Opened 18 years ago
Last modified 6 months ago
#4431 new enhancement
wiki_to_wikidom
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | highest | Milestone: | topic-wikiengine |
Component: | wiki system | Version: | 0.10.3 |
Severity: | critical | Keywords: | engine parser formatter |
Cc: | Martin.vGagern@…, itamarost@…, felix.schwarz@…, fcorreia@…, leho@…, olemis+trac@… | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
It would be good for plugins if parsing of wiki text will be broken to two stages. first take the text and convert to some data object tree (ala DOM, but with wiki orientation, so nodes are 'text', 'macro', 'link' etc.), then different formatters can go over the tree and format it, and plugins can search it.
Attachments (0)
Change History (32)
comment:1 by , 18 years ago
comment:3 by , 18 years ago
Component: | general → wiki |
---|---|
Keywords: | engine parser formatter added |
Milestone: | → 0.11 |
Owner: | changed from | to
Priority: | normal → high |
Severity: | normal → major |
Well, I'm not sure there's been a specific ticket for this, but references to this are scattered through wiki-related tickets.
So, yes, generating a Wiki-DOM tree is the plan. More specifically, the wiki parser would yield a tree of Element nodes, which can be traversed by the wiki formatters to render either as another tree of (HTML) Element nodes or serialized to various other formats (plain text, LaTeX, etc.).
Some related tickets are:
- #3925 (build <pre> elements)
- #3895 (build <link> elements)
- #3089 (retrieve <heading> elements from macros, needs 2 passes)
- #2296 (formatting to LaTeX)
- #4270 (formatting to plain text)
- #2064 (better truncating when generating one-liner content)
… and probably others.
Let's focus this ticket on the parsing/formatting split itself, and the specifics of the WikiDom.
comment:4 by , 18 years ago
Some other requirements (from #4139): a macro should be able to return some more Wiki content, instead of simply some rendered output.
If we split the macro rendering in two phases, one corresponding to the parsing, the other to the rendering (by the way of the expand_macro
recently introduced in r4621), then this would be possible, as the first phase could return the additional wiki text source, which could then be recursively parsed.
comment:5 by , 18 years ago
Status: | new → assigned |
---|
Ok, I've now started to work on the parser/formatter split, stay tuned ;)
comment:6 by , 18 years ago
Cc: | added |
---|
comment:7 by , 17 years ago
Priority: | high → highest |
---|
comment:8 by , 17 years ago
Milestone: | 0.11 → 0.12 |
---|
Probably not for 0.11 - too many things left to do for that release.
comment:9 by , 17 years ago
Just a hint: If the "DOM" you have in mind represents the input domain, then it is a sort of an abstract syntax tree (AST). There are parser generators out there which generate parsers able to do just that: Generate AST's. They would require a formal definition of the input lange syntax though, which I have not seen yet for wikis (but I think the project would definitely benefit from formally defining the wiki syntax anyway, so it might very well be worth the effort).
Cheers — Zoran
comment:10 by , 15 years ago
Milestone: | next-major-0.1X → 0.13 |
---|---|
Severity: | major → critical |
Let's make that a priority for me for the next release, as it is the enabler for key improvements.
comment:11 by , 15 years ago
I also started thinking about replacing the regular expression based parser to a parser that would both be extensible and generated by for example antlr3.
Some considerations:
- antlr3 is currently not supporting python 3.0 (I've made a patch for this (unpublished) but it renders the output generated by antlr3 incompatible to python 2.5)
- making the parser extensible would basically mean that one would have to join in the available extensions to the wiki syntax into one big grammar or multiple such grammars, the latter then would be compiled and cached by the system and would have to be recreated as soon as a new wiki syntax provider becomes available
- wiki syntax providers will become more complicated and not all users may provide such extensions to the system as they would have to be capable of authoring for example antlr3 grammars or at least be able to provide fragments of such grammars
- supporting multiple wiki syntaxes, for example the original trac syntax or the creole syntax will require multiple such grammars
- besides that, I sure think that trac would become more responsive once the regular expressions have been eliminated, besides that generated ASTs could be cached and from there they could be transformed to whatever output type you like
comment:12 by , 15 years ago
but making the wiki formatter output markup instead of just plain text will be a good start ;)
comment:13 by , 14 years ago
Cc: | added |
---|
comment:14 by , 14 years ago
Cc: | added |
---|
comment:15 by , 14 years ago
which dom api are you going to use?
a custom one?
if so, could we please get in touch in order to enable me to adopt the dom i am currently implementing for a genshi replacement?
perhaps we could also exchange some ideas…
comment:16 by , 14 years ago
How far is this in development? Is there some public repository to check your current solution and join your effort?
I am also currently prototyping such a WikiDom and Parser, however, there is still the problem of extending the syntax and also be able to allow the parser to recognize similar productions from different syntax providers.
See also the current thread in the mailing list, where one asked to associated priorities with each production returned by the syntax provider.
comment:17 by , 14 years ago
Back in July, I've started a wikiparser
branch that you can find on my github clone. Although that was a good start, there's still much to do, that's why I didn't advertise much. I'll go back to it, but can't tell when.
comment:18 by , 14 years ago
Recently stumbled over the pyparsing module. Wouldn't that be great for defining the grammar with? Then authors of wiki syntax extensions could simply use this for plugging in the grammar into the existing grammars without much headache.
I will look into pyparsing over the weekend and see what I can find out, whether it is usable for the approach or not.
comment:19 by , 14 years ago
Just had a look into pyparsing, here it goes.
- Setting up a recursive syntax is not that complicated.
- time spent for parsing, as far as the below provided example goes, is negligible
- rendering to xml comes for free, and so does the wiki dom, no need to re-implement the wheel, just use one of the available parsers/dom builders and process it
- of course, error handling etc. requires additional work
- dunno whether developers should use the pyparsing syntax or use something different for declaring their syntax extensions, though
import pyparsing as pp import string import time icap_word = r'[' + string.uppercase + '][' + string.lowercase + ']+' wiki_name = pp.Regex(icap_word + r'(' + icap_word + ')+').setResultsName('wikiName') text = pp.Word(pp.alphas + pp.alphas8bit).setResultsName('text') markup_or_text = pp.Forward() italics = pp.Group(pp.Literal("''").suppress() + markup_or_text + pp.Literal("''").suppress()).setResultsName('italics') bold = pp.Group(pp.Literal("'''").suppress() + markup_or_text + pp.Literal("'''").suppress()).setResultsName('bold') wiki_link = pp.Group(wiki_name).setResultsName('wikiLink') element = italics | bold | wiki_link markup_or_text << pp.OneOrMore(element | text) markup = markup_or_text if __name__ == '__main__': t = """ ''' bold wikiName '' italics WikiName'' notAwikiName NotAWikiName Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ''' ipsum etc pp Lorem ipsum ''' etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp italicsa '' '' ''' """ print time.time() print 'parsing' tokens = markup.parseString(t) print time.time() print tokens.asXML('document') print time.time()
comment:20 by , 14 years ago
Cc: | added |
---|
comment:21 by , 14 years ago
Milestone: | 0.13 → 0.14-wikiengine |
---|
comment:22 by , 13 years ago
I have some work done already on the topic of extensible grammars using pyparsing for a configuration parser of mine. It basically uses forwardly declared concepts of the language.
However, it still requires fore-front knowledge of the modules that will provide those grammar extensions, still not suitable for use with an extensible system such as Trac.
Still looking into that…
comment:23 by , 13 years ago
Good luck with this… but let me just share my intuitive feelings on the topic: I doubt a pyparsing based Wiki parser will work, as Wiki markup is not a programming languages and a Wiki parser needs to emulate the way a human reader deals with structured plain text. It's been nearly a year now I've started something in this direction (wikiparser) and I hope to be able to resume work on this topic this summer ;-)
comment:24 by , 13 years ago
I understand trac's wiki syntax is not very far from creole (at least, comparing to other wiki engines that I've eorked with), and that there's a goal to make trac's wiki syntax more creole-friendly. So, I thought it could be useful to add here this reference to a wiki creole parser in python.
This library uses the BSD license — the same as Trac — so it may be a good starting point.
comment:25 by , 13 years ago
@cboos: Cool, I've just noticed you've already worked quite a bit on this some time ago. How far away would you say it is for a release? Would love to see it making it to the trunk :)
comment:26 by , 12 years ago
Cc: | added |
---|
comment:27 by , 11 years ago
@cboos: Looks good, but what are your goals for the final document schema, i.e. just
<document> <node>... <block name="name"> <params><param name="name">...</param>...</params> <node>...</node> </block> </node> </document>
may not suffice, especially when wanting to process that document using genshi.
Is there a specification somewhere in the wiki that discusses the actual document schema, and where plugin authors can comment and propose additional ideas?
comment:28 by , 11 years ago
Cc: | added |
---|
comment:29 by , 9 years ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:30 by , 7 years ago
Have you considered using the RST DOM? It seems to be reasonably well developed and pretty well documented, relatively speaking:
I have not worked on it much, but the little work I have done on docutils indicates that RST is designed with the kind of structure you are trying to achieve. If you could reuse a lot of it, by making a Trac wiki markup to RST DOM converter, you would gain a whole lot of appropriate documentation along with the code.
comment:32 by , 7 years ago
Replying to Jun Omae:
Trac Wiki is not reStructredText.
Yes. I understand that.
What I was suggesting is that you could write a Trac Wiki to docutils DOM translation, and that would give you an already documented DOM to work from for the rest of what you need to do.
You might even gain something from docutils' ability to go from its DOM to various output formats, though I do not know whether that would be helpful or not.
So, all I was suggesting is that the docutils DOM for reStructuredText might be of value because someone has already worked through a structure that is probably very close to what you need.
I'm pretty sure there is alreay one ticket for this.
This feature is expected for email formatting, for example. Please search the DB and close this ticket if/when you find the duplicate.