Ticket #4431 (assigned enhancement)
Opened 5 years ago
Last modified 8 weeks ago
wiki_to_wikidom
| Reported by: | ittayd@… | Owned by: | cboos |
|---|---|---|---|
| Priority: | highest | Milestone: | 0.14-wikiengine |
| Component: | wiki system | Version: | 0.10.3 |
| Severity: | critical | Keywords: | engine parser formatter |
| Cc: | Martin.vGagern@…, itamarost@…, felix.schwarz@…, fcorreia@… | ||
| Release Notes: | |||
| API Changes: | |||
Description
It would be good for plugins if parsing of wiki text will be broken to two stages. first take the text and convert to some data object tree (ala DOM, but with wiki orientation, so nodes are 'text', 'macro', 'link' etc.), then different formatters can go over the tree and format it, and plugins can search it.
Attachments
Change History
comment:1 Changed 5 years ago by eblot
comment:2 Changed 5 years ago by ittayd@…
sorry, couldn't find a duplicate
comment:3 Changed 5 years ago by cboos
- Component changed from general to wiki
- Keywords engine parser formatter added
- Milestone set to 0.11
- Owner changed from jonas to cboos
- Priority changed from normal to high
- Severity changed from normal to major
Well, I'm not sure there's been a specific ticket for this, but references to this are scattered through wiki-related tickets.
So, yes, generating a Wiki-DOM tree is the plan.
More specifically, the wiki parser would yield a tree of Element nodes, which can be traversed by the wiki formatters to render either as another tree of (HTML) Element nodes or serialized to various other formats (plain text, LaTeX, etc.).
Some related tickets are:
- #3925 (build <pre> elements)
- #3895 (build <link> elements)
- #3089 (retrieve <heading> elements from macros, needs 2 passes)
- #2296 (formatting to LaTeX)
- #4270 (formatting to plain text)
- #2064 (better truncating when generating one-liner content)
... and probably others.
Let's focus this ticket on the parsing/formatting split itself, and the specifics of the WikiDom?.
comment:4 Changed 5 years ago by cboos
Some other requirements (from #4139): a macro should be able to return some more Wiki content, instead of simply some rendered output.
If we split the macro rendering in two phases, one corresponding to the parsing, the other to the rendering (by the way of the expand_macro recently introduced in r4621), then this would be possible, as the first phase could return the additional wiki text source, which could then be recursively parsed.
comment:5 Changed 5 years ago by cboos
- Status changed from new to assigned
Ok, I've now started to work on the parser/formatter split, stay tuned ;)
comment:6 Changed 5 years ago by anonymous
- Cc Martin.vGagern@… added
comment:7 Changed 5 years ago by cboos
- Priority changed from high to highest
comment:8 Changed 5 years ago by cboos
- Milestone changed from 0.11 to 0.12
Probably not for 0.11 - too many things left to do for that release.
comment:9 Changed 4 years ago by Zoran Isailovski
Just a hint: If the "DOM" you have in mind represents the input domain, then it is a sort of an abstract syntax tree (AST). There are parser generators out there which generate parsers able to do just that: Generate AST's. They would require a formal definition of the input lange syntax though, which I have not seen yet for wikis (but I think the project would definitely benefit from formally defining the wiki syntax anyway, so it might very well be worth the effort).
Cheers -- Zoran
comment:10 Changed 2 years ago by cboos
- Milestone changed from next-major-0.1X to 0.13
- Severity changed from major to critical
Let's make that a priority for me for the next release, as it is the enabler for key improvements.
comment:11 Changed 2 years ago by Carsten Klein <carsten.klein@…>
I also started thinking about replacing the regular expression based parser to a parser that would both be extensible and generated by for example antlr3.
Some considerations:
- antlr3 is currently not supporting python 3.0 (I've made a patch for this (unpublished) but it renders the output generated by antlr3 incompatible to python 2.5)
- making the parser extensible would basically mean that one would have to join in the available extensions to the wiki syntax into one big grammar or multiple such grammars, the latter then would be compiled and cached by the system and would have to be recreated as soon as a new wiki syntax provider becomes available
- wiki syntax providers will become more complicated and not all users may provide such extensions to the system as they would have to be capable of authoring for example antlr3 grammars or at least be able to provide fragments of such grammars
- supporting multiple wiki syntaxes, for example the original trac syntax or the creole syntax will require multiple such grammars
- besides that, I sure think that trac would become more responsive once the regular expressions have been eliminated, besides that generated ASTs could be cached and from there they could be transformed to whatever output type you like
comment:12 Changed 2 years ago by Carsten Klein <carsten.klein@…>
but making the wiki formatter output markup instead of just plain text will be a good start ;)
comment:13 Changed 21 months ago by itamaro
- Cc itamarost@… added
comment:14 Changed 20 months ago by fschwarz
- Cc felix.schwarz@… added
comment:15 Changed 19 months ago by Carsten Klein <carsten.klein@…>
which dom api are you going to use?
a custom one?
if so, could we please get in touch in order to enable me to adopt the dom i am currently implementing for a genshi replacement?
perhaps we could also exchange some ideas...
comment:16 Changed 15 months ago by carsten.klein@…
How far is this in development? Is there some public repository to check your current solution and join your effort?
I am also currently prototyping such a WikiDom? and Parser, however, there is still the problem of extending the syntax and also be able to allow the parser to recognize similar productions from different syntax providers.
See also the current thread in the mailing list, where one asked to associated priorities with each production returned by the syntax provider.
comment:17 Changed 15 months ago by cboos
Back in July, I've started a wikiparser branch that you can find on my github clone. Although that was a good start, there's still much to do, that's why I didn't advertise much. I'll go back to it, but can't tell when.
comment:18 Changed 14 months ago by Carsten Klein <carsten.klein@…>
Recently stumbled over the pyparsing module. Wouldn't that be great for defining the grammar with?
Then authors of wiki syntax extensions could simply use this for plugging in the grammar into the existing grammars without much headache.
I will look into pyparsing over the weekend and see what I can find out, whether it is usable for the approach or not.
comment:19 Changed 14 months ago by Carsten Klein <carsten.klein@…>
Just had a look into pyparsing, here it goes.
- Setting up a recursive syntax is not that complicated.
- time spent for parsing, as far as the below provided example goes, is negligible
- rendering to xml comes for free, and so does the wiki dom, no need to re-implement the wheel, just use one of the available parsers/dom builders and process it
- of course, error handling etc. requires additional work
- dunno whether developers should use the pyparsing syntax or use something different for declaring their syntax extensions, though
import pyparsing as pp import string import time icap_word = r'[' + string.uppercase + '][' + string.lowercase + ']+' wiki_name = pp.Regex(icap_word + r'(' + icap_word + ')+').setResultsName('wikiName') text = pp.Word(pp.alphas + pp.alphas8bit).setResultsName('text') markup_or_text = pp.Forward() italics = pp.Group(pp.Literal("''").suppress() + markup_or_text + pp.Literal("''").suppress()).setResultsName('italics') bold = pp.Group(pp.Literal("'''").suppress() + markup_or_text + pp.Literal("'''").suppress()).setResultsName('bold') wiki_link = pp.Group(wiki_name).setResultsName('wikiLink') element = italics | bold | wiki_link markup_or_text << pp.OneOrMore(element | text) markup = markup_or_text if __name__ == '__main__': t = """ ''' bold wikiName '' italics WikiName'' notAwikiName NotAWikiName Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ''' ipsum etc pp Lorem ipsum ''' etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp italicsa '' '' ''' """ print time.time() print 'parsing' tokens = markup.parseString(t) print time.time() print tokens.asXML('document') print time.time()
comment:20 Changed 13 months ago by fcorreia@…
- Cc fcorreia@… added
comment:21 Changed 12 months ago by cboos
- Milestone changed from 0.13 to 0.14-wikiengine
comment:22 Changed 8 months ago by Carsten Klein <carsten.klein@…>
I have some work done already on the topic of extensible grammars using pyparsing for a configuration parser of mine. It basically uses forwardly declared concepts of the language.
However, it still requires fore-front knowledge of the modules that will provide those grammar extensions, still not suitable for use with an extensible system such as Trac.
Still looking into that...
comment:23 Changed 8 months ago by cboos
Good luck with this... but let me just share my intuitive feelings on the topic: I doubt a pyparsing based Wiki parser will work, as Wiki markup is not a programming languages and a Wiki parser needs to emulate the way a human reader deals with structured plain text. It's been nearly a year now I've started something in this direction (wikiparser) and I hope to be able to resume work on this topic this summer ;-)
comment:24 Changed 8 weeks ago by FilipeCorreia
I understand trac's wiki syntax is not very far from creole (at least, comparing to other wiki engines that I've eorked with), and that there's a goal to make trac's wiki syntax more creole-friendly. So, I thought it could be useful to add here this reference to a wiki creole parser in python.
This library uses the BSD license -- the same as Trac -- so it may be a good starting point.
comment:25 Changed 8 weeks ago by FilipeCorreia
@cboos: Cool, I've just noticed you've already worked quite a bit on this some time ago. How far away would you say it is for a release? Would love to see it making it to the trunk :)



I'm pretty sure there is alreay one ticket for this.
This feature is expected for email formatting, for example. Please search the DB and close this ticket if/when you find the duplicate.