Edgewall Software
Modify

Opened 13 years ago

Last modified 11 months ago

#4431 new enhancement

wiki_to_wikidom

Reported by: ittayd@… Owned by:
Priority: highest Milestone: topic-wikiengine
Component: wiki system Version: 0.10.3
Severity: critical Keywords: engine parser formatter
Cc: Martin.vGagern@…, itamarost@…, felix.schwarz@…, fcorreia@…, leho@…, olemis+trac@… Branch:
Release Notes:
API Changes:

Description

It would be good for plugins if parsing of wiki text will be broken to two stages. first take the text and convert to some data object tree (ala DOM, but with wiki orientation, so nodes are 'text', 'macro', 'link' etc.), then different formatters can go over the tree and format it, and plugins can search it.

Attachments (0)

Change History (32)

comment:1 by Emmanuel Blot, 13 years ago

I'm pretty sure there is alreay one ticket for this.

This feature is expected for email formatting, for example. Please search the DB and close this ticket if/when you find the duplicate.

comment:2 by ittayd@…, 13 years ago

sorry, couldn't find a duplicate

comment:3 by Christian Boos, 13 years ago

Component: generalwiki
Keywords: engine parser formatter added
Milestone: 0.11
Owner: changed from Jonas Borgström to Christian Boos
Priority: normalhigh
Severity: normalmajor

Well, I'm not sure there's been a specific ticket for this, but references to this are scattered through wiki-related tickets.

So, yes, generating a Wiki-DOM tree is the plan. More specifically, the wiki parser would yield a tree of Element nodes, which can be traversed by the wiki formatters to render either as another tree of (HTML) Element nodes or serialized to various other formats (plain text, LaTeX, etc.).

Some related tickets are:

  • #3925 (build <pre> elements)
  • #3895 (build <link> elements)
  • #3089 (retrieve <heading> elements from macros, needs 2 passes)
  • #2296 (formatting to LaTeX)
  • #4270 (formatting to plain text)
  • #2064 (better truncating when generating one-liner content)

… and probably others.

Let's focus this ticket on the parsing/formatting split itself, and the specifics of the WikiDom.

comment:4 by Christian Boos, 13 years ago

Some other requirements (from #4139): a macro should be able to return some more Wiki content, instead of simply some rendered output.

If we split the macro rendering in two phases, one corresponding to the parsing, the other to the rendering (by the way of the expand_macro recently introduced in r4621), then this would be possible, as the first phase could return the additional wiki text source, which could then be recursively parsed.

comment:5 by Christian Boos, 12 years ago

Status: newassigned

Ok, I've now started to work on the parser/formatter split, stay tuned ;)

comment:6 by anonymous, 12 years ago

Cc: Martin.vGagern@… added

comment:7 by Christian Boos, 12 years ago

Priority: highhighest

comment:8 by Christian Boos, 12 years ago

Milestone: 0.110.12

Probably not for 0.11 - too many things left to do for that release.

comment:9 by Zoran Isailovski, 12 years ago

Just a hint: If the "DOM" you have in mind represents the input domain, then it is a sort of an abstract syntax tree (AST). There are parser generators out there which generate parsers able to do just that: Generate AST's. They would require a formal definition of the input lange syntax though, which I have not seen yet for wikis (but I think the project would definitely benefit from formally defining the wiki syntax anyway, so it might very well be worth the effort).

Cheers — Zoran

comment:10 by Christian Boos, 9 years ago

Milestone: next-major-0.1X0.13
Severity: majorcritical

Let's make that a priority for me for the next release, as it is the enabler for key improvements.

comment:11 by Carsten Klein <carsten.klein@…>, 9 years ago

I also started thinking about replacing the regular expression based parser to a parser that would both be extensible and generated by for example antlr3.

Some considerations:

  • antlr3 is currently not supporting python 3.0 (I've made a patch for this (unpublished) but it renders the output generated by antlr3 incompatible to python 2.5)
  • making the parser extensible would basically mean that one would have to join in the available extensions to the wiki syntax into one big grammar or multiple such grammars, the latter then would be compiled and cached by the system and would have to be recreated as soon as a new wiki syntax provider becomes available
  • wiki syntax providers will become more complicated and not all users may provide such extensions to the system as they would have to be capable of authoring for example antlr3 grammars or at least be able to provide fragments of such grammars
  • supporting multiple wiki syntaxes, for example the original trac syntax or the creole syntax will require multiple such grammars
  • besides that, I sure think that trac would become more responsive once the regular expressions have been eliminated, besides that generated ASTs could be cached and from there they could be transformed to whatever output type you like

comment:12 by Carsten Klein <carsten.klein@…>, 9 years ago

but making the wiki formatter output markup instead of just plain text will be a good start ;)

comment:13 by Itamar Ostricher, 9 years ago

Cc: itamarost@… added

comment:14 by Felix Schwarz, 9 years ago

Cc: felix.schwarz@… added

comment:15 by Carsten Klein <carsten.klein@…>, 9 years ago

which dom api are you going to use?

a custom one?

if so, could we please get in touch in order to enable me to adopt the dom i am currently implementing for a genshi replacement?

perhaps we could also exchange some ideas…

comment:16 by carsten.klein@…, 9 years ago

How far is this in development? Is there some public repository to check your current solution and join your effort?

I am also currently prototyping such a WikiDom and Parser, however, there is still the problem of extending the syntax and also be able to allow the parser to recognize similar productions from different syntax providers.

See also the current thread in the mailing list, where one asked to associated priorities with each production returned by the syntax provider.

comment:17 by Christian Boos, 9 years ago

Back in July, I've started a wikiparser branch that you can find on my github clone. Although that was a good start, there's still much to do, that's why I didn't advertise much. I'll go back to it, but can't tell when.

comment:18 by Carsten Klein <carsten.klein@…>, 9 years ago

Recently stumbled over the pyparsing module. Wouldn't that be great for defining the grammar with? Then authors of wiki syntax extensions could simply use this for plugging in the grammar into the existing grammars without much headache.

I will look into pyparsing over the weekend and see what I can find out, whether it is usable for the approach or not.

comment:19 by Carsten Klein <carsten.klein@…>, 9 years ago

Just had a look into pyparsing, here it goes.

  • Setting up a recursive syntax is not that complicated.
  • time spent for parsing, as far as the below provided example goes, is negligible
  • rendering to xml comes for free, and so does the wiki dom, no need to re-implement the wheel, just use one of the available parsers/dom builders and process it
  • of course, error handling etc. requires additional work
  • dunno whether developers should use the pyparsing syntax or use something different for declaring their syntax extensions, though
import pyparsing as pp
import string
import time


icap_word = r'[' + string.uppercase + '][' + string.lowercase + ']+'
wiki_name = pp.Regex(icap_word + r'(' + icap_word + ')+').setResultsName('wikiName')

text = pp.Word(pp.alphas + pp.alphas8bit).setResultsName('text')
markup_or_text = pp.Forward()

italics = pp.Group(pp.Literal("''").suppress() + markup_or_text + pp.Literal("''").suppress()).setResultsName('italics')
bold = pp.Group(pp.Literal("'''").suppress() + markup_or_text + pp.Literal("'''").suppress()).setResultsName('bold')

wiki_link = pp.Group(wiki_name).setResultsName('wikiLink')

element = italics | bold | wiki_link
markup_or_text << pp.OneOrMore(element | text)

markup = markup_or_text

if __name__ == '__main__':

    t = """
''' bold wikiName '' italics WikiName'' 
notAwikiName NotAWikiName
Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp 
Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp
Lorem ''' ipsum etc pp Lorem ipsum ''' etc pp Lorem ipsum etc pp Lorem ipsum etc pp
Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp
Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp
Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp Lorem ipsum etc pp
italicsa '' '' '''
    """

    print time.time()
    print 'parsing'
    tokens = markup.parseString(t)
    print time.time()
    print tokens.asXML('document')
    print time.time()

comment:20 by fcorreia@…, 9 years ago

Cc: fcorreia@… added

comment:21 by Christian Boos, 8 years ago

Milestone: 0.130.14-wikiengine

comment:22 by Carsten Klein <carsten.klein@…>, 8 years ago

I have some work done already on the topic of extensible grammars using pyparsing for a configuration parser of mine. It basically uses forwardly declared concepts of the language.

However, it still requires fore-front knowledge of the modules that will provide those grammar extensions, still not suitable for use with an extensible system such as Trac.

Still looking into that…

comment:23 by Christian Boos, 8 years ago

Good luck with this… but let me just share my intuitive feelings on the topic: I doubt a pyparsing based Wiki parser will work, as Wiki markup is not a programming languages and a Wiki parser needs to emulate the way a human reader deals with structured plain text. It's been nearly a year now I've started something in this direction (wikiparser) and I hope to be able to resume work on this topic this summer ;-)

comment:24 by FilipeCorreia, 8 years ago

I understand trac's wiki syntax is not very far from creole (at least, comparing to other wiki engines that I've eorked with), and that there's a goal to make trac's wiki syntax more creole-friendly. So, I thought it could be useful to add here this reference to a wiki creole parser in python.

This library uses the BSD license — the same as Trac — so it may be a good starting point.

comment:25 by FilipeCorreia, 8 years ago

@cboos: Cool, I've just noticed you've already worked quite a bit on this some time ago. How far away would you say it is for a release? Would love to see it making it to the trunk :)

comment:26 by lkraav <leho@…>, 6 years ago

Cc: leho@… added

comment:27 by trancesilken@…, 5 years ago

@cboos: Looks good, but what are your goals for the final document schema, i.e. just

<document>
  <node>...
    <block name="name">
      <params><param name="name">...</param>...</params>
      <node>...</node>
    </block>
  </node>
</document>

may not suffice, especially when wanting to process that document using genshi.

Is there a specification somewhere in the wiki that discusses the actual document schema, and where plugin authors can comment and propose additional ideas?

comment:28 by Olemis Lang <olemis+trac@…>, 5 years ago

Cc: olemis+trac@… added

comment:29 by Ryan J Ollos, 4 years ago

Owner: Christian Boos removed
Status: assignednew

comment:30 by ilewismsl <ilewis@…>, 22 months ago

Have you considered using the RST DOM? It seems to be reasonably well developed and pretty well documented, relatively speaking:

I have not worked on it much, but the little work I have done on docutils indicates that RST is designed with the kind of structure you are trying to achieve. If you could reuse a lot of it, by making a Trac wiki markup to RST DOM converter, you would gain a whole lot of appropriate documentation along with the code.

comment:31 by Jun Omae, 22 months ago

Trac Wiki is not reStructredText.

in reply to:  31 comment:32 by ilewismsl, 22 months ago

Replying to Jun Omae:

Trac Wiki is not reStructredText.

Yes. I understand that.

What I was suggesting is that you could write a Trac Wiki to docutils DOM translation, and that would give you an already documented DOM to work from for the rest of what you need to do.

You might even gain something from docutils' ability to go from its DOM to various output formats, though I do not know whether that would be helpful or not.

So, all I was suggesting is that the docutils DOM for reStructuredText might be of value because someone has already worked through a structure that is probably very close to what you need.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The ticket will remain with no owner.
The ticket will be disowned. Next status will be 'new'.
as The resolution will be set. Next status will be 'closed'.
The owner will be changed from (none) to anonymous. Next status will be 'assigned'.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.