Edgewall Software

Version 6 (modified by Christian Boos, 14 years ago) ( diff )

link to related proposal

Wiki Engine Refactoring: Split Parser and Formatter Stages

The current Trac Wiki Engine, while working quite well, suffers from a few fundamental drawbacks:

  • the parsing and formatting steps are done in one step; this makes it hard to examine the inner structure of a Wiki text, as a new parser/formatter has to be rewritten by subclassing an existing one
  • the code is quite complex, contains numerous (only?) special cases, and as a result, is not easy to maintain

Because of that, a parser/formatter split was in order. This has to be done carefully however, as the Wiki engine is one of the "vital piece" of Trac.

We first outline the major requirements, then briefly explain the envisioned implementation of the new engine. Finally we present and discuss all the major opened issues for the Wiki system, which provide concrete use cases that must be addressed by the new design.

Requirements

  1. Compatibility
    The existing Trac wiki syntax should continue to work the way it always had (well, modulo the bug fixes). Pages that used to "look good" using the former engine should look the same.
  1. Flexibility
    The Trac wiki engine is a modular one, with the possibility to have plugins inject their own Wiki syntax (IWikiSyntaxProvider). This flexibility should remain and even be augmented:
    1. Possibility to extend the parser, by adding new "tokens"
    2. Possibility to extend the existing formatters, in order to have the new tokens transformed by existing formatters
    3. Possibility to easily create new formatters, to address new needs
  1. Speed
    The use of wiki text is ubiquitous in Trac, so it should be very fast to parse a wiki text and apply one or many formatting on it.
  1. Maintainability
    The wiki engine must be easy to understand. Ok, the regexps can still be scary to satisfy 3.

Implementation

I've chosen a tree-based approach, in which the wiki text is first converted to a Wiki DOM tree, consisting of instances of subclasses of WikiDomNode. This tree can then in turn be transformed into some other output format, like Genshi Element nodes or events.

This is the approach outlined in #4431.

More specifically, the WikiDomNode class hierarchy is used to model the inheritance relations between the various types of nodes. This enables to setup embedding rules, specifying which node can be placed into which others. Therefore the structure of the document and its consistency can be ensured.

The inheritance relationships are also quite useful on the formatter level, as this way one can easily specify some specific handling for a group of node types. More importantly, if a new type of node is added (say by a IWikiSyntaxProvider), the existing formatters can also be augmented by custom rendering callbacks for those new nodes. If some formatters aren't updated (say because there are special formatters contributed by some unrelated plugins), they'll be able to handle those new nodes based on their parent type.

As a snapshot of the work in progress, here are the new interfaces:

  • IWikiSyntaxProvider, for providing new syntax.
    The default Trac syntax is in most part defined by syntax provider components. The interface remains compatible with Trac 0.10 (lesson learned ;-) )
  • IWikiFormatterContributor, for providing new or extending existing formatting flavors.
    The default Trac formatters are entirely provided by formatter contributors.
    (new interface)
class IWikiSyntaxProvider(Interface):
 
    def get_wiki_syntax():
        """Return an iterable that provides additional wiki syntax.

        A new syntax rule is a `(priority, regexp, callback)` triple.
        The `priority` is used to globally order the rules.

        The `regexp` must be of the form `"(?P<...>...)"`, where
        `<...>` contains a globally unique name that will be used
        to identify the rule.

        The `callback` expects a `parser` argument of type `WikiParser`,
        and will use it to participate to the construction of the
        Wiki DOM tree.
        
        Before 0.11, the additional wiki syntax corresponded
        simply to a `(regexp, callback)` pair. The priority of such
        rules is set to `50`, which introduces them after the simple
        text style rules and before the other ones, as they used to be.

        However, the old `callback` function was of the form
        `handler(formatter, ns, match, fullmatch)`, i.e. it was
        actually expected to take care of the ''formatting'' as well.
        """
 
    def get_link_resolvers():
        """Generate new handlers for new TracLinks prefixes.

        A handler is a `(namespace, callback)` pair, where the
        `callback` is a function expected a `parser` argument
        (the `WikiParser` instance currently driving the parsing).

        Before 0.11, the generated value were `(namespace, formatter)`
        pairs, where the `formatter` was a function of the form
        `fmt(formatter, ns, target, label)`, and would return some
        HTML fragment.
        The `label` is already HTML escaped, whereas the `target` is not.
        """

    def get_valid_parents():
        """Generate rules for element embedding.

        Each rule is of the form `(nodetype, valid_parent_nodetypes)`,
        which means that an instance node of the given `nodetype` type
        can be parented in instance nodes of the `valid_parent_nodetypes`
        types.

        Example: `(Inline, Anchor)` means that any `Inline` node can
                 be set to be a child of an `Anchor` node.

        (`valid_parent_nodetypes` can be a single class or a tuple
        of classes, all subclasses of `WikiDomNode`)
        """


class IWikiFormatterContributor(Interface):

    def get_wiki_formatters():
        """Generate `(flavor, nodetype, formatter_callback)` triples.

        This enables the wiki system to register the `formatter_callback`
        for handling `nodetype` nodes when rendering a parse tree to 
        the specified `flavor` kind of output.
        If for a given node, there's no callback registered directly for
        its class, a callback registered for one of its ancestor class
        will be used, following the method resolution order.      

        The `formatter_callback` itself is a function accepting a
        `Formatter` argument and the `WikiDomNode` instance currently
        being rendered and must return a result which is appropriate
        for the kind of formatting being done.
        """

The main utility functions for parsing and formatting wiki text will be:

# -- Utility functions

def parse_wiki(ctx, wikitext):
    """Parse `wikitext` and produce a Wiki DOM tree.

    Return the `WikiDocument` root node of that tree.
    """
    return WikiSystem(ctx.env).parse(wikitext)

def format_to(flavor, ctx, wikidom, **kwargs):
    """Format to `flavor` the given `wikidom` parsed wiki text,
    in the given `ctx`. 

    `wikidom` can be simply text instead of a Wiki DOM tree,
    and it will be parsed on the fly.
    """
    return WikiSystem(ctx.env).format(ctx, wikidom, flavor, **kwargs)

def format_to_html(ctx, wikidom, escape_newlines=False):
    return WikiSystem(ctx.env).format(ctx, wikidom, 'default',
                                      escape_newlines=escape_newlines)
# etc. 

implementations details to be updated as the code progresses

The WikiSystem

It contains the extension point for the IWikiSyntaxProviders. It also serves as a registry for the various formatter, according to their flavor.

The Parser

WikiParser Component
The component is used to incorporate syntax additions and rules provided by the IWikiSyntaxProviders and to maintain the environment global state of the parser.
WikiEngine
An engine is a transient object whose purpose is to maintain some state during the parsing of a given wiki text, and produce a Wiki DOM tree.
WikiDomNode
A node in a Wiki DOM tree. Each node can have children nodes, and other properties depending on their type.

The Formatters

WikiFormatter Component
The component is used to register the specific formatting rules that the IWikiSyntaxProviders may offer.
(format)Formatter
Individual formatters are transient objects used for traversing a Wiki DOM tree and generating some kind of output.

Addressing Current Issues

Here we list some of the most outstanding issues in the Trac ticket database concerning the Wiki component, and we discuss for each of them how they would be handled in the proposed solution.

Parser Issues

#2048: Wiki macros for generating <div>s and <span>s with classes

We need a way to "wrap" wiki text fragments, in a robust way. This need can be addressed if we have a low foot-print when recursively parsing wiki text.

On a related topic, #4139 was discussing the possibility for a macro to provide additional wiki text, instead of producing formatted output.

While building the DOM tree, when macros are found, they can be asked to participate in the parsing stage as well.

#1200: Provide syntax to link to file-differences from svn log messages

Or more generally, have the possibility to have context-dependent special syntaxes.

The parser should be able to use additional parsers for what it would otherwise consider to be text.

#2064: Timeline WikiFormatting mis-parses WikiLinks with aliases.

Currently, for rendering short excerpts of wiki text, we simply truncate the wiki text. That's of course far better than arbitrarily truncating the rendered HTML output, but is also very problematic has it has a high likelihood of breaking the consistency of the original wiki syntax.

Therefore, it's better to parse the full text, and in the formatter, stop producing output as soon as enough text was produced.

#4140: Merge OutlineFormatter into Formatter

Currently using the [[PageOutline]] macro for producing a table of content of the current page implies parsing the wiki text twice. Once we have a parse tree, we can reuse it as many time as we want, to produce different kinds of output (see also processing structured content below).

In this situation however, we see that we must take care of passing the WikiDOM tree of the wiki text to the macros when "rendering" them.

#1424 : MediaWiki-style table syntax support

Improved table support is difficult to achieve in the current engine, as the table syntax is limited to one row per line of wiki text. The handling of tables carries on that limitation throughout and basically only inline content can be placed in a cell.

A nearly complete redesign of the internals of the current engine would be required only to be able to handle this particular enhancement request. In the new engine, the rules relative to the embedding of WikiDOM elements makes writing such an enhancement easily possible, eventually even by the way of a (extended) IWikiSyntaxProvider.

#4235: Wiki formatting lost in ">" quoted blocks

It should be possible to "embed" usual wiki notation within some context. Previously we discussed this for table cells, but having the wiki content rendered normally even when prefixed share a similar problematic.

A notable difference is the need for better abstraction of what should be considered the start of a line. This is an open issue that will hopefully be more easily solvable using the new parser. Other than the lexing issues, the WikiDOM rules for embedding elements one in another greatly facilitate the handling of this situation.

Formatter Issues

#4778: simplified wikification of svn log messages

A TracLinksHtmlFormatter could be written (flavor 'traclinks'), which would handle only the TracLinks, no other formatting. This formatter can also be used for addressing #1520, wikification of comment in source code.

#4270: Ticket CC emails contain wiki formatting

A TextFormatter could be written (flavor 'text'), which would produce a rendering of the wiki text in a way suitable for being read in text e-mails. This rendering involves suppressing the markup that goes in the way, (like {{{ }}} blocks), provide a more explicit markup for headings (underlined titles), provide footnote style references to links, etc.

#2296: Export wiki pages to latex

This ticket is one example of a different output format which should be entirely doable by creating a plugin which would implement only the formatting side. #3477 is another example for this (DocBook in this case).

Processing Structured Content

#2717: Create window title on wiki pages from = The title =

Once a wiki text is parsed, it should be trivial to fetch the first toplevel heading node from the corresponding Wiki DOM tree.

#3895: Provide Trac API for returning all outbound links in a page

The Wiki text contains structured information, primarily by the way of TracLinks. Properly retrieving TracLinks inside any wiki text can be useful in a lot of situations and due to the profusion of TracLinks forms and their extensible nature, some direct support for reusable parsing is necessary.

A Wiki DOM tree can be quickly scanned in order to retrieve all the links contained in a robust way. Typical use cases are #108, #109, #611 and more generally TracCrossReferences.

#1024: Each section should have a edit button

Being able to edit a particular section of the wiki text implies to be able to identify where the section starts and ends in the source wiki text. This can be done if the WikiDOM nodes keep track of their position in the original text.

"Technical" details

#3925: whitespace not preserved in {{{ }}} blocks

Since the switch to Genshi, the whitespace in <pre> elements are collapsed, because the corresponding output is wrapped in a Markup instance.

This can be solved by having the HtmlFormatter output Genshi events or Elements.

#3232: Wiki syntax should be enforced (for ticket comments)

In various situations, it happens that the formatted HTML is not valid, as it contains opened tags with no corresponding closing tags.

If the formatter uses a regular way (e.g. recursive descent traversal of the parse tree) to produce structured content, this problem can be avoided completely.

#3794: Invalid table(with indentation) layout in wiki.

An example of the usual kind of "buglets" that affect the Wiki engine: the formatting that is produced for a given wiki text doesn't look "right". Fixing such issues is currently not very convenient, because all is done at once. Being able to focus on the logic of either the parser or the formatter would tremendously help with this kind of issue. Other similar tickets: #1936, #3335, #4790

Comments

Not sure what your specific plans are, but one of the docutils developers at PyCon illustrated fairly convincingly why calling enter/leave methods for each node in the tree, when "parsing", is bad, and why a visitor pattern is good. Something to consider

(AlecThomas)

For the formatting phase, I use something that could be called a "typed" hierarchical visitor pattern. You start with the root of the parse tree, and the formatter simply renders the root node, by calling the most specific callback registered for the type of that node. That callback in turn can call the render method on children nodes, and the process repeats. So in effect you find there the hierarchical navigation notion (as the visitor has full access to the tree) and the conditional navigation notion (as the visitor decides if it's worth recursing or not) of the c2Wiki:HierarchicalVisitorPattern. But this will become clearer with some code :-)
— cboos


See also: Wiki related tickets, VerticalHorizontalParsing

Note: See TracWiki for help on using the wiki.