Vertical vs. Horizontal Parsing in Trac Wiki
When one looks at a Trac Wiki markup source, the primary structure one can see is along the "vertical" direction.
Of course, one can find in nearly any text a vertical sequence of group of lines ("paragraphs"), vertically isolated lines corresponding to section titles, etc.
But in Trac or other wiki markups, this goes further:
- wiki processors use "blocks" of lines (
{{{
…}}}
), eventually nested - lists, blockquotes, definition lists all rely on a consistent indentation,
- citation quotes ('> …') also stand out first in the vertical direction
Yet the wiki parsing up to Trac 0.11 is heavily line oriented. The detection of code blocks is one exception to this, where matching {{{
/ }}}
pairs of lines are first detected, then their content processed, even recursively if needed.
By analyzing the structure of lines one after the other, one has to maintain a lot of state in order to keep a correct sense of context. Also the code is often fragile as it's not always obvious to decide what HTML elements have to be closed first before proceeding.
The idea with VH parsing is that the existing vertical structure in the markup should be exploited first before tackling the horizontal parsing, which is better suited to intra-paragraph markup.
In 0.12, the parsing of citation quotes ('> …') now uses this approach, and this enabled to address the #4235 issue in a comprehensive way relatively painlessly.
In future versions, the lists and other kind of markup could also benefit from this approach.
Parsing Overview
Here's a very rough outline:
parse_vertical
- prepares WikiDocument (W)
- preprocess (r9868), split text in lines
parse_blocks
- get a tree of the{{{
/}}}
delimited blocks (B) and the spans between them (Raw); at this stage, the root document (W) and each (B) contains a list of (B|Raw) nodes- for each wiki block (i.e. (W) and each (B) containing wiki text)
parse_raw_text
- each top-level (Raw) node will be scanned for structural ("vertical") markup; for each line:- detect verbatim text (
`
…`
and{{{
…}}}
sequences); remember verbatim spans for that line, escape the line (replaced by 'X') - match vertical patterns which can result in:
- (I)tem node (
- * 1.
etc.) - (D)efinition list node (
... :: ...
) - (Q)uote node (leading space)
- (C )itation node (
>+
) - (Row) node (
|| ... ||
) - …
- if nothing matches, this is a plain (T)ext node
- (I)tem node (
- at the end, this collection of (S)tructural nodes replace the (Raw) node
- detect verbatim text (
assemble_nodes
- each node had an indentation level, the first non-space character in its starting line; this information will enable us to re-arrange a list of (B) and (S) nodes according a logical nesting determined by the indentation
parse_horizontal
- each node in the previous tree will be split further, according to inline ("horizontal") markup- some nodes won't have any text content to process
- some will have two (D) or more (Row)
- it can well be that some markup will need to be processed recursively (e.g.
[=#anchor ''this was already explained above'']
). - macros could at this stage expand the tree as it's being built (e.g. via a new
IWikiMacroProvider.parse_macro
method)