I'm writing down these details here because maybe other VC could take advantage of the same considerations, or even the VcRefactoring.

= Darcs specific cache =

For performance reasons, TracDarcs heavily relies on a specialized `CachedRepository` that basically reimplements the Subversion versioned tree for the tracked darcs repository: once a darcs changeset gets processed, the trac environment won't need to access the repository anymore (this is slightly simplified, as the ''content'' of each file at a given revision gets computed and cached only at its first request).

It does that by keeping a few extra tables (as of version 0.6, there are [http://progetti.arstecnica.it/trac%2Bdarcs/browser/tracdarcs/darcs-tables.sql four specific] `darcs_xxx` tables) extending/replacing the standard trac database schema. This is far from perfect, but it does the (''tricky''!) work reasonably well: all the merits goes to K.S.Sreeram who [http://progetti.arstecnica.it/trac%2Bdarcs/changeset/27 reimplemented the core] for version 0.4.

As I'm planning to implement MultipleRepositorySupport for 0.7, I'm trying to figure out a better integration with the standard schema, avoiding table/data duplication.

== Changesets ==

Generally speaking, in darcs the ''order'' of patches is meaningless: from an abstract point of view, a repository may be seen as a ''bag'' of changesets that, under the rules specified by the [http://en.wikibooks.org/wiki/Understanding_darcs/Patch_theory patch theory], may be composed (''commuted'') together to give the result one expects.

In the Trac context however it's very handy to be able to point to a particular change by it's ''application ordering index'' instead of the full hash. To accomplish that, TracDarcs chose to use an [http://progetti.arstecnica.it/trac%2Bdarcs/wiki/DarcsBackend#Revisionnumber artificial monotonically incrementing index] as a shortcut, effectively mimicing the Subversion view of the history.

This index is the `rev` field in the standard schema tables; TracDarcs adds a table `darcs_changesets` to augment the description of a single changeset (that is, to extend the `revision` table), and to have a way to map the `rev` field to the globally unique identifier, the `hash`, and the other way around:

{{{
    rev_table = Table('darcs_changesets', key='rev')[
        Column('rev',type='int'),
        Column('hash'),
        Column('name'),
        Index(['hash'])]
}}}

Another way to indentify a single (or rather a set of) changeset(s) in darcs is by its (their) `name` (darcs makes explicit the Subversion suggestion of using the first line of the changelog as a summary, the rest as a longer description).

=== M/R considerations ===

Current MultipleRepositorySupport/Cache introduces the notion of a `repos`, and an equivalent field shall be added to this table, changing the primary key to `repos,rev`...

I hope to convince that it would be nicer using a ''surrogate primary key'' for `repository`, to allow the administration layer to rename repositories without having to update handfuls of tables.

IMHO, even the standard schema's `revision` and `node_change` should

 a. use that surrogate key to point to the repository
 a. introduce their own surrogate index: this would make much easier for TracDarcs (and I'd imagine for any other VC that needs to extend the schema in some way) to correlate the tables (in other words, if done in this way from the beginning, TracDarcs wouldn't even need to know that Trac 0.12 will come with MultipleRepositorySupport :-)

OTOH, if the consensum will be going with the IRepositoryChangeListener way proposed by cboos, I'll probably add the mentioned surrogate key to `darcs_changesets` anyway, to simplify TracDarcs's queries.
 - well, OK for the arguments about repository renaming, but I don't see how surrogate keys would help for repository additions/removals, so I think this IRepositoryChangeListener is needed anyway -- cboos
 - yes: I don't foresee the need for additions though... the RepositoryManager [source:/sandbox/multirepos/trac/versioncontrol/admin.py#L125 will reload] repositories and will eventually call `sync()` on us, won't it? -- lelit

== Nodes, a.k.a. the versioned tree ==

As said, TracDarcs wants to cache also the actual ''tree'' of the repository, at any given changeset, because it is (still) very expensive if not impossible doing that with darcs itself.

K.S.Sreeram (kudos!) replaced my original broken solution with one that actually works great (no new bug has been reported on the algorithm lately :)

The pitfall was handling ''complex'' changesets (I mean, something SVN does not handle/allow) as well as ''corner cases'' (some would say ''b0rken'') changesets that darcs builds (maybe recorded with a buggy 0.9.x version of darcs many years ago, fixed and handled in succeding versions). If there is a lesson I learned writing [http://progetti.arstecnica.it/tailor Tailor], it's surely that each VC has its own strange interpretation of what a changeset means :-)

Sreeram's solution uses two tables (actually three, one for caching the content of files, to make periodic purge easier), `darcs_nodes` and `darcs_nodes_changes`. The first introduces an artificial entity that represents a particular ''path'', so that the machinery can operate unambiguosly when the exact same path actually means different items (imagine something sillyties like ``ADD this/path; MOVE this/path other/path; MKDIR this/path; ADD this/path/file; MOVE other/path this/path/newpath``: in this example, each "this/path" would be a different ''node''); each ''node'' is of a particular type (that is, it's either a file '''or''' a directory), and "knows" which revision added it as well the one that removed it, if any.

{{{
    node_table = Table('darcs_nodes', key='node_id')[
        Column('node_id',type='int'),
        Column('node_type',size=1),
        Column('add_rev',type='int'),
        Column('remove_rev',type='int')]
}}}

=== Nodes changes ===

Surprisingly, a generic ''node'' does not correspond to particular "path": such association happens couplings nodes to revisions, that is, quoting Sreeram, ''"a node doesn't have a particular name or content but, for a given revision, its name and content will be well defined"''.

This is done in `darcs_node_changes`, that mimics the standard `node_change`, plus it keeps the hierarchy of the entries (that is, at any particular revision, a particular node has a pathname, the parent's node id, and the change it happens to be subject of).

{{{
    change_table = Table('darcs_node_changes', key=('node_id','rev'))[
        Column('node_id',type='int'),
        Column('rev',type='int'),
        Column('path'),
        Column('parent_id',type='int'),
        Column('the_change')]
}}}

=== Nodes content ===

Even the content of any requested path at a particular revision is cached by TracDarcs (again, for speed issues). Earlier versions used to keep such cache on the filesystem, while it's now kept within the database itself, in the `darcs_cache` table:

{{{
    cache_table = Table('darcs_cache', key=('node_id','rev'))[
        Column('node_id',type='int'),
        Column('rev',type='int'),
        Column('content',type=blobtype)]
}}}
''Tip: adding a `size` column would do wonders for enhancing browsing speed (i.e. `Darcs Node.get_content_length` would use that info if available, revert to the slower `len(self.get_content().read())` otherwise).''

This could go within the `darcs_node_changes` table (see, same primary key), but for practical reason it's been kept on its own.