Edgewall Software

I'm writing down these details here because maybe other VC could take advantage of the same considerations, or even the VcRefactoring.

Darcs specific cache

For performance reasons, TracDarcs heavily relies on a specialized CachedRepository that basically reimplements the Subversion versioned tree for the tracked darcs repository: once a darcs changeset gets processed, the trac environment won't need to access the repository anymore (this is slightly simplified, as the content of each file at a given revision gets computed and cached only at its first request).

It does that by keeping a few extra tables (as of version 0.6, there are four specific darcs_xxx tables) extending/replacing the standard trac database schema. This is far from perfect, but it does the (tricky!) work reasonably well: all the merits goes to K.S.Sreeram who reimplemented the core for version 0.4.

As I'm planning to implement MultipleRepositorySupport for 0.7, I'm trying to figure out a better integration with the standard schema, avoiding table/data duplication.

Changesets

Generally speaking, in darcs the order of patches is meaningless: from an abstract point of view, a repository may be seen as a bag of changesets that, under the rules specified by the patch theory, may be composed (commuted) together to give the result one expects.

In the Trac context however it's very handy to be able to point to a particular change by it's application ordering index instead of the full hash. To accomplish that, TracDarcs chose to use an artificial monotonically incrementing index as a shortcut, effectively mimicing the Subversion view of the history.

This index is the rev field in the standard schema tables; TracDarcs adds a table darcs_changesets to augment the description of a single changeset (that is, to extend the revision table), and to have a way to map the rev field to the globally unique identifier, the hash, and the other way around:

    rev_table = Table('darcs_changesets', key='rev')[
        Column('rev',type='int'),
        Column('hash'),
        Column('name'),
        Index(['hash'])]

Another way to indentify a single (or rather a set of) changeset(s) in darcs is by its (their) name (darcs makes explicit the Subversion suggestion of using the first line of the changelog as a summary, the rest as a longer description).

M/R considerations

Current MultipleRepositorySupport/Cache introduces the notion of a repos, and an equivalent field shall be added to this table, changing the primary key to repos,rev

I hope to convince that it would be nicer using a surrogate primary key for repository, to allow the administration layer to rename repositories without having to update handfuls of tables.

IMHO, even the standard schema's revision and node_change should

  1. use that surrogate key to point to the repository
  2. introduce their own surrogate index: this would make much easier for TracDarcs (and I'd imagine for any other VC that needs to extend the schema in some way) to correlate the tables (in other words, if done in this way from the beginning, TracDarcs wouldn't even need to know that Trac 0.12 will come with MultipleRepositorySupport :-)

OTOH, if the consensum will be going with the IRepositoryChangeListener way proposed by cboos, I'll probably add the mentioned surrogate key to darcs_changesets anyway, to simplify TracDarcs's queries.

  • well, OK for the arguments about repository renaming, but I don't see how surrogate keys would help for repository additions/removals, so I think this IRepositoryChangeListener is needed anyway — cboos
  • yes: I don't foresee the need for additions though… the RepositoryManager will reload repositories and will eventually call sync() on us, won't it? — lelit

Nodes, a.k.a. the versioned tree

As said, TracDarcs wants to cache also the actual tree of the repository, at any given changeset, because it is (still) very expensive if not impossible doing that with darcs itself.

K.S.Sreeram (kudos!) replaced my original broken solution with one that actually works great (no new bug has been reported on the algorithm lately :)

The pitfall was handling complex changesets (I mean, something SVN does not handle/allow) as well as corner cases (some would say b0rken) changesets that darcs builds (maybe recorded with a buggy 0.9.x version of darcs many years ago, fixed and handled in succeding versions). If there is a lesson I learned writing Tailor, it's surely that each VC has its own strange interpretation of what a changeset means :-)

Sreeram's solution uses two tables (actually three, one for caching the content of files, to make periodic purge easier), darcs_nodes and darcs_nodes_changes. The first introduces an artificial entity that represents a particular path, so that the machinery can operate unambiguosly when the exact same path actually means different items (imagine something sillyties like ADD this/path; MOVE this/path other/path; MKDIR this/path; ADD this/path/file; MOVE other/path this/path/newpath: in this example, each "this/path" would be a different node); each node is of a particular type (that is, it's either a file or a directory), and "knows" which revision added it as well the one that removed it, if any.

    node_table = Table('darcs_nodes', key='node_id')[
        Column('node_id',type='int'),
        Column('node_type',size=1),
        Column('add_rev',type='int'),
        Column('remove_rev',type='int')]

Nodes changes

Surprisingly, a generic node does not correspond to particular "path": such association happens couplings nodes to revisions, that is, quoting Sreeram, "a node doesn't have a particular name or content but, for a given revision, its name and content will be well defined".

This is done in darcs_node_changes, that mimics the standard node_change, plus it keeps the hierarchy of the entries (that is, at any particular revision, a particular node has a pathname, the parent's node id, and the change it happens to be subject of).

    change_table = Table('darcs_node_changes', key=('node_id','rev'))[
        Column('node_id',type='int'),
        Column('rev',type='int'),
        Column('path'),
        Column('parent_id',type='int'),
        Column('the_change')]

Nodes content

Even the content of any requested path at a particular revision is cached by TracDarcs (again, for speed issues). Earlier versions used to keep such cache on the filesystem, while it's now kept within the database itself, in the darcs_cache table:

    cache_table = Table('darcs_cache', key=('node_id','rev'))[
        Column('node_id',type='int'),
        Column('rev',type='int'),
        Column('content',type=blobtype)]

Tip: adding a size column would do wonders for enhancing browsing speed (i.e. Darcs Node.get_content_length would use that info if available, revert to the slower len(self.get_content().read()) otherwise).

This could go within the darcs_node_changes table (see, same primary key), but for practical reason it's been kept on its own.

Last modified 9 years ago Last modified on Apr 22, 2009, 1:43:04 PM
Note: See TracWiki for help on using the wiki.