Edgewall Software

MultipleRepositorySupport for Cached Repositories

This is a design document for sharing thoughts between the people interested in implementing support for cached repositories in the source:sandbox/multirepos branch.

This is a kind of showstopper for now, as its rather impractical to browse big Subversion repositories in Trac using a direct-svnfs connection only.

Issues

When to sync?

I think we definitely can't afford doing a synchronization check at the beginning of each request anymore, when there are many repositories to check. That would impose a big slowdown for every request, most of the time for nothing.

Doing this in the single repository case was already a sub-optimal solution, only prompted by practical considerations. Some historical background: a seemingly innocent change in 0.10-stable (r3972) made Trac likely to trigger a sync at any time during request processing, even during ongoing SQL queries. Eventually, after rushing out the 0.10.1 and 0.10.2 releases which didn't completely fix the issue, the 0.10.3 release finally settled it down by doing this sync very early, during a request preprocessing step. In 0.10.4, the sync was also triggered at appropriate time in the trac-post-commit-hook.

The general idea though is fine: when starting to serve a request, Trac must operate on some fixed state of the cache. And that can also be achieved by having the cache synced externally. Having one trigger for synchronization would even help a lot to solve the strange concurrent synchronization problems we had in the past (#4043, #4586). Note in passing that the quite complex sync code in cache.py is mainly due to #4043.

Scoped Repositories

It would be nice to be able to not replicate common data when having multiple scoped repositories rooted in the same real SVN repository. I see essentially 2 approaches:

  1. cache the whole repository
    • (+) it's simpler to sync, the scoping part has to be done when retrieving the data, filtering out what's out of scope
    • (-) Trac might be set up on a small subpart a huge repository, caching the whole thing could introduce performance issues (though we can expect good performance from the db, so it's probably not a strong objection)
  2. cache the relevant parts of the repository
    • (+) no extra cost, only what needs to be known by Trac is cached
    • (-) more complex, we first have to gather all the scopes for a repository, then filter out what's not needed
    • (-) whenever a scope is modified, removed or a new one added, a whole resync is needed

There's the alternative take that is not so dramatic to have duplication of data: multiple scopes probably don't overlap so we won't have much duplication and it could be simpler to just duplicate the data.

  • (-) possible data duplication
  • (-) need to trigger several syncs for a given real repository
  • (+) probably easier to deal with

There's even another alternative which consists of dropping the notion of scoped repositories altogether and instead limit what can be seen or not using the svn authz policy (which is still to be ported to the new TracFineGrainedPermissions framework, btw).

  • (-) backward incompatibility for links (the scope will now be apparent)
  • (+) much more flexible, no need to create artificial scoped repositories, therefore we always have to deal with one Trac repository for each Subversion repository (in that, it's close to solution 1. above)
  • (?) what would be the impact on efficiency?

What information should we cache?

The current cache deals mainly with the changeset data. It's not optimized or even usable as such for browsing the repository. See VcRefactoring#NewRepositoryCache for an example of what such an enhanced cache would look like. There are other needs as well, like the need to better support changeset DAGs for distributed VCS (Mercurial, Git, etc.).

But those enhancements, while important to consider, are nevertheless orthogonal to the topic at hand, which is extending the cache so that it supports more than one repository.

We should only keep in mind that in DVCS, changesets have a global unique identitier, and that even if such changesets are located in different repositories, they contain actually the same data. Cheap clones should translate to cheap repository caches as well. Therefore we probably need to dissociate the changeset data from its binding to the repository and from the sequence number in which it appears in a repo (and that information is probably to left out of the cache anyway, as git for example doesn't have the notion of a sequence number).

Proposals

Version

Would it be possible to simply check the version against the master during certain actions and/or times to see if the cached version is the same and if not sync?

  • Well, that sync() check used to be what Trac did in a request pre-processor. As discussed above, this doesn't scale.

Syncing

sync must be done exclusively by post-commit hooks. We have to clean up the post-commit hook so that it becomes a bit more modular, eventually by introducing a ChangesetChangeListener. See related discussion googlegroups:trac-dev:76a049777edf0c10, TH:SvnChangeListenerPlugin and most importantly #7723 and related changesets.

  • also, we currently still do the sync() check in a request pre-processor, but only for the default repository, if there's one defined. Therefore we're mainly backward compatible, in that a single repository Trac can continue to operate the same way in 0.12 than it did in previous versions.

Scoped Repositories

It's probably best to simply tolerate the duplication, for a start.

  • Note that this is what #7723 did. I think it's near optimal, as we actually can't have that much overlap in scopes (I was probably thinking of multiple scopes in the same repository when writing #ScopedRepositories above, something we can actually achieve simply by having multiple scoped repositories on the same repository)

Repository ids for Subversion need to integrate the uuid and the scope (but I think that's already the case).

  • #7723 introduced the notion of a base repository identifier. This enables to notify all the repositories sharing the same base at once. OTOH, with [7964] I took the option to trigger the notification by simply giving the repository directory, which is available from the hook. This needs to be expanded/fixed to handle notifications of all repositories sharing the same base.

Model

A repository table, with a unique key for each new repository (id), which binds repository uids (uuid, which can't itself be used as a short key as its potentially long if it contains the scope), and has some associated metadata (e.g. youngest_rev). We could eventually use a generic table in the spirit of the existing system(name,value) table, which is for now used to keep this metadata.

Then, we need to modify the revision so that it can contain different changesets with the same id. Two situations:

  1. same id, same changeset data: this would be the case for Mercurial changesets, but also for Subversion changesets coming from different scoped repositories based on the same underlying real repository.
  2. same id, different changesets: this would be the common case of several disjoint Subversion repositories

If we simply integrate the repository "key" into the changeset id, we can handle 2) above but not 1). If we don't, we can handle 1) but not 2). One solution would be to integrate the repository key in the changeset id only when needed, i.e. for 2).

The node_change table share the same concerns as above.

  • the current solution #7723 favors the situation 2) by requiring repos and id to be used as a composite keys.
  • This might work for 1) if we simply use NULL for the repos, as by the magic of using SHA1 keys, both Mercurial and Git have changeset ids that are globally unique

Current Design

(taken from source:sandbox/multirepos/trac/db_default.py@8081)

    # Version control cache
    Table('repository', key=('id', 'name'))[
        Column('id'),
        Column('name'),
        Column('value')],
    Table('revision', key=('repos', 'rev'))[
        Column('repos'),
        Column('rev'),
        Column('time', type='int'),
        Column('author'),
        Column('message'),
        Index(['repos', 'time'])],
    Table('node_change', key=('repos', 'rev', 'path', 'change_type'))[
        Column('repos'),
        Column('rev'),
        Column('path'),
        Column('node_type', size=1),
        Column('change_type', size=1),
        Column('base_path'),
        Column('base_rev'),
        Index(['repos', 'rev'])],

Discussion

From IRC, lelit (author of the TracDarcs plugin) suggested that using a surrogate id primary key for the repository would be more adequate for his use case, as he's re-using the CachedRepository and extending it with extra information, that extra information being stored in additional tables (e.g. for binding an artificial integer revision id to the corresponding Darcs hash and Darcs name): when renaming a repository, he would have to update his tables as well.

There's currently no extension point for reacting to such changes, but the idea was that if we were using an unique integer id for the repository, there would be no need to update anything after a simple renaming (i.e. the repository name would be just another repository property): current implementation does not scale very well, regardless of TracDarcs needs, though.

See also the considerations on the page TracDarcs/Cache. — lelit.

Still, there are other actions (adding/removing) that would require anyway to be notified to extensions, so why not rename as well. I think we therefore need a IRepositoryChangeListener to fully support the above use case. — cboos.

Last modified 8 years ago Last modified on Apr 21, 2009, 11:42:28 PM