Edgewall Software
Modify

Opened 16 years ago

Closed 15 years ago

Last modified 14 years ago

#7723 closed enhancement (fixed)

Make repository cache work on multirepos branch

Reported by: Remy Blank Owned by: Remy Blank
Priority: normal Milestone: 0.12-multirepos
Component: version control Version:
Severity: major Keywords: multirepos
Cc: Christian Boos, gwk.rko@…, mike@… Branch:
Release Notes:
API Changes:
Internal Changes:

Description

This ticket tracks the progress in making the repository cache work in the multirepos branch (#2086), as discussed in MultipleRepositorySupport/Cache.

Attachments (20)

7723-multirepos-cache-r7590.patch (15.9 KB ) - added by Remy Blank 16 years ago.
First try at a simple version of the cache
7723-multirepos-cache-2-r7590.patch (31.1 KB ) - added by Remy Blank 16 years ago.
Patch update
7723-multirepos-cache-3-r7684.patch (34.5 KB ) - added by Remy Blank 16 years ago.
Updated patch with better resync command
7723-multirepos-cache-4-r7684.patch (36.2 KB ) - added by Remy Blank 16 years ago.
Update improving handling of repository aliases
7723-multirepos-cache-4-r7815.patch (35.6 KB ) - added by Remy Blank 16 years ago.
Updated patch to current head of multirepos branch.
7723-multirepos-cache-4-r7829.patch (35.6 KB ) - added by Remy Blank 16 years ago.
Updated patch to current head of multirepos branch.
7723-multirepos-cache-4-r7888.patch (35.6 KB ) - added by Remy Blank 16 years ago.
Updated patch to current head of multirepos branch.
7723-1-multirepos-cache-r7928.patch (35.0 KB ) - added by Christian Boos 16 years ago.
7723-3-get_youngest_rev-r7928.patch (847 bytes ) - added by Christian Boos 16 years ago.
7723-2-unicode-fix-r7928.patch (1.3 KB ) - added by Christian Boos 16 years ago.
7723-multirepos-cache-5-r7928.patch (33.9 KB ) - added by Remy Blank 16 years ago.
Changed trac-admin changeset notification command into repository notify.
7723-multirepos-cache-6-r7928.patch (36.9 KB ) - added by Remy Blank 16 years ago.
Invalidate youngest_rev cache when updating revision cache.
Cache-metadata-1.patch (1.6 KB ) - added by Christian Boos 16 years ago.
retrieve metadata in a helper function and use it also in get_youngest_rev (applies on top of attachment:7723-multirepos-cache-6-r7928.patch)
7723-admin-cleanup-r7985.patch (2.6 KB ) - added by Remy Blank 16 years ago.
Small cleanup and removed unwanted argument to repository resync.
commit_ticket_update.py (6.3 KB ) - added by Remy Blank 16 years ago.
The functionality of trac-post-commit-hook as a plugin using IRepositoryChangeListener.
commit_ticket_update.2.py (7.7 KB ) - added by Remy Blank 16 years ago.
Same script as above, but with original copyright notice.
7723-changeset-admin-r8028.patch (4.1 KB ) - added by Remy Blank 16 years ago.
Changed trac-admin repository notify <event> into trac-admin changeset (added|modified).
7723-changeset-admin-r8028.2.patch (5.1 KB ) - added by Remy Blank 16 years ago.
Ah, forgot the unit tests.
7723-metadata-cache-r8097.patch (16.2 KB ) - added by Remy Blank 16 years ago.
Updated caching of repository metadata to new caching infrastructure.
7723-youngest-rev-r8299.patch (3.6 KB ) - added by Remy Blank 15 years ago.
Make consistent use of youngest_rev in CachedRepository.

Download all attachments as: .zip

Change History (77)

by Remy Blank, 16 years ago

First try at a simple version of the cache

comment:1 by Remy Blank, 16 years ago

The patch above implements the simplest version of the multi-repository cache:

  • Added a repository table with a generic structure similar to system for holding repository_dir and youngest_rev for every repository.
  • Added an id column to revision and node_change to contain the repository key.
  • Updated CachedRepository and trac-admin resync accordingly.
  • Added a DB upgrade script.

This is probably the simplest solution to get the cache working again, with the minor drawback that data duplication is possible with scoped repositories.

The cache is currently still synced on every request. But I'd like to get the cache structure right before I tackle the next problem.

comment:2 by Christian Boos, 16 years ago

I just read briefly the patch, it looks good, I'll test it at some later point.

There's a possible problem with your last, unrelated, change in that patch:

- ngettext('Changeset', 'Changesets', single and 1 or 2)
+ ngettext('Changeset ', 'Changesets ', single and 1 or 2)

I'm not entirely sure, but I think that at some point during the extraction the trailing spaces are stripped, that's why I did it the other way round. Would be worth checking.

in reply to:  2 ; comment:3 by Remy Blank, 16 years ago

Status: newassigned

Replying to cboos:

I'm not entirely sure, but I think that at some point during the extraction the trailing spaces are stripped, that's why I did it the other way round. Would be worth checking.

At least 0.11-stable and trunk have the space in the 'Changeset ' string. If you put the space into the <em>, the dotted underline below "pseudo-links" in the timeline continues below the space, which looks strange. That's why I made the change. Looking at the generated HTML, the space still seems to be there.

in reply to:  3 comment:4 by Christian Boos, 16 years ago

Replying to rblank:

If you put the space into the <em>, the dotted underline below "pseudo-links" in the timeline continues below the space, which looks strange.

Ok

… Looking at the generated HTML, the space still seems to be there.

Well, I was referring to their extraction using Babel (python setup.py extract_messages), but now I verified and at least with 0.9.4, the leading and trailing spaces are correctly preserved (I was probably wrong about that stripping). So I'm fine with this change as well.

comment:5 by gwk.rko@…, 16 years ago

Cc: gwk.rko@… added

comment:6 by anonymous, 16 years ago

wszytsko dziła poprawnie

by Remy Blank, 16 years ago

Patch update

comment:7 by Remy Blank, 16 years ago

Here's an updated patch with the following changes w.r.t. the previous one:

  • Added autocompletion for trac-admin resync.
  • Use (default) as an ID for the default repository.
  • Added IRepositoryChangeListener and the corresponding change notification chain. Notification of new changesets are triggered by trac-admin repository changeset.
  • Fixed failing test cases. This involved changing the initial creation of the youngest_rev key in the repository table.

I am not sure about a few points:

  • Previously, the youngest_rev value in the system table was created during DB creation. This cannot be done anymore for the repository table, so I moved it into CachedRepository.sync(), in the same transaction where repository_dir is created. But I am not sure if this could introduce race conditions.
  • Changeset notification tries to be smart about scoped repositories, so that only a single call to trac-admin repository changeset is needed. In particular, when notifying new changesets for a scoped repository, it will also notify all repositories having the same "base" repository. For this, I had to introduce Repository.get_base(). I am not sure that this is the right way to go about this.
  • It is also possible to specify the base repository name (e.g. svn:{uuid}:/path/to/repos) as the repository in trac-admin repository changeset, but this exposes the (internal) repository name.
  • The call to RepositoryManager.notify_changesets_added() from console.py goes through the environment, in the same way as get_repository(). Maybe I should use RepositoryManager(env) directly.

comment:8 by Remy Blank, 16 years ago

Milestone: 0.12

comment:9 by Christian Boos, 16 years ago

As a reminder: we should do nothing when calling resync and there's no repository available (#7524).

On a related note, what do you think of having:

trac-admin <env> repository resync <repos> [rev]

Resync the specified repository.
If <repos> is not specified, do nothing.
If <repos> is '*', resync all repositories.

I think that as resyncing all repositories could be potentially very expensive, so better do that when specifically told to do so. Also, moving the resync as a sub-command of repository seems nicer.

in reply to:  9 comment:10 by Remy Blank, 16 years ago

Replying to cboos:

As a reminder: we should do nothing when calling resync and there's no repository available (#7524).

I'll look into that.

I think that as resyncing all repositories could be potentially very expensive, so better do that when specifically told to do so. Also, moving the resync as a sub-command of repository seems nicer.

Yes, that's a good idea.

On a related note as well, I was thinking of starting to work on your suggestion on trac-dev for the rest of the repository command set:

  • repository add <type> <path>: add a repository
  • repository delete <repos>: remove a repository
  • repository rename <old> <new>: rename a repository
  • repository alias <repos> <alias>: create an alias for a repository

and managing repositories in the repository table. I would also like to create an admin module allowing to perform the same operations. Should I merge this into the patch tracked here, or do you prefer a separate patch?

comment:11 by Christian Boos, 16 years ago

I think it will be better addressed separately.

(Ideally we'd need to have an IAdminConsoleCommandProvider extension, so that we could move all this logic into a trac/versioncontrol/admin.py module, where the command line and the web admin modules could reuse the same functionality).

A few notes on the changeset notification in the patch (in trac/versioncontrol/api.py):

        289             def notify_repository(repos): 
 	290	            repos.sync() 
 	291	            for rev in revs: 

Here we should probably take care of not notifying twice about a given change. Even if this should normally not happen if trac-admin repository changeset is called from a hook, I think it would be more robust to have repos.sync() return the list of changesets that were effectively synced and only notify for those changesets, as that method takes great care of race conditions (even if I'm not 100% confident that there's no race condition left…).

in reply to:  11 ; comment:12 by Remy Blank, 16 years ago

Replying to cboos:

I think it will be better addressed separately.

Ok, I'll open a new ticket about that. I'll keep a new patch there that should be applied on top of this one. Thank God for Mercurial and its easy branching / merging :-)

(Ideally we'd need to have an IAdminConsoleCommandProvider extension, so that we could move all this logic into a trac/versioncontrol/admin.py module, where the command line and the web admin modules could reuse the same functionality).

Hey, that's an excellent idea! I'll also open a new ticket to keep track of that.

Here we should probably take care of not notifying twice about a given change. Even if this should normally not happen if trac-admin repository changeset is called from a hook (…)

I'm not sure about that. New changeset notification and caching are actually completely orthogonal functionalities, and you certainly also want notification for non-cached repositories. So you probably want to keep the "last notified" information separate from the cache, probably in the repository table as well.

But I'm not sure it's worth the trouble avoiding duplicate notification. As you say, calling the notification from the post-commit hook should already ensure no duplication occurs.

by Remy Blank, 16 years ago

Updated patch with better resync command

comment:13 by Remy Blank, 16 years ago

The patch above is updated to the current state of the multirepos branch.

  • The resync command has been changed to work as suggested in comment:9.
  • About #7524, the recent refactoring of trac-admin changed the way commands are provided. In this case, disabling the default repository prevents the VersionControlAdmin component from loading, and therefore the repository commands are not provided.

This completes the basic version described in MultipleRepositorySupport/Cache. I'm waiting for feedback and instructions about the next steps on this issue. In the meantime, I'll start on the repository management functionality in #7822.

by Remy Blank, 16 years ago

Update improving handling of repository aliases

comment:14 by Remy Blank, 16 years ago

The patch above improves handling of repository aliases, which ended up being cached separately before. It slightly modifies RepositoryManager.get_repository() so that the same object is returned for a real repository and an alias. Therefore, repos.reponame is always the name of the real repository.

Christian, do you think you could find some time to review this patch and give me some feedback about what must still be changed (or apply it if it is ok)?

in reply to:  12 comment:15 by Christian Boos, 16 years ago

Replying to rblank:

Replying to cboos:

Here we should probably take care of not notifying twice about a given change. Even if this should normally not happen if trac-admin repository changeset is called from a hook (…)

I'm not sure about that. New changeset notification and caching are actually completely orthogonal functionalities, and you certainly also want notification for non-cached repositories. So you probably want to keep the "last notified" information separate from the cache, probably in the repository table as well.

Another reason why this still bothers me a bit is that I'd like this mechanism to be generic enough so that it can be extended later to other cached vcs. Take the case of Mercurial, the hook can inform you of which changesets have been added, which would be something non completely trivial to recompute on the Trac end. This is why I have the feeling that we should carry the changeset(s) information from the notify to the resync. I haven't started coding anything yet, so it's not anything more than a feeling, I might well be wrong ;-)

by Remy Blank, 16 years ago

Updated patch to current head of multirepos branch.

by Remy Blank, 16 years ago

Updated patch to current head of multirepos branch.

by Remy Blank, 16 years ago

Updated patch to current head of multirepos branch.

comment:16 by Christian Boos, 16 years ago

I've unfortunately run short of time … so here's my partial update to keep you busy for this WE ;-)

by Christian Boos, 16 years ago

by Christian Boos, 16 years ago

by Christian Boos, 16 years ago

by Remy Blank, 16 years ago

Changed trac-admin changeset notification command into repository notify.

comment:17 by Remy Blank, 16 years ago

The updated patch above implements the following changes:

  • Changed the changeset notification trac-admin command from repository changeset into repository notify, and added a parameter for the event type. This allowed making the notification code more modular, and to easily add the the event changeset_modified (in addition to the already existing changeset_added), to be used e.g. in a SVN post-revprop-change hook.

The changeset_modified event required changing Repository.sync_changeset() to be a no-operation instead of raising a NotImplementedError. This is reasonable IMO, as Repository.sync() already is a no-operation.

The last remaining issue is the youngest_rev problem. I have been staring at the code for an hour now, and I still don't get the sync() logic and the associated caching issues. There are just too many interconnected issues. So I'm afraid I will need help from you on this, Christian. I would hate breaking the seemingly fragile synchronization procedure.

comment:18 by Christian Boos, 16 years ago

For RepositoryCache.youngest_rev, we cache the value if it's not already there, in a way similar to attachment:7723-3-get_youngest_rev-r7928.patch. Then, the problem is how do we invalidate this when another process do the sync? The easiest way, which we should adopt as a first step, is to simply trigger a global invalidation of the affected environments, by calling config.touch() at the end of a sync(). It's admittedly heavy-weight, but:

  • it will work, as we use it in plenty other situations (the RepositoryCache is not a Component itself, but it is accessed through the RepositoryManager, so as the latter gets cleared, a new RepositoryCache will be created when needed)
  • we sooner or later need a better solution for this cache invalidation problem anyway, and then it'll still be time to "fix" it for the youngest_rev case as well.

Are you OK with this approach? We can discuss the general issue in TracDev/Proposals/CacheInvalidation.

by Remy Blank, 16 years ago

Invalidate youngest_rev cache when updating revision cache.

comment:19 by Remy Blank, 16 years ago

The patch above implements the suggestion in comment:18 and triggers an environment reload every time Repository.sync() updates the cache, so that youngest_rev is re-loaded from the database.

I think we agree that this solution is kinda ugly, so I'm all the more motivated to find a good solution to the cache invalidation issue.

Except if I missed something, this patch should therefore be final and completely restore the SVN repository cache functionality in the multirepos branch.

comment:20 by Remy Blank, 16 years ago

Is there anything else you need from me about this issue? Or would you rather wait for CacheInvalidation to be resolved before applying the patch?

comment:21 by Christian Boos, 16 years ago

Sorry - as usual, simply lack of time :-(

Besides, I think 7723-3-get_youngest_rev-r7928.patch is not optimal, we should rather take the youngest_rev info from the meta-data in the repository table.

But no need to wait for a better cache invalidation scheme, we'll just fix it a the same time as the other config.touch() when time is ripe.

I'll try to finalize and commit the patch today.

comment:22 by Christian Boos, 16 years ago

What about this last(?) change?

attachment:Cache-metadata-1.patch

by Christian Boos, 16 years ago

Attachment: Cache-metadata-1.patch added

retrieve metadata in a helper function and use it also in get_youngest_rev (applies on top of attachment:7723-multirepos-cache-6-r7928.patch)

comment:23 by Christian Boos, 16 years ago

Main patch (attachment:7723-multirepos-cache-6-r7928.patch) committed as [7961].

A series of follow-up patches were committed as well:

Things left to do before we can close this ticket:

  • migrate the trac-post-commit-hook logic into a sample-plugin (e.g. update-ticket-from-changeset.py)
  • the trac-post-commit-hook can now be a shell-script calling repository notify (i.e. similar to trac-post-commit-hook.cmd)

in reply to:  23 comment:24 by Christian Boos, 16 years ago

Things left to do before we can close this ticket:

… and rewrite the cache invalidation currently based on config.touch() to be based on the TracDev/Proposals/CacheInvalidation, of course!

comment:25 by Christian Boos, 16 years ago

Also, I realized that [7964] was somehow not taking the aim of get_base into account. By giving one appropriate reponame to trac-admin, notify would have triggered the appropriate sync() in all repositories sharing the same base.

Still, I think it's simpler to give to trac-admin directly the REPOS information obtained within the hook, thereby requiring no knowledge about how that repository is mapped to one or many repositories objects within the environment.

So I reworked this in [7973], removing the changes to get_repository added in [7964] and added a get_repositories_by_dir method.

Now notify should do the right thing (i.e. sync and notify all the repositories that are involved), whether provided with a reponame or directly with repository directory.

by Remy Blank, 16 years ago

Small cleanup and removed unwanted argument to repository resync.

comment:26 by Remy Blank, 16 years ago

The patch above is a small cleanup and it removes the unwanted argument clean from the repository resync command (the arguments of the _do_* methods map directly to command arguments, so clean should not be in _do_resync()'s signature).

comment:27 by Remy Blank, 16 years ago

Also, the test suite currently fails on the multirepos branch, I haven't yet found out why. Current trunk is ok, though.

Do you want me to set up a build slave for the multirepos branch?

in reply to:  27 ; comment:28 by Christian Boos, 16 years ago

Replying to rblank:

Also, the test suite currently fails on the multirepos branch, I haven't yet found out why.

Those are the functional tests failing during repository creation, I haven't checked yet.

Do you want me to set up a build slave for the multirepos branch?

Yes, that could be useful (once the above problem is fixed).

in reply to:  27 comment:29 by Christian Boos, 16 years ago

Replying to rblank:

Also, the test suite currently fails on the multirepos branch, I haven't yet found out why.

[7971] was only tested with Mercurial repositories… [7987] fixes that and the tests pass again.

in reply to:  28 comment:30 by Christian Boos, 16 years ago

Replying to cboos:

Replying to rblank:

Also, the test suite currently fails on the multirepos branch, I haven't yet found out why.

Those are the functional tests failing during repository creation, I haven't checked yet.

The above problem actually happens when there is another version of Trac installed - when the functional tests code starts the python process, that one doesn't pick the code from the checkout but from the installed location - it's a bit weird, PYTHONPATH doesn't seem to be propagated.

comment:31 by Christian Boos, 16 years ago

Related to #8067/#7490, I think we need to add a setting to disable the synchronization check for the default repository in the RepositoryManager.preprocess_request, so that it could be done with a post-commit hook even in this case (e.g. [trac] repository_per_request_sync = false, defaults to true). Anyway, the post-commit hook will be needed for getting the changeset_added notification.

By the way, another thing that occurred to me recently is the possible confusion between sync and notify changeset_added, in hooks. I'm worried that people could now simply call repository sync from their hook and be under the impression that everything just works - but they'll miss the notifications.

So here's an alternative idea to the notify commands:

  • repository post-commit <repo_dir_or_name> <rev>
  • repository post-revprop <repo_dir_or_name> <rev>

Then it becomes more obvious what to use from within a hook, rather than facing the choice between sync and notify… (and no, people don't read the docs :-) ).

in reply to:  31 comment:32 by Remy Blank, 16 years ago

Replying to cboos:

Anyway, the post-commit hook will be needed for getting the changeset_added notification.

Then why not drop the per-request synchronization completely?

Another idea could be a [trac] repository_sync_per_request setting that contains a list of repository names that should be synced on every request. This would allow those who cannot install post-commit hooks to use multiple repositories as well. Obviously, the impact has to be documented (no changeset notifications, bad performance). The default would be (default).

By the way, another thing that occurred to me recently is the possible confusion between sync and notify changeset_added, in hooks. I'm worried that people could now simply call repository sync from their hook and be under the impression that everything just works - but they'll miss the notifications.

I'm not sure. Another option could be to drop sync and use an optional argument to resync for resuming synchronization.

So here's an alternative idea to the notify commands:

I don't like them too much. I liked the fact that all notifications were under a single command better, and if sync is removed, there shouldn't be any confusion. Maybe notify is not the right term, and can be confused with ticket notifications? How about using the term hook?

  • repository hook changeset_added <repos> <rev> [rev ...]
  • repository hook changeset_modified <repos> <rev> [rev ...]

Or even:

  • repository hook post-commit <repos> <rev> [rev ...]
  • repository hook post-revprop <repos> <rev> [rev ...]

I prefer the first version, as I feel the post-* terms sound too SVN-specific.

by Remy Blank, 16 years ago

Attachment: commit_ticket_update.py added

The functionality of trac-post-commit-hook as a plugin using IRepositoryChangeListener.

in reply to:  23 comment:33 by Remy Blank, 16 years ago

Replying to cboos:

Things left to do before we can close this ticket:

  • migrate the trac-post-commit-hook logic into a sample-plugin (e.g. update-ticket-from-changeset.py)

I have attached a new plugin that implements the same functionality as the trac-post-commit-hook script, to be placed into sample-plugins.

It Works For Me (tm) ;-)

Oh, wait, I have forgotten to include the copyright notice of the original script. Coming…

by Remy Blank, 16 years ago

Attachment: commit_ticket_update.2.py added

Same script as above, but with original copyright notice.

comment:34 by Christian Boos, 16 years ago

Some comments:

  • about the trac-admin commands, would it be possible to use another prefix than repository? If yes, we could have:
    • trac-admin <env> changeset added <repos> <rev>
    • trac-admin <env> changeset modified <repos> <rev>
  • the commit_ticket_update plugin looks fine, but I wonder if it shouldn't also act on a changeset_modified callback. Rationale: you enter a ticket reference to #7732 but you actually meant #7723. Typically you'd edit the log message to fix it. It would make sense to also update the newly referenced ticket in this case. Even better, it would make sense to amend the old ticket as well, but for that we should pass the old changeset as well as the new:
    • def changeset_modified(self, repos, old_changeset, new_changeset):

in reply to:  34 ; comment:35 by Remy Blank, 16 years ago

Replying to cboos:

  • about the trac-admin commands, would it be possible to use another prefix than repository?

Sure, everything is possible ;-) It's a bit a return to the initial idea, which was repository changeset.

I guess a separate category changeset makes it harder to miss when actually reading the documentation, so I'd be ok with that. Let me know your final decision, and I'll implement it.

  • the commit_ticket_update plugin looks fine, but I wonder if it shouldn't also act on a changeset_modified callback.

That does indeed make sense, thanks for the idea. In changeset_added, I use the time of the changeset as the ticket change time. I will have to use the current time for changeset_modified. I guess this makes sense anyway.

This also means I can compute the difference between the ticket references in the old and new changeset data, and only update those tickets that differ.

but for that we should pass the old changeset as well as the new:

Yes, I had forgotten about that. I'll implement the augmented changeset_modified as well.

in reply to:  35 comment:36 by Christian Boos, 16 years ago

Replying to rblank:

Replying to cboos: Sure, everything is possible ;-) It's a bit a return to the initial idea, which was repository changeset.

I guess a separate category changeset makes it harder to miss when actually reading the documentation, so I'd be ok with that. Let me know your final decision, and I'll implement it.

OK, so let's settle on trac-admin <env> changeset added|modified <repos> <rev>.

by Remy Blank, 16 years ago

Changed trac-admin repository notify <event> into trac-admin changeset (added|modified).

comment:37 by Remy Blank, 16 years ago

The patch above renames the repository notification commands of trac-admin as suggested in comment:36.

BTW, do you want me to continue posting patches on this ticket for small items like this one, or should I commit directly to the branch? I'm fine with both variants.

About adding the old changeset metadata to the changeset_modified() hook, I'm afraid this is not such a good idea after all. Indeed, as the command is called after the metadata has been updated in the repository (typically in the post-revprop hook), the only way to get the previous values is from the cache. So we have a few options:

  • Add the old_changeset argument to changeset_modified() anyway, and set it to None if the previous data cannot be retrieved. There's currently no interface for repositories to tell if they are cached, so this would have to be detected by comparing the old and new metadata, and assuming that the old values cannot be retrieved if they are equal.
  • Add a changes argument containing a dictionary of changed changeset attributes. It will obviously be empty if the previous values cannot be retrieved.
  • Do not provide the old values.

I tend to prefer the last option, as the others will need a big fat warning that the data may or may not be present depending on the VC backend.

It still makes sense to process the changeset_modified() event in the commit_ticket_update plugin.

by Remy Blank, 16 years ago

Ah, forgot the unit tests.

comment:38 by Christian Boos, 16 years ago

Severity: normalmajor

Applied attachment:7723-changeset-admin-r8028.2.patch plus adaptations for the post-commit hooks in r8070.

About changeset_modified(), yes it's true that we can't get the old changeset for anything else than the DB cached repositories… OTOH, the Git plugin has support for caching changeset metdata and the Mercurial plugin should have it as well sometime soon. So I think the first option among the possibilities you listed is also a viable one and we could indeed add a property telling if the changesets are cached or not (and the big fat warning that the old_changeset might be None).

Btw, I guess the ticket has now reached a point where we can say it's was "major" task ;-) For the work process itself - feel free to commit directly on the branch.

Summarizing what's left to be done:

comment:39 by Remy Blank, 16 years ago

The cache invalidation patch has been committed in [8071]. I was going to merge it to multirepos, but I'm not sure about the database version conflict. Should I combine both upgrades into a single db22.py file, or should the VC cache upgrade be moved to a 22 → 23 upgrade?

comment:40 by Christian Boos, 16 years ago

Yes, it's a bit tricky, but I think the rule here should be to minimize the trouble for people using trunk. So I'd say we do 22→23 on the branch, plus code in db23.py that copes with upgrading previous MultiRepos db22.py users. When merging in trunk, we strip off that extra code.

comment:41 by Christian Boos, 16 years ago

I believe r8073 handles this correctly:

  • environments at version 21 get upgraded to 23 the normal way
  • environments previously at version 22 get upgraded to 23, in which we detect that the repository table already exist, in which case we call explicitly the db22.do_upgrade. That code can be removed before merging into trunk, as it's unlikely that at that time there will be people left with a MultiRepos specific database at version 22 (and if so, it's easy enough to fix by hand).

comment:42 by Remy Blank, 16 years ago

I have added the old_changeset argument to IRepositoryChangeListener.changeset_modified() in [8094]. The old changeset is returned by Repository.sync_changeset(), by querying the database prior to making the change. This is backward-compatible, as the function previously returned None, which means "data not available".

I have also added the new commit_ticket_update.py sample plugin, and removed the old trac-post-commit-hook. It now triggers a ticket comment when a changeset is modified, for those tickets that were not already referenced by the old changeset text. Obviously, if the old changeset data cannot be retrieved, it will add a comment to all referenced tickets.

There are a few things that I'm not entirely satisfied with the plugin:

  • The time used for the ticket comment is the current time, instead of the time of the changeset, as requested in comment:3:ticket:7251. I would have preferred the changeset time, but it would have messed up threading of the comments.
  • If saving of the ticket comment fails (due to concurrent editing), it just logs the error. We could retry saving, but I don't know if it's worth the effort.

Other than that, I would say that this is a good sample plugin, so I'll move on to using the new cache infrastructure for caching repository metadata.

by Remy Blank, 16 years ago

Updated caching of repository metadata to new caching infrastructure.

comment:43 by Remy Blank, 16 years ago

The patch above is a first stab at migrating caching of the repository metadata to the new caching infrastructure. While testing, the following popped up:

  • Setting reponame on the repository after retrieving it in RepositoryManager.get_repository() is impractical. For example, I have to change the cache id when setting reponame. Wouldn't it be simpler to pass reponame to IRepositoryConnector.get_repository(), so that it can be set in the Repository constructor?
  • More importantly (not related to metadata caching), CachedRepository delegates some work to SubversionRepository without checking if the revision exists in the cache. For example, if a new changeset is added to a repository, but the changeset added hook has not yet been called, then the revision log for the repository generates an internal error. This is due to using SubversionNode.get_history() to get the history of a node. The history will contain the new changeset, but it will not be possible to retrieve it from the repository.

The second issue is already present on the multirepos branch. The window of opportunity for this to happen is typically small, as notification should be triggered from the post-commit hook, but still, it's not nice to get an internal error.

I see two potential solutions:

  • "Overload" more of the behavior of SubversionRepository in CachedRepository, and ensure that no data for not-yet-cached revisions leaks. Seems clumsy.
  • In CachedRepository, fall back to the underlying repository if some piece of data cannot be retrieved from the cache. Seems more robust, but will introduce some inconsistencies between different views of the data.

The second idea could even be extended further: what if we did "opportunistic" caching? That is, start with an empty cache (and no explicit "resync"), and on data retrieval, if the data is not available from the cache, get it from the repository and store it in the cache.

in reply to:  43 comment:44 by Christian Boos, 16 years ago

Replying to rblank:

The patch above is a first stab at migrating caching of the repository metadata to the new caching infrastructure.

I've tested it, and it seems to work fine.

  • Setting reponame on the repository after retrieving it in RepositoryManager.get_repository() is impractical.

The primary concern here was API compatibility. There was no compelling reason to change things, as setting .reponame afterwards was enough. Now it seems we have a real need for a change. If possible, that change should be handled with backward compatibility in mind (using arity on the connector's get_repository method), otherwise I think it's better to postpone the incompatible change once we're on trunk.

  • More importantly (not related to metadata caching), CachedRepository delegates some work to SubversionRepository without checking if the revision exists in the cache. For example, if a new changeset is added to a repository, but the changeset added hook has not yet been called, then the revision log for the repository generates an internal error. This is due to using SubversionNode.get_history() to get the history of a node. The history will contain the new changeset, but it will not be possible to retrieve it from the repository.

Yes, this is problematic.

I see two potential solutions:

  • "Overload" more of the behavior of SubversionRepository in CachedRepository, and ensure that no data for not-yet-cached revisions leaks. Seems clumsy.

All we need to do is to ensure that we start with a proper youngest, i.e. the one in the cache. Relatively easy for Subversion, maybe be more involved for other backends (well, we have still no other cached backends, but theoretically…)

  • In CachedRepository, fall back to the underlying repository if some piece of data cannot be retrieved from the cache. Seems more robust, but will introduce some inconsistencies between different views of the data.

Note that this might cover real issues, like the hook not working anymore for some reason.

But I think the 2 solutions can be done in complement one to each other: take care of giving a consistent view on the repo starting from the cache's youngest and be prepared to handle what falls between the cracks by delegating to the underlying repos when the cache doesn't have the information.

The second idea could even be extended further: what if we did "opportunistic" caching? That is, start with an empty cache (and no explicit "resync"), and on data retrieval, if the data is not available from the cache, get it from the repository and store it in the cache.

As we can retrieve information about a repository at anytime (i.e. we might be going through the results of a SELECT), we can't save the metadata right away. We would have to postpone that to after request processing, like we do for saving the session data modifications. I think it's an interesting option and might be doable, but certainly not trivial and without surprise. For example, it will break the get_(next|prev)_rev optimizations.

comment:45 by Remy Blank, 15 years ago

I have committed the patch in [8200]. I left the reponame setting after creation as-is, as we can change that at any time later, and it's not really a big issue (yet), only a minor inconvenience.

I'll try to address the youngest issue next.

comment:46 by mike@…, 15 years ago

Cc: mike@… added

by Remy Blank, 15 years ago

Make consistent use of youngest_rev in CachedRepository.

comment:47 by Remy Blank, 15 years ago

About youngest_rev consistency, the patch above seems to fix the issue for me, and all tests pass. Comments? Good to go?

in reply to:  47 comment:48 by Christian Boos, 15 years ago

Replying to rblank:

About youngest_rev consistency, the patch above seems to fix the issue for me, and all tests pass. Comments? Good to go?

I've tested attachment:7723-youngest-rev-r8299.patch, which works fine for me. Ok to apply.

So what's left? From comment:38, only the last two points are still opened, the one about Mercurial shouldn't block this ticket and the [trac] repository_sync_per_request feature.

comment:49 by Remy Blank, 15 years ago

I'll tackle [trac] repository_sync_per_request next. The initial idea was that it be a list of repository names, and that only those repositories would be synced at every request, with a default of (default) to be backward compatible. Ok with that?

comment:50 by Christian Boos, 15 years ago

Yes. The empty list meaning nothing gets synced at every request, which could become the default one day.

comment:51 by Remy Blank, 15 years ago

Fix for youngest_rev applied in [8396].

comment:52 by Remy Blank, 15 years ago

Resolution: fixed
Status: assignedclosed

Added the option [trac] repository_sync_per_request in [8397].

As far as I can tell, this concludes work on this ticket. Phew, 10 months and 52 comments… not bad!

comment:53 by Christian Boos, 15 years ago

I just wondered about the interaction between the CachedRepository metadata CacheProxy and the admin operations on the repository table: on rename and remove, don't we need to invalidate them?

in reply to:  53 ; comment:54 by Remy Blank, 15 years ago

Replying to cboos:

I just wondered about the interaction between the CachedRepository metadata CacheProxy and the admin operations on the repository table: on rename and remove, don't we need to invalidate them?

Currently it wouldn't make a difference, as all repository operations call RepositoryManager.reload_repositories(), which in turn calls config.touch(), so the whole environment is reloaded anyway.

Now if reload_repositories() was migrated to the fine-grained cache mechanism, the repository objects would be re-created, and hence the CacheProxy object as well, but the cache itself would still contain stale data from the previous object. This might have perverse side-effects.

I'll have to think about it some more. Maybe CacheProxy should allow clearing the associated (thread and process) cache on creation, for the use case where it's stored as an object attribute.

comment:55 by Christian Boos, 15 years ago

Milestone: 0.120.12-multirepos

in reply to:  54 comment:56 by Remy Blank, 15 years ago

Replying to rblank:

Now if reload_repositories() was migrated to the fine-grained cache mechanism, the repository objects would be re-created, and hence the CacheProxy object as well, but the cache itself would still contain stale data from the previous object. This might have perverse side-effects.

There's no easy solution to that. Clearing the cache when creating the CacheProxy in CachedRepository.__init__() is not an option, as repository objects are created per thread, and hence new threads would clear the cache of older threads.

The current assumption is that all caches are independent. This would not be the case anymore if e.g. reload_repositories() switches to a fine-grained cache, and there would need to be some kind of cascading cache invalidation. Currently, config.touch() is used for that (a kind of global cache invalidation), but we could also imagine that a cache can depend on another, and is invalidated at the same time as the latter.

Anyway, the current implementation with config.touch() works, so I'm leaving this for now.

comment:57 by Remy Blank, 14 years ago

Type: taskenhancement

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Remy Blank.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Remy Blank to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.