Edgewall Software
Modify

Opened 8 years ago

Closed 8 years ago

#8731 closed enhancement (fixed)

experiment with surrogate keys for the repository table

Reported by: Christian Boos Owned by: Remy Blank
Priority: normal Milestone: 0.12-multirepos
Component: version control Version: none
Severity: normal Keywords: consider multirepos
Cc: lele@…
Release Notes:
API Changes:

Description

In MultipleRepositorySupport branch, we introduced a new table for storing information about repositories. This table uses the "traditional" natural key approach.

However, we could consider using surrogate keys instead. Some version control backends (e.g. TracDarcs) have to store extra informations for changesets in their own table, and changesets have to be associated with a repository. This works best if the repository key would be a surrogate key, so the relations won't be lost when changing the "natural" key of a repository (i.e. renaming the repository).

See MultipleRepositorySupport/Cache#Discussion.

Attachments (5)

8731-surrogate-keys-r8662.patch (24.2 KB ) - added by Remy Blank 8 years ago.
First experiment with surrogate keys for repositories.
8731-surrogate-keys-r8662.2.patch (35.0 KB ) - added by Remy Blank 8 years ago.
Complete patch.
8731-mercurial-r8664.patch (1.2 KB ) - added by Remy Blank 8 years ago.
Adds then new reponame and id arguments to MercurialRepository.__init__().
8731-repo-options-r8673.patch (9.6 KB ) - added by Remy Blank 8 years ago.
Replace separate reponame and id arguments with an options dict in Repository.__init__().
8731-simpler-api-r8676.patch (2.9 KB ) - added by Remy Blank 8 years ago.
Fix for simpler API in [8677].

Download all attachments as: .zip

Change History (28)

comment:1 Changed 8 years ago by Remy Blank

Owner: set to Remy Blank

As discussed on IRC.

Changed 8 years ago by Remy Blank

First experiment with surrogate keys for repositories.

comment:2 Changed 8 years ago by Remy Blank

Cc: lele@… added

Here's a first experiment with surrogate keys for repositories. A few comments:

  • The DB upgrade code is not yet included, so you'll have to create a new environment for testing.
  • Repository ids are generated in RepositoryManager.get_repository_id(). Basically, for each name passed to that method, a row is created in the repository table. This also applies to repositories defined in trac.ini.
  • There's a small race condition in get_repository_id(), but I don't know how to remove it. The same race is also present in AbstractEnum.insert().
  • Every Repository object now has an .id attribute containing its unique id. It is set in the same way as the .reponame attribute, after retrieving the repository from the connector.
  • Even with the surrogate keys, we will probably have to add some hooks for repositories that store data in auxiliary tables. For example, the revision and node_change tables are cleared in VersionControlAdmin._sync() prior to calling Repository.sync(). This should probably be moved to the repository itself, in a clean() method.

Lele, it would be great to get some feedback from you about this approach, in particular, if it would be sufficient w.r.t. TracDarcs/Cache.

comment:3 Changed 8 years ago by Christian Boos

I did some initial testing, everything seems to be working fine except the repository alias command.

I think the race condition in get_repository_id is not too worrisome. However, as the repositories defined in the .ini file will also get assigned an id, there's the following scenario to consider: at the first startup, two concurrent requests for the /browser could trigger a conflict - if the only consequence is one of the requests failing, then it's not a serious problem.

The clean method sounds nice, but don't we already use that name? Ah, no, we use clear(). Maybe a longer name would be more explicit, like reset_before_sync.

comment:4 in reply to:  3 ; Changed 8 years ago by Remy Blank

Replying to cboos:

I did some initial testing, everything seems to be working fine except the repository alias command.

Right, I hadn't tested that one.

I think the race condition in get_repository_id is not too worrisome. However, as the repositories defined in the .ini file will also get assigned an id, there's the following scenario to consider: at the first startup, two concurrent requests for the /browser could trigger a conflict - if the only consequence is one of the requests failing, then it's not a serious problem.

That's what I would expect: both processes would try to insert a row with the same primary key (id, 'name'), and one of them will trigger a primary key violation.

I would still be interested in a general solution to this situation, namely inserting a new row atomically if it doesn't exist yet. The other "interesting" situation is inserting a new row if it doesn't exist, or update the existing row (sometimes called UPSERT). You'd think this is standard SQL, but it doesn't seem to be the case.

The clean method sounds nice, but don't we already use that name? Ah, no, we use clear(). Maybe a longer name would be more explicit, like reset_before_sync.

Makes sense.

So I assume you are convinced that this is the right approach? Then I'll continue with the fixes above and the DB upgrade code.

A side issue: you are going to break the Repository constructor signature in #5640/#7116 by removing the authz argument, right? Could we take this opportunity to pass the reponame and id as arguments to the constructor, instead of setting them afterwards in the RepositoryManager? That would make the code much cleaner (especially in CachedRepository).

comment:5 in reply to:  4 ; Changed 8 years ago by anonymous

Replying to rblank:

I would still be interested in a general solution to this situation, namely inserting a new row atomically if it doesn't exist yet.

Sure, we should eventually find a nice solution for that, but it's not a blocker as the problem should be very exceptional and have no damaging consequences if it ever happens.

The other "interesting" situation is inserting a new row if it doesn't exist, or update the existing row (sometimes called UPSERT). You'd think this is standard SQL, but it doesn't seem to be the case.

Well, what you did for this in trac/cache.py is perfect for me, are there other places that need a similar handling?

So I assume you are convinced that this is the right approach? Then I'll continue with the fixes above and the DB upgrade code.

Yes, that's the way I'd like to go. But we can wait a bit for additional feedback. Since yesterday I also tested [repositories] and it seems that this provider is still working as usual (i.e. handy for quickly adding/removing temporary Mercurial repositories, for example).

A side issue: you are going to break the Repository constructor signature in #5640/#7116 by removing the authz argument, right? Could we take this opportunity to pass the reponame and id as arguments to the constructor, instead of setting them afterwards in the RepositoryManager? That would make the code much cleaner (especially in CachedRepository).

Good idea, I'll do that.

comment:6 in reply to:  5 Changed 8 years ago by Remy Blank

Replying to anonymous (probably cboos):

Sure, we should eventually find a nice solution for that, but it's not a blocker as the problem should be very exceptional and have no damaging consequences if it ever happens.

Sure, that's not critical here. I'm just curious.

Well, what you did for this in trac/cache.py is perfect for me, are there other places that need a similar handling?

No, and the code in the cache is the best I could come up with. The only thing I'm not absolutely sure is whether an UPDATE that doesn't update any rows actually starts a transaction or not.

It's just that these two operations sound so common that I am amazed that SQL doesn't provide a definite answer to that. I have even asked on StackOverflow, and the replies were all DB-specific.

Changed 8 years ago by Remy Blank

Complete patch.

comment:7 Changed 8 years ago by Remy Blank

Here's a patch with (I believe) the full functionality. Changes compared to the previous one:

  • Fixed repository aliases.
  • Added the database upgrade code.
  • Moved the repository cache clean code into CachedRepository.sync(), triggered by a new optional argument clean.
  • Added the reponame and id arguments to the Repository constructor.

Feedback and testing welcome, especially for database upgrades on backends other than SQLite.

Changed 8 years ago by Remy Blank

Attachment: 8731-mercurial-r8664.patch added

Adds then new reponame and id arguments to MercurialRepository.__init__().

comment:8 Changed 8 years ago by Christian Boos

Upgrade from a 0.11-stable database for which the default repository was a mirror of Trac's repository worked great for PostgreSQL.

For MySQL unfortunately I had this issue:

17:23:38 Trac[util] DEBUG: SQL: "INSERT INTO revision (repos,rev,time,author,message) SELECT '',rev,time,author,message
FROM rev_old"
17:23:38 Trac[util] DEBUG: execute exception: OperationalError(1366, "Incorrect string value: '\\xC5\\x88 thr...' for co
lumn 'message' at row 5342")
OperationalError: (1366, "Incorrect string value: '\\xC5\\x88 thr...' for column 'message' at row 5342")

Looks like this corresponds to r5806. I'm investigating (maybe it's only due to the way I copied the MySQL db).

comment:9 Changed 8 years ago by Christian Boos

Ok, my bad. Both the original db and copy db seemed to have the same content (by doing a mysqldump and comparing them), but actually I just did a create database <copy> instead of the whole incantation (create database <copy> default character set utf8 collate utf8_bin).

The upgrade worked flawlessly after fixing that.

I also added a teo alias to the default repository, and that worked as well.

Now doing a last test with a SQLite db and Mercurial.

comment:10 Changed 8 years ago by Christian Boos

Everything works well ;-)

OK to commit from my side.

comment:11 Changed 8 years ago by Remy Blank

Thanks for the extensive testing!

Are you also ok with the signature change on Repository.__init__()?

comment:12 Changed 8 years ago by Christian Boos

Some details in the docstring about what are each of reponame, id, name would be useful , other than that it's OK.

comment:13 Changed 8 years ago by Remy Blank

Patch applied in [8672], with the additional documentation.

I will document the signature change on ApiChanges/0.12 in the next few days.

comment:14 Changed 8 years ago by Remy Blank

… and added the DB upgrade file in [8673]. Oops.

comment:15 Changed 8 years ago by Christian Boos

Committed 8731-mercurial-r8664.patch in r8676. Anything else or can we close?

comment:16 Changed 8 years ago by Remy Blank

Only the documentation of the API change. I can probably do that tonight.

comment:17 Changed 8 years ago by Lele Gaifax <lele@…>

I'm sorry, while I do follow the thread I couldn't dedicate the needed effort to understand the details and adapt my TracDarcs backend to the new API. Thanks for doing it, anyway! :-)

comment:18 Changed 8 years ago by Remy Blank

I wonder about one thing: instead of passing the reponame and id to Repository.__init__(), hence requiring all repository connectors to extract them from the options, would it be a better idea to pass the options dict directly to Repository.__init__()? Something like the following patch.

Changed 8 years ago by Remy Blank

Replace separate reponame and id arguments with an options dict in Repository.__init__().

comment:19 Changed 8 years ago by Christian Boos

I thought about that, but I didn't propose it because I don't like putting mandatory arguments in something called options ;-)

But the API simplification is appealing. Maybe rename options to params?

comment:20 in reply to:  19 Changed 8 years ago by Remy Blank

Replying to cboos:

But the API simplification is appealing. Maybe rename options to params?

Sounds good. I'll make the change and commit tonight.

comment:21 Changed 8 years ago by Remy Blank

Simplified API committed in [8677]. Patch for mercurial-plugin following shortly.

Changed 8 years ago by Remy Blank

Fix for simpler API in [8677].

comment:22 Changed 8 years ago by Remy Blank

API change documented in ApiChanges/0.12.

comment:23 Changed 8 years ago by Remy Blank

Resolution: fixed
Status: newclosed

The last patch has been applied in [8692], so this can be closed.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Remy Blank.
The resolution will be deleted.
to The owner will be changed from Remy Blank to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .

 
Note: See TracTickets for help on using tickets.