Opened 15 years ago
Closed 15 years ago
#8731 closed enhancement (fixed)
experiment with surrogate keys for the repository table
Reported by: | Christian Boos | Owned by: | Remy Blank |
---|---|---|---|
Priority: | normal | Milestone: | 0.12-multirepos |
Component: | version control | Version: | none |
Severity: | normal | Keywords: | consider multirepos |
Cc: | lele@… | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
In MultipleRepositorySupport branch, we introduced a new table for storing information about repositories. This table uses the "traditional" natural key approach.
However, we could consider using surrogate keys instead. Some version control backends (e.g. TracDarcs) have to store extra informations for changesets in their own table, and changesets have to be associated with a repository. This works best if the repository key would be a surrogate key, so the relations won't be lost when changing the "natural" key of a repository (i.e. renaming the repository).
Attachments (5)
Change History (28)
comment:1 by , 15 years ago
Owner: | set to |
---|
by , 15 years ago
Attachment: | 8731-surrogate-keys-r8662.patch added |
---|
First experiment with surrogate keys for repositories.
comment:2 by , 15 years ago
Cc: | added |
---|
Here's a first experiment with surrogate keys for repositories. A few comments:
- The DB upgrade code is not yet included, so you'll have to create a new environment for testing.
- Repository ids are generated in
RepositoryManager.get_repository_id()
. Basically, for each name passed to that method, a row is created in therepository
table. This also applies to repositories defined intrac.ini
.
- There's a small race condition in
get_repository_id()
, but I don't know how to remove it. The same race is also present inAbstractEnum.insert()
.
- Every
Repository
object now has an.id
attribute containing its unique id. It is set in the same way as the.reponame
attribute, after retrieving the repository from the connector.
- Even with the surrogate keys, we will probably have to add some hooks for repositories that store data in auxiliary tables. For example, the
revision
andnode_change
tables are cleared inVersionControlAdmin._sync()
prior to callingRepository.sync()
. This should probably be moved to the repository itself, in aclean()
method.
Lele, it would be great to get some feedback from you about this approach, in particular, if it would be sufficient w.r.t. TracDarcs/Cache.
follow-up: 4 comment:3 by , 15 years ago
I did some initial testing, everything seems to be working fine except the repository alias
command.
I think the race condition in get_repository_id
is not too worrisome. However, as the repositories defined in the .ini
file will also get assigned an id, there's the following scenario to consider: at the first startup, two concurrent requests for the /browser could trigger a conflict - if the only consequence is one of the requests failing, then it's not a serious problem.
The clean
method sounds nice, but don't we already use that name? Ah, no, we use clear()
. Maybe a longer name would be more explicit, like reset_before_sync
.
follow-up: 5 comment:4 by , 15 years ago
Replying to cboos:
I did some initial testing, everything seems to be working fine except the
repository alias
command.
Right, I hadn't tested that one.
I think the race condition in
get_repository_id
is not too worrisome. However, as the repositories defined in the.ini
file will also get assigned an id, there's the following scenario to consider: at the first startup, two concurrent requests for the /browser could trigger a conflict - if the only consequence is one of the requests failing, then it's not a serious problem.
That's what I would expect: both processes would try to insert a row with the same primary key (id, 'name'), and one of them will trigger a primary key violation.
I would still be interested in a general solution to this situation, namely inserting a new row atomically if it doesn't exist yet. The other "interesting" situation is inserting a new row if it doesn't exist, or update the existing row (sometimes called UPSERT). You'd think this is standard SQL, but it doesn't seem to be the case.
The
clean
method sounds nice, but don't we already use that name? Ah, no, we useclear()
. Maybe a longer name would be more explicit, likereset_before_sync
.
Makes sense.
So I assume you are convinced that this is the right approach? Then I'll continue with the fixes above and the DB upgrade code.
A side issue: you are going to break the Repository
constructor signature in #5640/#7116 by removing the authz
argument, right? Could we take this opportunity to pass the reponame
and id
as arguments to the constructor, instead of setting them afterwards in the RepositoryManager
? That would make the code much cleaner (especially in CachedRepository
).
follow-up: 6 comment:5 by , 15 years ago
Replying to rblank:
I would still be interested in a general solution to this situation, namely inserting a new row atomically if it doesn't exist yet.
Sure, we should eventually find a nice solution for that, but it's not a blocker as the problem should be very exceptional and have no damaging consequences if it ever happens.
The other "interesting" situation is inserting a new row if it doesn't exist, or update the existing row (sometimes called UPSERT). You'd think this is standard SQL, but it doesn't seem to be the case.
Well, what you did for this in trac/cache.py is perfect for me, are there other places that need a similar handling?
So I assume you are convinced that this is the right approach? Then I'll continue with the fixes above and the DB upgrade code.
Yes, that's the way I'd like to go. But we can wait a bit for additional feedback.
Since yesterday I also tested [repositories]
and it seems that this provider is still working as usual (i.e. handy for quickly adding/removing temporary Mercurial repositories, for example).
A side issue: you are going to break the
Repository
constructor signature in #5640/#7116 by removing theauthz
argument, right? Could we take this opportunity to pass thereponame
andid
as arguments to the constructor, instead of setting them afterwards in theRepositoryManager
? That would make the code much cleaner (especially inCachedRepository
).
Good idea, I'll do that.
comment:6 by , 15 years ago
Replying to anonymous (probably cboos):
Sure, we should eventually find a nice solution for that, but it's not a blocker as the problem should be very exceptional and have no damaging consequences if it ever happens.
Sure, that's not critical here. I'm just curious.
Well, what you did for this in trac/cache.py is perfect for me, are there other places that need a similar handling?
No, and the code in the cache is the best I could come up with. The only thing I'm not absolutely sure is whether an UPDATE
that doesn't update any rows actually starts a transaction or not.
It's just that these two operations sound so common that I am amazed that SQL doesn't provide a definite answer to that. I have even asked on StackOverflow, and the replies were all DB-specific.
comment:7 by , 15 years ago
Here's a patch with (I believe) the full functionality. Changes compared to the previous one:
- Fixed repository aliases.
- Added the database upgrade code.
- Moved the repository cache clean code into
CachedRepository.sync()
, triggered by a new optional argumentclean
. - Added the
reponame
andid
arguments to theRepository
constructor.
Feedback and testing welcome, especially for database upgrades on backends other than SQLite.
by , 15 years ago
Attachment: | 8731-mercurial-r8664.patch added |
---|
Adds then new reponame
and id
arguments to MercurialRepository.__init__()
.
comment:8 by , 15 years ago
Upgrade from a 0.11-stable database for which the default repository was a mirror of Trac's repository worked great for PostgreSQL.
For MySQL unfortunately I had this issue:
17:23:38 Trac[util] DEBUG: SQL: "INSERT INTO revision (repos,rev,time,author,message) SELECT '',rev,time,author,message FROM rev_old" 17:23:38 Trac[util] DEBUG: execute exception: OperationalError(1366, "Incorrect string value: '\\xC5\\x88 thr...' for co lumn 'message' at row 5342") OperationalError: (1366, "Incorrect string value: '\\xC5\\x88 thr...' for column 'message' at row 5342")
Looks like this corresponds to r5806. I'm investigating (maybe it's only due to the way I copied the MySQL db).
comment:9 by , 15 years ago
Ok, my bad. Both the original db and copy db seemed to have the same content (by doing a mysqldump and comparing them), but actually I just did a create database <copy>
instead of the whole incantation (create database <copy> default character set utf8 collate utf8_bin
).
The upgrade worked flawlessly after fixing that.
I also added a teo alias to the default repository, and that worked as well.
Now doing a last test with a SQLite db and Mercurial.
comment:11 by , 15 years ago
Thanks for the extensive testing!
Are you also ok with the signature change on Repository.__init__()
?
comment:12 by , 15 years ago
Some details in the docstring about what are each of reponame, id, name would be useful , other than that it's OK.
comment:13 by , 15 years ago
Patch applied in [8672], with the additional documentation.
I will document the signature change on ApiChanges/0.12 in the next few days.
comment:15 by , 15 years ago
Committed 8731-mercurial-r8664.patch in r8676. Anything else or can we close?
comment:16 by , 15 years ago
Only the documentation of the API change. I can probably do that tonight.
comment:17 by , 15 years ago
I'm sorry, while I do follow the thread I couldn't dedicate the needed effort to understand the details and adapt my TracDarcs backend to the new API. Thanks for doing it, anyway! :-)
comment:18 by , 15 years ago
I wonder about one thing: instead of passing the reponame
and id
to Repository.__init__()
, hence requiring all repository connectors to extract them from the options
, would it be a better idea to pass the options
dict directly to Repository.__init__()
? Something like the following patch.
by , 15 years ago
Attachment: | 8731-repo-options-r8673.patch added |
---|
Replace separate reponame
and id
arguments with an options
dict in Repository.__init__()
.
follow-up: 20 comment:19 by , 15 years ago
I thought about that, but I didn't propose it because I don't like putting mandatory arguments in something called options
;-)
But the API simplification is appealing. Maybe rename options
to params
?
comment:20 by , 15 years ago
Replying to cboos:
But the API simplification is appealing. Maybe rename
options
toparams
?
Sounds good. I'll make the change and commit tonight.
comment:21 by , 15 years ago
Simplified API committed in [8677]. Patch for mercurial-plugin following shortly.
comment:23 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
The last patch has been applied in [8692], so this can be closed.
As discussed on IRC.