#3676 closed defect (fixed)
MySQL: Primary keys are not well designed
Reported by: | Owned by: | Jun Omae | |
---|---|---|---|
Priority: | normal | Milestone: | 1.1.4 |
Component: | version control | Version: | 0.10b1 |
Severity: | major | Keywords: | mysql utf8 primary key |
Cc: | Branch: | ||
Release Notes: |
Add an auto-increment primary key to |
||
API Changes: | |||
Internal Changes: |
Description
The automated computation of primary key sizes in mysql_backend.py (compare #3673) is not well designed. For example: primary key of table node_change.
PRIMARY KEY (`rev`(111),`path`(111),`change_type`(111)
In this way, all three keys have a maximal length of 111 characters. Now, we have a too strict constraint on the length of pathes in Subversion: A path cannot be longer than 111 characters. If a revision contains too long pathes, you will get an "1062 - Duplicate entry" error (the path is truncated).
The change_type is just one char long, and I have never seen a revision nummber that is 111 digits long. ;-)
In general, I think you should use ids instead of composed primary keys.
Attachments (3)
Change History (34)
comment:1 by , 18 years ago
comment:2 by , 18 years ago
Keywords: | mysql utf8 primary key added |
---|
comment:3 by , 18 years ago
The rev
column does need to be pretty long, since many patch-based version control systems (Darcs, Mercurial, etc.) use hashes as the unique identifier of a patch which can be quite long.
comment:4 by , 18 years ago
Mmh, the MySQL developers classify that problem as a 'limitation', not as a bug.
So, the only solution may be another database schema? I know, in theory such a combined primary key (or multiple-column index in the term of MySQL) is fine, but in practice there may be other issues about that. However, AFAIK Oracle allows very long primary keys. But for the MySQL database provider (mysql_backend) this definitively poses a problem.
comment:5 by , 18 years ago
Milestone: | → 0.10.1 |
---|---|
Owner: | changed from | to
If we have #3778 (conversion of PRIMARY KEY to a non-unique index), then we could also use partial columns for creating the index.
comment:6 by , 18 years ago
Severity: | normal → major |
---|
So I think I'm going to apply attachment:ticket:3778:node_change_upgrade-r3787.patch, even if this will imply a schema update for 0.10.1.
follow-up: 8 comment:7 by , 18 years ago
What we could perhaps do is a 'no-op' upgrade for the backends that don't need that change, and simply increment the database_schema number.
That way, the operation will be quick and painless for SQLite and PostgreSQL users, as there will be no need to resync afterwards.
comment:8 by , 18 years ago
Replying to cboos:
What we could perhaps do is a 'no-op' upgrade for the backends that don't need that change, and simply increment the database_schema number.
That way, the operation will be quick and painless for SQLite and PostgreSQL users, as there will be no need to resync afterwards.
Can this really be a no-op for the other db backends? I think we need to remove the primary key for all backends to avoid node_change having a different schema depending on if the env was created before or after this schema update.
On the other hand it sucks that we need to remove a perfectly functional primary key just because of this MySQL limitation. But mainting a backend-specific schema will probably be too much work.
comment:9 by , 18 years ago
Well, I think we could make more effective use of the primary key without dropping it entirely. Obviously the change_type
and rev
columns should use less bytes than the path
, so we could either make the MySQL prefix length calculation better, or explicitly the prefix length in the schema definition.
We could also reduce the storage used by the rev
column by changing the charset to latin1, which should be fine for storing the numeric or hexadecimal values used as version control identifiers. The hashes all seem to be based on SHA1 which can be represented as 40 hex digits. Darcs adds some other things to the hash like a timestamp bringing the size up to 64 digits. So, somewhere between 64 and 128 bytes should be reasonable.
So if we used 1 byte for change_type
and 128 for rev
we could increase the path
prefix length to 290 which would be a big improvement.
comment:10 by , 18 years ago
Hrm, we could also get away without using the full length of the rev
column in the key, since it's much less likely to have collisions if it's truncated than the path
. So cutting the rev
prefix to 39 would allow for almost the full length of a SHA1 hash, and allow 320 characters for the path
key prefix.
Also if we could detect the database's character set we could use a longer prefix for the path on latin1 so that we used the full 1000 bytes for the key.
follow-up: 12 comment:11 by , 18 years ago
Well, no matter how you set the limit, you'll find users that have longer paths that you think is reasonable ;) So, to workaround this limitation of MySQL, the path can't be part of a fixed length primary key, no matter how you cut it.
OTOH, I don't see the reason why the other backends should be penalized from the deficiencies of MySQL in that respect. Having a pk on the path already helped to detect one subtle bug (#3778), so it's a good thing to have IMO.
My preferred solution would be that only the MySQL backend works around this limitation and uses an index (allowing duplicates) instead of a primary key here.
I don't really agree with jonas when he says that this approach would involve maintaining a backend-specific schema: the data model remains the same, it's just about an index, which is primarily here for performance reasons, the uniqueness constraint being only an (useful) side-effect, not something we rely on.
Anyway, when a table gets really modified, we usually blow away the table and recreate it, so the exact kind of index used won't matter.
comment:12 by , 18 years ago
Replying to cboos:
I don't really agree with jonas when he says that this approach would involve maintaining a backend-specific schema: the data model remains the same, it's just about an index, which is primarily here for performance reasons, the uniqueness constraint being only an (useful) side-effect, not something we rely on.
Anyway, when a table gets really modified, we usually blow away the table and recreate it, so the exact kind of index used won't matter.
True, as long as the backend-specific deviation is limited to different index/primary key configurations or other things that doesn't affect the data model I think we can handle it.
comment:13 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:14 by , 18 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Cboos its not a patch. we use long file names and still have it.
This ssue should be solved the correct way. Not the dirty one which is in the patch.
A patch can not be in a index.
comment:15 by , 18 years ago
OK, I think that for 0.10.4 / 0.11, we should simply remove that primary key:
- it's not used
- the argument I gave in comment:11 no longer applies I think, as we didn't get any other such subtle bug reported, so I think the code is OK now ;-)
- there's a new cache planned anyway
by , 18 years ago
Attachment: | remove_node_change_pk-r5133.diff added |
---|
Simply remove the problematic index
comment:16 by , 18 years ago
Please have a try of the attachment:remove_node_change_pk-r5133.diff.
If there's (positive) feedback in time, this may go in 0.10.4.
by , 18 years ago
Attachment: | node_change_index.txt added |
---|
comparison of EXPLAIN with and without key
comment:17 by , 18 years ago
Well, the key certainly does impact the execution plan of the databases since it provides useful information for the database to index on. See attachment:node_change_index.txt comparing SQLite and PostgreSQL EXPLAIN output with and without the key. I tried EXPLAIN on MySQL as well, but the output didn't seem to contain enough information to be useful. So, we shouldn't be penalizing other databases' performance because of MySQL's limitations. The key has also caught bugs such as #4586 where without it we would have created duplicate records, so I'm not sure it's valid to say "it's not used".
comment:18 by , 18 years ago
There's an index on the rev
field, which gets used instead of the primary key, as shown in the explain output. This is where the main cost reside, and for that there's no difference. There's possibly a small additional cost due to the ORDER BY path
which can't get optimized without an index on the path
, but that should be pretty marginal for anything but changesets containing thousands of entries. And for those, we certainly have rendering costs which marginalize the sorting penalty even more.
As for #4586, well that's a good point, though with the new sync code, there shouldn't be any possibility left for even trying to create duplicate node_change entries… but we never know.
Ok, maybe the right thing to do would be to scrap that pk at the mysql_backend.py level only, and leave the general model and the other backends alone. Trouble is, what do we do for the existing MySQL environments? Do we try to "migrate" them or consider that the problem will be fixed only for new environments?
comment:20 by , 18 years ago
Note that this ticket it still mentioned on the 0.10.4 milestone (in the text) which is a little misleading.
comment:21 by , 18 years ago
Ok, I'll update this. Note that even if the current fix was not deemed to be enough (see comment:14), the current situation is a bit better than the 0.10.3 one (see comment:13).
Please note that supporting MySQL is not high on my priority list, so I won't make any further attempt to fix this, as I'll concentrate my efforts on the new cache for 0.12 which will make the issue go away. I guess this ticket will end up with a wontfix resolution some day, unless someone comes with a better proposal or there's finally some consensus on comment:15 and comment:16 ;-)
comment:22 by , 18 years ago
Component: | general → version control |
---|---|
Milestone: | 0.10.5 → 0.12 |
The new repository cache scheduled for milestone:0.12 will fix this and #4378 as well.
Of course, if someone has a better idea in the meantime for 0.11, feel free to submit it. Preferably without involving any database upgrade, so more likely a mysql_backend.py fix only.
comment:23 by , 13 years ago
The solution I got is to change database schema in "mysql_backend.py" . If we drop all primary keys of "node_change" table then there will be no restriction for path size. Now my question is whether it will affect the other trac module functionality or other database functionality. Also is there any plan to fix this in future MYSQL version.
comment:25 by , 10 years ago
Milestone: | next-major-releases → 1.1.3 |
---|
#11782 was closed as duplicate.
I'll try to solve this issue and refresh remove_node_change_pk-r5133.diff on trunk. Also, I think we should add AUTO_INCREMENT
as primary key rather than dropping primary key.
comment:26 by , 10 years ago
Owner: | changed from | to
---|---|
Status: | reopened → assigned |
comment:27 by , 10 years ago
Milestone: | 1.1.3 → 1.1.4 |
---|
comment:28 by , 10 years ago
Proposed changes in jomae.git@t3676, as the result it would improve performance on MySQL.
- Replace the primary key with auto-increment id in node_change table.
- Unique index
(repos,rev,path,change_type)
would be removed.
- Unique index
- Add index
node_change (repos,rev,path)
. - Add index
node_change (repos,path,rev)
.
SQLite:
1.1.1 | 1.1.2 | 1.1.3 | r13774 | t3676 | |
---|---|---|---|---|---|
0.286283s | 0.278857s | 0.272613s | 0.256555s | 0.280069s | _next_prev_rev('<', 8802, 'trunk/tracopt/mimeview/tests/php.py')
|
0.070519s | 0.067838s | 0.066458s | 0.056563s | 0.057104s | _next_prev_rev('>', 8802, 'trunk/tracopt/mimeview/tests/php.py')
|
0.021353s | 0.014886s | 0.014168s | 0.007420s | 0.001173s | _next_prev_rev('<', 12000, 'trunk')
|
0.015980s | 0.015816s | 0.015060s | 0.008035s | 0.001219s | _next_prev_rev('>', 12000, 'trunk')
|
0.015870s | 0.015786s | 0.017921s | 0.007970s | 0.000907s | _next_prev_rev('<', 12345)
|
0.016744s | 0.016615s | 0.015915s | 0.008493s | 0.000902s | _next_prev_rev('>', 12345)
|
PostgreSQL:
1.1.1 | 1.1.2 | 1.1.3 | r13774 | t3676 | |
---|---|---|---|---|---|
0.710516s | 0.232353s | 0.234274s | 0.234219s | 0.243118s | _next_prev_rev('<', 8802, 'trunk/tracopt/mimeview/tests/php.py')
|
0.147532s | 0.050900s | 0.050583s | 0.053642s | 0.053488s | _next_prev_rev('>', 8802, 'trunk/tracopt/mimeview/tests/php.py')
|
0.007082s | 0.005189s | 0.005201s | 0.006669s | 0.006331s | _next_prev_rev('<', 12000, 'trunk')
|
0.007629s | 0.005721s | 0.005705s | 0.007117s | 0.006422s | _next_prev_rev('>', 12000, 'trunk')
|
0.004330s | 0.004323s | 0.004324s | 0.004459s | 0.004288s | _next_prev_rev('<', 12345)
|
0.004012s | 0.004007s | 0.004024s | 0.004757s | 0.004594s | _next_prev_rev('>', 12345)
|
MySQL:
1.1.1 | 1.1.2 | 1.1.3 | r13774 | t3676 | |
---|---|---|---|---|---|
1.186757s | 1.178240s | 1.166688s | 0.946206s | 0.010578s | _next_prev_rev('<', 8802, 'trunk/tracopt/mimeview/tests/php.py')
|
0.157969s | 0.153565s | 0.151270s | 0.119920s | 0.015570s | _next_prev_rev('>', 8802, 'trunk/tracopt/mimeview/tests/php.py')
|
0.859452s | 0.847304s | 0.844246s | 0.987854s | 0.340267s | _next_prev_rev('<', 12000, 'trunk')
|
0.106523s | 0.097104s | 0.097142s | 0.043159s | 0.068852s | _next_prev_rev('>', 12000, 'trunk')
|
1.418827s | 1.437706s | 1.419752s | 0.079679s | 0.080730s | _next_prev_rev('<', 12345)
|
0.103221s | 0.097151s | 0.096800s | 0.010424s | 0.010672s | _next_prev_rev('>', 12345)
|
comment:29 by , 10 years ago
Release Notes: | modified (diff) |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
Committed in [13782].
follow-up: 31 comment:30 by , 10 years ago
Hi "jomae",
we are currently struckling with this "behaviour" regarding "Primary_Key" and i was really happy that someone acts on this issue.
My question is: Can this "Change" already being downloaded or applied manually ?
Thanks in advance!
comment:31 by , 10 years ago
Replying to Anonymous:
My question is: Can this "Change" already being downloaded or applied manually ?
The issue was fixed in milestone:1.1.4. You can download from TracDownload#LatestDevRelease, however please read the caveats about using a developer stable release. Trac 1.2 should be released in a few months, will be production stable, and include the fix.
comment:32 by , 9 years ago
#12412 also closed as a duplicate.
I encountered IntegrityError
s on Trac 1.0.10 with MySQL due to repository paths that were not unique in the first 255 characters. A developer committed a library to the repository, which resulted in many very long paths in a single changeset (subversion).
While it looks like the node caching issue will be resolved in Trac 1.0.10, is it correct that paths will still be truncated to 255 characters? Will there be an impact on the changeset view?
Quick Fix
Just after running init via trac-admin execute SQL:
Now, path can contain 255 characters.