Ticket #3676 (reopened defect)
MySQL: Primary keys are not well designed
| Reported by: | Martin Burger <mburger@…> | Owned by: | cboos |
|---|---|---|---|
| Priority: | normal | Milestone: | next-major-0.1X |
| Component: | version control | Version: | 0.10b1 |
| Severity: | major | Keywords: | mysql utf8 primary key |
| Cc: |
Description
The automated computation of primary key sizes in mysql_backend.py (compare #3673) is not well designed. For example: primary key of table node_change.
PRIMARY KEY (`rev`(111),`path`(111),`change_type`(111)
In this way, all three keys have a maximal length of 111 characters. Now, we have a too strict constraint on the length of pathes in Subversion: A path cannot be longer than 111 characters. If a revision contains too long pathes, you will get an "1062 - Duplicate entry" error (the path is truncated).
The change_type is just one char long, and I have never seen a revision nummber that is 111 digits long. ;-)
In general, I think you should use ids instead of composed primary keys.
Attachments
Change History
comment:3 Changed 4 years ago by mgood
The rev column does need to be pretty long, since many patch-based version control systems (Darcs, Mercurial, etc.) use hashes as the unique identifier of a patch which can be quite long.
comment:4 Changed 4 years ago by Martin Burger <mburger@…>
Mmh, the MySQL developers classify that problem as a 'limitation', not as a bug.
So, the only solution may be another database schema? I know, in theory such a combined primary key (or multiple-column index in the term of MySQL) is fine, but in practice there may be other issues about that. However, AFAIK Oracle allows very long primary keys. But for the MySQL database provider (mysql_backend) this definitively poses a problem.
comment:5 Changed 4 years ago by cboos
- Owner changed from jonas to cboos
- Milestone set to 0.10.1
If we have #3778 (conversion of PRIMARY KEY to a non-unique index), then we could also use partial columns for creating the index.
comment:6 Changed 4 years ago by cboos
- Severity changed from normal to major
So I think I'm going to apply attachment:ticket:3778:node_change_upgrade-r3787.patch, even if this will imply a schema update for 0.10.1.
comment:7 follow-up: ↓ 8 Changed 4 years ago by cboos
What we could perhaps do is a 'no-op' upgrade for the backends that don't need that change, and simply increment the database_schema number.
That way, the operation will be quick and painless for SQLite and PostgreSQL users, as there will be no need to resync afterwards.
comment:8 in reply to: ↑ 7 Changed 4 years ago by jonas
Replying to cboos:
What we could perhaps do is a 'no-op' upgrade for the backends that don't need that change, and simply increment the database_schema number.
That way, the operation will be quick and painless for SQLite and PostgreSQL users, as there will be no need to resync afterwards.
Can this really be a no-op for the other db backends? I think we need to remove the primary key for all backends to avoid node_change having a different schema depending on if the env was created before or after this schema update.
On the other hand it sucks that we need to remove a perfectly functional primary key just because of this MySQL limitation. But mainting a backend-specific schema will probably be too much work.
comment:9 Changed 4 years ago by mgood
Well, I think we could make more effective use of the primary key without dropping it entirely. Obviously the change_type and rev columns should use less bytes than the path, so we could either make the MySQL prefix length calculation better, or explicitly the prefix length in the schema definition.
We could also reduce the storage used by the rev column by changing the charset to latin1, which should be fine for storing the numeric or hexadecimal values used as version control identifiers. The hashes all seem to be based on SHA1 which can be represented as 40 hex digits. Darcs adds some other things to the hash like a timestamp bringing the size up to 64 digits. So, somewhere between 64 and 128 bytes should be reasonable.
So if we used 1 byte for change_type and 128 for rev we could increase the path prefix length to 290 which would be a big improvement.
comment:10 Changed 4 years ago by mgood
Hrm, we could also get away without using the full length of the rev column in the key, since it's much less likely to have collisions if it's truncated than the path. So cutting the rev prefix to 39 would allow for almost the full length of a SHA1 hash, and allow 320 characters for the path key prefix.
Also if we could detect the database's character set we could use a longer prefix for the path on latin1 so that we used the full 1000 bytes for the key.
comment:11 follow-up: ↓ 12 Changed 4 years ago by cboos
Well, no matter how you set the limit, you'll find users that have longer paths that you think is reasonable ;) So, to workaround this limitation of MySQL, the path can't be part of a fixed length primary key, no matter how you cut it.
OTOH, I don't see the reason why the other backends should be penalized from the deficiencies of MySQL in that respect. Having a pk on the path already helped to detect one subtle bug (#3778), so it's a good thing to have IMO.
My preferred solution would be that only the MySQL backend works around this limitation and uses an index (allowing duplicates) instead of a primary key here.
I don't really agree with jonas when he says that this approach would involve maintaining a backend-specific schema: the data model remains the same, it's just about an index, which is primarily here for performance reasons, the uniqueness constraint being only an (useful) side-effect, not something we rely on.
Anyway, when a table gets really modified, we usually blow away the table and recreate it, so the exact kind of index used won't matter.
comment:12 in reply to: ↑ 11 Changed 4 years ago by jonas
Replying to cboos:
I don't really agree with jonas when he says that this approach would involve maintaining a backend-specific schema: the data model remains the same, it's just about an index, which is primarily here for performance reasons, the uniqueness constraint being only an (useful) side-effect, not something we rely on.
Anyway, when a table gets really modified, we usually blow away the table and recreate it, so the exact kind of index used won't matter.
True, as long as the backend-specific deviation is limited to different index/primary key configurations or other things that doesn't affect the data model I think we can handle it.
Changed 4 years ago by Dirk Datzert <dummy@…>
-
attachment
trac-mysql_backend.diff
added
My quick version of the fix
comment:13 Changed 3 years ago by cboos
- Status changed from new to closed
- Resolution set to fixed
patch applied in r4755 (trunk) and r4756 (0.10-stable).
For 0.12 and the VcRefactoring, care must be taken to avoid the problem of long primary keys, so that we can revert the special casing of column names (but not the s/500/333/ part, of course).
comment:14 Changed 3 years ago by Hans
- Status changed from closed to reopened
- Resolution fixed deleted
Cboos its not a patch. we use long file names and still have it.
This ssue should be solved the correct way. Not the dirty one which is in the patch.
A patch can not be in a index.
comment:15 Changed 3 years ago by cboos
OK, I think that for 0.10.4 / 0.11, we should simply remove that primary key:
- it's not used
- the argument I gave in comment:11 no longer applies I think, as we didn't get any other such subtle bug reported, so I think the code is OK now ;-)
- there's a new cache planned anyway
Changed 3 years ago by cboos
-
attachment
remove_node_change_pk-r5133.diff
added
Simply remove the problematic index
comment:16 Changed 3 years ago by cboos
Please have a try of the attachment:remove_node_change_pk-r5133.diff.
If there's (positive) feedback in time, this may go in 0.10.4.
Changed 3 years ago by mgood
-
attachment
node_change_index.txt
added
comparison of EXPLAIN with and without key
comment:17 Changed 3 years ago by mgood
Well, the key certainly does impact the execution plan of the databases since it provides useful information for the database to index on. See attachment:node_change_index.txt comparing SQLite and PostgreSQL EXPLAIN output with and without the key. I tried EXPLAIN on MySQL as well, but the output didn't seem to contain enough information to be useful. So, we shouldn't be penalizing other databases' performance because of MySQL's limitations. The key has also caught bugs such as #4586 where without it we would have created duplicate records, so I'm not sure it's valid to say "it's not used".
comment:18 Changed 3 years ago by cboos
There's an index on the rev field, which gets used instead of the primary key, as shown in the explain output. This is where the main cost reside, and for that there's no difference. There's possibly a small additional cost due to the ORDER BY path which can't get optimized without an index on the path, but that should be pretty marginal for anything but changesets containing thousands of entries. And for those, we certainly have rendering costs which marginalize the sorting penalty even more.
As for #4586, well that's a good point, though with the new sync code, there shouldn't be any possibility left for even trying to create duplicate node_change entries... but we never know.
Ok, maybe the right thing to do would be to scrap that pk at the mysql_backend.py level only, and leave the general model and the other backends alone. Trouble is, what do we do for the existing MySQL environments? Do we try to "migrate" them or consider that the problem will be fixed only for new environments?
comment:19 Changed 3 years ago by cboos
- Milestone changed from 0.10.4 to 0.10.5
No resolution for 0.10.4 found.
comment:20 Changed 3 years ago by anonymous
Note that this ticket it still mentioned on the 0.10.4 milestone (in the text) which is a little misleading.
comment:21 Changed 3 years ago by cboos
Ok, I'll update this. Note that even if the current fix was not deemed to be enough (see comment:14), the current situation is a bit better than the 0.10.3 one (see comment:13).
Please note that supporting MySQL is not high on my priority list, so I won't make any further attempt to fix this, as I'll concentrate my efforts on the new cache for 0.12 which will make the issue go away. I guess this ticket will end up with a wontfix resolution some day, unless someone comes with a better proposal or there's finally some consensus on comment:15 and comment:16 ;-)
comment:22 Changed 3 years ago by cboos
- Component changed from general to version control
- Milestone changed from 0.10.5 to 0.12
The new repository cache scheduled for milestone:0.12 will fix this and #4378 as well.
Of course, if someone has a better idea in the meantime for 0.11, feel free to submit it. Preferably without involving any database upgrade, so more likely a mysql_backend.py fix only.




Quick Fix
Just after running init via trac-admin execute SQL:
Now, path can contain 255 characters.