Edgewall Software

Ticket #3676 (reopened defect)

Opened 2 years ago

Last modified 19 months ago

MySQL: Primary keys are not well designed

Reported by: Martin Burger <mburger@…> Owned by: cboos
Priority: normal Milestone: 0.13
Component: version control Version: 0.10b1
Severity: major Keywords: mysql utf8 primary key
Cc:

Description

The automated computation of primary key sizes in mysql_backend.py (compare #3673) is not well designed. For example: primary key of table node_change.

PRIMARY KEY  (`rev`(111),`path`(111),`change_type`(111)

In this way, all three keys have a maximal length of 111 characters. Now, we have a too strict constraint on the length of pathes in Subversion: A path cannot be longer than 111 characters. If a revision contains too long pathes, you will get an "1062 - Duplicate entry" error (the path is truncated).

The change_type is just one char long, and I have never seen a revision nummber that is 111 digits long. ;-)

In general, I think you should use ids instead of composed primary keys.

Attachments

trac-mysql_backend.diff (1.1 KB) - added by Dirk Datzert <dummy@…> 22 months ago.
My quick version of the fix
remove_node_change_pk-r5133.diff (2.9 KB) - added by cboos 20 months ago.
Simply remove the problematic index
node_change_index.txt (8.6 KB) - added by mgood 20 months ago.
comparison of EXPLAIN with and without key

Change History

  Changed 2 years ago by Martin Burger <mburger@…>

Quick Fix

Just after running init via trac-admin execute SQL:

ALTER TABLE `node_change` DROP PRIMARY KEY ,
ADD PRIMARY KEY ( `rev` ( 39 ) , `path` ( 255 ) , `change_type` ( 39 ) ) ;

Now, path can contain 255 characters.

  Changed 2 years ago by Martin Burger <mburger@…>

  • keywords mysql utf8 primary key added

  Changed 2 years ago by mgood

The rev column does need to be pretty long, since many patch-based version control systems (Darcs, Mercurial, etc.) use hashes as the unique identifier of a patch which can be quite long.

  Changed 2 years ago by Martin Burger <mburger@…>

Mmh, the MySQL developers classify that problem as a 'limitation', not as a bug.

So, the only solution may be another database schema? I know, in theory such a combined primary key (or multiple-column index in the term of MySQL) is fine, but in practice there may be other issues about that. However, AFAIK Oracle allows very long primary keys. But for the MySQL database provider (mysql_backend) this definitively poses a problem.

  Changed 2 years ago by cboos

  • owner changed from jonas to cboos
  • milestone set to 0.10.1

If we have #3778 (conversion of PRIMARY KEY to a non-unique index), then we could also use partial columns for creating the index.

  Changed 2 years ago by cboos

  • severity changed from normal to major

So I think I'm going to apply attachment:ticket:3778:node_change_upgrade-r3787.patch, even if this will imply a schema update for 0.10.1.

follow-up: ↓ 8   Changed 2 years ago by cboos

What we could perhaps do is a 'no-op' upgrade for the backends that don't need that change, and simply increment the database_schema number.

That way, the operation will be quick and painless for SQLite and PostgreSQL users, as there will be no need to resync afterwards.

in reply to: ↑ 7   Changed 2 years ago by jonas

Replying to cboos:

What we could perhaps do is a 'no-op' upgrade for the backends that don't need that change, and simply increment the database_schema number. That way, the operation will be quick and painless for SQLite and PostgreSQL users, as there will be no need to resync afterwards.

Can this really be a no-op for the other db backends? I think we need to remove the primary key for all backends to avoid node_change having a different schema depending on if the env was created before or after this schema update.

On the other hand it sucks that we need to remove a perfectly functional primary key just because of this MySQL limitation. But mainting a backend-specific schema will probably be too much work.

  Changed 2 years ago by mgood

Well, I think we could make more effective use of the primary key without dropping it entirely. Obviously the change_type and rev columns should use less bytes than the path, so we could either make the MySQL prefix length calculation better, or explicitly the prefix length in the schema definition.

We could also reduce the storage used by the rev column by changing the charset to latin1, which should be fine for storing the numeric or hexadecimal values used as version control identifiers. The hashes all seem to be based on SHA1 which can be represented as 40 hex digits. Darcs adds some other things to the hash like a timestamp bringing the size up to 64 digits. So, somewhere between 64 and 128 bytes should be reasonable.

So if we used 1 byte for change_type and 128 for rev we could increase the path prefix length to 290 which would be a big improvement.

  Changed 2 years ago by mgood

Hrm, we could also get away without using the full length of the rev column in the key, since it's much less likely to have collisions if it's truncated than the path. So cutting the rev prefix to 39 would allow for almost the full length of a SHA1 hash, and allow 320 characters for the path key prefix.

Also if we could detect the database's character set we could use a longer prefix for the path on latin1 so that we used the full 1000 bytes for the key.

follow-up: ↓ 12   Changed 2 years ago by cboos

Well, no matter how you set the limit, you'll find users that have longer paths that you think is reasonable ;) So, to workaround this limitation of MySQL, the path can't be part of a fixed length primary key, no matter how you cut it.

OTOH, I don't see the reason why the other backends should be penalized from the deficiencies of MySQL in that respect. Having a pk on the path already helped to detect one subtle bug (#3778), so it's a good thing to have IMO.

My preferred solution would be that only the MySQL backend works around this limitation and uses an index (allowing duplicates) instead of a primary key here.

I don't really agree with jonas when he says that this approach would involve maintaining a backend-specific schema: the data model remains the same, it's just about an index, which is primarily here for performance reasons, the uniqueness constraint being only an (useful) side-effect, not something we rely on.

Anyway, when a table gets really modified, we usually blow away the table and recreate it, so the exact kind of index used won't matter.

in reply to: ↑ 11   Changed 2 years ago by jonas

Replying to cboos:

I don't really agree with jonas when he says that this approach would involve maintaining a backend-specific schema: the data model remains the same, it's just about an index, which is primarily here for performance reasons, the uniqueness constraint being only an (useful) side-effect, not something we rely on.

Anyway, when a table gets really modified, we usually blow away the table and recreate it, so the exact kind of index used won't matter.

True, as long as the backend-specific deviation is limited to different index/primary key configurations or other things that doesn't affect the data model I think we can handle it.

Changed 22 months ago by Dirk Datzert <dummy@…>

My quick version of the fix

  Changed 22 months ago by cboos

  • status changed from new to closed
  • resolution set to fixed

patch applied in r4755 (trunk) and r4756 (0.10-stable).

For 0.12 and the VcRefactoring, care must be taken to avoid the problem of long primary keys, so that we can revert the special casing of column names (but not the s/500/333/ part, of course).

  Changed 20 months ago by Hans

  • status changed from closed to reopened
  • resolution fixed deleted

Cboos its not a patch. we use long file names and still have it.

This ssue should be solved the correct way. Not the dirty one which is in the patch.

A patch can not be in a index.

  Changed 20 months ago by cboos

OK, I think that for 0.10.4 / 0.11, we should simply remove that primary key:

  1. it's not used
  2. the argument I gave in comment:11 no longer applies I think, as we didn't get any other such subtle bug reported, so I think the code is OK now ;-)
  3. there's a new cache planned anyway

Changed 20 months ago by cboos

Simply remove the problematic index

  Changed 20 months ago by cboos

Please have a try of the attachment:remove_node_change_pk-r5133.diff.

If there's (positive) feedback in time, this may go in 0.10.4.

Changed 20 months ago by mgood

comparison of EXPLAIN with and without key

  Changed 20 months ago by mgood

Well, the key certainly does impact the execution plan of the databases since it provides useful information for the database to index on. See attachment:node_change_index.txt comparing SQLite and PostgreSQL EXPLAIN output with and without the key. I tried EXPLAIN on MySQL as well, but the output didn't seem to contain enough information to be useful. So, we shouldn't be penalizing other databases' performance because of MySQL's limitations. The key has also caught bugs such as #4586 where without it we would have created duplicate records, so I'm not sure it's valid to say "it's not used".

  Changed 20 months ago by cboos

There's an index on the rev field, which gets used instead of the primary key, as shown in the explain output. This is where the main cost reside, and for that there's no difference. There's possibly a small additional cost due to the ORDER BY path which can't get optimized without an index on the path, but that should be pretty marginal for anything but changesets containing thousands of entries. And for those, we certainly have rendering costs which marginalize the sorting penalty even more.

As for #4586, well that's a good point, though with the new sync code, there shouldn't be any possibility left for even trying to create duplicate node_change entries... but we never know.

Ok, maybe the right thing to do would be to scrap that pk at the mysql_backend.py level only, and leave the general model and the other backends alone. Trouble is, what do we do for the existing MySQL environments? Do we try to "migrate" them or consider that the problem will be fixed only for new environments?

  Changed 20 months ago by cboos

  • milestone changed from 0.10.4 to 0.10.5

No resolution for 0.10.4 found.

  Changed 20 months ago by anonymous

Note that this ticket it still mentioned on the 0.10.4 milestone (in the text) which is a little misleading.

  Changed 20 months ago by cboos

Ok, I'll update this. Note that even if the current fix was not deemed to be enough (see comment:14), the current situation is a bit better than the 0.10.3 one (see comment:13).

Please note that supporting MySQL is not high on my priority list, so I won't make any further attempt to fix this, as I'll concentrate my efforts on the new cache for 0.12 which will make the issue go away. I guess this ticket will end up with a wontfix resolution some day, unless someone comes with a better proposal or there's finally some consensus on comment:15 and comment:16 ;-)

  Changed 19 months ago by cboos

  • component changed from general to version control
  • milestone changed from 0.10.5 to 0.12

The new repository cache scheduled for milestone:0.12 will fix this and #4378 as well.

Of course, if someone has a better idea in the meantime for 0.11, feel free to submit it. Preferably without involving any database upgrade, so more likely a mysql_backend.py fix only.

Add/Change #3676 (MySQL: Primary keys are not well designed)

Author



Change Properties
<Author field>
Action
as reopened
as The resolution will be set. Next status will be 'closed'
to The owner will change from cboos. Next status will be 'new'
 
Note: See TracTickets for help on using tickets.