Context Navigation

Modify ↓

#8067 closed defect (duplicate)

Stopped sync in 0.11.2.1

Reported by:	Christian Boos	Owned by:	Christian Boos
Priority:	normal	Milestone:
Component:	version control	Version:	0.11-stable
Severity:	major	Keywords:	cache, postgresql, mysql, pysqlite
Cc:	eric.petit@…, techtonik@…, dale.miller@…	Branch:
Release Notes:
API Changes:
Internal Changes:

Description

As reported by Anatoly on googlegroups:trac-dev:401827e96364f901, there can be situations where versioncontrol cache tables in the database are left in an inconsistent state, from which it is impossible to resume synchronization.

… It turned out to be locked cache after timeout. At first I thought about a race condition, but it was false. The cache can easily be jammed with a long resync from repository on a busy server. It took about 40 seconds to sync 200 revisions before timeout occurred. Take a look at cache.py sync(): On line 159 we check if a revision is being entered in DB by looking if next revision number is already present in revisions table. That means another thread is made an INSERT in line 184. The cache is now locked. To unlock it, another thread should update youngest revision at line 221.
So, the problem is when a timeout occurs during execution of 202:221 - there is an update in node_change table that can potentially take a long time to experience a timeout (esp. when server is under a heavy load). There are also cset.get_changes() and self.repos.next_rev(next_youngest) calls that can contribute to the issue. I haven't found other way to recover than to manually patch DB.

Also, with a synchronization attempt on every request, this can also lead to a serious slowdown (#7490).

Attachments (1)

t8067-stopped-sync-r7921.diff (1.0 KB ) - added by Christian Boos 16 years ago.: Take care to do a rollback in case something went wrong.

Download all attachments as: .zip

Change History (20)

comment:1 by Eric Petit <eric.petit@…>, 16 years ago

Cc:	eric.petit@… added

by Christian Boos, 16 years ago

Attachment:	t8067-stopped-sync-r7921.diff added

Take care to do a rollback in case something went wrong.

follow-up: 3 comment:2 by Christian Boos, 16 years ago

The t8067-stopped-sync-r7921.diff change should be more robust in case of an exception raised within the 1.2 - 1.5 part of the resync.

If the finally clause won't execute because the process is killed, then I think it's safe to assume there will also be no follow-up commit of the changes. Hopefully this doesn't leave a stale lock ("operation in progress" kind).

in reply to: 2 comment:3 by Christian Boos, 16 years ago

Hopefully this doesn't leave a stale lock ("operation in progress" kind).

That doesn't seem to be case on Windows Vista, Python 2.5.4 and its default sqlite3. tracd was killed from the task manager while it was populating the node_change for a big changeset, and a follow-up run of tracd managed to resume sync.

It would be nice to get confirmation that this works also for other platforms/databases.

comment:4 by Christian Boos, 16 years ago

I should note that I have actually not been able to reproduce the original problem. Also, in normal circumstances, if an exception occurred during the sync, then the connection wrapper would be garbage collected and a rollback should happen anyway before the actual connection could be reused through the pool. Weird.

comment:5 by Christian Boos, 16 years ago

See also #3444, another weird commit that happened when it shouldn't have, with SQLite. Also, Anatoly mentioned that he was using MySQL, so maybe there's a "transaction-less" mode there which could explain the behavior in his case.

comment:6 by anonymous, 16 years ago

My MySQL uses MyISAM tables. I've rechecked documentation and it appeared that they lack multi-statement transaction support. See the comparison at http://dev.mysql.com/tech-resources/articles/storage-engine/part_3.html

Perhaps MySQL pages in Trac installation manual should enforce users to create InnoDB tables instead of MyISAM. Maybe it is possible to add a warning in logs or convert tables on upgrade automatically?

comment:7 by anatoly techtonik <techtonik@…>, 16 years ago

It ^^^ was me. SPAM filter is a real annoyance.

Last edited 4 years ago by Ryan J Ollos (previous) (diff)

comment:8 by anatoly techtonik <techtonik@…>, 16 years ago

Cc:	techtonik@… added

comment:9 by Christian Boos, 16 years ago

Thanks for the precisions Anatoly. The behavior of MyISAM tables would indeed explain what happened in your case. I agree that we should make the MySQL requirements clear about that and add a run-time check as you suggested in #8089 (for the automatic upgrade, I'm less convinced).

Now, the big question is: could this possibly happen with other backends as well?

One possible scenario with PySqlite would be that "someone" (i.e. a plugin) sets the db connection in autocommit isolation level, then that connection is pushed back in the pool and later reused for sync().

For PostgreSQL, the sync() algorithm introduced in #4586 was trying to minimize the risks of inconsistencies, which used to happen in the 0.10.x days (#4043). In particular (ticket:4586#comment:8, point 2) there was the assumption that concurrent insert attempts using the same primary key will result in only one attempt to succeed, the other to fail. Though documentation on this is scarce, I've found the following precision (Tom Lane's paper http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/concurrency.pdf, slide 31). Quoting:

If two concurrent transactions try to insert rows having the same key value, then the second one will block until the first one finishes. If the first transaction commits, the second one must abort because of the uniqueness constraint; but if the first one aborts the second one can proceed.

So while it's not exactly how I supposed it worked (no immediate exception when inserting a duplicate key), in practice it's the same: the exception will be deferred until the other attempts are committed, or execution is resumed if other attempts have aborted.

So in theory we should be safe on the PostgreSQL side, even though I have the feeling that we traded our weird update problem between the revision and node_change tables for a similar one between the revision and the system one…

comment:10 by Christian Boos, 16 years ago

To be clear, the "weird" problem common to #4043 and this ticket is the following.

Let's say we start with the value 122 in both table a and b (which for this example are simple tables like (CREATE TABLE a (i INT PRIMARY KEY)).

We then have two (or more) concurrent executions of the following code:

BEGIN;
 INSERT INTO a VALUES (123);
 INSERT INTO b VALUES (123);
COMMIT;

(and there's no other possible kind of update for the above tables)

Then how can we possibly end up with 123 into a, but still have the previous value 122 in b? i.e. how could only part of the transaction succeed?

This can be explained if there's actually a COMMIT happening behind scene after each INSERT (pysqlite's autocommit) or if the BEGIN/COMMIT are no-ops and each successful INSERT is actually persistent (MySQL MyISAM tables).

But there was no real explanation for when this happened in a PostgreSQL database (see in particular ticket:4043#comment:9 and 10), where transactions are supposed to be reliable ;-)

comment:11 by Jonas Borgström, 16 years ago

So are we ready to close thi ticket now? Or is there anything left to investigate?

At least until someone manages to reproduce something similar to ticket:4043#comment:9

follow-up: 19 comment:12 by Christian Boos, 16 years ago

Keywords:	postgresql mysql pysqlite added
Milestone:	0.11.4 → not applicable
Priority:	high → normal

The only satisfying answer for this issue at this time is for MySQL - ISAM. Whether we warn or not when such engine is used should be discussed in #8089.

For the other backends, it might be a problem with the db or the bindings, but as we don't know yet I'm a bit reluctant to close the ticket, so I think we'll leave it opened as a 3^rd party issue.

comment:13 by Dale Miller <dale.miller@…>, 16 years ago

Cc:	dale.miller@… added

comment:14 by anatoly techtonik <techtonik@…>, 13 years ago

Got the same problem again. Needed to resync 5000+ revisions, request got killed during the process.

comment:15 by anatoly techtonik <techtonik@…>, 13 years ago

I see that my MySQL tables are again in MyISAM. Trac upgrade procedure should execute SET storage_engine=INNODB; or set InnoDB tables explicitly.

comment:16 by anatoly techtonik <techtonik@…>, 13 years ago

For the reference, here is how to manually unwedge the sync:

$ tail log/trac.log
2012-12-23 16:29:43,544 Trac[cache] INFO: repos rev [9238] != cached rev [5770]
2012-12-23 16:29:43,566 Trac[api] INFO: Synchronized '(default)' repository in 0.08 seconds
2012-12-23 16:29:46,320 Trac[cache] INFO: repos rev [9238] != cached rev [5770]
2012-12-23 16:29:46,323 Trac[api] INFO: Synchronized '(default)' repository in 0.02 seconds
$ mysql -h host -u user -p dbname
DELETE FROM `node_change` WHERE `rev` > 5750;
DELETE FROM `revision` WHERE `rev` > 5750;
UPDATE `system` SET `value` = 5750 WHERE `name` = 'youngest_rev' LIMIT 1;

comment:17 by Ryan J Ollos, 10 years ago

Keywords:	cache postgresql mysql pysqlite → cache, postgresql, mysql, pysqlite

comment:18 by figaro, 10 years ago

If the MyISAM storage engine is the root cause of the issue, then I suggest to look into the patch at #12097.

in reply to: 12 comment:19 by Ryan J Ollos, 4 years ago

Milestone:	not applicable
Resolution:	→ duplicate
Status:	new → closed

Replying to Christian Boos:

The only satisfying answer for this issue at this time is for MySQL - ISAM. Whether we warn or not when such engine is used should be discussed in #8089.

This was done in #10993, so closing the ticket.

We can make another issue if we find evidence of the behavior for other database types.

Modify Ticket

Change Properties

Summary:
Description:	As reported by Anatoly on googlegroups:trac-dev:401827e96364f901, there can be situations where versioncontrol cache tables in the database are left in an inconsistent state, from which it is impossible to resume synchronization. ... It turned out to be locked cache after timeout. At first I thought about a race condition, but it was false. The cache can easily be jammed with a long resync from repository on a busy server. It took about 40 seconds to sync 200 revisions before timeout occurred. Take a look at [source:/branches/0.11-stable/trac/versioncontrol/cache.py@7889#L159 cache.py] sync(): On line 159 we check if a revision is being entered in DB by looking if next revision number is already present in `revisions` table. That means another thread is made an INSERT in line 184. The cache is now locked. To unlock it, another thread should update youngest revision at line 221. [[br]] So, the problem is when a timeout occurs during execution of 202:221 - there is an update in `node_change` table that can potentially take a long time to experience a timeout (esp. when server is under a heavy load). There are also cset.get_changes() and self.repos.next_rev(next_youngest) calls that can contribute to the issue. I haven't found other way to recover than to manually patch DB. Also, with a synchronization attempt on every request, this can also lead to a serious slowdown (#7490). You may use WikiFormatting here.
Type:		Priority:
Milestone:		Component:
Version:		Severity:
Keywords:		Cc:	Set your email in Preferences
Branch:
Release Notes:
API Changes:
Internal Changes:

Action

leave as closed The owner will remain Christian Boos.

reopen The resolution will be deleted. Next status will be 'reopened'.

change ownership to The owner will be changed from Christian Boos to the specified user.

Add Comment

Your email or username:

E-mail address and name can be saved in the Preferences .

You may use WikiFormatting here.

Attachments ↑ Description ↑

Note: See TracTickets for help on using tickets.

Download in other formats: