#4043 closed defect (fixed)
two concurrent resync attempts can leave the database in an inconsistent state
Reported by: | Christian Boos | Owned by: | Christian Boos |
---|---|---|---|
Priority: | highest | Milestone: | 0.10.4 |
Component: | version control | Version: | 0.10 |
Severity: | blocker | Keywords: | pysqlite, weird, postgresql |
Cc: | brad@… | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
If this happens, then the every further attempt fails with:
Traceback (most recent call last): File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 356, in dispatch_request dispatcher.dispatch(req) File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 224, in dispatch resp = chosen_handler.process_request(req) File "/usr/lib/python2.4/site-packages/trac/versioncontrol/web_ui/changeset.py", line 206, in process_request chgset = repos.get_changeset(new) File "/usr/lib/python2.4/site-packages/trac/versioncontrol/cache.py", line 41, in get_changeset self.sync() File "/usr/lib/python2.4/site-packages/trac/versioncontrol/cache.py", line 107, in sync (str(current_rev), path, kind, action, File "/usr/lib/python2.4/site-packages/trac/db/util.py", line 47, in execute return self.cursor.execute(sql_escape_percent(sql), args) File "/usr/lib/python2.4/site-packages/trac/db/util.py", line 47, in execute return self.cursor.execute(sql_escape_percent(sql), args) IntegrityError: duplicate key violates unique constraint "node_change_pk"
Workaround: do a trac-admin $TRAC_ENV resync
.
Attachments (0)
Change History (20)
comment:1 by , 18 years ago
Component: | general → version control |
---|---|
Owner: | changed from | to
comment:2 by , 18 years ago
From the logs:
2006-10-31 11:47:40,425 Trac[cache] INFO: Syncing with repository (4089 to 4090) 2006-10-31 11:47:42,372 Trac[cache] INFO: Syncing with repository (4089 to 4090) 2006-10-31 11:47:42,423 Trac[main] ERROR: duplicate key violates unique constraint "node_change_pk" Traceback (most recent call last): File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 356, in dispatch_request dispatcher.dispatch(req) File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 224, in dispatch resp = chosen_handler.process_request(req) File "/usr/lib/python2.4/site-packages/trac/versioncontrol/web_ui/log.py", line 163, in process_request changes = get_changes(self.env, repos, revs, verbose, req, format) File "/usr/lib/python2.4/site-packages/trac/versioncontrol/web_ui/util.py", line 37, in get_changes changeset = repos.get_changeset(rev) File "/usr/lib/python2.4/site-packages/trac/versioncontrol/cache.py", line 41, in get_changeset self.sync() File "/usr/lib/python2.4/site-packages/trac/versioncontrol/cache.py", line 107, in sync (str(current_rev), path, kind, action, File "/usr/lib/python2.4/site-packages/trac/db/util.py", line 47, in execute return self.cursor.execute(sql_escape_percent(sql), args) File "/usr/lib/python2.4/site-packages/trac/db/util.py", line 47, in execute return self.cursor.execute(sql_escape_percent(sql), args) IntegrityError: duplicate key violates unique constraint "node_change_pk"
comment:3 by , 18 years ago
From the IRC:
<cmlenz> the select thinks the latest rev is 4089, but the insert says it already exists <cboos> exactly <cmlenz> i.e. 4090 already exists <cboos> so that's why I thought about a kind of "write cache" that the select don't see <cmlenz> two resyncs are started but only only is terminated with an error, apparently what happened to the other one? <cboos> blocked ... "about to commit" but can't as they're still reader on that table ... let's call him (1) all the other readers see the table content as it was before the insert attempt of (1) but when they turn into writers themselves, they write in a kind of cache... which already contains the insert of (1), hence the exception begins to make sense ;) <cmlenz> isolation-level wise that'd be awful
comment:4 by , 18 years ago
The postgresql default isolation level is READ COMMITTED
and that can explain this behaviour.
I see two ways to avoid this:
- Use the
SERIALIZABLE
isolation level. This would force postgresql to behave more like sqlite when it comes to concurrent transactions. This will severely hurt the db's performance.SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
- Clever use of the
SELECT .. FOR UPDATE
statement might make it possible safely resync repositories without hurting the performance too much (and using theREAD COMMITTED
isolation level).
comment:5 by , 18 years ago
jonas recovered the data that was in the db at the time we kept having the exception:
trac=# SELECT max(cast(rev as int)) FROM revision; max ------ 4089 (1 row) trac=# SELECT max(cast(rev as int)) FROM node_change; max ------ 4090 (1 row) trac=# SELECT * FROM node_change WHERE rev=4090; rev | path | node_type | change_type | base_path | base_rev ------+----------------------+-----------+-------------+----------------------+---------- 4090 | branches/0.10-stable | D | E | branches/0.10-stable | 4089 (1 row)
Given this data, why the exception happened is clear, now the problem is how did we get there?
comment:7 by , 18 years ago
Cc: | added |
---|
#4631 was marked as duplicate.
Brad, can you check whether the node_table
and revision
tables show the same kind of inconsistencies as described above, before doing the resync?
comment:8 by , 18 years ago
They do have the same inconsistencies. I have not done the resync yet, in case I can help a bit more. My thought is that the server hung, or was restarted in the middle of this write process.
Are these writes to the db tables encased in a transaction? If not, would it be possible to have some transaction mechanism in the pg backend? We'd have to figure out how to ignore it for backends that don't support transactions. Or does Trac already have this? I haven't been through the code in a while.
comment:9 by , 18 years ago
poseidon=> select max(cast(rev as int)) from revision; max ----- 162 (1 row) poseidon=> select max(cast(rev as int)) from node_change; max ----- 163 (1 row) poseidon=> select * from node_change where rev=4090; rev | path | node_type | change_type | base_path | base_rev -----+------+-----------+-------------+-----------+---------- (0 rows)
comment:10 by , 18 years ago
Keywords: | weird added |
---|
If you're using psycopg2, we should already be in a transaction (psycopg:wiki:psycopg2_documentation#setting-transaction-isolation-levels), as jonas explained in comment:4.
What Trac does during the sync (source:trunk/trac/versioncontrol/cache.py@4670#L64) can be summarized like this:
- determines which
current_rev
it has to cache - caches the revision metadata:
cursor.execute("INSERT INTO revision (rev,time,author,message) " "VALUES (%s,%s,%s,%s)", (str(current_rev), ...
- caches the node changes:
cursor.execute("INSERT INTO node_change (rev,path," ... "VALUES (%s,%s,%s,%s,%s,%s)", (str(current_rev), ...
- loops to 1. until there's no more revisions to cache
and then only commits.
So far it's been quite mysterious for me why the revision
table can end up with e.g. rev 162 in your case, while the node_change
already contains rev 163.
Even if we consider two or more concurrent resync attempts, I can't come up with a scenario in which 163 can be written and committed in the node_change
table and not in the revision
table.
Anyway, we need to avoid concurrent resync attempts altogether, see #4586. Still, it would be satisfying to understand what's going on…
comment:11 by , 18 years ago
I just attempted the resync:
trac-admin /var/trac/poseidon resync
It hung there for quite a while, and I began to see the familiar pile up of PG connections that eventually brings down my server due to mod_python processes being spawned, etc.
See: http://rafb.net/p/9Epln718.html for the ps listing - the first three poseidon pg connections are the resync, which hung and never returned.
I am fairly sure that there was no concurrent resync attempts happening, because I stopped Apache, and restarted Postgres fresh.
Here's svn info for my PROD instance:
> svn info Path: . URL: http://svn.edgewall.com/repos/trac/branches/0.10-stable Repository UUID: af82e41b-90c4-0310-8c96-b1721e28e2e2 Revision: 4387
hth
follow-up: 13 comment:12 by , 18 years ago
Doing a resync in this situation can also fail:
done 12400 done 12500 done 12600 done 12700 done 12800 done 12900 done 13000 Command failed: ERROR: duplicate key violates unique constraint "revision_pkey"
comment:13 by , 18 years ago
Replying to wichert@wiggy.net:
Doing a resync in this situation can also fail: … Command failed: ERROR: duplicate key violates unique constraint "revision_pkey"
Sounds like something different (the constraint above is "revision_pkey" and "node_change_pkey" in this ticket). If you're using the Perforce plugin, it's likely a duplicate of #3885. If you're using Subversion, can you open another ticket for this one? Please include the resync listing with DEBUG level informations (see TracLogging).
comment:14 by , 18 years ago
Priority: | normal → highest |
---|
Also happened with pysqlite2 on http://paludis.pioto.org/trac:
$ svn ls -v svn://svn.pioto.org/paludis/ 2438 pioto Feb 10 18:28 branches/ 2528 spb Feb 27 03:46 overlay/ 1635 spb Oct 24 18:40 repo-templates/ 2530 dleverto Feb 27 15:18 scratch/ 2508 dleverto Feb 23 20:57 tags/ 2532 rbrown Feb 27 21:11 trunk/
So latest is r2532.
In the db/trac.db:
sqlite3> select max(cast(rev as int)) from revision; 2529 sqlite3> select max(cast(rev as int)) from node_change; 2530
comment:15 by , 18 years ago
Keywords: | pysqlite added |
---|---|
Severity: | critical → blocker |
Status: | new → assigned |
While it's still not understood how the above situation can happen, I've rewritten the sync code anyway in order to prevent too many concurrent sync attempts.
I believe that the above patch will also fix this issue by drastically reducing the race conditions.
However I increase the severity of this bug to blocker, as I think it's crucial that we fix it for the next release (0.10.4):
- it happens with PostgreSQL but also PySqlite
- when it happens, the Trac instance becomes completely unusable, as we're attempting the sync (and hence triggering the error) in the pre_request_filter
- the only possibility to recover is to have an admin doing the
resync
(or more deleting the latest entries in the node_change table, e.g.delete from node_change where rev=2530
for the above example).
comment:16 by , 18 years ago
Milestone: | 0.10.5 → 0.10.4 |
---|
comment:17 by , 18 years ago
Summary: | PostgreSQL can trip over two concurrent resync attempts → two concurrent resync attempts can leave the database in an inconsistent state |
---|
comment:18 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
With r4965 and r4966, the CachedRepository.sync()
uses now a different strategy, which I think is immune to the kind of issue displayed in this ticket.
I extensively tested the new code with multiple server processes and many concurrent clients, repeated this using the 3 database backends we currently support: with the final version that was committed, I couldn't detect any problem, so I'm quite hopeful that the issue is gone.
comment:20 by , 10 years ago
Keywords: | postgresql added; postgres removed |
---|
Well, I fail to understand what could have gone wrong in the cache code…
We were too quick to do the
resync
… so let's hope this happens again and then we should:revision
tablecurrent_rev
in cache.py