Edgewall Software
Modify

Opened 12 years ago

Closed 11 years ago

Last modified 3 years ago

#4043 closed defect (fixed)

two concurrent resync attempts can leave the database in an inconsistent state

Reported by: Christian Boos Owned by: Christian Boos
Priority: highest Milestone: 0.10.4
Component: version control Version: 0.10
Severity: blocker Keywords: pysqlite, weird, postgresql
Cc: brad@…
Release Notes:
API Changes:

Description

If this happens, then the every further attempt fails with:

Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 356, in dispatch_request
    dispatcher.dispatch(req)
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 224, in dispatch
    resp = chosen_handler.process_request(req)
  File "/usr/lib/python2.4/site-packages/trac/versioncontrol/web_ui/changeset.py", line 206, in process_request
    chgset = repos.get_changeset(new)
  File "/usr/lib/python2.4/site-packages/trac/versioncontrol/cache.py", line 41, in get_changeset
    self.sync()
  File "/usr/lib/python2.4/site-packages/trac/versioncontrol/cache.py", line 107, in sync
    (str(current_rev), path, kind, action,
  File "/usr/lib/python2.4/site-packages/trac/db/util.py", line 47, in execute
    return self.cursor.execute(sql_escape_percent(sql), args)
  File "/usr/lib/python2.4/site-packages/trac/db/util.py", line 47, in execute
    return self.cursor.execute(sql_escape_percent(sql), args)
IntegrityError: duplicate key violates unique constraint "node_change_pk"

Workaround: do a trac-admin $TRAC_ENV resync.

Attachments (0)

Change History (20)

comment:1 Changed 12 years ago by Christian Boos

Component: generalversion control
Owner: changed from Jonas Borgström to Christian Boos

Well, I fail to understand what could have gone wrong in the cache code…

We were too quick to do the resync… so let's hope this happens again and then we should:

  • check what's in the revision table
  • check what is current_rev in cache.py

comment:2 Changed 12 years ago by Christian Boos

From the logs:

2006-10-31 11:47:40,425 Trac[cache] INFO: Syncing with repository (4089 to 4090)
2006-10-31 11:47:42,372 Trac[cache] INFO: Syncing with repository (4089 to 4090)
2006-10-31 11:47:42,423 Trac[main] ERROR: duplicate key violates unique constraint "node_change_pk"
Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 356, in dispatch_request
    dispatcher.dispatch(req)
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 224, in dispatch
    resp = chosen_handler.process_request(req)
  File "/usr/lib/python2.4/site-packages/trac/versioncontrol/web_ui/log.py", line 163, in process_request
    changes = get_changes(self.env, repos, revs, verbose, req, format)
  File "/usr/lib/python2.4/site-packages/trac/versioncontrol/web_ui/util.py", line 37, in get_changes
    changeset = repos.get_changeset(rev)
  File "/usr/lib/python2.4/site-packages/trac/versioncontrol/cache.py", line 41, in get_changeset
    self.sync()
  File "/usr/lib/python2.4/site-packages/trac/versioncontrol/cache.py", line 107, in sync
    (str(current_rev), path, kind, action,
  File "/usr/lib/python2.4/site-packages/trac/db/util.py", line 47, in execute
    return self.cursor.execute(sql_escape_percent(sql), args)
  File "/usr/lib/python2.4/site-packages/trac/db/util.py", line 47, in execute
    return self.cursor.execute(sql_escape_percent(sql), args)
IntegrityError: duplicate key violates unique constraint "node_change_pk"

comment:3 Changed 12 years ago by Christian Boos

From the IRC:

<cmlenz>	the select thinks the latest rev is 4089, but the insert says it already exists
<cboos>	exactly
<cmlenz>	i.e. 4090 already exists
<cboos>	so that's why I thought about a kind of "write cache" that the select don't see
<cmlenz>	two resyncs are started but only only is terminated with an error, apparently
	what happened to the other one?
<cboos>	blocked ... "about to commit" but can't as they're still reader on that table ...
	let's call him (1)
	all the other readers see the table content as it was before the insert attempt of (1)
	but when they turn into writers themselves, they write in a kind of cache... which already contains the insert of (1), hence the exception
	begins to make sense ;)
<cmlenz>	isolation-level wise that'd be awful

comment:4 Changed 12 years ago by Jonas Borgström

The postgresql default isolation level is READ COMMITTED and that can explain this behaviour.

I see two ways to avoid this:

  1. Use the SERIALIZABLE isolation level. This would force postgresql to behave more like sqlite when it comes to concurrent transactions. This will severely hurt the db's performance.
     SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
    
  1. Clever use of the SELECT .. FOR UPDATE statement might make it possible safely resync repositories without hurting the performance too much (and using the READ COMMITTED isolation level).

comment:5 Changed 12 years ago by Christian Boos

jonas recovered the data that was in the db at the time we kept having the exception:

trac=# SELECT max(cast(rev as int)) FROM revision;
 max  
------
 4089
(1 row)
 
trac=# SELECT max(cast(rev as int)) FROM node_change;
 max  
------
 4090
(1 row)
 
trac=# SELECT * FROM node_change WHERE rev=4090;
 rev  |         path         | node_type | change_type |      base_path       | base_rev 
------+----------------------+-----------+-------------+----------------------+----------
 4090 | branches/0.10-stable | D         | E           | branches/0.10-stable | 4089
(1 row)

Given this data, why the exception happened is clear, now the problem is how did we get there?

comment:6 Changed 12 years ago by sid

Similar to problem seen in #3831

comment:7 Changed 11 years ago by Christian Boos

Cc: brad@… added

#4631 was marked as duplicate. Brad, can you check whether the node_table and revision tables show the same kind of inconsistencies as described above, before doing the resync?

comment:8 Changed 11 years ago by Brad Anderson <brad@…>

They do have the same inconsistencies. I have not done the resync yet, in case I can help a bit more. My thought is that the server hung, or was restarted in the middle of this write process.

Are these writes to the db tables encased in a transaction? If not, would it be possible to have some transaction mechanism in the pg backend? We'd have to figure out how to ignore it for backends that don't support transactions. Or does Trac already have this? I haven't been through the code in a while.

comment:9 Changed 11 years ago by Brad Anderson <brad@…>

poseidon=> select max(cast(rev as int)) from revision;
 max
-----
 162
(1 row)

poseidon=> select max(cast(rev as int)) from node_change;
 max
-----
 163
(1 row)

poseidon=> select * from node_change where rev=4090;
 rev | path | node_type | change_type | base_path | base_rev
-----+------+-----------+-------------+-----------+----------
(0 rows)

comment:10 Changed 11 years ago by Christian Boos

Keywords: weird added

If you're using psycopg2, we should already be in a transaction (psycopg:wiki:psycopg2_documentation#setting-transaction-isolation-levels), as jonas explained in comment:4.

What Trac does during the sync (source:trunk/trac/versioncontrol/cache.py@4670#L64) can be summarized like this:

  1. determines which current_rev it has to cache
  2. caches the revision metadata:
    cursor.execute("INSERT INTO revision (rev,time,author,message) "
                   "VALUES (%s,%s,%s,%s)", (str(current_rev), ...
    
  3. caches the node changes:
    cursor.execute("INSERT INTO node_change (rev,path," ...
                   "VALUES (%s,%s,%s,%s,%s,%s)",
                   (str(current_rev), ...
    
  4. loops to 1. until there's no more revisions to cache

and then only commits.

So far it's been quite mysterious for me why the revision table can end up with e.g. rev 162 in your case, while the node_change already contains rev 163. Even if we consider two or more concurrent resync attempts, I can't come up with a scenario in which 163 can be written and committed in the node_change table and not in the revision table.

Anyway, we need to avoid concurrent resync attempts altogether, see #4586. Still, it would be satisfying to understand what's going on…

comment:11 Changed 11 years ago by Brad Anderson <brad@…>

I just attempted the resync:

trac-admin /var/trac/poseidon resync

It hung there for quite a while, and I began to see the familiar pile up of PG connections that eventually brings down my server due to mod_python processes being spawned, etc.

See: http://rafb.net/p/9Epln718.html for the ps listing - the first three poseidon pg connections are the resync, which hung and never returned.

I am fairly sure that there was no concurrent resync attempts happening, because I stopped Apache, and restarted Postgres fresh.

Here's svn info for my PROD instance:

> svn info
Path: .
URL: http://svn.edgewall.com/repos/trac/branches/0.10-stable
Repository UUID: af82e41b-90c4-0310-8c96-b1721e28e2e2
Revision: 4387

hth

comment:12 Changed 11 years ago by wichert@…

Doing a resync in this situation can also fail:

done 12400
done 12500
done 12600
done 12700
done 12800
done 12900
done 13000
Command failed: ERROR:  duplicate key violates unique constraint "revision_pkey"

comment:13 in reply to:  12 Changed 11 years ago by Christian Boos

Replying to wichert@wiggy.net:

Doing a resync in this situation can also fail: … Command failed: ERROR: duplicate key violates unique constraint "revision_pkey"

Sounds like something different (the constraint above is "revision_pkey" and "node_change_pkey" in this ticket). If you're using the Perforce plugin, it's likely a duplicate of #3885. If you're using Subversion, can you open another ticket for this one? Please include the resync listing with DEBUG level informations (see TracLogging).

comment:14 Changed 11 years ago by Christian Boos

Priority: normalhighest

Also happened with pysqlite2 on http://paludis.pioto.org/trac:

$ svn ls -v svn://svn.pioto.org/paludis/
   2438 pioto                 Feb 10 18:28 branches/
   2528 spb                   Feb 27 03:46 overlay/
   1635 spb                   Oct 24 18:40 repo-templates/
   2530 dleverto              Feb 27 15:18 scratch/
   2508 dleverto              Feb 23 20:57 tags/
   2532 rbrown                Feb 27 21:11 trunk/

So latest is r2532.

In the db/trac.db:

sqlite3> select max(cast(rev as int)) from revision;
2529
sqlite3> select max(cast(rev as int)) from node_change;
2530

comment:15 Changed 11 years ago by Christian Boos

Keywords: pysqlite added
Severity: criticalblocker
Status: newassigned

While it's still not understood how the above situation can happen, I've rewritten the sync code anyway in order to prevent too many concurrent sync attempts.

See this patch on #4586.

I believe that the above patch will also fix this issue by drastically reducing the race conditions.

However I increase the severity of this bug to blocker, as I think it's crucial that we fix it for the next release (0.10.4):

  • it happens with PostgreSQL but also PySqlite
  • when it happens, the Trac instance becomes completely unusable, as we're attempting the sync (and hence triggering the error) in the pre_request_filter
  • the only possibility to recover is to have an admin doing the resync (or more deleting the latest entries in the node_change table, e.g. delete from node_change where rev=2530 for the above example).

comment:16 Changed 11 years ago by Christian Boos

Milestone: 0.10.50.10.4

comment:17 Changed 11 years ago by Christian Boos

Summary: PostgreSQL can trip over two concurrent resync attemptstwo concurrent resync attempts can leave the database in an inconsistent state

comment:18 Changed 11 years ago by Christian Boos

Resolution: fixed
Status: assignedclosed

With r4965 and r4966, the CachedRepository.sync() uses now a different strategy, which I think is immune to the kind of issue displayed in this ticket.

I extensively tested the new code with multiple server processes and many concurrent clients, repeated this using the 3 database backends we currently support: with the final version that was committed, I couldn't detect any problem, so I'm quite hopeful that the issue is gone.

comment:19 Changed 8 years ago by Christian Boos

See ticket:5932#comment:25, though.

comment:20 Changed 3 years ago by Ryan J Ollos

Keywords: postgresql added; postgres removed

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Christian Boos.
The resolution will be deleted.
to The owner will be changed from Christian Boos to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.