Edgewall Software
Modify

Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#2170 closed defect (fixed)

Crash while doing a login/logout in the Timeline

Reported by: Christian Boos Owned by: Christian Boos
Priority: highest Milestone: 0.9
Component: general Version: devel
Severity: critical Keywords: crash
Cc: vyt@…
Release Notes:
API Changes:

Description (last modified by Christian Boos)

This is somehow reproductible, though I got slightly different stack traces each time. But it always bombs in connection_commit, though.

Configuration:

  • tracd on Linux
  • python 2.3.4
  • pysqlite 2.0.4 with sqlite 3.2.7.

With older versions of pysqlite, I got simply dead-locks, but this was not a good thing either…

Here's the stack trace:

dserver.bct.de - - [04/Oct/2005 15:59:05] "GET /stable-bct-trac/login HTTP/1.1" 401 -
[New Thread 557091 (LWP 12565)]
Trac[session] DEBUG: Promoting anonymous session 90d7a758b262bd4e88eced41 to authenticated session for user cboos

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 557091 (LWP 12565)]
0x40582056 in pysqlite_sleep (seconds=-21474836.48) at src/util.c:37
37          Py_DECREF(ret);
(gdb)
(gdb)
(gdb) bt
#0  0x40582056 in pysqlite_sleep (seconds=-21474836.48) at src/util.c:37
#1  0x4058222d in _sqlite_step_with_busyhandler (statement=0x8550dc8, _connection=0x4084e988) at src/util.c:86
#2  0x4057d940 in connection_commit (self=0x4084e988, args=0x0) at src/connection.c:230
#3  0x080c0c42 in call_function (pp_stack=0x45363450, oparg=0) at Python/ceval.c:3425
#4  0x080bd5d4 in eval_frame (f=0x844ee8c) at Python/ceval.c:2116
#5  0x080c102c in fast_function (func=0x40839e44, pp_stack=0x453635d0, n=2, na=2, nk=0) at Python/ceval.c:3518
#6  0x080c0e46 in call_function (pp_stack=0x453635d0, oparg=1) at Python/ceval.c:3458
#7  0x080bd5d4 in eval_frame (f=0x84b675c) at Python/ceval.c:2116
...

pyframe points to _do_login, in source:trunk/trac/web/auth.py@2326#L119

Attachments (2)

trac_no_writer_contention.2.patch (4.2 KB ) - added by Christian Boos 12 years ago.
Second iteration of the patch, integrating cmlenz feedback, previous patch survived 1 day of intensive use, let's see how it goes with this one :)
trac_no_writer_contention.patch (4.1 KB ) - added by Christopher Lenz 12 years ago.
Minor cleanup to the previous patch

Download all attachments as: .zip

Change History (17)

comment:1 Changed 12 years ago by Christian Boos

Ok, now I manage to reproduce the crash systematically:

  • in Thunderbird, I refresh the Timeline feed
  • half a second later, in Firefox, I reload the page which does a logout

Now, looking more closely to the pysqlite2 code within gdb, I located the cause of the crash itself:

  • util.c

    old new  
    8080        }
    8181
    8282        how_long = 0.01* (1 << counter);
    83         if (how_long > 1.0) {
     83        if (how_long > 1.0 || counter > 30) {
    8484            how_long = 1.0;
    8585        }
    8686        pysqlite_sleep(how_long);
    8787
    8888        counter++;

… now I'm back with my database locks :)

Ok, it looks like a commit is not "atomic" at all, and with our database pool, we reuse database connections which are in a commit in progress, hence the deadlock.

comment:2 Changed 12 years ago by Christian Boos

I disabled 2.0.4 and reinstalled pysqlite-1.1.6: Now I can still trigger the database lock by doing the operations described above but the database recovers from that state for subsequent queries, whereas with 2.0.4, the database remains locked forever.

comment:3 Changed 12 years ago by Christian Boos

Description: modified (diff)
Owner: changed from Jonas Borgström to Christian Boos
Status: newassigned

I just filled pysqlite:ticket:124, which contains some additional explanations about the issue: in short, we have to use 10 and not 10000 for the timeout value when using pysqlite2:

  • db.py

     
    254254                                 'directory it is located in.' \
    255255                                 % (getuser(), path)
    256256
    257         timeout = int(params.get('timeout', 10000))
    258257        if have_pysqlite == 2:
     258            timeout = int(params.get('timeout', 10)) # 10 seconds
    259259            # Convert unicode to UTF-8 bytestrings. This is case-sensitive, so
    260260            # we need two converters
    261261            sqlite.register_converter('text', str)
     
    264264            cnx = sqlite.connect(path, detect_types=sqlite.PARSE_DECLTYPES,
    265265                                 check_same_thread=False, timeout=timeout)
    266266        else:
     267            timeout = int(params.get('timeout', 10000)) # 10 seconds
    267268            cnx = sqlite.connect(path, timeout=timeout)
    268269        ConnectionWrapper.__init__(self, cnx)

comment:4 Changed 12 years ago by Christian Boos

… not that simple: applying the previous change was not enough, because the database effectively remains locked when using 2.0.4.

I see in the log that I got this exception:

Exception pysqlite2.dbapi2.OperationalError: 'cannot rollback transaction - SQL statements in progress' in <bound method PooledConnection.__del__ of <trac.db.PooledConnection object at 0x40cf2814>> ignored

which I can relate to the SQLiteConnection object which was doing the COMMIT. So it appears that the connection object doesn't recover cleanly after a commit which timed out.

comment:5 Changed 12 years ago by Christian Boos

I reverted to pysqlite-1.1.6, but I still manage to have persistent database locks, as soon as Trac needs to COMMIT anything:

Traceback (most recent call last):
  File "/opt/trac/stable-bct-trac/lib/python2.3/site-packages/trac/web/standalone.py", line 235, in _do_trac_req
    dispatch_request(path_info, req, env)
  File "/opt/trac/stable-bct-trac/lib/python2.3/site-packages/trac/web/main.py", line 139, in dispatch_request
    dispatcher.dispatch(req)
  File "/opt/trac/stable-bct-trac/lib/python2.3/site-packages/trac/web/main.py", line 107, in dispatch
    resp = chosen_handler.process_request(req)
  File "/opt/trac/stable-bct-trac/lib/python2.3/site-packages/trac/versioncontrol/web_ui/browser.py", line 122, in process_request
    self._render_directory(req, repos, node, rev)
  File "/opt/trac/stable-bct-trac/lib/python2.3/site-packages/trac/versioncontrol/web_ui/browser.py", line 154, in _render_directory
    changes = get_changes(self.env, repos, [i['rev'] for i in info])
  File "/opt/trac/stable-bct-trac/lib/python2.3/site-packages/trac/versioncontrol/web_ui/util.py", line 30, in get_changes
    changeset = repos.get_changeset(rev)
  File "/opt/trac/stable-bct-trac/lib/python2.3/site-packages/trac/versioncontrol/cache.py", line 42, in get_changeset
    self.sync()
  File "/opt/trac/stable-bct-trac/lib/python2.3/site-packages/trac/versioncontrol/cache.py", line 81, in sync
    self.db.commit()
  File "/opt/trac/stable-bct-trac/lib/python2.3/site-packages/trac/db.py", line 131, in commit
    self.cnx.commit()
  File "/usr/lib/python2.3/site-packages/sqlite/main.py", line 539, in commit
    self.db.execute("COMMIT")
OperationalError: database is locked

Read access is OK, which is a little bit better than with 2.0.4 …

comment:6 Changed 12 years ago by Christian Boos

Isn't the problems occurring because we lend ourselves to database locks?

  1. request A - thread A : involves a repo sync.
    This will start a COMMIT, the COMMIT will be "in progress" for a while
  2. shortly after: request B - thread B : only needs to get access to the DB for a SELECT. But there's a COMMIT in progress, so it will fail, database lock for B.
  3. since B initiated a query, the commit in A will also fail! Database lock for A too.

Step 3. is wrong, IMO, A shouldn't fail. At this point, I'm not sure who's the culprit: SQLite, pysqlite, or the fact that we allow B even if there's a COMMIT in progress.

comment:7 Changed 12 years ago by Christian Boos

Description: modified (diff)

Found some relevant discussion about the problem:

Consider a scenario where one process is holding a read lock that it is trying to promote to a reserved lock and a second process is holding a reserved lock that it is trying to promote to an exclusive lock. The first process cannot proceed because it is blocked by the second and the second process cannot proceed because it is blocked by the first. If both processes invoke the busy handlers, neither will make any progress.

  • Karl Mierle had a similar issue back in June with Argon, which he describe there Locking
  • The SQLite author was contacted by the PySQLite author about the former problem, and he replied
    • use EXCLUSIVE transactions, but this reduce the amount of concurrency
    • one of the involved threads must ROLLBACK

comment:8 Changed 12 years ago by Christian Boos

Priority: highhighest
Last edited 8 years ago by Christian Boos (previous) (diff)

comment:9 Changed 12 years ago by Christian Boos

I might have a fix for the issue: attachment:trac_no_writer_contention.patch

This uses the workaround described in #py126.

The sqlite binary has to be reconfigured with the --enable-threadsafe flag and recompiled, because this flag is turned off by default (on Linux) and the binaries used until now by Trac are most likely not thread-safe (see #2196).

Therefore, the above patch contains also cmlenz's patch for #2196, which is a prerequisite for being able to run Trac with a threadsafe sqlite.

The patches to pysqlite that can be found on #py126 are not absolutely required, but they should help.

Some caveats:

  • I haven't tested that on Windows yet
  • I've got segfaults due to a second __del__ call on a Cursor object (!?). This is happening on SuSE 9.0, with the original python (sys.version == "2.3+ (#1, Sep 23 2003, 23:07:16) \n[GCC 3.3.1 (SuSE Linux)]"). This appears to be an early 2.3 release (2.3.1, from the date information).
  • The Python 2.3.4 that I compiled myself seems to be immune to the problem, so far. The difference might be due to the changes introduced in Python 2.3.3c1, where 2 critical bug fixes involving weakrefs and garbage collection have been fixed.

Changed 12 years ago by Christian Boos

Second iteration of the patch, integrating cmlenz feedback, previous patch survived 1 day of intensive use, let's see how it goes with this one :)

comment:10 Changed 12 years ago by Christian Boos

Also, I tested the patch on Windows, it works well too, but it's so fast it's not easy to stress test :)

We would need some kind of web test automation there…

Changed 12 years ago by Christopher Lenz

Minor cleanup to the previous patch

comment:11 Changed 12 years ago by Christopher Lenz

I've attached another revision of the patch. Nothing major, but this avoids the duplicate handling of OperationErrors in the PyFormatCursor, etc.

comment:12 Changed 12 years ago by Christopher Lenz

Doh, “replace existing attachment” hit me here :-/ Should probably not be checked by default.

comment:13 Changed 12 years ago by anonymous

Cc: vyt@… added

comment:14 Changed 12 years ago by Christian Boos

Resolution: fixed
Status: assignedclosed

OK, this week not a single database lock and not any other problem detected, while testing the patches in our "production" environment.

Therefore I checked in the patch (the latest iteration from cmlenz) in r2355.

comment:15 Changed 12 years ago by Christian Boos

Keywords: crash added

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Christian Boos.
The resolution will be deleted.
to The owner will be changed from Christian Boos to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .

 
Note: See TracTickets for help on using tickets.