Edgewall Software
Modify

Opened 18 years ago

Closed 16 years ago

#3923 closed defect (fixed)

Thread deadlock

Reported by: exarkun@… Owned by: Jonas Borgström
Priority: high Milestone: 0.11.1
Component: general Version: 0.10
Severity: critical Keywords: database pool deadlock twisted verify
Cc: exarkun@…, moo.lighttpd@… Branch:
Release Notes:
API Changes:
Internal Changes:

Description

I have trac deployed as a WSGI application. Occassionally it hangs and must be restarted. Inspecting the state of the process reveals threads in two states. One is here:

  File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/threadpool.py", line 148, in _worker
    context.call(ctx, function, *args, **kwargs)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 59, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 37, in callWithContext
    return func(*args,**kw)
  File "/home/trac/Projects/Twisted/trunk/twisted/web2/wsgi.py", line 138, in run
    result = self.application(self.environment, self.startWSGIResponse)
  File "/home/trac/Run/trac/external.py", line 74, in __call__
    return self.application(environ, start_response)
  File "/home/trac/Run/trac/external.py", line 115, in tracApplication
    return trac.web.main.dispatch_request(environ, start_response)
  File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 356, in dispatch_request
    dispatcher.dispatch(req)
  File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 190, in dispatch
    req.perm = PermissionCache(self.env, req.authname)
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 267, in __init__
    self.perms = PermissionSystem(env).get_user_permissions(username)
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 231, in get_user_permissions
    for perm in self.store.get_user_permissions(username):
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 109, in get_user_permissions
    db = self.env.get_db_cnx()
  File "/home/trac/Projects/trac/trunk/trac/env.py", line 182, in get_db_cnx
    return DatabaseManager(self).get_connection()
  File "/home/trac/Projects/trac/trunk/trac/db/api.py", line 75, in get_connection
    return self._cnx_pool.get_cnx(self.timeout or None)
  File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 84, in get_cnx
    cnx = self._connector.get_connection(**self._kwargs)
  File "/home/trac/Projects/trac/trunk/trac/db/sqlite_backend.py", line 113, in get_connection
    return SQLiteConnection(path, params)
  File "/home/trac/Projects/trac/trunk/trac/db/sqlite_backend.py", line 159, in __init__
    timeout=timeout)
  File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 48, in __del__
    self.close()
  File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 44, in close
    self._pool._return_cnx(self.cnx, self._tid)
  File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 102, in _return_cnx
    self._available.acquire()

Nine are here:

  File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/threadpool.py", line 148, in _worker
    context.call(ctx, function, *args, **kwargs)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 59, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 37, in callWithContext
    return func(*args,**kw)
  File "/home/trac/Projects/Twisted/trunk/twisted/web2/wsgi.py", line 138, in run
    result = self.application(self.environment, self.startWSGIResponse)
  File "/home/trac/Run/trac/external.py", line 74, in __call__
    return self.application(environ, start_response)
  File "/home/trac/Run/trac/external.py", line 115, in tracApplication
    return trac.web.main.dispatch_request(environ, start_response)
  File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 356, in dispatch_request
    dispatcher.dispatch(req)
  File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 190, in dispatch
    req.perm = PermissionCache(self.env, req.authname)
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 267, in __init__
    self.perms = PermissionSystem(env).get_user_permissions(username)
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 231, in get_user_permissions
    for perm in self.store.get_user_permissions(username):
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 109, in get_user_permissions
    db = self.env.get_db_cnx()
  File "/home/trac/Projects/trac/trunk/trac/env.py", line 182, in get_db_cnx
    return DatabaseManager(self).get_connection()
  File "/home/trac/Projects/trac/trunk/trac/db/api.py", line 75, in get_connection
    return self._cnx_pool.get_cnx(self.timeout or None)
  File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 65, in get_cnx
    self._available.acquire()

The WSGI container uses a threadpool with ten threads for the application.

I have only had instrumentation in place to inspect this state on one occassion so far, so I don't know if this is the typical state in which the process ends.

The server is running r3798 with the db_pool_timeout-r3868.patch from #3504. Python is 2.4.2, SQLite is 3.3.7. PySQLite is 2.3.2. Subversion is 1.4.0. glibc is 2.3.5. Linux is 2.6.12.

Attachments (0)

Change History (29)

comment:1 by Christian Boos, 18 years ago

Keywords: database pool added
Milestone: 0.10.1
Owner: changed from Jonas Borgström to Christian Boos
Severity: majorcritical

"translated stack" for the first thread (I've offset the line numbers a bit, to make things clearer):

  1. we start from an innocuous env.get_db_cnx() and we end-up in the pool, holding the threading.Lock:
  2. source:trunk/trac/db/pool.py@3798#L83, in get_cnx
    cnx = self._connector.get_connection(**self._kwargs)''''
  3. source:trunk/trac/db/sqlite_backend.py@3798#L113, in get_connection
    return SQLiteConnection(path, params)
  4. source:trunk/trac/db/sqlite_backend.py@3798#L158, in __init__
    timeout=timeout)
    here, PySqlite triggers a garbage collection…
  5. source:trunk/trac/db/pool.py@3798#L48, in __del__
    self.close()
    … which has no better idea than to free another PooledConnection!
  6. source:trunk/trac/db/pool.py@3798#L44, in close
    self._pool._return_cnx(self.cnx, self._tid)
  7. source:trunk/trac/db/pool.py@3798#L101, in _return_cnx
    self._available.acquire()

… and boom, deadlock, as we are not using a threading.RLock anymore, since r2771. I have to discuss this with cmlenz who did this change, but I don't think r2771 is valid, we should really use a threading.RLock here, IMO.

Thank you very much for the feedback!

comment:2 by Christian Boos, 18 years ago

Status: newassigned

exarkun, can you try again with s/threading.Lock/threading.RLock/ ?

comment:3 by exarkun@…, 18 years ago

Cc: exarkun@… added

Updated my server to use RLock. Will comment again if/when anything happens…

comment:4 by Christian Boos, 18 years ago

Too bad, I missed the opportunity to close that ticket with r3923 ;)

More seriously, can you also try with the updated db code integrating the [3920:3924/trunk/trac/db] changes?

comment:5 by Christian Boos, 18 years ago

Oops, don't update to current trunk right now: looks like those pool changes introduced some big issues (I'm now always hitting the timeout).

comment:6 by Christian Boos, 18 years ago

Now with [3929:3934/trunk/trac/db] all is ok again, so pleasy try out the following patch.

comment:7 by moo, 18 years ago

Cc: moo.lighttpd@… added
Keywords: deadlock added

added cc to me, thx

comment:8 by Christian Boos, 18 years ago

Do I have to assume that everything is OK? ;)

One "little" problem I have with the recent pool code changes is that I've got segfaults with Python 2.5 on windows. I didn't have time so far to investigate this problem in more depth, but this is surely a showstopper for the merge in 0.10.1 at that point…

comment:9 by exarkun@…, 18 years ago

I haven't had time to update to trunk (also I'm not really sure from the comments here that I should update to trunk - can you clarify regarding what revisions are safe to use?). Nor have I run into this problem again since my last message here. This just doesn't happen very often.

comment:10 by Christian Boos, 18 years ago

  • can you clarify regarding what revisions are safe to use?

Well, I have the crash on win64 with Python 2.5 (and the builtin sqlite3).

My regular Python 2.4 (and pysqlite 2.1.3) on win32 works fine, so does Python 2.5 on win32 with builtin sqlite3 (I do have crashes with the latter, but they're not related to sqlite3, rather to the -r code reload feature of tracd).

For now, I can't say how things are on Linux. Once I've successfully tested the new code on Linux, I'll probably port the whole set of changes on 0.10-stable, that will make it easier to test the changes on a wider scale.

comment:11 by Christian Boos, 18 years ago

Keywords: needinfo added

No more deadlocks?

comment:12 by exarkun@…, 18 years ago

This hasn't happened since I updated to r4045.

comment:13 by Christian Boos, 18 years ago

Milestone: 0.10.30.10.1
Resolution: fixed
Status: assignedclosed

So I think this was fixed in the bunch of db pool fixes ported to 0.10.1.

comment:14 by anonymous, 18 years ago

I just had my server deadlock again. Same version of everything as before.

comment:15 by exarkun@…, 18 years ago

Actually, it deadlocked a number of times in a row (5 or so) in rapid succession.

comment:16 by Christian Boos, 18 years ago

So I've got a bunch of questions…

  1. Just to be sure, when you say r4045, are you using that with 0.10-stable or trunk?
  2. Also, is this with r3830 backed out or not?
  3. Are there any other modifications to the code?
  4. Does the deadlock ends with a timeout exception?
  5. Also, are you using an authz_file and maybe you're hit by #3996? (this is unlikely, I know, as otherwise the lock would happen systematically).

comment:17 by exarkun@…, 18 years ago

  1. trunk
  2. Yes, r3830 backed out
  3. No other modifications to the code
  4. The deadlock never resolves itself, the server has to be SIGKILL'd.
  5. There's no authz_file entry in my trac.ini, so I guess not.

comment:18 by Christian Boos, 18 years ago

Milestone: 0.10.10.10.4
Resolution: fixed
Status: closedreopened

Can you try a more recent trunk than r4045? Also, I'd need the C and Python backtrace again, please ;)

comment:19 by Christian Boos, 18 years ago

Is the situation any better with 0.10.4?

comment:20 by exarkun, 18 years ago

We're still using r4045 of trunk with r3830 backed out. I don't think we have any plans to upgrade in the foreseeable future. We're making do by restarting the server every six hours. I don't think we'll be able to test with a newer revision.

in reply to:  20 comment:21 by Christian Boos, 18 years ago

Replying to exarkun:

We're still using r4045 of trunk with r3830 backed out. I don't think we have any plans to upgrade in the foreseeable future.

Try at least to apply r4886. I wouldn't be so surprised if that fixed this issue as well…

comment:22 by Christian Boos, 17 years ago

Keywords: twisted verify added; needinfo removed
Milestone: 0.10.50.11.1

comment:23 by anonymous, 17 years ago

Trying that. I'll comment again in a few days.

comment:24 by exarkun, 17 years ago

I applied that revision. The problems don't seem to be much affected. Today the server segfaulted (which it hasn't done in a long, long time). Dunno if it's related to that changeset or not.

comment:25 by Christian Boos, 17 years ago

Is there any possibility for you to try the current trunk?

comment:26 by anonymous, 17 years ago

This would probably involve a lot of changes to our deployment and related code and templates, which I don't have time to do.

comment:27 by Christian Boos, 17 years ago

Well, I was simply suggesting that you could setup a test installation of 0.11 with Twisted, not the whole twistedmatrix.com Trac.

comment:28 by Christian Boos, 16 years ago

Milestone: 0.11.20.11.1
Owner: changed from Christian Boos to Jonas Borgström
Status: reopenednew

twistedmatrix.com/trac is currently running an old 0.11dev version. I strongly suggest that you upgrade to the new pool code (i.e. at least branches/0.11-stable@7317), which should hopefully be more robust w.r.t. deadlocks.

comment:29 by Christian Boos, 16 years ago

Resolution: fixed
Status: newclosed

Assuming the issue was fixed by r7190. Of course, feel free to reopen if the problem persists even with the new pool.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Jonas Borgström.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Jonas Borgström to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.