Opened 19 years ago
Closed 17 years ago
#3923 closed defect (fixed)
Thread deadlock
| Reported by: | Owned by: | Jonas Borgström | |
|---|---|---|---|
| Priority: | high | Milestone: | 0.11.1 |
| Component: | general | Version: | 0.10 |
| Severity: | critical | Keywords: | database pool deadlock twisted verify |
| Cc: | exarkun@…, moo.lighttpd@… | Branch: | |
| Release Notes: | |||
| API Changes: | |||
| Internal Changes: | |||
Description
I have trac deployed as a WSGI application. Occassionally it hangs and must be restarted. Inspecting the state of the process reveals threads in two states. One is here:
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/trac/Projects/Twisted/trunk/twisted/python/threadpool.py", line 148, in _worker
context.call(ctx, function, *args, **kwargs)
File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 59, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 37, in callWithContext
return func(*args,**kw)
File "/home/trac/Projects/Twisted/trunk/twisted/web2/wsgi.py", line 138, in run
result = self.application(self.environment, self.startWSGIResponse)
File "/home/trac/Run/trac/external.py", line 74, in __call__
return self.application(environ, start_response)
File "/home/trac/Run/trac/external.py", line 115, in tracApplication
return trac.web.main.dispatch_request(environ, start_response)
File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 356, in dispatch_request
dispatcher.dispatch(req)
File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 190, in dispatch
req.perm = PermissionCache(self.env, req.authname)
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 267, in __init__
self.perms = PermissionSystem(env).get_user_permissions(username)
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 231, in get_user_permissions
for perm in self.store.get_user_permissions(username):
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 109, in get_user_permissions
db = self.env.get_db_cnx()
File "/home/trac/Projects/trac/trunk/trac/env.py", line 182, in get_db_cnx
return DatabaseManager(self).get_connection()
File "/home/trac/Projects/trac/trunk/trac/db/api.py", line 75, in get_connection
return self._cnx_pool.get_cnx(self.timeout or None)
File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 84, in get_cnx
cnx = self._connector.get_connection(**self._kwargs)
File "/home/trac/Projects/trac/trunk/trac/db/sqlite_backend.py", line 113, in get_connection
return SQLiteConnection(path, params)
File "/home/trac/Projects/trac/trunk/trac/db/sqlite_backend.py", line 159, in __init__
timeout=timeout)
File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 48, in __del__
self.close()
File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 44, in close
self._pool._return_cnx(self.cnx, self._tid)
File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 102, in _return_cnx
self._available.acquire()
Nine are here:
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/trac/Projects/Twisted/trunk/twisted/python/threadpool.py", line 148, in _worker
context.call(ctx, function, *args, **kwargs)
File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 59, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 37, in callWithContext
return func(*args,**kw)
File "/home/trac/Projects/Twisted/trunk/twisted/web2/wsgi.py", line 138, in run
result = self.application(self.environment, self.startWSGIResponse)
File "/home/trac/Run/trac/external.py", line 74, in __call__
return self.application(environ, start_response)
File "/home/trac/Run/trac/external.py", line 115, in tracApplication
return trac.web.main.dispatch_request(environ, start_response)
File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 356, in dispatch_request
dispatcher.dispatch(req)
File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 190, in dispatch
req.perm = PermissionCache(self.env, req.authname)
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 267, in __init__
self.perms = PermissionSystem(env).get_user_permissions(username)
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 231, in get_user_permissions
for perm in self.store.get_user_permissions(username):
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 109, in get_user_permissions
db = self.env.get_db_cnx()
File "/home/trac/Projects/trac/trunk/trac/env.py", line 182, in get_db_cnx
return DatabaseManager(self).get_connection()
File "/home/trac/Projects/trac/trunk/trac/db/api.py", line 75, in get_connection
return self._cnx_pool.get_cnx(self.timeout or None)
File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 65, in get_cnx
self._available.acquire()
The WSGI container uses a threadpool with ten threads for the application.
I have only had instrumentation in place to inspect this state on one occassion so far, so I don't know if this is the typical state in which the process ends.
The server is running r3798 with the db_pool_timeout-r3868.patch from #3504. Python is 2.4.2, SQLite is 3.3.7. PySQLite is 2.3.2. Subversion is 1.4.0. glibc is 2.3.5. Linux is 2.6.12.
Attachments (0)
Change History (29)
comment:1 by , 19 years ago
| Keywords: | database pool added |
|---|---|
| Milestone: | → 0.10.1 |
| Owner: | changed from to |
| Severity: | major → critical |
comment:2 by , 19 years ago
| Status: | new → assigned |
|---|
exarkun, can you try again with s/threading.Lock/threading.RLock/ ?
comment:3 by , 19 years ago
| Cc: | added |
|---|
Updated my server to use RLock. Will comment again if/when anything happens…
comment:4 by , 19 years ago
Too bad, I missed the opportunity to close that ticket with r3923 ;)
More seriously, can you also try with the updated db code integrating the [3920:3924/trunk/trac/db] changes?
comment:5 by , 19 years ago
Oops, don't update to current trunk right now: looks like those pool changes introduced some big issues (I'm now always hitting the timeout).
comment:6 by , 19 years ago
Now with [3929:3934/trunk/trac/db] all is ok again, so pleasy try out the following patch.
comment:8 by , 19 years ago
Do I have to assume that everything is OK? ;)
One "little" problem I have with the recent pool code changes is that I've got segfaults with Python 2.5 on windows. I didn't have time so far to investigate this problem in more depth, but this is surely a showstopper for the merge in 0.10.1 at that point…
comment:9 by , 19 years ago
I haven't had time to update to trunk (also I'm not really sure from the comments here that I should update to trunk - can you clarify regarding what revisions are safe to use?). Nor have I run into this problem again since my last message here. This just doesn't happen very often.
comment:10 by , 19 years ago
- can you clarify regarding what revisions are safe to use?
Well, I have the crash on win64 with Python 2.5 (and the builtin sqlite3).
My regular Python 2.4 (and pysqlite 2.1.3) on win32 works fine, so does Python 2.5 on win32 with builtin sqlite3 (I do have crashes with the latter, but they're not related to sqlite3, rather to the -r code reload feature of tracd).
For now, I can't say how things are on Linux. Once I've successfully tested the new code on Linux, I'll probably port the whole set of changes on 0.10-stable, that will make it easier to test the changes on a wider scale.
comment:13 by , 19 years ago
| Milestone: | 0.10.3 → 0.10.1 |
|---|---|
| Resolution: | → fixed |
| Status: | assigned → closed |
So I think this was fixed in the bunch of db pool fixes ported to 0.10.1.
comment:14 by , 19 years ago
I just had my server deadlock again. Same version of everything as before.
comment:15 by , 19 years ago
Actually, it deadlocked a number of times in a row (5 or so) in rapid succession.
comment:16 by , 19 years ago
So I've got a bunch of questions…
- Just to be sure, when you say r4045, are you using that with 0.10-stable or trunk?
- Also, is this with r3830 backed out or not?
- Are there any other modifications to the code?
- Does the deadlock ends with a timeout exception?
- Also, are you using an authz_file and maybe you're hit by #3996? (this is unlikely, I know, as otherwise the lock would happen systematically).
comment:17 by , 19 years ago
- trunk
- Yes, r3830 backed out
- No other modifications to the code
- The deadlock never resolves itself, the server has to be SIGKILL'd.
- There's no authz_file entry in my trac.ini, so I guess not.
comment:18 by , 19 years ago
| Milestone: | 0.10.1 → 0.10.4 |
|---|---|
| Resolution: | fixed |
| Status: | closed → reopened |
Can you try a more recent trunk than r4045? Also, I'd need the C and Python backtrace again, please ;)
follow-up: 21 comment:20 by , 19 years ago
comment:21 by , 19 years ago
comment:22 by , 18 years ago
| Keywords: | twisted verify added; needinfo removed |
|---|---|
| Milestone: | 0.10.5 → 0.11.1 |
comment:24 by , 18 years ago
I applied that revision. The problems don't seem to be much affected. Today the server segfaulted (which it hasn't done in a long, long time). Dunno if it's related to that changeset or not.
comment:26 by , 18 years ago
This would probably involve a lot of changes to our deployment and related code and templates, which I don't have time to do.
comment:27 by , 18 years ago
Well, I was simply suggesting that you could setup a test installation of 0.11 with Twisted, not the whole twistedmatrix.com Trac.
comment:28 by , 17 years ago
| Milestone: | 0.11.2 → 0.11.1 |
|---|---|
| Owner: | changed from to |
| Status: | reopened → new |
twistedmatrix.com/trac is currently running an old 0.11dev version. I strongly suggest that you upgrade to the new pool code (i.e. at least branches/0.11-stable@7317), which should hopefully be more robust w.r.t. deadlocks.
comment:29 by , 17 years ago
| Resolution: | → fixed |
|---|---|
| Status: | new → closed |
Assuming the issue was fixed by r7190. Of course, feel free to reopen if the problem persists even with the new pool.



"translated stack" for the first thread (I've offset the line numbers a bit, to make things clearer):
env.get_db_cnx()and we end-up in the pool, holding thethreading.Lock:get_cnxcnx = self._connector.get_connection(**self._kwargs)''''get_connectionreturn SQLiteConnection(path, params)__init__timeout=timeout)here, PySqlite triggers a garbage collection…
__del__self.close()… which has no better idea than to free another PooledConnection!
closeself._pool._return_cnx(self.cnx, self._tid)_return_cnxself._available.acquire()… and boom, deadlock, as we are not using a
threading.RLockanymore, since r2771. I have to discuss this with cmlenz who did this change, but I don't think r2771 is valid, we should really use athreading.RLockhere, IMO.Thank you very much for the feedback!