Opened 18 years ago
Closed 16 years ago
#3923 closed defect (fixed)
Thread deadlock
Reported by: | Owned by: | Jonas Borgström | |
---|---|---|---|
Priority: | high | Milestone: | 0.11.1 |
Component: | general | Version: | 0.10 |
Severity: | critical | Keywords: | database pool deadlock twisted verify |
Cc: | exarkun@…, moo.lighttpd@… | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
I have trac deployed as a WSGI application. Occassionally it hangs and must be restarted. Inspecting the state of the process reveals threads in two states. One is here:
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap self.run() File "/usr/lib/python2.4/threading.py", line 422, in run self.__target(*self.__args, **self.__kwargs) File "/home/trac/Projects/Twisted/trunk/twisted/python/threadpool.py", line 148, in _worker context.call(ctx, function, *args, **kwargs) File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) File "/home/trac/Projects/Twisted/trunk/twisted/web2/wsgi.py", line 138, in run result = self.application(self.environment, self.startWSGIResponse) File "/home/trac/Run/trac/external.py", line 74, in __call__ return self.application(environ, start_response) File "/home/trac/Run/trac/external.py", line 115, in tracApplication return trac.web.main.dispatch_request(environ, start_response) File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 356, in dispatch_request dispatcher.dispatch(req) File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 190, in dispatch req.perm = PermissionCache(self.env, req.authname) File "/home/trac/Projects/trac/trunk/trac/perm.py", line 267, in __init__ self.perms = PermissionSystem(env).get_user_permissions(username) File "/home/trac/Projects/trac/trunk/trac/perm.py", line 231, in get_user_permissions for perm in self.store.get_user_permissions(username): File "/home/trac/Projects/trac/trunk/trac/perm.py", line 109, in get_user_permissions db = self.env.get_db_cnx() File "/home/trac/Projects/trac/trunk/trac/env.py", line 182, in get_db_cnx return DatabaseManager(self).get_connection() File "/home/trac/Projects/trac/trunk/trac/db/api.py", line 75, in get_connection return self._cnx_pool.get_cnx(self.timeout or None) File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 84, in get_cnx cnx = self._connector.get_connection(**self._kwargs) File "/home/trac/Projects/trac/trunk/trac/db/sqlite_backend.py", line 113, in get_connection return SQLiteConnection(path, params) File "/home/trac/Projects/trac/trunk/trac/db/sqlite_backend.py", line 159, in __init__ timeout=timeout) File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 48, in __del__ self.close() File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 44, in close self._pool._return_cnx(self.cnx, self._tid) File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 102, in _return_cnx self._available.acquire()
Nine are here:
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap self.run() File "/usr/lib/python2.4/threading.py", line 422, in run self.__target(*self.__args, **self.__kwargs) File "/home/trac/Projects/Twisted/trunk/twisted/python/threadpool.py", line 148, in _worker context.call(ctx, function, *args, **kwargs) File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) File "/home/trac/Projects/Twisted/trunk/twisted/web2/wsgi.py", line 138, in run result = self.application(self.environment, self.startWSGIResponse) File "/home/trac/Run/trac/external.py", line 74, in __call__ return self.application(environ, start_response) File "/home/trac/Run/trac/external.py", line 115, in tracApplication return trac.web.main.dispatch_request(environ, start_response) File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 356, in dispatch_request dispatcher.dispatch(req) File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 190, in dispatch req.perm = PermissionCache(self.env, req.authname) File "/home/trac/Projects/trac/trunk/trac/perm.py", line 267, in __init__ self.perms = PermissionSystem(env).get_user_permissions(username) File "/home/trac/Projects/trac/trunk/trac/perm.py", line 231, in get_user_permissions for perm in self.store.get_user_permissions(username): File "/home/trac/Projects/trac/trunk/trac/perm.py", line 109, in get_user_permissions db = self.env.get_db_cnx() File "/home/trac/Projects/trac/trunk/trac/env.py", line 182, in get_db_cnx return DatabaseManager(self).get_connection() File "/home/trac/Projects/trac/trunk/trac/db/api.py", line 75, in get_connection return self._cnx_pool.get_cnx(self.timeout or None) File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 65, in get_cnx self._available.acquire()
The WSGI container uses a threadpool with ten threads for the application.
I have only had instrumentation in place to inspect this state on one occassion so far, so I don't know if this is the typical state in which the process ends.
The server is running r3798 with the db_pool_timeout-r3868.patch from #3504. Python is 2.4.2, SQLite is 3.3.7. PySQLite is 2.3.2. Subversion is 1.4.0. glibc is 2.3.5. Linux is 2.6.12.
Attachments (0)
Change History (29)
comment:1 by , 18 years ago
Keywords: | database pool added |
---|---|
Milestone: | → 0.10.1 |
Owner: | changed from | to
Severity: | major → critical |
comment:2 by , 18 years ago
Status: | new → assigned |
---|
exarkun, can you try again with s/threading.Lock/threading.RLock/
?
comment:3 by , 18 years ago
Cc: | added |
---|
Updated my server to use RLock. Will comment again if/when anything happens…
comment:4 by , 18 years ago
Too bad, I missed the opportunity to close that ticket with r3923 ;)
More seriously, can you also try with the updated db code integrating the [3920:3924/trunk/trac/db] changes?
comment:5 by , 18 years ago
Oops, don't update to current trunk right now: looks like those pool changes introduced some big issues (I'm now always hitting the timeout).
comment:6 by , 18 years ago
Now with [3929:3934/trunk/trac/db] all is ok again, so pleasy try out the following patch.
comment:8 by , 18 years ago
Do I have to assume that everything is OK? ;)
One "little" problem I have with the recent pool code changes is that I've got segfaults with Python 2.5 on windows. I didn't have time so far to investigate this problem in more depth, but this is surely a showstopper for the merge in 0.10.1 at that point…
comment:9 by , 18 years ago
I haven't had time to update to trunk (also I'm not really sure from the comments here that I should update to trunk - can you clarify regarding what revisions are safe to use?). Nor have I run into this problem again since my last message here. This just doesn't happen very often.
comment:10 by , 18 years ago
- can you clarify regarding what revisions are safe to use?
Well, I have the crash on win64 with Python 2.5 (and the builtin sqlite3).
My regular Python 2.4 (and pysqlite 2.1.3) on win32 works fine, so does Python 2.5 on win32 with builtin sqlite3 (I do have crashes with the latter, but they're not related to sqlite3, rather to the -r
code reload feature of tracd).
For now, I can't say how things are on Linux. Once I've successfully tested the new code on Linux, I'll probably port the whole set of changes on 0.10-stable, that will make it easier to test the changes on a wider scale.
comment:13 by , 18 years ago
Milestone: | 0.10.3 → 0.10.1 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
So I think this was fixed in the bunch of db pool fixes ported to 0.10.1.
comment:14 by , 18 years ago
I just had my server deadlock again. Same version of everything as before.
comment:15 by , 18 years ago
Actually, it deadlocked a number of times in a row (5 or so) in rapid succession.
comment:16 by , 18 years ago
So I've got a bunch of questions…
- Just to be sure, when you say r4045, are you using that with 0.10-stable or trunk?
- Also, is this with r3830 backed out or not?
- Are there any other modifications to the code?
- Does the deadlock ends with a timeout exception?
- Also, are you using an authz_file and maybe you're hit by #3996? (this is unlikely, I know, as otherwise the lock would happen systematically).
comment:17 by , 18 years ago
- trunk
- Yes, r3830 backed out
- No other modifications to the code
- The deadlock never resolves itself, the server has to be SIGKILL'd.
- There's no authz_file entry in my trac.ini, so I guess not.
comment:18 by , 18 years ago
Milestone: | 0.10.1 → 0.10.4 |
---|---|
Resolution: | fixed |
Status: | closed → reopened |
Can you try a more recent trunk than r4045? Also, I'd need the C and Python backtrace again, please ;)
follow-up: 21 comment:20 by , 18 years ago
comment:21 by , 18 years ago
comment:22 by , 17 years ago
Keywords: | twisted verify added; needinfo removed |
---|---|
Milestone: | 0.10.5 → 0.11.1 |
comment:24 by , 17 years ago
I applied that revision. The problems don't seem to be much affected. Today the server segfaulted (which it hasn't done in a long, long time). Dunno if it's related to that changeset or not.
comment:26 by , 17 years ago
This would probably involve a lot of changes to our deployment and related code and templates, which I don't have time to do.
comment:27 by , 17 years ago
Well, I was simply suggesting that you could setup a test installation of 0.11 with Twisted, not the whole twistedmatrix.com Trac.
comment:28 by , 16 years ago
Milestone: | 0.11.2 → 0.11.1 |
---|---|
Owner: | changed from | to
Status: | reopened → new |
twistedmatrix.com/trac is currently running an old 0.11dev version. I strongly suggest that you upgrade to the new pool code (i.e. at least branches/0.11-stable@7317), which should hopefully be more robust w.r.t. deadlocks.
comment:29 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Assuming the issue was fixed by r7190. Of course, feel free to reopen if the problem persists even with the new pool.
"translated stack" for the first thread (I've offset the line numbers a bit, to make things clearer):
env.get_db_cnx()
and we end-up in the pool, holding thethreading.Lock
:get_cnx
cnx = self._connector.get_connection(**self._kwargs)''''
get_connection
return SQLiteConnection(path, params)
__init__
timeout=timeout)
here, PySqlite triggers a garbage collection…
__del__
self.close()
… which has no better idea than to free another PooledConnection!
close
self._pool._return_cnx(self.cnx, self._tid)
_return_cnx
self._available.acquire()
… and boom, deadlock, as we are not using a
threading.RLock
anymore, since r2771. I have to discuss this with cmlenz who did this change, but I don't think r2771 is valid, we should really use athreading.RLock
here, IMO.Thank you very much for the feedback!