Ticket #3923 (closed defect: fixed)
Opened 6 years ago
Last modified 4 years ago
Thread deadlock
| Reported by: | exarkun@… | Owned by: | jonas |
|---|---|---|---|
| Priority: | high | Milestone: | 0.11.1 |
| Component: | general | Version: | 0.10 |
| Severity: | critical | Keywords: | database pool deadlock twisted verify |
| Cc: | exarkun@…, moo.lighttpd@… | ||
| Release Notes: | |||
| API Changes: | |||
Description
I have trac deployed as a WSGI application. Occassionally it hangs and must be restarted. Inspecting the state of the process reveals threads in two states. One is here:
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/trac/Projects/Twisted/trunk/twisted/python/threadpool.py", line 148, in _worker
context.call(ctx, function, *args, **kwargs)
File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 59, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 37, in callWithContext
return func(*args,**kw)
File "/home/trac/Projects/Twisted/trunk/twisted/web2/wsgi.py", line 138, in run
result = self.application(self.environment, self.startWSGIResponse)
File "/home/trac/Run/trac/external.py", line 74, in __call__
return self.application(environ, start_response)
File "/home/trac/Run/trac/external.py", line 115, in tracApplication
return trac.web.main.dispatch_request(environ, start_response)
File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 356, in dispatch_request
dispatcher.dispatch(req)
File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 190, in dispatch
req.perm = PermissionCache(self.env, req.authname)
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 267, in __init__
self.perms = PermissionSystem(env).get_user_permissions(username)
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 231, in get_user_permissions
for perm in self.store.get_user_permissions(username):
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 109, in get_user_permissions
db = self.env.get_db_cnx()
File "/home/trac/Projects/trac/trunk/trac/env.py", line 182, in get_db_cnx
return DatabaseManager(self).get_connection()
File "/home/trac/Projects/trac/trunk/trac/db/api.py", line 75, in get_connection
return self._cnx_pool.get_cnx(self.timeout or None)
File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 84, in get_cnx
cnx = self._connector.get_connection(**self._kwargs)
File "/home/trac/Projects/trac/trunk/trac/db/sqlite_backend.py", line 113, in get_connection
return SQLiteConnection(path, params)
File "/home/trac/Projects/trac/trunk/trac/db/sqlite_backend.py", line 159, in __init__
timeout=timeout)
File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 48, in __del__
self.close()
File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 44, in close
self._pool._return_cnx(self.cnx, self._tid)
File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 102, in _return_cnx
self._available.acquire()
Nine are here:
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/trac/Projects/Twisted/trunk/twisted/python/threadpool.py", line 148, in _worker
context.call(ctx, function, *args, **kwargs)
File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 59, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 37, in callWithContext
return func(*args,**kw)
File "/home/trac/Projects/Twisted/trunk/twisted/web2/wsgi.py", line 138, in run
result = self.application(self.environment, self.startWSGIResponse)
File "/home/trac/Run/trac/external.py", line 74, in __call__
return self.application(environ, start_response)
File "/home/trac/Run/trac/external.py", line 115, in tracApplication
return trac.web.main.dispatch_request(environ, start_response)
File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 356, in dispatch_request
dispatcher.dispatch(req)
File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 190, in dispatch
req.perm = PermissionCache(self.env, req.authname)
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 267, in __init__
self.perms = PermissionSystem(env).get_user_permissions(username)
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 231, in get_user_permissions
for perm in self.store.get_user_permissions(username):
File "/home/trac/Projects/trac/trunk/trac/perm.py", line 109, in get_user_permissions
db = self.env.get_db_cnx()
File "/home/trac/Projects/trac/trunk/trac/env.py", line 182, in get_db_cnx
return DatabaseManager(self).get_connection()
File "/home/trac/Projects/trac/trunk/trac/db/api.py", line 75, in get_connection
return self._cnx_pool.get_cnx(self.timeout or None)
File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 65, in get_cnx
self._available.acquire()
The WSGI container uses a threadpool with ten threads for the application.
I have only had instrumentation in place to inspect this state on one occassion so far, so I don't know if this is the typical state in which the process ends.
The server is running r3798 with the db_pool_timeout-r3868.patch from #3504. Python is 2.4.2, SQLite is 3.3.7. PySQLite is 2.3.2. Subversion is 1.4.0. glibc is 2.3.5. Linux is 2.6.12.
Attachments
Change History
comment:1 Changed 6 years ago by cboos
- Keywords database pool added
- Milestone set to 0.10.1
- Owner changed from jonas to cboos
- Severity changed from major to critical
comment:2 Changed 6 years ago by cboos
- Status changed from new to assigned
exarkun, can you try again with s/threading.Lock/threading.RLock/ ?
comment:3 Changed 6 years ago by exarkun@…
- Cc exarkun@… added
Updated my server to use RLock. Will comment again if/when anything happens…
comment:4 Changed 6 years ago by cboos
Too bad, I missed the opportunity to close that ticket with r3923 ;)
More seriously, can you also try with the updated db code integrating the [3920:3924/trunk/trac/db] changes?
comment:5 Changed 6 years ago by cboos
Oops, don't update to current trunk right now: looks like those pool changes introduced some big issues (I'm now always hitting the timeout).
comment:6 Changed 6 years ago by cboos
Now with [3929:3934/trunk/trac/db] all is ok again, so pleasy try out the following patch.
comment:7 Changed 6 years ago by moo
- Cc moo.lighttpd@… added
- Keywords deadlock added
added cc to me, thx
comment:8 Changed 6 years ago by cboos
Do I have to assume that everything is OK? ;)
One "little" problem I have with the recent pool code changes is that I've got segfaults with Python 2.5 on windows. I didn't have time so far to investigate this problem in more depth, but this is surely a showstopper for the merge in 0.10.1 at that point…
comment:9 Changed 6 years ago by exarkun@…
I haven't had time to update to trunk (also I'm not really sure from the comments here that I should update to trunk - can you clarify regarding what revisions are safe to use?). Nor have I run into this problem again since my last message here. This just doesn't happen very often.
comment:10 Changed 6 years ago by cboos
- can you clarify regarding what revisions are safe to use?
Well, I have the crash on win64 with Python 2.5 (and the builtin sqlite3).
My regular Python 2.4 (and pysqlite 2.1.3) on win32 works fine, so does Python 2.5 on win32 with builtin sqlite3 (I do have crashes with the latter, but they're not related to sqlite3, rather to the -r code reload feature of tracd).
For now, I can't say how things are on Linux.
Once I've successfully tested the new code on Linux, I'll probably port the whole set of changes on 0.10-stable, that will make it easier to test the changes on a wider scale.
comment:12 Changed 6 years ago by exarkun@…
This hasn't happened since I updated to r4045.
comment:13 Changed 6 years ago by cboos
- Milestone changed from 0.10.3 to 0.10.1
- Resolution set to fixed
- Status changed from assigned to closed
So I think this was fixed in the bunch of db pool fixes ported to 0.10.1.
comment:14 Changed 6 years ago by anonymous
I just had my server deadlock again. Same version of everything as before.
comment:15 Changed 6 years ago by exarkun@…
Actually, it deadlocked a number of times in a row (5 or so) in rapid succession.
comment:16 Changed 6 years ago by cboos
So I've got a bunch of questions…
- Just to be sure, when you say r4045, are you using that with 0.10-stable or trunk?
- Also, is this with r3830 backed out or not?
- Are there any other modifications to the code?
- Does the deadlock ends with a timeout exception?
- Also, are you using an authz_file and maybe you're hit by #3996? (this is unlikely, I know, as otherwise the lock would happen systematically).
comment:17 Changed 5 years ago by exarkun@…
- trunk
- Yes, r3830 backed out
- No other modifications to the code
- The deadlock never resolves itself, the server has to be SIGKILL'd.
- There's no authz_file entry in my trac.ini, so I guess not.
comment:18 Changed 5 years ago by cboos
- Milestone changed from 0.10.1 to 0.10.4
- Resolution fixed deleted
- Status changed from closed to reopened
Can you try a more recent trunk than r4045?
Also, I'd need the C and Python backtrace again, please ;)
comment:19 Changed 5 years ago by cboos
Is the situation any better with 0.10.4?
comment:20 follow-up: ↓ 21 Changed 5 years ago by exarkun
comment:21 in reply to: ↑ 20 Changed 5 years ago by cboos
comment:22 Changed 4 years ago by cboos
- Keywords twisted verify added; needinfo removed
- Milestone changed from 0.10.5 to 0.11.1
comment:23 Changed 4 years ago by anonymous
Trying that. I'll comment again in a few days.
comment:24 Changed 4 years ago by exarkun
I applied that revision. The problems don't seem to be much affected. Today the server segfaulted (which it hasn't done in a long, long time). Dunno if it's related to that changeset or not.
comment:25 Changed 4 years ago by cboos
Is there any possibility for you to try the current trunk?
comment:26 Changed 4 years ago by anonymous
This would probably involve a lot of changes to our deployment and related code and templates, which I don't have time to do.
comment:27 Changed 4 years ago by cboos
Well, I was simply suggesting that you could setup a test installation of 0.11 with Twisted, not the whole twistedmatrix.com Trac.
comment:28 Changed 4 years ago by cboos
- Milestone changed from 0.11.2 to 0.11.1
- Owner changed from cboos to jonas
- Status changed from reopened to new
twistedmatrix.com/trac is currently running an old 0.11dev version. I strongly suggest that you upgrade to the new pool code (i.e. at least branches/0.11-stable@7317), which should hopefully be more robust w.r.t. deadlocks.
comment:29 Changed 4 years ago by cboos
- Resolution set to fixed
- Status changed from new to closed
Assuming the issue was fixed by r7190. Of course, feel free to reopen if the problem persists even with the new pool.



"translated stack" for the first thread (I've offset the line numbers a bit, to make things clearer):
cnx = self._connector.get_connection(**self._kwargs)''''
return SQLiteConnection(path, params)
timeout=timeout)
here, PySqlite triggers a garbage collection…
self.close()
… which has no better idea than to free another PooledConnection!
self._pool._return_cnx(self.cnx, self._tid)
self._available.acquire()
… and boom, deadlock, as we are not using a threading.RLock anymore, since r2771. I have to discuss this with cmlenz who did this change, but I don't think r2771 is valid, we should really use a threading.RLock here, IMO.
Thank you very much for the feedback!