Edgewall Software
Modify

Ticket #3923 (closed defect: fixed)

Opened 6 years ago

Last modified 4 years ago

Thread deadlock

Reported by: exarkun@… Owned by: jonas
Priority: high Milestone: 0.11.1
Component: general Version: 0.10
Severity: critical Keywords: database pool deadlock twisted verify
Cc: exarkun@…, moo.lighttpd@…
Release Notes:
API Changes:

Description

I have trac deployed as a WSGI application. Occassionally it hangs and must be restarted. Inspecting the state of the process reveals threads in two states. One is here:

  File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/threadpool.py", line 148, in _worker
    context.call(ctx, function, *args, **kwargs)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 59, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 37, in callWithContext
    return func(*args,**kw)
  File "/home/trac/Projects/Twisted/trunk/twisted/web2/wsgi.py", line 138, in run
    result = self.application(self.environment, self.startWSGIResponse)
  File "/home/trac/Run/trac/external.py", line 74, in __call__
    return self.application(environ, start_response)
  File "/home/trac/Run/trac/external.py", line 115, in tracApplication
    return trac.web.main.dispatch_request(environ, start_response)
  File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 356, in dispatch_request
    dispatcher.dispatch(req)
  File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 190, in dispatch
    req.perm = PermissionCache(self.env, req.authname)
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 267, in __init__
    self.perms = PermissionSystem(env).get_user_permissions(username)
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 231, in get_user_permissions
    for perm in self.store.get_user_permissions(username):
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 109, in get_user_permissions
    db = self.env.get_db_cnx()
  File "/home/trac/Projects/trac/trunk/trac/env.py", line 182, in get_db_cnx
    return DatabaseManager(self).get_connection()
  File "/home/trac/Projects/trac/trunk/trac/db/api.py", line 75, in get_connection
    return self._cnx_pool.get_cnx(self.timeout or None)
  File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 84, in get_cnx
    cnx = self._connector.get_connection(**self._kwargs)
  File "/home/trac/Projects/trac/trunk/trac/db/sqlite_backend.py", line 113, in get_connection
    return SQLiteConnection(path, params)
  File "/home/trac/Projects/trac/trunk/trac/db/sqlite_backend.py", line 159, in __init__
    timeout=timeout)
  File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 48, in __del__
    self.close()
  File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 44, in close
    self._pool._return_cnx(self.cnx, self._tid)
  File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 102, in _return_cnx
    self._available.acquire()

Nine are here:

  File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/threadpool.py", line 148, in _worker
    context.call(ctx, function, *args, **kwargs)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 59, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/home/trac/Projects/Twisted/trunk/twisted/python/context.py", line 37, in callWithContext
    return func(*args,**kw)
  File "/home/trac/Projects/Twisted/trunk/twisted/web2/wsgi.py", line 138, in run
    result = self.application(self.environment, self.startWSGIResponse)
  File "/home/trac/Run/trac/external.py", line 74, in __call__
    return self.application(environ, start_response)
  File "/home/trac/Run/trac/external.py", line 115, in tracApplication
    return trac.web.main.dispatch_request(environ, start_response)
  File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 356, in dispatch_request
    dispatcher.dispatch(req)
  File "/home/trac/Projects/trac/trunk/trac/web/main.py", line 190, in dispatch
    req.perm = PermissionCache(self.env, req.authname)
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 267, in __init__
    self.perms = PermissionSystem(env).get_user_permissions(username)
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 231, in get_user_permissions
    for perm in self.store.get_user_permissions(username):
  File "/home/trac/Projects/trac/trunk/trac/perm.py", line 109, in get_user_permissions
    db = self.env.get_db_cnx()
  File "/home/trac/Projects/trac/trunk/trac/env.py", line 182, in get_db_cnx
    return DatabaseManager(self).get_connection()
  File "/home/trac/Projects/trac/trunk/trac/db/api.py", line 75, in get_connection
    return self._cnx_pool.get_cnx(self.timeout or None)
  File "/home/trac/Projects/trac/trunk/trac/db/pool.py", line 65, in get_cnx
    self._available.acquire()

The WSGI container uses a threadpool with ten threads for the application.

I have only had instrumentation in place to inspect this state on one occassion so far, so I don't know if this is the typical state in which the process ends.

The server is running r3798 with the db_pool_timeout-r3868.patch from #3504. Python is 2.4.2, SQLite is 3.3.7. PySQLite is 2.3.2. Subversion is 1.4.0. glibc is 2.3.5. Linux is 2.6.12.

Attachments

Change History

comment:1 Changed 6 years ago by cboos

  • Keywords database pool added
  • Milestone set to 0.10.1
  • Owner changed from jonas to cboos
  • Severity changed from major to critical

"translated stack" for the first thread (I've offset the line numbers a bit, to make things clearer):

  1. we start from an innocuous env.get_db_cnx() and we end-up in the pool, holding the threading.Lock:
  2. source:trunk/trac/db/pool.py@3798#L83, in get_cnx
    cnx = self._connector.get_connection(**self._kwargs)''''
  3. source:trunk/trac/db/sqlite_backend.py@3798#L113, in get_connection
    return SQLiteConnection(path, params)
  4. source:trunk/trac/db/sqlite_backend.py@3798#L158, in __init__
    timeout=timeout)
    here, PySqlite triggers a garbage collection…
  5. source:trunk/trac/db/pool.py@3798#L48, in __del__
    self.close()
    … which has no better idea than to free another PooledConnection!
  6. source:trunk/trac/db/pool.py@3798#L44, in close
    self._pool._return_cnx(self.cnx, self._tid)
  7. source:trunk/trac/db/pool.py@3798#L101, in _return_cnx
    self._available.acquire()

… and boom, deadlock, as we are not using a threading.RLock anymore, since r2771. I have to discuss this with cmlenz who did this change, but I don't think r2771 is valid, we should really use a threading.RLock here, IMO.

Thank you very much for the feedback!

comment:2 Changed 6 years ago by cboos

  • Status changed from new to assigned

exarkun, can you try again with s/threading.Lock/threading.RLock/ ?

comment:3 Changed 6 years ago by exarkun@…

  • Cc exarkun@… added

Updated my server to use RLock. Will comment again if/when anything happens…

comment:4 Changed 6 years ago by cboos

Too bad, I missed the opportunity to close that ticket with r3923 ;)

More seriously, can you also try with the updated db code integrating the [3920:3924/trunk/trac/db] changes?

comment:5 Changed 6 years ago by cboos

Oops, don't update to current trunk right now: looks like those pool changes introduced some big issues (I'm now always hitting the timeout).

comment:6 Changed 6 years ago by cboos

Now with [3929:3934/trunk/trac/db] all is ok again, so pleasy try out the following patch.

comment:7 Changed 6 years ago by moo

  • Cc moo.lighttpd@… added
  • Keywords deadlock added

added cc to me, thx

comment:8 Changed 6 years ago by cboos

Do I have to assume that everything is OK? ;)

One "little" problem I have with the recent pool code changes is that I've got segfaults with Python 2.5 on windows. I didn't have time so far to investigate this problem in more depth, but this is surely a showstopper for the merge in 0.10.1 at that point…

comment:9 Changed 6 years ago by exarkun@…

I haven't had time to update to trunk (also I'm not really sure from the comments here that I should update to trunk - can you clarify regarding what revisions are safe to use?). Nor have I run into this problem again since my last message here. This just doesn't happen very often.

comment:10 Changed 6 years ago by cboos

  • can you clarify regarding what revisions are safe to use?

Well, I have the crash on win64 with Python 2.5 (and the builtin sqlite3).

My regular Python 2.4 (and pysqlite 2.1.3) on win32 works fine, so does Python 2.5 on win32 with builtin sqlite3 (I do have crashes with the latter, but they're not related to sqlite3, rather to the -r code reload feature of tracd).

For now, I can't say how things are on Linux.
Once I've successfully tested the new code on Linux, I'll probably port the whole set of changes on 0.10-stable, that will make it easier to test the changes on a wider scale.

comment:11 Changed 6 years ago by cboos

  • Keywords needinfo added

No more deadlocks?

comment:12 Changed 6 years ago by exarkun@…

This hasn't happened since I updated to r4045.

comment:13 Changed 6 years ago by cboos

  • Milestone changed from 0.10.3 to 0.10.1
  • Resolution set to fixed
  • Status changed from assigned to closed

So I think this was fixed in the bunch of db pool fixes ported to 0.10.1.

comment:14 Changed 6 years ago by anonymous

I just had my server deadlock again. Same version of everything as before.

comment:15 Changed 6 years ago by exarkun@…

Actually, it deadlocked a number of times in a row (5 or so) in rapid succession.

comment:16 Changed 6 years ago by cboos

So I've got a bunch of questions…

  1. Just to be sure, when you say r4045, are you using that with 0.10-stable or trunk?
  2. Also, is this with r3830 backed out or not?
  3. Are there any other modifications to the code?
  4. Does the deadlock ends with a timeout exception?
  5. Also, are you using an authz_file and maybe you're hit by #3996? (this is unlikely, I know, as otherwise the lock would happen systematically).

comment:17 Changed 5 years ago by exarkun@…

  1. trunk
  2. Yes, r3830 backed out
  3. No other modifications to the code
  4. The deadlock never resolves itself, the server has to be SIGKILL'd.
  5. There's no authz_file entry in my trac.ini, so I guess not.

comment:18 Changed 5 years ago by cboos

  • Milestone changed from 0.10.1 to 0.10.4
  • Resolution fixed deleted
  • Status changed from closed to reopened

Can you try a more recent trunk than r4045?
Also, I'd need the C and Python backtrace again, please ;)

comment:19 Changed 5 years ago by cboos

Is the situation any better with 0.10.4?

comment:20 follow-up: Changed 5 years ago by exarkun

We're still using r4045 of trunk with r3830 backed out. I don't think we have any plans to upgrade in the foreseeable future. We're making do by restarting the server every six hours. I don't think we'll be able to test with a newer revision.

comment:21 in reply to: ↑ 20 Changed 5 years ago by cboos

Replying to exarkun:

We're still using r4045 of trunk with r3830 backed out. I don't think we have any plans to upgrade in the foreseeable future.

Try at least to apply r4886. I wouldn't be so surprised if that fixed this issue as well…

comment:22 Changed 4 years ago by cboos

  • Keywords twisted verify added; needinfo removed
  • Milestone changed from 0.10.5 to 0.11.1

comment:23 Changed 4 years ago by anonymous

Trying that. I'll comment again in a few days.

comment:24 Changed 4 years ago by exarkun

I applied that revision. The problems don't seem to be much affected. Today the server segfaulted (which it hasn't done in a long, long time). Dunno if it's related to that changeset or not.

comment:25 Changed 4 years ago by cboos

Is there any possibility for you to try the current trunk?

comment:26 Changed 4 years ago by anonymous

This would probably involve a lot of changes to our deployment and related code and templates, which I don't have time to do.

comment:27 Changed 4 years ago by cboos

Well, I was simply suggesting that you could setup a test installation of 0.11 with Twisted, not the whole twistedmatrix.com Trac.

comment:28 Changed 4 years ago by cboos

  • Milestone changed from 0.11.2 to 0.11.1
  • Owner changed from cboos to jonas
  • Status changed from reopened to new

twistedmatrix.com/trac is currently running an old 0.11dev version. I strongly suggest that you upgrade to the new pool code (i.e. at least branches/0.11-stable@7317), which should hopefully be more robust w.r.t. deadlocks.

comment:29 Changed 4 years ago by cboos

  • Resolution set to fixed
  • Status changed from new to closed

Assuming the issue was fixed by r7190. Of course, feel free to reopen if the problem persists even with the new pool.

View

Add a comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
The resolution will be deleted. Next status will be 'reopened'
to The owner will be changed from jonas. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.