Edgewall Software

Ticket #4465 (closed defect: fixed)

Opened 20 months ago

Last modified 9 months ago

Timeout reached while waiting for a connection

Reported by: ezra@… Owned by: cboos
Priority: highest Milestone: 0.11
Component: general Version: 0.10.4
Severity: critical Keywords: db pool
Cc:

Description (last modified by thatch) (diff)

Python Traceback

Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 387, in dispatch_request
    dispatcher.dispatch(req)
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 183, in dispatch
    req.perm = PermissionCache(self.env, req.authname)
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 263, in __init__
    self.perms = PermissionSystem(env).get_user_permissions(username)
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 227, in get_user_permissions
    for perm in self.store.get_user_permissions(username):
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 109, in get_user_permissions
    db = self.env.get_db_cnx()
  File "/usr/lib/python2.4/site-packages/trac/env.py", line 182, in get_db_cnx
    return DatabaseManager(self).get_connection()
  File "/usr/lib/python2.4/site-packages/trac/db/api.py", line 75, in get_connection
    return self._cnx_pool.get_cnx(self.timeout or None)
  File "/usr/lib/python2.4/site-packages/trac/db/pool.py", line 107, in get_cnx
    raise TimeoutError('Unable to get database '
TimeoutError: Unable to get database connection within 20 seconds

Attachments

test_pool_timeout.py (2.4 kB) - added by cboos 18 months ago.
Simple test program demonstrating the issue - needs two paramaters, the $TRAC_ENV and the number of threads to use. If more than 5 given, it freezes.

Change History

Changed 20 months ago by thatch

  • keywords needinfo added
  • priority changed from high to normal
  • component changed from wiki to general
  • description modified (diff)

A little more information would be helpful. For starters, what db backend are you using? What server method? Does this happen on the first request or only after sitting a while?

Changed 20 months ago by ezra@…

  • status changed from new to closed
  • resolution set to worksforme

sorry this was on the edgewall wiki itself. it seems to have cleared itself up though

thank you for the response

Changed 19 months ago by cboos

  • status changed from closed to reopened
  • severity changed from normal to major
  • priority changed from normal to high
  • milestone set to 0.10.4
  • keywords db pool weird added; needinfo removed
  • resolution worksforme deleted

That error happens now systematically when trying to load a bunch of pages simultaneously. Once a thread gets stuck on the self._available.wait(), all the other threads seem to get stuck as well, and then nothing progresses until the timeout is reached. At that point, things resume normally. Weird.

Changed 18 months ago by cboos

  • owner changed from jonas to cboos
  • status changed from reopened to new
  • severity changed from major to blocker

Actually, this is the other main problem we have now in Trac, together with #4586.

Changed 18 months ago by cboos

  • summary changed from TracPermissions wiki down to Timeout reached while waiting for a connection

The problem is not really with the fact that one Request can timeout while waiting for a connection, it's rather the fact that when one thread starts to wait, all activity is frozen. It seems that no other thread gets back the control... So in fact, the waiting thread has no chance to ever get a connection, as no other thread can terminate and release the connection. The waiting thread has no "choice" but wait until the timeout and exit with an error. At this point, the other threads seem to proceed normally.

I've seen this on both Linux and Windows, and I've not yet understood completely the reasons for this behavior.

Changed 18 months ago by cboos

  • keywords weird removed
  • status changed from new to assigned

Ok, "weirdness" elucidated.

The problem comes from the interference between the connection pool and the repository cache.

Inside the repository cache (i.e. with the repo cache lock held), if we don't already have a repository object available, we try to create a new one. In case this implies a CachedRepository (as with Subversion), this will in turn try to acquire a connection from the pool.

If we don't already have a connection and if the pool is empty (all connections are in use), we wait on a timeout there in hope that the other threads will proceed and release their connection to the pool. However, it's very likely that before they'll do so, they require a repository object for themselves... deadlock.

A simple solution would be to make sure that we already have the db connection before trying to get a repository from the cache and getting hold of the repository cache lock.

Changed 18 months ago by cboos

Simple test program demonstrating the issue - needs two paramaters, the $TRAC_ENV and the number of threads to use. If more than 5 given, it freezes.

Changed 18 months ago by cboos

  • status changed from assigned to closed
  • resolution set to fixed

Fixed in r4886 (trunk) and r4887 (0.10-stable).

Changed 16 months ago by datallah@…

  • version changed from 0.10.3 to 0.10.4

I'm still running into this after upgrading to 0.10.4.

We're the fastcgi interface with lighttpd and a postgres db on a Debian etch box. I haven't changed any connection pooling settings. The www-data user is allowed 1024 open files.

I'm not sure if this helps to track down the issue, but I currently have version control disabled: trac.versioncontrol.* = disabled

The stack traces are across the board - anything that tries to get a db connection fails.

Please let me know what other info I can provide.

Changed 16 months ago by eblot

  • status changed from closed to reopened
  • resolution fixed deleted

(re-opening the ticked to match the latest comment)

Changed 11 months ago by cboos

  • keywords verify added
  • severity changed from blocker to critical
  • milestone changed from 0.10.4 to 0.10.5

#5335 and #6188 report identical backtraces. I think there could be a real problem here... though I guess this will be difficult to reproduce.

Changed 11 months ago by cboos

... and #5536 as well, though the backtrace is slightly different.

Changed 10 months ago by anonymous

  • keywords really added

Changed 9 months ago by cboos

  • milestone changed from 0.10.5 to 0.11.1

Update: r4886 is no longer current (see r6124) and despite of that, I'm unable to reproduce the deadlocks in Trac, the attachment:test_pool_timeout.py sample works fine for any number of threads.

Changed 9 months ago by cboos

  • keywords verify really removed
  • priority changed from high to highest
  • status changed from reopened to new
  • milestone changed from 0.11.1 to 0.11

#6405 was closed as duplicate. That ticket contained a detailed report showing that this issue is still present with the very latest version of everything, when the server is tracd on Linux.

That prompted me to do some further tests, and indeed I was able to reproduce the same timeout issue, with tracd on Linux. The only difference in my initial tests was that after the timeout, it was possible to resume working without having to restart the server.

I then tried again with r6124 reverted (i.e. reapplying the r4886 fix) and that definitely helped in making the timeout errors much less likely.

However this doesn't prevent the timeout situation completely, in particular when one do a reload all tabs while they're not yet fully loaded. Even worse, when doing so, it seems that the server can really be put in a state where all later requests will fail with the timeout.

Changed 9 months ago by cboos

  • status changed from new to closed
  • resolution set to fixed

Ok, so in r6260 I've restored the r4886 fix, which is kind of bad as I really wanted to get rid of that. It appears that for some reason it's really needed.

Now, even with that change, I was still able to trigger the "persistent" timeout issue, specifically when triggering lots of "broken connection" errors. This made me think that the env.shutdown(tid) done at the end of request processing was maybe not called every time.

Indeed, when making sure that the env.shutdown(tid) gets always called, with r6261, the persistent timeout issue didn't happen again.

I've not checked yet if there's something to backport to 0.10.5dev, but that appears less critical now that the beta1 is about to be released.

Add/Change #4465 (Timeout reached while waiting for a connection)

Author



Change Properties
<Author field>
Action
as closed
Next status will be 'reopened'
 
Note: See TracTickets for help on using tickets.