Edgewall Software

Opened 10 years ago

Closed 8 years ago

#4465 closed defect (fixed)

Timeout reached while waiting for a connection

Reported by: ezra@… Owned by: Christian Boos
Priority: highest Milestone: 0.11.2
Component: general Version: 0.10.4
Severity: critical Keywords: db pool
Cc: osimons
Release Notes:
API Changes:

Description (last modified by Tim Hatch)

Python Traceback

Traceback (most recent call last):
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 387, in dispatch_request
  File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 183, in dispatch
    req.perm = PermissionCache(self.env, req.authname)
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 263, in __init__
    self.perms = PermissionSystem(env).get_user_permissions(username)
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 227, in get_user_permissions
    for perm in self.store.get_user_permissions(username):
  File "/usr/lib/python2.4/site-packages/trac/perm.py", line 109, in get_user_permissions
    db = self.env.get_db_cnx()
  File "/usr/lib/python2.4/site-packages/trac/env.py", line 182, in get_db_cnx
    return DatabaseManager(self).get_connection()
  File "/usr/lib/python2.4/site-packages/trac/db/api.py", line 75, in get_connection
    return self._cnx_pool.get_cnx(self.timeout or None)
  File "/usr/lib/python2.4/site-packages/trac/db/pool.py", line 107, in get_cnx
    raise TimeoutError('Unable to get database '
TimeoutError: Unable to get database connection within 20 seconds

Attachments (1)

test_pool_timeout.py (2.4 KB) - added by Christian Boos 10 years ago.
Simple test program demonstrating the issue - needs two paramaters, the $TRAC_ENV and the number of threads to use. If more than 5 given, it freezes.

Download all attachments as: .zip

Change History (23)

comment:1 Changed 10 years ago by Tim Hatch

Component: wikigeneral
Description: modified (diff)
Keywords: needinfo added
Priority: highnormal

A little more information would be helpful. For starters, what db backend are you using? What server method? Does this happen on the first request or only after sitting a while?

comment:2 Changed 10 years ago by ezra@…

Resolution: worksforme
Status: newclosed

sorry this was on the edgewall wiki itself. it seems to have cleared itself up though

thank you for the response

comment:3 Changed 10 years ago by Christian Boos

Keywords: db pool weird added; needinfo removed
Milestone: 0.10.4
Priority: normalhigh
Resolution: worksforme
Severity: normalmajor
Status: closedreopened

That error happens now systematically when trying to load a bunch of pages simultaneously. Once a thread gets stuck on the self._available.wait(), all the other threads seem to get stuck as well, and then nothing progresses until the timeout is reached. At that point, things resume normally. Weird.

comment:4 Changed 10 years ago by Christian Boos

Owner: changed from Jonas Borgström to Christian Boos
Severity: majorblocker
Status: reopenednew

Actually, this is the other main problem we have now in Trac, together with #4586.

comment:5 Changed 10 years ago by Christian Boos

Summary: TracPermissions wiki downTimeout reached while waiting for a connection

The problem is not really with the fact that one Request can timeout while waiting for a connection, it's rather the fact that when one thread starts to wait, all activity is frozen. It seems that no other thread gets back the control… So in fact, the waiting thread has no chance to ever get a connection, as no other thread can terminate and release the connection. The waiting thread has no "choice" but wait until the timeout and exit with an error. At this point, the other threads seem to proceed normally.

I've seen this on both Linux and Windows, and I've not yet understood completely the reasons for this behavior.

comment:6 Changed 10 years ago by Christian Boos

Keywords: weird removed
Status: newassigned

Ok, "weirdness" elucidated.

The problem comes from the interference between the connection pool and the repository cache.

Inside the repository cache (i.e. with the repo cache lock held), if we don't already have a repository object available, we try to create a new one. In case this implies a CachedRepository (as with Subversion), this will in turn try to acquire a connection from the pool.

If we don't already have a connection and if the pool is empty (all connections are in use), we wait on a timeout there in hope that the other threads will proceed and release their connection to the pool. However, it's very likely that before they'll do so, they require a repository object for themselves… deadlock.

A simple solution would be to make sure that we already have the db connection before trying to get a repository from the cache and getting hold of the repository cache lock.

Changed 10 years ago by Christian Boos

Attachment: test_pool_timeout.py added

Simple test program demonstrating the issue - needs two paramaters, the $TRAC_ENV and the number of threads to use. If more than 5 given, it freezes.

comment:7 Changed 10 years ago by Christian Boos

Resolution: fixed
Status: assignedclosed

Fixed in r4886 (trunk) and r4887 (0.10-stable).

comment:8 Changed 10 years ago by datallah@…


I'm still running into this after upgrading to 0.10.4.

We're the fastcgi interface with lighttpd and a postgres db on a Debian etch box. I haven't changed any connection pooling settings. The www-data user is allowed 1024 open files.

I'm not sure if this helps to track down the issue, but I currently have version control disabled: trac.versioncontrol.* = disabled

The stack traces are across the board - anything that tries to get a db connection fails.

Please let me know what other info I can provide.

comment:9 Changed 10 years ago by Emmanuel Blot

Resolution: fixed
Status: closedreopened

(re-opening the ticked to match the latest comment)

comment:10 Changed 9 years ago by Christian Boos

Keywords: verify added
Severity: blockercritical

#5335 and #6188 report identical backtraces. I think there could be a real problem here… though I guess this will be difficult to reproduce.

comment:11 Changed 9 years ago by Christian Boos

… and #5536 as well, though the backtrace is slightly different.

comment:12 Changed 9 years ago by anonymous

Keywords: really added

comment:13 Changed 9 years ago by Christian Boos


Update: r4886 is no longer current (see r6124) and despite of that, I'm unable to reproduce the deadlocks in Trac, the attachment:test_pool_timeout.py sample works fine for any number of threads.

comment:14 Changed 9 years ago by Christian Boos

Keywords: verify really removed
Priority: highhighest
Status: reopenednew

#6405 was closed as duplicate. That ticket contained a detailed report showing that this issue is still present with the very latest version of everything, when the server is tracd on Linux.

That prompted me to do some further tests, and indeed I was able to reproduce the same timeout issue, with tracd on Linux. The only difference in my initial tests was that after the timeout, it was possible to resume working without having to restart the server.

I then tried again with r6124 reverted (i.e. reapplying the r4886 fix) and that definitely helped in making the timeout errors much less likely.

However this doesn't prevent the timeout situation completely, in particular when one do a reload all tabs while they're not yet fully loaded. Even worse, when doing so, it seems that the server can really be put in a state where all later requests will fail with the timeout.

comment:15 Changed 9 years ago by Christian Boos

Resolution: fixed
Status: newclosed

Ok, so in r6260 I've restored the r4886 fix, which is kind of bad as I really wanted to get rid of that. It appears that for some reason it's really needed.

Now, even with that change, I was still able to trigger the "persistent" timeout issue, specifically when triggering lots of "broken connection" errors. This made me think that the env.shutdown(tid) done at the end of request processing was maybe not called every time.

Indeed, when making sure that the env.shutdown(tid) gets always called, with r6261, the persistent timeout issue didn't happen again.

I've not checked yet if there's something to backport to 0.10.5dev, but that appears less critical now that the beta1 is about to be released.

comment:16 Changed 8 years ago by osimons

Cc: osimons added
Resolution: fixed
Status: closedreopened

I still get this TimeoutError at sporadic intervals, and there is something not quite right. Once one TimeoutError is reached there is no helping the site apart from a restart - all projects and all requests hang (Apache Windows single-process setup).

Reading up on threading I see that the pool code calls wait(), but there is no code anywhere that actually calls notify() to wake them up again. With the help of rblank on IRC, this patch was put together that I'm currently testing:

  • trac/db/pool.py

    a b  
    130130                self._pool.append(cnx)
    131131                self._pool_key.append(key)
    132132                self._pool_time.append(time.time())
     133                self._available.notify()
    133134            elif num == 1:
    134135                del self._active[(tid, key)]
     136                self._available.notify()
    135137            else:
    136138                self._active[(tid, key)] = (cnx, num - 1)
    137139        finally:

If anyone else sees regular TimeoutError, it may be well worth trying the patch to see if it helps. Further insight and testing much appreciated…

comment:17 Changed 8 years ago by Christian Boos

Out of curiosity, what happens when you try attachment:test_pool_timeout.py, are you able to reproduce the freeze when using more threads than the default available connections (10 now, IIRC)? I wasn't, but maybe this program was testing a different condition than the one you're dealing with now.

comment:18 Changed 8 years ago by osimons

Tried it, and here is findings using 250 threads:

  • Without patch I get consistently longer periods without output (everything waiting). With patch there is continuous output from the script.
  • Without patch and a timeout setting of 20, i get quite a lot of timeout errors. Adding the patch does not visibly reduce the amount of errors markedly, but quite a few of them also additionally show database is locked + timeout messages - ie. it throws the error for the wrong reason, I suppose.
  • Setting timeout to 0 in trac.ini and using patch gives continuous output with occasionally errors due to database is locked, but all these threads acquire cnx shortly after.
  • I tried changing the pool size, but did not find any particular changes from increasing it - except if it got too high (100) it increased errors. However, with just one project in the test script, this change may not be effective anyway - perhaps this increase will make more sense across a larger multi-project setup with many environments.

Looking at the pool code, I see that without the explicit call to notify() (patch) only 2 attempts at getting a connection will ever be made: First on entry, and if it doesn't get it it will wait for the full timeout period regardless of available connections in the meantime, then it will get woken up and tries one more time. If no connection is available then, it will hit the time-test and fail this due to fractionally more time than timeout will have passed.

I think the patch improves things, and I'm now using it + timeout = 0 but haven't yet changed pool size. We'll see how it all works out.

comment:19 Changed 8 years ago by Jonas Borgström

Good catch osimons. Without calling notify() everytime a pool slot is made available we risk raising TimeoutErrors in some high load situations.

Here's an updated patch with a small restructuring of _return_cnx to make it easier to follow and adding another notify() call to the _shutdown method.

  • trac/db/pool.py

    125125        try:
    126126            assert (tid, key) in self._active
    127127            cnx, num = self._active[(tid, key)]
    128             if num == 1 and cnx.poolable and try_rollback(cnx):
     128            if num == 1:
    129129                del self._active[(tid, key)]
    130                 self._pool.append(cnx)
    131                 self._pool_key.append(key)
    132                 self._pool_time.append(time.time())
    133             elif num == 1:
    134                 del self._active[(tid, key)]
     130                self._available.notify()
     131                if cnx.poolable and try_rollback(cnx):
     132                    self._pool.append(cnx)
     133                    self._pool_key.append(key)
     134                    self._pool_time.append(time.time())
    135135            else:
    136136                self._active[(tid, key)] = (cnx, num - 1)
    137137        finally:
    149149                self._pool.pop(0)
    150150                self._pool_key.pop(0)
    151151                self._pool_time.pop(0)
     152                self._available.notify()
    152153        finally:
    153154            self._available.release()

comment:20 Changed 8 years ago by Jonas Borgström

I've been giving this some more thought and the notify() call in _shutdown isn't really needed after all since it only cleans up inactive connections.

comment:21 in reply to:  20 Changed 8 years ago by osimons

Replying to jonas:

I've been giving this some more thought and the notify() call in _shutdown isn't really needed after all since it only cleans up inactive connections.

Right, same thing rblank said when we discussed it on IRC.

comment:22 Changed 8 years ago by Jonas Borgström

Resolution: fixed
Status: reopenedclosed

I've committed the changes in [7537,7538] so I'm closing this ticket again, thanks guys.

Modify Ticket

Change Properties
Set your email in Preferences
as closed The owner will remain Christian Boos.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Christian Boos to the specified user.

Add Comment

E-mail address and name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.