Ticket #4465 (closed defect: fixed)
Timeout reached while waiting for a connection
| Reported by: | ezra@… | Owned by: | cboos |
|---|---|---|---|
| Priority: | highest | Milestone: | 0.11.2 |
| Component: | general | Version: | 0.10.4 |
| Severity: | critical | Keywords: | db pool |
| Cc: | osimons |
Description (last modified by thatch) (diff)
Python Traceback
Traceback (most recent call last):
File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 387, in dispatch_request
dispatcher.dispatch(req)
File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 183, in dispatch
req.perm = PermissionCache(self.env, req.authname)
File "/usr/lib/python2.4/site-packages/trac/perm.py", line 263, in __init__
self.perms = PermissionSystem(env).get_user_permissions(username)
File "/usr/lib/python2.4/site-packages/trac/perm.py", line 227, in get_user_permissions
for perm in self.store.get_user_permissions(username):
File "/usr/lib/python2.4/site-packages/trac/perm.py", line 109, in get_user_permissions
db = self.env.get_db_cnx()
File "/usr/lib/python2.4/site-packages/trac/env.py", line 182, in get_db_cnx
return DatabaseManager(self).get_connection()
File "/usr/lib/python2.4/site-packages/trac/db/api.py", line 75, in get_connection
return self._cnx_pool.get_cnx(self.timeout or None)
File "/usr/lib/python2.4/site-packages/trac/db/pool.py", line 107, in get_cnx
raise TimeoutError('Unable to get database '
TimeoutError: Unable to get database connection within 20 seconds
Attachments
Change History
comment:1 Changed 4 years ago by thatch
- Keywords needinfo added
- Priority changed from high to normal
- Component changed from wiki to general
- Description modified (diff)
comment:2 Changed 4 years ago by ezra@…
- Status changed from new to closed
- Resolution set to worksforme
sorry this was on the edgewall wiki itself. it seems to have cleared itself up though
thank you for the response
comment:3 Changed 4 years ago by cboos
- Status changed from closed to reopened
- Severity changed from normal to major
- Priority changed from normal to high
- Milestone set to 0.10.4
- Keywords db pool weird added; needinfo removed
- Resolution worksforme deleted
That error happens now systematically when trying to load a bunch of pages simultaneously. Once a thread gets stuck on the self._available.wait(), all the other threads seem to get stuck as well, and then nothing progresses until the timeout is reached. At that point, things resume normally. Weird.
comment:4 Changed 3 years ago by cboos
- Owner changed from jonas to cboos
- Status changed from reopened to new
- Severity changed from major to blocker
Actually, this is the other main problem we have now in Trac, together with #4586.
comment:5 Changed 3 years ago by cboos
- Summary changed from TracPermissions wiki down to Timeout reached while waiting for a connection
The problem is not really with the fact that one Request can timeout while waiting for a connection, it's rather the fact that when one thread starts to wait, all activity is frozen. It seems that no other thread gets back the control... So in fact, the waiting thread has no chance to ever get a connection, as no other thread can terminate and release the connection. The waiting thread has no "choice" but wait until the timeout and exit with an error. At this point, the other threads seem to proceed normally.
I've seen this on both Linux and Windows, and I've not yet understood completely the reasons for this behavior.
comment:6 Changed 3 years ago by cboos
- Keywords weird removed
- Status changed from new to assigned
Ok, "weirdness" elucidated.
The problem comes from the interference between the connection pool and the repository cache.
Inside the repository cache (i.e. with the repo cache lock held), if we don't already have a repository object available, we try to create a new one. In case this implies a CachedRepository (as with Subversion), this will in turn try to acquire a connection from the pool.
If we don't already have a connection and if the pool is empty (all connections are in use), we wait on a timeout there in hope that the other threads will proceed and release their connection to the pool. However, it's very likely that before they'll do so, they require a repository object for themselves... deadlock.
A simple solution would be to make sure that we already have the db connection before trying to get a repository from the cache and getting hold of the repository cache lock.
Changed 3 years ago by cboos
-
attachment
test_pool_timeout.py
added
Simple test program demonstrating the issue - needs two paramaters, the $TRAC_ENV and the number of threads to use. If more than 5 given, it freezes.
comment:7 Changed 3 years ago by cboos
- Status changed from assigned to closed
- Resolution set to fixed
comment:8 Changed 3 years ago by datallah@…
- Version changed from 0.10.3 to 0.10.4
I'm still running into this after upgrading to 0.10.4.
We're the fastcgi interface with lighttpd and a postgres db on a Debian etch box. I haven't changed any connection pooling settings. The www-data user is allowed 1024 open files.
I'm not sure if this helps to track down the issue, but I currently have version control disabled: trac.versioncontrol.* = disabled
The stack traces are across the board - anything that tries to get a db connection fails.
Please let me know what other info I can provide.
comment:9 Changed 3 years ago by eblot
- Status changed from closed to reopened
- Resolution fixed deleted
(re-opening the ticked to match the latest comment)
comment:10 Changed 3 years ago by cboos
- Keywords verify added
- Severity changed from blocker to critical
- Milestone changed from 0.10.4 to 0.10.5
comment:11 Changed 3 years ago by cboos
... and #5536 as well, though the backtrace is slightly different.
comment:13 Changed 3 years ago by cboos
- Milestone changed from 0.10.5 to 0.11.1
Update: r4886 is no longer current (see r6124) and despite of that, I'm unable to reproduce the deadlocks in Trac, the attachment:test_pool_timeout.py sample works fine for any number of threads.
comment:14 Changed 3 years ago by cboos
- Keywords verify really removed
- Priority changed from high to highest
- Status changed from reopened to new
- Milestone changed from 0.11.1 to 0.11
#6405 was closed as duplicate. That ticket contained a detailed report showing that this issue is still present with the very latest version of everything, when the server is tracd on Linux.
That prompted me to do some further tests, and indeed I was able to reproduce the same timeout issue, with tracd on Linux. The only difference in my initial tests was that after the timeout, it was possible to resume working without having to restart the server.
I then tried again with r6124 reverted (i.e. reapplying the r4886 fix) and that definitely helped in making the timeout errors much less likely.
However this doesn't prevent the timeout situation completely, in particular when one do a reload all tabs while they're not yet fully loaded. Even worse, when doing so, it seems that the server can really be put in a state where all later requests will fail with the timeout.
comment:15 Changed 3 years ago by cboos
- Status changed from new to closed
- Resolution set to fixed
Ok, so in r6260 I've restored the r4886 fix, which is kind of bad as I really wanted to get rid of that. It appears that for some reason it's really needed.
Now, even with that change, I was still able to trigger the "persistent" timeout issue, specifically when triggering lots of "broken connection" errors. This made me think that the env.shutdown(tid) done at the end of request processing was maybe not called every time.
Indeed, when making sure that the env.shutdown(tid) gets always called, with r6261, the persistent timeout issue didn't happen again.
I've not checked yet if there's something to backport to 0.10.5dev, but that appears less critical now that the beta1 is about to be released.
comment:16 Changed 23 months ago by osimons
- Cc osimons added
- Status changed from closed to reopened
- Resolution fixed deleted
- Milestone changed from 0.11 to 0.11.2
I still get this TimeoutError at sporadic intervals, and there is something not quite right. Once one TimeoutError is reached there is no helping the site apart from a restart - all projects and all requests hang (Apache Windows single-process setup).
Reading up on threading I see that the pool code calls wait(), but there is no code anywhere that actually calls notify() to wake them up again. With the help of rblank on IRC, this patch was put together that I'm currently testing:
-
trac/db/pool.py
a b 130 130 self._pool.append(cnx) 131 131 self._pool_key.append(key) 132 132 self._pool_time.append(time.time()) 133 self._available.notify() 133 134 elif num == 1: 134 135 del self._active[(tid, key)] 136 self._available.notify() 135 137 else: 136 138 self._active[(tid, key)] = (cnx, num - 1) 137 139 finally:
If anyone else sees regular TimeoutError, it may be well worth trying the patch to see if it helps. Further insight and testing much appreciated...
comment:17 Changed 23 months ago by cboos
Out of curiosity, what happens when you try attachment:test_pool_timeout.py, are you able to reproduce the freeze when using more threads than the default available connections (10 now, IIRC)? I wasn't, but maybe this program was testing a different condition than the one you're dealing with now.
comment:18 Changed 23 months ago by osimons
Tried it, and here is findings using 250 threads:
- Without patch I get consistently longer periods without output (everything waiting). With patch there is continuous output from the script.
- Without patch and a timeout setting of 20, i get quite a lot of timeout errors. Adding the patch does not visibly reduce the amount of errors markedly, but quite a few of them also additionally show database is locked + timeout messages - ie. it throws the error for the wrong reason, I suppose.
- Setting timeout to 0 in trac.ini and using patch gives continuous output with occasionally errors due to database is locked, but all these threads acquire cnx shortly after.
- I tried changing the pool size, but did not find any particular changes from increasing it - except if it got too high (100) it increased errors. However, with just one project in the test script, this change may not be effective anyway - perhaps this increase will make more sense across a larger multi-project setup with many environments.
Looking at the pool code, I see that without the explicit call to notify() (patch) only 2 attempts at getting a connection will ever be made: First on entry, and if it doesn't get it it will wait for the full timeout period regardless of available connections in the meantime, then it will get woken up and tries one more time. If no connection is available then, it will hit the time-test and fail this due to fractionally more time than timeout will have passed.
I think the patch improves things, and I'm now using it + timeout = 0 but haven't yet changed pool size. We'll see how it all works out.
comment:19 Changed 23 months ago by jonas
Good catch osimons. Without calling notify() everytime a pool slot is made available we risk raising TimeoutErrors in some high load situations.
Here's an updated patch with a small restructuring of _return_cnx to make it easier to follow and adding another notify() call to the _shutdown method.
-
trac/db/pool.py
125 125 try: 126 126 assert (tid, key) in self._active 127 127 cnx, num = self._active[(tid, key)] 128 if num == 1 and cnx.poolable and try_rollback(cnx):128 if num == 1: 129 129 del self._active[(tid, key)] 130 self._ pool.append(cnx)131 self._pool_key.append(key)132 self._pool_time.append(time.time())133 elif num == 1:134 del self._active[(tid, key)]130 self._available.notify() 131 if cnx.poolable and try_rollback(cnx): 132 self._pool.append(cnx) 133 self._pool_key.append(key) 134 self._pool_time.append(time.time()) 135 135 else: 136 136 self._active[(tid, key)] = (cnx, num - 1) 137 137 finally: … … 149 149 self._pool.pop(0) 150 150 self._pool_key.pop(0) 151 151 self._pool_time.pop(0) 152 self._available.notify() 152 153 finally: 153 154 self._available.release() 154 155
comment:20 follow-up: ↓ 21 Changed 23 months ago by jonas
I've been giving this some more thought and the notify() call in _shutdown isn't really needed after all since it only cleans up inactive connections.
comment:21 in reply to: ↑ 20 Changed 23 months ago by osimons
Replying to jonas:
I've been giving this some more thought and the notify() call in _shutdown isn't really needed after all since it only cleans up inactive connections.
Right, same thing rblank said when we discussed it on IRC.
comment:22 Changed 23 months ago by jonas
- Status changed from reopened to closed
- Resolution set to fixed
I've committed the changes in [7537,7538] so I'm closing this ticket again, thanks guys.




A little more information would be helpful. For starters, what db backend are you using? What server method? Does this happen on the first request or only after sitting a while?