Opened 18 years ago
Closed 16 years ago
#4465 closed defect (fixed)
Timeout reached while waiting for a connection
Reported by: | Owned by: | Christian Boos | |
---|---|---|---|
Priority: | highest | Milestone: | 0.11.2 |
Component: | general | Version: | 0.10.4 |
Severity: | critical | Keywords: | db pool |
Cc: | osimons | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description (last modified by )
Python Traceback Traceback (most recent call last): File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 387, in dispatch_request dispatcher.dispatch(req) File "/usr/lib/python2.4/site-packages/trac/web/main.py", line 183, in dispatch req.perm = PermissionCache(self.env, req.authname) File "/usr/lib/python2.4/site-packages/trac/perm.py", line 263, in __init__ self.perms = PermissionSystem(env).get_user_permissions(username) File "/usr/lib/python2.4/site-packages/trac/perm.py", line 227, in get_user_permissions for perm in self.store.get_user_permissions(username): File "/usr/lib/python2.4/site-packages/trac/perm.py", line 109, in get_user_permissions db = self.env.get_db_cnx() File "/usr/lib/python2.4/site-packages/trac/env.py", line 182, in get_db_cnx return DatabaseManager(self).get_connection() File "/usr/lib/python2.4/site-packages/trac/db/api.py", line 75, in get_connection return self._cnx_pool.get_cnx(self.timeout or None) File "/usr/lib/python2.4/site-packages/trac/db/pool.py", line 107, in get_cnx raise TimeoutError('Unable to get database ' TimeoutError: Unable to get database connection within 20 seconds
Attachments (1)
Change History (23)
comment:1 by , 18 years ago
Component: | wiki → general |
---|---|
Description: | modified (diff) |
Keywords: | needinfo added |
Priority: | high → normal |
comment:2 by , 18 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
sorry this was on the edgewall wiki itself. it seems to have cleared itself up though
thank you for the response
comment:3 by , 18 years ago
Keywords: | db pool weird added; needinfo removed |
---|---|
Milestone: | → 0.10.4 |
Priority: | normal → high |
Resolution: | worksforme |
Severity: | normal → major |
Status: | closed → reopened |
That error happens now systematically when trying to load a bunch of pages simultaneously. Once a thread gets stuck on the self._available.wait()
, all the other threads seem to get stuck as well, and then nothing progresses until the timeout is reached. At that point, things resume normally. Weird.
comment:4 by , 18 years ago
Owner: | changed from | to
---|---|
Severity: | major → blocker |
Status: | reopened → new |
Actually, this is the other main problem we have now in Trac, together with #4586.
comment:5 by , 18 years ago
Summary: | TracPermissions wiki down → Timeout reached while waiting for a connection |
---|
The problem is not really with the fact that one Request can timeout while waiting for a connection, it's rather the fact that when one thread starts to wait, all activity is frozen. It seems that no other thread gets back the control… So in fact, the waiting thread has no chance to ever get a connection, as no other thread can terminate and release the connection. The waiting thread has no "choice" but wait until the timeout and exit with an error. At this point, the other threads seem to proceed normally.
I've seen this on both Linux and Windows, and I've not yet understood completely the reasons for this behavior.
comment:6 by , 18 years ago
Keywords: | weird removed |
---|---|
Status: | new → assigned |
Ok, "weirdness" elucidated.
The problem comes from the interference between the connection pool and the repository cache.
Inside the repository cache (i.e. with the repo cache lock held), if we don't already have a repository object available, we try to create a new one. In case this implies a CachedRepository (as with Subversion), this will in turn try to acquire a connection from the pool.
If we don't already have a connection and if the pool is empty (all connections are in use), we wait on a timeout there in hope that the other threads will proceed and release their connection to the pool. However, it's very likely that before they'll do so, they require a repository object for themselves… deadlock.
A simple solution would be to make sure that we already have the db connection before trying to get a repository from the cache and getting hold of the repository cache lock.
by , 18 years ago
Attachment: | test_pool_timeout.py added |
---|
Simple test program demonstrating the issue - needs two paramaters, the $TRAC_ENV and the number of threads to use. If more than 5 given, it freezes.
comment:7 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:8 by , 18 years ago
Version: | 0.10.3 → 0.10.4 |
---|
I'm still running into this after upgrading to 0.10.4.
We're the fastcgi interface with lighttpd and a postgres db on a Debian etch box. I haven't changed any connection pooling settings. The www-data
user is allowed 1024 open files.
I'm not sure if this helps to track down the issue, but I currently have version control disabled: trac.versioncontrol.* = disabled
The stack traces are across the board - anything that tries to get a db connection fails.
Please let me know what other info I can provide.
comment:9 by , 18 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
(re-opening the ticked to match the latest comment)
comment:10 by , 17 years ago
Keywords: | verify added |
---|---|
Milestone: | 0.10.4 → 0.10.5 |
Severity: | blocker → critical |
comment:12 by , 17 years ago
Keywords: | really added |
---|
comment:13 by , 17 years ago
Milestone: | 0.10.5 → 0.11.1 |
---|
Update: r4886 is no longer current (see r6124) and despite of that, I'm unable to reproduce the deadlocks in Trac, the attachment:test_pool_timeout.py sample works fine for any number of threads.
comment:14 by , 17 years ago
Keywords: | verify really removed |
---|---|
Milestone: | 0.11.1 → 0.11 |
Priority: | high → highest |
Status: | reopened → new |
#6405 was closed as duplicate. That ticket contained a detailed report showing that this issue is still present with the very latest version of everything, when the server is tracd on Linux.
That prompted me to do some further tests, and indeed I was able to reproduce the same timeout issue, with tracd on Linux. The only difference in my initial tests was that after the timeout, it was possible to resume working without having to restart the server.
I then tried again with r6124 reverted (i.e. reapplying the r4886 fix) and that definitely helped in making the timeout errors much less likely.
However this doesn't prevent the timeout situation completely, in particular when one do a reload all tabs while they're not yet fully loaded. Even worse, when doing so, it seems that the server can really be put in a state where all later requests will fail with the timeout.
comment:15 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Ok, so in r6260 I've restored the r4886 fix, which is kind of bad as I really wanted to get rid of that. It appears that for some reason it's really needed.
Now, even with that change, I was still able to trigger the "persistent" timeout issue, specifically when triggering lots of "broken connection" errors. This made me think that the env.shutdown(tid)
done at the end of request processing was maybe not called every time.
Indeed, when making sure that the env.shutdown(tid)
gets always called, with r6261, the persistent timeout issue didn't happen again.
I've not checked yet if there's something to backport to 0.10.5dev, but that appears less critical now that the beta1 is about to be released.
comment:16 by , 16 years ago
Cc: | added |
---|---|
Milestone: | 0.11 → 0.11.2 |
Resolution: | fixed |
Status: | closed → reopened |
I still get this TimeoutError
at sporadic intervals, and there is something not quite right. Once one TimeoutError
is reached there is no helping the site apart from a restart - all projects and all requests hang (Apache Windows single-process setup).
Reading up on threading
I see that the pool code calls wait()
, but there is no code anywhere that actually calls notify()
to wake them up again. With the help of rblank on IRC, this patch was put together that I'm currently testing:
-
trac/db/pool.py
a b 130 130 self._pool.append(cnx) 131 131 self._pool_key.append(key) 132 132 self._pool_time.append(time.time()) 133 self._available.notify() 133 134 elif num == 1: 134 135 del self._active[(tid, key)] 136 self._available.notify() 135 137 else: 136 138 self._active[(tid, key)] = (cnx, num - 1) 137 139 finally:
If anyone else sees regular TimeoutError
, it may be well worth trying the patch to see if it helps. Further insight and testing much appreciated…
comment:17 by , 16 years ago
Out of curiosity, what happens when you try attachment:test_pool_timeout.py, are you able to reproduce the freeze when using more threads than the default available connections (10 now, IIRC)? I wasn't, but maybe this program was testing a different condition than the one you're dealing with now.
comment:18 by , 16 years ago
Tried it, and here is findings using 250 threads:
- Without patch I get consistently longer periods without output (everything waiting). With patch there is continuous output from the script.
- Without patch and a
timeout
setting of 20, i get quite a lot of timeout errors. Adding the patch does not visibly reduce the amount of errors markedly, but quite a few of them also additionally showdatabase is locked
+ timeout messages - ie. it throws the error for the wrong reason, I suppose. - Setting
timeout
to 0 intrac.ini
and using patch gives continuous output with occasionally errors due todatabase is locked
, but all these threads acquire cnx shortly after. - I tried changing the pool size, but did not find any particular changes from increasing it - except if it got too high (100) it increased errors. However, with just one project in the test script, this change may not be effective anyway - perhaps this increase will make more sense across a larger multi-project setup with many environments.
Looking at the pool code, I see that without the explicit call to notify()
(patch) only 2 attempts at getting a connection will ever be made: First on entry, and if it doesn't get it it will wait for the full timeout period regardless of available connections in the meantime, then it will get woken up and tries one more time. If no connection is available then, it will hit the time-test and fail this due to fractionally more time than timeout will have passed.
I think the patch improves things, and I'm now using it + timeout = 0
but haven't yet changed pool size. We'll see how it all works out.
comment:19 by , 16 years ago
Good catch osimons. Without calling notify()
everytime a pool slot is made available we risk raising TimeoutError
s in some high load situations.
Here's an updated patch with a small restructuring of _return_cnx
to make it easier to follow and adding another notify()
call to the _shutdown
method.
-
trac/db/pool.py
125 125 try: 126 126 assert (tid, key) in self._active 127 127 cnx, num = self._active[(tid, key)] 128 if num == 1 and cnx.poolable and try_rollback(cnx):128 if num == 1: 129 129 del self._active[(tid, key)] 130 self._ pool.append(cnx)131 self._pool_key.append(key)132 self._pool_time.append(time.time())133 elif num == 1:134 del self._active[(tid, key)]130 self._available.notify() 131 if cnx.poolable and try_rollback(cnx): 132 self._pool.append(cnx) 133 self._pool_key.append(key) 134 self._pool_time.append(time.time()) 135 135 else: 136 136 self._active[(tid, key)] = (cnx, num - 1) 137 137 finally: … … 149 149 self._pool.pop(0) 150 150 self._pool_key.pop(0) 151 151 self._pool_time.pop(0) 152 self._available.notify() 152 153 finally: 153 154 self._available.release() 154 155
follow-up: 21 comment:20 by , 16 years ago
I've been giving this some more thought and the notify()
call in _shutdown isn't really needed after all since it only cleans up inactive connections.
comment:21 by , 16 years ago
Replying to jonas:
I've been giving this some more thought and the
notify()
call in _shutdown isn't really needed after all since it only cleans up inactive connections.
Right, same thing rblank said when we discussed it on IRC.
comment:22 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
I've committed the changes in [7537,7538] so I'm closing this ticket again, thanks guys.
A little more information would be helpful. For starters, what db backend are you using? What server method? Does this happen on the first request or only after sitting a while?