#9111 closed defect (fixed)
TimeoutError: Unable to get database connection within 20 seconds
Reported by: | Owned by: | Christian Boos | |
---|---|---|---|
Priority: | highest | Milestone: | 0.12.1 |
Component: | database backend | Version: | |
Severity: | critical | Keywords: | connection pool multithread concurrent |
Cc: | Jonas Borgström, felix.schwarz@… | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
How to Reproduce
While doing a GET operation on /
, Trac issued an internal error.
(please provide additional details here)
User Agent was: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.89 Safari/532.5
System Information
Trac | 0.12dev-r9318
|
Babel | 0.9.4
|
Docutils | 0.6
|
Genshi | 0.6dev-r1096
|
psycopg2 | 2.0.8
|
Pygments | 1.2.2dev-20100224
|
Python | 2.5.2 (r252:60911, Oct 5 2008, 19:29:17) [GCC 4.3.2]
|
setuptools | 0.6c9
|
Subversion | 1.5.1 (r32289)
|
jQuery | 1.4.1
|
Enabled Plugins
Python Traceback
Traceback (most recent call last): File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 499, in _dispatch_request dispatcher.dispatch(req) File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 231, in dispatch resp = chosen_handler.process_request(req) File "build/bdist.linux-x86_64/egg/trac/wiki/web_ui.py", line 118, in process_request versioned_page = WikiPage(self.env, pagename, version=version) File "build/bdist.linux-x86_64/egg/trac/wiki/model.py", line 45, in __init__ self._fetch(name, version, db) File "build/bdist.linux-x86_64/egg/trac/wiki/model.py", line 56, in _fetch db = self.env.get_db_cnx() File "build/bdist.linux-x86_64/egg/trac/env.py", line 329, in get_db_cnx return DatabaseManager(self).get_connection() File "build/bdist.linux-x86_64/egg/trac/db/api.py", line 93, in get_connection return self._cnx_pool.get_cnx(self.timeout or None) File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 191, in get_cnx return _backend.get_cnx(self._connector, self._kwargs, timeout) File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 124, in get_cnx 'seconds', time=timeout)) TimeoutError: Unable to get database connection within 20 seconds
Attachments (4)
Change History (26)
comment:1 by , 15 years ago
comment:2 by , 15 years ago
I'm using Google Chrome. This ticket was created as a result of your web site downtime yesterday. I've visited http://trac.edgewall.org, but the page wasn't loaded. Then button create ticket appeared, so I clicked to help you to record the details. The web site began responding in 15 minutes after that. It is good now.
comment:4 by , 15 years ago
Milestone: | → 0.13 |
---|---|
Severity: | blocker → critical |
Reopening to handle #8443 with a fresh start (see ticket:8443#comment:38).
As fixing this will require some major changes to the ConnectionPool, targeting 0.13 instead of 0.12.x.
comment:5 by , 15 years ago
Keywords: | connection pool multithread concurrent added |
---|---|
Resolution: | duplicate |
Status: | closed → reopened |
comment:6 by , 14 years ago
Cc: | added |
---|---|
Component: | general → database backend |
So after lots of debugging, I finally start to see what's going on.
We had very optimistic assumptions regarding the implementation of Python's threading.Condition.wait(timeout)
. It seems that when we have lots of threads, the total amount of semi-busy wait done in that method becomes overwhelming. We end up with such a level of contention that even the threads already having a connection, i.e. those with a Connection in the _active
dict, are not capable anymore to acquire the pool's _available
lock for re-using their connection or returning the connection to the pool.
Now that I know where to look, it seems this is actually a well-known source of trouble:
I would discourage this use because they used a polling mechanism to implement this, and according to the source code we poll atleast 20 times ever second
- http://linuxgazette.net/107/pai.html (2004, but still valid for all 2.x versions)
This also explains the dynamics of the bug: when there are a few threads, everything is fast and smooth; after a given threshold, the number of threads becomes so important that all the CPU is spent in wait()
and as new connections keep coming, this get worse; eventually after a while the rate of threads exiting after a TimeoutError becomes more important than the rate of incoming requests, and if the number of threads manages to drop at a normal level, the system recovers.
From that point, there are a number of things to try, the first being simply to wait()
without a timeout (already possible by simply setting [trac] timeout = 0
). Then, we can also avoid having to even enter the pool's critical section when we can avoid it, notably by using thread local storage for storing the active Connection
s (attachment:t9111-tls-for-active.patch). Finally, we can try to find a better overall solution than doing a global wait()
.
This might well be the performance problem of Trac for multi-threaded setups (i.e. also explaining #7490,8813) … besides Genshi of course ;-)
by , 14 years ago
Attachment: | t9111-tls-for-active.patch added |
---|
db: ConnectionPool uses ThreadLocal storage for active connections (patch on r9929)
comment:7 by , 14 years ago
Cc: | added |
---|
follow-ups: 9 10 comment:8 by , 14 years ago
Some feedback:
timeout = 0
works like wonder, problem gone ;-)- there's a problem with the attachment:t9111-tls-for-active.patch, though it worked well on my local tests, on t.e.o after a restart with this patch, new threads kept accumulating and none progressed, all were in the sleep state; strange…
comment:9 by , 14 years ago
Replying to cboos:
Some feedback:
timeout = 0
works like wonder, problem gone ;-)
So I spoke too soon… Though the behavior seems to be far better under normal load, unbounded accumulation of threads can still happen. Either there's an additional problem or we just need to limit the maximal number of threads…
follow-up: 11 comment:10 by , 14 years ago
- there's a problem with the attachment:t9111-tls-for-active.patch
There was a little error in the patch _return_cnx
(cnx
not defined) which prevented notify()
to be called.
But there was a more fundamental flaw in the approach with this patch. Using thread-local storage is all well and good for retrieving the active connection for the thread without having to acquire the lock. But the mechanism for returning the connection to the pool depends on the garbage collection of the PooledConnection instances wrapping that connection. The problem is that we have no guarantee that such an instance will get garbage collected in the thread that created it! When this happens, this gets the Connection ref count wrong for both threads, in the thread doing the collection there will be one decref too much (resulting in an assert
, harmless as we're called from a __del__
) and as the thread that should actually have seen this decref won't see it, this wastes a slot for a connection. When this has happened for all Connections, threads will all wait and accumulate.
Does this mean that we have to give up for the optimization of using TLS? Not necessarily, we can use a similar approach to what we're currently doing for transactions (see TracDev/ApiChanges/0.12#get_db_cnx), but also for read connections. It would be quite tedious to have to use a decorator for all the places we use a simple db = env.get_read_db()
, but by using the with
statement, this should become easy (something like with env.database_query as db
).
First we should fix the current behavior when timeout=0, and is possibly a simple matter of arranging for a "fair" scheduling. Currently, newly created threads for new requests can just go and get a Connection, even if there are already lots of waiters. This means that when a connection slot is freed and the oldest waiter gets notified (this is what _available.notify()
does), there will be a competition between this one and all the newly created threads. As the system gets loaded, it becomes increasingly hard for a waiter to win such a race. Of course, if the web front end itself would better handle the situation and not blindingly accept new connections and create new threads as fast as possible, that wouldn't hurt … That's why the problem was most apparent on lighttpd + _fcgi, as opposed to other systems with a fixed number of threads like mod_wsgi or mod_python. Which is not to say things are working the best they could with those systems.
So the first idea is to change the logic of the pool into a true FIFO behavior: if there are already waiters, a thread will just wait unless it just itself resumed after a wait, in which case it should always get the connection.
The second idea is to limit to the maximum the contention on the _available
lock by doing as little as possible in the critical section. Currently we do quite a lot of things inside, which we shouldn't. For example, instead of creation the Connection we should simply do the bookkeeping assuming the creation will succeed, do the actual creation outside and in case of failure, free a slot again (and hope the next try will succeed).
by , 14 years ago
Attachment: | t9111-wait-if-waiters.diff added |
---|
pool: fair attribution of Connection. (patch on r9929)
comment:11 by , 14 years ago
So the first idea is to change the logic of the pool into a true FIFO behavior: if there are already waiters, a thread will just wait unless it just itself resumed after a wait, in which case it should always get the connection.
t9111-wait-if-waiters.diff implements this.
It seems to be doing pretty good here on t.e.o. More testing welcome.
comment:12 by , 14 years ago
Okay, so after a bit more testing, it seems things still don't behave perfectly well… We still have situations where the server can't cope with the current load and new threads keep being created faster than already created ones can finish. Sometimes this peak disappears and load goes back to normal, sometimes it only gets worse until the system resources are exhausted.
This is not so surprising, as many times during the lifetime of a request, a thread will do some work with an active connection, then will release it completely and the Connection goes back to the pool. Each time the thread will try to reacquire a Connection in presence of waiters, it'll have to wait a full cycle again i.e. until all the previous waiters have got their turn.
So we could add a way to abort early the new incoming threads when there are already many waiters. Even better, we could avoid creating too many threads by using a better fcgi front-end than what we currently have (an old version of flup's fcgi.py). Recent version of flup has support for a thread pool, which seems to be exactly what we need to avoid this thread congestion problem in the first place.
follow-up: 18 comment:13 by , 14 years ago
Owner: | set to |
---|---|
Status: | reopened → new |
The following patch makes it possible to use the installed version of flup instead of the bundled one (source:trunk/trac/web/_fcgi.py, which is from a quite old version not even available anymore from http://www.saddi.com/software/flup/dist/archive/, see also #3957).
-
trac/web/fcgi_frontend.py
diff -r 292fd034e9c4 trac/web/fcgi_frontend.py
a b 20 20 from trac import __version__ as VERSION 21 21 from trac.web.main import dispatch_request 22 22 23 import _fcgi 23 params = {} 24 25 try: 26 from flup.server.fcgi import WSGIServer 27 params['maxThreads'] = 15 28 except ImportError: 29 from _fcgi import WSGIServer 24 30 25 31 def run(): 26 _fcgi.WSGIServer(dispatch_request).run()32 WSGIServer(dispatch_request, **params).run() 27 33 28 34 if __name__ == '__main__': 29 35 pkg_resources.require('Trac==%s' % VERSION)
With this and latest flup (1.0.3dev, [flup 72496e17674f]), the behavior is more in control as we don't exceed the maxThreads
number of threads. However, when flup hits the limit, it closes the socket to lighttpd, which results in an 500 - Internal Error for the client. Not nice, but certainly better than having to manually restart the server every few hours ;-) Maybe newer versions of lighttpd handles this better.
Also, a restart operation is still problematic. Remember that with stock 0.12 (and quite certainly 0.11 as well), a restart would almost everytime trigger the TimeoutError problem. With the t9111-wait-if-waiters.diff, this could lead to an unbounded accumulation of threads. If we complement that with the above patch and flup 1.0.3dev, some clients will see the error 500 reported by lighty.
What explains this is that at startup time all the Connections are being created, possibly even one after the other as we do that from within the critical section. Maybe this will improve with the second idea of comment:10, and the corresponding attachment:t9111-create-close-outside-cs.diff.
by , 14 years ago
Attachment: | t9111-create-close-outside-cs.diff added |
---|
pool: when taking or returning a Connection, don't do any potentially lenghty operation with the lock held. (patch on r9929)
comment:14 by , 14 years ago
The behavior of the server seems to be much better with this patch. Most notably, at startup time the system feels much more responsive.
follow-up: 16 comment:15 by , 14 years ago
I tried 0.11.7, 0.12 and 0.13dev as fcgi and tracd with mysql back-end and there is still the same error. Finally I changed set max database connections to 1 on line 70 in db/pool.py
self._maxsize = 1 # maxsize
it still gives the same error during > 10-20 simultaneous requests, but at least it do not go into undefinite loop and don't eat cpu resources, as the result trac is available under moderate load.
follow-up: 17 comment:16 by , 14 years ago
Replying to Alex Trofimov:
I tried 0.11.7, 0.12 and 0.13dev as fcgi and tracd with mysql back-end and there is still the same error.
The patches above are not yet committed. This is how you should proceed:
- cd trunk
- patch -p0 < t9111-wait-if-waiters.diff
- patch -p0 < t9111-create-close-outside-cs.diff
- optionally, get the unified diff patch from comment:13 and apply it (-p1), then get and install the latest version of flup
Step 2. should already make the TimeoutError
go away. Step 3. should theoretically make things even smoother but I didn't benchmark that rigorously yet.
Step 4. is only useful if you see an unbounded accumulation of threads, as this will use a threadpool.
comment:17 by , 14 years ago
Replying to cboos:
Thank you for detailed explanation.
Tried all these steps (pathes were applied to 0.13dev-r9950) - nothing helps :(
Tried to set Timeout to 0 and 10.
After some number of simulations requests (10-20), it stops working, trac process uses 90-110% of cpu and browser timeouts, only full restart helps.
I'm using nginx + fastcgi(or tracd), maybe this is the reason it doesn't help.
I see that your site timeout sometimes, so maybe this problem is still actual on your site too.
After many tries have to setup it as regular cgi on apache, and setup nginx as proxy for apache.
comment:18 by , 14 years ago
Replying to cboos:
The following patch makes it possible to use the installed version of flup instead of the bundled one
Two improvements were made to the initial patch, in flupify.diff:
- do
urllib.unquote
on the PATH_INFO (same as #8128) - add an environment variable
TRAC_USE_FLUP
to control the usage of flup vs. bundled_fcgi.py
. This will default to0
for 0.12.x and1
for 0.13.
by , 14 years ago
Attachment: | flupify.diff added |
---|
fcgi: make it possible to use the installed version of flup instead of builtin version (patch on r9930)
follow-up: 20 comment:19 by , 14 years ago
Milestone: | 0.13 → 0.12.1 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
The whole set of patches was applied as r10002, r10003 and r10004. These changes have been used on t.e.o for the last month, with success.
What happens now in case of high load is dependent on the front end. For example, with latest flup (comment:13), when no more spare thread is available from its thread pool, an error 500 is sent back to the client.
follow-up: 21 comment:20 by , 14 years ago
Replying to cboos:
For example, with latest flup (comment:13), when no more spare thread is available from its thread pool, an error 500 is sent back to the client.
Incidentally, I had this a few times yesterday evening. Still not very elegant, but much better than locking up the whole server.
comment:21 by , 14 years ago
Replying to rblank:
Replying to cboos:
For example, with latest flup (comment:13), when no more spare thread is available from its thread pool, an error 500 is sent back to the client.
Incidentally, I had this a few times yesterday evening. Still not very elegant, but much better than locking up the whole server.
That might have been my fault. I started playing a bit with uwsgi on t.e.o last night and in the process restarted and reconfigured lighttpd a few times.
It was just something I wanted to experiment a bit with. I'll keep you guys updated when I have some more results or before I want to try something more disruptive.
comment:22 by , 14 years ago
Any reason why we shouldn't use mod_wsgi
? Actually, why do we serve Trac via lighttpd, and the repositories via Apache? Wouldn't it be simpler to do everything with Apache?
Which web front-end are you using? (mod_python, mod_wsgi, lightppd + fastcgi, other?)