Edgewall Software
Modify

Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#9111 closed defect (fixed)

TimeoutError: Unable to get database connection within 20 seconds

Reported by: vfursov@… Owned by: cboos
Priority: highest Milestone: 0.12.1
Component: database backend Version:
Severity: critical Keywords: connection pool multithread concurrent
Cc: jonas, felix.schwarz@…
Release Notes:
API Changes:

Description

How to Reproduce

While doing a GET operation on /, Trac issued an internal error.

(please provide additional details here)

User Agent was: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.89 Safari/532.5

System Information

Trac 0.12dev-r9318
Babel 0.9.4
Docutils 0.6
Genshi 0.6dev-r1096
psycopg2 2.0.8
Pygments 1.2.2dev-20100224
Python 2.5.2 (r252:60911, Oct 5 2008, 19:29:17)
[GCC 4.3.2]
setuptools 0.6c9
Subversion 1.5.1 (r32289)
jQuery 1.4.1

Enabled Plugins

Python Traceback

Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 499, in _dispatch_request
    dispatcher.dispatch(req)
  File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 231, in dispatch
    resp = chosen_handler.process_request(req)
  File "build/bdist.linux-x86_64/egg/trac/wiki/web_ui.py", line 118, in process_request
    versioned_page = WikiPage(self.env, pagename, version=version)
  File "build/bdist.linux-x86_64/egg/trac/wiki/model.py", line 45, in __init__
    self._fetch(name, version, db)
  File "build/bdist.linux-x86_64/egg/trac/wiki/model.py", line 56, in _fetch
    db = self.env.get_db_cnx()
  File "build/bdist.linux-x86_64/egg/trac/env.py", line 329, in get_db_cnx
    return DatabaseManager(self).get_connection()
  File "build/bdist.linux-x86_64/egg/trac/db/api.py", line 93, in get_connection
    return self._cnx_pool.get_cnx(self.timeout or None)
  File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 191, in get_cnx
    return _backend.get_cnx(self._connector, self._kwargs, timeout)
  File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 124, in get_cnx
    'seconds', time=timeout))
TimeoutError: Unable to get database connection within 20 seconds

Attachments (4)

t9111-tls-for-active.patch (5.3 KB) - added by cboos 4 years ago.
db: ConnectionPool uses ThreadLocal storage for active connections (patch on r9929)
t9111-wait-if-waiters.diff (5.0 KB) - added by cboos 4 years ago.
pool: fair attribution of Connection. (patch on r9929)
t9111-create-close-outside-cs.diff (5.3 KB) - added by cboos 4 years ago.
pool: when taking or returning a Connection, don't do any potentially lenghty operation with the lock held. (patch on r9929)
flupify.diff (2.9 KB) - added by cboos 4 years ago.
fcgi: make it possible to use the installed version of flup instead of builtin version (patch on r9930)

Download all attachments as: .zip

Change History (26)

comment:1 Changed 4 years ago by cboos

Which web front-end are you using? (mod_python, mod_wsgi, lightppd + fastcgi, other?)

comment:2 Changed 4 years ago by vfursov@…

I'm using Google Chrome. This ticket was created as a result of your web site downtime yesterday. I've visited http://trac.edgewall.org, but the page wasn't loaded. Then button create ticket appeared, so I clicked to help you to record the details. The web site began responding in 15 minutes after that. It is good now.

comment:3 Changed 4 years ago by rblank

  • Resolution set to duplicate
  • Status changed from new to closed

Ok, known issue, then.

comment:4 Changed 4 years ago by cboos

  • Milestone set to 0.13
  • Severity changed from blocker to critical

Reopening to handle #8443 with a fresh start (see ticket:8443#comment:38).

As fixing this will require some major changes to the ConnectionPool, targeting 0.13 instead of 0.12.x.

comment:5 Changed 4 years ago by cboos

  • Keywords connection pool multithread concurrent added
  • Resolution duplicate deleted
  • Status changed from closed to reopened

comment:6 Changed 4 years ago by cboos

  • Cc jonas added
  • Component changed from general to database backend

So after lots of debugging, I finally start to see what's going on.

We had very optimistic assumptions regarding the implementation of Python's threading.Condition.wait(timeout). It seems that when we have lots of threads, the total amount of semi-busy wait done in that method becomes overwhelming. We end up with such a level of contention that even the threads already having a connection, i.e. those with a Connection in the _active dict, are not capable anymore to acquire the pool's _available lock for re-using their connection or returning the connection to the pool.

Now that I know where to look, it seems this is actually a well-known source of trouble:

I would discourage this use because they used a polling mechanism to implement this, and according to the source code we poll atleast 20 times ever second
- http://linuxgazette.net/107/pai.html (2004, but still valid for all 2.x versions)

This also explains the dynamics of the bug: when there are a few threads, everything is fast and smooth; after a given threshold, the number of threads becomes so important that all the CPU is spent in wait() and as new connections keep coming, this get worse; eventually after a while the rate of threads exiting after a TimeoutError becomes more important than the rate of incoming requests, and if the number of threads manages to drop at a normal level, the system recovers.

From that point, there are a number of things to try, the first being simply to wait() without a timeout (already possible by simply setting [trac] timeout = 0). Then, we can also avoid having to even enter the pool's critical section when we can avoid it, notably by using thread local storage for storing the active Connections (attachment:t9111-tls-for-active.patch). Finally, we can try to find a better overall solution than doing a global wait().

This might well be the performance problem of Trac for multi-threaded setups (i.e. also explaining #7490,​8813) … besides Genshi of course ;-)

Changed 4 years ago by cboos

db: ConnectionPool uses ThreadLocal storage for active connections (patch on r9929)

comment:7 Changed 4 years ago by fschwarz

  • Cc felix.schwarz@… added

comment:8 follow-ups: Changed 4 years ago by cboos

Some feedback:

  • timeout = 0 works like wonder, problem gone ;-)
  • there's a problem with the attachment:t9111-tls-for-active.patch, though it worked well on my local tests, on t.e.o after a restart with this patch, new threads kept accumulating and none progressed, all were in the sleep state; strange…

comment:9 in reply to: ↑ 8 Changed 4 years ago by cboos

Replying to cboos:

Some feedback:

  • timeout = 0 works like wonder, problem gone ;-)

So I spoke too soon… Though the behavior seems to be far better under normal load, unbounded accumulation of threads can still happen. Either there's an additional problem or we just need to limit the maximal number of threads…

comment:10 in reply to: ↑ 8 ; follow-up: Changed 4 years ago by cboos

There was a little error in the patch _return_cnx (cnx not defined) which prevented notify() to be called.

But there was a more fundamental flaw in the approach with this patch. Using thread-local storage is all well and good for retrieving the active connection for the thread without having to acquire the lock. But the mechanism for returning the connection to the pool depends on the garbage collection of the PooledConnection instances wrapping that connection. The problem is that we have no guarantee that such an instance will get garbage collected in the thread that created it! When this happens, this gets the Connection ref count wrong for both threads, in the thread doing the collection there will be one decref too much (resulting in an assert, harmless as we're called from a __del__) and as the thread that should actually have seen this decref won't see it, this wastes a slot for a connection. When this has happened for all Connections, threads will all wait and accumulate.

Does this mean that we have to give up for the optimization of using TLS? Not necessarily, we can use a similar approach to what we're currently doing for transactions (see TracDev/ApiChanges/0.12#get_db_cnx), but also for read connections. It would be quite tedious to have to use a decorator for all the places we use a simple db = env.get_read_db(), but by using the with statement, this should become easy (something like with env.database_query as db).

First we should fix the current behavior when timeout=0, and is possibly a simple matter of arranging for a "fair" scheduling. Currently, newly created threads for new requests can just go and get a Connection, even if there are already lots of waiters. This means that when a connection slot is freed and the oldest waiter gets notified (this is what _available.notify() does), there will be a competition between this one and all the newly created threads. As the system gets loaded, it becomes increasingly hard for a waiter to win such a race. Of course, if the web front end itself would better handle the situation and not blindingly accept new connections and create new threads as fast as possible, that wouldn't hurt … That's why the problem was most apparent on lighttpd + _fcgi, as opposed to other systems with a fixed number of threads like mod_wsgi or mod_python. Which is not to say things are working the best they could with those systems.

So the first idea is to change the logic of the pool into a true FIFO behavior: if there are already waiters, a thread will just wait unless it just itself resumed after a wait, in which case it should always get the connection.

The second idea is to limit to the maximum the contention on the _available lock by doing as little as possible in the critical section. Currently we do quite a lot of things inside, which we shouldn't. For example, instead of creation the Connection we should simply do the bookkeeping assuming the creation will succeed, do the actual creation outside and in case of failure, free a slot again (and hope the next try will succeed).

Changed 4 years ago by cboos

pool: fair attribution of Connection. (patch on r9929)

comment:11 in reply to: ↑ 10 Changed 4 years ago by cboos

So the first idea is to change the logic of the pool into a true FIFO behavior: if there are already waiters, a thread will just wait unless it just itself resumed after a wait, in which case it should always get the connection.

t9111-wait-if-waiters.diff implements this.

It seems to be doing pretty good here on t.e.o. More testing welcome.

comment:12 Changed 4 years ago by cboos

Okay, so after a bit more testing, it seems things still don't behave perfectly well… We still have situations where the server can't cope with the current load and new threads keep being created faster than already created ones can finish. Sometimes this peak disappears and load goes back to normal, sometimes it only gets worse until the system resources are exhausted.

This is not so surprising, as many times during the lifetime of a request, a thread will do some work with an active connection, then will release it completely and the Connection goes back to the pool. Each time the thread will try to reacquire a Connection in presence of waiters, it'll have to wait a full cycle again i.e. until all the previous waiters have got their turn.

So we could add a way to abort early the new incoming threads when there are already many waiters. Even better, we could avoid creating too many threads by using a better fcgi front-end than what we currently have (an old version of flup's fcgi.py). Recent version of flup has support for a thread pool, which seems to be exactly what we need to avoid this thread congestion problem in the first place.

Last edited 4 years ago by cboos (previous) (diff)

comment:13 follow-up: Changed 4 years ago by cboos

  • Owner set to cboos
  • Status changed from reopened to new

The following patch makes it possible to use the installed version of flup instead of the bundled one (source:trunk/trac/web/_fcgi.py, which is from a quite old version not even available anymore from http://www.saddi.com/software/flup/dist/archive/, see also #3957).

  • trac/web/fcgi_frontend.py

    diff -r 292fd034e9c4 trac/web/fcgi_frontend.py
    a b  
    2020from trac import __version__ as VERSION 
    2121from trac.web.main import dispatch_request 
    2222 
    23 import _fcgi 
     23params = {} 
     24 
     25try: 
     26    from flup.server.fcgi import WSGIServer 
     27    params['maxThreads'] = 15 
     28except ImportError: 
     29    from _fcgi import WSGIServer 
    2430 
    2531def run(): 
    26     _fcgi.WSGIServer(dispatch_request).run() 
     32    WSGIServer(dispatch_request, **params).run() 
    2733 
    2834if __name__ == '__main__': 
    2935    pkg_resources.require('Trac==%s' % VERSION) 

With this and latest flup (1.0.3dev, [flup 72496e17674f]), the behavior is more in control as we don't exceed the maxThreads number of threads. However, when flup hits the limit, it closes the socket to lighttpd, which results in an 500 - Internal Error for the client. Not nice, but certainly better than having to manually restart the server every few hours ;-) Maybe newer versions of lighttpd handles this better.

Also, a restart operation is still problematic. Remember that with stock 0.12 (and quite certainly 0.11 as well), a restart would almost everytime trigger the TimeoutError problem. With the t9111-wait-if-waiters.diff, this could lead to an unbounded accumulation of threads. If we complement that with the above patch and flup 1.0.3dev, some clients will see the error 500 reported by lighty.

What explains this is that at startup time all the Connections are being created, possibly even one after the other as we do that from within the critical section. Maybe this will improve with the second idea of comment:10, and the corresponding attachment:t9111-create-close-outside-cs.diff.

Changed 4 years ago by cboos

pool: when taking or returning a Connection, don't do any potentially lenghty operation with the lock held. (patch on r9929)

comment:14 Changed 4 years ago by cboos

The behavior of the server seems to be much better with this patch. Most notably, at startup time the system feels much more responsive.

comment:15 follow-up: Changed 4 years ago by Alex Trofimov

I tried 0.11.7, 0.12 and 0.13dev as fcgi and tracd with mysql back-end and there is still the same error. Finally I changed set max database connections to 1 on line 70 in db/pool.py

self._maxsize = 1 # maxsize

it still gives the same error during > 10-20 simultaneous requests, but at least it do not go into undefinite loop and don't eat cpu resources, as the result trac is available under moderate load.

comment:16 in reply to: ↑ 15 ; follow-up: Changed 4 years ago by cboos

Replying to Alex Trofimov:

I tried 0.11.7, 0.12 and 0.13dev as fcgi and tracd with mysql back-end and there is still the same error.

The patches above are not yet committed. This is how you should proceed:

  1. cd trunk
  2. patch -p0 < t9111-wait-if-waiters.diff
  3. patch -p0 < t9111-create-close-outside-cs.diff
  4. optionally, get the unified diff patch from comment:13 and apply it (-p1), then get and install the latest version of flup

Step 2. should already make the TimeoutError go away. Step 3. should theoretically make things even smoother but I didn't benchmark that rigorously yet.

Step 4. is only useful if you see an unbounded accumulation of threads, as this will use a threadpool.

Last edited 4 years ago by cboos (previous) (diff)

comment:17 in reply to: ↑ 16 Changed 4 years ago by anonymous

Replying to cboos:

Thank you for detailed explanation.
Tried all these steps (pathes were applied to 0.13dev-r9950) - nothing helps :(
Tried to set Timeout to 0 and 10.
After some number of simulations requests (10-20), it stops working, trac process uses 90-110% of cpu and browser timeouts, only full restart helps.
I'm using nginx + fastcgi(or tracd), maybe this is the reason it doesn't help.

I see that your site timeout sometimes, so maybe this problem is still actual on your site too.

After many tries have to setup it as regular cgi on apache, and setup nginx as proxy for apache.

comment:18 in reply to: ↑ 13 Changed 4 years ago by cboos

Replying to cboos:

The following patch makes it possible to use the installed version of flup instead of the bundled one

Two improvements were made to the initial patch, in flupify.diff:

  • do urllib.unquote on the PATH_INFO (same as #8128)
  • add an environment variable TRAC_USE_FLUP to control the usage of flup vs. bundled _fcgi.py. This will default to 0 for 0.12.x and 1 for 0.13.

Changed 4 years ago by cboos

fcgi: make it possible to use the installed version of flup instead of builtin version (patch on r9930)

comment:19 follow-up: Changed 4 years ago by cboos

  • Milestone changed from 0.13 to 0.12.1
  • Resolution set to fixed
  • Status changed from new to closed

The whole set of patches was applied as r10002, r10003 and r10004. These changes have been used on t.e.o for the last month, with success.

What happens now in case of high load is dependent on the front end. For example, with latest flup (comment:13), when no more spare thread is available from its thread pool, an error 500 is sent back to the client.

comment:20 in reply to: ↑ 19 ; follow-up: Changed 4 years ago by rblank

Replying to cboos:

For example, with latest flup (comment:13), when no more spare thread is available from its thread pool, an error 500 is sent back to the client.

Incidentally, I had this a few times yesterday evening. Still not very elegant, but much better than locking up the whole server.

comment:21 in reply to: ↑ 20 Changed 4 years ago by jonas

Replying to rblank:

Replying to cboos:

For example, with latest flup (comment:13), when no more spare thread is available from its thread pool, an error 500 is sent back to the client.

Incidentally, I had this a few times yesterday evening. Still not very elegant, but much better than locking up the whole server.

That might have been my fault. I started playing a bit with uwsgi on t.e.o last night and in the process restarted and reconfigured lighttpd a few times.

It was just something I wanted to experiment a bit with. I'll keep you guys updated when I have some more results or before I want to try something more disruptive.

comment:22 Changed 4 years ago by rblank

Any reason why we shouldn't use mod_wsgi? Actually, why do we serve Trac via lighttpd, and the repositories via Apache? Wouldn't it be simpler to do everything with Apache?

Add Comment

Modify Ticket

Change Properties
<Author field>
Action
as closed The owner will remain cboos.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from cboos to the specified user.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.