Edgewall Software
Modify

Opened 10 years ago

Closed 9 years ago

Last modified 5 years ago

#8443 closed defect (duplicate)

postgres and "IDLE in transaction" status

Reported by: mixedpuppy Owned by: Christian Boos
Priority: high Milestone:
Component: database backend Version: 0.11-stable
Severity: critical Keywords: postgresql
Cc: dfraser, pacopablo@… Branch:
Release Notes:
API Changes:

Description (last modified by Remy Blank)

I've noticed a lot of "IDLE in transaction" statuses on postgres connections from trac after a request is finished. The problem with this is that the connection essentially becomes unusable until a rollback or commit is executed, sometimes resulting in locked up requests. I've tracked it down, and have a hack of a fix for it, but first some background.

My feeling is that people may only see this depending on the installed plugins and how they are used, default trac may not exhibit the behaviour required to cause the problem though the bug is in trac. My reliable repro involves using the IncludePages plugin to include one wiki page inside another.

The database connection pool in trac is called with every env.get_db_cnx call. It has a list of "active" connections, which are connections currently in use, and then the pool. The typical pattern that happens when using a database connection looks like:

  • some function in trac is entered
  • get connection
  • if connection in active list for thread, use that
  • otherwise get one from pool (or created one)
  • a cursor is created
  • psycopg may execute a BEGIN statement, based on the default settings for the postgres server. The installed default for postgres causes psycopg to do that.
  • function in trac finishes
  • connection wrapper is garbage collected, causing the wrapper to "reinsert" the connection into the pool
  • if it is the last "reference" to the connection, trac attempts a rollback (which typically matches the begin above)
  • if it is NOT the last "reference" to the connection, and no explicit rollback was done (eg. after an insert, trac/plugins usually do a commit) then the pool also does not rollback (THIS IS THE PROBLEM). The result is that the begin statement above remains in effect, causing "IDLE in transaction".

Fixing this in the pool would be the wrong place since you can never know if a rollback is safe on an "active" connection. I'm also wondering if the rollback in there currently is due to bugs reported on this issue before.

An easy repro is to use the includepages plugin for the wiki and include one wiki page within another. At the end of a request for that wiki page you will have the IDLE in transaction state on the connection.

Without requiring a repro, this illustrates how it happens:

def do_something(env):
    db = env.get_db_cnx()
    c = db.cursor()
    c.execute("SELECT foo FROM bar")
    # since we're the second reference, no
    # rollback is called, but since it's
    # a new cursor, BEGIN was called

def main(env):
    db = env.get_db_cnx()
    c = db.cursor()
    c.execute("SELECT bat FROM fab")
    do_something(env)
    # once we're done, db is reinserted 
    # to the pool and rollback is called

With the above outline, two BEGIN statements were executed, but only on rollback.

The simple hack fix is to modify main.dispatch_request to get a handle on the database connection prior to handling the request, then doing a rollback after the request:

    req = Request(environ, start_response)
    try:
        db = env.get_db_cnx()
        return _dispatch_request(req, env, env_error)
    finally:
        db.rollback()
        if env and not run_once:
            env.shutdown(threading._get_ident())

I call it a hack fix because it seems ugly to me, but it's the only place that this can be handled reliably without the potential to step on a commit that is in process, and without adding commit/rollback all over the trac (and plugins) code.

The fix does not break current transaction area's (eg. after an update/insert) since psycopg will issue a new BEGIN on the next execute if commit or rollback was called. Updates, Inserts, etc. should remain explicit about calling commit, since a select will not get the new data unless that is done.

It also seems like some extra work is happening in the current db pooling, such as creating a PooledConnection class for each call to env.get_db_cnx even if the connection is in the "active" list.

Looking at other bugs regarding this issue, it seems that is would also be a good idea to force close the connection at the end of a request IF the status is "in transaction" still. Log or raise and error with it as well. This would at least prevent problems around replication and vacuuming, and let people know there is a bug (likely a missing commit in some plugin).

Attachments (3)

postgres-idle.patch (1.7 KB ) - added by mixedpuppy 10 years ago.
postgres idle fix
t8443-deactivate-r9234.patch (2.7 KB ) - added by Christian Boos 10 years ago.
db pool: try harder to rollback unused db connections
teo-debug.log (430.0 KB ) - added by Christian Boos 10 years ago.
strace + backtraces of a trac.fcgi process that was continuously ≥ 100%, while holding alive ≥ 10 idle in transaction postgres processes

Download all attachments as: .zip

Change History (46)

comment:1 by mixedpuppy, 10 years ago

formatting, formatting, formatting. sorry.

comment:2 by Remy Blank, 10 years ago

Description: modified (diff)

Indeed ;-)

by mixedpuppy, 10 years ago

Attachment: postgres-idle.patch added

postgres idle fix

comment:3 by anonymous, 10 years ago

The issue with postgres idle connections was masked by the gc.collect to some extent. When I removed gc.collect I ended up with many more idle connections. When the environment is shutdown at the end of a request, it does not return the last connection from the active list to the pool. The act of returning the connection to the pool is what does the last rollback. With this patch, I don't need the hack I described earlier.

comment:4 by Christian Boos, 10 years ago

Milestone: 0.11.6

in reply to:  description ; comment:5 by Christian Boos, 10 years ago

Replying to mixedpuppy:

With the above outline, two BEGIN statements were executed, but only on rollback.

Is this really a problem? From http://www.postgresql.org/docs/8.3/static/sql-begin.html:

Issuing BEGIN when already inside a transaction block will provoke a warning message. The state of the transaction is not affected.

in reply to:  3 comment:6 by Christian Boos, 10 years ago

Replying to anonymous:

… When the environment is shutdown at the end of a request, it does not return the last connection from the active list to the pool. The act of returning the connection to the pool is what does the last rollback.

The try_rollback, yes. Patch looks good. I need a bit more time to digest it fully but I'm getting convinced it should work.

in reply to:  description ; comment:7 by Christian Boos, 10 years ago

Replying to mixedpuppy:

It also seems like some extra work is happening in the current db pooling, such as creating a PooledConnection class for each call to env.get_db_cnx even if the connection is in the "active" list.

Well, I don't see what would be the alternative, can you elaborate?

comment:8 by Christian Boos, 10 years ago

Owner: set to Christian Boos

In the attachment:postgres-idle.patch, in _return_active_cnx, the _active dict is modified without the _available mutex being acquired.

comment:9 by Christian Boos, 10 years ago

Keywords: postgres added

The #8602 issue of today was a symptom of this problem.

in reply to:  5 ; comment:10 by Shane Caraveo <shanec@…>, 10 years ago

Replying to cboos:

Replying to mixedpuppy:

With the above outline, two BEGIN statements were executed, but only on rollback.

Is this really a problem? From http://www.postgresql.org/docs/8.3/static/sql-begin.html:

Issuing BEGIN when already inside a transaction block will provoke a warning message. The state of the transaction is not affected.

Well, you essentially loose the start of a transaction in that case. The bigger issue is that a rollback or commit has not happened. The result of the latter transaction is undetermined, could be a rollback or a commit, you just dont know.

in reply to:  7 comment:11 by Shane Caraveo <shanec@…>, 10 years ago

Replying to cboos:

Replying to mixedpuppy:

It also seems like some extra work is happening in the current db pooling, such as creating a PooledConnection class for each call to env.get_db_cnx even if the connection is in the "active" list.

Well, I don't see what would be the alternative, can you elaborate?

One thought was to attach the connection to the request object, but that isn't quite right either. Essentially, somehow keep the pooled object around for the duration of the request, rather than continually releasing/acquiring the object.

However, for us, it actually turned out good that the pool works this way, it made a migration to using pgbouncer easier.

in reply to:  10 comment:12 by Christian Boos, 10 years ago

Priority: normalhigh

Replying to Shane Caraveo <shanec@…>:

Replying to cboos:

Replying to mixedpuppy:

With the above outline, two BEGIN statements were executed, but only on rollback.

Is this really a problem? From http://www.postgresql.org/docs/8.3/static/sql-begin.html:

Issuing BEGIN when already inside a transaction block will provoke a warning message. The state of the transaction is not affected.

Well, you essentially loose the start of a transaction in that case.

That's not really important - we're talking about SELECTs that could happen before an INSERT/UPDATE/DELETE, and it doesn't really matter if those selects are within the same transaction or not, as we're at READ COMMITTED isolation level.

The point I tried to make with the above quote was that it doesn't matter if you had two or more BEGIN followed by only one ROLLBACK - as the other BEGINs are ignored, one ROLLBACK is enough.

Aside from that point, I think I understand a bit better your analysis now, in particular comment:7.

As currently we always return a new instance of PooledConnection in self.get_db_cnx(), we have no choice but doing the ROLLBACK when the last instance corresponding to a given real connection is gc'ed.

Otherwise we could easily trigger a ROLLBACK during a transaction in progress, as soon as there was more than one call to env.get_db_cnx(). For example, with the following code:

    # this is just some non-sensical code to illustrate the point
    def do_save(self, req):
        db = self.get_db_cnx()
        cursor = db.cursor()
        cursor.execute("INSERT ...")
        email_map = self._get_all_email_address()
        for user, mail in emal_map.items():
            cursor.execute("UPDATE ...")
        db.commit()

If _get_all_email_address itself does a self.get_db_cnx(), then it will get another PooledConnection instance, and that instance will be gc'ed upon the method's return, right after the first INSERT, so we would ROLLBACK that INSERT as well.

Now we could certainly improve upon that if env.get_db_cnx() would return the same PooledConnection instance, if it's taken from the active list. If _active would be a WeakValueDictionary, we could access the PooledConnection instances without preventing them to be gc'ed. If we have that, we could simply do a ROLLBACK in PooledConnection.__del__.

Yet another problem is that we don't do a ROLLBACK on errors, and that should also be possible to do thanks to weakrefs.

comment:13 by Shane Caraveo <shanec@…>, 10 years ago

that's cool, using weakref should deal with cases where people are not being explicit. I would like to see an explicit release as well, since (with the removal of the explicit gc.collect for performance reasons) those objects could take a little while to be collected.

comment:14 by tod@…, 10 years ago

That's really high priority issue. Too much idle process slow down performance. Thanks.

comment:15 by anonymous, 10 years ago

postgres  8550  0.3  0.3 40308 6864 ?        Ss   16:46   0:03 postgres: postgres trac_db 127.0.0.1(39129) idle in transaction
postgres  8566  0.5  0.4 40624 8712 ?        Ss   16:46   0:05 postgres: postgres trac_db 127.0.0.1(39131) idle
postgres  8571  0.3  0.3 40332 8288 ?        Ss   16:46   0:03 postgres: postgres trac_db 127.0.0.1(39132) idle
postgres  9166  0.7  0.3 40292 7792 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39141) idle
postgres  9178  0.7  0.3 40292 7504 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39142) idle
postgres  9182  0.7  0.2 40296 5812 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39144) idle
postgres  9183  0.7  0.2 40268 5516 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39145) idle
postgres  9194  0.7  0.3 40280 7456 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39146) idle
postgres  9199  0.8  0.3 40288 7304 ?        Rs   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39147) idle in transaction
postgres  9201  0.6  0.2 40300 5780 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39148) idle
postgres  9202  0.6  0.2 40276 5752 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39149) idle in transaction
postgres  9203  0.7  0.2 40296 5676 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39150) idle in transaction
postgres  9204  0.7  0.2 40292 5896 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39151) idle
postgres  9211  0.6  0.3 40292 7556 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39153) idle
postgres  9217  0.7  0.2 40292 6108 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39154) idle
postgres  9219  0.7  0.2 40276 5680 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39155) idle
postgres  9220  0.6  0.2 40276 5864 ?        Rs   17:00   0:00 postgres: postgres trac_db 127.0.0.1(39156) idle in transaction
postgres  9221  0.7  0.3 40632 6380 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39157) idle
postgres  9222  0.8  0.2 40304 5904 ?        Rs   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39158) idle
postgres  9223  0.6  0.2 40292 5740 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39159) idle
postgres  9224  0.7  0.3 40628 6508 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39160) idle
postgres  9225  0.7  0.2 40296 5656 ?        Rs   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39161) idle in transaction
postgres  9226  0.6  0.2 40284 5844 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39162) idle
postgres  9227  0.7  0.2 40272 5768 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39163) idle in transaction
postgres  9228  0.7  0.3 40288 7936 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39164) idle
postgres  9230  0.6  0.2 40280 5488 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39165) idle
postgres  9282  0.7  0.2 40276 5736 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39167) idle
postgres  9283  0.6  0.2 40300 5564 ?        Rs   17:00   0:00 postgres: postgres trac_db 127.0.0.1(39168) idle in transaction
postgres  9284  0.6  0.2 40276 5568 ?        Ss   17:00   0:00 postgres: postgres trac_db 127.0.0.1(39169) idle in transaction
postgres  9285  0.7  0.2 40300 5796 ?        Ss   17:00   0:00 postgres: postgres trac_db 127.0.0.1(39170) idle in transaction
postgres  9286  0.7  0.2 40276 5436 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39171) idle in transaction
postgres  9287  0.7  0.2 40300 5540 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39172) idle in transaction
postgres  9288  0.6  0.2 40296 5512 ?        Ss   17:00   0:00 postgres: postgres trac_db 127.0.0.1(39173) idle in transaction
postgres  9289  0.7  0.2 40280 5596 ?        Ss   17:00   0:00 postgres: postgres trac_db 127.0.0.1(39174) idle
postgres  9290  0.6  0.2 40288 5616 ?        Ss   17:00   0:00 postgres: postgres trac_db 127.0.0.1(39175) idle in transaction
postgres  9293  0.7  0.3 40272 7272 ?        Ss   17:00   0:01 postgres: postgres trac_db 127.0.0.1(39176) idle
postgres  9294  0.6  0.2 40288 5880 ?        Ss   17:00   0:00 postgres: postgres trac_db 127.0.0.1(39177) idle in transaction
postgres  9307  0.6  0.2 40276 5396 ?        Ss   17:01   0:00 postgres: postgres trac_db 127.0.0.1(39179) idle

comment:16 by dfraser, 10 years ago

Cc: dfraser added

comment:17 by Christian Boos, 10 years ago

Milestone: 0.11.6next-minor-0.12.x

The problem sort of disappeared on t.e.o since we disabled Bitten …

Postponed.

comment:18 by Christian Boos, 10 years ago

Milestone: next-minor-0.12.x0.12

#8602 was closed as a duplicate, as it's pretty clear now that the idle in transaction problem can lead to a blocked server showing the Unable to get database connection within 20 seconds error.

Putting it back on the radar for short terms issues, though I don't want this to be a blocker for 0.12.

comment:19 by Christian Boos, 10 years ago

Status: newassigned

In the tests I did, the idle in transaction connections persisted as long as the connection is not rollback'ed, but this rollback should happen as soon as the cnx Connection is returned to the _pool, when the last PooledConnection that wraps it is garbage collected (_return_cnx).

One possible issue there was that the try_rollback was only attempted when the cnx is poolable. This is normally True for PostgreSQL, but I know sometimes people forcefully set it to False.

So the problem can only happen if there's a memory leak and the PooledConnection is kept alive somewhere. One extra safety measure could be to force this rollback to happen at the end of a request, during shutdown(tid).

See attachment:t8443-deactivate-r9234.patch.

All tests pass on the 3 backends, but I'd like to get some additional testing feedback.

by Christian Boos, 10 years ago

db pool: try harder to rollback unused db connections

in reply to:  19 comment:20 by Christian Boos, 10 years ago

Replying to cboos:

attachment:t8443-deactivate-r9234.patch.

Patch now active on t.e.o, but I'll probably have to re-activate Bitten to really stress-test this ;-) (so far, so good)

comment:21 by John Hampton, 10 years ago

So I did some fairly unscientific testing against http://pacopablo.com. I ran:

ab -c 10 -n 1000 http://pacopablo.com/wiki/WikiStart

against trunk@9275 and then applied the patch and ran the same command again. My current setup is using 3 tracd instances proxied behind nginx. However, it looks as though only one of those instances was handling all of the traffic. (I guess I'll have to fix my proxying). In both cases, the number of used connections stayed around 10 for the entire benchmark. Every 2 seconds I counted the number of idle in transaction connections. Below is the distribution throughout the test.

  • r9275
    Number of idle in transaction connections: 0 (2 times)
    Number of idle in transaction connections: 1 (1 times)
    Number of idle in transaction connections: 2 (1 times)
    Number of idle in transaction connections: 3 (4 times)
    Number of idle in transaction connections: 4 (8 times)
    Number of idle in transaction connections: 5 (22 times)
    Number of idle in transaction connections: 6 (42 times)
    Number of idle in transaction connections: 7 (60 times)
    Number of idle in transaction connections: 8 (48 times)
    Number of idle in transaction connections: 9 (57 times)
    Number of idle in transaction connections: 10 (22 times)
    
  • r9275 w/ patch applied
    Number of idle in transaction connections: 0 (3 times)
    Number of idle in transaction connections: 2 (1 times)
    Number of idle in transaction connections: 3 (1 times)
    Number of idle in transaction connections: 4 (16 times)
    Number of idle in transaction connections: 5 (26 times)
    Number of idle in transaction connections: 6 (42 times)
    Number of idle in transaction connections: 7 (72 times)
    Number of idle in transaction connections: 8 (57 times)
    Number of idle in transaction connections: 9 (37 times)
    Number of idle in transaction connections: 10 (18 times)
    

The good news is that the number of idle connections never went over 10 (the number of concurrent connections from ab). Additionally, the idle in transaction connections dropped to 0 after the test was done.

However, I can't tell any difference between patch and no patch. The test was against my front page with hits at least 2 plugins (th:FullBlogPlugin and th:IrcLogsPlugin)

comment:22 by John Hampton, 10 years ago

Cc: pacopablo@… added

Ran some more tests. Can't see any difference between pre and post patch. However, I never have more db connections open then the number of concurrent web requests, so, it doesn't look like any are being leaked.

comment:23 by anonymous, 10 years ago

Thanks for the testing, it's good to see there were apparently no unwanted side effects with the changes.

The fact that you didn't see any differences with or without the patch is expected in normal conditions of operations. In effect, I suspect that this idle in transaction issue can only happen in abnormal conditions, which ones exactly is not completely clear, but some candidates are:

  • very long SQL queries (Bitten)
  • db connections kept alive across web requests (e.g. by storing one in a component, a.k.a my TOC brown paper bag issue, fixed by [TH4366] ;-) )
  • possibly exceptions triggered during query, with the db being held alive in the stack frames, and those in turn being kept alive (?) - back to the previous point

So if you can try again your experiments with something like:

  • trac/web/main.py

     
    465465    req = Request(environ, start_response)
    466466    translation.make_activable(lambda: req.locale, env and env.path or None)
    467467    try:
     468        env._db = env.get_db_cnx() # Misbehavior #1
    468469        return _dispatch_request(req, env, env_error)
    469470    finally:
    470471        translation.deactivate()

(or any other creative way to trigger errors ;-) ), you should see:

  • idle in transaction accumulate without the patch
  • "proper behavior" with the patch

comment:24 by John Hampton, 10 years ago

Well, I tired to get the code to misbehave, but after testing with your patch above I have failed in getting r9275 to show any difference between with #8443 patch and without.

However, throughout all my tests, I have seen stable behavior. I think that this is a good sign.

+1 for committing the patch

comment:25 by Christian Boos, 10 years ago

So yesterday we had another interesting situation here on t.e.o:

  • one trac.fcgi process was at 100%
  • 12 idle in transaction postgres processes

Looking at what trac.fcgi did, it was just busy accepting new connections (137 threads), all trying to get a db connection when none was available anymore. The gdb backtraces for all these threads didn't show anything interesting except that no other 3rd party bug was involved, like the svn/apr issue of #7785.

See teo-debug.log for the strace and gdb log.

That day I did a lot of restarts of the lighttpd server, maybe something wrong can happen when a trac.fcgi process is abruptly terminated (the corresponding postgres process hanging?). OTOH, to "fix" the server I killed -TERM that 17705 process, restarted lighty once more and after a while, all idle in transaction processes disappeared.

So apparently the patch doesn't do any good for the base problem and maybe the additional notify() I introduce only contribute to worsen the problem, with too many thread wake-ups…

by Christian Boos, 10 years ago

Attachment: teo-debug.log added

strace + backtraces of a trac.fcgi process that was continuously ≥ 100%, while holding alive ≥ 10 idle in transaction postgres processes

comment:26 by Christian Boos, 10 years ago

Milestone: 0.12next-minor-0.12.x

Ok, the results with the patch were mitigated, the bug is still here and we still don't have a clue about what's going on, so I'll postpone this.

comment:27 by Carsten Klein <carsten.klein@…>, 10 years ago

We had a similar problem with a Java based application, where too many connections were handed out to the same thread requesting those connections.

What finally settled it was to cache the last connection provided to the calling thread, and, if that connection prematurely closed, create a new connection that then in turn would be cached for future access by the same thread.

Additionally, I do not see the point where a simple SELECT statement would leave a connection in an IDLE IN TRANSACTION state, as a stand alone select operation does IMHO not actually require a transaction context since it is read only.

Perhaps there are some other write access statements like for example UPDATE or INSERT that will start a transaction and never commit, and thus occupy the connection?

I don't actually know, but does trac do some kind of access logging in the database or similar such things, where it actually inserts or updates a tuple and never commit?

Besides that, all model write operations seem to be @with_transaction as with 0.12dev, so that should not be the cause for this problem.

in reply to:  18 ; comment:28 by Christian Boos, 10 years ago

Milestone: next-minor-0.12.x0.12

Replying to cboos:

#8602 was closed as a duplicate, as it's pretty clear now that the idle in transaction problem can lead to a blocked server showing the Unable to get database connection within 20 seconds error.

Actually I now think that they're both a symptom of a same root cause, which must be some kind of deadlock, as the strace logs taken at this occasion always show some kind of fierce "futex" battle…

We need to be more careful between cross acquisition of the different locks we have, including those we don't "see". There's one in source:trunk/trac/web/_fcgi.py, there's the logging lock, the Genshi template loader lock, the localedata _cache_lock in Babel, possibly others (psycopg2 uses a pthread_mutex_t directly).

See on ftp.edgewall.org, the tmp/db-timeout-incident-2010-03-18.txt (no direct link, to not load the site too much, the file has a size of 20M).

That strace clearly shows contention on the 0x30b2510 and 0x2e0cf20 futexes. Now we just need to find out to which lock they correspond ;-)

From db-timeout-incident-2010-03-18-except-0x30b2510-and-0x2e0cf20-no-resumed.txt, the other futexes we see are at thread startup (near set_robust_list) we can ignore those, and a few times 0x30c4c80, which seems to be the logging IO lock.

[pid  6578] 10:25:00.804434 futex(0x30c4c80, 0x80 /* FUTEX_??? */, 0 <unfinished ...>
[pid  6551] 10:25:00.804531 write(7, "[pid  1802332496] 2010-03-18 10:"..., 175 <unfinished ...>
[pid  6551] 10:25:00.804798 futex(0x30c4c80, 0x81 /* FUTEX_??? */, 1 <unfinished ...>

From tmp/db-timeout-incident-2010-03-18.txt, we see things like:

pid  6540] 10:24:56.227063 sendto(21, "\1\6\21\362\1K\5\0", 8, 0, NULL, 0) = 8
[pid  6540] 10:24:56.227477 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0
[pid  6540] 10:24:56.227547 sendto(21, "Status: 500 Internal Server Erro"..., 331, 0, NULL, 0) = 331
[pid  6540] 10:24:56.227786 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0
[pid  6540] 10:24:56.227849 sendto(21, "\0\0\0\0\0", 5, 0, NULL, 0) = 5
[pid  6540] 10:24:56.228086 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0
[pid  6540] 10:24:56.228234 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0
[pid  6540] 10:24:56.228313 sendto(21, "\1\6\21\362\37\370\0\0", 8, 0, NULL, 0) = 8
[pid  6540] 10:24:56.228609 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0
[pid  6540] 10:24:56.228670 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0

which suggests that 0x30b2510 corresponds to the lock in _fcgi.py's MultiplexedConnection.

From log excerpts like:

[pid  6596] 10:24:58.006529 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid  6596] 10:24:58.006583 poll([{fd=35, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
[pid  6596] 10:24:58.006637 recvfrom(35, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I", 1048576, 0, NULL, NULL) = 20
[pid  6596] 10:24:58.006731 gettimeofday({1268904298, 6748}, NULL) = 0
[pid  6596] 10:24:58.006809 futex(0x2e0cf20, 0x81 /* FUTEX_??? */, 1) = 1

it seems that 0x2e0cf20 is the psycopg2 mutex.

Now we just have to find a way to avoid deadlocks between the two, maybe the "easy" fix would be to disable MultiplexedConnection. I'll try that next.

in reply to:  28 comment:29 by Christian Boos, 10 years ago

… disable MultiplexedConnection

Sigh, MultiplexedConnection was not used, 0x30b2510 must be another lock, ah, it must be the GIL of course ;-)

So does that boil down to some design flaw in psycopg2 which uses its own mutex, hence the risk to deadlock with the GIL?

comment:30 by Christian Boos, 10 years ago

I've found a possibly related discussion on the psycopg2 mailing list, including a proposed patch, but it has not been integrated in the psycopg2 repository yet. It would be interesting to try it.

Also, they talk about a deadlock, and while I also said "deadlock" in the above, what we're actually seeing is not a "dead" lock, but rather a livelock, leading to 100% CPU utilization.

in reply to:  30 ; comment:31 by Christian Boos, 10 years ago

Replying to cboos:

I've found a possibly related discussion on the psycopg2 mailing list, including a proposed patch, but it has not been integrated in the psycopg2 repository yet. It would be interesting to try it.

The test program given in the same mailing list thread works fine on t.e.o, so it's likely another problem…

in reply to:  31 comment:32 by Remy Blank, 10 years ago

Replying to cboos:

The test program given in the same mailing list thread works fine on t.e.o, so it's likely another problem…

The thread also mentions that, depending on the versions of the components, the issue takes longer to show up. So you might want to leave the program running for a few hours.

in reply to:  28 comment:33 by Christian Boos, 10 years ago

There was probably a flaw in my previous analysis, as where would be the futex corresponding to the trac.db.pool lock?

Also, this morning everything was fine on t.e.o, but nevertheless I had the curiosity to look for idle in transaction processes and indeed there were some!

postgres  2655  4178  0 Mar15 ?        00:00:33 postgres: www-data trac [local] idle in transaction         
postgres  2656  4178  0 Mar15 ?        00:00:11 postgres: www-data trac [local] idle in transaction         
postgres  2657  4178  0 Mar15 ?        00:00:38 postgres: www-data trac [local] idle in transaction         
postgres  2659  4178  0 Mar15 ?        00:00:16 postgres: www-data trac [local] idle in transaction         
postgres  2671  4178  0 Mar15 ?        00:00:19 postgres: www-data trac [local] idle in transaction         
postgres  2675  4178  0 Mar15 ?        00:00:29 postgres: www-data trac [local] idle in transaction         
postgres  2676  4178  0 Mar15 ?        00:00:17 postgres: www-data trac [local] idle in transaction         
postgres  3951  4178  0 Mar15 ?        00:00:06 postgres: www-data trac [local] idle in transaction         
root     12949 12334  0 09:48 pts/0    00:00:00 grep transaction

root@lynx:~# strace -f -tt -p 2655
Process 2655 attached - interrupt to quit
09:48:52.136709 recvfrom(8, ^C
root@lynx:~# strace -f -tt -p 2656
Process 2656 attached - interrupt to quit
09:49:19.471894 recvfrom(8, ^C
...

(same for the others)

That postgres process is simply waiting for input:

root@lynx:~# gdb -p 2655
...
(gdb) bt
#0  0x00007fc557788aa5 in recv () from /lib/libc.so.6
#1  0x00007fc5598bfbe6 in secure_read () from /usr/lib/postgresql/8.3/bin/postgres
#2  0x00007fc5598c6e34 in ?? () from /usr/lib/postgresql/8.3/bin/postgres
#3  0x00007fc5598c72cb in pq_getbyte () from /usr/lib/postgresql/8.3/bin/postgres
#4  0x00007fc559942685 in PostgresMain () from /usr/lib/postgresql/8.3/bin/postgres
#5  0x00007fc559914d80 in ?? () from /usr/lib/postgresql/8.3/bin/postgres
#6  0x00007fc559915c6a in PostmasterMain () from /usr/lib/postgresql/8.3/bin/postgres
#7  0x00007fc5598c8d4e in main () from /usr/lib/postgresql/8.3/bin/postgres

fd 8 is a socket:

lrwx------ 1 postgres postgres 64 Mar 22 11:18 8 -> socket:[171082172]

And there's indeed a trac.fcgi process (2650), on the other end:

lrwx------ 1 www-data www-data 64 Mar 22 11:19 8 -> socket:[171082171]

Also interesting is that there are a bunch of sockets used by that process:

root@lynx:/proc/2650# ll fd
total 0
lrwx------ 1 www-data www-data 64 Mar 22 11:19 0 -> socket:[171082152]
l-wx------ 1 www-data www-data 64 Mar 22 11:19 1 -> /var/log/lighttpd/error.log
lrwx------ 1 www-data www-data 64 Mar 22 11:19 10 -> socket:[171082181]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 100 -> socket:[171103696]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 101 -> socket:[171114602]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 102 -> socket:[171103698]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 103 -> socket:[171112888]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 104 -> socket:[171114603]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 107 -> socket:[171103706]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 109 -> socket:[171103708]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 11 -> socket:[171114658]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 110 -> socket:[171103709]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 12 -> socket:[171082175]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 125 -> socket:[171103724]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 13 -> socket:[171104647]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 14 -> socket:[171114559]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 15 -> socket:[171103745]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 16 -> socket:[171104057]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 17 -> socket:[171082249]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 18 -> socket:[171104126]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 19 -> socket:[171082262]
l-wx------ 1 www-data www-data 64 Mar 22 11:19 2 -> /var/log/lighttpd/error.log
lrwx------ 1 www-data www-data 64 Mar 22 11:19 20 -> socket:[171082270]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 21 -> socket:[171112463]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 22 -> socket:[171082273]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 23 -> socket:[171104816]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 24 -> socket:[171104666]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 25 -> socket:[171112472]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 26 -> socket:[171112791]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 27 -> socket:[171088246]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 28 -> socket:[171105148]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 29 -> socket:[171104070]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 3 -> socket:[171082152]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 30 -> socket:[171103809]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 31 -> socket:[171114886]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 32 -> socket:[171114600]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 33 -> socket:[171114887]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 34 -> socket:[171112792]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 35 -> socket:[171104074]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 36 -> socket:[171114888]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 37 -> socket:[171103752]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 38 -> socket:[171104590]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 39 -> socket:[171105082]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 4 -> socket:[171105003]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 40 -> socket:[171104817]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 41 -> socket:[171112868]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 42 -> socket:[171101938]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 43 -> socket:[171104177]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 44 -> socket:[171103861]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 45 -> socket:[171103899]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 46 -> socket:[171104679]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 47 -> socket:[171103905]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 48 -> socket:[171107995]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 49 -> socket:[171105223]
l-wx------ 1 www-data www-data 64 Mar 22 11:19 5 -> /var/trac/trac/log/trac.log
lrwx------ 1 www-data www-data 64 Mar 22 11:19 50 -> socket:[171112793]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 51 -> socket:[171114560]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 52 -> socket:[171105233]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 53 -> socket:[171104836]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 54 -> socket:[171102016]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 55 -> socket:[171104981]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 57 -> socket:[171107996]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 58 -> socket:[171104710]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 59 -> socket:[171107997]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 6 -> socket:[171104297]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 61 -> socket:[171112794]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 62 -> socket:[171104996]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 63 -> socket:[171104911]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 64 -> socket:[171104730]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 65 -> socket:[171104851]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 67 -> socket:[171104926]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 68 -> socket:[171103965]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 69 -> socket:[171105035]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 7 -> socket:[171082275]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 70 -> socket:[171112795]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 71 -> socket:[171114601]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 73 -> socket:[171102624]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 75 -> socket:[171104960]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 76 -> socket:[171112796]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 77 -> socket:[171112797]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 78 -> socket:[171102643]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 79 -> socket:[171104930]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 8 -> socket:[171082171]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 80 -> socket:[171112827]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 81 -> socket:[171102771]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 82 -> socket:[171103677]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 83 -> socket:[171103678]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 84 -> socket:[171104937]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 86 -> socket:[171103681]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 87 -> socket:[171104392]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 88 -> socket:[171105016]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 9 -> socket:[171082173]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 90 -> socket:[171105011]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 91 -> socket:[171105043]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 92 -> socket:[171103688]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 94 -> socket:[171112869]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 95 -> socket:[171112870]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 96 -> socket:[171112871]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 97 -> socket:[171112873]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 98 -> socket:[171103694]
lrwx------ 1 www-data www-data 64 Mar 22 11:19 99 -> socket:[171103695]

Unfortunately, that process seems to be untraceable. It was actually stopped (State T):

0 T www-data  2650     1  0  80   0 - 289822 signal Mar15 ?       00:20:23 /usr/local/virtualenv/trunk/bin/python /usr/local/virtualenv/trunk/share/cgi-bin/trac.fcgi

and after a manual CONT, it switched to (State: Z (zombie)) and then exited. All the idle in transaction processes went away as well, so I suppose that among the many sockets above, there were all the other ends for the postgres sockets involved.

Looks like "something" could have stopped the trac.fcgi processes in the middle of transactions, leaving the corresponding postgres processes hanging. Maybe at this point when the load gets higher, new trac.fcgi processes are unable to get any new connections?

Also at the same time, there was another trac.fcgi process that was subject to the #7785 deadlock, maybe this was related:

root@lynx:~# gdb -p 13309
...
(gdb) info threads
  5 Thread 0x43cc6950 (LWP 20364)  0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0
  4 Thread 0x434c5950 (LWP 20365)  0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0
  3 Thread 0x44cc8950 (LWP 20366)  0x00007f6b168cb635 in ?? () from /usr/lib/libapr-1.so.0
  2 Thread 0x454c9950 (LWP 24553)  0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0
  1 Thread 0x7f6b223776e0 (LWP 13309)  0x00007f6b21f54b04 in __lll_lock_wait () from /lib/libpthread.so.0
(gdb) thread apply all bt

Thread 5 (Thread 0x43cc6950 (LWP 20364)):
#0  0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0
#1  0x00000000004b9ea8 in PyThread_acquire_lock ()
...

Thread 4 (Thread 0x434c5950 (LWP 20365)):
#0  0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0
#1  0x00000000004b9ea8 in PyThread_acquire_lock ()
...

Thread 3 (Thread 0x44cc8950 (LWP 20366)):
#0  0x00007f6b168cb635 in ?? () from /usr/lib/libapr-1.so.0
#1  0x00007f6b168cbd36 in apr_pool_destroy () from /usr/lib/libapr-1.so.0
#2  0x00007f6b15f133c4 in ?? () from /var/lib/python-support/python2.5/libsvn/_core.so
#3  0x0000000000491052 in PyEval_EvalFrameEx ()
...

Thread 2 (Thread 0x454c9950 (LWP 24553)):
#0  0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0
#1  0x00000000004b9ea8 in PyThread_acquire_lock ()
#2  0x00000000004be2e0 in ?? ()
#...

Thread 1 (Thread 0x7f6b223776e0 (LWP 13309)):
#0  0x00007f6b21f54b04 in __lll_lock_wait () from /lib/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#1  0x00007f6b21f501a0 in _L_lock_102 () from /lib/libpthread.so.0
#2  0x00007f6b21f4faae in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x00007f6b168c9b22 in apr_thread_mutex_lock () from /usr/lib/libapr-1.so.0
#4  0x00007f6b168cbca6 in apr_pool_destroy () from /usr/lib/libapr-1.so.0
#5  0x00007f6b168cbc4f in apr_pool_destroy () from /usr/lib/libapr-1.so.0
#6  0x00007f6b168cb874 in apr_pool_terminate () from /usr/lib/libapr-1.so.0
#7  0x00007f6b168cd13c in apr_terminate () from /usr/lib/libapr-1.so.0
#8  0x00007f6b2157f6ad in exit () from /lib/libc.so.6
#9  0x00007f6b2156746d in __libc_start_main () from /lib/libc.so.6
#10 0x0000000000413ca9 in _start ()
#0  0x00007f6b21f54b04 in __lll_lock_wait () from /lib/libpthread.so.0

Besides 2560 and 13309, there is a 3rd process from the 15th of March, still "active" somehow:

root@lynx:/proc# strace -tt -p 9596
Process 9596 attached - interrupt to quit
13:22:31.888620 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
13:22:31.889229 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
13:22:31.913413 --- SIGTERM (Terminated) @ 0 (0) ---
13:22:31.913605 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
13:22:31.913650 --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
13:22:32.029193 rt_sigaction(SIGTERM, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, 8) = 0
13:22:32.029400 rt_sigreturn(0xf)       = -1 EINTR (Interrupted system call)
13:22:33.346454 rt_sigaction(SIGHUP, {SIG_IGN}, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, 8) = 0
13:22:33.361953 rt_sigaction(SIGINT, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, 8) = 0
13:22:33.362052 rt_sigaction(SIGTERM, {SIG_DFL}, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, 8) = 0
13:22:33.501645 close(3)                = 0
13:22:38.376117 close(6)                = 0
13:22:38.376241 munmap(0x7f68740b7000, 4096) = 0
13:22:38.665211 rt_sigaction(SIGINT, {SIG_DFL}, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, 8) = 0
^C

root@lynx:/proc# ll /proc/9596/fd
total 0
lrwx------ 1 www-data www-data 64 Mar 22 13:24 0 -> socket:[171116709]
l-wx------ 1 www-data www-data 64 Mar 22 13:24 1 -> /var/log/lighttpd/error.log
lrwx------ 1 www-data www-data 64 Mar 22 13:24 10 -> socket:[171116783]
lrwx------ 1 www-data www-data 64 Mar 22 13:24 12 -> socket:[171116787]
lrwx------ 1 www-data www-data 64 Mar 22 13:24 14 -> socket:[171121772]
lrwx------ 1 www-data www-data 64 Mar 22 13:24 15 -> socket:[171121885]
lrwx------ 1 www-data www-data 64 Mar 22 13:24 19 -> socket:[171122207]
l-wx------ 1 www-data www-data 64 Mar 22 13:24 2 -> /var/log/lighttpd/error.log
lrwx------ 1 www-data www-data 64 Mar 22 13:24 22 -> socket:[171137403]
lrwx------ 1 www-data www-data 64 Mar 22 13:24 26 -> socket:[171137406]
lrwx------ 1 www-data www-data 64 Mar 22 13:24 4 -> socket:[171130805]
lrwx------ 1 www-data www-data 64 Mar 22 13:24 7 -> socket:[171116774]
lrwx------ 1 www-data www-data 64 Mar 22 13:24 8 -> socket:[171116779]

(gdb) bt
#0  0x00007f68732febab in ?? () from /lib/libc.so.6
#1  0x00007f68732ff0a6 in free () from /lib/libc.so.6
#2  0x0000000000446d62 in ?? ()
#3  0x000000000045f47b in ?? ()
#4  0x0000000000446d23 in ?? ()
#5  0x00000000004457e7 in ?? ()
#6  0x0000000000447e8a in PyDict_SetItem ()
#7  0x0000000000449693 in _PyModule_Clear ()
#8  0x00000000004a6074 in PyImport_Cleanup ()
#9  0x00000000004b1f0b in Py_Finalize ()
#10 0x000000000041431b in Py_Main ()
#11 0x00007f68732a1466 in __libc_start_main () from /lib/libc.so.6
#12 0x0000000000413ca9 in _start ()
(gdb) info threads
  1 Thread 0x7f68740b16e0 (LWP 9596)  0x00007f68732febab in ?? () from /lib/libc.so.6

A sort of failed exit attempt ;-)

Hopefully one day all of this will make sense…

comment:34 by Christian Boos, 10 years ago

Happened again the other, nearly the same scenario. Coincidently or not, it seems to have started when I was testing #9171

There were two trac.fcgi processes involved. While the cause of the livelock is still not clear, what happened afterwards is now clearer: lighttpd simply detects that requests are no longer processed and signals (STOP) the trac.fcgi process. Too bad, as this leaves the "IDLE in transaction" postgres alive, as seen above.

/var/log/lighttpd/error.log excerpt:

...
2010-03-25 13:24:20: (server.c.1258) NOTE: a request for /browser/plugins/0.10 timed out after writing 12960 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle
2010-03-25 13:28:46: (server.c.1258) NOTE: a request for /wiki/TracLinks timed out after writing 22240 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle
2010-03-25 13:33:20: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 259
2010-03-25 13:33:23: (mod_fastcgi.c.2673) fcgi-server re-enabled:  0 /tmp/trac-trac-fastcgi.sock
2010-03-25 13:33:24: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 265
2010-03-25 13:33:27: (mod_fastcgi.c.2673) fcgi-server re-enabled:  0 /tmp/trac-trac-fastcgi.sock
2010-03-25 13:33:28: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 269
2010-03-25 13:33:31: (mod_fastcgi.c.2673) fcgi-server re-enabled:  0 /tmp/trac-trac-fastcgi.sock
...

The re-enable doesn't seem to have worked, though.

As today it happened again, I decided to switch off Babel for some time, just to verify if this makes any difference.

comment:35 by Christian Boos, 10 years ago

Though t.e.o seems more responsive these days, there were still Unable to connect to db … errors in the log, so I'm reactivating Babel.

comment:36 by Christian Boos, 10 years ago

Milestone: 0.120.12.1

Will not block 0.12 release.

comment:37 by Christian Boos, 10 years ago

More random notes.

I've seen this in the trac.log. Is it related?

2010-05-11 13:51:37: (server.c.1475) server stopped by UID = 0 PID = 27072
Unhandled exception in thread started by
Error in sys.excepthook:

Original exception was:
Traceback (most recent call last):
  File "/usr/lib/python2.5/logging/__init__.py", line 750, in emit
    self.stream.write(fs % msg)
ValueError: I/O operation on closed file
Unhandled exception in thread started by
Error in sys.excepthook:

Original exception was:
Unhandled exception in thread started by
Error in sys.excepthook:

...

(at least at 14:02 the load was quite high 16, the situation improved by itself a few minutes later).

Just found a duplicate of this error: #8955, itself linking to #6580. Now that one could be interesting, notably the endless exception loop part…

But what happened here? Was trac.log closed by logrotate? The problem was already there before I added logrotate on trac.log, and looking at the trac.log file, it doesn't look like it has been recreated recently (its content dates back to this morning 06:36).

Instead, I see:

[pid  1113229648] 2010-05-11 13:53:17,502 Trac[env] INFO: -------------------------------- environment startup [Trac 0.12dev-r9664] --------------------------------
[pid  1097636176] 2010-05-11 13:53:17,514 Trac[env] INFO: -------------------------------- environment startup [Trac 0.12dev-r9664] --------------------------------

After that, there was no Unable to get database connection within 20 seconds error. It should be noted that quite regularly (not everyday though) we have that error a few minutes after the startup, it resolves itself after a few minutes.

Other thoughts: what exactly happens when we shutdown once a modification to trac.ini is detected? We do close all currently opened connections (r7208), but what happens if a request was not finished and wants to get another connection again from the pool manager we have shut down? So far I thought this was not problematic, but this deserves a closer look…

Finally, if the problem is really nothing more than a bad behavior on high contention scenarios (> 100s threads), we could try a more controlled approach for selecting which thread get the next available connection.

comment:38 by Christian Boos, 9 years ago

Milestone: 0.12.1
Resolution: duplicate
Status: assignedclosed

Funny, just when I was going to comment again on this issue, it happened again! The typical situation, a bunch of postgres process stuck in idle in transaction, Trac being unresponsive and all the queries ending with a !Timeout error, and after a while, a recovery the server becoming fast again and the idle in transaction are gone.

The only thing a bit new is the following backtrace, showing a nice case situation of re-entrancy in get_cnx:

File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 513, in _dispatch_request
File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 235, in dispatch
File "build/bdist.linux-x86_64/egg/trac/timeline/web_ui.py", line 89, in process_request
File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__
File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 291, in _get_perm
File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__
File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 298, in _get_session
File "build/bdist.linux-x86_64/egg/trac/web/session.py", line 156, in __init__
File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__
File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 157, in authenticate
File "build/bdist.linux-x86_64/egg/trac/web/auth.py", line 83, in authenticate
File "build/bdist.linux-x86_64/egg/trac/web/auth.py", line 205, in _get_name_for_cookie
File "build/bdist.linux-x86_64/egg/trac/env.py", line 335, in get_db_cnx
File "build/bdist.linux-x86_64/egg/trac/db/api.py", line 90, in get_read_db
File "build/bdist.linux-x86_64/egg/trac/db/api.py", line 152, in get_connection
File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 205, in get_cnx
File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 153, in get_cnx
File "build/bdist.linux-x86_64/egg/trac/util/translation.py", line 186, in gettext
File "build/bdist.linux-x86_64/egg/trac/util/translation.py", line 175, in isactive
File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 479, in <lambda>
File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__
File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 309, in _get_locale
File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__
File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 298, in _get_session
File "build/bdist.linux-x86_64/egg/trac/web/session.py", line 156, in __init__
File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__
File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 157, in authenticate
File "build/bdist.linux-x86_64/egg/trac/web/auth.py", line 83, in authenticate
File "build/bdist.linux-x86_64/egg/trac/web/auth.py", line 205, in _get_name_for_cookie
File "build/bdist.linux-x86_64/egg/trac/env.py", line 335, in get_db_cnx
File "build/bdist.linux-x86_64/egg/trac/db/api.py", line 90, in get_read_db
File "build/bdist.linux-x86_64/egg/trac/db/api.py", line 152, in get_connection
File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 205, in get_cnx
File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 153, in get_cnx

A bit annoying, but nothing critical as we use an RLock.

But I digress, the point I wanted to make is that this "idle in transaction" issue is really nothing more than a consequence of the bad behavior of our pool under certain circumstances and that by improving the connection pool, solving the Timeout issue, this side-effect will go away. See #9111, which I'll reopen shortly.

comment:39 by Carsten Klein <carsten.klein@…>, 9 years ago

Resolution: duplicate
Status: closedreopened

Hm, I don't know, but closing this as duplicate when the actual issue is that

But I digress, the point I wanted to make is that this "idle in transaction" issue is really nothing more than a consequence of the bad behavior of our pool under certain circumstances and that by improving the connection pool, solving the Timeout issue, this side-effect will go away. See #9111, which I'll reopen shortly.

is rather unproductive, don't you think?

I would call this a major one which might even lead to blocking future releases until fixed!? As for #9111, it is still closed. So we need to reopen this one.

As for the db connection pooling issue, see comment comment:27. I still believe that there are some queries that are not part of transaction that is actively closed, otherwise the transactions would not remain idle.

in reply to:  39 comment:40 by Carsten Klein <carsten.klein@…>, 9 years ago

Replying to Carsten Klein <carsten.klein@…>:

a provable solution might be to implement the pool so that it will actively commit any open transactions that have not been closed before the eviction time of the connection still in idle in transaction state.

of course, any commit failures should be logged to the log, if available, and should not cause the application to fail.

this might require a thread that will permanently check for open connections and their overall life time. see for example the apache db connection pools.

in reply to:  39 comment:41 by Remy Blank, 9 years ago

Replying to Carsten Klein <carsten.klein@…>:

As for #9111, it is still closed. So we need to reopen this one.

Did you notice that #9111 was reopened four months ago (i.e. at the time comment:38 was written), had some major development going on, then finally was closed again?

comment:42 by Christian Boos, 9 years ago

Resolution: duplicate
Status: reopenedclosed

Exactly. Furthermore, I see no evidence in your reopening comment showing that the issue is not fixed when using 0.12.1rc1 or trunk.

If nevertheless this is the case for you, it would be better at this point to create a new ticket, given the long history of this ticket.

Last edited 9 years ago by Christian Boos (previous) (diff)

comment:43 by Ryan J Ollos, 5 years ago

Keywords: postgresql added; postgres removed

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Christian Boos.
The resolution will be deleted. Next status will be 'reopened'.
to as closed The owner will be changed from Christian Boos to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.