#8443 closed defect (duplicate)
postgres and "IDLE in transaction" status
Reported by: | mixedpuppy | Owned by: | Christian Boos |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | database backend | Version: | 0.11-stable |
Severity: | critical | Keywords: | postgresql |
Cc: | dfraser, pacopablo@… | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description (last modified by )
I've noticed a lot of "IDLE in transaction" statuses on postgres connections from trac after a request is finished. The problem with this is that the connection essentially becomes unusable until a rollback or commit is executed, sometimes resulting in locked up requests. I've tracked it down, and have a hack of a fix for it, but first some background.
My feeling is that people may only see this depending on the installed plugins and how they are used, default trac may not exhibit the behaviour required to cause the problem though the bug is in trac. My reliable repro involves using the IncludePages plugin to include one wiki page inside another.
The database connection pool in trac is called with every env.get_db_cnx call. It has a list of "active" connections, which are connections currently in use, and then the pool. The typical pattern that happens when using a database connection looks like:
- some function in trac is entered
- get connection
- if connection in active list for thread, use that
- otherwise get one from pool (or created one)
- a cursor is created
- psycopg may execute a BEGIN statement, based on the default settings for the postgres server. The installed default for postgres causes psycopg to do that.
- function in trac finishes
- connection wrapper is garbage collected, causing the wrapper to "reinsert" the connection into the pool
- if it is the last "reference" to the connection, trac attempts a rollback (which typically matches the begin above)
- if it is NOT the last "reference" to the connection, and no explicit rollback was done (eg. after an insert, trac/plugins usually do a commit) then the pool also does not rollback (THIS IS THE PROBLEM). The result is that the begin statement above remains in effect, causing "IDLE in transaction".
Fixing this in the pool would be the wrong place since you can never know if a rollback is safe on an "active" connection. I'm also wondering if the rollback in there currently is due to bugs reported on this issue before.
An easy repro is to use the includepages plugin for the wiki and include one wiki page within another. At the end of a request for that wiki page you will have the IDLE in transaction state on the connection.
Without requiring a repro, this illustrates how it happens:
def do_something(env): db = env.get_db_cnx() c = db.cursor() c.execute("SELECT foo FROM bar") # since we're the second reference, no # rollback is called, but since it's # a new cursor, BEGIN was called def main(env): db = env.get_db_cnx() c = db.cursor() c.execute("SELECT bat FROM fab") do_something(env) # once we're done, db is reinserted # to the pool and rollback is called
With the above outline, two BEGIN statements were executed, but only on rollback.
The simple hack fix is to modify main.dispatch_request to get a handle on the database connection prior to handling the request, then doing a rollback after the request:
req = Request(environ, start_response) try: db = env.get_db_cnx() return _dispatch_request(req, env, env_error) finally: db.rollback() if env and not run_once: env.shutdown(threading._get_ident())
I call it a hack fix because it seems ugly to me, but it's the only place that this can be handled reliably without the potential to step on a commit that is in process, and without adding commit/rollback all over the trac (and plugins) code.
The fix does not break current transaction area's (eg. after an update/insert) since psycopg will issue a new BEGIN on the next execute if commit or rollback was called. Updates, Inserts, etc. should remain explicit about calling commit, since a select will not get the new data unless that is done.
It also seems like some extra work is happening in the current db pooling, such as creating a PooledConnection
class for each call to env.get_db_cnx
even if the connection is in the "active" list.
Looking at other bugs regarding this issue, it seems that is would also be a good idea to force close the connection at the end of a request IF the status is "in transaction" still. Log or raise and error with it as well. This would at least prevent problems around replication and vacuuming, and let people know there is a bug (likely a missing commit in some plugin).
Attachments (3)
Change History (46)
comment:1 by , 15 years ago
follow-up: 6 comment:3 by , 15 years ago
The issue with postgres idle connections was masked by the gc.collect to some extent. When I removed gc.collect I ended up with many more idle connections. When the environment is shutdown at the end of a request, it does not return the last connection from the active list to the pool. The act of returning the connection to the pool is what does the last rollback. With this patch, I don't need the hack I described earlier.
comment:4 by , 15 years ago
Milestone: | → 0.11.6 |
---|
follow-up: 10 comment:5 by , 15 years ago
Replying to mixedpuppy:
With the above outline, two BEGIN statements were executed, but only on rollback.
Is this really a problem? From http://www.postgresql.org/docs/8.3/static/sql-begin.html:
Issuing BEGIN when already inside a transaction block will provoke a warning message. The state of the transaction is not affected.
comment:6 by , 15 years ago
Replying to anonymous:
… When the environment is shutdown at the end of a request, it does not return the last connection from the active list to the pool. The act of returning the connection to the pool is what does the last rollback.
The try_rollback
, yes. Patch looks good. I need a bit more time to digest it fully but I'm getting convinced it should work.
follow-up: 11 comment:7 by , 15 years ago
Replying to mixedpuppy:
It also seems like some extra work is happening in the current db pooling, such as creating a
PooledConnection
class for each call toenv.get_db_cnx
even if the connection is in the "active" list.
Well, I don't see what would be the alternative, can you elaborate?
comment:8 by , 15 years ago
Owner: | set to |
---|
In the attachment:postgres-idle.patch, in _return_active_cnx
, the _active
dict is modified without the _available
mutex being acquired.
comment:9 by , 15 years ago
Keywords: | postgres added |
---|
The #8602 issue of today was a symptom of this problem.
follow-up: 12 comment:10 by , 15 years ago
Replying to cboos:
Replying to mixedpuppy:
With the above outline, two BEGIN statements were executed, but only on rollback.
Is this really a problem? From http://www.postgresql.org/docs/8.3/static/sql-begin.html:
Issuing BEGIN when already inside a transaction block will provoke a warning message. The state of the transaction is not affected.
Well, you essentially loose the start of a transaction in that case. The bigger issue is that a rollback or commit has not happened. The result of the latter transaction is undetermined, could be a rollback or a commit, you just dont know.
comment:11 by , 15 years ago
Replying to cboos:
Replying to mixedpuppy:
It also seems like some extra work is happening in the current db pooling, such as creating a
PooledConnection
class for each call toenv.get_db_cnx
even if the connection is in the "active" list.Well, I don't see what would be the alternative, can you elaborate?
One thought was to attach the connection to the request object, but that isn't quite right either. Essentially, somehow keep the pooled object around for the duration of the request, rather than continually releasing/acquiring the object.
However, for us, it actually turned out good that the pool works this way, it made a migration to using pgbouncer easier.
comment:12 by , 15 years ago
Priority: | normal → high |
---|
Replying to Shane Caraveo <shanec@…>:
Replying to cboos:
Replying to mixedpuppy:
With the above outline, two BEGIN statements were executed, but only on rollback.
Is this really a problem? From http://www.postgresql.org/docs/8.3/static/sql-begin.html:
Issuing BEGIN when already inside a transaction block will provoke a warning message. The state of the transaction is not affected.
Well, you essentially loose the start of a transaction in that case.
That's not really important - we're talking about SELECTs that could happen before an INSERT/UPDATE/DELETE, and it doesn't really matter if those selects are within the same transaction or not, as we're at READ COMMITTED isolation level.
The point I tried to make with the above quote was that it doesn't matter if you had two or more BEGIN followed by only one ROLLBACK - as the other BEGINs are ignored, one ROLLBACK is enough.
Aside from that point, I think I understand a bit better your analysis now, in particular comment:7.
As currently we always return a new instance of PooledConnection in self.get_db_cnx()
, we have no choice but doing the ROLLBACK when the last instance corresponding to a given real connection is gc'ed.
Otherwise we could easily trigger a ROLLBACK during a transaction in progress, as soon as there was more than one call to env.get_db_cnx()
. For example, with the following code:
# this is just some non-sensical code to illustrate the point def do_save(self, req): db = self.get_db_cnx() cursor = db.cursor() cursor.execute("INSERT ...") email_map = self._get_all_email_address() for user, mail in emal_map.items(): cursor.execute("UPDATE ...") db.commit()
If _get_all_email_address
itself does a self.get_db_cnx()
, then it will get another PooledConnection instance, and that instance will be gc'ed upon the method's return, right after the first INSERT, so we would ROLLBACK that INSERT as well.
Now we could certainly improve upon that if env.get_db_cnx()
would return the same PooledConnection instance, if it's taken from the active list. If _active
would be a WeakValueDictionary, we could access the PooledConnection instances without preventing them to be gc'ed. If we have that, we could simply do a ROLLBACK in PooledConnection.__del__
.
Yet another problem is that we don't do a ROLLBACK on errors, and that should also be possible to do thanks to weakrefs.
comment:13 by , 15 years ago
that's cool, using weakref should deal with cases where people are not being explicit. I would like to see an explicit release as well, since (with the removal of the explicit gc.collect for performance reasons) those objects could take a little while to be collected.
comment:14 by , 15 years ago
That's really high priority issue. Too much idle process slow down performance. Thanks.
comment:15 by , 15 years ago
postgres 8550 0.3 0.3 40308 6864 ? Ss 16:46 0:03 postgres: postgres trac_db 127.0.0.1(39129) idle in transaction postgres 8566 0.5 0.4 40624 8712 ? Ss 16:46 0:05 postgres: postgres trac_db 127.0.0.1(39131) idle postgres 8571 0.3 0.3 40332 8288 ? Ss 16:46 0:03 postgres: postgres trac_db 127.0.0.1(39132) idle postgres 9166 0.7 0.3 40292 7792 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39141) idle postgres 9178 0.7 0.3 40292 7504 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39142) idle postgres 9182 0.7 0.2 40296 5812 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39144) idle postgres 9183 0.7 0.2 40268 5516 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39145) idle postgres 9194 0.7 0.3 40280 7456 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39146) idle postgres 9199 0.8 0.3 40288 7304 ? Rs 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39147) idle in transaction postgres 9201 0.6 0.2 40300 5780 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39148) idle postgres 9202 0.6 0.2 40276 5752 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39149) idle in transaction postgres 9203 0.7 0.2 40296 5676 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39150) idle in transaction postgres 9204 0.7 0.2 40292 5896 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39151) idle postgres 9211 0.6 0.3 40292 7556 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39153) idle postgres 9217 0.7 0.2 40292 6108 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39154) idle postgres 9219 0.7 0.2 40276 5680 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39155) idle postgres 9220 0.6 0.2 40276 5864 ? Rs 17:00 0:00 postgres: postgres trac_db 127.0.0.1(39156) idle in transaction postgres 9221 0.7 0.3 40632 6380 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39157) idle postgres 9222 0.8 0.2 40304 5904 ? Rs 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39158) idle postgres 9223 0.6 0.2 40292 5740 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39159) idle postgres 9224 0.7 0.3 40628 6508 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39160) idle postgres 9225 0.7 0.2 40296 5656 ? Rs 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39161) idle in transaction postgres 9226 0.6 0.2 40284 5844 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39162) idle postgres 9227 0.7 0.2 40272 5768 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39163) idle in transaction postgres 9228 0.7 0.3 40288 7936 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39164) idle postgres 9230 0.6 0.2 40280 5488 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39165) idle postgres 9282 0.7 0.2 40276 5736 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39167) idle postgres 9283 0.6 0.2 40300 5564 ? Rs 17:00 0:00 postgres: postgres trac_db 127.0.0.1(39168) idle in transaction postgres 9284 0.6 0.2 40276 5568 ? Ss 17:00 0:00 postgres: postgres trac_db 127.0.0.1(39169) idle in transaction postgres 9285 0.7 0.2 40300 5796 ? Ss 17:00 0:00 postgres: postgres trac_db 127.0.0.1(39170) idle in transaction postgres 9286 0.7 0.2 40276 5436 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39171) idle in transaction postgres 9287 0.7 0.2 40300 5540 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39172) idle in transaction postgres 9288 0.6 0.2 40296 5512 ? Ss 17:00 0:00 postgres: postgres trac_db 127.0.0.1(39173) idle in transaction postgres 9289 0.7 0.2 40280 5596 ? Ss 17:00 0:00 postgres: postgres trac_db 127.0.0.1(39174) idle postgres 9290 0.6 0.2 40288 5616 ? Ss 17:00 0:00 postgres: postgres trac_db 127.0.0.1(39175) idle in transaction postgres 9293 0.7 0.3 40272 7272 ? Ss 17:00 0:01 postgres: postgres trac_db 127.0.0.1(39176) idle postgres 9294 0.6 0.2 40288 5880 ? Ss 17:00 0:00 postgres: postgres trac_db 127.0.0.1(39177) idle in transaction postgres 9307 0.6 0.2 40276 5396 ? Ss 17:01 0:00 postgres: postgres trac_db 127.0.0.1(39179) idle
comment:16 by , 15 years ago
Cc: | added |
---|
comment:17 by , 15 years ago
Milestone: | 0.11.6 → next-minor-0.12.x |
---|
The problem sort of disappeared on t.e.o since we disabled Bitten …
Postponed.
follow-up: 28 comment:18 by , 15 years ago
Milestone: | next-minor-0.12.x → 0.12 |
---|
follow-up: 20 comment:19 by , 15 years ago
Status: | new → assigned |
---|
In the tests I did, the idle in transaction connections persisted as long as the connection is not rollback'ed, but this rollback should happen as soon as the cnx
Connection is returned to the _pool, when the last PooledConnection
that wraps it is garbage collected (_return_cnx
).
One possible issue there was that the try_rollback
was only attempted when the cnx
is poolable
. This is normally True
for PostgreSQL, but I know sometimes people forcefully set it to False
.
So the problem can only happen if there's a memory leak and the PooledConnection
is kept alive somewhere. One extra safety measure could be to force this rollback to happen at the end of a request, during shutdown(tid)
.
See attachment:t8443-deactivate-r9234.patch.
All tests pass on the 3 backends, but I'd like to get some additional testing feedback.
by , 15 years ago
Attachment: | t8443-deactivate-r9234.patch added |
---|
db pool: try harder to rollback
unused db connections
comment:20 by , 15 years ago
Replying to cboos:
Patch now active on t.e.o, but I'll probably have to re-activate Bitten to really stress-test this ;-) (so far, so good)
comment:21 by , 15 years ago
So I did some fairly unscientific testing against http://pacopablo.com. I ran:
ab -c 10 -n 1000 http://pacopablo.com/wiki/WikiStart
against trunk@9275 and then applied the patch and ran the same command again. My current setup is using 3 tracd instances proxied behind nginx. However, it looks as though only one of those instances was handling all of the traffic. (I guess I'll have to fix my proxying). In both cases, the number of used connections stayed around 10 for the entire benchmark. Every 2 seconds I counted the number of idle in transaction
connections. Below is the distribution throughout the test.
- r9275
Number of idle in transaction connections: 0 (2 times) Number of idle in transaction connections: 1 (1 times) Number of idle in transaction connections: 2 (1 times) Number of idle in transaction connections: 3 (4 times) Number of idle in transaction connections: 4 (8 times) Number of idle in transaction connections: 5 (22 times) Number of idle in transaction connections: 6 (42 times) Number of idle in transaction connections: 7 (60 times) Number of idle in transaction connections: 8 (48 times) Number of idle in transaction connections: 9 (57 times) Number of idle in transaction connections: 10 (22 times)
- r9275 w/ patch applied
Number of idle in transaction connections: 0 (3 times) Number of idle in transaction connections: 2 (1 times) Number of idle in transaction connections: 3 (1 times) Number of idle in transaction connections: 4 (16 times) Number of idle in transaction connections: 5 (26 times) Number of idle in transaction connections: 6 (42 times) Number of idle in transaction connections: 7 (72 times) Number of idle in transaction connections: 8 (57 times) Number of idle in transaction connections: 9 (37 times) Number of idle in transaction connections: 10 (18 times)
The good news is that the number of idle connections never went over 10 (the number of concurrent connections from ab
). Additionally, the idle in transaction
connections dropped to 0 after the test was done.
However, I can't tell any difference between patch and no patch. The test was against my front page with hits at least 2 plugins (th:FullBlogPlugin and th:IrcLogsPlugin)
comment:22 by , 15 years ago
Cc: | added |
---|
Ran some more tests. Can't see any difference between pre and post patch. However, I never have more db connections open then the number of concurrent web requests, so, it doesn't look like any are being leaked.
comment:23 by , 15 years ago
Thanks for the testing, it's good to see there were apparently no unwanted side effects with the changes.
The fact that you didn't see any differences with or without the patch is expected in normal conditions of operations. In effect, I suspect that this idle in transaction issue can only happen in abnormal conditions, which ones exactly is not completely clear, but some candidates are:
- very long SQL queries (Bitten)
- db connections kept alive across web requests (e.g. by storing one in a component, a.k.a my TOC brown paper bag issue, fixed by [TH4366] ;-) )
- possibly exceptions triggered during query, with the db being held alive in the stack frames, and those in turn being kept alive (?) - back to the previous point
So if you can try again your experiments with something like:
-
trac/web/main.py
465 465 req = Request(environ, start_response) 466 466 translation.make_activable(lambda: req.locale, env and env.path or None) 467 467 try: 468 env._db = env.get_db_cnx() # Misbehavior #1 468 469 return _dispatch_request(req, env, env_error) 469 470 finally: 470 471 translation.deactivate()
(or any other creative way to trigger errors ;-) ), you should see:
- idle in transaction accumulate without the patch
- "proper behavior" with the patch
comment:24 by , 15 years ago
Well, I tired to get the code to misbehave, but after testing with your patch above I have failed in getting r9275 to show any difference between with #8443 patch and without.
However, throughout all my tests, I have seen stable behavior. I think that this is a good sign.
+1 for committing the patch
comment:25 by , 15 years ago
So yesterday we had another interesting situation here on t.e.o:
- one trac.fcgi process was at 100%
- 12 idle in transaction postgres processes
Looking at what trac.fcgi did, it was just busy accepting new connections (137 threads), all trying to get a db connection when none was available anymore. The gdb backtraces for all these threads didn't show anything interesting except that no other 3rd party bug was involved, like the svn/apr issue of #7785.
See teo-debug.log for the strace and gdb log.
That day I did a lot of restarts of the lighttpd server, maybe something wrong can happen when a trac.fcgi process is abruptly terminated (the corresponding postgres process hanging?). OTOH, to "fix" the server I killed -TERM that 17705 process, restarted lighty once more and after a while, all idle in transaction processes disappeared.
So apparently the patch doesn't do any good for the base problem and maybe the additional notify() I introduce only contribute to worsen the problem, with too many thread wake-ups…
by , 15 years ago
Attachment: | teo-debug.log added |
---|
strace + backtraces of a trac.fcgi process that was continuously ≥ 100%, while holding alive ≥ 10 idle in transaction postgres processes
comment:26 by , 15 years ago
Milestone: | 0.12 → next-minor-0.12.x |
---|
Ok, the results with the patch were mitigated, the bug is still here and we still don't have a clue about what's going on, so I'll postpone this.
comment:27 by , 15 years ago
We had a similar problem with a Java based application, where too many connections were handed out to the same thread requesting those connections.
What finally settled it was to cache the last connection provided to the calling thread, and, if that connection prematurely closed, create a new connection that then in turn would be cached for future access by the same thread.
Additionally, I do not see the point where a simple SELECT statement would leave a connection in an IDLE IN TRANSACTION state, as a stand alone select operation does IMHO not actually require a transaction context since it is read only.
Perhaps there are some other write access statements like for example UPDATE or INSERT that will start a transaction and never commit, and thus occupy the connection?
I don't actually know, but does trac do some kind of access logging in the database or similar such things, where it actually inserts or updates a tuple and never commit?
Besides that, all model write operations seem to be @with_transaction as with 0.12dev, so that should not be the cause for this problem.
follow-ups: 29 33 comment:28 by , 15 years ago
Milestone: | next-minor-0.12.x → 0.12 |
---|
Replying to cboos:
#8602 was closed as a duplicate, as it's pretty clear now that the idle in transaction problem can lead to a blocked server showing the Unable to get database connection within 20 seconds error.
Actually I now think that they're both a symptom of a same root cause, which must be some kind of deadlock, as the strace logs taken at this occasion always show some kind of fierce "futex" battle…
We need to be more careful between cross acquisition of the different locks we have, including those we don't "see". There's one in source:trunk/trac/web/_fcgi.py, there's the logging lock, the Genshi template loader lock, the localedata _cache_lock in Babel, possibly others (psycopg2 uses a pthread_mutex_t directly).
See on ftp.edgewall.org, the tmp/db-timeout-incident-2010-03-18.txt
(no direct link, to not load the site too much, the file has a size of 20M).
That strace clearly shows contention on the 0x30b2510 and 0x2e0cf20 futexes. Now we just need to find out to which lock they correspond ;-)
From db-timeout-incident-2010-03-18-except-0x30b2510-and-0x2e0cf20-no-resumed.txt
, the other futexes we see are at thread startup (near
set_robust_list) we can ignore those, and a few times 0x30c4c80, which seems to be the logging IO lock.
[pid 6578] 10:25:00.804434 futex(0x30c4c80, 0x80 /* FUTEX_??? */, 0 <unfinished ...> [pid 6551] 10:25:00.804531 write(7, "[pid 1802332496] 2010-03-18 10:"..., 175 <unfinished ...> [pid 6551] 10:25:00.804798 futex(0x30c4c80, 0x81 /* FUTEX_??? */, 1 <unfinished ...>
From tmp/db-timeout-incident-2010-03-18.txt
, we see things like:
pid 6540] 10:24:56.227063 sendto(21, "\1\6\21\362\1K\5\0", 8, 0, NULL, 0) = 8 [pid 6540] 10:24:56.227477 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0 [pid 6540] 10:24:56.227547 sendto(21, "Status: 500 Internal Server Erro"..., 331, 0, NULL, 0) = 331 [pid 6540] 10:24:56.227786 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0 [pid 6540] 10:24:56.227849 sendto(21, "\0\0\0\0\0", 5, 0, NULL, 0) = 5 [pid 6540] 10:24:56.228086 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0 [pid 6540] 10:24:56.228234 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0 [pid 6540] 10:24:56.228313 sendto(21, "\1\6\21\362\37\370\0\0", 8, 0, NULL, 0) = 8 [pid 6540] 10:24:56.228609 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0 [pid 6540] 10:24:56.228670 futex(0x30b2510, 0x81 /* FUTEX_??? */, 1) = 0
which suggests that 0x30b2510 corresponds to the lock in _fcgi.py's MultiplexedConnection.
From log excerpts like:
[pid 6596] 10:24:58.006529 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 6596] 10:24:58.006583 poll([{fd=35, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1 [pid 6596] 10:24:58.006637 recvfrom(35, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I", 1048576, 0, NULL, NULL) = 20 [pid 6596] 10:24:58.006731 gettimeofday({1268904298, 6748}, NULL) = 0 [pid 6596] 10:24:58.006809 futex(0x2e0cf20, 0x81 /* FUTEX_??? */, 1) = 1
it seems that 0x2e0cf20 is the psycopg2 mutex.
Now we just have to find a way to avoid deadlocks between the two, maybe the "easy" fix would be to disable MultiplexedConnection. I'll try that next.
comment:29 by , 15 years ago
… disable MultiplexedConnection
Sigh, MultiplexedConnection was not used, 0x30b2510 must be another lock, ah, it must be the GIL of course ;-)
So does that boil down to some design flaw in psycopg2 which uses its own mutex, hence the risk to deadlock with the GIL?
follow-up: 31 comment:30 by , 15 years ago
I've found a possibly related discussion on the psycopg2 mailing list, including a proposed patch, but it has not been integrated in the psycopg2 repository yet. It would be interesting to try it.
Also, they talk about a deadlock, and while I also said "deadlock" in the above, what we're actually seeing is not a "dead" lock, but rather a livelock, leading to 100% CPU utilization.
follow-up: 32 comment:31 by , 15 years ago
Replying to cboos:
I've found a possibly related discussion on the psycopg2 mailing list, including a proposed patch, but it has not been integrated in the psycopg2 repository yet. It would be interesting to try it.
The test program given in the same mailing list thread works fine on t.e.o, so it's likely another problem…
comment:32 by , 15 years ago
Replying to cboos:
The test program given in the same mailing list thread works fine on t.e.o, so it's likely another problem…
The thread also mentions that, depending on the versions of the components, the issue takes longer to show up. So you might want to leave the program running for a few hours.
comment:33 by , 15 years ago
There was probably a flaw in my previous analysis, as where would be the futex corresponding to the trac.db.pool lock?
Also, this morning everything was fine on t.e.o, but nevertheless I had the curiosity to look for idle in transaction processes and indeed there were some!
postgres 2655 4178 0 Mar15 ? 00:00:33 postgres: www-data trac [local] idle in transaction postgres 2656 4178 0 Mar15 ? 00:00:11 postgres: www-data trac [local] idle in transaction postgres 2657 4178 0 Mar15 ? 00:00:38 postgres: www-data trac [local] idle in transaction postgres 2659 4178 0 Mar15 ? 00:00:16 postgres: www-data trac [local] idle in transaction postgres 2671 4178 0 Mar15 ? 00:00:19 postgres: www-data trac [local] idle in transaction postgres 2675 4178 0 Mar15 ? 00:00:29 postgres: www-data trac [local] idle in transaction postgres 2676 4178 0 Mar15 ? 00:00:17 postgres: www-data trac [local] idle in transaction postgres 3951 4178 0 Mar15 ? 00:00:06 postgres: www-data trac [local] idle in transaction root 12949 12334 0 09:48 pts/0 00:00:00 grep transaction root@lynx:~# strace -f -tt -p 2655 Process 2655 attached - interrupt to quit 09:48:52.136709 recvfrom(8, ^C root@lynx:~# strace -f -tt -p 2656 Process 2656 attached - interrupt to quit 09:49:19.471894 recvfrom(8, ^C ...
(same for the others)
That postgres process is simply waiting for input:
root@lynx:~# gdb -p 2655 ... (gdb) bt #0 0x00007fc557788aa5 in recv () from /lib/libc.so.6 #1 0x00007fc5598bfbe6 in secure_read () from /usr/lib/postgresql/8.3/bin/postgres #2 0x00007fc5598c6e34 in ?? () from /usr/lib/postgresql/8.3/bin/postgres #3 0x00007fc5598c72cb in pq_getbyte () from /usr/lib/postgresql/8.3/bin/postgres #4 0x00007fc559942685 in PostgresMain () from /usr/lib/postgresql/8.3/bin/postgres #5 0x00007fc559914d80 in ?? () from /usr/lib/postgresql/8.3/bin/postgres #6 0x00007fc559915c6a in PostmasterMain () from /usr/lib/postgresql/8.3/bin/postgres #7 0x00007fc5598c8d4e in main () from /usr/lib/postgresql/8.3/bin/postgres
fd 8 is a socket:
lrwx------ 1 postgres postgres 64 Mar 22 11:18 8 -> socket:[171082172]
And there's indeed a trac.fcgi process (2650), on the other end:
lrwx------ 1 www-data www-data 64 Mar 22 11:19 8 -> socket:[171082171]
Also interesting is that there are a bunch of sockets used by that process:
root@lynx:/proc/2650# ll fd total 0 lrwx------ 1 www-data www-data 64 Mar 22 11:19 0 -> socket:[171082152] l-wx------ 1 www-data www-data 64 Mar 22 11:19 1 -> /var/log/lighttpd/error.log lrwx------ 1 www-data www-data 64 Mar 22 11:19 10 -> socket:[171082181] lrwx------ 1 www-data www-data 64 Mar 22 11:19 100 -> socket:[171103696] lrwx------ 1 www-data www-data 64 Mar 22 11:19 101 -> socket:[171114602] lrwx------ 1 www-data www-data 64 Mar 22 11:19 102 -> socket:[171103698] lrwx------ 1 www-data www-data 64 Mar 22 11:19 103 -> socket:[171112888] lrwx------ 1 www-data www-data 64 Mar 22 11:19 104 -> socket:[171114603] lrwx------ 1 www-data www-data 64 Mar 22 11:19 107 -> socket:[171103706] lrwx------ 1 www-data www-data 64 Mar 22 11:19 109 -> socket:[171103708] lrwx------ 1 www-data www-data 64 Mar 22 11:19 11 -> socket:[171114658] lrwx------ 1 www-data www-data 64 Mar 22 11:19 110 -> socket:[171103709] lrwx------ 1 www-data www-data 64 Mar 22 11:19 12 -> socket:[171082175] lrwx------ 1 www-data www-data 64 Mar 22 11:19 125 -> socket:[171103724] lrwx------ 1 www-data www-data 64 Mar 22 11:19 13 -> socket:[171104647] lrwx------ 1 www-data www-data 64 Mar 22 11:19 14 -> socket:[171114559] lrwx------ 1 www-data www-data 64 Mar 22 11:19 15 -> socket:[171103745] lrwx------ 1 www-data www-data 64 Mar 22 11:19 16 -> socket:[171104057] lrwx------ 1 www-data www-data 64 Mar 22 11:19 17 -> socket:[171082249] lrwx------ 1 www-data www-data 64 Mar 22 11:19 18 -> socket:[171104126] lrwx------ 1 www-data www-data 64 Mar 22 11:19 19 -> socket:[171082262] l-wx------ 1 www-data www-data 64 Mar 22 11:19 2 -> /var/log/lighttpd/error.log lrwx------ 1 www-data www-data 64 Mar 22 11:19 20 -> socket:[171082270] lrwx------ 1 www-data www-data 64 Mar 22 11:19 21 -> socket:[171112463] lrwx------ 1 www-data www-data 64 Mar 22 11:19 22 -> socket:[171082273] lrwx------ 1 www-data www-data 64 Mar 22 11:19 23 -> socket:[171104816] lrwx------ 1 www-data www-data 64 Mar 22 11:19 24 -> socket:[171104666] lrwx------ 1 www-data www-data 64 Mar 22 11:19 25 -> socket:[171112472] lrwx------ 1 www-data www-data 64 Mar 22 11:19 26 -> socket:[171112791] lrwx------ 1 www-data www-data 64 Mar 22 11:19 27 -> socket:[171088246] lrwx------ 1 www-data www-data 64 Mar 22 11:19 28 -> socket:[171105148] lrwx------ 1 www-data www-data 64 Mar 22 11:19 29 -> socket:[171104070] lrwx------ 1 www-data www-data 64 Mar 22 11:19 3 -> socket:[171082152] lrwx------ 1 www-data www-data 64 Mar 22 11:19 30 -> socket:[171103809] lrwx------ 1 www-data www-data 64 Mar 22 11:19 31 -> socket:[171114886] lrwx------ 1 www-data www-data 64 Mar 22 11:19 32 -> socket:[171114600] lrwx------ 1 www-data www-data 64 Mar 22 11:19 33 -> socket:[171114887] lrwx------ 1 www-data www-data 64 Mar 22 11:19 34 -> socket:[171112792] lrwx------ 1 www-data www-data 64 Mar 22 11:19 35 -> socket:[171104074] lrwx------ 1 www-data www-data 64 Mar 22 11:19 36 -> socket:[171114888] lrwx------ 1 www-data www-data 64 Mar 22 11:19 37 -> socket:[171103752] lrwx------ 1 www-data www-data 64 Mar 22 11:19 38 -> socket:[171104590] lrwx------ 1 www-data www-data 64 Mar 22 11:19 39 -> socket:[171105082] lrwx------ 1 www-data www-data 64 Mar 22 11:19 4 -> socket:[171105003] lrwx------ 1 www-data www-data 64 Mar 22 11:19 40 -> socket:[171104817] lrwx------ 1 www-data www-data 64 Mar 22 11:19 41 -> socket:[171112868] lrwx------ 1 www-data www-data 64 Mar 22 11:19 42 -> socket:[171101938] lrwx------ 1 www-data www-data 64 Mar 22 11:19 43 -> socket:[171104177] lrwx------ 1 www-data www-data 64 Mar 22 11:19 44 -> socket:[171103861] lrwx------ 1 www-data www-data 64 Mar 22 11:19 45 -> socket:[171103899] lrwx------ 1 www-data www-data 64 Mar 22 11:19 46 -> socket:[171104679] lrwx------ 1 www-data www-data 64 Mar 22 11:19 47 -> socket:[171103905] lrwx------ 1 www-data www-data 64 Mar 22 11:19 48 -> socket:[171107995] lrwx------ 1 www-data www-data 64 Mar 22 11:19 49 -> socket:[171105223] l-wx------ 1 www-data www-data 64 Mar 22 11:19 5 -> /var/trac/trac/log/trac.log lrwx------ 1 www-data www-data 64 Mar 22 11:19 50 -> socket:[171112793] lrwx------ 1 www-data www-data 64 Mar 22 11:19 51 -> socket:[171114560] lrwx------ 1 www-data www-data 64 Mar 22 11:19 52 -> socket:[171105233] lrwx------ 1 www-data www-data 64 Mar 22 11:19 53 -> socket:[171104836] lrwx------ 1 www-data www-data 64 Mar 22 11:19 54 -> socket:[171102016] lrwx------ 1 www-data www-data 64 Mar 22 11:19 55 -> socket:[171104981] lrwx------ 1 www-data www-data 64 Mar 22 11:19 57 -> socket:[171107996] lrwx------ 1 www-data www-data 64 Mar 22 11:19 58 -> socket:[171104710] lrwx------ 1 www-data www-data 64 Mar 22 11:19 59 -> socket:[171107997] lrwx------ 1 www-data www-data 64 Mar 22 11:19 6 -> socket:[171104297] lrwx------ 1 www-data www-data 64 Mar 22 11:19 61 -> socket:[171112794] lrwx------ 1 www-data www-data 64 Mar 22 11:19 62 -> socket:[171104996] lrwx------ 1 www-data www-data 64 Mar 22 11:19 63 -> socket:[171104911] lrwx------ 1 www-data www-data 64 Mar 22 11:19 64 -> socket:[171104730] lrwx------ 1 www-data www-data 64 Mar 22 11:19 65 -> socket:[171104851] lrwx------ 1 www-data www-data 64 Mar 22 11:19 67 -> socket:[171104926] lrwx------ 1 www-data www-data 64 Mar 22 11:19 68 -> socket:[171103965] lrwx------ 1 www-data www-data 64 Mar 22 11:19 69 -> socket:[171105035] lrwx------ 1 www-data www-data 64 Mar 22 11:19 7 -> socket:[171082275] lrwx------ 1 www-data www-data 64 Mar 22 11:19 70 -> socket:[171112795] lrwx------ 1 www-data www-data 64 Mar 22 11:19 71 -> socket:[171114601] lrwx------ 1 www-data www-data 64 Mar 22 11:19 73 -> socket:[171102624] lrwx------ 1 www-data www-data 64 Mar 22 11:19 75 -> socket:[171104960] lrwx------ 1 www-data www-data 64 Mar 22 11:19 76 -> socket:[171112796] lrwx------ 1 www-data www-data 64 Mar 22 11:19 77 -> socket:[171112797] lrwx------ 1 www-data www-data 64 Mar 22 11:19 78 -> socket:[171102643] lrwx------ 1 www-data www-data 64 Mar 22 11:19 79 -> socket:[171104930] lrwx------ 1 www-data www-data 64 Mar 22 11:19 8 -> socket:[171082171] lrwx------ 1 www-data www-data 64 Mar 22 11:19 80 -> socket:[171112827] lrwx------ 1 www-data www-data 64 Mar 22 11:19 81 -> socket:[171102771] lrwx------ 1 www-data www-data 64 Mar 22 11:19 82 -> socket:[171103677] lrwx------ 1 www-data www-data 64 Mar 22 11:19 83 -> socket:[171103678] lrwx------ 1 www-data www-data 64 Mar 22 11:19 84 -> socket:[171104937] lrwx------ 1 www-data www-data 64 Mar 22 11:19 86 -> socket:[171103681] lrwx------ 1 www-data www-data 64 Mar 22 11:19 87 -> socket:[171104392] lrwx------ 1 www-data www-data 64 Mar 22 11:19 88 -> socket:[171105016] lrwx------ 1 www-data www-data 64 Mar 22 11:19 9 -> socket:[171082173] lrwx------ 1 www-data www-data 64 Mar 22 11:19 90 -> socket:[171105011] lrwx------ 1 www-data www-data 64 Mar 22 11:19 91 -> socket:[171105043] lrwx------ 1 www-data www-data 64 Mar 22 11:19 92 -> socket:[171103688] lrwx------ 1 www-data www-data 64 Mar 22 11:19 94 -> socket:[171112869] lrwx------ 1 www-data www-data 64 Mar 22 11:19 95 -> socket:[171112870] lrwx------ 1 www-data www-data 64 Mar 22 11:19 96 -> socket:[171112871] lrwx------ 1 www-data www-data 64 Mar 22 11:19 97 -> socket:[171112873] lrwx------ 1 www-data www-data 64 Mar 22 11:19 98 -> socket:[171103694] lrwx------ 1 www-data www-data 64 Mar 22 11:19 99 -> socket:[171103695]
Unfortunately, that process seems to be untraceable. It was actually stopped (State T):
0 T www-data 2650 1 0 80 0 - 289822 signal Mar15 ? 00:20:23 /usr/local/virtualenv/trunk/bin/python /usr/local/virtualenv/trunk/share/cgi-bin/trac.fcgi
and after a manual CONT, it switched to (State: Z (zombie)) and then exited. All the idle in transaction processes went away as well, so I suppose that among the many sockets above, there were all the other ends for the postgres sockets involved.
Looks like "something" could have stopped the trac.fcgi processes in the middle of transactions, leaving the corresponding postgres processes hanging. Maybe at this point when the load gets higher, new trac.fcgi processes are unable to get any new connections?
Also at the same time, there was another trac.fcgi process that was subject to the #7785 deadlock, maybe this was related:
root@lynx:~# gdb -p 13309 ... (gdb) info threads 5 Thread 0x43cc6950 (LWP 20364) 0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0 4 Thread 0x434c5950 (LWP 20365) 0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0 3 Thread 0x44cc8950 (LWP 20366) 0x00007f6b168cb635 in ?? () from /usr/lib/libapr-1.so.0 2 Thread 0x454c9950 (LWP 24553) 0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0 1 Thread 0x7f6b223776e0 (LWP 13309) 0x00007f6b21f54b04 in __lll_lock_wait () from /lib/libpthread.so.0 (gdb) thread apply all bt Thread 5 (Thread 0x43cc6950 (LWP 20364)): #0 0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0 #1 0x00000000004b9ea8 in PyThread_acquire_lock () ... Thread 4 (Thread 0x434c5950 (LWP 20365)): #0 0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0 #1 0x00000000004b9ea8 in PyThread_acquire_lock () ... Thread 3 (Thread 0x44cc8950 (LWP 20366)): #0 0x00007f6b168cb635 in ?? () from /usr/lib/libapr-1.so.0 #1 0x00007f6b168cbd36 in apr_pool_destroy () from /usr/lib/libapr-1.so.0 #2 0x00007f6b15f133c4 in ?? () from /var/lib/python-support/python2.5/libsvn/_core.so #3 0x0000000000491052 in PyEval_EvalFrameEx () ... Thread 2 (Thread 0x454c9950 (LWP 24553)): #0 0x00007f6b21f542f1 in sem_wait () from /lib/libpthread.so.0 #1 0x00000000004b9ea8 in PyThread_acquire_lock () #2 0x00000000004be2e0 in ?? () #... Thread 1 (Thread 0x7f6b223776e0 (LWP 13309)): #0 0x00007f6b21f54b04 in __lll_lock_wait () from /lib/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #1 0x00007f6b21f501a0 in _L_lock_102 () from /lib/libpthread.so.0 #2 0x00007f6b21f4faae in pthread_mutex_lock () from /lib/libpthread.so.0 #3 0x00007f6b168c9b22 in apr_thread_mutex_lock () from /usr/lib/libapr-1.so.0 #4 0x00007f6b168cbca6 in apr_pool_destroy () from /usr/lib/libapr-1.so.0 #5 0x00007f6b168cbc4f in apr_pool_destroy () from /usr/lib/libapr-1.so.0 #6 0x00007f6b168cb874 in apr_pool_terminate () from /usr/lib/libapr-1.so.0 #7 0x00007f6b168cd13c in apr_terminate () from /usr/lib/libapr-1.so.0 #8 0x00007f6b2157f6ad in exit () from /lib/libc.so.6 #9 0x00007f6b2156746d in __libc_start_main () from /lib/libc.so.6 #10 0x0000000000413ca9 in _start () #0 0x00007f6b21f54b04 in __lll_lock_wait () from /lib/libpthread.so.0
Besides 2560 and 13309, there is a 3rd process from the 15th of March, still "active" somehow:
root@lynx:/proc# strace -tt -p 9596 Process 9596 attached - interrupt to quit 13:22:31.888620 --- SIGSTOP (Stopped (signal)) @ 0 (0) --- 13:22:31.889229 --- SIGSTOP (Stopped (signal)) @ 0 (0) --- 13:22:31.913413 --- SIGTERM (Terminated) @ 0 (0) --- 13:22:31.913605 --- SIGSTOP (Stopped (signal)) @ 0 (0) --- 13:22:31.913650 --- SIGSTOP (Stopped (signal)) @ 0 (0) --- 13:22:32.029193 rt_sigaction(SIGTERM, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, 8) = 0 13:22:32.029400 rt_sigreturn(0xf) = -1 EINTR (Interrupted system call) 13:22:33.346454 rt_sigaction(SIGHUP, {SIG_IGN}, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, 8) = 0 13:22:33.361953 rt_sigaction(SIGINT, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, 8) = 0 13:22:33.362052 rt_sigaction(SIGTERM, {SIG_DFL}, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, 8) = 0 13:22:33.501645 close(3) = 0 13:22:38.376117 close(6) = 0 13:22:38.376241 munmap(0x7f68740b7000, 4096) = 0 13:22:38.665211 rt_sigaction(SIGINT, {SIG_DFL}, {0x4c00a0, [], SA_RESTORER, 0x7f6873c900f0}, 8) = 0 ^C root@lynx:/proc# ll /proc/9596/fd total 0 lrwx------ 1 www-data www-data 64 Mar 22 13:24 0 -> socket:[171116709] l-wx------ 1 www-data www-data 64 Mar 22 13:24 1 -> /var/log/lighttpd/error.log lrwx------ 1 www-data www-data 64 Mar 22 13:24 10 -> socket:[171116783] lrwx------ 1 www-data www-data 64 Mar 22 13:24 12 -> socket:[171116787] lrwx------ 1 www-data www-data 64 Mar 22 13:24 14 -> socket:[171121772] lrwx------ 1 www-data www-data 64 Mar 22 13:24 15 -> socket:[171121885] lrwx------ 1 www-data www-data 64 Mar 22 13:24 19 -> socket:[171122207] l-wx------ 1 www-data www-data 64 Mar 22 13:24 2 -> /var/log/lighttpd/error.log lrwx------ 1 www-data www-data 64 Mar 22 13:24 22 -> socket:[171137403] lrwx------ 1 www-data www-data 64 Mar 22 13:24 26 -> socket:[171137406] lrwx------ 1 www-data www-data 64 Mar 22 13:24 4 -> socket:[171130805] lrwx------ 1 www-data www-data 64 Mar 22 13:24 7 -> socket:[171116774] lrwx------ 1 www-data www-data 64 Mar 22 13:24 8 -> socket:[171116779] (gdb) bt #0 0x00007f68732febab in ?? () from /lib/libc.so.6 #1 0x00007f68732ff0a6 in free () from /lib/libc.so.6 #2 0x0000000000446d62 in ?? () #3 0x000000000045f47b in ?? () #4 0x0000000000446d23 in ?? () #5 0x00000000004457e7 in ?? () #6 0x0000000000447e8a in PyDict_SetItem () #7 0x0000000000449693 in _PyModule_Clear () #8 0x00000000004a6074 in PyImport_Cleanup () #9 0x00000000004b1f0b in Py_Finalize () #10 0x000000000041431b in Py_Main () #11 0x00007f68732a1466 in __libc_start_main () from /lib/libc.so.6 #12 0x0000000000413ca9 in _start () (gdb) info threads 1 Thread 0x7f68740b16e0 (LWP 9596) 0x00007f68732febab in ?? () from /lib/libc.so.6
A sort of failed exit attempt ;-)
Hopefully one day all of this will make sense…
comment:34 by , 15 years ago
Happened again the other, nearly the same scenario. Coincidently or not, it seems to have started when I was testing #9171…
There were two trac.fcgi processes involved. While the cause of the livelock is still not clear, what happened afterwards is now clearer: lighttpd simply detects that requests are no longer processed and signals (STOP) the trac.fcgi process. Too bad, as this leaves the "IDLE in transaction" postgres alive, as seen above.
/var/log/lighttpd/error.log excerpt:
... 2010-03-25 13:24:20: (server.c.1258) NOTE: a request for /browser/plugins/0.10 timed out after writing 12960 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle 2010-03-25 13:28:46: (server.c.1258) NOTE: a request for /wiki/TracLinks timed out after writing 22240 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle 2010-03-25 13:33:20: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 259 2010-03-25 13:33:23: (mod_fastcgi.c.2673) fcgi-server re-enabled: 0 /tmp/trac-trac-fastcgi.sock 2010-03-25 13:33:24: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 265 2010-03-25 13:33:27: (mod_fastcgi.c.2673) fcgi-server re-enabled: 0 /tmp/trac-trac-fastcgi.sock 2010-03-25 13:33:28: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 269 2010-03-25 13:33:31: (mod_fastcgi.c.2673) fcgi-server re-enabled: 0 /tmp/trac-trac-fastcgi.sock ...
The re-enable doesn't seem to have worked, though.
As today it happened again, I decided to switch off Babel for some time, just to verify if this makes any difference.
comment:35 by , 15 years ago
Though t.e.o seems more responsive these days, there were still Unable to connect to db … errors in the log, so I'm reactivating Babel.
comment:37 by , 15 years ago
More random notes.
I've seen this in the trac.log. Is it related?
2010-05-11 13:51:37: (server.c.1475) server stopped by UID = 0 PID = 27072 Unhandled exception in thread started by Error in sys.excepthook: Original exception was: Traceback (most recent call last): File "/usr/lib/python2.5/logging/__init__.py", line 750, in emit self.stream.write(fs % msg) ValueError: I/O operation on closed file Unhandled exception in thread started by Error in sys.excepthook: Original exception was: Unhandled exception in thread started by Error in sys.excepthook: ...
(at least at 14:02 the load was quite high 16, the situation improved by itself a few minutes later).
Just found a duplicate of this error: #8955, itself linking to #6580. Now that one could be interesting, notably the endless exception loop part…
But what happened here? Was trac.log closed by logrotate? The problem was already there before I added logrotate on trac.log, and looking at the trac.log file, it doesn't look like it has been recreated recently (its content dates back to this morning 06:36).
Instead, I see:
[pid 1113229648] 2010-05-11 13:53:17,502 Trac[env] INFO: -------------------------------- environment startup [Trac 0.12dev-r9664] -------------------------------- [pid 1097636176] 2010-05-11 13:53:17,514 Trac[env] INFO: -------------------------------- environment startup [Trac 0.12dev-r9664] --------------------------------
After that, there was no Unable to get database connection within 20 seconds error. It should be noted that quite regularly (not everyday though) we have that error a few minutes after the startup, it resolves itself after a few minutes.
Other thoughts: what exactly happens when we shutdown once a modification to trac.ini is detected? We do close all currently opened connections (r7208), but what happens if a request was not finished and wants to get another connection again from the pool manager we have shut down? So far I thought this was not problematic, but this deserves a closer look…
Finally, if the problem is really nothing more than a bad behavior on high contention scenarios (> 100s threads), we could try a more controlled approach for selecting which thread get the next available connection.
comment:38 by , 15 years ago
Milestone: | 0.12.1 |
---|---|
Resolution: | → duplicate |
Status: | assigned → closed |
Funny, just when I was going to comment again on this issue, it happened again! The typical situation, a bunch of postgres process stuck in idle in transaction, Trac being unresponsive and all the queries ending with a !Timeout error, and after a while, a recovery the server becoming fast again and the idle in transaction are gone.
The only thing a bit new is the following backtrace, showing a nice case situation of re-entrancy in get_cnx
:
File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 513, in _dispatch_request File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 235, in dispatch File "build/bdist.linux-x86_64/egg/trac/timeline/web_ui.py", line 89, in process_request File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__ File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 291, in _get_perm File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__ File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 298, in _get_session File "build/bdist.linux-x86_64/egg/trac/web/session.py", line 156, in __init__ File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__ File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 157, in authenticate File "build/bdist.linux-x86_64/egg/trac/web/auth.py", line 83, in authenticate File "build/bdist.linux-x86_64/egg/trac/web/auth.py", line 205, in _get_name_for_cookie File "build/bdist.linux-x86_64/egg/trac/env.py", line 335, in get_db_cnx File "build/bdist.linux-x86_64/egg/trac/db/api.py", line 90, in get_read_db File "build/bdist.linux-x86_64/egg/trac/db/api.py", line 152, in get_connection File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 205, in get_cnx File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 153, in get_cnx File "build/bdist.linux-x86_64/egg/trac/util/translation.py", line 186, in gettext File "build/bdist.linux-x86_64/egg/trac/util/translation.py", line 175, in isactive File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 479, in <lambda> File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__ File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 309, in _get_locale File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__ File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 298, in _get_session File "build/bdist.linux-x86_64/egg/trac/web/session.py", line 156, in __init__ File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 212, in __getattr__ File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 157, in authenticate File "build/bdist.linux-x86_64/egg/trac/web/auth.py", line 83, in authenticate File "build/bdist.linux-x86_64/egg/trac/web/auth.py", line 205, in _get_name_for_cookie File "build/bdist.linux-x86_64/egg/trac/env.py", line 335, in get_db_cnx File "build/bdist.linux-x86_64/egg/trac/db/api.py", line 90, in get_read_db File "build/bdist.linux-x86_64/egg/trac/db/api.py", line 152, in get_connection File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 205, in get_cnx File "build/bdist.linux-x86_64/egg/trac/db/pool.py", line 153, in get_cnx
A bit annoying, but nothing critical as we use an RLock.
But I digress, the point I wanted to make is that this "idle in transaction" issue is really nothing more than a consequence of the bad behavior of our pool under certain circumstances and that by improving the connection pool, solving the Timeout issue, this side-effect will go away. See #9111, which I'll reopen shortly.
follow-ups: 40 41 comment:39 by , 14 years ago
Resolution: | duplicate |
---|---|
Status: | closed → reopened |
Hm, I don't know, but closing this as duplicate when the actual issue is that
But I digress, the point I wanted to make is that this "idle in transaction" issue is really nothing more than a consequence of the bad behavior of our pool under certain circumstances and that by improving the connection pool, solving the Timeout issue, this side-effect will go away. See #9111, which I'll reopen shortly.
is rather unproductive, don't you think?
I would call this a major one which might even lead to blocking future releases until fixed!? As for #9111, it is still closed. So we need to reopen this one.
As for the db connection pooling issue, see comment comment:27. I still believe that there are some queries that are not part of transaction that is actively closed, otherwise the transactions would not remain idle.
comment:40 by , 14 years ago
Replying to Carsten Klein <carsten.klein@…>:
a provable solution might be to implement the pool so that it will actively commit any open transactions that have not been closed before the eviction time of the connection still in idle in transaction state.
of course, any commit failures should be logged to the log, if available, and should not cause the application to fail.
this might require a thread that will permanently check for open connections and their overall life time. see for example the apache db connection pools.
comment:41 by , 14 years ago
Replying to Carsten Klein <carsten.klein@…>:
As for #9111, it is still closed. So we need to reopen this one.
Did you notice that #9111 was reopened four months ago (i.e. at the time comment:38 was written), had some major development going on, then finally was closed again?
comment:42 by , 14 years ago
Resolution: | → duplicate |
---|---|
Status: | reopened → closed |
Exactly. Furthermore, I see no evidence in your reopening comment showing that the issue is not fixed when using 0.12.1rc1 or trunk.
If nevertheless this is the case for you, it would be better at this point to create a new ticket, given the long history of this ticket.
comment:43 by , 10 years ago
Keywords: | postgresql added; postgres removed |
---|
formatting, formatting, formatting. sorry.