#1401 closed defect (worksforme)
tracd crash related to svn bindings
Reported by: | Christian Boos | Owned by: | Mark Rowe |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | general | Version: | devel |
Severity: | major | Keywords: | svn bindings |
Cc: | Branch: | ||
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
With the latest trunk [1493], I have repeatable crashes of the python interpreter while trying to get the timeline: First, I get the following warning each time I refresh:
Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.__del__ of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x011A5B50>> ignored Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.__del__ of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x011B88D0>> ignored
Then, it sometimes crashes… This is on Windows, with python 2.3 and svn bindings 1.1.1:
_core.dll!svn_swig_py_acquire_py_lock() Line 96 C _core.dll!_wrap_apr_pool_destroy(_object * self=0x00000000, _object * args=0x011ac050) Line 1480 C
I post the report because the Pool wrapper logic is a bit too subtle for me to understand and debug this evening… :)
Attachments (1)
Change History (28)
comment:1 by , 20 years ago
comment:2 by , 20 years ago
Is there any possibility that you could post a more useful backtrace for the crash? It's going to be incredibly difficult to resolve otherwise as I cannot reproduce the crash here. Can you post full C and Python backtraces please?
comment:3 by , 20 years ago
The rest of the stack trace is not interesting, I'm afraid
(its a non-debug version of Python, so there are only python23.dll
lines).
I progressed a little bit, though: by commenting out the line 114 in
Timeline.py (the explicit repos.close()
) the problem disappears.
Now I try to understand why…
comment:4 by , 20 years ago
Hrm, sorry, no: the exception disappears, but not the crash. Weird.
I'll try to get some python backtrace.
comment:5 by , 20 years ago
In my experience the most likely reason for this to lead to a crash is that an exception is occurring, which leads to an unexpected order of the destruction of the SVN-related objects. In this case, the problem is likely inside CachedRepository.sync. I would suggest putting a try: except:
block around that and dumping out any exceptions that are caught.
comment:6 by , 20 years ago
I surrounded the svn code in the Timeline by a try/except block, but this gave nothing.
But, at one point, I got this:
Unhandled exception in thread started by Traceback (most recent call last): File "D:\COM\activestace\python-2.3.2-232\lib\threading.py", line 392, in __repr__ assert self.__initialized, "Thread.__init__() was not called" AssertionError: Thread.__init__() was not called
There must be a thread problem somewhere. If I refresh the timeline very quickly, the crash happens at the second or third refresh. If I wait enough time (for instance, after the refresh progress bar disappears) the crash doesn't happen.
Indeed, after the crash I see several python threads in the debugger:
2960 Win32 Thread 7c90eb94 Normal 0 > 3776 Win32 Thread 1e0871e5 Normal 0 4092 _wrap_apr_pool_destroy svn_swig_py_acquire_py_lock Normal 0 2264 _wrap_apr_pool_destroy svn_swig_py_acquire_py_lock Normal 0 312 Win32 Thread 7c90eb94 Normal 0
Detail of the 3776 thread, where the access violation occurs:
> python23.dll!1e0871e5() python23.dll!1e03f2b5() python23.dll!1e08372a() python23.dll!1e08370c() ...
The other python threads are looking like:
> ntdll.dll!7c90eb94() ntdll.dll!7c90e9c0() kernel32.dll!7c8025db() kernel32.dll!7c8399f3() kernel32.dll!7c802542() python23.dll!1e080331() python23.dll!1e08058b() python23.dll!1e012ad8() _core.dll!svn_swig_py_acquire_py_lock() Line 96 C _core.dll!_wrap_apr_pool_destroy(_object * self=0x00000000, _object * args=0x00c48150) Line 1480 C python23.dll!1e059196() python23.dll!1e0169fb() python23.dll!1e014b85() ...
In order to get a more useful backtrace, I tried to reproduce the problem on Linux, but everything seems to be fine there.
comment:7 by , 20 years ago
I'll add a "me too" to this:
vnet@kobuk$ tracd -p 9090 -a test,/home/websites/vnet-works/trac/test/ digest_pw,test Segmentation fault (core dumped)
vnet@kobuk$ tracd -p 9090 -a test,/home/websites/vnet-works/trac/test/ digest_pw,test /home/websites/vnet-works/trac/test Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.del of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x8591bec>> ignored Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.del of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x85825ac>> ignored
i created at ticket and clicked on "Timeline" and tracd dumps core again. If I restart tracd, however, then I get errors as above again.
-if i append to existing ticket, i get errors as above.
-if i create another new ticket, then tracd dumps core when accessing "Timeline"
fyi- the repo is exists, but has no projects or checkins. the repo is fsfs based. just a svnadmin create.
System: FreeBSD-5.3-p5 clearsilver-python-0.9.13_1 py-pyPgSQL-2.4_2 py24-PySQLite-1.1.6 py24-mx-base-2.0.5 python-2.4_1 postgresql-client-8.0.1 postgresql-server-8.0.1_3
trac trunk co version r1487 w/pgsql patch level 13 running via tracd
comment:8 by , 20 years ago
Severity: | normal → major |
---|
Reverting to [1480] makes the crash go away, though now I have again this exception while accessing the Timeline (with no stack trace, just the exception):
Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.__del__ of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x01386F90>> ignored
When I say the crash goes away… well, at least it appears to be as stable as it ever was…
But, if I try hard enough, I can actually crash any page served by tracd, if I hit the refresh key as fast as I can (usually it takes from 20 to 40 repetitions). Then the debugger shows a similar number of Python threads.
Unfortunately, this is not only a Windows issue. I was also able to get a crash on Linux, by performing the same kind of stress test (e.g. refreshing the WikiStart page as fast as I can)
Trac[main] ERROR: (104, 'Connection reset by peer') Trac[standalone] DEBUG: "GET /edgewall-trac/wiki HTTP/1.1" 500 - [New Thread 1294416 (LWP 17403)] Trac[main] ERROR: (104, 'Connection reset by peer') Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1179721 (LWP 17396)] 0x4003ea39 in instance_dealloc () from /usr/lib/libpython2.3.so.1.0 (gdb) bt #0 0x4003ea39 in instance_dealloc () from /usr/lib/libpython2.3.so.1.0 #1 0x40042a2c in instancemethod_dealloc () from /usr/lib/libpython2.3.so.1.0 #2 0x4003eb99 in instance_dealloc () from /usr/lib/libpython2.3.so.1.0 #3 0x4006e7e0 in clear_slots () from /usr/lib/libpython2.3.so.1.0 #4 0x4006ea72 in subtype_dealloc () from /usr/lib/libpython2.3.so.1.0 #5 0x4004dff0 in frame_dealloc () from /usr/lib/libpython2.3.so.1.0 #6 0x4009459d in fast_function () from /usr/lib/libpython2.3.so.1.0 #7 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0 #8 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0 #9 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0 #10 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0 #11 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0 #12 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0 #13 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0 #14 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0 #15 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0 #16 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0 #17 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0 #18 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0 #19 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0 #20 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0 #21 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0 #22 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0 #23 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0 #24 0x40092dbc in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #25 0x4004ff7c in function_call () from /usr/lib/libpython2.3.so.1.0 #26 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #27 0x4004304b in instancemethod_call () from /usr/lib/libpython2.3.so.1.0 #28 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #29 0x40093fb8 in PyEval_CallObjectWithKeywords () from /usr/lib/libpython2.3.so.1.0 #30 0x4003e794 in PyInstance_New () from /usr/lib/libpython2.3.so.1.0 #31 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #32 0x400948a6 in do_call () from /usr/lib/libpython2.3.so.1.0 #33 0x40094322 in call_function () from /usr/lib/libpython2.3.so.1.0 #34 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0 #35 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0 #36 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0 #37 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0 #38 0x40092dbc in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #39 0x4004ff7c in function_call () from /usr/lib/libpython2.3.so.1.0 #40 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #41 0x40094a86 in ext_do_call () from /usr/lib/libpython2.3.so.1.0 #42 0x40090091 in eval_frame () from /usr/lib/libpython2.3.so.1.0 #43 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0 #44 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0 #45 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0 #46 0x40092dbc in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0 #47 0x4004ff7c in function_call () from /usr/lib/libpython2.3.so.1.0 #48 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #49 0x4004304b in instancemethod_call () from /usr/lib/libpython2.3.so.1.0 #50 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0 #51 0x40093fb8 in PyEval_CallObjectWithKeywords () from /usr/lib/libpython2.3.so.1.0 #52 0x400b9bc4 in t_bootstrap () from /usr/lib/libpython2.3.so.1.0 #53 0x4011bf60 in pthread_start_thread () from /lib/i686/libpthread.so.0 #54 0x4011c0fe in pthread_start_thread_event () from /lib/i686/libpthread.so.0 #55 0x40335327 in clone () from /lib/i686/libc.so.6 (gdb)
This was on SuSE 9.0 Linux (2.4.21-99-default), with tracd of the source:trunk#1496
Wondering if it was a generic tracd issue, I reverted to an old revision [1000], but I can't crash this one, even on Windows. Now I'm starting a dichotomic search…
comment:9 by , 20 years ago
The Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.__del__ of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x011A5B50>> ignored
was resolved as of [1495]. I am looking into the crash.
comment:10 by , 20 years ago
Oh, and if you're going to try convincing tracd to crash may I suggest trying it at r1426. This is the revision before I added the new method for managing Subversion memory pools that looks to be causing the problems.
comment:11 by , 20 years ago
Owner: | changed from | to
---|
comment:12 by , 20 years ago
Status: | new → assigned |
---|
Another option to help track down the crash could be to comment out the svn_pool_destroy
call inside Pool.close
in svn_fs.py
. This may reveal that an exception is being raised somewhere.
comment:13 by , 20 years ago
I already commented out svn_pool_destroy
,
but that didn't help.
The crash is not strictly related to Subversion: I can crash
a fresh tracd only by requesting the WikiStart page repeatedly…
… but, as a follow-up to what I wrote in #1183, it seems to be more difficult to crash, as of [1403], but it's still possible to crash it.
Btw, my previous dichotomic search lead nowhere, since the more I went into the past, the more difficult to crash it was, so I couldn't really say if a version was immune to crash…
That being said, I just tried once again with [1412], which I used as a branching point for the xref, and couldn't make it crash! (I tried as hard as ever…)
comment:14 by , 20 years ago
Ok. The easiest way that I have found to track down where these types of crashes occur is to use a debug build of Python running in GDB combined with some intelligent GDB macros to map the C stack to the corresponding Python stack. You should be able to load that according to the instructions at the top of the file, and type pyframe
after it crashes to get the Python backtrace.
I'm not 100% sure, but this last backtrace looks like a different crash from the one initially reported.
comment:15 by , 20 years ago
comment:16 by , 20 years ago
I updated Subversion to version 1.1.4 (precompiled binaries on Windows, full build on Linux using swig-1.3.24 and python-2.3.4).
Apparently, the current trunk [1508] is much more robust that way.
I can still crash it at times on Windows, but apparently not anymore on Linux.
I guess the next logical thing to do is to rebuild on Windows using the same configuration.
In the end, it might be only problems related to SWIG:
- on Linux, swig 1.3.19 was used for the bindings used in the reported crashes
- on Windows, as I used the precompiled binaries, I don't know what swig version was used).
comment:17 by , 20 years ago
Just a quick note to mention that I found an easy way to tell if a given version is stable or not: simply use the Web Developer Toolbar for Firefox, and edit the CSS of a svn related page (like the revision log): if the python bindings are not stable, they crash in no time!
comment:18 by , 20 years ago
Yesterday I made another attempt for understanding the issue on Windows. Here are my findings:
FSFS vs. BDB
The trunk is incredibly fragile when the repos is using the BDB backend.
Though I switched all my repositories to fsfs, it just happened that
I used a small BDB repository for testing purpose (in fact, that was the one
generated by the unit tests).
This is with python-2.3.2-232 from ActiveState, with
the pre-build svn-win32-1.1.4 python bindings, testing with tracd
.
The Pool class
I tried to understand the Pool class, in source:trunk/trac/versioncontrol/svn_fs.py#1535
I can't really understand that design… The intent is to tie the lifetime of an apr pool to the main object using that pool within its method, right? But care should be taken that the pool is not deleted when subpools exist and could be used. Instead, the code seems to track the subpools and to close them when the parent pool is garbage collected. This might work, but that's very indirect…
In general, I have the impression that this code is too complex.
Why use weakref, and not rely on the normal garbage collection
cleanup __del__
?
I would propose the following implementation, which I think is both safer and easier to understand:
- the worker object (the one which needs a pool) will keep a ref on the pool
- any sub-pool keeps a ref on its parent pool
- therefore, the
__del__
method on a pool will be called upon garbage collection only when the worker object is gone and all sub-pools are also gone (meaning that all their worker objects are also gone): nothing should then be able to use a destroyed pool. - toplevel pools (those given a
None
parent could do theapr_initialize/apr_terminate
. I'm not sure about this, but shouldn't we avoid doingapr_terminate
at all?
class Pool(object): """ A wrapper for a new Subversion `apr_pool_t` object that ties the lifetime of the pool to that of a given `worker` object. However, a pool won't be deleted until all subpools have been also deleted. Instances of this type return their associated `pool` when called. """ def __init__(self, worker, parent_pool=lambda: None): """ Create a new pool that is a sub-pool of `parent_pool`, and arrange for `self.close` to be called up when the `worker` object is destroyed. The `worker` object must be weak-referenceable. The returned `Pool` instance is a callable returning the newly created pool. """ self.pool = core.svn_pool_create(parent_pool()) print "__init__(", self, worker, parent_pool, ")" if parent_pool(): # assume it's a Pool self.parent = parent_pool # the parent should not be GC'ed before us else: core.apr_initialize() self.parent = None def __call__(self): return self.pool def __del__(self): """ The `worker` object has been destroyed so it is time for us to go. -- So long, and thanks for all the fish! """ print '__del__(',self,')' # if not self.parent: # core.apr_terminate() core.svn_pool_destroy(self.pool)
comment:19 by , 20 years ago
I can't really understand that design… The intent is to tie the lifetime of an apr pool to the main object using that pool within its method, right? But care should be taken that the pool is not deleted when subpools exist and could be used. Instead, the code seems to track the subpools and to close them when the parent pool is garbage collected. This might work, but that's very indirect…
The entire point of the design of the pool class is to take garbage collection out of the picture. You are right about the issue with parent pools being deallocated before their sub-pools, and this should be fixed.
In general, I have the impression that this code is too complex. Why use weakref, and not rely on the normal garbage collection cleanup __del__
?
Granted, the code is relatively obscure. The use of weakref
was explicitly chosen to avoid the use of __del__
. Consider the following paragraph, taken from the Python documentation for the gc module:
3.1. Objects that have
__del__()
methods and are part of a reference cycle cause the entire reference cycle to be uncollectable, including objects not necessarily in the cycle but reachable only from it. Python doesn't collect such cycles automatically because, in general, it isn't possible for Python to guess a safe order in which to run the__del__()
methods.
In my opinion, the problem of creating a reference cycle involving a svn_fs
object that ties up a potentially large amount of memory in the Subversion library is a big one and warrants somewhat obscure code to prevent it from occurring. This is especially true when considering that many Trac processes may be long-running, such as in the case of tracd
or mod_python
.
My thoughts are that we should adapt the suggestion that you made with regards to pools, their parent pools, and their sub-pools to fit in with a weakref
-based approach. On the other hand, if someone can come up with a simpler approach that solves the aforementioned problems then I am all ears.
comment:20 by , 20 years ago
the problem of creating a reference cycle involving a svn_fs object that ties up a potentially large amount of memory in the Subversion library is a big one and warrants somewhat obscure code to prevent it from occurring.
Let me disagree on this.
Wouldn't it be better to ensure that, by design, such cycles won't occur?
Currently this is the case, as everything works using a top-down approach
(repos, then nodes, then sub-nodes…)
I can't see a reason for introducing cycles.
In that context, the __del__
-based Pool class is enough to
keep the garbage collection of pools out of the way.
comment:21 by , 20 years ago
cboos, have you had a chance to test with the __del__
-based implementation yet? Does it fix the problems you reported?
If so, I'd suggest you attach a patch to this ticket so that we can try it out.
by , 20 years ago
Attachment: | trac-svn-pool-children-#1401.patch added |
---|
Patch against source:/trunk#1544 to have a Pool wait until it's children have been closed to close itself.
comment:23 by , 19 years ago
One more "symptom" for the record:
Trac [1801] with the precompiled Subversion-1.2.0 bindings for Python 2.3, while editing CSS styles (browser.css) using the Web Developer Toolbar in Mozilla:
Trac[svn_fs] DEBUG: Closing subversion file-system at D:/Workspace/local/svn/trac-svnrepos Fatal Python error: GC object already tracked This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.
comment:24 by , 19 years ago
Summary: | crash in svn bindings while browsing the timeline → tracd crash related to svn bindings |
---|
Note that this problem has been silenced since r1875 (removal of the ThreadingMixIn from the TracHTTPServer).
I think that the SVN bindings are not really thread-safe, even more so if the underlying Subversion "filesystem" is bdb, judging by the much increased frequency of crashes triggered in such a configuration.
comment:25 by , 19 years ago
Resolution: | → worksforme |
---|---|
Status: | assigned → closed |
Ok, after much testing, it seems that the problem doesn't happen anymore with recent versions.
Actually, it stopped crashing after [2013]: those changes are not directly related to the SVN bindings, but there, the usage of pysqlite has been made more robust w.r.t. multithreading, so I suspect that the various crashes described here were, at least indirectly, favorised by the way pysqlite was used prior to [2013].
comment:26 by , 19 years ago
Keywords: | svn bindings added |
---|---|
Milestone: | → 0.9 |
Forgot to mention: I use tracd for the server and Firefox for the client. Also, in the debugger, I noticed 2 threads having python on their stack.