Edgewall Software
Modify

Opened 19 years ago

Closed 19 years ago

Last modified 12 years ago

#1401 closed defect (worksforme)

tracd crash related to svn bindings

Reported by: Christian Boos Owned by: Mark Rowe
Priority: normal Milestone:
Component: general Version: devel
Severity: major Keywords: svn bindings
Cc: Branch:
Release Notes:
API Changes:
Internal Changes:

Description

With the latest trunk [1493], I have repeatable crashes of the python interpreter while trying to get the timeline: First, I get the following warning each time I refresh:

Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.__del__ of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x011A5B50>> ignored
Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.__del__ of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x011B88D0>> ignored

Then, it sometimes crashes… This is on Windows, with python 2.3 and svn bindings 1.1.1:

 	_core.dll!svn_swig_py_acquire_py_lock()  Line 96	C
 	_core.dll!_wrap_apr_pool_destroy(_object * self=0x00000000, _object * args=0x011ac050)  Line 1480	C

I post the report because the Pool wrapper logic is a bit too subtle for me to understand and debug this evening… :)

Attachments (1)

trac-svn-pool-children-#1401.patch (2.8 KB ) - added by Mark Rowe 19 years ago.
Patch against source:/trunk#1544 to have a Pool wait until it's children have been closed to close itself.

Download all attachments as: .zip

Change History (28)

comment:1 by Christian Boos, 19 years ago

Forgot to mention: I use tracd for the server and Firefox for the client. Also, in the debugger, I noticed 2 threads having python on their stack.

comment:2 by Mark Rowe, 19 years ago

Is there any possibility that you could post a more useful backtrace for the crash? It's going to be incredibly difficult to resolve otherwise as I cannot reproduce the crash here. Can you post full C and Python backtraces please?

comment:3 by anonymous, 19 years ago

The rest of the stack trace is not interesting, I'm afraid (its a non-debug version of Python, so there are only python23.dll lines).

I progressed a little bit, though: by commenting out the line 114 in Timeline.py (the explicit repos.close()) the problem disappears.

Now I try to understand why…

comment:4 by anonymous, 19 years ago

Hrm, sorry, no: the exception disappears, but not the crash. Weird.

I'll try to get some python backtrace.

comment:5 by Mark Rowe, 19 years ago

In my experience the most likely reason for this to lead to a crash is that an exception is occurring, which leads to an unexpected order of the destruction of the SVN-related objects. In this case, the problem is likely inside CachedRepository.sync. I would suggest putting a try: except: block around that and dumping out any exceptions that are caught.

comment:6 by Christian Boos, 19 years ago

I surrounded the svn code in the Timeline by a try/except block, but this gave nothing.

But, at one point, I got this:

Unhandled exception in thread started by
Traceback (most recent call last):
  File "D:\COM\activestace\python-2.3.2-232\lib\threading.py", line 392, in __repr__
    assert self.__initialized, "Thread.__init__() was not called"
AssertionError: Thread.__init__() was not called

There must be a thread problem somewhere. If I refresh the timeline very quickly, the crash happens at the second or third refresh. If I wait enough time (for instance, after the refresh progress bar disappears) the crash doesn't happen.

Indeed, after the crash I see several python threads in the debugger:

 	2960	Win32 Thread	7c90eb94	Normal	0
>	3776	Win32 Thread	1e0871e5	Normal	0
 	4092	_wrap_apr_pool_destroy	svn_swig_py_acquire_py_lock	Normal	0
 	2264	_wrap_apr_pool_destroy	svn_swig_py_acquire_py_lock	Normal	0
 	312	Win32 Thread	7c90eb94	Normal	0

Detail of the 3776 thread, where the access violation occurs:

>	python23.dll!1e0871e5() 	
 	python23.dll!1e03f2b5() 	
 	python23.dll!1e08372a() 	
 	python23.dll!1e08370c() 	
...

The other python threads are looking like:

>	ntdll.dll!7c90eb94() 	
 	ntdll.dll!7c90e9c0() 	
 	kernel32.dll!7c8025db() 	
 	kernel32.dll!7c8399f3() 	
 	kernel32.dll!7c802542() 	
 	python23.dll!1e080331() 	
 	python23.dll!1e08058b() 	
 	python23.dll!1e012ad8() 	
 	_core.dll!svn_swig_py_acquire_py_lock()  Line 96	C
 	_core.dll!_wrap_apr_pool_destroy(_object * self=0x00000000, _object * args=0x00c48150)  Line 1480	C
 	python23.dll!1e059196() 	
 	python23.dll!1e0169fb() 	
 	python23.dll!1e014b85() 	
...

In order to get a more useful backtrace, I tried to reproduce the problem on Linux, but everything seems to be fine there.

comment:7 by anonymous, 19 years ago

I'll add a "me too" to this:

vnet@kobuk$ tracd -p 9090 -a test,/home/websites/vnet-works/trac/test/ digest_pw,test Segmentation fault (core dumped)

vnet@kobuk$ tracd -p 9090 -a test,/home/websites/vnet-works/trac/test/ digest_pw,test /home/websites/vnet-works/trac/test Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.del of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x8591bec>> ignored Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.del of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x85825ac>> ignored

i created at ticket and clicked on "Timeline" and tracd dumps core again. If I restart tracd, however, then I get errors as above again.

-if i append to existing ticket, i get errors as above.

-if i create another new ticket, then tracd dumps core when accessing "Timeline"

fyi- the repo is exists, but has no projects or checkins. the repo is fsfs based. just a svnadmin create.

System:

FreeBSD-5.3-p5
clearsilver-python-0.9.13_1 
py-pyPgSQL-2.4_2 
py24-PySQLite-1.1.6 
py24-mx-base-2.0.5 
python-2.4_1     
postgresql-client-8.0.1
postgresql-server-8.0.1_3

trac trunk co version r1487 w/pgsql patch level 13 running via tracd

comment:8 by Christian Boos, 19 years ago

Severity: normalmajor

Reverting to [1480] makes the crash go away, though now I have again this exception while accessing the Timeline (with no stack trace, just the exception):

Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.__del__ of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x01386F90>> ignored

When I say the crash goes away… well, at least it appears to be as stable as it ever was…

But, if I try hard enough, I can actually crash any page served by tracd, if I hit the refresh key as fast as I can (usually it takes from 20 to 40 repetitions). Then the debugger shows a similar number of Python threads.

Unfortunately, this is not only a Windows issue. I was also able to get a crash on Linux, by performing the same kind of stress test (e.g. refreshing the WikiStart page as fast as I can)

Trac[main] ERROR: (104, 'Connection reset by peer')
Trac[standalone] DEBUG: "GET /edgewall-trac/wiki HTTP/1.1" 500 -
[New Thread 1294416 (LWP 17403)]
Trac[main] ERROR: (104, 'Connection reset by peer')

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1179721 (LWP 17396)]
0x4003ea39 in instance_dealloc () from /usr/lib/libpython2.3.so.1.0
(gdb) bt
#0  0x4003ea39 in instance_dealloc () from /usr/lib/libpython2.3.so.1.0
#1  0x40042a2c in instancemethod_dealloc () from /usr/lib/libpython2.3.so.1.0
#2  0x4003eb99 in instance_dealloc () from /usr/lib/libpython2.3.so.1.0
#3  0x4006e7e0 in clear_slots () from /usr/lib/libpython2.3.so.1.0
#4  0x4006ea72 in subtype_dealloc () from /usr/lib/libpython2.3.so.1.0
#5  0x4004dff0 in frame_dealloc () from /usr/lib/libpython2.3.so.1.0
#6  0x4009459d in fast_function () from /usr/lib/libpython2.3.so.1.0
#7  0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0
#8  0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0
#9  0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0
#10 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0
#11 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0
#12 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0
#13 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0
#14 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0
#15 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0
#16 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0
#17 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0
#18 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0
#19 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0
#20 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0
#21 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0
#22 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0
#23 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0
#24 0x40092dbc in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0
#25 0x4004ff7c in function_call () from /usr/lib/libpython2.3.so.1.0
#26 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#27 0x4004304b in instancemethod_call () from /usr/lib/libpython2.3.so.1.0
#28 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#29 0x40093fb8 in PyEval_CallObjectWithKeywords () from /usr/lib/libpython2.3.so.1.0
#30 0x4003e794 in PyInstance_New () from /usr/lib/libpython2.3.so.1.0
#31 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#32 0x400948a6 in do_call () from /usr/lib/libpython2.3.so.1.0
#33 0x40094322 in call_function () from /usr/lib/libpython2.3.so.1.0
#34 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0
#35 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0
#36 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0
#37 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0
#38 0x40092dbc in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0
#39 0x4004ff7c in function_call () from /usr/lib/libpython2.3.so.1.0
#40 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#41 0x40094a86 in ext_do_call () from /usr/lib/libpython2.3.so.1.0
#42 0x40090091 in eval_frame () from /usr/lib/libpython2.3.so.1.0
#43 0x40094576 in fast_function () from /usr/lib/libpython2.3.so.1.0
#44 0x400943ac in call_function () from /usr/lib/libpython2.3.so.1.0
#45 0x400903cc in eval_frame () from /usr/lib/libpython2.3.so.1.0
#46 0x40092dbc in PyEval_EvalCodeEx () from /usr/lib/libpython2.3.so.1.0
#47 0x4004ff7c in function_call () from /usr/lib/libpython2.3.so.1.0
#48 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#49 0x4004304b in instancemethod_call () from /usr/lib/libpython2.3.so.1.0
#50 0x4003b627 in PyObject_Call () from /usr/lib/libpython2.3.so.1.0
#51 0x40093fb8 in PyEval_CallObjectWithKeywords () from /usr/lib/libpython2.3.so.1.0
#52 0x400b9bc4 in t_bootstrap () from /usr/lib/libpython2.3.so.1.0
#53 0x4011bf60 in pthread_start_thread () from /lib/i686/libpthread.so.0
#54 0x4011c0fe in pthread_start_thread_event () from /lib/i686/libpthread.so.0
#55 0x40335327 in clone () from /lib/i686/libc.so.6
(gdb)

This was on SuSE 9.0 Linux (2.4.21-99-default), with tracd of the source:trunk#1496

Wondering if it was a generic tracd issue, I reverted to an old revision [1000], but I can't crash this one, even on Windows. Now I'm starting a dichotomic search…

comment:9 by Mark Rowe, 19 years ago

The Exception exceptions.TypeError: "'NoneType' object is not callable" in <bound method SubversionRepository.__del__ of <trac.versioncontrol.svn_fs.SubversionRepository object at 0x011A5B50>> ignored was resolved as of [1495]. I am looking into the crash.

comment:10 by Mark Rowe, 19 years ago

Oh, and if you're going to try convincing tracd to crash may I suggest trying it at r1426. This is the revision before I added the new method for managing Subversion memory pools that looks to be causing the problems.

comment:11 by Mark Rowe, 19 years ago

Owner: changed from Jonas Borgström to Mark Rowe

comment:12 by Mark Rowe, 19 years ago

Status: newassigned

Another option to help track down the crash could be to comment out the svn_pool_destroy call inside Pool.close in svn_fs.py. This may reveal that an exception is being raised somewhere.

comment:13 by Christian Boos, 19 years ago

I already commented out svn_pool_destroy, but that didn't help. The crash is not strictly related to Subversion: I can crash a fresh tracd only by requesting the WikiStart page repeatedly…

… but, as a follow-up to what I wrote in #1183, it seems to be more difficult to crash, as of [1403], but it's still possible to crash it.

Btw, my previous dichotomic search lead nowhere, since the more I went into the past, the more difficult to crash it was, so I couldn't really say if a version was immune to crash…

That being said, I just tried once again with [1412], which I used as a branching point for the xref, and couldn't make it crash! (I tried as hard as ever…)

So now I will restart my search between [1412] and [1496].

comment:14 by Mark Rowe, 19 years ago

Ok. The easiest way that I have found to track down where these types of crashes occur is to use a debug build of Python running in GDB combined with some intelligent GDB macros to map the C stack to the corresponding Python stack. You should be able to load that according to the instructions at the top of the file, and type pyframe after it crashes to get the Python backtrace.

I'm not 100% sure, but this last backtrace looks like a different crash from the one initially reported.

comment:15 by Christian Boos, 19 years ago

Some quick notes:

  • thanks for the GDB tips, I'll be able to debug on Linux next week
  • r1426 is definitely robust, its only with r1427 that I can get easily reproductible crashes

comment:16 by Christian Boos, 19 years ago

I updated Subversion to version 1.1.4 (precompiled binaries on Windows, full build on Linux using swig-1.3.24 and python-2.3.4).

Apparently, the current trunk [1508] is much more robust that way.

I can still crash it at times on Windows, but apparently not anymore on Linux.

I guess the next logical thing to do is to rebuild on Windows using the same configuration.

In the end, it might be only problems related to SWIG:

  • on Linux, swig 1.3.19 was used for the bindings used in the reported crashes
  • on Windows, as I used the precompiled binaries, I don't know what swig version was used).

comment:17 by Christian Boos, 19 years ago

Just a quick note to mention that I found an easy way to tell if a given version is stable or not: simply use the Web Developer Toolbar for Firefox, and edit the CSS of a svn related page (like the revision log): if the python bindings are not stable, they crash in no time!

comment:18 by Christian Boos, 19 years ago

Yesterday I made another attempt for understanding the issue on Windows. Here are my findings:

FSFS vs. BDB

The trunk is incredibly fragile when the repos is using the BDB backend. Though I switched all my repositories to fsfs, it just happened that I used a small BDB repository for testing purpose (in fact, that was the one generated by the unit tests). This is with python-2.3.2-232 from ActiveState, with the pre-build svn-win32-1.1.4 python bindings, testing with tracd.

The Pool class

I tried to understand the Pool class, in source:trunk/trac/versioncontrol/svn_fs.py#1535

I can't really understand that design… The intent is to tie the lifetime of an apr pool to the main object using that pool within its method, right? But care should be taken that the pool is not deleted when subpools exist and could be used. Instead, the code seems to track the subpools and to close them when the parent pool is garbage collected. This might work, but that's very indirect…

In general, I have the impression that this code is too complex. Why use weakref, and not rely on the normal garbage collection cleanup __del__?

I would propose the following implementation, which I think is both safer and easier to understand:

  • the worker object (the one which needs a pool) will keep a ref on the pool
  • any sub-pool keeps a ref on its parent pool
  • therefore, the __del__ method on a pool will be called upon garbage collection only when the worker object is gone and all sub-pools are also gone (meaning that all their worker objects are also gone): nothing should then be able to use a destroyed pool.
  • toplevel pools (those given a None parent could do the apr_initialize/apr_terminate. I'm not sure about this, but shouldn't we avoid doing apr_terminate at all?
class Pool(object):
    """
    A wrapper for a new Subversion `apr_pool_t` object that ties the lifetime of the
    pool to that of a given `worker` object.
    However, a pool won't be deleted until all subpools have been also deleted.
    
    Instances of this type return their associated `pool` when called.
    """
    
    def __init__(self, worker, parent_pool=lambda: None):
        """
        Create a new pool that is a sub-pool of `parent_pool`, and arrange for
        `self.close` to be called up when the `worker` object is destroyed.
        
        The `worker` object must be weak-referenceable.  The returned `Pool`
        instance is a callable returning the newly created pool. 
        """
        self.pool = core.svn_pool_create(parent_pool())
        print "__init__(", self, worker, parent_pool, ")"
        
        if parent_pool(): # assume it's a Pool
            self.parent = parent_pool # the parent should not be GC'ed before us
        else:
            core.apr_initialize()
            self.parent = None

    def __call__(self):
        return self.pool

    def __del__(self):
        """
        The `worker` object has been destroyed so it is time for us to go.
        
        -- So long, and thanks for all the fish!
        """
        print '__del__(',self,')'
        #         if not self.parent:
        #             core.apr_terminate()

        core.svn_pool_destroy(self.pool)

comment:19 by Mark Rowe, 19 years ago

I can't really understand that design… The intent is to tie the lifetime of an apr pool to the main object using that pool within its method, right? But care should be taken that the pool is not deleted when subpools exist and could be used. Instead, the code seems to track the subpools and to close them when the parent pool is garbage collected. This might work, but that's very indirect…

The entire point of the design of the pool class is to take garbage collection out of the picture. You are right about the issue with parent pools being deallocated before their sub-pools, and this should be fixed.

In general, I have the impression that this code is too complex. Why use weakref, and not rely on the normal garbage collection cleanup __del__?

Granted, the code is relatively obscure. The use of weakref was explicitly chosen to avoid the use of __del__. Consider the following paragraph, taken from the Python documentation for the gc module:

3.1. Objects that have __del__() methods and are part of a reference cycle cause the entire reference cycle to be uncollectable, including objects not necessarily in the cycle but reachable only from it. Python doesn't collect such cycles automatically because, in general, it isn't possible for Python to guess a safe order in which to run the __del__() methods.

In my opinion, the problem of creating a reference cycle involving a svn_fs object that ties up a potentially large amount of memory in the Subversion library is a big one and warrants somewhat obscure code to prevent it from occurring. This is especially true when considering that many Trac processes may be long-running, such as in the case of tracd or mod_python.

My thoughts are that we should adapt the suggestion that you made with regards to pools, their parent pools, and their sub-pools to fit in with a weakref-based approach. On the other hand, if someone can come up with a simpler approach that solves the aforementioned problems then I am all ears.

comment:20 by Christian Boos, 19 years ago

the problem of creating a reference cycle involving a svn_fs object that ties up a potentially large amount of memory in the Subversion library is a big one and warrants somewhat obscure code to prevent it from occurring.

Let me disagree on this.

Wouldn't it be better to ensure that, by design, such cycles won't occur? Currently this is the case, as everything works using a top-down approach (repos, then nodes, then sub-nodes…) I can't see a reason for introducing cycles. In that context, the __del__-based Pool class is enough to keep the garbage collection of pools out of the way.

comment:21 by Christopher Lenz, 19 years ago

cboos, have you had a chance to test with the __del__-based implementation yet? Does it fix the problems you reported?

If so, I'd suggest you attach a patch to this ticket so that we can try it out.

by Mark Rowe, 19 years ago

Patch against source:/trunk#1544 to have a Pool wait until it's children have been closed to close itself.

comment:22 by Mark Rowe, 19 years ago

The Pool cleanup patch was committed in [1549].

comment:23 by Christian Boos, 19 years ago

One more "symptom" for the record:

Trac [1801] with the precompiled Subversion-1.2.0 bindings for Python 2.3, while editing CSS styles (browser.css) using the Web Developer Toolbar in Mozilla:

Trac[svn_fs] DEBUG: Closing subversion file-system at D:/Workspace/local/svn/trac-svnrepos
Fatal Python error: GC object already tracked

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

comment:24 by Christian Boos, 19 years ago

Summary: crash in svn bindings while browsing the timelinetracd crash related to svn bindings

Note that this problem has been silenced since r1875 (removal of the ThreadingMixIn from the TracHTTPServer).

I think that the SVN bindings are not really thread-safe, even more so if the underlying Subversion "filesystem" is bdb, judging by the much increased frequency of crashes triggered in such a configuration.

comment:25 by Christian Boos, 19 years ago

Resolution: worksforme
Status: assignedclosed

Ok, after much testing, it seems that the problem doesn't happen anymore with recent versions.

Actually, it stopped crashing after [2013]: those changes are not directly related to the SVN bindings, but there, the usage of pysqlite has been made more robust w.r.t. multithreading, so I suspect that the various crashes described here were, at least indirectly, favorised by the way pysqlite was used prior to [2013].

comment:26 by Christian Boos, 19 years ago

Keywords: svn bindings added
Milestone: 0.9

comment:27 by Christian Boos, 12 years ago

Milestone: 0.9

(clearing report:35)

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Mark Rowe.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Mark Rowe to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.