Edgewall Software
Modify

Opened 10 years ago

Closed 8 years ago

#6614 closed defect (fixed)

Memory leak using apache

Reported by: anonymous Owned by: Christian Boos
Priority: highest Milestone: 0.11
Component: version control Version: 0.11.5
Severity: blocker Keywords: memory
Cc: ilias@…, pierreroth64@…, kirean@…, kamil@…, mbertheau@…, benjixx@…, manuzhai@…, bdelaage@…, jdthood@…, mailinglist@…, alicebot@…, tereutes@…, daniel@…, gt4329b@…, ammon@…, chris@…, kts_66@…, bubbamagic87@…, gustav.svensson@…, trac@…
Release Notes:
API Changes:

Description

When I browse my source, my apache process increases its memory usage dangerously (more than 1 GB !), never freeing it. It is particularly visible when trying to display a directory content with several entries in it.

I'm currently using :

  • Apache 2.0.61
  • Python 2.4
  • Trac 0.11b1

Attachments (11)

timeline_rendering_backtrace.log (5.3 KB ) - added by Christian Boos 10 years ago.
The backtrace for the exception discussed in comment:23
no-permission-store-cache.diff (742 bytes ) - added by Christian Boos 10 years ago.
gives a cleaner picture of real memory usage
single-threaded-tracd.diff (501 bytes ) - added by Christian Boos 10 years ago.
disables multi-threading for tracd
fix-_history.diff (1.4 KB ) - added by Christian Boos 10 years ago.
better Pool usage in SubversionRepository._history()
explicit_gc.diff (810 bytes ) - added by Christian Boos 10 years ago.
call gc.collect() after every request
debug-mem-growth.diff (2.7 KB ) - added by Christian Boos 10 years ago.
same as explicit_gc.diff + some stats about accumulated objects
timeline-expression-bug.log (33.7 KB ) - added by Christian Boos 10 years ago.
log a session with timeline-expression-bug.diff active
timeline-expression-bug.diff (575 bytes ) - added by Christian Boos 10 years ago.
raise a ValueError from an Expression evaluation immediately during timeline rendering
timeline-suite-bug.diff (1.5 KB ) - added by Christian Boos 10 years ago.
raise a ValueError from a Suite execution immediately during timeline rendering
timeline-expression-bug-genshi-0.4.4.log (12.1 KB ) - added by Christian Boos 10 years ago.
log a session with timeline-expression-bug.diff active - this time using Genshi 0.4.4 instead of [G789]
render-using-StringIO-rr6677.patch (1.1 KB ) - added by Christian Boos 9 years ago.
This patch helps to reduce the maximal memory usage.

Download all attachments as: .zip

Change History (114)

comment:1 Changed 10 years ago by hyuga <hyugaricdeau@…>

Hard to say without more detail, but could this be a duplicate of #6581? Try the patch suggested there and see if that helps.

comment:2 in reply to:  1 Changed 10 years ago by hyuga <hyugaricdeau@…>

Replying to hyuga <hyugaricdeau@gmail.com>:

Hard to say without more detail, but could this be a duplicate of #6581? Try the patch suggested there and see if that helps.

Nevermind, apparently the patch won't help with the high memory usage issue.

comment:3 in reply to:  description Changed 10 years ago by Christian Boos

Component: generalversion control
Keywords: memory added
Owner: changed from Jonas Borgström to Christian Boos
Priority: normalhighest

Replying to anonymous:

It is particularly visible when trying to display a directory content with several entries in it.

Thanks for the hint. For some reason, I was focusing on the rendering of file in the TracBrowser, but that seems to work fine whereas the rendering of a folder indeed appears to leak.

comment:4 Changed 10 years ago by Christian Boos

Looks like this is also present when using TracMercurial and none of the fancy new stuff from 0.11 (no quickjump, no color scale). Maybe a <py:include> issue?

comment:5 in reply to:  4 Changed 10 years ago by hyuga <hyugaricdeau@…>

Replying to cboos:

Looks like this is also present when using TracMercurial and none of the fancy new stuff from 0.11 (no quickjump, no color scale). Maybe a <py:include> issue?

Just out of curiosity, why do you think it's an issue with includes, or even anything having to do with Genshi?

comment:6 Changed 10 years ago by anonymous

Cc: xarquonster@… added

comment:7 Changed 10 years ago by Christian Boos

From my various tests, it seems that Trac memory is growing as long as new data is processed.

For example, when requesting 6000 different changesets on a t.e.o clone, the memory slowly grows from 100 to 200 Mbytes. But then, when requesting the same range of changesets again, there's no extra growth, i.e. memory usage is still 200 Mbytes. This is inconsistent with a "classical" memory leak.

So my current take on the issue is that perhaps a lot of strings get interned. From a superficial inspection, I couldn't find the origin of this, and I won't have time in the next days to go more in depth, so maybe someone else can explore that hint.

comment:8 Changed 10 years ago by ThurnerRupert

how trac and apache are connected? mod_python?

comment:9 in reply to:  8 Changed 10 years ago by anonymous

Replying to ThurnerRupert:

how trac and apache are connected? mod_python?

Yes, the problem occurs with mod_python.

comment:10 Changed 10 years ago by esaj

I'd be interested to know if this happens in mod_wsgi too.

comment:11 Changed 10 years ago by Christian Boos

Happens with tracd as well, so I think this has nothing to do with apache and mod_* …

comment:12 in reply to:  7 ; Changed 10 years ago by osimons

Replying to cboos:

From my various tests, it seems that Trac memory is growing as long as new data is processed.

That is what I see too. As i browse new changesets it just keeps growing, but revisiting them makes little difference to memory - tried new user, and scoping the changeset display as well just to test if variations had any impact.

However, when browsing changesets, I see that a changeset with many diff's - like typically for a branch merge - can increase the memory with 25MB or more in one go. It does not get released again. Viewing a similar size changeset with only adds (initial import of a new project) leaves no noticeable impact on memory at all.

May the diff rendering code be a starting point for further research?

comment:13 in reply to:  12 ; Changed 10 years ago by Christian Boos

Sorry for the late reply - I thought I would be able to locate the precise problem, but as I'm not yet there, I'll post some intermediate findings.

Replying to osimons:

May the diff rendering code be a starting point for further research?

I don't think so: I forgot to mention that for the test mentioned in comment:7, I disabled the diff generation (same goes for the results below).

Here are some more results from my recent testings, using tracd on Linux x86_64, psycopg 2.0.6, latest Trac and Genshi.

It seems that there are actually two issues:

  1. memory always increases and is never returned to the system: Python 2.5 performs better here, as explained in http://evanjones.ca/memoryallocator
  2. Subversion bindings, or the way we're using them, cause a serious leak

In more details:

When doing a wget -r -np -l2 / on a t.e.o clone (no Subversion bindings):

  • Python 2.4.4 stabilizes at 303 Mbytes
  • Python 2.5.1 stabilizes at 138 Mbytes, with a peak at about 310 Mbytes

stabilizes means here that we no longer see an increase in the memory usage when running the same wget command again. Here we see that 2.5.1 does a better job at memory management, and that the overall memory usage is relatively reasonable. Still a bit high, but I think we can't expect to have less than that.

Now I did the same wget test, this time with the 1.4.3 bindings enabled, using repository_type = direct-svnfs to make things more apparent:

  • Python 2.4.4 stabilizes at 3338 Mbytes (yes, 3.3 Gbytes!)
  • Python 2.5.1 stabilizes at 2590 Mbytes

So while 2.5 does better, it's still pretty bad. The timeline queries seem to be the most expensive ones, sometimes adding 100 Mbytes each, when the changesets are accessed for the first time. When using repository_type = svn those timeline queries don't involve the svn bindings and are fast and lean.

comment:14 in reply to:  13 ; Changed 10 years ago by anonymous

When using repository_type = svn those timeline queries don't involve the svn bindings and are fast and lean.

So a workaround is to use repository_type = svn for now, for those of us using 0.11b1?

comment:15 Changed 10 years ago by anonymous

Cc: pierreroth64@… added

comment:16 in reply to:  14 Changed 10 years ago by Christian Boos

Replying to anonymous:

When using repository_type = svn those timeline queries don't involve the svn bindings and are fast and lean.

So a workaround is to use repository_type = svn for now, for those of us using 0.11b1?

Well, not really, svn is anyway the default, and I doubt the people having the issue reported here had used direct-svnfs. The latter only makes the problem more apparent, but the bindings are nevertheless still used for the former, e.g. for the browser queries.

We currently can't use the cache for everything, but there are still opportunities to use it more (like in #6654). Future work (0.12, 0.13?) should aim at a complete by-pass of the svn bindings, or alternatively, should use the newer ctypes bindings.

comment:17 Changed 10 years ago by Christian Boos

Back to the point: I think I've isolated one of the issues this week-end, related to Repository._history() and the weird behavior of fs.node_history(path), fs.history_prev and fs.history_location, most notably when path is not the root. I wrote an alternative implementation of _history(), using temporary memory pools. At first, it seemed to help a lot, but actually it was the explicit call to gc.collect() that I added at the end of request processing, in order to get more reproducible results, which was the main reason for the memory savings!

In more details:

With the following change only:

  • trac/web/main.py

    diff --git a/trac/web/main.py b/trac/web/main.py
    a b def dispatch_request(environ, start_resp  
    389389        return _dispatch_request(req, env, env_error)
    390390    finally:
    391391        if env and not run_once:
     392            import gc
     393            gc.collect()
     394            print "Garbage collected..."
    392395            env.shutdown(threading._get_ident())
    393396
    394397def _dispatch_request(req, env, env_error):

on the test case from comment:13 above, the memory usage drops to 504M (!). More specifically:

wget -r ... tracd 0.11 tracd 0.11, with explicit gc.collect()
1st run 2040M 476M
2nd run 3121M 504M
3rd run 3123M 504M

Also, in all my tests, the ThreadingMixIn was removed from the TracHTTPServer, in order to eliminate possible multi-threading issues.

Now, with the alternative implementation for _history(), using temporary pools:

    def _history(self, svn_path, start, end, pool):
        """`svn_path` must be a full scope path, UTF-8 encoded string.

        Generator yielding `(path, rev)` pairs, where `path` is an `unicode`
        object.
        Must start with `(path, created rev)`.
        """
        if start < end:
            start, end = end, start
        root = fs.revision_root(self.fs_ptr, start, pool())
        tmp1 = Pool(pool)
        tmp2 = Pool(pool)
        history_ptr = fs.node_history(root, svn_path, tmp1())
        cross_copies = 1
        while history_ptr:
            history_ptr = fs.history_prev(history_ptr, cross_copies, tmp2())
            tmp1.clear()
            tmp1, tmp2 = tmp2, tmp1
            if history_ptr:
                path, rev = fs.history_location(history_ptr, tmp2())
                tmp2.clear()
                if rev < end:
                    break
                path = _from_svn(path)
                if not self.authz.has_permission(path):
                    break
                yield path, rev
        del tmp1
        del tmp2

we can get even further savings, i.e.

wget -r ... tracd 0.11 tracd 0.11, with explicit gc.collect() and temp pools for _history()
1st run 2040M 425M
2nd run 3121M 453M
3rd run 3123M 453M

The additional savings (roughly 50M) are a hint that we can probably get even more savings by using temporary pools in other places. Or even that we should rewrite the whole Pool stuff, see #1401

Of course, those findings are quite counter-intuitive. If everything would work "as expected", the above changes in _history() would actually make no difference on the long run, and the same could be said for the gc.collect(). I think that this shows that there's definitely something wrong going on with the management of the pools, inside the bindings.

This still doesn't explain why it seems this is a 0.11 issue, as the svn_fs code hasn't changed that much since 0.10.5dev. I should probably make the same test on a 0.10.5dev install as well.

to be continued…

comment:18 in reply to:  17 Changed 10 years ago by Christian Boos

To complement comment:17 with testing results using python 2.5:

wget -r ... tracd 0.11 tracd' 0.11 - 2.4 tracd' 0.11 - 2.5
1st run 2040M 425M 277M
2nd run 3121M 453M 294M
3rd run 3123M 453M 288M

where tracd' means "with explicit gc.collect() and temp pools for _history()".

Also, in case that wasn't clear, the wget -r -np -l2 / command is run 3 times in succession, without restarting tracd.

comment:19 Changed 10 years ago by Christian Boos

Severity: normalblocker

Needless to say, this is a blocker issue.

Note that #5213 was probably related to the same cause (r6328 only tried to "hide" the issue). #6588 as well, and that one was reported against 0.10.4.

comment:20 Changed 10 years ago by Christian Boos

I repeated the same test using several variants of 0.10.5dev tracd, using Python 2.4:

wget -r ... tracd
(1)
tracd +gc
(2)
tracd +0.11
(3)
tracd +0.11 +gc
(4)
tracd +0.11 +tmp
(5)
tracd +0.11 +tmp +gc
(6)
1st run 124M 120M 117.8M 113M 118M 115M
2nd run 125.4M 122.4M 119M 117.4M 119M 118.6M
3rd run 125.4M 122.4M 120.8M 117.4M 119M 118.6M

Notes:

  1. straight from source:/branches/0.10-stable@6401
  2. same as 1. + the gc.collect() after every request
  3. same as 1. + svn_fs.py was patched with the 0.11 changes.
    I verified that (3) doesn't leak, as it stabilizes at 120.8M for a 4th run.
  4. same as 3. + the gc.collect() after every request
  5. same as 3. + the temp pools for _history().
  6. same as 5. + the gc.collect() after every request

From those numbers, it looks like the 0.11 changes are not harmful, they even seem to reduce the memory usage a bit. It's interesting to see that the temp pools optimizations are not worth doing (3 better than 5), unless the explicit call to gc.collect() is also done (6 better than 3). However, the explicit gc.collect() call alone is better than combined with the tmp pool optimization! (4 better than 6). Doing only the gc.collect() call on stock 0.10.5dev doesn't buy much (2 vs. 1).

Also, besides the memory issue, 0.10.5dev is a lot faster than 0.11: one wget run takes about 4 minutes for 0.10.5dev (even 3m 20s when not doing the systematic gc.collect() call), versus 33 minutes for 0.11. But that's old news.

Summary of the findings so far:

  • From the present results, it seems that the svn changes alone have no real impact on the memory.
  • From comment:13, we can see the big difference it makes for 0.11 when running with or without the bindings
  • From comment:18, we can see that thanks to some specific optimizations, the huge memory requirements added by the bindings can be lowered a lot.

Surprisingly enough, the main optimization is simply to call gc.collect() at regular intervals. Also interesting is the fact that the temp pool optimization (the changes to SvnRepository._history() described in comment:17) seem to give an additional 50M savings in 0.11, whereas we've seen here that those savings should really be marginals, if any (4 vs 6).

comment:21 Changed 10 years ago by hyuga <hyugaricdeau@…>

But what about the TracMercurial plugin? In comment:3 you said that this problem seems to happen with it too, in which case it would have nothing to do with Subversion. I haven't tried it myself, so I don't know. Do you still think the issue arises with TracMercurial?

comment:22 Changed 10 years ago by anonymous

Cc: kirean@… added

comment:23 Changed 10 years ago by anonymous

Cc: kamil@… added

comment:24 Changed 10 years ago by anonymous

Cc: gustavo@… added

comment:25 Changed 10 years ago by osimons

#5782 may be closely related to this issue.

comment:26 in reply to:  21 Changed 10 years ago by Christian Boos

Replying to hyuga <hyugaricdeau@gmail.com>:

But what about the TracMercurial plugin? In comment:3 you said that this problem seems to happen with it too, in which case it would have nothing to do with Subversion.

It happens as well with Mercurial, but to a lesser extent than with Subversion. Subversion makes the effects of the underlying problem very apparent, because each request will consume a lot of memory.

I still don't have a solution, but I made some progress: when there's an exception, it looks like the frame information is being referenced from somewhere. It's not due to cycles or uncollectable objects (gc.garbage remains empty, also with gc.DEBUG_UNCOLLECTABLE set). I'm not even sure that's the only reason, but that's definitely one reason, as I've got a constant +10M increase when repeating a given failing request, and by looking at the objects that are accumulated, they really correspond to what's reachable from the stack (and this is using Mercurial as a backend, even).

/me still deeply puzzled and looking…

comment:27 Changed 10 years ago by Christian Boos

It's not the logger: modifed trac/log.py with

class FakeLogger(object):
    def info(self, *args, **kwargs):
        pass
    def debug(self, *args, **kwargs):
        pass
    def warn(self, *args, **kwargs):
        pass
    def warning(self, *args, **kwargs):
        pass
    def error(self, *args, **kwargs):
        pass
    def exception(self, *args, **kwargs):
        pass

def logger_factory(logtype='syslog', logfile=None, level='WARNING',
                   logid='Trac', format=None):
    return FakeLogger()

Same results.

comment:28 Changed 10 years ago by Christian Boos

It's not the somewhat complex error handling in trac/web either. If I silence the error that I get during template generation like that:

  • trac/web/main.py

    diff -r 419b87217da6 trac/web/main.py
    a b class RequestDispatcher(Component):  
    206206                        template, data, content_type = \
    207207                                  self._post_process_request(req, *resp)
    208208                        if 'hdfdump' in req.args:
    209                             req.perm.require('TRAC_ADMIN')
     209                            #req.perm.require('TRAC_ADMIN')
    210210                            # debugging helper - no need to render first
    211211                            from pprint import pprint
    212212                            out = StringIO()
    213213                            pprint(data, out)
    214214                            req.send(out.getvalue(), 'text/plain')
    215215                        else:
    216                             output = chrome.render_template(req, template,
     216                            try:
     217                                output = chrome.render_template(req, template,
    217218                                                            data, content_type)
     219                            except:
     220                                pass
     221                            output = '****'
    218222                            # Give the session a chance to persist changes
    219223                            if req.session:
    220224                                req.session.save()

Same results.

So probably back to genshi?

comment:29 Changed 10 years ago by ilias@…

Cc: ilias@… added

comment:30 Changed 10 years ago by Christian Boos

The error I'm referring to in comment:26 is an exception raised in a WikiSyntaxProvider, inside a wiki formatting done while rendering the timeline.

The corresponding backtrace for that error can be seen in attachment:timeline_rendering_backtrace.log

Now, if I catch the exception before it goes through the template rendering, the memory usage is OK again. The live object count, as measured by doing a len(gc.get_objects()) after a gc.collect(), remains constant at 62642, after successive requests.

If I raise an exception on purpose after the template rendering, there's also no increase in memory usage.

But if the original exception is not caught, the live object count increase by ~50000 after each call.

I'll try to locate the exact point in the backtrace which cause the memory retention.

Changed 10 years ago by Christian Boos

The backtrace for the exception discussed in comment:23

comment:31 Changed 10 years ago by anonymous

Cc: mbertheau@… added

comment:32 Changed 10 years ago by anonymous

Cc: benjixx@… added

comment:33 Changed 10 years ago by anonymous

Cc: manuzhai@… added

comment:34 Changed 10 years ago by anonymous

Cc: etix@… added

comment:35 Changed 10 years ago by anonymous

Cc: bdelaage@… added

comment:36 Changed 10 years ago by ilias@…

May I suggest to the team the following:

  • provide a wiki page which summarizes the findings
  • provide some debug-code which prints mem-usage data into the log file, thus test give specific results.
  • users can publish the test results (from different systems) on the wiki page

I can run the tests on 2 servers (one has a mem problem, the other one seems(!) not to have it). Possibly the configuration of the servers is the answer.

comment:37 Changed 10 years ago by anonymous

FC5, r6432, tracd, svn, running for 3 days now: VSZ 123M, RSS 85M.

comment:38 Changed 10 years ago by Christian Boos

To summarize the issues at this point:

  1. prefer Python 2.5 over Python 2.4, the former has much improved over the latter at giving back memory to the system
  2. the alternative _history implementation helps a bit, when Subversion is used (fix-_history.diff), but not that much
  3. doing frequent full garbage collections helps a lot (explicit_gc.diff)
  4. there's a major issue going on as soon as an exception is raised when evaluating a Genshi Expression or Suite.

For testing 4. one can use the following set of patches (against latest trunk - r6493):

Also attached is a log file showing the accumulation of objects when testing the timeline expression bug. The VM memory usage grows approximately by 125Mb for each request (/timeline, daysback=10, date=2005-02-08, repository_dir=trac repository, repository_type=direct-svnfs).

That huge amount of memory can be attributed to an excessive reuse of the main Pool by Repository.get_changesets() when direct-svnfs is used. The real problem however is that this pool is not collected because it's kept alive. It's kept alive because there are 9 sub-pools kept alive, each corresponding to the 9 changesets stored in the 'data' object… which is itself kept alive for some reason.

I still haven't figured out why this happens. One possible approach to fix the problem would be to catch exceptions at the Base._eval level, and propagating that up as an EXCEPTION event. At an upper level, probably just before serialization, we can try to raise the exception again without running into the memory issue (it seems that exceptions raised at upper levels don't trigger the issue). More work is needed here.

Changed 10 years ago by Christian Boos

gives a cleaner picture of real memory usage

Changed 10 years ago by Christian Boos

Attachment: single-threaded-tracd.diff added

disables multi-threading for tracd

Changed 10 years ago by Christian Boos

Attachment: fix-_history.diff added

better Pool usage in SubversionRepository._history()

Changed 10 years ago by Christian Boos

Attachment: explicit_gc.diff added

call gc.collect() after every request

Changed 10 years ago by Christian Boos

Attachment: debug-mem-growth.diff added

same as explicit_gc.diff + some stats about accumulated objects

Changed 10 years ago by Christian Boos

Attachment: timeline-expression-bug.log added

log a session with timeline-expression-bug.diff active

Changed 10 years ago by Christian Boos

raise a ValueError from an Expression evaluation immediately during timeline rendering

Changed 10 years ago by Christian Boos

Attachment: timeline-suite-bug.diff added

raise a ValueError from a Suite execution immediately during timeline rendering

comment:39 Changed 10 years ago by Christian Boos

More findings:

  • among the frames in the alive objects, it seems that none refers to the data dictionary
  • when using Genshi:source:tags/0.4.4, the bug is still present but behaves differently: memory goes from 150Mb to 275Mb exactly like before, then stays at 275Mb for subsequent repeated requests (even slightly different, like when changing the data), i.e. it stops growing after the second failure.

See timeline-expression-bug-with-genshi-0.4.4.log.

Was anonymous also using Genshi 0.4.4?

Changed 10 years ago by Christian Boos

log a session with timeline-expression-bug.diff active - this time using Genshi 0.4.4 instead of [G789]

comment:40 in reply to:  39 Changed 10 years ago by Christian Boos

Correcting the above, I meant:

even slightly different, like when changing the date

I also checked that [G799] makes no difference with [G789].

The difference of behavior between 0.4 and 0.5 can actually be explained by [G770]… Reverting that changeset on trunk makes it behave like 0.4. Of course, we known that we can't just revert that change (as this would re-introduce nasty multi-threading issues - #G158) and anyway this wouldn't remove the issue completely as the memory increment at the second error is still there, but I think this narrows the issue even more.

comment:41 in reply to:  37 ; Changed 10 years ago by mbertheau@…

I'm anonymous. We're using Genshi 0.5dev [G789] and python 2.4.2.

comment:42 in reply to:  41 Changed 10 years ago by Christian Boos

Replying to mbertheau@gmail.com:

Great, so you just probably never had any exception raised within an Expression in a template :-)

comment:43 Changed 10 years ago by anonymous

Cc: jdthood@… added

(I try again to add my e-mail address to the cc field. If I just add the address then the submission gets rejected as spam. If you are reading this then it means that it helped to add a comment.)

comment:44 Changed 10 years ago by anonymous

Cc: mailinglist@… added

adding my email so I can be notified when this is fixed.

comment:45 Changed 10 years ago by alicebot@…

Cc: alicebot@… added

comment:46 Changed 10 years ago by anonymous

Cc: tereutes@… added

comment:47 Changed 10 years ago by Christian Boos

Status: newassigned

In r6593, I added explicit garbage collection after each request. This should already help a lot (see comment:17).

In r6594, I added the SubversionRepository._history optimization, which should also help a bit (comment:17 again).

Finally, using Python 2.5 in preference to 2.4 will also help greatly (see comment:18).

Some feedback would be useful, in order to see how well this performs in practice.

What's left to be done:

  • workaround the real memory leak that happens when an exception is raised from inlined expressions in Genshi templates (comment:38, comment:39 and comment:40).
  • further improve the memory usage in the Subversion backend (#5782)

comment:48 in reply to:  47 ; Changed 10 years ago by ilias@…

Replying to cboos:

In r6593, I added explicit garbage collection after each request. This should already help a lot (see comment:17).

What effects has this on performance?

(I like to repeat my plea in comment:36, as the information gets lost within the ticket comments)

comment:49 in reply to:  48 Changed 10 years ago by Christian Boos

Replying to ilias@lazaridis.com:

Replying to cboos:

In r6593, I added explicit garbage collection after each request. This should already help a lot (see comment:17).

What effects has this on performance?

Approximately a 10% slowdown, IIRC.

comment:50 Changed 10 years ago by daniel@…

Cc: daniel@… added

comment:51 Changed 9 years ago by Christian Boos

Concerning the current trunk - [6603], I'm still interested in getting feedback in term of memory usage.

Some news concerning the Genshi issue: I've not yet been able to trigger the problem purely at the Genshi level, so I'll go from the Trac side, stripping things away.

So far, I've found that the problem is not exactly tied to an exception being raised from an expression.

Having even an UndefinedError triggered from inside the py:for block in the timeline template triggers the same problem, with the same memory usage as in attachment:timeline-expression-bug.log (first request goes up to 146Mb, then 271Mb, then 396Mb, etc.)

  • trac/timeline/templates/timeline.html

     
    3232      </form>
    3333
    3434      <py:for each="day, events in groupby(events, key=lambda e: format_date(e.date))">
     35      ${say.hello}
    3536        <h2>${day}: ${day == today and 'Today' or day == yesterday and 'Yesterday' or None}</h2>
    3637        <dl>
    3738          <py:for each="event in events"

Now, changing that groupby to a simple day in event loop, things are even more interesting:

  • trac/timeline/templates/timeline.html

     
    3131       </div>
    3232      </form>
    3333
    34       <py:for each="day, events in groupby(events, key=lambda e: format_date(e.date))">
     34      <py:for each="day in events">
     35      ${say.hello}
    3536        <h2>${day}: ${day == today and 'Today' or day == yesterday and 'Yesterday' or None}</h2>
    3637        <dl>
    3738          <py:for each="event in events"

Here, the memory usage starts with 146Mb, then 271Mb, then stays at 271Mb for every other repeated requests.

Seeing that, I've modified timeline.html even more drastically:

<!DOCTYPE html
    PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:py="http://genshi.edgewall.org/"
      xmlns:xi="http://www.w3.org/2001/XInclude">
  ${say.hello}
</html>

The memory usage starts with 145Mb, then 270Mb, then stays at 270Mb.

Now, thinking again even the following results look suspicious:

<!DOCTYPE html
    PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:py="http://genshi.edgewall.org/"
      xmlns:xi="http://www.w3.org/2001/XInclude">
</html>

i.e. dummy template, no error generated: memory starts and stays at 145Mb. Why? This is with Python 2.5, the memory allocated during this request should be returned to the system.

Note that with an empty template, I get a TemplateSyntaxError: no element found error and the memory usage is the same as above, starts and stays at 145Mb.

… to be continued

comment:52 Changed 9 years ago by Jeroen Ruigrok van der Werven

Unless the is a __del__ or something else involved that keeps a reference and thus causing the garbage collector to not be able to free it. Might such be the cause here?

comment:53 in reply to:  51 ; Changed 9 years ago by Alec Thomas

Replying to cboos:

Seeing that, I've modified timeline.html even more drastically: … The memory usage starts with 145Mb, then 270Mb, then stays at 270Mb.

This doesn't seem very surprising to me. TemplateLoader caches the template stream/tree on load, so an initial jump is to be expected. There are also the various caches internal to Trac that get populated on first-use. None of these allocations would be returned to the system.

Unless I'm misunderstanding your point?

Now, thinking again even the following results look suspicious: … i.e. dummy template, no error generated: memory starts and stays at 145Mb. Why? This is with Python 2.5, the memory allocated during this request should be returned to the system.

This kind of confirms my previous point: the template fails to load and is thus not cached and no extra memory consumed.

comment:54 in reply to:  53 Changed 9 years ago by Christian Boos

Replying to athomas:

Replying to cboos:

Seeing that, I've modified timeline.html even more drastically: … The memory usage starts with 145Mb, then 270Mb, then stays at 270Mb.

This doesn't seem very surprising to me. TemplateLoader caches the template stream/tree on load, so an initial jump is to be expected. There are also the various caches internal to Trac that get populated on first-use. None of these allocations would be returned to the system.

Unless I'm misunderstanding your point?

You misunderstood. The template loader is supposed to cache the template, not the data, AFAICT. The 145Mb is a very high number which corresponds to the specific test case (3 years of changeset timeline data, using direct-svnfs on a t.e.o clone, see comment:38). The cost of the initial load you described is an order of magnitude smaller than what is observed here.

So again:

  1. starting Trac → 10Mb
  2. make an initial "cheap" timeline query → 22Mb
    that's what you were talking about, and yes, we're probably not able to go below that ever after
  3. timeline?from=2%2F25%2F2005&daysback=3&changeset=on query → 146Mb
    I think it's already wrong that we stay that high, most of the memory associated to the data should have been given back to the system
  4. repeat request 3. → 271Mb
    now in all cases, this is totally wrong
  5. repeat request 3., what we'll see now depends on the actual content of the timeline template:
    1. <py:for ... groupby()>:
      each time, the memory usage will increase by another 125Mb. Again, this is totally wrong
    2. <py:for each="day in events"> or <html>${say.hello}</html>-only template:
      no matter how many times we repeat the request, the memory stays at 271Mb (but why eliminating groupby makes such a big difference in behavior?)

Even if there was a bug in the template loader, where somehow the data would sneak in, I've eliminated that possibility by deleting the template loader after each use. where that reference is kept is still unknown at this point (and no, it's not uncollectable, see comment:26).

comment:55 Changed 9 years ago by Alec Thomas

Thanks for the clarification. The piece of information I was missing was the increasing daysback period.

Does the memory accumulation occur for other timeline providers, or only versioncontrol?

comment:56 in reply to:  55 Changed 9 years ago by Christian Boos

Replying to athomas:

Does the memory accumulation occur for other timeline providers, or only versioncontrol?

Yes, same thing without the changesets, though it's then less visible, of course. Doing a timeline query of the whole history of my test Trac without the changesets consistently increases the memory usage of 2.7Mb after each repeated request.

I've focused on the timeline so far, but I think the issue is the same with the other modules. It's quite likely that given the absence of groupby or similar in the other templates, we'll rather see the 5.2. behavior there (in line with early observations 7 and 12).

comment:57 Changed 9 years ago by Alec Thomas

I've done some testing at home and I see the same behaviour as you, where the first query consumes N MB of memory, but subsequent identical queries do not increase the memory usage. I also don't see any uncollectable objects at all. I was slightly confused by the terminology used in the GC module documentation, but some searching revealed the difference between unreachable and uncollectible.

The combination of these factors indicates to me that the memory is being released by the GC but continues to be held by Python's pool allocator or the system's libc (that is, free() is not returning the memory to the O/S). This is apparently fairly "normal" for Python and there's some discussion of it here.

I've not been able to replicate the behaviour you've seen with groupby() where the memory usage continues to climb. I'm running Python 2.5 and Genshi trunk.

comment:58 in reply to:  57 Changed 9 years ago by Christian Boos

Replying to athomas:

I've not been able to replicate the behaviour you've seen with groupby() where the memory usage continues to climb. I'm running Python 2.5 and Genshi trunk.

Just to be sure: are you triggering an exception from within the <py:for> element, like shown in the diffs shown in comment:51?

comment:59 Changed 9 years ago by Jeroen Ruigrok van der Werven

Christian,

thanks to Alec I received a trac-hacks trac.db and I did some timeline testing with trunk (r6612) and Genshi trunk @ rev 800. When I request 300 days I get (SIZE, RES as top describes): 38684K 33980K and it is stable after (forced) reloads. Using 400 I get: 38684K 33988K and stable, 500: 46108K 41536K and 600 days I get: 56088K 51524K. In all these cases I do not have a repository active that accompanies this trac.db. Alec said that his testing, which is with the repository for all I understood, caused him to rise to 300 MB for the timeline for 600 days.

My machine is a FreeBSD 6.3-STABLE one with Python 2.5.1.

I hope this data helps you a bit.

comment:60 Changed 9 years ago by Alec Thomas

Ah, no, I'm not triggering an exception. I'll try that tonight.

comment:61 Changed 9 years ago by Jeroen Ruigrok van der Werven

With the addition of say.hello into the timeline.html I manage to leak about 5 MB per hit on the timeline for 600 days back using a dump as provided by Alec.

80756K 75352K
85752K 80200K
90476K 85032K

as reported by top.

comment:62 Changed 9 years ago by Jeroen Ruigrok van der Werven

When I get rid of the groupby() my top reports this:

36468K 31068K
36628K 31228K
36468K 31068K
36628K 31228K
36468K 31068K

So something gets lost due to the groupby() when an error occurs.

comment:63 Changed 9 years ago by Alec Thomas

I've just confirmed the same leak as comment:51 by adding a reference to ${say.hello}.. The GC is not reporting any weirdness either. I also wrote a small script with the same general logic, but was unable to make Genshi by itself leak.

I also tried uncommenting the entirety of Trac's error handling except block and the leak persisted.

Chris also suggested commenting out the stream filters and using list(stream) instead of stream.render(). We tried a few other things, such as disabling genshi.speedups.

The last thing we tried was using the Python groupby() from trac.util. This fixed the problem! Test case here seems to replicate it.

We're still looking into it.

comment:64 in reply to:  63 ; Changed 9 years ago by Christian Boos

Replying to athomas:

I also wrote a small script with the same general logic, but was unable to make Genshi by itself leak. … Test case here seems to replicate it.

Replicate it or not replicate it? The two sentences above seem contradictory. Anyway, for me that script doesn't leak once you add explicit gc (http://paste.pocoo.org/show/30691/).

The last thing we tried was using the Python groupby() from trac.util. This fixed the problem!

One part of the problem, as there's still something weird going on even when groupby is not used at all (see comment:54, point 4. and 5.2).

We're still looking into it.

Great!

comment:65 in reply to:  64 Changed 9 years ago by Alec Thomas

Replying to cboos:

Replicate it or not replicate it? The two sentences above seem contradictory.

The sentences were referring to different test scripts ;)

Anyway, for me that script doesn't leak once you add explicit gc (http://paste.pocoo.org/show/30691/).

Right we tried that too, but adding an explicit gc.collect() is not something we want to do and doesn't solve the problem.

One part of the problem, as there's still something weird going on even when groupby is not used at all (see comment:54, point 4. and 5.2).

Yeah, turns out that was a dead end. Latest test case is here.

Feel free to jump on #python-genshi if you're interested.

comment:66 Changed 9 years ago by Christian Boos

So there's indeed a Genshi-only issue explaining the problem we see with the exception raised from an expression in a template, see #G190.

comment:67 Changed 9 years ago by Christian Boos

As #G190 blamed itertools.groupby, a simple workaround could be made at the Trac level, committed in [6637].

That is not to say that this issue is fixed, though. Trac still seems to be leaking memory.

Given the somewhat large number of people on CC, I'd like to get some feedback to see if the situation has improved with [6637].

comment:68 in reply to:  67 Changed 9 years ago by pierreroth64@…

Replying to cboos:

As #G190 blamed itertools.groupby, a simple workaround could be made at the Trac level, committed in [6637].

That is not to say that this issue is fixed, though. Trac still seems to be leaking memory.

Given the somewhat large number of people on CC, I'd like to get some feedback to see if the situation has improved with [6637].

I did some tests with r6647. My use case is browsing a directory with about 200 small files and the improvement mentioned in comment:67 is obvious :

memory usageTrac 0.11b1Trac r6647
before browse clic 78 MB 46 MB
after browse clic 146 MB < 48 MB
delta ~ 68 MB delta < 2 MB

So this is a huge improvement since we can see a little memory leak (< 2MB) with r6647 compared to the 68 MB of 0.11b1 !

comment:69 Changed 9 years ago by Christian Boos

Well, you can only call it a leak if you find a way to make the memory steadily increase or reach some fixed high bounds it should not reach. In the before/after test case you describe above, the increase you're seeing is probably only the browser.html parsed template being cached, and that's normal. Try browsing some more, like recursively getting all the files out and see how it behaves facing the load.

comment:70 in reply to:  69 Changed 9 years ago by pierreroth64@…

Replying to cboos:

Well, you can only call it a leak if you find a way to make the memory steadily increase or reach some fixed high bounds it should not reach. In the before/after test case you describe above, the increase you're seeing is probably only the browser.html parsed template being cached, and that's normal. Try browsing some more, like recursively getting all the files out and see how it behaves facing the load.

It is what I did and continuing browsing my tree makes my httpd process reach more than 1 GByte of RAM with 0.11b1 while the r6647 version stays at about 250 MBytes…

comment:71 Changed 9 years ago by anonymous

Milestone: 0.110.10.5
Resolution: fixed
Status: assignedclosed
Version: 0.11b1devel

comment:72 Changed 9 years ago by Jeroen Ruigrok van der Werven

Milestone: 0.10.50.11
Resolution: fixed
Status: closedreopened

comment:73 Changed 9 years ago by Jeroen Ruigrok van der Werven

Version: devel0.11b1

Changed 9 years ago by Christian Boos

This patch helps to reduce the maximal memory usage.

comment:74 Changed 9 years ago by Christian Boos

Well, the attachment:render-using-StringIO-rr6677.patch doesn't help as much as I would have liked. It's seems to only marginally help.

Also, I'm currently looking at a non-Genshi issue: repeatedly querying a /report/1?format=csv on a test environment (with 600/700 tickets in the report) leaks about 50-60 kB each time, and there seem to be no upper bound.

comment:75 Changed 9 years ago by gt4329b@…

Cc: gt4329b@… added

comment:76 Changed 9 years ago by Ammon Lauritzen <ammon@…>

Cc: ammon@… added

comment:77 Changed 9 years ago by chris@…

Cc: chris@… added

comment:78 Changed 9 years ago by Christian Boos

Those who are still having a high memory usage are invited to check the latest changes in Trac (r6756) and Genshi ([G816]), as we think that very substantial progresses have been made.

Please report your findings!

comment:79 Changed 9 years ago by xarquonster@…

Cc: xarquonster@… removed

comment:80 Changed 9 years ago by kts_66@…

Cc: kts_66@… added

comment:81 Changed 9 years ago by bubbamagic87@…

Cc: bubbamagic87@… added

I would like to be added to the CC, but the spam filter keeps blocking my ticket modifications when no text is in the comment box. Below is what it keeps telling me:

Submission rejected as potential spam (SpamBayes determined spam probability of 90.82%)

comment:82 Changed 9 years ago by gustav.svensson@…

Cc: gustav.svensson@… added

comment:83 Changed 9 years ago by anonymous

Summary: Memory leak using apacheMemory leak using apace

comment:84 Changed 9 years ago by anonymous

Summary: Memory leak using apaceMemory leak using apache

comment:85 Changed 9 years ago by Geoff Reedy <gereedy@…>

I just found out about a python heap profiler called heapy (part of http://guppy-pe.sourceforge.net/). Has anyone tried using it to track down the cause of the reference leaks? I would give it a go myself, but I have not been able to produce any leaks on my configuration.

Also, there haven't been any responses since comment:78. It would be nice to have an update from people who have encountered the problem, maybe the issue is now resolved.

comment:86 Changed 9 years ago by gustav.svensson@…

I think I suffer from this, but I might not have enough memory to test it. I just know my VPS hangs after some clicking in Trac. I have 256 MB RAM. I am running 0.11b2.

comment:87 Changed 9 years ago by Christian Boos

Severity: blockercritical

While there remains a few memory issues, I think we can no longer consider it to be a blocker. Latest trunk of both Trac and Genshi should now behave sensibly, even more so when using Python 2.5 (or later).

There's still the issue with report .csv which remains unexplained at this point (comment:74).

comment:88 Changed 9 years ago by keith@…

I can confirm that this bug exists in 0.10, but only when a person attempts to view a specific svn revision number on a trac environment that has indexed an svn repository from a folder other than the root of the repository.

For example, we have a repository in /foo and we index beginning with a subfolder /foo/bar (done by setting repository_dir to /foo/bar in trac.ini — a step I've read is a suitable way of ensuring trac users have access to only one portion of the repository).

Once that is complete, accessing a specific revision of a file in the Revision Log will cause apache to consume nearly 200 megabytes of memory. Do this a couple of times, and your system will eventually run out of memory. In addition, it is painfully slow to view a revision of a file — the load time is several seconds slower than normal.

The solution I've found for this is to index the repository from the root folder /foo, and then write up a svn permissions file (called svn-perm) that looks like this:

foo

  • = r

foo/private

  • =

foo/bar

  • = r

Gives read access to /foo/bar and /foo, and restricts /foo/private for our trac environment. Then set authz_file = /path/to/svn-perm in trac.ini.

The memory leaks will stop, while still restricing access.

comment:89 Changed 9 years ago by frekko@…

Although the situation became much better with rc1 this bug is still present and brings my server to it's knees about once every 14 days (instead of once a day) with only very moderate traffic. I _really_ hope you fix this for the final release. I'll be glad to provide you with any information about my setup if required.

comment:90 Changed 9 years ago by trac@…

Cc: trac@… added

comment:91 in reply to:  87 Changed 9 years ago by anonymous

Resolution: fixed
Status: reopenedclosed

Replying to cboos:

There's still the issue with report .csv which remains unexplained at this point (comment:74).

That specific leak can be attributed to the usage of izip and [7133] fixes it.

I also committed two related changes, [7134] and [7135] which reduced the need for itertools functions (and make the code more readable IMO).

For people still experiencing increasing memory usage, please create new detailed issues, once you make sure that you're using the latest version of the software (both Trac 0.11-stable r7135 and upward and Genshi trunk [G851] or 0.5 and upward).

comment:92 Changed 9 years ago by anonymous

So anonymous was cboos :-)

comment:93 Changed 9 years ago by gustavo@…

Cc: gustavo@… removed

comment:94 Changed 9 years ago by etix@…

Cc: etix@… removed

comment:95 Changed 9 years ago by anonymous

Resolution: fixed
Severity: criticalblocker
Status: closedreopened

This bug is NOT fixed.

comment:96 Changed 9 years ago by Jeroen Ruigrok van der Werven

Just saying so without providing additional details does not help.

comment:97 Changed 9 years ago by Christian Boos

Resolution: fixed
Severity: blockercritical
Status: reopenedclosed

Once again:

For people still experiencing increasing memory usage, please create new detailed issues, once you make sure that you're using the latest version of the software (both Trac 0.11-stable r7135 and upward and Genshi trunk [G851] or 0.5 and upward).

And now, please let this ticket closed, it has reached a critical size beyond which it's not practical to work with anymore.

comment:98 Changed 9 years ago by ilias@…

As suggested in comment:36, you should place the relevant information within a separate document.

This is a highly critical issue and you should treat it with much more care and of course respect against your user-base.

I agree with "anonymous" (which btw. should not reopen tickets as 'anonymous'): this ticket is not fixed, the "severity" of this ticket was (is) of course "blocker".

comment:99 Changed 9 years ago by Justin

Resolution: fixed
Severity: criticalblocker
Status: closedreopened

Has there been any evidence to suggest this bug has been fixed?

I don't believe so and as such, will reopen the ticket (as opposed to what comment 97 has to say).

comment:100 in reply to:  99 Changed 9 years ago by Alec Thomas

Resolution: fixed
Status: reopenedclosed

Replying to Justin:

Has there been any evidence to suggest this bug has been fixed?

Yes, there's plenty of evidence, thoroughly documented in the history of this ticket.

We've found several bugs in both Python for which we've implemented workarounds in Genshi and Trac. We've also found some actual bugs in Genshi and Trac themselves. That's why the ticket was closed.

To have these fixes you must be running "both Trac 0.11-stable r7135 and upward and Genshi trunk [G851] or 0.5 and upward" as mentioned by cboos.

I don't believe so and as such, will reopen the ticket (as opposed to what comment 97 has to say).

What version of Trac and Genshi are you running? Specifically, we need:

Dependency What we need
Trac Subversion revision.
Genshi Subversion revision.
Python database adapter Name and version
Version control backend Type, version of Trac plugin and VC version.
RAM How much do you have?
Operating System Name and release.
Apache/Lightty/etc. Version of the respective web server you are running.

The last three people to reopen this bug have not provided any of this information whatsoever.

Without being sure that you're running the versions of Genshi and Trac with the fixes, we can't commit the large amount of time required to track down what may not even be an issue any longer.

If you do provide detailed version information and show that the bug is still present, we'll continue to look. Until then, I'm closing this ticket, again.

comment:101 Changed 9 years ago by Christian Boos

For svn users, see also ticket:6588#comment:3

comment:102 Changed 8 years ago by elephant.man@…

Resolution: fixed
Status: closedreopened
Version: 0.11b10.11.5

I'm using Trac 0.11.5 (r8362) with Genshi 0.5.1, and this issue still seems real to me.

comment:103 Changed 8 years ago by Christian Boos

Resolution: fixed
Status: reopenedclosed

Even if that's actually the case, please don't reopen this ticket again, it is really too long at this point. Rather create a new one. I'll take care of linking the other ticket(s) from here.

Be sure to provide some numbers in your new ticket, and also be sure you have all the plugins disabled when doing your measures.

Thanks.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Christian Boos.
The resolution will be deleted.
to The owner will be changed from Christian Boos to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .

 
Note: See TracTickets for help on using tickets.