Opened 17 years ago
Closed 15 years ago
#6614 closed defect (fixed)
Memory leak using apache
Reported by: | anonymous | Owned by: | Christian Boos |
---|---|---|---|
Priority: | highest | Milestone: | 0.11 |
Component: | version control | Version: | 0.11.5 |
Severity: | blocker | Keywords: | memory |
Cc: | ilias@…, pierreroth64@…, kirean@…, kamil@…, mbertheau@…, benjixx@…, manuzhai@…, bdelaage@…, jdthood@…, mailinglist@…, alicebot@…, tereutes@…, daniel@…, gt4329b@…, ammon@…, chris@…, kts_66@…, bubbamagic87@…, gustav.svensson@…, trac@… | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
When I browse my source, my apache process increases its memory usage dangerously (more than 1 GB !), never freeing it. It is particularly visible when trying to display a directory content with several entries in it.
I'm currently using :
- Apache 2.0.61
- Python 2.4
- Trac 0.11b1
Attachments (11)
Change History (114)
follow-up: 2 comment:1 by , 17 years ago
comment:2 by , 17 years ago
Replying to hyuga <hyugaricdeau@gmail.com>:
Hard to say without more detail, but could this be a duplicate of #6581? Try the patch suggested there and see if that helps.
Nevermind, apparently the patch won't help with the high memory usage issue.
comment:3 by , 17 years ago
Component: | general → version control |
---|---|
Keywords: | memory added |
Owner: | changed from | to
Priority: | normal → highest |
Replying to anonymous:
It is particularly visible when trying to display a directory content with several entries in it.
Thanks for the hint. For some reason, I was focusing on the rendering of file in the TracBrowser, but that seems to work fine whereas the rendering of a folder indeed appears to leak.
follow-up: 5 comment:4 by , 17 years ago
Looks like this is also present when using TracMercurial and none of the fancy new stuff from 0.11 (no quickjump, no color scale). Maybe a <py:include> issue?
comment:5 by , 17 years ago
Replying to cboos:
Looks like this is also present when using TracMercurial and none of the fancy new stuff from 0.11 (no quickjump, no color scale). Maybe a <py:include> issue?
Just out of curiosity, why do you think it's an issue with includes, or even anything having to do with Genshi?
comment:6 by , 17 years ago
Cc: | added |
---|
follow-up: 12 comment:7 by , 17 years ago
From my various tests, it seems that Trac memory is growing as long as new data is processed.
For example, when requesting 6000 different changesets on a t.e.o clone, the memory slowly grows from 100 to 200 Mbytes. But then, when requesting the same range of changesets again, there's no extra growth, i.e. memory usage is still 200 Mbytes. This is inconsistent with a "classical" memory leak.
So my current take on the issue is that perhaps a lot of strings get interned. From a superficial inspection, I couldn't find the origin of this, and I won't have time in the next days to go more in depth, so maybe someone else can explore that hint.
comment:9 by , 17 years ago
Replying to ThurnerRupert:
how trac and apache are connected? mod_python?
Yes, the problem occurs with mod_python.
comment:11 by , 17 years ago
Happens with tracd as well, so I think this has nothing to do with apache and mod_* …
follow-up: 13 comment:12 by , 17 years ago
Replying to cboos:
From my various tests, it seems that Trac memory is growing as long as new data is processed.
That is what I see too. As i browse new changesets it just keeps growing, but revisiting them makes little difference to memory - tried new user, and scoping the changeset display as well just to test if variations had any impact.
However, when browsing changesets, I see that a changeset with many diff's - like typically for a branch merge - can increase the memory with 25MB or more in one go. It does not get released again. Viewing a similar size changeset with only adds (initial import of a new project) leaves no noticeable impact on memory at all.
May the diff rendering code be a starting point for further research?
follow-up: 14 comment:13 by , 17 years ago
Sorry for the late reply - I thought I would be able to locate the precise problem, but as I'm not yet there, I'll post some intermediate findings.
Replying to osimons:
May the diff rendering code be a starting point for further research?
I don't think so: I forgot to mention that for the test mentioned in comment:7, I disabled the diff generation (same goes for the results below).
Here are some more results from my recent testings, using tracd on Linux x86_64, psycopg 2.0.6, latest Trac and Genshi.
It seems that there are actually two issues:
- memory always increases and is never returned to the system: Python 2.5 performs better here, as explained in http://evanjones.ca/memoryallocator
- Subversion bindings, or the way we're using them, cause a serious leak
In more details:
When doing a wget -r -np -l2 /
on a t.e.o clone (no Subversion bindings):
- Python 2.4.4 stabilizes at 303 Mbytes
- Python 2.5.1 stabilizes at 138 Mbytes, with a peak at about 310 Mbytes
stabilizes means here that we no longer see an increase in the memory usage when running the same wget
command again.
Here we see that 2.5.1 does a better job at memory management, and that the overall memory usage is relatively reasonable. Still a bit high, but I think we can't expect to have less than that.
Now I did the same wget test, this time with the 1.4.3 bindings enabled, using repository_type = direct-svnfs
to make things more apparent:
- Python 2.4.4 stabilizes at 3338 Mbytes (yes, 3.3 Gbytes!)
- Python 2.5.1 stabilizes at 2590 Mbytes
So while 2.5 does better, it's still pretty bad. The timeline queries seem to be the most expensive ones, sometimes adding 100 Mbytes each, when the changesets are accessed for the first time.
When using repository_type = svn
those timeline queries don't involve the svn bindings and are fast and lean.
follow-up: 16 comment:14 by , 17 years ago
When using
repository_type = svn
those timeline queries don't involve the svn bindings and are fast and lean.
So a workaround is to use repository_type = svn
for now, for those of us using 0.11b1?
comment:15 by , 17 years ago
Cc: | added |
---|
comment:16 by , 17 years ago
Replying to anonymous:
When using
repository_type = svn
those timeline queries don't involve the svn bindings and are fast and lean.So a workaround is to use
repository_type = svn
for now, for those of us using 0.11b1?
Well, not really, svn
is anyway the default, and I doubt the people having the issue reported here had used direct-svnfs
. The latter only makes the problem more apparent, but the bindings are nevertheless still used for the former, e.g. for the browser queries.
We currently can't use the cache for everything, but there are still opportunities to use it more (like in #6654). Future work (0.12, 0.13?) should aim at a complete by-pass of the svn bindings, or alternatively, should use the newer ctypes bindings.
follow-up: 18 comment:17 by , 17 years ago
Back to the point: I think I've isolated one of the issues this week-end, related to Repository._history()
and the weird behavior of fs.node_history(path)
, fs.history_prev
and fs.history_location
, most notably when path
is not the root. I wrote an alternative implementation of _history()
, using temporary memory pools. At first, it seemed to help a lot, but actually it was the explicit call to gc.collect()
that I added at the end of request processing, in order to get more reproducible results, which was the main reason for the memory savings!
In more details:
With the following change only:
-
trac/web/main.py
diff --git a/trac/web/main.py b/trac/web/main.py
a b def dispatch_request(environ, start_resp 389 389 return _dispatch_request(req, env, env_error) 390 390 finally: 391 391 if env and not run_once: 392 import gc 393 gc.collect() 394 print "Garbage collected..." 392 395 env.shutdown(threading._get_ident()) 393 396 394 397 def _dispatch_request(req, env, env_error):
on the test case from comment:13 above, the memory usage drops to 504M (!). More specifically:
wget -r ... | tracd 0.11 | tracd 0.11, with explicit gc.collect()
|
1st run | 2040M | 476M |
2nd run | 3121M | 504M |
3rd run | 3123M | 504M |
Also, in all my tests, the ThreadingMixIn
was removed from the TracHTTPServer
, in order to eliminate possible multi-threading issues.
Now, with the alternative implementation for _history()
, using temporary pools:
def _history(self, svn_path, start, end, pool): """`svn_path` must be a full scope path, UTF-8 encoded string. Generator yielding `(path, rev)` pairs, where `path` is an `unicode` object. Must start with `(path, created rev)`. """ if start < end: start, end = end, start root = fs.revision_root(self.fs_ptr, start, pool()) tmp1 = Pool(pool) tmp2 = Pool(pool) history_ptr = fs.node_history(root, svn_path, tmp1()) cross_copies = 1 while history_ptr: history_ptr = fs.history_prev(history_ptr, cross_copies, tmp2()) tmp1.clear() tmp1, tmp2 = tmp2, tmp1 if history_ptr: path, rev = fs.history_location(history_ptr, tmp2()) tmp2.clear() if rev < end: break path = _from_svn(path) if not self.authz.has_permission(path): break yield path, rev del tmp1 del tmp2
we can get even further savings, i.e.
wget -r ... | tracd 0.11 | tracd 0.11, with explicit gc.collect() and temp pools for _history()
|
1st run | 2040M | 425M |
2nd run | 3121M | 453M |
3rd run | 3123M | 453M |
The additional savings (roughly 50M) are a hint that we can probably get even more savings by using temporary pools in other places. Or even that we should rewrite the whole Pool stuff, see #1401…
Of course, those findings are quite counter-intuitive. If everything would work "as expected", the above changes in _history()
would actually make no difference on the long run, and the same could be said for the gc.collect()
.
I think that this shows that there's definitely something wrong going on with the management of the pools, inside the bindings.
This still doesn't explain why it seems this is a 0.11 issue, as the svn_fs code hasn't changed that much since 0.10.5dev. I should probably make the same test on a 0.10.5dev install as well.
to be continued…
comment:18 by , 17 years ago
To complement comment:17 with testing results using python 2.5:
wget -r ... | tracd 0.11 | tracd' 0.11 - 2.4 | tracd' 0.11 - 2.5 |
1st run | 2040M | 425M | 277M |
2nd run | 3121M | 453M | 294M |
3rd run | 3123M | 453M | 288M |
where tracd'
means "with explicit gc.collect()
and temp pools for _history()
".
Also, in case that wasn't clear, the wget -r -np -l2 /
command is run 3 times in succession, without restarting tracd.
comment:19 by , 17 years ago
Severity: | normal → blocker |
---|
comment:20 by , 17 years ago
I repeated the same test using several variants of 0.10.5dev tracd, using Python 2.4:
wget -r ... | tracd (1) | tracd +gc (2) | tracd +0.11 (3) | tracd +0.11 +gc (4) | tracd +0.11 +tmp (5) | tracd +0.11 +tmp +gc (6) |
1st run | 124M | 120M | 117.8M | 113M | 118M | 115M |
2nd run | 125.4M | 122.4M | 119M | 117.4M | 119M | 118.6M |
3rd run | 125.4M | 122.4M | 120.8M | 117.4M | 119M | 118.6M |
Notes:
- straight from source:/branches/0.10-stable@6401
- same as 1. + the
gc.collect()
after every request - same as 1. + svn_fs.py was patched with the 0.11 changes.
I verified that (3) doesn't leak, as it stabilizes at 120.8M for a 4th run. - same as 3. + the
gc.collect()
after every request - same as 3. + the temp pools for
_history()
. - same as 5. + the
gc.collect()
after every request
From those numbers, it looks like the 0.11 changes are not harmful, they even seem to reduce the memory usage a bit. It's interesting to see that the temp pools optimizations are not worth doing (3 better than 5), unless the explicit call to gc.collect()
is also done (6 better than 3). However, the explicit gc.collect()
call alone is better than combined with the tmp pool optimization! (4 better than 6). Doing only the gc.collect()
call on stock 0.10.5dev doesn't buy much (2 vs. 1).
Also, besides the memory issue, 0.10.5dev is a lot faster than 0.11: one wget run takes about 4 minutes for 0.10.5dev (even 3m 20s when not doing the systematic gc.collect()
call), versus 33 minutes for 0.11. But that's old news.
Summary of the findings so far:
- From the present results, it seems that the svn changes alone have no real impact on the memory.
- From comment:13, we can see the big difference it makes for 0.11 when running with or without the bindings
- From comment:18, we can see that thanks to some specific optimizations, the huge memory requirements added by the bindings can be lowered a lot.
Surprisingly enough, the main optimization is simply to call gc.collect()
at regular intervals. Also interesting is the fact that the temp pool optimization (the changes to SvnRepository._history()
described in comment:17) seem to give an additional 50M savings in 0.11, whereas we've seen here that those savings should really be marginals, if any (4 vs 6).
follow-up: 26 comment:21 by , 17 years ago
But what about the TracMercurial plugin? In comment:3 you said that this problem seems to happen with it too, in which case it would have nothing to do with Subversion. I haven't tried it myself, so I don't know. Do you still think the issue arises with TracMercurial?
comment:22 by , 17 years ago
Cc: | added |
---|
comment:23 by , 17 years ago
Cc: | added |
---|
comment:24 by , 17 years ago
Cc: | added |
---|
comment:26 by , 17 years ago
Replying to hyuga <hyugaricdeau@gmail.com>:
But what about the TracMercurial plugin? In comment:3 you said that this problem seems to happen with it too, in which case it would have nothing to do with Subversion.
It happens as well with Mercurial, but to a lesser extent than with Subversion. Subversion makes the effects of the underlying problem very apparent, because each request will consume a lot of memory.
I still don't have a solution, but I made some progress: when there's an exception, it looks like the frame information is being referenced from somewhere. It's not due to cycles or uncollectable objects (gc.garbage remains empty, also with gc.DEBUG_UNCOLLECTABLE set). I'm not even sure that's the only reason, but that's definitely one reason, as I've got a constant +10M increase when repeating a given failing request, and by looking at the objects that are accumulated, they really correspond to what's reachable from the stack (and this is using Mercurial as a backend, even).
/me still deeply puzzled and looking…
comment:27 by , 17 years ago
It's not the logger: modifed trac/log.py with
class FakeLogger(object): def info(self, *args, **kwargs): pass def debug(self, *args, **kwargs): pass def warn(self, *args, **kwargs): pass def warning(self, *args, **kwargs): pass def error(self, *args, **kwargs): pass def exception(self, *args, **kwargs): pass def logger_factory(logtype='syslog', logfile=None, level='WARNING', logid='Trac', format=None): return FakeLogger()
Same results.
comment:28 by , 17 years ago
It's not the somewhat complex error handling in trac/web either. If I silence the error that I get during template generation like that:
-
trac/web/main.py
diff -r 419b87217da6 trac/web/main.py
a b class RequestDispatcher(Component): 206 206 template, data, content_type = \ 207 207 self._post_process_request(req, *resp) 208 208 if 'hdfdump' in req.args: 209 req.perm.require('TRAC_ADMIN')209 #req.perm.require('TRAC_ADMIN') 210 210 # debugging helper - no need to render first 211 211 from pprint import pprint 212 212 out = StringIO() 213 213 pprint(data, out) 214 214 req.send(out.getvalue(), 'text/plain') 215 215 else: 216 output = chrome.render_template(req, template, 216 try: 217 output = chrome.render_template(req, template, 217 218 data, content_type) 219 except: 220 pass 221 output = '****' 218 222 # Give the session a chance to persist changes 219 223 if req.session: 220 224 req.session.save()
Same results.
So probably back to genshi?
comment:29 by , 17 years ago
Cc: | added |
---|
comment:30 by , 17 years ago
The error I'm referring to in comment:26 is an exception raised in a WikiSyntaxProvider, inside a wiki formatting done while rendering the timeline.
The corresponding backtrace for that error can be seen in attachment:timeline_rendering_backtrace.log
Now, if I catch the exception before it goes through the template rendering, the memory usage is OK again. The live object count, as measured by doing a len(gc.get_objects())
after a gc.collect()
, remains constant at 62642, after successive requests.
If I raise an exception on purpose after the template rendering, there's also no increase in memory usage.
But if the original exception is not caught, the live object count increase by ~50000 after each call.
I'll try to locate the exact point in the backtrace which cause the memory retention.
by , 17 years ago
Attachment: | timeline_rendering_backtrace.log added |
---|
The backtrace for the exception discussed in comment:23
comment:31 by , 17 years ago
Cc: | added |
---|
comment:32 by , 17 years ago
Cc: | added |
---|
comment:33 by , 17 years ago
Cc: | added |
---|
comment:34 by , 17 years ago
Cc: | added |
---|
comment:35 by , 17 years ago
Cc: | added |
---|
comment:36 by , 17 years ago
May I suggest to the team the following:
- provide a wiki page which summarizes the findings
- provide some debug-code which prints mem-usage data into the log file, thus test give specific results.
- users can publish the test results (from different systems) on the wiki page
I can run the tests on 2 servers (one has a mem problem, the other one seems(!) not to have it). Possibly the configuration of the servers is the answer.
follow-up: 41 comment:37 by , 17 years ago
FC5, r6432, tracd, svn, running for 3 days now: VSZ 123M, RSS 85M.
comment:38 by , 17 years ago
To summarize the issues at this point:
- prefer Python 2.5 over Python 2.4, the former has much improved over the latter at giving back memory to the system
- the alternative
_history
implementation helps a bit, when Subversion is used (fix-_history.diff), but not that much - doing frequent full garbage collections helps a lot (explicit_gc.diff)
- there's a major issue going on as soon as an exception is raised when evaluating a Genshi Expression or Suite.
For testing 4. one can use the following set of patches (against latest trunk - r6493):
- Prepare a clean trunk for testing
- patch -p0 < no-permission-store-cache.diff
- patch -p0 < single-threaded-tracd.diff
- patch -p0 < fix-_history.diff
- Now, use either one of the following patches (second is more verbose):
- patch -p0 < explicit_gc.diff
- patch -p0 < debug-mem-growth.diff
- Then test a timeline query with either one of the following patch applied:
- patch -p0 < timeline-expression-bug.diff
- patch -p0 < timeline-suite-bug.diff
Also attached is a log file showing the accumulation of objects when testing the timeline expression bug. The VM memory usage grows approximately by 125Mb for each request (/timeline, daysback=10, date=2005-02-08, repository_dir=trac repository, repository_type=direct-svnfs).
That huge amount of memory can be attributed to an excessive reuse of the main Pool by Repository.get_changesets()
when direct-svnfs
is used. The real problem however is that this pool is not collected because it's kept alive. It's kept alive because there are 9 sub-pools kept alive, each corresponding to the 9 changesets stored in the 'data' object… which is itself kept alive for some reason.
I still haven't figured out why this happens.
One possible approach to fix the problem would be to catch exceptions at the Base._eval
level, and propagating that up as an EXCEPTION event. At an upper level, probably just before serialization, we can try to raise the exception again without running into the memory issue (it seems that exceptions raised at upper levels don't trigger the issue). More work is needed here.
by , 17 years ago
Attachment: | no-permission-store-cache.diff added |
---|
gives a cleaner picture of real memory usage
by , 17 years ago
Attachment: | fix-_history.diff added |
---|
better Pool usage in SubversionRepository._history()
by , 17 years ago
Attachment: | debug-mem-growth.diff added |
---|
same as explicit_gc.diff + some stats about accumulated objects
by , 17 years ago
Attachment: | timeline-expression-bug.log added |
---|
log a session with timeline-expression-bug.diff active
by , 17 years ago
Attachment: | timeline-expression-bug.diff added |
---|
raise a ValueError from an Expression evaluation immediately during timeline rendering
by , 17 years ago
Attachment: | timeline-suite-bug.diff added |
---|
raise a ValueError from a Suite execution immediately during timeline rendering
follow-up: 40 comment:39 by , 17 years ago
More findings:
- among the frames in the alive objects, it seems that none refers to the data dictionary
- when using Genshi:source:tags/0.4.4, the bug is still present but behaves differently: memory goes from 150Mb to 275Mb exactly like before, then stays at 275Mb for subsequent repeated requests (even slightly different, like when changing the data), i.e. it stops growing after the second failure.
See timeline-expression-bug-with-genshi-0.4.4.log.
Was anonymous also using Genshi 0.4.4?
by , 17 years ago
Attachment: | timeline-expression-bug-genshi-0.4.4.log added |
---|
log a session with timeline-expression-bug.diff active - this time using Genshi 0.4.4 instead of [G789]
comment:40 by , 17 years ago
Correcting the above, I meant:
even slightly different, like when changing the date
I also checked that [G799] makes no difference with [G789].
The difference of behavior between 0.4 and 0.5 can actually be explained by [G770]… Reverting that changeset on trunk makes it behave like 0.4. Of course, we known that we can't just revert that change (as this would re-introduce nasty multi-threading issues - #G158) and anyway this wouldn't remove the issue completely as the memory increment at the second error is still there, but I think this narrows the issue even more.
follow-up: 42 comment:41 by , 17 years ago
comment:42 by , 17 years ago
Replying to mbertheau@gmail.com:
Great, so you just probably never had any exception raised within an Expression in a template :-)
comment:43 by , 17 years ago
Cc: | added |
---|
(I try again to add my e-mail address to the cc field. If I just add the address then the submission gets rejected as spam. If you are reading this then it means that it helped to add a comment.)
comment:45 by , 17 years ago
Cc: | added |
---|
comment:46 by , 17 years ago
Cc: | added |
---|
follow-up: 48 comment:47 by , 17 years ago
Status: | new → assigned |
---|
In r6593, I added explicit garbage collection after each request. This should already help a lot (see comment:17).
In r6594, I added the SubversionRepository._history
optimization, which should also help a bit (comment:17 again).
Finally, using Python 2.5 in preference to 2.4 will also help greatly (see comment:18).
Some feedback would be useful, in order to see how well this performs in practice.
What's left to be done:
- workaround the real memory leak that happens when an exception is raised from inlined expressions in Genshi templates (comment:38, comment:39 and comment:40).
- further improve the memory usage in the Subversion backend (#5782)
follow-up: 49 comment:48 by , 17 years ago
Replying to cboos:
In r6593, I added explicit garbage collection after each request. This should already help a lot (see comment:17).
What effects has this on performance?
(I like to repeat my plea in comment:36, as the information gets lost within the ticket comments)
comment:49 by , 17 years ago
Replying to ilias@lazaridis.com:
Replying to cboos:
In r6593, I added explicit garbage collection after each request. This should already help a lot (see comment:17).
What effects has this on performance?
Approximately a 10% slowdown, IIRC.
comment:50 by , 17 years ago
Cc: | added |
---|
follow-up: 53 comment:51 by , 17 years ago
Concerning the current trunk - [6603], I'm still interested in getting feedback in term of memory usage.
Some news concerning the Genshi issue: I've not yet been able to trigger the problem purely at the Genshi level, so I'll go from the Trac side, stripping things away.
So far, I've found that the problem is not exactly tied to an exception being raised from an expression.
Having even an UndefinedError
triggered from inside the py:for
block in the timeline template triggers the same problem, with the same memory usage as in attachment:timeline-expression-bug.log (first request goes up to 146Mb, then 271Mb, then 396Mb, etc.)
-
trac/timeline/templates/timeline.html
32 32 </form> 33 33 34 34 <py:for each="day, events in groupby(events, key=lambda e: format_date(e.date))"> 35 ${say.hello} 35 36 <h2>${day}: ${day == today and 'Today' or day == yesterday and 'Yesterday' or None}</h2> 36 37 <dl> 37 38 <py:for each="event in events"
Now, changing that groupby
to a simple day in event
loop, things are even more interesting:
-
trac/timeline/templates/timeline.html
31 31 </div> 32 32 </form> 33 33 34 <py:for each="day, events in groupby(events, key=lambda e: format_date(e.date))"> 34 <py:for each="day in events"> 35 ${say.hello} 35 36 <h2>${day}: ${day == today and 'Today' or day == yesterday and 'Yesterday' or None}</h2> 36 37 <dl> 37 38 <py:for each="event in events"
Here, the memory usage starts with 146Mb, then 271Mb, then stays at 271Mb for every other repeated requests.
Seeing that, I've modified timeline.html even more drastically:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:py="http://genshi.edgewall.org/" xmlns:xi="http://www.w3.org/2001/XInclude"> ${say.hello} </html>
The memory usage starts with 145Mb, then 270Mb, then stays at 270Mb.
Now, thinking again even the following results look suspicious:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:py="http://genshi.edgewall.org/" xmlns:xi="http://www.w3.org/2001/XInclude"> </html>
i.e. dummy template, no error generated: memory starts and stays at 145Mb. Why? This is with Python 2.5, the memory allocated during this request should be returned to the system.
Note that with an empty template, I get a TemplateSyntaxError: no element found
error and the memory usage is the same as above, starts and stays at 145Mb.
… to be continued
comment:52 by , 17 years ago
Unless the is a __del__
or something else involved that keeps a reference and thus causing the garbage collector to not be able to free it. Might such be the cause here?
follow-up: 54 comment:53 by , 17 years ago
Replying to cboos:
Seeing that, I've modified timeline.html even more drastically: … The memory usage starts with 145Mb, then 270Mb, then stays at 270Mb.
This doesn't seem very surprising to me. TemplateLoader
caches the template stream/tree on load, so an initial jump is to be expected. There are also the various caches internal to Trac that get populated on first-use. None of these allocations would be returned to the system.
Unless I'm misunderstanding your point?
Now, thinking again even the following results look suspicious: … i.e. dummy template, no error generated: memory starts and stays at 145Mb. Why? This is with Python 2.5, the memory allocated during this request should be returned to the system.
This kind of confirms my previous point: the template fails to load and is thus not cached and no extra memory consumed.
comment:54 by , 17 years ago
Replying to athomas:
Replying to cboos:
Seeing that, I've modified timeline.html even more drastically: … The memory usage starts with 145Mb, then 270Mb, then stays at 270Mb.
This doesn't seem very surprising to me.
TemplateLoader
caches the template stream/tree on load, so an initial jump is to be expected. There are also the various caches internal to Trac that get populated on first-use. None of these allocations would be returned to the system.Unless I'm misunderstanding your point?
You misunderstood. The template loader is supposed to cache the template, not the data
, AFAICT. The 145Mb is a very high number which corresponds to the specific test case (3 years of changeset timeline data, using direct-svnfs on a t.e.o clone, see comment:38). The cost of the initial load you described is an order of magnitude smaller than what is observed here.
So again:
- starting Trac → 10Mb
- make an initial "cheap" timeline query → 22Mb
that's what you were talking about, and yes, we're probably not able to go below that ever after timeline?from=2%2F25%2F2005&daysback=3&changeset=on
query → 146Mb
I think it's already wrong that we stay that high, most of the memory associated to thedata
should have been given back to the system- repeat request 3. → 271Mb
now in all cases, this is totally wrong - repeat request 3., what we'll see now depends on the actual content of the timeline template:
<py:for ... groupby()>
:
each time, the memory usage will increase by another 125Mb. Again, this is totally wrong<py:for each="day in events">
or<html>${say.hello}</html>
-only template:
no matter how many times we repeat the request, the memory stays at 271Mb (but why eliminatinggroupby
makes such a big difference in behavior?)
Even if there was a bug in the template loader, where somehow the data
would sneak in, I've eliminated that possibility by deleting the template loader after each use.
where that reference is kept is still unknown at this point (and no, it's not uncollectable, see comment:26).
follow-up: 56 comment:55 by , 17 years ago
Thanks for the clarification. The piece of information I was missing was the increasing daysback period.
Does the memory accumulation occur for other timeline providers, or only versioncontrol?
comment:56 by , 17 years ago
Replying to athomas:
Does the memory accumulation occur for other timeline providers, or only versioncontrol?
Yes, same thing without the changesets, though it's then less visible, of course. Doing a timeline query of the whole history of my test Trac without the changesets consistently increases the memory usage of 2.7Mb after each repeated request.
I've focused on the timeline so far, but I think the issue is the same with the other modules. It's quite likely that given the absence of groupby
or similar in the other templates, we'll rather see the 5.2. behavior there (in line with early observations 7 and 12).
follow-up: 58 comment:57 by , 17 years ago
I've done some testing at home and I see the same behaviour as you, where the first query consumes N MB of memory, but subsequent identical queries do not increase the memory usage. I also don't see any uncollectable objects at all. I was slightly confused by the terminology used in the GC module documentation, but some searching revealed the difference between unreachable and uncollectible.
The combination of these factors indicates to me that the memory is being released by the GC but continues to be held by Python's pool allocator or the system's libc (that is, free() is not returning the memory to the O/S). This is apparently fairly "normal" for Python and there's some discussion of it here.
I've not been able to replicate the behaviour you've seen with groupby()
where the memory usage continues to climb. I'm running Python 2.5 and Genshi trunk.
comment:58 by , 17 years ago
Replying to athomas:
I've not been able to replicate the behaviour you've seen with
groupby()
where the memory usage continues to climb. I'm running Python 2.5 and Genshi trunk.
Just to be sure: are you triggering an exception from within the <py:for> element, like shown in the diffs shown in comment:51?
comment:59 by , 17 years ago
Christian,
thanks to Alec I received a trac-hacks trac.db and I did some timeline testing with trunk (r6612) and Genshi trunk @ rev 800. When I request 300 days I get (SIZE, RES as top describes): 38684K 33980K and it is stable after (forced) reloads. Using 400 I get: 38684K 33988K and stable, 500: 46108K 41536K and 600 days I get: 56088K 51524K. In all these cases I do not have a repository active that accompanies this trac.db. Alec said that his testing, which is with the repository for all I understood, caused him to rise to 300 MB for the timeline for 600 days.
My machine is a FreeBSD 6.3-STABLE one with Python 2.5.1.
I hope this data helps you a bit.
comment:61 by , 17 years ago
With the addition of say.hello into the timeline.html I manage to leak about 5 MB per hit on the timeline for 600 days back using a dump as provided by Alec.
80756K 75352K 85752K 80200K 90476K 85032K
as reported by top.
comment:62 by , 17 years ago
When I get rid of the groupby() my top reports this:
36468K 31068K 36628K 31228K 36468K 31068K 36628K 31228K 36468K 31068K
So something gets lost due to the groupby() when an error occurs.
follow-up: 64 comment:63 by , 17 years ago
I've just confirmed the same leak as comment:51 by adding a reference to ${say.hello}
.. The GC is not reporting any weirdness either. I also wrote a small script with the same general logic, but was unable to make Genshi by itself leak.
I also tried uncommenting the entirety of Trac's error handling except block and the leak persisted.
Chris also suggested commenting out the stream filters and using list(stream)
instead of stream.render()
. We tried a few other things, such as disabling genshi.speedups
.
The last thing we tried was using the Python groupby()
from trac.util
. This fixed the problem! Test case here seems to replicate it.
We're still looking into it.
follow-up: 65 comment:64 by , 17 years ago
Replying to athomas:
I also wrote a small script with the same general logic, but was unable to make Genshi by itself leak. … Test case here seems to replicate it.
Replicate it or not replicate it? The two sentences above seem contradictory. Anyway, for me that script doesn't leak once you add explicit gc (http://paste.pocoo.org/show/30691/).
The last thing we tried was using the Python
groupby()
fromtrac.util
. This fixed the problem!
One part of the problem, as there's still something weird going on even when groupby
is not used at all (see comment:54, point 4. and 5.2).
We're still looking into it.
Great!
comment:65 by , 17 years ago
Replying to cboos:
Replicate it or not replicate it? The two sentences above seem contradictory.
The sentences were referring to different test scripts ;)
Anyway, for me that script doesn't leak once you add explicit gc (http://paste.pocoo.org/show/30691/).
Right we tried that too, but adding an explicit gc.collect()
is not something we want to do and doesn't solve the problem.
One part of the problem, as there's still something weird going on even when
groupby
is not used at all (see comment:54, point 4. and 5.2).
Yeah, turns out that was a dead end. Latest test case is here.
Feel free to jump on #python-genshi
if you're interested.
comment:66 by , 17 years ago
So there's indeed a Genshi-only issue explaining the problem we see with the exception raised from an expression in a template, see #G190.
follow-up: 68 comment:67 by , 17 years ago
As #G190 blamed itertools.groupby
, a simple workaround could be made at the Trac level, committed in [6637].
That is not to say that this issue is fixed, though. Trac still seems to be leaking memory.
Given the somewhat large number of people on CC, I'd like to get some feedback to see if the situation has improved with [6637].
comment:68 by , 17 years ago
Replying to cboos:
As #G190 blamed
itertools.groupby
, a simple workaround could be made at the Trac level, committed in [6637].That is not to say that this issue is fixed, though. Trac still seems to be leaking memory.
Given the somewhat large number of people on CC, I'd like to get some feedback to see if the situation has improved with [6637].
I did some tests with r6647. My use case is browsing a directory with about 200 small files and the improvement mentioned in comment:67 is obvious :
memory usage | Trac 0.11b1 | Trac r6647 |
before browse clic | 78 MB | 46 MB |
after browse clic | 146 MB | < 48 MB |
delta ~ 68 MB | delta < 2 MB |
So this is a huge improvement since we can see a little memory leak (< 2MB) with r6647 compared to the 68 MB of 0.11b1 !
follow-up: 70 comment:69 by , 17 years ago
Well, you can only call it a leak if you find a way to make the memory steadily increase or reach some fixed high bounds it should not reach. In the before/after test case you describe above, the increase you're seeing is probably only the browser.html parsed template being cached, and that's normal. Try browsing some more, like recursively getting all the files out and see how it behaves facing the load.
comment:70 by , 17 years ago
Replying to cboos:
Well, you can only call it a leak if you find a way to make the memory steadily increase or reach some fixed high bounds it should not reach. In the before/after test case you describe above, the increase you're seeing is probably only the browser.html parsed template being cached, and that's normal. Try browsing some more, like recursively getting all the files out and see how it behaves facing the load.
It is what I did and continuing browsing my tree makes my httpd process reach more than 1 GByte of RAM with 0.11b1 while the r6647 version stays at about 250 MBytes…
comment:71 by , 17 years ago
Milestone: | 0.11 → 0.10.5 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
Version: | 0.11b1 → devel |
comment:72 by , 17 years ago
Milestone: | 0.10.5 → 0.11 |
---|---|
Resolution: | fixed |
Status: | closed → reopened |
comment:73 by , 17 years ago
Version: | devel → 0.11b1 |
---|
by , 17 years ago
Attachment: | render-using-StringIO-rr6677.patch added |
---|
This patch helps to reduce the maximal memory usage.
comment:74 by , 17 years ago
Well, the attachment:render-using-StringIO-rr6677.patch doesn't help as much as I would have liked. It's seems to only marginally help.
Also, I'm currently looking at a non-Genshi issue: repeatedly querying a /report/1?format=csv
on a test environment (with 600/700 tickets in the report) leaks about 50-60 kB each time, and there seem to be no upper bound.
comment:75 by , 17 years ago
Cc: | added |
---|
comment:76 by , 17 years ago
Cc: | added |
---|
comment:77 by , 17 years ago
Cc: | added |
---|
comment:78 by , 17 years ago
comment:79 by , 17 years ago
Cc: | removed |
---|
comment:80 by , 17 years ago
Cc: | added |
---|
comment:81 by , 17 years ago
Cc: | added |
---|
I would like to be added to the CC, but the spam filter keeps blocking my ticket modifications when no text is in the comment box. Below is what it keeps telling me:
Submission rejected as potential spam (SpamBayes determined spam probability of 90.82%)
comment:82 by , 17 years ago
Cc: | added |
---|
comment:83 by , 17 years ago
Summary: | Memory leak using apache → Memory leak using apace |
---|
comment:84 by , 17 years ago
Summary: | Memory leak using apace → Memory leak using apache |
---|
comment:85 by , 17 years ago
I just found out about a python heap profiler called heapy (part of http://guppy-pe.sourceforge.net/). Has anyone tried using it to track down the cause of the reference leaks? I would give it a go myself, but I have not been able to produce any leaks on my configuration.
Also, there haven't been any responses since comment:78. It would be nice to have an update from people who have encountered the problem, maybe the issue is now resolved.
comment:86 by , 17 years ago
I think I suffer from this, but I might not have enough memory to test it. I just know my VPS hangs after some clicking in Trac. I have 256 MB RAM. I am running 0.11b2.
follow-up: 91 comment:87 by , 17 years ago
Severity: | blocker → critical |
---|
While there remains a few memory issues, I think we can no longer consider it to be a blocker. Latest trunk of both Trac and Genshi should now behave sensibly, even more so when using Python 2.5 (or later).
There's still the issue with report .csv which remains unexplained at this point (comment:74).
comment:88 by , 17 years ago
I can confirm that this bug exists in 0.10, but only when a person attempts to view a specific svn revision number on a trac environment that has indexed an svn repository from a folder other than the root of the repository.
For example, we have a repository in /foo and we index beginning with a subfolder /foo/bar (done by setting repository_dir to /foo/bar in trac.ini — a step I've read is a suitable way of ensuring trac users have access to only one portion of the repository).
Once that is complete, accessing a specific revision of a file in the Revision Log will cause apache to consume nearly 200 megabytes of memory. Do this a couple of times, and your system will eventually run out of memory. In addition, it is painfully slow to view a revision of a file — the load time is several seconds slower than normal.
The solution I've found for this is to index the repository from the root folder /foo, and then write up a svn permissions file (called svn-perm) that looks like this:
- = r
- =
- = r
Gives read access to /foo/bar and /foo, and restricts /foo/private for our trac environment. Then set authz_file = /path/to/svn-perm in trac.ini.
The memory leaks will stop, while still restricing access.
comment:89 by , 17 years ago
Although the situation became much better with rc1 this bug is still present and brings my server to it's knees about once every 14 days (instead of once a day) with only very moderate traffic. I _really_ hope you fix this for the final release. I'll be glad to provide you with any information about my setup if required.
comment:90 by , 17 years ago
Cc: | added |
---|
comment:91 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Replying to cboos:
There's still the issue with report .csv which remains unexplained at this point (comment:74).
That specific leak can be attributed to the usage of izip
and [7133] fixes it.
I also committed two related changes, [7134] and [7135] which reduced the need for itertools
functions (and make the code more readable IMO).
For people still experiencing increasing memory usage, please create new detailed issues, once you make sure that you're using the latest version of the software (both Trac 0.11-stable r7135 and upward and Genshi trunk [G851] or 0.5 and upward).
comment:93 by , 17 years ago
Cc: | removed |
---|
comment:94 by , 17 years ago
Cc: | removed |
---|
comment:95 by , 17 years ago
Resolution: | fixed |
---|---|
Severity: | critical → blocker |
Status: | closed → reopened |
This bug is NOT fixed.
comment:97 by , 17 years ago
Resolution: | → fixed |
---|---|
Severity: | blocker → critical |
Status: | reopened → closed |
Once again:
For people still experiencing increasing memory usage, please create new detailed issues, once you make sure that you're using the latest version of the software (both Trac 0.11-stable r7135 and upward and Genshi trunk [G851] or 0.5 and upward).
And now, please let this ticket closed, it has reached a critical size beyond which it's not practical to work with anymore.
comment:98 by , 17 years ago
As suggested in comment:36, you should place the relevant information within a separate document.
This is a highly critical issue and you should treat it with much more care and of course respect against your user-base.
I agree with "anonymous" (which btw. should not reopen tickets as 'anonymous'): this ticket is not fixed, the "severity" of this ticket was (is) of course "blocker".
follow-up: 100 comment:99 by , 17 years ago
Resolution: | fixed |
---|---|
Severity: | critical → blocker |
Status: | closed → reopened |
Has there been any evidence to suggest this bug has been fixed?
I don't believe so and as such, will reopen the ticket (as opposed to what comment 97 has to say).
comment:100 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Replying to Justin:
Has there been any evidence to suggest this bug has been fixed?
Yes, there's plenty of evidence, thoroughly documented in the history of this ticket.
We've found several bugs in both Python for which we've implemented workarounds in Genshi and Trac. We've also found some actual bugs in Genshi and Trac themselves. That's why the ticket was closed.
To have these fixes you must be running "both Trac 0.11-stable r7135 and upward and Genshi trunk [G851] or 0.5 and upward" as mentioned by cboos.
I don't believe so and as such, will reopen the ticket (as opposed to what comment 97 has to say).
What version of Trac and Genshi are you running? Specifically, we need:
Dependency | What we need |
Trac | Subversion revision. |
Genshi | Subversion revision. |
Python database adapter | Name and version |
Version control backend | Type, version of Trac plugin and VC version. |
RAM | How much do you have? |
Operating System | Name and release. |
Apache/Lightty/etc. | Version of the respective web server you are running. |
The last three people to reopen this bug have not provided any of this information whatsoever.
Without being sure that you're running the versions of Genshi and Trac with the fixes, we can't commit the large amount of time required to track down what may not even be an issue any longer.
If you do provide detailed version information and show that the bug is still present, we'll continue to look. Until then, I'm closing this ticket, again.
comment:102 by , 15 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Version: | 0.11b1 → 0.11.5 |
I'm using Trac 0.11.5 (r8362) with Genshi 0.5.1, and this issue still seems real to me.
comment:103 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Even if that's actually the case, please don't reopen this ticket again, it is really too long at this point. Rather create a new one. I'll take care of linking the other ticket(s) from here.
Be sure to provide some numbers in your new ticket, and also be sure you have all the plugins disabled when doing your measures.
Thanks.
Hard to say without more detail, but could this be a duplicate of #6581? Try the patch suggested there and see if that helps.