Analysis of Performance of Trac 0.11.5
initial content from a mail authored by Shane Caraveo's on trac-dev (googlegroups:trac-dev:a48473cd1dcfbd7c)
I've recently reduced per request time for *simple* template generation from 150ms to 45ms on my laptop (dual 1.6ghz core duo 2GB, tested with ab against a simple template). I also reduced page load times (reported via google page load/firebug) to around a second (from 2-8 seconds in some cases). These timings are without any network overhead since I'm testing on the local machine. Times are all taken on my laptop with my typical environment/applications open (lots of crap).
This is all done with 11.5.
Even with the gains I still get 100% cpu spikes every request, but at least it's a shorter duration. Still digging further on that. I also have done a bit on overall browser page load.
Here's my current brain dump. If some of my conclusions are silly, please let me know why.
In general there are only a couple big wins. For me it was removing gc.collect (see trac.main) and the timing and estimation plugin. Everything else was small potatoes in comparison (10ms here, 5ms there), but together they added up to a good 40-50ms per request. Think of it this way, using 100%cpu and 50ms/request limits you to a max of 20 requests/second/cpu. Every ms counts if we want decent throughput. I'd like to get under 30ms.
How I timed
My test setup is apache prefork with wsgi deamon mode, one process, mod_deflate enabled. My laptop is pretty loaded, very little free memory but generally enough that trac shouldn't be swapping. Postgres is the database, running on the laptop as well.
For timing template generation I used ApacheBench and tossed out "warmup" requests, sometimes testing with keep alive (getting slightly better req/s)
ab [-k] -c 1 -n 10 url
With zero network latency, the timing is essentially how much time trac is taking to generate the content. I tested against a simple template generated from a small plugin that just renders a template file (this is to involve as little other processing such as querying the ticket table or permissions), and against WikiStart.
For "page load" I used Google Page Load extension for firebug and firefox 3.5. I also tested with Safari 4 which has similar functionality now, it's much faster than firefox :(
I tested with plugins enabled since I wanted to also discover whether any plugins were causing me grief, some were.
I tested against empty projects, so table indexing or table size are not involved. I also tested against an small sized project in some instances, which showed up issues in a couple plugins (eg. t&e).
It seems to me that cProfile adds times from generators to the function that uses them, masking where the real time is spent. That makes profiling Genshi terribly difficult. I also tried dtrace out, it gives the same rough data as cProfile. Knowing a tiny bit about the python engine (having written a c level debugger extension), I am figuring that it has something to do with how generators work in the python core.
A couple areas that cProfile claimed a lot of exclusive time in Genshi I rewrote in C. This didn't change my performance at all, but did shift timing to other area's which helped me find a couple of the actual time sinks in Genshi (see below).
For general performance
Use either a multithreaded apache if you use mod_python, or use wsgi in multithreaded deamon mode. It wont remove high-cpu slow requests, but it does lower memory use to some extent. While that didn't improve the per-request performance, I got higher concurrent throughput with that setup. Apache prefork with mod_python or wsgi in embeded mode will chew up memory, and due to high cpu usage, processes will be fighting each other more for resources.
customizations and plugins
The single largest gain I got with plugins was by removing timing and estimation plugin (primarily seen in ticket reports page where it does another sql query for each ticket, doubling the request time). It's stream filters slow lots of stuff down. Removing other plugins with stream filters or reworking them to not use stream filters when possible is a win. In some instances, TracHours plugin is a better solution if you don't need all the t&e features.
Get rid of stuff like the google analytics plugin and just add the code to a site or theme template. You'll get better performance. Its not that it's a bad plugin, but it does something simple that adds *genshi* overhead.
Examine any custom templates for extraneous py:match use, get rid of it. I had one match in the site template that was removing something from an admin panel, which I had forgotten about, big mistake and it cost on the performance of *every* page. I directly patched the template instead and saved myself 10ms per request. This kind of thing doesn't show up in the profiler, it was some debug dumps I inserted deep into the Genshi bowels that showed me the light.
Make sure all static resources for all plugins are exported. trac-admin deploy will only deploy static resources for enabled plugins in that project.
This is mostly about page load times.
Upgrade to jquery 1.3, it's much faster. trac trunk has done this, and there is a diff somewhere that shows what type of changes have to be made. You'd have to examine any plugins for similar jquery updates that need to be done.
If you use firebug, be sure you have disabled it (or at least disable the console tab) for trac pages, the console somehow interrupts and adds a couple seconds delay in the browser.
I haven't got to this yet, but there is a lot of slow css there. Removing all the css as a test, my page load in the browser was 200ms faster. Google Page Load is good at reporting what css *might* be slow, it reports lots of inefficient css in trac.
The next big item was removing gc.collect. If I recall correctly, gc.collect took at least 50ms of the request time, sometimes upwards of 100ms. It also hides other bugs (ticket:8443 for example, the postgres idle bug came back without gc.collect, I've got a patch in that bug). gc.collect could be done in a background thread from tracd, but I wouldn't suggest background threads for mod_python prefork. With wsgi daemon mode, just limit the number of requests before restart to something you think is fine and test memory consumption. I think the use of gc.collect should be seriously reconsidered.
I have a sneaking suspicion (unproven) that people who use mod_python and claim turning off keep alive and/or mod_deflate are having problems due to gc.collect. As I understand apache filters (e.g. mod_deflate) they wont finish up the request until the mod_python handler has returned. So that extra time in gc.collect delays the request being returned, which delays mod_deflate finishing. It also delays the start of the next request over a persistent socket connection (ie. keep alive).
Going through all the match_request implementations and removing permission checking (put it in process_request), make sure regex matches are precompiled, and generally simplifying things helps. There are a number of those in trac core, but plugins are greater abusers in this area. Also examine any IRequestFilter use and simplify.
Other than Genshi, profiling showed trac.config to be the single largest time on simple template generation. Profiling showed me that the get method in the Configuation class (trac.config.Configuration) was slow. I added caching there for a few extra milliseconds boost. I'm also looking at removing the auto-reload if the ini file changes, maybe using spread or memcached to create reload notifications, to get rid of file stats, but timing doesn't show this to be a large issue on my laptop.
While I still want to remove the sync on every request (get rid of system calls/file stats), I have been unable to show that performance changes much when I test with it removed. There are still bigger fish to fry.
Watching the postgres log file, I can tell that a lot of cursors get created without being actually used to do a query. This shows up because psycopg2 executes a BEGIN when a cursor is created. I haven't yet looked into where that is happening, but it's extra work the system is doing for nothing.
The current wiki parser design is slow, doing the same large regex over each line. I think a general redesign to use re.finditer, rather than line split then re.match, would help improve performance here. However, post-caching the content would be better. I'm experimenting with partial caching of the wiki content and have reduced my request timing on WikiStart from 120ms to 75ms while still getting correct content. The patch I have doesn't cache macro's unless the macro arguments have 'cache' in them (good for page outline macro, which btw seems to re-parse the entire page, macros included). There may be other issues with the approach I have taken, I haven't tested it much. Once I get further, I might actually pre-process when the wiki page is saved and stash the result into the database, avoiding the parse in most requests.
Use Genshi trunk, some performance gain there.
With Genshi, one big win is turning off the whitespace filter, I think I got around 10ms on a small page, the gain would be larger on larger pages. I did that by patching trac.web.chrome where there is a call to stream.render, and adding strip_whitespace=False. While this increases the file size delivered slightly (and adds a lot of whitespace to the source view), that can be countered by mod_deflate. The namespace flattener is another slow filter, but it's necessary.
Another win with Genshi is patching genshi.output.encode and getting rid of the iteration that calls _encode a bazillion times. Instead, just do:
This is much faster.
I should note, I also tried changing trac so that the output went straight through to the wsgi layer (rather than buffering it in a cStringIO object). While this improved the initial wait time dramatically, it actually slowed down the overall request time. I believe this slowdown was also due to the _encode calls inside the iteration over the stream. I'm still going to experiment with this more at some point.
Overall, I like Genshi, now that I kind of get how it works it has a certain elegance, but I am left unconvinced that it's design and flexibility outweighs the performance issues, and the added difficulty in figuring them out (refer to profiling at the top). One way I'm thinking of working around this is to beef up the xmlrpc interfaces and do everything with ajax.
I don't think most plugin authors know where they will generate performance penalties. It's hard to know all those details without spending a lot of time on it.
generally pay attention to any plugins that use stream filters or request filters. These are called for every request and more often than not do some thing that is time intensive. Additionally, IMO, a request handlers match_request should never do anything beyond req.path_info.startswith(/mypath). Many match_request implementations do a lot more, which gets executed for every request.
this adds a large js file to *every* request whether useful or not. I removed the large script from the request filter, then I changed the load script to check and see if there is a textarea element in the dom, then load the larger script if necessary.
is this not cached by the browser?