| 1 | = Analysis of Performance of Trac 0.11.5 = |
| 2 | |
| 3 | ''initial content from a mail authored by Shane Caraveo's on trac-dev (googlegroups:trac-dev:a48473cd1dcfbd7c)'' |
| 4 | |
| 5 | I've recently reduced per request time for *simple* template generation |
| 6 | from 150ms to 45ms on my laptop (dual 1.6ghz core duo 2GB, tested with |
| 7 | ab against a simple template). I also reduced page load times (reported |
| 8 | via google page load/firebug) to around a second (from 2-8 seconds in |
| 9 | some cases). These timings are without any network overhead since I'm |
| 10 | testing on the local machine. Times are all taken on my laptop with my |
| 11 | typical environment/applications open (lots of crap). |
| 12 | |
| 13 | This is all done with 11.5. |
| 14 | |
| 15 | Even with the gains I still get 100% cpu spikes every request, but at |
| 16 | least it's a shorter duration. Still digging further on that. I also |
| 17 | have done a bit on overall browser page load. |
| 18 | |
| 19 | Here's my current brain dump. If some of my conclusions are silly, |
| 20 | please let me know why. |
| 21 | |
| 22 | |
| 23 | == General == |
| 24 | |
| 25 | In general there are only a couple big wins. For me it was removing |
| 26 | gc.collect (see trac.main) and the timing and estimation plugin. |
| 27 | Everything else was small potatoes in comparison (10ms here, 5ms there), |
| 28 | but together they added up to a good 40-50ms per request. Think of it |
| 29 | this way, using 100%cpu and 50ms/request limits you to a max of 20 |
| 30 | requests/second/cpu. Every ms counts if we want decent throughput. I'd |
| 31 | like to get under 30ms. |
| 32 | |
| 33 | === How I timed === |
| 34 | |
| 35 | My test setup is apache prefork with wsgi deamon mode, one process, |
| 36 | mod_deflate enabled. My laptop is pretty loaded, very little free |
| 37 | memory but generally enough that trac shouldn't be swapping. Postgres |
| 38 | is the database, running on the laptop as well. |
| 39 | |
| 40 | For timing template generation I used ApacheBench and tossed out |
| 41 | "warmup" requests, sometimes testing with keep alive (getting slightly |
| 42 | better req/s) |
| 43 | |
| 44 | {{{ |
| 45 | ab [-k] -c 1 -n 10 url |
| 46 | }}} |
| 47 | |
| 48 | With zero network latency, the timing is essentially how much time trac |
| 49 | is taking to generate the content. I tested against a simple template |
| 50 | generated from a small plugin that just renders a template file (this is |
| 51 | to involve as little other processing such as querying the ticket table |
| 52 | or permissions), and against WikiStart. |
| 53 | |
| 54 | For "page load" I used Google Page Load extension for firebug and |
| 55 | firefox 3.5. I also tested with Safari 4 which has similar |
| 56 | functionality now, it's much faster than firefox :( |
| 57 | |
| 58 | I tested with plugins enabled since I wanted to also discover whether |
| 59 | any plugins were causing me grief, some were. |
| 60 | |
| 61 | I tested against empty projects, so table indexing or table size are not |
| 62 | involved. I also tested against an small sized project in some |
| 63 | instances, which showed up issues in a couple plugins (eg. t&e). |
| 64 | |
| 65 | === Profiling issues === |
| 66 | |
| 67 | It seems to me that cProfile adds times from generators to the function |
| 68 | that uses them, masking where the real time is spent. That makes |
| 69 | profiling Genshi terribly difficult. I also tried dtrace out, it gives |
| 70 | the same rough data as cProfile. Knowing a tiny bit about the python |
| 71 | engine (having written a c level debugger extension), I am figuring that |
| 72 | it has something to do with how generators work in the python core. |
| 73 | |
| 74 | A couple areas that cProfile claimed a lot of exclusive time in Genshi I |
| 75 | rewrote in C. This didn't change my performance at all, but did shift |
| 76 | timing to other area's which helped me find a couple of the actual time |
| 77 | sinks in Genshi (see below). |
| 78 | |
| 79 | == For general performance == |
| 80 | |
| 81 | Use either a multithreaded apache if you use mod_python, or use wsgi in |
| 82 | multithreaded deamon mode. It wont remove high-cpu slow requests, but |
| 83 | it does lower memory use to some extent. While that didn't improve the |
| 84 | per-request performance, I got higher concurrent throughput with that |
| 85 | setup. Apache prefork with mod_python or wsgi in embeded mode will |
| 86 | chew up memory, and due to high cpu usage, processes will be fighting |
| 87 | each other more for resources. |
| 88 | |
| 89 | === customizations and plugins === |
| 90 | |
| 91 | The single largest gain I got with plugins was by removing timing and |
| 92 | estimation plugin (primarily seen in ticket reports page where it does |
| 93 | another sql query for each ticket, doubling the request time). It's |
| 94 | stream filters slow lots of stuff down. Removing other plugins with |
| 95 | stream filters or reworking them to not use stream filters when possible |
| 96 | is a win. In some instances, TracHours plugin is a better solution if |
| 97 | you don't need all the t&e features. |
| 98 | |
| 99 | Get rid of stuff like the google analytics plugin and just add the code |
| 100 | to a site or theme template. You'll get better performance. Its not |
| 101 | that it's a bad plugin, but it does something simple that adds *genshi* |
| 102 | overhead. |
| 103 | |
| 104 | Examine any custom templates for extraneous py:match use, get rid of it. |
| 105 | I had one match in the site template that was removing something from |
| 106 | an admin panel, which I had forgotten about, big mistake and it cost on |
| 107 | the performance of *every* page. I directly patched the template |
| 108 | instead and saved myself 10ms per request. This kind of thing doesn't |
| 109 | show up in the profiler, it was some debug dumps I inserted deep into |
| 110 | the Genshi bowels that showed me the light. |
| 111 | |
| 112 | Make sure all static resources for all plugins are exported. trac-admin |
| 113 | deploy will only deploy static resources for enabled plugins in that |
| 114 | project. |
| 115 | |
| 116 | === javascript === |
| 117 | |
| 118 | This is mostly about page load times. |
| 119 | |
| 120 | minify all the javascript output by trac-admin deploy, and make sure it |
| 121 | gets cache headers via apache conf. |
| 122 | |
| 123 | Upgrade to jquery 1.3, it's much faster. trac trunk has done this, and |
| 124 | there is a diff somewhere that shows what type of changes have to be |
| 125 | made. You'd have to examine any plugins for similar jquery updates that |
| 126 | need to be done. |
| 127 | |
| 128 | If you use firebug, be sure you have disabled it (or at least disable |
| 129 | the console tab) for trac pages, the console somehow interrupts and adds |
| 130 | a couple seconds delay in the browser. |
| 131 | |
| 132 | === css === |
| 133 | |
| 134 | I haven't got to this yet, but there is a lot of slow css there. |
| 135 | Removing all the css as a test, my page load in the browser was 200ms |
| 136 | faster. Google Page Load is good at reporting what css *might* be slow, |
| 137 | it reports lots of inefficient css in trac. |
| 138 | |
| 139 | == trac issues == |
| 140 | |
| 141 | === gc.collect === |
| 142 | |
| 143 | The next big item was removing gc.collect. If I recall correctly, |
| 144 | gc.collect took at least 50ms of the request time, sometimes upwards of |
| 145 | 100ms. It also hides other bugs (ticket:8443 for example, the postgres |
| 146 | idle bug came back without gc.collect, I've got a patch in that bug). |
| 147 | gc.collect could be done in a background thread from tracd, but I |
| 148 | wouldn't suggest background threads for mod_python prefork. With wsgi |
| 149 | daemon mode, just limit the number of requests before restart to |
| 150 | something you think is fine and test memory consumption. I think the |
| 151 | use of gc.collect should be seriously reconsidered. |
| 152 | |
| 153 | I have a sneaking suspicion (unproven) that people who use mod_python |
| 154 | and claim turning off keep alive and/or mod_deflate are having problems |
| 155 | due to gc.collect. As I understand apache filters (e.g. mod_deflate) |
| 156 | they wont finish up the request until the mod_python handler has |
| 157 | returned. So that extra time in gc.collect delays the request being |
| 158 | returned, which delays mod_deflate finishing. It also delays the start |
| 159 | of the next request over a persistent socket connection (ie. keep alive). |
| 160 | |
| 161 | === filters === |
| 162 | |
| 163 | Going through all the match_request implementations and removing |
| 164 | permission checking (put it in process_request), make sure regex matches |
| 165 | are precompiled, and generally simplifying things helps. There are a |
| 166 | number of those in trac core, but plugins are greater abusers in this |
| 167 | area. Also examine any IRequestFilter use and simplify. |
| 168 | |
| 169 | === trac.config === |
| 170 | |
| 171 | Other than Genshi, profiling showed trac.config to be the single largest |
| 172 | time on simple template generation. Profiling showed me that the get |
| 173 | method in the Configuation class (trac.config.Configuration) was slow. |
| 174 | I added caching there for a few extra milliseconds boost. I'm also |
| 175 | looking at removing the auto-reload if the ini file changes, maybe using |
| 176 | spread or memcached to create reload notifications, to get rid of file |
| 177 | stats, but timing doesn't show this to be a large issue on my laptop. |
| 178 | |
| 179 | === repository === |
| 180 | |
| 181 | While I still want to remove the sync on every request (get rid of |
| 182 | system calls/file stats), I have been unable to show that performance |
| 183 | changes much when I test with it removed. There are still bigger fish |
| 184 | to fry. |
| 185 | |
| 186 | === database pool === |
| 187 | |
| 188 | Watching the postgres log file, I can tell that a lot of cursors get |
| 189 | created without being actually used to do a query. This shows up |
| 190 | because psycopg2 executes a BEGIN when a cursor is created. I haven't |
| 191 | yet looked into where that is happening, but it's extra work the system |
| 192 | is doing for nothing. |
| 193 | |
| 194 | === wiki === |
| 195 | |
| 196 | The current wiki parser design is slow, doing the same large regex over |
| 197 | each line. I think a general redesign to use re.finditer, rather than |
| 198 | line split then re.match, would help improve performance here. However, |
| 199 | post-caching the content would be better. I'm experimenting with |
| 200 | partial caching of the wiki content and have reduced my request timing |
| 201 | on WikiStart from 120ms to 75ms while still getting correct content. |
| 202 | The patch I have doesn't cache macro's unless the macro arguments have |
| 203 | 'cache' in them (good for page outline macro, which btw seems to |
| 204 | re-parse the entire page, macros included). There may be other issues |
| 205 | with the approach I have taken, I haven't tested it much. Once I get |
| 206 | further, I might actually pre-process when the wiki page is saved and |
| 207 | stash the result into the database, avoiding the parse in most requests. |
| 208 | |
| 209 | == Genshi == |
| 210 | |
| 211 | Use Genshi trunk, some performance gain there. |
| 212 | |
| 213 | With Genshi, one big win is turning off the whitespace filter, I think I |
| 214 | got around 10ms on a small page, the gain would be larger on larger |
| 215 | pages. I did that by patching trac.web.chrome where there is a call to |
| 216 | stream.render, and adding strip_whitespace=False. While this increases |
| 217 | the file size delivered slightly (and adds a lot of whitespace to the |
| 218 | source view), that can be countered by mod_deflate. The namespace |
| 219 | flattener is another slow filter, but it's necessary. |
| 220 | |
| 221 | Another win with Genshi is patching genshi.output.encode and getting rid |
| 222 | of the iteration that calls _encode a bazillion times. Instead, just do: |
| 223 | |
| 224 | {{{ |
| 225 | out.write(_encode(u''.join(list(iterator)))) |
| 226 | }}} |
| 227 | |
| 228 | This is much faster. |
| 229 | |
| 230 | I should note, I also tried changing trac so that the output went |
| 231 | straight through to the wsgi layer (rather than buffering it in a |
| 232 | cStringIO object). While this improved the initial wait time |
| 233 | dramatically, it actually slowed down the overall request time. I |
| 234 | believe this slowdown was also due to the _encode calls inside the |
| 235 | iteration over the stream. I'm still going to experiment with this more |
| 236 | at some point. |
| 237 | |
| 238 | Overall, I like Genshi, now that I kind of get how it works it has a |
| 239 | certain elegance, but I am left unconvinced that it's design and |
| 240 | flexibility outweighs the performance issues, and the added difficulty |
| 241 | in figuring them out (refer to profiling at the top). One way I'm |
| 242 | thinking of working around this is to beef up the xmlrpc interfaces and |
| 243 | do everything with ajax. |
| 244 | |
| 245 | == Misc == |
| 246 | |
| 247 | === plugins === |
| 248 | |
| 249 | I don't think most plugin authors know where they will generate |
| 250 | performance penalties. It's hard to know all those details without |
| 251 | spending a lot of time on it. |
| 252 | |
| 253 | generally pay attention to any plugins that use stream filters or |
| 254 | request filters. These are called for every request and more often than |
| 255 | not do some thing that is time intensive. Additionally, IMO, a request |
| 256 | handlers match_request should never do anything beyond |
| 257 | req.path_info.startswith(/mypath). Many match_request implementations |
| 258 | do a lot more, which gets executed for every request. |
| 259 | |
| 260 | === wysiwyg plugin === |
| 261 | |
| 262 | this adds a large js file to *every* request whether useful or not. I |
| 263 | removed the large script from the request filter, then I changed the |
| 264 | load script to check and see if there is a textarea element in the dom, |
| 265 | then load the larger script if necessary. |