Edgewall Software
Modify

Opened 10 years ago

Last modified 4 years ago

#8509 new defect

[PATCH] use precompiled regex for match_request

Reported by: mixedpuppy <shanec@…> Owned by:
Priority: high Milestone: next-major-releases
Component: web frontend Version: 0.11.5
Severity: normal Keywords: patch performance
Cc: osimons Branch:
Release Notes:
API Changes:

Description

This patch changes all the trac match_request calls to use precompiled regular expressions, avoiding multiple re compilations per request.

Attachments (3)

trac-perf-match-request.patch (8.7 KB ) - added by mixedpuppy <shanec@…> 10 years ago.
match_request_optimization.diff (16.2 KB ) - added by shookie@… 9 years ago.
new IRequestHandler Interface
match.diff (5.0 KB ) - added by shookie@… 9 years ago.
updated match_request

Download all attachments as: .zip

Change History (32)

by mixedpuppy <shanec@…>, 10 years ago

comment:1 by mixedpuppy <shanec@…>, 10 years ago

Version: none0.11.5

comment:2 by Remy Blank, 10 years ago

Did you actually measure a performance improvement with those changes? The re module in the standard lib caches up to 100 patterns (before emptying the whole cache), and I doubt that we have that many distinct patterns. Still, I agree that compiling the patterns separately is better style.

Another thing bothers me with the patch, though: I don't think the "Admin" navigation item should be restricted to TRAC_ADMIN. I do have access to a few admin panels here on t.e.o, and I am certainly not TRAC_ADMIN.

comment:3 by mixedpuppy <shanec@…>, 10 years ago

Precompiling regex is an old habit of mine from a time (perhaps even another language) that it did make a difference. I primarily did it when I was going through looking at match_request trying to minimize what was happening in those calls. Core trac doesn't have a lot of issues there, the bigger issues are generally in plugins.

The precompiles make relatively little difference if any, the larger issues in match_request have been doing other lookups such as permissions. However, a quick scan of genshi and trac show there is somewhere around 40-50 regular expressions (between the two) in the code base. Add whatever base python might do, and trac plugins, its realistic to think hitting the cache limit could happen.

For the admin tab, I was trying to avoid going through all the panels to make a decision on showing the tab. Didn't really think about the case of a non-admin having access to an admin panel.

comment:4 by Christian Boos, 10 years ago

Component: generalweb frontend
Keywords: needinfo added
Milestone: 2.0

Numbers, numbers…

Otherwise, might be a 2.0 thing if we one day work out a new way to globally dispatch requests.

comment:5 by Christian Boos, 9 years ago

Keywords: needinfo removed

comment:6 by shookie@…, 9 years ago

Implemented a new IRequestDispatcher mechanism with the follwing elements that should improve the overall performance:

  • Compiled re.
  • Only one primary global re, so not every of the dozens of handlers has to cycle through path_info on every request.
  • Also saves the dozens of match_request function calls overhead for every request.

I shortly thought of using a trie for that, but I got my doubts that one implemented in python would outperform the re and dict modules.

Other advantages:

  • Fully and transparently backwards compatible, so a smooth transition towards using the new mechanism is possible.
  • Also, should some plugin or module need the old more flexible match_request function, if it doesn't use a path prefix to match, it is still available.
  • The new mechanism can catch name conflicts in plugin IRequestHandlers, because it is aware of the names

The only downside is, that I didn't get to make proper benchmarks yet. But will do that on the weekend, when I'm home again.

On a side note, since #9224 was closed, I'm not sure you noticed the DictOption I implemented for that (http://trac.edgewall.org/attachment/ticket/9224/dict_option.diff). Really think that should be included, if only for plugin writers.

by shookie@…, 9 years ago

new IRequestHandler Interface

in reply to:  6 ; comment:7 by Christian Boos, 9 years ago

Milestone: 2.00.13

Replying to shookie@…:

Implemented a new IRequestDispatcher mechanism with the follwing elements that should improve the overall performance:

  • Compiled re.
  • Only one primary global re, so not every of the dozens of handlers has to cycle through path_info on every request.
  • Also saves the dozens of match_request function calls overhead for every request.

This is definitely the goal, yes. You're on the right track, but there's still some room for improvements, let's go to the specifics.

  • Fully and transparently backwards compatible, so a smooth transition towards using the new mechanism is possible.

Yes. This could also be achieved by using trac.util.arity on match_request, so we would just need to modify the signature of the existing IRequestDispatcher.match_request method.

  • Also, should some plugin or module need the old more flexible match_request function, if it doesn't use a path prefix to match, it is still available.

I think there's no need to split the regexp in two parts. If a partial regexp needs capture groups, it should then use named groups, (?P<unique_name> ...).

And, let's do it right from the start, we should use VERBOSE compatible regexps (i.e. use \s when we need a space).

  • The new mechanism can catch name conflicts in plugin IRequestHandlers, because it is aware of the names

I don't see that as conflicts, rather as a flexibility to override some other handlers. In addition to a regexp return value, the match_request could optionally return a pair (regexp, priority). If not, the priority is the default one, e.g. 5. Higher priority wins, as usual.

The only downside is, that I didn't get to make proper benchmarks yet. But will do that on the weekend, when I'm home again.

Don't worry about that, it will be faster, it doesn't really matter how much, any improvement is good to take ;-)

On a side note, since #9224 was closed, I'm not sure you noticed the DictOption I implemented for that

I noticed. See #7432, we'll use it when we'll need it.

in reply to:  7 ; comment:8 by Remy Blank, 9 years ago

Replying to cboos:

Don't worry about that, it will be faster, it doesn't really matter how much, any improvement is good to take ;-)

I disagree on that point. I'd really like to make sure that we don't have a performance regression. I'm happy as long as it actually is faster.

in reply to:  8 comment:9 by anonymous, 9 years ago

Ok, made a few runs with apache ab this morning. It's about 1-1.5% faster in the different runs. Only tested with tracd though, e.g. no keepalives etc.

But for now I guess I'll implement the improvements suggested.

I thought about making it one re, but using named options would just introduce new possible naming conflicts, so we better agree on a naming scheme like "module.groupname" or "class.groupname" We don't want 25 groups with the name "id".

comment:10 by shookie@…, 9 years ago

Next version.

  • No split re anymore.
  • Using named groups for paramteres (naming scheme is <_class_._name_>_name)
    • exception to this is the first group, which is the matched prefix
  • The re.X flag is added.
  • And using arity instead of an extended IRequestHandler interface to distinguish the two matching methods. Although I'm not sure this really is a good idea. Feels kinda hackish.

Disclaimers:

  • No priority yet implemented.
  • And although the dispatch re is one big re, its prefix and parameter parts are still defined separately. Else I would have to parse the re to get the keys for the handler lookup dict.
  • Also, the only handler I updated to use the new method is the WikiModule one. Makes the patch smaller for easier review, and spares me duplicated work, in the likely case this is not final.
  • benchmarks will be added later

by shookie@…, 9 years ago

Attachment: match.diff added

updated match_request

comment:11 by shookie@…, 9 years ago

Did some more benchmarks. The new updated version is about a the same 1-1.5% faster than trunk. Maybe leaning a little more towards the 1.5%

comment:12 by Carsten Klein <carsten.klein@…>, 9 years ago

I found the initial IRequestHandler interface proposal more straight forward to use. Now, when implementing the interface, plugin authors will have to always check for the extended contract on match_request(), e.g. if req is None: ….

Would it not be easier to just leave it the way you initially proposed?

Older IRequestHandlers then would not have the match() method, so hasattr() in the main dispatcher would then find out whether it is a new style or an old style request handler?

This, I believe is much more framework'ish than having to implement the extended contract of match_request() on every request handler.

comment:13 by Christian Boos, 9 years ago

Well, hold on, don't let the current patch fool you, plugins won't have to do anything.

Just wait for an updated patch.

comment:14 by Christian Boos, 9 years ago

Owner: set to Christian Boos
Priority: normalhigh

comment:15 by Christian Boos, 8 years ago

Milestone: 0.130.14

I often wanted to implement this one (and refactor the dispatching of web requests at this occasion), but there's a major issue:

  1. if we use one huge regexp, then how do we implement multiple "candidates" (i.e. simulte match_request returning False?);
  2. otherwise if we use multiple regexps, then there's probably no speed advantage to be expected.

I'm inclined to favor 1., and if the match is rejected (should be rare), then proceed on matching individual regexps

comment:16 by osimons, 8 years ago

Sorry, but I had overlooked this ticket earlier - or rather missed how it has morphed into more than the summary suggests. Pre-compiling regular expressions is a good idea, and having each component compile according to needs as in original patch looks good to me.

However, arriving at the idea that all dispatching must be regexp based seeing the main Trac handlers use it (or can use it) is not a good idea. Components should be free to match according to whatever criteria they want, and while path_info is very common it is certainly not the only dimension. For both Bitten and RPC plugin, Content-Type header may be involved in routing request to handlers. It should not take much imagination from developers to contemplate other such use-cases that aren't strictly path based.

Please drop the extended refactoring ideas…

comment:17 by osimons, 8 years ago

Cc: osimons added

Best to keep an eye on this one, I suppose ;-)

comment:18 by Christian Boos, 8 years ago

Right ;-)

What about having a small PathInfoDispatcher component which would implement IRequestHandler and will take care of the single regexp approach? Most request handlers do dispatching based on the path_info alone and could be converted to some IPathRequestHandler interface for which the PathInfoDispatcher will provide an extension point. If there's no match, then it's just an IRequestHandler among others which is failing and we proceed as before with the others. This means that plugins like the ones you mentioned don't need to be modified.

However, most of the others IRequestHandler could probably be converted and the increase in speed could be worth it. On the opposite, I'm not sure that compiling a bunch of different regexps and going through all of them in turn will have a significant impact.

comment:19 by osimons, 8 years ago

If pre-compiling expressions won't have a noticeable speed effect or make it easier to read or whatever, this whole ticket is approaching a wontfix in my opinion. Considering the quite extensive use of plugins at regular Trac installs, I just don't see the point of reworking the API or using energy and attention from the project to push plugin developers to fall in line with a new optional interface for little or no perceived gain.

It is just one of those things where it just makes more sense to live with the design choices made a long time ago. Just because we 'can' does not mean we always 'should', especially when it comes to an interface like IRequestHandler that is in such widespread use.

comment:20 by Christian Boos, 8 years ago

Ok, I suppose that what I wrote in comment:18 was not clear enough, so let me try again:

  • if we precompile each pattern separately and try them in turn like the initial patch does, then I'm not sure there will be a noticeable gain; this just adds complexity for little gain
  • if we precompile all patterns in one single regexp, I'm quite sure the speed improvement will be significative; in addition, the argument retrieval and dispatching could be made smarter

Of course, this is just my gut feeling, and I could be wrong. But at least this should explain the motivation for the change (which is not "because we can").

in reply to:  20 ; comment:21 by Remy Blank, 8 years ago

Replying to cboos:

  • if we precompile each pattern separately and try them in turn like the initial patch does, then I'm not sure there will be a noticeable gain; this just adds complexity for little gain

I'm not sure about that. The re module has a cache of 100 compiled regexps, and if that number is exceeded, the whole cache is cleared. A typical Trac installation uses more than 100 regexps (IIRC, I counted once), so we end up re-compiling regexps over and over again. So the question is really, how long does it take to compile a regexp?

Personally, I'm very much in favor of using explicit compilation, and not introducing any other complexity.

in reply to:  21 ; comment:22 by Christian Boos, 8 years ago

Replying to rblank:

Replying to cboos:

  • if we precompile each pattern separately and try them in turn like the initial patch does, then I'm not sure there will be a noticeable gain; this just adds complexity for little gain

I'm not sure about that. The re module has a cache of 100 compiled regexps, and if that number is exceeded, the whole cache is cleared. A typical Trac installation uses more than 100 regexps (IIRC, I counted once), so we end up re-compiling regexps over and over again. So the question is really, how long does it take to compile a regexp?

Ok, so because of that, the initial patch could already be an improvement. I notice we already do that in a couple of place, btw. But I'm sure, definitely not as much as having one single regexp.

Personally, I'm very much in favor of using explicit compilation, and not introducing any other complexity.

The advantages of a revised dispatching, as I see it:

  • speed
  • control the match order, if we introduce a priority together with the pattern
  • higher-level dispatching, like automatically extracting arguments from regexps and using them to call the dispatching method

comment:23 by alex@…, 8 years ago

Cc: alex@… added

in reply to:  22 ; comment:24 by Olemis Lang <olemis+trac@…>, 5 years ago

Replying to cboos:

Replying to rblank:

Replying to cboos:

  • if we precompile each pattern separately and try them in turn like the initial patch does, then I'm not sure there will be a noticeable gain; this just adds complexity for little gain

I'm not sure about that. The re module has a cache of 100 compiled regexps, and if that number is exceeded, the whole cache is cleared. A typical Trac installation uses more than 100 regexps (IIRC, I counted once), so we end up re-compiling regexps over and over again. So the question is really, how long does it take to compile a regexp?

Ok, so because of that, the initial patch could already be an improvement. I notice we already do that in a couple of place, btw. But I'm sure, definitely not as much as having one single regexp.

Personally, I'm very much in favor of using explicit compilation, and not introducing any other complexity.

The advantages of a revised dispatching, as I see it:

  • speed
  • control the match order, if we introduce a priority together with the pattern
  • higher-level dispatching, like automatically extracting arguments from regexps and using them to call the dispatching method

I think there's something still missing in this approach : URL matching should be started before the actual intra-environment request dispatching . In Bloodhound I added (once upon a time …) support for this kind of regexp-based web dispatching . In there we introduced the concept of web bootstrap handlers to deal with matching environment name and (in our case) product prefix to support custom URL namespaces for resources . There are a number of use cases listed in that document .

… to the point … In Bloodhound labs plugin (to be released soon ;) I have coded one such class that extracts such variables using Routes definitions. Nevertheless the underlying (not-so-efficient) request dispatching is still continued after that point, because it's been inherited from Trac . The aforementioned module offers a very solid and flexible approach to URL matching , and especially the possibility of nesting rules using sub-mappers , and more-than-regex matchers (see comment:16) . Indeed it would be very nice to connect both dispatching stages with the help of sub-mappers. Another use case would be to embed Trac into the dispatching machinery of other (compatible) frameworks .

Therefore … to the point once again … is there a chance to rearrange the new request dispatching solution to rely upon Routes framework ? … or at least make it compatible with Routes ?

in reply to:  24 comment:25 by Olemis Lang <olemis+trac@…>, 5 years ago

Replying to Olemis Lang <olemis+trac@…>: […]

The aforementioned module offers a very solid and flexible approach to URL matching , and especially the possibility of nesting rules using sub-mappers , and more-than-regex matchers (see comment:16) .

I mean , by passing a custom function in to Routes definitions it's actually possible to wrap (by the time) legacy IRequestHandler implementors in spite of achieving backwards compatibility . There will be some inertia in trac-hacks land to upgrade ad adopt a new request dispatching API.

[…]

comment:26 by Ryan J Ollos, 4 years ago

Milestone: next-dev-1.1.xnext-major-releases

Retargetting tickets to narrow focus for milestone:1.2. Please move the ticket back to milestone:next-dev-1.1.x if you intend to resolve it by milestone:1.2.

comment:27 by Alex Willmer <alex@…>, 4 years ago

Cc: alex@… removed

comment:28 by Ryan J Ollos, 4 years ago

Owner: Christian Boos removed

comment:29 by figaro, 4 years ago

Keywords: patch added

Modify Ticket

Change Properties
Set your email in Preferences
Action
as new The ticket will remain with no owner.
The ticket will be disowned.
as The resolution will be set.
The owner will be changed from (none) to anonymous.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.