Ticket #9025 (closed defect: fixed)
Opened 2 years ago
Last modified 2 years ago
Trac hangs with certain text
| Reported by: | anonymous | Owned by: | cboos |
|---|---|---|---|
| Priority: | highest | Milestone: | 0.11.7 |
| Component: | wiki system | Version: | 0.11-stable |
| Severity: | critical | Keywords: | performance |
| Cc: | srl@… | ||
| Release Notes: | |||
| API Changes: | |||
Description
Our server hung with the attached text as the body of a ticket. (actually it is from a 'tab' export of the ticket, which obviously worked OK).
I pasted the same text into a comment and it also caused a hang.
this is r9125 of 0.11.7-stabler0
sqlite3 database 3.6.22, red hat 7, python 2.6.2
Attachments
Change History
Changed 2 years ago by srl@…
comment:1 Changed 2 years ago by Steven R. Loomis <srl@…>
- Cc srl@… added
I am the author of this ticket.
comment:2 Changed 2 years ago by Steven R. Loomis <srl@…>
- Milestone 0.11.7 deleted
comment:3 Changed 2 years ago by cboos
- Component changed from general to wiki system
- Keywords performance added
- Milestone set to 0.11.7
- Priority changed from normal to high
Reduced test cases and the approximate time they take:
[ሀሁሂሃሄህሆለሉሊላሌልሎሏሐሑሒሓሔሕሖመሙሚማሜምሞ] > 1s
[ሀሁሂሃሄህሆለሉሊላሌልሎሏሐሑሒሓሔሕሖመሙሚማሜምሞሟሠ] > 3s
[ሀሁሂሃሄህሆለሉሊላሌልሎሏሐሑሒሓሔሕሖመሙሚማሜምሞሟሠሡሢ] > 6s
[ሀሁሂሃሄህሆለሉሊላሌልሎሏሐሑሒሓሔሕሖመሙሚማሜምሞሟሠሡሢሣሤ] > 17s
[ሀሁሂሃሄህሆለሉሊላሌልሎሏሐሑሒሓሔሕሖመሙሚማሜምሞሟሠሡሢሣሤሥሦ] > 35s
...
Note that every other form seem to be fine:
ሀሁሂሃሄህሆለሉሊላሌልሎሏሐሑሒሓሔሕሖመሙሚማሜምሞሟሠሡሢሣሤ wiki:ሀሁሂሃሄህሆለሉሊላሌልሎሏሐሑሒሓሔሕሖመሙሚማሜምሞሟሠሡሢሣሤ [wiki:ሀሁሂሃሄህሆለሉሊላሌልሎሏሐሑሒሓሔሕሖመሙሚማሜምሞሟሠሡሢሣሤ] ["ሀሁሂሃሄህሆለሉሊላሌልሎሏሐሑሒሓሔሕሖመሙሚማሜምሞሟሠሡሢሣሤ"] [ሀሁሂሃሄህሆለሉሊላሌልሎሏሐሑሒሓሔሕሖመሙሚማሜምሞሟሠሡሢሣሤ with label]
I've identified the guilty regexp, which is the one for wikipagename_with_label_link.
When a label is actually given, the regexp "works" as expected.
comment:4 Changed 2 years ago by cboos
- Priority changed from high to highest
comment:5 Changed 2 years ago by cboos
The regexp for CamelCase page names tries to deal with unicode strings (#230) and looks like this:
...(?:\w(?<![a-z0-9_])(?:\w(?<![A-Z0-9_]))*[\w/](?<![a-z0-9_]\))...
---------------- ================
a word character a word character
but no lower case but no upper case
digit or _ digit or _
In presence of a string like ሀሁሂሃሄህሆለሉሊ... which contains neither a-z nor A-Z characters but otherwise only \w characters, the above regexp becomes equivalent to:
(\w(\w)*)+
When it has to backtrack (failed match), a regexp like that is not performing very well...
>>> r = re.compile(r'(\w(\w)*)+ ')
>>> t = time.time(); r.search('[abcdefghijklmnopqrstu]'); time.time() - t
1.2868900299072266
>>> t = time.time(); r.search('[abcdefghijklmnopqrstuv]'); time.time() - t
2.3317339420318604
>>> t = time.time(); r.search('[abcdefghijklmnopqrstuvw]'); time.time() - t
4.4360249042510986
comment:6 Changed 2 years ago by cboos
- Owner set to cboos
- Status changed from new to assigned
I reworked the CamelCase regexp to come to a patch which is quite close to mgood's fix for #230 (attachment:unicode_wiki_links.diff:ticket:230).
In this t9025-unicode-CamelCase-r9250.patch I have two variants, one with the full list of characters, the other with regexp ranges, so it's easy to comment out one and test the other.
For me using the full list variant was faster (this is the one enabled in the patch).
This is also a tad bit faster than mgood's patch (refreshed version in t9025-mgood-t230-r9250.patch).
This is still unfortunately slower than my original solution using \w and look-behinds, but at least there's no pathological case with this one.
Even better solutions welcomed ;-)
Changed 2 years ago by cboos
- Attachment t9025-unicode-CamelCase-r9250.patch added
use explicit list of lower/upper unicode characters for identifying CamelCase words
Changed 2 years ago by cboos
- Attachment t9025-mgood-t230-r9250.patch added
refreshed patch from mgood on #230, for comparison
comment:7 Changed 2 years ago by rblank
Here are the results on my machine, for running:
PYTHONPATH=. python trac/tests/allwiki.py
- Current trunk: 1.18 s
- t9025-unicode-CamelCase-r9250.patch, full list: 1.27 s
- t9025-unicode-CamelCase-r9250.patch, ranges: 1.72 s
- t9025-mgood-t230-r9250.patch: 1.30 s
So we're not that much slower with the first variant. I have tried to find a better solution, but I have to admit that my regexp-fu is probably deficient here. Anyway, a performance loss of about 7% sounds ok to me (compared to the alternative, i.e. a DOS), so I would say apply and close.
comment:8 Changed 2 years ago by cboos
Thanks for testing. Would you mind timing that last one: t9025-unicode-CamelCase-r9250.2.patch?
I think the slight twist to the WikiPageNames it introduces is OK.
Changed 2 years ago by cboos
- Attachment t9025-unicode-CamelCase-r9250.2.patch added
A slightly faster version, albeit with a twist to the WikiPageNames rules (Page/Sub? is now a valid wiki name)
comment:9 Changed 2 years ago by cboos
- Resolution set to fixed
- Status changed from assigned to closed
Last iteration of the patch + some tests committed as r9252.



2593.txt