Ticket #230 (closed enhancement: fixed)
Opened 8 years ago
Last modified 2 years ago
'CamelCase' words don't work with 'umlauts'
| Reported by: | gernot@… | Owned by: | mgood |
|---|---|---|---|
| Priority: | normal | Milestone: | 0.11 |
| Component: | wiki system | Version: | 0.6 |
| Severity: | normal | Keywords: | unicode |
| Cc: | gernot@… | ||
| Release Notes: | |||
| API Changes: | |||
Description
Try this and you know what I mean!
Attachments
Change History
comment:1 Changed 8 years ago by gernot@…
- Component changed from general to wiki
comment:2 Changed 8 years ago by daniel
- Milestone set to 0.6.1
comment:3 Changed 8 years ago by rocky
- Milestone changed from 0.6.1 to 0.7
comment:4 Changed 8 years ago by jonas
comment:5 Changed 8 years ago by daniel
- Milestone changed from 0.7 to 0.8
comment:6 Changed 8 years ago by daniel
Here's how ZWiki does it:
comment:7 Changed 8 years ago by daniel
A discussion on Twiki:
http://twiki.org/cgi-bin/view/Codev/InternationalCharactersInWikiWords
comment:8 Changed 8 years ago by daniel
- Owner changed from jonas to daniel
- Status changed from new to assigned
comment:9 Changed 8 years ago by daniel
- Milestone changed from 0.8 to 0.9
comment:10 Changed 7 years ago by Alexander Shopov <ash@…>
Just reporting that Cyrillic letters also do not work in camel case.
They neither work when entered in a purely cyrillic camel-cased word, nor in mixed latiin-cyrillic camel-cased word.
Best regards.
comment:11 Changed 7 years ago by cmlenz
- Milestone 0.9 deleted
If someone has a patch for this, we'll gladly put it in. At the moment, I don't see a good way to solve this issue…
comment:12 Changed 6 years ago by cboos
- Keywords unicode added
Note that as a workaround, one could use the explicit wiki link notation:
[wiki:ÜberflüssigkeitsTheorie] → ÜberflüssigkeitsTheorie?
comment:13 Changed 5 years ago by ThurnerRupert
comment:14 Changed 5 years ago by ThurnerRupert
Changed 5 years ago by cboos
- Attachment wiki-t230.diff added
Unicode aware regexp for the wiki engine and the WikiPageNames syntax
comment:15 Changed 5 years ago by cboos
- Milestone set to 0.11
- Owner changed from daniel to cboos
- Status changed from assigned to new
With attachment:wiki-t230.diff (diff on top of r4680), the example WikiPageName given above is now recognized in both short wiki: TracLinks form and CamelCase form:
All the unit tests still pass. It seems that the performance impact of the changes is quite negligible, as it seems to be only 3% slower than without the patch.
comment:16 Changed 5 years ago by cboos
- Resolution set to fixed
- Status changed from new to closed
comment:17 Changed 5 years ago by ThurnerRupert
- Resolution fixed deleted
- Status changed from closed to reopened
are links with "not uppercase chars except whitespace and separators" valid now, like TracRel20 (like in http://moinmoin.wikiwikiweb.de/WikiName)?
comment:18 Changed 5 years ago by cboos
- Resolution set to fixed
- Status changed from reopened to closed
TracRel20 would be discussed in #425.
Btw, there's no need to reopen the ticket for asking a question about a closed ticket, we see everything ;-)
comment:19 Changed 5 years ago by christian.skarby@…
- Resolution fixed deleted
- Status changed from closed to reopened
r4693 produces unexpected behaviour on words with lower case international characters. The expected behavior would be that Småbokstaver not should produce a link, whereas SmÅogstore? should. I will try to reopen this ticket, as I believe it is an insufficient fix (actually words with umlauts work, but so does they regardless of being CamelCase.)
comment:20 Changed 5 years ago by cboos
- Resolution set to fixed
- Status changed from reopened to closed
Oops, thanks for the notice should be really fixed with r4709, now.
comment:21 Changed 5 years ago by mgood
- Resolution fixed deleted
- Status changed from closed to reopened
- Type changed from defect to enhancement
comment:22 Changed 5 years ago by mgood
- Owner changed from cboos to mgood
- Status changed from reopened to new
comment:23 follow-up: ↓ 24 Changed 5 years ago by cboos
Right, I stopped checking as soon as a wiki name started CamelCase…
Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?
comment:24 in reply to: ↑ 23 ; follow-up: ↓ 25 Changed 5 years ago by mgood
Replying to cboos:
Right, I stopped checking as soon as a wiki name started CamelCase…
Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?
Yes, I have a fix that will handle this in the regular expression, and without using the lookbehinds. I have some unit tests for the WikiPageNames syntax, but I could use an example of the effect of the SHREF_ changes in r4693. Can you provide an example of a Wiki link that requires those changes?
comment:25 in reply to: ↑ 24 ; follow-up: ↓ 26 Changed 5 years ago by cboos
Replying to mgood:
Replying to cboos:
Right, I stopped checking as soon as a wiki name started CamelCase…
Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?
Yes, I have a fix that will handle this in the regular expression, and without using the lookbehinds.
Ok, I'm very interested to see that, as I failed to find a regexp-only solution.
Nevertheless, I thought it would be good to "finish" the approach started in r4693 and r4709, with r4717, so that at least it was working the intended way (and thereby fixing the regression for #3240).
I have some unit tests for the WikiPageNames syntax, but I could use an example of the effect of the SHREF_ changes in r4693. Can you provide an example of a Wiki link that requires those changes?
In addition, r4717 adds also a few examples of unicode WikiPageNames.
The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie? (no unit test for that yet).
Changed 5 years ago by mgood
- Attachment unicode_wiki_links.diff added
alternate solution for unicode WikiPageNames
comment:26 in reply to: ↑ 25 ; follow-up: ↓ 27 Changed 5 years ago by mgood
I've attached my patch which uses two strings defined in trac.util.text for identifying all upper- and lower-case unicode characters. I was hoping a straight regex solution would be faster than mixing lookbehinds and an additional Python method to test whether it actually matched a WikiPageName, though on my laptop it's hard to tell on my laptop since the processor frequency scaling makes it difficult to accurately benchmark. I'll try on my other computer when I get home.
Replying to cboos:
The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie? (no unit test for that yet).
Actually it's not. I reverted your changes before starting on my solution, and as you can see from the unit test I added for those types of links it works without that change. So, what does that change actually affect?
comment:27 in reply to: ↑ 26 ; follow-up: ↓ 28 Changed 5 years ago by cboos
Replying to mgood:
… I was hoping a straight regex solution would be faster than mixing lookbehinds and an additional Python method to test whether it actually matched a WikiPageName
Actually it seems even significantly slower:
- Running trac/tests/allwiki.py with r4722 takes 1.13 seconds
- The same test with attachment:unicode_wiki_links.diff takes 1.76 seconds
(± 0.01 seconds in each case)
I think this is because testing the appartenance of a given character in a very large [] range must be slower than looking for \w followed by a lookbehind check. The final check is using islower() and isupper() which are quite fast as well.
Replying to cboos:
The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie? (no unit test for that yet).
Actually it's not. … So, what does that change actually affect?
… words like wiki:ÜberflüssigkeitsTheorie?, i.e. when the first (or the last) character is Unicode. It's true that it's not that apparent for the above example, as "ÜberflüssigkeitsTheorie?" itself will get recognized as a WikiPageNames, only leaving the unprocessed "wiki:" prefix before it.
But the limitation is more visible for things that are not wiki page names and other link resolvers. I've added a few examples in r4723.
comment:28 in reply to: ↑ 27 Changed 5 years ago by cboos
Replying to cboos:
- Running trac/tests/allwiki.py with r4722 takes 1.13 seconds
- The same test with attachment:unicode_wiki_links.diff takes 1.76 seconds
(± 0.01 seconds in each case)
And I just checked the timings with r4692 (no unicode at all) on the same machine,
I get 1.11 seconds, with the same variations. So as I said in the beginning, I think the performance impact of my solution is quite negligible.
comment:29 Changed 5 years ago by mgood
- Resolution set to fixed
- Status changed from new to closed
I also tried using regex range syntax instead of a complete list of characters, but it does seem marginally slower. So, I guess the current implementation is sufficient.
comment:30 Changed 2 years ago by cboos
See #9025 for a follow-up.



The regexp module (re) doesn't know the difference between upper and lower case letters in
unicode strings :(
Anyone got an idea how to do this?