#230 closed enhancement (fixed)
'CamelCase' words don't work with 'umlauts'
Reported by: | Owned by: | Matthew Good | |
---|---|---|---|
Priority: | normal | Milestone: | 0.11 |
Component: | wiki system | Version: | 0.6 |
Severity: | normal | Keywords: | unicode |
Cc: | gernot@… | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
Try this and you know what I mean!
Attachments (2)
Change History (32)
comment:1 by , 21 years ago
Component: | general → wiki |
---|
comment:2 by , 21 years ago
Milestone: | → 0.6.1 |
---|
comment:3 by , 21 years ago
Milestone: | 0.6.1 → 0.7 |
---|
comment:4 by , 21 years ago
comment:5 by , 20 years ago
Milestone: | 0.7 → 0.8 |
---|
comment:7 by , 20 years ago
A discussion on Twiki:
http://twiki.org/cgi-bin/view/Codev/InternationalCharactersInWikiWords
comment:8 by , 20 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:9 by , 20 years ago
Milestone: | 0.8 → 0.9 |
---|
comment:10 by , 20 years ago
Just reporting that Cyrillic letters also do not work in camel case. They neither work when entered in a purely cyrillic camel-cased word, nor in mixed latiin-cyrillic camel-cased word. Best regards.
comment:11 by , 19 years ago
Milestone: | 0.9 |
---|
If someone has a patch for this, we'll gladly put it in. At the moment, I don't see a good way to solve this issue…
comment:12 by , 19 years ago
Keywords: | unicode added |
---|
Note that as a workaround, one could use the explicit wiki link notation:
[wiki:ÜberflüssigkeitsTheorie]
→ ÜberflüssigkeitsTheorie
comment:13 by , 18 years ago
by , 18 years ago
Attachment: | wiki-t230.diff added |
---|
Unicode aware regexp for the wiki engine and the WikiPageNames syntax
comment:15 by , 18 years ago
Milestone: | → 0.11 |
---|---|
Owner: | changed from | to
Status: | assigned → new |
With attachment:wiki-t230.diff (diff on top of r4680), the example WikiPageName given above is now recognized in both short wiki:
TracLinks form and CamelCase form:
All the unit tests still pass. It seems that the performance impact of the changes is quite negligible, as it seems to be only 3% slower than without the patch.
comment:16 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:17 by , 18 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
are links with "not uppercase chars except whitespace and separators" valid now, like TracRel20 (like in http://moinmoin.wikiwikiweb.de/WikiName)?
comment:18 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
TracRel20
would be discussed in #425.
Btw, there's no need to reopen the ticket for asking a question about a closed ticket, we see everything ;-)
comment:19 by , 18 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
r4693 produces unexpected behaviour on words with lower case international characters. The expected behavior would be that Småbokstaver not should produce a link, whereas SmÅogstore should. I will try to reopen this ticket, as I believe it is an insufficient fix (actually words with umlauts work, but so does they regardless of being CamelCase.)
comment:20 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Oops, thanks for the notice should be really fixed with r4709, now.
comment:21 by , 18 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Type: | defect → enhancement |
comment:22 by , 18 years ago
Owner: | changed from | to
---|---|
Status: | reopened → new |
follow-up: 24 comment:23 by , 18 years ago
Right, I stopped checking as soon as a wiki name started CamelCase… Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?
follow-up: 25 comment:24 by , 18 years ago
Replying to cboos:
Right, I stopped checking as soon as a wiki name started CamelCase… Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?
Yes, I have a fix that will handle this in the regular expression, and without using the lookbehinds. I have some unit tests for the WikiPageNames syntax, but I could use an example of the effect of the SHREF_
changes in r4693. Can you provide an example of a Wiki link that requires those changes?
follow-up: 26 comment:25 by , 18 years ago
Replying to mgood:
Replying to cboos:
Right, I stopped checking as soon as a wiki name started CamelCase… Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?
Yes, I have a fix that will handle this in the regular expression, and without using the lookbehinds.
Ok, I'm very interested to see that, as I failed to find a regexp-only solution. Nevertheless, I thought it would be good to "finish" the approach started in r4693 and r4709, with r4717, so that at least it was working the intended way (and thereby fixing the regression for #3240).
I have some unit tests for the WikiPageNames syntax, but I could use an example of the effect of the
SHREF_
changes in r4693. Can you provide an example of a Wiki link that requires those changes?
In addition, r4717 adds also a few examples of unicode WikiPageNames.
The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie (no unit test for that yet).
by , 18 years ago
Attachment: | unicode_wiki_links.diff added |
---|
alternate solution for unicode WikiPageNames
follow-up: 27 comment:26 by , 18 years ago
I've attached my patch which uses two strings defined in trac.util.text
for identifying all upper- and lower-case unicode characters. I was hoping a straight regex solution would be faster than mixing lookbehinds and an additional Python method to test whether it actually matched a WikiPageName, though on my laptop it's hard to tell on my laptop since the processor frequency scaling makes it difficult to accurately benchmark. I'll try on my other computer when I get home.
Replying to cboos:
The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie (no unit test for that yet).
Actually it's not. I reverted your changes before starting on my solution, and as you can see from the unit test I added for those types of links it works without that change. So, what does that change actually affect?
follow-up: 28 comment:27 by , 18 years ago
Replying to mgood:
… I was hoping a straight regex solution would be faster than mixing lookbehinds and an additional Python method to test whether it actually matched a WikiPageName
Actually it seems even significantly slower:
- Running trac/tests/allwiki.py with r4722 takes 1.13 seconds
- The same test with attachment:unicode_wiki_links.diff takes 1.76 seconds
(± 0.01 seconds in each case)
I think this is because testing the appartenance of a given character in a very large []
range must be slower than looking for \w
followed by a lookbehind check. The final check is using islower()
and isupper()
which are quite fast as well.
Replying to cboos:
The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie (no unit test for that yet).
Actually it's not. … So, what does that change actually affect?
… words like wiki:ÜberflüssigkeitsTheorie, i.e. when the first (or the last) character is Unicode. It's true that it's not that apparent for the above example, as "ÜberflüssigkeitsTheorie" itself will get recognized as a WikiPageNames, only leaving the unprocessed "wiki:" prefix before it.
But the limitation is more visible for things that are not wiki page names and other link resolvers. I've added a few examples in r4723.
comment:28 by , 18 years ago
Replying to cboos:
- Running trac/tests/allwiki.py with r4722 takes 1.13 seconds
- The same test with attachment:unicode_wiki_links.diff takes 1.76 seconds
(± 0.01 seconds in each case)
And I just checked the timings with r4692 (no unicode at all) on the same machine, I get 1.11 seconds, with the same variations. So as I said in the beginning, I think the performance impact of my solution is quite negligible.
comment:29 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I also tried using regex range syntax instead of a complete list of characters, but it does seem marginally slower. So, I guess the current implementation is sufficient.
The regexp module (re) doesn't know the difference between upper and lower case letters in unicode strings :(
Anyone got an idea how to do this?