Edgewall Software
Modify

Opened 15 years ago

Closed 13 years ago

Last modified 10 years ago

#230 closed enhancement (fixed)

'CamelCase' words don't work with 'umlauts'

Reported by: gernot@… Owned by: Matthew Good
Priority: normal Milestone: 0.11
Component: wiki system Version: 0.6
Severity: normal Keywords: unicode
Cc: gernot@… Branch:
Release Notes:
API Changes:

Description

Try this and you know what I mean!

ÜberflüssigkeitsTheorie

Attachments (2)

wiki-t230.diff (2.0 KB ) - added by Christian Boos 13 years ago.
Unicode aware regexp for the wiki engine and the WikiPageNames syntax
unicode_wiki_links.diff (29.6 KB ) - added by Matthew Good 13 years ago.
alternate solution for unicode WikiPageNames

Download all attachments as: .zip

Change History (32)

comment:1 by gernot@…, 15 years ago

Component: generalwiki

comment:2 by daniel, 15 years ago

Milestone: 0.6.1

comment:3 by rocky, 15 years ago

Milestone: 0.6.10.7

comment:4 by Jonas Borgström, 15 years ago

The regexp module (re) doesn't know the difference between upper and lower case letters in unicode strings :(

Anyone got an idea how to do this?

comment:5 by daniel, 15 years ago

Milestone: 0.70.8

comment:6 by daniel, 15 years ago

Here's how ZWiki does it:

http://zwiki.org/zwikidir/Regexps.py

comment:8 by daniel, 15 years ago

Owner: changed from Jonas Borgström to daniel
Status: newassigned

comment:9 by daniel, 15 years ago

Milestone: 0.80.9

comment:10 by Alexander Shopov <ash@…>, 14 years ago

Just reporting that Cyrillic letters also do not work in camel case. They neither work when entered in a purely cyrillic camel-cased word, nor in mixed latiin-cyrillic camel-cased word. Best regards.

comment:11 by Christopher Lenz, 14 years ago

Milestone: 0.9

If someone has a patch for this, we'll gladly put it in. At the moment, I don't see a good way to solve this issue…

comment:12 by Christian Boos, 13 years ago

Keywords: unicode added

Note that as a workaround, one could use the explicit wiki link notation: [wiki:ÜberflüssigkeitsTheorie]ÜberflüssigkeitsTheorie

comment:13 by ThurnerRupert, 13 years ago

see related #4104, #4663. shall i close #4104 as duplicate, even if it deals with whitespace in wiki names?

comment:14 by ThurnerRupert, 13 years ago

#4633 of course, not #4663.

by Christian Boos, 13 years ago

Attachment: wiki-t230.diff added

Unicode aware regexp for the wiki engine and the WikiPageNames syntax

comment:15 by Christian Boos, 13 years ago

Milestone: 0.11
Owner: changed from daniel to Christian Boos
Status: assignednew

With attachment:wiki-t230.diff (diff on top of r4680), the example WikiPageName given above is now recognized in both short wiki: TracLinks form and CamelCase form:

All the unit tests still pass. It seems that the performance impact of the changes is quite negligible, as it seems to be only 3% slower than without the patch.

comment:16 by Christian Boos, 13 years ago

Resolution: fixed
Status: newclosed

Patch applied in r4693.

Near miss with Alec's r4691 change on the wrong branch ;)

comment:17 by ThurnerRupert, 13 years ago

Resolution: fixed
Status: closedreopened

are links with "not uppercase chars except whitespace and separators" valid now, like TracRel20 (like in http://moinmoin.wikiwikiweb.de/WikiName)?

comment:18 by Christian Boos, 13 years ago

Resolution: fixed
Status: reopenedclosed

TracRel20 would be discussed in #425.

Btw, there's no need to reopen the ticket for asking a question about a closed ticket, we see everything ;-)

comment:19 by christian.skarby@…, 13 years ago

Resolution: fixed
Status: closedreopened

r4693 produces unexpected behaviour on words with lower case international characters. The expected behavior would be that Småbokstaver not should produce a link, whereas SmÅogstore should. I will try to reopen this ticket, as I believe it is an insufficient fix (actually words with umlauts work, but so does they regardless of being CamelCase.)

comment:20 by Christian Boos, 13 years ago

Resolution: fixed
Status: reopenedclosed

Oops, thanks for the notice should be really fixed with r4709, now.

comment:21 by Matthew Good, 13 years ago

Resolution: fixed
Status: closedreopened
Type: defectenhancement

The check_unicode_camelcase from r4709 causes false-positives such as AbAbÅ to be considered CamelCase.

I'm working on a more robust solution to this.

comment:22 by Matthew Good, 13 years ago

Owner: changed from Christian Boos to Matthew Good
Status: reopenednew

comment:23 by Christian Boos, 13 years ago

Right, I stopped checking as soon as a wiki name started CamelCase… Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?

in reply to:  23 ; comment:24 by Matthew Good, 13 years ago

Replying to cboos:

Right, I stopped checking as soon as a wiki name started CamelCase… Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?

Yes, I have a fix that will handle this in the regular expression, and without using the lookbehinds. I have some unit tests for the WikiPageNames syntax, but I could use an example of the effect of the SHREF_ changes in r4693. Can you provide an example of a Wiki link that requires those changes?

in reply to:  24 ; comment:25 by Christian Boos, 13 years ago

Replying to mgood:

Replying to cboos:

Right, I stopped checking as soon as a wiki name started CamelCase… Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?

Yes, I have a fix that will handle this in the regular expression, and without using the lookbehinds.

Ok, I'm very interested to see that, as I failed to find a regexp-only solution. Nevertheless, I thought it would be good to "finish" the approach started in r4693 and r4709, with r4717, so that at least it was working the intended way (and thereby fixing the regression for #3240).

I have some unit tests for the WikiPageNames syntax, but I could use an example of the effect of the SHREF_ changes in r4693. Can you provide an example of a Wiki link that requires those changes?

In addition, r4717 adds also a few examples of unicode WikiPageNames.

The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie (no unit test for that yet).

by Matthew Good, 13 years ago

Attachment: unicode_wiki_links.diff added

alternate solution for unicode WikiPageNames

in reply to:  25 ; comment:26 by Matthew Good, 13 years ago

I've attached my patch which uses two strings defined in trac.util.text for identifying all upper- and lower-case unicode characters. I was hoping a straight regex solution would be faster than mixing lookbehinds and an additional Python method to test whether it actually matched a WikiPageName, though on my laptop it's hard to tell on my laptop since the processor frequency scaling makes it difficult to accurately benchmark. I'll try on my other computer when I get home.

Replying to cboos:

The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie (no unit test for that yet).

Actually it's not. I reverted your changes before starting on my solution, and as you can see from the unit test I added for those types of links it works without that change. So, what does that change actually affect?

in reply to:  26 ; comment:27 by Christian Boos, 13 years ago

Replying to mgood:

… I was hoping a straight regex solution would be faster than mixing lookbehinds and an additional Python method to test whether it actually matched a WikiPageName

Actually it seems even significantly slower:

(± 0.01 seconds in each case)

I think this is because testing the appartenance of a given character in a very large [] range must be slower than looking for \w followed by a lookbehind check. The final check is using islower() and isupper() which are quite fast as well.

Replying to cboos:

The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie (no unit test for that yet).

Actually it's not. … So, what does that change actually affect?

… words like wiki:ÜberflüssigkeitsTheorie, i.e. when the first (or the last) character is Unicode. It's true that it's not that apparent for the above example, as "ÜberflüssigkeitsTheorie" itself will get recognized as a WikiPageNames, only leaving the unprocessed "wiki:" prefix before it.

But the limitation is more visible for things that are not wiki page names and other link resolvers. I've added a few examples in r4723.

in reply to:  27 comment:28 by Christian Boos, 13 years ago

Replying to cboos:

(± 0.01 seconds in each case)

And I just checked the timings with r4692 (no unicode at all) on the same machine, I get 1.11 seconds, with the same variations. So as I said in the beginning, I think the performance impact of my solution is quite negligible.

comment:29 by Matthew Good, 13 years ago

Resolution: fixed
Status: newclosed

I also tried using regex range syntax instead of a complete list of characters, but it does seem marginally slower. So, I guess the current implementation is sufficient.

comment:30 by Christian Boos, 10 years ago

See #9025 for a follow-up.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Matthew Good.
The resolution will be deleted. Next status will be 'reopened'.
to as closed The owner will be changed from Matthew Good to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.