Edgewall Software
Modify

Ticket #230 (closed enhancement: fixed)

Opened 8 years ago

Last modified 2 years ago

'CamelCase' words don't work with 'umlauts'

Reported by: gernot@… Owned by: mgood
Priority: normal Milestone: 0.11
Component: wiki system Version: 0.6
Severity: normal Keywords: unicode
Cc: gernot@…
Release Notes:
API Changes:

Description

Try this and you know what I mean!

ÜberflüssigkeitsTheorie?

Attachments

wiki-t230.diff (2.0 KB) - added by cboos 5 years ago.
Unicode aware regexp for the wiki engine and the WikiPageNames syntax
unicode_wiki_links.diff (29.6 KB) - added by mgood 5 years ago.
alternate solution for unicode WikiPageNames

Download all attachments as: .zip

Change History

comment:1 Changed 8 years ago by gernot@…

  • Component changed from general to wiki

comment:2 Changed 8 years ago by daniel

  • Milestone set to 0.6.1

comment:3 Changed 8 years ago by rocky

  • Milestone changed from 0.6.1 to 0.7

comment:4 Changed 8 years ago by jonas

The regexp module (re) doesn't know the difference between upper and lower case letters in
unicode strings :(

Anyone got an idea how to do this?

comment:5 Changed 8 years ago by daniel

  • Milestone changed from 0.7 to 0.8

comment:6 Changed 8 years ago by daniel

Here's how ZWiki does it:

http://zwiki.org/zwikidir/Regexps.py

comment:8 Changed 8 years ago by daniel

  • Owner changed from jonas to daniel
  • Status changed from new to assigned

comment:9 Changed 8 years ago by daniel

  • Milestone changed from 0.8 to 0.9

comment:10 Changed 7 years ago by Alexander Shopov <ash@…>

Just reporting that Cyrillic letters also do not work in camel case.
They neither work when entered in a purely cyrillic camel-cased word, nor in mixed latiin-cyrillic camel-cased word.
Best regards.

comment:11 Changed 7 years ago by cmlenz

  • Milestone 0.9 deleted

If someone has a patch for this, we'll gladly put it in. At the moment, I don't see a good way to solve this issue…

comment:12 Changed 6 years ago by cboos

  • Keywords unicode added

Note that as a workaround, one could use the explicit wiki link notation:
[wiki:ÜberflüssigkeitsTheorie]ÜberflüssigkeitsTheorie?

comment:13 Changed 5 years ago by ThurnerRupert

see related #4104, #4663. shall i close #4104 as duplicate, even if it deals with whitespace in wiki names?

comment:14 Changed 5 years ago by ThurnerRupert

#4633 of course, not #4663.

Changed 5 years ago by cboos

Unicode aware regexp for the wiki engine and the WikiPageNames syntax

comment:15 Changed 5 years ago by cboos

  • Milestone set to 0.11
  • Owner changed from daniel to cboos
  • Status changed from assigned to new

With attachment:wiki-t230.diff (diff on top of r4680), the example WikiPageName given above is now recognized in both short wiki: TracLinks form and CamelCase form:

All the unit tests still pass. It seems that the performance impact of the changes is quite negligible, as it seems to be only 3% slower than without the patch.

comment:16 Changed 5 years ago by cboos

  • Resolution set to fixed
  • Status changed from new to closed

Patch applied in r4693.

Near miss with Alec's r4691 change on the wrong branch ;)

comment:17 Changed 5 years ago by ThurnerRupert

  • Resolution fixed deleted
  • Status changed from closed to reopened

are links with "not uppercase chars except whitespace and separators" valid now, like TracRel20 (like in http://moinmoin.wikiwikiweb.de/WikiName)?

comment:18 Changed 5 years ago by cboos

  • Resolution set to fixed
  • Status changed from reopened to closed

TracRel20 would be discussed in #425.

Btw, there's no need to reopen the ticket for asking a question about a closed ticket, we see everything ;-)

comment:19 Changed 5 years ago by christian.skarby@…

  • Resolution fixed deleted
  • Status changed from closed to reopened

r4693 produces unexpected behaviour on words with lower case international characters. The expected behavior would be that Småbokstaver not should produce a link, whereas SmÅogstore? should. I will try to reopen this ticket, as I believe it is an insufficient fix (actually words with umlauts work, but so does they regardless of being CamelCase.)

comment:20 Changed 5 years ago by cboos

  • Resolution set to fixed
  • Status changed from reopened to closed

Oops, thanks for the notice should be really fixed with r4709, now.

comment:21 Changed 5 years ago by mgood

  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Type changed from defect to enhancement

The check_unicode_camelcase from r4709 causes false-positives such as AbAbÅ to be considered CamelCase.

I'm working on a more robust solution to this.

comment:22 Changed 5 years ago by mgood

  • Owner changed from cboos to mgood
  • Status changed from reopened to new

comment:23 follow-up: Changed 5 years ago by cboos

Right, I stopped checking as soon as a wiki name started CamelCase
Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?

comment:24 in reply to: ↑ 23 ; follow-up: Changed 5 years ago by mgood

Replying to cboos:

Right, I stopped checking as soon as a wiki name started CamelCase
Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?

Yes, I have a fix that will handle this in the regular expression, and without using the lookbehinds. I have some unit tests for the WikiPageNames syntax, but I could use an example of the effect of the SHREF_ changes in r4693. Can you provide an example of a Wiki link that requires those changes?

comment:25 in reply to: ↑ 24 ; follow-up: Changed 5 years ago by cboos

Replying to mgood:

Replying to cboos:

Right, I stopped checking as soon as a wiki name started CamelCase
Fixing it by continuing the check for the whole string should be relatively straightforward, or do you have a completely different approach in mind?

Yes, I have a fix that will handle this in the regular expression, and without using the lookbehinds.

Ok, I'm very interested to see that, as I failed to find a regexp-only solution.
Nevertheless, I thought it would be good to "finish" the approach started in r4693 and r4709, with r4717, so that at least it was working the intended way (and thereby fixing the regression for #3240).

I have some unit tests for the WikiPageNames syntax, but I could use an example of the effect of the SHREF_ changes in r4693. Can you provide an example of a Wiki link that requires those changes?

In addition, r4717 adds also a few examples of unicode WikiPageNames.

The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie? (no unit test for that yet).

Changed 5 years ago by mgood

alternate solution for unicode WikiPageNames

comment:26 in reply to: ↑ 25 ; follow-up: Changed 5 years ago by mgood

I've attached my patch which uses two strings defined in trac.util.text for identifying all upper- and lower-case unicode characters. I was hoping a straight regex solution would be faster than mixing lookbehinds and an additional Python method to test whether it actually matched a WikiPageName, though on my laptop it's hard to tell on my laptop since the processor frequency scaling makes it difficult to accurately benchmark. I'll try on my other computer when I get home.

Replying to cboos:

The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie? (no unit test for that yet).

Actually it's not. I reverted your changes before starting on my solution, and as you can see from the unit test I added for those types of links it works without that change. So, what does that change actually affect?

comment:27 in reply to: ↑ 26 ; follow-up: Changed 5 years ago by cboos

Replying to mgood:

… I was hoping a straight regex solution would be faster than mixing lookbehinds and an additional Python method to test whether it actually matched a WikiPageName

Actually it seems even significantly slower:

(± 0.01 seconds in each case)

I think this is because testing the appartenance of a given character in a very large [] range must be slower than looking for \w followed by a lookbehind check. The final check is using islower() and isupper() which are quite fast as well.

Replying to cboos:

The change introduced in r4693 for shref would be required for things like wiki:ÜberflüssigkeitsTheorie? (no unit test for that yet).

Actually it's not. … So, what does that change actually affect?

… words like wiki:ÜberflüssigkeitsTheorie?, i.e. when the first (or the last) character is Unicode. It's true that it's not that apparent for the above example, as "ÜberflüssigkeitsTheorie?" itself will get recognized as a WikiPageNames, only leaving the unprocessed "wiki:" prefix before it.

But the limitation is more visible for things that are not wiki page names and other link resolvers. I've added a few examples in r4723.

comment:28 in reply to: ↑ 27 Changed 5 years ago by cboos

Replying to cboos:

(± 0.01 seconds in each case)

And I just checked the timings with r4692 (no unicode at all) on the same machine,
I get 1.11 seconds, with the same variations. So as I said in the beginning, I think the performance impact of my solution is quite negligible.

comment:29 Changed 5 years ago by mgood

  • Resolution set to fixed
  • Status changed from new to closed

I also tried using regex range syntax instead of a complete list of characters, but it does seem marginally slower. So, I guess the current implementation is sufficient.

comment:30 Changed 2 years ago by cboos

See #9025 for a follow-up.

View

Add a comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
The resolution will be deleted. Next status will be 'reopened'
to The owner will be changed from mgood. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.