Edgewall Software
Modify

Opened 16 years ago

Closed 16 years ago

#7552 closed defect (wontfix)

Japanese strings doesn't match in module "re"

Reported by: kondo@… Owned by:
Priority: normal Milestone:
Component: wiki system Version: 0.11.1
Severity: normal Keywords:
Cc: Branch:
Release Notes:
API Changes:
Internal Changes:

Description

Autowikify plugin uses IWikiSyntaxProvider methods.

http://trac-hacks.org/browser/autowikifyplugin/trunk/tracautowikify/autowikify.py#L50

In Japanese environment, when I use Japanese wiki page name and contents, It doesn't match.

http://trac-hacks.org/ticket/2252

I think that module "re" 's compile method doesn't care a locale.

I expect that regular expressions behave with locale.

I make a locale_add.path, and apply it. Then it works correctly in Japanese environment.

Environment

  • OS:WindowsXP SP2
  • Python 2.5.2

Attachments (2)

locale_add.patch (765 bytes ) - added by kondo@… 16 years ago.
formatter_test.patch (1.1 KB ) - added by kondo@… 16 years ago.

Download all attachments as: .zip

Change History (6)

by kondo@…, 16 years ago

Attachment: locale_add.patch added

comment:1 by Tim Hatch, 16 years ago

Note: On trunk, this patch causes 6 test failures, if it's right we might need to rework some regexps.

by kondo@…, 16 years ago

Attachment: formatter_test.patch added

comment:2 by kondo@…, 16 years ago

I am sorry for the delay of the answer. I took time in environmental considerations of the unit test.

I solved the failure of 6 test cases that you had pointed out.

I applied patch that considered locale to trac. But these 6 test cases didn't consider about locale.

These test cases has used German. For instance, it is u umlaut.

I applied the patch to the code of the test case.

It is formatter_test.patch.

Before test, it store current locale and set locale to German. After test, it restore locale.

When this patch was applied, all the tests passed.

comment:3 by Remy Blank, 16 years ago

Milestone: 0.13

The patch attachment:locale_add.patch seems to have a non-trivial effect. If I apply it, accented characters are not picked up anymore for a \w pattern in regexps. This explains the test failures.

The patch attachment:formatter_test.patch doesn't work here (Linux):

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "German")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/locale.py", line 478, in setlocale
    return _setlocale(category, locale)
locale.Error: unsupported locale setting

Indeed, the Python documentation page for the locale module mentions this:

According to POSIX, a program which has not called setlocale(LC_ALL, '') runs using the portable 'C' locale. Calling setlocale(LC_ALL, '') lets it use the default locale as defined by the LANG variable. Since we do not want to interfere with the current locale setting we thus emulate the behavior in the way described above.

Calling setlocale(LC_ALL, '') doesn't fix the problem. Actually, whatever I set the locale to, I cannot make the regexp \w* match e.g. "éàè" if the LOCALE flag is set.

Any ideas?

comment:4 by Christian Boos, 16 years ago

Milestone: 0.13
Resolution: wontfix
Status: newclosed

I think we shouldn't use the LOCALE flag in regexps and more generally, not use the locale for anything else than system defaults. Even better, keep using the portable 'C' locale by not calling setlocale at all would even spare us from the kind of trouble seen with buggy locales, like the "tr_TR" one (see #6953 and #7686). But this is slightly off-topic…

For this specific issue, there's another problem with attachment:locale_add.patch: it simply doesn't make sense to use the re.UNICODE and the re.LOCALE at the same time, as they're mutually incompatible:

L 
LOCALE
Make \w, \W, \b, \B, \s and \S dependent on the current locale. 

...

U 
UNICODE
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. New in version 2.0. 

And from the following experiment, it indeed seems they can't work together (taking the ああああ sample from #TH2252):

>>> import re
>>> re.search(u'\\b\u3042+\\b', u' \u3042\u3042\u3042\u3042 ', re.UNICODE).span()
(1, 5)

Works fine, but not:

>>> re.search(u'\\b\u3042+\\b', u' \u3042\u3042\u3042\u3042 ', re.UNICODE|re.LOCALE ) is None
True

Additionally:

>>> re.search(u'\\b\u3042+\\b', u' Foo\u3042\u3042\u3042\u3042Bar ', re.UNICODE) is None
True

The first and third matches show that it's indeed possible to match Japanese words using the \b marker and the re.UNICODE flag. Looking at the autowikify plugin, it seems that it's an issue with re.escape and unicode characters.

So this is not a Trac problem, see #TH2252.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The ticket will remain with no owner.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from (none) to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.