Edgewall Software
Modify

Opened 6 years ago

Closed 6 years ago

#7552 closed defect (wontfix)

Japanese strings doesn't match in module "re"

Reported by: kondo@… Owned by:
Priority: normal Milestone:
Component: wiki system Version: 0.11.1
Severity: normal Keywords:
Cc:
Release Notes:
API Changes:

Description

Autowikify plugin uses IWikiSyntaxProvider methods.

http://trac-hacks.org/browser/autowikifyplugin/trunk/tracautowikify/autowikify.py#L50

In Japanese environment, when I use Japanese wiki page name and contents, It doesn't match.

http://trac-hacks.org/ticket/2252

I think that module "re" 's compile method doesn't care a locale.

I expect that regular expressions behave with locale.

I make a locale_add.path, and apply it. Then it works correctly in Japanese environment.

Environment

  • OS:WindowsXP SP2
  • Python 2.5.2

Attachments (2)

locale_add.patch (765 bytes) - added by kondo@… 6 years ago.
formatter_test.patch (1.1 KB) - added by kondo@… 6 years ago.

Download all attachments as: .zip

Change History (6)

Changed 6 years ago by kondo@…

comment:1 Changed 6 years ago by thatch

Note: On trunk, this patch causes 6 test failures, if it's right we might need to rework some regexps.

Changed 6 years ago by kondo@…

comment:2 Changed 6 years ago by kondo@…

I am sorry for the delay of the answer. I took time in environmental considerations of the unit test.

I solved the failure of 6 test cases that you had pointed out.

I applied patch that considered locale to trac. But these 6 test cases didn't consider about locale.

These test cases has used German. For instance, it is u umlaut.

I applied the patch to the code of the test case.

It is formatter_test.patch.

Before test, it store current locale and set locale to German. After test, it restore locale.

When this patch was applied, all the tests passed.

comment:3 Changed 6 years ago by rblank

  • Milestone set to 0.13

The patch attachment:locale_add.patch seems to have a non-trivial effect. If I apply it, accented characters are not picked up anymore for a \w pattern in regexps. This explains the test failures.

The patch attachment:formatter_test.patch doesn't work here (Linux):

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "German")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/locale.py", line 478, in setlocale
    return _setlocale(category, locale)
locale.Error: unsupported locale setting

Indeed, the Python documentation page for the locale module mentions this:

According to POSIX, a program which has not called setlocale(LC_ALL, '') runs using the portable 'C' locale. Calling setlocale(LC_ALL, '') lets it use the default locale as defined by the LANG variable. Since we do not want to interfere with the current locale setting we thus emulate the behavior in the way described above.

Calling setlocale(LC_ALL, '') doesn't fix the problem. Actually, whatever I set the locale to, I cannot make the regexp \w* match e.g. "éàè" if the LOCALE flag is set.

Any ideas?

comment:4 Changed 6 years ago by cboos

  • Milestone 0.13 deleted
  • Resolution set to wontfix
  • Status changed from new to closed

I think we shouldn't use the LOCALE flag in regexps and more generally, not use the locale for anything else than system defaults. Even better, keep using the portable 'C' locale by not calling setlocale at all would even spare us from the kind of trouble seen with buggy locales, like the "tr_TR" one (see #6953 and #7686). But this is slightly off-topic…

For this specific issue, there's another problem with attachment:locale_add.patch: it simply doesn't make sense to use the re.UNICODE and the re.LOCALE at the same time, as they're mutually incompatible:

L 
LOCALE
Make \w, \W, \b, \B, \s and \S dependent on the current locale. 

...

U 
UNICODE
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. New in version 2.0. 

And from the following experiment, it indeed seems they can't work together (taking the ああああ sample from #TH2252):

>>> import re
>>> re.search(u'\\b\u3042+\\b', u' \u3042\u3042\u3042\u3042 ', re.UNICODE).span()
(1, 5)

Works fine, but not:

>>> re.search(u'\\b\u3042+\\b', u' \u3042\u3042\u3042\u3042 ', re.UNICODE|re.LOCALE ) is None
True

Additionally:

>>> re.search(u'\\b\u3042+\\b', u' Foo\u3042\u3042\u3042\u3042Bar ', re.UNICODE) is None
True

The first and third matches show that it's indeed possible to match Japanese words using the \b marker and the re.UNICODE flag. Looking at the autowikify plugin, it seems that it's an issue with re.escape and unicode characters.

So this is not a Trac problem, see #TH2252.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The ticket will remain with no owner.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from (none) to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.