Context Navigation

Modify ↓

#7552 closed defect (wontfix)

Japanese strings doesn't match in module "re"

Reported by:	kondo@…	Owned by:
Priority:	normal	Milestone:
Component:	wiki system	Version:	0.11.1
Severity:	normal	Keywords:
Cc:		Branch:
Release Notes:
API Changes:
Internal Changes:

Description

Autowikify plugin uses IWikiSyntaxProvider methods.

http://trac-hacks.org/browser/autowikifyplugin/trunk/tracautowikify/autowikify.py#L50

In Japanese environment, when I use Japanese wiki page name and contents, It doesn't match.

http://trac-hacks.org/ticket/2252

I think that module "re" 's compile method doesn't care a locale.

I expect that regular expressions behave with locale.

I make a locale_add.path, and apply it. Then it works correctly in Japanese environment.

Environment

OS:WindowsXP SP2
Python 2.5.2

Attachments (2)

locale_add.patch (765 bytes ) - added by kondo@… 17 years ago.
formatter_test.patch (1.1 KB ) - added by kondo@… 17 years ago.

Download all attachments as: .zip

Change History (6)

by kondo@…, 17 years ago

Attachment:	locale_add.patch added

comment:1 by Tim Hatch, 17 years ago

Note: On trunk, this patch causes 6 test failures, if it's right we might need to rework some regexps.

by kondo@…, 17 years ago

Attachment:	formatter_test.patch added

comment:2 by kondo@…, 17 years ago

I am sorry for the delay of the answer. I took time in environmental considerations of the unit test.

I solved the failure of 6 test cases that you had pointed out.

I applied patch that considered locale to trac. But these 6 test cases didn't consider about locale.

These test cases has used German. For instance, it is u umlaut.

I applied the patch to the code of the test case.

It is formatter_test.patch.

Before test, it store current locale and set locale to German. After test, it restore locale.

When this patch was applied, all the tests passed.

comment:3 by Remy Blank, 17 years ago

Milestone:	→ 0.13

The patch attachment:locale_add.patch seems to have a non-trivial effect. If I apply it, accented characters are not picked up anymore for a \w pattern in regexps. This explains the test failures.

The patch attachment:formatter_test.patch doesn't work here (Linux):

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "German")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/locale.py", line 478, in setlocale
    return _setlocale(category, locale)
locale.Error: unsupported locale setting

Indeed, the Python documentation page for the locale module mentions this:

According to POSIX, a program which has not called setlocale(LC_ALL, '') runs using the portable 'C' locale. Calling setlocale(LC_ALL, '') lets it use the default locale as defined by the LANG variable. Since we do not want to interfere with the current locale setting we thus emulate the behavior in the way described above.

Calling setlocale(LC_ALL, '') doesn't fix the problem. Actually, whatever I set the locale to, I cannot make the regexp \w* match e.g. "éàè" if the LOCALE flag is set.

Any ideas?

comment:4 by Christian Boos, 17 years ago

Milestone:	0.13
Resolution:	→ wontfix
Status:	new → closed

I think we shouldn't use the LOCALE flag in regexps and more generally, not use the locale for anything else than system defaults. Even better, keep using the portable 'C' locale by not calling setlocale at all would even spare us from the kind of trouble seen with buggy locales, like the "tr_TR" one (see #6953 and #7686). But this is slightly off-topic…

For this specific issue, there's another problem with attachment:locale_add.patch : it simply doesn't make sense to use the re.UNICODE and the re.LOCALE at the same time, as they're mutually incompatible:

L 
LOCALE
Make \w, \W, \b, \B, \s and \S dependent on the current locale. 

...

U 
UNICODE
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. New in version 2.0.

And from the following experiment, it indeed seems they can't work together (taking the ああああ sample from #TH2252):

>>> import re
>>> re.search(u'\\b\u3042+\\b', u' \u3042\u3042\u3042\u3042 ', re.UNICODE).span()
(1, 5)

Works fine, but not:

>>> re.search(u'\\b\u3042+\\b', u' \u3042\u3042\u3042\u3042 ', re.UNICODE|re.LOCALE ) is None
True

Additionally:

>>> re.search(u'\\b\u3042+\\b', u' Foo\u3042\u3042\u3042\u3042Bar ', re.UNICODE) is None
True

The first and third matches show that it's indeed possible to match Japanese words using the \b marker and the re.UNICODE flag. Looking at the autowikify plugin, it seems that it's an issue with re.escape and unicode characters.

So this is not a Trac problem, see #TH2252.

Modify Ticket

Change Properties

Summary:
Description:	Autowikify plugin uses IWikiSyntaxProvider methods. http://trac-hacks.org/browser/autowikifyplugin/trunk/tracautowikify/autowikify.py#L50 In Japanese environment, when I use Japanese wiki page name and contents, It doesn't match. http://trac-hacks.org/ticket/2252 I think that module "re" 's compile method doesn't care a locale. I expect that regular expressions behave with locale. I make a locale_add.path, and apply it. Then it works correctly in Japanese environment. Environment * OS:WindowsXP SP2 * Python 2.5.2 You may use WikiFormatting here.
Type:		Priority:
Milestone:		Component:
Version:		Severity:
Keywords:		Cc:	Set your email in Preferences
Branch:
Release Notes:
API Changes:
Internal Changes:

Action

leave as closed The ticket will remain with no owner.

reopen The resolution will be deleted. Next status will be 'reopened'.

change ownership to The owner will be changed from (none) to the specified user.

Add Comment

Your email or username:

E-mail address and name can be saved in the Preferences .

You may use WikiFormatting here.

Attachments ↑ Description ↑

Note: See TracTickets for help on using tickets.

Download in other formats:

Context Navigation

#7552 closed defect (wontfix)

Japanese strings doesn't match in module "re"

Description

Attachments (2)

Change History (6)

by kondo@…, 17 years ago

comment:1 by Tim Hatch, 17 years ago

by kondo@…, 17 years ago

comment:2 by kondo@…, 17 years ago

comment:3 by Remy Blank, 17 years ago

comment:4 by Christian Boos, 17 years ago

Modify Ticket

Add Comment

by anonymous

Download in other formats: