Opened 16 years ago
Closed 16 years ago
#7552 closed defect (wontfix)
Japanese strings doesn't match in module "re"
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | wiki system | Version: | 0.11.1 |
Severity: | normal | Keywords: | |
Cc: | Branch: | ||
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
Autowikify plugin uses IWikiSyntaxProvider methods.
http://trac-hacks.org/browser/autowikifyplugin/trunk/tracautowikify/autowikify.py#L50
In Japanese environment, when I use Japanese wiki page name and contents, It doesn't match.
http://trac-hacks.org/ticket/2252
I think that module "re" 's compile method doesn't care a locale.
I expect that regular expressions behave with locale.
I make a locale_add.path, and apply it. Then it works correctly in Japanese environment.
Environment
- OS:WindowsXP SP2
- Python 2.5.2
Attachments (2)
Change History (6)
by , 16 years ago
Attachment: | locale_add.patch added |
---|
comment:1 by , 16 years ago
by , 16 years ago
Attachment: | formatter_test.patch added |
---|
comment:2 by , 16 years ago
I am sorry for the delay of the answer. I took time in environmental considerations of the unit test.
I solved the failure of 6 test cases that you had pointed out.
I applied patch that considered locale to trac. But these 6 test cases didn't consider about locale.
These test cases has used German. For instance, it is u umlaut.
I applied the patch to the code of the test case.
It is formatter_test.patch.
Before test, it store current locale and set locale to German. After test, it restore locale.
When this patch was applied, all the tests passed.
comment:3 by , 16 years ago
Milestone: | → 0.13 |
---|
The patch attachment:locale_add.patch seems to have a non-trivial effect. If I apply it, accented characters are not picked up anymore for a \w
pattern in regexps. This explains the test failures.
The patch attachment:formatter_test.patch doesn't work here (Linux):
>>> import locale >>> locale.setlocale(locale.LC_ALL, "German") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/locale.py", line 478, in setlocale return _setlocale(category, locale) locale.Error: unsupported locale setting
Indeed, the Python documentation page for the locale
module mentions this:
According to POSIX, a program which has not called
setlocale(LC_ALL, '')
runs using the portable 'C' locale. Callingsetlocale(LC_ALL, '')
lets it use the default locale as defined by the LANG variable. Since we do not want to interfere with the current locale setting we thus emulate the behavior in the way described above.
Calling setlocale(LC_ALL, '')
doesn't fix the problem. Actually, whatever I set the locale to, I cannot make the regexp \w*
match e.g. "éàè" if the LOCALE flag is set.
Any ideas?
comment:4 by , 16 years ago
Milestone: | 0.13 |
---|---|
Resolution: | → wontfix |
Status: | new → closed |
I think we shouldn't use the LOCALE flag in regexps and more generally, not use the locale for anything else than system defaults. Even better, keep using the portable 'C' locale by not calling setlocale
at all would even spare us from the kind of trouble seen with buggy locales, like the "tr_TR" one (see #6953 and #7686). But this is slightly off-topic…
For this specific issue, there's another problem with attachment:locale_add.patch: it simply doesn't make sense to use the re.UNICODE and the re.LOCALE at the same time, as they're mutually incompatible:
L LOCALE Make \w, \W, \b, \B, \s and \S dependent on the current locale. ... U UNICODE Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. New in version 2.0.
And from the following experiment, it indeed seems they can't work together (taking the ああああ sample from #TH2252):
>>> import re >>> re.search(u'\\b\u3042+\\b', u' \u3042\u3042\u3042\u3042 ', re.UNICODE).span() (1, 5)
Works fine, but not:
>>> re.search(u'\\b\u3042+\\b', u' \u3042\u3042\u3042\u3042 ', re.UNICODE|re.LOCALE ) is None True
Additionally:
>>> re.search(u'\\b\u3042+\\b', u' Foo\u3042\u3042\u3042\u3042Bar ', re.UNICODE) is None True
The first and third matches show that it's indeed possible to match Japanese words using the \b
marker and the re.UNICODE flag. Looking at the autowikify plugin, it seems that it's an issue with re.escape
and unicode
characters.
So this is not a Trac problem, see #TH2252.
Note: On trunk, this patch causes 6 test failures, if it's right we might need to rework some regexps.