Opened 14 years ago
Closed 14 years ago
#9631 closed defect (fixed)
UnicodeDecodeError when file names in Mercurial repo use multi encoding
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | low | Milestone: | plugin - mercurial |
Component: | plugin/mercurial | Version: | 0.12 |
Severity: | normal | Keywords: | unicode |
Cc: | Branch: | ||
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
If files in Mercurial repositories use more than one multi-char encoding to describe the their names, then you may get something like "UnicodeDecodeError: 'ascii' codec can't decode byte 0xd3 in position 15: ordinal not in range(128)" when browse source code in Trac.
Let me give out a detailed example for this:
We committed some files with Chinese characters as their name. At the first time, we used "GB18030" as file name encoding because we committed them on a Windows machine. And then we found such file names can not display well on Linux & Mac, so we remove and re-committed them under "utf-8" encoding. Thus we get some "GB18030" things in hg history and also some "utf-8" things as tip. Now we get "UnicodeDecodeError: 'ascii' codec can't decode byte 0xd3 in position 15: ordinal not in range(128)" when we browse source code through Trac.
At that time, I sent "HGENCODING" ⇒ "utf-8" to Trac through lighttpd & fast-cgi, and set "default_charset = utf-8" in trac.ini
I guess some of the source code browsing operation touched the history "GB18030" file names, but Trac & Mercurial always try to decode file names as 'utf-8', which result in the exception.
My Trac & Mercurial version are:
Trac | 0.12 |
Babel | 0.9.5 |
Docutils | 0.6 |
Genshi | 0.6 |
Mercurial | 1.3.1 |
MySQL | server: "5.0.51a-24+lenny4", client: "5.0.51a", thread-safe: 1 |
MySQLdb | 1.2.2 |
Pygments | 1.3.1 |
Python | 2.5.2 (r252:60911, Jan 24 2010, 14:53:14) [GCC 4.3.2] |
pytz | 2010h |
setuptools | 0.6c8 |
jQuery | 1.4.2 |
And
TracMercurial | 0.12.0.23dev-r9953 |
For reference, attach is a dirty & simple patch just works for my situation (support utf-8 & GB18030 encodings at the same time).
Attachments (1)
Change History (6)
by , 14 years ago
Attachment: | file_name_multi_encodings.patch added |
---|
follow-up: 2 comment:1 by , 14 years ago
Resolution: | → duplicate |
---|---|
Status: | new → closed |
Thanks for the report & patch, but this was already reported in #8538.
comment:2 by , 14 years ago
Resolution: | duplicate |
---|---|
Status: | closed → reopened |
Replying to cboos:
Thanks for the report & patch, but this was already reported in #8538.
Hello,
I think #8538 is a little different from my situation.
For my test, when I set "default_charset = utf-8", variable 'entry' could be decoded as utf-8 automatically although 'entry' is a Python str in fact. Whereas, there may be more than one file name encoding in the history of Mercurial repo. That's why I use two different encoding (both utf-8 & GB18030) for decoding in my patch.
I guess #8538 only relate to one encoding, but my ticket is about more. In detail, my situation have a dilemma. If I set "default_charset = utf-8", then some GB18030 file names will throw out exception; and if I set "default_charset = GB18030", some utf-8 file names throw out exception. Thus I feel existing mechanism in Trac (default_charset & HGENCODING) can't solve this well…
follow-up: 4 comment:3 by , 14 years ago
Thanks for the patch - worked good for me too having the same problem.
But ideally those encodings should be configured in the TracMercurial.
comment:4 by , 14 years ago
Replying to anonymous:
Thanks for the patch - worked good for me too having the same problem.
But ideally those encodings should be configured in the TracMercurial.
Maybe something like the 'smart encoding' of Vim could be a choice for related configuration. In Vim, we use 'set fencs=utf-8,ucs-bom,shift-jis,gb18030,gbk,gb2312,cp936' to make Vim try multi encoding on opening a file.
comment:5 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Thanks for the suggestions.
I'm going to support the following setting [hg] encoding = utf-8,ucs-bom,shift-jis,gb18030,gbk,gb2312,cp936
.
By default it will be utf-8, but in complex cases like yours, a list could be specified. The last implicit encoding to be tried when everything else failed will be latin1 which always succeeds (so no need to specify it).
See r10490 and r10491. This is a proof of concept at this stage. Please test and report any remaining issue in #8538, thanks!
A dirty patch for more than one multi-char encoding in Mercurial repositories.