Edgewall Software
Modify

Opened 14 years ago

Closed 13 years ago

#9631 closed defect (fixed)

UnicodeDecodeError when file names in Mercurial repo use multi encoding

Reported by: elias.soong@… Owned by:
Priority: low Milestone: plugin - mercurial
Component: plugin/mercurial Version: 0.12
Severity: normal Keywords: unicode
Cc: Branch:
Release Notes:
API Changes:
Internal Changes:

Description

If files in Mercurial repositories use more than one multi-char encoding to describe the their names, then you may get something like "UnicodeDecodeError: 'ascii' codec can't decode byte 0xd3 in position 15: ordinal not in range(128)" when browse source code in Trac.


Let me give out a detailed example for this:

We committed some files with Chinese characters as their name. At the first time, we used "GB18030" as file name encoding because we committed them on a Windows machine. And then we found such file names can not display well on Linux & Mac, so we remove and re-committed them under "utf-8" encoding. Thus we get some "GB18030" things in hg history and also some "utf-8" things as tip. Now we get "UnicodeDecodeError: 'ascii' codec can't decode byte 0xd3 in position 15: ordinal not in range(128)" when we browse source code through Trac.

At that time, I sent "HGENCODING" ⇒ "utf-8" to Trac through lighttpd & fast-cgi, and set "default_charset = utf-8" in trac.ini

I guess some of the source code browsing operation touched the history "GB18030" file names, but Trac & Mercurial always try to decode file names as 'utf-8', which result in the exception.


My Trac & Mercurial version are:

Trac 0.12
Babel 0.9.5
Docutils 0.6
Genshi 0.6
Mercurial 1.3.1
MySQL server: "5.0.51a-24+lenny4", client: "5.0.51a", thread-safe: 1
MySQLdb 1.2.2
Pygments 1.3.1
Python 2.5.2 (r252:60911, Jan 24 2010, 14:53:14) [GCC 4.3.2]
pytz 2010h
setuptools 0.6c8
jQuery 1.4.2

And

TracMercurial 0.12.0.23dev-r9953

For reference, attach is a dirty & simple patch just works for my situation (support utf-8 & GB18030 encodings at the same time).

Attachments (1)

file_name_multi_encodings.patch (1.4 KB ) - added by elias.soong@… 14 years ago.
A dirty patch for more than one multi-char encoding in Mercurial repositories.

Download all attachments as: .zip

Change History (6)

by elias.soong@…, 14 years ago

A dirty patch for more than one multi-char encoding in Mercurial repositories.

comment:1 by Christian Boos, 14 years ago

Resolution: duplicate
Status: newclosed

Thanks for the report & patch, but this was already reported in #8538.

in reply to:  1 comment:2 by elias.soong@…, 14 years ago

Resolution: duplicate
Status: closedreopened

Replying to cboos:

Thanks for the report & patch, but this was already reported in #8538.

Hello,

I think #8538 is a little different from my situation.

For my test, when I set "default_charset = utf-8", variable 'entry' could be decoded as utf-8 automatically although 'entry' is a Python str in fact. Whereas, there may be more than one file name encoding in the history of Mercurial repo. That's why I use two different encoding (both utf-8 & GB18030) for decoding in my patch.

I guess #8538 only relate to one encoding, but my ticket is about more. In detail, my situation have a dilemma. If I set "default_charset = utf-8", then some GB18030 file names will throw out exception; and if I set "default_charset = GB18030", some utf-8 file names throw out exception. Thus I feel existing mechanism in Trac (default_charset & HGENCODING) can't solve this well…

comment:3 by anonymous, 14 years ago

Thanks for the patch - worked good for me too having the same problem.

But ideally those encodings should be configured in the TracMercurial.

in reply to:  3 comment:4 by elias.soong@…, 14 years ago

Replying to anonymous:

Thanks for the patch - worked good for me too having the same problem.

But ideally those encodings should be configured in the TracMercurial.

Maybe something like the 'smart encoding' of Vim could be a choice for related configuration. In Vim, we use 'set fencs=utf-8,ucs-bom,shift-jis,gb18030,gbk,gb2312,cp936' to make Vim try multi encoding on opening a file.

comment:5 by Christian Boos, 13 years ago

Resolution: fixed
Status: reopenedclosed

Thanks for the suggestions.

I'm going to support the following setting [hg] encoding = utf-8,ucs-bom,shift-jis,gb18030,gbk,gb2312,cp936.

By default it will be utf-8, but in complex cases like yours, a list could be specified. The last implicit encoding to be tried when everything else failed will be latin1 which always succeeds (so no need to specify it).

See r10490 and r10491. This is a proof of concept at this stage. Please test and report any remaining issue in #8538, thanks!

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The ticket will remain with no owner.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from (none) to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.