Edgewall Software
Modify

Opened 15 years ago

Closed 12 years ago

#7217 closed defect (fixed)

Non-ASCII characters get replaced with '?' in changeset metadata

Reported by: anonymous Owned by: Christian Boos
Priority: high Milestone: plugin - mercurial
Component: plugin/mercurial Version:
Severity: normal Keywords: unicode
Cc: franoleg@… Branch:
Release Notes:
API Changes:
Internal Changes:

Description

With TracMercurial, non-ASCII characters in changeset metadata strings get replaced with question marks. The plugin should set os.environ['HGENCODING'] = 'utf-8'.

I think in actuality this behavior can vary depending on the locale that Trac is running with, but I think it should just use UTF-8 regardless, since the plugin seems to expect that anyway (with calls to to_unicode(), which tries to decode from UTF-8 by default).

I'm not sure where exactly in the code it should set this, but this works for me, at least:

  • tracext/hg/backend.py

     
    2727                                    NoSuchChangeset, NoSuchNode
    2828from trac.wiki import IWikiSyntaxProvider
    2929
     30os.environ['HGENCODING'] = 'utf-8'
     31
    3032try:
    3133    # The new `demandimport` mechanism doesn't play well with code relying
    3234    # on the `ImportError` exception being caught.

Attachments (1)

ru-sample-repo.zip (4.7 KB ) - added by Oleg Frantsuzov <franoleg@…> 12 years ago.
Sample repository with commit messages in Russian

Download all attachments as: .zip

Change History (14)

comment:1 by Christian Boos, 15 years ago

Milestone: not applicable

comment:2 by anonymous, 14 years ago

I can second that, expecially when using mod_wsgi I got that problem. Using the fix described inside my trac.wsgi file eliminates the problem.

With standalone tracd that problem never arised.

I'm using hg (serv) too via mod_wsgi and there I didn't had to setup that variable. Might be something inside hg.

comment:3 by Christian Boos, 14 years ago

#7694 closed as duplicate.

This ticket deals with meta-data encoding, see also #7160.

comment:4 by Christian Boos, 13 years ago

Keywords: unicode added
Milestone: not applicablemercurial-plugin
Priority: normalhigh

comment:5 by Christian Boos, 12 years ago

This one is a bit tricky, I've started to address it in r10491, but it's not finished yet.

comment:6 by Christian Boos, 12 years ago

Resolution: fixed
Status: newclosed

Should be fixed in a robust way in r10518.

comment:7 by Oleg Frantsuzov <franoleg@…>, 12 years ago

Cc: franoleg@… added

I'm not sure if I should reopen this ticket, but I'm getting the original problem on on Trac 0.12.2. The commit messages in my Mercurial repository are in Russian, and everything except the latin characters is displayed as question marks. Trac is installed on Debian squeeze, Python version 2.6.6rc1+, Mercurial 1.6.4, TracMercurial r10532, mod_wsgi 3.2.

The same environment/repository was previously used on Trac 0.11.5 installed as per TracOnWindowsIisAjp, Python 2.5.4, Mercurial 1.3.1, TracMercurial r8963 on Windows Server 2003, and no metadata encoding problems were spotted.

I tried setting the [hg] encoding setting to utf-8 in TracIni, but it didn't seem to help. While trying other approaches I had found here, I noticed that os.environ['HGENCODING'] = 'utf-8' in trac.wsgi makes Trac display sometimes question marks and sometimes correct characters. I patched backend.py, replacing latin1 with utf-8 on line 89, and suddenly this did help.

I guess I must be missing something about the whole thing.

in reply to:  7 ; comment:8 by Christian Boos, 12 years ago

Replying to Oleg Frantsuzov <franoleg@…>:

… I guess I must be missing something about the whole thing.

… or I do ;-) Care to provide me with a sample repository reproducing the issue?

by Oleg Frantsuzov <franoleg@…>, 12 years ago

Attachment: ru-sample-repo.zip added

Sample repository with commit messages in Russian

in reply to:  8 comment:9 by Oleg Frantsuzov <franoleg@…>, 12 years ago

Replying to cboos:

… or I do ;-) Care to provide me with a sample repository reproducing the issue?

While preparing the sample repository (see attachment:ru-sample-repo.zip), I've found out that my problem has at least one more contributing factor. I use Windows and TortoiseHg (1.1.9.1 with Mercurial 1.7.5) on my PC, and it's TortoiseHg commit dialog what I use to enter my Russian commit messages.

I tried entering different commit messages: one in English, one with diacritical marks, and one in Russian. I did this once in my customary ru-RU system locale (commits 0-2), and once in en-US locale (commits 3-5 in the sample repo). The results were somewhat surprising: this is how the messages are displayed when the locale is set to ru-RU, and this is how they look like when it's en-US.

I understand this isn't the place to report bugs against either Mercurial or TortoiseHg, but I guess this can be useful for the TracTeam for diagnosing encoding problems. I used to think that both Mercurial and TortoiseHg use Unicode inside them, but it looks like it isn't the case.

As for Trac itself, that's how my sample repository looks with the unmodified TracMercurial r10532:

http://warmland.ru/direct/trac-7217/trac-latin1.png

Commits 1, 4 and 5 are garbled because I was too successful in testing if TortoiseHg could fail on problematic locale and encoding scenarios, but commit 2 is expected to be displayed correctly. The actual commits in the repository I mentioned in comment:7 are like commit 2 in the sample repository.

Now, that's how the repository looks like with the patched backend.py:

http://warmland.ru/direct/trac-7217/trac-utf-8.png

Just for reference, here's the patch:

  • tracext/hg/backend.py

     
    8686        from mercurial.error import RepoError, LookupError as HgLookupError
    8787
    8888    # Force local encoding to be non-lossy (#7217)
    89     os.environ['HGENCODING'] = 'latin1'
     89    os.environ['HGENCODING'] = 'utf-8'
    9090
    9191    if demandimport:
    9292        demandimport.disable();

comment:10 by Christian Boos, 12 years ago

You were right, r10518 was plain wrong, as there's actually no way to make sure that Mercurial's encoding.tolocal(str) is a no-op.

My initial goal by using 'latin1' instead of 'utf-8' was to be able to retrieve any bytes from the metadata, even if they were not decodable as 'utf-8' (as it could be for old repositories started before UTF-8 metadata was the norm in Mercurial), so that we can perform our sequence of conversions.

But it doesn't work that way, encoding.tolocal first attempts to decode using 'utf-8', and if this succeeds (which was the case in your example), encodes the resulting unicode to the chosen HGENCODING ('latin1' here), which then fails and triggers a fallback to 'replace' mode, hence the question marks.

So I'd suggest the following patch:

  • tracext/hg/backend.py

     
    8686        from mercurial.error import RepoError, LookupError as HgLookupError
    8787
    8888    # Force local encoding to be non-lossy (#7217)
    89     os.environ['HGENCODING'] = 'latin1'
     89    os.environ['HGENCODING'] = 'utf-8'
     90    encoding.tolocal = str
    9091   
    9192    if demandimport:
    9293        demandimport.disable();

… which should work even if you happen to have old changesets with messages written in 'cp866', 'koi8_r' encodings or such things ;-) (provided you put [hg] encoding = utf-8, cp866 in you ini file).

The encoding.tolocal = str line alone would work, but let's be safe and also use a proper value for 'HGENCODING', so that fromlocal will also work, for the day we will use the hg backend as a store and we will have to create our own commits.

OTOH, if you're sure to have all the metadata stored as UTF-8, then of course os.environ['HGENCODING'] = 'utf-8' should be enough, so your fix is quite OK.

As for the '?' in the other revisions, they are really part of the changeset data:

>>> from mercurial import hg, encoding, ui
>>> repo = hg.repository(ui.ui(), '.')
>>> encoding.tolocal = str
>>> print '\n'.join(repr((i, repo[i].description())) for i in range(0, len(repo)))
(0, 'English: sample commit message in English')
(1, 'Diacritics: s?mpl? c?mm?t m?ss?g? w?th l?ts ?f d??cr?t?cs')
(2, 'Russian: \xd0\xbf\xd1\x80\xd0\xb8\xd0\xbc\xd0\xb5\xd1\x80 \xd0\xbe\xd0\xbf\xd0\xb8\xd1\x81\xd0\
xb0\xd0\xbd\xd0\xb8\xd1\x8f \xd0\xba\xd0\xbe\xd0\xbc\xd0\xbc\xd0\xb8\xd1\x82\xd0\xb0 \xd0\xbd\xd0\xb
0 \xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xbe\xd0\xbc')
(3, 'System locale changed to en-US (was ru-RU)')
(4, 'Diacritics: s\xc3\xa1mpl\xc3\xab c\xc3\xb2mm?t m\xc3\xa9ss\xc3\xa4g\xc3\xa8 w?th l\xc3\xb3ts \x
c3\xb6f d\xc3\xac\xc3\xa3cr\xc3\xadt\xc3\xafcs')
(5, 'Russian: ?????? ???????? ??????? ??-??????')
>>>

comment:11 by Christian Boos, 12 years ago

Resolution: fixed
Status: closedreopened

(re-opening until I apply the patch - comments welcome in the meantime)

comment:12 by mr.troll <troll@…>, 12 years ago

This patch works on my trac, which i installed yesterday. I have source code and commit descriptions in windows-1251 charset. So i compile plugin with patch to tracext/hg/backend.py

 os.environ['HGENCODING'] = 'utf-8' 
 encoding.tolocal = str

In my hgrc files i have

[web]
encoding = windows-1251

in my trac.conf i have

[trac]
default_charset = windows-1251

Trac version 0.12.3dev-r10552, hg version 1.7.5, server on Fedora Core 7, commiting from win XP. http://www.valar.ru/gallery/0211/untitled1.gif

comment:13 by Christian Boos, 12 years ago

Resolution: fixed
Status: reopenedclosed

Ok, patch applied in r10618:10620. Thanks all for testing!

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Christian Boos.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Christian Boos to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.