Edgewall Software
Modify

Ticket #7217 (closed defect: fixed)

Opened 4 years ago

Last modified 12 months ago

Non-ASCII characters get replaced with '?' in changeset metadata

Reported by: anonymous Owned by: cboos
Priority: high Milestone: plugin - mercurial
Component: plugin/mercurial Version:
Severity: normal Keywords: unicode
Cc: franoleg@…
Release Notes:
API Changes:

Description

With TracMercurial, non-ASCII characters in changeset metadata strings get replaced with question marks. The plugin should set os.environ['HGENCODING'] = 'utf-8'.

I think in actuality this behavior can vary depending on the locale that Trac is running with, but I think it should just use UTF-8 regardless, since the plugin seems to expect that anyway (with calls to to_unicode(), which tries to decode from UTF-8 by default).

I'm not sure where exactly in the code it should set this, but this works for me, at least:

  • tracext/hg/backend.py

     
    2727                                    NoSuchChangeset, NoSuchNode 
    2828from trac.wiki import IWikiSyntaxProvider 
    2929 
     30os.environ['HGENCODING'] = 'utf-8' 
     31 
    3032try: 
    3133    # The new `demandimport` mechanism doesn't play well with code relying 
    3234    # on the `ImportError` exception being caught. 

Attachments

ru-sample-repo.zip (4.7 KB) - added by Oleg Frantsuzov <franoleg@…> 12 months ago.
Sample repository with commit messages in Russian

Download all attachments as: .zip

Change History

comment:1 Changed 4 years ago by cboos

  • Milestone set to not applicable

comment:2 Changed 4 years ago by anonymous

I can second that, expecially when using mod_wsgi I got that problem. Using the fix described inside my trac.wsgi file eliminates the problem.

With standalone tracd that problem never arised.

I'm using hg (serv) too via mod_wsgi and there I didn't had to setup that variable. Might be something inside hg.

comment:3 Changed 3 years ago by cboos

#7694 closed as duplicate.

This ticket deals with meta-data encoding, see also #7160.

comment:4 Changed 2 years ago by cboos

  • Keywords unicode added
  • Milestone changed from not applicable to mercurial-plugin
  • Priority changed from normal to high

comment:5 Changed 13 months ago by cboos

This one is a bit tricky, I've started to address it in r10491, but it's not finished yet.

comment:6 Changed 12 months ago by cboos

  • Resolution set to fixed
  • Status changed from new to closed

Should be fixed in a robust way in r10518.

comment:7 follow-up: Changed 12 months ago by Oleg Frantsuzov <franoleg@…>

  • Cc franoleg@… added

I'm not sure if I should reopen this ticket, but I'm getting the original problem on on Trac 0.12.2. The commit messages in my Mercurial repository are in Russian, and everything except the latin characters is displayed as question marks. Trac is installed on Debian squeeze, Python version 2.6.6rc1+, Mercurial 1.6.4, TracMercurial r10532, mod_wsgi 3.2.

The same environment/repository was previously used on Trac 0.11.5 installed as per TracOnWindowsIisAjp, Python 2.5.4, Mercurial 1.3.1, TracMercurial r8963 on Windows Server 2003, and no metadata encoding problems were spotted.

I tried setting the [hg] encoding setting to utf-8 in TracIni, but it didn't seem to help. While trying other approaches I had found here, I noticed that os.environ['HGENCODING'] = 'utf-8' in trac.wsgi makes Trac display sometimes question marks and sometimes correct characters. I patched backend.py, replacing latin1 with utf-8 on line 89, and suddenly this did help.

I guess I must be missing something about the whole thing.

comment:8 in reply to: ↑ 7 ; follow-up: Changed 12 months ago by cboos

Replying to Oleg Frantsuzov <franoleg@…>:

...
I guess I must be missing something about the whole thing.

... or I do ;-) Care to provide me with a sample repository reproducing the issue?

Changed 12 months ago by Oleg Frantsuzov <franoleg@…>

Sample repository with commit messages in Russian

comment:9 in reply to: ↑ 8 Changed 12 months ago by Oleg Frantsuzov <franoleg@…>

Replying to cboos:

... or I do ;-) Care to provide me with a sample repository reproducing the issue?

While preparing the sample repository (see attachment:ru-sample-repo.zip), I've found out that my problem has at least one more contributing factor. I use Windows and TortoiseHg (1.1.9.1 with Mercurial 1.7.5) on my PC, and it's TortoiseHg commit dialog what I use to enter my Russian commit messages.

I tried entering different commit messages: one in English, one with diacritical marks, and one in Russian. I did this once in my customary ru-RU system locale (commits 0-2), and once in en-US locale (commits 3-5 in the sample repo). The results were somewhat surprising: this is how the messages are displayed when the locale is set to ru-RU, and this is how they look like when it's en-US.

I understand this isn't the place to report bugs against either Mercurial or TortoiseHg, but I guess this can be useful for the TracTeam for diagnosing encoding problems. I used to think that both Mercurial and TortoiseHg use Unicode inside them, but it looks like it isn't the case.

As for Trac itself, that's how my sample repository looks with the unmodified TracMercurial r10532:

http://warmland.ru/direct/trac-7217/trac-latin1.png

Commits 1, 4 and 5 are garbled because I was too successful in testing if TortoiseHg could fail on problematic locale and encoding scenarios, but commit 2 is expected to be displayed correctly. The actual commits in the repository I mentioned in comment:7 are like commit 2 in the sample repository.

Now, that's how the repository looks like with the patched backend.py:

http://warmland.ru/direct/trac-7217/trac-utf-8.png

Just for reference, here's the patch:

  • tracext/hg/backend.py

     
    8686        from mercurial.error import RepoError, LookupError as HgLookupError 
    8787 
    8888    # Force local encoding to be non-lossy (#7217) 
    89     os.environ['HGENCODING'] = 'latin1' 
     89    os.environ['HGENCODING'] = 'utf-8' 
    9090 
    9191    if demandimport: 
    9292        demandimport.disable(); 

comment:10 Changed 12 months ago by cboos

You were right, r10518 was plain wrong, as there's actually no way to make sure that Mercurial's encoding.tolocal(str) is a no-op.

My initial goal by using 'latin1' instead of 'utf-8' was to be able to retrieve any bytes from the metadata, even if they were not decodable as 'utf-8' (as it could be for old repositories started before UTF-8 metadata was the norm in Mercurial), so that we can perform our sequence of conversions.

But it doesn't work that way, encoding.tolocal first attempts to decode using 'utf-8', and if this succeeds (which was the case in your example), encodes the resulting unicode to the chosen HGENCODING ('latin1' here), which then fails and triggers a fallback to 'replace' mode, hence the question marks.

So I'd suggest the following patch:

  • tracext/hg/backend.py

     
    8686        from mercurial.error import RepoError, LookupError as HgLookupError 
    8787 
    8888    # Force local encoding to be non-lossy (#7217) 
    89     os.environ['HGENCODING'] = 'latin1' 
     89    os.environ['HGENCODING'] = 'utf-8' 
     90    encoding.tolocal = str 
    9091     
    9192    if demandimport: 
    9293        demandimport.disable(); 

... which should work even if you happen to have old changesets with messages written in 'cp866', 'koi8_r' encodings or such things ;-) (provided you put [hg] encoding = utf-8, cp866 in you ini file).

The encoding.tolocal = str line alone would work, but let's be safe and also use a proper value for 'HGENCODING', so that fromlocal will also work, for the day we will use the hg backend as a store and we will have to create our own commits.

OTOH, if you're sure to have all the metadata stored as UTF-8, then of course os.environ['HGENCODING'] = 'utf-8' should be enough, so your fix is quite OK.

As for the '?' in the other revisions, they are really part of the changeset data:

>>> from mercurial import hg, encoding, ui
>>> repo = hg.repository(ui.ui(), '.')
>>> encoding.tolocal = str
>>> print '\n'.join(repr((i, repo[i].description())) for i in range(0, len(repo)))
(0, 'English: sample commit message in English')
(1, 'Diacritics: s?mpl? c?mm?t m?ss?g? w?th l?ts ?f d??cr?t?cs')
(2, 'Russian: \xd0\xbf\xd1\x80\xd0\xb8\xd0\xbc\xd0\xb5\xd1\x80 \xd0\xbe\xd0\xbf\xd0\xb8\xd1\x81\xd0\
xb0\xd0\xbd\xd0\xb8\xd1\x8f \xd0\xba\xd0\xbe\xd0\xbc\xd0\xbc\xd0\xb8\xd1\x82\xd0\xb0 \xd0\xbd\xd0\xb
0 \xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xbe\xd0\xbc')
(3, 'System locale changed to en-US (was ru-RU)')
(4, 'Diacritics: s\xc3\xa1mpl\xc3\xab c\xc3\xb2mm?t m\xc3\xa9ss\xc3\xa4g\xc3\xa8 w?th l\xc3\xb3ts \x
c3\xb6f d\xc3\xac\xc3\xa3cr\xc3\xadt\xc3\xafcs')
(5, 'Russian: ?????? ???????? ??????? ??-??????')
>>>

comment:11 Changed 12 months ago by cboos

  • Resolution fixed deleted
  • Status changed from closed to reopened

(re-opening until I apply the patch - comments welcome in the meantime)

comment:12 Changed 12 months ago by mr.troll <troll@…>

This patch works on my trac, which i installed yesterday. I have source code and commit descriptions in windows-1251 charset. So i compile plugin with patch to tracext/hg/backend.py

 os.environ['HGENCODING'] = 'utf-8' 
 encoding.tolocal = str


In my hgrc files i have

[web]
encoding = windows-1251

in my trac.conf i have

[trac]
default_charset = windows-1251

Trac version 0.12.3dev-r10552, hg version 1.7.5, server on Fedora Core 7, commiting from win XP.
http://www.valar.ru/gallery/0211/untitled1.gif

comment:13 Changed 12 months ago by cboos

  • Resolution set to fixed
  • Status changed from reopened to closed

Ok, patch applied in r10618:10620. Thanks all for testing!

View

Add a comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
The resolution will be deleted. Next status will be 'reopened'
to The owner will be changed from cboos. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.