Opened 17 years ago
Closed 14 years ago
#7217 closed defect (fixed)
Non-ASCII characters get replaced with '?' in changeset metadata
Reported by: | anonymous | Owned by: | Christian Boos |
---|---|---|---|
Priority: | high | Milestone: | plugin - mercurial |
Component: | plugin/mercurial | Version: | |
Severity: | normal | Keywords: | unicode |
Cc: | franoleg@… | Branch: | |
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
With TracMercurial, non-ASCII characters in changeset metadata strings get replaced with question marks. The plugin should set os.environ['HGENCODING'] = 'utf-8'
.
I think in actuality this behavior can vary depending on the locale that Trac is running with, but I think it should just use UTF-8 regardless, since the plugin seems to expect that anyway (with calls to to_unicode()
, which tries to decode from UTF-8 by default).
I'm not sure where exactly in the code it should set this, but this works for me, at least:
-
tracext/hg/backend.py
27 27 NoSuchChangeset, NoSuchNode 28 28 from trac.wiki import IWikiSyntaxProvider 29 29 30 os.environ['HGENCODING'] = 'utf-8' 31 30 32 try: 31 33 # The new `demandimport` mechanism doesn't play well with code relying 32 34 # on the `ImportError` exception being caught.
Attachments (1)
Change History (14)
comment:1 by , 17 years ago
Milestone: | → not applicable |
---|
comment:2 by , 17 years ago
comment:3 by , 16 years ago
comment:4 by , 15 years ago
Keywords: | unicode added |
---|---|
Milestone: | not applicable → mercurial-plugin |
Priority: | normal → high |
comment:5 by , 14 years ago
This one is a bit tricky, I've started to address it in r10491, but it's not finished yet.
comment:6 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Should be fixed in a robust way in r10518.
follow-up: 8 comment:7 by , 14 years ago
Cc: | added |
---|
I'm not sure if I should reopen this ticket, but I'm getting the original problem on on Trac 0.12.2. The commit messages in my Mercurial repository are in Russian, and everything except the latin characters is displayed as question marks. Trac is installed on Debian squeeze, Python version 2.6.6rc1+, Mercurial 1.6.4, TracMercurial r10532, mod_wsgi 3.2.
The same environment/repository was previously used on Trac 0.11.5 installed as per TracOnWindowsIisAjp, Python 2.5.4, Mercurial 1.3.1, TracMercurial r8963 on Windows Server 2003, and no metadata encoding problems were spotted.
I tried setting the [hg] encoding
setting to utf-8
in TracIni, but it didn't seem to help. While trying other approaches I had found here, I noticed that os.environ['HGENCODING'] = 'utf-8'
in trac.wsgi
makes Trac display sometimes question marks and sometimes correct characters. I patched backend.py, replacing latin1
with utf-8
on line 89, and suddenly this did help.
I guess I must be missing something about the whole thing.
follow-up: 9 comment:8 by , 14 years ago
Replying to Oleg Frantsuzov <franoleg@…>:
… I guess I must be missing something about the whole thing.
… or I do ;-) Care to provide me with a sample repository reproducing the issue?
by , 14 years ago
Attachment: | ru-sample-repo.zip added |
---|
Sample repository with commit messages in Russian
comment:9 by , 14 years ago
Replying to cboos:
… or I do ;-) Care to provide me with a sample repository reproducing the issue?
While preparing the sample repository (see attachment:ru-sample-repo.zip), I've found out that my problem has at least one more contributing factor. I use Windows and TortoiseHg (1.1.9.1 with Mercurial 1.7.5) on my PC, and it's TortoiseHg commit dialog what I use to enter my Russian commit messages.
I tried entering different commit messages: one in English, one with diacritical marks, and one in Russian. I did this once in my customary ru-RU system locale (commits 0-2), and once in en-US locale (commits 3-5 in the sample repo). The results were somewhat surprising: this is how the messages are displayed when the locale is set to ru-RU, and this is how they look like when it's en-US.
I understand this isn't the place to report bugs against either Mercurial or TortoiseHg, but I guess this can be useful for the TracTeam for diagnosing encoding problems. I used to think that both Mercurial and TortoiseHg use Unicode inside them, but it looks like it isn't the case.
As for Trac itself, that's how my sample repository looks with the unmodified TracMercurial r10532:
Commits 1, 4 and 5 are garbled because I was too successful in testing if TortoiseHg could fail on problematic locale and encoding scenarios, but commit 2 is expected to be displayed correctly. The actual commits in the repository I mentioned in comment:7 are like commit 2 in the sample repository.
Now, that's how the repository looks like with the patched backend.py
:
Just for reference, here's the patch:
-
tracext/hg/backend.py
86 86 from mercurial.error import RepoError, LookupError as HgLookupError 87 87 88 88 # Force local encoding to be non-lossy (#7217) 89 os.environ['HGENCODING'] = ' latin1'89 os.environ['HGENCODING'] = 'utf-8' 90 90 91 91 if demandimport: 92 92 demandimport.disable();
comment:10 by , 14 years ago
You were right, r10518 was plain wrong, as there's actually no way to make sure that Mercurial's encoding.tolocal(str)
is a no-op.
My initial goal by using 'latin1' instead of 'utf-8' was to be able to retrieve any bytes from the metadata, even if they were not decodable as 'utf-8' (as it could be for old repositories started before UTF-8 metadata was the norm in Mercurial), so that we can perform our sequence of conversions.
But it doesn't work that way, encoding.tolocal
first attempts to decode using 'utf-8', and if this succeeds (which was the case in your example), encodes the resulting unicode
to the chosen HGENCODING
('latin1' here), which then fails and triggers a fallback to 'replace' mode, hence the question marks.
So I'd suggest the following patch:
-
tracext/hg/backend.py
86 86 from mercurial.error import RepoError, LookupError as HgLookupError 87 87 88 88 # Force local encoding to be non-lossy (#7217) 89 os.environ['HGENCODING'] = 'latin1' 89 os.environ['HGENCODING'] = 'utf-8' 90 encoding.tolocal = str 90 91 91 92 if demandimport: 92 93 demandimport.disable();
… which should work even if you happen to have old changesets with messages written in 'cp866', 'koi8_r' encodings or such things ;-) (provided you put [hg] encoding = utf-8, cp866
in you ini file).
The encoding.tolocal = str
line alone would work, but let's be safe and also use a proper value for 'HGENCODING', so that fromlocal
will also work, for the day we will use the hg backend as a store and we will have to create our own commits.
OTOH, if you're sure to have all the metadata stored as UTF-8, then of course os.environ['HGENCODING'] = 'utf-8'
should be enough, so your fix is quite OK.
As for the '?' in the other revisions, they are really part of the changeset data:
>>> from mercurial import hg, encoding, ui >>> repo = hg.repository(ui.ui(), '.') >>> encoding.tolocal = str >>> print '\n'.join(repr((i, repo[i].description())) for i in range(0, len(repo))) (0, 'English: sample commit message in English') (1, 'Diacritics: s?mpl? c?mm?t m?ss?g? w?th l?ts ?f d??cr?t?cs') (2, 'Russian: \xd0\xbf\xd1\x80\xd0\xb8\xd0\xbc\xd0\xb5\xd1\x80 \xd0\xbe\xd0\xbf\xd0\xb8\xd1\x81\xd0\ xb0\xd0\xbd\xd0\xb8\xd1\x8f \xd0\xba\xd0\xbe\xd0\xbc\xd0\xbc\xd0\xb8\xd1\x82\xd0\xb0 \xd0\xbd\xd0\xb 0 \xd1\x80\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xbe\xd0\xbc') (3, 'System locale changed to en-US (was ru-RU)') (4, 'Diacritics: s\xc3\xa1mpl\xc3\xab c\xc3\xb2mm?t m\xc3\xa9ss\xc3\xa4g\xc3\xa8 w?th l\xc3\xb3ts \x c3\xb6f d\xc3\xac\xc3\xa3cr\xc3\xadt\xc3\xafcs') (5, 'Russian: ?????? ???????? ??????? ??-??????') >>>
comment:11 by , 14 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
(re-opening until I apply the patch - comments welcome in the meantime)
comment:12 by , 14 years ago
This patch works on my trac, which i installed yesterday. I have source code and commit descriptions in windows-1251 charset. So i compile plugin with patch to tracext/hg/backend.py
os.environ['HGENCODING'] = 'utf-8' encoding.tolocal = str
In my hgrc files i have
[web] encoding = windows-1251
in my trac.conf i have
[trac] default_charset = windows-1251
Trac version 0.12.3dev-r10552, hg version 1.7.5, server on Fedora Core 7, commiting from win XP.
comment:13 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Ok, patch applied in r10618:10620. Thanks all for testing!
I can second that, expecially when using mod_wsgi I got that problem. Using the fix described inside my trac.wsgi file eliminates the problem.
With standalone tracd that problem never arised.
I'm using hg (serv) too via mod_wsgi and there I didn't had to setup that variable. Might be something inside hg.