Edgewall Software
Modify

Opened 16 years ago

Closed 13 years ago

#7160 closed defect (fixed)

Problems with character encoding

Reported by: Martin <martin@…> Owned by: Christian Boos
Priority: high Milestone: plugin - mercurial
Component: plugin/mercurial Version: 0.11.5
Severity: major Keywords: unicode utf-8 mercurial hg
Cc: alvaro.justen@… Branch:
Release Notes:
API Changes:
Internal Changes:

Description

I have a mercurial repo and I'm using trac to manage the project.

The problem is that, as spanish is my first language, we write alot of acute vocals, and other chars that are not in the ASCII charset. So we use UTF-8 (all the files are in UTF-8 format).

The thing is that trac works great, but when trying to browse the repo, even though the brower works great, it doesn't print the exxtended characters as it should. For examplee, my name taken from the author information from hg would look like: Mart?n Marqu?s.

The contents of the repo also have the same problem. For example:

Creacón → Creación catálogo → catálogo

Looking at the hg repo, everything is OK, but trac doesn't render it correctly.

If you need further information, please ask.

Attachments (0)

Change History (20)

comment:1 by Remy Blank, 15 years ago

Milestone: not applicable

Could you please test with 0.11.1 and the latest mercurial-plugin?

comment:2 by Christian Boos, 15 years ago

Keywords: unicode added

Also, how did you configure your [trac] default_charset TracIni entry?

In any case, as Remy said, you're strongly advised to upgrade to 0.11 and the corresponding Mercurial plugin, because the 0.10.x version (and Trac 0.10.x as well) is in "low maintenance mode".

comment:3 by Christian Boos, 15 years ago

See also #3809.

comment:4 by anonymous, 15 years ago

Version: 0.10.40.11.1

Upgraded quite a few months ago (on Debian testing):

trac                                 0.11.1-2 
trac-mercurial                       0.11.0.5dev~svnr7354-2 

Same problem visualizing the acute vocals.

Also changed default_charset to UTF-8 in trac.ini, with no luck (restarted apache just in case).

comment:5 by Christian Boos, 15 years ago

Priority: normalhigh
Severity: minormajor

We need to better handle arbitrary character sets in TracMercurial, at different levels:

The common point for all cases is that Mercurial by itself doesn't care about encodings, it simply stores the bytes as they come in (by design). So the sensible thing to do here is:

  • make it possible to configure which encoding must be used in which situation (content, filename, meta-data), all falling back on default_charset. One concrete example would be a repository created on Windows, with the filenames encoded using whatever is the current codepage and utf-8 content).
  • we must use robust conversion, as nothing guarantees that the data in the Mercurial repository will be always consistent w.r.t the chosen encoding.
Version 0, edited 15 years ago by Christian Boos (next)

comment:6 by anonymous, 15 years ago

You should set / add

HGENCODING=utf-8

environment variable, then restart the trac daemon or web server. Other possible way is (if you are using web server) to define bin-environment variable. The example is from my lighttpd server configuration file

....
 "bin-environment" =>
     ("TRAC_ENV_PARENT_DIR" => "/var/lib/trac/" ,
     "LC_TIME" => "bg_BG.UTF-8",
     "PYTHON_EGG_CACHE" => "/tmp/.python_eggs",
     "HGENCODING" => "utf-8")
....

comment:7 by martin@…, 15 years ago

OK, but should I set HGENCODING en /etc/profile?

Can't test it at the moment. For some reason the Debian trac-mercurial plugin seems to have a bug in the browse section. Think I'll have to create a new ticket. :-(

comment:8 by anonymous, 15 years ago

I can't tell about Debian, but in Gentoo mercurial has environment setting located in /etc/env.d/80mercurial. This file looks like this

HG=/usr/bin/hg
HGENCODING=utf-8

comment:9 by martin@…, 15 years ago

Setting HGENCODING in apache, /etc/profile, etc. doesn't help. I keep seeing my name (Martín Marqués) with ? in the acute vocals.

comment:10 by IanMLewis@…, 15 years ago

Did you try changing the default_charset in the trac.ini to utf-8?

comment:11 by anonymous, 15 years ago

Comment: 5 had the answers. I had to hack the code though.

comment:12 by anonymous, 15 years ago

I solved problem by editing backend.py in trac-mercurial. I've added there os.environHGENCODING = "UTF-8". Using mod_python with apache doesn't send environment variables, so SetEnv HGENCODING UTF-8 doesn't work.

comment:13 by anonymous, 15 years ago

os.environ["HGENCODING"] = "UTF-8"

comment:14 by Álvaro Justen <alvaro.justen@…>, 15 years ago

Cc: alvaro.justen@… added
Keywords: utf-8 mercurial hg added
Version: 0.11.10.11.5

I just added:

os.environ["HGENCODING"] = "UTF-8"

to mercurial-plugin-0.11/tracext/hg/backend.py and it worked!

In my case I have all of my things UTF-8. But for people that don't use UTF-8 it won't work. Is there an way to get Hg charset? (I don't know about using Hg in Python programs)

comment:15 by Christian Boos, 14 years ago

Milestone: not applicablemercurial-plugin

in reply to:  15 comment:16 by anonymous, 14 years ago

Replying to cboos:

I also tried adding os.environ["HGENCODING"] = "utf-8" just next to import os in backend.py, but with no success.

This is with Trac 0.11.7. I still see UTF-8 strings coming from mercurial being interpreted as ISO_8859-1.

comment:17 by Ismael de Esteban <ismael@…>, 13 years ago

I'm getting this error when browsing the source, I don't know if this is related to this ticket:

12:10:59 PM Trac[main] ERROR: Internal Server Error: 
Traceback (most recent call last):
  File "/home/ismael/trac-trunk/trac/web/main.py", line 513, in _dispatch_request
    dispatcher.dispatch(req)
  File "/home/ismael/trac-trunk/trac/web/main.py", line 235, in dispatch
    resp = chosen_handler.process_request(req)
  File "/home/ismael/trac-trunk/trac/versioncontrol/web_ui/browser.py", line 370, in process_request
    node = get_existing_node(req, repos, path, rev_or_latest)
  File "/home/ismael/trac-trunk/trac/versioncontrol/web_ui/util.py", line 61, in get_existing_node
    return repos.get_node(path, rev)
  File "/home/ismael/mercu/mercurial-plugin/tracext/hg/backend.py", line 557, in get_node
    self.hg_node(rev))
  File "/home/ismael/mercu/mercurial-plugin/tracext/hg/backend.py", line 682, in __init__
    self._init_path(log, path.encode('utf-8'))
  File "/home/ismael/mercu/mercurial-plugin/tracext/hg/backend.py", line 731, in _init_path
    dirnodes = self.findnode(log.rev(self.n), [dir,])
  File "/home/ismael/mercu/mercurial-plugin/tracext/hg/backend.py", line 697, in findnode
    if f.startswith(d):
  File "/home/ismael/mercu/mercurial-plugin/tracext/hg/backend.py", line 697, in findnode
    if f.startswith(d):
  File "/usr/lib/python2.5/bdb.py", line 48, in trace_dispatch
    return self.dispatch_line(frame)

When I debug, the problematic file is this one.

(Pdb) f
'main/core/domain/sheldon/mvno/authentication/EMPRESA\xe2\x80\x93CIFB84675529.cert'

It was the same file for different paths. I have fixed adding a

I'm using 12.0 and latest version of mercurial plugin.

in reply to:  17 comment:18 by Ismael de Esteban <ismael@…>, 13 years ago

Replying to Ismael de Esteban <ismael@…>:

Sorry I saw this issue in #9631

comment:19 by Christian Boos, 13 years ago

#9970 was closed as duplicate.

in reply to:  5 comment:20 by Christian Boos, 13 years ago

Resolution: fixed
Status: newclosed

Replying to cboos:

We need to better handle arbitrary character sets in TracMercurial, at different levels:

  • content encoding (this ticket)

This seems to work now with r10490 / r10491. See also #9631 for the "multiple encoding" approach.

This one (#8538) I'll keep open until all glitches are fixed.

Also benefits from the multiple encoding approach of #9631, but not yet finished.

The common point for all cases is that Mercurial by itself doesn't care about encodings, it simply stores the bytes as they come in (by design). So the sensible thing to do here is:

  • make it possible to configure which encoding must be used in which situation (content, filename, meta-data), all falling back on default_charset. One concrete example would be a repository created on Windows, with the filenames encoded using whatever is the current codepage and utf-8 content).

Well, actually this is handled in a generic way by the [hg] encoding setting which can accept multiple encodings if needed (see #9631). By default it's "utf-8" and regardless of the value of the setting a fallback of "latin1" will always be used if any other encoding has failed. That way, we are guaranteed to never trigger errors, at the cost of eventually having latin1 mangled characters if the correct encoding was not part of the list.

  • we must use robust conversion, as nothing guarantees that the data in the Mercurial repository will be always consistent w.r.t the chosen encoding.

This should be achieved by r10491.

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Christian Boos.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Christian Boos to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.