Edgewall Software
Modify

Ticket #7160 (closed defect: fixed)

Opened 4 years ago

Last modified 13 months ago

Problems with character encoding

Reported by: Martin <martin@…> Owned by: cboos
Priority: high Milestone: plugin - mercurial
Component: plugin/mercurial Version: 0.11.5
Severity: major Keywords: unicode utf-8 mercurial hg
Cc: alvaro.justen@…
Release Notes:
API Changes:

Description

I have a mercurial repo and I'm using trac to manage the project.

The problem is that, as spanish is my first language, we write alot of acute vocals, and other chars that are not in the ASCII charset. So we use UTF-8 (all the files are in UTF-8 format).

The thing is that trac works great, but when trying to browse the repo, even though the brower works great, it doesn't print the exxtended characters as it should. For examplee, my name taken from the author information from hg would look like: Mart?n Marqu?s.

The contents of the repo also have the same problem. For example:

Creacón -> Creación
catálogo -> catálogo

Looking at the hg repo, everything is OK, but trac doesn't render it correctly.

If you need further information, please ask.

Attachments

Change History

comment:1 Changed 3 years ago by rblank

  • Milestone set to not applicable

Could you please test with 0.11.1 and the latest mercurial-plugin?

comment:2 Changed 3 years ago by cboos

  • Keywords unicode added

Also, how did you configure your [trac] default_charset TracIni entry?

In any case, as Remy said, you're strongly advised to upgrade to 0.11 and the corresponding Mercurial plugin, because the 0.10.x version (and Trac 0.10.x as well) is in "low maintenance mode".

comment:3 Changed 3 years ago by cboos

See also #3809.

comment:4 Changed 3 years ago by anonymous

  • Version changed from 0.10.4 to 0.11.1

Upgraded quite a few months ago (on Debian testing):

trac                                 0.11.1-2 
trac-mercurial                       0.11.0.5dev~svnr7354-2 

Same problem visualizing the acute vocals.

Also changed default_charset to UTF-8 in trac.ini, with no luck (restarted apache just in case).

comment:5 follow-up: Changed 3 years ago by cboos

  • Priority changed from normal to high
  • Severity changed from minor to major

We need to better handle arbitrary character sets in TracMercurial, at different levels:

The common point for all cases is that Mercurial by itself doesn't care about encodings, it simply stores the bytes as they come in (by design).
So the sensible thing to do here is:

  • make it possible to configure which encoding must be used in which situation (content, filename, meta-data), all falling back on default_charset. One concrete example would be a repository created on Windows, with the filenames encoded using whatever is the current codepage and utf-8 content).
  • we must use robust conversion, as nothing guarantees that the data in the Mercurial repository will be always consistent w.r.t the chosen encoding.
Last edited 17 months ago by cboos (previous) (diff)

comment:6 Changed 3 years ago by anonymous

You should set / add

HGENCODING=utf-8

environment variable, then restart the trac daemon or web server.
Other possible way is (if you are using web server) to define bin-environment variable. The example is from my lighttpd server configuration file

....
 "bin-environment" =>
     ("TRAC_ENV_PARENT_DIR" => "/var/lib/trac/" ,
     "LC_TIME" => "bg_BG.UTF-8",
     "PYTHON_EGG_CACHE" => "/tmp/.python_eggs",
     "HGENCODING" => "utf-8")
....

comment:7 Changed 3 years ago by martin@…

OK, but should I set HGENCODING en /etc/profile?

Can't test it at the moment. For some reason the Debian trac-mercurial plugin seems to have a bug in the browse section. Think I'll have to create a new ticket. :-(

comment:8 Changed 3 years ago by anonymous

I can't tell about Debian, but in Gentoo mercurial has environment setting located in /etc/env.d/80mercurial. This file looks like this

HG=/usr/bin/hg
HGENCODING=utf-8

comment:9 Changed 3 years ago by martin@…

Setting HGENCODING in apache, /etc/profile, etc. doesn't help. I keep seeing my name (Martín Marqués) with ? in the acute vocals.

comment:10 Changed 3 years ago by IanMLewis@…

Did you try changing the default_charset in the trac.ini to utf-8?

comment:11 Changed 3 years ago by anonymous

Comment: 5 had the answers. I had to hack the code though.

comment:12 Changed 3 years ago by anonymous

I solved problem by editing backend.py in trac-mercurial. I've added there os.environHGENCODING? = "UTF-8". Using mod_python with apache doesn't send environment variables, so SetEnv HGENCODING UTF-8 doesn't work.

comment:13 Changed 3 years ago by anonymous

os.environ["HGENCODING"] = "UTF-8"

comment:14 Changed 3 years ago by Álvaro Justen <alvaro.justen@…>

  • Cc alvaro.justen@… added
  • Keywords utf-8 mercurial hg added
  • Version changed from 0.11.1 to 0.11.5

I just added:

os.environ["HGENCODING"] = "UTF-8"

to mercurial-plugin-0.11/tracext/hg/backend.py and it worked!

In my case I have all of my things UTF-8. But for people that don't use UTF-8 it won't work.
Is there an way to get Hg charset? (I don't know about using Hg in Python programs)

comment:15 follow-up: Changed 2 years ago by cboos

  • Milestone changed from not applicable to mercurial-plugin

comment:16 in reply to: ↑ 15 Changed 21 months ago by anonymous

Replying to cboos:

I also tried adding os.environ["HGENCODING"] = "utf-8" just next to import os in backend.py, but with no success.

This is with Trac 0.11.7.
I still see UTF-8 strings coming from mercurial being interpreted as ISO_8859-1.

comment:17 follow-up: Changed 15 months ago by Ismael de Esteban <ismael@…>

I'm getting this error when browsing the source, I don't know if this is related to this ticket:

12:10:59 PM Trac[main] ERROR: Internal Server Error: 
Traceback (most recent call last):
  File "/home/ismael/trac-trunk/trac/web/main.py", line 513, in _dispatch_request
    dispatcher.dispatch(req)
  File "/home/ismael/trac-trunk/trac/web/main.py", line 235, in dispatch
    resp = chosen_handler.process_request(req)
  File "/home/ismael/trac-trunk/trac/versioncontrol/web_ui/browser.py", line 370, in process_request
    node = get_existing_node(req, repos, path, rev_or_latest)
  File "/home/ismael/trac-trunk/trac/versioncontrol/web_ui/util.py", line 61, in get_existing_node
    return repos.get_node(path, rev)
  File "/home/ismael/mercu/mercurial-plugin/tracext/hg/backend.py", line 557, in get_node
    self.hg_node(rev))
  File "/home/ismael/mercu/mercurial-plugin/tracext/hg/backend.py", line 682, in __init__
    self._init_path(log, path.encode('utf-8'))
  File "/home/ismael/mercu/mercurial-plugin/tracext/hg/backend.py", line 731, in _init_path
    dirnodes = self.findnode(log.rev(self.n), [dir,])
  File "/home/ismael/mercu/mercurial-plugin/tracext/hg/backend.py", line 697, in findnode
    if f.startswith(d):
  File "/home/ismael/mercu/mercurial-plugin/tracext/hg/backend.py", line 697, in findnode
    if f.startswith(d):
  File "/usr/lib/python2.5/bdb.py", line 48, in trace_dispatch
    return self.dispatch_line(frame)

When I debug, the problematic file is this one.

(Pdb) f
'main/core/domain/sheldon/mvno/authentication/EMPRESA\xe2\x80\x93CIFB84675529.cert'

It was the same file for different paths. I have fixed adding a

I'm using 12.0 and latest version of mercurial plugin.

comment:18 in reply to: ↑ 17 Changed 15 months ago by Ismael de Esteban <ismael@…>

Replying to Ismael de Esteban <ismael@…>:

Sorry I saw this issue in #9631

comment:19 Changed 13 months ago by cboos

#9970 was closed as duplicate.

comment:20 in reply to: ↑ 5 Changed 13 months ago by cboos

  • Resolution set to fixed
  • Status changed from new to closed

Replying to cboos:

We need to better handle arbitrary character sets in TracMercurial, at different levels:

  • content encoding (this ticket)

This seems to work now with r10490 / r10491. See also #9631 for the "multiple encoding" approach.

This one (#8538) I'll keep open until all glitches are fixed.

Also benefits from the multiple encoding approach of #9631, but not yet finished.

The common point for all cases is that Mercurial by itself doesn't care about encodings, it simply stores the bytes as they come in (by design).
So the sensible thing to do here is:

  • make it possible to configure which encoding must be used in which situation (content, filename, meta-data), all falling back on default_charset. One concrete example would be a repository created on Windows, with the filenames encoded using whatever is the current codepage and utf-8 content).

Well, actually this is handled in a generic way by the [hg] encoding setting which can accept multiple encodings if needed (see #9631). By default it's "utf-8" and regardless of the value of the setting a fallback of "latin1" will always be used if any other encoding has failed. That way, we are guaranteed to never trigger errors, at the cost of eventually having latin1 mangled characters if the correct encoding was not part of the list.

  • we must use robust conversion, as nothing guarantees that the data in the Mercurial repository will be always consistent w.r.t the chosen encoding.

This should be achieved by r10491.

View

Add a comment

Modify Ticket

Change Properties
<Author field>
Action
as closed
The resolution will be deleted. Next status will be 'reopened'
to The owner will be changed from cboos. Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.