Edgewall Software
Modify

Opened 9 years ago

Closed 9 years ago

#12322 closed defect (fixed)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 8: invalid continuation byte

Reported by: Ryan J Ollos Owned by: Ryan J Ollos
Priority: normal Milestone: 1.0.10
Component: plugin/git Version:
Severity: normal Keywords:
Cc: Branch:
Release Notes:

Invalid byte sequence in filepath is replaced when reading Git commits.

API Changes:
Internal Changes:

Description

Encountered this error while running trac-admin $env repository resync "(default)":

2016-01-19 00:21:23,635 Trac[console] ERROR: Exception in trac-admin command: 
Traceback (most recent call last):
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/trac/admin/console.py", line 109, in onecmd
    rv = cmd.Cmd.onecmd(self, line) or 0
  File "/usr/lib/python2.7/cmd.py", line 220, in onecmd
    return self.default(line)
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/trac/admin/console.py", line 287, in default
    return self.cmd_mgr.execute_command(*args)
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/trac/admin/api.py", line 127, in execute_command
    return f(*fargs)
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/trac/versioncontrol/admin.py", line 156, in _do_resync
    self._sync(reponame, rev, clean=True)
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/trac/versioncontrol/admin.py", line 143, in _sync
    repos.sync(self._sync_feedback, clean=clean)
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/tracopt/versioncontrol/git/git_fs.py", line 141, in sync
    self._insert_changeset(db, rev, cset)
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/trac/versioncontrol/cache.py", line 285, in _insert_changeset
    for path, kind, action, bpath, brev in cset.get_changes():
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/tracopt/versioncontrol/git/git_fs.py", line 851, in get_changes
    self.repos.git.diff_tree(parent, self.rev, find_renames=True):
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/tracopt/versioncontrol/git/PyGIT.py", line 1044, in diff_tree
    yield __chg_tuple()
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/tracopt/versioncontrol/git/PyGIT.py", line 1036, in __chg_tuple
    chg[5] = self._fs_to_unicode(chg[5])
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/tracopt/versioncontrol/git/PyGIT.py", line 380, in <lambda>
    self._fs_to_unicode = lambda s: s.decode(git_fs_encoding)
  File "/var/www/bugs.jqueryui.com/private/pve/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 8: invalid continuation byte

I'll post more info if I can reproduce at a different debug level.

Attachments (0)

Change History (20)

comment:1 by Ryan J Ollos, 9 years ago

Milestone: 1.0.10

With log_level at INFO:

2016-01-19 00:43:47,572 Trac[git_fs] INFO: Trying to sync revision [c1800c59953161d88432ea8a307b5cdf08c5ec41]
2016-01-19 00:43:47,602 Trac[console] ERROR: Exception in trac-admin command:
Traceback (most recent call last):
  File "/var/www/bugs.jqueryui.com/private/pve/local/lib/python2.7/site-packages/trac/admin/console.py", line 109, in onecmd
    rv = cmd.Cmd.onecmd(self, line) or 0
  File "/usr/lib/python2.7/cmd.py", line 220, in onecmd
    return self.default(line)

The commit can be found here.

comment:2 by Jun Omae, 9 years ago

That commit has invalid byte sequence in the name of files.

$ git show --name-status c1800c59953161d88432ea8a307b5cdf08c5ec41
...
M       ya/demos/accordion/default.html
M       ya/demos/dialog/default.html
A       ya/external/PIE.htc
A       ya/external/border-radius.htc
A       ya/external/jquery.bgiframe-2.1.2.js
A       ya/lib/sl.css
M       ya/lib/sl.js
A       ya/lib/uihelper.js
A       "ya/test/\312\326\267\347\307\331.txt"
A       ya/themes/default/images/ui-icon-arrows.png
A       ya/themes/default/images/ui-icon-close.png
A       ya/themes/default/images/ui-icon-triangle-1-e.png
A       ya/themes/default/images/ui-icon-triangle-1-s.png
A       ya/themes/default/images/ui-icons.png
A       ya/themes/default/jquery.ui.accordion.css
A       ya/themes/default/jquery.ui.dialog.css
A       ya/themes/default/jquery.ui.override.css
M       ya/ui/jquery.ya.accordion0.js
M       ya/ui/jquery.ya.dialog0.js
$ python -c '"ya/test/\312\326\267\347\307\331.txt".decode("utf-8")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xca in position 8: invalid continuation byte

We could ignore those invalid byte sequence in git repository.

  • tracopt/versioncontrol/git/PyGIT.py

    diff --git a/tracopt/versioncontrol/git/PyGIT.py b/tracopt/versioncontrol/git/PyGIT.py
    index 966df98bc..fc61319ed 100644
    a b class Storage(object):  
    380380            codecs.lookup(git_fs_encoding)
    381381
    382382            # setup conversion functions
    383             self._fs_to_unicode = lambda s: s.decode(git_fs_encoding)
     383            self._fs_to_unicode = lambda s: s.decode(git_fs_encoding,
     384                                                     'replace')
    384385            self._fs_from_unicode = lambda s: s.encode(git_fs_encoding)
    385386        else:
    386387            # pass bytestrings as-is w/o any conversion

After the patch:

Python 2.5.6 (r256:88840, Oct 21 2014, 22:26:35)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from trac.env import open_environment
>>> env = open_environment('/home/jun66j5/var/trac/1.0-sqlite')
>>> repos = env.get_repository('jquery-ui.git')
>>> cset = repos.get_changeset('c1800c59953161d88432ea8a307b5cdf08c5ec41')
>>> for _ in cset.get_changes(): print _[0]
...
ya/demos/accordion/default.html
ya/demos/dialog/default.html
ya/external/PIE.htc
ya/external/border-radius.htc
ya/external/jquery.bgiframe-2.1.2.js
ya/lib/sl.css
ya/lib/sl.js
ya/lib/uihelper.js
ya/test/�ַ���.txt
ya/themes/default/images/ui-icon-arrows.png
ya/themes/default/images/ui-icon-close.png
ya/themes/default/images/ui-icon-triangle-1-e.png
ya/themes/default/images/ui-icon-triangle-1-s.png
ya/themes/default/images/ui-icons.png
ya/themes/default/jquery.ui.accordion.css
ya/themes/default/jquery.ui.dialog.css
ya/themes/default/jquery.ui.override.css
ya/ui/jquery.ya.accordion0.js
ya/ui/jquery.ya.dialog0.js
Last edited 9 years ago by Jun Omae (previous) (diff)

comment:3 by Ryan J Ollos, 9 years ago

Replacing invalid characters seems like a good solution. Thanks for investigating.

comment:4 by Ryan J Ollos, 9 years ago

Owner: set to Ryan J Ollos
Status: newassigned

comment:5 by Ryan J Ollos, 9 years ago

Release Notes: modified (diff)

Change from comment:2 committed to 1.0-stable in [14523], merged to trunk in [14524].

It would be good to have a test case, but I struggled with that. I was trying to use _git_fast_import and the format used in _generate_data_many_merges, but I'm unsure of the specification of that format, or how I can export a Git commit in the format.

comment:6 by Ryan J Ollos, 9 years ago

There were some warnings when installing r14523 and syncing the repository:

$ pve/bin/trac-admin trac repository resync "(default)"
Resyncing repository history for (default)... 
<path>/pve/local/lib/python2.7/site-packages/trac/db/util.py:72: Warning: Invalid utf8 character string: 'F09F98'
  return self.cursor.execute(sql_escape_percent(sql), args)
<path>/pve/local/lib/python2.7/site-packages/trac/db/util.py:72: Warning: Incorrect string value: '\xF0\x9F\x98\xB3' for column 'message' at row 1
  return self.cursor.execute(sql_escape_percent(sql), args)
15002 rev

I haven't looked at whether those should be expected for the content of the repository.

comment:7 by Jun Omae, 9 years ago

Are you using MySQL? It might be fixed with utf8mb4. See also #10548 and #9766.

comment:8 by Ryan J Ollos, 9 years ago

It looks like database is MySQL with utf8_bin collation type. We could try converting from utf8_bin to utf8_mb4.

in reply to:  5 comment:9 by Jun Omae, 9 years ago

Replying to Ryan J Ollos:

… how I can export a Git commit in the format.

We can export the test data with the following steps.

$ mkdir /tmp/t12322
$ cd /tmp/t12322
$ git init .
Initialized empty Git repository in /tmp/t12322/.git/
$ python -c 'with open("\312\326\267\347\307\331.txt", "w") as f: f.write("")'
$ LC_ALL=C ls -lb
total 0
-rw-r--r-- 1 jun66j5 jun66j5 0 Feb 16 15:49 \312\326\267\347\307\331.txt
$ git add *.txt
$ git commit -a -m '(#12322)'
[master (root-commit) 2b6c462] (#12322)
 0 files changed
 create mode 100644 "\312\326\267\347\307\331.txt"
$ git fast-export --all
blob
mark :1
data 0

reset refs/heads/master
commit refs/heads/master
mark :2
author Jun Omae <jun66j5@gmail.com> 1455605392 +0900
committer Jun Omae <jun66j5@gmail.com> 1455605392 +0900
data 9
(#12322)
M 100644 :1 "\312\326\267\347\307\331.txt"

reset refs/heads/master
from :2

comment:10 by Ryan J Ollos, 9 years ago

Thanks, I will try adding a unit test.

comment:11 by Ryan J Ollos, 9 years ago

Leaving this ticket open to add a test in milestone:1.0.11.

comment:12 by Ryan J Ollos, 9 years ago

I think there must be something different about my shell environment, because I get:

$python -c 'with open("\312\326\267\347\307\331.txt", "w") as f: f.write("")'
$LC_ALL=C ls -b
%CA\326\267%E7%C7%D9.txt

comment:13 by Jun Omae, 9 years ago

I guess that you're using Python 3. Please try use b-prefix on Python 3 or Python 2:

$ python3 -c 'with open(b"\312\326\267\347\307\331.txt", "w") as f: f.write("")'

comment:14 by Ryan J Ollos, 9 years ago

I'm using Python 2.7.11 on OSX:

$echo $LANG
en_US.UTF-8
$python --version
Python 2.7.11
$ python -c 'with open(b"\312\326\267\347\307\331.txt", "w") as f: f.write("")'
$LC_ALL=C ls -b
%CA\326\267%E7%C7%D9.txt
Last edited 9 years ago by Ryan J Ollos (previous) (diff)

comment:15 by Jun Omae, 9 years ago

Ok. I get the same on my Mac. It doesn't seem that the filename can be created on HFS. Instead, please try on case-sensitive filesystem (e.g. ext4, xfs, …).

comment:16 by Ryan J Ollos, 9 years ago

Thanks for checking. Works on Windows 7:

>C:\Python27-x64\python.exe --version
Python 2.7.11

>C:\Python27-x64\python.exe -c "with open('\312\326\267\
347\307\331.txt', 'w') as f: f.write('')"

C:\Users\Ryan Ollos\temp>ls -lb
total 0
-rw-r--r-- 1 Ryan Ollos Administrators 0 May 18 23:14 \312\326\267\347\307\331.t
xt

I created fast-export data on Debian. Tests pass on OSX with:

$git --version
git version 2.8.2

Proposed changes in log:rjollos.git:t12322.

comment:17 by Jun Omae, 9 years ago

I get 1 failure on Python 2.5:

======================================================================
FAIL: test_sync_file_with_invalid_byte_sequence (tracopt.versioncontrol.git.tests.git_fs.GitCachedRepositoryTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/run/shm/209e76f970dd6806b5e83dd56a7cb71299754d14/py25-sqlite/tracopt/versioncontrol/git/tests/git_fs.py", line 571, in test_sync_file_with_invalid_byte_sequence
    self.assertEqual(u'�ַ���.txt', changes[0][0])
AssertionError: u'\ufffd\u05b7\ufffd\ufffd\ufffd.txt' != u'\ufffd\ufffd\ufffd.txt'

Hmm, there is difference about results between Python 2.5 and 2.6.

$ python2.4 -c 'print(repr("\312\326\267\347\307\331.txt".decode("utf8", "replace")))'
u'\ufffd\ufffd\ufffd.txt'
$ python2.5 -c 'print(repr("\312\326\267\347\307\331.txt".decode("utf8", "replace")))'
u'\ufffd\ufffd\ufffd.txt'
$ python2.6 -c 'print(repr("\312\326\267\347\307\331.txt".decode("utf8", "replace")))'
u'\ufffd\u05b7\ufffd\ufffd\ufffd.txt'
$ python2.7 -c 'print(repr("\312\326\267\347\307\331.txt".decode("utf8", "replace")))'
u'\ufffd\u05b7\ufffd\ufffd\ufffd.txt'

comment:18 by Ryan J Ollos, 9 years ago

That's a strange finding. I think it would be okay to just push the test case to trunk since development on 1.0-stable is winding down.

comment:19 by Jun Omae, 9 years ago

I agree. However, if we solve it on 1.0-stable, we could use assertIn(...):

  • tracopt/versioncontrol/git/tests/git_fs.py

    diff --git a/tracopt/versioncontrol/git/tests/git_fs.py b/tracopt/versioncontrol/git/tests/git_fs.py
    index b24a9fe86..e1bdaaf5c 100644
    a b from :2  
    567567
    568568        changes = list(repos.repos.get_changeset(revs[0]).get_changes())
    569569        self.assertEqual(1, len(changes))
    570         self.assertEqual(u'�ַ���.txt', changes[0][0])
     570        self.assertIn(changes[0][0], (u'\ufffd\u05b7\ufffd\ufffd\ufffd.txt',
     571                                      u'\ufffd\ufffd\ufffd.txt'))
    571572
    572573    def test_sync_merge(self):
    573574        self._git_init()

comment:20 by Ryan J Ollos, 9 years ago

Resolution: fixed
Status: assignedclosed

That sounds good. Committed to 1.0-stable in [14785], merged to trunk in [14786].

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Ryan J Ollos.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Ryan J Ollos to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.