Edgewall Software
Modify

Opened 18 years ago

Closed 14 years ago

#3663 closed defect (fixed)

Chinese folder or file name error

Reported by: niijyeni@… Owned by: Christian Boos
Priority: normal Milestone: 0.12
Component: web frontend/mod_python Version: devel
Severity: normal Keywords: apache cgi unicode review
Cc: Branch:
Release Notes:
API Changes:
Internal Changes:

Description (last modified by Christian Boos)

i clicked folder of file(encoding:utf-8) trac report this message:


Traceback (most recent call last):
  File "C:\Python23\Lib\site-packages\trac\web\main.py", line 335, in dispatch_request
    dispatcher.dispatch(req)
  File "C:\Python23\Lib\site-packages\trac\web\main.py", line 167, in dispatch
    if not req.path_info or req.path_info == '/':
  File "C:\Python23\lib\site-packages\trac\web\api.py", line 193, in <lambda>
    path_info = property(fget=lambda self: self.environ.get('PATH_INFO', '').decode('utf-8'),
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa6 in position 17: unexpected code byte

Attachments (5)

a1.PNG (14.2 KB ) - added by niijyeni@… 18 years ago.
error picture
a2.PNG (14.2 KB ) - added by niijyeni@… 18 years ago.
error picture2
a3.PNG (7.6 KB ) - added by niijyeni@… 18 years ago.
error picture3
t3663-to_unicode-PATH_INFO.patch (2.3 KB ) - added by Christian Boos 14 years ago.
fix + test
t3663-more-robust-PATH_INFO-decoding-r9325.diff (10.9 KB ) - added by Christian Boos 14 years ago.
web: don't assume PATH_INFO is always UTF-8 encoded.

Download all attachments as: .zip

Change History (38)

comment:1 by Christian Boos, 18 years ago

Keywords: unicode needinfo added
Owner: changed from Jonas Borgström to Christian Boos

Can you please dump the byte sequence corresponding to the UTF-8 encoding of that path, so that I can try to reproduce the issue?

by niijyeni@…, 18 years ago

Attachment: a1.PNG added

error picture

by niijyeni@…, 18 years ago

Attachment: a2.PNG added

error picture2

comment:2 by niijyeni@…, 18 years ago

there two error picture.

comment:3 by niijyeni@…, 18 years ago

and a file save as ANSI Encoding,when browse whill raise a encoding error.

ÎÒÊǺÃÈË£¬²»ÊÇ»µÈË¡£

see picture a3

by niijyeni@…, 18 years ago

Attachment: a3.PNG added

error picture3

comment:4 by Matthew Good, 18 years ago

I've tested the same filename found in the screenshot with FastCGI and tracd and it works for me. Be sure you're using the latest Trac source.

What web server are you using? Also, which frontend are you using with Trac: FastCGI, CGI, mod_python, or something else? Normally I would expect to see another Trac module above dispatch_request in the stack where it was called from, but that's not there in this traceback.

in reply to:  4 comment:5 by anonymous, 18 years ago

Replying to mgood:

I've tested the same filename found in the screenshot with FastCGI and tracd and it works for me. Be sure you're using the latest Trac source.

What web server are you using? Also, which frontend are you using with Trac: FastCGI, CGI, mod_python, or something else? Normally I would expect to see another Trac module above dispatch_request in the stack where it was called from, but that's not there in this traceback.


Apache 2.0.59(Win32) SVN 1.3.2 DAV 2 CGI IE 6

comment:6 by niijyeni@…, 18 years ago

??? I use

Apache 2.0.59(Win32)

SVN 1.3.2 DAV 2

CGI

IE 6

Windows XP

comment:7 by Christian Boos, 18 years ago

And what revision of Trac 0.10, exactly?

in reply to:  7 comment:8 by niijyeni@…, 18 years ago

Replying to cboos:

And what revision of Trac 0.10, exactly?

Trac 0.10b1

comment:9 by Christian Boos, 18 years ago

Description: modified (diff)
Keywords: apache cgi win32 added; chinese removed
Milestone: 0.100.10.1
Priority: highnormal

Ok, so according to attachment:a2.PNG, the request was for path "vendor/试试中文.txt" (url = u'vendor/\u8bd5\u8bd5\u4e2d\u6587'), but then, the PATH_INFO CGI env variable contained something that had a byte 0xa8 in position 16… I couldn't find an encoding that would actually encode the above url to this, so I'm left wondering what to do here…

Perhaps you could add a print `self.environ.get('PATH_INFO', '')` statement in trac.web.api.py before line 193, in order to see the full sequence of bytes. From that, perhaps we could guess what Apache is doing.

Also, this issue is possibly related to: http://twiki.org/cgi-bin/view/Codev.ApacheTwoBreaksNonUTF8EncodedURLsOnWindows

In any case, this is not going to block 0.10…

in reply to:  9 comment:10 by anonymous, 18 years ago

Replying to cboos:

url = u'vendor/\u8bd5\u8bd5\u4e2d\u6587'), but then, the PATH_INFO CGI env variable contained something that had a byte 0xa8 in position 16…

Sorry, I meant url = u'/browser/vendor/\u8bd5\u8bd5\u4e2d\u6587', so that position 16 effectively corresponds to the first non-ascii character.

comment:11 by Christian Boos, 18 years ago

#2981 was marked as duplicate.

comment:12 by sid, 17 years ago

Keywords: windows added; win32 removed

comment:13 by alexhalf@…, 17 years ago

Hello all. I had the same problem. I have windows server 2003 (Russian locale), apache-2.0.52, trac-0.10.3 (is running as cgi), python-2.4.1.

I couldn't submit and view(get) file with filenames were consist of russian characters (Attachements). I couldn't get the same files from svn repository through trac (Browse Source).

I decided the problem the following way. I changed the trac/web/api.py file. Need change line 219: path_info = property(fget=lambda self: self.environ.get('PATH_INFO', ).decode('utf-8'), To: path_info = property(fget=lambda self: self.environ.get('PATH_INFO', ).decode('mbcs'),

It's all. Best regards, Alyabushev Alexander alexhalf@…

comment:14 by Christian Boos, 17 years ago

#4003 closed as duplicate. The reporter of that ticket said that he tried to "update from encode('utf-8') to encode('mbcs') in api.py file" without success (btw, that was a few months before comment:13, so I wonder where that advice came from, initially).

comment:15 by Christian Boos, 17 years ago

#5915 was closed as duplicate.

in reply to:  13 comment:16 by Christian Boos, 16 years ago

#6372 was closed as duplicate.

This is the suggested fix: Replying to alexhalf@gmail.com:

… I decided the problem the following way. I changed the trac/web/api.py file. Need change line 219:

    path_info = property(fget=lambda self: self.environ.get('PATH_INFO', '').decode('utf-8'),

To:

    path_info = property(fget=lambda self: self.environ.get('PATH_INFO', '').decode('mbcs'),

Please try it out. If that works, then also try to create/view wiki pages containing unicode characters, to check if they are still working with that change.

comment:17 by Christian Boos, 16 years ago

Component: version control/browserweb frontend/mod_python
Keywords: windows removed
Milestone: 0.10.60.11.1

#7368 was closed as duplicate. That one is the first to be reported for a non-Windows platform.

Also that one is about mod_wsgi and not mod_python, but I think the root cause must be the same.

comment:18 by Christian Boos, 16 years ago

Keywords: verify added; needinfo removed
Milestone: 0.11.20.11.3

Still worth investigating/fixing one day, keeping on the -stable TODO list.

comment:19 by Christian Boos, 14 years ago

Keywords: review added; verify removed
Milestone: next-minor-0.12.x0.11.7
Status: newassigned

As discussed in #9077, I've seen a few similar errors in the trac.log of the demo-0.11 Trac instace:

2010-02-04 08:21:54,829 Trac[main] ERROR: Internal Server Error:
Traceback (most recent call last):
  File "/usr/local/virtualenv-0.11/lib/python2.5/site-packages/Trac-0.11.7stable_r8997-py2.5.egg/trac/web/main.py", line 450, in _dispatch_request
    dispatcher.dispatch(req)
  File "/usr/local/virtualenv-0.11/lib/python2.5/site-packages/Trac-0.11.7stable_r8997-py2.5.egg/trac/web/main.py", line 167, in dispatch
    if handler.match_request(req):
  File "/usr/local/virtualenv-0.11/lib/python2.5/site-packages/Trac-0.11.7stable_r8997-py2.5.egg/trac/wiki/intertrac.py", line 39, in match_request
    match = re.match(r'^/intertrac/(.*)', req.path_info)
  File "/usr/local/virtualenv-0.11/lib/python2.5/site-packages/Trac-0.11.7stable_r8997-py2.5.egg/trac/web/api.py", line 208, in <lambda>
    path_info = property(fget=lambda self: self.environ.get('PATH_INFO', '').decode('utf-8'),
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: unexpected code byte

...

2010-02-04 08:45:14,858 Trac[main] ERROR: Internal Server Error:
Traceback (most recent call last):
  File "/usr/local/virtualenv-0.11/lib/python2.5/site-packages/Trac-0.11.7stable_r8997-py2.5.egg/trac/web/main.py", line 450, in _dispatch_request
    dispatcher.dispatch(req)
  File "/usr/local/virtualenv-0.11/lib/python2.5/site-packages/Trac-0.11.7stable_r8997-py2.5.egg/trac/web/main.py", line 167, in dispatch
    if handler.match_request(req):
  File "/usr/local/virtualenv-0.11/lib/python2.5/site-packages/Trac-0.11.7stable_r8997-py2.5.egg/trac/wiki/intertrac.py", line 39, in match_request
    match = re.match(r'^/intertrac/(.*)', req.path_info)
  File "/usr/local/virtualenv-0.11/lib/python2.5/site-packages/Trac-0.11.7stable_r8997-py2.5.egg/trac/web/api.py", line 208, in <lambda>
    path_info = property(fget=lambda self: self.environ.get('PATH_INFO', '').decode('utf-8'),
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 1: unexpected code byte

We should fallback to .decode('latin1') if .decode('utf-8') fails, i.e. actually use to_unicode() here.

Reproducing the error is actually easy:

>chcp
Active code page: 437
>wget http://localhost:8000/devel-0.11/wiki/été
--2010-03-03 15:54:56--  http://localhost:8000/devel-0.11/wiki/%E9t%E9
Resolving localhost... 127.0.0.1
Connecting to localhost|127.0.0.1|:8000... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
2010-03-03 15:54:56 ERROR 500: Internal Server Error.

The backtrace corresponding to the above:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data
(7668) Trac[main] ERROR: Internal Server Error:
Traceback (most recent call last):
  File "C:\Workspace\src\trac\repos\0.11-stable\trac\web\main.py", line 450, in _dispatch_request
    dispatcher.dispatch(req)
  File "C:\Workspace\src\trac\repos\0.11-stable\trac\web\main.py", line 167, in dispatch
    if handler.match_request(req):
  File "C:\Workspace\src\trac\repos\0.11-stable\trac\wiki\intertrac.py", line 39, in match_request
    match = re.match(r'^/intertrac/(.*)', req.path_info)
  File "C:\Workspace\src\trac\repos\0.11-stable\trac\web\api.py", line 208, in <lambda>
    path_info = property(fget=lambda self: self.environ.get('PATH_INFO', '').decode('utf-8'),
  File "C:\Dev\Python261\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: ('utf8', '/wiki/\xe9t\xe9', 6, 9, 'invalid data')

Patch:

  • trac/web/api.py

     
    2828from trac.core import Interface, TracError
    2929from trac.util import get_last_traceback, md5, unquote
    3030from trac.util.datefmt import http_date, localtz
     31from trac.util.text import to_unicode
    3132from trac.web.href import Href
    3233from trac.web.wsgi import _FileWrapper
    3334
     
    205206
    206207    method = property(fget=lambda self: self.environ['REQUEST_METHOD'],
    207208                      doc='The HTTP method of the request')
    208     path_info = property(fget=lambda self: self.environ.get('PATH_INFO', '').decode('utf-8'),
     209    path_info = property(fget=(lambda self:
     210                               to_unicode(self.environ.get('PATH_INFO', ''))),
    209211                         doc='Path inside the application')
    210212    query_string = property(fget=lambda self: self.environ.get('QUERY_STRING',
    211213                                                               ''),

With the above patch, the same wget doesn't trigger an internal error anymore:

>wget http://localhost:8000/devel-0.11/wiki/été
--2010-03-03 15:53:37--  http://localhost:8000/devel-0.11/wiki/%E9t%E9
Resolving localhost... 127.0.0.1
Connecting to localhost|127.0.0.1|:8000... connected.
HTTP request sent, awaiting response... 404 Not Found
2010-03-03 15:53:39 ERROR 404: Not Found.

comment:20 by Christian Boos, 14 years ago

All tests pass.

  • trac/tests/functional/testcases.py

     
     1# -*- encoding: utf-8 -*-
    12#!/usr/bin/python
    23import os
    34from subprocess import call
     
    154155        tc.notfind('Second Attachment')
    155156
    156157
     158class RegressionTestTicket3663(FunctionalTwillTestCaseSetup):
     159    def runTest(self):
     160        """Regression test for non-UTF-8 PATH_INFO (#3663)"""
     161        self._tester.logout()                # as we start logged in as 'admin'
     162        self._tester.go_to_wiki('\xe9t\xe9') # PATH_INFO 'wiki/été' in latin1
     163        tc.code(404)                         # page not found (no WIKI_CREATE)
     164        self._tester.login('admin')          # the other tests assume 'admin'
     165
     166
    157167def functionalSuite():
    158168    suite = FunctionalTestSuite()
    159169    return suite
     
    168178    suite.addTest(RegressionTestTicket3833c())
    169179    suite.addTest(RegressionTestTicket5572())
    170180    suite.addTest(RegressionTestTicket7209())
     181    suite.addTest(RegressionTestTicket3663())
    171182
    172183    import trac.versioncontrol.tests
    173184    trac.versioncontrol.tests.functionalSuite(suite)

by Christian Boos, 14 years ago

fix + test

in reply to:  19 ; comment:21 by Remy Blank, 14 years ago

Replying to cboos:

With the above patch, the same wget doesn't trigger an internal error anymore:

The test fails in the XHTML validation on my box, because lxml cannot parse the resulting page (it probably contains the \xe9 characters as-is, which cannot be converted from utf-8). If a disable XHTML validation, the test passes.

(BTW, the test case should probably have the .login('admin') in a finally: clause, otherwise a failure of the test case will make the following test cases fail due to "collateral damage".)

I'm unsure what the goal of the patch is:

  • Don't throw a 500 error. Then we could just change the utf-8 conversion to "replace" instead of the default "strict". Your test case passes with this change.
    • trac/web/api.py

       
      205205
      206206    method = property(fget=lambda self: self.environ['REQUEST_METHOD'],
      207207                      doc='The HTTP method of the request')
      208     path_info = property(fget=lambda self: self.environ.get('PATH_INFO', '').decode('utf-8'),
       208    path_info = property(fget=lambda self: self.environ.get('PATH_INFO', '').decode('utf-8', 'replace'),
      209209                         doc='Path inside the application')
      210210    query_string = property(fget=lambda self: self.environ.get('QUERY_STRING',
      211211                                                               ''),
  • Make the system work with iso-8859-1 encoded URLs as well as utf-8 encoded URLs. Then we should be able to create a wiki page named "été" using an utf-8 URL, then retrieve it correctly using an iso-8859-1 URL. I tried this and it doesn't work (it should have been obvious, but I tried anyway).

It seems that RFC 3986 mandates utf-8 encoding for new URI schemes, so the current behavior is arguably correct (except if URLs cannot be considered a "new URI scheme", but in that case the expected behavior is undefined).

I don't think using a "fallback" encoding will make the system work in all cases, as I'm sure we can find iso-8859-1 encoded strings that are also valid utf-8 strings. So the URL will be ambiguous, and even the best guess won't make it work.

So I wonder if the best solution would be to keep a single encoding, but to make it configurable (e.g. through an environment variable). Everything seems to work for most people by using utf-8, and the few people who have reported that changing the encoding to mbcs fixed the issue can do so.

in reply to:  21 ; comment:22 by Christian Boos, 14 years ago

Replying to rblank:

Replying to cboos:

With the above patch, the same wget doesn't trigger an internal error anymore:

The test fails in the XHTML validation on my box, because lxml cannot parse the resulting page (it probably contains the \xe9 characters as-is, which cannot be converted from utf-8). If a disable XHTML validation, the test passes.

Strange, for me the validation passes, and no, the resulting page should not contain \xe9, see below.

(BTW, the test case should probably have the .login('admin') in a finally: clause, otherwise a failure of the test case will make the following test cases fail due to "collateral damage".)

OK.

I'm unsure what the goal of the patch is:

  • Don't throw a 500 error. …
  • Make the system work with iso-8859-1 encoded URLs as well as utf-8 encoded URLs. Then we should be able to create a wiki page named "été" using an utf-8 URL, then retrieve it correctly using an iso-8859-1 URL. I tried this and it doesn't work (it should have been obvious, but I tried anyway).

Both, and for the second point, it's not obvious it doesn't work, given that for me it does!

After creating the "été" page via the browser (the normal way):

C:\Workspace\src\trac\repos\0.11-stable>wget http://localhost:8000/devel-0.11/wiki/été
--2010-03-04 07:21:29--  http://localhost:8000/devel-0.11/wiki/%E9t%E9
Resolving localhost... 127.0.0.1
Connecting to localhost|127.0.0.1|:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4974 (4.9K) [text/html]
Saving to: `été'

100%[==============================================================================>] 4,974       --.-K/s   in 0s

2010-03-04 07:21:29 (13.5 MB/s) - `été' saved [4974/4974]

And on the server side:

127.0.0.1 - - [04/Mar/2010 07:21:29] "GET /devel-0.11/wiki/%E9t%E9 HTTP/1.0" 200 -

It seems that RFC 3986 mandates utf-8 encoding for new URI schemes, so the current behavior is arguably correct (except if URLs cannot be considered a "new URI scheme", but in that case the expected behavior is undefined).

OK.

I don't think using a "fallback" encoding will make the system work in all cases, as I'm sure we can find iso-8859-1 encoded strings that are also valid utf-8 strings. So the URL will be ambiguous, and even the best guess won't make it work.

Well, there's certainly an overlap given that any sequence of bytes is valid latin1, so by first attempting a decoding from utf-8, and if this fails, a decoding to latin1, we're guaranteed to always end up with unicode, "valid" unicode if the actual encoding was UTF-8 or latin1, "garbage" unicode (i.e. an non-meaningful sequence of unicode code points from the U+0000 - U+00FF range) if the original encoding was anything else. In that latter case, we avoid at least 1.

This is the typical use case for to_unicode… or rather, what I thought to_unicode would do. Looking again at the code, I see that the we now use locale.getpreferredencoding() as the fallback, and this explains why it worked for me but not for you. This was introduced part of r3569 and in retrospect I think this is wrong, we should have kept 'iso-8859-15' or rather 'iso-8859-1', as using locale always introduces variance in the results (#6953, #8951).

So I wonder if the best solution would be to keep a single encoding, but to make it configurable (e.g. through an environment variable). Everything seems to work for most people by using utf-8, and the few people who have reported that changing the encoding to mbcs fixed the issue can do so.

Now this would even be a third goal: if some servers always deliver PATH_INFO according to a local encoding (I've seen some hints this could be the case for IIS), then giving the possibility to configure the encoding as the first encoding to try would be useful.

I'll come up with some updated patches, with the minimal change from locale.getpreferredencoding() to 'latin1' in to_unicode for 0.11 and some more changes for 0.12. Or we can simply leaves things as they are on 0.11-stable and move the whole thing to 0.12.

in reply to:  22 ; comment:23 by Remy Blank, 14 years ago

Replying to cboos:

Strange, for me the validation passes, and no, the resulting page should not contain \xe9, see below.

I'll test again with your updated patches. I may have made a mistake somewhere.

Both, and for the second point, it's not obvious it doesn't work, given that for me it does!

Well that's the whole point: it depends on the encoding used by the client, which is not communicated to the server. If I use a client that encodes URLs as utf-8 to create the page, and a client that encodes as iso-8859-1 to retrieve it, the latter will fail with a 404 code. That's the test that failed for me.

I'll come up with some updated patches, with the minimal change from locale.getpreferredencoding() to 'latin1' in to_unicode for 0.11 and some more changes for 0.12. Or we can simply leaves things as they are on 0.11-stable and move the whole thing to 0.12.

I'm still not convinced that trying two encodings is the right thing to do, and considering this is becoming non-trivial, I would push (gently :) for 0.12.

in reply to:  23 ; comment:24 by Christian Boos, 14 years ago

Milestone: 0.11.70.12

Replying to rblank:

it depends on the encoding used by the client, which is not communicated to the server. If I use a client that encodes URLs as utf-8 to create the page, and a client that encodes as iso-8859-1 to retrieve it, the latter will fail with a 404 code.

  • client 1, utf-8 → /wiki/%c3%a9t%c3%a9, PATH_INFO → '\xc3\xa9t\xc3\xa9',
    to_unicode(self.environ('PATH_INFO')) → u'été'
  • client 2, latin1 → /wiki/%e9t%e9, PATH_INFO is '\xe9t\xe9',
    to_unicode(self.environ('PATH_INFO')) → u'été' as well… provided locale.getpreferredencoding() returns 'latin1' which is not guaranteed of course

Hence:

… with the minimal change from locale.getpreferredencoding() to 'latin1' in to_unicode

I'm still not convinced that trying two encodings is the right thing to do, and considering this is becoming non-trivial, I would push (gently :) for 0.12.

What do you mean, in this specific case or in general, the way to_unicode works? To me, trying an utf-8 decoding first then a latin1 decoding if the former fails is a better approach than an utf-8 decoding in replace mode, as it's a lossless conversion from str to unicode. In replace mode, you may well end up with u'??????' strings and therefore have no clue about what the original input was. It's also fast, as the second encoding is a direct mapping.

As for 0.12, sure, as that ticket has already slipped through all the minor milestones since 0.10, one more is no big deal ;-)

See t3663-more-robust-PATH_INFO-decoding-r9325.diff, which is a bit more involved and definitely not for 0.11.7.

by Christian Boos, 14 years ago

web: don't assume PATH_INFO is always UTF-8 encoded.

in reply to:  24 ; comment:25 by Remy Blank, 14 years ago

Replying to cboos:

  • client 1, utf-8 → /wiki/%c3%a9t%c3%a9, PATH_INFO → '\xc3\xa9t\xc3\xa9',
    to_unicode(self.environ('PATH_INFO')) → u'été'
  • client 2, latin1 → /wiki/%e9t%e9, PATH_INFO is '\xe9t\xe9',
    to_unicode(self.environ('PATH_INFO')) → u'été' as well… provided locale.getpreferredencoding() returns 'latin1' which is not guaranteed of course
  • client 3, iso-8859-1, trying to access the page "ÃtÃ"
    → /wiki/%c3%a9t%c3%a9, PATH_INFO → '\xc3\xa9t\xc3\xa9',
    to_unicode(self.environ('PATH_INFO')) → u'été'

That is, not the page intended. The issue is probably worse with other encodings, which result in valid utf-8 strings.

What do you mean, in this specific case or in general, the way to_unicode works?

In this specific case (PATH_INFO).

To me, trying an utf-8 decoding first then a latin1 decoding if the former fails is a better approach than an utf-8 decoding in replace mode, as it's a lossless conversion from str to unicode.

We're trying to guess the encoding used by the client, but only in the case where the string is an invalid utf-8 string. If the string is valid utf-8, the result is wrong. So I don't see much gain compared to "replace" mode, which gets it wrong every time it must guess. We always will get it wrong in some cases as soon as the client encoding is not utf-8.

in reply to:  25 ; comment:26 by Christian Boos, 14 years ago

Replying to rblank:

  • client 3, iso-8859-1, trying to access the page "ÃtÃ"
    → /wiki/%c3%a9t%c3%a9, PATH_INFO → '\xc3\xa9t\xc3\xa9',
    to_unicode(self.environ('PATH_INFO')) → u'été'

I see, but this of course will "fail" in all cases, even when using 'replace' mode.

We're trying to guess the encoding used by the client, but only in the case where the string is an invalid utf-8 string. If the string is valid utf-8, the result is wrong. So I don't see much gain compared to "replace" mode, which gets it wrong every time it must guess.

Well, admittedly the gain is very marginal, being able to "see" the original byte sequence through the resulting unicode object, which might help for debugging in contrast with '?????', for which we will only be able to tell "the encoding was not UTF-8". The fact that (some) latin1 URLs would work is only a side-effect, and not a justification for the change, I agree. I can modify the test to reflect that, as currently it gives the impression that it's a feature ;-)

in reply to:  26 ; comment:27 by Remy Blank, 14 years ago

Replying to cboos:

Well, admittedly the gain is very marginal, being able to "see" the original byte sequence through the resulting unicode object, which might help for debugging (…)

Slightly different idea: how about doing the conversion from utf-8 in strict mode, catching the UnicodeDecodeError and raising a TracError with a good error message describing the cause of the error and possible solutions (if there are any)?

in reply to:  27 ; comment:28 by Christian Boos, 14 years ago

Replying to rblank:

Replying to cboos:

Well, admittedly the gain is very marginal, being able to "see" the original byte sequence through the resulting unicode object, which might help for debugging (…)

Slightly different idea: how about doing the conversion from utf-8 in strict mode, catching the UnicodeDecodeError and raising a TracError with a good error message describing the cause of the error and possible solutions (if there are any)?

Good suggestion, I'll try to do something along those lines. The hint could be something like:

Your web browser is trying to access non-ascii URLs using an encoding
which is not the one expected by the server (%(encoding)s). 

Followed by, depending on the value of the server encoding:

  • utf-8:
    Please use a more standard compliant browser. 
    If you're already using a standard browser, something wrong is happening 
    on the server side, please contact your site administrator, 
    who perhaps need to adapt the server side encoding.
    
  • not utf-8:
    That encoding doesn't seem to be working in all cases.
    Please contact your site administrator.
    

But maybe this starts to be too complex, so perhaps the simpler version would also do:

    try:
        encoding = self.pathinfo_encoding or 'utf-8'
        pathinfo = unicode(path_info, encoding)
    except UnicodeDecodeError, e:
        raise TracError(_("Invalid URL encoding (was %(pathinfo)r, "
                          "server expected %(encoding)s)", 
                          pathinfo=pathinfo, encoding=encoding))

in reply to:  28 comment:29 by Remy Blank, 14 years ago

Replying to cboos:

But maybe this starts to be too complex, so perhaps the simpler version would also do:

Yes, that looks good.

comment:30 by Christian Boos, 14 years ago

Perhaps related?

2010-03-15 06:57:23,481 Trac[main] ERROR: Internal Server Error:
Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 504, in _dispatch_request
    dispatcher.dispatch(req)
  File "build/bdist.linux-x86_64/egg/trac/web/main.py", line 209, in dispatch
    req.redirect(req.href + target, permanent=True)
  File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 336, in redirect
    self.send_header('Location', url)
  File "build/bdist.linux-x86_64/egg/trac/web/api.py", line 274, in send_header
    self._outheaders.append((name, unicode(value).encode('utf-8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 25: ordinal not in range(128)

for:

66.249.65.35 trac.edgewall.org - [15/Mar/2010:06:57:34 +0100] "GET /%E3%80%82%E5%AE%83%E9%9B%86%E6%88%90%E4%BA%86%E5%A2%9E%E5%BC%BA%E7%9A%84Wiki%E5%8A%9F%E8%83%BD%E5%92%8C%E7%89%88%E6%9C%AC%E6%8E%A7%E5%88%B6%E5%8A%9F%E8%83%BD%EF%BC%8C%E5%B9%B6%E5%8F%AF%E9%80%9A%E8%BF%87%E6%8F%92%E4%BB%B6%E6%89%A9%E5%B1%95%E5%85%B6%E5%8A%9F%E8%83%BD%E3%80%82%E7%94%B1%E4%BA%8E%E6%8F%92%E4%BB%B6%E4%BC%97%E5%A4%9A%E3%80%81%E5%8A%9F%E8%83%BD%E5%85%A8%E9%9D%A2%EF%BC%8C%E7%94%9A%E8%87%B3%E5%8F%AF%E4%BB%A5%E4%B8%8E%E5%BE%88%E5%A4%9A%E5%95%86%E4%B8%9A%E7%9A%84CMS%E7%B3%BB%E7%BB%9F%E5%AA%B2%E7%BE%8E%EF%BC%8C%E5%9B%A0%E6%AD%A4%E5%BA%94%E7%94%A8%E4%B9%9F%E6%97%A5%E7%9B%8A%E5%B9%BF%E6%B3%9B%E3%80%82%E5%AE%83%E7%9A%84ticket%E7%AE%A1%E7%90%86%E5%8F%8A%E5%B7%A5%E4%BD%9C%E6%B5%81%E6%8F%92%E4%BB%B6(http://trac-hacks.org/ HTTP/1.1" 500 9243 "-" "Mediapartners-Google"

comment:31 by Christian Boos, 14 years ago

Resolution: fixed
Status: assignedclosed

Trying to support non-UTF-8 encodings for the PATH_INFO lead to other errors (notably for redirects). I think that it's not worth the trouble to try to support this kind of server setup (which might even be mis-configured ones from the start), so I gave up on the path_info_encoding setting and simply report the error in r9377.

I tried to address the other problem with non-UTF-8 URLs reported in comment:30, but I couldn't reproduce the traceback involving send_header. I nevertheless fixed the other traceback I got at this occasion, in [9380-9381].

Some additional code cleanups for source:trunk/trac/web/api.py were done in r9382.

OTOH, I still wanted to do the simplifications to to_unicode contained in t3663-more-robust-PATH_INFO-decoding-r9325.diff so I committed them separately in r9383.

comment:32 by Remy Blank, 14 years ago

Resolution: fixed
Status: closedreopened

I now get a failing functional test, in lxml:

======================================================================
ERROR: Regression test for non-UTF-8 PATH_INFO (#3663)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/joe/src/trac/trunk/trac/tests/functional/testcases.py", line 171, in runTest
    self._tester.go_to_wiki(u'été'.encode('latin1'))
  File "/home/joe/src/trac/trunk/trac/tests/functional/tester.py", line 115, in go_to_wiki
    tc.go(wiki_url)
  File "/usr/lib/python2.6/site-packages/twill/commands.py", line 112, in go
    browser.go(url)
  File "/usr/lib/python2.6/site-packages/twill/browser.py", line 113, in go
    self._journey('open', u)
  File "/usr/lib/python2.6/site-packages/twill/browser.py", line 549, in _journey
    callable(func_name, *args, **kwargs)
  File "/home/joe/src/trac/trunk/trac/tests/functional/better_twill.py", line 104, in _validate_xhtml
    etree.parse(StringIO(page), base_url=b.get_url())
  File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958)
  File "parser.pxi", line 1518, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71971)
  File "apihelpers.pxi", line 1381, in lxml.etree._encodeFilenameUTF8 (src/lxml/lxml.etree.c:21091)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 27-29: invalid data

Some relevant data: Linux, de_CH.utf8 locale, lxml 2.2.6.

comment:33 by Christian Boos, 14 years ago

Resolution: fixed
Status: reopenedclosed

Good catch! Fixed in r9395.

Btw, while running the tests on Linux, I stumbled upon the two errors already reported in #9128, the ball is in your court ;-)

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Christian Boos.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Christian Boos to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.