Edgewall Software

Changes between Initial Version and Version 1 of UnicodeDecodeError


Ignore:
Timestamp:
Mar 2, 2007, 1:51:40 PM (17 years ago)
Author:
Christian Boos
Comment:

Explain the UnicodeDecodeError

Legend:

Unmodified
Added
Removed
Modified
  • UnicodeDecodeError

    v1 v1  
     1= What does this `UnicodeDecodeError` mean? =
     2
     3This error means that an attempt to create an `unicode` object (i.e. the Python internal representation for a sequence of internationalized characters as defined by the Unicode standard) from a `str` object by "decoding" the sequence of bytes composing the later according to some conventional encoding, failed.
     4
     5In practice, this happens because the default conversion will make use of the default encoding, which usually is the ASCII encoding and as such, doesn't associate any meaning to the byte values higher than 127.
     6
     7If an encoding is explicitly specified (e.g. "UTF-8"), the same exception will happen if the sequence of bytes is actually not conforming to the specified encoding (e.g. it was actually "iso-8859-1" a.k.a. "latin1").
     8
     9This error happened quite frequently during the transition to the usage of `unicode` internally that occurred during Trac [milestone:0.10], until we adopted a robust conversion helper method (`to_unicode`, from the `trac.util.text` package).
     10
     11It can still happen that plugins are trying to produce `unicode` objects in a naive way, which can easily trigger the error:
     12{{{
     13>>> unicode('chaîne de caractères')
     14Traceback (most recent call last):
     15  File "<stdin>", line 1, in ?
     16UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 3: ordinal not in range(128)
     17}}}
     18
     19Not specifying the encoding means using the `sys.getdefaultencoding()`, which is usually 'ascii'.
     20So in effect, the above is equivalent to:
     21{{{
     22>>> unicode('chaîne de caractères', 'ascii')
     23Traceback (most recent call last):
     24  File "<stdin>", line 1, in ?
     25UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 3: ordinal not in range(128)
     26}}}
     27
     28A more subtle and confusing way to trigger this error is when trying to ''encode'' a sequence of bytes to a given encoding. Wait... encoding a sequence of bytes? Does that make sense? Well, no, normally it shouldn't, but Python "offers" you the `encode` method on `str` objects, as a shortcut to: `"string".encode() == unicode("string").encode()`. That could be called a `u"cadeau empoisonné"` ;-)
     29
     30Of course, if `"string"` can't be first decoded the naive way in order to produce that temporary `unicode` object, it will trigger the same error we saw above:
     31{{{
     32>>> 'chaîne de caractères'.encode('utf-8')
     33Traceback (most recent call last):
     34  File "<stdin>", line 1, in ?
     35UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 3: ordinal not in range(128)
     36}}}
     37
     38In practice, this happens when an API designed to handle an `unicode` object suddenly receive a `str` object. It's "normal" to call `s.encode(...)` if `s` is an `unicode` object, but this will fail with the above confusing error if `s` is actually a `str` object containing bytes not in the 0..127 range (see #4875 for an example).
     39
     40----
     41See also: TracDev/UnicodeGuidelines, UnicodeEncodeError