Context Navigation

Changes between Version 4 and Version 5 of TracDev/UnicodeGuidelines

Timestamp:: May 11, 2006, 8:35:12 AM (18 years ago)
Author:: Christian Boos
Comment:: cosmetic change: code blocks flushed left

Legend:

: Unmodified
: Added
: Removed
: Modified

TracDev/UnicodeGuidelines

-              v4
+              v5
    Unicode code points:
    {{{
    >>> ["U-%04x" % ord(x) for x in u"ndré Le"]
    ['U-006e', 'U-0064', 'U-0072', 'U-00e9', 'U-0020', 'U-004c', 'U-0065']
+>>> ["U-%04x" % ord(x) for x in u"ndré Le"]
+['U-006e', 'U-0064', 'U-0072', 'U-00e9', 'U-0020', 'U-004c', 'U-0065']
    }}}
  * From there, you can easily transform that to a `str` object. [[br]]
 …
      and in which ''any'' code point can be represented: [[br]]
      {{{
      >>> u"ndré Le".encode('utf-8')
      'ndr\xc3\xa9 Le'
+>>> u"ndré Le".encode('utf-8')
+'ndr\xc3\xa9 Le'
      }}}
    * ''iso-8859-15'': it's a fixed length encoding, which is commonly used
 …
      are interested in can be mapped to a sequence of bytes in this encoding.
      {{{
      >>> u"ndré Le".encode('iso-8859-15')
      'ndr\xe9 Le'
+>>> u"ndré Le".encode('iso-8859-15')
+'ndr\xe9 Le'
      }}}
    * ''ascii'': it is a very "poor" encoding, as there are only 128 unicode
 …
      as it contains one code point outside of this range (U-00e9).
      {{{
      >>> u"ndré Le".encode('ascii')
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
      UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
+>>> u"ndré Le".encode('ascii')
+Traceback (most recent call last):
+  File "<stdin>", line 1, in ?
+UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
      }}}
      It should be noted that this is also the error one would get by doing a
 …
      is usually `'ascii'`:
      {{{
      >>> str(u"ndré Le")
      Traceback (...): # same as above
      >>> sys.getdefaultencoding()
      'ascii'
+>>> str(u"ndré Le")
+Traceback (...): # same as above
+>>> sys.getdefaultencoding()
+'ascii'
      }}}
      Lastly, there are ways to ''force'' a conversion to succeed, even
 …
      in the targeted charset. One possible way is to use replacement characters:
      {{{
      >>> u"ndré Le".encode('ascii', 'replace')
      'ndr? Le'
+>>> u"ndré Le".encode('ascii', 'replace')
+'ndr? Le'
      }}}
  * Now, you might wonder how to get a `unicode` object in the first place,
 …
      object out of it is trivial:
      {{{
      >>> unicode('ndr\xc3\xa9 Le', 'utf-8')
      u'ndr\xe9 Le'
      >>> unicode('ndr\xe9 Le', 'iso-8859-15')
      u'ndr\xe9 Le'
+>>> unicode('ndr\xc3\xa9 Le', 'utf-8')
+u'ndr\xe9 Le'
+>>> unicode('ndr\xe9 Le', 'iso-8859-15')
+u'ndr\xe9 Le'
      }}}
      The above can be rewritten using the `str.decode()` method:
      {{{
      >>> 'ndr\xc3\xa9 Le'.decode('utf-8')
      u'ndr\xe9 Le'
      >>> 'ndr\xe9 Le'.decode('iso-8859-15')
      u'ndr\xe9 Le'
+>>> 'ndr\xc3\xa9 Le'.decode('utf-8')
+u'ndr\xe9 Le'
+>>> 'ndr\xe9 Le'.decode('iso-8859-15')
+u'ndr\xe9 Le'
      }}}
    * But what happens if we do a bad guess?
      {{{
      >>> unicode('ndr\xc3\xa9 Le', 'iso-8859-15')
      u'ndr\xc3\xa9 Le'
+>>> unicode('ndr\xc3\xa9 Le', 'iso-8859-15')
+u'ndr\xc3\xa9 Le'
      }}}
      No errors here, but the unicode string now contains garbage [[br]]
 …
      input assuming such an encoding will ''always'' succeed).
      {{{
      >>> unicode('ndr\xe9 Le', 'utf-8')
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
      UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
+>>> unicode('ndr\xe9 Le', 'utf-8')
+Traceback (most recent call last):
+  File "<stdin>", line 1, in ?
+UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
      }}}
      Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8...
    * What happens if we don't provide an encoding at all?
      {{{
      >>> unicode('ndr\xe9 Le')
      Traceback (most recent call last):
        File "<stdin>", line 1, in ?
      UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)
      >>> 'ndr\xe9 Le'.decode()
      Traceback (...) # same as above
+>>> unicode('ndr\xe9 Le')
+Traceback (most recent call last):
+  File "<stdin>", line 1, in ?
+UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)
+>>> 'ndr\xe9 Le'.decode()
+Traceback (...) # same as above
      }}}
      This is very symmetrical to the encoding situation: the `sys.getdefaultencoding()` is used
 …
      * One possibility would be to use replacement characters:
        {{{
        >>> unicode('ndr\xe9 Le', 'utf-8', 'replace')
        u'ndr\ufffde'
+>>> unicode('ndr\xe9 Le', 'utf-8', 'replace')
+u'ndr\ufffde'
        }}}
      * The other one would be to choose an encoding guaranteed to succeed