Edgewall Software

Changes between Version 4 and Version 5 of TracDev/UnicodeGuidelines


Ignore:
Timestamp:
May 11, 2006, 8:35:12 AM (18 years ago)
Author:
Christian Boos
Comment:

cosmetic change: code blocks flushed left

Legend:

Unmodified
Added
Removed
Modified
  • TracDev/UnicodeGuidelines

    v4 v5  
    2424   Unicode code points:
    2525   {{{
    26    >>> ["U-%04x" % ord(x) for x in u"ndré Le"]
    27    ['U-006e', 'U-0064', 'U-0072', 'U-00e9', 'U-0020', 'U-004c', 'U-0065']
     26>>> ["U-%04x" % ord(x) for x in u"ndré Le"]
     27['U-006e', 'U-0064', 'U-0072', 'U-00e9', 'U-0020', 'U-004c', 'U-0065']
    2828   }}}
    2929 * From there, you can easily transform that to a `str` object. [[br]]
     
    3232     and in which ''any'' code point can be represented: [[br]]
    3333     {{{
    34      >>> u"ndré Le".encode('utf-8')
    35      'ndr\xc3\xa9 Le'
     34>>> u"ndré Le".encode('utf-8')
     35'ndr\xc3\xa9 Le'
    3636     }}}
    3737   * ''iso-8859-15'': it's a fixed length encoding, which is commonly used
     
    3939     are interested in can be mapped to a sequence of bytes in this encoding.
    4040     {{{
    41      >>> u"ndré Le".encode('iso-8859-15')
    42      'ndr\xe9 Le'
     41>>> u"ndré Le".encode('iso-8859-15')
     42'ndr\xe9 Le'
    4343     }}}
    4444   * ''ascii'': it is a very "poor" encoding, as there are only 128 unicode
     
    4747     as it contains one code point outside of this range (U-00e9).
    4848     {{{
    49      >>> u"ndré Le".encode('ascii')
    50      Traceback (most recent call last):
    51        File "<stdin>", line 1, in ?
    52      UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
     49>>> u"ndré Le".encode('ascii')
     50Traceback (most recent call last):
     51  File "<stdin>", line 1, in ?
     52UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
    5353     }}}
    5454     It should be noted that this is also the error one would get by doing a
     
    5656     is usually `'ascii'`:
    5757     {{{
    58      >>> str(u"ndré Le")
    59      Traceback (...): # same as above
    60      >>> sys.getdefaultencoding()
    61      'ascii'
     58>>> str(u"ndré Le")
     59Traceback (...): # same as above
     60>>> sys.getdefaultencoding()
     61'ascii'
    6262     }}}
    6363     Lastly, there are ways to ''force'' a conversion to succeed, even
     
    6565     in the targeted charset. One possible way is to use replacement characters:
    6666     {{{
    67      >>> u"ndré Le".encode('ascii', 'replace')
    68      'ndr? Le'
     67>>> u"ndré Le".encode('ascii', 'replace')
     68'ndr? Le'
    6969     }}}
    7070 * Now, you might wonder how to get a `unicode` object in the first place,
     
    7878     object out of it is trivial:
    7979     {{{
    80      >>> unicode('ndr\xc3\xa9 Le', 'utf-8')
    81      u'ndr\xe9 Le'
    82      >>> unicode('ndr\xe9 Le', 'iso-8859-15')
    83      u'ndr\xe9 Le'
     80>>> unicode('ndr\xc3\xa9 Le', 'utf-8')
     81u'ndr\xe9 Le'
     82>>> unicode('ndr\xe9 Le', 'iso-8859-15')
     83u'ndr\xe9 Le'
    8484     }}}
    8585     The above can be rewritten using the `str.decode()` method:
    8686     {{{
    87      >>> 'ndr\xc3\xa9 Le'.decode('utf-8')
    88      u'ndr\xe9 Le'
    89      >>> 'ndr\xe9 Le'.decode('iso-8859-15')
    90      u'ndr\xe9 Le'
     87>>> 'ndr\xc3\xa9 Le'.decode('utf-8')
     88u'ndr\xe9 Le'
     89>>> 'ndr\xe9 Le'.decode('iso-8859-15')
     90u'ndr\xe9 Le'
    9191     }}}
    9292   * But what happens if we do a bad guess?
    9393     {{{
    94      >>> unicode('ndr\xc3\xa9 Le', 'iso-8859-15')
    95      u'ndr\xc3\xa9 Le'
     94>>> unicode('ndr\xc3\xa9 Le', 'iso-8859-15')
     95u'ndr\xc3\xa9 Le'
    9696     }}}
    9797     No errors here, but the unicode string now contains garbage [[br]]
     
    100100     input assuming such an encoding will ''always'' succeed).
    101101     {{{
    102      >>> unicode('ndr\xe9 Le', 'utf-8')
    103      Traceback (most recent call last):
    104        File "<stdin>", line 1, in ?
    105      UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
     102>>> unicode('ndr\xe9 Le', 'utf-8')
     103Traceback (most recent call last):
     104  File "<stdin>", line 1, in ?
     105UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
    106106     }}}
    107107     Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8...
    108108   * What happens if we don't provide an encoding at all?
    109109     {{{
    110      >>> unicode('ndr\xe9 Le')
    111      Traceback (most recent call last):
    112        File "<stdin>", line 1, in ?
    113      UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)
    114      >>> 'ndr\xe9 Le'.decode()
    115      Traceback (...) # same as above
     110>>> unicode('ndr\xe9 Le')
     111Traceback (most recent call last):
     112  File "<stdin>", line 1, in ?
     113UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)
     114>>> 'ndr\xe9 Le'.decode()
     115Traceback (...) # same as above
    116116     }}}
    117117     This is very symmetrical to the encoding situation: the `sys.getdefaultencoding()` is used
     
    121121     * One possibility would be to use replacement characters:
    122122       {{{
    123        >>> unicode('ndr\xe9 Le', 'utf-8', 'replace')
    124        u'ndr\ufffde'
     123>>> unicode('ndr\xe9 Le', 'utf-8', 'replace')
     124u'ndr\ufffde'
    125125       }}}
    126126     * The other one would be to choose an encoding guaranteed to succeed