Changes between Version 4 and Version 5 of TracDev/UnicodeGuidelines
- Timestamp:
- May 11, 2006, 8:35:12 AM (18 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
TracDev/UnicodeGuidelines
v4 v5 24 24 Unicode code points: 25 25 {{{ 26 27 26 >>> ["U-%04x" % ord(x) for x in u"ndré Le"] 27 ['U-006e', 'U-0064', 'U-0072', 'U-00e9', 'U-0020', 'U-004c', 'U-0065'] 28 28 }}} 29 29 * From there, you can easily transform that to a `str` object. [[br]] … … 32 32 and in which ''any'' code point can be represented: [[br]] 33 33 {{{ 34 35 34 >>> u"ndré Le".encode('utf-8') 35 'ndr\xc3\xa9 Le' 36 36 }}} 37 37 * ''iso-8859-15'': it's a fixed length encoding, which is commonly used … … 39 39 are interested in can be mapped to a sequence of bytes in this encoding. 40 40 {{{ 41 42 41 >>> u"ndré Le".encode('iso-8859-15') 42 'ndr\xe9 Le' 43 43 }}} 44 44 * ''ascii'': it is a very "poor" encoding, as there are only 128 unicode … … 47 47 as it contains one code point outside of this range (U-00e9). 48 48 {{{ 49 50 51 52 49 >>> u"ndré Le".encode('ascii') 50 Traceback (most recent call last): 51 File "<stdin>", line 1, in ? 52 UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128) 53 53 }}} 54 54 It should be noted that this is also the error one would get by doing a … … 56 56 is usually `'ascii'`: 57 57 {{{ 58 59 60 61 58 >>> str(u"ndré Le") 59 Traceback (...): # same as above 60 >>> sys.getdefaultencoding() 61 'ascii' 62 62 }}} 63 63 Lastly, there are ways to ''force'' a conversion to succeed, even … … 65 65 in the targeted charset. One possible way is to use replacement characters: 66 66 {{{ 67 68 67 >>> u"ndré Le".encode('ascii', 'replace') 68 'ndr? Le' 69 69 }}} 70 70 * Now, you might wonder how to get a `unicode` object in the first place, … … 78 78 object out of it is trivial: 79 79 {{{ 80 81 82 83 80 >>> unicode('ndr\xc3\xa9 Le', 'utf-8') 81 u'ndr\xe9 Le' 82 >>> unicode('ndr\xe9 Le', 'iso-8859-15') 83 u'ndr\xe9 Le' 84 84 }}} 85 85 The above can be rewritten using the `str.decode()` method: 86 86 {{{ 87 88 89 90 87 >>> 'ndr\xc3\xa9 Le'.decode('utf-8') 88 u'ndr\xe9 Le' 89 >>> 'ndr\xe9 Le'.decode('iso-8859-15') 90 u'ndr\xe9 Le' 91 91 }}} 92 92 * But what happens if we do a bad guess? 93 93 {{{ 94 95 94 >>> unicode('ndr\xc3\xa9 Le', 'iso-8859-15') 95 u'ndr\xc3\xa9 Le' 96 96 }}} 97 97 No errors here, but the unicode string now contains garbage [[br]] … … 100 100 input assuming such an encoding will ''always'' succeed). 101 101 {{{ 102 103 104 105 102 >>> unicode('ndr\xe9 Le', 'utf-8') 103 Traceback (most recent call last): 104 File "<stdin>", line 1, in ? 105 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data 106 106 }}} 107 107 Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8... 108 108 * What happens if we don't provide an encoding at all? 109 109 {{{ 110 111 112 113 114 115 110 >>> unicode('ndr\xe9 Le') 111 Traceback (most recent call last): 112 File "<stdin>", line 1, in ? 113 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128) 114 >>> 'ndr\xe9 Le'.decode() 115 Traceback (...) # same as above 116 116 }}} 117 117 This is very symmetrical to the encoding situation: the `sys.getdefaultencoding()` is used … … 121 121 * One possibility would be to use replacement characters: 122 122 {{{ 123 124 123 >>> unicode('ndr\xe9 Le', 'utf-8', 'replace') 124 u'ndr\ufffde' 125 125 }}} 126 126 * The other one would be to choose an encoding guaranteed to succeed