Edgewall Software

Changes between Version 1 and Version 2 of TracDev/UnicodeGuidelines


Ignore:
Timestamp:
Apr 18, 2006, 7:40:21 PM (18 years ago)
Author:
Christopher Lenz
Comment:

A couple of typos fixed

Legend:

Unmodified
Added
Removed
Modified
  • TracDev/UnicodeGuidelines

    v1 v2  
    22
    33Since Trac [milestone:0.10], Trac uses `unicode` strings internally.
    4 This document aims at clarifying what are the implications of this.
     4This document aims at clarifying what the implications of this change are.
    55
    66== Unicode Mini Tutorial ==
    77
    8 In python, they are two kind of string classes, both subclasses of `basestring`:
    9  * `unicode` is a string datatype in which each character is an Unicode code point. [[br]]
     8In Python, they are two kinds of string types, both subclasses of `basestring`:
     9 * `unicode` is a string type in which each character is an Unicode code point. [[br]]
    1010   All common string operations (len, slicing, etc.) will operate on those code points.
    1111   i.e. "real" character boundaries, in any language.
    12  * `str` is a string datatype in which each character is a byte. [[br]]
     12 * `str` is a string type in which each character is a byte. [[br]]
    1313   The string operations will operate on those bytes, and byte boundaries
    1414   don't correspond to character boundaries in many common encodings.
    1515
    16 Therefore, `unicode` can be seen as the ''safe side'' of textual data:
    17 once you're in `unicode`, you know that your text data can contain any
    18 kind of multilingual characters, and that you can safely manipulate it
    19 the expected way.
    20 
    21 On the other hand, a `str` object can be used to contain anything,
    22 binary data, or some text using any conceivable encoding.
    23 But if it supposed to contain some text, it is crucial to know
    24 which encoding was used. That knowledge must be known or inferred
    25 from somewhere, which is not always a trivial thing to do.
    26 
    27 In summary, it is not manipulating `unicode` object which is
    28 problematic (it is not), but how to go from the "wild" side
    29 to the "safe" side...
    30 Going from `unicode` to `str` is usually less problematic,
    31 because you can always control what kind of encoding you
    32 want to use for serializing your Unicode data.
     16`unicode` provides a real representation of textual data: once you're in `unicode`, you know that your text data can contain any kind of multilingual characters, and that you can safely manipulate it the expected way.
     17
     18On the other hand, a `str` object can be used to contain anything, binary data, or some text using any conceivable encoding. But if it's supposed to contain text, it is crucial to know which encoding was used. That knowledge must be known or inferred from somewhere, which is not always a trivial thing to do.
     19
     20In summary, it is not manipulating `unicode` object which is problematic (it is not), but how to go from the "wild" side (`str`) to the "safe" side (`unicode`)… Going from `unicode` to `str` is usually less problematic,  because you can always control what kind of encoding you want to use for serializing your Unicode data.
    3321
    3422How does all the above look like in practice? Let's take an example (from ![1]):