Context Navigation

Changes between Version 1 and Version 2 of TracDev/UnicodeGuidelines

Timestamp:: Apr 18, 2006, 7:40:21 PM (18 years ago)
Author:: Christopher Lenz
Comment:: A couple of typos fixed

Legend:

: Unmodified
: Added
: Removed
: Modified

TracDev/UnicodeGuidelines

-              v1
+              v2
 Since Trac [milestone:0.10], Trac uses `unicode` strings internally.
 This document aims at clarifying what are the implications of this.
+This document aims at clarifying what the implications of this change are.
 == Unicode Mini Tutorial ==
 In python, they are two kind of string classes, both subclasses of `basestring`:
  * `unicode` is a string datatype in which each character is an Unicode code point. [[br]]
+In Python, they are two kinds of string types, both subclasses of `basestring`:
+ * `unicode` is a string type in which each character is an Unicode code point. [[br]]
    All common string operations (len, slicing, etc.) will operate on those code points.
    i.e. "real" character boundaries, in any language.
  * `str` is a string datatype in which each character is a byte. [[br]]
+ * `str` is a string type in which each character is a byte. [[br]]
    The string operations will operate on those bytes, and byte boundaries
    don't correspond to character boundaries in many common encodings.
+Therefore, `unicode` can be seen as the ''safe side'' of textual data:
+once you're in `unicode`, you know that your text data can contain any
+kind of multilingual characters, and that you can safely manipulate it
+the expected way.
+On the other hand, a `str` object can be used to contain anything,
+binary data, or some text using any conceivable encoding.
+But if it supposed to contain some text, it is crucial to know
+which encoding was used. That knowledge must be known or inferred
+from somewhere, which is not always a trivial thing to do.
+In summary, it is not manipulating `unicode` object which is
+problematic (it is not), but how to go from the "wild" side
+to the "safe" side...
+Going from `unicode` to `str` is usually less problematic,
+because you can always control what kind of encoding you
+want to use for serializing your Unicode data.
+`unicode` provides a real representation of textual data: once you're in `unicode`, you know that your text data can contain any kind of multilingual characters, and that you can safely manipulate it the expected way.
+On the other hand, a `str` object can be used to contain anything, binary data, or some text using any conceivable encoding. But if it's supposed to contain text, it is crucial to know which encoding was used. That knowledge must be known or inferred from somewhere, which is not always a trivial thing to do.
+In summary, it is not manipulating `unicode` object which is problematic (it is not), but how to go from the "wild" side (`str`) to the "safe" side (`unicode`)… Going from `unicode` to `str` is usually less problematic,  because you can always control what kind of encoding you want to use for serializing your Unicode data.
 How does all the above look like in practice? Let's take an example (from ![1]):