Edgewall Software

Changes between Initial Version and Version 1 of TracDev/UnicodeGuidelines


Ignore:
Timestamp:
Apr 18, 2006, 5:20:33 PM (18 years ago)
Author:
Christian Boos
Comment:

Some guidelines about the new unicode inside way

Legend:

Unmodified
Added
Removed
Modified
  • TracDev/UnicodeGuidelines

    v1 v1  
     1= Trac and Unicode: Development Guidelines =
     2
     3Since Trac [milestone:0.10], Trac uses `unicode` strings internally.
     4This document aims at clarifying what are the implications of this.
     5
     6== Unicode Mini Tutorial ==
     7
     8In python, they are two kind of string classes, both subclasses of `basestring`:
     9 * `unicode` is a string datatype in which each character is an Unicode code point. [[br]]
     10   All common string operations (len, slicing, etc.) will operate on those code points.
     11   i.e. "real" character boundaries, in any language.
     12 * `str` is a string datatype in which each character is a byte. [[br]]
     13   The string operations will operate on those bytes, and byte boundaries
     14   don't correspond to character boundaries in many common encodings.
     15
     16Therefore, `unicode` can be seen as the ''safe side'' of textual data:
     17once you're in `unicode`, you know that your text data can contain any
     18kind of multilingual characters, and that you can safely manipulate it
     19the expected way.
     20
     21On the other hand, a `str` object can be used to contain anything,
     22binary data, or some text using any conceivable encoding.
     23But if it supposed to contain some text, it is crucial to know
     24which encoding was used. That knowledge must be known or inferred
     25from somewhere, which is not always a trivial thing to do.
     26
     27In summary, it is not manipulating `unicode` object which is
     28problematic (it is not), but how to go from the "wild" side
     29to the "safe" side...
     30Going from `unicode` to `str` is usually less problematic,
     31because you can always control what kind of encoding you
     32want to use for serializing your Unicode data.
     33
     34How does all the above look like in practice? Let's take an example (from ![1]):
     35 * `u"ndré Le"` is an Unicode object containing the following sequence of
     36   Unicode code points:
     37   {{{
     38   >>> ["U-%04x" % ord(x) for x in u"ndré Le"]
     39   ['U-006e', 'U-0064', 'U-0072', 'U-00e9', 'U-0020', 'U-004c', 'U-0065']
     40   }}}
     41 * From there, you can easily transform that to a `str` object. [[br]]
     42   As we said above, we have to freedom to choose the encoding:
     43   * ''UTF-8'': it's a variable length encoding which is widely understood,
     44     and in which ''any'' code point can be represented: [[br]]
     45     {{{
     46     >>> u"ndré Le".encode('utf-8')
     47     'ndr\xc3\xa9 Le'
     48     }}}
     49   * ''iso-8859-15'': it's a fixed length encoding, which is commonly used
     50     in European countries. It ''happens'' that the `unicode` sequence we
     51     are interested in can be mapped to a sequence of bytes in this encoding.
     52     {{{
     53     >>> u"ndré Le".encode('iso-8859-15')
     54     'ndr\xe9 Le'
     55     }}}
     56   * ''ascii'': it is a very "poor" encoding, as there are only 128 unicode
     57     code points (those in the U-0000 to U-007e range) that can be mapped to
     58     ascii. Therefore, trying to encode our sample sequence will fail,
     59     as it contains one code point outside of this range (U-00e9).
     60     {{{
     61     >>> u"ndré Le".encode('ascii')
     62     Traceback (most recent call last):
     63       File "<stdin>", line 1, in ?
     64     UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
     65     }}}
     66     It should be noted that this is also the error one would get by doing a
     67     coercion to `str` on that unicode object, because the system encoding
     68     is usually `'ascii'`:
     69     {{{
     70     >>> str(u"ndré Le")
     71     Traceback (...): # same as above
     72     >>> sys.getdefaultencoding()
     73     'ascii'
     74     }}}
     75     Lastly, there are ways to ''force'' a conversion to succeed, even
     76     if there's no way to encode some of the original unicode characters
     77     in the targeted charset. One possible way is to use replacement characters:
     78     {{{
     79     >>> u"ndré Le".encode('ascii', 'replace')
     80     'ndr? Le'
     81     }}}
     82 * Now, you might wonder how to get a `unicode` object in the first place,
     83   starting from a string. [[br]]
     84   Well, from the above it should be obvious that it's absolutely necessary
     85   to ''know'' what is the encoding used in the `str` object, as either
     86   `'ndr\xe9 Le'` or `'ndr\xc3\xa9 Le'` could be decoded into the same
     87   unicode string `u"ndré Le"` (as a matter of fact, it is as important
     88   as knowing if that stream of bytes has been gzipped or ROT13-ed...) [[br]]
     89   * Assuming we know the encoding of the `str` object, getting an `unicode`
     90     object out of it is trivial:
     91     {{{
     92     >>> unicode('ndr\xc3\xa9 Le', 'utf-8')
     93     u'ndr\xe9 Le'
     94     >>> unicode('ndr\xe9 Le', 'iso-8859-15')
     95     u'ndr\xe9 Le'
     96     }}}
     97     The above can be rewritten using the `str.decode()` method:
     98     {{{
     99     >>> 'ndr\xc3\xa9 Le'.decode('utf-8')
     100     u'ndr\xe9 Le'
     101     >>> 'ndr\xe9 Le'.decode('iso-8859-15')
     102     u'ndr\xe9 Le'
     103     }}}
     104   * But what happens if we do a bad guess?
     105     {{{
     106     >>> unicode('ndr\xc3\xa9 Le', 'iso-8859-15')
     107     u'ndr\xc3\xa9 Le'
     108     }}}
     109     No errors here, but the unicode string now contains garbage [[br]]
     110     (NB: as we have seen above, 'iso-8859-15' is a fixed-byte encoding
     111     with a mapping defined for all the 0..255 range, so decoding ''any''
     112     input with such an encoding will ''always'' succeed).
     113     {{{
     114     >>> unicode('ndr\xe9 Le', 'utf-8')
     115     Traceback (most recent call last):
     116       File "<stdin>", line 1, in ?
     117     UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
     118     }}}
     119     Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8...
     120   * What happens if we don't provide an encoding at all?
     121     {{{
     122     >>> unicode('ndr\xe9 Le')
     123     Traceback (most recent call last):
     124       File "<stdin>", line 1, in ?
     125     UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)
     126     >>> 'ndr\xe9 Le'.decode()
     127     Traceback (...) # same as above
     128     }}}
     129     This is very symmetrical to the encoding situation: the `sys.getdefaultencoding()` is used
     130     (usually 'ascii') when no encoding is explicitely given.
     131   * Now, as with the encoding situation, there are ways to ''force'' the encoding
     132     process to succeed, even if we are wrong about the charset used by our `str` object.
     133     * One possibility would be to use replacement characters:
     134       {{{
     135       >>> unicode('ndr\xe9 Le', 'utf-8')
     136       u'ndr\ufffde'
     137       }}}
     138     * The other one would be to choose an encoding guaranteed to succeed
     139       (as ''iso-8859-1'' or ''iso-8859-15'', see above).
     140
     141This was a very rough mini-tutorial on the question, I hope
     142it's enough for getting in the general mood needed to read the
     143rest of the guidelines...
     144
     145Of course, there are a lot of more in-depth tutorials on Unicode in general
     146and Python/Unicode in particular available on the Web:
     147 * ![1] http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf
     148 * ![2] http://www.amk.ca/python/howto/unicode
     149 * ![3] http://www.python.org/dev/peps/pep-0100
     150
     151Now we can move to the specifics of Trac programming...
     152
     153== Trac utilities for Unicode ==
     154
     155In order to handle the unicode related issues in a cohesive way,
     156there are a few utility functions that can be used, but this
     157is mainly our swiss-army knife `to_unicode` function.
     158
     159=== `to_unicode` ===
     160
     161The `to_unicode` function was designed with flexibility and
     162robustness in mind: Calling `to_unicode()` on anything should
     163never fail.
     164
     165The use cases are as follows:
     166 1. given any arbitrary object `x`, one could use `to_unicode(x)`
     167    as one would use `unicode(x)` to convert it to an unicode string
     168 2. given a `str` object `s`, which ''might'' be a text but for which
     169    we have no idea what was the encoding used, one can use
     170    `to_unicode(s)` to convert it to an `unicode` object in a safe way. [[br]]
     171    Actually, a decoding using 'utf-8' will be attempted first,
     172    and if this fails, a decoding using the `locale.getpreferredencoding()`
     173    will be done, in replacement mode.
     174 3. given a `str` object `s`, for which we ''think'' we know what is
     175    the encoding `enc` used, we can do `to_unicode(s, enc)` to try
     176    to decode it using the `enc` encoding, in replacement mode. [[br]]
     177    A practical advantage of using `to_unicode(s, enc)` over
     178    `unicode(s, enc, 'replace')` is that our first form will revert to
     179    the ''use case 2'', should `enc` be `None`.
     180
     181So, you may ask, if the above works in all situations, where should you
     182still use `unicode(x)` or `unicode(x,enc)`?
     183
     184 * you could use `unicode(x)` when you know for sure that x is anything
     185   __but__ a `str` containing bytes in the 128..255 range; [[br]]
     186   It should be noted that `to_unicode(x)` simply does a `unicode(x)` call
     187   for anything which is not a `str` object, so there's virtually no
     188   performance penalty in using `to_unicode` instead (in particular,
     189   there's no exception handler set in this case).
     190 * use `unicode(buf, encoding)` when you know for sure what the
     191   encoding is. You will have a performance gain here over `to_unicode`,
     192   as no exception handler will be set. Of course, the downside is that
     193   you will get an `UnicodeDecodeError` exception if your assumption
     194   was wrong. Therefore, use this if you ''want'' to catch errors
     195   in this situation.
     196
     197 ''FIXME: talk a bit about the other utilities''
     198
     199 ''FIXME: Those utilities are currently placed in the `trac.util` package,
     200 though I'm thinking about moving the in the `trac.util.text` package:
     201  * some of the corresponding unit tests are already in `trac.util.tests.text`
     202  * `to_unicode` could then be used in the `Markup` class''
     203
     204=== The Mimeview component ===
     205
     206The Mimeview component is the place where we collect some intelligence
     207about the MIME types and charsets auto-detection.
     208
     209Most of the time, when we manipulate ''file content'', we only have partial
     210information about the nature of the data actually contained in those files.
     211
     212This is true whether the file is located in the filesystem, in a version
     213control repository or is streamed by the web browser (file upload).
     214
     215The Mimeview component tries to associate a MIME type to a file content,
     216based on the filename or, if that's not enough on the file's content itself.
     217During this process, the charset used by the file ''might'' be inferred as well.
     218
     219The API is quite simple:
     220 * `Mimeview.get_mimetype(self, filename, content)` [[br]]
     221   guess the MIME type from the `filename` or eventually from the `content`
     222 * `Mimeview.get_charset(self, content, mimetype=None)` [[br]]
     223   guess the charset from the `content` or from the `mimetype`
     224   (as the `mimetype` ''might'' convey charset information as well)
     225 * `Mimeview.to_unicode(self, content, mimetype=None, charset=None)` [[br]]
     226   uses the `to_unicode` utility and eventually guess the charset if needed
     227 * ''`Mimeview.is_binary(self, filename, content, mimetype)`'' '''TBD''' [[br]]
     228   guess if the `content` is textual data or not
     229
     230
     231== Trac boundaries for Unicode Data ==
     232
     233Most of the time, within Trac we assume that we are manipulating `unicode` objects.
     234
     235But there are places where we need to deal with raw `str` objects, and therefore
     236we must know what to do, either when encoding to or when decoding from `str` objects.
     237
     238=== Database Layer ===
     239
     240Each database connector should configure its database driver
     241so that the `Cursor` objects are able to accept and will return
     242`unicode` objects.
     243
     244=== Filesystem objects ===
     245
     246Whenever a file is read or written, some care should be taken about the content.
     247Usually, when writing text data, we will choose to encode it using `'utf-8'`.
     248When reading, it is context dependent: there are situations were we know for sure
     249the data in the file is encoded using `'utf-8'`;
     250We therefore usually do a `to_unicode(filecontent, 'utf-8')` in these situations.
     251
     252There's an additional complexity here in that the filenames are also possibly
     253using non-ascii characters. In Python, it should be safe to provide `unicode`
     254objects for all the `os` filesystem related functions.
     255
     256=== `versioncontrol` subsystem ===
     257
     258This is dependent on the backend.
     259
     260In Subversion, there are clear rules about the pathnames used
     261by the SVN Bindings for Python: those should UTF-8 encoded `str` objects.
     262
     263Therefore, `unicode` pathnames should 'utf-8' encoded before
     264being passed to the bindings, and pathnames returned by
     265the bindings should be decoded using 'utf-8' before being
     266returned to callers of the `versioncontrol` API.
     267
     268As noted above when talking about file contents, the node content
     269can contain any kind of data, including binary data and therefore
     270`Node.get_content().read()` returns a `str` object.
     271
     272Depending on the backend, some ''hints'' about the nature of the
     273content (and eventually about the charset used if the content
     274is text) can be given by the `Node.get_content_type()` method.
     275
     276The Mimeview component can be used in order to use those hints
     277in a streamlined way.
     278
     279=== Generating content with !ClearSilver templates ===
     280
     281The main "source" of generated text from Trac is the ClearSilver template engine.
     282The !ClearSilver engine doesn't accept `unicode` objects, so those are
     283converted to UTF-8 encoded `str` objects just before being inserted in the "HDF"
     284(the data structure used by the template engine to fill in the templates).
     285
     286The body of those templates (the `.cs` files) must also use this encoding.
     287
     288=== The Web interface ===
     289
     290The information in the `Request` object (`req`) is converted to `unicode` objects,
     291from 'UTF-8' encoded strings.
     292
     293The data sent out is generally converted to 'UTF-8' as well
     294(like the headers), except if some charset information has
     295been explicitely set in the `'Content-Type'` header.
     296If this is the case, that encoding is used.
     297
     298=== The console ===
     299
     300When reading from the console, we assume the text is encoded
     301using `sys.stdin.encoding`.
     302
     303When writing to the console, we assume that the `sys.stdout.encoding`
     304should be used.
     305
     306  ''FIXME: and logging?''
     307
     308=== Interaction with plugins ===
     309
     310Whenever Trac gets data from plugins, it must try to cope
     311with `str` objects. Those might be 0.9 pre-unicode plugins
     312which have not been migrated fully to 0.10 and beyond.
     313
     314== Questions/Suggestions... ==
     315
     316Sorry, there are certainly a ton of typos there, hopefully
     317no more serious errors. But I had to have a first draft of this.
     318
     319Feel free to correct me, ask questions, etc. this is a Wiki :)
     320