Edgewall Software

Changes between Version 11 and Version 12 of TracDev/UnicodeGuidelines


Ignore:
Timestamp:
Feb 23, 2016, 9:57:51 PM (8 years ago)
Author:
figaro
Comment:

Cosmetic changes, removed dead link

Legend:

Unmodified
Added
Removed
Modified
  • TracDev/UnicodeGuidelines

    v11 v12  
    1 = Trac and Unicode: Development Guidelines =
     1[[PageOutline(2-3,Contents)]]
     2
     3= Trac and Unicode: Development Guidelines
    24
    35Since Trac [milestone:0.10], Trac uses `unicode` strings internally.
    4 This document aims at clarifying what the implications of this change are.
    5 
    6 == Unicode Mini Tutorial ==
    7 
    8 In Python, they are two kinds of string types, both subclasses of `basestring`:
     6This document clarifies what the implications of this change are.
     7
     8== Unicode Mini Tutorial
     9
     10In Python, there are two kinds of string types, both subclasses of `basestring`:
    911 * `unicode` is a string type in which each character is an Unicode code point. [[br]]
    1012   All common string operations (len, slicing, etc.) will operate on those code points.
     
    1618`unicode` provides a real representation of textual data: once you're in `unicode`, you know that your text data can contain any kind of multilingual characters, and that you can safely manipulate it the expected way.
    1719
    18 On the other hand, a `str` object can be used to contain anything, binary data, or some text using any conceivable encoding. But if it's supposed to contain text, it is crucial to know which encoding was used. That knowledge must be known or inferred from somewhere, which is not always a trivial thing to do.
    19 
    20 In summary, it is not manipulating `unicode` object which is problematic (it is not), but how to go from the "wild" side (`str`) to the "safe" side (`unicode`)… Going from `unicode` to `str` is usually less problematic, because you can always control what kind of encoding you want to use for serializing your Unicode data.
     20On the other hand, a `str` object can be used to contain anything, binary data, or some text using any conceivable encoding. But if it's supposed to contain text, it is crucial to know which encoding was used. That knowledge must be known or inferred from somewhere, which is not always trivial.
     21
     22In summary, it is not manipulating `unicode` object which is problematic (it is not), but how to go from the "wild" side (`str`) to the "safe" side (`unicode`). Going from `unicode` to `str` is usually less problematic, because you can always control what kind of encoding you want to use for serializing your Unicode data.
    2123
    2224How does all the above look like in practice? Let's take an example (from ![1]):
     
    2830   }}}
    2931 * From there, you can easily transform that to a `str` object. [[br]]
    30    As we said above, we have to freedom to choose the encoding:
     32   As we said above, we can choose the encoding:
    3133   * ''UTF-8'': it's a variable length encoding which is widely understood,
    3234     and in which ''any'' code point can be represented: [[br]]
     
    7072 * Now, you might wonder how to get a `unicode` object in the first place,
    7173   starting from a string. [[br]]
    72    Well, from the above it should be obvious that it's absolutely necessary
    73    to ''know'' what is the encoding used in the `str` object, as either
     74   For this it is critical
     75   to ''know'' what encoding was used in the `str` object, as either
    7476   `'ndr\xe9 Le'` or `'ndr\xc3\xa9 Le'` could be decoded into the same
    75    unicode string `u"ndré Le"` (as a matter of fact, it is as important
    76    as knowing if that stream of bytes has been gzipped or ROT13-ed...) [[br]]
     77   unicode string `u"ndré Le"` (it is in fact as important
     78   as knowing whether that stream of bytes has been gzipped or ROT13-ed.) [[br]]
    7779   * Assuming we know the encoding of the `str` object, getting an `unicode`
    7880     object out of it is trivial:
     
    9698     }}}
    9799     No errors here, but the unicode string now contains garbage [[br]]
    98      (NB: as we have seen above, 'iso-8859-15' is a fixed-byte encoding
     100     NB: as we have seen above, 'iso-8859-15' is a fixed-byte encoding
    99101     with a mapping defined for all the 0..255 range, so decoding ''any''
    100      input assuming such an encoding will ''always'' succeed).
     102     input assuming such an encoding will ''always'' succeed.
    101103     {{{
    102104>>> unicode('ndr\xe9 Le', 'utf-8')
     
    105107UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
    106108     }}}
    107      Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8...
     109     Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8.
    108110   * What happens if we don't provide an encoding at all?
    109111     {{{
     
    115117Traceback (...) # same as above
    116118     }}}
    117      This is very symmetrical to the encoding situation: the `sys.getdefaultencoding()` is used
    118      (usually 'ascii') when no encoding is explicitely given.
     119     This is analogous to the encoding situation: the `sys.getdefaultencoding()` is used
     120     (usually 'ascii') when no encoding is explicitly given.
    119121   * Now, as with the encoding situation, there are ways to ''force'' the encoding
    120122     process to succeed, even if we are wrong about the charset used by our `str` object.
     
    127129       (as ''iso-8859-1'' or ''iso-8859-15'', see above).
    128130
    129 This was a very rough mini-tutorial on the question, I hope
    130 it's enough for getting in the general mood needed to read the
    131 rest of the guidelines...
    132 
    133 Of course, there are a lot of more in-depth tutorials on Unicode in general
    134 and Python/Unicode in particular available on the Web:
     131There are more in-depth tutorials on Unicode in general and Python / Unicode in particular available:
    135132 * ![1] http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf
    136  * ![2] http://www.amk.ca/python/howto/unicode
    137  * ![3] http://www.python.org/dev/peps/pep-0100
    138 
    139 Now we can move to the specifics of Trac programming...
    140 
    141 == Trac utilities for Unicode ==
    142 
    143 In order to handle the unicode related issues in a cohesive way,
    144 there are a few utility functions that can be used, but this
    145 is mainly our swiss-army knife `to_unicode` function.
    146 
    147 === `to_unicode` ===
    148 
    149 The `to_unicode` function was designed with flexibility and
    150 robustness in mind: Calling `to_unicode()` on anything should
    151 never fail.
     133 * ![2] http://www.python.org/dev/peps/pep-0100
     134
     135Now we can move to the specifics of Trac programming.
     136
     137== Trac utilities for Unicode
     138
     139In order to handle the unicode related issues in a cohesive way, there are a few utility functions that can be used, but this is mainly our swiss-army knife `to_unicode` function.
     140
     141=== `to_unicode`
     142
     143The `to_unicode` function was designed with flexibility and robustness in mind: Calling `to_unicode()` on anything should never fail.
    152144
    153145The use cases are as follows:
    154146 1. given any arbitrary object `x`, one could use `to_unicode(x)`
    155147    as one would use `unicode(x)` to convert it to an unicode string
    156  2. given a `str` object `s`, which ''might'' be a text but for which
     148 1. given a `str` object `s`, which ''might'' be a text but for which
    157149    we have no idea what was the encoding used, one can use
    158150    `to_unicode(s)` to convert it to an `unicode` object in a safe way. [[br]]
     
    160152    and if this fails, a decoding using the `locale.getpreferredencoding()`
    161153    will be done, in replacement mode.
    162  3. given a `str` object `s`, for which we ''think'' we know what is
     154 1. given a `str` object `s`, for which we ''think'' we know what is
    163155    the encoding `enc` used, we can do `to_unicode(s, enc)` to try
    164156    to decode it using the `enc` encoding, in replacement mode. [[br]]
     
    167159    the ''use case 2'', should `enc` be `None`.
    168160
    169 So, you may ask, if the above works in all situations, where should you
    170 still use `unicode(x)` or `unicode(x,enc)`?
     161So, if the above works in all situations, where should you still use `unicode(x)` or `unicode(x,enc)`?
    171162
    172163 * you could use `unicode(x)` when you know for sure that x is anything
     
    185176There are a few other unicode related utilies besides `to_unicode` in the [source:/trunk/trac/util/text.py trac.util.text] module.
    186177
    187 === The Mimeview component ===
    188 
    189 The Mimeview component is the place where we collect some intelligence
    190 about the MIME types and charsets auto-detection.
    191 
    192 Most of the time, when we manipulate ''file content'', we only have partial
    193 information about the nature of the data actually contained in those files.
    194 
    195 This is true whether the file is located in the filesystem, in a version
    196 control repository or is streamed by the web browser (file upload).
    197 
    198 The Mimeview component tries to associate a MIME type to a file content,
    199 based on the filename or, if that's not enough on the file's content itself.
    200 During this process, the charset used by the file ''might'' be inferred as well.
     178=== The Mimeview component
     179
     180The Mimeview component is the place where we collect some intelligence about the MIME types and charsets auto-detection.
     181
     182Most of the time, when we manipulate ''file content'', we only have partial information about the nature of the data actually contained in those files.
     183
     184This is true whether the file is located in the filesystem, in a version control repository or is streamed by the web browser (file upload).
     185
     186The Mimeview component tries to associate a MIME type to a file content, based on the filename or, if that's not enough on the file's content itself. During this process, the charset used by the file ''might'' be inferred as well.
    201187
    202188The API is quite simple:
     
    208194 * `Mimeview.to_unicode(self, content, mimetype=None, charset=None)` [[br]]
    209195   uses the `to_unicode` utility and eventually guess the charset if needed
    210 '''Note that the Mimeview API is currently behing overhauled and will most probably change in the next releases. See #3332'''.
    211 
    212 == Trac boundaries for Unicode Data ==
     196'''Note''': that the Mimeview API is currently being overhauled and will most probably change in the next releases (#3332).
     197
     198== Trac boundaries for Unicode Data
    213199
    214200Most of the time, within Trac we assume that we are manipulating `unicode` objects.
    215201
    216 But there are places where we need to deal with raw `str` objects, and therefore
    217 we must know what to do, either when encoding to or when decoding from `str` objects.
    218 
    219 === Database Layer ===
    220 
    221 Each database connector should configure its database driver
    222 so that the `Cursor` objects are able to accept and will return
    223 `unicode` objects. This sometimes involve writing a wrapper class
    224 for the original Cursor class. See for example
     202But there are places where we need to deal with raw `str` objects, and therefore we must know what to do, either when encoding to or when decoding from `str` objects.
     203
     204=== Database Layer
     205
     206Each database connector should configure its database driver so that the `Cursor` objects are able to accept and will return
     207`unicode` objects. This sometimes involves writing a wrapper class for the original Cursor class. See for example
    225208[source:/trunk/trac/db/sqlite_backend.py@head#L58 SQLiteUnicodeCursor], for pysqlite1.
    226209
    227 === The console ===
    228 
    229 When reading from the console, we assume the text is encoded
    230 using `sys.stdin.encoding`.
    231 
    232 When writing to the console, we assume that the `sys.stdout.encoding`
    233 should be used.
     210=== The console
     211
     212When reading from the console, we assume the text is encoded using `sys.stdin.encoding`.
     213
     214When writing to the console, we assume that the `sys.stdout.encoding` should be used.
    234215
    235216The logging API seems to handle `unicode` objects just fine.
    236217
    237 === Filesystem objects ===
     218=== Filesystem objects
    238219
    239220Whenever a file is read or written, some care should be taken about the content.
    240221Usually, when writing text data, we will choose to encode it using `'utf-8'`.
    241 When reading, it is context dependent: there are situations were we know for sure
    242 the data in the file is encoded using `'utf-8'`;
     222When reading, it is context dependent: there are situations were we know for sure the data in the file is encoded using `'utf-8'`.
    243223We therefore usually do a `to_unicode(filecontent, 'utf-8')` in these situations.
    244224
    245 There's an additional complexity here in that the filenames are also possibly
    246 using non-ascii characters. In Python, it should be safe to provide `unicode`
    247 objects for all the `os` filesystem related functions.
     225There's an additional complexity here in that the filenames are also possibly using non-ascii characters. In Python, it should be safe to provide `unicode` objects for all the `os` filesystem related functions.
    248226
    249227Look also at r7360, r7361, r7362.
     
    251229More information about how Python deals with Python at system boundaries can be found here: http://kofoto.rosdahl.net/wiki/UnicodeInPython.
    252230
    253 
    254 
    255 === `versioncontrol` subsystem ===
     231=== `versioncontrol` subsystem
    256232
    257233This is dependent on the backend.
    258234
    259 In Subversion, there are clear rules about the pathnames used
    260 by the SVN Bindings for Python: those should be UTF-8 encoded `str` objects.
    261 
    262 Therefore, `unicode` pathnames should be 'utf-8' encoded before
    263 being passed to the bindings, and pathnames returned by
    264 the bindings should be decoded using 'utf-8' before being
    265 returned to callers of the `versioncontrol` API.
    266 
    267 As noted above when talking about file contents, the node content
    268 can contain any kind of data, including binary data and therefore
    269 `Node.get_content().read()` returns a `str` object.
    270 
    271 Depending on the backend, some ''hints'' about the nature of the
    272 content (and eventually about the charset used if the content
    273 is text) can be given by the `Node.get_content_type()` method.
    274 
    275 The Mimeview component can be used in order to use those hints
    276 in a streamlined way.
    277 
    278 === Generating content with !ClearSilver templates ===
     235In Subversion, there are clear rules about the pathnames used by the SVN Bindings for Python: those should be UTF-8 encoded `str` objects.
     236
     237Therefore, `unicode` pathnames should be 'utf-8' encoded before being passed to the bindings, and pathnames returned by the bindings should be decoded using 'utf-8' before being returned to callers of the `versioncontrol` API.
     238
     239As noted above when talking about file contents, the node content can contain any kind of data, including binary data and therefore `Node.get_content().read()` returns a `str` object.
     240
     241Depending on the backend, some ''hints'' about the nature of the content (and eventually about the charset used if the content is text) can be given by the `Node.get_content_type()` method.
     242
     243The Mimeview component can be used in order to use those hints in a streamlined way.
     244
     245=== Generating content with !ClearSilver templates
    279246
    280247The main "source" of generated text from Trac is the ClearSilver template engine.
    281 The !ClearSilver engine doesn't accept `unicode` objects, so those are
    282 converted to UTF-8 encoded `str` objects just before being inserted in the "HDF"
    283 (the data structure used by the template engine to fill in the templates).
    284 This is done automatically by our
    285 [source:/trunk/trac/web/clearsilver.py@head#L22 HDFWrapper] class, so anywhere else
    286 in the code one can safely associate unicode values to entries in `req.hdf`.
     248The !ClearSilver engine doesn't accept `unicode` objects, so those are converted to UTF-8 encoded `str` objects just before being inserted in the "HDF" (the data structure used by the template engine to fill in the templates).
     249This is done automatically by our [source:/trunk/trac/web/clearsilver.py@head#L22 HDFWrapper] class, so anywhere else in the code one can safely associate unicode values to entries in `req.hdf`.
    287250
    288251The body of those templates (the `.cs` files) must also use the UTF-8 encoding.
    289252
    290 === The Web interface ===
    291 
    292 The information in the `Request` object (`req`) is converted to `unicode` objects,
    293 from 'UTF-8' encoded strings.
    294 
    295 The data sent out is generally converted to 'UTF-8' as well
    296 (like the headers), except if some charset information has
    297 been explicitely set in the `'Content-Type'` header.
    298 If this is the case, that encoding is used.
    299 
    300 === Interaction with plugins ===
    301 
    302 Whenever Trac gets data from plugins, it must try to cope
    303 with `str` objects. Those might be 0.9 pre-unicode plugins
    304 which have not been migrated fully to 0.10 and beyond.
    305 
    306 == Questions/Suggestions... ==
    307 
    308 Feel free to correct me, ask questions, etc.; this is a Wiki. :)
    309 
    310 ----
    311 Q: When dealing with plugins that weren't designed to be unicode friendly and used unicode in favour of to_unicode, what parts of the plugin should be updated, what should use to_unicode ? --JamesMills
    312 
    313 A: There shouldn't be any reason to replace a working call to `unicode()` by a call to `to_unicode()`, unless you specified the encoding, like in:
     253=== The Web interface
     254
     255The information in the `Request` object (`req`) is converted to `unicode` objects, from 'UTF-8' encoded strings.
     256
     257The data sent out is generally converted to 'UTF-8' as well (like the headers), except if some charset information has been explicitly set in the `'Content-Type'` header. If this is the case, that encoding is used.
     258
     259=== Interaction with plugins
     260
     261Whenever Trac gets data from plugins, it must try to cope with `str` objects. Those might be 0.9 pre-unicode plugins which have not been migrated fully to 0.10 and beyond.
     262
     263== Questions / Suggestions
     264
     265'''Q''': When dealing with plugins that weren't designed to be unicode friendly and used unicode in favour of to_unicode, what parts of the plugin should be updated, what should use to_unicode ? --JamesMills
     266
     267'''A''': There shouldn't be any reason to replace a working call to `unicode()` by a call to `to_unicode()`, unless you specified the encoding, like in:
    314268{{{
    315269  ustring = unicode(data_from_trac, 'utf-8')
    316270}}}
    317 The above doesn't work if `data_from_trac` is actually an unicode object (you'd get `TypeError: decoding Unicode is not supported`).
     271
     272The above doesn't work if `data_from_trac` is actually an unicode object. You would get `TypeError: decoding Unicode is not supported`.
    318273
    319274In this case, either don't use `unicode` at all (0.10 and above only plugins) or replace it by `to_unicode` (0.9 and 0.10 plugins).