Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Version 11 and Version 12 of TracDev/UnicodeGuidelines

Timestamp:: Feb 23, 2016, 9:57:51 PM (8 years ago)
Author:: figaro
Comment:: Cosmetic changes, removed dead link

Legend:

: Unmodified
: Added
: Removed
: Modified

TracDev/UnicodeGuidelines

-              v11
+              v12
+= Trac and Unicode: Development Guidelines =
+[[PageOutline(2-3,Contents)]]
+= Trac and Unicode: Development Guidelines
 Since Trac [milestone:0.10], Trac uses `unicode` strings internally.
 This document aims at clarifying what the implications of this change are.
 == Unicode Mini Tutorial ==
 In Python, they are two kinds of string types, both subclasses of `basestring`:
+This document clarifies what the implications of this change are.
+== Unicode Mini Tutorial
+In Python, there are two kinds of string types, both subclasses of `basestring`:
  * `unicode` is a string type in which each character is an Unicode code point. [[br]]
    All common string operations (len, slicing, etc.) will operate on those code points.
 …
 `unicode` provides a real representation of textual data: once you're in `unicode`, you know that your text data can contain any kind of multilingual characters, and that you can safely manipulate it the expected way.
 On the other hand, a `str` object can be used to contain anything, binary data, or some text using any conceivable encoding. But if it's supposed to contain text, it is crucial to know which encoding was used. That knowledge must be known or inferred from somewhere, which is not always a trivial thing to do.
 In summary, it is not manipulating `unicode` object which is problematic (it is not), but how to go from the "wild" side (`str`) to the "safe" side (`unicode`)… Going from `unicode` to `str` is usually less problematic,  because you can always control what kind of encoding you want to use for serializing your Unicode data.
+On the other hand, a `str` object can be used to contain anything, binary data, or some text using any conceivable encoding. But if it's supposed to contain text, it is crucial to know which encoding was used. That knowledge must be known or inferred from somewhere, which is not always trivial.
+In summary, it is not manipulating `unicode` object which is problematic (it is not), but how to go from the "wild" side (`str`) to the "safe" side (`unicode`). Going from `unicode` to `str` is usually less problematic, because you can always control what kind of encoding you want to use for serializing your Unicode data.
 How does all the above look like in practice? Let's take an example (from ![1]):
 …
    }}}
  * From there, you can easily transform that to a `str` object. [[br]]
    As we said above, we have to freedom to choose the encoding:
+   As we said above, we can choose the encoding:
    * ''UTF-8'': it's a variable length encoding which is widely understood,
      and in which ''any'' code point can be represented: [[br]]
 …
  * Now, you might wonder how to get a `unicode` object in the first place,
    starting from a string. [[br]]
    Well, from the above it should be obvious that it's absolutely necessary
    to ''know'' what is the encoding used in the `str` object, as either
+   For this it is critical
+   to ''know'' what encoding was used in the `str` object, as either
    `'ndr\xe9 Le'` or `'ndr\xc3\xa9 Le'` could be decoded into the same
    unicode string `u"ndré Le"` (as a matter of fact, it is as important
    as knowing if that stream of bytes has been gzipped or ROT13-ed...) [[br]]
+   unicode string `u"ndré Le"` (it is in fact as important
+   as knowing whether that stream of bytes has been gzipped or ROT13-ed.) [[br]]
    * Assuming we know the encoding of the `str` object, getting an `unicode`
      object out of it is trivial:
 …
      }}}
      No errors here, but the unicode string now contains garbage [[br]]
      (NB: as we have seen above, 'iso-8859-15' is a fixed-byte encoding
+     NB: as we have seen above, 'iso-8859-15' is a fixed-byte encoding
      with a mapping defined for all the 0..255 range, so decoding ''any''
      input assuming such an encoding will ''always'' succeed).
+     input assuming such an encoding will ''always'' succeed.
      {{{
 >>> unicode('ndr\xe9 Le', 'utf-8')
 …
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
      }}}
      Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8...
+     Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8.
    * What happens if we don't provide an encoding at all?
      {{{
 …
 Traceback (...) # same as above
      }}}
      This is very symmetrical to the encoding situation: the `sys.getdefaultencoding()` is used
      (usually 'ascii') when no encoding is explicitely given.
+     This is analogous to the encoding situation: the `sys.getdefaultencoding()` is used
+     (usually 'ascii') when no encoding is explicitly given.
    * Now, as with the encoding situation, there are ways to ''force'' the encoding
      process to succeed, even if we are wrong about the charset used by our `str` object.
 …
        (as ''iso-8859-1'' or ''iso-8859-15'', see above).
+This was a very rough mini-tutorial on the question, I hope
+it's enough for getting in the general mood needed to read the
+rest of the guidelines...
+Of course, there are a lot of more in-depth tutorials on Unicode in general
+and Python/Unicode in particular available on the Web:
+There are more in-depth tutorials on Unicode in general and Python / Unicode in particular available:
  * ![1] http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf
+ * ![2] http://www.amk.ca/python/howto/unicode
+ * ![3] http://www.python.org/dev/peps/pep-0100
+Now we can move to the specifics of Trac programming...
+== Trac utilities for Unicode ==
+In order to handle the unicode related issues in a cohesive way,
+there are a few utility functions that can be used, but this
+is mainly our swiss-army knife `to_unicode` function.
+=== `to_unicode` ===
+The `to_unicode` function was designed with flexibility and
+robustness in mind: Calling `to_unicode()` on anything should
+never fail.
+ * ![2] http://www.python.org/dev/peps/pep-0100
+Now we can move to the specifics of Trac programming.
+== Trac utilities for Unicode
+In order to handle the unicode related issues in a cohesive way, there are a few utility functions that can be used, but this is mainly our swiss-army knife `to_unicode` function.
+=== `to_unicode`
+The `to_unicode` function was designed with flexibility and robustness in mind: Calling `to_unicode()` on anything should never fail.
 The use cases are as follows:
 . given any arbitrary object `x`, one could use `to_unicode(x)`
     as one would use `unicode(x)` to convert it to an unicode string
 . given a `str` object `s`, which ''might'' be a text but for which
+. given a `str` object `s`, which ''might'' be a text but for which
     we have no idea what was the encoding used, one can use
     `to_unicode(s)` to convert it to an `unicode` object in a safe way. [[br]]
 …
     and if this fails, a decoding using the `locale.getpreferredencoding()`
     will be done, in replacement mode.
 . given a `str` object `s`, for which we ''think'' we know what is
+. given a `str` object `s`, for which we ''think'' we know what is
     the encoding `enc` used, we can do `to_unicode(s, enc)` to try
     to decode it using the `enc` encoding, in replacement mode. [[br]]
 …
     the ''use case 2'', should `enc` be `None`.
+So, you may ask, if the above works in all situations, where should you
+still use `unicode(x)` or `unicode(x,enc)`?
+So, if the above works in all situations, where should you still use `unicode(x)` or `unicode(x,enc)`?
  * you could use `unicode(x)` when you know for sure that x is anything
 …
 There are a few other unicode related utilies besides `to_unicode` in the [source:/trunk/trac/util/text.py trac.util.text] module.
+=== The Mimeview component ===
+The Mimeview component is the place where we collect some intelligence
+about the MIME types and charsets auto-detection.
+Most of the time, when we manipulate ''file content'', we only have partial
+information about the nature of the data actually contained in those files.
+This is true whether the file is located in the filesystem, in a version
+control repository or is streamed by the web browser (file upload).
+The Mimeview component tries to associate a MIME type to a file content,
+based on the filename or, if that's not enough on the file's content itself.
+During this process, the charset used by the file ''might'' be inferred as well.
+=== The Mimeview component
+The Mimeview component is the place where we collect some intelligence about the MIME types and charsets auto-detection.
+Most of the time, when we manipulate ''file content'', we only have partial information about the nature of the data actually contained in those files.
+This is true whether the file is located in the filesystem, in a version control repository or is streamed by the web browser (file upload).
+The Mimeview component tries to associate a MIME type to a file content, based on the filename or, if that's not enough on the file's content itself. During this process, the charset used by the file ''might'' be inferred as well.
 The API is quite simple:
 …
  * `Mimeview.to_unicode(self, content, mimetype=None, charset=None)` [[br]]
    uses the `to_unicode` utility and eventually guess the charset if needed
 '''Note that the Mimeview API is currently behing overhauled and will most probably change in the next releases. See #3332'''.
 == Trac boundaries for Unicode Data ==
+'''Note''': that the Mimeview API is currently being overhauled and will most probably change in the next releases (#3332).
+== Trac boundaries for Unicode Data
 Most of the time, within Trac we assume that we are manipulating `unicode` objects.
+But there are places where we need to deal with raw `str` objects, and therefore
+we must know what to do, either when encoding to or when decoding from `str` objects.
+=== Database Layer ===
+Each database connector should configure its database driver
+so that the `Cursor` objects are able to accept and will return
+`unicode` objects. This sometimes involve writing a wrapper class
+for the original Cursor class. See for example
+But there are places where we need to deal with raw `str` objects, and therefore we must know what to do, either when encoding to or when decoding from `str` objects.
+=== Database Layer
+Each database connector should configure its database driver so that the `Cursor` objects are able to accept and will return
+`unicode` objects. This sometimes involves writing a wrapper class for the original Cursor class. See for example
 [source:/trunk/trac/db/sqlite_backend.py@head#L58 SQLiteUnicodeCursor], for pysqlite1.
+=== The console ===
+When reading from the console, we assume the text is encoded
+using `sys.stdin.encoding`.
+When writing to the console, we assume that the `sys.stdout.encoding`
+should be used.
+=== The console
+When reading from the console, we assume the text is encoded using `sys.stdin.encoding`.
+When writing to the console, we assume that the `sys.stdout.encoding` should be used.
 The logging API seems to handle `unicode` objects just fine.
 === Filesystem objects ===
+=== Filesystem objects
 Whenever a file is read or written, some care should be taken about the content.
 Usually, when writing text data, we will choose to encode it using `'utf-8'`.
+When reading, it is context dependent: there are situations were we know for sure
+the data in the file is encoded using `'utf-8'`;
+When reading, it is context dependent: there are situations were we know for sure the data in the file is encoded using `'utf-8'`.
 We therefore usually do a `to_unicode(filecontent, 'utf-8')` in these situations.
+There's an additional complexity here in that the filenames are also possibly
+using non-ascii characters. In Python, it should be safe to provide `unicode`
+objects for all the `os` filesystem related functions.
+There's an additional complexity here in that the filenames are also possibly using non-ascii characters. In Python, it should be safe to provide `unicode` objects for all the `os` filesystem related functions.
 Look also at r7360, r7361, r7362.
 …
 More information about how Python deals with Python at system boundaries can be found here: http://kofoto.rosdahl.net/wiki/UnicodeInPython.
+=== `versioncontrol` subsystem ===
+=== `versioncontrol` subsystem
 This is dependent on the backend.
+In Subversion, there are clear rules about the pathnames used
+by the SVN Bindings for Python: those should be UTF-8 encoded `str` objects.
+Therefore, `unicode` pathnames should be 'utf-8' encoded before
+being passed to the bindings, and pathnames returned by
+the bindings should be decoded using 'utf-8' before being
+returned to callers of the `versioncontrol` API.
+As noted above when talking about file contents, the node content
+can contain any kind of data, including binary data and therefore
+`Node.get_content().read()` returns a `str` object.
+Depending on the backend, some ''hints'' about the nature of the
+content (and eventually about the charset used if the content
+is text) can be given by the `Node.get_content_type()` method.
+The Mimeview component can be used in order to use those hints
+in a streamlined way.
+=== Generating content with !ClearSilver templates ===
+In Subversion, there are clear rules about the pathnames used by the SVN Bindings for Python: those should be UTF-8 encoded `str` objects.
+Therefore, `unicode` pathnames should be 'utf-8' encoded before being passed to the bindings, and pathnames returned by the bindings should be decoded using 'utf-8' before being returned to callers of the `versioncontrol` API.
+As noted above when talking about file contents, the node content can contain any kind of data, including binary data and therefore `Node.get_content().read()` returns a `str` object.
+Depending on the backend, some ''hints'' about the nature of the content (and eventually about the charset used if the content is text) can be given by the `Node.get_content_type()` method.
+The Mimeview component can be used in order to use those hints in a streamlined way.
+=== Generating content with !ClearSilver templates
 The main "source" of generated text from Trac is the ClearSilver template engine.
+The !ClearSilver engine doesn't accept `unicode` objects, so those are
+converted to UTF-8 encoded `str` objects just before being inserted in the "HDF"
+(the data structure used by the template engine to fill in the templates).
+This is done automatically by our
+[source:/trunk/trac/web/clearsilver.py@head#L22 HDFWrapper] class, so anywhere else
+in the code one can safely associate unicode values to entries in `req.hdf`.
+The !ClearSilver engine doesn't accept `unicode` objects, so those are converted to UTF-8 encoded `str` objects just before being inserted in the "HDF" (the data structure used by the template engine to fill in the templates).
+This is done automatically by our [source:/trunk/trac/web/clearsilver.py@head#L22 HDFWrapper] class, so anywhere else in the code one can safely associate unicode values to entries in `req.hdf`.
 The body of those templates (the `.cs` files) must also use the UTF-8 encoding.
+=== The Web interface ===
+The information in the `Request` object (`req`) is converted to `unicode` objects,
+from 'UTF-8' encoded strings.
+The data sent out is generally converted to 'UTF-8' as well
+(like the headers), except if some charset information has
+been explicitely set in the `'Content-Type'` header.
+If this is the case, that encoding is used.
+=== Interaction with plugins ===
+Whenever Trac gets data from plugins, it must try to cope
+with `str` objects. Those might be 0.9 pre-unicode plugins
+which have not been migrated fully to 0.10 and beyond.
+== Questions/Suggestions... ==
+Feel free to correct me, ask questions, etc.; this is a Wiki. :)
+----
+Q: When dealing with plugins that weren't designed to be unicode friendly and used unicode in favour of to_unicode, what parts of the plugin should be updated, what should use to_unicode ? --JamesMills
+A: There shouldn't be any reason to replace a working call to `unicode()` by a call to `to_unicode()`, unless you specified the encoding, like in:
+=== The Web interface
+The information in the `Request` object (`req`) is converted to `unicode` objects, from 'UTF-8' encoded strings.
+The data sent out is generally converted to 'UTF-8' as well (like the headers), except if some charset information has been explicitly set in the `'Content-Type'` header. If this is the case, that encoding is used.
+=== Interaction with plugins
+Whenever Trac gets data from plugins, it must try to cope with `str` objects. Those might be 0.9 pre-unicode plugins which have not been migrated fully to 0.10 and beyond.
+== Questions / Suggestions
+'''Q''': When dealing with plugins that weren't designed to be unicode friendly and used unicode in favour of to_unicode, what parts of the plugin should be updated, what should use to_unicode ? --JamesMills
+'''A''': There shouldn't be any reason to replace a working call to `unicode()` by a call to `to_unicode()`, unless you specified the encoding, like in:
 {{{
   ustring = unicode(data_from_trac, 'utf-8')
 }}}
+The above doesn't work if `data_from_trac` is actually an unicode object (you'd get `TypeError: decoding Unicode is not supported`).
+The above doesn't work if `data_from_trac` is actually an unicode object. You would get `TypeError: decoding Unicode is not supported`.
 In this case, either don't use `unicode` at all (0.10 and above only plugins) or replace it by `to_unicode` (0.9 and 0.10 plugins).