| 1 | = Trac and Unicode: Development Guidelines = |
| 2 | |
| 3 | Since Trac [milestone:0.10], Trac uses `unicode` strings internally. |
| 4 | This document aims at clarifying what are the implications of this. |
| 5 | |
| 6 | == Unicode Mini Tutorial == |
| 7 | |
| 8 | In python, they are two kind of string classes, both subclasses of `basestring`: |
| 9 | * `unicode` is a string datatype in which each character is an Unicode code point. [[br]] |
| 10 | All common string operations (len, slicing, etc.) will operate on those code points. |
| 11 | i.e. "real" character boundaries, in any language. |
| 12 | * `str` is a string datatype in which each character is a byte. [[br]] |
| 13 | The string operations will operate on those bytes, and byte boundaries |
| 14 | don't correspond to character boundaries in many common encodings. |
| 15 | |
| 16 | Therefore, `unicode` can be seen as the ''safe side'' of textual data: |
| 17 | once you're in `unicode`, you know that your text data can contain any |
| 18 | kind of multilingual characters, and that you can safely manipulate it |
| 19 | the expected way. |
| 20 | |
| 21 | On the other hand, a `str` object can be used to contain anything, |
| 22 | binary data, or some text using any conceivable encoding. |
| 23 | But if it supposed to contain some text, it is crucial to know |
| 24 | which encoding was used. That knowledge must be known or inferred |
| 25 | from somewhere, which is not always a trivial thing to do. |
| 26 | |
| 27 | In summary, it is not manipulating `unicode` object which is |
| 28 | problematic (it is not), but how to go from the "wild" side |
| 29 | to the "safe" side... |
| 30 | Going from `unicode` to `str` is usually less problematic, |
| 31 | because you can always control what kind of encoding you |
| 32 | want to use for serializing your Unicode data. |
| 33 | |
| 34 | How does all the above look like in practice? Let's take an example (from ![1]): |
| 35 | * `u"ndré Le"` is an Unicode object containing the following sequence of |
| 36 | Unicode code points: |
| 37 | {{{ |
| 38 | >>> ["U-%04x" % ord(x) for x in u"ndré Le"] |
| 39 | ['U-006e', 'U-0064', 'U-0072', 'U-00e9', 'U-0020', 'U-004c', 'U-0065'] |
| 40 | }}} |
| 41 | * From there, you can easily transform that to a `str` object. [[br]] |
| 42 | As we said above, we have to freedom to choose the encoding: |
| 43 | * ''UTF-8'': it's a variable length encoding which is widely understood, |
| 44 | and in which ''any'' code point can be represented: [[br]] |
| 45 | {{{ |
| 46 | >>> u"ndré Le".encode('utf-8') |
| 47 | 'ndr\xc3\xa9 Le' |
| 48 | }}} |
| 49 | * ''iso-8859-15'': it's a fixed length encoding, which is commonly used |
| 50 | in European countries. It ''happens'' that the `unicode` sequence we |
| 51 | are interested in can be mapped to a sequence of bytes in this encoding. |
| 52 | {{{ |
| 53 | >>> u"ndré Le".encode('iso-8859-15') |
| 54 | 'ndr\xe9 Le' |
| 55 | }}} |
| 56 | * ''ascii'': it is a very "poor" encoding, as there are only 128 unicode |
| 57 | code points (those in the U-0000 to U-007e range) that can be mapped to |
| 58 | ascii. Therefore, trying to encode our sample sequence will fail, |
| 59 | as it contains one code point outside of this range (U-00e9). |
| 60 | {{{ |
| 61 | >>> u"ndré Le".encode('ascii') |
| 62 | Traceback (most recent call last): |
| 63 | File "<stdin>", line 1, in ? |
| 64 | UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128) |
| 65 | }}} |
| 66 | It should be noted that this is also the error one would get by doing a |
| 67 | coercion to `str` on that unicode object, because the system encoding |
| 68 | is usually `'ascii'`: |
| 69 | {{{ |
| 70 | >>> str(u"ndré Le") |
| 71 | Traceback (...): # same as above |
| 72 | >>> sys.getdefaultencoding() |
| 73 | 'ascii' |
| 74 | }}} |
| 75 | Lastly, there are ways to ''force'' a conversion to succeed, even |
| 76 | if there's no way to encode some of the original unicode characters |
| 77 | in the targeted charset. One possible way is to use replacement characters: |
| 78 | {{{ |
| 79 | >>> u"ndré Le".encode('ascii', 'replace') |
| 80 | 'ndr? Le' |
| 81 | }}} |
| 82 | * Now, you might wonder how to get a `unicode` object in the first place, |
| 83 | starting from a string. [[br]] |
| 84 | Well, from the above it should be obvious that it's absolutely necessary |
| 85 | to ''know'' what is the encoding used in the `str` object, as either |
| 86 | `'ndr\xe9 Le'` or `'ndr\xc3\xa9 Le'` could be decoded into the same |
| 87 | unicode string `u"ndré Le"` (as a matter of fact, it is as important |
| 88 | as knowing if that stream of bytes has been gzipped or ROT13-ed...) [[br]] |
| 89 | * Assuming we know the encoding of the `str` object, getting an `unicode` |
| 90 | object out of it is trivial: |
| 91 | {{{ |
| 92 | >>> unicode('ndr\xc3\xa9 Le', 'utf-8') |
| 93 | u'ndr\xe9 Le' |
| 94 | >>> unicode('ndr\xe9 Le', 'iso-8859-15') |
| 95 | u'ndr\xe9 Le' |
| 96 | }}} |
| 97 | The above can be rewritten using the `str.decode()` method: |
| 98 | {{{ |
| 99 | >>> 'ndr\xc3\xa9 Le'.decode('utf-8') |
| 100 | u'ndr\xe9 Le' |
| 101 | >>> 'ndr\xe9 Le'.decode('iso-8859-15') |
| 102 | u'ndr\xe9 Le' |
| 103 | }}} |
| 104 | * But what happens if we do a bad guess? |
| 105 | {{{ |
| 106 | >>> unicode('ndr\xc3\xa9 Le', 'iso-8859-15') |
| 107 | u'ndr\xc3\xa9 Le' |
| 108 | }}} |
| 109 | No errors here, but the unicode string now contains garbage [[br]] |
| 110 | (NB: as we have seen above, 'iso-8859-15' is a fixed-byte encoding |
| 111 | with a mapping defined for all the 0..255 range, so decoding ''any'' |
| 112 | input with such an encoding will ''always'' succeed). |
| 113 | {{{ |
| 114 | >>> unicode('ndr\xe9 Le', 'utf-8') |
| 115 | Traceback (most recent call last): |
| 116 | File "<stdin>", line 1, in ? |
| 117 | UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data |
| 118 | }}} |
| 119 | Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8... |
| 120 | * What happens if we don't provide an encoding at all? |
| 121 | {{{ |
| 122 | >>> unicode('ndr\xe9 Le') |
| 123 | Traceback (most recent call last): |
| 124 | File "<stdin>", line 1, in ? |
| 125 | UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128) |
| 126 | >>> 'ndr\xe9 Le'.decode() |
| 127 | Traceback (...) # same as above |
| 128 | }}} |
| 129 | This is very symmetrical to the encoding situation: the `sys.getdefaultencoding()` is used |
| 130 | (usually 'ascii') when no encoding is explicitely given. |
| 131 | * Now, as with the encoding situation, there are ways to ''force'' the encoding |
| 132 | process to succeed, even if we are wrong about the charset used by our `str` object. |
| 133 | * One possibility would be to use replacement characters: |
| 134 | {{{ |
| 135 | >>> unicode('ndr\xe9 Le', 'utf-8') |
| 136 | u'ndr\ufffde' |
| 137 | }}} |
| 138 | * The other one would be to choose an encoding guaranteed to succeed |
| 139 | (as ''iso-8859-1'' or ''iso-8859-15'', see above). |
| 140 | |
| 141 | This was a very rough mini-tutorial on the question, I hope |
| 142 | it's enough for getting in the general mood needed to read the |
| 143 | rest of the guidelines... |
| 144 | |
| 145 | Of course, there are a lot of more in-depth tutorials on Unicode in general |
| 146 | and Python/Unicode in particular available on the Web: |
| 147 | * ![1] http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf |
| 148 | * ![2] http://www.amk.ca/python/howto/unicode |
| 149 | * ![3] http://www.python.org/dev/peps/pep-0100 |
| 150 | |
| 151 | Now we can move to the specifics of Trac programming... |
| 152 | |
| 153 | == Trac utilities for Unicode == |
| 154 | |
| 155 | In order to handle the unicode related issues in a cohesive way, |
| 156 | there are a few utility functions that can be used, but this |
| 157 | is mainly our swiss-army knife `to_unicode` function. |
| 158 | |
| 159 | === `to_unicode` === |
| 160 | |
| 161 | The `to_unicode` function was designed with flexibility and |
| 162 | robustness in mind: Calling `to_unicode()` on anything should |
| 163 | never fail. |
| 164 | |
| 165 | The use cases are as follows: |
| 166 | 1. given any arbitrary object `x`, one could use `to_unicode(x)` |
| 167 | as one would use `unicode(x)` to convert it to an unicode string |
| 168 | 2. given a `str` object `s`, which ''might'' be a text but for which |
| 169 | we have no idea what was the encoding used, one can use |
| 170 | `to_unicode(s)` to convert it to an `unicode` object in a safe way. [[br]] |
| 171 | Actually, a decoding using 'utf-8' will be attempted first, |
| 172 | and if this fails, a decoding using the `locale.getpreferredencoding()` |
| 173 | will be done, in replacement mode. |
| 174 | 3. given a `str` object `s`, for which we ''think'' we know what is |
| 175 | the encoding `enc` used, we can do `to_unicode(s, enc)` to try |
| 176 | to decode it using the `enc` encoding, in replacement mode. [[br]] |
| 177 | A practical advantage of using `to_unicode(s, enc)` over |
| 178 | `unicode(s, enc, 'replace')` is that our first form will revert to |
| 179 | the ''use case 2'', should `enc` be `None`. |
| 180 | |
| 181 | So, you may ask, if the above works in all situations, where should you |
| 182 | still use `unicode(x)` or `unicode(x,enc)`? |
| 183 | |
| 184 | * you could use `unicode(x)` when you know for sure that x is anything |
| 185 | __but__ a `str` containing bytes in the 128..255 range; [[br]] |
| 186 | It should be noted that `to_unicode(x)` simply does a `unicode(x)` call |
| 187 | for anything which is not a `str` object, so there's virtually no |
| 188 | performance penalty in using `to_unicode` instead (in particular, |
| 189 | there's no exception handler set in this case). |
| 190 | * use `unicode(buf, encoding)` when you know for sure what the |
| 191 | encoding is. You will have a performance gain here over `to_unicode`, |
| 192 | as no exception handler will be set. Of course, the downside is that |
| 193 | you will get an `UnicodeDecodeError` exception if your assumption |
| 194 | was wrong. Therefore, use this if you ''want'' to catch errors |
| 195 | in this situation. |
| 196 | |
| 197 | ''FIXME: talk a bit about the other utilities'' |
| 198 | |
| 199 | ''FIXME: Those utilities are currently placed in the `trac.util` package, |
| 200 | though I'm thinking about moving the in the `trac.util.text` package: |
| 201 | * some of the corresponding unit tests are already in `trac.util.tests.text` |
| 202 | * `to_unicode` could then be used in the `Markup` class'' |
| 203 | |
| 204 | === The Mimeview component === |
| 205 | |
| 206 | The Mimeview component is the place where we collect some intelligence |
| 207 | about the MIME types and charsets auto-detection. |
| 208 | |
| 209 | Most of the time, when we manipulate ''file content'', we only have partial |
| 210 | information about the nature of the data actually contained in those files. |
| 211 | |
| 212 | This is true whether the file is located in the filesystem, in a version |
| 213 | control repository or is streamed by the web browser (file upload). |
| 214 | |
| 215 | The Mimeview component tries to associate a MIME type to a file content, |
| 216 | based on the filename or, if that's not enough on the file's content itself. |
| 217 | During this process, the charset used by the file ''might'' be inferred as well. |
| 218 | |
| 219 | The API is quite simple: |
| 220 | * `Mimeview.get_mimetype(self, filename, content)` [[br]] |
| 221 | guess the MIME type from the `filename` or eventually from the `content` |
| 222 | * `Mimeview.get_charset(self, content, mimetype=None)` [[br]] |
| 223 | guess the charset from the `content` or from the `mimetype` |
| 224 | (as the `mimetype` ''might'' convey charset information as well) |
| 225 | * `Mimeview.to_unicode(self, content, mimetype=None, charset=None)` [[br]] |
| 226 | uses the `to_unicode` utility and eventually guess the charset if needed |
| 227 | * ''`Mimeview.is_binary(self, filename, content, mimetype)`'' '''TBD''' [[br]] |
| 228 | guess if the `content` is textual data or not |
| 229 | |
| 230 | |
| 231 | == Trac boundaries for Unicode Data == |
| 232 | |
| 233 | Most of the time, within Trac we assume that we are manipulating `unicode` objects. |
| 234 | |
| 235 | But there are places where we need to deal with raw `str` objects, and therefore |
| 236 | we must know what to do, either when encoding to or when decoding from `str` objects. |
| 237 | |
| 238 | === Database Layer === |
| 239 | |
| 240 | Each database connector should configure its database driver |
| 241 | so that the `Cursor` objects are able to accept and will return |
| 242 | `unicode` objects. |
| 243 | |
| 244 | === Filesystem objects === |
| 245 | |
| 246 | Whenever a file is read or written, some care should be taken about the content. |
| 247 | Usually, when writing text data, we will choose to encode it using `'utf-8'`. |
| 248 | When reading, it is context dependent: there are situations were we know for sure |
| 249 | the data in the file is encoded using `'utf-8'`; |
| 250 | We therefore usually do a `to_unicode(filecontent, 'utf-8')` in these situations. |
| 251 | |
| 252 | There's an additional complexity here in that the filenames are also possibly |
| 253 | using non-ascii characters. In Python, it should be safe to provide `unicode` |
| 254 | objects for all the `os` filesystem related functions. |
| 255 | |
| 256 | === `versioncontrol` subsystem === |
| 257 | |
| 258 | This is dependent on the backend. |
| 259 | |
| 260 | In Subversion, there are clear rules about the pathnames used |
| 261 | by the SVN Bindings for Python: those should UTF-8 encoded `str` objects. |
| 262 | |
| 263 | Therefore, `unicode` pathnames should 'utf-8' encoded before |
| 264 | being passed to the bindings, and pathnames returned by |
| 265 | the bindings should be decoded using 'utf-8' before being |
| 266 | returned to callers of the `versioncontrol` API. |
| 267 | |
| 268 | As noted above when talking about file contents, the node content |
| 269 | can contain any kind of data, including binary data and therefore |
| 270 | `Node.get_content().read()` returns a `str` object. |
| 271 | |
| 272 | Depending on the backend, some ''hints'' about the nature of the |
| 273 | content (and eventually about the charset used if the content |
| 274 | is text) can be given by the `Node.get_content_type()` method. |
| 275 | |
| 276 | The Mimeview component can be used in order to use those hints |
| 277 | in a streamlined way. |
| 278 | |
| 279 | === Generating content with !ClearSilver templates === |
| 280 | |
| 281 | The main "source" of generated text from Trac is the ClearSilver template engine. |
| 282 | The !ClearSilver engine doesn't accept `unicode` objects, so those are |
| 283 | converted to UTF-8 encoded `str` objects just before being inserted in the "HDF" |
| 284 | (the data structure used by the template engine to fill in the templates). |
| 285 | |
| 286 | The body of those templates (the `.cs` files) must also use this encoding. |
| 287 | |
| 288 | === The Web interface === |
| 289 | |
| 290 | The information in the `Request` object (`req`) is converted to `unicode` objects, |
| 291 | from 'UTF-8' encoded strings. |
| 292 | |
| 293 | The data sent out is generally converted to 'UTF-8' as well |
| 294 | (like the headers), except if some charset information has |
| 295 | been explicitely set in the `'Content-Type'` header. |
| 296 | If this is the case, that encoding is used. |
| 297 | |
| 298 | === The console === |
| 299 | |
| 300 | When reading from the console, we assume the text is encoded |
| 301 | using `sys.stdin.encoding`. |
| 302 | |
| 303 | When writing to the console, we assume that the `sys.stdout.encoding` |
| 304 | should be used. |
| 305 | |
| 306 | ''FIXME: and logging?'' |
| 307 | |
| 308 | === Interaction with plugins === |
| 309 | |
| 310 | Whenever Trac gets data from plugins, it must try to cope |
| 311 | with `str` objects. Those might be 0.9 pre-unicode plugins |
| 312 | which have not been migrated fully to 0.10 and beyond. |
| 313 | |
| 314 | == Questions/Suggestions... == |
| 315 | |
| 316 | Sorry, there are certainly a ton of typos there, hopefully |
| 317 | no more serious errors. But I had to have a first draft of this. |
| 318 | |
| 319 | Feel free to correct me, ask questions, etc. this is a Wiki :) |
| 320 | |