Changes between Version 11 and Version 12 of TracDev/UnicodeGuidelines
- Timestamp:
- Feb 23, 2016, 9:57:51 PM (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
TracDev/UnicodeGuidelines
v11 v12 1 = Trac and Unicode: Development Guidelines = 1 [[PageOutline(2-3,Contents)]] 2 3 = Trac and Unicode: Development Guidelines 2 4 3 5 Since Trac [milestone:0.10], Trac uses `unicode` strings internally. 4 This document aims at clarifyingwhat the implications of this change are.5 6 == Unicode Mini Tutorial ==7 8 In Python, the yare two kinds of string types, both subclasses of `basestring`:6 This document clarifies what the implications of this change are. 7 8 == Unicode Mini Tutorial 9 10 In Python, there are two kinds of string types, both subclasses of `basestring`: 9 11 * `unicode` is a string type in which each character is an Unicode code point. [[br]] 10 12 All common string operations (len, slicing, etc.) will operate on those code points. … … 16 18 `unicode` provides a real representation of textual data: once you're in `unicode`, you know that your text data can contain any kind of multilingual characters, and that you can safely manipulate it the expected way. 17 19 18 On the other hand, a `str` object can be used to contain anything, binary data, or some text using any conceivable encoding. But if it's supposed to contain text, it is crucial to know which encoding was used. That knowledge must be known or inferred from somewhere, which is not always a trivial thing to do.19 20 In summary, it is not manipulating `unicode` object which is problematic (it is not), but how to go from the "wild" side (`str`) to the "safe" side (`unicode`) … Going from `unicode` to `str` is usually less problematic,because you can always control what kind of encoding you want to use for serializing your Unicode data.20 On the other hand, a `str` object can be used to contain anything, binary data, or some text using any conceivable encoding. But if it's supposed to contain text, it is crucial to know which encoding was used. That knowledge must be known or inferred from somewhere, which is not always trivial. 21 22 In summary, it is not manipulating `unicode` object which is problematic (it is not), but how to go from the "wild" side (`str`) to the "safe" side (`unicode`). Going from `unicode` to `str` is usually less problematic, because you can always control what kind of encoding you want to use for serializing your Unicode data. 21 23 22 24 How does all the above look like in practice? Let's take an example (from ![1]): … … 28 30 }}} 29 31 * From there, you can easily transform that to a `str` object. [[br]] 30 As we said above, we have to freedom tochoose the encoding:32 As we said above, we can choose the encoding: 31 33 * ''UTF-8'': it's a variable length encoding which is widely understood, 32 34 and in which ''any'' code point can be represented: [[br]] … … 70 72 * Now, you might wonder how to get a `unicode` object in the first place, 71 73 starting from a string. [[br]] 72 Well, from the above it should be obvious that it's absolutely necessary73 to ''know'' what is the encodingused in the `str` object, as either74 For this it is critical 75 to ''know'' what encoding was used in the `str` object, as either 74 76 `'ndr\xe9 Le'` or `'ndr\xc3\xa9 Le'` could be decoded into the same 75 unicode string `u"ndré Le"` ( as a matter of fact, it isas important76 as knowing if that stream of bytes has been gzipped or ROT13-ed...) [[br]]77 unicode string `u"ndré Le"` (it is in fact as important 78 as knowing whether that stream of bytes has been gzipped or ROT13-ed.) [[br]] 77 79 * Assuming we know the encoding of the `str` object, getting an `unicode` 78 80 object out of it is trivial: … … 96 98 }}} 97 99 No errors here, but the unicode string now contains garbage [[br]] 98 (NB: as we have seen above, 'iso-8859-15' is a fixed-byte encoding100 NB: as we have seen above, 'iso-8859-15' is a fixed-byte encoding 99 101 with a mapping defined for all the 0..255 range, so decoding ''any'' 100 input assuming such an encoding will ''always'' succeed ).102 input assuming such an encoding will ''always'' succeed. 101 103 {{{ 102 104 >>> unicode('ndr\xe9 Le', 'utf-8') … … 105 107 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data 106 108 }}} 107 Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8. ..109 Here, we clearly see that not all sequence of bytes can be interpreted as UTF-8. 108 110 * What happens if we don't provide an encoding at all? 109 111 {{{ … … 115 117 Traceback (...) # same as above 116 118 }}} 117 This is very symmetricalto the encoding situation: the `sys.getdefaultencoding()` is used118 (usually 'ascii') when no encoding is explicit ely given.119 This is analogous to the encoding situation: the `sys.getdefaultencoding()` is used 120 (usually 'ascii') when no encoding is explicitly given. 119 121 * Now, as with the encoding situation, there are ways to ''force'' the encoding 120 122 process to succeed, even if we are wrong about the charset used by our `str` object. … … 127 129 (as ''iso-8859-1'' or ''iso-8859-15'', see above). 128 130 129 This was a very rough mini-tutorial on the question, I hope 130 it's enough for getting in the general mood needed to read the 131 rest of the guidelines... 132 133 Of course, there are a lot of more in-depth tutorials on Unicode in general 134 and Python/Unicode in particular available on the Web: 131 There are more in-depth tutorials on Unicode in general and Python / Unicode in particular available: 135 132 * ![1] http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf 136 * ![2] http://www.amk.ca/python/howto/unicode 137 * ![3] http://www.python.org/dev/peps/pep-0100 138 139 Now we can move to the specifics of Trac programming... 140 141 == Trac utilities for Unicode == 142 143 In order to handle the unicode related issues in a cohesive way, 144 there are a few utility functions that can be used, but this 145 is mainly our swiss-army knife `to_unicode` function. 146 147 === `to_unicode` === 148 149 The `to_unicode` function was designed with flexibility and 150 robustness in mind: Calling `to_unicode()` on anything should 151 never fail. 133 * ![2] http://www.python.org/dev/peps/pep-0100 134 135 Now we can move to the specifics of Trac programming. 136 137 == Trac utilities for Unicode 138 139 In order to handle the unicode related issues in a cohesive way, there are a few utility functions that can be used, but this is mainly our swiss-army knife `to_unicode` function. 140 141 === `to_unicode` 142 143 The `to_unicode` function was designed with flexibility and robustness in mind: Calling `to_unicode()` on anything should never fail. 152 144 153 145 The use cases are as follows: 154 146 1. given any arbitrary object `x`, one could use `to_unicode(x)` 155 147 as one would use `unicode(x)` to convert it to an unicode string 156 2. given a `str` object `s`, which ''might'' be a text but for which148 1. given a `str` object `s`, which ''might'' be a text but for which 157 149 we have no idea what was the encoding used, one can use 158 150 `to_unicode(s)` to convert it to an `unicode` object in a safe way. [[br]] … … 160 152 and if this fails, a decoding using the `locale.getpreferredencoding()` 161 153 will be done, in replacement mode. 162 3. given a `str` object `s`, for which we ''think'' we know what is154 1. given a `str` object `s`, for which we ''think'' we know what is 163 155 the encoding `enc` used, we can do `to_unicode(s, enc)` to try 164 156 to decode it using the `enc` encoding, in replacement mode. [[br]] … … 167 159 the ''use case 2'', should `enc` be `None`. 168 160 169 So, you may ask, if the above works in all situations, where should you 170 still use `unicode(x)` or `unicode(x,enc)`? 161 So, if the above works in all situations, where should you still use `unicode(x)` or `unicode(x,enc)`? 171 162 172 163 * you could use `unicode(x)` when you know for sure that x is anything … … 185 176 There are a few other unicode related utilies besides `to_unicode` in the [source:/trunk/trac/util/text.py trac.util.text] module. 186 177 187 === The Mimeview component === 188 189 The Mimeview component is the place where we collect some intelligence 190 about the MIME types and charsets auto-detection. 191 192 Most of the time, when we manipulate ''file content'', we only have partial 193 information about the nature of the data actually contained in those files. 194 195 This is true whether the file is located in the filesystem, in a version 196 control repository or is streamed by the web browser (file upload). 197 198 The Mimeview component tries to associate a MIME type to a file content, 199 based on the filename or, if that's not enough on the file's content itself. 200 During this process, the charset used by the file ''might'' be inferred as well. 178 === The Mimeview component 179 180 The Mimeview component is the place where we collect some intelligence about the MIME types and charsets auto-detection. 181 182 Most of the time, when we manipulate ''file content'', we only have partial information about the nature of the data actually contained in those files. 183 184 This is true whether the file is located in the filesystem, in a version control repository or is streamed by the web browser (file upload). 185 186 The Mimeview component tries to associate a MIME type to a file content, based on the filename or, if that's not enough on the file's content itself. During this process, the charset used by the file ''might'' be inferred as well. 201 187 202 188 The API is quite simple: … … 208 194 * `Mimeview.to_unicode(self, content, mimetype=None, charset=None)` [[br]] 209 195 uses the `to_unicode` utility and eventually guess the charset if needed 210 '''Note that the Mimeview API is currently behing overhauled and will most probably change in the next releases. See #3332'''.211 212 == Trac boundaries for Unicode Data ==196 '''Note''': that the Mimeview API is currently being overhauled and will most probably change in the next releases (#3332). 197 198 == Trac boundaries for Unicode Data 213 199 214 200 Most of the time, within Trac we assume that we are manipulating `unicode` objects. 215 201 216 But there are places where we need to deal with raw `str` objects, and therefore 217 we must know what to do, either when encoding to or when decoding from `str` objects. 218 219 === Database Layer === 220 221 Each database connector should configure its database driver 222 so that the `Cursor` objects are able to accept and will return 223 `unicode` objects. This sometimes involve writing a wrapper class 224 for the original Cursor class. See for example 202 But there are places where we need to deal with raw `str` objects, and therefore we must know what to do, either when encoding to or when decoding from `str` objects. 203 204 === Database Layer 205 206 Each database connector should configure its database driver so that the `Cursor` objects are able to accept and will return 207 `unicode` objects. This sometimes involves writing a wrapper class for the original Cursor class. See for example 225 208 [source:/trunk/trac/db/sqlite_backend.py@head#L58 SQLiteUnicodeCursor], for pysqlite1. 226 209 227 === The console === 228 229 When reading from the console, we assume the text is encoded 230 using `sys.stdin.encoding`. 231 232 When writing to the console, we assume that the `sys.stdout.encoding` 233 should be used. 210 === The console 211 212 When reading from the console, we assume the text is encoded using `sys.stdin.encoding`. 213 214 When writing to the console, we assume that the `sys.stdout.encoding` should be used. 234 215 235 216 The logging API seems to handle `unicode` objects just fine. 236 217 237 === Filesystem objects ===218 === Filesystem objects 238 219 239 220 Whenever a file is read or written, some care should be taken about the content. 240 221 Usually, when writing text data, we will choose to encode it using `'utf-8'`. 241 When reading, it is context dependent: there are situations were we know for sure 242 the data in the file is encoded using `'utf-8'`; 222 When reading, it is context dependent: there are situations were we know for sure the data in the file is encoded using `'utf-8'`. 243 223 We therefore usually do a `to_unicode(filecontent, 'utf-8')` in these situations. 244 224 245 There's an additional complexity here in that the filenames are also possibly 246 using non-ascii characters. In Python, it should be safe to provide `unicode` 247 objects for all the `os` filesystem related functions. 225 There's an additional complexity here in that the filenames are also possibly using non-ascii characters. In Python, it should be safe to provide `unicode` objects for all the `os` filesystem related functions. 248 226 249 227 Look also at r7360, r7361, r7362. … … 251 229 More information about how Python deals with Python at system boundaries can be found here: http://kofoto.rosdahl.net/wiki/UnicodeInPython. 252 230 253 254 255 === `versioncontrol` subsystem === 231 === `versioncontrol` subsystem 256 232 257 233 This is dependent on the backend. 258 234 259 In Subversion, there are clear rules about the pathnames used 260 by the SVN Bindings for Python: those should be UTF-8 encoded `str` objects. 261 262 Therefore, `unicode` pathnames should be 'utf-8' encoded before 263 being passed to the bindings, and pathnames returned by 264 the bindings should be decoded using 'utf-8' before being 265 returned to callers of the `versioncontrol` API. 266 267 As noted above when talking about file contents, the node content 268 can contain any kind of data, including binary data and therefore 269 `Node.get_content().read()` returns a `str` object. 270 271 Depending on the backend, some ''hints'' about the nature of the 272 content (and eventually about the charset used if the content 273 is text) can be given by the `Node.get_content_type()` method. 274 275 The Mimeview component can be used in order to use those hints 276 in a streamlined way. 277 278 === Generating content with !ClearSilver templates === 235 In Subversion, there are clear rules about the pathnames used by the SVN Bindings for Python: those should be UTF-8 encoded `str` objects. 236 237 Therefore, `unicode` pathnames should be 'utf-8' encoded before being passed to the bindings, and pathnames returned by the bindings should be decoded using 'utf-8' before being returned to callers of the `versioncontrol` API. 238 239 As noted above when talking about file contents, the node content can contain any kind of data, including binary data and therefore `Node.get_content().read()` returns a `str` object. 240 241 Depending on the backend, some ''hints'' about the nature of the content (and eventually about the charset used if the content is text) can be given by the `Node.get_content_type()` method. 242 243 The Mimeview component can be used in order to use those hints in a streamlined way. 244 245 === Generating content with !ClearSilver templates 279 246 280 247 The main "source" of generated text from Trac is the ClearSilver template engine. 281 The !ClearSilver engine doesn't accept `unicode` objects, so those are 282 converted to UTF-8 encoded `str` objects just before being inserted in the "HDF" 283 (the data structure used by the template engine to fill in the templates). 284 This is done automatically by our 285 [source:/trunk/trac/web/clearsilver.py@head#L22 HDFWrapper] class, so anywhere else 286 in the code one can safely associate unicode values to entries in `req.hdf`. 248 The !ClearSilver engine doesn't accept `unicode` objects, so those are converted to UTF-8 encoded `str` objects just before being inserted in the "HDF" (the data structure used by the template engine to fill in the templates). 249 This is done automatically by our [source:/trunk/trac/web/clearsilver.py@head#L22 HDFWrapper] class, so anywhere else in the code one can safely associate unicode values to entries in `req.hdf`. 287 250 288 251 The body of those templates (the `.cs` files) must also use the UTF-8 encoding. 289 252 290 === The Web interface === 291 292 The information in the `Request` object (`req`) is converted to `unicode` objects, 293 from 'UTF-8' encoded strings. 294 295 The data sent out is generally converted to 'UTF-8' as well 296 (like the headers), except if some charset information has 297 been explicitely set in the `'Content-Type'` header. 298 If this is the case, that encoding is used. 299 300 === Interaction with plugins === 301 302 Whenever Trac gets data from plugins, it must try to cope 303 with `str` objects. Those might be 0.9 pre-unicode plugins 304 which have not been migrated fully to 0.10 and beyond. 305 306 == Questions/Suggestions... == 307 308 Feel free to correct me, ask questions, etc.; this is a Wiki. :) 309 310 ---- 311 Q: When dealing with plugins that weren't designed to be unicode friendly and used unicode in favour of to_unicode, what parts of the plugin should be updated, what should use to_unicode ? --JamesMills 312 313 A: There shouldn't be any reason to replace a working call to `unicode()` by a call to `to_unicode()`, unless you specified the encoding, like in: 253 === The Web interface 254 255 The information in the `Request` object (`req`) is converted to `unicode` objects, from 'UTF-8' encoded strings. 256 257 The data sent out is generally converted to 'UTF-8' as well (like the headers), except if some charset information has been explicitly set in the `'Content-Type'` header. If this is the case, that encoding is used. 258 259 === Interaction with plugins 260 261 Whenever Trac gets data from plugins, it must try to cope with `str` objects. Those might be 0.9 pre-unicode plugins which have not been migrated fully to 0.10 and beyond. 262 263 == Questions / Suggestions 264 265 '''Q''': When dealing with plugins that weren't designed to be unicode friendly and used unicode in favour of to_unicode, what parts of the plugin should be updated, what should use to_unicode ? --JamesMills 266 267 '''A''': There shouldn't be any reason to replace a working call to `unicode()` by a call to `to_unicode()`, unless you specified the encoding, like in: 314 268 {{{ 315 269 ustring = unicode(data_from_trac, 'utf-8') 316 270 }}} 317 The above doesn't work if `data_from_trac` is actually an unicode object (you'd get `TypeError: decoding Unicode is not supported`). 271 272 The above doesn't work if `data_from_trac` is actually an unicode object. You would get `TypeError: decoding Unicode is not supported`. 318 273 319 274 In this case, either don't use `unicode` at all (0.10 and above only plugins) or replace it by `to_unicode` (0.9 and 0.10 plugins).