#10538 closed defect (fixed)
restructuredtext renderer confused by utf-8 BOM
Reported by: | anonymous | Owned by: | Christian Boos |
---|---|---|---|
Priority: | normal | Milestone: | 0.12.4 |
Component: | rendering | Version: | 0.10 |
Severity: | normal | Keywords: | unicode bom |
Cc: | zooko@… | Branch: | |
Release Notes: | |||
API Changes: |
|
||
Internal Changes: |
Description
If there is a "utf-8 Byte Order Marker" at the front of an .rst file (the bytes 0xEFBBBF), then trac misrenders the first line of the file, by counting the length of that line incorrectly. (For example, if the length of a "======" line differs from the length of the subsequent title and following "======" line, then it won't be turned into a headline.) Trac should probably check for a byte order marker and remove it before rendering.
Attachments (0)
Change History (7)
comment:1 by , 13 years ago
Cc: | added |
---|
comment:2 by , 13 years ago
comment:3 by , 13 years ago
I would have thought it was a problem with docutils, but docutils doesn't misparse the same file:
zooko@tahoe-lafs:~/playground/tahoe-lafs/1.8.3/docs$ hexdump -C known_issues.rst | head -4 00000000 ef bb bf 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 3d 0a |...============.| 00000010 4b 6e 6f 77 6e 20 49 73 73 75 65 73 0a 3d 3d 3d |Known Issues.===| 00000020 3d 3d 3d 3d 3d 3d 3d 3d 3d 0a 0a 2a 20 60 4f 76 |=========..* `Ov| 00000030 65 72 76 69 65 77 60 5f 0a 2a 20 60 49 73 73 75 |erview`_.* `Issu| zooko@tahoe-lafs:~/playground/tahoe-lafs/1.8.3/docs$ rst2html known_issues.rst > ~/public_html/known_issues.html zooko@tahoe-lafs:~/playground/tahoe-lafs/1.8.3/docs$ elinks -dump ~/public_html/known_issues.html | head -4 Known Issues * [1]Overview * Issues in Tahoe-LAFS v1.8.2, released 2011-01-30 zooko@tahoe-lafs:~/playground/tahoe-lafs/1.8.3/docs$ rst2html --version rst2html (Docutils 0.6 [release], Python 2.6.5, on linux2)
In contrast, the trac rendering to html of the same file on the same machine starts with:
============ Known Issues ============ Overview
(Visible here: https://tahoe-lafs.org/trac/tahoe-lafs/browser/1.8.3/docs/known_issues.rst )
Which suggests that trac doesn't realize that those first three lines are actually the same length as one another. If I insert a couple of blank lines before the first "============" line, then the mis-rendering from trac stops happening.
Now about the encoding, isn't the point of having a BOM in the file that then the parser can know what encoding the file is in even if it isn't told out-of-band? But anyway, even if it isn't told out-of-band, or if it is told the incorrect encoding out-of-band, a good parser should probably (?) pop off any leading UTf-8 BOM no matter what, before parsing.
Anyway, in answer to your question, I'm using darcs, [trac]default_charset = utf-8
. I'm not 100% sure that means the trac darcs plugin is giving the right encoding to the trac rst renderer. Lele Gaifax (author of trac darcs) would probably be able to tell us for sure.
Thanks for your help!
comment:4 by , 13 years ago
Component: | general → rendering |
---|---|
Keywords: | unicode bom added |
Milestone: | → 0.12.4 |
Owner: | set to |
Status: | new → assigned |
Version: | → 0.10 |
Ok, docutils first removes the BOM but only if it had to try to decode the str
itself. If it is given unicode
, it takes it wholesale and then doesn't know what to do with the BOM.
So at the very least we should remove the BOM as well in this case… and maybe even in our content_to_unicode
utility function:
-
trac/mimeview/api.py
diff -r e22b02cdc589 trac/mimeview/api.py
a b 451 451 return None 452 452 453 453 def content_to_unicode(env, content, mimetype): 454 """Retrieve an `unicode` object from a `content` to be previewed""" 454 """Retrieve an `unicode` object from a `content` to be previewed. 455 456 In case the raw content had an unicode BOM, we remove it. 457 """ 455 458 mimeview = Mimeview(env) 456 459 if hasattr(content, 'read'): 457 460 content = content.read(mimeview.max_preview_size) 458 return mimeview.to_unicode(content, mimetype) 461 u = mimeview.to_unicode(content, mimetype) 462 if u and u[0] == u'\ufeff': 463 u = u[1:] 464 return u 459 465 460 466 461 467 class IHTMLPreviewRenderer(Interface):
The fix will be for 0.12.4 though, unless people give this a lot of testing ;-)
comment:5 by , 13 years ago
While we're riding the Trac UTF8 BOM mindshare train, perhaps #10441 could get a comment?
comment:7 by , 13 years ago
API Changes: | modified (diff) |
---|
Isn't that more a problem with docutils itself?
[hg] encoding
)