Edgewall Software
Modify

Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#10538 closed defect (fixed)

restructuredtext renderer confused by utf-8 BOM

Reported by: anonymous Owned by: Christian Boos
Priority: normal Milestone: 0.12.4
Component: rendering Version: 0.10
Severity: normal Keywords: unicode bom
Cc: zooko@… Branch:
Release Notes:
API Changes:

trac.mimeview.api.content_to_unicode will remove the leading BOM character in the content if present (r10967)

Internal Changes:

Description

If there is a "utf-8 Byte Order Marker" at the front of an .rst file (the bytes 0xEFBBBF), then trac misrenders the first line of the file, by counting the length of that line incorrectly. (For example, if the length of a "======" line differs from the length of the subsequent title and following "======" line, then it won't be turned into a headline.) Trac should probably check for a byte order marker and remove it before rendering.

Attachments (0)

Change History (7)

comment:1 by zooko@…, 12 years ago

Cc: zooko@… added

comment:2 by Christian Boos, 12 years ago

Isn't that more a problem with docutils itself?

  • which version of docutils do you use?
  • what encoding is Trac using when reading that file? (if you use svn and have no charset specified in the svn:mime-type property, then check your [trac] default_charset setting; if using Mercurial, then check [hg] encoding)

comment:3 by zooko@…, 12 years ago

I would have thought it was a problem with docutils, but docutils doesn't misparse the same file:

zooko@tahoe-lafs:~/playground/tahoe-lafs/1.8.3/docs$ hexdump -C known_issues.rst | head -4
00000000  ef bb bf 3d 3d 3d 3d 3d  3d 3d 3d 3d 3d 3d 3d 0a  |...============.|
00000010  4b 6e 6f 77 6e 20 49 73  73 75 65 73 0a 3d 3d 3d  |Known Issues.===|
00000020  3d 3d 3d 3d 3d 3d 3d 3d  3d 0a 0a 2a 20 60 4f 76  |=========..* `Ov|
00000030  65 72 76 69 65 77 60 5f  0a 2a 20 60 49 73 73 75  |erview`_.* `Issu|
zooko@tahoe-lafs:~/playground/tahoe-lafs/1.8.3/docs$ rst2html known_issues.rst > ~/public_html/known_issues.html
zooko@tahoe-lafs:~/playground/tahoe-lafs/1.8.3/docs$ elinks -dump ~/public_html/known_issues.html | head -4
                                  Known Issues

     * [1]Overview
     * Issues in Tahoe-LAFS v1.8.2, released 2011-01-30
zooko@tahoe-lafs:~/playground/tahoe-lafs/1.8.3/docs$ rst2html --version
rst2html (Docutils 0.6 [release], Python 2.6.5, on linux2)

In contrast, the trac rendering to html of the same file on the same machine starts with:

============ Known Issues ============

    Overview

(Visible here: https://tahoe-lafs.org/trac/tahoe-lafs/browser/1.8.3/docs/known_issues.rst )

Which suggests that trac doesn't realize that those first three lines are actually the same length as one another. If I insert a couple of blank lines before the first "============" line, then the mis-rendering from trac stops happening.

Now about the encoding, isn't the point of having a BOM in the file that then the parser can know what encoding the file is in even if it isn't told out-of-band? But anyway, even if it isn't told out-of-band, or if it is told the incorrect encoding out-of-band, a good parser should probably (?) pop off any leading UTf-8 BOM no matter what, before parsing.

Anyway, in answer to your question, I'm using darcs, [trac]default_charset = utf-8. I'm not 100% sure that means the trac darcs plugin is giving the right encoding to the trac rst renderer. Lele Gaifax (author of trac darcs) would probably be able to tell us for sure.

Thanks for your help!

comment:4 by Christian Boos, 12 years ago

Component: generalrendering
Keywords: unicode bom added
Milestone: 0.12.4
Owner: set to Christian Boos
Status: newassigned
Version: 0.10

Ok, docutils first removes the BOM but only if it had to try to decode the str itself. If it is given unicode, it takes it wholesale and then doesn't know what to do with the BOM.

So at the very least we should remove the BOM as well in this case… and maybe even in our content_to_unicode utility function:

  • trac/mimeview/api.py

    diff -r e22b02cdc589 trac/mimeview/api.py
    a b  
    451451        return None
    452452
    453453def content_to_unicode(env, content, mimetype):
    454     """Retrieve an `unicode` object from a `content` to be previewed"""
     454    """Retrieve an `unicode` object from a `content` to be previewed.
     455
     456    In case the raw content had an unicode BOM, we remove it.
     457    """
    455458    mimeview = Mimeview(env)
    456459    if hasattr(content, 'read'):
    457460        content = content.read(mimeview.max_preview_size)
    458     return mimeview.to_unicode(content, mimetype)
     461    u = mimeview.to_unicode(content, mimetype)
     462    if u and u[0] == u'\ufeff':
     463        u = u[1:]
     464    return u
    459465
    460466
    461467class IHTMLPreviewRenderer(Interface):

The fix will be for 0.12.4 though, unless people give this a lot of testing ;-)

comment:5 by lkraav <leho@…>, 12 years ago

While we're riding the Trac UTF8 BOM mindshare train, perhaps #10441 could get a comment?

comment:6 by Christian Boos, 12 years ago

Resolution: fixed
Status: assignedclosed

Fix applied in r10967.

comment:7 by Christian Boos, 12 years ago

API Changes: modified (diff)

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Christian Boos.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Christian Boos to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.