Edgewall Software
Modify

Opened 8 years ago

Closed 7 years ago

Last modified 5 years ago

#12847 closed defect (fixed)

TypeError: unicode argument expected, got 'str'

Reported by: Ryan J Ollos Owned by: Jun Omae
Priority: normal Milestone: 1.3.3
Component: wiki system Version:
Severity: normal Keywords:
Cc: Branch:
Release Notes:
API Changes:

Fixed TracHTMLSanitizer raising exceptions and decoding refs twice in title attributes.

Internal Changes:

Description

Found in the logs,

[pid 24004 139801362245376] 2017-06-19 08:03:12,877 Trac[formatter] ERROR: Processor html failed for <Resource u'wiki:SandBox@165'>:
Traceback (most recent call last):
  File "/usr/local/virtualenv/1.3dev/lib/python2.7/site-packages/trac/wiki/formatter.py", line 1195, in _exec_processor
    return processor.process(text)
  File "/usr/local/virtualenv/1.3dev/lib/python2.7/site-packages/trac/wiki/formatter.py", line 375, in process
    text = self.processor(text)
  File "/usr/local/virtualenv/1.3dev/lib/python2.7/site-packages/trac/wiki/formatter.py", line 228, in _html_processor
    return self._sanitizer.sanitize(text)
  File "/usr/local/virtualenv/1.3dev/lib/python2.7/site-packages/trac/util/html.py", line 645, in sanitize
    transform.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 222, in goahead
    self.handle_data("&")
  File "/usr/local/virtualenv/1.3dev/lib/python2.7/site-packages/trac/util/html.py", line 1012, in handle_data
    self.out.write(data)
TypeError: unicode argument expected, got 'str'

I've been seeing the error in the logs for a week or two, but until we upgraded to the latest Trac trunk with improved logging, couldn't see that the exception was being raised from SandBox@165.

Attachments (1)

test_failures.txt (4.5 KB ) - added by Ryan J Ollos 7 years ago.

Download all attachments as: .zip

Change History (23)

comment:1 by Ryan J Ollos, 8 years ago

Issue also seen in comment:description:ticket:2558.

comment:2 by Jun Omae, 8 years ago

TracHTMLSanitizer with html special characters leads this issue:

>>> from trac.util.html import TracHTMLSanitizer
>>> sanitizer = TracHTMLSanitizer()
>>> sanitizer.sanitize(u'<div> &amp; </div>')
Markup(u'<div> &amp; </div>')
>>> sanitizer.sanitize(u'<div> & </div>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "trac/util/html.py", line 645, in sanitize
    transform.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 222, in goahead
    self.handle_data("&")
  File "trac/util/html.py", line 1012, in handle_data
    self.out.write(data)
TypeError: unicode argument expected, got 'str'
>>> sanitizer.sanitize(u'<div> < </div>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "trac/util/html.py", line 645, in sanitize
    transform.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 171, in goahead
    self.handle_data("<")
  File "trac/util/html.py", line 1012, in handle_data
    self.out.write(data)
TypeError: unicode argument expected, got 'str'
>>> sanitizer.sanitize(u'<div> > </div>')
Markup(u'<div> > </div>')

I think we should add test cases for sanitizer with malformed html.

comment:3 by Ryan J Ollos, 8 years ago

Milestone: 1.0.16

comment:4 by Ryan J Ollos, 7 years ago

Milestone: 1.0.161.0.17

Milestone renamed

comment:5 by Ryan J Ollos, 7 years ago

Owner: set to Ryan J Ollos
Status: newassigned

comment:6 by Ryan J Ollos, 7 years ago

Milestone: 1.0.171.3.3

Seems to be an issue only on trunk.

HTMLParser passes a string type:

                    self.handle_data("&")

It seems that the issue is fundamentally:

>>> import io
>>> s = io.StringIO()
>>> s.write('&')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    s.write('&')
TypeError: unicode argument expected, got 'str'
>>> s.write(u'&')
1L

… and probably remedied by using StringIO.StringIO rather than io.StringIO. However, using io module is good for forward-compatibility, looking towards Python3.

Does the following patch make any sense?:

  • trac/util/html.py

    diff --git a/trac/util/html.py b/trac/util/html.py
    index 57c9475b4..4fa69326e 100644
    a b class TracHTMLSanitizer(object):  
    641641        :rtype: Markup
    642642
    643643        """
    644         transform = HTMLSanitization(self, io.StringIO())
     644        class DecodingStringIO(io.StringIO):
     645            def write(self, content):
     646                super(DecodingStringIO, self).write(unicode(content))
     647
     648        transform = HTMLSanitization(self, DecodingStringIO())
    645649        transform.feed(html)
    646650        transform.close()
    647651        return Markup(transform.out.getvalue())
Last edited 7 years ago by Ryan J Ollos (previous) (diff)

comment:7 by Jun Omae, 7 years ago

Subclasses of HTMLTransform have the same issue (FormTokenInjector, HTMLSanitization).

I think it would be good to convert bytes to unicode before call self.out.write() in this case like this:

diff --git a/trac/util/html.py b/trac/util/html.py
index 57c9475b4..da6092114 100644
--- a/trac/util/html.py
+++ b/trac/util/html.py
@@ -911,33 +911,44 @@ class HTMLTransform(HTMLParser):
     def __init__(self, out):
         HTMLParser.__init__(self)
         self.out = out
+        if isinstance(out, io.TextIOBase):
+            self._convert = lambda v: v.decode('utf-8') \
+                                      if isinstance(v, bytes) else v
+        elif isinstance(out, io.IOBase):
+            self._convert = lambda v: v.encode('utf-8') \
+                                      if isinstance(v, unicode) else v
+        else:
+            self._convert = lambda v: v

 ...
+
+    def _write(self, data):
+        self.out.write(self._convert(data))

See [9665d5e51/jomae.git].

I noticed another issue that TracHTMLSanitizer with Genshi sanitizes unescaped & character to &amp; but TracHTMLSanitizer without Genshi doesn't. I think unescaped & character should be sanitized to &amp;. However, I have no idea to resolve it.

class TracHTMLSanitizerTestCase(unittest.TestCase):

    ...

    def test_special_characters(self):
        self.assertEqual(u'<p> & </p>', self.sanitize(u'<p> & </p>'))
        self.assertEqual(u'<p> < </p>', self.sanitize(u'<p> < </p>'))
        self.assertEqual(u'<p> > </p>', self.sanitize(u'<p> > </p>'))


if genshi:
    class TracHTMLSanitizerLegacyGenshiTestCase(TracHTMLSanitizerTestCase):
        def sanitize(self, html):
            sanitizer = TracHTMLSanitizer(safe_schemes=self.safe_schemes,
                                          safe_origins=self.safe_origins)
            return unicode(HTML(html, encoding='utf-8') | sanitizer)

        def test_special_characters(self):
            self.assertEqual(u'<p> &amp; </p>', self.sanitize(u'<p> & </p>'))
            self.assertEqual(u'<p> &lt; </p>', self.sanitize(u'<p> < </p>'))
            self.assertEqual(u'<p> &gt; </p>', self.sanitize(u'<p> > </p>'))

in reply to:  7 comment:8 by Ryan J Ollos, 7 years ago

Replying to Jun Omae:

See [9665d5e51/jomae.git].

Those changes look good.

I noticed another issue that TracHTMLSanitizer with Genshi sanitizes unescaped & character to &amp; but TracHTMLSanitizer without Genshi doesn't. I think unescaped & character should be sanitized to &amp;. However, I have no idea to resolve it.

Can we escape the content in handle_data?

  • trac/util/html.py

    diff --git a/trac/util/html.py b/trac/util/html.py
    index da6092114..d0bbf87d8 100644
    a b class HTMLSanitization(HTMLTransform):  
    10201020
    10211021    def handle_data(self, data):
    10221022        if not self.waiting_for:
    1023             self._write(data)
     1023            self._write(escape(data))
    10241024
    10251025    def handle_endtag(self, tag):
    10261026        if self.waiting_for:

comment:9 by Jun Omae, 7 years ago

I noticed another issue on trunk@16580 while trying your suggestion.

title="&amp;amp;" in attribute is sanitized to title="&amp;":

>>> from trac.util.html import TracHTMLSanitizer
>>> sanitizer = TracHTMLSanitizer()
>>> sanitizer.sanitize(u'<p title="&"> &amp;amp; </p>')
Markup(u'<p title="&amp;"> &amp;amp; </p>')
>>> sanitizer.sanitize(u'<p title="&amp;"> &amp; </p>')
Markup(u'<p title="&amp;"> &amp; </p>')
>>> sanitizer.sanitize(u'<p title="&amp;amp;"> &amp;amp; </p>')
Markup(u'<p title="&amp;"> &amp;amp; </p>')
>>> sanitizer.sanitize(u'<p title="&amp;amp;amp;"> &amp;amp; </p>')
Markup(u'<p title="&amp;amp;"> &amp;amp; </p>')

in reply to:  9 ; comment:10 by Ryan J Ollos, 7 years ago

Replying to Jun Omae:

title="&amp;amp;" in attribute is sanitized to title="&amp;":

The call sequence in HTMLParser is feedgoaheadparse_starttagunescape. Then the unescaped data is passed → HTMLParser.handle_starttagHTMLSanitization._handle_start: trunk/trac/util/html.py@16580:980-983#L973.

With &amp;amp; in the title attribute we get:

  • after unescape: &amp;
  • self.sanitizer.sanitize_attrs maps: [('title', '&amp;')] to [('title', '&')]
  • the values are escaped again to create html_attrs: [('title', '&amp;')]

So maybe the problem is in TracHTMLSanitizer.sanitize_attrs?

I can't say it's the right solution because I don't understand the code well enough, but this patch fixes the comment:9 issue:

  • trac/util/html.py

    diff --git a/trac/util/html.py b/trac/util/html.py
    index d4bfdf72a..7890700b1 100644
    a b class TracHTMLSanitizer(object):  
    761761        """
    762762        new_attrs = {}
    763763        for attr, value in attrs.iteritems():
    764             value = stripentities(value) if value is not None else attr
     764            value = stripentities(value, keepxmlentities=True) \
     765                    if value is not None else attr
    765766            if attr not in self.safe_attrs:
    766767                continue
    767768            elif attr in self.uri_attrs:
>>> from trac.util.html import TracHTMLSanitizer
>>> sanitizer = TracHTMLSanitizer()
>>> sanitizer.sanitize(u'<p title="&"> &amp;amp; </p>')
Markup(u'<p title="&amp;"> &amp;amp; </p>')
>>> sanitizer.sanitize(u'<p title="&amp;"> &amp; </p>')
Markup(u'<p title="&amp;"> &amp; </p>')
>>> sanitizer.sanitize(u'<p title="&amp;amp;"> &amp;amp; </p>')
Markup(u'<p title="&amp;amp;"> &amp;amp; </p>')
>>> sanitizer.sanitize(u'<p title="&amp;amp;amp;"> &amp;amp; </p>')
Markup(u'<p title="&amp;amp;amp;"> &amp;amp; </p>')

All tests pass with the comment:8 patch and the above patch, but maybe we just don't have enough test coverage to be sure.

  • trac/util/html.py

    diff --git a/trac/util/html.py b/trac/util/html.py
    index d4bfdf72a..86c98ecda 100644
    a b class TracHTMLSanitizer(object):  
    761761        """
    762762        new_attrs = {}
    763763        for attr, value in attrs.iteritems():
    764             value = stripentities(value) if value is not None else attr
     764            value = stripentities(value, keepxmlentities=True) \
     765                    if value is not None else attr
    765766            if attr not in self.safe_attrs:
    766767                continue
    767768            elif attr in self.uri_attrs:
    class HTMLTransform(HTMLParser):  
    936937        self.out.write('<?%s?>' % data)
    937938
    938939    def handle_data(self, data):
    939         self.out.write(data)
     940        self.out.write(escape(data))
    940941
    941942    def handle_endtag(self, tag):
    942943        self.out.write('</' + tag + '>')
    class HTMLSanitization(HTMLTransform):  
    10111012
    10121013    def handle_data(self, data):
    10131014        if not self.waiting_for:
    1014             self.out.write(data)
     1015            self.out.write(escape(data))
    10151016
    10161017    def handle_endtag(self, tag):
    10171018        if self.waiting_for:

in reply to:  10 comment:11 by Jun Omae, 7 years ago

Replying to Ryan J Ollos:

Replying to Jun Omae:

title="&amp;amp;" in attribute is sanitized to title="&amp;":

The call sequence in HTMLParser is feedgoaheadparse_starttagunescape. Then the unescaped data is passed → HTMLParser.handle_starttagHTMLSanitization._handle_start: trunk/trac/util/html.py@16580:980-983#L973.

With &amp;amp; in the title attribute we get:

  • after unescape: &amp;
  • self.sanitizer.sanitize_attrs maps: [('title', '&amp;')] to [('title', '&')]
  • the values are escaped again to create html_attrs: [('title', '&amp;')]

So maybe the problem is in TracHTMLSanitizer.sanitize_attrs?

HTMLParser.unescape decodes entity reference at https://github.com/python/cpython/blob/v2.7.14/Lib/HTMLParser.py#L472. I consider no need to call stripentities in TracHTMLSanitizer.sanitize_attrs.

>>> from HTMLParser import HTMLParser
>>> HTMLParser().unescape('&amp;')
u'&'
>>> HTMLParser().unescape('&hellip;')
u'\u2026'
>>> 
>>> from trac.util.html import TracHTMLSanitizer
>>> sanitizer = TracHTMLSanitizer()
>>> sanitizer.sanitize(u'<p title="&amp;hellip;">....</p>')
Markup(u'<p title="\u2026">....</p>')

Also, keepxmlentities=True of stripentities has an issue. It is easy to bypass using character reference with code point.

>>> stripentities('&amp; &#38; &#x26; &hellip;')
u'& & & \u2026'
>>> stripentities('&amp; &#38; &#x26; &hellip;', keepxmlentities=True)
u'&amp; & & \u2026'

comment:12 by Jun Omae, 7 years ago

HTMLParser crashes when large integer in character reference is used. I think we should override unescape to fix it in HTMLTransform which is a subclass of HTMLParser.

>>> sanitizer.sanitize(u'<p title="&#xffffffff;">...</p>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "trac/util/html.py", line 647, in sanitize
    transform.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 161, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 308, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/home/jun66j5/venv/py27/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python2.7/HTMLParser.py", line 459, in replaceEntities
    return unichr(c)
OverflowError: signed integer is greater than maximum
>>> sanitizer.sanitize(u'<p title="&#x1fffff;">...</p>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "trac/util/html.py", line 647, in sanitize
    transform.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 161, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 327, in parse_starttag
    self.handle_starttag(tag, attrs)
  File "trac/util/html.py", line 987, in handle_starttag
    self._handle_start(tag, attrs, '')
  File "trac/util/html.py", line 980, in _handle_start
    new_attrs = self.sanitizer.sanitize_attrs(tag, dict(attrs))
  File "trac/util/html.py", line 764, in sanitize_attrs
    value = stripentities(value) if value is not None else attr
  File "trac/util/html.py", line 195, in stripentities
    return _STRIPENTITIES_RE.sub(_replace_entity, text)
  File "trac/util/html.py", line 181, in _replace_entity
    return unichr(ref)
ValueError: unichr() arg not in range(0x110000) (wide Python build)

On Windows (narrow build), <p title="&#x10000;">` can lead the same issue.

in reply to:  12 comment:13 by Jun Omae, 7 years ago

Replying to Jun Omae:

HTMLParser crashes when large integer in character reference is used. I think we should override unescape to fix it in HTMLTransform which is a subclass of HTMLParser.

In [3540fdba4/jomae.git], validate character and entity references using handle_charref and handle_entityref of HTMLTransform rather than unescape.

Revised jomae.git@t12847.1.

However, I noticed Genshi has two issues at least and have no idea to work around.

  1. genshi.HTML incorrectly returns \u2026 for &amp;hellip; in attribute. Works fine for &amp;hellip; in element.
    >>> import genshi
    >>> genshi.__version__
    '0.7'
    >>> from genshi import HTML
    >>> for event in HTML('<img title="&amp;hellip;&amp;"/>', encoding='utf-8'): event
    ...
    ('START', (QName('img'), Attrs([(QName('title'), u'\u2026&')])), (None, 1, 0))
    ('END', QName('img'), (None, 1, 0))
    >>> for event in HTML('<p>&amp;hellip;&amp;</p>', encoding='utf-8'): event
    ...
    ('START', (QName('p'), Attrs()), (None, 1, 0))
    ('TEXT', u'&hellip;&', (None, 1, 3))
    ('END', QName('p'), (None, 1, 20))
    
  2. Character reference with out of [0, 0x10ffff] leads ValueError (e.g. &#x110000;).
    >>> HTML('<img title="&#x110000;"/>', encoding='utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 442, in HTML
        return Stream(list(HTMLParser(BytesIO(text), encoding=encoding)))
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/core.py", line 273, in _ensure
        event = stream.next()
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 449, in _coalesce
        for kind, data, pos in chain(stream, [(None, None, None)]):
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 338, in _generate
        self.feed(data)
      File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed
        self.goahead(0)
      File "/usr/lib/python2.7/HTMLParser.py", line 161, in goahead
        k = self.parse_starttag(i)
      File "/usr/lib/python2.7/HTMLParser.py", line 325, in parse_starttag
        self.handle_startendtag(tag, attrs)
      File "/usr/lib/python2.7/HTMLParser.py", line 407, in handle_startendtag
        self.handle_starttag(tag, attrs)
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 370, in handle_starttag
        fixed_attrib.append((QName(name), stripentities(value)))
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/util.py", line 227, in stripentities
        return _STRIPENTITIES_RE.sub(_replace_entity, text)
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/util.py", line 215, in _replace_entity
        return unichr(ref)
    ValueError: unichr() arg not in range(0x110000) (wide Python build)
    >>> HTML('<p>&#x110000;</p>', encoding='utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 442, in HTML
        return Stream(list(HTMLParser(BytesIO(text), encoding=encoding)))
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/core.py", line 273, in _ensure
        event = stream.next()
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 449, in _coalesce
        for kind, data, pos in chain(stream, [(None, None, None)]):
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 338, in _generate
        self.feed(data)
      File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed
        self.goahead(0)
      File "/usr/lib/python2.7/HTMLParser.py", line 191, in goahead
        self.handle_charref(name)
      File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 391, in handle_charref
        text = unichr(int(name[1:], 16))
    ValueError: unichr() arg not in range(0x110000) (wide Python build)
    

in reply to:  12 comment:14 by Jun Omae, 7 years ago

Replying to Jun Omae:

HTMLParser crashes when large integer in character reference is used. I think we should override unescape to fix it in HTMLTransform which is a subclass of HTMLParser.

Updated jomae.git@t12847.1 to fix these cases.

by Ryan J Ollos, 7 years ago

Attachment: test_failures.txt added

comment:15 by Ryan J Ollos, 7 years ago

This looks like a huge amount of work. I'm glad you understand all these details because I do not.

It looks like we need a modification of test cases for narrow Python builds: test_failures.txt.

I have Python installed via pyenv on OSX:

$ python --version
Python 2.7.14
$ python -c "import sysconfig; print(sysconfig.get_config_var('Py_UNICODE_SIZE'))"
2
Last edited 7 years ago by Ryan J Ollos (previous) (diff)

comment:16 by Jun Omae, 7 years ago

Thanks for the testing. Updated jomae.git@t12847.1 which is verified on travis-ci and appveyor.

I'm considering that we could backport fixes for TracHTMLSanitizer using Genshi (described in comment:13) on 1.0-stable and 1.2-stable. Should we backport the fixes?

in reply to:  16 ; comment:17 by Ryan J Ollos, 7 years ago

Replying to Jun Omae:

I'm considering that we could backport fixes for TracHTMLSanitizer using Genshi (described in comment:13) on 1.0-stable and 1.2-stable. Should we backport the fixes?

I'm certainly not against the idea, if you want to prepare the changes. However, I can't say whether it's definitely needed or not.

in reply to:  17 comment:18 by Jun Omae, 7 years ago

Replying to Ryan J Ollos:

Replying to Jun Omae:

I'm considering that we could backport fixes for TracHTMLSanitizer using Genshi (described in comment:13) on 1.0-stable and 1.2-stable. Should we backport the fixes?

I'm certainly not against the idea, if you want to prepare the changes. However, I can't say whether it's definitely needed or not.

Okay. I choose not to consider the backport since the issues are not critical and no one notice the issues until now. We'll consider the backport when anyone report the issues of 1.2-stable.

comment:19 by Ryan J Ollos, 7 years ago

Looks good to commit!

comment:20 by Jun Omae, 7 years ago

Owner: changed from Ryan J Ollos to Jun Omae
Release Notes: modified (diff)

Thanks for the reviewing! Committed in [16613].

comment:21 by Jun Omae, 7 years ago

Resolution: fixed
Status: assignedclosed

comment:22 by Ryan J Ollos, 5 years ago

API Changes: modified (diff)
Release Notes: modified (diff)

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Jun Omae.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Jun Omae to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.