#12847 closed defect (fixed)
TypeError: unicode argument expected, got 'str'
Reported by: | Ryan J Ollos | Owned by: | Jun Omae |
---|---|---|---|
Priority: | normal | Milestone: | 1.3.3 |
Component: | wiki system | Version: | |
Severity: | normal | Keywords: | |
Cc: | Branch: | ||
Release Notes: | |||
API Changes: |
Fixed |
||
Internal Changes: |
Description
Found in the logs,
[pid 24004 139801362245376] 2017-06-19 08:03:12,877 Trac[formatter] ERROR: Processor html failed for <Resource u'wiki:SandBox@165'>: Traceback (most recent call last): File "/usr/local/virtualenv/1.3dev/lib/python2.7/site-packages/trac/wiki/formatter.py", line 1195, in _exec_processor return processor.process(text) File "/usr/local/virtualenv/1.3dev/lib/python2.7/site-packages/trac/wiki/formatter.py", line 375, in process text = self.processor(text) File "/usr/local/virtualenv/1.3dev/lib/python2.7/site-packages/trac/wiki/formatter.py", line 228, in _html_processor return self._sanitizer.sanitize(text) File "/usr/local/virtualenv/1.3dev/lib/python2.7/site-packages/trac/util/html.py", line 645, in sanitize transform.feed(html) File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 222, in goahead self.handle_data("&") File "/usr/local/virtualenv/1.3dev/lib/python2.7/site-packages/trac/util/html.py", line 1012, in handle_data self.out.write(data) TypeError: unicode argument expected, got 'str'
I've been seeing the error in the logs for a week or two, but until we upgraded to the latest Trac trunk with improved logging, couldn't see that the exception was being raised from SandBox@165.
Attachments (1)
Change History (23)
comment:1 by , 8 years ago
comment:2 by , 8 years ago
TracHTMLSanitizer
with html special characters leads this issue:
>>> from trac.util.html import TracHTMLSanitizer >>> sanitizer = TracHTMLSanitizer() >>> sanitizer.sanitize(u'<div> & </div>') Markup(u'<div> & </div>') >>> sanitizer.sanitize(u'<div> & </div>') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "trac/util/html.py", line 645, in sanitize transform.feed(html) File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 222, in goahead self.handle_data("&") File "trac/util/html.py", line 1012, in handle_data self.out.write(data) TypeError: unicode argument expected, got 'str' >>> sanitizer.sanitize(u'<div> < </div>') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "trac/util/html.py", line 645, in sanitize transform.feed(html) File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 171, in goahead self.handle_data("<") File "trac/util/html.py", line 1012, in handle_data self.out.write(data) TypeError: unicode argument expected, got 'str' >>> sanitizer.sanitize(u'<div> > </div>') Markup(u'<div> > </div>')
I think we should add test cases for sanitizer with malformed html.
comment:3 by , 8 years ago
Milestone: | → 1.0.16 |
---|
comment:5 by , 7 years ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:6 by , 7 years ago
Milestone: | 1.0.17 → 1.3.3 |
---|
Seems to be an issue only on trunk.
HTMLParser
passes a string type:
self.handle_data("&")
It seems that the issue is fundamentally:
>>> import io >>> s = io.StringIO() >>> s.write('&') Traceback (most recent call last): File "<input>", line 1, in <module> s.write('&') TypeError: unicode argument expected, got 'str' >>> s.write(u'&') 1L
… and probably remedied by using StringIO.StringIO
rather than io.StringIO
. However, using io
module is good for forward-compatibility, looking towards Python3.
Does the following patch make any sense?:
-
trac/util/html.py
diff --git a/trac/util/html.py b/trac/util/html.py index 57c9475b4..4fa69326e 100644
a b class TracHTMLSanitizer(object): 641 641 :rtype: Markup 642 642 643 643 """ 644 transform = HTMLSanitization(self, io.StringIO()) 644 class DecodingStringIO(io.StringIO): 645 def write(self, content): 646 super(DecodingStringIO, self).write(unicode(content)) 647 648 transform = HTMLSanitization(self, DecodingStringIO()) 645 649 transform.feed(html) 646 650 transform.close() 647 651 return Markup(transform.out.getvalue())
follow-up: 8 comment:7 by , 7 years ago
Subclasses of HTMLTransform
have the same issue (FormTokenInjector
, HTMLSanitization
).
I think it would be good to convert bytes
to unicode
before call self.out.write()
in this case like this:
diff --git a/trac/util/html.py b/trac/util/html.py index 57c9475b4..da6092114 100644 --- a/trac/util/html.py +++ b/trac/util/html.py @@ -911,33 +911,44 @@ class HTMLTransform(HTMLParser): def __init__(self, out): HTMLParser.__init__(self) self.out = out + if isinstance(out, io.TextIOBase): + self._convert = lambda v: v.decode('utf-8') \ + if isinstance(v, bytes) else v + elif isinstance(out, io.IOBase): + self._convert = lambda v: v.encode('utf-8') \ + if isinstance(v, unicode) else v + else: + self._convert = lambda v: v ... + + def _write(self, data): + self.out.write(self._convert(data))
I noticed another issue that TracHTMLSanitizer
with Genshi sanitizes unescaped &
character to &
but TracHTMLSanitizer
without Genshi doesn't. I think unescaped &
character should be sanitized to &
. However, I have no idea to resolve it.
class TracHTMLSanitizerTestCase(unittest.TestCase): ... def test_special_characters(self): self.assertEqual(u'<p> & </p>', self.sanitize(u'<p> & </p>')) self.assertEqual(u'<p> < </p>', self.sanitize(u'<p> < </p>')) self.assertEqual(u'<p> > </p>', self.sanitize(u'<p> > </p>')) if genshi: class TracHTMLSanitizerLegacyGenshiTestCase(TracHTMLSanitizerTestCase): def sanitize(self, html): sanitizer = TracHTMLSanitizer(safe_schemes=self.safe_schemes, safe_origins=self.safe_origins) return unicode(HTML(html, encoding='utf-8') | sanitizer) def test_special_characters(self): self.assertEqual(u'<p> & </p>', self.sanitize(u'<p> & </p>')) self.assertEqual(u'<p> < </p>', self.sanitize(u'<p> < </p>')) self.assertEqual(u'<p> > </p>', self.sanitize(u'<p> > </p>'))
comment:8 by , 7 years ago
Replying to Jun Omae:
Those changes look good.
I noticed another issue that
TracHTMLSanitizer
with Genshi sanitizes unescaped&
character to&
butTracHTMLSanitizer
without Genshi doesn't. I think unescaped&
character should be sanitized to&
. However, I have no idea to resolve it.
Can we escape the content in handle_data
?
-
trac/util/html.py
diff --git a/trac/util/html.py b/trac/util/html.py index da6092114..d0bbf87d8 100644
a b class HTMLSanitization(HTMLTransform): 1020 1020 1021 1021 def handle_data(self, data): 1022 1022 if not self.waiting_for: 1023 self._write( data)1023 self._write(escape(data)) 1024 1024 1025 1025 def handle_endtag(self, tag): 1026 1026 if self.waiting_for:
follow-up: 10 comment:9 by , 7 years ago
I noticed another issue on trunk@16580 while trying your suggestion.
title="&amp;"
in attribute is sanitized to title="&"
:
>>> from trac.util.html import TracHTMLSanitizer >>> sanitizer = TracHTMLSanitizer() >>> sanitizer.sanitize(u'<p title="&"> &amp; </p>') Markup(u'<p title="&"> &amp; </p>') >>> sanitizer.sanitize(u'<p title="&"> & </p>') Markup(u'<p title="&"> & </p>') >>> sanitizer.sanitize(u'<p title="&amp;"> &amp; </p>') Markup(u'<p title="&"> &amp; </p>') >>> sanitizer.sanitize(u'<p title="&amp;amp;"> &amp; </p>') Markup(u'<p title="&amp;"> &amp; </p>')
follow-up: 11 comment:10 by , 7 years ago
Replying to Jun Omae:
title="&amp;"
in attribute is sanitized totitle="&"
:
The call sequence in HTMLParser
is feed
→ goahead
→ parse_starttag
→ unescape
. Then the unescaped data is passed → HTMLParser.handle_starttag
→ HTMLSanitization._handle_start
: trunk/trac/util/html.py@16580:980-983#L973.
With &amp;
in the title
attribute we get:
- after
unescape
:&
self.sanitizer.sanitize_attrs
maps:[('title', '&')]
to[('title', '&')]
- the values are escaped again to create
html_attrs
:[('title', '&')]
So maybe the problem is in TracHTMLSanitizer.sanitize_attrs?
I can't say it's the right solution because I don't understand the code well enough, but this patch fixes the comment:9 issue:
-
trac/util/html.py
diff --git a/trac/util/html.py b/trac/util/html.py index d4bfdf72a..7890700b1 100644
a b class TracHTMLSanitizer(object): 761 761 """ 762 762 new_attrs = {} 763 763 for attr, value in attrs.iteritems(): 764 value = stripentities(value) if value is not None else attr 764 value = stripentities(value, keepxmlentities=True) \ 765 if value is not None else attr 765 766 if attr not in self.safe_attrs: 766 767 continue 767 768 elif attr in self.uri_attrs:
>>> from trac.util.html import TracHTMLSanitizer >>> sanitizer = TracHTMLSanitizer() >>> sanitizer.sanitize(u'<p title="&"> &amp; </p>') Markup(u'<p title="&"> &amp; </p>') >>> sanitizer.sanitize(u'<p title="&"> & </p>') Markup(u'<p title="&"> & </p>') >>> sanitizer.sanitize(u'<p title="&amp;"> &amp; </p>') Markup(u'<p title="&amp;"> &amp; </p>') >>> sanitizer.sanitize(u'<p title="&amp;amp;"> &amp; </p>') Markup(u'<p title="&amp;amp;"> &amp; </p>')
All tests pass with the comment:8 patch and the above patch, but maybe we just don't have enough test coverage to be sure.
-
trac/util/html.py
diff --git a/trac/util/html.py b/trac/util/html.py index d4bfdf72a..86c98ecda 100644
a b class TracHTMLSanitizer(object): 761 761 """ 762 762 new_attrs = {} 763 763 for attr, value in attrs.iteritems(): 764 value = stripentities(value) if value is not None else attr 764 value = stripentities(value, keepxmlentities=True) \ 765 if value is not None else attr 765 766 if attr not in self.safe_attrs: 766 767 continue 767 768 elif attr in self.uri_attrs: … … class HTMLTransform(HTMLParser): 936 937 self.out.write('<?%s?>' % data) 937 938 938 939 def handle_data(self, data): 939 self.out.write( data)940 self.out.write(escape(data)) 940 941 941 942 def handle_endtag(self, tag): 942 943 self.out.write('</' + tag + '>') … … class HTMLSanitization(HTMLTransform): 1011 1012 1012 1013 def handle_data(self, data): 1013 1014 if not self.waiting_for: 1014 self.out.write( data)1015 self.out.write(escape(data)) 1015 1016 1016 1017 def handle_endtag(self, tag): 1017 1018 if self.waiting_for:
comment:11 by , 7 years ago
Replying to Ryan J Ollos:
Replying to Jun Omae:
title="&amp;"
in attribute is sanitized totitle="&"
:The call sequence in
HTMLParser
isfeed
→goahead
→parse_starttag
→unescape
. Then the unescaped data is passed →HTMLParser.handle_starttag
→HTMLSanitization._handle_start
: trunk/trac/util/html.py@16580:980-983#L973.With
&amp;
in thetitle
attribute we get:
- after
unescape
:&
self.sanitizer.sanitize_attrs
maps:[('title', '&')]
to[('title', '&')]
- the values are escaped again to create
html_attrs
:[('title', '&')]
So maybe the problem is in TracHTMLSanitizer.sanitize_attrs?
HTMLParser.unescape
decodes entity reference at https://github.com/python/cpython/blob/v2.7.14/Lib/HTMLParser.py#L472. I consider no need to call stripentities
in TracHTMLSanitizer.sanitize_attrs
.
>>> from HTMLParser import HTMLParser >>> HTMLParser().unescape('&') u'&' >>> HTMLParser().unescape('…') u'\u2026' >>> >>> from trac.util.html import TracHTMLSanitizer >>> sanitizer = TracHTMLSanitizer() >>> sanitizer.sanitize(u'<p title="&hellip;">....</p>') Markup(u'<p title="\u2026">....</p>')
Also, keepxmlentities=True
of stripentities
has an issue. It is easy to bypass using character reference with code point.
>>> stripentities('& & & …') u'& & & \u2026' >>> stripentities('& & & …', keepxmlentities=True) u'& & & \u2026'
follow-ups: 13 14 comment:12 by , 7 years ago
HTMLParser
crashes when large integer in character reference is used. I think we should override unescape
to fix it in HTMLTransform
which is a subclass of HTMLParser
.
>>> sanitizer.sanitize(u'<p title="�">...</p>') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "trac/util/html.py", line 647, in sanitize transform.feed(html) File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 161, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.7/HTMLParser.py", line 308, in parse_starttag attrvalue = self.unescape(attrvalue) File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s) File "/home/jun66j5/venv/py27/lib/python2.7/re.py", line 155, in sub return _compile(pattern, flags).sub(repl, string, count) File "/usr/lib/python2.7/HTMLParser.py", line 459, in replaceEntities return unichr(c) OverflowError: signed integer is greater than maximum >>> sanitizer.sanitize(u'<p title="�">...</p>') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "trac/util/html.py", line 647, in sanitize transform.feed(html) File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 161, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.7/HTMLParser.py", line 327, in parse_starttag self.handle_starttag(tag, attrs) File "trac/util/html.py", line 987, in handle_starttag self._handle_start(tag, attrs, '') File "trac/util/html.py", line 980, in _handle_start new_attrs = self.sanitizer.sanitize_attrs(tag, dict(attrs)) File "trac/util/html.py", line 764, in sanitize_attrs value = stripentities(value) if value is not None else attr File "trac/util/html.py", line 195, in stripentities return _STRIPENTITIES_RE.sub(_replace_entity, text) File "trac/util/html.py", line 181, in _replace_entity return unichr(ref) ValueError: unichr() arg not in range(0x110000) (wide Python build)
On Windows (narrow build), <p title="𐀀"
>` can lead the same issue.
comment:13 by , 7 years ago
Replying to Jun Omae:
HTMLParser
crashes when large integer in character reference is used. I think we should overrideunescape
to fix it inHTMLTransform
which is a subclass ofHTMLParser
.
In [3540fdba4/jomae.git], validate character and entity references using handle_charref
and handle_entityref
of HTMLTransform
rather than unescape
.
Revised jomae.git@t12847.1.
However, I noticed Genshi has two issues at least and have no idea to work around.
genshi.HTML
incorrectly returns\u2026
for&hellip;
in attribute. Works fine for&hellip;
in element.>>> import genshi >>> genshi.__version__ '0.7' >>> from genshi import HTML >>> for event in HTML('<img title="&hellip;&"/>', encoding='utf-8'): event ... ('START', (QName('img'), Attrs([(QName('title'), u'\u2026&')])), (None, 1, 0)) ('END', QName('img'), (None, 1, 0)) >>> for event in HTML('<p>&hellip;&</p>', encoding='utf-8'): event ... ('START', (QName('p'), Attrs()), (None, 1, 0)) ('TEXT', u'…&', (None, 1, 3)) ('END', QName('p'), (None, 1, 20))
- Character reference with out of [0, 0x10ffff] leads
ValueError
(e.g.�
).>>> HTML('<img title="�"/>', encoding='utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 442, in HTML return Stream(list(HTMLParser(BytesIO(text), encoding=encoding))) File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/core.py", line 273, in _ensure event = stream.next() File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 449, in _coalesce for kind, data, pos in chain(stream, [(None, None, None)]): File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 338, in _generate self.feed(data) File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 161, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.7/HTMLParser.py", line 325, in parse_starttag self.handle_startendtag(tag, attrs) File "/usr/lib/python2.7/HTMLParser.py", line 407, in handle_startendtag self.handle_starttag(tag, attrs) File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 370, in handle_starttag fixed_attrib.append((QName(name), stripentities(value))) File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/util.py", line 227, in stripentities return _STRIPENTITIES_RE.sub(_replace_entity, text) File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/util.py", line 215, in _replace_entity return unichr(ref) ValueError: unichr() arg not in range(0x110000) (wide Python build) >>> HTML('<p>�</p>', encoding='utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 442, in HTML return Stream(list(HTMLParser(BytesIO(text), encoding=encoding))) File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/core.py", line 273, in _ensure event = stream.next() File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 449, in _coalesce for kind, data, pos in chain(stream, [(None, None, None)]): File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 338, in _generate self.feed(data) File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 191, in goahead self.handle_charref(name) File "/dev/shm/venv/local/lib/python2.7/site-packages/genshi/input.py", line 391, in handle_charref text = unichr(int(name[1:], 16)) ValueError: unichr() arg not in range(0x110000) (wide Python build)
comment:14 by , 7 years ago
Replying to Jun Omae:
HTMLParser
crashes when large integer in character reference is used. I think we should overrideunescape
to fix it inHTMLTransform
which is a subclass ofHTMLParser
.
Updated jomae.git@t12847.1 to fix these cases.
by , 7 years ago
Attachment: | test_failures.txt added |
---|
comment:15 by , 7 years ago
This looks like a huge amount of work. I'm glad you understand all these details because I do not.
It looks like we need a modification of test cases for narrow Python builds: test_failures.txt.
I have Python installed via pyenv on OSX:
$ python --version Python 2.7.14 $ python -c "import sysconfig; print(sysconfig.get_config_var('Py_UNICODE_SIZE'))" 2
follow-up: 17 comment:16 by , 7 years ago
Thanks for the testing. Updated jomae.git@t12847.1 which is verified on travis-ci and appveyor.
I'm considering that we could backport fixes for TracHTMLSanitizer using Genshi (described in comment:13) on 1.0-stable and 1.2-stable. Should we backport the fixes?
follow-up: 18 comment:17 by , 7 years ago
Replying to Jun Omae:
I'm considering that we could backport fixes for TracHTMLSanitizer using Genshi (described in comment:13) on 1.0-stable and 1.2-stable. Should we backport the fixes?
I'm certainly not against the idea, if you want to prepare the changes. However, I can't say whether it's definitely needed or not.
comment:18 by , 7 years ago
Replying to Ryan J Ollos:
Replying to Jun Omae:
I'm considering that we could backport fixes for TracHTMLSanitizer using Genshi (described in comment:13) on 1.0-stable and 1.2-stable. Should we backport the fixes?
I'm certainly not against the idea, if you want to prepare the changes. However, I can't say whether it's definitely needed or not.
Okay. I choose not to consider the backport since the issues are not critical and no one notice the issues until now. We'll consider the backport when anyone report the issues of 1.2-stable.
comment:20 by , 7 years ago
Owner: | changed from | to
---|---|
Release Notes: | modified (diff) |
Thanks for the reviewing! Committed in [16613].
comment:21 by , 7 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Issue also seen in comment:description:ticket:2558.