#9409 closed defect (fixed)
Spamfilter: Encoding exceptions when there is invalid utf-8 data in the request
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | normal | Milestone: | plugin - spam-filter |
Component: | plugin/spamfilter | Version: | |
Severity: | normal | Keywords: | |
Cc: | Branch: | ||
Release Notes: | |||
API Changes: | |||
Internal Changes: |
Description
Using the spamfilter plugin (r9835) with trac 0.11.1, I've been bitten by encoding issues resulting from invalid UTF-8 in the request headers. A spam comment had buggy HTTP headers, which were stored verbatim in the database. When getting the data back from the database, SQLite errored out because it couldn't put the raw data in a unicode string:
OperationalError: Could not decode to UTF-8 column 'headers' with text 'Cookie: lang=spanish; JSESSIONID=894DC4625C228BD8D7916854072C48F2; SESS14a1f681a943bae47b32736abf4881ee=04e5010f965e252cec72bd78535f5e84; PHPSES
I haven't been able to retrieve the actual invalid string (which is beyond the part shown above), but I have been able to create another test case which produces the same error.
The attached file contains an HTTP request with a "Bar" header, with the invalid sequence fd20 (0xfd, which is the start of a 6-byte sequence, followed by a space). It is preceded by a bunch of x's, to make sure the actual invalid string does not show up in the error message produced (so you see the actual database error, instead of a genshi error while showing the database error).
To reproduce:
- Save the attached file to
request.txt
- Update the Host: header and request path to match your installation.
- Browse to the ticket url manually, and find the value of the
__FORM_TOKEN
hidden field. - Put your form token in the
Cookie
header and the__FORM_TOKEN
value in the POST payload. - run:
cat request.txt | nc example.org 80
- Go to the the "monitoring" section of the spamfilter plugin. You should get an error now.
I've looked a bit into resolving this error. However, I haven't found out what the proper way to do this is yet. According to the HTTP spec, HTTP headers should be like RFC822 (email) headers, which can consist of ASCII characters only.
However, a different section of the HTTP specification (at the definition of TEXT
) specifies that HTTP headers must be interpreted as ISO-8859-1, and characters from other character sets should be encoded using the MIME header encoding. For example, the utf-8 sequence e298 would become: =?utf-8?q?=e2=98?=
.
I'm not quite sure how this MIME encoding is handled, though. If the webserver or trac core takes care of the decoding, it's completely uncertain what characters or encoding are used in HTTP headers and it might be correct to use a BLOB field in the database instead of a TEXT field currently. However, it might also be that the headers are still encoded when SpamFilter sees them, in which case the content might only need interpretation as ISO-8859-1 (leaving any encoded characters encoded, which is probably not a big deal). Alternatively, there might be a python function to do this decoding. Either way, the end result should be a unicode string (instead of a normal string as is the case right now).
According to this post this error can be prevented by only passing unicode strings to pysqlite when inserting records, so they can also be interpreted as utf-8 later on (it's a bit silly that this error occurse when retrieving data, AFAICS it would make more sense to error out on the insert…).
Apart from the HTTP headers, there might be other strings that are inserted into the database that might need to be made into unicode strings as well.
I'm out of time to investigate this issue further, but I'd be happy to test or comment further.
Attachments (1)
Change History (9)
by , 14 years ago
Attachment: | request.txt added |
---|
comment:1 by , 14 years ago
Btw, please put me in the CC if that's required to get notifications (just realized that notifications to the reporter might be disabled).
comment:2 by , 14 years ago
Milestone: | → plugin - spam-filter |
---|
Note that the SpamFilter is in "low maintenance" mode, which means with don't have much time to apply patches, even less to fix new issues on our own, especially in these days of Trac 0.12 release. So please be patient with us (or on the contrary, bug us with good patches and volunteer for taking over maintenance of this plugin ;-) ).
comment:3 by , 14 years ago
I'm willing to implement a fix, but I'm not quite sure what the right fix is. Insights as to the direction of a fix would be most welcome :-)
Are there any alternatives to spamfilter that are better maintained, or does 0.12 include some spamfiltering by default or something?
comment:4 by , 14 years ago
Well, I'm now maintainer. Direction of the fix - I think ensuring, that correct data is entered into database should be the way to go. I think it is acceptable when data is not 100% exact (we talk about SPAM here anyway), but there should be no exceptions caused in handling data.
I'm happy to see patches from you.
comment:5 by , 14 years ago
Could you try if
-
tracspamfilter/model.py
44 44 self.env = env 45 45 self.time = time 46 46 self.path = path 47 self.author = author47 self.author = to_unicode(author) 48 48 self.authenticated = bool(authenticated) 49 49 self.ipnr = ipnr 50 self.headers = headersor ''50 self.headers = to_unicode(headers) or '' 51 51 self.content = content 52 52 self.rejected = bool(rejected) 53 53 self.karma = karma
fixes the issue?
comment:6 by , 14 years ago
Yup, this fixes the problem for me.
I was going to ask why "content" is not converted, but it seems that already happens in the callers (e.g., in FilterSystem._combine_changes). Perhaps it would be useful to add a comment about this?
Anyway, thanks for looking into this!
comment:7 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
In r9949. content has an own encoding function which does utf-8 first and base64 encoding afterwards. Probably this is outdated nowadays, but I wont break compatibility.
HTTP request to reproduce the problem