Edgewall Software
Modify

Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#9409 closed defect (fixed)

Spamfilter: Encoding exceptions when there is invalid utf-8 data in the request

Reported by: Matthijs Kooijman <matthijs@…> Owned by:
Priority: normal Milestone: plugin - spam-filter
Component: plugin/spamfilter Version:
Severity: normal Keywords:
Cc: Branch:
Release Notes:
API Changes:
Internal Changes:

Description

Using the spamfilter plugin (r9835) with trac 0.11.1, I've been bitten by encoding issues resulting from invalid UTF-8 in the request headers. A spam comment had buggy HTTP headers, which were stored verbatim in the database. When getting the data back from the database, SQLite errored out because it couldn't put the raw data in a unicode string:

OperationalError: Could not decode to UTF-8 column 'headers' with text 
'Cookie: lang=spanish; JSESSIONID=894DC4625C228BD8D7916854072C48F2;
SESS14a1f681a943bae47b32736abf4881ee=04e5010f965e252cec72bd78535f5e84; PHPSES

I haven't been able to retrieve the actual invalid string (which is beyond the part shown above), but I have been able to create another test case which produces the same error.

The attached file contains an HTTP request with a "Bar" header, with the invalid sequence fd20 (0xfd, which is the start of a 6-byte sequence, followed by a space). It is preceded by a bunch of x's, to make sure the actual invalid string does not show up in the error message produced (so you see the actual database error, instead of a genshi error while showing the database error).

To reproduce:

  1. Save the attached file to request.txt
  2. Update the Host: header and request path to match your installation.
  3. Browse to the ticket url manually, and find the value of the __FORM_TOKEN hidden field.
  4. Put your form token in the Cookie header and the __FORM_TOKEN value in the POST payload.
  5. run: cat request.txt | nc example.org 80
  6. Go to the the "monitoring" section of the spamfilter plugin. You should get an error now.

I've looked a bit into resolving this error. However, I haven't found out what the proper way to do this is yet. According to the HTTP spec, HTTP headers should be like RFC822 (email) headers, which can consist of ASCII characters only.

However, a different section of the HTTP specification (at the definition of TEXT) specifies that HTTP headers must be interpreted as ISO-8859-1, and characters from other character sets should be encoded using the MIME header encoding. For example, the utf-8 sequence e298 would become: =?utf-8?q?=e2=98?=.

I'm not quite sure how this MIME encoding is handled, though. If the webserver or trac core takes care of the decoding, it's completely uncertain what characters or encoding are used in HTTP headers and it might be correct to use a BLOB field in the database instead of a TEXT field currently. However, it might also be that the headers are still encoded when SpamFilter sees them, in which case the content might only need interpretation as ISO-8859-1 (leaving any encoded characters encoded, which is probably not a big deal). Alternatively, there might be a python function to do this decoding. Either way, the end result should be a unicode string (instead of a normal string as is the case right now).

According to this post this error can be prevented by only passing unicode strings to pysqlite when inserting records, so they can also be interpreted as utf-8 later on (it's a bit silly that this error occurse when retrieving data, AFAICS it would make more sense to error out on the insert…).

Apart from the HTTP headers, there might be other strings that are inserted into the database that might need to be made into unicode strings as well.

I'm out of time to investigate this issue further, but I'd be happy to test or comment further.

Attachments (1)

request.txt (341 bytes ) - added by Matthijs Kooijman <matthijs@…> 14 years ago.
HTTP request to reproduce the problem

Download all attachments as: .zip

Change History (9)

by Matthijs Kooijman <matthijs@…>, 14 years ago

Attachment: request.txt added

HTTP request to reproduce the problem

comment:1 by Matthijs Kooijman <matthijs@…>, 14 years ago

Btw, please put me in the CC if that's required to get notifications (just realized that notifications to the reporter might be disabled).

comment:2 by Christian Boos, 14 years ago

Milestone: plugin - spam-filter

Note that the SpamFilter is in "low maintenance" mode, which means with don't have much time to apply patches, even less to fix new issues on our own, especially in these days of Trac 0.12 release. So please be patient with us (or on the contrary, bug us with good patches and volunteer for taking over maintenance of this plugin ;-) ).

comment:3 by Matthijs Kooijman <matthijs@…>, 14 years ago

I'm willing to implement a fix, but I'm not quite sure what the right fix is. Insights as to the direction of a fix would be most welcome :-)

Are there any alternatives to spamfilter that are better maintained, or does 0.12 include some spamfiltering by default or something?

comment:4 by Dirk Stöcker, 14 years ago

Well, I'm now maintainer. Direction of the fix - I think ensuring, that correct data is entered into database should be the way to go. I think it is acceptable when data is not 100% exact (we talk about SPAM here anyway), but there should be no exceptions caused in handling data.

I'm happy to see patches from you.

Last edited 14 years ago by Dirk Stöcker (previous) (diff)

comment:5 by anonymous, 14 years ago

Could you try if

  • tracspamfilter/model.py

     
    4444        self.env = env
    4545        self.time = time
    4646        self.path = path
    47         self.author = author
     47        self.author = to_unicode(author)
    4848        self.authenticated = bool(authenticated)
    4949        self.ipnr = ipnr
    50         self.headers = headers or ''
     50        self.headers = to_unicode(headers) or ''
    5151        self.content = content
    5252        self.rejected = bool(rejected)
    5353        self.karma = karma

fixes the issue?

comment:6 by Matthijs Kooijman <matthijs@…>, 14 years ago

Yup, this fixes the problem for me.

I was going to ask why "content" is not converted, but it seems that already happens in the callers (e.g., in FilterSystem._combine_changes). Perhaps it would be useful to add a comment about this?

Anyway, thanks for looking into this!

comment:7 by Dirk Stöcker, 14 years ago

Resolution: fixed
Status: newclosed

In r9949. content has an own encoding function which does utf-8 first and base64 encoding afterwards. Probably this is outdated nowadays, but I wont break compatibility.

comment:8 by Matthijs Kooijman <matthijs@…>, 14 years ago

Thanks!

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The ticket will remain with no owner.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from (none) to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.