Edgewall Software
Modify

Opened 13 years ago

Closed 13 years ago

Last modified 13 years ago

#10314 closed enhancement (wontfix)

Double-checking spam filter

Reported by: jtv@… Owned by: Dirk Stöcker
Priority: normal Milestone:
Component: plugin/spamfilter Version:
Severity: normal Keywords:
Cc: Branch:
Release Notes:
API Changes:
Internal Changes:

Description

I'm getting a considerable amount of bug spam even after Akismet filtering. Typically the same spam will be posted again within the hour, but by then Akismet has learned to recognize the contents as spam.

I suspect it might be helpful to run recent bug tickets through a second, asynchronous round of spam filtering, perhaps a few minutes after they are first submitted: "this was not considered spam when it was submitted, but what about now? In retrospect, was it the front of a fresh spam wave?"

Checking tickets retroactively and deleting ones that fail is probably harder to do than the current synchronous check. But for my small single-person hobby site, it might make the difference between having a near-permanent spam presence despite constant attention (a massive drain on productivity), and having just bits of spam that disappear automatically after a few minutes.

An option to quarantine new tickets, so that they don't show up at all until after a delayed spam check would also be nice, though probably harder.

Attachments (0)

Change History (8)

comment:1 by Dirk Stöcker, 13 years ago

Resolution: wontfix
Status: newclosed

If you only rely on Akismet, then it is no wonder that SPAM gets through. There are a lot of different services supported in current spamfilter. Activate more of them and you get better detection rates. The most powerful tool is still the internal bayes filter. Train it properly. Usually spam has rates >90% and ham <2%. If not, your training is bad and you need to train more.

What you suggest is not really a useful approach. Delaying entries would make user experience broken (which user would expect that a ticket appears e.g. 10 minutes later). And automatically deleting accepted entries later would have strange results as well in these cases where content is wrongly detected as spam.

Beside this all your suggestion would be a much to big rework. And there is no need for it. After a bit of training there is hardly any spam coming through.

I currently run 3 different sites, from the hobby-style site up to major SPAM attacked ones. After some initial training time also the small site has a nearly perfect detection rate.

For Bayes it is important to have good training material, so in the beginning train EVERY Ham and EVERY spam entry. Also when you get only little ham, then train texts already contained in your site. See also SpamFilter page for training description.

comment:2 by jtv@…, 13 years ago

Thanks. I wasn't relying on Akismet alone actually, and until recently almost nothing was getting through even without Akismet (And I seem to be back to that now). But I wasn't using the Bayesian filter either; I've always managed Trac through the command line and so simply hadn't discovered the UI for training the filter.

I'm trying to train my Bayesian filter now. It would be easier if the UI didn't crash on every non-ASCII character though!

in reply to:  2 comment:3 by Dirk Stöcker, 13 years ago

Replying to jtv@…:

I'm trying to train my Bayesian filter now. It would be easier if the UI didn't crash on every non-ASCII character though!

Hmm, that sounds as if you either have an old version of SpamFilter or Trac. Can't remember a single crash since I use 0.12 and a recent SpamFilter. Report crashs if there are any.

comment:4 by jtv@…, 13 years ago

I do have an older version: 0.11 (on the current Ubuntu LTS).

By the way, I've trained the Bayesian filter on a few dozen pieces of ham and a few thousand pieces of spam; I notice that the Monitoring list will say that most spams are rejected by both the Akismet filter and the regex filter, and often the number-of-links filter as well. But as far as I'm aware I have yet to see the same notice about the Bayesian filter. Do rejections from that filter not show up in the same list, or is there simply no need to invoke the Bayesian filter if multiple other filters already reject the same post?

So far, so good though: I've been almost entirely spam-free for days! Had some probing spam today (no links or suspicious terms) on an instance without enough ham to train on, but that's about it. Thanks for helping me through this.

comment:5 by Dirk Stöcker, 13 years ago

Probably you still did not reach the minimum count training for HAM? Bayes is not active until you reach that count (see settings).

A large disbalance between SPAM and HAM is not good. The ratio should not be much larger than 5:1.

The crashes will be fixed when switching to 0.12. If I remember correctly the current filter will not work for 0.11 wikis anymore.

comment:6 by jtv@…, 13 years ago

The ham was definitely more than the default minimum of 25. I was told to train for all ham and all spam initially, but of course I get hundreds of spams per day and since this is a fairly mature project, entire months go by without new bugs. I guess I should just stop training then.

I'll see if I can find a way to upgrade to a newer trac version.

in reply to:  6 comment:7 by Dirk Stöcker, 13 years ago

Replying to jtv@…:

The ham was definitely more than the default minimum of 25. I was told to train for all ham and all spam initially, but of course I get hundreds of spams per day and since this is a fairly mature project, entire months go by without new bugs. I guess I should just stop training then.

If Bayes is active, then you should see something like this:

BayesianFilterStrategy (14): SpamBayes determined spam probability of 0.07%

Possible reasons it does not work:

  • SpamBayes not installed (training should not work then I think)
  • Not enough training
  • HAM and SPAM scores don't give reliable result (training not good enough yet)
  • The 0.11/0.12 semantics change (should only affect wiki pages, not tickets)

You can test bayes in the admin section. If you don't get results there and you have 25 entries for ham and spam maybe restarting database is a good idea (more balanced if possible).

Training should never stop, but it is a good idea to reduce the number of spam training and increase ham training. I train every ham and only spams with less than 90% recognition and this seems to be a good strategy.

I'll see if I can find a way to upgrade to a newer trac version.

Always a good idea. The 0.11 plugin version is no longer updated anyway.

comment:8 by jtv@…, 13 years ago

OK thanks, I think I've used up enough of your time. Better consider this ticket closed. ☺

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Dirk Stöcker.
The resolution will be deleted. Next status will be 'reopened'.
to The owner will be changed from Dirk Stöcker to the specified user.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.