Trac Spam Filtering
- How good is the filtering?
- Supported Internal Filtering Strategies
- Supported External Filtering Strategies
- Get the Plugin
- Enabling the Plugin
- SpamFilter and AccountManager
- Further Reading
- Known Issues
This plugin allows different ways to reject contributions that contain spam. It requires at least Trac release 1.0. The source code for version 0.12 and before isn't updated any more (but still available).
The spamfilter plugin has many options, but most of them are optional. Basically installing is enough to have a basic spam protection. But there are some things which may be helpful (in order of importance):
- Train bayes database (using the entries of the log) to activate that filter and reach good performance (bayes filter needs spambayes installed!)
- Setup BadContent page containing regular expressions to filter
- Get API keys for Akismet, Mollom, and/or HTTP:BL to use external services
- Activate captcha rejection handler to improve user treatment (may need reCAPTCHA access when that method should be used)
- Finetune the karma settings and parameters for your system (e.g. you may increase karma for good trained bayes filters or stop trusting registered users)
- If necessary get API keys for other services and activate them
WebAdmin is used for configuration, monitoring, and training. For monitoring and training purposes, it optionally logs all activity to a table in the database. Upgrading the environment is necessary to install the database table required for this logging.
How good is the filtering?
The spam filter will never be perfect. You need to check submissions to Trac and improve training or settings of the filter when it becomes necessary. But a fine trained setup will help you to run a site even if it is actively spammed (i.e. thousands of spam attempts a week/day). Even large sites with completely anonymous edits are possible.
But from time to time spam attacks nevertheless will succeed and handwork is required. Try removing successful spam as fast as possible. The longer it stays in the pages the harder your work will get (some spammers seem to monitor successful attempts and retry more intensive).
Spam should be removed completely (also in page history). Trac has options to delete tickets as well as wiki page versions. If done early enough this does not produce gaps in page history. Spam can also be in uploaded files. Delete them!
Some spam bots edit a page twice, so the last change is harmless and the previous one added the spam. Be aware of such tactics. Sometimes spam is done by humans - this type is usually successful, but humans are easily discouraged by fast deletion.
The bayes filter (when properly trained) usually has the best detection rates and can be adapted pretty fast to new attacks by training the successful spam attempts. Akismet is a good second line of defense (it also uses adaptive algorithms). Training also helps the external service when a new type of attack begins. All the other services are good to catch the spammers which have rather dumb methods (most of them).
Sometimes its hard for the human admins to see if a submission is spam or not. Please understand that for plain software it may be impossible!
A realistic goal currently is something like 1 spam slipping through for 10.000 to 20.000 different attempts (except for a new type spam wave, where in the beginning you have maybe 10-20 slip through, but that happens only 1 or 2 times a year). False rejects should be in the order of one rejection per 1.000 or more successful submissions.
Supported Internal Filtering Strategies
The individual strategies assign scores (“karma”) to submitted content, and the total karma determines whether a submission is rejected or not.
The regex filter reads a list of regular expressions from a wiki page named “BadContent”, each regular expression being on a separate line inside the first code block on the page, using the Python syntax for regular expressions.
If any of those regular expressions matches the submitted content, the submission will be rejected.
Regular Expressions for IP
The ip_regex filter reads a list of regular expressions from a wiki page named “BadIP”, each regular expression being on a separate line inside the first code block on the page, using the Python syntax for regular expressions.
If any of those regular expressions matches the submitters IP, the submission will be rejected.
Regular expressions are much too powerful for the simple task of matching an IP or IP range, but to keep things simple for users the design is equal to the content based regular expressions. You simple can specify full IPV4 addresses even if the dot has special meaning, as the match will work correctly. Only when matching partial addresses more care is needed.
The maximum number of posts per hour is configured in trac.ini:
[spam-filter] max_posts_by_ip = 5
When this limit is exceeded, the filter starts giving submissions negative karma as specified by the
Support to have CAPTCHA-style "human" verification is integrated. Captcha usage is configured in the 'Captcha' administration page.
Currently five captcha types are supported:
- Simple text captcha: Spam robots can bypass these, so they are not recommended.
- Image captcha
- External reCAPTCHA service: To use reCAPTCHA captcha method, you'll need to sign up at http://www.google.com/recaptcha/whyrecaptcha and set the keys at 'Captcha' administration page.
The captcha in spamfilter is a rejection system. Captchas are only displayed to the user when otherwise a submission would be rejected as spam. In this case a successful solved captcha can increase the score of a transmission. If a transmission has too many spam points even a successfully solved captcha can't save it (i.e. the score is 30 and a captcha only removed 20 points).
The Bayes filter is a very powerful tool when used properly. Following are a few guidelines how to use and train the filter to get good results:
- When beginning, the filter needs a minimum amount of 25 entries for HAM (useful entries) and also for SPAM (advertising). Simply train every submission you get until these limits are reached.
- The training is done in Administration Menu "Spam Filtering / Monitoring". You have following buttons
- Mark selected as Spam - Mark the entries as SPAM and train them in Bayes database (not visible by default for newer versions)
- Mark selected as Ham - Mark the entries as HAM and train them in Bayes database (not visible by default for newer versions)
- Delete selected - remove entry without training
- Delete selected as Spam - Mark the entries as SPAM and train them in Bayes database, remove them afterwards
- Delete selected as Ham - Mark the entries as HAM and train them in Bayes database, remove them afterwards
- Rules for a good trained database are:
- Don't train the same stuff multiple times
- HAM and SPAM count should be nearly equal (In reality you will have more SPAM, but a factor of 1 to 5 should be the maximum)
- Restart from scratch when results are poor
- It is hard to get rid of training errors, so be carefully
- See SpamBayes pages for more details.
- Strategy for Trac usage:
- Use the Delete selected as Spam and Delete selected as Ham
- Remove every strange entry (e.g. SandBox stuff) using Delete selected
- Train every valid HAM entry (or database will get unbalanced)
- Be sure to train every error: Rejected user submissions as well as undetected SPAM
- Train every SPAM entry with a score below 90% (at the beginning you may train everything not 100%)
- Delete SPAM entries with high score (100% in any case, after beginning phase everything above 90%)
- When in doubt if SPAM or HAM, delete entry
- NOTE: When Akismet, Defensio, BlogSpam or StopForumSpam (with API key) are activated, then training will send the entries also to these services.
- If you append the parameter "num" with values between 5 and 150 at monitoring page
url.../admin/spamfilter/monitor?num=100you can show more entries, but don't train very large dataset at once.
Supported External Filtering Strategies
See e.g. SpamLinks DNS Lists for a list of DNS based blacklists. A blacklist usable for this filter must return an IP for listed entries and no IP (NXDOMAIN) for unlisted entries.
NOTE: Submitters IP is sent to configured servers.
The use of this filter requires a Wordpress API key. The API key is configured in the 'External' administration page.
NOTE: Submitted content is sent to Akismet servers. Don't use this in private environments.
The use of this filter requires an API key. The API key is configured in the 'External' administration page.
NOTE: Submitted content is sent to Defensio servers. Don't use this in private environments.
Status: This service seems to have a relatively bad detection ratio.
The use of this filter requires API keys. These API keys are configured in the 'External' administration page.
NOTE: Submitted content is sent to Mollom servers. Don't use this in private environments.
Training this filter requires an API key. The API key is configured in the 'External' administration page.
NOTE: Submitted username and IP is sent to StopForumSpam servers. Don't use this in private environments.
NOTE: Submitted content is sent to LinkSleeve servers. Don't use this in private environments.
This service includes also DNS checks and services identical to the checks in this plugin. Be sure to set proper karma or these checks are counted twice. You also can disable individual checks in preferences.
NOTE: Submitted content is sent to BlogSpam servers. Don't use this in private environments.
The use of this filter requires a HTTP:BL API key. The API key is configured in the 'External' administration page.
NOTE: Submitters IP is sent to HTTP:BL servers.
Using this filter requires an API key. The API key is configured in the 'External' administration page.
NOTE: Submitted username and IP is sent to BotScout servers. Don't use this in private environments.
Using this filter requires an API key. The API key is configured in the 'External' administration page.
NOTE: Submitted username and IP is sent to FSpamList servers. Don't use this in private environments.
Get the Plugin
You can also obtain the code from the Trac Subversion repository:
svn co http://svn.edgewall.com/repos/trac/plugins/1.0/spam-filter
or download zipped source.
See TracPlugins for instructions on building and installing plugins.
Enabling the Plugin
[components] tracspamfilter.* = enabled
You can disable individual strategies:
- Disable the corresponding class in plugin handling
- Set karma to 0
- External services requiring API key are disabled without key
- All external services can be disabled in 'External' section (completely and only for training)
The Spamfilter adds 4 new permissions to Trac:
|SPAM_CONFIG||Get the admin menu entries to configure the filter|
|SPAM_MONITOR||Get the admin menu entries to monitor the submissions (spam or ham)|
|SPAM_TRAIN||In the monitoring panel access the spam training functions (useless without SPAM_MONITOR)|
|SPAM_ADMIN||Combination of all three: SPAM_CONFIG, SPAM_MONITOR, SPAM_TRAIN|
SpamFilter and AccountManager
If the AccountManager plugin is used in version 0.4 or better, than registrations can be checked for spam as well.
To do so, the entry RegistrationFilterAdapter needs to be added to key register_check in section account-manager of trac config.
Please help to translate the plugin into your language: https://www.transifex.com/projects/p/Trac_Plugin-L10N/resource/spamfilter/
- Historic information about SpamFilter: Managing Trac Spam
- Attention: The 1.7 series of dnspython causes a massive slowdown of whole Trac.
- The modules for IP blacklistening und HTTP:BL need dnspython installed. Install "setuptools" based on the Trac plugin requirements, then you can run "easy_install dnspython" to automatically download and install the package.
- Attention: The 1.7 series of dnspython causes a massive slowdown of whole Trac. Use newer versions only.
- The ImageCaptcha requires python-imaging to work.
- Bayes filtering needs spambayes software installed.
- Mollom filter needs python-oauth2 installed.