71 | | > (The code in svn uses [http://spambayes.org SpamBayes], which is a logical choice. It would make sense to use a custom tokenizer, however, rather than the email-centric one that is included with [http://spambayes.org SpamBayes]. The bigger issue is that some form of training is required (e.g. the API could be extended so that (optionally) authenticated users (and the other filters) could report contributions as spam (using automatic training to assume that everything else is ham); however, this is a complex change). An alternative to this would be a script that could be periodically executed that would train all existing contributions as ham, and gather spam from an appropriate source. If you decide to continue with this in the future, please don't hestiate to ask [mailto:spambayes-dev@python.org spambayes-dev] for help. |
| 75 | * When beginning, the filter needs a minimum amount of 25 entries for HAM (useful entries) and also for SPAM (advertising). Simply train every submission you get until this limits are reached. |
| 76 | * The training is done in Administration Menu "Spam Filtering / Monitoring". You have following buttons |
| 77 | * ''Mark selected as Spam'' - Mark the entries as SPAM and train them in Bayes database |
| 78 | * ''Mark selected as Ham'' - Mark the entries as HAM and train them in Bayes database |
| 79 | * ''Delete selected'' - remove entry without training |
| 80 | * ''Delete selected as Spam'' - Mark the entries as SPAM and train them in Bayes database, remove them afterwards |
| 81 | * ''Delete selected as Ham'' - Mark the entries as HAM and train them in Bayes database, remove them afterwards |
| 82 | * When !JavaScript is enabled a number of check boxes is available, which help selecting entries |
| 83 | * Rules for a good trained database are: |
| 84 | * Don't train the same stuff multiple times |
| 85 | * HAM and SPAM count should be nearly equal (In reality you will have more SPAM, but a factor of 1 to 5 should be the maximum) |
| 86 | * Restart from scratch when results are poor |
| 87 | * It is hard to get rid of training errors, so be carefully |
| 88 | * See [http://spambayes.org/background.html SpamBayes pages] for more details. |
| 89 | * Strategy for Trac usage: |
| 90 | * Use the ''Delete selected as Spam'' and ''Delete selected as Ham'' |
| 91 | * Remove every strange entry (e.g. SandBox stuff) using ''Delete selected'' |
| 92 | * Train every valid HAM entry (or database will get unbalanced) |
| 93 | * Be sure to train every error: Rejected user submissions as well as undetected SPAM |
| 94 | * Train every SPAM entry with a score below 90% (at the beginning you may train everything not 100%) |
| 95 | * Delete SPAM entries with high score (100% in any case, after beginning phase everything above 90%) |
| 96 | * When in doubt if SPAM or HAM, delete entry |
| 97 | * NOTE: When Akismet or !TypePad are activated, then training will send the entries also to these services. |
| 98 | * If you append the parameter "num" with values between 5 and 150 at monitoring page {{{url.../admin/spamfilter/monitor?num=100}}} you can show more entries, but don't train very large dataset at once. |