Mon, 29 May 2006
Retraining SpamAssassin's Bayesian classifier
I use SpamAssassin to filter my mail, and in general I am very happy with it. SpamAssassin classifies mail according to various criteria and assigns each message a score. A score of between five and ten earns a message a place in my probablespam mailbox, and above ten sends the message straight into the caughtspam mailbox. Any mail getting this far that is not to a name that I recognise goes into the not_me mailbox. Anything left goes into my inbox.
The has worked very well for me. Very rarely do I find spam in my inbox, and real mail ends up in caughtspam so rarely that I never look in there except for when someone insists they have sent me mail that I can't find. The probablespam mailbox is mostly spam, but occasionally I find some real mail in there. The not_me mailbox contains some spam along with messages I have been bcced on.
But recently I have been finding more real mail in my probablespam mailbox. Almost invariably these messages have been classified as BAYES_99, meaning that the SpamAssassin Bayesian classifier thinks the message is almost certainly spam. It's been a long time since I first trained SpamAssassin so I wondered whether the database had been polluted. This is often known as Bayesian poisoning, and is part of the goal of messages you might see which contain a poem or part of a story or just a long list of random words.
So I decided to retrain the Bayesian classifier to see if it could do any better. First I backed up the current database, then trained it on ham and spam.
$ sa-learn --backup > /var/tmp/sa.db $ sa-learn --clear $ sa-learn --ham --progress --mbox ~/Mail/new ~/Mail/tips $ sa-learn --spam --progress --mbox ~/Mail/probablespam ~/Mail/spam
Early results are encouraging. A few mistakes of course, but that is to be expected until I train it a little better. But fixing the problems is easy in mutt:
$ grep sa-learn ~/.muttrc macro index S "|sa-learn --spam\ns=spam\n" macro pager S "|sa-learn --spam\ns=spam\n" macro index H "|sa-learn --ham\ns=new\n" macro pager H "|sa-learn --ham\ns=new\n"
As an aside, I wonder what SpamAssassin has to do with Apache?