pjcj's notes

Mon, 29 May 2006

Retraining SpamAssassin's Bayesian classifier

I use SpamAssassin to filter my mail, and in general I am very happy with it. SpamAssassin classifies mail according to various criteria and assigns each message a score. A score of between five and ten earns a message a place in my probablespam mailbox, and above ten sends the message straight into the caughtspam mailbox. Any mail getting this far that is not to a name that I recognise goes into the not_me mailbox. Anything left goes into my inbox.

The has worked very well for me. Very rarely do I find spam in my inbox, and real mail ends up in caughtspam so rarely that I never look in there except for when someone insists they have sent me mail that I can't find. The probablespam mailbox is mostly spam, but occasionally I find some real mail in there. The not_me mailbox contains some spam along with messages I have been bcced on.

But recently I have been finding more real mail in my probablespam mailbox. Almost invariably these messages have been classified as BAYES_99, meaning that the SpamAssassin Bayesian classifier thinks the message is almost certainly spam. It's been a long time since I first trained SpamAssassin so I wondered whether the database had been polluted. This is often known as Bayesian poisoning, and is part of the goal of messages you might see which contain a poem or part of a story or just a long list of random words.

So I decided to retrain the Bayesian classifier to see if it could do any better. First I backed up the current database, then trained it on ham and spam.

$ sa-learn --backup > /var/tmp/sa.db
$ sa-learn --clear
$ sa-learn --ham --progress --mbox ~/Mail/new ~/Mail/tips
$ sa-learn --spam --progress --mbox ~/Mail/probablespam ~/Mail/spam

Early results are encouraging. A few mistakes of course, but that is to be expected until I train it a little better. But fixing the problems is easy in mutt:

$ grep sa-learn ~/.muttrc
  macro index S "|sa-learn --spam\ns=spam\n"
  macro pager S "|sa-learn --spam\ns=spam\n"
  macro index H "|sa-learn --ham\ns=new\n"
  macro pager H "|sa-learn --ham\ns=new\n"

As an aside, I wonder what SpamAssassin has to do with Apache?

[/software/spamassassin] permanent link