Mon, 29 May 2006

Retraining SpamAssassin's Bayesian classifier


I use SpamAssassin to filter my mail, and in general I am very happy with it. SpamAssassin classifies mail according to various criteria and assigns each message a score. A score of between five and ten earns a message a place in my probablespam mailbox, and above ten sends the message straight into the caughtspam mailbox. Any mail getting this far that is not to a name that I recognise goes into the not_me mailbox. Anything left goes into my inbox.

The has worked very well for me. Very rarely do I find spam in my inbox, and real mail ends up in caughtspam so rarely that I never look in there except for when someone insists they have sent me mail that I can't find. The probablespam mailbox is mostly spam, but occasionally I find some real mail in there. The not_me mailbox contains some spam along with messages I have been bcced on.

But recently I have been finding more real mail in my probablespam mailbox. Almost invariably these messages have been classified as BAYES_99, meaning that the SpamAssassin Bayesian classifier thinks the message is almost certainly spam. It's been a long time since I first trained SpamAssassin so I wondered whether the database had been polluted. This is often known as Bayesian poisoning, and is part of the goal of messages you might see which contain a poem or part of a story or just a long list of random words.

So I decided to retrain the Bayesian classifier to see if it could do any better. First I backed up the current database, then trained it on ham and spam.

$ sa-learn --backup > /var/tmp/sa.db
$ sa-learn --clear
$ sa-learn --ham --progress --mbox ~/Mail/new ~/Mail/tips
$ sa-learn --spam --progress --mbox ~/Mail/probablespam ~/Mail/spam

Early results are encouraging. A few mistakes of course, but that is to be expected until I train it a little better. But fixing the problems is easy in mutt:

$ grep sa-learn ~/.muttrc
  macro index S "|sa-learn --spam\ns=spam\n"
  macro pager S "|sa-learn --spam\ns=spam\n"
  macro index H "|sa-learn --ham\ns=new\n"
  macro pager H "|sa-learn --ham\ns=new\n"

As an aside, I wonder what SpamAssassin has to do with Apache?

[/software/spamassassin] permanent link

Sat, 27 May 2006

Updating SVN::Web


I noticed recently that my SVN::Web pages had stopped working. Today I found a little time to investigate. My apache error log said:

Can't locate object method "caught" via package "SVN::Web::X"
at /usr/local/share/perl/5.8.7/SVN/Web.pm

Nice.

I remembered having a bit of hassle installing it first time around primarily because it wasn't ready for Apache2, so I punched "SVN::Web Apache2" into Google, and surprised myself when I noticed that my notes page was the second hit. It was top on MSN.

Aha! So that's the reason I write these notes!

My notes told me which modules I could let debian install and which I had to manage myself. They also told me the hacks I had made to make things work with Apache2.

So in these situations I normally make sure I'm running the latest versions of everything. The bug I'm chasing might already be fixed. The first thing I noticed was that there was a new version of SVN::Web iteslf. So I installed it.

Since I had first installed SVN::Web debian had upgraded from Perl 5.8.7 to 5.8.8, so the latest SVN::Web installed into a slightly different directory. During the installation it told me that Exception::Class was out of date and asked if it should be updated. I declined since currently Exception::Class was installed as a debian package and I was hoping it could stay that way. (In fact, I had also installed it from CPAN, but didn't read enough of my notes to notice that.)

After installing the latest SVN::Web, I tried running it, just to see what happened. I was expecting loads of errors since my hacky patches were now lost. In fact, I got exactly the same error as before. Good News! That seemed to show that SVN::Web now works with Apache2, and my hacks were no longer required.

But the original problem remained. So I installed the latest Exception::Class, which hadn't yet made it into debian, tried again and everything just worked.

Wonderful!

Now, if only there was a debian package of SVN::Web so that someone else could worry about all this.

So once again I battled SVN::Web and debian and I prevailed! And once again you can see the results at svnweb.

[/revision_control] permanent link




November 2022
Sun Mon Tue Wed Thu Fri Sat