Appendix D. Bayasian filtering with spambayes

We knew early on that Bayesian filtering would be one of the things we did with incoming email. This is a statistical approach which allows a machine to automatically recognise SPAM-like messages given a small database of previous decisions.

Of all the tests we applied Bayesian filtering was the most complex, and the reason for this complexity largely revolved around the need for a local database of HAM/SPAM terms and frequencies.

We settled upon the spambayes implementation of bayesian filtering primarily because it used only a single binary database to maintain its records of words and frequencies. This meant that the database could be copied around very easily.

Upon a satellite MX machine spambayes would need to refer to a domain-specific database to do its testing and it would have that database located at /home/spam/example.org.db for the domain example.org.

The problem we ran into was ensuring that training operations took effect. We decided that the control panel would have the canonincal copy of the spambayes database - and from there each database would be copied to the satellite MX machines every hour.

For training to take effect this meant it had to occur upon the control panel host. When a user clicked "non-SPAM" in he quarantine display this essentially fed the message to the local/master copy of the database, and similarly when updates were made via email submission (as described in Appendix E) they operated upon the master copy.

D.1. Problems with spambayes

The tool we used for Bayesian testing of mail didn't correctly lock its database. This was documented in Debian bug #296322.

Our solution was to replace calls to the sb_filter.py tool with calls to three wrappers. These wrappers added lockfile protection, and also abstracted the operation a little.

/mf/bin/spam-test

Take a message from STDIN and return an exit code denoting the spam/ham result.

/mf/bin/train-as-ham

Take a message from STDIN and train that as being "not spam".

/mf/bin/train-as-spam

Take a message from STDIN and train that as being spam.

Example D-1. Testing a message to see if it is SPAM.


#!/bin/sh
#
#  Test if a message is spam for the given domain.
#
#  Usage: cat $file | spam-test example.com
#
##
#
# Steve
# --

domain=$1

#
#  Do we have the lockfile tool?
#
if [ ! -x /usr/bin/lockfile ]; then
   echo "Lockfile missing: apt-get install procmail"
   exit 1
fi

#
#  Ensure we got a domain
#
if [ -z "$domain" ]; then
    echo "Domain is mandatory"
    exit 2
fi

#
#  Does the domain exist locally?
#
if [ ! -d "/srv/$domain" ]; then
    echo "Domain not handled here: $domain"
    exit 3
fi

#
#  Now we're running under lockfile.
#
if ( /usr/bin/lockfile -s 9 -r 10 /tmp/spam.$domain ); then

    #
    # Do we have a database setup?  If not create one
    #
    if [ ! -e "/home/spam/$domain" ]; then
        sb_filter.py -d /home/spam/$domain -n
    fi

    #
    #  OK now we're green.
    #
    sb_filter.py -d /home/spam/$domain "$@"
    ret="$?"

    rm -f /tmp/spam.$domain

    #
    #  Keep any exit code we had
    #
    exit $ret

else

    echo "Lockfile failed"
    exit 4
fi  

Example D-2. Training a message as being SPAM.


#!/bin/sh
#
#  Train a message as spam for the given domain.
#
#  Usage: cat $file | train-as-spam example.com
#
##
#
# Steve
# --

domain=$1

#
#  Do we have the lockfile tool?
#
if [ ! -x /usr/bin/lockfile ]; then
   echo "Lockfile missing: apt-get install procmail"
   exit 1
fi

#
#  Ensure we got a domain
#
if [ -z "$domain" ]; then
    echo "Domain is mandatory"
    exit 2
fi

#
#  Does the domain exist locally?
#
if [ ! -d "/srv/$domain" ]; then
    echo "Domain not handled here: $domain"
    exit 3
fi

#
#  Now we're running under lockfile.
#
if ( /usr/bin/lockfile -s 9 -r 10 /tmp/spam.$domain ); then

    #
    # Do we have a database setup?  If not create one
    #
    if [ ! -e "/home/spam/$domain" ]; then
        sb_filter.py -d /home/spam/$domain -n
    fi

    #
    #  OK now we're green.
    #
    sb_filter.py -d /home/spam/$domain -s
    ret="$?"

    rm -f /tmp/spam.$domain

    #
    #  Keep any exit code we had
    #
    exit $ret

else

    echo "Lockfile failed"
    exit 4
fi  

Example D-3. Training a message as not being SPAM.


#!/bin/sh
#
#  Train a message as ham for the given domain.
#
#  Usage: cat $file | train-as-spam example.com
##
#
# Steve
# --

domain=$1

#
#  Do we have the lockfile tool?
#
if [ ! -x /usr/bin/lockfile ]; then
   echo "Lockfile missing: apt-get install procmail"
   exit 1
fi

#
#  Ensure we got a domain
#
if [ -z "$domain" ]; then
    echo "Domain is mandatory"
    exit 2
fi

#
# Does the domain exist locally?
#
if [ ! -d "/srv/$domain" ]; then
    echo "Domain not handled here: $domain"
    exit 3
fi

#
#  Now we're running under lockfile.
#
if ( /usr/bin/lockfile -s 9 -r 10 /tmp/spam.$domain ); then

    #
    # Do we have a database setup?  If not create one
    #
    if [ ! -e "/home/spam/$domain" ]; then
        sb_filter.py -d /home/spam/$domain -n
    fi

    #
    #  OK now we're green.
    #
    sb_filter.py -d /home/spam/$domain -g
    ret="$?"

    rm -f /tmp/spam.$domain

    #
    #  Keep any exit code we had
    #
    exit $ret

else

    echo "Lockfile failed"
    exit 4
fi