As documented we had a single master machine where the online quarantine was located, and we had multiple distinct MX hosts which would archive their rejected and accepted mail to disk locally. These local archives of delivered mail and rejected SPAM would be kept on the MX hosts for a period of five days.
One of the early challenges of our service was to ensure that all rejected mail was brought together to the single quarantine location regardless of which MX machine it was rejected from. We used rsync to pull all of the messages from the satellite MX machines to the single central host, which then allowed them to be processed.
Upon each satellite MX machine the SPAM and non-SPAM mails were saved as Maildir folders beneath two fixed directories /home/rejected and /home/good respectively. Beneath those directories we'd store the reject messages in a hierarchy named after the current date, and the domain each message was addressed to.
There were two reasons to store these messages:
To serve as a local backup - because the external backup of the quarantine host might be up to 24 hours out of date.
To store it as a local spool, in case the central machine couldn't contact the MX machine for a period of time.
To actually make the rejected mail for each domain available to the quarantine host two things had to happen:
The mail had to be transferred to the quarantine host.
The mail had to be imported.
Transferring the mail from each rejected machine was a matter of using rsync, and the script to make the import process was not much more complex than this:
Example 6-1. Importing messages from the satellite MX boxes.
#!/bin/sh # # For each known satellite MX machine pull /home/rejected & /home/good # to our local system. # # # If we're already running exit. # if [ -e /tmp/sync.in.progress ]; then exit else touch /tmp/sync.in.progress fi # # Source our list of secondaries. # if [ -e "/etc/secondaries.conf" ]; then . /etc/secondaries.conf else # old defaults secondaries="incoming0.mail-scanning.com incoming1.mail-scanning.com" fi # # If the hosts are pingable, copy # for host in $secondaries ; do if ( ping -c 1 $host 2>/dev/null >/dev/null ); then # # Good mail is moved to: # # /home/secondaries/incoming0.mail-scanning.com/good # rsync -e ssh $args --bwlimit=500 -azr root@$host:/home/good/ \ /home/secondaries/$host/good/ # # Rejected mail is moved to a similar location. # rsync -e ssh $args --bwlimit=500 -azr root@$host:/home/rejected/ \ /home/secondaries/$host/rejected/ fi done # # Cleanup our lock-file # rm /tmp/sync.in.progress
Once the mail had been pulled to the master machine we then had to actually process it. There were two cases:
A piece of SPAM had to be imported into the quarantine for the appropriate domain.
A piece of non-SPAM would increase the count of non-SPAM messages for the given domain .
Once the SPAM and non-SPAM email had been copied from each of the satellite machines as describe in Chapter 6 we have a local directory tree which looked something like this:
Example 6-2. Master machine's copy of MX-machine mail.
/home |-- /secondaries |-- /incoming0.mail-scanning.com/ | |-- /good/ | | |-- 1-3-2009/ | | | |-- hosted.org/ | | | | | new/ | | | | | cur/ | | | | ` tmp/ | | | `-- user.org/ | | | | new/ | | | | cur/ | | | ` tmp/ | | | | | |-- 2-3-2009/ | | | |-- hosted.org/ | | | | | new/ | | | | | cur/ | | | | ` tmp/ | | | `-- user.org/ | | | | new/ | | | | cur/ | | | ` tmp/ | | | | | `-- 3-3-2009/ | | |-- hosted.org/ | | | | new/ | | | | cur/ | | | ` tmp/ | | `-- user.org/ | | | new/ | | | cur/ | | ` tmp/ | | | `-- /rejected/ | |-- 1-3-2009/ | | |-- hosted.org/ | | | | new/ | | | | cur/ | | | ` tmp/ | | `-- user.org/ | | | new/ | | | cur/ | | ` tmp/ | | | |-- 2-3-2009/ | | |-- hosted.org/ | | | | new/ | | | | cur/ | | | ` tmp/ | | `-- user.org/ | | | new/ | | | cur/ | | ` tmp/ | | | `-- 3-3-2009/ | |-- hosted.org/ | | | new/ | | | cur/ | | ` tmp/ | `-- user.org/ | | new/ | | cur/ | ` tmp/ | `-- /incoming1.mail-scanning.com/ .. ..
The naive approach at importing the mail, and increasing the counts, would have been to process each file located beneath the new/, cur/ and tmp/ directories - then delete them.
Unfortunately deleting messages once they've been imported, or counted, doesn't work because they'd be restored the next time the rsync executed.
One possible way of dealing with this problem would have been to only import the messages belonging to the previous day, and only rsync messages referring to the current day. However that delay would have meant the quarantine was less useful, so we ignored that approach.
Our process for importing the messages involved the use of a marker file for each message. That allowed us to continuously process the freshly copied messages without worrying about importing any message more than once.
We setup a script which would iterate over each file beneath the /home/secondaries/ directory and act it.
The recipe looked like this:
1. Recursively descend the directory /home/secondaries/ looking for files and ignoring directories.
2. For each file we find ($filename):
2a. If $filename ended in a .processed suffix skip it.
2b. If the file $filename.processed exists skip it.
2c. If $filename has /good/ in its name increase the count of delivered mail for the given domain.
2d. If $filename has /rejected/ in its name import the message into the quarantine for the given domain.
2e. Create the file "$filename.processed" to ensure this message isn't processed again.
In both cases we could determine the domain the message was delivered to from the pathname of the file itself - just as we could determine whether the message was SPAM or non-SPAM by looking at the path-name.
Because we used rsync without the --delete flag our .processed files would survive as long as the copied message did, and this allowed us to ensure that each message was only examined once.
Having a count of both SPAM and non-SPAM meant that a percentage-SPAM figure could be arrived at upon a per-domain basis.