4.4. System & service monitoring

Our system was designed to be robust in the presence of failure, as described previously this was primarily achieved via the use of multiple redundant satellite MX machines. These machines were hosted in different locations, by different hosting companies.

The main exception to this was the control panel & quarantine machine, which was unique and not duplicated. (To mitigate the risk of failure database dumps were taken off-site every few hours, and all data was backed up off-site once a day.)

Although the loss of a single MX machine, or two, wouldn't have caused any issues with delivery time it was obviously important to know that a machine had gone off-line, or if it were struggling to keep up with the level of SMTP connections it was receiving.

To ensure that things were working correctly we used both internal monitoring and external monitoring.

External monitoring was carried via by a custom script which ensured that each machine was reachable, and was running the services that were expected. (An installation of Nagios/Netsaint would be sufficient; the custom script just made a nice webpage and sent SMS alerts on failure).

Internal monitoring was achieved by having each MX machine run a series of tests via local cronjobs. There were three main tests which were regularly executed upon each host:

/mf/bin/timeout

This command ran once a minute and ensured there were no orphaned or stuck qpsmtpd processes. If there were they were killed.

(This test was specifically added as we found that earlier versions of qpsmtpd would sometimes hang for no obvious reason. Later versions didn't suffer from this problem, but the tests were left in place just in case the problem did ever recur.)

/mf/bin/smtp-test

This command ran a full SMTP transaction against the loopback address, and restarted the qpsmtpd server if there was a failure.

/mf/bin/uptime

As several of the scheduled processes (primarily copying rejected messages and performing off-site backups) were IO-intensive there were times when the system load would rise to an unacceptable level.

Every two minutes the uptime command would run, and if the system load was >=5 then the SMTP service would be updated to avoid new connections (via the use of a temporary firewall rule) and a marker file would be created on the filesystem.

The marker file was used to throttle several parts of our setup. For example we'd import messages into the quarantine regularly, but if the marker was present we'd attempt to avoid adding to the load by sleeping for severl seconds between the importing of each message.

Note that each script is located beneath /mf/bin, this is because the code running upon all our machines was located beneath /mf for reasons explained in Appendix B.