Chapter 7. Service mistakes

We're pleased to say that for the duration of the time our service was live we didn't ever lose any incoming email, suffer from service unavailability, or otherwise fail in any significant fashion.

But no service is perfect, and there were some mistakes made which came back to haunt us.

Failure To Understand Scale

Time was spent in the early days of the service implementing features which just didn't scale and had to be withdrawn. As with many mistakes this was a learning experience and it was not necessarily a bad thing to add features only to remove them when they proved unpopular or less useful than initially anticipated.

As a concrete example we exported an RSS feed containing summaries of rejected mail early on. This proved to be an annoyance because RSS feeds aren't useful when they contain 30,000 entries!

Remote Hosting

The master machine which contained the central archive of rejected mail was located in America. Although the archive was local to the system itself, such that users logging into the control panel and viewing their rejected mail wouldn't suffer a delay.

The downside to the location was primarily that the additional network distance from the MX machines meant that copying the archives of rejected mail was slower than the ideal.

Were we to replicate the distributed setup more care would be taken to ensure that the various hosts were "close" to each other. e.g. All hosts would be located in America/Germany/The Netherlands (in this case we'd still aim for redundancy, by using different hosting companies in the same country).

The Quarantine Storage

Over the lifetime of the service we used several different techniques for storing the archive of rejected mail. Initially we stored it in a per-domain SQLite database but this became too slow once domains started to be added to our service that saw significant volumes of email.

Our first mistake was to move from storing the rejected mail in SQLite to storing it in MySQL. But eventually we fixed that, and we moved to storing rejected mail beneath a /reject directory on the master machine - and merely keeping indexes updated whenever we added/removed archived mail.

Bayesian Corruption

The Bayesian filtering we applied to incoming mail was implemented using the existing spambayes software, however it was soon discovered that the system didn't lock its database properly.

If a mail were being tested for a domain at the same time as a training event was occurring corruption would wipe out the database. So we had to modify spambayes to invoke it with a wrapper to apply locking. These wrappers are documented in Appendix D.

Failure To Use DNS Caches

Many different parts of our testing process involved the use of DNS lookups. We achieved a significant boost in throughput when we implemented local DNS caches upon each MX machine.

We could qualify the use of DNS caches as a service optimisation, but it is probably fairer to say that not using a cache initially was a definite mistake.

Repeating The Same Work

Because plugins were generally stateless we were frequently performing the same operations over and over again.

For example the detection of bad HTTP links in incoming emails involved performing DNS queries against all links in a message, and almost every HTML email would contain references to an external DTD which would be recognised as a hyper-link.

We eventually skipped processing all links which referred to sites such as w3.org, microsoft.com and openxmlformats.org.