Several of our testing methods evolved over time, from being pretty basic to being significantly more involved & advanced. The badlinks plugin was a typical example of something which started as a very basic test, and later evolved to be more useful and comprehensive.
The initial aim of the plugin was to detect and reject emails of a particular form:
Small emails in HTML format.
The emails would contain a single inline image (*.png|*.jpg).
The image would be wrapped with a link to http://random.string.tn/
There were several possible approaches to programatically detecting emails like this, but our initial approach was to extract URLs from each incoming message and compare those to a blacklist. A SPAM link might look something like this:
This link contains several parts:
1. The domain name.
2. A path.
3. A CGI key.
The path and key components would typically vary between messages, so we were only concerned with examining the hostname, and testing that against our blacklist file.
The blacklist would be regularly updated and uploaded to each satellite MX machine, via rsync, after an admin added new entries to it. This meant that links would be extracted from every single incoming email:
HTTP links were extracted and recorded from each message which was to be rejected as SPAM - so that an admin could look at then and decide whether the link was itself an indicator of a SPAM email.
Email which hadn't yet been determined to be SPAM had their HTTP links extracted and tested against the blacklist to determine if the message was SPAM as part of the testing process.
Having a global list of URLs contained in rejected emails allowed us to develop a simple online tool, which ordered URLs based upon the number of times they'd been seen, and loaded each site in a browser along with a "SPAM: Yes/No" checkbox. An admin could test several hundred sites in a short space of time, and keep current with emerging URLs easily.
It quickly became apparent that although there was an administrative overhead the technique itself was useful, as a single site could be advertised in over 20,000 emails in the space of 48 hours.
Although testing incoming hyper-links against a flat file of known bad sites was a simple process and reasonably fast to carry out, we knew that such an approach was doomed to failure if we ever fell behind in maintaining our list. So we looked at different ways we could evolve the test and reduce the administrative burden involved in maintaining the blacklist.
One change which lead to a significant improvement was the observation that a lot of sites were hosted by botnets. Detecting botnet sites was something that could be achieved using simple DNS lookups, which was a very lightweight process.
Because botnets don't like to have a single point of failure they would typically implement webhosting across a number of compromised machines. A site hosted in this fashion might have a name in DNS which resolved to 5+ different IP addresses, each of which would be hosted upon a residential IP address and none of which would have reverse DNS entries.
Many large or popular websites like http://www.microsoft.com/ are configured with redundant hosting via partners such as akamai, and they too typically resolve to multiple IP addresses but the difference between those sites and botnet sites is obvious:
A legitimate site would have valid or matching reverse DNS entries.
A legitimate site wouldn't resolve to residential IP addresses.
So in addition to testing domain names against our list of known-bad sites we also heuristically decided that links were bogus if the hostname section resolved to multiple IP addresses, and over half of those IP addresses were missing reverse DNS entries or were obviously hosted upon residential IP addresses.
We also experimented with testing links against the surbl.org lookup site. But we found this didn't increase our accuracy in any significant fashion, so these lookups were abandoned.
Many of our tests evolved from their initial implementation to increase facilities or correctness. The change from comparing sites against a blacklist to performing botnet detection was a particularly interesting evolution. Although this evolution was generally a positive one it still qualifies for its own entry in our list of mistakes: Failure To Use DNS Caches.
Wikipedia has an article on this topic: fast flux hosting.