Spam is extremely annoying in whatever shape or form, but the most annoying thing about it by far is that despite its proven ineffectiveness it does not go away.
We've been trying to eradicate e-mail spam for nearly half a decade now, but since we've been focusing on fighting the methods (stopping mass-mailing) rather than the cause (prosecuting the idiots who actually pay to have it sent), we're unlikely to succeed anytime soon.
So This Is Progress, Huh?
But there's a new kind of spam, aimed at attracting traffic to websites by seeding links in comments or referrer logs. Most of these are designed to poison or otherwise fudge PageRank, but all share the common denominators of spam - undesired, irrelevant and a damned waste of time (plus occasionally offensive).
Comment spam is a well-known phenomenon, and Bayesian techniques have popped up to counter it in common weblog tools like Movable Type (for which like Nuno, I recommend MT-Blacklist). It is also one of the reasons I, for one, don't allow any sort of comments (or persistent user input) in this Wiki.
Stealth Fakers
Referrer spam, however, is far more stealthy and "under the radar". It consists of issuing faked HTTP requests with bogus referring URLs, and it's a pain for any site that keeps a public Referrers page or public traffic statistics, since it makes it look as though as you're linked from those bogus sites.
Even if (like me) you exclude your Referrers page from crawlers, clueless spammers don't get it and keep at it, filling your server logs with garbage.
So I've been taking it easy for a while, sending snippets of my HTTP logs to the abuse addresses of some ISPs and hosting providers (most notably EV1.net, which seems to generate a disproportionate amount of spam referrers).
Upping The Ante
Last week (in the course of my quarterly system snapshot) I dumped all the machine's logs, ran them through a quick Perl script, and found I had a particularly large number of specially unsavoury referrals from sites in the .info and .biz TLDs.
Worse, some of those actually showed up on the Referrers page. So I picked up my old filter code (which runs before pretty much anything on this site) and hacked some serious oomph into it, by instating dynamic bans on all addresses issuing spam referrers and logging the results.
Bans are temporary, but instated (and cleared) in a fully automatic fashion. And I'm going fully Bayesian soon once I iron out the rating mechanism, so it's bound to be even more carefree.
The Surprise
Nothing prepared me for the log results, however. After a week, I was expecting to find spammer scripts running at universities, hosting companies, etc. However, running a simple command line on a small sample yielded a different story:
$ cat /tmp/ips.txt |xargs -i host {} |grep pointer|cut -f 5 -d\ |rev|sort|rev port-212-202-224-66.reverse.qsc.de. www.yellex.de. ip88.ph.ee. inw2k2.hce.org. old.hce.org. indy.hce.org. cs78147207.pp.htv.fi. cm203-168-166-150.hkcable.com.hk. host-200-105-136-20.acelerate.com. eu247.st48-net74.ip.superonlinecorporate.com. c-24-98-65-213.atl.client2.attbi.com. www4.rkymtnhi.com. user.rkymtnhi.com. 12-218-251-75.client.mchsi.com. ns.nosui.com. brm-sams-ext2.sun.com. mstvnldc1.mstvnl.chello.com. sv.d-jacket.com. 84.69-93-237.reverse.theplanet.com. 200-204-127-236.speedyterra.com.br. 200-168-24-208.speedyterra.com.br. 167130.telemar.net.br. chello213047228107.tirol.surfer.at. ip-cust-sv28085.telefonica-ca.net. YahooBB219011200161.bbtec.net. h-68-166-110-242.mclnva23.covad.net. berthelemy-2-81-56-96-121.fbx.proxad.net. priproxy.yisd.net. tpp.dc.ukrtel.net. adsl-65-71-88-51.dsl.rcsntx.swbell.net. 206-169-78-194.gen.twtelecom.net. ip03.asccl.adsl.gxn.net. wnpgmb06dc1-0-171.static.mts.net. 211-20-131-90.HINET-IP.hinet.net. 61-222-11-35.HINET-IP.hinet.net. h-213.61.7.249.host.de.colt.net. esx124dhcp799.essex01.md.comcast.net. pcp02245282pcs.bechgr01.in.comcast.net. c-67-165-143-103.client.comcast.net. ip68-0-244-149.ri.ri.cox.net. host202-20.pool62110.interbusiness.it. host100-160.pool195103.interbusiness.it. www.sfa.uconn.edu. OTOTO.ETC.cmu.edu. NCC1-FX-lonet.ILNET.ru. 117_PC6.ntcb.edu.tw. dsl-201-129-37-62.prod-infinitum.com.mx. dsl-201-129-128-34.prod-infinitum.com.mx.
Invasion Of The PC Snatchers
After cross-checking the IP addresses with the logged User-Agent, I found that, if the User-Agent info is real, most of these addresses are perfectly ordinary Windows PCs (or proxies/gateways for PCs). But let me rephrase that -
all of the User-Agent strings were variations of Microsoft Internet Explorer.
All of them. Of course, if these are spamming scripts, a fake User-Agent is the best way to dodge .htaccess filtering like what I use, but the hostnames above tell a very different story. It's highly unlikely that all of those hosts are running HTTP spamming scripts voluntarily.
My guess is that trojans don't just generate "regular" spam, and that someone has coded a minimal HTTP spam trojan that uses the existing HTTP libraries on Windows.
(If I recall my MFC coding days correctly, the Windows HTTP libraries (at least the ones I used) issue requests that are indistinguishable from IE's and follow redirects automatically - a behaviour I also noticed from reading the raw logs.)
Or that there is some sort of malware doing the rounds that picks IP addresses (apparently at random) and issues HTTP requests with a pre-defined list of bogus referrers (so far, each IP address blocked has issued requests with at least 4 different bogus referrers, but the number might just be a fluke).
So there's a lot more happening on the Net that we're not (usually) aware of.
I'm glad these are apparently all Windows machines.
And then people wonder why corporate IT staff is getting more and more draconian every passing year.
Nah. Now that I think about it, it's just sad. Fortunately the press is catching up.
Update: More background info on previous tactics -
- textism: Refer Spamming
- BotWhack - a first attempt at filtering, relying on vanilla User-Agent strings.
- Porn Sites Hiding Behind Blogs
- Weblog spam - the cardinal Mark Pilgrim post.