May The Cache Be With You

Inspired my Melo's reference to mod_speedyfeed , I've been replumbing the PhpWiki RSS generator to cut down bandwidth consumption by only including new or updated feed items.

The logic is plain and simple (in fact, it's quite similar to something i did before), and just about anyone can (and probably should) code this, no matter what the programming language:

  • Grab the most recent 15 Wiki nodes, and keep track of the last updated one.
  • If the RSS client is clever enough to send an If-Modified-Since header, compare the modification times of those nodes with the header value.
    • If there are no updated items at all, return a 304 Not Modified and zero content.
    • Else, assemble the partial feed and spit it out with a Last-Modified header according to the last updated node.
  • If the RSS client did not issue an If-Modified-Since header, output the entire feed with a Last-Modified header according to the last updated node, in the hopes it will wise up and start doing things properly.

Of course, this assumes RSS aggregators know how to use If-Modified-Since. Most do, but there are exceptions - some versions of NetNewsWire and Bloglines are two of the notable ones, and newspipe is another either myself or Ricardo can fix easily once we find the time - it should be doing this properly, but some instances (older code? proxies? mis-caching?) aren't.

I think I'm going to take the trouble of logging full headers for RSS only and figure out what is happening - it might be a date parsing bug (I'm using strtotime, but some folk mis-parse timezones in headers). So far, most people checking the feed are getting 304s (or, when I do a minor update, short 200s with only a few KB), which proves it's working.

Join Me In The Cache Side

While investigating this, I found out that PhpWiki had about zero HTTP cache control (at least the ancient sources I'm running only have provision for internal caching, not HTTP cache control), and hacked in a similar algorithm for all page renders.

As a result, bandwidth usage dropped around 30% so far. Frequently-viewed pages (like the HomePage) are being artificially kept back to increase the effect, and I'm thinking of looking at problematic User-Agents and including appropriate warnings in the RSS and HTML contents.

Of course, there are always surprises. Google seems to have changed their indexing bot's User-Agent to Mozilla/5.0 (compatible; Googlebot/2.1; + I had previously excluded it from a few sections (it insisted in indexing older versions of pages), so the traffic jumped back up when the new version started re-indexing this.