Grilling Posts

After Brent Simmons, Tim Bray and Tom Insam popped up in my feed practically in a row writing about baking posts and running sites with site generators rather than fully dynamic platforms, I thought it would be a good time to summarize how I grill this site’s pages in .

Right now, everything on this site is stored in , which replaced .

To post, I simply edit a text file, drop it into a folder, and it gets synced automagically to the server.

That’s it. Zero hassle. Completely free choice of format, editor, device, you name it. No headaches yet, and if has a glitch or goes away, there are plenty of replicas - and I can replace their service with in an eyeblink.

A post is published instantly, and Whoosh indexes the content within 30 minutes of it being online (I’ve been playing around with filesystem notifications to make that instantaneous, but tends to get inotify all hot under the collar, so that’s not finished yet).

Comments were handled by Disqus until 2, and there is really not much else to look at rather than content.

From the moment a browser starts talking to this site, a number of things happen:

  • 1 takes the request and checks if the content is in a RAM cache, thereby dealing with 90% of the crufty bits like favicons, CSS (which is both minified and gzipped), etc.
  • If not (or if the cache is stale) it then reverse proxies it to . can handle being a front-end server just fine, but one of the sites running on this VPS is based on nodejs (which is dumb as a doornail by default) so I need something in front of both. I have been meaning to get rid of in favor of nginx for , but I’ve even lazy and haven’t figured out the new setup yet.
  • then takes the request and checks if the relevant page content has been pre-rendered. Pre-rendered content is stored in a Haystack-like binary file (here’s the source for the stable version of that module - I’ve been testing an mmap()-based version, but haven’t yet decided it’s worthwhile to put in production).
  • If the pre-rendered exists and the remote browser doesn’t have it already (in which case it’ll get a 304 Not Modified reply), then blindly grabs a chunk of the file and spits it out via a template (templates in /Snakelets are pre-processed code, so the whole thing is very fast).
  • If not, then grabs the page contents off disk (remember, there’s no database, just plain text files and images), filters it through , or whatever, updates the intra-wiki link map, and sticks the resulting into the haystack before running it through the site template.
  • In either case, headers are properly set.

The fun thing here is that is very dynamic. I can bolt on a number of filters to the output and re-render the in other ways, do link substitutions, you name it - but the static content store and pre-baked haystack makes it brutally efficient -wise, and the simple caching tweaks I’ve added over the years more so.

You’d be surprised at the amount of “professional” CMS solutions that waste CPU cycles running a database query on every request, parsing the results and rendering them with insanely optimized engines to utterly fail at doing something so simple as outputting:

Cache-Control:public, max-age=3600
Date:Sat, 19 Mar 2011 16:46:34 GMT
Expires:Sat, 19 Mar 2011 17:01:34 GMT
Last-Modified:Sat, 19 Mar 2011 16:46:34 GMT

It’s not rocket science. Back when I was running and , I managed to by a full third by simply adding caching headers (before moving to FeedBurner, further tweaking on traffic), and I keep seeing it on pretty much every kind of site out there - you just have to open the Web Inspector (or Firebug), refresh a page a couple of times and take a look at how many requests get a 200 OK (for retransmission) instead of a 304 Not Modified (for proper caching handling on the server side).

There are actually very few frameworks that do caching properly for you, and most server-side web development these days seems more focused on architectural sophistication than efficiency - and thus the inherent inefficiency of, say, or Ruby is offset by a ludicrous amount of no-sequeling, memcaching and layering and whatnot.

All of which are fun and useful, but perhaps not really necessary for pushing content out there.

Update: I forgot to mention that also caches pre-rendered compressed content in gzip binary chunks for some things, thereby saving even more CPU cycles. I’ve been looking at request handling times, and server-side it takes around 0.02 seconds to render and serve a complex page (i.e., lots of markup with syntax highlighting plugins and all) from markup and typically around 0.005 seconds to spit out pre-processed output (from ).

Update 2: For extra kicks, I’ve temporarily set up Varnish instead of , to see if there would be any benefit of having an aggressive RAM-based cache (set to 128MB, which is plenty enough for the homepage, stylesheets, images, and most recent articles). Time will tell if even needs it, but I’ve been meaning to play around with it for a while (it’s extensively used at SAPO) and it might be useful to some folk - if you can’t bake, you might try glazing… :)

  1. I actually use lighttpd-improved since it has a few interesting fixes. Still, it will eventually go away. ↩︎

  2. I decided to get rid of them due to visual clutter and eventually realized I didn’t miss them, so I as well. ↩︎

This page is referenced in: