Python Patterns, Take One


Over the past few months a number of people have come to me for guidance regarding various aspects of Python development, so I decided I’d post something about that.

Toolset

I like my tools simple, flexible and, above all, easy to understand and extend. Being able to poke under the hood is essential, and that assumes you don’t have to spend days poring over source code – which is a big reason why I do mostly Python, without any IDEs to hamper my judgement – all I need is vim and a terminal window, even if I do tweak it or use Sublime Text.

Since I mostly do back-end stuff and REST APIs (and then, on occasion, JavaScript front-ends atop those APIs), my current Python toolset revolves around three libraries these days:

  • Bottle for request routing and templating
  • Peewee as my lightweight ORM of choice
  • Celery for task queueing1

The first two have small, compact codebases you can read through in one sitting and understand pretty thoroughly (or at least well enough to debug if the need arises).

Celery is in a league of its own here (it’s definitely not small), but the sheer power of it and the simplicity with which you can get a reliable, scalable and distributed task queue off the ground more than makes up for its size. It’s not something most people realize they need, but believe me, you do.

I’ve been meaning to clean up and publish my own micro-task-queue for a while (as part of my own little Python utilities library), but have yet to get around to it. Celery is overkill for doing simple batch processing, but excels at running massive workloads (hundreds of thousands of tasks an hour across dozens of CPUs), so I’m sticking with it for the time being.

Project Structure

My usual file tree these days looks like this:

+-- dev.py                # Stub to run Bottle while coding
+-- app.py                # WSGI entry point
+-- tasks.py              # Celery entry point
+-- fabfile               # Fabric deployment/build scripts
+-- Vagrantfile           # never leave home without it
+-- etc
|    +-- config.json      # my main configuration file
+-- api
|    +-- [model].py       # RESTful routes for each model
+-- controllers
|    +-- [behavior].py    # controllers used by routes
+-- lib
|    +-- bottle.py        # more bang than Flask
|    +-- peewee.py        # almost as nice as the Django ORM
|    +-- config.py        # loads up the JSON file 
|    +-- utils            # my little bag of tricks
|    |    +-- core.py
|    |    +-- urlkit.py
|    |    +-- stringkit.py
|    |    +-- datekit.py
|    +-- [dependencies]   # Include ALL the dependencies locally
+-- env                   # virtualenv for "fat" dependencies
+-- puppet                # Puppet manifests and deployment stuff
+-- models
|    +-- db.py            # Base models and database initialisation
|    +-- [store].py       # Other data stores (Redis, etc.)
+-- batch                 # Celery tasks
|    +-- [worker].py      # Separate modules per worker, if needed
+-- static                # Static assets (HTML and sundry)
|    +-- css
|    |    +-- ink.css     # I love it
|    +-- js
|    |    +-- app.js      # Everything starts here
|    |    +-- ink-all.js  # Our in-house library
|    |    +-- knockout.js # I love data binding
|    |    +-- zepto.js    # I hate jQuery
|    +-- img
|    +-- font             # Usually Roboto
+-- views
     +-- layout.tpl       # Base layout for templates
     +-- [group]          # Partials for each entity/screen

As you can see, there’s a lot going on here – for instance, I use Vagrant for building entire Linux environments from scratch2 and Fabric for deployment, etc.

Part of this is because I have to precisely match the (often antiquated) setups we have in production, and part because deployment is, in a word, hard and I tend to have to go through a number of contortions to deploy a new release – which is why I emphasize self-contained, pure Python in this piece.

Sometimes the end result is packaged as OS packages (Debian in my case), but that’s beyond the scope here.

Core Stuff

But let’s go back to basics for a second. There are three essential sets of files here:

  • dev.py, app.py and etc/config.json
  • the api, routes and views folders
  • the models and controllers folders

dev.py and app.py are fundamentally the same thing - they set up sys.path to make sure my local libraries (in lib) take precedence over everything else, load up the configuration file, and run the app – but dev.py uses the built-in Bottle HTTP server whereas app.py has a WSGI entry point for whatever app container I use.

For production, I usually run gevent workers in a gunicorn or uwsgi container3. Picking the right container and understanding why you need something to handle your threading model is worth a post on its own – one I dearly wish I had time to write – but let’s just say you need to really learn about this if you want your apps to scale, regardless of programming language and framework.

Configuration

Why config.json and not YAML or something else, you might ask?

Well, because JSON is about the only format I can read on all the languages I use without ambiguity or extra dependencies – it’s not unusual for me to have to mix Python with something else (like Node or, of late, Go), and this way I can load configurations for every component off a single file4.

And yes, JSON also forces you to be careful about how you structure the configuration file, which I find to be worthwhile in the long run.

With Python, I load the JSON data into a Struct object (an enriched dict with some syntax sugar) and stick that into a common config module that is re-used across the board, making it trivial to get at configuration data and easy to manage everything centrally.

Routing and Templating

Bottle‘s internal routing mechanism handles 99% of what I need, so the only thing I do is group routes in modules depending on their intended use.

The difference between api and routes is mostly a matter of taste – since I build REST APIs before any UI, I like to keep those separate from everything else, and easy to version.

So inside api you’ll usually find stuff like:

prefix = '/api/v1'
...

@get(prefix + '/<model>/<id:int>)
@auth(is_admin)
@cache(30)
@jsonp
def get_model_instance(model, id):
    ...

And so on. I make heavy use of decorators, since they allow me to tack on all sorts of filtering and access control without cluttering up my code with complex logic and increase readability – for instance, in the above example it’s immediately plain who can invoke that API.

It also makes it trivial to change stuff like caching back-ends. For instance:

from decorators import request_cache as cache

…can be replaced with:

from utils.redis.decorators import request_cache as cache

…and boom! Shared caching across all worker processes, without changing a single route.

As a counterpoint, stuff inside routes is a lot more prosaic, and deals mostly with HTML handling. In fact, ever since I started using Knockout.js, the most common bit of code in there is this:

@route('/<path:path>')
def serve_static(path):
    return static_file(path, root='static')

The views folder, in turn, contains all the templates, which are just HTML with Bottle‘s simple, neat and effective templating. It allows me to write Python inline, so boring stuff like rendering tables goes in there – usually in partials that are invoked from a base layout as required.

Using it is, again, trivial:

@route('/')
@view('default')
def serve_index():
    return {'title': 'My neat app', 'module': 'Main'}

Variables are then replaced in the template as you’d expect.

Dependency Management

This is where it gets interesting. I take a pretty radical approach at dependency management, which is to prefer pure Python code that I can version together with my code – I’ve found this is the only way to reliably rebuild your deployments from scratch months (or years) down the road when most libraries were updated in sometimes unexpected ways.

So lib is where I keep all my core (pure Python) dependencies and my utils library, which is crammed with goodies and reusable patterns.

But it’s quite easy to run up a sizeable amount of dependencies that aren’t pure Python and/or need to be compiled for deployment, so I set those aside in a virtualenv – predictably called env5.

My ground rule is that anything I easy_install will go into env, and env will quite often be (re)built on a separate machine that matches my target environment. So I treat it as a disposable folder, and it’s never committed to any repos.

This has been changing a bit of late since with LXC and Vagrant I quite often rebuild an entire machine rather than a virtualenv, but it’s still common enough – and when env has the right binaries, deployment is usually a matter of doing a plain rsync to production servers.

Ground Rules

For application internals, I’ve settled on a set of rules as to how components interact that boil down to the following:

  • routes handle CSRF protection (if applicable), input validation, result pagination, etc.
  • to get at data, routes instantiate a Controller object of some kind and never, ever manipulate models directly – the only thing they’re supposed to handle are iterables (arrays and generators) with dict results, which are trivial to fiddle with and cache.
  • controllers encapsulate all forms of storage and foreign APIs in individual classes like FoobarAPIController, BatchController, QueryController, etc.
  • controllers also manage external connections (for instance, they will invoke db.disconnect() in their __del__ methods, etc.)
  • There are zero SQL statements in controllers. Zero. In fact, you’ll be hard pressed to find a single one in most of my stuff, since I find crafting SQL by hand to be error-prone and hard to test on occasion6.
  • Since controllers have all the complex behaviour, they also define exceptions for all common failure cases and provide generators for routes to iterate over large datasets.
  • by extension of the above, views get only simple Python types for rendering templates, and Bottle helpfully escapes all strings by default – so script injection is, if not impossible, then at the very least extremely hard.

The rest largely depends on what I’m building, but there have been a few constants in of late.

Back-Ends

I’ve mostly settled (as much as anyone can in this field) on two back-end technologies worth noting7:

  • Postgres, because I hate relational databases but have grown to love the way it works and scales a fair bit beyond the norm.
  • Redis, because it’s a wonderful Swiss Army knife for extending your application logic in a number of ways.

I use Redis for a lot of stuff – for sharing state between HTTP workers, for running Celery, as an intermediate cache, and (in rare instances) as a message broker with a simple publish-subscribe mechanism.

Like Postgres, Redis can be used with just about any runtime besides Python, so it’s perfect as a result store to and from other systems.

My only gripe with it so far is that it doesn’t do replication out of the box, but I’ve been meaning to try some tricks with PubSub for doing quorum-based replication a friend of mine described a little while ago.

If I ever sort that out, I’ll be sure to post about it. And now if you’ll excuse me it’s getting late, and I have a book to read…


  1. Celery is a fairly new addition (I usually rolled my own task queueing, but I’ve already pushed Celery to run millions of tasks across a few dozen CPUs, so I’m sticking with it from now on). ↩︎

  2. I’ve got a nice HOWTO about how to do that on a Mac, by the way. ↩︎

  3. I’ve never felt the need for grand debates around using greenlets vs. OS threads for web services. Suffice it to say for the moment I’ve been able to scale this particular combination up to 5000 transactions/s on a desktop (and way beyond that in production hardware and employing judicious caching) before I hit a database bottleneck. But I’ll eventually write about that later, since it’s all very dependent on workload. ↩︎

  4. I also use multiple JSON files for production deployments, each named after a particular host or role in Fabric. So there’ll usually be a master.json, shard.json, etc. ↩︎

  5. I usually manage my virtualenvs with Fabric, with the advantage that I can then re-use the Fabric script to deploy identical environments to remote servers without hassles. ↩︎

  6. The ORM only issues prepared statements, can be mocked easily, and anyway I make sure that all user input is double-checked prior to building a query. ↩︎

  7. I’m also using Solr at the moment (and mostly enjoying it), but it’s not something you’d generally tackle for web development, so I’ll write about it some other time. ↩︎