With a Spark In My Ears

Nothing much to report this week, thanks to a bout of the flu which my sinuses decided to support by providing me with a completely free (and quite painful) earache that has made it very hard to sleep at all.

Nevertheless, I managed to find the time and energy to slowly rebuild my cluster atop the unofficial Ubuntu image I linked to the other day, which also happens to have working Docker support.

If you’re interested in that, the right way to do things is by using these images, which work just fine – I’ve used them to test a few things without breaking my existing install, and the base image takes up less than 300MB, so there’s no need to worry about filling up your SD card.

Painless PySpark

On the “Little Big Data” front, here are a few notes on getting Spark to run – assuming you already know how to set it up in standalone cluster mode, it’s completely painless to get it working with the IPython notebook and have your jobs run on remote executors:

# get the cluster going, so that we can have remote workers
# tell PySpark we intend to use the IPython notebook
IPYTHON_OPTS="notebook --pylab inline --ip=* --port=8889"
# start PySpark (and the notebook server), pointed at our master
/opt/spark/bin/pyspark --master spark://master:7077

…and that’s it – you automatically get a working SparkContext as the sc global inside your notebooks, so you’re good to go1.

Time to grab some more tea (discreetly seasoned with ibuprofen) and see if I can get well quickly enough to be of some use at the office tomorrow.

  1. I’ve yet to try embedding Scala, but there’s very little you can’t do in Python, and I’m looking forward to Spark 1.3 and its DataFrame support. ↩︎