Photo Archival and Integrity Checking

I’ve recently been spending some time recovering from yet another hard disk crash, and as part of that I’m taking advantage of my ages-old approach at storing photographs: using the wonders of jhead and a few other tidbits, everything is stored in a filesystem tree with nested folders for year and month, such as this one:

Photos-+-2001-+-01-+-200101012230.jpg
       |      |    |
       |      |    +-SHA1SUMS
       .      |
       .      +-02-+-200102010830.jpg
       .      .    .

This results in a simple, straightforward structure that is easy to navigate and archive and a unique filename at the end of the pathname, which also helps considerably (YYYY/MM/YYYYMMDDHHMMSS.foo).

And yes, I know that iPhoto and suchlike will do a (reasonably) decent job of managing my photos, and I use it since it was incapable of handling more than a couple of thousand images. The problem here is long-term storage and archival, and iPhoto can work off such a filesystem tree and not mess around with my originals too much.

For each folder, I’ve so far been using an MDSSUMS file, which helps me ensure that when I back this up to DVD (or, as is the case, try to save my files from a bad disk) the data I’m getting back is what I saved in the first place.

Thing is, times have moved on and now SHA1 is the thing to use, but taking the long view (i.e., more than an couple of years), I’ve seen both md5 and sha1 utilities that go and do their own thing regarding storing digest files and whatnot (from using parenthesis to a lot of extraneous junk around both the digest and the filename), so I decided to keep mine simple:

hexdigest filename

Furthermore, since OSX has shown a regrettable tendency to either not include md5sum or to twiddle its output format over the years, I’ve decided to go and code my own set of utility functions in Python to compute and confirm the hash values – the trick here is to use mmap(), which makes it as fast (if not faster) than C utilities.

Here’s the set of utility functions I’m using, which will take both kinds of hash indexes, compute sha1 for a given directory, and move out of the way for manual checking any files that don’t match either of the indexes for some reason:

(download)

These are a toolkit and not a finished solution, but they (and the amazing Index sheet view in Quicklook in Snow Leopard, which affords me effortless visual inspection of folders with hundreds of images) go a long way towards helping me making sure (or at least checking that) my files are correctly preserved – and hopefully will be so for many years to come, regardless of storage.

One improvement I’ll be adding (besides better cleanup of the way I handle paths and deal with the problem of having non-indexed files in the same directory) is automated JPEG loading and checking, although (since my media collection has started incuding more and more movie files) that will necessarily be somewhat limited.