HOWTO: Perform OCR on PDFs

Here’s a quick and dirty approach (based on this) that works well enough for me to perform on scanned bills and bank drafts.

Requires with GhostScript support, matching language files (I’m using Portuguese training data), and source material at 300dpi (150dpi didn’t cut it for me with smaller type).

convert -density 300 ~/foo.pdf /tmp/p%03d.tif
montage /tmp/*.tif -tile 1x -mode concatenate /tmp/foo.tif
tesseract /tmp/foo.tif output -l por

You can then grab output.txt, tokenize it and use it to set keywords or other metadata (details forthcoming once I find a way that works for all my use cases).