Here’s a quick and dirty approach (based on this) that works well enough for me to perform OCR on scanned bills and bank drafts.
Requires ImageMagick with GhostScript support, matching language files (I’m using Portuguese training data), and source material at 300dpi (150dpi didn’t cut it for me with smaller type).
convert -density 300 ~/foo.pdf /tmp/p%03d.tif montage /tmp/*.tif -tile 1x -mode concatenate /tmp/foo.tif tesseract /tmp/foo.tif output -l por
You can then grab output.txt
, tokenize it and use it to set PDF keywords or other metadata (details forthcoming once I find a way that works for all my use cases).