Prepare a PDF file for OCR

If you have some need to OCR some text from a PDF or image file, you may want to use a tool like tesseract to do the job. But it won’t take any old input file, you’ll probably need to convert it first.

The first error I got from tesseract was

The Googles indicated that I can’t pass a PDF to it directly. Then I found that one format it will take is tiff.

Sweet. convert foo.pdf foo.tiff. I should say, if you don’t have the convert program, you’ll need to install imagemagick [sic] or graphicsmagick [sic]. I had it, but it gave a completely unhelpful error message.

Ugh. After more Googles, I installed ghostscript, since it’s trying to find a binary called gs. Now it’s converting, but tesseract still won’t read the tiff. It gives

MOAR SEARCHING. Alpha channel. Density. If your source material is a PDF, you need to check the DPI setting in the file, and tell convert to match that density for the output file. So in my case I ended up with this:

Note that I tried this before with just the density and depth options, but it still apparently kept the alpha channel. This will generate a foo.txt file by default, hopefully with something meaningful inside it. I tried this on a scanned notebook paper with my (non-cursive, but still objectively illegible) handwriting, and it produced this lovely piece of modern ASCII art

I had much better results with this process on a file that someone other than me could actually read.

Prepare a PDF file for OCR is original content from devolve.