{"id":647,"date":"2015-05-22T17:37:33","date_gmt":"2015-05-22T21:37:33","guid":{"rendered":"https:\/\/www.devolve.net\/blog\/?p=647"},"modified":"2018-07-13T11:11:07","modified_gmt":"2018-07-13T15:11:07","slug":"prepare-a-pdf-file-for-ocr","status":"publish","type":"post","link":"https:\/\/www.devolve.local\/prepare-a-pdf-file-for-ocr\/","title":{"rendered":"Prepare a PDF file for OCR"},"content":{"rendered":"

If you have some need to OCR some text from a PDF or image file, you may want to use a tool like tesseract<\/code> to do the job. But it won’t take any old input file, you’ll probably need to convert it first.<\/p>\n

The first error I got from tesseract was <\/p>\n

Error in pixReadStream: Unknown format: no pix returned<\/pre>\n

The Googles indicated that I can’t pass a PDF to it directly. Then I found that one format it will take is tiff<\/code>.<\/p>\n

Sweet. convert foo.pdf foo.tiff<\/code>. I should say, if you don’t have the convert program, you’ll need to install imagemagick [sic] or graphicsmagick [sic]. I had it, but it gave a completely unhelpful error message. <\/p>\n

convert: no images defined `foo.tiff' @ error\/convert.c\/ConvertImageCommand\/3212.<\/pre>\n

Ugh. After more Googles, I installed ghostscript<\/code>, since it’s trying to find a binary called gs<\/code>. Now it’s converting, but tesseract still won’t read the tiff. It gives <\/p>\n

Error in pixReadFromTiffStream: spp not in set {1,3,4}<\/pre>\n

MOAR SEARCHING. Alpha channel. Density. If your source material is a PDF, you need to check the DPI setting in the file, and tell convert<\/code> to match that density for the output file. So in my case I ended up with this:<\/p>\n

$ convert -density 300 foo.pdf -depth 8 -background white -flatten +matte foo.tiff\r\n$ tesseract foo.tiff foo<\/pre>\n

Note that I tried this before with just the density and depth options, but it still apparently kept the alpha channel. This will generate a foo.txt file by default, hopefully with something meaningful inside it. I tried this on a scanned notebook paper with my (non-cursive, but still objectively illegible) handwriting, and it produced this lovely piece of modern ASCII art<\/p>\n

   %\u20ac5~\u00b0\u00abTTLT>\u20ac\r\n\r\nTa\u2018~\u20185T?f%c*\u2018?\u20182>s*\u2018\"\u2018L< \u2018~\u2019\u2014T\u00a7JTTJ(fTc\u00bb>zTi\u20183\u2019 TTTTfmbT.s\u2018 ggwi \u00bbTT\u20185[z\u00ab\u201dv~> T _ V  T T  \u00bb;z\"\u2018.;z.:os?T\u2018?,T\r\n\r\n\r\n\r\n T   _ V T4?\u2014Tw?zb< .\u2018T.Z.;9s3T.  T\r\n\r\n1\r\n\r\n;_%wmgAmT\r\n\r\n TG*.a2yTQ:_T,\u00a7naI Y*f~?:S T V T T.  T T T:{\u2018_=}(\"ce .4\/t\u00e9\u2019LTvT1T=v\u2019TS7  1\u00bb:\r\n T TTC\u2019a(<2\u00bbiicLx:\u2018T\/:f\u2018,\u00bb:9.(Ez,\u00ab21zha.  T   TT  isiac\ufb01\u00e9\/\u00abii\r\nT.  T T T _\u2018fL??c15?\ufb02<::\u00bbg\r\n% ..T5;Tl.Tca<~T_1a<:v>< '    T73.Wg5Ta.rTT\r\n\r\nFf  AmAm\u00a3TTT TTTPmsw T j\r\nV    *3e\u00a7h2T\u00a7;c_\"e&5j\u2018av\\T _ T T T T TT   ?7T\u00a7\u2018RrT(\u2018\/9\u00a3T\/ Z:\r\n T. T i\u201c}2\u20ac$._\n

I had much better results with this process on a file that someone other than me could actually read.
\n<\/x:><\/p>\n","protected":false},"excerpt":{"rendered":"

If you have some need to OCR some text from a PDF or image file, you may want to use a tool like tesseract to do the job. But it won’t take any old input file, you’ll probably need to convert it first. The first error I got from tesseract was Error in pixReadStream: Unknown […]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[6],"tags":[38,34],"_links":{"self":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts\/647"}],"collection":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/comments?post=647"}],"version-history":[{"count":2,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts\/647\/revisions"}],"predecessor-version":[{"id":649,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts\/647\/revisions\/649"}],"wp:attachment":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/media?parent=647"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/categories?post=647"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/tags?post=647"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}