{"id":647,"date":"2015-05-22T17:37:33","date_gmt":"2015-05-22T21:37:33","guid":{"rendered":"https:\/\/www.devolve.net\/blog\/?p=647"},"modified":"2018-07-13T11:11:07","modified_gmt":"2018-07-13T15:11:07","slug":"prepare-a-pdf-file-for-ocr","status":"publish","type":"post","link":"https:\/\/www.devolve.local\/prepare-a-pdf-file-for-ocr\/","title":{"rendered":"Prepare a PDF file for OCR"},"content":{"rendered":"
If you have some need to OCR some text from a PDF or image file, you may want to use a tool like The first error I got from tesseract was <\/p>\n The Googles indicated that I can’t pass a PDF to it directly. Then I found that one format it will take is Sweet. Ugh. After more Googles, I installed MOAR SEARCHING. Alpha channel. Density. If your source material is a PDF, you need to check the DPI setting in the file, and tell Note that I tried this before with just the density and depth options, but it still apparently kept the alpha channel. This will generate a foo.txt file by default, hopefully with something meaningful inside it. I tried this on a scanned notebook paper with my (non-cursive, but still objectively illegible) handwriting, and it produced this lovely piece of modern ASCII art<\/p>\n I had much better results with this process on a file that someone other than me could actually read. If you have some need to OCR some text from a PDF or image file, you may want to use a tool like tesseract to do the job. But it won’t take any old input file, you’ll probably need to convert it first. The first error I got from tesseract was Error in pixReadStream: Unknown […]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[6],"tags":[38,34],"_links":{"self":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts\/647"}],"collection":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/comments?post=647"}],"version-history":[{"count":2,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts\/647\/revisions"}],"predecessor-version":[{"id":649,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts\/647\/revisions\/649"}],"wp:attachment":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/media?parent=647"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/categories?post=647"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/tags?post=647"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}tesseract<\/code> to do the job. But it won’t take any old input file, you’ll probably need to convert it first.<\/p>\n
Error in pixReadStream: Unknown format: no pix returned<\/pre>\n
tiff<\/code>.<\/p>\n
convert foo.pdf foo.tiff<\/code>. I should say, if you don’t have the convert program, you’ll need to install imagemagick [sic] or graphicsmagick [sic]. I had it, but it gave a completely unhelpful error message. <\/p>\n
convert: no images defined `foo.tiff' @ error\/convert.c\/ConvertImageCommand\/3212.<\/pre>\n
ghostscript<\/code>, since it’s trying to find a binary called
gs<\/code>. Now it’s converting, but tesseract still won’t read the tiff. It gives <\/p>\n
Error in pixReadFromTiffStream: spp not in set {1,3,4}<\/pre>\n
convert<\/code> to match that density for the output file. So in my case I ended up with this:<\/p>\n
$ convert -density 300 foo.pdf -depth 8 -background white -flatten +matte foo.tiff\r\n$ tesseract foo.tiff foo<\/pre>\n
%\u20ac5~\u00b0\u00abTTLT>\u20ac\r\n\r\nTa\u2018~\u20185T?f%c*\u2018?\u20182>s*\u2018\"\u2018L< \u2018~\u2019\u2014T\u00a7JTTJ(fTc\u00bb>zTi\u20183\u2019 TTTTfmbT.s\u2018 ggwi \u00bbT
\n<\/x:><\/p>\n","protected":false},"excerpt":{"rendered":"