{"id":647,"date":"2015-05-22T17:37:33","date_gmt":"2015-05-22T21:37:33","guid":{"rendered":"https:\/\/www.devolve.net\/blog\/?p=647"},"modified":"2018-07-13T11:11:07","modified_gmt":"2018-07-13T15:11:07","slug":"prepare-a-pdf-file-for-ocr","status":"publish","type":"post","link":"https:\/\/www.devolve.local\/prepare-a-pdf-file-for-ocr\/","title":{"rendered":"Prepare a PDF file for OCR"},"content":{"rendered":"<p>If you have some need to OCR some text from a PDF or image file, you may want to use a tool like <code>tesseract<\/code> to do the job. But it won&#8217;t take any old input file, you&#8217;ll probably need to convert it first.<\/p>\n<p>The first error I got from tesseract was <\/p>\n<pre>Error in pixReadStream: Unknown format: no pix returned<\/pre>\n<p> The Googles indicated that I can&#8217;t pass a PDF to it directly. Then I found that one format it will take is <code>tiff<\/code>.<!--more--><\/p>\n<p>Sweet. <code>convert foo.pdf foo.tiff<\/code>. I should say, if you don&#8217;t have the convert program, you&#8217;ll need to install imagemagick [sic] or graphicsmagick [sic]. I had it, but it gave a completely unhelpful error message. <\/p>\n<pre>convert: no images defined `foo.tiff' @ error\/convert.c\/ConvertImageCommand\/3212.<\/pre>\n<p> Ugh. After more Googles, I installed <code>ghostscript<\/code>, since it&#8217;s trying to find a binary called <code>gs<\/code>. Now it&#8217;s converting, but tesseract still won&#8217;t read the tiff. It gives <\/p>\n<pre>Error in pixReadFromTiffStream: spp not in set {1,3,4}<\/pre>\n<p> MOAR SEARCHING. Alpha channel. Density. If your source material is a PDF, you need to check the DPI setting in the file, and tell <code>convert<\/code> to match that density for the output file. So in my case I ended up with this:<\/p>\n<pre>$ convert -density 300 foo.pdf -depth 8 -background white -flatten +matte foo.tiff\r\n$ tesseract foo.tiff foo<\/pre>\n<p>Note that I tried this before with just the density and depth options, but it still apparently kept the alpha channel. This will generate a foo.txt file by default, hopefully with something meaningful inside it. I tried this on a scanned notebook paper with my (non-cursive, but still objectively illegible) handwriting, and it produced this lovely piece of modern ASCII art<\/p>\n<pre>   %\u20ac5~\u00b0\u00abTTLT>\u20ac\r\n\r\nTa\u2018~\u20185T?f%c*\u2018?\u20182>s*\u2018\"\u2018L< \u2018~\u2019\u2014T\u00a7JTTJ(fTc\u00bb>zTi\u20183\u2019 TTTTfmbT.s\u2018 ggwi \u00bbT<x: ~y\u00a3\/aT\r\nv\/\u2018a.T\u2018<>T\u20185[z\u00ab\u201dv~> T _ V  T T  \u00bb;z\"\u2018.;z.:os?T\u2018?,T\r\n\r\n\r\n\r\n T   _ V T4?\u2014Tw?zb< .\u2018T.Z.;9s3T.  T\r\n\r\n1\r\n\r\n;_%wmgAmT\r\n\r\n TG*.a2yTQ:_T,\u00a7naI Y*f~?:S T V T T.  T T T:{\u2018_=}(\"ce .4\/t\u00e9\u2019LTvT1T=v\u2019TS7  1\u00bb:\r\n T TTC\u2019a(&lt;2\u00bbiicLx:\u2018T\/:f\u2018,\u00bb:9.(Ez,\u00ab21zha.  T   TT  isiac\ufb01\u00e9\/\u00abii\r\nT.  T T T _\u2018fL??c15?\ufb02<::\u00bbg\r\n% ..T5;Tl.Tca<~T_1a<:v>< '    T73.Wg5Ta.rTT\r\n\r\nFf  AmAm\u00a3TTT TTTPmsw T j\r\nV    *3e\u00a7h2T\u00a7;c_\"e&#038;5j\u2018av\\T _ T T T T TT   ?7T\u00a7\u2018RrT(\u2018\/9\u00a3T\/ Z:\r\n T. T i\u201c}2\u20ac$._<f TT(fai1),x\/\u00abL s%+:\u00bb:\u00ab~%TT\r\nTT TTH  V7  , T \u00bb\u20191i5\u00a3??~TCa  .:M\u00ab4~.5\u201d\\\u00ab .\r\nT .T !5_5\u00e9TWT\u2018\u00ab5\u201d\u20195Tv\u00a7?\u20acT(TT.~\u2019fW\u201d.T<\/pre>\n<p>I had much better results with this process on a file that someone other than me could actually read.<br \/>\n<\/x:><\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you have some need to OCR some text from a PDF or image file, you may want to use a tool like tesseract to do the job. But it won&#8217;t take any old input file, you&#8217;ll probably need to convert it first. The first error I got from tesseract was Error in pixReadStream: Unknown [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[6],"tags":[38,34],"_links":{"self":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts\/647"}],"collection":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/comments?post=647"}],"version-history":[{"count":2,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts\/647\/revisions"}],"predecessor-version":[{"id":649,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/posts\/647\/revisions\/649"}],"wp:attachment":[{"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/media?parent=647"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/categories?post=647"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devolve.local\/wp-json\/wp\/v2\/tags?post=647"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}