2015/05/22

Prepare a PDF file for OCR

If you have some need to OCR some text from a PDF or image file, you may want to use a tool like tesseract to do the job. But it won’t take any old input file, you’ll probably need to convert it first.

The first error I got from tesseract was

Error in pixReadStream: Unknown format: no pix returned

1	Error in pixReadStream: Unknown format: no pix returned

The Googles indicated that I can’t pass a PDF to it directly. Then I found that one format it will take is tiff.

Sweet. convert foo.pdf foo.tiff. I should say, if you don’t have the convert program, you’ll need to install imagemagick [sic] or graphicsmagick [sic]. I had it, but it gave a completely unhelpful error message.

convert: no images defined `foo.tiff' @ error/convert.c/ConvertImageCommand/3212.

1	convert: no images defined `foo.tiff' @ error/convert.c/ConvertImageCommand/3212.

Ugh. After more Googles, I installed ghostscript, since it’s trying to find a binary called gs. Now it’s converting, but tesseract still won’t read the tiff. It gives

Error in pixReadFromTiffStream: spp not in set {1,3,4}

1	Error in pixReadFromTiffStream: spp not in set {1,3,4}

MOAR SEARCHING. Alpha channel. Density. If your source material is a PDF, you need to check the DPI setting in the file, and tell convert to match that density for the output file. So in my case I ended up with this:

$ convert -density 300 foo.pdf -depth 8 -background white -flatten +matte foo.tiff
$ tesseract foo.tiff foo

1 2	$ convert -density 300 foo.pdf -depth 8 -background white -flatten +matte foo.tiff $ tesseract foo.tiff foo

Note that I tried this before with just the density and depth options, but it still apparently kept the alpha channel. This will generate a foo.txt file by default, hopefully with something meaningful inside it. I tried this on a scanned notebook paper with my (non-cursive, but still objectively illegible) handwriting, and it produced this lovely piece of modern ASCII art

   %€5~°«TTLT>€

Ta‘~‘5T?f%c*‘?‘2>s*‘"‘L< ‘~’—T§JTTJ(fTc»>zTi‘3’ TTTTfmbT.s‘ ggwi »T<x: ~y£/aT
v/‘a.T‘<>T‘5[z«”v~> T _ V  T T  »;z"‘.;z.:os?T‘?,T



 T   _ V T4?—Tw?zb< .‘T.Z.;9s3T.  T

1

;_%wmgAmT

 TG*.a2yTQ:_T,§naI Y*f~?:S T V T T.  T T T:{‘_=}("ce .4/té’LTvT1T=v’TS7  1»:
 T TTC’a(&lt;2»iicLx:‘T/:f‘,»:9.(Ez,«21zha.  T   TT  isiacﬁé/«ii
T.  T T T _‘fL??c15?ﬂ<::»g
% ..T5;Tl.Tca<~T_1a<:v>< '    T73.Wg5Ta.rTT

Ff  AmAm£TTT TTTPmsw T j
V    *3e§h2T§;c_"e&5j‘av\T _ T T T T TT   ?7T§‘RrT(‘/9£T/ Z:
 T. T i“}2€$._<f TT(fai1),x/«L s%+:»:«~%TT
TT TTH  V7  , T »’1i5£??~TCa  .:M«4~.5”\« .
T .T !5_5éTWT‘«5”’5Tv§?€T(TT.~’fW”.T

%€5~°«TTLT>€

Ta‘~‘5T?f%c*‘?‘2>s*‘"‘L< ‘~’—T§JTTJ(fTc»>zTi‘3’ TTTTfmbT.s‘ ggwi »T<x: ~y£/aT

v/‘a.T‘<>T‘5[z«”v~> T _ V T T »;z"‘.;z.:os?T‘?,T

T _ V T4?—Tw?zb< .‘T.Z.;9s3T. T

;_%wmgAmT

TG*.a2yTQ:_T,§naI Y*f~?:S T V T T. T T T:{‘_=}("ce .4/té’LTvT1T=v’TS7 1»:

T TTC’a(<2»iicLx:‘T/:f‘,»:9.(Ez,«21zha. T TT isiacﬁé/«ii

T. T T T _‘fL??c15?ﬂ<::»g

% ..T5;Tl.Tca<~T_1a<:v>< ' T73.Wg5Ta.rTT

Ff AmAm£TTT TTTPmsw T j

V *3e§h2T§;c_"e&5j‘av\T _ T T T T TT ?7T§‘RrT(‘/9£T/ Z:

T. T i“}2€$._<f TT(fai1),x/«L s%+:»:«~%TT

TT TTH V7 , T »’1i5£??~TCa .:M«4~.5”\« .

T .T !5_5éTWT‘«5”’5Tv§?€T(TT.~’fW”.T

I had much better results with this process on a file that someone other than me could actually read.

Prepare a PDF file for OCR is original content from devolve.

Tags:apps, command line

About The Author

Charlie Herron

Denizen of Portland, Maine; tech jack; lover / hater / whatever; philosophical dabbler. http://twitter.com/realgeek