Prepare a PDF file for OCR
If you have some need to OCR some text from a PDF or image file, you may want to use a tool like tesseract
to do the job. But it won’t take any old input file, you’ll probably need to convert it first.
The first error I got from tesseract was
1 |
Error in pixReadStream: Unknown format: no pix returned
|
The Googles indicated that I can’t pass a PDF to it directly. Then I found that one format it will take is tiff
.
Sweet. convert foo.pdf foo.tiff
. I should say, if you don’t have the convert program, you’ll need to install imagemagick [sic] or graphicsmagick [sic]. I had it, but it gave a completely unhelpful error message.
1 |
convert: no images defined `foo.tiff' @ error/convert.c/ConvertImageCommand/3212.
|
Ugh. After more Googles, I installed ghostscript
, since it’s trying to find a binary called gs
. Now it’s converting, but tesseract still won’t read the tiff. It gives
1 |
Error in pixReadFromTiffStream: spp not in set {1,3,4}
|
MOAR SEARCHING. Alpha channel. Density. If your source material is a PDF, you need to check the DPI setting in the file, and tell convert
to match that density for the output file. So in my case I ended up with this:
1
2
|
$ convert -density 300 foo.pdf -depth 8 -background white -flatten +matte foo.tiff
$ tesseract foo.tiff foo
|
Note that I tried this before with just the density and depth options, but it still apparently kept the alpha channel. This will generate a foo.txt file by default, hopefully with something meaningful inside it. I tried this on a scanned notebook paper with my (non-cursive, but still objectively illegible) handwriting, and it produced this lovely piece of modern ASCII art
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
%€5~°«TTLT>€
Ta‘~‘5T?f%c*‘?‘2>s*‘"‘L< ‘~’—T§JTTJ(fTc»>zTi‘3’ TTTTfmbT.s‘ ggwi »T<x: ~y£/aT
v/‘a.T‘<>T‘5[z«”v~> T _ V T T »;z"‘.;z.:os?T‘?,T
T _ V T4?—Tw?zb< .‘T.Z.;9s3T. T
1
;_%wmgAmT
TG*.a2yTQ:_T,§naI Y*f~?:S T V T T. T T T:{‘_=}("ce .4/té’LTvT1T=v’TS7 1»:
T TTC’a(<2»iicLx:‘T/:f‘,»:9.(Ez,«21zha. T TT isiacfié/«ii
T. T T T _‘fL??c15?fl<::»g
% ..T5;Tl.Tca<~T_1a<:v>< ' T73.Wg5Ta.rTT
Ff AmAm£TTT TTTPmsw T j
V *3e§h2T§;c_"e&5j‘av\T _ T T T T TT ?7T§‘RrT(‘/9£T/ Z:
T. T i“}2€$._<f TT(fai1),x/«L s%+:»:«~%TT
TT TTH V7 , T »’1i5£??~TCa .:M«4~.5”\« .
T .T !5_5éTWT‘«5”’5Tv§?€T(TT.~’fW”.T
|
I had much better results with this process on a file that someone other than me could actually read.