PDF Scanning

From Kyle's Wiki
Jump to: navigation, search

If you are going to scan PDF's to text in linux you need imagemagick first, you will also need tesseract and the tesseract english training files.

First convert the pdf into a bunch of high quality tiff files:

pdfimages -f FIRSTPAGE -l LASTPAGE -j input.pdf out

Then convert them to tiffs, inverting if necissary:

for EACH in `ls`
do
convert -negate $EACH $EACH.tif
done


Then OCR them:

for i in $(seq –format=%005.f 1 324)
do
tesseract $i.tif tesseract-$i -l eng
done

Then you can cat them all together:

cat *.txt > full-output.txt
Personal tools
Namespaces

Variants
Actions
Efforts
Toolbox
Meta