Skip to main content

Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript

Project description

This script will take a pdf file and generate the corresponding OCR’ed version.

Usage:

Single conversion:

python pypdfocr.py filename.pdf

--> filename_ocr.pdf will be generated

Folder monitoring (new!):

python pypdfocr.py -w watch_directory

--> Every time a pdf file is added to `watch_directory` it will be OCR'ed

For those on Windows, because it’s such a pain to get all the PIL and PDF dependencies installed, I’ve gone ahead and made an executable available under:

dist/pypdfocr.exe

You still need to install Tesseract and GhostScript as detailed below in the dependencies list.

Caveats

This code is brand-new, and is barely commented with no unit-tests included. I plan to improve things as time allows in the near-future.

Dependencies:

PyPDFOCR relies on the following (free) programs being installed and in the path:

On Mac OS X, you can install the first two using homebrew:

brew install tesseract
brew install ghostscript

The last three can be installed using a regular python manager such as pip:

pip install pil
pip install reportlab
pip install watchdog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypdfocr-0.2.tar.gz (8.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page