Products.PDFtoOCR

PDFtoOCR does OCR processing on PDF documents. The text from OCR is used in the search results.

These details have been verified by PyPI

Maintainers

helmantel huub kcleong khink

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Introduction

The PDFtoOCR product processes text in PDF documents using OCR. This is needed when text cannot be extracted from a (scanned) PDF. PDFtoOCR uses content rules to schedule the OCR processing. The processing cannot be done one the fly, for example with a custom TextIndexNG plugin. Processing large PDF documents using OCR is a time consuming task.

Configuration

On the operating system

PDF to Text uses three tools that are available for under Linux. The cooperation with the tools is only tested in Debian. But it the will probably work in in other nix enviroments.

Install requirements, PDF to OCR uses the following programs:

pdftotext, checks if OCR processing is necessary
ghostscript, converts the pdf documents to tiff images
tesseract, does the OCR processing (make sure you’ve got all language packs!*)

On the Plone site

Add a content rule

Event trigger: Object modified
Condition: Content type is file
Actions: Store OCR output from a PDF in searchable text

Assign content rule to a Plone site or a folder

Install cron4plone and add the following cronjob: portal/@@do_pdf_ocr_index

PDF Processing

Each time a file is added or modified the unique id (uid) of the file is added to a queue. This queue is persistent and has two functions, for indexing en reindexing. The indexing function uses the queue to process the documents. When reindexing is used all files in the queue history are processed.

If the text from a PDF document is extracted using pdftotext no OCR is done. Else the OCR extracts the text and stores it the content type file. The ATFile is patched with an extra field to accommodate the extracted text and the language of the PDF.

Page views:

@@do_pdf_ocr_index - indexes documents in the queue
@@do_pdf_ocr_reindex - reindexes all pdf documents in the Plone site
@@pdf_ocr_status - Show the queue and a history 10 documents

Futher reading:

http://plone.org/documentation/how-to/ocr-in-plone-using-tesseract-ocr/ http://code.google.com/p/tesseract-ocr/

Make sure you don’t got empty language files in /usr/local/share/tessdata/

Maybe a good alternative in the future, uses tessract but hard to setup and still too much beta: http://sites.google.com/site/ocropus/

Changelog

1.0 - Unreleased

Initial release

Project details

These details have been verified by PyPI

Maintainers

helmantel huub kcleong khink

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.1

Mar 16, 2010

1.0

Jun 17, 2009

This version

1.0dev pre-release

Jun 17, 2009

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Products.PDFtoOCR-1.0dev.tar.gz (16.6 kB view hashes)

Uploaded Jun 17, 2009 Source

Hashes for Products.PDFtoOCR-1.0dev.tar.gz

Hashes for Products.PDFtoOCR-1.0dev.tar.gz
Algorithm	Hash digest
SHA256	`5bef161fe7099a345a464927eb5bd4c88f1c10c5e8eabf18d44fb8f030c1870e`
MD5	`efb60d4eba0a7ef2603af2960741bb00`
BLAKE2b-256	`a6d748b9ba5ff300034fba39438644634bc05bf3d29fae1a27230ddebf26d9e2`