Skip to main content

A python script to iterate over a list of PDF in a directory and try to guess their language with Tesseract OCR.

Project description

PLD (PDF Language Detector)

PLD is a Python program that analyzes PDF files, extracts images, processes them using Optical Character Recognition (OCR), and detects the dominant language of the text. It provides language detection information in JSON format and calculates the average confidence coefficient for each language.

Requirements

  • Python 3.8 or above
  • Tesseract OCR
  • pdftoppm

Installation

Install Tesseract OCR and pdftoppm using your package manager. For example, on Ubuntu:

sudo apt install tesseract-ocr tesseract-ocr-all poppler-utils

From PyPi

Install with pip:

python3 -m pip install --user pdf-language-detector

From the sources

Clone the PLD repository:

git clone git@github.com:github.com/icij/pld.git

Install the required Python packages with poetry:

poetry install

Usage

pld --help

    --language A comma-separated list of ISO3 language codes to detect.
    --input-dir: Path to the input directory containing PDF files. Default is the current directory.
    --output-dir (optional): Path to the output directory. Default is 'out' directory in the current directory.
    --max-pages (optional): Maximum number of pages to process per PDF file. Default is 5.

Examples

Process PDF files in the current directory, detect English and Spanish languages, and save the results in the 'results' directory:

pld --language eng --language spa --input-dir documents --output-dir results

Process PDF files in the 'documents' directory, detect French and Greek languages, and limit the processing to 3 pages per file:

pld --language fra --language ell --input-dir documents --max-pages 3

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_language_detector-0.0.1.tar.gz (4.6 kB view hashes)

Uploaded Source

Built Distribution

pdf_language_detector-0.0.1-py3-none-any.whl (5.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page