Simple PDF text extraction

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

pdftotext

Simple PDF text extraction

import pdftotext

# Load your PDF
with open("lorem_ipsum.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# If it's password-protected
with open("secure.pdf", "rb") as f:
    pdf = pdftotext.PDF(f, "secret")

# How many pages?
print(len(pdf))

# Iterate over all the pages
for page in pdf:
    print(page)

# Read some individual pages
print(pdf[0])
print(pdf[1])

# Read all the text into one string
print("\n\n".join(pdf))

OS Dependencies

Debian, Ubuntu, and friends

sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev

Fedora, Red Hat, and friends

sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config

macOS

brew install pkg-config poppler

Conda users may also need libgcc:

conda install -c anaconda libgcc

Windows

Currently tested only when using conda:

Install the Microsoft Visual C++ Build Tools
Install poppler through conda:
```
conda install -c conda-forge poppler
```

Install

pip install pdftotext

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.2.2

Nov 23, 2021

2.2.1

Oct 1, 2021

2.2.0

Aug 16, 2021

2.1.6

May 14, 2021

2.1.5

Aug 14, 2020

This version

2.1.4

Jan 25, 2020

2.1.3

Jan 7, 2020

2.1.2

Aug 7, 2019

2.1.1

Oct 7, 2018

2.1.0

May 31, 2018

2.0.2

Feb 20, 2018

2.0.1

Aug 10, 2017

2.0.0

Jul 23, 2017

1.1.0

Jul 18, 2017

1.0.0

Jun 10, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftotext-2.1.4.tar.gz (113.8 kB view hashes)

Uploaded Jan 25, 2020 Source

Hashes for pdftotext-2.1.4.tar.gz

Hashes for pdftotext-2.1.4.tar.gz
Algorithm	Hash digest
SHA256	`d37864049581fb13cdcf7b23d4ea23dac7ca2e9c646e8ecac1a39275ab1cae03`
MD5	`e6018a96b3cd75fb65130e0601b805d7`
BLAKE2b-256	`5838f04c252f4cb2d10af9abcde0a2db1bcd38288a76a99e88333f2a434e0c40`