Gutenberg

Library to interface with Project Gutenberg

These details have been verified by PyPI

Maintainers

Clemens.Wolff gitenberg Master_Odin Master_Odin_Bot

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

https://travis-ci.org/c-w/Gutenberg.svg?branch=master

Overview

This package contains a variety of scripts to make working with the Project Gutenberg body of public domain texts easier.

The functionality provided by this package includes:

Downloading texts from Project Gutenberg.
Cleaning the texts: removing all the crud, leaving just the text behind.
Making meta-data about the texts easily accessible.

The package has been tested with Python 2.6, 2.7 and 3.4

Installation

This project is on PyPI, so I’d recommend that you just install everything from there using your favourite Python package manager.

pip install gutenberg

If you want to install from source or modify the package, you’ll need to clone this repository:

git clone https://github.com/c-w/Gutenberg.git

This package depends on Berkeley DB so you’ll need to install that:

sudo apt-get install libdb5.1-dev
export BERKELEYDB_DIR=/usr

Now, you should probably install the dependencies for the package and verify your checkout by running the tests.

cd Gutenberg

virtualenv --no-site-packages virtualenv
source virtualenv/bin/activate
pip install -r requirements.pip

pip install nose
nosetests

Usage

Downloading a text

from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers

text = strip_headers(load_etext(2701)).strip()
print(text)  # prints 'MOBY DICK; OR THE WHALE\n\nBy Herman Melville ...'

python -m gutenberg.acquire.text 2701 moby-raw.txt
python -m gutenberg.cleanup.strip_headers moby-raw.txt moby-clean.txt

Looking up meta-data

Title and author meta-data can queried:

from gutenberg.query import get_etexts
from gutenberg.query import get_metadata

print(get_metadata('title', 2701))  # prints frozenset([u'Moby Dick; Or, The Whale'])
print(get_metadata('author', 2701)) # prints frozenset([u'Melville, Hermann'])

print(get_etexts('title', 'Moby Dick; Or, The Whale'))  # prints frozenset([2701, ...])
print(get_etexts('author', 'Melville, Hermann'))        # prints frozenset([2701, ...])

Note: The first time that one of the functions from gutenberg.query is called, the library will create a rather large database of meta-data about the Project Gutenberg texts. This one-off process will take quite a while to complete (18 hours on my machine) but once it is done, any subsequent calls to get_etexts or get_metadata will be very fast.

Limitations

This project deliberately does not include any natural language processing functionality. Consuming and processing the text is the responsibility of the client; this library merely focuses on offering a simple and easy to use interface to the works in the Project Gutenberg corpus. Any linguistic processing can easily be done client-side e.g. using the TextBlob library.

Project details

These details have been verified by PyPI

Maintainers

Clemens.Wolff gitenberg Master_Odin Master_Odin_Bot

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.8.2

Dec 26, 2021

0.8.1

Apr 30, 2020

0.8.0

Aug 24, 2019

0.7.0

May 18, 2018

0.6.1

Jan 11, 2018

0.5.0

Apr 28, 2017

0.4.5

Feb 19, 2017

0.4.4

Feb 19, 2017

This version

0.4.2

Jan 9, 2016

0.4.1

Dec 1, 2015

0.4.0

Mar 11, 2015

0.3.3

Feb 28, 2015

0.3.2

Feb 28, 2015

0.3.1

Feb 28, 2015

0.3

Feb 28, 2015

0.2.2

Jan 3, 2015

0.2.1

Nov 18, 2014

0.2.0

Sep 29, 2014

0.1.1

Aug 3, 2014

0.1.0

Aug 3, 2014

0.0.0

Aug 3, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Gutenberg-0.4.2.tar.gz (10.8 kB view hashes)

Uploaded Jan 9, 2016 Source

Hashes for Gutenberg-0.4.2.tar.gz

Hashes for Gutenberg-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`b3601a2fa2d32df76cfefd4851eca622747212edc2ed2646876bfd3626fc559b`
MD5	`1ca983181424476f023d120f2896049a`
BLAKE2b-256	`f25b759ff8d1732f8e2d5334a0918b256fbef9f66530cab1f0e470551b842f5d`