Skip to main content

A simple interface to the Project Gutenberg corpus

Project description

Overview

This package contains a variety of scripts to make working with the Project Gutenberg body of public domain texts easier.

The functionality provided by this package includes:

  • Downloading texts using the Project Gutenberg API.

  • Cleaning up the texts: removing headers and footers.

  • Making meta-data about the texts easily accessible through a database.

Installation

First of all, you should probably install the dependencies for the package and verify your install.

  • The recommended way of doing this is using the project’s makefile. The command make virtualenv will install all the required dependencies for the package in a local directory called virtualenv.

  • You might want to run the tests to see if everything installed correctly: make test.

  • Now run source virtualenv/bin/activate and you’re good to go.

Another setup task you might want to run is make docs to automatically generate some API documentation for the project. After running the command, you can enjoy your documentation by pointing your browser at docs/_build/html/index.html.

Usage

Now that you’re all set-up, let’s get started. There are two ways to use this project: from the command line and as a Python module.

From the command line

The following functionality is available via the command line:

  • Download some texts: python -m gutenberg.download.

  • Clean up a downloaded text: python -m gutenberg.beautify.

  • Grab some meta-data for the texts: python -m gutenberg.metainfo.

For example, to download 10MB of texts to the directory corpus, you could run the following:

python -m gutenberg.download ./corpus --limit=10MB

You can find out more about how to run the scripts by appending --help to the commands listed above.

As a module

You can also use the project as a Python package. The following snippet demonstrates how you’d download some texts from Project Gutenberg and then iterate over your freshly built corpus:

from gutenberg import GutenbergCorpus

# this will setup the corpus and download 10MB worth of text to ./corpus
corpus = GutenbergCorpus()
corpus.download(limit=10 * 10**6)

# iterate over the corpus
preview_chars = 25
for author in corpus.authors():
    for work in author.works():
        text = work.fulltext
        print(u'The first {preview_chars} characters of '
               '"{title}" by "{author}" are:\n\t"{preview}"\n'
               .format(preview_chars=preview_chars,
                       title=work.title,
                       author=author.name,
                       preview=text.replace('\n', ' ')[:preview_chars]))

You can also easily drill down on specific texts and authors:

shakespeare = corpus[u'Shakespeare, William']

# list all the works for the author that we have currently available
work_names = shakespeare.work_names()
for work_num, title in enumerate(shakespeare.work_names(), start=1):
    print(u'Work {work_num} in the Shakespeare corpus: "{title}"'
          .format(work_num=work_num,
                  title=title))

# inspect a particular text
hamlet = shakespeare[u'Hamlet'].fulltext
to_be_or_not_to_be = u'To be, or not to be, that is the Question'
print(u'The famous quote "{quote}" is in Hamlet at position {position}.'
        .format(quote=to_be_or_not_to_be,
                position=hamlet.find(to_be_or_not_to_be)))

All the loading of the heavy stuff is done lazily so you can just iterate over authors and works at your heart’s content without worrying about running out of memory.

Advanced usage

You can influence how the corpus object behaves via specifying a configuration file when constructing the object: corpus = GutenbergCorpus.using_config('my-corpus.cfg'). A configuration file can be generated from a corpus object by using corpus.write_config('path-to-config.cfg').

The default configuration looks like this:

[download]
data_path = corpus/rawdata  # storage location of the raw Gutenberg texts
offset = 0  # start downloading from this result page

[database]
database = corpus/gutenberg.db3  # storage location of the corpus DB
drivername = sqlite  # the type of database to use for the corpus DB

[metadata]
metadata = corpus/metadata.json.gz  # storage location of the metadata DB

More information on the different configuration options can be found in the API documentation of the gutenberg.gutenberg package.

The corpus database stores information about the downloaded texts. The database has a single table, etexts, with four columns: etextno, title, author and path. The first column is the primary key of the table and represents the unique identifier of the work in the Project Gutenberg corpus. The remaining columns record meta-data about the work (in unicode) and a relative path to the raw text on disk.

Limitations

This project deliberately does not include any natural language processing functionality. Consuming and processing the text is the responsibility of the client; this library merely focuses on offering a simple and easy to use interface to the works in the Project Gutenberg corpus. Any linguistic processing can easily be done client-side e.g. using the TextBlob library.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Gutenberg-0.1.1.tar.gz (25.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page