gutenbergpy

Library to create and interogate local cache for Project Gutenberg

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

*********
GutenbergPy
*********

Overview
========

.. image:: https://github.com/raduangelescu/gutenbergpy/blob/master/dblogos.png
:alt: MONGODB
:align: center
:width: 100%

This package makes filtering and getting information from `Project
Gutenberg <http://www.gutenberg.org>`_ easier from python.

It's target audience is machine learning guys that need data for their project,
but may be freely used by anybody.

The package:

- Generates a local cache (of all gutenberg informations) that you can interogate to get book ids. The Local cache may be sqlite (default) or mongodb (for wich you need to have installed the pymongodb packet)

- Downloads and cleans raw text from gutenberg books

The package has been tested with Python 2.7 on both Windows and Linux
It is faster, smaller and less third-party intensive alternative to https://github.com/c-w/Gutenberg

Installation
============

.. sourcecode :: sh

pip install gutenbergpy

or just install it from source (it's all just python code)

.. sourcecode :: sh

git clone https://github.com/raduangelescu/gutenbergpy
python setup.py install

Usage
=====

Downloading a text
------------------

.. sourcecode :: python

import gutenbergpy.textget
#this gets a book by its gutenberg id
raw_book = gutenbergpy.textget.get_text_by_id(1000)
print raw_book
#this strips the headers from the book
clean_book = gutenbergpy.textget.strip_headers(raw_book)
print clean_book

Query the cache
--------------------
To do this you first need to create the cache (this is a one time thing per os, until you decide to redo it)

.. sourcecode :: python

from gutenbergpy.gutenbergcache import GutenbergCache
#for sqlite
GutenbergCache.create()
#for mongodb
GutenbergCache.create(type=GutenbergCacheTypes.CACHE_TYPE_MONGODB)

for debugging/better control you have these boolean options on create

- *refresh* deletes the old cache
- *download* property downloads the rdf file from the gutenberg project
- *unpack* unpacks it
- *parse* parses it in memory
- *cache* writes the cache

.. sourcecode :: python

GutenbergCache.create(refresh=True, download=True, unpack=True, parse=True, cache=True, deleteTemp=True)

for even better control you may set the GutenbergCacheSettings
- *CacheFilename*
- *CacheUnpackDir*
- *CacheArchiveName*
- *ProgressBarMaxLength*
- *CacheRDFDownloadLink*
- *TextFilesCacheFolder*
- *MongoDBCacheServer*
.. sourcecode :: python

GutenbergCacheSettings.set( CacheFilename="", CacheUnpackDir="",
CacheArchiveName="", ProgressBarMaxLength="", CacheRDFDownloadLink="", TextFilesCacheFolder="", MongoDBCacheServer="")

After doing a create you need to wait, it will be over in about 5 minutes depending on your internet speed and computer power
(On a i7 with gigabit connection and ssd it finishes in about 1 minute)

Get the cache

.. sourcecode :: python
#for mongodb
cache = GutenbergCache.get_cache(GutenbergCacheTypes.CACHE_TYPE_MONGODB)
#for sqlite
cache = GutenbergCache.get_cache()

Now you can do queries

Get the book Gutenberg unique indices by using this query function

Standard query fields:
- languages
- authors
- types
- titles
- subjects
- publishers
- bookshelves
- downloadtype

.. sourcecode :: python

print cache.query(downloadtype=['application/plain','text/plain','text/html; charset=utf-8'])

Or do a native query on the sqlite database

.. sourcecode :: python
#python
cache.native_query("SELECT * FROM books")
#mongodb
cache.native_query({type:'Text'}}

For SQLITE custom queries take a look at the SQLITE database scheme:

.. image:: https://github.com/raduangelescu/gutenbergpy/blob/master/sqlitecheme.png
:alt: SQLITE database scheme
:width: 100%
:align: center

For MongoDB queries you have all the books collection. Each book with the following fields:

- book(publisher, rights, language, book_shelf, gutenberg_book_id, date_issued, num_downloads, titles, subjects, authors, files ,type)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.5

Mar 27, 2023

0.3.4

Jun 2, 2021

0.3.3

Apr 1, 2021

0.3.2

Apr 1, 2021

0.3.1

Mar 2, 2021

0.3.0

Mar 2, 2021

0.2.0

Feb 28, 2017

This version

0.1.7

Feb 26, 2017

0.1.6

Feb 19, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GutenbergPy-0.1.7.zip (25.3 kB view hashes)

Uploaded Feb 26, 2017 Source

Hashes for GutenbergPy-0.1.7.zip

Hashes for GutenbergPy-0.1.7.zip
Algorithm	Hash digest
SHA256	`85ff12a00c1f50efd8e889298a4166196e0419f81f572a592b8db7a30c40328a`
MD5	`3d9cce1d0edff3433ae09b8c26cf4a26`
BLAKE2b-256	`33f5b5728ba15a0855c658c2f464f42f620f69029373ed8600028b219cf0b6a6`