Skip to main content

Library for working with the HTRC Extracted Features dataset

Project description

HTRC-Features Build Status PyPI version Anaconda-Server Badge

Tools for working with the HTRC Extracted Features dataset, a dataset of page-level text features extracted from 17 million digitized works.

This library provides a FeatureReader for parsing files, which are handled as Volume objects with collections of Page objects. Volumes provide access to metadata (e.g. language), volume-wide feature information (e.g. token counts), and access to Pages. Pages allow you to easily parse page-level features, particularly token lists.

This library makes heavy use of Pandas, returning many data representations as DataFrames. This is the leading way of dealing with structured data in Python, so this library doesn't try to reinvent the wheel. Since refactoring around Pandas, the primary benefit of using the HTRC Feature Reader is performance: reading the json structures and parsing them is generally faster than custom code. You also get convenient access to common information, such as case-folded token counts or part-of-page specific character counts. Details of the public methods provided by this library can be found in the HTRC Feature Reader docs.

Table of Contents: Installation | Usage | Additional Notes

Links: HTRC Feature Reader Documentation | HTRC Extracted Features Dataset

Citation: Peter Organisciak and Boris Capitanu, "Text Mining in Python through the HTRC Feature Reader," Programming Historian, (22 November 2016), http://programminghistorian.org/lessons/text-mining-with-extracted-features.

Installation

To install,

    pip install htrc-feature-reader

That's it! This library is written for Python 3.0+. For Python beginners, you'll need pip.

Alternately, if you are using Anaconda, you can install with

    conda install -c htrc htrc-feature-reader

The conda approach is recommended, because it makes sure that some of the hard-to-install dependencies are properly installed.

Given the nature of data analysis, using iPython with Jupyter notebooks for preparing your scripts interactively is a recommended convenience. Most basically, it can be installed with pip install ipython[notebook] and run with ipython notebook from the command line, which starts a session that you can access through your browser. If this doesn't work, consult the iPython documentation.

Optional: installing the development version.

Usage

Note: for new Python users, a more in-depth lesson is published by Programming Historian: Text Mining in Python through the HTRC Feature Reader. That lesson is also the official citation associated the HTRC Feature Reader library.

Reading feature files

The easiest way to start using this library is to use the Volume interface, which takes a path to an Extracted Features file.

from htrc_features import Volume
vol = Volume('data/ef2-stubby/hvd/34926/hvd.32044093320364.json.bz2')
vol

The Nautilus. by Delaware Museum of Natural History. (1904, 222 pages) - hvd.32044093320364

The FeatureReader can also download files at read time, by reference to a HathiTrust volume id. For example, if I want both of volumes of Pride and Prejudice, I can see that the URLs are babel.hathitrust.org/cgi/pt?id=hvd.32044013656053 and babel.hathitrust.org/cgi/pt?id=hvd.32044013656061. In the FeatureReader, these can be called with the ids=[] argument, as follows:

for htid in ["hvd.32044013656053", "hvd.32044013656061"]:
    vol = Volume(htid)
    print(vol.title, vol.enumeration_chronology)
Pride and prejudice. v.1
Pride and prejudice. v.2

This downloads the file temporarily, using the HTRC's web-based download link (e.g. https://data.analytics.hathitrust.org/features/get?download-id={{URL}}). One good pairing with this feature is the HTRC Python SDK's functionality for downloading collections.

For example, I have a small collection of knitting-related books at https://babel.hathitrust.org/cgi/mb?a=listis&c=1174943610. To read the feature files for those books:

from htrc import workset
volids = workset.load_hathitrust_collection('https://babel.hathitrust.org/cgi/mb?a=listis&c=1174943610')
FeatureReader(ids=volids).first().title

Remember that for large jobs, it is faster to download your dataset beforehand, using the rsync method.

Volume

A Volume contains information about the current work and access to the pages of the work. All the metadata fields from the HTRC JSON file are accessible as properties of the volume object, including title, language, imprint, oclc, pubDate, and genre. The main identifier id and pageCount are also accessible, and you can find the URL for the Full View of the text in the HathiTrust Digital Library - if it exists - with vol.handle_url.

"Volume {} is a {} page text from {} written in {}. You can doublecheck at {}".format(vol.id, vol.page_count, 
                                                                                      vol.year, vol.language, 
                                                                                      vol.handle_url)
'Volume hvd.32044013656061 is a 306 page text from 1903 written in eng. You can doublecheck at http://hdl.handle.net/2027/hvd.32044013656061'

This is the Extracted Features dataset, so the features are easily accessible. To most popular is token counts, which are returned as a Pandas DataFrame:

df = vol.tokenlist()
df.sample(10)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count
page section token pos
201 body abode NN 1
117 body head NN 1
126 body for IN 1
210 body three CD 1
224 body would MD 1
89 body The DT 1
283 body any DT 1
63 body surprise NN 1
152 body make VB 1
170 body I PRP 3

Other extracted features are discussed below.

The full included metadata can be seen with vol.parser.meta:

vol.parser.meta.keys()
dict_keys(['id', 'metadata_schema_version', 'enumeration_chronology', 'type_of_resource', 'title', 'date_created', 'pub_date', 'language', 'access_profile', 'isbn', 'issn', 'lccn', 'oclc', 'page_count', 'feature_schema_version', 'ht_bib_url', 'genre', 'handle_url', 'imprint', 'names', 'source_institution', 'classification', 'issuance', 'bibliographic_format', 'government_document', 'hathitrust_record_number', 'rights_attributes', 'pub_place', 'volume_identifier', 'source_institution_record_number', 'last_update_date'])

These fields are mapped to attributes in Volume, so vol.oclc will return the oclc field from that metadata. As a convenience, Volume.year returns the pub_date information and Volume.author returns the contributor information.

vol.year, vol.author
('1903', ['Austen, Jane 1775-1817 '])

If the minimal metadata included with the extracted feature files is insufficient, you can fetch HT's metadata record from the Bib API with vol.metadata. Remember that this calls the HTRC servers for each volume, so can add considerable overhead. The result is a MARC file, returned as a pymarc record object. For example, to get the publisher information from field 260:

vol.metadata['260'].value()
'Boston : Little, Brown, 1903.'

At large-scales, using vol.metadata is an impolite and inefficient amount of server pinging; there are better ways to query the API than one volume at a time. Read about the HTRC Solr Proxy.

Another source of bibliographic metadata is the HathiTrust Bib API. You can access this information through the URL returned with vol.ht_bib_url:

vol.ht_bib_url
'http://catalog.hathitrust.org/api/volumes/full/htid/hvd.32044013656061.json'

Volumes also have direct access to volume-wide info of features stored in pages. For example, you can get a list of words per page through Volume.tokens_per_page(). We'll discuss these features below, after looking first at Pages.

Note that for the most part, the properties of the Page and Volume objects aligns with the names in the HTRC Extracted Features schema, except they are converted to follow Python naming conventions: converting the CamelCase of the schema to lowercase_with_underscores. E.g. beginLineChars from the HTRC data is accessible as Page.begin_line_chars.

The fun stuff: playing with token counts and character counts

Token counts are returned by Volume.tokenlist() (or Page.tokenlist(). By default, part-of-speech tagged, case-sensitive counts are returned for the body.

The token count information is returned as a DataFrame with a MultiIndex (page, section, token, and part of speech) and one column (count).

print(vol.tokenlist()[:3])
                         count
page section token  pos       
1    body    Austen .        1
             Pride  NNP      1
             and    CC       1

Page.tokenlist() can be manipulated in various ways. You can case-fold, for example:

tl = vol.tokenlist(case=False)
tl.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count
page section lowercase pos
218 body what WP 1
30 body pemberley NNP 1
213 body comes VBZ 2
183 body took VBD 1
51 body necessary JJ 1

Or, you can combine part of speech counts into a single integer.

tl = vol.tokenlist(pos=False)
tl.sample(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count
page section token
264 body family 2
47 body journey 1
98 body Perhaps 1
49 body at 2
227 body so 1

Section arguments are valid here: 'header', 'body', 'footer', 'all', and 'group'

tl = vol.tokenlist(section="header", case=False, pos=False)
tl.head(5)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count
page section lowercase
9 header 's 1
and 1
austen 1
jane 1
prejudice 1

You can also drop the section index altogether if you're content with the default 'body'.

vol.tokenlist(drop_section=True, case=False, pos=False).sample(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count
page lowercase
247 suppose 1
76 would 2

The MultiIndex makes it easy to slice the results, and it is althogether more memory-efficient. For example, to return just the nouns (NN):

tl = vol.tokenlist()
tl.xs('NN', level='pos').head(4)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count
page section token
1 body prejudiceJane 1
9 body Volume 1
10 body vol 3
12 body ./■ 1

If you are new to Pandas DataFrames, you might find it easier to learn by converting the index to columns.

simpler_tl = df.reset_index()
simpler_tl[simpler_tl.pos == 'NN']
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
page section token pos count
3 1 body prejudiceJane NN 1
19 9 body Volume NN 1
40 10 body vol NN 3
51 12 body ./■ NN 1
53 12 body / NN 1
... ... ... ... ... ...
43178 297 body spite NN 1
43187 297 body uncle NN 1
43191 297 body warmest NN 1
43195 297 body wife NN 1
43226 305 body NON-RECEIPT NN 1

7224 rows × 5 columns

If you prefer not to use Pandas, you can always convert the object, with methods like to_dict and to_csv).

tl[:3].to_csv()
'page,section,token,pos,count\n1,body,Austen,.,1\n1,body,Pride,NNP,1\n1,body,and,CC,1\n'

To get just the unique tokens, Volume.tokens provides them as a set. Here I select a specific page for brevity and a minimum count, but you can run the method without arguments.

vol.tokens(page_select=21, min_count=5)
{'"', ',', '.', 'You', 'been', 'have', 'his', 'in', 'of', 'the', 'you'}

In addition to token lists, you can also access other section features:

vol.section_features()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
tokenCount lineCount emptyLineCount capAlphaSeq sentenceCount
page
1 4 1 0 1 1
2 15 10 4 2 1
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
... ... ... ... ... ...
302 0 0 0 0 0
303 0 0 0 0 0
304 0 0 0 0 0
305 49 11 2 3 3
306 2 3 1 1 1

306 rows × 5 columns

Chunking

If you're working in an instance where you hope to have comparably sized document units, you can use 'chunking' to roll pages into chunks that aim for a specific length. e.g.

by_chunk = vol.tokenlist(chunk=True, chunk_target=10000)
print(by_chunk.sample(4))
# Count words per chunk
by_chunk.groupby(level='chunk').sum()
                              count
chunk section token      pos       
5     body    husbands   NNS      3
2     body    frequently RB       3
              domestic   JJ       3
3     body    :          :       10
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count
chunk
1 12453
2 9888
3 9887
4 10129
5 10054
6 10065
7 12327

Multiprocessing

For large jobs, you'll want to use multiprocessing or multithreading to speed up your process. This is left up to your preferred method, either within Python or by spawning multiple scripts from the command line. Here are two approaches that I like.

Dask

Dask offers easy multithreading (shared resources) and multiprocessing (separate processes) in Python, and is particularly convenient because it includes a subset of Pandas DataFrames.

Here is a minimal example, that lazily loads token frequencies from a list of volume IDs, and counts them up by part of speech tag.

import dask.dataframe as dd
from dask import delayed

def get_tokenlist(vol):
    ''' Load a one volume feature reader, get that volume, and return its tokenlist '''
    return FeatureReader(ids=[volid]).first().tokenlist()

delayed_dfs = [delayed(get_tokenlist)(volid) for volid in volids]

# Create a dask
ddf = (dd.from_delayed(delayed_dfs)
         .reset_index()
         .groupby('pos')[['count']]
         .sum()
      )

# Run processing
ddf.compute()

Here is an example of 78 volumes being processed in 24 seconds with 31 threads:

Counting POS in 78 books about knitting

This example used multithreading. Due to the nature of Python, certain functions won't parallelize well. In our case, the part where the JSON is read from the file and converted to a DataFrame (the light green parts of the graphic) won't speed up because Python dicts lock the Global Interpreter Lock (GIL). However, because Pandas releases the GIL, nearly everything you do after parsing the JSON will be very quick.

To better understand what happens when ddf.compute(), here is a graph for 4 volumes:

GNU Parallel

As an alternative to multiprocessing in Python, my preference is to have simpler Python scripts and to use GNU Parallel on the command line. To do this, you can set up your Python script to take variable length arguments of feature file paths, and to print to stdout.

This psuedo-code shows how that you'd use parallel, where the number of parallel processes is 90% the number of cores, and 50 paths are sent to the script at a time (if you send too little at a time, the initialization time of the script can add up).

find feature-files/ -name '*json.bz2' | parallel --eta --jobs 90% -n 50 python your_script.py >output.txt

Additional Notes

Installing the development version

git clone https://github.com/htrc/htrc-feature-reader.git
cd htrc-feature-reader
python setup.py install

Iterating through the JSON files

If you need to do fast, highly customized processing without instantiating Volumes, FeatureReader has a convenient generator for getting the raw JSON as a Python dict: fr.jsons(). This simply does the file reading, optional decompression, and JSON parsing.

Downloading files within the library

utils includes an Rsyncing utility, download_file. This requires Rsync to be installed on your system.

Usage:

Download one file to the current directory:

utils.download_file(htids='nyp.33433042068894')

Download multiple files to the current directory:

ids = ['nyp.33433042068894', 'nyp.33433074943592', 'nyp.33433074943600']
utils.download_file(htids=ids)

Download file to /tmp:

utils.download_file(htids='nyp.33433042068894', outdir='/tmp')

Download file to current directory, keeping pairtree directory structure, i.e. ./nyp/pairtree_root/33/43/30/42/06/88/94/33433042068894/nyp.33433042068894.json.bz2:

utils.download_file(htids='nyp.33433042068894', keep_dirs=True)
    ```

### Getting the Rsync URL

If you have a HathiTrust Volume ID and want to be able to download the features for a specific book, `hrtc_features.utils` contains an [id_to_rsync](http://htrc.github.io/htrc-feature-reader/htrc_features/utils.m.html#htrc_features.utils.id_to_rsync) function. This uses the [pairtree](http://pythonhosted.org/Pairtree/) library but has a fallback written with that library is not installed, since it isn't compatible with Python 3.


```python
from htrc_features import utils
utils.id_to_rsync('miun.adx6300.0001.001')
'miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2'

See the ID to Rsync notebook for more information on this format and on Rsyncing lists of urls.

There is also a command line utility installed with the HTRC Feature Reader:

$ htid2rsync miun.adx6300.0001.001
miun/pairtree_root/ad/x6/30/0,/00/01/,0/01/adx6300,0001,001/miun.adx6300,0001,001.json.bz2

Advanced Features

In the beta Extracted Features release, schema 2.0, a few features were separated out to an advanced files. However, this designation is no longer present starting with schema 3.0, meaning information like beginLineChars, endLineChars, and capAlphaSeq are always available:

# What is the longest sequence of capital letter on each page?
vol.cap_alpha_seqs()[:10]
[0, 1, 0, 0, 0, 0, 0, 0, 4, 1]
end_line_chars = vol.end_line_chars()
print(end_line_chars.head())
                         count
page section place char       
2    body    end   -         1
                   :         1
                   I         1
                   f         1
                   t         1
# Find pages that have lines ending with "!"
idx = pd.IndexSlice
print(end_line_chars.loc[idx[:,:,:,'!'],].head())
                         count
page section place char       
45   body    end   !         1
75   body    end   !         1
77   body    end   !         1
91   body    end   !         1
92   body    end   !         1

Testing

This library is meant to be compatible with Python 3.2+ and Python 2.7+. Tests are written for py.test and can be run with setup.py test, or directly with python -m py.test -v.

If you find a bug, leave an issue on the issue tracker, or contact Peter Organisciak at organisciak+htrc@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htrc-feature-reader-2.0.7.tar.gz (58.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page