readability

Measure the readability of a given text using surface characteristics

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
- Web Environment
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- POSIX
Programming Language
Topic
- Text Processing :: Linguistic

Project description

A collection of functions that measure the readability of a given body of text using surface characteristics. These measures are basically linear regressions based on the number of words, syllables, and sentences.

The functionality is modeled after the UNIX style(1) command. Compared to the implementation as part of GNU diction, this version supports UTF-8 encoded text, but expects sentence-segmented and tokenized text. The syllabification and word type recognition is based on simple heuristics and only provides a rough measure.

NB: all readability formulas were developed for English, so the scales of the outcomes are only meaningful for English texts.

Installation

$ pip install https://github.com/andreasvc/readability/tarball/master

Usage

$ readability --help
Simple readability measures.

Usage: readability [--lang=<x>] [FILE]
or: readability [--lang=<x>] --csv FILES...

By default, input is read from standard input.
Text should be encoded with UTF-8,
one sentence per line, tokens space-separated.

Options:
  -L, --lang=<x>   Set language (available: de, nl, en).
  --csv            Produce a table in comma separated value format on
                   standard output given one or more filenames.
  --tokenizer=<x>  Specify a tokenizer including options that will be given
                   each text on stdin and should return tokenized output on
                   stdout. Not applicable when reading from stdin.

For proper results, the text should be tokenized.

For English, I recommend “tokenizer”, cf. http://moin.delph-in.net/WeSearch/DocumentParsing
For Dutch, I recommend the tokenizer that is part of the Alpino parser: http://www.let.rug.nl/vannoord/alp/Alpino/.
ucto is a general multilingual tokenizer: http://ilk.uvt.nl/ucto

Example using ucto:

$ ucto -L en -n -s '' "CONRAD, Joseph - Lord Jim.txt" | readability
[...]
readability grades:
    Kincaid:                     4.95
    ARI:                         5.78
    Coleman-Liau:                6.87
    FleschReadingEase:          86.18
    GunningFogIndex:             9.4
    LIX:                        30.97
    SMOGIndex:                   9.2
    RIX:                         2.39
sentence info:
    characters_per_word:         4.19
    syll_per_word:               1.25
    words_per_sentence:         14.92
    sentences_per_paragraph:        12.6
    characters:             552074
    syllables:              164207
    words:                  131668
    sentences:                8823
    paragraphs:                700
    long_words:              21122
    complex_words:           11306
word usage:
    tobeverb:                 3909
    auxverb:                  1632
    conjunction:              4413
    pronoun:                 18104
    preposition:             19271
    nominalization:           1216
sentence beginnings:
    pronoun:                  2593
    interrogative:             215
    article:                   632
    subordination:             124
    conjunction:               240
    preposition:               404

The option --csv collects readability measures for a number of texts in a table. To tokenize documents on-the-fly when using this option, use the --tokenizer option. Example with the “tokenize” tool:

$ readability --csv --tokenizer='tokenizer -L en-u8 -P -S -E "" -N' */*.txt >readabilitymeasures.csv

References

The following readability metrics are included:

For better readability measures, consider the following:

Collins-Thompson & Callan (2004). A language modeling approach to predicting reading difficulty. In Proc. of HLT/NAACL, pp. 193-200. http://aclweb.org/anthology/N04-1025.pdf
Schwarm & Ostendorf (2005). Reading level assessment using SVM and statistical language models. Proc. of ACL, pp. 523-530. http://www.aclweb.org/anthology/P05-1065.pdf
The Lexile framework for reading. http://www.lexile.com
Coh-Metrix. http://cohmetrix.memphis.edu/
Stylene: http://www.clips.ua.ac.be/category/projects/stylene
T-Scan: http://languagelink.let.uu.nl/tscan

Acknowledgments

The code is based on: https://github.com/mmautner/readability

Which in turn was based on: https://github.com/nltk/nltk_contrib/tree/master/nltk_contrib/readability

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
- Web Environment
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- POSIX
Programming Language
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

0.3.1

Jan 13, 2019

0.3

Jul 21, 2018

This version

0.2

Aug 11, 2015

0.1

Apr 13, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readability-0.2.tar.gz (10.8 kB view hashes)

Uploaded Aug 11, 2015 Source

Hashes for readability-0.2.tar.gz

Hashes for readability-0.2.tar.gz
Algorithm	Hash digest
SHA256	`2246350df2b095c3b5859785ffd6140409a51f7f02b659fb1a1be60dcc88b8e7`
MD5	`cf87a608385ab8dd6e5b7d65119be128`
BLAKE2b-256	`108c869de6d3e4b7afdaa1d21ec83c10cf859c71267f3596a4c6f3c192f3c34e`