skip to navigation
skip to content

readability-lxml 0.6.2

fast html to text parser (article readability tool) with python3 support

This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0

This is a python port of a ruby port of arc90’s readability project

http://lab.arc90.com/experiments/readability/

In few words, Given a html document, it pulls out the main body text and cleans it up. It also can clean up title based on latest readability.js code.

Based on:

Installation:

easy_install readability-lxml
or
pip install readability-lxml

Usage:

from readability.readability import Document
import urllib
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()

Command-line usage:

python -m readability.readability -u http://pypi.python.org/pypi/readability-lxml

To open resulting page in browser:

python -m readability.readability -b -u http://pypi.python.org/pypi/readability-lxml

Using positive/negative keywords example:

python -m readability.readability -p intro -n newsindex,homepage-box,news-section -u http://python.org

Document() kwarg options:

  • attributes:
  • debug: output debug messages
  • min_text_length:
  • retry_length:
  • url: will allow adjusting links to be absolute
  • positive_keywords: the list of positive search patterns in classes and ids, for example: [“news-item”, “block”]
  • negative_keywords: the list of negative search patterns in classes and ids, for example: [“mysidebar”, “related”, “ads”]

Updates

  • 0.3 Added Document.encoding, positive_keywords and negative_keywords
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 and 3.4
 
File Type Py Version Uploaded on Size
readability-lxml-0.6.2.tar.gz (md5) Source 2016-04-11 13KB