readability-lxml

fast html to text parser (article readability tool) with python 3 support

These details have not been verified by PyPI

Project links

Homepage

Project description

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90’s readability project.

Installation

It’s easy using pip, just run:

$ pip install readability-lxml

Usage

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

0.8.1 Fixed processing of non-ascii HTMLs via regexps.
0.8 Replaced XHTML output with HTML5 output in summary() call.
0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
0.4 Added Videos loading and allowed more images per paragraph
0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 license.

Thanks to

Latest readability.js
Ruby port by starrhorne and iterationlabs
Python port by gfxmonk
Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml
“BR to P” fix from readability.js which improves quality for smaller texts
Github users contributions.

Algorithm	Hash digest
SHA256	`e51fea56b5909aaf886d307d48e79e096293255afa567b7d08bca94d25b1a4e1`
MD5	`dd153878f06608bd487f36a29d21cc5a`
BLAKE2b-256	`b9626de3a9a8524c1a1ee0f2aee0dfbad13a36ebbca0db402abcf4e790496512`

Algorithm	Hash digest
SHA256	`e0d366a21b1bd6cca17de71a4e6ea16fcfaa8b0a5b4004e39e2c7eff884e6305`
MD5	`6a0dc326b843d99346d2afc44d2b4faa`
BLAKE2b-256	`39a6cfe22aaa19ac69b97d127043a76a5bbcb0ef24f3a0b22793c46608190caa`

readability-lxml 0.8.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

python-readability

Installation

Usage

Change Log

Licensing

Thanks to

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes