readability-lxml 0.8.1
pip install readability-lxml
Released:
fast html to text parser (article readability tool) with python 3 support
Navigation
Unverified details
These details have not been verified by PyPIProject links
Meta
- License: Apache License 2.0
- Author: Yuri Baburov
Classifiers
- Environment
- Intended Audience
- Operating System
- Programming Language
- Topic
Project description
python-readability
Given a html document, it pulls out the main body text and cleans it up.
This is a python port of a ruby port of arc90’s readability project.
Installation
It’s easy using pip, just run:
$ pip install readability-lxml
Usage
>>> import requests
>>> from readability import Document
>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'
>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n domain in examples without prior coordination or asking for permission.</p>
\n <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""
Change Log
0.8.1 Fixed processing of non-ascii HTMLs via regexps.
0.8 Replaced XHTML output with HTML5 output in summary() call.
0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
0.4 Added Videos loading and allowed more images per paragraph
0.3 Added Document.encoding, positive_keywords and negative_keywords
Licensing
This code is under the Apache License 2.0 license.
Thanks to
Latest readability.js
Ruby port by starrhorne and iterationlabs
Python port by gfxmonk
Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml
“BR to P” fix from readability.js which improves quality for smaller texts
Github users contributions.
Project details
Unverified details
These details have not been verified by PyPIProject links
Meta
- License: Apache License 2.0
- Author: Yuri Baburov
Classifiers
- Environment
- Intended Audience
- Operating System
- Programming Language
- Topic
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file readability-lxml-0.8.1.tar.gz
.
File metadata
- Download URL: readability-lxml-0.8.1.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/46.0.0 requests-toolbelt/0.8.0 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e51fea56b5909aaf886d307d48e79e096293255afa567b7d08bca94d25b1a4e1 |
|
MD5 | dd153878f06608bd487f36a29d21cc5a |
|
BLAKE2b-256 | b9626de3a9a8524c1a1ee0f2aee0dfbad13a36ebbca0db402abcf4e790496512 |
File details
Details for the file readability_lxml-0.8.1-py3-none-any.whl
.
File metadata
- Download URL: readability_lxml-0.8.1-py3-none-any.whl
- Upload date:
- Size: 20.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/46.0.0 requests-toolbelt/0.8.0 tqdm/4.47.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e0d366a21b1bd6cca17de71a4e6ea16fcfaa8b0a5b4004e39e2c7eff884e6305 |
|
MD5 | 6a0dc326b843d99346d2afc44d2b4faa |
|
BLAKE2b-256 | 39a6cfe22aaa19ac69b97d127043a76a5bbcb0ef24f3a0b22793c46608190caa |