clear-html

Clean and normalize HTML.

These details have not been verified by PyPI

Project links

Source

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

Clean and normalize HTML. Preserve embeddings (e.g. Twitter, Instagram, etc)

Quick start

Installation

Install the library with pip:

pip install clear-html

Usage

Example usage with lxml:

from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html

html="""
        <div style="color:blue" id="main_content">
            Some text to be
            <div>cleaned up!</div>
        </div>
     """
node = fromstring(html)
cleaned_node = clean_node(node)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)

Example usage with Parsel:

from parsel import Selector
from clear_html import clean_node, cleaned_node_to_html

selector = Selector(text="""<html>
                            <body>
                                <h1>Hello!</h1>
                                <div style="color:blue" id="main_content">
                                    Some text to be
                                    <div>cleaned up!</div>
                                </div>
                            </body>
                            </html>""")
selector = selector.css("#main_content")
cleaned_node = clean_node(selector[0].root)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)

Both of the different approaches above would print the following:

<article>

<p>Some text to be</p>

<p>cleaned up!</p>

</article>

Other interesting functions:

cleaned_node_to_text: convert the cleaned node to plain text
formatted_text.clean_doc: low level method to control more aspects of the cleaning up

Algorithm	Hash digest
SHA256	`711957bb03b0729caa257679e15881f9e0eeea27236b5c18eac1e75b8af06b06`
MD5	`f9bcf9d2d62dc0724fab546af717b67d`
BLAKE2b-256	`7a28d08437394b1b28e46fd804a99b3ba2e6dc3a1103ac14b097f04ea442bb26`

Algorithm	Hash digest
SHA256	`a270ed4d78bda7f8d9e308c7c4fa5ebe2bdcf39280730a448064ad677a0a76cf`
MD5	`55f9c42f64099028b74c08742ed731da`
BLAKE2b-256	`d11c349aa7cf8ac99c27a9afd1b27f4c1e5a9a913ae0b6f3fdc988e60b56116c`

clear-html 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quick start

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes