html-sanitizer

HTML sanitizer

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

This is a allowlist-based and very opinionated HTML sanitizer that can be used both for untrusted and trusted sources. It attempts to clean up the mess made by various rich text editors and or copy-pasting to make styling of webpages simpler and more consistent. It builds on the excellent HTML cleaner in lxml to make the result both valid and safe.

HTML sanitizer goes further than e.g. bleach in that it not only ensures that content is safe and tags and attributes conform to a given allowlist, but also applies additional transforms to HTML fragments.

Goals

Clean up HTML fragments using a very restricted set of allowed tags and attributes.
Convert some tags (such as , and ) into either or (but never both).
Absolutely disallow all inline styles.
Normalize whitespace by removing repeated line breaks, empty paragraphs and other empty elements.
Merge adjacent tags of the same type (such as several or <h3> directly after each other.
Automatically remove redundant list markers inside <li> tags.
Clean up some uglyness such as paragraphs inside paragraphs or list elements etc.
Normalize unicode.

Usage

>>> from html_sanitizer import Sanitizer
>>> sanitizer = Sanitizer()  # default configuration
>>> sanitizer.sanitize('<span style="font-weight:bold">some text</span>')
'<strong>some text</strong>'

Settings

Bold spans and b tags are converted into strong tags, italic spans and i tags into em tags (if strong and em are allowed at all)
Inline styles and scripts will always be dropped.
A div element is used to wrap the HTML fragment for the parser, therefore div tags are not allowed.

The default settings are:

DEFAULT_SETTINGS = {
    "tags": {
        "a", "h1", "h2", "h3", "strong", "em", "p", "ul", "ol",
        "li", "br", "sub", "sup", "hr",
    },
    "attributes": {"a": ("href", "name", "target", "title", "id", "rel")},
    "empty": {"hr", "a", "br"},
    "separate": {"a", "p", "li"},
    "whitespace": {"br"},
    "keep_typographic_whitespace": False,
    "add_nofollow": False,
    "autolink": False,
    "sanitize_href": sanitize_href,
    "element_preprocessors": [
        # convert span elements into em/strong if a matching style rule
        # has been found. strong has precedence, strong & em at the same
        # time is not supported
        bold_span_to_strong,
        italic_span_to_em,
        tag_replacer("b", "strong"),
        tag_replacer("i", "em"),
        tag_replacer("form", "p"),
        target_blank_noopener,
    ],
    "element_postprocessors": [],
    "is_mergeable": lambda e1, e2: True,
}

The keys’ meaning is as follows:

tags: A set() of allowed tags.
attributes: A dict() mapping tags to their allowed attributes.
empty: Tags which are allowed to be empty. By default, empty tags (containing no text or only whitespace) are dropped.
separate: Tags which are not merged if they appear as siblings. By default, tags of the same type are merged.
whitespace: Tags which are treated as whitespace and removed from the beginning or end of other tags’ content.
keep_typographic_whitespace: Keep typographically used space characters like non-breaking space etc.
add_nofollow: Whether to add rel="nofollow" to all links.
autolink: Enable lxml’s autolinker. May be either a boolean or a dictionary; a dictionary is passed as keyword arguments to autolink.
sanitize_href: A callable that gets anchor’s href value and returns a sanitized version. The default implementation checks whether links start with a few allowed prefixes, and if not, returns a single hash (#).
element_preprocessors and element_postprocessors: Additional filters that are called on all elements in the tree. The tree is processed in reverse depth-first order. Under certain circumstances elements are processed more than once (search the code for backlog.append). Preprocessors are run before whitespace normalization, postprocessors afterwards.
is_mergeable: Adjacent elements which aren’t kept separate are merged by default. This callable can be used to prevent merging of adjacent elements e.g. when their classes do not match (lambda e1, e2: e1.get('class') == e2.get('class'))

Settings can be specified partially when initializing a sanitizer instance, but are still checked for consistency. For example, it is not allowed to have tags in empty that are not in tags, that is, tags that are allowed to be empty but at the same time not allowed at all. The Sanitizer constructor raises TypeError exceptions when it detects inconsistencies.

An example for an even more restricted configuration might be:

>>> from html_sanitizer import Sanitizer
>>> sanitizer = Sanitizer({
...     'tags': ('h1', 'h2', 'p'),
...     'attributes': {},
...     'empty': set(),
...     'separate': set(),
... })

The rationale for such a restricted set of allowed tags (e.g. no images) is documented in the design decisions section of django-content-editor’s documentation.

Django

HTML sanitizer does not depend on Django, but ships with a module which makes configuring sanitizers using Django settings easier. Usage is as follows:

>>> from html_sanitizer.django import get_sanitizer
>>> sanitizer = get_sanitizer([name=...])

Different sanitizers can be configured. The default configuration is aptly named 'default'. Example settings follow:

HTML_SANITIZERS = {
    'default': {
      'tags': ...,
    },
    ...
}

The 'default' configuration is special: If it isn’t explicitly defined, the default configuration above is used instead. Non-existing configurations will lead to ImproperlyConfigured exceptions.

The get_sanitizer function caches sanitizer instances, so feel free to call it as often as you want to.

Security issues

Please report security issues to me directly at mk@feinheit.ch.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

2.4.1

Apr 1, 2024

2.4.0

Apr 1, 2024

2.3.1

Mar 10, 2024

2.3.0

Feb 7, 2024

2.2.0

Jul 3, 2023

2.1.0

Jun 29, 2023

2.0.0

Jun 28, 2023

1.9.3

Jan 15, 2022

1.9.2

Dec 13, 2021

1.9.1

May 28, 2020

1.9.0

Jan 20, 2020

1.8.0

Nov 21, 2019

1.7.3

Aug 8, 2019

1.7.2

Apr 26, 2019

1.7.1

Apr 19, 2019

1.7.0

Feb 19, 2019

1.6.4

Feb 6, 2019

1.6.3

Nov 4, 2018

1.6.2

Aug 21, 2018

1.6.1

Jul 31, 2018

1.6.0

Jun 29, 2018

1.5.0

Jun 1, 2018

1.4.0

Mar 29, 2018

1.3.0

Sep 22, 2017

1.2.1

Jun 8, 2017

1.2.0

May 25, 2017

1.1.4

May 12, 2017

1.1.3

May 10, 2017

1.1.2

May 3, 2017

1.1.1

May 2, 2017

1.1.0

May 2, 2017

1.0.0

May 2, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_sanitizer-2.4.1.tar.gz (17.0 kB view hashes)

Uploaded Apr 1, 2024 Source

Built Distribution

html_sanitizer-2.4.1-py3-none-any.whl (15.1 kB view hashes)

Uploaded Apr 1, 2024 Python 3

Hashes for html_sanitizer-2.4.1.tar.gz

Hashes for html_sanitizer-2.4.1.tar.gz
Algorithm	Hash digest
SHA256	`752ea75b3f5d93b431038810376df1fbef6ce0854c18b23aa9e03b048f1af435`
MD5	`7911223433afff2de602149b2861ea9b`
BLAKE2b-256	`be218697267e7e3c4558e24361b49313473b09367362965c4f0abccbd495432d`

Hashes for html_sanitizer-2.4.1-py3-none-any.whl

Hashes for html_sanitizer-2.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5fb1dfa951cb9b4a13f8a1c69059908ac231f9e9e99b84874d16948dcac9ff04`
MD5	`f366e67f99a8e4c3bbe9ce7a3942461d`
BLAKE2b-256	`9d4046e23db89102d57ee0217cad96b85366eb72bcf9f5ae5475e3142bd44af7`