Skip to main content

An HTML-friendly spaCy tokenizer

Project description

HTML-friendly spaCy Tokenizer

It's not an HTML tokenizer, but a tokenizer that works with text that happens to be embedded in HTML.

Install

pip install spacy-html-tokenizer

How it works

Under the hood we use selectolax to parse HTML. From there, common elements used for styling within traditional text elements (e.g. <b> or <span> inside of a <p>) are unwrapped, meaning the text contained within those elements becomes nested inside their parent elements. You can change this with the unwrapped_tags argument to the constructor. Tags used for non-text content, such as <script> and <style> are removed. Then the text is extracted from each remaining terminal node that contains text. These texts are then tokenized with the standard tokenizer defaults and then combined into a single Doc. The end result is a Doc, but each element's text from the original document is also a sentence, so you can iterate through each element's text with doc.sents.

Example

import spacy
from spacy_html_tokenizer import create_html_tokenizer

nlp = spacy.blank("en")
nlp.tokenizer = create_html_tokenizer()(nlp)

html = """<h2>An Ordered HTML List</h2>
<ol>
    <li><b>Good</b> coffee. There's another sentence here</li>
    <li>Tea and honey</li>
    <li>Milk</li>
</ol>"""

doc = nlp(html)
for sent in doc.sents:
    print(sent.text, "-- N Tokens:", len(sent))

# An Ordered HTML List -- N Tokens: 4
# Good coffee. There's another sentence here -- N Tokens: 8
# Tea and honey -- N Tokens: 3
# Milk -- N Tokens: 1

In the prior example, we didn't have any other sentence boundary detection components. However, this will also work with downstream sentence boundary detection components -- e.g.

nlp = spacy.load("en_core_web_sm")  # has parser for sentence boundary detection
nlp.tokenizer = create_html_tokenizer()(nlp)

doc = nlp(html)
for sent in doc.sents:
    print(sent.text, "-- N Tokens:", len(sent))

# An Ordered HTML List -- N Tokens: 4
# Good coffee. -- N Tokens: 3
# There's another sentence here -- N Tokens: 5
# Tea and honey -- N Tokens: 3
# Milk -- N Tokens: 1

Comparison

We'll compare parsing Explosion's About page with and without the HTML tokenizer.

import requests
import spacy
from spacy_html_tokenizer import create_html_tokenizer
from selectolax.parser import HTMLParser

about_page_html = requests.get("https://explosion.ai/about").text

nlp_default = spacy.load("en_core_web_lg")
nlp_html = spacy.load("en_core_web_lg")
nlp_html.tokenizer = create_html_tokenizer()(nlp_html)

# text from HTML - used for non-HTML default tokenizer
about_page_text = HTMLParser(about_page_html).text()

doc_default = nlp_default(about_page_text)
doc_html = nlp_html(about_page_html)

View first sentences of each

With standard tokenizer on text extracted from HTML

list(sent.text for sent in doc_default.sents)[:5]
['AboutSoftware & DemosCustom SolutionsBlog & NewsAbout usExplosion is a software company specializing in developer tools for Artificial\nIntelligence and Natural Language Processing.',
'We’re the makers of\nspaCy, one of the leading open-source libraries for advanced\nNLP and Prodigy, an annotation tool for radically efficient\nmachine teaching.',
'\n\n',
'Ines Montani CEO, FounderInes is a co-founder of Explosion and a core developer of the spaCy NLP library and the Prodigy annotation tool.',
'She has helped set a new standard for user experience in developer tools for AI engineers and researchers.']

With HTML Tokenizer on HTML

list(sent.text for sent in doc_html.sents)[:10]
['About us · Explosion',
 'About',
 'Software',
 '&',
 'Demos',
 'Custom Solutions',
 'Blog & News',
 'About us',
 'Explosion is a software company specializing in developer tools for Artificial Intelligence and Natural Language Processing.',
 'We’re the makers of spaCy, one of the leading open-source libraries for advanced NLP and Prodigy, an annotation tool for radically efficient machine teaching.']

What about the last sentence?

list(sent.text for sent in doc_default.sents)[-1]

# We’re the makers of spaCy, one of the leading open-source libraries for advanced NLP.NavigationHomeAbout usSoftware & DemosCustom SolutionsBlog & NewsOur SoftwarespaCy · Industrial-strength NLPProdigy · Radically efficient annotationThinc · Functional deep learning© 2016-2022 Explosion · Legal & Imprint/*<![CDATA[*/window.pagePath="/about";/*]]>*//*<![CDATA[*/window.___chunkMapping={"app":["/app-ac229f07fa81f29e0f2d.js"],"component---node-modules-gatsby-plugin-offline-app-shell-js":["/component---node-modules-gatsby-plugin-offline-app-shell-js-461e7bc49c6ae8260783.js"],"component---src-components-post-js":["/component---src-components-post-js-cf4a6bf898db64083052.js"],"component---src-pages-404-js":["/component---src-pages-404-js-b7a6fa1d9d8ca6c40071.js"],"component---src-pages-blog-js":["/component---src-pages-blog-js-1e313ce0b28a893d3966.js"],"component---src-pages-index-js":["/component---src-pages-index-js-175434c68a53f68a253a.js"],"component---src-pages-spacy-tailored-pipelines-js":["/component---src-pages-spacy-tailored-pipelines-js-028d0c6c19584ef0935f.js"]};/*]]>*/

Yikes. How about HTML Tokenizer?

list(sent.text for sent in doc_html.sents)[-1]

# '© 2016-2022 Explosion · Legal & Imprint'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy-html-tokenizer-0.1.3.tar.gz (5.4 kB view hashes)

Uploaded Source

Built Distribution

spacy_html_tokenizer-0.1.3-py3-none-any.whl (5.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page