html5-parser

Fast C based HTML 5 parsing for python

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

A fast implementation of the HTML 5 parsing spec. Parsing is done in C using a variant of the gumbo parser. The gumbo parse tree is then transformed into an lxml tree, also in C, yielding parse times that can be a thirtieth of the html5lib parse times. That is a speedup of 30x.

Installation

Unix

On a Unix-y system, with a working compiler, simply run:

pip install --no-binary lxml html5-parser

It is important that lxml is installed with the –no-binary flags. This is because without it, lxml uses a static copy of libxml2. For html5-parser to work it must use the same libxml2 implementation as lxml. This is only possible if libxml2 is loaded dynamically.

You can setup html5-parser to run from a source checkout as follows:

git clone https://github.com/kovidgoyal/html5-parser && cd html5-parser
pip install --no-binary lxml 'lxml>=3.8.0' --user
python setup.py develop --user

Windows

On Windows, installation is a little more involved. There is a 200 line script that is used to install html5-parser and all its dependencies on the windows continuous integration server. Using that script installation can be done by running the following commands in a Visual Studio 2015 Command prompt:

python.exe win-ci.py install_deps
python.exe win-ci.py test

This will install all dependencies and html5-parser in the sw sub-directory. You will need to add sw\bin to PATH and sw\python\Lib\site-packages to PYTHONPATH. Or copy the files into your system python’s directories.

Benchmarking

There is a benchmark script named benchmark.py that compares the parse times for parsing a large (~ 5.7MB) HTML document in html5lib and html5-parser. The results on my system show a speedup of 28x. The output from the script on my system is:

Testing with HTML file of 5,956,815 bytes
Parsing repeatedly with html5-parser
html5-parser took an average of : 0.491 seconds to parse it
Parsing repeatedly with html5lib
html5lib took an average of : 13.744 seconds to parse it

There is further potential for speedup. Currently the gumbo subsystem uses its own cache for tag and attribute names and the libxml2 sub-system uses its own cache. Unifying the two to use the libxml2 cache should yield significant performance and memory consumption gains.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.12

Nov 19, 2023

0.4.11

Apr 12, 2023

0.4.10

Sep 22, 2021

0.4.9

Nov 3, 2019

0.4.8

Jul 25, 2019

0.4.7

Jun 4, 2019

0.4.6

May 13, 2019

0.4.5

Apr 22, 2018

0.4.4

Aug 1, 2017

0.4.3

Jul 28, 2017

0.4.2

Jul 25, 2017

0.4.1

Jul 12, 2017

0.4.0

Jul 6, 2017

0.3.3

Jun 13, 2017

0.3.2

Jun 9, 2017

0.3.1

Jun 9, 2017

0.3.0

Jun 7, 2017

This version

0.2.1

Jun 4, 2017

0.2.0

Jun 4, 2017

0.1.1

Jun 4, 2017

0.1.0

Jun 3, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html5-parser-0.2.1.tar.gz (238.4 kB view hashes)

Uploaded Jun 4, 2017 Source

Hashes for html5-parser-0.2.1.tar.gz

Hashes for html5-parser-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`5f5a315391e3489f32aed6cec8acc2f5d361751dcfe502e4a700f1979154b859`
MD5	`768a2fd4b9f421cf2bcd5a729d9d1554`
BLAKE2b-256	`b3c7c5c5e2de4000647295e6a79270ae93d599b4be021ed7a60f53fa1a1b5b54`