Skip to main content

Fast C based HTML 5 parsing for python

Project description

Latest version released on PyPi Build status of the master branch on Unix Build status of the master branch on Windows

A fast implementation of the HTML 5 parsing spec. Parsing is done in C using a variant of the gumbo parser. The gumbo parse tree is then transformed into an lxml tree, also in C, yielding parse times that can be a thirtieth of the html5lib parse times. That is a speedup of 30x.

Installation

Unix

On a Unix-y system, with a working compiler, simply run:

pip install --no-binary lxml html5-parser

It is important that lxml is installed with the –no-binary flags. This is because without it, lxml uses a static copy of libxml2. For html5-parser to work it must use the same libxml2 implementation as lxml. This is only possible if libxml2 is loaded dynamically.

You can setup html5-parser to run from a source checkout as follows:

git clone https://github.com/kovidgoyal/html5-parser && cd html5-parser
pip install --no-binary lxml 'lxml>=3.8.0' --user
python setup.py develop --user

Windows

On Windows, installation is a little more involved. There is a 200 line script that is used to install html5-parser and all its dependencies on the windows continuous integration server. Using that script installation can be done by running the following commands in a Visual Studio 2015 Command prompt:

python.exe win-ci.py install_deps
python.exe win-ci.py test

This will install all dependencies and html5-parser in the sw sub-directory. You will need to add sw\bin to PATH and sw\python\Lib\site-packages to PYTHONPATH. Or copy the files into your system python’s directories.

Benchmarking

There is a benchmark script named benchmark.py that compares the parse times for parsing a large (~ 5.7MB) HTML document in html5lib and html5-parser. The results on my system show a speedup of 28x. The output from the script on my system is:

Testing with HTML file of 5,956,815 bytes
Parsing repeatedly with html5-parser
html5-parser took an average of : 0.491 seconds to parse it
Parsing repeatedly with html5lib
html5lib took an average of : 13.744 seconds to parse it

There is further potential for speedup. Currently the gumbo subsystem uses its own cache for tag and attribute names and the libxml2 sub-system uses its own cache. Unifying the two to use the libxml2 cache should yield significant performance and memory consumption gains.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html5-parser-0.2.1.tar.gz (238.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page