Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
Project description
scrapy-beautifulsoup
Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
Configuration
Add the middleware to DOWNLOADER_MIDDLEWARES dictionary setting:
DOWNLOADER_MIDDLEWARES = { 'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400 }
By default, BeautifulSoup would use the built-in html.parser parser. To change it, set the BEAUTIFULSOUP_PARSER setting:
BEAUTIFULSOUP_PARSER = "html5lib" # or BEAUTIFULSOUP_PARSER = "lxml"
html5lib is an extremely lenient parser and, if the target HTML is seriously broken, you might consider being it your first choice. Note: html5lib has to be installed in this case:
pip install html5lib
Motivation
BeautifulSoup itself with the help of an underlying parser of choice does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to “fix” it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy-beautifulsoup-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05d5b1ca40bf84f3de72001d1f5e20a6f6ca618695c4e518eb53bc9618e7a42c |
|
MD5 | 0fd4e6331e706d07c088971f502e33bc |
|
BLAKE2b-256 | 4ad9a803ed0e57d589ecaa6fdcca592f335f095b5f34cfce45ea6543d6e1f00d |
Hashes for scrapy_beautifulsoup-0.0.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c03e3d3c216a13d2222ff9eb31aa8436c076a5d23ccaf76ca6991b6c74d6d32 |
|
MD5 | 0445640d4cbcc9454aa559fd96cf88d5 |
|
BLAKE2b-256 | 1a3e091dffa3e05197b8b61b0ade86949bee8eccc53399ac4a79bf8542c9c163 |