Skip to main content

Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

Project description

PyPI version Requirements Status

scrapy-beautifulsoup

Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

Installation

The package is on PyPI and can be installed with pip:

pip install scrapy-beautifulsoup

Configuration

Add the middleware to DOWNLOADER_MIDDLEWARES dictionary setting:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400
}

By default, BeautifulSoup would use the built-in html.parser parser. To change it, set the BEAUTIFULSOUP_PARSER setting:

BEAUTIFULSOUP_PARSER = "html5lib"  # or BEAUTIFULSOUP_PARSER = "lxml"

html5lib is an extremely lenient parser and, if the target HTML is seriously broken, you might consider being it your first choice. Note: html5lib has to be installed in this case:

pip install html5lib

Motivation

BeautifulSoup itself with the help of an underlying parser of choice does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to “fix” it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-beautifulsoup-0.0.2.tar.gz (2.3 kB view hashes)

Uploaded Source

Built Distribution

scrapy_beautifulsoup-0.0.2-py2.py3-none-any.whl (4.5 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page