scrapy-beautifulsoup

Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Programming Language
- Python
Topic
- Internet :: WWW/HTTP

Project description

scrapy-beautifulsoup

Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

Configuration

Add the middleware to DOWNLOADER_MIDDLEWARES dictionary setting:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400
}

By default, BeautifulSoup would use the built-in html.parser parser. To change it, set the BEAUTIFULSOUP_PARSER setting:

BEAUTIFULSOUP_PARSER = "html5lib"  # or BEAUTIFULSOUP_PARSER = "lxml"

html5lib is an extremely lenient parser and, if the target HTML is seriously broken, you might consider being it your first choice. Note: html5lib has to be installed in this case:

pip install html5lib

Motivation

BeautifulSoup itself with the help of an underlying parser of choice does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to “fix” it.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Programming Language
- Python
Topic
- Internet :: WWW/HTTP

Release history Release notifications | RSS feed

0.0.2

Sep 26, 2016

This version

0.0.1

Sep 26, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-beautifulsoup-0.0.1.tar.gz (2.1 kB view hashes)

Uploaded Sep 26, 2016 Source

Built Distribution

scrapy_beautifulsoup-0.0.1-py2.py3-none-any.whl (4.2 kB view hashes)

Uploaded Sep 26, 2016 Python 2 Python 3

Hashes for scrapy-beautifulsoup-0.0.1.tar.gz

Hashes for scrapy-beautifulsoup-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`05d5b1ca40bf84f3de72001d1f5e20a6f6ca618695c4e518eb53bc9618e7a42c`
MD5	`0fd4e6331e706d07c088971f502e33bc`
BLAKE2b-256	`4ad9a803ed0e57d589ecaa6fdcca592f335f095b5f34cfce45ea6543d6e1f00d`

Hashes for scrapy_beautifulsoup-0.0.1-py2.py3-none-any.whl

Hashes for scrapy_beautifulsoup-0.0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`5c03e3d3c216a13d2222ff9eb31aa8436c076a5d23ccaf76ca6991b6c74d6d32`
MD5	`0445640d4cbcc9454aa559fd96cf88d5`
BLAKE2b-256	`1a3e091dffa3e05197b8b61b0ade86949bee8eccc53399ac4a79bf8542c9c163`