scrapy-proxy-validation

No project description provided

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Environment
- Plugins
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
Topic
- Internet :: WWW/HTTP
- Software Development :: Libraries :: Python Modules

Project description

Overview

Scrapy is a great framework for web crawling. This package provides a highly customized way to deal with the exceptions happening in the downloader middleware because of the proxy, and uses a signal to note relatives to treat the invalidated proxies (e.g. moving to blacklist, renew the proxy pool).

There are two types of signals this package support:

traditional signal, sync
deferred signal, async

Please refer to the scrapy and twisted documents:

Core API — Scrapy 1.4.0 documentation

Signals — Scrapy 1.4.0 documentation

Deferred Reference — Twisted 17.9.0 documentation

Requirements

Scrapy
Tests on Python 3.5
Tests on Linux, but it is a pure python module, should work on any other platforms with official python and twisted support

Installation

The quick way:

pip install -U scrapy-proxy-validation

Or put this middleware just beside the scrapy project.

Documentation

Set this middleware in ITEMPIPELINES in settings.py, for example:

from scrapy_proxy_validation.downloadermiddlewares.proxy_validation import Validation

DOWNLOADER_MIDDLEWARES.update({
    'scrapy_proxy_validation.downloadmiddlewares.proxy_validation.ProxyValidation': 751
})

SIGNALS = [Validation(exception='twisted.internet.error.ConnectionRefusedError',
                      signal='scrapy.signals.spider_closed'),
           Validation(exception='twisted.internet.error.ConnectionLost',
                      signal='scrapy.signals.spider_closed',
                      signal_deferred='scrapy.signals.spider_closed',
                      limit=5)]

RECYCLE_REQUEST = 'scrapy_proxy_validation.utils.recycle_request.recycle_request'

Settings Reference

SIGNALS

A list of the class Validation with the exception it wants to deal with, the sync signal it sends, the async signal it sends and the limit it touches.

RECYCLE_REQUEST

A function to recycle the request which have trouble with the proxy, the input argument is request, and the output is request too.

Note: remember to set ``dont_filter`` to be ``True``, or the middleware ``duplicate_fitler`` will remove this request.

Built-in Functions

scrapy_proxy_validation.utils.recycle_request.recycle_request

This is a built-in function to recycle the request which has a problem with the proxy.

This function will remove the proxy keyword in meta and set dont_filter to be True.

To use this function, in settings.py:

RECYCLE_REQUEST = 'scrapy_proxy_validation.utils.recycle_request.recycle_request'

Note

There could be many different problems about the proxy, thus it will take some to collect them all and add to SIGNALS. Please be patient, this is not a once-time solution middleware for this case.

TODO

No idea, please let me know if you have!

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Environment
- Plugins
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
Topic
- Internet :: WWW/HTTP
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

This version

0.0.4

Nov 2, 2017

0.0.3

Nov 2, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-proxy-validation-0.0.4.tar.gz (21.9 kB view hashes)

Uploaded Nov 2, 2017 Source

Built Distribution

scrapy_proxy_validation-0.0.4-py3-none-any.whl (8.5 kB view hashes)

Uploaded Nov 2, 2017 Python 3

Hashes for scrapy-proxy-validation-0.0.4.tar.gz

Hashes for scrapy-proxy-validation-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`3f6e3d72bf3458fd712b4693073ad4f98c4e4db567c705a9106f3c1053c4369f`
MD5	`51ec3a913dfe488420607213fa3749b7`
BLAKE2b-256	`a39cdf6b9e97074b47405427ffc93e48746ec44e2004810a976fc695b60bcc07`

Hashes for scrapy_proxy_validation-0.0.4-py3-none-any.whl

Hashes for scrapy_proxy_validation-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ebc6c9882d5e3ca47cab4d78a504aed8ae80915af0715310c82baf0e426a589`
MD5	`e6e8eada3ede813ddd969688e7cf4e8e`
BLAKE2b-256	`130ba436a8ea64215bf1b14ffc8c1b3097b01301e96011b78a9c33f136f2de72`