Skip to main content

No project description provided

Project description

PyPI Version Build Status Wheel Status Coverage report

Overview

Scrapy is a great framework for web crawling. This package provides a highly customized way to deal with the exceptions happening in the downloader middleware because of the proxy, and uses a signal to note relatives to treat the invalidated proxies (e.g. moving to blacklist, renew the proxy pool).

There are two types of signals this package support:

  • traditional signal, sync

  • deferred signal, async

Please refer to the scrapy and twisted documents:

Requirements

  • Scrapy

  • Tests on Python 3.5

  • Tests on Linux, but it is a pure python module, should work on any other platforms with official python and twisted support

Installation

The quick way:

pip install -U scrapy-proxy-validation

Or put this middleware just beside the scrapy project.

Documentation

Set this middleware in ITEMPIPELINES in settings.py, for example:

from scrapy_proxy_validation.downloadermiddlewares.proxy_validation import Validation

DOWNLOADER_MIDDLEWARES.update({
    'scrapy_proxy_validation.downloadmiddlewares.proxy_validation.ProxyValidation': 751
})

SIGNALS = [Validation(exception='twisted.internet.error.ConnectionRefusedError',
                      signal='scrapy.signals.spider_closed'),
           Validation(exception='twisted.internet.error.ConnectionLost',
                      signal='scrapy.signals.spider_closed',
                      signal_deferred='scrapy.signals.spider_closed',
                      limit=5)]

RECYCLE_REQUEST = 'scrapy_proxy_validation.utils.recycle_request.recycle_request'

Settings Reference

SIGNALS

A list of the class Validation with the exception it wants to deal with, the sync signal it sends, the async signal it sends and the limit it touches.

RECYCLE_REQUEST

A function to recycle the request which have trouble with the proxy, the input argument is request, and the output is request too.

Note: remember to set ``dont_filter`` to be ``True``, or the middleware ``duplicate_fitler`` will remove this request.

Built-in Functions

scrapy_proxy_validation.utils.recycle_request.recycle_request

This is a built-in function to recycle the request which has a problem with the proxy.

This function will remove the proxy keyword in meta and set dont_filter to be True.

To use this function, in settings.py:

RECYCLE_REQUEST = 'scrapy_proxy_validation.utils.recycle_request.recycle_request'

Note

There could be many different problems about the proxy, thus it will take some to collect them all and add to SIGNALS. Please be patient, this is not a once-time solution middleware for this case.

TODO

No idea, please let me know if you have!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-proxy-validation-0.0.4.tar.gz (21.9 kB view hashes)

Uploaded Source

Built Distribution

scrapy_proxy_validation-0.0.4-py3-none-any.whl (8.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page