scrapy-rotating-proxies

Rotating proxies for Scrapy

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

scrapy-rotating-proxies

This package provides a Scrapy middleware to use rotating proxies, check that they are alive and adjust crawling speed.

License is MIT.

Installation

pip install scrapy-rotating-proxies

Usage

Add ROTATING_PROXY_LIST option with a list of proxies to settings.py:

ROTATING_PROXY_LIST = [
    'proxy1.com:8000',
    'proxy2.com:8031',
    # ...
]

As an alternative, you can specify a ROTATING_PROXY_LIST_PATH options with a path to a file with proxies, one per line:

ROTATING_PROXY_LIST_PATH = '/my/path/proxies.txt'

ROTATING_PROXY_LIST_PATH takes precedence over ROTATING_PROXY_LIST if both options are present.

Then add rotating_proxies middlewares to your DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    # ...
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
    # ...
}

After this all requests will be proxied using one of the proxies from the ROTATING_PROXY_LIST / ROTATING_PROXY_LIST_PATH.

Requests with “proxy” set in their meta are not handled by scrapy-rotating-proxies. To disable proxying for a request set request.meta['proxy'] = None; to set proxy explicitly use request.meta['proxy'] = "<my-proxy-address>".

Concurrency

By default, all default Scrapy concurrency options (DOWNLOAD_DELAY, AUTHTHROTTLE_..., CONCURRENT_REQUESTS_PER_DOMAIN, etc) become per-proxy for proxied requests when RotatingProxyMiddleware is enabled. For example, if you set CONCURRENT_REQUESTS_PER_DOMAIN=2 then spider will be making at most 2 concurrent connections to each proxy, regardless of request url domain.

Customization

scrapy-rotating-proxies keeps track of working and non-working proxies, and re-checks non-working from time to time.

Detection of a non-working proxy is site-specific. By default, scrapy-rotating-proxies uses a simple heuristic: if a response status code is not 200, response body is empty or if there was an exception then proxy is considered dead.

You can override ban detection method by passing a path to a custom BanDectionPolicy in ROTATING_PROXY_BAN_POLICY option, e.g.:

# settings.py
ROTATING_PROXY_BAN_POLICY = 'myproject.policy.MyBanPolicy'

The policy must be a class with response_is_ban and exception_is_ban methods. These methods can return True (ban detected), False (not a ban) or None (unknown). It can be convenient to subclass and modify default BanDetectionPolicy:

# myproject/policy.py
from rotating_proxies.policy import BanDetectionPolicy

class MyPolicy(BanDetectionPolicy):
    def response_is_ban(self, request, response):
        # use default rules, but also consider HTTP 200 responses
        # a ban if there is 'captcha' word in response body.
        ban = super(MyPolicy, self).response_is_ban(request, response)
        ban = ban or b'captcha' in response.body
        return ban

    def exception_is_ban(self, request, exception):
        # override method completely: don't take exceptions in account
        return None

Instead of creating a policy you can also implement response_is_ban and exception_is_ban methods as spider methods, for example:

class MySpider(scrapy.Spider):
    # ...

    def response_is_ban(self, request, response):
        return b'banned' in response.body

    def exception_is_ban(self, request, exception):
        return None

It is important to have these rules correct because action for a failed request and a bad proxy should be different: if it is a proxy to blame it makes sense to retry the request with a different proxy.

Non-working proxies could become alive again after some time. scrapy-rotating-proxies uses a randomized exponential backoff for these checks - first check happens soon, if it still fails then next check is delayed further, etc. Use ROTATING_PROXY_BACKOFF_BASE to adjust the initial delay (by default it is random, from 0 to 5 minutes). The randomized exponential backoff is capped by ROTATING_PROXY_BACKOFF_CAP.

Settings

ROTATING_PROXY_LIST - a list of proxies to choose from;
ROTATING_PROXY_LIST_PATH - path to a file with a list of proxies;
ROTATING_PROXY_LOGSTATS_INTERVAL - stats logging interval in seconds, 30 by default;
ROTATING_PROXY_CLOSE_SPIDER - When True, spider is stopped if there are no alive proxies. If False (default), then when there is no alive proxies all dead proxies are re-checked.
ROTATING_PROXY_PAGE_RETRY_TIMES - a number of times to retry downloading a page using a different proxy. After this amount of retries failure is considered a page failure, not a proxy failure. Think of it this way: every improperly detected ban cost you ROTATING_PROXY_PAGE_RETRY_TIMES alive proxies. Default: 5.

It is possible to change this option per-request using max_proxies_to_try request.meta key - for example, you can use a higher value for certain pages if you’re sure they should work.
ROTATING_PROXY_BACKOFF_BASE - base backoff time, in seconds. Default is 300 (i.e. 5 min).
ROTATING_PROXY_BACKOFF_CAP - backoff time cap, in seconds. Default is 3600 (i.e. 60 min).
ROTATING_PROXY_BAN_POLICY - path to a ban detection policy. Default is 'rotating_proxies.policy.BanDetectionPolicy'.

FAQ

Q: Where to get proxy lists? How to write and maintain ban rules?

A: It is up to you to find proxies and maintain proper ban rules for web sites; scrapy-rotating-proxies doesn’t have anything built-in. There are commercial proxy services like https://crawlera.com/ which can integrate with Scrapy (see https://github.com/scrapy-plugins/scrapy-crawlera) and take care of all these details.

Contributing

source code: https://github.com/TeamHG-Memex/scrapy-rotating-proxies
bug tracker: https://github.com/TeamHG-Memex/scrapy-rotating-proxies/issues

To run tests, install tox and run tox from the source checkout.

CHANGES

0.6.2 (2019-05-25)

mean_backoff_time stats are always returned as float, to make saving stats in databases easier.

0.6.1 (2019-04-03)

Fixed incorrect “proxies/good” stats values.

0.6 (2018-12-28)

Proxy information is added to scrapy stats:

proxies/unchecked
proxies/reanimated
proxies/dead
proxies/good
proxies/mean_backoff

0.5 (2017-10-09)

ROTATING_PROXY_LIST_PATH option allows to pass file name with a proxy list.

0.4 (2017-06-06)

ROTATING_PROXY_BACKOFF_CAP option allows to change max backoff time from the default 1 hour.

0.3.2 (2017-06-05)

fixed proxy authentication issue.

0.3.1 (2017-03-20)

fixed OverflowError during backoff computation.

0.3 (2017-03-14)

redirects with empty bodies are no longer considered bans (thanks Diga Widyaprana).
ROTATING_PROXY_BAN_POLICY option allows to customize ban detection for all spiders.

0.2.3 (2017-03-03)

max_proxies_to_try request.meta key allows to override ROTATING_PROXY_PAGE_RETRY_TIMES option per-request.

0.2.2 (2017-03-01)

Update default ban detection rules: scrapy.exceptions.IgnoreRequest is not a ban.

0.2.1 (2017-02-08)

changed ROTATING_PROXY_PAGE_RETRY_TIMES default value - it is now 5.

0.2 (2017-02-07)

improved default ban detection rules;
log ban stats.

0.1 (2017-02-01)

Initial release

Algorithm	Hash digest
SHA256	`cb068c13ca7d44787bb444e2edfd609669c942a32b5e4b338949a009fd5ca160`
MD5	`77a2d992a40700c4732f7a8021b7fbc3`
BLAKE2b-256	`c569467b36e6c082febe4bd15518ce53ffead4ce7d9ae8e43017b982724dcc81`

Algorithm	Hash digest
SHA256	`f9cb6318011a4bdbb25b0e132b2dbcad01ea40eccceb1b3475c6bb4a0aef0e40`
MD5	`9880ae3d0c3e604d1d47c07467246137`
BLAKE2b-256	`0c706336f2e74bdb2f617ff0cf5c5d80c7e94a991e3e0af027441b3b25006d9c`

scrapy-rotating-proxies 0.6.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scrapy-rotating-proxies

Installation

Usage

Concurrency

Customization

Settings

FAQ

Contributing

CHANGES

0.6.2 (2019-05-25)

0.6.1 (2019-04-03)

0.6 (2018-12-28)

0.5 (2017-10-09)

0.4 (2017-06-06)

0.3.2 (2017-06-05)

0.3.1 (2017-03-20)

0.3 (2017-03-14)

0.2.3 (2017-03-03)

0.2.2 (2017-03-01)

0.2.1 (2017-02-08)

0.2 (2017-02-07)

0.1 (2017-02-01)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes