Skip to main content

Scrapy download handler that can impersonate browser fingerprints

Project description

scrapy-impersonate

scrapy-impersonate is a Scrapy download handler. This project integrates curl_cffi to perform HTTP requests, so it can impersonate browsers' TLS signatures or JA3 fingerprints.

Installation

pip install git+http://github.com/jxlil/scrapy-impersonate

Activation

Replace the default http and/or https Download Handlers through DOWNLOAD_HANDLERS

DOWNLOAD_HANDLERS = {
    "http": "scrapy_impersonate.ImpersonateDownloadHandler",
    "https": "scrapy_impersonate.ImpersonateDownloadHandler",
}

Also, be sure to install the asyncio-based Twisted reactor:

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Basic usage

Set the impersonate Request.meta key to download a request using curl_cffi:

import scrapy


class ImpersonateSpider(scrapy.Spider):
    name = "impersonate_spider"
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_impersonate.ImpersonateDownloadHandler",
            "https": "scrapy_impersonate.ImpersonateDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
    }

    def start_requests(self):
        for browser in ["chrome110", "edge99", "safari15_5"]:
            yield scrapy.Request(
                "https://tls.browserleaks.com/json",
                dont_filter=True,
                meta={"impersonate": browser},
            )

    def parse(self, response):
        # ja3_hash: 773906b0efdefa24a7f2b8eb6985bf37
        # ja3_hash: cd08e31494f9531f560d64c695473da9
        # ja3_hash: 2fe1311860bc318fc7f9196556a2a6b9
        return {"ja3_hash": response.json()["ja3_hash"]}

In this case, a Chrome browser with version 110 (chrome110) is being impersonated. Here you can find all the browsers that you can impersonate.

Thanks

This project is inspired by the following projects:

  • curl_cffi - Python binding for curl-impersonate via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.
  • curl-impersonate - A special build of curl that can impersonate Chrome & Firefox
  • scrapy-playwright - Playwright integration for Scrapy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-impersonate-1.0.0b1.tar.gz (4.7 kB view hashes)

Uploaded Source

Built Distribution

scrapy_impersonate-1.0.0b1-py3-none-any.whl (5.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page