scrapinghub-autoextract

Python interface to Scrapinghub Automatic Extraction API

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

Python client libraries for Scrapinghub AutoExtract API. It allows to extract product, article, job posting, etc. information from any website - whatever the API supports.

Command-line utility, asyncio-based library and a simple synchronous wrapper are provided by this package.

License is BSD 3-clause.

Installation

pip install scrapinghub-autoextract

scrapinghub-autoextract requires Python 3.6+ for CLI tool and for the asyncio API; basic, synchronous API works with Python 3.5.

Usage

First, make sure you have an API key. To avoid passing it in api_key argument with every call, you can set SCRAPINGHUB_AUTOEXTRACT_KEY environment variable with the key.

Command-line interface

The most basic way to use the client is from a command line. First, create a file with urls, an URL per line (e.g. urls.txt). Second, set SCRAPINGHUB_AUTOEXTRACT_KEY env variable with your AutoExtract API key (you can also pass API key as --api-key script argument).

Then run a script, to get the results:

python -m autoextract urls.txt --page-type article --output res.jl

If you need more flexibility, you can customize the requests by creating a JsonLines file with queries: a JSON object per line. You can pass any AutoExtract options there. Example - store it in queries.jl file:

{"url": "http://example.com", "meta": "id0", "articleBodyRaw": false}
{"url": "http://example.com/foo", "meta": "id1", "articleBodyRaw": false}
{"url": "http://example.com/bar", "meta": "id2", "articleBodyRaw": false}

See API docs for a description of all supported parameters in these query dicts. API docs mention batch requests and their limitation (no more than 100 queries at time); these limits don’t apply to the queries.jl file (i.e. it may have millions of rows), as the command-line script does its own batching.

Note that in the example pageType argument is omitted; pageType values are filled automatically from --page-type command line argument value. You can also set a different pageType for a row in queries.jl file; it has a priority over --page-type passed in cmdline.

To get results for this queries.jl file, run:

python -m autoextract --intype jl queries.jl --page-type article --output res.jl

Processing speed

Each API key has a limit on RPS. To get your URLs processed faster you can tune concurrency options: batch size and a number of connections.

Best options depend on the RPS limit and on websites you’re extracting data from. For example, if your API key has a limit of 3RPS, and average response time you observe for your websites is 10s, then to get to these 3RPS you may set e.g. batch size = 2, number of connections = 15 - this would allow to process 30 requests in parallel.

To set these options in the CLI, use --n-conn and --batch-size arguments:

python -m autoextract urls.txt --page-type articles --n-conn 15 --batch-size 2 --output res.jl

If too many requests are being processed in parallel, you’ll be getting throttling errors. They are handled by CLI automatically, but they make extraction less efficient; please tune the concurrency options to not hit the throttling errors (HTTP 429) often.

You may be also limited by the website speed. AutoExtract tries not to hit any individual website too hard, but it could be better to limit this on a client side as well. If you’re extracting data from a single website, it could make sense to decrease the amount of parallel requests; it can ensure higher success ratio overall.

If you’re extracting data from multiple websites, it makes sense to spread the load across time: if you have websites A, B and C, don’t send requests in AAAABBBBCCCC order, send them in ABCABCABCABC order instead.

To do so, you can change the order of the queries in your input file. Alternatively, you can pass --shuffle options; it randomly shuffles input queries before sending them to the API:

python -m autoextract urls.txt –shuffle –page-type articles –output res.jl

Run python -m autoextract --help to get description of all supported options.

Errors

The following errors could happen while making requests:

Network errors
Request-level errors
- Authentication failure
- Malformed request
- Too many queries in request
- Request payload size is too large
Query-level errors
- Downloader errors
- Proxy errors
- …

Some errors can be retried while others can’t.

For example, you can retry a query with a Proxy Timeout error because this is a temporary error and there are chances that this response will be different within the next retries.

On the other hand, it makes no sense to retry queries that return a 404 Not Found error because the response is not supposed to change if retried.

Retries

By default, we will automatically retry Network and Request-level errors. You could also enable Query-level errors retries by specifying the --max-query-error-retries argument.

Enable Query-level retries to increase the success rate at the cost of more requests being performed if you are interested in a higher success rate.

python -m autoextract urls.txt --page-type articles --max-query-error-retries 3 --output res.jl

Failing queries are retried until the max number of retries or a timeout is reached. If it’s still not possible to fetch all queries without errors, the last available result is written to the output including both queries with success and the ones with errors.

Synchronous API

Synchronous API provides an easy way to try AutoExtract. For production usage asyncio API is strongly recommended. Currently the synchronous API doesn’t handle throttling errors, and has other limitations; it is most suited for quickly checking extraction results for a few URLs.

To send a request, use request_raw function; consult with the API docs to understand how to populate the query:

from autoextract.sync import request_raw
query = [{'url': 'http://example.com.foo', 'pageType': 'article'}]
results = request_raw(query)

Note that if there are several URLs in the query, results can be returned in arbitrary order.

There is also a autoextract.sync.request_batch helper, which accepts URLs and page type, and ensures results are in the same order as requested URLs:

from autoextract.sync import request_batch
urls = ['http://example.com/foo', 'http://example.com/bar']
results = request_batch(urls, page_type='article')

asyncio API

Basic usage is similar to the sync API (request_raw), but asyncio event loop is used:

from autoextract.aio import request_raw

async def foo():
    query = [{'url': 'http://example.com.foo', 'pageType': 'article'}]
    results1 = await request_raw(query)
    # ...

There is also request_parallel_as_completed function, which allows to process many URLs in parallel, using both batching and multiple connections:

import sys
from autoextract.aio import request_parallel_as_completed, create_session
from autoextract import ArticleRequest

async def extract_from(urls):
    requests = [ArticleRequest(url) for url in urls]
    async with create_session() as session:
        res_iter = request_parallel_as_completed(requests,
                                    n_conn=15, batch_size=2,
                                    session=session)
        for fut in res_iter:
            try:
                batch_result = await fut
                for res in batch_result:
                    # do something with a result, e.g.
                    print(json.dumps(res))
            except RequestError as e:
                print(e, file=sys.stderr)
                raise

request_parallel_as_completed is modelled after asyncio.as_completed (see https://docs.python.org/3/library/asyncio-task.html#asyncio.as_completed), and actually uses it under the hood.

Note from autoextract import ArticleRequest and its usage in the example above. There are several Request helper classes, which simplify building of the queries.

request_parallel_as_completed and request_raw functions handle throttling (http 429 errors) and network errors, retrying a request in these cases.

CLI interface implementation (autoextract/__main__.py) can serve as an usage example.

Request helpers

To query AutoExtract you need to create a dict with request parameters, e.g.:

{'url': 'http://example.com.foo', 'pageType': 'article'}

To simplify the library usage and avoid typos, scrapinghub-autoextract provides helper classes for constructing these dicts:

* autoextract.Request
* autoextract.ArticleRequest
* autoextract.ProductRequest
* autoextract.JobPostingRequest

You can pass instances of these classes instead of dicts everywhere when requests dicts are accepted. So e.g. instead of writing this:

query = [{"url": url, "pageType": "article"} for url in urls]

You can write this:

query = [Request(url, pageType="article") for url in urls]

or this:

query = [ArticleRequest(url) for url in urls]

There is one difference: articleBodyRaw parameter is set to False by default when Request or its variants are used, while it is True by default in the API.

You can override API params passing a dictionary with extra data using the extra argument. Note that it will overwrite any previous configuration made using standard attributes like articleBodyRaw and fullHtml.

Extra parameters example:

request = ArticleRequest(
    url=url,
    fullHtml=True,
    extra={
        "customField": "custom value",
        "fullHtml": False
    }
)

This will generate a query that looks like this:

{
    "url": url,
    "pageType": "article",
    "fullHtml": False,  # our extra parameter overrides the previous value
    "customField": "custom value"  # not a default param but defined even then
}

Contributing

Source code: https://github.com/scrapinghub/scrapinghub-autoextract
Issue tracker: https://github.com/scrapinghub/scrapinghub-autoextract/issues

Use tox to run tests with different Python versions:

tox

The command above also runs type checks; we use mypy.

Changes

0.6.1 (2021-01-27)

fixed max_retries behaviour. Total attempts must be max_retries + 1

0.6.0 (2020-12-29)

CLI changes: error display in the progress bar is changed; summary is printed after the executions
more errors are retried when retrying is enabled, which allows for a higher success rate
fixed tcp connection pooling
autoextract.aio.request_raw function allows to pass custom headers to the API (not to remote websites)
autoextract.aio.request_raw now allows to customize the retry behavior, via retrying argument
tenacity.RetryError is no longer raised by the library; concrete errors are raised instead
Python 3.9 support
CI is moved from Travis to Github Actions

0.5.2 (2020-11-27)

QueryError is renamed to _QueryError, as this is not an error users of the library ever see.
Retrials were broken by having userAgent in the userQuery API output; temporary workaround is added to make retrials work again.

0.5.1 (2020-08-21)

fix a problem that was preventing calls to request_raw when endpoint argument was None

0.5.0 (2020-08-21)

add --api-endpoint option to the command line utility
improves documentation adding details about Request’s extra parameters

0.4.0 (2020-08-17)

autoextract.Request helper class now allows to set arbitrary parameters for AutoExtract requests - they can be passed in extra argument.

0.3.0 (2020-07-24)

In this release retry-related features are added or improved. It is now possible to fix some of the temporary errors by enabling query-level retries, and the default retry behavior is improved.

backwards-incompatible: autoextract.aio.ApiError is renamed to autoextract.aio.RequestError
max_query_error_retries argument is added to autoextract.aio.request_raw and autoextract.aio.request_parallel_as_completed functions; it allows to enable retries of temporary query-level errors returned by the API.
CLI: added --max-query-error-retries option to retry temporary query-level errors.
HTTP 500 errors from server are retried now;
documentation and test improvements.

0.2.0 (2020-04-15)

asyncio API is rewritten, to simplify use in cases where passing meta is required. autoextract.aio.request_parallel_as_completed is added, autoextract.aio.request_parallel and autoextract.aio.request_batch are removed.
CLI: it now shows various stats: mean response and connect time, % of throttling errors, % of network and other errors
CLI: new --intype jl option allows to process a .jl file with arbitrary AutoExtract API queries
CLI: new --shuffle option allows to shuffle input data, to spread it more evenly across websites.
CLI: it no longer exits on unrecoverable errors, to aid long-running processing tasks.
retry logic is adjusted to handle network errors better.
autoextract.aio.request_raw and autoextract.aio.request_parallel_as_completed functions provide an interface to return statistics about requests made, including retries.
autoextract.Request, autoextract.ArticleRequest, autoextract.ProductRequest, autoextract.JobPostingRequest helper classes
Documentation improvements.

0.1.1 (2020-03-12)

allow up to 100 elements in a batch, not up to 99
custom User-Agent header is added
Python 3.8 support is declared & tested

0.1 (2019-10-09)

Initial release.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

0.6.1

Jan 27, 2021

0.6.0

Dec 29, 2020

0.5.2

Nov 27, 2020

0.5.1

Aug 22, 2020

0.5.0

Aug 22, 2020

0.4.0

Aug 17, 2020

0.3.0

Jul 24, 2020

0.2.0

Apr 15, 2020

0.1.1

Mar 12, 2020

0.1

Oct 9, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapinghub-autoextract-0.6.1.tar.gz (30.3 kB view hashes)

Uploaded Jan 27, 2021 Source

Built Distribution

scrapinghub_autoextract-0.6.1-py3-none-any.whl (23.7 kB view hashes)

Uploaded Jan 27, 2021 Python 3

Hashes for scrapinghub-autoextract-0.6.1.tar.gz

Hashes for scrapinghub-autoextract-0.6.1.tar.gz
Algorithm	Hash digest
SHA256	`0cebe1f002f83b1e0ed1d43c695dc0918b5f20439c7c35bbe8389a09bcb0b55f`
MD5	`2bdf7829af439453f9023ce3850e9d13`
BLAKE2b-256	`ec3525532968ed297a5db9dbc0608547fab88a7fd74b06231d94752651f6cc18`

Hashes for scrapinghub_autoextract-0.6.1-py3-none-any.whl

Hashes for scrapinghub_autoextract-0.6.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`918056e3afd72926ed005962f8aa5abef95c0f7b488fb65fa17532a3fee2168f`
MD5	`4c5a3b7851e4ba305af8fc953a236a25`
BLAKE2b-256	`2fd892ebf33a8355f59799ec28da3a35b6bd79d976c587ad61571b8e4b301df9`

scrapinghub-autoextract 0.6.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Installation

Usage

Command-line interface

Processing speed

Errors

Retries

Synchronous API

asyncio API

Request helpers

Contributing

Changes

0.6.1 (2021-01-27)

0.6.0 (2020-12-29)

0.5.2 (2020-11-27)

0.5.1 (2020-08-21)

0.5.0 (2020-08-21)

0.4.0 (2020-08-17)

0.3.0 (2020-07-24)

0.2.0 (2020-04-15)

0.1.1 (2020-03-12)

0.1 (2019-10-09)

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution