datarobot-batch-scoring

A script to support start/resume batch scoring via Datarobot API.

These details have been verified by PyPI

Maintainers

Axik datarobot dsakagi madmott stasdr

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

A script to score CSV files via DataRobot’s prediction API.

https://coveralls.io/repos/github/datarobot/batch-scoring/badge.svg?branch=master

https://travis-ci.org/datarobot/batch-scoring.svg?branch=master

https://caniusepython3.com/project/datarobot_batch_scoring.svg

https://badge.fury.io/py/datarobot_batch_scoring.svg

How to install

Install or upgrade to last version:

$ pip install -U datarobot_batch_scoring

How to install particular version:

$ pip install datarobot_batch_scoring==x.y.z

Features

Concurrent requests (--n_concurrent)

Pause/resume

Gzip support

Custom delimiters

parallel processing

Running batch_scoring

You can execute the batch_scoring command from the command line or you can pass parameters to a batch_scoring script from .ini file. Place the file in your home directory or the directory from which you are running the batch_scoring command. Use the syntax and arguments below to define the parameters. Note that if you run the script and also execute via the command line, the command line parameters take priority.

The following table describes the syntax conventions; the syntax for running the script follows the table.

Convention	Meaning
[ ]	Optional argument
< >	User supplied value
{ \| }	Required, mutually exclusive

Required arguments:

batch_scoring --host=<host> --user=<user> <project_id> <model_id> <dataset_filepath> --datarobot_key=<datarobot_key> {--password=<pwd> | --api_token=<api_token>}

Additional recommended arguments [--verbose] [--keep_cols=<keep_cols>] [--n_concurrent=<n_concurrent>]

Additional optional arguments [--out=<filepath>] [--api_version=<api_version>] [--pred_name=<string>] [--timeout=<timeout>] [—-create_api_token] [--n_retry=<n_retry>] [--delimiter=<delimiter>] [--resume] [--skip_row_id] [--output_delimiter=<delimiter>]

Argument descriptions: The following table describes each of the arguments:

Argument	Description
host=<host>	Specifies the hostname of the prediction API endpoint (the location of the data from where to get predictions).
user=<user>	Specifies the username used to acquire the api-token. Use quotes if the name contains spaces.
<project_id>	Specifies the project identification string. You can find the ID: embedded in the URL that displays when you are in the Leaderboard (for example, https://<host>/projects/<project_id>/models) or when the prediction API is enabled, from the example shown when you click Deploy Model for a specific model in the Leaderboard.
<model_id>	Specifies the model identification string. You can find the ID: embedded in the URL that displays when you are in the Leaderboard and have selected a model (for example, https://<host>/projects/<project_id>/models/<model_id>) or when the prediction API is enabled, from the example shown when you click Deploy Model for a specific model in the Leaderboard.
<dataset_filepath>	Specifies the .csv input file that the script scores. It does this by submitting prediction requests against <host> using project <project_id> and model <model_id>.
datarobot_key=<datarobot_key>	An additional datarobot_key for dedicated prediction instances. This argument is required when using on-demand workers on the Cloud platform, but not for Enterprise users.
password=<pwd>	Specifies the password used to acquire the api-token. Use quotes if the password contains spaces. You must specify either the password or the api_token argument. To avoid entering your password each time you run the script, use the api_token argument instead.
api_token=<api_token>	Specifies the api token for the requests; if you do not have a token, you must specify the password argument. You can retrieve your token from your profile on the My Account page.
out=<filepath>	Specifies the file name, and optionally path, to which the results are written. If not specified, the default file name is out.csv, written to the directory containing the script. The value of the output file must be a single .csv file that can be gzipped (extension .gz).
verbose	Provides status updates while the script is running. It is recommended that you include this argument to track script execution progress. Silent mode (non-verbose) displays very little output.
keep_cols=<keep_cols>	Specifies the column names to append to the predictions. Enter as a comma-separated list.
n_samples=<n_samples>	DEPRECATED. Specifies the number of samples (rows) to use per batch. If not defined the “auto_sample” option will be used.
n_concurrent=<n_concurrent>	Specifies the number of concurrent requests to submit. By default, 4 concurrent requests are submitted. Set <n_concurrent> to match the number of cores in the prediction API endpoint.
create_api_token	Requests a new API token. To use this option, you must specify the password argument for this request (not the api_token argument). Specifying this argument invalidates your existing API token and creates and stores a new token for future prediction requests.
n_retry=<n_retry>	Specifies the number of times DataRobot will retry if a request fails. A value of -1, the default, specifies an infinite number of retries.
pred_name=<pred_name>	Applies a name to the prediction column of the output file. If you do not supply the argument, the column name is blank. For binary predictions, only positive class columns are included in the output. The last class (in lexical order) is used as the name of the prediction column.
skip_row_id	Skip the row_id column in output.
output_delimiter=<delimiter>	Specifies delimiter for output CSV. The special keyword “tab” can be used to indicate a tab delimited csv.
timeout=<timeout>	The time, in seconds, that DataRobot tries to make a connection to satisfy a prediction request. When the timeout expires, the client (the batch_scoring command) closes the connection and retries, up to the number of times defined by the value of <n_retry>. The default value is 30 seconds.
delimiter=<delimiter>	Specifies the delimiter to recognize in the input .csv file (e.g., “–delimiter=,”). If not specified, the script tries to automatically determine the delimiter. The special keyword “tab” can be used to indicate a tab-delimited csv.
resume	Starts the prediction from the point at which it was halted. If the prediction stopped, for example due to error or network connection issue, you can run the same command with all the same arguments plus this resume argument. If you do not include this argument, and the script detects a previous script was interrupted mid-execution, DataRobot prompts whether to resume. When resuming a script, you cannot change the dataset_filepath, model_id, project_id, n_samples, or keep_cols.
help	Show usage help for the command.
fast	Experimental: Uses a faster csv processor. Note that this method does not support multiline csv.
stdout	Send all log messages to stdout.
auto_sample	Override the <n_samples> value and instead use chunks of roughly 1.5 MB to improve throughput. On by default.
encoding	Declare the dataset encoding. If an encoding is not provided, the batch_scoring script attempts to detect it (e.g., “utf-8”, “latin-1”, or “iso2022_jp”). See the Python docs for a list of valid encodings.
skip_dialect	Tell the batch_scoring script to skip csv dialect detection.

Example:

batch_scoring --host=https://mycorp.orm.datarobot.com/ --user="greg@mycorp.com" --out=pred.csv 5545eb20b4912911244d4835 5545eb71b4912911244d4847 /home/greg/Downloads/diabetes_test.csv

Using the configuration file

The batch_scoring command checks for the existence of a batch_scoring.ini file at the location $HOME/batch_scoring.ini (your home directory) and the directory where you are running the script (working directory). If this file exists, the command uses the same arguments as those described above. If the file does not exist, the command proceeds normally with the command line arguments. The command line arguments have higher priority than the file arguments (that is, you can override file arguments using the command line).

The format of a batch_scoring.ini file is as follows:

[batch_scoring]
host=file_host
project_id=file_project_id
model_id=file_model_id
user=file_username
password=file_password

Usage Notes

If the script detects that a previous script was interrupted in mid-execution, it will prompt whether to resume that execution.

If no interrupted script was detected or if you indicate not to resume the previous execution, the script checks to see if the specified output file exists. If yes, the script prompts to confirm before overwriting this file.

The logs from each batch_scoring run are stored in the current working. All users will see a datarobot_batch_scoring_main.log log file. Windows users will see two additional log file, datarobot_batch_scoring_batcher.log and datarobot_batch_scoring_writer.log.

Supported Platforms

The batch_scoring script is tested on Linux and Windows, but it should also work on OS X. Both Python 2.7 and Python 3.x are supported.

Project details

These details have been verified by PyPI

Maintainers

Axik datarobot dsakagi madmott stasdr

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.16.5

Dec 11, 2019

1.16.4

Aug 29, 2019

1.16.3

Aug 4, 2019

1.16.2

Jul 25, 2019

1.16.1

May 27, 2019

1.16.0

Mar 29, 2019

1.15.3

Jan 9, 2019

1.15.2

Dec 18, 2018

1.15.1

Dec 4, 2018

1.15.0

Nov 23, 2018

1.14.2

Nov 21, 2018

1.14.1

Aug 30, 2018

1.14.0

Aug 17, 2018

1.13.3

Jun 20, 2018

1.13.2

Feb 26, 2018

1.13.1

Feb 26, 2018

1.13.0

Nov 11, 2017

1.12.1

Aug 15, 2017

1.11.0

May 31, 2017

1.10.2

May 9, 2017

1.10.1

Apr 27, 2017

1.10.0

Jan 27, 2017

This version

1.9.1

Jan 6, 2017

1.9.0

Dec 2, 2016

1.8.8

Nov 17, 2016

1.8.7

Nov 2, 2016

1.8.6

Sep 6, 2016

1.8.5

Jul 29, 2016

1.8.4

Jul 12, 2016

1.8.3

Jul 8, 2016

1.8.2

Jun 21, 2016

1.8.1

Jun 21, 2016

1.8.0

Jun 13, 2016

1.7.0a0 pre-release

Jun 13, 2016

1.6.0a4 pre-release

May 4, 2016

1.6.0a3 pre-release

Apr 29, 2016

1.6.0a2 pre-release

Apr 29, 2016

1.6.0a0 pre-release

Apr 28, 2016

1.5.1

Jan 28, 2016

1.5.0

Jan 28, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datarobot_batch_scoring-1.9.1.tar.gz (31.6 kB view hashes)

Uploaded Jan 6, 2017 Source

Built Distribution

datarobot_batch_scoring-1.9.1-py2.py3-none-any.whl (32.2 kB view hashes)

Uploaded Jan 6, 2017 Python 2 Python 3

Hashes for datarobot_batch_scoring-1.9.1.tar.gz

Hashes for datarobot_batch_scoring-1.9.1.tar.gz
Algorithm	Hash digest
SHA256	`994621defdde2f4d18096a01632d2330c949205db6c986d7eb42aac6cfe5cf76`
MD5	`ba4b2d969c764eebea1183f948bf030f`
BLAKE2b-256	`34f186522f142228954af6aecd89d15d65e8267b9b647ae59176bcafec52e4a6`

Hashes for datarobot_batch_scoring-1.9.1-py2.py3-none-any.whl

Hashes for datarobot_batch_scoring-1.9.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`939abe59740331cce1daee7883edddcc212806f101f486cf36d8b16550e40989`
MD5	`eb17e2b6c10075fbe3620f903da6c6fb`
BLAKE2b-256	`7471ce2ca81392acb89572721a7fc4e7ca60bccff03a1783fae123d4c46031f2`