Skip to main content

Command line interface for all things Cleanlab Studio

Project description

cleanlab-cli Build Status PyPI

Command line interface for all things Cleanlab Studio.

This currently supports generating dataset schema, uploading datasets into Cleanlab Studio, and downloading cleansets from Cleanlab Studio.

Installation

You can install the Cleanlab Studio CLI from PyPI with:

pip install cleanlab-cli

If you already have the CLI installed and wish to upgrade to the latest version, run:

pip install --upgrade cleanlab-cli

Workflow

Uploading datasets to Cleanlab Studio is a two-step process.

  1. We generate a schema describing the dataset and its data and feature types, which is verified by the user.
  2. Based on this schema, the dataset is parsed and uploaded to Cleanlab Studio.

Upload a dataset

To upload a dataset without first generating a schema (i.e. Cleanlab will suggest one for you):

cleanlab dataset upload -f [dataset filepath]

You will be asked to "Specify your dataset modality (text, tabular):".

  • Enter text to only find label errors based on a single column of text in your dataset.
  • Enter tabular to find data and label issues based on any subset of the column features.

To upload a dataset with a schema:

cleanlab dataset upload -f [dataset filepath] -s [schema filepath]

To resume uploading a dataset whose upload was interrupted:

cleanlab dataset upload -f [dataset filepath] --id [dataset ID]

A dataset ID is generated and printed to the terminal the first time the dataset is uploaded. It can also be accessed by visiting https://app.cleanlab.ai/datasets and selecting 'Resume' for the relevant dataset.

Generate dataset schema

To generate a dataset schema (prior to uploading your dataset):

cleanlab dataset schema generate -f [dataset filepath]

  • For Id column: , please enter the string name of of the column in your dataset that contains the id of each row.
  • For Modality (text, tabular): , please enter text to only find label errors based on a single column of text, otherwise enter tabular to find data and label issues based on any subset of the column features.

To validate an existing schema, i.e. check that it is complete, well-formatted, and has data types with sensible feature types:

cleanlab dataset schema validate -s [schema filepath]

You may then wish to inspect the generated schema to check that the fields and metadata are correct.

Download clean labels

To download clean labels (i.e. labels that have been fixed through the Cleanlab Studio interface):

cleanlab cleanset download --id [cleanset ID]

To download clean labels and combine them with your local dataset:

cleanlab cleanset download --id [cleanset ID] -f [dataset filepath]

Commands

cleanlab login authenticates you

Authenticates you when uploading datasets to Cleanlab Studio. Pass in your API key using --key [API key]. Your API key can be accessed at https://app.cleanlab.ai/upload.

cleanlab dataset schema generate generates dataset schemas

Generates a schema based on your dataset. Specify your target dataset with --filepath [dataset filepath]. You will be prompted to save the generated schema JSON and to specify a save location. This can be specified using --output [output filepath].

cleanlab dataset schema validate validates a schema JSON file

Validates a schema JSON file, checking that a schema is complete, well-formatted, and has data types with sensible feature types. Specify your target schema with --schema [schema filepath].

You may also validate an existing schema with respect to a dataset (-d [dataset filepath]), i.e. all previously mentioned checks and the additional check that all fields in the schema are present in the dataset.

cleanlab dataset upload uploads your dataset

Uploads your dataset to Cleanlab Studio. Specify your target dataset with --filepath [dataset filepath]. You will be prompted for further details about the dataset's modality and ID column. These may be supplied to the command with --modality [modality], --id-column [name of ID column], and you may also specify a custom dataset name with--name [custom dataset name].

After uploading your dataset, you will be prompted to save the list of dataset issues (if any) encountered during the upload process. These issues include missing IDs, duplicate IDs, missing values, and values whose types do not match the schema. You may specify the save location with --output [output filepath].

cleanlab cleanset download downloads Cleanlab columns from your cleanset

Cleansets are initialized through the Cleanlab Studio interface. In a cleanset, users can inspect their dataset and verify their labels. Clean labels are the labels after this set of manual fixes have been applied.

This command downloads the clean labels and saves them locally as a .csv, .xls/.xlsx, or .json, with columns id and clean_label. Include the --filepath [dataset filepath] to combine the clean labels with the original dataset as a new column clean_label, which will be outputted to --output [output filepath]. Include the --all flag to include all Cleanlab columns, i.e. issue, label quality, suggested label, clean label, instead of only the clean label column.

Dataset format

Cleanlab currently only supports text and tabular dataset modalities. (If your dataset contains both text and tabular data, treat it as tabular.) The accepted dataset file types are: .csv, .json, and .xls/.xlsx.

Below are some examples of how to format your dataset depending on modality and file type.

Every dataset must have an ID column (i.e. a column containing identifiers that uniquely identify each row) and a label column (for the prediction task).

Apart from the reserved column name: clean_label, You are free to name the columns in your dataset in any way you want.

Tabular
.csv, .xls/.xlsx
flower_id width length color species
flower_01 4 3 red rose
flower_02 7 2 white lily
.json
{
  "rows": [
    {
      "flower_id": "flower_01",
      "width": 4,
      "length": 3,
      "color": "red",
      "species": "rose"
    },
    {
      "flower_id": "flower_02",
      "width": 7,
      "length": 2,
      "color": "white",
      "species": "lily"
    }
  ]
}
Text
.csv, .xls/.xlsx
review_id review sentiment
review_1 The sales rep was fantastic! positive
review_2 He was a bit wishy-washy. negative
.json
{
  "rows": [
    {
      "review_id": "review_1",
      "review": "The sales rep was fantastic!",
      "label": "positive"
    },
    {
      "review_id": "review_2",
      "review": "He was a bit wishy-washy.",
      "label": "negative"
    }
  ]
}

Schema

To specify the column types in your dataset, create a JSON file named schema.json. We recommend using cleanlab dataset schema generate to generate an initial schema and editing from there.

Your schema file should be formatted as follows:

{
  "metadata": {
    "id_column": "tweet_id",
    "modality": "text",
    "name": "Tweets.csv"
  },
  "fields": {
    "tweet_id": {
      "data_type": "string",
      "feature_type": "identifier"
    },
    "sentiment": {
      "data_type": "string",
      "feature_type": "categorical"
    },
    "sentiment_confidence": {
      "data_type": "float",
      "feature_type": "numeric"
    },
    "retweet_count": {
      "data_type": "integer",
      "feature_type": "numeric"
    },
    "text": {
      "data_type": "string",
      "feature_type": "text"
    },
    "tweet_created": {
      "data_type": "boolean",
      "feature_type": "boolean"
    },
    "tweet_created": {
      "data_type": "string",
      "feature_type": "datetime"
    },
  },
  "version": "0.1.12"
}

This is the schema of a hypothetical dataset Tweets.csv that contains tweets, where the column tweet_id contains a unique identifier for each record. Each column in the dataset is specified under fields with its data type and feature type.

Data types and Feature types

Data type refers to the type of the field's values: string, integer, float, or boolean.

Note that the integer type is strict, meaning floats will be rejected. In contrast, the float type is lenient, meaning integers are accepted. Users should select the float type if the field may include float values. Note too that integers can have categorical and identifier feature types, whereas floats cannot.

For booleans, the list of accepted values are: true/false, t/f, yes/no, and 1/0.

Feature type refers to the secondary type of the field, relating to how it is used in a machine learning model, such as whether it is:

  • a categorical value
  • a numeric value
  • a datetime value
  • a boolean value
  • text
  • an identifier — a string / integer that identifies some entity

Some feature types can only correspond to specific data types. The list of possible feature types for each data type is shown below

Data type Feature type
string text, categorical, datetime, identifier
integer categorical, datetime, identifier, numeric
float datetime, numeric
boolean boolean

The datetime type should be used for datetime strings, e.g. "2015-02-24 11:35:52 -0800", and Unix timestamps (which will be integers or floats). Datetime values must be parsable by pandas.to_datetime().

version indicates the version of the Cleanlab CLI package version used to generate the schema. The current Cleanlab schema version is 0.1.14.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanlab-cli-0.1.14.tar.gz (34.5 kB view hashes)

Uploaded Source

Built Distribution

cleanlab_cli-0.1.14-py3-none-any.whl (40.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page