Skip to main content

Integrate customer side ML application with the Alectio Platform

Project description

Requirements

  • Python3 (Required)
  • PIP3 (Required)
  • Ubuntu 16.04+ / MacOS / Windows 10
  • GCC / C++ (Will depend on OS you are using. Ubuntu, MacOS it comes default. Some falvours of linux distribution like Amazon Linux/RED Hat linux might not have GCC or C++ realted libraires installed)

For this tutorial, we are assuming you are using Python3 and PIP3. Also, make sure you have the necessary build tools installed (might vary from OS to OS). If you get any errors while installing any dependent packages feel free to reach out to us but most of it can quickly be solved by a simple Google search.

Installation

1. Key Management

If you have not already created your Client ID and Client Secret then do so by visiting:

  1. open https://pro.alectio.com/
  2. Login there and create project and an experiment.
  3. An experiment token will be generated.
  4. Enter your experiment token in main.py (refer example in step 3) to authenticate. Experiment token

2. Set up a virtual environment

We recommend to set-up a virtual environment.

For example, you can use python's built-in virtual environment via:

python3 -m venv env
source env/bin/activate

3. Download the examples

All examples available in examples directory

4. Install the requirements in examples

pip install -r requirements.txt

5. Run Examples

The remaining installation instructions are detailed in the examples directory. We cover one example for image classification .

Alectio SDK

AlectioSDK is a package that enables developers to build an ML pipeline as a Flask app to interact with Alectio's platform. It is designed for Alectio's clients, who prefer to keep their model and data on their own server.

The package is currently under active development. More functionalities that aim to enhance robustness will be added soon, but for now, the package provides a class alectio_sdk.sdk.Pipeline that interfaces with customer-side processes in a consistent manner. Customers need to implement 4 processes as python functions:

  • A process to train the model
  • A process to test the model
  • A process to apply the model to infer unlabeled data
  • A process to assign each data point in the dataset to a unique index (Refer to one of the examples to know how)

A Pipeline can be created inside the main.py file using the following syntax:

import yaml
from alectio_sdk.sdk import Pipeline
from processes import train, test, infer, getdatasetstate

# All the variables can be declared inside the .yaml file
with open("./config.yaml", "r") as stream:
 args = yaml.safe_load(stream)

# Initialising the Experiment Pipeline
AlectioPipeline = Pipeline(
 name=args["exp_name"],
 train_fn=train, # A process to train the model
 test_fn=test, # A process to test the model
 infer_fn=infer, # A process to apply the model to infer unlabeled data
 getstate_fn=getdatasetstate, # A process to assign each data point in the dataset to a unique index
 args=args, # Any arguments that user ants to use inside his train, test, infer functions.
 token="xxxxxx7041a6xxxxx7948cexxxxxxxx", # Experiment token
 multiple_initialisations={"seeds": [], "limit_value": 0}, # Multiple seed initialisation feature
)

Refer to the alectio examples for more clarity on the use of the Pipeline class.

Train the Model

The logic for training the model should be implemented in this process. The function should look like this:

def train(args, labeled, resume_from, ckpt_file):
    """
    Training Function
    
    Input args:
    args* # Arguments passed to Alectio Pipeline
    labeled: list # List of labeled indices for training
    resume_from: str # Path to last checkpoint file
    ckpt_file: str # Path to saved model
    
    Returns:
    None
    or
    output_dict: dict # Labels and Hyperparams
    """

    # implement your logic to train the model
    # with the selected data indexed by `labeled`
    # lbs <- dictionary of indices of train data and their ground-truth
    
    return {'labels': lbs, 'hyperparams': hyperparameters}

The name of the function can be anything you like. It takes an argument as shown in the example above.

key value
resume_from a string that specifies which checkpoint to resume from
ckpt_file a string that specifies the name of checkpoint to be saved for the current loop
labeled a list of indices of selected samples used to train the model in this loop

Depending on your situation, the samples indicated in labeled might not be labeled (despite the variable name). We call it labeled because, in the active learning setting, this list represents the pool of samples iteratively labeled by the human oracle.

Test the Model

The logic for testing the model should be implemented in this process. The function representing this process should look like this:

def test(args, ckpt_file):
    """
    testing function

    Input args:
    args* # Arguments passed to Alectio Pipeline
    ckpt_file: str # Path to saved model

    Returns:
    output_dict: dict # Preds and Labels
    """
    # implement your testing logic here

    # put the predictions and labels into
    # two dictionaries

    # lbs <- dictionary of indices of test data and their ground-truth
    # prd <- dictionary of indices of test data and their prediction
    
    return {'predictions': prd, 'labels': lbs}

The test function takes arguments as shown in the example above.

key value
ckpt_file a string that specifies which checkpoint to test model

The test function needs to return a dictionary with two keys:

key value
predictions a dictionary of an index and a prediction for each test sample
labels a dictionary of an index and a ground truth label for each test sample

The format of the values depends on the type of ML problem. Please refer to the examples directory for details.

Apply Inference

The logic for applying the model to infer the unlabeled data should be implemented in this process. The function representing this process should look like this:

def infer(args, unlabeled, ckpt_file):
    """
    Inference Function

    Input args:
    args* # Arguments passed to Alectio Pipeline
    unlabeled: list # List of labeled indices for inference
    ckpt_file: str # Path to saved model

    returns:
    output_dict: dict
    """
    # implement your inference logic here

    # outputs <- save the output from the model on the unlabeled data as a dictionary
    return {'outputs': outputs}

The infer function takes an argument payload, which is a dictionary with 2 keys:

key value
ckpt_file a string that specifies which checkpoint to use to infer on the unlabeled data
unlabeled a list of of indices of unlabeled data in the training set

The infer function needs to return a dictionary with one key.

key value
outputs a dictionary of indexes mapped to the models output before an activation function is applied

For example, if it is a classification problem, return the output before applying softmax. For more details about the format of the output, please refer to the examples directory.

config.yaml

Put in all the requirements that are required for the model to train. This will be read and used in processes.py when the model trains. For example if config.yaml looks like this:

LOG_DIR:  "./log"
DATA_DIR: "./data"
EXPT_DIR: "./log"
exptname:  "ManualAL"  

# Model configs
backbone:     "Resnet101"
description:  "Pedestrian detection"
epochs: 10
.
.

You can access them inside your any of the above 4 processes as lets say args["backbone"] , args["description"] etc.

SDK- Features

1. Tracking CO2 emissions

The alectio SDK is capable of tracking the CO2 emissions during the experiment. The SDK uses an open-source package called code carbon to track the CO2 emissions along with the (CPU, GPU, and RAM) usage. This data is tracked and synced, once the experiment ends, with the user account where the user can see the total CO2 emission on his dashboard.

2. Time-Saved Information

The SDK uses linear interpolation to estimate the time that a user saved to train his model in each active learning cycle. The time-saved information is logged after each AL cycle and gets synced with the platform at the end of the experiment. The time-saved insights can be seen on the user dashboard.

3. Storing Hyperparameters

The SDK has the ability to track the hyperparameters for each AL cycle. To use this feature the user just needs to return a dictionary of their hyperparameters. Currently, the SDK supports a limited number of hyperparameters, the list of these parameters is shown below:

hyperparameter_names = [
                "optimizer_name", # Name of the optimizer used
                "loss", # Loss of the training process
                "running_loss", # Running Loss 
                "epochs", # Number of epochs for which the model was trained 
                "batch_size", # batch size on which the model was trained
                "loss_function", # name of loss function used for training
                "activation", # List of activation functions used 
                "optimizer", # Can be a state_dict in case of Pytorch
            ]

The syntax for storing these values is shown in the train function section.

4. Running Multiple Seed Initialization

The SDK can also help the user choose the right seed for his experiment by training his model on a range of seed values and selecting the best seed depending on the performance of models on these seed values. In order to use this feature the user can just use the multiple_initialisations argument of the Alectio Pipeline. The syntax is as shown below:

from alectio_sdk.sdk import Pipeline

AlectioPipeline = Pipeline(
    name=args["exp_name"],
    train_fn=train,
    test_fn=test,
    infer_fn=infer,
    getstate_fn=getdatasetstate,
    args=args,
    token="xxxxxx7041a6xxxxx7948cexxxxxxxx",
    multiple_initialisations={"seeds": [10, 42, 36, 78], "limit_value": 4000},
)

The input of this argument is a dict with 2 keys.

key value
seed a list containing different seed values you want to test your model on.
limit_value The number of samples from which you want to select the training samples from.

5. Accessing Alectio Public Datasets

The user can access Alectio Public Datasets usin the Alectio SDK. The user needs to select the public dataset he wants to use during creating his/her project on the Alectio platform. Alectio Public datasets contain training, validation and testing data. The code snippet to use the Public datasets is as given below.

1. Pytorch
# Pytorch Syntax
import torchvision
from torchvision import transforms
from alectio_sdk.sdk.alectio_dataset import AlectioDataset
from torch.utils.data import DataLoader, Subset

# create a public dataset object
# token = experiment token
# root = directory in which you want to download your dataset
# framework = pytorch/tensorflow
alectio_dataset = AlectioDataset(token="your_exp_token_goes_here", root="./data", framework="pytorch")

# train dataset
train_transforms = transforms.Compose(
    [
        transforms.Resize((224, 224)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ]
)

# call the get dataset function 
# dataset_type = train/test/validation
# transforms = augmentations/transformations you want to perform

# Returns 
# DataLoader Object | Length of dataset | Mapping of labels and indices
train_dataset, train_dataset_len, train_class_to_idx = alectio_dataset.get_dataset(
    dataset_type="train", transforms=train_transforms
)
2. Tensorflow
# Tensorflow Syntax
import tensorflow as tf
from alectio_sdk.sdk.alectio_dataset import AlectioDataset

# create a public dataset object
# token = experiment token
# root = directory in which you want to download your dataset
# framework = pytorch/tensorflow
alectio_dataset = AlectioDataset(token="your_exp_token_goes_here", root="./data", framework="tensorflow")

# train dataset
# all transforms supported by Tensoflow ImageDataGenerator can be added to the transform dict
train_transforms = dict(
    featurewise_center=False,
    samplewise_center=False,
    featurewise_std_normalization=False,
    samplewise_std_normalization=False,
    zca_whitening=False,
    channel_shift_range=0.0,
    fill_mode='nearest',
    cval=0.0,
    horizontal_flip=False,
    vertical_flip=False,
    rescale=None,
    preprocessing_function=None,
    data_format=None,
)

# call the get dataset function 
# dataset_type = train/test/validation
# transforms = dict of augmentations/transformations you want to perform

# Returns 
# Imagedatagenerator Object | Length of dataset | Mapping of labels and indices
train_dataset, train_dataset_len, train_class_to_idx = alectio_dataset.get_dataset(
    dataset_type="train", transforms=train_transforms
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alectio_sdk-1.0.35.tar.gz (48.7 kB view hashes)

Uploaded Source

Built Distribution

alectio_sdk-1.0.35-py3-none-any.whl (55.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page