Pipelines and primitives for machine learning and data science.

These details have been verified by PyPI

Maintainers

csala liudy mit_dai_lab smish

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

“MLBlocks”

MLBlocks is a simple framework for composing end-to-end tunable machine learning pipelines

Pipelines and primitives for machine learning and data science.

Free software: MIT license
Documentation: https://HDI-Project.github.io/mlblocks

Overview

At a high level:

Machine learning primitives are specified using standardized JSONs
User (or an external automated engine) specifies a list of primitives
The library transforms JSON specifications of machine learning primitives (blocks) into MLBlock instances, which expose tunable hyperparameters via MLHyperparams and composes a MLPipeline
The pipeline.fit and pipeline.predict functions then allow user to fit the pipeline to data and predict on a new set of data.

Project Structure

The MLBlocks consists of the following modules and folders:

mlblocks.mlblocks: Defines the MLBlock core class of the library.
mlblocks.mlpipeline: Defines the MLPipeline class that allows combining multiple MLBlock instances.
mlblocks_primitives: folder that contains the collection of JSON primitives. This folder can either be provided by the user or installed via the MLPrimitives subproject.

Primitive JSONS

The primitive JSONs are the main component of our library. The contents of said JSON files varies slightly depending on the model source library, but they all have a common structure.

Examples of such JSON files can be found inside the examples folder.

Installation

Install with pip

The simplest and recommended way to install MLBlocks is using pip:

pip install mlblocks

Install from sources

You can also clone the repository and install it from sources

git clone git@github.com:HDI-Project/MLBlocks.git
cd MLBlocks
pip install -e .

Usage

The following points cover the most basic usage of the MLBlocks library.

Note that in order to be able to execute the given code snippets, you will need to install a couple of additional libraries, which you can do by running:

pip install mlblocks[demo]

if you installed the library from PyPi or

pip install -e .[demo]

If you installed from sources.

Initializing a pipeline

With MLBlocks, we can simply initialize a pipeline by passing it the list of MLBlocks that will compose it.

>>> from mlblocks import MLPipeline
>>> pipeline = MLPipeline(['sklearn.ensemble.RandomForestClassifier'])

Obtaining and updating hyperparameters

Upon initialization, a pipeline has a set of default hyperparamters. For a particular data science problem, we may want to set or view the values and attributes of particular hyperparameters. For example, we may need to pass in the current hyperparameter values of our pipeline into a third party tuner.

To obtain the list of tunable hyperparameters can be obtained by calling the pipeline method get_tunable_hyperparameters.

>>> tunable_hp = pipeline.get_tunable_hyperparameters()
>>> import json
>>> print(json.dumps(tunable_hp, indent=4))
{
    "sklearn.ensemble.RandomForestClassifier#1": {
        "criterion": {
            "type": "str",
            "default": "entropy",
            "values": [
                "entropy",
                "gini"
            ]
        },
        "max_features": {
            "type": "str",
            "default": null,
            "range": [
                null,
                "auto",
                "log2"
            ]
        },
        "max_depth": {
            "type": "int",
            "default": 10,
            "range": [
                1,
                30
            ]
        },
        "min_samples_split": {
            "type": "float",
            "default": 0.1,
            "range": [
                0.0001,
                0.5
            ]
        },
        "min_samples_leaf": {
            "type": "float",
            "default": 0.1,
            "range": [
                0.0001,
                0.5
            ]
        },
        "n_estimators": {
            "type": "int",
            "default": 30,
            "values": [
                2,
                500
            ]
        },
        "class_weight": {
            "type": "str",
            "default": null,
            "range": [
                null,
                "balanced"
            ]
        }
    }
}

To obtain the values that the hyperparmeters of our pipeline currently has, the method get_hyperparameters can be used.

>>> current_hp = pipeline.get_hyperparameters()
>>> print(json.dumps(current_hp, indent=4))
{
    "sklearn.ensemble.RandomForestClassifier#1": {
        "n_jobs": -1,
        "criterion": "entropy",
        "max_features": null,
        "max_depth": 10,
        "min_samples_split": 0.1,
        "min_samples_leaf": 0.1,
        "n_estimators": 30,
        "class_weight": null
    }
}

Similarly, to set different hyperparameter values, the method set_hyperparameters can be used.

>>> new_hyperparameters = {'sklearn.ensemble.RandomForestClassifier#1': {'max_depth': 20}}
>>> pipeline.set_hyperparameters(new_hyperparameters)
>>> pipeline.get_hyperparameters()['sklearn.ensemble.RandomForestClassifier#1']['max_depth']
20

Making predictions

Once we have set the appropriate hyperparameters for our pipeline, we can make predictions on a dataset.

To do this, we first call the fit method if necessary. This takes in training data and labels as well as any other parameters each individual step may use during fitting.

>>> from sklearn.datasets import load_wine
>>> from sklearn.model_selection import train_test_split
>>> wine = load_wine()
>>> X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target)
>>> pipeline.fit(X_train, y_train)

Once we have fit our model to our data, we can simply make predictions. From these predictions, we can do useful things, such as obtain an accuracy score.

>>> y_pred = pipeline.predict(X_test)
>>> from sklearn.metrics import accuracy_score
>>> accuracy_score(y_test, y_pred)
1.0

History

In its first iteration in 2015, MLBlocks was designed for only multi table, multi entity temporal data. A good reference to see our design rationale at that time is Bryan Collazo’s thesis:

Machine learning blocks. Bryan Collazo. Masters thesis, MIT EECS, 2015.

With recent availability of a multitude of libraries and tools, we decided it was time to integrate them and expand the library to address other data types: images, text, graph, time series and integrate with deep learning libraries.

History

0.2.0 - New MLBlocks API

A new MLBlocks API and Primitive format.

This is a summary of the changes:

Primitives JSONs and Python code has been moved to a different repository, called MLPrimitives
Optional usage of multiple JSON primitive folders.
JSON format has been changed to allow more flexibility and features:
- input and output arguments, as well as argument types, can be specified for each method
- both classes and function as primitives are supported
- multitype and conditional hyperparameters fully supported
- data modalities and primitive classifiers introduced
- metadata such as documentation, description and author fields added
Parsers are removed, and now the MLBlock class is responsible for loading and reading the JSON primitive.
Multiple blocks of the same primitive are supported within the same pipeline.
Arbitrary inputs and outputs for both pipelines and blocks are allowed.
Shared variables during pipeline execution, usable by multiple blocks.

0.1.9 - Bugfix Release

Disable some NetworkX functions for incompatibilities with some types of graphs.

0.1.8 - New primitives and some improvements

Improve the NetworkX primitives.
Add String Vectorization and Datetime Featurization primitives.
Refactor some Keras primitives to work with single dimension y arrays and be compatible with pickle.
Add XGBClassifier and XGBRegressor primitives.
Add some keras.applications pretrained networks as preprocessing primitives.
Add helper class to allow function primitives.

0.1.7 - Nested hyperparams dicts

Support passing hyperparams as nested dicts.

0.1.6 - Text and Graph Pipelines

Add LSTM classifier and regressor primitives.
Add OneHotEncoder and MultiLabelEncoder primitives.
Add several NetworkX graph featurization primitives.
Add community.best_partition primitive.

0.1.5 - Collaborative Filtering Pipelines

Add LightFM primitive.

0.1.4 - Image pipelines improved

Allow passing init_params on MLPipeline creation.
Fix bug with MLHyperparam types and Keras.
Rename produce_params as predict_params.
Add SingleCNN Classifier and Regressor primitives.
Simplify and improve Trivial Predictor

0.1.3 - Multi Table pipelines improved

Improve RandomForest primitive ranges
Improve DFS primitive
Add Tree Based Feature Selection primitives
Fix bugs in TrivialPredictor
Improved documentation

0.1.2 - Bugfix release

Fix bug in TrivialMedianPredictor
Fix bug in OneHotLabelEncoder

0.1.1 - Single Table pipelines improved

New project structure and primitives for integration into MIT-TA2.
MIT-TA2 default pipelines and single table pipelines fully working.

0.1.0

First release on PyPI.

Project details

These details have been verified by PyPI

Maintainers

csala liudy mit_dai_lab smish

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.6.1

Sep 26, 2023

0.6.1.dev0 pre-release

Sep 26, 2023

0.6.0

Apr 14, 2023

0.5.1.dev0 pre-release

Apr 14, 2023

0.5.0

Jan 22, 2023

0.4.2.dev0 pre-release

Dec 12, 2022

0.4.1

Oct 8, 2021

0.4.1.dev1 pre-release

Oct 8, 2021

0.4.1.dev0 pre-release

Mar 8, 2021

0.4.0

Jan 9, 2021

0.4.0.dev0 pre-release

Dec 22, 2020

0.3.4

Nov 4, 2019

0.3.4.dev1 pre-release

Oct 31, 2019

0.3.4.dev0 pre-release

Oct 3, 2019

0.3.3

Sep 9, 2019

0.3.2

Aug 12, 2019

0.3.1

Jul 8, 2019

0.3.0

Jan 10, 2019

0.2.4

Dec 21, 2018

0.2.3

Oct 4, 2018

0.2.2

Oct 2, 2018

0.2.1

Sep 24, 2018

This version

0.2.0

Aug 17, 2018

0.1.9

Jul 28, 2018

0.1.8

Jul 27, 2018

0.1.7

Jul 13, 2018

0.1.6

Jun 19, 2018

0.1.5

Jun 19, 2018

0.1.4

Jun 15, 2018

0.1.3

Jun 13, 2018

0.1.2

Jun 8, 2018

0.1.1

Jun 8, 2018

0.1.0

Jun 4, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlblocks-0.2.0.tar.gz (16.9 kB view hashes)

Uploaded Aug 17, 2018 Source

Built Distribution

mlblocks-0.2.0-py2.py3-none-any.whl (8.8 kB view hashes)

Uploaded Aug 17, 2018 Python 2 Python 3

Hashes for mlblocks-0.2.0.tar.gz

Hashes for mlblocks-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`fda0ac931950142c7abeff86d26e37bb87eb860c234fbffff98afbf7344e6f04`
MD5	`103146160560a199d670cbe90d96214b`
BLAKE2b-256	`614f99867f6f5ef7da22328977946ed674821b745f779276c01b2d2618770cea`

Hashes for mlblocks-0.2.0-py2.py3-none-any.whl

Hashes for mlblocks-0.2.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca253a832bcfaf4dc12fa5a9fcb780279c3d08341ac94b066eab4a1ac368e247`
MD5	`b9ebe305e1171406135f3189af311b2f`
BLAKE2b-256	`864318fcde4a8eabb263859dbd22f94f6cc2d436affccd3b2bb56e248df633a2`

mlblocks 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Overview

Project Structure

Primitive JSONS

Installation

Install with pip

Install from sources

Usage

Initializing a pipeline

Obtaining and updating hyperparameters

Making predictions

History

History

0.2.0 - New MLBlocks API

0.1.9 - Bugfix Release

0.1.8 - New primitives and some improvements

0.1.7 - Nested hyperparams dicts

0.1.6 - Text and Graph Pipelines

0.1.5 - Collaborative Filtering Pipelines

0.1.4 - Image pipelines improved

0.1.3 - Multi Table pipelines improved

0.1.2 - Bugfix release

0.1.1 - Single Table pipelines improved

0.1.0

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution