Pipelines and primitives for machine learning and data science.

These details have been verified by PyPI

Maintainers

csala liudy mit_dai_lab smish

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

An Open Source Project from the Data to AI Lab, at MIT

MLPrimitives

Pipelines and primitives for machine learning and data science.

Documentation: https://MLBazaar.github.io/MLPrimitives
Github: https://github.com/MLBazaar/MLPrimitives
License: MIT
Development Status: Pre-Alpha

Overview

This repository contains primitive annotations to be used by the MLBlocks library, as well as the necessary Python code to make some of them fully compatible with the MLBlocks API requirements.

There is also a collection of custom primitives contributed directly to this library, which either combine third party tools or implement new functionalities from scratch.

Why did we create this library?

Too many libraries in a fast growing field
Huge societal need to build machine learning apps
Domain expertise resides at several places (knowledge of math)
No documented information about hyperparameters, behavior...

Installation

Requirements

MLPrimitives has been developed and tested on Python 3.8, 3.9, 3.10, and 3.11

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where MLPrimitives is run.

Install with pip

The easiest and recommended way to install MLPrimitives is using pip:

pip install mlprimitives

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Quickstart

This section is a short series of tutorials to help you getting started with MLPrimitives.

In the following steps you will learn how to load and run a primitive on some data.

Later on you will learn how to evaluate and improve the performance of a primitive by tuning its hyperparameters.

Running a Primitive

In this first tutorial, we will be executing a single primitive for data transformation.

1. Load a Primitive

The first step in order to run a primitive is to load it.

This will be done using the mlprimitives.load_primitive function, which will load the indicated primitive as an MLBlock Object from MLBlocks

In this case, we will load the mlprimitives.custom.feature_extraction.CategoricalEncoder primitive.

from mlprimitives import load_primitive

primitive = load_primitive('mlprimitives.custom.feature_extraction.CategoricalEncoder')

2. Load some data

The CategoricalEncoder is a transformation primitive which applies one-hot encoding to all the categorical columns of a pandas.DataFrame.

So, in order to be able to run our primitive, we will first load some data that contains categorical columns.

This can be done with the mlprimitives.datasets.load_census function:

from mlprimitives.datasets import load_census

dataset = load_census()

This dataset object has an attribute data which contains a table with several categorical columns.

We can have a look at this table by executing dataset.data.head(), which will return a table like this:

                             0                    1                   2
age                         39                   50                  38
workclass            State-gov     Self-emp-not-inc             Private
fnlwgt                   77516                83311              215646
education            Bachelors            Bachelors             HS-grad
education-num               13                   13                   9
marital-status   Never-married   Married-civ-spouse            Divorced
occupation        Adm-clerical      Exec-managerial   Handlers-cleaners
relationship     Not-in-family              Husband       Not-in-family
race                     White                White               White
sex                       Male                 Male                Male
capital-gain              2174                    0                   0
capital-loss                 0                    0                   0
hours-per-week              40                   13                  40
native-country   United-States        United-States       United-States

3. Fit the primitive

In order to run our pipeline, we first need to fit it.

This is the process where it analyzes the data to detect which columns are categorical

This is done by calling its fit method and assing the dataset.data as X.

primitive.fit(X=dataset.data)

4. Produce results

Once the pipeline is fit, we can process the data by calling the produce method of the primitive instance and passing agin the data as X.

transformed = primitive.produce(X=dataset.data)

After this is done, we can see how the transformed data contains the newly generated one-hot vectors:

                                                0      1       2       3       4
age                                            39     50      38      53      28
fnlwgt                                      77516  83311  215646  234721  338409
education-num                                  13     13       9       7      13
capital-gain                                 2174      0       0       0       0
capital-loss                                    0      0       0       0       0
hours-per-week                                 40     13      40      40      40
workclass= Private                              0      0       1       1       1
workclass= Self-emp-not-inc                     0      1       0       0       0
workclass= Local-gov                            0      0       0       0       0
workclass= ?                                    0      0       0       0       0
workclass= State-gov                            1      0       0       0       0
workclass= Self-emp-inc                         0      0       0       0       0
...                                             ...    ...     ...     ...     ...

Tuning a Primitive

In this short tutorial we will teach you how to evaluate the performance of a primitive and improve its performance by modifying its hyperparameters.

To do so, we will load a primitive that can learn from the transformed data that we just generated and later on make predictions based on new data.

1. Load another primitive

Firs of all, we will load the xgboost.XGBClassifier primitive that we will use afterwards.

primitive = load_primitive('xgboost.XGBClassifier')

2. Split the dataset

Before being able to evaluate the primitive perfomance, we need to split the data in two parts: train, which will be used for the primitive to learn, and test, which will be used to make the predictions that later on will be evaluated.

In order to do this, we will get the first 75% of rows from the transformed data that we obtained above and call it X_train, and then set the next 25% of rows as X_test.

train_size = int(len(transformed) * 0.75)
X_train = transformed.iloc[:train_size]
X_test = transformed.iloc[train_size:]

Similarly, we need to obtain the y_train and y_test variables containing the corresponding output values.

y_train = dataset.target[:train_size]
y_test = dataset.target[train_size:]

3. Fit the new primitive

Once we have have splitted the data, we can fit the primitive by passing X_train and y_train to its fit method.

primitive.fit(X=X_train, y=y_train)

4. Make predictions

Once the primitive has been fitted, we can produce predictions using the X_test data as input.

predictions = primitive.produce(X=X_test)

5. Evalute the performance

We can now evaluate how good the predictions from our primitive are by using the score method from the dataset object on both the expected output and the real output from the primitive:

dataset.score(y_test, predictions)

This will output a float value between 0 and 1 indicating how good the predicitons are, being 0 the worst score possible and 1 the best one.

In this case we will obtain a score around 0.866

6. Set new hyperparameter values

In order to improve the performance of our primitive we will try to modify a couple of its hyperparameters.

First we will see which hyperparameter values the primitive has by calling its get_hyperparameters method.

primitive.get_hyperparameters()

which will return a dictionary like this:

{
    "n_jobs": -1,
    "n_estimators": 100,
    "max_depth": 3,
    "learning_rate": 0.1,
    "gamma": 0,
    "min_child_weight": 1
}

Next, we will see which are the valid values for each one of those hyperparameters by calling its get_tunable_hyperparameters method:

primitive.get_tunable_hyperparameters()

For example, we will see that the max_depth hyperparameter has the following specification:

{
    "type": "int",
    "default": 3,
    "range": [
        3,
        10
    ]
}

Next, we will choose a valid value, for example 7, and set it into the pipeline using the set_hyperparameters method:

primitive.set_hyperparameters({'max_depth': 7})

7. Re-evaluate the performance

Once the new hyperparameter value has been set, we repeat the fit/train/score cycle to evaluate the performance of this new hyperparameter value:

primitive.fit(X=X_train, y=y_train)
predictions = primitive.produce(X=X_test)
dataset.score(y_test, predictions)

This time we should see that the performance has improved to a value around 0.724

What's Next?

Do you want to learn more about how the project, about how to contribute to it or browse the API Reference? Please check the corresponding sections of the documentation!

History

0.4.1 - 2024-11-15

Primitive Improvements

SimpleImputer primitive update – Issue #280 by @sarahmish

0.4.0 - 2024-03-22

General Imporvements

Upgrade python versions 3.9, 3.10, and 3.11 - Issue #279 by @sarahmish
Adapt to statsmodels.tsa.arima_model.ARIMA deprecation - Issue #253 by @sarahmish

0.3.5 - 2023-04-14

General Imporvements

Update mlblocks cap - Issue #278 by @sarahmish

0.3.4 - 2023-01-24

General Imporvements

Update mlblocks cap - Issue #277 by @sarahmish

0.3.3 - 2023-01-20

General Imporvements

Update dependencies - Issue #276 by @sarahmish

Adapter Improvements

Building model within fit in keras adapter- Issue #267 by @sarahmish

0.3.2 - 2021-11-09

Adapter Improvements

Inferring data shapes with single dimension for keras adapter - Issue #265 by @sarahmish

0.3.1 - 2021-10-07

Adapter Improvements

Dynamic target_shape in keras adapter - Issue #263 by @sarahmish
Save keras primitives in Windows environment - Issue #261 by @sarahmish

General Imporvements

Update TensorFlow and NumPy dependency - Issue #259 by @sarahmish

0.3.0 - 2021-01-09

New Primitives

Add primitive sklearn.naive_bayes.GaussianNB - Issue #242 by @sarahmish
Add primitive sklearn.linear_model.SGDClassifier - Issue #241 by @sarahmish

Primitive Improvements

Add offset to rolling_window_sequence primitive - Issue #251 by @skyeeiskowitz
Rename the time_index column to time - Issue #252 by @pvk-developer
Update featuretools dependency - Issue #250 by @pvk-developer

General Improvements

Udpate dependencies and add python3.8 - Issue #246 by @csala
Drop Python35 - Issue #244 by @csala

0.2.5 - 2020-07-29

Primitive Improvements

Accept timedelta window_size in cutoff_window_sequences - Issue #239 by @joanvaquer

Bug Fixes

ImportError: Keras requires TensorFlow 2.2 or higher. Install TensorFlow via pip install tensorflow - Issue #237 by @joanvaquer

New Primitives

Add pandas.DataFrame.set_index primitive - Issue #222 by @JDTheRipperPC

0.2.4 - 2020-01-30

New Primitives

Add RangeScaler and RangeUnscaler primitives - Issue #232 by @csala

Primitive Improvements

Extract input_shape from X in keras.Sequential - Issue #223 by @csala

Bug Fixes

mlprimitives.custom.text.TextCleaner fails if text is empty - Issue #228 by @csala
Error when loading the reviews dataset - Issue #230 by @csala
Curate dependencies: specify an explicit prompt-toolkit version range - Issue #224 by @csala

0.2.3 - 2019-11-14

New Primitives

Add primitive to make window_sequences based on cutoff times - Issue #217 by @csala
Create a keras LSTM based TimeSeriesClassifier primitive - Issue #218 by @csala
Add pandas DataFrame primitives - Issue #214 by @csala
Add featuretools.EntitySet.normalize_entity primitive - Issue #209 by @csala

Primitive Improvements

Make featuretools.EntitySet.entity_from_dataframe entityset arg optional - Issue #208 by @csala
Add text regression dataset - Issue #206 by @csala

Bug Fixes

pandas.DataFrame.resample crash when grouping by integer columns - Issue #211 by @csala

0.2.2 - 2019-10-08

New Primitives

Add primitives for GAN based time-series anomaly detection - Issue #200 by @AlexanderGeiger
Add numpy.reshape and numpy.ravel primitives - Issue #197 by @AlexanderGeiger
Add feature selection primitive based on Lasso - Issue #194 by @csala

Primitive Improvements

feature_extraction.CategoricalEncoder support dtype category - Issue #196 by @csala

0.2.1 - 2019-09-09

New Primitives

Timeseries Intervals to Mask Primitive - Issue #186 by @AlexanderGeiger
Add new primitive: Arima model - Issue #168 by @AlexanderGeiger

Primitive Improvements

Curate PCA primitive hyperparameters - Issue #190 by @AlexanderGeiger
Add option to drop rolling window sequences - Issue #186 by @AlexanderGeiger

Bug Fixes

scikit-image==0.14.3 crashes when installed on Mac - Issue #188 by @csala

0.2.0

New Features

Publish the pipelines as an entry_point Issue #175 by @csala

Primitive Improvements

Improve pandas.DataFrame.resample primitive Issue #177 by @csala
Improve feature_extractor primitives Issue #183 by @csala
Improve find_anomalies primitive Issue #180 by @AlexanderGeiger

Bug Fixes

Typo in the primitive keras.Sequential.LSTMTimeSeriesRegressor Issue #176 by @DanielCalvoCerezo

0.1.10

New Features

Add function to run primitives without a pipeline Issue #43 by @csala

New Pipelines

Add pipelines for all the MLBlocks examples Issue #162 by @csala

Primitive Improvements

Add Early Stopping to keras.Sequential.LSTMTimeSeriesRegressor primitive Issue #156 by @csala
Make FeatureExtractor primitives accept Numpy arrays Issue #165 by @csala
Add window size and pruning to the timeseries_anomalies.find_anomalies primitive Issue #160 by @csala

0.1.9

New Features

Add a single table binary classification dataset Issue #141 by @csala

New Primitives

Add Multilayer Perceptron (MLP) primitive for binary classification Issue #140 by @Hector-hedb12
Add primitive for Sequence classification with LSTM Issue #150 by @Hector-hedb12
Add VGG-like convnet primitive Issue #149 by @Hector-hedb12
Add Multilayer Perceptron (MLP) primitive for multi-class softmax classification Issue #139 by @Hector-hedb12
Add primitive to count feature matrix columns Issue #146 by @csala

Primitive Improvements

Add additional fit and predict arguments to keras.Sequential Issue #161 by @csala
Add suport for keras.Sequential Callbacks Issue #159 by @csala
Add fixed hyperparam to control keras.Sequential verbosity Issue #143 by @csala

0.1.8

New Primitives

mlprimitives.custom.timeseries_preprocessing.time_segments_average - Issue #137

New Features

Add target_index output in timseries_preprocessing.rolling_window_sequences - Issue #136

0.1.7

General Improvements

Validate JSON format in make lint - Issue #133
Add demo datasets - Issue #131
Improve featuretools.dfs primitive - Issue #127

New Primitives

pandas.DataFrame.resample - Issue #123
pandas.DataFrame.unstack - Issue #124
featuretools.EntitySet.add_relationship - Issue #126
featuretools.EntitySet.entity_from_dataframe - Issue #126

Bug Fixes

Bug in timeseries_anomalies.py - Issue #119

0.1.6

General Improvements

Add Contributing Documentation
Remove upper bound in pandas version given new release of featuretools v0.6.1
Improve LSTMTimeSeriesRegressor hyperparameters

New Primitives

mlprimitives.candidates.dsp.SpectralMask
mlprimitives.custom.timeseries_anomalies.find_anomalies
mlprimitives.custom.timeseries_anomalies.regression_errors
mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences
mlprimitives.custom.timeseries_preprocessing.time_segments_average
sklearn.linear_model.ElasticNet
sklearn.linear_model.Lars
sklearn.linear_model.Lasso
sklearn.linear_model.MultiTaskLasso
sklearn.linear_model.Ridge

0.1.5

New Primitives

sklearn.impute.SimpleImputer
sklearn.preprocessing.MinMaxScaler
sklearn.preprocessing.MaxAbsScaler
sklearn.preprocessing.RobustScaler
sklearn.linear_model.LinearRegression

General Improvements

Separate curated from candidate primitives
Setup entry_points in setup.py to improve compaitibility with MLBlocks
Add a test-pipelines command to test all the existing pipelines
Clean sklearn example pipelines
Change the author entry to a contributors list
Change the name of mlblocks_primitives folder
Pip install requirements_dev.txt fail documentation

Bug Fixes

Fix LSTMTimeSeriesRegressor primitive. Issue #90
Fix timeseries primitives. Issue #91
Negative index anomalies in timeseries_errors. Issue #89
Keep pandas version below 0.24.0. Issue #87

0.1.4

New Primitives

mlprimitives.timeseries primitives for timeseries data preprocessing
mlprimitives.timeseres_error primitives for timeseries anomaly detection
keras.Sequential.LSTMTimeSeriesRegressor
sklearn.neighbors.KNeighbors Classifier and Regressor
several sklearn.decomposition primitives
several sklearn.ensemble primitives

Bug Fixes

Fix typo in mlprimitives.text.TextCleaner primitive
Fix bug in index handling in featuretools.dfs primitive
Fix bug in SingleLayerCNNImageClassifier annotation
Remove old vlaidation tags from JSON annotations

0.1.3

New Features

Fix and re-enable featuretools.dfs primitive.

0.1.2

New Features

Add pipeline specification language and Evaluation utilities.
Add pipelines for graph, text and tabular problems.
New primitives ClassEncoder and ClassDecoder
New primitives UniqueCounter and VocabularyCounter

Bug Fixes

Fix TrivialPredictor bug when working with numpy arrays
Change XGB default learning rate and number of estimators

0.1.1

New Features

Add more keras.applications primitives.
Add a Text Cleanup primitive.

Bug Fixes

Add keywords to keras.preprocessing primtives.
Fix the image_transform method.
Add epoch as a fixed hyperparameter for keras.Sequential primitives.

0.1.0

First release on PyPI.

Algorithm	Hash digest
SHA256	`91f1e02731e6996928cab2129794777b643c5507878acf06ac57fc3289c6bab2`
MD5	`fd52e3ee60d40e9b7fededfa0fcf6fd8`
BLAKE2b-256	`f95cd61602faaf1325691d2c451ca607d11fd4174b015a0d4a18c1a1e7b3f589`

Algorithm	Hash digest
SHA256	`3fcec748295050133ee001eabcd64b5056d2c2026e6534e6308b3da1610ede67`
MD5	`6f642fd2d66d3c64fdbcd9f74b15a9d1`
BLAKE2b-256	`94fe9c58c06be9ce1d9bfeaceb3320346a98bd36e4791f6202b6ec0f54b51655`

mlprimitives 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MLPrimitives

Overview

Why did we create this library?

Installation

Requirements

Install with pip

Quickstart

Running a Primitive

1. Load a Primitive

2. Load some data

3. Fit the primitive

4. Produce results

Tuning a Primitive

1. Load another primitive

2. Split the dataset

3. Fit the new primitive

4. Make predictions

5. Evalute the performance

6. Set new hyperparameter values

7. Re-evaluate the performance

What's Next?

History

0.4.1 - 2024-11-15

Primitive Improvements

0.4.0 - 2024-03-22

General Imporvements

0.3.5 - 2023-04-14

General Imporvements

0.3.4 - 2023-01-24

General Imporvements

0.3.3 - 2023-01-20

General Imporvements

Adapter Improvements

0.3.2 - 2021-11-09

Adapter Improvements

0.3.1 - 2021-10-07

Adapter Improvements

General Imporvements

0.3.0 - 2021-01-09

New Primitives

Primitive Improvements

General Improvements

0.2.5 - 2020-07-29

Primitive Improvements

Bug Fixes

New Primitives

0.2.4 - 2020-01-30

New Primitives

Primitive Improvements

Bug Fixes

0.2.3 - 2019-11-14

New Primitives

Primitive Improvements

Bug Fixes

0.2.2 - 2019-10-08

New Primitives

Primitive Improvements

0.2.1 - 2019-09-09

New Primitives

Primitive Improvements

Bug Fixes

0.2.0

New Features

Primitive Improvements

Bug Fixes

0.1.10

New Features

New Pipelines

Primitive Improvements

0.1.9

New Features