Automates Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

MLJAR Automated Machine Learning

Documentation: https://supervised.mljar.com/

Source Code: https://github.com/mljar/mljar-supervised

Automated Machine Learning :rocket:

The mljar-supervised is an Automated Machine Learning Python package that works with tabular data. It is designed to save time for a data scientist. It abstracts the common way to preprocess the data, construct the machine learning models, and perform hyper-parameters tuning to find the best model :trophy:. It is no black-box as you can see exactly how the ML pipeline is constructed (with a detailed Markdown report for each ML model).

The mljar-supervised will help you with:

explaining and understanding your data (Automatic Exploratory Data Analysis),
trying many different machine learning models (Algorithm Selection and Hyper-Parameters tuning),
creating Markdown reports from analysis with details about all models (Atomatic-Documentation),
saving, re-running and loading the analysis and ML models.

It has three built-in modes of work:

Explain mode, which is ideal for explaining and understanding the data, with many data explanations, like decision trees visualization, linear models coefficients display, permutation importances and SHAP explanations of data,
Perform for building ML pipelines to use in production,
Compete mode that trains highly-tuned ML models with ensembling and stacking, with a purpose to use in ML competitions.

Of course, you can further customize the details of each mode to meet the requirements.

What's good in it? :boom:

It is using many algorithms: Baseline, Linear, Random Forest, Extra Trees, LightGBM, Xgboost, CatBoost, Neural Networks, and Nearest Neighbors.
It can compute Ensemble based on greedy algorithm from Caruana paper.
It can stack models to build level 2 ensemble (available in Compete mode or after setting stack_models parameter).
It can do features preprocessing, like: missing values imputation and converting categoricals. What is more, it can also handle target values preprocessing.
It can do advanced features engineering, like: Golden Features, Features Selection, Text and Time Transformations.
It can tune hyper-parameters with not-so-random-search algorithm (random-search over defined set of values) and hill climbing to fine-tune final models.
It can compute the Baseline for your data. That you will know if you need Machine Learning or not!
It has extensive explanations. This package is training simple Decision Trees with max_depth <= 5, so you can easily visualize them with amazing dtreeviz to better understand your data.
The mljar-supervised is using simple linear regression and include its coefficients in the summary report, so you can check which features are used the most in the linear model.
It cares about explainability of models: for every algorithm, the feature importance is computed based on permutation. Additionally, for every algorithm the SHAP explanations are computed: feature importance, dependence plots, and decision plots (explanations can be switched off with explain_level parameter).
There is automatic documnetation for every ML experiment run with AutoML. The mljar-supervised creates markdown reports from AutoML training full of ML details, metrics and charts.

Automatic Documentation

The AutoML Report

The report from running AutoML will contain the table with infomation about each model score and time needed to train the model. For each model there is a link, which you can click to see model's details. The performance of all ML models is presented as scatter and box plots so you can visually inspect which algorithms perform the best :trophy:.

AutoML leaderboard

The `Decision Tree` Report

The example for Decision Tree summary with trees visualization. For classification tasks additional metrics are provided:

confusion matrix
threshold (optimized in the case of binary classification task)
F1 score
Accuracy
Precision, Recall, MCC

Decision Tree summary

The `LightGBM` Report

The example for LightGBM summary:

Decision Tree summary

Available Modes :books:

In the docs you can find details about AutoML modes are presented in the table .

Explain

automl = AutoML(mode="Explain")

It is aimed to be used when the user wants to explain and understand the data.

It is using 75%/25% train/test split.
It is using: Baseline, Linear, Decision Tree, Random Forest, Xgboost, Neural Network algorithms and ensemble.
It has full explanations: learning curves, importance plots, and SHAP plots.

Perform

automl = AutoML(mode="Perform")

It should be used when the user wants to train a model that will be used in real-life use cases.

It is using 5-fold CV.
It is using: Linear, Random Forest, LightGBM, Xgboost, CatBoost and Neural Network. It uses ensembling.
It has learning curves and importance plots in reports.

Compete

automl = AutoML(mode="Compete")

It should be used for machine learning competitions.

It adapts the validation strategy depending on dataset size and total_time_limit. It can be: train/test split (80/20), 5-fold CV or 10-fold CV.
It is using: Linear, Decision Tree, Random Forest, Extra Trees, LightGBM, Xgboost, CatBoost, Neural Network and Nearest Neighbors. It uses ensemble and stacking.
It has only learning curves in the reports.

Examples

:point_right: Binary Classification Example

There is a simple interface available with fit and predict methods.

import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",
    skipinitialspace=True,
)
X_train, X_test, y_train, y_test = train_test_split(
    df[df.columns[:-1]], df["income"], test_size=0.25
)

automl = AutoML()
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)

AutoML fit will print:

Create directory AutoML_1
AutoML task to be solved: binary_classification
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will optimize for metric: logloss
1_Baseline final logloss 0.5519845471086654 time 0.08 seconds
2_DecisionTree final logloss 0.3655910192804364 time 10.28 seconds
3_Linear final logloss 0.38139916864708445 time 3.19 seconds
4_Default_RandomForest final logloss 0.2975204390214936 time 79.19 seconds
5_Default_Xgboost final logloss 0.2731086827200411 time 5.17 seconds
6_Default_NeuralNetwork final logloss 0.319812276905242 time 21.19 seconds
Ensemble final logloss 0.2731086821194617 time 1.43 seconds

the AutoML results in Markdown report
the Xgboost Markdown report, please take a look at amazing dependence plots produced by SHAP package :sparkling_heart:
the Decision Tree Markdown report, please take a look at beautiful tree visualization :sparkles:
the Logistic Regression Markdown report, please take a look at coefficients table, and you can compare the SHAP plots between (Xgboost, Decision Tree and Logistic Regression) :coffee:

:point_right: Multi-Class Classification Example

The example code for classification of the optical recognition of handwritten digits dataset. Running this code in less than 30 minutes will result in test accuracy ~98%.

import pandas as pd 
# scikit learn utilites
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# mljar-supervised package
from supervised.automl import AutoML

# load the data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(digits.data), digits.target, stratify=digits.target, test_size=0.25,
    random_state=123
)

# train models with AutoML
automl = AutoML(mode="Perform")
automl.fit(X_train, y_train)

# compute the accuracy on test data
predictions = automl.predict_all(X_test)
print(predictions.head())
print("Test accuracy:", accuracy_score(y_test, predictions["label"].astype(int)))

:point_right: Regression Example

Regression example on Boston house prices data. On test data it scores ~ 10.85 mean squared error (MSE).

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from supervised.automl import AutoML # mljar-supervised

# Load the data
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(housing.data, columns=housing.feature_names),
    housing.target,
    test_size=0.25,
    random_state=123,
)

# train models with AutoML
automl = AutoML(mode="Explain")
automl.fit(X_train, y_train)

# compute the MSE on test data
predictions = automl.predict(X_test)
print("Test MSE:", mean_squared_error(y_test, predictions))

:point_right: More Examples

Income classification - it is a binary classification task on census data
Iris classification - it is a multiclass classification on Iris flowers data
House price regression - it is a regression task on Boston houses data

Documentation :books:

For details please check mljar-supervised docs.

Installation :package:

From PyPi repository:

pip install mljar-supervised

From source code:

git clone https://github.com/mljar/mljar-supervised.git
cd mljar-supervised
python setup.py install

Installation for development

git clone https://github.com/mljar/mljar-supervised.git
virtualenv venv --python=python3.6
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements_dev.txt

Running in the docker:

FROM python:3.7-slim-buster
RUN apt-get update && apt-get -y update
RUN apt-get install -y build-essential python3-pip python3-dev
RUN pip3 -q install pip --upgrade
RUN pip3 install mljar-supervised jupyter
CMD ["jupyter", "notebook", "--port=8888", "--no-browser", "--ip=0.0.0.0", "--allow-root"]

Contributing

To get started take a look at our Contribution Guide for information about our process and where you can fit in!

Contributors

License :necktie:

The mljar-supervised is provided with MIT license.

MLJAR :heart:

The mljar-supervised is an open-source project created by MLJAR. We care about ease of use in the Machine Learning. The mljar.com provides a beautiful and simple user interface for building machine learning models.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.1.7

Apr 10, 2024

1.1.6

Mar 8, 2024

1.1.5

Mar 4, 2024

1.1.4

Mar 4, 2024

1.1.3

Jan 22, 2024

1.1.2

Jan 8, 2024

1.1.1

Sep 26, 2023

1.1.0

Sep 21, 2023

1.0.2

Jul 6, 2023

1.0.1

Jul 6, 2023

1.0.0

Jun 26, 2023

0.11.5

Dec 30, 2022

0.11.4

Dec 14, 2022

0.11.3

Aug 16, 2022

0.11.2

Mar 2, 2022

0.11.1

Oct 1, 2021

0.11.0

Sep 6, 2021

0.10.6

Jun 8, 2021

0.10.5

Jun 8, 2021

0.10.4

May 14, 2021

0.10.3

Apr 1, 2021

0.10.2

Mar 17, 2021

0.10.1

Mar 16, 2021

0.10.0

Mar 16, 2021

0.9.1

Mar 2, 2021

0.9.0

Feb 27, 2021

0.8.9

Feb 5, 2021

0.8.8

Jan 30, 2021

0.8.7

Jan 29, 2021

This version

0.8.6

Jan 29, 2021

0.8.5

Jan 29, 2021

0.8.4

Jan 29, 2021

0.8.3

Jan 27, 2021

0.8.2

Jan 27, 2021

0.8.1

Jan 25, 2021

0.8.0

Jan 22, 2021

0.7.20

Jan 14, 2021

0.7.19

Jan 12, 2021

0.7.18

Jan 11, 2021

0.7.17

Jan 11, 2021

0.7.16

Jan 10, 2021

0.7.15

Dec 17, 2020

0.7.14

Dec 16, 2020

0.7.13

Dec 11, 2020

0.7.12

Dec 8, 2020

0.7.11

Dec 3, 2020

0.7.10

Dec 1, 2020

0.7.9

Nov 30, 2020

0.7.8

Nov 27, 2020

0.7.7

Nov 26, 2020

0.7.6

Nov 24, 2020

0.7.5

Nov 23, 2020

0.7.4

Nov 23, 2020

0.7.3

Sep 21, 2020

0.7.2

Sep 15, 2020

0.7.1

Sep 9, 2020

0.7.0

Sep 9, 2020

0.6.1

Aug 28, 2020

0.6.0

Jul 31, 2020

0.5.5

Jul 22, 2020

0.5.4

Jul 21, 2020

0.5.3

Jul 14, 2020

0.5.2

Jul 10, 2020

0.5.1

Jul 9, 2020

0.5.0

Jul 9, 2020

0.4.1

Jul 2, 2020

0.4.0

Jul 2, 2020

0.3.5

May 12, 2020

0.3.4

May 6, 2020

0.3.3

May 6, 2020

0.3.2

May 6, 2020

0.3.1

May 5, 2020

0.3.0

May 5, 2020

0.2.8

Apr 22, 2020

0.2.7

Apr 22, 2020

0.2.6

Apr 21, 2020

0.2.5

Apr 20, 2020

0.2.4

Apr 18, 2020

0.2.3

Apr 17, 2020

0.2.2

Apr 17, 2020

0.2.1

Apr 17, 2020

0.2.0

Apr 16, 2020

0.1.7

Apr 25, 2019

0.1.6

Apr 24, 2019

0.1.5

Apr 23, 2019

0.1.4

Apr 23, 2019

0.1.3

Apr 23, 2019

0.1.2

Apr 13, 2019

0.1.1

Apr 9, 2019

0.1.0

Apr 9, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mljar-supervised-0.8.6.tar.gz (83.8 kB view hashes)

Uploaded Jan 29, 2021 Source

Hashes for mljar-supervised-0.8.6.tar.gz

Hashes for mljar-supervised-0.8.6.tar.gz
Algorithm	Hash digest
SHA256	`d91ec63114e4056493b4169194c6de08c1e3f54727d0882aa0f0d39009d48e74`
MD5	`4547771f3f4c9d782de5d8dc6d1abace`
BLAKE2b-256	`86617569cdb606a1c0ba6255450857727bac9077a045e7da9f0d0f275764b7fd`

mljar-supervised 0.8.6

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

MLJAR Automated Machine Learning

Table of Contents

Automated Machine Learning :rocket:

What's good in it? :boom:

Automatic Documentation

The AutoML Report

The Decision Tree Report

The LightGBM Report

Available Modes :books:

Explain

Perform

Compete

Examples

:point_right: Binary Classification Example

:point_right: Multi-Class Classification Example

:point_right: Regression Example

:point_right: More Examples

Documentation :books:

Installation :package:

Contributing

Contributors

License :necktie:

MLJAR :heart:

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

The `Decision Tree` Report

The `LightGBM` Report