Machine Learning Toolkit (MLToolkit/mltk) for Python

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python
Topic
- Scientific/Engineering
- Software Development

Project description

MLToolkit

Current release: PyMLToolkit [v0.1.3]

MLToolkit (mltk) is a Python package providing a set of user-friendly functions to help building machine learning models in data science research, teaching or production focused projects.

Introduction

MLToolkit supports all stages of the machine learning application development process.

Installation

pip install pymltoolkit

If the installation failed with dependancy issues, execute the above command with --no-dependencies

pip install pymltoolkit --no-dependencies

Functions

Data Extraction (SQL, Flatfiles, etc.)
Exploratory Data Analysis (statistical summary, univariate analysis, etc.)
Feature Engineering
Model Building
Hyper Parameter Tuning [in development for v0.2]
Model Performance Analysis and Comparison Between Models
Auto ML (automated machine learning) [in development for v0.2]
Model Deploymet and Serving [will be imporved for v0.2]

Supported Machine Learning Algorithms/Packages

RandomForestClassifier: scikit-learn
LogisticRegression: statsmodels
... More models will be added in the future releases ...

Usage

import mltk

Examples

Data Loading and exploration

import numpy as np
import pandas as pd
import mltk as mltk

Data = pd.read_csv(r'C:\Projects\Data\incomedata.csv')
Data = mltk.add_identity_column(Data, id_label='ID', start=1, increment=1)
DataStats = mltk.data_description(Data)

Data Preprocessing

# Analyze Response Target
print(mltk.variable_frequency(DataFrame=Data, variable='income'))

# Set Target Valriable
targetVariable = 'HighIncome'
targetCondition = "income=='>50K'" #For Binary Classification

Data=mltk.set_binary_target(Data, target_condition=targetCondition, target_variable=targetVariable)
print(mltk.variable_frequency(DataFrame=Data, variable=targetVariable))

        Counts  CountsFraction%
income                         
<=50K    24720         75.91904
>50K      7841         24.08096
TOTAL    32561        100.00000

# Flag Records to Exclude
excludeCondition="age < 18"
action = 'flag' # 'drop' #
excludeLabel = 'EXCLUDE'
Data=mltk.exclude_records(Data, exclude_ondition=excludeCondition, action=action, exclude_label=excludeLabel) # )#

categoryVariables = set({'sex', 'native-country', 'race', 'occupation', 'workclass', 'marital-status', 'relationship'})
binaryVariables = set({})
print(mltk.category_lists(Data, list(categoryVariables)))

sourceVariable='age'
table = mltk.histogram(Data, sourceVariable, n_bins=10, orientation='vertical', show_plot=True)
print(table)

# Divide to categories
labels = ['0', '20', '30', '40', '50', '60', 'INF']
Data, groupVariable = mltk.numeric_to_category(DataFrame=Data, variable=sourceVariable, str_labels=labels, right_inclusive=True, print_output=False, return_variable=True)
mltk.plot_variable_response(DataFrame=Data, variable=groupVariable, class_variable=targetVariable)

            Counts  HighIncome  CountsFraction%  ResponseFraction%  ResponseRate%
ageGRP                                                                           
1_(0,20]      2410           2          7.40149            0.02551        0.08299
2_(20,30]     8162         680         25.06680            8.67236        8.33129
3_(30,40]     8546        2406         26.24612           30.68486       28.15352
4_(40,50]     6983        2655         21.44590           33.86048       38.02091
5_(50,60]     4128        1547         12.67774           19.72963       37.47578
6_(60,INF)    2332         551          7.16194            7.02716       23.62779
TOTAL        32561        7841        100.00000          100.00000            NaN

# Create One Hot Encoded Variables
Data, featureVariables, targetVariable = mltk.to_one_hot_encode(Data, category_variables=categoryVariables, binary_variables=binaryVariables, target_variable=targetVariable)
Data[identifierColumns+featureVariables+[targetVariable]].sample(5).transpose()

Correlation

correlation=mltk.correlation_matrix(Data, featureVariables+[targetVariable], target_variable=targetVariable, method='pearson', return_type='list', show_plot=False)

Split Train, Validate Test datasets

TrainDataset, ValidateDataset, TestDataset = mltk.train_validate_test_split(Data, ratios=(0.6,0.2,0.2))

Model Building

sample_attributes = {'SampleDescription':'Adult Census Income Dataset',
                    'NumClasses':2,
                    'RecordIdentifiers':identifierColumns
                    }

score_parameters = {'Edges':[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
                   'Quantiles':10,
                   'ScoreVariable':'Probability',
                   'ScoreLabel':'Score',
                   'QuantileLabel':'Quantile'
                   }

model_attributes = {'ModelID': None,   
                   'ModelName': 'IncomeLevel',
                   'Version':'0.1',
                   }

model_parameters = {'MLAlgorithm':'RF', # 'LGR', #  'DFF', # 'CNN', # 'CATBST', # 'XGBST'
                    'NTrees':500,
                   'MaxDepth':100,
                   'MinSamplesToSplit':10,
                   'Processors':2} 

RFModel = mltk.build_ml_model(TrainDataset, ValidateDataset, TestDataset, model_variables, targetVariable, 
                 model_attributes, sample_attributes, model_parameters, score_parameters, 
                          return_model_object=True, show_results=False, show_plot=True)

print(RFModel.model_attributes['ModelID'])
print(RFModel.model_interpretation['ModelSummary'])
print(RFModel.model_evaluation['AUC'])
print(RFModel.model_evaluation['RobustnessTable'])

mltk.save_object(RFModel, '{}.pkl'.format(RFModel.get_model_id()))
# Save model
RFModel.plot_eval_matrics(comparison=False)

          minProbability  maxProbability  meanProbability  BucketCount  ResponseCount  BucketFraction  ResponseFraction  BucketPrecision  CumulativeBucketFraction  CumulativeResponseFraction  CumulativePrecision
Quantile                                                                                                                                                                                                           
1                0.00000         0.00406          0.00064         1306           11.0         0.20052           0.00705          0.00842                   1.00000                     1.00000              0.23967
2                0.00406         0.02182          0.01138          648           15.0         0.09949           0.00961          0.02315                   0.79948                     0.99295              0.29768
3                0.02184         0.05599          0.03639          651           34.0         0.09995           0.02178          0.05223                   0.69998                     0.98334              0.33670
4                0.05603         0.11695          0.08490          652           64.0         0.10011           0.04100          0.09816                   0.60003                     0.96156              0.38408
5                0.11731         0.20303          0.15683          651          109.0         0.09995           0.06983          0.16743                   0.49992                     0.92056              0.44134
6                0.20323         0.31633          0.26482          651          182.0         0.09995           0.11659          0.27957                   0.39997                     0.85074              0.50979
7                0.31654         0.48098          0.39805          653          249.0         0.10026           0.15951          0.38132                   0.30002                     0.73414              0.58649
8                0.48136         0.67088          0.57195          651          380.0         0.09995           0.24343          0.58372                   0.19975                     0.57463              0.68947
9                0.67233         1.00000          0.80734          650          517.0         0.09980           0.33120          0.79538                   0.09980                     0.33120              0.79538
DataSet          0.00000         1.00000          0.23319         6513         1561.0         1.00000           1.00000          0.23967                   1.00000                     1.00000              0.23967

TestDataset = mltk.score_dataset(TestDataset, RFModel, edges=None, score_label=None, fill_missing=0)
score_variable = RFModel.get_score_variable()
score_label = RFModel.get_score_label()

Robustnesstable = mltk.robustness_table(ResultsSet=TestDataset, class_variable=targetVariable, score_variable=score_variable,  score_label=score_label, show_plot=True)

threshold = 0.8
TestDataset['Predicted'] = np.where(TestDataset[score_variable]>threshold,1,0)
ConfusionMatrix = mltk.confusion_matrix(actual_variable=TestDataset[targetVariable], predcted_variable=TestDataset['Predicted'], labels=[0,1], sample_weight=None, totals=True)
print(ConfusionMatrix)

ConfusionMatrixRow = mltk.confusion_matrix_to_row(ConfusionMatrix, ModelID=RFModel.model_attributes['ModelID'])
ConfusionMatrixRow

License

Copyright 2019 Sumudu Tennakoon

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

MLToolkit Project Timeline

2018-07-02 [v0.0.1]: Initial set of functions for data exploration, model building and model evaluation was published.
2019-03-20 [v0.1.0]: Developed and published initial version of model building and serving framework for IBM Coursera Advanced Data Science Capstone Project.
2019-07-02 [v0.1.2]: First resease of the PyMLToolkit Python package, a collection of clases and functions facilitating end-to-end machine learning model building and serving over RESTful API.
2019-07-04 [v0.1.3]: Minor bug fix

Future Resease Plan

2019-12-31 [v0.1.6]: Major bug-fix version of the initial resease.
[v0.2.0]: Imporved model serving frameework, support more machine learning algorithms and deep learning.
[v0.3.0]: Hyper parameter tuning and Automated machine learning.
[v0.4.0]: Building continious learning models.

References

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python
Topic
- Scientific/Engineering
- Software Development

Release history Release notifications | RSS feed

0.1.11

Feb 13, 2020

0.1.10

Dec 31, 2019

0.1.9

Dec 8, 2019

0.1.8

Sep 29, 2019

0.1.7

Sep 1, 2019

0.1.6

Aug 12, 2019

0.1.5

Jul 28, 2019

0.1.4

Jul 14, 2019

This version

0.1.3

Jul 4, 2019

0.1.2

Jul 2, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymltoolkit-0.1.3.tar.gz (22.3 kB view hashes)

Uploaded Jul 4, 2019 Source

Built Distribution

pymltoolkit-0.1.3-py3-none-any.whl (32.6 kB view hashes)

Uploaded Jul 4, 2019 Python 3

Hashes for pymltoolkit-0.1.3.tar.gz

Hashes for pymltoolkit-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`af497986f0a8667cb19eec8fea2da51f462ef90418b3761628d40cb66b9f8696`
MD5	`4cec8594ff376d649f35ad692d8703a2`
BLAKE2b-256	`bf0158810e980feca163e9de95de76fc5ecdd87b6954773aece68b5a461f12e9`

Hashes for pymltoolkit-0.1.3-py3-none-any.whl

Hashes for pymltoolkit-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3f854b35e86ae70e13f2716de371f09a6f36b8b2880f832638cd533fe1b48c50`
MD5	`85973def48923f3c9ae268cae79980ff`
BLAKE2b-256	`273ed73f8b558400a2054f39493619ac031cd47f57840e47452d84ef03b2c178`