Pipeline Explorer

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

Pipeline Explorer

Classes and functions to explore and reproce the performance obtained by thousands of MLBlocks pipelines and templates accross hundreds of datasets.

Free software: MIT license
Documentation: https://HDI-Project.github.io/piex
Homepage: https://github.com/HDI-Project/piex

Getting Started

Installation

$ git clone git@github.com:HDI-Project/piex.git
$ cd piex
$ pip install -e .

Usage

The S3PipelineExplorer

The S3PipelineExplorer class provides methods to download the results from previous tests executions from S3, see which pipelines obtained the best scores and load them as a dictionary, ready to be used by an MLPipeline.

To start working with it, it needs to be given the name of the S3 Bucket from which the data will be downloaded.

For this examples, we will be using the ml-pipelines-2018 bucket, where the results of the experiments run for the Machine Learning Bazaar paper can be found.

from piex.explorer import S3PipelineExplorer

piex = S3PipelineExplorer('ml-pipelines-2018')

The Datasets

The get_datasets method returns a pandas.DataFrame with information about the available datasets.

datasets = piex.get_datasets()
datasets.head()

	dataset	data_modality	task_type	task_subtype
314	124_120_mnist	image	classification	multi_class
315	124_138_cifar100	image	classification	multi_class
316	124_153_svhn_cropped	image	classification	multi_class
317	124_174_cifar10	image	classification	multi_class
318	124_178_coil100	image	classification	multi_class

datasets = piex.get_datasets(data_modality='multi_table', task_type='regression')
datasets.head()

	dataset	data_modality	task_type	task_subtype
311	uu2_gp_hyperparameter_estimation	multi_table	regression	multivariate
312	uu3_world_development_indicators	multi_table	regression	univariate

The Experiments

The list of that have been executed can be obtained with the method get_tests.

Just like the get_datasets, any keyword arguments will be used to filter the results.

import pandas as pd

tests = piex.get_tests()
pd.DataFrame(tests.groupby(['data_modality', 'task_type']).size(), columns=['count'])

		count
data_modality	task_type
graph	community_detection	5
graph_matching	18
link_prediction	2
vertex_nomination	2
image	classification	57
regression	1
multi_table	classification	1
regression	1
single_table	classification	1405
collaborative_filtering	1
regression	430
time_series_forecasting	175
text	classification	17
timeseries	classification	37

tests = piex.get_tests(data_modality='graph', task_type='link_prediction')
tests[['dataset', 'pipeline', 'checkpoints', 'test_id']]

	dataset	pipeline	checkpoints	test_id
1716	59_umls	NaN	[900, 1800, 3600, 7200]	20181031040541366347
2141	59_umls	graph/link_prediction/random_forest_classifier	[900, 1800, 3600, 7200]	20181031182305995728

The Experiment Results

The results of the experiments can be seen using the get_experiment_results method.

These results include both the cross validation score obtained by the pipeline during the tuning, as well as the score obtained by this pipeline once it has been fitted using the training data and then used to make predictions over the test data.

Just like the get_datasets, any keyword arguments will be used to filter the results, including the test_id.

results = piex.get_test_results(test_id='20181031182305995728')
results[['test_id', 'pipeline', 'score', 'cv_score', 'elapsed', 'iterations']]

	test_id	pipeline	score	cv_score	elapsed	iterations
7464	20181031182305995728	graph/link_prediction/random_forest_classifier	0.499853	0.843175	900.255511	435.0
7465	20181031182305995728	graph/link_prediction/random_forest_classifier	0.499853	0.854603	1800.885417	805.0
7466	20181031182305995728	graph/link_prediction/random_forest_classifier	0.499853	0.854603	3600.005072	1432.0
7467	20181031182305995728	graph/link_prediction/random_forest_classifier	0.785568	0.860000	7200.225256	2366.0

The Best Pipeline

Information about the best pipeline for a dataset can be obtained using the get_best_pipeline method.

This method returns a pandas.Series object with information about the pipeline that obtained the best cross validation score during the tuning, as well as the template that was used to build it.

Note: This call will download some data in the background the first time that it is run, so it might take a while to return.

piex.get_best_pipeline('185_baseball')

id                            17385666-31da-4b6e-ab7f-8ac7080a4d55
dataset                                 185_baseball_dataset_TRAIN
metric                                                     f1Macro
name             categorical_encoder/imputer/standard_scaler/xg...
rank                                                      0.307887
score                                                     0.692113
template                                  5bd0ce5249e71569e8bf8003
test_id                                       20181024234726559170
pipeline         categorical_encoder/imputer/standard_scaler/xg...
data_modality                                         single_table
task_type                                           classification
Name: 1149699, dtype: object

Apart from obtaining this information, we can use the load_best_pipeline method to load its JSON specification, ready to be using in an mlblocks.MLPipeline object.

pipeline = piex.load_best_pipeline('185_baseball')
pipeline['primitives']

['mlprimitives.feature_extraction.CategoricalEncoder',
 'sklearn.preprocessing.Imputer',
 'sklearn.preprocessing.StandardScaler',
 'mlprimitives.preprocessing.ClassEncoder',
 'xgboost.XGBClassifier',
 'mlprimitives.preprocessing.ClassDecoder']

The Best Template

Just like the best pipeline, the best tempalte for a given dataset can be obtained using the get_best_template method.

This returns just the name of the template that was used to build the best pipeline.

template_name = piex.get_best_template('185_baseball')
template_name

'categorical_encoder/imputer/standard_scaler/xgbclassifier'

This can be later on used to explore the template, obtaining its default hyperparameters:

defaults = piex.get_default_hyperparameters(template_name)
defaults

{'mlprimitives.feature_extraction.CategoricalEncoder#1': {'copy': True,
  'features': 'auto',
  'max_labels': 0},
 'sklearn.preprocessing.Imputer#1': {'missing_values': 'NaN',
  'axis': 0,
  'copy': True,
  'strategy': 'mean'},
 'sklearn.preprocessing.StandardScaler#1': {'with_mean': True,
  'with_std': True},
 'mlprimitives.preprocessing.ClassEncoder#1': {},
 'xgboost.XGBClassifier#1': {'n_jobs': -1,
  'n_estimators': 100,
  'max_depth': 3,
  'learning_rate': 0.1,
  'gamma': 0,
  'min_child_weight': 1},
 'mlprimitives.preprocessing.ClassDecoder#1': {}}

Or obtaning the corresponding tunable ranges, ready to be used with a tuner:

tunable = piex.get_tunable_hyperparameters(template_name)
tunable

{'mlprimitives.feature_extraction.CategoricalEncoder#1': {'max_labels': {'type': 'int',
   'default': 0,
   'range': [0, 100]}},
 'sklearn.preprocessing.Imputer#1': {'strategy': {'type': 'str',
   'default': 'mean',
   'values': ['mean', 'median', 'most_frequent']}},
 'sklearn.preprocessing.StandardScaler#1': {'with_mean': {'type': 'bool',
   'default': True},
  'with_std': {'type': 'bool', 'default': True}},
 'mlprimitives.preprocessing.ClassEncoder#1': {},
 'xgboost.XGBClassifier#1': {'n_estimators': {'type': 'int',
   'default': 100,
   'range': [10, 1000]},
  'max_depth': {'type': 'int', 'default': 3, 'range': [3, 10]},
  'learning_rate': {'type': 'float', 'default': 0.1, 'range': [0, 1]},
  'gamma': {'type': 'float', 'default': 0, 'range': [0, 1]},
  'min_child_weight': {'type': 'int', 'default': 1, 'range': [1, 10]}},
 'mlprimitives.preprocessing.ClassDecoder#1': {}}

Scoring Templates and Pipelines

The S3PipelineExplorer class also allows cross validating templates and pipelines over any of the datasets.

Scoring a Pipeline

The simplest use case is cross validating a pipeline over a dataset. For this, we must pass the ID of the pipeline and the name of the dataset to the method score_pipeline.

The dataset can be the one that was used during the experiments or a different one.

piex.score_pipeline(pipeline['id'], '185_baseball')

(0.6921128080904511, 0.09950216269594728)

piex.score_pipeline(pipeline['id'], 'uu4_SPECT')

(0.8897656842904123, 0.037662864373452655)

Optionally, the cross validation configuration can be changed

piex.score_pipeline(pipeline['id'], 'uu4_SPECT', n_splits=3, random_state=43)

(0.8869488536155202, 0.019475563687443638)

Scoring a Template

A Template can also be tested over any dataset by passing its name, the dataset and, optionally, the cross validation specification.

If no hyperparameters are passed, the default ones will be used:

piex.score_template(template_name, 'uu4_SPECT', n_splits=3, random_state=43)

(0.8555346666968675, 0.028343173498423108)

However, different hyperparameters can be passed as a dictionary:

hyperparameters = piex.get_default_hyperparameters(template_name)
hyperparameters['xgboost.XGBClassifier#1']['learning_rate'] = 1

piex.score_template(template_name, 'uu4_SPECT', hyperparameters, n_splits=3, random_state=43)

(0.8754554700753094, 0.019151608028236813)

History

0.1.0

First release on PyPI

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.2.0

May 21, 2019

0.1.2

May 21, 2019

0.1.1

Nov 28, 2018

This version

0.1.0

Nov 16, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piex-0.1.0.tar.gz (21.2 kB view hashes)

Uploaded Nov 16, 2018 Source

Built Distribution

piex-0.1.0-py2.py3-none-any.whl (9.9 kB view hashes)

Uploaded Nov 16, 2018 Python 2 Python 3

Hashes for piex-0.1.0.tar.gz

Hashes for piex-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`26ae6a8a70886f5402122d682d64f4bff14e1a5e597f819c421b9fe40136f662`
MD5	`f166bb8f85bf5e52dd0c7e64d49d5906`
BLAKE2b-256	`abb85ab2080f6fe0bd7b111d1e2d68c5c82f57f309f0a1c6dd0f3bf41d12be68`

Hashes for piex-0.1.0-py2.py3-none-any.whl

Hashes for piex-0.1.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`c16f53b400de362ae550a61634e31ae007da210b108edf508cae57ba9d0652e2`
MD5	`a25519e88f80a25b1c4d090f36e74ce2`
BLAKE2b-256	`a6a13f7592e5d5a4c7ded6c70432359652863d6fddc0ecf1d32e4878d9374c02`