MetaRL-based Estimator using Task-encodings for AutoML

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

META Learn

MetaRL-based Estimator using Task-encodings for Automated machine Learning

META Learn is a deep learning approach to that parameterizes the API of machine learning software as a sequence of actions to select the hyperparameters of a machine learning estimator in an end-to-end fashion, from raw data representation, imputation, normalizing, feature representation, and classification/regression. Currently the sklearn API is the only supported ML framework.

Why?

As the diversity of data and machine learning use cases increases, we need to accelerate and scale the process of training performant machine learning systems. We'll need tools that are adaptable to specific problem domains, the nature of the dataset, and the (sometimes non-differentiable) performance metric we're trying to optimize. Supervised learning of classification and regression tasks given a task distribution of small to medium datasets provides an promising jumping off platform for programmatically generating a reinforcement learning environment for automated machine learning (AutoML).

Quickstart

pre-install dependencies:

pip install numpy scipy

install metalearn library:

pip install -e .

then you can run an experiment with the metalearn cli.

# run an experiment with default values
$ metal run experiment

Running an experiment with a configuration file

Alternatively, you can create an experiment configuration file to run your experiment.

# create experiment config file
$ metal create config my_experiment config/local --description "my experiment"

# output:
# wrote experiment config file to config/local/experiment_2018-37-25-21:37:11_my_experiment.yml

edit the config file parameters section to the set of parameters that you want to train on, then run the experiment with

$ metal run from-config config/local/experiment_2018-37-25-21:37:11_my_experiment.yml

Relevant Work

The Combined Algorithm Selection and Hyperparameter optimization (CASH) problem is an important one to solve if we want to effectively scale and deploy machine learning systems in real-world use cases, which often deals with small (< 10 gb) to medium size (10 - 100 gb) data.

CASH is the problem of searching through the space of all ML frameworks, defined as an Algorithm A and a set of relevant hyperparameters lambda and proposing a set of models that will perform well given a dataset and a task e.g.

.----------.    .--------------.    .-----.    .---------------------.
| Raw Data | -> | Handle Nulls | -> | PCA | -> | Logistic Regression |
.----------.    .--------------.    .-----.    .---------------------.

In order to solve this problem, previous work like autosklearn uses a Bayesian Optimization techniques SMAC with an offline meta- learning "warm-start" step using euclidean distance to reduce the search space of ML frameworks. This meta-learning step was done by representing the datasets with metadata features (e.g. number of features, skew, mean, variance, etc.) to learn representations of the data space that perform well with respect to the ML framework selection task.

Neural Architecture Search is another approach to the CASH problem, where a Controller network proposes "child" neural net architectures that are trained on a training set and evaluated on a validation set, using the validation performance R as a reinforcement learning reward signal to learn the best architecture proposal policy.

Contributions

The contributions of the META Learn project are two-fold: it builds on the neural architecture search paradigm by formalating the output space of the Controller as a sequence of tokens conditioned on the space of possible executable frameworks. The scope of this project is to define a framework, expressed as a piece of Python code, which evaluates to an instantiated sklearn Pipeline and can be fitted on a training set and evaluated on a validation set of a particular dataset D.

Following the Neural Architecture scheme, META Learn uses the REINFORCE algorithm to compute the policy gradient used to update the Controller in order to learn a policy for proposing good frameworks that are able to achieve high validation set performance.

The second contribution of this project is that it proposes a conditional ML framework generator by extending the Controller network to have an encoder network that takes as input metadata about the dataset D (e.g. number of instances, number of features). The output of the encoder network would be fed into the decoder network, which proposes an ML framework. Therefore, we can condition the output of the decoder network metadata on D to propose customized frameworks.

High-level Approach

There are two general approaches to take, with substantial tradeoffs to consider:

Approach 1: Character-level Controller

Generate an ML frameworks at the character level, such that the goal is to output Python code using a softmax over the space of valid characters, e.g. A-Z, a-z, 0-9, ()[]=, etc.

This approach requires building in fewer assumptions into the AutoML system, however the function that the Controller needs to learn would be much more complex: it needs to (a) generate valid sklearn code character-by-character, and (b) generate performant algorithm and hyperparameter combinations over the distribution of datasets and tasks.

Approach 2: Domain-specific Controller

Generate ML frameworks over a state space of algorithms and hyperparameter values, in this case, over the estimators/transformers of the sklearn API.

This approach requires building in more assumptions into the AutoML system, e.g. expicitly specifying the algorithm/hyperparamater space to search over and how to interpret the output of the Controller so as to fit a model, but the advantage is that the Controller mainly has to learn a function that generates performant algorithm and hyperparameter combinations.

In the META Learn project, a MetaLearnController represents the policy approximator, which selects actions based on a tree-structured set of softmax classifiers, each one representing some part of the algorithm and hyperparameter space. The controller selects estimators/transformers and hyperparameters in a pre-defined manner (interpreted as embedding priors into the architecture of the system). The ordering is the following:

one hot encoding
one hot encoder hyperparameters
imputation (e.g. mean, median, mode)
imputer hyperparameters
rescaling (e.g. min-max, mean-variance)
rescaler hyperparameters
feature preprocessing (e.g. PCA)
feature processor hyperparameters
classification/regression
classifier/regressor hyperparameters

Roadmap: Milestones

implementation of the naive (unstructured) AlgorithmRNN/ HyperparameterRNN that seperately predict the estimators/transformers and hyperparameters of the ML Framework.
basic implementation of the structured MetaLearnController architecture
refine MetaLearnController with baseline function prior such that each data environment maps to its own value function (in this case, the exponential mean of rewards per episode).
implement basic meta-RL algorithm as described here in this paper in particular, feed MetaLearnController auxiliary inputs:
- previous reward
- previous actions
normalize reward - baseline (equivalent of advantage in this system) by mean-centering and standard-deviation-rescaling.
extend meta-RL algorithm by implementing memory as a lookup table that maps data environments to the most recent hidden state from the same data environment.
extend deep cash to support regression problems.
increase coverage of regression estimators (add ~5-6 more)
handle missing-valued data with imputer
test controller on kaggle classification and regression datasets (5 each)
- train on kaggle classification datasets
- train on kaggle regression datasets
- train on openml/sklearn classification datasets, test on kaggle classification datasets
- train on openml/sklearn regression datasets, test on kaggle classification datasets
test controller on auto-sklearn paper classification datasets.
add support for automated ensembling. TBD: should this be implemented as part of the CASH controller, or should there be a separate module altogether that ensembles cached pipelines?
add support for random grid search with the AlgorithmSpace API. One big design question: how should fit/predict errors be handled? Add logic to hyperparameter sampling that prevents error-raising hyperparameter configurations in the first place, or just catch error during the fitting/ scoring process? (possibly cache as some kind of hash to speed things up).
add support for test and train dataset environment partitions, i.e. at task env initialization, set aside n% of the data as test datasets, use 1 - n% as training datasets. Evaluate rewards and validation performance over train and test datasets to assess degree of overfitting.
100% coverage of sklearn classification estimators
100% coverage of sklearn regression estimators
100% coverage of sklearn data preprocessors
100% coverage of sklearn feature preprocessors
support for XGBoost
support for apricot submodular selection
support using GANS for imputation
test transfer-learning ability of controller
test meta-learning ability of controller

Tooling Enhancements

support tuning experiments in the experiments.py API. Extend the experiment configuration so that user can specify more than one setting for a particular hyperparameter.
create experiment viewer, either as a static report rendered via jupyter notebook or Dash app. Inputs should be floyd job numbers
stream-line the dataset mounting process for floyd. This includes the openml and kaggle datasets.

Analyses

The ./analysis subfolder contains jupyter notebooks that visualize the performance of the cash controller over time. Currently there are 5 analyses in the project analysis subfolder:

rnn_metalearn_controller_experiment_analysis.ipynb: analyzing the output of running examples/example_rnn_metalearn_controller.py with static plots.
metalearn_controller_analysis.ipynb: a basic interactive analysis of a single job's outputs.
metalearn_controller_multi_experiment_analysis.ipynb: analyzes multiple job outputs, all assumed to have one trial (training run) per job.
metalearn_controller_multi_trail_analysis.ipynb: analyzes the output of one job, but that job has multiple trials.
metalearn_controller_multi_trial_experiment_analysis.ipynb: analyzes the output of multiple jobs, each with multiple trials.

Extensions

Metadata Encoder

An extension to the encoder would be to generalize the metadata feature representation from hand-crafted features (e.g. mean of means of numerical features) and instead formulate encoder as a sequence model, where the input is a sequence of sequences. The first sequence contains data points or instances of the dataset, the second sequence contains minimally preprocessed features of that particular instance (note that the challenge here is how to represent categorical features across difference datasets). The weights in the encoder are trained jointly as part of the gradient computed using the REINFORCE algorithm.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.24

Aug 10, 2020

0.0.23

Aug 10, 2020

0.0.22

Jul 27, 2020

0.0.21

Jul 27, 2020

0.0.20

Jul 27, 2020

0.0.19

Jul 24, 2020

0.0.18

Jul 23, 2020

0.0.17

Jul 19, 2020

0.0.16

Jul 17, 2020

0.0.15

Aug 31, 2019

0.0.14

Aug 21, 2019

0.0.13

Jul 31, 2019

0.0.12

Jul 19, 2019

0.0.11

Jul 19, 2019

0.0.10

Jul 19, 2019

0.0.9

Jul 19, 2019

0.0.8

Jul 17, 2019

0.0.7

Jul 13, 2019

0.0.6

Jul 8, 2019

0.0.5

Jun 29, 2019

This version

0.0.4

Jun 22, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meta-ml-0.0.4.tar.gz (71.6 kB view hashes)

Uploaded Jun 22, 2019 Source

Built Distribution

meta_ml-0.0.4-py3-none-any.whl (83.4 kB view hashes)

Uploaded Jun 22, 2019 Python 3

Hashes for meta-ml-0.0.4.tar.gz

Hashes for meta-ml-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`9828e340bb85876652eea446770e7fb6004fd66a894b6bdf35dced1c15c86edf`
MD5	`78051ef1c75f30e2b18b798747b193c5`
BLAKE2b-256	`61700e5cc5c12f5e76dc019d750b08b3cda82a3034fb3ee1463b15006d7c4845`

Hashes for meta_ml-0.0.4-py3-none-any.whl

Hashes for meta_ml-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1757e6dd74f66c8a1af960cb0d98d6503951d98153d5eb6766ad896570bd80c`
MD5	`463d284a532a19d7b0ac1fb42a377943`
BLAKE2b-256	`f0f1dbf2d9e40e8c79302293a4a878018809927227dd21ad544b8aed655924de`