Skip to main content

The GreenGuard project is a collection of end-to-end solutions for machine learning tasks commonly found in monitoring wind energy production systems.

Project description

DAI

A project from Data to AI Lab at MIT.

------------------------------------

GreenGuard

GreenGuard is a machine learning library built for data generated by wind turbines and solar panel installations.

PyPI Shield Travis CI Shield

GreenGuard

Overview

The GreenGuard project is a collection of end-to-end solutions for machine learning tasks commonly found in monitoring wind energy production systems. Most tasks utilize sensor data emanating from monitoring systems. We utilize the foundational innovations developed for automation of machine Learning at Data to AI Lab at MIT.

The salient aspects of this customized project are:

  • A set of ready to use, well tested pipelines for different machine learning tasks. These are vetted through testing across multiple publicly available datasets for the same task.
  • An easy interface to specify the task, pipeline, and generate results and summarize them.
  • A production ready, deployable pipeline.
  • An easy interface to tune pipelines using Bayesian Tuning and Bandits library.
  • A community oriented infrastructure to incorporate new pipelines.
  • A robust continuous integration and testing infrastructure.
  • A learning database recording all past outcomes --> tasks, pipelines, outcomes.

Data Format

GreenGuard Pipelines work on time Series formatted as follows:

  • A Turbines table that contains:
    • turbine_id: column with the unique id of each turbine.
    • A number of additional columns with information about each turbine.
  • A Signals table that contains:
    • signal_id: column with the unique id of each signal.
    • A number of additional columns with information about each signal.
  • A Readings table that contains:
    • reading_id: Unique identifier of this reading.
    • turbine_id: Unique identifier of the turbine which this reading comes from.
    • signal_id: Unique identifier of the signal which this reading comes from.
    • timestamp: Time where the reading took place, as an ISO formatted datetime.
    • value: Numeric value of this reading.
  • A Targets table that contains:
    • target_id: Unique identifier of the turbine which this label corresponds to.
    • turbine_id: Unique identifier of the turbine which this label corresponds to.
    • timestamp: Time associated with this target
    • target - optional: The value that we want to predict. This can either be a numerical value or a categorical label. This column can also be skipped when preparing data that will be used only to make predictions and not to fit any pipeline.

Demo Dataset

For development and demonstration purposes, we include a dataset with data from several telemetry signals associated with one wind energy production turbine.

This data, which has been already formatted as expected by the GreenGuard Pipelines, can be browsed and downloaded directly from the d3-ai-greenguard AWS S3 Bucket.

This dataset is adapted from the one used in the project by Cohen, Elliot J., "Wind Analysis." Joint Initiative of the ECOWAS Centre for Renewable Energy and Energy Efficiency (ECREEE), The United Nations Industrial Development Organization (UNIDO) and the Sustainable Engineering Lab (SEL). Columbia University, 22 Aug. 2014. Available online here

The complete list of manipulations performed on the original dataset to convert it into the demo one that we are using here is exhaustively shown and explained in the Green Guard Demo Data notebook.

Concepts

Before diving into the software usage, we briefly explain some concepts and terminology.

Primitive

We call the smallest computational blocks used in a Machine Learning process primitives, which:

  • Can be either classes or functions.
  • Have some initialization arguments, which MLBlocks calls init_params.
  • Have some tunable hyperparameters, which have types and a list or range of valid values.

Template

Primitives can be combined to form what we call Templates, which:

  • Have a list of primitives.
  • Have some initialization arguments, which correspond to the initialization arguments of their primitives.
  • Have some tunable hyperparameters, which correspond to the tunable hyperparameters of their primitives.

Pipeline

Templates can be used to build Pipelines by taking and fixing a set of valid hyperparameters for a Template. Hence, Pipelines:

  • Have a list of primitives, which corresponds to the list of primitives of their template.
  • Have some initialization arguments, which correspond to the initialization arguments of their template.
  • Have some hyperparameter values, which fall within the ranges of valid tunable hyperparameters of their template.

A pipeline can be fitted and evaluated using the MLPipeline API in MLBlocks.

Current tasks and pipelines

In our current phase, we are addressing two tasks - time series classification and time series regression. To provide solutions for these two tasks we have two components.

GreenGuardPipeline

This class is the one in charge of learning from the data and making predictions by building MLBlocks pipelines and later on tuning them using BTB

GreenGuardLoader

A class responsible for loading the time series data from CSV files, and return it in the format ready to be used by the GreenGuardPipeline.

Tuning

We call tuning the process of, given a dataset and a template, find the pipeline derived from the given template that gets the best possible score on the given dataset.

This process usually involves fitting and evaluating multiple pipelines with different hyperparameter values on the same data while using optimization algorithms to deduce which hyperparameters are more likely to get the best results in the next iterations.

We call each one of these tries a tuning iteration.

Getting Started

Requirements

Python

GreenGuard has been developed and runs on Python 3.5, 3.6 and 3.7.

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where you are trying to run GreenGuard.

Installation

The simplest and recommended way to install GreenGuard is using pip:

pip install greenguard

For development, you can also clone the repository and install it from sources

git clone git@github.com:D3-AI/GreenGuard.git
cd GreenGuard
make install-develop

Quickstart

In this example we will load some demo data using the GreenGuardLoader and fetch it to the GreenGuardPipeline for it to find the best possible pipeline, fit it using the given data and then make predictions from it.

1. Load and explore the data

The first step is to load the demo data.

For this, we will import and call the greenguard.loader.load_demo function without any arguments:

from greenguard.loader import load_demo

X, y, tables = load_demo()

The returned objects are:

X: A pandas.DataFrame with the targets table data without the target column.

   target_id  turbine_id  timestamp
0          1           1 2013-01-01
1          2           1 2013-01-02
2          3           1 2013-01-03
3          4           1 2013-01-04
4          5           1 2013-01-05

y: A pandas.Series with the target column from the targets table.

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: target, dtype: float64

tables: A dictionary containing three tables in the format explained above:

the turbines table:

   turbine_id       name
0           1  Turbine 1

the signals table:

   signal_id                                          name
0          1  WTG01_Grid Production PossiblePower Avg. (1)
1          2  WTG02_Grid Production PossiblePower Avg. (2)
2          3  WTG03_Grid Production PossiblePower Avg. (3)
3          4  WTG04_Grid Production PossiblePower Avg. (4)
4          5  WTG05_Grid Production PossiblePower Avg. (5)

and the readings table:

   reading_id  turbine_id  signal_id  timestamp  value
0           1           1          1 2013-01-01  817.0
1           2           1          2 2013-01-01  805.0
2           3           1          3 2013-01-01  786.0
3           4           1          4 2013-01-01  809.0
4           5           1          5 2013-01-01  755.0

2. Split the data

If we want to split the data in train and test subsets, we can do so by splitting the X and y variables with any suitable tool.

In this case, we will do it using the train_test_split function from scikit-learn.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

3. Finding the best Pipeline

Once we have loaded the data, we create a GreenGuardPipeline instance by passing:

  • template (string): the name of a template or the path to a template json file.
  • metric (string or function): The name of the metric to use or a metric function to use.
  • cost (bool): Whether the metric is a cost function to be minimized or a score to be maximized.

Optionally, we can also pass defails about the cross validation configuration:

  • stratify
  • cv_splits
  • shuffle
  • random_state

In this case, we will be loading the greenguard_classification pipeline, using the accuracy metric, and using only 2 cross validation splits:

from greenguard.pipeline import GreenGuardPipeline

pipeline = GreenGuardPipeline(template='greenguard_classification', metric='accuracy', cv_splits=2)

Once we have created the pipeline, we can call its tune method to find the best possible hyperparameters for our data, passing the X, y, and tables variables returned by the loader, as well as an indication of the number of tuning iterations that we want to perform.

pipeline.tune(X_train, y_train, tables, iterations=10)

After the tuning process has finished, the hyperparameters have been already set in the classifier.

We can see the found hyperparameters by calling the get_hyperparameters method,

pipeline.get_hyperparameters()

which will return a dictionary with the best hyperparameters found so far:

{
    "pandas.DataFrame.resample#1": {
        "rule": "1D",
        "time_index": "timestamp",
        "groupby": [
            "turbine_id",
            "signal_id"
        ],
        "aggregation": "mean"
    },
    "pandas.DataFrame.unstack#1": {
        "level": "signal_id",
        "reset_index": true
    },
    ...

as well as the obtained cross validation score by looking at the score attribute of the pipeline object:

pipeline.score  # -> 0.6447509660798626

NOTE: If the score is not good enough, we can call the tune method again as many times as needed and the pipeline will continue its tuning process every time based on the previous results!

4. Fitting the pipeline

Once we are satisfied with the obtained cross validation score, we can proceed to call the fit method passing again the same data elements.

This will fit the pipeline with all the training data available using the best hyperparameters found during the tuning process:

pipeline.fit(X_train, y_train, tables)

5. Use the fitted pipeline

After fitting the pipeline, we are ready to make predictions on new data:

predictions = pipeline.predict(X_test, tables)

And evaluate its prediction performance:

from sklearn.metrics import accuracy_score

f1_score(y_test, predictions) # -> 0.6413043478260869

6. Save and load the pipeline

Since the tuning and fitting process takes time to execute and requires a lot of data, you will probably want to save a fitted instance and load it later to analyze new signals instead of fitting pipelines over and over again.

This can be done by using the save and load methods from the GreenGuardPipeline.

In order to save an instance, call its save method passing it the path and filename where the model should be saved.

path = 'my_pipeline.pkl'

pipeline.save(path)

Once the pipeline is saved, it can be loaded back as a new GreenGuardPipeline by using the GreenGuardPipeline.load method:

new_pipeline = GreenGuardPipeline.load(path)

Once loaded, it can be directly used to make predictions on new data.

new_pipeline.predict(X_test, tables)

Use your own Dataset

Once you are familiar with the GreenGuardPipeline usage, you will probably want to run it on your own dataset.

Here are the necessary steps:

1. Prepare the data

Firt of all, you will need to prepare your data as 4 CSV files like the ones described in the data format section above.

2. Create a GreenGuardLoader

Once you have the CSV files ready, you will need to import the greenguard.loader.GreenGuardLoader class and create an instance passing:

  • path - str: The path to the folder where the 4 CSV files are
  • target - str, optional: The name of the target table. Defaults to targets.
  • target_column - str, optional: The name of the target column. Defaults to target.
  • readings - str, optional: The name of the readings table. Defaults to readings.
  • turbines - str, optional: The name of the turbines table. Defaults to turbines.
  • signals - str, optional: The name of the signals table. Defaults to signals.
  • readings - str, optional: The name of the readings table. Defaults to readings.
  • gzip - bool, optional: Set to True if the CSV files are gzipped. Defaults to False.

For example, here we will be loading a custom dataset which has been sorted in gzip format inside the my_dataset folder, and for which the target table has a different name:

from greenguard.loader import GreenGuardLoader

loader = GreenGuardLoader(path='my_dataset', target='labels', gzip=True)

3. Call the loader.load method.

Once the loader instance has been created, we can call its load method:

X, y, tables = loader.load()

Optionally, if the dataset contains only data to make predictions and the target column does not exist, we can pass it the argument False to skip it:

X, tables = loader.load(target=False)

Docker Usage

GreenGuard comes configured and ready to be distributed and run as a docker image which starts a jupyter notebook already configured to use greenguard, with all the required dependencies already installed.

Requirements

The only requirement in order to run the GreenGuard Docker image is to have Docker installed and that the user has enough permissions to run it.

Installation instructions for any possible system compatible can be found here

Additionally, the system that builds the GreenGuard Docker image will also need to have a working internet connection that allows downloading the base image and the additional python depenedencies.

Building the GreenGuard Docker Image

After having cloned the GreenGuard repository, all you have to do in order to build the GreenGuard Docker Image is running this command:

make docker-jupyter-build

After a few minutes, the new image, called greenguard-jupyter, will have been built into the system and will be ready to be used or distributed.

Distributing the GreenGuard Docker Image

Once the greenguard-jupyter image is built, it can be distributed in several ways.

Distributing using a Docker registry

The simplest way to distribute the recently created image is using a registry.

In order to do so, we will need to have write access to a public or private registry (remember to login!) and execute these commands:

docker tag greenguard-jupyter:latest your-registry-name:some-tag
docker push your-registry-name:some-tag

Afterwards, in the receiving machine:

docker pull your-registry-name:some-tag
docker tag your-registry-name:some-tag greenguard-jupyter:latest

Distributing as a file

If the distribution of the image has to be done offline for any reason, it can be achieved using the following command.

In the system that already has the image:

docker save --output greenguard-jupyter.tar greenguard-jupyter

Then copy over the file greenguard-jupyter.tar to the new system and there, run:

docker load --input greenguard-jupyter.tar

After these commands, the greenguard-jupyter image should be available and ready to be used in the new system.

Running the greenguard-jupyter image

Once the greenguard-jupyter image has been built, pulled or loaded, it is ready to be run.

This can be done in two ways:

Running greenguard-jupyter with the code

If the GreenGuard source code is available in the system, running the image is as simple as running this command from within the root of the project:

make docker-jupyter-run

This will start a jupyter notebook using the docker image, which you can access by pointing your browser at http://127.0.0.1:8888

In this case, the local version of the project will also mounted within the Docker container, which means that any changes that you do in your local code will immediately be available within your notebooks, and that any notebook that you create within jupyter will also show up in your notebooks folder!

Running greenguard-jupyter without the greenguard code

If the GreenGuard source code is not available in the system and only the Docker Image is, you can still run the image by using this command:

docker run -ti -p8888:8888 greenguard-jupyter

In this case, the code changes and the notebooks that you create within jupyter will stay inside the container and you will only be able to access and download them through the jupyter interface.

History

0.1.0

  • First release on PyPI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

greenguard-0.1.0.tar.gz (33.7 kB view hashes)

Uploaded Source

Built Distribution

greenguard-0.1.0-py2.py3-none-any.whl (15.2 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page