skip to navigation
skip to content

BlackBoxAuditing 0.0.3

Sample Implementation of Gradient Feature Auditing (GFA)

Latest Version: 0.1.6

# Black Box Auditing and Certifying and Removing Disparate Impact

This repository contains a sample implementation of Gradient Feature Auditing (GFA) meant to be generalizable to most datasets. For more information on the repair process, see our paper on [Certifying and Removing Disparate Impact](http://arxiv.org/abs/1412.3756) For information on the full auditing process, see our paper on [Auditing Black-box Models for Indirect Influence](http://arxiv.org/abs/1602.07043)

# License

This code is licensed under an [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) license.

# Setup and Installation

1. Install the Python dependencies listed in the requirements.txt file.
2. Install python-matplotlib if you do not already have it (https://matplotlib.org/users/installing.html)
3. Install BlackBoxAuditing (`pip install BlackBoxAuditing`)

Many of the ModelVisitors rely on [Weka](http://www.cs.waikato.ac.nz/ml/weka/) Similarly, we use [TensorFlow](https://www.tensorflow.org/) for network-based machine learning. Any Python libraries that need to be installed are included in the `requirements.txt` file. Weka and Tensorflow should be downloaded during installation, but here's the download links just in case.

- Weka 3.6.13 [download](http://www.cs.waikato.ac.nz/ml/weka/downloading.html)
- TensorFlow [download](https://www.tensorflow.org/versions/master/get_started/os_setup.html) (original experiments run with version 0.6.0)


# Certifying and Removing Disparate Impact

After installing BlackBoxAuditing, you can run the data repair described in [Certifying and Removing Disparate Impact](http://arxiv.org/abs/1412.3756) using the command `BlackBoxAuditing-repair` on a terminal which will tell you the arguments the script takes.

# Black Box Auditing

To run GFA on a dataset (as in [Auditing Black-box Models for Indirect Influence](http://arxiv.org/abs/1602.07043))


## Running as a Python Script

After installing BlackBoxAuditing, GFA can be run on a dataset (as in [Auditing Black-box Models for Indirect Influence](http://arxiv.org/abs/1602.07043)) using a simple python script. For reference, the following includes sample code:

```python
%python
# import BlackBoxAuditing
import BlackBoxAuditing as BBA
# import machine learning technique
from BlackBoxAuditing.model_factories import Weka_SVM, Weka_DecisionTree

"""
Using a preloaded dataset
"""
# load in preloaded dataset
data = BBA.load_data("german")

# initialize the auditor and set parameters
auditor = BBA.Auditor()
auditor.model = Weka_SVM

# call the auditor with the data
auditor(data)


"""
Using your own dataset
"""
# load your own data
datafile = 'path/to/datafile'
data = BBA.load_from_file(datafile)

# initialize the auditor and set parameters
auditor = BBA.Auditor()
auditor.model = Weka_DecisionTree

# call the auditor
auditor(data)

```

### More Advanced Script Options

#### Using a preloaded dataset

The BlackBoxAuditing package has a few datasets preloaded and ready to use for auditing. In a script, they are available via the function `load_data` which takes as input the name of the dataset and returns formatted data ready for auditing. The following is the list of preloaded datasets available for auditing:

* adult
* diabetes
* ricci
* german
* glass
* sample
* DRP

Refer to the Sources section down below for more information about the datasets

#### Using you own dataset

To use your own data for auditing, the function `load_from_file`, most simply, takes as input the path to your dataset and returns formatted data ready for auditing. `load_from_file` also includes other paramters which should be set to ensure that your data is processed correctly. Refer to the full function and its defaults:

```
load_from_file(datafile, testdata=None, correct_types=None, train_percentage=2.0/3.0,
response_header=None, features_to_ignore=None, missing_data_symbol=""
```

* *datafile*: path to your dataset
* *testdata*: path to the dataset used for testing a model. Assumes that *datafile* is the training dtata
* *correct_types*: list of the types (str, int, or float) of the features in the data. If not given, the types will be automatically generated by inspecting the values of each feature
* *train_percentage*: train/test split of the data given as floats
* *response_header*: name of the response column in the data. if not given, assumes that the last column in the data is the response
* *features_to_ignore*: list of the names of any feature than you wish to be ignored by the model
* *missing_data_symbol*: symbol that marks missing or unknown value in the data

#### Auditor options

After initializing the auditor `auditor = BlackBoxAuditor.Auditor()`, there are a few options that can be set to tune the auditor listed as follows:

`auditor.measurers`: (*default = [accuracy, BCR]*) list of measurers to use for GFA

`auditor.model_options`: (*default = {}*) options for machine learning model

`auditor.verbose`: (*default = True*) Set to "True" to allow for more detailed status updates

`auditor.REPAIR_STEPS`: (*default = 10*) Number of repair steps take

`auditor.RETRAIN_MODEL_PER_REPAIR`: (*default = False*)

`auditor.WRITE_ORIGINAL_PREDICTIONS`: (*default = True*)

`auditor.ModelFactory`: (*default = Weka_SVM*) Available machine learning options: Weka_SVM, Weka_DecisionTree, TensorFlow

`auditor.kdd`: (*default = False*)


## Testing Code Changes

After BlackBoxAuditing has been installed, you can run the test suite using the command on a terminal `BlackBoxAuditing-test`.

Every python file should include test functions at the bottom that will be run when the file is run. This can be done by including the line `if __name__=="__main__": test()` as long as there is a function defined as `test`.

These tests should use print statements with `True` or `False` readouts indicating success or failure (where `True` should always be success). It is fine/good to have multiple of these per file.

Note: if a test requires reading data from the `test_data` directory, it should import the appropriate `load_data` file from the `experiments` directory.

## Implementing a New Machine-Learning Method

The best way to create a model would be to use a ModelFactory and ModelVisitors. A ModelVisitor should be thought of as a wrapper that knows how to load a machine-learning model of a given type and communicate with that model file in order to output predicted values of some test dataset. A ModelFactory simply knows how to "build" a ModelVisitor based on some provided training data. Check out the "Abstract" files in the `sample_experiment` directory for outlines of what these two classes should do; similarly, check out the "SVM_ModelFactory" files in the `sample_experiment` subdirectory for examples that use WEKA to create model files and produce predictions.

# Sources

Dataset Sources:
- adult.csv [link](https://archive.ics.uci.edu/ml/datasets/Adult)
- german_categorical.csv (Modified from [link](https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
- RicciDataMod.csv (Modified from [link](http://www.amstat.org/publications/jse/v18n3/RicciData.csv))
- DRP Datasets (Source and data-files coming soon.)
- Arrests/Recidivism Datasets [link](http://www.icpsr.umich.edu/icpsrweb/RCMD/studies/3355)
- Linear Datasets ("sample_2" Experiment) [link](https://github.com/jasonbaldridge/try-tf)

More information on DRP can be found at the [Dark Reactions Project](http://darkreactions.haverford.edu/) official site.

# Bug Reports and Feature-Requests

All bug reports and feature-requests should be submitted through the [Issue Tracker](https://github.com/cfalk/BlackBoxAuditing/issues)  
File Type Py Version Uploaded on Size
BlackBoxAuditing-0.0.3.tar.gz (md5) Source 2017-08-13 2MB