Compute (Target) Permutation Importances of a machine learning model
Project description
Target Permutation Importances
[Source] [Bug Report] [Documentation] [API Reference]
Overview
This method aims to lower the feature attribution due to a feature's variance. If a feature shows high importance to a model after the target vector is shuffled, it fits the noise.
Overall, this package
- Fit the given model class $M$ times to get $M$ actual feature importances of feature f: $A_f = [a_{f_1},a_{f_2}...a_{f_M}]$.
- Fit the given model class with shuffled targets for $N$ times to get $N$ feature random importances: $R_f = [r_{f_1},r_{f_2}...r_{f_N}]$.
- Compute the final importances of a feature $f$ by various methods, such as:
- $I_f = Avg(A_f) - Avg(R_f)$
- $I_f = Avg(A_f) / (Avg(R_f) + 1)$
Not to be confused with sklearn.inspection.permutation_importance, this sklearn method is about feature permutation instead of target permutation.
This method were originally proposed/implemented by:
- Permutation importance: a corrected feature importance measure
- Feature Selection with Null Importances
Install
pip install target-permutation-importances
or
poetry add target-permutation-importances
Basic Usage
# Import the function
import target_permutation_importances as tpi
# Prepare a dataset
import pandas as pd
from sklearn.datasets import load_breast_cancer
# Models
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
data = load_breast_cancer()
# Convert to a pandas dataframe
Xpd = pd.DataFrame(data.data, columns=data.feature_names)
# Compute permutation importances with default settings
result_df = tpi.compute(
model_cls=RandomForestClassifier, # The constructor/class of the model.
model_cls_params={ # The parameters to pass to the model constructor.
"n_estimators": 1,
},
model_fit_params={}, # The parameters to pass to the model fit method.
X=Xpd, # pd.DataFrame
y=data.target, # pd.Series, np.ndarray
num_actual_runs=2,
num_random_runs=10,
)
print(result_df[["feature", "importance"]].sort_values("importance", ascending=False).head())
Fork above code from Kaggle.
Outputs:
Running 2 actual runs and 10 random runs
100%|██████████| 2/2 [00:00<00:00, 167.35it/s]
100%|██████████| 10/10 [00:00<00:00, 163.71it/s]
feature importance
7 mean concave points 0.343365
8 mean concavity 0.291501
25 worst perimeter 0.021797
10 mean perimeter 0.021520
26 worst radius 0.008913
You can find more detailed examples in the "Feature Selection Examples" section.
Advance Usage / Customization
This package exposes generic_compute
to allow customization.
Read target_permutation_importances.__init__.py
for details.
Feature Selection Examples
Benchmarks
Benchmark has been done with some tabular datasets from the Tabular data learning benchmark. It is also hosted on Hugging Face.
The following models with their default params are used in the benchmark:
sklearn.ensemble.RandomForestClassifier
sklearn.ensemble.RandomForestRegressor
xgboost.XGBClassifier
xgboost.XGBRegressor
catboost.CatBoostClassifier
catboost.CatBoostRegressor
lightgbm.LGBMClassifier
lightgbm.LGBMRegressor
For the binary classification task, sklearn.metrics.f1_score
is used for evaluation. For the regression task, sklearn.metrics.mean_squared_error
is used for evaluation.
The downloaded datasets are divided into 3 sections: train
: 50%, val
: 10%, test
: 40%.
Feature importance is calculated from the train
set. Feature selection is done on the val
set.
The final benchmark is evaluated on the test
set. Therefore the test
set is unseen to both the feature importance and selection process.
Raw result data are in benchmarks/results
.
Kaggle Competitions
Many Kaggle Competition top solutions involve this method, here are some examples
Year | Competition | Medal | Link |
---|---|---|---|
2023 | Predict Student Performance from Game Play | Gold | 3rd place solution |
2019 | Elo Merchant Category Recommendation | Gold | 16th place solution |
2018 | Home Credit Default Risk | Gold | 10th place solution |
Development Setup and Contribution Guide
Python Version
You can find the suggested development Python version in .python-version
.
You might consider setting up Pyenv
if you want to have multiple Python versions on your machine.
Python packages
This repository is setup with Poetry
. If you are not familiar with Poetry, you can find package requirements listed in pyproject.toml
.
Otherwise, you can just set it up with poetry install
Run Benchmarks
To run the benchmark locally on your machine, run make run_tabular_benchmark
or python -m benchmarks.run_tabular_benchmark
Make Changes
Following the Make Changes Guide from Github
Before committing or merging, please run the linters defined in make lint
and the tests defined in make test
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for target_permutation_importances-1.0.10.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | d76649fb6cc5e6454f90f7ab7b07f6a1eb587e9ed8e977562b7d92e1b87bb3af |
|
MD5 | a6725047d6e999e2737f2adbd58ee5f8 |
|
BLAKE2b-256 | 9fb3b4359fa81f884c7f686d9a559a6133a81a892b2bab233ece420dc682405e |
Hashes for target_permutation_importances-1.0.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6cdc493e1862bf94e6ea7f349b72e2c1ee3ce6343fa8c102ae0c4c05f11df9d1 |
|
MD5 | 8a2b2130f73bdff5dcad8ff6bc6a9ffa |
|
BLAKE2b-256 | c9d654afde978692da6e30a877dc144e4adba649a049a279229213aa3308c5b9 |