Skip to main content

Coarse approximation linear function with cross validation

Project description

A binomial classifier that implements the Coarse Approximation Linear Function (CALF).

Contact

Rolf Carlson hrolfrc@gmail.com

Install

Use pip to install calfcv.

pip install calfcv

Introduction

This is a python implementation of the Coarse Approximation Linear Function (CALF). The implementation is based on the greedy forward selection algorithm described in the paper referenced below.

Currently, CalfCV provides classification and prediction for two classes, the binomial case. Multinomial classification with more than two cases is not implemented.

The feature matrix is scaled to have zero mean and unit variance. Cross-validation is implemented to identify optimal score and coefficients. CalfCV is designed for use with scikit-learn pipelines and composite estimators.

Example

from calfcv import CalfCV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np

Make a classification problem

seed = 42
X, y = make_classification(
    n_samples=30,
    n_features=5,
    n_informative=2,
    n_redundant=2,
    n_classes=2,
    random_state=seed
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed)

Train the classifier

The best score is the best average auc.

cls = CalfCV().fit(X_train, y_train)
cls.best_score_
0.95

The coefficients for the best score are in [-1, 0, 1].

cls.best_coef_
[-1, 1, 0, 1, 1]

The probabilities of class 1 are in the last row

We vertically stack the ground truth on the top with the probabilities of class 1 on the bottom. We show the first 5 entries.

np.round(np.vstack((y_train, cls.predict_proba(X_train).T))[:, 0:5], 2)
array([[0.  , 1.  , 1.  , 0.  , 0.  ],
       [0.71, 0.05, 0.19, 0.34, 0.54],
       [0.29, 0.95, 0.81, 0.66, 0.46]])

Predicting the training data should give a slightly higher score than the best_score_

That is what we see here. The reason is that best_score_ is a mean of auc over the cross validation.

roc_auc_score(y_true=y_train, y_score=cls.predict_proba(X_train)[:, 1])
0.9750000000000001

The classifier will likely produce a lower score on unseen data

Often we get a lower score on the unseen data, but in this case we get a higher score.

roc_auc_score(y_true=y_test, y_score=cls.predict_proba(X_test)[:, 1])
1.0

Score using classes is lower than score using probabilities

The ground truth is on the top and the predicted class is on the bottom. Sample 6 of y_test is predicted incorrectly but the others are correct.

y_pred = cls.predict(X_test)
np.vstack((y_test, y_pred))
array([[0, 1, 1, 0, 1, 0, 0, 0],
       [0, 1, 1, 0, 1, 0, 1, 0]])
roc_auc_score(y_true=y_test, y_score=y_pred)
0.9

Authors

The CALF algorithm was designed by Clark D. Jeffries, John R. Ford, Jeffrey L. Tilson, Diana O. Perkins, Darius M. Bost, Dayne L. Filer and Kirk C. Wilhelmsen. This python implementation was written by Rolf Carlson.

References

Jeffries, C.D., Ford, J.R., Tilson, J.L. et al. A greedy regression algorithm with coarse weights offers novel advantages. Sci Rep 12, 5440 (2022). https://doi.org/10.1038/s41598-022-09415-2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

calfcv-0.0.8.tar.gz (9.0 kB view hashes)

Uploaded Source

Built Distribution

calfcv-0.0.8-py3-none-any.whl (8.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page