Skip to main content

No project description provided

Project description

Discrust

Supervised discretization in Rust

The discrust package provides a supervised discretization algorithm. Under the hood it implements a decision tree, using information value to find the optimal splits, and provides several different methods to constrain the final discretization scheme.

The package draws heavily from the ivpy package, both in the algorithm and the parameter controls.

Usage

The package has a single user facing class, Discretizer that can be instantiated with the following arguments.

  • min_obs (Optional[float], optional): Minimum number of observations required in a bin. Defaults to 5.
  • max_bins (Optional[int], optional): Maximum number of bins to split the variable into. Defaults to 10.
  • min_iv (Optional[float], optional): Minimum information value required to make a split. Defaults to 0.001.
  • min_pos (Optional[float], optional): Minimum number of records with a value of one that should be present in a split. Defaults to 5.
  • mono (Optional[int], optional): The monotonicity required between the binned variable and the binary performance outcome. A value of -1 will result in negative correlation between the binned x and y variables, while a value of 1 will result in a positive correlation between the binned x variable and the y variable. Specifying a value of 0 will result in binning x, with no monotonicity constraint. If a value of None is specified the monotonicity will be determined the monotonicity of the first split. Defaults to None.

The fit method can be called on data and accepts the following parameters.

  • x (ArrayLike): An arraylike numeric field that will be discretized based on the values of y, and the constraints the Discretizer was initialized with.
  • y (ArrayLike): An arraylike binary field.
  • sample_weight (Optional[ArrayLike], optional): Optional sample weight array to be used when calculating the optimal breaks. Defaults to None.

This method will return a list of the optimal split values for the feature given the constraints. After being fit the discretizer will have a splits_ attribute with this list.

import pandas as pd
import seaborn as sns

df = sns.load_dataset("titanic")

from discrust import Discretizer

ds = Discretizer(min_obs=5, max_bins=10, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["fare"], df["survived"])
# [-inf, 6.95, 7.125, 7.7292, 10.4625, 15.1, 50.4958, 52.0, 73.5, 79.65, inf]

The predict method can be called and will discretize the feature, and then perform weight of evidence substitution on each binned level. This method takes the following arguments.

  • x (ArrayLike): An arraylike numeric field.
ds.predict(df["fare"])[0:5]
array([-0.84846814, 0.78344263, -0.787529, 0.78344263, -0.787529])

Installation

From PyPi

For Windows users, the package can be installed directly from pypi with the following command.

python -m pip install discrust

Building from Source

The package can be built from source, it utalizes the maturin tool as a build backend. This tool requires you have python, and a working Rust compiler installed, see here for details. If these two requirements are met, you can clone this repository, and run the following command in the repositories root directory.

python -m pip install . -v

This should invoke the maturin tool, which will handle the building of the Rust code and installation of the package. Alternativly, if you simply want to build a wheel, you can run the following command after installing maturin.

maturin build --release

I have had some problems building packages with maturin directly in a conda environment, this is actually a bug on anaconda's side that will hopefully be resolved. If this does give you any problems, it's usually easiest to build a wheel inside of a venv and then install the wheel.

Additional TODOs

  • Support for exception values
  • Support for missing values in both the dependant and independent variables

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discrust-0.1.1.tar.gz (21.1 kB view hashes)

Uploaded Source

Built Distributions

discrust-0.1.1-cp39-none-win_amd64.whl (136.2 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

discrust-0.1.1-cp38-none-win_amd64.whl (135.6 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page