napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

napkinXC

napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification. It allows to train a classifier for very large datasets in few lines of code with minimal resources.

Right now, napkinXC implements the following features both in Python and C++:

Probabilistic Label Trees (PLT) and Online Probabilistic Label Trees (OPLT),
Hierarchical softmax (HSM),
Binary Relevance (BR),
One Versus Rest (OVR),
fast online prediction of top-k labels or labels above the given threshold,
hierarchical k-means clustering for tree building and other tree building methods,
support for predefined hierarchies,
LIBLINEAR, SGD, and AdaGrad solvers for base classifiers,
efficient ensembles tree-based model,
helpers to download and load data from XML Repository,
helpers to measure performance.

Please note that this library is still under development and also serves as a base for experiments. Some of the experimental features may not be documented.

The napkinXC is distributed under MIT license. All contributions to the project are welcome!

Roadmap

Coming soon:

OPLT available in Python
Possibility to use any type of binary classifier from Python
Improved dataset loading from Python
More datasets from XML Repository

Python quick start

Python version of napkinXC can be easly installed from PyPy repository:

pip install napkinxc

Minimal example of usage:

from napkinxc.models import PLT
from napkinxc.measures import precision_at_k
from napkinxc.datasets import load_dataset

X_train, Y_train = load_dataset("eurlex-4k", "train")
X_test, Y_test = load_dataset("eurlex-4k", "test")
plt = PLT("eurlex-model")
plt.fit(X_train, Y_train)
Y_pred = plt.predict(X_test, top_k=1)
print(precision_at_k(Y_test, Y_pred, k=1))

More examples can be found under python/examples directory.

Building executable

napkinXC can be also build as executable using:

cmake .
make -j

Command line options

Usage: nxc <command> <args>

Commands:
    train                   Train model on given input data
    test                    Test model on given input data
    predict                 Predict for given data
    ofo                     Use online f-measure optimalization
    version                 Print napkinXC version
    help                    Print help

Args:
    General:
    -i, --input             Input dataset
    -o, --output            Output (model) dir
    -m, --model             Model type (default = plt):
                            Models: ovr, br, hsm, plt, oplt, ubop, ubopHsm, brMips, ubopMips
    --ensemble              Number of models in ensemble (default = 1)
    -d, --dataFormat        Type of data format (default = libsvm),
                            Supported data formats: libsvm
    -t, --threads           Number of threads to use (default = 0)
                            Note: -1 to use #cpus - 1, 0 to use #cpus
    --header                Input contains header (default = 1)
                            Header format for libsvm: #lines #features #labels
    --hash                  Size of features space (default = 0)
                            Note: 0 to disable hashing
    --featuresThreshold     Prune features below given threshold (default = 0.0)
    --seed                  Seed (default = system time)
    --verbose               Verbose level (default = 2)

    Base classifiers:
    --optimizer             Optimizer used for training binary classifiers (default = libliner)
                            Optimizers: liblinear, sgd, adagrad, fobos
    --bias                  Value of the bias features (default = 1)
    --inbalanceLabelsWeighting     Increase the weight of minority labels in base classifiers (default = 1)
    --weightsThreshold      Threshold value for pruning models weights (default = 0.1)

    LIBLINEAR:              (more aobut LIBLINEAR: https://github.com/cjlin1/liblinear)
    -s, --solver            LIBLINEAR solver (default for log loss = L2R_LR_DUAL, for l2 loss = L2R_L2LOSS_SVC_DUAL)
                            Supported solvers: L2R_LR_DUAL, L2R_LR, L1R_LR,
                                               L2R_L2LOSS_SVC_DUAL, L2R_L2LOSS_SVC, L2R_L1LOSS_SVC_DUAL, L1R_L2LOSS_SVC
    -c, --liblinearC        LIBLINEAR cost co-efficient, inverse of regularization strength, must be a positive float,
                            smaller values specify stronger regularization (default = 10.0)
    --eps, --liblinearEps   LIBLINEAR tolerance of termination criterion (default = 0.1)

    SGD/AdaGrad:
    -l, --lr, --eta         Step size (learning rate) for online optimizers (default = 1.0)
    --epochs                Number of training epochs for online optimizers (default = 1)
    --adagradEps            Defines starting step size for AdaGrad (default = 0.001)

    Tree:
    -a, --arity             Arity of tree nodes (default = 2)
    --maxLeaves             Maximum degree of pre-leaf nodes. (default = 100)
    --tree                  File with tree structure
    --treeType              Type of a tree to build if file with structure is not provided
                            tree types: hierarchicalKmeans, huffman, completeKaryInOrder, completeKaryRandom,
                                        balancedInOrder, balancedRandom, onlineComplete

    K-Means tree:
    --kmeansEps             Tolerance of termination criterion of the k-means clustering 
                            used in hierarchical k-means tree building procedure (default = 0.001)
    --kmeansBalanced        Use balanced K-Means clustering (default = 1)

    Prediction:
    --topK                  Predict top-k labels (default = 5)
    --threshold             Predict labels with probability above the threshold, defaults to 0
    --setUtility            Type of set-utility function for prediction using ubop, rbop, ubopHsm, ubopMips models.
                            Set-utility functions: uP, uF1, uAlfa, uAlfaBeta, uDeltaGamma
                            See: https://arxiv.org/abs/1906.08129

    Set-Utility:
    --alfa
    --beta
    --delta
    --gamma

    Test:
    --measures              Evaluate test using set of measures (default = "p@1,r@1,c@1,p@3,r@3,c@3,p@5,r@5,c@5")
                            Measures: acc (accuracy), p (precision), r (recall), c (coverage), hl (hamming loos)
                                      p@k (precision at k), r@k (recall at k), c@k (coverage at k), s (prediction size)

References and acknowledgments

This library implements methods from following papers:

Another implementation of PLT model is available in extremeText library, that implements approach described in this NeurIPS paper.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.6.2

Oct 17, 2022

0.6.1

Sep 11, 2022

0.6.0

Mar 23, 2022

0.5.2

May 17, 2021

0.5.1

Feb 6, 2021

0.5.0

Feb 2, 2021

0.4.2

Nov 4, 2020

0.4.1

Oct 26, 2020

This version

0.4.0

Sep 7, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

napkinxc-0.4.0.tar.gz (143.0 kB view hashes)

Uploaded Sep 7, 2020 Source

Hashes for napkinxc-0.4.0.tar.gz

Hashes for napkinxc-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`81a1628b5f70abe779602c541ff27846fa756ee65e24c99f2ef72d85b9f4654f`
MD5	`04fe11ec0721c53c295abedb391b5bc7`
BLAKE2b-256	`052d85dd6d6e14d6e3f32cf0e887a6ddd79f35cae0dc4e71f6c5839a7c79c42d`