Skip to main content

napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification.

Project description

napkinXC

C++ build Python build Documentation Status PyPI version

napkinXC is an extremely simple and fast library for extreme multi-class and multi-label classification. It allows training a classifier for very large datasets in just a few lines of code with minimal resources.

Right now, napkinXC implements the following features both in Python and C++:

  • Probabilistic Label Trees (PLTs) and Online Probabilistic Label Trees (OPLTs),
  • Hierarchical softmax (HSM),
  • Binary Relevance (BR),
  • One Versus Rest (OVR),
  • fast online prediction of top-k labels or labels above one given threshold or separate threshold for each label,
  • fast online prediction with labels weight, e.g., propensity scores,
  • efficient online F-measure optimization (OFO) procedure,
  • hierarchical k-means clustering for tree building and other tree-building methods,
  • support for predefined hierarchies,
  • LIBLINEAR, AdaGrad, SGD optimizers for base classifiers,
  • ensembles of PLTs and HSMs,
  • helpers to download and load data from XML Repository,
  • helpers to measure performance (precision@k, recall@k, nDCG@k, propensity-scored precision@k, and more).

Please note that this library is still under development and also serves as a base for experiments. API may not be compatible between releases and some of the experimental features may not be documented. Do not hesitate to open an issue in case of a question or problem!

The napkinXC is distributed under MIT license. All contributions to the project are welcome!

Roadmap

Coming soon:

  • Possibility to use any binary classifier from Python.
  • Raw versions of datasets from XML Repository.

Python Quick Start and Documentation

napkinXC's documentation is available at https://napkinxc.readthedocs.io and is generated from this repository.

Python (3.5+) version of napkinXC can be easily installed from PyPy repository on Linux and MacOS, it requires modern C++17 compiler, CMake and Git installed:

pip install napkinxc

or the latest master version directly from the GitHub repository:

pip install git+https://github.com/mwydmuch/napkinXC.git

Minimal example of usage:

from napkinxc.datasets import load_dataset
from napkinxc.models import PLT
from napkinxc.measures import precision_at_k

X_train, Y_train = load_dataset("eurlex-4k", "train")
X_test, Y_test = load_dataset("eurlex-4k", "test")
plt = PLT("eurlex-model")
plt.fit(X_train, Y_train)
Y_pred = plt.predict(X_test, top_k=1)
print(precision_at_k(Y_test, Y_pred, k=1)) 

More examples can be found under python/examples directory.

Executable

napkinXC can also be used as executable to train and evaluate model and make a predict using a data in libsvm format

To build executable use:

cmake .
make

Command line options:

Usage: nxc [command] [arg...]

Commands:
    train                   Train model on given input data
    test                    Test model on given input data
    predict                 Predict for given data
    ofo                     Use online f-measure optimization
    version                 Print napkinXC version
    help                    Print help

Args:
    General:
    -i, --input             Input dataset, required
    -o, --output            Output (model) dir, required
    -m, --model             Model type (default = plt)
                            Models: ovr, br, hsm, plt, oplt, svbopFull, svbopHf, brMips, svbopMips
    --ensemble              Number of models in ensemble (default = 1)
    -t, --threads           Number of threads to use (default = 0)
                            Note: set to -1 to use a number of available CPUs - 1, 0 to use a number of available CPUs
    --memLimit              Maximum amount of memory (in G) available for training (defualt = 0)
                            Note: set to 0 to set limit to amount of available memory
    --hash                  Size of features space (default = 0)
                            Note: set to 0 to disable hashing
    --featuresThreshold     Prune features below given threshold (default = 0.0)
    --seed                  Seed (default = system time)
    --verbose               Verbose level (default = 2)

    For OVR and HSM:
    --pickOneLabelWeighting Allows to use multi-label data by transforming it into multi-class (default = 0)

    Base classifiers:
    --optim, --optimizer    Optimizer used for training binary classifiers (default = liblinear)
                            Optimizers: liblinear, sgd, adagrad
    --bias                  Value of the bias features (default = 1)
    --inbalanceLabelsWeighting      Increase the weight of minority labels in base classifiers (default = 0)
    --weightsThreshold      Threshold value for pruning models weights (default = 0.1)
    --loss                  Loss function to optimize in base classifier (default = log)
                            Losses: log (alias logistic), l2 (alias squaredHinge)

    LIBLINEAR:              (more about LIBLINEAR: https://github.com/cjlin1/liblinear)
    -c, --liblinearC        LIBLINEAR cost co-efficient, inverse of regularization strength, must be a positive float,
                            smaller values specify stronger regularization (default = 10.0)
    --eps, --liblinearEps   LIBLINEAR tolerance of termination criterion (default = 0.1)
    --solver, --liblinearSolver     LIBLINEAR solver (default for log loss = L2R_LR_DUAL, for l2 loss = L2R_L2LOSS_SVC_DUAL)
                                    Overrides default solver set by loss parameter.
                                    Supported solvers: L2R_LR_DUAL, L2R_LR, L1R_LR,
                                                       L2R_L2LOSS_SVC_DUAL, L2R_L2LOSS_SVC, L2R_L1LOSS_SVC_DUAL, L1R_L2LOSS_SVC
    --maxIter, --liblinearMaxIter   Maximum number of iterations for LIBLINEAR (default = 100)

    SGD/AdaGrad:
    -l, --lr, --eta         Step size (learning rate) for online optimizers (default = 1.0)
    --epochs                Number of training epochs for online optimizers (default = 1)
    --adagradEps            Defines starting step size for AdaGrad (default = 0.001)

    Tree (PLT and HSM):
    -a, --arity             Arity of tree nodes (default = 2)
    --maxLeaves             Maximum degree of pre-leaf nodes. (default = 100)
    --tree                  File with tree structure
    --treeType              Type of a tree to build if file with structure is not provided
                            tree types: hierarchicalKmeans, huffman, completeKaryInOrder, completeKaryRandom,
                                        balancedInOrder, balancedRandom, onlineComplete

    K-Means tree:
    --kmeansEps             Tolerance of termination criterion of the k-means clustering
                            used in hierarchical k-means tree building procedure (default = 0.001)
    --kmeansBalanced        Use balanced K-Means clustering (default = 1)

    Prediction:
    --topK                  Predict top-k labels (default = 5)
    --threshold             Predict labels with probability above the threshold (default = 0)
    --thresholds            Path to a file with threshold for each label, one threshold per line
    --labelsWeights         Path to a file with weight for each label, one weight per line
    --setUtility            Type of set-utility function for prediction using svbopFull, svbopHf, svbopMips models.
                            Set-utility functions: uP, uF1, uAlfa, uAlfaBeta, uDeltaGamma
                            See: https://arxiv.org/abs/1906.08129

    Set-Utility:
    --alpha
    --beta
    --delta
    --gamma

    Test:
    --measures              Evaluate test using set of measures (default = "p@1,p@3,p@5")
                            Measures: acc (accuracy), p (precision), r (recall), c (coverage), hl (hamming loos)
                                      p@k (precision at k), r@k (recall at k), c@k (coverage at k), s (prediction size)

See documentation for more details.

References and acknowledgments

This library implements methods from following papers (see experiments directory for scripts to replicate the results):

Another implementation of PLT model is available in extremeText library, that implements approach described in this NeurIPS paper.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

napkinxc-0.5.2.tar.gz (164.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page