Skip to main content

Tensor-based outlier detection. A general GPU-accelerated framework.

Project description

Deployment & Documentation & Stats & License

PyPI version GitHub stars GitHub forks License

Background: Outlier detection (OD) is a key data mining task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection.

We propose TOD, a system for efficient and scalable outlier detection (OD) on distributed multi-GPU machines. A key idea behind TOD is decomposing OD applications into basic tensor algebra operations for GPU acceleration.

Authors: TOD is developed by the same author(s) of the popular PyOD and PyGOD. Specifically, Yue Zhao, Prof. George Chen, and Prof. Zhihao Jia. The code is being cleaned up and released. Please watch and star!

Citing TOD: Check out the design paper. If you use TOD in a scientific publication, we would appreciate citations to the following paper:

@article{zhao2021tod,
  title={TOD: GPU-accelerated Outlier Detection via Tensor Operations},
  author={Zhao, Yue and Chen, George H and Jia, Zhihao},
  journal={arXiv preprint arXiv:2110.14007},
  year={2021}
}

or:

Zhao, Y., Chen, G.H. and Jia, Z., 2021. TOD: GPU-accelerated Outlier Detection via Tensor Operations. arXiv preprint arXiv:2110.14007.

One Reason to Use It:

On average, TOD is 11 times faster than PyOD on a diverse group of OD algorithms!

If you need another reason: it can handle much larger datasets—more than a million sample OD within an hour!

GPU-accelerated Outlier Detection with 5 Lines of Code:

# train the COPOD detector
from pytod.models.knn import KNN
clf = KNN() # default GPU device is used
clf.fit(X_train)

# get outlier scores
y_train_scores = clf.decision_scores_  # raw outlier scores on the train data
y_test_scores = clf.decision_function(X_test)  # predict raw outlier scores on test

TOD is featured for:

  • Unified APIs, detailed documentation, and examples for the easy use (under construction)

  • More than 5 different OD algorithms and more are being added

  • The support of multi-GPU acceleration

  • Advanced techniques including provable quantization and automatic batching

Table of Contents:


Installation

It is recommended to use pip for installation. Please make sure the latest version is installed, as PyTOD is updated frequently:

pip install pytod            # normal install
pip install --upgrade pytod  # or update if needed

Alternatively, you could clone and run setup.py file:

git clone https://github.com/yzhao062/pytod.git
cd pyod
pip install .

Required Dependencies:

  • Python 3.6+

  • numpy>=1.13

  • torch>=1.7 (it is safer if you install by yourself)

  • scipy>=0.19.1

  • scikit_learn>=0.20.0

  • pyod (for comparison)


Implemented Algorithms

PyTOD toolkit consists of three major functional groups (to be cleaned up):

(i) Individual Detection Algorithms :

Type

Abbr

Algorithm

Year

Ref

Linear Model

PCA

Principal Component Analysis (the sum of weighted projected distances to the eigenvector hyperplanes)

2003

[29]

Proximity-Based

LOF

Local Outlier Factor

2000

[7]

Proximity-Based

COF

Connectivity-Based Outlier Factor

2002

[30]

Proximity-Based

HBOS

Histogram-based Outlier Score

2012

[9]

Proximity-Based

kNN

k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score)

2000

[25]

Proximity-Based

AvgKNN

Average kNN (use the average distance to k nearest neighbors as the outlier score)

2002

[5]

Proximity-Based

MedKNN

Median kNN (use the median distance to k nearest neighbors as the outlier score)

2002

[5]

Probabilistic

ABOD

Angle-Based Outlier Detection

2008

[16]

Probabilistic

COPOD

COPOD: Copula-Based Outlier Detection

2020

[20]

Probabilistic

FastABOD

Fast Angle-Based Outlier Detection using approximation

2008

[16]

Code is being released. Watch and star for the latest news!


A Motivating Example PyOD vs. PyTOD!

kNN example shows that how fast and how easy PyTOD is. Take the famous kNN outlier detection as an example:

  1. Initialize a kNN detector, fit the model, and make the prediction.

    from pytod.models.knn import KNN   # kNN detector
    
    # train kNN detector
    clf_name = 'KNN'
    clf = KNN()
    clf.fit(X_train)
    # if GPU is not available, use CPU instead
    clf = KNN(device='cpu')
    clf.fit(X_train)
  2. Get the prediction results

    # get the prediction label and outlier scores of the training data
    y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
    y_train_scores = clf.decision_scores_  # raw outlier scores
  3. On a simple laptop, let us see how fast it is in comparison to PyOD for 30,000 samples with 20 features

    KNN-PyOD ROC:1.0, precision @ rank n:1.0
    Execution time 11.26 seconds
    KNN-PyTOD-GPU ROC:1.0, precision @ rank n:1.0
    Execution time 2.82 seconds
    KNN-PyTOD-CPU ROC:1.0, precision @ rank n:1.0
    Execution time 3.36 seconds

It is easy to see, PyTOD shows both better efficiency than PyOD.


Paper Reproducibility

Datasets: OD benchmark datasets are available at datasets folder.

Scripts for reproducibility is available in reproducibility folder.

Cleanup is on the way!


Programming Model Interface

Complex OD algorithms can be abstracted into common tensor operators.

https://raw.githubusercontent.com/yzhao062/pytod/master/figs/abstraction.png

For instance, ABOD and COPOD can be assembled by the basic tensor operators.

https://raw.githubusercontent.com/yzhao062/pytod/master/figs/abstraction_example.png

End-to-end Performance Comparison with PyOD

Overall, it is much (on avg. 11 times) faster than PyOD takes way less run time.

https://raw.githubusercontent.com/yzhao062/pytod/master/figs/run_time.png

Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytod-0.0.3.tar.gz (35.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page