Outlier Detection Using PyOD

Source: DeepHub IMBA


This article is approximately 1200 words long and is recommended for a 5-minute read.
In this article, we will introduce the PyOD package and provide detailed code examples.

Outlier detection is a key task in various fields. PyOD, which stands for Python Outlier Detection, simplifies the process of identifying outliers in multivariate datasets. In this article, we will introduce the PyOD package and provide detailed code examples.

Introduction to PyOD

PyOD provides a wide range of algorithms for outlier detection, suitable for both supervised and unsupervised scenarios. Whether dealing with labeled or unlabeled data, PyOD offers a variety of techniques to meet specific needs. One of PyOD’s standout features is its user-friendly API, making it accessible for both beginners and experienced practitioners.

Example 1: kNN

We start with a simple example using the k-nearest neighbors (kNN) algorithm for outlier detection.

First, import the necessary modules from PyOD.

 from pyod.models.knn import KNN
 from pyod.utils.data import generate_data
 from pyod.utils.data import evaluate_print

We generate synthetic data with a predefined outlier rate to simulate outliers.

 contamination = 0.1 # percentage of outliers
 n_train = 200 # number of training points
 n_test = 100 # number of testing points
 X_train, X_test, y_train, y_test = generate_data(
    n_train=n_train, n_test=n_test, contamination=contamination)

Initialize the kNN detector, fit it to the training data, and obtain outlier predictions.

 clf_name = 'KNN'
 clf = KNN()
 clf.fit(X_train)

Evaluate the performance of the trained model on the training and testing datasets using ROC and Precision @ Rank n metrics.

 print("\nOn Training Data:")
 evaluate_print(clf_name, y_train, clf.decision_scores_)
 print("\nOn Test Data:")
 evaluate_print(clf_name, y_test, clf.decision_function(X_test))

Finally, visualize the outlier detection results using the built-in visualization functionality.

 from pyod.utils.data import visualize
 
 visualize(clf_name, X_train, y_train, X_test, y_test, clf.labels_,
          clf.predict(X_test), show_figure=True, save_figure=False)

This is a simple usage example.

Example 2: Model Ensemble

Outlier detection can sometimes be affected by model instability, especially in unsupervised cases. Therefore, PyOD provides model ensemble techniques to improve robustness.

 import numpy as np
 from sklearn.model_selection import train_test_split
 from scipy.io import loadmat
 
 from pyod.models.knn import KNN
 from pyod.models.combination import aom, moa, average, maximization, median
 from pyod.utils.utility import standardizer
 from pyod.utils.data import generate_data
 from pyod.utils.data import evaluate_print
 
 X, y = generate_data(train_only=True) # load data
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
 
 # standardizing data for processing
 X_train_norm, X_test_norm = standardizer(X_train, X_test)
 
 n_clf = 20 # number of base detectors
 
 # Initialize 20 base detectors for combination
 k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
            150, 160, 170, 180, 190, 200]
 
 train_scores = np.zeros([X_train.shape[0], n_clf])
 test_scores = np.zeros([X_test.shape[0], n_clf])
 
 print('Combining {n_clf} kNN detectors'.format(n_clf=n_clf))
 
 for i in range(n_clf):
    k = k_list[i]
 
    clf = KNN(n_neighbors=k, method='largest')
    clf.fit(X_train_norm)
 
    train_scores[:, i] = clf.decision_scores_
    test_scores[:, i] = clf.decision_function(X_test_norm)
 
 # Decision scores have to be normalized before combination
 train_scores_norm, test_scores_norm = standardizer(train_scores,
                                                    test_scores)
 # Combination by average
 y_by_average = average(test_scores_norm)
 evaluate_print('Combination by Average', y_test, y_by_average)
 
 # Combination by max
 y_by_maximization = maximization(test_scores_norm)
 evaluate_print('Combination by Maximization', y_test, y_by_maximization)
 
 # Combination by median
 y_by_median = median(test_scores_norm)
 evaluate_print('Combination by Median', y_test, y_by_median)
 
 # Combination by aom
 y_by_aom = aom(test_scores_norm, n_buckets=5)
 evaluate_print('Combination by AOM', y_test, y_by_aom)
 
 # Combination by moa
 y_by_moa = moa(test_scores_norm, n_buckets=5)
 evaluate_print('Combination by MOA', y_test, y_by_moa)

If the above code prompts an error, you need to install the combo package.

 pip install combo

Conclusion

As we can see, using PyOD for outlier detection is very convenient, from basic kNN outlier detection to model ensemble, PyOD provides a comprehensive integration that allows us to easily and efficiently handle outlier detection tasks.

Finally, refer to the documentation and official website of PyOD at https://pyod.readthedocs.io/en/latest/

Editor: Wenjing

Introduction to PyOD

Example 1: kNN

Example 2: Model Ensemble

Conclusion

Leave a Comment Cancel reply