logo le blog invivoo blanc

Anomaly detection and three most used algorithms

8 March 2023 | Machine Learning | 0 comments

1. Description

Anomaly detection, also known as outlier detection, is a group of problems which purpose is to find out the samples perform differently from the majority.

It is applied in so many domains: fraud detection in finance and insurance, default detection in product quality control, intrusion detection in web services, lesion detection in medical diagnosis, etc.

As given by its definition, we know that there must be much more normal instances than abnormal instances. To deal with such imbalanced dataset, the classic supervised algorithms might not be appropriate any more in this case. Therefore, we need some other algorithms. In this article, I’d like to talk about three widely used algorithms specifically for anomaly detection and a little coding part to use them in Python.

2. Approaches

There are three types of solutions for anomaly detection: statistical modeling, supervised and unsupervised algorithms.

Statistical modeling is to assume that the dataset follows a certain statistic distribution, usually a gaussian distribution. If the sample locates in a zone where its probability is too low, this sample will be considered as abnormal. The drawback of this approach is obvious: it is sensitive to extreme points, and not all datasets are gaussian distributed.

Supervised learning requires labeled dataset. As mentioned in the previous chapter, the problem of anomaly detection is the imbalanced dataset (much more normal samples than abnormal ones). It is possible but rare in the real case to have enough labeled abnormal instances to apply the supervised learning.

As a result, unsupervised learning is more pragmatic in anomaly detection. There is no hypothesis to make, and it doesn’t need a large amount of labeled data for positive and negative classes. I will explain in more detail three unsupervised algorithms in the following chapters: OC-SVM (One Class SVM), LOF (Local Outlier Factor), iForest (Isolation Forest).

3. OC-SVM

OC-SVM, which full name is One Class Support Vector Machine, is published in 1999. As its name, it is an extension of Support Vector Machine, a very classic machine learning algorithm. In SVM, the purpose is to separate two classes by maximizing the margin between them. The idea of One Class SVM is to consider the origin of a space F as abnormal, and to separate all data from the origin by a hyperplane maximizing the margin between them. If we can find this hyperplane separating all data and the origin, there must be a binary function f, which gives positive results in the subspace S (containing all normal instances) and negative results outside the subspace S (where instances are considered as abnormal).

Figure 1: data seperated by the hyperplan

In figure 1, all red points are input data. Most of these points are in the target zone. Only few locates in the outlier zone where the origin (the big black point) is the support vector.

4. Local outlier factor (LOF)

 As its name, there is nothing to train but some factors to calculate. This factor can indicate the degree of being an outlier. Intuitively, a point is abnormal because it locates far away from most of other points (normal points). As a result, a point with lower local density (far away from other points) is considered as an outlier. This is the logic of LOF.

 The word ‘local’ means LOF cares about the density in a neighborhood of a point, instead of the global density. Only the k nearest points to the point p are taken into consideration.

To calculate the LOF of an object p, the following definitions of local distance are introduced:

Let’s go and calculate the LOF (all definitions are from the article [2]): 

4.1) k-distance of an object p

For any positive integer k, the k-distance of object p, denoted as k-distance(p), is defined as the distance d(p,o) between p and an object o ∈ D such that:

(i) for at least k objects o’∈D \ {p} it holds that d(p,o’) ≤ d(p,o), and
(ii) for at most k-1 objects o’∈D\{p} it holds that d(p,o’) < d(p,o).

4.2) k-distance neighborhood of an object p

Given the k-distance of p, the k-distance neighborhood of p contains every object whose distance from p is not greater than the k-distance, i.e. Nk-distance(p)(p) = Nk(p)= { q ∈ D\{p} | d(p, q) ≤ k- distance(p) }.

These objects q are called the k-nearest neighbors of p.

4.3) reachability distance of an object p w.r.t.(with respect to) object o

Let k be a natural number. The reachability distance of object p with respect to object o is defined as reach-dist k(p, o) = max { k-distance(o), d(p, o) }.

4.4) local reachability density of an object p

The local reachability density of p is defined as

MinPts means a minimum number of objects. The local reachability density is the inverse of the average reachability distance of Minpts closest points of the object p.

4.5) (local) outlier factor of an object p

The (local) outlier factor of p is defined as

5. iForest

Isolation forest, also named iForest, focuses on how to isolate anomalies from normal instances. The idea is to build an isolation tree (iTree) to separate all data. If a point is isolated closer to the root, it’s more probable to be abnormal. As it is a forest, many iTrees are built to treat different subsamples of the whole dataset. Instances having a short average path length to the root will be considered as abnormal. iTree, by definition, is a proper binary tree, which means each node can only have zero or two daughter nodes.

Figure 3 : iForest with iTrees

There are two definitions to know (all definitions are from the article [3]):

Isolation Tree. Let T be a node of an isolation tree. T is either an external-node with no child, or an internal-node with one test and exactly two daughter nodes (Tl,Tr). A test consists of an attribute q and a split value p such that the test q < p divides data points into Tl and Tr.

Path Length h(x) of a point x is measured by the number of edges x traverses an iTree from the root node until the traversal is terminated at an external node.

An anomaly score s (0 < s < 1) will be computed to estimate how abnormal the instance is. When s is close to 1, this instance is probably abnormal. If you want to know more details about the anomaly score, please read [3].

6. Use case with pyOD

Here I’d like to use Kaggle credit card fraud detection [4] as an example, to try these three algorithms I introduced above.

There is a useful library called pyOD specially for anomaly detection. If you want to know more about this toolbox, please read [5].

It contains more than 40 anomaly detection algorithms (of course, including OC-SVM, LOF and iForest) and I will use this library in the coding part.

Figure 4: import functions from pyod

I’d like to use ‘evaluate_print’ to evaluate models. This function uses ‘sklearn.metrics.roc_auc_score’ and ‘sklearn.metrics.precision_score’ to give the AUC score and the precision score. As I mentioned in the description, data in anomaly detection are often imbalanced. In this case, the fraud represents only 0.17% of the whole data (figure 5). The accuracy or the confusion matrix are not appropriate any more to evaluate the performance of a model. The ROC curve and the AUC score are usually used in this case. The higher the AUC score is, the better the model performs. If you have never met ROC or AUC before, just type “ROC and AUC” in any search engine, there are lots of articles about them.

Figure 5 : count normal and abnormal instances

To build and train a One-class SVM model:

Figure 6: build, train and test an OC-SVM model

To build and train a LOF model:

Figure 7 : build, train and test a LOF model

To build and train an iForest model:

Figure 8 : build, train and test an iForest model

With the execution time of each model, it’s clear that iForest runs much faster than the two other models. The reason is obvious: OC-SVM and LOF are distance-based algorithms. As the drawback of all distance-based algorithms, it takes time and memory to calculate distances. iForest shows its advantages: linear complexity and low memory requirement. Even in high dimensional dataset, iForest works well and quickly.

7. Conclusion

In this article, we first described the anormal detection by giving some use cases. Because of the nature of having imbalanced data, classical statistical modeling and supervised learning are considered as less effective. As a result, One-class SVM, Local outlier factor and isolation forest are introduced as unsupervised methods to deal with anormal detection. At the end, a fraud detection use case is given and three algorithms mentioned above are applied to detect frauds. Here models are trained directly without doing any pre-processing nor features engineering, so the performance is not good enough. If you are interested, try to build your own model and get a better outlier detector !

References

[1] Bernhard Schölkopf, Robert C. Williamson, Alex Smola, John Shawe-Taylor, John Platt.
 “Support Vector Method for Novelty Detection”. In NIPS, pages 582-588, 1999

[2] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J ̈org Sander. “Lof: identifying density-based local outliers”. In ACM SIGMOD Record, volume 29, pages 93–104, 2000.

[3] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. “Isolation forest”. In International Conference on Data Mining (ICDM), pages 413–422. IEEE, 2008

[4] Kaggle competition: Credit Card Fraud Detection https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

[5] PyOD documentation:  https://pyod.readthedocs.io/en/latest/