Anomaly Detection – Finding Outliers in Data – Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

Anomaly Detection – Finding Outliers in Data

Detecting anomalies in data is a recurring theme in machine learning. In Chapter 10,Imbalanced Learning – Not Even 1% Win the Lottery, we learned how to spot these interesting minorities in our data. Back then, the data was labeled and the classification algorithms from the previous chapters were apt for the problem. Aside from labeled anomaly detection problems, there are cases where data is unlabeled.

In this chapter, we are going to learn how to identify outliers in our data, even when no labels are provided. We will use three different algorithms and we will learn about the two branches of unlabeled anomaly detection. Here are the topics that will be covered in this chapter:

  • Unlabeled anomaly detection
  • Detecting anomalies using basic statistics
  • Detecting outliers using EllipticEnvelope
  • Outlier and novelty detection using Local Outlier Factor (LOF)
  • Detecting outliers using isolation forest

Unlabeled anomaly detection

In this chapter, we will start with some unlabeled data and we will need to spot the anomalous samples in it. We may be given inliers only, and we want to learn what normal data looks likefrom them. Then, after fitting a model on our inliers, we are given new data and need to spot any outliers that diverge from the data seen so far. These kinds of problems are referred to as novelty detection. On the other hand, if we fit our model on a dataset that consists of a combination of inliers and outliers, then this problem is referred to as an outlier detection problem.

Like any other unlabeled algorithm, the fit method ignores any labels given. This method's interface allows you to pass in both x and y, for the sake of consistency, but y is simply ignored. In cases of novelty detection, it is logical to firstuse thefitmethod on a dataset that includes no outliers, and then use the algorithm'spredictmethod later on for data that includes both inliers and outliers. Conversely, for outlier detection problems, it is common to apply your fit method and predict all at once with the fit_predict method.

Before using any of our algorithms, we need to create a sample dataset to be used throughout this chapter. Our data will include 1,000 samples, with 98% of them coming from certain distributions and the remaining 2% coming from different distributions. In the next section, we are going to see how to create this sample data in detail.

Generating sample data

The make_classification function allows us to specify the number of samples, as well as the number of features. We can limit the number of informative features and make some features redundant—that is, dependent on the informative features. We can also make some features copies of any of the informative or redundant features. In our current use case, we will make sure that all our features are informative since we are going to limit ourselves to two features only. Since the make_classification function is meant to produce data for classification problems, it returns both x and y.

We will ignore y when building our models and only use it for evaluation later on. We will make sure each class comes from two different distributions by setting n_clusters_per_class to 2. We will keep the two features to the same scale by setting scale to a single value. We will also make sure the data is randomly shuffled (shuffle=True) and that no samples from one class are labeled as members of the other class (flip_y=0). Finally, we will set random_state to 0 to make sure we get the exact same random data when running the following code on our computer:

from sklearn.datasets import make_classification

x, y = make_classification(
n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0,
n_classes=2, n_clusters_per_class=2, weights=[0.98, ], class_sep=0.5,
scale=1.0, shuffle=True, flip_y=0, random_state=0
)

Now that the sample data is ready, it is time to think of ways to detect the outliers in it.

Detecting anomalies using basic statistics

Rather than jumping straight into the available algorithms in scikit-learn, let's start by thinking about ways to detect the anomalous samples. Imagine measuring the traffic to your website every hour, which gives you the following numbers:

hourly_traffic = [
120, 123, 124, 119, 196,
121, 118, 117, 500, 132
]

Looking at these numbers, 500 sounds quite high compared to the others. Formally speaking, if the hourly traffic data is assumed to be normally distributed, then 500 is further away from its mean or expected value. We can measure this by calculating the mean of these numbers and then checking the numbers that are more than 2 or 3 standard deviations away from the mean. Similarly, we can calculate a high quantile and check which numbers are above it. Here, we find the values above the 95th percentile:

pd.Series(hourly_traffic) > pd.Series(hourly_traffic).quantile(0.95)

This code will give an array of the False values, except for the penultimate value, which is the one corresponding to 500. Before printing out the results, let's put the preceding code in the form of an estimator with its fit and predict methods. The fit method calculates the threshold and saves it, and the predict method compares the new data to the saved threshold. I also added a fit_predict method that carries out these two operations in sequence. Here is the code for the estimator:

class PercentileDetection:

def __init__(self, percentile=0.9):
self.percentile = percentile

def fit(self, x, y=None):
self.threshold = pd.Series(x).quantile(self.percentile)

def predict(self, x, y=None):
return (pd.Series(x) > self.threshold).values

def fit_predict(self, x, y=None):
self.fit(x)
return self.predict(x)

We can now use our newly created estimator. In the following code snippet, we use the 95th percentile for our estimator. We then put the resulting predictions alongside the original data into a data frame. Finally, I added some styling logic to mark the rows with outliers in bold:

outlierd = PercentileDetection(percentile=0.95)
pd.DataFrame(
{
'hourly_traffic': hourly_traffic,
'is_outlier': outlierd.fit_predict(hourly_traffic)
}
).style.apply(
lambda row: ['font-weight: bold'] * len(row)
if row['is_outlier'] == True
else ['font-weight: normal'] * len(row),
axis=1
)

Here is the resulting data frame:

Can we apply the same logic to the dataset from the previous section? Well, yes, but we need to figure out how to apply it to multi-dimensional data first.

Using percentiles for multi-dimensional data

Unlike the hourly_traffic data, the data we generated using the make_classification function is multi-dimensional. We have more than one feature to check this time. Obviously, we can check each feature separately. Here is the code for checking the outliers with respect to the first feature:

outlierd = PercentileDetection(percentile=0.98)
y_pred = outlierd.fit_predict(x[:,0])

We can do the same for the other feature as well:

outlierd = PercentileDetection(percentile=0.98)
y_pred = outlierd.fit_predict(x[:,1])

Now, we have ended up with two predictions. We can combine them in a way that each sample is marked as an outlier if it is an outlier with respect to any of the two features. In the following code snippet, we will tweak the PercentileDetectionestimator to do that:

class PercentileDetection:

def __init__(self, percentile=0.9):
self.percentile = percentile

def fit(self, x, y=None):
self.thresholds = [
pd.Series(x[:,i]).quantile(self.percentile)
for i in range(x.shape[1])
]

def predict(self, x, y=None):
return (x > self.thresholds).max(axis=1)

def fit_predict(self, x, y=None):
self.fit(x)
return self.predict(x)

Now, we can use the tweaked estimator as follows:

outlierd = PercentileDetection(percentile=0.98)
y_pred = outlierd.fit_predict(x)

We can also use the labels we ignored earlier to calculate the precision and recall of our new estimator. Since we care about the minority class, whose label is 1, we set pos_label to 1 in the following code snippet:

from sklearn.metrics import precision_score, recall_score

print(
'Precision: {:.02%}, Recall: {:.02%} [Percentile Detection]'.format(
precision_score(y, y_pred, pos_label=1),
recall_score(y, y_pred, pos_label=1),
)
)

This gives a precision of 4% and a recall of 5%. Did you expect better results? I did too. Maybe we need to plot our data to understand what might be the problem with our method. Here is the dataset, where each sample is marked according to its label:

Our method checks each point and sees whether it is extreme on one of the two axes. Despite the fact that the outliers are further away from the inliers, there are still inliers that share the same horizontal or vertical position of each point of the outliers. In other words, if you project your points onto any of the two axes, you will not be able to separate the outliers from the inliers anymore. So, we need a way to consider the two axes at once. What if we find the mean point of the two axes—that is, the center of our data—and then draw a circle or an ellipse around it? Then, we can consider any point that falls outside this ellipse an outlier. Would this new strategy help? Luckily, that's what the EllipticEnvelope algorithm does.

Detecting outliers using EllipticEnvelope

"I'm intimidated by the fear of being average."
– Taylor Swift

The EllipticEnvelope algorithm finds the center of the data samples and then draws an ellipsoid around that center. The radii of the ellipsoid in each axis are measured in the Mahalanobis distance. You can think of the Mahalanobis distance as a Euclideandistance whose units are the number of standard deviations in each direction. After the ellipsoid is drawn, the points that fall outside it can be considered outliers.

The multivariate Gaussian distribution is a key concept of the EllipticEnvelope algorithm. It's a generalization of the one-dimensional Gaussian distribution. If the Gaussian distribution is defined by single-valued mean and variance, then the multivariate Gaussian distribution is defined by matrices for means and covariances. The multivariate Gaussian distribution is then used to draw an ellipsoid that defines what is normal and what is an outlier.

Here is how we use the EllipticEnvelope algorithm to detect the data outliers, using the algorithm's default settings. Keep in mind that the predict methods for all the outlier detection algorithms in this chapter return -1 for outliers and 1 for inliers:

from sklearn.covariance import EllipticEnvelope

ee = EllipticEnvelope(random_state=0)
y_pred = ee.fit_predict(x) == -1

We can calculate the precision and the recall scores for the predictions using the exact same code from the previous section:

from sklearn.metrics import precision_score, recall_score

print(
'Precision: {:.02%}, Recall: {:.02%} [EllipticEnvelope]'.format(
precision_score(y, y_pred, pos_label=1),
recall_score(y, y_pred, pos_label=1),
)
)

This time, we get a precision of 9% and a recall of 45%. That's already better than the previous scores, but can we do better? Well, if you take another look at the data, you will notice that it is non-convex. We already know that the samples in each class come from more than one distribution, and so the shape of the points doesn't seem like it would perfectly fit into an ellipse. This means that we should instead use an algorithm that bases its decision on local distances and densities, rather than comparing everything to a fixed centroid. The Local Outlier Factor (LOF) gives us that feature. If the k-meansclusteringalgorithm of the previous chapter falls into the same group as the elliptic envelope algorithm, then the LOF would be the counterpart of the DBSCAN algorithm.

Outlier and novelty detection using LOF

"Madness is rare in individuals – but in groups, parties, nations, and ages, it is the rule."
– Friedrich Nietzsche

LOF takes an opposite approach to Nietzsche's—it compares the density of a sample to the local densities of its neighbors. A sample existing in a low-density area compared to its neighbors is considered an outlier. Like any other neighbor-based algorithms, we have parameters to specify the number of neighbors to consider (n_neighbors) and the distance metric to use to find the neighbors (metric and p). By default, the Euclidean distance is used—that is, metric='minkowski' and p=2. You can refer to Chapter 5, Image Processing with Nearest Neighbors, for more information about the available distance metrics. Here is how we useLocalOutlierFactor for outlier detection, using 50 neighbors and its default distance metric:

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=50)
y_pred = lof.fit_predict(x) == -1

The precision and recall scores have now further improved the event. We got a precision value of 26% and a recall value of 65%.

Just like the classifiers, which have the predict method as well as predict_proba, outlier detection algorithms not only give us binary predictions, but can also tell us how confident they are that a sample is an outlier. Once the LOF algorithm is fitted, it stores its outlier factor scores in negative_outlier_factor_. A sample is more likely to be an outlier if the score is closer to -1. So, we can use this score and set its bottom 1%, 2%, or 10% values as outliers, and consider the rest inliers. Here is a comparison for the different performance metrics at each of the aforementioned thresholds:

from sklearn.metrics import precision_score, recall_score

lof = LocalOutlierFactor(n_neighbors=50)
lof.fit(x)

for quantile in [0.01, 0.02, 0.1]:

y_pred = lof.negative_outlier_factor_ < np.quantile(
lof.negative_outlier_factor_, quantile
)

print(
'LOF: Precision: {:.02%}, Recall: {:.02%} [Quantile={:.0%}]'.format(
precision_score(y, y_pred, pos_label=1),
recall_score(y, y_pred, pos_label=1),
quantile
)
)

Here are the different precision and recall scores:

# LOF: Precision: 80.00%, Recall: 40.00% [Quantile=1%]
# LOF: Precision: 50.00%, Recall: 50.00% [Quantile=2%]
# LOF: Precision: 14.00%, Recall: 70.00% [Quantile=10%]

As in the case with the classifiers' probabilities, there is a trade-off here between the precision and recall scores for the different thresholds. This is how you can fine-tune your predictions to suit your needs. You can also use negative_outlier_factor_ to plot the Receiver Operating Characteristic (ROC) or Precision-Recall (PR) curves if the true labels are known.

Aside from its use for outlier detection, theLOFalgorithm can also be used for novelty detection.

Novelty detection using LOF

When used for outlier detection, the algorithm has to be fitted on the dataset with both its inliers and outliers. In the case of novelty detection, we are expected to fit the algorithm on the inliers only, and then predict on a contaminated dataset later on. Furthermore, to be used for novelty detection, you have novelty=True during the algorithm's initialization. Here, we remove the outliers from our data and use the resulting subsample, x_inliers, with the fit function. Then, we predict for the original dataset as normal:

from sklearn.neighbors import LocalOutlierFactor

x_inliers = x[y==0]

lof = LocalOutlierFactor(n_neighbors=50, novelty=True)
lof.fit(x_inliers)
y_pred = lof.predict(x) == -1

The resulting precision (26.53%) and recall (65.00%) values did not vary much compared to when we used the algorithm for outlier detection. In the end, the choice in terms of novelty detection versus the outlier detection approach is a tactical one. It depends on the available data when the model is built and whether it contains outliers.

You probably already know by now that I like using the ensemble methods, and so it is hard for me to end this chapter without presenting an ensemble algorithm for outlier detection. In the next section, we are going to look at the isolationforest algorithm.

Detecting outliers using isolation forest

In previous approaches, we started by defining what normal is, and then considered anything that doesn't conform to this as outliers. The isolation forest algorithm follows a different approach. Since the outliers are few and different, they are easier to isolate from the rest. So, when building a forest of random trees, a sample that ends in leaf nodes early in a tree—that is, it did not need a lot of branching effort to be isolated—is more likely to be an outlier.

As a tree-based ensemble, this algorithm shares many hyperparameters with its counterparts, such as the number of random trees to build (n_estimators), the ratio of samples to use when building each tree (max_samples), the ratio of features to consider when building each tree (max_features), and whether to sample with a replacement or not (bootstrap). You can also build the trees in parallel using all the available CPUs on your machine by setting n_jobs to -1. Here, we will build an isolation forest algorithm of 200 trees, then use it to predict the outliers in our dataset. Like all the other algorithms in this chapter, a prediction of -1 means that the sample is seen as an outlier:

from sklearn.ensemble import IsolationForest

iforest = IsolationForest(n_estimators=200, n_jobs=-1, random_state=10)
y_pred = iforest.fit_predict(x) == -1

The resulting precision (6.5%) and recall (60.0%) values are not as good as the previous approaches. Clearly, LOF is the most suitable algorithm for the data we have at hand here. We were able to compare the three algorithms since the original labels were available to us. In reality, labels are usually unavailable, and it is hard to decide which algorithm to use. The field of unlabeled anomaly detection evaluation is actively being researched, and I hope to see scikit-learn implement reliable evaluation metrics once they are available in the future.

In the case of supervised learning, you can use true labels to evaluate models using the PR curves. When it comes to unlabeled data, recent researchers are trying to tailor evaluation criteria, such as the Excess-Mass (EM) and Mass-Volume (MV) curves.

Summary

So far in this book, we have used supervised learning algorithms to spot anomalous samples. This chapter offered additional solutions when no labels are provided. The solutions explained here stem from different fields of machine learning, such as statistical learning, nearest-neighbor, and tree-based ensembles. Each one of the three tools explained here can excel, but also have disadvantages. We also learned that evaluating machine learning algorithms when no labels are provided is tricky.

This chapter will deal with unlabeled data. In the previous chapter, we learned how to cluster data, and then we learned how to detect the outliers in it here. We still have one more unsupervised learning topic to discuss in this book, though. In the next chapter, we will cover an important topic relating to e-commerce—recommendation engines. Since it is the last chapter of this book, I'd alsolike to go through the possible approaches to machine learning model deployment. We will learn how to save and load our models and how to deploy them on Application Programming Interfaces (APIs).