Imbalanced Learning – Not Even 1% Win the Lottery – Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

Imbalanced Learning – Not Even 1% Win the Lottery

Cases where your classes are neatly balanced are more of an exception than the rule. In most of the interesting problems we'll come across, the classes are extremely imbalanced. Luckily, a small fraction of online payments are fraudulent, just like a small fraction of the population catch rare diseases. Conversely, few contestants win the lottery and fewer of your acquaintances become your close friends. That's why we are usually interested in capturing those rare cases.

In this chapter, we will learn how to deal with imbalanced classes. We will start by giving different weights to our training samples to mitigate the class imbalance problem. Afterward, we will learn about other techniques, such as undersampling and oversampling. We will see the effect of these techniques in practice. We will also learn how to combine concepts such as ensemble learning with resampling, and also introduce new scores to validate if our learners are meeting our needs.

The following topics will be covered in this chapter:

  • Reweighting the training samples
  • Random oversampling
  • Random undersampling
  • Combing sampling with ensembles
  • Equal opportunity score

Let's get started!

Getting the click prediction dataset

Usually, a small percentage of people who see an advertisement click on it. In other words, the percentage of samples in a positive class in such an instance can be just 1% or even less. This makes it hard to predict the click-through rate (CTR) since the training data is highly imbalanced. In this section, we are going to use a highly imbalanced dataset from the Knowledge Discovery in Databases (KDD) Cup.

The KDD Cup is an annual competition organized by the ACM Special Interest Group on Knowledge Discovery and Data Mining. In 2012, they released a dataset for the advertisements shown alongside the search results in a search engine. The aim of the competitors was to predict whether a user will click on each ad or not. A modified version of the data has been published on the OpenML platform (https://www.openml.org/d/1220). The CTR in the modified dataset is 16.8%. This is our positive class. We can also call it the minority class since the majority of the cases did not lead to an ad being clicked on.

Here, we are going to download the data and put it into a DataFrame, as follows:

from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=1220)

df = pd.DataFrame(
data['data'],
columns=data['feature_names']
).astype(float)

df['target'] = pd.Series(data['target']).astype(int)

We can display 5 random rows of the dataset using the following line of code:

df.sample(n=5, random_state=42)

We can make sure we get the same random lines if we both set random_state to the same value. In The Hitchhiker's Guide to the Galaxy by Douglas Adams, the number 42 was deemed as the answer to the ultimate question of life, the universe, and everything. So, we will stick to setting random_stateto42throughout this chapter. Here is our five-line sample:

There are two things we need to keep in mind about this data:

  • The classes are imbalanced, as mentioned earlier. You can check this by running df['target'].mean(), which will give you 16.8%.
  • Despite the fact that all the features are numerical, it is clear that all the features ending with the id suffix are supposed to be treated as categorical features. For example, the relationship between ad_id and the CTR is not expected to be linear, and thus when using a linear model, we may need to encode these features using a one-hot encoder. Nevertheless, due to their high cardinality, a one-hot encoding strategy will result in too many features for our classifier to deal with. Therefore, we need to come up with another scalable solution. For now, let's learn how to check the cardinality of each feature:
for feature in data['feature_names']:
print(
'Cardinality of {}: {:,}'.format(
feature, df[feature].value_counts().shape[0]
)
)

This will give us the following results:

Cardinality of impression: 99
Cardinality of ad_id: 19,228
Cardinality of advertiser_id: 6,064
Cardinality of depth: 3
Cardinality of position: 3
Cardinality of keyword_id: 19,803
Cardinality of title_id: 25,321
Cardinality of description_id: 22,381
Cardinality of user_id: 30,114

Finally, we will convert our data into x_train, x_test, y_train, and y_test sets, as follows:

from sklearn.model_selection import train_test_split
x, y = df[data['feature_names']], df['target']
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.25, random_state=42
)

In this section, we downloaded the necessary data and added it to a DataFrame. In the next section, we will install the imbalanced-learn library.

Installing the imbalanced-learn library

Due to class imbalance, we will need to resample our training data or apply different techniques to get better classification results. Thus, we are going to rely on theimbalanced-learnlibrary here. The project was started in 2014 by Fernando Nogueira. It now offers multiple resampling data techniques, as well as metrics for evaluating imbalanced classification problems. The library's interface is compatible with scikit-learn.

You can download the library via pip by running the following command in your Terminal:

          pip install -U imbalanced-learn
        

Now, you can import and use its different modules in your code, as we will see in the following sections. One of the metrics provided by the library is the geometric mean score. InChapter 8, Ensembles – When One Model is Not Enough, we learned about the true positive rate(TPR),or sensitivity, and the false positive rate (FPR), and we used them to draw the area under the curve. We also learned about thetrue negative rate (TNR), or specificity, which is basically 1 minus the FPR. The geometric mean score, for binary classification problems, is the square root of the product of the sensitivity (TPR) and specificity (TNR). By combining these two metrics, we try to maximize the accuracy of each of the classes while taking their imbalances into account. The interface forgeometric_mean_scoreis similar to the other scikit-learn metrics. It takes the true and predicted values and returns the calculated score, as follows:

from imblearn.metrics import geometric_mean_score
geometric_mean_score(y_true, y_pred)

We will be using this metric in addition to the precision and recall scores throughout this chapter.

In the next section, we are going to alter the weights of our training samples and see if this helps us deal with our imbalanced classes.

Predicting the CTR

We have our data and installed the imbalanced-learn library. Now, we are ready to build our classifier. As we mentioned earlier, the one-hot encoding techniques we are familiar with will not scale well with the high cardinality of our categorical features. In Chapter 8, Ensembles – When One Model is Not Enough, we briefly mentionedrandom trees embedding as a technique for transforming our features. It is an ensemble of totally random trees, where each sample of our data will be represented according to the leaves of each tree it ends upon. Here, we are going to build a pipeline where the data will be transformed into a random trees embedding and scaled. Finally, a logistic regression classifier will be used to predict whether a click has occurred or not:

from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomTreesEmbedding
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score
from imblearn.metrics import geometric_mean_score


def predict_and_evalutate(x_train, y_train, x_test, y_test, sample_weight=None, title='Unweighted'):


clf = Pipeline(
[
('Embedder', RandomTreesEmbedding(n_estimators=10, max_leaf_nodes=20, random_state=42)),
('Scaler', MaxAbsScaler()),
('Classifier', LogisticRegression(solver='saga', max_iter=1000, random_state=42))
]
)
clf.fit(x_train, y_train, Classifier__sample_weight=sample_weight)
y_test_pred = clf.predict(x_test)

print(
'Precision: {:.02%}, Recall: {:.02%}; G-mean: {:.02%} @ {}'.format(
precision_score(y_test, y_test_pred),
recall_score(y_test, y_test_pred),
geometric_mean_score(y_test, y_test_pred),
title
)
)

return clf

We wrapped the whole process into a function so that we can reuse it later in this chapter. The predict_and_evalutate() function takes the x's and the y's, as well as the sample weights. We are going to use the sample weights in a moment, but you can ignore them for now. Once you're done predicting, the function will also print the different scores and return an instance of the pipeline that was used.

We can use the function we have just created as follows:

clf = predict_and_evalutate(x_train, y_train, x_test, y_test) 

By default, the precision and recall that are calculated are for the positive class. The previous code gave us a recall of 0.3%, a precision of 62.5%, and a geometric mean score of 5.45%. The recall is less than 1%, which means that the classifier won't be able to capture the vast majority of the positive/minority class. This is an expected scenario when dealing with imbalanced data. One way to fix this is to give more weights to the samples in the minority class. This is like asking the classifier to give more attention to these samples since we care about capturing them, despite their rareness. In the next section, we are going to see the effect of sample weighting on our classifier.

Weighting the training samples differently

The number of samples in the majority class is about five times those in the minority class. You can double-check this by running the following line of code:

(1 - y_train.mean()) / y_train.mean() 

Thus, it makes sense to give the samples in the minority class five times the weight of the other samples. We can use the same predict_and_evalutate() function from the previous section and change the sample weights, as follows:

sample_weight = (1 * (y_train == 0)) + (5 * (y_train == 1))
clf = predict_and_evalutate(
x_train, y_train, x_test, y_test,
sample_weight=sample_weight
)

Now, the recall jumps to 13.4% at the expense of the precision, which went down to 24.8%. The geometric mean score went down from5.5% to 34%, thanks to the new weights.

The predict_and_evalutate() function returns an instance of the pipeline that was used. We can get the last component of the pipeline, the logistic regression classifier, viaclf[-1]. Then, we can access the coefficients of the classifier that were assigned to each feature as we intercept it. Due to the embedding step, we may end up with up to 200 features; 10 estimators x up to 20 leaf nodes. The following function prints the last nine features, as well as the intercept, along with their coefficients:

def calculate_feature_coeff(clf):
return pd.DataFrame(
{
'Features': [
f'EmbFeature{e}'
for e in range(len(clf[-1].coef_[0]))
] + ['Intercept'],
'Coeff': list(
clf[-1].coef_[0]
) + [clf[-1].intercept_[0]]
}

).set_index('Features').tail(10)

The output of calculate_feature_coeff(clf).round(2) can also be rounded to two decimal points so that it looks as follows:

Now, let's compare three weighting strategies side by side. With a weight of one, both the minority and the majority classes get the same weights. Then, we give the minority class double the weight of the majority class, as well as five times its weight, as follows:

df_coef_list = []
weight_options = [1, 2, 5]

for w in weight_options:

print(f'\nMinority Class (Positive Class) Weight = Weight x {w}')
sample_weight = (1 * (y_train == 0)) + (w * (y_train == 1))
clf = predict_and_evalutate(
x_train, y_train, x_test, y_test,
sample_weight=sample_weight
)
df_coef = calculate_feature_coeff(clf)
df_coef = df_coef.rename(columns={'Coeff': f'Coeff [w={w}]'})
df_coef_list.append(df_coef)

This gives us the following results:

It is easy to see how the weighting affects the precision and the recall. It is as if one of them always improves at the expense of the other. This behavior is the result of moving the classifier's boundaries. As we know, the class boundaries are defined by the coefficients of the different features, as well as the intercept. I bet you are tempted to see the coefficients of the three previous models side by side. Luckily, we have saved the coefficients indf_coef_list so that we can display them using the following code snippet:

pd.concat(df_coef_list, axis=1).round(2).style.bar(
subset=[f'Coeff [w={w}]' for w in weight_options],
color='#999',
align='zero'
)

This gives us the following visual comparison between the three classifiers:

The coefficients of the features did change slightly, but the changes in the intercept are more noticeable. In summary, the weighting affects the intercept the most and moves the class boundaries as a result.

A sample is classified as a member of the positive class if the predicted probability is above 50%. The movement of the intercept, without any changes in the other coefficients, is equivalent to changing the probability threshold so that it's above or below that 50%. If the weighting only affected the intercept, we might suggest that we should try different probability thresholds until we get the desired precision-recall tradeoff. To check whether the weighting offered any additional benefit on top of altering the intercept, we have to check the area under the Receiver Operating Characteristic (ROC) curve.

The effect of the weighting on the ROC

Did the weighting improve the area under the ROC curve? To answer this question, let's start by creating a function that will display the ROC curve and print the area under the curve (AUC):

from sklearn.metrics import roc_curve, auc

def plot_roc_curve(y, y_proba, ax, label):
fpr, tpr, thr = roc_curve(y, y_proba)
auc_value = auc(fpr, tpr)
pd.DataFrame(
{
'FPR': fpr,
'TPR': tpr
}
).set_index('FPR')['TPR'].plot(
label=label + f'; AUC = {auc_value:.3f}',
kind='line',
xlim=(0,1),
ylim=(0,1),
color='k',
ax=ax
)
return (fpr, tpr, auc_value)

Now, we can loop over the three weighting options and render their corresponding curves, as follows:

from sklearn.metrics import roc_curve, auc

fig, ax = plt.subplots(1, 1, figsize=(15, 8), sharey=False)

ax.plot(
[0, 1], [0, 1],
linestyle='--',
lw=2, color='k',
label='Chance', alpha=.8
)

for w in weight_options:

sample_weight = (1 * (y_train == 0)) + (w * (y_train == 1))

clf = Pipeline(
[
('Embedder', RandomTreesEmbedding(n_estimators=20, max_leaf_nodes=20, random_state=42)),
('Scaler', MaxAbsScaler()),
('Classifier', LogisticRegression(solver='lbfgs', max_iter=2000, random_state=42))
]
)
clf.fit(x_train, y_train, Classifier__sample_weight=sample_weight)
y_test_pred_proba = clf.predict_proba(x_test)[:,1]

plot_roc_curve(
y_test, y_test_pred_proba,
label=f'\nMinority Class Weight = Weight x {w}',
ax=ax
)

ax.set_title('Receiver Operating Characteristic (ROC)')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')

ax.legend(ncol=1, fontsize='large', shadow=True)

fig.show()

These three curves are displayed here:

The ROC curve is meant to show the tradeoff between the TPR and the FPR for the different probability thresholds. If the area under the ROC curve is more or less than the same for the three weighting strategies, then the weighting did not offer much value beyond altering the classifier's intercept. Thus, it is up to us if we want to increase the recall at the expense of the precision to either reweight our training samples or to try different probability thresholds for our classification decision.

In addition to the sample weighting, we can resample the training data so that we train on a more balanced set. In the next section, we are going to see the different sampling techniques offered by the imbalanced-learn library.

Sampling the training data

"It's not denial. I'm just selective about the reality I accept."
- Bill Watterson

If the machine learning models were humans, they would have believed that the end justifies the means. When 99% of their training data belongs to one class, and their aim is to optimize their objective function, we cannot blame them if they focus on getting that single class right since it contributes to 99% of the solution. In the previous section, we tried to change this behavior by giving more weights to the minority class, or classes. Another strategy might entail removing some samples from the majority class or adding new samples to the minority class until the two classes are balanced.

Undersampling the majority class

"Truth, like gold, is to be obtained not by its growth, but by washing away from it all that is not gold."
- Leo Tolstoy

We can randomly remove samples from the majority class until it becomes the same size as the minority class. When dealing with non-binary classification tasks, we can remove samples from all the classes until they all become the same size as the minority class. This technique is known as Random Undersampling. The following code shows how RandomUnderSampler() can be used to downsample the majority class:

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
x_train_resampled, y_train_resampled = rus.fit_resample(x_train, y_train)

Rather than keeping the classes balanced, you can just reduce their imbalance by setting thesampling_strategyhyperparameter. Its value dictates the final ratio of the minority class versus the majority class. In the following example, we kept the final size of the majority class so that it's twice that of the minority class:

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy=0.5)
x_train_resampled, y_train_resampled = rus.fit_resample(x_train, y_train)

The downsampling process doesn't have to be random. For example, we can use the nearest neighbors algorithm to remove the samples that do not agree with their neighbors. TheEditedNearestNeighboursmodule allows you to set the number of neighbors to check via its n_neighborshyperparameter, as follows:

from imblearn.under_sampling import EditedNearestNeighbours

enn = EditedNearestNeighbours(n_neighbors=5)
x_train_resampled, y_train_resampled = enn.fit_resample(x_train, y_train)

The previous techniques belong to what is known as prototype selection. In this situation, we select samples from already existing ones. In contrast to Prototype Selection, the prototype generation approach generates new samples to summarize the existing ones. The ClusterCentroids algorithm puts the majority class samples into clusters and uses the cluster centroids instead of the original samples. More on clustering and cluster centroids will be provided inChapter 11, Clustering – Making Sense of Unlabeled Data.

To compare the aforementioned algorithms, let's create a function that takes the x's and y's, in addition to the sampler instance, and then trains them and returns the predicted values for the test set:

from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomTreesEmbedding
from sklearn.pipeline import Pipeline

def sample_and_predict(x_train, y_train, x_test, y_test, sampler=None):

if sampler:
x_train, y_train = sampler.fit_resample(x_train, y_train)

clf = Pipeline(
[
('Embedder', RandomTreesEmbedding(n_estimators=10, max_leaf_nodes=20, random_state=42)),
('Scaler', MaxAbsScaler()),
('Classifier', LogisticRegression(solver='saga', max_iter=1000, random_state=42))
]
)
clf.fit(x_train, y_train)
y_test_pred_proba = clf.predict_proba(x_test)[:,1]

return y_test, y_test_pred_proba

Now, we can use the sample_and_predict() function we have just created and plot the resulting ROC curve for the following two sampling techniques:

from sklearn.metrics import roc_curve, auc
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import EditedNearestNeighbours

fig, ax = plt.subplots(1, 1, figsize=(15, 8), sharey=False)

# Original Data

y_test, y_test_pred_proba = sample_and_predict(x_train, y_train, x_test, y_test, sampler=None)
plot_roc_curve(
y_test, y_test_pred_proba,
label='Original Data',
ax=ax
)


# RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
y_test, y_test_pred_proba = sample_and_predict(x_train, y_train, x_test, y_test, sampler=rus)

plot_roc_curve(
y_test, y_test_pred_proba,
label='RandomUnderSampler',
ax=ax
)

# EditedNearestNeighbours

nc = EditedNearestNeighbours(n_neighbors=5)
y_test, y_test_pred_proba = sample_and_predict(x_train, y_train, x_test, y_test, sampler=nc)

plot_roc_curve(
y_test, y_test_pred_proba,
label='EditedNearestNeighbours',
ax=ax
)

ax.legend(ncol=1, fontsize='large', shadow=True)

fig.show()

The resulting ROC curve will look as follows:

Here, we can see the value of the sampling techniques on the resulting area under the ROC curve in comparison to training on the original unsampled set. The three graphs may be too close for us to tell them apart, as is the case here, so it makes sense to check the resulting AUC number instead.

Oversampling the minority class

Besides undersampling, we can also increase the data points of the minority class. RandomOverSampler naively clones random samples of the minority class until it becomes the same size as the majority class. SMOTE and ADASYN, on the other hand, generate new synthetic samples by interpolation.

Here, we are comparing RandomOverSampler to theSMOTE oversampling algorithm:

from sklearn.metrics import roc_curve, auc
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

fig, ax = plt.subplots(1, 1, figsize=(15, 8), sharey=False)

# RandomOverSampler

ros = RandomOverSampler(random_state=42)
y_test, y_test_pred_proba = sample_and_predict(x_train, y_train, x_test, y_test, sampler=ros)
plot_roc_curve(
y_test, y_test_pred_proba,
label='RandomOverSampler',
ax=ax
)

# SMOTE

smote = SMOTE(random_state=42)
y_test, y_test_pred_proba = sample_and_predict(x_train, y_train, x_test, y_test, sampler=smote)
plot_roc_curve(
y_test, y_test_pred_proba,
label='SMOTE',
ax=ax
)

ax.legend(ncol=1, fontsize='large', shadow=True)

fig.show()

The resulting ROC curve helps us compare the performance of the two techniques being used on the dataset at hand:

As we can see, the SMOTEalgorithm did not perform on our current dataset, whileRandomOverSamplerpushed the curve upward. So far, the classifiers we've used have been agnostic to the sampling techniques we've applied. We can simply remove the logistic regression classifier and plug in any other classifier here without changing the data sampling code. In contrast to the algorithms we've used, the data sampling process is an integral part of some ensemble algorithms. In the next section, we'll learn how to make use of this fact to get the best of both worlds.

Combining data sampling with ensembles

In Chapter 8, Ensembles – When One Model is Not Enough, we learned about bagging algorithms. They basically allow multiple estimators to learn from different subsets of the dataset, in the hope that these diverse training subsets will allow the different estimators to come to a better decision when combined. Now that we've undersampled the majority class to keep our training data balanced, it is natural that we combine the two ideas together; that is, the bagging and the under-sampling techniques.

BalancedBaggingClassifier builds several estimators on different randomly selected subsets of data, where the classes are balanced during the sampling process. Similarly,BalancedRandomForestClassifierbuilds its trees on balanced samples. In the following code, we're plotting the ROC curves for the two ensembles:

from imblearn.ensemble import BalancedRandomForestClassifier
from imblearn.ensemble import BalancedBaggingClassifier

fig, ax = plt.subplots(1, 1, figsize=(15, 8), sharey=False)

# BalancedBaggingClassifier

clf = BalancedBaggingClassifier(n_estimators=500, n_jobs=-1, random_state=42)
clf.fit(x_train, y_train)
y_test_pred_proba = clf.predict_proba(x_test)[:,1]

plot_roc_curve(
y_test, y_test_pred_proba,
label='Balanced Bagging Classifier',
ax=ax
)

# BalancedRandomForestClassifier

clf = BalancedRandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
clf.fit(x_train, y_train)
y_test_pred_proba = clf.predict_proba(x_test)[:,1]

plot_roc_curve(
y_test, y_test_pred_proba,
label='Balanced Random Forest Classifier',
ax=ax
)

fig.show()

Some formatting lines have been omitted for brevity. Running the previous code gives us the following graph:

From this, it's clear that the combination of undersampling and ensembles achieved better results than our earlier models.

In addition to the bagging algorithms, RUSBoostClassifiercombines the random undersampling technique with the adaBoostclassifier.

Equal opportunity score

So far, we've only focused on the imbalances in the class labels. In some situations, the imbalance in a particular feature may also be problematic. Say, historically, that the vast majority of the engineers in your company were men. Now, if you build an algorithm to filter the new applicants based on your existing data, would it discriminate against the female candidates?

The equal opportunity score tries to evaluate how dependent a model is of a certain feature. Simply put, a model is considered to give an equal opportunity to the different value of a certain feature if the relationship between the model's predictions and the actual targets is the same, regardless of the value of this feature. Formally, this means that the conditional probability of the predicted target, which is conditional on the actual target, and the applicant's gender should be the same, regardless of gender. These conditional probabilities are shown in the following equation:

The previous equation only gives a binary outcome. Therefore, we can turn it into a ratio where we have a value between 0 and 1. Since we do not know which gender gets a better opportunity, we take the minimum value of the two possible fractions using the following equation:

To demonstrate this metric, let's assume we have a model trained on the applicant's IQ and Gender. This following code shows its predictions on the test set, where both the true label and the predictions are listed side by side:

df_engineers = pd.DataFrame(
{
'IQ': [110, 120, 124, 123, 112, 114],
'Gender': ['M', 'F', 'M', 'F', 'M', 'F'],
'Is Hired? (True Label)': [0, 1, 1, 1, 1, 0],
'Is Hired? (Predicted Label)': [1, 0, 1, 1, 1, 0],
}
)

Now, we can create a function to calculate the equal opportunity score for us, as follows:

def equal_opportunity_score(df, true_label, predicted_label, feature_name, feature_value):
opportunity_to_value = df[
(df[true_label] == 1) & (df[feature_name] == feature_value)
][predicted_label].mean() / df[
(df[true_label] == 1) & (df[feature_name] != feature_value)
][predicted_label].mean()
opportunity_to_other_values = 1 / opportunity_to_value
better_opportunity_to_value = opportunity_to_value > opportunity_to_other_values
return {
'Score': min(opportunity_to_value, opportunity_to_other_values),
f'Better Opportunity to {feature_value}': better_opportunity_to_value
}

When called with our df_engineers DataFrame, it will give us 0.5. Having a value that's less than one tells us that the female applicants have less of an opportunity to get hired by our model:

equal_opportunity_score(
df=df_engineers,
true_label='Is Hired? (True Label)',
predicted_label='Is Hired? (Predicted Label)',
feature_name='Gender',
feature_value='F'
)

Obviously, we can exclude the gender feature from this model altogether, yet this score is still useful if there are any remaining features that depend on the applicant's gender. Additionally, we need to alter this score when dealing with a non-binary classifier and/or a non-binary feature. You can read about this score in more detail in the original paper by Moritz Hardtet al.

Summary

In this chapter, we learned how to deal with class imbalances. This is a recurrent problem in machine learning, where most of the value lies in the minority class. This phenomenon is common enough that the black swan metaphor was coined to explain it. When the machine learning algorithms try to blindly optimize their out-of-the-box objective functions, they usually miss those black swans. Hence, we have to use techniques such as sample weighting, sample removal, and sample generation to force the algorithms to meet our own objectives.

This was the last chapter in this book about supervised learning algorithms. There is a rough estimate that 80% of the machine learning problems in business setups and academia are supervised learning ones, which is why about 80% of this book focused on that paradigm. From the next chapter onward, we will start covering the other machine learning paradigms, which is where about 20% of the real-life value resides. We will start by looking at clustering algorithms, and then move on and look at other problems where the data is also unlabeled.