Ensembles – When One Model Is Not Enough – Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

Ensembles – When One Model Is Not Enough

In the previous three chapters, we saw how neural networks help directly and indirectly in solving natural language understanding and image processing problems. This is because neural networks are proven to work well with homogeneous data; that is, if all the input features are of the same breed—pixels, words, characters, and so on. On the other hand, when it comes to heterogeneousdata, it is the ensemblemethods that are known to shine. They are well suited to deal with heterogeneous data—for example, where one column contains users' ages, the other has their incomes, and a third has their city of residence.

You can view ensemble estimators as meta-estimators; they are made up of multiple instances of other estimators. The way they combine their underlying estimators is what differentiates between the different ensemble methods—for example, the bagging versus the boosting methods. In this chapter, we are going to look at these methods in detail and understand their underlying theory. We will also learn how to diagnose our own models and understand why they make certain decisions.

As always, I would also like to seize the opportunity to shed light on general machine learning concepts while dissecting each individual algorithm. In this chapter, we will see how to handle the estimators' uncertainties using the classifiers' probabilities and the regression ranges.

The following topics will be discussed in this chapter:

  • The motivation behind ensembles
  • Averaging/bagging ensembles
  • Boosting ensembles
  • Regression ranges
  • The ROC curve
  • Area under the curve
  • Voting and stacking ensembles

Answering the question why ensembles?

The main idea behind ensembles is to combine multiple estimators so that they make better predictions than a single estimator. However, you should not expect the mere combination of multiple estimators to just lead to better results. The combined predictions of multiple estimators who make the exact same mistakes will be as wrong as each individual estimator in the group. Therefore, it is helpful to think of the possible ways to mitigate the mistakes that individual estimators make. To do so, we have to revisit our old friend the bias and variance dichotomy. We will meet few machine learning teachers better than this pair.

If you recall from Chapter 2, Making Decisions with Trees, when we allowed our decision trees to grow as much as they can, they tended to fit the training data like a glove but failed to generalize to newer data points. We referred to this as overfitting, and we have seen the same behavior with unregularized linear models and with a small number of nearest neighbors. Conversely, aggressively restricting the growth of trees, limiting the number of the features in linear models, and asking too many neighbors to vote caused the models to become biased and underfit the data at hand. So, we had to tread a thin line between trying to find the optimum balance between the bias-variance and the underfitting-overfitting dichotomies.

In the following sections, we are going to follow a different approach. We will deal with the bias-variance dichotomy as a continuous scale, starting from one side of this scale and using the concept of ensemble to move toward the other side. In the next section, we are going to start by looking at high-variance estimators and averaging their results to reduce their variance. Later on, we will start from the other side and use the concept of boosting to reduce the estimators' biases.

Combining multiple estimators via averaging

"To derive the most useful information from multiple sources of evidence, you should always try to make these sources independent of each other."
Daniel Kahneman

If a single fully grown decision tree overfits, and if having many voters in the nearest neighbors algorithm has an opposite effect, then why not combine the two concepts? Rather than having a single tree, let's have a forest that combines the predictions of each tree in it. Nevertheless, we do not want all the trees in our forest to be identical; we would love them to be as diverse as possible. The bagging and random forest meta-estimators are the most common examples here. To achieve diversity, they make sure that each one of the individual estimators they use is trained on a random subset of the training data—hence the random prefix in random forest. Each time a random sample is drawn, it can be done with replacement (bootstrapping) or without replacement (pasting). The term bagging stands for bootstrap aggregation as the estimators draw their samples with replacement. Furthermore, for even more diversity, the ensembles can assure that each tree sees a random subset of the training features.

Both ensembles use decision tree estimators by default, but the bagging ensemble can be reconfigured to use any other estimator. Ideally, we would like to use high-variance estimators. The decisions made by the individual estimators are combined via voting or averaging.

Boosting multiple biased estimators

"If I have seen further than others, it is by standing upon the shoulders of giants."
–Isaac Newton

In contrast to fully grown trees, a shallow tree tends to be biased. Boosting a biased estimator is commonly performed via AdaBoost or gradient boosting. The AdaBoost meta-estimator starts with a weak or biased estimator, then each consequent estimator learns from the mistakes made by its predecessors. We saw in Chapter 2, Making Decisions with Trees, that we can give each individual training sample a different weight so that the estimators can give more emphasis to some samples versus others. In AdaBoost, erroneous predictions made by the preceding estimators are given more weight for their successors to pay more attention to.

The gradient boosting meta-estimator follows a slightly different approach. It starts with a biased estimator, computes its loss function, then builds each consequent estimator to minimize the loss function of its predecessors. As we saw earlier, gradient descent always comes in handy when iteratively minimizing loss functions, hence the gradient prefix in the name of the gradient boosting algorithm.

Due to the iterative nature of the two ensembles, they both have a learning rate to control their learning speed and to make sure they don't miss the local minima when converging. Like the bagging algorithm, AdaBoost is not limited to decision trees as its base estimator.

Now that we have a good idea about the different ensemble methods, we can use real-life data to demonstrate how they work in practice. Each of the ensemble methods described here can be used for classification and regression. The classifier and regressor hyperparameters are almost identical for each ensemble. Therefore, I will pick a regression problem to demonstrate each algorithm and briefly show the classification capabilities of the random forest and gradient boosting algorithms since they are the most commonly used ensembles.

In the next section, we are going to download a dataset prepared by theUniversity of California, Irvine (UCI). It contains 201 samples for different cars, along with their prices. We will be using this dataset in a later section to predict the car prices via regression.

Downloading the UCI Automobile dataset

The Automobile dataset was created by Jeffrey C. Schlimmer and published in UCI's machine learning repository. It contains information about 201 automobiles, along with their prices. The names of the features are missing. Nevertheless, I could get them from the dataset's description (http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names). So, we can start by seeing the URL and the feature names, as follows:

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'

header = [
# ... some list items are omitted for brevity


Then, we use the following code to download our data.

df = pd.read_csv(url, names=header, na_values='?')

It is mentioned in the dataset's description that missing values are replaced with a question mark. To make things more Pythonic, we set na_values to '?' to replace these question marks with NumPy's Not a Number(NaN).

Next, we can perform our Exploratory Data Analysis (EDA), check the percentages of the missing values, and see how to deal with them.

Dealing with missing values

Now, we can check which columns have the most missing values:

cols_with_missing = df.isnull().sum()
cols_with_missing > 0

This gives us the following list:

normalized-losses    41
num-of-doors          2
bore                  4
stroke                4
horsepower            2
peak-rpm              2
price                 4

Since the price is our target value, we can just ignore the four records where the prices are unknown:

df = df[~df['price'].isnull()]

As for the remaining features, I'd say let's drop thenormalized-losses column since 41 of its values are missing. Later on, we will use the data imputation techniques to deal with the other columns with fewer missing values. You can drop the normalized-losses column using the following code:

df.drop(labels=['normalized-losses'], axis=1, inplace=True)

At this point, we have a data frame with all the required features and their names. Next, we want to split the data into training and test sets, and then prepare our features. The different feature types require different preparations. You may need to separately scale the numerical features and encode the categorical ones. So, it is good practice to be able to differentiate between the numerical and the categorical features.

Differentiating between numerical features and categorical ones

Here, we are going to create a dictionary to separately list the numerical and categorical features. We will also make a combined list of the two, and provide the name of the target column, as in the following code:

features = {
'categorical': [
'make', 'fuel-type', 'aspiration', 'num-of-doors',
'body-style', 'drive-wheels', 'engine-location',
'engine-type', 'num-of-cylinders', 'fuel-system',

'numerical': [
'symboling', 'wheel-base', 'length', 'width', 'height',
'curb-weight', 'engine-size', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm',
'city-mpg', 'highway-mpg',

features['all'] = features['categorical'] + features['numerical']

target = 'price'

By doing so, you can deal with the columns differently. Furthermore, just for my own sanity and to notprint too many zeros in the future, I rescaled the prices to be in thousands, as follows:

df[target] = df[target].astype(np.float64) / 1000

You can also display certain features separately. Here, we print a random sample, where just the categorical features are shown:

 df[features['categorical']].sample(n=3, random_state=42)

Here are the resulting rows. I set random_state to 42 to make sure we all get the same random rows:

All other transformations, such as scaling, imputing, and encoding, should be done after splitting the data into training and test sets. That way, we can ensure that no information is leaked from the test set into the training samples.

Splitting the data into training and test sets

Here, we keep 25% of the data for testing and use the rest for training:

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.25, random_state=22)

Then, we can use the information from the previous section to create our x and y values:

x_train = df_train[features['all']]
x_test = df_test[features['all']]

y_train = df_train[target]
y_test = df_test[target]

As usual, with regression tasks, it is handy to understand the distribution of the target values:

title="Distribution of Car Prices (in 1000's)",

A histogram is usually a good choice for understanding distributions, as seen in the following graph:

We may come back to this distribution later to put our regressor's mean error in perspective. Additionally, you can use this range for sanity checks. For example, if you know that all the prices you have seen fall in the range of 5,000 to 45,000, you may decide when to put your model in production to fire an alert any time it returns prices far from this range.

Imputing the missing values and encoding the categorical features

Before bringing our ensembles to action, we need to make sure we do not have null values in our data. We will replace the missing values with the most frequent value in each column using the SimpleImputer function fromChapter 4, Preparing Your Data:

from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

x_train = imp.fit_transform(x_train)
x_test = imp.transform(x_test)

You may have already seen me complain many times about the scikit-learn transformers, which do not respect the column names and insist on converting the input data frames into NumPy arrays. To stop myself from complaining again, let me solve my itch by using the following ColumnNamesKeeper class. Whenever I wrap it around a transformer, it will make sure all the data frames are kept unharmed:

class ColumnNamesKeeper:

def __init__(self, transformer):
self._columns = None
self.transformer = transformer

def fit(self, x, y=None):
self._columns = x.columns

def transform(self, x, y=None):
x = self.transformer.transform(x)
return pd.DataFrame(x, columns=self._columns)

def fit_transform(self, x, y=None):
self.fit(x, y)
return self.transform(x)

As you can see, it mainly saves the column name when the fit method is called. Then, we can use the saved names to recreate the data frames after the transformation steps.

The code for ColumnNamesKeeper can be simplified further by inheriting from sklearn.base.BaseEstimator and sklearn.base.TransformerMixin. You can check the source code of any of the library's built-in transformers if you are willing to write more scikit-learn-friendly transformers.

Now, I can call SimpleImputer again while preserving x_train and x_test as data frames:

from sklearn.impute import SimpleImputer

imp = ColumnNamesKeeper(
SimpleImputer(missing_values=np.nan, strategy='most_frequent')

x_train = imp.fit_transform(x_train)
x_test = imp.transform(x_test)

We learned in Chapter 4, Preparing Your Data, that OrdinalEncoderis recommended for tree-based algorithms, in addition to any other non-linear algorithms. The category_encoders library doesn't mess with the column names, and so we can use OrdinalEncoder without the need forColumnNamesKeeper this time. In the following code snippet, we also specify which columns to encode (the categorical columns) and which to keep unchanged (the remaining ones):

from category_encoders.ordinal import OrdinalEncoder
enc = OrdinalEncoder(
x_train = enc.fit_transform(x_train)
x_test = enc.transform(x_test)

In addition to OrdinalEncoder, you can also test the encoders mentioned in the target encoding in Chapter 4, Preparing Your Data. They, too, are meant to be used with the algorithms explained in this chapter. In the next section, we are going to use the random forest algorithm with the data we have just prepared.

Using random forest for regression

The random forest algorithm is going to be the first ensemble to deal with here. It's an easy-to-grasp algorithm with straightforward hyperparameters. Nevertheless, as we usually do, we will start by training the algorithm using its default values, as follows, then explain its hyperparameters after that:

from sklearn.ensemble import RandomForestRegressor
rgr = RandomForestRegressor(n_jobs=-1)
rgr.fit(x_train, y_train)
y_test_pred = rgr.predict(x_test)

Since each tree is independent of the others, I set n_jobs to -1 to use my multiple processors to train the trees in parallel. Once they are trained and the predictions are obtained, we can print the following accuracy metrics:

from sklearn.metrics import (
mean_squared_error, mean_absolute_error, median_absolute_error, r2_score

'R2: {:.2f}, MSE: {:.2f}, RMSE: {:.2f}, MAE {:.2f}'.format(
r2_score(y_test, y_test_pred),
mean_squared_error(y_test, y_test_pred),
np.sqrt(mean_squared_error(y_test, y_test_pred)),
mean_absolute_error(y_test, y_test_pred),

This will print the following scores:

# R2: 0.90, MSE: 4.54, RMSE: 2.13, MAE 1.35

The average car price is 13,400. So, a Mean Absolute Error (MAE) of 1.35 seems reasonable. As for the Mean Squared Error (MSE), it makes sense to use its square root to keep it in the same units as the MAE. In brief, given the high R2 score and the low errors, the algorithm seems to perform well with its default values. Furthermore, you can plot the errors to get a better understanding of the model's performance:

df_pred = pd.DataFrame(
'actuals': y_test,
'predictions': y_test_pred,

df_pred['error'] = np.abs(y_test - y_test_pred)

fig, axs = plt.subplots(1, 2, figsize=(16, 5), sharey=False)

title='Actuals vs Predictions',

title='Distribution of Error',


I've excluded some of the formatting lines to keep the code concise. In the end, we get the following graphs:

By plotting the predictions versus the actuals, we can make sure that the models don't systematically overestimate or underestimate. This is shown via the 45o slope of the scattered points on the left. A lower slope for the scattered points would have systematically reflected an underestimation. Having the scattered points aligned on a straight line assures us that there aren't non-linearities that the model couldn't capture. The histogram to the right shows that most of the errors are below 2,000. It is good to understand what mean and maximum errors you can expect to get in the future.

Checking the effect of the number of trees

By default, each tree is trained on a random sample from the training data. This is achieved by setting the bootstrap hyperparameter to True. In bootstrap sampling, a sample may be used during training more than once, while another sample may not be used at all.

When max_samples is kept as None, each tree is trained on a random sample of a size that is equal to the entire training data size. You can set max_samples to a fraction that is less than 1, then each tree is trained on a smaller random sub-sample. Similarly, we can set max_features to a fraction that is less than 1 to make sure each tree uses a random subset of the available features. These parameters help each tree to have its own personality and to ensure the diversity of the forest. To put it more formally, these parameters increase the variance of each individual tree. So, it is advised to have as many trees as possible to reduce the variance we have just introduced.

Here, we compare three forests, with a different number of trees in each:

mae = []
n_estimators_options = [5, 500, 5000]

for n_estimators in n_estimators_options:

rgr = RandomForestRegressor(

rgr.fit(x_train, y_train)
y_test_pred = rgr.predict(x_test)
mae.append(mean_absolute_error(y_test, y_test_pred))

Then, we can plot the MAE for each forest to see the merits of having more trees:

Clearly, we have just encountered a new set of hyperparameters to tune bootstrap, max_features, and max_samples. So, it makes sense to apply cross-validation for hyperparameter tuning.

Understanding the effect of each training feature

Once a random forest is trained, we can list the training features, along with their importance. As usual, we put the outcome in a data frame by using the column names and the feature_importances_ attribute, as shown:

df_feature_importances = pd.DataFrame(
'Feature': x_train.columns,
'Importance': rgr.feature_importances_,
'Importance', ascending=False

Here is the resulting data frame:

Unlike with linear models, all the values here are positive. This is because these values only show the importance of each feature, regardless of whether it is positively or negatively correlated with the target. This is common for decision trees, as well as for tree-based ensembles. Thus, we can use Partial Dependence Plots(PDPs)to show the relationship between the target and the different features. Here, we only plot it for the top six features according to their importance:

from sklearn.inspection import plot_partial_dependence

fig, ax = plt.subplots(1, 1, figsize=(15, 7), sharey=False)

top_features = df_feature_importances['Feature'].head(6)

rgr, x_train,
line_kw={'color': 'k'},

ax.set_title('Partial Dependence')


The resulting graphs are easier to read, especially when the relationship between the target and the features is non-linear:

We can now tell that cars with bigger engines, more horsepower, and less mileage per gallon tend to be more expensive.

PDPs are not just useful for ensemble methods, but also for any other complex non-linear model. Despite the fact the neural networks have coefficients for each layer, the PDP is essential in understanding the network as a whole. Furthermore, you can also understand the interaction between the different feature pairs by passing the list of features as a list tuples, with a pair of features in each tuple.

Using random forest for classification

To demonstrate the random forest classifier, we are going to use a synthetic dataset. We first create the dataset using the built-in make_hastie_10_2 class:

from sklearn.datasets import make_hastie_10_2
x, y = make_hastie_10_2(n_samples=6000, random_state=42)

This previous code snippet creates a random dataset. I set random_state to a fixed number to make sure we both get the same random data. Now, we can split the resulting data into training and test sets:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

Then, to evaluate the classifier, we are going to introduce a new concept called the Receiver Operating Characteristic (ROC) curve in the next section.

The ROC curve

"Probability is expectation founded upon partial knowledge. A perfect acquaintance with all the circumstances affecting the occurrence of an event would change expectation into certainty, and leave neither room nor demand for a theory of probabilities."
– George Boole (Boolean data types are named after him)

In a classification problem, the classifier assigns probabilities to each sample to reflect how likely it is that each sample belongs to a certain class. We get these probabilities via the classifier's predict_proba() method. The predict() method is typically a wrapper on top of the predict_proba() method. In a binary-classification problem, it assigns each sample to a specific class if the probability of it belonging to the class is above 50%. In practice, we may not always want to stick to this 50% threshold, especially as different thresholds usually change the True Positive Rates (TPRs) and False Positive Rates (FPRs) for each class. So, you can choose a different threshold to optimize for a desired TPR.

The best way to decide which threshold suits your needs is to use a ROC curve. This helps us see the TPR and FPR for each threshold. To create this curve, we will train our random forest classifier on the synthetic dataset we have just created, but get the classifier's probabilities this time:

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(

clf.fit(x_train, y_train)
y_pred_proba = clf.predict_proba(x_test)[:,1]

Then, we can calculate the TPR and FPR for each threshold, as follows:

from sklearn.metrics import roc_curve
fpr, tpr, thr = roc_curve(y_test, y_pred_proba)

Let's stop for a moment to explain what TPR and FPR mean:

  • TheTPR, also known as recall or sensitivity, is calculated as the number of True Positive (TP) cases divided by all the positive cases; that is, , where FN is the positive cases falsely classified as negative (false negatives).
  • The True Negative Rates (TNR), also known as specificity, is calculated as the number of True Negative (TN) cases divided by all the negative cases; that is, , where FP is the negative cases falsely classified as positive (false positives).
  • The FPR is defined as 1 minus TNR; that is, .
  • The False Negative Rate (FNR) is defined as 1 minus TPR; that is, .

Now, we can put the calculated TPR and FPR for our dataset into the following table:

Even better than a table, we can plot them into a graph using the following code:

{'FPR': fpr, 'TPR': tpr}
title=f'Receiver Operating Characteristic (ROC)',
label='Random Forest Classifier',

I've omitted the graph's styling code for the sake of brevity. I also added a 45o line and the Area Under the Curve (AUC), which I will explain in a bit:

A classifier that randomly assigns each sample to a certain class will have a ROC curve that looks like the dashed 45o line. Any improvement over this will make the curve more convex upward. Obviously, random forest's ROC curve is better than chance. An optimum classifier will touch the upper-left corner of the graph. Therefore, the AUC can be used to reflect how good the classifier is. An area above 0.5 is better than chance, and an area of 1.0 is the best possible value. We typically expect values between 0.5 and 1.0. Here, we got an AUC of 0.94. The AUC can be calculated using the following code:

from sklearn.metrics import auc
auc_values = auc(fpr, tpr)

We can also use the ROC and AUC to compare two classifiers. Here, I trained the random forest classifier with the bootstrap hyperparameter set to True and compared it to the same classifier when bootstrap was set to False:

No wonder the bootstrap hyperparameter is set to True by default—it gives better results. Now, you have seen how to use random forest algorithms to solve classification and regression problems. In the next section, we are going to explain a similar ensemble: the bagging ensemble.

Using bagging regressors

We will go back to the Automobile dataset as we are going to use the bagging regressor this time. The bagging meta-estimator is very similar to random forest. It is built of multiple estimators, each one trained on a random subset of the data using a bootstrap sampling method. The key difference here is that although decision trees are used as the base estimators by default, any other estimator can be used as well. Out of curiosity, let's use the K-Nearest Neighbors (KNN) regressor as our base estimator this time. However, we need to prepare the data to suit the new regressor's needs.

Preparing a mixture of numerical and categorical features

It is recommended to put all features on the same scale when using distance-based algorithms such as KNN. Otherwise, the effect of the features with higher magnitudes on the distance metric will overshadow the other features. As we have a mixture of numerical and categorical features here, we need to create two parallel pipelines to prepare each feature set separately.

Here is a top-level view of our pipeline:

Here, we start by building the four transformers in our pipelines: Imputer, Scaler, and OneHotEncoder. We also wrap them in ColumnNamesKeeper, which we created earlier in this chapter:

from sklearn.impute import SimpleImputer
from category_encoders.one_hot import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

numerical_mputer = ColumnNamesKeeper(

categorical_mputer = ColumnNamesKeeper(

minmax_scaler = ColumnNamesKeeper(

onehot_encoder = OneHotEncoder(

Then, we put them into two parallel pipelines:

numerical_pipeline = Pipeline(
('numerical_mputer', numerical_mputer),
('minmax_scaler', minmax_scaler)

categorical_pipeline = Pipeline(
('categorical_mputer', categorical_mputer),
('onehot_encoder', onehot_encoder)

Finally, we concatenate the outputs of the pipelines for both the training and the test sets:

x_train_knn = pd.concat(

x_test_knn = pd.concat(

At this point, we are ready to build our bagged KNNs.

Combining KNN estimators using a bagging meta-estimator

BaggingRegressor has a base_estimator hyperparameter, where you can set the estimators you want to use. Here, KNeighborsRegressor is used with a single neighbor. Since we are aggregating multiple estimators to reduce their variance, it makes sense to have a high variance estimator in the first place, hence the small number of neighbors here:

from sklearn.ensemble import BaggingRegressor
from sklearn.neighbors import KNeighborsRegressor

rgr = BaggingRegressor(

rgr.fit(x_train_knn, df_train[target])
y_test_pred = rgr.predict(x_test_knn)

This new setup gives us an MAE of 1.8. We can stop here, or we may decide to improve the ensemble's performance by tuning its big array of hyperparameters.

First of all, we can try different estimators other than KNN, each with its own hyperparameters. Then, the bagging ensemble also has its own hyperparameters. We can change the number of estimators via n_estimators. Then, we can decide whether to use the entire training set or a random subset of it for each estimator via max_samples. Similarly, we can also pick a random subset of the columns to use for each estimator to use via max_features. The choice of whether to use bootstrapping for the rows and the columns can be made via the bootstrap and bootstrap_featureshyperparameters, respectively.

Finally, since each estimator is trained separately, we can use a machine with a high number of CPUs and parallelize the training process by setting n_jobs to -1.

Now that we have experienced two versions of the averaging ensembles, it is time to check their boosting counterparts. We will start with the gradient boosting ensemble, then move to the AdaBoost ensemble.

Using gradient boosting to predict automobile prices

If I were ever stranded on a desert island and had to pick one algorithm to take with me, I'd definitely chose the gradient boosting ensemble! It has proven to work very well on many classification and regression problems. We are going to use it with the same automobile data from the previous sections. The classifier and the regressor versions of this ensemble share the exact same hyperparameters, except for the loss functions they use. This means that everything we are going to learn here will be useful to us whenever we decide to use gradient boosting ensembles for classification.

Unlike the averaging ensembles we have seen so far, the boosting ensembles build their estimators iteratively. The knowledge learned from the initial ensemble is used to build its successors. This is the main downside of boosting ensembles, where parallelism is unfeasible. Putting parallelism aside, this iterative nature of the ensemble calls for a learning rate to be set. This helps the gradient descent algorithm reach the loss function's minima easily. Here, we use 500 trees, each with a maximum of 3 nodes, and a learning rate of 0.01. Furthermore, the Least Squares (LS) loss is used here; think MSE. More on the available loss functions in a moment:

from sklearn.ensemble import GradientBoostingRegressor

rgr = GradientBoostingRegressor(
n_estimators=1000, learning_rate=0.01, max_depth=3, loss='ls'

rgr.fit(x_train, y_train)
y_test_pred = rgr.predict(x_test)

This new algorithm gives us the following performance on the test set:

# R2: 0.92, MSE: 3.93, RMSE: 1.98, MAE: 1.42

As you can see, this setting gave a lower MSE compared to random forest, while random forest had a better MAE. Another loss function that the gradient boostingregressor can use is Least Absolute Deviation (LAD); think MAE, this time. LAD may help when dealing with outliers, and it can sometimes reduce the model's MAE performance on the test set. Nevertheless, it did not improve the MAE for the dataset at hand. We also have a percentile (quantile) loss, but before going deeper into the supported loss functions, we need to learn how to diagnose the learning process.

The main hyperparameters to set here are the number of trees, the depth of the trees, the learning rate, and the loss function. As a rule of thumb, you should aim for a higher number of trees and a low learning rate. As we will see in a bit, these two hyperparameters are inversely proportional to each other. Controlling the depth of your trees is purely dependent on your data. In general, we need to have shallow trees and let boosting empower them. Nevertheless, the depth of the tree controls the number of feature interactions we want to capture. In a stub (a tree with a single split), only one feature can be learned at a time. A deeper tree resembles a nested if condition where a few more features are at play each time. I usually start withmax_depth set to around 3 and 5 and tune it along the way.

Plotting the learning deviance

With each additional estimator, we expect the algorithm to learn more and the loss to decrease. Yet, at some point, the additional estimators will keep overfitting on the training data while not offering much improvement for the test data.

To have a clear picture, we need to plot the calculated loss with each additional estimator for both the training and test sets. As for the training loss, it is saved by the gradient boosting meta-estimator into its loss_ attribute. For the test loss, we can use the meta-estimator's staged_predict() methods. This method can be used for a given dataset to make predictions for each intermediate iteration.

Since we have multiple loss functions to choose from, gradient boosting also provides aloss_() method, which calculates the loss for us based on the loss function used. Here, we create a new function to calculate the training and test errors for each iteration and put them into a data frame:

def calculate_deviance(estimator, x_test, y_test):

train_errors = estimator.train_score_
test_errors = [
estimator.loss_(y_test, y_pred_staged)
for y_pred_staged in estimator.staged_predict(x_test)

return pd.DataFrame(
'n_estimators': range(1, estimator.estimators_.shape[0]+1),
'train_error': train_errors,
'test_error': test_errors,

Since we are going to use an LS loss here, you can simply replace the estimator.loss_() method with mean_squared_error() and get the exact same results. But let's keep the estimator.loss_() function for a more versatile and reusable code.

Next, we train our gradient boosting regressor, as usual:

from sklearn.ensemble import GradientBoostingRegressor

rgr = GradientBoostingRegressor(n_estimators=250, learning_rate=0.02, loss='ls')
rgr.fit(x_train, y_train)

Then, we use the trained model, along with the test set, to plot the training and test learning deviance:

fig, ax = plt.subplots(1, 1, figsize=(16, 5), sharey=False)

df_deviance = calculate_deviance(rgr, x_test, y_test)

kind='line', color='k', linestyle=':', ax=ax

kind='line', color='k', linestyle='-', ax=ax


Running the code gives us the following graph:

The beauty of this graph is that it tells us that the improvements on the test set stopped after 120 estimators or so, despite the continuous improvement in the training set; that is, it started to overfit. Furthermore, we can use this graph to understand the effect of a chosen learning rate, as we did in Chapter 7, Neural Networks - Here comes the Deep Learning.

Comparing the learning rate settings

Rather than training one model, we will train three gradient boosting regressors this time, each with a different learning rate. Then, we will plot the deviance graph for each one side by side, as shown:

As with other gradient descent-based models, a high learning rate causes the estimator to overshoot and miss the local minima. We can see this in the first graph where no improvements are seen despite the consecutive iterations. The learning rates in the second and third graphs seem reasonable. In comparison, the learning rate in the third graph seems to be too slow for the model to converge in 500 iterations. You may then decide to increase the number of estimators for the third model to allow it to converge.

We have learned from the bagging ensembles that using a random training sample with each estimator may help with overfitting. In the next section, we are going to see whether the same approach can also help the boosting ensembles.

Using different sample sizes

We have been using the entire training set for each iteration. This time, we are going to train three gradient boosting regressors, each with a different subsample size, and plot their deviance graphs as before. We will use a fixed learning rate of 0.01 and the LADas our loss function, as shown:

In the first graph, the entire training sample is used for each iteration. So, the training loss did not fluctuate as much as in the other two graphs. Nevertheless, the sampling used in the second model allowed it to reach a better test score, despite its noisy loss graph. This was similarly the case for the third model, but with a slightly larger final error.

Stopping earlier and adapting the learning rate

The n_iter_no_change hyperparameter is used to stop the training process after a certain number of iterations if the validation score is not improving enough. The subset set aside for validation, validation_fraction, is used to calculate the validation score. The tolhyperparameter is used to decide how much improvement we must consider as being enough.

The fit method in the gradient boosting algorithm accepts a callback function that is called after each iteration. It can also be used to set a custom condition to stop the training process based on it. Furthermore, It can be used for monitoring or for any other customizations you need. This callback function is called with three parameters: the order of the current iteration (n), an instance of gradient boosting (estimator), and its settings (params). To demonstrate how this callback function works, let's build a function to change the learning rate to 0.01 for one iteration at every 10 iterations, and keep it at 0.1 for the remaining iterations, as shown:

def lr_changer(n, estimator, params):
if n % 10:
estimator.learning_rate = 0.01
estimator.learning_rate = 0.1
return False

Then, we use our lr_changer function, as follows:

from sklearn.ensemble import GradientBoostingRegressor
rgr = GradientBoostingRegressor(n_estimators=50, learning_rate=0.01, loss='ls')
rgr.fit(x_train, y_train, monitor=lr_changer)

Now, if we print the deviance as we usually do, we will see how after every 10th iteration, the calculated loss jumps due to the learning rate changes:

What I've just done is pretty useless, but it demonstrates the possibilities you have at hand. For example, you can borrow ideas such as the adaptive learning rate and the momentum from the solvers used in the neural networks and incorporate them here using this callback function.

Regression ranges

"I try to be a realist and not a pessimist or an optimist."
–Yuval Noah Harari

One last gem that gradient boosting regression offers to us is regression ranges. These are very useful in quantifying the uncertainty of your predictions.

We try our best to make our predictions exactly the same as the actual data. Nevertheless, our data can still be noisy, or the features used may not capture the whole truth. Take the following example:

x1 x2 y
0 0 10
1 1 50
0 0 20
0 0 22

Consider a new sample with x1 = 0 and x2 = 0. We already have three training examples with the exact same features, so what would the predicted y value be for this new sample? If a squared loss function is used during the training, then the predicted target will be close to 17.3, which is the mean of the three corresponding targets(10, 20, and 22). Now, if MAE is used, then the predicted target will be closer to 22, which is the median (50th percentile) of the three corresponding targets. Rather than the 50th percentile, we can use any other percentiles when a quantile loss function is used. So, to achieve regression ranges, we can use two regressors with two different quantiles as the upper and lower bounds of our range.

Although the regression ranges work regardless of the dimensionality of the data at hand, the format of the page has forced us to come up with a two-dimensional example for more clarity. The following code creates 400samples to play with:

x_sample = np.arange(-10, 10, 0.05)
y_sample = np.random.normal(loc=0, scale=25, size=x_sample.shape[0])
y_sample *= x_sample

pd_random_samples = pd.DataFrame(
'x': x_sample,
'y': y_sample

Here is a scatter plot of the generated y versus x values:

Now, we can train two regressors with the 10th and 90th percentiles as our range boundaries and plot those regression boundaries, along with our scattered data points:

from sklearn.ensemble import GradientBoostingRegressor

fig, ax = plt.subplots(1, 1, figsize=(12, 8), sharey=False)

title='Regression Ranges [10th & 90th Quantiles]',
kind='scatter', x='x', y='y', color='k', alpha=0.95, ax=ax

for quantile in [0.1, 0.9]:

rgr = GradientBoostingRegressor(n_estimators=10, loss='quantile', alpha=quantile)
rgr.fit(pd_random_samples[['x']], pd_random_samples['y'])
pd_random_samples[f'pred_q{quantile}'] = rgr.predict(pd_random_samples[['x']])

kind='line', x='x', y=f'pred_q{quantile}',
linestyle='-', alpha=0.75, color='k', ax=ax

ax.legend(ncol=1, fontsize='x-large', shadow=True)


We can see that the majority of the points fall within the range. Ideally, we would expect 80% of the points to fall in the 90-100 range:

We can now use the same strategy to predict automobile prices:

from sklearn.ensemble import GradientBoostingRegressor

rgr_min = GradientBoostingRegressor(n_estimators=50, loss='quantile', alpha=0.25)
rgr_max = GradientBoostingRegressor(n_estimators=50, loss='quantile', alpha=0.75)

rgr_min.fit(x_train, y_train, monitor=lr_changer)
rgr_max.fit(x_train, y_train, monitor=lr_changer)

y_test_pred_min = rgr_min.predict(x_test)
y_test_pred_max = rgr_max.predict(x_test)

df_pred_range = pd.DataFrame(
'Actuals': y_test,
'Pred_min': y_test_pred_min,
'Pred_max': y_test_pred_max,

Then, we can check what percentage of our test set falls within the regression range:

df_pred_range['Actuals in Range?'] = df_pred_range.apply(
lambda row: 1 if row['Actuals'] >= row['Pred_min'] and row['Actuals'] <= row['Pred_max'] else 0, axis=1

Calculating the mean of df_pred_range['Actuals in Range?'] gives us 0.49, which is very close to the 0.5 value we expected. Obviously, we can use wider or narrower ranges, depending on our use case. If your model is going to be used to help car owners sell their cars, you may need to give them reasonable ranges, since telling someone that they can sell their car for any price between $5 and $30,000 is pretty accurate yet useless advice. Sometimes, a less accurate yet useful model is better than an accurate and useless one.

Another boosting algorithm that is not used as much nowadays is the AdaBoost algorithm. We will briefly explore it in the next section for the sake of completeness.

Using AdaBoost ensembles

In an AdaBoost ensemble, the mistakes made in each iteration are used to alter the weights of the training samples for the following iterations. As in the boosting meta-estimator, this method can also use any other estimators instead of the decision trees used by default. Here, we have used it with its default estimators on the Automobile dataset:

from sklearn.ensemble import AdaBoostRegressor

rgr = AdaBoostRegressor(n_estimators=100)
rgr.fit(x_train, y_train)
y_test_pred = rgr.predict(x_test)

The AdaBoost meta-estimator also has a staged_predict method, which allows us to plot the improvement in the training or test loss after each iteration. Here is the code for plotting the test error:

(n, mean_squared_error(y_test, y_pred_staged))
for n, y_pred_staged in enumerate(rgr.staged_predict(x_test), 1)
columns=['n', 'Test Error']


Here is a plot for the calculated loss after each iteration:

As in the other ensembles, the more estimators we add, the more accurate it becomes. Once we start to overfit, we should be able to stop. That's why having a validation sample is essential in knowing when to stop. I used the test set for demonstration here, but in practice, the test sample should be kept aside and a validation set should be used instead.

Exploring more ensembles

The main ensemble techniques are the ones we have seen so far. The following ones are also good to know about and can be useful for some peculiar cases.

Voting ensembles

Sometimes, we have a number of good estimators, each with its own mistakes. Our objective is not to mitigate their bias or variance, but to combine their predictions in the hope that they don't all make the same mistakes. In these cases, VotingClassifier and VotingRegressor could be used. You can give a higher preference to some estimators versus the others by adjusting the weights hyperparameter. VotingClassifier has different voting strategies, depending on whether the predicted class labels are to be used or whether the predicted probabilities should be used instead.

Stacking ensembles

Rather than voting, you can combine the predictions of multiple estimators by adding an extra one that uses their predictions as input. This strategy is known as stacking. The inputs of the final estimator can be limited to the predictions of the previous estimators, or it can be a combination of their predictions and the original training data. To avoid overfitting, the final estimators are usually trained using cross-validation.

Random tree embedding

We have seen how the trees are capable of capturing the non-linearities in the data. So, if we still want to use a simpler algorithm, we can just use the trees to transform the data and leave the prediction for the simple algorithm to do. When building a tree, each data point falls into one of its leaves. Therefore, the IDs of the leaves can be used to represent the different data points. If we build multiple trees, then each data point can be represented by the ID of the leaf it fell on in each tree. These leaves, IDs can be used as our new features and can be fed into a simpler estimator. This kind of embedding is useful for feature compression and allows linear models to capture the non-linearities in the data.

Here, we use an unsupervised RandomTreesEmbedding method to transform our automobile features, and then use the transformed features in a Ridge regression:

from sklearn.ensemble import RandomTreesEmbedding
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

rgr = make_pipeline(RandomTreesEmbedding(), Ridge())
rgr.fit(x_train, y_train)
y_test_pred = rgr.predict(x_test)

print(f'MSE: {mean_squared_error(y_test, y_test_pred)}')

From the preceding block of code, we can observe the following:

  • This approach is not limited to RandomTreesEmbedding.
  • Gradient-boosted trees can also be used to transform the data for a downstream estimator to use.
  • Both GradientBoostingRegressor and GradientBoostingClassifier have an apply function, which can be used for feature transformation.


In this chapter, we saw how algorithms benefit from being assembled in the form of ensembles. We learned how these ensembles can mitigate the bias versus variance trade-off.

When dealing with heterogeneous data, the gradient boosting and random forest algorithms are my first two choices for classification and regression. They do not require any sophisticated data preparation, thanks to their dependence on trees. They are able to deal with non-linear data and capture feature interactions. Above all, the tuning of their hyperparameters is straightforward.

The more estimators in each method, the better, and you should not worry so much about them overfitting. As for gradient boosting, you can pick a lower learning rate if you can afford to have more trees. In addition to these hyperparameters, the depth of the trees in each of the two algorithms should be tuned via trail and error and cross-validation. Since the two algorithms come from different sides of the bias-variance spectrum, you may initially aim for forests with big trees that you can prune later on. Conversely, you can start with shallow trees and rely on your gradient-boosting meta-estimator to boost them.

So far in this book, we have predicted a single target at a time. Here, we predicted the prices of automobiles and that's it. In the next chapter, we will see how to predict multiple targets at a time. Furthermore, when our aim is to use the probabilities given by a classifier, having a calibrated classifier is paramount. We can have a better estimation of our risks if we have probabilities that we trust. Thus, calibrating a classifier is going to be another topic covered in the next chapter.