In this chapter, we are going to start by looking at our first supervised learning algorithm—decision trees. The decision tree algorithm is versatile and easy to understand. It is widely used and also serves as a building block for the numerous advanced algorithms that we will encounter later on in this book. In this chapter, we will learn how to train a decision tree and use it for either classification or regression problems. We will also understand the details of its learning process in order to know how to set its different hyperparameters. Furthermore, we will use a real-world dataset to apply what we are going to learn here in practice. We will start by getting and preparing the data and apply our algorithm to it. Along the way, we will also try to understand key machine learning concepts, such as cross-validation and model evaluation metrics. By the end of this chapter, you will have a very good understanding of the following topics:
- Understanding decision trees
- How do decision trees learn?
- Getting a more reliable score
- Tuning the hyperparameters for higher accuracy
- Visualizing the tree's decision boundaries
- Building decision tree regressors
Understanding decision trees
I chose to start this book with decision trees because I've noticed that the majority of new machine learning practitioners have previous experience in one of two fields—software development, or statistics and mathematics. Decision trees can conceptually resemble some of the concepts software developers are used to, such as nested if-else conditions and binary search trees. As for the statisticians, bear with me—soon, you will feel at home when we reach the chapter about linear models.
What are decision trees?
I think the best way to explain what decision trees are is by showing the rules they generate after they are trained. Luckily, we can access those rules and print them. Here is an example of how decision tree rules look:
Shall I take an umbrella with me?
|--- Chance of Rainy <= 0.6
| |--- UV Index <= 7.0
| | |--- class: False
| |--- UV Index > 7.0
| | |--- class: True
|--- Chance of Rainy > 0.6
| |--- class: True
As you can see, it's basically a set of conditions. If the chance of rain falling is above 0.6 (60%), then I need to take an umbrella with me. If it is below 0.6, then it all depends on the UV index. If the UV index is above 7, then an umbrella is needed; otherwise, I will be fine without one. Now, you might be thinking well, a few nested if-else conditions will do the trick.
True, but the main difference here is that I didn't write any of these conditions myself. The algorithm just learned the preceding conditions automatically after it went through the following data:
Of course, for this simple case, anyone can manually go through the data and come up with the same conditions. Nevertheless, when dealing with a bigger dataset, the number of conditions we need to program will quickly grow with the number of columns and the values in each column. At such a scale, it is not possible to manually perform the same job, and an algorithm that can learn the conditions from the data is needed.
Conversely, it is also possible to map a constructed tree back to the nested if-else conditions. This means that you can use Python to build a tree from data, then export the underlying conditions to be implemented in a different language or even to put them in Microsoft Excel if you want.
Iris classification
scikit-learn comes loaded with a number of datasets that we can use to test new algorithms. One of these datasets is the Iris set. Iris is a genus of 260–300 species of flowering plants with showy flowers. However, in our dataset, just three species are covered—Setosa, Versicolor, and Virginica. Each example in our dataset has the length and the widths of the sepal and petal of each plant (the features), along with whether it is a Setosa, a Versicolor, or a Virginica (the target). Our task is to be able to identify the species of a plant given its sepal and petal dimensions. Clearly, this is a classification problem. It is a supervised learning problem since the targets are provided with the data. Furthermore, it is a classification problem since we take a limited number of predefined values (three species).
Loading the Iris dataset
Let's now startby loading the dataset:
- We import the dataset's module from scikit-learn, and then load the Iris data into a variable, which we are going to call iris as well:
from sklearn import datasets
import pandas as pd
iris = datasets.load_iris()
- Usingdir, we can see what methods and attributes the dataset provides:
dir(iris)
We get a list of the DESCR, data, feature_names, filename, target, and target_namesmethods.
It's nice of the data creators to provide descriptions with each one, which we can access using DESCR.This is rarely the case with real-life data, however. Usually, in real life, we need to talk to the people who produced the data in the first place to understand what each value means, or at least use some descriptive statistics to understand the data before using it.
- For now, let's print the Iris data's description:
print(iris.DESCR)
Have a look at the description now and try to think of some of the main takeaways from it. I will list my own takeaways afterward:
.. _iris_dataset:
Iris plants dataset
--------------------
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolor
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
This description holds some useful information for us, and I found the following points the most interesting:
- The data is composed of 150 rows (or 150 samples). This is a reasonably small dataset. Later on, we will see how to deal with this fact when evaluating our model.
- The class labels or targets take three values—Iris-Setosa, Iris-Versicolor, and Iris-Virginica. Some classification algorithms can only deal with two class labels; we call them binary classifiers. Luckily, the decision tree algorithm can deal with more than two classes, so we have no problems this time.
- The data is balanced; there are 50 samples for each class. This is something we need to keep in mind when training and evaluating our model later on.
- We have four features—sepal length, sepal width, petal length, and petal width—and all four features are numeric. In Chapter 3, Preparing Your Data, we will learn how to deal with non-numeric data.
- There are no missing attribute values. In other words, none of our samples contains null values. Later on in this book, we will learn how to deal with missing values if we encounter them.
- The petal dimensions correlate with the class values more than the sepal dimensions. I wish we had never seen this piece of information. Understanding your data is useful, but the problem here is that this correlation is calculated for the entire dataset. Ideally, we will only calculate it for our training data. Anyway, let's ignore this information for now and just use it for a sanity check later on.
- It's time to put all the dataset information into one DataFrame.
The feature_namesmethodreturns the names of our features, while the data method returns their values in the form of a NumPy array. Similarly, the targetvariable has the values of the target in the form of zeros, ones, and twos, and target_names maps 0, 1, and 2 toIris-Setosa, Iris-Versicolor, and Iris-Virginica, respectively.
Here, we can see the first eight rows we get usingiris.data[:8]:
array([[5.1, 3.5, 1.4, 0.2], [4.9, 3. , 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5. , 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4], [4.6, 3.4, 1.4, 0.3], [5. , 3.4, 1.5, 0.2]])
The following code uses the data, feature_names, and target methods to combine all the dataset information into one DataFrame and assign its column names accordingly:
df = pd.DataFrame(
iris.data,
columns=iris.feature_names
)
df['target'] = pd.Series(
iris.target
)
- The target column now has the class IDs. However, for more clarity, we can also create a new column called target_names, where we can map our numerical target values to the class names:
df['target_names'] = df['target'].apply(lambda y: iris.target_names[y])
- Finally, let's print a sample of six rows to see how our new DataFrame looks. Running the following code in a Jupyter notebook or a Jupyter lab will just print the contents of the DataFrame; otherwise, you need to surround your code with a print statement. I will assume that a Jupyter notebook environment is used in all later code snippets:
# print(df.sample(n=6))
df.sample(n=6)
This gave me the following random sample:
The sample methods picked six random rows to display. This means that you will get a different set of rows each time you run the same code. Sometimes, we need to get the same random results every time we run the same code. Then, we use a pseudo-random number generator with a preset seed. A pseudo-random number generator initialized with the same seed will produce the same results every time it runs.
So, set therandom_stateparameter in thesample()method to 42, as follows:
df.sample(n=6, random_state=42)
You will get the exact same rows shown earlier.
Splitting the data
Let's split the DataFrame we have just created into two—70% of the records (that is, 105 records) should go into the training set, while 30% (45 records) should go into testing. The choice of 70/30 is arbitrary for now. We will use the train_test_split() function provided by scikit-learn and specifytest_size to be 0.3:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.3)
We can usedf_train.shape[0]and df_test.shape[0]to check how many rows thereare in the newly created DataFrames. We can also list the columns of the new DataFrames using df_train.columns and df_test.columns. They both have the same six columns:
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)
- target
- target_names
The first four columns are our features, while the fifth column is our target (or label). The sixthcolumn will not be needed for now. Visually, you could say that we have split our data vertically into training and test sets. Usually, it makes sense to further split each of our DataFrames horizontally into two parts—one part for the features, which we usually call x, and another part for the targets, which is usually called y. We will continue to use this x and y naming convention throughout the rest of this book.
As you know, thefeature_names method iniris contains a list of the corresponding column names to our features. We will use this information, along with the target label, to create our x and y sets, as follows:
x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]
y_train = df_train['target']
y_test = df_test['target']
Training the model and using it for prediction
To get a feel for how everything works, we will train our algorithm using its default configuration for now. Later on in this chapter, I will explain the details of the decision tree algorithms and how to configure them.
We need to import DecisionTreeClassifier first, and then create an instance of it, as follows:
from sklearn.tree import DecisionTreeClassifier
# It is common to call the classifier instance clf
clf = DecisionTreeClassifier()
One commonly used synonym for training is fitting. This is how an algorithm uses the training data (x and y) to learn its parameters. All scikit-learn models implement afit()method that takesx_trainandy_train, and DecisionTreeClassifier is no different:
clf.fit(x_train, y_train)
By calling the fit() method, the clf instance is trained and ready to be used for predictions. We then call thepredict()method onx_test:
# If y_test is our truth, then let's call our predictions y_test_pred
y_test_pred = clf.predict(x_test)
When predicting, we usually don't know the actual targets (y) for our features (x). That's why we only provide the predict() method here withx_test. In this particular case, we happened to knowy_test; nevertheless, we will pretend that we don't know it for now, and only use it later on for evaluation. As our actual targets are called y_test, we will call the predicted ones y_test_pred and compare the two later on.
Evaluating our predictions
As we have y_test_predict, all we need now is to compare it toy_test to check how good our predictions are. If you remember from the previous chapter, there are multiple metrics for evaluating a classifier, such asprecision,recall, andaccuracy. The Iris dataset is a balanced dataset; it has the same number of instances for each class. Therefore, it is apt to use the accuracy metric here.
Calculating the accuracy, as follows, gives us a score of0.91:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_test_pred)
Congratulations! You've just trained your first supervised learning algorithm. From now on, all the algorithms we are going to use in this book have a similar interface:
- The fit() method takes the x and y parts of your training data.
- The predict() method takes x only and returns a predicted y.
Which features were more important?
We may now ask ourselves,Which features did the model find more useful in deciding the iris species? Luckily,DecisionTreeClassifier has a method called feature_importances_, which is computed after the classifier is fitted and scores how important each feature is to the model's decision. In the following code snippet, we will create a DataFrames where we will put the features' names and their importance together and then sort the features by their importance:
pd.DataFrame(
{
'feature_names': iris.feature_names,
'feature_importances': clf.feature_importances_
}
).sort_values(
'feature_importances', ascending=False
).set_index('feature_names')
This is the output we get:
As you will recall, when we printed the dataset's description, the petal length and width values started to correlate highly with the target. They also have high feature importance scores here, which confirms what is stated in the description.
Displaying the internal tree decisions
We can also print the internal structure of the learned tree using the following code snippet:
from sklearn.tree import export_text
print(
export_text(clf, feature_names=iris.feature_names, spacing=3, decimals=1)
)
This will print the following text:
|--- petal width (cm) <= 0.8
| |--- class: 0
|--- petal width (cm) > 0.8
| |--- petal width (cm) <= 1.8
| | |--- petal length (cm) <= 5.3
| | | |--- sepal length (cm) <= 5.0
| | | | |--- class: 2
| | | |--- sepal length (cm) > 5.0
| | | | |--- class: 1
| | |--- petal length (cm) > 5.3
| | | |--- class: 2
| |--- petal width (cm) > 1.8
| | |--- class: 2
If you print the complete dataset description, you will notice that toward the end, it says the following:
One class is linearly separable from the other two; the latter are NOT linearly separable from each other.
This means that one class is easier to separate from the other two, while the other two are harder to separate from each other. Now, look at the internal tree's structure. You may notice that in the first step, it decided that anything with a petal width below or equal to 0.8 belongs to class 0 (Setosa). Then, for petal widths above 0.8, the tree kept on branching, trying to differentiate between classes 1 and 2 (Versicolor and Virginica). Generally, the harder it is to separate classes, the deeper the branching goes.
How do decision trees learn?
It's time to find out how decision trees actually learn in order to configure them. In the internal structure we just printed, the tree decided to use a petal width of 0.8 as its initial splitting decision. This was done because decision trees try to build the smallest possible tree using the following technique.
It went through all the features trying to find a feature (petal width, here) and a value within that feature (0.8, here) so that if we split all our training data into two parts (one for petal width ≤ 0.8, and one for petal width > 0.8), we get the purest split possible. In other words, it tries to find a condition where we can separate our classes as much as possible. Then, for each side, it iteratively tries to split the data further using the same technique.
Splitting criteria
If we onlyhad two classes, an ideal split would put members of one class on one side and members of the others on the other side. In our case, we succeeded in putting members of class 0 on one side and members of classes 1 and 2 on the other. Obviously, we are not always guaranteed to get such a pure split. As we can see in the other branches further down the tree, we always had a mix of samples from classes 1 and 2 on each side.
Having said that, we need a way to measure purity. We need a criterion based on if one split is purer than the other. There are two criteria that scikit-learn uses for classifiers' purity—gini and entropy—with the gini criterion as its default option. When it comes to decision tree regression, there are other criteria that we will come across later on.
Preventing overfitting
After the first split, the tree went on to try to separate between the remaining classes; the Versicolor and the Virginica irises. However, are we really sure that our training data is detailed enough to explain all the nuances that differentiate between the two classes? Isn't it possible that all those branches are driving the algorithm to learn things that happen to exist in the training data, but will not generalize well enough when faced with future data? Allowing a tree to grow so much results in what is called overfitting. The tree tries to perfectly fit the training data, forgetting that the data it may encounter in the future may be different. To prevent overfitting, the following settings may be used to limit the growth of a tree:
- max_depth:This is the maximum depth a tree can get to. A lower number means that the tree will stop branching earlier. Setting it to None means that the tree will continue to grow until all the leaves are pure or until all the leaves contain fewer than the min_samples_split samples.
- min_samples_split: The minimum number of samples needed in a level to allow further splitting there. A higher number means that the tree will stop branching earlier.
- min_samples_leaf:The minimum number of samples needed in a level to allow it to become a leaf node. A leaf node is a node where there are no further splits and where decisions are made. A higher number may have the effect of smoothing the model, especially in regression.
If max_depth is not set at training time to limit the tree's growth, then alternatively, you can prune the tree after it has been built. Curious readers can check the cost_complexity_pruning_path() method of the decision tree and find out how to use it to prune an already-grown tree.
Predictions
At the end of the training process, nodes that aren't split any further are called leaf nodes. Within a leaf node, we may have five samples—four of them from class 1, one from class 2, and none from class 0. Then, at prediction time, if a sample ends up in the same leaf node, we can easily decide that the new sample belongs to class 1 since this leaf node had a 4:1 ratio of its training samples from class 1 compared to the other two classes.
When we make predictions on the test set, we can evaluate the classifier's accuracy versus the actual labels we have in the test set. Nevertheless, the manner in which we split our data may affect the reliability of the scores we get. In the next section, we will see how to get more reliable scores.
Getting a more reliable score
The Iris dataset is a small set of just 150 samples. When we randomly split it into training and test sets, we ended up with 45 instances in the test set. With such a small number, we may have variations in the distribution of our targets. For example, when I randomly split the data, I got 13 samples from class 0 and 16 samples from each one of the two other classesin my test set. Knowing that predicting class 0 is easier than the other two classes in this particular dataset, we can tell that if I was luckier and had more samples of class 0 in the test set, I'd have had a higher score. Furthermore, decision trees are very sensitive to data changes, and you may get a very different tree with every slight change in your training data.
What to do now to get a more reliable score
A statistician would say let's run the whole process of data splitting, training, and predicting, more than once, and get the distribution of the different accuracy scores we get each time
. The following code does exactly that for 100 iterations:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# A list to store the score from each iteration
accuracy_scores = []
After importing the required modules and defining an accuracy_scores list to store the scores we are going get with each iteration, it is time to write a for loop to freshly split the data and recalculate the classifier's accuracy with each iteration:
for _ in range(100):
# At each iteration we freshly split our data
df_train, df_test = train_test_split(df, test_size=0.3)
x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]
y_train = df_train['target']
y_test = df_test['target']
# We then create a new classifier
clf = DecisionTreeClassifier()
# And use it for training and prediction
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
# Finally, we append the score to our list
accuracy_scores.append(round(accuracy_score(y_test, y_pred), 3))
# Better convert accuracy_scores from a list into a series
# Pandas series provides statistical methods to use later
accuracy_scores = pd.Series(accuracy_scores)
The following snippet lets us plot the accuracy's distribution using a box plot:
accuracy_scores.plot(
title='Distribution of classifier accuracy',
kind='box',
)
print(
'Average Score: {:.3} [5th percentile: {:.3} & 95th percentile: {:.3}]'.format(
accuracy_scores.mean(),
accuracy_scores.quantile(.05),
accuracy_scores.quantile(.95),
)
)
This will give us the following graphical analysis of the accuracy. Your results might vary slightlydue to the random split of the training and test sets and the random initial settings of the decision trees. Almost all of the scikit-learn modules support a pseudo-random number generator that can be initialized via a random_state hyperparameter. This can be used to enforce code reproducibility. Nevertheless, I deliberately ignored it this time to show how the model's results may vary from one run to the other, and to show the importance of estimating the distributions of your models' errors via iterations:
Box plots are good at showing distributions. Rather than having a single number, we now have an estimation of the best- and the worst-case scenarios of our classifier's performance.
ShuffleSplit
Generating different train and test splits is called cross-validation. This helps us have a more reliable estimation of our model's accuracy. What we did in the previous section is one of many cross-validation strategies called repeated random sub-sampling validation, or Monte Carlo cross-validation.
scikit-learn's ShuffleSplit module provides us with the functionality to perform Monte Carlo cross-validation. Rather than us splitting the data ourselves, ShuffleSplit gives us lists of indices to use for splitting our data. In the following code, we are going to use the DataFrame's loc() method and the indices we get from ShuffleSplitto randomly split the dataset into 100 training and test pairs:
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
accuracy_scores = []
# Create a shuffle split instance
rs = ShuffleSplit(n_splits=100, test_size=0.3)
# We now get 100 pairs of indices
for train_index, test_index in rs.split(df):
x_train = df.loc[train_index, iris.feature_names]
x_test = df.loc[test_index, iris.feature_names]
y_train = df.loc[train_index, 'target']
y_test = df.loc[test_index, 'target']
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_scores.append(round(accuracy_score(y_test, y_pred), 3))
accuracy_scores = pd.Series(accuracy_scores)
Alternatively, we can simplify the preceding code even further by using scikit-learn'scross_validatefunctionality. This time, we are not event splitting the data into training and test sets ourselves. We will give cross_validate thexandy values for the entire set, and then give it our ShuffleSplit instance for it to use internally to split the data. We also give it the classifier and specify what kind of scoring metric to use. When done, it will give us back a list with the calculated test set scores:
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
clf = DecisionTreeClassifier()
rs = ShuffleSplit(n_splits=100, test_size=0.3)
x = df[iris.feature_names]
y = df['target']
cv_results = cross_validate(
clf, x, y, cv=rs, scoring='accuracy'
)
accuracy_scores = pd.Series(cv_results['test_score'])
We can plot the resulting series of accuracy scores now to get the same box plot as earlier. Cross-validation is recommended when dealing with a small dataset since a group of accuracy scores will give us a better understanding of the classifier's performance compared to a single score calculated after a single trial.
Tuning the hyperparameters for higher accuracy
Now that we have learned how to evaluate the model's accuracy more reliably using the ShuffleSplit cross-validation method, it is time to test our earlier hypothesis: would a smaller tree be more accurate?
Here is what we are going to do in the following sub sections:
- Split the data into training and test sets.
- Keep the test side to one side now.
- Limit the tree's growth using different values of max_depth.
- For each max_depth setting, we will use the ShuffleSplit cross-validation method on the training set to get an estimation of the classifier's accuracy.
- Once we decide which value to use for max_depth, we will train the algorithm one last time on the entire training set and predict on the test set.
Splitting the data
Here is the usual code for splitting the data into training and test sets:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.25)
x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]
y_train = df_train['target']
y_test = df_test['target']
Trying different hyperparameter values
If we allowed our earlier treeto grow indefinitely, we would get a tree depth of 4. You can check the depth of a tree by callingclf.get_depth()once it is trained. So, it doesn't make sense to try any max_depth values above 4. Here, we are going to loop over the maximum depths from 1 to 4 and use ShuffleSplit to get the classifier's accuracy:
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
for max_depth in [1, 2, 3, 4]:
# We initialize a new classifier each iteration with different max_depth
clf = DecisionTreeClassifier(max_depth=max_depth)
# We also initialize our shuffle splitter
rs = ShuffleSplit(n_splits=20, test_size=0.25)
cv_results = cross_validate(
clf, x_train, y_train, cv=rs, scoring='accuracy'
)
accuracy_scores = pd.Series(cv_results['test_score'])
print(
'@ max_depth = {}: accuracy_scores: {}~{}'.format(
max_depth,
accuracy_scores.quantile(.1).round(3),
accuracy_scores.quantile(.9).round(3)
)
)
We called the cross_validate() method as we did earlier, giving it the classifier's instance, as well as the ShuffleSplit instance. We also defined our evaluation score as accuracy. Finally, we print the scores we get with each iteration. We will look more at the printed values in the next section.
Comparing the accuracy scores
Since we have a list of scores for each iteration, we can calculate their mean, or, as we will do here, we will print their 10^{th} and 90^{th} percentiles to get an idea of the accuracy ranges versus each max_depthsetting.
Running the preceding code gave me the following results:
@ max_depth = 1: accuracy_scores: 0.532~0.646 @ max_depth = 2: accuracy_scores: 0.925~1.0 @ max_depth = 3: accuracy_scores: 0.929~1.0 @ max_depth = 4: accuracy_scores: 0.929~1.0
One thing I am sure about now is that a single-level tree (usually called a stub) is not as accurate as deeper trees. In other words, having a single decision based on whether the petal width is less than 0.8 is not enough. Allowing the tree to grow further improves the accuracy, but I can't see many differences between trees of depths 2, 3, and 4. I'd conclude that contrary to my earlier speculations, we shouldn't worry too much about overfitting here.
Finally, you can train your model once more using the entire training set and a max_depth value of, say, 3. Then, use the trained model to predict the classes for the test set in order to evaluate your final model. I won't bore you with the code for it this time as you can easily do it yourself.
In addition to printing the classifier's decision and descriptive statistics about its accuracy, it is useful to also see its decision boundaries visually. Mapping those boundaries versus the data samples helps us understand why the classifier made certain mistakes. In the next section, we are going to check the decision boundaries we got for the Iris dataset.
Visualizing the tree's decision boundaries
To be able to pick the right algorithm for the problem, it is important to have a conceptual understanding of how an algorithm makes its decision. As we already know by now, decision trees pick one feature at a time and try to split the data accordingly. Nevertheless, it is important to be able to visualize those decisions as well. Let me first plot our classes versus our features, then I will explain further:
When the tree made a decision to split the data around a petal width of 0.8, you can think of it as drawing a horizontal line in the right-hand side graph at the value of 0.8. Then, with every later split, the tree splits the space further using combinations of horizontal and vertical lines. By knowing this, you should not expect the algorithm to use curves or 45-degree lines to separate the classes.
One trick to plot the decision boundaries that a tree has after it has been trained is to use contour plots. For simplicity, let's assume we only have two features—petal length and petal width. We then generate almost all the possible values for those two features and predict the class labels for our new hypothetical data. Then, we create a contour plot using those predictions to see the boundaries between the classes. The following function, created by Richard Johanssonof the University of Gothenburg, does exactly that:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def plot_decision_boundary(clf, x, y):
feature_names = x.columns
x, y = x.values, y.values
x_min, x_max = x[:,0].min(), x[:,0].max()
y_min, y_max = x[:,1].min(), x[:,1].max()
step = 0.02
xx, yy = np.meshgrid(
np.arange(x_min, x_max, step),
np.arange(y_min, y_max, step)
)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(12,8))
plt.contourf(xx, yy, Z, cmap='Paired_r', alpha=0.25)
plt.contour(xx, yy, Z, colors='k', linewidths=0.7)
plt.scatter(x[:,0], x[:,1], c=y, edgecolors='k')
plt.title("Tree's Decision Boundaries")
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
This time, we will train our classifier using two features only, and then call the preceding function using the newly trained model:
x = df[['petal width (cm)', 'petal length (cm)']]
y = df['target']
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(x, y)
plot_decision_boundary(clf, x, y)
Richard Johansson's functions overlay the contour graph over our samples to give us the following graph:
By seeing the decision boundaries as well as the data samples, you can make better decisions on whether one algorithm is good for the problem at hand.
Feature engineering
On seeing the class distribution versus the petal lengths and widths, you may wonder: what if the decision trees could also draw boundaries that are at 40 degrees? Wouldn't 40-degree boundaries be more apt than those horizontal and vertical jigsaws?Unfortunately, decision trees cannot do that, but let's put the algorithm aside for a moment and think about the data instead. How about creating a new axis where the class boundaries change their orientation?
Let's create two new columns—petal length x width (cm) and sepal length x width (cm)—and see how the class distribution will look:
df['petal length x width (cm)'] = df['petal length (cm)'] * df['petal width (cm)']
df['sepal length x width (cm)'] = df['sepal length (cm)'] * df['sepal width (cm)']
The following code will plot the classes versus the newly derived features:
fig, ax = plt.subplots(1, 1, figsize=(12, 6));
h_label = 'petal length x width (cm)'
v_label = 'sepal length x width (cm)'
for c in df['target'].value_counts().index.tolist():
df[df['target'] == c].plot(
title='Class distribution vs the newly derived features',
kind='scatter',
x=h_label,
y=v_label,
color=['r', 'g', 'b'][c], # Each class different color
marker=f'${c}$', # Use class id as marker
s=64,
alpha=0.5,
ax=ax,
)
fig.show()
Running this code will produce the following graph:
This new projection looks better; it makes the data more vertically separable. Nevertheless, the proof of the pudding is still in the eating. So, let's train two classifiers—one on the original features and one on the newly derived features—and see
how their accuracies compare. The following code goes through 500 iterations, each time splitting the data randomly, and then training both models, each with its own set of features, and storing the accuracy we get with each iteration:
features_orig = iris.feature_names
features_new = ['petal length x width (cm)', 'sepal length x width (cm)']
accuracy_scores_orig = []
accuracy_scores_new = []
for _ in range(500):
df_train, df_test = train_test_split(df, test_size=0.3)
x_train_orig = df_train[features_orig]
x_test_orig = df_test[features_orig]
x_train_new = df_train[features_new]
x_test_new = df_test[features_new]
y_train = df_train['target']
y_test = df_test['target']
clf_orig = DecisionTreeClassifier(max_depth=2)
clf_new = DecisionTreeClassifier(max_depth=2)
clf_orig.fit(x_train_orig, y_train)
clf_new.fit(x_train_new, y_train)
y_pred_orig = clf_orig.predict(x_test_orig)
y_pred_new = clf_new.predict(x_test_new)
accuracy_scores_orig.append(round(accuracy_score(y_test, y_pred_orig),
3))
accuracy_scores_new.append(round(accuracy_score(y_test, y_pred_new),
3))
accuracy_scores_orig = pd.Series(accuracy_scores_orig)
accuracy_scores_new = pd.Series(accuracy_scores_new)
Then, we can use box plots to compare the accuracies of the two classifiers:
fig, axs = plt.subplots(1, 2, figsize=(16, 6), sharey=True);
accuracy_scores_orig.plot(
title='Distribution of classifier accuracy [Original Features]',
kind='box',
grid=True,
ax=axs[0]
)
accuracy_scores_new.plot(
title='Distribution of classifier accuracy [New Features]',
kind='box',
grid=True,
ax=axs[1]
)
fig.show()
Here, we put the top plots side by side to be able to compare them to each other:
Clearly, the derived features helped a bit. Its accuracy is higher on average (0.96 versus 0.93), and its lower bound is also higher.
Building decision tree regressors
Decision tree regressors work in a similar fashion to their classifier counterparts. The algorithm splits the data recursively using one feature at a time. At the end of the process, we end up with leaf nodes—that is, nodes where there are no further splits. In the case of a classifier, if, at training time, a leaf node has three instances of class A and one instance of class B, then at prediction time, if an instance lands in the same leaf node, the classifier decides that it belongs to the majority class (class A). In the case of a regressor, if, at training time, a leaf node has three instances of values 12, 10, and 8,then, at prediction time, if an instance lands in the same leaf node, the regressor predicts its value to be 10 (the average of the three values at training time).
Predicting people's heights
Say we have two populations. Population 1 has an average height of 155 cm for females, with a standard deviation of 4, and an average height of 175 cm for males, with a standard deviation of 5. Population 2 has an average height of 165 cm for females, with a standard deviation of 15, and an average height of 185 cm for males, with a standard deviation of 12. We decide to take 200 males and 200 females from each population. To be able to simulate this, we can use a function provided by NumPy that draws random samples from a normal (Gaussian) distribution.
Here is the code for generating random samples:
# It's customary to call numpy np
import numpy as np
# We need 200 samples from each
n = 200
# From each population we get 200 male and 200 female samples
height_pop1_f = np.random.normal(loc=155, scale=4, size=n)
height_pop1_m = np.random.normal(loc=175, scale=5, size=n)
height_pop2_f = np.random.normal(loc=165, scale=15, size=n)
height_pop2_m = np.random.normal(loc=185, scale=12, size=n)
At the moment, we don't actually care about which population each sample comes from. So, we will useconcatenateto group all the males and all the females together:
# We group all females together and all males together
height_f = np.concatenate([height_pop1_f, height_pop2_f])
height_m = np.concatenate([height_pop1_m, height_pop2_m])
We then put this data into a DataFrame (df_height) to be able to deal with it easily. There, we also give a label of 1 to females and 2 to males:
df_height = pd.DataFrame(
{
'Gender': [1 for i in range(height_f.size)] +
[2 for i in range(height_m.size)],
'Height': np.concatenate((height_f, height_m))
}
)
Let's plot our fictional data using histograms to see the height distributions among each gender:
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
df_height[df_height['Gender'] == 1]['Height'].plot(
label='Female', kind='hist',
bins=10, alpha=0.7, ax=ax
)
df_height[df_height['Gender'] == 2]['Height'].plot(
label='Male', kind='hist',
bins=10, alpha=0.7, ax=ax
)
ax.legend()
fig.show()
The preceding code gives us the following graph:
As you can see, the resulting distributions are notsymmetrical. Although normal distributions are symmetrical, these artificial distributions are made of two sub-distributions combined. We can use this line of code to see that their mean and median values are not equal:
df_height.groupby('Gender')[['Height']].agg([np.mean, np.median]).round(1)
Here, we have the mean and median heights for each group:
Now, we want to predict people's heights using one feature—their gender. Therefore, we are going to split our data into training and test sets and create our x and y sets, as follows:
df_train, df_test = train_test_split(df_height, test_size=0.3)
x_train, x_test = df_train[['Gender']], df_test[['Gender']]
y_train, y_test = df_train['Height'], df_test['Height']
Remember that in the case of classifications, the trees use either gini or entropy to decide the best split at each step during the training process. The goal for these criteria was to find a split where each of the two resulting sub-groups is as pure as possible. In the case of regression, we have a different goal. We want the members of each group to have target values that are as close as possible to the predictions they make. scikit-learn implements two criteria to achieve this goal:
- Mean squared error (MSE or L2):Say after the split, we get three samples in one group with targets of 5, 5, and 8. We calculate the mean value of these three numbers (6). Then, we calculate the squared differences between each sample and the calculated mean—1, 1, and 4. We then take the mean of these squared differences, which is 2.
- Mean absolute error (MAE or L1): Say after the split, we get three samples in one group with targets of 5, 5, and 8. We calculate the median value of these three numbers (5). Then, we calculate the absolute differences between each sample and the calculated median—0, 0, and 3. We then take the mean of these absolute differences, which is 1.
For each possible split at training time, the tree calculates either L1 or L2 for each of the expected sub-groups after the split. A split with the minimum L1 or L2 is then chosen at this step. L1 may be preferred sometimes due to its robustness to outliers. The other important difference to keep in mind is that L1 uses median while L2 uses mean in its calculations.
Let's now compare the effect of the splitting criteria on our height dataset:
from sklearn.tree import export_text
from sklearn.tree import DecisionTreeRegressor
for criterion in ['mse', 'mae']:
rgrsr = DecisionTreeRegressor(criterion=criterion)
rgrsr.fit(x_train, y_train)
print(f'criterion={criterion}:\n')
print(export_text(rgrsr, feature_names=['Gender'], spacing=3, decimals=1))
We get the following two trees depending on the chosen criterion:
criterion=mse:
|--- Gender <= 1.5
| |--- value: [160.2]
|--- Gender > 1.5
| |--- value: [180.8]
criterion=mae:
|--- Gender <= 1.5
| |--- value: [157.5]
|--- Gender > 1.5
| |--- value: [178.6]
As expected, when MSE was used, the predictions were close to the mean of each gender, while for MAE, the predictions were close to the median.
Of course, we only had one binary feature in our dataset—gender. That's why we had a very shallow tree with a single split (a stub). Actually, in this case, we do not even need to train a decision tree; we could have easily calculated the mean heights for males and females and used them as our expected values right away. The decisions made by such a shallow tree are called biased decisions. If we would have allowed each individual to express themselves using more information, rather than just their gender, then we would have been able to make more accurate predictions for each individual.
Finally, just as in the classification trees, we have the same knobs, such as max_depth, min_samples_split, and min_samples_leaf, to control the growth of a regression tree.
Regressor's evaluation
The very same MSE and MAE scores can also be used to evaluate a regressor's accuracy. We use them to compare the regressor's predictions to the actual targets in the test set. Here is the code predicting and evaluating the predictions made:
from sklearn.metrics import mean_squared_error, mean_absolute_error
y_test_pred = rgrsr.predict(x_test)
print('MSE:', mean_squared_error(y_test, y_test_pred))
print('MAE:', mean_absolute_error(y_test, y_test_pred))
Using MSE as a splitting criterion gives us an MSE of 117.2 and an MAE of 8.2, while using MAE as a splitting criterion gives us an MSE of 123.3and an MAE of 7.8. Clearly, using MAE as the splitting criterion gives a lower MAE at test time, and vice versa. In other words, if your aim is to reduce the error of your predictions based on a certain metric, it is advised to use the same metric when growing your tree at the time of training.
Setting sample weights
Both the decision tree classifiers and the regressors allow us to give more or less emphasis to the individual training samples via setting their weights while fitting. This is a common feature in many estimators, and decision trees are no exception here. To see the effect of sample weights, we are going to give 10 times more weight to users above 150 cm versus the remaining users:
rgrsr = DecisionTreeRegressor(criterion='mse')
sample_weight = y_train.apply(lambda h: 10 if h > 150 else 1)
rgrsr.fit(x_train, y_train, sample_weight=sample_weight)
Conversely, we can also give more weights to users who are 150 cm and below by changing the sample_weight calculations, as follows:
sample_weight = y_train.apply(lambda h: 10 if h <= 150 else 1)
By using the export_text() function, as we did in the previous section, we can display the resulting trees. We can see how sample_weightaffected their final structures:
Emphasis on "below 150":
|--- Gender <= 1.5
| |--- value: [150.7]
|--- Gender > 1.5
| |--- value: [179.2]
Emphasis on "above 150":
|--- Gender <= 1.5
| |--- value: [162.4]
|--- Gender > 1.5
| |--- value: [180.2]
By default, all samples are given the same weight. Weighting individual samples differently is useful when dealing with imbalanced data or imbalanced business decisions; maybe you can tolerate delaying a shipment for a new customer more than you can do for your loyal ones. In Chapter 8, Ensembles – When One Model Is Not Enough, we will also see how sample weights are an integral part of how the AdaBoost algorithm learns.
Summary
Decision trees are intuitive algorithms that are capable of performing classification and regression tasks. They allow users to print out their decision rules, which is a plus when communicating the decisions you made to business personnel and non-technical third parties. Additionally, decision trees are easy to configure since they have a limited number of hyperparameters. The two main decisions you need to make when training a decision tree are your splitting criterion and how to control the growth of your tree to have a good balance between overfitting and underfitting. Your understanding of the limitations of the tree's decision boundaries is paramount in deciding whether the algorithm is good enough for the problem at hand.
In this chapter, we looked at how decision trees learn and used them to classify a well-known dataset. We also learned about the different evaluation metrics and how the size of our data affects our confidence in a model's accuracy. We then learned how to deal with the evaluation's uncertainties using different data-splitting strategies. We saw how to tune the algorithm's hyperparameters for a good balance between overfitting and underfitting. Finally, we built on the knowledge we gained to build decision tree regressors and learned how the choice of a splitting criterion affects our resulting predictions.
I hope this chapter has served as a good introduction to scikit-learn and its consistent interface. With this knowledge at hand, we can move on to our next algorithm and see how it compares to this one. In the next chapter, we will learn about linear models. This set of algorithms has its roots back in the 18^{th} century, and it is still one of the most commonly used algorithms today.