Image Processing with Nearest Neighbors – Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

Image Processing with Nearest Neighbors

In this chapter and the following one, we are going to take a different approach. The nearest neighbors algorithm will take a supporting role here, while image processing will be the main protagonist of the chapter. We will start by loading images and we will use Python to represent them in a suitable format for the machine learning algorithms to work with. We will be using the nearest neighbors algorithm for classification and regression. We will also learn how to compress information in images into a smaller space. Many of the concepts explained here are transferable and can be used with other algorithms with slight tweaks. Later, in Chapter 7, NeuralNetworks - Here Comes the Deep Learning, we will build on the knowledge acquired here and continue with image processing by using neural networks. In this chapter, we are going to cover the following topics:

  • Nearest neighbors
  • Loading and displaying images
  • Image classification
  • Using custom distances
  • Using nearest neighbors for regression
  • Reducing the dimensions of our image data

Nearest neighbors

"We learn by example and by direct experience because there are real limits to the adequacy of verbal instruction."
Malcolm Gladwell

It feels as if Malcolm Gladwell is explaining the K-nearest neighbors algorithm in the preceding quote; we only need to replace "verbal instruction" with "mathematical equation." In cases such as linear models, training data is used to learn a mathematical equation that models the data. Once a model is learned, we can easily put the training data aside. Here, in the nearest neighbors algorithm, the data itself is the model. Whenever we encounter a new data sample, we compare it to the training dataset. We locate the K-nearest samples in the training set to the newly encountered sample, and then we use the class labels of the K samples in the training set to assign a label to the new sample.

A few things should be noted here:

  • The concept of training doesn't really exist here. Unlike other algorithms, where the training time is dependent on the amount of training data, the computational cost is mostly spent in the nearest neighbors algorithm at prediction time.
  • Most of the recent research done on the nearest neighbors algorithm is focused on finding the optimum ways to quickly search through the training data during prediction time.
  • What does nearest mean? In this chapter, we will learn about the different distance measures used to compare different data points to each other. Two data points are deemed near to each other depending on the distance metric used.
  • What is K? We can compare a new data point to 1, 2, 3, or 50 other samples in the training set. The number of samples we decide to compare to is K, and we are going to see how different values of K affect the behavior of the algorithms.

Before using the nearest neighbors algorithm for image classification, we need to firstlearn how to deal with images. In the next section, we will load and display one of the most commonly used image datasets in the field of machine learning and image processing.

When finding the nearest neighbors of a sample, you can compare it to all the other training samples. This is a naive brute-force approach that doesn't scale well with the size of the training data. A more efficient approach for larger datasets requires the training samples to be stored in a specific data structure that is optimized for search. K-D tree and ball tree are two available data structures. These two data structures are parameterized by leaf_size. As its value approaches the size of the training set, the K-D tree and ball tree turn into a brute-force search. Conversely, setting the leaf size to 1 introduces lots of overhead when traversing the trees. A default leaf size of 30 is good middle ground for many sample sizes.

Loading and displaying images

"Photographs are two-dimensional. I work in four dimensions."
– Tino Sehgal

When asked about the number of dimensions that an image has, photographers, painters, illustrators, and almost everyone else on this planet will agree that images are two-dimensional objects. Only machine learning practitioners see images differently. For us, every pixel in a black and white image is a separate dimension. Dimensions expand even more with colored images, but that's something for later. We see each pixel as a separate dimension so that we can deal with each pixel and its value as a unique feature that defines the image, along with the other pixels (features). So, unlikeTino Sehgal, we can sometimes end up working with 4,000 dimensions.

The ModifiedNational Instituteof Standardsand Technology(MNIST) dataset is a collection of handwritten digits that is commonly used inimage processing. Due to its popularity, it is included in scikit-learn, and we can load it as we usually do with other datasets:

from sklearn.datasets import load_digits
digits = load_digits()

The dataset has digits from 0 to 9. We can access their targets (labels) as follows:

# Output: array([0, 1, 2, ..., 8, 9, 8])

Similarly, we can load the pixel values, as follows:

# Output:
# array([[ 0., 0., 5., ..., 0., 0., 0.],
# [ 0., 0., 0., ..., 10., 0., 0.],
# ...,
# [ 0., 0., 2., ..., 12., 0., 0.],
# [ 0., 0., 10., ..., 12., 1., 0.]])

Each line is a picture and each integer is a pixel value. In this dataset, the pixels take values between 0 and 16. The shape of the dataset (digits['data'].shape) is 1,797 x 64. In other words, we have 1,797 square-shaped pictures, and each of them has 64 pixels (width = height = 8).

Knowing this information, we can create the following function to display an image. It takes an array of 64 values and reshapes it into a two-dimensional array with 8 rows and 8 columns. It also uses the corresponding target of the image to show on top of the digit. The matplotlib axis (ax) is given so that we can display the image on it:

def display_img(img, target, ax):
img = img.reshape((8, 8))
ax.imshow(img, cmap='gray')
ax.set_title(f'Digit: {str(target)}')

We can now use the function we just created to display the first eight digits of our dataset:

fig, axs = plt.subplots(1, 8, figsize=(15, 10))

for i in range(8):
display_img(digits['data'][i], digits['target'][i], axs[i])

The digits look as follows:

Being able to display the digits is a good first step. Next, we need to convert them into our usual training and test formats. This time, we want to keep each image as one row, so there is no need to reshape it into 8 x 8 matrices:

from sklearn.model_selection import train_test_split
x, y = digits['data'], digits['target']
x_train, x_test, y_train, y_test = train_test_split(x, y)

At this point, the data is ready to be used with an image classification algorithm. By learning to predict the targets when given a bunch of pixels, we are already one step closer to making our computer understand the handwritten text.

Image classification

Now that we have our data ready, we can predict the digits using the nearest neighbors classifier, as follows:

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=11, metric='manhattan'), y_train)
y_test_pred = clf.predict(x_test)

For this example, I set n_neighbors to 11 and metric to manhattan, meaning at prediction time, we compare each new sample to the 11 nearest training samples, using the Manhattan distance to evaluate how near they are. More on these parameters in a bit. This model made predictions with an accuracy of 96.4% on the test set. This might sound reasonable, but I'm sorry to break it to you; this isn't a fantastic score for this particular dataset. Anyway, let's keep on dissecting the model's performance further.

Using a confusion matrix to understand the model's mistakes

When dealing with a dataset with 10 class labels, a single accuracy score can only tell us so much. To better understand what digits were harder to guess than others, we can print the model's confusion matrix. This is a square matrix where the actual labels are shown as rows and the predicted labels are shown as columns. Then, the numbers in each cell show the testing instances that fell into it. Let me create it now, and it will become clearer in a moment. The plot_confusion_matrix function needs the classifier's instance, along with the test's x and y values, to display the matrix:

from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf, x_test, y_test, cmap='Greys')

Once called, the function runs the model internally on the test data and displays the following matrix:

Ideally, all the cells should have zeros, except for the diagonal cells. Falling into a diagonal cell means that a sample is correctly labeled. However, there are only a few non-zero cells here. The four samples at the intersection of row 8 and column 1 signify that our model has classified four samples as 1, while their actual label was 8. Most likely, those were too-skinny eights that looked like ones to the algorithm. The same conclusions can be made for the remaining non-diagonal non-zero cells.

Picking a suitable metric

The images we are using are just lists of numbers (vectors). The distance metric decides whether one image is close to another. This also applies to non-image data, where distance metrics are used to decide whether one sample is close to another. Two commonly used metrics are the Manhattan andEuclidean distances:

Name Manhattan (L1 norm) Euclidean (L2 norm)

Most likely, the equation for the Manhattan distance will remind you of the mean absolute error and L1 regularization, while the Euclidean distance resembles the mean squared error and L2 regularization. This resemblance is a nice reminder of how many concepts stem from common ideas:

For the Manhattan distance, the distance between A and C is calculated by going from A to D, and then from D to C. It gets its name from Manhattan Island in New York, where its landscape is divided into blocks. For the Euclidean distance, the distance between A and C is calculated via the diagonal line between the two points. There is a generalized form for the two metrics, called the Minkowski distance, and here is its formula:

Setting p to 1 gives us the Manhattan distance, and we can get the Euclidean distance by setting it to 2. I am sure you can tell now where 1 and 2 in the L1 and L2 norms come from. To be able to compare the different values of p, we can run the following code. Here, we calculate the Minkowski distance for the two points—(1, 2) and (4, 6)—for different values of p:

from sklearn.neighbors import DistanceMetric

points = pd.DataFrame(
[[1, 2], [4, 6]], columns=['x1', 'x2']

d = [
(p, DistanceMetric.get_metric('minkowski', p=p).pairwise(points)[0][-1])
for p in [1, 2, 10, 50, 100]

Plotting the results shows us how the Minkowski distance changes with p:

Clearly, the Minkowski distance decreases with an increase in p. For p = 1, the distance is 7, (4 - 1) + (6 - 2), and for p = 2, the distance is 5, the square root of (9 + 16). For higher values of p, the distance calculated approaches 4, which is (6 - 2) only. In other words, as p approaches infinity, the distance is just the maximum of all the spans between the points on all the axes, which is known as the Chebyshev distance.

The term metric is used to describe a distance measure that follows the following criteria:

It cannot be negative: , and it is symmetric: .
The distance from one point to itself is 0. It follows the following triangle inequality criterion: .

Another common metric is the cosine distance, and its formula is as follows:

Unlike the Euclidean distance, the cosine distance is scale-insensitive. I think it would be better to show the difference between the two metrics with the following example.

Here, we take one digit and multiply each pixel value by 2:

Now, let's calculate the distances between the original image and the intensified one:

from sklearn.metrics.pairwise import (

d0 = manhattan_distances(
[1.0 * digits['data'][0], 2.0 * digits['data'][0]]

d1 = euclidean_distances(
[1.0 * digits['data'][0], 2.0 * digits['data'][0]]

d2 = cosine_distances(
[1.0 * digits['data'][0], 2.0 * digits['data'][0]]

Running the preceding code gives us the values for each distance—Manhattan = 294, Euclidean = 55.41, and cosine = 0. As expected, the cosine distance does not care about the constant we used to multiply the pixels with, and it considers the two versions of the same images as one. The other two metrics, on the other hand, saw the two versions as further apart.

Setting the correct K

Equally important to metric selection is knowing how many neighbors to listen to when making a decision. You don't want to ask too few neighbors as maybe they don't know enough. You also don't want to ask everyone as the very distant neighbors probably don't know much about the sample at hand. To put it formally, a decision made based on too few neighbors introduces variance since any slight changes in the data will result in different neighborhoods and different results. Conversely, a decision made based on too many neighbors is a biased decision as it is less sensitive to the differences between the neighborhoods. Keep this in mind. Here, I used the model with different settings for K and plotted the resulting accuracy:

The concept of bias-variance trade-off will follow us throughout this book. When it comes to picking sides, we usually opt to use a biased model when we have smaller training sets. A high-variance model will overfit if there isn't enough data to learn from. The most biased model is one where K is set to the number of training samples. Then, all the new data points will get the same prediction and will be assigned to the same label as the majority class. Conversely, when we have a good amount of data, the few closest neighbors within a smaller radius are a better choice to consult, as it's more likely that they will belong to the same class as our new sample.

Now, we have two hyperparameters to set: the number of neighbors and the distance metrics. In the next section, we are going to use a grid search to find the optimum values for these parameters.

Hyperparameter tuning using GridSearchCV

GridSearchCV is a method for looping over all the possible hyperparameter combinations and employing cross-validation to pick the optimum hyperparameters. For each hyperparameter combination, we do not want to limit ourselves to just one accuracy score. So, to get a better understanding of the estimator's accuracy of each combination, we make use of K-fold cross-validation. Then, the data is split into a number of folds, and for each iteration, all folds but one are used for training, and the remaining one is used for testing. This method for hyperparameter tuning performs an exhaustive search over all possible parameter combinations, hence the Grid prefix. In the following code, we give GridSearchCV a Python dictionary with all the parameter values we want to loop over, as well as the estimator we want to tune. We also specify the number of folds to split the data into, and then we call the grid search's fit method with the training data. Remember, it is bad practice to learn anything from the test dataset, which should be kept aside for now. Here is the code to do this:

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

parameters = {
'metric':('manhattan','euclidean', 'cosine'),
'n_neighbors': range(1, 21)

knn = KNeighborsClassifier()
gscv = GridSearchCV(knn, param_grid=parameters, scoring='accuracy'), y_train)

Once done, we can show the best parameters found via gscv.best_params_. We can also show the accuracy achieved when using the chosen parameter via gscv.best_score_. Here, the euclidean distance was chosen as metric and n_neighbors was set to 3. I also got an accuracy score of 98.7% when using the chosen hyperparameters.

We can now use the resulting classifier to make predictions for the test set:

from sklearn.metrics import accuracy_score

y_test_pred = gscv.predict(x_test)
accuracy_score(y_test, y_test_pred)

This gave me an accuracy of 98.0% on the test set. Luckily, the grid search helped us improve the accuracy of our estimator by picking the optimum hyperparameters.

GridSearchCV can become computationally expensive if we have too many hyperparameters to search through and too many values for each one. When facing a problem like this,RandomizedSearchCV may be an alternative solution since it randomly picks hyperparameter values while searching. Both hyperparameter tuning algorithms use the accuracy score by default for classifiers and R2for regressors. We can override this and specify different metrics to pick the best configuration.

Using custom distances

The digits here are written in white pixels over a black background. I don't think anyone would have a problem with identifying a digit if it was written in black pixels over a white background instead. As for a computer algorithm, things are a little different. Let's train our classifier as usual and see whether it will have any issues if the colors are inverted. We will start by training the algorithm on the original images:

clf = KNeighborsClassifier(n_neighbors=3, metric='euclidean'), y_train)
y_train_pred = clf.predict(x_train)

We then create an inverted version of the data we have just used for training:

x_train_inv = x_train.max() - x_train 

The nearest neighbors implementation has a method called kneighbors. When given a sample, it returns a list of the K-nearest samples to it from the training set, as well as their distances from the given sample. We are going to give this method one of the inverted samples and see which samples it will consider as its neighbors:

img_inv = x_train_inv[0]

fig, axs = plt.subplots(1, 8, figsize=(14, 5))

display_img(img_inv, y_train[0], axs[0])

_, kneighbors_index_inv = clf.kneighbors(

for i, neighbor_index in enumerate(kneighbors_index_inv[0], 1):

Just to make things clearer, I ran the code twice—once with the original sample and its seven neighbors, and once with the inverted sample and its neighbors. The output of the two runs is displayed here. As you can see, unlike us humans, the algorithm got totally confused by the adversarial example where the colors were inverted:

If you think about it, according to the distance we use, a sample and its inverted version cannot be too much further from each other. Although we visually see the two as one, the model sees them as different as day and night. Having that said, it is clear that we need to come up with a different way to evaluate distances. Since pixels take values between 0 and 16, in an inverted sample, all of the 16s are turned into 0s, the 15s are turned into 1s, and so on. Therefore, a distance that compares samples in relation to how far their pixels are from the midpoint between 0 and 16 (8) can help us solve our problem here. Here is how to create this custom distance. Let's call our new distance contrast_distance:

from sklearn.metrics.pairwise import euclidean_distances

def contrast_distance(x1, x2):
_x1, _x2 = np.abs(8 - x1), np.abs(8 - x2)
d = euclidean_distances([_x1], [_x2])
return d[0][0]

Once defined, we can use the custom metric in our classifier, as follows:

clf = KNeighborsClassifier(n_neighbors=3, metric=contrast_distance), y_train)

After this tweak, the inversion doesn't bother the model anymore. For the original and the inverted sets, we get the exact same accuracy of 89.3%. We can also print the seven nearest neighbors according to the new metric to validate the fact that the new model is alreadysmarter and no longer discriminates against the black digits:

One thing to keep in mind when writing your own custom distances is that they are not as optimized as the built-in ones, and running the algorithm will be more computationally expensive at prediction time.

Using nearest neighbors for regression

At the end of the day, the targets we predict in the MNIST dataset are just numbers between 0 and 9. So, we can alternatively use a regressor algorithm for the same problem. In this case, our predictions will not be integers anymore, but rather floats. Training the regressor isn't much different from training the classifier:

from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor(n_neighbors=3, metric='euclidean'), y_train)
y_test_pred = clf.predict(x_test)

Here are some of the incorrectly made predictions:

The first item's three nearest neighbors are 3, 3, and 5. So, the regressor used their mean (3.67) as the prediction. The second and third items' neighbors are 8, 9, 8 and 7, 9, 7, respectively. Remember to round these predictions and convert them into integers if you want to use a classifier's evaluation metric to evaluate this model.

More neighborhood algorithms

There are other variations of K-nearest neighbors that I'd like to quickly go through before moving on to the next section. These algorithms are less commonly used, although they have their merits as well as certain disadvantages.

Radius neighbors

Contrary to the K-nearest neighbors algorithm, where a certain number of neighbors are allowed to vote, in radius neighbors, all the neighbors within a certain radius participate in the voting process. By setting a predefined radius, the decisions in sparser neighborhoods are based on fewer neighbors than the ones made in denser neighborhoods. This can be useful when dealing with imbalanced classes. Furthermore, by using the haversine formula as our metric, we can use this algorithm to recommend nearby venues or gas stations on a map to the users. Both radius neighbors and K-nearest neighbors can give closer data points more voting power than distant ones by specifying the algorithm's weights parameter.

Nearest centroid classifier

As we have seen, the K-nearest neighbors algorithm compares the test samples to all of the samples in the training set. This exhaustive search causes the model to become slower at prediction time. To deal with this, the nearest centroid classifier summarizes all the training samples from each class into a pseudo-sample that represents this class. This pseudo-sample is called the centroid as it is typically created by calculating the mean value for each of the features in a class. At prediction time, a test sample is compared to all the centroids and is classified based on the class whose centroid is closest to it.

In the next section, we are going to use the centroid algorithm for training and prediction, but for now, we are going to use it to generate new digits just for fun. The algorithm is trained as follows:

from sklearn.neighbors import NearestCentroid
clf = NearestCentroid(metric='euclidean'), y_train)

The learned centroids are stored in centroids_. The following code displays these centroids, along with the class labels:

fig, axs = plt.subplots(1, len(clf.classes_), figsize=(15, 5))

for i, (centroid, label) in enumerate(zip(clf.centroids_, clf.classes_)):
display_img(centroid, label, axs[i])

The generated digits are shown here:

These digits do not exist in our dataset. They are just combinations of all the samples in each class.

The nearest centroid classifier is fairly simple, and I am sure you can implement it from scratch using a few lines of code. Its accuracy is not as good as nearest neighbors for the MNIST dataset, though. The centroid algorithm is more commonly used in natural language processing, where it's better known as Rocchio (pronounced like "we will rock you").

Finally, the centroid algorithm also has a hyperparameter called shrink_threshold. When set, this can help to remove the irrelevant features.

Reducing the dimensions of our image data

Earlier, we realized that the dimensionality of an image is equal to the number of pixels in it. So, there is no way to visualize our 43-dimensional MNIST dataset. It is true that we can display each digit separately, yet we cannot see where each image falls in our feature space. This is important to understand the classifier's decision boundaries. Furthermore, an estimator's memory requirements grow in proportion to the number of features in the training data. As a result, we need a way to reduce the number of features in our data to deal with the aforementioned issues.

In this section, we are going to discover two dimensionality-reduction algorithms: Principal Component Analysis (PCA) and Neighborhood Component Analysis (NCA). After explaining them, we will use them to visualize the MNIST dataset and generate additional samples to add to our training set. Finally, we will also use feature selection algorithms to remove non-informative pixels from our images.

Principal component analysis

"A good photograph is knowing where to stand."
–Ansel Adams

Imagine having the following set of data with two features—x1 and x2:

You can generate a previous data frame by using the following code snippet, keeping in mind that the numbers may vary on your machine, given their randomness:

df = pd.DataFrame(
'x1': np.random.normal(loc=10.0, scale=5.0, size=8),
'noise': np.random.normal(loc=0.0, scale=1.0, size=8),

df['x2'] = 3 * df['x1'] + df['noise']

When we plot the data, we realize that x1 and x2 take the following form:

If you want, you can tilt your head to the left. Now, imagine we did not have the x1 and x2 axes, but instead had one diagonal axis that goes through the data. Wouldn't that axis be enough to represent our data? Then, we would have reduced it from a two-dimensional dataset into a one-dimensional dataset. That's exactly what PCA tries to achieve.

This new axis has one main characteristic—the distances between the points on it are more than their distances on the x1 or x2 axes. Remember, the hypotenuse of a triangle is always bigger than any of the two other sides. In conclusion, PCA tries to find a set of new axes (principal components) where the data variance is maximized.

Just like in the case of the correlation coefficient equation discussed in Chapter 4, Preparing Your Data, PCA also needs the data to be centered. For each column, we subtract the mean of the column from each value in it. We can use thewith_std =False standard scaler to achieve this. Here is how to calculate the PCA and convert our data into the new dimensions:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler(with_std=False)
x = scaler.fit_transform(df[['x1', 'x2']])

pca = PCA(n_components=1)
x_new = pca.fit_transform(x)

The resulting x_new value is a single column data frame instead of two. We can also access the newly created component via pca.components_. Here, I plotted the new component over the original data:

As you can see, we were able to use the PCA algorithm to reduce the number of features here from two to one. Since the dots don't fall exactly on the line, some information is lost by only using one component. This information is stored in the second component that we did not retrieve. You can transform your data into any number of components from one up to the original number of features. The components are ordered descendingly according to the amount of information they carry. Therefore, ignoring the latter components may help to remove any noisy and less-useful information. Aftertransforming the data, it can also be transformed back (inverse transformation). The resulting data after the two operations only matches the original data if all the components are retained; otherwise, we can limit ourselves to the first few (principal) components to denoise the data.

In the PCA assumption, the directions in the feature space with the highest variance are expected to carry more information than the directions with lower variance. This assumption may hold in some cases, but it is not guaranteed to always be true. Remember that in PCA, the target is not used, only the features. This makes it more suitable for unlabeled data.

Neighborhood component analysis

In the nearest neighbors algorithms, the choice of the distance measure is paramount, yet it is only set empirically. We used K-fold cross-validation earlier in this chapter to decide which distance metric is better for our problem. This can be time-consuming, which triggers many researchers to look for better solutions. The main aim of NCA is to learn the distance metric from the data using gradient descend. The distances it tries to learn are usually represented by a square matrix. For N samples, we have sample pairs to compare, hence the square matrix. Nevertheless, this matrix can be restricted to become a rectangular one, , where the small n is a lower number than N and represents the reduced components. These reduced components are the building blocks of NCA.

The nearest neighbors algorithms belong to a class of learners called instance-based learners. We use instances of the training set to make decisions. So, the matrix that carries the distances between the instances is an essential part of it. This matrix inspired many researchers to do research on it. For example, learning the distances from data is what NCA and large-margin nearest neighbors do; other researchers converted this matrix into a higher dimensional space—for example, with the kernel trick—and others tried to embed feature selection into the instance-based learners via regularization.

In the next section, we will visually compare the two dimensionality-reduction methods by using them to plot the MNIST dataset onto a two-dimensional graph.

Comparing PCA to NCA

We will reduce the dimensionality of the data by projecting it into a smaller space. We will use PCA and NCA in addition to random projection. We will start by importing the required models and putting the three algorithms into a Python dictionary to loop over them later on:

from sklearn.preprocessing import StandardScaler
from sklearn.random_projection import SparseRandomProjection
from sklearn.decomposition import PCA
from sklearn.neighbors import NeighborhoodComponentsAnalysis

methods = {
'Rand': SparseRandomProjection(n_components=2),
'PCA': PCA(n_components=2),
'NCA': NeighborhoodComponentsAnalysis(n_components=2, init='random'),

Then, we will create three plots side by side for the three algorithms, as follows:

fig, axs = plt.subplots(1, 3, figsize=(15, 5))

for i, (method_name, method_obj) in enumerate(methods.items()):

scaler = StandardScaler(with_std=False)
x_train_scaled = scaler.fit_transform(x_train), y_train)
x_train_2d = method_obj.transform(x_train_scaled)

for target in set(y_train):
y_train == target
], columns=['y', 'x']
kind='scatter', x='x', y='y',
marker=f'${target}$', s=64, ax=axs[i]
axs[i].set_title(f'{method_name} MNIST')

It is important to center your data before applying PCA. We used StandardScaler to do this. Other algorithms shouldn't mind the centering, anyway. Running the code gives us the following graphs:

PCA and NCA do a better job than random projection in clustering the same digits together. In addition to the visual analysis, we can run the nearest neighbors algorithm on the reduced data to judge which transformation represents the data better. We can use similar code to the preceding one and replace the part inside the for loop with the following two chunks of code:

  1. First, we need to scale and transform our data:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler = StandardScaler(with_std=False)

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.fit_transform(x_test), y_train)
x_train_2d = method_obj.transform(x_train_scaled)
x_test_2d = method_obj.transform(x_test_scaled)

scaler = MinMaxScaler()
x_train_scaled = scaler.fit_transform(x_train_2d)
x_test_scaled = scaler.transform(x_test_2d)
  1. Then, we use cross-validation to set the optimum hyperparameters:
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

parameters = {'metric':('manhattan','euclidean'), 'n_neighbors': range(3, 9)}

knn = KNeighborsClassifier()
clf = GridSearchCV(knn, param_grid=parameters, scoring='accuracy', cv=5), y_train)
y_test_pred = clf.predict(x_test_scaled)

'MNIST test accuracy score: {:.1%} [k={}, metric={} - {}]'.format(
accuracy_score(y_test, y_test_pred),

Since we do not need to visualize the data this time, we can set the number of components to 6. This gives us the following accuracy scores. Keep in mind that your results may vary due to the random split of the data and the estimator's initial values:

Projection Accuracy
Sparse random projection 73%
PCA 93%
NCA 95%
In PCA, the class labels are not needed. I just passed them in the preceding code for consistency, but they were simply ignored by the algorithm. In comparison, in NCA, the class labels are used by the algorithm.

Picking the most informative components

After fitting PCA, explained_variance_ratio_ contains the percentage of variance explained by each of the selected components. According to the principal components hypothesis, higher ratios should reflect more information. We can put this information into a data frame, as follows:

df_explained_variance_ratio = pd.DataFrame(
(component, explained_variance_ratio)
for component, explained_variance_ratio in enumerate(pca.explained_variance_ratio_[:32], 1)
], columns=['component', 'explained_variance_ratio']

Then, plot it to get the following graph. I am sure plotting data via bar charts is becoming second nature to you by now:

From the graph, we can tell that starting from the eighth component, the remaining components carry less than 5% of the information.

We can also loop through different values for n_components, and then train a model on the reduced data and see how the accuracy changes with the number of components used. I'd trust this approach more than relying on the explained variance since it is independent of the principal components assumption and evaluates the feature reduction algorithm and the classifier as a single black box. This time, I am going to use a different algorithm: nearest centroid.

Using the centroid classifier with PCA

In the following code, we will try the centroid algorithm with a different number of principal components each time. Please don't forget to scale and transform your features with each iteration, and remember to store the resulting x values in x_train_embed and x_test_embed. I used StandardScaler here, as well as the PCA's transform method, to transform the scaled data:

from sklearn.neighbors import NearestCentroid

scores = []
for n_components in range(1, 33, 1):

# Scale and transform the features as before
clf = NearestCentroid(shrink_threshold=0.01), y_train)
y_test_pred = clf.predict(x_test_embed)

scores.append([n_components, accuracy_score(y_test, y_test_pred)])

Plotting the scores gives us the following graph:

When we use the centroid algorithm with this dataset, we can roughly see that anything above 15 components doesn't add much value. With the help of cross-validation, we can pick the exact number of components that gives the best results.

Restoring the original image from its components

Once an image is reduced to its principal components, it can also be restored back, as follows.

  1. First, you have to scale your data before using PCA:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_std=False)
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

Once scaled, you can transform your data using 32 principal components, as follows.

  1. Then, you can restore the original data after transformation by using the inverse_transform method:
from sklearn.decomposition import PCA
embedder = PCA(n_components=32), y_train)

x_train_embed = embedder.transform(x_train_scaled)
x_test_embed = embedder.transform(x_test_scaled)

x_train_restored = embedder.inverse_transform(x_train_embed)
x_test_restored = embedder.inverse_transform(x_test_embed)
  1. To keep the original images and the restored ones on the same scale, we can use MinMaxScaler, as follows:
iscaler = MinMaxScaler((x_train.min(), x_train.max()))
x_train_restored = iscaler.fit_transform(x_train_restored)
x_test_restored = iscaler.fit_transform(x_test_restored)

Here, you can see a comparison between some digits and themselves, with the less important components removed. These restored versions of the original data can be useful to the classifier, either by using them in place of the training and test sets, or by adding them as additional samples to the training set:

  1. Finally, I used x_train_embed and x_test_embed in place of the original features in our nearest neighbors classifier. I tried a different number of PCA components each time. The darker bars in the following graph show the number of PCA components that resulted in the highest accuracy scores:

Not only did PCA help us reduce the number of features and the prediction time consequently, but it also helped us achieve a score of 98.9%.

Finding the most informative pixels

Since almost all of the digits are centered in the images, we can intuitively deduce that the pixels on the right and left edges of the images do not carry valuable information. To validate our intuition, we will let the feature selection algorithms from Chapter 4, Preparing Your Data, decide for us which pixels are most important. Here, we can use the mutual information algorithm to return a list of pixels and their corresponding importance:

from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(x_train, y_train)

We then use the preceding information to remove 75% of the pixels:

percent_to_remove = 75
mi_threshold = np.quantile(mi, 0.01 * percent_to_remove)
informative_pixels = (mi >= mi_threshold).reshape((8, 8))

plt.imshow(informative_pixels, cmap='Greys')
plt.title(f'Pixels kept when {percent_to_remove}% removed')

In the following diagram, the pixels marked in black are the most informative ones, and the rest are the 75% of the pixels that are deemed less important by the mutual information algorithm:

As expected, the pixels on the edges are less informative. Now that we have identified the less informative pixels, we can reduce the number of features in our data by removing the less informative pixels, as follows:

from sklearn.feature_selection import SelectPercentile
percent_to_keep = 100 - percent_to_remove
selector = SelectPercentile(mutual_info_classif, percentile=percent_to_keep)

x_train_mi = selector.fit_transform(x_train, y_train)
x_test_mi = selector.transform(x_test)

Training a classifier on the reduced features gives us an accuracy score of 94%. Knowing that the complexity of the nearest neighbors algorithm and its prediction time grows with the number of features, we can understand the value of a slightly less accurate algorithm that only uses 25% of the data.


Images are in abundance in our day-to-day life. Robots need computer vision to understand their surroundings. The majority of the posts on social media include pictures. Handwritten documents require image processing to make them consumable by machines. These and many more uses cases are the reason why image processing is an essential competency for machine learning practitioners to master. In this chapter, we learned how to load images and make sense of their pixels. We also learned how to classify images and reduce their dimensions for better visualization and further manipulation.

We used the nearest neighbor algorithm for image classification and regression. This algorithm allowed us to plug our own metrics when needed. We also learned about other algorithms, such as radius neighbors and nearest centroid. The concepts behind these algorithms and their differences are omnipresent in the field of machine learning. Later on, we will see how the clustering and anomaly detection algorithms borrow ideas from the concepts discussed here. In addition to the main algorithms discussed here, concepts such as distance metrics and dimensionality reduction are also ubiquitous.

Due to the importance of image processing, we will not stop here, as we are going to build on the knowledge acquired here in Chapter 7, Neural Networks – Here Comes the Deep Learning, where we will use artificial neural networks for image classification.