3. Going Beyond the Basics: Detecting Features in Images – AI and Machine Learning for Coders

Chapter 3. Going Beyond the Basics: Detecting Features in Images

In Chapter 2 you learned how to get started with computer vision by creating a simple neural network that matched the input pixels of the Fashion MNIST dataset to 10 labels, each representing a type (or class) of clothing. And while you created a network that was pretty good at detecting clothing types, there was a clear drawback. Your neural network was trained on small monochrome images that each contained only a single item of clothing, and that item was centered within the image.

To take the model to the next level, you need to be able to detect features in images. So, for example, instead of looking merely at the raw pixels in the image, what if we could have a way to filter the images down to constituent elements? Matching those elements, instead of raw pixels, would help us to detect the contents of images more effectively. Consider the Fashion MNIST dataset that we used in the last chapter—when detecting a shoe, the neural network may have been activated by lots of dark pixels clustered at the bottom of the image, which it would see as the sole of the shoe. But when the shoe is no longer centered and filling the frame, this logic doesn’t hold.

One method to detect features comes from photography and the image processing methodologies that you might be familiar with. If you’ve ever used a tool like Photoshop or GIMP to sharpen an image, you’re using a mathematical filter that works on the pixels of the image. Another word for these filters is a convolution, and by using these in a neural network you will create a convolutional neural network (CNN).

In this chapter you’ll learn about how to use convolutions to detect features in an image. You’ll then dig deeper into classifying images based on the features within. We’ll explore augmentation of images to get more features and transfer learning to take preexisting features that were learned by others, and then look briefly into optimizing your models using dropouts.

Convolutions

A convolution is simply a filter of weights that are used to multiply a pixel with its neighbors to get a new value for the pixel. For example, consider the ankle boot image from Fashion MNIST and the pixel values for it as shown in Figure 3-1.

Figure 3-1. Ankle boot with convolution

If we look at the pixel in the middle of the selection we can see that it has the value 192 (recall that Fashion MNIST uses monochrome images with pixel values from 0 to 255). The pixel above and to the left has the value 0, the one immediately above has the value 64, etc.

If we then define a filter in the same 3 × 3 grid, as shown below the original values, we can transform that pixel by calculating a new value for it. We do this by multiplying the current value of each pixel in the grid by the value in the same position in the filter grid, and summing up the total amount. This total will be the new value for the current pixel. We then repeat this for all pixels in the image.

So in this case, while the current value of the pixel in the center of the selection is 192, the new value after applying the filter will be:

new_val = (-1 * 0) + (0 * 64) + (-2 * 128) + 
     (.5 * 48) + (4.5 * 192) + (-1.5 * 144) + 
     (1.5 * 142) + (2 * 226) + (-3 * 168)

This equals 577, which will be the new value for that pixel. Repeating this process across every pixel in the image will give us a filtered image.

Let’s consider the impact of applying a filter on a more complicated image: the ascent image that’s built into SciPy for easy testing. This is a 512 × 512 grayscale image that shows two people climbing a staircase.

Using a filter with negative values on the left, positive values on the right, and zeros in the middle will end up removing most of the information from the image except for vertical lines, as you can see in Figure 3-2.

Figure 3-2. Using a filter to get vertical lines

Similarly, a small change to the filter can emphasize the horizontal lines, as shown in Figure 3-3.

Figure 3-3. Using a filter to get horizontal lines

These examples also show that the amount of information in the image is reduced, so we can potentially learn a set of filters that reduce the image to features, and those features can be matched to labels as before. Previously, we learned parameters that were used in neurons to match inputs to outputs. Similarly, the best filters to match inputs to outputs can be learned over time.

When combined with pooling, we can reduce the amount of information in the image while maintaining the features. We’ll explore that next.

Pooling

Pooling is the process of eliminating pixels in your image while maintaining the semantics of the content within the image. It’s best explained visually. Figure 3-4 shows the concept of a max pooling.

Figure 3-4. Demonstrating max pooling

In this case, consider the box on the left to be the pixels in a monochrome image. We then group them into 2 × 2 arrays, so in this case the 16 pixels are grouped into four 2 × 2 arrays. These are called pools.

We then select the maximum value in each of the groups, and reassemble those into a new image. Thus, the pixels on the left are reduced by 75% (from 16 to 4), with the maximum value from each pool making up the new image.

Figure 3-5 shows the version of ascent from Figure 3-2, with the vertical lines enhanced, after max pooling has been applied.

Figure 3-5. Ascent after vertical filter and max pooling

Note how the filtered features have not just been maintained, but further enhanced. Also, the image size has changed from 512 × 512 to 256 × 256—a quarter of the original size.

Note

There are other approaches to pooling, such as min pooling, which takes the smallest pixel value from the pool, and average pooling, which takes the overall average value.

Implementing Convolutional Neural Networks

In Chapter 2 you created a neural network that recognized fashion images. For convenience, here’s the complete code:

import tensorflow as tf
data = tf.keras.datasets.fashion_mnist

(training_images, training_labels), (test_images, test_labels) = data.load_data()

training_images = training_images / 255.0
test_images = test_images / 255.0

model = tf.keras.models.Sequential([
      tf.keras.layers.Flatten(input_shape=(28, 28)),
      tf.keras.layers.Dense(128, activation=tf.nn.relu),
      tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ])

model.compile(optimizer='adam',
       loss='sparse_categorical_crossentropy',
       metrics=['accuracy'])

model.fit(training_images, training_labels, epochs=5)

To convert this to a convolutional neural network, we simply use convolutional layers in our model definition. We’ll also add pooling layers.

To implement a convolutional layer, you’ll use the tf.keras.layers.Conv2D type. This accepts as parameters the number of convolutions to use in the layer, the size of the convolutions, the activation function, etc.

For example, here’s a convolutional layer used as the input layer to a neural network:

tf.keras.layers.Conv2D(64, (3, 3), activation='relu', 
            input_shape=(28, 28, 1)),

In this case, we want the layer to learn 64 convolutions. It will randomly initialize these, and over time will learn the filter values that work best to match the input values to their labels. The (3, 3) indicates the size of the filter. Earlier I showed 3 × 3 filters, and that’s what we are specifying here. This is the most common size of filter; you can change it as you see fit, but you’ll typically see an odd number of axes like 5 × 5 or 7 × 7 because of how filters remove pixels from the borders of the image, as you’ll see later.

The activation and input_shape parameters are the same as before. As we’re using Fashion MNIST in this example, the shape is still 28 × 28. Do note, however, that because Conv2D layers are designed for multicolor images, we’re specifying the third dimension as 1, so our input shape is 28 × 28 × 1. Color images will typically have a 3 as the third parameter as they are stored as values of R, G, and B.

Here’s how to use a pooling layer in the neural network. You’ll typically do this immediately after the convolutional layer:

tf.keras.layers.MaxPooling2D(2, 2),

In the example in Figure 3-4, we split the image into 2 × 2 pools and picked the maximum value in each. This operation could have been parameterized to define the pool size. Those are the parameters that you can see here—the (2, 2) indicates that our pools are 2 × 2.

Now let’s explore the full code for Fashion MNIST with a CNN:

import tensorflow as tf
data = tf.keras.datasets.fashion_mnist

(training_images, training_labels), (test_images, test_labels) = data.load_data()

training_images = training_images.reshape(60000, 28, 28, 1)
training_images = training_images / 255.0
test_images = test_images.reshape(10000, 28, 28, 1)
test_images = test_images / 255.0

model = tf.keras.models.Sequential([
      tf.keras.layers.Conv2D(64, (3, 3), activation='relu', 
                  input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(2, 2),
      tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
      tf.keras.layers.MaxPooling2D(2,2),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(128, activation=tf.nn.relu),
      tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ])

model.compile(optimizer='adam',
       loss='sparse_categorical_crossentropy',
       metrics=['accuracy'])

model.fit(training_images, training_labels, epochs=50)

model.evaluate(test_images, test_labels)

classifications = model.predict(test_images)
print(classifications[0])
print(test_labels[0])

There are a few things to note here. Remember earlier when I said that the input shape for the images had to match what a Conv2D layer would expect, and we updated it to be a 28 × 28 × 1 image? The data also had to be reshaped accordingly. 28 × 28 is the number of pixels in the image, and 1 is the number of color channels. You’ll typically find that this is 1 for a grayscale image or 3 for a color image, where there are three channels (red, green, and blue), with the number indicating the intensity of that color.

So, prior to normalizing the images, we also reshape each array to have that extra dimension. The following code changes our training dataset from 60,000 images, each 28 × 28 (and thus a 60,000 × 28 × 28 array), to 60,000 images, each 28 × 28 × 1:

training_images = training_images.reshape(60000, 28, 28, 1)

We then do the same thing with the test dataset.

Also note that in the original deep neural network (DNN) we ran the input through a Flatten layer prior to feeding it into the first Dense layer. We’ve lost that in the input layer here—instead, we just specify the input shape. Note that prior to the Dense layer, after convolutions and pooling, the data will be flattened.

Training this network on the same data for the same 50 epochs as the network shown in Chapter 2, we can see a nice increase in accuracy. While the previous example reached 89% accuracy on the test set in 50 epochs, this one will hit 99% in around half that many—24 or 25 epochs. So we can see that adding convolutions to the neural network is definitely increasing its ability to classify images. Let’s next take a look at the journey an image takes through the network so we can get a little bit more of an understanding of why this works.

Exploring the Convolutional Network

You can inspect your model using the model.summary command. When you run it on the Fashion MNIST convolutional network we’ve been working on you’ll see something like this:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape       Param #  
=================================================================
conv2d (Conv2D)              (None, 26, 26, 64) 640    
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 64) 0     
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64) 36928   
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)   0     
_________________________________________________________________
flatten (Flatten)            (None, 1600)       0     
_________________________________________________________________
dense (Dense)                (None, 128)        204928  
_________________________________________________________________
dense_1 (Dense)              (None, 10)         1290   
=================================================================
Total params: 243,786
Trainable params: 243,786
Non-trainable params: 0

Let’s first take a look at the Output Shape column to understand what is going on here. Our first layer will have 28 × 28 images, and apply 64 filters to them. But because our filter is 3 × 3, a 1-pixel border around the image will be lost, reducing our overall information to 26 × 26 pixels. Consider Figure 3-6. If we take each of the boxes as a pixel in the image, the first possible filter we can do starts at the second row and the second column. The same would happen on the right side and at the bottom of the diagram.

Figure 3-6. Losing pixels when running a filter

Thus, an image that is A × B pixels in shape when run through a 3 × 3 filter will become (A–2) × (B–2) pixels in shape. Similarly, a 5 × 5 filter would make it (A–4) × (B–4), and so on. As we’re using a 28 × 28 image and a 3 × 3 filter, our output will now be 26 × 26.

After that the pooling layer is 2 × 2, so the size of the image will halve on each axis, and it will then become (13 × 13). The next convolutional layer will reduce this further to 11 × 11, and the next pooling, rounding down, will make the image 5 × 5.

So, by the time the image has gone through two convolutional layers, the result will be many 5 × 5 images. How many? We can see that in the Param # (parameters) column.

Each convolution is a 3 × 3 filter, plus a bias. Remember earlier with our dense layers, each layer was Y = mX + c, where m was our parameter (aka weight) and c was our bias? This is very similar, except that because the filter is 3 × 3 there are 9 parameters to learn. Given that we have 64 convolutions defined, we’ll have 640 overall parameters (each convolution has 9 parameters plus a bias, for a total of 10, and there are 64 of them).

The MaxPooling layers don’t learn anything, they just reduce the image, so there are no learned parameters there—hence 0 being reported.

The next convolutional layer has 64 filters, but each of these is multiplied across the previous 64 filters, each with 9 parameters. We have a bias on each of the new 64 filters, so our number of parameters should be (64 × (64 × 9)) + 64, which gives us 36,928 parameters the network needs to learn.

If this is confusing, try changing the number of convolutions in the first layer to something—for example, 10. You’ll see the number of parameters in the second layer becomes 5,824, which is (64 × (10 × 9)) + 64).

By the time we get through the second convolution, our images are 5 × 5, and we have 64 of them. If we multiply this out we now have 1,600 values, which we’ll feed into a dense layer of 128 neurons. Each neuron has a weight and a bias, and we have 128 of them, so the number of parameters the network will learn is ((5 × 5 × 64) × 128) + 128, giving us 204,928 parameters.

Our final dense layer of 10 neurons takes in the output of the previous 128, so the number of parameters learned will be (128 × 10) + 10, which is 1,290.

The total number of parameters is then the sum of all of these: 243,786.

Training this network requires us to learn the best set of these 243,786 parameters to match the input images to their labels. It’s a slower process because there are more parameters, but as we can see from the results, it also builds a more accurate model!

Of course, with this dataset we still have the limitation that the images are 28 × 28, monochrome, and centered. Next we’ll take a look at using convolutions to explore a more complex dataset comprising color pictures of horses and humans, and we’ll try to determine if an image contains one or the other. In this case, the subject won’t always be centered in the image like with Fashion MNIST, so we’ll have to rely on convolutions to spot distinguishing features.

Building a CNN to Distinguish Between Horses and Humans

In this section we’ll explore a more complex scenario than the Fashion MNIST classifier. We’ll extend what we’ve learned about convolutions and convolutional neural networks to try to classify the contents of images where the location of a feature isn’t always in the same place. I’ve created the Horses or Humans dataset for this purpose.

The Horses or Humans Dataset

The dataset for this section contains over a thousand 300 × 300-pixel images, approximately half each of horses and humans, rendered in different poses. You can see some examples in Figure 3-7.

Figure 3-7. Horses and humans

As you can see, the subjects have different orientations and poses and the image composition varies. Consider the two horses, for example—their heads are oriented differently, and one is zoomed out showing the complete animal while the other is zoomed in, showing just the head and part of the body. Similarly, the humans are lit differently, have different skin tones, and are posed differently. The man has his hands on his hips, while the woman has hers outstretched. The images also contain backgrounds such as trees and beaches, so a classifier will have to determine which parts of the image are the important features that determine what makes a horse a horse and a human a human, without being affected by the background.

While the previous examples of predicting Y = 2X – 1 or classifying small monochrome images of clothing might have been possible with traditional coding, it’s clear that this is far more difficult, and you are crossing the line into where machine learning is essential to solve a problem.

An interesting side note is that these images are all computer-generated. The theory is that features spotted in a CGI image of a horse should apply to a real image. You’ll see how well this works later in this chapter.

The Keras ImageDataGenerator

The Fashion MNIST dataset that you’ve been using up to this point comes with labels. Every image file has an associated file with the label details. Many image-based datasets do not have this, and Horses or Humans is no exception. Instead of labels, the images are sorted into subdirectories of each type. With Keras in TensorFlow, a tool called the ImageDataGenerator can use this structure to automatically assign labels to images.

To use the ImageDataGenerator, you simply ensure that your directory structure has a set of named subdirectories, with each subdirectory being a label. For example, the Horses or Humans dataset is available as a set of ZIP files, one with the training data (1,000+ images) and another with the validation data (256 images). When you download and unpack them into a local directory for training and validation, ensure that they are in a file structure like the one in Figure 3-8.

Here’s the code to get the training data and extract it into the appropriately named subdirectories, as shown in this figure:

import urllib.request
import zipfile

url = "https://storage.googleapis.com/laurencemoroney-blog.appspot.com/
                                            horse-or-human.zip"
file_name = "horse-or-human.zip"
training_dir = 'horse-or-human/training/'
urllib.request.urlretrieve(url, file_name)

zip_ref = zipfile.ZipFile(file_name, 'r')
zip_ref.extractall(training_dir)
zip_ref.close()
Figure 3-8. Ensuring that images are in named subdirectories

Here’s the code to get the training data and extract it into the appropriately named subdirectories, as shown in this figure:

import urllib.request
import zipfile

url = "https://storage.googleapis.com/laurencemoroney-blog.appspot.com/
                                            horse-or-human.zip"
file_name = "horse-or-human.zip"
training_dir = 'horse-or-human/training/'
urllib.request.urlretrieve(url, file_name)

zip_ref = zipfile.ZipFile(file_name, 'r')
zip_ref.extractall(training_dir)
zip_ref.close()

This simply downloads the ZIP of the training data and unzips it into a directory at horse-or-human/training (we’ll deal with downloading the validation data shortly). This is the parent directory that will contain subdirectories for the image types.

To use the ImageDataGenerator we now simply use the following code:

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# All images will be rescaled by 1./255
train_datagen = ImageDataGenerator(rescale=1/255)

train_generator = train_datagen.flow_from_directory(
  training_dir,
  target_size=(300, 300),
  class_mode='binary'
)

We first create an instance of an ImageDataGenerator called train_datagen. We then specify that this will generate images for the training process by flowing them from a directory. The directory is training_dir, as specified earlier. We also indicate some hyperparameters about the data, such as the target size—in this case the images are 300 × 300, and the class mode is binary. The mode is usually binary if there are just two types of images (as in this case) or categorical if there are more than two.

CNN Architecture for Horses or Humans

There are several major differences between this dataset and the Fashion MNIST one that you have to take into account when designing an architecture for classifying the images. First, the images are much larger—300 × 300 pixels—so more layers may be needed. Second, the images are full color, not grayscale, so each image will have three channels instead of one. Third, there are only two image types, so we have a binary classifier that can be implemented using just a single output neuron, where it approaches 0 for one class and 1 for the other. Keep these considerations in mind when exploring this architecture:

model = tf.keras.models.Sequential([
  tf.keras.layers.Conv2D(16, (3,3), activation='relu' , 
              input_shape=(300, 300, 3)),
  tf.keras.layers.MaxPooling2D(2, 2),
  tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

There are a number of things to note here. First of all, this is the very first layer. We’re defining 16 filters, each 3 × 3, but the input shape of the image is (300, 300, 3). Remember that this is because our input image is 300 × 300 and it’s in color, so there are three channels, instead of just one for the monochrome Fashion MNIST dataset we were using earlier.

At the other end, notice that there’s only one neuron in the output layer. This is because we’re using a binary classifier, and we can get a binary classification with just a single neuron if we activate it with a sigmoid function. The purpose of the sigmoid function is to drive one set of values toward 0 and the other toward 1, which is perfect for binary classification.

Next, notice how we stack several more convolutional layers. We do this because our image source is quite large, and we want, over time, to have many smaller images, each with features highlighted. If we take a look at the results of model.summary we’ll see this in action:

=================================================================
conv2d (Conv2D)              (None, 298, 298, 16)  448    
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 149, 149, 16)  0     
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 147, 147, 32)  4640   
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 73, 73, 32)    0     
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 71, 71, 64)    18496   
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 35, 35, 64)    0     
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 33, 33, 64)    36928   
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 16, 16, 64)    0     
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 14, 14, 64)    36928   
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 7, 7, 64)      0     
_________________________________________________________________
flatten (Flatten)            (None, 3136)          0     
_________________________________________________________________
dense (Dense)                (None, 512)           1606144  
_________________________________________________________________
dense_1 (Dense)              (None, 1)             513    
=================================================================
Total params: 1,704,097
Trainable params: 1,704,097
Non-trainable params: 0
_________________________________________________________________

Note how, by the time the data has gone through all the convolutional and pooling layers, it ends up as 7 × 7 items. The theory is that these will be activated feature maps that are relatively simple, containing just 49 pixels. These feature maps can then be passed to the dense neural network to match them to the appropriate labels.

This, of course, leads us to have many more parameters than the previous network, so it will be slower to train. With this architecture, we’re going to learn 1.7 million parameters.

To train the network, we’ll have to compile it with a loss function and an optimizer. In this case the loss function can be binary cross entropy loss function binary cross entropy, because there are only two classes, and as the name suggests this is a loss function that is designed for that scenario. And we can try a new optimizer, root mean square propagation (RMSprop), that takes a learning rate (lr) parameter that allows us to tweak the learning. Here’s the code:

model.compile(loss='binary_crossentropy',
       optimizer=RMSprop(lr=0.001),
       metrics=['accuracy'])

We train by using fit_generator and passing it the training_generator we created earlier:

history = model.fit_generator(
  train_generator,
  epochs=15
)

This sample will work in Colab, but if you want to run it on your own machine, please ensure that the Pillow libraries are installed using pip install pillow.

Note that with TensorFlow Keras, you can use model.fit to fit your training data to your training labels. When using a generator, older versions required you to use model.fit_generator instead. Later versions of TensorFlow will allow you to use either.

Over just 15 epochs, this architecture gives us a very impressive 95%+ accuracy on the training set. Of course, this is just with the training data, and isn’t an indication of performance on data that the network hasn’t previously seen.

Next we’ll look at adding the validation set using a generator and measuring its performance to give us a good indication of how this model might perform in real life.

Adding Validation to the Horses or Humans Dataset

To add validation, you’ll need a validation dataset that’s separate from the training one. In some cases you’ll get a master dataset that you have to split yourself, but in the case of Horses or Humans, there’s a separate validation set that you can download.

Note

You may be wondering why we’re talking about a validation dataset here, rather than a test dataset, and whether they’re the same thing. For simple models like the ones developed in the previous chapters, it’s often sufficient to split the dataset into two parts, one for training and one for testing. But for more complex models like the one we’re building here, you’ll want to create separate validation and test sets. What’s the difference? Training data is the data that is used to teach the network how the data and labels fit together. Validation data is used to see how the network is doing with previously unseen data while you are training—i.e., it isn’t used for fitting data to labels, but to inspect how well the fitting is going. Test data is used after training to see how the network does with data it has never previously seen. Some datasets come with a three-way split, and in other cases you’ll want to separate the test set into two parts for validation and testing. Here, you’ll download some additional images for testing the model.

You can use very similar code to that used for the training images to download the validation set and unzip it into a different directory:

validation_url = "https://storage.googleapis.com/laurencemoroney-blog.appspot.com
                                                /validation-horse-or-human.zip"

validation_file_name = "validation-horse-or-human.zip"
validation_dir = 'horse-or-human/validation/'
urllib.request.urlretrieve(validation_url, validation_file_name)

zip_ref = zipfile.ZipFile(validation_file_name, 'r')
zip_ref.extractall(validation_dir)
zip_ref.close()

Once you have the validation data, you can set up another ImageDataGenerator to manage these images:

validation_datagen = ImageDataGenerator(rescale=1/255)

validation_generator = train_datagen.flow_from_directory(
  validation_dir,
  target_size=(300, 300),
  class_mode='binary'
)

To have TensorFlow perform the validation for you, you simply update your model.fit_generator method to indicate that you want to use the validation data to test the model epoch by epoch. You do this by using the validation_data parameter and passing it the validation generator you just constructed:

history = model.fit_generator(
  train_generator,
  epochs=15,
  validation_data=validation_generator
)

After training for 15 epochs, you should see that your model is 99%+ accurate on the training set, but only about 88% on the validation set. This is an indication that the model is overfitting, as we saw in the previous chapter.

Still, the performance isn’t bad considering how few images it was trained on, and how diverse those images were. You’re beginning to hit a wall caused by lack of data, but there are some techniques that you can use to improve your model’s performance. We’ll explore them later in this chapter, but before that let’s take a look at how to use this model.

Testing Horse or Human Images

It’s all very well to be able to build a model, but of course you want to try it out. A major frustration of mine when I was starting my AI journey was that I could find lots of code that showed me how to build models, and charts of how those models were performing, but very rarely was there code to help me kick the tires of the model myself to try it out. I’ll try to avoid that in this book!

Testing the model is perhaps easiest using Colab. I’ve provided a Horses or Humans notebook on GitHub that you can open directly in Colab.

Once you’ve trained the model, you’ll see a section called “Running the Model.” Before running it, find a few pictures of horses or humans online and download them to your computer. Pixabay.com is a really good site to check out for royalty-free images. It’s a good idea to get your test images together first, because the node can time out while you’re searching.

Figure 3-9 shows a few pictures of horses and humans that I downloaded from Pixabay to test the model.

Figure 3-9. Test images

When they were uploaded, as you can see in Figure 3-10, the model correctly classified the first image as a human and the third image as a horse, but the middle image, despite being obviously a human, was incorrectly classified as a horse!

Figure 3-10. Executing the model

You can also upload multiple images simultaneously and have the model make predictions for all of them. You may notice that it tends to overfit toward horses. If the human isn’t fully posed—i.e., you can’t see their full body—it can skew toward horses. That’s what happened in this case. The first human model is fully posed and the image resembles many of the poses in the dataset, so it was able to classify her correctly. The second model was facing the camera, but only her upper half is in the image. There was no training data that looked like that, so the model couldn’t correctly identify her.

Let’s now explore the code to see what it’s doing. Perhaps the most important part is this chunk:

img = image.load_img(path, target_size=(300, 300))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)

Here, we are loading the image from the path that Colab wrote it to. Note that we specify the target size to be 300 × 300. The images being uploaded can be any shape, but if we are going to feed them into the model, they must be 300 × 300, because that’s what the model was trained to recognize. So, the first line of code loads the image and resizes it to 300 × 300.

The next line of code converts the image into a 2D array. The model, however, expects a 3D array, as indicated by the input_shape in the model architecture. Fortunately, Numpy provides an expand_dims method that handles this and allows us to easily add a new dimension to the array.

Now that we have our image in a 3D array, we just want to make sure that it’s stacked vertically so that it is in the same shape as the training data:

image_tensor = np.vstack([x])

With our image in the right format, it’s easy to do the classification:

classes = model.predict(image_tensor)

The model returns an array containing the classifications. Because there’s only one classification in this case, it’s effectively an array containing an array. You can see this in Figure 3-10, where for the first (human) model it looks like [[1.]].

So now it’s simply a matter of inspecting the value of the first element in that array. If it’s greater than 0.5, we’re looking at a human:

if classes[0]>0.5:
  print(fn + " is a human")
 else:
  print(fn + " is a horse")

There are a few important points to consider here. First, even though the network was trained on synthetic, computer-generated imagery, it performs quite well at spotting horses or humans in real photographs. This is a potential boon in that you may not need thousands of photographs to train a model, and can do it relatively cheaply with CGI.

But this dataset also demonstrates a fundamental issue you will face. Your training set cannot hope to represent every possible scenario your model might face in the wild, and thus the model will always have some level of overspecialization toward the training set. A clear and simple example of that was shown here, where the human in the center of Figure 3-9 was miscategorized. The training set didn’t include a human in that pose, and thus the model didn’t “learn” that a human could look like that. As a result, there was every chance it might see the figure as a horse, and in this case, it did.

What’s the solution? The obvious one is to add more training data, with humans in that particular pose and others that weren’t initially represented. That isn’t always possible, though. Fortunately, there’s a neat trick in TensorFlow that you can use to virtually extend your dataset—it’s called image augmentation, and we’ll explore that next.

Image Augmentation

In the previous section, you built a horse-or-human classifier model that was trained on a relatively small dataset. As a result, you soon began to hit problems classifying some previously unseen images, such as the miscategorization of a woman with a horse because the training set didn’t include any images of people in that pose.

One way to deal with such problems is with image augmentation. The idea behind this technique is that, as TensorFlow is loading your data, it can create additional new data by amending what it has using a number of transforms. For example, take a look at Figure 3-11. While there is nothing in the dataset that looks like the woman on the right, the image on the left is somewhat similar.

Figure 3-11. Dataset similarities

So if you could, for example, zoom into the image on the left as you are training, as shown in Figure 3-12, you would increase the chances of the model being able to correctly classify the image on the right as a person.

Figure 3-12. Zooming in on the training set data

In a similar way, you can broaden the training set with a variety of other transformations, including:

  • Rotation

  • Shifting horizontally

  • Shifting vertically

  • Shearing

  • Zooming

  • Flipping

Because you’ve been using the ImageDataGenerator to load the images, you’ve seen it do a transform already—when it normalized the images like this:

train_datagen = ImageDataGenerator(rescale=1/255)

The other transforms are easily available within the ImageDataGenerator too, so, for example, you could do something like this:

train_datagen = ImageDataGenerator(
  rescale=1./255,
  rotation_range=40,
  width_shift_range=0.2,
  height_shift_range=0.2,
  shear_range=0.2,
  zoom_range=0.2,
  horizontal_flip=True,
  fill_mode='nearest'
)

Here, as well as rescaling the image to normalize it, you’re also doing the following:

  • Rotating each image randomly up to 40 degrees left or right

  • Translating the image up to 20% vertically or horizontally

  • Shearing the image by up to 20%

  • Zooming the image by up to 20%

  • Randomly flipping the image horizontally or vertically

  • Filling in any missing pixels after a move or shear with nearest neighbors

When you retrain with these parameters, one of the first things you’ll notice is that training takes longer because of all the image processing. Also, your model’s accuracy may not be as high as it was previously, because previously it was overfitting to a largely uniform set of data.

In my case, when training with these augmentations my accuracy went down from 99% to 85% after 15 epochs, with validation slightly higher at 89%. (This indicates that the model is underfitting slightly, so the parameters could be tweaked a bit.)

What about the image from Figure 3-9 that it misclassified earlier? This time, it gets it right. Thanks to the image augmentations, the training set now has sufficient coverage for the model to understand that this particular image is a human too (see Figure 3-13). This is just a single data point, and may not be representative of the results for real data, but it’s a small step in the right direction.

Figure 3-13. The zoomed woman is now correctly classified

As you can see, even with a relatively small dataset like Horses or Humans you can start to build a pretty decent classifier. With larger datasets you could take this further. Another technique to improve the model is to use features that were already learned elsewhere. Many researchers with massive resources (millions of images) and huge models that have been trained on thousands of classes have shared their models, and using a concept called transfer learning you can use the features those models learned and apply them to your data. We’ll explore that next!

Transfer Learning

As we’ve already seen in this chapter, the use of convolutions to extract features can be a powerful tool for identifying the contents of an image. The resulting feature maps can then be fed into the dense layers of a neural network to match them to the labels and give us a more accurate way of determining the contents of an image. Using this approach, with a simple, fast-to-train neural network and some image augmentation techniques, we built a model that was 80–90% accurate at distinguishing between a horse and a human when trained on a very small dataset.

But we can improve our model even further using a method called transfer learning. The idea behind transfer learning is simple: instead of learning a set of filters from scratch for our dataset, why not use a set of filters that were learned on a much larger dataset, with many more features than we can “afford” to build from scratch? We can place these in our network and then train a model with our data using the prelearned filters. For example, our Horses or Humans dataset has only two classes. We can use an existing model that was pretrained for one thousand classes, but at some point we’ll have to throw away some of the preexisting network and add the layers that will let us have a classifier for two classes.

Figure 3-14 shows what a CNN architecture for a classification task like ours might look like. We have a series of convolutional layers that lead to a dense layer, which in turn leads to an output layer.

Figure 3-14. A convolutional neural network architecture

We’ve seen that we’re able to build a pretty good classifier using this architecture. But with transfer learning, what if we could take the prelearned layers from another model, freeze or lock them so that they aren’t trainable, and then put them on top of our model, like in Figure 3-15?

Figure 3-15. Taking layers from another architecture via transfer learning

When we consider that, once they’ve been trained, all these layers are just a set of numbers indicating the filter values, weights, and biases along with a known architecture (number of filters per layer, size of filter, etc.), the idea of reusing them is pretty straightforward.

Let’s look at how this would appear in code. There are several pretrained models already available from a variety of sources. We’ll use version 3 of the popular Inception model from Google, which is trained on more than a million images from a database called ImageNet. It has dozens of layers and can classify images into one thousand categories. A saved model is available containing the pretrained weights. To use this, we simply download the weights, create an instance of the Inception V3 architecture, and then load the weights into this architecture like this:

from tensorflow.keras.applications.inception_v3 import InceptionV3

weights_url = "https://storage.googleapis.com/mledu-
datasets/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5"

weights_file = "inception_v3.h5"
urllib.request.urlretrieve(weights_url, weights_file)

pre_trained_model = InceptionV3(input_shape=(150, 150, 3),
                include_top=False,
                weights=None)

pre_trained_model.load_weights(weights_file)

Now we have a full Inception model that’s pretrained. If you want to inspect its architecture, you can do so with:

pre_trained_model.summary()

Be warned—it’s huge! Still, take a look through it to see the layers and their names. I like to use the one called mixed7 because its output is nice and small—7 × 7 images—but feel free to experiment with others.

Next, we’ll freeze the entire network from retraining and then set a variable to point at mixed7’s output as where we want to crop the network up to. We can do that with this code:

for layer in pre_trained_model.layers:
  layer.trainable = False

last_layer = pre_trained_model.get_layer('mixed7')
print('last layer output shape: ', last_layer.output_shape)
last_output = last_layer.output

Note that we print the output shape of the last layer, and you’ll see that we’re getting 7 × 7 images at this point. This indicates that by the time the images have been fed through to mixed7, the output images from the filters are 7 × 7 in size, so they’re pretty easy to manage. Again, you don’t have to choose that specific layer; you’re welcome to experiment with others.

Let’s now see how to add our dense layers underneath this:

# Flatten the output layer to 1 dimension
x = layers.Flatten()(last_output)
# Add a fully connected layer with 1,024 hidden units and ReLU activation
x = layers.Dense(1024, activation='relu')(x)
# Add a final sigmoid layer for classification
x = layers.Dense(1, activation='sigmoid')(x)

It’s as simple as creating a flattened set of layers from the last output, because we’ll be feeding the results into a dense layer. We then add a dense layer of 1,024 neurons, and a dense layer with 1 neuron for our output.

Now we can define our model simply by saying it’s our pretrained model’s input followed by the x we just defined. We then compile it in the usual way:

model = Model(pre_trained_model.input, x)

model.compile(optimizer=RMSprop(lr=0.0001),
       loss='binary_crossentropy',
       metrics=['acc'])

Training the model on this architecture over 40 epochs gave an accuracy of 99%+, with a validation accuracy of 96%+ (see Figure 3-16).

Figure 3-16. Training the horse-or-human classifier with transfer learning

The results here are much better than with our previous model, but you can continue to tweak and improve it. You can also explore how the model will work with a much larger dataset, like the famous Dogs vs. Cats from Kaggle. This is an extremely varied dataset consisting of 25,000 images of cats and dogs, often with the subjects somewhat obscured—for example, if they are held by a human.

Using the same algorithm and model design as before you can train a Dogs vs. Cats classifier on Colab, using a GPU at about 3 minutes per epoch. For 20 epochs, this equates to about 1 hour of training.

When tested with very complex pictures like those in Figure 3-17, this classifier got them all correct. I chose one picture of a dog with catlike ears, and one with its back turned. Both pictures of cats were nontypical.

Figure 3-17. Unusual dogs and cats that were classified correctly

The cat in the lower-right corner with its eyes closed, ears down, and tongue out while washing its paw gave the results in Figure 3-18 when loaded into the model. You can see that it gave a very low value (4.98 × 10–24), which shows that the network was almost certain it was a cat!

Figure 3-18. Classifying the cat washing its paw

You can find the complete code for the Horses or Humans and Dogs vs. Cats classifiers in the GitHub repository for this book.

Multiclass Classification

In all of the examples so far you’ve been building binary classifiers—ones that choose between two options (horses or humans, cats or dogs). When building multiclass classifiers the models are almost the same, but there are a few important differences. Instead of a single neuron that is sigmoid-activated, or two neurons that are binary-activated, your output layer will now require n neurons, where n is the number of classes you want to classify. You’ll also have to change your loss function to an appropriate one for multiple categories. For example, whereas for the binary classifiers you’ve built so far in this chapter your loss function was binary cross entropy, if you want to extend the model for multiple classes you should instead use categorical cross entropy. If you’re using the ImageDataGenerator to provide your images the labeling is done automatically, so multiple categories will work the same as binary ones—the ImageDataGenerator will simply label based on the number of subdirectories.

Consider, for example, the game Rock Paper Scissors. If you wanted to train a dataset to recognize the different hand gestures, you’d need to handle three categories. Fortunately, there’s a simple dataset you can use for this.

There are two downloads: a training set of many diverse hands, with different sizes, shapes, colors, and details such as nail polish; and a testing set of equally diverse hands, none of which are in the training set.

You can see some examples in Figure 3-19.

Figure 3-19. Examples of Rock/Paper/Scissors gestures

Using the dataset is simple. Download and unzip it—the sorted subdirectories are already present in the ZIP file—and then use it to initialize an ImageDataGenerator:

!wget --no-check-certificate \
 https://storage.googleapis.com/laurencemoroney-blog.appspot.com/rps.zip \
 -O /tmp/rps.zip
local_zip = '/tmp/rps.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp/')
zip_ref.close()
TRAINING_DIR = "/tmp/rps/"
training_datagen = ImageDataGenerator(
  rescale = 1./255,
  rotation_range=40,
  width_shift_range=0.2,
  height_shift_range=0.2,
  shear_range=0.2,
  zoom_range=0.2,
  horizontal_flip=True,
  fill_mode='nearest'
)

Note, however, that when you set up the data generator from this, you have to specify that the class mode is categorical in order for the ImageDataGenerator to use more than two subdirectories:

train_generator = training_datagen.flow_from_directory(
  TRAINING_DIR,
  target_size=(150,150),
  class_mode='categorical'
)

When defining your model, while keeping an eye on the input and output layers, you want to ensure that the input matches the shape of the data (in this case 150 × 150) and that the output matches the number of classes (now three):

model = tf.keras.models.Sequential([
  # Note the input shape is the desired size of the image: 
  # 150x150 with 3 bytes color
  # This is the first convolution
  tf.keras.layers.Conv2D(64, (3,3), activation='relu', 
              input_shape=(150, 150, 3)),
  tf.keras.layers.MaxPooling2D(2, 2),
  # The second convolution
  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  # The third convolution
  tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  # The fourth convolution
  tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  # Flatten the results to feed into a DNN
  tf.keras.layers.Flatten(),
  # 512 neuron hidden layer
  tf.keras.layers.Dense(512, activation='relu'),
  tf.keras.layers.Dense(3, activation='softmax')
])

Finally, when compiling your model, you want to ensure that it uses a categorical loss function, such as categorical cross entropy. Binary cross entropy will not work with more than two classes:

model.compile(loss = 'categorical_crossentropy', optimizer='rmsprop', 
       metrics=['accuracy'])

Training is then the same as before:

history = model.fit(train_generator, epochs=25, 
          validation_data = validation_generator, verbose = 1)

Your code for testing predictions will also need to change somewhat. There are now three output neurons, and they will output a value close to 1 for the predicted class, and close to 0 for the other classes. Note that the activation function used is softmax, which will ensure that all three predictions will add up to 1. For example, if the model sees something it’s really unsure about it might output .4, .4, .2, but if it sees something it’s quite sure about you might get .98, .01, .01.

Note also that when using the ImageDataGenerator, the classes are loaded in alphabetical order—so while you might expect the output neurons to be in the order of the name of the game, the order in fact will be Paper, Rock, Scissors.

Code to try out predictions in a Colab notebook will look like this. It’s very similar to what you saw earlier:

import numpy as np
from google.colab import files
from keras.preprocessing import image

uploaded = files.upload()

for fn in uploaded.keys():
 
  # predicting images
  path = fn
  img = image.load_img(path, target_size=(150, 150))
  x = image.img_to_array(img)
  x = np.expand_dims(x, axis=0)

  images = np.vstack([x])
  classes = model.predict(images, batch_size=10)
  print(fn)
  print(classes)

Note that it doesn’t parse the output, just prints the classes. Figure 3-20 shows what it looks like in use.

Figure 3-20. Testing the Rock/Paper/Scissors classifier

You can see from the filenames what the images were. Paper1.png ended up as [1, 0, 0], meaning the first neuron was activated and the others weren’t. Similarly, Rock1.png ended up as [0, 1, 0], activating the second neuron, and Scissors2.png was [0, 0, 1]. Remember that the neurons are in alphabetical order by label!

Some images that you can use to test the dataset are available to download. Alternatively, of course, you can try your own. Note that the training images are all done against a plain white background, though, so there may be some confusion if there is a lot of detail in the background of the photos you take.

Dropout Regularization

Earlier in this chapter we discussed overfitting, where a network may become too specialized in a particular type of input data and fare poorly on others. One technique to help overcome this is use of dropout regularization.

When a neural network is being trained, each individual neuron will have an effect on neurons in subsequent layers. Over time, particularly in larger networks, some neurons can become overspecialized—and that feeds downstream, potentially causing the network as a whole to become overspecialized and leading to overfitting. Additionally, neighboring neurons can end up with similar weights and biases, and if not monitored this can lead the overall model to become overspecialized to the features activated by those neurons.

For example, consider the neural network in Figure 3-21, where there are layers of 2, 6, 6, and 2 neurons. The neurons in the middle layers might end up with very similar weights and biases.

Figure 3-21. A simple neural network

While training, if you remove a random number of neurons and ignore them, their contribution to the neurons in the next layer is temporarily blocked (Figure 3-22).

Figure 3-22. A neural network with dropouts

This reduces the chances of the neurons becoming overspecialized. The network will still learn the same number of parameters, but it should be better at generalization—that is, it should be more resilient to different inputs.

Note

The concept of dropouts was proposed by Nitish Srivastava et al. in their 2014 paper “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”.

To implement dropouts in TensorFlow, you can just use a simple Keras layer like this:

tf.keras.layers.Dropout(0.2),

This will drop out, at random, the specified percentage of neurons (here, 20%) in the specified layer. Note that it may take some experimentation to find the correct percentage for your network.

For a simple example that demonstrates this, consider the Fashion MNIST classifier from Chapter 2. I’ll change the network definition to have a lot more layers, like this:

model = tf.keras.models.Sequential([
      tf.keras.layers.Flatten(input_shape=(28,28)),
      tf.keras.layers.Dense(256, activation=tf.nn.relu),
      tf.keras.layers.Dense(128, activation=tf.nn.relu),
      tf.keras.layers.Dense(64, activation=tf.nn.relu),
      tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ])

Training this for 20 epochs gave around 94% accuracy on the training set, and about 88.5% on the validation set. This is a sign of potential overfitting.

Introducing dropouts after each dense layer looks like this:

model = tf.keras.models.Sequential([
      tf.keras.layers.Flatten(input_shape=(28,28)),
      tf.keras.layers.Dense(256, activation=tf.nn.relu),
      tf.keras.layers.Dropout(0.2),
      tf.keras.layers.Dense(128, activation=tf.nn.relu),
      tf.keras.layers.Dropout(0.2),
      tf.keras.layers.Dense(64, activation=tf.nn.relu),
      tf.keras.layers.Dropout(0.2),
      tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ])

When this network was trained for the same period on the same data, the accuracy on the training set dropped to about 89.5%. The accuracy on the validation set stayed about the same, at 88.3%. These values are much closer to each other; the introduction of dropouts thus not only demonstrated that overfitting was occurring, but also that using dropouts can help remove such ambiguity by ensuring that the network isn’t overspecializing to the training data.

Keep in mind as you design your neural networks that great results on your training set are not always a good thing. This could be a sign of overfitting. Introducing dropouts can help you remove that problem, so that you can optimize your network in other areas without that false sense of security.

Summary

This chapter introduced you to a more advanced way of achieving computer vision using convolutional neural networks. You saw how to use convolutions to apply filters that can extract features from images, and designed your first neural networks to deal with more complex vision scenarios than those you encountered with the MNIST and Fashion MNIST datasets. You also explored techniques to improve your network’s accuracy and avoid overfitting, such as the use of image augmentation and dropouts.

Before we explore further scenarios, in Chapter 4 you’ll get an introduction to TensorFlow Datasets, a technology that makes it much easier for you to get access to data for training and testing your networks. In this chapter you were downloading ZIP files and extracting images, but that’s not always going to be possible. With TensorFlow Datasets you’ll be able to access lots of datasets with a standard API.