11. Using Convolutional and Recurrent Methods for Sequence Models – AI and Machine Learning for Coders

Chapter 11. Using Convolutional and Recurrent Methods for Sequence Models

The last few chapters introduced you to sequence data. You saw how to predict it first using statistical methods, then basic machine learning methods with a deep neural network. You also explored how to tune the model’s hyperparameters using Keras Tuner. In this chapter, you’ll look at additional techniques that may further enhance your ability to predict sequence data using convolutional neural networks as well as recurrent neural networks.

Convolutions for Sequence Data

In Chapter 3 you were introduced to convolutions where a 2D filter was passed over an image to modify it and potentially extract features. Over time, the neural network learned which filter values were effective at matching the modifications made to the pixels to their labels, effectively extracting features from the image. The same technique can be applied to numeric time series data, but with one modification: the convolution will be one-dimensional instead of two-dimensional.

Consider, for example, the series of numbers in Figure 11-1.

Figure 11-1. A sequence of numbers

A 1D convolution could operate on these as follows. Consider the convolution to be a 1 × 3 filter with filter values of –0.5, 1, and –0.5, respectively. In this case, the first value in the sequence will be lost, and the second value will be transformed from 8 to –1.5, as shown in Figure 11-2.

Figure 11-2. Using a convolution with the number sequence

The filter will then stride across the values, calculating new ones as it goes. So, for example, in the next stride 15 will be transformed to 3, as shown in Figure 11-3.

Figure 11-3. An additional stride in the 1D convolution

Using this method, it’s possible to extract the patterns between values and learn the filters that extract them successfully, in much the same way as convolutions on the pixels in images are able to extract features. In this instance there are no labels, but the convolutions that minimize overall loss could be learned.

Coding Convolutions

Before coding convolutions, you’ll have to adjust the windowed dataset generator that you used in the previous chapter. This is because when coding the convolutional layers, you have to specify the dimensionality. The windowed dataset was a single dimension, but it wasn’t defined as a 1D tensor. This simply requires adding a tf.expand_dims statement at the beginning of the windowed_dataset function like this:

def windowed_dataset(series, window_size, batch_size, shuffle_buffer):
  series = tf.expand_dims(series, axis=-1)
  dataset = tf.data.Dataset.from_tensor_slices(series)
  dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)
  dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))
  dataset = dataset.shuffle(shuffle_buffer).map(
                 lambda window: (window[:-1], window[-1]))
  dataset = dataset.batch(batch_size).prefetch(1)
  return dataset

Now that you have an amended dataset, you can add a convolutional layer before the dense layers that you had previously:

dataset = windowed_dataset(x_train, window_size, batch_size, shuffle_buffer_size)

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv1D(filters=128, kernel_size=3,
                           strides=1, padding="causal",
                           activation="relu",
                           input_shape=[None, 1]),
    tf.keras.layers.Dense(28, activation="relu"), 
    tf.keras.layers.Dense(10, activation="relu"), 
    tf.keras.layers.Dense(1),
])

optimizer = tf.keras.optimizers.SGD(lr=1e-5, momentum=0.5)
model.compile(loss="mse", optimizer=optimizer)
history = model.fit(dataset, epochs=100, verbose=1)

In the Conv1D layer, you have a number of parameters:

filters

Is the number of filters that you want the layer to learn. It will generate this number, and adjust them over time to fit your data as it learns.

kernel_size

Is the size of the filter—earlier we demonstrated a filter with the values –0.5, 1, –0.5, which would be a kernel size of 3.

strides

Is the size of the “step” that the filter will take as it scans across the list. This is typically 1.

padding

Determines the behavior of the list with regard to which end data is dropped from. A 3 × 1 filter will “lose” the first and last value of the list because it cannot calculate the prior value for the first, or the subsequent value for the last. Typically with sequence data you’ll use causal here, which will only take data from the current and previous time steps, never future ones. So, for example, a 3 × 1 filter would take the current time step along with the previous two.

activation

Is the activation function. In this case, relu means to effectively reject negative values coming out of the layer.

input_shape

As always, is the input shape of the data being passed into the network. As this is the first layer, you have to specify it.

Training with this will give you a model as before, but to get predictions from the model, given that the input layer has changed shape, you’ll need to modify your prediction code somewhat.

Also, instead of predicting each value, one by one, based on the previous window, you can actually get a single prediction for an entire series if you’ve correctly formatted the series as a dataset. To simplify things a bit, here’s a helper function that can predict an entire series based on the model, with a specified window size:

def model_forecast(model, series, window_size):
    ds = tf.data.Dataset.from_tensor_slices(series)
    ds = ds.window(window_size, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda w: w.batch(window_size))
    ds = ds.batch(32).prefetch(1)
    forecast = model.predict(ds)
    return forecast

If you want to use the model to predict this series, you simply pass the series in with a new axis to handle the Conv1Ds needed for the layer with the extra axis. You can do it like this:

forecast = model_forecast(model, series[..., np.newaxis], window_size)

And you can split this forecast into just the predictions for the validation set using the predetermined split time:

results = forecast[split_time - window_size:-1, -1, 0]

A plot of the results against the series is in Figure 11-4.

The MAE in this case is 4.89, which is slightly worse than for the previous prediction. This could be because we haven’t tuned the convolutional layer appropriately, or it could be that convolutions simply don’t help. This is the type of experimentation you’ll need to do with your data.

Do note that this data has a random element in it, so values will change across sessions. If you’re using code from Chapter 10 and then running this code separately, you will, of course, have random fluctuations affecting your data, and thus your MAE.

Figure 11-4. Convolutional neural network with time sequence data prediction

But when using convolutions, the question always comes up: Why choose the parameters that we chose? Why 128 filters? Why size 3 × 1? The good news is that you can experiment with them using Keras Tuner, as shown previously. We’ll explore that next.

Experimenting with the Conv1D Hyperparameters

In the previous section, you saw a 1D convolution that was hardcoded with parameters for things like filter number, kernel size, number of strides, etc. When training the neural network with it, it appeared that the MAE went up slightly, so we got no benefit from using the Conv1D. This may not always be the case, depending on your data, but it could be because of suboptimal hyperparameters. So, in this section, you’ll see how Keras Tuner can optimize them for you.

In this example, you’ll experiment with the hyperparameters for the number of filters, the size of the kernel, and the size of the stride, keeping the other parameters static:

def build_model(hp):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Conv1D(
        filters=hp.Int('units',min_value=128, max_value=256, step=64), 
        kernel_size=hp.Int('kernels', min_value=3, max_value=9, step=3),
        strides=hp.Int('strides', min_value=1, max_value=3, step=1),
        padding='causal', activation='relu', input_shape=[None, 1]
    ))
  
    model.add(tf.keras.layers.Dense(28, input_shape=[window_size], 
                                    activation='relu'))
  
    model.add(tf.keras.layers.Dense(10, activation='relu'))
  
    model.add(tf.keras.layers.Dense(1))

    model.compile(loss="mse", 
                   optimizer=tf.keras.optimizers.SGD(momentum=0.5, lr=1e-5))
    return model

The filter values will start at 128 and then step upwards toward 256 in increments of 64. The kernel size will start at 3 and increase to 9 in steps of 3, and the strides will start at 1 and be stepped up to 3.

There are a lot of combinations of values here, so the experiment will take some time to run. You could also try other changes, like using a much smaller starting value for filters to see the impact they have.

Here’s the code to do the search:

tuner = RandomSearch(build_model, objective='loss', 
                      max_trials=500, executions_per_trial=3, 
                      directory='my_dir', project_name='cnn-tune')

tuner.search_space_summary()

tuner.search(dataset, epochs=100, verbose=2)

When I ran the experiment, I discovered that 128 filters, with size 9 and stride 1, gave the best results. So, compared to the initial model, the big difference was to change the filter size—which makes sense with such a large body of data. With a filter size of 3, only the immediate neighbors have an impact, whereas with 9, neighbors further afield also have an impact on the result of applying the filter. This would warrant a further experiment, starting with these values and trying larger filter sizes and perhaps fewer filters. I’ll leave that to you to see if you can improve the model further!

Plugging these values into the model architecture, you’ll get this:

dataset = windowed_dataset(x_train, window_size, batch_size, 
                            shuffle_buffer_size)

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv1D(filters=128, kernel_size=9,
                           strides=1, padding="causal",
                           activation="relu",
                           input_shape=[None, 1]),
    tf.keras.layers.Dense(28, input_shape=[window_size], 
                           activation="relu"), 
    tf.keras.layers.Dense(10, activation="relu"), 
    tf.keras.layers.Dense(1),
])

optimizer = tf.keras.optimizers.SGD(lr=1e-5, momentum=0.5)
model.compile(loss="mse", optimizer=optimizer)
history = model.fit(dataset, epochs=100,  verbose=1)

After training with this, the model had improved accuracy compared with both the naive CNN created earlier and the original DNN, giving Figure 11-5.

Figure 11-5. Optimized CNN predictions

This resulted in an MAE of 4.39, which is a slight improvement over the 4.47 we got without using the convolutional layer. Further experimentation with the CNN hyperparameters may improve this further.

Beyond convolutions, the techniques we explored in the chapters on natural language processing with RNNs, including LSTMs, may be powerful when working with sequence data. By their very nature, RNNs are designed for maintaining context, so previous values can have an effect on later ones. You’ll explore using them for sequence modeling next. But first, let’s move on from a synthetic dataset and start looking at real data. In this case, we’ll consider weather data.

Using NASA Weather Data

One great resource for time series weather data is the NASA Goddard Institute for Space Studies (GISS) Surface Temperature analysis. If you follow the Station Data link, on the right side of the page you can pick a weather station to get data from. For example, I chose the Seattle Tacoma (SeaTac) airport and was taken to the page in Figure 11-6.

Figure 11-6. Surface temperature data from GISS

You can see a link to download monthly data as CSV at the bottom of this page. Select this, and a file called station.csv will be downloaded to your device. If you open this, you’ll see that it’s a grid of data with a year in each row and a month in each column, like in Figure 11-7.

Figure 11-7. Exploring the data

As this is CSV data, it’s pretty easy to process in Python, but as with any dataset, do note the format. When reading CSV, you tend to read it line by line, and often each line has one data point that you’re interested in. In this case there are at least 12 data points of interest per line, so you’ll have to consider this when reading the data.

Reading GISS Data in Python

The code to read the GISS data is shown here:

def get_data():
    data_file = "/home/ljpm/Desktop/bookpython/station.csv"
    f = open(data_file)
    data = f.read()
    f.close()
    lines = data.split('\n')
    header = lines[0].split(',')
    lines = lines[1:]
    temperatures=[]
    for line in lines:
        if line:
            linedata = line.split(',')
            linedata = linedata[1:13]
            for item in linedata:
                if item:
                    temperatures.append(float(item))

    series = np.asarray(temperatures)
    time = np.arange(len(temperatures), dtype="float32")
    return time, series

This will open the file at the indicated path (yours will of course differ) and read in the entire file as a set of lines, where the line split is the new line character (\n). It will then loop through each line, ignoring the first line, and split them on the comma character into a new array called linedata. The items from 1 through 13 in this array will indicate the values for the months January through February as strings. These values are converted to floats and added to the array called temperatures. Once it’s completed it will be turned into a Numpy array called series, and another Numpy array called time will be created that’s the same size as series. As it is created using np.arange, the first element will be 1, the second 2, etc. Thus, this function will return time in steps from 1 to the number of data points, and series as the data for that time.

Now if you want a time series that is normalized, you can simply run this code:

time, series = get_data()
mean = series.mean(axis=0)
series-=mean
std = series.std(axis=0)
series/=std

This can be split into training and validation sets as before. Choose your split time based on the size of the data—in this case I had ~840 data items, so I split at 792 (reserving four years’ worth of data points for validation):

split_time = 792
time_train = time[:split_time]
x_train = series[:split_time]
time_valid = time[split_time:]
x_valid = series[split_time:]

Because the data is now a Numpy array, you can use the same code as before to create a windowed dataset from it to train a neural network:

window_size = 24
batch_size = 12
shuffle_buffer_size = 48
dataset = windowed_dataset(x_train, window_size, 
                           batch_size, shuffle_buffer_size)
valid_dataset = windowed_dataset(x_valid, window_size, 
                                 batch_size, shuffle_buffer_size)

This should use the same windowed_dataset function as the convolutional network earlier in this chapter, adding a new dimension. When using RNNs, GRUs, and LSTMs, you will need the data in that shape.

Using RNNs for Sequence Modeling

Now that you have the data from the NASA CSV in a windowed dataset, it’s relatively easy to create a model to train a predictor for it. (It’s a bit more difficult to train a good one!) Let’s start with a simple, naive model using RNNs. Here’s the code:

model = tf.keras.models.Sequential([
    tf.keras.layers.SimpleRNN(100, return_sequences=True, 
                              input_shape=[None, 1]),
    tf.keras.layers.SimpleRNN(100),
    tf.keras.layers.Dense(1)
])

In this case, the Keras SimpleRNN layer is used. RNNs are a class of neural networks that are powerful for exploring sequence models. You first saw them in Chapter 7 when you were looking at natural language processing. I won’t go into detail on how they work here, but if you’re interested and you skipped that chapter, take a look back at it now. Notably, an RNN has an internal loop that iterates over the time steps of a sequence while maintaining an internal state of the time steps it has seen so far. A SimpleRNN has the output of each time step fed into the next time step.

You can compile and fit the model with the same hyperparameters as before, or use Keras Tuner to see if you can find better ones. For simplicity, you can use these settings:

optimizer = tf.keras.optimizers.SGD(lr=1.5e-6, momentum=0.9)
model.compile(loss=tf.keras.losses.Huber(), 
               optimizer=optimizer, metrics=["mae"])

 history = model.fit(dataset, epochs=100,  verbose=1,
                     validation_data=valid_dataset)

Even one hundred epochs is enough to get an idea of how it can predict values. Figure 11-8 shows the results.

Figure 11-8. Results of the SimpleRNN

As you can see, the results were pretty good. It may be a little off in the peaks, and when the pattern changes unexpectedly (like at time steps 815 and 828), but on the whole it’s not bad. Now let’s see what happens if we train it for 1,500 epochs (Figure 11-9).

Figure 11-9. RNN trained over 1,500 epochs

There’s not much of a difference, except that some of the peaks are smoothed out. If you look at the history of loss on both the validation and training sets, it looks like Figure 11-10.

Figure 11-10. Training and validation loss for the SimpleRNN

As you can see, there’s a healthy match between the training loss and the validation loss, but as the epochs increase, the model begins to overfit on the training set. Perhaps a better number of epochs would be around five hundred.

One reason for this could be the fact that the data, being monthly weather data, is highly seasonal. Another is that there is a very large training set and a relatively small validation set. Next, we’ll explore using a larger climate dataset.

Exploring a Larger Dataset

The KNMI Climate Explorer allows you to explore granular climate data from many locations around the world. I downloaded a dataset consisting of daily temperature readings from the center of England from 1772 until 2020. This data is structured differently from the GISS data, with the date as a string, followed by a number of spaces, followed by the reading.

I’ve prepared the data, stripping the headers and removing the extraneous spaces. That way it’s easy to read with code like this:

def get_data():
    data_file = "tdaily_cet.dat.txt"
    f = open(data_file)
    data = f.read()
    f.close()
    lines = data.split('\n')
    temperatures=[]
    for line in lines:
        if line:
            linedata = line.split(' ')
            temperatures.append(float(linedata[1]))

    series = np.asarray(temperatures)
    time = np.arange(len(temperatures), dtype="float32")
    return time, series

This dataset has 90,663 data points in it, so, before training your model, be sure to split it appropriately. I used a split time of 80,000, leaving 10,663 records for validation. Also, update the window size, batch size, and shuffle buffer size appropriately. Here’s an example:

window_size = 60
batch_size = 120
shuffle_buffer_size = 240

Everything else can remain the same. As you can see in Figure 11-11, after training for one hundred epochs, the plot of the predictions against the validation set looks pretty good.

Figure 11-11. Plot of predictions against real data

There’s a lot of data here, so let’s zoom in to the last hundred days’ worth (Figure 11-12).

Figure 11-12. Results for one hundred days’ worth of data

While the chart generally follows the curve of the data, and is getting the trends roughly correct, it is pretty far off, particularly at the extreme ends, so there’s room for improvement.

It’s also important to remember that we normalized the data, so while our loss and MAE may look low, that’s because they are based on the loss and MAE of normalized values that have a much lower variance than the real ones. So, Figure 11-13, showing loss of less than 0.1, might lead you into a false sense of security.

Figure 11-13. Loss and validation loss for large dataset

To denormalize the data, you can do the inverse of normalization: first multiply by the standard deviation, and then add back the mean. At that point, if you wish, you can calculate the real MAE for the prediction set as done previously.

Using Other Recurrent Methods

In addition to the SimpleRNN, TensorFlow has other recurrent layer types, such as gated recurrent units (GRUs) and long short-term memory layers (LSTMs), discussed in Chapter 7. When using the TFRecord-based architecture for your data that you’ve been using throughout this chapter, it becomes relatively simple to just drop in these RNN types if you want to experiment.

So, for example, if you consider the simple naive RNN that you created earlier:

model = tf.keras.models.Sequential([
    tf.keras.layers.SimpleRNN(100, input_shape=[None, 1], 
                              return_sequences=True),
    tf.keras.layers.SimpleRNN(100),
    tf.keras.layers.Dense(1)
])

Replacing this with a GRU becomes as easy as:

model = tf.keras.models.Sequential([
    tf.keras.layers.GRU(100, input_shape=[None, 1], return_sequences=True),
    tf.keras.layers.GRU(100),
    tf.keras.layers.Dense(1)
])

With an LSTM, it’s similar:

model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(100, input_shape=[None, 1], return_sequences=True),
    tf.keras.layers.LSTM(100),
    tf.keras.layers.Dense(1)
])

It’s worth experimenting with these layer types as well as with different hyperparameters, loss functions, and optimizers. There’s no one-size-fits-all solution, so what works best for you in any given situation will depend on your data and your requirements for prediction with that data.

Using Dropout

If you encounter overfitting in your models, where the MAE or loss for the training data is much better than with the validation data, you can use dropout. As discussed in Chapter 3 in the context of computer vision, with dropout, neighboring neurons are randomly dropped out (ignored) during training to avoid a familiarity bias. When using RNNs, there’s also a recurrent dropout parameter that you can use.

What’s the difference? Recall that when using RNNs you typically have an input value, and the neuron calculates an output value and a value that gets passed to the next time step. Dropout will randomly drop out the input values. Recurrent dropout will randomly drop out the recurrent values that get passed to the next step.

For example, consider the basic recurrent neural network architecture shown in Figure 11-14.

Figure 11-14. Recurrent neural network

Here you can see the inputs to the layers at different time steps (x). The current time is t, and the steps shown are t – 2 through t + 1. The relevant outputs at the same time steps (y) are also shown. The recurrent values passed between time steps are indicated by the dotted lines and labeled as r.

Using dropout will randomly drop out the x inputs. Using recurrent dropout will randomly drop out the r recurrent values.

You can learn more about how recurrent dropout works from a deeper mathematical perspective in the paper “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks” by Yarin Gal and Zoubin Ghahramani.

One thing to consider when using recurrent dropout is discussed by Gal in his research around uncertainty in deep learning, where he demonstrates that the same pattern of dropout units should be applied at every time step, and that a similar constant dropout mask should be applied at every time step. While dropout is typically random, Gal’s work was built into Keras, so when using tf.keras the consistency recommended by his research is maintained.

To add dropout and recurrent dropout, you simply use the relevant parameters on your layers. For example, adding them to the simple GRU from earlier would look like this:

model = tf.keras.models.Sequential([
    tf.keras.layers.GRU(100, input_shape=[None, 1], return_sequences=True, 
                         dropout=0.1, recurrent_dropout=0.1),
    tf.keras.layers.GRU(100, dropout=0.1, recurrent_dropout=0.1),
    tf.keras.layers.Dense(1),
])

Each parameter takes a value between 0 and 1 indicating the proportion of values to drop out. A value of 0.1 will drop out 10% of the requisite values.

RNNs using dropout will often take longer to converge, so be sure to train them for more epochs to test for this. Figure 11-15 shows the results of training the preceding GRU with dropout and recurrent dropout on each layer set to 0.1 over 1,000 epochs.

Figure 11-15. Training a GRU with dropout

As you can see, the loss and MAE decreased rapidly until about epoch 300, after which they continued to decline, but quite noisily. You’ll often see noise like this in the loss when using dropout, and it’s an indication that you may want to tweak the amount of dropout as well as the parameters of the loss function, such as learning rate. Predictions with this network were shaped quite nicely, as you can see in Figure 11-16, but there’s room for improvement, in that the peaks of the predictions are much lower than the real peaks.

Figure 11-16. Predictions using a GRU with dropout

As you’ve seen in this chapter, predicting time sequence data using neural networks is a difficult proposition, but tweaking their hyperparameters (particularly with tools such as Keras Tuner) can be a powerful way to improve your model and its subsequent predictions.

Using Bidirectional RNNs

Another technique to consider when classifying sequences is to use bidirectional training. This may seem counterintuitive at first, as you might wonder how future values could impact past ones. But recall that time series values can contain seasonality, where values repeat over time, and when using a neural network to make predictions all we’re doing is sophisticated pattern matching. Given that data repeats, a signal for how data can repeat might be found in future values—and when using bidirectional training, we can train the network to try to spot patterns going from time t to time t + x, as well as going from time t + x to time t.

Fortunately, coding this is simple. For example, consider the GRU from the previous section. To make this bidirectional, you simply wrap each GRU layer in a tf.keras.layers.Bidirectional call. This will effectively train twice on each step—once with the sequence data in the original order, and once with it in reverse order. The results are then merged before proceeding to the next step.

Here’s an example:

model = tf.keras.models.Sequential([
    tf.keras.layers.Bidirectional(
        tf.keras.layers.GRU(100, input_shape=[None, 1],return_sequences=True, 
                            dropout=0.1, recurrent_dropout=0.1)),
    tf.keras.layers.Bidirectional(
        tf.keras.layers.GRU(100, dropout=0.1, recurrent_dropout=0.1)),
    tf.keras.layers.Dense(1),
])

A plot of the results of training with a bidirectional GRU with dropout on the time series is shown in Figure 11-17. As you can see, there’s no major difference here, and the MAE ends up being similar. However, with a larger data series, you may see a decent accuracy difference, and additionally tweaking the training parameters—particularly window_size, to get multiple seasons—can have a pretty big impact.

Figure 11-17. Training with a bidirectional GRU

This network has an MAE (on the normalized data) of about .48, chiefly because it doesn’t seem to do too well on the high peaks. Retraining it with a larger window and bidirectionality produces better results: it has a significantly lower MAE of about .28 (Figure 11-18).

Figure 11-18. Larger window, bidirectional GRU results

As you can see, you can experiment with different network architectures and different hyperparameters to improve your overall predictions. The ideal choices are very much dependent on the data, so the skills you’ve learned in this chapter will help you with your specific datasets!

Summary

In this chapter, you explored different network types for building models to predict time series data. You built on the simple DNN from Chapter 10, adding convolutions, and experimented with recurrent network types such as simple RNNs, GRUs, and LSTMs. You saw how you can tweak hyperparameters and the network architecture to improve your model’s accuracy, and you practiced working with some real-world datasets, including one massive dataset with hundreds of years’ worth of temperature readings. You’re now ready to get started building networks for a variety of datasets, with a good understanding of what you need to know to optimize them!