Recurrent Neural Networks

In [1]:
from ipypublish import nb_setup

Introduction

Recurrent Neural Nets or RNNs are systems that are designed to detect patterns present in data sequences. This makes them better suited to solve "prediction" problems, as compared to other types of DLNs. An example of a prediction problem is predicting the next word in a sentence, which is a fundamental problem in Language Modeling. The solution to this problem requires that the system take into account the variable number of words that came before, i.e., it should be able to "remember" previous data in the sequence. Just like ConvNets, RNNs were discovered in the late 1980s, but lay dormant until recently due to the difficulty in training them. These problems have been overcome in recent years, with the use of a type of RNN called Long Short Term Memory or LSTMs, as well as the increase in processing power and size of training datasets. Today RNNs are at the forefront of exciting new discoveries in Deep Learning, and some of the most important recent work in DLNs falls in the RNN domain.

What are Recurrent Neural Networks

In [2]:
#rnn36
nb_setup.images_hconcat(["DL_images/rnn36.png"], width=600)
Out[2]:

In order to motivate the need for RNNs, consider the following: The DLN architectures that we have seen so far implements a static mapping between input and output vectors, of the type $Y=f(X)$ as shown in Part (a) of Figure rnn36.

Instead, consider a system that is evolving with time, i.e., it is a Dynamical System. In this scenario, the system is subject to an input sequence $X_1,...,X_n$ in which successive values of $X_i$ are dependent (for example if each $X_i$ were an image, then the sequence represents a video clip). Furthermore the output vector $Y_i$ at time $i$ depends not just on the input vector $X_i$, but on past values of $X$ as well, i.e., $Y_i = f_i(X_1,...,X_i)$ as shown in Part (b) of Figure rnn36. If we were to try to solve this problem using a traditional Neural Network, then we would need to find a different function $f_i$ for each value of $i$, i.e., the function would depend on the size of the input sequence. Obviously this would be a formidable task, and it would be much nicer if we could find a solution which is not dependent on the size of the input sequence; this is precisely what RNNs do.

Systems whose behavior is time dependent are commonly encountered in practice and are usually studied by postulating a Hidden Variable or Hidden State $Z_i$ (also called a State Variable) The evolution of the system state obeys the following recursion:

$$ Z_{i+1} = w_{i+1}(Z_i,X_i) $$

while the output is a function of the current state and is given by

$$ Y_{i+1} = v_{i+1}(Z_{i+1}) $$

The Hidden Variable sequence $Z_i$ captures the lossy history of the $X_i$ sequence, and hence serves as a type of memory. If we assume that the functions $v$ and $w$ do not change with time, then these equations reduce to:

$$ \begin{equation} Z_{i+1} = w(Z_i,X_i) \quad \quad (**eqn1**) \end{equation} $$

$$ \begin{equation} Y_{i+1} = v(Z_{i+1}) \quad \quad (**eqn2**) \end{equation} $$

Traditional Systems Theory makes the additional simplifying assumption that these functions are linear so that

$$ Z_{i+1} = WZ_i + UX_i $$

$$ Y_{i+1} = VZ_{i+1} $$

The use of DLNs enables us to approximate non-linear functions to high degree of accuracy using the training dataset, so that this linearity assumption is no longer needed. We do however assume that the vectors $Z_i$ and $X_i$ can be combined linearly, so that the resulting equations are:

$$ \begin{equation} Z_{i+1} = f(W Z_i + UX_i) \quad \quad (**eqn**3) \end{equation} $$

$$ \begin{equation} Y_{i+1} = h(V Z_{i+1}) \quad \quad (**eqn4**) \end{equation} $$

Equations (eqn3) and (eqn4) are in a form in which they can be implemented using Neural Networks, with the functions $f$ and $h$ serving as the Activation Functions, as shown in Figure rnn1.

In [3]:
#rnn1
nb_setup.images_hconcat(["DL_images/rnn1.png"], width=600)
Out[3]:

The figure shows three types of nodes:

$X{(n)}$: This represents the value of the input vector at time $n$.

$Z{(n)}$: This represents the value of the Hidden Layer vector at time $n$.

$Y{(n)}$: This represents the value of the output vector at time $n$.

The LHS of Figure rnn1 shows a RNN with connections between the Hidden Layers at times $n$ and $n+1$ shown explicitly. The RHS of Figure rnn1 is a simplified representation of the same RNN. Each of the boxes in this figure represents a row of Neural Network nodes belonging to a single layer, of the type shown on the LHS, which have been omitted for the sake of clarity. Note that:

  • The weight matrix $U$ connects the nodes in the Input Layer with those in the Hidden Layer
  • The weight matrix $W$ connects the nodes in the Hidden Layer with the same set of nodes, but at the previous time instant. The square block on the self-loop to the Hidden Layer represents a time delay of one unit, and represents the fact that at time $n$, the nodes in that layer have as one of their inputs, the values that they had at time $n-1$
  • The weight matrix $V$ connects the nodes in the Hidden Layer with those in the Output Layer.

The fact that all these weight matrices do not change with time is a result of the time invariance assumption.

In [28]:
#rnn1c
nb_setup.images_hconcat(["DL_images/rnn1c.png"], width=600)
Out[28]:

While Figure rnn1 accurately captures the fact that the RNN design incorporates feedback, it does not lend itself easily to training of the type we have seen in Dense Feed-Forward Networks. In order to facilitate the reuse of techniques we have developed in previous chapters, we convert Figure rnn1 into an equivalent Dense Feed Forward network shown in Figure rnn1c, by a process called "unfolding". The unfolded network in the RHS of Figure rnn1c basically shows snapshots of the RNN at various points in time, and the fact that there is feedback between the nodes in the Hidden Layer is captured by the connections involving the weight matrix $W$.

As a result of the unfolding operation, the system becomes amenable to optimization using the Backprop algorithm. In addition, the depth or number of layers of the RNN is now dependent on the length of the input data sequence. This can result in a RNN with hundreds of Hidden Layers, hence the problems of training deep models with Backprop have a special resonance here. We also mentioned earlier that the Hidden Layer in RNNs keeps a "Lossy Memory" of the input sequence - the reason why it is lossy is because this single layer has to capture data from an entire input sequence, which cannot be done without compressing the data is some fashion.

In order to gain an intuitive understanding of the way in which the RNN shown in Figure rnn1c operates, note the following: There is an analogy between the way a ConvNet operates by looking for patterns in localized patches of an image, and the way a RNN operates by looking for patterns in localized intervals in time. Just as a ConvNet slides a single Filter over the entire image, a RNN slides its own filter represented by the weight matrix U over the entire input sequence, one input at a time. Hence RNNs implement a form of Translational Invariance, but they do this over the temporal axis, as opposed to spatial Translational Invariance seen in ConvNets. As a result of this property, the important part of the sequence can occur anywhere along the time axis, but still can be detected using a single weight matrix repeated at each point in time. There is however a crucial difference in the way a RNN operates compared to a ConvNet: Due to the feedback loop, the RNN pattern detector in the Hidden Layer is able to take advantage of the information that was learnt in the prior time steps from older data in the sequence. As a result the RNN is able to detect patterns that are spread over time, something that a ConvNet cannot do (in the spatial sense).

There is a fundamental difference between the internal representations that are created in the hidden layers of a ConvNet vs those that are created in the hidden layers of a RNN. The hidden layers in a ConvNet create a hierarchical representation, in which the representation at layer $r+1$ is at a higher level of abstraction compared to that for layer $r$. The hidden layer in a RNN on the other hand, does not add to the level of abstraction in the representation with successive time steps. Instead it captures patterns that are spread in time, but at the same level of abstraction. It is indeed possible to combine DLNs and RNNs and create deep RNNs (see the next section for examples), in which case higher level hidden layers capture time dependent patterns at different levels of abstraction.

The rest of this chapter is organized as follows: RNNs can be configured in several useful ways depending upon the problem they are being used to solve, some of these are described in Section Examples. In Section Training we discuss the training of RNNs and the problems that arise when doing so. In particular the Backprop algorithm can lead to issues such as the Vanishing Gradients Problem or the Exploding Gradient Problem, and in Section LSTMs we discuss a modified RNN architecture called LSTM (Long Short Term Memory) which was designed to solve these problems. In the final section we show how to model RNNs and LSTMs using Keras.

IMBD Movie Review Classification Using an RNN

In prior chapters we built models to classify IMDB Movie Reviews using Dense Feed Forward Networks (Chapter NNDeepLearning) and 1-D Convnets (Chapter ConvNetsPart1). We now show how to do this using a RNN.

We first address the problem of how to feed data into a RNN: As shown in Figure rnn39, Keras requires that data inputs into each RNN stage be in the form of a 1-D vector. Hence as Part (a) of the figure shows, a single training sample into an RNN is a 2-D matrix of shape $time\times features$, such that the $i^{th}$ row of the matrix represents the data vector that is fed into the $i^{th}$ stage of the RNN. As shown in Part (b) of the figure, a batch of samples into the RNN becomes a 3D tensor of shape $sample\times time\times features$.

All datasets to be fed into an RNN have to be first formatted into this shape. This raises the question of how to feed non-vector data into a RNN, the most common example of which are 3-D images. A common way of doing so is by first passing the image data through a ConvNet, and then using the image feature vector that occurs at the end of the convolutional layers as an input. This enables us to feed a video clip into a RNN, such that images in the clip are fed into successive stages of the RNN.

In [11]:
#rnn39
nb_setup.images_hconcat(["DL_images/rnn39.png"], width=600)
Out[11]:

We use the IMDB dataset that comes with Keras. The test data has already been split into trainin + test samples and also tokenized such that all words have been converted into integers. When loading this data, we limit ourselves to the top 10,000 words that occur in the dataset by means of the max_features parameter. Furthermore each review is truncated after 500 words, with shorter reviews padded with zeroes, using the pad_sequences command and the maxlen parameter. At the end of these operations, each review is in the form of a 1-D vector of size 500 and the entire IMDB dataset is a matrix of shape $samples\times word tokens$.

In [2]:
from keras.datasets import imdb
from keras.preprocessing import sequence

max_features = 10000  # number of words to consider as features
maxlen = 500  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')

print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
input_train shape: (25000, 500)
input_test shape: (25000, 500)

The RNN model for the IMDB problem is shown below. It is a basic RNN with 500 stages (since each movie review sample consists of 500 words) and a single Logit node for doing the binary classification.

  • In order to input a single word into the model, Keras does the following: A word is represented using 1-Hot vector of size 10,000 (corresponding to the 10,000 words in the vocabulary), and it then passes through the Embedding Layer which converts it into its vector representation of size 32 by multiplying it with a matrix of size $10,000\times 32$ (note that this multiplication is not needed in the actual implementation, Keras just picks the $i^{th}$ row of the matrix if the token of the input word is $i$). Note that we are not using a pre-trained Embedding Layer, hence the best embedding for each word is learnt as part of the training process.

  • Keras inputs a single review into the model by presenting each of its 500 words in turn to successive stages of the RNN.

  • Keras inputs a batch of reviews into the model by creating a 3-D tensor of shape $(batch\ size, review\ size, feature\ size)$, which is (128, 500, 32) for our example. Hence during the forward pass, 128 reviews are fed in parallel into the model.

In [5]:
#rnn40
nb_setup.images_hconcat(["DL_images/rnn40.png"], width=600)
Out[5]:
In [5]:
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
In [6]:
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 32)                2080      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
=================================================================
Total params: 322,113
Trainable params: 322,113
Non-trainable params: 0
_________________________________________________________________
In [7]:
history = model.fit(input_train, y_train,
                    epochs=100,
                    batch_size=128,
                    validation_split=0.2)
Epoch 1/100
157/157 [==============================] - 25s 148ms/step - loss: 0.5662 - acc: 0.7080 - val_loss: 0.4769 - val_acc: 0.7876
Epoch 2/100
157/157 [==============================] - 22s 138ms/step - loss: 0.3628 - acc: 0.8516 - val_loss: 0.4489 - val_acc: 0.8184
Epoch 3/100
157/157 [==============================] - 23s 145ms/step - loss: 0.2811 - acc: 0.8903 - val_loss: 0.3696 - val_acc: 0.8548
Epoch 4/100
157/157 [==============================] - 23s 144ms/step - loss: 0.2375 - acc: 0.9098 - val_loss: 0.3564 - val_acc: 0.8430
Epoch 5/100
157/157 [==============================] - 22s 137ms/step - loss: 0.2132 - acc: 0.9233 - val_loss: 0.3872 - val_acc: 0.8416
Epoch 6/100
157/157 [==============================] - 21s 134ms/step - loss: 0.1650 - acc: 0.9410 - val_loss: 0.3683 - val_acc: 0.8580
Epoch 7/100
157/157 [==============================] - 21s 133ms/step - loss: 0.1352 - acc: 0.9521 - val_loss: 0.5580 - val_acc: 0.8394
Epoch 8/100
157/157 [==============================] - 21s 134ms/step - loss: 0.1128 - acc: 0.9616 - val_loss: 0.4379 - val_acc: 0.8416
Epoch 9/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0918 - acc: 0.9688 - val_loss: 0.4639 - val_acc: 0.8524
Epoch 10/100
157/157 [==============================] - 21s 133ms/step - loss: 0.0769 - acc: 0.9739 - val_loss: 0.4772 - val_acc: 0.8678
Epoch 11/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0685 - acc: 0.9793 - val_loss: 0.7638 - val_acc: 0.8288
Epoch 12/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0456 - acc: 0.9856 - val_loss: 0.6962 - val_acc: 0.7726
Epoch 13/100
157/157 [==============================] - 22s 137ms/step - loss: 0.0356 - acc: 0.9892 - val_loss: 0.5905 - val_acc: 0.8368
Epoch 14/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0553 - acc: 0.9833 - val_loss: 0.7072 - val_acc: 0.8006
Epoch 15/100
157/157 [==============================] - 21s 136ms/step - loss: 0.0235 - acc: 0.9926 - val_loss: 0.6466 - val_acc: 0.8414
Epoch 16/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0333 - acc: 0.9880 - val_loss: 0.7729 - val_acc: 0.7922
Epoch 17/100
157/157 [==============================] - 21s 136ms/step - loss: 0.0241 - acc: 0.9931 - val_loss: 0.7891 - val_acc: 0.8072
Epoch 18/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0177 - acc: 0.9944 - val_loss: 0.7494 - val_acc: 0.8270
Epoch 19/100
157/157 [==============================] - 21s 136ms/step - loss: 0.0094 - acc: 0.9970 - val_loss: 0.8824 - val_acc: 0.8028
Epoch 20/100
157/157 [==============================] - 21s 136ms/step - loss: 0.0048 - acc: 0.9985 - val_loss: 0.8467 - val_acc: 0.8282
Epoch 21/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0100 - acc: 0.9968 - val_loss: 1.0301 - val_acc: 0.7628
Epoch 22/100
157/157 [==============================] - 21s 136ms/step - loss: 0.0051 - acc: 0.9987 - val_loss: 0.9628 - val_acc: 0.7922
Epoch 23/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0051 - acc: 0.9984 - val_loss: 0.9925 - val_acc: 0.8026
Epoch 24/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0047 - acc: 0.9985 - val_loss: 1.0366 - val_acc: 0.7850
Epoch 25/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0038 - acc: 0.9987 - val_loss: 0.9723 - val_acc: 0.8030
Epoch 26/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0043 - acc: 0.9987 - val_loss: 1.0013 - val_acc: 0.7978
Epoch 27/100
157/157 [==============================] - 21s 136ms/step - loss: 0.0076 - acc: 0.9980 - val_loss: 1.2578 - val_acc: 0.7592
Epoch 28/100
157/157 [==============================] - 21s 136ms/step - loss: 0.0034 - acc: 0.9987 - val_loss: 1.3130 - val_acc: 0.7430
Epoch 29/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0045 - acc: 0.9985 - val_loss: 1.2486 - val_acc: 0.7576
Epoch 30/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0038 - acc: 0.9987 - val_loss: 1.1152 - val_acc: 0.7880
Epoch 31/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0036 - acc: 0.9988 - val_loss: 1.1975 - val_acc: 0.7756
Epoch 32/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0044 - acc: 0.9987 - val_loss: 1.2073 - val_acc: 0.7676
Epoch 33/100
157/157 [==============================] - 21s 136ms/step - loss: 0.0033 - acc: 0.9988 - val_loss: 1.1394 - val_acc: 0.7836
Epoch 34/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0011 - acc: 0.9997 - val_loss: 1.1878 - val_acc: 0.7848
Epoch 35/100
157/157 [==============================] - 22s 138ms/step - loss: 0.0044 - acc: 0.9989 - val_loss: 1.2360 - val_acc: 0.7704
Epoch 36/100
157/157 [==============================] - 22s 142ms/step - loss: 0.0011 - acc: 0.9997 - val_loss: 1.2728 - val_acc: 0.7744
Epoch 37/100
157/157 [==============================] - 21s 137ms/step - loss: 0.0027 - acc: 0.9991 - val_loss: 1.3752 - val_acc: 0.7592
Epoch 38/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0041 - acc: 0.9983 - val_loss: 1.2901 - val_acc: 0.7710
Epoch 39/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0045 - acc: 0.9986 - val_loss: 1.3399 - val_acc: 0.7536
Epoch 40/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0044 - acc: 0.9987 - val_loss: 1.3645 - val_acc: 0.7626
Epoch 41/100
157/157 [==============================] - 23s 146ms/step - loss: 0.0023 - acc: 0.9992 - val_loss: 1.3305 - val_acc: 0.7664
Epoch 42/100
157/157 [==============================] - 22s 143ms/step - loss: 0.0015 - acc: 0.9995 - val_loss: 1.3996 - val_acc: 0.7552
Epoch 43/100
157/157 [==============================] - 21s 133ms/step - loss: 4.4946e-04 - acc: 0.9999 - val_loss: 1.4366 - val_acc: 0.7574
Epoch 44/100
157/157 [==============================] - 21s 133ms/step - loss: 2.6842e-04 - acc: 0.9999 - val_loss: 1.7124 - val_acc: 0.7120
Epoch 45/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0017 - acc: 0.9994 - val_loss: 1.5959 - val_acc: 0.7292
Epoch 46/100
157/157 [==============================] - 22s 141ms/step - loss: 8.9925e-04 - acc: 0.9997 - val_loss: 1.6372 - val_acc: 0.7218
Epoch 47/100
157/157 [==============================] - 21s 136ms/step - loss: 0.0030 - acc: 0.9990 - val_loss: 1.6065 - val_acc: 0.7232
Epoch 48/100
157/157 [==============================] - 21s 133ms/step - loss: 7.6052e-04 - acc: 0.9997 - val_loss: 1.5389 - val_acc: 0.7380
Epoch 49/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0011 - acc: 0.9997 - val_loss: 1.5803 - val_acc: 0.7468
Epoch 50/100
157/157 [==============================] - 21s 133ms/step - loss: 4.9535e-04 - acc: 0.9998 - val_loss: 1.5754 - val_acc: 0.7482
Epoch 51/100
157/157 [==============================] - 21s 133ms/step - loss: 0.0011 - acc: 0.9997 - val_loss: 1.5845 - val_acc: 0.7524
Epoch 52/100
157/157 [==============================] - 21s 133ms/step - loss: 0.0044 - acc: 0.9985 - val_loss: 1.7093 - val_acc: 0.7384
Epoch 53/100
157/157 [==============================] - 21s 135ms/step - loss: 0.0020 - acc: 0.9994 - val_loss: 1.6615 - val_acc: 0.7574
Epoch 54/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0050 - acc: 0.9987 - val_loss: 1.7627 - val_acc: 0.7318
Epoch 55/100
157/157 [==============================] - 21s 134ms/step - loss: 5.6182e-04 - acc: 0.9998 - val_loss: 1.9114 - val_acc: 0.7176
Epoch 56/100
157/157 [==============================] - 21s 133ms/step - loss: 0.0020 - acc: 0.9992 - val_loss: 1.6934 - val_acc: 0.7424
Epoch 57/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0019 - acc: 0.9995 - val_loss: 1.5983 - val_acc: 0.7500
Epoch 58/100
157/157 [==============================] - 21s 134ms/step - loss: 6.3050e-04 - acc: 0.9998 - val_loss: 1.6219 - val_acc: 0.7634
Epoch 59/100
157/157 [==============================] - 21s 133ms/step - loss: 2.1885e-04 - acc: 0.9999 - val_loss: 1.6265 - val_acc: 0.7604
Epoch 60/100
157/157 [==============================] - 21s 134ms/step - loss: 2.7765e-04 - acc: 0.9998 - val_loss: 1.7236 - val_acc: 0.7496
Epoch 61/100
157/157 [==============================] - 21s 134ms/step - loss: 5.1104e-04 - acc: 0.9998 - val_loss: 1.7307 - val_acc: 0.7554
Epoch 62/100
157/157 [==============================] - 21s 133ms/step - loss: 0.0014 - acc: 0.9995 - val_loss: 1.8284 - val_acc: 0.7284
Epoch 63/100
157/157 [==============================] - 21s 133ms/step - loss: 8.9904e-04 - acc: 0.9998 - val_loss: 1.7856 - val_acc: 0.7300
Epoch 64/100
157/157 [==============================] - 21s 132ms/step - loss: 3.0090e-04 - acc: 0.9999 - val_loss: 2.0609 - val_acc: 0.7168
Epoch 65/100
157/157 [==============================] - 21s 133ms/step - loss: 1.4129e-04 - acc: 0.9999 - val_loss: 1.8861 - val_acc: 0.7394
Epoch 66/100
157/157 [==============================] - 21s 133ms/step - loss: 0.0014 - acc: 0.9995 - val_loss: 1.9126 - val_acc: 0.7262
Epoch 67/100
157/157 [==============================] - 21s 133ms/step - loss: 0.0022 - acc: 0.9991 - val_loss: 2.1492 - val_acc: 0.7010
Epoch 68/100
157/157 [==============================] - 21s 133ms/step - loss: 1.5285e-04 - acc: 0.9999 - val_loss: 1.9580 - val_acc: 0.7378
Epoch 69/100
157/157 [==============================] - 21s 134ms/step - loss: 2.0725e-04 - acc: 0.9999 - val_loss: 2.3079 - val_acc: 0.6888
Epoch 70/100
157/157 [==============================] - 21s 134ms/step - loss: 3.2451e-05 - acc: 1.0000 - val_loss: 1.9635 - val_acc: 0.7458
Epoch 71/100
157/157 [==============================] - 21s 134ms/step - loss: 5.0015e-04 - acc: 0.9998 - val_loss: 1.9711 - val_acc: 0.7418
Epoch 72/100
157/157 [==============================] - 21s 132ms/step - loss: 0.0016 - acc: 0.9996 - val_loss: 1.9615 - val_acc: 0.7466
Epoch 73/100
157/157 [==============================] - 21s 133ms/step - loss: 6.9097e-04 - acc: 0.9998 - val_loss: 2.0379 - val_acc: 0.7404
Epoch 74/100
157/157 [==============================] - 21s 135ms/step - loss: 2.6221e-05 - acc: 1.0000 - val_loss: 2.0917 - val_acc: 0.7376
Epoch 75/100
157/157 [==============================] - 21s 134ms/step - loss: 3.0255e-05 - acc: 1.0000 - val_loss: 1.9486 - val_acc: 0.7688
Epoch 76/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0020 - acc: 0.9995 - val_loss: 2.0380 - val_acc: 0.7498
Epoch 77/100
157/157 [==============================] - 21s 132ms/step - loss: 0.0014 - acc: 0.9998 - val_loss: 2.1089 - val_acc: 0.7302
Epoch 78/100
157/157 [==============================] - 21s 133ms/step - loss: 1.8078e-04 - acc: 0.9999 - val_loss: 2.1894 - val_acc: 0.7212
Epoch 79/100
157/157 [==============================] - 21s 134ms/step - loss: 0.0016 - acc: 0.9995 - val_loss: 2.1526 - val_acc: 0.7210
Epoch 80/100
157/157 [==============================] - 21s 133ms/step - loss: 2.3403e-04 - acc: 0.9999 - val_loss: 2.2912 - val_acc: 0.7216
Epoch 81/100
157/157 [==============================] - 21s 134ms/step - loss: 8.6144e-05 - acc: 0.9999 - val_loss: 2.1191 - val_acc: 0.7364
Epoch 82/100
157/157 [==============================] - 21s 133ms/step - loss: 0.0013 - acc: 0.9997 - val_loss: 2.3310 - val_acc: 0.7094
Epoch 83/100
157/157 [==============================] - 21s 132ms/step - loss: 2.8061e-04 - acc: 0.9999 - val_loss: 2.3067 - val_acc: 0.7098
Epoch 84/100
157/157 [==============================] - 21s 132ms/step - loss: 4.5224e-04 - acc: 0.9998 - val_loss: 2.1959 - val_acc: 0.7298
Epoch 85/100
157/157 [==============================] - 21s 133ms/step - loss: 2.7094e-04 - acc: 0.9999 - val_loss: 2.2141 - val_acc: 0.7284
Epoch 86/100
157/157 [==============================] - 21s 132ms/step - loss: 6.9691e-04 - acc: 0.9998 - val_loss: 2.3401 - val_acc: 0.7196
Epoch 87/100
157/157 [==============================] - 21s 134ms/step - loss: 4.9009e-04 - acc: 0.9999 - val_loss: 2.1792 - val_acc: 0.7418
Epoch 88/100
157/157 [==============================] - 21s 134ms/step - loss: 5.4425e-04 - acc: 0.9999 - val_loss: 2.3057 - val_acc: 0.7160
Epoch 89/100
157/157 [==============================] - 21s 134ms/step - loss: 3.4485e-04 - acc: 0.9999 - val_loss: 2.2222 - val_acc: 0.7270
Epoch 90/100
157/157 [==============================] - 21s 133ms/step - loss: 1.7077e-04 - acc: 0.9999 - val_loss: 2.1758 - val_acc: 0.7398
Epoch 91/100
157/157 [==============================] - 21s 133ms/step - loss: 8.2131e-06 - acc: 1.0000 - val_loss: 2.3120 - val_acc: 0.7312
Epoch 92/100
157/157 [==============================] - 21s 133ms/step - loss: 0.0033 - acc: 0.9991 - val_loss: 2.2393 - val_acc: 0.7284
Epoch 93/100
157/157 [==============================] - 21s 136ms/step - loss: 0.0020 - acc: 0.9995 - val_loss: 2.2916 - val_acc: 0.7278
Epoch 94/100
157/157 [==============================] - 22s 140ms/step - loss: 2.8387e-04 - acc: 0.9999 - val_loss: 2.1180 - val_acc: 0.7456
Epoch 95/100
157/157 [==============================] - 22s 140ms/step - loss: 0.0017 - acc: 0.9995 - val_loss: 2.3299 - val_acc: 0.7148
Epoch 96/100
157/157 [==============================] - 22s 138ms/step - loss: 5.5296e-05 - acc: 1.0000 - val_loss: 2.4119 - val_acc: 0.7056
Epoch 97/100
157/157 [==============================] - 22s 140ms/step - loss: 0.0024 - acc: 0.9994 - val_loss: 2.2842 - val_acc: 0.7350
Epoch 98/100
157/157 [==============================] - 22s 139ms/step - loss: 8.3363e-04 - acc: 0.9997 - val_loss: 2.2574 - val_acc: 0.7376
Epoch 99/100
157/157 [==============================] - 22s 142ms/step - loss: 5.4921e-04 - acc: 0.9998 - val_loss: 2.3752 - val_acc: 0.7160
Epoch 100/100
157/157 [==============================] - 22s 139ms/step - loss: 1.7456e-04 - acc: 0.9999 - val_loss: 2.3985 - val_acc: 0.7356
In [9]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

The Accuracy and Loss curves show that the model starts to overfit after about 8-10 epochs, achieving a maximum accuracy of about 85%. This is also close to the accuracy that was achieved with the 1-D convnet (see Chapter ConvNetsPart1), even though that was done using an embedding of size 128. In order to increase the accuracy we may increase the the number of words per review to more than 500. However this also increases the number of RNN stages, and later in this chapter we will show that this can cause problems with the Backprop algorithm.

Examples of RNN Architectures

RNNs with Stacked Layers

In [12]:
#rnn38
nb_setup.images_hconcat(["DL_images/rnn48.png"], width=600)
Out[12]:

Figure rnn38 shows a RNN with multiple hidden layers, usually referred to as a Deep RNN. This system incorporates ideas from Dense Feed Forward Networks into the RNN, and enables the system to simultaneously create:

  1. Higher level hierarchical representations of the input data, and
  2. At each level of the hierarchy capture temporal patterns in the data. This architecture is used quite commonly in current RNN systems.

The Keras code for a RNN Model with 4 stacked layers is shown below. They require the use of the return_sequences flag which is set to True in the intermediate layers as shown. This causes Keras to return the full sequence of outputs for each timestep, which are needed to feed the following layer. The model.summary() command shows the output shape from the intermediate layers as a 3D tensor of size $(batch\ size, timesteps, output\ features)$.

In [10]:
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32))  # This last layer only returns the last outputs.
model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, None, 32)          2080      
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, None, 32)          2080      
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (None, None, 32)          2080      
_________________________________________________________________
simple_rnn_5 (SimpleRNN)     (None, 32)                2080      
=================================================================
Total params: 328,320
Trainable params: 328,320
Non-trainable params: 0
_________________________________________________________________

Bi-Directional RNNs

In [10]:
#rnn38
nb_setup.images_hconcat(["DL_images/rnn38.png"], width=600)
Out[10]:

Figure rnn38 shows an important RNN architecture called Bi-Directional RNNS. As the name implies this system incorporates two sets of hidden layers. The first set is the conventional hidden layer, with connections going from left to right, while the second set of Hidden Layers has connections going in the opposite direction, from right to left. Unlike the first layer, the second layer is able to spot patterns in the reversed sequence. The final output of each layer incorporates information from both the forward and backward hidden layers, usually by concatenation of layers. Such a design can be used for the case when the input sequence is not being generated in real time and is most useful when processing text sequences and has been shown to substantially improve performance in many cases. Note that this architecture cannot be used for real time data sequences, since the at test time future values are not yet available.

The code snippet below shows how to program a Bi-Directional RNN in Keras. The Bi-Directional layer is invoked using a recurrent layer instance as its first argument, which processes the input sequence in the forward order. It creates a second layer which processes the input sequence in the reverse order, and then the final states of the two layers are concatenated and fed into the Dense layer.

In [11]:
from keras import layers
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN

model = Sequential()
model.add(layers.Embedding(10000, 32))
model.add(layers.Bidirectional(layers.SimpleRNN(32)))
model.add(layers.Dense(1,activation='sigmoid'))
model.summary()
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
=================================================================
Total params: 324,225
Trainable params: 324,225
Non-trainable params: 0
_________________________________________________________________

RNNs with Dropout

Recall that the Dropout technique is used to reduce the amount of Overfitting in a DLN model. Initial applications of Dropout to RNNs were not successful, and later it was discovered that it was due to the fact that a different random Dropout Mask was applied to successive stages of the RNN. Later was shown that if the Dropout Mask was kept fixed for each stage for a single forward/backward pass through the RNN, then indeed the amount of Overfitting was reduced. Keras supports this type of RNN Dropout, and it can be turned on by using two flags in the parameter list (see below): The dropout flag specifies that Dropout should be used in the input layer of the model, while the recurrent_dropout flag specifies that Dropout should be used in the recurrent layer as well.

Earlier in this chapter we saw that the SimpleRNN model when applied to the IMDB Movie Review problem resulted in a good deal of overfitting. In the example below we use Dropout as part of the model. The results show that indeed the Overfitting is pushed out to later in the training process, after 60-70 epochs as opposed to 10 epochs. However there is not much improvement in the best accuracy level, which actually has decreased.

In [12]:
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32,
                    dropout = 0.2,
                    recurrent_dropout = 0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
In [13]:
history = model.fit(input_train, y_train,
                    epochs=100,
                    batch_size=128,
                    validation_split=0.2)
Epoch 1/100
157/157 [==============================] - 33s 203ms/step - loss: 0.7074 - acc: 0.5074 - val_loss: 0.6956 - val_acc: 0.5116
Epoch 2/100
157/157 [==============================] - 30s 193ms/step - loss: 0.6973 - acc: 0.5220 - val_loss: 0.6915 - val_acc: 0.5200
Epoch 3/100
157/157 [==============================] - 30s 193ms/step - loss: 0.6871 - acc: 0.5433 - val_loss: 0.6746 - val_acc: 0.5798
Epoch 4/100
157/157 [==============================] - 30s 191ms/step - loss: 0.6517 - acc: 0.6086 - val_loss: 0.7779 - val_acc: 0.5286
Epoch 5/100
157/157 [==============================] - 31s 195ms/step - loss: 0.6262 - acc: 0.6533 - val_loss: 0.6730 - val_acc: 0.5766
Epoch 6/100
157/157 [==============================] - 29s 186ms/step - loss: 0.4985 - acc: 0.7706 - val_loss: 0.5821 - val_acc: 0.7280
Epoch 7/100
157/157 [==============================] - 30s 193ms/step - loss: 0.4435 - acc: 0.8035 - val_loss: 0.5583 - val_acc: 0.7590
Epoch 8/100
157/157 [==============================] - 30s 192ms/step - loss: 0.4031 - acc: 0.8274 - val_loss: 0.7027 - val_acc: 0.7140
Epoch 9/100
157/157 [==============================] - 30s 193ms/step - loss: 0.3834 - acc: 0.8396 - val_loss: 0.7025 - val_acc: 0.7368
Epoch 10/100
157/157 [==============================] - 30s 193ms/step - loss: 0.3664 - acc: 0.8480 - val_loss: 0.4992 - val_acc: 0.8132
Epoch 11/100
157/157 [==============================] - 30s 193ms/step - loss: 0.3594 - acc: 0.8521 - val_loss: 0.7798 - val_acc: 0.7376
Epoch 12/100
157/157 [==============================] - 30s 193ms/step - loss: 0.3422 - acc: 0.8590 - val_loss: 0.5143 - val_acc: 0.8006
Epoch 13/100
157/157 [==============================] - 30s 192ms/step - loss: 0.3369 - acc: 0.8638 - val_loss: 0.4956 - val_acc: 0.8234
Epoch 14/100
157/157 [==============================] - 30s 193ms/step - loss: 0.3243 - acc: 0.8701 - val_loss: 0.4744 - val_acc: 0.8302
Epoch 15/100
157/157 [==============================] - 30s 193ms/step - loss: 0.3119 - acc: 0.8742 - val_loss: 0.4310 - val_acc: 0.8384
Epoch 16/100
157/157 [==============================] - 30s 194ms/step - loss: 0.2984 - acc: 0.8805 - val_loss: 0.4673 - val_acc: 0.8358
Epoch 17/100
157/157 [==============================] - 30s 193ms/step - loss: 0.2950 - acc: 0.8852 - val_loss: 0.8859 - val_acc: 0.7176
Epoch 18/100
157/157 [==============================] - 30s 194ms/step - loss: 0.3119 - acc: 0.8773 - val_loss: 0.4268 - val_acc: 0.8514
Epoch 19/100
157/157 [==============================] - 30s 193ms/step - loss: 0.2854 - acc: 0.8892 - val_loss: 0.4994 - val_acc: 0.8204
Epoch 20/100
157/157 [==============================] - 30s 193ms/step - loss: 0.2718 - acc: 0.8923 - val_loss: 0.5074 - val_acc: 0.8368
Epoch 21/100
157/157 [==============================] - 30s 193ms/step - loss: 0.2833 - acc: 0.8897 - val_loss: 0.6319 - val_acc: 0.7928
Epoch 22/100
157/157 [==============================] - 30s 192ms/step - loss: 0.2593 - acc: 0.9006 - val_loss: 0.4338 - val_acc: 0.8484
Epoch 23/100
157/157 [==============================] - 30s 194ms/step - loss: 0.2516 - acc: 0.9040 - val_loss: 0.5642 - val_acc: 0.8180
Epoch 24/100
157/157 [==============================] - 30s 194ms/step - loss: 0.2483 - acc: 0.9054 - val_loss: 0.4587 - val_acc: 0.8364
Epoch 25/100
157/157 [==============================] - 30s 193ms/step - loss: 0.2425 - acc: 0.9063 - val_loss: 0.6253 - val_acc: 0.8018
Epoch 26/100
157/157 [==============================] - 30s 193ms/step - loss: 0.2456 - acc: 0.9065 - val_loss: 0.5691 - val_acc: 0.8166
Epoch 27/100
157/157 [==============================] - 30s 193ms/step - loss: 0.2398 - acc: 0.9096 - val_loss: 0.4968 - val_acc: 0.8300
Epoch 28/100
157/157 [==============================] - 31s 195ms/step - loss: 0.2301 - acc: 0.9148 - val_loss: 0.5075 - val_acc: 0.8254
Epoch 29/100
157/157 [==============================] - 30s 194ms/step - loss: 0.2332 - acc: 0.9132 - val_loss: 0.5634 - val_acc: 0.8210
Epoch 30/100
157/157 [==============================] - 30s 190ms/step - loss: 0.2217 - acc: 0.9168 - val_loss: 0.5028 - val_acc: 0.8330
Epoch 31/100
157/157 [==============================] - 30s 193ms/step - loss: 0.2169 - acc: 0.9179 - val_loss: 0.5718 - val_acc: 0.8172
Epoch 32/100
157/157 [==============================] - 30s 193ms/step - loss: 0.2102 - acc: 0.9221 - val_loss: 0.4794 - val_acc: 0.8376
Epoch 33/100
157/157 [==============================] - 30s 192ms/step - loss: 0.2162 - acc: 0.9191 - val_loss: 0.4709 - val_acc: 0.8384
Epoch 34/100
157/157 [==============================] - 30s 194ms/step - loss: 0.2043 - acc: 0.9236 - val_loss: 0.5629 - val_acc: 0.8290
Epoch 35/100
157/157 [==============================] - 31s 195ms/step - loss: 0.2018 - acc: 0.9241 - val_loss: 0.5262 - val_acc: 0.8332
Epoch 36/100
157/157 [==============================] - 31s 201ms/step - loss: 0.1999 - acc: 0.9260 - val_loss: 0.5514 - val_acc: 0.8326
Epoch 37/100
157/157 [==============================] - 32s 204ms/step - loss: 0.1938 - acc: 0.9274 - val_loss: 0.5081 - val_acc: 0.8408
Epoch 38/100
157/157 [==============================] - 32s 201ms/step - loss: 0.2018 - acc: 0.9264 - val_loss: 0.5229 - val_acc: 0.8324
Epoch 39/100
157/157 [==============================] - 31s 200ms/step - loss: 0.1986 - acc: 0.9254 - val_loss: 0.5047 - val_acc: 0.8384
Epoch 40/100
157/157 [==============================] - 32s 202ms/step - loss: 0.1858 - acc: 0.9303 - val_loss: 0.5093 - val_acc: 0.8388
Epoch 41/100
157/157 [==============================] - 31s 195ms/step - loss: 0.1858 - acc: 0.9310 - val_loss: 0.6178 - val_acc: 0.8232
Epoch 42/100
157/157 [==============================] - 30s 191ms/step - loss: 0.1769 - acc: 0.9343 - val_loss: 0.5326 - val_acc: 0.8266
Epoch 43/100
157/157 [==============================] - 31s 200ms/step - loss: 0.1723 - acc: 0.9365 - val_loss: 0.5706 - val_acc: 0.8244
Epoch 44/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1699 - acc: 0.9380 - val_loss: 0.5882 - val_acc: 0.8222
Epoch 45/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1719 - acc: 0.9379 - val_loss: 0.6628 - val_acc: 0.8142
Epoch 46/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1755 - acc: 0.9366 - val_loss: 0.6323 - val_acc: 0.8142
Epoch 47/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1712 - acc: 0.9363 - val_loss: 0.6126 - val_acc: 0.8226
Epoch 48/100
157/157 [==============================] - 30s 191ms/step - loss: 0.1695 - acc: 0.9359 - val_loss: 0.6578 - val_acc: 0.8236
Epoch 49/100
157/157 [==============================] - 31s 196ms/step - loss: 0.1623 - acc: 0.9420 - val_loss: 0.7442 - val_acc: 0.8004
Epoch 50/100
157/157 [==============================] - 34s 218ms/step - loss: 0.1626 - acc: 0.9394 - val_loss: 0.6454 - val_acc: 0.8156
Epoch 51/100
157/157 [==============================] - 31s 200ms/step - loss: 0.1605 - acc: 0.9421 - val_loss: 0.5724 - val_acc: 0.8154
Epoch 52/100
157/157 [==============================] - 31s 195ms/step - loss: 0.1582 - acc: 0.9438 - val_loss: 0.5717 - val_acc: 0.7942
Epoch 53/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1585 - acc: 0.9425 - val_loss: 0.7980 - val_acc: 0.7984
Epoch 54/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1533 - acc: 0.9459 - val_loss: 0.6275 - val_acc: 0.8184
Epoch 55/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1447 - acc: 0.9485 - val_loss: 0.7149 - val_acc: 0.8106
Epoch 56/100
157/157 [==============================] - 31s 197ms/step - loss: 0.1433 - acc: 0.9479 - val_loss: 0.6634 - val_acc: 0.8164
Epoch 57/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1417 - acc: 0.9485 - val_loss: 0.6454 - val_acc: 0.8150
Epoch 58/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1458 - acc: 0.9477 - val_loss: 0.6386 - val_acc: 0.8078
Epoch 59/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1371 - acc: 0.9517 - val_loss: 0.6825 - val_acc: 0.8112
Epoch 60/100
157/157 [==============================] - 30s 194ms/step - loss: 0.1419 - acc: 0.9505 - val_loss: 0.5936 - val_acc: 0.8124
Epoch 61/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1361 - acc: 0.9520 - val_loss: 0.6631 - val_acc: 0.7912
Epoch 62/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1379 - acc: 0.9520 - val_loss: 0.6753 - val_acc: 0.8132
Epoch 63/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1285 - acc: 0.9560 - val_loss: 0.7988 - val_acc: 0.8014
Epoch 64/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1378 - acc: 0.9508 - val_loss: 0.6987 - val_acc: 0.8170
Epoch 65/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1315 - acc: 0.9538 - val_loss: 0.6677 - val_acc: 0.8060
Epoch 66/100
157/157 [==============================] - 30s 194ms/step - loss: 0.1292 - acc: 0.9556 - val_loss: 0.9279 - val_acc: 0.7758
Epoch 67/100
157/157 [==============================] - 31s 196ms/step - loss: 0.1330 - acc: 0.9524 - val_loss: 0.6766 - val_acc: 0.8108
Epoch 68/100
157/157 [==============================] - 31s 195ms/step - loss: 0.1286 - acc: 0.9556 - val_loss: 0.6717 - val_acc: 0.8082
Epoch 69/100
157/157 [==============================] - 31s 195ms/step - loss: 0.1293 - acc: 0.9538 - val_loss: 0.6495 - val_acc: 0.8076
Epoch 70/100
157/157 [==============================] - 31s 196ms/step - loss: 0.1257 - acc: 0.9564 - val_loss: 0.6978 - val_acc: 0.8056
Epoch 71/100
157/157 [==============================] - 34s 218ms/step - loss: 0.1236 - acc: 0.9572 - val_loss: 0.6664 - val_acc: 0.8084
Epoch 72/100
157/157 [==============================] - 32s 203ms/step - loss: 0.1264 - acc: 0.9574 - val_loss: 0.7215 - val_acc: 0.8116
Epoch 73/100
157/157 [==============================] - 32s 203ms/step - loss: 0.1161 - acc: 0.9595 - val_loss: 0.6979 - val_acc: 0.8086
Epoch 74/100
157/157 [==============================] - 31s 196ms/step - loss: 0.1171 - acc: 0.9589 - val_loss: 0.6830 - val_acc: 0.8080
Epoch 75/100
157/157 [==============================] - 32s 203ms/step - loss: 0.1219 - acc: 0.9566 - val_loss: 0.6877 - val_acc: 0.8152
Epoch 76/100
157/157 [==============================] - 33s 208ms/step - loss: 0.1227 - acc: 0.9554 - val_loss: 0.7327 - val_acc: 0.8090
Epoch 77/100
157/157 [==============================] - 32s 207ms/step - loss: 0.1138 - acc: 0.9608 - val_loss: 0.6984 - val_acc: 0.8118
Epoch 78/100
157/157 [==============================] - 32s 204ms/step - loss: 0.1205 - acc: 0.9567 - val_loss: 0.7042 - val_acc: 0.8108
Epoch 79/100
157/157 [==============================] - 30s 194ms/step - loss: 0.1214 - acc: 0.9589 - val_loss: 0.6767 - val_acc: 0.8074
Epoch 80/100
157/157 [==============================] - 30s 191ms/step - loss: 0.1108 - acc: 0.9621 - val_loss: 0.8082 - val_acc: 0.8044
Epoch 81/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1118 - acc: 0.9602 - val_loss: 0.7464 - val_acc: 0.8096
Epoch 82/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1100 - acc: 0.9624 - val_loss: 0.8293 - val_acc: 0.7924
Epoch 83/100
157/157 [==============================] - 30s 194ms/step - loss: 0.1138 - acc: 0.9599 - val_loss: 0.7283 - val_acc: 0.8074
Epoch 84/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1121 - acc: 0.9625 - val_loss: 0.7342 - val_acc: 0.8024
Epoch 85/100
157/157 [==============================] - 30s 194ms/step - loss: 0.1046 - acc: 0.9656 - val_loss: 0.7643 - val_acc: 0.8028
Epoch 86/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1015 - acc: 0.9650 - val_loss: 0.7842 - val_acc: 0.7974
Epoch 87/100
157/157 [==============================] - 30s 194ms/step - loss: 0.1001 - acc: 0.9650 - val_loss: 0.8347 - val_acc: 0.8010
Epoch 88/100
157/157 [==============================] - 31s 195ms/step - loss: 0.1074 - acc: 0.9632 - val_loss: 0.7701 - val_acc: 0.8038
Epoch 89/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1060 - acc: 0.9622 - val_loss: 0.8084 - val_acc: 0.7986
Epoch 90/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1034 - acc: 0.9651 - val_loss: 0.7394 - val_acc: 0.7998
Epoch 91/100
157/157 [==============================] - 30s 192ms/step - loss: 0.1044 - acc: 0.9639 - val_loss: 0.7577 - val_acc: 0.7928
Epoch 92/100
157/157 [==============================] - 30s 193ms/step - loss: 0.1013 - acc: 0.9651 - val_loss: 0.8262 - val_acc: 0.7910
Epoch 93/100
157/157 [==============================] - 31s 196ms/step - loss: 0.0989 - acc: 0.9657 - val_loss: 0.8126 - val_acc: 0.7980
Epoch 94/100
157/157 [==============================] - 33s 209ms/step - loss: 0.0982 - acc: 0.9669 - val_loss: 0.8116 - val_acc: 0.7920
Epoch 95/100
157/157 [==============================] - 31s 201ms/step - loss: 0.1006 - acc: 0.9658 - val_loss: 0.8379 - val_acc: 0.7996
Epoch 96/100
157/157 [==============================] - 32s 203ms/step - loss: 0.1018 - acc: 0.9652 - val_loss: 0.7555 - val_acc: 0.7828
Epoch 97/100
157/157 [==============================] - 31s 197ms/step - loss: 0.0978 - acc: 0.9668 - val_loss: 0.8253 - val_acc: 0.7904
Epoch 98/100
157/157 [==============================] - 31s 199ms/step - loss: 0.0930 - acc: 0.9688 - val_loss: 0.8245 - val_acc: 0.7940
Epoch 99/100
157/157 [==============================] - 31s 196ms/step - loss: 0.0917 - acc: 0.9683 - val_loss: 0.8288 - val_acc: 0.7872
Epoch 100/100
157/157 [==============================] - 31s 194ms/step - loss: 0.0963 - acc: 0.9669 - val_loss: 0.7995 - val_acc: 0.7976
In [14]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Training RNNs: The Back Propagation Through Time (BPTT) Algorithm

The Backprop algorithm that was described in Chapter TrainingNNsBackprop can be modified in order to train RNNs, and this forms the subject of this section. We start by setting up the optimization problem, which as before, is to minimize the Loss Function $L$. Recall that $L$ is a function of of the labeled Training Sequence $\{(X(m),T(m))\}_{m=1}^{M}$ and the network parameters $(W,b)$ while $\mathcal L(m)$ is the Cross Entropy Loss Function for the $m^{th}$ training pair, given by

$$ \mathcal L(m) = -\sum_{k=1}^{K}t_k(m)\log y_k(m) $$

In this equation $K$ is the number of output categories, $t_k(m), 1\le k\le K$ is the Ground Truth vector and $y_k(m), 1\le k\le K$ are the classification probabilities for the $m^{th}$ training pair, given by

$$ y_k(m) = \frac{\exp(a_k(m))}{\sum_{j=1}^K \exp(a_j(m))},\ \ 1\le k\le K $$

where $a_k(m), 1\le k\le K$ are the corresponding logits. The overall Loss function can then be written as

$$ L={1\over M}\sum_{m=1}^M \mathcal L(m)= -{1\over M} \sum_{m=1}^M\sum_{k=1}^K t_k(m)\log y_k(m) $$

The modification to the Backprop algorithm to train RNNs is called Back Propagation Through Time (BPTT), and in the following two sub-sections we describe the forward and backward passes for this algorithm.

BPTT Forward Pass

In [19]:
#rnn41
nb_setup.images_hconcat(["DL_images/rnn41.png"], width=600)
Out[19]:

Consider a RNN with R stages, such that the (R+1)rst layer is the Logit Layer. The forward pass of the RNN backprop algorithm for a single sample of the input data $(X_1,...,X_R)$, proceeds as follows (refer to Figure rnn41):

The first input vector in the sample, $X_1$ is applied to the system, resulting in the hidden state activation vector $Z_1$ as per the formulae:

$$ Z_1 = f(UX_1) $$

The second input vector $X_2$ and the hidden state vector $Z_1$ is applied to the system, resulting in the hidden state activation vector $Z_1$:

$$ Z_2 = f(WZ_1 + UX_2) $$

Step 2 is repeated an additional $R-2$ times to compute the vectors $Z_r, 3\le r\le R$ until the remaining input vectors from the sample $(X_3,...,X_R)$ have been processed.

After the last stage, the Logits $A_{R+1}$ are computed as a function of the final Hidden State $Z_R$:

$$ A_{R+1} = VZ_R $$

and the output probabilities are given by

$$ Y = softmax(A_{R+1}) $$

This constitutes the forward pass of the BPTT algorithm for a single sample of the input dataset. In practice $B$ samples are sent through the RNN at the same time when the calculations are done in the batch mode.

BPTT Backward Pass

In [20]:
#rnn42
nb_setup.images_hconcat(["DL_images/rnn42.png"], width=600)
Out[20]:

Once again consider a RNN with R stages, such that the (R+1)rst layer is the Logit Layer. Define the following gradients for the R stages in the RNN, with P nodes per stage:

$$ \Delta_r = ({{\partial\mathcal L}\over{\partial a^1_r}},...,{{\partial\mathcal L}\over{\partial a^P _r}}) = {{\partial\mathcal L}\over{\partial A_r}},\ 1\le r\le R $$

Since there are K nodes in the Logit Layer, the gradient is defined by:

$$ \Delta_{R+1} = ({{\partial\mathcal L}\over{\partial a^1_{R+1}}},...,{{\partial\mathcal L}\over{\partial a^K _{R+1}}}) = {{\partial\mathcal L}\over{\partial A_{R+1}}} $$

The Backward Pass starts with the computation of the gradient for the Logit Layer given by:

$$ \Delta_{R+1} = Y - T $$

The gradients for the other stages is then computed iteratively in a backwards fashion from stage $R$ down to stage 1, using the equation

$$ \Delta_{r} = f'(A^{r})\odot (W^{r+1})^T \Delta_{r+1},\ \ 1\le m\le R $$

The derivation of these equations is the same as for Dense Feed Forward Networks and was explained in Chapter TrainingNNsBackprop.

The gradient of the Loss Function with respect to the weight matrices $U, V, W$ is then given by the sum of the gradients across each of the stages:

$$ \frac{\partial L}{\partial U} = X_1^T\Delta_1 + ... + X_M^R\Delta_R $$

$$ \frac{\partial L}{\partial V} = Z_R^T\Delta_{R+1} $$

$$ \frac{\partial L}{\partial W} = Z_1^T\Delta_2 + ... + Z_{R-1}^T\Delta_R $$

These calculations are illustrated for a RNN with $R=3$ in Figure rnn42.

Truncated Back Propagation through Time

In [29]:
#rnn20
nb_setup.images_hconcat(["DL_images/rnn20.png"], width=800)
Out[29]:

Given the structure of RNNs, it is possible to process a single training sample of size $R$, $(X_r,T_r),r = 1,..,R$ using a RNN with R stages. However if R is large, the RNN is not able to learn from data in the initial values of the input sample. In practice once R exceeds 20-30 stages, the model performance no longer improves. These issue, which is due to the vanishing gradient problem, is treated in greater detail later in the next section, but for now we describe a training technique that is commonly used to mitigate this problem called Truncated BPTT.

This technique is illustrated in Figure rnn20 and the basic idea is to break up a large sample into smaller segments, and then run BPTT on each segment separately. Part (a) of the figure shows a RNN with 6 stages. In order to apply Truncated BPTT we split it into two segments of size 3 and Parts (b) and (c) show BPTT being aplied to these segments. However in order to create a connection between successive segments, the final Hidden State vector $Z$ from segment $n$ is used to initialize the First Hidden State in Segmet $n+1$ as shown. It has been observed experimentally that this technique is effective in detecting correlations between inputs that are located more than a segment-size apart.

Difficulties with Backprop in RNNs

In [30]:
#rnn43
nb_setup.images_hconcat(["DL_images/rnn43.png"], width=800)
Out[30]:

When the BPTT algorithm, is applied to a RNN with a large number of stages, then we run into some problems that were first encountered when training Dense Feed Forward Networks, but they have more serious implications for RNNs:

  • The Vanishing Gradient Problem: As a result of the Vanishing Gradient Problem, the gradients $\Delta_i$ in the initial stages of the network become progressively smaller and get close to zero during Backprop, as the number of RNN stages increases.

  • The Exploding Gradient Problem: This leads to the opposite effect, i.e., the gradients $\delta_i$ become larger and larger as we add more RNN stages.

The causes of these problems can be understood by analysing the recursion equations for the gradients, and we proceed to do this next. The gist of the argument can be understood by considering a simple scalar sequence $\{x_n\}$ generated by the recursion:

$$ x_{n+1} = ax_{n} \ \ {\mbox so}\ \ that\ \ x_{n} = a^n x_0,\ \ n=1,2,... $$

Clearly if $a<1$ then the sequence converges to zero, while if $a>1$, then the sequence blows up to infinity. In the case of RNNs, the BPTT recursion equations for the gradients $\Delta_t = \frac{\partial L}{\partial A_t}$ are given by the following equations:

$$ \Delta_t = f'(A_t)\odot W^T\Delta_{t+1} $$

where $W^T$ is the transpose of the weight matrix $W$. Ignoring the derivative $f'$ (and noting that since with ReLUs, $f'$ is either $1$ or $0$, this is a reasonable assumption), it follows that for $\tau$ RNN stages

$$ \Delta(\tau-n) = (W^T)^n\ \Delta_{\tau},\ \ n = 1,2,...,\tau-1 $$

If the matrix $W$ can be decomposed as

$$ W = R\Lambda R^T $$

where $R$ is an orthogonal matrix, and $\Lambda$ is a diagonal matrix with the eigenvalues $\lambda_j$ at the $j^{th}$ position along the diagonal. The (backwards) recursion then becomes

$$ \Delta(\tau-n) = R^T\Lambda^n R\ \ \Delta_{\tau},\ \ n = 1, 2,...,\tau-1 $$

Note that the diagonal matrix $\Lambda^n$ has diagonal components $\lambda_j^n$, and as $n$ increases, they either converge to $0$ if $\lambda_j < 1$ (resulting in Vanishing Gradients) or go to infinity if $\lambda > 1$ (resulting in Exploding Gradients). Hence for larger values of $\tau$, this will result in the gradients $\Delta_i$ of the early stages converging to zero.

In the previous section we derived the following expression for the gradient of the Loss Function with respect to the weight parameters $W$:

$$ \frac{\partial L}{\partial U} = \sum_{t=1}^{\tau} X_t^T\Delta_t $$

Combining this equation with the prior one, it follows that the inputs $X_t$ from the early stages of the network will have limited or no effect on the gradient $\frac{\partial L}{\partial U}$. This is known as the problem of Long Term Dependencies in RNNs, i.e., in a RNN with a large number of stages, the early training samples are not able to influence the gradients, and thus the learning process.

Figure rnn43 shows an example of this analsys for a 3 stage RNN network.

Solutions to the Vanishing and Exploding Gradients Problems

In [8]:
#rnn4
nb_setup.images_hconcat(["DL_images/rnn4.png"], width=600)
Out[8]:

The Exploding Gradient problem is usually addressed by a technique known as Gradient Clipping. As the name implies, this involves clipping the gradients $\frac{\partial L}{\partial W}$ whenever they explode,i.e., if their value exceeds some pre-set threshold, then it is set back to a small number. Defining $$ {\hat g}=\frac{\partial L}{\partial W} $$ If $||{\hat g}|| \ge threshold$ then $$ {\hat g} \leftarrow \frac{threshold}{||{\hat g}||}{\hat g} $$ The effect of Gradient Clipping is illustrated in Figure rnn4, in which the Loss Function is plotted as a function of the network parameters $(w,b)$. The solid blue arrows show the progression of the Loss Function with successive optimization steps. When the optimization hits the high Loss Function wall, then the next gradient will be extremely large, and in the absence of Gradient Clipping, the Loss Function is pushed off to a distant location on the Loss Function surface. On the other hand, with Gradient Clipping the optimization proceeds along the dashed line, where we can see that the Loss Function is pulled back to an area that is closer to its original location.

The Vanishing Gradient problem can be (partially) addressed by using Truncated BPTT which restricts the number of stages over which BPTT is run to twenty or less. This technique doesn't work well in all cases and in general the Vanishing Gradient problem is more difficult to solve using traditional RNNs. In the next section we describe an alternative RNN architecture which was designed with the objective of solving this problem.

Long Short Term Memories (LSTMs)

In [33]:
#rnn44
nb_setup.images_hconcat(["DL_images/rnn44.png"], width=1500)
Out[33]:

LSTMs were designed with the objective of solving the Vanishing Gradients Problem that plagues RNNs. They have been very successful in this regard, indeed almost all of the successful applications of RNNs in recent years have been with LSTMs rather than with plain RNNs. The reason why RNNs suffer from the Vanishing Gradients problem is that successive hidden states in the RNN are computed using a (matrix) multiplication operation as on $A_2 = WZ_1+UX_2$, as illustrated in Part (a) of Figure rnn44. The basic change that the creators of LSTM made, was to replace this multiplication by an addition. This is illustrated in Part (b) of Figure rnn44 and explained next:

  • LSTMs introduce a new entity called the "Cell State" which is also a vector valued quantity associated with each stage (in the above figure the Cell State is a vector os size 32). The basic recursion between stages is now carried out using the Cell State and not the Hidden State Z. Indeed as shown in the figure, the Hidden State values are derived as a function of the Cell State, as in $Z =\tanh C$.
  • The Cell State recursion is additive $C_2 = C_1 + \tanh(WZ_1 + UX_2)$, which prevents it from running into the Vanishing Gradients problem. Note that the tanh function assumes values between +1 and -1. Assuming that it is saturated most of the time, we can see that each component of the Cell State goes up by +1 or down by -1 from one stage to the other. This is illustrated in Figure rnn23 which graphs the variation of a single component in the Cell State vector as a function of the number of stages. Focus on the two circles at either end of the figure: The circle on the left shows the cell state increasing while the one of the right shows the cell state decreasing with sumber of stages.
  • This architecture also ensures that gradients get propagated witout attenuation. This is explained by the fact that an addition operation causes gradients to be broadcast across all inputs as shown in Chapter TrainingNNsBackprop.

In addition to the basic increment and decrement operations, we would also like to increase the flexibility of our architecture by supporting features such as (1) Holding the Cell State to a constant value, or (2) Resetting the Cell State value. Implementation of these requires the addition of Gates to the architecture and this is explained next.

In [26]:
#rnn23
nb_setup.images_hconcat(["DL_images/rnn23.png"], width=600)
Out[26]:

There are three types of Gates that are used in LSTMs:

  1. Input Gate: This is illustrated in Part (a) of Figure rnn45. The Input Gate is implemented using the vector $i_t$ (at stage $t$). This vector is generated by using the equation $$ i_{t+1} = \sigma(W_iZ_t + U_iX_{t+1}) $$ Recall that the sigma function varies in the range [0,1] and assuming that it is operating in the saturation mode, we can see how $i_t$ can act as a switch. This is an intelligent switch whose operation is governed by the previous Hidden State $Z_t$ and the current input $X_{t+1}$, as well as by the parameters $W_i$ and $U_i$ whose values are learnt as part of the training process. This switch controls the Cell State propagation using the equation $$ C_{t+1} = C_t + i_{t+1}\odot\tanh(WZ_t + UX_{t+1}) $$ When the Input Gate is closed i.e., $i_{t+1} = 0$, then $C_{t+1} = C_t$, i.e. all the components of the Cell State vector are passed on unchanged to the next stage. On the other hand when the Input Gate is open, i.e., $i_{t+1} = 1$, then $C_{t+1} = C_t + \tanh(WZ_t + UX_{t+1})$, so that the Cell State vector changes in response to the Hidden State and Input Vector values by +1 or -1.

  2. Forget Gate: This is illustrated in Part (b) of Figure rnn45 (along with the Input Gate). The Forget Gate is implemented using the vector $f_t$, which is generated using the equation $$ f_{t+1} = \sigma(W_fZ_t + U_fX_{t+1}) $$ The Forget Gate controls the Cell State propagation using the equation (for simplicity we have left out the Input Gate) $$ C_{t+1} = f_{t+1}\odot C_t + \tanh(WZ_t + UX_{t+1}) $$ When the Forget Gate is closed i.e., $f_{t+1} = 0$, then $C_{t+1} = \tanh(WZ_t + UX_{t+1})$, i.e., the Cell State vector is reset to +1 or -1 and it looses memory of the prior Cell State $C_t$. On the other hand when the Forget Gate is open, i.e., $f_{t+1} = 1$, then $C_{t+1} = C_t + \tanh(WZ_t + UX_{t+1})$, so that the Cell State vector changes in response to both the prior state $C_t$, as well as the Hidden State and Input Vector values.

  3. Output Gate: This is illustrated in Part (c) of Figure rnn45 (along with the Input and Forget Gates). The Output Gate is implemented using the vector $o_t$, which is generated using the equation $$ o_{t+1} = \sigma(W_oZ_t + U_oX_{t+1}) $$ Unlike the Input and Forget Gates, the Output Gate does not directly influence the Cell State propagation, instead it controls whether the Cell State is copied onto the Hidden State, by using the equation $$ Z_{t+1} = o_{t+1}\odot \tanh(VC_{t+1}) $$ Of these three gates, the Input and Forget Gates have been found essential to the operation of the LSTM, while the Output Gate is of lesser importance.

In [24]:
#rnn45
nb_setup.images_hconcat(["DL_images/rnn45.png"], width=1200)
Out[24]:

Putting all the equations for the LSTM together:

The Cell State evolves according to

$$ C_{t+1} = f_{t+1}\odot C_t + i_{t+1}\odot\tanh(WZ_t + UX_{t+1}) $$

The corresponding Hidden State is given by

$$ Z_{t+1} = o_{t+1}\odot \tanh(VC_{t+1}) $$

where the gates are given by:

$$ i_{t+1} = \sigma(W_iZ_t + U_iX_{t+1}) $$

$$ f_{t+1} = \sigma(W_fZ_t + U_fX_{t+1}) $$

$$ o_{t+1} = \sigma(W_oZ_t + U_oX_{t+1}) $$

Going back to Figure rnn23, we can readily see how the flat and reset portions of the curve can be generated by appropriate choice of the input and forget gates.

From this analysis it is clear that LSTMs are much more effective in retaining memory of past inputs as compared to RNNs. Their structure gives them a fine level of control over component of their Cell State vector. They can propagate a state value indefinitly into the future by setting the gates $f_t = 1, i_t = 0$, as the analysis shows.

An example of this is shown in Figure rnn25 which shows the value of a single component of the Cell State vector in a LSTM system that was trained to generate sentences in English (the input into this network are characters that have already been generated, so that it uses this information to generate the next caharacter). This cell state component was in the ON state (shown in blue) whenever the system is generating characters that lie in between quotes, and is OFF otherwise (shown in red). This illustrates the LSTMs ability to change its state and hold it in response to the input values.

In [12]:
#rnn25
nb_setup.images_hconcat(["DL_images/rnn25.png"], width=600)
Out[12]:

There is an interesting similarity between LSTMs and the best performing ConvNet architecture called ResNets (see Chapter ConvNetsPart2), since both feature the additive interaction design. Not all aspects of why this is beneficial are understood at this point, and this topic forms an active area of research.

In [13]:
#rnn24
nb_setup.images_hconcat(["DL_images/rnn24.png"], width=600)
Out[13]:

Gated Recurrent Units (GRUs)

In [14]:
#rnn6
nb_setup.images_hconcat(["DL_images/rnn6.png"], width=600)
Out[14]:

GRUs are a recently proposed alternative to LSTMs, that share its good properties, i.e., they are also designed to avoid the Vanishing Gradient problem (see Figure rnn6). Even though they feature fewer parameters than LSTMs, their performance has been experimentally seen to be about the same. The main difference from LSTMs is that GRUs don't have a cell memory state $C_t$, but instead are able to function effectively using the Hidden State $Z_t$ alone.

The design of the GRU features two gates, the Update Gate $o_t$, given by: \begin{equation} o_t = \sigma(W^o X_t + U^o Z_{t-1}) \quad \quad (**GRU1**) \end{equation} and the Reset Gate $r_t$ given by: \begin{equation} r_t = \sigma(W^r X_t + U^r Z_{t-1}) \quad \quad (**GRU2**) \end{equation}

The Hidden State vector $Z_{t-1}$ from the prior stage is combined with the Input vector $X_t$ from the current stage, to generate an intermediate Hidden State value $\tilde Z_t$ given by: \begin{equation} {\tilde Z}_t = \tanh(r_t\odot U Z_{t-1} + W X_t) \quad \quad (**GRU3**) \end{equation}

Finally the new Hidden State $Z_t$ is generated by combining the prior Hidden State $Z_{t-1}$ with the intermediate value $\tilde Z_t$ as follows: \begin{equation} Z_t = (1 - o_t)\odot {\tilde Z}_t + o_t\odot Z_{t-1} \quad \quad (**GRU4**) \end{equation}

In order to understand the workings of the GRU, we once again assume that both the Update and Reset gates are operating in the saturated regime, so that they assume values in the set $\{0,1\}$. It is easy to see that each component of $Z_t$ vector takes on values from the set $\{+1,0,-1\}$ and from Equations (GRU3) and (GRU4) one can see that it can exhibit the following behaviors:

  1. It changes in value in response to new inputs $X_t$: This happens when $o_t = 0, r_t = 1$, and $Z_t = \tilde Z_t = \tanh(U Z_{t-1} + W X_t)$. $Z_t$ can potentially flip between $+1$ and $-1$ in response to changes in $X_t$ (or vary continuously if it is in the linear portion of the tanh curve).
  2. It holds the old value from the previous stage: This happens when $o_t = 1$, and results in $Z_t = Z_{t-1}$.
  3. It resets to the value of the new input: This happens when $o_t = 0, r_t = 0$ and results in $Z_t = \tilde Z_t = \tanh(W X_t)$. Note that this is a full reset since all memory of prior states is lost.

The reader will notice that there are similarities in the behavior of $Z_t$ in GRUs and LSTMs, in the sense both have modes in which $Z_t$ maintains its prior value as well, as well as modes in which it changes value based on current input or gets reset and looses memory of prior states. However GRUs are able to do this using a smaller number of parameters. The GRU design also provides a path for the gradients with respect to $Z$ to flow backwards without being subject to matrix multiplication, as an application of the Gradient Flow rules to Figure rnn6 reveals. This gradient flow is gated by $o_t$, such that if $o_t = 0$ at a cell, then the gradient is not propagated backwards.

LSTMs in Keras

LSTMs can be invoked in Keras simply by replacing SimpleRNN with LSTM (or GRU). We solve the IMDB Moview Review problem using a single layer LSTM model below. Note the following:

  • The number of parameters in the LSTM layer is exactly four time the number for the SimpleRNN layer. This is account by the fact that 3 extra payers of matrices $(W_i,U_i), (W_f,U_f), (W_o,U_o)$ each of the same size as the $(W,U)$ matrices, are required for the LSTM model
  • The execution is time is longer for LSTM, about 74 sec/epoch as opposed to about 30sec/epoch for the Simple RNN. This is also accounted for by the extra computations for all the Gates in each LSTM cell
  • There is an increase in accuracy in the LSTM model.
In [29]:
#rnn47
nb_setup.images_hconcat(["DL_images/rnn47.png"], width=600)
Out[29]:
In [15]:
from keras.datasets import imdb
from keras.preprocessing import sequence

max_features = 10000  # number of words to consider as features
maxlen = 500  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')

print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
input_train shape: (25000, 500)
input_test shape: (25000, 500)
In [16]:
from keras.layers import LSTM

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
In [ ]:
history = model.fit(input_train, y_train,
                    epochs=50,
                    batch_size=128,
                    validation_split=0.2)
In [17]:
model.summary()
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_5 (Embedding)      (None, None, 32)          320000    
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 33        
=================================================================
Total params: 328,353
Trainable params: 328,353
Non-trainable params: 0
_________________________________________________________________
In [36]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()