Transformers

In [2]:
from ipypublish import nb_setup

Introduction

In [3]:
#trans4
nb_setup.images_hconcat(["DL_images/trans4.png"], width=800)
Out[3]:

Transformers are a new kind of Neural Network Architecture, that were introduced in 2017 by Vaswani et.al.. They were originally targeted at NLP applications, but since then they have been successfully applied to Image Processing as well. In NLP they overcome some of the difficulties with RNN/LSTMs, and result in much improved performance in applications such as Machine Translation. In addition, they are much better at Transfer Learning for NLP, so that Transformer models trained on huge amounts of data can be fine tuned and used for smaller datasets, just like for ConvNets.

Transformers were originally used to do Machine Translation, whereby the name comes from (they "Transform" a sentence in language 1 to language 2). Since then they have been successfully applied to other NLP applications such as Classification and Language Modeling. Indeed it was soon realized that Transformers are a versatile general purpose tool, which can be used in any Machine Learning application, as long as the input can be formatted in a way that can be ingested by them. Recently Transformers have been used for Image Processing tasks, where they have shown themselves to perform better than the best performing ConvNets.

In contrast to older architectures, Transformers have a much higher modeling capacity, which also means it is possible to scale up these models to very large sizes (see Figure trans4) with corresponding improvements in performance. Indeed some of the more recent Transformer models have hundreds of billions of parameters, which take days to train even with powerful computing infrastructure. Recall that models with large number of parameters require a correspondingly large amount of training data in order to avoid the overfitting problem. In the case of Transformers this problem was addressed by using self-supervised learning on massive text datasets.

This process, whereby larger and larger models trained on bigger and bigger datasets lead to better and better performance, has not reached its limits yet, indeed the latest Transformer features over a trillion parameters!

Why RNNs are not Good Enough

In [4]:
#rnn38
nb_setup.images_hconcat(["DL_images/rnn38.png"], width=600)
Out[4]:

Consider the Bi_directional RNN shown in Figure rnn38: Assuming that the data being fed in the network consists of NLP sequences, with the sequence $X_i, 1\le i\le N$ being the embedding or representation for the corresponding word sequence. Note that this representation does not take into account the surrounding context from the other words in the sentence. The corresponding hidden state sequence ${Z1_i}, 1\le i\le N$, can be considered to be the new representation for the sequence ${X_i}, 1\le i\le N$, such that the representation $Z1_i$ is modified by the words $X_1,...,X_i$ that came at or before $Z1_i$. In a Bi-Directional RNN, each word $X_i$ has two such representations, with the representation $Z1_i$ modified by the words $X_1,...,X_i$, and the representation $Z2_i$ modified by the words $X_{i+1},...,X_N$. Can these word representations be further improved? This RNN model has some shortfalls in this area:

  • RNN representations such as $Z1_i$ or $Z2_i$ for the word $X_i$ are most influenced by other other words that are in $X_i$'s immediate neighborhood. The influence from words that are further away becomes progressively smaller due to the Vanishing Gradient problem (this is less of an issue in LSTMs, but the problem does not completely go away). It is well known that word sentences contain patterns that are strongly non-local, and this is not well captured by the RNN/LSTM type models.

  • Lack of Parallelizability: Note that future RNN states cannot be computed before past RNN states have been computed. This is a significant restriction, since modern GPUs are capable of performing multiple independent computations at the same time. This causes a significant slowdown when training on very large datasets.

In order to develop better NLP models, it is important to come up with systems that are able to develop even better context dependent representations for words by allowing all the words in a sequence to interact with each other.

Just as NLP models can be improved by allowing better interactions between all the words in a sentence, as opposed to only the nearby words, it turns out that exactly the same reasoning can be applied to Image Processing as well. If we replace the word embeddings in NLP by patch embeddings in Image Processing (as we will see later in this chapter), then ConvNets can be considered to be a special type of model in which only the neighboring patches are allowed to influence each others representations. Better Image Processing models can be developed which allow ALL the patches in an image to interact with each other, which is the main idea behind a class of Transformer models for Image Processing called Vision Transformers (or ViT).

In [5]:
#trans1
nb_setup.images_hconcat(["DL_images/trans1.png"], width=800)
Out[5]:

Introducing Self Attention

When we introduced the concept of Attention in the previous chapter (see Figure rnn65 in Chapter NLP), it was in the context of the Encoder-Decoder architecture, with Attention being used so that the words in the Encoder can influence the representation of words in the Decoder. This was referred to as Cross Attention. We are going to take this idea and modify it in a way such that the Attention mechanism can be used to allow words in the same sentence to modify each others representations, which is called Self Attention.

A possible way in which this can be done is shown in Figure trans1. The figure shows the embedded word sequence in the bottom layer followed by two layers of Self Attention. In each Self Attention layer, the representation of each of the words is modified by every other word in the sequence by using the Self Attention mechanism (an example of the connections from one of the words is shown, but the same connections exist for all the other words as well). This process can be repeated with multiple layers, with the output of layer $i$ serving as the input to layer $i+1$, as shown in the figure. The idea behind this architecture is that after several layers of Self Attention, each word develops a representation that takes into account all the other words that exist in the sentence. The exact calculations used to implement Self Attention are described in detail in the following section. Also note that the calculations for each of the words proceeds independently up the stack, which means that unlike RNNs, the words in the entire input sequence can be processed in parallel.

The idea of multilayer Self Attention as shown in the figure may remind you of the Dense Feed Forward Networks (DFN) that we encountered several chapters ago, with its dense node to node connections and multiple layers. Indeed, if we were to replace the embedding vectors with scalars (ignoring the attention calculations for a moment), then it becomes identical to the DFN architecture. From this perspective, the Multilayer Self Attention architecture can be considered to be an extention of the DFN architecture to 2D tensor or vector inputs. Hence instead of starting with a scalar sequence and transforming it with matrix multiplications in each layer (as in DFNs), we start with a vector sequence, and use the Attention mechanism to transform it into another vector sequence. This raises the interesting possibility that there should be other ways in which vector inputs can be transformed. Indeed, in addition to RNNs and LSTMs, we have already come across two ways in which this can be done:

  • By using regular 2D ConvNets: We can treat the vector sequence as an 'image' and then process it by using a 2D ConvNet in the usual way

  • By using 1D ConvNets: This is a more natural way to process vector sequences, and indeed in Chapter ConvNets Part 1 we saw that their performance is close to that of LSTMs when processing NLP data.

The existence of so many different ways of processing vector sequences shows that the common factor in all these designs, is a way in which the individual elements of a vector can be mixed, in both column-wise and row-wise axes, and there are multiple ways in which this can be done. What sets the Transformer apart from all these other ways of processing vectors is the fact that they have a much higher model capacity with the ability to scale up to models with hundreds of billions of parameters. This allows them to capture and model much more complex patterns. The amount of complexity in language data or image data is such that LSTM or ConvNet models are not able to capture all the interconnected patterns that exist in them and thus are capacity limited.

Transformer Architecture

A Transformer consists of a set of identical modules that are stacked in a serial fashion. Note that each module has its own set of parameters. The block level structure of each module, as shown in Figure rnn87(a), consists of a Self-Attention layer followed by a Feed Forward layer.

In [6]:
#rnn87
nb_setup.images_hconcat(["DL_images/rnn87.png"], width=800)
Out[6]:

Part (b) of Figure rnn87 show the progress of an input sequence (of vectors) as it traverses a module. The input $(x_1,x_2)$ is first processed by the Self Attention layer and results in the sequence $(z_1,z_2)$. This is further processed by the Dense Feed Forward Layer and the sequence $(r_1,r_2)$ is the final output for this layer. Note that each member of the sequence is propagated separately through the Self Attention and Dense layers and weight parameters are shared across all of them (each Encoder Layer has its own set of parameters though). The calculations for $x_1$ as it goes up the stack are independent from those for $x_2$, so both can proceed in parallel.

Lets first examine the Self Attention layer in greater detail:

In [7]:
#trans2
nb_setup.images_hconcat(["DL_images/trans2.png"], width=800)
Out[7]:

In Figure trans2 we show how to go from the input $X_3$ to the output $Z_3$ of the Self Attention layer. Note that $Z_3$ is a measure of the Self Attention that $X_3$ pays to the other vectors $(X_1, X_2)$ in the input sequence. The simplest way to compute the Self Attention between two vectors is by taking their dot product, and this was the technique used for RNN based Cross Attention in the prior chapter. If we carry out this procedure, then the Self Attention between vectors $X_i$ and $X_j$ is given by $A_{ij} = X_i\cdot X_j$. These numbers can then be converted into weights $$ w_{ij} = {{e^{A_{ij}}}\over{\sum_j{e^{A_{ij}}}}},\ \ i,j = 1,2,...,N $$ The output vector $Z_{i}$ is computed as a weighted sum of the input vectors $X_i$ $$ Z_{i} = \sum_j w_{ij} X_j $$ This procedure represents the core of Self Attention based approach and it worked well for the RNN Cross Attention design, but note that it has the following drawback: There are no learnable parameters in these computations. We would give the Neural Network more flexibility (and thus greater capacity) if we were to change the procedure in a way that allows the network to modify the weights $w_{ij}$ during the course of the training process. In order to so, we define a new Self Attention procedure, fundamental to which are the following three vector sequences:

  • Queries $(Q_1,Q_2,...,Q_N)$: The Query $Q_i$ for the $i^{th}$ input, represents the focus of Attention when this input is being processed, and is used to compare the $i^{th}$ input to all the other inputs.
  • Keys $(K_1,K_2,...,K_N)$: The Key $K_j$ for the $j^{th}$ input is used to compare this input with the current focus of Attention.
  • Values $(V_1,V_2,...,V_N)$: Instead of applying Self Attention to the input $(X_1,X_2,...,X_N)$ directly, it is first converted into another sequence $(V_1,V_2,...,V_N)$, called the Value Sequence. This sequence is used to compute the output for the current focus of attention.

The three sequences are derived from the input sequence $(X_1,...,X_N)$ by means of linear transformations with learnable weights.

  • The Queries $(Q_1,Q_2,...,Q_N)$ are generated from the Encoder inputs by multiplication of the inputs $(X_1,X_2,...,X_N)$ with a Query matrix $W^Q$.

  • The Keys $(K_1,K_2,...,K_N)$ are generated by multiplying the input $(X_1,X_2,...,X_N)$ by the Key Matrix $W^K$.

  • The Values $(V_1,V_2,...,V_N)$ are generated by multiplying the input $(X_1,X_2,...,X_N)$ by the Value Matrix $W^V$.

Note that these transformations are equivalent to taking each of the vectors $X_i, i = 1,...,N$ and passing them through three separate Dense Feed Forward Networks $W^Q, W^K$ and $W^V$, as shown in Figure trans2.

Each of the vectors $X_i$ is of dimension $1\times d$, while $W^Q, W^K$ and $W^V$ are of dimensions $d\times d$. The contents of these matrices are parameters that are estimated using Gradient Descent during the training process. Once we have computed the Queries, Keys and Values, the Self Attention computation for the $i^{th}$ input $X_i$ proceeds as follows:

  1. Compute a scalar valued Score $s_{ij},\ j=1,2,...,N$ associated with the $i^{th}$ and $j^{th}$ inputs by taking the inner product $$ s_{ij} = Q_i^T K_j,\ \ j=1,2,...,N $$ The Score value is a measure of the similarity between these two vectors.

  2. Normalize the Score values by dividing by $\sqrt{d}$ to create the sequence $s'_{ij}, j=1,2,...,N$. $$ s'_{ij} = \frac{Q_i^T K_j}{\sqrt d},\ \ j=1,2,...,N $$ This can be considered to be a type of Normalization in order to keep the results of the dot product between the Query and Key vector under control. Without this, there is danger that the dot product may become very large (or very small), which in combination with the exponentiation in the softmax (the following step) leads to numerical issues and problems in gradient propagation.

  3. The normalized Scores are used to generate scalar weights $w_{ij}$ by using the Softmax function $$ w_{ij} = {{e^{s'_{ij}}}\over{\sum_j{e^{s'_{ij}}}}},\ \ j=1,2,...,N $$

  4. The Self Attention output vector $Z_{i}$ for the $i^{th}$ input is computed as a weighted sum of the Value vectors $$ Z_{i} = \sum_j w_{ij} V_j $$

Since each of the outputs $Z_i$ can be computed independently, these calculations can be parallelized by using matrix multiplication, as follows: The vector sequence $(X_1,...,X_N)$ is packed into a matrix $X\in R^{N\times d}$, such that the $i^{th}$ row of $X$ represents the vector $X_i$. We then multiply $X$ by the matrices $W^Q, W^K$ and $W^V$, each of which are of dimension $d\times d$, to produce matrices $Q, K, V$ of dimensions $N\times d$: $$ Q = XW^Q, \ \ K = XW^K, \ \ V = XW^V $$ These three matrices contain all of the Query, Key and Value vectors. By using them, the calculations in steps 1 to 4 can be reduced to a single step: $$ Z = softmax({QK^T\over{\sqrt{d}}}) V $$ Note that the output vector $Z_i$ is the $i^{th}$ row of this matrix.

Multiple Attention Heads

In [8]:
#trans3
nb_setup.images_hconcat(["DL_images/trans3.png"], width=600)
Out[8]:

The Attention weight $w_{ij}$ for the $i^{th}$ input $X_i$ is a measure of how important the $j^{th}$ input $X_j$ is in the calculation of the Self-Attention $Z_i$, which is the new representation for $X_i$. Note that this captures only one set of dependencies between the $i^{th}$ input and all the other inputs. Just as an image has multiple patterns whose capture requires multiple ConvNet filters, the Transformer model uses multiple Attention weights (and thus multiple Self-Attention values) in order to capture other dependencies between the $i^{th}$ and the other inputs.

As shown in Figure trans3, Multiple Attention Heads are implemented with the help of $H$ versions of the Query, Key and Value matrices: $(W^Q_1,...,W^Q_H),\ (W^K_1,...,W^K_H)$ and $(W^V_1,...,W^V_H)$, each of which are of dimension $N\times{d\over H}$. These are then used to compute $H$ Self Attention matrices, given by $(Z^1,...,Z^H)$, using the same computations as before, each of which are of dimension $N\times{d\over H}$. In order to generate a single output value, these $H$ matrices are first concatenated together to create a $N\times d$ matrix $\zeta = Z^1 || Z^2||... || Z^H$, followed by multiplication with another matrix $W^O$ in order to compute the final output $Z$: $$ Z = \zeta W^O $$ If the matrix $W^O$ is chosen to be of dimension $d\times d$, then this results in a final Attention vector of the same size as when only one Attention Head was being used. This also means that there is no increase in either the number of parameters or amount computation in implementing additional Heads. In the original Attention paper $d = 768$ and $H = 10$.

Note: Some Self Attention implementations, such as the one that is in Keras, do not do this truncation of the individual Attention Head matrices, instead choosing to stick to the original dimensions for each of the Attention Heads. This results in a $\zeta$ matrix of size $N\times dH$, which is then multiplied by a $W^O$ matrix of size $dh\times d$ to produce the $Z$ matrix of the right size, $N\times d$.

A Complete Transformer Block

In [9]:
#TransformerBlock
nb_setup.images_hconcat(["DL_images/rnn101.png"], width=400)
Out[9]:

Figure TransformerBlock shows a complete Encoder Layer. In addition to the Self-Attention Layer, it includes the following:

Residual Connection + Layer Normalization:

As shown in the figure, each Encoder block has two Residual Connections, one around the Self Attention Layer, and the other around the Dense Feed Forward Layer. As in ResNets (see Chapter ConvNetsPart2),each of these Residual Connections does a bypass from the input to the output of these two layers, and the vectors at either end are added together. In addition to facilitating gradient flow during backprop, these connections also create four separate paths through the Encoder Layer in the forward direction. These two properties together enable very deep networks that are nevertheless trainable, and create an Ensemble like effect when making decisions.

Each of the two Residual Connections is followed by Layer Normalization. We discussed Batch Normalization in Chapter GradientDescentTechniques in which normalization is done one feature at a time, across a batch. Layer Normalization on the other hand, carries out Normalization across features in a single training sample, as opposed to a batch. As shown below, normalization is done by computing the mean and standard deviation for the elements in a single vector. \begin{eqnarray} \mu_L & = & \frac{1}{d}\sum_{m=1}^{d}a(m) \\ \sigma_L^2 & = & \frac{1}{d}\sum_{m=1}^d (a(m)-\mu_L)^2 \\ \hat{a}(m) & = & \frac{a(m)-\mu_L}{\sqrt{\sigma_L^2+\epsilon}} \\ c(m) & = & \gamma\hat{a}(m) + \beta \end{eqnarray}

Layer Normalization was introduced by Ba, Kiros,Hinton and works better in Transformers than Batch Normalization.

Dense Feed Forward Layer:

The output of the first Layer Normalization is fed into a Dense Feed Forward Layer. The computation carried out by this layer is as follows: $$ R_i = ReLU(Z_iW_1 +b_1)W_2 + b_2,\ \ i =1,...,N $$ Hence each of the vectors $Z_1,...,Z_N$ is processed independently by two DFN layers, with ReLU being applied only after the first layer. Note that all of DFNs in a layer share the same parameters, however the DFN parameters differ across layers. In the original paper, the $W_1$ matrix was of dimension $d\times 4d$, while the $W_2$ matrix was of dimension $4d\times d$. Hence the output of the DFN layer is set of vectors $R_1,...,R_N$ each of which are of dimension $1\times d$.

The output of this layer is subjected to another round of Residual connection + Layer Normalization before generating the final output of the Encoder Layer $R_i, i=1,2,...,N$. The addition of the DFN layer accomplishes the following:

  • It introduces a non-linearity into the model. This is important since the Self Attention layer does not have any non-linearities.

  • It serves as a mechanism for the mixing within a 'channel' and also introduces an 'inverted bottleneck' into the architecture. This is further explained in a following section.

The set of vectors $(R_1,R_2,...,R_N)$ are then passed through another Self Attention + Dense layer, as shown in Figure trans23, and this process is repeated $P$ times to finally generate the output of the Encoder Block.

The computations in a single Encoder Block can be summarized as: $$ Z = LayerNorm(X + SelfAttn(X)) $$ $$ R = LayerNorm(Z + DFN(Z)) $$

In [10]:
#trans23
nb_setup.images_hconcat(["DL_images/trans23.png"], width=400)
Out[10]:

Keras Model for a Transformer Encoder

The following Transformer model is used to classify movie reviews from the IMDB dataset (which has already been downloaded). We start by invoking the text_dataset_from_directory function to create the training, validation and test samples.

In [11]:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = '/Users/subirvarma/handson-ml/datasets/aclImdb'

train_ds = keras.utils.text_dataset_from_directory(
    "/Users/subirvarma/handson-ml/datasets/aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "/Users/subirvarma/handson-ml/datasets/aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "/Users/subirvarma/handson-ml/datasets/aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)                
Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.

The TextVectorization is invoked to convert the text into integers, for each sample review. Each review is restricted to 600 characters or less, and the vocabulary used is restricted to the 20,000 most frequently ocurring words found in the reviews.

In [12]:
from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

The TransformerEncoder class defines the Self Attention and Dense Feed Forward blocks in the Transformer Encoder. The Keras MultiHeadAttention function implements the Attention calculations. Its main parameters are:

  • embed_dim: The size of the input vectors after embedding
  • dense_dim: The number of nodes in the first DFN layer. Note that the number of nodes in the second DFN layer is equal to the embed_dim.
  • num_heads: The number of Attention Heads
  • key_dim: Size of Attention Heads for Query and Key
  • value_dim: Size of Attention Heads for Value, defaults to embed_dim

Note that the length of the Transformer is not explicitly specified, since it is equal to the Sequence Length of the input (set to max_length=600 in the previous code block).

When the Attention function is invoked, its call arguments include:

  • query: The Query Tensor of shape (B, N, d), where B is the Batch Size, N is the Sequence Length and d is the embeddding dimension size
  • value: the Value Tensor, also of shape (B, N, d)
  • key: Optional, if not given, then value is used for both key and value
  • attention_mask: A boolean of shape (B, N, d) that prevents attention to certain positions. This is not used in the current example, but we will see it in action when we discuss Language Models using Transformers
In [13]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        # Define the Multi Headed Attention block
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        # Define the DFN block
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        # Implement Residual Connection and Layer Normalization after the Self Attention block
        proj_input = self.layernorm_1(inputs + attention_output)
        # Send the resulting ouput through the DFN block
        proj_output = self.dense_proj(proj_input)
        # This is then sent through another round of Residual Connection followed by Layer Normalization
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config
In [77]:
vocab_size = 20000
embed_dim = 32
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
# The Embedding layer coverts the input one-hot vector of size vocab_size into an embedded vector
# of size embed_dim
x = layers.Embedding(vocab_size, embed_dim)(inputs)
# The embedded vectors are sent through the Transformer Encoder
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
# The output from the Encoder is sent through the GlobalMaxPooling1D layer
# which creates a single vector of size 1 X d by taking the max across the N elements
# in each row of the N X d output matrix from the Encoder
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()
Model: "model_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_7 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding_7 (Embedding)      (None, None, 32)          640000    
_________________________________________________________________
transformer_encoder_3 (Trans (None, None, 32)          10656     
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 32)                0         
_________________________________________________________________
dropout_60 (Dropout)         (None, 32)                0         
_________________________________________________________________
dense_71 (Dense)             (None, 1)                 33        
=================================================================
Total params: 650,689
Trainable params: 650,689
Non-trainable params: 0
_________________________________________________________________

Recall that we used a LSTM based model to do IMDB classification in Chapter NLP. It is interesting to compare the number of parameters in the Transformer vs the LSTM model, and it turns out that they are approximately the same, around 650K (this assumes an LSTM with the same number of cell nodes as the size of the embed_dim in the Transformer, which is 32 in this case). More specifically the number of parameters in the LSTM part of the model and the Transformer Encoder are also about the same.

In [10]:
model.fit(int_train_ds, validation_data=int_val_ds, epochs=15)
Epoch 1/15
625/625 [==============================] - 365s 582ms/step - loss: 0.5129 - accuracy: 0.7516 - val_loss: 0.3223 - val_accuracy: 0.8598
Epoch 2/15
625/625 [==============================] - 395s 632ms/step - loss: 0.3275 - accuracy: 0.8648 - val_loss: 0.2922 - val_accuracy: 0.8762
Epoch 3/15
625/625 [==============================] - 384s 614ms/step - loss: 0.2587 - accuracy: 0.8970 - val_loss: 0.2582 - val_accuracy: 0.8946
Epoch 4/15
625/625 [==============================] - 374s 598ms/step - loss: 0.2170 - accuracy: 0.9158 - val_loss: 0.2608 - val_accuracy: 0.8914
Epoch 5/15
625/625 [==============================] - 372s 595ms/step - loss: 0.1936 - accuracy: 0.9258 - val_loss: 0.2500 - val_accuracy: 0.9016
Epoch 6/15
625/625 [==============================] - 369s 591ms/step - loss: 0.1690 - accuracy: 0.9365 - val_loss: 0.2785 - val_accuracy: 0.8910
Epoch 7/15
625/625 [==============================] - 372s 595ms/step - loss: 0.1523 - accuracy: 0.9438 - val_loss: 0.2673 - val_accuracy: 0.8984
Epoch 8/15
625/625 [==============================] - 388s 621ms/step - loss: 0.1390 - accuracy: 0.9495 - val_loss: 0.3196 - val_accuracy: 0.8888
Epoch 9/15
625/625 [==============================] - 390s 624ms/step - loss: 0.1244 - accuracy: 0.9560 - val_loss: 0.3164 - val_accuracy: 0.8920
Epoch 10/15
625/625 [==============================] - 390s 624ms/step - loss: 0.1162 - accuracy: 0.9584 - val_loss: 0.3457 - val_accuracy: 0.8856
Epoch 11/15
625/625 [==============================] - 397s 635ms/step - loss: 0.1061 - accuracy: 0.9642 - val_loss: 0.3486 - val_accuracy: 0.8874
Epoch 12/15
625/625 [==============================] - 3833s 6s/step - loss: 0.1002 - accuracy: 0.9654 - val_loss: 0.4310 - val_accuracy: 0.8646
Epoch 13/15
625/625 [==============================] - 370s 592ms/step - loss: 0.0909 - accuracy: 0.9678 - val_loss: 0.3508 - val_accuracy: 0.8880
Epoch 14/15
625/625 [==============================] - 374s 599ms/step - loss: 0.0831 - accuracy: 0.9737 - val_loss: 0.3414 - val_accuracy: 0.8866
Epoch 15/15
625/625 [==============================] - 371s 594ms/step - loss: 0.0778 - accuracy: 0.9743 - val_loss: 0.4321 - val_accuracy: 0.8770
Out[10]:
<keras.callbacks.History at 0x14de4f7b8>

The model.summary computed the total number of parameters in the Encoder Block to be 10,656. The spreadsheet in Figure trans5 shows how this number was arrived at.

In [27]:
#trans5
nb_setup.images_hconcat(["DL_images/trans5.png"], width=800)
Out[27]:

Transformer models from large research labs tend to be much larger. For example, the model in the original paper by Vaswani et.al. had an Embedding Dimension of 768, 10 Attention Heads, Dense Dimension of 3072 and 12 Encoder blocks. Plugging these numbers into the spreadsheet shows that the number of parameters per block is 28,342,272. Since there were 12 blocks in the model, the total number of parameters for the entire Encoder was 340,107,264. Hence we can see that the number of parameters in Transformers adds up pretty quickly, indeed the latest models feature tens of billions of parameters.

The spreadsheet also gives an estimate of the number of computations in a single block, and as can be seen, the number is huge. It is easy to see that the number of computations is proportional to $ B^2 N^2 H $, so that even for a 'toy' model such ours, it can blow up pretty quickly. This can also be seen in the 6 minutes (approx) it took to finish each epoch when the model was run. These computations are almost all in the Self Attention calculation part of the model, and there have been proposals to replace Self Attention by other schemes in order to reduce this number. In comparison, the equivalent LSTM model from Chapter NLP took about 2 minutes per epoch, even though it was serialized in its stage by stage computations.

Encoding Position Information in Transformers

The Transformer architecture was originally designed to process sequences, however note that the desciption provided so far works the same regardless of the order of the input data (you can see this for yourself by shuffling the input sequence and running it through the model). This is unlike RNN/LSTMs or even 1D ConvNets, where the order of the input changes the corresponding output. Hence we need to make some changes to the Transformer design in order to make it sensitive to the order of the input sequence. The original Transformer paper had a scheme whereby another pre-computed sequence was added to each input sequence, such that the second sequence was sensitive to the order. This was done by making the vector elements of the second sequence a function of sines and cosines with varying frequencies.

We now describe a simpler scheme that achieves the same purpose and is used more commonly in practice. As shown below, we create two embedded sequences for each input:

  • The original Embedded Sequence that creates word embeddings for the words in each input IMDB review

  • A new Embedded Sequence that creates position imbeddings from an input that is a sequence of integers from 0 to sequence_length. Hence each of the integers gets converted into a vector that takes the sequence order into account.

These two sequences are then added together to create the final input that is then fed into the Transformer.

In [28]:
#trans6
nb_setup.images_hconcat(["DL_images/trans6.png"], width=800)
Out[28]:

The following code implements both the Token Embeddings and the Position Embeddings, and then adds then together to create the input into the Encoder.

In [13]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config
In [14]:
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

model.fit(int_train_ds, validation_data=int_val_ds, epochs=15)
Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
positional_embedding (Positi (None, None, 256)         5273600   
_________________________________________________________________
transformer_encoder_2 (Trans (None, None, 256)         543776    
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 256)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 257       
=================================================================
Total params: 5,817,633
Trainable params: 5,817,633
Non-trainable params: 0
_________________________________________________________________
Epoch 1/15
625/625 [==============================] - 1429s 2s/step - loss: 0.4753 - accuracy: 0.7872 - val_loss: 0.2516 - val_accuracy: 0.9018
Epoch 2/15
625/625 [==============================] - 1506s 2s/step - loss: 0.2397 - accuracy: 0.9066 - val_loss: 0.2359 - val_accuracy: 0.9072
Epoch 3/15
625/625 [==============================] - 1507s 2s/step - loss: 0.1814 - accuracy: 0.9333 - val_loss: 0.4492 - val_accuracy: 0.8736
Epoch 4/15
625/625 [==============================] - 1505s 2s/step - loss: 0.1494 - accuracy: 0.9448 - val_loss: 0.2586 - val_accuracy: 0.8970
Epoch 5/15
625/625 [==============================] - 1509s 2s/step - loss: 0.1281 - accuracy: 0.9536 - val_loss: 0.3299 - val_accuracy: 0.8894
Epoch 6/15
625/625 [==============================] - 1510s 2s/step - loss: 0.1113 - accuracy: 0.9602 - val_loss: 0.3086 - val_accuracy: 0.8994
Epoch 7/15
625/625 [==============================] - 6816s 11s/step - loss: 0.0986 - accuracy: 0.9653 - val_loss: 0.3967 - val_accuracy: 0.8922
Epoch 8/15
625/625 [==============================] - 26082s 42s/step - loss: 0.0883 - accuracy: 0.9692 - val_loss: 0.4189 - val_accuracy: 0.8892
Epoch 9/15
625/625 [==============================] - 1401s 2s/step - loss: 0.0784 - accuracy: 0.9736 - val_loss: 0.3555 - val_accuracy: 0.8922
Epoch 10/15
625/625 [==============================] - 1475s 2s/step - loss: 0.0711 - accuracy: 0.9753 - val_loss: 0.4298 - val_accuracy: 0.8862
Epoch 11/15
625/625 [==============================] - 1495s 2s/step - loss: 0.0656 - accuracy: 0.9769 - val_loss: 0.4332 - val_accuracy: 0.8848
Epoch 12/15
625/625 [==============================] - 1501s 2s/step - loss: 0.0568 - accuracy: 0.9806 - val_loss: 0.4987 - val_accuracy: 0.8834
Epoch 13/15
625/625 [==============================] - 1462s 2s/step - loss: 0.0500 - accuracy: 0.9828 - val_loss: 0.5100 - val_accuracy: 0.8862
Epoch 14/15
625/625 [==============================] - 1419s 2s/step - loss: 0.0439 - accuracy: 0.9855 - val_loss: 0.5147 - val_accuracy: 0.8828
Epoch 15/15
625/625 [==============================] - 3260s 5s/step - loss: 0.0406 - accuracy: 0.9863 - val_loss: 0.4691 - val_accuracy: 0.8776
Out[14]:
<keras.callbacks.History at 0x153375128>

Visualizing Attention Patterns in Transformers

It has been shown (https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77) that RNN type processing forms a subset of Attention type processing, i.e., the recurrent connections that we hardwire into a RNN emerge naturally in Attention networks as a result of training. Some figures from this article are reproduced in Figure rnn99. This study was done for a type of Transformer Encoder model called BERT (we will describe this model in a later section). Also note that there are two input sentences input into the model, which are separated using a special token call SEP. Also the beginning and the end of the input are appended with special tokens CLS and SEP. In these figures, the input sentence is the one on the right hand side, while the left hand side is the representation of the corresponding words after self-attention. The lines represent the attention weights, with a thicker line indicating a higher weight. For example in Part (a) of the figure, the representation of the (first) "I" assigns the greatest weight to the following word "went".

Note that the Transformer model used had multiple layers, with multiple heads in each layer. The layer number is shown at the top left of each figure, while the attention head is the highlighted color in the strip at the top of the figures.

In [80]:
#rnn99.png
nb_setup.images_hconcat(["DL_images/rnn99.png"], width=800)
Out[80]:

There are some interesting patterns that emerge in the self-attention patterns of this Transformer:

  • The Self-Attention pattern in Part (a) shows that all the Attention is being paid to the next word in the sequence. This is identical to the hardwired connections that are used in backwards RNNs. This pattern is broken by the SEP tokens, which focus their attention to the CLS token.

  • The Self-Attention pattern in Part (b) shows that most of the Attention is being paid to the prior word in the sequence. This is identical to the hardwired connections that are used in forward RNNs. The Attention is somewhat diffuse compared to Part (b), some of the Attention is also distributed among other words in the input sentence.

  • The Self-Attention pattern in Part (c) shows that most of the Attention is being paid to words in the sequence that are identical to the input word. If a word occures only once, then this results in the horizontal line.

  • The Self-Attention pattern in Part (d) seems to pay equal attention to all the words that are in the same sentence.

Hence the usual backward and forward recurrent connections emerge naturally in a trained Transformer. In addition the Transfomer incorporates other types of connections that are non-local and can span the entire sequence, which results in the higher modeling capacity in these models.

In [82]:
#trans25.png
nb_setup.images_hconcat(["DL_images/trans25.png"], width=800)
Out[82]:

Figure trans25 shows all the different attention patterns that exist in a trained Transformer Encoder with 6 layers, with 12 heads in each layer. We can see that the model is able to capture a wide variety of Attention patterns, with some of the patterns repeating from layer to layer (analogous to a multi-layer RNN). There is also a "Null" pattern that exists especially in the higher layers, with attention focused on the SEP or CLS tokens. This pattern seems to indicate that the model has not found a meaningful pattern for a Attention Head, so is defaulting to the Null pattern.

Relationship between Transformers and Depthwise Separable ConvNets

The Transformer design was originally inspired by the Cross Attention mechanism used in RNNs. Indeed the Transformer does have a lot of aspects that it shares with RNNs, such as the fact that both architectures were designed to process vector sequence data in a modular fashion with a mechanism that enables members of the sequence to influence each others representation. However, it turns out that Transformers have a lot in common with Depthwise Separable Convolutions (that we first encountered in Chapter Convolutional Networks Part 2). Indeed as shown next, Transformers can be considered to be a way in which ConvNets with their superior filtering capabilities have been merged with RNNs with their superior ability to handle sequential data, to create a new architecture that has the best aspects of both.

In [83]:
#trans26.png
nb_setup.images_hconcat(["DL_images/trans26.png"], width=800)
Out[83]:

Lets start by re-visiting the Depthwise Separable 1D ConvNet design, which has been reproduced in Figure trans26. It shows four vectors, each of dimension three, being processed by a Depthwise 1D ConvNet, with data passing from bottom to top of the figure. The three features in the vectors are color coded as shown. The processing happens in two stages:

  • In the first stage, a 1D Convolution is applied to each Feature Plane, with a different filter used for each of the features. These filters are marked F1, F2 and F3 in the figure. The output of this stage is once again 4 vectors, in which the each feature is now a filtered version of the original features. The "blue" features in the second layer for example were obtained by filtering the corresponding "blue" features in the input using filter F1.

  • In the second stage each of the individual vectors are processed using a 1 X 1 Convolution to obtain the final output. Note that a 1 X 1 Convolution is the same as processing using a DFN (or MLP as they are also called).

Hence the processing using a 1D Depthwise Separable ConvNet can be considered to be made up of two types of filtering (1) Filtering along the "horizontal" axis (using the 1D Convolution), followed by filtering along the "vertical" axis (using the DFN). Using the terminology used in image processing, the horizontal axis is called the feature plane, while the vertical axis is called the channel plane.

It turns out that Transformers are structured along similar lines, as we show next.

In [84]:
#trans27.png
nb_setup.images_hconcat(["DL_images/trans27.png"], width=800)
Out[84]:

Earlier in this chapter we presented the computation of the representation $Z_1$ of a vector $X_1$ in Transformers, as a result of the Self-Attention operation. However, as shown in Figure trans27 this computation can also be considered to be a filtering operation, as evidenced by the equation: $$ Z_{i} = \sum_j w_{ij} X_j $$ Hence $w_{ij}$ can be considered to be filter coefficents rather than Self Attention weights, with the caveat that these filters are now a function of the data. Since the co-efficents are the same for all the elements in the output $Z_1$, lets call it Filter $F_1$, it results in the figure shown in Part (a). Similarly the elements in $Z_2$ are generated by using a second Filter $F_2$ and so on.

Comparing Figures trans26 and trans27, we can see that the elements $Z_i$ are generated in each case using a filtering operation, though the exact details by which the filtering is done differs for the two networks. The important point to take away from this is that in both cases we are 'mixing' together the elements of the vectors along the feature axis.

The reader may have noticed that in 1D ConvNets (see Figure trans26), the elements in $Z_1$ are generated by using 3 different filters, while all the elemnts in the corresponding $Z_1$ in Transformers (see Figure trans27) are generated using the same filter. This weakness in the Transformer architecture is addressed multiple Heads as shown in Figure trans28. Since each Transformer Head uses a different set of filters, we are able to reproduce the multiple filter aspct of 1D ConvNets.

In [85]:
#trans28.png
nb_setup.images_hconcat(["DL_images/trans28.png"], width=800)
Out[85]:

Figure trans28 shows the DFN part of a Transformer Encoder, with its filtering along the channel axis and as shown in Figure trans26, this exact operation is also carried out in 1D Convnets, with the caveat that multiple vectors are mixed in Transformers, vs a single vector in a 1D ConvNet,

Hence both 1D ConvNets and Transformers involve two types of filtering or mixing:

  • Filtering along the Feature axis, followed by
  • Filtering along the Channel axis.

The filtering along the Channel axis is identical in the two architectures, however note that the Feature axis filtering is local in the ConvNet, while it spans the entire sequence in the Transformer. This is one of the reasons why Transformers have higher capacity than ConvNets, since they are able to detect non-local patterns.

Since the discovery of Transformers, researchers have discovered other ways of processing vector sequences in a non-local fashion with filtering along these two axes, with performance that is comparable to Transformers. Two examples of this are:

  • The paper by Lee-Thorpe et.al. in which the Fast Fourier Transform was used instead of Self Attention

  • The paper by Tolstikhin, Houlsby et.al., in which they use multiple DFNs in each layer instead of Self Attention

Language Models using Transformers

Language Models were defined in Chapter LLM as way to predict the next word in a sequence. As a result they can be used to generate new language sequences, which is very useful in applications such as Translation or Summarization. In this Section we modify the Transformer Encoder model so that it can be used as a Language Model.

Language Models using Transformers have made a big impact on NLP in the last few years. In particular a Language Model called GPT (Generalized Pre-Trained Transformer) from OpenAI has garnered a lot of attention due to the realistic text that it is able to generate. This model also inaugurated a new era in Deep Learning, by defining a new class of models called Large Language Models or LLMs. The largest of these models have tens of billions of parameters, and require millions of dollars to train on huge training datasets. Fortunately these datasets do not require manual labeling, since the labels are auto-generated (as the next word in a sentence), which is referred to as Self Supervised Learning. In addition, LLMs exhibit the following useful properties:

  • Pre-trained models can be reused for new datasets, while retaining their good performance, i.e., Transfer Learning works very well with LLMs, unlike RNNs or LSTMs.
  • It has also been observed that the pre-trained LLMs can give very good results with very small datasets, sometimes as little as single example. This is referred to as One-Shot-Learning and is an active area of research.
  • LLMs also exhibit Emergent Properties, such as the ability to answer general questions (such who was the first President of the USA), almost like a Q&A service. This has led to proposals for using LLMs as an interface into Search Engines. This is also an active area of research.
  • LLMs have been in used in Text to Image models such as DALLE and Imagen, which have garnered a lot of attention recently.
In [29]:
#trans7
nb_setup.images_hconcat(["DL_images/trans7.png"], width=800)
Out[29]:

The main difference between the Transformer used as Language Model, and the Transformer described earlier in this chapter, is that when a sentence is fed into a Language Model Transformer during training, then the Attention calculations for word $ x_i $ cannot take into account words that occur after $ x_i $. This is illustrated in Figure trans7: For example the Attention calculations for $ x_3 $ can take into account $ x_3 $ itself as well as $ x_1 $ and $ x_2 $, but not $ x_4 $ and $ x_5 $.

In [30]:
#trans8
nb_setup.images_hconcat(["DL_images/trans8.png"], width=800)
Out[30]:

In order implement this restriction, we make the following change to the Transformer model: Recall that the output of the Self Attention layer is computed using the formula $$ Z = softmax({QK^T\over{\sqrt{d}}}) V $$ where the matrix $ QK^T$ contains the results of the vector dot products. As shown in Figure trans8, if the upper half of this matrix is set to -infinity, then row $ i $ exhibits the correct dot product for computing the Self Attention for the $i^{th}$ term in the input sequence.

The mathematical operation described above is called masking, and is implemented in Keras using the mask argument.

In [31]:
#trans9
nb_setup.images_hconcat(["DL_images/trans9.png"], width=800)
Out[31]:

Figure trans9 shows a Transformer based Language Model during the Training phase. Note that the entire training sentence can be fed into the model in one shot once we have the masking in place. This is in contrast to RNN/LSTMs in which words had to fed in sequence, one at a time. Ths figure also shows that the labels used for each word, is simply the following word in the sequence.

Text Completion using Language Models

In [32]:
#trans10
nb_setup.images_hconcat(["DL_images/trans10.png"], width=800)
Out[32]:

While Figure trans9 showed a Transformer based Language Model during training, Figure trans10 shows the model being used to generate new text, once it has been trained. The model is prepped using a sentence starter <so long and thanks for>, and is reponsible generating the words that follow. Note that the model operates in an auto-regressive manner, with each generated word fed back into the model in order to generate the next word.

The following code for a Language Model is one the examples in the keras.io website. It uses the IMDB dataset for training, and consequently can be used to generate move reviews once it has been trained.

In [36]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import numpy as np
import os
import re
import string
import random
In [37]:
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    """
    Mask the upper half of the dot product matrix in self attention.
    This prevents flow of information from future tokens to current token.
    1's in the lower triangle, counting from the lower right corner.
    """
    i = tf.range(n_dest)[:, None]
    j = tf.range(n_src)
    m = i >= j - n_src + n_dest
    mask = tf.cast(m, dtype)
    mask = tf.reshape(mask, [1, n_dest, n_src])
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    )
    return tf.tile(mask, mult)


class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads, embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
        attention_output = self.att(inputs, inputs, attention_mask=causal_mask)
        attention_output = self.dropout1(attention_output)
        out1 = self.layernorm1(inputs + attention_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)
In [38]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions
In [39]:
vocab_size = 20000  # Only consider the top 20k words
maxlen = 80  # Max sequence size
embed_dim = 256  # Embedding size for each token
num_heads = 2  # Number of attention heads
feed_forward_dim = 256  # Hidden layer size in feed forward network inside transformer


def create_model():
    inputs = layers.Input(shape=(maxlen,), dtype=tf.int32)
    embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
    x = embedding_layer(inputs)
    transformer_block = TransformerBlock(embed_dim, num_heads, feed_forward_dim)
    x = transformer_block(x)
    outputs = layers.Dense(vocab_size)(x)
    model = keras.Model(inputs=inputs, outputs=[outputs, x])
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.compile(
        "adam", loss=[loss_fn, None],
    )  # No loss and optimization based on word embeddings from transformer block
    return model
In [43]:
batch_size = 128

# The dataset contains each review in a separate text file
# The text files are present in four different folders
# Create a list all files
filenames = []
directories = [
    "/Users/subirvarma/handson-ml/datasets/aclImdb/train/pos",
    "/Users/subirvarma/handson-ml/datasets/aclImdb/train/neg",
    "/Users/subirvarma/handson-ml/datasets/aclImdb/test/pos",
    "/Users/subirvarma/handson-ml/datasets/aclImdb/test/neg",
]
for dir in directories:
    for f in os.listdir(dir):
        filenames.append(os.path.join(dir, f))

print(f"{len(filenames)} files")

# Create a dataset from text files
random.shuffle(filenames)
text_ds = tf.data.TextLineDataset(filenames)
text_ds = text_ds.shuffle(buffer_size=256)
text_ds = text_ds.batch(batch_size)


def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    lowercased = tf.strings.lower(input_string)
    stripped_html = tf.strings.regex_replace(lowercased, "<br />", " ")
    return tf.strings.regex_replace(stripped_html, f"([{string.punctuation}])", r" \1")


# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size - 1,
    output_mode="int",
    output_sequence_length=maxlen + 1,
)
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()  # To get words back from token indices


def prepare_lm_inputs_labels(text):
    """
    Shift word sequences by 1 position so that the target for position (i) is
    word at position (i+1). The model will use all words up till position (i)
    to predict the next word.
    """
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y


text_ds = text_ds.map(prepare_lm_inputs_labels)
text_ds = text_ds.prefetch(tf.data.AUTOTUNE)
45000 files
In [44]:
class TextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model.
    1. Feed some starting prompt to the model
    2. Predict probabilities for the next token
    3. Sample the next token and add it to the next input

    Arguments:
        max_tokens: Integer, the number of tokens to be generated after prompt.
        start_tokens: List of integers, the token indices for the starting prompt.
        index_to_word: List of strings, obtained from the TextVectorization layer.
        top_k: Integer, sample from the `top_k` token predictions.
        print_every: Integer, print after this many epochs.
    """

    def __init__(
        self, max_tokens, start_tokens, index_to_word, top_k=10, print_every=1
    ):
        self.max_tokens = max_tokens
        self.start_tokens = start_tokens
        self.index_to_word = index_to_word
        self.print_every = print_every
        self.k = top_k

    def sample_from(self, logits):
        logits, indices = tf.math.top_k(logits, k=self.k, sorted=True)
        indices = np.asarray(indices).astype("int32")
        preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
        preds = np.asarray(preds).astype("float32")
        return np.random.choice(indices, p=preds)

    def detokenize(self, number):
        return self.index_to_word[number]

    def on_epoch_end(self, epoch, logs=None):
        start_tokens = [_ for _ in self.start_tokens]
        if (epoch + 1) % self.print_every != 0:
            return
        num_tokens_generated = 0
        tokens_generated = []
        while num_tokens_generated <= self.max_tokens:
            pad_len = maxlen - len(start_tokens)
            sample_index = len(start_tokens) - 1
            if pad_len < 0:
                x = start_tokens[:maxlen]
                sample_index = maxlen - 1
            elif pad_len > 0:
                x = start_tokens + [0] * pad_len
            else:
                x = start_tokens
            x = np.array([x])
            y, _ = self.model.predict(x)
            sample_token = self.sample_from(y[0][sample_index])
            tokens_generated.append(sample_token)
            start_tokens.append(sample_token)
            num_tokens_generated = len(tokens_generated)
        txt = " ".join(
            [self.detokenize(_) for _ in self.start_tokens + tokens_generated]
        )
        print(f"generated text:\n{txt}\n")


# Tokenize starting prompt
word_to_index = {}
for index, word in enumerate(vocab):
    word_to_index[word] = index

start_prompt = "this movie is"
start_tokens = [word_to_index.get(_, 1) for _ in start_prompt.split()]
num_tokens_generated = 40
text_gen_callback = TextGenerator(num_tokens_generated, start_tokens, vocab)
In [45]:
model = create_model()

model.fit(text_ds, verbose=2, epochs=15, callbacks=[text_gen_callback])
Epoch 1/15
352/352 - 1812s - loss: 5.6594 - dense_15_loss: 5.6594
generated text:
this movie is a very low budget movie about a [UNK] [UNK] [UNK] . [UNK] " [UNK] [UNK] [UNK] [UNK] " , a [UNK] , a [UNK] ) are all the same movie . the first time , [UNK] of his [UNK] . i

Epoch 2/15
352/352 - 1897s - loss: 4.7506 - dense_15_loss: 4.7506
generated text:
this movie is so awful i think this film is very good to the story of a very good thing that it is that the characters seem to make me laugh at all , and so much to make a film . it 's

Epoch 3/15
352/352 - 1900s - loss: 4.4846 - dense_15_loss: 4.4846
generated text:
this movie is a bad , bad , the bad [UNK] , bad plot , bad acting by all the actors . the actors were so good . the plot was a decent story . the story and it was a very low budget

Epoch 4/15
352/352 - 1904s - loss: 4.3167 - dense_15_loss: 4.3167
generated text:
this movie is a movie that it 's all over , and over again . a film [UNK] me . this movie is a good movie about a boy . it is so wonderful . i have no idea that it doesn 't make

Epoch 5/15
352/352 - 1911s - loss: 4.1876 - dense_15_loss: 4.1876
generated text:
this movie is a very good movie for all the characters that are pretty good but the story is a little too much to say that it is . but it is just a very low budget movie . it is about a [UNK]

Epoch 6/15
352/352 - 1898s - loss: 4.0811 - dense_15_loss: 4.0811
generated text:
this movie is a classic . the first time it takes on a trip back with their own right now and then . it was one of those rare films that we get to see it . the plot concerns a group of [UNK]

Epoch 7/15
352/352 - 1850s - loss: 3.9896 - dense_15_loss: 3.9896
generated text:
this movie is an entertaining movie , i thought it was the biggest disappointment that it had a lot of people were trying to do . it was a joke and the jokes could have been funnier and funnier ! it had me and

Epoch 8/15
352/352 - 1912s - loss: 3.9090 - dense_15_loss: 3.9090
generated text:
this movie is a very well worth seeing . i will be honest and the movie is very much less than a great film , but it isn 't even a very enjoyable experience . it is about a young child who kills people

Epoch 9/15
352/352 - 1914s - loss: 3.8373 - dense_15_loss: 3.8373
generated text:
this movie is very funny . the acting is pretty good . a movie is good , a little movie for everyone who is doing nothing else to do is the same movie ? it is not a movie .    

Epoch 10/15
352/352 - 1910s - loss: 3.7732 - dense_15_loss: 3.7732
generated text:
this movie is a movie that has no plot , the acting and script is bad . the acting isn 't all bad . you have to have a lot . . this movie has nothing to recommend you ! the characters and it

Epoch 11/15
352/352 - 1909s - loss: 3.7149 - dense_15_loss: 3.7149
generated text:
this movie is just a terrible movie that is so stupid and worthless . there is nothing to say about it . it was a joke to laugh at the stupidity . the movie is very good , especially the acting is awful .

Epoch 12/15
352/352 - 1910s - loss: 3.6622 - dense_15_loss: 3.6622
generated text:
this movie is just terrible . i 've watched it again with all the [UNK] [UNK] , who did not have it on the [UNK] [UNK] 's face , or even in the theaters , i don 't understand why it was a good

Epoch 13/15
352/352 - 4072s - loss: 3.6138 - dense_15_loss: 3.6138
generated text:
this movie is one of the best thrillers i have ever seen , but it is a movie to see it again ! it is a great cast and crew . i am a fan of the genre that has made a great movie

Epoch 14/15
352/352 - 3931s - loss: 3.5696 - dense_15_loss: 3.5696
generated text:
this movie is one of the few surprises i have ever seen . however , the acting was poor - a tad better than the writing . the cinematography was excellent . it is not very funny , but it also had to be

Epoch 15/15
352/352 - 17336s - loss: 3.5292 - dense_15_loss: 3.5292
generated text:
this movie is a great and funny movie . if you 're going thru it , the acting , bad writing , implausible and it is not . it 's hard to be a horror movie , and the acting is very good and

Out[45]:
<keras.callbacks.History at 0x158f81ac8>

Summarization using Transformer based Language Models

In [33]:
#trans11
nb_setup.images_hconcat(["DL_images/trans11.png"], width=800)
Out[33]:

Figure trans11 shows how a Transformer based Language Model can be used to do Text Summarization:

  • During the training phase, the original article and its summary are concatenated together as shown, with a special word inserted as a delimiter between the two. The training is done exactly as for the regular Language Model, with the model trying to predict the next word in the summary

  • During the test phase, the article to be summarized is fed into the model, while it's summary is generated one word at a time, in an auto-regressive fashion

The Language Model based design used for summarization can also be used for other NLP tasks, such as Machine Translation with the sentence to be translated on the left, and the corresponding translation on the right.

Encoder-Decoders using Transformers

So far we have described Encoder only Transformers, and showed that they can be used to perform most NLP tasks. The original Transformer paper actually featured an Encoder-Decoder type Transformer of the type shown in Figure trans12. It has been shown that both these architectures can be used to perform NLP tasks, and the type to use for a particular application is up to the designer. The Encoder-Decoder design does include an additional feature not found in Encoder only systems, and that is the Cross Attention Layer in the decoder, which is described below.

In [43]:
#trans12
nb_setup.images_hconcat(["DL_images/trans12.png"], width=1200)
Out[43]:

Figure trans12 shows the Encoder Decoder Transformer being used to do Machine Translation. It shows that whereas the Input sentence is fed into the Encoder in one shot, the output translation is generated by the Decoder one word at a time (same as in a Language Model). Hence the first vector fed into the Decoder is the token for Start of Sentence "S", and in response it generate the first word "llego". In stage 2, the the tokens for "S" and "llego" are fed into the decoder which in turn leads to the generation of the word "la", and so on until the End of Sentence token is generated. The figure shows the test phase of the model in which the words generated by the Decoder are fed back into it as input in an Auto Regressive fashion. During the test phase the translated phrase is fed into the decoder after passing it through a Mask in order to hide future words.

In [46]:
#trans13
nb_setup.images_hconcat(["DL_images/trans13.png"], width=1400)
Out[46]:

Figure trans13 delves deeper into the Encoder Decoder architecture and shows the following:

  • The Encoder is the same as described previously, consisting of multiple Encoder Block layers.

  • The Decoder consists of multiple Decoder Blocks. Each Decoder Block is similar to an Encoder Block since it also contains Self-Attention and Dense Feed Forward layers. But in addition, it also has an Cross-Attention layer that connects the Encoder with the Decoder.

  • The Encoder output $(h_1,...,h_n)$ is packed into a matrix $H^{enc} = (h_1,...,h_n)$. We then multiply $H^{enc}$ by the Cross-Attention Key and Value matrices $W^K$ and $W^V$, to produce matrices $K, V$ : $$ K = H^{enc}W^K, \ \ V = H^{enc}W^V $$ The Query vector $Q$ on the other hand is computed from the output of the prior decoder Self-Attention layer: $$ Q = H^{dec[i-1]}W^Q $$ The output of the Cross-Attention layer is given by $$ Z = softmax({QK^T\over{\sqrt{d}}}) V $$

As a result of the Cross-Attention layer, each of the Decoder Blocks has full access to all the ouputs of the Encoder.

The following example taken from Chollet Chapter 11 uses the Encoder Decoder Transformer to do English to Spanish translation. We start by downloading the dataset, creating a list each lement of which is an English sentence followed by its Spanish translation, appending a "start" and "end" tokens to the beginning and end of each Spanish sentence, and then storing the English-Spanish pairs in a list called text_pairs:

In [46]:
!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
!unzip -q spa-eng.zip
--2022-04-22 14:42:22--  http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.6.48, 172.217.164.112, 142.250.189.208, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.6.48|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2638744 (2.5M) [application/zip]
Saving to: ‘spa-eng.zip’

spa-eng.zip         100%[===================>]   2.52M   234KB/s    in 9.6s    

2022-04-22 14:42:38 (269 KB/s) - ‘spa-eng.zip’ saved [2638744/2638744]

In [1]:
text_file = "/Users/subirvarma/handson-ml/datasets/spa-eng/spa.txt"
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    english, spanish = line.split("\t")
    spanish = "[start] " + spanish + " [end]"
    text_pairs.append((english, spanish))

Here is what a randomly selected sample from the text_pairs list looks like:

In [2]:
import random
print(random.choice(text_pairs))
('He looks old for his age.', '[start] Él luce viejo para su edad. [end]')

The elements of the text_pairs list are randomly shuffled, and then split into training, validation and test datasets.

In [3]:
import random
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

Vectorize the English and Spanish text pairs

In this code block, the vectorization function for the English and Spanish texts are defined, using a maximum sequence length of 20 words per sentence, and a vocabulary size of 20,000 most frequently used words. Before doing this, we remove the special from the text, and convert all text to lowercase characters. The adapt commend creates a mapping of words in the vocabulary with their corresponding integer codes.

In [9]:
import tensorflow as tf
from tensorflow.keras import layers
import string
import re

strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(
        lowercase, f"[{re.escape(strip_chars)}]", "")

vocab_size = 15000
sequence_length = 20

source_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
target_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    #standardize=custom_standardization,
)
train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
source_vectorization.adapt(train_english_texts)
print(random.choice(train_english_texts))
print(random.choice(train_spanish_texts))
target_vectorization.adapt(train_spanish_texts)
Do you want to talk?
[start] Uno de mis hermanos toca el fagot. [end]

Preparing training and validation datasets for the translation task

The main task of the following code is to create the training and validation datasets for the model. Note that the input into the model consists of the encoded English sentence (to be fed into the Encoder) followed by the encoded Spanish sentence (to be fed into the Decoder), while target consists of the Spanish sentence only (at the output of the Decoder). However note that in order to create the target, the Spanish sentence has to be shifted to the right by one word, so that each Spanish word in the input sequence is mapped to the next Spanish word in that sentence, in order to create the target sequence. This is accomplished in the format_dataset function. The tf.data.Dataset.from_tensor_slices function creates a list of pairs of English and Spanish sentences, which are then vectorized and formatted into the correct input and target tensors by means of the format_dataset function.

In [10]:
batch_size = 64

def format_dataset(eng, spa):
    eng = source_vectorization(eng)
    spa = target_vectorization(spa)
    return ({
        "english": eng,
        "spanish": spa[:, :-1],
    }, spa[:, 1:])

def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)
In [11]:
for inputs, targets in train_ds.take(1):
    print(f"inputs['english'].shape: {inputs['english'].shape}")
    print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
    print(f"targets.shape: {targets.shape}")
inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)

The Transformer Encoder

As shown below, the Encoder module uses a Transformer Encoder block of the type that we have alrady seen.

In [12]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

The Transformer Decoder

The Transformer Decoder has two Attention layers: The first Attention layer computes the Self-Attention for the Decoder input sequence, while the second Attention layer implements the Cross Attention between the Key and Value vectors at the output of the Encoder and the Query vector from the output of the Decoder Self Attention layer. The other important function is the Causal Attention Mask that makes sure that during the Decoder Attention calculations do not take into account the Spanish word at position (i + 1), when predicting the word at the $i{th}$ stage.

In [13]:
class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

Positional Embedding Layer

In [15]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

End to End Transformer

In [20]:
import tensorflow as tf
from tensorflow import keras

embed_dim = 256
dense_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="english")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="spanish")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

The model summary shows that it has a total of 19,960,216 parameters, of which 3.8 million are in the dense layer at the output of the Decoder, and just over 10 million are in the Embedding layers in the Encoder and Decoder. The Transformer Encoder and Decoder themselves account for a combined 8.8 million parameters.

In [22]:
transformer.summary()
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
english (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
spanish (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
positional_embedding_2 (Positio (None, None, 256)    3845120     english[0][0]                    
__________________________________________________________________________________________________
positional_embedding_3 (Positio (None, None, 256)    3845120     spanish[0][0]                    
__________________________________________________________________________________________________
transformer_encoder_1 (Transfor (None, None, 256)    3155456     positional_embedding_2[0][0]     
__________________________________________________________________________________________________
transformer_decoder_1 (Transfor (None, None, 256)    5259520     positional_embedding_3[0][0]     
                                                                 transformer_encoder_1[0][0]      
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, None, 256)    0           transformer_decoder_1[0][0]      
__________________________________________________________________________________________________
dense_9 (Dense)                 (None, None, 15000)  3855000     dropout_1[0][0]                  
==================================================================================================
Total params: 19,960,216
Trainable params: 19,960,216
Non-trainable params: 0
__________________________________________________________________________________________________

Training the Encoder Decoder Transformer

In [32]:
transformer.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
transformer.fit(train_ds, epochs=15, validation_data=val_ds)
Epoch 1/15
1302/1302 [==============================] - 2282s 2s/step - loss: 1.2405 - accuracy: 0.5657 - val_loss: 1.1191 - val_accuracy: 0.5895
Epoch 2/15
1302/1302 [==============================] - 2353s 2s/step - loss: 1.1339 - accuracy: 0.6021 - val_loss: 1.0682 - val_accuracy: 0.6175
Epoch 3/15
1302/1302 [==============================] - 2359s 2s/step - loss: 1.0775 - accuracy: 0.6270 - val_loss: 1.0456 - val_accuracy: 0.6311
Epoch 4/15
1302/1302 [==============================] - 2368s 2s/step - loss: 1.0422 - accuracy: 0.6451 - val_loss: 1.0201 - val_accuracy: 0.6418
Epoch 5/15
1302/1302 [==============================] - 2357s 2s/step - loss: 1.0173 - accuracy: 0.6587 - val_loss: 1.0311 - val_accuracy: 0.6430
Epoch 6/15
1302/1302 [==============================] - 4779s 4s/step - loss: 0.9968 - accuracy: 0.6698 - val_loss: 1.0104 - val_accuracy: 0.6506
Epoch 7/15
1302/1302 [==============================] - 3814s 3s/step - loss: 0.9792 - accuracy: 0.6799 - val_loss: 1.0087 - val_accuracy: 0.6527
Epoch 8/15
1302/1302 [==============================] - 3759s 3s/step - loss: 0.9629 - accuracy: 0.6878 - val_loss: 1.0126 - val_accuracy: 0.6544
Epoch 9/15
1302/1302 [==============================] - 17043s 13s/step - loss: 0.9485 - accuracy: 0.6950 - val_loss: 1.0109 - val_accuracy: 0.6574
Epoch 10/15
1302/1302 [==============================] - 10904s 8s/step - loss: 0.9344 - accuracy: 0.7013 - val_loss: 1.0141 - val_accuracy: 0.6585
Epoch 11/15
1302/1302 [==============================] - 2268s 2s/step - loss: 0.9213 - accuracy: 0.7078 - val_loss: 1.0100 - val_accuracy: 0.6610
Epoch 12/15
1302/1302 [==============================] - 2350s 2s/step - loss: 0.9076 - accuracy: 0.7134 - val_loss: 1.0169 - val_accuracy: 0.6606
Epoch 13/15
1302/1302 [==============================] - 2370s 2s/step - loss: 0.8971 - accuracy: 0.7184 - val_loss: 1.0168 - val_accuracy: 0.6627
Epoch 14/15
1302/1302 [==============================] - 2377s 2s/step - loss: 0.8839 - accuracy: 0.7233 - val_loss: 1.0150 - val_accuracy: 0.6632
Epoch 15/15
1302/1302 [==============================] - 2381s 2s/step - loss: 0.8748 - accuracy: 0.7271 - val_loss: 1.0197 - val_accuracy: 0.6645
Out[32]:
<keras.callbacks.History at 0x156b45748>

We are now going to use the trained model to do English to Spanish translation. We set up the inference models for both the Encoder and the Decoder sub-systems. The Encoder inference model is the same as was defined for the training phase. The Decoder inference model on the other hand is going to be run on a stage by stage basis, such that the input into a stage is the same as the output from the previous stage, i.e., the Decoder is run in the Auto-Regressive mode.

Before the model can be run, we create a dictionary which maps each of the Spanish words with its corresponding index. The Spanish sentence is generated one word at a time, starting with the word 'start' and ending when the word 'end' is sampled. During each stage the model predicts the probabilities of the 20,000 possible Spanish words. The output probabilities for the $i^{th}$ stage are converted into a word, by first choosing the word index that has the maximum probability, and then using the lookup dictionary to convert it into its corresponding alpha-numeric character.

In [47]:
import numpy as np
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization(
            [decoded_sentence])[:, :-1]
        predictions = transformer(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(decode_sequence(input_sentence))
-
It is cheaper to go by bus.
[start] es más difícil ir en autobús end  end end end end end end end end end end end end
-
It's supposed to snow tomorrow.
[start] se va a [UNK] mañana end  end end end end end end end end end end end end end
-
This sounds very interesting.
[start] esto parece muy interesante end  end end end end end end end end end end end end  end
-
I am willing to help you.
[start] yo estoy empezando a ayudar end  end end end end end end end end end end end end end
-
Tom just cleaned his room.
[start] tom acaba de responder a su habitación end  end end end end end end end end end end end
-
January is the first month of the year.
[start] end  es el el mes de end al end end end end end end end end end  end
-
How many brothers do you have?
[start] ¿cuántos hermanos tienes end  end end end end end end end end end end end end end  end
-
I wasn't the only one who didn't know Tom.
[start] no fue el único que no sabía tom era el que no sabía end  end end end end end
-
Tom was the oldest person in the room.
[start] tom fue la persona más mayor de la habitación end  end end end end end end end  end
-
I'd like it if you were always honest with me.
[start] quisiera que yo [UNK] siempre es [UNK] conmigo end  end end end end end end end end  end
-
Tom was suddenly overcome by fear.
[start] tom fue de repente de miedo de mala forma end  end end end end end end end  end
-
Tom dropped the ball.
[start] tom dejó la pelota end  end end end end end end end end end end end end end end
-
His only wish was to see his son again one more time.
[start] su único que quisiera ser [UNK] a su hijo más tiempo end  end más end end end  end
-
Are you telling me you don't know how to cook hard-boiled eggs?
[start] ¿me está decir que no sabes oír a sal end  end end end end end end end end end
-
What time do you want me to come?
[start] ¿a qué hora quieres que venga end  end end end end end end end end end end end end
-
I tried to stop their quarrel, but that was not easy.
[start] intentó [UNK] para [UNK] la la la la la la la la la la la la la la la [UNK]
-
Some food was brought to them.
[start] algunas comida fueron [UNK] end  end end end end end end end end end end end end end end
-
We have just a tiny bit of garden.
[start] tenemos solo un poco de jardín end  más end end end end end end end end end end end
-
Just ignore him.
[start] solo le [UNK] end  end end end end end end end end end end end end end  end
-
Tom knew what was hidden in the cave.
[start] tom sabía lo que estaba [UNK] en la [UNK] end  end end end end end end end end end

It is interesting to compare the Encoder Decoder Transformer based Translation system with the one that was described in Chapter NLP which was an Encoder Decoder system based on GRUs (the Encoder used a Bi-Directional GRU, while the Decoder was Uni-Directional). Note that the input into both these models is the same, and so are some of the internal parameters such as Embedding Size. We find the following:

  • The number of parameters in the Embedding Layer is slightly more for Transformers, this is due to the extra 5120 ( = 20 * 256) parameters required to do the Positional Embedding
  • The number of parameters in the GRU portion of the model (Enc + Dec) are larger than the number of parameters in the Encoder + Decoder portion of the Transformer model, 11,814,912 vs 8,414,976. This is due to the fact that the internal state of the GRU has 1024 nodes, while the internal state of the Transformer has 256 nodes.
  • The number of parameters in the ouput Dense layer is much larger for the GRU (15,375,000) compared to that for the Transformer (3,855,000). This is due to the fact that the 1024 nodes in the GRU densly connect to 20,000 nodes of the vocab_size, while there are only 256 nodes in the Transformer model in a similar position.

As a result of the Dense layer parameter difference, the total number of parameters in the GRU based model (42,554,912) is much larger than the total number of parameters in the Transformer based model (19,960,216), even though the number of parameters in the core models are much closer.

The performance of the two models are also comparable, based on a comparison of the validation accuracies. However the time required to execute a single epoch was much larger for the GRU based model (about 4600 sec/epoch) compared to the Transformer based model (about 2300 sec/epoch). This can be attributed to serialized nature of the GRU computation.

BERT: Bi-Directional Language Models

The Transformer Encoder based Language Model that was described earlier in this chapter, was characterized by a training scheme whereby the model was tasked with predicting the next word in the training sequence. As a result of this, the word representations that the model generated, were influenced by the words that came before a particular word in the sequence. However, the meaning of a word clearly depends not only on words that come before it, but also words that come after it in a sentence. For example:

The bank of the river was very green and shady

Bob deposited the check in the bank

Hence it should be possible to get better word representations by taking into account all the words in a sequence, as opposed to opposed to only prior words. This was the main motivation behind the model called BERT, which stands for Bi-Directional Encoder Representations from Transformers. The Self Attention Layer in a BERT model is shown in Figure trans14, and the reader may notice that it is nothing more than the Transformer Self Attention Layer. Hence the Self Attention calculation for each of the input words takes into account all the other words in the sequence. The resulting system is referred to as a Masked Language Model or MLM.

In [28]:
#trans14
nb_setup.images_hconcat(["DL_images/trans14.png"], width=1200)
Out[28]:

The novelty in the BERT model was the technique used for training the system, which is illustrated in Figure trans15. Instead of trying to predict the next word in the sequence, BERT tries to predict missing words, which can occur anywhere in the sequence. The actual scheme used is a little more sophisticated, and works as follows: Up to 15% of the words in a sequence are randomly selected for prediction. Out of these, 80% of the words are replaced by a special MASK token, 10% of the words are left unchanged and the remaining 10% are replaced by a randomly selected word. For example in the figure below, the words 'long' and 'thanks' are replaced by the MASK token, while the word 'apricot' is replaced by 'the'. This scheme was designed to mitigate the mismatch between training sequences and the sequences used during fine-tuning, since the latter do not use the MASK token. The usual Cross Entropy Loss is used to do the prediction for the selected words.

In [29]:
#trans15
nb_setup.images_hconcat(["DL_images/trans15.png"], width=1200)
Out[29]:

In addition to the masked word prediction method, there was another technique that was proposed in a model called SpanBERT, which tries to predict words located within a span of missing words, as shown in Figure trans16. Since several Language Modeling tasks involve identification or classification of parts of a sentence, this training technique has been shown to improve their performance. The span based training works as follows: The length of the span is chosen randomly by sampling from a geometric distribution, and is limited to 10 words or less. The start of the span is randomly selected using a uniform distribution. The Loss Function used to predict a word occuring within the span is the sum of two loss functions:

  • The first Loss Function is simply the Cross Entropy Loss associated with the word being predicted
  • The second Loss Function is computed using the word immmediately preceding the span AND the word immediately following the span, augmented with a position token for the location of the missing word within the span.

Once again up to 15% of the words are selected for prediction within a sequence, and as before the missing words may be replaced by the MASK token, a randomly selected word or the word itself (in the ration 80:10:10).

In [30]:
#trans16
nb_setup.images_hconcat(["DL_images/trans16.png"], width=1200)
Out[30]:

BERT's training so far has focused on obtaining the best representation for a word as a function of the other words in the sequence. However there are certain NLP tasks that have to do with finding relationships between pairs of sentences. In order to improve BERT's performance for these tasks, the following additional training scheme was added: The model is fed with two sentences, and it is tasked to determine whether the second sentence follows the first (see Fig trans29). The training dataset consists of 50% samples in which two successive sentences are fed into the model, while the remaining 50% consist of pairs of un-related sentences. In order to facilitate this training, the following changes are made: Two additional tokens CLS and SEP are added to the input, with CLS pre-pended to the first sentence and SEP inserted after the first sentence and also after the end of the second sentence. In addition to the Word Token and Positional Embeddings, another embedding called the Segment embedding is added which indicates whether a word is part of the first or second sentence. The output vector corresponding to the CLS token is used for doing the sentence classification. This vector is passed through a dense layer W_NSP followed by a binary softmax based classifier.

In [4]:
#trans29
nb_setup.images_hconcat(["DL_images/trans29.png"], width=1200)
Out[4]:

One of the important benefits of Transformer based models such as BERT is that after being trained using self-supervised learning methods, they can be tuned to individual tasks using supervised learning. These tasks typically have much smaller training datasets. In order to train the model, the model is initialized with the parameters of a fully trained self supervised system, and then further fine-tuned in a supervised manner using the smaller dataset. This is an example of Transfer Learning which we encountered earlier with ConvNet based Image Processing systems. Until the advent of Transformers, earlier sequence processing models such as RNN/LSTMs were not very good at Transfer Learning, so this has been an important advance in the state of the art.

Figure trans17 shows how BERT can be used to do sentiment classification using a pre-trained model. The CLS token is inserted at the start of the sequence, and then the output vector corresponding to CLS is used for classification. The model is then fine-tuned using the training dataset with labeled sentences.

In [31]:
#trans17
nb_setup.images_hconcat(["DL_images/trans17.png"], width=1200)
Out[31]:

Image Processing using Transformers

After Transformers proved to be adept at NLP tasks, the focus shifted to trying them out for Image Processing tasks. Just as ConvNets can be used to process 2D language data, Transformers can also be adapted to process 3D image data. In order to do so the 3D image tensor has to be transformed into a sequence of 2D vectors before being fed into the Transformer model. The performance of the resulting system, called Vision Transformer or ViT Dosovitskiy et.al, (2021), has been shown to be better than that of the best performing ConvNet, which seems to point to the conclusion that the filtering in the Self Attention architecture also subsumes that used for ConvNets. Indeed it has been shown by Cordonnier, Loukas, Jaggi that the local filtering in ConvNets arises naturally in ViT networks as a result of training. The disadvantage of using Transformers for images is that much larger dataset is required for training, since Transformers don't make any assumptions about the locality of patterns.

The critical design decision in the ViT model was to decide what aspect of an image should be used to replace the word vectors used as NLP model input. The most straightforward choice would be to use the pixel vectors (for intance along the channel axis). However given the large number of pixels in a typical image, this leads to un-acceptably high computation load in the Attention layers. The initial attempts at Image Processing Transformers focused on trying to reduce this computation load by various techniques, such as local self-attention around the query vector. The ViT model achieved the critical break-through by using image patches instead, as explained next.

The finding that Transformers are good at Image Processing opens up the possibility of using Transformer models for doing multi-modal pressing, i.e., being able to handle not just language data but also images, sound, video etc.

In [14]:
#trans18
nb_setup.images_hconcat(["DL_images/trans18.png"], width=2000)
Out[14]:

The main idea behind ViT is quite simple and illustrated in Figure trans18. Given an input image X of shape $R^{H\times W\times C}$, where $C$ is the number of channels and $H$ and $W$ are the dimensions of the image in pixels, sub-divide it into a sequence of flattened 2D patch vectors, which is of shape $R^{N\times P^2 C}$. Each of the patch vectors is obtained by dividing the original image into image patches of size $P\times P\times C$ as shown in the figure, so that there are $N = {HW\over P^2}$ image patches in all. Each image patch is then flattened to create $N$ patch vectors of size $P^2 C$. These patch vectors are then sent through a learnable embedding layer, and a position embedding is added to them, to create the input into the model. The Transformer model itself is exactly the same as was used for NLP.

In [41]:
#trans19
nb_setup.images_hconcat(["DL_images/trans19.png"], width=1000)
Out[41]:

In order to test this new model, the researchers created three versions of ViT, as shown in Figure trans19, with increasing size. We use the notation ViT-L/16 to refer to the "Large" variant, with 16×16 input patch size. Note that decreasing the patch size increases the sequence length into the model, thus making it more computationally intensive.

For comparisons to ConvNets, the researchers used slightly modified versions of ResNets, which is referred to as "ResNet (BiT)".

In [42]:
#trans20
nb_setup.images_hconcat(["DL_images/trans20.png"], width=1000)
Out[42]:

Figure trans20 has a comparison of the Transformer models (the colored circles) as well a range of ConvNet models of varying size (in the shaded area). The performance numbers were generated by pre-training the models using the ImageNet dataset (1.3M images with 1K classes), the ImageNet-21 k dataset (14M images with 21K classes) and the JFT-300M dataset (303M images with 18K classes), and then fine tuning it on ImageNet.

The most interesting observation from this graph is that the performance of the ViT models varies very strongly as a function of the training dataset size. Indeed with smallest dataset (ImageNet), the ResNet models outperform all the ViT models. With the intermediate size dataset ImageNet-21k, their performances are about the same, while with the largest dataset, the best ViT model performs better than all the ResNet models. From this we can conclude that the strong inductive prior built into ResNet models, that feature representations are only influenced by nearby features in the neighborhood, works well when the training set is not very large. However for large training sets such as the JFT-300M, learning the relevant attention patterns from the data works as well or better. This conclusion is further re-inforced by the results in Figure trans21 which shows the performnce of the models on random subsets of JFT-300M which are increasing in size. The performance of the ResNet models starts out better than ViT models, but flattens out for larger datasets. ViT models on the other hand show progressively improving performance with larger datasets and larger models, which opens up the possibility that with increasing compute resources even better performance can be achieved.

The researchers also showed that the compute resources required for training ViT models is 2-4x less compared to ResNet models, for comparable levels of performance. For example it took about 2.5K TPUv3-core-days to train the ViT-H/14 model, while the BiT-L Resnet model took 9.9K.

In [38]:
#trans21
nb_setup.images_hconcat(["DL_images/trans21.png"], width=2000)
Out[38]:

Figure trans22 illustrates another very important aspect of ViT models: It is a plot of the Attention weights $w_{ij}$ in the ViT model, as a function of the Network Depth (or layer index). It shows the extent to which the patch vector at any particular position, is influenced by other patch vectors in the sequence. The plot illustrates that in the early layers, the query patch is already paying attention to patches that are far it, indeed it seems to be paying attention across a broad spectrum of all the patches in the sequence. In later layers on the other hand, the Attention seems to focused more on patches that are further away. This behavior is in contrast to that in ConvNets, in which the convulation in the early layers is influnced solely by pixels that are in the immediate neighborhood. As a result of this, the ViT is able to take into account aspects of the input image that are located further away from the query patch, thus allowing it to detect global patterns in the image that are invisible to the ConvNet. This is especially useful in applications such as object detection, where a global view can be very useful.

In [44]:
#trans22
nb_setup.images_hconcat(["DL_images/trans22.png"], width=1000)
Out[44]:

We now present the Keras code for ViT, taken from one of the examples in the keras.io web page. The model is used to classify the images in the cifar100 dataset.

In [58]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
In [59]:
num_classes = 100
input_shape = (32, 32, 3)

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()

print(f"x_train shape: {x_train.shape} - y_train shape: {y_train.shape}")
print(f"x_test shape: {x_test.shape} - y_test shape: {y_test.shape}")
x_train shape: (50000, 32, 32, 3) - y_train shape: (50000, 1)
x_test shape: (10000, 32, 32, 3) - y_test shape: (10000, 1)
In [60]:
learning_rate = 0.001
weight_decay = 0.0001
batch_size = 256
num_epochs = 100
image_size = 72  # We'll resize input images to 72 X 72 pixels
patch_size = 6  # Size of the patches to be extract from the input images, equal to 6 X 6 pixels
num_patches = (image_size // patch_size) ** 2
projection_dim = 64 # Patches are  projeted to vectors of this size before being fed into the model
num_heads = 4
transformer_units = [
    projection_dim * 2,
    projection_dim,
]  # Size of the 2 MLP layers after Self-Attention layer
transformer_layers = 8
mlp_head_units = [2048, 1024]  # Size of the dense layers of the final classifier
In [61]:
data_augmentation = keras.Sequential(
    [
        layers.Normalization(),
        layers.Resizing(image_size, image_size),
        layers.RandomFlip("horizontal"),
        layers.RandomRotation(factor=0.02),
        layers.RandomZoom(
            height_factor=0.2, width_factor=0.2
        ),
    ],
    name="data_augmentation",
)
# Compute the mean and the variance of the training data for normalization.
data_augmentation.layers[0].adapt(x_train)
In [62]:
# This mlp module is used in two places: 
# (1) The dense layers after the Self Attention layer
# (2) The dense layers after the final encoder block

def mlp(x, hidden_units, dropout_rate):
    for units in hidden_units:
        x = layers.Dense(units, activation=tf.nn.gelu)(x)
        x = layers.Dropout(dropout_rate)(x)
    return x
In [63]:
# Patches are extracted by using the tf.image.extract_patches function
# After exraction patches are reshaped into vectors of size patch_dims = 6 X 6 X 3 = 108

class Patches(layers.Layer):
    def __init__(self, patch_size):
        super(Patches, self).__init__()
        self.patch_size = patch_size

    def call(self, images):
        batch_size = tf.shape(images)[0]
        patches = tf.image.extract_patches(
            images=images,
            sizes=[1, self.patch_size, self.patch_size, 1],
            strides=[1, self.patch_size, self.patch_size, 1],
            rates=[1, 1, 1, 1],
            padding="VALID",
        )
        patch_dims = patches.shape[-1]
        patches = tf.reshape(patches, [batch_size, -1, patch_dims])
        return patches
In [64]:
import matplotlib.pyplot as plt

plt.figure(figsize=(4, 4))
image = x_train[np.random.choice(range(x_train.shape[0]))]
plt.imshow(image.astype("uint8"))
plt.axis("off")

resized_image = tf.image.resize(
    tf.convert_to_tensor([image]), size=(image_size, image_size)
)
patches = Patches(patch_size)(resized_image)
print(f"Image size: {image_size} X {image_size}")
print(f"Patch size: {patch_size} X {patch_size}")
print(f"Patches per image: {patches.shape[1]}")
print(f"Elements per patch: {patches.shape[-1]}")

n = int(np.sqrt(patches.shape[1]))
plt.figure(figsize=(4, 4))
for i, patch in enumerate(patches[0]):
    ax = plt.subplot(n, n, i + 1)
    patch_img = tf.reshape(patch, (patch_size, patch_size, 3))
    plt.imshow(patch_img.numpy().astype("uint8"))
    plt.axis("off")
Image size: 72 X 72
Patch size: 6 X 6
Patches per image: 144
Elements per patch: 108
In [65]:
# Patches are encoded by projecting them into a vector of size projection_dim, using a dense layer.
# Position vectors are created by Embedding a 1-Hot vector of size num_patches, with position 
# of the '1' indicating the position of the vector in the sequence

class PatchEncoder(layers.Layer):
    def __init__(self, num_patches, projection_dim):
        super(PatchEncoder, self).__init__()
        self.num_patches = num_patches
        self.projection = layers.Dense(units=projection_dim)
        self.position_embedding = layers.Embedding(
            input_dim=num_patches, output_dim=projection_dim
        )

    def call(self, patch):
        positions = tf.range(start=0, limit=self.num_patches, delta=1)
        encoded = self.projection(patch) + self.position_embedding(positions)
        return encoded
In [72]:
def create_vit_classifier():
    inputs = layers.Input(shape=input_shape)
    # Augment data.
    augmented = data_augmentation(inputs)
    # Create patches.
    patches = Patches(patch_size)(augmented)
    # Encode patches.
    encoded_patches = PatchEncoder(num_patches, projection_dim)(patches)

    # Create multiple layers of the Transformer block.
    for _ in range(transformer_layers):
        # Layer normalization 1.
        x1 = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
        # Create a multi-head attention layer.
        attention_output = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=projection_dim, dropout=0.1
        )(x1, x1)
        # Skip connection 1.
        x2 = layers.Add()([attention_output, encoded_patches])
        # Layer normalization 2.
        x3 = layers.LayerNormalization(epsilon=1e-6)(x2)
        # MLP.
        x3 = mlp(x3, hidden_units=transformer_units, dropout_rate=0.1)
        # Skip connection 2.
        encoded_patches = layers.Add()([x3, x2])
  
    # Flatten the output of the Transformer Encoder and pass it througha a final MLP
    # before computing the logits for image classification.
    # Start by creating a [batch_size, projection_dim] tensor.
    representation = layers.LayerNormalization(epsilon=1e-6)(encoded_patches)
    representation = layers.Flatten()(representation)
    representation = layers.Dropout(0.5)(representation)
    # Add MLP.
    features = mlp(representation, hidden_units=mlp_head_units, dropout_rate=0.5)
    # Classify outputs.
    logits = layers.Dense(num_classes)(features)
    # Create the Keras model.
    model = keras.Model(inputs=inputs, outputs=logits)
    return model
In [27]:
def run_experiment(model):
    optimizer = keras.optimizers.Adam(
        learning_rate=learning_rate
    )

    model.compile(
        optimizer=optimizer,
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[
            keras.metrics.SparseCategoricalAccuracy(name="accuracy"),
            keras.metrics.SparseTopKCategoricalAccuracy(5, name="top-5-accuracy"),
        ],
    )

    checkpoint_filepath = "/tmp/checkpoint"
    checkpoint_callback = keras.callbacks.ModelCheckpoint(
        checkpoint_filepath,
        monitor="val_accuracy",
        save_best_only=True,
        save_weights_only=True,
    )

    history = model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        epochs=num_epochs,
        validation_split=0.1,
        callbacks=[checkpoint_callback],
    )

    model.load_weights(checkpoint_filepath)
    _, accuracy, top_5_accuracy = model.evaluate(x_test, y_test)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")
    print(f"Test top 5 accuracy: {round(top_5_accuracy * 100, 2)}%")

    return history


vit_classifier = create_vit_classifier()
history = run_experiment(vit_classifier)
Epoch 1/100
176/176 [==============================] - 4613s 26s/step - loss: 4.4740 - accuracy: 0.0446 - top-5-accuracy: 0.1594 - val_loss: 3.9194 - val_accuracy: 0.0982 - val_top-5-accuracy: 0.3216
Epoch 2/100
176/176 [==============================] - 2391s 14s/step - loss: 3.9495 - accuracy: 0.0967 - top-5-accuracy: 0.2915 - val_loss: 3.5524 - val_accuracy: 0.1640 - val_top-5-accuracy: 0.4200
Epoch 3/100
176/176 [==============================] - 2261s 13s/step - loss: 3.7123 - accuracy: 0.1270 - top-5-accuracy: 0.3595 - val_loss: 3.3973 - val_accuracy: 0.1858 - val_top-5-accuracy: 0.4664
Epoch 4/100
176/176 [==============================] - 2353s 13s/step - loss: 3.5359 - accuracy: 0.1579 - top-5-accuracy: 0.4125 - val_loss: 3.2184 - val_accuracy: 0.2200 - val_top-5-accuracy: 0.5110
Epoch 5/100
176/176 [==============================] - 2461s 14s/step - loss: 3.4080 - accuracy: 0.1783 - top-5-accuracy: 0.4495 - val_loss: 3.1172 - val_accuracy: 0.2380 - val_top-5-accuracy: 0.5256
Epoch 6/100
176/176 [==============================] - 2355s 13s/step - loss: 3.3087 - accuracy: 0.1987 - top-5-accuracy: 0.4770 - val_loss: 3.0128 - val_accuracy: 0.2590 - val_top-5-accuracy: 0.5656
Epoch 7/100
176/176 [==============================] - 2452s 14s/step - loss: 3.2007 - accuracy: 0.2185 - top-5-accuracy: 0.5051 - val_loss: 2.9189 - val_accuracy: 0.2824 - val_top-5-accuracy: 0.5802
Epoch 8/100
176/176 [==============================] - 2396s 14s/step - loss: 3.1008 - accuracy: 0.2385 - top-5-accuracy: 0.5304 - val_loss: 2.8476 - val_accuracy: 0.2882 - val_top-5-accuracy: 0.5856
Epoch 9/100
176/176 [==============================] - 2383s 14s/step - loss: 2.9905 - accuracy: 0.2604 - top-5-accuracy: 0.5608 - val_loss: 2.7022 - val_accuracy: 0.3186 - val_top-5-accuracy: 0.6240
Epoch 10/100
176/176 [==============================] - 2428s 14s/step - loss: 2.8909 - accuracy: 0.2758 - top-5-accuracy: 0.5810 - val_loss: 2.6193 - val_accuracy: 0.3326 - val_top-5-accuracy: 0.6372
Epoch 11/100
176/176 [==============================] - 2480s 14s/step - loss: 2.8035 - accuracy: 0.2926 - top-5-accuracy: 0.6037 - val_loss: 2.5614 - val_accuracy: 0.3518 - val_top-5-accuracy: 0.6516
Epoch 12/100
172/176 [============================>.] - ETA: 10:45 - loss: 2.7135 - accuracy: 0.3137 - top-5-accuracy: 0.6230
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-27-4230d345b1e5> in <module>()
     39 
     40 vit_classifier = create_vit_classifier()
---> 41 history = run_experiment(vit_classifier)

<ipython-input-27-4230d345b1e5> in run_experiment(model)
     27         epochs=num_epochs,
     28         validation_split=0.1,
---> 29         callbacks=[checkpoint_callback],
     30     )
     31 

//anaconda/envs/miniconda3/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1182                 _r=1):
   1183               callbacks.on_train_batch_begin(step)
-> 1184               tmp_logs = self.train_function(iterator)
   1185               if data_handler.should_sync:
   1186                 context.async_wait()

//anaconda/envs/miniconda3/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
    883 
    884       with OptionalXlaContext(self._jit_compile):
--> 885         result = self._call(*args, **kwds)
    886 
    887       new_tracing_count = self.experimental_get_tracing_count()

//anaconda/envs/miniconda3/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
    915       # In this case we have created variables on the first call, so we run the
    916       # defunned version which is guaranteed to never create variables.
--> 917       return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
    918     elif self._stateful_fn is not None:
    919       # Release the lock early so that multiple threads can perform the call

//anaconda/envs/miniconda3/lib/python3.6/site-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
   3038        filtered_flat_args) = self._maybe_define_function(args, kwargs)
   3039     return graph_function._call_flat(
-> 3040         filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
   3041 
   3042   @property

//anaconda/envs/miniconda3/lib/python3.6/site-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1962       # No tape is watching; skip to running the function.
   1963       return self._build_call_outputs(self._inference_function.call(
-> 1964           ctx, args, cancellation_manager=cancellation_manager))
   1965     forward_backward = self._select_forward_and_backward_functions(
   1966         args,

//anaconda/envs/miniconda3/lib/python3.6/site-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
    594               inputs=args,
    595               attrs=attrs,
--> 596               ctx=ctx)
    597         else:
    598           outputs = execute.execute_with_cancellation(

//anaconda/envs/miniconda3/lib/python3.6/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

KeyboardInterrupt: 

References and Slides