%pylab inline
from ipypublish import nb_setup
The Linear Models that we discussed in Chapter LinearLearningModels work well if the input dataset is approximately linearly separable, but they have limited accuracy for complex datasets. Some of the issues with Linear Models are the following:
If the input data is not linearly separable, then the designer has to expend a lot of effort in finding an appropriate feature map that makes it so. It would be nice to have a model that solves this problem automatically, by learning the best feature map from the data itself.
We showed that the model weight parameters could be regarded as a filter, so that for $K$ classes, the operation of the system is equivalent to trying to the match the input with $K$ different filters. The limitations of this approach can be seen in the filter for the "horse" class in Figure LC2. The filter looks like a horse with two heads, since it is trying its best to match with a horse image, irrespective of the direction in which the horse is facing. This type of filtering will clearly not work for cases in which the horse were standing with some other orientation, or if it were located in a corner of the image. The fact that the best accuracy that can be achieved with linear classifiers and the CIFAR-10 Dataset is only about 40% is a reflection of this shortcoming. The linear system tries to do classification by taking each and every pixel into account, which is a difficult task. What if it were possible to create representations for higher level features in the image, say the head of the horse or its legs, and then use these for classification instead. This will enable the system to identify a horse irrespective of its orientation and its location in the image. This is precisely what Deep Learning systems do.
In general a way to make any model more powerful is by increasing the number of parameters. However in a Linear Model the number of parameters is constrained to $KN + K$ by the sizes of the input data and the number of output classes, which limits its modeling power.
#LC2
nb_setup.images_hconcat(["DL_images/LC2.png"], width=600)
Dense Feed Forward Networks were designed with the objective the overcoming these shortcomings. As Figure DFN1 shows, we are looking for a functional block between the input vector $(x_1,...,x_N)$ and the output logits $(a_1,...,a_K)$, that can create a new representation vector $(z_1,...,z_P)$ which satisfies the approximate linear separability property. One way to do this is shown in Figure DFN2, which is a Deep Feed Forward Network with a single Hidden Layer. Note the following:
The Input layer and Output layers are as before, but we have added a third layer, the so-called Hidden Layer in between. The Input Layer is fully connected to the Hidden Layer, i.e., each node in the Input Layer is connected to every other node in the Hidden Layer, and the same holds true for connections between the Hidden Layer and the Output Layer. DLNs with these characteristics are called Dense Feed Forward Neural Networks. Later in this monograph we will come across examples of DLNs where these properties don’t apply; either because the fully connected property does not hold (as in Convolutional Neural Networks), or the DLN incorporates feedback loops (as in Recurrent Neural Networks).
The $j$-th node in the Hidden Layer performs the following computation on the input variables $x_i$ to generate an output $z_j^{(1)}, 1 \leq j \leq P$ given by $$ a_j^{(1)} = \sum_{i=1}^N w_{ji}^{(1)} x_i + b_j^{(1)} $$ $$ z_j^{(1)} = f(a_j^{(1)}) $$ The vector $(a^{(1)}_1,...,a_P^{(1)})$, which we call the Pre-Activation, is computed as a simple linear combination of the Input Vector. The output of the Hidden Layer $(z^{(1)}_1,...,z_P^{(1)})$ which we call the Activation, is computed as an elementwise non-linear function of the Pre-Activations.
The Output Layer operates on the Activations $z_j^{(1)}$ from the Hidden Layer, and computes the logits for the K classes $(a_1^{(2)},...,a_K^{(2)})$. $$ a_k^{(2)} = \sum_{i=1}^P w_{ki}^{(2)} z_i^{(1)} + b_k^{(2)}, \ \ 1\le k\le K $$ The classification probabilities $y_k, 1\le k\le K$ are obtained by applying the Softmax function to the logits. $$ y_k = \frac{\exp(a_k^{(2)})}{\sum_{j=1}^K \exp(a_j^{(2)})}, \ \ 1\le k\le K $$ Note that the logit and classification probability computations are identical to that done in Linear Systems, with the inputs $X$ now replaced by the activations $Z$.
The weight parameters $w_{ij}^{(1)}, 1\le i\le P,1\le j\le N; w_{ij}^{(2)}, 1\le i\le K,1\le j\le P$ and the bias parameters $b_i^{(1)}, 1\le i\le P; b_i^{(2)}, 1\le i\le K$ have to be learnt using the training data, as in Linear Models. The total number of parameters need to describe this network is given by $NP + P + PK + K$, which is now dependent on the number of nodes in the Hidden Layer $P$. Hence we can build a Dense Feed Forward model with more powerful classification ability by increasing the number of nodes in the Hidden Layer, which is an option that does not exist in Linear Systems.
#DFN1
nb_setup.images_hconcat(["DL_images/DFN1.png"], width=600)
#DFN2
nb_setup.images_hconcat(["DL_images/DFN2.png"], width=600)
The activations $(z^{(1)}_1,...,z_P^{(1)})$ correspond to the new data representation that we are looking for. They filter the input and create higher layer representations, which are then used by the logit layer for classification. Note that the filtering done by the Hidden Layer is non-linear due to the presence of the non-linear Function $f$. This function is called the Activation Function, and plays an important role in system performance. The most popular Activation Function in use is called the Rectified Linear Unit, or ReLU, and is shown in Figure DFN3. It simply passes on the pre-activations that are greater than zero, and blocks those that are less.
The presence of the Activation Function is critical to the functioning of the DLN, and it can be easily shown that if they were to be omitted, then the Hidden and Output layers can be collapsed together so that the resulting model would be equivalent to a Linear Model. Indeed the presence of Activation Functions gives the system its modeling power, and in general we will see later in the book that DLN systems can be made more powerful by increasing the amount of non-linear processing. The appropriate choice of Activation Functions has a big influence on the performance of the DLN, and the discovery of more effective Activation Functions such as the ReLU have helped make DLNs easier to train.
nb_setup.images_hconcat(["DL_images/DFN3.png"], width=600)
The system shown in Figure DFN2 incorporates only a single Hidden Layer. Why not continue the process and enable the model to create higher level representations by adding additional hidden layers? This is certainly possible and the resulting network is shown in Figure DFN4. It shows a Dense Feed Forward Network with $R$ hidden layers, such that layer $r$ consists of $P^r$ nodes. The equations decribing this network can be written as:
With each successive Hidden Layer, this network creates representations at higher levels of abstraction.
Using matrix notation, these equations can be compactly written as (with the $Z^{(0)} = X$):
$$ A^{(r)} = W^{(r)}Z^{(r-1)} + B^{(r)},\ \ Z^{(r)} = f(A^{(r)}),\ \ 1\le r\le R $$
$$ A^{(R+1)} = W^{(R+1)}Z^{(R)} + B^{(R+1)},\ \ Y = h(A^{(R+1)}) $$
In these equations $f$ and $h$ represent the Activation and Softmax functions respectively, and these operations are carried out on an elementwise basis across all the matrix entries.
nb_setup.images_hconcat(["DL_images/DFN4.png"], width=600)
We have introduced two degrees of freedom in DLN design in this chapter: (1) The number of Hidden Layers, and (2) The number of nodes per Hidden Layer. This leads to the following questions:
Unfortunately there don't exist many theoretical results in this area which can give definite answers to these questions. However there is one interesting theorem regarding Deep Feed Forward Networks with a single Hidden Layer whose proof was given by Cybenko et.al. in 1989:
Given an arbitrary continuous function $g$ of $n$ variables such as
$$ y = g(x_1,...,x_n) $$
it is always possible to find a Deep Feed Forward Network with a single Hidden Layer, such that the output of the network approximates $g$, and the approximation can be made as close as we want by adding nodes to the Hidden Layer.
This property is of course dependent on the form of the Activation Function used, but it has been proven to be true for the most commonly used functions. Hence it should be possible to solve any classification problem with a Dense Feed Forward Network containing a single layer. However the theorem does not specify the number of hidden nodes needed for a particular problem.
In practice, the following has been observed that to increase the modeling power of a DLN, it is advantageous to add Hidden Layers, becuase of the following reasons:
More layers allow the model to develop an hierarchical representation of the input data, which simplifies the task of the linear classifier in the final layer.
Having additional layers increases the amount of non-linearity and thus the modeling capacity.
This still begs the question of how wide should the network be. There has been some progress on this recently more recently [Li, Xu, et.al] (https://arxiv.org/pdf/1712.09913.pdf), and their key finding is shown in the Figure convnet46.
#convnet46
nb_setup.images_hconcat(["DL_images/convnet46.png"], width=700)
As illustrated in the figure, the width of the network has a critical effect on the smoothness of its Loss Function. The figure shows four contour plots for the Loss Function of an increasingly wider network, and as can be seen the Loss Function landscape becomes progressively smoother as we move from left to right. This makes the optimization task much easier. This effect is more pronounced for the very deep networks with hundreds of layers that we will study later in the course, and less of an issue in a network with only a few layers.
If the Loss Function is highly chaotic as in the leftmost plot, then this causes the optimization becomes highly dependent on the initialization values, since a bad initialization can cause the trajectory to get caught in the ups and downs of the uneven loss landscape. Increasing the width of the network promotes flat minimizers and prevents the transition to chaotic behavior, which also improces the generalization ability for the network.
The other question that we raised is whether the DLN performance keeps improving as we add more and more Hidden Layers. This is actually not the case, the model performance is constrained due to the following factors:
The Vanishing Gradient Problem: In order to train a multilayer Deep Feed Forward Network, the gradients $\frac{\partial L}{\partial w^{(r)}_{ij}}$ and $\frac{\partial L}{\partial b^{(r)}_i}$ have to be computed. It turns that if the number of layers is large, the gradients of the weights that are either in the first few layers or the last few layers, converge towards zero as the training progresses. Once this happens, the corresponding weights stop adapting to new training data, and thus the training process grinds to a halt. This phenomena is known as the Vanishing Gradient problem, and its causes are explained in detail in Chapter GradientDescentTechniques. In addition adding more layers layers makes the Loss Landscape more chaotic as shown in Figure convnet46 which makes optimization very difficult. This problem contrains the number of layers that can be added to the network to asbout 20 or so, without degrading the training process. In order to get around this problem, we can increase the width of the network as explained above, or use a recent advance in DLN architecture called Residual Connections which allows much deeper networks containing hundreds of layers.
The Overfitting Problem: Larger models with more layers have a larger number of parameters, and this in turn requires larger training datasets. As explained in Chapter ImprovingModelGeneralization, modeling is an exercise in matching the Capacity of the Model with the Complexity of the Dataset. If the Capacity of the Model is greater than the Complexity of the Dataset (which can happen if we add more layers than necessary), then it leads to overfitting. This problem constrains the model's generalization ability.
As this discussion shows, there is no formula or theoretical result which tells us the number of layers or the nodes per layer to use in the model. These numbers, which are also called hyper-parameters are a function of the dataset that we are trying to model, and the only way to find the best numbers is by trial and error. Hence when building the model, the designer has to do several trial runs with different vales for these hyper-parameters before settling on the best ones.
In Chapter ImprovingModelGeneralization we provide some guidelines that can be used to make this process more efficient.
There are two ways to define a Dense Feed Forward Network in Keras:
The code shown below uses the Layers Module to define a Dense Feed Forward Network with two hidden layers with 20 and 15 nodes respectively. The first hidden layer is constrained to accept input tensors of shape (32 32 3, ). Note that the second dimension of this tensor is left un-specified, this allows the system to feed this layer with batches of data such that any batch size can be accepted. The input tensor is transformed into a tensor of shape (20, ) by the first hidden layer, and this tensor is then processed by the second hidden layer with 15 nodes. There is no need to specify an input shape argument for the second layer, since Keras automatically decides on this based on the output of the first layer.
Comparing the results of the Linear Model from the previous chapter and the Dense Feed Forward Model, the accuracy increased from about 40% to 45%. This is a significant jump, however not good enough. One of the main factors that is holding back the Dense Feed Forward model from doing a better job on the accuracy is that it is only able to process images after they have been flattened into a vector shape. Thus a lot of information that is present in the original 3D image shape is lost, especially data about pixels that are in proximity of each other in the original image. In order to process images in the native 3D shape, we will need a more sophisticated Neural Network model called Convolutional Neural Networks, which is discussed in one of the later chapters.
import keras
keras.__version__
from keras import models
from keras import layers
from keras.datasets import cifar10
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()
train_images = train_images.reshape((50000, 32 * 32 * 3))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 32 * 32 * 3))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
network = models.Sequential()
network.add(layers.Dense(20, activation='relu', input_shape=(32 * 32 * 3,)))
network.add(layers.Dense(15, activation='relu'))
network.add(layers.Dense(10, activation='softmax'))
network.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = network.fit(train_images, train_labels, epochs=100, batch_size=128, validation_split=0.2)
history_dict = history.history
history_dict.keys()
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf() # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
#DFN5
nb_setup.images_hconcat(["DL_images/DFN5.png"], width=600)
The previous model was built using the Keras Sequential Layers Module, and it is compactly represented in Part (a) of Figure DFN5. As the name implies, the flow of data in this model is strictly sequential from left to right, with each layer working on the output of the previous layer in the model. However there are important cases in which the data flow is not sequential, as shown in Parts (b) to (d) of the figure:
Part (b) shows an example of what is known as Model Ensembling, and is used to improve the performance of a network by averaging out the outputs of multiple identical copies of the same network (this is discuused in more detail in Chapter Training Neural Networks Part 2. In order to implement this architecture, the same input is fed into multiple copies of the network with different initializations, and then the output from each is further processed to get the final output.
Part (c) shows an example of a model that features Residual Connections, which provide a path for a copy of the signal to travel to a deeper part of the network after bypassing the intervening layers. It is then added to the rest of the signal that propagates using the usual sequential processing modules. This mechanism has been shown to improve the flow of gradients during training, and has enabled the use of networks with hundreds of layers.
Part (d) shows an example of a model in which the final output is afunction of multiple types of input datasets, in this case it depends on a mixture of tabular, image and text data. Each dataset is processed along its own branch (perhaps using different types of sub-networks, optimal for its type of data) and then combined together to give the final output.
Part (e) is an example of Multi-Label Classification. In this case images have multiple objects, each of which have to be classified together in a single image. This is done by having multiple output nodes, each of which passes a yes/no decision for the presence of one of the objects in the image.
All these models use non-sequential flow of data, which can be modeled using the Keras Functional API. In this system, tensors are manipulated directly and layers are used as functions to take tensors and return tensors. As an example, we take the CIFAR-10 sequential model and recast it in functional form:
import keras
keras.__version__
from keras import Sequential, Model
from keras import layers
from keras import Input
from keras.datasets import cifar10
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()
train_images = train_images.reshape((50000, 32 * 32 * 3))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 32 * 32 * 3))
test_images = test_images.astype('float32') / 255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
input_tensor = Input(shape=(32 * 32 * 3,))
x = layers.Dense(20, activation='relu')(input_tensor)
y = layers.Dense(15, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(y)
model = Model(input_tensor, output_tensor)
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=10, batch_size=128, validation_split=0.2)
model.summary()
Exercise: Change the code in the model shown above so that it confirms to the architectures shown in figures (b) and (c)
#DFN6
nb_setup.images_hconcat(["DL_images/DFN6.png"], width=600)
In the Keras code that we have seen so far, the input data was already formatted into a tensor form that could be fed into the model directly. In practical applications the data exists in raw form that has to be processed before being fed into Keras. Some examples of this are shown Figure DFN6 for three of the most common types of datasets:
Image Datasets: These usually exist as image files in the png or jpg format. These have to be converted into the RGB format and then cropped so that they all have the same dimensions.
Text Datasets: The words and characters in these datasets have to be pre-processed to remove un-necessary elements, and then vectorized.
TimeSeries Datasets: The data in this case is already in numerical form so requires the least amount of pre-processing.
In its latest release, Keras has provided a nunber of dataset pre-processing functions that can be used for this purpose. In particular:
In each of these cases, the dataset function creates the training, validation or test datasets by pairing the actual data with its appropriate labels. For the Image and Text datasets, where the data resides in directories, these labels are created by making use of the directory structure for the data, for example all images belonging to a particular category are placed in the same directory. Also note that in each of these cases, the training datasets are not pre-defined and stored, but instead are created on the fly during the training phase (the same applies for validation and test datasets). This has the following benefits:
If the data is already in the form of a tensor, and the objective is to predict a missing row element, then the Dataset.from_tensor_slices function is used to to do the pairing between the data and the corresponding label (as shown in the following section).
In this example we feed the model with data in tabular format with 303 rows. Each row of the table consists of features that have been extracted from cells from a single patient, with the last column indicating whether the call is cancerous (with a label of 1 in case it is). The objective of the model is to predict this label from the cell features.
The following table has a description of each feature. Some of the features are numerical, but several other features are cetgorical, while one of the features is both numerical and categorical. In order to feed this data into the Neural Network, the categorical features are converted into one-hot-encoded values.
#cancer_ds
nb_setup.images_hconcat(["DL_images/cancer_ds.png"], width=600)
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
We start by downloading the data and storing it in a Pandas dataframe.
file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)
dataframe.shape
A preview of the samples:
dataframe.head()
The data is randomly split in validation and training sets:
val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)
print(
"Using %d samples for training and %d for validation"
% (len(train_dataframe), len(val_dataframe))
)
The following procedure invokes the Dataset.from_tensor_slices procedure in order to create labels for each input and pair it with the rest of the data in each row. This results in the formation of the training and validation datasets.
def dataframe_to_dataset(dataframe):
dataframe = dataframe.copy()
labels = dataframe.pop("target")
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
ds = ds.shuffle(buffer_size=len(dataframe))
return ds
train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)
for x, y in train_ds.take(1):
print("Input:", x)
print("Target:", y)
The datasets are batched:
train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)
We define the following two procedures for pre-processing the data:
The encode_numerical_feature procedure invokes the Normalization function to normalize a column containing numerical features.
The encode_categorical_feature procedure converts categorical features into either integers of one_hot encoded features (we use the latter in this example). If the categorical feature is an integer, then the IntegerLookup function is invoked and if it is a string, then the StringLookup function is invoked. Both these functions use a table lookup method to do the mapping.
from tensorflow.keras.layers import IntegerLookup
from tensorflow.keras.layers import Normalization
from tensorflow.keras.layers import StringLookup
def encode_numerical_feature(feature, name, dataset):
# create a Normalization layer for our feature
normalizer = Normalization()
# Prepare a Dataset that only yields our feature
# expand_dims returns a tensor with a length 1 axis inserted at index axis.
feature_ds = dataset.map(lambda x, y: x[name])
feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
# Learn the statistics of the data
normalizer.adapt(feature_ds)
# Normalize the input feature
encoded_feature = normalizer(feature)
return encoded_feature
def encode_categorical_feature(feature, name, dataset, is_string):
lookup_class = StringLookup if is_string else IntegerLookup
# Create a lookup layer which will turn strings into integer indices
lookup = lookup_class(output_mode="one_hot")
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
# Learn the set of possible string values and assign them a fixed integer index
lookup.adapt(feature_ds)
# Turn the string input into one-hot indices
encoded_feature = lookup(feature)
return encoded_feature
# Categorical features encoded as integers
sex = keras.Input(shape=(1,), name="sex", dtype="int64")
cp = keras.Input(shape=(1,), name="cp", dtype="int64")
fbs = keras.Input(shape=(1,), name="fbs", dtype="int64")
restecg = keras.Input(shape=(1,), name="restecg", dtype="int64")
exang = keras.Input(shape=(1,), name="exang", dtype="int64")
ca = keras.Input(shape=(1,), name="ca", dtype="int64")
# Categorical feature encoded as string
thal = keras.Input(shape=(1,), name="thal", dtype="string")
# Numerical features
age = keras.Input(shape=(1,), name="age")
trestbps = keras.Input(shape=(1,), name="trestbps")
chol = keras.Input(shape=(1,), name="chol")
thalach = keras.Input(shape=(1,), name="thalach")
oldpeak = keras.Input(shape=(1,), name="oldpeak")
slope = keras.Input(shape=(1,), name="slope")
all_inputs = [
sex,
cp,
fbs,
restecg,
exang,
ca,
thal,
age,
trestbps,
chol,
thalach,
oldpeak,
slope,
]
# Integer categorical features
sex_encoded = encode_categorical_feature(sex, "sex", train_ds, False)
cp_encoded = encode_categorical_feature(cp, "cp", train_ds, False)
fbs_encoded = encode_categorical_feature(fbs, "fbs", train_ds, False)
restecg_encoded = encode_categorical_feature(restecg, "restecg", train_ds, False)
exang_encoded = encode_categorical_feature(exang, "exang", train_ds, False)
ca_encoded = encode_categorical_feature(ca, "ca", train_ds, False)
# String categorical features
thal_encoded = encode_categorical_feature(thal, "thal", train_ds, True)
# Numerical features
age_encoded = encode_numerical_feature(age, "age", train_ds)
trestbps_encoded = encode_numerical_feature(trestbps, "trestbps", train_ds)
chol_encoded = encode_numerical_feature(chol, "chol", train_ds)
thalach_encoded = encode_numerical_feature(thalach, "thalach", train_ds)
oldpeak_encoded = encode_numerical_feature(oldpeak, "oldpeak", train_ds)
slope_encoded = encode_numerical_feature(slope, "slope", train_ds)
all_features = layers.concatenate(
[
sex_encoded,
cp_encoded,
fbs_encoded,
restecg_encoded,
exang_encoded,
slope_encoded,
ca_encoded,
thal_encoded,
age_encoded,
trestbps_encoded,
chol_encoded,
thalach_encoded,
oldpeak_encoded,
]
)
x = layers.Dense(32, activation="relu")(all_features)
x = layers.Dropout(0.5)(x)
output = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(all_inputs, output)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
history = model.fit(train_ds, epochs=50, validation_data=val_ds)
history_dict = history.history
history_dict.keys()
import matplotlib.pyplot as plt
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf() # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
An important class of problems in Machine Learning have to do with predicting the next term in a sequence. In this case the DLN has to learn the pattern of data as it evolves in time, an example of which is shown in Figure DFN8 (this example is taken from Section 10.2 of Chollet, the dataset can be downlodade from http://www.bgc-jena.mpg.de/wetter/). The table consists of 14 pieces of metereological data collected once every 10 minutes, over the course of several decades. The objective is to predict the temperature (which is one of the variables) one day in the future, based on all the data collected over a time preiod called lookback, which spans several days.
#DFN8
nb_setup.images_hconcat(["DL_images/DFN8.png"], width=600)
Once again we start by reading in the data file and storing the information in the data structure 'lines'. Since most of the processing is very similar to that described for the previous example, we will limit our comments to places where the two are different.
import tensorflow
import keras
keras.__version__
from keras import models
from keras import layers
import os
#data_dir = '/home/ubuntu/data/'
data_dir = '/Users/subirvarma/handson-ml/datasets/'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')
with open(fname) as f:
data = f.read()
lines = data.split("\n")
header = lines[0].split(",")
lines = lines[1:]
print(header)
print(len(lines))
import numpy as np
temperature = np.zeros((len(lines),))
raw_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
values = [float(x) for x in line.split(",")[1:]]
temperature[i] = values[1]
raw_data[i, :] = values[:]
The variation of temperature with time is plotted below. There is a periodicity in this pattern that reflects the variation of temperature over the course of a year.
from matplotlib import pyplot as plt
temp = float_data[:, 1] # temperature (in degrees Celsius)
plt.plot(range(len(temp)), temp)
plt.show()
Half the data in the table is used for training, the remaining is split between validation and testing.
num_train_samples = int(0.5 * len(raw_data))
num_val_samples = int(0.25 * len(raw_data))
num_test_samples = len(raw_data) - num_train_samples - num_val_samples
print("num_train_samples:", num_train_samples)
print("num_val_samples:", num_val_samples)
print("num_test_samples:", num_test_samples)
In the next step we normalize all the variables individually by subtracting their mean and dividing by the standard deviation. Normalization is discuused in greater detail in Chapter Gradient Descent Techniques, it equalizes variables whose values are very different in magnitude, such as temperature and pressure, which improves the training process.
mean = raw_data[:num_train_samples].mean(axis=0)
raw_data -= mean
std = raw_data[:num_train_samples].std(axis=0)
raw_data /= std
We invoke the timeseries_dataset_from_array procedure to create the training, validation and test datsets for the model. Note the following:
Since the sampling_rate is set to 6, the model uses a single sample of the data per hour.
The model uses a sequence of the prior 120 hours of data (i.e. 5 days) in order to predict the temperature 24 hours after the end of the sequence.
sampling_rate = 6
sequence_length = 120
delay = sampling_rate * (sequence_length + 24 - 1)
batch_size = 256
train_dataset = tensorflow.keras.utils.timeseries_dataset_from_array(
raw_data[:-delay],
targets=temperature[delay:],
sampling_rate=sampling_rate,
sequence_length=sequence_length,
shuffle=True,
batch_size=batch_size,
start_index=0,
end_index=num_train_samples)
val_dataset = tensorflow.keras.utils.timeseries_dataset_from_array(
raw_data[:-delay],
targets=temperature[delay:],
sampling_rate=sampling_rate,
sequence_length=sequence_length,
shuffle=True,
batch_size=batch_size,
start_index=num_train_samples,
end_index=num_train_samples + num_val_samples)
test_dataset = tensorflow.keras.utils.timeseries_dataset_from_array(
raw_data[:-delay],
targets=temperature[delay:],
sampling_rate=sampling_rate,
sequence_length=sequence_length,
shuffle=True,
batch_size=batch_size,
start_index=num_train_samples + num_val_samples)
we are going to use a Dense Feed Forward Network to process the data, hence the 2D input tensor has to to flattened into a 1D shape using the Flatten layer in Keras, before it can be fed into the rest of the network.
from keras.models import Sequential
from keras import layers
from tensorflow.keras.optimizers import RMSprop
from tensorflow import keras
from tensorflow.keras import layers
inputs = keras.Input(shape=(sequence_length, raw_data.shape[-1]))
x = layers.Flatten()(inputs)
x = layers.Dense(16, activation="relu")(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
history = model.fit(train_dataset,
epochs=10,
validation_data=val_dataset)
This example is taken from Section 11.3 in Chollet.
The objective of this model to classify movie reviews into either positive or negative, given the text of the review. The IMDB dataset which is used in this example can be downloaded from: http://ai.stanford.edu/~amaas/data/sentiment/. There are a total of 50,000 reviews in the dataset, of which we will use 25,000 reviews for training and the remainder for testing. Of all the reviews in this dataset, 50% are positive.
We will go through the following steps in pre-processing the text data before it can be fed into the Neural Network:
The downloaded dataset already has the reviews sorted into Training and Test directories, and with each directory they are further sorted into negative and positive reviews. We start the process of creating a training dataset by creating two lists:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = '/Users/subirvarma/handson-ml/datasets/aclImdb'
#val_dir = base_dir/"val"
#train_dir = base_dir/"train"
train_ds = keras.utils.text_dataset_from_directory(
"/Users/subirvarma/handson-ml/datasets/aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
"/Users/subirvarma/handson-ml/datasets/aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
"/Users/subirvarma/handson-ml/datasets/aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)
#DFN10
nb_setup.images_hconcat(["DL_images/DFN10.png"], width=600)
The next piece of code invokes the TextVectorization procedure, which takes each review and converts it from text to integers. It does so by cutting of the number of words in the reviews to the top 20,000 most freqently occuring words (specified by the parameter max_tokens), and then mapping each word to an unique integer in the range 0 to 20,000 (after removing all punctuation). It furthermore truncates each review to a maximum of max_length = 600 words, and pads the reviews with less than 600 words with zeroes. The resulting 2D array is illustrated in Figure DFN10.
from tensorflow.keras import layers
max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
max_tokens=max_tokens,
output_mode="int",
output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)
int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))
The Dense Feed Forword network model is defined next. Note that the first layer of the model is an Embedding Layer. Its function is take a sample of the input review, which is a 1D array of shape max_length (see Figure DFN9), and converts it into 2D array of shape (max_length, embedding_dim). Hence after this transformation, each review is represented by a matrix, which is then fed into the rest of the network. Since this is a Dense Feed Forward network, it can only accept 1D vectors, hence the matrix is flattened before it is forwarded on, as shown in Figure DF9.
#DFN9
nb_setup.images_hconcat(["DL_images/DFN9.png"], width=600)
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
embedding_dim = 100
model = Sequential()
model.add(Embedding(max_tokens, embedding_dim, input_length=max_length))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10)
Exercise: The performance of this model is not very good. Part of the reason for this is that Dense Feed Forward Networks are not very weel suited for processing sequences such as those that arise in NLP. Later in this book we will study other models such as Recurrent Neural Netwrks, LSTMs and Transformers that are better at this task.
Try out the following changes to see whether any of them help improve it:
This example is taken from Section 8.2 of Chollet. The dataset is from a Kaggle competition and consists of 50,000 images, evenly divided between those of cats and dogs (the dataset can be downloaded from https://www.kaggle.com/c/dogs-vs-cats/data). The first step after downloading the data is to split up the images into training, validation and test directories, and furthermore within each of these, create separate sub-directories for cat and dog images (see Figure DFN13). The images in this dataset are labeled using their file names which cannot be directly used during the training process. The Keras dataset generator assumes that each image category occupies its own sub-directory and then generates the training labels by making use of this information.
See the example in Chollet on how to create and populate the directory structure, we will assume that this step has already been done.
#DFN13
nb_setup.images_hconcat(["DL_images/DFN13.png"], width=600)
import os, shutil, pathlib
train_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/train'
validation_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/validation'
train_cats_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/train/cats'
train_dogs_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/train/dogs'
validation_cats_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/validation/cats'
validation_dogs_dir = '/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small/validation/dogs'
new_base_dir = pathlib.Path("/Users/subirvarma/handson-ml/datasets/cats_and_dogs_small")
We will use only a subset of the 50,000 images in the dataset: 2000 images for training and 1000 for validation.
print('total training cat images:', len(os.listdir(train_cats_dir)))
print('total training dog images:', len(os.listdir(train_dogs_dir)))
print('total validation cat images:', len(os.listdir(validation_cats_dir)))
print('total validation dog images:', len(os.listdir(validation_dogs_dir)))
We now make use of the Keras utility called image_dataset_from_directory which carries out the following functions:
Another Dataset is defined for the the Validation data, with exactly the same structure.
from tensorflow.keras.utils import image_dataset_from_directory
train_dataset = image_dataset_from_directory(
new_base_dir / "train",
image_size=(150, 150),
batch_size=20)
validation_dataset = image_dataset_from_directory(
new_base_dir / "validation",
image_size=(150, 150),
batch_size=20)
The output of one of these Dataset objects looks as follows:
for data_batch, labels_batch in train_dataset:
print('data batch shape:', data_batch.shape)
print('labels batch shape:', labels_batch.shape)
break
We now define a Dense Feed Forward model with four hidden layers to process the data. Since this model can only process 1D tensors, we use the Flatten layer to convert the (150,150,3) image tensor into a 1D tensor of size 67,500. The model has more than 4 million parameters, almost all of which are concentrated in the first layer of weights, due to the large number of image pixels.
from keras.layers import Embedding, Flatten, Dense
from tensorflow.keras import optimizers
from keras import Model
from keras import layers
from keras import Input
input_tensor = Input(shape=(150,150,3,))
a = layers.Flatten()(input_tensor)
a = layers.Rescaling(1./255)(a)
layer_1 = layers.Dense(64, activation='relu')(a)
layer_2 = layers.Dense(64, activation='relu')(layer_1)
layer_3 = layers.Dense(64, activation='relu')(layer_2)
layer_4 = layers.Dense(64, activation='relu')(layer_3)
output_tensor = layers.Dense(1, activation='sigmoid')(layer_4)
model = Model(input_tensor, output_tensor)
model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(learning_rate=1e-4),
metrics=['acc'])
model.summary()
We are finally ready to run the model which is done using the fit command. The execution of the model is illustrated in Figure DFN14: The training dataset samples batches of size 20 images, does the pre-processing and converion into tensors, and feeds them into the model.
The Dataset Generator computes the number of batches to be fed into the model using the formula
$$ batches\ per\ epoch = {{total\ sample\ size\ per\ epoch}\over{batch\ size}} $$
In a previous version of Keras, there was a steps_per_epoch argument in the fit command that could be used to control the number of batches per epoch. This allowed more images to be fed into the model than actually existed in the directory. This was very useful when images were augmented with random changes before entering the model, thus enabling an increase in the effective number of images available for training. The steps_per_epoch parameter is still present, but is is not clear whether it is still performing this function.
The Validation Generator feeds the validation data into the model using a similar process.
#DFN14
nb_setup.images_hconcat(["DL_images/DFN14.png"], width=600)
history = model.fit(
train_dataset,
epochs=100,
validation_data=validation_dataset)
history_dict = history.history
history_dict.keys()
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf() # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
The Validation Accuracy for this system is about 60%, which is not great. There are several was to improve this: