Linear Learning Models

In [1]:
from ipypublish import nb_setup

Logistic Regression

The subject of this chapter are Linear Neural models used for classification, which also go by the name Logistic Regression. These models serve as the building block for more sophisticated DLN classification models, all of whom use a Logistic Regression layer for their final stage. We proceed in two stages, we start with the case $K=2$ for classification into two categories, followed by the general case of $K$ categories.

In [2]:
#LR10
nb_setup.images_hconcat(["DL_images/LR10.png"], width=600)
Out[2]:

In Figure LR10 we reproduce the classification model from the previous chapter, specialized for the case $K=2$. The model processes the input $X$ and produces a probability distribution for the two output values. Not that only a single output $P(T=1|X) = y$ is needed, since the other output is always $P(T=0|X)=1-y$. We now propose our first model for the function $h$, which is shown in Figure LR2.

In [3]:
#LR2
nb_setup.images_hconcat(["DL_images/LR2.png"], width=600)
Out[3]:

Sigmoid Function

Using the notation from the previous chapter, the functional form of the simplest linear model with single node in the Output Layer (corresponding to $K=2$) is given by

\begin{equation} a = \sum_{i=1}^N w_i x_i + b \quad \quad \quad \mbox{(**aWX**)} \end{equation}

In this equation $(w_1,...,w_N)$ are called weights and $b$ is known as the bias (see Figure LR2)). As the reader can readily see, $a$ can not function as the output of a classification system, since we cannot ensure that $0 \le a\le 1$. In order to satisfy this requirement we add a function called the sigmoid to the system (see Figure sigmoid). The sigmoid function is also referred to as the “squashing function” since it maps the entire real axis into the finite interval $[0, 1]$. The resulting output $y$ satisfies the condition $0\le y\le 1$ and hence is a suitable output for the Logistic Regression system.

$$ y = \sigma(a) = \frac{1}{1+\exp(-a)} $$

In [4]:
#sigmoid
nb_setup.images_hconcat(["DL_images/sigmoid.png"], width=600)
Out[4]:

We can do classification using this model by using the following criteria: If $y\ge 0.5$, then the model predicts that input to be in class 1, or else in class 2. Note that the condition $y\ge 0.5$ is equivalent to the condition $$ a = \sum_{i=1}^N w_i x_i + b \ge 0 $$ so that input points on one side of the hyperplane $\sum_{i=1}^N w_i x_i + b$ are categorized in class 1, and those on the other side are categorized in class 2. This leads to the question: Why not use $a$ to do the classification, thus avoidng the extra step of computing $y$. If we use the criteria $y\ge 0.5$ to do the classification, then indeed there is no difference in whether we choose to use $y$ or $a$. However if we change the classification criteria to the following: Compute $y$ and use it to generate a Random Number such that the probability of the input being in class 1 is $y$, and similarly for class 2. With this criteria, there is a finite chance that the predicted class is $2$, even though $y\ge 0.5$ (especially if $y$ is close to $0.5$). This means that input points that are close to the hyperplane $a = \sum_{i=1}^N w_i x_i + b$ have an equal chance of being classified in either Class 1 or Class 2. This distinction is important, since it enables the Logistic Regression model to use probabilistic inference and do classification even in cases in which the two classes are not strictly linearly separated. The interpretation of the Linear Model as a hyperplane also explains the need for the bias parameter $b$, since in the absence of $b$, all hyperplanes would be constrained to pass through the origin, which obviously limits their modeling power.

Linear Separator

In [5]:
#linearSeparator
nb_setup.images_hconcat(["DL_images/linearSeparator.png"], width=600)
Out[5]:

Figure linearSeparator shows the results of applying Logistic Regression in the two dimensional case to a test dataset. As we noted earlier, point that are close to the linear separator may get classified in either of the two classes due to the statistical nature of the algorithm.

Now that we have obtained the structural form for our classification system for the special case $K=2$, the next step is to estimate the model parameters $b, w_i, i=1,...,N$ from the training data. From the discussion in the prior chapter, this can be done by choosing the parameters so that they minimize the Cross Entropy Loss Function. Given that $(X{(m)},T{(m)}),m=1,...,M$ is the training sequence, the Loss Function is defined as:

$$ L = -{1\over M}\sum_{m=1}^M [t{(m)} \log y{(m)} + (1-t{(m)}) \log(1-y{(m)})] $$

where the output $y{(m)}$ in response to the $m^{th}$ input vector $X{(m)}$ is given by

$$ y{(m)} = \sigma(a{(m)})\ \ where\ \ a{(m)} = \left(\sum_{i=1}^N w_i x_i{(m)} + b \right), \ \ 1\le m\le M $$

Even for this simple model there is no closed form solution for determining the values of $(W,b)$ that minimize $L$, so we resort to the use of the Gradient Descent algorithm:

$$ w_i \leftarrow w_i - \eta \frac{\partial L}{\partial w_i},\ \ 1\le i\le N\ and\ b\leftarrow b - \eta \frac{\partial L}{\partial b} $$

where $\eta$ is a constant called the Learning Rate. There are various ways in which the iteration shown above can be improved and is discussed in detail in Chapter GradientDescentTechniques. In order to proceed we need to compute the gradients $\frac{\partial L}{\partial w_i}$ and $\frac{\partial L}{\partial b}$.

The Loss Function $\mathcal L$ for a single sample of the training dataset is given by:

$$ \mathcal L = -[t\log y + (1-t)\log (1-y)] $$

Using the Chain Rule of Derivatives,

$$ \frac{\partial \mathcal L}{\partial w_i} = \frac{\partial \mathcal L}{\partial y}\frac{\partial y}{\partial a}\frac{\partial a}{\partial w_i} $$

Note that

$$ \frac{\partial \mathcal L}{\partial y} = \frac{y-t}{y(1-y)},\ \ \frac{\partial y}{\partial a} = y(1-y)\ \ and\ \ \frac{\partial a}{\partial w_i} = x_i $$

which results in

$$ \frac{\partial \mathcal L}{\partial w_i} = x_i(y-t),\ 1\le i\le N $$

We can now write down the expression for the derivative of the Loss Function $L$ for the entire training dataset:

\begin{equation} \frac{\partial L}{\partial w_i} = {1\over M}\sum_{m=1}^M x_i(m)(y{(m)}-t{(m)}),\ 1\le i\le N \quad \quad \mbox{(**dLdW**)} \end{equation}

Gradient Descent

In [6]:
#LR11
nb_setup.images_hconcat(["DL_images/LR11.png"], width=600)
Out[6]:

Using equation dLdW, the following iterative algorithm can be used to train the model:

Gradient Descent Algorithm

  1. Initialize all the weights $w_i, 1\le i\le N$ and the bias term $b$ to zero.

  2. In response to the training sequence $X{(m)}, 1\le m\le M$, compute the model predictions $y(m), 1\le m\le M$ (using the latest values of the parameters $(W,b)$): $$ y(m) = \sigma\left(\sum_{i=1}^N w_i x_i{(m)} + b \right), \ \ 1\le m\le M $$

  3. Change the model parameters according to the formulae

\begin{equation} w_i \leftarrow w_i - {\eta\over M} \sum_{m=1}^M x_i(m)(y{(m)}-t{(m)}),\ 1\le i\le N \quad \quad \mbox{(**L1**)} \end{equation}

$$ b \leftarrow b - {\eta\over M} \sum_{m=1}^M (y{(m)}-t{(m)}) $$

  1. Compute the Loss Function $L$ over the Validation Dataset given by $$ L = -{1\over V}\sum_{m=1}^V t{(m)} \ln y{(m)} + (1-t{(m)} \ln(1-y{(m)}) $$ where $V$ is the number of samples in the Validation Dataset. If $L$ drops below some threshold, then stop. Otherwise go back to Step 2.

The Gradient Descent Algorithm is our first encounter with training or learning algorithms in this monograph. The mathematical setup and general procedure that we used in this algorithm holds true for more complex models. Step 2 of the algorithm in which the output of the model is computed is known as the Forward Pass. Step 3 of the model in which the gradients are computed and then used to update the model parameters, is known as the Backward Pass. This model (and all other DLN models) proceeds by repeated cycles of Forward and Backward Passes, until the stopping condition is satisfied. This is also illustrated in Figure LR11. All DLN training algorithms share this structure of alternating Forward and Backward passes, the only difference being that the calculations become more complex in the Forward Pass and the Backward Pass is implemented using the Backprop algorithm.

An alternative to using the Loss Function as a stopping criteria, is to use a performance measure called Accuracy. This is defined as the percentage of correct predictions on the Validation Dataset, so that, the Accuracy $A$ is given by $$ A = \frac{ \sum_{m=1}^V 1_{y(m)=t(m)}} {V} $$ If the value of $A$ exceeds some threshold, terminate the algorithm.

We can get some intuition into the workings of this training algorithm as follows: Consider a training example in which the label $t=1$. From Equation L1 it follows that the training iteration is given by $$ w_i \leftarrow w_i - {\eta\over M} \sum_{m=1}^M x_i(m)(y(m)-1) $$ Since $ y-1 \le 0$, it follows that all the weights $w_i$ increase during this iteration of the training loop. This would cause the the value of $y$ to increase for the input $X$. This makes intuitive sense, since we would want the probability of $X$ being classified in Category 1 to increase as a result of this training sample.

Conversely, if the label for the input $X$ was $t=0$, then the training iteration becomes: $$ w_i \leftarrow w_i - {\eta\over M} \sum_{m=1}^M x_i(m) y(m) $$ This expression shows that the iteration results in a decrease of model weights, so that next time the same input is presented, the out $y$ of the model will be smaller. As expected this implies that the probability of $X$ being classified in the Category 1 decreases as a result of this training sample.

Stochastic Gradient Descent

In ML algorithms, an epoch is defined as the time required to do a single pass through all the elements in the training dataset. In the algorithm described above, note that model parameters are updated only once per epoch. In practice, this considerably slows down the speed of convergence, especially for large training datasets. We now present a modification to the algorithm, called Stochastic Gradient Descent (SGD), which updates the parameters more frequently, indeed after processing each element in the training dataset. The key to SGD is the observation that the parameter update equation can be written as:

\begin{equation} w_i \leftarrow w_i - \eta x_i(m)(y{(m)}-t{(m)}),\ 1\le i\le N \quad \quad \mbox{(**L2**)} \end{equation}

Hence for input vector $X(m)$, we compute the corresponding output $y(m)$, and then use the above equation to immediately change the model parameters, and then repeat the process with the next input. This considerably speeds up the algorithm, and thus the learning process. Effectively by doing this, we are using noisy estimates of the gradient to do the iteration, which causes the convergence to be not as smooth as with Gradient Descent (see Figure SGD). However it has been observed that the noise introduced to SGD also has the benefit of helping the algorithm avoid getting stuck in non-optimal minima, as well as escape from saddle point quickly.

In [7]:
#SGD
nb_setup.images_hconcat(["DL_images/SGD.png"], width=600)
Out[7]:

Stochastic Gradient Descent Algorithm (SGD)

  1. Initialize all the weights $w_i, 1\le i\le N$ and the bias term $b$ to zero.

  2. For $m = 1,...,M$ repeat the following steps (a) - (c):

    (a) For the input vector $X{(m)}$, compute the model prediction $y(m)$ given by $$ y(m) = \sigma\left(\sum_{i=1}^N w_i x_i{(m)} + b \right) $$

    (b) Change the model weights according to the formula

$$ w_i \leftarrow w_i - \eta x_i(m)(y{(m)}-t{(m)}),\ 1\le i\le N $$

$$ b \leftarrow b - \eta (y{(m)}-t{(m)}) $$

(c) Increment $m$ by one and go back to step $(a)$.

  1. Compute the Loss Function $L$ over the Validation Dataset given by $$ L = -{1\over V}\sum_{m=1}^V t{(m)} \log y{(m)} + (1-t{(m)} \log(1-y{(m)}) $$ If $L$ has dropped below some threshold, then stop. Otherwise go back to Step 2.

Note that in the SGD algorithm we update the parameter values after processing each element in the training dataset, but we check the stopping condition only once per epoch.

Batch Stochastic Gradient Descent Algorithm (B-SGD)

There is a third way of solving the optmization problem, which is midway between the two extremes of Gradient Descent and Stochastic Gradient Descent, and is known as Batch Stochastic Gradient Descent (B-SGD). This is based on the following parameter update rule: $$ w_j \leftarrow w_j - {\eta\over B} \sum_{b=1}^B x_i(b)(y{(b)}-t{(b)}) $$ This algorithm works by dividing up the training dataset into batches of size $B$, and then updates the parameters only once per batch. By varying the batch size, we can trade-off the smoothness of the convergence algorithm with the speed of convergence.

  1. Initialize all the weights $w_i, 1\le i\le N$ and the bias term $b$ to zero.
  2. For $r = 0,...,{M\over B}-1$ repeat the following steps (a) - (c):

    (a) For the training inputs $X_i(m), rB\le i\le (r+1)B$, compute the model predictions $y(m)$ given by $$ y(m) = \sigma\left(\sum_{i=1}^N w_i x_i{(m)} + b \right) $$ (b) Change the model weights according to the formula $$ w_j \leftarrow w_j - {\eta\over B} \sum_{m=rB}^{(r+1)B} x_i(m)(y{(m)}-t{(m)}),\ 1\le j\le N $$ $$ b \leftarrow b - {\eta\over B} \sum_{m=rB}^{(r+1)B}(y{(m)}-t{(m)}) $$

    (c) Increment $r$ by one and go back to step $(a)$.

  3. Compute the Loss Function $L$ over the Validation Dataset given by $$ L = -{1\over V}\sum_{m=1}^V t{(m)} \ln y{(m)} + (1-t{(m)} \ln(1-y{(m)}) $$ If $L$ has dropped below some threshold, then stop. Otherwise go back to Step 2.

In [8]:
#GDComp
nb_setup.images_hconcat(["DL_images/GDComp.png"], width=600)
Out[8]:

Figure GDComp gives a summary of the three Gradient Descent algorithms types, with the red arrows representing the number of input samples that are used in the computation of a single update of the model parameters.

Multiclass Logistic Regression

In [9]:
#multiclassLogit
nb_setup.images_hconcat(["DL_images/multiclassLogit.png"], width=600)
Out[9]:

If the input can belong to one of $K$ categories, then the Logistic Regression model can be expanded to accommodate this, as shown in Figure multiclassLogit. Define the quantities $a_k$ by

$$ a_k = \sum_{i=1}^N w_{ki} x_i + b_k, \quad 1 \leq k \leq K $$

These quantities are referred to as logits in Machine Learning literature, and are defined as the outputs of the final layer of neurons. They are also sometimes referred to as "Scoring Functions", since a higher score for a class means that it stands a greater chance of being the selected category.

We need to convert these logits into output probabilities $y_k$. Note that the sigmoid function cannot be used for this, since in that case the condition $\sum_{k=1}^K y_k = 1$ is not guaranteed to be satisfied. Instead the probabilities are generated using the following equation:

$$ y_k = \frac{\exp(a_k)}{\sum_{i=1}^K \exp(a_i)}, \quad 1 \leq k \leq K $$

which is called the Normalized Exponential or Softmax function and is a multiclass generalization of the sigmoid. Indeed for the case $K = 2$ it is easy to show that softmax and the sigmoid result in identical outputs.

Using the results from the previous chapter on Supervised Learning, the Loss Function for this model can be written as

$$ L = -{1\over M}\sum_{m=1}^M \sum_{k=1}^K t_k(m)\log y_k(m) = {1\over M}\sum_{m=1}^M \mathcal L(m) $$

where

$$ \mathcal L(m) = -\sum_{k=1}^K t_k(m)\log y_k(m) $$

In these equation, the labels $t_k$ are defined such that if the label corresponds to the $q^{th}$ category, then $t_q = 1$ and $t_k = 0, k\ne q$. In order to train the model using Gradient Descent, we need to compute $\frac{\partial \mathcal L}{\partial w_{ki}}$ and $\frac{\partial \mathcal L}{\partial b_{k}}$. Using the Chain Rule of derivatives

$$ \frac{\partial \mathcal L}{\partial w_{ki}} = \frac{\partial \mathcal L}{\partial a_{k}}\frac{\partial a_k}{\partial w_{ki}} $$

Note that $\frac{\partial a_k}{\partial w_{ki}} = x_i$ while $\frac{\partial \mathcal L}{\partial a_{k}}$ can be written as

$$ \frac{\partial \mathcal L}{\partial a_{k}} = \sum_{j=1}^K \frac{\partial \mathcal L}{\partial y_{j}} \frac{\partial y_j}{\partial a_{k}} $$

In order to justify this equation, note that the logit $a_k$ influences the Loss Function through all of the outputs $y_j, 1\le j\le K$, due to the form of the softmax equation. Note that

$$ \frac{\partial \mathcal L}{\partial y_{j}} = -\frac{t_j}{y_j} $$

while it is not difficult to show that

$$ \frac{\partial y_j}{\partial a_{j}} = y_j(1-y_j) $$

and

$$ \frac{\partial y_j}{\partial a_{k}} = -y_jy_k\ \ for\ \ j\ne k $$

It follows that

$$ \frac{\partial \mathcal L}{\partial a_{k}} = -\frac{t_k}{y_k} y_k(1-y_k) - \sum_{j\ne k}\frac{t_k}{y_j}(-y_jy_k) = -(t_k - t_ky_k-\sum_{j\ne k}t_jy_k) = -(t_k-\sum_{j=1}^K t_jy_k) = y_k-t_k $$

Putting all these equations together, it follows that

$$ \frac{\partial \mathcal L}{\partial w_{ki}} = x_i(y_k - t_k) $$

Similarly it can be shown that

$$ \frac{\partial \mathcal L}{\partial b_{k}} = (y_k - t_k) $$

Using these expressions, we end up with the following Gradient Descent Training Algorithm for Logistic Regression systems with $K$ output categories:

In [10]:
#LR12
nb_setup.images_hconcat(["DL_images/LR12.png"], width=600)
Out[10]:

Gradient Descent Algorithm

  1. Initialize all the weights $w_{ki}, 1\le k\le K, 1\le i\le N$ and the bias terms $b_k, 1\le k\le K$ to zero.

  2. Loop through (a) and (b) for $M\over B$ batches:

    a. Forward Pass: In response to the training sequence $X{(m)}, 1\le m\le M$, compute the model predictions $y_k(m), 1\le k\le K, 1\le m\le B$ (using the latest values of the parameters $(W,b)$):

$$ y_k(m) = {{\exp{a_k(m)}}\over{\sum_{i=1}^K \exp{a_i(m)}}},\ \ where\ \ a_k(m) = \sum_{i=1}^N w_{ki} x_i{(m)} + b_k,\ \ 1\le k\le K,\ \ 1\le m\le B $$

b. Backward Pass: Change the model parameters according to the formulae

$$ w_{ki} \leftarrow w_{ki} - {\eta\over B} \sum_{m=1}^B x_i(m)(y_k{(m)}-t_k{(m)}),\ 1\le k\le K,\ 1\le i\le N $$

$$ b_k \leftarrow b_k - {\eta\over B} \sum_{m=1}^B (y_k{(m)}-t_k{(m)}), \ \ 1\le k\le K $$

  1. Compute the Loss Function $L$ over the Validation Dataset given by

$$ L = -{1\over V}\sum_{m=1}^V\sum_{k=1}^K t_k{(m)} \log y_k{(m)} $$

where $V$ is the number of samples in the Validation Dataset. If $L$ drops below some threshold, then stop. Otherwise go back to Step 2.

Just as for the case $K=2$, we can verify that the weight update equations in this algorithm cause the output probabilities to change in the right way, as follows: Consider the case that the input $X$ is such that the label corresponds to the $q^{th}$ category so that $t_q = 1$ and $t_k = 0, k\ne q$. Then the weight updates can be written as

$$ w_{qi} \leftarrow w_{qi} - {\eta\over M} \sum_{m=1}^M x_i(m)(y_q{(m)}-1),\ \ 1\le i\le N $$

and

$$ w_{ki} \leftarrow w_{ki} - {\eta\over M} \sum_{m=1}^M x_i(m)y_k{(m)},\ \ k\ne q,\ 1\le i\le N $$

From these equations it is clear the weights that are incident on the $q^{th}$ output increase, while all the other weights decrease. Hence it follows that the next time the same input is sent into the model, the value of the probability $y_q$ will be higher and the other $y_k, k\ne q$ will be smaller.

Image Classification Using Linear Models

In [11]:
#LC1
nb_setup.images_hconcat(["DL_images/LC1.png"], width=600)
Out[11]:
In [12]:
#LC2
nb_setup.images_hconcat(["DL_images/LC2.png"], width=600)
Out[12]:

A Linear Classifier can be interpreted as a straight line (in 2-D) or hyperplane that separates out the points belonging to the classes being categorized. There is another interpretation of Linear Classifiers which is best understood for the case when the input is an image, and this is in terms of Template Matching. We introduced the MNIST image dataset in Chapter 2, and we now introduce the CIFAR-10 image dataset which has also played an important role in the development of DLNs (see Figure LC1). This dataset consists 50,000 training images and 10,000 test images, each of which is $32 \times 32 \times 3$ pixels. Each image contains an object which can belong to one of ten categories, as shown in the figure.

In order to input a CIFAR-10 image into the classifier, it has to be stretched out into a vector of 3072 dimensions. It is then fed into the 10-ary classification model of the type shown in Figure multiclassLogit. A fully trained linear model results in a classification accuracy of about 40%, which is significantly higher than the accuracy of 10% or so one would expect if we were to make a purely random choice. We can get some understanding of the way the images are being classified by examining the weight matrix $\{w_{ki}\}, 1\le k\le 10, 1\le i\le 3072$. Recall that the logits $a_k, 1\le k\le 10$ are defined as

\begin{equation} a_k = \sum_{i=1}^{3072 } w_{ki} x_i,\ \ 1\le k\le 10 \quad \quad \mbox{(**TP**)} \end{equation}

where $x_i$ represent the pixels in the image. Lets assume that the input image belongs to the category $q$. In order to correctly classify this image, the logit $a_q$ should be significantly larger than the other logits. How does the system accomplish this? A plausible explanation can be obtained by considering Equation (TP) to be a form of template matching. Just as we obtained the vector $x_i, 1\le i\le 3072$ by stretching out the image pixels, we can also reverse this process and create an image from the weight vector $w_{qi}, 1\le i\le 3072$ that is used to compute the $q^{th}$ logit $a_q$. The images thus created from the weight vectors for the ten CIFAR-10 categories are shown in Figure LC2. The reader will notice that there is a faint but distinct resemblance between these "weight generated" images and the category to which the weights belong. This can be explained by interpreting the "weight generated" images as an image template. Hence a fully trained model creates ten of these templates, one for each image category, and then it does classification by using Equation (TP) to do template matching (it can be shown that under a norm constraint, the sum in Equation (TP) is maximized if the vectors $w_{.i}$ and $x_i$ match).

The explanation that we gave in terms of template matching, points to the interpretation of the model weight parameters as a filter. This interpretation is extremely important, and serves as the inspiration for a class of DLN models called Convolutional Neural Networks (see Chapter ConvNets).

Beyond Linear Models

In [13]:
#RP1
nb_setup.images_hconcat(["DL_images/RP1.png"], width=600)
Out[13]:

Note that Linear Models work well only in cases in which the data categories are approximately linearly separable. In the 2-dimensional case this corresponds to drawing a straight line that separates out most of the points in the two categories. However there are cases in which this assumption does not hold, an example of which is shown in the diagram on the LHS of Figure RP1. Some of these cases can be solved by finding new representations for the input data in which the linear separability property holds. For example if we change the co-ordinates in Figure RP1 from Cartesian to Polar, then the two classes become linearly separable. The change in representation for the general case (for K = 2) is shown in Figure featureLogit, so that the output of this network is given by: $$ y = \sigma(\sum_{i=1}^N w_i f_i(x_1,...,x_N) + b) $$

In [14]:
#featureLogit
nb_setup.images_hconcat(["DL_images/featureLogit.png"], width=600)
Out[14]:

Before the discovery of DLNs, a lot of the research in the classification area was focused on pre-processing the input values and discovering appropriate functions $f_i$ (known as Feature Extractors) which could then be fed into the Logistic Regression network. The Feature Extractor selection involves a deep analysis of the original input to discover functions that can transform the input data in a way that makes it amenable to classification using a simple linear system.

DLNs have done away with the intermediate step of manual feature set selection. As explained in the next chapter, DLNs automatically learn the appropriate representation from the available training data, using an algorithm that is a generalization of the one we discussed for Logistic Regression.

Example of a Linear Model in Keras

The Keras model shown below does Linear Classification for the CIFAR-10 dataset. The flow of the program is the identical to the MNIST model from the previous chapter, with the exception that each image is now a 3D tensor of shape (32, 32, 3). Before it can be processed by the model, it has to flattened into a vector of size 32 x 32 x 3.

The results of running the Linear Model shows that asymtotically the validation accuracy converges to about 40% which is much better than the 10% accuracy one would expect from random guessing. In the following chapters we will introduce more sophisticated models which improve the accuracy beyond this number (the best models achieve an accuracy of over 90% for this dataset).

In [28]:
import keras
keras.__version__
from keras import models
from keras import layers

from keras.datasets import cifar10

(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()
In [29]:
train_images.shape
Out[29]:
(50000, 32, 32, 3)
In [30]:
test_images.shape
Out[30]:
(10000, 32, 32, 3)
In [31]:
train_labels
Out[31]:
array([[6],
       [9],
       [9],
       ...,
       [9],
       [1],
       [1]], dtype=uint8)
In [32]:
image = train_images[4]
digit = image

import matplotlib.pyplot as plt
import numpy as np
plt.imshow(digit, cmap = plt.cm.binary)
plt.show()
In [33]:
from tensorflow.keras.utils import to_categorical

train_images = train_images.reshape((50000, 32 * 32 * 3))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 32 * 32 * 3))
test_images = test_images.astype('float32') / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

network = models.Sequential()
network.add(layers.Dense(15, activation='relu'))
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='sgd',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

history = network.fit(train_images, train_labels, epochs=100, batch_size=128, validation_split=0.2)
Epoch 1/100
313/313 [==============================] - 1s 3ms/step - loss: 2.1691 - accuracy: 0.1817 - val_loss: 2.0872 - val_accuracy: 0.2151
Epoch 2/100
313/313 [==============================] - 1s 3ms/step - loss: 2.0082 - accuracy: 0.2719 - val_loss: 1.9559 - val_accuracy: 0.3018
Epoch 3/100
313/313 [==============================] - 1s 3ms/step - loss: 1.9197 - accuracy: 0.3169 - val_loss: 1.9114 - val_accuracy: 0.3129
Epoch 4/100
313/313 [==============================] - 1s 3ms/step - loss: 1.8710 - accuracy: 0.3376 - val_loss: 1.8810 - val_accuracy: 0.3356
Epoch 5/100
313/313 [==============================] - 1s 3ms/step - loss: 1.8414 - accuracy: 0.3516 - val_loss: 1.8478 - val_accuracy: 0.3468
Epoch 6/100
313/313 [==============================] - 1s 3ms/step - loss: 1.8209 - accuracy: 0.3591 - val_loss: 1.8431 - val_accuracy: 0.3472
Epoch 7/100
313/313 [==============================] - 1s 3ms/step - loss: 1.8027 - accuracy: 0.3644 - val_loss: 1.8858 - val_accuracy: 0.3220
Epoch 8/100
313/313 [==============================] - 1s 4ms/step - loss: 1.7877 - accuracy: 0.3665 - val_loss: 1.8595 - val_accuracy: 0.3409
Epoch 9/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7769 - accuracy: 0.3730 - val_loss: 1.8028 - val_accuracy: 0.3649
Epoch 10/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7632 - accuracy: 0.3769 - val_loss: 1.8087 - val_accuracy: 0.3640
Epoch 11/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7531 - accuracy: 0.3794 - val_loss: 1.8320 - val_accuracy: 0.3573
Epoch 12/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7394 - accuracy: 0.3854 - val_loss: 1.7666 - val_accuracy: 0.3761
Epoch 13/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7294 - accuracy: 0.3895 - val_loss: 1.8085 - val_accuracy: 0.3575
Epoch 14/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7186 - accuracy: 0.3954 - val_loss: 1.7770 - val_accuracy: 0.3728oss: 1.7190 - ac
Epoch 15/100
313/313 [==============================] - 1s 3ms/step - loss: 1.7090 - accuracy: 0.4000 - val_loss: 1.7492 - val_accuracy: 0.3815
Epoch 16/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6998 - accuracy: 0.4019 - val_loss: 1.7653 - val_accuracy: 0.3753
Epoch 17/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6919 - accuracy: 0.4028 - val_loss: 1.7480 - val_accuracy: 0.3889
Epoch 18/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6833 - accuracy: 0.4090 - val_loss: 1.7128 - val_accuracy: 0.3991
Epoch 19/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6774 - accuracy: 0.4101 - val_loss: 1.7607 - val_accuracy: 0.3743
Epoch 20/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6710 - accuracy: 0.4112 - val_loss: 1.7291 - val_accuracy: 0.3929
Epoch 21/100
313/313 [==============================] - 1s 4ms/step - loss: 1.6641 - accuracy: 0.4142 - val_loss: 1.7041 - val_accuracy: 0.4025
Epoch 22/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6568 - accuracy: 0.4163 - val_loss: 1.7658 - val_accuracy: 0.3778
Epoch 23/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6545 - accuracy: 0.4172 - val_loss: 1.6995 - val_accuracy: 0.4035
Epoch 24/100
313/313 [==============================] - 1s 4ms/step - loss: 1.6486 - accuracy: 0.4193 - val_loss: 1.7082 - val_accuracy: 0.3984
Epoch 25/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6440 - accuracy: 0.4218 - val_loss: 1.7086 - val_accuracy: 0.3981
Epoch 26/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6399 - accuracy: 0.4226 - val_loss: 1.6876 - val_accuracy: 0.4092
Epoch 27/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6358 - accuracy: 0.4259 - val_loss: 1.7045 - val_accuracy: 0.3968
Epoch 28/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6321 - accuracy: 0.4259 - val_loss: 1.7651 - val_accuracy: 0.3800
Epoch 29/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6283 - accuracy: 0.4271 - val_loss: 1.6882 - val_accuracy: 0.4097
Epoch 30/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6250 - accuracy: 0.4260 - val_loss: 1.7156 - val_accuracy: 0.3912
Epoch 31/100
313/313 [==============================] - 1s 4ms/step - loss: 1.6203 - accuracy: 0.4292 - val_loss: 1.7635 - val_accuracy: 0.3766
Epoch 32/100
313/313 [==============================] - 1s 4ms/step - loss: 1.6206 - accuracy: 0.4272 - val_loss: 1.7107 - val_accuracy: 0.3926
Epoch 33/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6153 - accuracy: 0.4319 - val_loss: 1.6890 - val_accuracy: 0.4091
Epoch 34/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6137 - accuracy: 0.4313 - val_loss: 1.6615 - val_accuracy: 0.4192
Epoch 35/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6124 - accuracy: 0.4331 - val_loss: 1.7048 - val_accuracy: 0.3927
Epoch 36/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6060 - accuracy: 0.4333 - val_loss: 1.6860 - val_accuracy: 0.4112
Epoch 37/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6052 - accuracy: 0.4326 - val_loss: 1.6716 - val_accuracy: 0.4155
Epoch 38/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6012 - accuracy: 0.4328 - val_loss: 1.6621 - val_accuracy: 0.4178
Epoch 39/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6017 - accuracy: 0.4347 - val_loss: 1.7222 - val_accuracy: 0.3914
Epoch 40/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5996 - accuracy: 0.4361 - val_loss: 1.7250 - val_accuracy: 0.3968
Epoch 41/100
313/313 [==============================] - 1s 3ms/step - loss: 1.6000 - accuracy: 0.4343 - val_loss: 1.7160 - val_accuracy: 0.3920
Epoch 42/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5974 - accuracy: 0.4363 - val_loss: 1.6727 - val_accuracy: 0.4090
Epoch 43/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5928 - accuracy: 0.4367 - val_loss: 1.6482 - val_accuracy: 0.4210
Epoch 44/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5924 - accuracy: 0.4373 - val_loss: 1.6848 - val_accuracy: 0.4028
Epoch 45/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5880 - accuracy: 0.4372 - val_loss: 1.6880 - val_accuracy: 0.4045
Epoch 46/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5885 - accuracy: 0.4397 - val_loss: 1.6572 - val_accuracy: 0.4163
Epoch 47/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5855 - accuracy: 0.4414 - val_loss: 1.7324 - val_accuracy: 0.3965
Epoch 48/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5836 - accuracy: 0.4423 - val_loss: 1.6413 - val_accuracy: 0.4241
Epoch 49/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5826 - accuracy: 0.4403 - val_loss: 1.6916 - val_accuracy: 0.4049
Epoch 50/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5800 - accuracy: 0.4431 - val_loss: 1.6702 - val_accuracy: 0.4086
Epoch 51/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5818 - accuracy: 0.4405 - val_loss: 1.6729 - val_accuracy: 0.4097
Epoch 52/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5767 - accuracy: 0.4444 - val_loss: 1.6404 - val_accuracy: 0.4248
Epoch 53/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5744 - accuracy: 0.4432 - val_loss: 1.6615 - val_accuracy: 0.4171
Epoch 54/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5732 - accuracy: 0.4453 - val_loss: 1.7366 - val_accuracy: 0.3972
Epoch 55/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5729 - accuracy: 0.4442 - val_loss: 1.6550 - val_accuracy: 0.4165
Epoch 56/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5706 - accuracy: 0.4442 - val_loss: 1.6395 - val_accuracy: 0.4215
Epoch 57/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5655 - accuracy: 0.4466 - val_loss: 1.6454 - val_accuracy: 0.4219
Epoch 58/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5655 - accuracy: 0.4474 - val_loss: 1.6448 - val_accuracy: 0.4200
Epoch 59/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5666 - accuracy: 0.4445 - val_loss: 1.6504 - val_accuracy: 0.4188
Epoch 60/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5651 - accuracy: 0.4455 - val_loss: 1.6463 - val_accuracy: 0.4189
Epoch 61/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5643 - accuracy: 0.4469 - val_loss: 1.6407 - val_accuracy: 0.4224
Epoch 62/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5607 - accuracy: 0.4466 - val_loss: 1.6697 - val_accuracy: 0.4096
Epoch 63/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5611 - accuracy: 0.4474 - val_loss: 1.6581 - val_accuracy: 0.4156
Epoch 64/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5598 - accuracy: 0.4473 - val_loss: 1.6482 - val_accuracy: 0.4130
Epoch 65/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5585 - accuracy: 0.4483 - val_loss: 1.6588 - val_accuracy: 0.4205
Epoch 66/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5577 - accuracy: 0.4489 - val_loss: 1.6796 - val_accuracy: 0.4066
Epoch 67/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5544 - accuracy: 0.4512 - val_loss: 1.6509 - val_accuracy: 0.4217
Epoch 68/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5547 - accuracy: 0.4498 - val_loss: 1.6505 - val_accuracy: 0.4197
Epoch 69/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5555 - accuracy: 0.4510 - val_loss: 1.7310 - val_accuracy: 0.3914
Epoch 70/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5527 - accuracy: 0.4499 - val_loss: 1.6405 - val_accuracy: 0.4265: 0s - loss: 1.5515 - accuracy: 0.45
Epoch 71/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5505 - accuracy: 0.4540 - val_loss: 1.6756 - val_accuracy: 0.4107
Epoch 72/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5498 - accuracy: 0.4496 - val_loss: 1.6907 - val_accuracy: 0.4009
Epoch 73/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5502 - accuracy: 0.4529 - val_loss: 1.6474 - val_accuracy: 0.4215
Epoch 74/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5443 - accuracy: 0.4530 - val_loss: 1.6733 - val_accuracy: 0.4078
Epoch 75/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5444 - accuracy: 0.4526 - val_loss: 1.7008 - val_accuracy: 0.4076
Epoch 76/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5439 - accuracy: 0.4546 - val_loss: 1.6755 - val_accuracy: 0.4085
Epoch 77/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5445 - accuracy: 0.4513 - val_loss: 1.6860 - val_accuracy: 0.4138
Epoch 78/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5415 - accuracy: 0.4548 - val_loss: 1.6580 - val_accuracy: 0.4216
Epoch 79/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5412 - accuracy: 0.4539 - val_loss: 1.7150 - val_accuracy: 0.4012
Epoch 80/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5387 - accuracy: 0.4540 - val_loss: 1.6315 - val_accuracy: 0.4232
Epoch 81/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5399 - accuracy: 0.4532 - val_loss: 1.6219 - val_accuracy: 0.4313
Epoch 82/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5365 - accuracy: 0.4554 - val_loss: 1.6510 - val_accuracy: 0.4202
Epoch 83/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5360 - accuracy: 0.4579 - val_loss: 1.6385 - val_accuracy: 0.4272
Epoch 84/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5322 - accuracy: 0.4557 - val_loss: 1.6277 - val_accuracy: 0.4287
Epoch 85/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5322 - accuracy: 0.4577 - val_loss: 1.6287 - val_accuracy: 0.4293
Epoch 86/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5299 - accuracy: 0.4577 - val_loss: 1.7126 - val_accuracy: 0.3940
Epoch 87/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5281 - accuracy: 0.4566 - val_loss: 1.6612 - val_accuracy: 0.4183
Epoch 88/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5349 - accuracy: 0.4555 - val_loss: 1.6278 - val_accuracy: 0.4309
Epoch 89/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5315 - accuracy: 0.4564 - val_loss: 1.6538 - val_accuracy: 0.4143
Epoch 90/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5254 - accuracy: 0.4594 - val_loss: 1.6351 - val_accuracy: 0.4276
Epoch 91/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5286 - accuracy: 0.4594 - val_loss: 1.7053 - val_accuracy: 0.3965
Epoch 92/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5288 - accuracy: 0.4588 - val_loss: 1.6285 - val_accuracy: 0.4307
Epoch 93/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5248 - accuracy: 0.4590 - val_loss: 1.6103 - val_accuracy: 0.4328
Epoch 94/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5264 - accuracy: 0.4573 - val_loss: 1.6708 - val_accuracy: 0.4154
Epoch 95/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5217 - accuracy: 0.4607 - val_loss: 1.6210 - val_accuracy: 0.4276
Epoch 96/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5226 - accuracy: 0.4607 - val_loss: 1.6313 - val_accuracy: 0.4276
Epoch 97/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5212 - accuracy: 0.4604 - val_loss: 1.6116 - val_accuracy: 0.4355
Epoch 98/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5190 - accuracy: 0.4642 - val_loss: 1.6911 - val_accuracy: 0.4079
Epoch 99/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5177 - accuracy: 0.4629 - val_loss: 1.6215 - val_accuracy: 0.4302
Epoch 100/100
313/313 [==============================] - 1s 3ms/step - loss: 1.5171 - accuracy: 0.4623 - val_loss: 1.6339 - val_accuracy: 0.4274
In [34]:
history_dict = history.history
history_dict.keys()
Out[34]:
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
In [35]:
import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)
#epochs = range(1, len(loss) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()
In [36]:
plt.clf()   # clear figure
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

References and Slides