Image Generation Using Diffusion Models

Introduction

One of the important applications of Neural Networks has been image generation. Given a set of training samples ${X_1,…,X_N}$, the preferred approach to generating a new datapoint $X = (x_1,…,x_n)$ is to first estimate the distribution $p_\theta(X)$ that approximates the actual distribution $p(X)$ that the training samples come from (using Unsupervised Learning) and then sample from $p_\theta(X)$ to generate a new $X$. There are several ways in which this program can be put into practice:

Diffusion models were first described by Sohl-Dickstein et.al. who laid down the theory behind this method. Further work in 2019 by Ho, Jain and Abeel resulted in an improved algorithm called DDPM (De-Noising Diffusion Probabilistic Model) while Y. Song et.al. independently came up with an equivalent algorithm called Score Based Diffusion. The DDPM algorithm was further improved by J. Song et.al. in an algorithm called DDIM (De-Noising Diffusion Implicit Model) with a faster sampling technique, which serves as the method of choice at the present time.

Diffusion Models have been coupled with Large Language Models (LLMs) to create images from captions, the two most well known models in this category being DALL-E-2 from OpenAI and Imagen from Google. A recent entry into this field is Solid Diffusion from researchers at the University of Munich Rombach, Ommer et.al which has significantly reduces the computational load required to generate images. These systems are able to create images whose quality (and creativity) are superior to those from GANs, and as a result attention in Generative Modeling has shifted to Diffusion Models.

Our objective in this blog post is to give a rigorous description of Diffusion Models, including the underlying mathematical framework.

Overview

Figure 1 From CVPR 2022 Tutorial on Diffusion Models

A high level view of Diffusion Models is shown in Figure 1. A fully trained model generates images by taking a Latent Variable consisting of random noise data ${\bf z}$ to the right of the figure, and gradually ‘de-noises’ it in multiple stages as shown by the dotted arrows moving from right to left (typically 1000 stages in large networks such as DALL-E-2). This is done by feeding the output of stage $t$ to stage $t-1$, until we get to the final stage resulting in the image ${\bf x_0}$. This is illustrated in Figure 2, with the middle row illustrating the de-noising process starting from random data in stage $T$, followed by partially re-constructed image at stage ${T\over 2}$ and the final image at stage $0$.

Diffusion Models are trained by running the process just described in the opposite direction, i.e., from left to right in Figure 1, as shown by the solid arrows. An image ${\bf x_0}$ from the training dataset is fed into stage 1 resulting in image ${\bf x_1}$. This stage as well as the following ones add small amounts of noise to the image, so it becomes progressively more and more blurry, until all that is left at state $T$ is pure noise (this is illustrated in the top row of Figure 2). Since we know precisely how much noise is being added in each stage, the DDPM algorithm works by training a Neural Network to predict the added noise level given the noisy image data as the input into the model. This optimization problem is posed as that of minimizing a convex regression loss, which is a much simpler problem to solve as compared to the minmax approach in GANs.

Figure 2 From Sohl-Dickstein

Latent Variables and the ELBO Bound

Figure 3 From Stanford CS236

Given a set of image samples $(X_1,…,X_N)$ we would like to estimate their Probability Density functions $q(X)$, so that we can sample from it and thereby generate new images. However, in general $q(X)$ is a very complex function, and a simpler problem is to define a Latent Variable $Z$ that encodes semantically useful information about $X$ and then estimate $q(X\vert Z)$ with the hope that this will be a simpler function to estimate. The idea behind Latent Variables is illustrated in Figure 3: The LHS of the figure shows how new images can be generated in a two step process: First sample a Latent Variable $Z$ assuming that we know its distribution $q(Z)$ and then sample from $q(X\vert Z)$ to generate a new image. The RHS of the figure gives some intuition behind the concept of Latent Variables, it shows how the contents of an image of a human face are controlled by variables such as gender, hair color etc, and if these variables are specified, then the face generation problem becomes simpler.

Figure 4 From Stanford CS236

For another perspective into the problem, consider Figure 4 which shows the data distribution $q(X)$ which in general can a very complicated function (as a stand-in for an image distribution). However note that by using the Law of Total Probabilities, $q(X)$ can also written as

\[q(X) = \int q(X,Z) dZ = \int q(X|Z) q(Z) dZ\]

If $q(X\vert Z)$ is a simpler function that can be approximated by a Gaussian $p_\theta(X\vert Z) = p_\theta(\mu_\theta(Z),\sigma_\theta(Z))$ then

\[q(X) \approx \int p_\theta(\mu_\theta,\sigma_\theta) q(Z) dZ\]

Thus a complex $q(X)$ can be built by using a number of of these Gaussians super-imposed togther and weighted by the distribution $q(Z)$ as shown in Figure 4. This is somewhat analogous to approximating a complex function by its Fourier Transform co-efficients which act as weights for simple sinusoidal functions.

We now show how to obtain the approximations $p_\theta(X\vert Z)$ by solving an optimization problem.

Evidence Lower Bound or ELBO is a useful technique from Bayesian Statistics, that has found use in Neural Network modeling in recent years. It works by converting the problem of estimating an unknown probability distribution into an optimization problem, which can then be solved using Neural Networks.

Figure 5

Consider the following problem: Given two random variables $X$ and $Z$, the conditional probabilities $q(Z\vert X)$ and $q(Z)$ are known, and we are asked to invert the conditional probability and thus compute $q(X\vert Z)$. For example $X$ may be an image so that $Z$ is the Latent Variable representation of the image. In this case the problem can be formulated as: We know how to get the latent representation from the image, but we don’t know how to convert a latent representation back into an image.

The most straightforward way of inverting the conditional probability is by means of Bayes Theorem, which states

\[q(X|Z) = {q(Z|X)q(x)\over{\sum_u q(Z|U)q(U)}}\]

However this formula is very difficult to compute because the sum in the denominator is often intractable. In order to solve this problem using Neural Networks, we have to turn this into an optimization problem. This is done by means of a technique called ELBO (Evidence Lower Bound) also known as VLB (Variational Lower Bound), which works as follows: Lets assume that we can approximate $q(X\vert Z)$ by another (parametrized) distribution $p_\theta(X\vert Z)$. In order to find the best $p_\theta(X\vert Z)$, we can try to minimize the “distance” between $p_\theta(X\vert Z)$ and $q(X\vert Z)$. A measure of distance between probability distributions is the Kullback-Leibler Divergence or KL Divergence. The KL Divergence between probability distributions $f(X)$ and $g(X)$ is defined as:

\[D_{KL}(f(X), g(X)) = \sum f(X)\log{f(X)\over g(X)}\]

Substituting $f(X) = q(X\vert Z)$ and $g(X) = p_\theta(X\vert Z)$, and making use of the Law of Conditional Probabilities, it can be shown that

\[D_{KL}(q(X|Z), p_\theta(X|Z)) = \log q(X) - \sum_Z q(Z|X)\log{ p_\theta(X,Z)\over q(Z|X)}\]

Hence in order to minimize $D_{KL}(q(X\vert Z), p_\theta(X\vert Z))$, we have to maximize $\sum_Z q(Z\vert X)\log{ p_\theta(X,Z)\over q(Z\vert X)}$, or minimize $\sum_Z q(Z\vert X)\log{q(Z\vert X)\over p_\theta(X,Z)}$. We will refer to the latter quantity as the ELBO or VLB, i.e.,

\[ELBO = \sum_Z q(Z|X)\log{q(Z|X)\over p_\theta(X,Z)} \quad\quad\quad\quad (1)\]

Figure 6 From Stanford CS236

Since by definition $D_{KL}(q(X\vert Z), p_\theta(X\vert Z)) \ge 0$, it follows that \(\log q(X) \ge ELBO\) i.e., the ELBO serves as an approximation to the true distribution $q(X)$, with the difference betwen the two being equal to the KL Divergence $D_{KL}(q(X\vert Z), p_\theta(X\vert Z))$ as illustrated in Figure 6.

Note: In Diffusion Approximations the distribution $q(Z\vert X)$ is known (as explained in the next section), however this is not the case in other Latent Variable models such as VAEs. In that case $q(Z\vert X)$ is also approximated by a parametrized distribution $q_\phi(Z\vert X)$, which is assumed to Gaussian as well. Thus building a VAE model requires a joint optimization of the parameters $(\theta,\phi)$.

Hence even though we don’t have access to the true distribution $q(X)$, we can approximate it using the ELBO, with the parameters $\theta$ (or $\theta,\phi$) being chosen so as to minimize the difference between the two.

The ELBO serves as a convenient optimization measure that can be used to train a Neural Network. To make the problem more tractable, it is also assumed that the unknown distribution $p_\theta$ belongs to the family of Gaussian distributions $N(\mu_\theta,\sigma_\theta)$, which reduces the problem to that of estimating its mean $\mu_\theta$ and variance $\sigma_\theta$. These quantities can be estimated by using a DLN with parameters $\theta$, with the ELBO serving as a Loss Function to optimize the network. This optimization procedure underlies Diffusion Models.

Forward Diffusion Process

Figure 7 From CVPR 2022 Tutorial on Diffusion Models

Note: We will use the notation $N(X; \mu,\Sigma)$ for a mutivariate Gaussian (or Normal) Distribution $X$ with mean vector $\mu = (\mu_1,…,\mu_N)$ and covariance matrix $\Sigma$ (here is a a short introduction to multivariate Normal Distributions). For the special case when the covariance matrix is a diagonal with a common variance $\sigma^2$, this reduces to $N(X; \mu,\sigma^2 I)$ where $I$ is a $N\times N$ identity matrix.

Note: Given a Gaussian Distribution $N(X;\mu,\sigma^2 I)$, it is possible to generate a sample from it by using the Re-Parametrization Trick, which states that a sample $X$ can be expressed as

\[X = \mu + \sigma \epsilon\]

where the random vector $\epsilon$ is distributed as per the Unit Gaussian Distribution $N(0,I)$.

The setup used in Diffusion Models is shown in Figure 7. We start with an image $X_0 = (x_{01},…,x_{0N})$, shown on the left. Note that even though $X_0$ is a 3D tensor, for this discussion we will flatten it out into a 1D vector. $X_0$ is distributed according to some unknown distribution $q(X_0)$ to which we don’t have access.

We gradually add small amounts of Gaussian Noise to $X_0$ in $T$ steps (typically T = 1000 in large models), such that at the $t^{th}$ step the resulting conditional distribution is given by the Gaussian

\[q(X_t|X_{t-1}) = N(X_t; \sqrt{1-\beta_t} X_{t-1}, \beta_t I)\]

At this step the ‘signal’ $X_{t-1}$ in the image is suppressed by the factor $\sqrt{1-\beta_t}$, while an amount of noise $\beta_t$ is added to it, i.e., the image gets more and more blurry as $t$ increases. The sequence $\beta_i$ is chosen such that $\beta_1 < \beta_2 < … < \beta_T < 1$, i.e., the amount of noise being added increases monotonically, which is known as the Variance Schedule. This can be chosen to be linear, quadratic, cosine etc or even a learnt sequence.

By using the Re-parametrization Trick, we can sample $X_t$ from the distribution $q(X_t\vert X_{t-1})$

\[X_t = \sqrt{1-\beta_t}X_{t-1} + \sqrt{\beta_t}\epsilon_{t-1} \quad\quad\quad (2)\]

where $\epsilon_{t-1}$ is sampled from the N Dimensional Gaussian distribution $N(0,I)$. From this equation it is clear that the sequence $X_t, 0\le t\le T$ forms a Markov Chain. This is also referred to as a Discrete Time Diffusion, and later we will see that if T is kept fixed while increasing the number of stages in the model, then there is a continuous time limit to Equation (2) as $T\rightarrow\infty$, resulting in $X_t$ converging to a continuous time Diffusion Process (in the index t) driven by a Weiner Process.

Since Equation (2) is a recursion in the $X_t$ sequence, we can show that the $X_t$ sample can also be expressed in terms of the starting image vector $X_0$ as follows:

\[X_t = \sqrt{\gamma_t} X_0 + \sqrt{1 - \gamma_t}\epsilon \quad\quad\quad (3)\]

where $\epsilon$ is sampled from $N(0,I)$, and \(\gamma_t = \prod_{i=1}^t \alpha_i\) where $\alpha_t = 1 - \beta_t$. This equation allows us to get a sample of $X_t$ at the $t^{th}$ stage directly from original $X_0$ sample. This property is very useful when training the Neural Network since it allows us to optimize random terms of the Loss Function as we will see shortly.

Note that since the $\gamma_t$ sequence is a product of factors that are less than one, it follows that $\gamma_1 \ge\gamma_2\ge … \ge\gamma_T$. These sequences are chosen so that $\gamma_t\rightarrow 0$ as $t$ increases, and from equation (3) this implies that the distribution of $X_T$ approaches $N(0,I)$ which is also referred to as “white” noise. This is illustrated in Figure 8 which shows the distribution of a single pixel $x$, starting from the unknown distribution $q(x_0)$ on the left and ending up as the Standard Gaussian on the right.

Figure 8 From CVPR 2022 Tutorial on Diffusion Models

Reverse Diffusion Process

Figure 9 From CVPR 2022 Tutorial on Diffusion Models

In order to generate an image while starting from Gaussian noise, we need to reverse the process shown in Figure 7, i.e., we start with the white noise image $X_T$ sampled from Gaussian Noise $N(0,I)$ and then gradually “de-noise” this image as we move from right to left, as shown in Figure 9. In order to do so we need knowledge of the reverse distribution $q(X_{t-1}\vert X_t)$ which in general is an intractable problem. However we can try to approximate the reverse distribution with a parametrized distribution $p_\theta$ such that $p_\theta(X_{t-1}\vert X_t)\approx q(X_{t-1}\vert X_t)$. We will further assume that $p_\theta(X_{t-1}\vert X_t)$ has a Gaussian distribution given by

\[p_\theta(X_{t-1}|X_t) = N(X_{t-1};\mu_\theta(X_t,t),\sigma_\theta (X_t,t)I)\quad\quad\quad (4)\]

where $\mu_\theta(X_t,t)$ and $\sigma_\theta(X_t,t)$ are complex and unknown functions parametrized by $\theta$. The Gaussian approximation to $q(X_{t-1}\vert X_t)$ works well for the case when the per stage added noise $\beta_t$ in the forward direction is sufficiently small.

If the distributions $p_\theta$ were known, then it is possible to recover the distribution of the original image as illustrated in Figure 10.

Figure 10 From CVPR 2022 Tutorial on Diffusion Models

Optimization Objective

Under the Gaussian asumption, the problem of finding the distributions $p_\theta(X_{t-1}\vert X_t)$ which best approximate $q(X_{t-1}\vert X_t)$ reduces to finding the parametrized functions $\mu_\theta$ and $\sigma_\theta$. In order to find the optimum set of parameters, we need to find a Loss Function whose minimization gives us these optimum parameters. Note that for a single stage this problem formulation falls within the purview of the ELBO technique. This can be extended across all the $T$ stages in the model as follows: Define the joint probabilities

\[p_\theta(X_{0:T}) = p(X_T)\prod_{t=1}^T p_\theta(X_{t-1}|X_t)\] \[q(X_{1:T}|X_0) = \prod_{t=1}^T q(X_t|X_{t-1})\]

The ELBO optimization objective (see equation (1)) can be formulated as minimization of

\[L_{ELBO} = E_{q(X_{0:T})}{\log q(X_{1:T}|X_0) \over{p_\theta(X_{0:T})} }\]

After some algebraic manipulations and making use of the Law of Conditional Probabilities, this expression can be re-written as

\[L_{ELBO} = E_q\left[{\log{q(X_T|X_0)\over{p_\theta}(X_T)}} + \sum_{t=2}^T \log{q(X_{t-1}|X_t,X_0)\over {p_\theta(X_{t-1}|X_t)}} - \log p_\theta(X_0|X_1)\right]\]

which is the same as

\[L_{ELBO} = E_q\left[D_{KL}(q(X_T|X_0)||p_\theta(X_T) ) + \sum_{t=2}^T D_{KL}(q(X_{t-1}|X_t,X_0)||p_\theta(X_{t-1}|X_t) ) - \log p_\theta(X_0|X_1) \right]\]

Defining

\[L_T = D_{KL}[q(X_T|X_0)||p_\theta(X_T) ]\] \[L_t = D_{KL}\left[q(X_t|X_{t+1},X_0)||p_\theta(X_t|X_{t+1})\right]\quad 1\le t\le T-1\] \[L_0 = - \log p_\theta(X_0|X_1)\]

$L_{ELBO}$ can be written as

\[L_{ELBO} = L_T + L_{T-1} + ... +\ L_0\]

In this expression the $L_T$ term can be ignored for optimization purposes since $q(X_T\vert X_0)$ being equal to $N(0,I)$ is not a function of $\theta$ and neither is $p_\theta(X_T)$. Even though $q(X_{t-1}\vert X_t)$ is an intractable distribution, it can be shown that $q(X_{t-1}\vert X_t,X_0)$ is actually a Gaussian distribution which makes the loss terms $L_t, 1\le t\le T-1$ computable. After some algebraic manipulations The parameters of this distribution can be computed as follows:

\[q(X_{t-1}|X_t,X_0) = N(X_{t-1}; {\tilde\mu}(X_t,X_0),{\tilde\beta_t} I) \quad\quad\quad (5)\]

where

\[{\tilde\beta_t} = {1-\gamma_{t-1}\over{1-\gamma_t}}\beta_t\]

and

\[{\tilde\mu}(X_t,X_0) = {\sqrt{\alpha_t}(1-\gamma_{t-1})\over{1-\gamma_t }}X_t + {\sqrt{\gamma_{t-1}}\beta_t\over{1-\gamma_t }}X_0 \quad\quad\quad (6)\]

Note that we are trying to approximate $q(X_t\vert X_{t+1},X_0)$ by $p_\theta(X_t\vert X_{t+1})$ by minimizing \(L_t = D_{KL}[q(X_t\vert {X_{t+1},X_0})||p_\theta(X_t\vert X_{t+1})], 1\le t\le T-1\). Using the fact that $q(X_{t-1}\vert {X_t,X_0})$ has a Normal Distribution given by equation (5) and $p_\theta(X_t\vert X_{t+1})$ also has a Normal Distribution given by equation (4), and plugging them into the formula for $D_{KL}$ results in the following (see Wikipedia article on Multi-Variate Normal Distributions to see why), :

\[L_t = E \left[{1\over{2||\Sigma_\theta(X_t,t)||^2}} ||\tilde\mu_t(X_t,X_0) - \mu_\theta(X_t,t)||^2 \right]\quad 1\le t\le T-1 \quad\quad\quad (7)\]

According to equation (7), we should design our Neural Network to learn $\mu_\theta$ while using $\tilde\mu_t$ as the ground truth.

However Ho et.al. discovered that the system performance improves if the Neural Network is trained to learn the Noise Level $\epsilon_t$ instead. This can be done by expressing $X_0$ in terms of $X_t$ and $\epsilon$ by using equation (3), so that

\[X_0 = {1\over\sqrt{\gamma_t}}(X_t - \sqrt{1-\gamma_t}\epsilon_t)\quad\quad\quad (8)\]

Substituting $X_0$ into equation (6), results in

\[{\tilde\mu}(X_t,X_0) = {1\over\sqrt\alpha_t} \left[X_t - {\beta_t\over{\sqrt{1-\gamma_t}}}\epsilon_t\right]\]

Note that in this equation $\epsilon_t$ is the noise that is added to the image $X_0$ during training in order to get $X_t$. We will get an estimate $\epsilon_\theta(X_t,t)$ of $\epsilon_t$ using a Neural Network, and plugging this back into equation (8) results in an estimate $\mu_\theta(X_t,t)$ given by

\[\mu_\theta(X_t,t) = {1\over\sqrt\alpha_t}\left[X_t - {\beta_t\over{\sqrt{1-\gamma_t}}}\epsilon_\theta(x_t,t)\right]\]

Substituting these equations back into (7), we get

\[L_t = E \left[{1\over{2||\Sigma_\theta(X_t,t)||^2}} ||{1\over\sqrt\alpha_t}(X_t - {\beta_t\over{\sqrt{1-\gamma_t}}}\epsilon_t) - {1\over\sqrt\alpha_t}(X_t - {\beta_t\over{\sqrt{1-\gamma_t}}}{\epsilon_\theta}(X_t,t))||^2\right]\] \[= E\left[{\beta_t^2\over{2\alpha_t(1-\gamma_t) ||\Sigma_\theta(X_t,t)||^2}} || \epsilon_t - {\epsilon_\theta}(X_t,t)||^2 \right]\] \[= E \left[{\beta_t^2\over{2\alpha_t(1-\gamma_t) ||\Sigma_\theta(X_t,t)||^2}} ||\epsilon_t - \epsilon_\theta(\sqrt\gamma_t X_0 + \sqrt{1-\gamma_t}\epsilon_t,t)||^2 \right]\]

Ho. et.al. also discovered that training the diffusion model is easier if the weighting term is ignored, so that the optimization problem becomes

\[L_t^{simple} = E\left[||\epsilon_t - \epsilon_\theta(\sqrt\gamma_t X_0 + \sqrt{1-\gamma_t}\epsilon_t,t)||^2 \right]\]

The DDPM Algorithm

We collect all the results from the previous sections and put them together in the DDPM algorithm described below.

Figure 11 From Ho, Jain, Abbeel

The pseudocode for the training procedure is shown on the left in Figure 11, and proceeds as follows:

Once we have a trained model, we can use it to generate new images by using the procedure outlined in Algorithm 2. At the first step we start with a Gaussian noise sample $X_T$, and then gradually de-noise it in steps $X_{T-1}, X_{T-2},…,X_1$ until we get to the final image $X_0$. The de-noising is carried out by sampling from the Gaussian Distribution $N(\mu_\theta(X_t,t),\beta_t I)$, so that

\[X_{t-1} = \mu_\theta(X_t,t) + \sqrt{\beta_t} \epsilon\]

$\mu_\theta(X_t,t)$ is computed by running the model to estimate $\epsilon_\theta$ and then using the following equation to get $\mu_\theta$

\[\mu_\theta(X_t,t) = {1\over\sqrt\alpha_t}\left[X_t - {\beta_t\over{\sqrt{1-\gamma_t}}}\epsilon_\theta(X_t,t)\right]\]

Note that images are generated in a probabilistic manner starting from the initial Gaussian noise $X_T$, so that the same noise sample can generate different images on successive runs (which accounts for the P in DDPM).

The Neural Network

Figure 12 From Ronneberger et.al.

A Neural Network which would be appropriate for this application would be one that takes an input image and converts it another image of the same dimensions at the output. The class of networks that do this are called Auto-Encoders, and the one that was chosen for the job is called U-net (Ronneberger et.al, see Figure 12). A short description of this network follows:

Another implementation decision is the schedule to be used for adding noise to the Diffusion stages using the sequence $\beta_t, 1\le t\le T$. We progressively increase the amount of noise added as $t$ increases so that $\beta_1 < \beta_2 < … < \beta_T$. In the original paper the authors employed a linear schedule with $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$. Later it was discovered that a cosine schedule works better (see Dhariwal and Nichol).

Figure 13 From CVPR 2022 Tutorial on Diffusion Models

Speeding Up Diffusion Models: The DDIM Algorithm

One of the issues with the DDPM algorithm is that the inference (or test) step takes too long to generate image, due to the 1000 or so times that the model has to be run in order to generate a single image. For example it takes about 20 hours to to sample 50k images of size $32\times 32$ from a DDPM on a Nvidia 2080 Ti GPU. The large number of steps are required, since the backward process is trying to approximate the forward Gaussian diffusion process, and the approximation improves if the smaller amounts of noise are introduced at every step.

Song, Meng,and Ermon came up with an ingenous way to speed up these models. They made the observation that DDPM did not really make use of the Markovian nature of the forward diffusion process anywhere in the algorithm. So perhaps if the forward diffusion were to be replaced by a non-Markovian process, then the number of steps in the backward process could be reduced, thus speeding up the algorithm. This resulted in an algorithm they called Denoising Diffusion Implicit Model or DDIM, which indeed works much faster.

Song et.al. used a slightly different parametrization of the forward diffusion process, which is described next. The forward recursion in DDPM is written as

\[q(X_t|X_{t-1}) = N(\sqrt{\alpha_t\over\alpha_{t-1}}X_{t-1}, (1-{\alpha_t\over\alpha_{t-1}})\epsilon),\quad \epsilon\in N(0,I)\]

which results in

\[X_t = \sqrt{\alpha_t}X_0 + \sqrt{1-\alpha_t}\epsilon\]

so that the convergence of $X_T$ to White Gaussian happens if $\alpha_T\rightarrow 0$. The DDPM objective is given by

\[L_\gamma(\epsilon_\theta) = \sum_{t=1}^T \gamma_t E_q\left[||\epsilon_\theta^{(t)}(\sqrt{\alpha_t}X_0 + \sqrt{1-\alpha_t}\epsilon_t) - \epsilon_t||^2 \right]\]

Song et.al. made the following observations:

Song et.al. came up up with a non Markovian inference process that has the same marginal, and further demonstrated that this process has the same surrogate objective function as DDPM. This meant that the trained DDPM models could be re-used by making use of the DDIM algorithm for generation only, with a faster execution time.

Consider the family Q of inference distributions indexed by a real vector $\sigma \in R^T_{\ge 0}$:

\[q_\sigma(X_{1:T}|X_0) = q_\sigma(X_T|X_0)\prod_{t=2}^T q_\sigma(X_{t-1}|X_t, X_0)\]

where $q_\sigma(X_t\vert X_0) = N(\sqrt{\alpha_T}X_0, (1-\alpha_T)\epsilon)$ and for all $t>1$

\[q_\sigma(X_{t-1}|X_t,X_0) = N(\sqrt{\alpha_{t-1}}X_0 + \sqrt{1-\alpha_{t-1}-{\sigma_t}^2}. {X_t - \sqrt{\alpha_t}X_0\over{\sqrt{1-\alpha_t}}}, {\sigma_t}^2 \epsilon) \quad\quad\quad (9)\]

Using this formula, it can be shown that $q_\sigma(X_t\vert X_0) = N(\sqrt{\alpha_t}X_0, (1-\alpha_t)\epsilon), \forall t$ so the marginals between this inference process and the DDPM forward process match. The forward process itself can be computed by using the formula:

\[q_\sigma(X_t|X_{t-1},X_0) = {q_\sigma(X_{t-1}|X_t,X_0)q_\sigma(X_t|X_0)\over{q_\sigma(X_{t-1}|X_0)}}\]

This is clearly a Gaussian process, however note that $q_\sigma(X_t\vert X_{t-1},X_0)$ is not a Markov process, since $X_t$ depends on $X_0$ in addition to $X_{t-1}$.

Let $p_\theta(X_{0:T})$ be the Generative process, where ${p_\theta}^{(t)}(X_{t-1}\vert X_t)$ is approximated using $q_\sigma(X_{t-1}\vert X_t,X_0)$. This is done in a 2-step process:

\[{\hat X_0} = {( X_t - \sqrt{1-\alpha_t}.{\epsilon_\theta}^{(t)}(X_t))\over{\sqrt{\alpha_t}}}\] \[{p_\theta}^{(t)}(X_{t-1}|X_t) = q_\sigma(X_{t-1}|X_t,{\hat X_0})\]

Note that changing the parameter $\sigma$ results in different generative processes, while using the same DDPM based training process. Some special choices of $\sigma$ are the following:

\(\sigma_t = \sqrt{ {1-\alpha_{t-1}\over{1-\alpha_t}} }\sqrt{ 1-{\alpha_t\over\alpha_{t-1}} }, \quad\forall t\) then the forward process becomes Markovian and the generative process becomes DDPM.

DDIM Accelerated Sampling Process

Figure 14 From CVPR 2022 Tutorial on Diffusion Models

Song et.al. made use of the fact that the $X_t$ sequence need not be Markov and proposed the following Accelerated Sampling Process:, which is illustrated in Figure 14:

As Part (a) of the figure shows, we can still use the same training procedure as DDPM, since DDPM and DDIM share the same Loss Function.

The Generative Procedure however is different. As shown in Part (b) of the figure, instead of doing Generation on all T steps, DDIM chooses an increasing sub-sequence $\tau=[\tau_1,\tau_2,…,\tau_S]$ of the forward sequence $[1,…,T]$ of length S with $\tau_S = T$, and then does generation only over this sub-sequence (shown using the blue arrows in the figure, assuming that $2$ and $4$ belong to $\tau$. Clearly $X(\tau_1),…,X(\tau_S)$ don’t form a Markov Chain, but as long as the forward process $q_\sigma(X_{\tau_{i-1}}\vert X_{\tau_i},X_0)$ satisfies equation (9), i.e.,

\[q_\sigma(X_{\tau_{i-1}}|X_{\tau_i},X_0) = N(\sqrt{\alpha_{\tau_{i-1}}}X_0 + \sqrt{1-\alpha_{\tau_{i-1}}-{\sigma_{\tau_i}}^2}. {X_{\tau_i} - \sqrt{\alpha_{\tau_i}}X_0\over{\sqrt{1-\alpha_{\tau_i}}}}, {\sigma_{\tau_i}}^2 \epsilon)\]

where the coefficients $\sigma$ are chosen so that

\[q_{\sigma,\tau}(X_{\tau_i}|X_0) = N\left[\sqrt{\alpha_{\tau_i}}X_0, (1 - \alpha_{\tau_i})\epsilon\right],\quad \forall i\in S\]

The reverse process conditionals are given by

\[{p_\theta}^{(\tau_i)}({X_{\tau_{i-1}}|X_{\tau_i}}) = q_{\sigma,\tau}(X_{\tau_{i-1}}|X_{\tau_i},{\hat X_0})\]

where

\[{\hat X_0} = {( X_{\tau_i} - \sqrt{1-\alpha_{\tau_i}}.{\epsilon_\theta}^{(\tau_i)}(X_{\tau_i}))\over{\sqrt{\alpha_{\tau_i}}}}\]

The generative process now samples latent variables according to the reversed $\tau$, which is referred to as the sampling trajectory. If the length of the sampling trajectory $S$ is much smaller than $T$, then it leads to significant speed-ups in the computational efficiency. DDIM is able to produce samples with quality comparable to 1000 step models within 20 to 100 steps. Indeed Song et.al. showed that only 20 steps are already very similar to the ones generated with 1000 steps in terms of high level features, with only minor diffferences in details.

The Latent Diffusion Model (LDM)

Figure 15 From Ho, Jain, Abbeel

Ho et.al. made the following interesting observation in their DDPM paper: During the inference process, we can regard the process of generating progressively better images from $X_T$ to $X_0$ in the following two ways:

\({ \sqrt{ \left\vert X_0 - {\hat X_0}\right\vert^2}\over D }\), where ${\hat X_0}$ is the prediction that the system makes for $X_0$, given $X_t$. This will give us some idea of how the predicted image quality changes with $t$ as the inference proceeds. This is graphed in the LHS of Figure 15.

The curve on the left shows that there is an initial very steep decrease in the Distortion for small values of $T - t$, followed by a linear decrease. The Rate curve on the other hand shows that little information is passed on to ${\hat X_0}$, until the very end of the reverse process, at which point there is a very steep increase in Rate.

By definition, Distortion is a measure of difference between the pixel content of $X_0$ and ${\hat X_0}$. For example if two images have the same shapes, but are textured differently, then they will have a large distortion (another way of putting this is that Distortion measures the difference between the high frequency content in the two images). Rate on the other hand is a measure of semantic content, i.e., the specific objects in the image and how they relate to each other. These curves show that the initial stages in the reconstruction process have information about the texture or finer details of the image, while the later stages have information about the semantic content of the image.

Figure 16 From Rombach, Emmer et.al.

This is further illustrated in Figure 16 that plots Distortion vs Rate, which shows that the inference process can be divided into two parts: Part 1 is referred to as Semantic Compression and leads to steep decrease in Distortion, while Part 2 is referred to as Perceptual Compression, and leads to increase in the semantic content of the re-constructed image. Rombach, Ommer et.al. made the observation that Diffusion Models excel at generating the semantic content of an image, but spend a lot of their processing in the generation of the perceptual content. This leads to the idea of breaking up the image generation process into two parts resulting in a model called the Latent Diffusion Model or LDM (see Figure 17).

Given an input image $X\in R^{H\times W\times C}$ in RGB space, the Encoder encodes $X$ into the latent representation $Z\in R^{h\times w\times c}$, where $f={H\over h} = {W\over w}$ is the downsampling factor. Rombach, Ommer et.al. discovered that $f=4$ to $16$ strikes a good balance between efficiency and perceptually faithful results.

The LDM Model has been shown to produce images that are comparable in quality to DDPM/DDIM models, while using much less computational resources. As a result they form the core of the Solid Diffusion model that was released recently to the general public (see Solid Diffusion).

Figure 17 From Rombach, Emmer et.al.

Conditional Diffusion Models

Figure 18 From Rombach, Emmer et.al.

The LDM model in Figure 17 also incorporates support for conditional image generatiom, where the condition could be a piece of text, images etc. Considering the case when text is used as the input, then the system works as follows:

The Cross Attention is implemented by using the flattened “image” tensor of shape $h.w\times d.n_h$ to generate the Query, and using the text tensor of shape $M\times d_\tau$ to generate the Key and Value.