Diffusion model

Deep learning algorithm
Part of a series on
Machine learning
and data mining
Paradigms
  • Supervised learning
  • Unsupervised learning
  • Online learning
  • Batch learning
  • Meta-learning
  • Semi-supervised learning
  • Self-supervised learning
  • Reinforcement learning
  • Curriculum learning
  • Rule-based learning
  • Quantum machine learning
Problems
Learning with humans
Machine-learning venues
  • v
  • t
  • e

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure.[1] The goal of diffusion models is to learn a diffusion process that generates a probability distribution for a given dataset from which we can then sample new images. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.[2]

In the case of computer vision, diffusion models can be applied to a variety of tasks, including image denoising, inpainting, super-resolution, and image generation. This typically involves training a neural network to sequentially denoise images blurred with Gaussian noise.[2][3] The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise for the network to iteratively denoise. Announced on 13 April 2022, OpenAI's text-to-image model DALL-E 2 is an example that uses diffusion models for both the model's prior (which produces an image embedding given a text caption) and the decoder that generates the final image.[4] Diffusion models have recently found applications in natural language processing (NLP),[5] particularly in areas like text generation[6][7] and summarization.[8]

Diffusion models are typically formulated as Markov chains and trained using variational inference.[9] Examples of generic diffusion modeling frameworks used in computer vision are denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.[10]

Denoising diffusion model

Non-equilibrium thermodynamics

Diffusion models were introduced in 2015 as a method to learn a model that can sample from a highly complex probability distribution. They used techniques from non-equilibrium thermodynamics, especially diffusion.[11]

Consider, for example, how one might model the distribution of all naturally-occurring photos. Each image is a point in the space of all images, and the distribution of naturally-occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a Gaussian distribution N ( 0 , I ) {\displaystyle N(0,I)} . A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution.

The equilibrium distribution is the Gaussian distribution N ( 0 , I ) {\displaystyle N(0,I)} , with pdf ρ ( x ) e 1 2 x 2 {\displaystyle \rho (x)\propto e^{-{\frac {1}{2}}\|x\|^{2}}} . This is just the Maxwell–Boltzmann distribution of particles in a potential well V ( x ) = 1 2 x 2 {\displaystyle V(x)={\frac {1}{2}}\|x\|^{2}} at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a Brownian walker) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.

Denoising Diffusion Probabilistic Model (DDPM)

The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference.[9]

Forward diffusion

To present the model, we need some notation.

  • β 1 , . . . , β T ( 0 , 1 ) {\displaystyle \beta _{1},...,\beta _{T}\in (0,1)} are fixed constants.
  • α t := 1 β t {\displaystyle \alpha _{t}:=1-\beta _{t}}
  • α ¯ t := α 1 α t {\displaystyle {\bar {\alpha }}_{t}:=\alpha _{1}\cdots \alpha _{t}}
  • β ~ t := 1 α ¯ t 1 1 α ¯ t β t {\displaystyle {\tilde {\beta }}_{t}:={\frac {1-{\bar {\alpha }}_{t-1}}{1-{\bar {\alpha }}_{t}}}\beta _{t}}
  • μ ~ t ( x t , x 0 ) := α t ( 1 α ¯ t 1 ) x t + α ¯ t 1 ( 1 α t ) x 0 1 α ¯ t {\displaystyle {\tilde {\mu }}_{t}(x_{t},x_{0}):={\frac {{\sqrt {\alpha _{t}}}(1-{\bar {\alpha }}_{t-1})x_{t}+{\sqrt {{\bar {\alpha }}_{t-1}}}(1-\alpha _{t})x_{0}}{1-{\bar {\alpha }}_{t}}}}
  • N ( μ , Σ ) {\displaystyle N(\mu ,\Sigma )} is the normal distribution with mean μ {\displaystyle \mu } and variance Σ {\displaystyle \Sigma } , and N ( x | μ , Σ ) {\displaystyle N(x|\mu ,\Sigma )} is the probability density at x {\displaystyle x} .
  • A vertical bar denotes conditioning.

A forward diffusion process starts at some starting point x 0 q {\displaystyle x_{0}\sim q} , where q {\displaystyle q} is the probability distribution to be learned, then repeatedly adds noise to it by

x t = 1 β t x t 1 + β t z t {\displaystyle x_{t}={\sqrt {1-\beta _{t}}}x_{t-1}+{\sqrt {\beta _{t}}}z_{t}}
where z 1 , . . . , z T {\displaystyle z_{1},...,z_{T}} are IID samples from N ( 0 , I ) {\displaystyle N(0,I)} . This is designed so that for any starting distribution of x 0 {\displaystyle x_{0}} , we have lim t x t | x 0 {\displaystyle \lim _{t}x_{t}|x_{0}} converging to N ( 0 , I ) {\displaystyle N(0,I)} .

The entire diffusion process then satisfies

q ( x 0 : T ) = q ( x 0 ) q ( x 1 | x 0 ) q ( x T | x T 1 ) = q ( x 0 ) N ( x 1 | α 1 x 0 , β 1 I ) N ( x T | α T x T 1 , β T I ) {\displaystyle q(x_{0:T})=q(x_{0})q(x_{1}|x_{0})\cdots q(x_{T}|x_{T-1})=q(x_{0})N(x_{1}|{\sqrt {\alpha _{1}}}x_{0},\beta _{1}I)\cdots N(x_{T}|{\sqrt {\alpha _{T}}}x_{T-1},\beta _{T}I)}
or
ln q ( x 0 : T ) = ln q ( x 0 ) t = 1 T 1 2 β t x t 1 β t x t 1 2 + C {\displaystyle \ln q(x_{0:T})=\ln q(x_{0})-\sum _{t=1}^{T}{\frac {1}{2\beta _{t}}}\|x_{t}-{\sqrt {1-\beta _{t}}}x_{t-1}\|^{2}+C}
where C {\displaystyle C} is a normalization constant and often omitted. In particular, we note that x 1 : T | x 0 {\displaystyle x_{1:T}|x_{0}} is a gaussian process, which affords us considerable freedom in reparameterization. For example, by standard manipulation with gaussian process,
x t | x 0 N ( α ¯ t x 0 , ( 1 α ¯ t ) I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},(1-{\bar {\alpha }}_{t})I\right)}
x t 1 | x t , x 0 N ( μ ~ t ( x t , x 0 ) , β ~ t I ) {\displaystyle x_{t-1}|x_{t},x_{0}\sim N({\tilde {\mu }}_{t}(x_{t},x_{0}),{\tilde {\beta }}_{t}I)}
In particular, notice that for large t {\displaystyle t} , the variable x t | x 0 N ( α ¯ t x 0 , ( 1 α ¯ t ) I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},(1-{\bar {\alpha }}_{t})I\right)} converges to N ( 0 , I ) {\displaystyle N(0,I)} . That is, after a long enough diffusion process, we end up with some x T {\displaystyle x_{T}} that is very close to N ( 0 , I ) {\displaystyle N(0,I)} , with all traces of the original x 0 q {\displaystyle x_{0}\sim q} gone.

For example, since

x t | x 0 N ( α ¯ t x 0 , ( 1 α ¯ t ) I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},(1-{\bar {\alpha }}_{t})I\right)}
we can sample x t | x 0 {\displaystyle x_{t}|x_{0}} directly "in one step", instead of going through all the intermediate steps x 1 , x 2 , . . . , x t 1 {\displaystyle x_{1},x_{2},...,x_{t-1}} .

Derivation by reparameterization

We know x t 1 | x 0 {\textstyle x_{t-1}|x_{0}} is a gaussian, and x t | x t 1 {\textstyle x_{t}|x_{t-1}} is another gaussian. We also know that these are independent. Thus we can perform a reparameterization:

x t 1 = α ¯ t 1 x 0 + 1 α ¯ t 1 z {\displaystyle x_{t-1}={\sqrt {{\bar {\alpha }}_{t-1}}}x_{0}+{\sqrt {1-{\bar {\alpha }}_{t-1}}}z}
x t = α t x t 1 + 1 α t z {\displaystyle x_{t}={\sqrt {\alpha _{t}}}x_{t-1}+{\sqrt {1-\alpha _{t}}}z'}
where z , z {\textstyle z,z'} are IID gaussians.

There are 5 variables x 0 , x t 1 , x t , z , z {\textstyle x_{0},x_{t-1},x_{t},z,z'} and two linear equations. The two sources of randomness are z , z {\textstyle z,z'} , which can be reparameterized by rotation, since the IID gaussian distribution is rotationally symmetric.

By plugging in the equations, we can solve for the first reparameterization:

x t = α ¯ t x 0 + α t α ¯ t z + 1 α t z = 1 α ¯ t z {\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\underbrace {{\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}z+{\sqrt {1-\alpha _{t}}}z'} _{={\sqrt {1-{\bar {\alpha }}_{t}}}z''}}
where z {\textstyle z''} is a gaussian with mean zero and variance one.

To find the second one, we complete the rotational matrix:

[ z z ] = [ α t α ¯ t 1 α ¯ t β t 1 α ¯ t ? ? ] [ z z ] {\displaystyle {\begin{bmatrix}z''\\z'''\end{bmatrix}}={\begin{bmatrix}{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sqrt {1-{\bar {\alpha }}_{t}}}}&{\frac {\sqrt {\beta _{t}}}{\sqrt {1-{\bar {\alpha }}_{t}}}}\\?&?\end{bmatrix}}{\begin{bmatrix}z\\z'\end{bmatrix}}}

Since rotational matrices are all of the form [ cos θ sin θ sin θ cos θ ] {\textstyle {\begin{bmatrix}\cos \theta &\sin \theta \\-\sin \theta &\cos \theta \end{bmatrix}}} , we know the matrix must be

[ z z ] = [ α t α ¯ t 1 α ¯ t β t 1 α ¯ t β t 1 α ¯ t α t α ¯ t 1 α ¯ t ] [ z z ] {\displaystyle {\begin{bmatrix}z''\\z'''\end{bmatrix}}={\begin{bmatrix}{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sqrt {1-{\bar {\alpha }}_{t}}}}&{\frac {\sqrt {\beta _{t}}}{\sqrt {1-{\bar {\alpha }}_{t}}}}\\-{\frac {\sqrt {\beta _{t}}}{\sqrt {1-{\bar {\alpha }}_{t}}}}&{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sqrt {1-{\bar {\alpha }}_{t}}}}\end{bmatrix}}{\begin{bmatrix}z\\z'\end{bmatrix}}}
and since the inverse of rotational matrix is its transpose,
[ z z ] = [ α t α ¯ t 1 α ¯ t β t 1 α ¯ t β t 1 α ¯ t α t α ¯ t 1 α ¯ t ] [ z z ] {\displaystyle {\begin{bmatrix}z\\z'\end{bmatrix}}={\begin{bmatrix}{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sqrt {1-{\bar {\alpha }}_{t}}}}&-{\frac {\sqrt {\beta _{t}}}{\sqrt {1-{\bar {\alpha }}_{t}}}}\\{\frac {\sqrt {\beta _{t}}}{\sqrt {1-{\bar {\alpha }}_{t}}}}&{\frac {\sqrt {\alpha _{t}-{\bar {\alpha }}_{t}}}{\sqrt {1-{\bar {\alpha }}_{t}}}}\end{bmatrix}}{\begin{bmatrix}z''\\z'''\end{bmatrix}}}

Plugging back, and simplifying, we have

x t = α ¯ t x 0 + 1 α ¯ t z {\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+{\sqrt {1-{\bar {\alpha }}_{t}}}z''}
x t 1 = μ ~ t ( x t , x 0 ) β ~ t z {\displaystyle x_{t-1}={\tilde {\mu }}_{t}(x_{t},x_{0})-{\sqrt {{\tilde {\beta }}_{t}}}z'''}

Backward diffusion

The key idea of DDPM is to use a neural network parametrized by θ {\displaystyle \theta } . The network takes in two arguments x t , t {\displaystyle x_{t},t} , and outputs a vector μ θ ( x t , t ) {\displaystyle \mu _{\theta }(x_{t},t)} and a matrix Σ θ ( x t , t ) {\displaystyle \Sigma _{\theta }(x_{t},t)} , such that each step in the forward diffusion process can be approximately undone by x t 1 N ( μ θ ( x t , t ) , Σ θ ( x t , t ) ) {\displaystyle x_{t-1}\sim N(\mu _{\theta }(x_{t},t),\Sigma _{\theta }(x_{t},t))} . This then gives us a backward diffusion process p θ {\displaystyle p_{\theta }} defined by

p θ ( x T ) = N ( x T | 0 , I ) {\displaystyle p_{\theta }(x_{T})=N(x_{T}|0,I)}
p θ ( x t 1 | x t ) = N ( x t 1 | μ θ ( x t , t ) , Σ θ ( x t , t ) ) {\displaystyle p_{\theta }(x_{t-1}|x_{t})=N(x_{t-1}|\mu _{\theta }(x_{t},t),\Sigma _{\theta }(x_{t},t))}
The goal now is to learn the parameters such that p θ ( x 0 ) {\displaystyle p_{\theta }(x_{0})} is as close to q ( x 0 ) {\displaystyle q(x_{0})} as possible. To do that, we use maximum likelihood estimation with variational inference.

Variational inference

The ELBO inequality states that ln p θ ( x 0 ) E x 1 : T q ( | x 0 ) [ ln p θ ( x 0 : T ) ln q ( x 1 : T | x 0 ) ] {\displaystyle \ln p_{\theta }(x_{0})\geq E_{x_{1:T}\sim q(\cdot |x_{0})}[\ln p_{\theta }(x_{0:T})-\ln q(x_{1:T}|x_{0})]} , and taking one more expectation, we get

E x 0 q [ ln p θ ( x 0 ) ] E x 0 : T q [ ln p θ ( x 0 : T ) ln q ( x 1 : T | x 0 ) ] {\displaystyle E_{x_{0}\sim q}[\ln p_{\theta }(x_{0})]\geq E_{x_{0:T}\sim q}[\ln p_{\theta }(x_{0:T})-\ln q(x_{1:T}|x_{0})]}
We see that maximizing the quantity on the right would give us a lower bound on the likelihood of observed data. This allows us to perform variational inference.

Define the loss function

L ( θ ) := E x 0 : T q [ ln p θ ( x 0 : T ) ln q ( x 1 : T | x 0 ) ] {\displaystyle L(\theta ):=-E_{x_{0:T}\sim q}[\ln p_{\theta }(x_{0:T})-\ln q(x_{1:T}|x_{0})]}
and now the goal is to minimize the loss by stochastic gradient descent. The expression may be simplified to[12]
L ( θ ) = t = 1 T E x t 1 , x t q [ ln p θ ( x t 1 | x t ) ] + E x 0 q [ D K L ( q ( x T | x 0 ) p θ ( x T ) ) ] + C {\displaystyle L(\theta )=\sum _{t=1}^{T}E_{x_{t-1},x_{t}\sim q}[-\ln p_{\theta }(x_{t-1}|x_{t})]+E_{x_{0}\sim q}[D_{KL}(q(x_{T}|x_{0})\|p_{\theta }(x_{T}))]+C}
where C {\displaystyle C} does not depend on the parameter, and thus can be ignored. Since p θ ( x T ) = N ( x T | 0 , I ) {\displaystyle p_{\theta }(x_{T})=N(x_{T}|0,I)} also does not depend on the parameter, the term E x 0 q [ D K L ( q ( x T | x 0 ) p θ ( x T ) ) ] {\displaystyle E_{x_{0}\sim q}[D_{KL}(q(x_{T}|x_{0})\|p_{\theta }(x_{T}))]} can also be ignored. This leaves just L ( θ ) = t = 1 T L t {\displaystyle L(\theta )=\sum _{t=1}^{T}L_{t}} with L t = E x t 1 , x t q [ ln p θ ( x t 1 | x t ) ] {\displaystyle L_{t}=E_{x_{t-1},x_{t}\sim q}[-\ln p_{\theta }(x_{t-1}|x_{t})]} to be minimized.

Noise prediction network

Since x t 1 | x t , x 0 N ( μ ~ t ( x t , x 0 ) , β ~ t I ) {\displaystyle x_{t-1}|x_{t},x_{0}\sim N({\tilde {\mu }}_{t}(x_{t},x_{0}),{\tilde {\beta }}_{t}I)} , this suggests that we should use μ θ ( x t , t ) = μ ~ t ( x t , x 0 ) {\displaystyle \mu _{\theta }(x_{t},t)={\tilde {\mu }}_{t}(x_{t},x_{0})} ; however, the network does not have access to x 0 {\displaystyle x_{0}} , and so it has to estimate it instead. Now, since x t | x 0 N ( α ¯ t x 0 , ( 1 α ¯ t ) I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},(1-{\bar {\alpha }}_{t})I\right)} , we may write x t = α ¯ t x 0 + 1 α ¯ t z {\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+{\sqrt {1-{\bar {\alpha }}_{t}}}z} , where z {\displaystyle z} is some unknown gaussian noise. Now we see that estimating x 0 {\displaystyle x_{0}} is equivalent to estimating z {\displaystyle z} .

Therefore, let the network output a noise vector ϵ θ ( x t , t ) {\displaystyle \epsilon _{\theta }(x_{t},t)} , and let it predict

μ θ ( x t , t ) = μ ~ t ( x t , x t 1 α ¯ t ϵ θ ( x t , t ) α ¯ t ) = x t ϵ θ ( x t , t ) β t / 1 α ¯ t α t {\displaystyle \mu _{\theta }(x_{t},t)={\tilde {\mu }}_{t}\left(x_{t},{\frac {x_{t}-{\sqrt {1-{\bar {\alpha }}_{t}}}\epsilon _{\theta }(x_{t},t)}{\sqrt {{\bar {\alpha }}_{t}}}}\right)={\frac {x_{t}-\epsilon _{\theta }(x_{t},t)\beta _{t}/{\sqrt {1-{\bar {\alpha }}_{t}}}}{\sqrt {\alpha _{t}}}}}
It remains to design Σ θ ( x t , t ) {\displaystyle \Sigma _{\theta }(x_{t},t)} . The DDPM paper suggested not learning it (since it resulted in "unstable training and poorer sample quality"), but fixing it at some value Σ θ ( x t , t ) = σ t 2 I {\displaystyle \Sigma _{\theta }(x_{t},t)=\sigma _{t}^{2}I} , where either σ t 2 = β t  or  β ~ t {\displaystyle \sigma _{t}^{2}=\beta _{t}{\text{ or }}{\tilde {\beta }}_{t}} yielded similar performance.

With this, the loss simplifies to

L t = β t 2 2 α t ( 1 α ¯ t ) σ t 2 E x 0 q ; z N ( 0 , I ) [ ϵ θ ( x t , t ) z 2 ] + C {\displaystyle L_{t}={\frac {\beta _{t}^{2}}{2\alpha _{t}(1-{\bar {\alpha }}_{t})\sigma _{t}^{2}}}E_{x_{0}\sim q;z\sim N(0,I)}\left[\left\|\epsilon _{\theta }(x_{t},t)-z\right\|^{2}\right]+C}
which may be minimized by stochastic gradient descent. The paper noted empirically that an even simpler loss function
L s i m p l e , t = E x 0 q ; z N ( 0 , I ) [ ϵ θ ( x t , t ) z 2 ] {\displaystyle L_{simple,t}=E_{x_{0}\sim q;z\sim N(0,I)}\left[\left\|\epsilon _{\theta }(x_{t},t)-z\right\|^{2}\right]}
resulted in better models.

Score-based generative model

Score-based generative model is another formulation of diffusion modelling. They are also called noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).[13][14]

Score matching

The idea of score functions

Consider the problem of image generation. Let x {\displaystyle x} represent an image, and let q ( x ) {\displaystyle q(x)} be the probability distribution over all possible images. If we have q ( x ) {\displaystyle q(x)} itself, then we can say for certain how likely a certain image is. However, this is intractable in general.

Most often, we are uninterested in knowing the absolute probability of a certain image. Instead, we are usually only interested in knowing how likely a certain image is compared to its immediate neighbors — e.g. how much more likely is an image of cat compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some Gaussian noise added?

Consequently, we are actually quite uninterested in q ( x ) {\displaystyle q(x)} itself, but rather, x ln q ( x ) {\displaystyle \nabla _{x}\ln q(x)} . This has two major effects:

  • One, we no longer need to normalize q ( x ) {\displaystyle q(x)} , but can use any q ~ ( x ) = C q ( x ) {\displaystyle {\tilde {q}}(x)=Cq(x)} , where C = q ~ ( x ) d x > 0 {\displaystyle C=\int {\tilde {q}}(x)dx>0} is any unknown constant that is of no concern to us.
  • Two, we are comparing q ( x ) {\displaystyle q(x)} neighbors q ( x + d x ) {\displaystyle q(x+dx)} , by q ( x ) q ( x + d x ) = e x ln q , d x {\displaystyle {\frac {q(x)}{q(x+dx)}}=e^{-\langle \nabla _{x}\ln q,dx\rangle }}

Let the score function be s ( x ) := x ln q ( x ) {\displaystyle s(x):=\nabla _{x}\ln q(x)} ; then consider what we can do with s ( x ) {\displaystyle s(x)} .

As it turns out, s ( x ) {\displaystyle s(x)} allows us to sample from q ( x ) {\displaystyle q(x)} using thermodynamics. Specifically, if we have a potential energy function U ( x ) = ln q ( x ) {\displaystyle U(x)=-\ln q(x)} , and a lot of particles in the potential well, then the distribution at thermodynamic equilibrium is the Boltzmann distribution q U ( x ) e U ( x ) / k B T = q ( x ) 1 / k B T {\displaystyle q_{U}(x)\propto e^{-U(x)/k_{B}T}=q(x)^{1/k_{B}T}} . At temperature k B T = 1 {\displaystyle k_{B}T=1} , the Boltzmann distribution is exactly q ( x ) {\displaystyle q(x)} .

Therefore, to model q ( x ) {\displaystyle q(x)} , we may start with a particle sampled at any convenient distribution (such as the standard gaussian distribution), then simulate the motion of the particle forwards according to the Langevin equation

d x t = x t U ( x t ) d t + d W t {\displaystyle dx_{t}=-\nabla _{x_{t}}U(x_{t})dt+dW_{t}}
and the Boltzmann distribution is, by Fokker-Planck equation, the unique thermodynamic equilibrium. So no matter what distribution x 0 {\displaystyle x_{0}} has, the distribution of x t {\displaystyle x_{t}} converges in distribution to q {\displaystyle q} as t {\displaystyle t\to \infty } .

Learning the score function

Given a density q {\displaystyle q} , we wish to learn a score function approximation f θ ln q {\displaystyle f_{\theta }\approx \nabla \ln q} . This is score matching.[15] Typically, score matching is formalized as minimizing Fisher divergence function E q [ f θ ( x ) ln q ( x ) 2 ] {\displaystyle E_{q}[\|f_{\theta }(x)-\nabla \ln q(x)\|^{2}]} . By expanding the integral, and performing an integration by parts,

E q [ f θ ( x ) ln q ( x ) 2 ] = E q [ f θ 2 + 2 2 f θ ] + C {\displaystyle E_{q}[\|f_{\theta }(x)-\nabla \ln q(x)\|^{2}]=E_{q}[\|f_{\theta }\|^{2}+2\nabla ^{2}\cdot f_{\theta }]+C}
giving us a loss function, also known as the Hyvärinen scoring rule, that can be minimized by stochastic gradient descent.

Annealing the score function

Suppose we need to model the distribution of images, and we want x 0 N ( 0 , I ) {\displaystyle x_{0}\sim N(0,I)} , a white-noise image. Now, most white-noise images do not look like real images, so q ( x 0 ) 0 {\displaystyle q(x_{0})\approx 0} for large swaths of x 0 N ( 0 , I ) {\displaystyle x_{0}\sim N(0,I)} . This presents a problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score function x t ln q ( x t ) {\displaystyle \nabla _{x_{t}}\ln q(x_{t})} at that point, then we cannot impose the time-evolution equation on a particle:

d x t = x t ln q ( x t ) d t + d W t {\displaystyle dx_{t}=\nabla _{x_{t}}\ln q(x_{t})dt+dW_{t}}
To deal with this problem, we perform annealing. If q {\displaystyle q} is too different from a white-noise distribution, then progressively add noise until it is indistinguishable from one. That is, we perform a forward diffusion, then learn the score function, then use the score function to perform a backward diffusion.

Continuous diffusion processes

Forward diffusion process

Consider again the forward diffusion process, but this time in continuous time:

x t = 1 β t x t 1 + β t z t {\displaystyle x_{t}={\sqrt {1-\beta _{t}}}x_{t-1}+{\sqrt {\beta _{t}}}z_{t}}
By taking the β t β ( t ) d t , d t z t d W t {\displaystyle \beta _{t}\to \beta (t)dt,{\sqrt {dt}}z_{t}\to dW_{t}} limit, we obtain a continuous diffusion process, in the form of a stochastic differential equation:
d x t = 1 2 β ( t ) x t d t + β ( t ) d W t {\displaystyle dx_{t}=-{\frac {1}{2}}\beta (t)x_{t}dt+{\sqrt {\beta (t)}}dW_{t}}
where W t {\displaystyle W_{t}} is a Wiener process (multidimensional Brownian motion).

Now, the equation is exactly a special case of the overdamped Langevin equation

d x t = D k B T ( x U ) d t + 2 D d W t {\displaystyle dx_{t}=-{\frac {D}{k_{B}T}}(\nabla _{x}U)dt+{\sqrt {2D}}dW_{t}}
where D {\displaystyle D} is diffusion tensor, T {\displaystyle T} is temperature, and U {\displaystyle U} is potential energy field. If we substitute in D = 1 2 β ( t ) I , k B T = 1 , U = 1 2 x 2 {\displaystyle D={\frac {1}{2}}\beta (t)I,k_{B}T=1,U={\frac {1}{2}}\|x\|^{2}} , we recover the above equation. This explains why the phrase "Langevin dynamics" is sometimes used in diffusion models.

Now the above equation is for the stochastic motion of a single particle. Suppose we have a cloud of particles distributed according to q {\displaystyle q} at time t = 0 {\displaystyle t=0} , then after a long time, the cloud of particles would settle into the stable distribution of N ( 0 , I ) {\displaystyle N(0,I)} . Let ρ t {\displaystyle \rho _{t}} be the density of the cloud of particles at time t {\displaystyle t} , then we have

ρ 0 = q ; ρ T N ( 0 , I ) {\displaystyle \rho _{0}=q;\quad \rho _{T}\approx N(0,I)}
and the goal is to somehow reverse the process, so that we can start at the end and diffuse back to the beginning.

By Fokker-Planck equation, the density of the cloud evolves according to

t ln ρ t = 1 2 β ( t ) ( n + ( x + ln ρ t ) ln ρ t + Δ ln ρ t ) {\displaystyle \partial _{t}\ln \rho _{t}={\frac {1}{2}}\beta (t)\left(n+(x+\nabla \ln \rho _{t})\cdot \nabla \ln \rho _{t}+\Delta \ln \rho _{t}\right)}
where n {\displaystyle n} is the dimension of space, and Δ {\displaystyle \Delta } is the Laplace operator.

Backward diffusion process

If we have solved ρ t {\displaystyle \rho _{t}} for time t [ 0 , T ] {\displaystyle t\in [0,T]} , then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with density ν 0 = ρ T {\displaystyle \nu _{0}=\rho _{T}} , and let the particles in the cloud evolve according to

d y t = 1 2 β ( T t ) y t d t + β ( T t ) y t ln ρ T t ( y t ) score function  d t + β ( T t ) d W t {\displaystyle dy_{t}={\frac {1}{2}}\beta (T-t)y_{t}dt+\beta (T-t)\underbrace {\nabla _{y_{t}}\ln \rho _{T-t}\left(y_{t}\right)} _{\text{score function }}dt+{\sqrt {\beta (T-t)}}dW_{t}}
then by plugging into the Fokker-Planck equation, we find that t ρ T t = t ν t {\displaystyle \partial _{t}\rho _{T-t}=\partial _{t}\nu _{t}} . Thus this cloud of points is the original cloud, evolving backwards.[16]

Noise conditional score network (NCSN)

At the continuous limit,

α ¯ t = ( 1 β 1 ) ( 1 β t ) = e i ln ( 1 β i ) e 0 t β ( t ) d t {\displaystyle {\bar {\alpha }}_{t}=(1-\beta _{1})\cdots (1-\beta _{t})=e^{\sum _{i}\ln(1-\beta _{i})}\to e^{-\int _{0}^{t}\beta (t)dt}}
and so
x t | x 0 N ( e 1 2 0 t β ( t ) d t x 0 , ( 1 e 0 t β ( t ) d t ) I ) {\displaystyle x_{t}|x_{0}\sim N\left(e^{-{\frac {1}{2}}\int _{0}^{t}\beta (t)dt}x_{0},\left(1-e^{-\int _{0}^{t}\beta (t)dt}\right)I\right)}
In particular, we see that we can directly sample from any point in the continuous diffusion process without going through the intermediate steps, by first sampling x 0 q , z N ( 0 , I ) {\displaystyle x_{0}\sim q,z\sim N(0,I)} , then get x t = e 1 2 0 t β ( t ) d t x 0 + ( 1 e 0 t β ( t ) d t ) z {\displaystyle x_{t}=e^{-{\frac {1}{2}}\int _{0}^{t}\beta (t)dt}x_{0}+\left(1-e^{-\int _{0}^{t}\beta (t)dt}\right)z} . That is, we can quickly sample x t ρ t {\displaystyle x_{t}\sim \rho _{t}} for any t 0 {\displaystyle t\geq 0} .

Now, define a certain probability distribution γ {\displaystyle \gamma } over [ 0 , ) {\displaystyle [0,\infty )} , then the score-matching loss function is defined as the expected Fisher divergence:

L ( θ ) = E t γ , x t ρ t [ f θ ( x t , t ) 2 + 2 f θ ( x t , t ) ] {\displaystyle L(\theta )=E_{t\sim \gamma ,x_{t}\sim \rho _{t}}[\|f_{\theta }(x_{t},t)\|^{2}+2\nabla \cdot f_{\theta }(x_{t},t)]}
After training, f θ ( x t , t ) ln ρ t {\displaystyle f_{\theta }(x_{t},t)\approx \nabla \ln \rho _{t}} , so we can perform the backwards diffusion process by first sampling x T N ( 0 , I ) {\displaystyle x_{T}\sim N(0,I)} , then integrating the SDE from t = T {\displaystyle t=T} to t = 0 {\displaystyle t=0} :
x t d t = x t + 1 2 β ( t ) x t d t + β ( t ) f θ ( x t , t ) d t + β ( t ) d W t {\displaystyle x_{t-dt}=x_{t}+{\frac {1}{2}}\beta (t)x_{t}dt+\beta (t)f_{\theta }(x_{t},t)dt+{\sqrt {\beta (t)}}dW_{t}}
This may be done by any SDE integration method, such as Euler–Maruyama method.

The name "noise conditional score network" is explained thus:

  • "network", because f θ {\displaystyle f_{\theta }} is implemented as a neural network.
  • "score", because the output of the network is interpreted as approximating the score function ln ρ t {\displaystyle \nabla \ln \rho _{t}} .
  • "noise conditional", because ρ t {\displaystyle \rho _{t}} is equal to ρ 0 {\displaystyle \rho _{0}} blurred by an added gaussian noise that increases with time, and so the score function depends on the amount of noise added.

Their equivalence

DDPM and score-based generative models are equivalent.[17] This means that a network trained using DDPM can be used as a NCSN, and vice versa.

We know that x t | x 0 N ( α ¯ t x 0 , ( 1 α ¯ t ) I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},(1-{\bar {\alpha }}_{t})I\right)} , so by Tweedie's formula, we have

x t ln q ( x t ) = 1 1 α ¯ t ( x t + α ¯ t E q [ x 0 | x t ] ) {\displaystyle \nabla _{x_{t}}\ln q(x_{t})={\frac {1}{1-{\bar {\alpha }}_{t}}}(-x_{t}+{\sqrt {{\bar {\alpha }}_{t}}}E_{q}[x_{0}|x_{t}])}
As described previously, the DDPM loss function is t L s i m p l e , t {\displaystyle \sum _{t}L_{simple,t}} with
L s i m p l e , t = E x 0 q ; z N ( 0 , I ) [ ϵ θ ( x t , t ) z 2 ] {\displaystyle L_{simple,t}=E_{x_{0}\sim q;z\sim N(0,I)}\left[\left\|\epsilon _{\theta }(x_{t},t)-z\right\|^{2}\right]}
where x t = α ¯ t x 0 + 1 α ¯ t z {\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+{\sqrt {1-{\bar {\alpha }}_{t}}}z} . By a change of variables,
L s i m p l e , t = E x 0 , x t q [ ϵ θ ( x t , t ) x t α ¯ t x 0 1 α ¯ t 2 ] = E x t q , x 0 q ( | x t ) [ ϵ θ ( x t , t ) x t α ¯ t x 0 1 α ¯ t 2 ] {\displaystyle L_{simple,t}=E_{x_{0},x_{t}\sim q}\left[\left\|\epsilon _{\theta }(x_{t},t)-{\frac {x_{t}-{\sqrt {{\bar {\alpha }}_{t}}}x_{0}}{\sqrt {1-{\bar {\alpha }}_{t}}}}\right\|^{2}\right]=E_{x_{t}\sim q,x_{0}\sim q(\cdot |x_{t})}\left[\left\|\epsilon _{\theta }(x_{t},t)-{\frac {x_{t}-{\sqrt {{\bar {\alpha }}_{t}}}x_{0}}{\sqrt {1-{\bar {\alpha }}_{t}}}}\right\|^{2}\right]}
and the term inside becomes a least squares regression, so if the network actually reaches the global minimum of loss, then we have ϵ θ ( x t , t ) = x t α ¯ t E q [ x 0 | x t ] 1 α ¯ t = 1 α ¯ t x t ln q ( x t ) {\displaystyle \epsilon _{\theta }(x_{t},t)={\frac {x_{t}-{\sqrt {{\bar {\alpha }}_{t}}}E_{q}[x_{0}|x_{t}]}{\sqrt {1-{\bar {\alpha }}_{t}}}}=-{\sqrt {1-{\bar {\alpha }}_{t}}}\nabla _{x_{t}}\ln q(x_{t})} .

Now, the continuous limit x t 1 = x t d t , β t = β ( t ) d t , z t d t = d W t {\displaystyle x_{t-1}=x_{t-dt},\beta _{t}=\beta (t)dt,z_{t}{\sqrt {dt}}=dW_{t}} of the backward equation

x t 1 = x t α t β t α t ( 1 α ¯ t ) ϵ θ ( x t , t ) + β t z t ; z t N ( 0 , I ) {\displaystyle x_{t-1}={\frac {x_{t}}{\sqrt {\alpha _{t}}}}-{\frac {\beta _{t}}{\sqrt {\alpha _{t}(1-{\bar {\alpha }}_{t})}}}\epsilon _{\theta }(x_{t},t)+{\sqrt {\beta _{t}}}z_{t};\quad z_{t}\sim N(0,I)}
gives us precisely the same equation as score-based diffusion:
x t d t = x t ( 1 + β ( t ) d t / 2 ) + β ( t ) x t ln q ( x t ) d t + β ( t ) d W t {\displaystyle x_{t-dt}=x_{t}(1+\beta (t)dt/2)+\beta (t)\nabla _{x_{t}}\ln q(x_{t})dt+{\sqrt {\beta (t)}}dW_{t}}

Main variants

Denoising Diffusion Implicit Model (DDIM)

The original DDPM method for generating images is slow, since the forward diffusion process usually takes T 1000 {\displaystyle T\sim 1000} to make the distribution of x T {\displaystyle x_{T}} to appear close to gaussian. However this means the backward diffusion process also take 1000 steps. Unlike the forward diffusion process, which can skip steps as x t | x 0 {\displaystyle x_{t}|x_{0}} is gaussian for all t 1 {\displaystyle t\geq 1} , the backward diffusion process does not allow skipping steps. For example, to sample x t 2 | x t 1 N ( μ θ ( x t 1 , t 1 ) , Σ θ ( x t 1 , t 1 ) ) {\displaystyle x_{t-2}|x_{t-1}\sim N(\mu _{\theta }(x_{t-1},t-1),\Sigma _{\theta }(x_{t-1},t-1))} requires the model to first sample x t 1 {\displaystyle x_{t-1}} . Attempting to directly sample x t 2 | x t {\displaystyle x_{t-2}|x_{t}} would require us to marginalize out x t 1 {\displaystyle x_{t-1}} , which is generally intractable.

DDIM[18] is a method to take any model trained on DDPM loss, and use it to sample with some steps skipped, sacrificing an adjustable amount of quality. If we generate the Markovian chain case in DDPM to non-Markovian case, DDIM corresponds to the case that the reverse process has variance equals to 0. In other words, the reverse process (and also the forward process) is deterministic. When using fewer sampling steps, DDIM outperforms DDPM.

Latent diffusion model (LDM)

Since the diffusion model is a general method for modelling probability distributions, if one wants to model a distribution over images, one can first encode the images into a lower-dimensional space by an encoder, then use a diffusion model to model the distribution over encoded images. Then to generate an image, one can sample from the diffusion model, then use a decoder to decode it into an image.[19]

The encoder-decoder pair is most often a variational autoencoder (VAE).

Classifier guidance

Suppose we wish to sample not from the entire distribution of images, but conditional on the image description. We don't want to sample a generic image, but an image that fits the description "black cat with red eyes". Generally, we want to sample from the distribution p ( x | y ) {\displaystyle p(x|y)} , where x {\displaystyle x} ranges over images, and y {\displaystyle y} ranges over classes of images (a description "black cat with red eyes" is just a very detailed class, and a class "cat" is just a very vague description).

Taking the perspective of the noisy channel model, we can understand the process as follows: To generate an image x {\displaystyle x} conditional on description y {\displaystyle y} , we imagine that the requester really had in mind an image x {\displaystyle x} , but the image is passed through a noisy channel and came out garbled, as y {\displaystyle y} . Image generation is then nothing but inferring which x {\displaystyle x} the requester had in mind.

In other words, conditional image generation is simply "translating from a textual language into a pictorial language". Then, as in noisy-channel model, we use Bayes theorem to get

p ( x | y ) p ( y | x ) p ( x ) {\displaystyle p(x|y)\propto p(y|x)p(x)}
in other words, if we have a good model of the space of all images, and a good image-to-class translator, we get a class-to-image translator "for free". In the equation for backward diffusion, the score ln p ( x ) {\displaystyle \nabla \ln p(x)} can be replaced by
x ln p ( x | y ) = x ln p ( y | x ) + x ln p ( x ) {\displaystyle \nabla _{x}\ln p(x|y)=\nabla _{x}\ln p(y|x)+\nabla _{x}\ln p(x)}
where x ln p ( x ) {\displaystyle \nabla _{x}\ln p(x)} is the score function, trained as previously described, and x ln p ( y | x ) {\displaystyle \nabla _{x}\ln p(y|x)} is found by using a differentiable image classifier.

With temperature

The classifier-guided diffusion model samples from p ( x | y ) {\displaystyle p(x|y)} , which is concentrated around the maximum a posteriori estimate arg max x p ( x | y ) {\displaystyle \arg \max _{x}p(x|y)} . If we want to force the model to move towards the maximum likelihood estimate arg max x p ( y | x ) {\displaystyle \arg \max _{x}p(y|x)} , we can use

p β ( x | y ) p ( y | x ) β p ( x ) {\displaystyle p_{\beta }(x|y)\propto p(y|x)^{\beta }p(x)}
where β > 0 {\displaystyle \beta >0} is interpretable as inverse temperature. In the context of diffusion models, it is usually called the guidance scale. A high β {\displaystyle \beta } would force the model to sample from a distribution concentrated around arg max x p ( y | x ) {\displaystyle \arg \max _{x}p(y|x)} . This often improves quality of generated images.[20]

This can be done simply by SGLD with

x ln p β ( x | y ) = β x ln p ( y | x ) + x ln p ( x ) {\displaystyle \nabla _{x}\ln p_{\beta }(x|y)=\beta \nabla _{x}\ln p(y|x)+\nabla _{x}\ln p(x)}

Classifier-free guidance (CFG)

If we do not have a classifier p ( y | x ) {\displaystyle p(y|x)} , we could still extract one out of the image model itself:[21]

x ln p β ( x | y ) = ( 1 β ) x ln p ( x ) + β x ln p ( x | y ) {\displaystyle \nabla _{x}\ln p_{\beta }(x|y)=(1-\beta )\nabla _{x}\ln p(x)+\beta \nabla _{x}\ln p(x|y)}
Such a model is usually trained by presenting it with both ( x , y ) {\displaystyle (x,y)} and ( x , N o n e ) {\displaystyle (x,{\rm {None}})} , allowing it to model both x ln p ( x | y ) {\displaystyle \nabla _{x}\ln p(x|y)} and x ln p ( x ) {\displaystyle \nabla _{x}\ln p(x)} .

Samplers

Given a diffusion model, one may regard it either as a continuous process, and sample from it by integrating a SDE, or one can regard it as a discrete process, and sample from it by iterating the discrete steps. The choice of the "noise schedule" β t {\displaystyle \beta _{t}} can also affect the quality of samples. In the DDPM perspective, one can use the DDPM itself (with noise), or DDIM (with adjustable amount of noise). The case where one adds noise is sometimes called ancestral sampling.[22] One can interpolate between noise and no noise. The amount of noise is denoted η {\displaystyle \eta } ("eta value") in the DDIM paper, with η = 0 {\displaystyle \eta =0} denoting no noise (as in deterministic DDIM), and η = 1 {\displaystyle \eta =1} denoting full noise (as in DDPM).

In the perspective of SDE, one can use any of the numerical integration methods, such as Euler–Maruyama method, Heun's method, linear multistep methods, etc. Just as in the discrete case, one can add an adjustable amount of noise during the integration.

A survey and comparison of samplers in the context of image generation is in.[23]

Flow-based diffusion model

Abstractly speaking, the idea of diffusion model is to take an unknown probability distribution (the distribution of natural-looking images), then progressively convert it to a known probability distribution (standard gaussian distribution), then learn a neural network that reverses the process.

In denoising diffusion models, the forward process adds noise, and the backward process removes noise. Both the forward and backward processes are SDEs, though the forward process is integrable in closed-form, so it can be done at no computational cost. The backward process is not integrable in closed-form, so it must be integrated step-by-step by standard SDE solvers, which can be very expensive.

In flow-based diffusion models, the forward process is a both deterministic flow along a time-dependent vector field, and the backward process is the same vector field, but going backwards. Both processes are solutions to ODEs. If the vector field is well-behaved, the ODE will also be well-behaved.

Given two distributions π 0 {\displaystyle \pi _{0}} and π 1 {\displaystyle \pi _{1}} , a flow-based model is a time-dependent velocity field v ( Z t , t ) {\displaystyle \mathbf {v} (\mathbf {Z} _{t},t)} in R d × [ 0 , 1 ] {\displaystyle \mathbb {R} ^{d}\times [0,1]} , such that if we start by sampling a point Z 0 π 0 {\displaystyle \mathbf {Z} _{0}\sim \pi _{0}} , and let it move according to the velocity field:

d Z t = v ( Z t , t ) d t , t [ 0 , 1 ] , starting from  Z 0 π 0 {\displaystyle \mathrm {d} \mathbf {Z} _{t}=\mathbf {v} (\mathbf {Z} _{t},t)\,\mathrm {d} t,\quad t\in [0,1],\quad {\text{starting from }}\mathbf {Z} _{0}\sim \mathbf {\pi } _{0}}
we end up with a point Z 1 π 1 {\displaystyle \mathbf {Z} _{1}\sim \pi _{1}} .

Rectified flow

Given two distributions π 0 {\displaystyle \pi _{0}} and π 1 {\displaystyle \pi _{1}} , there are infinitely many possible velocity fields to transport between them. Some are more well-behaved than others. The idea of rectified flow[24][25] is to learn a flow model such that the velocity is nearly constant along each flow path. This is beneficial, because we can integrate along such a vector field with very few steps. For example, if an ODE d Z t = v ( Z t , t ) d t {\displaystyle \mathrm {d} \mathbf {Z} _{t}=\mathbf {v} (\mathbf {Z} _{t},t)\;\mathrm {d} t} follows perfectly straight paths, it simplifies to Z t = Z 0 + t v ( Z 0 , 0 ) {\displaystyle \mathbf {Z} _{t}=\mathbf {Z} _{0}+t\cdot \mathbf {v} (\mathbf {Z} _{0},0)} , allowing for exact solutions in one step. In practice, we cannot reach such perfection, but when the flow field is nearly so, we can take a few large steps instead of many little steps.

The general idea is to start with two distributions π 0 {\displaystyle \pi _{0}} and π 1 {\displaystyle \pi _{1}} , then construct a flow field Z 0 = { Z t : t [ 0 , 1 ] } {\displaystyle {\boldsymbol {Z}}^{0}=\{\mathbf {Z} _{t}:t\in [0,1]\}} from it, then repeatedly apply a "reflow" operation to obtain successive flow fields Z 1 , Z 2 , {\displaystyle {\boldsymbol {Z}}^{1},{\boldsymbol {Z}}^{2},\dots } , each straighter than the previous one. When the flow field is straight enough for the application, we stop. See the images in [26] for intuition.

Generally, for any time-differentiable process X ( t ) {\displaystyle \mathbf {X} (t)} , v {\displaystyle \mathbf {v} } can be estimated by solving:

min v 0 1 E [ X ˙ t v ( X t , t ) 2 ] d t . {\displaystyle \min _{\mathbf {v} }\int _{0}^{1}\mathbb {E} \left[\lVert {{\dot {\mathbf {X} }}_{t}-\mathbf {v} (\mathbf {X} _{t},t)}\rVert ^{2}\right]\,\mathrm {d} t.}


In rectified flow, by injecting strong priors that intermediate trajectories are straight, it can achieve both theoretical relevance for optimal transport and computational efficiency, as ODEs with straight paths can be simulated precisely without time discretization.

Specifically, rectified flow seeks to match an ODE with the marginal distributions of the linear interpolation between points from distributions π 0 {\displaystyle \pi _{0}} and π 1 {\displaystyle \pi _{1}} . Given observations X 0 π 0 {\displaystyle \mathbf {X} _{0}\sim \pi _{0}} and X 1 π 1 {\displaystyle \mathbf {X} _{1}\sim \pi _{1}} , the canonical linear interpolation X t = t X 1 + ( 1 t ) X 0 , t [ 0 , 1 ] {\displaystyle \mathbf {X} _{t}=t\mathbf {X} _{1}+(1-t)\mathbf {X} _{0},t\in [0,1]} yields a trivial case X ˙ t = X 1 X 0 {\displaystyle {\dot {\mathbf {X} }}_{t}=\mathbf {X} _{1}-\mathbf {X} _{0}} , which cannot be causally simulated without X 1 {\displaystyle \mathbf {X} _{1}} . To address this, X t {\displaystyle \mathbf {X} _{t}} is "projected" into a space of causally simulatable ODEs, expressed as d Z t = v ( Z t , t ) {\displaystyle \mathrm {d} \mathbf {Z} _{t}=\mathbf {v} (\mathbf {Z} _{t},t)} , by minimizing the least squares loss with respect to the direction X 1 X 0 {\displaystyle \mathbf {X} _{1}-\mathbf {X} _{0}} :

min v 0 1 E [ ( X 1 X 0 ) v ( X t , t ) 2 ] d t . {\displaystyle \min _{\mathbf {v} }\int _{0}^{1}\mathbb {E} \left[\lVert {(\mathbf {X} _{1}-\mathbf {X} _{0})-\mathbf {v} (\mathbf {X} _{t},t)}\rVert ^{2}\right]\,\mathrm {d} t.}

The data pair ( X 0 , X 1 ) {\displaystyle (\mathbf {X} _{0},\mathbf {X} _{1})} can be any coupling of π 0 {\displaystyle \pi _{0}} and π 1 {\displaystyle \pi _{1}} , typically independent (i.e., ( X 0 , X 1 ) π 0 × π 1 {\displaystyle (\mathbf {X} _{0},\mathbf {X} _{1})\sim \pi _{0}\times \pi _{1}} ) obtained by randomly combining observations from π 0 {\displaystyle \pi _{0}} and π 1 {\displaystyle \pi _{1}} . This process ensures that the Z t {\displaystyle \mathbf {Z} _{t}} trajectories closely mirror the density map of X t {\displaystyle \mathbf {X} _{t}} trajectories but reroute at intersections to ensure causality. This rectifying process is also known as Flow Matching,[27] Stochastic Interpolation,[28] and Alpha-Blending.[citation needed]

A distinctive aspect of rectified flow is its capability for "reflow", which straightens the trajectory of ODE paths. Denote the rectified flow Z 0 = { Z t : t [ 0 , 1 ] } {\displaystyle {\boldsymbol {Z}}^{0}=\{\mathbf {Z} _{t}:t\in [0,1]\}} induced from ( X 0 , X 1 ) {\displaystyle (\mathbf {X} _{0},\mathbf {X} _{1})} as Z 0 = R e c t f l o w ( ( X 0 , X 1 ) ) {\displaystyle {\boldsymbol {Z}}^{0}={\mathsf {Rectflow}}((\mathbf {X} _{0},\mathbf {X} _{1}))} . Recursively applying this R e c t f l o w ( ) {\displaystyle {\mathsf {Rectflow}}(\cdot )} operator generates a series of rectified flows Z k + 1 = R e c t f l o w ( ( Z 0 k , Z 1 k ) ) {\displaystyle {\boldsymbol {Z}}^{k+1}={\mathsf {Rectflow}}((\mathbf {Z} _{0}^{k},\mathbf {Z} _{1}^{k}))} , starting with ( Z 0 0 , Z 1 0 ) = ( X 0 , X 1 ) {\displaystyle (\mathbf {Z} _{0}^{0},\mathbf {Z} _{1}^{0})=(\mathbf {X} _{0},\mathbf {X} _{1})} , where Z k {\displaystyle {\boldsymbol {Z}}^{k}} is the k {\displaystyle k} -th iteration of rectified flow induced from ( X 0 , X 1 ) {\displaystyle (\mathbf {X} _{0},\mathbf {X} _{1})} . This "reflow" process not only reduces transport costs but also straightens the paths of rectified flows, making Z k {\displaystyle {\boldsymbol {Z}}^{k}} paths straighter with increasing k {\displaystyle k} .

Rectified flow includes a nonlinear extension where linear interpolation X t {\displaystyle \mathbf {X} _{t}} is replaced with any time-differentiable curve that connects X 0 {\displaystyle \mathbf {X} _{0}} and X 1 {\displaystyle \mathbf {X} _{1}} , given by X t = α t X 1 + β t X 0 {\displaystyle \mathbf {X} _{t}=\alpha _{t}\mathbf {X} _{1}+\beta _{t}\mathbf {X} _{0}} . This framework encompasses DDIM and probability flow ODEs as special cases, with particular choices of α t {\displaystyle \alpha _{t}} and β t {\displaystyle \beta _{t}} . However, in the case where the path of X {\displaystyle \mathbf {X} } is not straight, the reflow process no longer ensures a reduction in convex transport costs, and also no longer straighten the paths of Z t {\displaystyle \mathbf {Z} _{t}} .[24]

Choice of architecture

Architecture of Stable Diffusion
The denoising process used by Stable Diffusion

Diffusion model

For generating images by DDPM, we need a neural network that takes a time t {\displaystyle t} and a noisy image x t {\displaystyle x_{t}} , and predicts a noise ϵ θ ( x t , t ) {\displaystyle \epsilon _{\theta }(x_{t},t)} from it. Since predicting the noise is the same as predicting the denoised image, then subtracting it from x t {\displaystyle x_{t}} , denoising architectures tend to work well. For example, the U-Net, which was found to be good for denoising images, is often used for denoising diffusion models that generate images.[29]

For DDPM, the underlying architecture does not have to be a U-Net. It just has to predict the noise somehow. For example, the diffusion transformer (DiT) uses a Transformer to predict the mean and diagonal covariance of the noise, given the textual conditioning and the partially denoised image. It is the same as standard U-Net-based denoising diffusion model, with a Transformer replacing the U-Net.[30]

DDPM can be used to model general data distributions, not just natural-looking images. For example, Human Motion Diffusion[31] models human motion trajectory by DDPM. Each human motion trajectory is a sequence of poses, represented by either joint rotations or positions. It uses a Transformer network to generate a less noisy trajectory out of a noisy one.

Conditioning

The base diffusion model can only generate unconditionally from the whole distribution. For example, a diffusion model learned on ImageNet would generate images that look like a random image from ImageNet. To generate images from just one category, one would need to impose the condition. Whatever condition one wants to impose, one needs to first convert the conditioning into a vector of floating point numbers, then feed it into the underlying diffusion model neural network. However, one has freedom in choosing how to convert the conditioning into a vector.

Stable Diffusion, for example, imposes conditioning in the form of cross-attention mechanism, where the query is an intermediate representation of the image in the U-Net, and both key and value are the conditioning vectors. The conditioning can be selectively applied to only parts of an image, and new kinds of conditionings can be finetuned upon the base model, as used in ControlNet.[32]

As a particularly simple example, consider image inpainting. The conditions are x ~ {\displaystyle {\tilde {x}}} , the reference image, and m {\displaystyle m} , the inpainting mask. The conditioning is imposed at each step of the backward diffusion process, by first sampling x ~ t N ( α ¯ t x ~ , ( 1 α ¯ t ) I ) {\displaystyle {\tilde {x}}_{t}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}{\tilde {x}},(1-{\bar {\alpha }}_{t})I\right)} , a noisy version of x ~ {\displaystyle {\tilde {x}}} , then replacing x t {\displaystyle x_{t}} with ( 1 m ) x t + m x ~ t {\displaystyle (1-m)\odot x_{t}+m\odot {\tilde {x}}_{t}} , where {\displaystyle \odot } means elementwise multiplication.[33]

Conditioning is not limited to just generating images from a specific category, or according to a specific caption (as in text-to-image). For example,[31] demonstrated generating human motion, conditioned on an audio clip of human walking (allowing syncing motion to a soundtrack), or video of human running, or a text description of human motion, etc.

Upscaling

As generating an image takes a long time, one can try to generate a small image by a base diffusion model, then upscale it by other models. Upscaling can be done by GAN,[34] Transformer,[35] or signal processing methods like Lanczos resampling.

Diffusion models themselves can be used to perform upscaling. Cascading diffusion model stacks multiple diffusion models one after another, in the style of Progressive GAN. The lowest level is a standard diffusion model that generate 32x32 image, then the image would be upscaled by a diffusion model specifically trained for upscaling, and the process repeats.[29]

In more detail, the diffusion upscaler is trained as follows:[29]

  • Sample ( x 0 , z 0 , c ) {\displaystyle (x_{0},z_{0},c)} , where x 0 {\displaystyle x_{0}} is the high-resolution image, z 0 {\displaystyle z_{0}} is the same image but scaled down to a low-resolution, and c {\displaystyle c} is the conditioning, which can be the caption of the image, the class of the image, etc.
  • Sample two white noises ϵ x , ϵ z {\displaystyle \epsilon _{x},\epsilon _{z}} , two time-steps t x , t z {\displaystyle t_{x},t_{z}} . Compute the noisy versions of the high-resolution and low-resolution images: { x t x = α ¯ t x x 0 + 1 α ¯ t x ϵ x z t z = α ¯ t z z 0 + 1 α ¯ t z ϵ z {\displaystyle {\begin{cases}x_{t_{x}}&={\sqrt {{\bar {\alpha }}_{t_{x}}}}x_{0}+{\sqrt {1-{\bar {\alpha }}_{t_{x}}}}\epsilon _{x}\\z_{t_{z}}&={\sqrt {{\bar {\alpha }}_{t_{z}}}}z_{0}+{\sqrt {1-{\bar {\alpha }}_{t_{z}}}}\epsilon _{z}\end{cases}}} .
  • Train the denoising network to predict ϵ x {\displaystyle \epsilon _{x}} given x t x , z t z , t x , t z , c {\displaystyle x_{t_{x}},z_{t_{z}},t_{x},t_{z},c} . That is, apply gradient descent on θ {\displaystyle \theta } on the L2 loss ϵ θ ( x t x , z t z , t x , t z , c ) ϵ x 2 2 {\displaystyle \|\epsilon _{\theta }(x_{t_{x}},z_{t_{z}},t_{x},t_{z},c)-\epsilon _{x}\|_{2}^{2}} .

Examples

This section collects some notable diffusion models, and briefly describes their architecture.

OpenAI

The DALL-E series by OpenAI are text-conditional diffusion models of images.

The first version of DALL-E (2021) is not actually a diffusion model. Instead, it uses a Transformer architecture that generates a sequence of tokens, which is then converted to an image by the decoder of a discrete VAE. Released with DALL-E was the CLIP classifier, which was used by DALL-E to rank generated images according to how close the image fits the text.

GLIDE (2022-03)[36] is a 3.5-billion diffusion model, and a small version was released publicly.[4] Soon after, DALL-E 2 was released (2022-04).[37] DALL-E 2 is a 3.5-billion cascaded diffusion model that generates images from text by "inverting the CLIP image encoder", the technique which they termed "unCLIP".

Sora (2024-02) is a diffusion Transformer model (DiT).

Stability AI

Stable Diffusion (2022-08), released by Stability AI, consists of a denoising latent diffusion model (860 million parameters), a VAE, and a text encoder. The denoising network is a U-Net, with cross-attention blocks to allow for conditional image generation.[38][19]

Stable Diffusion 3 (2024-02)[39] changed the latent diffusion model from the UNet to a Transformer model, and so it is a DiT. It uses rectified flow.

Google Imagen

Imagen (2022-05)[40][41] uses a T5 language emodel to encode the input text into embeddings. It is a cascaded diffusion model with three steps. The first step denoises a white noise to a 64×64 image, conditional on text embedding. The second step upscales the image by 64×64→256×256, conditional on text embedding. The third step is similar, upscaling by 256×256→1024×1024. The three denoising networks are all U-Nets.

Imagen 2 (2023-12) is also diffusion-based. It can generate images based on a prompt that mixes images and text. No further information available.[42]

See also

Further reading

  • Guidance: a cheat code for diffusion models. Overview of classifier guidance and classifier-free guidance, light on mathematical details.
  • Mathematical details omitted in the article.
    • "Power of Diffusion Models". AstraBlog. 2022-09-25. Retrieved 2023-09-25.
    • Weng, Lilian (2021-07-11). "What are Diffusion Models?". lilianweng.github.io. Retrieved 2023-09-25.

References

  1. ^ Chang, Ziyi; Koulieris, George Alex; Shum, Hubert P. H. (2023). "On the Design Fundamentals of Diffusion Models: A Survey". arXiv:2306.04542 [cs.LG].
  2. ^ a b Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations". arXiv:2011.13456 [cs.LG].
  3. ^ Gu, Shuyang; Chen, Dong; Bao, Jianmin; Wen, Fang; Zhang, Bo; Chen, Dongdong; Yuan, Lu; Guo, Baining (2021). "Vector Quantized Diffusion Model for Text-to-Image Synthesis". arXiv:2111.14822 [cs.CV].
  4. ^ a b GLIDE, OpenAI, 2023-09-22, retrieved 2023-09-24
  5. ^ Li, Yifan; Zhou, Kun; Zhao, Wayne Xin; Wen, Ji-Rong (August 2023). "Diffusion Models for Non-autoregressive Text Generation: A Survey". Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization. pp. 6692–6701. arXiv:2303.06574. doi:10.24963/ijcai.2023/750. ISBN 978-1-956792-03-4.
  6. ^ Han, Xiaochuang; Kumar, Sachin; Tsvetkov, Yulia (2023). "SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control". Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics: 11575–11596. arXiv:2210.17432. doi:10.18653/v1/2023.acl-long.647.
  7. ^ Xu, Weijie; Hu, Wenxiang; Wu, Fanyou; Sengamedu, Srinivasan (2023). "DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM". Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA, USA: Association for Computational Linguistics: 9040–9057. arXiv:2310.15296. doi:10.18653/v1/2023.findings-emnlp.606.
  8. ^ Zhang, Haopeng; Liu, Xiao; Zhang, Jiawei (2023). "DiffuSum: Generation Enhanced Extractive Summarization with Diffusion". Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, PA, USA: Association for Computational Linguistics: 13089–13100. arXiv:2305.01735. doi:10.18653/v1/2023.findings-acl.828.
  9. ^ a b Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (2020). "Denoising Diffusion Probabilistic Models". Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 6840–6851.
  10. ^ Croitoru, Florinel-Alin; Hondru, Vlad; Ionescu, Radu Tudor; Shah, Mubarak (2023). "Diffusion Models in Vision: A Survey". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (9): 10850–10869. arXiv:2209.04747. doi:10.1109/TPAMI.2023.3261988. PMID 37030794. S2CID 252199918.
  11. ^ Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (PDF). Proceedings of the 32nd International Conference on Machine Learning. 37. PMLR: 2256–2265.
  12. ^ Weng, Lilian (2021-07-11). "What are Diffusion Models?". lilianweng.github.io. Retrieved 2023-09-24.
  13. ^ "Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song". yang-song.net. Retrieved 2023-09-24.
  14. ^ Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations". arXiv:2011.13456 [cs.LG].
  15. ^ "Sliced Score Matching: A Scalable Approach to Density and Score Estimation | Yang Song". yang-song.net. Retrieved 2023-09-24.
  16. ^ Anderson, Brian D.O. (May 1982). "Reverse-time diffusion equation models". Stochastic Processes and Their Applications. 12 (3): 313–326. doi:10.1016/0304-4149(82)90051-5. ISSN 0304-4149.
  17. ^ Luo, Calvin (2022). "Understanding Diffusion Models: A Unified Perspective". arXiv:2208.11970v1 [cs.LG].
  18. ^ Song, Jiaming; Meng, Chenlin; Ermon, Stefano (3 Oct 2023). "Denoising Diffusion Implicit Models". arXiv:2010.02502 [cs.LG].
  19. ^ a b Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Björn (13 April 2022). "High-Resolution Image Synthesis With Latent Diffusion Models". arXiv:2112.10752 [cs.CV].
  20. ^ Dhariwal, Prafulla; Nichol, Alex (2021-06-01). "Diffusion Models Beat GANs on Image Synthesis". arXiv:2105.05233 [cs.LG].
  21. ^ Ho, Jonathan; Salimans, Tim (2022-07-25). "Classifier-Free Diffusion Guidance". arXiv:2207.12598 [cs.LG].
  22. ^ Yang, Ling; Zhang, Zhilong; Song, Yang; Hong, Shenda; Xu, Runsheng; Zhao, Yue; Zhang, Wentao; Cui, Bin; Yang, Ming-Hsuan (2022). "Diffusion Models: A Comprehensive Survey of Methods and Applications". arXiv:2206.00364 [cs.CV].
  23. ^ Karras, Tero; Aittala, Miika; Aila, Timo; Laine, Samuli (2022). "Elucidating the Design Space of Diffusion-Based Generative Models". arXiv:2206.00364v2 [cs.CV].
  24. ^ a b Liu, Xingchao; Gong, Chengyue; Liu, Qiang (2022-09-07). "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow". arXiv:2209.03003 [cs.LG].
  25. ^ Liu, Qiang (2022-09-29). "Rectified Flow: A Marginal Preserving Approach to Optimal Transport". arXiv:2209.14577 [stat.ML].
  26. ^ "Rectified Flow — Rectified Flow". www.cs.utexas.edu. Retrieved 2024-04-04.
  27. ^ Lipman, Yaron; Chen, Ricky T. Q.; Ben-Hamu, Heli; Nickel, Maximilian; Le, Matt (2023-02-08), Flow Matching for Generative Modeling, arXiv:2210.02747
  28. ^ Albergo, Michael S.; Vanden-Eijnden, Eric (2023-03-09), Building Normalizing Flows with Stochastic Interpolants, arXiv:2209.15571
  29. ^ a b c Ho, Jonathan; Saharia, Chitwan; Chan, William; Fleet, David J.; Norouzi, Mohammad; Salimans, Tim (2022-01-01). "Cascaded diffusion models for high fidelity image generation". The Journal of Machine Learning Research. 23 (1): 47:2249–47:2281. arXiv:2106.15282. ISSN 1532-4435.
  30. ^ Peebles, William; Xie, Saining (March 2023). "Scalable Diffusion Models with Transformers". arXiv:2212.09748v2 [cs.CV].
  31. ^ a b Tevet, Guy; Raab, Sigal; Gordon, Brian; Shafir, Yonatan; Cohen-Or, Daniel; Bermano, Amit H. (2022). "Human Motion Diffusion Model". arXiv:2209.14916 [cs.CV].
  32. ^ Zhang, Lvmin; Rao, Anyi; Agrawala, Maneesh (2023). "Adding Conditional Control to Text-to-Image Diffusion Models". arXiv:2302.05543 [cs.CV].
  33. ^ Lugmayr, Andreas; Danelljan, Martin; Romero, Andres; Yu, Fisher; Timofte, Radu; Van Gool, Luc (2022). "RePaint: Inpainting Using Denoising Diffusion Probabilistic Models". arXiv:2201.09865v4 [cs.CV].
  34. ^ Wang, Xintao; Xie, Liangbin; Dong, Chao; Shan, Ying (2021). "Real-ESRGAN: Training Real-World Blind Super-Resolution With Pure Synthetic Data" (PDF). Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021. International Conference on Computer Vision. pp. 1905–1914. arXiv:2107.10833.
  35. ^ Liang, Jingyun; Cao, Jiezhang; Sun, Guolei; Zhang, Kai; Van Gool, Luc; Timofte, Radu (2021). "SwinIR: Image Restoration Using Swin Transformer" (PDF). Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. International Conference on Computer Vision, 2021. pp. 1833–1844. arXiv:2108.10257v1.
  36. ^ Nichol, Alex; Dhariwal, Prafulla; Ramesh, Aditya; Shyam, Pranav; Mishkin, Pamela; McGrew, Bob; Sutskever, Ilya; Chen, Mark (2022-03-08). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models". arXiv:2112.10741 [cs.CV].
  37. ^ Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125 [cs.CV].
  38. ^ Alammar, Jay. "The Illustrated Stable Diffusion". jalammar.github.io. Retrieved 2022-10-31.
  39. ^ Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; Müller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (2024-03-05), Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, arXiv:2403.03206
  40. ^ "Imagen: Text-to-Image Diffusion Models". imagen.research.google. Retrieved 2024-04-04.
  41. ^ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily L.; Ghasemipour, Kamyar; Gontijo Lopes, Raphael; Karagol Ayan, Burcu; Salimans, Tim; Ho, Jonathan; Fleet, David J.; Norouzi, Mohammad (2022-12-06). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". Advances in Neural Information Processing Systems. 35: 36479–36494. arXiv:2205.11487.
  42. ^ "Imagen 2 - our most advanced text-to-image technology". Google DeepMind. Retrieved 2024-04-04.