547 words
3 minutes
Diffusion Model Learning

Summary of Diffusion Models#

Structure#

  • A fixed forward noise-adding process
  • A U-Net used to learn how to denoise

Noise-Adding Process#

Fixed is the key to understanding the noise-adding process!

Noise-adding is a fixed process. Given an image, the result of adding noise once and adding noise 100 times is the same (with fixed parameters).

How to Obtain Noise?#

First, introduce the concept of distribution. We assume the distribution of real images is q(x0)q(\mathbf{x}_0), and x0x_0 is a real image sampled from q(x0)q(\mathbf{x}_0).

You can think of the distribution as a class with 50 students, and the sampled x0x_0 is one student.

The noise-adding process is done step by step. Each step is denoted as tt, and a total of TT noise-adding steps are required.

We obtain noise through a Gaussian distribution. Here, it is important to know that a Gaussian distribution is determined by two parameters: mean and variance, denoted as μ\mu and σ2\sigma^2. The noise added at step tt for an image comes from a Gaussian distribution with mean μt=1βtx0\mu_t=\sqrt{1-\beta_t}\mathbf{x_0} and variance σt2=βt\sigma_t^2=\beta_t. This can be equivalently expressed as: each time, sample an ϵ\epsilon from a standard normal distribution, i.e., ϵN(0,I)\epsilon \sim \mathcal {N}(0, I).

The Noisy Image at Step tt, Xt\mathbf{X}_t#

The image at noise-adding step tt is expressed as xt=1βtxt1+βtϵ\mathbf{x}_t=\sqrt{1-\beta_t}\mathbf{x}_{t-1}+\sqrt{\beta_t}\epsilon.

Note that βt\beta_t here is not a constant; it changes with tt, but follows (β1<β2<β3<<βT<1\beta_1<\beta_2<\beta_3<\cdots<\beta_T<1). It can be linear, binomial, cosine, etc.

The final noisy image XT\mathbf{X_T} should be pure noise.

Denoising Process#

The denoising process involves step-by-step transforming the noisy result XT\mathbf{X_T} back to the original image x0\mathbf{x_0}.

What is the purpose of learning this denoising process? After obtaining a new noisy image, denoising can generate entirely new images, giving the network the ability to generate on its own.

We denote this denoising process as p(xt1xt)p(x_{t-1}|x_t), but this distribution cannot be directly computed. Therefore, we use a neural network to approximate this process, i.e., pθ(xt1xt)p_{\theta}(x_{t-1}|x_t), where θ\theta represents the parameters of the neural network.

How to Fit the Denoising Process?#

Here, we assume that the denoising process also follows a Gaussian distribution. That is, the neural network needs to learn the two parameters mentioned earlier: μt\mu_t and σt2\sigma_t^2. (DDPM fixes the variance and only learns the mean, which has been improved in subsequent papers.)

Defining the Objective Function#

To drive the neural network to learn the mean during the denoising process, the authors treat the noise-adding process qq and the denoising process pθp_\theta as a VAE (variational auto-encoder). If you are not familiar with VAE, you can skip this part; just know how the final loss function is calculated. (This involves knowledge of KL divergence, ELBO evidence lower bound, and other probability theory concepts. If curious, you can explore each one.)

I haven’t fully understood this part yet.

After a series of conditions, a “nice property” is obtained:

q(xtx0)=N(xt;αtˉ,(1αtˉ)I)q(\mathbf{x_t}|\mathbf{x_0})=\mathcal{N}(\mathbf{x_t};\sqrt{\bar{\alpha_t}},(1-\bar{\alpha_t})\mathbf{I})

where αt=1βt\alpha_{t}=1-\beta_t, αtˉ=s=1tαs\bar{\alpha_t}=\prod_{s=1}^t\alpha_s. This property means:

  1. Noise can be sampled from a Gaussian distribution, and through appropriate scaling, x0\mathbf{x_0} can be directly transformed into xt\mathbf{x_t}, where αtˉ\bar{\alpha_t} can be calculated from the known βt\beta_t.
  2. The network predicting the mean can be converted into a network predicting noise.

The final objective function is defined as: ϵϵθ(xt,t)2=ϵϵθ(αˉtx0+(1αˉt)ϵ,t)2\|\epsilon-\epsilon_\theta(\mathbf{x}_t,t)\|^2=\|\epsilon-\epsilon_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{(1-\bar{\alpha}_t)}\epsilon,t)\|^2

Summary#

  1. Randomly sample an image x0\mathbf{x_0} from the real-world data distribution q(x0)q(\mathbf{x_0}).
  2. Uniformly sample a noise level tt from 1 to T\mathbf{T}.
  3. Sample noise ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) from a Gaussian distribution and corrupt the sampled image x0\mathbf{x_0} to obtain xt=αˉtx0+(1αˉt)ϵ\mathbf{x_t}=\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{(1-\bar{\alpha}_t)}\epsilon.
  4. The neural network predicts the noise ϵθ(xt,t)\epsilon_\theta(\mathbf{x}_t,t) based on the corrupted image xt\mathbf{x_t}.
Diffusion Model Learning
https://yunqing.org/en/posts/diffusion-model-learning/
作者
Yunqing
发布于
2024-04-22
许可协议
CC BY-NC-SA 4.0