MLNLP community is a well-known machine learning and natural language processing community at home and abroad, covering NLP master’s and doctoral students, university teachers, and corporate researchers.

The vision of the community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning at home and abroad, especially the progress of beginners.

Reprinted from | GiantPandaTV

Author | Liang Depeng

0

Introduction

Recently, AI drawing has become very popular, and one of the core technologies behind it is the Diffusion Model. Although fully understanding the Diffusion Model and its complex formula derivation requires a solid foundation in mathematics, this does not prevent us from grasping its principles. The following will explain what the Diffusion Model is from the author’s understanding.

1

What is the Diffusion Model

Forward Diffusion Process

The Diffusion Model first defines a forward diffusion process that includes a total of T time steps, as shown in the figure below:

The leftmost blue circle x0 represents the real natural image, corresponding to the dog image below.

The rightmost blue circle xT represents pure Gaussian noise, corresponding to the noise image below.

The middle blue circle xt represents the noisy x0, corresponding to the noisy dog image below.

The arrow below q(xt|xt-1) represents a Gaussian distribution with the previous state xt-1 as the mean, from which xt is sampled.

The so-called forward diffusion process can be understood as a Markov chain (see reference [7]), which gradually adds Gaussian noise to a real image until it eventually becomes a pure Gaussian noise image.

So how exactly is noise added? The formula is as follows:

Each time step’s xt is sampled from a Gaussian distribution with 1-βt multiplied by the square root of xt-1 as the mean, and βt as the variance.

Where βt, t ∈ [1, T] are a series of fixed values generated by a formula.

In reference [2], set T=1000, β1=0.0001, βT=0.02, and generate all βt values with one line of code:

# https://pytorch.org/docs/stable/generated/torch.linspace.html
betas = torch.linspace(start=0.0001, end=0.02, steps=1000)

When sampling to obtain xt, it is not sampled directly from the Gaussian distribution q(xt|xt-1), but uses a reparameterization trick (see reference [4] page 5).

In simple terms, if you want to sample from a Gaussian distribution with any mean μ and variance σ^2, you can first sample from a standard Gaussian distribution (mean 0, variance 1) to get ε,

Then μ + σ·ε is equivalent to the result of sampling from any Gaussian distribution. The formula is as follows:

Next, let’s look at how to sample to obtain the noise image xt,

First, sample from a standard Gaussian distribution, then multiply by the standard deviation and add the mean, pseudo-code is as follows:

# https://pytorch.org/docs/stable/generated/torch.randn_like.html
betas = torch.linspace(start=0.0001, end=0.02, steps=1000)
noise = torch.randn_like(x_0)
ext = sqrt(1-betas[t]) * xt-1 + sqrt(betas[t]) * noise

Then the forward diffusion process has another property, which is that you can directly sample from x0 to get the noise image at any intermediate time step xt, the formula is as follows:

Where αt represents:

The specific derivation can be found in reference [4] page 11, pseudo-code is as follows:

betas = torch.linspace(start=0.0001, end=0.02, steps=1000)
alphas = 1 - betas
# cumprod is equivalent to calculating a prefix product for each time step t
# https://pytorch.org/docs/stable/generated/torch.cumprod.html
alphas_cum = torch.cumprod(alphas, 0)
alphas_cum_s = torch.sqrt(alphas_cum)
alphas_cum_sm = torch.sqrt(1 - alphas_cum)

# Apply the reparameterization trick to sample xt
noise = torch.randn_like(x_0)
ext = alphas_cum_s[t] * x_0 + alphas_cum_sm[t] * noise

Through the above explanation, readers should have a clear understanding of the forward diffusion process of the Diffusion Model.

But our goal is to generate images, right?

Now we have only obtained a noise image from the real image in the dataset, so how do we generate images specifically?

Reverse Diffusion Process

The reverse diffusion process q(xt-1|xt, x0) (see pink arrow) is the posterior probability distribution of the forward diffusion process <code>q(xt|xt-1).

In contrast to the forward process, it starts from the pure Gaussian noise image on the right and gradually samples to obtain the real image x0.

The posterior probability q(xt-1|xt, x0)  can be derived based on Bayes’ theorem (the derivation process can be found in reference [4] page 12):

It is also a Gaussian distribution.

Its variance can be seen as a constant, and the variance value for all time steps can be calculated in advance:

The calculation pseudo-code is as follows:

betas = torch.linspace(start=0.0001, end=0.02, steps=1000)
alphas = 1 - betas
alphas_cum = torch.cumprod(alphas, 0)
alphas_cum_prev = torch.cat((torch.tensor([1.0]), alphas_cum[:-1]), 0)
posterior_variance = betas * (1 - alphas_cum_prev) / (1 - alphas_cum)

Next, let’s look at the calculation of the mean,

For the reverse diffusion process, when sampling to generate xt-1, xt is known, while the other coefficients are constants that can be calculated in advance.

However, the problem now is that when generating images through the reverse process, we do not know x0, because it is the target image to be generated.

It seems to have become a chicken-and-egg problem, so what should we do?

Diffusion Model Training Objective

When a probability distribution q is difficult to solve, we can change our approach (see reference [5,6]).

By artificially constructing a new distribution p, the goal is to minimize the difference between distributions p and q.

By continuously modifying the parameters of p to minimize the difference, when p and q are sufficiently similar, it can replace q.

Returning to the reverse diffusion process, since the posterior distribution q(xt-1|xt, x0) cannot be solved directly.

We construct a Gaussian distribution p(xt-1|xt) (see green arrow) to make its variance consistent with the posterior distribution <code>q(xt-1|xt, x0):

Its mean is set to:

The difference from q(xt-1|xt, x0) is that <code>x0 is replaced by xθ(xt, t), predicted by a deep learning model, with model input being the noisy image xt and time step t.

Then minimize the difference between distributions p(xt-1|xt) and q(xt-1|xt, x0), transforming it into optimizing the following objective function (the derivation process can be found in reference [4] page 13):

However, if we let the model predict directly from xt to x0, the fitting difficulty is too high, so we continue to change our approach.

In the previous introduction of the forward diffusion process, it was mentioned that xt can be directly obtained from x0:

Transforming the above formula yields:

Substituting into the mean expression of q(xt-1|xt, x0)  gives (the derivation process can be found in reference [4] page 15):

Observing the transformed expression above, we find that the mean of the posterior probability q(xt-1|xt, x0)  only relates to xt and the noise added during the forward diffusion process at time step t.

So we also modify the mean of the constructed distribution p(xt-1|xt):

We change the model to predict the Gaussian noise ε added at the forward time step t, with model input being xt and time step t:

Then the optimized objective function becomes (the derivation process can be found in reference [4] page 15):

Then the algorithm description of the training process is as follows, the coefficients in front of the final objective function are removed because they are constants:

It can be seen that although the derivation process is complex, the training process is quite simple.

First, each iteration involves taking a real image x0 from the dataset and sampling a time step t from a uniform distribution,

Then sample noise ε from the standard Gaussian distribution and calculate xt according to the formula.

Next, input xt and t into the model to let it output a fit for the predicted noise ε, and update the model through gradient descent until convergence.

The deep learning model used is similar to the structure of UNet (see reference [2] Appendix B).

The pseudo-code for the training process is as follows:

betas = torch.linspace(start=0.0001, end=0.02, steps=1000)
alphas = 1 - betas
alphas_cum = torch.cumprod(alphas, 0)
alphas_cum_s = torch.sqrt(alphas_cum)
alphas_cum_sm = torch.sqrt(1 - alphas_cum)

def diffusion_loss(model, x0, t, noise):
    # Calculate xt according to the formula
    xt = alphas_cum_s[t] * x0 + alphas_cum_sm[t] * noise
    # Model predicts noise
    predicted_noise = model(xt, t)
    # Calculate Loss
    return mse_loss(predicted_noise, noise)

for i in len(data_loader):
    # Read a batch of real images from the dataset
    x0 = next(data_loader)
    # Sample time step
    t = torch.randint(0, 1000, (batch_size,))
    # Generate Gaussian noise
    noise = torch.randn_like(x_0)
    loss = diffusion_loss(model, x0, t, noise)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Diffusion Model Image Generation Process

After the model is trained, during the real inference phase, images must be generated step by step starting from time step T, the algorithm description is as follows:

Initially, generate noise from the standard Gaussian distribution, then at each time step t, input the previously generated image xt into the model to predict the noise. Next, sample noise from the standard Gaussian distribution, and calculate xt-1 based on the reparameterization trick and the posterior probability mean and variance formulas until time step 1 is reached.

2

Improving the Diffusion Model

The article [3] proposes some improvements to the Diffusion Model.

Improvement of Variance βt

It was mentioned earlier that the generation of βt divides a given range uniformly into T parts, with each time step corresponding to a certain point:

betas = torch.linspace(start=0.0001, end=0.02, steps=1000)

Then the article [3] observes through experiments that using this method to generate variance βt leads to a problem where too much noise is added at the later time steps during forward diffusion.

This results in the fact that during the later time steps in the forward process, there is not much contribution during reverse generation sampling, and skipping them would not have a significant impact on the generation results.

The paper [3] then proposes a new generation strategy for βt, with a comparison to the original strategy during forward diffusion shown in the figure below:

The first row is the original generation strategy, and it can be seen that it has already become pure Gaussian noise before reaching the final time step,

While in the second row, the improved strategy adds noise more slowly, which seems more reasonable.

Experimental results show that for images from the imagenet dataset 64x64, the original strategy during reverse diffusion shows that even skipping the first 20% of time steps does not significantly affect the metrics.

Now let’s look at the new proposed strategy formula:

Where s is set to 0.008 and the maximum value of βt is limited to 0.999, the pseudo-code is as follows:

T = 1000
s = 8e-3
ts = torch.arange(T + 1, dtype=torch.float64) / T + s
alphas = ts / (1 + s) * math.pi / 2
alphas = torch.cos(alphas).pow(2)
alphas = alphas / alphas[0]
betas = 1 - alphas[1:] / alphas[:-1]
betas = betas.clamp(max=0.999)

Improvement of Number of Time Steps in the Generation Process

Originally, the model is trained under the assumption of T time steps, and during image generation, it must also traverse from T to 1. The paper [3] proposes a method to reduce the number of generation steps without retraining, thus significantly improving generation speed.

This method simply describes that the original T time steps are now set to a smaller number of time steps S, mapping each time step s in the S time series to the steps in the T time series, the pseudo-code is as follows:

T = 1000
S = 100
start_idx = 0
all_steps = []
frac_stride = (T - 1) / (S - 1)
cur_idx = 0.0
s_timesteps = []
for _ in range(S):
    s_timesteps.append(start_idx + round(cur_idx))
    cur_idx += frac_stride

Next, calculate the new β, St is the computed s_timesteps:

The pseudo-code is as follows:

alphas = 1 - betas
alphas_cum = torch.cumprod(alphas, 0)
last_alpha_cum = 1.0
new_betas = []
# Iterate through the original alpha prefix product sequence
for i, alpha_cum in enumerate(alphas_cum):
    # When the index i in the original sequence T is in the new sequence S, calculate the new beta
    if i in s_timesteps:
        new_betas.append(1 - alpha_cum / last_alpha_cum)
        last_alpha_cum = alpha_cum

Let’s look at the experimental results:

Focusing on the red and green solid lines of the blue line, it can be seen that reducing the sampling steps from 1000 to 100 does not significantly decrease the metrics.

References

[1] https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/
[2] https://arxiv.org/pdf/2006.11239.pdf
[3] https://arxiv.org/pdf/2102.09672.pdf
[4] https://arxiv.org/pdf/2208.11970.pdf
[5] https://www.zhihu.com/question/41765860/answer/1149453776
[6] https://www.zhihu.com/question/41765860/answer/331070683
[7] https://zh.wikipedia.org/wiki/%E9%A9%AC%E5%B0%94%E5%8F%AF%E5%A4%AB%E9%93%BE
[8] https://github.com/rosinality/denoising-diffusion-pytorch
[9] https://github.com/openai/improved-diffusion

Technical Group Invitation

△ Long press to add assistant

Scan the QR code to add the assistant WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

to apply to join the Natural Language Processing/Pytorch and other technical groups

About Us

MLNLP Community is a grassroots academic community jointly established by domestic and foreign scholars in machine learning and natural language processing. It has developed into a well-known community for machine learning and natural language processing at home and abroad, aiming to promote progress between the academic and industrial circles of machine learning and natural language processing and enthusiasts.

The community can provide an open communication platform for relevant practitioners’ further studies, employment, and research. Everyone is welcome to follow and join us.

Understanding Diffusion Model in AI

0

Introduction

1

What is the Diffusion Model

Forward Diffusion Process

Reverse Diffusion Process

Diffusion Model Training Objective

Diffusion Model Image Generation Process

2

Improving the Diffusion Model

Improvement of Variance βt

Improvement of Number of Time Steps in the Generation Process

References

About Us

Leave a Comment Cancel reply

0 Introduction

1 What is the Diffusion Model

Forward Diffusion Process

Reverse Diffusion Process

Diffusion Model Training Objective

Diffusion Model Image Generation Process

2 Improving the Diffusion Model

Improvement of Variance βt

Improvement of Number of Time Steps in the Generation Process

References

About Us

Leave a Comment Cancel reply

0

Introduction

1

What is the Diffusion Model

2

Improving the Diffusion Model