Understanding Stable Diffusion: A Comprehensive Guide

↑ ClickBlue Text Follow the Jishi Platform

Author丨tian-feng@Zhihu (Authorized)

Source丨https://zhuanlan.zhihu.com/p/634573765

Editor丨Jishi Platform

Jishi Guide

Detailed interpretation of the stable diffusion paper, after reading this article, you will never struggle to understand it again!>>Join the Jishi CV technology group and stay at the forefront of computer vision

Personal website: https://tianfeng.space/

1. Introduction (Can be skipped)

Hello everyone, I am Tian-Feng. Today I will introduce some principles of stable diffusion in an easy-to-understand way. Since I usually play with AI painting, I decided to write an article explaining its principles. This article took quite a long time to write, and if you find it useful, please give it a thumbs up. Thank you.

Stable diffusion, as an open-source image generation model from Stability-AI, is comparable to ChatGPT in its emergence and has developed momentum that is not inferior to midjourney. With the support of its numerous plugins, its launch has also been greatly enhanced. However, the methods are slightly more complex than midjourney.

As for why it is open-source, the founder stated: The reason I do this is that I believe it is part of a shared narrative. There are people who need to publicly demonstrate what has happened. Again, this should be assumed to be open-source by default. Because value does not exist in any proprietary model or data, we will build auditable open-source models, even if they contain licensed data. Without further ado, let’s get started.

2. Stable Diffusion

For the images in the original paper above, you may find it difficult to understand, but that’s okay. I will break down the images into individual modules for interpretation and then combine them together. I believe you will understand what each step of the image does.

First, I will draw a simplified model diagram corresponding to the original image for easier understanding. Let’s start from the training phase. You may notice that the VAE decoder is missing. This is because our training process is completed in latent space; we will discuss the decoder in the second sampling phase. The stablediffusion webui we usually use for drawing is typically in the sampling phase. As for the training phase, most ordinary people currently cannot complete it. The training time required can be measured in GPU years (a single V100 GPU takes about a year), and if you have 100 cards, you should be able to complete it in a month. As for ChatGPT, the electricity costs are in the tens of millions of dollars, with thousands of GPU clusters; it feels like AI is all about computing power nowadays. I digress, let’s come back.

1. CLIP

Let’s start with the prompt. We input a prompt like ‘a black and white striped cat’. CLIP will correspond the text to a vocabulary list, with each word and punctuation having a corresponding number. We call each word a token. Previously, stablediffusion had a limit of 75 words (which is no longer the case), meaning 75 tokens. As you can see, six words correspond to eight tokens because it also includes a start token and an end token. Each number corresponds to a 768-dimensional vector, which you can think of as the ID card for each word. Moreover, words with very similar meanings correspond to similar 768-dimensional vectors. Through CLIP, we obtained a (8,768) text vector corresponding to the image.

The stable diffusion uses the pre-trained model of CLIP from OpenAI, meaning it uses a model that has already been trained. How is CLIP trained? How does it correspond images and text information? (The following expansion can be viewed or skipped; it does not affect understanding; just know it is used to convert prompts into the corresponding text vectors for generated images.)

CLIP requires data consisting of images and their titles. The dataset contains about 400 million images and descriptions. It should have been obtained directly via web scraping, with image information used as labels. The training process is as follows:

CLIP is a combination of an image encoder and a text encoder, using two encoders to encode the data separately. Then it uses cosine distance to compare the results of the embeddings. Initially, even if the text description matches the image, their similarity is definitely low.

As the model continues to update, in the later stages, the embeddings obtained from the encoders for images and text will gradually become similar. This process is repeated across the entire dataset, using a large batch size for encoding. Ultimately, it can generate an embedding vector where the image of a dog and the sentence ‘a picture of a dog’ are similar.

Give some prompt text, then calculate the similarity for each prompt, and find the one with the highest probability.

2. Diffusion Model

Now that we have obtained an input for the UNet, we also need an input noise image. Suppose we input a 3x512x512 cat image. We do not directly process the cat image; instead, we compress the 512×512 image from pixel space to latent space (4x64x64) using the VAE encoder. The data volume is reduced by nearly 64 times.

The latent space is simply a representation of compressed data. Compression refers to the process of encoding information with fewer digits than the original representation. Dimensionality reduction may lose some information; however, in certain cases, reducing dimensions is not a bad thing. By reducing dimensions, we can filter out less important information and retain only the most critical information.

After obtaining the latent space vector, we now come to the diffusion model. Why can an image with added noise be restored? The secret lies in the equations. Here, I will explain using the DDPM paper as a theoretical reference. Of course, there are improved versions like DDIM, etc. If you are interested, feel free to read it yourself.

Forward Diffusion

First is the forward diffusion, which is the noise-adding process; eventually, it becomes pure noise.
At each moment, Gaussian noise must be added, with each subsequent moment obtained from the previous one by adding noise.

So, must we derive each noise addition from the previous step? Can we obtain the noise image for any step? The answer is YES. The reason is that in our training process, adding noise to the image is random. Suppose we randomly add noise at 100 steps (assuming the total number of time steps is 200). If we want to add noise from the first step to get the second step, it would be too time-consuming to loop back and forth. In fact, the added noise follows a pattern. Our goal now is to obtain the noisy image at any moment, as long as we have the original image X0, without needing to derive the desired noisy image step by step.

Let me explain the above. I have labeled everything clearly.

First, the range of αt is 0.9999-0.998.

Second, the image noise addition follows a Gaussian distribution, meaning that the noise added to the latent space vector follows a mean of 0 and a variance of 1. By substituting Xt-1 into Xt, why can the two terms be combined? Because Z1 and Z2 both follow a Gaussian distribution, their sum Z2′ also follows a Gaussian distribution. Moreover, their variances add up to a new variance, so we sum their respective variances (the one under the square root is the standard deviation). If you cannot understand, you can treat it as a theorem. To elaborate, for Z–>a+bZ, the Gaussian distribution of Z changes from (0,σ) to (a,bσ). Now we have derived the relationship between Xt and Xt-2.

Third, if you substitute Xt-2 again, you will find the relationship with Xt-3, and by identifying the pattern, you will obtain the cumulative product of α, ultimately deriving the relationship between Xt and X0. Now we can directly obtain the noisy image for any moment based on this formula.

Fourth, because the initialized noise image is random, if you set the number of time steps (timesteps) to 200, it means dividing the interval from 0.9999 to 0.998 into 200 parts, representing the α value for each moment. According to the formula of Xt and X0, since α is cumulative (the smaller it is), you can see that the noise increases more rapidly over time, roughly in the range of 1-0.13. At time 0, it is 1, meaning Xt represents the image itself, while at time 200, the image noise is about α = 0.13, with noise occupying 0.87. Since it is cumulative, the noise increases more and more, not in an average process.

Fifth, to add one more point, the Reparameterization Trick states that if X(u,σ2), then X can be expressed as X=μ+σZ, where Z~Ν(0,1). This is the reparameterization trick.

The reparameterization trick allows us to sample from a distribution that has parameters. If we sample directly (sampling is a discrete action and is non-differentiable), there is no gradient information. Therefore, during the backpropagation, the parameter gradients will not be updated. The reparameterization trick ensures that we can sample while retaining gradient information.

Reverse Diffusion

After completing the forward diffusion, we now move on to reverse diffusion, which might be more challenging than the previous one. How do we gradually obtain the original image from a noisy image? This is the key.

Starting with the reverse process, our goal is to derive the noise image Xt into a noise-free X0. First, we start from Xt to derive Xt-1. Here we assume that X0 is known (ignoring why we assume it is known for now). Later, we will replace it. As for how to replace it, the forward diffusion provides the relationship between Xt and X0. Now we know Xt, we will express X0 in terms of Xt, but there is still unknown noise Z. This is where UNet comes into play, as it needs to predict the noise.
We leverage Bayes’ theorem (which is conditional probability) here, and we rely on Bayes’ theorem results. I previously wrote a document (https://tianfeng.space/279.html)

We are seeking Xt-1 given Xt. In reverse, we don’t know how to derive it, but if we know how to derive it forward, if we know X0, we can derive all these terms.

Let’s start interpreting. Since these three terms all follow Gaussian distributions, substituting them into the Gaussian distribution (also known as normal distribution), why does their multiplication equal addition? Because e2 * e3 = e2+3; this can be understood (it belongs to exp, which is e raised to a power). Now we have obtained an overall equation, and we can continue simplifying.

First, we expand the square, and the unknown is now only Xt-1, arranging it in the form of AX2+BX+C. Don’t forget, even when adding, it still follows a Gaussian distribution. Now we arrange the original Gaussian distribution formula in a similar format; the red part is the inverse of the variance, and multiplying the blue by the variance and dividing by 2 gives us the mean μ (the results of simplification will show below; if you are interested, you can simplify it yourself). Returning to X0, previously we assumed X0 was known; now we express it in terms of Xt (known), substituting μ, leaving only the unknown Zt.

Zt is actually the noise we want to estimate at each moment—here we use the UNet model to predict it. The model’s input parameters are three: the current distribution Xt and the moment t, along with the previous text vector; the output is the predicted noise, and that’s the whole process.

The Algorithm 1 above represents the training process.

In the second step, data is generally taken from a specific category, like cats, dogs, etc., or images of a specific style; we cannot mix various images together, or the model won’t learn effectively.

The third step indicates that each image is randomly assigned a moment of noise (as mentioned above).

The fourth step states that the noise follows a Gaussian distribution.

The fifth step calculates the loss between the real noise and the predicted noise (DDPM input does not have a text vector, so it is not written; you can understand it as adding another input), updating parameters until the output noise is very close to the true noise, completing the training of the UNet model.

Next, we arrive at Algorithm 2, which describes the sampling process.

Doesn’t it mean that Xt follows a Gaussian distribution?
Execute T times, sequentially deriving Xt-1 to X0; there are T time steps.
Xt-1 is the formula we derived from reverse diffusion, Xt-1=μ+σZ, where both the mean and variance are known, and the only unknown noise Z is predicted by the UNet model; εθ refers to the already trained UNet.

Sampling Diagram

For easier understanding, I have drawn diagrams for text-to-image and image-to-image. Those who have used stable diffusion webui will find them very familiar. For text-to-image, it directly initializes a noise and samples.
Image-to-image adds noise based on your original image, and you control the noise weight. Isn’t there a redraw amplitude in the webui interface? That’s it.
The iteration count is the sampling steps in the webui interface.
The random seed is a noise image we initially obtain. Therefore, if you want to replicate the same image, the seed must remain consistent.

Stage Summary

Now, looking at this diagram, aside from UNet which I haven’t explained (I will introduce it separately below), isn’t it much simpler? The leftmost part is the encoder and decoder in pixel space, the rightmost part is CLIP converting text into text vectors, and the upper middle part is the noise addition while the lower part is UNet predicting noise, followed by continuous sampling and decoding to obtain the output image. This is the sampling diagram from the original paper, without depicting the training process.

3. UNet Model

The UNet model is something many of you may know to some extent. It involves multi-scale feature fusion, similar to FPN image pyramids, PAN, and many others with similar ideas. Typically, ResNet is used as the backbone (downsampling) to serve as the encoder, allowing us to obtain multiple scales of feature maps. Then, during the upsampling process, we concatenate (the feature maps obtained from downsampling) during the upsampling, which is a standard UNet.

So, what’s different about the UNet in stable diffusion? Here’s a diagram I found; I admire the patience of this lady and borrowed her image.

I will explain the ResBlock and SpatialTransformer modules. The inputs are timestep_embedding, context, and input, which are the three inputs, representing the time step, text vector, and noisy image, respectively. You can understand the time step as the positional encoding in transformers, used in natural language processing to inform the model of the position of each word in a sentence. Different positions can have significantly different meanings, and here, adding time step information can be understood as telling the model the moment at which noise is added (of course, this is my understanding).

Timestep_embedding uses sine and cosine encoding

The ResBlock module takes the time encoding and the image output after convolution, adding them together, which is its function. I won’t go into the specifics; it’s just convolution and fully connected layers, which are quite simple.

The SpatialTransformer module takes the text vector and the output from the previous ResBlock.

It mainly discusses cross-attention; the rest involves some dimensional transformations, convolution operations, and various normalizations such as Group Norm and Layer Norm.

By using cross-attention, the features of the latent space are fused with the features of another modality sequence (the text vector) and added to the reverse process of the diffusion model. Through UNet, we predict the noise that needs to be reduced at each step, calculating the gradient using the loss function between the ground truth noise and the predicted noise.

Looking at the diagram in the lower right corner, we can see that Q represents the features of the latent space, while KV are the text vectors connected to two fully connected layers. The remaining operations are standard transformer operations; after multiplying Q and K, softmax yields a score, which is then multiplied by V, transforming dimensions for output. You can think of the transformer as a feature extractor; it highlights important information for us (just for understanding). That’s about it. The subsequent operations are similar, ultimately outputting the predicted noise.

You should be familiar with transformers and know what self-attention and cross-attention are. If not, find an article to read; it’s not something that can be explained simply.

That’s it, goodbye, showing some webui comparison images.

3. Stable Diffusion WebUI Extensions

Parameters clip

Understanding Stable Diffusion: A Comprehensive Guide

Reply in the public account backend “Dataset” to obtain a collection of over 100 resources in various directions of deep learning.

Jishi Essentials

Technical Column: Detailed interpretation column on multi-modal large models｜Understanding the Transformer series｜ICCV2023 Paper Interpretation｜Jishi Live

Jishi Perspective Dynamics: Welcome to apply for the 2023 Ministry of Education Industry-University Cooperation Collaborative Education Project｜New Horizons + Smart Brain, “Drones + AI” have become good helpers for intelligent road inspection!

Technical Review: A 40,000-word detailed explanation of Neural ODE: Using neural networks to describe non-discrete state changes｜What are the details of transformers? 18 Questions about Transformers!

Understanding Stable Diffusion: A Comprehensive Guide

Click to read the original text and enter the CV community

Gain more technical insights

1. Introduction (Can be skipped)

2. Stable Diffusion

1. CLIP

2. Diffusion Model

Forward Diffusion

Reverse Diffusion

Sampling Diagram

Stage Summary

3. UNet Model

Timestep_embedding uses sine and cosine encoding

3. Stable Diffusion WebUI Extensions

Leave a Comment Cancel reply