Understanding Stable Diffusion Principles in Simple Terms

Source: AI Algorithms and Image Processing

This article is about 4300 words, recommended reading time is 8 minutes.
A detailed interpretation of the stable diffusion paper, after reading this, you will never find it hard to understand!

1. Introduction (Can Be Skipped)

Hello, everyone, I am Tian-Feng. Today I will introduce some principles of stable diffusion. The content is easy to understand, as I usually play with AI drawing, so I wrote this article to explain its principles. It really took me quite a while to write this article, so if you find it useful, I hope you can give it a like. Thank you.

Stable diffusion, as an open-source image generation model from Stability-AI, is no less significant than ChatGPT. Its development momentum is not inferior to midjourney, and with the support of numerous plugins, its launch has also greatly elevated its status. Of course, the methods used are slightly more complex than midjourney.

As for why it is open-source, the founder stated: The reason I do this is that I believe it is part of a shared narrative. Some people need to publicly demonstrate what has happened. Again, I emphasize that this should be inherently open-source. Because value does not reside in any proprietary model or data, we will build auditable open-source models, even if they contain licensed data. Without further ado, let’s get started.

2. Stable Diffusion

For the image in the original paper above, it might be difficult for some friends to understand, but it doesn’t matter. I will break down the image into individual modules for interpretation and then combine them together. I believe you can understand what each step of the image does.

First, I will draw a simplified model diagram corresponding to the original image for better understanding. Let’s start with the training phase. You might notice that the VAE decoder is missing. This is because our training process is completed in latent space. We discuss the decoder in the second phase, the sampling phase. The stable diffusion webui drawing we usually use is typically in the sampling phase. As for the training phase, most ordinary people cannot complete it. The training time required could be measured in GPU years (a single V100 GPU would take a year). If you have 100 cards, it should be possible to complete it in a month. As for ChatGPT, the electricity costs are in the millions of dollars, with thousands of GPU clusters. It feels like AI is all about computing power nowadays. I digress, let’s come back.

1. CLIP

Let’s start with the prompts. We input a prompt “a black and white striped cat”. CLIP will correspond the text to a vocabulary list, where each word and punctuation mark corresponds to a number. We call each word a token. Previously, stable diffusion had a limit of 75 words for input (now it’s gone), meaning 75 tokens. You might notice that 6 words correspond to 8 tokens because it also includes a start token and an end token. Each number corresponds to a 768-dimensional vector, which you can think of as the ID card for each word, and words with very similar meanings correspond to similar 768-dimensional vectors. Through CLIP, we obtained a (8,768) text vector corresponding to the image.

The stable diffusion uses the pre-trained model of OpenAI’s CLIP, which means it uses a model that others have already trained. How was CLIP trained? How did it correspond images and text information? (The following expansion can be read or skipped; it does not affect understanding, just know that it is used to convert prompts into corresponding text vectors for generating images).

CLIP requires data in the form of images and their titles, with the dataset containing approximately 400 million images and descriptions. It should be directly crawled, with image information serving as labels. The training process is as follows:

CLIP is a combination of an image encoder and a text encoder, encoding data with two encoders separately. Then, cosine distance is used to compare the embedded results. At the beginning of training, even if the text description matches the image, their similarity is definitely low.

As the model continuously updates, in the subsequent stages, the embeddings obtained from the image and text encoding gradually become similar. This process is repeated across the entire dataset, using a large batch size for encoding, ultimately generating an embedding vector where the image of a dog and the sentence “a picture of a dog” are similar.

Provide some prompt text, then calculate the similarity for each prompt, and find the one with the highest probability.

2. Diffusion Model

Now that we have an input for UNet, we also need an input noise image. If we input a 3x512x512 cat image, we do not directly process the cat image. Instead, we compress the 512×512 image from pixel space to latent space (4x64x64) using the VAE encoder, reducing the data volume by almost 64 times.

Latent space is simply a representation of compressed data. Compression refers to encoding information using fewer digits than the original representation. Reducing dimensions may lose some information, but in some cases, dimensionality reduction is not a bad thing. By reducing dimensions, we can filter out less important information and keep only the most significant information.

After obtaining the latent space vector, we now come to the diffusion model. Why can images be restored after adding noise? The secret lies in the formulas. Here, I will explain using the DDPM paper as a theoretical explanation, along with its improved versions like DDIM, etc. If interested, look it up yourself.

Forward Diffusion

First is forward diffusion, which is the noise addition process, ultimately turning into pure noise.
At each moment, Gaussian noise must be added, with the next moment being derived from the previous moment by adding noise.

So do we need to derive the noise for each step from the previous step? Can we obtain the noise image for any step we want? The answer is YES. The reason is: During our training process, adding noise to the image is random. If we randomly reach 100 steps of noise (assuming the time step is set to 200), if we want to add noise from the first step to get the second step, it would be too time-consuming to repeat this process. Actually, the added noise follows a pattern. Our goal now is that as long as we have the original image X0, we can obtain the noisy image at any moment without having to derive the desired noisy image step by step.

I will explain the above. I have marked everything clearly.

First, the range of αt is 0.9999-0.998.

Second, the noise added to the image conforms to a Gaussian distribution, meaning that the noise added to the latent space vector is consistent with a mean of 0 and variance of 1. Bringing Xt-1 into Xt, why can the two terms be combined? Because Z1 and Z2 both conform to a Gaussian distribution, their sum Z2′ also conforms, and their variances sum to a new variance. Thus, we sum their respective variances (the square root is the standard deviation). If you cannot understand, you can treat it as a theorem. To elaborate, for Z–>a+bZ, the Gaussian for Z changes from (0,σ) to (a,bσ). Now we have the relationship between Xt and Xt-2.

Third, if you substitute Xt-2, you will get the relationship with Xt-3, and find the pattern, which is the cumulative product of α. Finally, we obtain the relationship between Xt and X0. Now we can directly obtain the noisy image at any moment based on this equation.

Fourth, because the image initialization noise is random, if you set the time steps (timesteps) to 200, it means dividing the interval from 0.9999 to 0.998 into 200 parts, representing the α value at each moment. According to the formula of Xt and X0, since α is a cumulative product (the smaller it is), it can be seen that the further back, the faster the noise is added, roughly in the range of 1-0.13, where at time 0, Xt represents the image itself, and at time 200, the image has approximately α of 0.13, meaning the noise occupies 0.87. Since it is a cumulative product, the noise increases, which is not an average process.

Fifth, a quick note about the Reparameterization Trick.

If X(u,σ2), then X can be expressed in the form X=μ+σZ, where Z~Ν(0,1). This is the Reparameterization Trick.

The Reparameterization Trick allows sampling from a distribution with parameters. If sampling is done directly (the sampling action is discrete and non-differentiable), there will be no gradient information, meaning that during BP backpropagation, the parameter gradients will not be updated. The Reparameterization Trick ensures that we can sample while retaining gradient information.

Reverse Diffusion

After forward diffusion, we proceed to reverse diffusion, which may be more difficult than the previous one. How do we obtain the original image step by step from a noisy image? This is key.

Starting the reverse process, our goal is to derive the noise image Xt to obtain the noise-free X0. First, we start from Xt to derive Xt-1. Here we initially assume that X0 is known (first ignore why we assume it is known); it will be replaced later. As for how to replace it, the forward diffusion already provides the relationship between Xt and X0. Now we know Xt, and we can express X0 in terms of Xt, but there is still an unknown noise Z that we need to predict using UNet.
Here we utilize Bayes’ theorem (which is conditional probability). We leverage the results from Bayes’ theorem, as I previously wrote a document (https://tianfeng.space/279.html).

We know Xt to derive Xt-1. In reverse, we don’t know how to derive it, but we can derive it forward. If we know X0, we can derive all these terms.

Let’s start interpreting. Since these three terms all conform to Gaussian distributions, substituting them into the Gaussian distribution (also known as normal distribution), why does their multiplication equal addition? Because e2 * e3 = e2+3, which you can understand (this belongs to exp, i.e., e to the power). Now we have an overall equation, and next we continue to simplify:

First, we expand the square, and the unknown is now only Xt-1, fitting into the format AX2+BX+C. Don’t forget that even if we add, it still conforms to Gaussian distribution. Now we fit the original Gaussian distribution formula into a similar format, with the red part being the reciprocal of variance, and the blue part multiplied by variance divided by 2 gives us the mean μ (the result of simplification is shown below; if you’re interested, simplify it yourself). Returning to X0, we previously assumed X0 is known, now we express it in terms of Xt (known) and substitute μ, leaving only the unknown Zt.

Zt is actually the noise we need to estimate at each moment.

Here we use the UNet model for prediction.
The model’s input parameters are three: the current distribution Xt and time t, along with the previous text vector, and it outputs the predicted noise. This is the entire process.

The above Algorithm 1 is the training process.

The second step indicates data collection, which generally involves one category of images, such as cats or dogs, or images of a particular style. You cannot mix various images, as the model won’t learn.

The third step states that each image is randomly assigned a moment of noise (as mentioned earlier).

The fourth step indicates that the noise conforms to a Gaussian distribution.

The fifth step calculates the loss between the real noise and the predicted noise (DDPM input does not include text vector, so it’s not written; just understand that an additional input is included), updating parameters until the output noise is very close to the real noise, completing the training of the UNet model.

Next, we move to Algorithm 2, the sampling process.

Doesn’t that mean Xt conforms to a Gaussian distribution?
Execute T times, sequentially deriving Xt-1 to X0, which are T moments.
Xt-1 is the formula we derived during reverse diffusion, Xt-1=μ+σZ, where the mean and variance are known. The only unknown noise Z is predicted by the UNet model,

Sampling Diagram

For ease of understanding, I have drawn both text-to-image and image-to-image. If you are a user of stable diffusion webui, you should find this familiar. If it’s text-to-image, it means directly initializing a noise and performing sampling.
Image-to-image adds noise based on the existing image, controlling the noise weight by yourself. Isn’t that what the webui interface has for the redraw amplitude?
The number of iterations corresponds to the sampling steps in the webui interface.
The random seed is the initial random noise image we obtained, so if you want to replicate the same image, the seed must remain consistent.

Stage Summary

Now let’s take another look at this image. Aside from UNet, which I haven’t explained (I will introduce it separately), isn’t it much simpler? The far left is the pixel space encoder-decoder, the far right is CLIP converting text into text vectors, the upper middle is noise addition, and the lower middle is UNet predicting noise. Then, continuously sampling and decoding to obtain the output image. This is the sampling diagram from the original paper, not depicting the training process.

3. UNet Model

Many of you may have heard of the UNet model, which is about multi-scale feature fusion, similar to FPN image pyramids, PAN, etc. Generally, ResNet is used as the backbone (down-sampling) serving as the encoder. Thus, we obtain multiple scale feature maps, and during the up-sampling process, we concatenate the up-sampled features with the previously down-sampled feature maps. This is a typical UNet.

So what’s different about the UNet in stable diffusion? Here’s an image I found, and I admire the patience of this young lady, so I’ll borrow her image.

Let me explain the ResBlock module and the SpatialTransformer module. The inputs are timestep_embedding, context, and input, which are the three inputs respectively: time step, text vector, and noisy image. You can understand the time step as the positional encoding in transformers, which is used in natural language processing to tell the model the position information of each character in a sentence. Different positions may have very different meanings. Here, adding time step information can be seen as informing the model of the moment of noise being added (of course, this is my interpretation).

Timestep Embedding Uses Sinusoidal Encoding

The ResBlock module takes the time encoding and the image output after convolution and adds them together. That’s its function. I won’t go into the specific details; it’s just convolution and fully connected layers, which are quite simple.

The SpatialTransformer module takes the text vector and the output from the previous ResBlock.

Here, I mainly want to discuss cross attention; the rest are just some dimensional transformations, convolution operations, and various normalizations like Group Norm and Layer Norm.

Using cross attention, the features of the latent space are fused with the features of another modality sequence (text vector) and added to the reverse process of the diffusion model. UNet predicts the amount of noise that needs to be reduced at each step, and the gradient is calculated using the loss function between the ground truth noise and the predicted noise.

Looking at the diagram in the lower right corner, we can see that Q represents the features of the latent space, while KV are the text vectors connected through two fully connected layers. The rest is standard transformer operations. After multiplying Q and K, softmax gives a score, which is then multiplied by V, changing the dimensions for output. You can think of the transformer as a feature extractor; it can highlight important information for us (just for understanding). That’s about it. The subsequent operations are similar, and finally, the predicted noise is output.

You should definitely be familiar with transformers and know what self-attention and cross-attention are. If you don’t understand, look for an article to read; it’s not something that can be simply explained.

That’s all. Goodbye, and here are some webui comparison images.

3. Stable Diffusion WebUI Extensions

Parameters of CLIP

Editor: Yu TengkaiProofreader: Cheng Anle