
The catalyst for diffusion models began with the introduction of DDPM (Denoising Diffusion Probabilistic Model) in 2020. Before delving into the details of how the Denoising Diffusion Probabilistic Model (DDPM) works, let’s first look at some developments in existing generative artificial intelligence, which are the foundational studies of DDPM:
VAE
VAEs use encoders, probabilistic latent spaces, and decoders. During training, the encoder predicts the mean and variance for each image. These values are then sampled from a Gaussian distribution and passed to the decoder, where the input image is expected to resemble the output image. This process involves using KL Divergence to compute the loss. One significant advantage of VAEs is their ability to generate a wide variety of images. During the sampling phase, simply sampling from the Gaussian distribution allows the decoder to create a new image.
GAN
Shortly after the introduction of Variational Autoencoders (VAEs), a groundbreaking generative family of models emerged—Generative Adversarial Networks (GANs), marking the beginning of a new class of generative models characterized by the collaboration of two neural networks: a generator and a discriminator, involving an adversarial training process. The generator aims to produce realistic data, such as images, from random noise, while the discriminator strives to distinguish between real and generated data. Throughout the training phase, the generator and discriminator continuously refine their abilities through a competitive learning process. The generator generates increasingly convincing data, becoming smarter than the discriminator, while the discriminator improves its ability to discern real samples from generated ones. This adversarial interaction peaks when the generator produces high-quality, realistic data. During the sampling phase, after GAN training, the generator produces new samples from input random noise, transforming this noise into data that typically reflects real examples.
Why We Need Diffusion Models: DDPM
Both models have different issues; while GANs excel at generating realistic images that closely resemble those in the training set, VAEs are adept at creating a diverse array of images, although they tend to produce blurry images. However, existing models have not successfully combined these two capabilities—creating images that are both highly realistic and diverse. This challenge presents a significant barrier that researchers need to address.
Six years after the first GAN paper was published and seven years after the VAE paper, a groundbreaking model emerged: the Denoising Diffusion Probabilistic Model (DDPM). DDPM combines the advantages of both models, excelling at creating diverse and realistic images.
In this article, we will delve into the complexities of DDPM, covering its training process, including forward and reverse processes, and exploring how to perform sampling. Throughout this exploration, we will use PyTorch to build DDPM from scratch and complete its full training.
It is assumed that you are already familiar with the basics of deep learning and have a solid foundation in deep computer vision. We will not introduce these basic concepts; our goal is to generate images that humans believe to be real.
Diffusion Model DDPM
The Denoising Diffusion Probabilistic Model (DDPM) is a cutting-edge approach in the field of generative models. Unlike traditional models that rely on explicit likelihood functions, DDPM operates by iteratively denoising through the diffusion process. This involves gradually adding noise to an image and attempting to remove that noise. The fundamental theory is based on the idea that by transforming a simple distribution, such as a Gaussian distribution, through a series of diffusion steps, a complex and expressive image data distribution can be obtained. In other words, by transferring samples from the original image distribution to a Gaussian distribution, we can create a model that reverses this process. This allows us to start from a fully Gaussian distribution and end with an image distribution, effectively generating new images.
The training of DDPM involves two fundamental steps: generating noisy images, which is a fixed and non-learnable forward process, and the subsequent reverse process. The main goal of the reverse process is to denoise the images using a specialized machine learning model.
Forward Diffusion Process
The forward process is a fixed and non-learnable step, but it requires some predefined settings. Before delving into these settings, let’s understand how it works.
The core concept of this process is to start with a clear image. At a specific step denoted by “T”, a small amount of noise is gradually introduced according to a Gaussian distribution.
It can be seen from the image that noise is incrementally added at each step, and we will delve into the mathematical representation of this noise.
The noise is sampled from a Gaussian distribution. To introduce a small amount of noise at each step, we use a Markov chain. To generate the image at the current timestamp, we only need the image from the last timestamp. The concept of the Markov chain is key here and is crucial for the subsequent mathematical details.
A Markov chain is a stochastic process where the probability of transitioning to any specific state depends only on the current state and the time elapsed, rather than the sequence of previous events. This property simplifies the modeling of the noise addition process, making it easier for mathematical analysis.
The variance parameter denoted by beta is intentionally set to a very small value, with the aim of introducing only a minimal amount of noise at each step.
The step parameter “T” determines the number of steps required to generate a fully noisy image. In this article, this parameter is set to 1000, which may seem large. Do we really need to create 1000 noisy images for each original image in the dataset? The Markov chain aspect proves helpful in addressing this issue. Since we only need the image from the previous step to predict the next step, and the noise added at each step remains constant, we can simplify the calculations by generating noisy images at specific timestamps. The right reparameterization technique allows us to further simplify the equations.
The new parameters introduced in equation (3) are incorporated into equation (2), resulting in an evolution of equation (2).
Reverse Diffusion Process
Having introduced noise to the images, the next step is to perform the reverse operation. Mathematically, it is impossible to achieve reverse processing of the image unless we know the initial condition, namely the denoised image at t = 0. Our goal is to directly sample from noise to create new images, where there is a lack of information about the results. Therefore, I need to design a method to gradually denoise the image without knowing the results. Hence, the solution involves using a deep learning model to approximate this complex mathematical function.
With some mathematical background, the model will approximate equation (5). A noteworthy detail is that we will stick to the original DDPM paper and keep the variance fixed, although it is also possible to let the model learn it.
The task of the model is to predict the average noise added between the current timestamp and the previous timestamp. Doing so effectively removes noise and achieves the desired effect. However, what if our goal is to have the model predict the noise added from the “original image” to the last timestamp?
Mathematically, executing the reverse process is challenging unless we know the initial image without noise, so let’s start by defining the posterior variance.
The task of the model is to predict the noise added to the image at timestamp t from the initial image. The forward process enables us to perform this operation, starting from a clear image and progressing to a noisy image at timestamp t.
Training Algorithm
We assume that the architecture used for prediction will be a U-Net. The objective during the training phase is: for each image in the dataset, randomly select a timestamp in the range [0,T] and calculate the forward diffusion process. This generates a clear, slightly noisy image, along with the actual noise used. Then, utilizing our understanding of the reverse process, we use the model to predict the noise added to the image. With the real and predicted noise, we seem to have entered a supervised machine learning problem.
The main question arises: which loss function should we use to train our model? Since we are dealing with a probabilistic latent space, Kullback-Leibler (KL) divergence is a suitable choice.
KL divergence measures the difference between two probability distributions, in our case, the distribution predicted by the model and the expected distribution. Including KL divergence in the loss function not only guides the model to produce accurate predictions but also ensures that the latent space representation conforms to the expected probabilistic structure.
KL divergence can be approximated as an L2 loss function, so we can obtain the following loss function:
Ultimately, we arrive at the training algorithm proposed in the paper.
Sampling
The reverse process has been explained; now let’s discuss how to use it. Starting from a completely random image at time T and using the reverse process T times, we ultimately reach time 0. This constitutes the second algorithm outlined in this article.
Parameters
We have many different parameters such as beta, beta_tildes, alpha, alpha_hat, etc. Currently, we do not know how to choose these parameters. However, the only known parameter at this time is T, which is set to 1000.
For all listed parameters, their selection depends on beta. In a sense, beta determines the amount of noise we want to add at each step. Therefore, careful selection of beta is crucial to ensure the success of the algorithm. For the other parameters, please refer to the paper for details.
Various sampling methods were explored during the experimental phase of the original paper. The initial linear sampling methods either led to insufficient noise or became overly noisy. To address this issue, another more commonly used method, cosine sampling, was adopted. Cosine sampling provides smoother and more consistent noise addition.
Pytorch Implementation of Diffusion Models
We will utilize the U-Net architecture for noise prediction. The choice of U-Net is due to its suitability for image processing, capturing spatial and feature maps, and providing output of the same size as the input.
Given the complexity of the task and the requirement to use the same model for each step (where the model needs to be able to denoise both completely noisy images and slightly noisy images with the same weights), adjusting the model is essential. This includes merging more complex blocks and introducing awareness of the timestamps used through sinusoidal embedding steps. These enhancements aim to make the model an expert in the denoising task. Before continuing to build the complete model, we will introduce each block.
ConvNext Block
To meet the need for increased model complexity, convolutional blocks play a crucial role. Here, we cannot solely rely on the basic blocks from the U-Net paper; we will incorporate ConvNext.
The input consists of “x” representing the image and a timestamp visualization “t” of size “time_embedding_dim”. Due to the complexity of the blocks and the residual connections with the input and final layer, the blocks play a key role in learning spatial and feature mapping throughout the process.
Sinusoidal Timestamp Embedding
One of the key blocks in the model is the sinusoidal timestamp embedding block, which allows the encoding of a given timestamp to retain information about the current time needed for the model’s decoding, as it will be used for all different timestamps.
This is a very classic implementation and is applied in various places, so we will directly paste the code here.
DownSample & UpSample
Temporal Multilayer Perceptron
This module is utilized to create temporal representations based on the given timestamp t. The output of this multilayer perceptron (MLP) will also serve as the input “t” for all modified ConvNext blocks.
Here, “dim” is a hyperparameter of the model that indicates the number of channels required for the first block. It serves as the basis for calculating the number of channels in subsequent blocks.
Attention
This is an optional component used in U-Net. Attention helps enhance the role of residual connections in learning. It focuses more on important spatial information obtained from the left side of U-Net through the attention mechanism computed via the residual connections and the feature maps calculated in the mid-low latent space. It originates from the ACC-UNet paper.
The gate represents the upsample output of the lower block, while the residual denotes the residual connection at the level where attention is applied.
Final Integration
Integrate all the blocks discussed earlier (excluding the attention block) into a Unet. Each block contains two residual connections instead of one. This modification is made to address potential overfitting issues.
Implementation of Diffusion Code
Finally, we introduce how diffusion is implemented. Since we have already covered all the mathematical theories used for the forward, reverse, and sampling processes, here we will focus on the code.
The diffusion process is the training part of the model. It opens up a sampling interface that allows us to generate samples using the already trained model.
Summary of Training Points
For the training part, we set up 37,000 training steps with a batch size of 16. Due to GPU memory allocation limits, the image size was restricted to 128×128. Using exponential moving average (EMA) model weights, samples were generated every 1000 steps to smooth sampling and save model versions.
In the initial 1000 training steps, the model began to capture some features but still missed certain areas. Around 10,000 steps, the model started producing promising results, and progress became more evident. By the final 30,000 steps, the quality of the results significantly improved, but black images still existed. This was simply due to the model not having enough sample variety, as the data distribution of real images was not fully mapped to the Gaussian distribution.
With the final model weights, we can generate some images. Although the image quality is limited due to the 128×128 size restriction, the model’s performance is still commendable.
Note: The dataset used in this article consists of satellite images of forest terrains; for specific acquisition methods, please refer to the ETL section in the source code.
Conclusion
We have comprehensively introduced the necessary knowledge about diffusion models and provided a complete implementation using PyTorch.
Code for this article:
https://github.com/Camaltra/this-is-not-real-aerial-imagery/
Related papers:
DDPM Paper https://arxiv.org/abs/2006.11239ConvNext Paper https://arxiv.org/abs/2201.03545UNet Paper: https://arxiv.org/abs/1505.04597ACC UNet: https://arxiv.org/abs/2308.13680
Editor / Fan Ruiqiang
Review / Fan Ruiqiang
Verification / Fan Ruiqiang