Recently, diffusion models have become a research hotspot in the field of AI. Researchers from Google Research and UT-Austin have fully considered the ‘corruption’ process in their latest study and proposed a diffusion model design framework for a more general corruption process.
We know that score-based models and Denoising Diffusion Probabilistic Models (DDPM) are two powerful generative models that produce samples by reversing the diffusion process. These two types of models have been unified into a single framework in the paper by Yang Song et al., titled ‘Score-based generative modeling through stochastic differential equations’, and are widely referred to as diffusion models.
Currently, diffusion models have achieved great success in a range of applications including image, audio, video generation, and solving inverse problems. Researchers such as Tero Karras have analyzed the design space of diffusion models in the paper ‘Elucidating the design space of diffusion-based generative models’ and identified three stages: i) selecting the scheduling of noise levels, ii) choosing network parameterization (each parameterization generates a different loss function), iii) designing sampling algorithms.
Recently, in a collaborative arXiv paper titled ‘Soft Diffusion: Score Matching for General Corruptions’ by researchers from Google Research and UT-Austin, it was noted that there is still an important step in diffusion models: corruption. Generally, corruption is a process of adding noise of varying magnitudes, and DDPM requires rescaling. While some have attempted to use different distributions for diffusion, a universal framework is still lacking. Therefore, the researchers proposed a diffusion model design framework for a more general corruption process.
Specifically, they introduced a new training objective called Soft Score Matching and a novel sampling method called Momentum Sampler. Theoretical results indicate that for corruption processes satisfying regularity conditions, Soft Score Matching can learn their scores (i.e., likelihood gradients), and diffusion must convert any image into another image with non-zero likelihood.
In the experimental section, the researchers trained models on CelebA and CIFAR-10, achieving a state-of-the-art FID score of 1.85 on CelebA with the model trained there. Additionally, compared to models trained with the original Gaussian denoising diffusion, the models trained by the researchers were significantly faster.
Paper link: https://arxiv.org/pdf/2209.05442.pdf
Overview of the Method
Generally speaking, diffusion models generate images by reversing the corruption process of gradually adding noise. The researchers demonstrated how to learn to reverse diffusion involving linear deterministic degradation and stochastic additive noise.
Specifically, the researchers demonstrated the framework for training diffusion models using a more general corruption model, consisting of three parts: the new training objective Soft Score Matching, the novel sampling method Momentum Sampler, and the scheduling of the corruption mechanism.
First, let’s look at the training objective Soft Score Matching, named after soft filtering, a photography term that refers to filters that remove fine details. It learns the scores of conventional linear corruption processes in a provable manner and incorporates the filtering process into the network, training the model to predict images that match the diffusion observations after corruption.
As long as diffusion assigns non-zero probability to any clean, corrupted image pair, this training objective can be shown to have learned the scores. Moreover, this condition is always satisfied when additive noise is present in the corruption.
Specifically, the researchers explored corruption processes of the following form.
During the process, the researchers found that noise is important in both empirical (i.e., better results) and theoretical (i.e., to learn the scores) aspects. This also became the key distinction from concurrent work on reversing deterministic corruption, known as Cold Diffusion.
Next is the sampling method Momentum Sampling. The researchers demonstrated that the choice of sampler has a significant impact on the quality of generated samples. They proposed the Momentum Sampler to reverse general linear corruption processes. This sampler uses convex combinations of corruption at different diffusion levels and is inspired by momentum methods in optimization.
This sampling method was inspired by the continuous formulation of diffusion models proposed in the paper by Yang Song et al. The algorithm for Momentum Sampler is shown below.
The following figure visually demonstrates the impact of different sampling methods on the quality of generated samples. The image on the left, sampled using Naive Sampler, appears to have repetitions and lacks detail, while the image on the right, using Momentum Sampler, significantly improves sampling quality and FID score.
Finally, we address scheduling. Even if the type of degradation is predefined (like blurring), deciding how much to corrupt at each diffusion step is not easy. The researchers proposed a principled tool to guide the design of the corruption process. To find the scheduling, they minimize the Wasserstein distance between distributions along the path. Intuitively, the researchers wish to transition smoothly from a fully corrupted distribution to a clean distribution.
Experimental Results
The researchers evaluated the proposed method on CelebA-64 and CIFAR-10, both standard baselines for image generation. The main objective of the experiments was to understand the role of the type of corruption.
The researchers first attempted corruption using blurring and low-amplitude noise. The results showed that their proposed model achieved state-of-the-art results on CelebA, with an FID score of 1.85, surpassing all other methods that only added noise and possibly rescaled images. Additionally, the FID score obtained on CIFAR-10 was 4.64, which, while not reaching state-of-the-art, remained competitive.
Moreover, on the CIFAR-10 and CelebA datasets, the researchers’ method also performed better on another metric, sampling time. An additional benefit is the significant computational advantage. Compared to image generation denoising methods, deblurring (with almost no noise) appears to be a more efficient manipulation.
The following figure shows how the FID score varies with the number of function evaluations (NFE). From the results, it can be seen that on the CIFAR-10 and CelebA datasets, the researchers’ model can achieve the same or better quality as the standard Gaussian denoising diffusion model using significantly fewer steps.