Alex from Qubit AI | WeChat Official Account QbitAI
The free and open-source Stable Diffusion has been played with in new ways:
This time, it has been used for image compression.
Stable Diffusion can not only shrink the same original image to a smaller size but also performs visibly better than JPEG and WebP.
For the same original image, the image compressed by Stable Diffusion not only has more details but also fewer compression artifacts.
However, the software engineer Matthias Bühlmann (let’s just call him MB) pointed out that this method also has significant limitations.
It does not handle faces and text very well, and sometimes even produces features that do not exist in the original image after decoding and expanding back.
For example, like this (the effect can be quite shocking):
△ Left is the original image, right is the generated image after compression and expansion by Stable Diffusion
However, that said—
How Does Stable Diffusion Compress Images?
To explain how Stable Diffusion compresses images, it’s helpful to start with some important principles of how Stable Diffusion works.
Stable Diffusion is a special type of diffusion model called Latent Diffusion.
Unlike standard diffusion (Standard Diffusion), latent diffusion operates in a lower-dimensional latent space (Latent Space) rather than using the actual pixel space.
This means that the representation in the latent space results in some lower-resolution compressed images, but these images have higher accuracy.
Here’s a note: the resolution of an image and its accuracy are two different things. Resolution is a parameter that indicates how much data an image has, while accuracy reflects how close the result is to the true value.
For example, take this close-up photo of a camel: the original image size is 768KB, with a resolution of 512×512 and an accuracy of 3×8 bits.
After compressing with Stable Diffusion to 4.98KB, the resolution drops to 64×64, while the accuracy actually improves to 4×32 bits.
So it appears that the compressed image from Stable Diffusion is not much different from the original image.
To be more specific, this latent diffusion model has three main components:
VAE (Variational Auto Encoder), U-Net, and Text-encoder.
However, in this image compression test, the text encoder is not particularly useful.
The main role is played by the VAE, which consists of two parts: an encoder and a decoder.
Thus, the VAE can encode an image from image space and decode it to obtain some latent space representation (Latent space representation).
MB found that the decoding function of the VAE performs very stably for quantifying latent representations.
By scaling, dragging, and remapping, latent representations can be quantized from floating-point to 8-bit unsigned integers, resulting in a minimally distorted compressed image:
First, quantize the latents to 8-bit unsigned integers, at which point the image size is 64×64×4×8Bit=16 kB (original image size 512×512×3×8Bit=768 kB).
Then, by using a palette (Palette) and dithering (Dither), the data can be further reduced to 5kB while also improving the image fidelity.
As a diligent programmer, MB not only relies on visual observation but also conducts data analysis on image quality.
However, based on two important metrics for image quality assessment, PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity), the compression results from Stable Diffusion are not significantly better than JPG and WebP.
Additionally, when the latent representation is re-decoded and expanded to the original image resolution, although the main features of the image are still visible, the VAE also assigns high-resolution features to these pixel values.
In layman’s terms, the reconstructed image often differs from the original, containing many newly generated