How GAN Controls Image Generation Style? Detailed StyleGAN Evolution

Click on the above “Visual Learning for Beginners”, select to add “Starred” or “Top”

Important information delivered at the first time

Source: WeChat Official Account Machine Heart Authorized

Can GAN systematically control the style of the generated images?

Do you understand your style? Most GAN models do not understand. So, can GAN systematically control the style of the generated images?

The original GAN generates images based on latent factor (latent factor) z.Typically, the latent factor z is sampled from a normal or uniform distribution, which determines the type and style of the generated content.

Based on this, we need to answer the following two important questions:

Why is z uniformly or normally distributed?
Since z contains meta-information, should it play a more significant role in the process of generating data at each convolutional layer? (rather than just as input to the first layer)

Note: This article will use “style” to refer to the meta-information, which includes type information and style information.

The following image is generated by StyleGAN2:

Latent factor z

Latent factors in machine learning are usually independent of each other to simplify the model training process. For example, height and weight are highly correlated (taller people usually weigh more). Therefore, the body mass index (BMI) calculated based on height and weight is commonly used to measure human obesity, and its required training model complexity is relatively low. Independent factors make the model easier to interpret.

In GAN, the distribution of z should be similar to the distribution of latent factors of real images. If we sample z from a normal or uniform distribution, the optimized model may require z to embed information beyond type and style. For example, we generate images of soldiers and visualize the data distribution of the training dataset based on masculinity and hair length as two latent factors. The missing upper left corner in the image below indicates that male soldiers are not allowed to have long hair.

If we uniformly sample this space, the generator will try to generate images of male soldiers with long hair. This will not succeed because we do not have relevant training data. From another perspective, when sampling using normal or uniform distribution, what latent factors is the model learning? This seems to be more complex. As stated in the StyleGAN paper: “This leads to a certain degree of unavoidable entanglement.”

In logistic regression, we use a change of basis to create a linear boundary for binary classification categories. StyleGAN uses a deep network called a mapping network to transform the latent factor z into an intermediate latent space w.

Conceptually, StyleGAN distorts the space where uniform or normal distribution sampling can be performed (shown in the image below) into the latent feature space (shown on the left), making it easy to generate images. The mapping network is designed to create independent features so that the generator can render more easily while avoiding combinations of features that do not appear in the training dataset.

StyleGAN introduces a mapping network f, which uses eight fully connected layers to transform z into an intermediate latent space. w can be viewed as a new z (z’). Through this network, the 512-dimensional latent space z can be transformed into a 512-dimensional intermediate latent space w.

Style-based generator

In the original GAN, the latent factor z is only used as input to the first layer of the deep network. We might think that as the network deepens, the role of z gradually diminishes.

However, the style-based generator uses separately learned affine transformations A to transform w at each layer. The transformed w will act as style information on spatial data.

The StyleGAN paper initially used the Progress GAN network and reused many hyperparameters, including the Adam optimizer parameters. Then researchers modified the model design and conducted multiple experiments to see if the model performance improved.

The first improved version (B) replaced nearest neighbor upsampling/downsampling in the discriminator and generator with bilinear sampling. The model training time will also increase with further tuning.

The second improved version (C) added a mapping network and styling. For the latter, AdaIN (Adaptive Instance Normalization) replaced PixelNorm to perform styling on spatial data.

AdaIN is defined as follows:

In this process, instance normalization is first applied to the input feature map. Then, StyleGAN uses style information to perform scaling on each normalized spatial feature map, adding bias (μ and σ represent the mean difference and standard deviation of the input feature map xᵢ, respectively). StyleGAN calculates the style values for each layer (y(s, i), y(b, i)) as scaling and bias values for w, thereby applying the style to the spatial feature map i. The normalized features affect the degree of styling applied to the spatial location.

In the original GAN, the input to the first layer is the latent factor z.Experimental results show that adding variable input to the first layer of StyleGAN is of no benefit, so the variable input is replaced with a constant input.

As for the improved version (D), its first layer input is replaced with a learned constant matrix with dimensions 4×4×512.

In the StyleGAN paper, “Style” refers to the main attributes of the data, such as pose and identity. In the improved version (E), StyleGAN introduces noise to the spatial data, creating stochastic variations.

For example, the noise added in the experiment can create different random variations for hair (see the image below), stubble, freckles, or pores.

How GAN Controls Image Generation Style? Detailed StyleGAN Evolution

For example, for an 8×8 spatial layer, create an 8×8 matrix where the elements contain unrelated Gaussian noise. This matrix is shared across all feature maps. However, StyleGAN learns a separate scaling factor for each feature map and multiplies that factor with the noise matrix, which is then added to the output of the previous layer.

Noise creates rendering variants, and its advantages compared to cases without noise or where noise is only applied at specific resolutions are shown in the image below. The StyleGAN paper also states that it alleviates the repetitive pattern problem commonly encountered in other GAN methods.

In summary, when the style is applied globally to the feature map, it overrides the key attributes of the image. Noise introduces pixel-level local variations and brings random variations, generating local variants of features.

For information on how noise leads to different image rendering results, see the following video:

The last improved version (E) involves style mixing regularization.

Style Mixing and Mixing Regularization

Previously we generated latent factor z and used it as the single source of generating styles. However, after using mixing regularization, we switched to using another latent factor z₂ to generate styles after reaching a specific spatial resolution.

As shown in the image below, we use the latent factor of the generated image “source B” to obtain the rough style of spatial resolution (from 4×4 to 8×8), and the latent factor of “source A” to obtain the fine resolution style. Therefore, the generated image has the high-level style of source B, such as pose, hairstyle, face shape, and glasses, while all colors (eyes, hair, light) and finer facial features in the generated image come from source A.

As shown in the image below, if we use the medium resolution (from 16×16 to 32×32) style of source B, the generated image inherits smaller-scale facial features, hairstyles, and eye states (open/closed) from source B, while the pose, face shape, and glasses from source A are preserved. In the last column, the model copies the high-resolution style (from 64×64 to 1024×1024 resolution) from source B, which mainly affects the tone and microstructure of the image.

During training, a certain proportion of images are generated using two random latent codes instead of just one.

Training

Compared to the CelebA-HQ dataset, the FFHQ (Flickr-Faces-HQ, high-quality face dataset) has higher quality and covers a wider range, such as age, race, image background, and accessories like glasses and hats. In StyleGAN, the WGAN-GP is used as the loss function during the training of the CelebA-HQ dataset, while the non-saturating GAN loss function and R₁ regularization term are used for the FFHQ dataset, as shown below:

Truncation Technique in w

Low probability density regions in z or w may not have sufficient training data to accurately learn the model.

Therefore, when generating images, we can avoid these regions to improve image quality at the cost of variation. This can be achieved by truncating z or w. In StyleGAN, truncating w achieves this goal:

Where ψ represents the style scale.

However, truncation is only performed on low-resolution layers (e.g., spatial layers from 4×4 to 32×32, ψ = 0.7). This ensures that high-resolution details are not affected.

When ψ is 0, the average face generated is shown in the image below. As the ψ value is adjusted, we can see changes in attributes such as gaze, glasses, age, skin color, hair length, and gender, for example, from wearing glasses to not wearing glasses.

Perceptual Path Length

The StyleGAN paper also proposes a new metric for measuring GAN performance—perceptual path length. GAN gradually changes a specific dimension in the latent factor z and visualizes its semantics.

This type of latent space interpolation can yield surprising nonlinear visual changes. For example, features that do not appear in either end image may appear in the middle image. This indicates a high correlation between latent space and variation factors. Therefore, we can quantify these changes by measuring the cumulative change when performing interpolation.

First, we use VGG16 embedding to measure the perceptual difference between two images. If we divide the latent space interpolation path into linear segments, we can add all perceptual differences for each segment. The lower the difference value, the higher the quality of the GAN image. For detailed mathematical definitions, see the StyleGAN paper.

Problems with StyleGAN

StyleGAN-generated images exhibit droplet-like blotchy artifacts, which are more pronounced in the intermediate feature maps of the generator network. This problem seems to appear in all 64×64 resolution feature maps and is more severe in higher resolution feature maps.

GAN technology has matured, and we can now easily enlarge images to see the areas where anomalous image patterns appear during the pseudo-image detection process.

The StyleGAN2 paper attributes this problem to instance normalization in AdaIN. AdaIN was originally used for style transfer, and some important information from the input was lost during the transfer process.

The StyleGAN2 paper expresses this finding as follows:

We believe the problem lies in the AdaIN operation, which normalizes the mean and variance of each feature map separately, potentially destroying any correlated information found in the magnitude of the features. We hypothesize that the reason for the appearance of these droplet-like artifacts is that the generator intentionally passes signal strength information through instance normalization: by creating strong local peaks that dominate the statistics, the generator can scale that signal as effectively as elsewhere.

Furthermore, StyleGAN2 proposes an alternative design solution to address the problems caused by progressive growing to stabilize high-resolution training.

As shown in the image above, even when the face images generated using progressive growing change direction, the gaps in their teeth (blue line) do not change.

Before discussing StyleGAN2, let’s redraw the design diagram of StyleGAN (shown on the right). The AdaIN module in this design is also divided into two modules, but this diagram adds bias, which was omitted in the original design diagram. (Note that there is currently no change in the model design)

StyleGAN2

Weight Demodulation

With the support of experimental results, StyleGAN2 made the following changes:

Simplified the method of processing constants in the early stages;
No need for mean during feature normalization;
Removed the noise module from the style module.

Then, StyleGAN2 simplifies the model design using weight demodulation, as shown in the image below. It revisits the design of instance normalization (Norm std), intending to replace it with another normalization method that does not cause droplet-like artifacts. The right image shows the new design obtained using weight demodulation.

Weight demodulation adds the following changes:

1. After modulation (mod std) is convolution (Conv 3×3), the combination can be used to scale the convolution weights, implemented as Mod in the right image above. (This does not change the model design)

Where i is the input feature map.

2. Then normalize the weights using Demod:

The new normalized weight is:

This formula adds a small value ε to avoid numerical instability issues. Although mathematically, this is different from instance normalization, it performs normalization on the output feature map to obtain standard deviation and achieves similar goals as other normalization methods (i.e., making the training process more stable). Experimental results show that the droplet-like artifact problem has been resolved.

Improvements Made by StyleGAN2

Now, let’s look at the improved version of StyleGAN2. The image below summarizes various model changes and their corresponding FID score improvements (the smaller the FID score, the better the model performance).

Lazy Regularization

StyleGAN applies R₁ regularization to the FFHQ dataset. Lazy regularization indicates that neglecting most of the regularization costs during cost computation does not lead to negative consequences. In fact, even if regularization is performed only once every 16 mini-batches, the model performance is not affected, while the computational cost is reduced.

Path Length Regularization

As mentioned earlier, path length can be used to measure GAN performance. One potential trouble is that the path distances between different segments on the interpolation path vary greatly. In short, we want the image distances between continuous linear interpolation points to be similar. That is, displacements in latent space should lead to changes of the same magnitude in image space, regardless of the values of latent factors. Therefore, we add a regularization term as follows:

When the changes in image space do not correspond to the expected displacement, the cost increases. The changes in image space are based on gradient calculations, while the expected displacement approximates the current running average.

This will not be elaborated here; the code can be found at: https://github.com/NVlabs/stylegan2/blob/7d3145d23013607b987db30736f89fb1d3e10fad/training/loss.py, readers can run the debugger based on this.

Progressive Growing

StyleGAN uses progressive growing to stabilize the training of high-resolution images. We mentioned the issues with progressive growing above, and StyleGAN2 seeks an alternative design that allows deeper networks to maintain good training stability. ResNet uses skip connections to achieve this goal. Thus, StyleGAN2 explores residual connection designs and other residual concepts similar to ResNet. For these designs, we use bilinear filters to perform upsampling/downsampling on the previous layer and try to learn the residual values for the next layer.

The image below shows the MSG-GAN model with residual connections between the discriminator and generator.

The table below shows the performance improvement of different methods.

Large Networks

After these changes, we further analyze the impact of high-resolution layers on image generation. The StyleGAN2 paper measures the changes in images output by different model layers. The left image indicates the contribution of each layer to the generated images, with the horizontal axis representing the training process.

In the early stages of training, low-resolution layers dominate. However, as more training iterations are completed, the contributions of high-resolution layers (especially the 1024 × 1024 layer) are not as significant as expected. Researchers suspect that these layers lack sufficient capacity. Indeed, when the number of feature maps in high-resolution layers is doubled, their influence significantly increases (right image).

Original link: https://medium.com/@jonathan_hui/gan-stylegan-stylegan2-479bdf256299

Good news!
Visual Learning for Beginners Knowledge Circle
Open to the public now👇👇👇



Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial
Reply: Extension Module Chinese Tutorial in the backend of “Visual Learning for Beginners” WeChat official account to download the first OpenCV extension module tutorial in Chinese on the internet, covering installation, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, and more than 20 chapters of content.

Download 2: Python Vision Practical Project 52 Lectures
Reply: Python Vision Practical Project in the backend of “Visual Learning for Beginners” WeChat official account to download 31 visual practical projects including image segmentation, mask detection, lane detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, face recognition, etc., to help quickly learn computer vision.

Download 3: OpenCV Practical Project 20 Lectures
Reply: OpenCV Practical Project 20 Lectures in the backend of “Visual Learning for Beginners” WeChat official account to download 20 practical projects based on OpenCV, achieving advanced OpenCV learning.

Communication Group

Welcome to join the reader group of the official account to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will gradually be subdivided in the future). Please scan the WeChat number below to join the group, and note: “nickname + school/company + research direction”, for example: “Zhang San + Shanghai Jiao Tong University + Visual SLAM”. Please follow the format, otherwise, it will not be approved. After successful addition, you will be invited to enter the relevant WeChat group according to your research direction. Please do not send advertisements in the group, otherwise, you will be removed from the group. Thank you for your understanding~

Leave a Comment Cancel reply