How Attention Mechanism Enhances GAN Image Quality

| Previous Work

The papers Unsupervised attention-guided image-to-image translation and Attention-GAN for Object Translation in Wild Images have studied the combination of attention mechanisms and GANs, but both use attention to separate the foreground and background. The main approach is:

To split the generator network into two parts, where the first part is the prediction network (used to predict the area of interest), and the second part is the converter network (used for image conversion between two domains).

The main idea of the paper Attention-GAN for Object Translation in Wild Images is:

Using segmentation annotations of the input image as additional supervisory information to train the attention network, then applying the attention map to the output of the converter network, thereby using the background of the input image as the output background, thus improving the quality of generated images.

| SPA-GAN

The paper SPA-GAN was published in 2020 in TOM (IEEE Transactions on Multimedia)

Title: SPA-GAN: Spatial Attention GAN for Image-to-Image Translation (SPA-GAN)

Journal: IEEE Transactions on Multimedia 2020

Authors: Hajar Emami, Majid Moradi Aliabadi, Ming Dong, and Ratna Babu Chinnam

Affiliation: Computer Science Department, Wayne State University, Detroit, Michigan United States

Main Content

SPA-GAN is based on the CycleGAN network structure, outputting attention maps from the discriminator and inputting them into the generator to assist the generator in focusing on more distinguishable areas in the image. It modifies the cycle consistency loss and adds feature map loss (calculated with the output of the first layer of the decoder). SPA-GAN, as the latest research achievement, has the lowest KID and the highest classification accuracy; however, its theoretical foundation is lacking, especially in the ablation study part regarding the use of outputs from which layers of the encoder and decoder for calculating feature map loss, which lacks explanation and is only analyzed from experimental results.

Main Contributions

(1) The attention mechanism is applied in the discriminator and its results are fed back to the generator (the feedback is a spatial attention map, which indicates the local regions used by the discriminator to determine the authenticity of the input image), allowing the generator to assign high weights to clearly distinguishable areas. The authors also state that this approach can better preserve some domain-specific features; in the generation network, it drives the feature map obtained at the first layer of the decoder to match the areas of interest identified in the real and generated images; using attention as a mechanism to transfer knowledge from the discriminator to the generator, thereby enabling the discriminator to help the generator more clearly identify distinguishable areas.

(2) Changes to the cycle consistency loss and the new addition of generator feature map loss (aimed at preserving domain-specific features).

(3) Unlike previous GANs that added attention mechanisms (which either required additional supervisory information or a separate attention network, increasing the computational burden on the GPU), SPA-GAN is a lightweight model.

How Attention Mechanism Enhances GAN Image Quality

Figure 1 Comparison of CycleGAN and SPA-GAN structures

Figure 2 Comparison of style transfer results of different algorithms

Figure 3 Comparison of style transfer results from apple <-> orange

Evaluation Criteria

KID, classification accuracy, and human visual judgment, along with additional ablation studies. KID is defined as the squared maximum mean discrepancy (MMD) between the inception representations of real images and generated images. KID is an unbiased estimator without any assumptions about the form of activation distributions (more reliable than FID). A smaller KID indicates higher visual similarity between real and generated images.

Paper Evaluation

This paper is substantial in content. Previous methods such as AttentionGAN, SAGAN have gone through multiple structural transformations in the application of attention mechanisms combined with GANs, from the earliest SAGAN (both generator and discriminator have attention mechanisms) to the later AttentionGAN (decoupling the generator into two structures – prediction network and converter network), Attention-GAN for Object Configuration in Wild Images (adding segmentation annotations as additional supervisory information), AGGAN (adding a separate attention network), and early image translation methods like CycleGAN (proposing cycle consistency loss, using residual networks for image conversion), DualGAN (using WGAN’s loss function for higher stability), DiscoGAN (using the simplest CNN encoder-decoder with a fully connected network as a converter), UNIT (sharing latent space assumptions, cycle loss, VAE), MUNIT (establishing two latent space assumptions for content and style, used across multiple domains, content code remains unchanged while style varies), DRIT (decoupling the latent space into domain-shared content space – capturing common information, domain-specific attribute space). SPA-GAN, based on the CycleGAN network structure, outputs attention maps from the discriminator and inputs them into the generator to assist the generator in focusing on more distinguishable areas in the image, modifies the cycle consistency loss, and increases feature map loss (calculated with the output of the first layer of the decoder). SPA-GAN, as the latest research achievement, has the lowest KID and the highest classification accuracy; however, it lacks a theoretical foundation, especially in the ablation study regarding the use of outputs from which layers of the encoder and decoder for calculating feature map loss, which lacks explanation and is only analyzed from experimental results.

References

[1] Emami H, Aliabadi M M, Dong M, et al. SPA-GAN: Spatial Attention GAN for Image-to-Image Translation[J]. IEEE Transactions on Multimedia, 2020, PP(99):1-1.

Leave a Comment Cancel reply