ViTGAN: A New Approach to Image Generation Using Transformers

Transformers have brought tremendous advancements to various natural language tasks and have recently begun to penetrate the field of computer vision, starting to show potential in tasks previously dominated by CNNs. A recent study from the University of California, San Diego, and Google Research proposed using visual Transformers to train GANs. To effectively apply this method, the researchers also proposed several improvements, allowing the new method to match state-of-the-art CNN models on some metrics.

Convolutional Neural Networks (CNNs) have become the dominant technology in the field of computer vision due to their powerful capabilities in convolution (weight sharing and local connections) and pooling (translation invariance). However, recently, the Transformer architecture has begun to rival CNNs in image and video recognition tasks. Notably, the visual Transformer (ViT) interprets images as sequences of tokens (similar to words in natural language). Research by Dosovitskiy et al. shows that ViT can achieve comparable classification accuracy on the ImageNet benchmark at a lower computational cost. Unlike the local connectivity in CNNs, ViT relies on representations considered in a global context, where each patch must be processed in relation to all patches of the same image.

Although ViT and its variants are still in their early stages, studies have shown their excellent prospects in modeling non-local contextual dependencies, demonstrating outstanding efficiency and scalability. Since the emergence of ViT, it has been used in various tasks such as object detection, video recognition, and multi-task pre-training.

Recently, a study from the University of California, San Diego, and Google Research proposed using visual Transformers to train GANs.The research topic of this paper is: Can visual Transformers be used to accomplish image generation tasks without convolution or pooling? More specifically: Can ViT be used to train Generative Adversarial Networks (GANs) and achieve quality comparable to widely studied CNN-based GANs?

Paper link:https://arxiv.org/pdf/2107.04589.pdf

To this end, the researchers trained GANs using the most fundamental design of ViT (as shown in Figure 2(A)). The challenge lies in the fact that the training process of GANs becomes very unstable when coupled with ViT, and adversarial training is often hindered by high variance gradients (or spike gradients) during the later stages of the discriminator’s training. Moreover, traditional regularization methods like gradient penalty and spectral normalization, while effective for CNN-based GAN models (as shown in Figure 4), do not resolve the aforementioned instability issues. After applying appropriate regularization methods, instability in CNN-based GAN training is uncommon, making it a unique challenge for ViT-based GANs.

To address these issues and achieve stable training dynamics while facilitating convergence of ViT-based GANs, this paper proposes several necessary modifications.

In the discriminator, the researchers re-examined the Lipschitz properties of self-attention and designed a spectral normalization that enhances Lipschitz continuity. Unlike traditional spectral normalization methods that struggle with instability, these techniques effectively stabilize the training dynamics of ViT-based discriminators. Additionally, to validate the effectiveness of the newly proposed techniques, the researchers conducted controlled variable studies. For the ViT-based generator, they experimented with various architectural designs and identified two critical modifications related to layer normalization and output mapping layers. Experiments showed that regardless of whether the discriminator is based on ViT or CNN, the modified ViT-based generator better facilitates adversarial training.

To provide more convincing evidence, the researchers conducted experiments on three standard image synthesis benchmarks. The results indicated that the newly proposed model, ViTGAN, significantly outperforms previous Transformer-based GAN models and achieves performance comparable to leading CNN-based GANs like StyleGAN2, even without using convolution and pooling. The authors state that the newly proposed ViTGAN is one of the earliest attempts to use visual Transformers in GANs, and more importantly, this research is the first to demonstrate that Transformers can exceed current best convolutional architectures on standard image generation benchmarks such as CIFAR, CelebA, and LSUN bedroom datasets.

Methods

Figure 1 illustrates the newly proposed ViTGAN architecture, comprising a ViT discriminator and a ViT-based generator. The researchers found that directly using ViT as a discriminator leads to unstable training.To stabilize training dynamics and promote convergence, the researchers introduced new techniques for both the generator and discriminator: (1) regularization on the ViT discriminator and (2) a new generator architecture.

ViTGAN: A New Approach to Image Generation Using Transformers

Figure 1: Schematic diagram of the newly proposed ViTGAN framework. Both the generator and discriminator are designed based on visual Transformers (ViT). The discriminator score is derived from classification embeddings (noted as * in the figure); the generator generates pixels patch by patch based on patch embeddings.

Enhancing the Lipschitz Properties of the Transformer Discriminator.In GAN discriminators, Lipschitz continuity plays a crucial role. It was first noted when used as a condition for approximating Wasserstein distance in WGAN, and it has been confirmed in other GAN settings beyond using Wasserstein loss. Notably, the ICML 2019 paper “Lipschitz Generative Adversarial Nets” demonstrated that Lipschitz discriminators ensure the existence of an optimal discriminator function and a unique Nash equilibrium. However, a paper from ICML 2021, “The Lipschitz Constant of Self-Attention,” indicated that the Lipschitz constant of standard dot-product self-attention layers can be unbounded, which can disrupt Lipschitz continuity in ViT. To enhance the Lipschitz properties of the ViT discriminator, the researchers adopted L2 attention proposed in the aforementioned paper. As shown in Equation 7, the dot-product similarity is replaced with Euclidean distance, and the weights of the projection matrix are associated with queries and keys in self-attention.This improvement enhances the stability of Transformers used in GAN discriminators.

Improved Spectral Normalization.To further strengthen Lipschitz continuity, the researchers also applied spectral normalization during discriminator training. Standard spectral normalization estimates the spectral norm of the projection matrix of each layer of the neural network using power iteration and then divides the weight matrix by the estimated spectral norm, resulting in a projection matrix with a Lipschitz constant equal to 1. The researchers found that Transformer modules are highly sensitive to the size of the Lipschitz constant, and training speed slows significantly when spectral normalization is applied. Similarly, they discovered that using a ViT-based discriminator can impair GAN training with R1 gradient penalty. Other studies have found that if the Lipschitz constant of the MLP module is small, it can lead to the output of the Transformer collapsing to a rank-1 matrix. To address this issue, the researchers proposed increasing the spectral norm of the projection matrix.

They found that simply multiplying the spectral norm by the normalization weight matrix of each layer at initialization is sufficient to resolve the issue. Specifically, the update rule for spectral normalization is as follows, where σ is the computed standard spectral norm of the weight matrix:

Overlapping Image Patches.Due to the excessive learning capacity of the ViT discriminator, it is prone to overfitting. In this study, both the discriminator and generator used the same image representation, which segments images into sequences of non-overlapping patches based on a predefined network P×P. If not carefully designed, these arbitrary network partitions may prompt the discriminator to memorize local cues, thereby failing to provide meaningful losses for the generator. To solve this problem, the researchers adopted a simple technique of allowing overlap between patches. For each edge of the patch, it was extended by o pixels, making the effective patch size (P+2o)×(P+2o).

This results in the same sequence length as before, but with reduced sensitivity to the predefined grid. This may also help the Transformer better understand which neighboring patches are relevant to the current patch, thereby improving the understanding of local features.

Generator Design

Designing a generator based on the ViT architecture is not an easy task, with one major challenge being to shift the function of ViT from predicting a set of class labels to generating pixels in a spatial area.

Figure 2: Generator architecture. The left image shows three generator architectures studied by the researchers: (A) adds intermediate hidden embeddings w to each positional embedding, (B) presets w to the sequence, and (C) replaces normalization with self-modulated layer normalization (SLN) calculated from w. The right image details the self-modulation operations used in the Transformer module.

The researchers first explored various generator architectures, finding that they could not match the performance of CNN-based generators. Therefore, they proposed a novel generator following the design principles of ViT. Figure 2(c) shows this ViTGAN generator, which consists of two main components: a Transformer module and an output mapping layer.

To facilitate the training process, the researchers made two improvements to the newly proposed generator:

Self-Modulated Layer Normalization (SLN).The new approach does not send the noise vector z as input to the ViT but instead uses z to modulate the layer normalization operation. This operation is termed self-modulation because it requires no external information;
Implicit Neural Representations for Patch Generation.To learn a continuous mapping from patch embeddings to patch pixel values, the researchers used implicit neural representations. When combined with Fourier features or sine activation functions, implicit representations can constrain the generated sample space to smoothly varying natural signal spaces. The researchers found that implicit representations play a particularly significant role when training GANs using ViT-based generators.

It should be noted that due to the different image grids of the generator and discriminator, the sequence lengths also differ. Further research revealed that when scaling the model for higher resolution images, it is sufficient to increase the sequence length or feature dimension of the discriminator.

Experimental Results

Table 1: Comparison of several representative GAN architectures on unconditional image generation benchmarks. Conv and Pool represent convolution and pooling, respectively. ↓ indicates lower is better; ↑ indicates higher is better.

Table 1 presents the main results on three standard benchmarks for image synthesis. The new method proposed in this paper can match the following benchmark architectures. TransGAN is currently the only GAN that completely avoids using convolutions, being entirely based on Transformers. The comparison includes its best variant version, TransGAN-XL. Vanilla-ViT is a ViT-based GAN that employs the generator and pure ViT discriminator shown in Figure 2(A) but does not use the improvements proposed in this paper.

Table 3a compares the generator architecture shown in Figure 2(B). Additionally, BigGAN and StyleGAN2, as the best models of CNN-based GANs, are included in the comparison.

Figure 3: Qualitative comparison. In CIFAR-10 32 × 32, CelebA 64 × 64, and LSUN Bedroom 64 × 64 datasets, the results of ViTGAN are compared with StyleGAN2, the best Transformer benchmark, and pure ViT generator and discriminator.

Figure 4: (a-c) Gradient magnitudes of the ViT discriminator (L2 norm across all parameters), (d-f) FID scores (lower is better) over training iterations.

It can be observed that the performance of the newly proposed method is comparable to that of two pure ViT discriminator baselines using R1 penalty and spectral norm. The remaining architectures perform similarly across all methods. The new method can overcome spikes in gradient magnitude and achieve significantly lower FID (on CIFAR and CelebA) or comparable FID (on LSUN).

Table 3: Controlled variable studies performed on ViTGAN using the CIFAR-10 dataset. Left image: controlled variable study on the generator architecture. Right image: controlled variable study on the discriminator architecture.

Leave a Comment Cancel reply