Development of Generative Adversarial Networks (GAN)

Think Tank Highlights #Global Defense Dynamics #US Military Dynamics #Russian Military Dynamics #Taiwan Affairs #WeChat Store Available

#South Korea #Raytheon #Japan #Electronic Warfare #Northeast Asia Military Dynamics #Unmanned

Development of Generative Adversarial Networks (GAN)

Author: Military Eagle Think Tank Source: Military Eagle Dynamics

Generative Adversarial Networks (GAN） is a deep learning generative model proposed by Ian Goodfellow and others. GAN is structurally inspired by the two-player zero-sum game in game theory (where the sum of benefits for both parties is zero; one party’s gain is the other party’s loss). The system consists of a generator and a discriminator. The generator captures the latent distribution of real data samples and generates new data samples; the discriminator is a binary classifier that determines whether the input is real data or generated samples. Both the generator and the discriminator can use currently popular deep neural networks. The introduction of GAN meets the research and application needs of many fields and injects new development momentum into these areas. GAN has become a hot research direction in the field of artificial intelligence. Currently, the image and vision fields are the most widely researched and applied domains for GAN, capable of generating digital objects, faces, and various realistic indoor and outdoor scenes, restoring original images from segmented images, coloring black and white images, recovering object images from outlines, and generating high-resolution images from low-resolution images. Furthermore, GAN has started to be applied in research on speech and language processing, computer virus detection, and board game competition programs.

1. The Evolution of GAN Structure

(1) Fully Connected GAN

In 2014, the first GAN architecture used fully connected neural networks as both the generator and discriminator. This type of architecture was applied to relatively simple image datasets, namely handwritten digits (MNIST), natural images (CIFAR-10), and the Toronto face dataset (TFD).

(2) Convolutional GAN

Transitioning from fully connected neural networks to convolutional neural networks (CNN) is a very natural extension, as CNNs are particularly suited for image data. However, early experiments conducted on CIFAR-10 indicated that it was relatively difficult to train generator and discriminator networks using CNNs with the same capacity and representational power as supervised learning.

Thus, Laplacian Pyramid GAN (LAPGAN) proposed a solution to this problem by using a multi-scale decomposition generation process: real images themselves are decomposed into a Laplacian pyramid, generating images layer by layer iteratively.

Additionally, Radford et al. proposed a network architecture called DCGAN (Deep Convolutional GAN) in 2015, which allows for training a pair of deep convolutional generator and discriminator networks. DCGAN utilizes stepwise convolution, allowing it to learn spatial down-sampling and up-sampling convolutional kernels during training.

As an extension of 2D image synthesis, Wu et al. proposed a GAN capable of synthesizing 3D samples using volumetric convolutions. Wu et al. synthesized new objects, including chairs, tables, and cars. They also proposed a method for mapping 2D images to the 3D objects depicted in those images.

(3) Conditional GAN

In 2015, Mirza et al. enabled both the generator and discriminator networks to have conditional discrimination capabilities. Conditional GANs have stronger representational capabilities for generating multi-modal data. Parallel processing can be conducted between conditional GANs and Info-GAN, decomposing the noise source into incompressible sources and “latent space encodings,” attempting to discover the potential factors of variation by maximizing the mutual information between the latent space encoding and the generator output. This latent code can be used to discover object classes in a purely unsupervised manner, although latent space encoding is not absolutely necessary. The representations learned by Info-GAN appear to be semantically meaningful, handling complex overlay factors in image appearance, including variations in pose, lighting, and expression of facial images.

(4) Inference Model GAN

In the original GAN expression, GAN lacks a method to map given observed data to a vector in latent space (i.e., an inference mechanism). With the development of GAN, several techniques have been proposed to invert the generators of pre-trained GANs. Adversarially Learned Inference (ALI) and Bidirectional GAN (BiGAN) provide simple yet effective extensions, both introducing an inference model that fits the joint probability distribution of latent space encoding and observed data using the discriminator.

In this formula, the generator consists of two networks: an “encoder” (inference network) and a “decoder.” The discriminator receives a pair of data (latent space encoding + observed data) each time and must determine which pair constitutes a real image sample and its corresponding encoding versus which pair constitutes a fake image sample and its corresponding latent space encoding.

Ideally, in the encoder-decoder model, the reconstructed output should have high similarity to the input; however, the fidelity of the reconstructed data samples synthesized using ALI/BiGAN is poor, and the fidelity of samples may increase the adversarial cost as the complexity of the data samples increases.

(5) Adversarial Autoencoders

Autoencoders consist of an “encoder” and a “decoder” network that learns to map data encoding to an internal latent space and back again. Autoencoders learn deterministic mappings from data space (e.g., images) to latent space (through the encoder) and from latent space back to data space (through the decoder). The combination of these two mappings forms a “reconstruction” process, training the encoder and decoder to make the reconstructed image as close as possible to the original image.

(6) GAN Model Optimization

GAN faces the problem of gradient vanishing during gradient descent training, as when there is minimal overlap or no overlap between real samples and generated samples, the Jensen-Shannon divergence of its objective function is a constant, leading to a discontinuous optimization target. To address the training gradient vanishing problem, Arjovsky et al. proposed two versions of Wasserstein GAN (W-GAN). W-GAN replaces Jensen-Shannon divergence with Earth-Mover distance to measure the distance between the distributions of real samples and generated samples, using a critic function corresponding to the discriminator of GAN, and the critic function must be based on the Lipschitz continuity assumption. Additionally, the discriminator of GAN has infinite modeling capability, capable of distinguishing between real and generated samples regardless of their complexity, which can easily lead to overfitting issues. To limit the modeling capability of the model, Loss-sensitive GAN (LS-GAN) is proposed, which confines the loss function obtained from the minimum objective function to satisfy the class of Lipschitz continuous functions and also provides quantitative analysis results when gradient vanishing occurs. It should be noted that W-GAN and LS-GAN do not alter the structure of the GAN model, only improving the optimization method.

2. Application Areas of GAN

As a model with “infinite” generative capabilities, the direct application of GAN is modeling, generating data samples consistent with the distribution of real data, such as generating images, videos, etc. GAN can be used to solve learning problems when labeled data is insufficient, such as unsupervised learning, semi-supervised learning, etc. GAN can also be used for speech and language processing, such as generating dialogues, generating images from text, etc.

(1) Image and Vision Fields

GAN can generate images consistent with the distribution of real data. A typical application comes from Twitter, which proposed using GAN to transform a low-resolution blurry image into a high-resolution image with rich details, using the VGG network as a discriminator, with a parameterized residual network representing the generator, and has also begun to be used for generating autonomous driving scenes. Santana et al. proposed using GAN to generate images consistent with actual traffic scene distributions, subsequently training a transfer model based on recurrent neural networks (RNN) to achieve measurement objectives. GAN can be used in semi-supervised or unsupervised learning tasks in autonomous driving applications such as drones and self-driving cars, and can also utilize actual scene video frames to continuously update the generator of GAN in real-time.

Using simulated images and real images as training samples to achieve human eye detection, but there is a certain distribution gap between these simulated images and real images. Shrivastava et al. proposed a GAN-based method (called SimGAN) that utilizes unlabeled real images to enrich and refine simulated images, making synthetic images more realistic, and introduces a self-regularization term to minimize synthetic errors while preserving the category of simulated images as much as possible, while utilizing the added local adversarial loss function to discriminate each local image block, enriching local information.

(2) Speech and Language Fields

Currently, there are some articles on GAN in speech and language processing. For example, using GAN to represent the implicit correlation between dialogues to generate dialogue text; text generation based on GAN, using CNN as the discriminator, with the discriminator based on fitting the output of LSTM, solving optimization problems with moment matching, during training, unlike traditional methods that update discriminator parameters multiple times before updating the generator, it requires multiple updates of the generator before updating the CNN discriminator. SeqGAN trains the generator based on policy gradients, with the feedback reward signal from the generator obtained through Monte Carlo search, and experiments indicate that SeqGAN can outperform traditional methods in speech, poetry, and music generation. Reed et al. proposed generating images based on text descriptions using GAN, where the text encoding is used as the conditional input to the generator, and to utilize text encoding information, it is also input as additional information to specific layers of the discriminator to improve the discriminator’s accuracy in determining whether the generated images meet the text descriptions, with experimental results showing that generated images have high relevance to text descriptions.

(3) Other Fields

In addition to applying GAN in image and vision, speech and language fields, GAN can also be combined with reinforcement learning, such as the aforementioned SeqGAN. Researchers have also fused GAN with imitation learning and combined GAN with Actor-critic methods. MalGAN can help detect malicious code by generating adversarial virus code samples using GAN, with experimental results indicating that GAN-based methods can perform better than traditional black-box detection model-based methods. Chidambaram et al. proposed an extended GAN generator based on style transfer, using the discriminator to regularize the generator instead of using a loss function, demonstrating the effectiveness of the proposed method through chess experiment examples.

Development of Generative Adversarial Networks (GAN)

Leave a Comment Cancel reply