Follow the public account “ML_NLP“
Set as “Starred“, heavy content delivered immediately!

Source | PaperWeekly
©PaperWeekly Original · Author|Sun Yudao
School|PhD student at Beijing University of Posts and Telecommunications
Research Direction|GAN Image Generation, Emotion Adversarial Sample Generation

A Mathematical Introduction to Generative Adversarial Nets
https://arxiv.org/abs/2009.00169

Introduction
Since the groundbreaking work of Goodfellow on GANs in 2014, GANs have received significant attention, leading to an explosive growth of new ideas, techniques, and applications.
There are some rigorous aspects missing in the original paper on GANs, and many simplifications were made in the algorithm for higher training efficiency. This is a common phenomenon in this field; during the summer course on the mathematical principles of deep learning at Peking University, the instructor mentioned that mathematical rigor accounts for 60% in deep learning.
The implication is that the proof processes in this field are not as rigorous as pure mathematics. When deriving proofs from the perspective of a computer science engineer, there are often premise assumptions that contradict practical realities, yet the conclusions derived from these assumptions often align with experimental results or provide certain guidance for solving practical problems.
The author is a mathematically proficient AI researcher. The purpose of this paper is to provide an overview of GANs from a mathematical perspective, making it a rare theoretical article on the mathematical principles of GANs. The paper involves a lot of mathematical principles and numerous dazzling mathematical symbols, and it can be treated as a theoretical manual for GANs, allowing readers to study any section of interest in depth..

Background on GANs
The Turing Award winner and one of the deep learning trinity, Yann LeCun, has stated that “the introduction of GANs is one of the most interesting ideas in deep learning over the past decade.”
The figure below shows the monthly publication count of papers related to GANs from 2014 to 2018. It can be seen that after the introduction in 2014, the number of relevant papers was relatively low until 2016, but from 2016 or 2017 to this year, there has been a real explosion in related papers.
GANs have a wide range of applications, including image synthesis, image editing, style transfer, image super-resolution, image transformation, data augmentation, and more.
In this paper, the author will attempt to introduce GANs from a more mathematical perspective for beginners. To fully understand GANs, one must also study their algorithms and applications; however, understanding the mathematical principles is the crucial first step in grasping GANs. With this foundation, it will be easier to master other variants of GANs.
The goal of GANs is to solve the following problem: suppose there is a dataset of object data, such as a set of cat images, handwritten Chinese characters, or paintings by Van Gogh, etc. GANs can generate similar objects from noise. The principle is that neural networks adjust network parameters appropriately using the training dataset, and deep neural networks can approximate any function.
For the discriminator function D and the generator function G modeled as neural networks, the specific GAN model is shown in the figure below. The GAN network actually consists of two networks: one is the generator network G used to generate fake samples, and the other is the discriminator network D used to distinguish the authenticity of samples. To introduce adversarial loss, the generator is trained adversarially to produce high-quality images.

Specifically, adversarial learning can be achieved through the maximization-minimization of the objective function between the discriminator function and the generator function. The generator G transforms random samples from distribution into generated samples. The discriminator attempts to distinguish them from training real samples from distribution, while G tries to make the generated samples similar to the training samples in distribution.

The Mathematical Formula of GAN
The adversarial game of GAN can be mathematically represented by the maximization-minimization of the objective function between the discriminator function and the generator function. The generator transforms random samples from distribution into generated samples. The discriminator attempts to distinguish them from training samples from distribution, while it tries to make the generated samples similar to the training samples in distribution. The adversarial objective loss function is shown below:

In the formula, the expectation is indicated with respect to the specified distribution in the subscript. The description of the minimax value that GAN solves is as follows:

Intuitively, for a given generator, optimizing the discriminator to distinguish generated samples means trying to assign high values to real samples from distribution and low values to generated samples.
Conversely, for a given discriminator, optimizing the generator will attempt to “fool” the discriminator into assigning high values.Let be the distribution, which can be rewritten using and as:

Transforming the above maximization-minimization problem into the following formula:

Assuming that has density , and has density function (note that this only occurs when ). In summary, it can finally be written as:

It can be observed that under certain constraints of the distribution, the above value is equivalent to .
The paper contains a large number of formulas and proofs, which can be quite overwhelming to read. Different propositions and theorems appear interspersed. To reduce reading barriers for the readers, I have selected some important theorems and propositions and reorganized them.
Proposition 1: Given probability distributions on , with their probability density functions being and , the following holds:

Where , the range of values is (sup represents the supremum).
Theorem 1: Let be the probability density function on . For the probability distributions with density functions and , the maximization-minimization problem can be described as:

For all , we can solve for and .
Theorem 1 states that the solution to the maximization-minimization problem is the optimization solution of GAN derived by Ian Goodfellow in his 2014 paper.
Theorem 2: Let be a given probability distribution on . For probability distributions and functions , the following holds:

Where and exist almost everywhere for .
Proposition 2: If the discriminator is allowed to reach its optimum given at each training cycle, then updating according to the minimization criterion yields:
Where the distribution converges to the distribution .
From a purely mathematical perspective, this proposition 2 is not rigorous. However, it provides a practical framework for solving the GAN minimax problem.
In summary, the propositions and theorems can be summarized into the following GAN algorithm framework, which has been reorganized for clarity.


f-Divergence and f-GAN
4.1 f-Divergence
The core essence of GAN is to bring the distribution of random noise closer to the distribution of real data through adversarial training, which requires a metric for measuring probability distributions—divergence. The well-known KL divergence is given by the following formula (KL divergence has a disadvantage of distance asymmetry, so it is not a traditional, true distance metric).

And the JS divergence is given by the following formula (JS divergence improves on the distance asymmetry of KL divergence):

However, all the divergences we are familiar with can actually be summarized into a larger category known as f-divergence, which is defined as follows:
Definition 1: Let and be two probability density functions on . The f-divergence between and is defined as:
Where if , it will have .
It is important to clarify that the f-divergence varies depending on the chosen function, leading to different asymmetries in distance.
Proposition 3: Let be a strictly convex function on the domain such that . Assuming (equivalent to ) or , where . Then , if and only if .
4.2 Convex Functions and Convex Conjugates
The convex conjugate of a convex function is also known as the Fenchel transform or Fenchel-Legendre transform. It is a generalization of the famous Legendre transform. Let be a convex function defined on the interval . Its convex conjugate is defined as:

Lemma 1: Assuming is strictly convex and continuously differentiable on its domain, where and. Then:

Proposition 4: Let be a convex function on the domain , whose range lies within . Then is convex and lower semi-continuous. Furthermore, if is lower semi-continuous, then it satisfies the Fenchel duality.
The table below lists some common convex functions and their convex duals:
4.3 Estimating f-Divergence Using Convex Duality
To estimate the f-divergence of samples, the convex dual of f can be used, with the derivation formula as follows:

Where is any Borel function, thus obtaining all Borel functions as shown:

Proposition 5: Let be strictly convex and continuously differentiable on . Let be Borel probability measures on such that . Then:

Theorem 3: Let be convex, such that the domain of includes certain . Let be Borel probability measures on such that .

For all Borel functions, there is .
Theorem 4: Let be a lower semi-continuous convex function such that the domain has . Let be Borel probability measures on such that , where and . Then:

For all Borel functions, there is .
4.4 f-GAN and VDM
Using f-divergence to represent the generalization of GAN. For a given probability distribution, the goal of f-GAN is to minimize the f-divergence relative to the probability distribution. In the sample space, f-GAN solves the following maximization-minimization problem:

This optimization problem can be referred to as Variational Divergence Minimization (VDM). Note that VDM appears similar to the maximization-minimization problem in GAN, where the Borel function is referred to as the judging function.
Theorem 5: Let be a lower semi-continuous strictly convex function such that the domain . Further, assume is continuously differentiable on its domain, and . Let be Borel probability measures on such that:

For all Borel functions, there is .
In summary, below is the algorithm flow framework of f-GAN.
4.4.1 Example1:
When the function , the f-divergence here is the well-known KL divergence. Its conjugate function is , where , and its corresponding objective function is:

Where , ignoring the constant +1 term, and then we will have our original form of GAN:
4.4.2 Example2:
This is the Jensen-Shannon divergence, the conjugate function of which is , and the domain is . The corresponding f-GAN objective function is:

Where , , so it can be simplified further as:

Where in this formula, the constant term is ignored.
4.4.3 Example3: ,
The conjugate function of this function is , and the domain is . When is not strictly convex and continuously differentiable, the corresponding f-GAN objective function is shown below:

The form of this objective function is the objective function of Wasserstein GAN.
4.4.4 Example4:
Given , this function is a strictly convex function. The corresponding f-GAN objective function is:

Further solving the above formula yields:

Where the “logD” technique used in the original GAN paper is present, which is used to address the gradient saturation problem in GANs.

Specific Examples of GAN
For examples of GAN applications, I did not introduce the examples given in the paper; instead, I introduced GANs that I find interesting based on my own perspective.
Training a GAN can be challenging, often encountering various issues, the main three being:
-
Vanishing Gradient: This often occurs, especially when the discriminator is too good, hindering the improvement of the generator. When using the optimal discriminator, training may fail due to vanishing gradients, providing insufficient information for the generator’s improvement.
-
Mode Collapse: This refers to the phenomenon where the generator begins to repeatedly produce the same output (or a small group of outputs). If the discriminator falls into a local minimum, the next generator iteration can easily find the most reasonable output for the discriminator. The discriminator can never learn to escape this trap.
-
Failure to Converge: Due to many factors (known and unknown), GANs often fail to converge.
WGAN made a simple modification by replacing the Jensen-Shannon divergence loss function in GAN with the Wasserstein distance (also known as Earth Mover’s distance). Do not underestimate the significance of this modification: it is one of the most important advances in this topic since the birth of GAN, as the use of EM distance effectively addresses some prominent shortcomings of divergence-based GANs, alleviating common failure modes in GAN training.
DCGAN introduces convolution into the training process of GANs, leading to better parameterization and training outcomes. It provides many guiding directions in training GANs, specifically including the following five points:
-
Replacing any pooling layers in the generator and discriminator with strided convolutions.
-
Using batch normalization in both the generator and discriminator.
-
Removing fully connected hidden layers to achieve a deeper architecture.
-
Using ReLU activation for all layers in the generator, besides the output using Tanh.
-
Using LeakyReLU activation in all layers of the discriminator.
It is important to note that the discriminator is a convolution process from high-dimensional vectors to low-dimensional vectors, as shown in the figure below:

The generator is a deconvolution process from low-dimensional vectors to high-dimensional vectors. Deconvolution, also known as transposed convolution and micro-step convolution, differs from convolution mainly in the method of padding. The following image illustrates this:

Another significant innovation of DCGAN is the introduction of vector addition into image generation, as shown in the figure below:
5.3 StyleGAN
StyleGAN proposes a new generator framework, claiming to control high-level attributes of the generated images, such as hairstyle, freckles, etc.; and the generated images score better on some evaluation criteria. As a type of unsupervised learning GAN, it can generate high-resolution images from noise. The specific algorithm framework diagram is shown below:

The clarity and detail richness of the images generated by StyleGAN are evident from the figure below, though the training cost is also high.
5.4 zi2zi
zi2zi is one of my favorite applications of GANs, utilizing GANs to transform font styles. Previous work on similar Chinese font transformation issues did not yield very satisfactory results. The reasons include that the generated images are often quite blurry, and handling more stylistically diverse fonts is challenging, with each attempt only able to learn and output a limited target font style. GANs can effectively solve these issues, with the model framework shown in the figure below:

Chinese characters, whether simplified or traditional, or Japanese characters, are constructed using the same principles and share a series of common foundations, with the specific implementation results shown in the figure below: