Explaining CNNs from the Frequency Domain Perspective

Link: https://www.zhihu.com/question/59532432/answer/1510340606

Editor: Deep Learning and Computer Vision

Disclaimer: For academic sharing only, please delete if infringed

Explaining CNNs from the Frequency Domain Perspective
Time domain convolution = frequency domain multiplication. Most of the computations in convolutional neural networks occur in the convolution part. How to think about convolutional neural networks from the perspective of the frequency domain? How to explain ResNet from the frequency domain perspective?
Explaining CNNs from the Frequency Domain Perspective
Author: Ruoyu
Link: https://www.zhihu.com/question/59532432/answer/1510340606
I find that the most enlightening work for me is by Xu Zhiqin from Shanghai Jiao Tong University.
https://ins.sjtu.edu.cn/people/xuzhiqin/fprinciple/index.html
His Bilibili lecture
Undergraduate course in the Mathematics Department: Statistical Computing and Machine Learning 3 Frequency Principle https://www.bilibili.com/video/av94808183?p=2
Additionally, I have attended his lectures offline about neural networks and Fourier transforms, as well as Fourier analysis.
Training behavior of deep neural network in frequency domain
https://arxiv.org/pdf/1807.01251.pdf
This paper states that the generalization performance of neural networks comes from their training process, which focuses more on low-frequency components.
The fitting process of neural networks for CIFAR-10 and MNIST, thanks to @Jimmy for the correction, where blue indicates a relatively large error and red indicates a relatively small error. As the training epochs progress, higher frequencies (larger frequency index) converge more slowly (i.e., for a particular epoch, the low-frequency error is small, colored red, while the high-frequency error is large, colored blue).
Explaining CNNs from the Frequency Domain Perspective
Theory of the frequency principle for general deep neural networks
https://arxiv.org/pdf/1906.09235v2.pdf
This paper provides a substantial mathematical derivation to prove the F-Principle, dividing it into the initial, intermediate, and concluding stages for proof. It may be somewhat cumbersome for non-mathematics professionals.
Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks
https://arxiv.org/pdf/1906.09235v2.pdf
Why deep neural networks (DNNs) with more parameters than samples can often generalize well remains a mystery. One attempt to understand this challenge is to discover the implicit biases during the training process of DNNs, such as the frequency principle (F-Principle), which states that DNNs typically fit target functions from low frequencies to high frequencies. Inspired by the F-Principle, this paper proposes an effective linear F-Principle dynamic model that can accurately predict the learning outcomes of wide two-layer ReLU neural networks (NNs). This Linear FP dynamic is justified by the linearized Mean Field residual dynamics of NNs. Importantly, the long-time limit solution of this LFP dynamic is equivalent to the solution of a constrained optimization problem that explicitly minimizes the FP norm, where feasible solutions are penalized more severely for high frequencies. Using this optimization formula, a prior estimate of the generalization error bound is given, indicating that the higher the FP norm of the target function, the larger the generalization error. Overall, by interpreting the implicit bias of the F-Principle as an explicit penalty for two-layer NNs, this work takes a step towards quantitatively understanding the learning and generalization of general DNNs.
This is a schematic diagram of the LFP model for two-dimensional data like images.
Explaining CNNs from the Frequency Domain Perspective
Professor Xu’s earlier introduction:
The LFP model provides a new perspective for the quantitative understanding of neural networks. First, the LFP model effectively characterizes the key features of the training process of such a parameter-rich system using a simple differential equation, and can accurately predict the learning outcomes of neural networks. Therefore, this model establishes a relationship between differential equations and neural networks from a new angle. Since differential equations are a very mature research field, we believe that tools from this field can help us further analyze the training behavior of neural networks.

Secondly, similar to statistical physics, the LFP model only relates to certain macroscopic statistical quantities of network parameters, and is independent of the specific behavior of individual parameters. This statistical characterization can help us accurately understand the learning process of DNNs when there are many parameters, thus explaining the good generalization ability of DNNs when parameters far exceed the number of training samples.

In this work, we analyze the evolution results of the LFP dynamics through an equivalent optimization problem and provide a prior estimate of the network’s generalization error. We find that the generalization error of the network can be controlled by a certain F-principle norm (defined as Explaining CNNs from the Frequency Domain Perspective , where γ(ξ) is a weight function that decays with frequency).

It is worth noting that our error estimate targets the learning process of the neural network itself and does not require adding extra regularization terms in the loss function. We will further elaborate on this error estimate in a subsequent introductory article.

FREQUENCY PRINCIPLE: FOURIER ANALYSIS SHEDS LIGHT ON DEEP NEURAL NETWORKS

https://arxiv.org/pdf/1901.06523.pdf
Explaining CNNs from the Frequency Domain Perspective

This indicates that for any two non-converging frequencies, the low-frequency gradient exponentially outperforms the high-frequency gradient under smaller weights. According to Parseval’s theorem, the MSE loss in the spatial domain is equivalent to the L2 loss in the Fourier domain. To intuitively understand the high decay rate of the low-frequency loss function, we consider training in the Fourier domain of a loss function with only two non-zero frequencies.

Explaining CNNs from the Frequency Domain Perspective

It explains why the ReLU function works, as the tanh function is smooth in the spatial domain, and its derivative decays exponentially with frequency in the Fourier region.

Professor Xu’s popular science articles on the F-Principle:

https://zhuanlan.zhihu.com/p/42847582
https://zhuanlan.zhihu.com/p/72018102
https://zhuanlan.zhihu.com/p/56077603
https://zhuanlan.zhihu.com/p/57906094
On the Spectral Bias of Deep Neural Networks

The work of the Bengio group previously wrote a rough analysis note.

https://zhuanlan.zhihu.com/p/160806229
  1. Analyzing the Fourier spectral components of ReLU networks using a continuous piecewise linear structure.
  2. Finding empirical evidence of spectral bias originating from low-frequency components. However, learning low-frequency components helps the network’s robustness during adversarial interference.
  3. Providing a learning theoretical framework analysis through manifold theory.

According to the topology’s Stokes theorem, it is proven that the ReLU function is compact and smooth, aiding convergence in training. What about the subsequent Swish and Mish? (Dog head).

Thus, in high-dimensional space, the spectral decay of the ReLU function exhibits strong anisotropy, and the upper limit of the amplitude of the ReLU Fourier transform satisfies the Lipschitz constraint.

Experiments:

Center point: High priority for low-frequency component learning
  1. Experimenting on functions:Explaining CNNs from the Frequency Domain Perspective
Explaining CNNs from the Frequency Domain Perspective
Fourier transform effects
Explaining CNNs from the Frequency Domain Perspective
Learning process of the function
Explaining CNNs from the Frequency Domain Perspective
Standardized spectral components of the model
2. Learning MNIST data in a noisy environment
Explaining CNNs from the Frequency Domain Perspective
Different validation losses
Explaining CNNs from the Frequency Domain Perspective
Frequency components fitting of MNIST data
Neural networks can approximate arbitrary value functions, but researchers have found they prefer low-frequency components, thus exhibiting a bias towards smooth functions—a phenomenon known as spectral bias.
Manifold Hypothesis
Explaining CNNs from the Frequency Domain Perspective
Explaining CNNs from the Frequency Domain PerspectiveExplaining CNNs from the Frequency Domain Perspective
The more complex the manifold, the easier the learning process becomes. This hypothesis may break the “structural risk minimization” assumption, potentially leading to “overfitting”.
Explaining CNNs from the Frequency Domain Perspective
If there are complex datasets (like ImageNet), the search space is quite large, and it is necessary to find certain methods to ensure they “work in harmony” and tune to work properly.
It seems that Bengio believes this has inspirational significance for regularization in deep learning.
Machine Learning from a Continuous Viewpoint
https://arxiv.org/pdf/1912.12777.pdf
Mathematician Wienan.E’s debate indicates that the frequency principle does not always work.
Assuming a certain function:
Explaining CNNs from the Frequency Domain Perspective
Explaining CNNs from the Frequency Domain Perspective Probability measure
Based on kernel functions for derivation:
Explaining CNNs from the Frequency Domain Perspective
Where:
Explaining CNNs from the Frequency Domain Perspective
Explaining CNNs from the Frequency Domain Perspective
Explaining CNNs from the Frequency Domain Perspective
Decomposing the Fourier coefficients:
Explaining CNNs from the Frequency Domain Perspective
Deriving:
Explaining CNNs from the Frequency Domain Perspective
Characteristic function:
Explaining CNNs from the Frequency Domain Perspective
Then the boundaries of when the frequency principle works are given.
Explaining CNNs from the Frequency Domain Perspective
Conditions under which it works:
Explaining CNNs from the Frequency Domain Perspective
Conditions under which it does not work:
Explaining CNNs from the Frequency Domain Perspective
If Wienan. E provides the boundaries of the Frequency Principle from the perspective of a mathematician, then those in engineering should definitely check this paper.
A Fourier Perspective on Model Robustness in Computer Vision
Code has also been open-sourced:
https://arxiv.org/pdf/1906.08988.pdf
https://github.com/google-research/google-research/tree/master/frequency_analysis
The author’s intention is to focus on robustness and not to completely discard high-frequency features.
Explaining CNNs from the Frequency Domain Perspective
Image description translation: Using input information that humans cannot recognize, the model can achieve high accuracy. The above shows the trained and tested models, which applied strict high-pass and low-pass filtering at the input end. Through positive low-pass filtering, when the image appears as a simple colored sphere, the model still exceeds 30% accuracy on ImageNet. In the case of high-pass (HP) filtering, using input features that are nearly invisible to humans, the model can achieve over 50% accuracy. As shown in the right image, normalization of the high-pass filtered image is required for proper visualization of high-frequency features (we use the method provided in the appendix to visualize high-pass filtered images).
Explaining CNNs from the Frequency Domain Perspective
Image description translation: Left: Fourier spectrum of natural images; we estimate E[|F(X)[i,j]|] by averaging all CIFAR-10 validation images. Right: Severely corrupted Fourier spectrum of CIFAR-10-C at severity level 3. For each corruption point, we estimate E[|F(C(X)−X)[i,j]|] by averaging all validation images. Additive noise has a higher concentration in the high-frequency band, while fog, contrast, and other pollutants concentrate in the low-frequency band.
Explaining CNNs from the Frequency Domain Perspective
Image description translation: Sensitivity of models to additive noise with different Fourier basis vectors on CIFAR-10. We fixed the additive noise to “L2 norm of 4” and evaluated three models: natural training model, adversarial training model, and Gaussian data augmentation training model. The average error rate was calculated on 1000 randomly sampled images from the test set. In the bottom row, we show images disturbed by noise along the corresponding Fourier basis vector. The natural training model is highly sensitive to all additive noise except for the lowest frequency. Adversarial training and Gaussian data augmentation significantly improve robustness to high frequencies while sacrificing the robustness of the natural training model to low frequencies (i.e., in both models, the middle blue area is smaller than that of the natural training model).
Explaining CNNs from the Frequency Domain Perspective
Image description translation: Sensitivity of models to additive noise with different Fourier basis vectors on ImageNet validation images. We fixed the basis vector to an L2 norm value of 15.7. The error rate is the average error rate across the entire ImageNet validation set. A 63×63 square centered on the lowest frequency in the Fourier domain is shown. Similarly, the natural training model is highly sensitive to all additive noise except for the lowest frequency. On the other hand, Gaussian data augmentation improves robustness to high frequencies while sacrificing robustness to low-frequency disturbances. For AutoAugment, we observe that its Fourier heat map has the largest blue/yellow area around the center, indicating that AutoAugment is relatively robust to low-frequency to mid-frequency disruptions.
Explaining CNNs from the Frequency Domain Perspective
Image description translation: The relationship between the high-frequency energy fraction of CIFAR-10-C corruption and test accuracy. Each scatter point in the plot represents the evaluation result of a specific model against a specific type of corruption. The X-axis represents the score of high-frequency energy for the corruption type, and the Y-axis represents the change in test accuracy compared to the natural training model. Overall, Gaussian data augmentation, adversarial training, and adding low-pass filters improve robustness to high-frequency corruptions while reducing robustness to low-frequency corruptions. The application of high-pass filtering at the front end has a more significant accuracy drop for high-frequency corruptions than low-frequency corruptions. AutoAugment enhances robustness to nearly all corruptions and achieves the best overall performance. The legend at the bottom shows the slope (K) and residual (r) for each fitted line.
Explaining CNNs from the Frequency Domain Perspective
Image description translation: (a) and (b): Fourier spectrum of adversarial perturbations. Given an image X, a PGD attack is initiated to obtain an adversarial sample C(X), estimating the Fourier spectrum of the adversarial perturbation that causes misclassification of the image. (a) is the spectrum obtained from natural training; (b) is the spectrum obtained from adversarial training. The adversarial perturbations of the natural training model are uniformly distributed across frequency components. In contrast, adversarial training biases these perturbations towards lower frequencies. (C) and (D): Adding Fourier basis vectors with large norms to images is a simple method for generating content-preserving black-box adversarial examples.

Key Conclusions:

1) Adversarial training pays attention to some high-frequency components, rather than being obsessed only with low-frequency components.
2) AutoAugment helps improve robustness.
The open-source code mainly teaches how to draw similar schematic diagrams from the paper.
Another paper from Eric Xing’s group was previously published by a self-media on Zhihu:
High-frequency Component Helps Explain the Generalization of Convolutional Neural Networks
https://arxiv.org/pdf/1905.13545.pdf
Explaining CNNs from the Frequency Domain Perspective
Explaining CNNs from the Frequency Domain Perspective

Visualization of convolutions from natural training and adversarial training

This paper experimented with several methods:

  • For a trained model, we adjust its weights to make the convolution kernel smoother;
  • Directly filtering out high-frequency information from the trained convolution kernel;
  • Increasing regularization during the training of convolutional neural networks so that weights at adjacent positions are closer.

Then the conclusion is drawn:

Focusing on low-frequency information helps improve generalization, while high-frequency components may be related to adversarial attacks, but this cannot be too absolute.

The contribution is to provide detailed experimental evidence that Batch Normalization is useful for fitting high-frequency components and improving generalization.

Explaining CNNs from the Frequency Domain Perspective

Finally, it’s all about talking.

On this side, Professor Xu proves that the smoothness of ReLU helps with function optimization; on the other side, a recent work called Bandlimiting Neural networks against adversarial attacks

https://arxiv.org/pdf/1905.12797.pdf

The ReLU function is given a piecewise linear function

Explaining CNNs from the Frequency Domain Perspective

It can be decomposed into many frequency components.

For a hidden layer with N=1000 nodes and input dimension n=200, the maximum number of regions is roughly equal to 10^200. In other words, even a moderately sized neural network can partition the input space into a vast number of sub-regions, easily exceeding the total number of atoms in the universe. When we learn neural networks, we cannot expect at least one sample to exist in each region. For those regions without any training samples, the results of linear functions can be arbitrary since they contribute nothing to the training objective function. Of course, most of these areas are very small. When we measure the expected loss function over the entire space, their contributions can be negligible since the chance of randomly sampling points falling into these tiny regions is very small. However, adversarial attacks present new challenges because adversarial samples are not naturally sampled. Considering the total number of regions is enormous, these tiny regions are almost everywhere in the input space. For any given data point in the input space, we can almost certainly find such a tiny region where the linear function is arbitrary. If a point within this tiny region is chosen, the neural network’s output may be unexpected. These tiny regions are the fundamental reason why neural networks are vulnerable to adversarial attacks.

Then, a method of adversarial defense is proposed, but I did not understand it well, so readers can read the paper themselves, and I welcome you to point it out in the comments after reading.

Explaining CNNs from the Frequency Domain Perspective

Explaining CNNs from the Frequency Domain Perspective

Author: Feitian Youxia https://www.zhihu.com/question/59532432/answer/1458221308

In digital signal processing, I believe that the zero-padding method in periodic convolution and circular convolution should be considered. I also provide two methods for calculating convolution: one for rolling and one for multiplication.

Assuming there is an image Explaining CNNs from the Frequency Domain Perspective as shown below:

Explaining CNNs from the Frequency Domain Perspective as follows:

Explaining CNNs from the Frequency Domain Perspective

There is a convolution kernel as follows:

Explaining CNNs from the Frequency Domain Perspective

Then the convolution with a stride of 1 can be computed in two ways, both resulting in Explaining CNNs from the Frequency Domain Perspective matrix.

The first method is general, as shown in the figure, calculating one by one, for example, the first element is Explaining CNNs from the Frequency Domain Perspective, and other elements are calculated similarly; this is not the focus.

Explaining CNNs from the Frequency Domain Perspective

The second method is to perform convolution kernel circular zero-padding and then multiply, which is somewhat similar to what you mentioned about multiplication.

Expand Explaining CNNs from the Frequency Domain Perspective into Explaining CNNs from the Frequency Domain Perspective

Explaining CNNs from the Frequency Domain Perspective

Look at this matrix column-wise (very important), note that the change in each column is based on the convolution stride and the number of zeros padded is determined by the size of the image and the size of the convolution kernel.

Explaining CNNs from the Frequency Domain Perspective

Try to calculate the vector of Explaining CNNs from the Frequency Domain Perspective and then group them into Explaining CNNs from the Frequency Domain Perspective convolution result.

Author: Liu Qi https://www.zhihu.com/question/59532432/answer/1461984696

You can study the book “Deep Learning Illustrated”; CNNs can be seen as a special GCN, and GCN can be explained from the frequency domain perspective, and CNNs can too.

☆ END ☆
If you have read this far, it means you like this article. Please share, like. Search for “uncle_pn” on WeChat, and feel free to add the editor’s WeChat “mthler”. Every day, we update a high-quality blog post in the Moments.
Scan the QR code to add the editor ↓

Explaining CNNs from the Frequency Domain Perspective

Leave a Comment