Disclaimer: For academic sharing only, please delete if infringed
Time domain convolution = frequency domain multiplication. Most of the computations in convolutional neural networks occur in the convolution part. How to think about convolutional neural networks from the perspective of the frequency domain? How to explain ResNet from the frequency domain perspective?Author: RuoyuLink: https://www.zhihu.com/question/59532432/answer/1510340606I find that the most enlightening work for me is by Xu Zhiqin from Shanghai Jiao Tong University.https://ins.sjtu.edu.cn/people/xuzhiqin/fprinciple/index.htmlHis Bilibili lectureUndergraduate course in the Mathematics Department: Statistical Computing and Machine Learning 3 Frequency Principle https://www.bilibili.com/video/av94808183?p=2Additionally, I have attended his lectures offline about neural networks and Fourier transforms, as well as Fourier analysis.Training behavior of deep neural network in frequency domainhttps://arxiv.org/pdf/1807.01251.pdfThis paper states that the generalization performance of neural networks comes from their training process, which focuses more on low-frequency components.The fitting process of neural networks for CIFAR-10 and MNIST, thanks to @Jimmy for the correction, where blue indicates a relatively large error and red indicates a relatively small error. As the training epochs progress, higher frequencies (larger frequency index) converge more slowly (i.e., for a particular epoch, the low-frequency error is small, colored red, while the high-frequency error is large, colored blue).Theory of the frequency principle for general deep neural networkshttps://arxiv.org/pdf/1906.09235v2.pdfThis paper provides a substantial mathematical derivation to prove the F-Principle, dividing it into the initial, intermediate, and concluding stages for proof. It may be somewhat cumbersome for non-mathematics professionals.Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networkshttps://arxiv.org/pdf/1906.09235v2.pdfWhy deep neural networks (DNNs) with more parameters than samples can often generalize well remains a mystery. One attempt to understand this challenge is to discover the implicit biases during the training process of DNNs, such as the frequency principle (F-Principle), which states that DNNs typically fit target functions from low frequencies to high frequencies. Inspired by the F-Principle, this paper proposes an effective linear F-Principle dynamic model that can accurately predict the learning outcomes of wide two-layer ReLU neural networks (NNs). This Linear FP dynamic is justified by the linearized Mean Field residual dynamics of NNs. Importantly, the long-time limit solution of this LFP dynamic is equivalent to the solution of a constrained optimization problem that explicitly minimizes the FP norm, where feasible solutions are penalized more severely for high frequencies. Using this optimization formula, a prior estimate of the generalization error bound is given, indicating that the higher the FP norm of the target function, the larger the generalization error. Overall, by interpreting the implicit bias of the F-Principle as an explicit penalty for two-layer NNs, this work takes a step towards quantitatively understanding the learning and generalization of general DNNs.This is a schematic diagram of the LFP model for two-dimensional data like images.Professor Xu’s earlier introduction:The LFP model provides a new perspective for the quantitative understanding of neural networks. First, the LFP model effectively characterizes the key features of the training process of such a parameter-rich system using a simple differential equation, and can accurately predict the learning outcomes of neural networks. Therefore, this model establishes a relationship between differential equations and neural networks from a new angle. Since differential equations are a very mature research field, we believe that tools from this field can help us further analyze the training behavior of neural networks.
Secondly, similar to statistical physics, the LFP model only relates to certain macroscopic statistical quantities of network parameters, and is independent of the specific behavior of individual parameters. This statistical characterization can help us accurately understand the learning process of DNNs when there are many parameters, thus explaining the good generalization ability of DNNs when parameters far exceed the number of training samples.
In this work, we analyze the evolution results of the LFP dynamics through an equivalent optimization problem and provide a prior estimate of the network’s generalization error. We find that the generalization error of the network can be controlled by a certain F-principle norm (defined as , where γ(ξ) is a weight function that decays with frequency).
It is worth noting that our error estimate targets the learning process of the neural network itself and does not require adding extra regularization terms in the loss function. We will further elaborate on this error estimate in a subsequent introductory article.
FREQUENCY PRINCIPLE: FOURIER ANALYSIS SHEDS LIGHT ON DEEP NEURAL NETWORKS
This indicates that for any two non-converging frequencies, the low-frequency gradient exponentially outperforms the high-frequency gradient under smaller weights. According to Parseval’s theorem, the MSE loss in the spatial domain is equivalent to the L2 loss in the Fourier domain. To intuitively understand the high decay rate of the low-frequency loss function, we consider training in the Fourier domain of a loss function with only two non-zero frequencies.
It explains why the ReLU function works, as the tanh function is smooth in the spatial domain, and its derivative decays exponentially with frequency in the Fourier region.
Professor Xu’s popular science articles on the F-Principle:
Analyzing the Fourier spectral components of ReLU networks using a continuous piecewise linear structure.
Finding empirical evidence of spectral bias originating from low-frequency components. However, learning low-frequency components helps the network’s robustness during adversarial interference.
Providing a learning theoretical framework analysis through manifold theory.
According to the topology’s Stokes theorem, it is proven that the ReLU function is compact and smooth, aiding convergence in training. What about the subsequent Swish and Mish? (Dog head).
Thus, in high-dimensional space, the spectral decay of the ReLU function exhibits strong anisotropy, and the upper limit of the amplitude of the ReLU Fourier transform satisfies the Lipschitz constraint.
Experiments:
Center point: High priority for low-frequency component learning
Experimenting on functions:
Fourier transform effectsLearning process of the functionStandardized spectral components of the model2. Learning MNIST data in a noisy environmentDifferent validation lossesFrequency components fitting of MNIST dataNeural networks can approximate arbitrary value functions, but researchers have found they prefer low-frequency components, thus exhibiting a bias towards smooth functions—a phenomenon known as spectral bias.Manifold HypothesisThe more complex the manifold, the easier the learning process becomes. This hypothesis may break the “structural risk minimization” assumption, potentially leading to “overfitting”.If there are complex datasets (like ImageNet), the search space is quite large, and it is necessary to find certain methods to ensure they “work in harmony” and tune to work properly.It seems that Bengio believes this has inspirational significance for regularization in deep learning.Machine Learning from a Continuous Viewpointhttps://arxiv.org/pdf/1912.12777.pdfMathematician Wienan.E’s debate indicates that the frequency principle does not always work.Assuming a certain function: Probability measureBased on kernel functions for derivation:Where:Decomposing the Fourier coefficients:Deriving:Characteristic function:Then the boundaries of when the frequency principle works are given.Conditions under which it works:Conditions under which it does not work:If Wienan. E provides the boundaries of the Frequency Principle from the perspective of a mathematician, then those in engineering should definitely check this paper.A Fourier Perspective on Model Robustness in Computer VisionCode has also been open-sourced:https://arxiv.org/pdf/1906.08988.pdfhttps://github.com/google-research/google-research/tree/master/frequency_analysisThe author’s intention is to focus on robustness and not to completely discard high-frequency features.Image description translation: Using input information that humans cannot recognize, the model can achieve high accuracy. The above shows the trained and tested models, which applied strict high-pass and low-pass filtering at the input end. Through positive low-pass filtering, when the image appears as a simple colored sphere, the model still exceeds 30% accuracy on ImageNet. In the case of high-pass (HP) filtering, using input features that are nearly invisible to humans, the model can achieve over 50% accuracy. As shown in the right image, normalization of the high-pass filtered image is required for proper visualization of high-frequency features (we use the method provided in the appendix to visualize high-pass filtered images).Image description translation: Left: Fourier spectrum of natural images; we estimate E[|F(X)[i,j]|] by averaging all CIFAR-10 validation images. Right: Severely corrupted Fourier spectrum of CIFAR-10-C at severity level 3. For each corruption point, we estimate E[|F(C(X)−X)[i,j]|] by averaging all validation images. Additive noise has a higher concentration in the high-frequency band, while fog, contrast, and other pollutants concentrate in the low-frequency band.Image description translation: Sensitivity of models to additive noise with different Fourier basis vectors on CIFAR-10. We fixed the additive noise to “L2 norm of 4” and evaluated three models: natural training model, adversarial training model, and Gaussian data augmentation training model. The average error rate was calculated on 1000 randomly sampled images from the test set. In the bottom row, we show images disturbed by noise along the corresponding Fourier basis vector. The natural training model is highly sensitive to all additive noise except for the lowest frequency. Adversarial training and Gaussian data augmentation significantly improve robustness to high frequencies while sacrificing the robustness of the natural training model to low frequencies (i.e., in both models, the middle blue area is smaller than that of the natural training model).Image description translation: Sensitivity of models to additive noise with different Fourier basis vectors on ImageNet validation images. We fixed the basis vector to an L2 norm value of 15.7. The error rate is the average error rate across the entire ImageNet validation set. A 63×63 square centered on the lowest frequency in the Fourier domain is shown. Similarly, the natural training model is highly sensitive to all additive noise except for the lowest frequency. On the other hand, Gaussian data augmentation improves robustness to high frequencies while sacrificing robustness to low-frequency disturbances. For AutoAugment, we observe that its Fourier heat map has the largest blue/yellow area around the center, indicating that AutoAugment is relatively robust to low-frequency to mid-frequency disruptions.Image description translation: The relationship between the high-frequency energy fraction of CIFAR-10-C corruption and test accuracy. Each scatter point in the plot represents the evaluation result of a specific model against a specific type of corruption. The X-axis represents the score of high-frequency energy for the corruption type, and the Y-axis represents the change in test accuracy compared to the natural training model. Overall, Gaussian data augmentation, adversarial training, and adding low-pass filters improve robustness to high-frequency corruptions while reducing robustness to low-frequency corruptions. The application of high-pass filtering at the front end has a more significant accuracy drop for high-frequency corruptions than low-frequency corruptions. AutoAugment enhances robustness to nearly all corruptions and achieves the best overall performance. The legend at the bottom shows the slope (K) and residual (r) for each fitted line.Image description translation: (a) and (b): Fourier spectrum of adversarial perturbations. Given an image X, a PGD attack is initiated to obtain an adversarial sample C(X), estimating the Fourier spectrum of the adversarial perturbation that causes misclassification of the image. (a) is the spectrum obtained from natural training; (b) is the spectrum obtained from adversarial training. The adversarial perturbations of the natural training model are uniformly distributed across frequency components. In contrast, adversarial training biases these perturbations towards lower frequencies. (C) and (D): Adding Fourier basis vectors with large norms to images is a simple method for generating content-preserving black-box adversarial examples.
Key Conclusions:
1) Adversarial training pays attention to some high-frequency components, rather than being obsessed only with low-frequency components.2) AutoAugment helps improve robustness.The open-source code mainly teaches how to draw similar schematic diagrams from the paper.Another paper from Eric Xing’s group was previously published by a self-media on Zhihu:High-frequency Component Helps Explain the Generalization of Convolutional Neural Networkshttps://arxiv.org/pdf/1905.13545.pdf
Visualization of convolutions from natural training and adversarial training
This paper experimented with several methods:
For a trained model, we adjust its weights to make the convolution kernel smoother;
Directly filtering out high-frequency information from the trained convolution kernel;
Increasing regularization during the training of convolutional neural networks so that weights at adjacent positions are closer.
Then the conclusion is drawn:
Focusing on low-frequency information helps improve generalization, while high-frequency components may be related to adversarial attacks, but this cannot be too absolute.
The contribution is to provide detailed experimental evidence that Batch Normalization is useful for fitting high-frequency components and improving generalization.
Finally, it’s all about talking.
On this side, Professor Xu proves that the smoothness of ReLU helps with function optimization; on the other side, a recent work called Bandlimiting Neural networks against adversarial attacks
The ReLU function is given a piecewise linear function
It can be decomposed into many frequency components.
For a hidden layer with N=1000 nodes and input dimension n=200, the maximum number of regions is roughly equal to 10^200. In other words, even a moderately sized neural network can partition the input space into a vast number of sub-regions, easily exceeding the total number of atoms in the universe. When we learn neural networks, we cannot expect at least one sample to exist in each region. For those regions without any training samples, the results of linear functions can be arbitrary since they contribute nothing to the training objective function. Of course, most of these areas are very small. When we measure the expected loss function over the entire space, their contributions can be negligible since the chance of randomly sampling points falling into these tiny regions is very small. However, adversarial attacks present new challenges because adversarial samples are not naturally sampled. Considering the total number of regions is enormous, these tiny regions are almost everywhere in the input space. For any given data point in the input space, we can almost certainly find such a tiny region where the linear function is arbitrary. If a point within this tiny region is chosen, the neural network’s output may be unexpected. These tiny regions are the fundamental reason why neural networks are vulnerable to adversarial attacks.
Then, a method of adversarial defense is proposed, but I did not understand it well, so readers can read the paper themselves, and I welcome you to point it out in the comments after reading.
In digital signal processing, I believe that the zero-padding method in periodic convolution and circular convolution should be considered. I also provide two methods for calculating convolution: one for rolling and one for multiplication.
Assuming there is an image as shown below:
as follows:
There is a convolution kernel as follows:
Then the convolution with a stride of 1 can be computed in two ways, both resulting in matrix.
The first method is general, as shown in the figure, calculating one by one, for example, the first element is , and other elements are calculated similarly; this is not the focus.
The second method is to perform convolution kernel circular zero-padding and then multiply, which is somewhat similar to what you mentioned about multiplication.
Expand into
Look at this matrix column-wise (very important), note that the change in each column is based on the convolution stride and the number of zeros padded is determined by the size of the image and the size of the convolution kernel.
Try to calculate the vector of and then group them into convolution result.
You can study the book “Deep Learning Illustrated”; CNNs can be seen as a special GCN, and GCN can be explained from the frequency domain perspective, and CNNs can too.
☆ END ☆If you have read this far, it means you like this article. Please share, like. Search for “uncle_pn” on WeChat, and feel free to add the editor’s WeChat “mthler”. Every day, we update a high-quality blog post in the Moments.↓ Scan the QR code to add the editor ↓