Why Bigger Neural Networks Are Better: A NeurIPS Study

Reported by New Intelligence

Editor: LRS

[New Intelligence Overview] It has almost become a consensus that bigger neural networks are better, but this idea contradicts traditional function fitting theory. Recently, researchers from Microsoft published a paper at NeurIPS proving the necessity of large-scale neural networks mathematically, suggesting they should be even larger than expected.

As the research direction of neural networks gradually shifts towards ultra-large pre-trained models, the goal of researchers seems to become to have networks with larger parameter counts, more training data, and more diverse training tasks.

Of course, this approach is indeed effective; as neural networks grow larger, the amount of data the models understand and master also increases, surpassing human performance on certain specific tasks.

Why Bigger Neural Networks Are Better: A NeurIPS Study

However, mathematically, the scale of modern neural networks is somewhat bloated; the parameter count often far exceeds the needs of prediction tasks, a situation known as overparameterization.

A recent paper at NeurIPS proposed a novel explanation for this phenomenon. They argue that such larger-than-expected neural networks are completely necessary to avoid certain fundamental issues, and the findings in this paper provide a more general insight into this problem.

Paper link: https://arxiv.org/abs/2105.12806

The first author of the paper, Sébastien Bubeck, manages the machine learning foundational research group at MSR Redmond, conducting interdisciplinary research across various topics in machine learning and theoretical computer science.

A neural network should be this big

A common task for neural networks is to recognize target objects in images.

To create a network capable of performing this task, researchers first provide it with many images and corresponding target labels, training it to learn the correlations between them. Afterwards, the network will correctly identify targets in images it has seen.

In other words, the training process allows the neural network to memorize this data.

Once the network has memorized enough training data, it can also predict the labels of unseen objects with varying degrees of accuracy, a process known as generalization.

The size of the network determines how much it can remember.

We can understand this using graphical space. Suppose there are two data points placed on an XY plane, they can be connected by a line described by two parameters: the slope of the line and the height at which it intersects the y-axis. If others know the parameters of this line and the x-coordinate of one of the original data points, they can calculate the corresponding y-coordinate by observing the line (or using the parameters).

This means that this line has memorized the two data points, and what the neural network does is quite similar.

For example, an image is described by hundreds or thousands of values, with each pixel having a corresponding value. This collection of many free values can mathematically be considered as the coordinates of a point in high-dimensional space, with the number of coordinates also referred to as dimensions.

Traditional mathematical conclusions suggest that to fit n data points with a curve, you need a function with n parameters. For example, in the case of a line, two points are described by a curve with two parameters.

When neural networks first emerged as a new model in the 1980s, researchers believed that only n parameters were needed to fit n data points, regardless of the data’s dimensions.

Alex Dimakis from the University of Texas at Austin states that the reality is no longer like that; the number of parameters in neural networks far exceeds the number of training samples, indicating that textbook content needs to be rewritten and corrected.

Researchers are studying the robustness of neural networks, which is the ability of the network to handle small changes. For example, a non-robust network may have learned to recognize a giraffe but misclassify a nearly unchanged version as a sand rat.

In 2019, Bubeck and his colleagues were seeking to prove a theorem regarding this issue when they realized that the problem was related to the scale of the network.

In their new proof, researchers showed that overparameterization is necessary for the network’s robustness. They introduced smoothing to indicate how many parameters are needed to fit data points with a curve that has equivalent mathematical properties to robustness.

To understand this, imagine again a curve on a plane where the x-coordinate represents a pixel’s color and the y-coordinate represents an image label.

Because the curve is smooth, if you slightly modify the pixel’s color, moving a short distance along the curve, the corresponding predicted value will only change slightly. On the other hand, for a jagged curve, small changes in the x-coordinate (color) can lead to huge changes in the y-coordinate (image label), turning a giraffe into a sand rat.

Bubeck and Sellke proved in their paper that smoothing fits high-dimensional data points not only requires n parameters but also n×d parameters, where d is the input dimension (for example, an image input with 784 pixels has an input dimension of 784).

In other words, if you want a network to robustly memorize its training data, overparameterization is not only helpful but necessary. This proof relies on a fact about high-dimensional geometry: points randomly distributed on the surface of a sphere are almost all at a distance of one diameter from each other, and the vast spacing between points means that fitting them with a smooth curve requires many additional parameters.

Amin Karbasi from Yale University praised the proof in the paper as very concise, without a lot of mathematical formulas, and it discusses very general content.

This proof also provides a new avenue for understanding why the simple strategy of expanding neural networks is so effective.

Other studies reveal additional reasons why overparameterization is beneficial. For example, it can improve the efficiency of the training process and enhance the network’s generalization ability.

While we now know that overparameterization is necessary for robustness, it remains unclear how necessary robustness is for other aspects. However, by linking it to overparameterization, the new proof suggests that robustness may be more important than previously thought, which could also pave the way for further studies explaining the benefits of large models.

Robustness is indeed a prerequisite for generalization; if you build a system that goes out of control with just slight perturbations, what kind of system is that? Clearly unreasonable.

Therefore, Bubeck considers this to be a very fundamental and basic requirement.

References:

https://www.quantamagazine.org/computer-scientists-prove-why-bigger-neural-networks-do-better-20220210/

Reported by New Intelligence

Leave a Comment Cancel reply