Summary of Reasons for Neural Network Training Not Converging or Failing

Click on 'Xiaobai Learns Vision' above, select to add 'star' or 'top'
Important content delivered first

Introduction This article analyzes the reasons for model training not converging or failing from both data and model perspectives. Four possible reasons from the data aspect and nine possible issues from the model aspect are summarized. In addition, the article introduces the consequences or phenomena arising from each potential problem and provides conventional practices.

Author: Feng Ying Ren Zhe @ Zhihu (Authorized)

Editor: CV Technical Guide

Article: https://zhuanlan.zhihu.com/p/285601835

When facing a model that does not converge, the first thing to ensure is that the training iterations are sufficient. During the training process, loss does not always decrease, and accuracy does not always improve; there will be some fluctuations. As long as the overall trend is converging, it is fine. If there are enough training iterations (generally thousands, tens of thousands, or dozens of epochs) and it still does not converge, then consider taking measures to resolve it.

1. Data and Label Aspects

1. No data preprocessing. Is the data classification labeling accurate? Is the data clean?

2. No data normalization. Different evaluation metrics often have different dimensions and units, which can affect the results of data analysis. To eliminate the impact of dimensions between metrics, data standardization processing is required to resolve the comparability between data metrics. After standardization, various metrics are within the same magnitude, suitable for comprehensive comparative evaluation. Moreover, most neural network processes assume that the input and output are distributed around 0, from weight initialization to activation functions, from training to the optimization algorithm of the training network. Data should be centered by subtracting the mean and divided by the variance.

3. The information content of samples is too large, causing the network to be insufficient to fit the entire sample space. A small sample only leads to overfitting issues. Check if the loss on your training set has converged. If it only does not converge on the validation set, it indicates overfitting. At this point, consider various anti-overfitting tricks, such as dropout, SGD, increasing the number of minibatches, reducing the number of nodes in the fully connected layer, momentum, finetuning, etc.

4. Is the label setting correct?

2. Model Aspects

1. Network configuration is unreasonable.

If a very complex classification task is performed using a very shallow network, it may lead to training not converging. An appropriate network should be chosen, or try deepening the current network. Generally, a deeper network is not always better; start with a network of 3-8 layers. When this network performs well, you can consider experimenting with a deeper network to improve accuracy. Starting training with a small network means faster training and allows setting different parameters to observe their impact on the network rather than simply stacking more layers.

2. Inappropriate learning rate.

If too large, it will cause non-convergence; if too small, it will cause very slow convergence.

When training a new network, you can start with 0.1. If the loss does not decrease, reduce it by dividing by 10 and try 0.01. Generally, 0.01 will converge; if not, try 0.001. A learning rate set too large can easily cause oscillation. However, it is not recommended to set the learning rate too small at the beginning, especially in the early training phase. In the initial phase, we cannot set the learning rate too low, or the loss will not converge.

My approach is to gradually try, from 0.1, 0.08, 0.06, 0.05… gradually decreasing until it stabilizes. Sometimes, if the learning rate is too low, it cannot escape local minima; increasing the momentum is also a method, appropriately increasing the minibatch size to reduce fluctuations.

A learning rate set too high can lead to divergence (loss suddenly becoming very large). This is the most common situation for beginners—why does the network seem to be converging but suddenly diverges? The most likely reason is using ReLU as the activation function while also using softmax or a loss function with exp in the classification layer.

When a training sample reaches the last layer, if a certain node is over-activated (e.g., 100), then exp(100)=Inf, causing overflow; after backpropagation, all weights become NAN, and from then on, the weights will remain NAN, causing the loss to skyrocket. If the learning rate is set too high, it can lead to divergence that cannot be corrected. At this point, pause and check the weights of any layer; it is very likely they are all NAN. For this situation, it is recommended to use a binary search method. 0.1 to 0.0001. The optimal learning rate varies for different models and tasks.

3. Incorrect number of hidden layer neurons.

Using too many or too few neurons can make training difficult. Too few neurons lack the capacity to express the task, while too many neurons can slow down training and make it hard for the network to filter out noise.

The number of hidden layer neurons can be set starting from 256 to 1024, and then you can refer to the numbers used by researchers. If their numbers differ significantly, consider the underlying principles. Before deciding on the number of hidden units, the key is to consider the minimum number of values needed to express the information through this network, and then gradually increase this number.

If you are doing regression tasks, you can consider using a number of neurons 2 to 3 times the input or output variables. In reality, compared to other factors, the number of hidden units usually has a relatively small impact on the performance of neural networks. Moreover, in many cases, increasing the required number of hidden units only slows down training speed.

4. Incorrect initialization of network parameters.

If the network weights are not correctly initialized, the network cannot be trained.

Commonly used weight initialization methods include ‘he’, ‘lecun’, ‘xavier’. In practical applications, these methods perform very well, and the network bias is usually initialized to 0; you can choose the initialization method that best suits your task.

5. No regularization.

Typical regularization methods include dropout, adding noise, etc. Even if the data volume is large or you believe the network cannot overfit, regularization is still very necessary.

Dropout typically starts with a parameter set to 0.75 or 0.9, adjusted based on your assessment of the likelihood of overfitting. Additionally, if you are sure this network will not overfit, you can set the parameter to 0.99. Regularization not only prevents overfitting but also accelerates training speed and helps deal with outliers in the data, preventing extreme weight configurations in the network. Data augmentation can also achieve regularization effects; the best way to avoid overfitting is to have a large amount of training data.

6. Batch size too large.

Setting the batch size too large can reduce network accuracy because it decreases the randomness of gradient descent. Additionally, given the same conditions, a larger batch size typically requires training more epochs to achieve the same accuracy.

We can try smaller batch sizes such as 16, 8, or even 1. Using smaller batch sizes allows for more weight updates in one epoch. There are two benefits: first, it can escape local minima; second, it can exhibit better generalization performance.

7. Incorrect learning rate setting.

Many deep learning frameworks have gradient clipping enabled by default, which can handle gradient explosion issues and is very useful. However, by default, it can also make it difficult to find the optimal learning rate. If you have correctly cleaned the data, removed outliers, and set the correct learning rate, you may not need to use gradient clipping. Occasionally, you may encounter gradient explosion problems, in which case you can enable gradient clipping. However, this situation usually indicates other issues with the data, and gradient clipping is only a temporary solution.

8. Incorrect activation function in the last layer.

Using the wrong activation function in the last layer can cause the network to fail to output the expected range of values. The most common mistake is using the ReLU function in the last layer, which outputs no negative values.

If it is a regression task, most of the time, an activation function is not needed unless you know the expected output values. Consider what your data values actually represent and what their range is after normalization. The most likely scenario is that the output has no boundary for positive and negative numbers. In this case, the last layer should not use an activation function. If your output values only make sense within a specific range, such as probabilities between 0 and 1, then the last layer can use the sigmoid function.

9. Network has bad gradients.

If you have trained for several epochs without any change in error, it may be that you used ReLU; try changing the activation function to leaky ReLU. The ReLU activation function has a gradient of 1 for positive values and 0 for negative values. Therefore, some network weights may have a cost function slope of 0, and in this case, we say the network is ‘dead’ because it can no longer update.

How to analyze the current status of the network through train loss and test loss?

1. Train loss continuously decreases, test loss continuously decreases, indicating the network is still learning;

2. Train loss continuously decreases, test loss stabilizes, indicating overfitting;

3. Train loss stabilizes, test loss continuously decreases, indicating 100% data issues;

4. Train loss stabilizes, test loss stabilizes, indicating learning has hit a bottleneck; reduce learning rate or batch size;

5. Train loss continuously increases, test loss continuously increases, indicating improper network structure design, improper training hyperparameter settings, or issues with the cleaned dataset.

Download 1: OpenCV-Contrib Extension Module Chinese Tutorial

Reply 'Extension Module Chinese Tutorial' in the backend of 'Xiaobai Learns Vision' public account to download the first OpenCV extension module tutorial in Chinese, covering installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters of content.

Download 2: Python Vision Practical Projects 52 Lectures

Reply 'Python Vision Practical Projects' in the backend of 'Xiaobai Learns Vision' public account to download 31 practical vision projects, including image segmentation, mask detection, lane line detection, vehicle counting, adding eyeliner, license plate recognition, character recognition, emotion detection, text content extraction, face recognition, etc., to help quickly learn computer vision.

Download 3: OpenCV Practical Projects 20 Lectures

Reply 'OpenCV Practical Projects 20 Lectures' in the backend of 'Xiaobai Learns Vision' public account to download 20 practical projects based on OpenCV for advanced learning.

Group Chat

Welcome to join the reader group of the public account to exchange ideas with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will gradually be subdivided in the future). Please scan the WeChat ID below to join the group, and note: 'Nickname + School/Company + Research Direction', for example: 'Zhang San + Shanghai Jiao Tong University + Visual SLAM'. Please follow the format, otherwise, you will not be accepted. After successful addition, you will be invited to relevant WeChat groups based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed. Thank you for your understanding~

Leave a Comment Cancel reply