Summary of Reasons for Neural Network Training Failures

Click the "Xiaobai Learns Vision" above, choose to add "Starred" or "Top"
Heavyweight content delivered first hand

This article analyzes the reasons for model training not converging or failing from both data and model perspectives. It summarizes four possible reasons from the data side and nine potential issues from the model side. In addition, the article introduces the consequences or phenomena produced by each potential problem and provides conventional practices.

When facing a model that does not converge, the first step is to ensure that the number of training iterations is sufficient. During the training process, the loss does not always decrease, and the accuracy does not always increase; there will be some fluctuations. As long as the overall trend is converging, it’s fine. If the training iterations are sufficient (generally thousands, tens of thousands, or several dozen epochs) and it still does not converge, then consider taking measures to resolve it. For further reading on the training process: A Comprehensive Guide to Deep Learning Modeling and Prediction Process (Python)

1. Data and Label Issues

1. No preprocessing of data. Is the data classification labeled accurately? Is the data clean?

2. No normalization of data. Due to different evaluation metrics often having different dimensions and units, this situation can affect the results of data analysis. To eliminate the dimensional impact between indicators, data standardization is required to solve the comparability between data indicators. After the raw data undergoes standardization, all indicators are at the same order of magnitude, making it suitable for comprehensive comparative evaluation. Additionally, most neural network processes assume that the input and output are distributed around 0, from weight initialization to activation functions, from training to the optimization algorithms of the training network. Subtract the mean and divide by the variance from the data.

3. The amount of information in the samples is too large, leading to the network being insufficient to fit the entire sample space. A small sample can only lead to overfitting issues; check if your training set’s loss is converging. If only the validation set is not converging, it indicates overfitting, and you should consider various anti-overfitting tricks, such as dropout, SGD, increasing the number of mini-batches, reducing the number of nodes in the fully connected layer, momentum, finetuning, etc.

4. Is the label setting correct?

2. Model Issues

1. Network settings are unreasonable.

If you are performing a very complex classification task but only using a shallow network, it may lead to difficulties in convergence during training. An appropriate network should be chosen, or you can try deepening the current network. Generally speaking, a deeper network is not always better; start by building a 3-8 layer network, and when this network performs well, you can consider experimenting with deeper networks to improve accuracy. Starting training with a small network means faster training and allows you to set different parameters to observe their impact on the network instead of simply stacking more layers.

2. The learning rate is inappropriate.

If it is too large, it will cause non-convergence; if too small, it will slow down convergence significantly.

When training a new network, you can start with 0.1; if the loss does not decrease, lower it by dividing by 10 and try 0.01. Generally, 0.01 will converge; if not, use 0.001. A learning rate set too high can cause oscillation. However, it is not advisable to set the learning rate too low at the beginning, especially during the early training phase. In the initial phase, we cannot set the learning rate too low, or else the loss will not converge.

My approach is to gradually try reducing from 0.1, 0.08, 0.06, 0.05… until it normalizes. Sometimes, if the learning rate is too low, it cannot escape the underestimation; increasing the momentum is also a method, appropriately increasing the mini-batch size to minimize fluctuations.

Setting the learning rate too high can lead to exploding loss (loss suddenly becomes very large). This is a common situation for beginners—why does the network seem to be converging only to suddenly fly away? The most likely reason is that you used ReLU as the activation function while also using softmax or a loss function with exp in the classification layer.

When a training session reaches the last layer, if a certain node is activated excessively (e.g., 100), then exp(100)=Inf, causing overflow, and after backpropagation, all weights become NAN, and from then on, the weights will remain NAN, causing the loss to explode. If the learning rate is set too high, it can lead to exploding loss that cannot recover. In this case, pause and check the weights of any layer; it is very likely that they are all NAN. For this situation, it is advisable to use a binary search method. From 0.1 to 0.0001, the optimal learning rate varies for different models and tasks.

3. Incorrect number of hidden layer neurons.

In some cases, using too many or too few neurons can make it difficult for the network to train. Too few neurons lack the capacity to express the task, while too many neurons slow down training and make it difficult for the network to eliminate noise.

The number of hidden layer neurons can be set between 256 to 1024 as a starting point, and then you can refer to the numbers used by researchers. If their numbers differ significantly, consider the principles behind it. Before deciding on the number of hidden units, the key is to consider the minimum number of actual values you need to express through this network, and then gradually increase this number.

If you are doing a regression task, consider using a number of neurons that is 2 to 3 times the input or output variables. In fact, compared to other factors, the number of hidden units usually has a relatively small impact on the performance of the neural network. In many cases, increasing the number of required hidden units merely slows down the training speed.

4. Incorrect initialization of network parameters.

If the network weights are not initialized correctly, then the network will not be trainable.

Commonly used weight initialization methods include ‘he’, ‘lecun’, ‘xavier’. In practical applications, these methods perform very well, and the network bias is usually initialized to 0; you can choose the initialization method that best suits your task.

5. No regularization.

Regularization typically includes dropout, adding noise, etc. Even if the data volume is large or you believe the network cannot overfit, it is still necessary to apply regularization to the network.

Dropout usually starts with a parameter setting of 0.75 or 0.9, and you can adjust this parameter based on the likelihood of overfitting. Additionally, if you are sure this network will not overfit, you can set the parameter to 0.99. Regularization not only prevents overfitting but also accelerates training speed and helps handle anomalies in the data, preventing extreme weight configurations in the network. Data augmentation can also achieve a regularization effect; the best method to avoid overfitting is to have a large amount of training data.

6. Batch size is too large.

Setting the batch size too large can reduce the accuracy of the network because it decreases the randomness of gradient descent. Moreover, under the same circumstances, the larger the batch size, the more epochs are typically needed to achieve the same accuracy.

We can try smaller batch sizes such as 16, 8, or even 1. Using a smaller batch size allows for more weight updates in one epoch. This has two benefits: first, it can escape local minimum points; second, it can exhibit better generalization performance.

7. Incorrect learning rate setting.

Many deep learning frameworks have gradient clipping enabled by default, which can handle the gradient explosion problem, which is very useful, but under default circumstances, it is also difficult to find the optimal learning rate. If you have cleaned the data correctly, removed outliers, and set the correct learning rate, you may not need to use gradient clipping. Occasionally, you may encounter gradient explosion problems; in that case, you can enable gradient clipping. However, this issue generally indicates that there are other problems with the data, and gradient clipping is just a temporary solution.

8. Incorrect activation function used in the last layer.

Using the wrong activation function in the last layer can prevent the network from outputting the expected range of values. The most common mistake is using the ReLU function in the last layer, which outputs no negative values.

If it is a regression task, most of the time, there is no need to use an activation function unless you know the expected output values. Consider what your data values actually represent and what their range is after normalization; the most likely scenario is that the output has no boundaries for positive and negative numbers. In this case, the last layer should not use an activation function. If your output values only make sense within a specific range, such as probabilities within the range of 0 to 1, then the last layer can use the sigmoid function.

9. The network has bad gradients.

If you have trained for several epochs without any change in error, it may be because you used ReLU; you can try changing the activation function to leaky ReLU. The ReLU activation function has a gradient of 1 for positive values and 0 for negative values. Therefore, some network weight cost functions may have a slope of 0; in this case, we say the network is “dead” because it can no longer update.

How to analyze the current state of the network through train loss and test loss?

1. Train loss continuously decreases, and test loss continuously decreases, indicating the network is still learning;

2. Train loss continuously decreases, and test loss stabilizes, indicating the network is overfitting;

3. Train loss stabilizes, and test loss continuously decreases, indicating the dataset has 100% problems;

4. Train loss stabilizes, and test loss stabilizes, indicating learning has hit a bottleneck; reduce the learning rate or batch size;

5. Train loss continuously increases, and test loss continuously increases, indicating improper network structure design, incorrect training hyperparameter settings, or issues with the cleaned dataset.

Download 1: OpenCV-Contrib Extension Module Chinese Tutorial

Reply: Extension Module Chinese Tutorial in the "Xiaobai Learns Vision" public account backend to download the first Chinese version of the OpenCV extension module tutorial, covering installation of the extension module, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters of content.

Download 2: Python Visual Practical Projects 52 Lectures

Reply: Python Visual Practical Projects in the "Xiaobai Learns Vision" public account backend to download 31 visual practical projects including image segmentation, mask detection, lane line detection, vehicle counting, adding eyeliner, license plate recognition, character recognition, emotion detection, text content extraction, face recognition, etc., to help quickly learn computer vision.

Download 3: OpenCV Practical Projects 20 Lectures

Reply: OpenCV Practical Projects 20 Lectures in the "Xiaobai Learns Vision" public account backend to download 20 practical projects based on OpenCV, achieving advanced learning of OpenCV.

Group Chat

Welcome to join the public account reader group to exchange with peers; there are currently WeChat groups on SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (these will gradually be subdivided). Please scan the WeChat number below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Visual SLAM". Please follow the format; otherwise, you will not be approved. After successful addition, you will be invited into relevant WeChat groups based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed. Thank you for your understanding~

Leave a Comment Cancel reply