Deep Learning Parameter Tuning Techniques Summary

Editor: Amusi | Source: Zhihu, originally from: cver

https://www.zhihu.com/question/25097993

This article is for academic sharing only. If there is infringement, the article will be deleted.

What Are the Techniques for Tuning Parameters in Deep Learning?

The effectiveness of deep learning largely depends on how well the parameters are tuned. So how can we quickly and effectively adjust the parameters? Seeking answers.

Author: Jarvix https://www.zhihu.com/question/25097993/answer/153674495

Let me just say one thing: initialization.

A painful lesson learned was using normal initialization for CNN parameters, resulting in an accuracy of only over 70%. Just changing it to Xavier initialization allowed the accuracy to reach 98%.

Another instance was when initializing word embeddings. Initially, I used the default initializer in TensorFlow (the glorot_uniform_initializer, which is often referred to as mindlessly using Xavier). The training speed was slow, and the results were poor. Switching to uniform initialization dramatically increased both the training speed and the results.

Therefore, initialization is like black magic; if you use the right hyperparameters, you don’t even need to tune them; if not, the results will look like the model has bugs, which is hard to look at.

Author: BBuf https://www.zhihu.com/question/25097993/answer/934100939

I have been tuning CNNs for nearly a year (from January 2019 to today), and I find this topic quite interesting. Here are my summaries.

Engineering Practices

Convolution is the mainstream component of CNNs. I usually design networks for classification and regression tasks, and the convolution kernels are generally set to. If you want to know the reason, you should askVGG16. Stacking twoconvolution kernels can achieve a larger receptive field with fewer parameters than, so it is highly recommended.
You can appropriately useconvolutions. Why is this important? Becauseconvolutions can reduce computational load, andconvolutions can emphasize the receptive field in a certain direction. For example, if you want to classify a rectangular shape, you can useconvolution kernels paired withconvolution kernels to set a larger receptive field in the long edge direction, which may enhance generalization performance.
ACNet structure. This research comes from ICCV2019 and can be based onwith additionalandside-path convolution kernels. Finally, during the inference phase, all three convolution kernels arefused intoconvolution kernels, which can achieve about a 1-point improvement on many classic CV tasks. You can check out this article for more interpretation: 3*3 convolution + 1*3 convolution + 3*1 convolution = free accuracy improvement.
Convolution kernel weight initialization methods. Forweights, I generally usexavier initialization. Of course, you can also try He initialization from the great Kai Ming He. For bias initialization, set all to 0.
Batch Normalization. This is a technique I have been using, which can significantly speed up convergence. It is recommended to includeBN when building your own network. If there is alreadyBN in the fully connected layer, there is no need to addDropout.
Do not blindly remove thefpn structure in object detection. When tuning detection tasks for your data, such asyolov3, do not blindly cut off thefpn structure. Even if you analyze that a certain branch’s Anchor is unlikely to affect your predicted targets, directly removing the branch may lead to missed detections.
Optimizer selection. I generally useSGD with momentum. If the optimization is stuck, you can tryAdam.
Activation function. You can start withReLU for one version, and if you want to improve accuracy, you can try changingReLU toPReLU. I prefer to useReLU directly.
batch_size: In different types of tasks, the impact ofbatch_size also varies. You can check out this article on howbatch_size affects model performance, from the AI developer public account. How batch_size affects model performance.
Initial learning rate. I generally set it to0.01. I personally believe this learning rate is related to the learning rate decay strategy, but it should not be set too large or too small. Generally, 0.01 and 0.1 are commonly used. I usually use themultistep decay strategy, and the setting ofstep_size depends on yourmax_iter.
Zero-centering of data and processing. I first encountered this term while watching the cs231n video. There are mainly two steps: the first is to subtract the mean, and the second is to divide by the variance. By doing this, the final input satisfies the probability distribution with a mean of 0 and a variance of 1. Generally, subtracting the mean is the most common method, and whether or not to divide by the variance may need to be tested to see the effect.
Residual structures and dense connections. The residual structure ofresnet and the dense connection structure ofdensenet make it nearly impossible to use the complete versions ofresnet anddensenet entirely in engineering, but we can replace certain modules of our network with residual structures and dense connections. When replacing, we can appropriately reduce the complexity of these two structures, such as halving the number of channels and only keeping half of the connections in dense connections, etc. Some ablation experiments need to be conducted to verify the accuracy of the improvements.
About loss. An excellentloss generally improves the model’s generalization performance, but when usingloss, it is often not as simple as directly replacingloss; you need to carefully consider the mathematical principles behindloss and use it correctly for improvement. For example, how to apply Focal Loss to YOLOv3 to enhance themap value. Everyone can check out this post: https://www.zhihu.com/question/293369755.
Find reliable evaluation metrics when tuning model parameters. When training models while adjusting parameters, it is essential to find the correct evaluation metrics. For each parameter adjustment, record the model’s evaluation metrics such as accuracy,map value, andmiou value, etc. It is also recommended to combine the adjusted parameters with the accuracy on the test set into a string to re-command the model, facilitating quickreview later.
Using networks with abackbone, such as trainingVGG16-SSD, it is advisable to choose thefinetune method. Training from scratch is not only time-consuming and labor-intensive but can also be hard to converge.
In segmentation experiments, I found that using upsampling with 1*1 convolutions instead of transposed convolutions results in smoother outcomes, and the miou difference is not significant, so I believe both can be used.
Some Anchor-based object detection algorithms aim to improve accuracy by generating numerous boxes. The ap value indeed increases, but it also leads to many false positives, and this part of the false positives does not regress, making them unfiltered in the nms phase. Engineering-wise, reducing false positives is more crucial than improving ap. The Gaussian yolov3 reduces false positives by 40% compared to yolov3. I am not too familiar with Anchor-free algorithms, so I won’t elaborate on them.

Competition Practices

Feature extraction. VGG16, VGG19, ResNet50, Xception are very useful feature extraction models. It is recommended to use pre-trained classic models to extract feature vectors from the dataset and store them locally for easier use while significantly reducing GPU memory consumption.
Ensemble:

Combine the feature vectors extracted from different classic networks. Assuming the feature vector dimension extracted fromVGG16 is[N,c1], the dimension fromResNet50 is[N,c2], and the dimension fromXception is[N, c3], we can use three coefficientsa, b, c to combine them into a shape of[N, a*c1+b*c2+c*c3], where the values ofa, b, c represent how much we use the features from each model. If it’s a classification regression competition, we can connect a feature processing network afterward. Different values ofa, b, c can yield different features, which can then be processed through various methods likevoting,soft-voting, etc. Generally, the results won’t be too bad.
You can train models using different initialization methods, and then perform ensemble.
You can use different hyperparameters (like learning rate,batch_size, optimizer) to train different models, and then perform ensemble.

Since I have only participated in a little bit of entry-level competitions, the methods described above have yielded quite good results, so I’m sharing them here. The methods are indeed quite straightforward, so feel free to laugh it off. After thinking about it, I seem to have nothing else to share aside from these experiences. If I think of anything else later, I’ll add it.

Author: Captain Jack https://www.zhihu.com/question/25097993/answer/127472322 I am similar to @Yang Jun, also a latecomer. My current work mainly involves using CNN for CV tasks. I’ve been tuning parameters for about two years. My answer may focus more on industrial applications, with the technology limited to CNN.

First, let me state my viewpoint: tuning parameters is trial-and-error. There is no shortcut. The only difference is that some people try blindly, while others think before trying. Quick attempts, quick corrections are the keys to tuning parameters.

I read Yang Jun’s answer. Regarding this answer, the comments below by @Ji Qiujia are correct. The main content of this answer focuses more on understanding the network rather than training the network.

I want to emphasize again that Yang Jun’s answer is more about understanding the network rather than training the network. Yes, that’s right. After reading all the content in the answer, you still won’t know how to actually train a network, especially under complex tasks (because simple tasks don’t require it; the results will be good directly unless you want to climb the leaderboard for simple tasks).

First, let’s talk about visualization:

In my personal understanding, visualization is more about helping humans observe the network in a familiar way. You cannot adjust parameters while observing the network. You can only visualize after training is complete (or after the accuracy reaches a certain stage). Before that, if the network has not learned good parameters, visualizing it is meaningless. When the network achieves a decent accuracy, looking at it is just a formality. Similarly, if your network is a complete mess, visualizing it won’t help; the only thing you can see is the intermediate results being chaotic or entirely black or white. At this point, you can just check the final accuracy to know that the network is beyond saving.

Regarding weight visualization [Visualize Layer Weights] (currently, it is not significant to demand smoothness, this will be discussed later):

Likewise, if you see an image that does not meet smooth results, you know that the network is not trained well, but why? Is it due to bad data? Lack of preprocessing? Issues with the network structure? Learning rate being too large or too small? Or perhaps just missing a LRN layer (I encountered this before; adding an LRN can yield smooth weights, which is related to preprocessing)?

Smoothness is worth checking, just to have a sense of it. However, specifically how to tune parameters is still out of reach. First, you cannot tell the network to learn a boundary detection function at this layer. Second, different tasks will have different weights (although there is significant commonality in the underlying features). Why do you think you can guide a machine that can process images faster than you?

Now, should we demand smoothness? The current trend encourages the use of small filters, 3×3 in size, and adding more layers (this improves non-linearity). In other words, for a 3×3 image, there are only 9 pixels; how can you judge smoothness? Of course, if you use larger filters, generally starting from 5×5, if you’re lucky, you might see smooth results.

Let’s discuss another extreme: a network runs perfectly (as long as it meets application requirements, it’s perfect), but upon inspection, the weights are not smooth. What do you plan to do? Indeed, networks with non-smooth weights can still achieve excellent results (I have become accustomed to this situation).

So is visualizing the network unimportant?

It is very important, but not in the training aspect, rather it helps understand the principles of the network. After understanding the principles of the network, you can have a sense of it when designing structures (just a sense). If the network has issues or you are dissatisfied in certain situations, you will have a better intuition for adjustments (yes, just intuition; although some adjustments logically should work based on the principles of the network, they might not work, and you can’t argue with the machine).

So how do you train a good network?

This is a good link that explains how to continuously trial-and-error from scratch: Using convolutional neural nets to detect facial keypoints tutorial.

======================================================== My own experience includes the following:

Basic Principle: Quick Trial-and-Error

Some Major Considerations:

1. At first, start with a small-scale dataset, scale the model up, and as long as it doesn’t exceed GPU memory, if you can use 256 filters, don’t use 128. Go straight for overfitting. Yes, train the network to overfit, and you can skip validation on the test set.

Why? + You need to verify whether your training script is correct. This step with a small dataset generates quickly, but all scripts are consistent with future large-scale training (except for running a few less loops) + If you can’t achieve results with such a large network on a small dataset, you need to reflect on whether the model’s input-output is correct. Should you check your code (never doubt the library unless you’ve modified the code)? Is the problem definition of the model correct? Is your understanding of the application scenario flawed? Never doubt the capabilities of NN, never doubt the capabilities of NN, never doubt the capabilities of NN. The probability of NN failing to fit the issues we parameter tuners encounter is incredibly small. + You can choose not to do this, but if you prepare the data for two days and find issues that require regeneration, your week will be wasted.

2. The design of the loss function must be reasonable.

+ Generally, classification uses Softmax, regression uses L2 loss. However, pay attention to the error range of the loss (mainly for regression). If you predict a label with a value of 10000 and the model outputs 0, calculate how large this loss is; this is still in the univariate case. Generally, the result is nan. Therefore, not only should the input be normalized, but the output should be as well. + In multi-task scenarios, the various losses should be limited to the same order of magnitude, or ultimately limited to the same order of magnitude; initially, you can focus on one task’s loss.

3. Observing loss is more valuable than observing accuracy.

Although accuracy is a measurement metric, it’s still essential to pay attention to loss during training. You may find that in some cases, accuracy can fluctuate suddenly, remaining at 0 for a long time, then suddenly jumping to 1. If you prematurely stop training because of this, only heaven can feel sorry for you. Loss does not exhibit such bizarre behaviors, as the optimization target is loss. Give NN some time; you must leave enough room for NN to learn based on the task. You cannot say that if there is no improvement in the early stages, you should abandon it. In some cases, there may be no visible improvement in the early stages, and then stable learning begins.

4. Ensure that the classification network learns sufficiently.

The classification network learns the boundaries between categories. You will find that the network gradually moves from ambiguous categories to clear categories. How to find out? Look at the probability distribution of the Softmax output. For binary classification, you will see that the network’s predictions are initially around 0.5, which is quite vague. As learning progresses, the network’s predictions will slowly move toward the extremes of 0 and 1. Therefore, if your network’s prediction distribution is centered, continue learning.

5. Set a reasonable learning rate. + Too large: loss explodes or becomes nan + Too small: loss decreases too slowly (however, the need to reduce LR also manifests this way; here, you can visualize the intermediate results of the network, not the weights, which is effective; the two visualizations are different; if too small, the intermediate results may appear wavy or noisy, as the filters learn too slowly, which is very obvious once you try it) + Needs further reduction: loss has decreased significantly under the current LR, but has not dropped for a long time. + If it’s a more complex task, initially, you need to manually monitor and adjust the LR. Once you become familiar with the learning characteristics of the task network, you can leave it to run. + If you cannot set the above loss design reasonably, it is easy to explode under initial conditions, so start with a small LR to ensure it does not explode. Once the loss decreases, you can slowly increase LR, although this is quite tedious. + Reduce LR slightly under the maximum value that works to avoid killing neurons with ReLU. Of course, I am an impatient person and tend to set a larger value.

6. Compare training and validation set losses. This is the basis for judging overfitting, determining whether training is sufficient, and whether early stopping is needed; these are standard principles, so I won’t elaborate further.

7. Be aware of the size of the receptive field. In CV tasks, the context window is very important. Therefore, you need to have a clear understanding of the size of the receptive field of your model. This significantly impacts the effectiveness, especially when using FCN, as large targets require a large receptive field. Unlike fully connected networks, where there is at least an FC to catch the global information.

Brief Considerations:

Preprocessing: -mean/std zero-center is sufficient; PCA, whitening, etc., are unnecessary. In my personal view, CNNs can learn encoders, and whether or not to use PCA is not crucial; at worst, the network will learn it itself.
Shuffle, shuffle, shuffle.
Understanding the principles of the network is crucial; you must understand the boundary detection of the Sobel operator in CNN’s conv.
Dropout, Dropout, Dropout (not only can it prevent overfitting, but it also serves as the lowest-cost ensemble; of course, training will be slower than without Dropout, and you should correspondingly increase the network parameters, yes, this will slow it down further).
CNNs are more suitable for training to answer yes/no questions; if the task is complex, consider training a model on a classification task first and then finetuning.
Mindlessly use ReLU (in the CV field).
Mindlessly use 3×3.
Mindlessly use Xavier.
LRN-like layers can actually be omitted. If they don’t work, you can try them again.
Filter counts should be a power of 2.
Multi-scale image input (or using results from multi-scale within the network) can significantly improve performance.
Ensure that the number of filters in the first layer is not too small; otherwise, it will not learn effectively (lower-level features are crucial).
For choosing SGD, Adam, etc., it depends on personal preference. Generally, they are not decisive for the network. I usually mindlessly use SGD + momentum.
I have never used batch normalization, although I know it’s beneficial; I just haven’t used it out of laziness. Therefore, I encourage the use of batch normalization.
Don’t fully trust what’s in the papers. If you think a structure might be effective, go ahead and try it.
You have a 95% probability of not using models with more than 40 layers.
Shortcut connections are effective.
Brutal parameter tuning is the most advisable; after all, your life is the most important. After tuning this model, you might discard it in a couple of days.
Machines, machines, machines.
Carefully review Google’s inception paper.
Understand some traditional methods. I have used a 1×14 handwritten filter in my programs, and once you see the 1×7, 7×1 in inception, you will smile knowingly.

Author: Random Walker Fool https://www.zhihu.com/question/25097993/answer/951804080

1. First, when tuning parameters, you need to arrange your mood. Don’t misunderstand; I mean to let yourself be a little frantic. Because sometimes this is indeed a mysterious art, often tuning for a long time yields no results, and then just changing an initial value can push you above 95% in a minute. Yes, you actually did nothing, but most of the time, this thing is very human; “birth” is crucial.

2. As mentioned above, good starting points + suitable LR + good optimization methods can basically solve most problems. If it still doesn’t work, consider changing the loss function. Other tricks are often too many and quite superficial.

3. Always remember to save your results in real-time and familiarize yourself with using various seeds; develop good habits. Sometimes you might think a not-so-good result is actually the best you can achieve, and you don’t want to find out later that you can’t retrieve it because you didn’t save it, right? The saying goes, “Today you look down on me, tomorrow I’ll make you look up to me.” Don’t ask me how I know this.

4. When beginners first start tuning parameters, they often lack experience, so they must be humble! What does being humble mean? It means that when you first start tuning, don’t think too far ahead; within a manageable range, try to use as many filters as possible and keep the data minimal, aiming for overfitting directly! This is called small-step trial-and-error, quick iteration; internet companies do it this way. Although overfitting isn’t ideal, there are many tricks for it, and the problem of underfitting is far more terrifying. After all, if you can’t even train the results, what’s the point of thinking far?

5. There are many mindless configurations you can try, such as 3×3 convolution kernels, ReLU activation functions, adding shuffle, data augmentation, BN, Dropout, etc. Dropout can be increased from 0.5 and above; for optimizers, you can use Adam or SGD + 0.8/0.9 momentum. Most of the time, these experiences are more valuable than the intricate tricks you painstakingly discover, but they are not absolute.

6. Always remember to print some results in real-time, such as training loss, training accuracy, validation accuracy; if you can graph them, do so. Watching the graphs can reveal many issues, especially regarding learning rates and overfitting. Additionally, as a certain expert pointed out earlier, loss is much more useful than accuracy when analyzing graphs because accuracy can fluctuate dramatically; it may significantly change in the next step, while loss tends to show a relatively stable downward trend.

7. In quiet moments or when you’re not busy, remember to think more about the principles. Study others’ excellent results, especially mature architectures and some state-of-the-art results; you can also revisit your dataset. When you have time, engage in some visualization; it not only hones your skills but also helps in discovery. Besides tuning parameters, you might directly use certain layers from others in your applications, which can save a lot of time.

8. Finally, nothing is absolute. Many theoretical articles are just for reading. Everything is conditional; without those conditions, it becomes meaningless, and sometimes that condition might just be luck. So if you can’t reproduce it, don’t take it to heart; let it go if necessary. Tuning parameters is tough, but don’t forget to adjust your mindset at the same time: do good deeds, record timely, avoid boasting, and spend more time on Zhihu.

Author: JD BaiTiao https://www.zhihu.com/question/25097993/answer/651617880

Many friends who just start with deep learning feel that tuning parameters in deep learning is like a mysterious art; sometimes, when the parameters are tuned well, the model converges quickly, and when the parameters are not tuned well, the loss value may become NaN after several iterations.

I remember when I first started studying deep learning, I worked on two small examples. One was using TensorFlow to build a very simple network for MNIST handwritten digit recognition with only one input layer and one softmax output layer. The first time I initialized the weight matrix W and bias b with a normal distribution, I iterated for 20 epochs, and by the end of the first epoch, the predicted accuracy was only about 10% (similar to random guessing, since MNIST is a ten-class problem). By the end of twenty epochs, the accuracy only reached about 60%.

Then I simply changed the weight matrix W initialization to all zeros while keeping the other parameters unchanged, and the result was that after the first epoch, the predicted accuracy exceeded 85%, and after 20 epochs, it reached 92%. The other example was a regression prediction problem. At that time, I used the SGD optimizer, and initially set the learning rate to 0.1. The model could train normally, but the training speed was a bit slow. I tried adjusting the learning rate to 0.3, hoping to speed up training, but after a few iterations, the loss became NaN. Since then, I have deeply felt the importance of parameter tuning in training deep learning models.

Now, let me share some insights on tuning techniques in deep learning. Although reading this won’t turn you into a tuning expert, it can at least provide some ideas on tuning parameters.

1. Activation function selection:

Common activation functions include ReLU, leaky ReLU, sigmoid, tanh, etc. For the output layer, use softmax for multi-class tasks, sigmoid for binary classification tasks, and linear output for regression tasks. For the hidden layers, prefer using the ReLU activation function (ReLU effectively solves the gradient vanishing problem associated with sigmoid and tanh, and experiments show it converges faster than other activation functions). Additionally, when constructing recurrent neural networks (RNNs), prioritize using the tanh activation function.

2. Learning rate setting:

Typically, the learning rate starts from 0.1 or 0.01. A learning rate that is too large can lead to instability in training, even resulting in NaN, while a learning rate that is too small can cause the loss to decrease too slowly. The learning rate should generally decay during training. Set decay factors of 0.1, 0.3, or 0.5; the timing of decay can be when the validation accuracy stops increasing or automatically after a fixed number of training cycles.

3. Preventing overfitting:

Common methods to prevent overfitting include using L1 regularization, L2 regularization, dropout, early stopping, and data augmentation. If the model performs well on the training set but poorly on the test set, consider increasing the penalty strength of L1 or L2 regularization (L2 regularization is generally set to 1.0, rarely exceeding 10), or increasing the random dropout probability (a common choice for dropout is 0.5); or when the test set performance declines while training continues, use early stopping. Certainly, the most effective method is to increase the size of the training set. If acquiring new data is challenging, you can use data augmentation methods, such as cropping, flipping, and translating in CV tasks, which often improve the final model’s test accuracy.

4. Optimizer selection:

If the data is sparse, use adaptive methods like Adagrad, Adadelta, RMSprop, or Adam. Overall, Adam is the best choice. While SGD can reach a minimum, it takes longer than other algorithms and may get stuck at saddle points. To achieve faster convergence or train deeper and more complex neural networks, use an adaptive algorithm.

5. Residual blocks and BN layers:

If you wish to train a deeper and more complex network, residual blocks are definitely an important component, allowing your network to train deeper.

BN layers accelerate training speed, effectively prevent gradient vanishing and exploding, and help prevent overfitting, so it’s best to include this component when constructing networks.

6. Automated tuning methods:

(1) Grid Search: This method involves looping through all candidate parameter selections and trying every possibility. The best-performing parameters become the final result, similar to finding the maximum value in an array. The downside is that it is time-consuming, especially for neural networks, where many parameter combinations are often infeasible.

(2) Random Search: In practice, Random Search is generally more effective than Grid Search. The typical approach is to first use Grid Search to obtain all candidate parameters, then randomly select from them for training. Random Search often combines with a coarse-to-fine tuning strategy, where more detailed searches are conducted around parameters that show better performance.

(3) Bayesian Optimization: This method considers the experimental results corresponding to different parameters, leading to more efficient time usage. Bayesian tuning requires fewer iterations than Grid Search, is faster, and remains robust for non-convex problems.

7. Parameter random initialization and data preprocessing:

Parameter initialization is crucial; it determines the training speed of the model and whether it can avoid local minima. For ReLU activation function initialization, He normal is recommended; for tanh, Glorot normal is recommended, which is also known as Xavier normal initialization. The common data preprocessing method is data normalization.

Public Account: AI Snail Car

Stay Humble, Stay Disciplined, Stay Progressive

What Are the Techniques for Tuning Parameters in Deep Learning?

Engineering Practices

Competition Practices

Leave a Comment Cancel reply