Deep Learning Hyperparameter Tuning Experience

Click on the “AI有道” above to select the “Star” public account

Heavyweight content delivered first-hand

This article is adapted from DataWhale

Training techniques are very important for deep learning. As a highly experimental science, the same network structure trained with different methods can yield significantly different results. Here, I summarize my experiences over the past year and share them with everyone, and I welcome everyone to add comments and corrections.

Parameter Initialization

Choose one of the following methods; the results are generally similar. However, it must be done. Otherwise, it may slow down the convergence speed, affect the convergence result, or even cause a series of problems such as NaN.

The following n_in is the input size of the network, n_out is the output size of the network, and n is n_in or (n_in+n_out)*0.5

Xavier initialization paper:

http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

He initialization paper:

https://arxiv.org/abs/1502.01852

Uniform distribution initialization: w = np.random.uniform(low=-scale, high=scale, size=[n_in,n_out])

Xavier initialization, suitable for common activation functions (tanh, sigmoid): scale = np.sqrt(3/n)
He initialization, suitable for ReLU: scale = np.sqrt(6/n)

Normal distribution initialization: w = np.random.randn(n_in,n_out) * stdev # stdev is the standard deviation of the Gaussian distribution, mean set to 0

Xavier initialization, suitable for common activation functions (tanh, sigmoid): stdev = np.sqrt(n)
He initialization, suitable for ReLU: stdev = np.sqrt(2/n)

SVD initialization: has good effects on RNNs. Reference paper:https://arxiv.org/abs/1312.6120

Data Preprocessing Methods

Zero-center, this is quite common. X -= np.mean(X, axis=0) # zero-center X /= np.std(X, axis=0) # normalize
PCA whitening, this is used less frequently.

Training Techniques

Gradient normalization must be performed, which means dividing the computed gradient by the minibatch size
Clip c (gradient clipping): limit the maximum gradient, which is actually value = sqrt(w1^2+w2^2…). If the value exceeds the threshold, calculate a decay coefficient to make the value equal to the threshold: 5, 10, 15
Dropout has a good effect in preventing overfitting on small data, usually set to 0.5. On small data, dropout + SGD has shown significant improvement in most of my experiments. Therefore, if possible, it is highly recommended to try it. The position of dropout is quite crucial; for RNNs, it is recommended to place it between input->RNN and RNN->output. For how to use dropout in RNNs, refer to this paper: http://arxiv.org/abs/1409.2329
Adam, Adadelta, etc., on small data, the effects I have tested here are not as good as SGD. SGD converges more slowly, but the final results are generally better. If using SGD, you can start with a learning rate of 1.0 or 0.1, and after a while, check on the validation set. If the cost does not decrease, halve the learning rate. I have seen many papers do this, and my own experimental results are also very good. Of course, you can also run with the Ada series first, and when it converges quickly, switch to SGD for further training. This will also lead to improvements. It is said that Adadelta generally performs better on classification problems, while Adam performs better on generation problems.
Apart from gates, where the output needs to be limited to 0-1, try not to use sigmoid; you can use activation functions like tanh or ReLU. 1. The sigmoid function has a large gradient only in the range of -4 to 4. Outside this range, the gradient approaches 0, which can easily lead to the gradient vanishing problem. 2. The output of the sigmoid function is not zero-centered when the input is zero mean.
The dimensions of RNN and embedding size generally start adjusting from around 128. The batch size generally starts adjusting from around 128. The appropriate batch size is most important; it is not always better to have a larger batch size.
Word2Vec initialization can effectively improve convergence speed and results on small data.
Try to shuffle the data as much as possible.
For the forget gate bias of LSTM, initializing with 1.0 or a larger value can achieve better results, as mentioned in this paper: http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf. In my experiments, setting it to 1.0 can improve convergence speed. In practical use, different tasks may require trying different values.
Batch Normalization is said to improve performance, but I have not tried it. I recommend it as a last resort for enhancing the model, reference paper: Accelerating Deep Network Training by Reducing Internal Covariate Shift
If your model includes fully connected layers (MLP), and the input and output sizes are the same, consider replacing the MLP with a Highway Network. I tried it and saw a slight improvement in results. It is recommended as a last resort for enhancing the model. The principle is simple: a gate is added to the output to control the flow of information. For detailed introduction, refer to this paper: http://arxiv.org/abs/1505.00387
From @Zhang Xinyu’s tip: alternate between adding regularization and not adding regularization for one epoch each.

Ensemble

Ensemble is the ultimate weapon for brushing results in papers. In deep learning, the following methods are generally used:

Same parameters, different initialization methods
Different parameters, selecting the best several groups through cross-validation
Same parameters, models at different training stages, i.e., models at different iteration counts.
Different models performing linear fusion, e.g., RNN and traditional models.

Recommended Reading

(Click the title to jump to read)

Heavyweight | Selected historical articles from the public account

My Deep Learning Introduction Route

My Machine Learning Introduction Roadmap

Heavyweight!

Lin Xuantian’s complete machine learning video and blogger’s notes are here!

Scan the QR code below to add AI有道小助手 WeChat, you can apply to join the group and receive Lin Xuantian’s complete machine learning video + blogger’s refined notes (be sure to note:Join group + Location + School/Company. For example:Join group + Shanghai + Fudan.

Long press to scan the code and apply to join the group

(Due to the large number of people adding, please be patient)

The latest AI content, Iam watching

Leave a Comment Cancel reply