Deep Learning Hyperparameter Tuning Experience

From | DataWhale

Training techniques are very important for deep learning. As a highly experimental science, even the same network architecture trained with different methods can yield significantly different results. Here, I summarize my experiences from the past year and share them with everyone. I also welcome additions and corrections.

Parameter Initialization

Any of the following methods can be chosen, as the results are generally similar. However, it is essential to do this; otherwise, it may slow down convergence speed, affect convergence results, or even lead to issues like NaN.

Here, n_in represents the input size of the network, n_out represents the output size, and n is either n_in or (n_in+n_out)*0.5.

Xavier initialization paper:

http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

He initialization paper:

https://arxiv.org/abs/1502.01852

Uniform distribution initialization: w = np.random.uniform(low=-scale, high=scale, size=[n_in,n_out])

Xavier initialization, suitable for common activation functions (tanh, sigmoid): scale = np.sqrt(3/n)
He initialization, suitable for ReLU: scale = np.sqrt(6/n)

Normal distribution initialization: w = np.random.randn(n_in,n_out) * stdev # stdev is the standard deviation of the Gaussian distribution, mean set to 0

Xavier initialization, suitable for common activation functions (tanh, sigmoid): stdev = np.sqrt(n)
He initialization, suitable for ReLU: stdev = np.sqrt(2/n)

SVD initialization: has better effects for RNNs. Reference paper:https://arxiv.org/abs/1312.6120

Data Preprocessing Methods

Zero-center, this is quite common. X -= np.mean(X, axis = 0) # zero-center; X /= np.std(X, axis = 0) # normalize
PCA whitening, this is used less frequently.

Training Techniques

Gradient normalization should be done, meaning the computed gradient is divided by the minibatch size.
Clip c (gradient clipping): limit the maximum gradient. Essentially, value = sqrt(w1^2+w2^2….), if value exceeds the threshold, apply a decay coefficient to set value equal to the threshold: 5, 10, 15.
Dropout is very effective in preventing overfitting on small datasets, typically set to 0.5. In my experiments with small datasets, the combination of dropout + SGD has shown significant improvements. Therefore, if possible, it is highly recommended to try it. The placement of dropout is crucial; for RNNs, it is advisable to place it between input -> RNN and RNN -> output. For more on using dropout in RNNs, refer to this paper: http://arxiv.org/abs/1409.2329
Adam, Adadelta, etc., did not perform as well as SGD on small datasets in my experiments. SGD converges slower, but the final results are generally better. If using SGD, start with a learning rate of 1.0 or 0.1, and periodically check the validation set; if the cost does not decrease, halve the learning rate. Many papers have done this, and my experimental results were also good. Of course, you can initially run with the Ada series and then switch to SGD for fine-tuning, which can also yield improvements. It is said that Adadelta generally performs better on classification problems, while Adam is better for generative tasks.
Apart from places like gates where outputs need to be limited to 0-1, avoid using sigmoid as much as possible; prefer activation functions like tanh or ReLU. 1. The sigmoid function has a significant gradient only in the range of -4 to 4. Outside this range, the gradient approaches 0, which can easily lead to the vanishing gradient problem. 2. The output of the sigmoid function is not zero-centered when the input is zero mean.
For the dimension of RNN and embedding size, generally start adjusting from around 128. For batch size, start adjusting from about 128. The appropriate batch size is crucial; larger is not always better.
Word2Vec initialization can effectively improve convergence speed and results on small datasets.
Try to shuffle the data as much as possible.
For the bias of the forget gate in LSTM, initializing with 1.0 or a larger value can yield better results, as noted in this paper: http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf. In my experiments, setting it to 1.0 improved convergence speed. In practice, different tasks may require trying different values.
Batch Normalization is said to improve performance, but I have not tried it. It is recommended as a last resort to enhance the model, reference paper: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
If your model contains fully connected layers (MLP), and the input and output sizes are the same, consider replacing the MLP with a Highway Network. I found this to slightly improve results, and it is recommended as a last resort to enhance the model. The principle is simple: it adds a gate to control the flow of information. For detailed introduction, refer to this paper: http://arxiv.org/abs/1505.00387
From @Zhang Xinyu: Alternate between adding regularization and not adding it in each round.

Ensemble

Ensemble is the ultimate weapon for brushing results in papers. In deep learning, there are generally the following methods:

Same parameters, different initialization methods.
Different parameters, selecting the best few sets through cross-validation.
Same parameters, models trained at different stages, i.e., models from different iterations.
Different models, performing linear fusion. For example, RNN and traditional models.

—The End—
Recommended for you:
Microsoft's Black Technology: Storing Massive Data with Glass Chips, Lasting a Thousand Years!
As someone from the IT industry, what advice do you have for the next generation?
Being a programmer is just too difficult!
13 Probability Distributions Every Deep Learner Should Know
【Microsoft】AI - A Concise Tutorial on Basic Principles of Neural Networks

Leave a Comment Cancel reply