Deep Learning Hyperparameter Tuning Experience

Click on the “Datawhalee” above to select the “Starred“ public account

Get valuable content at the first time

Deep Learning Hyperparameter Tuning Experience

Training techniques are very important for deep learning. As a highly experimental science, using different training methods on the same network structure can yield significantly different results. Here, I summarize my experiences from the past year and share them with everyone, and I welcome everyone to provide additional insights and corrections.

Parameter Initialization

Choose one of the following methods; the results are generally similar. However, it is essential to do this. Otherwise, it may slow down the convergence speed, affect the convergence results, or even cause issues such as NaN.

The following n_in is the input size of the network, n_out is the output size of the network, and n is n_in or (n_in+n_out)*0.5

Xavier initialization paper:

http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

He initialization paper:

https://arxiv.org/abs/1502.01852

Uniform distribution initialization: w = np.random.uniform(low=-scale, high=scale, size=[n_in,n_out])

Xavier initialization, suitable for common activation functions (tanh, sigmoid): scale = np.sqrt(3/n)
He initialization, suitable for ReLU: scale = np.sqrt(6/n)

Normal Gaussian distribution initialization: w = np.random.randn(n_in,n_out) * stdev # stdev is the standard deviation of the Gaussian distribution, with mean set to 0

Xavier initialization, suitable for common activation functions (tanh, sigmoid): stdev = np.sqrt(n)
He initialization, suitable for ReLU: stdev = np.sqrt(2/n)

SVD initialization: has a good effect on RNNs. Refer to the paper:https://arxiv.org/abs/1312.6120

Data Preprocessing Methods

Zero-center, this is quite common. X -= np.mean(X, axis = 0) # zero-center X /= np.std(X, axis = 0) # normalize
PCA whitening, this is used less frequently.

Training Techniques

Gradient normalization is necessary, which means dividing the computed gradient by the minibatch size
Clip c (gradient clipping): limit the maximum gradient, essentially value = sqrt(w1^2+w2^2….), if value exceeds the threshold, calculate a decay coefficient to make the value equal to the threshold: 5, 10, 15
Dropout is effective in preventing overfitting with small datasets, generally set to 0.5. In most of my experiments with small datasets, dropout + SGD significantly improved results. Therefore, if possible, it is highly recommended to try this. The placement of dropout is quite important; for RNNs, it is recommended to place it between input -> RNN and RNN -> output. For how to use dropout in RNNs, you can refer to this paper:http://arxiv.org/abs/1409.2329
Adam, Adadelta, etc., did not perform as well as SGD on small datasets in my experiments. SGD converges more slowly, but the final results are generally better. If using SGD, you can start with a learning rate of 1.0 or 0.1, and check on the validation set after a while; if the cost does not decrease, halve the learning rate. I have seen many papers do this, and my own experimental results are also good. Of course, you can also run with the Ada series first and then switch to SGD for continued training when close to convergence. This will also yield improvements. It is said that Adadelta performs better on classification problems, while Adam is better for generative problems.
Aside from places like gates that need to limit outputs to 0-1, try not to use sigmoid; use activation functions like tanh or ReLU instead. 1. The sigmoid function has a significant gradient only in the range of -4 to 4. Outside of this range, the gradient approaches 0, easily causing vanishing gradient problems. 2. The output of the sigmoid function is not zero-centered when the input is zero.
The dim and embedding size of RNNs generally start adjusting from around 128. The batch size usually starts around 128. The appropriate batch size is crucial; it is not necessarily better just because it is larger.
Word2Vec initialization can effectively improve convergence speed and results on small datasets.
Try to shuffle the data.
For the forget gate of LSTM, initializing the bias to 1.0 or a larger value can yield better results, based on this paper:http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf. In my experiments, setting it to 1.0 improved convergence speed. In practical use, different tasks may require trying different values.
Batch Normalization is said to improve results, but I have not tried it. I recommend it as a last resort to enhance the model, refer to the paper: Accelerating Deep Network Training by Reducing Internal Covariate Shift
If your model includes fully connected layers (MLP) and the input and output sizes are the same, consider replacing the MLP with a Highway Network. I found a slight improvement in results and recommend it as a last resort to enhance the model. The principle is simple: it adds a gate to control the flow of information. For a detailed introduction, refer to this paper:http://arxiv.org/abs/1505.00387
From @Zhang Xinyu’s technique: alternate between adding regularization and not adding it for each round.

Ensemble

Ensemble is the ultimate weapon for achieving results in papers; in deep learning, there are generally the following methods

Same parameters, different initialization methods
Different parameters, selecting the best several groups through cross-validation
Same parameters, models at different training stages, i.e., models at different iteration counts.
Different models, performing linear fusion. For example, RNN and traditional models.

Leave a Comment Cancel reply