Summary of Common Tricks in PyTorch

Author: z.defying

Reprinted from: Datawhale

Table of Contents:

1 Specify GPU ID

2 View Model Layer Output Details

3 Gradient Clipping

4 Expand Dimensions of a Single Image

5 One-Hot Encoding

6 Prevent Out of Memory When Validating Model

7 Learning Rate Decay

8 Freeze Parameters of Certain Layers

9 Use Different Learning Rates for Different Layers

1. Specify GPU ID

Set the current GPU device to only use device 0, with the device name as /gpu:0:os.environ["CUDA_VISIBLE_DEVICES"] = "0"

Set the current GPU devices to 0 and 1, with names as /gpu:0 and /gpu:1:os.environ["CUDA_VISIBLE_DEVICES"] = "0,1", indicating that device 0 is preferred, followed by device 1.

The command to specify the GPU must be placed before a series of operations related to the neural network.

2. View Model Layer Output Details

Keras has a concise API to view the output size of each layer of the model, which is very useful for debugging the network. This functionality can now also be implemented in PyTorch.

It is simple to use, as shown below:

from torchsummary import summary
summary(your_model, input_size=(channels, H, W))

input_size must be set according to the input size of your own network model.

https://github.com/sksq96/pytorch-summary

3. Gradient Clipping

import torch.nn as nn
outputs = model(data)
loss = loss_fn(outputs, target)
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=20, norm_type=2)
optimizer.step()

nn.utils.clip_grad_norm_‘s parameters:

parameters – An iterator based on variables for gradient normalization
max_norm – The maximum norm of the gradients
norm_type – Specifies the type of norm, default is L2

Zhihu user @不椭的椭圆 pointed out: Gradient clipping may consume a lot of computation time on certain tasks.

4. Expand Dimensions of a Single Image

Since the data dimensions during training are usually (batch_size, c, h, w), but only one image is input during testing, it is necessary to expand the dimensions. There are multiple methods to expand dimensions:

import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())
img = image.view(1, *image.size())
print(img.size())
# output:# torch.Size([h, w, c])# torch.Size([1, h, w, c])

import cv2
import numpy as np
image = cv2.imread(img_path)
print(image.shape)
img = image[np.newaxis, :, :, :]
print(img.shape)
# output:# (h, w, c)# (1, h, w, c)

or (thanks to Zhihu user @coldleaf for the addition)

import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())
img = image.unsqueeze(dim=0)
print(img.size())
img = img.squeeze(dim=0)
print(img.size())
# output:# torch.Size([(h, w, c)])# torch.Size([1, h, w, c])# torch.Size([h, w, c])

tensor.unsqueeze(dim): Expand dimensions, where dim specifies which dimension to expand.

tensor.squeeze(dim): Remove the dimension specified by dim that has size 1; if the dimension is greater than 1, squeeze() has no effect. If dim is not specified, all dimensions with size 1 are removed.

5. One-Hot Encoding

When using the cross-entropy loss function in PyTorch, the label is automatically converted to one-hot, so manual conversion is not needed. However, when using MSE, it needs to be manually converted to one-hot encoding.

import torch
class_num = 8
batch_size = 4
def one_hot(label):    """    Convert a one-dimensional list to one-hot encoding    """    label = label.resize_(batch_size, 1)    m_zeros = torch.zeros(batch_size, class_num)    # Take values from value, then assign to the corresponding position based on dim and index    onehot = m_zeros.scatter_(1, label, 1)  # (dim,index,value)
    return onehot.numpy()  # Tensor -> Numpy
label = torch.LongTensor(batch_size).random_() % class_num  # Take remainder of random numbers
print(one_hot(label))
# output:[[0. 0. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0.]]

https://discuss.pytorch.org/t/convert-int-into-one-hot-format/507/3

6. Prevent Out of Memory When Validating Model

During model validation, gradient computation is not needed, so turning off autograd can improve speed and save memory. If not turned off, it may cause out of memory.

with torch.no_grad():    # Code to predict using the model    pass

Thanks to Zhihu user @zhaz for the reminder. I updated the reason for using torch.cuda.empty_cache() accordingly.

This was the original answer:

During PyTorch training, useless temporary variables may accumulate, leading to out of memory. The following statement can be used to clean up these unnecessary variables.

The explanation on the official website is:

Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU applications and visible in nvidia-smi. torch.cuda.empty_cache()

This means that PyTorch’s caching allocator pre-allocates some fixed GPU memory. Even if tensors have not used up this memory, it cannot be used by other applications. This allocation process is triggered by the first CUDA memory access.

The role of torch.cuda.empty_cache() is to release the currently held unoccupied cached memory by the caching allocator so that this memory can be used by other GPU applications and can be seen through the <code>nvidia-smi. Note that using this command does not release the memory occupied by tensors.

For unused data variables, PyTorch can automatically reclaim them to free up the corresponding memory.

For more detailed optimizations, see: Optimize Memory Usage:https://blog.csdn.net/qq_28660035/article/details/80688427Memory Utilization Issues:https://oldpan.me/archives/pytorch-gpu-memory-usage-track

7. Learning Rate Decay

import torch.optim as optim
from torch.optim import lr_scheduler
# Initialization before training
optimizer = optim.Adam(net.parameters(), lr=0.001)
scheduler = lr_scheduler.StepLR(optimizer, 10, 0.1)  # Every 10 epochs, multiply the learning rate by 0.1
# During training
for n in n_epoch:    scheduler.step()    ...

8. Freeze Parameters of Certain Layers

Reference: Freezing a Certain Layer of a Pre-trained Model in PyTorchhttps://www.zhihu.com/question/311095447/answer/589307812

When loading a pre-trained model, we sometimes want to freeze the first few layers so that their parameters do not change during training.

We first need to know the names of each layer, which can be printed using the following code:

net = Network()  # Get custom network structure
for name, value in net.named_parameters():    print('name: {0},	 grad: {1}'.format(name, value.requires_grad))

Assuming the information of the first few layers is as follows:

name: cnn.VGG_16.convolution1_1.weight,   grad: True
name: cnn.VGG_16.convolution1_1.bias,   grad: True
name: cnn.VGG_16.convolution1_2.weight,   grad: True
name: cnn.VGG_16.convolution1_2.bias,   grad: True
name: cnn.VGG_16.convolution2_1.weight,   grad: True
name: cnn.VGG_16.convolution2_1.bias,   grad: True
name: cnn.VGG_16.convolution2_2.weight,   grad: True
name: cnn.VGG_16.convolution2_2.bias,   grad: True

The True at the end indicates that the parameters of that layer are trainable. Then we define a list of layers to freeze:

no_grad = [    'cnn.VGG_16.convolution1_1.weight',    'cnn.VGG_16.convolution1_1.bias',    'cnn.VGG_16.convolution1_2.weight',    'cnn.VGG_16.convolution1_2.bias']

The freezing method is as follows:

net = Net.CTPN()  # Get network structure
for name, value in net.named_parameters():    if name in no_grad:        value.requires_grad = False    else:        value.requires_grad = True

After freezing, we print the information of each layer again:

name: cnn.VGG_16.convolution1_1.weight,   grad: False
name: cnn.VGG_16.convolution1_1.bias,   grad: False
name: cnn.VGG_16.convolution1_2.weight,   grad: False
name: cnn.VGG_16.convolution1_2.bias,   grad: False
name: cnn.VGG_16.convolution2_1.weight,   grad: True
name: cnn.VGG_16.convolution2_1.bias,   grad: True
name: cnn.VGG_16.convolution2_2.weight,   grad: True
name: cnn.VGG_16.convolution2_2.bias,   grad: True

We can see that the weight and bias of the first two layers have requires_grad set to False, indicating they are not trainable.

Finally, when defining the optimizer, only the parameters of layers with requires_grad set to True will be updated.

optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.01)

9. Use Different Learning Rates for Different Layers

We use different learning rates for different layers of the model.

Using this model as an example:

net = Network()  # Get custom network structure
for name, value in net.named_parameters():    print('name: {}'.format(name))
# Output:# name: cnn.VGG_16.convolution1_1.weight# name: cnn.VGG_16.convolution1_1.bias# name: cnn.VGG_16.convolution1_2.weight# name: cnn.VGG_16.convolution1_2.bias# name: cnn.VGG_16.convolution2_1.weight# name: cnn.VGG_16.convolution2_1.bias# name: cnn.VGG_16.convolution2_2.weight# name: cnn.VGG_16.convolution2_2.bias

Set different learning rates for convolution1 and convolution2 by first separating them into different lists:

conv1_params = []
conv2_params = []
for name, parms in net.named_parameters():    if "convolution1" in name:        conv1_params += [parms]    else:        conv2_params += [parms]
# Then perform the following operations in the optimizer:
optimizer = optim.Adam(    [        {"params": conv1_params, 'lr': 0.01},        {"params": conv2_params, 'lr': 0.001},    ],    weight_decay=1e-3,)

We divide the model into two parts, storing them in a list, where each part corresponds to a dictionary above, setting different learning rates. When these two parts have the same other parameters, the parameters are placed outside the list as global parameters, as shown in the above `weight_decay`.

A global learning rate can also be set outside the list, and when local learning rates are set in the dictionaries of each part, that learning rate will be used; otherwise, the global learning rate outside the list will be used.