Collection of PyTorch Tricks

↑ ClickBlue Text Follow the Jishi platform
Collection of PyTorch Tricks
Author丨z.defying@Zhihu (Authorized)
Source丨https://zhuanlan.zhihu.com/p/76459295
Editor丨Jishi Platform

Jishi Guide

This article organizes 13 tips for using PyTorch, including specifying GPU IDs, gradient clipping, and expanding the dimensions of a single image, which can help workers complete tasks more efficiently. >>CV Developer 2021 New Year Exclusive Red Packet Cover has been added! ClickRead the Original to receive it for free!

Table of Contents

1. Specify GPU ID
2. View Model Output Details for Each Layer
3. Gradient Clipping
4. Expand Dimensions of a Single Image
5. One Hot Encoding
6. Prevent Out of Memory Errors During Model Validation
7. Learning Rate Decay
8. Freeze Parameters of Certain Layers
9. Use Different Learning Rates for Different Layers
10. Model Related Operations
11. Built-in One Hot Function in PyTorch
12. Network Parameter Initialization
13. Load Built-in Pre-trained Models

1. Specify GPU ID

  • Set the current GPU device to only device 0, device name is /gpu:0:os.environ["CUDA_VISIBLE_DEVICES"] = "0"
  • Set the current GPU devices to 0 and 1, names are /gpu:0 and /gpu:1:os.environ["CUDA_VISIBLE_DEVICES"] = "0,1", indicating to prioritize using device 0, then device 1.

The command to specify the GPU should be placed before a series of operations related to the neural network.

2. View Model Output Details for Each Layer

Keras has a concise API to view the output size of each layer in a model, which is very useful for debugging networks. This functionality can also be implemented in PyTorch now.

It is simple to use, as shown below:

from torchsummary import summary
summary(your_model, input_size=(channels, H, W))

input_size should be set according to the input size of your own network model.

3. Gradient Clipping

import torch.nn as nn
outputs = model(data)
loss = loss_fn(outputs, target)
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=20, norm_type=2)
optimizer.step()

The parameters for nn.utils.clip_grad_norm_ are:

  • parameters – An iterator based on variables that will normalize the gradients
  • max_norm – The maximum norm of the gradients
  • norm_type – Specifies the type of norm, default is L2

@不椭的椭圆 suggested: Gradient clipping can consume a lot of computation time on certain tasks, please check the comments for details.

4. Expand Dimensions of a Single Image

During training, the data dimensions are generally (batch_size, c, h, w), but during testing, only a single image is input, so the dimensions need to be expanded. There are several methods to expand dimensions:

import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())
img = image.view(1, *image.size())
print(img.size())
# output:# torch.Size([h, w, c])# torch.Size([1, h, w, c])

or

import cv2
import numpy as np
image = cv2.imread(img_path)
print(image.shape)
img = image[np.newaxis, :, :, :]
print(img.shape)
# output:# (h, w, c)# (1, h, w, c)

or (thanks to @coldleaf for the addition)

import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())
img = image.unsqueeze(dim=0)  
print(img.size())
img = img.squeeze(dim=0)
print(img.size())
# output:# torch.Size([(h, w, c)])# torch.Size([1, h, w, c])# torch.Size([h, w, c])

tensor.unsqueeze(dim): expands the dimension, dim specifies which dimension to expand.

tensor.squeeze(dim): removes the dimension specified by dim that has size 1; if the dimension is greater than 1, squeeze() does not take effect; if dim is not specified, removes all dimensions with size 1.

5. One Hot Encoding

When using the cross-entropy loss function in PyTorch, the label will be automatically converted to one-hot, so there is no need to convert it manually. However, when using MSE, it needs to be manually converted to one-hot encoding.

import torch
class_num = 8
batch_size = 4
def one_hot(label):    """    Convert a 1D list to one-hot encoding    """    label = label.resize_(batch_size, 1)    m_zeros = torch.zeros(batch_size, class_num)    # Extract values from value, then assign to the corresponding position based on dim and index    onehot = m_zeros.scatter_(1, label, 1)  # (dim,index,value)
    return onehot.numpy()  # Tensor -> Numpy
label = torch.LongTensor(batch_size).random_() % class_num  # Take the remainder of random numbers
print(one_hot(label))
# output:[[0. 0. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0.]]

Note: There is a simpler method in item 11.

6. Prevent Out of Memory Errors During Model Validation

During model validation, there is no need to compute gradients, so turning off autograd can increase speed and save memory. If not turned off, it may cause out of memory errors.

with torch.no_grad():    # Code for predicting using the model    pass

Thanks to @zhaz for the reminder, I updated the reason for using torch.cuda.empty_cache().

This is the original answer:

Unnecessary temporary variables during PyTorch training may accumulate, leading to out of memory; you can use the statement below to clean up these unnecessary variables.

According to the official explanation:

Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU applications and visible in nvidia-smi. torch.cuda.empty_cache()

This means that PyTorch’s caching allocator will pre-allocate some fixed GPU memory, and even if tensors do not actually use all of this memory, it cannot be used by other applications. This allocation process is triggered by the first CUDA memory access.

The role of torch.cuda.empty_cache() is to release the cached memory currently held by the caching allocator that is unoccupied so that this memory can be used by other GPU applications and is visible through the nvidia-smi command. Note that using this command will not release the memory occupied by tensors.

For unused data variables, PyTorch can automatically recycle them to free up the corresponding memory.

For more detailed optimizations, see Optimizing Memory Usage and Memory Utilization Issues.

7. Learning Rate Decay

import torch.optim as optim
from torch.optim import lr_scheduler
# Initialization before training
optimizer = optim.Adam(net.parameters(), lr=0.001)
scheduler = lr_scheduler.StepLR(optimizer, 10, 0.1)  # # Every 10 epochs, multiply the learning rate by 0.1
# During training
for n in n_epoch:    scheduler.step()    ...

The learning rate value can be checked at any time: optimizer.param_groups[0]['lr'].

There are also other ways to update the learning rate:

1. Custom update formula:

scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch:1/(epoch+1))

2. Update learning rate without relying on epochs:

lr_scheduler.ReduceLROnPlateau() provides a method to dynamically decrease the learning rate based on certain measurements during training; its parameter descriptions can be found everywhere. One reminder is that the parameter mode='min' or 'max' depends on whether the optimization is for loss or accuracy, i.e., using scheduler.step(loss) or scheduler.step(acc).

8. Freeze Parameters of Certain Layers

Reference: https://www.zhihu.com/question/311095447/answer/589307812

When loading a pre-trained model, we sometimes want to freeze the first few layers so that their parameters do not change during training.

We need to know the names of each layer, which can be printed using the code below:

net = Network()  # Get custom network structure
for name, value in net.named_parameters():    print('name: {0},	 grad: {1}'.format(name, value.requires_grad))

Assuming the information of the first few layers is as follows:

name: cnn.VGG_16.convolution1_1.weight,    grad: True
name: cnn.VGG_16.convolution1_1.bias,    grad: True
name: cnn.VGG_16.convolution1_2.weight,    grad: True
name: cnn.VGG_16.convolution1_2.bias,    grad: True
name: cnn.VGG_16.convolution2_1.weight,    grad: True
name: cnn.VGG_16.convolution2_1.bias,    grad: True
name: cnn.VGG_16.convolution2_2.weight,    grad: True
name: cnn.VGG_16.convolution2_2.bias,    grad: True

The trailing True indicates that the parameters of this layer are trainable, then we define a list of layers to freeze:

no_grad = [    'cnn.VGG_16.convolution1_1.weight',    'cnn.VGG_16.convolution1_1.bias',    'cnn.VGG_16.convolution1_2.weight',    'cnn.VGG_16.convolution1_2.bias']

The method to freeze is as follows:

net = Net.CTPN()  # Get network structure
for name, value in net.named_parameters():    if name in no_grad:        value.requires_grad = False    else:        value.requires_grad = True

After freezing, we print the information of each layer again:

name: cnn.VGG_16.convolution1_1.weight,    grad: False
name: cnn.VGG_16.convolution1_1.bias,    grad: False
name: cnn.VGG_16.convolution1_2.weight,    grad: False
name: cnn.VGG_16.convolution1_2.bias,    grad: False
name: cnn.VGG_16.convolution2_1.weight,    grad: True
name: cnn.VGG_16.convolution2_1.bias,    grad: True
name: cnn.VGG_16.convolution2_2.weight,    grad: True
name: cnn.VGG_16.convolution2_2.bias,    grad: True

We can see that the requires_grad for the weights and biases of the first two layers are now False, indicating they are not trainable.

Finally, when defining the optimizer, only the parameters of layers with requires_grad set to True will be updated.

optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.01)

9. Use Different Learning Rates for Different Layers

We will use different learning rates for different layers of the model.

Still using this model as an example:

net = Network()  # Get custom network structure
for name, value in net.named_parameters():    print('name: {}'.format(name))
# Output:# name: cnn.VGG_16.convolution1_1.weight# name: cnn.VGG_16.convolution1_1.bias# name: cnn.VGG_16.convolution1_2.weight# name: cnn.VGG_16.convolution1_2.bias# name: cnn.VGG_16.convolution2_1.weight# name: cnn.VGG_16.convolution2_1.bias# name: cnn.VGG_16.convolution2_2.weight# name: cnn.VGG_16.convolution2_2.bias

Set different learning rates for convolution1 and convolution2, first separate them into different lists:

conv1_params = []
conv2_params = []
for name, parms in net.named_parameters():    if "convolution1" in name:        conv1_params += [parms]    else:        conv2_params += [parms]
# Then do the following in the optimizer:
optimizer = optim.Adam(    [        {"params": conv1_params, 'lr': 0.01},        {"params": conv2_params, 'lr': 0.001},    ],    weight_decay=1e-3,)

We divide the model into two parts and store them in a list; each part corresponds to a dictionary above, where different learning rates are set. When these two parts have the same other parameters, the parameter should be placed outside the list as a global parameter, such as the weight_decay above.

A global learning rate can also be set outside the list, and when local learning rates are set in the dictionaries of each part, the local learning rates will be used; otherwise, the global learning rate will be used.

10. Model Related Operations

This content is quite extensive, so I wrote an article: https://zhuanlan.zhihu.com/p/73893187

11. Built-in One Hot Function in PyTorch

Thanks to @yangyangyang for the addition: After PyTorch 1.1, one-hot can be directly used with torch.nn.functional.one_hot.

Then I upgraded PyTorch to version 1.2 and tried the one-hot function, which is indeed very convenient.

The specific usage is as follows:

import torch.nn.functional as F
import torch
tensor =  torch.arange(0, 5) % 3  # tensor([0, 1, 2, 0, 1])
one_hot = F.one_hot(tensor)
# Output:# tensor([[1, 0, 0],#         [0, 1, 0],#         [0, 0, 1],#         [1, 0, 0],#         [0, 1, 0]])

F.one_hot will automatically detect the number of different categories and generate the corresponding one-hot encoding. We can also specify the number of categories ourselves:

tensor =  torch.arange(0, 5) % 3  # tensor([0, 1, 2, 0, 1])
one_hot = F.one_hot(tensor, num_classes=5)
# Output:# tensor([[1, 0, 0, 0, 0],#         [0, 1, 0, 0, 0],#         [0, 0, 1, 0, 0],#         [1, 0, 0, 0, 0],#         [0, 1, 0, 0, 0]])

Command to upgrade PyTorch (CPU version): conda install pytorch torchvision -c pytorch

(Hope upgrading PyTorch won’t affect the project code)

12. Network Parameter Initialization

Initializing a neural network is an important foundational step in the training process; it significantly affects the model’s performance, convergence, and convergence speed.

The following introduces two commonly used initialization operations.

(1) Use the built-in torch.nn.init method in PyTorch.

Common initialization operations, such as normal distribution, uniform distribution, xavier initialization, kaiming initialization, etc., have been implemented and can be used directly. For details, see the Chinese documentation for torch.nn.init in PyTorch.

init.xavier_uniform(net1[0].weight)

(2) For more flexible initialization methods, you can use numpy.

For custom initialization methods, sometimes the features of tensor are not as powerful and flexible as numpy, so you can use numpy to implement initialization methods and then convert them to tensor for use.

for layer in net1.modules():    if isinstance(layer, nn.Linear): # Check if it is a linear layer        param_shape = layer.weight.shape        layer.weight.data = torch.from_numpy(np.random.normal(0, 0.5, size=param_shape))         # Define as normal distribution with mean 0 and variance 0.5

13. Load Built-in Pre-trained Models

The torchvision.models submodule contains the following models:

  • AlexNet
  • VGG
  • ResNet
  • SqueezeNet
  • DenseNet

The method to import these models is:

import torchvision.models as models
resnet18 = models.resnet18()
alexnet = models.alexnet()
vgg16 = models.vgg16()

A very important parameter is pretrained, which defaults to False, indicating that only the structure of the model is imported, and the weights are randomly initialized.

If pretrained is set to True, it indicates that the model pre-trained on the ImageNet dataset is imported.

import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)
alexnet = models.alexnet(pretrained=True)
vgg16 = models.vgg16(pretrained=True)

For more models, see: https://pytorch-cn.readthedocs.io/zh/latest/torchvision/torchvision-models/

Recommended Reading
  • Tsinghua University Open Source Transfer Learning Algorithm Library: Implemented based on PyTorch, supports easy invocation of existing algorithms

  • Vedastr: A Scene Text Recognition Toolbox Based on PyTorch

  • Open Source Book by Core Contributors of PyTorch: “Deep Learning with PyTorch” Full Version Has Been Released!

Leave a Comment