Collection of PyTorch Tricks

Click the above “CVer” to choose to add “Star” or “Pin”

Essential content delivered promptlyCollection of PyTorch Tricks

Author: z.defying

https://zhuanlan.zhihu.com/p/76459295

This article is authorized by the author, and reprinting is not allowed without permission

Table of Contents:

  1. Specify GPU ID

  2. View details of each layer output of the model

  3. Gradient Clipping

  4. Expand the dimension of a single image

  5. One-Hot Encoding

  6. Prevent out of memory errors when validating the model

  7. Learning Rate Decay

  8. Freeze parameters of certain layers

  9. Use different learning rates for different layers

1. Specify GPU ID

  • Set the current GPU device to only use device 0, device name is /gpu:0: os.environ["CUDA_VISIBLE_DEVICES"] = "0"

  • Set the current GPU device to use both devices 0 and 1, names are /gpu:0 and /gpu:1: os.environ["CUDA_VISIBLE_DEVICES"] = "0,1", indicating to prioritize device 0 first, then use device 1.

The command to specify GPU should be placed before a series of operations related to the neural network.

2. View details of each layer output of the model

Keras has a concise API to view the output size of each layer of the model, which is very useful for debugging the network. This functionality can now also be achieved in PyTorch.

It is very simple to use, as shown below:

from torchsummary import summary
summary(your_model, input_size=(channels, H, W))

input_size should be set according to the input size of your own network model.

https://github.com/sksq96/pytorch-summary

3. Gradient Clipping

import torch.nn as nn

outputs = model(data)
loss= loss_fn(outputs, target)
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=20, norm_type=2)
optimizer.step()

The parameters of nn.utils.clip_grad_norm_:

  • parameters – An iterator based on variables, which will perform gradient normalization

  • max_norm – The maximum norm of the gradient

  • norm_type – Specifies the type of norm, default is L2

4. Expand the dimension of a single image

Since the data dimensions during training are generally (batch_size, c, h, w), but only one image is input during testing, it is necessary to expand the dimensions. There are several methods to expand dimensions:

import cv2
import torch

image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())

img = image.view(1, *image.size())
print(img.size())

# output:
# torch.Size([h, w, c])
# torch.Size([1, h, w, c])

Or

import cv2
import numpy as np

image = cv2.imread(img_path)
print(image.shape)
img = image[np.newaxis, :, :, :]
print(img.shape)

# output:
# (h, w, c)
# (1, h, w, c)

Or (thanks to Zhihu user coldleaf for the addition)

import cv2
import torch

image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())

img = image.unsqueeze(dim=0)  
print(img.size())

img = img.squeeze(dim=0)
print(img.size())

# output:
# torch.Size([(h, w, c)])
# torch.Size([1, h, w, c])
# torch.Size([h, w, c])

tensor.unsqueeze(dim): Expand the dimension, dim specifies which dimension to expand.

tensor.squeeze(dim): Remove the dimension specified by dim that has size 1; if the size is greater than 1, squeeze() has no effect, and when dim is not specified, it removes all dimensions of size 1.

5. One-Hot Encoding

When using the cross-entropy loss function in PyTorch, the label is automatically converted to one-hot, so there is no need to convert it manually, while using MSE requires manual conversion to one-hot encoding.

import torch
class_num = 8
batch_size = 4

def one_hot(label):
    """
    Convert a one-dimensional list to one-hot encoding
    """
    label = label.resize_(batch_size, 1)
    m_zeros = torch.zeros(batch_size, class_num)
    # Take values from value, and assign to the corresponding position based on dim and index
    onehot = m_zeros.scatter_(1, label, 1)  # (dim,index,value)

    return onehot.numpy()  # Tensor -> Numpy

label = torch.LongTensor(batch_size).random_() % class_num  # Take remainder of random numbers
print(one_hot(label))

# output:
[[0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0.]]

https://discuss.pytorch.org/t/convert-int-into-one-hot-format/507/3

6. Prevent Out of Memory Errors When Validating the Model

When validating the model, there is no need to compute gradients, so turning off autograd can improve speed and save memory. If not turned off, it may cause out of memory errors.

with torch.no_grad():
    # Code for making predictions using the model
    pass

Thanks to Zhihu user zhaz for the reminder, I updated the reason for using torch.cuda.empty_cache().

This was the original answer:

Pytorch’s unnecessary temporary variables during training may increase, leading to out of memory, and the following statement can be used to clean up these unnecessary variables.

The explanation on the official website is:

Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in nvidia-smi.torch.cuda.empty_cache()

This means that PyTorch’s caching allocator will pre-allocate some fixed memory, even if tensors have not actually used all of this memory, this memory cannot be used by other applications. This allocation process is triggered by the first CUDA memory access.

And the role of torch.cuda.empty_cache() is to release the currently held and unoccupied cached memory by the caching allocator so that this memory can be used by other GPU applications and is visible through the nvidia-smi command. Note that using this command will not release the memory occupied by tensors.

For unused data variables, Pytorch can automatically recycle and release the corresponding memory.

For more detailed optimizations, see Optimizing Memory Usage and Memory Utilization Issues.

7. Learning Rate Decay

import torch.optim as optim
from torch.optim import lr_scheduler

# Initialization before training
optimizer = optim.Adam(net.parameters(), lr=0.001)
scheduler = lr_scheduler.StepLR(optimizer, 10, 0.1)  # Every 10 epochs, multiply the learning rate by 0.1

# During training
for n in n_epoch:
    scheduler.step()
    ...

8. Freeze Parameters of Certain Layers

Reference: Freezing a Certain Layer of a Pre-trained Model in Pytorch

When loading a pre-trained model, sometimes we want to freeze the first few layers so that their parameters do not change during training.

We first need to know the names of each layer, which can be printed with the following code:

net = Network()  # Get custom network structure
for name, value in net.named_parameters():
    print('name: {0},	 grad: {1}'.format(name, value.requires_grad))

Assuming the information of the first few layers is as follows:

name: cnn.VGG_16.convolution1_1.weight,     grad: True
name: cnn.VGG_16.convolution1_1.bias,     grad: True
name: cnn.VGG_16.convolution1_2.weight,     grad: True
name: cnn.VGG_16.convolution1_2.bias,     grad: True
name: cnn.VGG_16.convolution2_1.weight,     grad: True
name: cnn.VGG_16.convolution2_1.bias,     grad: True
name: cnn.VGG_16.convolution2_2.weight,     grad: True
name: cnn.VGG_16.convolution2_2.bias,     grad: True

The True at the end indicates that the parameters of this layer are trainable, then we define a list of layers to be frozen:

no_grad = [
    'cnn.VGG_16.convolution1_1.weight',
    'cnn.VGG_16.convolution1_1.bias',
    'cnn.VGG_16.convolution1_2.weight',
    'cnn.VGG_16.convolution1_2.bias'
]

The freezing method is as follows:

net = Net.CTPN()  # Get network structure
for name, value in net.named_parameters():
    if name in no_grad:
        value.requires_grad = False
    else:
        value.requires_grad = True

After freezing, we print the information of each layer again:

name: cnn.VGG_16.convolution1_1.weight,     grad: False
name: cnn.VGG_16.convolution1_1.bias,     grad: False
name: cnn.VGG_16.convolution1_2.weight,     grad: False
name: cnn.VGG_16.convolution1_2.bias,     grad: False
name: cnn.VGG_16.convolution2_1.weight,     grad: True
name: cnn.VGG_16.convolution2_1.bias,     grad: True
name: cnn.VGG_16.convolution2_2.weight,     grad: True
name: cnn.VGG_16.convolution2_2.bias,     grad: True

It can be seen that the weight and bias of the first two layers have their requires_grad set to False, indicating that they are not trainable.

Finally, when defining the optimizer, only the parameters of layers where requires_grad is True are updated.

optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.01)

9. Use Different Learning Rates for Different Layers

We will use different learning rates for different layers of the model.

Still using this model as an example:

net = Network()  # Get custom network structure
for name, value in net.named_parameters():
    print('name: {}'.format(name))

# Output:
# name: cnn.VGG_16.convolution1_1.weight
# name: cnn.VGG_16.convolution1_1.bias
# name: cnn.VGG_16.convolution1_2.weight
# name: cnn.VGG_16.convolution1_2.bias
# name: cnn.VGG_16.convolution2_1.weight
# name: cnn.VGG_16.convolution2_1.bias
# name: cnn.VGG_16.convolution2_2.weight
# name: cnn.VGG_16.convolution2_2.bias

Set different learning rates for convolution1 and convolution2, first separate them into different lists:

conv1_params = []
conv2_params = []

for name, parms in net.named_parameters():
    if "convolution1" in name:
        conv1_params += [parms]
    else:
        conv2_params += [parms]

# Then perform the following operation in the optimizer:
optimizer = optim.Adam(
    [
        {"params": conv1_params, 'lr': 0.01},
        {"params": conv2_params, 'lr': 0.001},
    ],
    weight_decay=1e-3,
)

We divide the model into two parts, placed in a list, each part corresponds to a dictionary above, setting different learning rates in the dictionary. When both parts have the same other parameters, put those parameters outside the list as global parameters, like the above weight_decay.

You can also set a global learning rate outside the list; when the local learning rates are set in the dictionaries of each part, the local learning rates will be used; otherwise, the global learning rate outside the list will be used.

Exciting! CVer academic exchange group has been established!

Scan to add CVer assistant to apply to join CVer-Object Detection, Image Segmentation, Object Tracking, Face Detection & Recognition, OCR, Pose Estimation, Super Resolution, SLAM, Medical Imaging, Re-ID, GAN, NAS, Depth Estimation, Autonomous Driving, Reinforcement Learning, Lane Line Detection and Model Pruning & Compressionand other groups. Be sure to note:Research Direction + Location + School/Company + Nickname(e.g., Object Detection + Shanghai + Shanghai Jiaotong University + Kaka)

Collection of PyTorch Tricks

▲ Long press to join the group

Collection of PyTorch Tricks

▲ Long press to follow us

Please give me a thumbs up!!

Leave a Comment