PyTorch Tricks Compilation

PyTorch Tricks Compilation

Author丨z.defying@Zhihu
Source丨https://zhuanlan.zhihu.com/p/76459295
Editor | Jishi Platform
For academic sharing only, please contact to delete if there is infringement

Table of Contents

1. Specify GPU ID
2. View model output details for each layer
3. Gradient Clipping
4. Expand image dimensions
5. One-hot encoding
6. Prevent out-of-memory during model validation
7. Learning rate decay
8. Freeze parameters of certain layers
9. Use different learning rates for different layers
10. Model-related operations
11. Built-in one-hot function in PyTorch
12. Initialize network parameters
13. Load built-in pre-trained models

1. Specify GPU ID

  • Set the current GPU device to only device 0, named /gpu:0: os.environ["CUDA_VISIBLE_DEVICES"] = "0"
  • Set the current GPU devices to 0 and 1, named /gpu:0 and /gpu:1: os.environ["CUDA_VISIBLE_DEVICES"] = "0,1", indicating a preference for device 0, then device 1.

The command to specify the GPU must be placed before a series of operations related to the neural network.

2. View model output details for each layer

Keras has a concise API to view the output size of each layer of the model, which is very useful for debugging the network. This functionality can now also be achieved in PyTorch.

It is very simple to use, as shown below:

from torchsummary import summary
summary(your_model, input_size=(channels, H, W))

input_size is set based on the input size of your own network model.

3. Gradient Clipping

import torch.nn as nn
outputs = model(data)
loss = loss_fn(outputs, target)
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=20, norm_type=2)
optimizer.step()

The parameters of nn.utils.clip_grad_norm_ are:

  • parameters – An iterator based on variables that will perform gradient normalization
  • max_norm – The maximum norm of the gradient
  • norm_type – Specifies the type of norm, default is L2

@不椭的椭圆 suggested: Gradient clipping can consume a lot of computation time on certain tasks, please check the comments for details.

4. Expand image dimensions

Since the data dimensions during training are generally (batch_size, c, h, w), but only one image is input during testing, it is necessary to expand the dimensions. There are multiple methods to expand dimensions:

import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())
img = image.view(1, *image.size())
print(img.size())
# output:# torch.Size([h, w, c])# torch.Size([1, h, w, c])

Or

import cv2
import numpy as np
image = cv2.imread(img_path)
print(image.shape)
img = image[np.newaxis, :, :, :]
print(img.shape)
# output:# (h, w, c)# (1, h, w, c)

Or (thanks to @coldleaf for the addition)

import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())
img = image.unsqueeze(dim=0)
print(img.size())
img = img.squeeze(dim=0)
print(img.size())
# output:# torch.Size([(h, w, c)])# torch.Size([1, h, w, c])# torch.Size([h, w, c])

tensor.unsqueeze(dim): Expands the dimension, dim specifies which dimension to expand.

tensor.squeeze(dim): Removes the dimension specified by dim that has size 1; if the dimension is greater than 1, squeeze() has no effect. If dim is not specified, all dimensions with size 1 will be removed.

5. One-hot encoding

When using the cross-entropy loss function in PyTorch, it will automatically convert the label into one-hot, so there is no need to convert manually. However, when using MSE, it needs to be manually converted into one-hot encoding.

import torch
class_num = 8
batch_size = 4
def one_hot(label):
    """
    Convert a one-dimensional list to one-hot encoding
    """
    label = label.resize_(batch_size, 1)
    m_zeros = torch.zeros(batch_size, class_num)
    # Take values from value, and assign to the corresponding position based on dim and index
    onehot = m_zeros.scatter_(1, label, 1)  # (dim,index,value)
    return onehot.numpy()  # Tensor -> Numpy
label = torch.LongTensor(batch_size).random_() % class_num  # Take remainder of random number
print(one_hot(label))
# output:[[0. 0. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0.]]

Note: There is a simpler method for item 11.

6. Prevent out-of-memory during model validation

During model validation, gradient calculation is not needed, so turning off autograd can improve speed and save memory. If not turned off, it may cause out-of-memory errors.

with torch.no_grad():
    # Code for predicting using the model
    pass

Thanks to @zhaz for the reminder, I updated the reason for using torch.cuda.empty_cache().

This is the original answer:

Unused temporary variables during PyTorch training may increase, leading to out of memory. You can use the following statement to clean up these unnecessary variables.

The explanation on the official website is:

Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU applications and visible in nvidia-smi. torch.cuda.empty_cache()

This means that PyTorch’s caching allocator will pre-allocate a certain amount of GPU memory, even if the tensors have not actually used all this memory, this memory cannot be used by other applications. This allocation process is triggered by the first CUDA memory access.

The role of torch.cuda.empty_cache() is to release the currently held and unoccupied cached GPU memory by the caching allocator, so that this memory can be used by other GPU applications, and it is visible through the nvidia-smi command. Note that using this command will not release memory occupied by tensors.

For unused data variables, PyTorch can automatically recycle them to release the corresponding memory.

For more detailed optimizations, see Memory Optimization and Memory Utilization Issues.

7. Learning rate decay

import torch.optim as optim
from torch.optim import lr_scheduler
# Initialization before training
optimizer = optim.Adam(net.parameters(), lr=0.001)
scheduler = lr_scheduler.StepLR(optimizer, 10, 0.1)  # Every 10 epochs, multiply the learning rate by 0.1
# During training
for n in n_epoch:
    scheduler.step()
    ...

You can check the value of the learning rate at any time: optimizer.param_groups[0]['lr'].

There are other ways to update the learning rate:

1. Custom update formula:

scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch:1/(epoch+1))

2. Update learning rate without relying on epoch:

lr_scheduler.ReduceLROnPlateau() provides a method to dynamically decrease the learning rate based on certain measurements during training. Its parameter explanations can be found everywhere. One reminder is that the parameter mode=’min’ or ‘max’ depends on whether to optimize loss or accuracy, that is, use scheduler.step(loss) or scheduler.step(acc).

8. Freeze parameters of certain layers

Reference: https://www.zhihu.com/question/311095447/answer/589307812

When loading a pre-trained model, we sometimes want to freeze the first few layers so that their parameters do not change during training.

We need to know the name of each layer, which can be printed using the following code:

net = Network()  # Get custom network structure
for name, value in net.named_parameters():
    print('name: {0},	 grad: {1}'.format(name, value.requires_grad))

Assuming the information for the first few layers is as follows:

name: cnn.VGG_16.convolution1_1.weight, grad: True
name: cnn.VGG_16.convolution1_1.bias, grad: True
name: cnn.VGG_16.convolution1_2.weight, grad: True
name: cnn.VGG_16.convolution1_2.bias, grad: True
name: cnn.VGG_16.convolution2_1.weight, grad: True
name: cnn.VGG_16.convolution2_1.bias, grad: True
name: cnn.VGG_16.convolution2_2.weight, grad: True
name: cnn.VGG_16.convolution2_2.bias, grad: True

The True at the end indicates that the parameters of this layer are trainable. Then we define a list of layers to freeze:

no_grad = [
    'cnn.VGG_16.convolution1_1.weight',
    'cnn.VGG_16.convolution1_1.bias',
    'cnn.VGG_16.convolution1_2.weight',
    'cnn.VGG_16.convolution1_2.bias']

The freezing method is as follows:

net = Net.CTPN()  # Get network structure
for name, value in net.named_parameters():
    if name in no_grad:
        value.requires_grad = False
    else:
        value.requires_grad = True

After freezing, we print the information of each layer again:

name: cnn.VGG_16.convolution1_1.weight, grad: False
name: cnn.VGG_16.convolution1_1.bias, grad: False
name: cnn.VGG_16.convolution1_2.weight, grad: False
name: cnn.VGG_16.convolution1_2.bias, grad: False
name: cnn.VGG_16.convolution2_1.weight, grad: True
name: cnn.VGG_16.convolution2_1.bias, grad: True
name: cnn.VGG_16.convolution2_2.weight, grad: True
name: cnn.VGG_16.convolution2_2.bias, grad: True

As can be seen, the requires_grad of the weight and bias of the first two layers is False, indicating that they are not trainable.

Finally, when defining the optimizer, only update the parameters of layers where requires_grad is True.

optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.01)

9. Use different learning rates for different layers

We use different learning rates for different layers of the model.

Using this model as an example:

net = Network()  # Get custom network structure
for name, value in net.named_parameters():
    print('name: {}'.format(name))
# Output:# name: cnn.VGG_16.convolution1_1.weight# name: cnn.VGG_16.convolution1_1.bias# name: cnn.VGG_16.convolution1_2.weight# name: cnn.VGG_16.convolution1_2.bias# name: cnn.VGG_16.convolution2_1.weight# name: cnn.VGG_16.convolution2_1.bias# name: cnn.VGG_16.convolution2_2.weight# name: cnn.VGG_16.convolution2_2.bias

Set different learning rates for convolution1 and convolution2 by first separating them into different lists:

conv1_params = []
conv2_params = []
for name, parms in net.named_parameters():
    if "convolution1" in name:
        conv1_params += [parms]
    else:
        conv2_params += [parms]
# Then perform the following operation in the optimizer:
optimizer = optim.Adam(
    [
        {"params": conv1_params, 'lr': 0.01},
        {"params": conv2_params, 'lr': 0.001},
    ],
    weight_decay=1e-3,)

We divide the model into two parts and store them in a list, each part corresponds to a dictionary above, where different learning rates are set. When these two parts have the same other parameters, that parameter is placed outside the list as a global parameter, such as the above weight_decay.

You can also set a global learning rate outside the list, when local learning rates are set in the dictionaries of each part, the local learning rates will be used; otherwise, the global learning rate outside the list will be used.

10. Model-related operations

This content is quite extensive, I wrote it into an article: https://zhuanlan.zhihu.com/p/73893187

11. Built-in one-hot function in PyTorch

Thanks to @yangyangyang for the addition: After PyTorch 1.1, one-hot can be directly used with torch.nn.functional.one_hot.

Then I upgraded PyTorch to version 1.2, tried the one-hot function, and it is indeed very convenient.

The specific usage is as follows:

import torch.nn.functional as F
import torch
tensor =  torch.arange(0, 5) % 3  # tensor([0, 1, 2, 0, 1])
one_hot = F.one_hot(tensor)
# Output:# tensor([[1, 0, 0],#         [0, 1, 0],#         [0, 0, 1],#         [1, 0, 0],#         [0, 1, 0]])

F.one_hot will automatically detect the number of different categories and generate corresponding one-hot encoding. We can also specify the number of categories:

tensor =  torch.arange(0, 5) % 3  # tensor([0, 1, 2, 0, 1])
one_hot = F.one_hot(tensor, num_classes=5)
# Output:# tensor([[1, 0, 0, 0, 0],#         [0, 1, 0, 0, 0],#         [0, 0, 1, 0, 0],#         [1, 0, 0, 0, 0],#         [0, 1, 0, 0, 0]])

Command to upgrade PyTorch (cpu version): conda install pytorch torchvision -c pytorch

(I hope the PyTorch upgrade won’t affect project code)

12. Initialize network parameters

Initialization of neural networks is an important foundational step in the training process, which can significantly impact the model’s performance, convergence, and convergence speed.

Below are two commonly used initialization operations.

(1) Use PyTorch’s built-in torch.nn.init method.

Common initialization operations, such as normal distribution, uniform distribution, xavier initialization, kaiming initialization, etc., have been implemented and can be used directly. For details, see the Chinese documentation for torch.nn.init in PyTorch.

init.xavier_uniform(net1[0].weight)

(2) For some more flexible initialization methods, you can use numpy.

For custom initialization methods, sometimes the functionality of tensors is not as powerful and flexible as numpy, so numpy can be used to implement the initialization method, and then converted to tensor for use.

for layer in net1.modules():
    if isinstance(layer, nn.Linear): # Check if it is a linear layer
        param_shape = layer.weight.shape
        layer.weight.data = torch.from_numpy(np.random.normal(0, 0.5, size=param_shape))  # Defined as a normal distribution with mean 0 and variance 0.5

13. Load built-in pre-trained models

The torchvision.models module contains the following models:

  • AlexNet
  • VGG
  • ResNet
  • SqueezeNet
  • DenseNet

The method to import these models is:

import torchvision.models as models
resnet18 = models.resnet18()
alexnet = models.alexnet()
vgg16 = models.vgg16()

A very important parameter is pretrained, which defaults to False, indicating that only the structure of the model is imported, and the weights are randomly initialized.

If pretrained is set to True, it indicates that the model pre-trained on the ImageNet dataset is imported.

import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)
alexnet = models.alexnet(pretrained=True)
vgg16 = models.vgg16(pretrained=True)

For more models, see: https://pytorch-cn.readthedocs.io/zh/latest/torchvision/torchvision-models/

Recommended Reading

(Clicking the title can jump to read)

Dry Goods | Selected Historical Articles from the Official Account

My Deep Learning Introduction Route

My Machine Learning Introduction Route Map

Important!

The Annual Technical Article PDF from AI Youdao is here!

PyTorch Tricks Compilation

Scan the QR code below, add AI Youdao Assistant WeChat, to apply for joining the group and obtain the complete technical article collection PDF for 2020 (please be sure to note:Join + Location + School/Company. For example:Join + Shanghai + Fudan.

PyTorch Tricks Compilation

Long press to scan and apply to join the group

(Due to a large number of applicants, please be patient)

Thank you for your sharing, like, and followConnect ↓

Leave a Comment