Table of Contents
1. Specify GPU ID
-
Set the current GPU device to only device 0, named /gpu:0
:os.environ["CUDA_VISIBLE_DEVICES"] = "0"
-
Set the current GPU devices to 0 and 1, named /gpu:0
and/gpu:1
:os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
, indicating a preference for device 0, then device 1.
The command to specify the GPU must be placed before a series of operations related to the neural network.
2. View model output details for each layer
Keras has a concise API to view the output size of each layer of the model, which is very useful for debugging the network. This functionality can now also be achieved in PyTorch.
It is very simple to use, as shown below:
from torchsummary import summary
summary(your_model, input_size=(channels, H, W))
input_size
is set based on the input size of your own network model.
3. Gradient Clipping
import torch.nn as nn
outputs = model(data)
loss = loss_fn(outputs, target)
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=20, norm_type=2)
optimizer.step()
The parameters of nn.utils.clip_grad_norm_
are:
-
parameters – An iterator based on variables that will perform gradient normalization -
max_norm – The maximum norm of the gradient -
norm_type – Specifies the type of norm, default is L2
@不椭的椭圆 suggested: Gradient clipping can consume a lot of computation time on certain tasks, please check the comments for details.
4. Expand image dimensions
Since the data dimensions during training are generally (batch_size, c, h, w), but only one image is input during testing, it is necessary to expand the dimensions. There are multiple methods to expand dimensions:
import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())
img = image.view(1, *image.size())
print(img.size())
# output:# torch.Size([h, w, c])# torch.Size([1, h, w, c])
Or
import cv2
import numpy as np
image = cv2.imread(img_path)
print(image.shape)
img = image[np.newaxis, :, :, :]
print(img.shape)
# output:# (h, w, c)# (1, h, w, c)
Or (thanks to @coldleaf for the addition)
import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())
img = image.unsqueeze(dim=0)
print(img.size())
img = img.squeeze(dim=0)
print(img.size())
# output:# torch.Size([(h, w, c)])# torch.Size([1, h, w, c])# torch.Size([h, w, c])
tensor.unsqueeze(dim)
: Expands the dimension, dim
specifies which dimension to expand.
tensor.squeeze(dim)
: Removes the dimension specified by dim
that has size 1; if the dimension is greater than 1, squeeze()
has no effect. If dim
is not specified, all dimensions with size 1 will be removed.
5. One-hot encoding
When using the cross-entropy loss function in PyTorch, it will automatically convert the label into one-hot, so there is no need to convert manually. However, when using MSE, it needs to be manually converted into one-hot encoding.
import torch
class_num = 8
batch_size = 4
def one_hot(label):
"""
Convert a one-dimensional list to one-hot encoding
"""
label = label.resize_(batch_size, 1)
m_zeros = torch.zeros(batch_size, class_num)
# Take values from value, and assign to the corresponding position based on dim and index
onehot = m_zeros.scatter_(1, label, 1) # (dim,index,value)
return onehot.numpy() # Tensor -> Numpy
label = torch.LongTensor(batch_size).random_() % class_num # Take remainder of random number
print(one_hot(label))
# output:[[0. 0. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0.]]
Note: There is a simpler method for item 11.
6. Prevent out-of-memory during model validation
During model validation, gradient calculation is not needed, so turning off autograd can improve speed and save memory. If not turned off, it may cause out-of-memory errors.
with torch.no_grad():
# Code for predicting using the model
pass
Thanks to @zhaz for the reminder, I updated the reason for using torch.cuda.empty_cache()
.
This is the original answer:
Unused temporary variables during PyTorch training may increase, leading to
out of memory
. You can use the following statement to clean up these unnecessary variables.
The explanation on the official website is:
Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU applications and visible in nvidia-smi.
torch.cuda.empty_cache()
This means that PyTorch’s caching allocator will pre-allocate a certain amount of GPU memory, even if the tensors have not actually used all this memory, this memory cannot be used by other applications. This allocation process is triggered by the first CUDA memory access.
The role of torch.cuda.empty_cache()
is to release the currently held and unoccupied cached GPU memory by the caching allocator, so that this memory can be used by other GPU applications, and it is visible through the nvidia-smi
command. Note that using this command will not release memory occupied by tensors.
For unused data variables, PyTorch can automatically recycle them to release the corresponding memory.
For more detailed optimizations, see Memory Optimization and Memory Utilization Issues.
7. Learning rate decay
import torch.optim as optim
from torch.optim import lr_scheduler
# Initialization before training
optimizer = optim.Adam(net.parameters(), lr=0.001)
scheduler = lr_scheduler.StepLR(optimizer, 10, 0.1) # Every 10 epochs, multiply the learning rate by 0.1
# During training
for n in n_epoch:
scheduler.step()
...
You can check the value of the learning rate at any time: optimizer.param_groups[0]['lr']
.
There are other ways to update the learning rate:
1. Custom update formula:
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda epoch:1/(epoch+1))
2. Update learning rate without relying on epoch:
lr_scheduler.ReduceLROnPlateau()
provides a method to dynamically decrease the learning rate based on certain measurements during training. Its parameter explanations can be found everywhere. One reminder is that the parameter mode=’min’ or ‘max’ depends on whether to optimize loss or accuracy, that is, use scheduler.step(loss)
or scheduler.step(acc)
.
8. Freeze parameters of certain layers
Reference: https://www.zhihu.com/question/311095447/answer/589307812
When loading a pre-trained model, we sometimes want to freeze the first few layers so that their parameters do not change during training.
We need to know the name of each layer, which can be printed using the following code:
net = Network() # Get custom network structure
for name, value in net.named_parameters():
print('name: {0}, grad: {1}'.format(name, value.requires_grad))
Assuming the information for the first few layers is as follows:
name: cnn.VGG_16.convolution1_1.weight, grad: True
name: cnn.VGG_16.convolution1_1.bias, grad: True
name: cnn.VGG_16.convolution1_2.weight, grad: True
name: cnn.VGG_16.convolution1_2.bias, grad: True
name: cnn.VGG_16.convolution2_1.weight, grad: True
name: cnn.VGG_16.convolution2_1.bias, grad: True
name: cnn.VGG_16.convolution2_2.weight, grad: True
name: cnn.VGG_16.convolution2_2.bias, grad: True
The True at the end indicates that the parameters of this layer are trainable. Then we define a list of layers to freeze:
no_grad = [
'cnn.VGG_16.convolution1_1.weight',
'cnn.VGG_16.convolution1_1.bias',
'cnn.VGG_16.convolution1_2.weight',
'cnn.VGG_16.convolution1_2.bias']
The freezing method is as follows:
net = Net.CTPN() # Get network structure
for name, value in net.named_parameters():
if name in no_grad:
value.requires_grad = False
else:
value.requires_grad = True
After freezing, we print the information of each layer again:
name: cnn.VGG_16.convolution1_1.weight, grad: False
name: cnn.VGG_16.convolution1_1.bias, grad: False
name: cnn.VGG_16.convolution1_2.weight, grad: False
name: cnn.VGG_16.convolution1_2.bias, grad: False
name: cnn.VGG_16.convolution2_1.weight, grad: True
name: cnn.VGG_16.convolution2_1.bias, grad: True
name: cnn.VGG_16.convolution2_2.weight, grad: True
name: cnn.VGG_16.convolution2_2.bias, grad: True
As can be seen, the requires_grad
of the weight and bias of the first two layers is False, indicating that they are not trainable.
Finally, when defining the optimizer, only update the parameters of layers where requires_grad
is True.
optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.01)
9. Use different learning rates for different layers
We use different learning rates for different layers of the model.
Using this model as an example:
net = Network() # Get custom network structure
for name, value in net.named_parameters():
print('name: {}'.format(name))
# Output:# name: cnn.VGG_16.convolution1_1.weight# name: cnn.VGG_16.convolution1_1.bias# name: cnn.VGG_16.convolution1_2.weight# name: cnn.VGG_16.convolution1_2.bias# name: cnn.VGG_16.convolution2_1.weight# name: cnn.VGG_16.convolution2_1.bias# name: cnn.VGG_16.convolution2_2.weight# name: cnn.VGG_16.convolution2_2.bias
Set different learning rates for convolution1 and convolution2 by first separating them into different lists:
conv1_params = []
conv2_params = []
for name, parms in net.named_parameters():
if "convolution1" in name:
conv1_params += [parms]
else:
conv2_params += [parms]
# Then perform the following operation in the optimizer:
optimizer = optim.Adam(
[
{"params": conv1_params, 'lr': 0.01},
{"params": conv2_params, 'lr': 0.001},
],
weight_decay=1e-3,)
We divide the model into two parts and store them in a list, each part corresponds to a dictionary above, where different learning rates are set. When these two parts have the same other parameters, that parameter is placed outside the list as a global parameter, such as the above weight_decay
.
You can also set a global learning rate outside the list, when local learning rates are set in the dictionaries of each part, the local learning rates will be used; otherwise, the global learning rate outside the list will be used.
10. Model-related operations
This content is quite extensive, I wrote it into an article: https://zhuanlan.zhihu.com/p/73893187
11. Built-in one-hot function in PyTorch
Thanks to @yangyangyang for the addition: After PyTorch 1.1, one-hot can be directly used with torch.nn.functional.one_hot
.
Then I upgraded PyTorch to version 1.2, tried the one-hot function, and it is indeed very convenient.
The specific usage is as follows:
import torch.nn.functional as F
import torch
tensor = torch.arange(0, 5) % 3 # tensor([0, 1, 2, 0, 1])
one_hot = F.one_hot(tensor)
# Output:# tensor([[1, 0, 0],# [0, 1, 0],# [0, 0, 1],# [1, 0, 0],# [0, 1, 0]])
F.one_hot
will automatically detect the number of different categories and generate corresponding one-hot encoding. We can also specify the number of categories:
tensor = torch.arange(0, 5) % 3 # tensor([0, 1, 2, 0, 1])
one_hot = F.one_hot(tensor, num_classes=5)
# Output:# tensor([[1, 0, 0, 0, 0],# [0, 1, 0, 0, 0],# [0, 0, 1, 0, 0],# [1, 0, 0, 0, 0],# [0, 1, 0, 0, 0]])
Command to upgrade PyTorch (cpu version): conda install pytorch torchvision -c pytorch
(I hope the PyTorch upgrade won’t affect project code)
12. Initialize network parameters
Initialization of neural networks is an important foundational step in the training process, which can significantly impact the model’s performance, convergence, and convergence speed.
Below are two commonly used initialization operations.
(1) Use PyTorch’s built-in torch.nn.init
method.
Common initialization operations, such as normal distribution, uniform distribution, xavier initialization, kaiming initialization, etc., have been implemented and can be used directly. For details, see the Chinese documentation for torch.nn.init
in PyTorch.
init.xavier_uniform(net1[0].weight)
(2) For some more flexible initialization methods, you can use numpy.
For custom initialization methods, sometimes the functionality of tensors is not as powerful and flexible as numpy, so numpy can be used to implement the initialization method, and then converted to tensor for use.
for layer in net1.modules():
if isinstance(layer, nn.Linear): # Check if it is a linear layer
param_shape = layer.weight.shape
layer.weight.data = torch.from_numpy(np.random.normal(0, 0.5, size=param_shape)) # Defined as a normal distribution with mean 0 and variance 0.5
13. Load built-in pre-trained models
The torchvision.models
module contains the following models:
-
AlexNet -
VGG -
ResNet -
SqueezeNet -
DenseNet
The method to import these models is:
import torchvision.models as models
resnet18 = models.resnet18()
alexnet = models.alexnet()
vgg16 = models.vgg16()
A very important parameter is pretrained
, which defaults to False
, indicating that only the structure of the model is imported, and the weights are randomly initialized.
If pretrained
is set to True
, it indicates that the model pre-trained on the ImageNet
dataset is imported.
import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)
alexnet = models.alexnet(pretrained=True)
vgg16 = models.vgg16(pretrained=True)
For more models, see: https://pytorch-cn.readthedocs.io/zh/latest/torchvision/torchvision-models/
Recommended Reading
(Clicking the title can jump to read)
Dry Goods | Selected Historical Articles from the Official Account
My Deep Learning Introduction Route
My Machine Learning Introduction Route Map
Important!
The Annual Technical Article PDF from AI Youdao is here!
Scan the QR code below, add AI Youdao Assistant WeChat, to apply for joining the group and obtain the complete technical article collection PDF for 2020 (please be sure to note:Join + Location + School/Company. For example:Join + Shanghai + Fudan.
Long press to scan and apply to join the group
(Due to a large number of applicants, please be patient)
Thank you for your sharing, like, and followConnect ↓