MLNLP ( Machine Learning Algorithms and Natural Language Processing ) community is a well-known natural language processing community both domestically and internationally, covering NLP graduate students, university teachers, and researchers from enterprises.
The Vision of the Community is to promote communication between the academic and industrial circles of natural language processing and machine learning, as well as enthusiasts, especially for the progress of beginners.
Reprinted from | Machine Learning Algorithms and Related Matters
Author | z.defying
Source | DataWhale
1. Specify GPU Number
Set the current GPU device to be only device 0, with the device name <span>/gpu:0</span>
:<span>os.environ["CUDA_VISIBLE_DEVICES"] = "0"</span>
Set the current GPU devices to be 0 and 1, with names <span>/gpu:0</span>
and <span>/gpu:1</span>
: <span>os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"</span>
, indicating that device 0 is prioritized, followed by device 1.
The command to specify the GPU should be placed before a series of operations related to the neural network.
2. View Model Output Details for Each Layer
Keras has a concise API to view the output size of each layer of the model, which is very useful for debugging the network. This functionality can now also be implemented in PyTorch.
It’s easy to use, as shown below:
from torchsummary import summary
summary(your_model, input_size=(channels, H, W))
<span>input_size</span>
is set according to your own network model’s input size.
https://github.com/sksq96/pytorch-summary
3. Gradient Clipping
import torch.nn as nn
outputs = model(data)
loss = loss_fn(outputs, target)
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=20, norm_type=2)
optimizer.step()
<span>nn.utils.clip_grad_norm_</span>
parameters:
-
parameters – An iterator based on variables that will normalize gradients
-
max_norm – The maximum norm of the gradient
-
norm_type – Specifies the type of norm, default is L2
Zhihu user @不椭的椭圆 stated: Gradient clipping may consume a lot of computational time on certain tasks.
4. Expand Dimensions of a Single Image
Since the data dimensions during training are generally (batch_size, c, h, w), while only one image is input during testing, the dimensions need to be expanded. There are several methods to expand dimensions:
import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())
img = image.view(1, *image.size())
print(img.size())
# output:# torch.Size([h, w, c])# torch.Size([1, h, w, c])
or
import cv2
import numpy as np
image = cv2.imread(img_path)
print(image.shape)
img = image[np.newaxis, :, :, :]
print(img.shape)
# output:# (h, w, c)# (1, h, w, c)
or (thanks to Zhihu user @coldleaf for the addition)
import cv2
import torch
image = cv2.imread(img_path)
image = torch.tensor(image)
print(image.size())
img = image.unsqueeze(dim=0)
print(img.size())
img = img.squeeze(dim=0)
print(img.size())
# output:# torch.Size([(h, w, c)])# torch.Size([1, h, w, c])# torch.Size([h, w, c])
<span>tensor.unsqueeze(dim)</span>
: Expand dimensions, where dim specifies which dimension to expand.
<span>tensor.squeeze(dim)</span>
: Remove the dimension specified by dim that has a size of 1; if the size is greater than 1, squeeze() has no effect. If dim is not specified, all dimensions with a size of 1 are removed.
5. One-Hot Encoding
When using the cross-entropy loss function in PyTorch, it automatically converts the label to one-hot, so there is no need to convert it manually, while using MSE requires manual conversion to one-hot encoding.
import torch
class_num = 8
batch_size = 4
def one_hot(label):
""" Convert a one-dimensional list to one-hot encoding """
label = label.resize_(batch_size, 1)
m_zeros = torch.zeros(batch_size, class_num)
onehot = m_zeros.scatter_(1, label, 1) # (dim,index,value)
return onehot.numpy() # Tensor -> Numpy
label = torch.LongTensor(batch_size).random_() % class_num # Take the remainder of random numbers
print(one_hot(label))
# output:[[0. 0. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0.]]
https://discuss.pytorch.org/t/convert-int-into-one-hot-format/507/3
6. Prevent Out of Memory When Validating the Model
When validating the model, gradients are not needed, so turning off autograd can improve speed and save memory. If not turned off, it may lead to out of memory.
with torch.no_grad():
# Code for making predictions with the model
pass
Thanks to Zhihu user @zhaz for the reminder, I updated the reason for using <span>torch.cuda.empty_cache()</span>
.
This is the original answer:
During training in PyTorch, useless temporary variables may accumulate, leading to out of memory. The following statement can be used to clean these unnecessary variables.
The explanation on the official website is:
Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU applications and visible in nvidia-smi. <span>torch.cuda.empty_cache()</span>
This means that the caching allocator in PyTorch allocates some fixed GPU memory in advance, even if tensors do not use all of this memory, it cannot be used by other applications. This allocation process is triggered by the first CUDA memory access.
And <span>torch.cuda.empty_cache()</span>
is used to release the currently held but unoccupied cached GPU memory by the caching allocator so that this memory can be used by other GPU applications, and can be seen through <span>nvidia-smi</span>
command. Note that using this command does not release the GPU memory occupied by tensors.
For unused data variables, PyTorch can automatically recycle them to free the corresponding GPU memory.
For more detailed optimizations, see: Optimizing GPU Memory Usage:https://blog.csdn.net/qq_28660035/article/details/80688427 GPU Memory Utilization Issues:https://oldpan.me/archives/pytorch-gpu-memory-usage-track
7. Learning Rate Decay
import torch.optim as optim
from torch.optim import lr_scheduler
# Initialization before training
optimizer = optim.Adam(net.parameters(), lr=0.001)
scheduler = lr_scheduler.StepLR(optimizer, 10, 0.1) # Every 10 epochs, learning rate multiplied by 0.1
# During training
for n in n_epoch:
scheduler.step()
...
8. Freeze Parameters of Certain Layers
Reference: Freezing a Certain Layer of a Pre-trained Model in Pytorchhttps://www.zhihu.com/question/311095447/answer/589307812
When loading a pre-trained model, sometimes we want to freeze the first few layers so that their parameters do not change during training.
We need to know the names of each layer, which can be printed using the following code:
net = Network() # Get custom network structure
for name, value in net.named_parameters():
print('name: {0}, grad: {1}'.format(name, value.requires_grad))
Assuming the information for the first few layers is as follows:
name: cnn.VGG_16.convolution1_1.weight, grad: True
name: cnn.VGG_16.convolution1_1.bias, grad: True
name: cnn.VGG_16.convolution1_2.weight, grad: True
name: cnn.VGG_16.convolution1_2.bias, grad: True
name: cnn.VGG_16.convolution2_1.weight, grad: True
name: cnn.VGG_16.convolution2_1.bias, grad: True
name: cnn.VGG_16.convolution2_2.weight, grad: True
name: cnn.VGG_16.convolution2_2.bias, grad: True
The trailing True indicates that the parameters of this layer are trainable, then we define a list of layers to freeze:
no_grad = ['cnn.VGG_16.convolution1_1.weight','cnn.VGG_16.convolution1_1.bias','cnn.VGG_16.convolution1_2.weight','cnn.VGG_16.convolution1_2.bias']
The freezing method is as follows:
net = Net.CTPN() # Get network structure
for name, value in net.named_parameters():
if name in no_grad:
value.requires_grad = False
else:
value.requires_grad = True
After freezing, we print the information of each layer again:
name: cnn.VGG_16.convolution1_1.weight, grad: False
name: cnn.VGG_16.convolution1_1.bias, grad: False
name: cnn.VGG_16.convolution1_2.weight, grad: False
name: cnn.VGG_16.convolution1_2.bias, grad: False
name: cnn.VGG_16.convolution2_1.weight, grad: True
name: cnn.VGG_16.convolution2_1.bias, grad: True
name: cnn.VGG_16.convolution2_2.weight, grad: True
name: cnn.VGG_16.convolution2_2.bias, grad: True
It can be seen that the weight and bias of the first two layers have requires_grad set to False, indicating that they are not trainable.
Finally, when defining the optimizer, only the parameters of layers with requires_grad set to True are updated.
optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.01)
9. Use Different Learning Rates for Different Layers
We can use different learning rates for different layers of the model.
Still using this model as an example:
net = Network() # Get custom network structure
for name, value in net.named_parameters():
print('name: {}'.format(name))
# Output:
# name: cnn.VGG_16.convolution1_1.weight
# name: cnn.VGG_16.convolution1_1.bias
# name: cnn.VGG_16.convolution1_2.weight
# name: cnn.VGG_16.convolution1_2.bias
# name: cnn.VGG_16.convolution2_1.weight
# name: cnn.VGG_16.convolution2_1.bias
# name: cnn.VGG_16.convolution2_2.weight
# name: cnn.VGG_16.convolution2_2.bias
Set different learning rates for convolution1 and convolution2 by first separating them into different lists:
conv1_params = []
conv2_params = []
for name, parms in net.named_parameters():
if "convolution1" in name:
conv1_params += [parms]
else:
conv2_params += [parms]
# Then, in the optimizer, do the following:
optimizer = optim.Adam(
[
{"params": conv1_params, 'lr': 0.01},
{"params": conv2_params, 'lr': 0.001},
],
weight_decay=1e-3,
)
We divide the model into two parts, placing them in a list where each part corresponds to a dictionary above, setting different learning rates in the dictionary. When these two parts have the same other parameters, they are placed outside the list as global parameters, such as the above `weight_decay`.
A global learning rate can also be set outside the list, and when local learning rates are set in the dictionaries of each part, the local learning rates are used; otherwise, the global learning rate outside the list is used.
Scan the QR code to add the assistant on WeChat