17 Methods to Speed Up PyTorch Training!

Click the "Little White Learning Vision" above, select to add "Star" or "Top"
Important content delivered first

Master these 17 methods to accelerate your PyTorch deep learning training in the most effortless way.

Recently, a post on Reddit has gone viral. The topic is about how to speed up PyTorch training. The original author is Lorenz Kuhn, a master’s student in computer science at ETH Zurich, and the article introduces us to the 17 most effortless and effective methods to accelerate training deep models using PyTorch.

The methods mentioned in this article assume that you are training models in a GPU environment. The specific content is as follows.

17 Methods to Accelerate PyTorch Training

1. Consider Switching Learning Rate Schedules

The choice of learning rate schedule has a significant impact on the convergence speed and generalization ability of the model. Leslie N. Smith et al. proposed cyclical learning rates and the 1Cycle learning rate schedule in their papers “Cyclical Learning Rates for Training Neural Networks” and “Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates”. Subsequently, Jeremy Howard and Sylvain Gugger from fast.ai promoted it. The following image illustrates the 1Cycle learning rate schedule:

Sylvain wrote: 1Cycle includes two equal-length phases, one phase is from a lower learning rate to a higher learning rate, and the other is back to the lowest level. The maximum value comes from the value selected by the learning rate finder, and the smaller value can be ten times lower. Moreover, the length of this cycle should be slightly less than the total number of epochs, and in the final stage of training, we should allow the learning rate to be several orders of magnitude lower than the minimum value.

Compared to traditional learning rate schedules, this schedule achieves significant acceleration in the best cases (Smith refers to it as super convergence). For example, using the 1Cycle strategy to train ResNet-56 on the ImageNet dataset reduces the training iterations to 1/10 of the original, while the model performance can still match the level in the original paper. This schedule seems to perform well across common architectures and optimizers.

PyTorch has already implemented these two methods: “torch.optim.lr_scheduler.CyclicLR” and “torch.optim.lr_scheduler.OneCycleLR”.

Reference documentation: https://pytorch.org/docs/stable/optim.html

2. Use Multiple Workers and Pin Memory in DataLoader

When using torch.utils.data.DataLoader, set num_workers > 0 instead of the default 0, and set pin_memory=True instead of the default False.

Szymon Micacz, a senior CUDA deep learning algorithm software engineer from NVIDIA, achieved a 2x speedup in a single epoch using four workers and pinned memory. A common rule of thumb for choosing the number of workers is to set it to four times the number of available GPUs; setting it higher or lower will reduce training speed. Note that increasing num_workers will increase CPU memory consumption.

3. Maximize Batch Size

Maximizing batch size is a somewhat controversial point. Generally, if you maximize the batch size within the limits of GPU memory, your training speed will be faster. However, you also need to adjust other hyperparameters, such as the learning rate. A good rule of thumb is to double the learning rate when doubling the batch size.

OpenAI’s paper “An Empirical Model of Large-Batch Training” demonstrates how many steps different batch sizes require to converge. In the article “How to get 4x speedup and better generalization using the right batch size,” author Daniel Huynh conducted some experiments using different batch sizes (also using the 1Cycle strategy discussed above). Ultimately, he increased the batch size from 64 to 512, achieving a 4x speedup.

However, the downside of using large batches is that it may lead to poorer generalization compared to using smaller batches.

4. Use Automatic Mixed Precision (AMP)

PyTorch version 1.6 includes native support for automatic mixed precision training. What it means is that certain operations run faster in half precision (FP16) compared to single precision (FP32) without losing accuracy. AMP automatically decides which precision to use for which operations. This can speed up training and reduce memory usage.

In the best case, the usage of AMP is as follows:

import torch# Creates once at the beginning of training
scaler = torch.cuda.amp.GradScaler()

for data, label in data_iter:   optimizer.zero_grad()   # Casts operations to mixed precision   with torch.cuda.amp.autocast():      loss = model(data)
   # Scales the loss, and calls backward()   # to create scaled gradients   scaler.scale(loss).backward()
   # Unscales gradients and calls   # or skips optimizer.step()   scaler.step(optimizer)
   # Updates the scale for next iteration   scaler.update()

5. Consider Using Another Optimizer

AdamW is an Adam variant with weight decay (instead of L2 regularization) promoted by fast.ai, implemented in PyTorch as torch.optim.AdamW. AdamW seems to consistently outperform Adam in terms of error and training time.

Both Adam and AdamW work well with the 1Cycle strategy mentioned above.

Currently, some non-native optimizers have also gained significant attention, most notably LARS and LAMB. NVIDIA’s APEX implements fused versions of some common optimizers, such as Adam. This implementation avoids multiple transfers between the GPU memory compared to the Adam implementation in PyTorch, resulting in a 5% speed increase.

6. Set cudNN Benchmark

If your model architecture and input size remain constant, set torch.backends.cudnn.benchmark = True.

7. Be Careful with Frequent Data Transfers Between CPU and GPU

Frequent use of tensor.cpu() to transfer tensors from GPU to CPU (or tensor.cuda() to transfer tensors from CPU to GPU) is costly. item() and .numpy() are similarly costly; use .detach() instead.

If you create a new tensor, you can assign it to the GPU using the keyword argument device=torch.device(‘cuda:0’).

If you need to transfer data, you can use .to(non_blocking=True), provided there are no synchronization points after the transfer.

8. Use Gradient/Activation Checkpointing

Checkpointing works by trading compute for memory, not storing all intermediate activations of the computation graph for the backward pass but recomputing those activations. We can apply it to any part of the model.

Specifically, during the forward pass, the function runs in torch.no_grad() mode, not storing intermediate activations. Instead, the input tuples and function parameters are saved during the forward pass. During the backward pass, the inputs and function are retrieved, and the forward pass is computed again on the function. Then, the intermediate activations are tracked, using these activation values to compute gradients.

Therefore, while this may slightly increase the runtime for a given batch size, it significantly reduces memory usage. This, in turn, allows for further increases in the batch size used, improving GPU utilization.

Although checkpointing is implemented in torch.utils.checkpoint, it still requires some thought and effort to implement correctly. Priya Goyal wrote a great tutorial introducing the key aspects of checkpointing.

Priya Goyal’s tutorial link:

https://github.com/prigoyal/pytorch_memonger/blob/master/tutorial/Checkpointing_for_PyTorch_models.ipynb

9. Use Gradient Accumulation

Another way to increase batch size is to accumulate gradients over multiple .backward() passes before calling optimizer.step().

Thomas Wolf from Hugging Face’s article “Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups” discusses how to use gradient accumulation. Gradient accumulation can be implemented as follows:

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):    predictions = model(inputs)                     # Forward pass    loss = loss_function(predictions, labels)       # Compute loss function    loss = loss / accumulation_steps                # Normalize our loss (if averaged)    loss.backward()                                 # Backward pass    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps        optimizer.step()                            # Now we can do an optimizer step        model.zero_grad()                           # Reset gradients tensors        if (i+1) % evaluation_steps == 0:           # Evaluate the model when we...            evaluate_model()                        # ...have no gradients accumulated

This method is primarily developed to circumvent GPU memory limitations.

10. Use Distributed Data Parallel for Multi-GPU Training

There are many ways to accelerate distributed training, but a simple method is to use torch.nn.DistributedDataParallel instead of torch.nn.DataParallel. This way, each GPU will be driven by a dedicated CPU core, avoiding the GIL issues of DataParallel.

Distributed training documentation link: https://pytorch.org/tutorials/beginner/dist_overview.html

11. Set Gradients to None Instead of 0

Set gradients using .zero_grad(set_to_none=True) instead of .zero_grad(). This allows the memory allocator to manage the gradients rather than setting them to 0. As the documentation states, setting gradients to None provides moderate speedup, but do not expect miracles. Note that this approach also has downsides; see the documentation for details.

Documentation link: https://pytorch.org/docs/stable/optim.html

12. Use .as_tensor() Instead of .tensor()

torch.tensor() always copies data. If you are converting a numpy array, use torch.as_tensor() or torch.from_numpy() to avoid copying data.

13. Enable Debugging Tools When Necessary

PyTorch provides many debugging tools, such as autograd.profiler, autograd.grad_check, and autograd.anomaly_detection. Make sure to enable the debugger only when you need to debug, and turn it off promptly when not needed, as the debugger can slow down your training speed.

14. Use Gradient Clipping

To avoid the problem of exploding gradients in RNNs, some experiments and theories have confirmed that gradient clipping (gradient = min(gradient, threshold)) can accelerate convergence. HuggingFace’s Transformer implementation is a clear example of how to use gradient clipping. Other methods mentioned in this article, such as AMP, can also be applied.

In PyTorch, gradient clipping can be implemented using torch.nn.utils.clip_grad_norm_.

15. Disable Bias Before BatchNorm

Disable the bias layer before starting the BatchNormalization layer. For a 2-D convolutional layer, you can set the bias keyword to False: torch.nn.Conv2d(…, bias=False, …).

16. Disable Gradient Calculation During Validation

During validation, disable gradient calculation by setting: torch.no_grad().

17. Normalize Inputs and Batches

Double-check whether inputs are normalized? Is batch normalization used?

Original link: https://efficientdl.com/faster-deep-learning-in-pytorch-a-guide/

This article is for academic sharing only. If there is any infringement, please contact for removal.

Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial

Reply: Extension Module Chinese Tutorial in the "Little White Learning Vision" public account backend to download the first OpenCV extension module tutorial in Chinese, covering more than 20 chapters including extension module installation, SFM algorithm, stereo vision, target tracking, biological vision, super-resolution processing, etc.

Download 2: Python Vision Practical Project 52 Lectures

Reply: Python Vision Practical Project in the "Little White Learning Vision" public account backend to download 31 visual practical projects including image segmentation, mask detection, lane detection, vehicle counting, eyeliner adding, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to help quickly learn computer vision.

Download 3: OpenCV Practical Project 20 Lectures

Reply: OpenCV Practical Project 20 Lectures in the "Little White Learning Vision" public account backend to download 20 practical projects based on OpenCV, achieving advanced learning in OpenCV.

Discussion Group

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (which will gradually be subdivided in the future). Please scan the WeChat number below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format; otherwise, you will not be approved. After successful addition, you will be invited to relevant WeChat groups based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding~

Leave a Comment Cancel reply