Reproduced from: Machine Heart
Master these 17 methods to accelerate your PyTorch deep learning training in the most effortless way.
Recently, a post on Reddit gained massive popularity. The topic is about how to speed up PyTorch training. The original author is LORENZ KUHN, a master’s student in computer science at ETH Zurich. The article introduces the 17 most effective and effortless methods for training deep models using PyTorch.
These methods assume you are training your model in a GPU environment. The specific content is as follows.17 Ways to Speed Up PyTorch Training1. Consider Changing the Learning Rate ScheduleThe choice of learning rate schedule has a significant impact on the convergence speed and generalization ability of the model. Leslie N. Smith and others proposed cyclical learning rates and the 1Cycle learning rate schedule in their papers “Cyclical Learning Rates for Training Neural Networks” and “Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates.” Subsequently, Jeremy Howard and Sylvain Gugger from fast.ai promoted it. The following image illustrates the 1Cycle learning rate schedule:
Sylvain wrote: 1Cycle consists of two equal-length phases, one phase is from a lower learning rate to a higher learning rate, and the other returns to the lowest level. The maximum value comes from the value selected by the learning rate finder, and the smaller value can be ten times lower. Then, the length of this cycle should be slightly less than the total number of epochs, and in the last phase of training, we should allow the learning rate to be several orders of magnitude smaller than the minimum.Compared to traditional learning rate schedules, this schedule achieves significant acceleration in the best case (Smith calls it super convergence). For example, using the 1Cycle strategy to train ResNet-56 on the ImageNet dataset reduced the training iterations to 1/10 of the original, yet the model performance can still match that in the original paper. This schedule seems to perform well in common architectures and optimizers.Pytorch has implemented both methods: “torch.optim.lr_scheduler.CyclicLR” and “torch.optim.lr_scheduler.OneCycleLR.”Reference documentation: https://pytorch.org/docs/stable/optim.html2. Use Multiple Workers and Pin Memory in DataLoaderWhen using torch.utils.data.DataLoader, set num_workers > 0 instead of the default value of 0, and also set pin_memory=True instead of the default value of False.Senior CUDA deep learning algorithm software engineer Szymon Micacz from NVIDIA once achieved a 2x speedup in a single epoch using four workers and pinned memory. A rule of thumb for choosing the number of workers is to set it to four times the number of available GPUs; setting it higher or lower will reduce training speed. Note that increasing num_workers will increase CPU memory consumption.3. Maximize the Batch SizeMaximizing the batch size is a somewhat controversial viewpoint. Generally, if the batch can be maximized within the limits of GPU memory, your training speed will be faster. However, you must also adjust other hyperparameters, such as the learning rate. A good rule of thumb is to double the learning rate when the batch size is doubled.OpenAI’s paper “An Empirical Model of Large-Batch Training” demonstrates well how many steps different batch sizes need to converge. In the article “How to get 4x speedup and better generalization using the right batch size,” author Daniel Huynh conducted experiments with different batch sizes (also using the aforementioned 1Cycle strategy). Ultimately, he increased the batch size from 64 to 512, achieving a 4x speedup.However, the downside of using large batches is that it may lead to poorer generalization compared to using smaller batches.4. Use Automatic Mixed Precision (AMP)PyTorch version 1.6 includes a native implementation of automatic mixed precision training for PyTorch. What I want to say is that certain operations run faster in half-precision (FP16) compared to single precision (FP32) without losing accuracy. AMP automatically decides which precision should be used for which operation, thus speeding up training while reducing memory usage.In the best case, the usage of AMP is as follows:
import torch# Creates once at the beginning of training
scaler = torch.cuda.amp.GradScaler()
for data, label in data_iter: optimizer.zero_grad() # Casts operations to mixed precision with torch.cuda.amp.autocast(): loss = model(data)
# Scales the loss, and calls backward() # to create scaled gradients scaler.scale(loss).backward()
# Unscales gradients and calls # or skips optimizer.step() scaler.step(optimizer)
# Updates the scale for next iteration scaler.update()
5. Consider Using Another OptimizerAdamW is an Adam variant with weight decay (instead of L2 regularization) promoted by fast.ai, implemented in PyTorch as torch.optim.AdamW. AdamW seems to consistently outperform Adam in terms of error and training time.Both Adam and AdamW pair well with the previously mentioned 1Cycle strategy.Currently, some non-native optimizers have also drawn significant attention, most notably LARS and LAMB. NVIDIA’s APEX implements fused versions of some common optimizers, such as Adam. This implementation avoids multiple transfers with GPU memory compared to Adam’s implementation in PyTorch, improving speed by 5%.6. Set cudNN BenchmarkIf your model architecture remains unchanged and the input size remains constant, set torch.backends.cudnn.benchmark = True.7. Be Careful of Frequent Data Transfers Between CPU and GPUFrequent use of tensor.cpu() to transfer tensors from GPU to CPU (or using tensor.cuda() to transfer tensors from CPU to GPU) is very costly. item() and .numpy() also incur similar costs; use .detach() instead.If you create a new tensor, you can allocate it to the GPU using the keyword argument device=torch.device(‘cuda:0’).If you need to transfer data, you can use .to(non_blocking=True) as long as there are no synchronization points after the transfer.8. Use Gradient/Activation CheckpointingCheckpointing works by trading computation for memory, not storing all intermediate activations of the entire computation graph for the backward pass but recomputing those activations. We can apply it to any part of the model.Specifically, during the forward pass, functions run with torch.no_grad() and do not store intermediate activations. Instead, the input tuples and function parameters are saved during the forward pass. During the backward pass, the inputs and functions are retrieved, and the forward pass is computed again on the functions. Then, it tracks the intermediate activations, using these activation values to compute gradients.Thus, while this may slightly increase the runtime for a given batch size, it significantly reduces memory usage. This, in turn, allows for a further increase in the batch size used, thereby improving GPU utilization.Although checkpointing is implemented via torch.utils.checkpoint, some thought and effort are still required to implement it correctly. Priya Goyal wrote a great tutorial introducing key aspects of checkpointing.Priya Goyal’s tutorial link: https://github.com/prigoyal/pytorch_memonger/blob/master/tutorial/Checkpointing_for_PyTorch_models.ipynb9. Use Gradient AccumulationAnother way to increase batch size is to accumulate gradients over multiple .backward() passes before calling optimizer.step().Thomas Wolf from Hugging Face discusses how to use gradient accumulation in his article “Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups.” Gradient accumulation can be implemented as follows:
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set): predictions = model(inputs) # Forward pass loss = loss_function(predictions, labels) # Compute loss function loss = loss / accumulation_steps # Normalize our loss (if averaged) loss.backward() # Backward pass if (i+1) % accumulation_steps == 0: # Wait for several backward steps optimizer.step() # Now we can do an optimizer step model.zero_grad() # Reset gradients tensors if (i+1) % evaluation_steps == 0: # Evaluate the model when we... evaluate_model() # ...have no gradients accumulate
This method is primarily developed to circumvent GPU memory limitations.10. Use Distributed Data Parallel for Multi-GPU TrainingThere are many ways to accelerate distributed training, but a simple way is to use torch.nn.DistributedDataParallel instead of torch.nn.DataParallel. This way, each GPU will be driven by a dedicated CPU core, avoiding the GIL issues of DataParallel.Distributed training documentation link: https://pytorch.org/tutorials/beginner/dist_overview.html11. Set Gradients to None Instead of 0Set gradients to .zero_grad(set_to_none=True) instead of .zero_grad(). This allows the memory allocator to handle gradients instead of setting them to 0. As the documentation states, setting gradients to None provides moderate acceleration, but do not expect miracles. Note that there are also downsides to doing this; see the documentation for details.Documentation link: https://pytorch.org/docs/stable/optim.html12. Use .as_tensor() Instead of .tensor()torch.tensor() always copies data. If you want to convert a numpy array, use torch.as_tensor() or torch.from_numpy() to avoid copying data.13. Enable Debugging Tools When NecessaryPyTorch provides many debugging tools, such as autograd.profiler, autograd.grad_check, and autograd.anomaly_detection. Ensure to enable the debugger only when you need to debug, and turn it off promptly when not needed, as the debugger can slow down your training speed.14. Use Gradient ClippingRegarding avoiding gradient explosion issues in RNNs, some experiments and theories have confirmed that gradient clipping (gradient = min(gradient, threshold)) can accelerate convergence. HuggingFace’s Transformer implementation is a very clear example of how to use gradient clipping. Other methods mentioned in this article, such as AMP, can also be applied.In PyTorch, this can be achieved using torch.nn.utils.clip_grad_norm_.15. Disable Bias Before BatchNormDisable the bias layer before starting the BatchNormalization layer. For a 2-D convolution layer, you can set the bias keyword to False: torch.nn.Conv2d(…, bias=False, …).16. Disable Gradient Calculation During ValidationDuring validation, disable gradient calculation by setting: torch.no_grad().17. Use Input and Batch NormalizationDouble-check if the inputs are normalized? Is batch normalization used?Original link: https://efficientdl.com/faster-deep-learning-in-pytorch-a-guide/