Practical Guide to PyTorch Training Acceleration Techniques

Introduction

This article discusses how to apply torch for mixed precision computation, data parallelism, and distributed computing.

Author丨Not Important What Name @ Zhihu

Link丨https://zhuanlan.zhihu.com/p/360697168

Due to recent programs requiring high speed, I wanted to get results quickly, so I specifically learned about mixed precision computation and parallel operations. Since there are already many articles explaining the relevant principles, this article will only describe how to apply torch for mixed precision computation, data parallelism, and distributed computing, without going into the principles.

Mixed Precision

Automatic Mixed Precision (AMP) training can significantly reduce training costs and increase training speed. Previously, automatic mixed precision computation was implemented using NVIDIA’s Apex tool. Starting from PyTorch 1.6.0, PyTorch has included the AMP module, so the following will mainly introduce the simple usage of the PyTorch built-in amp module.

## Import amp toolkit 
from torch.cuda.amp import autocast, GradScaler

model.train()

## Scale the gradients to accelerate model convergence,
## because float16 gradients are prone to underflow (too small gradients)
scaler = GradScaler()

batch_size = train_loader.batch_size
num_batches = len(train_loader)
end = time.time()
for i, (images, target) in tqdm.tqdm(
    enumerate(train_loader), ascii=True, total=len(train_loader)
):
    # measure data loading time
    data_time.update(time.time() - end)
    optimizer.zero_grad()
    if args.gpu is not None:
        images = images.cuda(args.gpu, non_blocking=True)

    target = target.cuda(args.gpu, non_blocking=True)
    # Automatically selects precision for GPU ops to enhance training performance without reducing model accuracy
    with autocast():
    # compute output
        output = model(images)

        loss = criterion(output, target)

    scaler.scale(loss).backward()
    # optimizer.step()
    scaler.step(optimizer)
    scaler.update()

Data Parallelism

When a server has multiple GPUs, to accelerate the model (possibly due to insufficient single GPU), we can use multiple GPUs on a single machine to train the model. To achieve this, we must find a way to distribute a model across multiple GPUs for training.

PyTorch provides a simple interface with nn.DataParallel to easily implement model parallelization. We just need to wrap the model with nn.DataParallel and set some parameters to easily achieve multi-GPU parallelism for the model.

# multigpu indicates the GPU numbers
multigpu = [0,1,2,3,4,5,6,7] 
# Set the primary GPU to aggregate the model's loss function and perform gradient updates
torch.cuda.set_device(args.multigpu[0])
# Aggregate the model's gradients to gpu[0]
model = torch.nn.DataParallel(model, device_ids=args.multigpu).cuda(
        args.multigpu[0]
        )

Using nn.DataParallel with Mixed Precision Computation

Using mixed precision computation with nn.DataParallel requires some special configurations; otherwise, the model will not be able to achieve data parallelism. Autocast is designed to be “thread local,” so setting the autocast area only in the main thread will not work. Referencing from (https://zhuanlan.zhihu.com/p/348554267), here is an example of incorrect operation:

model = MyModel() 
dp_model = nn.DataParallel(model)

with autocast():     # dp_model's internal threads won't autocast.
     #The main thread's autocast state has no effect.     
     output = dp_model(input)     # loss_fn still autocasts, but it's too late...
     loss = loss_fn(output)

There are two solutions, which are introduced below: 1. Add a decorator function to the forward function of the model module

MyModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
       ...

2. Another correct approach is to set the autocast area inside the forward function: python MyModel(nn.Module): ... def forward(self, input): with autocast(): ... After modifying the forward function, use autocast in the main thread “`python model = MyModel() dp_model = nn.DataParallel(model)

with autocast(): output = dp_model(input) loss = loss_fn(output) “`

Disadvantages of nn.DataParallel

In each training batch, the nn.DataParallel module will backpropagate all losses to gpu[0], which requires transferring several GBs of data, and the loss computation must be done on a single GPU. This can easily lead to uneven GPU load, and it is common to see that the load on gpu[0] is significantly higher than that of other GPUs. Additionally, the speed of data transfer between GPUs can create a significant bottleneck for model training, which is clearly unreasonable. Therefore, we will next introduce the specific principles, which can be referenced in the single-machine multi-GPU operation (distributed DataParallel, mixed precision, Horovod) (https://zhuanlan.zhihu.com/p/158375055)

Distributed Computation

nn.DistributedDataParallel: Multi-process control of multiple GPUs to train the model together.

Advantages

Each process controls one GPU, ensuring that model computations are not affected by communication between GPUs, and allowing for relatively uniform GPU load. However, compared to single-machine single-GPU or single-machine multi-GPU (nn.DataParallel), there are several issues:

1. Synchronizing model parameters across different GPUs, especially BatchNormalization 2. Informing each process of its position and which GPU to use, specified by the args.local_rank parameter 3. Ensuring that each process retrieves different data (DistributedSampler)

Usage Introduction

Starting the Program Since the author has only practiced single-machine multi-GPU operations, the focus will primarily be on single-machine multi-GPU. Unlike running a simple python program, we need to use PyTorch’s built-in launcher torch.distributed.launch to start the program.

# Here, CUDA_VISIBLE_DEVICES specifies the number of GPUs on the machine
# nproc_per_node is the number of program processes
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py

Configuring the Main Program

parser.add_argument('--local_rank', type=int, default=0, help='node rank for distributed training')
# Configure the local_rank parameter to inform each process of its position and which GPU to use

Initializing GPU Communication and Parameter Retrieval

# Specify the GPU for this process
torch.cuda.set_device(args.local_rank)
# Initialize GPU communication method NCCL and parameter retrieval method, where env represents the environment variable
# PyTorch implements distributed computation through NCCL for GPU communication
torch.distributed.init_process_group(
    backend='nccl',
    rank=args.local_rank
)

Reconfiguring DataLoader

kwargs = {"num_workers": args.workers, "pin_memory": True} if use_cuda else {}

train_sampler = DistributedSampler(train_dataset)
self.train_loader = torch.utils.data.DataLoader(
            train_dataset, 
            batch_size=args.batch_size, 
            sampler=train_sampler,  
            **kwargs
        )

# Note: Since the Sampler method is used, parameters like shuffle, drop_last cannot be added to the dataloader
'''PyTorch dataloader.py 192-197 code
        if batch_sampler is not None:
            # auto_collation with custom batch_sampler
            if batch_size != 1 or shuffle or sampler is not None or drop_last:
                raise ValueError('batch_sampler option is mutually exclusive '
                                 'with batch_size, shuffle, sampler, and '
                                 'drop_last')'''

Pin_memory means locked page memory. When creating a DataLoader with pin_memory=True, it means that the generated Tensor data initially belongs to locked page memory in memory, which will speed up the transfer of memory Tensor to GPU memory.

Model Initialization

torch.cuda.set_device(args.local_rank)
device = torch.device('cuda', args.local_rank)
model.to(device)
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = torch.nn.parallel.DistributedDataParallel(
        model,
        device_ids=[args.local_rank],
        output_device=args.local_rank,
        find_unused_parameters=True,
        )
torch.backends.cudnn.benchmark=True 
# This will cause the program to spend a little extra time at the start to search for the most suitable convolution implementation algorithm for each convolution layer of the entire network, thereby accelerating the network
# DistributedDataParallel can aggregate gradients obtained from different GPUs to update the model GPU

DistributedDataParallel can aggregate gradients obtained from different GPUs to update the model GPU

Synchronizing BatchNormalization Layers

For training tasks that consume a lot of GPU memory, the relative batch size on a single GPU is often too small, affecting the convergence of the model. Cross-GPU synchronized Batch Normalization can use global samples for normalization, which effectively ‘increases’ the batch size, ensuring that training performance is not affected by the number of GPUs used. Referencing from single-machine multi-GPU operations (distributed DataParallel, mixed precision, Horovod), fortunately, in recent versions of Pytorch, PyTorch has started to natively support synchronization of BatchNormalization layers.

torch.nn.SyncBatchNorm
torch.nn.SyncBatchNorm.convert_sync_batchnorm: Automatically converts BatchNormalization layers to torch.nn.SyncBatchNorm for synchronization of BatchNormalization layers across different GPUs

For specific implementation, please refer to the model initialization part of the code python model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

Synchronizing Random Seeds for Model Initialization

I have not yet tried using different random seeds across different processes. To be safe, it is recommended to ensure that each model’s initialization random seed is the same, ensuring that the models on each GPU process are synchronized.

Conclusion

Standing on the shoulders of giants, I learned about model acceleration recently, encountered many pitfalls, and finally summarized some specific code, referencing many other blogs. I hope this can help everyone.

References (in no particular order):

PyTorch 21. Single-machine multi-GPU operation (distributed DataParallel, mixed precision, Horovod)
PyTorch Source Code Interpretation of torch.cuda.amp: Detailed Explanation of Automatic Mixed Precision
PyTorch’s Automatic Mixed Precision (AMP)
Speeding Up Training by 60%! Just 5 Lines of Code, PyTorch 1.6 Will Natively Support Automatic Mixed Precision Training
torch.backends.cudnn.benchmark ?!
Revisiting the Eight Writing Methods of Python Decorators, Feel Free to Ask~