PyTorch Training Acceleration Techniques

MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering NLP master’s and doctoral students, university teachers, and corporate researchers.

The vision of the community is to promote communication and progress between the academic and industrial sectors of natural language processing and machine learning, especially for beginners.

Reprinted from | Zhihu

Author | The Name Doesn’t Matter

Address | https://zhuanlan.zhihu.com/p/360697168

This article discusses how to apply torch to achieve mixed precision computation, data parallelism, and distributed computing.

Due to the recent high speed requirements of programs and the need for quick results, I specifically studied mixed precision computation and parallel operations. As many articles have already introduced the relevant principles, this article will only discuss how to apply torch to achieve mixed precision computation, data parallelism, and distributed computing without detailing the principles.

『Mixed Precision』

Automatic Mixed Precision (AMP) training can significantly reduce training costs and improve training speed. Previously, automatic mixed precision computation was implemented using NVIDIA’s Apex tool. Starting from PyTorch 1.6.0, PyTorch has included an AMP module, so the following mainly introduces the use of the built-in amp module in PyTorch.

## Import amp toolkit 
from torch.cuda.amp import autocast, GradScaler

model.train()

## Scale the gradients to speed up model convergence,
## because float16 gradients can easily cause underflow (too small gradients)
scaler = GradScaler()

batch_size = train_loader.batch_size
num_batches = len(train_loader)
end = time.time()
for i, (images, target) in tqdm.tqdm(
    enumerate(train_loader), ascii=True, total=len(train_loader)
):
    # measure data loading time
    data_time.update(time.time() - end)
    optimizer.zero_grad()
    if args.gpu is not None:
        images = images.cuda(args.gpu, non_blocking=True)

    target = target.cuda(args.gpu, non_blocking=True)
    # Automatically selects precision for GPU operations to enhance training performance without reducing model accuracy
    with autocast():
    # compute output
        output = model(images)

        loss = criterion(output, target)

    scaler.scale(loss).backward()
    # optimizer.step()
    scaler.step(optimizer)
    scaler.update()

『Data Parallelism』

When a server has multiple GPUs, to accelerate the model (possibly because one GPU is not enough), we can use multi-GPU training on a single machine. To achieve this, we must find a way to distribute a model across multiple GPUs for training.

In PyTorch, nn.DataParallel provides a simple interface to easily parallelize a model. We just need to wrap the model with nn.DataParallel and set some parameters to easily achieve multi-GPU parallelism.

# multigpu represents the GPU numbers
multigpu = [0,1,2,3,4,5,6,7] 
# Set the primary GPU to summarize the model's loss function and compute gradients, updating the gradients
torch.cuda.set_device(args.multigpu[0])
# Collect all gradients to gpu[0]
model = torch.nn.DataParallel(model, device_ids=args.multigpu).cuda(
        args.multigpu[0]
        )

Using nn.DataParallel with Mixed Precision

Using nn.DataParallel for mixed precision computation requires some special configurations; otherwise, the model cannot achieve data parallelism. Autocast is designed to be “thread local,” so setting the autocast region only in the main thread will not work. Here’s an example of a wrong operation:

model = MyModel() 
dp_model = nn.DataParallel(model)

with autocast():     # dp_model's internal threads won't autocast.
     #The main thread's autocast state has no effect.     
     output = dp_model(input)     # loss_fn still autocasts, but it's too late...
     loss = loss_fn(output)

The solution has two methods, which are introduced below: 1. Add a decorator function in the forward function of the model module.

MyModel(nn.Module):
    ...
    @autocast()
    def forward(self, input):
       ...

2. Another correct method is to set the autocast region inside the forward function: <span>python MyModel(nn.Module): ... def forward(self, input): with autocast(): ...</span> After modifying the forward function, use autocast in the main thread “`python model = MyModel() dp_model = nn.DataParallel(model)

with autocast(): output = dp_model(input) loss = loss_fn(output) “`

Disadvantages of nn.DataParallel

In each training batch, the nn.DataParallel module will backpropagate all losses to gpu[0], which can lead to heavy data transmission and loss computation on a single GPU, easily causing uneven GPU load. It is often observed that the load on gpu[0] is significantly higher than that of other GPUs. Moreover, the data transmission speed between GPUs can create a significant bottleneck for model training, which is clearly unreasonable. Therefore, we will introduce specific principles, which can be referenced in single-machine multi-GPU operations (DistributedDataParallel, mixed precision, Horovod) (https://zhuanlan.zhihu.com/p/158375055).

『Distributed Computing』

nn.DistributedDataParallel: multi-process control of multiple GPUs to train models together.

Advantages

Each process controls a GPU, ensuring that model computations are not affected by communication between GPUs, and allowing for relatively uniform load distribution across each GPU. However, compared to single-machine single-GPU or single-machine multi-GPU (nn.DataParallel), there are several issues:

1. Synchronizing model parameters across different GPUs, especially BatchNormalization 2. Informing each process of its position and which GPU to use, specified by the args.local_rank parameter 3. Ensuring that each process retrieves different data (DistributedSampler)

Usage Introduction

Launching the Program Since the author has only practiced single-machine multi-GPU operations, this section mainly introduces single-machine multi-GPU. Unlike running a simple python program, we need to use the PyTorch built-in launcher torch.distributed.launch to start the program.

# Where CUDA_VISIBLE_DEVICES specifies the number of GPUs on the machine
# nproc_per_node is the number of program processes
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py

Configuring the Main Program

parser.add_argument('--local_rank', type=int, default=0, help='node rank for distributed training')
# Configure the local_rank parameter to inform each process of its position and which GPU to use

Initializing GPU Communication and Parameter Retrieval

# Specify GPU for this process
torch.cuda.set_device(args.local_rank)
# Initialize GPU communication method NCCL and parameter retrieval method, where env represents environment variables
# PyTorch implements distributed computing through NCCL for GPU communication
torch.distributed.init_process_group(
    backend='nccl',
    rank=args.local_rank
)

Reconfiguring DataLoader

kwargs = {"num_workers": args.workers, "pin_memory": True} if use_cuda else {}

train_sampler = DistributedSampler(train_dataset)
self.train_loader = torch.utils.data.DataLoader(
            train_dataset, 
            batch_size=args.batch_size, 
            sampler=train_sampler,  
            **kwargs
        )

# Note: Since the Sampler method is used, the dataloader cannot include shuffle, drop_last, and other parameters
'''PyTorch dataloader.py 192-197 code
        if batch_sampler is not None:
            # auto_collation with custom batch_sampler
            if batch_size != 1 or shuffle or sampler is not None or drop_last:
                raise ValueError('batch_sampler option is mutually exclusive '
                                 'with batch_size, shuffle, sampler, and '
                                 'drop_last')'''

pin_memory means locked page memory. When creating a DataLoader, setting pin_memory=True means that the generated Tensor data initially belongs to locked page memory in memory, which speeds up the transfer of memory Tensor to GPU memory.

Model Initialization

torch.cuda.set_device(args.local_rank)
device = torch.device('cuda', args.local_rank)
model.to(device)
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = torch.nn.parallel.DistributedDataParallel(
        model,
        device_ids=[args.local_rank],
        output_device=args.local_rank,
        find_unused_parameters=True,
        )
torch.backends.cudnn.benchmark=True 
# This will cause the program to spend a little extra time at the start to search for the most appropriate convolution implementation algorithm for each convolution layer in the entire network, thus accelerating the network
# DistributedDataParallel can aggregate gradients obtained from different GPUs to update the model GPU

DistributedDataParallel can aggregate gradients obtained from different GPUs to update the model GPU

Synchronizing BatchNormalization Layers

For training tasks that consume a lot of memory, the relative batch size on a single GPU is often too small, affecting model convergence. Cross-GPU synchronized Batch Normalization can use global samples for normalization, effectively ‘increasing’ the batch size, so the training effect is no longer affected by the number of GPUs used. Refer to single-machine multi-GPU operations (DistributedDataParallel, mixed precision, Horovod). Fortunately, in recent PyTorch versions, PyTorch has begun to natively support synchronized BatchNormalization layers.

torch.nn.SyncBatchNorm
torch.nn.SyncBatchNorm.convert_sync_batchnorm: Automatically converts BatchNormalization layers to torch.nn.SyncBatchNorm to synchronize BatchNormalization layers across different GPUs.

For specific implementation, please refer to the model initialization part code python model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

Synchronizing Random Seeds for Model Initialization

Currently, I have not tried using different random seeds across different processes. To be safe, it is recommended to ensure that the random seeds for model initialization are the same, ensuring that the models on each GPU process are synchronized.

『Summary』

Standing on the shoulders of giants, I self-studied model acceleration over the past period, encountering many pitfalls, and finally summarized some specific codes, referencing many other blogs. I hope this can be helpful to everyone.

Citations (regardless of order):

PyTorch 21. Single-machine Multi-GPU Operations (DistributedDataParallel, Mixed Precision, Horovod)
PyTorch Source Code Interpretation of torch.cuda.amp: Detailed Explanation of Automatic Mixed Precision
Automatic Mixed Precision (AMP) in PyTorch
Speeding Up Training by 60%! Just 5 Lines of Code, PyTorch 1.6 Will Natively Support Automatic Mixed Precision Training
torch.backends.cudnn.benchmark ?!
I Have Reviewed Eight Ways to Write Python Decorators, Feel Free to Ask~

Technical Community Invitation

PyTorch Training Acceleration Techniques

△ Long press to add assistant

Scan the QR code to add the assistant’s WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

You can apply to join Natural Language Processing/PyTorch and other technical communities.

About Us

MLNLP community is a grassroots academic community jointly built by machine learning and natural language processing scholars from both domestic and international backgrounds. It has developed into a well-known community for machine learning and natural language processing, aiming to promote progress between academia, industry, and enthusiasts in machine learning and natural language processing.

The community provides an open communication platform for professionals in related fields for further studies, employment, and research. Everyone is welcome to follow and join us.

PyTorch Training Acceleration Techniques

Using nn.DataParallel with Mixed Precision

Disadvantages of nn.DataParallel

Advantages

Usage Introduction

Synchronizing BatchNormalization Layers

About Us

Leave a Comment Cancel reply