This article discusses how to apply torch to achieve mixed precision computation, data parallelism, and distributed computing.
Due to the recent high speed requirements of programs and the need for quick results, I specifically studied mixed precision computation and parallel operations. As many articles have already introduced the relevant principles, this article will only discuss how to apply torch to achieve mixed precision computation, data parallelism, and distributed computing without detailing the principles.
1
『Mixed Precision』
## Import amp toolkit
from torch.cuda.amp import autocast, GradScaler
model.train()
## Scale the gradients to speed up model convergence,
## because float16 gradients can easily cause underflow (too small gradients)
scaler = GradScaler()
batch_size = train_loader.batch_size
num_batches = len(train_loader)
end = time.time()
for i, (images, target) in tqdm.tqdm(
enumerate(train_loader), ascii=True, total=len(train_loader)
):
# measure data loading time
data_time.update(time.time() - end)
optimizer.zero_grad()
if args.gpu is not None:
images = images.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)
# Automatically selects precision for GPU operations to enhance training performance without reducing model accuracy
with autocast():
# compute output
output = model(images)
loss = criterion(output, target)
scaler.scale(loss).backward()
# optimizer.step()
scaler.step(optimizer)
scaler.update()
2
『Data Parallelism』
# multigpu represents the GPU numbers
multigpu = [0,1,2,3,4,5,6,7]
# Set the primary GPU to summarize the model's loss function and compute gradients, updating the gradients
torch.cuda.set_device(args.multigpu[0])
# Collect all gradients to gpu[0]
model = torch.nn.DataParallel(model, device_ids=args.multigpu).cuda(
args.multigpu[0]
)
Using nn.DataParallel with Mixed Precision
model = MyModel()
dp_model = nn.DataParallel(model)
with autocast(): # dp_model's internal threads won't autocast.
#The main thread's autocast state has no effect.
output = dp_model(input) # loss_fn still autocasts, but it's too late...
loss = loss_fn(output)
The solution has two methods, which are introduced below: 1. Add a decorator function in the forward function of the model module.
MyModel(nn.Module):
...
@autocast()
def forward(self, input):
...
2. Another correct method is to set the autocast region inside the forward function: <span>python MyModel(nn.Module): ... def forward(self, input): with autocast(): ...</span>
After modifying the forward function, use autocast in the main thread “`python model = MyModel() dp_model = nn.DataParallel(model)
with autocast(): output = dp_model(input) loss = loss_fn(output) “`
Disadvantages of nn.DataParallel
In each training batch, the nn.DataParallel module will backpropagate all losses to gpu[0], which can lead to heavy data transmission and loss computation on a single GPU, easily causing uneven GPU load. It is often observed that the load on gpu[0] is significantly higher than that of other GPUs. Moreover, the data transmission speed between GPUs can create a significant bottleneck for model training, which is clearly unreasonable. Therefore, we will introduce specific principles, which can be referenced in single-machine multi-GPU operations (DistributedDataParallel, mixed precision, Horovod) (https://zhuanlan.zhihu.com/p/158375055).
3
『Distributed Computing』
Advantages
Usage Introduction
# Where CUDA_VISIBLE_DEVICES specifies the number of GPUs on the machine
# nproc_per_node is the number of program processes
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py
parser.add_argument('--local_rank', type=int, default=0, help='node rank for distributed training')
# Configure the local_rank parameter to inform each process of its position and which GPU to use
# Specify GPU for this process
torch.cuda.set_device(args.local_rank)
# Initialize GPU communication method NCCL and parameter retrieval method, where env represents environment variables
# PyTorch implements distributed computing through NCCL for GPU communication
torch.distributed.init_process_group(
backend='nccl',
rank=args.local_rank
)
kwargs = {"num_workers": args.workers, "pin_memory": True} if use_cuda else {}
train_sampler = DistributedSampler(train_dataset)
self.train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size,
sampler=train_sampler,
**kwargs
)
# Note: Since the Sampler method is used, the dataloader cannot include shuffle, drop_last, and other parameters
'''PyTorch dataloader.py 192-197 code
if batch_sampler is not None:
# auto_collation with custom batch_sampler
if batch_size != 1 or shuffle or sampler is not None or drop_last:
raise ValueError('batch_sampler option is mutually exclusive '
'with batch_size, shuffle, sampler, and '
'drop_last')'''
torch.cuda.set_device(args.local_rank)
device = torch.device('cuda', args.local_rank)
model.to(device)
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[args.local_rank],
output_device=args.local_rank,
find_unused_parameters=True,
)
torch.backends.cudnn.benchmark=True
# This will cause the program to spend a little extra time at the start to search for the most appropriate convolution implementation algorithm for each convolution layer in the entire network, thus accelerating the network
# DistributedDataParallel can aggregate gradients obtained from different GPUs to update the model GPU
Synchronizing BatchNormalization Layers
-
torch.nn.SyncBatchNorm -
torch.nn.SyncBatchNorm.convert_sync_batchnorm: Automatically converts BatchNormalization layers to torch.nn.SyncBatchNorm to synchronize BatchNormalization layers across different GPUs.
python model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
4
『Summary』
Citations (regardless of order):
-
PyTorch 21. Single-machine Multi-GPU Operations (DistributedDataParallel, Mixed Precision, Horovod) -
PyTorch Source Code Interpretation of torch.cuda.amp: Detailed Explanation of Automatic Mixed Precision -
Automatic Mixed Precision (AMP) in PyTorch -
Speeding Up Training by 60%! Just 5 Lines of Code, PyTorch 1.6 Will Natively Support Automatic Mixed Precision Training -
torch.backends.cudnn.benchmark ?! -
I Have Reviewed Eight Ways to Write Python Decorators, Feel Free to Ask~
Scan the QR code to add the assistant’s WeChat