NVIDIA’s 50-Minute BERT Training: Beyond Just GPUs

Selected from arXiv

Author:Mohammad Shoeybi et al.

Translated by Machine Heart

Contributors:Mo Wang
Previously, Machine Heart introduced a study by NVIDIA that broke three records in the NLP field: reducing BERT’s training time to 53 minutes; reducing BERT’s inference time to 2.2 milliseconds; and increasing the parameter count of GPT-2 to 8 billion (previously, GPT-2 had a maximum of 1.5 billion parameters). Many attributed this achievement to NVIDIA’s superior hardware conditions, as GPUs are abundant. However, NVIDIA’s recent paper revealed the model parallelism method used in this research: intra-layer model parallelism. This method does not require new compilers or library changes; it can be fully implemented by embedding a few communication operations in PyTorch.
Recent research on unsupervised language modeling has proven that training large neural language models drives SOTA results in natural language processing applications. However, for very large models, memory limits the actual training model size. Model parallelism allows us to train larger models because it can split parameters and distribute them across multiple processors.
NVIDIA’s recent study achieved a simple and efficient intra-layer model parallelism method, enabling the training of state-of-the-art transformer language models with billions of parameters. This method does not require new compilers or library changes, is orthogonal and complementary to pipeline model parallelism, and can be fully implemented by embedding a few communication operations in PyTorch. Using this method, researchers converged a transformer language model with 8.3 billion parameters using 512 GPUs, making it the largest transformer model currently, 24 times the size of BERT and 5.6 times that of GPT-2.
Figure 2 is a schematic of the model:

NVIDIA's 50-Minute BERT Training: Beyond Just GPUs

Figure 2:GPT-2 Transformer Architecture.The purple rectangles represent fully connected layers, and each blue rectangle represents a transformer layer (repeated N times).
The entire application maintains a performance of 15.1 PetaFLOPs/sec with a 76% scaling efficiency, while the powerful single-processor baseline method can only maintain a performance of 39 TeraFLOPs/sec, which is 30% of the peak FLOPs. Training this model on 174GB of text data requires 12 ZettaFLOPs over 9.2 days to achieve convergence. Applying this language model to the WikiText103 and LAMBADA datasets achieves current optimal results: it reaches a perplexity of 10.8 on the WikiText103 dataset, while the previous SOTA perplexity was 16.4; it achieves an accuracy of 66.5% on the LAMBADA dataset, while the previous SOTA accuracy was 63.2%. Currently, NVIDIA researchers have released the training and evaluation code, as well as weights for a small portable model.
  • Paper link:https://arxiv.org/abs/1909.08053v1

  • Code link: https://github.com/NVIDIA/Megatron-LM

Research Contribution
NVIDIA researchers efficiently trained a transformer language model with 8.3 billion parameters using the intra-layer model parallelism method. They implemented a simple model parallelism based on the inherent structure of the transformer, which can be efficiently trained in PyTorch without any custom C++ code or compiler. This method is orthogonal to pipeline-based model parallelism.
To demonstrate the scalability of this method, the researchers established a baseline: they trained a model with 1.2 billion parameters on a single NVIDIA V100 32GB GPU, maintaining a performance of 39 TeraFLOPs/sec, which is 30% of the theoretical peak FLOPS of a single GPU running on a DGX-2H server, thus making it a very powerful baseline model. Expanding this model to 8.3 billion parameters and training it using 8-way model parallelism on 512 GPUs achieved a performance of 15.1 PetaFLOPs/sec. Compared to the single GPU case, it achieved a 76% scaling efficiency. Converging this model on 174 GB of text data requires 12 ZettaFLOPs over 9.2 days.
Detailed expansion results are shown in Figure 1; as the number of GPUs increases, the effective computing power provided approaches linear growth.

NVIDIA's 50-Minute BERT Training: Beyond Just GPUs

Figure 1:FLOPS performance of model parallel (blue) and model+data parallel (green); the x-axis represents the number of GPUs.
Model parallel (blue): 8-way model parallelism, with each GPU using about 1 billion parameters for weak scaling (e.g., 2 GPUs for 2 billion, 4 GPUs for 4 billion). Model+data parallel (green): similar to the model parallel configuration, but it also adds 64-way data parallelism.
The researchers analyzed the accuracy of the trained models on the WikiText103 and LAMBADA datasets, finding that as the model size increased, the perplexity on the WikiText103 dataset decreased, and the accuracy on the LAMBADA dataset increased, achieving current optimal results on these tasks.
Model Parallel Transformer
The researchers utilized the structure of the transformer network, adding only a few synchronization primitives to create a simple model parallel implementation. They applied model parallelism to both the self-attention module and the multi-layer perceptron (MLP) module in the transformer.

NVIDIA's 50-Minute BERT Training: Beyond Just GPUs

Figure 3:Transformer module after applying model parallelism.f and g are conjugate; f is the identity operator during forward propagation and all-reduce during backward propagation, while g is all-reduce during forward propagation and the identity operator during backward propagation.
NVIDIA's 50-Minute BERT Training: Beyond Just GPUs
Figure 4:Communication operations in transformer layers.There are a total of 4 communication operations in the forward and backward propagation of a single model parallel transformer layer.
Hybrid Model and Data Parallelism
Model parallelism and data parallelism are orthogonal; thus, we can use both simultaneously to train large models in a reasonable time. Figure 5 shows the GPU grouping situation for hybrid model and data parallelism.

NVIDIA's 50-Minute BERT Training: Beyond Just GPUs

Figure 5:GPU grouping situation for hybrid model and data parallelism during 8-way model parallelism and 64-way data parallelism.
Experiments
All experiments were conducted on NVIDIA DGX SuperPod4, with researchers using up to 32 DGX-2H servers (a total of 512 Tesla V100 SXM3 32GB GPUs).
To test the scalability achieved in the research, the researchers considered four parameter settings for the GPT-2 model, as shown in the table below:

NVIDIA's 50-Minute BERT Training: Beyond Just GPUs

Table 1:Parameters used in the scalability study.The hidden layer size for each attention head is 96.
The following Figure 6 shows the scaling efficiency of model and model+data parallelism. We can see that good scaling efficiency is achieved in both settings.

NVIDIA's 50-Minute BERT Training: Beyond Just GPUs

Figure 6:Weak scaling efficiency of model parallel (a) and model+data parallel (b); the x-axis represents the number of GPUs.
To study the role of attention heads in model parallel scaling, the researchers considered using 8-way model parallelism to handle 8.3 billion parameters, setting the number of attention heads to 16, 24, and 32. The results are shown in Table 2 below:

NVIDIA's 50-Minute BERT Training: Beyond Just GPUs

Table 2:The role of the number of attention heads when using 8-way model parallelism to handle 8.3 billion parameters.
The model parallelism in this study is aimed at training models that exceed the memory capacity of a single GPU, as well as accelerating the training of smaller models without increasing the batch size. To measure the acceleration effect, the researchers used a fixed model with 1.2 billion parameters, and the results are shown in Table 3:

NVIDIA's 50-Minute BERT Training: Beyond Just GPUs

Table 3:Acceleration obtained by training a 1.2 billion parameter model using model parallelism (keeping batch size constant).
To demonstrate the role of large language models in driving SOTA results, Figure 7 shows the model’s perplexity on the validation set, with the x-axis representing the number of iterations.

NVIDIA's 50-Minute BERT Training: Beyond Just GPUs

Figure 7:Validation set perplexity.All language models were trained for 300k iterations.The convergence speed of large language models is significantly accelerated, and the achieved validation perplexity is lower than that of similarly sized smaller models.
This article is compiled by Machine Heart, please contact this public account for authorization.
✄————————————————
Join Machine Heart (Full-time Reporter / Intern): [email protected]
Submissions or inquiries for coverage: content@jiqizhixin.com
Advertising & Business Cooperation: [email protected]

Leave a Comment