Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex

Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex

Author: Nicolas

Affiliation: Researcher at Zhuiyi Technology AI Lab

Research Direction: Information Extraction, Machine Reading Comprehension

Do you want to experience double the training speed?
Do you want to instantly double your GPU memory?
If I tell you that it only takes three lines of code, would you believe it?
In this article, the author will explain mixed precision computing in detail and introduce a tool developed by NVIDIA for mixed precision training acceleration based on PyTorch—Apex. Recently, Apex has updated its API, allowing for different levels of mixed precision acceleration to be achieved with just three lines of code, effectively halving the training time.
Without further ado, let’s teach you how to use it.

Implementing PyTorch

from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1") # Here it is "O1", not "01"
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
Yes, it’s that simple. If you are not willing to spend time delving deeper, you can basically use it directly after reading this.
However, if you wish to gain a deeper understanding of FP16 and Apex, or if you encounter various unexplained “Nan” issues while using it, you can continue reading. There will be some interesting theoretical knowledge and various bugs the author encountered while using Apex over the past month. Once you understand and solve these bugs, you will be able to completely rid yourself of the slow FP32.

Theoretical Part

To fully understand the principles of mixed precision and the use of the API, let’s supplement some basic theoretical knowledge.

1. What is FP16?

Half-precision floating-point is a binary floating-point data type used by computers, stored in 2 bytes (16 bits).

Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex

Comparison of the range and precision represented by FP16 and FP32
Among them, the sign bit represents positive or negative, the exponent bit represents the exponentEssential Tool for PyTorch: Accelerate Mixed Precision Training with Apex, and the fraction bit represents the fractionEssential Tool for PyTorch: Accelerate Mixed Precision Training with Apex. When the exponent is zero, the left side of the plus sign in the figure below is 0, while in other cases it is 1.

Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex

Example of FP16 representation

2. Why do we need FP16?

Before using FP16, I would like to reiterate why we use FP16.
  • Reduce GPU Memory Usage Models are getting larger, and when you use pre-trained models like Bert, they often occupy more than half of the GPU memory. When trying to use a larger Batch Size, it becomes challenging. Since FP16 only occupies half the memory of FP32, it naturally helps save half the GPU memory during the training process.

  • Accelerate Training and Inference Computation Unlike ordinary space-time trade-off acceleration methods, FP16 can save memory while also speeding up model training time. In most tests, FP16-based acceleration methods can provide a doubling of the training experience (similar to watching a soap opera at double speed).

  • Widespread Use of Tensor Cores Hardware development also promotes the acceleration of model computation. With the widespread use of NVIDIA Tensor Cores, 16-bit computation is maturing, and low-precision computation is an important trend in deep learning. If you don’t learn it now, you will be out.

3. Problems Brought by FP16:Quantization Error

This part is the most important theoretical core of the entire article.
Having discussed the benefits of FP16, are there any problems when using FP16? Of course, there are. The main problems brought by FP16 are two: 1. Overflow Errors; 2. Rounding Errors.

Overflow Errors (Grad Overflow / Underflow) Due to the narrow dynamic range of FP16Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex compared to FP32Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex, it is easy to encounter overflow (Overflow, g>65504) and underflow (Underflow, Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex) errors during computation. After overflow, the “Nan” issue may arise.

In deep learning, since the gradients of activation functions are often smaller than the weight gradients, underflow is more likely to occur.

Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex

Underflow Issue

Rounding Errors (Rounding Error) Rounding error refers to the situation where when the gradient is too small, less than the minimum interval within the current range, the gradient update may fail, which can be clearly represented in a diagram:

Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex

Rounding Error

4. Solutions to the Problems:Mixed Precision Training + Dynamic Loss Scaling

Mixed Precision Training (Mixed Precision) The essence of mixed precision training is to store and multiply using FP16 in memory to accelerate computation, and use FP32 for accumulation to avoid rounding errors.” The strategy of mixed precision training effectively alleviates the problem of rounding errors.

Loss Scaling (Loss Scaling) Even with mixed precision training, there may still be cases where convergence cannot be achieved due to the activation gradient being too small, leading to underflow. The idea of loss scaling is:

  • Before backpropagation, manually increase the loss change (dLoss)Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex times, so that the intermediate variables (activation function gradients) obtained during backpropagation will not overflow;

  • After backpropagation, reduce the weight gradientEssential Tool for PyTorch: Accelerate Mixed Precision Training with Apex times, restoring the normal value.

New API in Apex: Automatic Mixed Precision (AMP)

The previous Apex mixed precision training API required manual half-modeling and input data, which was cumbersome. Now, the new API requires only three lines of code for painless use:
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1") # Here it is "O1", not "01"
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
opt_level
There is only one opt_level that needs to be configured by the user:

O0 : Pure FP32 training, can be used as a baseline for accuracy;

O1 : Mixed precision training (recommended), automatically decides whether to use FP16 (GEMM, convolution) or FP32 (Softmax) for computation based on a whitelist and blacklist;

O2 : “Almost FP16” mixed precision training, with no whitelist and blacklist, almost everything except Batch Norm is computed using FP16;

O3 : Pure FP16 training, very unstable, but can be used as a speed baseline.

Dynamic Loss Scaling
AMP uses dynamic loss scaling by default to fully utilize the range of FP16 and mitigate rounding errors, aiming to use the highest scaling factor (Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex), and if overflow occurs, it skips parameter updates, reducing the scaling factor to prevent overflow. After a certain number of steps (e.g., 2000 steps), it will try to use a larger scale again to fully utilize the FP16 range:

Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex

Strategy of dynamic loss scaling in AMP

Key Takeaways: Pitfalls Encountered

This section contains the most valuable insights of the entire article, detailing all the pitfalls the author encountered while using Apex recently. Since Apex error messages are not obvious, debugging can often be frustrating, but by paying attention to the following points, 95% of issues can be resolved smoothly:

1. Check if your GPU supports FP16: Supported GPUs have Tensor Cores (2080Ti, Titan, Tesla, etc.), while unsupported ones (Pascal series) are not recommended for experimentation;

2. Constant Range: To ensure calculations do not overflow, first ensure that manually set constants (including those in the source code) do not overflow, such as various epsilon, INF, etc.;

3. Dimensions should preferably be multiples of 8: According to NVIDIA’s official documentation, performance is best when dimensions are multiples of 8;

4. Be cautious with operations involving sum, as they can easily overflow. For operations like Softmax, it is recommended to use official APIs and define them as layers in model initialization;

5. Model writing should be standardized: Custom layers should be written in the model initialization function, and graph calculations should be placed in the forward function;

6. Some less commonly used functions need to be registered before use: amp.register_float_function(torch, ‘sigmoid’) ;

7. Some functions (such as einsum) currently do not support FP16 acceleration, so it is advised not to use them heavily. The implementation of XLNet in FP16 [4] troubled me for a long time;

8. Modules that need to operate on model parameters (like EMA) should use the AMP-wrapped model;

9. Modules that need to operate on gradients must be within the optimizer’s step, otherwise AMP cannot determine if grad is NaN;

10. Contributions are welcome.

Conclusion

This article provides a theoretical and practical introduction to mixed precision computing and the use of the new Apex API (AMP). The author now frequently switches to mixed precision training during deep learning model development, as it is fast and maintains precision, making it an essential tool for hyperparameter tuning. Currently, there are no Chinese blogs online discussing AMP and the pitfalls encountered during its use, so this article aims to help readers save some time debugging. Of course, if readers discover new pitfalls, I welcome communication, and I will add them to the blog in the column.

References

[1] Intel’s Low Precision Representation for Deep Learning Training and Inference

http://market.itcgb.com/Contents/Intel/OR_AI_BJ/images/Brian_DeepLearning_LowNumericalPrecision.pdf

[2] Official NVIDIA Mixed Precision Training Documentation
https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
[3] Official Apex Usage Documentation
https://nvidia.github.io/apex/amp.html
[4] XLNet Implementation Change to FP16
https://github.com/NVIDIA/apex/issues/394
[5] Column Blog
https://zhuanlan.zhihu.com/p/79887894

Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex

Click the following titles to see more past content:

  • ICCV 2019 | Conditional Image Generation Model Based on Continual Learning

  • To the soon-to-be students, please keep this essential paper list

  • Xiaomi AutoML Team Releases Scalable Super Network SCARLET

  • Popular Papers on Github | A Novel Unsupervised Image Translation Based on GAN

  • Post-BERT Era NLP Pre-trained Models

  • KDD Cup 2019 AutoML Track Champion Team Technical Sharing

Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex#Submission Channel#

Let your paper be seen by more people

How can more quality content reach readers in a shorter path, reducing the cost for readers to find quality content? The answer is: people you don’t know.

There are always some people you don’t know who know what you want to know. PaperWeekly may serve as a bridge, facilitating the collision of scholars and academic inspiration from different backgrounds and directions, sparking more possibilities.

PaperWeekly encourages university laboratories or individuals to share various quality content on our platform, which can be latest paper interpretations, learning experiences, or technical insights. Our only goal is to let knowledge flow.

📝 Submission Standards:

• The manuscript must be a personal original work, and the submission must include personal information (name + school/work unit + degree/position + research direction)

• If the article is not the first release, please remind us during submission and include all published links

• PaperWeekly assumes every article is a first release and will add an “original” tag

📬 Submission Email:

• Submission email: [email protected]

• All article images should be sent separately as attachments

• Please leave immediate contact information (WeChat or phone) so we can communicate with the author during editing and publishing

🔍

You can now find us on “Zhihu”

Search for “PaperWeekly” on Zhihu homepage

Click “Follow” to subscribe to our column

About PaperWeekly

PaperWeekly is an academic platform that recommends, interprets, discusses, and reports on cutting-edge research papers in artificial intelligence. If you are researching or working in the AI field, feel free to click “Discussion Group” in the public account background, and the assistant will bring you into the PaperWeekly discussion group.

Essential Tool for PyTorch: Accelerate Mixed Precision Training with Apex

▽ Click | Read the original text | Get the latest paper recommendations

Leave a Comment