Author: Nicolas
Affiliation: Researcher at Zhuiyi Technology AI Lab
Research Direction: Information Extraction, Machine Reading Comprehension
Implementing PyTorch
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1") # Here it is "O1", not "01"
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
Theoretical Part
1. What is FP16?


▲ Example of FP16 representation
2. Why do we need FP16?
-
Reduce GPU Memory Usage Models are getting larger, and when you use pre-trained models like Bert, they often occupy more than half of the GPU memory. When trying to use a larger Batch Size, it becomes challenging. Since FP16 only occupies half the memory of FP32, it naturally helps save half the GPU memory during the training process.
-
Accelerate Training and Inference Computation Unlike ordinary space-time trade-off acceleration methods, FP16 can save memory while also speeding up model training time. In most tests, FP16-based acceleration methods can provide a doubling of the training experience (similar to watching a soap opera at double speed).
-
Widespread Use of Tensor Cores Hardware development also promotes the acceleration of model computation. With the widespread use of NVIDIA Tensor Cores, 16-bit computation is maturing, and low-precision computation is an important trend in deep learning. If you don’t learn it now, you will be out.
3. Problems Brought by FP16:Quantization Error
Overflow Errors (Grad Overflow / Underflow) Due to the narrow dynamic range of FP16 compared to FP32
, it is easy to encounter overflow (Overflow, g>65504) and underflow (Underflow,
) errors during computation. After overflow, the “Nan” issue may arise.
In deep learning, since the gradients of activation functions are often smaller than the weight gradients, underflow is more likely to occur.
▲ Underflow Issue
Rounding Errors (Rounding Error) Rounding error refers to the situation where when the gradient is too small, less than the minimum interval within the current range, the gradient update may fail, which can be clearly represented in a diagram:
4. Solutions to the Problems:Mixed Precision Training + Dynamic Loss Scaling
Mixed Precision Training (Mixed Precision) The essence of mixed precision training is to “ store and multiply using FP16 in memory to accelerate computation, and use FP32 for accumulation to avoid rounding errors.” The strategy of mixed precision training effectively alleviates the problem of rounding errors.
Loss Scaling (Loss Scaling) Even with mixed precision training, there may still be cases where convergence cannot be achieved due to the activation gradient being too small, leading to underflow. The idea of loss scaling is:
-
Before backpropagation, manually increase the loss change (dLoss)
times, so that the intermediate variables (activation function gradients) obtained during backpropagation will not overflow;
-
After backpropagation, reduce the weight gradient
times, restoring the normal value.
New API in Apex: Automatic Mixed Precision (AMP)
from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1") # Here it is "O1", not "01"
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
O0 : Pure FP32 training, can be used as a baseline for accuracy;
O1 : Mixed precision training (recommended), automatically decides whether to use FP16 (GEMM, convolution) or FP32 (Softmax) for computation based on a whitelist and blacklist;
O2 : “Almost FP16” mixed precision training, with no whitelist and blacklist, almost everything except Batch Norm is computed using FP16;
O3 : Pure FP16 training, very unstable, but can be used as a speed baseline.

Key Takeaways: Pitfalls Encountered
This section contains the most valuable insights of the entire article, detailing all the pitfalls the author encountered while using Apex recently. Since Apex error messages are not obvious, debugging can often be frustrating, but by paying attention to the following points, 95% of issues can be resolved smoothly:
1. Check if your GPU supports FP16: Supported GPUs have Tensor Cores (2080Ti, Titan, Tesla, etc.), while unsupported ones (Pascal series) are not recommended for experimentation;
2. Constant Range: To ensure calculations do not overflow, first ensure that manually set constants (including those in the source code) do not overflow, such as various epsilon, INF, etc.;
3. Dimensions should preferably be multiples of 8: According to NVIDIA’s official documentation, performance is best when dimensions are multiples of 8;
4. Be cautious with operations involving sum, as they can easily overflow. For operations like Softmax, it is recommended to use official APIs and define them as layers in model initialization;
5. Model writing should be standardized: Custom layers should be written in the model initialization function, and graph calculations should be placed in the forward function;
6. Some less commonly used functions need to be registered before use: amp.register_float_function(torch, ‘sigmoid’) ;
7. Some functions (such as einsum) currently do not support FP16 acceleration, so it is advised not to use them heavily. The implementation of XLNet in FP16 [4] troubled me for a long time;
8. Modules that need to operate on model parameters (like EMA) should use the AMP-wrapped model;
9. Modules that need to operate on gradients must be within the optimizer’s step, otherwise AMP cannot determine if grad is NaN;
10. Contributions are welcome.
Conclusion
This article provides a theoretical and practical introduction to mixed precision computing and the use of the new Apex API (AMP). The author now frequently switches to mixed precision training during deep learning model development, as it is fast and maintains precision, making it an essential tool for hyperparameter tuning. Currently, there are no Chinese blogs online discussing AMP and the pitfalls encountered during its use, so this article aims to help readers save some time debugging. Of course, if readers discover new pitfalls, I welcome communication, and I will add them to the blog in the column.
References
http://market.itcgb.com/Contents/Intel/OR_AI_BJ/images/Brian_DeepLearning_LowNumericalPrecision.pdf
Click the following titles to see more past content:
-
ICCV 2019 | Conditional Image Generation Model Based on Continual Learning
-
To the soon-to-be students, please keep this essential paper list
-
Xiaomi AutoML Team Releases Scalable Super Network SCARLET
-
Popular Papers on Github | A Novel Unsupervised Image Translation Based on GAN
-
Post-BERT Era NLP Pre-trained Models
-
KDD Cup 2019 AutoML Track Champion Team Technical Sharing
#Submission Channel#
Let your paper be seen by more people
How can more quality content reach readers in a shorter path, reducing the cost for readers to find quality content? The answer is: people you don’t know.
There are always some people you don’t know who know what you want to know. PaperWeekly may serve as a bridge, facilitating the collision of scholars and academic inspiration from different backgrounds and directions, sparking more possibilities.
PaperWeekly encourages university laboratories or individuals to share various quality content on our platform, which can be latest paper interpretations, learning experiences, or technical insights. Our only goal is to let knowledge flow.
📝 Submission Standards:
• The manuscript must be a personal original work, and the submission must include personal information (name + school/work unit + degree/position + research direction)
• If the article is not the first release, please remind us during submission and include all published links
• PaperWeekly assumes every article is a first release and will add an “original” tag
📬 Submission Email:
• Submission email: [email protected]
• All article images should be sent separately as attachments
• Please leave immediate contact information (WeChat or phone) so we can communicate with the author during editing and publishing
🔍
You can now find us on “Zhihu”
Search for “PaperWeekly” on Zhihu homepage
Click “Follow” to subscribe to our column
About PaperWeekly
PaperWeekly is an academic platform that recommends, interprets, discusses, and reports on cutting-edge research papers in artificial intelligence. If you are researching or working in the AI field, feel free to click “Discussion Group” in the public account background, and the assistant will bring you into the PaperWeekly discussion group.
▽ Click | Read the original text | Get the latest paper recommendations