Detailed Guide on Converting Pytorch to ONNX

Click the aboveBeginner’s Guide to Vision“, select to add “Starred” or “Pinned

Important content delivered promptly

Author: Diving Champion @ Zhihu
Source: https://zhuanlan.zhihu.com/p/272767300
Editor: Jishi Platform

Jishi Guide

The author summarizes their experience in the conversion of models from Pytorch to ONNX, mainly introducing the significance of this conversion work, the path for model deployment, and the limitations of Pytorch itself.

In the past few months, I participated in the conversion of models to ONNX in OpenMMlab (github account: drcut), with the main goal of supporting the conversion of some models from Pytorch to ONNX. Although I haven’t achieved much in these months, I’ve encountered many pitfalls, and I’m documenting them here in hopes of helping others.

This is the first part, a theoretical section, mainly discussing some macro issues unrelated to code. Next, I will write a practical section specifically analyzing some specific codes in OpenMMlab, explaining some coding techniques and precautions in the conversion process from Pytorch to ONNX.

(1) Significance of Converting Pytorch to ONNX

Generally speaking, converting to ONNX is merely a means. After obtaining the ONNX model, further conversions may be necessary, such as converting to TensorRT for deployment, or some may add an intermediate step of converting from ONNX to Caffe, and then from Caffe to TensorRT. The reason is that Caffe is more friendly to TensorRT, and I will discuss the definition of friendly later.

Therefore, before starting the ONNX conversion work, it is essential to clarify the target backend. ONNX is just a format, similar to JSON. As long as certain rules are met, it is considered valid. Thus, simply converting from Pytorch to an ONNX file is quite straightforward. However, the ONNX accepted by different backend devices varies, which is the source of many pitfalls.

The ONNX generated by Pytorch’s built-in torch.onnx.export, the ONNX required by ONNXRuntime, and the ONNX needed by TensorRT are all different.

Here’s a simple example of Maxpool:

Maxunpool can be seen as the inverse operation of Maxpool. Let’s first look at an example of Maxpool. Suppose we have a tensor of shape C*H*W (shape [2, 3, 3]), where each channel’s 2D matrix is the same as follows:

Detailed Guide on Converting Pytorch to ONNX Detailed Guide on Converting Pytorch to ONNX

In this case, if we call MaxPool (kernel_size=2, stride=1, pad=0) on it in Pytorch,

we will get two outputs. The first output is the value after Maxpool:

Detailed Guide on Converting Pytorch to ONNX Detailed Guide on Converting Pytorch to ONNX

The other is the Maxpool Idx, which indicates which output corresponds to which original input, allowing the gradient of the output to be directly passed to the corresponding input during backpropagation:

Detailed Guide on Converting Pytorch to ONNX Detailed Guide on Converting Pytorch to ONNX

Careful readers may notice that the Maxpool Idx can actually have another representation:

Detailed Guide on Converting Pytorch to ONNX Detailed Guide on Converting Pytorch to ONNX ,

which places the idx of each channel together, rather than starting from 0 for each channel individually. Both representations are valid as long as they remain consistent during backpropagation.

However, when I was supporting OpenMMEditing, I encountered Maxunpool, which is the inverse operation of Maxpool: given MaxpoolId and the output of Maxpool, we obtain the input of Maxpool.

Pytorch’s MaxUnpool implementation expects each channel’s idx to start from 0, while Onnxruntime requires the opposite. Therefore, if you want to achieve the same results using Onnxruntime, you must perform additional processing on the input Idx (i.e., the input that matches Pytorch). In other words, the neural network graph output by Pytorch and the neural network graph required by ONNXRuntime are different.

(2) ONNX and Caffe

There are two mainstream paths for model deployment, taking TensorRT as an example: one is Pytorch -> ONNX -> TensorRT, and the other is Pytorch -> Caffe -> TensorRT. Personally, I believe the latter is currently more mature, mainly due to the properties of ONNX, Caffe, and TensorRT.

Detailed Guide on Converting Pytorch to ONNX

The table above lists some key differences between ONNX and Caffe, with the most important difference being the granularity of ops. For example, if we convert the Attention layer of Bert, ONNX will break it down into a combination of MatMul, Scale, and SoftMax, while Caffe may generate a layer directly called Multi-Head Attention and tell CUDA engineers: “You go write a big kernel” (I suspect that eventually, ResNet50 will be turned into a single layer…).

Detailed Guide on Converting Pytorch to ONNX
Thus, if one day a researcher proposes a new state-of-the-art op, it can likely be directly converted to ONNX (if this op is implemented entirely using the Aten library in Pytorch), but for Caffe engineers, they need to rewrite a kernel.
The advantage of fine-grained ops is that they are very flexible, while the downside is that they may be slower. In recent years, there has been a lot of work on op fusion (such as combining convolution and its subsequent relu into one computation). Both XLA and TVM have invested significantly in op fusion, which combines small ops into larger ops.
TensorRT is a deployment framework launched by NVIDIA, and naturally, performance is the primary consideration, so their layer granularity is quite coarse. In this case, converting Caffe models has a natural advantage.
Additionally, coarse granularity can also address branching issues. To TensorRT, a neural network is simply a directed acyclic graph (DAG): given an input of fixed shape, performing the same computation yields an output of fixed shape.
**Currently, one direction of development for TensorRT is to support dynamic shapes, but it is still quite immature.
tensor i = funcA();
if(i==0)
  j = funcB(i);
else
  j = funcC(i);
funcD(j);
For the above network, assuming funcA, funcB, funcC, and funcD are all fine-grained operators supported by ONNX, ONNX will face a challenge: the resulting DAG will either look like this: funcA -> funcB -> funcD, or funcA -> funcC -> funcD. However, either way, there will definitely be issues.
Caffe can bypass this problem using coarse granularity.
tensor i = funcA();
coarse_func(tensor i) {
  if(i==0) return funcB(i);
  else return funcC(i);
}
funcD(coarse_func(i))
Thus, the resulting DAG is: funcA -> coarse_func -> funcD.
Of course, the cost for Caffe is that the hardworking HPC engineers need to manually write a coarse_func kernel… (Hopefully, the Deep Learning Compiler can liberate HPC engineers soon).
(3) Limitations of Pytorch Itself
Those familiar with deep learning frameworks know that Pytorch managed to emerge successfully and capture a significant market share while TensorFlow was already dominant mainly due to its flexibility. An inappropriate analogy would be that TensorFlow is like C++, while Pytorch is like Python.
TensorFlow compiles the entire neural network before running it, generating a directed acyclic graph (DAG) to execute. Pytorch, on the other hand, operates step by step, calculating the result at each node before determining what to compute next.
ONNX essentially converts the network model from the upper-level deep learning framework into a graph. Since TensorFlow already has a graph, it can directly take this graph, make some adjustments, and use it.
However, for Pytorch, there is no concept of a graph, so to convert from Pytorch to ONNX, ONNX needs to take notes while running Pytorch, recording what it encounters and abstracting the recorded results into a graph. Thus, there are two inherent limitations in converting from Pytorch to ONNX.
1. The conversion result is only valid for specific inputs. If a different input causes a change in the network structure, ONNX won’t be able to detect it (the most common case is when there is an if statement in the network; if this time the input goes through the if branch, ONNX will only generate the graph corresponding to the if branch, discarding all information from the else branch).
2. It requires a considerable amount of computation, as the neural network needs to be run in real-time.
PS: Regarding the above two limitations, my undergraduate thesis proposed a solution that involves using lexical and syntax analysis within the compiler to directly scan the source code of Pytorch or TensorFlow to obtain the graph structure. This would allow for lightweight conversion of models to ONNX while also capturing information about branching decisions. Here’s a GitHub link (https://github.com/drcut/NN_transform), and I hope everyone supports it.
*Currently, Pytorch’s official solution aims to address branching issues through TorchScript, but as far as I know, it is still not very mature.
Download 1: OpenCV-Contrib Extension Module Chinese Tutorial

Reply "Extension Module Chinese Tutorial" in the backend of the "Beginner's Guide to Vision" public account to download the first comprehensive OpenCV extension module tutorial in Chinese, covering installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters of content.

Download 2: Python Vision Practical Project 52 Lectures

Reply "Python Vision Practical Project" in the backend of the "Beginner's Guide to Vision" public account to download 31 vision practical projects, including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to help quickly learn computer vision.

Download 3: OpenCV Practical Project 20 Lectures

Reply "OpenCV Practical Project 20 Lectures" in the backend of the "Beginner's Guide to Vision" public account to download 20 practical projects based on OpenCV, achieving advanced learning of OpenCV.

Discussion Group

Welcome to join the reader group of the public account to exchange ideas with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will gradually be subdivided in the future). Please scan the WeChat number below to join the group, and note: "nickname + school/company + research direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format; otherwise, your request will not be approved. After successful addition, you will be invited to the relevant WeChat group based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed. Thank you for your understanding~

Leave a Comment