Detailed Explanation of Pytorch to ONNX Conversion

Click on 'Xiaobai Learns Vision' above, select 'Star' or 'Top'
Heavy content delivered at the first time

Introduction The author of this article summarizes their experiences in the conversion of models from Pytorch to ONNX, mainly introducing the significance of this conversion work, the path for model deployment, and the limitations of Pytorch itself.

Author: Champion of Diving at Overpass @ ZhihuSource: https://zhuanlan.zhihu.com/p/272767300For academic sharing only, please contact for deletion if infringing

In the past few months, I participated in the work of converting models to ONNX for OpenMMlab (github account: drcut), with the main goal of supporting the conversion of some models from Pytorch to ONNX. Although I haven’t achieved much in these months, I’ve encountered many pitfalls, and I hope to record them here to help others.

This is the first part, the theoretical part, mainly introducing some macro issues unrelated to code. Next, I will write a practical part, analyzing some specific codes in OpenMMlab, explaining some coding techniques and precautions in the Pytorch to ONNX conversion process.

(1) The Significance of Converting Pytorch to ONNX

Generally speaking, converting to ONNX is just a means; after obtaining the ONNX model, further conversion is needed, such as converting to TensorRT for deployment, or some people may add an extra step, converting from ONNX to Caffe, and then from Caffe to TensorRT. The reason is that Caffe is more friendly to TensorRT, and the definition of friendly will be discussed later.

Therefore, before starting the ONNX conversion work, it is essential to clarify the target backend. ONNX is just a format, similar to JSON. As long as you meet certain rules, it is considered valid. Thus, simply converting from Pytorch to an ONNX file is straightforward. However, the ONNX accepted by different backend devices is different, which is the source of the pitfalls.

The ONNX generated by Pytorch’s built-in torch.onnx.export, the ONNX required by ONNXRuntime, and the ONNX needed by TensorRT are all different.

Here’s a simple example of Maxpool:

Maxunpool can be regarded as the inverse operation of Maxpool. Let’s first look at a Maxpool example. Suppose we have a tensor of shape C*H*W (shape [2, 3, 3]), where each channel’s 2D matrix is the same, as shown below:

In this case, if we call MaxPool in Pytorch with (kernel_size=2, stride=1, pad=0),

we will get two outputs. The first output is the values after Maxpool:

The other is the Maxpool Idx, which corresponds to which original input for each output, allowing for direct gradient propagation during backpropagation:

Careful students will notice that the Maxpool Idx can actually have another form:

which means that each channel’s idx is grouped together, rather than starting from 0 for each channel separately. Both forms are fine, as long as they are consistent during backpropagation.

However, when I supported OpenMMEditing, it involved Maxunpool, which is the inverse operation of Maxpool: inputting MaxpoolId and Maxpool’s output to obtain Maxpool’s input.

Pytorch’s MaxUnpool implementation accepts an Idx format where each channel starts from 0, while Onnxruntime does the opposite. Therefore, if you want to achieve the same result using Onnxruntime, you must perform additional processing on the input Idx (i.e., the same input as Pytorch). In other words, the neural network graph generated from Pytorch and the one required by ONNXRuntime are different.

(2) ONNX and Caffe

There are two mainstream model deployment paths, taking TensorRT as an example. One is Pytorch->ONNX->TensorRT, and the other is Pytorch->Caffe->TensorRT. Personally, I believe the latter is currently more mature, mainly due to the inherent characteristics of ONNX, Caffe, and TensorRT.

The table above lists some differences between ONNX and Caffe, the most important being the granularity of ops. For example, if converting the Bert Attention layer, ONNX will turn it into a combination of MatMul, Scale, and SoftMax, while Caffe may directly generate a layer called Multi-Head Attention and tell the CUDA engineer: “You go write a large kernel” (I suspect this will eventually lead to ResNet50 becoming a single layer…).

Detailed Explanation of Pytorch to ONNX Conversion

Therefore, if one day a researcher proposes a new state-of-the-art op, it can likely be directly converted to ONNX (if this op is entirely assembled using the Aten library in Pytorch), but for Caffe engineers, they need to rewrite a kernel.The advantage of fine-grained ops is that they are very flexible, while the disadvantage is that they may be slower. In recent years, much work has been done on op fusion (for example, combining convolution with its subsequent ReLU), with XLA and TVM putting a lot of effort into op fusion, which is to combine small ops into large ops.TensorRT is a deployment framework launched by NVIDIA, so performance is the primary consideration; hence their layer granularity is generally coarse. In this case, converting Caffe has a natural advantage.Moreover, coarse granularity can also solve branching issues. The neural network in TensorRT is viewed as a simple DAG: given fixed shape inputs, performing the same operations yields fixed shape outputs.**Currently, one development direction of TensorRT is to support dynamic shapes, but it is still not mature.

tensor i = funcA();
if(i==0)
  j = funcB(i);
else
  j = funcC(i);
funcD(j);

For the above network, assuming funcA, funcB, funcC, and funcD are all fine-grained operators supported by onnx, then ONNX will face a challenge. The DAG it converts could either look like this: funcA->funcB->funcD or funcA->funcC->funcD. However, either way, there will definitely be issues.In contrast, Caffe can bypass this problem with coarse granularity.

tensor i = funcA();
coarse_func(tensor i) {
  if(i==0) return funcB(i);
  else return funcC(i);
}
funcD(coarse_func(i))

Thus, the resulting DAG is: funcA->coarse_func->funcD.Of course, the cost for Caffe is that the hard-working HPC engineers have to manually write a coarse_func kernel… (I hope the Deep Learning Compiler can liberate HPC engineers soon).(3) Limitations of Pytorch ItselfThose familiar with deep learning frameworks know that the reason Pytorch can emerge and successfully capture half the market share despite TensorFlow’s dominance is mainly due to its flexibility. To put it inappropriately, TensorFlow is like C++, while Pytorch is like Python.TensorFlow compiles the entire neural network before running, generating a DAG (Directed Acyclic Graph), and then runs this graph. Pytorch, on the other hand, is step-by-step, calculating the result at each node as it runs until it reaches that node.ONNX actually converts the network model from upper-level deep learning frameworks into a graph. Since TensorFlow already has a graph, it can directly take that graph, make adjustments, and use it.However, Pytorch has no concept of a graph at all. Therefore, to complete the conversion from Pytorch to ONNX, ONNX has to take notes while Pytorch runs, recording what it encounters and abstracting the recorded results into a graph. As a result, there are two inherent limitations in converting Pytorch to ONNX.1. The conversion result is only valid for specific inputs.If a different input changes the network structure, ONNX cannot detect it (the most common case is if there are if statements in the network; if the input goes through the if this time, ONNX will only generate the graph corresponding to the if, discarding all information in the else part).2. It requires a relatively large amount of computation, as the neural network needs to be run through in its entirety.PS: Regarding the two limitations above, my undergraduate thesis proposed a solution by directly scanning the source code of Pytorch or TensorFlow through lexical analysis and syntax analysis in the compiler to obtain the graph structure. This way, model conversion to ONNX can be completed lightweightly, while also obtaining information on branch judgments. Here’s a github link (https://github.com/drcut/NN_transform), and I hope everyone can support it.*Currently, the Pytorch official team hopes to solve the branch statement issue through TorchScript, but as far as I know, it is still not very mature.

Good news!
Xiaobai Learns Vision Knowledge Planet
Is now open to the public👇👇👇




Download 1: Chinese Version Tutorial for OpenCV-Contrib Extension Modules
Reply: Extension Module Chinese Tutorial in the backend of 'Xiaobai Learns Vision' public account to download the first Chinese version of the OpenCV extension module tutorial online, covering installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, etc., with more than twenty chapters of content.

Download 2: 52 Lectures on Practical Python Vision Projects
Reply: Python Vision Practical Project in the backend of 'Xiaobai Learns Vision' public account to download 31 practical vision projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, and face recognition, to help quickly learn computer vision.

Download 3: 20 Lectures on Practical OpenCV Projects
Reply: OpenCV Practical Projects 20 Lectures in the backend of 'Xiaobai Learns Vision' public account to download 20 practical projects based on OpenCV, achieving advanced learning of OpenCV.

Group Chat

Welcome to join the reader group of the public account to exchange ideas with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (which will gradually be subdivided). Please scan the WeChat number below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Visual SLAM". Please follow the format for notes; otherwise, entry will not be approved. After successful addition, you will be invited to relevant WeChat groups based on research direction. Please do not send advertisements in the group; otherwise, you will be removed. Thank you for your understanding~

Leave a Comment Cancel reply