Unlocking CNN and Transformer Integration

Click the "Little White Learns Vision" above, select to add "Star" or "Top"
Heavyweight content, delivered at the first time

For academic sharing only, does not represent the position of this public account, contact for deletion if infringing
Reprinted from: Machine Heart
Due to the complex attention mechanism and model design, most existing visual Transformers (ViT) cannot perform as efficiently as Convolutional Neural Networks (CNN) in real industrial deployment scenarios. This raises a question: can visual neural networks infer as quickly as CNNs and be as powerful as ViTs?
Recently, some works have tried to design CNN-Transformer hybrid architectures to address this issue, but the overall performance of these works is far from satisfactory. Based on this, researchers from ByteDance proposed a next-generation visual Transformer—Next-ViT—that can be effectively deployed in real industrial scenarios. From the perspective of latency/accuracy trade-off, the performance of Next-ViT can rival that of excellent CNNs and ViTs.

Unlocking CNN and Transformer Integration

Paper link: https://arxiv.org/pdf/2207.05501.pdf
The research team of Next-ViT deployed a friendly mechanism to capture local and global information by developing novel Convolution Blocks (NCB) and Transformer Blocks (NTB). Then, the study proposed a new hybrid strategy, NHS, aimed at stacking NCBs and NTBs in an efficient hybrid paradigm to improve the performance of various downstream tasks.
Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs, and CNN-Transformer hybrid architectures in terms of latency/accuracy trade-off across various visual tasks. On TensorRT, Next-ViT exceeds ResNet by 5.4 mAP on the COCO detection task (40.4 VS 45.8), and by 8.2% mIoU on the ADE20K segmentation (38.8% VS 47.0%). Meanwhile, Next-ViT achieves performance comparable to CSWin and improves inference speed by 3.6 times. On CoreML, Next-ViT exceeds EfficientFormer by 4.6 mAP on the COCO detection task (42.6 VS 47.2), and by 3.5% mIoU on the ADE20K segmentation (from 45.2% to 48.7%).
Method
The overall architecture of Next-ViT is shown in Figure 2. Next-ViT follows a hierarchical pyramid architecture, equipped with a patch embedding layer and a series of convolution or transformer blocks at each stage. The spatial resolution will be gradually reduced to 1/32 of the original, while the channel dimension will be expanded stage by stage.

Unlocking CNN and Transformer Integration

The researchers first deeply designed the core module for information interaction and developed powerful NCBs and NTBs to simulate short-term and long-term dependencies in visual data. NTB also integrates local and global information, further enhancing modeling capability. Finally, to overcome the inherent flaws of existing methods, the study systematically explored the integration of convolution and transformer blocks and proposed the NHS strategy to stack NCBs and NTBs to construct a new CNN-Transformer hybrid architecture.
NCB
The researchers analyzed several classic structural designs, as shown in Figure 3. The BottleNeck block proposed by ResNet has long dominated visual neural networks due to its inherent inductive bias and ease of deployment on most hardware platforms. Unfortunately, the effectiveness of the BottleNeck block is inferior compared to the Transformer block. The ConvNeXt block modernized the BottleNeck block by mimicking the design of the Transformer block. Although the ConvNeXt block improved network performance, its inference speed on TensorRT/CoreML is severely limited by inefficient components. The Transformer block achieves excellent performance across various visual tasks, however, its inference speed is much slower than that of the BottleNeck block on TensorRT and CoreML due to its complex attention mechanism, which is difficult to bear in most real industrial scenarios.

Unlocking CNN and Transformer Integration

To overcome the issues of the aforementioned blocks, the study proposed the Next Convolution Block (NCB), which retains the deployment advantages of the BottleNeck block while achieving the outstanding performance of the Transformer block. As shown in Figure 3(f), NCB follows the general architecture of MetaFormer (which has been proven to be crucial for Transformer blocks).
Additionally, an efficient attention-based token mixer is equally important. The study designed a Multi-Head Convolution Attention (MHCA) as an efficient token mixer for deploying convolution operations, and constructed NCB using MHCA and MLP layers within the MetaFormer paradigm.
NTB
NCB has effectively learned local representations, and the next step is to capture global information. The Transformer architecture has a strong ability to capture low-frequency signals, which can provide global information (such as global shape and structure).
However, relevant studies have found that Transformer blocks may somewhat deteriorate high-frequency information, such as local texture information. Signals of different frequency bands are essential in the human visual system; they are fused in a specific way to extract more essential and unique features.
Influenced by these known results, the study developed the Next Transformer Block (NTB) to capture multi-frequency signals in a lightweight mechanism. Furthermore, NTB can serve as an effective multi-frequency signal mixer, further enhancing overall modeling capability.
NHS
Recent works have attempted to combine CNNs and Transformers for efficient deployment. As shown in Figures 4(b)(c), they almost all adopt convolution blocks in shallow stages, stacking Transformer blocks only in the last one or two stages. This combination is effective for classification tasks. However, the study found that these hybrid strategies easily reach performance saturation on downstream tasks (such as segmentation and detection). The reason is that classification tasks only use the output of the last stage for prediction, while downstream tasks (such as segmentation and detection) often rely on features from each stage for better results. This is because traditional hybrid strategies only stack Transformer blocks in the last few stages, and shallow layers cannot capture global information.

Unlocking CNN and Transformer Integration

The study proposed a new hybrid strategy (NHS), creatively combining Convolution Blocks (NCB) and Transformer Blocks (NTB) with the (N + 1) * L hybrid paradigm. NHS significantly enhances model performance on downstream tasks while achieving efficient deployment by controlling the proportion of Transformer blocks.
First, to empower shallow layers to capture global information, the study proposed a (NCB×N+NTB×1) pattern hybrid strategy, stacking N NCBs and one NTB sequentially at each stage, as shown in Figure 4(d). Specifically, the Transformer block (NTB) is placed at the end of each stage, allowing the model to learn global representations in the shallow layers. The study conducted a series of experiments to validate the superiority of the proposed hybrid strategy, with the performance of different hybrid strategies shown in Table 1.

Unlocking CNN and Transformer Integration

Furthermore, as shown in Table 2, the performance of large models gradually reaches saturation. This phenomenon indicates that expanding the model size by increasing N in the (NCB × N + NTB × 1) pattern, i.e., simply adding more convolution blocks, is not the optimal choice; the value of N in the (NCB × N + NTB × 1) pattern may significantly affect model performance.

Unlocking CNN and Transformer Integration

Therefore, researchers began to explore the impact of the value of N on model performance through extensive experiments. As shown in Table 2 (middle), the study built models with different N values in the third stage. To construct models with similar latency for fair comparison, the study stacked L groups of (NCB × N + NTB × 1) pattern when N values are small.
As shown in Table 2, the model with N = 4 in the third stage achieves the best trade-off between performance and latency. The study further constructed larger models by expanding the (NCB × 4 + NTB × 1) × L pattern in the third stage. As shown in Table 2 (bottom), the performance of Base (L = 4) and Large (L = 6) models significantly improved compared to small models, validating the general effectiveness of the proposed (NCB × N + NTB × 1) × L pattern.
Finally, to provide a fair comparison with existing SOTA networks, the researchers proposed three typical variants, namely Next-ViTS/B/L.

Unlocking CNN and Transformer Integration

Experimental Results
Classification Task on ImageNet-1K
Compared with the latest SOTA methods (e.g., CNN, ViT, and hybrid networks), Next-ViT achieves the best trade-off between accuracy and latency, as shown in Table 4.

Unlocking CNN and Transformer Integration

Semantic Segmentation Task on ADE20K
The study compared Next-ViT with CNNs, ViTs, and several recent hybrid architectures for the semantic segmentation task. As shown in Table 5, extensive experiments demonstrate that Next-ViT has excellent potential in segmentation tasks.

Unlocking CNN and Transformer Integration

Object Detection and Instance Segmentation
In the object detection and instance segmentation tasks, the study compared Next-ViT with SOTA models, with results shown in Table 6.

Unlocking CNN and Transformer Integration

Ablation Studies and Visualization
To better understand Next-ViT, the researchers analyzed the role of each key design by evaluating its performance on ImageNet-1K classification and downstream tasks, and visualized the Fourier spectrum and heat maps of the output features to show the inherent advantages of Next-ViT.
As shown in Table 7, NCB achieves the best latency/accuracy trade-off across all three tasks.

Unlocking CNN and Transformer Integration

For the NTB block, the study explored the effect of the contraction rate r of NTB on the overall performance of Next-ViT, with results shown in Table 8, indicating that reducing the contraction rate r will decrease model latency.

Unlocking CNN and Transformer Integration

Moreover, models with r = 0.75 and r = 0.5 have better performance than pure Transformer (r = 1) models. This indicates that appropriately fusing multi-frequency signals will enhance the model’s representation learning ability. Specifically, the model with r = 0.75 achieves the best latency/accuracy trade-off. These results highlight the effectiveness of the NTB block.
The study further analyzed the impact of different normalization layers and activation functions in Next-ViT. As shown in Table 9, LN and GELU bring some performance improvements, but the inference latency on TensorRT is significantly higher. On the other hand, BN and ReLU achieve the best latency/accuracy trade-off across overall tasks. Therefore, Next-ViT uniformly uses BN and ReLU for efficient deployment in real industrial scenarios.

Unlocking CNN and Transformer Integration

Finally, the researchers visualized the Fourier spectrum and heat maps of the output features of ResNet, Swin Transformer, and Next-ViT, as shown in Figure 5(a). The spectral distribution of ResNet indicates that convolution blocks tend to capture high-frequency signals and struggle to focus on low-frequency signals; ViT excels at capturing low-frequency signals while neglecting high-frequency signals; whereas Next-ViT can capture high-quality multi-frequency signals simultaneously, demonstrating the effectiveness of NTB.

Unlocking CNN and Transformer Integration

Additionally, as shown in Figure 5(b), Next-ViT can capture richer texture information and more accurate global information compared to ResNet and Swin, indicating that Next-ViT has stronger modeling capability.
Download 1: OpenCV-Contrib Extension Module Chinese Tutorial

Reply "Extension Module Chinese Tutorial" in the backend of "Little White Learns Vision" public account to download the first OpenCV extension module tutorial in Chinese, covering installation of extension modules, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, and more than twenty chapters of content.

Download 2: Python Vision Practical Project 52 Lectures

Reply "Python Vision Practical Project" in the backend of "Little White Learns Vision" public account to download 31 practical vision projects including image segmentation, mask detection, lane line detection, vehicle counting, eye line addition, license plate recognition, character recognition, emotion detection, text content extraction, and face recognition to help quickly learn computer vision.

Download 3: OpenCV Practical Projects 20 Lectures

Reply "OpenCV Practical Projects 20 Lectures" in the backend of "Little White Learns Vision" public account to download 20 practical projects based on OpenCV to achieve advanced learning in OpenCV.

Group Chat

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (which will be gradually subdivided in the future). Please scan the WeChat number below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format; otherwise, it will not be approved. After successful addition, you will be invited to the relevant WeChat group based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding~

Leave a Comment