This article is reprinted from AI Park.

Author: Cheng He

Translation: ronghuaiyang

Introduction

Applying Transformers to CV tasks is becoming increasingly common, and here we organize some related advancements for everyone.

Transformers in Computer Vision

The Transformer architecture has achieved state-of-the-art results in many natural language processing tasks. One major breakthrough of the Transformer model may be the release of GPT-3 in mid-year, awarded “Best Paper” at NeurIPS 2020.

Transformers in Computer Vision

In the field of computer vision, CNNs have dominated visual tasks since 2012. With the emergence of increasingly efficient structures, computer vision and natural language processing are converging, and using Transformers to accomplish visual tasks has become a new research direction to reduce structural complexity and explore scalability and training efficiency.

Here are several well-known projects in related work:

DETR (End-to-End Object Detection with Transformers), using Transformers for object detection and segmentation.
Vision Transformer (AN IMAGE IS WORTH 16X16 WORDS: Transformer FOR IMAGE RECOGNITION AT SCALE), using Transformers for image classification.
Image GPT (Generative Pretraining from Pixels), using Transformers for pixel-level image completion, just like other GPT text completions.
End-to-end Lane Shape Prediction with Transformers, using Transformers for lane marking detection in autonomous driving.

Architecture

Overall, there are mainly two model architectures for adopting Transformers in CV. One is a pure Transformer structure, and the other is a hybrid structure that combines CNNs/backbones with Transformers.

Pure Transformer
Hybrid: (CNNs + Transformer)

Vision Transformer is based on a complete self-attention Transformer structure without using CNNs, while DETR is an example of a hybrid model structure that combines Convolutional Neural Networks (CNNs) and Transformers.

Some Questions

Why use Transformers in CV? How to use them?
What are the results on benchmarks?
*What are the constraints and challenges of using Transformers in CV?*
Which structure is more efficient and flexible? Why?

You will find answers in the in-depth studies below on ViT, DETR, and Image GPT.

Vision Transformer

Vision Transformer (ViT) applies the pure Transformer architecture directly to a series of image patches for classification tasks, achieving excellent results. It outperforms state-of-the-art convolutional networks on many image classification tasks while significantly reducing the required pre-training computational resources (at least 4 times less).

Transformers in Computer Vision

Vision Transformer Model Structure

Image Sequence Patches

They are how images are divided into fixed-size patches, and then the linear projections of these patches along with their image positions are input into the Transformer. The remaining steps involve a clean and standard Transformer encoder and decoder.

In the embedding of image patches, positional embeddings are added to retain spatial/positional information globally through different strategies. In this paper, they experimented with different spatial information encoding methods, including no positional information encoding, 1D/2D positional embedding encoding, and relative positional embedding encoding.

Transformers in Computer Vision

Comparison of Different Position Encoding Strategies

An interesting finding is that, compared to one-dimensional positional embeddings, two-dimensional positional embeddings do not bring significant performance improvements.

Datasets

The model was pre-trained by removing duplicate data from multiple large datasets to support fine-tuning (smaller datasets) for downstream tasks.

ILSVRC-2012 ImageNet dataset has 1k classes and 1.3 million images.
ImageNet-21k has 21k classes and 14 million images.
JFT has 18k classes and 303 million high-resolution images.

Model Variants

Transformers in Computer Vision

Like other popular Transformer models (GPT, BERT, RoBERTa), ViT (Vision Transformer) also has different model sizes (base, large, and huge) and different numbers of transformer layers and heads. For example, ViT-L/16 can be interpreted as a large (24-layer) ViT model with a 16×16 input image patch size.

Note that the smaller the input patch size, the larger the computational model becomes, as the number of input patches N = HW/P*P, where (H,W) is the resolution of the original image, and P is the resolution of the patch image. This means that a 14 x 14 patch is computationally more expensive than a 16 x 16 image patch.

Benchmark Results

Transformers in Computer Vision

Image Classification Benchmark

The above results indicate that the model outperforms existing SOTA models on several popular benchmark datasets.

The Vision Transformer (ViT-H/14, ViT-L/16) pre-trained on the JFT-300M dataset outperforms all ResNet models (ResNet152x4, pre-trained on the same JFT-300M dataset) on all test datasets while significantly reducing the computational resources (TPUv3 core days) used during pre-training. Even the ViT pre-trained on ImageNet-21K performs better than the baseline.

Model Performance vs Dataset Size

Transformers in Computer Vision

Pre-training Dataset Size vs Model Performance

The above figure shows the impact of dataset size on model performance. When the size of the pre-training dataset is small, ViT does not perform well; when sufficient training data is available, it outperforms previous SOTA.

Which Structure is More Efficient?

As mentioned at the beginning, there are different architectural designs for using transformers in computer vision, some completely replace CNNs with Transformers (ViT), some partially replace them, and some combine CNNs with Transformers (DETR). The results below show the performance of various model structures under the same computational budget.

Transformers in Computer Vision

Performance and Computational Cost of Different Model Architectures

The above experiments indicate:

The pure Transformer architecture (ViT) is more efficient and scalable in both size and computational scale compared to traditional CNNs (ResNet BiT).
The hybrid architecture (CNNs + Transformer) outperforms pure Transformers at smaller model sizes, while performance is very close at larger model sizes.

Key Points of ViT (Vision Transformer)

Uses Transformer architecture (pure or hybrid)
Input images are tiled into multiple patches
Beats SOTA on multiple image recognition benchmarks
Pre-training on large datasets is cheaper
More scalable and computationally efficient

DETR

DETR is the first framework to successfully use Transformers as a main building block in the object detection pipeline. It matches the performance of previous SOTA methods (highly optimized Faster R-CNN) while having a simpler and more flexible pipeline.

Transformers in Computer Vision

DETR combines CNN and Transformer in the object detection pipeline

The above image shows DETR, a hybrid pipeline with CNN and Transformer as the main building blocks. The following is the process:

CNN is used to learn the 2D representation of the image and extract features.
The output of the CNN is flattened and supplemented with positional encoding to feed into the standard Transformer encoder.
The decoder of the Transformer predicts categories and bounding boxes through the output embeddings into a feed-forward network (FNN).

Simpler Pipeline

Transformers in Computer Vision

Comparison of Traditional Object Detection Pipeline and DETR

Traditional object detection methods, such as Faster R-CNN, have multiple steps for anchor generation and NMS. DETR eliminates these manually designed components, significantly simplifying the object detection pipeline.

Amazing Results When Extended to Panoptic Segmentation

In this paper, they further extend the DETR pipeline for panoptic segmentation tasks, a recently popular and challenging pixel-level recognition task. To simply explain the panoptic segmentation task, it unifies two different tasks: one is traditional semantic segmentation (assigning class labels to each pixel), and the other is instance segmentation (detecting and segmenting each object’s instance). Using a single model architecture to solve both tasks (classification and segmentation) is a very clever idea.

Transformers in Computer Vision

Pixel-level Panoptic Segmentation

The above image shows an example of panoptic segmentation. Through DETR’s unified pipeline, it surpasses very competitive baselines.

Attention Visualization

The following image shows the attention of the Transformer decoder on the predictions. The attention scores of different objects are represented in different colors.

By observing the colors/attention, you will be surprised by the model’s ability to understand images globally through self-attention, solving the problem of overlapping bounding boxes. Especially the orange on the zebra legs, although they overlap with the blue and green sections, can still be well classified.

Transformers in Computer Vision

Visualization of Attention on Predicted Objects

Key Points of DETR

Using Transformers leads to a simpler and more flexible pipeline.
Can match SOTA on object detection tasks.
Efficiently outputs the final prediction set in parallel.
Unified architecture for object detection and segmentation.
Significantly improves detection performance for large objects, but performance for small object detection decreases.

Image GPT

Image GPT is a GPT-2 transformer model trained on pixel sequences for image completion. Like general pre-trained language models, it is designed to learn high-quality unsupervised image representations. It can autoregressively predict the next pixel without knowing the 2D structure of the input image.

The features from pre-trained Image GPT achieve state-of-the-art performance on some classification benchmarks and come close to state-of-the-art unsupervised accuracy on ImageNet.

The following image shows the completion model generated from a half-provided image as input, followed by the creative completion from the model.

Transformers in Computer Vision

Image Completion from Image GPT

Key Points of Image GPT:

Uses the same transformer architecture as GPT-2 in NLP.
Unsupervised learning, no manual labeling required.
Requires more computation to generate competitive representations.
The learned features achieve SOTA performance on classification benchmarks of low-resolution datasets.

Conclusion

The tremendous success of Transformers in natural language processing has been explored in the field of computer vision and has become a new research direction.

Transformers have proven to be a simple and scalable framework for computer vision tasks such as image recognition, classification, and segmentation, or simply learning global image representations.
Significant advantages in training efficiency compared to traditional methods.
Architecturally, they can be used in a pure Transformer manner or in a hybrid manner with CNNs.
They also face challenges, such as poor performance in detecting small objects in DETR, and suboptimal performance in Vision Transformer (ViT) when the pre-training dataset is small.
Transformers are becoming a more general framework for learning sequential data (including text, images, and time series data).

Original article: https://towardsdatascience.com/transformer-in-cv-bbdb58bf335e

END

Note: TFM

Transformers in Computer Vision

Transformer Group Chat

Scan the code to join the group.

I Love Computer Vision

WeChat ID: aicvml

QQ Group: 805388940

Weibo Zhihu: @I Love Computer Vision

Submission: [email protected]

Website: www.52cv.net

Architecture

Some Questions

Vision Transformer

Image Sequence Patches

Datasets

Model Variants

Benchmark Results

Model Performance vs Dataset Size

Which Structure is More Efficient?

Key Points of ViT (Vision Transformer)

DETR

Simpler Pipeline

Amazing Results When Extended to Panoptic Segmentation

Attention Visualization

Key Points of DETR

Image GPT

Conclusion

Leave a Comment Cancel reply