Why Transformers for NLP Tasks Can Be Applied to Computer Vision?

Click on the above “Beginner Learning Vision” to choose to add a Star or “Top”

Important content delivered promptly

Almost all natural language processing tasks, from language modeling and masked word prediction to translation and question answering, have undergone revolutionary changes since the Transformer architecture first appeared in 2017. The Transformer also performs excellently in computer vision tasks, taking only 2-3 years. In this article, we explore two foundational architectures that enable the Transformer to break into the world of computer vision.

Table of Contents

Vision Transformer

Main Ideas
Operations
Hybrid Architecture
Loss of Structure
Results
Self-Supervised Learning via Masked

Masked Autoencoding Vision Transformer

Main Ideas
Architecture
Final Comments and Examples

Vision Transformer

Main IdeasThe intention of the Vision Transformer is to generalize the standard transformer architecture to process and learn from image inputs. A key idea regarding the architecture is that the authors transparently emphasize:

“Inspired by the success of Transformer expansion in NLP, we attempt to apply the standard Transformer directly to images with as few modifications as possible.”

Operations can be understood very literally as “as few modifications as possible,” since they made almost no changes. What they actually modified was the input structure:

In NLP, the Transformer encoder takes a sequence of one-hot vectors (or equivalent token indices) representing input sentences/paragraphs and returns a sequence of context embedding vectors usable for further tasks (e.g., classification).
To generalize to computer vision, the Vision Transformer takes a sequence of patch vectors representing the input image and returns a sequence of context embedding vectors usable for further tasks (e.g., classification).

Specifically, assuming the input image has dimensions (n,n,3), to pass it as input to the Transformer, the operations of the Vision Transformer are as follows:

Divide it into k² patches, where k is some value (e.g., k=3), as shown in the figure.
Now each patch will be of size (n/k,n/k,3), and the next step is to flatten each patch into a vector.

The patch vector will be a vector of dimension 3*(n/k)(n/k). For example, if the image is (900,900,3) and we use k=3, then the patch vector will have dimensions 300300*3, representing the pixel values in the flattened patch. In the paper, the authors used k=16. Therefore, the paper is titled “An Image is Worth 16×16 Words: Transformers for Large-Scale Image Recognition,” where instead of providing one-hot vectors representing words, they represent pixel vectors of image patches.

The rest of the operations remain unchanged from the original Transformer encoder:

These patch vectors are passed through a trainable embedding layer
Add position embedding to each vector to preserve spatial information in the image
The output is num_patches encoder representations (one for each patch), usable for classification at the patch or image level
More commonly (as in the paper), a CLS token is added before the representations, corresponding to predictions for the entire image (similar to BERT)

What about the Transformer decoder?

Remember it is like the Transformer encoder; the difference is that it uses masked self-attention instead of self-attention (but the same input signature remains unchanged). In any case, you should rarely use a Transformer architecture with only a decoder, as simply predicting the next patch may not be a very interesting task.

Hybrid ArchitectureThe authors also mention that a hybrid architecture can be formed by using CNN feature maps instead of the images themselves as input (CNN outputs are passed to the Vision Transformer). In this case, we treat the input as a general (n,n,p) feature map, and the patch vectors will have dimensions (n/k)*(n/k)*p.

Loss of StructureYou might think this architecture shouldn’t work well because it treats images as linear structures, which they are not. The authors attempt to depict that this is intentional by mentioning:

The two-dimensional neighborhood structure is used very sparingly…position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch

We will see that the Transformer is capable of learning this, as evidenced by its good performance in experiments, and more importantly, in the architecture of the next paper.

ResultsThe main conclusion of the results is that the Vision Transformer often cannot surpass CNN-based models on small datasets, but approaches or surpasses CNN-based models on large datasets, while significantly reducing computational load:

Why Transformers for NLP Tasks Can Be Applied to Computer Vision?

Here we can see that for the JFT-300M dataset (with 300 million images), the ViT model pre-trained on this dataset surpassed the ResNet-based baseline while significantly reducing the computational resources required for pre-training. It can be seen that their larger Vision Transformer (ViT-Huge, with 632M parameters and k=16) used about 25% of the computational resources that the ResNet model used and still surpassed it. Performance did not degrade significantly even when using ViT-Large with less than <6.8% of the computational load.

Meanwhile, others have revealed results showing that when trained on only 1.3 million images in ImageNet-1K, ResNet performed significantly better.

Self-Supervised Learning via MaskingThe authors conducted preliminary explorations of self-supervised masked patch prediction, mimicking the masked language modeling task used in BERT (i.e., masking patchs and trying to predict them).

“We employ the masked patch prediction objective for preliminary self-supervision experiments. To do so, we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable [mask] embedding (80%), a random other patch embedding (10%), or just keeping them as is (10%).”

Through self-supervised pre-training, their smaller ViT-Base/16 model achieved 79.9% accuracy on ImageNet, a significant 2% improvement over training from scratch. However, it still lags behind supervised pre-training by 4%.

Masked Autoencoding Vision Transformer

Main Ideas As we saw from the Vision Transformer paper, the gains obtained from pre-training using masked patchs in the input image are not as significant as in conventional NLP, where masked pre-training can achieve state-of-the-art results in some fine-tuning tasks.

This paper proposes a Vision Transformer architecture involving both encoder and decoder, which can achieve significant improvements when pre-trained with masked compared to the base Vision Transformer model (improvements of up to 6% compared to supervised training of the base-sized Vision Transformer).

Here are some examples (input, output, ground truth labels). In a sense, it is an autoencoder as it attempts to reconstruct the input while filling in the missing patchs.

ArchitectureTheir encoder is just the ordinary Vision Transformer encoder we explained earlier. During training and inference, it only takes the “observed” patchs.Meanwhile, their decoder is also an ordinary Vision Transformer encoder, but it takes:

The masked token vector for missing patchs
The encoder output vectors for known patchs

So for the image [[A, B, X], [C, X, X], [X, D, E]], where X represents the missing patch, the decoder will take the patch vector sequence [Enc(A), Enc(B), Vec(X), Vec(X), Vec(X), Enc(D), Enc(E)]. Enc returns the encoder output vector for the given patch vector, and X is the vector representing the masked token.

The last layer in the decoder is a linear layer that maps the context embedding (produced by the Vision Transformer encoder in the decoder) to a vector of the same length as the patch size. The loss function is the mean squared error, which calculates the squared difference between the original patch vector and the vector predicted by this layer. In the loss function, we only focus on the decoder predictions due to the masked token and ignore those corresponding to existing predictions (i.e., Dec(A), Dec(B), Dec(C), etc.).

Final Comments and ExamplesThe authors suggest masking about 75% of the patchs in the image, which may be surprising; BERT only masks about 15% of the words. They explain this as follows:

“Images, are natural signals with heavy spatial redundancy — e.g., a missing patch can be recovered from neighboring patches with little high-level understanding of parts, objects, and scenes. To overcome this difference and encourage learning useful features, we mask a very high portion of random patches.”

Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial

Reply "Extension Module Chinese Tutorial" in the backend of "Beginner Learning Vision" public account to download the first Chinese version of OpenCV extension module tutorial online, covering more than twenty chapters including extension module installation, SFM algorithm, stereo vision, object tracking, biological vision, super-resolution processing, etc.

Download 2: Python Vision Practical Project 52 Lectures

Reply "Python Vision Practical Project" in the backend of "Beginner Learning Vision" public account to download 31 vision practical projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, face recognition, etc., to help quickly learn computer vision.

Download 3: OpenCV Practical Project 20 Lectures

Reply "OpenCV Practical Project 20 Lectures" in the backend of "Beginner Learning Vision" public account to download 20 practical projects based on OpenCV for advanced OpenCV learning.

Discussion Group

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will gradually be subdivided in the future). Please scan the WeChat number below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format; otherwise, the request will not be approved. After successful addition, you will be invited to enter the relevant WeChat group based on research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding~

Leave a Comment Cancel reply