MLNLP community is a well-known machine learning and natural language processing community in China and abroad, covering NLP graduate students, university teachers, and researchers in enterprises.

The vision of the community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for beginners.

Reprinted from | Jishi Platform

Author丨Tech Beast

TL;DR

Proposed a solution to the problem: How to adapt a pre-trained visual model to new downstream tasks without specific task fine-tuning or any model modifications? Note a few keywords: no fine-tuning and no model modifications, pre-trained model, adapt to different new tasks.

This has been done in the NLP GPT-3 model, and the solution is: Prompt.

When facing new tasks, provide GPT-3 with examples of input and output during inference (Prompt), aiming to automatically generate output results that are consistent with the given examples.

Therefore, this article aims to achieve this in vision: during inference, provide input-output image examples of new tasks and new input images, with the goal of automatically generating output images consistent with the given examples.

So how to solve this problem? The conclusion of this article is: As long as training is conducted on the correct data and this problem is treated as an image inpainting problem, it can be solved.

The author trained on an 88k unlabelled dataset from Arxiv academic papers and demonstrated the effectiveness of the method in many image-to-image downstream tasks, such as foreground segmentation, single object detection, coloring, and edge detection.

What this article did

Proved that many computer vision tasks can be treated as image inpainting tasks, requiring only a few task input and output examples and a query image.
Constructed a large dataset containing 88,000 samples, allowing the model to learn the image inpainting task without any annotation information or task-related descriptions.
Demonstrated that adding additional data (such as ImageNet) to the training dataset can yield better results.

Paper Title: Visual Prompting via Image Inpainting (NeurIPS 2022)

Translation: Completing Visual Prompts through Image Inpainting

Paper Link: http://arxiv.org/pdf/2209.00647.pdf

Paper Homepage: http://yossigandelsman.github.io/visual_prompt/

Can the characteristics of a general model performing various downstream tasks in language models be transferred to the visual domain?

In recent years, self-supervised learning has become increasingly popular in computer vision and natural language processing. The growing capacity of modern deep learning models makes them prone to overfitting when trained on smaller labeled datasets. Self-supervised learning provides a good solution to this problem, addressing the data hunger of these high-capacity deep learning models.

However, the features learned through self-supervised learning are not “ready for use” and typically require fine-tuning on some labeled datasets to adapt to the given downstream tasks. So, is it possible to avoid this fine-tuning?

This issue has long been solved in NLP, with the method known as prompting. Prompting means providing corresponding example inputs and query information for a specific language understanding task during testing. For example, the following is a piece of prompt information:

Je suis désolé I’m sorryJ’adore la glace

The model’s output is:

I love ice cream

Can the practice of providing prompts for different downstream tasks during testing be extended to the visual domain? In other words, unlike the current paradigm where one model performs one task in CV tasks, can we have a general model that can perform multiple user-specified tasks without any weight fine-tuning?

This article proposes: As long as a large-scale image inpainting model is trained on the correct data, it can serve as a tool for visual prompting.

As shown in Figure 1, the author constructed a grid image. The input-output examples of the task and the new query are marked with green boxes, and the model generates results by simply inpainting the remaining parts of the image, which are marked in red. The only requirement of the method in this article is that the task must be defined as image-to-image translation, which is a very large subset of visual problems.

Visual Prompt Engineering: Adapting General Models for Various Tasks

By doing so, visual prompting can be effectively achieved through image inpainting. This problem can be well solved. When a new downstream task (like Segmentation) arises, only an input-output example and a query need to be provided. Then, arranging them in a grid and feeding them into the model can yield the corresponding results.

MAE-VQGAN Method Introduction

Given an input image x∈R^{HxWx3} and a binary mask matrix m∈{0,1}^{HxW}, the goal of the image inpainting function f is to synthesize a new image y∈R^{HxWx3}, filling in the masked locations:

To train the function f, this article proposes the MAE-VQGAN method, which consists of the MAE[1] and VQGAN[2] models, as shown in Figure 2.

During training, the input image is divided into patches like ViT, masking some patches and sending them to MAE. Unlike the MAE that directly predicts pixels, MAE-VQGAN predicts visual tokens through a softmax layer. The author maps the image to the indices of visual tokens using the VQGAN Encoder to obtain the ground truth visual tokens. For each masked token, the decoder outputs a distribution over the pre-trained VQGAN codebook. The model is trained using cross-entropy loss.

Assuming z hat = (z_1 hat, …, z_k hat) is the ordered set of predicted visual tokens, to obtain z_i hat, the author uses the argmax function:

Finally, to decode the visual tokens back to pixels, the VQGAN Decoder is used to output y based on z hat.

Adding Prompts to the Trained Image Inpainting Model

The author constructed a visual prompt, a grid image consisting of task input-output examples and new query images. The inpainting model must fill in the blank parts of the image.

Define S={(x_i,y_i)},(i=1,…,n) as the input-output image examples, where x_i is the input image and y_i is the segmentation mask. Then, given a new query image x_q, the goal is to predict the corresponding label y_q. The author needs to define a function g that maps the example set S and the query image x_q to new images and masks:

Where, image x_vp is the visual prompt, and mask m defines the part that the inpainting model should predict.

The goal of the inpainting model is to output y_vp based on the visual prompt x_vp without any additional training and extract the corresponding part from the mask to obtain y_q.

Design of Visual Prompts

The function g serves to map the example set S and the query image x_q to new images and masks. The function g in this article creates an image that contains (n+1)*2 cells, where the i-th pair of input-output examples is in the i-th row, and the last row on the left is the query image. The author discusses various design schemes for visual prompts and their corresponding results in the experimental section.

Dataset

The image generated by function g is not a natural image. Specifically, these images have a grid-like structure because they are stitched together from images of different distributions, such as natural images and segmentation masks. Therefore, models trained on standard datasets like ImageNet may struggle to handle these grid-like images. To address this, the author created a new dataset, Computer Vision Figures.

This dataset contains 88,645 images, which are closer to the structure of the visual prompts in this article. The dataset was collected from Arxiv. The author downloaded all papers from 2010 to 2022 and extracted those from the cs.CV partition, as they contain images more similar to grid structures, as shown in Figure 3.

To remove irrelevant source images, such as icons, the author manually labeled 2,000 images and trained a binary image classifier. The classifier assigns high scores to images with graphic structures and at least one natural image. The author then used this classifier on the entire dataset, retaining only the most informative images from 23,302 different papers. The author randomly split 90% of the data for training and the rest for validation.

Experimental Results

To study the impact of model selection on prompting results, the author experimented with different models, including MAE-VQGAN, VQGAN, BEiT, and several other inpainting models.

Downstream Task Experimental Results

Construction strategy for visual prompts: Given one example pair and one query image, the author constructed visual prompts in the same way for all tasks: the author built a 2×2 grid containing four sub-images. The input-output examples are in the first row, and the query image appears in the bottom left corner of the second row, as shown in Figure 1.

The author evaluated the image inpainting model on three visual tasks.

Foreground segmentation task: The goal is to binary segment the query image into foreground and background. The input-output example is an image and its corresponding binary segmentation mask, and the query image is a new image, with the goal of obtaining its binary segmentation mask. The author used the Pascal-5i dataset and reported the mIoU metric.

Single object detection task: Similar to foreground segmentation, this task aims to binary segment the object appearing in the query image. However, this task is more challenging than foreground segmentation because the mask is obtained from the bounding box. The author used the Pascal VOC 2012 dataset along with its related detection boxes. For simplicity, the author used Pascal annotations that only included images with a single object and filtered out those where the object covered more than 50% of the image. The author followed a similar process to obtain the binary segmentation mask as in foreground segmentation and reported the mIoU metric.

Image colorization task: The goal is to map grayscale images to color images. The example pair is a grayscale image and the corresponding color image, as shown in Figure 4. The author randomly selected 1,000 example pairs and image queries from the ImageNet validation set and converted them to grayscale to obtain the grayscale and color versions of each image, reporting the MSE loss and LPIPS metrics.

The experimental results are shown in Figures 4 and 5. MAE-VQGAN significantly outperformed other models in detection and segmentation and generated clearer images than MAE. It was found that VQGAN struggled to produce accurate results, possibly due to sequential decoding. The BEiT model performed better than MAE.

Synthetic Data Study

To evaluate the combination prediction capabilities of the inpainting model, the author created three simple synthetic tasks and their combinations, evaluating each model on 100 examples for each task.

Visual prompting strategy: Given two example pairs and one query image, the author constructed visual prompts in the same way for all tasks: the author built a 3×2 grid containing six sub-images. The input-output examples are in the first two rows, and the query image appears in the bottom left corner of the third row, as shown in Figure 6.

Each example pair contains an image of a circle and a corresponding image of a smaller circle. The goal is to predict the image using the resized version given a query image.

Shape: Each example pair consists of an image with a circle and a corresponding image with a rectangle. Both are of similar size and appear in the same position. The goal is to predict the image using a rectangle given a new image query.

Color: Each example pair contains a circle image that appears in the same position, with color changing from green to blue. Given a new image query, the goal is to predict the corresponding image with the circle colored blue.

Evaluation: The author mapped each predicted pixel to its nearest neighbor color from a predefined set of options: black, white, blue, or green. The color-aware mIoU was measured and reported by considering pixels that appeared in the GT shape color as foreground and the rest as background.

The experimental results are shown in Figure 7. It can be observed that if the model in this article is not pre-trained on the dataset proposed in this article, the image inpainting model cannot generalize to these previously unseen tasks. When all models are pre-trained on the dataset proposed in this article, their performance improves. However, due to increased complexity, the same model may struggle to handle combinations of tasks. The VQGAN model’s sequential decoding leads to poor performance due to a lack of context. The MAE model outperformed MAE-VQGAN in color tasks, while BEiT performed poorly in Resize tasks. These models require dVAE or codebooks, which may not be suitable for these tasks.

Impact of Dataset Size

The author evaluated the impact of dataset size on pre-training. The author tried training on only ImageNet, only using the dataset proposed in this article, and combinations of both. As shown in Figure 8, the experimental results of foreground segmentation on Pascal-5i show that the MAE-VQGAN trained on ImageNet consistently achieved 5 points lower mIoU. The model trained on the combined dataset performed the best, indicating that MAE-VQGAN can benefit from a large number of unlabelled images.

Visual Prompt Engineering

The author explored constructing different visual prompts for the foreground segmentation task and their corresponding MAE-VQGAN results. It can be observed that the model generates reasonable completions when changing the prompt layout, such as horizontal versus vertical order (as shown in Figure 8) and changing mask colors, textures, or using only edges (as shown in Figure 9).

The mIoU results in Figure 11 indicate that the model performs best when the segmentation mask is black and white, and the prompt is in a vertical layout. Interestingly, by analyzing the average attention heads of the mask tokens, it can be observed that attention changes with variations in prompt layout.

References

^Masked Autoencoders Are Scalable Vision Learners
^Taming Transformers for High-Resolution Image Synthesis

Technical Exchange Group Invitation

△Long press to add assistant

Scan the QR code to add the assistant WeChat

Please note: Name-School/Company-Research Direction

(For example: Xiao Zhang-Harbin Institute of Technology-Dialogue System)

to apply to join Natural Language Processing/Pytorch and other technical exchange groups

About Us

MLNLP community is a grassroots academic community jointly constructed by scholars in machine learning and natural language processing from home and abroad. It has developed into a well-known community for machine learning and natural language processing in China and abroad, aiming to promote progress between the academic and industrial circles of machine learning and natural language processing and enthusiasts.

The community can provide an open exchange platform for the further education, employment, and research of relevant practitioners. Everyone is welcome to follow and join us.