How Is Computer Vision Evolving? A Comprehensive Review of Segmentation Large Models (SAM/SegGPT/SEEM)

Source｜Heart of Autonomous Driving

Image segmentation in computer vision is an important subfield that aims to assign each pixel in an image to different categories or objects. This technology is widely used in various applications such as image recognition, scene understanding, and medical image processing, demonstrating significant practical value.

Previously, there were roughly two methods to address segmentation issues.

The first method is interactive segmentation, which allows the segmentation of any category of objects but requires a human to guide the method by iteratively refining the masks. The second method is automatic segmentation, which allows the segmentation of predefined specific object categories (e.g., cats or chairs) but requires a large amount of manually annotated objects for training (e.g., thousands or even tens of thousands of examples of segmented cats). Neither of these methods provides a universal, fully automated segmentation solution.

The field of computer vision is also witnessing a trend towards universal models. With the improvement of model generalization capabilities in computer vision, there is hope to promote the development of universal multimodal AI systems, which can be applied in industrial manufacturing, general robotics, smart homes, gaming, virtual reality, and other fields. This article introduces the recent developments in large segmentation models.

1. SAM

SAM (Segment Anything Model) is a state-of-the-art image segmentation model released by Meta’s FAIR lab. This model introduces the prompt paradigm from the field of natural language processing into computer vision, enabling precise image segmentation through three interaction methods: clicking, box selection, and automatic recognition, significantly enhancing the efficiency of image segmentation.

1.1. Tasks

SAM has been trained on millions of images and over a billion masks, capable of returning effective segmentation masks for any prompt. In this context, a prompt refers to a segmentation task, which can be foreground/background points, rough boxes or masks, clicks, text, or generally any information indicating what to segment in the image. This task is also used as a pre-training objective for the model.

1.2. Network Architecture

The architecture of SAM consists of three components that work together to return effective segmentation masks:

Image encoder, which generates one-time image embeddings.
Prompt encoder, which generates prompt embeddings; prompts can be points, boxes, or text.
A lightweight mask decoder that combines the embeddings from the prompt and image encoders.

1.2.1. Image Encoder

At the highest level, the image encoder (Mask Autoencoder, MAE, Pre-trained Vision Transformer, ViT) generates one-time image embeddings that can be applied before the prompt model.

1.2.2. Prompt Encoder

The prompt encoder encodes background points, masks, bounding boxes, or text into embedding vectors in real-time. The study considers two types of prompts: sparse (points, boxes, text) and dense (masks). Points and boxes are represented by positional encodings, and learning embeddings are added for each prompt type. Free-form text prompts are represented by a pre-trained text encoder from CLIP. Dense prompts, such as masks, are embedded convolutively and summed element-wise with the image embeddings.

1.2.3. Mask Decoder

The lightweight mask decoder predicts segmentation masks based on the embeddings from the image and prompt encoders. It maps the image embeddings, prompt embeddings, and output labels to the masks. All embeddings are updated by decoder blocks, which use self-attention and cross-attention in both directions (from prompt to image embeddings and back). The masks are annotated and used to update the model weights. This layout enhances the dataset and allows the model to learn and improve over time, making it efficient and flexible.

1.3. Dataset

The Segment Anything 1 Billion Masks (SA-1B) dataset is the largest labeled segmentation dataset to date. It is designed for the development and evaluation of advanced segmentation models. Annotators use SAM to interactively annotate images, and the newly annotated data subsequently updates SAM, facilitating mutual enhancement. Using this method, interactively annotating a mask takes about 14 seconds. Compared to previous large-scale segmentation data collection efforts, Meta’s approach is 6.5 times faster than COCO’s completely manual polygon-based mask annotation and twice as fast as the previous largest data annotation work, thanks to the results aided by the SAM model. The final dataset contains over 1.1 billion segmentation masks collected from approximately 11 million licensed and privacy-protected images. The masks in SA-1B are 400 times more than any existing segmentation dataset, and studies confirmed through manual evaluation that these masks are of high quality and diversity, in some cases even comparable in quality to the masks from previously smaller, fully manually annotated datasets.

1.4. Zero-Shot Transfer Experiments

1.4.1. Zero-Shot Single Point Valid Mask Evaluation

1.4.2. Zero-Shot Edge Detection

1.4.3. Zero-Shot Object Proposals

1.4.4. Zero-Shot Instance Segmentation

1.4.5. Zero-Shot Text to Mask

1.4.6. Ablation Studies

2. Grounded-SAM

Shortly after the release of SAM, derivative models combining multiple foundational capabilities emerged. For example, the IDEA Research Institute, founded by Dr. Xiangyang Shen, former chief scientist at Microsoft Research Asia, developed the Grounded SAM model based on SAM, its own Grounding DINO model, and Stable Diffusion technology, which can directly perform image detection, segmentation, and generation through text descriptions. Leveraging the powerful zero-shot detection capability of Grounding DINO, Grounded SAM can find any object in an image through text descriptions and then finely segment the masks using SAM’s powerful segmentation capabilities. Finally, Stable Diffusion can be used for controllable text-image generation of the segmented areas.

Examples of Grounding DINO

Examples of Grounded-Segment-Anything

3. SegGPT

The visual team at the Zhiyuan Research Institute in China proposed a universal segmentation model, SegGPT—Segment Everything in Context, the first universal visual model that completes various segmentation tasks using visual context. For example, annotating a rainbow in one image allows for the batch segmentation of rainbows in other images.

Compared to SAM, the In-context capability of the visual model is the main difference:

SegGPT “one-to-many”: it can use one or a few example images and corresponding masks to segment a large number of test images. Users can annotate and recognize a type of object in one image, allowing for batch recognition and segmentation of all other similar objects, whether in the current image or in other images or video contexts.
SAM “one-touch”: it provides interactive prompts on the target image through a point, bounding box, or a sentence, identifying and segmenting specified objects in the image. This means that SAM’s fine annotation capability can be further combined with SegGPT’s batch annotation and segmentation capabilities to create new CV applications. Specifically, SegGPT is a derivative model of the Zhiyuan universal visual model Painter, optimized for the goal of segmenting everything. After training, SegGPT does not require fine-tuning; it only needs to provide examples to automatically infer and complete corresponding segmentation tasks, including instances, categories, parts, contours, text, faces, etc., in images and videos.

The model has the following advantageous capabilities:

Universal capability: SegGPT has contextual reasoning ability, allowing the model to adaptively adjust predictions based on the masks provided in the context (prompt), achieving segmentation of “everything”, including instances, categories, parts, contours, text, faces, medical images, etc.
Flexible reasoning ability: supports any number of prompts; supports tuned prompts for specific scenarios; can represent different targets with different color masks, achieving parallel segmentation reasoning.
Automatic video segmentation and tracking capability: using the first frame of an image and the corresponding object mask as context examples, SegGPT can automatically segment subsequent video frames and use the mask colors as object IDs for automatic tracking.

3.1. Methods

The SegGPT training framework redefines the output space of visual tasks as “images” and unifies different tasks into a single image inpainting problem, i.e., outputting images from random mask tasks and reconstructing missing pixels. To maintain simplicity and universality, the authors did not modify the architecture and loss function, using vanilla ViT and a simple smooth-ℓ1 loss, but designed a new random coloring scheme in contextual training for better generalization ability.

3.1.1. In-Context Coloring

In the traditional framework of Painter, the color space for each task is predefined, leading to solutions often collapsing into multi-task learning tasks. The proposed in-context coloring random coloring scheme includes sampling another image with a similar background, mapping colors to random colors, and using mixed context training to focus on context rather than specific color information. The unification of segmentation datasets allows for a consistent data sampling strategy tailored to specific tasks, defining different contexts for different data types (e.g., semantic and instance segmentation) and using the same color to refer to the same category or instance.

3.1.2. Context Ensemble

Once training is completed, this training mode can be released during inference. SegGPT supports arbitrary segmentation in context, for example, using a single image and its target image as examples. The target image can be a single color (excluding the background) or multiple colors, such as segmenting multiple categories or objects of interest in one shot. Specifically, given an input image to test, we concatenate it with example images and provide it to SegGPT for the corresponding context prediction. To provide more accurate and specific context, multiple examples can be used. One method is called spatial ensemble, where multiple examples are connected in an n×n grid and then resampled to the same size as a single example. This method aligns with the intuition of contextual coloring and can extract semantic information from multiple examples at almost no additional cost. Another method is feature integration, where multiple examples are combined in the batch dimension and computed independently, except that the features of the query image are averaged after each attention layer. In this way, the query image collects information about multiple examples during inference.

3.1.3. In-Context Tuning

SegGPT can adapt to unique use cases without updating the model parameters. We freeze the entire model and initialize a learnable image tensor as input context. During training, only this learnable image vector is updated. The rest of the training remains unchanged, such as using the same loss function. After tuning, the authors take the learned image tensor out as plug-and-play keys for specific applications.

3.2. Experiments

4. SEEM

SEEM is a promptable interactive model that retains dialogue history information through the integration of learnable memory prompts, allowing for the segmentation of all content in an image at once, including semantic, instance, and panoramic segmentation, while also supporting various types of prompts and their arbitrary combinations.

The authors point out that SEEM has the following four highlights:

Versatility: Handles various types of prompts, such as clicks, box selections, polygons, sketches, text, and reference images;
Compositional: Handles any combination of prompts;
Interactivity: Engages in multi-turn interactions with users, benefiting from SEEM’s memory prompts to store session history;
Semantic-aware: Provides semantic labels for any predicted masks.

4.1. Methods

The SEEM model adopts a universal encoder-decoder architecture, focusing on the complex interactions between queries and prompts. The model consists of a text encoder and a visual sampler. Text and visual prompts are encoded into learnable queries, which are then fed into the SEEM model, outputting masks and semantic labels. Visual prompts are encoded into pooled image features, which are then used in the SEEM decoder with Self-Attention and Cross-Attention. As shown in (a):

Multi-turn interactions between SEEM and humans are illustrated in (b), mainly consisting of the following three steps:

The user provides a prompt;
The model sends prediction results to the user;
The model updates the memory prompt.

4.1.1. Versatility

In addition to text input, SEEM also introduces visual prompts to handle all non-text inputs, such as points, boxes, sketches, and region references from another image. When text prompts cannot accurately identify the correct segmentation area, non-text prompts can provide useful supplementary information to help accurately locate the segmentation area. Previous interactive segmentation methods typically converted spatial queries into masks and fed them into the image backbone network or used different prompt encoders for each input type (points, boxes). However, these methods face issues of being overly cumbersome or difficult to generalize. To address these problems, SEEM proposes using visual prompts to unify all non-text inputs. These visual prompts are uniformly represented as tokens and located in the same visual embedding space, allowing the same method to process all non-text inputs. Additionally, SEEM continuously learns a universal visual-semantic space through panoramic and reference segmentation, enabling visual prompts to align naturally with text prompts, thereby better guiding the segmentation process. When learning semantic labels, the prompt features are mapped to the same space to calculate a similarity matrix, thus better coordinating the segmentation tasks.

4.1.2. Compositional

Users can express their intentions using different or combined input types, making compositional prompting methods crucial in practical applications. However, two issues arise during model training. First, the training data typically covers only one interaction type (e.g., none, text, visual). Second, while we have unified all non-text prompts using visual prompts and aligned them with text prompts, their embedding spaces remain inherently different. To address this issue, this paper proposes a method to match different types of prompts with different outputs. After model training, the SEEM model becomes familiar with all prompt types and supports various combinations, such as no prompts, single prompt types, or simultaneous use of visual and text prompts. Notably, even samples that have never been trained in this manner can simply be concatenated and fed into the SEEM decoder.

4.1.3. Interactive

SEEM introduces memory prompts for multi-turn interactive segmentation, allowing segmentation results to be further optimized. Memory prompts are used to convey segmentation results from previous iterations, encoding historical information into the model for use in the current round. Unlike previous works that use a single network to encode masks, SEEM employs a mask-guided cross-attention mechanism to encode historical information, effectively utilizing segmentation history for optimization in the next round. Notably, this approach can also be extended to simultaneous interactive segmentation of multiple objects.

4.1.4. Semantic-aware

Unlike previous category-agnostic interactive segmentation methods, SEEM applies semantic labels to masks from all types of prompt combinations, as its visual prompt features align with text features in a joint visual-semantic space. During training, although no semantic labels were trained for interactive segmentation, the existence of the joint visual-semantic space enables the calculation of similarity matrices between mask embeddings and visual samplers, allowing the computed logits to align well. Thus, during inference, the query image can gather information from multiple examples.

4.2. Experiments

The visual prompts show significantly better effects than textual ones, achieving the highest IOU accuracy when using both visual and textual prompts.

   ABOUT 
    关于我们 

深蓝学院是专注于人工智能的在线教育平台，已有数万名伙伴在深蓝学院平台学习，很多都来自于国内外知名院校，比如清华、北大等。