Source | Machine Heart

Editor | Jishi Platform

Jishi Introduction

This paper is the first comprehensive research introducing the progress of the SAM base model, focusing on its applications in various tasks and data types, discussing its historical development, recent advancements, and the profound impact on widespread applications.

Artificial Intelligence (AI) is evolving towards AGI, which refers to artificial intelligence systems capable of performing a wide range of tasks and exhibiting human-like intelligence levels, contrasting with narrow AI that aims to efficiently execute specific tasks. Thus, designing a general-purpose base model is urgent. Base models are trained on extensive data, enabling adaptation to various downstream tasks. Recently, the Segment Anything Model (SAM) proposed by Meta has broken segmentation boundaries, significantly advancing the development of computer vision base models.

SAM is a prompt-based model trained on over 1 billion masks across 11 million images, achieving powerful zero-shot generalization. Many researchers consider “this the GPT-3 moment for CV, as SAM has learned the general concept of what objects are, even for unknown objects, unfamiliar scenes (like underwater, cellular microscopy), and ambiguous situations,” showcasing its vast potential as a foundational model in CV.

To fully understand SAM, researchers from institutions such as Hong Kong University of Science and Technology (Guangzhou) and Shanghai Jiao Tong University conducted in-depth studies and jointly published the paper “A Comprehensive Survey on Segment Anything Model for Vision and Beyond”.

Comprehensive Survey on Segment Anything Model (SAM)

Paper: https://arxiv.org/abs/2305.08196

This paper, as the first comprehensive introduction to the progress of the SAM base model, focuses on its applications in various tasks and data types, discussing its historical development, recent advancements, and the profound impact on widespread applications.

The article first introduces the background and terminology of base models, including SAM, and the latest methods significant for segmentation tasks;

Then, the study analyzes and summarizes the advantages and limitations of SAM in various image processing applications, including software scenarios, real-world scenarios, and complex scenarios. Importantly, the study provides insights to guide future research in developing more versatile base models and improving SAM’s architecture;

Finally, the study summarizes SAM’s applications in vision and other fields.

Let’s take a look at the specific contents of the paper.

SAM Model Overview

SAM originated from Meta’s Segment Anything (SA) project in 2023. This project discovered that foundational models emerging in NLP and CV fields exhibit strong performance, leading researchers to attempt to establish a similar model to unify the entire image segmentation task. However, the available data in the segmentation field is relatively scarce, contrary to their design goals. Therefore, as shown in Figure 1, researchers divided the path into three steps: task, model, and data.

The SAM architecture is illustrated as follows, mainly consisting of three parts: image encoder; prompt encoder; and mask decoder.

After gaining a preliminary understanding of SAM, the study then introduces SAM’s use in image processing.

SAM for Image Processing

This section is mainly divided by scenarios, including: software scenarios, real-world scenarios, and complex scenarios.

Software Scenarios

Software scenarios require operations for image editing and restoration, such as removing objects, filling objects, and replacing objects. However, existing restoration work, such as [99], [100], [101], [102], requires fine annotation for each mask to achieve good performance, which is labor-intensive. SAM [20] can generate accurate masks through simple prompts like points or boxes, aiding in image editing scenarios.

Inpaint Anything (IA) [39] designed a process that combines the advantages of SAM, state-of-the-art image restorers [99], and AI-generated content models [103] to address repair-related issues. This process is illustrated in Figure 3. For object removal, the process consists of SAM and a state-of-the-art restorer like LaMa [99]. User clicks are used as prompts for SAM to generate masks of object regions, which LaMa then fills using corrosion and dilation operations. For object filling and replacement, the second step employs AI-generated content models like Stable Diffusion (SD) [103] to fill selected objects with newly generated objects through text prompts.

A similar idea can also be seen in Edit Everything [40], as shown in Figure 4, which allows users to edit images using simple text instructions.

Real-World Scenarios

Researchers indicate that SAM has the capability to assist in processing many real-world scenarios, such as real-world object detection, object counting, and mobile object detection scenarios. Recently, [108] evaluated SAM’s performance in various real-world segmentation scenarios (e.g., natural images, agriculture, manufacturing, remote sensing, and healthcare scenarios). The paper found that it exhibits excellent generalization capabilities in common scenarios like natural images but performs poorly in low-contrast scenarios and requires strong prior knowledge in complex scenarios.

For instance, in the application of civil infrastructure defect assessment, [42] utilized SAM to detect cracks in concrete structures and compared its performance with the baseline U-Net [109]. The crack detection process is illustrated in Figure 6. Results show that SAM outperforms U-Net in detecting longitudinal cracks, which are more likely to find similar training images in normal scenarios, while in uncommon scenarios, such as spalled cracks, SAM’s performance is inferior to U-Net.

Process of using SAM and U-Net for crack detection. Image extracted from the original paper [42].

Unlike the complex image cases in crack detection, SAM is more suitable as a detection tool for crater detection due to the primarily circular or elliptical shapes of craters. Craters are one of the most important morphological features in planetary exploration, and detecting and counting them is an important yet time-consuming task in planetary science. Although existing machine learning and computer vision work has successfully addressed some specific issues in crater detection, they rely on specific types of data, thus not performing well across different data sources.

In [110], researchers proposed a general crater detection scheme using SAM for zero-shot generalization on unfamiliar objects. This process uses SAM to segment input images without restrictions on data types and resolutions. It then employs circular-elliptical indices to filter out non-circular-elliptical segmentation masks. Finally, a post-processing filter is used to remove duplicates, artifacts, and false positives. This process shows significant potential as a general tool in the current field, and the authors also discuss the drawbacks of only recognizing specific shapes.

Complex Scenarios

Besides the aforementioned conventional scenarios, whether SAM can solve segmentation problems in complex scenarios (like low-contrast scenarios) is also a meaningful question that can expand its application range. To explore SAM’s generalization capabilities in more complex scenarios, Ji et al. [22] quantitatively compared it with state-of-the-art models in three scenarios: camouflaged animals, industrial defects, and medical lesions. They conducted experiments on three camouflage object segmentation (COS) datasets, namely CAMO [116] with 250 samples, COD10K [117] with 2026 samples, and NC4K [118] with 4121 samples, comparing it with Transformer-based models CamoFormer-P/S [119] and HitNet [120]. Results indicate that SAM lacks finesse in concealed scenes, and potential solutions may depend on support from prior knowledge in specific domains. The same conclusion can also be drawn in [29], where the authors compared SAM with 22 state-of-the-art methods in camouflaged object detection on the same three datasets.

Cao et al. [115] proposed a new framework called Segment Any Anomaly+ (SAA+), for zero-shot anomaly segmentation, as shown in Figure 7. This framework utilizes mixed prompt normalization to enhance the adaptability of modern foundational models, enabling more precise anomaly segmentation without domain-specific fine-tuning. The authors conducted detailed experiments on four anomaly segmentation benchmarks: VisA [122], MVTecAD [123], MTD [124], and KSDD2 [125], achieving state-of-the-art performance.

He et al. [126] proposed the first method (WSSAM) that utilizes SAM for weakly supervised concealed object segmentation, addressing the challenge of segmenting objects that blend into their surroundings using sparse annotated data (see Figure 8). The proposed WSSAM includes SAM-based pseudo-labeling and multi-scale feature grouping to enhance model learning and distinguish concealed objects from the background. The authors found that using only scribble supervision [127], SAM can generate sufficiently good segmentation masks to train the segmenter.

More Models and Applications: Vision and Beyond

Vision Related

First, medical imaging. The purpose of medical image segmentation is to display the anatomical or pathological structures of the corresponding tissues, which can be used for computer-aided diagnosis and intelligent clinical surgeries.

Figure 10 provides an overview of medical images with SAM, including computed tomography (CT) images, magnetic resonance imaging (MRI) images, colonoscopy images, multi-format images, H&E stained tissue slice images, etc.

Second, video. In the field of computer vision, video object tracking (VOT) and video segmentation are considered crucial and indispensable tasks. VOT involves locating specific targets in video frames and then tracking them throughout the remaining video. Therefore, VOT has various practical applications, such as surveillance and robotics.

SAM has made outstanding contributions in the VOT field. Reference [46] introduced the Tracking Anything Model (TAM), which efficiently achieves excellent interactive tracking and segmentation in videos. Figure 11 shows the TAM pipeline.

Another tracking model is SAMTrack, detailed in reference [172]. SAMTrack is a video segmentation framework that achieves target tracking and segmentation through both interactive and automatic methods. Figure 12 shows the SAMTrack pipeline.

Figure 13 shows a lightweight SAM-guided refinement module (SEEM) designed to enhance the performance of existing methods.

Next is data annotation. SAMText [180] is a scalable pipeline for scene text mask annotation in videos. It utilizes SAM to generate mask annotations on a large dataset SAMText-9M, which contains over 2,400 video clips and over 9 million mask annotations.

Moreover, reference [143] utilized existing remote sensing object detection datasets and data-centric machine learning models SAM to construct a large-scale remote sensing image segmentation dataset SAMRS, containing target classification, location, and instance information, which can be used for semantic segmentation, instance segmentation, and object detection research.

Beyond Vision

First, 3D reconstruction. In addition to achieving fine-grained 3D segmentation, SA3D [183] can be used for 3D reconstruction. Using 3D mask grids, researchers can determine the occupied space of objects in 3D and reconstruct in various ways. Figure 14 illustrates the overall SA3D pipeline.

Reference [186] proposed a new object removal pipeline ORNeRF, which removes objects from 3D scenes using points or text prompts on a single view. By using a point projection strategy to rapidly propagate user annotations to all views, this method achieves better performance in less time than previous works. Figure 15 shows the framework of ORNeRF.

Next is non-Euclidean domains. To handle different feature dimensions for various tasks, the SNA method shown in Figure 16 introduces a dedicated, scalable graph convolution layer that can dynamically activate or deactivate channels based on the input feature dimensions.

Then, robotics. Figure 17 illustrates the overall process of Instruct2Act [190]. In the perception part, predefined APIs are used to access multiple foundational models. SAM [20] accurately locates candidate objects, and CLIP [13] classifies them. This framework leverages the expertise of foundational models and robotic capabilities to convert complex high-level instructions into precise strategy code.

Next is video text localization. Figure 18 shows a scalable and efficient solution for generating mask annotations for video text localization tasks, SAMText [180]. By applying the SAM model to bounding box annotations, it can generate mask annotations for large-scale video text datasets.

Moreover, there is image captioning. Wang et al. [44] proposed a controllable image captioning method, Caption Anything (CAT), as shown in Figure 20. The CAT framework introduces multimodal control into image captioning, presenting various visual focal points and language styles that align with human intentions.

Conclusion

This article provides the first comprehensive review of research progress on the SAM base model in computer vision and beyond. It first summarizes the historical development of foundational models (large language models, large visual models, and multimodal large models) and the basic terminology of SAM, focusing on its applications in various tasks and data types, summarizing and comparing SAM’s parallel work and its follow-up works. Researchers also discuss the immense potential of SAM in a wide range of image processing applications, including software scenarios, real-world scenarios, and complex scenarios.

Furthermore, researchers analyze and summarize the advantages and limitations of SAM in various applications. These observations may provide insights for developing more robust foundational models and further enhancing SAM’s robustness and generalization. The article concludes with a summary of numerous other astonishing applications of SAM in vision and other fields.

Editor / Garvey

Review / Fan Ruiqiang

Verification / Garvey

Click below