First Published Foundation Model for SAR Image Target Recognition

Machine Heart released

Machine Heart Editorial Department

Synthetic Aperture Radar (SAR) is an active detection technology based on electromagnetic waves, providing all-weather, all-time ground observation capabilities. It has developed into an indispensable tool for ground observation, with significant applications in both military and civilian fields.

Automatic Target Recognition (ATR) is the core issue of intelligent interpretation of SAR images, aiming to automatically locate and classify typical targets (usually vehicles, ships, and aircraft) in SAR images. Achieving high precision, high agility, strong robustness, and resource efficiency for SAR target recognition in complex, open, and adversarial environments still faces many challenges. Currently, SAR target recognition mainly faces two levels of challenges.

Technical Level, most SAR target recognition methods are supervised, static, single-task, single-model, and single-platform. Detection and classification for specific categories require their own algorithm models, with each task needing to learn independently from scratch, leading to computational redundancy, long algorithm design cycles, severe generalization capability issues, and high annotation dependency.
Ecological Level, due to the sensitivity of SAR image data and the high cost of annotation, there is a lack of good, open-source code, evaluation benchmarks, and data ecosystems, resulting in many SAR target recognition algorithms not being open-sourced, a lack of unified algorithm evaluation benchmarks, and currently no publicly available large-scale high-quality SAR target recognition benchmark datasets in the millions or tens of millions.

In the era of rapid development of artificial intelligence foundation model technology, the SAR image interpretation field urgently needs breakthroughs in technological innovation and development ecology.

First Published Foundation Model for SAR Image Target Recognition

Figure 1. Various specialized SAR ATR datasets and tasks. SAR ATR includes various imaging conditions (i.e., operational conditions), such as targets, scenes, and sensors. However, due to high costs, datasets are usually collected in specific tasks and settings. For example, MSTAR is a dataset for classifying 10 types of vehicle targets in X-band and grassland scenes, while SAR-Aircraft is a dataset for detecting 7 types of aircraft collected from three airports and C-band satellites. Different target features, scene information, and sensor parameters make it difficult for existing algorithms to generalize. Therefore, the team aims to establish a SAR ATR foundation model, a universal approach for various tasks.

To address the above technical challenges, the team of Professors Liu Yongxiang & Liu Li from the School of Electronic Science, National University of Defense Technology, proposed the first publicly published SAR image target recognition foundation model SARATR-X 1.0.

Technical Level:① Took the lead in conducting SAR target feature representation learning based on self-supervised learning; ② Innovatively proposed a new framework for joint embedding-predictive self-supervised learning suitable for SAR images (Joint Embedding Predictive Architecture for SAR ATR, SAR-JEPA), allowing deep neural networks to predict only the sparse and important gradient feature representations of SAR images, effectively suppressing the coherent speckle noise of SAR images and avoiding predicting the original pixel intensity information of SAR images containing coherent speckle noise; ③ Developed the first SAR image target recognition foundation model SARATR-X (66 million parameters, based on Transformer), breaking the bottleneck of high dependence on large-scale high-quality annotated data in learning SAR target features in complex scenes, significantly enhancing the cognitive ability of the pre-trained foundation model.

Ecological Level:The team is committed to creating a good open-source ecology for SAR image target recognition to promote the rapid innovation and development of SAR target recognition technology. ① Standardized and integrated existing public datasets to form a large-scale SAR image land-sea target recognition dataset SARDet-180K; ② To replace MSTAR (10 types of vehicle models), spent two years building the SAR vehicle target recognition dataset NUDT4MSTAR (40 types of vehicle models, more challenging real-world scenes, publicly available, 10 times larger than similar datasets), with detailed performance evaluations; ③ Open-sourced relevant target recognition algorithm code and evaluation benchmarks.

The research results, titled “SARATR-X: Towards Building A Foundation Model for SAR Target Recognition” and “Predicting Gradient is Better: Exploring Joint Embedding-Predictive Framework for SAR ATR Self-Supervised Learning,” have been accepted by the top international academic journal IEEE Transactions on Image Processing and published in ISPRS Journal of Photogrammetry and Remote Sensing.

The team’s representative work has attracted attention from peers at home and abroad and received positive evaluations after publication and acceptance. Citation units include the U.S. Air Force Research Laboratory, Gustave Eiffel University in France, Nanyang Technological University in Singapore, Peking University, Wuhan University, and Beihang University.

For example, Clement Mallet, editor-in-chief of ISPRS Journal and director of LASTIG laboratory, stated in his paper “AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities” that “SAR-JEPA [41] is the first application of the joint embedding predictive framework concept in Earth observation, specifically for SAR data. (Original citation: SAR-JEPA [41] introduces the first implementation of JEPA concepts for EO, focusing exclusively on SAR data. In this paper, we combine JEPA with a versatile spatial encoder architecture, allowing a single model to handle diverse data scales, resolutions, and modalities.)”

Furthermore, the team is accelerating the development of SARATR-X 2.0, expected to have a parameter scale of 300 million and a SAR target slice sample size of 2 million, with the collected data forming an open-source dataset to serve ecological construction. The SAR vehicle target recognition dataset NUDT4MSTAR will be released soon.

Technical Solution

The team aims to build a universal SAR image target recognition foundation model to meet diverse recognition task requirements in practice. As the first publicly released SAR image target recognition foundation model SARATR-X 1.0, this model learns more general feature representations from large-scale unannotated SAR target images, breaking the adaptability limitations of traditional supervised algorithms and providing a foundation for efficient adaptation to various downstream tasks. In a series of works, the team studied the pre-training set, model architecture, self-supervised learning, and evaluation benchmarks of the SAR image target recognition foundation model.

Pre-training SetThe pre-training set used includes different target categories and imaging conditions to adapt to various downstream tasks, incorporating most open-source datasets as part of the pre-training, totaling 14 classification and detection datasets with different target categories and imaging conditions, serving as the new pre-training dataset to explore the potential of the foundation model.

Table 1. 14 open-source synthetic aperture radar datasets used for pre-training SARATR-X.

Model ArchitectureUtilizes the HiViT architecture, aiming to achieve better spatial representation of remote sensing images, especially for small targets in large images. HiViT benefits from the Swin Transformer high-resolution input and can drop patches in self-supervised learning masked image modeling to improve training efficiency.

Self-Supervised LearningThe speckle noise in SAR coherent imaging negatively affects image quality. Moreover, the visual features of SAR amplitude images are not as prominent as those of optical RGB images. Therefore, the main task of SAR SSL is to improve the quality of feature learning and target signals. In the earlier work SAR-JEPA, the focus was on designing self-supervised learning methods tailored to the characteristics of SAR images.

SAR-JEPA is inspired by works such as JEPA, MaskFeat, and FG-MAE, which utilize feature space for self-supervised learning tasks instead of the original pixel space, compressing information redundancy in the image space and learning different features such as target properties and deep semantic features. SAR-JEPA addresses the noise issue in SAR images, focusing on a denoised feature space for self-supervised learning, extracting target edge gradient information for self-supervision by combining traditional feature operators to remove speckle noise interference, thus enabling large-scale unannotated self-supervised learning in noisy SAR image data. Results show that the performance of self-supervised learning models can continuously improve with the amount of data across different SAR target classification datasets. This drives us to build a universal SAR image target recognition foundation model based on large-scale datasets, thereby achieving efficient reuse in different targets, scenes, sensors, and recognition tasks.

Therefore, SARATR-X is trained based on SAR-JEPA, first pre-training on ImageNet data to achieve better initialization model diversity, followed by using high-quality target signals to pre-train SAR images.

Figure 2. Two-step pre-training process. The first step is pre-training on ImageNet data to achieve better initialization model diversity. The second step is to use high-quality target signals for pre-training SAR images, such as suppressing speckle noise and extracting multi-scale gradient features of target edges.

Evaluation Tasks, to comprehensively evaluate the performance of the foundation model, the team utilized 3 open-source target datasets, first constructing a fine-grained classification dataset SAR-VSA containing 25 categories to assess the effectiveness of the proposed improvements. Then, comprehensive comparisons were made between the proposed SARATR-X 1.0 and existing methods on public classification and detection datasets.

Model Performance

Limited by the scale of publicly available SAR target recognition datasets, the developed SAR image target recognition foundation model SARATR-X 1.0 has only 66 million parameters but has learned more general feature representations from large-scale unannotated SAR target images. Its performance on multiple downstream target recognition tasks (8 benchmark target recognition tasks, including small sample target recognition, robust target recognition, target detection, etc.) has reached an internationally advanced or leading level (as shown in Figure 3). In the fine-grained vehicle MSTAR dataset, its target classification performance surpasses existing SSL methods (BIDFC) by 4.5%.

Moreover, it performs well under extended operational conditions (EOCs) (e.g., ground angle EOCs-Depression, target configuration EOCs-Config, and target version EOCs-Version). SARATR-X also demonstrates competitiveness in target detection across various categories (multi-class SARDet-100K and OGSOD, ship SSDD, and aircraft SAR-AIRcraft), with an average improvement of about 4%. Additionally, the proposed method exhibits good scalability concerning data volume and parameter volume, with further enhancement potential.

Figure 3. Results of classification and detection for SARATR-X 1.0.

Detection Result Analysis, the detection visualization is shown in Figure 4. False positives and missed detections are common in SAR images, especially in similar target overlaps and complex scenes. Although the proposed method effectively enhances detection performance by learning contextual information in images, detecting targets in complex scenes and low-quality images remains very challenging.

Figure 4. Visualization of detection on SARDet-100K.

Attention Diversity Analysis, visual analysis of the attention range of different models is shown in Figure 5, ensuring that the attention range for SAR target recognition varies through model architecture (Figure a vs. Figure b), initialization weights (Figure a vs. Figure c), and SSL (Figure d vs. Figure e), including HiViT architecture, ImageNet weights, and SAR target features.

Figure 5. Average attention distance of different attention heads (the x-axis represents the number of attention head layers, point colors represent different layers for better visualization), where attention distance represents the range of a receptive field.

Scalability, although masked image modeling can effectively improve performance with data resources and model parameters, can the proposed method ensure its scalability when dealing with noisy data (such as SAR)? Figure 6 presents experimental results from three perspectives: dataset size, model parameter volume, and training epochs. Even though the pre-training set contains 180,000 images, which is smaller than ImageNet-1K, in Figures 6 (a) and (b), the performance of downstream tasks shows a significant upward curve with the increase of data and parameter volume. This result indicates that by extracting high-quality features as guiding signals, the foundation model can fully leverage its potential in SAR target recognition. However, due to data volume limitations, the model tends to overfit when scaling training epochs. Additionally, SAR image noise and low resolution further exacerbate overfitting.

Figure 6. Scalability of SARATR-X in terms of dataset size, model parameter volume, and training epochs. Although the method benefits from these three aspects, it should be noted that excessive training epochs often lead to overfitting due to the dataset size.

More chart analyses can be found in the original text.

Paper Portal

SARATR-X

Title: SARATR-X: Towards Building A Foundation Model for SAR Target Recognition
Journal: IEEE Transactions on Image Processing
Paper: https://arxiv.org/abs/2405.09365
Code: https://github.com/waterdisappear/SARATR-X
Year: 2025
Affiliation: National University of Defense Technology, Shanghai Artificial Intelligence Laboratory
Authors: Li Weijie, Yang Wei, Hou Yuenan, Liu Li, Liu Yongxiang, Li Xiang

SAR-JEPA

Title: Predicting Gradient is Better: Exploring Self-Supervised Learning for SAR ATR with a Joint-Embedding Predictive Architecture
Journal: ISPRS Journal of Photogrammetry and Remote Sensing
Paper: https://www.sciencedirect.com/science/article/pii/S0924271624003514
Code: https://github.com/waterdisappear/SAR-JEPA
Year: 2024
Affiliation: National University of Defense Technology, Shanghai Artificial Intelligence Laboratory, Nankai University
Authors: Li Weijie, Yang Wei, Liu Tianpeng, Hou Yuenan, Li Yuxuan, Liu Zhen, Liu Yongxiang, Liu Li

For reprints, please contact this public account for authorization

Submission or seeking reports: [email protected]

Leave a Comment Cancel reply