Fellow Column | Mei Tao: Ultra-Fine Visual Recognition

Click the blue text to follow us

Written by / Mei Tao

Abstract

Image recognition is an important branch of artificial intelligence, and its performance on specific academic benchmark datasets (such as ImageNet) has even surpassed human levels. However, during the transition from academia to application, image recognition still faces the 3S challenges—space ultra-fine (Space, small targets are not clear), class ultra-fine (Scale, many categories are not fully visible), and semantic ultra-fine (Semantics, few semantics are not understood). These challenges have given rise to the next generation of computer vision research direction—ultra-fine visual recognition. This article first introduces ultra-fine image recognition technology aimed at subtle targets, global objects, and rich semantics, along with its broad application scenarios and representative methods, and then discusses the challenges faced by ultra-fine vision in the future.

Keywords

Ultra-fine vision; image classification; image parsing; image understanding

1 Introduction

After more than 60 years of continuous breakthroughs and innovations, computer vision technology has developed into an important driving force for social progress, as well as a new round of technological and industrial revolution, profoundly impacting economic and social development. Especially in recent years, innovations and entrepreneurship related to computer vision technology have emerged rapidly, including the development of industrial ecosystems related to technologies such as facial recognition, smart security, and autonomous driving, which have significantly changed the way humans produce and live, generating positive and far-reaching social impacts.

As the core of computer vision technology, image recognition serves as an important technological foundation for various related applications. Image recognition technology mainly includes several directions such as image parsing, image classification, and image semantic understanding. The significance of image recognition technology is closely related to national strategies and the economy. Multiple national policies, development reports, and market research reports issued in China have repeatedly reflected the strategic significance and economic value of image recognition. The “New Generation Artificial Intelligence Development Plan” issued by the State Council in 2017 pointed out the need to develop cognitive computing methods centered on graphics and images. Furthermore, a global market research report on image recognition in 2021 showed that the market value of image recognition technology is expected to reach hundreds of billions by 2030.

Looking back at the development history of image recognition technology, it can be roughly divided into two main stages.

● The specific image recognition stage in the 1990s mainly focused on object recognition in specific fields (such as handwritten digits and postal codes), mostly using manual features and shallow learning methods;

● In the 2010s, with the breakthrough progress of deep learning technology in general image recognition, people’s attention shifted to common objects and general scenes, improving the generalization ability of image recognition through end-to-end deep feature learning.

However, the application scenarios of artificial intelligence (AI) in the above two stages are relatively limited, and image recognition in practical scenarios often faces challenges of subtle targets, massive categories, and complex semantic environments. This requires the next generation of image recognition technology to achieve “precision,” “completeness,” and “understanding.” For example, in the intelligent manufacturing industry, quality inspection often requires identifying subtle defects that can only be seen clearly when magnified a hundred times, accurately locating and detecting flaws, necessitating image recognition technology to perform “precise” analysis of small areas in images; online shopping websites have over a billion different product categories, with hundreds of types of cola, each with very small differences, requiring image recognition technology to possess “comprehensive” classification capabilities for large-scale categories; service robots capture images of real scenes through cameras and need to understand the complex semantic relationships between people, objects, and scenes to achieve precise human-machine interaction. These challenges have given rise to the new generation of image recognition technology, namely ultra-fine image recognition in the 2010s, focusing on greatly expanded category numbers and new challenges posed by ultra-small detail areas, shifting research emphasis to image recognition of global objects, ultra-fine differences, and refined semantic understanding.

From the dimensions of category number, spatial size, and semantic richness, ultra-fine image recognition primarily studies a new generation of image recognition technology aimed at 100,000 to 1 million categories, with about 1% or even smaller proportions of regions of interest, and comprehensive semantics in complex scenes (see Figure 1).

Fellow Column | Mei Tao: Ultra-Fine Visual Recognition

Figure 1 Three Stages of Image Recognition Technology Development

In summary, ultra-fine image recognition technology can be summarized as an image recognition technology aimed at subtle targets (Space), global objects (Scale), and rich semantics (Semantics).

2 Ultra-Fine Visual Applications

The development and application of image recognition technology in traditional industries in China are still at a limited level, unable to meet the digitalization and intelligence needs of traditional industries significantly expanded in category numbers, and the new challenges posed by ultra-small detail areas. The research focus has shifted to image recognition of global objects, ultra-fine differences, and refined semantic understanding. From the dimensions of category number, spatial size, and semantic richness, ultra-fine image recognition primarily studies a new generation of image recognition technology aimed at 100,000 to 1 million categories, with about 1% or even smaller proportions of regions of interest, and comprehensive semantics in complex scenes (see Figure 1). To achieve high-quality development in related industries, it is necessary to leverage the technical advantages of image recognition technology in industrial upgrading, product development, and service innovation, promoting the deep integration of technology and the real economy. In this context, ultra-fine image recognition technology, which focuses on ultra-large-scale object categories, high precision/high resolution image features, and inferable image semantic descriptions, is receiving increasing attention. Specifically, as shown in Figure 2, the application value and significance of ultra-fine visual technology in medical care, manufacturing, retail, aerospace, and agriculture are mainly reflected in the following five aspects.

Figure 2 Several Representative Applications of Ultra-Fine Visual Technology

(1) Applications in Healthcare. The two most important application directions in the “AI + Healthcare” field are AI medical imaging and AI-assisted pharmaceuticals. Among them, AI medical imaging analysis is currently the most accepted scenario, which uses image recognition technology to automatically detect and locate lesions and abnormal areas in medical images. Specifically, the parsing of lesion areas needs to achieve millimeter-level accuracy, while the parsing of capillaries even requires micron-level precision. Ultra-fine visual technology can significantly enhance the automation and precision of medical diagnosis, contributing to the development of inclusive healthcare and improving the overall medical level and services in society.

(2) Applications in Manufacturing. In recent years, the downstream requirements for product inspection and intelligence in the manufacturing industry have been continuously increasing, driving the widespread application of machine vision technology in industrial fields. The popularization and application of ultra-fine image recognition technology in the detection of external defects in industrial products have become a breakthrough point for the deep integration of artificial intelligence technology and the manufacturing industry. Among them, the demand for ultra-fine visual inspection and high-precision visual positioning products has grown significantly. For example, in the defect detection of electronic products and mechanical components, it is required that the defect imaging range achieves micron-level accuracy with an error of less than 10 pixels. Ultra-fine visual technology will play an important role as a foundational technological capability in the future intelligent manufacturing industry, liberating productivity in industrial design, production, and quality inspection.

(3) Applications in Retail Scenarios. The rapid development of artificial intelligence has empowered the retail industry, effectively reconstructing the elements of “people, goods, and scenarios” in retail, improving efficiency in all links, and ultimately enhancing the consumer shopping experience, driving a transformation in the retail industry. Among them, the convenient and personalized user shopping experience is what users care about the most. Users need to quickly find what they want and quickly check out and refund, which requires machines to automatically recognize and retrieve product categories on a scale of hundreds of millions. Ultra-fine image recognition and understanding technology can achieve fast and accurate product perception, search, and recommendation, effectively enhancing user experience and stimulating corresponding purchasing demands.

(4) Applications in High-Tech Fields such as Aerospace. The aerospace industry is a cutting-edge technology field where many tasks involving complex logical reasoning and numerous constraints need to be solved using artificial intelligence systems. For example, locating ground targets in satellite remote sensing images generally requires controlling the positioning error of the object to be recognized within 0.5 to 1 pixel at a 1:N thousand scale; in a space environment, achieving precise target grasping by a robotic arm under high-speed motion of the aircraft while ensuring visual measurement accuracy is also reliant on ultra-fine visual technology.

(5) Applications in Agricultural Production. By 2020, China’s agricultural production value had reached 16 trillion yuan, accounting for 16.47% of GDP, but crop pests and diseases affected 4.5 billion mu (approximately 740 million acres). Although China has vast agricultural land, there are only around 50,000 agricultural protection experts. Therefore, introducing artificial intelligence technology to assist in pest control and crop analysis is particularly important. To cover the complex and diverse farming environments, seasonal climates, and crop varieties, artificial intelligence image recognition systems are often required to classify and detect millions of plants and tens of thousands of pests, which falls within the realm of ultra-fine image recognition technology.

Similar applications exist in areas such as AI-assisted content generation, which are not listed here. It is evident that ultra-fine visual technology has gradually penetrated all aspects of human life and social production and will play an increasingly important role in the future.

3 Ultra-Fine Visual Challenges

To promote the large-scale application of ultra-fine image recognition technologies in various scenarios and achieve high-performance, robust, and inferable ultra-fine image recognition and understanding, it is urgent to conduct a series of technical research and tackle the difficulties faced by ultra-fine image recognition technology. Specifically, ultra-fine image recognition faces the following three technical challenges.

(1) Difficulty in Detecting Subtle Targets. Image understanding and analysis have increased automation levels in retail, industrial manufacturing, and other scenarios. Compared to general image recognition, fine image parsing, as a high-level, refined image information extraction and fusion technology, has broader application prospects. However, in real complex scenarios, fine image parsing models face challenges in detecting subtle targets. To solve the application dilemmas of fine image parsing models in real scenarios, it is necessary to conduct research on related technologies that break through the detection of subtle targets in the fine image parsing process to achieve high-performance fine image parsing.

(2) Difficulty in Distinguishing Global Categories. Unlike traditional general image recognition, fine-grained image recognition requires the representation and differentiation of visual detail features in images, often applied to replace human experts in completing various vertical field image fine classification tasks, thus considered to have high practical and commercial value. However, due to the limited representation ability of general fine image recognition technology for fine-grained discriminative information, the limited number of supported labels, and sensitivity to image styles, it cannot be directly applied to image recognition tasks in large-scale global scenarios. Therefore, learning image features with high resolution and robustness has become a key focus of research in this field.

(3) Difficulty in Expressing Fine Semantics. The semantic structure of images in real scenes is complex and difficult to express. Moreover, aligning high-level semantic information between images and descriptive texts, which are two different modalities, is also challenging. Additionally, due to training in closed environments, image semantic description models struggle to describe new objects in open environments. To address these challenges in real scene applications, inferable image semantic description has become a key area of current research.

4 Ultra-Fine Visual Technology

Corresponding to the three major technical challenges of “difficult target detection,” “difficult category distinction,” and “difficult semantic expression” in ultra-fine image recognition, three key technologies can be summarized as ultra-fine image parsing (seeing details), ultra-fine image classification (seeing comprehensively), and ultra-fine image semantic description (understanding).

4.1 Ultra-Fine Image Parsing

Existing image fine parsing data annotation is challenging, and the inference speed of high-resolution image parsing in real scenes is slow, leading to limitations in the application of image parsing models in real scenarios. To solve the above issues, researchers have primarily conducted corresponding research from three different directions, including: addressing the problem of limited training data due to high annotation costs for fine image parsing by proposing self-supervised learning methods to enhance feature representation capabilities and thus improve parsing accuracy; addressing the issue of slow inference speed due to fine parsing relying on high-resolution images by proposing constrained efficient network structure designs to ensure the accuracy of fine image parsing while reducing inference time; and addressing the accuracy drop of fine parsing models when data distributions change by proposing fine-grained category alignment to improve model generalization and achieve high-precision parsing across various data distributions.

Additionally, considering that collecting a large amount of densely annotated training data is usually a labor-intensive task, the latest advancements in computer graphics provide new alternative solutions to expensive manual annotation: namely, obtaining photo-realistic images with pixel-level annotations through physics-based rendering and computer graphics at low cost. However, when models trained using synthetic data (source domain) are applied to real scenes (target domain), performance decline is often observed due to different distributions of data from different domains. This phenomenon is known as the domain transfer problem, which presents new challenges for cross-domain tasks. Therefore, ultra-fine image parsing combined with domain adaptation methods is also receiving increasing attention.

4.2 Ultra-Fine Image Classification

The difficulty of ultra-fine image classification tasks lies in the need for models to recognize “comprehensively,” reflected in the vast number of object types that can be recognized and classified, corresponding to two key research technologies: detecting fine-grained visual differences between different types of objects and learning highly similar fine image features.

Research on detecting fine-grained visual differences primarily focuses on fine-grained image classification tasks. Currently, the most widely used technical frameworks for fine-grained image classification tasks can be divided into three categories: ① Two-step calculation methods that rely on object detection models to extract subject contours and background information for differentiation, and then classify images based on object subject image features; ② Systems based on attention models, relying on a large number of additional parameters to learn attention scores, thus distinguishing between target object subjects and backgrounds in images; ③ Discriminative feature/object structural feature learning methods based on self-supervised tasks, which do not require object bounding box annotations or a large number of additional computations, achieving better efficiency and effectiveness compared to the other two methods.

Learning highly similar fine image features is crucial for similar image retrieval, open-domain image classification, and more. However, in practical application scenarios, when the data distribution during testing is significantly different from the training dataset or when test samples belong to previously unseen categories (open domain), or when there are only a few training samples per class, severe overfitting issues often arise, affecting the model’s generalization. Therefore, integrating additional constraints such as probability density into feature metric learning or conducting visual feature transfer learning around multi-source data and open domains, maximizing the utilization of the embedding space’s expressive capabilities, becomes particularly important for enhancing performance in ultra-fine image classification and retrieval tasks.

4.3 Ultra-Fine Image Understanding

Due to the complex semantic structure of images in real scenes, the difficulty in expressing fine semantics, the challenge of aligning semantics between images and texts of different modalities, and the difficulty of generalizing image description generation models to new objects in open environments due to training in closed environments, research work surrounding semantic structure-guided image understanding, cross-modal interactive image semantic description, and knowledge fusion for open environment image semantic representation has emerged.

First, to address the issue of difficulty in expressing fine semantics in real scenes, researchers have proposed generating image semantic descriptions guided by rich semantic structures (graph structures, tree structures) between objects, improving the richness and accuracy of semantic understanding in images. By strengthening the model’s visual interpretation of image structures hierarchically and utilizing this semantic structure topology to integrate learning at image-level, region-level, and instance-level features, it ultimately completes the transformation from multi-level semantic tree structures among objects in images to corresponding image semantics and inference information.

Additionally, to address the challenge of aligning semantics between different modalities of images and texts, attention mechanisms can generally achieve feature interaction between different modalities in a multi-modal feature association framework. The attention mechanism integrates text features and encoded image region features, assigning different attention weights to cross-modal matching of corresponding features, thereby enhancing the semantic consistency between images and descriptive texts.

Finally, for the challenge of generalizing image descriptions in open scenes, knowledge fusion-based open environment image semantic description models can incorporate information from image-text knowledge bases into image description generation models, achieving understanding and description of new objects in open scenes.

4.4 Ultra-Fine Visual Datasets

To promote the continuous development of technology in the ultra-fine visual field and attract more researchers to participate in this research topic, several relevant datasets and competitions have been proposed in recent years. For example, the Visual Genome dataset released by Stanford University in 2016 combines structured visual information with language information, covering 2.3 million common semantic relationships and 5.4 million image description annotations. FAIR opened the LVIS (large vocabulary instance segmentation) dataset in 2019, a large-scale fine-grained vocabulary annotated dataset targeting over 1,000 object classes with about 2 million high-quality instance segmentation annotations. The AI-TOD (tiny object detection in aerial images) dataset, open-sourced in 2020, contains 700,000 instance objects with an average target size of only 12.8 pixels. The JD AI Research Institute open-sourced the largest annotated product image dataset in the industry, Product-10K, in 2019, containing over 10,000 common product categories; in 2022, they released the YOVO-10M dataset, which includes 10 million video segments and corresponding text description labels. Relying on these ultra-fine visual datasets, various competitions have been held at academic conferences and forums, attracting many experts in machine learning and computer vision to participate, significantly promoting technological progress and development in this field.

5 Future Prospects for Ultra-Fine Visual Technology

The service direction of ultra-fine image recognition technology currently covers multiple fields such as smartphones, autonomous driving, healthcare, and security. However, the applications in these fields mainly rely on one or two core technologies among ultra-fine image parsing, classification, and understanding. In the future, service robots and general artificial intelligence will rely on the simultaneous support of these three ultra-fine visual technologies, bringing new significant challenges and opportunities to this research direction. Currently, the functionalities of robots released on the market are still relatively singular. To truly achieve practical use of robots in open scenes, they need to accurately understand the semantic relationships between people, objects, and scenes, precisely locate target objects, and possess global object recognition capabilities. Meanwhile, the development of general artificial intelligence is still in its early stages. Although large models such as BERT, DALL-E, and ChatGPT have been successively released, cross-modal ultra-fine content understanding and generation still remain in their infancy.

We believe that with the continuous development of ultra-fine visual technology, artificial intelligence will gradually improve its visual cognitive capabilities, effectively enhancing human productivity and improving social welfare and people’s living standards.

(References omitted)