Computer vision is a discipline that studies how to use computers to simulate the functions of human or biological visual systems. Its purpose is to enable computers to perceive and understand the surrounding world based on images, specifically to recognize, measure, and understand information such as scenes, objects, and behaviors in image or video data. Computer vision is one of the important research fields in artificial intelligence.
The premise and foundation of computer vision is imaging technology. As early as the time of the Lu State in ancient China, Mozi discovered pinhole imaging. It wasn’t until the 19th century that Joseph Nicéphore Nièpce and Louis-Jacques-Mandé Daguerre invented the camera. Later in the same century, Wheatstone invented the mirror stereoscope, which confirmed the phenomenon of binocular parallax: two 2D images can create a 3D stereoscopic feeling. In the 1940s, Gibson proposed the concept of optical flow, hypothesizing that 3D spatial motion parameters and structural parameters could be recovered from the optical flow field of a 2D plane. Starting in the 1960s, Ulf Grenander integrated algebra, set theory, and probability theory from a mathematical perspective, proposing the idea of Analysis-by-Synthesis, which laid an important pioneering theoretical foundation for computer vision. During the same period, in visual pattern recognition research, King-Sun Fu proposed syntactic structural expressions and computations, supporting bottom-up or top-down visual computation processes. In the 1970s, David Marr attempted to simulate the human visual process with computers, enabling them to achieve human-like stereoscopic vision functions. Marr’s visual computation theory is based on computer science and systematically summarizes important achievements in psychology, neuroscience, and other fields at that time. Its important feature is that it makes the study of visual information processing more rigorous, elevating visual research from a descriptive level to a mathematically supported and computable level, marking the establishment of computer vision as an independent discipline. Since Marr’s visual theory was proposed, computer vision has developed rapidly. Although there are shortcomings in Marr’s visual theory framework, it still occupies a central position in computer vision today. According to Marr’s computational visual theory framework, computer vision is divided into low-level image feature extraction and processing, mid-level 3D computer vision, and high-level object recognition and scene understanding. Due to the systematic and dominant nature of Marr’s visual theory, most significant research progress in the past few decades has concentrated within this theoretical framework.
In the 2012 ImageNet Large Scale Visual Recognition Challenge, the convolutional neural network (CNN) model using deep learning methods brought about significant breakthroughs. Subsequently, deep learning-based face recognition and other applications have been widely adopted across various industries. With the rapid development of computational resources and the increasing demand for practical applications in artificial intelligence, areas of Marr’s visual theory that were once controversial have now received clearer interpretations. For example, scholars who criticized Marr’s vision proposed “Active Vision” and “Purpose and Qualitative Vision,” arguing that the visual process necessarily involves interaction between humans and the environment, and that vision must be purposeful. Moreover, in many applications, the 3D reconstruction process is not necessary. However, with the advancement of deep learning and artificial intelligence promoting the development of computer vision, today’s series of 2D visual tasks can no longer meet practical application needs. Various depth cameras are emerging, and 2D visual tasks are expanding into 3D, with an increasing amount of work in 3D point cloud analysis and processing, gradually validating the correctness of Marr’s visual theory. Currently, specialized artificial intelligence has developed sufficiently, and in the future, it will gradually move towards the research stage of general artificial intelligence. General artificial intelligence requires computational capabilities of “time,” “space,” and “reasoning.” The Marr visual theory framework possesses the capabilities of the first two, and when combined with “reasoning,” Marr’s visual theory will become the cornerstone of general computer vision intelligence. Understanding the significant research progress under this framework in the past will also be of great significance for guiding future research.
This report analyzes and summarizes the research progress in the field of computer vision in the past and introduces 13 significant research advances that have had a major impact or driving force on the development of the discipline and application technology. These significant research advances are reflected in Computational Imaging, Early Vision, Image Enhancement and Restoration, Image Feature Extraction and Matching, Multi-View Geometry Theory, Camera Calibration and Localization, 3D Reconstruction, Object Detection and Recognition, Image Segmentation, Image Scene Understanding, Image Retrieval, Object Tracking, and Behavior and Event Analysis.
1. Computational Imaging
The light rays propagating in free space carry rich information about the three-dimensional world and are one of the most important media and carriers for human perception of the external world. Light is a high-dimensional signal that possesses attributes such as wavelength (𝜆) and propagation time (𝑡). During propagation in free space, it also has position and direction attributes, including three-dimensional coordinates (𝑥,𝑦,𝑧) and angles (𝜃,𝜙). Computational Imaging combines techniques from computation, optical systems, and intelligent illumination, innovatively integrating the imaging system’s acquisition capabilities with computer processing capabilities, advancing visual information processing and computation to the imaging process, proposing new imaging mechanisms, designing new imaging paths, and developing new image reconstruction methods. This allows for qualitative breakthroughs in the dimensions, scales, and resolutions of visual information sampling, making it possible to sample light signals at high dimensions and high resolutions.
In 1936, Arun Gershun began studying the distribution of light rays in space, first proposing the concept of the “Light Field” to describe the radiative characteristics of light in three-dimensional space. In 1991, Adelson and others further expanded and improved the theory of the light field, proposing the plenoptic function, which characterizes the spatial distribution of light rays with a 7D function, i.e., 𝑃(𝑥,𝑦,𝑧,𝜃,𝜙,𝜆,𝑡). In 1992, based on the plenoptic theory, Adelson and others developed a prototype light field camera. Ignoring the attenuation of light rays during propagation (omitting 𝜆 and 𝑡), Gortler and others proposed the concept of Lumigraph, further reducing the 7D plenoptic function to 4D, representing a light ray with only four dimensions of (𝑥,𝑦) and (𝜃,𝜙), containing spatial and angular information of the light ray. In 1996, Marc Levoy and Pat Hanrahan introduced the light field into computer graphics, proposing the light field rendering theory and double-plane parameterization of the four-dimensional light field. In May 2005, researchers from MIT, Stanford University, and Microsoft Research held the first Computational Photography workshop at MIT. Since 2009, the IEEE International Conference on Computational Photography has been held annually. Stanford University PhD Ren Ng detailed the hardware and software issues of consumer-grade light field cameras and their solutions in his dissertation, founded Lytro in 2006, and released the Plenoptic 1.0 handheld light field camera. Subsequently, companies like Raytrix and Pelican also released light field cameras, proposing various light field imaging structures. Alongside the development of light field theory, various light field imaging devices have been developed and manufactured in recent decades, especially various types of industrial-grade and consumer-grade light field cameras. Representative designs of light field imaging devices include light field gantries, camera arrays, microlens-based light field cameras, and programmable aperture cameras. In recent years, light field imaging technology has been widely used in immersive experience devices such as VR/AR. Meanwhile, light field imaging technology has also been applied in microscopic observations. Researchers from MIT and the University of Vienna used a light field microscope to produce a 3D image of an entire zebrafish larva brain for the first time on a millisecond time scale, with related results published in the journal Nature Methods.
Compared to traditional optical imaging, light field imaging technology represents a significant technological innovation, bringing new opportunities for the development and innovation of disciplines such as pattern recognition and computer vision with its outstanding characteristics of multi-view, large depth of field, and multi-focus imaging. It has already been applied in visual tasks such as depth estimation, 3D reconstruction, automatic refocusing, synthetic aperture imaging, segmentation, and recognition. In addition to classic visual tasks, light field imaging has also been applied to visual odometry, scene-flow estimation, camera rotation estimation, video stabilization, and panoramic stitching.
In addition to light field cameras, imaging technologies that consider the spatial position and propagation direction of light include typical representatives such as coded imaging, scattering imaging, and holographic imaging. Methods that collect light fields based on the time and phase dimensions of light propagation include single-photon imaging and time-of-flight (ToF) imaging, while research based on wavelength and spectral levels has led to various imaging technologies, including visible light, near-infrared, and hyperspectral imaging. Additionally, there are imaging techniques utilizing the wave properties of light, such as polarization imaging.
2. Early Vision
The processing of visual information in humans includes early vision and high-level vision. Early vision primarily obtains information about the position, shape, appearance, and motion of objects by analyzing changes in input visual signals, with little involvement in the semantic understanding of scene information. Similar to the human visual information processing process, computer vision is also divided into early vision and high-level vision, where early vision mainly involves the preprocessing and encoding of visual information, specifically including research content such as image filtering, edge extraction, texture analysis, stereo vision, optical flow, image enhancement, and restoration. The ability to recognize objects, analyze behaviors, and interpret events is the main criterion for distinguishing early vision from high-level vision.
Image filtering is one of the main means of image preprocessing, aimed at highlighting effective information in the image and suppressing unwanted other information. Depending on the operation domain of the filtering, image filtering can be divided into spatial domain filtering and frequency domain filtering; based on the computational characteristics of the filtering operation, it can be divided into linear filtering and nonlinear filtering; based on the purpose of filtering, it can be divided into smoothing filtering, morphological filtering, bilateral filtering, guided filtering, etc. Gaussian filtering is the most commonly used linear filter, while the Gabor filter conforms to the information processing characteristics of the primary visual cortex in humans and is frequently used in image feature extraction. Bilateral filtering and guided filtering have good edge-preserving properties and do not affect the filtering effect of other non-edge areas. Compared to bilateral filtering, guided filtering is more efficient and can preserve more types of image structures. The concept of local image features has evolved from image filtering, among which LBP and Haar are two local image features with far-reaching influence. The former encodes features based on the gray level relationship between adjacent pixels, exhibiting good illumination robustness and discrimination ability, playing an important role in face recognition and texture analysis. The latter defines a series of rectangular regions and conducts discriminative analysis based on their average pixel differences, combined with the adaboost feature selection algorithm, marking a milestone in the field of face detection and widely applied to other object detection tasks. Image enhancement and restoration techniques have developed from image filtering, with early methods focusing on filter design, such as Wiener filtering, constrained least squares filtering, and the Lucy-Richardson deconvolution algorithm. After 2000, sparse coding methods represented by regularization methods and dictionary learning gradually became mainstream due to their outstanding performance, such as the BM3D algorithm, LSC algorithm, FOE model for image denoising, and TV regularization algorithm, L1 regularization algorithm, etc., for image deblurring. Currently, there are also deep learning-based methods for image enhancement and restoration. In early studies on edge extraction, methods primarily designed corresponding filters based on the physical properties of edges, with the Canny edge operator being a representative work; after 2000, these designer-experience-based filtering methods were gradually replaced by learning-based methods, such as PB and gPB; in recent years, deep learning has further promoted the development of edge detection technology, with early works including DeepContour and DeepEdge, as well as end-to-end trainable edge detection algorithms like HED, with RCF being one of the better methods currently. In early vision problems based on matching correspondences like stereo vision and optical flow, methods modeling global constraint information using Markov random fields were a representative class of methods before the emergence of deep learning. These methods typically use graph cuts, belief propagation, dynamic programming, etc., to solve the constructed minimization problems. For stereo matching problems, global optimization methods are usually slower, while semi-global and feature-based local methods are more practical, among which the semi-global block matching algorithm (SGBM) strikes a good balance between speed and accuracy. The basic assumption for solving optical flow problems is the color constancy of motion, which can be addressed through variational methods, region-based methods, feature-based methods, frequency domain processing methods, and recently emerging CNN-based methods. Before deep learning emerged, variational methods dominated the development of optical flow, with most high-performing optical flow algorithms belonging to this category, formalizing the data term in the optimization objective function based on the basic assumption of color constancy, supported by smoothness constraints, ultimately obtaining optical flow solutions by solving optimization problems. Recently emerging CNN-based optical flow computations achieve optical flow for input images through a single forward pass of the network, thus being more efficient, with computation speeds tens of times faster than traditional methods, showcasing great potential, with representative works including the FlowNet series, SpyNet, TVNet, and PWC-Net.
Research in early vision has had a wide-ranging impact, such as the convolution derived from image filtering being a core component of convolutional neural networks, optical flow computation being the most fundamental processing method in video behavior analysis, and RGBD cameras developed from stereo vision technology serving as an important supplement to traditional image sensors in many applications, while image super-resolution and video deblurring techniques have been widely used in various digital imaging products.
3. Image Enhancement and Restoration
Image enhancement and restoration are classic problems studied in the field of image processing. During the imaging, storage, and transmission of images, various external factors can lead to different types of quality degradation issues. Image enhancement and restoration primarily study how to improve the visual quality of images or restore their original appearance based on image priors and degradation models. Image enhancement and restoration are slightly different. The former usually aims to enhance the visual quality of images, often serving as a preprocessing step for subsequent image processing and analysis. The latter aims to restore the original appearance of images, thus the restoration process often needs to consider the degradation mechanisms of images and construct degradation models. Classic problems in image enhancement and restoration include image denoising, deblurring, dehazing, derain, deshadowing, image super-resolution, and geometric distortion correction. It should be noted that due to the ill-posedness of degradation models, image enhancement and restoration problems typically involve solving a class of inverse problems, making them typical ill-posed problems. There is no unified processing method for image enhancement and restoration; it often requires constructing appropriate solution methods based on specific issues, degradation models, and available image priors.
Early methods for image enhancement and restoration primarily included various filtering methods. Since noise and image content usually have different spectra, they can be processed separately in different spectral bands to ensure that noise is removed while minimizing damage to image content. These methods primarily target image denoising and deblurring issues, with representative methods including median filtering, homomorphic filtering, Wiener filtering, constrained least squares filtering, weighted least squares method, and the Lucy-Richardson deconvolution algorithm, among others. Subsequently, sparse coding methods represented by regularization and dictionary learning gradually became mainstream restoration methods due to their outstanding performance. From a Bayesian perspective, regularization terms correspond to the prior distribution of images, thus the quality of image restoration is closely related to the selected image priors. Compared to filtering methods, sparse coding provides a more accurate and effective means to characterize image priors, often achieving excellent performance. During this period, numerous research works and high-performing algorithms emerged targeting image denoising and deblurring issues, such as the Fields of Experts (FOE) model, Block-Matching 3D (BM3D) algorithm for natural image denoising, and various regularization algorithms such as TV norm, L1 norm, and Lp norm for image deblurring. In recent years, with the rise of deep learning, data-driven, end-to-end learning-based image restoration methods have gradually gained favor among researchers. Benefiting from the powerful model representation capabilities of neural networks, researchers have attempted to implicitly characterize image priors and degradation models using deep neural networks. By integrating them into a generative adversarial network framework, the image restoration problem can be transformed into an image generation problem. The advantage of this method lies in its ability to address various types of image enhancement and restoration problems within a unified computational framework. In the future, image restoration will continue to be an area that warrants further research. Effectively embedding relevant knowledge and constructing efficient and convenient computational models will remain the focus of research in image enhancement and restoration.
Since image enhancement and restoration research involves solving ill-posed problems and representing and learning image priors in high-dimensional spaces, this research has objectively promoted the advancement of image sparse coding, image deep coding, image prior representation, and regularization learning. Additionally, as a classic research problem in the field of image processing, image enhancement and restoration has become a litmus test for new image representation theories and algorithm research. As an effective means to enhance the visual quality of images, image enhancement and restoration have been widely applied in many fields, including low-level vision, computational imaging, text recognition, iris recognition, fingerprint recognition, face recognition, object tracking, and video surveillance.
4. Image Feature Extraction and Matching
The purpose of image feature extraction and matching is to establish correspondences between identical or similar primitives in different images, with primitives also referred to as image features. Common image features include points, lines/curves, and regions, so based on the features used, image feature matching can be divided into point matching, line/curve matching, and region matching, while the process of automatically extracting these features from images is called image feature extraction. Relatively speaking, point matching is the most widely applied and receives the most attention from researchers. Point matching can be further divided into dense point matching and sparse point matching. The task of dense point matching is to establish pixel-wise correspondences between images, widely used in computer vision tasks such as stereo vision, optical flow, and motion field estimation. Feature point matching includes three parts: feature point detection, feature point description, and robust estimation of the matching model, aiming to establish sparse point correspondences between images.
For dense point matching, early works primarily combined local matching with global optimization methods, with representative works being graph cut-based methods and belief propagation-based methods. Current research focuses on using deep learning to solve this problem. Compared to dense point matching, feature point matching is more widely applied and is the mainstream feature matching method. The feature point detection algorithms are used to detect corners and blobs on images to ensure that the same points in different images can be repeatedly detected, which is the basic prerequisite for feature point matching. The early Harris corner detection algorithm is still in use today and has spawned many improved algorithms, while the FAST corner detection operator is the preferred algorithm for rapid feature point detection. Among blob detection algorithms, representative works include the SIFT feature point detection algorithm and its improvement based on integral image technology, the SURF algorithm. The purpose of feature point description is to establish a vector representation based on the image information around the feature point to establish correspondences of the same feature points between different images, divided into expert knowledge-based methods and learning-based methods. The SIFT algorithm, designed based on block gradient orientation histograms, is an outstanding representative among many feature description methods based on expert knowledge. The notable feature description methods improved on it include SURF, which served as a substitute for SIFT in scenarios with high speed requirements for a long time and has also been widely applied. With the rise of deep learning, the field of feature point description has basically completed the transition from expert knowledge-based design methods to deep learning-based methods in 2017, utilizing the powerful feature expression capabilities of convolutional neural networks to automatically learn discriminative feature descriptors that are robust based on paired matching/non-matching image blocks. Currently, the network structure most commonly used for feature description is L2Net. Additionally, unifying the two intrinsically related tasks of feature point detection and feature point description to solve with deep networks is a popular method today, with representative works including LIFT, RF-Net, D2Net, and R2D2. Research on robust model estimation calculates the true transformation model from a set of point matches containing erroneous matches, with widely used methods being RANSAC. Moreover, how to remove mismatches from feature point matching results has always attracted the attention of researchers, with primary methods including graph matching-based methods and motion consistency-based methods, such as GMS and CODE. In recent years, some methods using deep learning to filter erroneous feature point matches have emerged, with the overall idea being to treat a pair of matching feature points as a four-dimensional vector, exploring the contextual relationship between different points in the set using deep learning methods, inferring erroneous matching feature points.
Image feature extraction and matching have had a wide-ranging impact, such as the emergence of HoG features inspired by SIFT, which has had a significant impact in the field of object detection, being the preferred feature in the field of object detection before the advent of deep learning. Local image feature point extraction and description directly spawned research on image representation based on the bag-of-words model, which was the main method for image classification and recognition in the pre-deep learning era. The panoramic image stitching technology based on feature point matching has entered thousands of households and is widely used in daily life. Furthermore, feature point matching is also widely applied in 3D reconstruction, visual localization, camera calibration, and other 3D computer vision tasks, playing an important role in emerging applications such as augmented reality, vision-based localization, urban digitization, and autonomous driving.
5. Multi-View Geometry Theory
Multi-view geometry is the basic mathematical theory used in geometric computer vision research, mainly studying the geometric constraints and computational methods between corresponding points of 2D images from different perspectives under projective transformations, as well as between image points and 3D scenes and camera models, thereby achieving the recovery and understanding of the 3D geometric properties of scenes from 2D images. Multi-view geometry is established on strict algebraic and geometric theories and has developed a series of analytical computational methods and nonlinear optimization algorithms, serving as the basic mathematical theory for 3D reconstruction, visual SLAM, visual localization, and other 3D geometric vision problems. Representative figures in the study of multi-view geometry include R. Hartely from the Australian National University, A. Zisserman from the University of Oxford, and O. Faugeras from the National Institute for Research in Computer Science and Automation in France. The 2000 co-authored book by R. Hartely and A. Zisserman, “Multiple View Geometry in Computer Vision,” systematically summarizes research work in this area. It can be said that the theoretical research of multi-view geometry was basically perfected around the year 2000.
Multi-view geometry primarily studies the epipolar geometry constraints between corresponding points of two images, the trifocal tensor constraints between corresponding points of three images, and the homography constraints between spatial plane points and image points or multiple image points. The core algorithms of multi-view geometry include triangulation, eight-point method for estimating the fundamental matrix, five-point method for estimating the essential matrix, multi-view factorization methods, and camera self-calibration based on the Kruppa equations, among others. The most central theory in multi-view geometry is the hierarchical reconstruction theory established from around 1990 to 2000. The basic idea of hierarchical reconstruction is to obtain the projection in the projective space from the image space in the reconstruction process from image to 3D Euclidean space (11 unknowns), then elevate the reconstruction in the projective space to the affine space (3 unknowns), and finally elevate the reconstruction in the affine space to the Euclidean space (5 unknowns). In the hierarchical reconstruction theory, the projection reconstruction from corresponding image points determines the projection matrices corresponding to each image under projection reconstruction; the elevation from projection reconstruction to affine reconstruction involves determining the corresponding coordinate vector of the infinitely distant plane under projection reconstruction (a specific projection coordinate system); and the elevation from affine reconstruction to metric reconstruction essentially involves determining the camera’s intrinsic parameter matrix, i.e., the camera self-calibration process. Since any geometric vision problem can ultimately be transformed into a multi-parameter nonlinear optimization problem, the difficulty of nonlinear optimization lies in finding a reasonable initial value. Generally, the more parameters to be optimized, the more complex the solution space, making it more difficult to find suitable initial values. Therefore, if an optimization problem can group parameters for stepwise optimization, it can generally simplify the optimization problem’s difficulty significantly. The hierarchical reconstruction theory, due to the few unknown variables involved in each step of the reconstruction process and its clear geometric significance, has effectively improved the robustness of the algorithms.
Multi-view geometry and hierarchical reconstruction are important theoretical achievements in the development of computer vision, with its theoretical framework being relatively well established. With the improvement in camera manufacturing levels, the internal parameters of cameras under the traditional pinhole imaging model can usually be simplified to only the focal length being the internal parameter that needs calibration, and the rough value of the focal length can usually be read from the EXIF header of the image, thus the internal parameters of the camera can usually be considered known. In this case, based on the essential matrix constraint between two images, the external parameters (rotation and translation vectors) between the two images can be solved using the five-point method, enabling direct 3D reconstruction without needing to conduct hierarchical reconstruction. Nevertheless, multi-view geometry and hierarchical reconstruction remain indispensable in geometric vision, especially in computer vision, due to their theoretical elegance and mathematical completeness.
6. Camera Calibration and Visual Localization
The parameters of a camera include intrinsic and extrinsic parameters. Intrinsic parameters include focal length, aspect ratio, skew parameters, and principal points, belonging to the inherent attributes of the camera. Extrinsic parameters refer to the motion parameters of the camera, including the rotation matrix and translation vector of the camera’s motion. The determination of the intrinsic and extrinsic parameters of the camera can be collectively referred to as camera calibration. The determination of the extrinsic parameters of the camera can also be referred to as camera localization or visual localization.
Camera intrinsic parameter calibration can be divided into two categories: prior information-based calibration and self-calibration. First, the prior information-based calibration method is introduced: In 1986, Tsai proposed a two-step method using three-dimensional calibration objects. Due to the high manufacturing requirements for three-dimensional calibration objects and the ease of occlusion, in 1999, Zhang proposed a calibration method based on a two-dimensional checkerboard, which is simple and easy to use, widely adopted in both industry and academia. Among self-calibration methods, the most important one is the self-calibration method based on the Kruppa equations proposed by Faugeras in 1992. By calculating the fundamental matrix between matched points in images, the equations for the camera’s intrinsic parameters can be established. Generally, prior information-based calibration is a linear problem, while self-calibration is nonlinear. Due to the simplicity of the Kruppa equations’ principles and the ease of establishing the equations, how to solve such nonlinear problems has attracted many researchers. When the camera parameters are few, the Kruppa equations can also be transformed into linear problems. Subsequently, a significant self-calibration method was proposed in 1997 by Triggs, based on absolutely dual quadric self-calibration, which requires projective reconstruction and is more complex than the self-calibration based on the Kruppa equations but can avoid some degeneracies. The importance of the self-calibration method based on absolutely dual quadrics also lies in the fact that once the camera is self-calibrated, it can naturally transition from projective reconstruction to metric reconstruction.
Camera localization can be divided into two major categories: methods with known environmental information and methods with unknown environmental information. Known environmental information primarily involves research on the PnP problem, while unknown environmental information mainly involves research on SLAM (Simultaneous Localization and Mapping). The study of the PnP problem dates back to 1841. In 1841 and 1903, Grunert Finsterwalder and Scheufele studied that the P3P problem has at most four solutions, while the P4P problem has a unique solution. This initiated a series of studies on the PnP problem. In 1999, Quan and Lan provided approximate linear methods for P4P and P5P. When n is greater than or equal to 6, the PnP problem is linear, and the earliest influential method for solving this problem was the Direct Linear Transformation method proposed by Abdel-Aziz and H. M. Karara in 1971. Currently, the most effective processing method is the EPnP method proposed by Lepetit et al. in 2008. SLAM was first proposed by Smith and Cheeseman in 1986 and was officially named at the 1995 Robotics Research Workshop. SLAM technology has significant theoretical significance and application value, being regarded by many scholars as the key to achieving true autonomy in mobile robots, even referred to as the holy grail in the field of autonomous mobile robots. In 2002, Andrew Davison first implemented a monocular real-time SLAM system called MonoSLAM, which adopted filtering methods. Since then, it has become possible for robots to conduct real-time localization using monocular cameras, laying an important foundation for augmented reality under monocular cameras. With the development of computer hardware and the gradual maturity of multi-view geometry theory, in 2007, Klein and Murray proposed PTAM (Parallel Tracking and Mapping), abandoning the mainstream framework of previous filtering methods, proposing and implementing the parallelization of tracking and mapping processes based on multi-view geometry theory. Subsequently, the widely popular ORB SLAM proposed by Mur-Artal and Tardós was modified based on the PTAM framework. Instead of considering feature points, it focuses on image gradient information, directly based on the photometric consistency of images, with Engel et al. proposing a direct SLAM method in 2014 that does not require extracting feature points or calculating descriptors, achieving a high tracking speed. In recent years, a series of deep learning-based visual localization methods have emerged, with representative works including CNN-SLAM proposed by Tateno et al. in 2017, CodeSLAM proposed by Bloesch et al. in 2018, and VO methods with memory modules proposed by Xue et al. in 2019. Compared to traditional methods, deep learning-based methods exhibit higher robustness.
Camera intrinsic parameter calibration is fundamental in computer vision, and many applications rely on calibrated intrinsic parameters as a prerequisite. Camera localization is a key technology in robotics, autonomous driving, augmented reality, and virtual reality, with extensive application value, being applicable not only in industrial fields but also possessing a broad market in consumer-grade fields, attracting significant research and attention.
7. 3D Reconstruction
3D reconstruction aims to restore the three-dimensional structure of a scene from multiple perspective 2D images, which can be seen as the inverse process of camera imaging. The earliest theories of 3D reconstruction were proposed by D. Marr in 1982 in his visual computation theory. Marr believed that the primary function of human vision is to restore the visible geometric surface of 3D scenes, i.e., the 3D reconstruction problem. He also proposed a complete computational theory and method from initial sketches to 2.5D descriptions of objects and finally to 3D descriptions of objects. Marr believed that this process of recovering 3D geometric structure from 2D images could be accomplished through computation, and this visual computation theory is the earliest theory of 3D reconstruction. From around 1990 to 2000, the introduction of hierarchical reconstruction theory based on projective geometry effectively improved the robustness of 3D reconstruction algorithms. Hierarchical reconstruction theory constructed computational methods from projective space to affine space and then to Euclidean space, having clear geometric significance and fewer unknown variables, serving as the foundational theory for modern 3D reconstruction algorithms. In recent years, with the increasing demand for large-scale 3D reconstruction applications, research on 3D reconstruction has begun to focus on large scenes and massive image data, primarily addressing robustness and computational efficiency issues in the reconstruction process of large scenes.
Restoring the three-dimensional structure of a scene from multiple perspective 2D images mainly includes two serial steps: sparse reconstruction and dense reconstruction. Sparse reconstruction calculates the three-dimensional sparse point cloud of the scene based on the sequence of matched feature points between input images while synchronously estimating the intrinsic parameters (focal length, principal point, distortion parameters, etc.) and extrinsic parameters (camera position, orientation). Sparse reconstruction algorithms can be categorized into incremental and global sparse reconstruction: incremental sparse reconstruction starts from two views and continuously adds new cameras for overall optimization, gradually reconstructing the entire scene and calibrating all cameras; global sparse reconstruction first estimates the spatial orientation of all cameras as a whole, then calculates the positions of all cameras, and finally computes the sparse point cloud in space through triangulation. In sparse reconstruction, the final step typically requires using the bundle adjustment algorithm to optimize all camera parameters and the positions of the three-dimensional point cloud. Bundle adjustment aims to minimize the sum of squared reprojection errors of all three-dimensional points and is a high-dimensional nonlinear optimization problem, being the core step determining the quality of sparse reconstruction results. After completing sparse reconstruction, dense reconstruction calculates dense spatial point clouds pixel by pixel based on the camera poses computed during sparse reconstruction. The main methods for dense reconstruction include voxel-based methods, methods based on sparse point space diffusion, and methods based on depth map fusion. Voxel-based methods first partition the three-dimensional space into a regular three-dimensional grid (Voxel), transforming the dense reconstruction problem into a labeling problem of marking each voxel as inside or outside, and globally solving it through graph cut algorithms, obtaining the interface between the inner and outer voxels, which corresponds to the surface area of the scene or object. Methods based on feature point diffusion use sparse point clouds as initial values and iteratively optimize the parameters (position, normal, etc.) of neighboring three-dimensional points by minimizing image consistency functions, achieving spatial diffusion of the point cloud. Depth map fusion methods first compute the depth maps corresponding to each image through two-view or multi-view stereo vision, then cross-filter and fuse the depth maps from different perspectives to obtain dense point clouds. In recent years, deep learning methods have also begun to be gradually applied to depth map computation, with the basic idea being to first use convolutional neural networks with shared weights to extract image features, then utilize the planar hypothesis to transform features extracted from neighboring images to the current image’s different depth front planes, subsequently merging features across different depths by calculating variances, and finally refining the depth through three-dimensional convolutions to obtain the depth map of the current image.
In addition to restoring the three-dimensional structure of a scene through multiple perspective 2D images, the field of computer vision has also developed a series of methods to restore the three-dimensional structure of a scene based on image brightness, photometry, texture, focus, and other information, collectively referred to as Shape-from-X. The method of recovering shape from shading establishes a reflection equation between the object’s surface shape and the light source and image, and under the assumption of smoothness constraints on the scene surface, calculates the three-dimensional shape from the gray brightness of a single image. The method of recovering shape from photometric stereo is similarly based on the reflection equation but uses multiple controllable light sources to sequentially change the image brightness, constructing multiple constraint equations to make the computation of the three-dimensional shape more accurate and reliable. The method of recovering shape from texture utilizes the changes in size, shape, and gradient of regularly and repetitively patterned texture primitives in projection transformations to infer scene structures, but this method is limited by the scene’s texture priors and is less frequently used in practical applications. The method of recovering shape from focus utilizes the image blurring (defocus) phenomenon caused by the object moving away from the focal plane in lens imaging, using the focusing plane or the object’s motion, along with detected sharp imaging points in the image, to infer the distance from each pixel point to the camera’s optical center.
Theories and methods of 3D reconstruction have been continuously developed in response to the demands of various application fields, such as constructing environmental maps and navigation for robots, large-scale aerial 3D modeling of cities, and digital preservation of cultural heritage. Especially for large-scale complex scene 3D modeling, due to the low cost and convenience of image sensors, they often become the preferred choice for such applications. For example, in the field of geographic information, 3D modeling based on aerial oblique photography has replaced traditional airborne LiDAR modeling in many instances. In recent years, with further improvements in the robustness and computational efficiency of image 3D reconstruction algorithms, their applications in indoor modeling and navigation, high-precision map construction for autonomous driving, and other fields are also continuously expanding.
8. Object Detection and Recognition
Object detection and recognition have long been important research tasks in the fields of computer vision and pattern recognition, laying the foundation for solving more complex tasks such as object segmentation, behavior analysis, event understanding, and visual language interaction. Specifically, object recognition requires predicting the corresponding category of people or objects appearing in image videos, while object detection further predicts the location of the object in the image based on the recognized object category.
Traditional object recognition methods typically adopt a two-stage approach: 1) Feature extraction and encoding: Extracting discriminative local features from image videos, which are usually based on human prior-designed feature descriptors, with representative methods including SIFT, Gabor, LBP, and SURF. Additionally, there are methods based on analyzing the geometric shape of objects that can be robust to significant motion changes such as rotation and scaling, as well as distortion or damage to the object’s shape, with representative methods including GHT, CTT, and shape context methods. Based on local features, feature encoding is usually performed to further enhance the representation capabilities of the features, with representative methods including BOW (Bag of Words) and sparse coding methods. 2) Training classifiers: Learning the mapping from visual features to categories, with representative methods including SVM (Support Vector Machine). Alternatively, metric learning and template matching strategies can be employed to find the categories of samples that are close to the query sample. These two stages of models are learned independently, and the first stage typically does not utilize supervised information such as categories. Since 2012, deep learning models represented by CNNs have adopted an end-to-end joint feature learning and classifier learning approach, learning discriminative feature representations suitable for classification through data-driven methods. The most representative series of deep learning models include AlexNet, VGGNet, GoogleNet, ResNet, DenseNet, and SENet, achieving performance far exceeding traditional methods. Deep learning-based algorithms continuously refreshed the best results in object recognition tasks from 2012 to 2017 and ultimately achieved recognition performance surpassing that of humans on the million-level image database ImageNet. Since then, the problem of general object recognition has been basically solved, with related technologies widely used in practical scenarios such as face recognition, plant recognition, and animal recognition. Currently, researchers are more focused on achieving efficient object recognition based on small learning networks, with representative models including MobileNet, ShuffleNet, and IGCNet.
Early object detection algorithms mostly targeted specific object categories, such as face detection and pedestrian detection. Among them, the Adaboost algorithm proposed for face detection has also been widely applied in other specific object detection problems. After Adaboost, the introduction of the R-CNN method based on deep learning in 2014 became the most representative multi-object detection method based on a deformable parts model (DPM). However, traditional object detection algorithms have obvious shortcomings: 1) The sliding window-based region selection strategy lacks specificity, resulting in high time complexity and redundancy; 2) Hand-crafted features are not robust enough to the diversity of object variations. After 2014, object detection fully entered the deep learning era, with deep learning-based object detection algorithms generating a qualitative leap compared to previous methods combining hand-crafted features with DPM. Current methods can be divided into two-stage object detection algorithms based on candidate box extraction and one-stage object detection algorithms based on regression. Notable methods such as Fast R-CNN, Faster R-CNN, FPN, Mask R-CNN, and Cascade R-CNN series fall under the former category, achieving higher accuracy but slower running speeds compared to one-stage object detection algorithms. One-stage object detection algorithms emerged in 2016, with representative works including SSD, YOLO, and the RetinaNet series of algorithms. In recent years, there has also been growing interest in organically combining these two types of object detection algorithms, with related technologies widely used in biomedical image analysis, traffic safety, and other fields.
9. Image Segmentation
Different from object detection and recognition tasks, image segmentation is a more challenging task that has developed rapidly in recent years. The purpose of image segmentation is to divide an image or video into regions with distinct characteristics and extract the target of interest. It can be seen as an extension of the object detection task, which not only requires identifying the targets appearing in the image or video but also locating the target positions and segmenting their contours. Image segmentation has evolved to include four main task types: 1) Ordinary segmentation, which separates pixel regions belonging to different targets without distinguishing categories, such as separating the region of a foreground dog from the region of background grass; 2) Semantic segmentation, which builds on ordinary segmentation to determine the category of each region, including countable things (like dogs) and uncountable stuff (like grass); 3) Instance segmentation, which builds on semantic segmentation to assign numbers to each countable thing (target), for example, one target is car A, and another target is car B; 4) Panoptic segmentation, which combines semantic segmentation and instance segmentation, segmenting both countable things and uncountable stuff while numbering each countable thing.
Many traditional image segmentation algorithms typically measure the similarity between different pixels based on pixel values, colors, textures, etc., and are all unsupervised, such as thresholding methods, region growing methods, edge detection methods, feature clustering methods, histogram methods, and region growing methods. The watershed algorithm is a representative segmentation method, which views the high and low gray values of an image as “peaks” and “valleys,” continuously injecting different labels of “water” into different “valley” regions, and increasing “watersheds” at the junctions of “water” between adjacent “valleys” to achieve region segmentation. Although these algorithms are relatively fast, they can easily produce incomplete segmentation regions and missed segmentations for complex visual content. To alleviate these issues, Normalized Cut models all pixels of an image as a graph and uses maximum flow/minimum cut algorithms to obtain two disjoint subsets, corresponding to the foreground pixel set and background pixel set of the image, effectively completing the image segmentation. Another commonly used method is the active contour algorithm, which transforms the segmentation process into an energy functional minimization problem by designing an energy functional that includes the target edge represented by a continuous curve. This method includes two implementation approaches: parametric active contours and geometric active contours, with representative methods being Snake and Level Set, respectively. Additionally, before the rise of deep learning, there were many image segmentation methods based on probabilistic graphical models, with representative methods including MRF, CRF, and Auto-Context.
After the emergence of deep learning in 2012, various CNN extension models have been applied to the field of image segmentation. The milestone model for semantic segmentation is the Fully Convolutional Network (FCN), which efficiently conducts pixel-wise category predictions by replacing all fully connected operations with convolutional operations, thereby avoiding the loss of spatial information brought about by compressing two-dimensional feature maps into one-dimensional vectors in CNNs. To ensure both accuracy and output image resolution, models such as U-Net, DeconvNet, SegNet, and HRNet gradually incorporate low-resolution features to recover high-resolution detailed predictions through cross-layer association patterns, while models like Deeplab and PSPNet introduce dilated convolution operations to maintain a large output prediction resolution. With the significant improvement in the accuracy of image segmentation, the efficiency of segmentation has also gradually attracted much attention, with methods like ICNet and BiSeNet significantly enhancing model inference efficiency through the design of multi-branch network structures.
Instance segmentation requires not only segmenting the semantics of the objects but also locating different instances, with the milestone model being Mask-RCNN, which adds a branch for segmenting targets on top of the Faster-RCNN object detection algorithm, thus performing semantic segmentation within each detection box. However, the ROI operation in object detection algorithms limits the precision of the output segmentation images. Therefore, with the development of object detection and semantic segmentation methods, methods like FOCS, SOLO, and CondINS have proposed directly outputting higher precision segmentation maps without relying on ROIs. Panoptic segmentation combines the characteristics of semantic segmentation and instance segmentation, requiring both the segmentation of uncountable stuff and the segmentation of different instances of countable things. This task was proposed in 2018, and although it is relatively new, it has attracted increasing numbers of researchers to engage in it. Models such as PanopticFPN, UPSNet, OANet, and Panoptic-Deeplab primarily rely on semantic segmentation algorithms to segment stuff, instance segmentation algorithms to segment things, and then fuse the outputs of both to obtain the final panoptic segmentation map.
Although the academic community is still conducting in-depth research on refined image segmentation algorithms, related technologies have already been applied in various fields such as pedestrian segmentation and lesion segmentation. Additionally, image segmentation technology is widely used as a preprocessing operation for other complex visual content understanding tasks such as gait recognition and pedestrian re-identification, with its segmentation robustness directly determining the final performance of subsequent tasks. Therefore, researching robust image segmentation under complex conditions such as complex backgrounds, occlusions, and blurriness is an urgent issue to be addressed.
10. Image Scene Understanding
Image scene understanding is a broad concept, with key technologies involved primarily including scene parsing and semantic description, both of which have rapidly developed in recent years.
Scene Parsing: Scene parsing assigns a corresponding target category label to each pixel in the image, also known as image semantic segmentation. Unlike coarse image recognition, scene parsing is a high-level, refined image analysis and recognition task. Through pixel-level target category labels, it is easy to obtain the position, contour, and category of targets in the image. The difficulty of scene parsing technology lies in how to integrate high-level target semantics with low-level contours to achieve high-resolution, refined parsing results. High-level target semantics require deep features and a large receptive field to extract macro concepts, while low-level contours require shallow high-resolution features and a relatively limited receptive field to ensure sharp edges. Current mainstream scene parsing technologies are primarily based on fully convolutional neural networks (FCN) and can be roughly divided into two categories: 1) Encoder-Decoder models. Models like U-Net, DeconvNet, GCN, RefineNet, and DFN gradually introduce shallow high-resolution features based on low-resolution high-level semantic features to recover high-resolution detailed parsing results. 2) Dilated Convolution models. Scene parsing technologies like DeepLab, PSPNet, and PSANet ensure high-resolution high-level semantic features in the output through dilated convolutions or dilated convolutions. Scene parsing can provide refined image analysis and recognition results, with particularly high demand in fields requiring precise positioning and operation, such as autonomous driving, autonomous robotics, and surveillance videos.
Semantic Description: Although most visual research currently focuses on classic visual tasks such as detection, segmentation, and recognition, it has been found that the human visual system often works in conjunction with the auditory and language systems when processing information. This allows for the processing and abstraction of visual information into high-level semantic information. Semantic description is a cutting-edge research area in computer vision technology, specifically studying the problem of generating descriptive text based on a given image, striving to align with the descriptive labels given by humans. The current image semantic description technology originated from the renowned Visual Genome project planned by Dr. Fei-Fei Li, aiming to integrate images with semantics. Current image semantic description technologies are a new type of network combining convolutional neural networks (CNN) and recurrent neural networks (RNN). Semantic description is regarded as the beginning of the transition from perceptual intelligence to cognitive intelligence, not only being a typical problem in cross-modal pattern recognition but also possessing broad application prospects. The current technical challenges of image description primarily focus on two aspects: grammatical correctness, where the mapping process needs to follow the grammar of natural language to ensure that the results are readable; and the richness of the description, where the generated description needs to accurately depict the details of the corresponding image, producing sufficiently complex descriptions. To address these issues, scholars have introduced techniques such as attention mechanisms and generative adversarial networks (GANs) to generate image semantic descriptions that are closer to human natural language.
11. Image Retrieval
Image retrieval is aimed at conveniently, quickly, and accurately querying and screening the images that users need or are interested in from a massive image database containing rich visual information when inputting a query image. The main steps of retrieval include user input of the query (Query), query analysis, indexing & vocabulary, content filtering, result recall, and result ranking and display. Queries often include text, color maps, image instances, video samples, conceptual images, shape images, sketches, voice, QR codes, and various combinations. To better provide the images needed by users, retrieval systems utilize relevance feedback and interactive feedback, fully leveraging user-provided feedback information (such as browsing history, click records, re-searches, etc.) to better understand users’ expressed search intentions and obtain better search results. Image retrieval methods can be divided based on how they describe image content: text-based image retrieval and content-based similar image retrieval. Research content in this area includes automatic image annotation, image feature extraction and representation, feature encoding and aggregation, and large-scale search.
Automatic Image Annotation refers to the process of automatically adding textual feature information (such as color, shape, regional attribute annotations, conceptual categories, etc.) reflecting the content of the image through machine learning methods. Through automatic image annotation, the image retrieval problem can be transformed into a relatively mature text information processing problem. Automatic image annotation primarily includes methods based on statistical classification, probabilistic image annotation, and deep learning-based image annotation, depending on the annotation model. Statistical classification methods classify each image’s semantic concept as a class, transforming automatic image annotation into a multi-classification problem. Probabilistic modeling methods attempt to infer the correlation or joint probability distribution between images and semantic concepts. Deep learning methods are suitable for automatically learning high-level semantic features of images and classifying and annotating massive images.
Image Feature Extraction and Representation is the initial phase of content-based image retrieval. Common feature extraction and representation methods used in classification and visual object recognition (such as SIFT, SURF, Bag-of-Words, CNN, etc.) can also be applied to image retrieval. Due to the high complexity of calculating the similarity/distance of floating-point features and the large storage space required, binary features have attracted considerable attention due to their efficient storage and low computational complexity for Hamming distance calculations. Hashing encodes high-dimensional data into binary representations while preserving the similarity of images or videos. Traditional methods need to encode floating-point features into binary features, such as spectral hashing; deep learning methods directly learn and output binary feature representations, such as compact feature representations based on Hamming embedding, binary hashing coding, deep supervised hashing, and deep discrete hashing. To mitigate the impact of the dimensionality disaster caused by original feature dimensions, feature encoding and aggregation form the second phase of content-based image retrieval, primarily based on clustering the image features obtained during the feature extraction phase and generating codebooks, which is beneficial for constructing inverted indices and can be divided into small-scale codebooks and large-scale codebooks. Depending on the encoding methods, small-scale codebooks include feature aggregation based on sparse coding (Bag of Words, BoW), Vector of Locally Aggregated Descriptors (VLAD), and Fisher vector encoding. Large-scale codebooks include hierarchical K-means and approximate K-means. In the deep learning era, earlier works adopted a combination of convolutional neural networks with traditional coding and aggregation methods, such as CNN+VLAD, CNN+BoW, and Fisher encoding+CNN. Later, researchers proposed various end-to-end trained deep convolutional neural networks for image retrieval tasks, eliminating the need for explicit encoding or aggregation steps. Representative works include visual similarity learning based on twin networks and contrastive loss, and NetVLAD inspired by VLAD. Binary encoding is also an important part of feature encoding, with significant progress including data-independent hashing and data-dependent hashing. Data-independent hashing representative works include random projection hashing, locality-sensitive hashing, and weighted minimum independent permutation locality-sensitive hashing. Data-dependent hashing algorithms need to learn hash functions using training data, being sensitive to data, generally divided into unsupervised, semi-supervised, and supervised hashing. Due to the powerful feature learning capabilities of deep learning and the ability to learn hash functions end-to-end, many related hashing algorithms have gained attention, with representative works including convolutional neural network hashing, deep regularized similarity comparison hashing, deep supervised hashing, cross-modal deep hashing, and ranking-based semantic hashing. Deep unsupervised hashing methods do not require any label information but instead obtain similarity information through feature distances, mainly divided into three categories: deep hashing with similarity removal, deep hashing based on generative models, and deep hashing based on pseudo-labels. In recent years, multi-modal deep hashing technology has attracted significant research interest, with representative works including various cross-modal hashing and cross-modal deep hashing, self-supervised adversarial hashing, and deep multi-level semantic hashing.
For fast lookup technologies for large-scale image searches, optimization methods (such as building inverted indices and optimizing retrieval structures for performance without altering the vector itself) and vector optimization (by mapping high-dimensional floating-point vectors to low-dimensional vectors or mapping them into Hamming space to reduce computational complexity and storage space) are included. Lookup optimization methods can be divided into nearest neighbor search and approximate nearest neighbor search. Representative works for nearest neighbor search include KD-trees and large-scale indexing methods based on query-driven iterative nearest neighbor graph searches. Approximate nearest neighbor searches significantly improve efficiency by reducing search space, finding matching targets with approximate nearest distances, with commonly used methods including locality-sensitive hashing, inverted file indexing, inverted multi-indexing, and non-orthogonal inverted multi-indexes tailored for deep features. Vector optimization methods involve remapping feature vectors, mapping high-dimensional floating-point vectors to other spaces, allowing for more efficient distance calculations. Hashing algorithms are among the most representative technologies.
Moreover, image retrieval has many extensions in defining relevance, including semantic relevance, texture relevance, and appearance relevance. To better obtain image retrieval results, ranking algorithms and re-ranking algorithms are often applied in image retrieval systems. To better interact with users or commercialize advertising recommendations, the reasonable presentation of retrieval results is also a significant focus for major internet companies. In summary, image retrieval has driven the development of fields such as computer vision, pattern recognition, and machine learning, and its technology has been widely applied, including in search engines from Baidu, Google, and Microsoft, vertical searches for products in e-commerce platforms like Alibaba, JD, and Pinduoduo, and medical assistance from IBM.
12. Visual Tracking
In the most general sense, visual tracking involves determining the state of a specified target in every frame of an image sequence through algorithms. The state of the object to be tracked in the first frame is determined by humans or other algorithms. The target state typically includes its center position in the image, the rectangular box that precisely surrounds the object, and the rotation angle of that rectangular box. For objects undergoing significant deformation during tracking, multiple rectangular boxes can be used to collectively approximate their position and posture. Alternatively, polygonal or image segmentation algorithms can be utilized to divide the pixels within the bounding box of the tracked object into target pixels and background pixels, enhancing the accuracy of marking the tracked object. There are many types of tracking algorithms, which can be categorized based on whether the algorithm tracks objects online or offline. Online tracking refers to algorithms that can only utilize the current and previous images to locate the object, while offline tracking refers to algorithms that can use the entire video to determine the state of an object in any frame. Clearly, online tracking is more challenging but also more widely applicable. Tracking algorithms can also be categorized based on whether the tracked object or its type is known in advance. If the tracking algorithm can only use information from the object in the initial frame, it is generally referred to as a model-free tracking problem; if it can know the tracked object or its type in advance, it can collect a large number of relevant samples and design and train a tracker to reduce misjudgments during tracking, thus significantly improving tracking performance. Tracking algorithms can also be further subdivided based on whether they need to track a single target or multiple targets in a frame. Single-target tracking algorithms generally consist of an appearance model, motion model, and search strategy, while multi-target tracking algorithms typically comprise two parts: locating multiple objects in the same frame and associating the same object across adjacent frames. In practical applications, tracking algorithms can also be further refined based on whether the background or camera is stationary, whether three-dimensional tracking is performed, and whether cross-camera tracking is required, among others. Cross-camera tracking often targets specific types of objects and involves more efficient target detection, re-identification, or many-to-many matching problems.
For the most basic single-target visual tracking, in terms of the techniques employed, tracking algorithms have evolved from the initial affine correspondence and Kalman filtering and particle filtering methods based on generative object models, to the introduction of discriminative methods in object modeling at the end of the 20th century and the beginning of the 21st century, to correlation filtering methods and deep network-based tracking algorithms in the second decade of the 21st century. Supported by big data, the combination of correlation filtering methods and deep features, as well as introducing correlation filtering into deep network trackers, has significantly improved the positioning performance of tracking algorithms while also achieving high processing frame rates. With the ongoing research into correlation filtering tracking algorithms, breakthroughs have also been made in correlation filtering theory itself. The high-speed performance of correlation filtering no longer relies solely on fast Fourier transforms. Regression network-based tracking algorithms have gained attention in recent years. These algorithms directly regress the search area of the object or rough object states to obtain precise object states. Meta-learning-based tracking algorithms currently achieve the best balance between accuracy and speed. These algorithms train deep networks through meta-learning, allowing the tracker template to quickly adapt to object templates and surrounding backgrounds, thus exhibiting strong discrimination and robustness.
Visual tracking is a very challenging yet widely applicable foundational problem in computer vision. Current tracking algorithms often heavily borrow techniques from other fields in computer vision, especially from the field of object detection, and adapt them to the specific problems of visual tracking.
13. Behavior and Event Analysis
Behavior and event analysis is an important task in high-level computer vision. Behavior analysis utilizes computer vision information (images or videos) to analyze what the subject is doing, which involves a deeper understanding of the human visual system compared to object detection and classification. An event refers to behavior triggered by specific conditions or external stimuli, representing a more complex form of behavior analysis, including analysis of targets, scenes, and the relationships before and after the behavior. Event analysis is a higher-level stage of behavior analysis, capable of providing semantic descriptions through a longer-term analysis of targets. Previous behavior recognition can serve as the foundation for event analysis, but event analysis also has its particularities, as relying solely on the aforementioned behavior recognition does not adequately solve event analysis. The core tasks of behavior and event analysis involve classification but are not limited to classification, also involving spatial and temporal localization and prediction.
Behavior analysis began in the 1970s, and the general process of this task includes two steps: first, feature extraction to eliminate redundant information from the video, and second, using methods such as classification and comparison for recognition analysis. Early research was primarily limited to simple, fixed-angle actions that had been segmented, with methods based on global feature representation being the most representative among early behavior recognition methods. Typical methods first utilize background subtraction to obtain human contours, then accumulate these differential contours to generate motion energy images (MEI) or motion history images (MHI), classifying the behaviors in the video using template matching methods; or extracting contour information from each frame, applying linear dynamic transformations, hidden Markov models, etc., for temporal modeling, and utilizing state-space methods for recognition. However, methods based on global feature representation depend on background segmentation and are sensitive to noise, angles, occlusion, etc., making it challenging to analyze complex behaviors and events in complex backgrounds. In the early 2000s, many methods based on local feature representation emerged, overcoming some issues present in global feature methods and demonstrating certain invariance to changes in viewpoint, lighting, appearance variations, and partial occlusion, achieving better results. The general process for these methods involves local region extraction, local feature extraction, local feature encoding and pooling, and classifier learning. Local blocks are typically obtained through dense sampling or sampling around spatiotemporal interest points, where spatiotemporal interest points are significant locations in the video where motion occurs, and it is assumed that these locations are critical for human behavior recognition. Local feature descriptors represent the characteristics of local blocks in images or videos, with typical examples including histogram of gradients, optical flow histograms, scale-invariant feature transform (SIFT), SURF features, motion boundary histograms (MBH), tracklet features, etc. Local features need to be encoded and pooled to form the feature representation for the entire video, with the most common feature encoding methods including visual bag-of-words models, vector quantization (VQ), sparse coding, Fisher vector, as well as local linear coding with constraints (LLC) and local aggregated descriptor vectors (VLAD), among others. The most commonly used classification methods during this period include SVM combined with multi-kernel learning and metric learning. Over the past decade, deep learning methods have achieved breakthroughs in various visual tasks and have been widely applied in behavior analysis tasks. Behavior recognition methods based on convolutional neural networks utilize convolutional networks to describe video sequences separately from the RGB and optical flow channels (two streams), with the final prediction result for the entire video being the weighted average of the two channels. Methods based on three-dimensional convolutional neural networks directly extend 2D convolutional networks to 3D convolutional neural networks, inputting the entire video as a whole into the 3D deep convolutional neural network for end-to-end training. Methods based on recurrent neural networks model the deep features extracted from each frame of the video in a temporal sequence, for example, first using convolutional networks to extract low-level visual features and then using LSTM to model the low-level visual features at a higher level. Many methods increase spatial, temporal, or channel attention modules to focus the network on more discriminative regions. Some methods also utilize graph convolutional networks to model high-level features and their relationships to enhance the model’s expressive capability. However, due to the structural significance of human skeleton data, graph convolutional networks are more widely used in behavior recognition based on skeletal data. Finally, these neural network-based methods often integrate methods based on dense motion trajectories to further enhance final performance.
For group behavior analysis, in addition to the aforementioned holistic methods, some scholars have proposed frameworks for group behavior analysis based on individual segmentation, roughly decomposing the behavior process of multiple interactions into the individual action processes of several people, then employing high-level feature descriptions and interaction recognition methods to obtain the final interaction results. The occurrence of behaviors is generally brief, and current video behavior analysis methods are mostly applicable to various shooting angles and scenes, demonstrating certain invariance to changes in viewpoint and scene. However, events often last longer and require analysis across cameras, such as in large-scale monitoring environments with multiple cameras. Complex events in large-scale scenes under multiple cameras usually involve multiple interrelated behavior units, and there is currently relatively little research directly conducting associative behavior analysis. However, technologies such as pedestrian re-identification, pedestrian tracking, and recognition of identities under different postures/environments based on specific individuals in cross-camera networks are currently hot research topics in the cross-camera field. By utilizing these technologies, associative behavior units across cameras can be linked, enabling further event analysis.
Behavior and event analysis is a highly challenging task, encompassing not only the perception of static targets in videos but also the analysis of dynamic changes. Currently, the performance of behavior and event analysis has significantly improved, transitioning from methods based on spatiotemporal interest point local feature descriptions to methods based on neural networks. For complex real-world scenes with large samples, a relatively high level has already been achieved. This presents broader application spaces for behavior and event analysis, including intelligent video surveillance, robotic vision systems, human-computer interaction, medical care, virtual reality, motion analysis, and game control. For instance, motion behavior detection in sports videos such as basketball/football, behavior recognition and prediction in monitoring videos of elderly patients, and pre-warning of violent events and group behavior analysis in public safety scenarios.
Welcome to leave messages in the background, recommending topics, content, or information you are interested in!
If you need to reprint or submit, please send a private message in the background.