Introduction to Computer Vision: History, Current Status, and Development Trends

This article, written by Professor Hu Zhanyi from the Institute of Automation, Chinese Academy of Sciences, provides a brief summary of the development of computer vision over the past 40 years, including: Marr’s computational theory of vision, active and purpose-driven vision, multi-view geometry and camera self-calibration, and learning-based vision. Based on this foundation, some prospects for the future development trends of computer vision are presented.

The article was published on the WeChat public account of the Machine Vision Research Group, which is affiliated with the National Key Laboratory of Pattern Recognition at the Institute of Automation, Chinese Academy of Sciences. The Deep Blue Academy has not made any modifications to the article.

What is Computer Vision?

Like many other disciplines, a field studied by many people for years is difficult to define strictly, as is the case with pattern recognition, the currently popular artificial intelligence, and computer vision.

Concepts closely related to computer vision include visual perception, visual cognition, and image and video understanding. These concepts share some commonalities but also have essential differences. Broadly speaking, computer vision is the discipline of “endowing machines with natural visual capabilities“.

Natural visual capabilities refer to the visual abilities exhibited by biological visual systems. However, biological natural vision cannot be strictly defined, and this broad definition of vision is “all-encompassing” and does not accurately reflect the research status of computer vision over the past 40 years. Therefore, this “broad definition of computer vision” is impeccable but lacks substantive content; it is merely a “circular game definition.” In reality, computer vision fundamentally studies visual perception issues.

Introduction to Computer Vision: History, Current Status, and Development Trends

Visual perception, according to the definition from Wikipedia, refers to the process of organizing, recognizing, and interpreting visual information in the expression and understanding of the environment. Based on this definition, the goal of computer vision is to express and understand the environment, with the core issue being how to organize input image information, recognize objects and scenes, and subsequently interpret the content of images.

Computer vision is closely related to artificial intelligence but also fundamentally different. While artificial intelligence emphasizes reasoning and decision-making, computer vision currently remains primarily focused on image information representation and object recognition.

“Object recognition and scene understanding” also involve reasoning and decision-making based on image features, but they differ fundamentally from the reasoning and decision-making in artificial intelligence. No serious computer vision researcher would consider AlphaGo or AlphaZero as part of computer vision; they would be regarded as typical artificial intelligence content.

In short, computer vision is a discipline that takes images (videos) as input and aims to express and understand the environment by studying the organization of image information, object and scene recognition, and subsequently providing explanations for events. Currently, research mainly focuses on the organization and recognition of image information, with little involvement in event interpretation, which is still at a very preliminary stage.

It is important to emphasize that due to differences in background, preferences, and knowledge, individuals may have different viewpoints on the same issue, which can lead to significant discrepancies. The above is the author’s understanding of computer vision, which may be biased or incorrect.

Many people believe that “texture analysis” is an important research direction in computer vision, but the author does not agree. Additionally, in many contexts, people also consider “image processing” to be part of “computer vision,” which is also inappropriate. Image processing is an independent discipline that studies image denoising, image enhancement, etc., where the input is an image and the output is also an image.Computer vision utilizes image processing techniques for image preprocessing, but image processing itself does not constitute the core content of computer vision.

It is worth mentioning that many people currently do not distinguish between “perception” and “cognition,” which leads to unnecessary confusion and misunderstanding for readers.

In many cases, some “visual experts” use “cognition” and “reasoning and decision-making” as parallel concepts, which is factually inaccurate. According to Wikipedia, “cognition” refers to the mental process of acquiring knowledge and understanding through senses, experiences, and thoughts.

Cognition includes knowledge formation, attention, memory, reasoning, problem-solving, decision-making, and language production. Therefore, “perception” and “cognition” are different; reasoning and decision-making are typical cognitive processes and important components of cognition, indicating a containment relationship rather than a parallel relationship.

The Four Main Stages of Computer Vision Development

Although there are different opinions regarding the starting time and development history of computer vision, it can be said that the publication of Marr’s “Vision” in 1982 marked the establishment of computer vision as an independent discipline.

The research content of computer vision can be roughly divided into two main parts: object vision and spatial vision. Object vision focuses on fine classification and identification of objects, while spatial vision aims to determine the position and shape of objects, serving the purpose of “action”.

As the famous cognitive psychologist J.J. Gibson said, the main function of vision is to “adapt to the external environment and control one’s own movement.” Adapting to the external environment and controlling one’s movement is essential for biological survival, and these functions require coordination between object vision and spatial vision.

Over the past 40 years of development in computer vision, despite the numerous theories and methods proposed, computer vision has generally gone through four main stages: Marr’s computational vision, active and purpose-driven vision, multi-view geometry and hierarchical 3D reconstruction, and learning-based vision. The following is a brief introduction to these four main contents.

1

Marr’s Computational Vision

Many current computer vision researchers may not fully understand “Marr’s computational vision,” which is indeed a regrettable situation. Currently, tuning “deep networks” on computers to improve object recognition accuracy seems equivalent to engaging in “vision research.” In fact, Marr’s computational vision, both theoretically and methodologically, is of epoch-making significance.

Marr’s computational vision is divided into three levels: computational theory, representation and algorithms, and algorithm implementation. Since Marr believed that algorithm implementation does not affect the function and effect of the algorithm, the theory of Marr’s computational vision primarily discusses the contents of “computational theory” and “representation and algorithms.”

Marr believed that the neural computation of the brain and numerical computation of computers are fundamentally indistinguishable, so he did not explore “algorithm implementation” at all.

From the current advancements in neuroscience, there may be essential differences between “neural computation” and numerical computation in some cases, such as the currently emerging neuromorphic computing; however, overall, “numerical computation” can “simulate neural computation.” At least for now, “different implementation approaches of algorithms” do not affect the essential properties of Marr’s computational vision theory.

>>>>

Computational Theory

Computational theory requires clarifying the purpose of vision or what the main function of vision is. In the 1970s, people had a very rudimentary understanding of the brain, and the non-invasive imaging techniques commonly used today, such as functional MRI (FMRI), had not yet become widespread.

Therefore, people primarily relied on pathological and psychological results to infer physiological functions. Even now, there is still no consensus on what the “main function of vision” is.

For example, in recent years, MIT’s DiCarlo and others proposed the so-called “goal-driven perceptual information modeling” method (Yamins & DiCarlo et al. 2016a). They speculated that the neuronal responses of the monkey’s IT area (IT: inferior temporal cortex, the object recognition area) to objects “can be modeled by hierarchical convolutional neural networks” (HCNN: Hierarchical Convolutional Neural Networks).

They believe that as long as HCNN is trained on image object classification tasks, the trained HCNN can quantitatively predict the responses of IT area neurons well (Yamins et al. 2014, 2016b). Since merely controlling the image classification performance quantitatively predicts the responses of IT neurons (the collective responses of neurons to a certain input image object represent the expression or encoding of that object), they termed this framework a “goal-driven framework.”

The goal-driven framework provides a new and relatively general way to model the encoding of population neurons, but it also has significant shortcomings. Whether it is truly possible, as the authors claim, to quantitatively predict neuronal responses to image objects solely by “training the image classification HCNN” remains a topic that requires further in-depth research.

Marr believed that regardless of how many functions vision has, the main function is to “recover the visible three-dimensional surface shape of spatial objects from the two-dimensional images formed on the retina,” which he referred to as “3D reconstruction.” Moreover, Marr believed that this reconstruction process is not innate but can be accomplished through computation. Psychologists like J.J. Gibson and the Gestalt psychology school believed that many functions of vision are innate. One might consider that if a certain visual function is innate and cannot be modeled, then it cannot be discussed in terms of computation; perhaps the discipline of “computer vision” would not exist today.

So, what is Marr’s computational theory? On this side, Marr does not seem to provide a particularly specific introduction in his book. He gives an example of purchasing goods to illustrate the importance of computational theory. For example, at a store checkout, addition should be used rather than multiplication. Imagine if multiplication were used at checkout: if each item costs 1 yuan, regardless of how many items you buy, you would only need to pay one yuan.

Introduction to Computer Vision: History, Current Status, and Development Trends

Marr’s computational theory posits that images are projections of physical space onto the retina, thus image information contains intrinsic information about physical space. Therefore, any computational theory and method in computer vision should start from images and fully explore the intrinsic attributes corresponding to physical space contained in images.

In other words, Marr’s vision computational theory aims to “exploit the intrinsic attributes of imaged physical scenes to solve corresponding visual problem computations.” Because from a mathematical perspective, many visual problems have “ambiguity” when only starting from images. For example, the typical correspondence problem between left and right eye images cannot be uniquely determined without any prior knowledge.

For any animal or person, the environment in which they live is not random; consciously or unconsciously, they are constantly using this prior knowledge to interpret the scenes they see and guide their daily behavior and actions. For instance, in a scene where a cup is placed on a table, people will correctly interpret it as a cup placed on the table rather than seeing it as a new object.

Of course, humans can also make mistakes, as seen in many optical illusion phenomena. From this perspective, whether letting computers mimic human vision is necessarily a good approach remains an unknown proposition. The flight of an airplane requires knowledge of aerodynamics, not merely mechanically mimicking how birds fly.

>>>>

Representation and Algorithms

Before recognizing objects, whether it is a computer or a human, there must be a stored form of that object in the brain (or computer memory), known as object representation. Marr’s visual computational theory posits that the representation of an object is its three-dimensional geometric shape.

Marr speculated that since humans’ recognition of objects is independent of the viewpoint from which the object is observed, and the same object has different retinal images from different viewpoints, the representation of the object in the brain cannot be two-dimensional but may be three-dimensional, as three-dimensional shapes do not depend on the viewpoint from which they are observed.

Additionally, at that time, pathological studies found that some patients could not recognize a “teacup” but could easily draw the shape of a teacup, which led Marr to believe that these patients also supported his speculation. Current research on the brain indicates that the brain’s functions are compartmentalized.

The “geometric shape” and “semantics” of an object are stored in different brain regions. Furthermore, object recognition is not absolutely independent of viewpoint, but only independent within a relatively small range of variations. Therefore, based on current research, Marr’s conjecture about the “three-dimensional representation” of objects is fundamentally incorrect, or at least not entirely correct, but Marr’s computational theory still holds significant theoretical meaning and practical value.

In short, Marr’s visual computational theory of “object representation” refers to the “three-dimensional shape representation in the object coordinate system.” Note that, mathematically, a three-dimensional geometric shape can have different expression functions depending on the chosen coordinate system. For instance, a sphere can be simply expressed as: x^2+y^2+z^2=1 if the center of the sphere is chosen as the origin of the coordinate system.

However, if the observer is at a position twice the radius away along the x-axis, the visible part of the sphere in the observer’s coordinate system would be expressed as: x=2-sqrt(1-y^2-z^2). This shows that the same object can have different expression methods depending on the chosen coordinate system. Marr referred to the “three-dimensional geometric shape representation in the observer’s coordinate system” as “2.5-dimensional representation,” while the representation in the object coordinate system is referred to as “three-dimensional representation.” Therefore, in the subsequent algorithm section, Marr focused on how to calculate the “2.5-dimensional representation” from images and then convert it into the calculation method and process for the “three-dimensional representation.”

The algorithm section is the main content of Marr’s computational vision. Marr believed that to move from image to three-dimensional representation, three computational levels must be traversed: first, obtaining some primitives from the image (primal sketch), then promoting these primitives to a 2.5-dimensional representation through stereo vision and other modules, and finally elevating them to a three-dimensional representation.

The following diagram summarizes the algorithm process of Marr’s visual computational theory:

Introduction to Computer Vision: History, Current Status, and Development Trends

Figure 1: The three computational levels of algorithms in Marr’s computational theory

As shown in the above diagram, the first step is to extract edge information from the image (zero-crossings of the second derivative), then extract point primitives (blobs, edge primitives, and bar primitives), and subsequently combine these primary primitives (raw primal sketch) to form complete primitives (full primal sketch). This process is the feature extraction phase of the visual computational theory. Based on this, through stereo vision and motion vision modules, the primitives are elevated to a 2.5-dimensional representation.

Finally, the 2.5-dimensional representation is elevated to a three-dimensional representation. In Marr’s book, “Vision,” the focus is on the computation methods corresponding to feature extraction and 2.5-dimensional representation. In the 2.5-dimensional representation section, only stereo vision and motion vision sections are emphasized.

Since the relative positions of the two eyes (left and right cameras) are known (referred to as the camera’s extrinsic parameters in computer vision), stereo vision transforms into the “correspondence problem of left and right image points” (image point correspondence). Therefore, Marr emphasizes the issue of matching image points in the stereo vision section, specifically how to eliminate mismatches and provides corresponding algorithms.

The three-dimensional points calculated from stereo vision and other computations are merely in the “observer’s coordinate system” and represent the 2.5-dimensional representation of the object. How to further elevate this to a three-dimensional representation in the object coordinate system is something Marr provides some ideas for, but this area is quite vague. For instance, determining the principal axis of the object’s rotation, etc., is similar to the construction of the “skeleton model” proposed later by researchers.

It is worth noting that Marr’s visual computational theory is a theoretical system. Within this system, specific computational modules can be further enriched to construct a “general vision system.” Unfortunately, Marr (Jan. 15, 1945 – Nov. 17, 1980) passed away due to leukemia at the end of 1980, and his book “Vision” was published posthumously.

Marr’s untimely death is undoubtedly a significant loss to the computer vision community. Due to Marr’s contributions, the biennial International Conference on Computer Vision (ICCV) awards the Marr Prize for the best paper at the conference. Additionally, there is also a Marr Prize in the field of cognitive science due to Marr’s substantial contributions to cognitive science.

Establishing awards in different fields for the same person is quite rare, underscoring the profound impact Marr has had on computer vision.

As S. Edelman and L. M. Vaina noted in the “International Encyclopedia of the Social & Behavioral Sciences,” “Marr’s early integration of mathematics and neurobiology for understanding the brain has earned him a significant place in the scientific hall of fame of British empiricism for two and a half centuries… however, he further proposed a more influential computational vision theory.”

Thus, it is indeed regrettable that those engaged in computer vision research are not familiar with Marr’s computational vision.

2

Ephemeral Active and Purpose-Driven Vision

Many people do not introduce this part as a separate section when discussing computer vision, mainly because “active vision and purpose-driven vision” have not had a sustained impact on subsequent research in computer vision. However, as an important stage in the development of computer vision, it is still necessary to introduce it briefly.

After Marr’s visual computational theory was proposed in the early 1980s, a wave of interest in “computer vision” arose in academia. One direct application of this theory was to endow industrial robots with visual capabilities, with the typical system being the so-called “parts-based system.”

However, over ten years of research revealed that, despite Marr’s computational vision theory being very elegant, it lacked sufficient “robustness” and was challenging to apply broadly in the industrial sector as envisioned. This led to skepticism regarding the rationality of this theory, with even sharp criticisms emerging.

Two main criticisms of Marr’s computational vision theory are: one is that this three-dimensional reconstruction process is a “pure bottom-up process,” lacking high-level feedback; the second is that “reconstruction” lacks “purposefulness and proactivity.” Due to varying requirements for reconstruction precision based on different usages, it seems unreasonable to “blindly reconstruct a three-dimensional model suitable for any task” without considering specific tasks.

Representative figures criticizing Marr’s visual computational theory include J. Y. Aloimonos from the University of Maryland, R. Bajcsy from the University of Pennsylvania, and A. K. Jaini from Michigan State University. Bajcsy argued that the visual process inevitably involves interaction between humans and the environment and proposed the concept of active vision. Aloimonos believed that vision should have purposefulness, and in many applications, strict three-dimensional reconstruction is unnecessary, proposing the concepts of “purpose and qualitative vision.”

Jain emphasized the importance of application and proposed the concept of “practicing vision.” From the late 1980s to the early 1990s, it can be said that this was a “wandering phase” in the field of computer vision, where the criticisms were incessant, and the future of vision appeared uncertain.

In response to this situation, a well-known publication in the vision field (CVGIP: Image Understanding) organized a special issue in 1994 to debate the computational vision theory.

Initially, M. J. Tarr from Yale University and M. J. Black from Brown University wrote a highly controversial opinion piece (Tarr & Black, 1994), arguing that Marr’s computational vision does not exclude proactivity but that overemphasizing “applied vision” in Marr’s “general vision theory” is a “myopic” approach. Although general vision cannot be strictly defined, “human vision” serves as the best model.

Following the publication of this opinion piece, over 20 renowned vision experts worldwide expressed their views and comments. The general consensus was that while “proactivity” and “purposefulness” are reasonable, the challenge lies in how to provide new theories and methods.

However, many of the active vision methods proposed at that time were merely algorithmic improvements lacking theoretical innovations, and they could entirely fit within Marr’s computational vision framework.

Therefore, after this visual debate in 1994, active vision has not made much substantial progress in the computer vision community.

This “wandering phase” did not last long and had minimal impact on the subsequent development of computer vision, resembling a fleeting moment.

It is worth noting that while “active vision” is a very good concept, the difficulty lies in “how to compute it.” Active vision often requires “visual attention” and necessitates studying the feedback mechanisms from high-level areas of the cerebral cortex to lower-level areas. Even today, despite significant advancements in brain science and neuroscience compared to 20 years ago, there is still a lack of “computational progress” to provide substantial references for computer vision researchers.

In recent years, the development of various brain imaging techniques, particularly the progress in “connectomics,” is expected to provide computer vision researchers with insights into the “feedback pathways and connection strengths” in studying brain feedback mechanisms.

3

Multi-View Geometry and Hierarchical 3D Reconstruction

In the early 1990s, computer vision transitioned from a “depression” to further “prosperity” mainly due to two factors: first, the targeted application field shifted from “industrial applications” requiring high precision and robustness to areas with lower requirements, particularly those needing only “visual effects,” such as remote video conferencing, archaeology, virtual reality, and video surveillance. On the other hand, it was discovered that the multi-view geometry theory could effectively improve the robustness and accuracy of three-dimensional reconstruction.

The representative figures in multi-view geometry include O. Faugeras from INRIA, R. Hartley from GE Research (now back at the Australian National University), and A. Zisserman from the University of Oxford.

It can be said that the theory of multi-view geometry was basically perfected by 2000. The book co-authored by Hartley and Zisserman in 2000 (Hartley & Zisserman 2000) provided a systematic summary of this content, while subsequent work mainly focused on how to improve the computational efficiency of robust reconstruction under big data conditions. Big data requires fully automated reconstruction, and full automation necessitates repeated optimization, which consumes substantial computational resources. Therefore, how to achieve rapid three-dimensional reconstruction of large scenes while ensuring robustness has become a focal point of later research.

For example, if one wants to reconstruct the three-dimensional structure of the Zhongguancun area in Beijing, a large number of ground and drone images would need to be obtained to ensure the completeness of the reconstruction.

Suppose 10,000 high-resolution ground images (4000×3000) and 5,000 high-resolution drone images (8000×7000) were obtained (this scale of images is typical today). Three-dimensional reconstruction would require matching these images, selecting suitable image sets, calibrating camera position information, and reconstructing the three-dimensional structure of the scene. Given such a large volume of data, manual intervention is impossible, so the entire three-dimensional reconstruction process must be fully automated.

This necessitates that the reconstruction algorithms and systems possess very high robustness; otherwise, full automation of three-dimensional reconstruction would be impossible. Under the assurance of robustness, the efficiency of three-dimensional reconstruction is also a significant challenge. Therefore, current research in this area focuses on how to quickly and robustly reconstruct large scenes.

>>>>

Multi-View Geometry

Since the imaging process of images is a central projection process, “multi-view geometry” essentially studies the constraint theory and computational methods between corresponding points in images under projective transformations, as well as between spatial points and their projected image points (note: the pinhole camera model is a type of central projection; when the camera has distortion, the distorted image points must be corrected to undistorted points before using the multi-view geometry theory).

In the field of computer vision, multi-view geometry primarily studies the epipolar geometry constraints between two images, the trifocal tensor constraints between three images, and the homography constraints between spatial plane points and image points or between spatial points and their projections on multiple images.

In multi-view geometry, invariants under projective transformations, such as the image of the absolute conic, the image of the absolute quadric, and the homography of the infinite plane, are crucial concepts that serve as “referential objects” for self-calibrating cameras.

Since these quantities are projections of “referential objects” at infinity onto images, they are independent of the camera’s position and motion (in principle, any finite motion does not affect the properties of objects at infinity), so these “projective invariants” can be used for camera self-calibration. For detailed content regarding multi-view geometry and camera self-calibration, please refer to the book co-authored by Hartley and Zisserman (Hartley & Zisserman, 2000).

Overall, in terms of theory, multi-view geometry cannot be considered new content within projective geometry. Hartley, Faugeras, Zisserman, and others introduced multi-view geometry theory into computer vision, proposing hierarchical three-dimensional reconstruction theory and camera self-calibration theory, enriching Marr’s three-dimensional reconstruction theory, enhancing the robustness of three-dimensional reconstruction, and adapting it to big data, thus significantly promoting the application range of three-dimensional reconstruction. Therefore, the study of multi-view geometry in computer vision is an important stage and event in the development history of computer vision.

Multi-view geometry requires a mathematical foundation in projective geometry. Projective geometry is non-Euclidean and involves abstract concepts such as parallel lines intersecting and parallel planes intersecting, and expression and computation must be conducted in homogeneous coordinates, which poses considerable difficulties for engineering students. Therefore, anyone intending to engage in this research must first establish a solid foundation, at least possessing the necessary knowledge of projective geometry. Otherwise, working in this area would be akin to wasting time.

>>>>

Hierarchical 3D Reconstruction

Hierarchical 3D reconstruction, as illustrated in the following diagram, refers to the process of recovering the three-dimensional structure of Euclidean space from multiple two-dimensional images, not through a single step from images to the three-dimensional structure in Euclidean space, but rather through a stepwise hierarchical approach.

Specifically, it involves first reconstructing corresponding spatial points under projective space from corresponding points in multiple images (i.e., projective reconstruction), then elevating the points reconstructed in projective space to affine space (i.e., affine reconstruction), and finally elevating the points reconstructed in affine space to Euclidean space (or metric space: metric reconstruction) (note: metric space differs from Euclidean space by a constant factor).

Since hierarchical 3D reconstruction relies solely on images for spatial point reconstruction and lacks known “absolute scales” (such as “the length of the window is 1 meter”), it can only recover spatial points to metric space from images.

Figure 2: Hierarchical 3D reconstruction diagram

Several concepts require clarification. Taking the three-dimensional reconstruction of spatial points as an example, “projective reconstruction” refers to the reconstruction of points whose coordinates differ from their coordinates in Euclidean space by a “projective transformation.” “Affine reconstruction” refers to the reconstruction of points whose coordinates differ from their coordinates in Euclidean space by an “affine transformation.” “Metric reconstruction” refers to the reconstruction of points whose coordinates differ from their coordinates in Euclidean space by a “similar transformation.”

Since any visual problem can ultimately be transformed into a nonlinear optimization problem with multiple parameters, the difficulty in nonlinear optimization lies in finding a reasonable initial value. Generally, the more parameters to be optimized, the more complex the solution space, making it increasingly challenging to find suitable initial values. Thus, if an optimization problem can group and optimize parameters step by step, it can generally simplify the difficulty of the optimization problem significantly.

The computational reasonableness of hierarchical 3D reconstruction leverages this “grouped stepwise” optimization strategy. For example, direct reconstruction of three-dimensional points in metric space from corresponding points in images requires nonlinear optimization of 16 parameters (assuming the camera’s internal parameters remain unchanged: 5 camera internal parameters, the rotation and translation parameters of the second and third images relative to the first image, excluding one constant factor, so 5+2×(3+3)-1=16), which is a very challenging optimization problem. However, moving from image corresponding points to projective reconstruction requires a linear estimation of 22 parameters, and since it is a linear optimization, the optimization problem is not difficult.

Moving from projective reconstruction to affine reconstruction requires nonlinear optimization of three parameters (the three planar parameters of the infinite plane), while moving from affine reconstruction to metric reconstruction requires nonlinear optimization of five parameters (the five internal parameters of the camera). Consequently, hierarchical 3D reconstruction only needs to solve the nonlinear optimization problems of 3 and 5 parameters step by step, significantly reducing the computational complexity of three-dimensional reconstruction.

Another characteristic of hierarchical 3D reconstruction is its theoretical elegance. Under projective reconstruction, the projections of spatial lines remain straight, and two intersecting lines will still project to intersecting lines; however, the parallelism and orthogonality of spatial lines are no longer preserved. In affine reconstruction, parallelism can be maintained, but orthogonality cannot. In metric reconstruction, both parallelism and orthogonality can be preserved. In practical applications, these properties can be utilized to progressively enhance reconstruction results.

The theory of hierarchical 3D reconstruction can be considered one of the most important and influential theories in the field of computer vision since Marr’s computational vision theory was proposed. Many large companies’ 3D vision applications, such as Apple’s 3D maps, Baidu’s 3D maps, Nokia’s Streetview, and Microsoft’s virtual earth, rely on hierarchical 3D reconstruction technology as a crucial core supporting technology.

>>>>

Camera Self-Calibration

Camera calibration, in a narrow sense, is the process of determining the internal mechanical and optical parameters of a camera, such as focal length, the intersection of the optical axis with the image plane, etc. Although cameras are marked with some standard parameters when they leave the factory, these parameters are generally not precise enough for direct application in three-dimensional reconstruction and visual measurements. Therefore, to improve the accuracy of three-dimensional reconstruction, these internal parameters of the camera need to be estimated. The process of estimating the camera’s internal parameters is called camera calibration. In the literature, sometimes the estimation of the camera’s coordinates in a given object coordinate system or the mutual positioning relationship between cameras is referred to as extrinsic parameter calibration. However, unless explicitly specified, camera calibration typically refers to the calibration of internal parameters of the camera.

Camera calibration includes two aspects: “selecting an imaging model” and “estimating model parameters.” During camera calibration, it is first necessary to determine a “reasonable camera imaging model,” such as whether it is a pinhole model and whether there is distortion. Currently, there is no solid guiding theory regarding camera model selection; it can only be determined based on specific cameras and applications.

With advancements in camera manufacturing processes, ordinary cameras (excluding special cameras such as fisheye or ultra-wide-angle lenses) can generally use the pinhole imaging model (with first or second-order radial distortion) adequately. Other distortions are minimal and can often be disregarded. Once the camera imaging model is determined, the corresponding model parameters must be estimated. In the literature, people often simplify the estimation of imaging model parameters as camera calibration, which is not comprehensive. In fact, selecting the camera model is the most critical step in camera calibration.

If a camera is undistorted but distortion is considered during calibration, or if distortion is present but not considered, significant errors will arise. Visual application personnel should pay particular attention to the issue of “camera model selection.”

Camera parameter estimation typically requires a “calibration reference object” with a known three-dimensional structure, such as a planar checkerboard or a stereo block. Camera calibration is the process of establishing constraint equations for model parameters using the known calibration reference object and its projected image under a known imaging model, thus estimating the model parameters.

“Self-calibration” refers to the process of estimating model parameters using only the correspondence relationships between image feature points, without the need for specific physical calibration reference objects.

“Traditional calibration” requires the use of calibration reference objects with known dimensions, while self-calibration does not require such physical calibration objects, as previously mentioned in the multi-view geometry section, using abstract “absolute conics” and “absolute quadrics” at infinity as references. From this perspective, self-calibration also requires reference objects, albeit “virtual reference objects at infinity.”

Camera self-calibration relies on constraints between two images, such as the fundamental matrix, essential matrix, and trifocal tensor constraints among three images. Additionally, the Kruppa equations are also an important concept. These topics are essential content in multi-view geometry, which will be detailed in subsequent chapters.

4

Learning-Based Vision

Learning-based vision refers to computer vision research primarily utilizing machine learning techniques. The literature generally divides learning-based vision research into two stages: the early 2000s, represented by manifold learning and subspace methods, and the current stage, represented by deep neural networks and deep learning.

>>>>

Manifold Learning

As previously mentioned, object representation is the core issue in object recognition. Given an image object, such as a face image, different representations yield varying classification and recognition rates. Furthermore, directly using image pixels as representations constitutes “over-representation” and is not a good representation. Manifold learning theory posits that an image object exists within its “intrinsic manifold,” which serves as a high-quality representation of that object.

Thus, manifold learning is the process of learning the intrinsic manifold representation from image representations, which generally involves a nonlinear optimization process.

Manifold learning began with two papers published in Science in 2000 (Tenenbaum et al., 2000) (Roweis & Lawrence 2000). A challenging aspect of manifold learning is the lack of a rigorous theory to determine the dimension of the intrinsic manifold.

It has been found that in many cases, the results of manifold learning are not as good as traditional PCA, LDA, and MDS methods. Representative methods of manifold learning include LLE (Locally Linear Embedding) (Roweis & Lawrence 2000), Isomap (Tenenbaum et al., 2000), and Laplacian Eigenmaps (Belkin & Niyogi, 2001).

>>>>

Deep Learning

The success of deep learning (LeCun et al. 2015) is primarily attributed to the accumulation of data and improvements in computational power. The concept of deep networks was proposed in the 1980s, but at that time, it was found that “deep networks” performed worse than “shallow networks,” which hindered their development.

Currently, it seems that computer vision is increasingly perceived as an application of deep learning, as evidenced by the papers published in recent years at the three major international conferences in computer vision: the International Conference on Computer Vision (ICCV), the European Conference on Computer Vision (ECCV), and the Conference on Computer Vision and Pattern Recognition (CVPR). The current basic situation is that people are using deep learning to “replace” traditional methods in computer vision. Researchers have become “machines for tuning programs,” which is an abnormal “mass movement.”

Newton’s law of universal gravitation, Maxwell’s equations of electromagnetism, Einstein’s mass-energy equation, and Schrödinger’s equation in quantum mechanics still seem to be the goals that people should pursue.

Regarding deep networks and deep learning, the following points should be emphasized:

(1) Deep learning has demonstrated significant advantages over traditional methods in object vision, but in spatial vision, such as three-dimensional reconstruction and object localization, it still cannot compete with geometric methods. This is primarily because deep learning struggles to handle mismatches between image features.

In geometric-based three-dimensional reconstruction, robust outlier removal modules such as RANSAC (Random Sample Consensus) can be repeatedly utilized, while in deep learning, integrating outlier removal mechanisms like RANSAC is still challenging. The author believes that if deep networks cannot effectively integrate outlier removal modules, deep learning will struggle to compete with geometric methods in three-dimensional reconstruction and may even face challenges in spatial vision applications.

(2) Deep learning has matured in static image object recognition, which is why the object classification competition on ImageNet is no longer held;

(3) Current deep networks are primarily feedforward networks. Different networks mainly differ in the cost functions used. The next step is expected to explore hierarchical networks with “feedback mechanisms.” Feedback mechanisms need to draw on the mechanisms of brain neural networks, particularly the results of connectomics.

(4) Currently, for video processing, RCNN (Recurrent Neural Networks) has been proposed. Recurrence is an effective mechanism for same-layer interactions, but it cannot replace feedback. The long-distance feedback in the cerebral cortex (which will be introduced in the chapter on the introduction to biological vision) may be the neural basis for the different specific functions of various regions of the cerebral cortex. Therefore, researching feedback mechanisms, especially deep networks with “long-distance feedback” (across multiple layers), will be an important direction for future research in image understanding;

(5) Despite the transformative achievements of deep learning and deep networks in image object recognition, there is still a lack of solid theoretical foundations to explain why “deep learning” has yielded such excellent results. Some research in this area has been conducted, but systematic theory is still lacking.

In fact, “hierarchical structure” is essential; not only deep networks but also other hierarchical models, such as the Hmax model (Riesenhuber & Poggio, 1999) and the HTM (Hierarchical Temporal Memory) model (George & Hawkins, 2009), face the same theoretical confusion. The reason why “hierarchical structures” have advantages remains a significant mystery.

Several Development Trends in Computer Vision

The rapid development of information science makes predicting the development trends over the next decade feel somewhat like “fortune-telling.” For computer vision, the author has the following outlook for future developments:

(1) Learning-based object vision and geometry-based spatial vision will continue to operate “independently.” Deep learning is unlikely to replace geometric vision in the short term. How to integrate a “robust outlier removal module” into deep networks will be an exploratory direction, but substantial progress is unlikely in the near term;

(2) Vision-based localization will increasingly trend toward “applied research,” particularly in multi-sensor fusion localization technologies;

(3) Three-dimensional point cloud reconstruction technology has matured; how to transition from “point clouds” to “semantics” will be a key research focus. “Semantic reconstruction” involves simultaneously conducting point cloud reconstruction, object segmentation, and object recognition, which is a prerequisite for practical three-dimensional reconstruction;

(4) For outdoor scene three-dimensional reconstruction, how to reconstruct models that comply with “urban management standards” is a problem to be solved. For indoor scene reconstruction, the most significant potential application is “home service robots.” Given the lack of specific application needs and drivers for indoor reconstruction, coupled with the complexity of indoor environments, significant breakthroughs in the next 3-5 years are unlikely;

(5) For object recognition, learning-based object recognition is likely to evolve from “general recognition” to “specific domain object recognition.” “Specific domains” can provide clearer and more specific prior information, which can effectively improve recognition accuracy and efficiency, making it more practical;

(6) The current trend based on RCNN for video understanding will continue;

(7) Analyzing the mechanisms of deep networks holds significant theoretical significance and challenges; given the complexity of deep networks, substantial breakthroughs are unlikely in the near term;

(8) Research into deep network structures (architecture) with “feedback mechanisms” will undoubtedly be the next research hotspot.

Several Typical Theories of Object Representation

As previously mentioned, object representation is a core scientific issue in computer vision. Here, “object representation theory” should be distinguished from “object representation models.” “Representation theory” refers to methods that are widely recognized in the literature. “Representation models” can easily be misunderstood as “a mathematical description of an object.” In the field of computer vision, several well-known object representation theories include the following three:

1

Marr’s Three-Dimensional Object Representation

As introduced earlier, Marr’s visual computational theory posits that the representation of an object is its three-dimensional representation in the object coordinate system.

2

Two-Dimensional Image-Based Object Representation

Although theoretically, a three-dimensional object can be imaged as infinitely many different two-dimensional images, the human visual system can only recognize a “limited number of images.” Given the advancements in neuroscience regarding the ventral pathway in monkeys (which is thought to be the object recognition pathway), T. Poggio and others proposed the concept of image-based object representation (Poggio & Bizzi, 2004), which posits that the expression of a three-dimensional object is a set of typical two-dimensional images of that object.

Currently, some believe that Poggio et al.’s “views” should not be narrowly understood as two-dimensional images but also include three-dimensional representations under the observer’s coordinate system, i.e., Marr’s 2.5-dimensional representation (Anzai & DeAngelis, 2010).

3

Inverse Generative Model Representation

For a long time, it has been believed that object recognition models are “discriminative models” rather than “generative models.” Recent research on object recognition in the ventral pathway of monkeys indicates that the IT region of the monkey’s cerebral cortex (Inferior Temporal: object expression area) may encode the object and its imaging parameters (such as lighting, posture, geometry, texture, etc.) (Yildirim et al. 2015) (Yamins & DiCarlo, 2016b).

Since knowing these parameters allows for the generation of corresponding images, encoding these parameters can be considered inverse generative model representation. Inverse generative model representation can explain why the encoder-decoder networks in deep learning (Badrinarayanan et al. 2015) can achieve relatively good results, as the encoder is essentially the inverse generative model of the image.

Additionally, the concept of “inverse graphics” proposed in deep learning (Kulkarni et al. 2015) is fundamentally an inverse generative model. Inverse graphics refers to learning the image generation parameters from images and then classifying images of the same object under different parameters as the same object, achieving final “invariant object recognition” through this “equivariant recognition.”

In summary, this article summarizes and forecasts the theories, current status, and future development trends of computer vision, hoping to provide readers with some assistance in understanding this field. It is particularly important to note that much of the content here is merely a summary of the author’s “personal views” and “personal preferences,” aimed at helping readers without causing misguidance.

Furthermore, the author always believes that there are not many core key documents in any discipline, and to facilitate readers’ reading, this article only presents some necessary representative literature.

References

Marr D (1982), Vision: A computational investigation into the human representation and processing of visual information, W.H. Freeman and Company.

Yamins D. L K. & DiCarlo J.J (2016a), Using goal-driven deep learning models to understand sensory cortex, Nature Neuroscience, Perspective, Vol. 19, No.3, pp.356-365.

Yamins D. L. K et al. (2014), Performance-optimized hierarchical models predict neural responses in higher visual cortex, PNAS, Vol.111, No.23, pp.8619-8624.

Yamins D. L. K & DiCarlo J. J. (2016b). Explicit Information for category-orthogonal object properties increases along the ventral stream, Nature Neuroscience, Vol.19, No.4, pp.613-622.

Edelman S & Vaina L. (2015). Marr, David (1945–80), International Encyclopedia of the Social & Behavioral Sciences (Second Edition), pp. 596-598.

Tarr M & Black M. (1994). A computational and Evolutionary Perspective on the Role of Representation in Vision, CVGIP: Image Understanding, Vol.60, No.1, pp.65-73.

Hartley R & Zisserman A. (2000). Multiple View Geometry in Computer Vision, Cambridge University Press.

Faugeras O (1993). Three-Dimensional Computer Vision: A geometric Viewpoint, MIT Press.

Tenenbaum J. B. et al. (2000). A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science, Vol. 290, No.5500, pp.2319–2323.

Roweis S & Saul L (2000). Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science, Vol.290, No.5500, pp.2323—2326.

Belkin M & Niyogi P. (2001). Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, Advances in Neural Information Processing Systems 14, pp.586–691, MIT Press.

LeCun Y et al. (2015). Deep Learning, Nature, Vol.521, pp.436-444.

Riesenhuber M. & Poggio T. (1999). Hierarchical Models of Object Recognition in Cortex, Nature Neuroscience 2:1019-1025.

George D & Hawkins J (2009). Towards a mathematical theory of cortical micro-circuits, PloS Computational Biology, Vol.5, No.10, pp.1-26.

Poggio T & Bizzi E (2004). Generalization in vision and motor control, Nature 431, Vol.14, pp.768-774.

Anzai A & DeAngelis G (2010). Neural computations underlying depth perception, Curr Opin Neurobiol., Vol.20, No.3, pp. 367–375.

Yildirim I et al. (2015). Efficient analysis-by-synthesis in vision: A computational framework, behavioral tests, and comparison with neural representations. In Proceedings of the 37th Annual Cognitive Science Society.

Badrinarayanan V et al. (2015). Segnet: A deep convolutional encoder-decoder architecture for image segmentation, arXiv:1511.00561.

Kulkarni T. D. et al. (2015). Deep Convolutional Inverse Graphics Network, NIPS 2015.

1 Marr’s Computational Vision

2 Ephemeral Active and Purpose-Driven Vision

3 Multi-View Geometry and Hierarchical 3D Reconstruction

4 Learning-Based Vision

1 Marr’s Three-Dimensional Object Representation

2 Two-Dimensional Image-Based Object Representation

3 Inverse Generative Model Representation

Leave a Comment Cancel reply

1

Marr’s Computational Vision

2

Ephemeral Active and Purpose-Driven Vision

3

Multi-View Geometry and Hierarchical 3D Reconstruction

4

Learning-Based Vision

1

Marr’s Three-Dimensional Object Representation

2

Two-Dimensional Image-Based Object Representation

3

Inverse Generative Model Representation