While sharing my learning experiences, please allow me to introduce myself first. I have been studying computer vision for nearly two years and am currently a first-year master’s student with a research focus on visual SLAM.
In the development of the artificial intelligence industry, the important position of computer vision has long been established. Like other disciplines, this field has undergone continuous technical innovation and iteration by numerous researchers, leading to the thriving development of the current AI industry. Broadly speaking, computer vision is a discipline that ‘endows machines with natural visual capabilities.’ Natural visual capability refers to the abilities exhibited by biological visual systems. Simply put, it is a technology that enables computers to understand images and videos. Therefore, some aptly refer to computer vision as the eyes of robots in the era of artificial intelligence.
Computer Vision is a vast concept, and based on my personal learning experience, I have summarized the following insights for newcomers entering the CV field:
1. Required Skills
1. Basic Skills
Mainly some mathematical foundations, such as linear algebra, probability theory and mathematical statistics, mathematical analysis, deep learning, rigid body motion in 3D, camera imaging principles, statistical machine learning, visual neuroscience, convex optimization, etc.
2. Intermediate Skills
Information acquisition from images and video streams through mathematical principles and physical models, such as image segmentation, image classification, object detection, object tracking, and video (sequential image) analysis.
3. Application Layer Skills
Computer vision has a wide range of applications, including face recognition, gesture recognition, image recognition, image retrieval, OCR, neural network chips, medical imaging diagnosis, autonomous driving, industrial vision, 3D reconstruction, integration of image and NLP, intelligent video analysis, etc.
The difference from image processing is that image processing generally refers to analyzing digital images and running algorithms, including classification, extraction, editing, and filtering, primarily through techniques and methods to enhance image information; while computer vision aims to realize the practical uses of images.
2. Current Limitations in the Development of Computer Vision
1. The working principles of human vision and machine vision are not consistent. Human vision can have illusions and has unique mechanisms that cannot yet be simulated by computers.
2. There are misperceptions regarding some things, such as seeing expressions as human faces.
3. The human eye usually understands an image or the existence of the real world in a 3D form, unconsciously using principles like perspective.
3. Required Programming Languages
In learning computer vision, both theory and practice are equally important. From my personal perspective, theoretical knowledge is never-ending, and only through project needs can one truly learn relevant knowledge.
Take the simplest example of changing libraries and parameters; bugs are not written in books, and methods for handling bugs are not presented in book form either. Whether using Python, C++, or other languages, there will be unforeseen issues that need to be explored in practice to advance further in the field of CV.
Writing code in Python is concise, understandable, and versatile, making it a must-choose for many CV enthusiasts, while C++ is often daunting due to its lower-level, complex nature and higher learning curve.
However, in large algorithm systems such as SLAM, C++ meets the requirements for high real-time performance and execution efficiency, and most people starting in CV will choose to use C++, set up VS Code, and configure the OpenCV library, which all seems very reasonable, and indeed it is.
Gradually, as your knowledge expands, C++ may leave you feeling overwhelmed. For instance, when you need to create charts for analysis, you will have to further process them in Matlab. When you need to conveniently and quickly learn or test a certain algorithm, C++ can be quite frustrating; when you want to study deep learning, you will not hesitate to abandon C++… At that point, there are many reasons that will prompt you to learn Python.
Therefore, what I want to express is that both C++ and Python are extremely important programming languages in computer vision.
4. Popular Directions in Computer Vision
Based on current popular directions, computer vision can be divided into two areas:
One is classification processing through machine learning and deep learning. Traditional machine learning includes decision trees, neural networks, support vector machines, and Bayesian methods, with Li Hang’s “Statistical Learning Methods” and Zhou Zhi Hua’s “Machine Learning” being classic essential introductions.
The other is geometry-based SLAM, which is currently a highly valuable research direction. Due to its high threshold and broad knowledge base, it can easily deter newcomers. Additionally, there are still many issues that need to be resolved in visual SLAM due to hardware environment constraints and other factors.
Using SLAM as a background, discussing the development of computer vision based on personal learning experiences may be biased, but it is not without merit.
5. Final Learning Experiences and Insights
I am fortunate to have been able to use the binocular camera and depth camera purchased by the laboratory to complete projects during my master’s studies, through theoretical learning and practical application.
In visual SLAM, computer vision is used to preprocess images acquired by the camera carried by the robot, extract features, and match features between adjacent frames, thereby solving the camera’s motion through physical models and mathematical expressions to estimate the robot’s current pose.
In Gao Bo’s “Fourteen Lectures on Visual SLAM,” he first introduces 3D rigid body motion, followed by the necessary knowledge of linear algebra, which includes Lie groups and Lie algebras, quaternions, and Euler angles. The later content introduces camera imaging principles and feature extraction and matching through images, and finally discusses how to use SLAM to build maps.
I personally believe that there is still great potential for development in the application of computer vision in SLAM. Of course, this is also a point that SLAM needs to break through, such as feature extraction and matching in weak texture scenes and high dynamic environments. Research is still needed on how to eliminate redundant features and handle mismatches. The laboratory is researching how to handle redundant feature points in environments where pedestrians have passed to improve matching efficiency.
Additionally, visual SLAM utilizes loop closure detection to address cumulative errors generated during long-term operation, using a pre-trained bag-of-words model to give the system memory capabilities. However, with the ever-changing world, how to train all existing objects through computers for memory is also an urgent problem to solve, just like our Xinhua Dictionary, which is constantly being revised and improved.
Currently, this aspect mainly aims to achieve its goals through learning, thus machine learning and deep learning have become closely related to SLAM. SLAM (Simultaneous Localization and Mapping) refers to simultaneously determining the location and constructing a map, establishing a local map while positioning, with both processes being mutually dependent and complementary.
In the process of map construction, point cloud stitching is used to connect scattered point clouds within the domain, forming a connected 3D map that depicts the real environment. This has great practical significance for operations in areas that are inaccessible to humans or in harsh environments, and is a hot research topic in the industry.
Moreover, there are many applications of SLAM in drones, AR, and other areas. SLAM requires robustness, accuracy, and real-time performance, all of which need to be promoted by computer vision.
In my journey of learning computer vision, the courses at Deep Blue Academy have provided me with a clearer learning direction. Through studying “The Theory and Practice of Visual SLAM” and “C++ Basics and Deep Analysis”, I now consider myself a beginner-level SLAM learner.
The biggest takeaway from studying at Deep Blue Academy is that I not only gained knowledge but also encountered many fellow learners like myself who helped me avoid many detours in my learning journey.
I hope that every student entering the field of computer vision does so out of passion, not trend, as many difficulties encountered during the learning process are unknown and require daily accumulation and persistence. Only love can withstand the long passage of time.
As I conclude this writing, I am deeply moved. My knowledge is limited, and my words may not fully express my thoughts. I sincerely wish everyone progress in their studies!