When it comes to face recognition, most people’s first reaction is “facial recognition”. Let’s look at the definition of face recognition: it is a biometric technology for identity verification based on facial feature information. It uses cameras to capture images or video streams containing faces, automatically detecting and tracking faces in the images, and then performing a series of related techniques on the detected faces, which is often referred to as portrait recognition or facial recognition.
From the above statement, we can infer that face recognition requires four steps: face image acquisition and detection, face image feature extraction, and matching and recognition. These are also the three components of a face recognition system.
Face Image Acquisition and Detection
The mainstream methods for face detection and acquisition currently include the Adaboost face detection algorithm, feature-based methods, template-based methods, and so on.
Let’s discuss the Adaboost face detection algorithm. The Adaboost face detection algorithm is based on integral images, cascade detectors, and the Adaboost algorithm. This method can detect frontal faces quickly. Its core idea is to automatically select several classifiers from a pool of weak classifiers to form a strong classifier with high classification capability.
The downside is that in complex backgrounds, the Adaboost face detection algorithm is easily affected by complicated environments, leading to unstable detection results and a high false positive rate, often misidentifying similar areas as faces.
Face Image Feature Extraction
The features that can be used by face recognition systems are generally divided into visual features, pixel statistical features, face image transformation coefficient features, and algebraic features of face images. Face feature extraction targets certain characteristics of the face. It is also known as face representation, which is the process of modeling the features of a face. The methods of face feature extraction can be broadly categorized into two types: one is knowledge-based representation methods; the other is algebraic feature or statistical learning-based representation methods.
Knowledge-based representation methods primarily obtain feature data that assists in face classification based on the shapes of facial organs and the distances between them. Their feature components usually include Euclidean distances, curvature, and angles between feature points. A face is composed of local parts such as the eyes, nose, mouth, and chin. The geometric description of these local parts and their structural relationships can serve as important features for face recognition, known as geometric features.
Statistical theory-based methods refer to using statistical analysis and machine learning to find features of face and non-face samples, constructing classifications based on these features, and using classifications for face detection. This mainly includes neural network methods, support vector machine methods, and hidden Markov model methods. Statistical theory-based methods derive representational rules through sample learning rather than from human intuition, thus reducing errors caused by incomplete and inaccurate human observation. However, these methods require a large amount of statistical properties and sample training, which can be time-consuming and labor-intensive.
Face Image Matching and Recognition
The extracted feature data of face images is searched and matched against the feature templates stored in a database. By setting a threshold, if the similarity exceeds this threshold, the matching result is output. Face recognition compares the features of the face to be recognized with the already obtained face feature templates and judges the identity information based on the degree of similarity. This process is divided into two types: one type is verification, which is a one-to-one image comparison process, referred to as 1:1, and the other type is identification, which is a one-to-many image matching comparison process, referred to as 1:N.
1:1 means “Is this person someone?”
For example, when we take a train at a station and go through security checks, the ticket inspector always compares your ID card with your actual appearance to verify whether the ID card belongs to you. This scenario is a 1:1 situation.
According to relevant statistics, the accuracy of human eye recognition reaches about 95%. However, human eyes can become fatigued, so ticket inspectors need to switch shifts regularly to maintain a relatively average recognition accuracy. In such scenarios, if face recognition technology is used, the recognition rate can reach 97% or even higher accuracy, and the system does not suffer from fatigue.
1:N means “Who is this person?”
For instance, in crowded places like train stations or important locations such as pedestrian streets and urban villages, we apply face recognition systems. The characteristic of such systems is dynamic and non-cooperative. The term dynamic means that the recognition is not based on photos or images but on dynamic video streams captured by front-end cameras; non-cooperative means that the recognized subjects do not need to be aware of the camera’s position and cooperate to complete the recognition task. The entire recognition process is very convenient and does not cause rejection. However, 1:N recognition can be affected by the location, environment, lighting, and even glass reflections, which can impact the accuracy of recognition, making 1:N relatively more challenging.
Face recognition technology is now widely used in fields such as finance, justice, military, public security, border control, and security. With the construction and development of safe cities, smart communities, intelligent buildings, and intelligent transportation, face recognition technology will increasingly penetrate our lives.