Comprehensive Summary! Detailed Explanation of Feature Point Detection in Computer Vision: Harris / SIFT / SURF / ORB

Click the above“Beginner Learning Vision” and select to add Star or Pin.

Important content delivered promptly

This article is reprinted from | Computer Vision Alliance

Harris Corner Detection

Def. [Corner point]

A point with a sufficiently high grayscale change value in all directions within a neighborhood, which is a point of maximum curvature on the image edge curve.

[Corner detection based on grayscale images] includes gradient-based methods (determining corners by calculating the curvature of edges), template-based methods (considering the grayscale changes of neighboring pixels, defining points with sufficiently large brightness contrast with neighboring points as corners), and methods based on combined template gradients.
[Corner detection based on binary images] treats binary images as independent detection targets, using various corner detection methods based on grayscale images.
[Corner detection based on contour curves] extracts corners through corner intensity or curve curvature.

The basic idea of corner detection is: using corner detection operators, calculate the corner response function for each pixel in the image, threshold the corner response function, select a threshold based on actual conditions, perform non-maximum suppression on the thresholded corner response function, and obtain non-zero points as corners. By using a small sliding window to detect corners in any direction, if there is a significant change in grayscale values within the window, the center of the window is the corner. Define the corner response function.

Where w is the window function, I is the image gradient, and E represents the intensity of grayscale change. In 1977, Moravec first proposed the following corner detection method:

For the original image, take offsets (Δx,Δy) as (1,0),(1,1),(0,1),(-1,1), and calculate the grayscale change for each pixel point (xi,yi).
For each pixel point (xi,yi), calculate the corner response function R(xi,yi)=min E.
Set a threshold T, set the values of the corner response function R(xi,yi) that are below T to 0.
Perform non-maximum suppression within the window range: traverse the corner response function, if the corner response function of a pixel is not the maximum within the window, set that pixel to 0.
Select non-zero points as the result of corner detection.

Disadvantages of Moravec corner detection:

The binary window function leads to a non-smooth corner response function.
Only calculates grayscale value changes in four directions, causing the corner response function to have large responses in multiple places.
Considers only the minimum value of E for each point, leading to a strong response of the algorithm to edges.

In 1988, Harris and Plessey improved Moravec’s method and proposed the classic Harris corner detection algorithm. Harris first changed the window function in the Moravec algorithm from a step function to a two-dimensional Gaussian function, and examined small movements through Taylor expansion, meaning that if the maximum value of E is required to clearly define corners, one can perform a Taylor expansion on E, yielding.

Let it be noted that the above expression can be written in the form of an ellipse, where the autocorrelation matrix M describes the trend of grayscale changes in the local area of the image, and the shape of the ellipse can be used to determine corners.

Proposition. For the ellipse, let its semi-major and semi-minor axes be a and b, then the eigenvalues of matrix M are.

Proof. The squared distance from any point on the ellipse to the origin (the center of the ellipse) is given by the condition, and the Lagrange function.

Thus

can also be rewritten as.

Since (x,y) is not a zero vector, it follows that, i.e., 1/λ is an eigenvalue of matrix, so.

Thus λ1 and λ2 correspond to the two extrema of d^2, which are a^2 and b^2, hence, are eigenvalues of matrix M. This also allows us to obtain the area of the ellipse q.e.d.

What is the use of obtaining the eigenvalues of M? If viewed from the perspective of singular value decomposition. Since the Jacobian matrix, we have M=JJ^T, thus the square root of the eigenvalues of M corresponds to the singular values of J, so the eigenvalues of M can reflect the relative sizes of I_X and I_Y.

Now we need to define the corner response function to further differentiate, let (generally k=0.04 ~ 0.06) where.

The qualitative judgment method is:

[Edge] or
[Corner] both are large, E varies significantly in all directions.
[Smooth area] both are small, remaining basically unchanged in all directions.

The quantitative judgment method is:

[Edge] R<<0
[Corner] R>>0
[Smooth area] |R| very small.

Generally, increasing the value of k will reduce the corner response value R, decrease the sensitivity of corner detection, and reduce the number of corners detected; decreasing k will increase the corner response value R, increase the sensitivity of corner detection, and increase the number of corners detected.

Next, we introduce how to use Harris corner features for feature point matching.

The process of feature matching between two images is:

Establish a database of feature points for the images, including: position coordinates, scale, direction, feature vector.
For each feature point in the new image, match it one by one in the database, searching for its nearest neighbor and second nearest neighbor feature points based on the Euclidean distance of the feature vectors; if (nearest neighbor distance / second nearest neighbor distance) exceeds a certain threshold, the feature matching is successful.

The Harris corner detector can only detect interest points in the image, but does not provide a method for finding matching corners by comparing interest points between images. We need to add descriptor information at each point and provide a method for comparing these descriptors.

The interest point descriptor is a vector assigned to the interest point, describing the appearance information of the image around that point. The better the descriptor, the better the corresponding points found. We use corresponding points or point correspondences to describe the pixel points formed by the same object and scene points in different images. The Harris corner descriptor is usually composed of the grayscale values of surrounding image pixel blocks, and a normalized cross-correlation matrix used for comparison. The image pixel block consists of the surrounding rectangular part of the image centered on that pixel point. Typically, the normalized cross-correlation matrix of two (same size) pixel blocks I1(x) and I2(x) is defined as:

Advantages of Harris corners:

Simple computation.
Extracted point features are uniform and reasonable.
Stable: The Harris operator is insensitive to image rotation, brightness changes, noise, and viewpoint transformations.

Limitations of the Harris operator:

Very sensitive to scale, lacks scale invariance.
The accuracy of extracted corners is at the pixel level.
Requires the design of corner matching algorithms.

In 1994, Shi and Tomasi improved the Harris corner detection by defining the corner response function R=min⁡ (λ1, λ2), this improvement is to compute high-quality features suitable for tracking (good features to track), making the feature distribution more uniform, and its detection accuracy is at the sub-pixel level.

SIFT Algorithm Prerequisites: Scale Space and Image Pyramid

In 1999, Lowe proposed the SIFT (Scale-invariant feature transform) feature detection algorithm, and in 2003, he improved and summarized it; in 2006, Bay and others improved the SIFT algorithm, enhancing its efficiency and proposing the SURF (Speeded Up Robust Features) algorithm.

Characteristics of the SIFT feature detection algorithm:

SIFT features are local features of the image, invariant to rotation, scale changes, and brightness changes, and maintain a certain degree of stability against viewpoint changes, affine transformations, and noise.
Rich in information, suitable for matching in massive feature databases.
Multi-modal, even a few objects can generate a large number of SIFT features.
High-speed, optimized SIFT matching algorithms can even achieve real-time performance.

Steps for SIFT feature detection:

Detect extreme points in the scale space.
Precisely locate feature points (Keypoint).
Set the directional parameters of feature points.
Generate descriptors for feature points (128-dimensional vector).

If the pixel count of a photo is continuously compressed, or the observer moves further away from the photo, it will gradually become blurry, and the continuous variable that causes the change in the presentation content of the photo can be referred to as scale. Different scales of observation present the object differently. In computer vision, it is necessary to represent the structures of all scales because it is impossible to predict whether a certain scale of the object structure is meaningful. The following part about scale space theory can also be referenced:

http://www.cnblogs.com/ronny/p/3886013.html

By using the image pyramid method, we can obtain a series of images of different sizes, forming a pyramid model from large to small, from bottom to top. Assuming the image at the l-th layer of the Gaussian pyramid is G_l, then:

In this expression, N is the number of layers of the Gaussian pyramid; R_l and G_l are the number of rows and columns of the l-th layer of the Gaussian pyramid, respectively; ω(m,n) is a two-dimensional separable 5×5 window function, expressed as:

Writing it in the above form is to illustrate that the convolution operator of the two-dimensional window can be expressed as the product of one-dimensional convolution kernels (binomial kernels) in two directions. The formula in the above convolution form actually completes two steps: 1) Gaussian smoothing; 2) dimensionality reduction. The G_0, G_1,…, G_N generated according to the above steps constitute the Gaussian pyramid of the image, where the bottom layer is the same as the original image, and the top layer is the pyramid’s peak. The current layer image of the Gaussian pyramid is generated by first applying Gaussian low-pass filtering to the previous layer image and then performing downsampling (removing even rows and even columns).

Interpolating G_l (here the interpolation used is not bilinear but the same filter used during downsampling) to obtain an enlarged image, making the size of the enlarged image match that of the original image, expressed as:

Wherein

the coefficient 4 above is because the items that can participate in weighting each time have a weight sum of 4/256, which is related to the value of ω we use. In the expression,

Let

Where N is the top layer number of the Laplacian pyramid, and LP_l is the image of the l-th layer of Laplacian pyramid decomposition. The pyramid formed by LP_0, LP_1,…, LP_l,…, LP_N is called the Laplacian pyramid. Each layer of the image is the difference between the current layer image of the Gaussian pyramid and the interpolated enlarged image of the image from the previous higher layer; this process is equivalent to band-pass filtering, thus the Laplacian pyramid is also called band-pass pyramid decomposition.

Pyramiding the image allows efficient (also computationally efficient) multi-scale representation of the image, but it lacks a solid theoretical foundation and cannot analyze the various scales of objects in the image.

We know that the scale space of signals was initially proposed to filter the original signal into a group of low-frequency signals using a series of single-parameter, increasingly wide Gaussian filters. An obvious question is: can other low-pass filters with parameter t also be used to generate a scale space besides Gaussian filtering? Later, Koenderink, Lindeberg, Florack, and others proved through precise mathematical forms that the Gaussian kernel is the only transformation kernel for achieving scale transformation.

Although many researchers have derived that the Gaussian filter is the optimal filter for establishing linear scale space based on its separability, rotation invariance, causality, and other properties, in digital image processing, it is necessary to sample the kernel function, and the discrete Gaussian function does not satisfy some of the excellent properties of the continuous Gaussian function. Therefore, some nonlinear filter groups have emerged to establish scale space, such as B-spline kernel functions.

Using Gaussian filters to construct the scale space pyramid for images gives the scale space the following properties:

Weighted average and finite aperture effect

The expression of the signal at scale t can be seen as a series of weighted averages of the original signal in space, where the weights are Gaussian kernels with different scale parameters.
The expression of the signal at scale t also corresponds to the result of observing the signal with a non-directional aperture function (characteristic length ). At this time, the fine structures in the signal with characteristic lengths smaller than σ will be suppressed, understood as the fluctuations smaller than σ on a one-dimensional signal will be smoothed out.

Layered smoothing, also known as the semi-group property of the Gaussian kernel family: the convolution of two Gaussian kernels is equivalent to the convolution of another Gaussian kernel with different kernel parameters. This property means that the smoothing of images by different Gaussian kernels is continuous.
Local extreme value recursion

This feature can be understood from the visual principles of the human eye. When a person looks at an object, the further away it is, the less detail is seen, and the detail features are decreasing. Physiological studies have shown that the receptive fields of mammalian retinas and visual cortex can be well modeled using Gaussian derivatives of up to the fourth order.
The Gaussian kernel has the property of suppressing local details when filtering the image.

Scale invariance; when a transformation function is added to the original signal, and the transformed signal undergoes Gaussian kernel scale space generation, the extreme points and other features of the new signal remain invariant.

SIFT Algorithm

Let the original image be the first layer of the pyramid, and each new image obtained through downsampling constitutes a layer of the pyramid (one image per layer), with a total of n layers in the pyramid. The number of layers in the pyramid is determined by the original size of the image and the size of the top image of the pyramid, calculated as follows:

Where M and N are the length and width of the original image, and t is the logarithmic value of the minimum dimension of the top image. Next, we use scale space theory to simulate the multi-scale characteristics of image data. Using the Gaussian function as the only linear kernel for achieving scale transformation, the scale space of the two-dimensional image I(x, y) is defined as:

According to the properties of the Gaussian function, we know that the width of the Gaussian window is about 6σ, meaning that the response of the signal through the Gaussian filter outside the interval [-3σ, 3σ] is 0. Therefore, different standard deviations of the Gaussian function can serve as different scale measures. In fact, after performing a Fourier transform on the Gaussian function, we have:

That is, the one-dimensional Gaussian kernel is a linear kernel, and the two-dimensional Gaussian kernel remains linear. Thus, we can define the Difference of Gaussian (DoG) operator:

Where represents the Gaussian Laplacian operator (LoG), and the regularized Gaussian-Laplacian transform.

To find the similarity between the image and a certain two-dimensional function through convolution, similarly, the convolution of the image with the Gaussian Laplacian function is to find the similarity between the image and the Gaussian Laplacian function. When the size of spots in the image approaches the shape of the Gaussian Laplacian function, the Laplacian response of the image reaches its maximum.

From a probabilistic perspective: suppose the original image is the density function of a random variable X related to position, and LOG is the density function of random variable Y, then the density distribution function of random variables X+Y is the convolution form of the two functions. To maximize X+Y, it is best for X and Y to keep pace, meaning that when X rises, Y also rises, and when X is at its maximum, Y is also at its maximum. In practice:

Each sampling return result is akin to an octave in music, and to allow the scale to reflect continuity, the Gaussian pyramid is built upon simple downsampling combined with Gaussian filtering. Each image in the pyramid’s layer is subjected to Gaussian blurring with different parameters, resulting in multiple Gaussian blurred images in each layer of the pyramid, collectively referred to as octaves (Octave). Each layer of the pyramid contains only one set of images, and the number of sets equals the number of layers in the pyramid, with each set containing multiple images (also called layers, Interval). Furthermore, during downsampling, the initial image of a group in the Gaussian pyramid is obtained by downsampling the third-to-last image of the previous group. In the next octave, the standard deviation of the Gaussian operator doubles.

In the Gaussian difference scale space, local maxima or minima are detected by comparing the detected point with its eight neighbors at the same scale and the corresponding nine × two points at adjacent scales to ensure that extreme points are detected in both scale space and two-dimensional image space. The position of the extreme points can be obtained by taking the first derivative of the Gaussian difference operator. The extreme points detected in discrete space are not true extreme points; the method of interpolating known discrete space points to obtain continuous space extreme points is called sub-pixel interpolation. To improve the stability of key points, curve fitting needs to be performed on the scale space DoG function. The Taylor expansion of the DoG function in scale space (fitting function) is:

Taking the derivative and setting the equation to zero can yield the offset of the extreme points:

For the extreme points obtained in this way, points with a very small absolute value of the Gaussian difference operator (generally referring to |D(\hat X)|<0.03) are first discarded, and extreme points with excessive edge responses are eliminated through principal curvature analysis, i.e., calculating the Hessian matrix of the difference image D:

Retaining extreme points that satisfy the following conditions (generally r=10).

Now performing the difference on L(x,y,σ) yields the gradient direction histogram of the neighborhood (commonly using the Roberts operator for calculation), i.e.,

Subsequently, the directional parameters of the features are determined:

Gradient direction histogram range: [0, 360];
Main direction of the feature point: peak of the histogram;
Secondary direction of the feature point: second peak of the histogram, energy exceeding 80% of the main peak energy.

When the image rotates, the main direction of the feature point remains unchanged relative to the pixel’s gradient direction; rotating all images to align the feature point direction to zero before matching ensures that the features are rotation invariant.

Now we will rotate the coordinate axis direction to the direction of the feature point, taking a window centered on the feature point, enhancing the contribution of pixel gradient direction information near the feature point through Gaussian weighting, i.e., calculating the gradient direction histogram over a 4 × 4 block (taking 8 directions), calculating the cumulative value of the gradient direction to form seed points, resulting in a 4× 4 × 8= 128-dimensional feature vector. SIFT features remain invariant to rotation, scale changes, and brightness variations, maintaining a certain degree of stability against viewpoint changes, affine transformations, and noise.

Further processing of the descriptor:

[Remove lighting effects] Normalize the length of the feature vector.
[Reduce distortion effects] Affine transformations.

Several Questions About the SIFT Algorithm

This section mainly references:

http://www.cnblogs.com/ronny/p/4028776.html#3962356

The first question: Generation of the first layer image of the first group.

Here we need to consider two cases: one is to set the index of the first group to 0; the other is to set the index of the first group to -1. We first consider the case where the index of the first group is 0; we know that the first layer image of the first group is generated by convolving the original image with a Gaussian filter (generally set to 1.6). For the sake of anti-aliasing in the image, we usually assume that the input image has been smoothed with a Gaussian filter, with a value of , i.e., half a pixel.

That is, the image we collected I(x,y) has already been smoothed by the Gaussian filter . Therefore, we cannot directly smooth I(x,y) with the Gaussian filter but should smooth I(x,y) with the Gaussian filter, i.e.,

where FirstLayer(x,y) represents the image of the first group and the first layer of the entire scale space.

Now we consider the case where the index of the first group is set to -1. The first question here is why to set the index to -1; if the index is 0, the first layer image of the first group has already been generated from the blurred original image, meaning that the detail information has already been lost, and we have not fully utilized the original image. Based on this consideration, we first enlarge the image by 2 times, so that the details of the original image are hidden within it.

From the above analysis of one situation, we have already established that I(x,y) can be viewed as an image that has been blurred by . Therefore, when we double the size of I(x,y) to obtain Is(x,y), it can be considered an image that has been blurred by the Gaussian kernel . Thus, the Gaussian filter used to generate the first layer image of the first group is .

The second question: How many images does the scale space generate?

We know that S is the Gaussian difference image we ultimately construct to find feature points, and the search for feature points requires looking for local extreme values in the space, which means that when searching for local extreme points in a certain layer, it is necessary to use the Gaussian difference images of the previous layer and the next layer. Therefore, if we need to search for feature points in layer S, we need S+2 layers of Gaussian difference images and then search from the second layer to the S+1 layer.

Each Gaussian difference image G(x,y,σ) requires two images from the scale space L(x,y,kσ) and L(x,y,σ) to generate the difference. Assuming S=3, we then need S+2 = 5 Gaussian difference images, which are:

where

These three images are used to search for local extreme points. Thus, we need S+3 = 6 layers of scale space images to generate the Gaussian difference images mentioned above, which are:

From the above analysis, we know that for the scale space, we need a total of S+3 layers of images to construct S+2 layers of Gaussian difference images. Therefore, if the entire scale space has n groups, with each group having S+3 layers of images, there are a total of n(S+3) scale images.

The third question: Why take the third-to-last image of the previous group for downsampling as the first image of the next group?

For the scale space of the o-th group, the scale of the image at the s-th layer is

Thus, starting from the 0-th group, we can see the scales of each layer.

The 0-th group

The 1-st group

……

It can be seen that the 0-th group’s last image coincides with the first image of the 1-st group, and the scales are both . Therefore, we do not need to recalculate the original image to generate the first image of each group; we only need to use the third-to-last image of the previous layer for downsampling.

We can also continue analyzing the scale of the Gaussian difference images obtained from the 0-th group:

And the scale of the Gaussian difference images obtained from the 1-st group is

If we extract the middle three terms and combine them, the scale will be:

Which is continuous. This effect directly benefits the process of determining extreme points in the scale space, ensuring that we do not miss any extreme points at any scale, but can comprehensively consider the quantified scale factors.

The fourth question: How to generate the i-th layer image from the i-1 layer image?

Each layer image in the scale space (except for the first layer) is generated by convolving the image from the previous layer with a Gaussian filter corresponding to a relative σ, rather than from the original image and the corresponding scale Gaussian filter. This is partly because, as I mentioned earlier, there is no so-called “original image”; our input image I(x,y) is already an image with a scale of σ=0.5. On the other hand, if the original image is used for calculations, the scales between adjacent layers are actually very small, which would lead to most values tending towards 0 when generating the Gaussian difference images, making it difficult to detect feature points later.

Based on these two reasons, for each i+1 layer image in each group, it is generated by convolving the i-th layer image with a Gaussian filter corresponding to a relative scale.

Now consider the 0-th group; the relative scale between the i+1 layer image and the i layer image is

To maintain scale continuity, the subsequent groups will use this same relative scale (this is how it is done in the actual SIFT code). Here’s a guess: for this group with a scale of 2σ0, the relative scale between the i-th layer and the i+1 layer should be

However, it can still be used,

because this layer has been downsampled.

The fifth question: The difference between extreme points obtained in discrete space and those in continuous space.

To find extreme points in the scale space, each sampling point must be compared with all its neighboring points to see if it is larger or smaller than the neighboring points in both the image domain and the scale domain. For any detection point, it must be compared with eight neighboring points at the same scale and nine × two corresponding points at adjacent scales, totaling 26 points to compare, ensuring that extreme points are detected in both scale space and two-dimensional image position space. In other words, the comparison is done within a 3× 3× 3 cube.

The search process starts from the second layer of each group, taking the second layer as the current layer, and for each point in the DoG image of the second layer, a 3× 3× 3 cube is taken, with the upper and lower layers being the first and third layers. Thus, the extreme points found have both position coordinates (DoG image coordinates) and spatial scale coordinates (layer coordinates). After completing the search for the second layer, the third layer is taken as the current layer, and the process is similar to that of the second layer. When S=3, three layers need to be searched within each group.

The search for extreme points is conducted in discrete space, and the detected extreme points are not true extreme points in a meaningful sense. The method of interpolating known discrete space points to obtain continuous space extreme points is called sub-pixel interpolation.

SURF Algorithm

SURF features (Speeded Up Robust Features) are further optimizations of SIFT features, based on constructing the pyramid scale space using the Hessian matrix, simplifying two-dimensional Gaussian filtering using box filters (box filter), and eliminating the need for downsampling; the main direction of the feature points is determined using Haar wavelet features. The feature descriptor constructed in this way is 64-dimensional; the pyramid images constructed by SURF differ significantly from SIFT due to these differences, resulting in increased detection speed. SIFT uses DOG images, while SURF uses approximate values of the Hessian matrix determinant images, also written as the DOH operator.

Since calculating the Hessian requires Gaussian smoothing first and then taking second-order derivatives, these two operations can be combined into one template instead. Compared to the Gaussian pyramid construction process of the SIFT algorithm, the SURF algorithm speeds up the process. In the SIFT algorithm, each group (octave) of images is of different sizes, with the next group being a downsampling (1/4 size) of the previous group; within each group, the several images are of the same size, differing only in the scales used. Moreover, during the blurring process, the size of their Gaussian templates remains constant, with only the scale changing. For the SURF algorithm, the image size remains constant, only the size of the Gaussian blur template changes; of course, the scale is also changing, but there is no need for the downsampling process, saving time.

To ensure rotation invariance, SURF does not statistically analyze its gradient histogram but instead counts the Haar wavelet features in the feature point’s neighborhood. Specifically, centered on the feature point, the total Haar wavelet responses in the x (horizontal) and y (vertical) directions are calculated within a 60-degree sector of radius 6s (S being the scale value of the feature point), and these response values are assigned Gaussian weight coefficients, ensuring that responses close to the feature point contribute more while those further away contribute less. The responses within the 60-degree range are summed to form a new vector. This process is repeated across the entire circular area, with the direction of the longest vector chosen as the main direction of the feature point. A square box is taken around the feature point, oriented in the direction of the main feature point, divided into 16 sub-regions, and the Haar wavelet features in each sub-region are counted. The sum of the horizontal and vertical Haar wavelet feature values in each sub-region, the absolute sum of the horizontal values, and the sum of the vertical values form a 16× 4=64-dimensional feature vector.

After collection, a database of feature points in the image needs to be established, with the data structure of each feature point including: position coordinates, scale, direction, feature vector (128 or 64 dimensions); for each feature point in the new image, match it one by one in the database, i.e., searching for its nearest neighbor and second nearest neighbor feature points based on the Euclidean distance of the feature vectors, if the nearest neighbor distance or second nearest neighbor distance exceeds a certain threshold, the feature matching is successful.

In summary, it is clear that SURF is quite stable in obtaining the local extrema of the image using the Hessian matrix, but it excessively relies on the gradient direction of pixels in the local area during the main direction stage, which may lead to inaccuracies in the identified main direction. The subsequent feature vector extraction and matching heavily depend on the main direction; even slight deviations can cause significant errors in feature matching, leading to unsuccessful matches. Additionally, insufficient tightness in the layers of the image pyramid may also result in scale errors, and the subsequent feature vector extraction also relies on the corresponding scale; in this regard, we can only adopt a compromise solution: take an appropriate number of layers and then interpolate.

Comparison of SIFT features and SURF features:

[Constructing image pyramids] SIFT features utilize images of different sizes convolved with Gaussian difference filters; SURF features use the original image convolved with box filters of different sizes.
[Feature descriptors] SIFT features have a 4×4×8=128-dimensional descriptor, while SURF features have a 4×4×4=64-dimensional descriptor.
[Feature point detection methods] SIFT features first perform non-maximum suppression, then remove low-contrast points, and finally eliminate points with excessive edge responses using the Hessian matrix; SURF features first determine candidate points using the Hessian matrix, then perform non-maximum suppression.
[Feature point main direction] SIFT features calculate the histogram of gradient magnitudes in a square region, with the maximum histogram corresponding to the main direction, allowing for multiple main directions; SURF features calculate Haar wavelet responses in each sector of a circular region, with the direction of the sector with the maximum response as the main direction.

ORB Features

This section mainly references:

https://blog.csdn.net/zouzoupaopao229/article/details/52625678

ORB features combine the FAST feature point detection method with the BRIEF feature descriptor, making improvements and optimizations based on their original foundations. The speed of the ORB algorithm is approximately 100 times that of SIFT and 10 times that of SURF.

ORB (Oriented FAST and Rotated BRIEF) is a fast feature point extraction and description algorithm. This algorithm was proposed by Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R.Bradski in a 2011 paper titled “ORB: An Efficient Alternative to SIFT or SURF.” The ORB algorithm consists of two parts: feature point extraction and feature point description. Feature extraction is derived from the FAST (Features from Accelerated Segment Test) algorithm, and feature point description is based on improvements to the BRIEF (Binary Robust Independent Elementary Features) feature description algorithm.

ORB uses the ID3 algorithm to train a decision tree, inputting the 16 pixels around the feature point into the decision tree to filter out the optimal FAST feature points.

Next, non-maximum suppression is applied to remove locally dense feature points. The non-maximum suppression algorithm eliminates the problem of multiple feature points in close proximity. For each feature point, its response size is calculated. The calculation method is the absolute value of the deviation of feature point P from the surrounding 16 feature points summed up. Among the closely located feature points, the feature point with the larger response value is retained, while the others are deleted.

We can now establish a pyramid to achieve multi-scale invariance of feature points. A scale factor scaleFactor (default is 1.2 in OpenCV) and the number of pyramid levels nlevels (default is 8 in OpenCV) are set. The original image is scaled down to nlevels images. The scaled images are: I’= I/scaleFactork (k=1,2,…, nlevels). The total number of oFAST feature points extracted from the images of different scales constitutes the oFAST feature points of this image.

The ORB algorithm proposes using the moment method to determine the direction of FAST feature points. This means calculating the centroid within a radius r around the feature point, and the vector formed from the feature point coordinates to the centroid serves as the direction of the feature point. The moment is defined as:

Where I(x,y) is the grayscale expression of the image. The centroid of this moment is

Assuming the corner point coordinates are O, the angle of the vector is the direction of the feature point. The calculation formula is

The rBRIEF feature description is an improvement of the BRIEF feature description that incorporates a rotation factor. The BRIEF algorithm computes a binary string as the feature descriptor. It does this by selecting n pairs of pixel points pi, qi (i=1,2,…,n) within a neighborhood of a feature point. It then compares the grayscale values of each pair. If I(pi)> I(qi), a 1 is generated in the binary string; otherwise, it generates a 0. After comparing all point pairs, a binary string of length n is generated. Typically, n is set to 128, 256, or 512, with OpenCV defaulting to 256. Additionally, it is worth noting that to enhance the noise resistance of the feature descriptor, the algorithm first requires Gaussian smoothing of the image.

The matching quality of the BRIEF descriptors is very high for images with minimal rotation; in most cases tested by the authors, it surpasses SURF. However, once the rotation exceeds 30°, the matching rate of BRIEF quickly drops to around 0. The time taken by BRIEF is very short; under the same conditions, calculating descriptors for 512 feature points takes 335ms for SURF and only 8.18ms for BRIEF; matching SURF descriptors takes 28.3ms while BRIEF only takes 2.19ms. In situations where the requirements are not too high, BRIEF descriptors can more easily achieve real-time performance.

Improved BRIEF Algorithm—rBRIEF (Rotation-Aware Brief)

(1) Steered BRIEF (Rotation Invariance Improvement)

In the feature points calculated using the oFast algorithm, the direction angle of the feature points is included. Assuming the original BRIEF algorithm selects n pairs of points in the SxS neighborhood (generally S is set to 31). After rotating by angle θ, new point pairs are obtained, and the sizes of these new point pairs are compared to form a binary string descriptor. It is important to note that the feature points are extracted at different scales using the oFast algorithm. Therefore, when using BRIEF feature descriptors, the image must be converted to the corresponding scale image, and then the SxS neighborhood around the feature points in the scale image is taken, followed by point pair selection and rotation to obtain the binary string descriptor.

(2) rBRIEF—Improved Correlation of Feature Descriptors

The feature descriptors obtained using the steered BRIEF method have rotation invariance, but they do not perform as well as the original BRIEF algorithm in another property. What is this property? It is the distinctiveness of the descriptors, or their correlation. This property significantly impacts the quality of feature matching. The descriptor expresses the characteristics of the feature points. The descriptors we compute should express the uniqueness of the feature points as much as possible. If the distinctiveness of the descriptors from different feature points is poor, it becomes challenging to find corresponding matching points during matching, leading to mismatches.

To address the issues of descriptor distinctiveness and correlation, ORB uses statistical learning methods to reselect the set of point pairs.

First, a test set of 300k feature points is established. For each point in the test set, consider its 31×31 neighborhood. This differs from the original BRIEF algorithm in that after Gaussian smoothing of the image, the grayscale average value of a 5×5 neighborhood around a certain point is used to replace the value of a point pair, which further enhances the noise resistance. Additionally, integral images can be used to speed up the calculation of the 5×5 neighborhood grayscale average value.

From the above, it can be seen that within a 31×31 neighborhood, there are (31-5+1)x(31-5+1)=729 such sub-windows, meaning there are M=265356 methods for selecting point pairs, and we need to select 256 methods from these M methods, with the principle of selection being that the correlation among these 256 methods is minimal. How to select?

For each of the 300k feature points in the 31×31 neighborhood, point pairs are selected using M methods, comparing the sizes of the point pairs to form a binary matrix Q of size 300k×M. Each column of the matrix represents the binary numbers obtained from 300k points according to a certain selection method.
Calculate the average value of each column of the Q matrix, and reorder the columns of Q based on the distance of the average value to 0.5.
Place the first column vector of T into R.
For the next column vector of T, calculate its correlation with all column vectors in R. If the correlation coefficient is below a set threshold, move that column vector from T to R.
Continue this operation until the number of vectors in R equals 256.

This is the rBRIEF algorithm.

end

Download 1: OpenCV-Contrib Extension Module Chinese Tutorial

Reply in the background of the “Beginner Learning Vision” public account:Chinese Tutorial for Extension Module, to download the first Chinese version of the OpenCV extension module tutorial available online, covering installation of the extension module, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, and more than twenty chapters.

Download 2: 52 Lectures on Python Vision Practical Projects

Reply in the background of the “Beginner Learning Vision” public account: Python Vision Practical Projects, to download 31 visual practical projects including image segmentation, mask detection, lane line detection, vehicle counting, eye line addition, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to help quickly learn computer vision.

Download 3: 20 Lectures on OpenCV Practical Projects

Reply in the background of the “Beginner Learning Vision” public account: 20 Lectures on OpenCV Practical Projects, to download 20 practical projects based on OpenCV for advanced learning of OpenCV.

Group Chat

Welcome to join the reader group of the public account to communicate with peers. Currently, there are WeChat groups for SLAM, three-dimensional vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions (which will gradually be subdivided later), please scan the WeChat number below to join the group, and note: “nickname + school/company + research direction”, for example: “Zhang San + Shanghai Jiao Tong University + Vision SLAM”. Please follow the format for notes, otherwise, entry will not be granted. After successfully adding, you will be invited to relevant WeChat groups based on your research direction. Please do not send advertisements in the group, or you will be removed from the group. Thank you for your understanding~