A Detailed Explanation of Object Detection Loss Functions: IOU, GIOU, DIOU, CIOU

MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering audiences including NLP master’s and doctoral students, university teachers, and corporate researchers.The vision of the community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for beginners.

Reprinted from | Jishi Platform

Author | Memory’s Maze

Source | https://zhuanlan.zhihu.com/p/359982543

『IOU Loss Function』

A Detailed Explanation of Object Detection Loss Functions: IOU, GIOU, DIOU, CIOU

The image shows examples of three sets of overlapping rectangles: green represents the ground truth (GT) box where the real object exists, and black represents the predicted box position. By observing the above figure, we find that the third prediction performs better because the predicted object’s position is closest to the real target. However, calculating the L2 loss for the three sets results in a loss value of 8.41, even though the IOU values differ. This indicates that the L2 loss cannot accurately reflect the degree of overlap between two target bounding boxes, leading to the creation of the IOU loss function.

The above shows the calculation method for the IOU loss function: the green box represents the position of the real target, while the blue box represents the predicted box position. The calculation method for IOU is simple: the area of intersection of the two boxes divided by the area of the union of the two boxes, taking the logarithm of the resulting value to the base e, and adding a negative sign gives us the IOU loss function.

『GIOU Loss Function』

As shown in the figure: green represents the real target bounding box, red is the predicted target bounding box, and the outer blue box is the boundary that encloses the red and green rectangles in the smallest rectangle. Ac is the area of the blue rectangle, and u corresponds to the area of the union of the red and green rectangles.

If the red and green rectangles perfectly overlap, then IOU = 1, Ac = u = area of the predicted target bounding box, and GIOU = 1 – 0 = 1. If the two targets are very far apart, Ac tends to a large value, u tends to 0, IOU also tends to 0, and GIOU = 0 – 1 = -1. Therefore, the range of GIOU values is [-1, 1].

The final expression for the GIOU loss function is L(GIOU) = 1 – GIOU

The table above shows the experimental results from the original paper: the first column (AP-IoU column) using MSE (L2 loss) has mAP=0.461, while using IoU loss yields mAP=0.466, a slight improvement. If GIoU loss is used, it can reach 0.477, which performs better than IOU.

However, GIOU also has its drawbacks: when two predicted boxes have the same height and width and are on the same horizontal plane, GIOU degenerates to IOU. Additionally, both GIOU and IOU have two shortcomings: slow convergence and insufficient regression accuracy.

『DIOU Loss Function』

Before introducing DIOU, let’s first look at the effects of using DIOU: as shown in the figure, black represents the anchor box, blue and red represent the default boxes, and green represents the position of the GT box where the real target exists. The goal is for the red and blue boxes to overlap as much as possible with the green box. The first row uses GIOU to train the network, leading to a rough overlap with the real target bounding box after 400 iterations. The second row uses DIOU to train the network, and after 120 steps, it is found that it has completely overlapped with the target bounding box. It can be seen that compared to GIOU, DIOU not only converges faster but also has higher accuracy.

Let’s look at another set of images, which show the overlap relationships of three sets of target bounding boxes. It is clear that their overlapping positions are different, and we expect the third type of overlap (where the centers of the two boxes overlap as much as possible). The IOU loss and GIOU loss calculated for these three sets are identical, indicating that these two losses do not well express the overlapping relationship of bounding boxes. However, the losses calculated by DIOU for the three cases are different, clearly showing that DIOU is more reasonable.

ρ represents the Euclidean distance between b and b(gt)

Understanding the formula with this image: b represents the parameters of the predicted center coordinates, that is, the center point of the black box, and bgt represents the parameters of the center of the real target bounding box, i.e., the center point of the green box. ρ² is the square of the distance between the two center points, which is the square of d (the red line) in the figure, and c represents the length of the diagonal of the smallest enclosing rectangle of the two rectangles (the blue line). If the two boxes perfectly overlap, d=0, IOU = 1, DIOU = 1 – 0 = 1. If the two boxes are far apart, d²/c² approaches 1, IOU = 0, DIOU = 0 – 1 = -1. Therefore, the range of DIOU values is also [-1, 1].

The final loss function for DIOU is: L(DIoU) = 1 – DIOU

When using DIOU loss, mAP is 46.57, which shows an improvement in accuracy compared to IOU and GIOU, and the convergence speed has increased by about 3% compared to IOU.

『CIOU LOSS』

In the paper, the author states that an excellent regression localization loss should consider three geometric parameters: overlapping area, center point distance, and aspect ratio. CIoU adds the loss of the scale of the detection box based on DIoU, which increases the loss of length and width, making the predicted box more closely match the real box.

Thus, the three components of CIOU correspond to the calculations of IOU, center point distance, and aspect ratio. CIOU loss = 1 – CIoU. α and v are the aspect ratio, with the calculation formula as shown in the image above: w, h and w(gt), h(gt) represent the height and width of the predicted box and the real box, respectively.

If using CIOU, mAP can reach 49.21%, which is an increase of 1.5 percentage points compared to GIoU. CIOU(D) refers to using DIOU instead of IOU when evaluating mAP during model validation, which further improves accuracy.

In practical detection effects, CIOU can find a more suitable bounding box position compared to GIOU. As shown in the images above, in the first row, the first cat using GIOU loss function has one ear outside the box, while the second cat using DIOU loss function accurately marks the position of the cat. Similarly, in the second row, although the first dog is completely marked, it does not accurately outline the dog’s shape, while the second image has the bounding box positioned just right.

Thanks: The content of this article is summarized from the uploader Pili Bala Wz

Reference video source: https://www.bilibili.com/video/BV1yi4y1g7ro

Technical Group Invitation

△ Long press to add assistant

Scan the QR code to add the assistant WeChat

Please note: Name – School/Company – Research Direction(e.g., Xiaozhang – Harbin Institute of Technology – Dialogue System)to apply to join technical groups such as Natural Language Processing/Pytorch

About Us

MLNLP Community is a grassroots academic community jointly established by scholars in machine learning and natural language processing from both domestic and international backgrounds. It has developed into a well-known community for machine learning and natural language processing, aiming to promote progress between the academic and industrial circles of machine learning and natural language processing and enthusiasts.The community provides an open communication platform for relevant practitioners in further education, employment, and research. Everyone is welcome to follow and join us.

About Us

Leave a Comment Cancel reply