Current Research Status of Object Detection Algorithms Based on Transformer

Object detection is a fundamental task in computer vision that requires us to locate and classify objects. The groundbreaking R-CNN family[1]-[3] and ATSS[4], RetinaNet[5], FCOS[6], PAA[7], and a series of variants[8][10] have made significant breakthroughs in the object detection task. One-to-many label assignment is the core solution, which assigns each ground truth box as a supervisory target to multiple coordinates in the detector’s output in collaboration with proposals, anchors, or window centers. Although these detectors perform well, they are highly dependent on a series of manually designed components, such as non-maximum suppression algorithms[11] or anchor generation[12]. To achieve more convenient end-to-end detection, the DEtection TRansformer (DETR)[13]] was proposed, which is the first end-to-end detection model based on Transformer[14], treating object detection as a set prediction problem and introducing a one-to-one set matching scheme based on the Transformer encoder-decoder architecture. In this way, each ground truth box is assigned to a specific query, and there is no longer a need for multiple manually designed encoding prior knowledge components.

DETR achieved an excellent performance of 42.0% AP on the COCO[15]] dataset, comparable to the finely-tuned Faster R-CNN. However, DETR has shortcomings in convergence speed and detection performance for small objects. To address these issues, Zhu[16]] and others proposed an optimized class DETR object detection model—Deformable DETR, which introduces a deformable attention mechanism, allowing the model to perform sparse sampling on feature maps of different scales, enabling the model to focus on learning key meaningful positions. This not only accelerates convergence speed but also improves detection accuracy for small objects. In the same year, Conditional DETR[17]] was proposed, which adjusts based on the content features of the query by learning conditional space queries to better match image features. Efficient DETR (Efficient DETR)[18]] introduces a dense prediction module to select top-K object queries, and Anchor DETR (Anchor DETR)[19]] represents queries as 2D anchor points, both associating each query with a specific spatial position. However, all the aforementioned works only utilize 2D positions as anchor points without considering the spatial scale of the objects.

References:

[1] Girshick R. Fast R-CNN[C]. In Proceedings of the IEEE International Conference on Computer Vision, 2015:1440–1448.

[2] He K M, Gkioxari G, Dollar P, et al. Mask R-CNN[C]. In Proceedings of the IEEE International Conference on Computer Vision, 2017: 2961–2969.

[3] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]. Advances in Neural Information Processing Systems, 2015, 28.

[4] Zhang S F, Chi C, Yao Y Q, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection[C]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 9759–9768.

[5] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]. In Proceedings of the IEEE International Conference on Computer Vision, 2017: 2980–2988.

[6] Tian Z, Shen C H, Chen H, et al. FCOS: Fully convolutional one-stage object detection[C]. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 9627–9636.

[7] Kim K and Lee H S. Probabilistic anchor assignment with IoU prediction for object detection[C]. In European Conference on Computer Vision, Springer, 2020: 355–371.

[8] Song G L, Liu Y, and Wang X G. Revisiting the sibling head in object detector[C]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11563–11572.

[9] Xue Z Y, Liang J M, Song G L, et al. Large-batch optimization for dense visual predictions[C]. In Advances in Neural Information Processing Systems, 2022.

[10] Zong Z F, Cao Q G, and Leng B. RCNet: Reverse feature pyramid and cross-scale shift network for object detection[C]. In Proceedings of the 29th ACM International Conference on Multimedia, 2021: 5637–5645.

[11] Hosang J, Benenson R, and Schiele B. Learning non-maximum suppression[C]. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4507-4515.

[12] Redmon J, Farhadi A. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.

[13] Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers. In European Conference on Computer Vision. Aug 23, 2020[C]. Cham: Springer, 2020: 213-229.

[14] Vaswani. A., Shazeer. N., Parmar. N., et al. Attention is all you need[C]. In Advances in Neural Information Processing Systems, 2017: 5998–6008.

[15] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference. September 6-12, 2014[C]. Zurich: Springer, 2014, 5(13): 740-755.

[16] Zhu X Z, Su W J, Lu L W, et al. Deformable DETR: Deformable transformers for end-to-end object detection. In ICLR 2021: The Ninth International Conference on Learning Representations, 2021.

[17] Meng D P, Chen X K, Fan Z J, et al. Conditional DETR for fast training convergence. arXiv preprint arXiv: 2108.06152, 2021

[18] Yao Z Y, Ai J B, Li B X, et al. Efficient DETR: Improving end-to-end object detector with dense prior. arXiv preprint arXiv:2104.01318, 2021

[19] Wang Y M, Zhang X Y, Yang T, et al. Anchor DETR: Query design for transformer-based detector. arXiv preprint arXiv:2109.07107, 2021.

WeChat Official Account QR Code

WeChat Official Account: Artificial Intelligence Perception Information Processing Algorithm Research Institute

Leave a Comment Cancel reply