Inspired by these studies, Shilong Liu and others conducted an in-depth study on the cross-attention module in the Transformer decoder and proposed using 4D box coordinates (x, y, w, h) as queries in DETR, namely anchor boxes. By updating layer by layer, this new query method introduces better spatial priors in the cross-attention module, simplifying the implementation and deepening the understanding of the role of queries in DETR. Despite these advancements, few have focused on the bipartite graph matching component to improve training efficiency. Due to the nature of random optimization, the discrete bipartite graph matching component is unstable, especially in the early stages of training, which can lead to slow convergence issues. Consequently, for the same image, queries frequently match with different objects at different stages, making optimization ambiguous and unstable.
To address this issue, Feng Li and others proposed a new training method that stabilizes bipartite graph matching during the training process by introducing a query denoising task. Specifically, noisy ground truth bounding boxes are used as noisy queries and fed together with learnable anchor queries into the Transformer decoder. Both types of queries have the same (x, y, w, h) input format and can be input into the Transformer decoder simultaneously. For the noisy queries, a denoising task is performed to reconstruct their corresponding ground truth boxes. For other learnable anchor queries, the same training loss as regular DETR is used, including bipartite matching. Since the denoised bounding boxes do not require bipartite graph matching, the denoising task can be seen as a simpler auxiliary task that helps DETR mitigate unstable discrete bipartite matching and learn bounding box predictions faster. At the same time, the denoising task also helps reduce optimization difficulty because the added random noise is usually quite small. To maximize the potential of this auxiliary task, each decoder query is treated as a bounding box plus a class label embedding, allowing for simultaneous box denoising and label denoising.
References:
[1] Li F, Zhang H, Liu S L, et al. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. ArXiv preprint arXiv: 2203.01305, 2022.

WeChat Official Account QR Code

WeChat Official Account: Artificial Intelligence Perception Information Processing Algorithm Research Institute