Latest Review on Multi-Modal 3D Object Detection in Autonomous Driving

Source｜Public Account: Heart of Autonomous Driving

Autonomous vehicles require continuous environmental perception to understand the distribution of obstacles for safe driving. Specifically, 3D object detection is a crucial functional module as it can predict the category, location, and size of surrounding objects simultaneously. Generally, autonomous cars are equipped with multiple sensors, including cameras and LiDAR. The unsatisfactory detection performance of single-modal methods has prompted the use of multi-modal inputs to compensate for single-sensor failures. Despite the existence of numerous multi-modal fusion detection algorithms, there is still a lack of comprehensive analysis of these methods to elucidate how to effectively fuse multi-modal data. Therefore, this article reviews the latest advancements in fusion detection methods. It first introduces the broad context of multi-modal 3D detection and identifies the characteristics of widely used datasets and their evaluation metrics. Secondly, it classifies and analyzes all fusion methods from three aspects: feature representation, alignment, and fusion, rather than the traditional pre-, feature-, and post-fusion classification method, revealing how these fusion methods are inherently realized. Thirdly, it deeply compares their advantages and disadvantages and compares their performance on mainstream datasets. Finally, it further summarizes the current challenges and research trends to fully leverage the potential of multi-modal 3D detection.

Latest Review on Multi-Modal 3D Object Detection in Autonomous Driving

Figure 2 is a temporal overview of notable multi-modal works. A comprehensive introduction to this review is shown in Figure 3. Meanwhile, the paper outlines the progress in this field and comprehensively compares state-of-the-art methods.

The main contributions of this work can be summarized as follows:

This is the first comprehensive review of multi-modal 3D detection in autonomous driving, rather than treating it as a trivial subset of 3D detection;
This paper proposes a taxonomy for multi-modal 3D detection that surpasses the traditional pre-, feature-, and post-fusion paradigms, consisting of representation, alignment, and fusion aspects;
The latest advancements in multi-modal 3D detection are presented;
The paper comprehensively compares existing methods on several public datasets and provides in-depth analysis.

Background

This section introduces the general background of 3D detection and the relationship between single-modal and multi-modal 3D detection. It also discusses common datasets and evaluation metrics.

3D Object Detection

Problem Definition: 3D object detection is dedicated to predicting properties of targets in a three-dimensional scene, including location, size, and category. Typically, it can be represented as:

Sensors: In 3D object detection, several popular sensors are shown in Table I, including monocular cameras, stereo cameras, LiDAR, and radar. Their advantages and disadvantages are also compared.

Cameras capture images with rich color and texture properties, benefiting from high frame rates and negligible costs. However, they lack depth information and are susceptible to lighting conditions. On the other hand, point clouds are data collected by LiDAR or radar, representing a massive collection of points that indicate the spatial distribution of targets and the spectral properties of target surfaces in the same spatial reference system. LiDAR provides high-precision, high-density, and high-resolution point cloud data for object detection. However, its acquisition requires substantial computational resources and is sensitive to adverse weather conditions. Radar can measure point cloud data over a wide range without being affected by environmental conditions and detect moving objects. However, its measurement accuracy and object resolution are relatively low and may be affected by reflection interference.

Single-Modal: In autonomous driving, detection using a single sensor is unsatisfactory. Specifically, single-modal methods have inherent flaws that lead to insufficient environmental perception in 3D scenes. For example, camera-based 3D detectors achieve low accuracy performance because images cannot provide sufficient depth information. Although LiDAR-based methods overcome the issue of poor depth information, they also have LiDAR’s drawbacks, such as low resolution, sparsity, and poor texture.

Multi-Modal: Multi-modal 3D detection proposes integrating multiple sensors to leverage the advantages of various modalities for better performance. Compared to single-modal methods, it can fully utilize the advantages of multi-modal data (e.g., depth information from point clouds, texture information from images), providing significant potential and enhancement for autonomous driving perception. However, this also brings many problems and challenges. For instance, the pioneering multi-modal 3D detection, MV3D[52], attempted to combine data from two modalities but overlooked the gap between heterogeneous modalities. Meanwhile, the heterogeneity gap is a key challenge in multi-modal learning.

Common datasets are shown in Table 2:

Feature Representation

As various sensors with different characteristics perceive the 3D environment in a multi-modal setting, data representation becomes a key design choice for fusing information from different sensors. In autonomous driving scenarios, the input data mainly consists of images and LiDAR point clouds. However, to better utilize different data modalities, more data fusion representations have been proposed.

In multi-modal learning, data representation is a crucial part that determines the input of modeling tasks. To this end, the paper reviews popular representations in multi-modal 3D detection, as shown in Table III and Figure 4. To help understand the breadth of existing methods, the paper categorizes them into two types: unified representation and raw representation. Unified representation methods aim to convert heterogeneous data into a homogeneous format and can be said to be more challenging to construct since they require the capability to build specific spaces for heterogeneous data. Raw representation refers to direct heterogeneous data used for predictions without any preprocessing.

Unified Representation

Unified representation aims to process heterogeneous data (or features) in a consistent format to narrow the heterogeneity gap. Based on the type of representation, these methods can be categorized into three types: mixed-based, 3D-based, and BEV-based.

Mixed-Based Representation: Mixed-based methods aim to combine heterogeneous information in a homogeneous format, for instance, by converting 3D point clouds into 2D representations (similar to images). Mixed-based methods address the multi-modal detection problem from two aspects: designing new representations that can cope with heterogeneity and selecting appropriate learning perspectives. As a pioneering work, MV3D[52] represents the raw points from two different viewpoints, namely range view and bird’s-eye view. Specifically, MV3D proposes a coding method for front view (similar to range view) and bird’s-eye view, which contains height, density, and intensity. In this way, 3D representations can be transformed into 2D pseudo-images, allowing the network to use 2D convolutions to extract geometric details. Many works have followed this design philosophy.

Stereo-Based: Unlike mixed-based methods where representations are densely distributed in the 2D space, stereo-based methods aim to fuse heterogeneous representations in 3D space by converting 2D representations into 3D. Several works[58] have proposed converting images from 2D space into pseudo point clouds that contain both geometric and texture information. Since this way of generating pseudo point clouds requires depth information for each pixel, stereo-based methods always utilize depth estimation models, such as depth completion. SFD[58] proposed a basic pipeline for combining raw voxel and pseudo point cloud features, eliminating the raw heterogeneity gap between data representations.

BEV-Based: BEV representation is widely used in 3D perception due to its strong interpretability, which is beneficial for extending sensor modalities and utilizing downstream tasks. BEV representation can address challenging issues in autonomous driving scenarios, such as vehicle occlusion and sparse representation. For point clouds, changing the viewpoint is easy. In contrast, changing the camera’s viewpoint requires laborious parameters and transformation strategies. Thanks to the advances in pure visual BEV works[96]–[98], the development of BEV-based methods has been promoted. The model proposed in [60] achieves efficient camera-to-BEV conversion and effective semantic merging of BEV representations.

Raw Representation

An alternative to unified multi-modal representation is raw representation, which aims not to perform super-modal representation translation or coding to retain maximum available information.

Since most high-performance single-modal detectors are point cloud-based, several multi-modal methods suggest combining raw representations from other modalities, such as cameras, to embellish point clouds. For instance, the model proposed in [104] introduces a new paradigm that decorates raw point clouds with semantic scores from semantic segmentation tasks. This extra multi-modal advantage is derived from powerful 2D vision tasks used for raw representations, such as 2D detection or 2D semantic segmentation. To fully utilize raw representations, F-PointNet[54] uses 2D raw representations and 2D detection to narrow down the range of 3D representations, thereby producing accurate foreground information for predictions. Many works have followed this design paradigm. Although this approach can alleviate the gap between features, they cannot fully utilize the original information in heterogeneous data at the feature level.

Several methods have been proposed to leverage simple feature extractors to utilize complete raw representations. PointFusion[99] utilizes simple backbones, PointNet[27] for 3D, and ResNet[128] for 2D, to directly extract features from raw representations. [101] followed closely behind. [108] proposed a pillar-based encoding method that converts raw representations into pillar representations and uses [30] to process pillar features. Unlike previous primitive feature extractors[109], [129] proposed using an encoder-decoder structure to enhance the interaction and fusion of heterogeneous representations. Due to the superiority of raw 2D representations, more variants of 2D auxiliary tasks are allowed. [112] uses 2D detection on raw images in the image branch, achieving ROI pooling (Region of Interest pooling) for both 2D and 3D detection. In multi-modal methods, feature fusion combining different features of simple backbones and their variants is becoming increasingly common. This is mainly because raw representations can preserve more information from the original sensors, and their representations are more suitable for multi-modal reasoning.

Conclusion and Discussion

This paper identifies two major categories of multi-modal representations in 3D detection: unified representation and raw representation. Unified representation projects multi-modal data (or features) into a unified format (or space) and addresses the misalignment of representations or formats. It has been widely used in popular 3D detectors and significantly enhances performance and efficiency, especially in BEV-based paradigms. On the other hand, raw representation does not require transformations on the original representation to retain the maximum original information. Typically, it introduces auxiliary tasks for raw features, such as semantic segmentation and auxiliary object detection. Table IV summarizes their advantages and disadvantages. Finally, multi-modal representations in 3D detection are under development, and we may see more efficient representations in the future.

Feature Alignment

Multi-modal fusion input data has different forms of feature representations, which are often heterogeneous. Therefore, establishing correspondences between data and different modalities becomes an important step. The paper suggests summarizing this step as alignment, as directly using unaligned features from different modalities is likely to diminish the gains of multi-modal data, or even backfire. Therefore, considering feature alignment to establish correspondences between different modality data is crucial.

Multi-modal feature alignment refers to establishing correspondences between features of different modality data. In multi-modal 3D detection, point cloud (as shown in Figure 4) data provides accurate geometric and depth information, but due to its inherent sparsity and irregular distribution characteristics, point clouds lack resolution and texture information. In contrast, images (as shown in Figure 4) contain fine-grained texture and color information but lack depth information. Features extracted from two heterogeneous data by neural networks are heterogeneous, making it very challenging to align features of two heterogeneous modalities.

The correspondence between LiDAR and cameras is achieved by projection matrices[130], [131], which consist of intrinsic and extrinsic parameters used to convert 3D world coordinates into 2D image coordinates. Multiple works utilize calibration parameters to find correspondences between 3D and 2D for feature alignment. This method is effective, but it compromises the semantic information of the image. To better address this issue, many researchers have adopted deep learning techniques to achieve feature alignment. Based on these considerations, the paper categorizes feature alignment methods into two categories: 1) projection-based methods and 2) model-based methods, as shown in Figure 5 and Table V.

Projection-Based Feature Alignment

Previous works primarily utilized camera projection matrices to align image and point cloud features in a deterministic manner, which is efficient and fast, and can maintain positional consistency through projection matrices. Projection-based methods can be broadly categorized into global projection and local projection, as shown in Figure 6.

Global Projection: Global projection refers to using image features processed by instance segmentation networks or converting images into BEV as input, projecting point clouds onto the processed image, and inputting them into the 3D backbone for further processing.

For example, popular detection methods such as PointPainting[104] and PI-RCNN[105] fuse image features in the image branch and semantic features from the raw LiDAR point cloud to enhance point clouds via image-based semantic segmentation. Specifically, the image obtains pixel-level semantic labels through a segmentation network, and then point-to-pixel projection attaches semantic labels to the 3D point cloud. Complexer YOLO and FusionPainting[111] also follow this paradigm. MVP[110] borrows from the ideas of PointPainting, first using image instance segmentation and establishing alignment between instance segmentation masks and point clouds through projection matrices, but differs in that MVP randomly samples pixels within the 2D box corresponding to each range, with pixels on the point cloud being connected by nearest neighbors, with the depth of the laser point cloud being the depth of the current pixel. Then these points are projected back into the LiDAR coordinate system to obtain virtual LiDAR points. MvxNet[101] does not use the PointNet[27] network to extract point cloud features but preprocesses the raw LiDAR point cloud into voxels to further use more advanced single-modal 3D object detection backbones and passes the corresponding pixel image feature vectors through a projection method attached to the voxels. This method attaches ROI image feature vectors to the dense feature vector of each voxel of the LiDAR point cloud.

Contfuse[91], BEVFusion[60], and 3D-CVF[23] uniformly express data from two modalities. By projection, image features are converted into BEV representations and aligned with point cloud BEV representations. In Contfuse, image features are projected into BEV space through MLP learning. First, K neighboring point clouds are found for each pixel in the image, then the projection matrix is passed into 3D space, and then projected onto the image. The coordinate offsets of feature pixels and target pixels are input into MLP. The image features of the target point cloud are obtained. Then it is fused with the BEV feature map to form a dense feature map. Inspired by the LSS[134] algorithm, BEVFusion converts camera images into 3D ego-car coordinates and uses a BEV encoder module to convert 3D ego-car coordinates into BEV representations. 3D-CVF[23] converts 2D camera features into smooth spatial feature maps through self-calibrating projection, which has the maximum correspondence with radar features in BEV. This feature map also belongs to BEV.

Local Projection: Local projection uses 2D detection to extract knowledge from images to narrow down candidate target areas in the 3D point cloud, transferring image knowledge to the point cloud, and ultimately inputting the enhanced point cloud into LiDAR-based 3D detectors.

Frustum PointNet[54] proposes a frustum with predicted forward and backward truncated radial distances, extending 2D boxes into 3D. First, the image is processed by a 2D detector to generate 2D bounding boxes around the objects of interest. Then, the targets within the 2D box are projected into the 3D frustum using calibration parameters. Information from the 3D frustum is applied to the LiDAR point cloud to align the image and point cloud. Some works, such as Frustum ConvNet[56], Faraway Frustum[57], Frustum PointPillars[55], and Roarnet[100] follow this setup. Based on this, corresponding innovations have been made. Specifically, Frustum ConvNet aggregates point cloud features into frustum feature vectors. These feature vectors are combined into a feature map to use their fully convolutional network (FCN), which spatially fuses frustum feature vectors and supports end-to-end and continuous estimation of oriented boxes in 3D space. Frustum PointPillars uses pillars to accelerate computation.

MVP[52] generates proposals by projecting LiDAR point clouds into BEV and front view (FV) and then fuses BEV, FV, and image features to predict the final 3D bounding box. In this process, a 3D proposal network generates high-precision 3D candidate boxes and projects the 3D proposals onto feature maps in multiple views for feature alignment between the two modalities. AVOD[53] adopted the same idea, but unlike MV3D, AVOD removed FV and proposed a more fine-grained region proposal.

PointAugmenting[115] does not use features obtained from image instance segmentation networks but rather the feature maps of object detection networks. This is mainly due to the high cost of segmentation annotations, while 2D annotations are easy to implement. SFD[58] proposed a method using pseudo point clouds, where the point cloud branch processes the raw point cloud to generate the ROI areas of interest. The projection matrix is used to project the point cloud onto the image to generate pseudo point clouds with color, thus achieving feature alignment between the two data. Finally, the search range of the point cloud is reduced through the generated ROI.

Model-Based Feature Alignment

Unlike previous methods that align the two data using camera projection matrices, some recent multi-modal 3D detection methods propose aligning camera images and point clouds through learning methods that primarily utilize attention. For example, AutoAlign[117] and Deepfusion[119] adopt a cross-attention mechanism to achieve feature alignment of the two modalities. They convert voxels into queries q and camera features, key points k, and values v, respectively. For each query (i.e., voxel unit), an inner product is executed between the query and the key to obtain a matrix containing the correlation between the voxel and all its corresponding camera features. A softmax operator is used for normalization, and then it is aggregated and weighted with values v containing camera information. To reduce the computational load, AutoAlignV2[118] is inspired by Deformable DETR[135] and proposes a cross-domain deformable CAFA operation. DeformCAFA uses a deformable cross-attention mechanism, where q and k still adopt the setup in AutoAlign. v has a new variation. First, the projection matrix is used to query image features corresponding to voxel features. Then, MLP learning is used to learn offsets and extract the image feature corresponding to the offsets as values v. Cross-attention allows each voxel to perceive the entire image, thus achieving feature alignment of the two modalities.

Conclusion and Discussion

Applying camera projection matrices to align images and point clouds is effective. Although feature aggregation is performed at a fine pixel level, point clouds are sparse while images are dense. The use of projection matrices to find correspondences between LiDAR points and image pixels aggregates image information in a coarse manner through this hard association, which may compromise the semantic information in the image. For example, a car may have 100 points in the point cloud, while that car may have thousands of pixels in the corresponding image. Each point is projected onto the image plane through the projection matrix. Although feature alignment is performed at the pixel level, due to the sparsity of point clouds, image features may still lose contextual semantic information. By using a soft association mechanism, this method employs a cross-attention mechanism to find correspondences between LiDAR points and image pixels. It can dynamically focus on pixel-level information in the image. Each point cloud feature queries the entire image, allowing point cloud features to aggregate image information in a fine-grained manner to obtain a pixel-level semantic alignment map. Although this method can better capture semantic information in the image, due to the use of attention mechanisms, every pixel in the image will be matched, and the model’s computational load is high and time-consuming. AutoAlignV2[118] uses the DeformCAFA module to reduce the number of queries for image features and the computational load.

Feature Fusion

This paper summarizes the fusion methods for multi-modal 3D detection, which have been regarded as the most important part of multi-modal methods. Based on these fusion methods, better results in enhancing 3D detection can be achieved. Currently, the primary fusion methods for multi-modal 3D detection are complementary representations, i.e., enhancing one modality with another. Analysis shows that multi-modal methods primarily involve feature complementarity between image features and point cloud features. In the field of 3D detection, the detection accuracy of point clouds is significantly higher than that of images, as shown in Figure 8. The lack of depth information in images leads to low accuracy in 3D detection. Meanwhile, image information has rich semantic information that can serve as data supplementation for point cloud information.

The current multi-modal complementary methods are implemented through different fusion methods. The main difference lies in whether learning is required during the multi-modal 3D detection fusion process. To help understand existing fusion methods, the paper categorizes them into two categories: learning-agnostic and learning-based. Learning-agnostic methods perform arithmetic operations and concatenation on features. These methods are simple to operate and computationally easy but lack good scalability and robustness. Learning-based methods utilize attention to fuse features, which is relatively complex and increases the number of parameters. However, learning-based methods can focus on high-weight important information while ignoring low-weight irrelevant information, thus having higher scalability and robustness. An overview of multi-modal 3D detection fusion methods is shown in Figure 7 and Table VI.

Learning-Agnostic Fusion

Traditional fusion methods focus on performing arithmetic operations and concatenation on features. Learning-agnostic methods are one of the fusion methods that use feature operations and concatenation. There are two main types of learning-agnostic methods: element-wise operations (sum, average) and concatenation.

Element Operations: Element operations utilize arithmetic operations to process features of the same dimension (sum, average). Element operations are easy to parallelize. It combines the two features into a composite vector. It has the advantages of simple computation and convenient operation. At the same time, calculating the average or sum of different channels increases the information of point cloud features, but the dimensionality of the features does not increase. Only the amount of information under each dimension will increase. The increase in information can improve detection accuracy.

In earlier works, MV3D[52] was a pioneer of this method, using the mean method to fuse features from three different views. The feature fusion process is easy to operate and simplifies the computation of the fusion process. AVOD[53] uses MV3D[52] as a baseline to generate new fused features from the feature maps of two views through element averaging. It inherits the advantage of low computational load in the MV3D fusion process. In this way, it can effectively fuse feature maps of the same shape. In this way, it can effectively fuse feature maps of the same shape. ContFuse[91] correlates features through sensor coordinate correspondences and uses element-wise summation to combine features of the same dimension element-wise to fuse different modality information. In recent research, element fusion has only been adopted by a few methods. This is mainly because element-wise operations cannot accurately obtain the correct foreground information and often carry noise. SCANet[93] and MMF[103] also adopt element operations. However, unlike previous studies, MMF[103] utilizes multiple tasks to assist in detection and feature fusion in the backbone.

Concatenation: Feature concatenation transforms the transformed multi-modal features into the same feature vector size, and then concatenates the image feature vector with the point cloud feature vector. An overview of concatenation fusion methods is shown in Figure 9. Unlike element-wise operations, concatenation operations are denser in computation due to merging across channels. But it avoids the information loss caused by direct element-wise operations. Meanwhile, concatenation operations are not limited by the number of channels. Specifically, it is more popular in multi-modal 3D detection methods. PointFusion[99] is a pioneer in applying concatenation operations to multi-modal 3D detection. The PointFusion method connects point-wise features and image features to retain maximum information from each modality. VoxelNet[19] extends single-modal input to multi-modal input, thus further improving performance. MVX-Net[101] and SEGVoxelNet[25] use concatenation operations to supplement corresponding image features to the coordinates of 3D points. Unlike element-wise operations, concatenation operations can retain modality information to a greater extent and have shallower information loss. The PointPainting[104] method obtains pixel segmentation scores through a semantic segmentation network. This method uses concatenation operations to fuse the segmentation scores to complete the point cloud, thus retaining both point cloud information and the segmentation scores. In the research of fusion methods, these previous multi-modal methods have been attempting to find that concatenation operations are simple and can retain more feature information.

Learning-Based Fusion

In 2020, DETR combined neural networks with attention for detection tasks. DETR enables the entire network to achieve end-to-end object detection, significantly simplifying the detection pipeline. Later, DETR3D applied attention to 3D detection. With the development of attention, cross-modal attention can provide a new fusion approach for multi-modal methods. Learning-based methods learn weight distributions, where different parts of the input data or feature maps have different weights. Depending on the weights, high weights are used to retain important information while low weights ignore irrelevant information. Learning-based fusion methods have better robustness.

DETR is a milestone algorithm for applying attention to object detection. In the same year, several methods also attempted to focus on multi-modal 3D object detection in fusion methods, such as 3D-CVF[23], MVAF-Net[106], MAFF-Net, and others. 3D-CVF proposed an adaptive gated fusion network that produces significantly simplified 3×3 convolution layers and S-shaped functions of attention. The attention mapping supplements projected image features to point cloud features. This type of fusion can better concentrate useful information to be fused, making the fusion method learnable. The MVFF part of MVAF-Net suggests combining with the APF module to adaptively fuse multi-task features using the attention mechanism. The MAFF-Net model proposes a point cloud attention fusion (PAF) module. PAF uses a single image feature and two attention features to fuse each 3D point for adaptive fusion features. Since camera sensors are easily affected by lighting, occlusion, and other factors, introducing interference information in the process of supplementing image features to point cloud features is a concern. To address this issue, EPNet[109] uses attention methods to adaptively estimate the importance of images for fusion.

With the development of attention, many attention fusion methods have emerged in the field of multi-modal 3D detection, such as FusionPainting, AutoAlign, AutoAlign V2, DeepFusion, CAT Det, BEVFusion[59], and BEVFFusion[60]. These models utilize attention fusion to merge key information with high weights and redundant information with low weights. This significantly improves fusion efficiency and prevents interference information from affecting detection efficiency.

Discussion and Conclusion

This chapter discusses multi-modal fusion methods and categorizes data fusion into learning-agnostic and learning-based categories. Learning-agnostic methods primarily consist of two operations, namely element-wise and concatenation operations, to adaptively estimate the importance of images for fusion. Multi-modal fusion is a widely researched topic. Many solutions have been proposed in this field, each with its advantages and disadvantages. Learning-agnostic methods are suitable for smaller datasets, while learning-based methods provide better robustness. Despite these advancements, multi-modal fusion still faces the following challenges:

Data information experiences varying degrees of information loss during feature transformation;
Current fusion methods use image features to complement point cloud features, and image features may encounter issues when using point cloud baselines, such as domain gaps;
Learning-agnostic methods need to consider fusion issues based on the importance of information;
Learning-based methods have many parameters, requiring consideration of parameter count optimization issues.

Challenges and Trends

Despite the many fusion methods developed, the image and point cloud fusion algorithms in autonomous driving face numerous challenges due to the demands for accuracy, robustness, and real-time performance. Additionally, the alignment of data between point clouds and images is still widely explored and far from mature. This section discusses the challenges and trends in multi-modal 3D detection.

Data Noise: Effectively fusing multi-modal information has always been a major challenge in multi-modal learning. For various sensors, there are information gaps between data from different modalities, leading to information desynchronization. This issue introduces significant noise in feature fusion, which harms information representation learning. For example, the presence of different dimensional ROIs during the fusion process leads to the combination of background features in images by two-stage detectors. Recent works[59], [60] utilize BEV representations to unify different heterogeneous modalities, providing a new perspective for addressing this issue, which deserves further exploration.
Limited Receptive Field in Open-Source Datasets: Insufficient sensor coverage adversely affects the performance of multi-modal detection. Recently, more and more multi-modal works have focused on nuScenes[79], as it has excellent perception range (both point clouds and cameras are 360 degrees). Excellent perception range aids multi-modal learning, especially in autonomous driving perception tasks. Utilizing sensors with good perception ranges, such as nuScenes[79] and Waymo[80], may improve the coverage of multi-modal detection systems and enhance their performance in complex environments, potentially providing a possible solution to the limited reception field issue in open-source datasets.
Compact Representation: Compact representation contains more information but less data ratio. Although existing works attempt to encode sparse 3D representations into 2D representations, significant information loss occurs during the encoding process. The projection of distance images may cause multiple points to fall into the same pixel, leading to information loss. Recently, the Waymo open dataset has provided high-resolution range images, but only a few works have examined them. High-quality representation remains an outstanding challenge. Advanced encoding techniques may be used to achieve more compact 3D representations, such as using deep learning-based autoencoders and generative adversarial networks to represent 3D features.
Information Loss: Maximizing the retention of multi-modal information has always been one of the key challenges in multi-modal 3D detection. The fusion of information from multiple modalities may lead to information loss. For example, during the fusion stage, when images are complemented with point cloud features, semantic information from images may be lost. This leads to the fusion process not better utilizing image feature information, resulting in suboptimal model performance. State-of-the-art models in multi-modal learning[109], [117] may prove beneficial for sensor fusion in 3D detection, and new fusion methods and neural network architectures can be explored to maximize the retention of multi-modal information.
Unlabeled Data: Unlabeled data is prevalent in autonomous driving scenarios, and unsupervised learning can provide more robust representation learning, which has been studied to some extent in similar tasks, such as 2D detection[145]-[153]. However, in current multi-modal 3D detection, there is no compelling unsupervised representation study. Particularly in the multi-modal research field, better unsupervised learning representation for multi-modal representations is a challenging research topic. In future research, the difficulties of unsupervised learning representation will revolve around simultaneously representing multi-modal data across multi-modal differences.
High Computational Complexity: One important challenge of multi-modal 3D object detection is to quickly and real-time detect objects in autonomous driving scenarios. Since multi-modal methods need to process multiple pieces of information, this leads to an increase in parameters and computational load, longer training times, and inference times, making applications unable to meet real-time performance. Recent multi-modal methods have also considered real-time performance, for example, MVP[110] and BEVFusion[59] experiments on the nuScenes dataset have already used FPS as model evaluation metrics. As shown in Table VII, to alleviate the issue of high computational complexity, future work is encouraged to explore model pruning and quantization techniques. These techniques aim to simplify model structures and reduce model parameters for efficient model deployment, which requires further research in autonomous driving scenarios.
Long-Tail Effect: How to address the long-tail effect caused by performance variation is an important challenge in multi-modal 3D object detection. In the field of autonomous driving, most models need to detect cars, but other objects, such as pedestrians, are also essential detection requirements. As shown in Table VIII, there are many categories in autonomous driving scenarios. Models effective in detecting cars may perform poorly in detecting pedestrians, such as SFD[58]. This leads to uneven category detection. In future work, exploring the use of loss functions and sampling strategies may be potential solutions to the aforementioned issues.
Cross-Modal Data Augmentation: Data augmentation is a key part of achieving competitive results in 3D detection, but data augmentation is primarily applied to single-modal methods, with little consideration in multi-modal scenarios. Since point clouds and images are two heterogeneous data types, achieving cross-modal synchronized augmentation is challenging, leading to serious cross-modal misalignment. Applying gt-aug to point clouds and camera data without distortion is difficult. In some methods, only the point cloud part is augmented, while ignoring the image part. Some methods also keep the original image unchanged and perform inverse transformations on the point cloud to achieve correspondence between images and point clouds. Point cloud augmentation[115] proposed a more complex cross-modal data augmentation method, but it uses additional mask annotations on the image branch and is prone to noise. These methods do not well address the synchronization issue of cross-modal data augmentation. A potential solution to this challenge could be to convert heterogeneous data into a unified representation through representation reconstruction and achieve simultaneous data augmentation.
Temporal Synchronization: Temporal synchronization is a key issue in multi-modal 3D detection. Due to the differences in sampling rates, working modes, and acquisition speeds of different sensors, there are temporal offsets between the data collected by the sensors, leading to misalignment of multi-modal data, which affects the accuracy and efficiency of multi-modal 3D detection. First, there may be errors in the timestamps of different sensors. Even with hardware for timing synchronization, it is challenging to ensure the consistency of sensor timestamps completely. This method may require expensive equipment. Software synchronization methods can be utilized, such as timestamp interpolation methods, Kalman filter-based time synchronization algorithms, and deep learning-based time synchronization methods. Secondly, sensor data may experience frame loss or delay, which also affects the accuracy of multi-modal 3D object detection. The idea to address this issue is to use caching mechanisms to handle delayed or lost data and use data interpolation or extrapolation methods to fill in the gaps in the data. Temporal synchronization in multi-modal 3D detection is a complex issue that requires various technical means to solve.

Conclusion

This paper comprehensively reviews and analyzes various aspects of multi-modal 3D detection. The paper first analyzes the reasons for the emergence of multi-modal 3D detection, introduces existing datasets and evaluation metrics, and performs a comprehensive comparison of datasets. It also proposes a new classification method for multi-modal 3D detection. Specifically, it analyzes existing methods from the perspectives of data representation, feature alignment, and feature fusion. The advantages and disadvantages of classification methods are reviewed in detail from different angles. Finally, it summarizes the development trends, current challenges, and issues in recent years and looks forward to future research directions in multi-modal 3D detection.

References

[1] Multi-modal 3D Object Detection in Autonomous Driving: A Survey and Taxonomy

Background

3D Object Detection

Feature Representation

Unified Representation

Raw Representation

Conclusion and Discussion

Feature Alignment

Projection-Based Feature Alignment

Model-Based Feature Alignment

Conclusion and Discussion

Feature Fusion

Learning-Agnostic Fusion

Learning-Based Fusion

Discussion and Conclusion

Challenges and Trends

Conclusion

References

Leave a Comment Cancel reply