Innovative Applications of Diffusion Models in Point Cloud Data

Paper Title:

Towards Dense and Accurate Radar Perception Via Efficient Cross-Modal Diffusion Model

Authors:

Ruibin Zhang, Donglai Xue, Yuhan Wang, Ruixu Geng, and Fei Gao

Project Address:https://github.com/ZJU-FAST-Lab/Radar-Diffusion

Compiled by: xlh

Reviewed by: Los

Introduction:

The point cloud noise generated by millimeter-wave radar is significant and relatively sparse. This paper proposes a method for constructing dense and accurate millimeter-wave radar point clouds for the autonomous navigation of Micro Aerial Vehicles (MAVs) based on a generative model. Additionally, the research team combined the latest diffusion model inference acceleration techniques to ensure that the proposed method can be implemented on MAVs with limited computational resources. The quality of the point clouds on benchmark datasets exceeds that of other methods, demonstrating its strong generalization capabilities.©️【Deep Blue AI】 Compilation

Millimeter-wave (mmWave) radar has garnered significant attention in academia and industry due to its ability to operate under extreme weather conditions. However, challenges related to sparsity and noise interference hinder its application in the autonomous navigation of Micro Aerial Vehicles (MAVs). To address this, this paper proposes a new method to construct dense and accurate millimeter-wave radar point clouds through cross-modal learning. Specifically, the researchers introduced a diffusion model, which has state-of-the-art performance in generative modeling, to predict LiDAR-like point clouds from matched raw radar data. The author team also incorporated the latest diffusion model inference acceleration techniques to ensure that the proposed method can be realized on MAVs with limited computational resources. Extensive dataset comparisons and practical experiments validate the proposed method, demonstrating its superior performance and generalization capabilities.

Millimeter-wave radar inherently suffers from poor angular resolution and sensor noise, resulting in the generation of sparse point clouds, making it challenging for SLAM applications. To better utilize millimeter-wave radar, a typical technical approach is to fuse filtered radar point clouds with other sensors. However, this does not meet the conditions for accurate scene mapping in scenarios where MAVs autonomously fly in cluttered environments. Furthermore, due to payload limitations, most MAVs can only carry single-chip millimeter-wave radar (with an angular resolution of approximately 1%). Therefore, the research team proposed a cross-modal supervised learning method to generate pseudo LiDAR point clouds from raw radar data.

Due to the sparsity and high noise of millimeter-wave radar signals, supervising the generation of high-quality radar point clouds using LiDAR point clouds is essentially a cross-modal denoising and super-resolution task. To apply the diffusion model to current MAV tasks, the research team framed the original task as image restoration—learning the mapping between the “damaged data” (millimeter-wave radar point clouds) and the “original point clouds” (LiDAR point clouds) in the image domain. The limitation of the diffusion model lies in its reliance on iterative sampling processes, which can lead to slow inference. Therefore, the research team combined diffusion model inference acceleration techniques that support single-step generation.

Innovative Applications of Diffusion Models in Point Cloud Data

▲Figure 1｜LiDAR data, millimeter-wave radar data, and data generated using the proposed method ©️【Deep Blue AI】 Compilation

■3.1 Prior Knowledge

Modern commercial millimeter-wave radars primarily emit frequency-modulated continuous wave (FMCW) linear frequency-modulated pulses—frequency increases linearly over time. A complete radar frame consists of chirps with the same time interval (chirp is a signal whose frequency increases or decreases over time. In some fields, the term chirp can be used interchangeably with scanning signal). With the aid of multiple-input multiple-output (MIMO) technology, transmitting antennas (TX) and receiving antennas (RX) can produce up to virtual antenna arrays. Distance, speed, and angle estimation can then be achieved through the following methods.

● Distance Estimation

After the RX antennas capture reflected signals from surrounding objects, the signal mixer combines the TX and RX synthetic signals IF, whose frequency equals the frequency difference between the TX and RX signals. For FMCW, the frequency of the IF signal is constant and equal to the reciprocal of the time difference between the TX and RX signals. Therefore, we can calculate the distance to the detected object as, where is the speed of light, is the frequency of the IF signal, and is the slope of the FMCW.

▲Figure 2｜Millimeter-wave radar signal model and data preprocessing ©️【Deep Blue AI】 Compilation

● Speed Estimation

The relative speed between the radar sensor and surrounding objects is measured by the phase difference between different chirps.

● Angle Estimation

The different ranges from objects to different RX antennas produce detectable phase differences, allowing for the calculation of the azimuth and elevation angles.

● Point Cloud Generation

After applying multiple FFTs along the range, speed, and angle dimensions, the raw data is transformed into a 4D tensor, where any two dimensions can be combined to create continuous heatmaps (e.g., RDH and RAH). Typically, traditional target detectors (such as various CFAR) are applied to extract effective targets against noise and interference. CFAR detectors dynamically estimate noise levels based on the surrounding cells of the target and classify cells with signal strengths above the noise level as valid targets. Due to the diverse sources of noise and interference in cluttered environments, CFAR cannot accurately estimate noise levels. When the threshold coefficient is set too low, a significant amount of noise is incorrectly reported as targets. Conversely, setting the threshold coefficient too high results in missing many valid targets.

■3.2 Generative Model

Inspired by non-equilibrium thermodynamics, diffusion models describe the generation of data samples as a process of continuously eliminating pure Gaussian noise. During training, data samples are gradually added noise over time until they become indistinguishable from pure Gaussian noise. Then, during the learning phase, the neural network learns the reverse process of adding noise during training, i.e., learning the data distribution and how to generate new samples. Currently, such neural networks (i.e., generative models) are mainly divided into two categories: denoising diffusion probabilistic models (DDPM) and noise-conditioned score networks (NCSN). These models can be expressed within a unified generative modeling framework based on scores. The generation process is represented as, is a continuous time variable. The solution to the stochastic differential equation (SDE) is:

is the standard Wiener process, and and are the offset and generation parameters, respectively, which are pre-designed and contain no learnable parameters. The reverse process of the diffusion process is also a diffusion process, running in reverse time, and is given by the reverse-time SDE:

is also a standard Wiener process, referred to as the score function for each marginal distribution, which in the generative model is approximated by training a noise prediction model. For conditional generation scenarios, diffusion model learning means given the sample with the loss, the timestamp, and the corresponding condition. The neural network is optimized through the following loss function:

■3.3 Point Cloud Generation Model

The researchers directly predict the raw data instead of learning the score function. First, they optimize the following MSE loss function:

MSE metrics may lead to perceptual mismatches between the generated samples and the original samples. For example, in the case studied, the accuracy of wall structures in LiDAR BEV images is more critical for autonomous navigation than the accuracy of grass. However, since the former occupies fewer pixels than the latter, the corresponding loss function is also smaller. Therefore, when training the diffusion model using only MSE loss, it penalizes subtle pixel mismatches more severely than significant structural features. To address this issue, the researchers introduced the Learning Perceptual Image Patch Similarity (LPIPS) metric, which uses deep features extracted from the neural network as training loss, outperforming previous perceptual metrics:

The complete training loss is:

The diffusion model follows an iterative method during inference, where each step requires forward propagation through the network. This poses a significant challenge for real-time applications. To solve this problem, the researchers adopted the latest consistency model to support fast single-step generation while still allowing multi-step sampling to trade off computation for sample quality.

To summarize the above methods, the diffusion model is first pre-trained and then distilled into a consistency model that shares the same network architecture. During inference, a single-frame radar RAH point cloud data passes through the network as a condition for predicting paired LiDAR BEV images. Finally, the BEV images are converted into 2D point clouds, whose density and accuracy are similar to LiDAR point clouds.

Point cloud acquisition physical reference is as follows:

▲Figure 3｜Custom handheld sensor platform ©️【Deep Blue AI】 Compilation

■4.1 Qualitative Comparison

Examples of ground truth LiDAR and radar point clouds generated by the proposed method and baseline methods are shown in Figure 4.

The results indicate that the proposed method surpasses the baseline methods by generating denser and more accurate point clouds with fewer clutter points. Structural features in scenes such as straight walls in indoor environments, curved rock walls in mining scenarios, and various obstacles in outdoor environments can also be accurately predicted, which baseline methods cannot achieve. In baseline methods, the results of OS-CFAR and RPDNet contain a large amount of clutter because they rely on traditional DOA estimation, making it difficult to filter noise in complex environments. Additionally, due to the lack of LiDAR monitoring in the azimuth dimension, the angular resolution is much lower compared to ground truth. In contrast, the results from RadarHD exhibit higher angular resolution and fewer clutter points as it employs an end-to-end approach to learn the perception characteristics of LiDAR. However, due to its limited generative capabilities, RadarHD cannot accurately retain fine texture details in the environment like the proposed method. Furthermore, it fails to predict environmental structures in outdoor scenes. In the Longboard dataset, the performance efficiency of both the proposed method and baseline methods is lower compared to other datasets (as shown below). This is because the scenes in this dataset contain many low-reflectivity objects, such as grass and leaves, which millimeter-wave radar struggles to detect. Moreover, these objects are highly unstructured and difficult to learn. Consequently, the generated radar point clouds are sparser than LiDAR point clouds.

▲Figure 4｜Qualitative comparison of 2D point clouds across different datasets ©️【Deep Blue AI】 Compilation

■4.2 Quantitative Comparison

As seen in Tables 1 and 2, the proposed method (Propose-EDM and Propose-CD) outperforms all baseline methods in quantitative metrics across all scenes except for Longboard. These results demonstrate the superiority of the proposed method in generating high-quality millimeter-wave radar point clouds.

This paper proposes a method for constructing single-chip millimeter-wave radar point clouds for MAV autonomous navigation. Based on cross-modal supervision and utilizing aligned LiDAR point clouds and generative learning of diffusion models, the proposed method can generate high-quality LiDAR-like point clouds from single-frame sparse and noisy radar data. Furthermore, the author team has combined the latest diffusion model inference acceleration techniques to achieve single-step generation and real-time inference on onboard computational platforms, addressing the slow inference speed of diffusion models. The quality of point clouds from the proposed method surpasses baseline methods on the public ColoRadar dataset. Additionally, the proposed method’s generalization capabilities were validated against novel scenes and different sensor configurations.

In the future, we can attempt to combine elevation measurement information from millimeter-wave radar to generate high-quality 3D point clouds and deploy such methods on MAVs to perform autonomous navigation tasks in cluttered and visually degraded environments.

MIT proposes the latest spatio-temporal semantic SLAM framework: Khronos

2024-03-17

Innovative Applications of Diffusion Models in Point Cloud Data

The latest brain-like model M-Detector: The first dynamic point cloud flow detection architecture to be published in Nature

2024-02-23

Innovative Applications of Diffusion Models in Point Cloud Data

【Deep Blue AI】 is long-term recruiting authors, welcome anyone who wants to transform their research and technical experiences into writing to share with more readers～If you want to join, please click the tweet below for details👇

Deep Blue Academy author team is actively recruiting! Looking forward to your participation

【Deep Blue AI】’s original content is crafted with the personal efforts of the author team. We hope everyone will respect the original rules and cherish the authors’ hard work. For reprints, please privately message the backend for authorization and be sure to indicate the source of【Deep Blue AI】 WeChat public account, otherwise infringement will be pursued.

*Click to view, collect, and recommend this article*

Leave a Comment Cancel reply