This article assumes you are familiar with the Transformer Attention mechanism. If not, that’s okay; let me explain briefly.
The Attention mechanism refers to the focus point; the same event can have different focal points for different people. For instance, the teacher says: “Xiao Ming skipped class again to play basketball.” The teacher’s focus is on “Xiao Ming,” “again,” and “skipping class,” indicating that the teacher is thinking: “It’s thisXiao Ming skipping class again”. However, Xiao Ming’s good friend Xiao Qiang might focus on “Xiao Ming” and “basketball,” thinking : “Xiao Ming is playing basketball without inviting me”. This is Attention.
Now let’s look at image Attention. The image below should be familiar to everyone; it depicts a scene where a Li Auto vehicle was rear-ended. The reason is that Li Auto’s visual recognition system mistakenly identified the car on the billboard as a real vehicle, causing it to brake and leading to the rear-end collision.
Let’s analyze this image. From the perspective of image recognition, it indeed shows two cars. However, from a driver’s Attention perspective, it is a billboard standing on the highway that does not obstruct traffic. Attention actually incorporates human focal points visually.
When a driver sees this image, their first concern is whether there are cars on the road, rather than focusing on the billboard, as they know it is an advertisement. They may later pay attention to the car on the billboard. Clearly, Li Auto’s visual system lacks Attention; it merely performs image recognition, confirming that there are indeed two cars. To abstract further, Attention encompasses human understanding of things.
Many analyses suggest that if Li Auto had Lidar, the rear-end collision could have been avoided. On the surface, this seems correct because Lidar can easily detect that the object is above the road and does not block traffic, treating it as a non-obstacle. Lidar does not identify whether it is a billboard or something else; it merely recognizes it as an object that does not impede travel, allowing for normal passage without braking. However, consider the following image:
I added the words “Landslide Ahead” to the billboard, which Lidar cannot comprehend. This is because Lidar lacks the ability to recognize text; it can only detect the object, and cannot interpret the information displayed. Some might argue that Huawei’s GOD network merges Lidar and vision, which is correct, but this integration can cause interference. Lidar detects the shape and position of the billboard, while when we drive and see “Landslide Ahead,” we focus on the meaning of those four words rather than the billboard itself.
We know that Tesla has purchased many Lidar units for data collection vehicles to annotate objects in videos. Note that annotation can be done manually; this annotation represents the driver’s “Attention”, which indicates where the large model needs to focus during subsequent training. In the previous image, Tesla’s possible annotations might include “road,” “guardrail by the road,” and “words on the billboard“! Tesla would not annotate the position of the billboard since it is not necessary for the driver to focus on it; the driver only needs to ensure there are no cars on the road and to read the words on the billboard, without needing to focus on the billboard’s position or the object itself. Because Attention encompasses human understanding of things, which differs from Lidar’s understanding.
Looking at Huawei’s ADS 2.0 and 3.0, both incorporate Lidar, meaning that regardless of whether it merges with visual information, the radar will first note the position of the billboard and identify it as an object, then merge with visual data. This creates a false Attention focus, which can negatively affect inference speed due to the algorithm processing an unnecessary interference. In severe cases, this interference can lead to erroneous decisions.
For example, when we drive and see a road sign that says “Road Closed Ahead,” we stop and turn around. If ADS learns from this human driving behavior, it focuses on the information provided by Lidar, interpreting it as an obstacle shaped like a road sign, and thus turns around. What ADS doesn’t realize is that the driver turned around because they saw the words “Road Closed Ahead”; the driver’s Attention was on “Road Closed Ahead,” not on the road sign itself. The next time ADS operates independently, if Lidar detects a similar-shaped sign, it will turn around, even though this time the sign says “Welcome Home!”
In summary, visual information aligns best with the human “Attention” mechanism, and its integration with AI agents will be even more perfect. Under normal circumstances, Lidar can recognize obstacles, but it pays attention to all objects that can reflect radar waves, which can interfere with the “Attention” of visual signals, especially when training data is insufficient, affecting the Transformer’s understanding of visual signals.