Currently, there are two major trends in the tech industry: one is the wave of large models sparked by ChatGPT, and the other is the trend of humanoid robots, more broadly referred to as the wave of embodied intelligence. Especially after witnessing the investments and showcases of companies in humanoid robots at last week’s World Robot Conference, I can only say that the era of robots is approaching!
What is embodied intelligence? What are its key components?
Embodied intelligence is the ability to understand the world, interact, and accomplish tasks through learning and evolution in both physical and digital realms. It is generally considered to consist of the ‘body’ and the ‘agent’ that perform tasks in complex environments.
The ultimate goal is for the agent to adapt to new environments, learn new knowledge, and solve real-world problems through interaction with the physical world (virtual or real).
-
Body: The robot body that perceives and executes tasks in physical or virtual environments.
-
Agent: The intelligent core embodied on top of the body, responsible for perception, understanding, decision-making, and control.
-
Data: Used for generalization and training.
What is the cornerstone of the technology stack for embodied intelligence?
From the concept of embodied intelligence, it is hoped that the embodied intelligence body can help people solve real problems, thereby liberating our productivity.
Returning to our existing models, how does the robot body help solve problems? The most common approach is to define the requirements, after which engineers customize solutions for specific scenarios through programming or teaching; the robot itself cannot think and find solutions beyond the code.
The model of embodied intelligence differs in that the embodied intelligence body typically has sensors such as vision and language, which, combined with visual signals and voice information, allow the robot to decompose tasks and understand the environment based on the information it reads, and then program itself to accomplish its goals.
The difference between the two models is that one involves humans teaching machines to work, while the other involves robots learning to work by mimicking humans. You will find that embodied intelligence is somewhat like a combination of deep learning and traditional robotics.
-
Large models can help robots understand and digest knowledge, forming the robot’s agent;
-
The robot body continues to leverage traditional robotic knowledge to solve actual physical tasks.
What are the cutting-edge research areas in embodied intelligence?
Robot Bodies
Robot Type |
Main Application Areas |
Technical Details |
Representative Robots |
Fixed-base Robots |
Laboratory automation, education and training, industrial manufacturing |
High-precision sensors and actuators, programming flexibility, micron-level precision |
Franka Emika Panda, Kuka iiwa, Sawyer |
Wheeled Robots |
Logistics, warehousing, security inspection |
Simple structure, low cost, high efficiency, fast movement |
Kiva Robot, Jackal Robot |
Crawler Robots |
Agriculture, construction, disaster recovery, military applications |
Strong off-road capability and maneuverability, stability, and traction |
PackBot |
Quadrupedal Robots |
Exploring complex terrains, rescue missions, military applications |
Multi-joint design, strong adaptability, strong environmental perception capabilities |
Unitree A1, Go1, Boston Dynamics Spot, ANYmal C |
Humanoid Robots |
Service industry, healthcare, collaborative environments |
Humanoid shape, multi-degree-of-freedom hand design, ability to perform complex tasks |
Atlas, HRP series, ASIMO, Pepper |
Bionic Robots |
Healthcare, environmental monitoring, biological research |
Simulate the movements and functions of natural organisms, flexible materials and structures |
Fish robots, insect robots, soft robots |
Data Source — Simulators
Simulators play a crucial role in embodied intelligence by providing virtual environments that help researchers conduct cost-effective, safe, and highly scalable experiments and tests.
General Simulators
General simulators provide a virtual environment that closely resembles the physical world, used for algorithm development and model training, offering significant cost, time, and safety advantages.
Specific simulator case studies:
-
Isaac Sim: An advanced platform for robot and AI research simulation, featuring high-fidelity physics simulation, real-time ray tracing, and a rich library of robot models, applicable to scenarios including autonomous driving, industrial automation, and human-robot interaction.
-
Gazebo: An open-source robot research simulator that supports various sensor simulations and multi-robot system simulations, mainly used for robot navigation and control.
-
PyBullet: A Python interface to the Bullet physics engine, easy to use, supporting real-time physics simulation, mainly used for reinforcement learning and robot simulation.
Real-World Scene-Based Simulators
These simulators create highly realistic 3D scenes by collecting real-world data, making them the preferred choice for embodied intelligence research in home activities.
Specific simulator case studies:
-
AI2-THOR: An indoor embodied scene simulator based on Unity3D, containing rich interactive scene objects and physical properties, suitable for multi-agent simulation and complex task research.
-
Matterport 3D: A large 2D-3D visual dataset containing rich indoor scenes, widely used for embodied navigation benchmark testing.
-
Habitat: An open-source, large-scale human-robot interaction simulator based on the Bullet physics engine, providing high-performance, fast, parallel 3D simulation and rich interfaces, suitable for reinforcement learning in embodied intelligence research.
Agents
Research Area |
Main Goals |
Specific Methods |
Embodied Perception |
Visual Simultaneous Localization and Mapping (vSLAM) |
Traditional vSLAM (MonoSLAM, PTAM, ORB-SLAM), Semantic vSLAM (SLAM++, DynaSLAM) |
3D Scene Understanding |
Projection methods (MV3D), Voxel methods (VoxNet), Point cloud methods (PointNet) |
|
Active Visual Perception |
Interactive environment exploration (Pinto et al.), Exploration based on visual direction changes (Jayaraman et al.) |
|
Tactile Perception |
Non-visual tactile sensors (BioTac), Visual tactile sensors (GelSight) |
|
Embodied Interaction |
3D visual localization |
Two-stage methods (ReferIt3D, TGNN), Single-stage methods (3D-SPS, BUTD-DETR) |
Visual Language Navigation (VLN) |
Memory and understanding-based methods (LVERG), Future prediction-based methods (LookBY) |
|
Embodied Interaction in Dialogue Systems |
Large model-based dialogue systems (DialFRED), Multi-agent collaboration (DiscussNav) |
|
Embodied Agents |
Multimodal foundational models |
Multimodal data fusion and representation (VisualBERT), Representative models and applications (UNITER) |
Embodied Task Planning |
Task decomposition and execution (HAPI), Planning and realization of complex tasks (TAMP) |
|
Sim-to-Real Adaptation |
Embodied world models |
Simulation and understanding of world models (Dreamer), Real-world application case studies (PlaNet) |
Data Collection and Training |
Creation and optimization of datasets (Gibson) |
|
Embodied Control |
Control algorithms and strategies (PPO), Instances and applications (DRL) |
Basic Knowledge for Embodied Intelligence Development
Brief Introduction
You will also find that whether it is the body or the agent’s learning, there are many subdivisions, but some basic content is consistent. Next, I will introduce some general foundational knowledge:
-
Programming Languages and Data Structures
-
C++: Can be used for efficient embedded function execution and inference engine development; future articles will be published by GuYue Academy.
-
Python: For rapid functionality verification;
-
MatLab: For quick theoretical algorithm validation;
-
Basic Data Structures
-
ROS: A universal robot middleware that can quickly deploy basic robot functions; many LLMs now have typical cases with ROS.
-
Deep Learning
-
Fundamentals of deep learning, basic convolutional neural network architectures, AlexNet, ResNet, etc., context-aware RNNs, LSTMs, and Transformer architectures under self-attention mechanisms;
-
Deep learning frameworks: Pytorch;
-
(Advanced) Robot deep learning architectures: RT, RT-2, AutoRT/SARA-RT/RT-Trajectory, RT-H;
-
Embedded Development
-
Common chip development, such as ST, ESP, GD, Infineon series, etc.;
-
Ability to understand schematics and PCB boards;
-
Development of general Linux kernel drivers.
Introduction to Humanoid Robot Bodies
The image below is a schematic diagram of the joints and structure of the Qinglong full-size general humanoid robot.
Core Joints of the Robot
The core joints of the robot are mainly divided into linear joints, rotational joints, joint sensors, and joint drive systems;
The complexity of humanoid robots largely stems from their requirement for many degrees of freedom, which corresponds to the need for many joints in the robot body, involving a complex supply chain.
Looking back at the World Robot Conference, the component manufacturers included many different types:
What are the uses of these components? We need to go back to the linear joints, rotational joints, joint sensors, and joint drive systems.
-
Linear joints are a combination of motors and screws, allowing robots to perform linear motion;
-
Rotational joints are a combination of motors and reducers, enabling robots to perform rotational motion.
Motors
A motor is a device that converts electrical energy into rotational kinetic energy. A motor typically consists of a stator and a rotor; the stator is the fixed part, while the rotor is the rotating part. When power is applied, the current flows through the windings, generating a magnetic field that creates a reaction force between the stator and rotor, causing the rotor to rotate.
Reducers
Screws