Understanding Embodied Intelligence: Key Components and Technologies

Currently, there are two major trends in the tech industry: one is the wave of large models sparked by ChatGPT, and the other is the trend of humanoid robots, more broadly referred to as the wave of embodied intelligence. Especially after witnessing the investments and showcases of companies in humanoid robots at last week’s World Robot Conference, I can only say that the era of robots is approaching!

What is embodied intelligence? What are its key components?

Embodied intelligence is the ability to understand the world, interact, and accomplish tasks through learning and evolution in both physical and digital realms. It is generally considered to consist of the ‘body’ and the ‘agent’ that perform tasks in complex environments.

The ultimate goal is for the agent to adapt to new environments, learn new knowledge, and solve real-world problems through interaction with the physical world (virtual or real).

Body: The robot body that perceives and executes tasks in physical or virtual environments.
Agent: The intelligent core embodied on top of the body, responsible for perception, understanding, decision-making, and control.
Data: Used for generalization and training.

What is the cornerstone of the technology stack for embodied intelligence?

From the concept of embodied intelligence, it is hoped that the embodied intelligence body can help people solve real problems, thereby liberating our productivity.

Returning to our existing models, how does the robot body help solve problems? The most common approach is to define the requirements, after which engineers customize solutions for specific scenarios through programming or teaching; the robot itself cannot think and find solutions beyond the code.

The model of embodied intelligence differs in that the embodied intelligence body typically has sensors such as vision and language, which, combined with visual signals and voice information, allow the robot to decompose tasks and understand the environment based on the information it reads, and then program itself to accomplish its goals.

The difference between the two models is that one involves humans teaching machines to work, while the other involves robots learning to work by mimicking humans. You will find that embodied intelligence is somewhat like a combination of deep learning and traditional robotics.

Large models can help robots understand and digest knowledge, forming the robot’s agent;
The robot body continues to leverage traditional robotic knowledge to solve actual physical tasks.

What are the cutting-edge research areas in embodied intelligence?

Robot Bodies

Robot Type	Main Application Areas	Technical Details	Representative Robots
Fixed-base Robots	Laboratory automation, education and training, industrial manufacturing	High-precision sensors and actuators, programming flexibility, micron-level precision	Franka Emika Panda, Kuka iiwa, Sawyer
Wheeled Robots	Logistics, warehousing, security inspection	Simple structure, low cost, high efficiency, fast movement	Kiva Robot, Jackal Robot
Crawler Robots	Agriculture, construction, disaster recovery, military applications	Strong off-road capability and maneuverability, stability, and traction	PackBot
Quadrupedal Robots	Exploring complex terrains, rescue missions, military applications	Multi-joint design, strong adaptability, strong environmental perception capabilities	Unitree A1, Go1, Boston Dynamics Spot, ANYmal C
Humanoid Robots	Service industry, healthcare, collaborative environments	Humanoid shape, multi-degree-of-freedom hand design, ability to perform complex tasks	Atlas, HRP series, ASIMO, Pepper
Bionic Robots	Healthcare, environmental monitoring, biological research	Simulate the movements and functions of natural organisms, flexible materials and structures	Fish robots, insect robots, soft robots

Data Source — Simulators

Simulators play a crucial role in embodied intelligence by providing virtual environments that help researchers conduct cost-effective, safe, and highly scalable experiments and tests.

General Simulators

General simulators provide a virtual environment that closely resembles the physical world, used for algorithm development and model training, offering significant cost, time, and safety advantages.

Specific simulator case studies:

Isaac Sim: An advanced platform for robot and AI research simulation, featuring high-fidelity physics simulation, real-time ray tracing, and a rich library of robot models, applicable to scenarios including autonomous driving, industrial automation, and human-robot interaction.
Gazebo: An open-source robot research simulator that supports various sensor simulations and multi-robot system simulations, mainly used for robot navigation and control.
PyBullet: A Python interface to the Bullet physics engine, easy to use, supporting real-time physics simulation, mainly used for reinforcement learning and robot simulation.

Real-World Scene-Based Simulators

These simulators create highly realistic 3D scenes by collecting real-world data, making them the preferred choice for embodied intelligence research in home activities.

Specific simulator case studies:

AI2-THOR: An indoor embodied scene simulator based on Unity3D, containing rich interactive scene objects and physical properties, suitable for multi-agent simulation and complex task research.
Matterport 3D: A large 2D-3D visual dataset containing rich indoor scenes, widely used for embodied navigation benchmark testing.
Habitat: An open-source, large-scale human-robot interaction simulator based on the Bullet physics engine, providing high-performance, fast, parallel 3D simulation and rich interfaces, suitable for reinforcement learning in embodied intelligence research.

Agents

Research Area	Main Goals	Specific Methods
Embodied Perception	Visual Simultaneous Localization and Mapping (vSLAM)	Traditional vSLAM (MonoSLAM, PTAM, ORB-SLAM), Semantic vSLAM (SLAM++, DynaSLAM)
3D Scene Understanding	Projection methods (MV3D), Voxel methods (VoxNet), Point cloud methods (PointNet)
Active Visual Perception	Interactive environment exploration (Pinto et al.), Exploration based on visual direction changes (Jayaraman et al.)
Tactile Perception	Non-visual tactile sensors (BioTac), Visual tactile sensors (GelSight)
Embodied Interaction	3D visual localization	Two-stage methods (ReferIt3D, TGNN), Single-stage methods (3D-SPS, BUTD-DETR)
Visual Language Navigation (VLN)	Memory and understanding-based methods (LVERG), Future prediction-based methods (LookBY)
Embodied Interaction in Dialogue Systems	Large model-based dialogue systems (DialFRED), Multi-agent collaboration (DiscussNav)
Embodied Agents	Multimodal foundational models	Multimodal data fusion and representation (VisualBERT), Representative models and applications (UNITER)
Embodied Task Planning	Task decomposition and execution (HAPI), Planning and realization of complex tasks (TAMP)
Sim-to-Real Adaptation	Embodied world models	Simulation and understanding of world models (Dreamer), Real-world application case studies (PlaNet)
Data Collection and Training	Creation and optimization of datasets (Gibson)
Embodied Control	Control algorithms and strategies (PPO), Instances and applications (DRL)

Basic Knowledge for Embodied Intelligence Development

Brief Introduction

You will also find that whether it is the body or the agent’s learning, there are many subdivisions, but some basic content is consistent. Next, I will introduce some general foundational knowledge:

Programming Languages and Data Structures

C++: Can be used for efficient embedded function execution and inference engine development; future articles will be published by GuYue Academy.
Python: For rapid functionality verification;
MatLab: For quick theoretical algorithm validation;

Basic Data Structures
ROS: A universal robot middleware that can quickly deploy basic robot functions; many LLMs now have typical cases with ROS.
Deep Learning

Fundamentals of deep learning, basic convolutional neural network architectures, AlexNet, ResNet, etc., context-aware RNNs, LSTMs, and Transformer architectures under self-attention mechanisms;
Deep learning frameworks: Pytorch;
(Advanced) Robot deep learning architectures: RT, RT-2, AutoRT/SARA-RT/RT-Trajectory, RT-H;

Embedded Development

Common chip development, such as ST, ESP, GD, Infineon series, etc.;
Ability to understand schematics and PCB boards;
Development of general Linux kernel drivers.

Introduction to Humanoid Robot Bodies

The image below is a schematic diagram of the joints and structure of the Qinglong full-size general humanoid robot.

Core Joints of the Robot

The core joints of the robot are mainly divided into linear joints, rotational joints, joint sensors, and joint drive systems;

The complexity of humanoid robots largely stems from their requirement for many degrees of freedom, which corresponds to the need for many joints in the robot body, involving a complex supply chain.

Looking back at the World Robot Conference, the component manufacturers included many different types:

What are the uses of these components? We need to go back to the linear joints, rotational joints, joint sensors, and joint drive systems.

Linear joints are a combination of motors and screws, allowing robots to perform linear motion;
Rotational joints are a combination of motors and reducers, enabling robots to perform rotational motion.

Motors

A motor is a device that converts electrical energy into rotational kinetic energy. A motor typically consists of a stator and a rotor; the stator is the fixed part, while the rotor is the rotating part. When power is applied, the current flows through the windings, generating a magnetic field that creates a reaction force between the stator and rotor, causing the rotor to rotate.

Reducers

Screws

Leave a Comment Cancel reply