This is an article published in the “Heroic Journey” column by the Hero.
2025 Issue No. 15 Total Issue No. 98
This article has a total of 6191 words and takes about 10 minutes to read.
Contact the Hero to join the group.
Stay updated with the latest insights and cutting-edge news from the global industry!
(Please be sure to indicate your real name, industry, and position when adding. No advertising or sales.)
Since the emergence of GenAI (Generative AI) driven by ChatGPT, the topic of artificial intelligence seems to be closer to our work and life than ever before.
Humans both create and develop AI while being influenced and changed by it. In this unprecedented era of great change, the best way for humans to survive in a world of co-creation and mutual influence with AI is to engage in high-dimensional thinking and listen to multiple perspectives.
Thus, the Hero will take you on a global tour to see what kind of digital intelligence stories are unfolding around the world.
Join the Hero in AI and keep pace with the Hero!
Overview of the Best AI Agent Papers for 2024
Best AI Agent Papers of 2024
The Hero’s Golden Sayings
The issues of digital transformation, consult the Hero without worries..
(Like and share this article to receive a copy of the latest industry digital transformation report, limited to the first 10 people.)
Hero Original + Juteq Official Website
Main Text:
AI agents have become one of the most popular technology trends of 2024, but they are still new and require significant improvements. This year, we have seen outstanding research and advancements in agents that even prompt us to rethink the overall concept.
From frameworks to surveys, we have categorized this year’s list as there are numerous novel research papers from top companies and universities.
Let’s take a look at this year’s collection of research papers that make AI agents more interesting:
1. Microsoft’s Magentic-One
Magentic-One is an updated version of the Microsoft Autogen framework, designed to create a universal multi-agent system for solving open web and file-based tasks across various domains.
-
Multi-agent architecture for handling complex tasks
-
Capable of processing web-based and file-based inputs
-
Universal approach for cross-domain versatility
The system may employ a set of coordinated specialized agents, each focusing on different aspects of task resolution, such as information retrieval, reasoning, and output generation. We can infer that Magentic-One might use a hierarchical structure where a central coordinator manages task distribution among various specialized agents.
2. Agent-oriented Planning in Multi-Agent Systems
This framework introduces a new method for planning in multi-agent systems using a meta-agent architecture. The system aims to enhance coordination and decision-making among multiple AI agents.
-
Meta-agent architecture for supervised planning
-
Improved coordination among multiple agents
-
Enhanced decision-making capabilities
We can envision a system where the meta-agent supervises and coordinates the planning activities of individual agents. This meta-agent has a global view of tasks and can optimize the overall planning strategy by considering the strengths and limitations of each agent in the system.
Amazon’s KGLA (Knowledge Graph-enhanced Agent) framework aims to improve knowledge retrieval across various domains. The system utilizes knowledge graphs to enhance the functionality of AI agents.
-
Integration of knowledge graphs with AI agents
-
Improved knowledge retrieval capabilities
-
Applicability across multiple domains
The KGLA architecture may consist of several key components:
-
Knowledge Graph: Structured representation of domain knowledge
-
Agent Interface: Allows agents to query and interact with the knowledge graph
-
Retrieval Mechanism: Efficiently extracts relevant information from the graph
-
Reasoning Module: Combines retrieved knowledge with agent functionality
This integration enables agents to access and utilize structured knowledge more effectively, potentially improving their performance in complex tasks requiring extensive domain knowledge.
4. Harvard University’s FINCON
FINCON is a multi-agent framework based on LLM developed by researchers at Harvard University, specifically designed for various financial tasks. It employs conversational verbal reinforcement to enhance agent performance.
-
Specialized in financial domain tasks
-
-
Conversational reinforcement learning
The FINCON architecture may include:
1. Multiple specialized agents: each focusing on different aspects of financial tasks
2. Conversational interface: allows agents to communicate and learn from each other
3. Language reinforcement learning module: provides feedback to improve agent performance
4. Task coordinator: manages the distribution and integration of sub-tasks
This framework focuses on financial tasks and its use of language reinforcement makes it particularly suitable for complex financial decision-making and analysis scenarios.
5. OmniParser for GUI-based AI Agents
OmniParser introduces a multi-agent approach specifically for UI navigation in GUI-based AI agents. The system aims to enhance the interaction capabilities of AI agents with graphical user interfaces.
-
Specialized for GUI navigation
-
Multi-agent system for parsing visual elements
-
Enhanced interaction of AI agents with graphical interfaces
The OmniParser architecture may include:
-
Visual Element Detector: identifies UI components in images
-
Semantic Interpreter: understands the function and context of detected elements
-
Navigation Planner: determines the optimal path for GUI interactions
-
Action Executor: performs planned interactions with the GUI
This system focuses on visual parsing and GUI navigation, making it particularly valuable for tasks involving automated software testing, user interface analysis, and the development of more intuitive AI assistants.
AutoRestTest is a framework developed by IBM for testing REST APIs using multi-agent and semantic graph approaches. The system aims to improve the efficiency and effectiveness of the API testing process.
-
Specialized for REST API testing
-
Utilizes a multi-agent approach
-
Combines semantic graphs for improved understanding
The AutoRestTest architecture may include:
-
API Parser: extracts API specifications and structures
-
Semantic Graph Generator: creates graphical representations of API relationships
-
Test Case Generator: designs test scenarios based on the semantic graph
-
Multi-agent Executor: runs tests using multiple dedicated agents
-
Results Analyzer: interprets test results and identifies issues
This framework employs semantic graphs and multi-agent execution, allowing for more comprehensive and intelligent API testing, potentially uncovering issues that traditional testing methods may overlook.
AIOpsLab is a comprehensive framework developed by Microsoft for designing, developing, and evaluating autonomous AIOps (AI for IT Operations) agents. The system aims to advance the field of AI-driven IT operations.
-
Holistic approach to AIOps agent development
-
Supports design, implementation, and evaluation phases
-
Focus on autonomous operations in IT environments
The AIOpsLab architecture may include:
-
Agent Development Environment: tools for creating and training AIOps agents
-
Simulated IT Infrastructure: replicates real IT environments for testing
-
Scenario Generator: creates diverse operational challenges
-
Performance Evaluation Module: assesses agent effectiveness
-
Feedback Loop: allows for iterative improvements to agents
This framework provides a standardized platform for advancing the AIOps field, potentially achieving more efficient and reliable IT operations management.
8. Alibaba’s Graph Reader
Graph Reader is a framework proposed by Alibaba aimed at enhancing the long-context capabilities of LLMs using graph-based agents. This approach aims to improve the performance of LLMs in tasks requiring a broad understanding of context.
-
Graph-based long text representation
-
Agent-driven graph exploration
-
Enhanced long-context understanding for LLMs
The Graph Reader architecture includes:
-
Text-to-Graph Converter: transforms long texts into graphical structures
-
Graph Explorer Agent: navigates the graph to extract relevant information
-
LLM Interface: integrates graph-based knowledge with LLM functionality
-
Response Generator: generates responses based on graph exploration and LLM processing
This innovative approach enables LLMs to effectively handle longer contexts by leveraging the structure of graphs and targeted exploration, potentially overcoming the limitations of traditional sequential processing.
9. DynaSaur by Adobe and the University of Maryland
DynaSaur is a framework for LLM agents that can dynamically create and write operations online. The system aims to enhance the flexibility and adaptability of AI agents in various tasks.
-
Dynamic action creation and combination
-
Online learning and adaptation
-
Higher flexibility for LLM agents
While specific architecture details are not provided, we can infer that DynaSaur may include:
-
Action Generator: creates new operations based on task requirements
-
Combination Module: combines actions to form complex behaviors
-
Online Learning Component: adjusts agent behavior in real-time
-
Task Analyzer: determines appropriate actions for given situations
This framework allows for the dynamic generation and combination of actions, enabling the use of more general and adaptive AI agents, potentially improving performance across various tasks and environments.
10. ShowUI: Microsoft GUI Visualization Agent Visual-Language-Action Model
ShowUI is a visual model developed by Microsoft researchers to improve UI element recognition in GUI-based AI agents. The system aims to enhance the interaction capabilities of AI agents with graphical user interfaces.
-
Specialized for GUI element recognition
-
Integration of visual, language, and action modeling
-
Improved performance of GUI-based AI agents
The ShowUI architecture may include:
-
Visual Encoder: processes GUI images to recognize elements
-
Language Model: interprets textual information in the GUI
-
Action Predictor: determines appropriate interactions with GUI elements
-
Multi-modal Fusion Module: integrates visual, textual, and action information
This model focuses on GUI element recognition and interaction, potentially significantly improving the performance of AI agents in tasks involving software testing, user interface analysis, and automated GUI navigation.
11. Automated Design of Agent Systems at Columbia University
This framework focuses on the automated invention of novel components (building blocks) and their combination to create innovative agents. It aims to advance the field of AI agent design by introducing automation and creativity into the process.
-
Automated invention of agent components
-
Innovative combination of components
-
Creation of innovative AI agents
Although specific architecture details are not provided, we can envision a system that includes:
1. Component Generator: creates new agent building blocks
2. Compatibility Analyzer: determines how components can be combined
3. Agent Assembly: builds agents from compatible components
4. Performance Evaluator: assesses the effectiveness of created agents
5. Evolutionary Optimizer: iteratively improves agent design
This automated approach to agent design may lead to the discovery of novel and efficient AI architectures, potentially pushing the field beyond human-designed systems.
Experimentation and Analysis Papers
(Experimentation & Analysis)
1. Can Graph Learning Improve Planning for LLM-based Agents?
Microsoft’s research demonstrates how graph learning can enhance the planning capabilities of LLM-based agents, especially when using GPT-4 as the core model. This groundbreaking study provides empirical evidence for integrating graph structures into agent planning systems.
-
High-level graph learning integration with LLMs
-
GPT-4 core model optimization
-
Enhanced planning capabilities for AI agents
The architecture includes:
-
Graph Learning Module: processes and analyzes graph structures
-
Planning Optimizer: enhances decision-making
-
GPT-4 Integration Layer: connects graph learning with language models
-
Performance Analysis System: measures and validates improvements
This research significantly advances the field of AI planning systems by showcasing the practical advantages of graph learning integration.
2. Thousand-Individual Generative Agent Simulation – Stanford University and Google DeepMind
The collaboration between Stanford University and Google DeepMind achieved significant results in simulating 1000 different individuals using only two hours of audio data.
-
Large-scale behavior simulation
-
Efficient audio data processing
-
Advanced generative modeling
The architecture includes:
-
Audio Processing Engine: analyzes and extracts behavior patterns
-
Simulation Generator: creates individual behavior models
-
Scaling Module: manages large-scale simulations
-
Validation System: ensures the accuracy of simulated behaviors
This breakthrough opens new possibilities for large-scale behavioral modeling and simulation.
3. ByteDance’s Bug Fix Analysis
ByteDance conducted comprehensive testing to identify the most effective LLMs for automated bug fixing, providing valuable insights for implementing agent-based code repair systems.
-
Automated bug detection and analysis
-
LLM performance comparison framework
-
Real-time code fixing capabilities
The architecture includes:
Defect Detection Engine: identifies and categorizes code issues
LLM Integration Layer: connects multiple language models
Code Analysis Module: evaluates repair suggestions
Performance Monitoring System: tracks repair success rates
This research significantly advances the processes of automated code maintenance and quality assurance.
4. Google DeepMind’s Improved Multi-Agent Debate System with Sparse Communication Topology
Research on a multi-agent debate system with sparse communication topology shows improved performance despite limited information sharing.
-
Sparse communication optimization
-
Enhanced agent debate protocols
-
Efficient information sharing mechanisms
The architecture includes:
-
Communication Topology Manager: optimizes information flow
-
Debate Protocol Engine: manages agent interactions
-
Information Sharing Module: controls data exchange
-
Performance Analysis System: measures communication efficiency
This breakthrough enhances our understanding of efficient multi-agent communication systems.
5. Improving AI Agents through Symbolic Learning
A comprehensive study of the progress and challenges of LLM-based multi-agent systems, focusing on problem-solving and world simulation applications.
-
Extensive analysis of current LLM-MA systems
-
Problem-solving capability assessment
-
World simulation application review
The architecture includes:
-
Analysis Framework: evaluates system capabilities
-
Comparison Engine: assesses different approaches
-
Challenge Identification System: maps current limitations
-
Future Direction Mapper: outlines project development paths
This survey provides important insights for the future development of LLM-based multi-agent systems.
1. Survey on LLM-based Multi-Agent Systems
A comprehensive study of the progress and challenges of LLM-MA systems, focusing on applications in problem-solving and world simulation scenarios.
-
Comprehensive analysis of LLM-MA systems
-
Progress tracking and challenge identification
-
Application-centered evaluation framework
The architecture includes:
-
System Analysis Framework: evaluates current LLM-MA implementations
-
Challenge Mapping Module: identifies key obstacles and limitations
-
Application Evaluation Engine: reviews practical applications
-
Future Direction Predictor: outlines project development trajectories
This survey provides important insights for advancing the development of LLM-based multi-agent systems.
2. Survey of LLM-brained GUI Agents
A broad analysis of the evolution and complexity of GUI-based agents across various domains, highlighting key developments and challenges.
-
Historical evolution analysis
-
Cross-domain complexity assessment
-
Implementation pattern evaluation
The architecture includes:
1. Evolution Tracker: maps development progress
2. Complexity Analysis Engine: assesses implementation challenges
3. Domain Comparison Module: evaluates cross-domain applications
4. Pattern Recognition System: identifies successful implementations
This survey greatly aids in understanding the development patterns and future directions of GUI agents.
3. Dawn of GUI Agents: Case Study of Sonnet 3.5
A comprehensive analysis of the application capabilities of Anthropic computers across multiple domains, providing practical insights into real-world applications.
-
Multi-domain usability testing
-
Performance metric analysis
-
Real application evaluation
The architecture includes:
1. Usability Testing Framework: evaluates interface interactions
2. Performance Measurement System: tracks success metrics
3. Domain Adaptation Module: assesses cross-domain capabilities
4. Implementation Guidelines: provides practical deployment insights
This case study provides valuable insights for practical GUI agent implementation.
4. CSIRO’s AgentOps Taxonomy
A systematic classification of AI Agent operations, providing standardized terminology and operational frameworks.
-
Standardized terminology framework
-
Operational classification system
-
Implementation guidelines
The architecture includes:
1. Terminology Database: maintains standardized definitions
2. Classification Engine: organizes business categories
3. Relationship Mapper: links related concepts
4. Implementation Framework: guides practical applications
This taxonomy establishes critical standards for the development and deployment of AI agents.
5. OpenAI Governance Practices for AI Systems
A comprehensive framework outlining seven key principles for implementing safe and responsible AI agent systems in business environments.
-
-
-
Business implementation guidelines
The architecture includes:
1. Safety Assessment Module: evaluates potential risks
2. Accountability Framework: ensures responsible deployment
3. Implementation Guidelines: provides practical steps
4. Monitoring System: tracks compliance and performance
These guidelines establish fundamental standards for the responsible deployment of AI agents.
(Benchmarks for AI Agents)
1. Partner: benchmarks for planning and reasoning in multi-agent tasks.
A comprehensive evaluation framework for assessing planning and reasoning capabilities in multi-agent systems, focusing on human-agent coordination.
-
Human-agent coordination metrics
-
Planning capability assessment
-
Household activity simulations
The architecture includes:
1. Task Generation Engine: creates testing scenarios
2. Coordination Assessment Module: evaluates interaction quality
3. Performance Metrics System: measures success rates
4. Analysis Framework: provides detailed insights
This benchmark provides valuable metrics for improving human-agent coordination systems.
2. Salesforce’s CRM Arena
A complex benchmarking system for evaluating AI agents in customer support scenarios using a real sandbox approach.
-
-
-
Sandbox testing environment
The architecture includes:
1. Scenario Generator: creates real customer cases
2. Response Evaluation System: assesses agent performance
3. Metrics Analysis Engine: measures effectiveness
4. Integration Testing Module: verifies CRM compatibility
This benchmark advances the evaluation of AI agents in customer service applications.
The above AI agent papers represent significant advancements in addressing challenges related to long-context processing, multi-agent systems, and automated AI design. By leveraging techniques such as graph-based representations, dynamic action combinations, and automated component generation, these methods are pushing the boundaries of AI agents and large language models.
2024 is just the beginning of the future landscape of AI agents. As agents become more complex and versatile, they will require less human intervention, and we believe that what will happen in 2025 will be very exciting.
(Due to space limitations, the article has been abridged. Please refer to the end of the article for access to the full text link.)
The future is an era where humans dance with agents.
The future is an era of achieving a “human-centered” production and lifestyle that strives for excellence.
The future is on its way.
If you are interested in the viewpoints mentioned in the article, wish to learn and apply them in your enterprise, and want to further learn on the path from digitalization to intelligence transformation, establishing a foundational consensus for individuals and enterprises in the digital intelligence era, and building business leadership for individuals and organizations, please scan the QR code below to get information about the “Digital Transformation and Innovation Management VeriSM” international certification course and other related course information.
The Hero has a well-known examination service organization in the IT field, EXIN (International Information Science Examination Association) and the internationally leading organization in AI, BCS (British Computer Society) jointly released the AI instructor certification. Based on the comprehensive industry digital transformation training and consulting experience, the Hero has collaborated with EXIN to create a newly upgraded AI course – “EXIN BCS AI Essentials“, if you wish to further understand the essence of AI’s business value and engage with the specific competitive strategies and innovative business practices of top AI companies globally, welcome to enroll, directly scan the QR code at the beginning of the article and leave a message to the Hero on WeChat!
This course is particularly suitable for business management, project management, and functional management personnel with a non-IT technical background. If you are not interested in the market’s tool-based courses aimed at the general public that use GenAI to create images and write PPTs, or algorithm-based courses aimed at technical personnel, but wish to truly grasp the business value of AI and understand how to leverage AI to empower business outcomes from the perspective of enterprises, organizations, and business, then this course is very suitable for you, and it may be the only course on the market positioned this way. We hope you become a learning pioneer in this direction!
We welcome you to like, appreciate, and share this article with one click, and you will receive the collection of AI agent papers mentioned in the text. The Hero has specially organized it for you! Limited spots available, act fast!
Alright! That’s all for this issue! See you next time!
Heroic Journey Column: Professional perspective, popular language, interpreting the technological dynamics in the Jianghu.