MetaGPT: A Multi-Agent Collaborative Metaprogramming Framework

Abstract: MetaGPT is an innovative metaprogramming framework that encodes standard operating procedures into prompt sequences, enabling agents with human domain expertise to validate intermediate results and reduce errors for a smoother workflow. MetaGPT utilizes an assembly line paradigm to assign different roles to various agents, efficiently breaking down complex tasks into subtasks that involve collaboration among many agents. In collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems.

Original link: 2308.00352.pdf (arxiv.org)

1. Introduction

Autonomous agents utilizing large language models (LLMs) can enhance and replicate human workflows. However, in practical applications, existing systems often oversimplify complexity. This is particularly challenging when meaningful collaborative interactions are required, making it difficult to achieve effective, coherent, and accurate problem-solving processes.

This paper discusses the importance of standard operating procedures (SOPs) in collaborative practices, which can support task decomposition and effective coordination, clarify team members’ responsibilities, and establish standards for intermediate outputs, thereby improving task consistency and accuracy. The authors designed a GPT-based MetaGPT framework that benefits from SOPs, requiring agents to generate structured outputs, such as high-quality requirements.

MetaGPT is a SOP-based metaprogramming approach that can enhance the success rate of target code generation through intermediate structured outputs. MetaGPT employs strict workflows and standardized handover methods to reduce chatty interference between LLMs. The metaprogramming approach of MetaGPT is “programming programming,” distinct from meta-learning and “learning learning.”

MetaGPT is a unique solution that achieves efficient metaprogramming through a set of specialized agents, each with specific roles and expertise, adhering to established standards. This allows for automatic requirement analysis, system design, code generation, modification, execution, and debugging at runtime, highlighting how agent-based techniques enhance metaprogramming.

This paper introduces a metaprogramming framework, MetaGPT, based on LLMs for multi-agent collaboration. MetaGPT features high flexibility and convenience, including role definition and message sharing, which can be utilized to develop LLM-based multi-agent systems. The framework improves code generation quality and robustness by integrating human-like SOPs and a novel execution feedback mechanism. Experimental results show that MetaGPT achieves state-of-the-art performance on benchmarks such as HumanEval and MBPP, making it a promising metaprogramming framework.

2. Related Work

Automated programming: The history of automated programming dates back to last century and has now evolved into an industry offering paid features like Microsoft Copilot. – Recent automated programming approaches utilize natural language processing techniques, such as LLMs-based agents and ToolFormer, as well as research using role-playing frameworks. – LLM-based autonomous agents have garnered significant attention in both industry and academia. – While existing research has improved productivity, it has not fully leveraged effective workflows with structured output formats, complicating the handling of complex software engineering issues.

LLM-based multi-agent frameworks: Multiple research works enhance LLMs’ problem-solving capabilities through discussions among multi-agents, with some focusing on sociological phenomena like Generative Agents and NLSOM, while others emphasize cooperation and competition in planning and strategy. In practice, multi-agent collaboration faces challenges such as maintaining consistency and avoiding ineffective cycles, necessitating the application of advanced concepts like standard operating procedures in software development.

3. MetaGPT: A Metaprogramming Framework

MetaGPT is a metaprogramming framework for LLM-based multi-agent systems. Sec. 3.1 explains the application of role specialization, workflows, and structured communication within the framework, demonstrating how to organize multi-agent systems in the context of SOPs. Sec. 3.2 introduces a communication protocol that enhances role communication efficiency. We also implemented structured communication interfaces and effective publish-subscribe mechanisms. These methods enable agents to obtain directional information from other roles and common information from the environment. Finally, Sec. 3.3 introduces executable feedback, a self-correcting mechanism that further improves code generation quality at runtime.

3.1 Agents in SOPs

This paper presents a method of breaking down complex work into smaller, more specific tasks for collaborative completion in software companies by defining five roles: product manager, architect, project manager, engineer, and quality assurance engineer, detailing the specific responsibilities and skills of each role. Each role follows React-style behavior and monitors the environment for timely detection of important information. The authors also introduce the SOP processes adopted in software development to ensure all roles collaborate in sequence.

This document describes the SOP workflow of MetaGPT. The product manager formulates a PRD based on user needs, including user stories and a requirement pool. The architect translates requirements into system design components, which are assigned to the project manager for task allocation. Engineers execute classes and functions according to the design. QA engineers formulate test cases to ensure code quality. Ultimately, MetaGPT generates high-quality software solutions. Detailed workflow diagrams and examples are provided in the document.

3.2 Communication Protocol

– Most current LLM-based multi-agent frameworks use natural language as a communication interface, but there are doubts about whether pure natural language communication is sufficient for solving complex tasks. – Inspired by human social structures, we propose using structured communication to standardize communication among agents. – We establish patterns and formats for each role and require individuals to provide necessary outputs based on their specific roles and contexts. – Agents in MetaGPT communicate through documents and diagrams rather than conversations to avoid irrelevant or missing content. – In collaboration, information sharing is crucial; we introduce a publish-subscribe mechanism to simplify communication topologies and improve efficiency.

By storing information in a global message pool, communication efficiency challenges can be addressed. Agents can publish and access structured messages directly in the message pool without querying other agents and waiting for their responses. This improves communication efficiency.

To avoid information overload, agents typically only want to receive information related to the task during execution, avoiding distractions from irrelevant details. The subscription mechanism is a simple and effective solution, allowing agents to select information to focus on based on their role configuration. In practical implementation, agents only activate their operations after receiving all prerequisites. Architects primarily focus on the PRD provided by the product manager, while documentation from roles like QA engineers may be less critical.

3.3 Iterative Development

This paper discusses the importance of debugging and optimization in daily programming tasks and the shortcomings of existing methods. To address errors and defects in code generation, this paper proposes an executable feedback mechanism that improves code through iterative testing and debugging. Specifically, engineers write code based on original product requirements and designs, continuously improving the code through historical execution and debugging memory. If test results are satisfactory, additional development tasks are initiated. Otherwise, engineers debug the code and continue programming. This iterative testing process continues until tests pass or the maximum retry count is reached.

4. Experiments

4.1 Experimental Setup

This paper introduces three programming task datasets: HumanEval, MBPP, and SoftwareDev. Among them, HumanEval contains 164 handwritten programming tasks, MBPP contains 427 Python tasks, while SoftwareDev is a more challenging dataset of automatically generated software development tasks, containing 70 representative tasks. The authors used the Pass @k metric to evaluate the functional accuracy of the generated code. In comparisons, the authors randomly selected seven representative tasks for evaluation.

This paper describes the practical evaluation metrics used in the software development domain, including executability, cost, code statistics, productivity, and human revision costs. The authors compared MetaGPT with other domain-specific and general LLMs, providing a comprehensive comparison with AutoGPT, LangChain, AgentVerse, and ChatDev. The authors also modified the role prompts in MetaGPT to generate code suitable for the target problem.

4.2 Key Results

MetaGPT outperformed all previous methods in the HumanEval and MBPP benchmarks. When collaborating with GPT-4, MetaGPT significantly improved Pass @k in the HumanEval benchmark compared to GPT-4. In these two public benchmarks, it achieved 85.9% and 87.7%, respectively. Furthermore, as shown in Table 1, MetaGPT outperformed ChatDev on almost all metrics in the challenging SoftwareDev dataset. For instance, considering executability, MetaGPT scored 3.75, very close to 4 (perfect). Additionally, it required less time (503 seconds), significantly less than ChatDev. Considering code statistics and human revision costs, it also significantly outperformed ChatDev. Although MetaGPT required more tokens (24,613 or 31,255, while ChatDev only required 19,292), it only needed 126.5/124.3 tokens to generate a line of code. In contrast, ChatDev used 248.9 tokens. These results highlight the benefits of SOPs in collaboration among multiple agents. Moreover, we showcase MetaGPT’s autonomous software generation capabilities through visual samples (Figure 5). For more experiments and analyses, please refer to Appendix C.

4.3 Capability Analysis

MetaGPT is a framework for software engineering tasks, demonstrating broader capabilities to efficiently handle complex and specialized development tasks compared to other open-source baseline methods and autonomous agents. The introduction of SOPs (such as role-playing expertise, structured communication, and process optimization) can significantly enhance the efficiency of code generation. Other baseline methods can also easily integrate similar SOP designs to improve performance.

4.4 Ablation Study

Through experiments on two tasks, it was found that the addition of different roles significantly improved performance in generating effective code and calculating average statistics, despite slightly increasing overhead. Excluding certain roles led to non-functional code generation, indicating the effectiveness of various roles. Adding roles different from engineers can continuously improve revisions and executability.

5. Conclusion

This paper introduces MetaGPT, a SOP-based metaprogramming framework that enhances the problem-solving capabilities of LLM-based multi-agent systems. MetaGPT simulates a software company with a set of agents, utilizing role specialization, workflow management, and efficient sharing mechanisms, making it a flexible and portable platform for autonomous agents and multi-agent frameworks. It employs an executable feedback mechanism to improve runtime code generation quality. MetaGPT achieves state-of-the-art performance across multiple benchmarks. Successfully integrating human SOPs inspires future research on human heuristic technologies in artificial multi-agent systems. The authors view this as an early attempt to standardize LLM-based multi-agent frameworks.

6. Innovations of the Paper

1. Introduction of a Metaprogramming Framework:MetaGPT serves as a metaprogramming framework for multi-agent collaboration, based on large language models (LLMs). This framework offers high convenience and flexibility, with clearly defined functions such as role definition and message sharing, making it a useful platform for developing LLM-based multi-agent systems.

2. Integration of Human-like Standard Operating Procedures (SOPs):The design of MetaGPT innovatively integrates human-like standard operating procedures (SOPs), significantly enhancing system robustness and reducing ineffective collaboration among LLM-based agents.

3. Novel Execution Feedback Mechanism:A novel execution feedback mechanism is introduced for debugging and executing code at runtime, significantly improving code generation quality. For example, in the MBPP (Model-Based Python Programming) benchmark, an absolute improvement of 5.4% was achieved.

4. Achieving State-of-the-Art Performance:MetaGPT has achieved state-of-the-art performance in multiple benchmarks such as HumanEval and MBPP.

These innovations indicate that MetaGPT possesses significant technical advantages in building LLM-based multi-agent systems, particularly in enhancing collaboration efficiency, code quality, and system robustness.

7. Main Limitations of the Paper

1. Limitations of Independent Project Execution:In the current version of MetaGPT, each software project is executed independently. However, in real software development teams, active teamwork should enable the team to learn from the development experiences of each project, becoming more compatible and successful over time.

2. Preliminary Implementation of Self-Improvement:MetaGPT explores a self-referential mechanism that recursively modifies constraint prompts based on information observed by agents during software development. Although this mechanism allows agents to continuously learn from past project experiences and enhance the entire multi-agent system by improving each individual in the company, this summarization-based optimization currently only alters the constraints of role specialization (Section 3.1) and does not change the structured communication interface in the communication protocol (Section 3.2).

3. Unexplored Future Developments:The paper mentions that it has yet to explore how to implement this summarization-based optimization on the structured communication interface in the communication protocol, which is a direction for future development.

Overall, MetaGPT has made significant progress in the design and implementation of multi-agent systems, but there is still room for further improvement and development, particularly in inter-project learning and the structuring of communication interfaces.

8. Case Studies

8.1 Case Role Assignment

According to the MetaGPT framework, “Write a Python 3 graphical user interface (GUI) application that allows users to draw images,” we can proceed step-by-step according to the responsibilities of each role:

1. Product Manager:

– Determine requirements: Develop a Python 3 GUI application that allows users to draw images.

– Define user stories: Users should be able to open the application, draw on the canvas using a mouse or touchpad, and select different brush colors and thicknesses.

– Formulate a requirement pool: Including basic drawing functions, color and brush selection, saving and loading images, etc.

2. Architect:

– Design system architecture: Choose Python 3 as the programming language and decide to use Tkinter as the GUI framework due to its simplicity and ease of use.

– Define data structures: Define the data structures to be used for handling image drawing and storage.

– Interface definition: Design the user interface of the application, including the canvas area, toolbar, etc.

3. Project Manager:

– Task allocation: Divide the application development into several subtasks, such as GUI design, drawing logic implementation, functional testing, etc.

– Monitor progress: Ensure that tasks are proceeding as planned and adjust resource allocation to fit the actual situation.

4. Engineer:

– Write code: Implement the GUI interface, including windows, buttons, canvas, etc.

– Implement drawing functionality: Write logic for handling user input (such as mouse movements and clicks) and drawing images on the canvas.

– Integrate functions: Ensure that color and brush selection functions work correctly and implement saving and loading image features.

5. Quality Assurance Engineer:

– Test the application: Conduct functional and non-functional testing to ensure all required features work as expected.

– Provide feedback: Based on test results, offer improvement suggestions, such as interface enhancements, performance optimizations, etc.

This process demonstrates how to systematically develop a fully functional Python 3 GUI drawing application according to the MetaGPT framework. The specialized abilities and specific responsibilities of each role collectively drive the project’s efficient progress and successful completion.

8.2 Product Manager

For the development task of this Python 3 GUI drawing application, the specific outputs of the product manager include a detailed product requirement document (PRD), user stories, and a requirement pool. Here is an example:

### Product Requirement Document (PRD):

**Project Overview**:

Develop a Python 3 GUI application that allows users to draw images. The application should be intuitive, easy to use, and provide basic drawing tools.

**Target Users**:

– Individuals interested in drawing

– Students and professionals needing simple drawing tools

**Main Features**:

1. Drawing: Users can freely draw on the canvas using a mouse or touchpad.

2. Tool Selection: Provide different brush options, including at least three different colors and at least two brush thicknesses.

3. Image Saving and Loading: Users can save their work and be able to load and edit previously saved images.

**Additional Features (considered based on time and resources)**:

– Undo and redo functionality

– Custom color picker

– Layer support

### User Stories:

1. As a user, I want to draw on the canvas to express my creativity and ideas.

2. As a user, I want to be able to choose different colors and brush thicknesses to create more colorful images.

3. As a user, I need to save my work to view or continue editing in the future.

4. As a user, I want to load my previous drawing work for further editing and improvement.

**Requirement Pool**:

– GUI Design: Including drawing canvas, toolbar (color and brush options), and menu bar (file saving and loading options).

– Drawing Logic: Handle user drawing operations on the canvas, including mouse movement and click responses.

– File Management: Implement image saving and loading functions.

– User Interface Friendliness: Ensure the application interface is intuitive and user-friendly, meeting the needs of different user groups.

Through these outputs, the product manager provides clear guidance and objectives for the development team, ensuring that the development work can proceed smoothly while meeting the core needs and expectations of users.

8.3 Architect

Based on the requirements defined by the product manager, the architect’s specific outputs should include an overall plan for system design, selection of the tech stack, and design of the interface and data structures. Here is an example output:

### System Architecture Design Document:

**Tech Stack Selection**:

– **Programming Language**: Python 3, due to its extensive community support and rich library resources.

– **GUI Framework**: Tkinter, due to its simplicity, cross-platform compatibility, and being built into the Python standard library.

**Application Component Design**:

1. **GUI Components**:

– **Canvas**: The main drawing area for capturing and responding to user drawing actions.

– **Toolbar**: Provides brush selection (including color and thickness).

– **Menu Bar**: Provides file operation functions (saving and loading images).

2. **Backend Logic Components**:

– **Drawing Processing**: Handles mouse events, converting user actions into images on the canvas.

– **File Management**: Implements image saving and loading functions.

3. **Data Structures**:

– **Brush Attributes**: Stores the currently selected color and thickness.

– **Image Data**: Stores the current canvas state for saving and loading operations.

**System Design Components**:

1. **Modularization**: Divide the application into independent modules (such as GUI module, drawing logic module, file management module) for easier maintenance and expansion.

2. **Interface Definition**: Define interfaces between modules to ensure efficient collaboration among modules.

**Performance and Usability Considerations**:

– Ensure high efficiency in resource usage and fast responsiveness of the application.

– The interface design should be intuitive and easy to use, adapting to different user operating habits.

**Security and Maintainability**:

– Ensure code quality by adhering to best practices.

– Design with future potential feature expansions and maintenance in mind.

Through these design decisions and planning, the architect ensures that the application structure is robust, extensible, and capable of meeting the requirements set by the product manager. This planning also provides clear implementation guidance for the project manager and development team.

8.4 Project Manager

Based on the system design provided by the architect, the project manager’s main responsibilities include planning project implementation, assigning tasks, monitoring progress, and ensuring all work proceeds smoothly. Here is an example output:

### Project Implementation Plan:

**Task Breakdown and Assignment**:

1. **GUI Design Task**:

– Responsible: Frontend Developer

– Task Content: Design and implement the user interface, including canvas, toolbar, and menu bar.

– Estimated Time: 1 week

2. **Drawing Logic Implementation Task**:

– Responsible: Backend Developer

– Task Content: Implement mouse event handling and drawing logic.

– Estimated Time: 2 weeks

3. **File Management Function Implementation Task**:

– Responsible: Backend Developer

– Task Content: Write code to implement image saving and loading functions.

– Estimated Time: 1 week

4. **Integration and Testing Task**:

– Responsible: Quality Assurance Engineer

– Task Content: Integrate all modules, conduct system testing to ensure application stability and usability.

– Estimated Time: 1 week

**Milestone Plan**:

– **Milestone 1** (End of Week 1): Complete GUI design.

– **Milestone 2** (End of Week 3): Complete drawing logic and file management function implementation.

– **Milestone 3** (End of Week 4): Complete overall system integration and testing.

**Progress Monitoring and Resource Allocation**:

– Regular (weekly) project meetings to monitor progress and resolve issues.

– Adjust resources and task assignments based on actual progress.

**Risk Management and Response Strategies**:

– Anticipate potential technical and time risks and develop response plans.

– Reserve time buffers in critical tasks to address unforeseen delays.

**Communication and Collaboration**:

– Ensure effective communication among teams, using collaboration tools to document and track task status.

– Provide regular feedback and adjustment suggestions to ensure the project proceeds smoothly as planned.

Through this project management, the project manager ensures that all tasks are effectively assigned and monitored, orderly advancing the project timeline and promptly addressing potential issues and challenges. Such planning helps ensure that the final product can be delivered on time and meet expected quality standards.

8.5 Engineer

Based on the project manager’s plan and the architect’s design, the engineer’s output will focus on actually writing code to implement the application’s features. Here is an example output:

### Engineer Implementation Report:

**1. GUI Design Implementation**:

– **Completed Content**:

– Designed and implemented the main interface of the application, including a large canvas area, a toolbar with brush options, and a menu bar for file operations (saving and loading).

– Implemented using the Tkinter framework, ensuring cross-platform compatibility.

– **Code Snippet**:

– Code example for creating the main window and canvas.

– Code for implementing the toolbar, including color and brush selection.

– **Issues Encountered and Solutions**:

– Encountered layout adjustment issues, resolved by modifying Tkinter layout manager parameters.

**2. Drawing Logic Implementation**:

– **Completed Content**:

– Implemented drawing functionality on the canvas, including responding to mouse drag events to draw lines.

– Added options for different brush colors and thicknesses.

– **Code Snippet**:

– Implementation code for mouse event handling functions.

– Relevant code for line drawing and attribute settings.

– **Issues Encountered and Solutions**:

– Resolved the issue of brush continuity by optimizing the drawing algorithm for smooth rendering.

**3. File Management Function Implementation**:

– **Completed Content**:

– Implemented image saving and loading functionality.

– Users can save their works on the canvas and load previously saved images for editing.

– **Code Snippet**:

– Implementation code for file saving and loading functionality.

– **Issues Encountered and Solutions**:

– Addressed file format compatibility issues by choosing a universal image format (such as PNG).

**4. Testing and Optimization**:

– Collaborated with the quality assurance engineer to conduct functional and non-functional testing.

– Performed performance optimizations and interface adjustments based on feedback.

Through these specific implementation steps and code implementations, the engineer successfully transformed the design into a fully functional application. These outputs cover not only the main features of the application but also the issues encountered during development and the strategies adopted to resolve them, ensuring the quality and performance of the software.

8.6 Quality Engineer

Based on the project manager’s plan and the engineer’s implementation, the quality assurance engineer’s main responsibility is to test the application to ensure it meets predetermined functional and performance requirements and provide feedback for product improvement. Here is an example output:

### Quality Assurance Report:

**1. Functional Testing**:

– **Test Content**:

– Tested drawing functionality, including line drawing, color and brush changes.

– Verified the correctness and stability of file saving and loading functionality.

– Checked the responsiveness and accuracy of user interface elements.

– **Test Results**:

– All drawing functionalities worked as expected.

– File operation functions were stable, with no data loss or corruption detected.

– The user interface responded quickly and aligned with user actions.

**2. Non-Functional Testing**:

– **Test Content**:

– Tested the application’s performance, including response time and resource usage.

– Tested the application’s compatibility to ensure stable operation across different operating systems and hardware configurations.

– **Test Results**:

– The application ran smoothly, with reasonable resource usage.

– Performed well on mainstream operating systems, with no significant compatibility issues.

**3. User Experience Testing**:

– **Test Content**:

– Invited several potential users to test the application and collect their usage experiences and feedback.

– **Test Results and Feedback**:

– Users generally found the application interface intuitive and easy to use.

– Suggestions received included adding more customizable drawing options and improving the toolbar layout.

**4. Post-Test Optimization Suggestions**:

– Based on user feedback, it is suggested to add more drawing tools and customization options in future versions.

– Recommendations were made to redesign the toolbar layout to enhance usability.

**5. Documentation and Reporting**:

– Completed a detailed testing report, including test cases, test results, and identified issues.

– Provided suggestions for improving code and design.

Through these detailed tests and feedback, the quality assurance engineer ensures that the software quality meets project requirements and provides valuable guidance for future improvements. These outputs cover not only the main features of the application but also tests related to user experience and performance, ensuring the comprehensiveness and usability of the software.

Leave a Comment Cancel reply