MetaGPT: A Multi-Agent Collaboration Framework for AI Development

DeepSeek’s explosive popularity has once again showcased the power of LLMs. To better utilize LLMs, let’s take a look at the highest-scoring article in the Agent field at the 2024 ICLR conference, which has the most GitHub stars—MetaGPT.

Let’s briefly review the research motivation: existing multi-agent systems based on LLMs can solve simple dialogue tasks, but encounter logical inconsistencies when handling more complex tasks. These issues arise from the hallucination phenomena caused by simply chaining LLMs.

To address this problem, the research team proposed MetaGPT, which uses pipeline paradigms to assign different roles to each agent and effectively decomposes complex tasks into multiple subtasks for collaborative completion by multiple agents. This approach aims to generate more coherent solutions than previous chat-based multi-agent systems, particularly in collaborative software engineering benchmarks.

Project address: https://github.com/geekan/MetaGPT/

Method Introduction

MetaGPT: A Multi-Agent Collaboration Framework for AI Development

MetaGPT is a multi-agent collaboration framework that encodes Standard Operating Procedures (SOPs) into prompts to ensure a structured approach to problem-solving. It requires agents to participate in collaboration as experts and generate structured outputs as needed, such as high-quality requirement documents, architectural design diagrams, and flowcharts. Structured outputs provide a higher-level thinking chain for individual agents and a clear, goal-oriented context for downstream roles. By clearly defining role divisions, complex work can be decomposed into smaller, more specific tasks, thus enhancing the output quality of LLMs.

Let’s look at a simple example to understand how MetaGPT operates: Software Company

MetaGPT takes a line of requirements as input and outputs user stories/competitive analysis/requirements/data structures/API/documentation, etc.

Internally, MetaGPT includes product managers/architects/project managers/engineers. It provides the entire process of a software company along with meticulously orchestrated SOPs.

Requirement Analysis

Upon receiving the requirements, this process begins. This phase focuses on clarifying the functionalities and requirements needed for the software.
Product Manager

The product manager initiates the entire process based on the requirements and feasibility analysis. They are responsible for understanding the requirements and setting a clear direction for the project.
Architect

Once the requirements are clarified, the architect creates a technical design plan for the project. They are responsible for constructing the system interface design to ensure that the technical implementation meets the requirements. In MetaGPT, the architect agent can automatically generate system interface designs, such as the development of a content recommendation engine.
Project Manager

The project manager uses sequence flow diagrams to meet each requirement. They ensure that the project progresses as planned, with each phase executed in a timely manner.
Engineer

The engineers are responsible for the actual code development. They use the designs and flowcharts to transform them into fully functional code.
Quality Assurance (QA) Engineer

After the development phase, the QA engineer conducts comprehensive testing. They ensure that the software meets the required standards and has no errors or issues.

Details Introduction

Below, we will detail how MetaGPT organizes multi-agent systems through role specialization, workflow definition, and structured communication:

Role Specialization

Clear role division: To achieve efficient task decomposition, MetaGPT defines multiple roles with different skills and expertise. For example, in a simulated software company environment, five key roles are defined: Product Manager, Architect, Project Manager, Engineer, and QA Engineer. Each role has its specific tasks and responsibilities.

Role configuration and behavior patterns: Each agent follows a React-style behavior pattern, which means they continuously monitor the environment (i.e., message pool) and trigger corresponding actions when significant events or information are detected. Furthermore, all agents’ operations are constrained by preset conditions to ensure their actions align with the expected workflow.

MetaGPT: A Multi-Agent Collaboration Framework for AI Development

Workflow Definition

Basic workflow establishment

By defining agents’ roles and operational skills, a basic workflow can be established. In this process, MetaGPT follows SOPs in software development, allowing all agents to collaborate sequentially to form an orderly and efficient work chain.
Task handover and feedback optimization

Each agent, after completing its task, will pass relevant information to other agents, forming a closed-loop workflow. For example, after the engineer generates the initial code, if errors are found, they can self-correct by reviewing past messages and comparing them with PRD, system design, and code files. This iterative programming approach enhances the quality of the final output.

Figure 1 describes the standardized operating procedures (SOPs) in software development within the MetaGPT framework, demonstrating its similarities with real-world human teams. After receiving user requirements, the product manager conducts a comprehensive analysis and formulates a detailed PRD (Product Requirement Document), which includes user stories and a pool of requirements, serving as the initial functional decomposition. The structured PRD is then passed to the architect, who translates the requirements into system design components such as file lists, data structures, and interface definitions. Once the information is captured in the system design, it is passed to the project manager for task assignment. Engineers then execute the specified classes and functions (see Figure 2). In the subsequent stages, QA engineers formulate test cases to ensure code quality. Finally, MetaGPT generates a carefully designed software solution.

Structured Communication Interface

Most existing multi-agent frameworks based on large language models use unconstrained natural language as the communication interface. Although natural language offers flexibility, it may fall short in solving complex tasks. For example, in a “telephone game” or “whispering game,” the original information may undergo significant distortion after several rounds of transmission. To address this issue, MetaGPT proposes using structured communication methods.

Communication between agents no longer relies on dialogue forms but is conducted through structured outputs such as documents and diagrams. These documents contain all necessary information, preventing irrelevant or missing content. For instance, the architect agent generates system interface designs and sequence flow diagrams, which serve as crucial reference materials for engineers executing tasks.

Publish-Subscribe Mechanism

To simplify the information-sharing process, MetaGPT introduces a global message pool—shared message pool—where all agents can directly exchange messages. Agents can not only publish their structured messages in the pool but also transparently access messages published by other entities. This eliminates the need for one-on-one communication every time, enhancing communication efficiency.

While sharing all information seems transparent, it may lead to information overload. Therefore, MetaGPT implements an effective method for managing and disseminating information—namely, subscription mechanisms, where agents selectively subscribe to relevant information based on their specific role configurations. For example, architects mainly focus on product requirement documents (PRDs) provided by product managers, while messages from QA engineers may not be as important to architects. This mechanism ensures that agents only receive necessary information related to their tasks, reducing distractions.

Iterative Programming with Executable Feedback

Iterative Programming Process

Initial code generation: Engineers write preliminary code based on the Product Requirement Document (PRD) and system design. This is the first step in the entire development process, where engineers translate high-level designs into specific implementation details.
Running and error detection: Once the code is generated, engineers immediately run the code to check for any syntax or logical errors. This phase is similar to unit testing in traditional software development, aiming to identify potential issues early.

Executable Feedback Mechanism

Self-correction: If errors are found during execution, engineers initiate a self-correction process. This process reviews past messages stored in memory and compares them with the PRD, system design, and current code files. This comparison helps engineers identify the problem and make the necessary adjustments.
Iterative optimization: This is a repetitive process where engineers continuously generate code, run the code, detect errors, and correct them until the code runs correctly or reaches a predetermined maximum retry count. Each iteration improves the code, ultimately enhancing its quality and reliability.

Figure 2 illustrates the communication protocols (left) and iterative programming with executable feedback mechanisms (right) within the MetaGPT framework.

The left side depicts a shared message pool accessible to all agents. Agents can not only publish structured messages in the pool but also subscribe to relevant messages published by other agents. This allows agents to directly retrieve the required information from the shared pool without waiting for responses from other agents, thus enhancing communication efficiency. Each agent selectively subscribes to relevant information based on its role configuration, for instance, architects mainly focus on product requirement documents (PRDs) provided by product managers, while messages from QA engineers may not be as important to architects. This mechanism avoids information overload and ensures agents only receive necessary information related to their tasks.

The right side shows the process of the engineer agent writing code based on the original product requirements and design. After generating the initial code, the engineer runs the code and checks for errors. If errors are found during execution, the engineer agent reviews past messages stored in memory and compares them with the product requirement document (PRD), system design, and code files to identify the problem and make appropriate adjustments. This process is iterative, with engineers continuously improving the code until no new errors are found or the maximum retry count is reached.

Figure 3 illustrates the detailed process of software development within the MetaGPT framework, emphasizing its high reliance on standardized operating procedures (SOPs). This figure specifically depicts the entire process from obtaining user requirements to ultimately generating a carefully designed software solution.

MetaGPT: A Multi-Agent Collaboration Framework for AI Development

Below is a detailed interpretation of this figure:

Starting point: User requirement collection: The process begins with the product manager (Product Manager) obtaining user requirements and conducting a thorough analysis to formulate a product requirement document (PRD) that includes user stories and a pool of requirements. This is the initial functional decomposition stage.
Passed to the architect: The structured PRD is passed to the architect (Architect), who translates these requirements into system design components such as file lists, data structures, and interface definitions. This step ensures that high-level design intentions are accurately translated into technical implementation plans.
Task assignment and execution: The system design information is then handed over to the project manager (Project Manager) for task assignment. Engineers execute specific class and function coding tasks based on assigned tasks, developing according to established design requirements.
Quality assurance: After the engineers complete the coding, QA engineers (QA Engineer) intervene, responsible for formulating test cases and conducting rigorous code quality checks to ensure that the code meets high-quality standards.
Final output: Finally, MetaGPT generates a carefully designed software solution, a process that fully reflects the importance of SOPs, ensuring that each step adheres to established standards and best practices.

The diagram clearly delineates each role and their responsibilities, including the product manager, architect, project manager, engineer, and QA engineer. The flow of information indicated by arrows shows how different roles interact with one another. For example, the PRD provided by the product manager serves as the foundation for all subsequent work; the system design documents generated by the architect guide the engineers’ specific implementations; while the QA engineer’s work is based on the code submitted by the engineers. The diagram also implies the concept of iterative programming, where engineers can detect errors during runtime and self-correct by reviewing past messages. This continuous improvement process helps enhance code quality and the overall reliability of the project.

Experiments

Datasets

Two publicly available benchmark datasets were used: HumanEval and MBPP, as well as a self-generated, more challenging software development benchmark dataset called SoftwareDev.

HumanEval: Contains 164 manually crafted programming tasks. These tasks cover function specifications, descriptions, reference code, and tests.
MBPP: Contains 427 Python tasks. These tasks cover core concepts and standard library functions, including descriptions, reference code, and automated tests.
SoftwareDev: Contains 70 representative software development tasks, each with its unique task prompt (see Table 5). These tasks have a diverse range (see Figure 5), such as mini-games, image processing algorithms, and data visualization, providing a robust testing platform for real development tasks. Unlike HumanEval and MBPP, SoftwareDev focuses on engineering aspects.

Evaluation Metrics

For HumanEval and MBPP, unbiased Pass@k is used to evaluate the functional accuracy of the generated code:

A brief introduction to Pass@k, the core idea of this metric is to evaluate the proportion of code samples generated by the model that can pass the given test cases. The evaluation process is as follows:

Average over multiple experiments: To reduce randomness, unbiased Pass@k requires multiple repetitions of experiments for each test question, generating multiple code samples. For example, n code samples can be generated for each question (n≥k), where k is the number of samples considered in the experiment. By performing multiple experiments, the model’s performance can be estimated more accurately.
Calculate the pass rate: After generating n code samples, count the number of correct samples that can pass unit tests, denoted as c. Then, use this data to calculate the unbiased estimate. This method reduces variance and improves the accuracy of the estimate.

For SoftwareDev, practical applications are prioritized, and performance is evaluated through human assessments (A, E) or statistical analysis (B, C, D):

(A) Executability: This metric scores the code from 1 (failure/non-functional) to 4 (perfect). ‘1’ indicates non-functional, ‘2’ indicates runnable but flawed, ‘3’ indicates nearly perfect, and ‘4’ indicates perfect code.

(B) Cost: This cost assessment includes (1) runtime, (2) token usage, and (3) costs.

(C) Code Statistics: Includes (1) number of code files, (2) number of lines of code in each file, and (3) total lines of code.

(D) Productivity: Basically defined as token usage divided by lines of code, indicating the token consumption per line of code.

(E) Cost of Manual Revisions: Quantified by the number of revision rounds needed to ensure the code runs smoothly, indicating the frequency of human intervention, such as debugging or importing packages.

Main Results

MetaGPT significantly outperformed previous methods on both HumanEval and MBPP public benchmarks. When working in conjunction with GPT-4, the Pass@k scores on HumanEval reached 85.9% and 87.7%, significantly higher than using GPT-4 alone. This indicates that MetaGPT not only enhances the performance of individual LLM models but also boosts the overall efficiency of multi-agent systems.

Table 1 shows that MetaGPT also performed excellently on the challenging SoftwareDev dataset. Particularly in terms of code executability, MetaGPT achieved an almost perfect score of 3.75 (out of 4), far surpassing ChatDev’s score of 2.25. Additionally, MetaGPT made significant progress in reducing manual revision costs, dropping from 2.25 to 0.83, while maintaining low time consumption (503 seconds).These results highlight the efficiency gains brought by SOPs in the multi-agent collaboration process.

Capability Analysis

Compared to open-source baseline methods (such as AutoGPT, LangChain, AgentVerse, and ChatDev), MetaGPT offers more functionalities tailored to software engineering tasks. For instance, it can generate product requirement documents (PRDs), technical design documents, and technical interfaces, capabilities not available in other methods. Table 2 summarizes the functional differences between different frameworks, highlighting MetaGPT’s advantages in handling complex and specialized development tasks.

Ablation Study

To further understand the impact of different roles on the final results, the findings indicate that adding additional roles (such as architects, project managers, etc.) can continuously improve the number of revisions and code executability, although this may slightly increase costs, overall performance has significantly improved.

MetaGPT: A Multi-Agent Collaboration Framework for AI Development

Additionally, by incorporating executable feedback into MetaGPT, the effectiveness of the executable feedback mechanism was verified. The results, as shown in Figure 4, indicate that Pass@1 on HumanEval and MBPP significantly improved by 4.2% and 5.4%, respectively. Furthermore, Table 1 shows that the feedback mechanism increased feasibility (from 3.67 to 3.75) and reduced manual revision costs (from 2.25 to 0.83). These results indicate how the designed feedback mechanism can generate higher quality code.

Conclusion

MetaGPT significantly enhances the automated problem-solving capabilities based on large language models (LLMs) by introducing standardized operating procedures (SOPs) and an efficient multi-agent collaboration framework. It not only achieves efficient decomposition and execution of complex tasks, reducing logical inconsistencies and error rates but also demonstrates superior performance across multiple benchmark tests, generating more coherent and high-quality solutions. Moreover, MetaGPT provides new insights for the future development of artificial intelligence systems, showcasing how to enhance machine collaboration by simulating human workflows, advancing the application of AI in software engineering and other fields.

Please open in the WeChat client

Leave a Comment Cancel reply