MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering NLP master’s and doctoral students, university teachers, and industry researchers.

The Vision of the Community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for beginners.

Reprinted from | Quantum Bit

Author | Multi-LLM Team

Truly, “Three cobblers with their wits combined can outsmart Zhuge Liang” —

Three agents based on open-source small models collaborate to achieve tool invocation results comparable to GPT-4!

Without further ado, let’s directly look at the execution records of two systems.

The user claims to be a music enthusiast wanting to explore different music genres and musicians. Therefore, they specified the model to use Deezer and Shazam APIs to search for some music tracks and corresponding artist information.

The agents, playing three different roles, collaborated and completed the task in two steps.

A more challenging task is to not specify the tools and let the model find the most popular landscape painting tutorial video and the details of the channel that uploaded that video.

In this case, the model often encounters issues with tool status changes, such as tools being taken down or changes in required parameters.

However, using the above method, the model attempted to use video_for_simple_youtube_search to obtain video details at step 0 but found that this API was broken and could not be called.

Thus, the agent playing the planner role adapted its strategy, informing the agent in the caller role to try a different API, and ultimately discovered the details through the new API, solving the user’s task.

This is a multi-model collaborative agent framework based on open-source small models proposed jointly by Sun Yat-sen University and Alibaba Tongyi Laboratory — α-UMi.

α-UMi achieves collaborative combat through fine-tuning multiple open-source small models, achieving performance in tool invocation and other datasets comparable to GPT-4.

In summary, compared to other closed-source API frameworks, α-UMi has the following advantages:

Based on the α-UMi multi-model collaborative framework, three small models: planner, caller, and summarizer are responsible for path planning, tool invocation, and summarizing responses, offloading the workload of small models.
Compared to single-model agents, it supports more flexible prompt design. It outperforms single-model agent frameworks on multiple benchmarks, including ToolBench and ToolAlpaca corpus, achieving performance comparable to GPT-4.
It proposes a “global-local” multi-stage fine-tuning paradigm (GLPFT), which successfully trains a multi-model collaborative framework on open-source small models. Experimental results show that this two-stage paradigm is currently the best training paradigm for exploring multi-model collaborative agents and can be widely applied.

What Does the Multi-Model Collaborative Framework α-UMi Look Like?

Currently, tools learning agents based on large model API calls, functions, and code interpreters, such as OpenAI code interpreter, AutoGPT, etc., have attracted widespread attention in both industry and academia.

With the support of external tools, large models can autonomously complete more complex tasks such as web browsing, data analysis, and address navigation, making AI agents an important direction for the implementation of large models.

However, some mainstream projects mentioned above are primarily based on closed-source ChatGPT and GPT-4 large models, which are already sufficiently strong in reasoning, step planning, request generation, and summarization capabilities.

In contrast, open-source small models, due to limitations in model capacity and pre-training capabilities, cannot achieve performance comparable to large models in tasks such as reasoning and planning, tool invocation, and response generation.

To address this issue, the researchers proposed α-UMi.

α-UMi includes three small models: planner, caller, and summarizer.

Among them, the planner model is the system’s core brain, responsible for activating the caller or summarizer during a certain agent execution step and providing corresponding reasoning (rationale) guidance;

while the caller and summarizer are responsible for receiving the planner’s guidance to complete the subsequent tasks, with the caller generating instructions for tool interaction and the summarizer summarizing the final response to feedback to the user.

All three models are fine-tuned based on open-source small models for different types of data.

Additionally, the researchers proposed the global-local multi-stage fine-tuning paradigm — GLPFT.

Implementing a multi-model collaborative framework based on open-source small models is not a simple task, as there are two opposing influencing factors:

First, generating Rationale, Action, and Final Answer can mutually promote each other during training and enhance the model’s overall understanding of agent tasks. Thus, most current work trains a single model to generate rationale, action, and final answer simultaneously.

Second, model capacity and the data ratio for different tasks also limit our ability to train a single model to achieve peak performance across all three tasks simultaneously.

The following figure shows that the data volume required for a single model agent to reach peak performance across various metrics is different, making it challenging to find a data volume and model checkpoint that achieves peak performance across all metrics.

Through multi-model collaboration, this issue can be resolved.

Considering the above two points, the researchers proposed a “global-local” multi-stage training method aimed at leveraging the mutual promotion advantages of Rationale, Action, and Final Answer during training to obtain a better single-model initialization, followed by multi-model fine-tuning to focus on improving sub-task performance.

The above figure illustrates the process of this multi-stage fine-tuning. In the first stage, a pre-trained LLM is fine-tuned to complete tool invocation agent tasks, obtaining a single-model agent LLM initialization.

Next, in the second stage, the researchers reconstructed the training data for the tool invocation agent task, breaking it down into three sub-tasks: generating rationale, generating tool interaction actions, and generating final responses, and copied the single-LLM agent base trained in the first stage three times, further fine-tuning on different sub-tasks.

Performance Comparable to GPT-4

Static Evaluation

In static evaluation, this paper compares all outputs from the baseline with labeled outputs, and it can be seen that:

the α-UMi system significantly outperformed ChatGPT and the open-source tool invocation model ToolLLaMA, achieving performance comparable to GPT-4.

It is worth mentioning that ToolLLaMA requires an output length of 8192 to obtain satisfactory results, while α-UMi only needs an input length of 4096, thanks to the more flexible prompt design afforded by the multi-model framework.

In the comparison of fine-tuning schemes for multi-model collaborative framework models, directly fine-tuning three models or multi-task fine-tuning of a single model does not enable the multi-model collaborative framework to function effectively; only using the multi-stage fine-tuning GLPFT can achieve optimal performance and open up ideas for subsequent multi-model collaborative training.

Real API Call Evaluation

The authors also introduced a real API call evaluation method on the ToolBench dataset, with the following experimental results:

In the real API call experimental results, α-UMi still outperformed ChatGPT and ToolLLaMA, achieving success rates comparable to GPT-4.

Model Costs

Seeing this, some may wonder whether multi-model collaboration will incur more costs. The authors also explored the comparison of expenses during training, inference, and storage phases for the multi-model collaborative framework:

Overall, the multi-model collaborative framework does introduce higher costs in training and model parameter storage, but its inference speed is comparable to that of single-model frameworks.

Of course, considering that the performance of the multi-model collaborative agent framework using a 7B base far exceeds that of a 13B single-model agent, the overall costs are also lower. This means that a multi-model collaborative agent framework based on small models can be chosen to reduce costs while outperforming the single-model agent framework based on large models.

Finally, the researchers concluded that multi-agent collaboration is the trend in future agent development, and how to train and enhance the multi-agent collaboration capabilities of open-source small models is a crucial aspect for practical implementation. This paper opens up new ideas for multi-agent collaboration based on open-source small models and achieves tool invocation results that exceed single-model agent baselines, comparable to GPT-4, on multiple tool invocation benchmarks.

In the future, efforts will be made to enhance the generalization of the planner to apply it to a wider range of agent task scenarios, local privatization of the caller model to focus on local tool invocation tasks, and a “large-small” model collaborative framework combining cloud large models with local small models.

Project links: [1]https://arxiv.org/abs/2401.07324[2]https://github.com/X-PLUG/Multi-LLM-agent[3]https://modelscope.cn/models/iic/alpha-umi-planner-7b/summary

Technical Communication Group Invitation

Three Agents Surpass GPT-4 Using Open Source Models

△Long press to add assistant

Scan the QR code to add the assistant WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

You can apply to join the Natural Language Processing/Pytorch and other technical communication groups

About Us

MLNLP Community is a grassroots academic community jointly built by domestic and international scholars in machine learning and natural language processing. It has developed into a well-known community for machine learning and natural language processing both domestically and internationally, aimed at promoting progress between the academic and industrial circles of machine learning and natural language processing and enthusiasts.

The community provides an open communication platform for related practitioners’ further education, employment, and research. Everyone is welcome to follow and join us.