HuggingGPT: A ChatGPT Controller for All AI Models

The Ultimate Combination: HuggingFace + ChatGPT — HuggingGPT is here!

Source | Quantum Bit

Just give it an AI task, such as “What animals are in the picture below, and how many of each are there?”

It can automatically analyze which AI models are needed, and then directly call the corresponding models from HuggingFace to execute and complete the task.

HuggingGPT: A ChatGPT Controller for All AI Models

Throughout the entire process, all you need to do is output your requirements in natural language.

This collaboration between Zhejiang University and Microsoft Research Asia has rapidly gained popularity since its release.

NVIDIA AI research scientist Jim Fan exclaimed:

This is the most interesting paper I’ve read this week. Its ideas are very close to the “Everything App” (where everything is an app, and AI directly reads information).

And one netizen exclaimed:

Isn’t this just ChatGPT being a “swapper”?

The speed of AI evolution is astonishing; leave us some food to eat…

So, what exactly is going on?

HuggingGPT: Your AI Model “Swapper”

In fact, if we say this combination is just a “swapper”, that’s too small a vision.

Its true purpose is AGI. As the authors state, a key step towards AGI is the ability to solve complex AI tasks that span different domains and modalities. We are still a distance away from this result — many models can only excel at a specific task.

However, the performance of large language models (LLMs) in language understanding, generation, interaction, and reasoning has led the authors to think: it is possible to use them as an intermediary controller to manage all existing AI models, solving complex AI tasks by “mobilizing and combining everyone’s strengths”.

In this system, language serves as a universal interface. Thus, HuggingGPT was born. Its engineering process is divided into four steps:

First, task planning. ChatGPT parses the user’s needs into a task list and determines the execution order and resource dependencies between tasks.
Second, model selection. ChatGPT assigns appropriate models for the tasks based on the descriptions of various expert models hosted on HuggingFace.
Next, task execution. Selected expert models on mixed endpoints (including local inference and HuggingFace inference) execute the assigned tasks based on task order and dependencies, providing execution information and results back to ChatGPT.
Finally, output results. ChatGPT summarizes the execution logs and reasoning results of each model to provide the final output.

As shown in the figure below. Assuming we give such a request:

Please generate an image of a girl reading a book, with her posture matching that of the boy in example.jpg. Then describe the new image with your voice.

We can see how HuggingGPT breaks it down into six sub-tasks and selects models to execute them to obtain the final result.

What are the specific results?

The authors conducted practical tests using gpt-3.5-turbo and text-davinci-003, which can be accessed publicly through the OpenAI API.

As shown in the figure below, when there are resource dependencies between tasks, HuggingGPT can correctly parse specific tasks from the user’s abstract request and complete the image transformation.

In audio and video tasks, it also demonstrates the ability to organize collaboration between models, completing a video and voiceover of an “astronaut walking in space” by executing two models in parallel and serially.

Additionally, it can integrate input resources from multiple users to perform simple reasoning, such as counting how many zebras are present among the following three images.

In summary: HuggingGPT can perform well on various forms of complex tasks.

The Project is Open Source, Named “JARVIS”

Currently, the paper on HuggingGPT has been published, and the project is under construction; only part of the code has been open-sourced, garnering 1.4k stars.

We note that its project name is interesting, not called by its original name HuggingGPT, but rather after the AI assistant JARVIS from Iron Man.

Some have found it very similar to the recently released Visual ChatGPT: the latter extends the range of callable models to more, including quantity and type.

Indeed, they share a common author: Microsoft Research Asia.

Specifically, the first author of Visual ChatGPT is MSRA senior researcher Wu Chenfei, and the corresponding author is MSRA chief researcher Duan Nan.

HuggingGPT includes two co-first authors:Shen Yongliang, who completed this work during an internship at MSRA while at Zhejiang University;

Song Kaitao, a researcher at MSRA.

The corresponding author is Professor Zhuang Yueting from the Computer Science Department of Zhejiang University.

Paper address:https://arxiv.org/abs/2303.17580

Project link:https://github.com/microsoft/JARVIS

Finally, for the birth of this powerful new tool, netizens are very excited, with some expressing:

ChatGPT has become the commander of all AI created by humans.

Some have also suggested:

AGI might not be a single LLM, but multiple interconnected models connected by an “intermediary” LLM.

So, have we already entered the “semi-AGI” era?

Reference Links:

https://twitter.com/DrJimFan/status/1642563455298473986

HuggingGPT: A ChatGPT Controller for All AI Models

Tap“Read Original” to learn the details and registration channel of “WeNet Speech Recognition Practice – Issue 2”

HuggingGPT: A ChatGPT Controller for All AI Models

Permanent Benefit: Direct Resume Submission

Resume Submission: [email protected]

HuggingGPT: Your AI Model “Swapper”

The Project is Open Source, Named “JARVIS”

Leave a Comment Cancel reply