Comparative Evaluation of ChatGPT and Similar Models

Machine Heart Reports

Machine Heart Editorial Team

The “Seven Heroes” of large language models compete to see who comes out on top.

Large Language Models (LLMs) are gaining popularity worldwide, with one important application being chatbots, which are used in Q&A, customer service, and many other areas. However, chatbots are notoriously difficult to evaluate. It remains unclear under what circumstances these models perform best, making the evaluation of LLMs very important.

Previously, a Medium blogger named Marco Tulio Ribeiro tested Vicuna-13B, MPT-7b-Chat, and ChatGPT 3.5 on some complex tasks. The results indicated that Vicuna is a viable alternative to ChatGPT (3.5) for many tasks, while MPT is not yet ready for real-world use.

Recently, CMU Associate Professor Graham Neubig conducted a detailed evaluation of seven existing chatbots and developed an open-source tool for automatic comparison, resulting in a comprehensive evaluation report.

Comparative Evaluation of ChatGPT and Similar Models

In this report, the evaluator presents initial assessments and comparative results of various chatbots, aiming to make it easier for people to understand the current state of all recently released open-source models and API-based models.

Specifically, the evaluator created a new open-source toolkit – Zeno Build, for evaluating LLMs. This toolkit combines: (1) a unified interface for using open-source LLMs via Hugging Face or online APIs; (2) an online interface for browsing and analyzing results using Zeno; and (3) metrics for SOTA evaluation of text using Critique.

For detailed results, visit: https://zeno-ml-chatbot-report.hf.space/

Here is a summary of the evaluation results:

The evaluator assessed 7 language models: GPT-2, LLaMa, Alpaca, Vicuna, MPT-Chat, Cohere Command, and ChatGPT (gpt-3.5-turbo);
These models were evaluated based on their ability to generate human-like responses in a customer service dataset;
ChatGPT came out on top, but the open-source model Vicuna is also very competitive;
The evaluator found that using chat-tuned models with longer context windows is very important;
In the early rounds of conversation, prompt engineering is very useful for improving model performance, but its effectiveness diminishes in later rounds with more context;
Even powerful models like ChatGPT exhibit many obvious problems, such as hallucinations, failure to seek more information, and repeated content.

Below are the detailed evaluation settings.

Setup

Model Overview

The evaluator used the DSTC11 customer service dataset. DSTC11 is a dataset for the Dialogue System Technology Challenge, aimed at supporting more informative and engaging task-oriented dialogues by leveraging subjective knowledge from comment posts.

The DSTC11 dataset includes multiple sub-tasks, such as multi-turn dialogue and multi-domain dialogue. For example, one sub-task involves multi-turn dialogue based on movie reviews, where the conversation between the user and the system aims to help the user find movies that suit their taste.

They tested the following 7 models:

GPT-2: A classic language model from 2019. The evaluator included it as a baseline to see how recent advancements in language modeling impact the development of better chat models.
LLaMa: A language model initially trained by Meta AI, using direct language modeling objectives. The 7B version of the model was used in testing, and the following open-source models also used similarly scaled versions;
Alpaca: A model based on LLaMa, but fine-tuned for instruction;
Vicuna: A model based on LLaMa, further fine-tuned for chatbot applications;
MPT-Chat: A model trained from scratch in a manner similar to Vicuna, with a more commercial license;
Cohere Command: An API-based model released by Cohere, fine-tuned for instruction following;
ChatGPT (gpt-3.5-turbo): The standard API-based chat model developed by OpenAI.

For all models, the evaluator used default parameter settings, including a temperature of 0.3, a context window of 4 previous dialogue turns, and a standard prompt:“You are a chatbot tasked with making small talk with people”.

Evaluation Metrics

The evaluator assessed these models based on the similarity of their outputs to human customer service responses. This was accomplished using metrics provided by the Critique toolbox:

chrf: Measures the overlap of strings;
BERTScore: Measures the overlap of embeddings between two texts;
UniEval Coherence: Predicts how coherent the output is with respect to the previous chat turn.

They also measured the length ratio, dividing the length of the output by the length of the gold standard human response to assess whether the chatbot is verbose.

Further Analysis

To delve deeper into the results, the evaluator used Zeno’s analysis interface, particularly its report generator, to categorize examples based on their position in the dialogue (beginning, early, mid, and late) and the gold standard length of human responses (short, medium, long), using its exploration interface to view examples with poor automatic scores and better understand each model’s shortcomings.

Results

How did the models perform overall?

Based on all these metrics, gpt-3.5-turbo is the clear winner; Vicuna is the open-source winner; GPT-2 and LLaMa performed poorly, indicating the importance of training directly in chat contexts.

These rankings also roughly correspond with the rankings from the lmsys chat arena, which uses human A/B testing to compare models, but the Zeno Build results were obtained without any human scoring.

Regarding output length, gpt3.5-turbo produced much longer outputs than the other models, and it appears that models tuned for chat generally provide verbose outputs.

Accuracy of Gold Standard Response Length

Next, the evaluator used the Zeno report UI to perform an in-depth analysis. First, they measured accuracy based on the length of human responses, categorizing responses into short (≤35 characters), medium (36-70 characters), and long (≥71 characters) and assessed their accuracy separately.

gpt-3.5-turbo and Vicuna maintained accuracy even in longer dialogue turns, while the accuracy of other models declined.

The next question is how important is the context window size? The evaluator experimented with Vicuna, varying the context window from 1-4 previous texts. As they increased the context window, model performance improved, indicating that a larger context window is important.

The evaluation results show that longer contexts are especially important in the middle and later stages of conversation, as responses in these positions rely less on templates and more on what has been said previously.

When attempting to generate shorter outputs that match the gold standard (possibly due to more ambiguity), more context is particularly important.

How Important Are Prompts?

The evaluator tried 5 different prompts, four of which were generic, and one tailored specifically for customer service chat tasks in the insurance sector:

Standard: “You are a chatbot responsible for making small talk.”
Friendly: “You are a kind, friendly chatbot, tasked with chatting with people in a pleasant manner.”
Polite: “You are a very polite chatbot, speaking very formally and trying to avoid making any mistakes in your responses.”
Cynical: “You are a cynical chatbot, with a very dark view of the world, often pointing out any potential issues.”
Insurance-specific: “You are a staff member at Rivertown Insurance Services, primarily helping with insurance claim issues.”

Overall, using these prompts, the evaluator did not find significant differences caused by different prompts, but the “cynical” chatbot performed slightly worse, while the tailored “insurance” chatbot performed slightly better overall.

In the first round of conversation, the differences brought by different prompts were particularly pronounced, indicating that prompts are most important when there is little other context to leverage.

Identified Errors and Possible Mitigations

Finally, the evaluator used Zeno’s exploration UI to try to identify potential errors using gpt-3.5-turbo. Specifically, they examined all examples with low chrf (<0.1) and manually reviewed these examples to identify trends.

Probe Failures

Sometimes the model fails to probe for more information when needed, for instance, the model is not yet perfected in handling numbers (phone numbers must be 11 digits, but the model’s output length does not match the answer). This can be mitigated by modifying the prompt to remind the model of the required length for certain information.

Content Repetition

Sometimes, the same content is repeated multiple times, such as the chatbot saying “thank you” twice here.

Reasonable Responses, But Different from Human Ways

Sometimes, such responses are reasonable but differ from human reactions.

That concludes the evaluation results. Finally, the evaluator hopes this report will be helpful to researchers! If you wish to continue experimenting with other models, datasets, prompts, or other hyperparameter settings, you can jump to the chatbot examples on the zeno-build repository to try it out.

Original article link: https://github.com/zeno-ml/zeno-build/tree/main/tasks/chatbot/report

AI+EDA: Leading a New Future in Chip Design

On May 23, from 19:00 to 21:00, Machine Heart, in collaboration with Synopsys and Microsoft, will host an online sharing session. Senior Product Manager Zhuang Dingzheng from Synopsys and Senior Technical Expert Chen Jingzhong from Microsoft will discuss the hot topic of AI+EDA in the industry.

Scan the QR code on the poster to reserve your spot for the live stream.

Please contact this public account for authorization to reproduce

Submissions or inquiries: [email protected]

Leave a Comment Cancel reply