Is Meta's Open Source ChatGPT Alternative Worth Using?

MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering NLP master’s and doctoral students, university teachers, and corporate researchers.

The community’s vision is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for beginners.

Reprinted from | Machine Heart

The evaluation of Meta’s open-source large model series LLaMA has been released, showing that there is still a gap compared to ChatGPT.

The continuous popularity of ChatGPT has made major tech companies restless.

Just last week, Meta ‘open-sourced’ a new series of large models —LLaMA (Large Language Model Meta AI), with parameters ranging from 7 billion to 65 billion. Because LLaMA has fewer parameters than many previously released large models but performs better, it excited many researchers upon release.

For example, the 13 billion parameter LLaMA model can outperform the 175 billion parameter GPT-3 on “most benchmarks” and can run on a single V100 GPU; while the largest 65 billion parameter LLaMA model can compete with Google’s Chinchilla-70B and PaLM-540B.

The reduction in parameters is beneficial for ordinary researchers and commercial institutions, but does LLaMA really perform as well as claimed in the paper? Can LLaMA hold its own against the current ChatGPT? To answer these questions, some researchers have already tested this model.

Some companies are also trying to address LLaMA’s shortcomings, exploring whether training methods like RLHF can enhance LLaMA’s performance.

Preliminary Evaluation of LLaMA

This evaluation comes from a Medium author named @Enryu. It compares the performance of LLaMA and ChatGPT on three challenging tasks: joke explanation, zero-shot classification, and code generation. The related blog post is titled “Mini-post: first look at LLaMA”.

The author ran LLaMA 7B/13B versions on RTX 3090/RTX 4090, and the 33B version on a single A100.

It is important to note that unlike ChatGPT, other models are not instruction-tuned, thus the structure of the prompts is different.

Joke Explanation

This is a use case presented in Google’s original PaLM paper: given a joke, the model needs to explain why it is funny. This task requires combining world knowledge with some basic logic. All previous models before PaLM failed at this. The author extracted some examples from the PaLM paper to compare the performance of LLaMA-7B, LLaMA-13B, and LLaMA-33B with ChatGPT.

Is Meta's Open Source ChatGPT Alternative Worth Using?

As can be seen, the results are poor. These models grasp some punchlines but fail to truly understand them; they merely generate some relevant text streams randomly. Although ChatGPT performs poorly like LLaMA-33B (and other models performed worse), it follows a different strategy: generating a large amount of text, hoping that at least part of its response is correct (though most clearly isn’t), which is reminiscent of how people tackle exam questions.

However, ChatGPT at least understood the joke about Schmidthuber. Overall, these models’ performance on zero-shot joke explanation tasks is far from PaLM’s (unless PaLM’s examples were carefully selected).

Zero-Shot Classification

The second task considered by the author is more challenging — clickbait classification. Since even humans cannot agree on what constitutes clickbait, the author provided some examples in the prompt (thus it is actually a few-shot rather than zero-shot). Below is the prompt for LLaMA:

I will tell whether the following news titles are clickbait:1) The WORST care homes in England: Interactive map reveals the lowest-rated 2,530 residences - so is there one near you?Clickbait: yes2) Netflix's top 10 most-watched movies of all timeClickbait: yes3) Peering Through the Fog of InflationClickbait: no4) You’ll never believe which TV cook this cheeky chap grew up to beClickbait: yes

The following image shows more example results from LLaMA-7B, LLaMA-13B, LLaMA-33B, and ChatGPT.

Clearly, the winner is LLaMA-33B, which is the only model that can follow all requested formats (yes/no) and predict reasonably. ChatGPT is also okay, but some predictions are not very reasonable, and the format has errors. The smaller models (7B/13B) are not suitable for this task.

Code Generation

Although LLMs excel in humanities, they perform poorly in STEM subjects. While LLaMA has benchmark test results, the author tried something special in the field of code generation, namely converting human language into SQL queries in a zero-shot manner. This is not very practical, as writing queries directly in real life would be more efficient. Here is just an example of a code generation task.

In the prompt, the author provided a table schema and the desired goal, asking the model to provide an SQL query. Below are some random examples, and honestly, ChatGPT seems to perform better.

-- We have a table `events` with columns `event_id`, `timestamp`, `user_id`, `rpc_method`.-- We want to count how many users issued request "Login" over the last week.-- SQL query for this is:

-- 7bSELECT count (*)FROM eventsWHERE user_id IN (SELECT user_id FROM events WHERE timestamp &gt;= NOW () - INTERVAL 7 DAY) AND rpc_method = 'Login';

-- 13bSELECT COUNT (*) FROM events WHERE rpc_method = "Login" AND UNIX_TIMESTAMP (timestamp) &gt;= UNIX_TIMESTAMP (CURRENT_DATE - INTERVAL 7 DAY)

-- 33bSELECT user_id, COUNT (DISTINCT user_id) AS totalFROM eventsWHERE timestamp BETWEEN TIMESTAMP '2013-08-14 00:00:00' AND TIMESTAMP '2013-08-21 00:00:00' AND rpc_method = 'Login'GROUP BY user_id

-- ChatGPTSELECT COUNT (DISTINCT user_id)FROM eventsWHERE rpc_method = 'Login'AND timestamp &gt;= DATE_SUB (NOW (), INTERVAL 1 WEEK);

From the test results, LLaMA performs well on some tasks, but there are still gaps compared to ChatGPT on others. If it could incorporate some “training secrets” like ChatGPT, would the performance improve significantly?

Adding RLHF: Startup Nebuly AI Open Sources ChatLLaMA Training Method

Although LLaMA was favored by many researchers at its release, it still lacks the enhancement of RLHF, as indicated by the evaluation results above.

Three days after the release of LLaMA, the startup Nebuly AI open-sourced the RLHF version of LLaMA (ChatLLaMA) training method. Its training process is similar to ChatGPT, allowing the construction of ChatGPT-style services based on the pre-trained LLaMA model. The project garnered 5.2K stars just two days after launch.

Project link: https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/chatllama

The ChatLLaMA training process algorithm implementation emphasizes faster and cheaper training than ChatGPT, which can be verified by the following four points:

ChatLLaMA is a complete open-source implementation that allows users to build ChatGPT-style services based on the pre-trained LLaMA model;
Compared to ChatGPT, the LLaMA architecture is smaller, but the training process and single GPU inference speed are faster, and costs are lower;
ChatLLaMA has built-in support for DeepSpeed ZERO to accelerate the fine-tuning process;
This library also supports all LLaMA model architectures (7B, 13B, 33B, 65B), allowing users to fine-tune models based on training time and inference performance preferences.

Image source: https://openai.com/blog/chatgpt

Some researchers have even claimed that ChatLLaMA is up to 15 times faster in training than ChatGPT.

However, some have questioned this claim, arguing that the project did not provide accurate measurement standards.

The project has just been online for two days and is still in its early stages. Users can further expand it through the following additions:

Checkpoints with fine-tuning weights;
Optimization techniques for rapid inference;
Support for packaging models into effective deployment frameworks.

Nebuly AI hopes more people will join in to create more efficient and open ChatGPT-like assistants.

How to use it? First, install the package using pip:

pip install chatllama-py

Next, clone the LLaMA model:

git clone https://github.com/facebookresearch/llama.gitcd llamapip install -r requirements.txtpip install -e .

Once everything is ready, you can run it. The project introduces training examples for ChatLLaMA 7B, and interested parties can check the original project.

Reference links:

https://www.linkedin.com/posts/activity-7035964259431763970-YdMK/

https://medium.com/@enryu9000/mini-post-first-look-at-llama-4403517d41a1

Technical Communication Group Invitation

Is Meta's Open Source ChatGPT Alternative Worth Using?

△ Long press to add assistant

Scan the QR code to add the assistant on WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

You can apply to join Natural Language Processing/Pytorch and other technical communication groups

About Us

MLNLP community is a grassroots academic community jointly constructed by machine learning and natural language processing scholars from home and abroad. It has developed into a well-known machine learning and natural language processing community, aiming to promote progress between the academic and industrial circles of machine learning and natural language processing.

The community can provide an open communication platform for related practitioners’ further studies, employment, and research. Everyone is welcome to follow and join us.

Is Meta's Open Source ChatGPT Alternative Worth Using?

Is Meta’s Open Source ChatGPT Alternative Worth Using?

Preliminary Evaluation of LLaMA

Adding RLHF: Startup Nebuly AI Open Sources ChatLLaMA Training Method

About Us

Leave a Comment Cancel reply