Defeating GPT-3 with 1/10 Parameter Size: In-Depth Analysis of Meta's LLaMA

Yann LeCun announced on February 25, 2023, Beijing time, that Meta AI has publicly released LLaMA (Large Language Model Meta AI), a large language model that includes four parameter sizes: 7 billion, 13 billion, 33 billion, and 65 billion. The aim is to promote research on the miniaturization and democratization of LLMs.

Guillaume Lample claimed on his Twitter (using LLaMA to write the release copy): The performance of the LLaMA 13 billion parameter version surpasses that of OPT and GPT-3 175 billion parameter version in most tests, while the 65 billion version can basically compete with large models like Chinchilla 700 billion and PaLM 5400 billion.

LLaMA was developed by Meta AI’s FAIR team, trained from December 2022 to February 2023. The version released on GitHub is the V1 version of this model. Similar to the GPT series, LLaMA is also an autoregressive language model based on the Transformer architecture. For those unfamiliar with the Transformer architecture, you can read this article: “The Eve of the AI LLM Revolution: Understanding the Transformer Model that Swept Natural Language Processing”.

(1) Source Code

https://github.com/facebookresearch/llama

(2) Paper

https://research.facebook.com/file/1574548786327032/LLaMA–Open-and-Efficient-Foundation-Language-Models.pdf Meta AI also released the paper “LLaMA: Open and Efficient Foundation Language Models” on its official site.

Yann LeCun expressed great excitement, marking a strong rebuttal to previous criticisms of ChatGPT.

Next, based on the currently available information about LLaMA, we will quickly browse through some key details, including the model architecture, parameters, training and evaluation datasets, computational carbon footprint, and some examples of LLaMA’s use. Let’s set sail together.

Model Architecture and Parameters

Like other large language models, LLaMA also generates text recursively by taking a series of words as input and predicting the next word. To train our model, Meta chose text from the 20 most widely used languages, focusing on languages that use Latin and Cyrillic scripts.

The model architecture is also based on the Transformer, but with several significant improvements:

(1) Pre-normalization inspired by GPT-3: To enhance training stability, normalization is applied not only at the output layer but also at the input of each layer of the Transformer, specifically using the RMSNorm mentioned by Zhang and Sennrich (2019)[1].

(2) SwiGLU activation function inspired by PaLM: The SwiGLU activation function proposed by Shazeer (2020)[2] replaces the familiar ReLU activation function.

(3) RoPE inspired by GPTNeo: In the position encoding part of the Transformer, absolute positional embeddings are not used; instead, Rotary Positional Embeddings (RoPE) mentioned by Su et al. (2021)[3] are utilized.

Meta AI has announced the hyperparameters for its various sizes of LLaMA models:

Comparing with the GPT-3 model, it can be seen that among the four versions of LLaMA:

(1) LLaMA-7B corresponds to the GPT-3 6.7B version, both having 32 layers, 32 multi-head attentions, and a width of 4096, with a learning rate of 3.0E-4, which is higher than GPT’s 1.2E-4, and a batch size of 4M, which is larger.

(2) LLaMA-13B corresponds to the GPT-3 13B version, both having 40 layers, 40 multi-head attentions, and model widths of 5120 and 5140, respectively, with a learning rate of 3.0E-4, also higher than GPT’s 1.0E-4, and a batch size of 4M, which is larger.

(3) LLaMA-33B and LLaMA-65B do not correspond to GPT-3, both being just below the largest GPT-3 model of 175B. Meta AI aims to prove that smaller models can achieve or even surpass the performance of the massive GPT-3 model, which is a driving force for promoting model miniaturization.

Data

Training Data for LLaMA

LLaMA utilized the following training datasets, with their respective proportions:

(1) CCNet: 67%

(2) C4: 15%. Those in the NLP field are likely familiar with it; its full name is Colossal Common Crawl Corpus, and it was first known through the Google T5 model’s paper “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” In comparison, 60% of GPT-3’s data comes from Common Crawl, but GPT-3 filtered its Common Crawl data into positive and negative categories, while LLaMA did not.

(3) GitHub: 4.5%, an open-source code repository platform, now owned by Microsoft.

(4) Wikipedia: 4.5%, previously used by GPT-3 for English Wikipedia.

(5) Books: 4.5%, in contrast to GPT-3, which sourced 16% from books.

(6) ArXiv: 2.5%, the well-known open-access archive for academic papers, established by Cornell University in 1991.

(7) Stack Exchange: 2%, an online Q&A community for programmers, similar to Stack Overflow.

Evaluation Data and Performance of LLaMA

As seen from the training data sources, they primarily come from web content, leading Meta AI to acknowledge the presence of offensive, harmful, and biased content. Therefore, Meta AI evaluated the model’s biases using the RAI dataset to measure biases exhibited in aspects such as gender, religion, race, sexual orientation, age, nationality, disability, appearance, and socioeconomic status. Meta AI also assessed the toxicity of the model’s generated content based on the harmfulness of the context provided to the model:

As mentioned earlier, Meta AI also classified web text; if its content is similar to that of Wikipedia or cited by Wikipedia, it is considered high quality. This is akin to how GPT-3 regards web pages with outbound links and Karma greater than 3 as high quality. Here, Meta AI employed the Kneser-Ney language model and a fastText linear classifier.

Meta AI claims that the training data includes 20 languages, but the majority of the content is still primarily in English, which means, like GPT-3, it performs better in English. Similarly, OpenAI has previously stated that due to the abundance of English content, the overall content generated by the model tends to align more with the values of English-speaking populations, which poses a potential issue.

LLaMA used the following evaluation datasets: BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU, BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs. LLaMA has listed numerous experimental results in its paper, and here are some excerpts.

In some reasoning tasks, LLaMA performed as follows:

Performance comparison on NaturalQuestions with other models:

Comparison of reading comprehension performance:

Performance comparison on TriviaQA for zero-shot and few-shot question answering:

Performance comparison in code generation:

Performance comparison in large-scale multi-task language understanding (MMLU) is as follows; the complete performance of each model can be found in Table 16 of Appendix B of the paper.

Computational Power

Meta AI also provided calculations of carbon footprint to measure power consumption. With the rise of LLM applications, such environmental issues will be increasingly emphasized in the future.

The carbon footprint of training different models in the same data center. Meta AI compared the carbon emissions of training OPT, BLOOM, and LLaMA models in the same data center. For the A100-80GB power consumption, LLaMA adopted the thermal design power (TDP) of the NVLink system, which is 400W. Meta AI used a PUE value of 1.1 and a carbon intensity coefficient set at the U.S. national average of 0.385 kg CO2e/KWh.

Can I try it out?

Let’s check out some examples provided by LLaMA.

LLaMA is not yet open for use, but you can apply for access at the following link and enter the waiting list:

https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform

However, we can take a look at the examples released by Meta AI, which are quite interesting. Below is the output from LLaMA-65B (without instruction fine-tuning), with the bolded parts being the prompts given:

Writing a recommendation letter:

Providing the definition and comments for a Python function, continuing the code:

The Meta AI team jokingly teased their boss LeCun (hhhhh):

Demonstrating the ability to create fictional dialogues based on a given scenario:

Meta AI also showcased several examples generated by LLaMA-I, which is the result of fine-tuning LLaMA-65B using the protocol and instruction dataset from Chung et al. (2022). The first is a dialogue between the Sun and Pluto:

Next, generating two code examples in JavaScript for sending HTTP requests:

Writing a regular expression in Python to remove HTML tags, as well as a regular expression to extract function definitions (to be honest, the efficiency of writing regular expressions with ChatGPT is really high; writing them by human brain is quite counterintuitive):

In a series of continuous multi-turn dialogues, LLaMA also performed very well:

Writing a short essay is also within its capabilities:

Inventing a theory explaining that cats never existed (the researchers’ imagination is quite remarkable hhhh):

Writing a scene of an argument between Julius Caesar and Napoleon (2333333):

Sending an email requesting people to use language models responsibly:

Again, in a multi-turn dialogue, involving a large number of real entities, verifying the accuracy of world knowledge, the model accurately stated that it was Einstein who proposed the mass-energy equivalence:

Letting LLaMA pretend to be a runnable bash terminal:

These examples are quite exciting for me.

Some Comments and Future Impact

Meta AI has directly open-sourced the model and parameters this time, promoting the miniaturization and democratization of models, which will greatly aid active startups and research in the AI field, deserving everyone’s close attention. Meta AI chose to release it on a Friday, catching other big companies off guard and allowing this news to ferment over the weekend. However, for everyone outside the big companies in the entire ecosystem, this is a good thing.

In the paper, Meta AI summarized that the performance of LLaMA-13B exceeds that of GPT-3, while being over 10 times smaller, and LLaMA-65B is comparable to Chinchilla-70B and PaLM-540B. Unlike previous research, it currently demonstrates that state-of-the-art performance can be achieved using only publicly available datasets for training, without the need for proprietary datasets. Meta hopes to release these models to the research community to accelerate the development of large language models and help improve their robustness while mitigating known issues such as toxicity and bias. Additionally, Meta observed that fine-tuning the model, as mentioned in the paper by Chung et al. (2022), can yield better results, and Meta AI plans to further research this in future work.

Meta AI also mentioned that, based on current experiments, as long as the parameter scale and data scale continue to increase, performance is still improving, as shown in the above images. Therefore, Meta AI directly stated during the release that it plans to continue releasing models with larger data scales and parameter sizes in the future.

It can be anticipated that 2023 will be a year of great activity, and we will undoubtedly witness many significant events in the development of models that will be recorded in the history of AI technology and applications.

References

[1] Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.

[2] Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202.

[3] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.

[4] LLaMA GitHub repository from Meta AI Research – FAIR team, https://github.com/facebookresearch/llama

[5] Large Language Model LLaMA from Meta AI FAIR team, https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

[6] LLaMA Open and Efficient Foundation Language Models, https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/

[7] https://twitter.com/ylecun/status/1629243179068268548

[8] LLaMA Announcement Tweet from Meta AI FAIR team, https://twitter.com/GuillaumeLample/status/1629151231800115202

[9] Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed Huai hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc Le, and Jason Wei. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.

END

Material Collection

[ChatGPT Material Package] The second selection includes 70 papers and 30 technical articles, all of which are [freely distributed]! Friends interested can scan the QR code below to collect!

Your every “like” is taken as a favorite

▼

Defeating GPT-3 with 1/10 Parameter Size: In-Depth Analysis of Meta’s LLaMA

Training Data for LLaMA

Evaluation Data and Performance of LLaMA

References

Leave a Comment Cancel reply