Ten Research Directions Worth Exploring in Large Models

2025-05-08 by AI Agent

Source: OpenBMB Open Source Community

  This article is 14,629 words long and is recommended to be read in 20 minutes.
This article introduces Professor Liu Zhiyuan, an associate professor in the Department of Computer Science at Tsinghua University and one of the main initiators of OpenBMB, answering the question on Zhihu, "What are the academic research directions in the field of large models (LLM)?"

△ This article is a repost of Professor Liu Zhiyuan’s answer to the Zhihu question, “What are the academic research directions in the field of large models (LLM)?”

Original answer:

I feel a responsibility to answer this question and happened to write this answer while on the high-speed train. At the beginning of 2022, I gave a report titled “Ten Questions About Large Models,” sharing the ten issues we believe are worth exploring in large models. At that time, large models were not so popular, but now they are well-known and evolving rapidly. However, overall, most of the ten issues mentioned at that time have not become outdated. The content of the report aligns well with this question, so I will use this report framework as a blueprint, with slight updates as an answer, hoping that more researchers can find their research direction in the era of large models.

I have seen some comments saying that after the emergence of large models, there is not much left to do in NLP. In my view, when a technological revolution like large models occurs, although many old problems are solved or disappear, our tools for understanding and transforming the world become stronger, and many new problems and scenarios emerge, waiting for us to explore. Therefore, whether in natural language processing or other related fields of artificial intelligence, students should be grateful that a technological revolution is happening in their field, right next to them. They are much closer to the center of this revolution than others and are better prepared to embrace this new era, with more opportunities to make foundational innovations. I hope more students can actively embrace this new change, quickly stand on the shoulders of giants in large models, and actively explore or even carve out their own direction, methods, and applications.

🧾 Outline

1. Fundamental Theory: What is the fundamental theory of large models?

2. Network Architecture: Is Transformer the ultimate framework?

3. Efficient Computation: How to make large models more efficient?

4. Efficient Adaptation: How to adapt large models to downstream tasks?

5. Controllable Generation: How to achieve controllable generation with large models?

6. Safe and Trustworthy: How to improve safety and ethical issues in large models?

7. Cognitive Learning: How to enable large models to acquire advanced cognitive abilities?

8. Innovative Applications: What innovative applications do large models have?

9. Data Evaluation: How to evaluate the performance of large models?

10. Usability: How to lower the threshold for using large models?

Direction One: The Fundamental Theory of Large Models

With the accumulation of rich experiential data from global large models, people have found that large models exhibit many characteristics different from previous statistical learning models, deep learning models, and even pre-trained small models. Familiar concepts such as Few/Zero-Shot Learning, In-Context Learning, and Chain-of-Thought abilities have attracted academic attention, while aspects like Emergence, Scaling Prediction, Parameter-Efficient Learning (which we call Delta Tuning), sparse activation, and functional partition characteristics, which have not yet been widely recognized by the public, need to be established as a solid theoretical foundation for large models to ensure steady progress. We have many questions regarding large models, for example:

► What——What have large models learned?

What do large models know and not know? What abilities can only be learned by large models that cannot be learned by small models? In 2022, Google published an article exploring the emergence phenomenon of large models, pointing out that many capabilities emerge magically after the model scale increases [1]. So what surprises are still hidden in large models? This question remains for us to explore.

► How——How to train large models well?

As the model scale continues to grow (Scaling), how to grasp the rules of training large models [2] involves many issues, such as how to prepare and combine data, how to find the optimal training configuration, how to foresee the performance of downstream tasks, and so on [3]. These are the How questions.

► Why——Why are large models good?

This area already has many very important research theories [4,5,6], including over-parameterization and others, but the ultimate theoretical framework has not yet been unveiled.

Regarding questions like What, How, and Why, there are many theoretical issues worth exploring in large models. I remember a few years ago, Professor Huang Tiejun gave an example saying that the airplane was invented first, and then aerodynamics emerged. I think this kind of elevation from practice to theory is an inevitable historical trend and will also occur in the field of large models. This will undoubtedly become the foundation of the entire artificial intelligence discipline, thus it is listed as the first of the ten questions.

We also believe it is necessary to document the various characteristics presented by large models for in-depth research exploration. To this end, we plan to open-source a repository called BMPrinciples [https://github.com/openbmb/BMPrinciples] to collect and document phenomena during the development of large models, which will help the open-source community train better large models and understand them.

References:

[1] Wei et al. Emergent Abilities of Large Language Models. TMLR 2022.

[2] Kaplan et al. Scaling Laws for Neural Language Models. 2020.

[3] OpenAI.GPT-4 technical report. 2023.

[4] Nakkiran et al. Deep double descent: Where bigger models and more data hurt. ICLR 2020.

[5] Bubeck et al. A universal law of robustness via isoperimetry. NeurIPS 2021.

[6] Aghajanyan et al. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. ACL 2021.

Direction Two: The Network Architecture of Large Models

Currently, the mainstream network architecture for large models, Transformer, was proposed in 2017. As model scales grow, we also observe diminishing returns in performance improvement. Is Transformer the ultimate framework? Can we find a better and more efficient network architecture than Transformer? This is a fundamental question worth exploring.

In fact, the establishment of artificial neural networks in deep learning has been inspired by disciplines such as neuroscience. For the next generation of AI network architectures, we can also gain support and inspiration from related disciplines. For example, some scholars have been inspired by mathematical-related directions to propose non-Euclidean space Manifold network frameworks, attempting to incorporate certain geometric prior knowledge into the model, which are relatively novel research directions.

Some scholars have also attempted to draw inspiration from engineering and physics, such as State Space Models and dynamic systems. Neuroscience is also an important source of ideas for exploring new network architectures, with brain-like computing attempting architectures like Spiking Neural Networks. So far, there is no significant conclusion on what the next generation of foundational model network frameworks will be, and it remains an urgent issue to explore.

References:

[1] Chen et al. Fully Hyperbolic Neural Networks. ACL 2022.[2] Gu et al. Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022.[3] Gu et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers. NeurIPS 2021[4] Weinan, Ee. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics.[5] Maass, Wolfgang. Networks of spiking neurons: the third generation of neural network models. Neural networks.

Direction Three: The Efficient Computation of Large Models

Currently, large models often contain billions, hundreds of billions, or even trillions of parameters. As the scale of large models continues to grow, the consumption of computational and storage costs also increases significantly. Previously, some scholars proposed the concept of GreenAI, considering computational energy consumption as an important factor in the comprehensive design and training of AI models. Regarding this issue, we believe it is necessary to establish an efficient computation system for large models.

First, we need to build a more efficient distributed training algorithm system. Many high-performance computing scholars have already done extensive exploration in this area, for example, by using model parallelism [9], pipeline parallelism [8], ZeRO-3 [1], and other model parallel strategies to distribute large model parameters across multiple GPUs, offloading the burden of GPUs to cheaper CPUs and memory through tensor offloading, optimizer offloading technologies [2], reducing the GPU memory overhead of the computation graph through recomputation [7], speeding up model training using mixed precision training [10] with Tensor Core, and selecting distributed operator strategies based on automatic tuning algorithms [11, 12].

Currently, many influential open-source tools have been established in the model acceleration field. Internationally famous ones include Microsoft DeepSpeed and NVIDIA Megatron-LM, while well-known domestic tools include OneFlow and ColossalAI. In this regard, our OpenBMB community has launched BMTrain, which can reduce the training cost of large models at the scale of GPT-3 by over 90%.

In the future, how to automatically select the most suitable combination of optimization strategies based on hardware resource conditions among numerous optimization strategies is a problem worth further exploration. Additionally, existing work typically designs optimization strategies for general deep neural networks, and how to create targeted optimizations based on the characteristics of Transformer large models remains to be further researched.

Then, once a large model is trained and ready for use, inference efficiency also becomes an important issue. One approach is to compress the trained model without losing performance as much as possible. Techniques in this area include model pruning, knowledge distillation, parameter quantization, etc. Recently, we have also discovered that the sparse activation phenomenon presented by large models can also be used to improve inference efficiency. The basic idea is to cluster and group neurons based on the sparse activation patterns, allowing each input to only activate a very small number of neuron modules to complete the computation. We call this algorithm MoEfication [5].

In terms of model compression, we have also launched an efficient compression tool called BMCook [4], which significantly improves the compression ratio by integrating various compression techniques. Currently, it has implemented four mainstream compression methods, and different compression methods can be combined arbitrarily based on demand. Simple combinations can maintain about 98% of the original model’s performance at a 10x compression ratio. In the future, how to automatically implement combinations of compression methods based on the characteristics of large models is worth further exploration.

Here, we provide some more detailed information about MoEfication [5]: based on the sparse activation phenomenon, we propose converting the feed-forward network into a mixture of experts network, dynamically selecting experts to enhance model efficiency. Experiments have shown that using only 10% of the feed-forward network’s computational capacity can achieve approximately 97% of the original model’s performance. Compared to traditional pruning methods that focus on parameter sparsity, the phenomenon of neuron sparse activation has not been widely researched, and relevant mechanisms and algorithms need to be explored.

References:

[1] Samyam Rajbhandari et al. ZeRO: memory optimizations toward training trillion parameter models. SC 2020.[2] Jie Ren et al. ZeRO-Offload: Democratizing Billion-Scale Model Training. USENIX ATC 2021.[3] Dettmers et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022.[4] Zhang et al. BMCook: A Task-agnostic Compression Toolkit for Big Models. EMNLP 2022 Demo.[5] MoEfication: Transformer Feed-forward Layers are Mixtures of Experts. Findings of ACL 2022.[6] The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers. ICLR 2023.[7] Training Deep Nets with Sublinear Memory Cost. 2016.[8] Fast and Efficient Pipeline Parallel DNN Training. 2018.[9] Megatron-lm: Training multi-billion parameter language models using model parallelism. 2019.[10] Mixed Precision Training. 2017.[11] Unity: Accelerating {DNN} Training Through Joint Optimization of Algebraic Transformations and Parallelization. OSDI 2022.[12] Alpa: Automating Inter- and {Intra-Operator} Parallelism for Distributed Deep Learning. OSDI 2022.

Direction Four: The Efficient Adaptation of Large Models

Once a large model is trained, how to adapt it to downstream tasks? Model adaptation studies how to effectively utilize the model for downstream tasks; the current popular term is “alignment“.

Traditionally, model adaptation focused more on the performance of specific scenarios or tasks. However, with the launch of ChatGPT, model adaptation has also begun to focus on enhancing general capabilities and aligning with human values. We know that the larger the foundational model, the better it performs on known tasks, while also showing potential for supporting complex tasks. Correspondingly, the computational and storage overhead for adapting larger foundational models to downstream tasks will also significantly increase.

This greatly raises the application threshold for foundational models. According to our statistics from papers before 2022, although pre-trained language models have become infrastructure, the proportion of papers actually using large models is still very low. A very important reason is that even though a large number of large models have been open-sourced worldwide, many research institutions still do not have sufficient computational resources to adapt large models to downstream tasks. Here, we can explore at least two solutions to improve model adaptation efficiency.

Solution one isprompt learning, which approaches the training and downstream tasks by adding prompts to the input [1,2,3], transforming various downstream tasks into language model tasks during pre-training, thus achieving a form of unification between different downstream tasks and the pre-training-downstream task, thereby enhancing model adaptation efficiency. In fact, the currently popular instruction tuning (Instruction Tuning) is a specific example of the prompt learning concept.

I commented on Weibo last year that prompt learning will become the feature engineering of the large model era. Now, many prompt engineering tutorials have emerged, indicating that prompt learning has become a standard for adapting large models.

Solution two isparameter-efficient tuning (Delta Tuning) [4, 5, 6], which fundamentally keeps the vast majority of parameters unchanged while only adjusting a very small set of parameters in the large model. This can greatly save storage and computational costs for adapting large models, and when the foundational model is large (e.g., in the billions or hundreds of billions), parameter-efficient tuning can achieve results comparable to full parameter tuning. Currently, parameter-efficient tuning has not received as much attention as prompt tuning, while it actually reflects the unique characteristics of large models.

To explore the characteristics of parameter-efficient tuning, we conducted systematic research and analysis last year, providing a unified modeling framework: theoretically, we analyzed from the perspectives of optimization and optimal control; experimentally, we conducted experiments on over 100 downstream tasks from multiple angles, including overall performance, convergence efficiency, transferability, model impact, and computational efficiency, drawing many innovative conclusions driven by parameter efficiency. For example, the parameter-efficient tuning method exhibits a significant Power of Scale phenomenon, where as the foundational model scale increases to a certain extent, the performance gap between different parameter-efficient tuning methods narrows, and the performance is comparable to full parameter tuning. This paper became the cover article of Nature Machine Intelligence this year [4], and everyone is welcome to download and read it.

In these two directions, we have open-sourced two tools: OpenPrompt [7] and OpenDelta to promote research and application of large model adaptation. Among them, OpenPrompt is the first unified framework for prompt learning, which won the Best Demo Paper Award at ACL 2022; OpenDelta is the first parameter-efficient tuning toolkit that does not require any modification to model code and has also been accepted into the ACL 2023 Demo Track.

References:

[1] Tom Brown et al. Language Models are Few-shot Learners. 2020.

[2] Timo Schick et al. Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference. EACL 2021.

[3] Tianyu Gao et al. Making Pre-trained Language Models Better Few-shot Learners. ACL 2021.[4] Ning Ding et al. Parameter-efficient Fine-tuning for Large-scale Pre-trained Language Models. Nature Machine Intelligence. [5] Neil Houlsby et al. Parameter-Efficient Transfer Learning for NLP. ICML 2020.[6] Edward Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.[7] Ning Ding et al. OpenPrompt: An Open-Source Framework for Prompt-learning. ACL 2022 Demo.

Direction Five: The Controllable Generation of Large Models

A few years ago, I once imagined in a popular science report that natural language processing would leap from consuming existing data (natural language understanding) to producing new data (natural language generation), which would be a significant transformation. This wave of technological change in large models has greatly promoted the performance of AIGC, becoming a hot topic for research and application. How to accurately incorporate conditions or constraints into the generation process is an important exploration direction for large models.

Before the emergence of ChatGPT, there were already many exploratory solutions for controllable generation, such as using prompts in prompt learning to control the generation process. There have also long been some open questions in controllable generation, such as how to establish a unified controllable generation framework and how to build a scientific and objective evaluation method.

ChatGPT has made significant progress in controllable generation, and now there are relatively mature practices in controllable generation:

(1) Improving the large model’s intent understanding ability through instruction tuning [1, 2, 3], enabling it to accurately understand human input and provide feedback; (2) Writing appropriate prompts through prompt engineering to stimulate model output. This approach, which uses pure natural language to control generation, has achieved very good results. For some complex tasks, we can also control the model’s generation using techniques like chain-of-thought [4].

The core goal of this technical solution is to enable the model to establish instruction-following capabilities. Recent research has found that acquiring this capability does not require particularly complex techniques; collecting enough diverse instruction data for fine-tuning can yield a good model. This is why so many custom open-source models have emerged recently.

To promote the development of such models, our laboratory has systematically designed a process to automatically generate diverse, high-quality multi-turn instruction dialogue data called UltraChat [5], and conducted detailed manual post-processing. We have now open-sourced all English data, totaling over 1.5 million entries, making it one of the largest sets of high-quality instruction data in the open-source community, and we look forward to everyone using it to train more powerful models.

References:

[1] Jason wei et al. Finetuned language models are zero-shot learners. ICLR 2022.[2] Victor Sanh et al. Multitask Prompted Training Enables Zero-Shot Task Generalization. ICLR 2022.[3] Srinivasan Iyer. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. Preprint 2022.[4] Jason Wei et al. Chain of thought prompting elicits reasoning in large language models. NeurIPS 2022.[5] Ning Ding et al. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. Preprint 2023.

Direction Six: The Safety and Ethical Issues of Large Models

With large models represented by ChatGPT increasingly penetrating human daily life, the safety and ethical issues of large models have become increasingly prominent. OpenAI has invested a lot of effort into making ChatGPT better serve humanity in this regard. Numerous experiments have shown that large models exhibit good robustness against traditional adversarial attacks and OOD sample attacks [1], but in practical applications, they can still be susceptible to attacks.

Moreover, with the widespread use of ChatGPT, people have discovered many new attack methods. For example, the recent jailbreak of ChatGPT [2] (also known as prompt injection attacks), which exploits the model’s tendency to follow user instructions to induce it to provide incorrect or even dangerous responses. We need to recognize that as the capabilities of large models become stronger, any security vulnerabilities or loopholes in them could lead to more severe consequences than before. How to prevent and correct these vulnerabilities has become a hot topic after ChatGPT’s jailbreak [3].

Additionally, the content generated by large models and related applications also face various ethical issues. For instance, what to do if someone uses large models to generate fake news? How to prevent large models from producing biased and discriminatory content? What about students using large models to do their homework? These are real-world problems that have occurred, and there are currently no satisfactory solutions, making them excellent research topics.

Specifically, in terms of large model safety, we found that although large models exhibit good robustness against adversarial attacks, they are particularly vulnerable to deliberately implanted backdoors, which can cause large models to respond in specific ways in certain scenarios [4]. This is a very important security issue for large models. In this regard, we have previously developed two toolkits, OpenAttack and OpenBackdoor, aimed at providing researchers with a more standardized and extensible platform.

Furthermore, more and more large model providers are beginning to offer only the inference API of the model, which to some extent protects the model’s security and intellectual property. However, this paradigm also makes downstream adaptation of the model more difficult. To address this issue, we proposed a method called Decoder Tuning for downstream adaptation of black-box large models at the output end, achieving a 200-fold acceleration and state-of-the-art results in understanding tasks compared to existing methods. The related paper has been accepted into ACL 2023, and we welcome everyone to try it out.

In terms of large model ethics, how to align large models with human values is an important proposition. Previous research has shown that the larger the model, the more biased it tends to become [5]. The alignment algorithms such as RLHF and RLAIF that emerged after ChatGPT can effectively alleviate this issue, making large models more aligned with human preferences and generating higher-quality outputs. Compared to pre-training and instruction tuning techniques, feedback-based alignment is a novel research direction, and reinforcement learning is notoriously difficult to train, presenting many questions worth exploring.

References:

[1] Wang et al. On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective. Arxiv 2023.

[2] Ali Borji. A Categorical Archive of ChatGPT Failures. Arxiv 2023.[3] https://openai.com/blog/governance-of-superintelligence[4] Cui et al. A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks. NeurIPS 2022 Datasets & Benchmarks.[5] Lin et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods. ACL 2022.

Direction Seven: The Cognitive Learning of Large Models

ChatGPT indicates that large models have basically mastered human language, understanding user intent through instruction tuning and completing tasks. Looking ahead, we can consider what unique cognitive abilities that humans possess are currently lacking in large models? In my view, advanced human cognitive abilities are reflected in the ability to solve complex tasks, breaking down previously unseen complex tasks into simple tasks with known solutions, and then reasoning based on the simple tasks to ultimately complete the task. Moreover, in this process, there is no need to seek to remember all information in the brain; rather, it is about effectively utilizing various external tools, as the saying goes, “The gentleman’s nature is not different; he is good at utilizing things.”

This will be an important direction worth exploring for large models in the future. Although large models have made significant breakthroughs in many aspects, the problem of generating hallucinations remains serious, and they face challenges of being untrustworthy and unprofessional in specialized tasks. These tasks often require specialized tools or domain knowledge support to solve. Therefore, large models need to possess the ability to learn to use various specialized tools to better complete complex tasks.

Tool learning is expected to solve the issues of timeliness and enhance professional knowledge, improving interpretability. Large models have initially developed human-like reasoning and planning capabilities in understanding complex data and scenarios, leading to the emergence of the tool learning paradigm [1]. The core of this paradigm is to integrate specialized tools with the advantages of large models to achieve greater accuracy, efficiency, and autonomy. Currently, works like WebGPT / WebCPM [2, 3] have successfully enabled large models to learn to use search engines and surf the web like humans, targeting useful information to complete specific tasks.

Recently, the emergence of ChatGPT Plugins allows it to support the use of tools such as internet access and mathematical calculations, marking what is referred to as OpenAI’s “App Store” moment. Tool learning will undoubtedly become an important exploration direction for large models. To support the open-source community’s exploration of large model tool learning capabilities, we developed the tool learning engine BMTools [4], which is an open-source, scalable tool learning platform based on large language models. It standardizes and automates the calling processes of various tools (such as text-to-image models, search engines, stock queries, etc.) under the same framework. Developers can use BMTools to call various tool interfaces to complete tasks using given large model APIs (such as ChatGPT, GPT-4) or open-source models.

Furthermore, most existing efforts have focused on improving the capabilities of a single pre-trained model. As a single large model has become quite capable, the future will open a leap from single intelligence to multi-intelligence, achieving interaction, collaboration, or competition among multiple models. For instance, Stanford University recently constructed a virtual town where characters are played by large models [5], allowing different roles to interact or collaborate well in a virtual sandbox environment, exhibiting a certain degree of social attributes. The interaction, collaboration, and competition among multiple models will be a highly promising research direction in the future. Currently, there is no mature solution for constructing multi-model interaction environments, so we developed the open-source framework AgentVerse [6], which supports researchers in building multi-model interaction environments with simple configuration files and a few lines of code. At the same time, AgentVerse integrates with BMTools, allowing models to access tools by adding tool links in the configuration files, thus achieving multi-model interaction with tools. In the future, we may even hire a “large model assistant team” to collaboratively call tools to solve complex problems.

References:

[1] Qin, Yujia, et al. “Tool Learning with Foundation Models.” arXiv preprint arXiv:2304.08354 (2023).[2] Nakano, Reiichiro, et al. “Webgpt: Browser-assisted question-answering with human feedback.” arXiv preprint arXiv:2112.09332 (2021).[3] Qin, Yujia, et al. “WebCPM: Interactive Web Search for Chinese Long-form Question Answering.” arXiv preprint arXiv:2305.06849 (2023).[4] BMTools: https://github.com/OpenBMB/BMTools[5] Park, Joon Sung, et al. “Generative agents: Interactive simulacra of human behavior.” arXiv preprint arXiv:2304.03442 (2023).[6] AgentVerse: https://github.com/OpenBMB/AgentVerse

Direction Eight: The Innovative Applications of Large Models

Large models have tremendous application potential in numerous fields. In recent years, various applications have appeared on the cover of Nature, with large models playing a crucial role among them [2,3]. A well-known work in this regard is AlphaFold, which has had a revolutionary impact on protein structure prediction.

In the future, the key question in this direction will be how to integrate domain knowledge into AI’s strengths in large-scale data modeling and the generation process of large models. This is an important proposition for utilizing large models for innovative applications.

In this regard, we have already conducted some explorations in legal intelligence and biomedicine. For example, back in 2021, we launched the first Chinese legal intelligence pre-trained model, Lawformer, in collaboration with Power Law Intelligence, which can better handle long legal documents; we also proposed a unified pre-trained model capable of modeling both chemical expressions and natural language, KV-PLM, which can outperform human experts in specific biomedical tasks. Relevant results were published in Nature Communications and selected for the Editor’s Highlights section.

References:

[1] Chen et al. Fully Hyperbolic Neural Networks. ACL 2022.[2] Gu et al. Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022.

Direction Nine: The Data and Evaluation Issues of Large Models

Throughout the development history of deep learning and large models, the universal principle of “more data brings more intelligence” has been consistently validated. Learning more open and complex knowledge from various modal data will be an important way to expand the capabilities of large models and enhance intelligence levels. Recently, OpenAI’s GPT-4 [1] has expanded its deep understanding of visual signals based on language models, and Google’s PaLM-E [2] has further integrated embodied signals for robotic control. Reviewing recent cutting-edge developments, a mainstream technical route is emerging in which language large models serve as the foundation, incorporating other modal signals to absorb knowledge and capabilities from language models into multimodal computation.

In this regard, our recent work [3] has found that transferring visual modules between different language large model bases can significantly reduce the pre-training costs of multimodal large models. Our recent experiments indicate that based on the newly open-sourced bilingual model CPM-Bee, it can support rapid training of multimodal large models, facilitating Chinese-English multimodal dialogues about images in open domains, with good performance in human interaction. Looking ahead, learning knowledge from more modalities and larger-scale data is the inevitable path for the development of large model technology.

On one hand, as large models grow larger, the types of structures, data sources, and training objectives are also becoming increasingly diverse. How much improvement in performance do these models actually achieve? In which areas do we still need to work hard? Regarding the performance evaluation of large models, we need a scientific standard to assess their strengths and weaknesses.

This has been an important proposition even before the emergence of ChatGPT; evaluation sets like GLUE and SuperGLUE have profoundly influenced the development of pre-trained models. We have also launched the CUGE Chinese Understanding and Generation Evaluation Benchmark [4] in the past few years, systematically evaluating models based on their scores across different metrics, datasets, tasks, and capabilities. This automated answer matching evaluation method was the primary evaluation method in the field of natural language processing before the rise of generative AI, with the advantage of fixed evaluation standards and fast evaluation speed. However, for generative AI, models tend to generate highly divergent and longer content, making it challenging to assess the diversity and creativity of generated content using automated evaluation metrics, thus bringing new challenges and research opportunities. Recently, large model evaluation methods can be roughly divided into the following categories:

Automated Evaluation Methods: Many researchers have proposed new automated evaluation methods, such as using multiple-choice questions [5], collecting human exam questions from elementary school to university and professional examinations in finance, law, etc., allowing large models to read options directly and provide answers for automated evaluation. This method is suitable for assessing large models’ capabilities in knowledge retention, logical reasoning, semantic understanding, and other dimensions.

Model Evaluation Methods: Some researchers have proposed using more powerful large models as judges [6]. For example, directly providing GPT-4 and other models with the original questions and two models’ responses, using prompts to have GPT-4 score the responses from the two models. This method has some issues, such as being limited by the judging model’s capabilities and the judging model’s tendency to favor one model’s responses, but it has the advantage of being able to execute automatically without the need for evaluators, providing a certain degree of reference for assessing model capabilities.

Manual Evaluation Methods: Manual evaluation is currently the most reliable method. However, due to the diversity of generated content, how to design a reasonable evaluation system and align the cognitive levels of annotators has become a new issue. Currently, both domestic and international research institutions have launched “arenas” for evaluating large model capabilities, requiring users to blind evaluate the responses of different models to the same question. There are many interesting questions involved, such as whether automated metrics can be designed to assist annotators during the evaluation process, whether a question’s answer can be scored from different dimensions, and how to select relatively reliable answers from online crowd-sourced annotators. These questions are worth practicing and exploring.

References:

[1] OpenAI. GPT-4 Technical Report. 2023.

[2] Driess D, Xia F, Sajjadi M S M, et al. PaLM-E: An embodied multimodal language model[J]. arXiv preprint arXiv:2303.03378, 2023.

[3] Zhang A, Fei H, Yao Y, et al. Transfer Visual Prompt Generator across LLMs[J]. arXiv preprint arXiv:2305.01278, 2023.[4] Yao Y, Dong Q, Guan J, et al. Cuge: A chinese language understanding and generation evaluation benchmark[J]. arXiv preprint arXiv:2112.13610, 2021.[5] Chiang, Wei-Lin et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. 2023. [6] Huang, Yuzhen et al. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv preprint arXiv:2305.08322, 2023.

Direction Ten: The Usability of Large Models

Large models have shown a strong trend towards generality, specifically reflected in the increasingly unified Transformer network architecture and the increasingly unified foundational models across various fields. This brings the possibility of establishing standardized large model systems (Big Model Systems) to deploy artificial intelligence capabilities with low thresholds across various industries. Inspired by the successful realization of standardization in database systems and big data analysis systems in computer history, we should encapsulate complex, efficient algorithms at the system level while providing users with understandable yet powerful interfaces.

It is precisely following this philosophy that we have aimed to “bring large models into thousands of households” since 2021 by building the OpenBMB open-source community, formally known as the Open Lab for Big Model Base. We have successively released a complete tool system covering efficient computation tools for training, fine-tuning, compression, inference, and application, including efficient training tool BMTrain, efficient compression tool BMCook, low-cost inference tool BMInf, tool learning engine BMTools, etc. The OpenBMB large model system perfectly supports our self-developed Chinese large model CPM series, and we have just open-sourced the latest version of the bilingual foundational model CPM-Bee. In my view, large models not only need to perform well but also require a robust tool system to be user-friendly. Therefore, we will continue to deepen our work on the CPM large model and the OpenBMB large model tool system, striving to create the best large model system in the Chinese-speaking world. We welcome everyone to use them and provide suggestions and feedback to jointly build this open-source community of large models that belongs to all of us.

——END——

Leave a Comment Cancel reply