MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering audiences including NLP master’s and doctoral students, university teachers, and researchers from enterprises.

The vision of the community is to promote communication and progress between the academic and industrial fields of natural language processing and machine learning, especially for beginners.

Reprinted from | Xixiaoyao Technology Author | Richard

Artificial intelligence is developing at an astonishing pace, with large language models (LLMs) being the “stars” that demonstrate impressive language understanding and generation capabilities. However, while enjoying the conveniences brought by large language models, we must also face the challenges they face regarding honesty and safety.

Recently, a research team from Huazhong University of Science and Technology proposed a brand new framework to enhance the honesty and usefulness of large language models from both theoretical and experimental perspectives. They constructed a new evaluation dataset called HoneSet and designed optimization methods for open-source and commercial models. Experiments show that the honesty of LLaMa3 was improved by 65% after two-stage fine-tuning.

With the development of artificial intelligence, honest and reliable AI assistants will become a necessity for people. We look forward to seeing more researchers engage in this field to jointly promote the maturity of large model technology and better benefit human society.

A Simple Trick to Improve LLaMa3's Honesty by 65%

Paper Title:The Best of Both Worlds: Toward an Honest and Helpful Large Language Model

Paper Link:https://arxiv.org/pdf/2406.00380

Challenges of Honesty in Large Language Models

Large language models (LLMs) are emerging in the field of natural language processing with their outstanding language understanding and generation capabilities, showcasing broad application prospects in dialogue, writing, and question-answering tasks. However, the honesty challenges faced by large language models in practical applications have gradually become a focus of attention.

These models sometimes generate seemingly accurate but incorrect information, and when faced with questions beyond their capabilities, they fail to honestly express their limitations. This can affect users’ trust in their outputs, leading them not to apply large models to tasks requiring high trust levels. Therefore, how to enhance the honesty of large language models, making them more reliable and beneficial assistants, has become an urgent issue to address.

The Path to Cultivating an “Honest” Large Model

In response to the challenges mentioned above, researchers from Huazhong University of Science and Technology, the University of Notre Dame, and Lehigh University proposed a brand new framework to enhance the honesty and usefulness of large language models from both theoretical and practical perspectives.

Firstly, the researchers systematically sorted and defined the characteristics that an honest large model should possess from a theoretical perspective. They pointed out:

An honest large model should recognize its limitations and provide reasonable responses to questions beyond its capabilities;
It should not blindly follow user inputs but maintain an objective and neutral stance;
Additionally, it should have a clear self-awareness and not equate itself with sentient and emotional humans.

Based on these principles, the researchers constructed a brand new evaluation dataset called HoneSet, covering six major types of “tricky” questions to examine the honesty of large models from multiple angles. As shown in the figure below, HoneSet includes six categories of questions: Latest Information, User Input, Professional Capability, Modality Mismatch, Interactivity Sensory, and Self Identity, aiming to comprehensively assess the model’s ability to maintain honesty in different scenarios.

The figure below illustrates the construction process of the HoneSet dataset, which mainly includes three steps:

Construction of candidate datasets, where seed queries are artificially defined for the six categories and data is expanded through contextual learning using GPT-4.
Data filtering and enhancement, using OpenAI’s text embedding model to filter out duplicate data and paraphrase queries for expansion.
Human evaluation, where experts screen and refine the generated queries to ensure data quality.

Secondly, the research team designed two optimization methods from a practical perspective, targeting open-source models and commercial models:

Open-source Models

For open-source models, they proposed a “curiosity-driven” prompt optimization method. This method consists of two stages: curiosity-driven prompt generation and answer optimization.

In the first stage, by designing clever prompts, the model is guided to articulate its doubts and uncertainties about the questions. Specifically, the prompt templates encourage the model to carefully analyze the questions and express its confusion, such as lack of real-time information, insufficient or erroneous user input, or lack of domain-specific knowledge. This step aims to awaken the model’s awareness of its limitations.

In the second stage, the researchers combine the model’s doubts with its original answers, re-inputting them into the model and providing a “constitution-guided” prompt to guide the model in optimizing its answers based on preset honesty principles. The optimized answers should include an honest acknowledgment of limitations and provide beneficial guidance to users.
Commercial Models

For commercial models, the researchers proposed a two-stage fine-tuning process:

(1) The first stage trains the model on HoneSet to distinguish between honest and dishonest answers by optimizing the contrastive loss function;

(2) The second stage further enhances the usefulness of the model’s answers by optimizing a reward function based on human preferences.

The entire process draws on the principles of curriculum learning, allowing the model to learn the qualities of honesty and helpfulness in a gradual manner. At the same time, the figure also compares the performance of two-stage fine-tuning with direct end-to-end fine-tuning, indicating that staged training can achieve better performance improvements.

This research systematically explores methods for creating honest and helpful large language models from both theoretical and practical perspectives. By defining honesty criteria, constructing evaluation datasets, designing prompt optimization and fine-tuning methods, it provides new ideas for enhancing the credibility and usefulness of large models in practical applications.

Initial Results of Honesty Cultivation

To verify the effectiveness of the method, the researchers conducted extensive experiments on nine mainstream language models, including GPT-4, ChatGPT, and Claude.

The figure below shows the experimental results based on the prompt optimization method. As can be seen, after adopting curiosity-driven prompts, the honesty levels of each model on HoneSet significantly improved. For example, the honesty of GPT-4 and Claude reached 100%, achieving nearly perfect honesty alignment; while the honesty of the smaller parameter model LLaMa2-7b increased significantly from 43% to 83.7%. Almost all models exceeded 60% in honesty, proving the universality of the method.

Subsequently, the authors further compared the performance of optimized versus unoptimized answers in human evaluations. The results showed that the optimized answers generally had a higher success rate in pairwise comparisons, reflecting higher honesty and usefulness.

In addition, the article also quantitatively demonstrates improvements in responses across three dimensions: explanation, answering, and guidance. The results indicate that various models have made significant progress in honestly explaining limitations, providing problem-solving ideas, and offering specific guidance, fully demonstrating the effectiveness of the prompt optimization method.

The table below summarizes the honesty and score changes of models such as LLaMa3-70b and Mistral-7b before and after two-stage fine-tuning. It can be seen that after adopting two-stage fine-tuning, the distribution of scores for both models has significantly improved across all score ranges.

After two-stage fine-tuning, the honesty of LLaMa3-8b increased from 49.2% to 91.7%, a rise of 42.5 percentage points. In the evaluation, its total score also improved from 4.975 to 8.225, an increase of 65.3%. Mistral-7b performed even better, with honesty soaring from 32.5% to 85.8%, and its total score more than doubled from 3.308 to 7.433, with an increase of 124.7%.

It is worth mentioning that significant effects can be achieved with just 1000 pairs of data for two-stage fine-tuning, demonstrating the data efficiency of this method.

The table further illustrates the performance changes of different categories of data under various scoring thresholds. It can be seen that the scores of each category have improved to varying degrees after fine-tuning, especially in categories such as User Input, Modality Mis., and Interactivity Sen., where the progress is most noticeable.

In addition to the tables above, the figure below intuitively compares the performance differences between two-stage fine-tuning and direct end-to-end fine-tuning under different threshold settings. Regardless of how the thresholds change, two-stage fine-tuning consistently outperforms direct fine-tuning, reaffirming the superiority of gradual training.

In summary, the prompt optimization method and two-stage fine-tuning method proposed in this paper have achieved significant results in enhancing the honesty and usefulness of language models. On one hand, prompt optimization cleverly utilizes the model’s “curiosity” to guide it to face its limitations and provide constructive responses without needing to retrain the model to achieve honesty alignment. On the other hand, two-stage fine-tuning, through a curriculum learning approach, allows the model to exhibit outstanding honesty and helpful qualities even with a small sample of 1000 pairs of data. More importantly, the proposed methods have achieved consistent performance improvements across various mainstream language models, including open-source and commercial models, proving their wide applicability.

Conclusion and Outlook

This research work explores a new path for constructing more trustworthy and beneficial large language models for humanity. As artificial intelligence continues to extend its reach, honest and reliable AI assistants will become an indispensable part of people’s work and lives. Users need AI to openly recognize its limitations while innovatively providing targeted assistance.

Of course, shaping an honest and trustworthy AI assistant is not an overnight task. For example, as the application scenarios for large models expand, we need to continuously update our requirements for honest AI; at the technical level, we must further explore more efficient and precise optimization algorithms. This requires close collaboration between academia and industry.

Technical Group Invitation

A Simple Trick to Improve LLaMa3's Honesty by 65%

△ Long press to add the assistant

Scan the QR code to add the assistant on WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

to apply to join technical groups such as Natural Language Processing/Pytorch

About Us

MLNLP community is a grassroots academic community jointly built by scholars in machine learning and natural language processing from home and abroad. It has now developed into a well-known community for machine learning and natural language processing both domestically and internationally, aiming to promote progress between the academic and industrial fields of machine learning and natural language processing and the vast number of enthusiasts.

The community can provide an open communication platform for practitioners’ further studies, employment, and research. Everyone is welcome to follow and join us.