Artificial intelligence is developing at an astonishing pace, with large language models (LLMs) being the “stars” that demonstrate impressive language understanding and generation capabilities. However, while enjoying the conveniences brought by large language models, we must also face the challenges they face regarding honesty and safety.
Recently, a research team from Huazhong University of Science and Technology proposed a brand new framework to enhance the honesty and usefulness of large language models from both theoretical and experimental perspectives. They constructed a new evaluation dataset called HoneSet and designed optimization methods for open-source and commercial models. Experiments show that the honesty of LLaMa3 was improved by 65% after two-stage fine-tuning.
With the development of artificial intelligence, honest and reliable AI assistants will become a necessity for people. We look forward to seeing more researchers engage in this field to jointly promote the maturity of large model technology and better benefit human society.

Paper Title:The Best of Both Worlds: Toward an Honest and Helpful Large Language Model
Paper Link:https://arxiv.org/pdf/2406.00380
Challenges of Honesty in Large Language Models
Large language models (LLMs) are emerging in the field of natural language processing with their outstanding language understanding and generation capabilities, showcasing broad application prospects in dialogue, writing, and question-answering tasks. However, the honesty challenges faced by large language models in practical applications have gradually become a focus of attention.

These models sometimes generate seemingly accurate but incorrect information, and when faced with questions beyond their capabilities, they fail to honestly express their limitations. This can affect users’ trust in their outputs, leading them not to apply large models to tasks requiring high trust levels. Therefore, how to enhance the honesty of large language models, making them more reliable and beneficial assistants, has become an urgent issue to address.
The Path to Cultivating an “Honest” Large Model
In response to the challenges mentioned above, researchers from Huazhong University of Science and Technology, the University of Notre Dame, and Lehigh University proposed a brand new framework to enhance the honesty and usefulness of large language models from both theoretical and practical perspectives.
Firstly, the researchers systematically sorted and defined the characteristics that an honest large model should possess from a theoretical perspective. They pointed out:
-
An honest large model should recognize its limitations and provide reasonable responses to questions beyond its capabilities;
-
It should not blindly follow user inputs but maintain an objective and neutral stance;
-
Additionally, it should have a clear self-awareness and not equate itself with sentient and emotional humans.
Based on these principles, the researchers constructed a brand new evaluation dataset called HoneSet, covering six major types of “tricky” questions to examine the honesty of large models from multiple angles. As shown in the figure below, HoneSet includes six categories of questions: Latest Information, User Input, Professional Capability, Modality Mismatch, Interactivity Sensory, and Self Identity, aiming to comprehensively assess the model’s ability to maintain honesty in different scenarios.

The figure below illustrates the construction process of the HoneSet dataset, which mainly includes three steps:
-
Construction of candidate datasets, where seed queries are artificially defined for the six categories and data is expanded through contextual learning using GPT-4.
-
Data filtering and enhancement, using OpenAI’s text embedding model to filter out duplicate data and paraphrase queries for expansion.
-
Human evaluation, where experts screen and refine the generated queries to ensure data quality.

Secondly, the research team designed two optimization methods from a practical perspective, targeting open-source models and commercial models:
-
Open-source Models
For open-source models, they proposed a “curiosity-driven” prompt optimization method. This method consists of two stages: curiosity-driven prompt generation and answer optimization.
In the first stage, by designing clever prompts, the model is guided to articulate its doubts and uncertainties about the questions. Specifically, the prompt templates encourage the model to carefully analyze the questions and express its confusion, such as lack of real-time information, insufficient or erroneous user input, or lack of domain-specific knowledge. This step aims to awaken the model’s awareness of its limitations.
In the second stage, the researchers combine the model’s doubts with its original answers, re-inputting them into the model and providing a “constitution-guided” prompt to guide the model in optimizing its answers based on preset honesty principles. The optimized answers should include an honest acknowledgment of limitations and provide beneficial guidance to users.
-
Commercial Models
For commercial models, the researchers proposed a two-stage fine-tuning process:
(1) The first stage trains the model on HoneSet to distinguish between honest and dishonest answers by optimizing the contrastive loss function;
(2) The second stage further enhances the usefulness of the model’s answers by optimizing a reward function based on human preferences.
The entire process draws on the principles of curriculum learning, allowing the model to learn the qualities of honesty and helpfulness in a gradual manner. At the same time, the figure also compares the performance of two-stage fine-tuning with direct end-to-end fine-tuning, indicating that staged training can achieve better performance improvements.
This research systematically explores methods for creating honest and helpful large language models from both theoretical and practical perspectives. By defining honesty criteria, constructing evaluation datasets, designing prompt optimization and fine-tuning methods, it provides new ideas for enhancing the credibility and usefulness of large models in practical applications.
Initial Results of Honesty Cultivation
To verify the effectiveness of the method, the researchers conducted extensive experiments on nine mainstream language models, including GPT-4, ChatGPT, and Claude.
The figure below shows the experimental results based on the prompt optimization method. As can be seen, after adopting curiosity-driven prompts, the honesty levels of each model on HoneSet significantly improved. For example, the honesty of GPT-4 and Claude reached 100%, achieving nearly perfect honesty alignment; while the honesty of the smaller parameter model LLaMa2-7b increased significantly from 43% to 83.7%. Almost all models exceeded 60% in honesty, proving the universality of the method.

Subsequently, the authors further compared the performance of optimized versus unoptimized answers in human evaluations. The results showed that the optimized answers generally had a higher success rate in pairwise comparisons, reflecting higher honesty and usefulness.

In addition, the article also quantitatively demonstrates improvements in responses across three dimensions: explanation, answering, and guidance. The results indicate that various models have made significant progress in honestly explaining limitations, providing problem-solving ideas, and offering specific guidance, fully demonstrating the effectiveness of the prompt optimization method.




Conclusion and Outlook
Scan the QR code to add the assistant on WeChat