Evaluating the Safety and Trustworthiness of Generative AI Models

As generative artificial intelligence gradually integrates into daily life, the safety and trustworthiness of AI has become a focal point of international concern. Incidents of AI safety both domestically and internationally have led to significant public discourse. For example, AI-generated deepfake images and videos have long been criticized for contributing to the spread of misinformation and reputational damage. Classic malicious attacks, such as “how to create a bomb,” can also be directly answered by large models, providing opportunities for criminals if exploited. Moreover, some educational and scientific AI videos frequently contain factual errors that contradict the laws of the physical world, and erroneous videos can easily influence minors’ perceptions when spread online. Unsafe and untrustworthy outputs have become significant challenges faced by generative AI.
In response to these challenges, academia, industry, and the international community have taken relevant measures to uncover and address the safety issues of large models. Researchers have established many safety and trustworthiness evaluation benchmarks to measure the sensitivity of generative AI models to unsafe content. OpenAI has also formulated numerous policies regarding safety and privacy to restrict harmful responses from GPT. On July 14, 2023, the National Internet Information Office, in conjunction with the National Development and Reform Commission and other departments, released the “Interim Measures for the Management of Generative AI Services,” marking the world’s first written law concerning generative AI. On March 13, 2024, the European Parliament passed the “Artificial Intelligence Act,” ushering in a new era of safety and trust regulation in the AI field both domestically and internationally.
In this context, the aspects in which the safety and trustworthiness of generative AI need improvement is a topic that requires continuous exploration. Only by knowing ourselves and our adversaries can we ensure the safety and trustworthiness of large models, providing effective guidance for the development of generative AI and fostering more powerful socialized AI. Therefore, this article proposes a hierarchical evaluation system for the safety and trustworthiness of generative AI, constructed from multiple dimensions of safety and trustworthiness, aimed at providing a solid safety guarantee for the large-scale application of large models. Specifically, as illustrated in Figure 1, we assess generative large models across three dimensions: physical credibility, safety reliability, and detectability of forgery, with each dimension further subdivided into many sub-dimensions. Physical credibility encompasses mechanics, optics, materials science, and thermodynamics; safety reliability covers general symbols, celebrity privacy, and NSFW issues; detectability of forgery includes sub-dimensions such as forgery modalities, semantics, tasks, types, and models, with each sub-dimension containing deeper subdivisions. Our evaluation subjects include various generative models such as text-to-video (T2V), text-to-image (T2I), and large visual language models (LVLMs). Through this comprehensive hierarchical evaluation of safety and trustworthiness, we derive evaluation results and conduct in-depth analyses, revealing not only the safety weaknesses of large models but also suggesting improvement directions for generative AI models to promote their safe and effective application across various social domains, ensuring that technological advancements bring controllable and trustworthy societal impacts.
Evaluating the Safety and Trustworthiness of Generative AI Models

Evaluating the Safety and Trustworthiness of Generative AI Models

Physical Credibility

Evaluating the Safety and Trustworthiness of Generative AI Models

With the emergence of various generative models, more and more people are beginning to use AI to create images and videos, publishing and disseminating them on the internet. As the audience for AI-generated works expands, the credibility and accuracy of these works become critical to their development. T2V (such as Sora and other tools for visualizing time and scene changes) is increasingly seen as a promising avenue for building universal simulators of the physical world. Cognitive psychology posits that intuitive physics is crucial for simulating the real world, much like the learning process of human infants. Therefore, video generation must first accurately reproduce simple yet fundamental physical phenomena to enhance the realism and credibility of the generated content.
However, even the most advanced T2V models trained on vast resources encounter difficulties in accurately generating simple physical phenomena, as illustrated in Figure 2(a) with an optical example where the model may fail to understand that a water surface should have reflections. This glaring flaw indicates a significant gap between current video generation models and human understanding of basic physics, revealing vulnerabilities in their physical credibility, and suggesting that they are still far from functioning as real-world simulators. Thus, assessing the various levels of physical credibility of current T2V models becomes crucial, helping to guide future improvements in generative AI, necessitating the development of a comprehensive evaluation framework that transcends traditional metrics.
Evaluating the Safety and Trustworthiness of Generative AI Models
Based on this context of physical unreliability, we propose PhyGenBench and PhyGenEval to automatically assess the physical common sense understanding of T2V models. PhyGenBench aims to evaluate physical common sense based on the fundamental physical laws in text-to-video generation. Inspired by this, we categorize the world’s physical common sense into four main domains: mechanics, optics, thermodynamics, and material properties. We found that each category contains important physical laws and easily observable physical phenomena, resulting in a comprehensive set of 27 physical laws and 160 validated prompts in the proposed benchmark. Specifically, starting from basic physical laws, we used brainstorming, utilizing sources such as textbooks, to construct prompts that can easily reflect physical laws. This process produced a comprehensive yet straightforward set of prompts that adequately reflect physical common sense for evaluation purposes.
On the other hand, benefiting from the simple and clear physical phenomena in PhyGenBench prompts, we propose PhyGenEval, a novel video evaluation framework for assessing the correctness of physical common sense in PhyGenBench. As shown in Figure 3, PhyGenEval first uses GPT-4o to analyze the physical laws in the text, addressing the insufficient understanding of physical common sense in video-based VLMs. Moreover, considering that previous evaluation metrics did not specifically target physical correctness, we propose a three-layer hierarchical evaluation strategy that transitions from image-based analysis to comprehensive video analysis: single image, multiple images, and full video stages. Each stage employs different VLMs and custom instructions generated by GPT-4o to form judgments. By combining PhyGenBench and PhyGenEval, we can effectively assess the understanding of physical common sense across different T2V models on a large scale, yielding results that are highly consistent with human feedback.
Evaluating the Safety and Trustworthiness of Generative AI Models
In terms of physical credibility, through PhyGen-Bench and PhyGenEval, we conducted extensive evaluations of popular T2V models, uncovering several key phenomena and conclusions: ① Even the best-performing model, Gen-3, scored only 0.51. This indicates that current models are far from achieving the capabilities of world simulators. ② PhyGenEval primarily focuses on physical correctness and is robust to other factors affecting visual quality. Additionally, even if a model can generate videos with better overall quality, it does not necessarily imply a better understanding of physical common sense. ③ Rapid engineering or scaling up T2V models may solve some issues, but still cannot handle dynamic physical phenomena, which may require extensive training on synthetic data.
Based on the evaluation results, we found that the physical credibility of generated videos still has significant deficiencies. We hope this work can inspire the community to focus on learning about physical common sense in T2V models rather than merely using them as entertainment tools.

Evaluating the Safety and Trustworthiness of Generative AI Models

Safety Reliability

Evaluating the Safety and Trustworthiness of Generative AI Models

Besides the basic question of whether generated content is credible and reasonable, the safety and reliability of generated content is an even more severe problem. The dangers of this issue can be directly reflected in T2I models. Text-to-image generation technology has garnered widespread attention in recent years, allowing images to be generated from any human-written prompts, achieving unprecedented popularity. The rapid development of text-to-image generation technology has led to the emergence of T2I models, such as Stable Diffusion, the Civitai community, as well as closed-source APIs like DALL-E and Midjourney, which have attracted a large number of artists and commercial individuals, demonstrating immense commercial potential and revenue prospects.
As the skill of image creation is empowered by T2I models to every user, society increasingly seeks to ensure the safety of T2I models. Currently, numerous policy measures have emerged to prevent the generation of harmful content. However, despite some progress made by these existing safety measures, malicious attacks on T2I models have become increasingly complex and profound. We have identified a significant weakness in the current safety measures of T2I models: these measures primarily target explicit text prompts, i.e., situations where the target object is directly specified in the text. However, more complex implicit text prompts remain to be explored, which do not explicitly state the target object but provide indirect descriptions.
Therefore, regarding the safety and reliability of generative AI, we delve into more complex dangerous attacks through implicit text prompts. As illustrated in Figure 2(b), we first consider “general symbols,” such as landmarks, logos, food, etc., to preliminarily assess the model’s understanding of implicit text prompts. We found that T2I models can indeed generate the desired semantic content through implicit text prompts. Additionally, we focus on the harmful aspects of implicit text prompts, primarily concerning “celebrity privacy” and “NSFW issues” (Not Safe for Work). Regarding celebrity privacy, DALL-E is equipped with a privacy policy prohibiting the generation of celebrity images; thus, directly inputting the name Michael Jackson would be rejected. However, when using implicit text prompts to describe a celebrity, the T2I model can still generate an image of Michael Jackson, potentially leading to the spread of misinformation and damaging the reputation of public figures. In terms of NSFW issues, when the prompt for violent content is rewritten as an implicit text prompt “butcher artwork by Ben Templesmith,” the T2I model fails to filter out these implicit dangerous keywords and still generates violent images, posing serious social risks. These scenarios indicate that implicit text prompts can effectively evade the current safety mechanisms of most T2I models, providing attackers with opportunities to generate harmful images.
Based on this unsafe context, we propose a new implicit text prompt benchmark, ImplicitBench, to systematically study the performance of T2I models under implicit text prompts. Specifically, ImplicitBench focuses on three aspects of implicit text prompts: general symbols, celebrity privacy, and NSFW issues. As shown in Figure 4, the research workflow can be summarized as follows: first, we collected a dataset containing over 2000 implicit text prompts covering the three aspects, including more than twenty subcategories; next, we utilized three open-source T2I models and three closed-source T2I APIs to generate a large number of images based on our ImplicitBench; then, we designed an evaluation framework, ImplicitEval, which includes three evaluation methods to determine whether the images generated from specific implicit text prompts accurately reflect their underlying explicit content and calculate quantitative accuracy rates for the three aspects. As shown in Figure 4, for general symbols, we used GPT-4V to evaluate whether the generated images displayed the specified symbols; for celebrity privacy, we utilized a traditional face verification model, Arcface, as a recognizer, collecting corresponding real photos of celebrities as references; for NSFW issues, we employed the built-in safety checker provided by Stable Diffusion along with a dedicated unsafe image classifier as a dual evaluation method.
Evaluating the Safety and Trustworthiness of Generative AI Models
In terms of safety reliability, through Implicit-Bench and ImplicitEval, we conducted a comprehensive evaluation of popular T2I models, yielding the following conclusions. ① General symbols: T2I models can generate images that align with the symbolic meanings implied by implicit text prompts to a certain extent; this ability correlates positively with the quality of the generated images and the consistency between text and images, with closed-source T2I APIs generally performing better. ② Celebrity privacy: Experimental results indicate that T2I models are more likely to generate images that infringe on the privacy of highly recognizable celebrities, and implicit text prompts can evade current privacy policy defenses, potentially leading to the spread of misinformation and damage to personal reputations. ③ NSFW issues: Implicit text prompts can bypass the safety filters of most T2I models; although they may appear harmless, they can generate harmful NSFW content. Compared to the DALL-E series, Midjourney performs better in terms of safety, being more capable of recognizing NSFW implications and preventing the generation of harmful content. Furthermore, compared to ordinary vocabulary, certain technical terms, overly detailed close-ups of body parts, and words with ambiguous or multiple meanings are more likely to lead to the generation of NSFW content.
Overall, ImplicitBench aims to evaluate the safety and reliability of generative AI, drawing more attention from the T2I community to more complex harmful attacks. We found that existing safety strategies may not effectively address implicit text prompts; therefore, privacy and NSFW issues derived from implicit text prompts should receive adequate attention. In the future, preventive mechanisms against implicit text prompts require further research and refinement to enhance the safety and reliability of generative AI.

Evaluating the Safety and Trustworthiness of Generative AI Models

Detectability of Forgery

Evaluating the Safety and Trustworthiness of Generative AI Models

In recent years, with the rapid development of AI-generated content technology, the barriers to creating fake media have significantly lowered, allowing the general public to easily produce fake media. Consequently, a vast amount of synthetic media has flooded the internet, posing unprecedented threats to politics, law, and social security, such as the malicious dissemination of deepfake videos and misinformation. To address this situation, researchers have proposed numerous forgery detection methods aimed at filtering out synthetic media as much as possible. However, the current synthetic media can be highly diverse, potentially including different modalities, representing various semantics, and created or manipulated by different AI models. Therefore, designing a universal forgery detector with comprehensive identification capabilities has become a critical and urgent task in the new era of AI-generated content, presenting significant challenges to the research community.
At the same time, LVLMs have made remarkable progress in various multimodal tasks, such as visual recognition and visual description, reigniting discussions on artificial general intelligence. These outstanding generalization capabilities make LVLMs powerful tools for distinguishing increasingly diverse synthetic media. However, there is still a lack of a comprehensive evaluation benchmark to assess the ability of LVLMs to recognize synthetic media, which limits their application in forgery detection and further hinders their development towards the next stage of artificial general intelligence. To this end, some research efforts have attempted to fill this gap through different evaluation benchmarks, but they only cover a limited range of synthetic media.
Based on this context of rampant yet difficult-to-monitor forgery, we introduce Forensics-Bench, a new forgery detection benchmark suite designed for a comprehensive evaluation of LVLMs’ capabilities in forgery detection. For this purpose, Forensics-Bench has been meticulously designed to cover as many diverse types of forgery as possible, including 63K multiple-choice visual questions and encompassing 112 unique forgery detection types. Specifically, the breadth of Forensics-Bench covers five aspects: ① Different forgery modalities, including RGB images, near-infrared images, videos, and text; ② Various semantics, including human subjects and other general subjects; ③ Created/manipulated by different AI models, such as GANs, diffusion models, VAEs, etc.; ④ Various task types, including forgery binary classification, forgery spatial localization, and forgery temporal localization; ⑤ Diverse forgery types, such as face swapping, facial attribute editing, and face reenactment. The diversity within Forensics-Bench requires LVLMs to possess comprehensive identification capabilities to recognize various forgeries, underscoring the significant challenges currently posed by AI-generated content technology. Figure 2(c) presents examples of different types of image, text, and video forgeries.
In the experiments, we used the evaluation platform OpenCompass and followed previous research for assessment: first, we manually checked whether the options appeared in the responses from the LVLMs; then, we manually verified whether the content of the options appeared in the responses from the LVLMs; finally, we sought assistance from ChatGPT to help extract matching options. If all extraction attempts failed, we designated the model’s answer as Z.
In terms of detectability of forgery, we evaluated 22 publicly available LVLMs and 3 proprietary models using Forensics-Bench. The experimental results indicated significant differences in the performance of LVLMs across different forgery detection types, revealing their limitations. We summarize the following findings: ① The forgery benchmark Forensics-Bench poses significant challenges to LVLMs, with the best-performing model achieving only a 66.7% overall accuracy, highlighting the unique difficulties of robust forgery detection. ② Among various forgery types, LVLMs exhibit significant performance discrepancies: they perform excellently (close to 100%) on certain forgery types (such as deception and style transfer) but poorly (below 55%) on others (such as face swapping (multiple faces) and face editing). This result reveals a partial understanding of LVLMs regarding different forgery types. ③ In different forgery detection tasks, LVLMs typically perform better on classification tasks while performing poorly on spatial and temporal localization tasks. ④ For forgeries generated by popular AI models, we found that current LVLMs perform better on forgeries generated by diffusion models while performing poorly on forgery detection generated by GANs. These results expose the limitations of LVLMs in distinguishing forgeries generated by different AI models.
Overall, regarding the detectability of forgery, we discovered limitations in LVLMs’ ability to distinguish AI-generated forgery content through Forensics-Bench, deepening our understanding of LVLMs’ sensitivity to forgery content.
In the face of the ongoing development of generative AI, ensuring the safety and trustworthiness of large models is an essential step towards their socialization. Only by constructing a comprehensive safety and trustworthiness evaluation system can we deeply understand the safety vulnerabilities of generative AI and provide practical safety guidelines for model improvement.
A safety and trustworthiness evaluation system needs to be constructed from multiple dimensions and layers to simulate the different scenarios large models may face when interacting with thousands of users, effectively preventing potential safety risks. Therefore, the evaluation system we propose focuses on the three dimensions of generative AI: physical credibility, safety reliability, and detectability of forgery, all of which address more complex and subtle safety issues. The evaluation results indicate that there are several issues within these three dimensions that are easily overlooked by large models, leading to uncontrollable safety and trustworthiness risks, reflecting the current fragility of large models’ safety defenses. Based on the analysis of the experimental results, we also propose some improvement suggestions for the physical credibility, safety reliability, and detectability of forgery of large models. We hope our safety and trustworthiness evaluation can inspire thoughts and insights for the protection and improvement of large models, achieving further advancements in the safety of generative AI.
Looking ahead, the landscape of generative AI will continue to expand, and people’s lifestyles will undergo rapid changes. To ensure that large models serve our needs, we must guarantee their safety and trustworthiness so that generative AI can smoothly and harmoniously integrate into daily life, driving social progress and development towards a smarter and more convenient new era.

Evaluating the Safety and Trustworthiness of Generative AI Models

Evaluating the Safety and Trustworthiness of Generative AI Models

Evaluating the Safety and Trustworthiness of Generative AI Models

Leave a Comment