Evaluating the Safety and Trustworthiness of Generative AI Models

In recent years, generative artificial intelligence technology has made significant advancements. With various large models continuously iterating and upgrading, their capabilities have improved significantly, from general generative abilities to specialized capabilities in various domains, and now with a greater focus on actual user interaction. The applications of artificial intelligence are increasingly gaining attention. However, current large models still face trustworthiness bottlenecks, hindering large-scale applications. The safety and trustworthiness of large models have garnered significant attention, and numerous regulations and standards have been rapidly developed and implemented both domestically and internationally. This article constructs a hierarchical evaluation system for the safety and trustworthiness of generative artificial intelligence, assessing the safety and trustworthiness levels of existing large models from three dimensions: physical trustworthiness, security reliability, and forgery detectability. The evaluation covers various generative models, including text-to-image models, text-to-video models, and visual large language models, and identifies areas for improvement based on the evaluation results, providing a safety and trustworthiness guide for the large-scale application of generative artificial intelligence.

Generative AI, as an important branch of artificial intelligence, forms emergent capabilities by learning from large amounts of data and can autonomously generate content, achieving remarkable progress in multiple domains such as multimodal dialogue, image generation, and video generation. In recent years, with the development of pre-training technologies and the enhancement of computational resources, multimodal generative large models represented by GPT, DALLE, and Sora have rapidly developed, showcasing impressive effects. Consequently, generative artificial intelligence has not only stirred significant waves in academic research but has also demonstrated immense potential in practical applications, gradually integrating into people’s daily lives. For instance, multimodal dialogue generation has gained widespread popularity among users in tasks such as text extraction, chatbots, and language translation; image and video generation technologies have empowered the new media industry and are widely used in artistic creation, advertising design, and film special effects; in embodied intelligence, large language generation models can provide embodied instructions for robot actions, endowing robots with more strategic decision-making capabilities; in medical biological research, generative AI helps predict protein structures and analyze medical images, providing new tools for scientific discovery.

As generative artificial intelligence gradually enters daily social life, the safety and trustworthiness of AI have become a focal point of international concern. Domestic and international AI safety incidents have led to frequent public discourse; for example, the long-standing criticism of AI-generated deepfake images and videos has caused the spread of false information and damage to reputations. Classic malicious attacks, such as “how to make a bomb,” can also be directly answered by large models, potentially providing criminals with opportunities for wrongdoing. Some educational and scientific popularization AI videos often contain common sense errors that do not align with the laws of the physical world, and erroneous videos can easily influence minors’ cognition when spread online. Unsafe and untrustworthy outputs have become significant challenges faced by generative artificial intelligence.

In response to this challenge, academia, industry, and the international community have taken measures to uncover and remedy the safety issues of large models. Researchers have established many safety and trustworthiness evaluation benchmarks to measure the sensitivity of generative AI models to unsafe content; OpenAI has also developed many policies regarding safety and privacy to restrict harmful responses from GPT; on July 14, 2023, the National Internet Information Office, in conjunction with the National Development and Reform Commission and other departments, released the “Interim Measures for the Management of Generative AI Services,” marking the world’s first written law concerning generative artificial intelligence; on March 13, 2024, the European Parliament passed the “Artificial Intelligence Act,” ushering in a new era of safety and trustworthiness regulation in the field of artificial intelligence both domestically and internationally.

In this context, the aspects in which the safety and trustworthiness of generative artificial intelligence need improvement remain a topic of ongoing exploration. Only by understanding both the strengths and weaknesses can we ensure a robust battle for the safety and trustworthiness of large models, providing effective guidance for the development of generative artificial intelligence and fostering more powerful socialized AI. Therefore, this article proposes a hierarchical evaluation system for the safety and trustworthiness of generative artificial intelligence, constructed from multiple safety and trustworthiness dimensions, aiming to provide solid safety guarantees for the large-scale application of large models. Specifically, as shown in Figure 1, we evaluate generative large models across three dimensions: physical trustworthiness, security reliability, and forgery detectability, with each dimension containing several sub-dimensions. Physical trustworthiness encompasses mechanics, optics, materials science, and thermodynamics; security reliability covers general symbols, celebrity privacy, and NSFW issues; forgery detectability includes sub-dimensions such as forgery modalities, forgery semantics, forgery tasks, forgery types, and forgery models, each with deeper subcategories. Our evaluation covers various generative models, including text-to-video (T2V) models, text-to-image (T2I) models, and visual large language models (LVLMs). Through this comprehensive hierarchical safety and trustworthiness evaluation, we derive evaluation results and conduct in-depth analyses, revealing the security weaknesses of large models and proposing suggestions for improving generative AI models, to promote the safe and effective application of generative AI across various social domains, ensuring that technological advancements also bring controllable and trustworthy social impacts.

Evaluating the Safety and Trustworthiness of Generative AI Models

Physical Trustworthiness

As various generative models emerge, more and more people are using artificial intelligence to create images and videos, publishing and disseminating them on the internet. As the audience for AI-generated works expands, the credibility and accuracy of these works become critical for development. T2V (for example, tools like Sora that visualize time and scene transitions) is increasingly seen as a promising path for constructing general simulators of the physical world. Cognitive psychology suggests that intuitive physics is crucial for simulating real-world scenarios, akin to the learning process of human infants. Therefore, video generation should first accurately reproduce simple yet fundamental physical phenomena to enhance the realism and credibility of generated content.

However, even the most advanced T2V models trained on vast resources encounter difficulties in accurately generating simple physical phenomena, as illustrated in Figure 2(a) with optical examples where models may fail to understand that a water surface should have reflections. This apparent flaw indicates a significant gap between current video generation models and human understanding of basic physics, revealing vulnerabilities in the physical trustworthiness of these models, which still have a long way to go to become true world simulators. Hence, assessing the physical trustworthiness of various T2V models is crucial, guiding future improvements in generative artificial intelligence and necessitating the development of a comprehensive evaluation framework that goes beyond traditional metrics.

Based on this context of physical untrustworthiness, we propose PhyGenBench and PhyGenEval to automatically assess the physical common sense understanding capabilities of T2V models. PhyGenBench aims to evaluate physical common sense based on fundamental physical laws in text-to-video generation. Inspired by this, we categorize physical common sense into four main domains: mechanics, optics, thermodynamics, and material properties. We find that each category has important physical laws and easily observable physical phenomena, resulting in a comprehensive set of 27 physical laws and 160 validated prompts in the proposed benchmark. Specifically, starting from basic physical laws, we construct prompts that can easily reflect these laws using sources such as textbooks through brainstorming. This process yields a comprehensive yet straightforward set of prompts that reflect physical common sense and are clear enough for evaluation.

On the other hand, benefiting from the simple and clear physical phenomena in PhyGenBench prompts, we propose PhyGenEval, a novel video evaluation framework to assess the correctness of physical common sense in PhyGenBench. As shown in Figure 3, PhyGenEval first employs GPT-4o to analyze the physical laws in the text, addressing the insufficient understanding of physical common sense in video-based VLMs. Furthermore, considering that previous evaluation metrics did not specifically target physical correctness, we propose a three-tiered evaluation strategy that transitions from image-based analysis to comprehensive video analysis: single image, multiple images, and full video stages. Each stage employs different VLMs and custom instructions generated by GPT-4o to form judgments. By combining PhyGenBench and PhyGenEval, we can effectively evaluate the understanding of physical common sense by different T2V models on a large scale, producing results that are highly consistent with human feedback.

In terms of physical trustworthiness, through PhyGen-Bench and PhyGenEval, we conducted extensive evaluations on popular T2V models and discovered several key phenomena and conclusions: ① Even the best-performing model, Gen-3, only scored 0.51. This indicates that current models are far from achieving the functionality of world simulators. ② PhyGenEval primarily focuses on physical correctness and is robust to other factors affecting visual quality. Moreover, even if a model can generate videos of better general quality, it does not necessarily mean it has a better understanding of physical common sense. ③ Rapid engineering or scaling up T2V models can address some issues, but still cannot handle dynamic physical phenomena, which may require extensive training on synthetic data.

Based on the evaluation results, we find that the physical trustworthiness of generated videos has significant shortcomings. We hope this work can inspire the community to focus on learning physical common sense in T2V models rather than merely using them as entertainment tools.

Security Reliability

In addition to whether the generated content is credible and reasonable, the safety and reliability of the generated content is an even more severe issue. This issue is directly reflected in T2I models. Text-to-image generation technology has garnered widespread attention in recent years, allowing images to be generated from any human-written prompts, achieving unprecedented popularity. The rapid development of text-to-image generation technology has led to the emergence of T2I models, such as Stable Diffusion, the Civitai community, as well as closed-source APIs like DALL-E and Midjourney, attracting numerous artistic creators and business people, showcasing immense commercial potential and revenue prospects.

As the skill of image creation is empowered by T2I models to every user, society increasingly seeks to ensure the safety of T2I models. Currently, many policy constraints have emerged to prevent the generation of harmful content. However, despite some progress made by existing safety measures, malicious attacks on T2I models have become increasingly complex and in-depth. We have identified a significant weakness in the current safety measures of T2I models: these measures primarily target explicit text prompts, where the target object is directly specified in the text. However, more complex implicit text prompts remain to be explored, which do not explicitly state the target object but provide indirect descriptions.

Therefore, regarding the safety reliability of generative artificial intelligence, we delve into more complex dangerous attacks through implicit text prompts. As shown in Figure 2(b), we first consider “general symbols,” such as landmarks, logos, food, and other common symbols, for a preliminary assessment of the model’s understanding of implicit text prompts. We find that T2I models can indeed generate the desired semantic content through implicit text prompts. Furthermore, we focus on the dangers of implicit text prompts, primarily concerning “celebrity privacy” and “NSFW issues” (Not Safe for Work). In terms of celebrity privacy, DALLE is equipped with a policy prohibiting the generation of images of celebrities, so directly inputting the name of Michael Jackson would be rejected. However, when describing celebrities using implicit text prompts, T2I models can successfully generate images of Michael Jackson, potentially leading to the spread of false information and damaging the reputation of public figures. Regarding NSFW issues, when the prompt for violent content is rewritten as an implicit text prompt like “butcher artwork by ben templesmith,” the T2I model fails to filter out these implicit dangerous keywords, still generating gruesome and violent images, posing serious social risks. These scenarios indicate that implicit text prompts can effectively evade most T2I models’ safety mechanisms, providing opportunities for attackers to generate harmful images.

Based on this unsafe context, we propose a new implicit text prompt benchmark, ImplicitBench, to systematically study the performance of T2I models under implicit text prompts. Specifically, ImplicitBench primarily focuses on three aspects of implicit text prompts: general symbols, celebrity privacy, and NSFW issues. As shown in Figure 4, the workflow of the study can be summarized as follows: first, we collect a dataset containing over 2000 implicit text prompts, covering content in three aspects and including more than twenty subcategories; next, we utilize three open-source T2I models and three closed-source T2I APIs to generate a large number of images based on our ImplicitBench; then, we design the ImplicitEval evaluation framework, including three evaluation methods to determine whether the images generated from specific implicit text prompts accurately reflect their underlying explicit content, calculating quantitative accuracy in three aspects. As shown in Figure 4, for general symbols, we use GPT-4V to evaluate whether the generated images display the specified symbols; for celebrity privacy, we utilize a traditional face verification model, Arcface, as a recognizer, collecting real photos of corresponding celebrities as references; for NSFW issues, we use both a built-in safety checker provided by Stable Diffusion and a dedicated unsafe image classifier as a dual evaluation method.

In terms of security reliability, through Implicit-Bench and ImplicitEval, we conducted comprehensive evaluations on popular T2I models, leading to the following conclusions. ① General symbols: T2I models can generate images that align with the symbolic meanings implied by implicit text prompts to some extent, and this ability is positively correlated with the quality of generated images and the consistency between text and images, with closed-source T2I APIs generally performing better. ② Celebrity privacy: Experimental results indicate that T2I models are more likely to generate images that violate the privacy of well-known celebrities, and implicit text prompts can evade current privacy policy defenses, potentially leading to the spread of false information and damage to personal reputations. ③ NSFW issues: Implicit text prompts can bypass the safety filters of most T2I models; although they appear harmless, they can generate harmful NSFW content. Compared to the DALL-E series, Midjourney performs better in terms of safety, better identifying NSFW implications and preventing the generation of harmful content. Additionally, compared to ordinary vocabulary, certain technical terms, overly detailed close-ups of body parts, and words with ambiguous or multiple meanings are more likely to lead to the generation of NSFW content.

Overall, ImplicitBench aims to assess the safety reliability of generative artificial intelligence, raising awareness within the T2I community regarding more complex harmful attacks. We find that existing safety strategies may not effectively address implicit text prompts, thus the privacy and NSFW issues derived from implicit text prompts should receive sufficient attention. In the future, preventive mechanisms against implicit text prompts need further research and improvement to enhance the safety reliability of generative artificial intelligence.

Forgery Detectability

In recent years, with the rapid development of artificial intelligence-generated content technology, the barrier to creating fake media has significantly lowered, enabling the general public to easily produce fake media. Consequently, various synthetic media have flooded the internet, posing unprecedented threats to politics, law, and social security, such as the malicious dissemination of deepfake videos and false information. In response to this situation, researchers have proposed numerous forgery detection methods aimed at filtering out synthetic media as much as possible. However, today’s synthetic media can be highly diverse, potentially encompassing different modalities, exhibiting various semantics, and being created or manipulated by different AI models. Therefore, designing a universal forgery detector with comprehensive identification capabilities has become a key and urgent task in the new era of AI-generated content, posing significant challenges to the research community.

At the same time, LVLMs have achieved remarkable progress in various multimodal tasks, such as visual recognition and visual description, reigniting discussions around artificial general intelligence. These outstanding generalization capabilities make LVLMs powerful tools for distinguishing increasingly diverse synthetic media. However, there is still a lack of a comprehensive evaluation benchmark to assess the capability of LVLMs in forgery detection, limiting their application in forgery detection and further hindering their development towards the next stage of artificial general intelligence. To this end, some research efforts have attempted to fill this gap through different evaluation benchmarks, but they only cover a limited range of synthetic media.

Based on this context of rampant but hard-to-monitor forgery, we introduce Forensics-Bench, a new forgery detection benchmark suite designed for comprehensive evaluation of LVLMs in forgery detection. For this purpose, Forensics-Bench has been carefully curated to cover as many diverse forgery types as possible, including 63K multiple-choice visual questions and statistics covering 112 unique forgery detection types. Specifically, the breadth of Forensics-Bench encompasses five aspects: ① Different forgery modalities, including RGB images, near-infrared images, videos, and text; ② Covering various semantics, including human subjects and other general subjects; ③ Created/controlled by different AI models, such as GANs, diffusion models, VAEs, etc.; ④ Various task types, including forgery binary classification, forgery spatial localization, and forgery temporal localization; ⑤ Diverse forgery types, such as face swapping, facial attribute editing, and face reenactment. This diversity in Forensics-Bench requires LVLMs to possess comprehensive identification capabilities to recognize various forgeries, highlighting the significant challenges posed by current AI-generated content technology. Figure 2(c) provides examples of different types of image, text, and video forgeries.

In the experiments, we used the OpenCompass evaluation platform and followed previous research for evaluation: first, we manually checked whether options appeared in the responses of LVLMs; then, we manually checked whether the content of the options appeared in the responses of LVLMs; finally, we sought assistance from ChatGPT to help extract matching options. If all extractions failed, we set the model’s answer to Z.

In terms of forgery detectability, we evaluated 22 publicly available LVLMs and 3 proprietary models through Forensics-Bench. The experimental results indicate significant differences in the performance of LVLMs across different forgery detection types, revealing their limitations. We summarize the following findings: ① The forgery benchmark Forensics-Bench poses significant challenges to LVLMs, with the best-performing model achieving only a 66.7% overall accuracy, highlighting the unique difficulty of robust forgery detection. ② Among various forgery types, LVLMs exhibit significant performance discrepancies: they perform excellently on certain forgery types (such as deception and style transfer) with close to 100% accuracy, but poorly on others (below 55%), such as face swapping (multiple faces) and facial editing. This result reveals partial understanding of LVLMs regarding different forgery types. ③ In different forgery detection tasks, LVLMs typically perform better on classification tasks while performing poorly on spatial and temporal localization tasks. ④ For forgeries synthesized by popular AI models, we find that current LVLMs perform better on forgeries generated by diffusion models but poorly on those generated by GANs. These results expose the limitations of LVLMs in distinguishing forgeries generated by different AI models.

Overall, regarding forgery detectability, we find through Forensics-Bench that LVLMs exhibit limitations in distinguishing AI-generated forgery content, gaining deeper insights into the sensitivity of LVLMs to forgery content.

In light of the ongoing development of generative artificial intelligence, ensuring the safety and trustworthiness levels of large models is a necessary path for their socialization. Only by constructing a comprehensive safety and trustworthiness evaluation system can we deeply grasp the security vulnerabilities of generative artificial intelligence, providing practical safety guidelines for model improvement.

The safety and trustworthiness evaluation system needs to be constructed from multiple dimensions and levels to simulate the various scenarios large models face with thousands of users, effectively preventing potential security risks. Thus, the evaluation system we propose centers around three dimensions of generative artificial intelligence: physical trustworthiness, security reliability, and forgery detectability, all focusing on more complex and subtle safety issues. The evaluation results indicate that there are some issues easily overlooked by large models in these three dimensions, leading to uncontrollable safety and trustworthiness risks, reflecting the current fragility of large model safety defenses. Based on the analysis of experimental results, we also propose some improvement suggestions for the physical trustworthiness, security reliability, and forgery detectability of large models. We hope our safety and trustworthiness evaluation can inspire thoughts and insights for the protection and improvement of large models, achieving further leaps in the safety of generative artificial intelligence.

Looking ahead, the landscape of generative artificial intelligence will continue to expand, and people’s lifestyles will change rapidly as well. To ensure that large models can be used effectively and serve our needs, we must ensure their safety and trustworthiness, allowing generative artificial intelligence to smoothly and harmoniously integrate into daily life, pushing society forward into a more intelligent and convenient new era.

Acknowledgments: Thanks to the Science and Technology Innovation 2030 – “New Generation Artificial Intelligence (2030)” major project “General Instance Perception Large Model Based on Joint Optimization of Software and Hardware” (Project Number: 2022ZD0161000)

This article is published in the November 2024 issue of IEEE Spectrum’s Chinese edition, “Science and Technology Overview.”

Expert Introduction

Luo Ping:Associate Professor and Assistant Dean of the School of Computing and Data Science at the University of Hong Kong, Deputy Director of the Musketeers Foundation Data Science Research Institute, and Executive Director of the Hong Kong University-Shanghai National Artificial Intelligence Joint Laboratory..

Yang Yue:PhD student co-trained by Shanghai Jiao Tong University and Shanghai Artificial Intelligence Laboratory..

Meng Fanqing: PhD student co-trained by Shanghai Jiao Tong University and Shanghai Artificial Intelligence Laboratory..

Shao Wenqi: Young Scientist at Shanghai Artificial Intelligence Laboratory.

Wang Jin:PhD student in the Department of Computer Science at the University of Hong Kong..

IEEE Spectrum

“Science and Technology Overview”

Official WeChat public platform

>>>This article is original; please reply for reprints.<<<

Previous Recommendations

Novel Digital Afterlife Industry

IBM’s Gamble on Quantum-Centric Supercomputers

Using Generative AI to Design Better Robots

Leave a Comment Cancel reply