
As a major school of iconology, the Warburg School examines images within the context of art history, thus forming the basic knowledge framework of image hermeneutics. In traditional iconology, the primary objects of image interpretation are artistic images, while subsequent image hermeneutics strives to “catch up” with various emerging forms of images. However, this is clearly an “unfinished project.” With the rise of large language models such as Midjourney, Runway, and Sora, a new type of image species—generative AI images—has quietly emerged, deeply embedded in the image world of the digital age. Unlike general technical images, generative AI images are algorithmically generated images. Understanding and interpreting generative AI images urgently requires a reconstruction of the knowledge framework of image hermeneutics. Professor Liu Tao’s article “The Disappeared Traces: Generative AI Images and the Reconstruction of Knowledge Framework in Image Hermeneutics” is a response to this cutting-edge issue. The article critically reflects on the impact and challenges that generative AI images pose to the three foundational propositions of image hermeneutics—”representation,” “symbolism,” and “intertextuality,” thus outlining the knowledge framework of image hermeneutics in the intelligent era. The author proposes the original concept of “universal image,” arguing that the algorithmic “black box” should be incorporated into the knowledge horizon of hermeneutics, thereby revealing the possibility of equivalence between the semiotic and the visual in the “computational” dimension. Exploring the interpretive concepts, connotations, and rules of generative AI images helps to advance the reconstruction of the knowledge system of image hermeneutics in the era of intelligent media.
—— Chen Changfeng, Professor and Doctoral Supervisor, School of Journalism and Communication, Tsinghua University
Abstract
The rise of generative AI images has brought tremendous challenges to image hermeneutics. Only by critically reflecting on the three foundational propositions of image hermeneutics—”representation,” “symbolism,” and “intertextuality”—can we reconstruct the knowledge framework of image hermeneutics in the era of artificial intelligence. As a symbolic model, “representation” reveals the world’s dependence on symbols and the linguistic foundation of image interpretation. If the essence of traditional image representation is objectivity, mimetic quality, and mirroring, generative AI images deny the representational foundation of image existence and the representational language of image production; their essence is intransitive, anti-representational, and generative images. The significance of Sora as a world simulator is not merely a visual understanding of the world but also a profound device that forms a set of “universal images” about the world, thus realizing the possibility of equivalence between verbal semiotics and visual imagery. Compared to linguistic texts, the intertextual context based on intertextuality provides indispensable interpretive rules and anchoring systems for image interpretation. However, the emergence of generative AI images signals the demise of texts and the collapse of the intertextual world, forcing images into a lonely world devoid of symbolic “traces.”
1. Image Hermeneutics and Its Reflection in the Era of Artificial Intelligence
The formation of human culture is inseparable from the construction of visual order, that is, by inventing and producing corresponding forms of images, establishing a system of understanding the world. To understand the world, “humans began to see the world as an object and based on this constructed various images—this also constitutes the characteristic of this era.” As an “evolutionary” form within the family of images, modern images possess a universal and profound logic of technical generation, deeply embedded in the overall writing pattern of the era. Images inscribe the marks of the times, carry the language of media, and condense the forms of thought. From early traditional images to later simulated images and to current digital images, the forms and fates of images have already been embedded in the overall trajectory of technological evolution—in the technology-driven world of images, humanity enters into images and recognizes and discovers different logical patterns of the times within the dimension of images. Therefore, there exists a complex “co-writing” model among images, media, and the era, the result of which is the formation of corresponding cultural forms and orders. Régis Debray keenly discovered different “image eras” in the history of human civilization by measuring images and media.
When images and media are deeply embedded in each other’s worlds, image interpretation urgently needs to break through the “symbolic dependency” of the Warburg School and turn to an image hermeneutics that is “media-based.” The so-called “media-based” emphasizes the introduction of the material logic of media, rescuing media from its originally silent role as a carrier, exploring how media technology configures image representation and its generative scenarios, thus transcending purely representational structures and seeking models and mechanisms of meaning “conduction” based on “things” and “images.” However, media technology does not follow a unified apparatus principle, but possesses different media logics, thereby forming different worlds of images. Even so-called digital media or digital images still present different material connotations and media characteristics due to differences in their apparatus systems and programming languages.
Unlike the digital technologies of the web 2.0 era, current artificial intelligence (AI) technology has transcended traditional material connotations and the general sense of procedural logic. It is against the backdrop of technological transformation that a brand new form of image—generative AI images—has fully entered the online world, desperately seeking all possible application scenarios and profoundly rewriting the underlying rules of the “image era.” Compared to general digital images, generative AI images, as a typical form of AI-generated content (AIGC), not only rewrite the formation rules of digital images but also rewrite the generative models between images and technology. The rise of AI drawing tools such as Midjourney, DALL-E3, and Stable Diffusion has already announced the arrival of the generative AI image era. Sora, with its unparalleled video generation capabilities, has pushed the imagination of generative AI images to a new height, allowing people to imagine the world relatively freely and form images about the world. In fact, even before the birth of Sora, similar tools for text-to-video and image-to-video had already attracted widespread social attention. For example, Runway not only allows users to create AI videos based on prompt input and image input but also permits users to debug and train their personalized visual models to achieve the conversion and generation from text/image to AI video. It should be particularly emphasized that the generative AI images referred to in this article include both static images and dynamic images, namely video texts. Clearly, from AI drawing to AI video, generative AI images truly realize the significance and function of a “world simulator”—in addition to presenting the “moment” of world images, its deeper significance is to break through the constraints of the time dimension, endowing images with the attribute of movement, thus achieving an understanding of the “whole” of the world.
So, how do we explain this emerging image “species” of generative AI images and reflect on the image hermeneutics of the artificial intelligence era? This question is clearly difficult to find an answer to from the representational system of images themselves. Only by returning to the technical rules and procedural systems of generative artificial intelligence and starting from the game rules of image formation can we truly reconstruct the discourse of image hermeneutics in the era of artificial intelligence. Based on this, this article mainly takes generative AI images as the research object, focusing on the three core propositions of image hermeneutics— the representation problem in the dimension of hermeneutic ontology, the symbolism problem in the dimension of hermeneutic language, and the intertextuality problem in the dimension of hermeneutic context, with an emphasis on exploring the core issues and theoretical approaches faced by image hermeneutics in the era of artificial intelligence. These three major propositions directly address the foundational issues of image hermeneutics, namely, “What does the image carry?” in the dimension of image ontology, “How does the image present?” in the dimension of image language, and “How is meaning anchored?” in the dimension of image context. Compared to other forms of images in the digital media era, generative AI images fundamentally challenge the above three issues, prompting us to reassess the theoretical discourse of image hermeneutics. In light of this, this article employs the three foundational propositions of image hermeneutics—”representation,” “symbolism,” and “intertextuality”—as conceptual tools to critically examine the interpretive dilemmas of traditional image hermeneutics, with the aim of revealing the core conceptual categories and the deep knowledge discourse of image hermeneutics in the era of artificial intelligence.
2. Generative AI Images: From Representation to Generation
(1) Images, Representation, and Mimicry
As early as Aristotle’s “Poetics,” “representation” was regarded as the narrative basis and meaning principle of literary works, thus forming the theory of mimicry of representation, which holds that the reason literary works can carry themes and convey meanings is fundamentally that they possess a foundational model of representation—mimicking the world. By mimicking the external world to achieve the purpose of understanding the world, it is an important way and method of human cognitive activity.
The relationship between images and the world fundamentally reflects the representational relationship in the semiotic sense—images use certain visual languages to represent the external forms of things and also represent ways of understanding things, thus endowing things with a semiotic perception and understanding. The semiotic way of world cognition exists in different referential logics, and images undoubtedly provide a representational model based on iconicity. In Vilém Flusser’s philosophy of images, images are seen as meaningful planes, essentially revealing the possibility of transforming four-dimensional spacetime into a two-dimensional plane. Among them, the language and mechanism of transformation are representation, especially manifested in the representation of the external world. Here, the “external world” refers not only to objective things in the external world but also to concepts and understandings related to the external world. Therefore, mimicry, as the essence of image representation, primarily manifests as a “capturing” behavior directed toward the external world, ultimately “outputting” a visual form related to the external world. Correspondingly, the connotation of image representation reflects the purpose of “representation” in terms of form, depicting, displaying, and highlighting the characteristics of things through other media.
So, how to understand the “thing” represented by the image? This requires returning to examine the ontological dimension of the image’s symbols. Specifically, the reason why images possess the potential for representational meaning is that they have a referential framework based on the forms of the external world and can “reach” various meanings of the world on this basis. For image symbols, the reason why the signifier and the signified can establish a certain corresponding structure is that the signified reveals and interprets the “iconic” characteristics or connotations of the signifier from a conceptual dimension. Clearly, certain features or attributes of the image project onto the object, achieving a symbolic connection between the image and the world.
Representation, as a symbolic model, reveals the world’s dependence on symbols, the symbolic foundation of image meaning, and the linguistic foundation of image interpretation. The mechanism of image representation can be understood from the following four dimensions: first, the object of image representation points to an external world that “looks like this”; second, the representation language of the image embodies a representational structure based on visual thinking; third, the representation pattern of the image is constrained by the technical logic of the medium itself and the rules set by its apparatus; fourth, the mechanism of image representation essentially reflects a semiotic model based on symbolic methods. Although the interpretive basis of representation is iconicity, “the likeness of representation is not a replica of reality.” In other words, representation implies a series of complex operational “languages” that are not a simple mimicry of the external world but are ultimately “outputted” as a visual system or imaginative model based on the semantic foundation of mimicry through the debugging and configuration of media apparatus. Accordingly, the connection between images and the world, or the arrival of images toward the world, primarily relies on representation as a symbolic basis, that is, by using certain visual codes to mimic, depict, or portray the forms of the world, to realize the possibility of images referring to the world, images replacing the world, and images transcending the world.
(2) Representation: The Basis of the Connection between Images and the World
If we return to the philosophical “destiny” of images, representation serves as an inescapable “fate” of images, accompanying the “life” and “death” of images. Images cannot directly replace the world; therefore, only by resorting to the idea and means of representation can images achieve symbolic grasp of the world. The emergence of images in the history of philosophy has never been able to escape the “burden of representation”—images cannot directly capture truth; their grasp of the world is essentially representational; and the objects of representation do not point to the universe of truth but merely to the mimicry of tangible things in the real world. Therefore, the representation of images is often seen as a low-level mimicry, difficult to approach the perfect idea, and hard to reach the essence of understanding. Plato, as early as in “The Republic,” explained the essence of the representation of images, believing that the “thing” of image representation is a “mimicry of a mimicry” far from the world of truth.
Plato affirmed the essence of the representation of images, but he denied the potential meaning of representation—images represent the external world; although this “world” is far from the world of truth, through mimicking the external world, they become shadows of truth projected in reality. From this perspective, even in ancient Greek philosophy, images were recognized within a representational structure. Nietzsche opened the curtain on the critique of rationalism, reviving sensationalism and granting images a “rebirth,” thereby rescuing images from the lowly, subordinate, marginal, and non-essential philosophical “position,” endowing them with a positive cognitive function. The “fate” of the representation of images continues to this day; whether in ancient Greek Platonism or in modern Western philosophy after Nietzsche, both acknowledge the representational essence of images and take representation as the starting point for thinking about the deeper significance of images—only that the two provide different understandings of the “content” of representation: the former believes that the representational capacity of images is limited to the mimicry of objects and cannot open the world of reason; the latter breaks down the barriers between images and reason, believing that images can represent richer meaningful content and reach the world of ideas. Therefore, regardless of which philosophical context is approached, representation remains a fundamental issue in the study of images; all issues related to the language, function, interpretation, and practice of images essentially revolve around representation and ultimately respond to the proposition of representation.
The essence of the representation of images can also find similar answers in Flusser’s philosophy of images. Flusser examines images in the relationship between language and images, opening up the cognitive channel between images and concepts. How to understand the relationship between images and the world? Flusser provides a far-reaching conclusion: images are the intermediaries between the world and humanity. The realization of the “intermediary” function relies on the meaning apparatus formed by images—representation. Since humans exist in the world, surrounded by the world, the world is unfamiliar to humans, difficult to grasp directly—one cannot recognize the true face of Mount Lu, only because one is in this mountain. Only by projecting the world onto some surface, becoming the object of sight, can the world gain the possibility of being recognized and understood, and representation precisely reveals the internal language of “projection.” Through the operational technique of representation that “invades” the world, the world is transformed into a “microcosm” about images, and the way to “open” this “microcosm” fundamentally depends on the way the world is “represented” before humanity.
The representational system of technical images has already transcended the direct description of real experiences and has risen to the reconfiguration and assembly of the experiential world. Therefore, the representational attributes of images in the semiotic sense reveal the essence of the existence of images and the basis of the connection between images and the world. It can be seen that images fundamentally reveal the way the world is imagined and the way the world is perceived through images. When the world is pushed away by the lens, becoming a “microcosm” at the end of sight—images, images no longer accurately indicate the world but instead exist in forms such as canvases, screens, and interfaces, “not presenting the world but disguising it.” Therefore, the disguise of the world by images is precisely built on the representational model of images and realized through the means of representation.
(3) AI Images: Intransitive Images
In the representational model of images, images possess transitivity, that is, images always invoke objects, summoning the manifestation of the external world. Like an open “container,” images are more like shadows of the external world projected onto the canvas; they do not completely enter the world of thought like words but establish a connection with the world through the means of representing the world. It is precisely in the sense of transitivity that images are not solitary entities, nor does the world end here; on the contrary, images stubbornly narrate the possible “appearance” of the external world, attempting to provide an interpretative “mirror” for the external world in the visual and formal dimensions. Likewise, precisely because of the deep connections and joining attributes of transitivity, images not only possess the ability to summon objects to “appear” or “show themselves” but also exhibit the idea and potential to mimic or even replace the world. If transitivity reveals the ontological attributes of images in the semiotic sense, that is, the indicative ability of images based on iconicity, then Sartre focuses on the pure negation characteristic of images, discovering another intransitive connotation of images, that is, through the condensation of time and space, discovering and presenting the pure materiality of things. In summary, the transitivity of images reveals the essence of the representation of images—manifesting both as the indicative representation of the external world that transcends the limitations of the frame and as the restoration and representation of pure materiality by rescuing things from the constraints of time and space.
If the symbolic premise of representation is the transitivity of images, generative AI images signify a brand new concept of images—an intransitive image, characterized by terminating the connection with the external world, fundamentally denying the representational foundation of image formation, and rejecting the representational mechanism of image meaning. In traditional images, the external world is both the mimetic object of representation and the reference coordinate of representation, as well as the mysterious ghost of representation. However, generative AI images completely overturn the representational essence of images—images no longer open up to the external world; their purpose is not to activate or summon external objects; they come from the algorithmic “black box” and obtain interpretive resources from the algorithmic apparatus. In summary, the traditional image’s essence of representation is manifested as transitivity, mimetic quality, and mirroring, while generative AI images deviate from the representational foundation of image existence, deny the representational language of image production, and are essentially intransitive images, anti-representational images, and generative images. If traditional images, from the moment of their birth, unpreparedly enter the realm of media and communication, entering the perspective of the subject’s intent, they struggle to seek the abode of meaning, ultimately “returning home”; then generative AI images are the images that “run away from home,” discarding the “attachments” of the external world, entering a mysterious world filled with “spells”—a world without grammar, a mysterious world that must be faced “alone,” where everything is in a state of perpetual generation, ever-changing, and without “chapters” to follow.
In fact, “generation” reveals the ontological characteristics of generative AI images that differ from representational images—”representation” is linguistic, symbolic, traceable, has traces, and can be recognized and understood through certain symbolic reasoning; on the contrary, “generation” discards the cognitive framework presupposed by representation, resolutely negating and abandoning knowledge discourse in the dimensions of language, order, pattern, object, and structure, forcing people to accept the legitimate position of the algorithmic “black box” and acknowledge the autonomy of the image itself in interpretive activities.
In generative AI images, the external world is no longer the homeland of images. Images emerge under the “captivity” of “spells” and “dance” with the changes of the “spells,” ultimately gaining freedom with the disappearance of the “spells.” Furthermore, generative AI images shake the external order upon which images exist, deny the linguistic rules that enable images to be interpreted, and abandon the media games that allow images to emerge. This makes images no longer point to the external world and no longer exist depending on some external reference object; in the philosophical sense of images, it signifies a generative “thing” from “within” to “without.” For instance, the interpretation of generative AI images departs from the representational, linguistic, symbolic, imagery, and other rhetorical systems relied upon by traditional image hermeneutics, turning instead to the deep learning system of large language models to directly generate an “image about images.” The algorithmic “black box” of large language models cuts off all clues and imaginations from the outside—at least those that human consciousness, language, and thought cannot capture—and draws a strict dividing line between the “internal” and “external” of images—the “internal” is the algorithmic “black box,” the “image itself,” the “programmatic desire” unknown to the outside world, where an inexhaustible generating energy flows, providing endless energy supply for images; the “external” is a lonely world, a world exiled by images, where it is no longer the homeland of images and does not provide much interpretive language for images, but instead waits for the “redemption” of images.
3. The Birth of the “Universal Image”:
How Does Text Become Image?
Regarding the relationship between images and objects, current academic discussions mainly focus on mimicry theory, reflection theory, simulation theory, and substitution theory, each revealing the cognitive modes of the world dimension of images. As a symbolic form, images can not only represent the characteristics and forms of things but also possess the potential to substitute for things—based on “likeness,” occurring in the “cross-domain” dimension, ultimately reflecting a metaphorical model of this (image) and that (thing).
Accordingly, the imagination of images also reflects the symbolic capacity of images—when the intended content of an image points to a certain allusive meaning, such as discourse, ideology, cultural themes, etc., the relationship between images and the external world presents enormous joining potential. In the symbolic system of images, symbolism is often based on metaphor, although its connotation points to a certain conventional symbolic meaning. As the implicit meaning of image significance, symbolic meaning must be symbolic, discursive, and rhetorical. For generative AI images, whether in terms of text production methods or meaning transmission mechanisms, they transcend the representational model that references the external world and also transcend the visual grammatical logic within traditional representational structures. Therefore, how to understand the symbolism problem in image interpretation, and how to comprehend the “opening” of images to the extratextual space? This requires returning to the mental foundation where symbolism operates—namely, the dimension of imagination, using the relationship between text and image as the object of investigation to re-examine the imagination of generative AI images.
(1) Symbolism: The Form of Concepts
Images are not an absolutely closed system. Beyond the frame of the image, there inevitably exists an extratextual space, and the basic idea of image interpretation is to strive to summon the world “outside the frame.” Compared to the order and structure of the “inside the frame,” the “outside the frame” is always imaginative—at least it needs to be “filled” or “extended” by imagination.
Imagination is a fundamental condition of human existence. Without imagination, the production of meaning and order would be impossible to discuss. Compared to the imaginative mechanism of language, the iconic qualities of images in the semiotic dimension determine the foundational role of imagination in image interpretation. The ability of a two-dimensional plane in an image to express four-dimensional spacetime fundamentally lies in the occurrence of storing and releasing meaning based on imagination. According to Flusser, imagination is both a conceptual level of thinking and a fundamental “technique” of images, specifically including the techniques of image drawing and image interpretation. Compared to the referential nature of language, the virtual nature of images determines that imagination plays an extremely important coordinating role in image cognition, that is, to “depict” the relationship of real-time space into a two-dimensional plane relationship, replacing the relationship of reality with the relationship of imagination, thus realizing the “imaging” of the world in the dimension of images. The result of imagination is the creation of a special symbol, allowing images to break free from their original symbolic domain and become a symbol, thus connecting the form of image and the conceptual meaning.
Image interpretation is based on imagination. Whether it is the image thinking of the cognitive dimension, the image grammar of the linguistic dimension, or the image rhetoric of the practical dimension, they are all established on the foundation of imagination, thus establishing the rationale and legitimacy of image cognition. Rudolf Arnheim elevated imagination to a form of visual thinking and proposed the assertion that “visual perception has cognitive power,” thus bridging the long-standing cognitive gap between sensibility and reason, perception and thinking, art and science.
Since imagination plays a crucial role in the formation and expression of images, what then is the imagination of artificial intelligence, and how do we understand the imagination of generative AI images? Unlike the traditional way of imagining images, AI images are both products of prompt output and suddenly “descended” from large language models. Therefore, we can understand the imagination of AI images along two dimensions: one is from the relationship dimension between language and images, recognizing the “drawing” potential of AI images in responding to the “conceptual” dimension of language; the other is based on the technical apparatus logic of the algorithmic “black box,” exploring the “creation” potential of AI images that “bring forth” from “nothing,” namely the imagination of “creating worlds.”
On one hand, regarding the “drawing of concepts,” the generation of AI images is primarily based on computational models. When users input instructions, the large language model instantly activates the “thinking” switch, seeking the optimal matching method between language “input” and image “output.” It must be acknowledged that the traditional imagination of images mainly reflects what Deleuze calls the second kind of calling and manifestation ability of the extratextual space, particularly reflecting the conceptual generation of the visual symbolic dimension, that is, using images as a “medium” to produce the intended concepts. The cognitive formation from images to concepts relies on the symbolic practice of visual rhetoric, and a common rhetorical strategy is to activate specific visual imagery or visual frameworks to unlock the meaningful connotations beyond the image. Compared to the mimicry of reality by traditional images, generative AI images shift toward an understanding of reality—this understanding is symbolic, is representative, and particularly conceptual. In the generation chain of AI images, one end is the user input prompts, and the other end is the symbolic form regarding the prompts. In other words, the images formed by AI ultimately represent a conceptual formal world, which essentially reflects a “diagrammatic” form created based on concepts as the basis of understanding. Large language models, represented by Sora, generate images through training and understanding of other massive images; on one hand, they use images as tools to form image relationships of real-world cognition, and on this basis, they create corresponding conceptual systems; on the other hand, they use computation as a method to build fitting models between concepts and images to explore the possible image forms that concepts may “release.” Thus, the imagination of AI images mainly reflects the ability of images to understand prompts and their conceptual logic, namely the symbolic ability of images to annotate concepts, diagram concepts, and restore concepts. If general digital images are mainly based on programmatic control, that is, according to the conditions of “input,” activating corresponding program configurations to achieve specific “output” functions, then generative AI images present a brand new generative logic, forming possible “forms of concepts” based on the descriptive method of prompts in the language dimension.
On the other hand, regarding the “creation of worlds,” if traditional technical images input the “signals” of reality and output the “landscapes” of reality, then the information processing method between input and output essentially follows the language of the apparatus—the mechanical motor of the camera determines the speed of the film’s “rotation” and the way of “unfolding,” thus forming a mode of imaging based on the mechanism of “visual residue.” Whether it is the accumulation of frames in the camera or the scanning imaging of the video camera, technical images are essentially a coding method of the world. Long before images are “outputted,” the technical apparatus has already tacitly permitted the forms and meanings of the world, completing the conceptual “operation” of images. For this reason, technical images are organized according to the logic of concepts, ultimately outputting a diagrammatic mode about concepts.
It must be acknowledged that in Flusser’s theory of technical images, the source of the mimicry of images is primarily reality, or the image of reality “projected.” However, generative AI images deviate from the mimetic nature of images; they master the general rules of image composition through deep learning of massive image resources, thus generating possible forms of the world under the “activation” and “guidance” of prompts. Specifically, traditional technical images are still constrained by the external world; the imaginative mode of images mainly reflects a one-time “distant view” based on “reality” as the origin or prototype—whether it is the accurate reflection of the real world by documentary images, the free association of reality by artistic images, or the virtual simulation of reality by digital images, they have not fundamentally escaped the shadow of reality. However, as products of deep learning, AI images inherently possess computational characteristics, allowing the imaginative mode to enter an intelligent computational world—the way in which the elements within images are combined, what layouts and structures are formed, and what symbolic consequences arise have already transcended the factual characteristics of the reality dimension, and also transcended the associative principles of the artistic dimension, turning instead to a reasonable principle based on possibilities. Therefore, if electronic images simulate the compositional forms of the external world, generative AI images simulate the organizational laws and movement rules of the external world, simulating the language models. From this perspective, generative AI images signify a brand new imaginative way about the world, transcending the theological imaginative schemas followed by traditional images, and also transcending the spectacular imaginative schemas of electronic images, thus revealing the meta-image (meta-image) attributes of schemas that can “generate” images, “derive” images, and “configure” images from the meta-language of schemas.
(2) Sora’s World: The “Universal Image” of Language
Generative AI images undoubtedly complete a magical symbolic “transcoding” project, creating an imaginative way from language to images, thus realizing the symbolic connection between the two. The reason why two entities can establish an imaginative relationship of “this” and “that” fundamentally lies in the discovery of their similarity, thus forming an imaginative relationship based on similarity as a thinking foundation. Text and images originally belong to different symbolic domains; the exchange of meaning between them mainly occurs through the intertextual context between verbal and visual. However, the “transcoding” project realized by generative AI images—from concepts to forms—fundamentally relies on the invention of “imagery,” that is, by discovering the similarity connection between images and concepts, forming an imaginative space for the exchange of meaning—the summons of text for specific imaginative “forms” and the construction of specific meaning “concepts” by images occur within the same imaginative model and achieve connections and equivalences between the two in their “negotiation.”
So, what kind of similarity exists between text and images, which belong to heterogeneous categories? How does the large language model achieve the “conversion” project from language to images based on this similarity? Only by finding the basis for equivalence between text and images can we truly build the communication “medium” between text and images. It must be acknowledged that whether it is the conversion from language to images or the interpretation from images to language, it is inseparable from the shared codes that establish the connection—imagery. If images are material, containing the materiality about the external world, imagery is a concept of a spiritual domain, possessing a strong reproductive and cloning ability, existing within images, flowing between media, and thriving endlessly.
In fact, the reason why the “image turn” has sparked a wave of “images replacing words” fundamentally lies in the vitality of imagery, which has already surpassed the circulation and proliferation ability of words, dominating the imaginative mode of an era, to the extent that understanding and cognition of language, in some cases, must resort to imagery. The so-called “verbal imagery” not only signifies a visual form related to language but has also risen to a symbolic schema, revealing the deep dependence of language on imagery. Michel will point out that “language and writing themselves are two mediums, one gaining substance through sound imagery, the other through graphic imagery.” It is not difficult to find that in language, imagery provides a way of “unfolding” meaning; while in images, imagery reveals the imaginative mode of text. For this reason, imagery, as the exchange “currency” between text and images, plays an active “medium” role, realizing the transition and conversion between the two.
Therefore, examining the conversion mode from text to images must focus on the “medium” mechanism of imagery so as to truly reveal the formation principles of generative AI images. Although artificial intelligence follows a computational logic, its understanding of imagery is evidently different from human thinking models, but as a generative program device, the deep “learning” of large language models must reflect the “learning” of the imagery of images, and the final “training” result points to a certain “secret of imagery,” particularly reflecting the processing mode of conversion between text and images, such as schemas and images. As typical forms of imagery, schemas and images resemble a generative model, storing the “internal language” of images, such as the mechanisms, rules, and element relationships of image composition, thus continuously “releasing” corresponding forms of images. In the current large language model, in order to maximize the restoration of the referential connotations of prompts, creators often need to adopt more detailed descriptive methods, such as accurately stating the style and tone of the image, precisely depicting the structural and relational elements within the image, and thoroughly describing the movement methods and processes of the subject, etc., in order to capture the imagery of language to the fullest, achieving the visual “proliferation” from imagery to images.
Unlike general automated program devices, Sora grants subjects more freedom in debugging and imaginative authority—like the magical world opened by “spells,” prompts essentially serve the production of imagery to form imaginative ways about the world. The imaginative way here mainly manifests as a metaphorical model, namely, identifying and establishing a similarity connection between text and images, and transforming this similarity into shared visual schemas, images, etc., thus opening a path of equivalence between text and images through imagery. Even the currently highly regarded image-to-video application model of “bringing old photos to life” essentially still operates under the prompt of language “spells,” endowing images with a certain temporal width and motion attributes to restore the “past” scene.
In fact, every change of the prompt means an adjustment of the parameters of the large model, and the result is the formation of a “universal image” of language. Precisely because of the formation of the “universal image,” the verbal “imagery” and the visual “imagery” not only have the possibility of conversion but also possess a “linguistic” foundation for equivalence. Through the exploration and practice of countless developers, a series of style code “spells” related to large language model applications such as Midjourney and Stable Diffusion have emerged online, allowing users to simply copy the corresponding code to obtain the desired visual style. Clearly, the standardized production of these “spells” indicates that the “universal image” of language is becoming increasingly stable and mature. Therefore, the significance of Sora as a world simulator is not merely reflected in the visual understanding of the world; its deeper connotation lies in forming a set of “universal images” about the world through the algorithmic “black box”—people “diagram” the possible forms of the world in an imagistic manner and construct a set of universal visual forms in the dimension of “verbal imagery.” Accordingly, the connection between language and images gradually breaks free from the symbolic meaning of indicative models and unfolds along the algorithmic logic, forming a relatively stable language of equivalence through the repeated probing between the two.
4. The Invisible Symbolic “Traces”:
The Collapse of the Intertextual World
Traditional images have a definite way of emerging—whether artistic images, electronic images, or digital images, they inevitably carry a relatively stable “factory setting.” As the meta-language of image interpretation, it not only determines the foundational language and rules of image interpretation but also profoundly influences the possible “destiny” trajectory of images. In other words, where do images come from, where are they going, and what functions and purposes do they serve? These questions have long been inscribed in the language depths of images like a visual “gene”—when it comes to image interpretation, these codes, factors, and clues hidden deep within the images will be “summoned” as a stubborn interpretive “force” or interpretive “condition,” limiting the direction of meaning for images.
Traditional image hermeneutics grants immense interpretive efficacy to context, believing that image interpretation heavily relies on the contextual relationships of images. Compared to linguistic texts, image interpretation relies on a certain anchoring system; different “anchoring” methods determine different ways of interpreting images. If the interpretation of linguistic texts possesses relative certainty, then the inherent symbolic characteristics of image texts make their meaning interpretation possess greater variability and uncertainty. This requires the anchoring of context to establish the starting point and possible direction of image interpretation.
(1) Intertextual Context: The Anchoring System of Image Interpretation
Image interpretation encompasses three common forms of context: cultural context, situational context, and intertextual context. As a fundamental form of context, intertextual context fundamentally reveals the existence mode of texts and the meaning rules upon which interpretive actions rely. Specifically, the existence and emergence of any text inevitably carry the “traces” and “shadows” of other texts, and in the intertextual relationships with other texts, they gain interpretive “clues” from the outside. These textual forms that provide “resources” for textual interpretation are what is known in semiotics as accompanying texts (co-text).
As resources or clues for image interpretation, the accompanying text of images is not an external “intrusion” but carries a kind of conventional interpretive “language,” where certain universal interpretive “codes” are stored. It can be imagined that without the accompanying texts diffused around the image, image interpretation is destined to be difficult, if not impossible. It is precisely within the intertextual context constructed by accompanying texts that images draw inspiration and nourishment for production, also obtaining the basis and resources for interpretation, ultimately gaining the necessary anchoring system and symbolic rules for image interpretation.
Unlike the relatively stable regulatory systems of traditional images, the algorithmic “black box” of generative AI images denies the foundational significance of intertextual relationships in image interpretation, as well as the intertextual context. For instance, the authorial “mark” carried by traditional images is difficult to find in AI images. According to the basic assumption of authorial theory, traditional texts harbor certain authorial intentions and authorial identities, which endow textual interpretation with background information and interpretive perspectives. However, the premise of authorial theory is that the author of the text possesses clear referentiality, namely, the author exists, the authorial intention is identifiable, the authorial identity is recognizable, and the authorial style is also traceable. As a typical accompanying text, the “revival” of the author in the text provides crucial interpretive basis for textual interpretation. However, the generative AI images based on prompts reject the “author” subject of image production and the “personal” background it carries. Therefore, any attempt to find interpretive “clues” through the author as an accompanying text is destined to be powerless and futile in AI images.
(2) Deep Learning: The Illusion of Intertextuality
The premise and foundation of the formation of intertextual contexts lie in the “intrusion” or “manifestation” of other information related to images in the text. This intertextual information from other texts carries indispensable deciphering information for image interpretation and constructs the essential external context for image interpretation.
In contrast, the acceptance of other images by generative AI images is limited: it extricates itself from the intertextual world, rejecting both the intertextual context of image interpretation and the influence of other images on interpretive actions. If traditional images exist within intertextual contexts, and their interpretive processes cannot do without the presence of accompanying texts, then generative AI images announce the collapse of intertextual contexts and the decline of the intertextual world—since everything happens silently within the algorithmic “black box,” AI images are pushed into a device system that is “isolated from the world,” where they accept the computation and debugging of large language models and form an image form that is far removed from other images. Whether in the process of text generation or in the mode of meaning interpretation, generative AI images have already stepped out of the intertextual world; they sever the connections with the outside world and the intertextual chain with other images, thus completely entrusting their “destiny” to algorithms, programs, and codes. When algorithms dominate the internal “language” of AI images, other images are merely “scenery” in a parallel world, silently observing the self-referential and magical expressions of AI images. Therefore, generative AI images have escaped the constraints of the external world and jumped out of the intertextual structure with other images, ultimately forming neither a “mirror” of the external world nor a “shadow” of other images cast into the “black box.”
Indeed, the formation of generative AI images relies on deep learning of other images, and the common “learning” approach is to draw “nourishment” from other images; however, does this “learning” process oriented toward other images still imply an intertextual relationship? Only by re-understanding the relationship between AI images and other images can we truly grasp the applicability of intertextuality to image interpretation. Although the training model of AI images is built on a massive foundation of images and realized through algorithms, an undeniable fact is that the process of deep learning does not simply mimic the external forms of other images; rather, it embodies the acquisition of the mysteries of the composition of the world using other images as “media,” ultimately forming a pattern of understanding about the connotations and essence of the world in the dimension of images. Midjourney, Runway, Sora, Genie, and Pika’s “memories” store the “image codes” of the essence of the world—scattered fragments of other images in the online world merely participate in the training process of large language models, and once the training is complete, their mission is declared over, like “digital orphans” discarded by the large model, returning to the intertextual world, back to the “dust” of the original network “position.”
Unlike the mutual presence of texts or the bidirectional flow of meaning presupposed by intertextual theory, the relationship between generative AI images and their “learning” objects—other images scattered in the online world—is merely a one-way, weak, and fleeting connection. For example, the diffusion models of Sora surpass the earlier recurrent neural networks (RNN) or generative adversarial networks (GAN). Specifically, the basic principle of GAN is the machine’s imitation of humans, achieving a corresponding level of “creation” through the mechanical imitation of other images; in contrast, Sora’s diffusion model aims to achieve “thinking like humans,” mastering the corresponding composition rules, element relationships, temporal and spatial structures, etc., through repeated debugging and training of the model, and on this basis, forming a pattern of understanding about the world, such as schemas of “beauty,” “horror,” and “harmony.”
Therefore, the training method of Sora does not merely stay at the level of simple image “creation” but aims to become a “person” that can “create,” fundamentally denying the intertextual relationship between generative AI images and other images. Specifically, there does not exist a recognizable, communicable, or equivalent meaning channel between AI images and other images; it is difficult for people to identify the “shadows” of other images from the representational structure of generative AI images, nor can they regard them as interpretive “clues” in the intertextual sense. The “contribution” of other images to Sora merely reflects a data relationship at the model level, ultimately influencing the parameters and indicators of the model. At the moment other images depart, Sora becomes an independent algorithmic device, generating image content relying on the “input” of prompts.
The reason why generative AI images escape the constraints of intertextual relationships also lies in the fact that the training data of large language models includes not only images scattered in the external world but also images generated by the models themselves. This makes the image relationships and their deep data systems established on the basis of “feeding” and “training” surpass the simple dimensions of “learning” and “imitation,” presenting complex connotations and processes that intertextual theory struggles to explain—the “input” process of images, on the surface, establishes a structure of intertextuality in the name of “learning,” but in reality, it is an illusion of intertextuality. In AI images, we can no longer accurately identify the “learning” objects of the large model in the boundless online world, nor can we identify the symbolic “traces” in the hermeneutic sense. In other words, the other images in the online world do not possess the significance of accompanying texts for AI images; they cannot assist generative AI images in realizing the tracing of the source of image “emergence,” the anchoring of image “semantics,” or the tracking of image “destiny.” In summary, when the algorithmic “black box” dominates the “learning” process of large language models, the connection between generative AI images and other images is essentially tightly bound to the dimensions of computation and code rather than cognition and consciousness—other images have not provided AI images with anchoring, limiting, or understanding modes in the interpretive dimension; on the contrary, their meaning is confined to the dimensions of model training parameters and indicators, enabling the possibility of “images becoming images.”
In summary, under the impact of the algorithmic “black box,” the image hermeneutics based on representation, symbolism, and intertextuality urgently calls for the expansion and innovation of knowledge discourse. Only by using representation, symbolism, and intertextuality as conceptual tools to critically examine the interpretive dilemmas faced by generative AI images and the knowledge discourse space opened up can we truly grasp the “change” and “unchange” in the knowledge system of image hermeneutics. Specifically, regarding the representational problem in the dimension of interpretive ontology, the foundational basis of generative AI images is generation rather than representation; therefore, image interpretation urgently needs to transcend the symbolic representational logic and visual language system of the representational dimension, returning to the ontological attribute of “generation” of images, thus reconstructing the cognitive relationship between images and the world based on the “self-reference” of images; regarding the symbolism problem in the dimension of interpretive language, generative AI images produce “universal images,” reconciling the contradictions between text and images, also providing a certain basis for equivalence between the two, thus answering the theoretical proposition of “how concepts become images”; regarding the intertextuality problem in the dimension of interpretive context, generative AI images escape the meaning rules of intertextuality, freeing themselves from the limitations and influences of external images on interpretive activities, thereby calling for a brand new set of image interpretive rules. It should be particularly emphasized that when the unknowable, traceless, and grammar-less algorithmic “black box” has already become the foundational apparatus of AI images, how to assign a certain theoretical position to the algorithmic “black box,” that is, studying it within the knowledge framework of hermeneutics, is undoubtedly a theoretical proposition that image hermeneutics must respond to in the era of artificial intelligence.
(The original text is 24,000 words; this text is an excerpt. For detailed discussions, please refer to the original text.)

This journal insists on originality and welcomes reprints.
Copyright reserved. If reprinted, please indicate:
This article is reproduced from Nanjing Social Sciences.
Personal forwarding is welcome.
Media reprints please contact the backstage for authorization.
Submission website: http://www.njsh.cbpt.cnki.net
Contact number: 025-83611547