New Opportunities for Intelligent Film Production: A Review of Multimodal Technology Development at CVPR 2024

This article was published in the “Modern Film Technology” 2024, Issue 7.

Expert Commentary

Film is an organic combination of visual and auditory arts, presenting an unparalleled audiovisual experience to the audience through the transmission of sight and sound. Multimodal technology synergistically utilizes visual, auditory, textual, and other information to accomplish tasks that are difficult to achieve through a single sensory channel, effectively enhancing information processing performance and robustness. Unlike traditional visual generation tasks, film content production requires audio-visual synchronization, which single-modal technology struggles to achieve. In contrast, multimodal technology can synchronize the synthesis and processing of audiovisual content, providing possibilities for the automatic generation of film content. Additionally, sound effect generation in film production differs from traditional sound signal processing; the sound effect tracks in films require fine control based on the visual content of each individual event, and the spatial sense of sound must adapt to changes in visual content. These tasks pose challenges for voice actors’ creative editing. Multimodal technology can automatically generate sound effects that correspond to visual scenes based on input data, achieving a logical fusion of film content and sound effects, which helps inspire voice actors’ creativity and effectively enhances creative efficiency. The article “New Opportunities for Intelligent Film Production: A Review of Multimodal Technology Development at CVPR 2024” discusses and analyzes cutting-edge technological achievements in the multimodal field presented at the 2024 International Conference on Computer Vision and Pattern Recognition (CVPR), objectively showcasing the current frontier development status of multimodal technology and exploring the new opportunities it may bring to intelligent film production, providing high reference value for film practitioners and related researchers.

——Liu Shiguang

Professor

Tianjin University, School of Intelligence and Computing, Doctoral Supervisor

Author Introduction

New Opportunities for Intelligent Film Production: A Review of Multimodal Technology Development at CVPR 2024

Xie Zhifeng

Associate Professor at Shanghai University, Shanghai Film Academy, and Shanghai Film Special Effects Engineering Technology Research Center, Doctoral Supervisor, main research areas: advanced film technology, artificial intelligence.

Master’s student at Shanghai University, Shanghai Film Academy, main research area: multimodal models, film sound effect generation.

Yu Shengye

Abstract

To explore new opportunities for intelligent film production, this article conducts an in-depth analysis of the cutting-edge technological achievements in the multimodal field presented at the 2024 International Conference on Computer Vision and Pattern Recognition (CVPR). Specifically, this article focuses on the research of three modalities: visual, textual, and audio, and the important applications of multimodal technology in film production: video generation, video editing, and trailer cutting technology, video description generation and video content interpretation technology, as well as audiovisual synchronization, sound effect generation, and video scoring technology. Research shows that the integration of multimodal technology into the film production process not only significantly improves production efficiency but also enhances artistic expressiveness. Finally, this article summarizes the current challenges faced by multimodal technology and looks forward to the future development direction of related technologies in film production.

Keywords

Artificial Intelligence; Film Production; Multimodal Technology; Large Language Models; Computer Vision

1 Introduction

The International Conference on Computer Vision and Pattern Recognition (CVPR), first held in Washington, D.C. in 1983, has developed into the most influential annual event in the field of computer vision. As a CCF-A class conference, CVPR attracts researchers worldwide to share their latest research results every year. These results not only guide future research directions but also promote the practical application of technology. As of July 8, 2024, Google Scholar Citation’s official statistics show that CVPR has an H5 index of ^① 422, ranking fourth among global publications and first among engineering and computer publications.

CVPR is known for its rigorous review standards and low acceptance rates, with accepted papers typically covering a wide range of research directions from image processing and object detection to deep learning. CVPR 2024 was held from June 17 to June 21, 2024, in Seattle, Washington, USA. According to the results released by CVPR on April 5, the conference received 11,532 valid paper submissions from researchers, of which 2,719 were accepted, resulting in an overall acceptance rate of approximately 23.6%^[1]. A visual analysis of the accepted papers for CVPR 2024 revealed that applications in areas such as diffusion models, 3D vision, neural radiance fields (NeRF), large language models (LLM), multimodal, and semantic segmentation have become hot topics. These studies are not only significant in academia but also greatly promote innovation in applications such as film, gaming, animation, and interaction.

2 Integration of Multimodal Technology in Film Production

Multimodal refers to the combination of information from multiple sensory channels, such as visual, linguistic, and auditory, to improve and enhance the machine’s ability to understand the environment. In this way, models can not only process images and videos but also understand and generate text that describes these visual contents or respond to voice commands. Multimodal technology enables computers to more comprehensively understand complex scenes and interactions, which is particularly important in natural language processing (NLP), image and video analysis, robotics, and improving user interface interactions.

In contemporary film production, the application of multimodal technology significantly enhances the artistic quality of film works by deeply integrating the three core modalities of vision, text, and sound, promoting innovation in film technology, and further deepening the emotional conveyance and visual impact of films.

As the fundamental elements of film, the visual modality captures and presents images through advanced photography techniques and meticulous visual design. Excellent cinematography not only focuses on composition and color management but also enhances the visual dynamism of the story through dynamic camera techniques such as push-pull and rotation, allowing the audience to feel the authenticity of the scene. The textual modality unfolds through scripts and dialogues, providing structure and narrative depth to films. The script is not only a blueprint for story development but also the core of emotional conflict and character development. Effective dialogues can deepen the dimensionality of characters, drive the plot forward, and reveal deeper themes and meanings. The sound modality enhances the emotional expression of films through carefully designed sound effects and music. This scope includes not only background music or theme songs but also ambient sounds, sound effects of character actions, etc. These sound elements, when precisely used at the right moments, can greatly enhance the tension or emotional depth of scenes. When these three modalities are effectively integrated in films, they can complement each other to construct a rich experience across multiple layers and senses. The shock of visuals, the depth of textual narrative, and the emotional guidance of sound work together to provide viewers with a comprehensive immersive experience.

The multimodal-related research presented at the CVPR 2024 conference is expected to bring technological innovation to the film production industry, providing technical support for simplifying production processes, enhancing the artistic value of works, and increasing market competitiveness. This article will delve into the specific applications of these technologies in film production and the transformations they bring, from video generation and video editing to trailer cutting technology, advancements in video description generation and video content interpretation, and discuss the innovative applications of sound technology in audiovisual synchronization, sound effect generation, and video scoring. At the same time, it will summarize the current challenges faced and future prospects, exploring how multimodal technology continues to drive innovation in the field of film production.

3 Overview of Multimodal Technology in Film Production at CVPR 2024

3.1 Visual Modality and Film Production

The visual modality is the most direct and impactful form of expression in films. Early films were primarily black and white and silent, relying solely on visuals to tell stories and express emotions. With technological advancements, especially the introduction of color films and digital imaging technology, the visual expressiveness of films has been significantly enhanced. In modern film production, technologies such as high-definition photography, special effects, and computer-generated imagery (CGI) are widely used, allowing creators to present more refined and stunning visual effects.

(1) Video Generation

The video generation task involves using generative models to automatically create video content. It generates corresponding videos based on textual descriptions, images, etc., capable of producing highly realistic scenes and characters, and is widely applied in film production, advertising creation, virtual reality (VR), and animation games.

Wu et al.^[2] proposed the LAMP (Learn A Motion Pattern) technology, a method for efficient and low-cost video generation by fine-tuning a pretrained text-to-image model on a small amount of video data. LAMP decouples content and motion generation, optimizes inter-frame communication, and employs a shared noise strategy, effectively enhancing video quality and motion pattern learning while demonstrating good generalization capabilities.

Moreover, Wang et al.^[3] developed the MicroCinema method, which addresses appearance and temporal coherence issues in video generation through a two-stage innovative process. The method first uses a text-to-image generator to create key frames, and then in the second stage, employs a Stable Diffusion model to add temporal layers for high-quality motion modeling. The introduced appearance injection network and appearance-aware noise strategies ensure that the video maintains visual consistency while showcasing smooth dynamic effects.

Although single-text video generation technology holds great potential, capable of generating a rich variety of visual content based on simple textual descriptions, the generated results are diverse and free, significantly broadening creators’ imagination. However, to achieve the fine control and high-quality output required in film production, these technologies need to meet more conditions in practical applications.

Zeng et al.^[4] proposed the PixelDance technology, which employs a unique approach that combines diffusion models, text, and image instructions to generate content-rich dynamic videos. The core innovation of this method lies in the simultaneous use of video’s first and last frame image instructions along with text instructions, allowing the model to construct complex scenes and actions more precisely while providing finer control.

Jain et al.^[5] introduced the PEEKABOO method, which incorporates spatio-temporal control into a UNet-based video generation model. This method achieves precise control over detailed content in videos while maintaining low latency by adjusting the attention mechanism. This not only improves video generation quality but also allows users to interactively control the size, position, posture, and motion of objects in the video, enhancing the personalization and application potential of video content.

Cai et al.^[6] proposed the Generative Rendering method, which further pushes the boundaries of video generation technology. This diffusion-based method utilizes UV space to initialize noise, enhances self-attention layers, and guides deep clues to achieve high-fidelity and inter-frame consistency generation of 4D-guided stylized animations. This method directly renders texture-less 3D animated scenes into stylized animations, specifying styles through textual prompts, providing a higher level of user control for image generation models.

Text-based video generation technology in film production includes applications such as previews, animations, concept validation, and storyboard production. For example, storyboards typically serve as a “visual script,” containing a series of illustrations and notes arranged chronologically. They are drawn by storyboard artists based on the director’s instructions and script content, detailing specific plots or actions such as camera angles, movements, and key events. Storyboards not only help the production team preview the visual performance of the film but also often serve as a tool for communication and collaboration, ensuring that the film’s visual style and narrative pacing are unified and executed accurately. However, the traditional storyboard production process is complex, time-consuming, and has limitations in showcasing complex actions, special effects, and dynamic scenes. Text-based video generation technology shows great potential in this regard, converting textual descriptions into dynamic, interactive 3D storyboards, providing directors and producers with a more intuitive and detailed preview method. By instantly generating corresponding dynamic scenes and shots from simple text inputs, directors and producers can preview key shots in the film and make adjustments as needed. This instant feedback greatly improves the accuracy and efficiency of decision-making. Wu et al.^[2-6]’s research not only demonstrates innovation in the field of text-guided video generation but also highlights the necessity of detail management and dynamic control in the high-quality film production process.

(2) Video Editing

The video editing task involves refining adjustments to the visual elements of the video through algorithms and models, such as visual style, characters, and scenes, to enhance video quality and visual effects, achieving the artistic intent of the creator.

At CVPR 2024, Yang et al.^[7] proposed a novel zero-shot diffusion framework called FRESCO, focusing on maintaining spatial-temporal consistency in video editing. This framework significantly improves the consistency and coverage of video editing by combining optical flow guidance and self-similarity optimization features. Users only need to provide the input video, and FRESCO can re-render the video according to the target text prompts while retaining the original semantic content and actions. This framework is compatible with various auxiliary technologies such as ControlNet, SDEdit, and LoRA, providing flexible and personalized video transformation and editing capabilities.

Feng et al.^[8] proposed CCEdit, an advanced generative video editing framework that achieves precise control over structure and appearance through a trident network structure. This framework includes three main branches: a primary generative branch for text-to-video, a structure control branch, and an appearance control branch. The primary generative branch adapts a pretrained text-to-image model for video generation, while the structure control branch handles the structural information of each frame of the input video, and the appearance control branch allows editing reference frames for precise appearance control. These branches are integrated through learnable temporal layers, ensuring temporal consistency of video frames.

Ma et al.^[9] proposed a text-based video editing framework called MaskINT, which improves the efficiency and quality of video editing through two-stage processing. First, it edits key frames using a pretrained text-to-image model; secondly, it generates all intermediate frames in parallel through a non-autoregressive masked generative transformer’s structure-aware frame interpolation module. MaskINT significantly accelerates the speed of video editing, and experiments show that it is comparable to traditional diffusion methods in terms of temporal consistency and text alignment, while being 5-7 times faster in inference time. This framework provides efficient text video editing solutions for advertising, live streaming, and the film industry.

Xing et al.^[10] proposed an efficient video diffusion model called SimDA (Simple Diffusion Adapter), which fine-tunes existing large image diffusion models (like stable diffusion) with minimal parameters (only 2%). SimDA improves temporal modeling capabilities through latent-shift attention (LSA), significantly enhancing processing efficiency and video quality. This model reduces GPU memory requirements and time costs during training and inference, achieving inference speeds 39 times faster than traditional autoregressive methods like CogVideo, and can also be applied to video super-resolution and editing, with training speeds improved by 3 times. SimDA not only optimizes video generation and editing performance but also substantially reduces training costs.

The above studies are based on 2D videos, while some research focuses on editing 3D scenes and characters. Jiang et al.^[11] proposed a novel cinematic behavior transfer method. This technology utilizes differentiable filming techniques based on neural radiance fields (NeRF) to extract camera trajectories and character actions from existing films, transferring these behaviors to new characters and scenes. This method allows modifications in various attributes, such as lighting, character dynamics, and scene settings. Liu et al.^[12] proposed a novel video editing framework called DynVideo-E, which applies dynamic neural radiance fields to human-centered video editing for the first time. Traditional video editing based on diffusion models struggles to maintain high temporal consistency when dealing with long videos or videos with large-scale motion and perspective changes. DynVideo-E integrates video information into 3D dynamic human spaces and 3D background spaces, using a deformation field guided by human posture to achieve consistent propagation of edited content throughout the video. Additionally, this technology supports high-fidelity new view synthesis with 360° free perspective, significantly outperforming current state-of-the-art methods, with human preference improvement rates ranging from 50% to 95%. DynVideo-E not only enhances the temporal consistency and visual effects of video editing but also improves the editing quality and animation capabilities of 3D dynamic human spaces through multi-perspective and multi-pose score distillation sampling (SDS), super-resolution techniques, and style transfer strategies.

Through these methods, directors and production teams can preview and optimize different shooting effects before actual filming or make dynamic adjustments during post-production as needed. These methods not only eliminate the need for reshooting and custom animations but also significantly enhance production efficiency and artistic expressiveness. For example, the film “Spider-Man: Across the Spider-Verse” utilizes the concept of parallel universes, bringing together over 280 Spider-Man characters, each displaying a unique style such as punk Spider-Man, LEGO Spider-Man, and dinosaur Spider-Man. This innovation not only breaks the traditional animation film’s uniform style rules but also creates a seamless and colorful audiovisual experience for viewers by integrating various styles, such as watercolor, pencil sketches, and comics, breaking the dimensional wall and delivering unprecedented visual impact and emotional resonance. If the above video editing technologies are applied, it may lead to more efficient and lower-cost innovations in film stylization.

(3) Trailer Cutting

In the film industry, trailers play a crucial marketing role. Trailers spark audience anticipation and interest by showcasing captivating key scenes, storylines, and cast members, serving as a key marketing tool before the film’s release. However, the traditional trailer production process is not only time-consuming but also relies on expertise, often involving cumbersome shot selection and sequencing.

To address these challenges, Argaw et al.^[13] proposed an innovative automated solution called the Trailer Generation Transformer (TGT). This framework can automatically select and synthesize shots from the entire film to generate a logically coherent trailer. The TGT framework draws on the principles of machine translation, modeling films and trailers as sequences of shots, and defining the trailer generation problem as a sequence-to-sequence task. The framework employs a deep learning encoder-decoder architecture, where the film encoder utilizes self-attention mechanisms to embed each film shot within the overall context, capturing the complex relationships between different shots. The trailer decoder predicts the features of the next trailer shot using a self-regressive model, accurately considering the temporal order of shots in the trailer. This automated editing technology not only optimizes the trailer production process but also significantly enhances the efficiency and quality of production.

3.2 Text Modality and Film Production

The application of the text modality in films can be traced back to the title cards of silent films, which were used to explain plot developments or display dialogues. With the advent of sound films, text directly participated in sound storytelling through dialogues and scripts. The script serves as the foundation of film production, providing a structured storyline while containing detailed scene descriptions, character dialogues, and action instructions, which are central to film narrative and emotional expression.

(1) Video Description

Video description technology utilizes natural language processing (NLP) algorithms to automatically generate textual descriptions based on video content. This technology analyzes the visual and audio information of the video, extracts key features, and transforms them into natural language descriptions, widely applied in video search, recommendation systems, and accessibility assistance, significantly improving the accessibility and retrieval efficiency of video content.

Zhou et al.^[14] proposed a novel streaming dense video description generation model that employs a K-means clustering-based memory mechanism and streaming decoding algorithm, capable of processing long video sequences and generating descriptions in real-time, showcasing the real-time application potential of this technology.

Xu et al.^[15] further advanced video description generation technology by constructing a unified representation space for first-person and third-person videos, proposing a retrieval-augmented description generation method called EgoInstructor. This method generates pseudo-pairs of videos through automated processes and trains a cross-view retrieval module using the EgoExoNCE loss function, effectively aligning video features. This not only improves the accuracy and relevance of description generation but also enhances the generation of first-person video descriptions by leveraging human natural learning processes.

Kim et al.^[16] developed the CM² model, a cross-modal dense video description generation framework based on external memory, which effectively improves the localization and description of significant events in videos by retrieving relevant textual cues and combining visual and textual cross-attention mechanisms. The CM² model can generate video descriptions naturally and fluently while significantly enhancing the understanding and interactive experience of video content.

Islam et al.^[17] proposed the Video ReCap model, designed specifically for long videos, which processes and generates descriptions at different levels through a recursive video-language architecture, effectively addressing the challenges posed by video lengths ranging from seconds to hours. This model employs a hierarchical learning strategy and pseudo-summary data training, achieving significant performance improvements in long video description generation tasks. Additionally, its potential applications in long video understanding and complex video question-answering tasks make Video ReCap well-suited for scenarios requiring in-depth analysis and description of video content.

Raajesh et al.^[18] proposed a new single-stage film description model called MICap, which integrates fill-in-the-blanks and complete description generation through a self-regressive sequence-to-sequence generation approach. MICap uses a transformer-based encoder-decoder to simultaneously handle video descriptions and character identity labeling, improving processing efficiency and accuracy. This model is particularly suitable for scenarios requiring consistency in character identities across multiple videos, capable of generating descriptive subtitles that include specific character identities, such as in film and television production.

Jin et al.^[19] proposed an innovative video text retrieval (VTR) method called MV-Adapter (Multimodal Video Adapter), designed to enhance task efficiency and performance. This method employs a dual-branch structure and achieves efficient processing of video and text through a bottleneck architecture (downsampling, transformer, upsampling). To enhance temporal modeling capabilities, the MV-Adapter introduces a temporal adaptation (TA) module, which dynamically generates weights based on the global and local features of the video. Meanwhile, the cross-modality tying (CMT) module generates weights by sharing modality parameter spaces, improving cross-modal learning efficiency. The high efficiency and flexibility of this method make it suitable for various application scenarios requiring rapid and accurate retrieval of video and text, such as automated media analysis and content review.

Video description generation technology plays multiple roles in film production. It can automatically generate plot summaries and scene descriptions, helping directors, screenwriters, and editors quickly review and adjust plot developments, significantly enhancing editing efficiency. Additionally, it can extract key scenes and highlights from videos, providing materials for trailers or promotional short films. Furthermore, this technology can assist in automatically dividing video chapters and generating corresponding descriptions and summaries, helping audiences better understand and navigate video content. In terms of content review, this technology can also assist reviewers in quickly understanding video content, ensuring compliance with relevant laws and regulations, and effectively labeling and adjusting sensitive content.

(2) Video Understanding

Video understanding technology utilizes computer vision algorithms to achieve comprehensive understanding of video content. Although current research has mostly focused on understanding the basic plot developments and interactions of visual elements, initial results have shown effectiveness in exploring higher-level artistic and deeper social meanings.

Song et al.^[20] proposed the MovieChat framework, integrating visual models and large language models (LLM), specifically designed for long video understanding tasks. MovieChat extracts video features through an efficient memory management mechanism and sliding window method, processing these features through short-term and long-term memory systems, significantly reducing computational complexity and memory costs while enhancing continuity for long time sequences. This model can provide answers based on audience questions, such as explaining plot backgrounds or character relationships, which not only helps audiences better understand and discuss film plots but also significantly enhances audience engagement and satisfaction.

Wang et al.^[21] developed the OmniViD framework, viewing video tasks as video-based language modeling tasks. Through an encoder-decoder architecture and multimodal feature extraction, OmniViD introduces various conditions such as text, time, and bounding boxes, achieving unified processing for different video tasks. This method effectively unifies output formats and training objectives, improving processing efficiency, and OmniViD has shown outstanding performance in multiple video tasks such as action recognition, video subtitling, video question answering, and visual object tracking.

Nguyen et al.^[22] proposed the Hierarchical Interlacement Graph (HIG) framework, aimed at deeply understanding the complex interactive dynamics within videos. HIG simplifies operational processes and enhances comprehensive grasp of interactions between objects in video content through its unique hierarchical structure and unified layers. This framework can adapt to different video sequences and flexibly adjust its structure to capture various interactive activities between characters and objects in videos.

Jin et al.^[23] proposed Chat-UniVi, a novel unified visual language model that simultaneously understands images and videos through dynamic visual tokens. This model employs a multi-scale representation approach, gradually merging visual tokens through a density peak clustering K-nearest neighbors (DPC-KNN) algorithm, achieving comprehensive capture of spatial details in images and temporal relationships in videos. Chat-UniVi can be directly applied to image and video understanding tasks without fine-tuning, demonstrating superior performance in these tasks.

Tores et al.^[24] proposed a new computer vision task for detecting objectification phenomena in films. By creating a dataset named ObyGaze12, which includes 1,914 video clips from 12 films, these clips are meticulously annotated by experts around various objectification concepts. The research team employed concept bottleneck models (CBMs) to evaluate and enhance the model’s ability to analyze shooting types, gaze, posture, and appearance in terms of objectification concepts. The application scenarios of this technology mainly pertain to film production, aiming to quantify and identify gender objectification phenomena in film works, further exploring and challenging gender biases on screen. This provides new tools and perspectives for gender equality assessment and academic research in the film industry.

Video understanding is a process of in-depth analysis and comprehension of the connotation of films, which is crucial for audiences, film researchers, and creators. For audiences, interpreting films not only deepens their understanding of plots, characters, and emotions but also enhances their aesthetic appreciation of visual elements and narrative structures, prompting them to reflect on the themes and artistic expressions behind the films. For researchers, video understanding promotes the development of film theory, helping to understand the relationship between films and culture, history, and society, revealing how films reflect the zeitgeist and social ideas through visual storytelling and plot development. Meanwhile, interpreting classic films provides creators with sources of learning and inspiration, further exploring new modes of expression and themes, and better understanding audience needs to create more profound and impactful works.

3.3 Audio Modality and Film Production

Sound in films marks the transition from the silent film era to sound films. This transition began in the late 1920s, and the introduction of sound not only changed narrative techniques but also greatly enhanced emotional expression in films and the audience’s immersion. With technological advancements, the introduction of surround sound systems and multi-channel stereo systems further enriched the sound layers of films, making sound design an indispensable part of film art.

(1) Audiovisual Synchronization

The synchronization of sound and visuals is a fundamental requirement for all video content providers, covering two key aspects: audiovisual track time synchronization and audiovisual content synchronization.

Audiovisual track time synchronization focuses on the precise matching of video and audio streams in time. Such synchronization errors can occur throughout the entire process from shooting to playback, including errors during content editing or encoding stages. Research shows that even minor synchronization deviations, such as 45 milliseconds, can significantly impact the viewing experience. Although there are various commercial solutions available in the market, they often struggle to meet the demands of large-scale production.

Additionally, audiovisual content synchronization mainly refers to whether the audio content matches the visual elements in the video, which is often a problem in dubbed films. Dubbed versions require precise adjustments of lip movements and language to achieve natural fluency in dialogues. During the dubbing process, translators need to adjust in real-time in the recording studio to ensure consistency between audio and video. Despite the greater investment in producing dubbed versions, the mismatch between character lip movements and language pronunciation, as well as language differences, can sometimes reduce the naturalness of dubbing, making it generally less favored by audiences compared to the original sound version.

To address the challenge of audiovisual track time synchronization, Fernandez-Labrador et al.^[25] developed a transformer-based audio-video synchronization model called DiVAS, which directly processes raw audio-video data, effectively tackling the challenges posed by different frame rates (FPS) and sampling rates. DiVAS has demonstrated superior synchronization accuracy and processing speed across various media content, including action movies and TV dramas, and can perform audio-video synchronization analysis for segments and entire works, providing a comprehensive and effective solution for content creators and analysts. However, this synchronization technology mainly addresses the alignment of audio tracks and visual tracks in terms of time, without involving the corresponding audiovisual content, such as the naturalness of dubbing and language matching issues.

In terms of audiovisual content synchronization, Choi et al.^[26] proposed an innovative audiovisual speech translation (AV2AV) framework, which can directly translate audiovisual input into audiovisual output in the target language, solving common audiovisual inconsistencies in traditional speech translation systems. Utilizing the modality-agnostic characteristics of the AVHuBERT model and a specially designed AV renderer, this system maintains the consistency of speaker tone and facial features during translation, changing only the language and lip movements, making it suitable for various cross-language communication scenarios, including the localization of foreign films.

(2) Sound Effect Generation

Sound effect generation technology utilizes multimodal generative models to automatically generate various sound effects based on input data. This technology can generate sound effects that match scenes according to textual descriptions, images, or video content, widely applied in film production, game development, and interactive media.

Xing et al.^[27] developed the Seeing and Hearing framework, which synchronously generates visual and audio content in multimodal embedding space using pretrained unimodal generative models and the ImageBind aligner. This framework establishes connections between visual and audio through bidirectional guiding signals, demonstrating excellent performance and suitability for various video-to-audio conversion tasks, requiring no large-scale dataset training and consuming low resources. The model has shown outstanding performance and broad applicability in video-to-audio (V2A), image-to-audio (I2A), audio-to-video (A2V), and joint video-audio (Joint-VA) tasks.

However, the sound effect tracks in films require fine control over each individual event. Xie et al.^[28] proposed a controllable sound effect generation framework called SonicVisionLM based on visual-language models, categorizing sound effects as either on-screen or off-screen based on their visibility. The model can automatically identify and generate on-screen sound effects for the film, providing a user interaction module for voice actors to create and edit off-screen sound effects for the film, further inspiring creative ideas. Technically, it addresses the challenges of synchronizing generated sound effects with film actions over time and achieving high consistency between generated sound effects and film content, ultimately achieving logical integration of film content and on-screen sound effects, as well as flexible editing of off-screen sound effects.

Since the 1990s, the widespread application of multi-channel systems and digital technology has had a profound impact on film sound creation. Sound effects are no longer merely seen as ancillary elements of films but have become key factors in enhancing story atmosphere and realism, with their roles in film art becoming increasingly important. The sources of film sound effects are rich and diverse, including natural sounds, indoor and outdoor environmental sounds, and character actions, all of which together construct the auditory background and coherent atmosphere of the scene. Environmental sound effects such as wind, water, and background music closely integrate with visuals, sketching out the auditory background for film scenes. Hard sound effects include various sounds generated by character and object movements, such as door opening and closing sounds and action fighting sounds, while Foley techniques synchronize recorded sounds in post-production to simulate interactions between characters and environments. The application of the above sound effect generation technologies increases the feasibility of automatic generation of film sound effects, significantly reducing the time and labor costs of film sound production, and effectively shortening the film production cycle.

(3) Video Scoring

Video scoring technology automatically generates or recommends suitable music based on video content and emotional tone. This technology analyzes the visual and audio features of the video, identifying changes in plot and emotion, and matching corresponding music segments, widely applied in film, advertising, games, and multimedia production.

Li et al.^[29] developed the Diff-BGM model, a diffusion-based generative framework for generating background music that is highly aligned with video content. This model integrates the semantic and dynamic features of the video, utilizing segment-aware cross-attention layers to achieve precise synchronization of audio and video during the diffusion process. This technology not only enhances the appeal and expressiveness of the video but also provides automated scoring for video content such as films, short films, advertisements, and social media, greatly reducing reliance on copyrighted music and avoiding copyright issues.

In terms of user interaction, Dong et al.^[30] proposed MuseChat, a conversational music recommendation system designed for video content. This system adjusts music selections in real-time through natural language dialogue to better align with users’ specific needs and preferences. By combining music recommendation and sentence generation modules, MuseChat enables users to specify details such as music style, emotion, and instrument usage, thus generating music that is highly consistent with video content and user preferences. This system is particularly suitable for social media and personal video production, helping users quickly and accurately match suitable background music.

Chowdhury et al.^[31] researched and developed the MeLFusion model, a novel diffusion model that generates music consistent with images and texts, overcoming the limitations of traditional music generation models under multimodal conditions. By employing a “visual synapse” mechanism to directly extract features from images and text prompts, it converts them into inputs for music generation. MeLFusion provides an efficient music creation tool for social media content creators, supporting the efficient generation of music consistent with visual content in various creative environments.

The above technologies provide a flexible, efficient, and cost-effective music solution for film production. Film music is mainly divided into two categories: scoring and songs, where scoring includes theme music, scene music, and background music, while songs include theme songs and interludes. Music is the soul of film art, driving plot development and deepening the film’s themes and shaping character images. For instance, “The Pianist” effectively uses scoring to participate in storytelling and deepen emotional expression. Utilizing video scoring technology, composers can quickly locate the tone of music while drawing inspiration for more detailed creation.

4 Conclusion and Outlook

Multimodal technology is gradually transforming the film production field, opening up infinite possibilities for innovation. These technologies not only enhance the automation level of content generation but also improve the understanding of complex scenes and the depth of plot analysis. By integrating visual, auditory, and textual data, multimodal technology can accurately generate visual scenes and audio content that match script descriptions, greatly enhancing the quality of immersive experiences and personalized content. Furthermore, it promotes interdisciplinary collaborative creation, enabling screenwriters, directors, voice actors, sound designers, and visual effects artists to work efficiently on real-time collaborative platforms, quickly responding to feedback and adjusting creative ideas.

Future research will focus on further exploring the applications of multimodal technology in addressing more complex scene understanding and plot construction. For instance, advanced algorithms can automatically analyze and generate plot summaries and provide detailed character interactions and emotional dynamic maps, helping creative teams delve deeper into script potential and precisely control the rhythm of storytelling and emotional flow. At the same time, utilizing advanced machine learning models, multimodal technology will be able to analyze audience behavior and reactions, providing highly targeted and engaging personalized recommendations.

Despite the many benefits brought by multimodal technology, it also faces several challenges in practical applications. The integration and processing of data require precise technical support to ensure seamless connections and consistency of information between different modalities. The complexity and opacity of deep learning models present another issue that needs to be addressed, calling for the development of more advanced explainable AI technologies to make the creative process more transparent and controllable. Additionally, improvements in real-time processing capabilities, protection of data privacy and security, and generation of multilingual and cross-cultural content are all important obstacles that need to be overcome in technological development.

Globally, the development of multimodal technology will continue to drive transformation in film production. With ongoing technological advancements and innovations, these tools are expected to make the film production process more efficient, cost-effective, and capable of creating unprecedented viewing experiences. As research deepens and technology matures, multimodal technology will play an increasingly critical role in future film production, opening up new modes of artistic expression and business models.

Notes and References

(Scroll down to read)

① The H5 index, or H5-Index, is a measure of the citation count of papers published in a journal over the last five years. This index is widely representative and is relatively objective as it is not influenced by extremely high citations of individual papers.

[1] CVPR.#CVPR2024[EB/OL].(2024-04-05)[2024-07-10].https://x.com/CVPR/status/1775979633717952965.

[2] Wu R, Chen L, Yang T, et al. LAMP: Learn A Motion Pattern for Few-Shot Video Generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7089-7098.

[3] Wang Y, Bao J, Weng W, et al. Microcinema: A divide-and-conquer approach for text-to-video generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 8414-8424.

[4] Zeng Y, Wei G, Zheng J, et al. Make pixels dance: High-dynamic video generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 8850-8860.

[5] Jain Y, Nasery A, Vineet V, et al. PEEKABOO: Interactive video generation via masked-diffusion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 8079-8088.

[6] Cai S, Ceylan D, Gadelha M, et al. Generative rendering: Controllable 4d-guided video generation with 2d diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7611-7620.

[7] Yang S, Zhou Y, Liu Z, et al. FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 8703-8712.

[8] Feng R, Weng W, Wang Y, et al. CCEdit: Creative and controllable video editing via diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 6712-6722.

[9] Ma H, Mahdizadehaghdam S, Wu B, et al. MaskINT: Video editing via interpolative non-autoregressive masked transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7403-7412.

[10] Xing Z, Dai Q, Hu H, et al. SimDA: Simple diffusion adapter for efficient video generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7827-7839.

[11] Jiang X, Rao A, Wang J, et al. Cinematic Behavior Transfer via NeRF-based Differentiable Filming[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 6723-6732.

[12] Liu J W, Cao Y P, Wu J Z, et al. DynVideo-E: Harnessing dynamic nerf for large-scale motion-and view-change human-centric video editing[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7664-7674.

[13] Argaw D M, Soldan M, Pardo A, et al. Towards Automated Movie Trailer Generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7445-7454.

[14] Zhou X, Arnab A, Buch S, et al. Streaming dense video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 18243-18252.

[15] Xu J, Huang Y, Hou J, et al. Retrieval-augmented egocentric video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 13525-13536.

[16] Kim M, Kim H B, Moon J, et al. Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 13894-13904.

[17] Islam M M, Ho N, Yang X, et al. Video ReCap: Recursive Captioning of Hour-Long Videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 18198-18208.

[18] Raajesh H, Desanur N R, Khan Z, et al. MICap: A Unified Model for Identity-aware Movie Descriptions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 14011-14021.

[19] Jin X, Zhang B, Gong W, et al. MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 27144-27153.

[20] Song E, Chai W, Wang G, et al. MovieChat: From dense token to sparse memory for long video understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 18221-18232.

[21] Wang J, Chen D, Luo C, et al. OmniViD: A generative framework for universal video understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 18209-18220.

[22] Nguyen T T, Nguyen P, Luu K. HIG: Hierarchical interlacement graph approach to scene graph generation in video understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 18384-18394.

[23] Jin P, Takanobu R, Zhang W, et al. Chat-UniVi: Unified visual representation empowers large language models with image and video understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 13700-13710.

[24] Tores J, Sassatelli L, Wu H Y, et al. Visual Objectification in Films: Towards a New AI Task for Video Interpretation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 10864-10874.

[25] Fernandez-Labrador C, Akçay M, Abecassis E, et al. DiVAS: Video and Audio Synchronization with Dynamic Frame Rates[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 26846-26854.

[26] Choi J, Park S J, Kim M, et al. AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 27325-27337.

[27] Xing Y, He Y, Tian Z, et al. Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 7151-7161.

[28] Xie Z, Yu S, He Q, et al. SonicVisionLM: Playing sound with vision language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 26866-26875.

[29] Li S, Qin Y, Zheng M, et al. Diff-BGM: A Diffusion Model for Video Background Music Generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 27348-27357.

[30] Dong Z, Liu X, Chen B, et al. Musechat: A conversational music recommendation system for videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 12775-12785.

[31] Chowdhury S, Nag S, Joseph K J, et al. MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024: 26826-26835.

New Opportunities for Intelligent Film Production: A Review of Multimodal Technology Development at CVPR 2024

Supervised by: National Film Administration

Organized by: Film Technology Quality Inspection Institute

Standard International Serial Number: ISSN 1673-3215

Domestic Unified Serial Number: CN 11-5336/TB

Submission System: ampt.crifst.ac.cn

Official Website: www.crifst.ac.cn

Advertising Cooperation: 010-63245082

Journal Distribution: 010-63245081

Leave a Comment Cancel reply