Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Click belowCard, follow “AI Generated Future
>> Reply “GAI” in the background to get free AI industry reports and materials!

Author: Fan Bao et al.

Interpretation: AI Generated Future
Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Article link: https://arxiv.org/pdf/2405.04233 Open source address: https://www.shengshu-ai.com/vidu

Vidu is China’s first long video generation AI large model, jointly released by Tsinghua University and Shengshu Technology. Recently, many impressive effect presentations have been released, and this newly published interpretive article introduces the technology used by Vidu, let’s learn together.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

This article introduces Vidu, a high-performance text-to-video generator capable of generating 1080p videos lasting up to 16 seconds in a single generation. Vidu is a diffusion model with a backbone of U-ViT, which gives it the scalability and capability to handle long videos. Vidu has strong coherence and dynamics, and can generate realistic and imaginative videos, as well as understand some professional photography techniques, comparable to Sora—the most powerful reported text-to-video generator. Finally, preliminary experiments on other controllable video generation, including edge detection to video generation, video prediction, and subject-driven generation, show promising results.

Introduction

Diffusion models have made breakthrough progress in generating high-quality images, videos, and other types of data, surpassing alternative methods like autoregressive networks. Previously, video generation models mainly relied on diffusion models with U-Net backbones and focused on limited durations like 4 seconds. The model in this paper, Vidu, demonstrates that a text-to-video diffusion model with a U-ViT backbone can break this duration limitation by leveraging the scalability and long-sequence modeling capabilities of transformers. Vidu can generate 1080p videos lasting up to 16 seconds in a single generation, as well as single-frame images as videos.

Furthermore, Vidu has strong coherence and dynamics, capable of generating realistic and imaginative videos. Vidu also has a preliminary understanding of some professional photography techniques, such as transition effects, camera movements, lighting effects, and emotional expressions. To some extent, Vidu’s generation performance is comparable to the currently most powerful text-to-video generator, Sora, far surpassing other text-to-video generators. Lastly, preliminary experiments on other controllable video generation, including edge detection to video generation, video prediction, and subject-driven generation, all demonstrate promising results.

Text-to-Video Generation

Vidu first employs a video autoencoder to reduce the spatial and temporal dimensions of videos for efficient training and inference. After that, Vidu uses U-ViT as a noise prediction network to model these compressed representations. Specifically, as shown in Figure 1 below, U-ViT segments the compressed video into 3D patches, treating all inputs (including time, text conditions, and noisy 3D patches) as tokens, and uses long skip connections between shallow and deep layers of the transformer. By leveraging the transformer’s ability to handle variable-length sequences, Vidu can process videos of different durations.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Vidu is trained on a large number of text-video pairs, but it is impractical to have all videos labeled by humans. To solve this problem, a high-performance video captioning generator optimized for understanding dynamic information in videos is first trained, and then this captioning generator is used to automatically label all training videos. During inference, a re-captioning technique is applied to reformulate user input into a form more suitable for the model.

Generating Videos of Different Lengths

Since Vidu is trained on videos of various lengths, it can generate 1080p videos of all lengths up to 16 seconds, including single-frame images as videos. Examples are presented in Figure 2 below.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

3D Consistency

The videos generated by Vidu exhibit strong 3D consistency. As the camera rotates, the video presents projections of the same object from different angles. For example, as shown in Figure 3 below, the generated cat’s fur is naturally occluded as the camera rotates.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Generating Transitions

Vidu is capable of generating videos that include transitions. As shown in Figure 4 below, these videos present different perspectives of the same scene by switching camera angles while maintaining the consistency of the subjects in the scene.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Generating Transition Effects

Vidu is capable of generating videos with transition effects in a single generation. As shown in Figure 5 below, these transition effects can connect two different scenes in an engaging way.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Camera Movements

Camera movements involve physical adjustments or movements of the camera during shooting, enhancing visual storytelling and conveying different perspectives and emotions within the scene. Vidu learned these techniques from the data, enhancing the viewer’s visual experience. For example, as shown in Figure 6, Vidu is capable of generating videos that include camera movements such as zooming, panning, and tilting.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Lighting Effects

Vidu is capable of generating videos with impressive lighting effects, which helps enhance the overall atmosphere. For example, as shown in Figure 7 below, the generated video can evoke a sense of mystery and tranquility. Thus, in addition to the entities within the video content, Vidu also has a preliminary ability to convey some abstract emotions.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Emotional Portrayal

Vidu can effectively depict the emotions of characters. For example, as shown in Figure 8 below, Vidu can express emotions such as happiness, loneliness, awkwardness, and joy.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Imagination

In addition to generating scenes from the real world, Vidu also possesses rich imagination. As shown in Figure 9 below, Vidu can generate scenes that do not exist in the real world.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Comparison with Sora

Sora is currently the most powerful text-to-video generator, capable of generating high-definition videos with high consistency. However, since Sora is not publicly accessible, comparisons were made by directly inserting the example prompts released by Sora into Vidu. Figures 10 and 11 describe the comparison between Vidu and Sora, indicating that to some extent, Vidu’s generation performance is comparable to Sora.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora
Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Other Controllable Video Generation

Several preliminary experiments on other controllable video generation have also been conducted at a resolution of 512, including edge detection to video generation, video prediction, and subject-driven generation. All these demonstrate promising results.

Edge Detection to Video Generation

Vidu can add additional control using techniques similar to ControlNet, as shown in Figure 12 below.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Video Prediction

As shown in Figure 13 below, Vidu can generate subsequent frames based on input images or several input frames (marked with red boxes).

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Subject-Driven Generation

Surprisingly, we found that Vidu can perform subject-driven video generation by fine-tuning only on images rather than videos. For example, we used DreamBooth technology to specify the learned subject as a special symbol <V> for fine-tuning. As shown in Figure 14 below, the generated video faithfully reproduces the learned subject.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Conclusion

Vidu, a high-definition text-to-video generator, demonstrates strong capabilities in various aspects, including the duration, coherence, and dynamics of the generated videos, comparable to Sora. In the future, Vidu still has room for improvement. For instance, there are occasional defects in details, and interactions between different subjects in the video sometimes deviate from physical laws. It is believed that with further expansion of Vidu, these issues can be effectively resolved.

References

[1] Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

Join the AIGC Technology Exchange Group, please add the assistant with a note

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Previous Recommendations

SAM-Lightening is 30 times faster than SAM! It takes only 7 milliseconds to infer an image (Beihang University)

Tsinghua & VAST propose CharacterGen: generating high-quality 3D characters from a single image

From Adversarial Training to Diffusion Networks: Where Should Image Restoration Go in the Diffusion Era?

LLM and Autonomous Driving Join Forces | DriveDreamer-2: Custom Video Generation World Model for Autonomous Driving!

Click belowCard, follow “AI Generated Future

Leave a Comment