Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Click belowCard, follow “AI Generated Future“

>> Reply “GAI” in the background to get free AI industry reports and materials!

Author: Fan Bao et al.

Interpretation: AI Generated Future

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Article link: https://arxiv.org/pdf/2405.04233 Open source address: https://www.shengshu-ai.com/vidu

Vidu is China’s first long video generation AI large model, jointly released by Tsinghua University and Shengshu Technology. Recently, many impressive effect presentations have been released, and this newly published interpretive article introduces the technology used by Vidu, let’s learn together.

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

This article introduces Vidu, a high-performance text-to-video generator capable of generating 1080p videos lasting up to 16 seconds in a single generation. Vidu is a diffusion model with a backbone of U-ViT, which gives it the scalability and capability to handle long videos. Vidu has strong coherence and dynamics, and can generate realistic and imaginative videos, as well as understand some professional photography techniques, comparable to Sora—the most powerful reported text-to-video generator. Finally, preliminary experiments on other controllable video generation, including edge detection to video generation, video prediction, and subject-driven generation, show promising results.

Introduction

Diffusion models have made breakthrough progress in generating high-quality images, videos, and other types of data, surpassing alternative methods like autoregressive networks. Previously, video generation models mainly relied on diffusion models with U-Net backbones and focused on limited durations like 4 seconds. The model in this paper, Vidu, demonstrates that a text-to-video diffusion model with a U-ViT backbone can break this duration limitation by leveraging the scalability and long-sequence modeling capabilities of transformers. Vidu can generate 1080p videos lasting up to 16 seconds in a single generation, as well as single-frame images as videos.

Furthermore, Vidu has strong coherence and dynamics, capable of generating realistic and imaginative videos. Vidu also has a preliminary understanding of some professional photography techniques, such as transition effects, camera movements, lighting effects, and emotional expressions. To some extent, Vidu’s generation performance is comparable to the currently most powerful text-to-video generator, Sora, far surpassing other text-to-video generators. Lastly, preliminary experiments on other controllable video generation, including edge detection to video generation, video prediction, and subject-driven generation, all demonstrate promising results.

Text-to-Video Generation

Vidu first employs a video autoencoder to reduce the spatial and temporal dimensions of videos for efficient training and inference. After that, Vidu uses U-ViT as a noise prediction network to model these compressed representations. Specifically, as shown in Figure 1 below, U-ViT segments the compressed video into 3D patches, treating all inputs (including time, text conditions, and noisy 3D patches) as tokens, and uses long skip connections between shallow and deep layers of the transformer. By leveraging the transformer’s ability to handle variable-length sequences, Vidu can process videos of different durations.

Vidu is trained on a large number of text-video pairs, but it is impractical to have all videos labeled by humans. To solve this problem, a high-performance video captioning generator optimized for understanding dynamic information in videos is first trained, and then this captioning generator is used to automatically label all training videos. During inference, a re-captioning technique is applied to reformulate user input into a form more suitable for the model.

Generating Videos of Different Lengths

Since Vidu is trained on videos of various lengths, it can generate 1080p videos of all lengths up to 16 seconds, including single-frame images as videos. Examples are presented in Figure 2 below.

3D Consistency

The videos generated by Vidu exhibit strong 3D consistency. As the camera rotates, the video presents projections of the same object from different angles. For example, as shown in Figure 3 below, the generated cat’s fur is naturally occluded as the camera rotates.

Generating Transitions

Vidu is capable of generating videos that include transitions. As shown in Figure 4 below, these videos present different perspectives of the same scene by switching camera angles while maintaining the consistency of the subjects in the scene.

Generating Transition Effects

Vidu is capable of generating videos with transition effects in a single generation. As shown in Figure 5 below, these transition effects can connect two different scenes in an engaging way.

Camera Movements

Camera movements involve physical adjustments or movements of the camera during shooting, enhancing visual storytelling and conveying different perspectives and emotions within the scene. Vidu learned these techniques from the data, enhancing the viewer’s visual experience. For example, as shown in Figure 6, Vidu is capable of generating videos that include camera movements such as zooming, panning, and tilting.

Lighting Effects

Vidu is capable of generating videos with impressive lighting effects, which helps enhance the overall atmosphere. For example, as shown in Figure 7 below, the generated video can evoke a sense of mystery and tranquility. Thus, in addition to the entities within the video content, Vidu also has a preliminary ability to convey some abstract emotions.

Emotional Portrayal

Vidu can effectively depict the emotions of characters. For example, as shown in Figure 8 below, Vidu can express emotions such as happiness, loneliness, awkwardness, and joy.

Imagination

In addition to generating scenes from the real world, Vidu also possesses rich imagination. As shown in Figure 9 below, Vidu can generate scenes that do not exist in the real world.

Comparison with Sora

Sora is currently the most powerful text-to-video generator, capable of generating high-definition videos with high consistency. However, since Sora is not publicly accessible, comparisons were made by directly inserting the example prompts released by Sora into Vidu. Figures 10 and 11 describe the comparison between Vidu and Sora, indicating that to some extent, Vidu’s generation performance is comparable to Sora.

Conclusion

Vidu, a high-definition text-to-video generator, demonstrates strong capabilities in various aspects, including the duration, coherence, and dynamics of the generated videos, comparable to Sora. In the future, Vidu still has room for improvement. For instance, there are occasional defects in details, and interactions between different subjects in the video sometimes deviate from physical laws. It is believed that with further expansion of Vidu, these issues can be effectively resolved.

References

[1] Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

Join the AIGC Technology Exchange Group, please add the assistant with a note

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Previous Recommendations

SAM-Lightening is 30 times faster than SAM! It takes only 7 milliseconds to infer an image (Beihang University)

Tsinghua & VAST propose CharacterGen: generating high-quality 3D characters from a single image

From Adversarial Training to Diffusion Networks: Where Should Image Restoration Go in the Diffusion Era?

LLM and Autonomous Driving Join Forces | DriveDreamer-2: Custom Video Generation World Model for Autonomous Driving!

Click belowCard， follow “AI Generated Future“

Exploring the Technology Behind Vidu: A Domestic Video Generator Comparable to Sora

Introduction

Text-to-Video Generation

Generating Videos of Different Lengths

3D Consistency

Generating Transitions

Generating Transition Effects

Camera Movements

Lighting Effects

Emotional Portrayal

Imagination

Comparison with Sora

Other Controllable Video Generation

Edge Detection to Video Generation

Video Prediction

Subject-Driven Generation

Conclusion

References

Leave a Comment Cancel reply