Domestic Sora: Generate 16-Second High-Resolution Videos

Click the above“Mechanical and Electronic Engineering Technology” to follow us
The domestic Sora, also known as the video large model Vidu, is an innovative technology product jointly released by Tsinghua University and the startup Shenshu Technology. It is based on the self-developed U-ViT architecture and can generate high-definition video content of up to 16 seconds in length and 1080P resolution with excellent performance in multi-camera language and temporal-spatial consistency, even capable of creating surrealistic images.

1. Features of Vidu

Simulating the Real Physical World: Vidu can generate complex and detailed scenes following real physical laws, such as precise lighting effects and vivid facial expressions.

Imaginative: Vidu can create images that transcend the real world, and this ability for surreal creation allows Vidu to meet the creative video needs, providing strong technical support for film production, advertising design, and other industries.

Domestic Sora: Generate 16-Second High-Resolution Videos

Multi-Camera Language: Vidu can seamlessly generate and switch between various camera perspectives, including long shots, tracking shots, transitions, and other professional effects.

Domestic Sora: Generate 16-Second High-Resolution Videos

Outstanding Video Duration: The video generated by Vidu can last up to 16 seconds, which is quite rare among video large models, maintaining a continuous and smooth visual experience with details and logical coherence.

Temporal-Spatial Consistency: Vidu maintains coherence and smoothness over the 16-second duration, ensuring that characters and scenes remain consistent in time and space as the camera moves.

Understanding Chinese Elements: Vidu generates unique Chinese elements, giving it a distinctive advantage in understanding and expressing Chinese cultural characteristics.

2. Comparison of Vidu and Sora

Vidu and Sora are both advanced AI video generation models, sharing similarities in technology but also having some differences.

2.1 Technical Architecture

Vidu uses the original architecture U-ViT, which combines Diffusion and Transformer developed by Shenshu Technology, capable of generating high-definition video content of up to 16 seconds and 1080P resolution based on textual descriptions18
Sora, on the other hand, uses a Diffusion plus Transformer architecture called DiT, which can also generate high-quality video content, with a maximum duration of 1 minute49

2.2 Video Generation Capability

Vidu supports one-click generation of videos up to 16 seconds in length, while Sora’s maximum duration is 1 minute
Both can simulate the real physical world, generating complex, detail-rich scenes, and have imaginative image generation capabilities

2.3 Technical Challenges and Optimizations

Vidu achieved significant performance improvements within two months, breaking through from 8 seconds of video generation to 16 seconds
Sora already had the capability for longer video generation at the time of its release.

2.4 Multi-Modal Capability

Vidu’s architecture supports multi-modal capabilities, potentially compatible with a wider range of applications in the future
Sora’s specific multi-modal capabilities have not been detailed in the search results.

2.5 Understanding Cultural Elements

Vidu can understand and generate images rich in Chinese cultural characteristics, such as pandas and dragons

2.6 Performance Comparison

Reports indicate that Vidu has reached Sora’s level in terms of consistency and movement range in video generation, although its duration is still less than Sora’s maximum of 60 seconds, it can generally be compared to Sora

2.7 Future Prospects

Vidu is currently accelerating iterations for improvement and may catch up with Sora in performance in the future

Introduction to Shenshu Technology

Shenshu Technology is a startup founded in 2023, with core members from the Tsinghua University Artificial Intelligence Research Institute, dedicated to independently developing world-leading controllable multi-modal general large models. The company’s CEO, Tang Jiayu, is a graduate from the Tsinghua University Computer Science Department, and the Chief Scientist is Zhu Jun, the deputy director of the Tsinghua AI Research Institute. The CTO, Bao Fan, is a PhD student from the Tsinghua University Computer Science Department and a member of Professor Zhu Jun’s research group, who has long focused on research in the field of diffusion models.
Shenshu Technology’s main product is a video large model called Vidu, which can generate high-definition video content of up to 16 seconds in length and 1080P resolution with one click. The Vidu model uses the self-developed U-ViT architecture that combines Diffusion and Transformer.

Additionally, Shenshu Technology has significant experience in the field of multi-modal large models and is currently one of the highest-valued startups in the multi-modal large model track. The company’s entrepreneurial direction focuses on the research and development of multi-modal general large models and application products.

Shenshu Technology has already completed three rounds of financing since its establishment.In June 2023, it completed nearly 100 million RMB in angel round financing, with investors including Ant Group, BV Baidu Ventures, Zhuoyuan Asia, and Zhuoyuan Capital;In August 2023, it completed tens of millions of RMB in angel+ round financing, with the investor being Jinqiu Fund;In March 2024, it completed hundreds of millions of RMB in Series A financing, with investors including Qiming Venture Partners, Datang Capital, Zhipu AI, and other new institutions, as well as BV Baidu Ventures and Zhuoyuan Asia as two old shareholders.

Domestic Sora: Generate 16-Second High-Resolution Videos

Want to know more?

Quickly scan the code to follow us!

Leave a Comment