Author|Zhou Yixiao, Wan Wan Youbei
Editor|Wang Zhaoyang
The Chinese text-to-video model that can rival Sora has arrived.
On the morning of April 27, at the 2024 Zhongguancun Forum, Shengshu Technology, in collaboration with Tsinghua University, launched China’s first long-duration, high-consistency, and high-dynamic video large model — Vidu, whose effects immediately went viral.
According to Shengshu Technology, Vidu supports one-click generation of 16-second, 1080P resolution video content. From the videos, Vidu’s consistency and motion amplitude have reached the level of Sora. Although the duration still cannot match Sora’s maximum of 60 seconds, overall it can already be compared to Sora.
Shengshu’s release was as low-key as usual, without any press conference. However, the results attracted widespread attention, and the term “Chinese Sora” quickly emerged.
But we communicated with Shengshu right away. The company, whose core team comes from the Tsinghua University AI Research Institute and is led by Dr. Zhu Jun, the deputy director of the institute, stated:
Vidu’s video duration will continue to break through. “Additionally, our architecture supports multi-modal capabilities; the video modality is just the most important at this stage.”
According to Shengshu, Vidu is currently accelerating its iterations and improvements. Looking ahead, Vidu’s flexible model architecture will also be able to accommodate a broader range of multi-modal capabilities.
In other words, saying Shengshu Technology is the “Chinese Sora” shows a lack of imagination.
Shengshu’s ambitions are even greater.
Frame-by-frame Comparison of Vidu and Sora
In a discussion on March 12 of this year, Shengshu Technology’s co-founder and CEO Tang Jiayu told us:
“We will definitely achieve the effects of Sora’s current version within this year, but it is hard to say whether it will be three months or six months.”
As we understand, Shengshu achieved 8 seconds of video generation in March, and broke through to 16 seconds in April. Today’s release represents a significant advancement in just two months.
In this showcase, what details are worth noting? We immediately conducted a frame-by-frame comparison of Vidu and Sora. Without further ado, let’s take a look together.
Classic Walking Scene
The video of a street beauty walking from Sora also exploded across major social media. Vidu came out with a bang! Not only generating videos of street beauties walking, but also street handsome guys and even street bears!
First, looking at the characters and backgrounds, Vidu’s generation effect is indeed on par with Sora. However, the coordination of character movements is still slightly weaker compared to Sora.
Vidu
Sora
Off-Road Vehicle in Motion
The off-road vehicle weaves through jungle paths, and Vidu’s jungle background has a slight 3D animation effect, resembling some scenes in games, while Sora’s background appears more realistic.
Vidu
Sora
Chinese Dragon
In this video scene, the styles generated by both differ significantly. Vidu showcases a virtual dragon from reality, while Sora depicts a real scene of dragon and lion dance. However, both have showcased various details of the dragon’s image with their characteristics.
Additionally, aside from the main dragon, both backgrounds appear very realistic; however, Sora’s video has a higher richness of visuals.
Vidu
Sora
Close-up of Character’s Eyes
Who can tell whether this is a real shot or AI-generated? In this round, I feel Vidu is truly on par with Sora!
Vidu
Sora
Television Compilation
Vidu is indeed not afraid of comparison! The richness of this scene and camera movement is not inferior to Sora at all.
Vidu
Sora
Dog
Sora’s generated dog has a stronger sense of motion and realism, but Vidu’s handling of the floating fur on the dog’s swimming legs is also quite impressive.
Vidu
Sora
Vidu presents a “cat with pearls,” which is a bit fantastical, but after the camera rotates, the detail of the fur is also quite good.
Vidu
Soraa
Ship and Sea
Vidu’s flowing waves are very much in line with physical rules. It can be said to be on par with Sora.
Moreover, both provide prompts for direct comparison, allowing for many interesting differences to be observed.
Vidu:“A ship in the studio sails towards the camera”

Sora:“A realistic close-up video showing two pirate ships fighting each other inside a cup of coffee.”
Astronaut
Vidu emphasizes the astronaut’s life in space, while Sora focuses more on the close-up of the astronaut’s face.
Vidu
Sora
How Vidu Was Created: Correct Technical Route + Engineering Technology Migration
This time’s release video shows a significant visible improvement in effects. How was this achieved?
This breakthrough appears to be the result of Shengshu’s long-term accumulation.
The DiT architecture of OpenAI’s Sora integrates Diffusion and Transformer, achieving image generation quality comparable to GAN while also offering better scalability and computational efficiency. By replacing the commonly used U-Net structure in traditional Diffusion models with a Transformer structure, DiT can process data more efficiently, especially when handling large-scale data, significantly reducing the required computational resources while demonstrating outstanding emergent capabilities in visual tasks.
In terms of technology, Vidu adopts a completely consistent Diffusion and Transformer integrated architecture with Sora. Vidu is based on Shengshu’s self-developed U-ViT architecture, proposed by the team in September 2022. In fact, U-ViT is the first architecture to integrate Diffusion and Transformer, predating Sora’s DiT architecture.

Caption: “All are Worth Words: A ViT Backbone for Diffusion Models” proposed the network architecture U-ViT, which is the most important technical foundation for Vidu.
Some video generation tools on the market increase video length by using frame interpolation techniques, which enhance the smoothness and length of videos by inserting additional frames between original video frames. Frame interpolation can be implemented based on various algorithms, including traditional motion compensation (MEMC), deep learning methods, or smart frame interpolation combined with codecs. Nvidia’s SuperSlomo technology predicts and inserts intermediate frames through deep learning algorithms to achieve high frame rate playback of videos.
However, frame interpolation can also bring some drawbacks, such as potential quality degradation, especially in fast motion or shadow processing, where distortions or blurriness may occur.
Additionally, some tools generate seemingly longer videos by combining different models and techniques. For instance, some tools may first use image generation models like Stable Diffusion or Midjourney to generate single images, then convert these images into short videos through image-to-video technology, and finally splice these short videos together to create longer video content.
These methods can indeed increase the length of videos, but they fundamentally rely on the workflow of “short video generation.” Therefore, they may appear less coherent in terms of content flow and visual performance, lacking some natural transition effects, and may not be as tight in narrative and logic as a complete long video.
Vidu, based on the U-ViT architecture, does not involve intermediate frame interpolation or splicing, enabling direct and continuous text-to-video conversion. Sensory-wise, it is more like a “one-shot” process, where the video is continuously generated from start to finish without frame interpolation traces.
In addition to the innovative U-ViT underlying architecture, Vidu also benefits from Shengshu team’s engineering foundation.
In March 2023, based on the U-ViT architecture, Shengshu trained a 1 billion parameter multi-modal model called UniDiffuser on the open-source large-scale text-image dataset LAION-5B, and made it open-source. UniDiffuser excels at text-image tasks, supporting arbitrary generation and conversion between text and image modalities.
It is understood that UniDiffuser first validated the scalability of the integrated architecture in large-scale training tasks, effectively running through all processes of the U-ViT architecture in large-scale training tasks. Notably, UniDiffuser was released a year earlier than Stable Diffusion 3, which recently switched to the DiT architecture.
Moreover, videos can be seen as an extension of image sequences along the time axis, so techniques and experiences in image processing can be transferred to video processing. For example, Sora uses DALL·E 3’s re-labeling technology to finely re-label and describe visual training data, allowing it to follow user instructions more accurately when generating videos.
These accumulated engineering experiences have laid the foundation for Shengshu’s technological migration from text-image tasks to video tasks.
In fact, Vidu has reused multiple technical experiences accumulated by Shengshu Technology in text-image tasks for video generation tasks, including training acceleration, parallel training, and low-memory training, thereby optimizing the training process. Through video data compression technology and self-developed distributed training framework, it has achieved significant improvements in communication efficiency, reduced memory overhead, and increased training speed while ensuring computational accuracy.
From unifying image tasks to integrating video capabilities, Vidu can be viewed as a universal visual model that supports generating more diverse and longer-duration video content. Shengshu also revealed that Vidu is currently accelerating its iterations and improvements. Looking ahead, Vidu’s flexible model architecture will also be able to accommodate a broader range of multi-modal capabilities.
According to Dr. Zhu Jun’s explanation, Vidu means We do, We did, We do together. Shengshu has also launched the “Vidu Large Model Partnership Program.”
“The main goal is to attract industry application partners interested in AI video scenarios, including companies, institutions, and individual creators, to explore application scenarios together.”
In addition to self-developed large models, Shengshu Technology also develops vertical application products, including the visual creative design platform PixWeaver and the 3D asset creation tool VoxCraft, charging through subscription and other methods.
As for the commercialization of Vidu, Shengshu Technology has left a suspense, responding to Silicon Star with four words:


Click ““View” and then leave👀