How AI Video Tool Sora Generates Videos from Text

On February 16, 2024, OpenAI announced its new text-to-video model, Sora, on social media. Almost overnight, Sora went viral across the internet.

How AI Video Tool Sora Generates Videos from Text

Screenshot of a video generated by Sora. Image/Internet

Text-to-video is not a brand new technology, so why did the emergence of Sora cause such a huge stir?

Currently, in the video generation industry, the typical video length is 4 seconds and is subject to many limitations. However, Sora has completely broken this norm, being able to generate videos up to 60 seconds long and supporting cuts between shots. Additionally, Sora supports different resolutions. It’s like a student in your class participating in a very challenging math competition, where most students score in the 30s, while this student named Sora astonishes everyone with a score of 70.

How Difficult Is It to Generate a 60-Second High-Quality Video

We should all know that a video is composed of a series of images, and this series of images has a specific sequence. It’s like a homemade flipbook; it contains a series of coherent images of actions, and when these images are flipped quickly, due to the phenomenon of visual persistence, we feel the images come to life, creating an animation effect. Videos actually work on this principle.

Text-to-video can be said to be motion modeling in the time dimension. To ensure the coherence of the generated video, the model needs to model in the time dimension and be able to capture, understand, and generate motion information. This greatly increases the complexity of the model.

How AI Video Tool Sora Generates Videos from Text

A homemade flipbook. Image/Visual China

Moreover, video data is more complex than image data, requiring larger-scale and higher-quality training data. However, currently, the publicly available high-quality “text-to-video” data is very limited. Additionally, training a video generation model requires a significant amount of computing resources and time, making the training cost extremely high. Therefore, generating a high-quality video lasting 60 seconds is very challenging!

So, how does Sora achieve this?

A Student-Friendly Technical Explanation of Sora

First, Sora utilizes a technology known as “video compression networks” to compress the input images or videos. It’s like the last question on a math exam, where the teacher usually breaks a big problem into 3 smaller questions. The first small question is usually relatively simple, and the second small question’s solution is hidden in the first one. To solve the most difficult third question, we have to rely on the previous two small questions. Of course, the teacher could directly ask you to solve the third question, but without the groundwork from the first two, tackling the third question would be very difficult. Breaking the problem down makes it relatively easier. The video compression network technology uses this method to simplify complex video data while retaining its key information, significantly reducing computational load, allowing Sora to process large amounts of data more efficiently during training.

For videos processed through the compression network, Sora further decomposes them into “spatio-temporal patches”. These patches are small components of the video that contain not only local spatial information but also incorporate dynamic changes in the time dimension.

To understand spatio-temporal patches visually, we can compare them to each frame of a movie. If we view each frame as a still photograph, these photographs can be torn into many small pieces, each piece being a spatio-temporal patch containing a small part of the image’s information. When we see these small pieces, we can also associate them with other related scenes. In Sora, spatio-temporal patches allow the model to process each small segment of video content more finely while considering their changes over time.

How AI Video Tool Sora Generates Videos from Text

Image/Visual China

After extracting the necessary information, Sora begins the video generation process. It is based on the Transformer model, combining the given text prompts and the extracted spatio-temporal patches to start creating video content.

For example, if you tell Sora to generate a video of “participating in the 100-meter race at the school sports meet and winning first place,” this text is your prompt. What will it do? First, Sora will understand the specific meaning of this sentence; secondly, it will search its “brain” for related memory fragments (spatio-temporal patches) based on its understanding. Drawing from these fragments, it uses its imagination to continuously complete the scenes and arrange them in a timeline, such as making sure the starting scene comes before the sprinting scene. After repeated supplementation and refinement, Sora will generate the desired video.

In this process, Sora will fine-tune the initial noisy video (a video with imperfect visuals and a chaotic timeline), filtering out irrelevant information and adding necessary details. Through repeated optimization, it ultimately generates a video that perfectly matches the text prompt.

As mentioned earlier, Sora initially generates a noisy video, which is a flawed video. At this stage, each pixel in the video is randomly assigned color values, resulting in a chaotic display. Anyone who has seen a “snowy TV” should be familiar with this kind of image, which appears when the television has no signal.

However, through continuous training and optimization, Sora can accurately adjust parameters such as the position, size, angle, and brightness of image blocks, ultimately predicting the clear image behind these noisy visuals.

How AI Video Tool Sora Generates Videos from Text

“Snowy TV” in no-signal state. Image/Visual China

This process is akin to me writing this article; initially, there may only be an outline, roughly structuring the overall content, deciding how many sections to write, and what to cover in each section, then continuously filling in text and images, ultimately presenting a logically clear and rich article. For videos, this means Sora needs to predict multiple frames at once and convert these noisy multi-frame images into a clear and coherent sequence. When these clear images are presented continuously, they form the final smooth and natural video.

The New Changes Brought by Sora

The emergence of Sora can be said to break people’s traditional understanding of text-to-video technology.

First, Sora demonstrates powerful multi-format video generation capabilities. When we usually shoot videos with our phones or other devices, we often choose to shoot in landscape or portrait mode based on our needs, resulting in non-uniform screen ratios. However, Sora can easily handle videos of various screen ratios, meeting diverse viewing needs. Additionally, Sora can quickly build a draft of content at a low resolution and then refine it at full resolution, all within the same model, enhancing creative flexibility and simplifying the generation process.

Secondly, Sora shows significant improvements in video composition and framing. Traditional training models assume a square crop when cutting videos, which can lead to parts of the scene not being displayed. In contrast, Sora can more accurately maintain the entirety of the video theme.

Finally, thanks to OpenAI’s possession of the ChatGPT large language model product and its technological accumulation, Sora has a deep understanding of text, enabling it to accurately comprehend user instructions provided through text and create characters and lively scenes with rich details and emotional expressions based on these instructions. This technology makes the conversion process from simple text prompts to complex video content appear more natural and fluid. Whether it’s action-packed scenes or subtle emotional expressions, Sora can precisely capture and present them.

How AI Video Tool Sora Generates Videos from Text

The emergence of Sora will lower the barriers to video creation. Image/Visual China

If the emergence of ChatGPT changed the way people produce text, then the emergence of Sora lowers the barriers to video creation. For the vast majority of people, various social media content in the future will no longer be limited to text and images.

Source: Science Popularization Magazine Editor: Bai Yulei

Leave a Comment