How AI Video Tool Sora Generates Videos from Text

On February 16, 2024, Open AI announced its new text-to-video model—Sora—on X (formerly Twitter).

This model cangenerate videos up to 60 seconds long, and during this process, it can switch camera angles by itself and even provide close-ups. Below are the translated prompts for the videos and the “works” generated directly by Sora based on the original English prompts.

A fashionable woman walks down the neon-lit streets of Tokyo, wearing a black leather jacket, a red long skirt, and black boots, carrying a black handbag. She wears sunglasses and red lipstick, walking with both confidence and casualness. The street is wet, and the water on the ground reflects the colorful lights like a mirror, with many pedestrians coming and going.

Video source: Open AI official website

A 3D animation shows a small, round, fuzzy creature exploring a vibrant, magical forest. This creature is a mix of a rabbit and a squirrel, with soft blue fur and a fluffy striped tail. It hops along a sparkling stream, its eyes filled with curiosity. The forest is filled with magical elements: glowing flowers that change color, trees with purple and silver leaves, and floating lights similar to fireflies. The creature eventually stops to play with a group of fairies dancing around a mushroom, looking up in awe at a giant glowing tree that seems to be the heart of the forest.

Video source: Open AI official website

At first glance, you might think these videos were produced by a professional filming team or an animation studio. Many comments from users in the OpenAI community express concern that Sora might take away jobs from animators.

How AI Video Tool Sora Generates Videos from Text

Image is a screenshot from machine translation: community.openai.com

Some people are also concerned that such technology could be used to forge videos or even be used as false evidence in court.

Image is a screenshot from machine translation: X

So how does Sora generate such videos? Is it truly omnipotent and will it take away human jobs?

How Does Sora Generate Videos?

Since the second half of 2022, applications like Midjourney and Stable Diffusion have been able to generate corresponding images based on text prompts. In September 2023, the combination of GPT 4.0 and DALLE 3 also allowed us to generate and modify images in a conversational manner.

AI-generated videos are not a novelty either. Before the release of Sora, there were already some video-generating AIs, such as Pika, Stable Video, RunwayML, etc. However, compared to Sora, other models generate shorter videos and are much weaker in terms of camera movement and shot transitions.

Video source: X message by Gabor Cselle

So, how does Sora generate videos?

Open AI released a technical report on Sora, mentioning that “Sora is a diffusion model“.

Sora is a diffusion model, image source: Open AI official website

Diffusion models are inherently complex, and we won’t delve into the specifics, but we can understand the general idea of a diffusion model through a simple example.

Suppose we now have a photo of a dog; we can gradually add noise to this photo, making it increasingly blurry until it eventually becomes a pile of chaotic noise.

Adding noise and removing noise, image source: reference [3]

If we reverse this process, for a pile of chaotic noise, we can also gradually remove the noise to restore it to the target image. The key to the diffusion model is learning to reverse the noise removal.

Of course, diffusion models can be used not only to generate images but also to generate videos.

For instance, in Sora’s technical report, it is mentioned that Open AI processed video data in a way that allows it to be directly used for training the model, enabling Sora to generate videos based on prompts.

Sora processes video data, image source: Open AI official website

The Powerful Video Creation Capabilities of Sora

According to Open AI, Sora “inherits” Open AI’s understanding of text, capable of generating high-quality images and videos based on prompts, and can extend videos forward or backward. For example, it can continue to extend a video from the same beginning to produce different endings, or introduce different beginnings that eventually converge to the same ending.

These three video beginnings will ultimately lead to the same ending, image taken from: Open AI official website

Additionally, Sora can not only generate videos based on text but also directly input images or videos to edit and adjust them.

For instance, it can transform a car driving on a regular road into a more “cyberpunk” version.

Image taken from: Open AI official website

Furthermore, Sora has demonstrated some previously unimagined abilities, such as being able to follow objects with moving camera angles while still maintaining the coherence and completeness of the surrounding scenery.

Video taken from: OpenAI official website

The “Powerful Sora” Still Has Some Flaws

Although Sora has shown powerful capabilities, it is still not perfect at this stage.

Not every time does Sora generate satisfying videos. According to Will Douglas Heaven from MIT Technology Review, “The videos released by Sora are already the cream of the crop selected from a large number of results.” However, even these “selected cream of the crop” are not perfect.

The technical report of Sora also admits that the videos generated at this stage have some flaws. For instance, in the video clip of “archaeologists digging up a plastic chair,” the chair clearly does not comply with objective physical laws.

Additionally, the process of a glass cup breaking is also not very “scientific”—the liquid inside the cup flowed out before the cup broke.

Therefore, Sora still has many areas for improvement. But there is no doubt that the capabilities demonstrated by Sora indicate that this is a very promising path.

Is Sora Safe?

Will It Replace Humans?

In recent days, videos generated by Sora have gone viral on many people’s social media, and while people are amazed by Sora’s prowess, they also express concerns that focus on two main areas.

The first concern is: the ability of Sora to generate videos is truly impressive. If such technology is used for forgery, wouldn’t that be terrifying? How will we know if the videos we see in the future are real or fake?

The second concern mainly comes from professionals in the video industry. If models like Sora become widespread, will video industry workers lose their jobs?

First, let’s talk about safety issues. In fact, Open AI has also considered the potential safety issues that Sora may bring.Currently, Sora is only open to a few people, and it will not be made available to the public until it is ensured that it will not be used for malicious purposes.

So, will Sora replace human video workers?

It is certain that the emergence of Sora may threaten some creators of animation materials.

For example, in January of this year, The Hollywood Reporter conducted a survey of 300 entertainment industry leaders, where three-quarters of respondents stated that AI will reduce future job positions, with about 200,000 jobs expected to be affected in the next three years. Sora’s outstanding performance will exacerbate this impact.

However, from another perspective, every emerging technology brings new opportunities alongside its threats.

Video-generating AIs, including Sora, are just tools; the creative source for videos still needs to come from humans. Sora may help humans produce videos more efficiently while also giving everyone a chance to create their own creative videos.

References

[1]https://openai.com/research/video-generation-models-as-world-simulators

[2]https://openai.com/Sora[3]https://scholar.harvard.edu/binxuw/classes/machine-learning-scratch/materials/foundation-diffusion-generative-models

[4]https://www.hollywoodreporter.com/business/business-news/ai-hollywood-workers-job-cuts-1235811009/

Source: Science Popularization China

How AI Video Tool Sora Generates Videos from Text

Leave a Comment Cancel reply