How Sora Generates Video from Text Using AI

On February 16, 2024, Open AI announced on X (formerly Twitter) the launch of its new text-to-video model – Sora.

This model cangenerate videos up to 60 seconds long, and during this process, it can switch camera angles by itself and even provide close-ups. The following are the translations of video prompts and the “works” generated directly by Sora based on the original English prompts.

A fashionable woman walks down the neon-lit streets of Tokyo. She is wearing a black leather jacket, a red long skirt, and black boots, carrying a black handbag. She wears sunglasses and has red lipstick on. She walks with both confidence and ease. The street is wet, and the water on the ground reflects the colorful lights like a mirror, with many pedestrians coming and going.

Video source: Open AI official website

A 3D animation showcases a small, round, fluffy creature exploring a vibrant, magical forest. This creature is a mix between a rabbit and a squirrel, with soft blue fur and a fluffy striped tail. It hops along a sparkling stream, its eyes filled with curiosity. The forest is filled with magical elements: glowing flowers that change color, trees with purple and silver leaves, and floating lights resembling fireflies. The creature eventually stops to play with a group of fairies dancing around a mushroom. It looks up in awe at a giant glowing tree, which seems to be the heart of the forest.

Video source: Open AI official website

At first glance, these videos may seem like professionally produced short films by a filming team or an animation company. In the OpenAI community, there are also comments from netizens expressing concern that Sora might take away jobs from animators.

How Sora Generates Video from Text Using AI

The image is a machine translation excerpt from: community.openai.com

Some people are also concerned about whether such technology could be used to forge videos or even be used to commit perjury in court.

The image is a machine translation excerpt from: X

So how does Sora generate such videos? Is it really omnipotent and will it take away human jobs?

How does Sora generate videos?

Since the second half of 2022, applications like Midjourney and Stable Diffusion have been able to generate corresponding images based on text prompts. In September 2023, the combination of GPT 4.0 and DALLE 3 allowed us to generate and modify images in a conversational manner.

AI-generated videos are not a new phenomenon. Before the release of Sora, there were already some video-generating AIs, such as Pika, Stable video, RunwayML, etc. However, compared to Sora, other models generate shorter videos and are considerably weaker in terms of camera movement and shot transitions.

Video source: X message published by Gabor Cselle

So how does Sora generate videos?

Open AI released a technical report on Sora, mentioning that “Sora is a diffusion model“.

Sora is a diffusion model, image source: Open AI official website

Diffusion models are inherently complex, and we won’t delve into the specifics, but we can roughly understand the idea of a diffusion model through a simple example.

Suppose we have a photo of a dog. We can gradually add noise to this photo, making it increasingly blurry until it eventually turns into a pile of random noise.

Adding noise and removing noise, image source: Reference [3]

If we reverse this process, we can also gradually remove noise from a pile of random noise and restore it to the target image,the key to diffusion models is learning to reverse the process of removing noise.

Of course, diffusion models can not only be used to generate images but also to generate videos.

For instance, in Sora’s technical report, it was mentioned that Open AI performed some transformations on video data, allowing the video data to be directly used to train the model, enabling Sora to generate videos directly based on prompts.

Sora transforms video data, image source: Open AI official website

Sora’s Powerful Video Creation Capabilities

According to Open AI, Sora “inherits” Open AI’s understanding of text,capable of generating high-quality images and videos based on prompts, and can extend videos forward or backward. For example, it can continue to extend the same video’s beginning to create different endings. Or introduce different beginnings that ultimately converge to the same ending.

These three video beginnings will ultimately lead to the same ending, image excerpt from: Open AI official website

Additionally, Sora can not only generate videos based on text but also can directly input images or videos to edit and adjust them.

For instance, it can make a car driving on a regular road look more “cyberpunk”.

Image excerpt from: Open AI official website

Moreover, Sora has demonstrated some previously unthought-of abilities, such as being able to move the camera following an object while still maintaining the reasonableness and completeness of the surrounding scene during angle transitions.

Video taken from: OpenAI official website

The “Powerful Sora” Still Has Some Flaws

Although Sora has shown powerful capabilities, it is still not perfect at this stage.

Not every time does Sora generate a satisfactory video. According to Will Douglas Heaven from MIT Technology Review, “The videos released by Sora are already the cream of the crop selected from a large number of results.”However, even these “selected cream of the crop” are not perfect.

The technical report on Sora also admits that the videos generated by Sora currently have some flaws. For example, in the video clip where “archaeologists dig up a plastic chair”, the plastic chair clearly does not comply with objective physical laws.

Additionally, the process of a glass cup breaking is also not very “scientific” – the liquid in the cup has already spilled out before the cup broke.

Therefore, Sora has many areas that need improvement. But there is no doubt that the capabilities demonstrated by Sora already indicate that this is a very promising path.

Is Sora Safe?

Will it Replace Humans?

Recently, the videos generated by Sora have gone viral in many people’s social circles. While people marvel at Sora’s capabilities, they also express concerns, which focus on two aspects.

The first concern is: Sora’s ability to generate videos is so impressive that if such technology is used for forgery, wouldn’t that be terrifying? How will we know if the videos we see in the future are real or fake?

The other concern mainly comes from professionals in the video industry. If models like Sora become widespread, will video industry professionals lose their jobs?

First, let’s talk about safety issues. In fact, Open AI has also considered the potential safety issues that Sora may bring.Currently, Sora is only open to a limited number of people, and it will not be made available to the public until it is ensured that it will not be used for malicious purposes.

So will Sora replace human video workers?

It is certain that the emergence of Sora may threaten some creators of animation materials.

For example, in January this year, The Hollywood Reporter conducted a survey of 300 entertainment industry leaders, and three-quarters of the respondents indicated that AI would reduce future job positions, with about 200,000 positions expected to be affected in the next three years. Sora’s excellent performance will exacerbate this impact.

However, looking at it from another perspective, every emerging technology brings new opportunities alongside the threats.

Video-generating AIs, including Sora, are merely tools; the source of creativity for videos still needs to come from humans. Sora may help humans produce videos more efficiently while also giving every ordinary person the chance to create their own creative videos.

References

[1]https://openai.com/research/video-generation-models-as-world-simulators

[2]https://openai.com/Sora[3]https://scholar.harvard.edu/binxuw/classes/machine-learning-scratch/materials/foundation-diffusion-generative-models

[4]https://www.hollywoodreporter.com/business/business-news/ai-hollywood-workers-job-cuts-1235811009/

Planning and Production

This article is a work of the Science Popularization China – Starry Sky Project

Produced by | Science Popularization Department of China Association for Science and Technology

Supervised by | China Science and Technology Publishing House Co., Ltd., Beijing Zhongke Xinghe Cultural Media Co., Ltd.

Author | Xiao Wei, Science Popularization Creator

Reviewed by | Qin Zengchang, Associate Professor, School of Automation Science and Electrical Engineering, Beihang University

Planning | Xu Lai

Editor | He Tong

Related Recommendations

1.Spring Festival Gala Magic Revealed! We Found the Step Where Nigmatullin Flopped!2.“Dirtiest Fruits and Vegetables Ranking” First Place? Can This Fruit Still Be Eaten With Confidence…3.Find a Partner! I Bet You Can’t Imagine This Benefit4.This Kitchen Tool Commonly Used During the New Year Might Be Dirtier Than a Toilet!5.Eating This Way Really Slows Aging! 8 Types of Antioxidant Foods That Not Only Fight Aging But Also…

The cover image and images in this article are from copyright libraries

Reproduction may cause copyright disputes. For original images and text, please reply “Reprint” in the background.

Leave a Comment Cancel reply