Exploring OpenAI's Sora Video Model: A Technical Report

New Intelligence Report

Editor: Editorial Department

[New Intelligence Guide] OpenAI’s first AI video model Sora has emerged, creating history once again. This technical report, known as a “world model”, was also released today, but specific training details have not yet been made public.

Yesterday during the day, “Reality No Longer Exists” began trending across the internet.

“Have we really entered the next era so quickly? Sora is simply explosive.”

“This is the future of filmmaking!”

Google’s Gemini Pro 1.5 had barely made a splash when, at dawn, the spotlight turned to OpenAI’s Sora.

Once Sora was released, all other video models bowed down.

Just a few hours later, the technical report for OpenAI Sora was also released!

Among them, “milestone” has become a key term in the report.

Exploring OpenAI's Sora Video Model: A Technical Report

Report link: https://openai.com/research/video-generation-models-as-world-simulators

The technical report mainly discusses two aspects:

(1) Methods for converting different types of visual data into a unified format for large-scale training of generative models;

(2) Qualitative assessments of Sora’s capabilities and limitations.

Unfortunately, the report does not include model and implementation details. Well, OpenAI is still the same “OpenAI”.

Even Musk was shocked by the effects generated by Sora, stating “gg humanity”.

Creating a Virtual World Simulator

Previously, OpenAI researchers have been exploring a challenging question: how to apply large-scale training generative models to video data?

To this end, researchers trained on videos and images of varying durations, resolutions, and aspect ratios, based on a text-conditioned diffusion model.

They adopted a Transformer architecture capable of handling the latent codes of spatiotemporal segments in videos and images.

The result was the most powerful model, Sora, which is capable of generating high-quality videos lasting one minute.

OpenAI researchers discovered an exciting point: scaling up video generative models is a very promising direction for building a universal simulator of the physical world.

In other words, following this direction, LLMs may indeed become world models!

What sets Sora apart?

It’s important to note that many previous studies modeled video data through various techniques, such as recurrent networks, generative adversarial networks, autoregressive Transformers, and diffusion models.

These typically focused only on specific types of visual data, shorter videos, or fixed-size videos.

Sora is different; it is a universal visual data model capable of generating videos and images of various durations, aspect ratios, and resolutions, even high-definition videos lasting up to one minute.

Some netizens remarked, “Although Sora has some imperfections (which can be detected), it will revolutionize many industries.

Imagine generating dynamic, personalized advertisement videos for precise targeting; this could become a trillion-dollar industry!”

To verify SORA’s effectiveness, industry leader Gabor Cselle compared it with Pika, RunwayML, and Stable Video.

He first used the same prompt as in OpenAI’s examples.

The results showed that other mainstream tools generated videos of only about 5 seconds, while SORA could maintain action and visual consistency in a scene lasting 17 seconds.

Subsequently, he used SORA’s starting frame as a reference, attempting to produce similar effects with other models by adjusting command prompts and controlling camera movements.

In contrast, SORA significantly outperformed in handling longer video scenes.

Seeing such stunning effects, it’s no wonder that industry insiders are exclaiming that SORA indeed has revolutionary significance in AI video production.

Transforming Visual Data into Patches

The success of LLMs is attributed to their training on internet-scale data, which grants them broad capabilities.

A key to their success is the use of tokens, which elegantly unify various forms of text—code, mathematical formulas, and various natural languages.

OpenAI researchers found inspiration from this.

How can we enable the generative model for visual data to inherit the advantages of tokens?

Note that, unlike the text tokens used in LLMs, Sora uses visual patches.

Previous research has shown that patches are very effective for modeling visual data.

OpenAI researchers were pleasantly surprised to find that this highly scalable and effective representation form is suitable for training generative models that can handle various types of videos and images.

From a macro perspective, researchers first compressed the video into a low-dimensional latent space, and then decomposed this representation into spatiotemporal patches, thus achieving the transformation from video to patches.

Video Compression Network

Researchers developed a network to reduce the dimensionality of visual data.

This network can take raw video as input and output a latent representation compressed in both time and space.

Sora is trained in this compressed latent space, which is then used to generate videos.

Additionally, researchers designed a corresponding decoder model to convert the generated latent data back into pixel space.

Latent Space Patches

For a compressed input video, researchers extract a series of spatial patches to use as tokens for the Transformer.

This approach is also applicable to images, as images can be viewed as a single frame of video.

Based on the patch representation method, researchers enabled Sora to handle videos and images of different resolutions, durations, and aspect ratios.

During inference, the size of the generated video can be controlled by appropriately arranging randomly initialized patches in a suitably sized grid.

Expanding Transformer

Thus, the video model Sora is a diffusion model; it can accept noisy patches (and conditional information, such as text prompts) as input, and is subsequently trained to predict the original “clean” patches.

Importantly, Sora is based on a Transformer diffusion model. Historically, Transformers have exhibited exceptional scalability in various fields, including language modeling, computer vision, and image generation.

Surprisingly, in this work, researchers discovered that the diffusion Transformer as a video model can also effectively scale.

The following image shows a comparison of video samples used during training with a fixed seed and input.

As training computational resources increase, sample quality significantly improves.

Diversification of Video Performance

Traditionally, image and video generation techniques often adjust videos to a standard size, such as 4 seconds and a resolution of 256×256.

However, OpenAI researchers found that training directly on the original size of videos brings numerous benefits.

Flexible Video Production

Sora can produce videos of various sizes, from widescreen 1920×1080 to vertical 1080×1920, and everything in between.

This means that Sora can create content that fits the screen ratios of various devices!

It can also quickly create video prototypes at lower resolutions and then produce full-resolution videos with the same model.

Better Visual Performance

Experiments found that training directly on the original aspect ratio of videos significantly improves visual performance and composition.

Thus, researchers compared Sora with another version of the model, which cropped all training videos to square, a common practice in generative model training.

Compared to that, the videos generated by Sora (on the right) showed noticeable improvements in visual composition.

Deep Language Understanding

Training a text-to-video generation system requires a large number of videos with text descriptions.

Researchers employed re-labeling techniques from DALL·E 3 and applied them to videos.

First, researchers trained a model capable of generating detailed descriptive labels, which were then used to generate text descriptions for all videos in the training set.

They found that using detailed video descriptions for training not only improved the accuracy of the text but also enhanced the overall quality of the videos.

Similar to DALL·E 3, researchers also utilized GPT to convert users’ brief prompts into detailed descriptions, which are then input into the video model.

This way, Sora can generate high-quality, accurate videos based on specific user requirements.

Diversified Prompts for Images and Videos

While the showcased cases are demos of Sora converting text to video, Sora’s capabilities extend beyond that.

It can also accept other forms of input, such as images or videos.

This allows Sora to perform a range of image and video editing tasks, such as creating seamless looping videos, adding dynamics to static images, and extending the length of videos on the timeline, etc.

Bringing DALL·E Images to Life

Sora can accept an image and a text prompt, then generate video based on these inputs.

Below is a video generated by Sora based on DALL·E 2 and DALL·E 3 images.

A Shiba Inu wearing a beret and a black turtleneck.

An illustration of a family of five monsters, featuring a furry brown monster, a smooth black monster with antennas, a green spotted monster, and a small polka-dotted monster, all playing together in a cheerful scene.

A realistic photo of a cloud with the word “SORA” written on it.

In an elegant old hall, a huge wave is about to crash down. Two surfers seize the opportunity, skillfully gliding on the crest of the wave.

Flexible Expansion of Video Timeline

Sora can not only generate videos but also extend them along the timeline, both forward and backward.

In the demo, all the videos start from the same video segment and extend back in time. Although the beginnings differ, they all converge at the same ending.

Through this method, we can extend videos in both directions, creating seamless looping videos.

Image Generation Capability

Similarly, Sora also possesses the ability to generate images.

To achieve this, researchers arrange Gaussian noise patches on a spatial grid, with a time span of one frame.

The model can generate images of different sizes, with a maximum resolution of 2048×2048 pixels.

Left: A close-up photo of a woman in autumn, rich in detail with a blurred background.

Right: A vibrant coral reef inhabited by colorful fish and marine life.

Left: A digital painting depicting a young tiger under an apple tree, in exquisite matte painting style.

Right: A snow-covered mountain village, with cozy cottages and a magnificent northern light, detailed and realistic, photographed with a 50mm f/1.2 lens.

Changing Video Styles and Environments

Using diffusion models, we can edit images and videos through text prompts.

Here, researchers applied a technique called SDEdit to Sora, enabling it to change the style and environment of videos without needing any prior samples.

Seamless Transition Between Videos

Additionally, Sora can create smooth transitions between two different videos, even if the themes and scenes of these videos are completely different.

In the demo below, the middle video achieves a smooth transition from the left video to the right video.

One is a castle, and the other is a snowy cottage, blending very naturally into one scene.

Emerging Simulation Capabilities

As large-scale training progresses, the video model exhibits many exciting new capabilities.

Sora leverages these capabilities to simulate certain features of humans, animals, and natural environments without needing to set specific rules for 3D space, objects, etc.

The emergence of these capabilities is entirely attributed to the scaling of the model.

Realism in 3D Space

Sora can create videos with dynamic perspective changes, making the movement of characters and scene elements in three-dimensional space appear very natural.

For example, a couple strolling through snowy Tokyo, the generated video closely resembles real camera movements.

Similarly, Sora has a broader field of vision, generating videos of mountain scenery with people hiking, reminiscent of drone footage.

Consistency of Videos and Persistence of Objects

Maintaining continuity of scenes and objects over time while generating long videos has always been a challenge.

Sora can handle this issue well, even when objects are occluded or leave the frame, it can maintain their presence.

In the example below, the spotted dog on the windowsill remains consistent in appearance despite multiple passersby.

For instance, it can showcase the same character multiple times in a video, maintaining a consistent appearance throughout.

A cyber-style robot rotates in a complete circle from front to back without any frame skips.

Interaction with the World

Moreover, Sora can simulate simple behaviors that affect the state of the world.

For example, a cherry blossom tree painted by an artist leaves lasting strokes on watercolor paper.

Or, the bite marks left on a hamburger are clearly visible, with Sora’s generation conforming to the rules of the physical world.

Simulation of the Digital World

Sora can simulate not only the real world but also the digital world, such as video games.

For example, in “Minecraft”, Sora can render the game world and dynamic changes in a highly realistic manner while controlling the player character.

Moreover, by simply mentioning “Minecraft”, Sora can demonstrate these capabilities.

These new capabilities suggest that continuously expanding the scale of video models is a very promising direction, enabling models to develop into advanced simulators that accurately simulate the physical and digital worlds, as well as the biological and object entities within them.

Limitations

Of course, as a simulator, Sora still has many limitations.

For example, while it can simulate some basic physical interactions, such as glass breaking, it is not precise enough.

Simulating the process of eating food does not always accurately reflect changes in the state of objects.

On the homepage of the website, OpenAI has detailed common issues with the model, such as logical inconsistencies in long videos, or objects appearing out of nowhere.

Finally, OpenAI states that the capabilities demonstrated by Sora prove that not only is enhancing the scale of video models an exciting direction.

Continuing down this path, perhaps one day, world models will come into being.

Netizens: Future Games Will Speak

OpenAI has provided numerous official demonstrations, indicating that Sora seems to pave the way for generating more realistic games—programmatic games can be generated solely based on text descriptions.

This is both exciting and frightening.

FutureHouseSF’s co-founder speculated, “Perhaps Sora can simulate my world. Maybe the next generation of game consoles will be ‘Sora box’, and games will be released in the form of 2-3 segments of text.”

OpenAI technician Evan Morikawa stated, “In the Sora video released by OpenAI, the following video opened my eyes. Rendering this scene through classic renderers is very difficult. Sora simulates physics in a way that differs from ours. It will certainly still make mistakes, but I did not predict it could do this realistically.”

Some netizens remarked, “People didn’t take the phrase ‘everyone will become a filmmaker’ seriously.”

I made this 1920s trailer in 15 minutes using clips from OpenAI Sora, voiceover from David Attenborough on Eleven Labs, and sampling some natural music on YouTube using iMovie.

Some even stated, “In five years, you will be able to generate fully immersive worlds and experience them in real time; ‘holographic decks’ are about to become a reality!”

Some people expressed that they were completely amazed by the outstanding effects of Sora’s AI video generation.

“It makes existing video models look like silly toys. Everyone will become a filmmaker.”

“A new generation of filmmakers is about to emerge with OpenAI’s Sora. In ten years, it will be an interesting competition!”

“OpenAI’s Sora will not replace Hollywood anytime soon. It will provide tremendous momentum for Hollywood as well as individual filmmakers and content creators.

Imagine, with just a team of three, completing a first draft of a 120-minute feature film in a week for audience testing. That is our goal.”

References:

https://openai.com/research/video-generation-models-as-world-simulators?ref=upstract.com

Exploring OpenAI’s Sora Video Model: A Technical Report