In-Depth Analysis of Sora Official Technical Report and AI Video Prompts Collection

On February 16th, OpenAI dropped a bombshell by releasing the first text-to-video model, Sora. It is reported that Sora can directly output videos up to 60 seconds long, featuring highly detailed backgrounds, complex multi-angle shots, and emotional multiple characters.

Core Highlights

60s of 3D Motion Video

In the past, image and video generation methods often required resizing, cropping, or trimming videos to standard sizes, such as 4 seconds or 256×256 resolution. However, Sora breaks this convention by training directly on raw-sized data, bringing numerous advantages.

While many AI video software struggle with 4s coherence, Sora directly claims support for 60s video generation length.

In terms of 3D consistency, Sora is capable of generating videos with dynamic camera movements. As the camera moves and rotates, characters and scene elements consistently follow the laws of motion in three-dimensional space.

In the following case, we can see that this single-shot video lasts 60s, and not only is the main character stable, but even the characters in the background remain unbelievably stable. Scene transitions are also very smooth, seamlessly switching from wide shots to close-ups of faces.

Video Multi-Angle Consistency

Current AI workflows are single-shot single-generation, where a video can have multi-angle shots while maintaining perfect consistency of the subject, which was unimaginable before. An important challenge in video generation is maintaining temporal and spatial coherence and consistency in longer generated videos.

Sora, though not always, often effectively models the dependencies between short-term and long-term objects. For example, in the generated video, characters, animals, and objects can still be accurately preserved and presented even after being occluded or leaving the frame. Similarly, Sora can generate multiple shots of the same character within a single sample while maintaining their appearance consistency throughout the video.

World Models

World models are always one of the hardest problems to solve, requiring data collection and cleaning. Therefore, Runway’s world model has remained stagnant. However, as shown in the example below,Sora has mastered the laws of physics and can sometimes simulate behaviors that affect the state of the world in simple ways.

Official Technical Report of Text-to-Video Model Sora

We explored large-scale training of generative models using video data, and the research results indicate that by scaling up video generation models, we hope to build a universal simulator capable of simulating the physical world, which undoubtedly is a promising development path.

More Flexible Sampling

Sora has excellent sampling capabilities, whether it is widescreen 1920x1080p video, vertical 1080×1920 video, or any video size in between, it can easily handle.This means Sora can generate content perfectly matching the original aspect ratio for various devices.

Even more astonishing is that Sora can quickly create content prototypes at smaller sizes before generating full-resolution content. All of this is thanks to using the same model.

Caption:Sora can generate content perfectly matching the original aspect ratio for various devices

Improved Composition and Framing

Our experimental results show that training on the original aspect ratio of videos significantly enhances the quality of composition and framing. To verify this, we compared Sora with a model version that cropped all training videos to squares. The results found that models trained on square crops sometimes generated videos that only partially displayed the subject.In contrast, Sora could present more perfect frames, showcasing its outstanding performance in the field of video generation.

Caption: Compared to the model trained on square crops (left), Sora can present more perfect frames

More Video-Related Features

In all the above results and our demonstrations, you may have noticed examples of text-to-video. However, Sora’s capabilities extend far beyond this,it can also accept other types of input prompts, such as pre-existing images or videos. This diversified prompting allows Sora to perform a wide range of image and video editing tasks, such as creating perfect looping videos, turning static images into animations, and extending videos forward or backward.

Notably, Sora can generate videos even when provided with images and prompts as inputs. The example video shown below is based on images generated by DALL·E 2 and DALL·E 3. These examples not only demonstrate Sora’s powerful capabilities but also showcase its limitless potential in the field of image and video editing.

Video: A Shiba Inu wearing a beret and a black turtleneck generates a video

A realistic cloud image generates a video with the words “SORA”; in a magnificent historical hall, a massive wave peaks and begins to break, with two surfers seizing the moment, skillfully gliding across the surface of the wave

Sora not only has the ability to generate videos,but can also achieve infinite expansion in the temporal dimension. The following three videos start from the same generated video segment and gradually expand backward. Despite their differing starting points, the endings are surprisingly consistent.

Video:These videos have different starting points, but the endings are almost the same

This fully demonstrates Sora’s powerful capabilities in temporal expansion, even creating seamless infinite loop videos.

Video: Sora can even create infinite loop videos

With the development of diffusion models, we have developed various methods to edit images and videos based on text prompts. Here, we apply one of these techniques called SDEdit 32 to Sora. This technique empowers Sora to transform the style and environment of zero-shot input videos, bringing revolutionary changes to the field of video editing.

New Simulation Capabilities

During large-scale training, we found that video models exhibited many exciting new capabilities.These features enable Sora to simulate certain aspects of characters, animals, and environments in the real world.

Notably, the emergence of these attributes did not rely on any explicit 3D modeling, object recognition, or other inductive biases, but emerged naturally through the scaling up of the model.

Video: Characters and scene elements consistently maintain coherence in three-dimensional space

Sora can also simulate artificial processes, such as video games. It can render high-fidelity worlds and their dynamics while controlling players in “Minecraft” with basic strategies. These capabilities require no additional training data or adjustments to model parameters; simply prompting Sora with “Minecraft” achieves this.

These new capabilities indicate that the continuous expansion of video models provides a promising path for developing high-performance physical and digital world simulators. By simulating entities such as objects, animals, and people living in these worlds, we can gain deeper insights into the operating laws of the real world and develop more realistic and natural video generation technologies.

Current Shortcomings

OpenAI does not shy away from the current weaknesses of Sora, pointing out that it maystruggle to accurately simulate the physical principles of complex scenes and may fail to understand causal relationships.

For example, in the scenario of “five gray wolf pups playing and chasing each other on a secluded gravel road,” the number of wolves may change, with some pups appearing or disappearing out of thin air.

The model may also confuse spatial details of prompts, such as confusing left and right, and may struggle to accurately describe events occurring over time, such as following a specific camera trajectory.

For instance, in the video generated based on the prompt “a basketball goes through the hoop and then explodes,” the basketball fails to bounce off the edge of the hoop and simply goes through.

OpenAI stated that they are teaching AI to understand and simulate the physical world in motion, aiming to train models to assist people in solving problems that require real-world interaction.

At the same time, OpenAI explained how Sora works. Sora is a diffusion model that starts from a static noise-like video and gradually removes noise through multiple steps, transforming the video from random pixels into clear image scenes. Sora uses a Transformer architecture, which has strong scalability.

Currently, Sora is open to select members to assess potential harms or risks in key areas. At the same time, OpenAI has invited a group of visual artists, designers, and filmmakers to join, hoping to gain valuable feedback to promote model advancement and better support creative workers.

Appreciation of Works

A train travels through the suburbs of Tokyo, reflecting charming scenes on the window.

Prompt: Reflections in the window of a train traveling through the Tokyo suburbs.

Several giant woolly mammoths tread through a snowy meadow, their long furry coats lightly blowing in the wind, with snow-covered trees and dramatic snow-capped mountains in the distance, the afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammals with beautiful photography, depth of field.

Prompt: Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.

A drone overlooks the rugged cliffs near Big Sur’s Garay Point beach, with waves crashing against the rocks, forming white-tipped waves, illuminated by the golden light of the setting sun. In the distance, a small island with a lighthouse stands, and green vegetation covers the cliff’s edge. The steep drop from the road down to the beach showcases the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.

Prompt: Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff’s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff’s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.

Under the blue hour, an aerial view of Santorini showcases the stunning architecture of white Cycladic buildings with blue domes. The caldera views are breathtaking, and the lighting creates a beautiful, serene atmosphere.

Prompt: Aerial view of Santorini during the blue hour, showcasing the stunning architecture of white Cycladic buildings with blue domes. The caldera views are breathtaking, and the lighting creates a beautiful, serene atmosphere.

A young man in his 20s sits on a cloud in the sky, immersed in a book.

Prompt: A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.

A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.

Prompt: A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.

In Burano, Italy, a row of brightly colored buildings, a cute Dalmatian curiously looks out through a window. At the same time, people come and go on the street, some walking, some cycling.

Prompt:The camera directly faces colorful buildings in Burano, Italy. An adorable Dalmatian looks through a window on a building on the ground floor. Many people are walking and cycling along the canal streets in front of the buildings.

A tilt-shift photograph of a construction site filled with workers, equipment, and heavy machinery.

Prompt: Tiltshift of a construction site filled with workers, equipment, and heavy machinery.

A petri dish with a bamboo forest growing within it that has tiny red pandas running around.

Prompt: A petri dish with a bamboo forest growing within it that has tiny red pandas running around.

A cartoon kangaroo is disco dancing.

Prompt: A cartoon kangaroo disco dances.

A photorealistic close-up video of two pirate ships battling each other as they sail inside a cup of coffee.

Prompt: Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee.

In-Depth Analysis of Sora Official Technical Report and AI Video Prompts Collection

Shenzhen Longgang Intelligent Audio-Visual Research Institute

Artificial Intelligence | Ultra HD

Industry Innovation | Technology Incubation | Achievement Transformation

Leave a Comment Cancel reply