Creating Video From Text

With text instructions, Sora, an AI model, can construct imaginative and realistic scenes.

This page contains only videos that Sora directly created, unaltered.

Sora image

Capabilities (Sora)

In order to train models that assist humans in solving problems requiring real-world interaction, we are teaching AI to comprehend and simulate the physical world in motion.

Let us introduce our text-to-video model, Sora. Sora can produce videos up to one minute in length while adhering to the user’s prompt and preserving visual quality.

Red team members can now use Sora to evaluate important areas for risks or harms. We are also providing access to several designers, filmmakers, and visual artists in order to get their input on how to improve the model so that it best serves the needs of creative professionals.

In order to begin collaborating with and receiving input from individuals outside of OpenAI, as well as to give the general public an idea of what AI capabilities are yet to come, we are making our research progress early available.

Sora can create intricate scenes with several characters, distinct motion styles, and precise background and subject details. In addition to comprehending the user’s request in the prompt, the model also knows how those items actually exist in the real world.

The model’s profound linguistic comprehension allows it to reliably decipher prompts and produce engrossing characters that vividly convey emotions. Sora can also produce multiple shots that faithfully maintain the visual style and characters in a single generated video.

There are flaws in the current model. It might have trouble faithfully reproducing the physics of a complicated scene and might not be able to comprehend particular cases of cause and effect. A person might bite into a cookie, for instance, but the cookie might not have a bite mark afterward.

The model can also have trouble accurately describing events that occur over time, like tracking a particular camera trajectory, and misinterpret spatial details of a prompt, such as mixing up left and right.


Before integrating Sora into OpenAI‘s products, we’ll be taking a number of crucial safety precautions. We will be collaborating with red teamers, who are subject matter experts in areas such as biassed content, hate speech, and disinformation, to conduct adversarial testing on the model.

Additionally, we are developing tools to assist in identifying deceptive content, like a detection classifier that can determine whether a video was produced by Sora. If we use the model in an OpenAI product in the future, we intend to incorporate C2PA metadata.

We are not only creating new methods to get ready for deployment, but we are also utilising the safety procedures we have already developed for our DALL·E 3 products, which also apply to Sora.

For instance, our text classifier will examine and reject text input prompts that violate our usage policies, such as those that call for extreme violence, sexual content, hateful imagery, celebrity likeness, or other people’s intellectual property, once it is integrated into an OpenAI product. In order to make sure that every video generated complies with our usage policies before it is shown to the user, we have also developed strong image classifiers.

We intend to involve policymakers, educators, and artists globally in order to comprehend their apprehensions and ascertain constructive applications for this novel technology. We cannot foresee every constructive or harmful way that people will use our technology, even after conducting a great deal of research and testing. Because of this, we think that developing and releasing ever-more-safe AI systems over time requires learning from practical application.

Research Techniques

Diffusion models like Sora begin with a video that appears to be static noise and work their way through several steps to gradually remove the noise.

Sora can create full videos all at once or can add extra time to generated videos to make them longer. The difficult task of ensuring that a subject remains unchanged even when it momentarily disappears from view has been resolved by providing the model with multiple frames’ worth of foresight at once.

Sora employs a transformer architecture, just like GPT models, to provide better scaling performance.

We use collections of smaller data units called patches to represent images and videos; each patch is similar to a token in GPT. We can train diffusion transformers on a larger range of visual data than was previously possible, spanning different durations, resolutions, and aspect ratios, by standardising the way we represent data.

Sora expands on earlier work in the GPT and DALL·E models. It makes use of DALL·E 3’s recaptioning technique, which entails creating extremely detailed captions for the visual training data. Consequently, the model can more accurately follow the user’s text instructions in the generated video.

The model can create a video from text instructions alone, or it can take an already-existing still image and use it to create a new one, accurately and minutely animating the image’s contents. Additionally, the model can be used to expand or add frames to an already-existing video. Read our technical report to find out more.

We think that being able to understand and simulate the real world will be a crucial step towards achieving artificial general intelligence (AGI), and Sora provides the foundation for such models.



