OpenAI AI Tech

A First Look at Sora, OpenAI's Groundbreaking Text-to-Video Model

February 15, 2024 • 3 min read

Soon, you'll be able to generate realistic videos with ChatGPT. OpenAI today announced its newest product—Sora, an impressive text-to-video model that can generate vibrant 60-second videos from text prompts alone. While Sora isn't the first model capable of producing very realistic videos (you have Runway and Pika), from what we have seen, it is very likely already the most advanced.

Its ability to understand detailed prompts and recreate the dynamics of the physical world through motion and visual storytelling is uncanny. Take a look at these examples and see for yourself:


A petri dish with a bamboo forest growing within it that has tiny red pandas running around.	Reflections in the window of a train traveling through the Tokyo suburbs.

The camera directly faces colorful buildings in burano italy. An adorable dalmation looks through a window on a building on the ground floor. Many people are walking and cycling along the canal streets in front of the buildings.	A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.

As outlined in the official announcement, Sora builds on OpenAI's prior work in language and image generation. While DALLE could generate static scenes, Sora introduces flow and continuity, seamlessly transitioning between shots while maintaining context and fidelity to the original text prompt.

A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.

A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast, the view showcases historic and magnificent architectural details and tiered pathways and patios, waves are seen crashing against the rocks below as the view overlooks the horizon of the coastal waters and hilly landscapes of the Amalfi Coast Italy, several distant people are seen walking and enjoying vistas on patios of the dramatic ocean views, the warm glow of the afternoon sun creates a magical and romantic feeling to the scene, the view is stunning captured with beautiful photography.

Notice how the videos are filled with intricate details - flowing hair and clothing, emotionally expressive faces, synchronized lips and vocals. The AI demonstrates an acute understanding of lighting, physics and camera work as well, automatically composing dynamic scenes spanning diverse angles, movements and transitions absent from the original text.


Extreme close up of a 24 year old woman’s eye blinking, standing in Marrakech during magic hour, cinematic film shot in 70mm, depth of field, vivid colors, cinematic	A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

How does it work? Sora uses a diffusion model technique similar to Midjourney. Basically it begins with noise and progressively refines its generated videos step-by-step until vibrant, coherent scenes emerge. The model is built on a transformer architecture similar to GPT which allows for remarkable scalability and efficiency. It represents videos as collections of data patches—similar to tokens in language models—learning complex mappings between textual descriptions and visual manifestations frame-by-frame. This allows Sora to process a diverse array of visual data across various durations, resolutions, and aspect ratios.

This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird’s head is tilted slightly to the side, giving the impression of it looking regal and majestic. The background is blurred, drawing attention to the bird’s striking appearance.

Drawing inspiration from DALL-E's image captioning, Sora associates rich textual captions with its visual training data. Consequently, the model adeptly interprets free-form natural language prompts, ensuring higher faithfulness between user instructions and video creations relative to predecessors. Sora can likewise extend existing videos and fill in missing frames - pivotal capabilities for long-form generation.

Before wide release, OpenAI is giving access to a select group of red teamers, visual artists, designers, and filmmakers. The goal is to assess the model's potential risks and harm while gathering invaluable feedback on refining Sora to better serve the creative community. Such collaboration is crucial for tailoring the model to meet the nuanced needs of professionals across various fields, from entertainment to design.

However, conscious of risks, OpenAI is working closely with experts across ethics, policy and content moderation to preemptively address dangers around misinformation, bias and harmful content. This includes adversarial testing, the development of detection classifiers to identify Sora-generated content, and the application of robust safety protocols developed for previous models like DALL·E 3.


Borneo wildlife on the Kinabatangan River	A Chinese Lunar New Year celebration video with Chinese Dragon.

Simply put, Sora is a game changer for AI video generation. While improvements in coherent long-form generation and physics simulation remain, Sora’s breathtakingly versatile visual conjuring brims with disruptive potential.