Credit: @WhimsyNico
Hey Everyone,
Some interesting news broke recently that I wanted to talk about it real quick. Sora can generate realistic and imaginative scenes up to 60 seconds.
Prompt Example
Prompt: “A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.”
Where I actually think this has implications is in gaming.
Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths. - Jim Fan (on X)
“Sora is an AI model that can create realistic and imaginative scenes from text instructions.” - OpenAI
OpenAI said Sora is still in the research stage and is not yet being added to any of the company's products.
The simulator instantiates two exquisite 3D assets: pirate ships with different decorations.
Sora has to solve text-to-3D implicitly in its latent space.
The 3D objects are consistently animated as they sail and avoid each other's paths.
Fluid dynamics of the coffee, even the foams that form around the ships. Fluid simulation is an entire sub-field of computer graphics, which traditionally requires very complex algorithms and equations.
Photorealism, almost like rendering with raytracing.
The simulator takes into account the small size of the cup compared to oceans, and applies tilt-shift photography to give a "minuscule" vibe.
The semantics of the scene does not exist in the real world, but the engine still implements the correct physical rules that we expect.
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.
Prompt
Prompt: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.”
Sora is a diffusion model that is able to "generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background."
Convenient
Sora will be able to understand the nuances of the prompt as well as how various objects behave in the physical world.
Sora also generates an entire video at once, rather than creating it frame by frame. That helps avoid what has been a challenge with other approaches — ensuring a subject stays the same even when it goes out of view temporarily.
The images feel well, sort of like a deepfake might or as lonely as the room might feel with a Vision Pro on.
OpenAI says they are working with red teamers — domain experts in areas like misinformation, hateful content, and bias — who are adversarially testing the model. The videos don’t come with audio, so far as I can tell.
Runway, Pika Labs, Google’s Lumiere are all similar tools similar to Sora.
“We think building models that can understand video, and understand all these very complex interactions of our world, is an important step for all future AI systems,” says Tim Brooks, a scientist at OpenAI.
The way OpenAI presents things reminds me more of synthetic AI training worlds and the future of gaming than it does video generation for the entertainment industry. Videographers shouldn’t fear these tools just quite yet.
Similar to Sora, Lumiere gives users text-to-video tools and also lets them create videos from a still image. Where does this even lead to though? Cinematic deepfakes or art? Or just more spam on the internet?
People also have weird reactions. “I am absolutely terrified that this kind of thing will sway a narrowly contested election,” said Oren Etzioni, a professor at the University of Washington who specializes in artificial intelligence, on the New York Times.
OpenAI calls its new system Sora, after the Japanese word for sky. The team behind the technology, including the researchers Tim Brooks and Bill Peebles, chose the name because it “evokes the idea of limitless creative potential.”
Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance.
Sora builds on past research in DALL·E and GPT models. It uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data.
In addition to being able to generate a video solely from text instructions, the model is able to take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to small detail.
OpenAI Bizarrely tries to Tie this to AGI
Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.
Deepfake Threat
I can only imagine Sora will make the internet a yet more dangerous place.
1950s sci-fi movies, horror movies, news, static, a 1970s sitcom, etc, set inside a large New York museum gallery.
I think these tools are great. They rapidly expand image idea prototyping and allow people with creative vision - but lacking in the technical manual muscle memory training of a visual artist - to bring images into existence that others can enjoy, hate or disdain. It's a wonderful thing.
As for all the handwringing about "elections," enough already - all those people mean is "I hate freedom and I don't want anyone convincing someone to vote in a way I don't like." They need to get over it.
Imagine how amazing role playing games are going to be in like 5 years.