Sora 2: how to write prompts the model actually understands
OpenAI · Updated:
Sora 2 is OpenAI's video model with native audio, support for up to two characters via the Characters API, and clips of 4-20 seconds. The prompt works like a brief for a director of photography: style first, then subject, action, camera, and sound. Duration and resolution are set only through API parameters, never in the text.
What Sora 2 does
Sora 2 generates clips of 4, 8, 12, 16, or 20 seconds at 720×1280 or 1280×720 (the Pro tier adds 1024×1792, 1792×1024, 1080×1920, 1920×1080). The model produces native audio — dialogue, ambience, SFX, and music. Through the Characters API you can upload a short character video (MP4, 2-4 seconds, 720p-1080p) and reuse it across clips; up to two characters are supported at once.
A unique feature is video extension — up to 6 times, summing to 120 seconds total. On extension, the model uses the full original clip as context, not just the last frame, which keeps motion stable across joins. For iteration it is often easier to assemble a long scene from two stitched four-second clips: the model follows instructions more reliably in shorter takes.
- Clips of 4-20 seconds, native audio and dialogue
- Up to 2 characters via the Characters API (MP4 reference)
- Extension up to 120 seconds with the full clip in context
- Image-to-video: input photo anchors the first frame
- Video Edit for surgical changes to an existing clip
Prompt structure
Optimal order: [Style/Aesthetic] + [Subject/Character] + [Scene/Environment] + [Action/Motion] + [Camera: shot + movement] + [Lighting/Color] + [Mood] + [Sound/Dialogue].
The main rule — style goes first. It is the most powerful control lever: the same details look radically different under «Hollywood drama», «handheld smartphone clip», or «grainy vintage commercial». Then comes a concrete subject (not «a person» but «a woman in a red coat»), concrete action with verbs («pedals three times, brakes, stops at the crosswalk» instead of «moves quickly»), and always a shot size plus camera movement.
A prompt describes one shot, not the whole story. Build long scenes from a series of short clips via extension or post-production cut.
Template with Cinematography and Actions blocks
The official Sora 2 template breaks the prompt into blocks. At the top — a prose description of the scene, characters, wardrobe, set. Then:
Cinematography: Camera: medium close-up, slow push-in Lighting: warm key from overhead practical, cool spill from window Mood: gentle, whimsical, a touch of suspense
Actions: - The robot taps the bulb; sparks crackle. - It flinches, dropping the bulb. - A puff of steam escapes its chest.
Dialogue: - Robot: "Almost lost it... but I got it!"
Background Sound: Rain, ticking clock, soft mechanical hum.
The model reads this structure as a shot breakdown. For longer clips add a second-by-second layout: «0.00-2.40 — Arrival Drift (32mm, slow dolly left)» — the model anchors actions to timecodes.
Sound, dialogue, and palette
Sora 2 generates audio together with video. Even for quiet scenes specify at least one rhythmic sound — «distant traffic hiss», «a crisp snap», «faint mechanical hum» — otherwise the model will invent the background on its own. Put dialogue in a separate block with character name and emotion: «Detective (low voice): "You're lying. I can hear it in your silence."».
For the color palette use 3-5 anchors separated by commas: «amber, cream, walnut brown» or «teal and orange». This is critical for cross-clip stability when cutting a series. Describe lighting through sources, not brightness: not «brightly lit» but «soft window light with warm lamp fill, cool rim from hallway». Concrete lens parameters («Anamorphic 2.0x, shallow DOF, volumetric light») work far better than the abstract «cinematic look».
Common mistakes
1. Duration and resolution written in the prompt text
«Make this an 8-second 1080p video» — the model does not read those parameters from text. Duration (seconds), size, and characters are set through API parameters only. In the prompt they become noise and can conflict with the actual settings. Strip them from the text and set them via the UI or API.
2. Vague motion instead of a concrete verb
«Person moves quickly» — the model does not know how the subject moves. Use concrete verbs with timing: «sprinting», «tiptoeing», «gliding», «pedals three times, brakes, stops». The more precise the verb, the less the model has to invent and the more stable the result is across generations.
3. Several scenes packed into one prompt
One prompt equals one shot. If you describe «she leaves the cafe, walks to the car, drives away» the model tries to fit three actions into a single clip and slides into morphing. Break the story into a series of 4-8 second clips and join them via extension or post cut. You get both stability and control.
4. Dialogue in quotes without a named speaker
«She says "hello there"» works worse than a block dialogue with an explicit speaker and emotion: «Woman (warmly): "Hello there."». For multiple characters state clearly who is speaking. With two characters use the Characters API so they do not drift visually between generations.
5. Abstract «cinematic look» instead of parameters
The word «cinematic» alone gives the model no direction — it interprets it statistically. Replace it with specifics: «Anamorphic 2.0x lens, shallow DOF, volumetric light», «shot on Kodak Vision3 500T», «warm Kodak grade with subtle halation». Concrete lens and film parameters are the strongest stylistic lever.
Before / after examples
Example 1
Before
a person walking down a street at night
After
Cinematic neo-noir style, shot on 35mm film with subtle halation and natural grain. Wide-angle shot slowly pushing forward down a rain-soaked Tokyo street at 2am, neon signs reflecting in puddles. A woman in a black trench coat walks past a ramen shop, hands in pockets, breath visible in the cold air. Camera: low angle, slow dolly-in from eye level. Lighting: cyan key from neon, warm spill from shop windows. Palette: teal, magenta, amber. Mood: cinematic, lonely, tense. Background Sound: distant traffic hiss, rain on pavement, faint izakaya chatter.
Style first, concrete subject with wardrobe, action with verbs, explicit light sources and palette, a rhythmic sound bed.
Example 2
Before
an old man tells a story
After
In a 90s documentary-style interview, an elderly Swedish fisherman sits in a dim study lined with maritime maps. He wears a wool sweater and has a weathered face with deep wrinkles. Cinematography: Camera: medium close-up, static on tripod with slight handheld micro-shake Lighting: soft window light from screen-left, warm practical lamp fill Mood: nostalgic, intimate Actions: - He looks down at his hands, then up to camera. - A faint smile crosses his face. Dialogue: - Fisherman (quietly): "I still remember when I was young." Background Sound: distant foghorn, ticking wall clock, faint creak of the chair.
Block structure Cinematography + Actions + Dialogue + Background Sound — the model reads this as a shot breakdown, dialogue via colon, not quotes.
Example 3
Before
a product video of headphones
After
Commercial photography style, clean studio aesthetic with soft shadows. Smooth 360-degree rotating shot of matte-black wireless headphones on a white marble pedestal against a seamless white cyclorama. Subtle reflection on the pedestal surface. Camera: medium close-up, slow orbit at eye level, shallow depth of field. Lighting: large softbox key from above, gentle rim light from behind, gradient fill. Palette: white, charcoal, brushed metal accents. Mood: premium, minimal, confident. Background Sound: a single subtle electronic chime at the start, then ambient room tone.
A product shot does not need drama but needs specifics: material type, exact camera motion, lighting setup, minimal but intentional sound.