Video

Sora 2: how to write prompts the model actually understands

OpenAI · Updated:

Sora 2 is OpenAI's video model with native audio, support for up to two characters via the Characters API, and clips of 4-20 seconds. The prompt works like a brief for a director of photography: style first, then subject, action, camera, and sound. Duration and resolution are set only through API parameters, never in the text.

What Sora 2 does

Sora 2 generates clips of 4, 8, 12, 16, or 20 seconds at 720×1280 or 1280×720 (the Pro tier adds 1024×1792, 1792×1024, 1080×1920, 1920×1080). The model produces native audio — dialogue, ambience, SFX, and music. Through the Characters API you can upload a short character video (MP4, 2-4 seconds, 720p-1080p) and reuse it across clips; up to two characters are supported at once.

A unique feature is video extension — up to 6 times, summing to 120 seconds total. On extension, the model uses the full original clip as context, not just the last frame, which keeps motion stable across joins. For iteration it is often easier to assemble a long scene from two stitched four-second clips: the model follows instructions more reliably in shorter takes.

  • Clips of 4-20 seconds, native audio and dialogue
  • Up to 2 characters via the Characters API (MP4 reference)
  • Extension up to 120 seconds with the full clip in context
  • Image-to-video: input photo anchors the first frame
  • Video Edit for surgical changes to an existing clip

Prompt structure

Optimal order: [Style/Aesthetic] + [Subject/Character] + [Scene/Environment] + [Action/Motion] + [Camera: shot + movement] + [Lighting/Color] + [Mood] + [Sound/Dialogue].

The main rule — style goes first. It is the most powerful control lever: the same details look radically different under «Hollywood drama», «handheld smartphone clip», or «grainy vintage commercial». Then comes a concrete subject (not «a person» but «a woman in a red coat»), concrete action with verbs («pedals three times, brakes, stops at the crosswalk» instead of «moves quickly»), and always a shot size plus camera movement.

A prompt describes one shot, not the whole story. Build long scenes from a series of short clips via extension or post-production cut.

Template with Cinematography and Actions blocks

The official Sora 2 template breaks the prompt into blocks. At the top — a prose description of the scene, characters, wardrobe, set. Then:

Cinematography: Camera: medium close-up, slow push-in Lighting: warm key from overhead practical, cool spill from window Mood: gentle, whimsical, a touch of suspense

Actions: - The robot taps the bulb; sparks crackle. - It flinches, dropping the bulb. - A puff of steam escapes its chest.

Dialogue: - Robot: "Almost lost it... but I got it!"

Background Sound: Rain, ticking clock, soft mechanical hum.

The model reads this structure as a shot breakdown. For longer clips add a second-by-second layout: «0.00-2.40 — Arrival Drift (32mm, slow dolly left)» — the model anchors actions to timecodes.

Sound, dialogue, and palette

Sora 2 generates audio together with video. Even for quiet scenes specify at least one rhythmic sound — «distant traffic hiss», «a crisp snap», «faint mechanical hum» — otherwise the model will invent the background on its own. Put dialogue in a separate block with character name and emotion: «Detective (low voice): "You're lying. I can hear it in your silence."».

For the color palette use 3-5 anchors separated by commas: «amber, cream, walnut brown» or «teal and orange». This is critical for cross-clip stability when cutting a series. Describe lighting through sources, not brightness: not «brightly lit» but «soft window light with warm lamp fill, cool rim from hallway». Concrete lens parameters («Anamorphic 2.0x, shallow DOF, volumetric light») work far better than the abstract «cinematic look».

Common mistakes

  1. 1. Duration and resolution written in the prompt text

    «Make this an 8-second 1080p video» — the model does not read those parameters from text. Duration (seconds), size, and characters are set through API parameters only. In the prompt they become noise and can conflict with the actual settings. Strip them from the text and set them via the UI or API.

  2. 2. Vague motion instead of a concrete verb

    «Person moves quickly» — the model does not know how the subject moves. Use concrete verbs with timing: «sprinting», «tiptoeing», «gliding», «pedals three times, brakes, stops». The more precise the verb, the less the model has to invent and the more stable the result is across generations.

  3. 3. Several scenes packed into one prompt

    One prompt equals one shot. If you describe «she leaves the cafe, walks to the car, drives away» the model tries to fit three actions into a single clip and slides into morphing. Break the story into a series of 4-8 second clips and join them via extension or post cut. You get both stability and control.

  4. 4. Dialogue in quotes without a named speaker

    «She says "hello there"» works worse than a block dialogue with an explicit speaker and emotion: «Woman (warmly): "Hello there."». For multiple characters state clearly who is speaking. With two characters use the Characters API so they do not drift visually between generations.

  5. 5. Abstract «cinematic look» instead of parameters

    The word «cinematic» alone gives the model no direction — it interprets it statistically. Replace it with specifics: «Anamorphic 2.0x lens, shallow DOF, volumetric light», «shot on Kodak Vision3 500T», «warm Kodak grade with subtle halation». Concrete lens and film parameters are the strongest stylistic lever.

Before / after examples

Example 1

Before

a person walking down a street at night

After

Cinematic neo-noir style, shot on 35mm film with subtle halation and natural grain. Wide-angle shot slowly pushing forward down a rain-soaked Tokyo street at 2am, neon signs reflecting in puddles. A woman in a black trench coat walks past a ramen shop, hands in pockets, breath visible in the cold air. Camera: low angle, slow dolly-in from eye level. Lighting: cyan key from neon, warm spill from shop windows. Palette: teal, magenta, amber. Mood: cinematic, lonely, tense. Background Sound: distant traffic hiss, rain on pavement, faint izakaya chatter.

Style first, concrete subject with wardrobe, action with verbs, explicit light sources and palette, a rhythmic sound bed.

Example 2

Before

an old man tells a story

After

In a 90s documentary-style interview, an elderly Swedish fisherman sits in a dim study lined with maritime maps. He wears a wool sweater and has a weathered face with deep wrinkles. Cinematography:
Camera: medium close-up, static on tripod with slight handheld micro-shake
Lighting: soft window light from screen-left, warm practical lamp fill
Mood: nostalgic, intimate

Actions:
- He looks down at his hands, then up to camera.
- A faint smile crosses his face.

Dialogue:
- Fisherman (quietly): "I still remember when I was young."

Background Sound: distant foghorn, ticking wall clock, faint creak of the chair.

Block structure Cinematography + Actions + Dialogue + Background Sound — the model reads this as a shot breakdown, dialogue via colon, not quotes.

Example 3

Before

a product video of headphones

After

Commercial photography style, clean studio aesthetic with soft shadows. Smooth 360-degree rotating shot of matte-black wireless headphones on a white marble pedestal against a seamless white cyclorama. Subtle reflection on the pedestal surface. Camera: medium close-up, slow orbit at eye level, shallow depth of field. Lighting: large softbox key from above, gentle rim light from behind, gradient fill. Palette: white, charcoal, brushed metal accents. Mood: premium, minimal, confident. Background Sound: a single subtle electronic chime at the start, then ambient room tone.

A product shot does not need drama but needs specifics: material type, exact camera motion, lighting setup, minimal but intentional sound.

Frequently asked

How long is a single Sora 2 clip?
A single clip is 4, 8, 12, 16, or 20 seconds. Duration is set through the API parameter seconds, not in the prompt text. From there you can extend the clip up to 6 times, summing to 120 seconds — during extension the model uses the full original clip as context. For unstable scenes 4-second takes work more reliably than long ones.
How is Sora 2 Pro different from regular Sora 2?
Sora 2 Pro supports more resolutions: in addition to 720×1280 and 1280×720, it adds 1024×1792, 1792×1024, 1080×1920, and 1920×1080. This gives honest Full HD in both orientations and vertical formats for Reels, Shorts, and TikTok. Prompting logic is identical — the difference is purely the output resolutions exposed via API.
How do I add my own character to a video?
Through the Characters API. Upload a short video (MP4, 2-4 seconds, 720p-1080p, 16:9 or 9:16), the model learns the character and returns an ID. In prompts reference the name and ID — the character appears in different scenes with consistent appearance. Maximum two characters per generation; beyond that the model breaks down and slides into morphing.
Can I write prompts in languages other than English?
Technically yes, but English gives noticeably more stable results — especially for camera terms, film stocks, and stylistic references. The cinematographic vocabulary («Anamorphic 2.0x», «Kodak Vision3 500T», «slow dolly-in») has historically been trained best in English. Keep the prompt in English; in-clip dialogue can be in any language.
Why does the model add strange audio I never asked for?
If background sound is not described explicitly, Sora 2 will invent it, and often badly: «studio audience laughter» appears in dramatic scenes, or a random saxophone drops in. Always state Background Sound explicitly — even for quiet scenes give one rhythmic anchor: «distant traffic hiss», «ticking wall clock», «soft mechanical hum».
What do I do if results differ every time with the same prompt?
This is by design — even an identical prompt yields varied results across generations in Sora 2. Do not try to hit the right result by re-rolling: refine the prompt instead. Add a specific lens, color anchors, second-by-second action breakdown — that narrows the interpretation space. For a series of shots with one character, use the Characters API.
Does Opten support Sora 2?
Yes, the Opten extension detects Sora 2 on OpenAI and fal.ai platforms and scores prompts against the structure outlined above: style first, concrete subject and action, mandatory camera, explicit background sound, long stories split into shots. One click gives you a rewrite in the right structure, with duration and resolution stripped from the text.

Related models

Ready to write Sora 2 / Sora 2 Pro prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672