Video

Sora: how to write prompts the model actually understands

OpenAI · Updated:

Sora is OpenAI's line of video models that produce 4-20 second clips with consistent characters. The prompt works as a brief for a director of photography: style first, then subject, action, camera, lighting, and sound. English gives the most stable results, especially for camera and film-stock vocabulary.

What Sora does

Sora generates clips of 4-20 seconds per run. Base resolution is 720×1280 or 1280×720, and Pro variants add up to 1080×1920 and 1920×1080. Up to two characters are supported via the Characters API: a short reference video (MP4, 2-4 seconds, 720p-1080p) becomes a reusable character with consistent appearance.

A clip can be extended up to 6 times, summing to 120 seconds — the model uses the full original clip as context. Image-to-Video accepts a photo or AI art as a visual anchor for the first frame; the prompt describes what happens next. Video Edit takes surgical changes: «same shot, switch to 85mm» or «change the color of the monster to orange».

  • Clips of 4-20 seconds per run
  • Up to 2 consistent characters via the Characters API
  • Extension up to 120 seconds with the full clip in context
  • Image-to-Video: a photo anchors the first frame
  • Video Edit for surgical changes to an existing clip

Prompt structure

Optimal order: [Style/Aesthetic] + [Subject/Character] + [Scene/Environment] + [Action/Motion] + [Camera: shot + movement] + [Lighting/Color] + [Mood] + [Sound/Dialogue].

Style goes first — it is the most powerful control lever. The same details look radically different under «1970s romantic drama, shot on 35mm film», «16mm black-and-white documentary», or «90s documentary-style interview». Then comes a concrete subject (not «a person» but «a woman in a red coat»), a physical action with verbs and timing, and always a shot size plus camera movement.

One prompt describes one shot, not the whole story. Build long scenes from a series of short clips via extension or post cut.

Camera, lighting, color

Set the camera through both shot size AND movement: «Wide establishing shot, eye level» + «slow dolly-in». Shooting style — «handheld», «Steadicam», «shoulder-mounted», «static tripod». Angle — «eye level», «low angle», «aerial», «Dutch angle». Depth — «shallow depth of field», «deep focus», «rack focus».

Describe lighting through sources, not brightness: not «brightly lit» but «soft window light with warm lamp fill, cool rim from hallway». For the palette use 3-5 anchors separated by commas: «amber, cream, walnut brown» or «teal and orange». This is critical for cross-clip stability when cutting a series. Concrete lens parameters («Anamorphic 2.0x», «Kodak Vision3 500T», «volumetric light») work far better than the abstract «cinematic look».

Sound and dialogue

Even for quiet scenes specify at least one rhythmic sound — «distant traffic hiss», «a crisp snap», «faint mechanical hum». Otherwise the model invents the background on its own, often badly. Put dialogue in a separate block with character name and emotion:

Dialogue: - Detective (low voice): "You're lying. I can hear it in your silence." - Suspect (tired): "Or maybe I'm just tired of talking."

With multiple characters state clearly who is speaking — this matters both for audio and for camera focus. For a series of shots with one character, use the Characters API so appearance does not drift across generations.

Common mistakes

  1. 1. Too short a prompt without details

    «A cat playing with a ball» — the model has to invent everything: breed, lighting, angle, background. Result is unpredictable. Minimum for stability: a concrete subject with details («tabby cat»), action with a verb («batting a red yarn ball»), environment («across hardwood floors»), camera, and light.

  2. 2. Vague lighting

    «Bright» or «dark» does not tell the model WHERE the light comes from. Specify sources and direction: «soft window light from screen-left with warm tungsten fill from above, cool rim from hallway». Even a simple «golden hour, natural sunlight» works better than the abstract «brightly lit».

  3. 3. Several scenes in one prompt

    One prompt equals one shot. A description like «she leaves the cafe, drives to the airport, boards a plane» pushes the model to fit three actions into one clip and it slides into morphing. Break the story into a series of 4-8 second clips and join them via extension or a post cut.

  4. 4. Duration or resolution in the prompt text

    «Make this 1080p and 12 seconds long» — the model does not read those parameters from text. Duration and resolution are set through API parameters or the UI only. In the text they become noise and can conflict with the actual settings. Strip them from the prompt.

  5. 5. Abstract «cinematic look» instead of parameters

    «Cinematic» alone means nothing to the model. Replace it with specifics: «Anamorphic 2.0x lens, shallow DOF, volumetric light», «shot on Kodak Vision3 500T», «warm Kodak grade with halation». Concrete film stock and lens parameters are the strongest stylistic lever in Sora.

Before / after examples

Example 1

Before

a beautiful street at night

After

Cinematic neo-noir style, shot on 35mm film with natural grain and subtle halation. Wide-angle shot slowly tracking forward down a rain-soaked Tokyo street at 2am, wet asphalt, zebra crosswalk, neon signs reflecting in puddles. Camera: low angle, slow dolly-in from eye level, shallow depth of field. Lighting: cyan key from neon, warm spill from a ramen shop window, cool rim from the alley. Palette: teal, magenta, amber. Mood: cinematic, lonely, tense. Background Sound: distant traffic hiss, rain on pavement, faint izakaya chatter.

Style first, concrete environmental details, explicit camera motion and lighting setup, palette as anchor, a rhythmic sound bed.

Example 2

Before

person moves quickly

After

Handheld ENG camera style, 16mm documentary look with natural film grain. A cyclist in a yellow rain jacket pedals three times across a wet intersection, brakes hard, and stops just before a zebra crosswalk as a tram passes. Camera: medium shot at eye level, handheld with subtle micro-shake, follows the cyclist in a slow lateral track. Lighting: overcast natural daylight, soft and even, cool color temperature. Palette: slate grey, yellow, asphalt black. Mood: gritty, observational. Background Sound: tram bell, wet tyres on pavement, distant city hum.

Abstract «moves quickly» replaced with concrete action, verbs, and timing — the model knows exactly how the subject moves and where it stops.

Example 3

Before

a product spinning

After

Commercial photography style, clean studio aesthetic. Smooth 360-degree rotating shot of matte-black wireless headphones on a white marble pedestal against a seamless white cyclorama. Camera: medium close-up, slow continuous orbit at eye level, shallow depth of field with smooth bokeh on the backdrop. Lighting: large softbox key from above, gentle rim light from behind, subtle gradient fill from screen-right. Palette: white, charcoal, brushed metal accents. Mood: premium, minimal, confident. Background Sound: a single subtle electronic chime at the start, then ambient room tone.

Product shot: material specifics, exact camera motion (smooth orbit), three-source lighting setup, a minimal sound used as rhythm.

Frequently asked

How is Sora different from Sora 2?
Sora is the umbrella identifier for OpenAI's entire model line; Sora 2 is the current concrete version with native audio, Characters API, and tighter prompt adherence. At the prompting level the approach is the same: style first, concrete subject, physical action, camera, light, sound. Sora 2 is stricter about structure and the Cinematography/Actions/Dialogue block format.
How long can a single clip be?
A single run is 4 to 20 seconds. From there the clip can be extended up to 6 times, summing to 120 seconds. On extension the model uses the full original clip as context, not just the last frame — this gives more stable motion across joins. For unstable scenes 4-second takes work more reliably.
Can I write prompts in languages other than English?
Technically yes, but English gives noticeably more stable results — especially for camera terms («wide establishing shot», «slow dolly-in»), film formats, and stylistic references. The cinematographic vocabulary has historically been trained best in English. Keep the prompt in English; in-clip dialogue can be in any language.
What is the Characters API and why use it?
The Characters API lets you upload a short character video (MP4, 2-4 seconds, 720p-1080p, 16:9 or 9:16) and get back an ID. In prompts you reference the name and ID, and the model reproduces the same character in different scenes with consistent appearance. Maximum two characters per generation; beyond that the model breaks down and slides into morphing.
Why do results differ every run with the same prompt?
This is by design, not a bug — the model samples from a distribution, and an identical prompt yields varied results across runs. Do not try to land it via re-rolls: refine the prompt instead. Add a specific lens, color anchors, a second-by-second layout — that narrows the interpretation. For a series of shots with one character, use the Characters API.
How do I avoid «studio audience laughter» in the background?
It is a typical artifact when background sound is not described explicitly — the model drops «laughter» into any scene with multiple people. The fix: always state Background Sound explicitly, even for quiet scenes. One rhythmic anchor — «distant traffic hiss», «ticking wall clock», «faint mechanical hum» — removes the problem.
Does Opten support Sora?
Yes, the Opten extension detects Sora on OpenAI platforms (ChatGPT, API) and scores prompts against the structure outlined above: style first, concrete subject, physical action, mandatory camera, explicit sound bed. One click gives you a rewrite in the right structure, with duration and resolution stripped from the prompt text.

Related models

Ready to write Sora (general) prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672