Video

Veo 3: how to write prompts the model actually understands

Google · Updated:

Veo 3 is the first Google DeepMind model to generate audio natively together with video: dialogue, background sounds, music, SFX. Clips are around 8 seconds, resolution 720p, format 16:9. The prompt must describe the audio layer — otherwise the model invents it and often badly. English gives the most stable results.

What is new in Veo 3

The headline change in Veo 3 versus prior versions is native audio generation. Every video ships with sound: character dialogue, ambient background, action-tied SFX, and mood music. This changes prompting: you cannot stay silent about sound — the model will generate it anyway, and often not what you wanted.

Other specs: roughly 8-second duration, base resolution 720p (1280×720), 16:9 format, standard FPS. Prompt limit around 1500 characters. Very high consistency: the same prompt yields nearly identical results even across seeds — for variation you have to change the prompt itself, not re-roll.

  • Native audio — dialogue, ambience, SFX, music
  • Clips ~8 seconds, 720p, 16:9 format
  • Prompt limit ~1500 characters
  • Very high consistency — change the prompt to get variation
  • Platforms: Google AI Studio, Vertex AI, Replicate, Flow

Prompt structure

Optimal order: [Subject + Appearance] + [Context/Scene] + [Action] + [Camera Movement] + [Style/Mood] + [Lighting] + [Dialogue/Audio].

The key trait: the audio block is mandatory. Without it the model invents sound randomly — the most common artifact is «studio audience laughter» that drops into dramatic scenes with several characters.

Example of a strong prompt: «A man in his 40s with short brown hair, wearing a blue jacket, sits at a podcast desk in a dimly lit studio. He leans into the microphone and says: My name is Ben, and today we're talking about why most startups fail in year two. Camera: medium close-up, static. Lighting: warm key from a desk lamp, cool rim from a monitor. Background Sound: faint room tone, soft electronic hum. (no subtitles!)»

Dialogue: colon, not quotes

Veo 3 accepts two approaches to dialogue:

Explicit — exact text after a colon: «A guy says: My name is Ben». Use this for precise control over spoken words.

Implicit — a description of what the character is saying: «A guy tells us his name». Use this when the model can invent the line itself.

Critical: write dialogue via colon, not in quotes. `says: My name is Ben` works better than `says "My name is Ben"` — quotes push the model to render embedded subtitles at the bottom of the frame, often with typos. Add `(no subtitles!)` at the end of the prompt as a safeguard. With multiple characters state clearly who is speaking: «The woman in pink says: ... The man with glasses replies: ...».

Background sounds and music

If the prompt has characters but the background is not described, Veo 3 fills the silence on its own, often inappropriately. The classic artifact: «studio audience laughter» in a dramatic scene, a random saxophone in a quiet setting, sitcom-style crowd noise. The fix — always state background sounds explicitly:

«sounds of distant bands, noisy crowd, ambient background of a busy festival field» «ambient sounds of rain on windows, distant thunder, soft piano music» «faint room tone, soft electronic hum, ticking wall clock»

For music, specify genre, mood, and style: «a tense cinematic score plays in the background», «a cheerful upbeat pop melody», «a melancholic orchestral score swells». Even a plain «no background music, ambient room tone only» works better than silence.

Common mistakes

  1. 1. Dialogue in quotes instead of with a colon

    `says "hello"` pushes the model to generate embedded subtitles at the bottom of the frame — often with typos and bad accent rendering. Use the `says: hello` format with a colon and append `(no subtitles!)` at the end. If subtitles still appear, repeat: «No subtitles. No subtitles!» — for reliability.

  2. 2. No background sound described

    If the scene has characters but no background is described, Veo 3 invents the audio randomly. The most common artifact: «studio audience laughter» — sitcom-style crowd noise in any scene with multiple people. The fix — always state Background Sound explicitly, even a simple «faint room tone, ambient hum» removes the problem.

  3. 3. Dialogue too long or too short

    A 50-word line in an 8-second clip — the model speaks unnaturally fast, swallowing pauses and intonation. A 1-2 word line — the model fills the rest of the time with AI mumbling. Aim for 12-25 words per 8 seconds, leave natural pauses and emotional beats.

  4. 4. Re-rolling the same prompt instead of changing it

    Veo 3 is very consistent — an identical prompt yields a nearly identical result even with different seeds. To get variation you have to CHANGE the prompt, not re-roll. Add a different lens, change the lighting, swap the palette — that produces real variation. Re-rolling the same text is wasted tokens.

  5. 5. Trying to get vertical format

    Veo 3 natively generates 16:9 only — horizontal format. «Vertical video» or «9:16» in the prompt is ignored. For vertical content use Veo 3.1 (9:16 is native there) or crop in post. Do not specify format in a Veo 3 prompt — it is just noise.

Before / after examples

Example 1

Before

a man talking to camera about his startup

After

A man in his 40s with short brown hair and a closely trimmed beard, wearing a navy blue jacket over a grey t-shirt, sits at a podcast desk in a dimly lit studio. He leans toward the microphone and says: My name is Ben, and today we're talking about why most startups fail in year two. Camera: medium close-up, static, slight handheld micro-shake. Lighting: warm key from a desk lamp on screen-left, cool rim from a monitor behind. Mood: intimate, thoughtful. Background Sound: faint room tone, soft electronic hum from the equipment. (no subtitles!)

Detailed subject for consistency, dialogue via colon (not quotes), explicit background sound, «(no subtitles!)» appended.

Example 2

Before

a woman walking through a market

After

A young woman with long auburn hair tied in a low ponytail, wearing a green linen dress and a straw hat, walks through a bustling outdoor farmers market on a sunny Saturday morning. She picks up an apple, examines it, and smiles. Camera: medium tracking shot following her from the side, slow steadicam motion. Lighting: golden hour natural sunlight, warm tones. Mood: warm, casual, observational. Background Sound: lively crowd chatter, distant vendor calls, faint acoustic guitar playing somewhere nearby, occasional bird song. No background music — just ambient market sounds.

Concrete subject, explicit action with verbs, camera motion, color character, background sound written out with a «no background music» caveat.

Example 3

Before

a selfie video of someone in nature

After

A selfie video of a young man with messy brown hair and a denim jacket, hiking along a misty mountain trail at dawn. He holds the camera at arm's length, arm clearly visible in frame, occasionally looking into the lens with an excited grin. Background: pine trees, low fog, soft mountain silhouettes. Lighting: soft diffused dawn light, cool blue palette with warm spill from his face. Style: slightly grainy, film-like, vlog aesthetic. He says: I can't believe how quiet it is up here. Background Sound: distant bird calls, soft wind through pine needles, the crunch of his footsteps on gravel. (no subtitles!)

Full selfie structure: visible arm, natural eye motion, a line via colon, layered background sound, «slightly grainy» counteracts AI cleanliness.

Frequently asked

How is Veo 3 different from Veo 2?
The main difference is native audio generation. Veo 2 produced silent video; Veo 3 generates dialogue, ambient sounds, SFX, and music together with the video track. This changes prompting: the audio block is now mandatory, otherwise the model invents the sound and often badly. Base quality and resolution remain at 720p, format stays 16:9.
How do I avoid embedded subtitles in frame?
Three techniques work together. Write dialogue via colon, not in quotes: `says: hello` instead of `says "hello"`. Append `(no subtitles!)` to the end of the prompt. If subtitles still show up, repeat several times: «No subtitles. No subtitles!». Quotes are the main subtitle trigger; the colon is interpreted by the model as a spoken line without visual rendering.
Why does the model add audience laughter I never asked for?
It is a typical Veo 3 artifact when the scene has several characters but background sound is not described explicitly. The model «remembers» that videos with people usually have some background, and falls back on «studio audience laughter» as the most frequent pattern from its training data. The fix: always state Background Sound explicitly; even one rhythmic anchor removes the problem.
Can I write dialogue in languages other than English?
Technically yes — Veo 3 will pronounce other-language words, but quality is noticeably lower than in English: pronunciation can warp, intonation feels off, long words are problematic. For production work English dialogue is recommended. If you need another language, use phonetic spelling for tricky words and test on short phrases before long scenes.
What dialogue length is optimal for an 8-second clip?
Roughly 12-25 words. Less — the model fills pauses with AI mumbling. More — it speaks unnaturally fast without intonation. Ideal pattern: a short opener, the main idea, a short closer. For example: «So, here's the thing. Most startups fail in year two because they scale too fast. It's not the product, it's the timing.»
How do I get character consistency across scenes?
Veo 3 with an identical prompt produces a nearly identical character — exploit that. Build a detailed description: «John, a man in his 40s with short brown hair and a closely trimmed beard, wearing a navy blue jacket over a grey t-shirt, looking thoughtful». Repeat that description verbatim in every generation. The more unique the detail set, the more stable the consistency.
Does Opten support Veo 3?
Yes, the Opten extension detects Veo 3 on Google AI Studio, Vertex AI, Replicate, and Flow and scores prompts against the structure outlined above: it checks for the audio block, the colon dialogue format, the «(no subtitles!)» tag, background-sound description, and reasonable dialogue length. One click gives you a rewrite in the right structure.

Related models

Ready to write Google Veo 3 prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672