Veo 3: how to write prompts the model actually understands
Google · Updated:
Veo 3 is the first Google DeepMind model to generate audio natively together with video: dialogue, background sounds, music, SFX. Clips are around 8 seconds, resolution 720p, format 16:9. The prompt must describe the audio layer — otherwise the model invents it and often badly. English gives the most stable results.
What is new in Veo 3
The headline change in Veo 3 versus prior versions is native audio generation. Every video ships with sound: character dialogue, ambient background, action-tied SFX, and mood music. This changes prompting: you cannot stay silent about sound — the model will generate it anyway, and often not what you wanted.
Other specs: roughly 8-second duration, base resolution 720p (1280×720), 16:9 format, standard FPS. Prompt limit around 1500 characters. Very high consistency: the same prompt yields nearly identical results even across seeds — for variation you have to change the prompt itself, not re-roll.
- Native audio — dialogue, ambience, SFX, music
- Clips ~8 seconds, 720p, 16:9 format
- Prompt limit ~1500 characters
- Very high consistency — change the prompt to get variation
- Platforms: Google AI Studio, Vertex AI, Replicate, Flow
Prompt structure
Optimal order: [Subject + Appearance] + [Context/Scene] + [Action] + [Camera Movement] + [Style/Mood] + [Lighting] + [Dialogue/Audio].
The key trait: the audio block is mandatory. Without it the model invents sound randomly — the most common artifact is «studio audience laughter» that drops into dramatic scenes with several characters.
Example of a strong prompt: «A man in his 40s with short brown hair, wearing a blue jacket, sits at a podcast desk in a dimly lit studio. He leans into the microphone and says: My name is Ben, and today we're talking about why most startups fail in year two. Camera: medium close-up, static. Lighting: warm key from a desk lamp, cool rim from a monitor. Background Sound: faint room tone, soft electronic hum. (no subtitles!)»
Dialogue: colon, not quotes
Veo 3 accepts two approaches to dialogue:
Explicit — exact text after a colon: «A guy says: My name is Ben». Use this for precise control over spoken words.
Implicit — a description of what the character is saying: «A guy tells us his name». Use this when the model can invent the line itself.
Critical: write dialogue via colon, not in quotes. `says: My name is Ben` works better than `says "My name is Ben"` — quotes push the model to render embedded subtitles at the bottom of the frame, often with typos. Add `(no subtitles!)` at the end of the prompt as a safeguard. With multiple characters state clearly who is speaking: «The woman in pink says: ... The man with glasses replies: ...».
Background sounds and music
If the prompt has characters but the background is not described, Veo 3 fills the silence on its own, often inappropriately. The classic artifact: «studio audience laughter» in a dramatic scene, a random saxophone in a quiet setting, sitcom-style crowd noise. The fix — always state background sounds explicitly:
«sounds of distant bands, noisy crowd, ambient background of a busy festival field» «ambient sounds of rain on windows, distant thunder, soft piano music» «faint room tone, soft electronic hum, ticking wall clock»
For music, specify genre, mood, and style: «a tense cinematic score plays in the background», «a cheerful upbeat pop melody», «a melancholic orchestral score swells». Even a plain «no background music, ambient room tone only» works better than silence.
Common mistakes
1. Dialogue in quotes instead of with a colon
`says "hello"` pushes the model to generate embedded subtitles at the bottom of the frame — often with typos and bad accent rendering. Use the `says: hello` format with a colon and append `(no subtitles!)` at the end. If subtitles still appear, repeat: «No subtitles. No subtitles!» — for reliability.
2. No background sound described
If the scene has characters but no background is described, Veo 3 invents the audio randomly. The most common artifact: «studio audience laughter» — sitcom-style crowd noise in any scene with multiple people. The fix — always state Background Sound explicitly, even a simple «faint room tone, ambient hum» removes the problem.
3. Dialogue too long or too short
A 50-word line in an 8-second clip — the model speaks unnaturally fast, swallowing pauses and intonation. A 1-2 word line — the model fills the rest of the time with AI mumbling. Aim for 12-25 words per 8 seconds, leave natural pauses and emotional beats.
4. Re-rolling the same prompt instead of changing it
Veo 3 is very consistent — an identical prompt yields a nearly identical result even with different seeds. To get variation you have to CHANGE the prompt, not re-roll. Add a different lens, change the lighting, swap the palette — that produces real variation. Re-rolling the same text is wasted tokens.
5. Trying to get vertical format
Veo 3 natively generates 16:9 only — horizontal format. «Vertical video» or «9:16» in the prompt is ignored. For vertical content use Veo 3.1 (9:16 is native there) or crop in post. Do not specify format in a Veo 3 prompt — it is just noise.
Before / after examples
Example 1
Before
a man talking to camera about his startup
After
A man in his 40s with short brown hair and a closely trimmed beard, wearing a navy blue jacket over a grey t-shirt, sits at a podcast desk in a dimly lit studio. He leans toward the microphone and says: My name is Ben, and today we're talking about why most startups fail in year two. Camera: medium close-up, static, slight handheld micro-shake. Lighting: warm key from a desk lamp on screen-left, cool rim from a monitor behind. Mood: intimate, thoughtful. Background Sound: faint room tone, soft electronic hum from the equipment. (no subtitles!)
Detailed subject for consistency, dialogue via colon (not quotes), explicit background sound, «(no subtitles!)» appended.
Example 2
Before
a woman walking through a market
After
A young woman with long auburn hair tied in a low ponytail, wearing a green linen dress and a straw hat, walks through a bustling outdoor farmers market on a sunny Saturday morning. She picks up an apple, examines it, and smiles. Camera: medium tracking shot following her from the side, slow steadicam motion. Lighting: golden hour natural sunlight, warm tones. Mood: warm, casual, observational. Background Sound: lively crowd chatter, distant vendor calls, faint acoustic guitar playing somewhere nearby, occasional bird song. No background music — just ambient market sounds.
Concrete subject, explicit action with verbs, camera motion, color character, background sound written out with a «no background music» caveat.
Example 3
Before
a selfie video of someone in nature
After
A selfie video of a young man with messy brown hair and a denim jacket, hiking along a misty mountain trail at dawn. He holds the camera at arm's length, arm clearly visible in frame, occasionally looking into the lens with an excited grin. Background: pine trees, low fog, soft mountain silhouettes. Lighting: soft diffused dawn light, cool blue palette with warm spill from his face. Style: slightly grainy, film-like, vlog aesthetic. He says: I can't believe how quiet it is up here. Background Sound: distant bird calls, soft wind through pine needles, the crunch of his footsteps on gravel. (no subtitles!)
Full selfie structure: visible arm, natural eye motion, a line via colon, layered background sound, «slightly grainy» counteracts AI cleanliness.