Veo 3.1: how to write prompts the model actually understands
Google · Updated:
Veo 3.1 is Google DeepMind's updated video model with stronger prompt adherence, native 1080p, vertical 9:16 format, and image-to-video. It inherits audio generation from Veo 3: dialogue, ambience, SFX, music. The prompt limit grows to 2000 characters, and clips can run to several minutes in extended mode.
What is new in Veo 3.1
Veo 3.1 is five concrete upgrades on top of Veo 3. First: improved prompt adherence — the model follows descriptions more precisely, invents less. Second: native vertical 9:16 support for TikTok, Reels, and Shorts — no more cropping in post. Third: image-to-video — the model animates a starting frame, with the prompt describing motion, not the frame itself.
Fourth: camera presets — built-in movement presets (platform-specific) that supplement the text description. Fifth: longer clips compared to the 8-second Veo 3 ceiling. Base Veo 3.1 resolution is up to 1080p; the Fast and Fast Relax variants run at 720p with higher speed and lower cost. Prompt limit grows to about 2000 characters.
- Up to 1080p (Veo 3.1), 720p (Fast / Fast Relax)
- Native 9:16 — TikTok, Reels, Shorts without cropping
- Image-to-Video: animate a starting frame
- Camera presets + extended duration
- Prompt limit ~2000 characters, audio inherited from Veo 3
Prompt structure
Optimal order: [Subject + Appearance] + [Context/Scene] + [Action] + [Camera Movement/Composition] + [Style/Mood] + [Lighting] + [Dialogue: text] + [Audio/Ambiance] + [(no subtitles!)].
Veo 3.1 fully inherits Veo 3 audio logic: the sound block is mandatory, dialogue uses a colon, append `(no subtitles!)` at the end. Thanks to stricter prompt adherence the tag works more reliably than in Veo 3.
For complex scenes use a structured prompt with explicit blocks:
Scene: A busy cafe in Paris, morning light streaming through large windows. Character: A young woman with auburn hair, wearing a cream sweater. Action: She lifts a cup of coffee, takes a sip, looks out the window. Camera: Slow dolly-in from medium shot to close-up. Audio: Ambient cafe sounds, clinking cups, soft jazz piano. Mood: Warm, nostalgic, golden hour tones. (no subtitles!)
The model reads this layout better than a single long paragraph.
Vertical video and image-to-video
For vertical 9:16 the format is chosen in the platform (Google AI Studio, Vertex AI), not in the prompt. Adapt the prompt to a portrait composition: more close-ups, portrait orientation for the subject, minimal wide landscape shots (they get lost in 9:16). Selfie style fits the vertical format particularly well.
For image-to-video the model uses the uploaded image as the first frame, and the prompt describes motion and action, NOT the original frame. Weak: «A woman in a cafe drinking coffee» (that is already shown in the photo). Strong: «The woman slowly lifts the cup to her lips and takes a sip. Camera: slow dolly-in to extreme close-up on her eyes. Background Sound: faint cafe chatter, distant espresso machine.». Description of the initial state is just noise — focus on motion only.
Dialogue, audio, subtitles
Veo 3.1 fully inherits Veo 3 audio capabilities. Dialogue via colon, not in quotes: `says: text` works better than `says "text"` — quotes trigger embedded subtitles. Append `(no subtitles!)` to the end of the prompt.
Dialogue length must fit the clip duration: roughly 12-25 words per 8-second take. Too long — the model speaks unnaturally fast. Too short — it fills pauses with AI mumbling. With multiple characters state clearly who is speaking: «The woman in red says: ... The man with beard replies: ...».
Write background sounds out explicitly — even a plain «ambient room tone» removes the risk of «studio audience laughter». For music specify genre and mood: «a melancholic orchestral score swells», «upbeat electronic music with a driving beat», «no background music — just ambient room tone». Veo 3.1 follows these instructions more precisely than Veo 3.
Common mistakes
1. Describing the starting frame in Image-to-Video
In image-to-video mode the image ALREADY locks the first frame. Writing «A woman sitting in a cafe drinking coffee» in the prompt is just empty repetition of what the photo already shows. Describe MOTION only: «She slowly lifts the cup, takes a sip, looks out the window. Camera: slow dolly-in». Focus on dynamics, not statics.
2. Format specified in the prompt text
«Vertical video», «9:16», «1080p» in the prompt text are ignored — these are generation parameters set on the platform or via the API. In the prompt they become noise. If you need vertical, pick it in Google AI Studio / Vertex AI and adapt the composition: «portrait close-up», «subject centered», close shots.
3. Horizontal composition with 9:16 selected
If the vertical format is selected but the prompt still says «wide establishing shot of a city skyline» — the subject will be cropped and the frame loses meaning. For 9:16 adapt the composition: more close-ups, portrait orientation for people, minimal wide landscapes. Selfie style fits the vertical format particularly well.
4. Dialogue in quotes without «no subtitles»
Veo 3.1 inherits Veo 3 subtitle behavior: quotes around dialogue trigger embedded captions at the bottom of the frame, often with typos. Use the `says: text` format with a colon and append `(no subtitles!)`. In Veo 3.1 the tag works more reliably than in Veo 3 thanks to improved prompt adherence.
5. Re-rolling at an identical prompt
Veo 3.1, like Veo 3, is very consistent — an identical prompt yields a similar result. For variation CHANGE the prompt: swap the lens, change the lighting, alter the color palette, add a character detail. Re-rolling without changes wastes tokens; real variation comes only from edits.
Before / after examples
Example 1
Before
a woman drinking coffee in a cafe
After
Scene: A busy cafe in Paris on a Saturday morning, golden light streaming through large arched windows. Character: A young woman with auburn hair tied in a low ponytail, wearing a cream cable-knit sweater, sits at a small marble table by the window. Action: She lifts a small espresso cup, takes a slow sip, then sets it down and looks out the window with a pensive expression. Camera: Slow dolly-in from medium shot to close-up on her face. Lighting: Warm golden hour light through the windows, soft fill from a nearby lamp. Mood: Warm, nostalgic, contemplative. Audio: Ambient cafe sounds — clinking cups, soft conversation in French, a jazz piano playing quietly in the background. (no subtitles!)
Structured prompt with explicit blocks (Scene, Character, Action, Camera, Lighting, Mood, Audio). Veo 3.1 reads this better than a single long paragraph.
Example 2
Before
vertical video of a person in the city
After
Vertical 9:16 composition optimized for mobile. A young man with messy dark hair and a black hoodie, leaning against a graffiti-covered wall in a neon-lit Tokyo alley. He looks down at his phone, smiles, then glances up at the camera. Camera: portrait close-up, slight handheld micro-shake, slow push-in. Lighting: cyan neon key from screen-left, warm spill from a noodle shop sign on screen-right. Style: slightly grainy, film-like, cinematic vlog aesthetic. Mood: cool, urban, intimate. Background Sound: distant traffic hum, faint J-pop playing from a nearby shop, light rain on metal awnings. (no subtitles!)
For 9:16: explicit «portrait close-up», subject positioned for a vertical frame, minimal wide shots. The format itself is set on the platform, not in the prompt.
Example 3
Before
animate this product photo of headphones
After
[Image-to-Video: starting frame is a product shot of matte-black wireless headphones on a white marble pedestal] The headphones begin a slow, smooth 360-degree rotation on the pedestal. Camera: slow continuous orbit around the headphones at eye level, shallow depth of field maintained throughout. Lighting: existing softbox key and rim light from the starting frame, with subtle highlight movement as the headphones rotate. Style: clean commercial photography. Mood: premium, refined. Audio: subtle electronic ambient tone, soft mechanical hum, a gentle chime at the start of rotation. (no subtitles!)
Image-to-Video: the prompt describes MOTION, not the contents of the source photo. Lighting is inherited from the starting frame; the prompt covers only dynamics.