Video

Veo 3.1: how to write prompts the model actually understands

Google · Updated:

Veo 3.1 is Google DeepMind's updated video model with stronger prompt adherence, native 1080p, vertical 9:16 format, and image-to-video. It inherits audio generation from Veo 3: dialogue, ambience, SFX, music. The prompt limit grows to 2000 characters, and clips can run to several minutes in extended mode.

What is new in Veo 3.1

Veo 3.1 is five concrete upgrades on top of Veo 3. First: improved prompt adherence — the model follows descriptions more precisely, invents less. Second: native vertical 9:16 support for TikTok, Reels, and Shorts — no more cropping in post. Third: image-to-video — the model animates a starting frame, with the prompt describing motion, not the frame itself.

Fourth: camera presets — built-in movement presets (platform-specific) that supplement the text description. Fifth: longer clips compared to the 8-second Veo 3 ceiling. Base Veo 3.1 resolution is up to 1080p; the Fast and Fast Relax variants run at 720p with higher speed and lower cost. Prompt limit grows to about 2000 characters.

  • Up to 1080p (Veo 3.1), 720p (Fast / Fast Relax)
  • Native 9:16 — TikTok, Reels, Shorts without cropping
  • Image-to-Video: animate a starting frame
  • Camera presets + extended duration
  • Prompt limit ~2000 characters, audio inherited from Veo 3

Prompt structure

Optimal order: [Subject + Appearance] + [Context/Scene] + [Action] + [Camera Movement/Composition] + [Style/Mood] + [Lighting] + [Dialogue: text] + [Audio/Ambiance] + [(no subtitles!)].

Veo 3.1 fully inherits Veo 3 audio logic: the sound block is mandatory, dialogue uses a colon, append `(no subtitles!)` at the end. Thanks to stricter prompt adherence the tag works more reliably than in Veo 3.

For complex scenes use a structured prompt with explicit blocks:

Scene: A busy cafe in Paris, morning light streaming through large windows. Character: A young woman with auburn hair, wearing a cream sweater. Action: She lifts a cup of coffee, takes a sip, looks out the window. Camera: Slow dolly-in from medium shot to close-up. Audio: Ambient cafe sounds, clinking cups, soft jazz piano. Mood: Warm, nostalgic, golden hour tones. (no subtitles!)

The model reads this layout better than a single long paragraph.

Vertical video and image-to-video

For vertical 9:16 the format is chosen in the platform (Google AI Studio, Vertex AI), not in the prompt. Adapt the prompt to a portrait composition: more close-ups, portrait orientation for the subject, minimal wide landscape shots (they get lost in 9:16). Selfie style fits the vertical format particularly well.

For image-to-video the model uses the uploaded image as the first frame, and the prompt describes motion and action, NOT the original frame. Weak: «A woman in a cafe drinking coffee» (that is already shown in the photo). Strong: «The woman slowly lifts the cup to her lips and takes a sip. Camera: slow dolly-in to extreme close-up on her eyes. Background Sound: faint cafe chatter, distant espresso machine.». Description of the initial state is just noise — focus on motion only.

Dialogue, audio, subtitles

Veo 3.1 fully inherits Veo 3 audio capabilities. Dialogue via colon, not in quotes: `says: text` works better than `says "text"` — quotes trigger embedded subtitles. Append `(no subtitles!)` to the end of the prompt.

Dialogue length must fit the clip duration: roughly 12-25 words per 8-second take. Too long — the model speaks unnaturally fast. Too short — it fills pauses with AI mumbling. With multiple characters state clearly who is speaking: «The woman in red says: ... The man with beard replies: ...».

Write background sounds out explicitly — even a plain «ambient room tone» removes the risk of «studio audience laughter». For music specify genre and mood: «a melancholic orchestral score swells», «upbeat electronic music with a driving beat», «no background music — just ambient room tone». Veo 3.1 follows these instructions more precisely than Veo 3.

Common mistakes

  1. 1. Describing the starting frame in Image-to-Video

    In image-to-video mode the image ALREADY locks the first frame. Writing «A woman sitting in a cafe drinking coffee» in the prompt is just empty repetition of what the photo already shows. Describe MOTION only: «She slowly lifts the cup, takes a sip, looks out the window. Camera: slow dolly-in». Focus on dynamics, not statics.

  2. 2. Format specified in the prompt text

    «Vertical video», «9:16», «1080p» in the prompt text are ignored — these are generation parameters set on the platform or via the API. In the prompt they become noise. If you need vertical, pick it in Google AI Studio / Vertex AI and adapt the composition: «portrait close-up», «subject centered», close shots.

  3. 3. Horizontal composition with 9:16 selected

    If the vertical format is selected but the prompt still says «wide establishing shot of a city skyline» — the subject will be cropped and the frame loses meaning. For 9:16 adapt the composition: more close-ups, portrait orientation for people, minimal wide landscapes. Selfie style fits the vertical format particularly well.

  4. 4. Dialogue in quotes without «no subtitles»

    Veo 3.1 inherits Veo 3 subtitle behavior: quotes around dialogue trigger embedded captions at the bottom of the frame, often with typos. Use the `says: text` format with a colon and append `(no subtitles!)`. In Veo 3.1 the tag works more reliably than in Veo 3 thanks to improved prompt adherence.

  5. 5. Re-rolling at an identical prompt

    Veo 3.1, like Veo 3, is very consistent — an identical prompt yields a similar result. For variation CHANGE the prompt: swap the lens, change the lighting, alter the color palette, add a character detail. Re-rolling without changes wastes tokens; real variation comes only from edits.

Before / after examples

Example 1

Before

a woman drinking coffee in a cafe

After

Scene: A busy cafe in Paris on a Saturday morning, golden light streaming through large arched windows.
Character: A young woman with auburn hair tied in a low ponytail, wearing a cream cable-knit sweater, sits at a small marble table by the window.
Action: She lifts a small espresso cup, takes a slow sip, then sets it down and looks out the window with a pensive expression.
Camera: Slow dolly-in from medium shot to close-up on her face.
Lighting: Warm golden hour light through the windows, soft fill from a nearby lamp.
Mood: Warm, nostalgic, contemplative.
Audio: Ambient cafe sounds — clinking cups, soft conversation in French, a jazz piano playing quietly in the background.
(no subtitles!)

Structured prompt with explicit blocks (Scene, Character, Action, Camera, Lighting, Mood, Audio). Veo 3.1 reads this better than a single long paragraph.

Example 2

Before

vertical video of a person in the city

After

Vertical 9:16 composition optimized for mobile. A young man with messy dark hair and a black hoodie, leaning against a graffiti-covered wall in a neon-lit Tokyo alley. He looks down at his phone, smiles, then glances up at the camera. Camera: portrait close-up, slight handheld micro-shake, slow push-in. Lighting: cyan neon key from screen-left, warm spill from a noodle shop sign on screen-right. Style: slightly grainy, film-like, cinematic vlog aesthetic. Mood: cool, urban, intimate. Background Sound: distant traffic hum, faint J-pop playing from a nearby shop, light rain on metal awnings. (no subtitles!)

For 9:16: explicit «portrait close-up», subject positioned for a vertical frame, minimal wide shots. The format itself is set on the platform, not in the prompt.

Example 3

Before

animate this product photo of headphones

After

[Image-to-Video: starting frame is a product shot of matte-black wireless headphones on a white marble pedestal]

The headphones begin a slow, smooth 360-degree rotation on the pedestal. Camera: slow continuous orbit around the headphones at eye level, shallow depth of field maintained throughout. Lighting: existing softbox key and rim light from the starting frame, with subtle highlight movement as the headphones rotate. Style: clean commercial photography. Mood: premium, refined. Audio: subtle electronic ambient tone, soft mechanical hum, a gentle chime at the start of rotation. (no subtitles!)

Image-to-Video: the prompt describes MOTION, not the contents of the source photo. Lighting is inherited from the starting frame; the prompt covers only dynamics.

Frequently asked

How is Veo 3.1 different from Veo 3?
Five upgrades: improved prompt adherence (less guesswork), native vertical 9:16, image-to-video mode, camera presets, and longer clips. Base resolution rises to 1080p (vs. 720p in Veo 3). Audio capabilities are fully inherited — dialogue, ambience, SFX, music. The prompt limit grows from ~1500 to ~2000 characters.
What is the difference between Veo 3.1, Fast, and Fast Relax?
Veo 3.1 — maximum quality at 1080p, standard speed. Veo 3.1 Fast — 720p, noticeably faster, for iteration and prototyping. Veo 3.1 Fast Relax — 720p, even cheaper, for mass generation and tests. Prompting logic is identical across all three variants: the same structure blocks, the same audio and dialogue techniques.
How do I make a vertical video for TikTok / Reels / Shorts?
The 9:16 format is chosen on the platform (Google AI Studio or Vertex AI), not in the prompt. In the prompt adapt the composition: more close-ups, portrait orientation for the subject, explicit «portrait close-up» or «vertical composition». Minimize wide landscape shots — they get lost in vertical format. Selfie style is especially well suited.
How do I use Image-to-Video mode?
Upload a starting image (product shot, illustration, photo) and describe ONLY motion in the prompt — do not repeat what is already in the source frame. Focus on what moves, where the camera goes, what sounds appear. Lighting carries over from the starting frame. This is ideal for animating product photography and bringing static illustrations to life.
Can I write dialogue in languages other than English?
Technically yes — Veo 3.1 will pronounce other-language words, but quality is noticeably lower than in English: pronunciation and intonation can warp. For production work English is recommended. If you need non-English dialogue, use phonetic spelling for tricky words and test on short phrases before long scenes. Veo 3.1 is slightly more accurate than Veo 3 across non-English languages.
What is the optimal prompt length?
The recommended limit is around 2000 characters. That gives room for detailed descriptions of characters, environment, action, camera, lighting, audio, and style without losing quality. Prompts longer than 2000 characters start to drop details: the model cannot process the entire description in full. For very complex scenes break the prompt into structured blocks (Scene/Character/Action/Camera/Audio).
Does Opten support Veo 3.1?
Yes, the Opten extension detects Veo 3.1, Fast, and Fast Relax on Google AI Studio, Vertex AI, and Flow and scores prompts against the structure outlined above: it checks the audio block, the colon dialogue format, the «(no subtitles!)» tag, composition adapted to the format, and the correct image-to-video shape. One click gives you a rewrite.

Related models

Ready to write Google Veo 3.1 (incl. Veo 3.1 Fast and Veo 3.1 Fast Relax) prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672