Video

Veo: how to write prompts the model actually understands

Google · Updated:

Google Veo is DeepMind's line of video models that produce 5-8 second clips at a base resolution of 720p. The prompt works as a brief for a director: subject, context, action, camera, style, and lighting. English gives the most stable results. The family spans several versions: Veo 1/2 are video-only, Veo 3 and later add native audio.

What Veo does

Veo generates 16:9 videos of 5-8 seconds (version-dependent). Base resolution is 720p (1280×720), upscale to 4K is done via external tools. The recommended prompt limit is around 1500 characters — beyond that the model starts dropping details.

Available on Google AI Studio, Vertex AI, and Flow. Two main modes: Text-to-Video (generation purely from text) and Image-to-Video (animating a starting frame, availability varies by version and platform). Audio arrives only with version 3 — Veo 1 and Veo 2 output silent video. Vertical format is not natively supported in the base Veo line, only through post-processing or special variants (Veo 3.1).

  • Clips of 5-8 seconds, 16:9 format, base resolution 720p
  • Prompt limit ~1500 characters
  • Text-to-Video and Image-to-Video modes
  • Audio — only from Veo 3 onward
  • Platforms: Google AI Studio, Vertex AI, Flow

Prompt structure

Optimal order: [Subject] + [Context/Scene] + [Action] + [Camera Movement] + [Style/Mood] + [Lighting/Ambiance] + [Audio (where supported)].

Using every element is not mandatory — the mix depends on the type of video. The more concrete the description, the better the result. Write as if you were briefing a director who is seeing the script for the first time.

Key contrast: • Weak: «A man answers a phone». • Strong: «A shaky dolly zoom goes from a far away blur to a close-up cinematic shot of a desperate man in a weathered green trench coat as he picks up a rotary phone mounted on a gritty brick wall, bathed in the eerie glow of a green neon sign».

Concrete details of appearance, environment, lighting, and camera motion are the main quality lever.

Camera and motion

Veo understands camera terms well — that is the model's primary control language. In every prompt specify at least one of: shot size, movement, angle, or focus. Shot size — wide shot, medium shot, close-up, extreme close-up, establishing shot. Movement — dolly shot, zoom in, zoom out, pan left/right, tracking shot, orbit. Angle — eye level, high angle, low angle, worm's eye, top-down, aerial shot. Focus — shallow depth of field, rack focus, deep focus.

Special techniques — dolly zoom, one-take, handheld, steadicam, crane shot. Concrete techniques work better than an abstract «cinematic camera»: «slow dolly-in from eye level» or «shaky handheld tracking shot» gives the model a clear direction.

Style, lighting, mood

Stylistic modifiers through the «In the style of [style]:» prefix — LEGO, Claymation, Pixar animation, Anime, Graphic novel, 8-bit retro, Stop-motion, Origami, Blueprint, Marble. This gives a radical visual switch while keeping the rest of the parameters intact.

Quality — Cinematic, film grain, HDR, 4K, professional. Genre — Hollywood blockbuster, indie film, documentary, commercial, music video, vlog. Color — warm tones, cool tones, high contrast, desaturated, neon, golden hour. Lighting — natural light, rim light, backlight, volumetric, neon glow, silhouette, blue light.

For a selfie style: start with «A selfie video of...», state a visible arm («holds the camera at arm's length, arm clearly visible in frame»), and add natural eye movement. This removes the synthetic feel and locks in the POV.

Common mistakes

  1. 1. Too short a prompt without details

    «A beautiful video» or «a cool scene» — the model invents everything and the result becomes unpredictable. Minimum: a concrete subject with appearance details, a physical action with a verb, an environment, and at least one camera direction. Without these four elements Veo collapses into a «generic pretty frame» with no direction.

  2. 2. Abstract phrasing instead of specifics

    «Cinematic look», «beautiful lighting», «high quality» tell the model nothing — these are subjective words. Replace them with specifics: «shallow depth of field», «golden hour sunlight», «35mm film grain», «soft window light with warm tungsten fill». Concrete parameters work, abstract evaluative adjectives do not.

  3. 3. Conflicting camera instructions

    «Zoom in and zoom out», «static shot with tracking», «wide angle close-up» — the model cannot honor a contradiction and either ignores part of the instruction or produces chaotic motion. Pick one camera move per clip. Build complex shot lists from several clips in post.

  4. 4. No action described

    A static scene with no dynamics — Veo will generate a near-frozen video with minimal motion, looking like a GIF. Describe physical action: «picks up the phone, turns around, walks», «leaves blow across the empty street», «steam rises slowly from the coffee cup». Without action the video loses meaning.

  5. 5. Attempting vertical video in base Veo

    Veo 1/2 does not natively support vertical format — the output is always 16:9. If you try via the prompt («vertical video», «9:16»), the model ignores it and returns a horizontal frame. For vertical use Veo 3.1, where 9:16 is supported natively, or crop in post-processing.

Before / after examples

Example 1

Before

a man answers a phone

After

A shaky dolly zoom goes from a far away blur to a close-up cinematic shot of a desperate man in a weathered green trench coat as he picks up a rotary phone mounted on a gritty brick wall, bathed in the eerie glow of a green neon sign. Camera: handheld with subtle micro-shake, dolly zoom effect. Lighting: green neon key from above, deep shadows in the alley. Mood: tense, noir, claustrophobic.

Concrete character with wardrobe, emotion baked into the description («desperate»), precise camera move (dolly zoom), lighting setup with source and direction, explicit mood.

Example 2

Before

a selfie video of someone in the city

After

A selfie video of a young woman with curly red hair and a black leather jacket walking through Tokyo's Shibuya crossing at night. She holds the camera at arm's length, arm clearly visible in frame, occasionally looking into the lens and smiling. Background: neon signs, crowd of pedestrians, light rain. Lighting: cool neon glow with warm spill from storefronts. Style: slightly grainy, film-like, vlog aesthetic.

Selfie format: explicit visible arm, natural eye motion, concrete background details, environment color characterization. Veo responds well to «slightly grainy, film-like» — it removes AI cleanliness.

Example 3

Before

a product video of headphones

After

Commercial product shot. Smooth 360-degree orbit around matte-black wireless headphones on a white marble pedestal against a seamless white background. Camera: slow continuous orbit at eye level, shallow depth of field, medium close-up. Lighting: large softbox key from above-left, gentle rim light from behind, soft gradient fill from the right. Style: clean commercial photography, premium minimalism. Mood: confident, refined.

Concrete camera motion (smooth orbit), material and background, three-point lighting setup with explicit sources, stylistic reference «commercial photography».

Frequently asked

Which Veo versions exist?
Google DeepMind's Veo line spans several generations: Veo 1 and Veo 2 (video without audio, base quality), Veo 3 (native sound, dialogue, ambience), and Veo 3.1 with Fast and Fast Relax variants (better prompt adherence, vertical video, image-to-video). This page is the general overview for the whole line; per-version specifics live on dedicated pages.
How long is a single Veo clip?
A standard Veo clip is 5-8 seconds; exact duration depends on the version and the platform. Veo 3.1 extended mode supports longer videos. Duration is set through the UI or API parameters, not via prompt text. You do not need to write «make it 8 seconds long» in the prompt — that is ignored and just clutters the description.
Can I write prompts in languages other than English?
Technically yes, but English gives noticeably more stable results, especially for camera terminology and stylistic references. The cinematographic vocabulary («dolly zoom», «shallow depth of field», «golden hour») has historically been trained best in English. For production work English is recommended; for experiments other languages also work.
Does Veo support vertical video?
Base Veo 1 and Veo 2 — no, the output is always 16:9 (horizontal). Veo 3 — also predominantly 16:9. Veo 3.1 natively supports 9:16 vertical format — a new feature in 3.1 that fits TikTok, Reels, and Shorts. For vertical content use Veo 3.1 or crop a 16:9 output in post-processing.
How do I get a cinematic result?
Specifics instead of abstractions. Not «cinematic», but «35mm film grain, anamorphic lens, shallow DOF, volumetric lighting». Not «beautiful light», but «soft window light from screen-left with warm tungsten fill, cool rim from hallway». Not «moves smoothly», but «slow dolly-in from eye level, then tilts up to follow the subject». Concrete parameters are the main quality lever.
What about moving text and logos in frame?
Veo at the time of writing handles in-frame text imperfectly — letters can warp, especially under motion. If the brief absolutely needs exact text or a logo, add it in post over the generated video. For text in-frame it is better to use image models (GPT Image 2) and animate that result than to ask Veo directly.
Does Opten support Veo?
Yes, the Opten extension detects Veo on Google AI Studio, Vertex AI, and Flow and scores prompts against the structure outlined above: it checks for a concrete subject with details, physical action, camera motion, style, and lighting. One click gives you a rewrite in the right structure, with abstract «cinematic» / «beautiful» stripped out.

Related models

Ready to write Google Veo (General) prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672