Veo: how to write prompts the model actually understands
Google · Updated:
Google Veo is DeepMind's line of video models that produce 5-8 second clips at a base resolution of 720p. The prompt works as a brief for a director: subject, context, action, camera, style, and lighting. English gives the most stable results. The family spans several versions: Veo 1/2 are video-only, Veo 3 and later add native audio.
What Veo does
Veo generates 16:9 videos of 5-8 seconds (version-dependent). Base resolution is 720p (1280×720), upscale to 4K is done via external tools. The recommended prompt limit is around 1500 characters — beyond that the model starts dropping details.
Available on Google AI Studio, Vertex AI, and Flow. Two main modes: Text-to-Video (generation purely from text) and Image-to-Video (animating a starting frame, availability varies by version and platform). Audio arrives only with version 3 — Veo 1 and Veo 2 output silent video. Vertical format is not natively supported in the base Veo line, only through post-processing or special variants (Veo 3.1).
- Clips of 5-8 seconds, 16:9 format, base resolution 720p
- Prompt limit ~1500 characters
- Text-to-Video and Image-to-Video modes
- Audio — only from Veo 3 onward
- Platforms: Google AI Studio, Vertex AI, Flow
Prompt structure
Optimal order: [Subject] + [Context/Scene] + [Action] + [Camera Movement] + [Style/Mood] + [Lighting/Ambiance] + [Audio (where supported)].
Using every element is not mandatory — the mix depends on the type of video. The more concrete the description, the better the result. Write as if you were briefing a director who is seeing the script for the first time.
Key contrast: • Weak: «A man answers a phone». • Strong: «A shaky dolly zoom goes from a far away blur to a close-up cinematic shot of a desperate man in a weathered green trench coat as he picks up a rotary phone mounted on a gritty brick wall, bathed in the eerie glow of a green neon sign».
Concrete details of appearance, environment, lighting, and camera motion are the main quality lever.
Camera and motion
Veo understands camera terms well — that is the model's primary control language. In every prompt specify at least one of: shot size, movement, angle, or focus. Shot size — wide shot, medium shot, close-up, extreme close-up, establishing shot. Movement — dolly shot, zoom in, zoom out, pan left/right, tracking shot, orbit. Angle — eye level, high angle, low angle, worm's eye, top-down, aerial shot. Focus — shallow depth of field, rack focus, deep focus.
Special techniques — dolly zoom, one-take, handheld, steadicam, crane shot. Concrete techniques work better than an abstract «cinematic camera»: «slow dolly-in from eye level» or «shaky handheld tracking shot» gives the model a clear direction.
Style, lighting, mood
Stylistic modifiers through the «In the style of [style]:» prefix — LEGO, Claymation, Pixar animation, Anime, Graphic novel, 8-bit retro, Stop-motion, Origami, Blueprint, Marble. This gives a radical visual switch while keeping the rest of the parameters intact.
Quality — Cinematic, film grain, HDR, 4K, professional. Genre — Hollywood blockbuster, indie film, documentary, commercial, music video, vlog. Color — warm tones, cool tones, high contrast, desaturated, neon, golden hour. Lighting — natural light, rim light, backlight, volumetric, neon glow, silhouette, blue light.
For a selfie style: start with «A selfie video of...», state a visible arm («holds the camera at arm's length, arm clearly visible in frame»), and add natural eye movement. This removes the synthetic feel and locks in the POV.
Common mistakes
1. Too short a prompt without details
«A beautiful video» or «a cool scene» — the model invents everything and the result becomes unpredictable. Minimum: a concrete subject with appearance details, a physical action with a verb, an environment, and at least one camera direction. Without these four elements Veo collapses into a «generic pretty frame» with no direction.
2. Abstract phrasing instead of specifics
«Cinematic look», «beautiful lighting», «high quality» tell the model nothing — these are subjective words. Replace them with specifics: «shallow depth of field», «golden hour sunlight», «35mm film grain», «soft window light with warm tungsten fill». Concrete parameters work, abstract evaluative adjectives do not.
3. Conflicting camera instructions
«Zoom in and zoom out», «static shot with tracking», «wide angle close-up» — the model cannot honor a contradiction and either ignores part of the instruction or produces chaotic motion. Pick one camera move per clip. Build complex shot lists from several clips in post.
4. No action described
A static scene with no dynamics — Veo will generate a near-frozen video with minimal motion, looking like a GIF. Describe physical action: «picks up the phone, turns around, walks», «leaves blow across the empty street», «steam rises slowly from the coffee cup». Without action the video loses meaning.
5. Attempting vertical video in base Veo
Veo 1/2 does not natively support vertical format — the output is always 16:9. If you try via the prompt («vertical video», «9:16»), the model ignores it and returns a horizontal frame. For vertical use Veo 3.1, where 9:16 is supported natively, or crop in post-processing.
Before / after examples
Example 1
Before
a man answers a phone
After
A shaky dolly zoom goes from a far away blur to a close-up cinematic shot of a desperate man in a weathered green trench coat as he picks up a rotary phone mounted on a gritty brick wall, bathed in the eerie glow of a green neon sign. Camera: handheld with subtle micro-shake, dolly zoom effect. Lighting: green neon key from above, deep shadows in the alley. Mood: tense, noir, claustrophobic.
Concrete character with wardrobe, emotion baked into the description («desperate»), precise camera move (dolly zoom), lighting setup with source and direction, explicit mood.
Example 2
Before
a selfie video of someone in the city
After
A selfie video of a young woman with curly red hair and a black leather jacket walking through Tokyo's Shibuya crossing at night. She holds the camera at arm's length, arm clearly visible in frame, occasionally looking into the lens and smiling. Background: neon signs, crowd of pedestrians, light rain. Lighting: cool neon glow with warm spill from storefronts. Style: slightly grainy, film-like, vlog aesthetic.
Selfie format: explicit visible arm, natural eye motion, concrete background details, environment color characterization. Veo responds well to «slightly grainy, film-like» — it removes AI cleanliness.
Example 3
Before
a product video of headphones
After
Commercial product shot. Smooth 360-degree orbit around matte-black wireless headphones on a white marble pedestal against a seamless white background. Camera: slow continuous orbit at eye level, shallow depth of field, medium close-up. Lighting: large softbox key from above-left, gentle rim light from behind, soft gradient fill from the right. Style: clean commercial photography, premium minimalism. Mood: confident, refined.
Concrete camera motion (smooth orbit), material and background, three-point lighting setup with explicit sources, stylistic reference «commercial photography».