How do you write an image to video prompt?

Start with the source frame: who or what is present, where the subject sits, and what background and light already exist. Then add one action, one camera move, duration, and constraints. A strong prompt for video generation reads like a short director's brief, not a pile of aesthetic tags.

Why does image to video AI distort faces or backgrounds?

Usually the prompt does not say what to preserve. The model treats the still image as editable material and may redraw the face, hands, outfit, or background. Add a preserve block: keep face, proportions, pose, clothing, background, and composition; change only motion and light.

What camera motion works best for short AI video clips?

The safest options are a slow push-in, slight pull-back, smooth pan, or steady hold with movement inside the scene. For a 4-6 second clip, avoid stacking several camera moves. The simpler the camera path, the more stable the image-to-video result.

How much text should appear in screenshots or video frames?

For educational screenshots, keep text large and short, then repeat the meaning in the article body and alt text. In generated video frames, avoid small text because image-to-video models can smear letters between frames, especially when the camera moves.

Guide

Image to video AI: prompt workflow that works

Vlad Voronezhtsev · May 26, 2026 · Updated: May 27, 2026 · 6 min read

Cover image for an image to video AI prompting guide

Image to video AI turns a still frame into a short clip, but the useful result comes from a structured prompt, not from a lucky render. In Kling 3.0, Veo 3.1, and Seedance 2.0, a good image-to-video brief defines the source frame, camera motion, subject action, lighting, pace, and constraints so the model keeps identity, background, and composition stable.

1.
Describe the source frame and the target clip
Image-to-video generation starts with what the still frame already contains and what should change over time. In the first sentence, name the subject, setting, current state, and intended clip type: a product motion shot, an atmospheric establishing shot, a smooth portrait, or a social short. That gives the AI video generator a stable anchor: it knows which elements to preserve and which elements can move.
Before
```
Animate this image and make it cinematic.
```
After
```
Image to video prompt: preserve the source composition. A person stands on a wet mountain road. Create a 5-second clip: jacket moves in the wind, clouds slowly open, camera gently pushes in.
```
2.
Break the prompt into scene, subject, light, and pace
A reliable image-to-video prompt has four plain blocks: scene, subject, light, and pace. Scene defines the environment and mood, subject defines the main object and action, light controls readability, and pace controls how quickly the frame changes. Kling 3.0 exposes this on multi-shot and people shots, Veo 3.1 on image-to-video with native audio, and Seedance 2.0 on reference-heavy scenes with timing control. If one block is missing, the model guesses it, which often creates extra drama, chaotic camera motion, or identity drift.
Before
```
Woman outside, video, beautiful, realistic.
```
After
```
Scene: quiet evening street after rain. Subject: woman in a light coat walks forward and glances aside. Light: soft shop-window reflections on wet asphalt. Pace: slow, no abrupt jumps.
```
3.
Use one camera move
Short clips work best with one camera move: push-in, pull-back, orbit, pan, or a steady hold. A prompt that asks for push-in, orbit, drone rise, and hard zoom at once usually causes shake and frame breaks. State the movement and the constraint together: no shake, no angle change, no sudden cut. This keeps image to video AI closer to the original composition.
Before
```
Camera flies around, pushes in, then rapidly rises overhead.
```
After
```
Camera: slow 10% push-in, eye level, no rotation. Preserve horizon and subject position. No shake, no lens change, no cuts.
```
4.
Check face, hands, background, and pace before render
Before the final render, check four risk areas: face, hands, background, and pace. In a real short fashion-shot test, the first Kling 3.0 render gave the model six fingers on one hand; rerunning with `preserve finger count, keep both hands anatomically correct` fixed the artifact without changing the pose. For people, explicitly preserve facial features, finger count, and body proportions. For objects, lock shape and material. For backgrounds, block new objects from appearing. Keep duration modest: 4-6 seconds is usually safer than a long clip with several events.
Before
```
Make it 12 seconds, the character walks, waves, camera changes angle, background comes alive.
```
After
```
Duration 5 seconds. Preserve face, hands, outfit, and background. Motion only: the subject takes half a step forward, fabric moves slightly. No new people, no hand deformation, no scene change.
```

Image to video AI: prompt workflow that works

Describe the source frame and the target clip

Break the prompt into scene, subject, light, and pace

Use one camera move

Check face, hands, background, and pace before render

FAQ

Related posts

How to write prompts for GPT Image 2: 5 steps from random output to precise result