Midjourney Video: how to write prompts the model actually understands
MidJourney · Updated:
Midjourney Video is Midjourney's Image-to-Video model for short animations of still images. Pure Text-to-Video isn't supported: a reference image is mandatory. The prompt describes what moves and how the camera moves — not the subject's appearance, which is already set by the image. English is the primary language; optimal length 20-60 words.
What Midjourney Video does
Midjourney Video is Midjourney's new model, specialized in animating a single still image into a short clip. It's a fundamentally different tool from Midjourney V7 or Niji: there the model generates a frame from scratch from text, here it brings an existing frame to life.
Composition, color palette, style, and the subject's appearance are all defined by the reference image. The prompt sets subject motion (head turn, walking, hair blowing), camera motion (push in, pan, orbit, static), and atmosphere (slowly, dramatic, peaceful). Short clips, optimal prompt length 20-60 words, documentation is limited.
- Image-to-Video only — a reference frame is required
- Prompt describes motion, not appearance
- Camera: push in, pan, orbit, static, tracking
- Optimal 20-60 words; 1-3 sentences
- Image defines the starting frame and the style
Prompt structure
Optimal formula: [Subject motion/action] + [Camera motion] + [Tempo/Mood].
Example: «The woman slowly turns her head toward the camera, wind gently blowing her hair, slow dolly push in, soft ambient light.» The main rule — don't restate what's already visible in the image. If the photo shows a girl in a red jacket, don't write «girl in red jacket.» Those are empty tokens that can conflict with what the model has already parsed from the reference.
Brevity and a focus on motion deliver better results than long descriptions. One primary action — don't load three simultaneous moves.
Subject motion
A concrete physical action gives predictable animation: «turns her head», «walks forward», «waves ripple», «hair blowing in the wind», «dress flowing», «leaves falling.» Abstract verbs like «something happens» or «she does something» yield chaotic results.
For portraits, small motions work — blinking, slight head turn, subtle smile. For nature — wind, water, clouds, fire. For objects — rotation, floating, falling, dissolving. The more precise the verb, the fewer artifacts at the edges of motion.
Camera motion
Without a camera call, the result often comes out static or chaotic — the model picks for you. Core camera moves: push in, pull out, dolly in, dolly out, zoom in, zoom out (in/out); pan left, pan right, pan up, pan down (pan); tracking shot, follow shot (tracking); orbit, rotating around (orbit); crane up, crane down (rise/fall); static camera, locked off (static).
Camera tempo matters too — «slowly» and «gently» give cinematic results; «suddenly» and «rapidly» give dynamic but sometimes artifact-prone output. Don't combine conflicting moves: «zoom in and zoom out simultaneously» or «pan + orbit + tracking at the same time.»
Common mistakes
1. Describing the subject's appearance
Appearance is already locked in the reference — restating it is useless and can conflict with what the model sees. «Beautiful young woman with blonde hair in red dress walks forward» is empty tokens up to «walks forward.» Write only motion and camera.
2. Trying text-to-video without an image
Midjourney Video doesn't support pure T2V. The model requires a reference frame. If you submit text-only without uploading an image, generation isn't possible. This isn't a prompt bug — it's an architectural limit of the version.
3. Prompt too long (>60 words)
The model loses focus on long prompts: motions become chaotic and artifacts can appear. Optimal is 1-3 sentences, 20-60 words. If your description won't fit — trim to one primary subject motion + one camera move + tempo.
4. Conflicting motions
«Walks left while running right», «zoom in and zoom out simultaneously», «pan + orbit + tracking at the same time» — the model can't resolve the conflict and outputs chaotic, shaky results. One primary subject motion + one camera move. If you need multiple camera moves, describe them sequentially with «then.»
5. Quality spam and tag soup
«cinematic, masterpiece, 8K, ultra detailed, best quality, trending on artstation» — noise that clogs the prompt and doesn't affect output. Video quality is determined by reference quality and motion-description precision. Spend tokens on a concrete verb and a concrete camera move instead.
Before / after examples
Example 1
Before
beautiful girl walks down the street
After
The woman slowly walks forward toward the camera, hair gently swaying with each step. Slow dolly push in, shallow depth of field. Soft cinematic atmosphere, peaceful tempo.
Appearance is dropped — it's already on the reference. Only described: subject motion, camera motion, tempo. One primary action (walking) + one camera move (push in).
Example 2
Before
beautiful nature
After
Tall grass and wildflowers gently sway in the wind, soft afternoon light filtering through the trees. Slow lateral tracking shot from left to right. Peaceful, dreamlike atmosphere, gradual light shifting from warm to cool.
«Beautiful nature» is abstract. This has concrete motion of environmental elements (grass, flowers), a concrete camera move (lateral tracking), tempo (slow), and an atmospheric light shift.
Example 3
Before
product spinning
After
The bottle slowly rotates 360 degrees on its axis, catching the studio light on its glossy surface. Static camera, locked off frame. Smooth even tempo, commercial product showcase aesthetic.
Concrete motion (360° rotation), direction (on its axis), camera (static locked off), tempo (smooth even). Without an explicit camera, the model might add a random pan and ruin a product shot.