Video

Midjourney Video: how to write prompts the model actually understands

MidJourney · Updated:

Midjourney Video is Midjourney's Image-to-Video model for short animations of still images. Pure Text-to-Video isn't supported: a reference image is mandatory. The prompt describes what moves and how the camera moves — not the subject's appearance, which is already set by the image. English is the primary language; optimal length 20-60 words.

What Midjourney Video does

Midjourney Video is Midjourney's new model, specialized in animating a single still image into a short clip. It's a fundamentally different tool from Midjourney V7 or Niji: there the model generates a frame from scratch from text, here it brings an existing frame to life.

Composition, color palette, style, and the subject's appearance are all defined by the reference image. The prompt sets subject motion (head turn, walking, hair blowing), camera motion (push in, pan, orbit, static), and atmosphere (slowly, dramatic, peaceful). Short clips, optimal prompt length 20-60 words, documentation is limited.

  • Image-to-Video only — a reference frame is required
  • Prompt describes motion, not appearance
  • Camera: push in, pan, orbit, static, tracking
  • Optimal 20-60 words; 1-3 sentences
  • Image defines the starting frame and the style

Prompt structure

Optimal formula: [Subject motion/action] + [Camera motion] + [Tempo/Mood].

Example: «The woman slowly turns her head toward the camera, wind gently blowing her hair, slow dolly push in, soft ambient light.» The main rule — don't restate what's already visible in the image. If the photo shows a girl in a red jacket, don't write «girl in red jacket.» Those are empty tokens that can conflict with what the model has already parsed from the reference.

Brevity and a focus on motion deliver better results than long descriptions. One primary action — don't load three simultaneous moves.

Subject motion

A concrete physical action gives predictable animation: «turns her head», «walks forward», «waves ripple», «hair blowing in the wind», «dress flowing», «leaves falling.» Abstract verbs like «something happens» or «she does something» yield chaotic results.

For portraits, small motions work — blinking, slight head turn, subtle smile. For nature — wind, water, clouds, fire. For objects — rotation, floating, falling, dissolving. The more precise the verb, the fewer artifacts at the edges of motion.

Camera motion

Without a camera call, the result often comes out static or chaotic — the model picks for you. Core camera moves: push in, pull out, dolly in, dolly out, zoom in, zoom out (in/out); pan left, pan right, pan up, pan down (pan); tracking shot, follow shot (tracking); orbit, rotating around (orbit); crane up, crane down (rise/fall); static camera, locked off (static).

Camera tempo matters too — «slowly» and «gently» give cinematic results; «suddenly» and «rapidly» give dynamic but sometimes artifact-prone output. Don't combine conflicting moves: «zoom in and zoom out simultaneously» or «pan + orbit + tracking at the same time.»

Common mistakes

  1. 1. Describing the subject's appearance

    Appearance is already locked in the reference — restating it is useless and can conflict with what the model sees. «Beautiful young woman with blonde hair in red dress walks forward» is empty tokens up to «walks forward.» Write only motion and camera.

  2. 2. Trying text-to-video without an image

    Midjourney Video doesn't support pure T2V. The model requires a reference frame. If you submit text-only without uploading an image, generation isn't possible. This isn't a prompt bug — it's an architectural limit of the version.

  3. 3. Prompt too long (>60 words)

    The model loses focus on long prompts: motions become chaotic and artifacts can appear. Optimal is 1-3 sentences, 20-60 words. If your description won't fit — trim to one primary subject motion + one camera move + tempo.

  4. 4. Conflicting motions

    «Walks left while running right», «zoom in and zoom out simultaneously», «pan + orbit + tracking at the same time» — the model can't resolve the conflict and outputs chaotic, shaky results. One primary subject motion + one camera move. If you need multiple camera moves, describe them sequentially with «then.»

  5. 5. Quality spam and tag soup

    «cinematic, masterpiece, 8K, ultra detailed, best quality, trending on artstation» — noise that clogs the prompt and doesn't affect output. Video quality is determined by reference quality and motion-description precision. Spend tokens on a concrete verb and a concrete camera move instead.

Before / after examples

Example 1

Before

beautiful girl walks down the street

After

The woman slowly walks forward toward the camera, hair gently swaying with each step. Slow dolly push in, shallow depth of field. Soft cinematic atmosphere, peaceful tempo.

Appearance is dropped — it's already on the reference. Only described: subject motion, camera motion, tempo. One primary action (walking) + one camera move (push in).

Example 2

Before

beautiful nature

After

Tall grass and wildflowers gently sway in the wind, soft afternoon light filtering through the trees. Slow lateral tracking shot from left to right. Peaceful, dreamlike atmosphere, gradual light shifting from warm to cool.

«Beautiful nature» is abstract. This has concrete motion of environmental elements (grass, flowers), a concrete camera move (lateral tracking), tempo (slow), and an atmospheric light shift.

Example 3

Before

product spinning

After

The bottle slowly rotates 360 degrees on its axis, catching the studio light on its glossy surface. Static camera, locked off frame. Smooth even tempo, commercial product showcase aesthetic.

Concrete motion (360° rotation), direction (on its axis), camera (static locked off), tempo (smooth even). Without an explicit camera, the model might add a random pan and ruin a product shot.

Frequently asked

How is Midjourney Video different from Midjourney V7?
V7 is an image generator that draws a frame from text using Midjourney's signature syntax (`--ar`, `--style`, `--chaos`). Midjourney Video is a separate model that animates a finished image. V7 parameters don't work here: aspect ratio is set by the reference, style is set by the reference, and the prompt describes motion only. They're two different tools under one brand.
Can I generate video from text alone?
No, Midjourney Video is strictly Image-to-Video. A reference image is mandatory. If you need T2V, first generate a frame in V7 or Niji (or another image model), then feed it into Midjourney Video with a motion prompt. It's a two-stage pipeline: image → video.
Why describe camera motion if I only have a subject?
Without a camera call the model picks behavior on its own — often static, or a random pan that breaks the composition. An explicit «slow dolly push in» or «static camera» yields a predictable shot. This is especially critical for product shots and portraits — without a static camera the product may drift out of frame, and a portrait's angle may shift.
What's the optimal prompt length?
20-60 words, 1-3 sentences. Too short (<10 words) gives chaotic animation — the model fills in. Too long (>60 words) leads to lost focus and artifacts. The formula «subject motion + camera motion + tempo» in 2-3 sentences covers most scenarios.
Can I ask for rain, wind, or fire that aren't in the photo?
You can, but it's risky. If there's no rain on the reference and you ask «rain falls,» the model will try to overlay rain on the existing scene — often with edge artifacts. It works better for atmospheric tweaks consistent with the photo: an overcast photo + «light wind picks up» works; a sunny day + «sudden rain» yields odd output.
How do I get a cinematic result?
Stack: slow tempo (slowly, gently, gradually) + an explicit camera move (slow dolly push in, slow lateral tracking) + one atmospheric detail (soft ambient light, gentle wind, light shifting). The anti-stack — «suddenly», «rapidly», «explosive», «chaotic» — gives energy but often with shake and artifacts. For cinematic, hold the slow tempo and one motion.
Does Opten support Midjourney Video?
Yes, the Opten extension auto-detects Midjourney Video and scores prompts against the structure above: it checks that a reference image is present, that appearance isn't restated (it's on the image), that subject motion and camera motion are explicit, and that length is within 20-60 words. One click gives you a rewrite in the correct «motion + camera + tempo» formula.

Related models

Ready to write MidJourney Video prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672