How is Midjourney Video different from Midjourney V7?

V7 is an image generator that draws a frame from text using Midjourney's signature syntax (`--ar`, `--style`, `--chaos`). Midjourney Video is a separate model that animates a finished image. V7 parameters don't work here: aspect ratio is set by the reference, style is set by the reference, and the prompt describes motion only. They're two different tools under one brand.

Can I generate video from text alone?

No, Midjourney Video is strictly Image-to-Video. A reference image is mandatory. If you need T2V, first generate a frame in V7 or Niji (or another image model), then feed it into Midjourney Video with a motion prompt. It's a two-stage pipeline: image → video.

Why describe camera motion if I only have a subject?

Without a camera call the model picks behavior on its own — often static, or a random pan that breaks the composition. An explicit «slow dolly push in» or «static camera» yields a predictable shot. This is especially critical for product shots and portraits — without a static camera the product may drift out of frame, and a portrait's angle may shift.

What's the optimal prompt length?

20-60 words, 1-3 sentences. Too short ( 60 words) leads to lost focus and artifacts. The formula «subject motion + camera motion + tempo» in 2-3 sentences covers most scenarios.

Can I ask for rain, wind, or fire that aren't in the photo?

You can, but it's risky. If there's no rain on the reference and you ask «rain falls,» the model will try to overlay rain on the existing scene — often with edge artifacts. It works better for atmospheric tweaks consistent with the photo: an overcast photo + «light wind picks up» works; a sunny day + «sudden rain» yields odd output.

How do I get a cinematic result?

Stack: slow tempo (slowly, gently, gradually) + an explicit camera move (slow dolly push in, slow lateral tracking) + one atmospheric detail (soft ambient light, gentle wind, light shifting). The anti-stack — «suddenly», «rapidly», «explosive», «chaotic» — gives energy but often with shake and artifacts. For cinematic, hold the slow tempo and one motion.

Does Opten support Midjourney Video?

Yes, the Opten extension auto-detects Midjourney Video and scores prompts against the structure above: it checks that a reference image is present, that appearance isn't restated (it's on the image), that subject motion and camera motion are explicit, and that length is within 20-60 words. One click gives you a rewrite in the correct «motion + camera + tempo» formula.

Video

Midjourney Video: how to write prompts the model actually understands

Name: MidJourney Video
Brand: MidJourney

MidJourney · Updated: May 19, 2026

Midjourney Video is Midjourney's Image-to-Video model for short animations of still images. Pure Text-to-Video isn't supported: a reference image is mandatory. The prompt describes what moves and how the camera moves — not the subject's appearance, which is already set by the image. English is the primary language; optimal length 20-60 words.

What Midjourney Video does

Midjourney Video is Midjourney's new model, specialized in animating a single still image into a short clip. It's a fundamentally different tool from Midjourney V7 or Niji: there the model generates a frame from scratch from text, here it brings an existing frame to life.

Composition, color palette, style, and the subject's appearance are all defined by the reference image. The prompt sets subject motion (head turn, walking, hair blowing), camera motion (push in, pan, orbit, static), and atmosphere (slowly, dramatic, peaceful). Short clips, optimal prompt length 20-60 words, documentation is limited.

Image-to-Video only — a reference frame is required
Prompt describes motion, not appearance
Camera: push in, pan, orbit, static, tracking
Optimal 20-60 words; 1-3 sentences
Image defines the starting frame and the style

Prompt structure

Optimal formula: [Subject motion/action] + [Camera motion] + [Tempo/Mood].

Example: «The woman slowly turns her head toward the camera, wind gently blowing her hair, slow dolly push in, soft ambient light.» The main rule — don't restate what's already visible in the image. If the photo shows a girl in a red jacket, don't write «girl in red jacket.» Those are empty tokens that can conflict with what the model has already parsed from the reference.

Brevity and a focus on motion deliver better results than long descriptions. One primary action — don't load three simultaneous moves.

Subject motion

A concrete physical action gives predictable animation: «turns her head», «walks forward», «waves ripple», «hair blowing in the wind», «dress flowing», «leaves falling.» Abstract verbs like «something happens» or «she does something» yield chaotic results.

For portraits, small motions work — blinking, slight head turn, subtle smile. For nature — wind, water, clouds, fire. For objects — rotation, floating, falling, dissolving. The more precise the verb, the fewer artifacts at the edges of motion.

Camera motion

Without a camera call, the result often comes out static or chaotic — the model picks for you. Core camera moves: push in, pull out, dolly in, dolly out, zoom in, zoom out (in/out); pan left, pan right, pan up, pan down (pan); tracking shot, follow shot (tracking); orbit, rotating around (orbit); crane up, crane down (rise/fall); static camera, locked off (static).

Camera tempo matters too — «slowly» and «gently» give cinematic results; «suddenly» and «rapidly» give dynamic but sometimes artifact-prone output. Don't combine conflicting moves: «zoom in and zoom out simultaneously» or «pan + orbit + tracking at the same time.»

Common mistakes

1. Describing the subject's appearance
Appearance is already locked in the reference — restating it is useless and can conflict with what the model sees. «Beautiful young woman with blonde hair in red dress walks forward» is empty tokens up to «walks forward.» Write only motion and camera.
2. Trying text-to-video without an image
Midjourney Video doesn't support pure T2V. The model requires a reference frame. If you submit text-only without uploading an image, generation isn't possible. This isn't a prompt bug — it's an architectural limit of the version.
3. Prompt too long (>60 words)
The model loses focus on long prompts: motions become chaotic and artifacts can appear. Optimal is 1-3 sentences, 20-60 words. If your description won't fit — trim to one primary subject motion + one camera move + tempo.
4. Conflicting motions
«Walks left while running right», «zoom in and zoom out simultaneously», «pan + orbit + tracking at the same time» — the model can't resolve the conflict and outputs chaotic, shaky results. One primary subject motion + one camera move. If you need multiple camera moves, describe them sequentially with «then.»
5. Quality spam and tag soup
«cinematic, masterpiece, 8K, ultra detailed, best quality, trending on artstation» — noise that clogs the prompt and doesn't affect output. Video quality is determined by reference quality and motion-description precision. Spend tokens on a concrete verb and a concrete camera move instead.

Before / after examples

Example 1

Before

beautiful girl walks down the street

After

The woman slowly walks forward toward the camera, hair gently swaying with each step. Slow dolly push in, shallow depth of field. Soft cinematic atmosphere, peaceful tempo.

Appearance is dropped — it's already on the reference. Only described: subject motion, camera motion, tempo. One primary action (walking) + one camera move (push in).

Example 2

Before

beautiful nature

After

Tall grass and wildflowers gently sway in the wind, soft afternoon light filtering through the trees. Slow lateral tracking shot from left to right. Peaceful, dreamlike atmosphere, gradual light shifting from warm to cool.

«Beautiful nature» is abstract. This has concrete motion of environmental elements (grass, flowers), a concrete camera move (lateral tracking), tempo (slow), and an atmospheric light shift.

Example 3

Before

product spinning

After

The bottle slowly rotates 360 degrees on its axis, catching the studio light on its glossy surface. Static camera, locked off frame. Smooth even tempo, commercial product showcase aesthetic.

Concrete motion (360° rotation), direction (on its axis), camera (static locked off), tempo (smooth even). Without an explicit camera, the model might add a random pan and ruin a product shot.

Midjourney Video: how to write prompts the model actually understands

What Midjourney Video does

Prompt structure

Subject motion

Camera motion

Common mistakes

1. Describing the subject's appearance

2. Trying text-to-video without an image

3. Prompt too long (>60 words)

4. Conflicting motions

5. Quality spam and tag soup

Before / after examples

Frequently asked

Related models

Google Veo 3.1 (incl. Veo 3.1 Fast and Veo 3.1 Fast Relax)

Google Veo 3

Google Veo (General)

Ready to write MidJourney Video prompts in one click?