Video

PixVerse V6: how to write prompts the model actually understands

PixVerse · Updated:

PixVerse V6 is a video model from PixVerse with native audio generation, multi-shot mode, and 20+ cinematic lens controls. It supports T2V and I2V, up to 15 seconds at 1080p, negative prompts, and custom seeds. It responds best to literal physical descriptions — not metaphors.

What's new in V6

The headline feature is native audio in a single pass: background music, SFX, ambient, even dialogue. Audio is described explicitly in the prompt («Loud engine roaring sound. Tires hitting gravel sound.») and generated in sync with the video.

Second, the multi-shot engine: short films with native transitions and character consistency. Third, 20+ cinematic lens controls (focal length, aperture, DoF, lens distortion, chromatic aberration, vignetting) as production parameters, not prompt hints. And fourth, up to 15 seconds at 1080p in a single generation, versus 5–10 in previous versions.

  • Native audio (BGM, SFX, dialogue, ambient) in one pass
  • Multi-shot engine with transitions and character consistency
  • 20+ cinematic lens controls as parameters, not text
  • Up to 15 seconds at 1080p (5–8 sec on V5.5)
  • Negative prompts and custom seed supported

Prompt structure

PixVerse reads literally — no metaphors, no abstractions. The base formula: [Subject] + [Action] + [Environment] + [Camera movement] + [Audio description].

Describe only what you SEE and HEAR. «Tears of the sky» is noise; «Heavy rain falling on pavement» is a working prompt. This matters extra on V6, because the new lens controls only kick in when the model has a clear grip on the physical scene.

Prompt length ranges from 2 to 2,048 characters. The `thinking_type` parameter (enabled/disabled/auto) toggles automatic prompt optimization — on short prompts, `enabled` can noticeably lift quality.

Audio in the prompt

V6 is one of the first public video models with native audio. Describe sounds explicitly in the prompt text: «Loud engine roaring sound. Tires hitting gravel sound. Wind rushing past.»

Supported categories: SFX (engine, footsteps, splashes, impacts), ambient (forest, urban street, ocean waves), BGM (simplified — «soft piano music», «driving bass beat»), dialogue in quotes with lip-sync. The more specific the sound, the better — «soft synth pad» beats «nice music».

Audio sits in a separate block in the prompt, usually after the visual part. It's not «extra text» as it might look — it's a working V6 feature; skip it and you get silence.

Style presets and lens controls

Styles are set via parameter, not in the prompt text: `anime`, `3d_animation`, `clay`, `comic`, `cyberpunk`. Writing «in anime style» in the text works noticeably worse than picking the preset through the parameter. Each preset supports lip-sync for dialogue scenes.

Cinematic lens controls are also parameters: focal length (24mm wide, 50mm normal, 85mm portrait), aperture (f/1.4 shallow DoF, f/8 deep), lens distortion, chromatic aberration, vignetting. These are production settings, like on a real camera. V5.5 didn't have them — you had to write it all in text; V6 lifts that out into dedicated fields.

Common mistakes

  1. 1. Metaphors instead of literal description

    PixVerse reads literally. «Tears of the sky» will be interpreted literally — you might get a crying sky or a weird artifact. «Heavy rain falling on pavement» yields exactly what you want. Describe physically: what moves, how exactly, in what environment.

  2. 2. Styles in the prompt text instead of the parameter

    «In anime style» or «as a clay animation» in the text works noticeably worse than picking the matching style preset through the parameter (`anime`, `clay`, `3d_animation`, `comic`, `cyberpunk`). Styles live in a separate field for a reason — use it and keep the text for scene content.

  3. 3. No audio description on V6

    V6 can generate audio natively, but only if you describe it in the prompt. Without an audio block the clip is silent. Minimum — mention ambient («urban street ambient»). Maximum — concrete SFX, BGM, and quoted dialogue. It's a real model feature, not filler.

  4. 4. 1080p × 10 seconds on V5.5

    On V5.5, 1080p maxes out at 5–8 seconds; 10 seconds is only available at 720p. V6 removes that limit — up to 15 seconds at 1080p in one pass. If a V5.5 request for «1080p, 10s» fails, switch to V6 or accept 720p.

  5. 5. Forgetting character descriptors in multi-shot

    In multi-shot mode the model loses the character between shots unless you repeat the key descriptors in each one. «The woman in red coat» in shot 1, «she» in shot 2 — drift is almost guaranteed. Repeat the short descriptor («the woman in red coat») in every shot to hold identity.

Before / after examples

Example 1

Before

beautiful cinematic video of a car in the city at night

After

A black sports car drives through a wet downtown street at night. Neon signs reflect on the asphalt. Camera tracks the car from a low angle, smooth dolly motion. Loud engine roaring sound, tires hissing on wet pavement, distant urban ambient. 50mm focal length, shallow DoF.

Literal physical detail (wet street, neon reflections), explicit audio (engine, tires, ambient), separate lens block (50mm, shallow DoF). V5.5 would force lens settings into the text; V6 takes them as parameters.

Example 2

Before

anime clip where a girl cries from sadness, emotional music

After

A young woman sits on a windowsill, soft tears running down her cheeks. Rain on the glass behind her, grey overcast light. Camera slowly pushes in from medium shot to close-up. Soft piano music, gentle rain ambient. Style preset: anime (set via parameter, not in prompt).

The anime style moves into the parameter, not the text. Emotion is carried by physical detail (tears, posture, rain), not the abstract «sad». Audio is its own block.

Example 3

Before

product video of sneakers on the street

After

Shot 1: Close-up of running shoes on wet asphalt, water splashing as the foot lifts off. Shot 2: Medium tracking shot, the runner sprints down an empty street at sunrise. Shot 3: Wide shot, the runner crosses the frame, golden light flaring through buildings. Footsteps slapping pavement, rhythmic breath, upbeat electronic music. Negative prompt: blurry, watermark.

Multi-shot structure (3 shots, explicit transitions), «runner» repeated in each for consistency, audio in a separate block, negative prompt moved out. This is the V6 sweet spot.

Frequently asked

How does PixVerse V6 differ from V5.5?
Four main differences: native audio generation (BGM, SFX, dialogue) in one pass, up to 15 seconds at 1080p versus 5–8 on V5.5, 20+ cinematic lens controls as parameters, and a multi-shot engine with native transitions. V5.5 remains useful for short clips with effects (46 effect templates), but for serious content V6 is a clear upgrade.
Do I need to describe audio in the prompt if I want a silent clip?
Yes, explicitly. If audio isn't described, V6 either generates silence or adds random ambient — unpredictable. For a quiet clip, write «silent» or specify a thin ambient: «very faint room tone». Controlled silence beats accidental silence — that's the main rule for working with V6 audio.
Are negative prompts supported?
Yes, it's a documented V6 feature (also V5.5). The negative prompt is a separate field or API parameter. Format: comma-separated list of things to exclude: «blurry, distorted hands, extra limbs, watermark, text». Unlike Runway Gen-4/4.5 where negatives don't work, in PixVerse this is a working tool.
How do I keep a character consistent across multi-shot frames?
Two tools: repeating character descriptors in every shot and multi-image reference (up to 3 character photos as input). Best practice is to combine both — upload 2–3 reference photos and repeat a short text descriptor («the woman in red coat») in every shot. That gives maximum consistency.
What should I do with the thinking_type parameter?
Three values: `enabled` (model auto-optimizes the prompt before generation), `disabled` (prompt goes through as written), `auto` (model decides by prompt complexity). For short prompts of 10–20 words, `enabled` noticeably improves quality. For long detailed prompts of 100+ words, `disabled` preserves your control. `auto` is a sensible default.
How long should the prompt be?
From 2 to 2,048 characters technically. In practice 50–200 words is optimal for most scenes. Short prompts (10–20 words) pair well with `thinking_type=enabled`. Long multi-shot prompts can hit 300+ words with three shot blocks plus an audio description — that's normal for the format.
Does Opten support PixVerse V6?
Yes, the Opten extension recognizes PixVerse inside pixverse.ai and scores prompts against the V6-specific structure: it checks for an audio description block, literal physical phrasing, style preset used as a parameter (not in text), repeated character descriptors in multi-shot, and a sensible negative prompt.

Related models

Ready to write PixVerse V6 (V5.5) prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672