PixVerse V6: how to write prompts the model actually understands
PixVerse · Updated:
PixVerse V6 is a video model from PixVerse with native audio generation, multi-shot mode, and 20+ cinematic lens controls. It supports T2V and I2V, up to 15 seconds at 1080p, negative prompts, and custom seeds. It responds best to literal physical descriptions — not metaphors.
What's new in V6
The headline feature is native audio in a single pass: background music, SFX, ambient, even dialogue. Audio is described explicitly in the prompt («Loud engine roaring sound. Tires hitting gravel sound.») and generated in sync with the video.
Second, the multi-shot engine: short films with native transitions and character consistency. Third, 20+ cinematic lens controls (focal length, aperture, DoF, lens distortion, chromatic aberration, vignetting) as production parameters, not prompt hints. And fourth, up to 15 seconds at 1080p in a single generation, versus 5–10 in previous versions.
- Native audio (BGM, SFX, dialogue, ambient) in one pass
- Multi-shot engine with transitions and character consistency
- 20+ cinematic lens controls as parameters, not text
- Up to 15 seconds at 1080p (5–8 sec on V5.5)
- Negative prompts and custom seed supported
Prompt structure
PixVerse reads literally — no metaphors, no abstractions. The base formula: [Subject] + [Action] + [Environment] + [Camera movement] + [Audio description].
Describe only what you SEE and HEAR. «Tears of the sky» is noise; «Heavy rain falling on pavement» is a working prompt. This matters extra on V6, because the new lens controls only kick in when the model has a clear grip on the physical scene.
Prompt length ranges from 2 to 2,048 characters. The `thinking_type` parameter (enabled/disabled/auto) toggles automatic prompt optimization — on short prompts, `enabled` can noticeably lift quality.
Audio in the prompt
V6 is one of the first public video models with native audio. Describe sounds explicitly in the prompt text: «Loud engine roaring sound. Tires hitting gravel sound. Wind rushing past.»
Supported categories: SFX (engine, footsteps, splashes, impacts), ambient (forest, urban street, ocean waves), BGM (simplified — «soft piano music», «driving bass beat»), dialogue in quotes with lip-sync. The more specific the sound, the better — «soft synth pad» beats «nice music».
Audio sits in a separate block in the prompt, usually after the visual part. It's not «extra text» as it might look — it's a working V6 feature; skip it and you get silence.
Style presets and lens controls
Styles are set via parameter, not in the prompt text: `anime`, `3d_animation`, `clay`, `comic`, `cyberpunk`. Writing «in anime style» in the text works noticeably worse than picking the preset through the parameter. Each preset supports lip-sync for dialogue scenes.
Cinematic lens controls are also parameters: focal length (24mm wide, 50mm normal, 85mm portrait), aperture (f/1.4 shallow DoF, f/8 deep), lens distortion, chromatic aberration, vignetting. These are production settings, like on a real camera. V5.5 didn't have them — you had to write it all in text; V6 lifts that out into dedicated fields.
Common mistakes
1. Metaphors instead of literal description
PixVerse reads literally. «Tears of the sky» will be interpreted literally — you might get a crying sky or a weird artifact. «Heavy rain falling on pavement» yields exactly what you want. Describe physically: what moves, how exactly, in what environment.
2. Styles in the prompt text instead of the parameter
«In anime style» or «as a clay animation» in the text works noticeably worse than picking the matching style preset through the parameter (`anime`, `clay`, `3d_animation`, `comic`, `cyberpunk`). Styles live in a separate field for a reason — use it and keep the text for scene content.
3. No audio description on V6
V6 can generate audio natively, but only if you describe it in the prompt. Without an audio block the clip is silent. Minimum — mention ambient («urban street ambient»). Maximum — concrete SFX, BGM, and quoted dialogue. It's a real model feature, not filler.
4. 1080p × 10 seconds on V5.5
On V5.5, 1080p maxes out at 5–8 seconds; 10 seconds is only available at 720p. V6 removes that limit — up to 15 seconds at 1080p in one pass. If a V5.5 request for «1080p, 10s» fails, switch to V6 or accept 720p.
5. Forgetting character descriptors in multi-shot
In multi-shot mode the model loses the character between shots unless you repeat the key descriptors in each one. «The woman in red coat» in shot 1, «she» in shot 2 — drift is almost guaranteed. Repeat the short descriptor («the woman in red coat») in every shot to hold identity.
Before / after examples
Example 1
Before
beautiful cinematic video of a car in the city at night
After
A black sports car drives through a wet downtown street at night. Neon signs reflect on the asphalt. Camera tracks the car from a low angle, smooth dolly motion. Loud engine roaring sound, tires hissing on wet pavement, distant urban ambient. 50mm focal length, shallow DoF.
Literal physical detail (wet street, neon reflections), explicit audio (engine, tires, ambient), separate lens block (50mm, shallow DoF). V5.5 would force lens settings into the text; V6 takes them as parameters.
Example 2
Before
anime clip where a girl cries from sadness, emotional music
After
A young woman sits on a windowsill, soft tears running down her cheeks. Rain on the glass behind her, grey overcast light. Camera slowly pushes in from medium shot to close-up. Soft piano music, gentle rain ambient. Style preset: anime (set via parameter, not in prompt).
The anime style moves into the parameter, not the text. Emotion is carried by physical detail (tears, posture, rain), not the abstract «sad». Audio is its own block.
Example 3
Before
product video of sneakers on the street
After
Shot 1: Close-up of running shoes on wet asphalt, water splashing as the foot lifts off. Shot 2: Medium tracking shot, the runner sprints down an empty street at sunrise. Shot 3: Wide shot, the runner crosses the frame, golden light flaring through buildings. Footsteps slapping pavement, rhythmic breath, upbeat electronic music. Negative prompt: blurry, watermark.
Multi-shot structure (3 shots, explicit transitions), «runner» repeated in each for consistency, audio in a separate block, negative prompt moved out. This is the V6 sweet spot.