Kling 3.0: how to write prompts the model actually understands
Kuaishou · Updated:
Kling 3.0 is Kuaishou's flagship video model on klingai.com. Duration up to 15 seconds, Multi-shot with up to 6 shots in one generation, native dialogue and audio generation with voice-tone control, multilingual accents, and code-switching. Built to understand directorial intent, not just an object list.
What's new in Kling 3.0
Kling 3.0 is a major upgrade to Kuaishou's video model. Single-generation length grew from 10 to 15 seconds, enough for real narrative progression. The headline feature is Multi-shot: up to 6 shots in one generation with automatic angle variation and preserved narrative continuity.
Native audio generation appears for the first time in the family: dialogue with a unique voice tone per character, ambient sound, music, SFX. Multilingual dialogue is supported, with accents and code-switching inside a single scene. The model excellently preserves identity, layout, and even text from the source image in I2V — critical for branded content.
- Duration up to 15 seconds (vs 10 in Kling 2.6 Pro)
- Multi-shot: up to 6 shots with narrative continuity
- Native audio: dialogue, ambient, music, SFX
- Multilingual dialogue with accents and code-switching
- Text and logos from the source image preserved in I2V
Prompt structure as a director's script
Kling 3.0 is built to understand cinematic intent, not just visual descriptions. Prompts should be written as directorial instructions, not as an object list.
Optimal structure: [Scene Setup + Atmosphere] + [Character Introduction] + [Action/Dialogue Sequence] + [Camera Direction] + [Audio/Sound Design]. Anchor characters at the start of the prompt and keep their descriptions consistent across all shots — the model locks in character, object, and environment traits.
Describe camera motion explicitly: «tracking», «following», «freezing», «panning», «moving in sync». Long takes work better when the camera's relationship to the subject is clearly described. Sweet-spot length 50–200 words (longer for multi-shot).
Multi-shot: up to 6 shots in one generation
The key feature of Kling 3.0. Multi-shot enables a multi-shot storyboard in one generation with narrative continuity.
Formula: Master Prompt: [overall scene description] Multi shot Prompt 1: [shot 1 description] (Duration: Xs) Multi shot Prompt 2: [shot 2 description] (Duration: Xs)
Each shot must have its own framing, action, and duration. The Master Prompt sets the overall context. The model automatically varies angles and composition while preserving narrative continuity between shots. Supported types: profile shots, macro close-ups, tracking shots, POV, shot-reverse-shot for dialogue. Multi-shot without clear shot markers is the mode's main anti-pattern.
Dialogue with native audio
Kling 3.0 supports character-anchored dialogue generation. Four required principles:
Structural naming: unique character labels throughout the prompt — `[Character A: Black-suited Agent]` and `[Character B: Female Assistant]`. Pronouns instead of labels («he says…») are an anti-pattern.
Visual anchoring: physical action BEFORE the line. «The agent slams his hand on the table. [Agent, angrily]: "Where is the truth?"»
Audio details: a unique tone and emotion per character. «[Agent, raspy, deep voice]: "Don't move." [Assistant, clear, fearful voice]: "I'm scared."»
Temporal control: linking words between lines. «[Agent]: "Why?" Immediately, [Assistant]: "Because it's time."» Without a connector, the model can merge lines.
Common mistakes
1. Pronouns instead of character labels in dialogue
«He says…», «Then she replies…» — the model doesn't know who's speaking and merges lines or changes voice between them. Use unique structural labels `[Character A: description]` and `[Character B: description]` throughout the prompt. Every line gets an explicit character label.
2. Multi-shot without marking individual shots
Describing several scenes in a row without `Multi shot Prompt 1:`, `Multi shot Prompt 2:` markers makes the model treat it as one long shot and stumble on transitions. Each shot needs to be a separate block with its own framing, action, and duration.
3. Dialogue without visual anchoring
If the line comes first and the action after — «[Agent]: "Where is the truth?" The agent slams the table» — the model often desyncs sound and motion. Correct order: physical action BEFORE the line. This gives the model a clear audio-visual anchor.
4. Dialogue without tonal descriptors
«[Agent]: "Don't move"» without tonal info delivers a flat TTS-like voice. Add voice characteristics: «[Agent, raspy deep voice, threatening]: "Don't move"». This unlocks Kling 3.0's native-audio advantage — emotional and tonal control.
5. Describing the scene in an I2V prompt
As in other Kling models, in Image-to-Video the model already sees the image and preserves its layout, text, and identity. Describing appearance or setting inside an I2V prompt conflicts with the actual picture. Length 20–40 words, ONLY motion and scene evolution.
Before / after examples
Example 1
Before
woman in a kitchen at night
After
A dim kitchen late at night, warm tungsten light from overhead fixture casting long shadows. [Character A: Sarah, mid-30s woman in a worn flannel bathrobe, exhausted eyes, hair in a messy ponytail]. Sarah sets a ceramic plate down too hard on the granite counter. Sharp ceramic clink. She turns slowly from the sink, exhaustion evident in her posture. Camera holds steady medium shot at eye level, shallow depth of field with soft bokeh from the kitchen window behind her. Quiet ambient sound of a ticking wall clock, distant traffic muffled through the window.
Full structure: scene setup + atmosphere, an anchored character up front, visual anchoring before sound, camera described separately from subject, audio design (clink, ticking clock, traffic).
Example 2
Before
Multi-shot: «joker dances down stairs»
After
Master Prompt: A theatrical figure begins his iconic dance descent down a long flight of concrete stairs in a moody urban setting at dusk, cinematic gritty atmosphere. Multi shot Prompt 1: A man in a vibrant red suit starts dancing at the top of the stairs, taking first exaggerated steps down, arms spreading wide, head tilting back in laughter, wide shot from below capturing the full silhouette against amber sky (Duration: 5 seconds). Multi shot Prompt 2: Continuing wild dance down concrete steps, spinning and kicking, coat tails flapping dramatically, reaching the bottom step with triumphant arms-raised pose, medium tracking shot following from the side with smooth dolly motion (Duration: 5 seconds).
Pure Multi-shot: Master Prompt sets the overall context, two marked shots with their own framing, action, and duration.
Example 3
Before
dialogue between an agent and an assistant
After
A tense interrogation room with a single overhead bulb casting harsh shadows. [Character A: Agent, black-suited man in his 40s with closely-cropped grey hair]. [Character B: Assistant, young woman in a beige sweater, nervous expression]. The agent slams his hand on the metal table, the impact echoes sharply. [Agent, raspy deep voice, cold]: "Where is the truth?" The assistant flinches, looks down at her trembling hands. Immediately, [Assistant, clear fearful voice]: "I told you everything I know." Camera holds tight medium shot, shallow depth of field, dim tungsten lighting with hard shadows.
Dialogue by all the rules: unique character labels, visual action before each line, tonal voice descriptors, the linking word «Immediately» between lines.