Kling O1: how to write prompts the model actually understands
Kuaishou · Updated:
Kling O1 is Kuaishou's reasoning video model on klingai.com. Duration up to 10 seconds, resolution up to 1080p, four specialized modes: I2V, V2V Transform, Reference-to-Video, and V2V Edit. Each mode needs its own prompting strategy — applying the wrong strategy gives unstable results, even with a detailed prompt.
What Kling O1 is
Kling O1 is a reasoning model: unlike previous versions, it understands prompt intent rather than just keywords. It runs an internal scene analysis before generation, especially helpful for complex compound tasks.
Four modes, each with its own prompting strategy. Image-to-Video for animating still images. Video-to-Video Transform for style transfers that preserve the original motion. Reference-to-Video for generation with element consistency from 1–4 references. V2V Edit for surgical precision — modifying specific elements while preserving everything else. Output quality is driven more by prompt structure than by word count.
- Reasoning model: analyzes intent, not just words
- Four modes: I2V, V2V Transform, Ref2V, V2V Edit
- Duration up to 10 seconds, resolution up to 1080p
- Up to 4 references in Reference-to-Video
- Surgical precision in V2V Edit with explicit preservation anchors
General prompt structure
Baseline structure for all modes: [Subject + Primary Action] → [Environmental Context] → [Camera Movement/Perspective] → [Style/Quality Descriptors]. The key rule — start with the subject and the primary action. Each element gives the model a concrete visual anchor.
Weak prompt: «A car driving through a city at sunset». Strong: «A sleek silver sports car accelerates through a rain-slicked downtown street as golden sunset light breaks through storm clouds, camera tracking alongside at street level, cinematic lighting with volumetric light rays, photorealistic rendering». The difference — concrete visual anchors: car appearance, street condition, lighting quality, camera behavior, the desired aesthetic. Sweet-spot length 50–150 words.
I2V and V2V Transform: different strategies
I2V describes ONLY motion. Length 20–40 words. Separate subject motion from camera motion: «Camera slowly pushes in while the subject turns their head to look over their shoulder». Temporal descriptors control rhythm: «gradually», «suddenly», «smoothly», «rhythmically». Describing what's already in the image is an anti-pattern.
V2V Transform — style transfers that preserve motion. Formula: «Transform into [target style] + while maintaining original motion and composition + [specific changes]». Required anchor — «maintaining the original camera movement and subject blocking». Without it the model may inject unwanted changes into motion. Example: «Transform into a cyberpunk cityscape with neon signs, holographic advertisements, and rain-slicked streets reflecting colored lights, maintaining the original camera movement and subject blocking, add volumetric fog and lens flares».
Reference-to-Video and V2V Edit
Ref2V — generation with element consistency from 1–4 reference images. Formula: [Character from ref 1] + [Action/interaction] + [Spatial relations] + [Setting from ref N]. Each reference must be explicitly tied to an element in the scene: «Character A (reference 1) stands in the foreground left, turning to hand an object to Character B (reference 2) who enters from the right background». Consistent terminology is critical: if you say «the red jacket», don't switch to «crimson coat».
V2V Edit — surgical precision. Formula: «Keeping [what to preserve] identical + change only [what to change] + [specific change description]». Start with what does NOT change: «Keeping all camera movement, subject blocking, and background elements identical, change only the sky to a dramatic sunset with purple and orange clouds». Negative instructions are allowed: «Do not alter facial features, do not change body proportions».
Common mistakes
1. Applying T2V strategy to I2V
Describing character appearance, clothing, or setting inside an I2V prompt — the model already sees the image. Describing the scene in I2V conflicts with the actual picture. Length 20–40 words, ONLY motion and scene evolution. Separate subject motion from camera motion — critical for O1.
2. V2V Transform without a preservation anchor
Without «maintaining the original camera movement and subject blocking» in a V2V Transform prompt, the model often injects unwanted changes — the subject changes pose, the camera drifts. The preservation anchor is required in every V2V Transform prompt.
3. Inconsistent terminology in Ref2V
If the first sentence calls it «the red jacket» and the third switches to «crimson coat», the model treats them as two different objects and can mix or swap them. Use one consistent phrasing for each referenced element throughout the prompt.
4. V2V Edit without isolating the change
Just writing «change the sky to sunset» without an explicit preservation anchor makes V2V Edit change the whole scene instead of the target element — lighting, shadows, background colors. Start with what to preserve: «Keeping camera movement, subject blocking, and ground lighting identical, change only the sky…».
5. Conflicting descriptions in a single prompt
«Bright sunny day with dark moody shadows», «cheerful upbeat scene with melancholic atmosphere» — internal contradictions. As a reasoning model, O1 tries to resolve the conflict and outputs an uncontrolled mix. Keep the description stylistically consistent, or state progression explicitly («scene transitions from bright morning to moody evening»).
Before / after examples
Example 1
Before
I2V: «person walks to the sea»
After
Walks slowly toward the ocean with relaxed steps, hair and clothing moving gently in the warm sea breeze, waves rolling onto shore in the background at a steady rhythm, camera slowly pushes in from behind while gradually tilting up to reveal the horizon
I2V in the right mode: motion only, no appearance description; subject motion separated from camera motion; temporal descriptors «slowly», «gradually»; layered description (foreground subject, background waves).
Example 2
Before
V2V Transform: «make it cyberpunk»
After
Transform into a cyberpunk cityscape with neon signs, holographic advertisements floating between buildings, and rain-slicked streets reflecting saturated magenta and cyan colored lights, maintaining the original camera movement and subject blocking. Add volumetric fog at street level, lens flares on neon signs, and chromatic aberration on bright lights. High-contrast Blade Runner aesthetic with warm amber and cool blue color grading.
Explicit preservation anchor «maintaining the original camera movement and subject blocking», concrete style anchors (Blade Runner), effect and color-grading descriptions.
Example 3
Before
V2V Edit: «change the sky to sunset»
After
Keeping all camera movement, subject blocking, foreground objects, and ground lighting identical, change only the sky to a dramatic sunset with deep purple, orange, and pink cloud formations. Increase contrast in the sky by 15% to match the dramatic mood. Do not alter facial features, do not change body proportions, do not modify the lighting direction on the subject.
V2V Edit structure: first what to preserve (camera, blocking, foreground, ground lighting), then what to change (sky only), then negative instructions as a guarantee. Masking language isolates the change.