Video

Kling O1: how to write prompts the model actually understands

Kuaishou · Updated:

Kling O1 is Kuaishou's reasoning video model on klingai.com. Duration up to 10 seconds, resolution up to 1080p, four specialized modes: I2V, V2V Transform, Reference-to-Video, and V2V Edit. Each mode needs its own prompting strategy — applying the wrong strategy gives unstable results, even with a detailed prompt.

What Kling O1 is

Kling O1 is a reasoning model: unlike previous versions, it understands prompt intent rather than just keywords. It runs an internal scene analysis before generation, especially helpful for complex compound tasks.

Four modes, each with its own prompting strategy. Image-to-Video for animating still images. Video-to-Video Transform for style transfers that preserve the original motion. Reference-to-Video for generation with element consistency from 1–4 references. V2V Edit for surgical precision — modifying specific elements while preserving everything else. Output quality is driven more by prompt structure than by word count.

  • Reasoning model: analyzes intent, not just words
  • Four modes: I2V, V2V Transform, Ref2V, V2V Edit
  • Duration up to 10 seconds, resolution up to 1080p
  • Up to 4 references in Reference-to-Video
  • Surgical precision in V2V Edit with explicit preservation anchors

General prompt structure

Baseline structure for all modes: [Subject + Primary Action] → [Environmental Context] → [Camera Movement/Perspective] → [Style/Quality Descriptors]. The key rule — start with the subject and the primary action. Each element gives the model a concrete visual anchor.

Weak prompt: «A car driving through a city at sunset». Strong: «A sleek silver sports car accelerates through a rain-slicked downtown street as golden sunset light breaks through storm clouds, camera tracking alongside at street level, cinematic lighting with volumetric light rays, photorealistic rendering». The difference — concrete visual anchors: car appearance, street condition, lighting quality, camera behavior, the desired aesthetic. Sweet-spot length 50–150 words.

I2V and V2V Transform: different strategies

I2V describes ONLY motion. Length 20–40 words. Separate subject motion from camera motion: «Camera slowly pushes in while the subject turns their head to look over their shoulder». Temporal descriptors control rhythm: «gradually», «suddenly», «smoothly», «rhythmically». Describing what's already in the image is an anti-pattern.

V2V Transform — style transfers that preserve motion. Formula: «Transform into [target style] + while maintaining original motion and composition + [specific changes]». Required anchor — «maintaining the original camera movement and subject blocking». Without it the model may inject unwanted changes into motion. Example: «Transform into a cyberpunk cityscape with neon signs, holographic advertisements, and rain-slicked streets reflecting colored lights, maintaining the original camera movement and subject blocking, add volumetric fog and lens flares».

Reference-to-Video and V2V Edit

Ref2V — generation with element consistency from 1–4 reference images. Formula: [Character from ref 1] + [Action/interaction] + [Spatial relations] + [Setting from ref N]. Each reference must be explicitly tied to an element in the scene: «Character A (reference 1) stands in the foreground left, turning to hand an object to Character B (reference 2) who enters from the right background». Consistent terminology is critical: if you say «the red jacket», don't switch to «crimson coat».

V2V Edit — surgical precision. Formula: «Keeping [what to preserve] identical + change only [what to change] + [specific change description]». Start with what does NOT change: «Keeping all camera movement, subject blocking, and background elements identical, change only the sky to a dramatic sunset with purple and orange clouds». Negative instructions are allowed: «Do not alter facial features, do not change body proportions».

Common mistakes

  1. 1. Applying T2V strategy to I2V

    Describing character appearance, clothing, or setting inside an I2V prompt — the model already sees the image. Describing the scene in I2V conflicts with the actual picture. Length 20–40 words, ONLY motion and scene evolution. Separate subject motion from camera motion — critical for O1.

  2. 2. V2V Transform without a preservation anchor

    Without «maintaining the original camera movement and subject blocking» in a V2V Transform prompt, the model often injects unwanted changes — the subject changes pose, the camera drifts. The preservation anchor is required in every V2V Transform prompt.

  3. 3. Inconsistent terminology in Ref2V

    If the first sentence calls it «the red jacket» and the third switches to «crimson coat», the model treats them as two different objects and can mix or swap them. Use one consistent phrasing for each referenced element throughout the prompt.

  4. 4. V2V Edit without isolating the change

    Just writing «change the sky to sunset» without an explicit preservation anchor makes V2V Edit change the whole scene instead of the target element — lighting, shadows, background colors. Start with what to preserve: «Keeping camera movement, subject blocking, and ground lighting identical, change only the sky…».

  5. 5. Conflicting descriptions in a single prompt

    «Bright sunny day with dark moody shadows», «cheerful upbeat scene with melancholic atmosphere» — internal contradictions. As a reasoning model, O1 tries to resolve the conflict and outputs an uncontrolled mix. Keep the description stylistically consistent, or state progression explicitly («scene transitions from bright morning to moody evening»).

Before / after examples

Example 1

Before

I2V: «person walks to the sea»

After

Walks slowly toward the ocean with relaxed steps, hair and clothing moving gently in the warm sea breeze, waves rolling onto shore in the background at a steady rhythm, camera slowly pushes in from behind while gradually tilting up to reveal the horizon

I2V in the right mode: motion only, no appearance description; subject motion separated from camera motion; temporal descriptors «slowly», «gradually»; layered description (foreground subject, background waves).

Example 2

Before

V2V Transform: «make it cyberpunk»

After

Transform into a cyberpunk cityscape with neon signs, holographic advertisements floating between buildings, and rain-slicked streets reflecting saturated magenta and cyan colored lights, maintaining the original camera movement and subject blocking. Add volumetric fog at street level, lens flares on neon signs, and chromatic aberration on bright lights. High-contrast Blade Runner aesthetic with warm amber and cool blue color grading.

Explicit preservation anchor «maintaining the original camera movement and subject blocking», concrete style anchors (Blade Runner), effect and color-grading descriptions.

Example 3

Before

V2V Edit: «change the sky to sunset»

After

Keeping all camera movement, subject blocking, foreground objects, and ground lighting identical, change only the sky to a dramatic sunset with deep purple, orange, and pink cloud formations. Increase contrast in the sky by 15% to match the dramatic mood. Do not alter facial features, do not change body proportions, do not modify the lighting direction on the subject.

V2V Edit structure: first what to preserve (camera, blocking, foreground, ground lighting), then what to change (sky only), then negative instructions as a guarantee. Masking language isolates the change.

Frequently asked

How is Kling O1 different from Kling 2.6 Pro and 3.0?
Kling O1 is a reasoning model: it runs internal scene analysis before generation and understands intent rather than just keywords. Four specialized modes (I2V, V2V Transform, Ref2V, V2V Edit), each with its own prompting strategy. 2.6 Pro and 3.0 are general T2V/I2V models; O1 is tuned for transformations and editing with surgical precision.
What's the difference between V2V Transform and V2V Edit?
V2V Transform changes the overall style or atmosphere of the whole video (realism → anime, day → night, modern → cyberpunk). It preserves the original motion and rewrites the visual aesthetic. V2V Edit is a surgical modification of a specific element (sky swap, clothing color change, object removal) while keeping EVERYTHING else. Edit requires explicit preservation anchors and often negative instructions.
How many references should I use in Reference-to-Video?
1–4 reference images. Each reference must be explicitly tied to a scene element in the prompt: «Character A (reference 1) stands in the foreground left». Spatial relationships between references are critical — who's where, who's turning toward whom. Consistent terminology throughout the prompt is required: don't switch from «red jacket» to «crimson coat» in the next sentence.
Why separate subject motion from camera motion?
In I2V and other modes, O1 responds strongly to separating «who moves» from «how the camera moves». «Camera slowly pushes in while the subject turns their head» — the model clearly sees that camera and subject are two independent elements. Without separation a reasoning model can confuse who initiates motion and generate a desynced scene.
Which temporal descriptors work?
Rhythm and pacing markers: «gradually», «suddenly», «smoothly», «rhythmically». Point-in-time markers: «at the 3-second mark», «by the end of the clip», «in the first 2 seconds». Progressions: «light fog at the start gradually thickens to dense mist by the end». O1 as a reasoning model handles temporal progression and lighting choreography especially well.
Can I write prompts in languages other than English?
You can, but quality drops. Kling O1 responds best to structured English prompts, especially because of technical cinematic vocabulary and mode-specific formulas (V2V Transform, V2V Edit). The reasoning mode works in any language, but English exposes a wider vocabulary of style anchors. For production, translate to English.
Does Opten support Kling O1?
Yes, the Opten extension auto-detects Kling O1 and its four modes inside klingai.com. Each mode is scored with its own strategy: I2V — short motion-only prompt; V2V Transform — presence of a preservation anchor; Ref2V — reference-to-element binding and consistent terminology; V2V Edit — change isolation with explicit «Keeping … identical». One click delivers a rewrite in the correct structure.

Related models

Ready to write Kling O1 prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672