How is Kling O1 different from Kling 2.6 Pro and 3.0?

Kling O1 is a reasoning model: it runs internal scene analysis before generation and understands intent rather than just keywords. Four specialized modes (I2V, V2V Transform, Ref2V, V2V Edit), each with its own prompting strategy. 2.6 Pro and 3.0 are general T2V/I2V models; O1 is tuned for transformations and editing with surgical precision.

What's the difference between V2V Transform and V2V Edit?

V2V Transform changes the overall style or atmosphere of the whole video (realism → anime, day → night, modern → cyberpunk). It preserves the original motion and rewrites the visual aesthetic. V2V Edit is a surgical modification of a specific element (sky swap, clothing color change, object removal) while keeping EVERYTHING else. Edit requires explicit preservation anchors and often negative instructions.

How many references should I use in Reference-to-Video?

1–4 reference images. Each reference must be explicitly tied to a scene element in the prompt: «Character A (reference 1) stands in the foreground left». Spatial relationships between references are critical — who's where, who's turning toward whom. Consistent terminology throughout the prompt is required: don't switch from «red jacket» to «crimson coat» in the next sentence.

Why separate subject motion from camera motion?

In I2V and other modes, O1 responds strongly to separating «who moves» from «how the camera moves». «Camera slowly pushes in while the subject turns their head» — the model clearly sees that camera and subject are two independent elements. Without separation a reasoning model can confuse who initiates motion and generate a desynced scene.

Which temporal descriptors work?

Rhythm and pacing markers: «gradually», «suddenly», «smoothly», «rhythmically». Point-in-time markers: «at the 3-second mark», «by the end of the clip», «in the first 2 seconds». Progressions: «light fog at the start gradually thickens to dense mist by the end». O1 as a reasoning model handles temporal progression and lighting choreography especially well.

Can I write prompts in languages other than English?

You can, but quality drops. Kling O1 responds best to structured English prompts, especially because of technical cinematic vocabulary and mode-specific formulas (V2V Transform, V2V Edit). The reasoning mode works in any language, but English exposes a wider vocabulary of style anchors. For production, translate to English.

Does Opten support Kling O1?

Yes, the Opten extension auto-detects Kling O1 and its four modes inside klingai.com. Each mode is scored with its own strategy: I2V — short motion-only prompt; V2V Transform — presence of a preservation anchor; Ref2V — reference-to-element binding and consistent terminology; V2V Edit — change isolation with explicit «Keeping … identical». One click delivers a rewrite in the correct structure.

Video

Kling O1: how to write prompts the model actually understands

Name: Kling O1
Brand: Kuaishou

Kuaishou · Updated: May 19, 2026

Kling O1 is Kuaishou's reasoning video model on klingai.com. Duration up to 10 seconds, resolution up to 1080p, four specialized modes: I2V, V2V Transform, Reference-to-Video, and V2V Edit. Each mode needs its own prompting strategy — applying the wrong strategy gives unstable results, even with a detailed prompt.

What Kling O1 is

Kling O1 is a reasoning model: unlike previous versions, it understands prompt intent rather than just keywords. It runs an internal scene analysis before generation, especially helpful for complex compound tasks.

Four modes, each with its own prompting strategy. Image-to-Video for animating still images. Video-to-Video Transform for style transfers that preserve the original motion. Reference-to-Video for generation with element consistency from 1–4 references. V2V Edit for surgical precision — modifying specific elements while preserving everything else. Output quality is driven more by prompt structure than by word count.

Reasoning model: analyzes intent, not just words
Four modes: I2V, V2V Transform, Ref2V, V2V Edit
Duration up to 10 seconds, resolution up to 1080p
Up to 4 references in Reference-to-Video
Surgical precision in V2V Edit with explicit preservation anchors

General prompt structure

Baseline structure for all modes: [Subject + Primary Action] → [Environmental Context] → [Camera Movement/Perspective] → [Style/Quality Descriptors]. The key rule — start with the subject and the primary action. Each element gives the model a concrete visual anchor.

Weak prompt: «A car driving through a city at sunset». Strong: «A sleek silver sports car accelerates through a rain-slicked downtown street as golden sunset light breaks through storm clouds, camera tracking alongside at street level, cinematic lighting with volumetric light rays, photorealistic rendering». The difference — concrete visual anchors: car appearance, street condition, lighting quality, camera behavior, the desired aesthetic. Sweet-spot length 50–150 words.

I2V and V2V Transform: different strategies

I2V describes ONLY motion. Length 20–40 words. Separate subject motion from camera motion: «Camera slowly pushes in while the subject turns their head to look over their shoulder». Temporal descriptors control rhythm: «gradually», «suddenly», «smoothly», «rhythmically». Describing what's already in the image is an anti-pattern.

V2V Transform — style transfers that preserve motion. Formula: «Transform into [target style] + while maintaining original motion and composition + [specific changes]». Required anchor — «maintaining the original camera movement and subject blocking». Without it the model may inject unwanted changes into motion. Example: «Transform into a cyberpunk cityscape with neon signs, holographic advertisements, and rain-slicked streets reflecting colored lights, maintaining the original camera movement and subject blocking, add volumetric fog and lens flares».

Reference-to-Video and V2V Edit

Ref2V — generation with element consistency from 1–4 reference images. Formula: [Character from ref 1] + [Action/interaction] + [Spatial relations] + [Setting from ref N]. Each reference must be explicitly tied to an element in the scene: «Character A (reference 1) stands in the foreground left, turning to hand an object to Character B (reference 2) who enters from the right background». Consistent terminology is critical: if you say «the red jacket», don't switch to «crimson coat».

V2V Edit — surgical precision. Formula: «Keeping [what to preserve] identical + change only [what to change] + [specific change description]». Start with what does NOT change: «Keeping all camera movement, subject blocking, and background elements identical, change only the sky to a dramatic sunset with purple and orange clouds». Negative instructions are allowed: «Do not alter facial features, do not change body proportions».

Common mistakes

1. Applying T2V strategy to I2V
Describing character appearance, clothing, or setting inside an I2V prompt — the model already sees the image. Describing the scene in I2V conflicts with the actual picture. Length 20–40 words, ONLY motion and scene evolution. Separate subject motion from camera motion — critical for O1.
2. V2V Transform without a preservation anchor
Without «maintaining the original camera movement and subject blocking» in a V2V Transform prompt, the model often injects unwanted changes — the subject changes pose, the camera drifts. The preservation anchor is required in every V2V Transform prompt.
3. Inconsistent terminology in Ref2V
If the first sentence calls it «the red jacket» and the third switches to «crimson coat», the model treats them as two different objects and can mix or swap them. Use one consistent phrasing for each referenced element throughout the prompt.
4. V2V Edit without isolating the change
Just writing «change the sky to sunset» without an explicit preservation anchor makes V2V Edit change the whole scene instead of the target element — lighting, shadows, background colors. Start with what to preserve: «Keeping camera movement, subject blocking, and ground lighting identical, change only the sky…».
5. Conflicting descriptions in a single prompt
«Bright sunny day with dark moody shadows», «cheerful upbeat scene with melancholic atmosphere» — internal contradictions. As a reasoning model, O1 tries to resolve the conflict and outputs an uncontrolled mix. Keep the description stylistically consistent, or state progression explicitly («scene transitions from bright morning to moody evening»).

Before / after examples

Example 1

Before

I2V: «person walks to the sea»

After

Walks slowly toward the ocean with relaxed steps, hair and clothing moving gently in the warm sea breeze, waves rolling onto shore in the background at a steady rhythm, camera slowly pushes in from behind while gradually tilting up to reveal the horizon

I2V in the right mode: motion only, no appearance description; subject motion separated from camera motion; temporal descriptors «slowly», «gradually»; layered description (foreground subject, background waves).

Example 2

Before

V2V Transform: «make it cyberpunk»

After

Transform into a cyberpunk cityscape with neon signs, holographic advertisements floating between buildings, and rain-slicked streets reflecting saturated magenta and cyan colored lights, maintaining the original camera movement and subject blocking. Add volumetric fog at street level, lens flares on neon signs, and chromatic aberration on bright lights. High-contrast Blade Runner aesthetic with warm amber and cool blue color grading.

Explicit preservation anchor «maintaining the original camera movement and subject blocking», concrete style anchors (Blade Runner), effect and color-grading descriptions.

Example 3

Before

V2V Edit: «change the sky to sunset»

After

Keeping all camera movement, subject blocking, foreground objects, and ground lighting identical, change only the sky to a dramatic sunset with deep purple, orange, and pink cloud formations. Increase contrast in the sky by 15% to match the dramatic mood. Do not alter facial features, do not change body proportions, do not modify the lighting direction on the subject.

V2V Edit structure: first what to preserve (camera, blocking, foreground, ground lighting), then what to change (sky only), then negative instructions as a guarantee. Masking language isolates the change.

Kling O1: how to write prompts the model actually understands

What Kling O1 is

General prompt structure

I2V and V2V Transform: different strategies

Reference-to-Video and V2V Edit

Common mistakes

1. Applying T2V strategy to I2V

2. V2V Transform without a preservation anchor

3. Inconsistent terminology in Ref2V

4. V2V Edit without isolating the change

5. Conflicting descriptions in a single prompt

Before / after examples

Frequently asked

Related models

Google Veo 3.1 (incl. Veo 3.1 Fast and Veo 3.1 Fast Relax)

Google Veo 3

Google Veo (General)

Ready to write Kling O1 prompts in one click?