Kling: how to write prompts the model actually understands
Kuaishou · Updated:
Kling is Kuaishou's video model family, available at klingai.com. It generates up to 10-second clips (up to 15 seconds in Kling 3.0) and supports T2V, I2V, and Motion Control. Prompts accept up to ~2500 characters; the sweet spot is 50–150 words. English yields the most stable results, and a negative prompt field is supported.
What Kling does well
Kling is a text-to-video and image-to-video model aimed at cinematic scenes and product content. Standard duration is 5–10 seconds (15 seconds in Kling 3.0), resolution up to 1080p, with Elements support — up to 4 reference images for character and object consistency.
Motion Control transfers motion from a reference video onto a new character from an image — the foundation for AI influencers, virtual presenters, and dance performances. A negative prompt is supported as a separate field — a key difference from Imagen and many other models. Keyframes (exactly 2 anchor frames) are also supported.
- T2V up to 10 seconds (15 in Kling 3.0), resolution up to 1080p
- Image-to-Video for animating still images
- Motion Control: transferring motion from a reference video
- Elements — up to 4 references for consistency
- Negative prompt as a separate field
Prompt structure
Optimal T2V structure: [Subject/Character] + [Action/Motion] + [Scene/Environment] + [Camera Movement] + [Style/Mood/Lighting]. Order matters — the model weights elements at the start of the prompt more heavily. The most important goes first.
Each block needs concrete detail: «35-year-old woman with shoulder-length auburn hair wearing an emerald green coat» instead of «a person»; «walking purposefully through fallen leaves» instead of «moving around»; «smooth tracking shot following from the side» instead of no camera at all. Limit the environment to 3–4 elements — more than ten causes overload and loss of focus. Sweet-spot length is 50–150 words.
T2V, I2V, and Motion Control modes
Each mode needs its own strategy. T2V — describe EVERYTHING: subject, action, environment, camera, style. Formula: (Subject + details) + (Action + tempo) + (Environment + lighting) + (Camera) + (Style).
I2V — describe ONLY motion, not the scene. The model already sees the image. Formula: (Subject motion) + (Environmental motion) + (Camera). Length 20–40 words. Describing what's already in the picture is an anti-pattern.
Motion Control — describe ONLY the character's appearance and setting. Motion is taken from the reference video automatically. Formula: [Character style + clothing] + [Setting/background] + [Visual quality]. Motion, gesture, and expression instructions in Motion Control are the main anti-pattern.
Common mistakes
1. Describing the scene in an I2V prompt
In Image-to-Video the model already sees the source image. Describing appearance, clothing, or setting wastes tokens and either gets ignored or conflicts with the actual picture. An I2V prompt should be 20–40 words and describe ONLY motion and scene evolution.
2. Motion instructions in Motion Control
Motion Control transfers motion from the reference video automatically. Phrases like «character dances», «waves hand», «walks forward» in the prompt are either ignored or conflict with motion from the video. The prompt is art direction (how it looks), not motion direction (how it moves).
3. Conflicting camera moves
«360-degree rotation around subject while zooming in and panning left» — three simultaneous transforms almost guarantee geometry distortion. Use one primary camera move at a time: either orbit, or zoom, or pan. For complex transitions, use Multi-shot in Kling 3.0.
4. Prompts that are too short or too abstract
A prompt under 15 words leaves too much freedom — the model fills the scene on its own. Abstract phrases like «something beautiful happens», «make it look dynamic», «cool vibes» give no visual anchors. Concrete details and physical actions give the model something to grip.
5. Negative phrasing in the main prompt
Kling supports a negative prompt as a separate field — but not inside the main prompt. «No people, no text, not blurry» inside the main prompt is either ignored or causes the opposite effect. Move unwanted elements to the dedicated negative prompt field.
Before / after examples
Example 1
Before
car drives through a city at sunset
After
A sleek silver sports car with chrome wheels accelerates through a rain-slicked downtown street as golden sunset light breaks through storm clouds, camera tracking alongside at street level, smooth dolly motion, cinematic lighting with volumetric light rays reflecting off wet asphalt, photorealistic rendering, shot on virtual anamorphic lens, 24mm, f/2.8, warm color grading with deep contrast.
Key changes: concrete car details, the street's state, camera behavior described separately from the subject, the cinematic stack, a temporal marker for rhythm.
Example 2
Before
I2V from a photo of a woman on the beach: «woman walks to the sea»
After
Walks slowly toward the ocean, hair and clothing moving gently in the breeze, waves rolling onto shore in the background, camera slowly pushes in
I2V is short (20–40 words) and describes ONLY motion: what the subject does, what's happening in the environment, how the camera moves. Describing appearance or scene would be an anti-pattern — the model already sees the image.
Example 3
Before
Motion Control for a dance video: «character dances»
After
Style the character as a confident urban dancer wearing oversized black streetwear and white sneakers, placed in a moody underground parking lot with flickering fluorescent lights and concrete walls, cinematic realism with grainy 35mm film aesthetic, high contrast color grading, shallow depth of field with bokeh on background lights.
Motion Control describes APPEARANCE and SETTING, not motion. The dance and timing come from the reference video. Instructions like «dances energetically» here are the main anti-pattern.