LTX 2: how to write prompts the model actually understands
LTX · Updated:
LTX 2 is Lightricks' open-source video model at ltx.io. It comes in two versions: Fast (up to 20 seconds, 2x faster) and Pro (up to 10 seconds, plus Audio-to-Video and Retake). Native 4K up to 50 FPS, audio generation, Apache license. The prompt is written as a cinematographer's shot list — optimal length around 200 words in English.
What LTX 2 does well
LTX 2 is an open-source video model built on a Diffusion Transformer (DiT) architecture. Key technical advantages: native 4K (2160p) up to 50 FPS — the highest resolution among surveyed models; native audio generation (dialogue, music, ambient, SFX) in sync with video; full weights on HuggingFace under an Apache license; LoRA fine-tuning support for custom styles and motion.
Two versions solve different problems. LTX 2 Fast — up to 20 seconds, 2x faster, 1/10 the compute cost, optimal for prototyping and long tests. LTX 2 Pro — up to 10 seconds with exclusive modes: Audio-to-Video (video generation from an audio track), Retake (regenerate a segment without restarting), Extend. Negative prompts are supported in both versions.
- Native 4K (2160p) up to 50 FPS — a record among models
- LTX 2 Fast: up to 20 seconds, 2x faster, 1/10 compute
- LTX 2 Pro: up to 10 seconds, A2V, Retake, Extend
- Native audio in sync with video
- Open source, Apache license, LoRA fine-tuning
The 6-element prompt structure
The official Lightricks structure — write like a cinematographer's shot list, detailed chronological descriptions in paragraph form. Six elements:
1. Shot type / camera position — cinematic terms (wide shot, medium close-up, low-angle establishing). 2. Environment — lighting, color palette, textures, atmosphere. 3. Action — natural sequence in present-tense, start to finish. 4. Character details — age, hair, clothing, distinctive features. 5. Camera movement — how and when; describing post-movement helps. 6. Audio description — ambient, music, dialogue, vocals.
Not all six are mandatory for simple scenes, but the 6-element structure is the ideal for production work.
The key principle: prompt length = video length
A unique LTX 2 feature is the correlation between prompt length and video duration. A short prompt for a long video causes «rushing»: the model crams everything into the start and then has nothing left to do. For a 10-second video you need ~200 words of chronological description.
Lens/aperture language reduces artifacts: «50mm, f/2.8» cuts edge flicker. Explicit camera paths (dolly, crane, orbit) reduce temporal jitter — specify a concrete camera trajectory, not a generic «cinematic camera». When generating 4K, add «no high-frequency patterns» to the negative prompt — otherwise moiré artifacts can appear on textures.
For automatic prompt enhancement, the `enhance_prompt=True` flag is available — the model expands the description to optimal length on its own.
Common mistakes
1. Short prompt for a long video
A unique LTX 2 anti-pattern: prompt length should match video duration. A 10-word prompt for a 10-second clip causes «rushing» — the model crams everything into the start. For 10 seconds you need ~200 words of chronological description with progression from start to finish.
2. Conflicting descriptions
«Still peaceful lake with dramatic waves crashing», «bright sunny day with dark moody shadows» — internal contradictions. LTX 2 tries to reconcile the irreconcilable and outputs uncontrolled results. Keep the description stylistically consistent, or state temporal progression explicitly.
3. No audio description
LTX 2 generates audio natively, and describing the audio landscape significantly improves the result. Without an explicit description the model picks an «average» audio variant, often less expressive. Add a block — «Ambient sound of…», «Soft piano in the background…», «Character speaks in…» — it's a full sixth element of the 6-element structure.
4. High-frequency patterns in 4K without a negative guardrail
When generating 4K, high-frequency patterns (thin stripes, fine grids, dense textures) can cause moiré artifacts. Add «no high-frequency patterns, no moiré, no aliasing» to the negative prompt — insurance specific to 2K and higher resolutions.
5. Describing the image in I2V instead of motion
As in Kling, in Image-to-Video the model already sees the source image. Describing appearance, clothing, or setting inside an I2V prompt conflicts with the actual picture. Length 20–40 words, describe ONLY motion and scene evolution — what moves, how, and at what tempo.
Before / after examples
Example 1
Before
girl walks along the beach at sunset
After
Wide establishing shot at golden hour. A long stretch of empty Pacific coast with warm amber sunlight bathing the wet sand, soft pastel pink and orange sky reflecting on shallow waves, low rolling fog at the horizon. A young woman in her late twenties with long auburn hair tied loosely, wearing a cream linen sundress and bare feet, walks slowly from the right side of the frame toward the receding waves. She pauses, lifts her face to the sun, then continues walking parallel to the shoreline. Camera follows her with a smooth tracking dolly from a medium distance, gradually pulling back to reveal the vastness of the coast by the end of the clip. Shot on 50mm lens at f/2.8, shallow depth of field with soft bokeh on the background. Gentle ambient sound of waves rolling in and seagulls in the distance, soft acoustic guitar melody fades in around the 4-second mark.
Full 6-element structure: shot type, environment, character, action, camera movement, audio. Length ~150 words for a 10-second video, lens language (50mm, f/2.8), chronological progression from start to finish.
Example 2
Before
foggy street scene
After
Medium low-angle tracking shot at pre-dawn blue hour. A narrow cobblestone alley in a European old town, dense morning fog drifts at ankle level, wet cobblestones reflecting muted blue light from antique street lamps, brick walls covered in ivy, deep shadows between buildings. A man in his forties wearing a long charcoal wool coat and grey fedora walks deliberately away from the camera into the fog, hands in pockets. Camera dollies forward at the same pace as the subject, maintaining constant distance for the first 5 seconds, then gradually slows as he disappears into the fog. 35mm lens at f/2.0, anamorphic flares from street lamps, film grain texture. Ambient sound of distant church bells and faint footsteps on wet stone, a low cello drone gradually builds tension throughout the clip.
Lens/aperture language (35mm, f/2.0), explicit camera path (dolly forward, gradually slows), chronological rhythm with timestamps («for the first 5 seconds», «throughout the clip»), full audio design.
Example 3
Before
watch product shot
After
Macro close-up product shot in studio. A premium stainless-steel automatic watch with sapphire crystal face, exposed mechanical movement visible through the case, dark navy leather strap with white stitching, placed on a black slate surface with soft directional rim lighting from the right. Camera orbits slowly around the watch at the same elevation, completing a quarter turn over the duration of the clip, revealing different angles of the case and dial. Shot on 100mm macro lens at f/4, razor-sharp focus on the mechanical movement, soft falloff into the background. Subtle ambient sound of the mechanical tick-tock of the watch movement clearly audible, distant soft piano in the background. No high-frequency patterns.
4K product scene with a negative guardrail («no high-frequency patterns» against moiré), explicit camera path (orbit, quarter turn), lens language (100mm macro, f/4), audio description to emphasize mechanics.