What is the optimal prompt length?

~20 words for a simple shot is the sweet spot. The default template «[Subject] [does action] in [setting], [time of day], [one atmosphere or camera cue]» covers 80% of tasks. Longer is justified only when the shot leans on camera language (then the cue goes at the end) or for multi-beat scenes with a timecoded shot list.

How does joint audio-video work?

Sound and video are generated in one forward pass and synchronized by default. Control sound via text: «dialogue in English: '...'», «ambient: distant traffic», «Foley: footsteps on gravel». If sound isn't described, the model invents it from visual logic. This is a unique Happy Horse feature — most video models generate sound separately or not at all.

Which languages are supported for lip-sync?

Seven: English, Mandarin, Cantonese, Japanese, Korean, German, French. With ultra-low WER (word error rate). Specify the language in the DIALOGUE block: «dialogue in Korean: '...'». Joint audio-video synchronizes speech and lip movement automatically. Note: visuals render better in English even when dialogue is in another language.

When should a timecoded shot list be used?

For multi-beat scenes — when one clip needs several different shots or actions. Format: «Shot 1 (wide establishing, 0-1s): ...», «Shot 2 (close-up, 1-3s): ...». In fal.ai tests a shot list separates beats correctly, while the same scene as flat prose collapses. For a single simple shot a shot list is overkill — use the default 20-word template.

What video duration is supported?

5–8 seconds by default, up to 12 on Lite, up to 15 on the paid tier. Native 1080p without upscaling. Generation time ~10 seconds on average, ~38 seconds for 1080p on NVIDIA H100, ~2 seconds for a 5-sec 256p preview. Aspect ratios: 16:9, 9:16, 4:3, 21:9, 1:1.

What works better, T2V or I2V?

Both modes share the same weights and perform equally well. I2V is handy when there's a concrete visual anchor (product photo, portrait, concept art) — then the prompt describes motion rather than re-describing the picture. T2V — for from-scratch scene generation. For I2V, don't describe the visual in detail; focus on motion and atmosphere.

Does Opten support Happy Horse?

Yes, the Opten extension auto-detects Happy Horse 1.0 and scores prompts against the structure outlined above: it checks alignment with the default 20-word template, absence of hedge epithets and quality boosters, physical details instead of abstract emotion, and English for visuals. One click delivers a rewrite in the correct structure.

Video

Happy Horse 1.0: how to write prompts the model actually understands

Name: Happy Horse 1.0
Brand: Alibaba

Alibaba · Updated: May 19, 2026

Happy Horse 1.0 (快乐小马) is Alibaba ATH AI Innovation Unit's video model — 15B parameters, unified single-stream Transformer. It generates 5–8 seconds of 1080p in ~10 seconds on an H100. Joint audio-video in a single forward pass, lip-sync in 7 languages, open source. The core prompting rule — brevity wins, ~20 words for a simple shot.

What Happy Horse 1.0 does well

Happy Horse is an open-source model ranked #1 on Artificial Analysis Video Arena (T2V Elo 1333, I2V Elo 1392). T2I and I2V share the same weights; native 1080p without upscaling.

Key feature — joint audio-video: video and sound are generated in one forward pass and synchronized by default. Lip-sync in 7 languages (English, Mandarin, Cantonese, Japanese, Korean, German, French) with ultra-low WER. Up to 12 multimodal inputs: text + reference images + reference videos + audio references. Duration 5–8 seconds by default, up to 15 on the paid tier.

15B parameters, unified single-stream Transformer
Native 1080p without upscaling, 5–8 second duration
Joint audio-video in a single forward pass
Lip-sync in 7 languages with ultra-low WER
#1 on Artificial Analysis Video Arena (T2V and I2V)

Prompt structure and the 20-word rule

The default template covers 80% of tasks: «[Subject] [does action] in [setting], [time of day], [one atmosphere or camera cue]». About 20 words.

Working prompt examples: «A young woman in a red coat walks down a wet city street at night, neon reflections». «A 1965 cherry-red Mustang convertible drives along a winding California coastal highway at midday». «An orange tabby cat coiled on a velvet sofa leaps to a tall oak bookshelf».

Golden rule: brevity wins. The model has a finite «attention budget», and every extra word steals power from rendering. Long prompts literally degrade results: faces blur toward an averaged look, hands lose geometry, gait flattens.

When a long prompt is justified

A long prompt is justified in one case — when the shot leans on camera language (Steadicam push, slow dolly-in, helicopter aerial). Put the camera cue at the end of the prompt — that's where it gets maximum weight.

For multi-beat scenes use a shot list with timecodes: «Shot 1 (wide establishing, 0-1s): ...», «Shot 2 (mid tracking, 1-4s): ...», «Shot 3 (slow push-in close, 4-5s): ...». In fal.ai tests, a timecoded shot list separates beats correctly, while the same scene as flat prose collapses into one blurred motion.

Markdown sections (## Subject, ## Action, ## Setting, ## Camera, ## Lighting, ## Mood) — for single-take shots with many control axes. Use ONLY when there's content for most sections. Empty headers hurt.

Strengths and weaknesses

Strengths (use them): camera movements (Steadicam push, slow dolly-in, helicopter aerial — the model understands English camera vocabulary unusually well); atmospheric lighting (blue hour alley, neon noir, single hard top-down key with deep falloff, warm amber backlight + cool blue ambient); cars, metal, chrome, reflections; fabric and hair in wind (secondary motion holds throughout the clip); fire and sparks with correct warmth.

Weaknesses: long human action sequences with faces in focus. Storytelling prose instead of production notes (the model executes instructions, not narration). Emotion as abstraction («sad woman», «happy moment») — translate into physical details: micro-expressions, gaze direction, breath rhythm, pauses.

Common mistakes

1. Too long a prompt for a simple scene
The model's main antipattern. Long prompts for simple scenes literally degrade output: faces blur toward an averaged look, hands lose geometry, gait flattens. ~20 words is the sweet spot. Longer is justified only when camera language or a multi-beat scene calls for it.
2. Hedge epithets and quality boosters
«Beautiful, stunning, gorgeous, masterpiece, epic, breathtaking, insane detail, ultra detailed, hyperrealistic» eat the token budget and pull toward average-look. Replace with specifics: «overcast daylight, wet asphalt», «neon pink and cyan reflections», «35mm telephoto, shallow depth of field».
3. Emotion as abstraction
«Sad woman thinking about her past», «happy moment», «emotional scene» — Happy Horse doesn't read emotion as a concept. Translate into physical details: «close-up of a young woman standing still, soft wind moving her hair, neutral expression, slow blink, shallow depth of field». Micro-expressions, gaze direction, breath rhythm.
4. Mandarin for visuals
Despite the model's Chinese origin from Alibaba, English yields better visual rendering. Use Mandarin ONLY in the DIALOGUE block for Chinese lip-sync. All production notes (subject, action, setting, camera, lighting) — in English.
5. Booru tags, JSON, weighted parentheses
Comma-separated keywords without sentences (Booru style), JSON objects, and weighted parentheses `(keyword:1.2)` (Stable Diffusion syntax) — measurably lose to English prose. Happy Horse is trained on natural language. Write sentences and production notes.

Before / after examples

Example 1

Before

A beautiful gorgeous stunning woman in a magnificent red coat masterpiece walking elegantly down a breathtaking wet city street at night with insane neon reflections, ultra detailed, hyperrealistic, 8k cinematography

After

A young woman in a red coat walks down a wet city street at night, neon reflections, 35mm telephoto, slow tracking dolly.

Anti-slop rule: hedge epithets (beautiful, gorgeous, stunning, masterpiece, ultra detailed) eat the token budget and pull toward average-look. ~20 words with a camera cue at the end is the sweet spot.

Example 2

Before

happy man walking and feeling good about life in a nice park

After

A young man walks through a sunlit park in autumn, slow exhale visible in cool air, soft smile, hand brushing fallen leaves, golden hour, slow side tracking.

Emotion as abstraction («happy», «feeling good») — the model can't read it. Translating into physical details (slow exhale, soft smile, hand brushing leaves) yields visible motion.

Example 3

Before

A complex cinematic scene where a detective enters the dimly lit room, looks around suspiciously, finds a clue on the table, picks it up, examines it carefully, and then walks out the door

After

Shot 1 (wide establishing, 0-1s): A detective in a wool coat enters a dim hotel room; single hard top-down key, deep falloff to black.
Shot 2 (close-up, 1-3s): His hand picks up a folded note from the wood desk; warm amber practical light.
Shot 3 (medium tracking, 3-5s): He turns and walks toward the door; slow side tracking, neon glow through the blinds.

Flat prose with multiple actions collapses into one blurred motion. A timecoded shot list separates beats correctly.

Happy Horse 1.0: how to write prompts the model actually understands

What Happy Horse 1.0 does well

Prompt structure and the 20-word rule

When a long prompt is justified

Strengths and weaknesses

Common mistakes

1. Too long a prompt for a simple scene

2. Hedge epithets and quality boosters

3. Emotion as abstraction

4. Mandarin for visuals

5. Booru tags, JSON, weighted parentheses

Before / after examples

Frequently asked

Related models

Google Veo 3.1 (incl. Veo 3.1 Fast and Veo 3.1 Fast Relax)

Google Veo 3

Google Veo (General)

Ready to write Happy Horse 1.0 prompts in one click?