Happy Horse 1.0: how to write prompts the model actually understands
Alibaba · Updated:
Happy Horse 1.0 (快乐小马) is Alibaba ATH AI Innovation Unit's video model — 15B parameters, unified single-stream Transformer. It generates 5–8 seconds of 1080p in ~10 seconds on an H100. Joint audio-video in a single forward pass, lip-sync in 7 languages, open source. The core prompting rule — brevity wins, ~20 words for a simple shot.
What Happy Horse 1.0 does well
Happy Horse is an open-source model ranked #1 on Artificial Analysis Video Arena (T2V Elo 1333, I2V Elo 1392). T2I and I2V share the same weights; native 1080p without upscaling.
Key feature — joint audio-video: video and sound are generated in one forward pass and synchronized by default. Lip-sync in 7 languages (English, Mandarin, Cantonese, Japanese, Korean, German, French) with ultra-low WER. Up to 12 multimodal inputs: text + reference images + reference videos + audio references. Duration 5–8 seconds by default, up to 15 on the paid tier.
- 15B parameters, unified single-stream Transformer
- Native 1080p without upscaling, 5–8 second duration
- Joint audio-video in a single forward pass
- Lip-sync in 7 languages with ultra-low WER
- #1 on Artificial Analysis Video Arena (T2V and I2V)
Prompt structure and the 20-word rule
The default template covers 80% of tasks: «[Subject] [does action] in [setting], [time of day], [one atmosphere or camera cue]». About 20 words.
Working prompt examples: «A young woman in a red coat walks down a wet city street at night, neon reflections». «A 1965 cherry-red Mustang convertible drives along a winding California coastal highway at midday». «An orange tabby cat coiled on a velvet sofa leaps to a tall oak bookshelf».
Golden rule: brevity wins. The model has a finite «attention budget», and every extra word steals power from rendering. Long prompts literally degrade results: faces blur toward an averaged look, hands lose geometry, gait flattens.
When a long prompt is justified
A long prompt is justified in one case — when the shot leans on camera language (Steadicam push, slow dolly-in, helicopter aerial). Put the camera cue at the end of the prompt — that's where it gets maximum weight.
For multi-beat scenes use a shot list with timecodes: «Shot 1 (wide establishing, 0-1s): ...», «Shot 2 (mid tracking, 1-4s): ...», «Shot 3 (slow push-in close, 4-5s): ...». In fal.ai tests, a timecoded shot list separates beats correctly, while the same scene as flat prose collapses into one blurred motion.
Markdown sections (## Subject, ## Action, ## Setting, ## Camera, ## Lighting, ## Mood) — for single-take shots with many control axes. Use ONLY when there's content for most sections. Empty headers hurt.
Strengths and weaknesses
Strengths (use them): camera movements (Steadicam push, slow dolly-in, helicopter aerial — the model understands English camera vocabulary unusually well); atmospheric lighting (blue hour alley, neon noir, single hard top-down key with deep falloff, warm amber backlight + cool blue ambient); cars, metal, chrome, reflections; fabric and hair in wind (secondary motion holds throughout the clip); fire and sparks with correct warmth.
Weaknesses: long human action sequences with faces in focus. Storytelling prose instead of production notes (the model executes instructions, not narration). Emotion as abstraction («sad woman», «happy moment») — translate into physical details: micro-expressions, gaze direction, breath rhythm, pauses.
Common mistakes
1. Too long a prompt for a simple scene
The model's main antipattern. Long prompts for simple scenes literally degrade output: faces blur toward an averaged look, hands lose geometry, gait flattens. ~20 words is the sweet spot. Longer is justified only when camera language or a multi-beat scene calls for it.
2. Hedge epithets and quality boosters
«Beautiful, stunning, gorgeous, masterpiece, epic, breathtaking, insane detail, ultra detailed, hyperrealistic» eat the token budget and pull toward average-look. Replace with specifics: «overcast daylight, wet asphalt», «neon pink and cyan reflections», «35mm telephoto, shallow depth of field».
3. Emotion as abstraction
«Sad woman thinking about her past», «happy moment», «emotional scene» — Happy Horse doesn't read emotion as a concept. Translate into physical details: «close-up of a young woman standing still, soft wind moving her hair, neutral expression, slow blink, shallow depth of field». Micro-expressions, gaze direction, breath rhythm.
4. Mandarin for visuals
Despite the model's Chinese origin from Alibaba, English yields better visual rendering. Use Mandarin ONLY in the DIALOGUE block for Chinese lip-sync. All production notes (subject, action, setting, camera, lighting) — in English.
5. Booru tags, JSON, weighted parentheses
Comma-separated keywords without sentences (Booru style), JSON objects, and weighted parentheses `(keyword:1.2)` (Stable Diffusion syntax) — measurably lose to English prose. Happy Horse is trained on natural language. Write sentences and production notes.
Before / after examples
Example 1
Before
A beautiful gorgeous stunning woman in a magnificent red coat masterpiece walking elegantly down a breathtaking wet city street at night with insane neon reflections, ultra detailed, hyperrealistic, 8k cinematography
After
A young woman in a red coat walks down a wet city street at night, neon reflections, 35mm telephoto, slow tracking dolly.
Anti-slop rule: hedge epithets (beautiful, gorgeous, stunning, masterpiece, ultra detailed) eat the token budget and pull toward average-look. ~20 words with a camera cue at the end is the sweet spot.
Example 2
Before
happy man walking and feeling good about life in a nice park
After
A young man walks through a sunlit park in autumn, slow exhale visible in cool air, soft smile, hand brushing fallen leaves, golden hour, slow side tracking.
Emotion as abstraction («happy», «feeling good») — the model can't read it. Translating into physical details (slow exhale, soft smile, hand brushing leaves) yields visible motion.
Example 3
Before
A complex cinematic scene where a detective enters the dimly lit room, looks around suspiciously, finds a clue on the table, picks it up, examines it carefully, and then walks out the door
After
Shot 1 (wide establishing, 0-1s): A detective in a wool coat enters a dim hotel room; single hard top-down key, deep falloff to black. Shot 2 (close-up, 1-3s): His hand picks up a folded note from the wood desk; warm amber practical light. Shot 3 (medium tracking, 3-5s): He turns and walks toward the door; slow side tracking, neon glow through the blinds.
Flat prose with multiple actions collapses into one blurred motion. A timecoded shot list separates beats correctly.