Video

Happy Horse 1.0: how to write prompts the model actually understands

Alibaba · Updated:

Happy Horse 1.0 (快乐小马) is Alibaba ATH AI Innovation Unit's video model — 15B parameters, unified single-stream Transformer. It generates 5–8 seconds of 1080p in ~10 seconds on an H100. Joint audio-video in a single forward pass, lip-sync in 7 languages, open source. The core prompting rule — brevity wins, ~20 words for a simple shot.

What Happy Horse 1.0 does well

Happy Horse is an open-source model ranked #1 on Artificial Analysis Video Arena (T2V Elo 1333, I2V Elo 1392). T2I and I2V share the same weights; native 1080p without upscaling.

Key feature — joint audio-video: video and sound are generated in one forward pass and synchronized by default. Lip-sync in 7 languages (English, Mandarin, Cantonese, Japanese, Korean, German, French) with ultra-low WER. Up to 12 multimodal inputs: text + reference images + reference videos + audio references. Duration 5–8 seconds by default, up to 15 on the paid tier.

  • 15B parameters, unified single-stream Transformer
  • Native 1080p without upscaling, 5–8 second duration
  • Joint audio-video in a single forward pass
  • Lip-sync in 7 languages with ultra-low WER
  • #1 on Artificial Analysis Video Arena (T2V and I2V)

Prompt structure and the 20-word rule

The default template covers 80% of tasks: «[Subject] [does action] in [setting], [time of day], [one atmosphere or camera cue]». About 20 words.

Working prompt examples: «A young woman in a red coat walks down a wet city street at night, neon reflections». «A 1965 cherry-red Mustang convertible drives along a winding California coastal highway at midday». «An orange tabby cat coiled on a velvet sofa leaps to a tall oak bookshelf».

Golden rule: brevity wins. The model has a finite «attention budget», and every extra word steals power from rendering. Long prompts literally degrade results: faces blur toward an averaged look, hands lose geometry, gait flattens.

When a long prompt is justified

A long prompt is justified in one case — when the shot leans on camera language (Steadicam push, slow dolly-in, helicopter aerial). Put the camera cue at the end of the prompt — that's where it gets maximum weight.

For multi-beat scenes use a shot list with timecodes: «Shot 1 (wide establishing, 0-1s): ...», «Shot 2 (mid tracking, 1-4s): ...», «Shot 3 (slow push-in close, 4-5s): ...». In fal.ai tests, a timecoded shot list separates beats correctly, while the same scene as flat prose collapses into one blurred motion.

Markdown sections (## Subject, ## Action, ## Setting, ## Camera, ## Lighting, ## Mood) — for single-take shots with many control axes. Use ONLY when there's content for most sections. Empty headers hurt.

Strengths and weaknesses

Strengths (use them): camera movements (Steadicam push, slow dolly-in, helicopter aerial — the model understands English camera vocabulary unusually well); atmospheric lighting (blue hour alley, neon noir, single hard top-down key with deep falloff, warm amber backlight + cool blue ambient); cars, metal, chrome, reflections; fabric and hair in wind (secondary motion holds throughout the clip); fire and sparks with correct warmth.

Weaknesses: long human action sequences with faces in focus. Storytelling prose instead of production notes (the model executes instructions, not narration). Emotion as abstraction («sad woman», «happy moment») — translate into physical details: micro-expressions, gaze direction, breath rhythm, pauses.

Common mistakes

  1. 1. Too long a prompt for a simple scene

    The model's main antipattern. Long prompts for simple scenes literally degrade output: faces blur toward an averaged look, hands lose geometry, gait flattens. ~20 words is the sweet spot. Longer is justified only when camera language or a multi-beat scene calls for it.

  2. 2. Hedge epithets and quality boosters

    «Beautiful, stunning, gorgeous, masterpiece, epic, breathtaking, insane detail, ultra detailed, hyperrealistic» eat the token budget and pull toward average-look. Replace with specifics: «overcast daylight, wet asphalt», «neon pink and cyan reflections», «35mm telephoto, shallow depth of field».

  3. 3. Emotion as abstraction

    «Sad woman thinking about her past», «happy moment», «emotional scene» — Happy Horse doesn't read emotion as a concept. Translate into physical details: «close-up of a young woman standing still, soft wind moving her hair, neutral expression, slow blink, shallow depth of field». Micro-expressions, gaze direction, breath rhythm.

  4. 4. Mandarin for visuals

    Despite the model's Chinese origin from Alibaba, English yields better visual rendering. Use Mandarin ONLY in the DIALOGUE block for Chinese lip-sync. All production notes (subject, action, setting, camera, lighting) — in English.

  5. 5. Booru tags, JSON, weighted parentheses

    Comma-separated keywords without sentences (Booru style), JSON objects, and weighted parentheses `(keyword:1.2)` (Stable Diffusion syntax) — measurably lose to English prose. Happy Horse is trained on natural language. Write sentences and production notes.

Before / after examples

Example 1

Before

A beautiful gorgeous stunning woman in a magnificent red coat masterpiece walking elegantly down a breathtaking wet city street at night with insane neon reflections, ultra detailed, hyperrealistic, 8k cinematography

After

A young woman in a red coat walks down a wet city street at night, neon reflections, 35mm telephoto, slow tracking dolly.

Anti-slop rule: hedge epithets (beautiful, gorgeous, stunning, masterpiece, ultra detailed) eat the token budget and pull toward average-look. ~20 words with a camera cue at the end is the sweet spot.

Example 2

Before

happy man walking and feeling good about life in a nice park

After

A young man walks through a sunlit park in autumn, slow exhale visible in cool air, soft smile, hand brushing fallen leaves, golden hour, slow side tracking.

Emotion as abstraction («happy», «feeling good») — the model can't read it. Translating into physical details (slow exhale, soft smile, hand brushing leaves) yields visible motion.

Example 3

Before

A complex cinematic scene where a detective enters the dimly lit room, looks around suspiciously, finds a clue on the table, picks it up, examines it carefully, and then walks out the door

After

Shot 1 (wide establishing, 0-1s): A detective in a wool coat enters a dim hotel room; single hard top-down key, deep falloff to black.
Shot 2 (close-up, 1-3s): His hand picks up a folded note from the wood desk; warm amber practical light.
Shot 3 (medium tracking, 3-5s): He turns and walks toward the door; slow side tracking, neon glow through the blinds.

Flat prose with multiple actions collapses into one blurred motion. A timecoded shot list separates beats correctly.

Frequently asked

What is the optimal prompt length?
~20 words for a simple shot is the sweet spot. The default template «[Subject] [does action] in [setting], [time of day], [one atmosphere or camera cue]» covers 80% of tasks. Longer is justified only when the shot leans on camera language (then the cue goes at the end) or for multi-beat scenes with a timecoded shot list.
How does joint audio-video work?
Sound and video are generated in one forward pass and synchronized by default. Control sound via text: «dialogue in English: '...'», «ambient: distant traffic», «Foley: footsteps on gravel». If sound isn't described, the model invents it from visual logic. This is a unique Happy Horse feature — most video models generate sound separately or not at all.
Which languages are supported for lip-sync?
Seven: English, Mandarin, Cantonese, Japanese, Korean, German, French. With ultra-low WER (word error rate). Specify the language in the DIALOGUE block: «dialogue in Korean: '...'». Joint audio-video synchronizes speech and lip movement automatically. Note: visuals render better in English even when dialogue is in another language.
When should a timecoded shot list be used?
For multi-beat scenes — when one clip needs several different shots or actions. Format: «Shot 1 (wide establishing, 0-1s): ...», «Shot 2 (close-up, 1-3s): ...». In fal.ai tests a shot list separates beats correctly, while the same scene as flat prose collapses. For a single simple shot a shot list is overkill — use the default 20-word template.
What video duration is supported?
5–8 seconds by default, up to 12 on Lite, up to 15 on the paid tier. Native 1080p without upscaling. Generation time ~10 seconds on average, ~38 seconds for 1080p on NVIDIA H100, ~2 seconds for a 5-sec 256p preview. Aspect ratios: 16:9, 9:16, 4:3, 21:9, 1:1.
What works better, T2V or I2V?
Both modes share the same weights and perform equally well. I2V is handy when there's a concrete visual anchor (product photo, portrait, concept art) — then the prompt describes motion rather than re-describing the picture. T2V — for from-scratch scene generation. For I2V, don't describe the visual in detail; focus on motion and atmosphere.
Does Opten support Happy Horse?
Yes, the Opten extension auto-detects Happy Horse 1.0 and scores prompts against the structure outlined above: it checks alignment with the default 20-word template, absence of hedge epithets and quality boosters, physical details instead of abstract emotion, and English for visuals. One click delivers a rewrite in the correct structure.

Related models

Ready to write Happy Horse 1.0 prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672