OmniHuman 1.5: how to write prompts the model actually understands
ByteDance · Updated:
OmniHuman 1.5 is ByteDance's specialized video model for animating people via Image + Audio → Video. 1024×1024 at 30fps, up to 30 seconds via API. The primary driver is audio (lip-sync and body language); the text prompt is a supplement for scene and camera. English for prompts; audio can be in any language.
What OmniHuman does
OmniHuman isn't a universal video generator. It's a narrow specialized model for bringing a single human image to life using audio. The architecture is dual-system: Diffusion Transformer (System 1) for visuals + Multimodal LLM (System 2) for context understanding. Trained on 18,700 hours of human motion video; context window 50,000 tokens.
Three inputs work together: image (required — portrait, half-body, or full-body), audio (for lip-sync and body language), text prompt (a supplement for scene, camera, action, emotion). Quality = consistency across all three inputs. Subjects supported: real people, animals, stylized characters, and 3D models.
- Image + Audio → Video, specialized in human animation
- 1024×1024 at 30fps, up to 30 seconds via API
- Audio-driven lip-sync with emotionally responsive body language
- Subjects: real people, animals, stylized characters, 3D
- Multi-character scenes with explicit speaker assignment
Prompt structure
The text prompt is a supplement to audio. Audio sets tempo, emotion, lip-sync; the prompt describes scene, camera, action. Don't write long appearance descriptions — appearance is set by the image.
Formula: [Character description + pose] + [Action/movement] + [Camera] + [Emotional tone].
Example: «A male DJ performing live on stage, wearing headphones and mixing music on a DJ controller, focused expression, subtle head movement following the beat.» Short natural scenarios work better than keyword lists. Optimal length 15-40 words. The main thing is consistency with audio and image, not verbosity.
Three-input consistency is the main rule
Output quality = consistency of image + audio + prompt. This rule breaks most failed generations.
Inconsistent example: the image is a businessman in an office; the audio is rock music; the prompt is «DJ performing on stage.» The model can't resolve the conflict and outputs something strange. Consistent example: image — DJ in headphones; audio — electronic music; prompt — «male DJ performing live on stage, focused expression, subtle head movement following the beat.» All three inputs say the same thing.
If you want lip-sync, the audio must contain speech or vocals. If you want a dance to a beat, the audio must contain rhythmic music. If you want a calm talking head, the audio should be a podcast, not a rock track.
Talking head and presentations
OmniHuman's main production scenario — animating a speaker from a single photo. Podcasts, video lessons, corporate clips, explainers — anything that needs lip-sync without filming. Saves a day of production: one photo, one audio track → finished clip.
For talking head the prompt is minimal: «A speaker addressing the camera with a calm professional tone, slight natural head movement, occasional hand gestures off-frame.» Audio sets the rest — pauses, intonation, emotion. Use a static camera or light zoom-in — this matches talking-head aesthetics and doesn't distract from the face.
Common mistakes
1. Using it as text-to-video
OmniHuman ALWAYS requires a human image. It's not a general video generator. If you submit only a text prompt without uploading a reference, generation isn't possible. For T2V use Veo, Sora, Kling, or Hailuo. OmniHuman is a narrow specialized model for animating one photo, not an alternative to general video models.
2. No audio
OmniHuman's headline feature is audio-driven lip-sync with emotionally responsive body language. Without audio the model can't sync lips, gets no signal about tempo or emotion. The result degrades sharply: a static portrait or chaotic mimicry. Every generation needs audio — even just an ambient background.
3. Input mismatch
DJ in the prompt + classical music in the audio + businessman portrait on the reference = a conflict the model won't resolve. All three inputs must say the same thing. Before generating, check: does the subject on the image match the description in the prompt; does the audio's emotional tone match the action in the prompt; does the visual scene match the acoustic environment.
4. Describing the subject's appearance
Appearance is locked in the input image. A long description like «handsome young man with blonde hair, blue eyes, wearing a black suit» is empty tokens up to the action description. Write only: what the character does, how the camera moves, what the emotional tone is, what scene is around them. 15-40 words is more than enough.
5. Expecting high resolution
OmniHuman is 1024×1024 at 30fps. That's not 4K and not wide 1080P. For YouTube-resolution production video you need a post-upscale (Topaz, a separate super-resolution pass). For social content (Reels, Shorts, vertical TikTok) 1024×1024 is fine with a light crop. For presentations and podcasts — also OK. For broadcast cinema — insufficient.
Before / after examples
Example 1
Before
animate my business partner for a presentation
After
A professional speaker addressing the camera with a calm confident tone, slight natural head movements, occasional subtle hand gestures appearing at the bottom of frame. Static camera, mid-shot framing, neutral business office background visible behind. Focused friendly expression, executive presentation aesthetic.
Appearance isn't described — it's on the reference. Tone (calm confident), motion (slight natural head), camera (static mid-shot), and emotional tone (focused friendly) are explicit. Length is in the 15-40 word target range.
Example 2
Before
DJ playing music
After
A male DJ performing live on a club stage, wearing headphones, hands operating a DJ controller, subtle head and shoulder movement following the beat of the audio. Tracking shot slowly orbiting from left to right. Energetic focused expression, club lighting atmosphere with magenta and blue accents.
Consistent with the assumed audio (electronic beat). Instrument interaction described (operating DJ controller), motion in tempo (following the beat), camera (tracking orbit), atmosphere (club lighting).
Example 3
Before
two people talking on a podcast
After
Two people in a warmly-lit podcast studio. The man on the left is speaking (lip-sync to audio), occasional emphatic hand gestures, engaged expression. The woman on the right is listening attentively, slight nods and subtle micro-reactions on her face. Static two-shot framing, soft warm key light, intimate atmosphere.
Multi-character: the speaker is explicit (man on the left, lip-sync to audio) and the listener (woman on the right, micro-reactions). Without this OmniHuman doesn't know whose lips to sync.