Video

OmniHuman 1.5: how to write prompts the model actually understands

ByteDance · Updated:

OmniHuman 1.5 is ByteDance's specialized video model for animating people via Image + Audio → Video. 1024×1024 at 30fps, up to 30 seconds via API. The primary driver is audio (lip-sync and body language); the text prompt is a supplement for scene and camera. English for prompts; audio can be in any language.

What OmniHuman does

OmniHuman isn't a universal video generator. It's a narrow specialized model for bringing a single human image to life using audio. The architecture is dual-system: Diffusion Transformer (System 1) for visuals + Multimodal LLM (System 2) for context understanding. Trained on 18,700 hours of human motion video; context window 50,000 tokens.

Three inputs work together: image (required — portrait, half-body, or full-body), audio (for lip-sync and body language), text prompt (a supplement for scene, camera, action, emotion). Quality = consistency across all three inputs. Subjects supported: real people, animals, stylized characters, and 3D models.

  • Image + Audio → Video, specialized in human animation
  • 1024×1024 at 30fps, up to 30 seconds via API
  • Audio-driven lip-sync with emotionally responsive body language
  • Subjects: real people, animals, stylized characters, 3D
  • Multi-character scenes with explicit speaker assignment

Prompt structure

The text prompt is a supplement to audio. Audio sets tempo, emotion, lip-sync; the prompt describes scene, camera, action. Don't write long appearance descriptions — appearance is set by the image.

Formula: [Character description + pose] + [Action/movement] + [Camera] + [Emotional tone].

Example: «A male DJ performing live on stage, wearing headphones and mixing music on a DJ controller, focused expression, subtle head movement following the beat.» Short natural scenarios work better than keyword lists. Optimal length 15-40 words. The main thing is consistency with audio and image, not verbosity.

Three-input consistency is the main rule

Output quality = consistency of image + audio + prompt. This rule breaks most failed generations.

Inconsistent example: the image is a businessman in an office; the audio is rock music; the prompt is «DJ performing on stage.» The model can't resolve the conflict and outputs something strange. Consistent example: image — DJ in headphones; audio — electronic music; prompt — «male DJ performing live on stage, focused expression, subtle head movement following the beat.» All three inputs say the same thing.

If you want lip-sync, the audio must contain speech or vocals. If you want a dance to a beat, the audio must contain rhythmic music. If you want a calm talking head, the audio should be a podcast, not a rock track.

Talking head and presentations

OmniHuman's main production scenario — animating a speaker from a single photo. Podcasts, video lessons, corporate clips, explainers — anything that needs lip-sync without filming. Saves a day of production: one photo, one audio track → finished clip.

For talking head the prompt is minimal: «A speaker addressing the camera with a calm professional tone, slight natural head movement, occasional hand gestures off-frame.» Audio sets the rest — pauses, intonation, emotion. Use a static camera or light zoom-in — this matches talking-head aesthetics and doesn't distract from the face.

Common mistakes

  1. 1. Using it as text-to-video

    OmniHuman ALWAYS requires a human image. It's not a general video generator. If you submit only a text prompt without uploading a reference, generation isn't possible. For T2V use Veo, Sora, Kling, or Hailuo. OmniHuman is a narrow specialized model for animating one photo, not an alternative to general video models.

  2. 2. No audio

    OmniHuman's headline feature is audio-driven lip-sync with emotionally responsive body language. Without audio the model can't sync lips, gets no signal about tempo or emotion. The result degrades sharply: a static portrait or chaotic mimicry. Every generation needs audio — even just an ambient background.

  3. 3. Input mismatch

    DJ in the prompt + classical music in the audio + businessman portrait on the reference = a conflict the model won't resolve. All three inputs must say the same thing. Before generating, check: does the subject on the image match the description in the prompt; does the audio's emotional tone match the action in the prompt; does the visual scene match the acoustic environment.

  4. 4. Describing the subject's appearance

    Appearance is locked in the input image. A long description like «handsome young man with blonde hair, blue eyes, wearing a black suit» is empty tokens up to the action description. Write only: what the character does, how the camera moves, what the emotional tone is, what scene is around them. 15-40 words is more than enough.

  5. 5. Expecting high resolution

    OmniHuman is 1024×1024 at 30fps. That's not 4K and not wide 1080P. For YouTube-resolution production video you need a post-upscale (Topaz, a separate super-resolution pass). For social content (Reels, Shorts, vertical TikTok) 1024×1024 is fine with a light crop. For presentations and podcasts — also OK. For broadcast cinema — insufficient.

Before / after examples

Example 1

Before

animate my business partner for a presentation

After

A professional speaker addressing the camera with a calm confident tone, slight natural head movements, occasional subtle hand gestures appearing at the bottom of frame. Static camera, mid-shot framing, neutral business office background visible behind. Focused friendly expression, executive presentation aesthetic.

Appearance isn't described — it's on the reference. Tone (calm confident), motion (slight natural head), camera (static mid-shot), and emotional tone (focused friendly) are explicit. Length is in the 15-40 word target range.

Example 2

Before

DJ playing music

After

A male DJ performing live on a club stage, wearing headphones, hands operating a DJ controller, subtle head and shoulder movement following the beat of the audio. Tracking shot slowly orbiting from left to right. Energetic focused expression, club lighting atmosphere with magenta and blue accents.

Consistent with the assumed audio (electronic beat). Instrument interaction described (operating DJ controller), motion in tempo (following the beat), camera (tracking orbit), atmosphere (club lighting).

Example 3

Before

two people talking on a podcast

After

Two people in a warmly-lit podcast studio. The man on the left is speaking (lip-sync to audio), occasional emphatic hand gestures, engaged expression. The woman on the right is listening attentively, slight nods and subtle micro-reactions on her face. Static two-shot framing, soft warm key light, intimate atmosphere.

Multi-character: the speaker is explicit (man on the left, lip-sync to audio) and the listener (woman on the right, micro-reactions). Without this OmniHuman doesn't know whose lips to sync.

Frequently asked

How is OmniHuman different from Veo or Sora?
Veo and Sora are general video models for generating any scene from text (T2V) or image (I2V). OmniHuman is a narrow specialized model ONLY for human animation via Image + Audio → Video. The headline feature is audio-driven lip-sync with emotional body language. It's not «better or worse than Veo» — it's a different class of tool for a specific task: bringing a single portrait to life with audio.
Can I use OmniHuman without audio?
Technically yes, but not recommended. Audio-driven lip-sync is the model's headline feature; without audio OmniHuman loses the main signal for tempo, emotion, and body language. The result degrades to a static portrait or chaotic mimicry. If there's no speech audio — use at least an ambient background or a music track to set motion rhythm. OmniHuman isn't designed for silence.
Is the model suitable for dialog between two people?
Only partially. OmniHuman supports multi-character scenes with one speaker and background reactions from other characters — that works. But a real dialog (two speaking turns) in one pass isn't possible: the model syncs lip-sync to a single audio track. The solution — two passes with different audio and post-edit, or pre-edited audio with explicit speaker markers.
Which subject types are supported?
Real people (the main scenario), animals (animating talking cats and dogs works surprisingly well), stylized / animated characters (cartoon, anime), 3D models, and avatars. The key requirement — the reference must have one subject as the «main character.» For multi-character scenes with one speaker and background reactions — also works, but with explicit assignment.
What duration is available?
Via API — up to 30 seconds. The research version of the model supports over a minute, but that version isn't publicly available. 30 seconds is enough for a talking-head presentation, a short podcast, a music clip, a product video. For longer videos — generate several 30-second segments and stitch them in post. For short social clips (Reels, Shorts) the limit is irrelevant.
What's the optimal prompt length?
15-40 words. A short prompt is the norm for OmniHuman, not a shortcoming. The text prompt supplements the audio, doesn't replace it. Appearance description isn't needed (it's on the image), emotional-arc description isn't needed (it's in the audio). Enough to specify: what the character does in frame, what camera, what emotional tone, what scene is around.
Does Opten support OmniHuman 1.5?
Yes, the Opten extension auto-detects ByteDance OmniHuman and scores prompts against the structure above: it checks that an input image is present, consistency across image + audio + prompt, absence of appearance description (it's on the reference), focus on action and camera, and optimal 15-40 word length. For multi-character — it checks for explicit speaker assignment. One click gives you a rewrite in the correct formula.

Related models

Ready to write OmniHuman 1.5 prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672