Veed Fabric: how to write prompts the model actually understands
VEED · Updated:
Veed Fabric 1.0 is a specialized lip-sync model, not a general video generator. The input is an image plus audio (or a TTS speech script), and the output is an animated talking head with lip, head, and hand motion. It supports 30+ languages and up to 5 minutes via API. A traditional text prompt does not apply here.
How Fabric works and how it differs
Fabric is not text-to-video. It is a lip-sync / talking head system built on a Diffusion Transformer (DiT) architecture that animates a still image to audio. The input is a pair: one image and one audio file. The model produces lip-sync, adds natural head and body motion, and adds hand gestures tied to the speech rhythm.
The image can be in any style: photo, illustration, anime, 3D render, clay mascot, brand character. This is the key difference from classic avatar generators — Fabric does not require a photorealistic face. Audio is speech or music. Resolutions are 480p and 720p, frame rate 25 fps, formats 16:9, 9:16, and 1:1. Speed: Fabric 1.0 Fast is roughly 2.5× faster than Standard, the 480p path handles 10 seconds of video in about 1.5 minutes.
- Image + Audio → lip-synced video (not T2V)
- Any input image style — photo, illustration, anime, 3D
- 30+ languages, up to 5 minutes via API
- Resolutions 480p and 720p, 25 fps, formats 16:9 / 9:16 / 1:1
- Fast variant runs ~2.5× faster than Standard
Two working modes
Audio mode: you bring your own audio file (mp3/wav/m4a/aac/ogg, up to 10 MB). There is no text prompt here at all — the model only syncs lips and motion to the sound. Quality depends on the image and audio pair: clean recording with no background noise, frontal image with a visible face, no extreme angles.
TTS mode (via VEED): audio is generated from a speech script by the ElevenLabs V3 engine. Here the «prompt» is the script itself: the text spoken on screen plus inline bracketed tags that control emotions, pacing, accent, and sound effects. The script can be in any of the 30+ supported languages.
Fabric Emotions: inline tags inside the script
In TTS mode the script carries [tag] markers for emotional expression. These are not formatting markers — they are directorial cues for the voice engine:
Emotions: [excited], [happy], [sad], [angry], [curious], [nervous], [confident]. Reactions: [laughs], [sighs], [gasps], [clears throat]. Volume: [whispers], [shouting]. Pacing: [pause], [long pause], [breathes], [rushed], [drawn out]. Sound effects: [applause], [gunshot], [door creaks]. Accent: [American accent], [British accent].
One rule: do not overload. One tag per 1-2 sentences, distributed gradually for natural delivery. A tag before every word will break the intonation — the model will start clipping each word with awkward pauses and gear shifts.
Requirements for the input image
The image sets the entire visual quality. Base rules: frontal or near-frontal face, no heavy Dutch angles, no 90-degree profiles. Face well lit — no deep shadow on one half. No occlusions (hands in front of the face, masks, glasses with strong reflections on the eyes) — the model cannot pull lip-sync if it cannot see the mouth.
Formats: jpg, jpeg, png, webp, gif, avif, up to 10 MB. Style is not critical: Fabric animates photos, anime illustrations, 3D-rendered clay characters, and corporate mascots equally well. But in every case you need one clear face in frame — not a crowd, not two characters, not a profile without a visible mouth.
Common mistakes
1. Describing a scene instead of writing a script
Fabric is not T2V. A prompt like «a man in a forest at sunset, walking and explaining the product» will be ignored: the model does not generate the forest, the sunset, or the walking. Hand in a ready image (background plus face) and a speech script with tags. Some other tool draws the scene; Fabric only animates the face.
2. Overloading with emotion tags
[excited] Hello [happy] everyone [laughs] today [confident] I want — the model will jitter, pause, and break into unnatural transitions. One tag per 1-2 sentences, not per word. Place reactions like [laughs] and [sighs] between phrases, not inside them. Tags work as director's notes, not as per-token markup.
3. Dirty audio or heavy background noise
In Audio mode lip-sync quality depends directly on how clean the sound is. Heavy background noise, echo, or music layered over speech confuses the model — lips start to drift and sync breaks. Record speech alone, add background music in post after generation, not into the source audio file.
4. Extreme angles in the input image
A strong profile, Dutch angle, or face occluded behind a hand, mask, or reflective glasses — the model cannot pull lip-sync. Use a frontal or near-frontal image with a clearly visible mouth and even lighting. Art style is not critical, the angle is.
5. Expecting camera moves or action
Fabric does not do dolly, push-in, or tracking shots — the model does not move the camera and does not change the shot. If the brief needs cinematic motion and physical action in an environment, that is a job for Sora 2, Veo 3.1, or Kling. Fabric covers a different case: a fixed frame brought to life by face and speech.
Before / after examples
Example 1
Before
person talking about a product
After
[TTS script for Veed Fabric, paired with a frontal product-shot image of a brand mascot] [confident] Hey there! I'm Otto, and today I'm showing you something special. [pause] Our new wireless earbuds give you twelve hours of battery on a single charge. [excited] Twelve full hours — that's almost a whole workday! [pause] [drawn out] No more low-battery anxiety. Tap the link below to grab yours.
This is a TTS script, not a scene description. Emotion tags [confident], [excited], [drawn out] are placed between phrases, not on every word.
Example 2
Before
a brand mascot says hello to viewers
After
[TTS script paired with a frontal illustration of the brand mascot] [happy] Hello, friends! [laughs] It's so good to see you again. [pause] I've been waiting all week to share this with you. [curious] Have you ever wondered what makes our community special? [pause] [confident] Stick around — I'll show you in the next sixty seconds.
Short script, tags distributed: one emotion → one or two phrases → pause. Reactions like [laughs] make the talking head feel alive.
Example 3
Before
explain something in two languages
After
[TTS script for Veed Fabric, paired with a frontal image of an animated instructor — illustration style] [British accent] [confident] Welcome back to the channel. Today we're tackling something most beginners get wrong. [pause] [curious] What if I told you the trick is in the timing, not the tools? [drawn out] Let me show you. [pause] In the next clip I'll walk through it step by step.
Accent is set by the [British accent] tag at the start and carries forward. The script is sized for about 15 seconds of speech — it does not try to cram a full lecture.