Video

Veed Fabric: how to write prompts the model actually understands

VEED · Updated:

Veed Fabric 1.0 is a specialized lip-sync model, not a general video generator. The input is an image plus audio (or a TTS speech script), and the output is an animated talking head with lip, head, and hand motion. It supports 30+ languages and up to 5 minutes via API. A traditional text prompt does not apply here.

How Fabric works and how it differs

Fabric is not text-to-video. It is a lip-sync / talking head system built on a Diffusion Transformer (DiT) architecture that animates a still image to audio. The input is a pair: one image and one audio file. The model produces lip-sync, adds natural head and body motion, and adds hand gestures tied to the speech rhythm.

The image can be in any style: photo, illustration, anime, 3D render, clay mascot, brand character. This is the key difference from classic avatar generators — Fabric does not require a photorealistic face. Audio is speech or music. Resolutions are 480p and 720p, frame rate 25 fps, formats 16:9, 9:16, and 1:1. Speed: Fabric 1.0 Fast is roughly 2.5× faster than Standard, the 480p path handles 10 seconds of video in about 1.5 minutes.

  • Image + Audio → lip-synced video (not T2V)
  • Any input image style — photo, illustration, anime, 3D
  • 30+ languages, up to 5 minutes via API
  • Resolutions 480p and 720p, 25 fps, formats 16:9 / 9:16 / 1:1
  • Fast variant runs ~2.5× faster than Standard

Two working modes

Audio mode: you bring your own audio file (mp3/wav/m4a/aac/ogg, up to 10 MB). There is no text prompt here at all — the model only syncs lips and motion to the sound. Quality depends on the image and audio pair: clean recording with no background noise, frontal image with a visible face, no extreme angles.

TTS mode (via VEED): audio is generated from a speech script by the ElevenLabs V3 engine. Here the «prompt» is the script itself: the text spoken on screen plus inline bracketed tags that control emotions, pacing, accent, and sound effects. The script can be in any of the 30+ supported languages.

Fabric Emotions: inline tags inside the script

In TTS mode the script carries [tag] markers for emotional expression. These are not formatting markers — they are directorial cues for the voice engine:

Emotions: [excited], [happy], [sad], [angry], [curious], [nervous], [confident]. Reactions: [laughs], [sighs], [gasps], [clears throat]. Volume: [whispers], [shouting]. Pacing: [pause], [long pause], [breathes], [rushed], [drawn out]. Sound effects: [applause], [gunshot], [door creaks]. Accent: [American accent], [British accent].

One rule: do not overload. One tag per 1-2 sentences, distributed gradually for natural delivery. A tag before every word will break the intonation — the model will start clipping each word with awkward pauses and gear shifts.

Requirements for the input image

The image sets the entire visual quality. Base rules: frontal or near-frontal face, no heavy Dutch angles, no 90-degree profiles. Face well lit — no deep shadow on one half. No occlusions (hands in front of the face, masks, glasses with strong reflections on the eyes) — the model cannot pull lip-sync if it cannot see the mouth.

Formats: jpg, jpeg, png, webp, gif, avif, up to 10 MB. Style is not critical: Fabric animates photos, anime illustrations, 3D-rendered clay characters, and corporate mascots equally well. But in every case you need one clear face in frame — not a crowd, not two characters, not a profile without a visible mouth.

Common mistakes

  1. 1. Describing a scene instead of writing a script

    Fabric is not T2V. A prompt like «a man in a forest at sunset, walking and explaining the product» will be ignored: the model does not generate the forest, the sunset, or the walking. Hand in a ready image (background plus face) and a speech script with tags. Some other tool draws the scene; Fabric only animates the face.

  2. 2. Overloading with emotion tags

    [excited] Hello [happy] everyone [laughs] today [confident] I want — the model will jitter, pause, and break into unnatural transitions. One tag per 1-2 sentences, not per word. Place reactions like [laughs] and [sighs] between phrases, not inside them. Tags work as director's notes, not as per-token markup.

  3. 3. Dirty audio or heavy background noise

    In Audio mode lip-sync quality depends directly on how clean the sound is. Heavy background noise, echo, or music layered over speech confuses the model — lips start to drift and sync breaks. Record speech alone, add background music in post after generation, not into the source audio file.

  4. 4. Extreme angles in the input image

    A strong profile, Dutch angle, or face occluded behind a hand, mask, or reflective glasses — the model cannot pull lip-sync. Use a frontal or near-frontal image with a clearly visible mouth and even lighting. Art style is not critical, the angle is.

  5. 5. Expecting camera moves or action

    Fabric does not do dolly, push-in, or tracking shots — the model does not move the camera and does not change the shot. If the brief needs cinematic motion and physical action in an environment, that is a job for Sora 2, Veo 3.1, or Kling. Fabric covers a different case: a fixed frame brought to life by face and speech.

Before / after examples

Example 1

Before

person talking about a product

After

[TTS script for Veed Fabric, paired with a frontal product-shot image of a brand mascot]

[confident] Hey there! I'm Otto, and today I'm showing you something special. [pause] Our new wireless earbuds give you twelve hours of battery on a single charge. [excited] Twelve full hours — that's almost a whole workday! [pause] [drawn out] No more low-battery anxiety. Tap the link below to grab yours.

This is a TTS script, not a scene description. Emotion tags [confident], [excited], [drawn out] are placed between phrases, not on every word.

Example 2

Before

a brand mascot says hello to viewers

After

[TTS script paired with a frontal illustration of the brand mascot]

[happy] Hello, friends! [laughs] It's so good to see you again. [pause] I've been waiting all week to share this with you. [curious] Have you ever wondered what makes our community special? [pause] [confident] Stick around — I'll show you in the next sixty seconds.

Short script, tags distributed: one emotion → one or two phrases → pause. Reactions like [laughs] make the talking head feel alive.

Example 3

Before

explain something in two languages

After

[TTS script for Veed Fabric, paired with a frontal image of an animated instructor — illustration style]

[British accent] [confident] Welcome back to the channel. Today we're tackling something most beginners get wrong. [pause] [curious] What if I told you the trick is in the timing, not the tools? [drawn out] Let me show you. [pause] In the next clip I'll walk through it step by step.

Accent is set by the [British accent] tag at the start and carries forward. The script is sized for about 15 seconds of speech — it does not try to cram a full lecture.

Frequently asked

How is Veed Fabric different from a regular text-to-video model?
Fabric does not generate scenes, backgrounds, or camera motion — it animates a still image to speech. Input is an image + audio pair (or a TTS script), output is a talking head with lip-sync, head motion, and gestures. It is a specialized tool for one case: a speaking character in a fixed frame. For cinematic clips reach for Sora 2, Veo 3.1, or Kling.
Which languages does Fabric support?
Fabric supports 30+ languages, including English, Russian, Spanish, French, German, Chinese, Japanese, and Arabic. The audio or TTS script can be in any of them. That makes Fabric convenient for localizing talking head content: the same visual character can speak several languages from different scripts.
What is the difference between Fabric 1.0 Standard and Fabric 1.0 Fast?
Standard delivers maximum quality, Fast runs roughly 2.5× faster on the same DiT architecture. At 480p Standard renders 10 seconds in about 1.5 minutes; Fast is noticeably quicker. At 720p Standard takes around 5 minutes per 10 seconds. Pick Fast for iteration and prototyping, Standard for final production.
What are the requirements for the input image?
Formats: jpg, jpeg, png, webp, gif, avif, up to 10 MB. The face must be frontal or near-frontal, well lit, with no occlusions (hands, masks, strong glasses reflections). One character per frame, not a crowd. Style is not critical — photo, illustration, anime, 3D all work equally. Angle is critical: the model has to see the mouth.
What are Fabric Emotions and how do I use them?
Fabric Emotions are inline bracketed tags embedded in the speech script in TTS mode. Categories: emotions ([excited], [sad]), reactions ([laughs], [sighs]), volume ([whispers], [shouting]), pacing ([pause], [rushed]), sound effects ([applause]), accents ([British accent]). Distribute them gradually — one tag per 1-2 phrases, not per word.
Can Fabric be used for long videos?
Via the API, yes, up to 5 minutes per clip. In Studio the limit is around 30 seconds per clip. For long content use the API directly, or split long speech into several clips and join them in post. Lip-sync stays stable on long takes if the input audio is clean and free of sharp tempo changes.
Does Opten support Veed Fabric?
Yes, the Opten extension detects Fabric inside the VEED interface and scores the speech script against the structure outlined above: it checks for the input image, script length matched to target duration, sensible distribution of emotion tags, and that you did not write a scene description instead of TTS text. One click gives you a rewrite of the script in the right format.

Related models

Ready to write Veed Fabric 1.0 prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672