Runway Act-Two: how to prepare inputs the model actually understands
Runway · Updated:
Runway Act-Two is a performance transfer model, not text-to-video. You feed it a driving video with an actor's performance and a character reference (image or video), and the model transfers body motion, facial expression, and lip-sync onto the character. Text prompts play a minimal role here — quality is set by the inputs.
What Act-Two does well
Act-Two works like AI motion capture without mocap suits: record an actor's performance on a regular webcam, pick a character reference, and the model transfers body motion, facial expression, and audio lip-sync onto that character. Output is 720p video at 5 credits/sec.
This is a fundamentally different class of model — neither T2V nor I2V. The text prompt barely influences the result. The Facial Expressiveness parameter (1–5 scale) controls how strongly facial motion transfers — values above 3 risk artifacts. If the character reference is an image (not video), you also get gesture control.
- Performance transfer — NOT text-to-video and NOT prompt-driven
- Driving video + character reference are mandatory
- Transfers: body motion, facial expression, lip-sync (audio)
- Facial Expressiveness 1–5 (above 3 risks artifacts)
- 720p, 5 credits/sec
What to feed as input
Driving video — your performance footage. Can be a webcam recording or a prepared clip. Key requirements: even lighting on the face without harsh shadows, clear audio for lip-sync, and ideally start the frame with palms toward the camera — this helps the model capture the hands and later transfer gestures more accurately.
Character reference — who to transfer the performance onto. Can be a still image or a short video. An image unlocks gesture control (extra hand control); a video gives better facial consistency on longer scenes. In both cases the lighting and pose should be clear, the face unobstructed.
The role of the text prompt
Act-Two is input-driven. The text prompt plays a minimal, almost decorative role. Everything you'd normally describe in a prompt (movements, expression, lip-sync) here comes from the driving video; everything about appearance (clothing, face, background) comes from the character reference.
If you write a detailed prompt like «a man in a suit, walking, smiling, saying hello», it will either be ignored or conflict with the inputs. If you want specific movements, act them out in the driving video. If you want a specific look, pick the right character reference. Leave the prompt empty or only briefly describe the scene context.
Tuning Facial Expressiveness
The 1–5 scale controls how strongly facial expression transfers. Value 1–2 — calm, restrained expression with minimal artifact risk. Value 3 — recommended default, transfers most expressions naturally. Value 4–5 — maximum expression, but artifact risk rises non-linearly: the face can melt, eyes can twitch, expressions can look overdone.
Rule: start at 3, raise only if the result looks visibly flat. For dramatic scenes 4 can work, but problems usually start above that. If artifacts appear, lower Expressiveness — don't try to compensate with the prompt.
Common mistakes
1. Detailed text prompt as primary control
Act-Two is input-driven, not prompt-driven. Describing movements and expression in the prompt is either ignored or conflicts with the driving video. If you want specific motion, act it out in front of the camera. Leave the prompt empty or include only a brief scene context.
2. Missing driving video or character reference
Act-Two physically cannot run without both inputs. Driving video sets the performance, character reference picks who gets animated. If you launch missing one of them, generation either won't start or produces garbage. Verify both slots in Generation Settings before running.
3. Facial Expressiveness above 3 by default
Values 4–5 can deliver striking expression, but artifact risk grows non-linearly: face melts, eyes twitch, expression looks overdone. Always start at 3, raise only if the output is clearly flat. Lowering Expressiveness is a better fix for artifacts than regenerating.
4. Dark or noisy driving video
Harsh facial shadows break face tracking; noisy audio breaks lip-sync. The performance should be shot in even soft lighting (window, softbox) with clean audio. No prompt optimization can fix this — reshooting the driving video is always faster and more effective.
5. Using Act-Two like a generic T2V or I2V model
Act-Two is a performance transfer system, not a scene generator. Prompts like «a man walks across the room» don't work here because motion isn't generated — it's copied from the driving video. If you need a scene generator, use Gen-4.5 or Gen-4, not Act-Two.
Before / after examples
Example 1
Before
Detailed text prompt: «A young woman in a red sweater speaks to the camera, smiling warmly, gesturing with her hands as she explains a new product.»
After
Driving video: 15-second webcam recording, actress delivers the line clearly, palms toward camera at the start of the frame, even lighting. Character reference: portrait image of the character in a red sweater. Prompt: (empty or brief: «product explainer scene»). Facial Expressiveness: 3.
Text prompts in Act-Two are useless for controlling motion and expression — those transfer from the driving video. Replace the prompt with a quality performance recording.
Example 2
Before
Character reference: dramatic painted portrait, Facial Expressiveness: 5
After
Character reference: clear photo or live video reference of the character, even lighting, face unobstructed. Facial Expressiveness: 3.
Painted or stylized references transfer expression poorly. Expressiveness 5 on any reference almost guarantees artifacts. Drop to 3, pick a clear reference — the result stabilizes.
Example 3
Before
Driving video: dark recording with harsh shadows, noisy audio
After
Driving video: recording in even light (natural window light or soft box), clean audio without noise, palms visible at the start of the frame. Character reference + Expressiveness 3.
Driving video quality directly determines transfer quality. Harsh shadows break face tracking, noisy audio breaks lip-sync. Reshooting the performance is the best «prompt optimization» in Act-Two.