Video

Runway Act-Two: how to prepare inputs the model actually understands

Runway · Updated:

Runway Act-Two is a performance transfer model, not text-to-video. You feed it a driving video with an actor's performance and a character reference (image or video), and the model transfers body motion, facial expression, and lip-sync onto the character. Text prompts play a minimal role here — quality is set by the inputs.

What Act-Two does well

Act-Two works like AI motion capture without mocap suits: record an actor's performance on a regular webcam, pick a character reference, and the model transfers body motion, facial expression, and audio lip-sync onto that character. Output is 720p video at 5 credits/sec.

This is a fundamentally different class of model — neither T2V nor I2V. The text prompt barely influences the result. The Facial Expressiveness parameter (1–5 scale) controls how strongly facial motion transfers — values above 3 risk artifacts. If the character reference is an image (not video), you also get gesture control.

  • Performance transfer — NOT text-to-video and NOT prompt-driven
  • Driving video + character reference are mandatory
  • Transfers: body motion, facial expression, lip-sync (audio)
  • Facial Expressiveness 1–5 (above 3 risks artifacts)
  • 720p, 5 credits/sec

What to feed as input

Driving video — your performance footage. Can be a webcam recording or a prepared clip. Key requirements: even lighting on the face without harsh shadows, clear audio for lip-sync, and ideally start the frame with palms toward the camera — this helps the model capture the hands and later transfer gestures more accurately.

Character reference — who to transfer the performance onto. Can be a still image or a short video. An image unlocks gesture control (extra hand control); a video gives better facial consistency on longer scenes. In both cases the lighting and pose should be clear, the face unobstructed.

The role of the text prompt

Act-Two is input-driven. The text prompt plays a minimal, almost decorative role. Everything you'd normally describe in a prompt (movements, expression, lip-sync) here comes from the driving video; everything about appearance (clothing, face, background) comes from the character reference.

If you write a detailed prompt like «a man in a suit, walking, smiling, saying hello», it will either be ignored or conflict with the inputs. If you want specific movements, act them out in the driving video. If you want a specific look, pick the right character reference. Leave the prompt empty or only briefly describe the scene context.

Tuning Facial Expressiveness

The 1–5 scale controls how strongly facial expression transfers. Value 1–2 — calm, restrained expression with minimal artifact risk. Value 3 — recommended default, transfers most expressions naturally. Value 4–5 — maximum expression, but artifact risk rises non-linearly: the face can melt, eyes can twitch, expressions can look overdone.

Rule: start at 3, raise only if the result looks visibly flat. For dramatic scenes 4 can work, but problems usually start above that. If artifacts appear, lower Expressiveness — don't try to compensate with the prompt.

Common mistakes

  1. 1. Detailed text prompt as primary control

    Act-Two is input-driven, not prompt-driven. Describing movements and expression in the prompt is either ignored or conflicts with the driving video. If you want specific motion, act it out in front of the camera. Leave the prompt empty or include only a brief scene context.

  2. 2. Missing driving video or character reference

    Act-Two physically cannot run without both inputs. Driving video sets the performance, character reference picks who gets animated. If you launch missing one of them, generation either won't start or produces garbage. Verify both slots in Generation Settings before running.

  3. 3. Facial Expressiveness above 3 by default

    Values 4–5 can deliver striking expression, but artifact risk grows non-linearly: face melts, eyes twitch, expression looks overdone. Always start at 3, raise only if the output is clearly flat. Lowering Expressiveness is a better fix for artifacts than regenerating.

  4. 4. Dark or noisy driving video

    Harsh facial shadows break face tracking; noisy audio breaks lip-sync. The performance should be shot in even soft lighting (window, softbox) with clean audio. No prompt optimization can fix this — reshooting the driving video is always faster and more effective.

  5. 5. Using Act-Two like a generic T2V or I2V model

    Act-Two is a performance transfer system, not a scene generator. Prompts like «a man walks across the room» don't work here because motion isn't generated — it's copied from the driving video. If you need a scene generator, use Gen-4.5 or Gen-4, not Act-Two.

Before / after examples

Example 1

Before

Detailed text prompt: «A young woman in a red sweater speaks to the camera, smiling warmly, gesturing with her hands as she explains a new product.»

After

Driving video: 15-second webcam recording, actress delivers the line clearly, palms toward camera at the start of the frame, even lighting.
Character reference: portrait image of the character in a red sweater.
Prompt: (empty or brief: «product explainer scene»).
Facial Expressiveness: 3.

Text prompts in Act-Two are useless for controlling motion and expression — those transfer from the driving video. Replace the prompt with a quality performance recording.

Example 2

Before

Character reference: dramatic painted portrait, Facial Expressiveness: 5

After

Character reference: clear photo or live video reference of the character, even lighting, face unobstructed.
Facial Expressiveness: 3.

Painted or stylized references transfer expression poorly. Expressiveness 5 on any reference almost guarantees artifacts. Drop to 3, pick a clear reference — the result stabilizes.

Example 3

Before

Driving video: dark recording with harsh shadows, noisy audio

After

Driving video: recording in even light (natural window light or soft box), clean audio without noise, palms visible at the start of the frame.
Character reference + Expressiveness 3.

Driving video quality directly determines transfer quality. Harsh shadows break face tracking, noisy audio breaks lip-sync. Reshooting the performance is the best «prompt optimization» in Act-Two.

Frequently asked

Can I use Act-Two with text prompts only?
No, Act-Two is a performance transfer model, not text-to-video. Without a driving video and a character reference it cannot physically generate. If you need to produce video from a text description, use Runway Gen-4.5 — it supports full T2V. Act-Two is for when you already have a performance or plan to record one.
What's the difference between driving video and character reference?
Driving video is the actor's performance recording with motion and expression — the source of what gets transferred. Character reference is an image or video of the character it transfers onto. Driving sets HOW to move, reference sets HOW to look. Both are needed simultaneously; without either, Act-Two does not work.
What Facial Expressiveness value should I use?
Default is 3, Runway's recommended value for most scenes. 1–2 yields restrained expression, suits documentary tone. 4 is worth trying for dramatic scenes but artifacts are likely. 5 almost always produces a «melting» face and is not recommended. If artifacts appear, lower Expressiveness — don't try to compensate with the prompt.
Can I get gesture control?
Yes, but only if the character reference is provided as an image (not a video). Image mode unlocks extra hand-gesture transfer control. For best capture, start the driving video with palms toward the camera — this helps the model lock onto the hands and transfer gestures accurately throughout the clip.
Is Act-Two good for dubbing into another language?
Yes, it's one of the strongest use cases. Driving video is a new recording with speech in the target language (your own voice works); character reference is an image or frame from the original video. Act-Two transfers lip-sync to match the new language while preserving the character's appearance. Lip-sync quality depends on clean audio in the driving video.
What are Act-Two's duration limits?
Output duration matches the driving video — the model transfers frame by frame. The longer the driving, the more credits you spend (5 credits/sec). For short lines and micro-scenes that's economical; for multi-minute monologues it's better to split into separate generations.
Does Opten support Runway Act-Two?
Yes, the Opten extension recognizes Act-Two inside runwayml.com and accounts for its input-driven nature: if you write a detailed text prompt, Opten warns that the model is controlled by video input, not text. It also checks that both inputs are present (driving video + character reference) and that Facial Expressiveness is set sensibly.

Related models

Ready to write Runway Act-Two prompts in one click?

  • Auto-detects the model inside its native interface
  • Scores every line of your prompt
  • One-click rewrite into the correct structure
ChromeYandex BrowserChrome / Yandex BrowserInstall extension

Pro — $2.99/month or ₽199/month · cancel anytime

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672