Seedance 2.0: how to write prompts the model actually understands
ByteDance · Updated:
Seedance 2.0 is a video model from ByteDance (Jimeng platform) with 4–15 seconds per generation and resolution up to 2K. Powerful multimodal input: up to 9 images, 3 videos, and 3 audio files per request. 10 generation types, timestamp storyboarding for long videos, native voice control. Prompts up to 2,000 characters.
What Seedance 2.0 does well
Seedance 2.0 is one of the most feature-rich public video models. Ten generation types in one product: T2V, Consistency Control with @-references, copying camera from a reference video, copying VFX, story completion, video extension, voice cloning, one-take long shot, video editing, beat sync to music.
Multimodal input: up to 9 images (jpeg/png/webp/bmp/tiff/gif, <30MB), up to 3 videos (mp4/mov, 2–15s, <50MB, 480p–720p), up to 3 audio files (mp3/wav, ≤15s combined, <15MB), max 12 files per request. Duration 4–15 seconds per pass; for longer content, sequential extension via @Video.
- 10 generation types including voice cloning and beat sync
- Multimodal input: 9 images + 3 videos + 3 audio files
- Duration 4–15 seconds, resolution up to 2K
- @-references for character and scene consistency control
- Timestamp storyboarding for 13–15 second narratives
Basic prompt structure
Optimal formula: [Subject/Character] + [Scene/Environment] + [Action/Motion] + [Camera Movement] + [Timing Breakdown] + [Audio/Sound] + [Style/Mood]. You don't have to use every element — composition depends on video type.
The more specific, the better. Active verbs over abstractions («walks, turns, picks up» beats «something happens»). At least one shot-size or camera-movement directive per prompt. Concrete physical description of the scene and environment.
Prompt length is up to 2,000 characters. On syntx.ai (English-language platform) English is recommended; on native Jimeng Chinese yields slightly better results. English is fine either way — the model is trained bilingually.
The 10 generation types
T2V — text-only generation. Consistency Control — lock a character, product, or scene via @-references. Copy Camera — upload a reference video to copy camera moves and choreography. Copy VFX — replicate transitions and effects from a reference video.
Story Completion — the model continues a narrative from a storyboard or image sequence. Video Extension — smooth continuation of an existing video. Voice Control — voice cloning, dialogue generation, sound design. One-Take Long Shot — continuous shot without cuts.
Video Editing — character swaps, plot changes. Beat Sync — visual rhythm synced to music via reference audio. Each type has its own prompt formula (see platform documentation).
Timestamp storyboarding
The most powerful technique for 13–15 second videos is per-second breakdown. It gives precise control over narrative pacing:
0-3s: [scene + camera + sound] 4-8s: [scene + camera + sound] 9-12s: [scene + camera + sound] 13-15s: [scene + camera + sound]
Key rule — realistic timecodes. A full action needs 2–3 seconds, a short gesture 1 second. Don't try to cram «walking across a room» into 0.5 seconds. For 4–8 second videos, timestamp isn't required — one or two key moments are enough. For 9–12 seconds, timing is recommended. For 13–15 seconds, it's mandatory for a good result.
Common mistakes
1. Prompt too short or too long
Under 15 words — the model invents too much, results are unpredictable. Over 2,000 characters — detail overload, the model starts ignoring parts of the prompt. The sweet spot for most scenes is 50–200 words; for timestamp storyboards 300–500 words with explicit scenes.
2. Conflicting camera moves at once
«Zoom in while panning left and orbiting around» — the model can't fit three simultaneous moves into 5–10 seconds of screen time. Pick one main move per scene plus an optional speed modifier. If you need different moves, split them across timestamp segments.
3. Asking for more than 15 seconds in one pass
15 seconds is a hard platform limit per generation. A «30-second video» request either truncates to 15 or errors out. For longer content, use the multi-segment approach via Video Extension: segment by segment with smooth handoffs.
4. Abstract phrasing instead of physical actions
«Something beautiful happens», «emotional moment», «mood shifts» — the model doesn't understand abstractions. Describe concrete physical actions: «she slowly turns her head», «light fades from warm to cool», «petals fall onto the table». This delivers predictable, controllable results.
5. Realistic human faces in uploaded references
The Jimeng platform blocks uploading realistic human faces as references — it's a ByteDance policy, not a bypassable limit. For I2V with humans, use stylized references (illustration, painting, cartoon) or generate the human-containing scene via T2V without a reference image.
Before / after examples
Example 1
Before
video where a man runs down the street
After
Wide tracking shot of a man in a black hoodie sprinting down a narrow alley at dusk. Side tracking, camera moves at chest height parallel to him. He knocks over a fruit stall, stumbles, gets back up, keeps running. Wet pavement reflects neon signs. Loud panicked footsteps, distant crowd murmur, heavy breathing. 16:9, 24fps, 8 seconds, cinematic noir tone.
Concrete scene (alley at dusk), physical detail (knocks over stall, wet pavement), camera parameters (chest height parallel), sound (footsteps, crowd, breathing), technical parameters at the end. A working T2V prompt.
Example 2
Before
long 15-second video with a hero's story
After
0-3s: Wide shot, a woman in a red coat walks toward a wooden cabin in a snowy forest. Slow forward dolly, soft ambient wind, crunching snow. 4-8s: Medium shot, she opens the cabin door, warm orange light spills onto the snow. Camera slowly pushes in. 9-12s: Interior close-up, she sets a lantern on a wooden table, takes off her gloves. Soft crackling fireplace ambient. 13-15s: Wide interior shot, she sits by the fire, exhales. Camera pulls back to reveal the warm, intimate room. Soft piano music begins.
Timestamp storyboarding with realistic pacing (3–4 seconds per beat), consistent character (the woman in red coat), smooth narrative across 4 scenes. Sound varies by scene. The Seedance 2.0 sweet spot.
Example 3
Before
product spot with my brand using 3 shots
After
Use @Image1 (product hero shot) and @Image2 (lifestyle context). Beat sync to @Audio1 (brand music track). 0-2s: Close-up of @Image1 product rotating slowly on a marble surface. Soft side light, shallow DoF. 3-5s: Cut to @Image2 lifestyle scene, person holds the product naturally, smiles slightly. Hand-held camera, warm afternoon light. 6-8s: Wide editorial shot, product centered with brand color palette around it. Smooth dolly out. Beat hit at 8s. 16:9, 24fps.
Multimodal prompt with @-references (Image1, Image2, Audio1), beat sync to music, timestamp breakdown for 3 shots, technical parameters. This is the production scenario Seedance 2.0 is designed for.