Seedream 4.5: how to write prompts the model actually understands
ByteDance · Updated:
Seedream 4.5 is the mainstream version of ByteDance's image model. Text-to-image, image-to-image, and multi-image blending up to 4K. Optimal prompt length is 30–100 words. Available via fal.ai, YouMind, and flux-ai.io. It brought readable in-image text rendering, scene spatial understanding, and precise adherence to complex instructions — the line's main production choice.
What is new in 4.5 versus 4.0
4.5 is a generational jump over 4.0 across the board. Superior aesthetics with worked-out light and shadows, high consistency on complex scenes, precise adherence to complex prompts with visual control.
Key upgrades: spatial understanding (realistic proportions, object placement, scene layout), rich world knowledge (scientific and technical grounding), readable in-image text rendering (posters, signs, infographics), and multi-image blending — combining several reference images into one result.
Resolution is raised to 4K (vs 2K in 4.0). Editing endpoint support — inpainting and modifications of existing images work precisely, not as «take this as a starting point».
- Text-to-Image, Image-to-Image, Multi-Image Blending
- Resolution up to 4K (vs 2K in 4.0)
- Optimal prompt length 30–100 words
- Precise rendering of readable text
- Editing endpoint (inpainting, precise modifications)
Prompt structure
Canonical formula: `[Subject] + [Style] + [Composition] + [Lighting/Atmosphere] + [Technical parameters]`. Prioritization hierarchy is the same as in 4.0 — subject always first.
But 4.5 handles much more detailed prompts without losing focus. You can safely write 60–100 words of specifics across every level — the model holds all elements.
Example: «A young woman in soft natural light, photorealistic portrait style, 85mm lens, shallow depth of field, subtle expression, smooth bokeh background, clean composition, --ar 4:5.» — 28 words, all five hierarchy levels filled. On a prompt like this 4.5 reliably delivers production quality.
Text rendering
The main 4.5 upgrade is readable in-image text. Posters with titles, signs, infographics, packaging — everything that was a 4.0 weak spot now works.
Rules are the same as in other models with in-image text: exact text in quotes («text "BEYOND THE STARS"»), explicit font style («bold metallic sans-serif»), explicit placement («centered at top», «bottom left corner»), explicit format («--ar 2:3» for a poster).
For long strings — split into separate elements. «Movie poster, text "BEYOND THE STARS" centered at top, subtitle "a journey beyond imagination" at bottom» works better than one long string. Latin script yields the most stable results; Cyrillic is readable but less precise.
Multi-Image Blending
Uniquely available in 4.5 — blending two reference images into one result. Steps: 1) prepare the base images; 2) upload two images for blending; 3) write a description of the desired result; 4) state which stylistic elements to preserve from each source.
Typical scenario: character from one photo + setting from another. «Take the character from image 1 and place them in the environment from image 2. Preserve the character's exact facial features and wardrobe from image 1. Use the lighting and atmosphere from image 2.»
Another scenario: style blend. «Blend the colour palette of image 1 with the composition style of image 2.» — the model synthesizes an intermediate visual. This is stronger than style transfer — the model actually understands what to take from each reference.
Common mistakes
1. Using 4.5 as «fast» 5
5 Lite is better at everything, but 4.5 is the line's production standard as of release. Don't try to write a prompt by 5's rules (120 words, extended styles, improved anatomy) on 4.5 — the model loses focus. Sweet spot for 4.5 is 30–100 words; stick to the standard style set.
2. Multi-Image Blending without an explicit preserve list
Blending two images requires explicit guidance on what to take from each. «Take the character from image 1 and place in the scene from image 2» is too abstract. Correct: «Preserve the person's exact facial features, wardrobe, and pose from image 1. Use the lighting and color palette from image 2.»
3. Long text in a single string
A poster with one long string («text "BEYOND THE STARS A JOURNEY BEYOND IMAGINATION"») renders worse in 4.5 than the same content split into parts. Better: «text "BEYOND THE STARS" centered at top, subtitle "a journey beyond imagination" at bottom». Long strings can get mangled even on 4.5.
4. Negatives in the main text
As in 4.0, on 4.5 negative prompts go in the platform's separate negative_prompt field, not in the main text. «No watermark, no text» in the main prompt is an anti-pattern — the model may add a watermark. Use the separate field or phrase positively.
5. Conflicting styles
«Photorealistic oil painting cartoon» works a bit better on 4.5 than on 4.0, but still produces an unpredictable result. Pick one dominant style and at most one compatible modifier. «Cinematic with film grain», «photorealistic with subtle painterly touches» — fine. «Realistic anime» — no.
Before / after examples
Example 1
Before
nice food photo for a restaurant menu
After
Bowl of artisan ramen with soft-boiled egg, sliced pork belly, and fresh green onions on a dark stone surface, food photography, soft overhead lighting from the upper-left, steam rising from the bowl, shallow depth of field with sharp focus on the egg yolk, warm earthy color palette, close-up overhead angle, --ar 1:1.
Concrete subject (what is actually in frame), food photography style, explicit lighting with direction, overhead composition, depth of field. 50 words — a working length for 4.5. At this level of detail 4.5 delivers a nearly production-ready result.
Example 2
Before
horror movie poster with a title and creepy atmosphere
After
Horror movie poster with text "THE LAST NIGHT" in bold weathered sans-serif typography centered at the upper third, dark abandoned hallway receding into shadow, single bare bulb hanging from the ceiling, dramatic low-key lighting with hard shadows, cold blue-grey color palette with one accent of red light at the far end, subtle film grain, cinematic 35mm aesthetic, --ar 2:3.
Text in quotes, explicit font style, explicit placement in frame. A spatially understood scene (hallway, bulb, far red accent). This is what breaks in 4.0 and works in 4.5.
Example 3
Before
blend my photo with a landscape as background
After
Take the person from image 1 and place them in the mountain landscape from image 2. Preserve the person's exact facial features, wardrobe, and pose from image 1. Use the lighting, atmosphere, and golden hour color palette from image 2. Match the scale so the person stands naturally in the mid-ground, with the mountain peaks rising behind them. Cinematic style, shallow depth of field, --ar 16:9.
A Multi-Image Blending prompt: explicit on what to take from image 1 (appearance, wardrobe, pose) and from image 2 (light, atmosphere, palette), plus instructions on scale and placement. Without an explicit preserve list, the model may «improve» the face or change the wardrobe.