Z-Image: how to write prompts the model actually understands
Open · Updated:
Z-Image is Alibaba Tongyi-MAI's compact 6B image model with open Apache 2.0 weights. Its key features are bilingual text rendering (English plus Chinese) and a built-in Prompt Enhancer. Available in Base (50 steps, negative prompt) and Turbo (8 steps, sub-second inference on H800) variants. Runs on consumer GPUs from RTX 3060 upward.
What Z-Image does
Z-Image is 6 billion parameters on the S3-DiT (Scalable Single-Stream Diffusion Transformer) architecture. The Turbo variant is distilled to 8 steps, gives sub-second generation on H800 GPUs, and took first place among open-source models in the Artificial Analysis ranking. The Base variant runs the full 50 steps, supports negative prompts, trains LoRA, works with ControlNet (canny, depth), and supports the Z-Image-Edit mode.
Key numbers: flexible resolution up to roughly 4 megapixels, hardware target — RTX 3060 with 16 GB VRAM. The Apache 2.0 license permits commercial use. Run it via HuggingFace (locally), fal.ai (API), or integrate into your own stack. Both English and Chinese are natively supported — for the prompt itself and for in-image text rendering.
- 6B parameters on S3-DiT — more compact than competitors
- Bilingual text: EN + CN inside images
- Turbo — sub-second on H800, Base — negative prompt + LoRA
- ControlNet (canny, depth) + Z-Image-Edit
- Open-source under Apache 2.0, RTX 3060+ (16 GB VRAM)
Prompt structure
Detailed descriptive prompts work best:
[Subject with details] + [Style keyword] + [Lighting] + [Composition] + [Quality modifiers]
Style keywords Z-Image responds well to: «oil painting», «3D render», «anime style», «photorealistic», «watercolor», «pencil sketch». Lighting — «natural light», «studio lighting», «golden hour», «dramatic shadow», «neon glow». Composition — «close-up», «wide shot», «bird's eye», «centered», «rule of thirds». Quality modifiers — «ultra-detailed», «high-resolution», «crisp», «sharp» — in Z-Image these actually move the needle, unlike many open-source models.
For in-image text rendering specify it explicitly in quotes: «A vintage poster with the title "Spring Festival" in red bold letters». Z-Image renders both Latin script and Chinese characters — its key feature versus competitors of similar size.
Prompt Enhancer and ambiguous prompts
Z-Image ships with a built-in Prompt Enhancer (PE) — a component that injects reasoning and common sense at the moment the prompt is processed. This lets it produce sensible output even from short ambiguous descriptions: the model fills in the missing pieces with plausible detail.
Useful for rapid prototyping and creative experiments, but it does not replace a good prompt. If predictability matters, write it out: PE patches gaps, it does not interpret key decisions on your behalf. In practice: «cat in a garden» → PE invents the breed, time of day, garden type. «A british shorthair cat sitting in a Japanese moss garden at dawn» → the result is more predictable and closer to intent.
PE plus a descriptive prompt is the best usage pattern for Z-Image. PE covers small gaps while the main description locks in direction.
Bilingual text inside images
Z-Image's main advantage over similarly sized models is accurate rendering of both English and Chinese text inside images. This is convenient for bilingual banners, two-language posters, ads aimed at the Chinese market, memes with English text, and infographics with Chinese captions.
For precise rendering specify text explicitly in quotes inside the prompt: • «A coffee shop sign that reads "Morning Brew" in elegant gold script» • «A poster with the Chinese title "春节快乐" (Happy Spring Festival) in red calligraphy» • «A book cover with the English title "The Silent Mountain" and subtitle "A Journey Through Tibet"»
Z-Image is not Qwen Image (a different model by another Alibaba team). For solid rendering add details: font (calligraphy, bold, sans-serif), color, placement in the frame. The more precise the text and its parameters, the higher the chance of an error-free render.
Common mistakes
1. Too minimal a prompt
«A cat» — Prompt Enhancer will try to fill in, but without direction the result is generic. PE patches gaps, it does not replace a description. Minimum for stability: a concrete subject with 2-3 details («a british shorthair cat with green eyes»), a style (photorealistic / anime / oil painting), lighting, and at least one composition cue.
2. Text without explicit quotes
«Make a poster about spring festival» — Z-Image does not know what text to render and will often produce mangled glyphs or substitute its own. Exact text always in quotes with font and color specified: «with the title "Spring Festival" in red bold calligraphy». Critical for bilingual rendering — the model's signature feature.
3. Negative prompt in Turbo instead of Base
Negative prompt support is officially documented only for the Base variant. In Turbo (8 steps, distilled) the negative prompt is either ignored or affects output unpredictably. If the task requires excluding watermarks, hand artifacts, or text errors, use Z-Image Base with an explicit negative prompt in platform settings.
4. Expecting video or vision capabilities
Z-Image is an image generator, not a video model and not an analyzer. Prompts like «animate this scene» or «describe what's in this photo» do not work. For video reach for Sora 2, Veo 3.1, Kling, Wan-video. For image analysis use the Qwen-VL family or GPT-4V. Z-Image covers only T2I and I2I.
5. Confusing it with Qwen Image
Z-Image and Qwen Image are different models from different Alibaba teams: Z-Image is built by Tongyi-MAI, Qwen Image by the Qwen team. Architecture, training data, and strengths differ. A Qwen prompt may not work optimally in Z-Image and vice versa. Check which specific model the prompt is written for, especially when exporting between platforms.
Before / after examples
Example 1
Before
a cafe sign
After
A vintage coffee shop sign hanging from a brass chain, with the text "Morning Brew" written in elegant cursive gold script on a deep navy background. Worn wooden frame around the sign, slight weathering on the edges. Mounted on a brick wall, soft afternoon sunlight from the left creating warm shadows. Photorealistic, ultra-detailed, sharp focus, editorial photography style, 50mm lens, shallow depth of field.
Text explicitly in quotes with font and color specified. Concrete material and environment. Lighting with direction. Quality modifiers «ultra-detailed, sharp focus» actually work in Z-Image.
Example 2
Before
billboard with chinese text
After
A modern billboard in a busy Shanghai street at twilight, featuring the bold Chinese title "新春快乐" (Happy New Year) in red calligraphy on a yellow background. Below the title, smaller English subtitle "Spring Festival 2026" in clean white sans-serif. Neon city lights reflected on wet pavement below. Wide-angle low-angle shot. Cinematic, photorealistic, ultra-detailed, sharp focus on the text.
Bilingual render: Chinese and English text both in quotes with font, color, size specified. Z-Image is one of the few models that reliably pulls both languages at once.
Example 3
Before
anime character illustration
After
A young woman with long pink hair tied in twin braids, wearing a white school uniform with a navy blue tie, standing in a cherry blossom park at golden hour. Soft warm sunlight filtering through the petals creating bokeh in the background. Detailed eyes with reflective highlights, hand-drawn linework. Anime style, ultra-detailed, sharp focus, vibrant colors, cinematic composition, rule of thirds.
Style keyword «anime style» at the start of the style block. Concrete character, environment, and lighting details. Quality modifiers stacked in sequence.