Qwen Image: how to write prompts the model actually understands
Alibaba · Updated:
Qwen Image is the Alibaba Qwen team's image model with leading text rendering: commercial-grade English and Chinese, multi-line layouts, paragraphs. V2.0 is 7B parameters, native 2048×2048, prompt budget up to 1,000 tokens, and direct generation of infographics, PPT slides, posters, and comics with speech bubbles.
What Qwen Image does well
The headline feature is commercial-grade text rendering in EN and CN: multi-line, paragraphs, headlines, fine captions. On AI Arena, Qwen Image holds #1 in both T2I and image editing (V2.0). It also leads 9 public benchmarks (GenEval, DPG, OneIG-Bench, GEdit, and more).
V2.0 is the workhorse: 7B parameters (lighter than V1's 20B), much lower VRAM, native 2048×2048 without upscaling, and unified generation + editing in one model. V1 remains for heavy production pipelines with 40–58GB VRAM. Apache 2.0 license — full commercial rights.
- Commercial-grade text rendering (EN + CN)
- V2.0: native 2048×2048, up to 1,000 tokens, 7B parameters
- Direct generation of infographics, PPT, posters, comics
- ControlNet (canny, depth, pose, lineart, softedge, normal, openpose)
- Unified generation + editing in one model
Prompt structure
Detailed descriptive prompts with scene composition work best. Base formula: [Main subject] + [Scene composition] + [Style] + [Text content to render] + [Layout details].
For documents with text, ALWAYS specify the exact text in the prompt — the model won't «guess» what should be in a slide headline. For V2.0 you can use up to 1,000 tokens, and for infographics or comics that's not «too much» — it's optimal: the model handles dense composition well.
For editing the prompt is an instruction, not a full description. «Change the text to "Q4 2026"» works; «A poster with text saying...» in edit mode does not.
Rendering text in the image
Qwen Image leads the market on text accuracy, alongside GPT Image 2. Multi-line layouts, paragraph-level text, infographics with charts and text blocks, PPT slides, comics with speech bubbles, posters with headlines — V2.0 generates all of this directly, without a separate typography engine.
Rules: write exact text in quotes, specify font and size («bold serif headline», «small sans-serif caption»), set layout («centered», «left-aligned», «two-column grid»). For bilingual layouts (EN + CN in the same image) declare both languages as separate blocks — this is a Qwen Image strength.
Bilingual EN + CN workflow
Qwen Image is the only top model where Chinese is a native language (Alibaba team). You can write the prompt in Chinese, in English, or mix them. The in-image text can be in either language or bilingual.
Concrete scenarios: marketing materials for the Chinese market with Chinese headlines and English brand names, bilingual infographics for international teams, comics with CJK speech bubbles, product packaging for Chinese e-commerce. This is an area where Qwen Image is objectively stronger than any Western model.
Common mistakes
1. No explicit text for document prompts
If you're generating an infographic, poster, or PPT slide and don't specify exact text in quotes, the model will invent the headline and captions — usually not what you want. Every text field should be in quotes with an EXACT marker. For bilingual layouts, declare both languages as separate blocks.
2. Confusing Qwen Image with Qwen2.5-VL
Qwen Image is an image generator. Qwen2.5-VL is a vision model for image analysis. These are two different models from different Alibaba teams. If a tutorial or API mentions «Qwen2.5-VL», it's NOT about generation. For generation you need Qwen Image V1 or V2.0 specifically — check the repo name before launching.
3. V1 on a weak GPU
Qwen Image V1 requires 40–58GB VRAM — A100/H100 territory. On consumer GPUs (24GB and below) V1 won't run, or will work with severe offloading and low speed. For local execution and most cloud pipelines, use V2.0 — 7B parameters, much lower VRAM.
4. Prompt too short for complex composition
V2.0 supports up to 1,000 tokens specifically for complex compositions — infographics, PPT, comics. If you ask for a 4-panel comic in one sentence, the model invents the contents at random. Use the full prompt budget to enumerate panels, exact text, layout, fonts, and colors.
5. Full scene description in edit mode
In V2.0 unified generation+editing, edit-mode prompts should be instructions, not full descriptions. «Change the title text to "Q4 2026"» works; «A poster with a Q4 2026 title and modern design» in edit mode pushes the model to redraw everything. If you want a new poster, switch to T2I mode.
Before / after examples
Example 1
Before
a beautiful poster with sale text
After
A retail sale poster, photorealistic background with shopping bags and gift boxes. Bold serif headline (EXACT): "BLACK FRIDAY" in red, centered top. Subheadline below in white sans-serif: "Up to 70% off — November 28–30". Bottom-right corner: small caption "Free shipping over $50". Two-column grid layout, vertical orientation. Commercial-grade typography.
Exact text in quotes with EXACT marker, explicit fonts (serif headline + sans-serif sub), colors and layout. Without these the model invents text and placement — usually not what you want.
Example 2
Before
infographic about company sales
After
Corporate infographic, white background, clean grid layout. Title (EXACT, centered top, bold sans-serif): "Q4 Revenue Breakdown". Four metric cards in a 2×2 grid, each with a number and label: "$2.4M Total", "+18% YoY", "3 New Markets", "86% Retention". Use Inter sans-serif for all numbers, brand color #1E40AF for highlights, light grey rules between cards. Print-ready commercial typography.
Complex composition (2×2 grid) with concrete numbers and labels in quotes. Font (Inter), color (#1E40AF), and print-ready marker specified. V2.0's 1,000-token budget handles this density.
Example 3
Before
comic with dialogue between two characters in Chinese and English
After
Two-panel manga-style comic. Panel 1: A young woman in business attire holds a coffee cup, looking out a window. Speech bubble (Chinese): "明天的会议准备好了吗?". Panel 2: Close-up of her phone screen showing a message in English: "Meeting moved to Friday". Clean line art, black ink with light grey shading, white background. Comic-style typography, speech bubbles with thin black borders.
Bilingual text (CN in panel 1, EN in panel 2), both explicitly in quotes. Style (manga, line art, ink) and typography (comic-style) specified. This is a scenario where Qwen Image beats most Western models.