Grok Imagine: how to write prompts the model actually understands
Grok · Updated:
Grok Imagine (Aurora) is xAI's image model with an autoregressive MoE Transformer architecture, not diffusion. It excels at photorealistic portraits and accurate in-image text rendering. It supports up to 2K resolution, prompts up to 10,000 characters, 14+ aspect ratios, and up to 10 images per request. Negative prompts do not work.
What Grok Imagine does well
Grok Imagine is autoregressive, not diffusion. This delivers high prompt fidelity and stable in-image text rendering — one of the key differentiators from competitors.
Strengths: photorealistic human portraits, accurate in-image text (logos, signs, banners), stylistic flexibility within one model (photorealism, anime, watercolor, oil, pop art). Multi-turn editing via POST /v1/images/edits — an iterative edit chain. Multi-image compositing — up to 5 input images per generation. Fewer restrictions on real objects than most competitors.
- Resolution 1K (default), 2K (via resolution parameter)
- Up to 10,000 characters per prompt
- 14+ aspect ratios, up to 10 images per request
- Edit mode via /v1/images/edits — up to 5 input images
- Pro variant — higher quality, better text
Prompt structure
Formula: [Subject] + [Style/Mood] + [Lighting] + [Camera Angle] + [Finishing Details].
Grok Imagine accepts natural language — descriptive sentences, NOT tags. Use cinematic language: camera position, lens type, light direction, time of day.
Specific atmosphere beats generic: «nostalgic», «melancholic», «electric» instead of «happy», «cool», «nice». Describe one clear scene per generation — multi-scene prompts with conflicting elements confuse the model.
The API returns a `revised_prompt` field — the model can internally refine the prompt before generation. This is part of the architecture, not a bug.
What does NOT work
Main limitation: negative prompts are not supported. «No X», «don't include Y», «without Z» — the model completely ignores them. Describe ONLY what you want. This is a critical antipattern that breaks output.
Also non-functional: special syntax (no weights `(word:1.2)`, tokens, LoRA references), keyword stacking («masterpiece, best quality, 8k, ultra detailed» — counterproductive for autoregressive architecture), generic adjectives («nice», «cool», «good» — empty words).
Don't expect pixel-level control in Edit mode — editing is prompt-driven and holistic. When iterating, change one variable at a time — otherwise the model changes everything at once.
Edit mode — image editing
Grok Imagine Edit is the same model backend, not a separate model. Access via POST /v1/images/edits. Accepts 1–5 input images plus a prompt.
Key rule: when editing a single image, aspect ratio is taken from the source. The prompt describes only WHAT TO CHANGE, not the whole scene. «Change the sky to sunset» works better than redescribing the entire frame.
Iterate one variable at a time. Don't contradict the input image — if it's daylight, don't ask for «midnight» in one prompt, prefer «evening light». Multi-image compositing — describe exactly how to combine: «place the person from Image 1 into the scene from Image 2».
Common mistakes
1. Negative prompts
«No X», «don't include Y», «without Z» — Grok Imagine completely ignores negatives. This is a core architectural limitation. Describe ONLY what you want. To get «no people», don't mention people at all and describe the empty scene.
2. Keyword stacking «masterpiece, best quality, 8k»
A stack of generic quality tags («masterpiece, best quality, 8k, ultra detailed, hyperrealistic») is counterproductive for autoregressive models. Concrete terms (lens, lighting, mood adjective) work significantly better than any quality stack.
3. SD syntax: weights, LoRA, embeddings
Weights like `(word:1.5)`, LoRA references, embeddings, special tokens — Grok Imagine doesn't support them. They land in the prompt as literal noise or get ignored. Regulate priorities via word order and coherent descriptions.
4. Generic adjectives instead of atmospheric ones
«Nice», «cool», «good», «beautiful» give the model no direction. Use specific atmospheric words: «nostalgic», «melancholic», «electric», «dramatic», «serene», «ominous», «ethereal». They shift output noticeably more than generic adjectives.
5. Complex multi-scene prompts
One clear scene per generation. A prompt with multiple scenes, conflicting elements, or attempts to describe a story confuses the model. For storytelling do multiple generations. For editing change one variable at a time in Edit mode.
Before / after examples
Example 1
Before
beautiful portrait of a woman, beautiful, high quality, no blur, no watermark
After
A close-up portrait of a young woman with freckles and short auburn hair, wearing a black wool turtleneck. Golden hour rim light from behind, warm amber tones, melancholic mood. Shot on 85mm f/1.4, shallow depth of field, subtle film grain. Editorial photography.
Negatives «no blur, no watermark» — Grok Imagine ignores them. «Beautiful, high quality» are empty words. A concrete subject, lighting, lens, and atmospheric adjective hit the target.
Example 2
Before
vintage shop sign
After
A weathered metal sign mounted above a 1950s diner entrance. The sign reads "JOE'S DINER" in bold red script with cyan accents and small star icons. Twilight neon glow, wet asphalt below reflecting the lights, nostalgic mood. 35mm film photography, shallow depth of field.
Exact text in quotes, specific font and color, era, atmospheric «nostalgic». Grok Imagine is top-tier for text — use that.
Example 3
Before
masterpiece, best quality, 8k, ultra detailed, photorealistic, woman, dress, garden, no blur
After
A young woman in her twenties wearing a flowing pale yellow linen dress, standing in a sunlit cottage garden in early summer. Soft golden hour light catches her hair, electric atmospheric mood, shallow depth of field. Shot on 85mm at f/1.8, candid documentary style.
Keyword stacking («masterpiece, best quality, 8k, ultra detailed») is counterproductive for autoregressive architecture. A coherent description with cinematic language works many times better.