How is Grok Imagine different from Midjourney and DALL-E?

Grok Imagine is built on autoregressive MoE Transformer architecture, not diffusion. This delivers high prompt fidelity and accurate text rendering — one of the main differences. Comparable to top-tier models on stylistic flexibility and photorealism, but with fewer restrictions on real objects. Supports up to 10,000 characters per prompt and an Edit mode with multi-image compositing.

What is the difference between Standard and Pro?

Standard — fast variant (up to 300 requests/min), good quality for most tasks. Pro — higher quality, better text rendering, cleaner detail. For branding, typography, product photography, and final production, choose Pro. For prototyping and quick iteration — Standard.

What resolutions are supported?

1K — default. 2K — via the `resolution` parameter. Aspect ratios: 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3, 2:1, 1:2, 19.5:9, 9:19.5, 20:9, 9:20, auto. Up to 10 images per request. Aspect ratio auto in Edit mode takes the source image's ratio.

How does Edit mode work?

Via POST /v1/images/edits. Pass 1–5 input images plus a prompt describing the change. The prompt describes only WHAT TO CHANGE, not the whole scene — the model sees the source. When editing a single image, aspect ratio comes from it. Multi-image compositing — describe how to combine: «place the person from Image 1 into the scene from Image 2».

What is revised_prompt in the API response?

The `revised_prompt` field shows a refined version of your prompt — the model may internally tweak the wording before generation. This is part of the autoregressive MoE Transformer architecture, not a bug. Use it to understand how the model interpreted the request and to debug unexpected results.

What language should prompts be written in?

English is the most reliable. The model is trained predominantly on English data, and quality on other languages drops noticeably. For production tasks — English only. For experiments in other languages output works but is less precise in styles, atmospheric nuance, and photo terminology.

Does Opten support Grok Imagine?

Yes, the Opten extension auto-detects Grok Imagine and scores prompts against the structure outlined above: coherent description, concrete lighting and camera angle, atmospheric adjectives, absence of negative prompts and keyword stacking. One click delivers a rewrite in the correct structure.

Image

Grok Imagine: how to write prompts the model actually understands

Name: Grok Imagine (Aurora)
Brand: Grok

Grok · Updated: May 19, 2026

Grok Imagine (Aurora) is xAI's image model with an autoregressive MoE Transformer architecture, not diffusion. It excels at photorealistic portraits and accurate in-image text rendering. It supports up to 2K resolution, prompts up to 10,000 characters, 14+ aspect ratios, and up to 10 images per request. Negative prompts do not work.

What Grok Imagine does well

Grok Imagine is autoregressive, not diffusion. This delivers high prompt fidelity and stable in-image text rendering — one of the key differentiators from competitors.

Strengths: photorealistic human portraits, accurate in-image text (logos, signs, banners), stylistic flexibility within one model (photorealism, anime, watercolor, oil, pop art). Multi-turn editing via POST /v1/images/edits — an iterative edit chain. Multi-image compositing — up to 5 input images per generation. Fewer restrictions on real objects than most competitors.

Resolution 1K (default), 2K (via resolution parameter)
Up to 10,000 characters per prompt
14+ aspect ratios, up to 10 images per request
Edit mode via /v1/images/edits — up to 5 input images
Pro variant — higher quality, better text

Prompt structure

Formula: [Subject] + [Style/Mood] + [Lighting] + [Camera Angle] + [Finishing Details].

Grok Imagine accepts natural language — descriptive sentences, NOT tags. Use cinematic language: camera position, lens type, light direction, time of day.

Specific atmosphere beats generic: «nostalgic», «melancholic», «electric» instead of «happy», «cool», «nice». Describe one clear scene per generation — multi-scene prompts with conflicting elements confuse the model.

The API returns a `revised_prompt` field — the model can internally refine the prompt before generation. This is part of the architecture, not a bug.

What does NOT work

Main limitation: negative prompts are not supported. «No X», «don't include Y», «without Z» — the model completely ignores them. Describe ONLY what you want. This is a critical antipattern that breaks output.

Also non-functional: special syntax (no weights `(word:1.2)`, tokens, LoRA references), keyword stacking («masterpiece, best quality, 8k, ultra detailed» — counterproductive for autoregressive architecture), generic adjectives («nice», «cool», «good» — empty words).

Don't expect pixel-level control in Edit mode — editing is prompt-driven and holistic. When iterating, change one variable at a time — otherwise the model changes everything at once.

Edit mode — image editing

Grok Imagine Edit is the same model backend, not a separate model. Access via POST /v1/images/edits. Accepts 1–5 input images plus a prompt.

Key rule: when editing a single image, aspect ratio is taken from the source. The prompt describes only WHAT TO CHANGE, not the whole scene. «Change the sky to sunset» works better than redescribing the entire frame.

Iterate one variable at a time. Don't contradict the input image — if it's daylight, don't ask for «midnight» in one prompt, prefer «evening light». Multi-image compositing — describe exactly how to combine: «place the person from Image 1 into the scene from Image 2».

Common mistakes

1. Negative prompts
«No X», «don't include Y», «without Z» — Grok Imagine completely ignores negatives. This is a core architectural limitation. Describe ONLY what you want. To get «no people», don't mention people at all and describe the empty scene.
2. Keyword stacking «masterpiece, best quality, 8k»
A stack of generic quality tags («masterpiece, best quality, 8k, ultra detailed, hyperrealistic») is counterproductive for autoregressive models. Concrete terms (lens, lighting, mood adjective) work significantly better than any quality stack.
3. SD syntax: weights, LoRA, embeddings
Weights like `(word:1.5)`, LoRA references, embeddings, special tokens — Grok Imagine doesn't support them. They land in the prompt as literal noise or get ignored. Regulate priorities via word order and coherent descriptions.
4. Generic adjectives instead of atmospheric ones
«Nice», «cool», «good», «beautiful» give the model no direction. Use specific atmospheric words: «nostalgic», «melancholic», «electric», «dramatic», «serene», «ominous», «ethereal». They shift output noticeably more than generic adjectives.
5. Complex multi-scene prompts
One clear scene per generation. A prompt with multiple scenes, conflicting elements, or attempts to describe a story confuses the model. For storytelling do multiple generations. For editing change one variable at a time in Edit mode.

Before / after examples

Example 1

Before

beautiful portrait of a woman, beautiful, high quality, no blur, no watermark

After

A close-up portrait of a young woman with freckles and short auburn hair, wearing a black wool turtleneck. Golden hour rim light from behind, warm amber tones, melancholic mood. Shot on 85mm f/1.4, shallow depth of field, subtle film grain. Editorial photography.

Negatives «no blur, no watermark» — Grok Imagine ignores them. «Beautiful, high quality» are empty words. A concrete subject, lighting, lens, and atmospheric adjective hit the target.

Example 2

Before

vintage shop sign

After

A weathered metal sign mounted above a 1950s diner entrance. The sign reads "JOE'S DINER" in bold red script with cyan accents and small star icons. Twilight neon glow, wet asphalt below reflecting the lights, nostalgic mood. 35mm film photography, shallow depth of field.

Exact text in quotes, specific font and color, era, atmospheric «nostalgic». Grok Imagine is top-tier for text — use that.

Example 3

Before

masterpiece, best quality, 8k, ultra detailed, photorealistic, woman, dress, garden, no blur

After

A young woman in her twenties wearing a flowing pale yellow linen dress, standing in a sunlit cottage garden in early summer. Soft golden hour light catches her hair, electric atmospheric mood, shallow depth of field. Shot on 85mm at f/1.8, candid documentary style.

Keyword stacking («masterpiece, best quality, 8k, ultra detailed») is counterproductive for autoregressive architecture. A coherent description with cinematic language works many times better.

Grok Imagine: how to write prompts the model actually understands

What Grok Imagine does well

Prompt structure

What does NOT work

Edit mode — image editing

Common mistakes

1. Negative prompts

2. Keyword stacking «masterpiece, best quality, 8k»

3. SD syntax: weights, LoRA, embeddings

4. Generic adjectives instead of atmospheric ones

5. Complex multi-scene prompts

Before / after examples

Frequently asked

Related models

Z-Image (Base / Turbo)

Wan (General — 2.5 / 2.6)

Seedream 5 Lite

Ready to write Grok Imagine (Aurora) prompts in one click?