Guide

How to write prompts for GPT Image 2: 5 steps from random output to precise result

Vlad Voronezhtsev · · 8 min read

Cover image for a guide to structured GPT Image 2 prompts

GPT Image 2 is OpenAI's "thinking" image model. It processes the prompt sequentially: what comes first gets the most visual weight. Unlike Midjourney, which is happy with a tag soup, and Nano Banana, which defaults to a bright bubblegum exposure, GPT Image 2 expects a structured brief with declared purpose and a calm, neutral palette. Write Midjourney-style prompts for it and half your credits go to noise. These 5 steps turn random output into predictable results for everything from ad billboards to dense-text infographic slides.

  1. 1.

    Structure beats tag soup

    GPT Image 2 reads the prompt top-to-bottom and assigns the largest weight to the first paragraph. Bury the main subject at the end and the model won't surface it — your shot ends up being about something else. The working order: [Background/Scene] → [Subject] → [Key details] → [Style/Medium] → [Lighting/Composition] → [Text in quotes] → [Constraints: what to keep, what to avoid]. The block format itself can be anything — natural language, JSON-ish structure, an instruction list — all work. What matters: intent and constraints must live in the first 30-40 words. Stable-Diffusion-style tag soup (`girl, redhair, summer, masterpiece, 8k, octane render`) doesn't work for GPT Image 2: the model tries to use the tags but has no hierarchy, so output is random.

    Before

    summer, girl, red hair, beach, golden hour, cinematic, 35mm, photorealistic, masterpiece

    After

    Candid photograph: a young woman with red hair walking along an empty beach at golden hour. Subject centered, looking away from camera. Photorealistic, 35mm film, shallow depth of field, warm natural light, subtle film grain.
  2. 2.

    Write a brief, not a description

    Top lifehack: declare the purpose. Not "a nice product image" but "premium campaign image for streetwear brand Thread." Not "a UI screen" but "iPhone mockup for the onboarding flow of a fintech app." The purpose triggers the right template stack in the model: ads imply tight composition and tagline space; pitch-deck slides imply a grid and readable numbers; product shots imply a neutral backdrop and precise material lighting. With no declared purpose, the model guesses — differently each time. This is the single most common reason the same prompt gives three different outputs in a row. Bonus: state the audience or use context ("for an investor deck", "for teen-audience social media") — the model adapts tone visually.

    Before

    beautiful advertising image of a new smartphone

    After

    Premium product campaign image for "Aurora" smartphone (mid-range, target audience: 25-35 urban professionals). Hero shot on a neutral grey gradient background, soft three-point studio lighting, phone tilted 15° to show edge profile, subtle shadow. Tagline area on left third (reserve empty space). Render once, integrated lifestyle cue: faint coffee cup blur in background.
  3. 3.

    Exact text always in quotes

    GPT Image 2 is SOTA at rendering text inside images — its main win over Midjourney and Stable Diffusion. But if you don't wrap exact text in quotes, the model treats words as scene description and routinely warps letters, adds extra characters, or drops case. The rule: anything that must appear literally goes inside `"..."` or ALL CAPS. Specify the typeface (`bold sans-serif, Inter`), the size (`large headline`), the color and placement (`centered top third`). For rare words, brand names, or non-English spellings — spell them in brackets. For dense or small text (chart legends, fine-print) always set `quality="high"` — on `medium` and `low` micro-text comes back with artifacts. Multilingual support: text can be in Cyrillic, Chinese, Japanese, Korean, Hindi, Bengali, Arabic — all render cleanly.

    Before

    billboard with text Fresh and Clean about a cleaning product, modern design

    After

    Outdoor billboard for a cleaning product brand. Billboard text (EXACT, verbatim, no extra characters, no logo drift): "Fresh and Clean". Typography: bold sans-serif, Inter, white on deep teal background, centered, large size. Below the tagline (smaller, 30% of headline size): "Available nationwide". Quality: high.
  4. 4.

    Change / Preserve / Constraints template for editing

    When you need to change one thing and keep everything else — without an explicit preserve block the model drifts: it shifts the face along with the outfit, the lighting along with the background, the camera angle along with the weather. Surgical edits template: `Change: [what changes]` / `Preserve: [face, pose, lighting, angle, background, geometry, text, layout]` / `Constraints: [no extra objects, no redesign, no logo drift, no watermark]`. The advantage: the template explicitly blocks drift. Especially critical for virtual try-on (swapping clothing on a person), interior swaps (one piece of furniture for another), weather/season changes. Repeat the preserve list on every iteration — otherwise by the 3rd or 4th pass the model forgets the original identity constraints and gradually "redraws" the character.

    Before

    make her hair red

    After

    Change: hair color from brown to natural red (auburn).
    Preserve: face, facial features, skin tone, eye color, expression, pose, lighting direction, background, clothing, all other identity markers.
    Constraints: no extra objects, no redesign of any element except hair, no watermark, no logo drift.
  5. 5.

    Iterate, don't overload

    It's tempting to cram every requirement into one prompt: style, lighting, text, constraints, aspect ratio, identity preservation. Don't — the model can't hold 15 orthogonal requirements simultaneously, and one of them collapses (usually text or identity). The right workflow: clean base prompt → evaluate the output → targeted single-axis edit. Examples of one-shot edits: `make lighting warmer`, `remove the extra tree on the left`, `replace the typography with Inter bold`, `restore the original background`. This is far faster than rewriting from scratch. Use `quality="high"` only when you actually need it (dense text, close-up portraits, identity-sensitive editing) — `medium` works for 80% of jobs and is 2-3× faster. Last note: GPT Image 2 does NOT understand Midjourney syntax (`--ar 16:9`, `::`, `(keyword:1.2)`) — specify aspect ratio as explicit pixel size, weight things in natural language ("emphasize the cat", "de-emphasize the background").

FAQ

Why does the same prompt give different results in Midjourney and GPT Image 2?
Different engines, different habits. Midjourney was trained on aesthetic data and reads tag soup like `cinematic, 8k, octane render, masterpiece` stylistically. GPT Image 2 is a "thinking" model: it expects a structured brief with declared purpose and processes text sequentially (important stuff first). Additionally, GPT Image 2 has a neutral, calm default exposure, while Midjourney pulls toward bright and saturated. The same idea: in Midjourney "moody coffee shop interior" is enough; GPT Image 2 wants "Atmospheric coffee shop interior at dusk. Subject: empty wooden bar table in foreground. Style: documentary realism, desaturated palette, no warming filters. Lighting: ambient indoor, single warm pendant light overhead. Camera: 35mm, eye-level, medium shot."
Can I ask GPT Image 2 to draw a specific actor or politician?
No — this is OpenAI policy, not a bug. The model blocks generation of recognizable public figures (actors, politicians, historical personalities past a certain era). The strict moderator also triggers on combinations of innocent words: `real person` + `young woman` + `bathroom` + `suggestive` almost guarantees a refusal, even though each word is fine alone. Workarounds: use Midjourney or Nano Banana for recognizable faces (they filter too, but less aggressively). For editorial / fashion with a type description, reword without real-person attachment ("editorial portrait of a woman in her 30s with red hair"). Don't try to euphemism your way past the filter — it's semantic, not keyword-based, and euphemisms just lower the model's trust in your request.
Why declare the purpose ("this is an ad", "this is a UI mockup") if I just want a nice picture?
The purpose activates the right processing mode in the model. Ads imply tight composition, tagline space, and a single focal point. Pitch-deck slides imply a structured grid and readable labels. Product shots imply a neutral background and precise material lighting. Documentary realism implies a desaturated palette with no auto-warming. Without a declared purpose, the model mixes all those modes at random — and the output drifts wildly between passes. "Beautiful coffee shop interior" can come back as ad photography one time, stock photo the next, illustration after that. The purpose gives the model an anchor and the shot becomes predictable.
What's the max resolution GPT Image 2 supports?
Technically up to 4K (3840×2160), but reliably up to 2K (2560×1440). Above 2K is experimental: artifacts can creep in and render times grow significantly. Minimum: 655,360 pixels (e.g. 1024×1024). Both sides must be multiples of 16. Maximum long-to-short side ratio: 3:1 (no skinny 1×10 panoramas). Reliable common sizes: 1024×1024 (square), 1024×1536 (portrait), 1536×1024 (landscape), 2560×1440 (presentation-wide). For 4K and dense text always set `quality="high"` — on `medium` and `low` details "float."
How do I keep the face unchanged when editing a person?
Use an explicit preserve block and repeat it on every iteration. Template: `Change: [only what changes]` / `Preserve: face, facial features, skin tone, eye color, body shape, pose, identity in any way` / `Constraints: replace only the [clothing / background / lighting], no other changes`. For virtual try-on (swapping clothing) additionally block pose, hair, camera angle. On every next iteration repeat the full preserve block — by pass 3 or 4 the model forgets the original constraints and starts to "redraw" the character. This is the single most common editing-flow mistake: "I already said don't change the face in the first prompt" — but the model only sees the current prompt, not the conversation history.

Related posts

Browse all posts

Stop Guessing. Generate
On The First Try.

Install Opten in 30 seconds and score your next prompt.

Opten is a Chrome extension that scores AI prompts for the specific model. Supports 60+ image and video models — Midjourney, GPT Image 2, Kling, Sora, Nano Banana, Flux — and rewrites them in one click inside the Syntx, Higgsfield, and Freepik interfaces. From $2.99/month.

© 2026 Opten · IE Nikolai Shupletsov · Tax ID 306389672