GPT Image: how to write prompts the model actually understands
OpenAI · Updated:
GPT Image is OpenAI's image model family (1, 1.5, 2). It understands natural language, treats prompts as stories with visual specifics, supports 1024×1024, 1536×1024, and 1024×1536 resolutions, transparent background, and three quality tiers. The standout feature is rendering readable in-image text.
What GPT Image does well
The family's main strength is accurate in-image text: signs, menus, labels, UI mockups, posters. The model handles font, size, color, placement, and multilingual typography.
GPT Image works with natural language, not tags. It supports transparent background (a dedicated parameter), three quality tiers (high/medium/low), and a wide stylistic range from photorealism to watercolor and concept art. OpenAI's content policy is one of the strictest — NSFW, real celebrities, and violence are blocked.
- Resolutions 1024×1024, 1536×1024, 1024×1536
- Output formats PNG, JPEG, WebP
- Transparency via dedicated parameter
- Three quality tiers: high / medium / low
- Top-tier in-image text rendering
Prompt structure
General formula: [Visual medium] + [Subject] + [Environment/Scene] + [Lighting/Mood] + [Composition] + [Details] + [Constraints].
The core principle: describe like a story, but with visual specificity. «A foggy mountain valley at dawn, golden light filtering through pine trees, reflected in a mirror-still lake» beats «a beautiful landscape» tenfold.
Start with the visual medium: «photograph», «watercolor painting», «3D render», «technical illustration», «vintage poster». This sets the generation mode for the model.
Camera and lighting for photorealism
Camera terms work significantly better than generic quality phrases like «8K, ultra HD».
Lenses: 35mm, 50mm, 85mm, macro. Depth: shallow depth of field, bokeh, sharp focus. Angle: low angle, bird's eye view, eye level, Dutch angle. Shot type: candid, portrait, product shot, aerial.
For lighting avoid generic «good lighting». Use specifics: «dramatic side lighting creating strong shadows», «soft box lighting eliminating harsh shadows», «golden hour», «fluorescent overhead», «neon glow», «candlelight». The more precise the light, the more precise the mood and atmosphere on screen.
In-image text
GPT Image is one of the best models for in-image text. Rules:
Exact text always in quotes: `"CAFE LUNA"`, `"OPEN 24/7"`. Specify font style: «elegant handwriting», «bold sans-serif», «neon sign lettering». Placement: «centered at the top», «on the wooden sign above the door». For complex or rare words spell them letter by letter: `C-A-F-E L-U-N-A`.
For dense text (menus, infographics) set `quality="high"`. At low/medium small type can break. Specify typeface, size, color — the model uses these for rendering.
Common mistakes
1. Abstract adjectives only
«Beautiful, amazing, stunning, gorgeous» give the model no visual information — no color, texture, material, or shape. Replace with specifics: «weathered brick wall, warm afternoon light, shallow depth of field». Minimum 2-3 descriptive details per scene.
2. Stable Diffusion syntax
Weights like `(word:1.5)`, comma-separated tags, `1girl, masterpiece, best quality`, embeddings, LoRA references — GPT Image works with natural language, not tags. These constructions are ignored or degrade output. Write sentences.
3. Quality boosters «8K, ultra HD, masterpiece»
Generic quality praise barely affects GPT Image. Concrete camera terms («85mm, shallow DOF, golden hour»), style references («editorial photography», «watercolor illustration»), and lighting descriptions work many times better than any quality stack.
4. Missing visual medium
Without saying whether it's a photograph, illustration, or 3D, the decision is left to the model — output becomes unpredictable. Start the prompt with a medium: «photograph», «watercolor painting», «3D render», «technical illustration», «vintage poster», «sticker design». This sets the generation mode.
5. Conflicting styles in one prompt
«Photorealistic cartoon», «minimalist detailed», «realistic stylized» — conflict without explanation. The model can't reconcile mutually exclusive instructions. If a stylistic blend is needed, describe it explicitly: «realistic rendering with subtle anime-inspired proportions».
Before / after examples
Example 1
Before
beautiful ginger cat
After
A close-up portrait of a ginger tabby cat sitting on an old wooden windowsill, warm afternoon light filtering through lace curtains. Soft autumn garden visible through the window in soft bokeh. Shot on 50mm lens, shallow depth of field, photorealistic, muted warm palette.
Key change: visual specificity instead of a generic adjective. Concrete environment, camera terms, lighting, medium.
Example 2
Before
café with a menu
After
A chalkboard café menu mounted on an exposed brick wall, listing "Espresso $3", "Flat White $4.50", and "Lavender Latte $5" in elegant white chalk handwriting. Warm pendant lighting from above, shallow depth of field, blurred coffee shop interior in the background. Editorial café photography, quality="high".
Exact text in quotes, specific font, placement, lighting. `quality="high"` for clean small text — mandatory.
Example 3
Before
masterpiece, best quality, 8K, ultra HD, hyper-realistic, 1girl, beautiful, dress, garden
After
A young woman in her twenties wearing a flowing pale yellow linen dress, walking through a sunlit cottage garden in early summer. Soft natural light, golden hour warmth, shallow depth of field. Shot on 85mm lens at f/1.8, candid documentary style, subtle film grain, muted earthy palette.
Stable Diffusion style (comma-separated tags, quality boosters, `1girl`) is ignored or handled poorly by GPT Image. A coherent description with camera terms hits the target.