GPT Image 1: how to write prompts the model actually understands
OpenAI · Updated:
GPT Image 1 is an OpenAI image model with natural-language prompting and strong in-image text rendering. It runs in ChatGPT and via API, supports resolutions up to 1536×1024, transparent background, three quality tiers, and image-to-image editing. Prompts of ~500 words are optimal.
What GPT Image 1 does well
The main strengths are accurate readable in-image text (signs, menus, labels, UI mockups), high prompt adherence, photorealism through camera terms, and built-in transparent background support (ideal for stickers and assets).
In ChatGPT the model uses multi-turn context — images can be refined iteratively in a single conversation. In the API every request is autonomous. Image-to-image editing is supported via a dedicated endpoint.
- Resolutions 1024×1024, 1536×1024, 1024×1536
- Formats PNG, JPEG, WebP, dedicated transparency parameter
- Quality high / medium / low
- Image-to-image editing via API
- Prompt length up to ~4000 tokens, optimal up to 500 words
Prompt structure
Layered formula: [Visual medium/Style] + [Subject] + [Environment/Scene] + [Lighting/Mood] + [Composition/Angle] + [Details and textures] + [Constraints/Exclusions].
The model understands natural language — no tags or special syntax. Describe like a story, but with concrete visual details.
Specificity is the main rule. «A foggy mountain valley at dawn, golden light filtering through pine trees, reflected in a mirror-still lake» works tenfold better than «a beautiful landscape». Minimum 2-3 descriptive details per scene: color, texture, material, shape.
Camera and photorealism
Camera terms work significantly better than generic «8K, ultra-detailed».
Shot size: close-up, medium shot, wide angle, aerial view. Lenses: 50mm, 35mm, macro, fisheye. Focus: shallow depth of field, bokeh, sharp focus throughout. Angle: low angle, bird's eye view, eye level, Dutch angle.
For lighting avoid generic «good lighting». Use specifics: «dramatic side lighting creating strong shadows», «soft box lighting eliminating harsh shadows», «golden hour», «fluorescent overhead», «neon glow», «candlelight». The more precise the light, the more precise the mood.
In-image text and iterative work
GPT Image 1 is top-tier for in-image text. Exact text always in quotes or CAPS: `"OPEN 24/7"`, `"CAFE LUNA"`. Specify font style («elegant handwriting», «bold sans-serif», «neon sign lettering»), size, color, placement. For complex words (brands, rare spellings) spell letter by letter: `C-A-F-E L-U-N-A`.
In ChatGPT use an iterative approach. Start with a base prompt, then refine in small steps: «Same scene, but make the lighting warmer», «Add a person sitting on the bench on the left», «Remove the tree in the background». A series of precise edits beats one overloaded prompt.
Common mistakes
1. Stable Diffusion syntax
Weights like `(word:1.5)`, `(masterpiece:1.3)`, comma-separated tags `1girl, masterpiece, best quality`, embeddings, LoRA references — GPT Image 1 works with natural language, not tags. These constructions land in the prompt as literal noise or degrade output.
2. Quality boosters «8K, ultra HD, masterpiece»
Generic quality praise barely affects GPT Image 1. Concrete camera terms («85mm at f/1.8», «shallow DOF», «golden hour»), style references, and lighting descriptions work many times better than any quality stack.
3. Missing environment
«A red sports car» versus «a red sports car on an empty desert highway with mountains on the horizon» — dramatically different results. Without context the model decides on its own, and output is unpredictable. Even minimal background description significantly improves the frame.
4. Conflicting styles in one prompt
«Photorealistic cartoon», «minimalist detailed», «realistic stylized» — conflict without explanation of how styles should combine. The model can't decide what to prioritize. If a stylistic blend is needed, describe it explicitly: «realistic photography with subtle painterly post-processing».
5. Negatives without a positive alternative
«Don't draw background», «no people, no text, no clutter» are less effective than positive descriptions. «Transparent background» beats «no background». «Clean composition» beats «no clutter». Describe what you want, not what you don't.
Before / after examples
Example 1
Before
beautiful portrait
After
Editorial portrait of a woman in her thirties with freckles and short auburn hair, wearing a cream-colored cashmere sweater. Soft natural light from a north-facing window, calm contemplative expression, shallow depth of field. Shot on 85mm lens at f/1.8, subtle film grain, muted warm palette, fashion editorial style.
Concrete subject, appearance details, specific lighting, camera terms, style reference. «Beautiful» is an empty word.
Example 2
Before
café sign on an old brick wall
After
A weathered metal café sign mounted on a red brick wall in a 1920s Brooklyn neighborhood. The sign reads "BREW & BEAN" in bold cream-colored sans-serif lettering with a small coffee cup icon. Warm afternoon light catches the metal, soft shadows on the brick. Documentary photography, shallow depth of field, muted warm palette.
Exact text in quotes, specific font and color, era, surface material, lighting type. Without this the model invents all details.
Example 3
Before
(masterpiece:1.5), (best quality:1.3), 1girl, blue dress, beautiful, garden, photorealistic, 8k
After
A young woman in her twenties wearing a flowing pale blue linen dress, walking through a sunlit cottage garden in early summer. Soft natural light, golden hour warmth, shallow depth of field. Shot on 85mm lens at f/1.8, candid documentary style, subtle film grain.
Parenthetical weights `(word:1.5)` and comma-separated tags are Stable Diffusion syntax. GPT Image 1 doesn't support them. A coherent description with camera terms hits the target.