sprout

AI image generation is closer to directing a scene than writing a prompt

I tried AgnesAI after watching the demo at E27 Echelon. Here is what made the difference between images that looked AI-generated and ones that held together.

June 5, 2026 · #ai#building#tools#content

A grounded tabletop scene: cricket ball, bat, notebook, laptop, warm light. Scene-directed, object-led, no humans.

Yesterday at E27 Echelon, the AgnesAI team ran a live demo. I watched from the audience and immediately wanted to try it.

The first images I generated looked exactly like what people complain about when they say AI-generated.

Too loud. Too artificial. Humans with visible defects. Fake text screaming from screens and sticky notes. Every image announcing itself as AI.

That was with direct generation. Describe a subject, get an image. It is the obvious first approach, and it produced the obvious result.

What changed when I tried a different workflow

Later that day, I generated a version of the same concept with Codex assisting the direction. The output felt noticeably more grounded. Not perfect, but believable.

That made me ask a different question. Was the difference in the model, or in how I was directing the image?

Before generating, I asked Claude to give me five possible scenes. Not prompts. Scenes.

Each one described:

The actual setting, not just the topic
Camera angle
Lighting
Mood
Specific objects that should appear
What should stay in focus
What should be avoided entirely
Whether humans should appear at all

That last decision matters more than most people realize. Faces and hands are still the clearest place where AI defects show up. Unless an image genuinely needs a human, it is usually better to design around them. Object-led scenes hold together: desks, notebooks, paper, tools, gear, warm light, coffee, process artifacts.

When I picked the best scene from the five options and generated from that, the results were meaningfully different.

The text artifact comparison

The clearest place to see the difference is text.

The same concept, two different approaches. Direct generation produced loud, garbled text on sticky notes and screens. The scene-directed version treated text as background texture.

AgnesAI direct generation: text on sticky notes and the laptop screen is loud and visibly garbled. — Direct generation. Text on screens and stickies is visibly wrong — loud and garbled.

Codex-directed version: text stays soft and blurred, reading as background texture rather than fake content. — Scene-directed version. Text stays soft and blurred, reading as paper texture rather than fake words.

When text is visible but wrong, your brain catches the artifact immediately. When text is blurred and ambient, the image holds together.

The Cricket OS example

I had been generating images for a Cricket OS concept: a shared operating system for a cricket team, pulling match data, player history, and coaching decisions into one structured layer.

The first prompt was something like “AI cricket system dashboard.”

The result looked exactly like that. Visual noise. Garbled text. Data panels designed to look futuristic. Nothing you would actually use.

Early Cricket OS generation. The AI-dashboard framing made it loud, with garbled text and unreadable panels. — The dashboard prompt produced a dashboard. Loud, garbled, unusable.

When I shifted to a grounded scene brief: cricket ball, bat, notebook, laptop, warm light, working notes scattered across a table, something that looked like a real captain’s session rather than a generated UI concept, the image changed significantly.

Better Cricket OS image. Tabletop scene, object-led, warm light, no dashboard aesthetic. — The tabletop brief produced something believable. Object-led, no humans, warm light.

It felt like a real moment. Not a generated poster.

What I now put in every scene brief

The five-scene approach forces you to explore multiple framings before committing to generation. Sometimes the third option is obviously better than what you would have prompted directly. Sometimes it shows you that none of the obvious angles work, which saves a dozen failed generations.

For each scene I now specify:

Where the camera is pointed
What the light is doing
What stays out of frame
Whether humans are necessary
Whether text should be readable or just visual texture
What mood the image should carry

When you work through those decisions, the prompt almost writes itself. You end up with a direction brief, not a keyword list.

Model behavior still shows through

This is where scene direction has limits.

AgnesAI and Codex handle text, composition, and light differently by default. Those defaults show through even when the prompt is strong.

I now explicitly include in scene briefs: “text on screens and papers should be blurred, treated as background detail, not legible.” That instruction changes the output noticeably in some models and does almost nothing in others. Which tells you something about where the defaults sit.

The bigger variable is still direction. Most prompts I see describe a topic, not a scene. That is where the obvious AI aesthetic comes from.

But model behavior is a real second variable, and worth knowing before you start.

Curious what others are doing: which image generation models are you using for production content right now, and what direction techniques help you get outputs that feel more grounded?