What is a ceiling probe in agentic systems?

A ceiling probe is running an agentic system once with no budget cap on a real task to discover the maximum quality it can produce. The output, the cost, and the trace of what the agent did become reference points for every subsequent decision about cost, quality, and scope.

How much does a ceiling probe cost?

A typical ceiling probe runs 5 to 20 times the cost of a single constrained run, often around $20 for a meaningful task. It's a one-time investment per task type, not a recurring cost. Compared to building the wrong thing for weeks, it's the cheapest piece of information you can buy.

Why are basic cost and quality questions hard to answer for agentic products?

Agentic systems decide their own runtime behavior. The agent chooses which tools to call, how many cycles to run, and when to stop. Cost and quality both depend on how much you let it do. Without seeing the unconstrained output, neither question has a top reference to negotiate against.

What does a founder or PM do differently with agentic products?

Their job shifts from approving a single budget number to calibrating an envelope. With ceiling and floor in view, the conversation moves from 'is this the cheapest acceptable config?' to 'is this the right point on the diminishing-returns curve for our business?'

Can a ceiling probe save you from a doomed build?

Yes. If the unconstrained ceiling output is mediocre, no amount of cost or time tuning will fix it. That's an architecture problem, not a tuning problem. The probe doubles as feature triage. A mediocre ceiling on a $20 test is the cheapest 'no' you'll ever buy.

Product Thinking

Find the ceiling before you set the floor

Bill Cava/May 5, 2026

A founder sits down to scope a new agentic feature. They know their customers, their domain, what would justify the build. They open with the questions that have always worked.

What's it going to cost? How good will it be?

They expect numbers. They get conditionals.

It depends on how much we let the agent do. It depends on what you'd accept as good enough. It depends on which tools we wire up. It depends on how often it runs.

Those answers aren't wrong. They're honest. They're also useless for someone trying to decide whether to fund something.

The questions are right. The substrate is different. The mental model that makes "what will it cost?" answerable for traditional software doesn't fit agentic systems...yet. And no one wants to fund a thing whose cost is "depends."

This post is one place to start.

The triangle that doesn't quite fit

For traditional software, cost and quality are mostly settled at build time. You write the code, you ship the binary, and once it's running, the per-user cost is largely fixed. Quality is a property of the build.

Agentic systems are different. The agent decides at runtime what to do, which tools to call, how many cycles to run, when to stop. Quality and cost are both functions of how much you let it do. Neither is settled by the build. Both are runtime properties shaped by the envelope you set.

That's the category shift. The old questions are right. The way they're answerable is new. We've written before about how AI lowered the floor and raised the ceiling on what people can build at all. This post is about a narrower ceiling: the one inside a single agentic feature.

The first move: run it once, all the way out

Stop trying to answer "what will it cost? how good?" in the abstract. Pick one realistic task. Run the agent on that task with no budget cap. Maximum iterations, every relevant tool, full reasoning depth, however long it takes.

Save the output. Save the cost. Save the trace of what the agent did to produce both.

That run is the ceiling probe. It tells you the best output your current architecture can produce, what it cost to produce, and what the agent actually did along the way.

Take an agentic content writer as an example. Constrained: two minutes, five sources, single-pass draft, costs cents. Unconstrained: 30 minutes, 50+ sources, multiple draft and critique cycles, fact-check pass, polish, costs around $20. The gap between those two outputs is the operating space you're choosing inside. You can't choose well without seeing both ends.

A ceiling probe is bounded, not open-ended. One task, one execution, you stop when it stops. It typically runs 5 to 20 times the cost of a single constrained run. For most tasks that lands somewhere around $20. It's a one-time investment per task type, not recurring. Compared to building the wrong thing for six weeks, it's the cheapest piece of information you'll ever buy.

Now the business questions become answerable

Before the ceiling probe, "what will it cost?" was a vibe. After, it's a range with two real endpoints. Anywhere between $X (constrained) and $Y (ceiling). Where on that range do you want to operate?

"How good will it be?" was conditional. After, it's a comparison. Here's the ceiling output. Here's the constrained output. Pick where you'd accept landing.

The conversation shifts from open-ended to bounded. From "how much?" to "how much of this to land there?" That's the question a founder can actually answer. Without ceiling and floor, they were being asked to approve an unbounded cost for an unbounded outcome. With them, the question is finally fair.

Your job is calibration, not approval

PMs and founders managing agentic products instinctively reach for the mental model from traditional software: approve a budget, ship the build, watch the metrics. That model doesn't fit. There's no single budget to approve. There's a band the agent operates inside, and the band has endpoints you have to find first.

Your job is calibrating that band, not approving a number.

Once you have ceiling data, plot quality against spend at a few intermediate budgets. Sometimes the curve is steep: small budget cuts cost you a lot of quality. Sometimes it's flat: you can cut 80% of the cost for 5% quality loss. The shape is per-task. Don't generalize across them.

Plotted, the only real question is where you'd stop.

The right question stops being "is this the cheapest acceptable config?" It becomes "is this the right point on the curve for our business?" Those are different conversations. The second one is the one founders should be in.

The real payoff: the ceiling probe trains the agent

Here's the part that matters most.

Most teams treat a ceiling probe as one-off calibration. Run it, look at the curve, pick a point, ship. The real leverage is the loop:

Run unconstrained. Capture the trace, not just the output.
A human reviewer (someone who knows what good looks like in the domain) reads the high-quality output and notices what made it good. The patterns. The order of operations. The specific moves the agent made that worked.
Those patterns become rules and few-shot examples the planner uses on future constrained runs.
The constrained agent now optimizes for the patterns that mattered, not arbitrary ones.

The result: 80% of ceiling quality at 20% of cost. But only because someone mined the ceiling for what made it good. The ceiling pulls the floor up. And it keeps pulling, because every probe produces new exemplars.

If you're a founder, you don't have to build this loop yourself. You have to know to ask for it. Without the loop, you're paying for ceiling runs you only use once. With it, every probe makes the constrained system better than the one before.

And sometimes, the ceiling tells you to stop

There's one more thing the probe does that nothing else can.

A good product brief tells you what's worth building. A ceiling probe tells you whether your current agent can build it. If the unconstrained output is great, you have a feature worth optimizing toward. If the unconstrained output is mediocre, no budget will rescue it. That's not a tuning problem: it's an architecture problem, and it usually means redesigning the agent or killing the feature.

A mediocre ceiling on a $20 test is the cheapest "no" you'll ever buy. It can save you a quarter of build effort that would have hit a wall anyway. This is the most directly business-relevant use of the probe. Before you fund the build, you get to see whether the build can produce the thing you want it to produce. That used to be a question you could only answer six weeks in.

Where this leaves you

If you're staring at a new agentic feature and you're not sure how to scope it, you don't have a worse playbook than your peers. You have no playbook. Most people don't, yet. The category is new enough that the standard PM moves haven't fully formed.

The first move is small, cheap, and concrete. Pick a realistic task. Run the agent on it with no constraints. Spend the $20. Get the ceiling.

After that, every other question gets a real frame. Cost, quality, scope, viability. The conversation you couldn't have before becomes the one you've been wanting to have.

Find the ceiling before you set the floor.

Next post: the engineering side. How to instrument the agentic lifecycle, capture useful traces, and turn ceiling outputs into planner heuristics that compound. For the team building the loop.

Follow the thinking.

We write when we learn something worth sharing. No schedule, no spam.