10Measure & improve Pillar

You feel like the agent is getting better. How accurate is that feeling?

Without a number, there is no improvement — only a shifting feeling. And a feeling cannot tell the difference between 'actually getting better' and 'forgetting the last few failures'.

Read4 min read
Topicsevaluation · metrics · improvement · measurement
TL;DR

"The agent feels solid lately" is a metric — but a metric of your memory, not of reality. Human memory is biased toward remembering successes and forgetting silent failures. To know whether you are genuinely improving, you need three numbers: catch rate (what percentage of errors you catch before output goes out), rework rate (what percentage of output requires significant correction), and autonomy ceiling (the longest task the agent runs without needing you to intervene).

A colleague asks: "Three months with the agent — has it gotten better?"

You nod. "Much better. Feels noticeably smoother."

"Better how, in numbers?"

You pause. You actually do not know.

The concerning part is not that you lack the number — it is that you believe you are improving based on something that cannot measure improvement. "Feels smoother" is real, but it is a measurement of what you remember, not what is happening.

01Why the feeling lies in a positive direction

Measuring by feeling

Remember clearly: times the agent finished cleanly, you were satisfied, the task moved fast
Forget quickly: times the agent was wrong but you fixed it and moved on — no "incident" worth recording
Conclusion: "getting better and better" — even if the error pattern has not changed at all

Measuring by number

Track: when was the last error caught, how long to fix, which outputs had to be redone
Pattern appears: catch rate, rework rate, autonomy ceiling — comparable to last month
Trustworthy conclusion: genuine improvement or just forgetting more

Human memory is a filter, not a recorder. It retains things with strong emotion — significant successes or failures — and discards most of the small errors that were quietly fixed. "Feeling solid" is synthesized from that filtered version, not from the full data.

Silent errors are particularly dangerous here. When the agent gets something wrong, you catch it and fix it quickly — you do not count that as a memorable event. You count it as normal workflow. But ten of those in a week, fifteen minutes each, is two and a half hours you never measured. And you still say "the agent has been great lately."

02Three metrics that actually measure improvement

No complicated system needed. These three numbers, even tracked informally, give a clearer picture than any feeling:

1
Catch rate — what percentage of errors you catch before output goes out

In the last 10 tasks, how often did you find an error before the output was used for real? High catch rate = verification is working. Low catch rate = either there are no errors (good) or errors are slipping through (dangerous).

2
Rework rate — what percentage of output requires significant correction

Minor fixes (one line, one word) are normal and do not count. Significant rework = reading it again from the start, refactoring logic, or redoing a section. What was last month's number? This month's? Which direction is the trend moving?

3
Autonomy ceiling — the longest task the agent runs without needing your intervention

Not "the longest task you are willing to release" — but "the longest task it completed well without an interrupt." If this number is growing, you are genuinely expanding something. If it has not moved in three months — you are standing still.

These three numbers do not require a spreadsheet. You can estimate them after a week of observation. The key point: having a rough number to compare is far better than having nothing to compare at all.

03Four signs you are measuring the wrong things

Are you measuring it this way?
"The agent finishes faster" — measuring output speed, not real speedthe agent producing more does not mean you ship more if rework rate is high; real speed = usable output, not created output
"Fewer back-and-forth corrections" — measuring prompt iterations, not output qualityfewer rounds might mean you have gotten better at prompting, or it might mean you are accepting lower-quality output — worth distinguishing
"No serious incidents" — measuring by incident, ignoring accumulated small errorsno large fall ≠ doing well; you may be spending continuous energy on small errors without realizing how much they add up
"I feel more productive" — measuring the feeling, not the outputthe feeling of productivity usually comes from busyness, not from output — busy ≠ productive, with or without an agent

This is not to dismiss those observations — they all carry information. But they measure what is easy to notice, not what is important. And measuring what is easy to notice instead of what is important is precisely how someone ends up confident they are progressing when they are standing still.

04If you cannot measure it, you cannot improve it — in the direction you intend

There is something even experienced agent users tend to overlook: the agent does not learn from the mistakes it made with you. Every session starts fresh. Improvement in your workflow with the agent only comes from one direction: you changing how you delegate, how you check, how you structure context.

That means if you are not tracking what you are changing — nothing is actually changing. You are doing exactly what you did last month, just feeling more familiar with it and calling that better.

Catch rate, rework rate, autonomy ceiling. Three numbers that do not ask much. But they give you something feeling cannot: a distance to compare across time, and a direction that is actually forward.

c
The author

Each story here wraps a lesson paid for in full.

craftagentsomeone building and learning at once
36pieces11clustersVI·ENbilingual