A colleague asks: "Three months with the agent — has it gotten better?"
You nod. "Much better. Feels noticeably smoother."
"Better how, in numbers?"
You pause. You actually do not know.
The concerning part is not that you lack the number — it is that you believe you are improving based on something that cannot measure improvement. "Feels smoother" is real, but it is a measurement of what you remember, not what is happening.
01Why the feeling lies in a positive direction
✕ Measuring by feeling
✓ Measuring by number
Human memory is a filter, not a recorder. It retains things with strong emotion — significant successes or failures — and discards most of the small errors that were quietly fixed. "Feeling solid" is synthesized from that filtered version, not from the full data.
Silent errors are particularly dangerous here. When the agent gets something wrong, you catch it and fix it quickly — you do not count that as a memorable event. You count it as normal workflow. But ten of those in a week, fifteen minutes each, is two and a half hours you never measured. And you still say "the agent has been great lately."
02Three metrics that actually measure improvement
No complicated system needed. These three numbers, even tracked informally, give a clearer picture than any feeling:
In the last 10 tasks, how often did you find an error before the output was used for real? High catch rate = verification is working. Low catch rate = either there are no errors (good) or errors are slipping through (dangerous).
Minor fixes (one line, one word) are normal and do not count. Significant rework = reading it again from the start, refactoring logic, or redoing a section. What was last month's number? This month's? Which direction is the trend moving?
Not "the longest task you are willing to release" — but "the longest task it completed well without an interrupt." If this number is growing, you are genuinely expanding something. If it has not moved in three months — you are standing still.
These three numbers do not require a spreadsheet. You can estimate them after a week of observation. The key point: having a rough number to compare is far better than having nothing to compare at all.
03Four signs you are measuring the wrong things
This is not to dismiss those observations — they all carry information. But they measure what is easy to notice, not what is important. And measuring what is easy to notice instead of what is important is precisely how someone ends up confident they are progressing when they are standing still.
04If you cannot measure it, you cannot improve it — in the direction you intend
There is something even experienced agent users tend to overlook: the agent does not learn from the mistakes it made with you. Every session starts fresh. Improvement in your workflow with the agent only comes from one direction: you changing how you delegate, how you check, how you structure context.
That means if you are not tracking what you are changing — nothing is actually changing. You are doing exactly what you did last month, just feeling more familiar with it and calling that better.
Catch rate, rework rate, autonomy ceiling. Three numbers that do not ask much. But they give you something feeling cannot: a distance to compare across time, and a direction that is actually forward.