Your agent is good at the familiar — you've watched it a hundred times. Then you hand it a task in a completely new domain and assume it's just as good there. Same agent, but good at code review doesn't mean good at legal analysis or reading numbers from an unfamiliar industry. You find out only after it's done the whole big task — wrong in a way you didn't see coming.
People who've done this a while don't trust a new domain straight off. They run one small slice first, watch how the agent approaches it, and only then set how much to trust it. Not out of suspicion — because the same model behaves differently by domain, and a small slice tells you where it differs before that difference gets expensive.
✕ Hand over the big task
✓ Probe a small slice first
Paste it the first time you hand over a new kind of task, not every time. The trick is making it show its thinking, not just the result — because what you're measuring isn't "is the answer right," it's "does its approach match what this domain actually needs." A small slice done the right way is more trustworthy than a big one that came out right by luck.