Oversight Has a Capacity

In plain terms.

Give an AI agent real power — let it write code, delete files, push to production — and the obvious safety move is to put a human in front of it: anything risky, pause and ask a person first.

Everyone assumes the same thing: the more the agent asks, the safer you are. More checks, more eyes, fewer disasters. Right?

That assumption is wrong — and that’s what the paper is about.

The problem is the human, not the agent

Picture an airport screener. Guard A checks 5 bags a day — fresh, alert, reads every one. Guard B checks 500 — and by bag #400 they’re fried, bored, rubber-stamping everything. Now hide one dangerous bag in a stream of 499 boring ones. Guard A catches it. Guard B waves it through. More oversight produced the worse outcome, because attention isn’t infinite — it’s a battery that drains.

There’s no perfect answer key

I had reviewers label 125 agent actions as safe or risky. They only agreed about half the time. “Risky” turns out to be an opinion, not a fact — like three chefs disagreeing on whether a soup’s too salty. So you can’t even say “this guard is 99% accurate,” because there’s no objective truth to score against.

So safety is a curve, not a dial

Ask too little, and the agent does dangerous things on its own. Ask too much, and you exhaust the human until they rubber-stamp the danger. The safe zone is in the middle: escalate the things that matter, and let the human handle the rest without burning out. More asking is not more safety.

Attackers can weaponize this

Flood the system with 100 harmless requests, then slip the malicious one in at the end. A naive “ask about everything” setup tires the human out and rubber-stamps the bad one. A smart one stays quiet on the noise, so the human is still sharp when the real threat arrives.

The bottom line

To keep AI agents safe, the answer isn’t “add more human review.” Human attention is a limited resource. The real question isn’t whether to involve a human — it’s when, so they still have the attention left to catch the thing that actually matters.

I built an open-source firewall — and the eval to measure exactly where that line sits.

Read the paper (arXiv) Live demo The technical version

← Back to home