The Threshold Problem — Who Sets the Line, and How Do You Know It's Right?
A smoke detector calibrated for a laboratory will fire every time you make toast. You learn to ignore it. The night it fires for a real reason, you've already trained yourself not to respond. Threshold calibration is not a technical problem. It is a governance problem — and it fails in two opposite directions.
A smoke detector calibrated for a laboratory is installed in a kitchen. Every time you make toast, it fires. Within a week, you’ve learned to ignore it. The night it fires for a real reason — you’ve already trained yourself not to respond.
The detector was not malfunctioning. It was correctly calibrated — for the wrong environment. And that miscalibration produced a failure mode worse than a detector that didn’t exist: a detector the receiver had been conditioned to dismiss.
Two Failure Modes, Both Governance Failures
Threshold calibration fails in exactly two opposite directions. Both look normal from the outputs. Neither is detectable from within the system alone.
Failure mode one: too sensitive. The threshold sits below the noise floor of normal operations. Every routine event crosses it. Notifications fire constantly. The receiver cannot distinguish signal from noise because everything surfaces at the same priority.
This is alarm fatigue — and it is more dangerous than it sounds. The system is not malfunctioning. It is detecting exactly what it was calibrated to detect. The failure is that the calibration environment did not match the deployment environment, so detection became noise.
Alarm fatigue has a compounding cost: when the real event arrives — the one the system was designed to surface — the receiver has been conditioned to dismiss. The threshold fires correctly. The notification is real. Nobody responds. The infrastructure has become the obstacle to the notification.
Failure mode two: too conservative. The threshold sits above the events you need to surface. Real events don’t cross it. The channel is quiet — not because the system is clean, but because the threshold is missing what it should catch.
This failure mode is harder to diagnose. There is no alarm to investigate. Absence of notifications looks like system health when it may be a calibration that never fires for the right reasons. In a system producing governance artifacts — logs, dashboards, clean audit results — a conservatively miscalibrated threshold generates the same output as a correctly calibrated one watching a genuinely healthy system. From the outside, they are indistinguishable.
This is the structural core of the problem. You cannot distinguish correct calibration from either failure mode by reading output data alone.
Why It’s a Governance Problem
A technical problem has a correct answer derivable from the system’s own data. A governance problem requires judgment about what the system is for — what events matter, to whom, in what context, under what operational conditions.
The system cannot answer this. A confidence classifier doesn’t know what your deployment’s acceptable false positive rate is. An anomaly detector doesn’t know which anomalies your team is equipped to act on.
Only the people responsible for governance can answer these questions — and threshold calibration is the act of encoding those governance judgments into the system’s operating parameters.
The consequence of treating threshold calibration as a technical problem is that the governance judgment gets made by default. Someone chose a parameter during model development. That parameter was calibrated against development data. Your deployment context is not development data. The threshold was set somewhere. It was not set for your context.
The Calibration Context Problem in AI
Most AI deployments have implicit thresholds — confidence scores, content classifiers, anomaly detectors — calibrated during development against training distribution data. The training distribution is not your deployment context.
Feynman’s question for this: what is the threshold actually measuring? Not what label it carries — what is the underlying quantity it tracks?
A content safety classifier measures a proxy: text features correlated with harmful content in training data. In your deployment, that proxy may overfire on legitimate professional vocabulary your domain requires. Or it may miss content your specific context flags as harmful that the training data didn’t represent. The measurement is technically correct. The calibration is wrong for your environment.
Four Properties of Governed Threshold Design
A threshold that satisfies the Feynman test has four required properties:
Explicit. The threshold is documented as a governance decision: a specific value, the rationale for it, the expected false positive and negative rates, the conditions under which it should be reviewed, and the name of the person accountable for it. Not implicit in a model parameter that ships as a library default. An authored decision with a named owner.
Calibrated to the deployment environment. You cannot set a correct threshold before you have operational data from your specific context. The development threshold is a starting point. Calibration to deployment requires collecting operational data — what events are actually occurring at what rates, which of those events the governance process requires surfacing. The development value is an initial estimate, not a final decision.
Revisable on a defined schedule. The deployment environment changes. A threshold correctly calibrated at launch may be miscalibrated at ninety days. Governed threshold design includes a review cadence — a defined cycle with a defined owner who has access to the operational data required to make the calibration judgment.
Auditable. You can inspect what the threshold fired on. You can compare the actual fire rate against the expected rate at calibration time. Without this, you cannot distinguish correct calibration from alarm fatigue from silent failure from outputs alone.
The Practical Starting Point
For every threshold in the system:
- Who set it, and when?
- What data was it calibrated against?
- What is the documented rationale for the specific value?
- When was it last reviewed?
- Who is named as the accountable owner for the next review?
If a threshold can’t answer all five questions, it has not been governed. It has been inherited.
Next: NOT·3 — The Obligation Gap. The threshold fires correctly. The notification reaches the receiver. And then nothing happens. Not because they ignored it — because the system had no obligation architecture.
You’re reading 2 of 5.
Get notified when the next article drops. No marketing — one email per new article, unsubscribe any time.
