COG-GOV-COHERENCY-IS-NOT-A · Paper 05

Coherency Is Not a Feature

Why Consistent Output Is a Precondition for Trust, Not a Quality Level

Consider two AI systems deployed in production.

Coherency Is Not a Feature

Why Consistent Output Is a Precondition for Trust, Not a Quality Level

ALSI Inc. — Advanced Governance Intelligence Research Division Human.Exe Publishing — Research Paper Series April 2026

"A system that produces the correct output ninety percent of the time but contradicts itself freely is not ninety percent reliable. Its reliability is unknown. You cannot trust a system whose consistency you cannot measure."

A Problem That Doesn't Look Like One

Consider two AI systems deployed in production.

System A is accurate 85% of the time, measured against ground truth. It is occasionally wrong, it knows it is occasionally wrong, and its errors are randomly distributed. When you check its output, you find it reliable most of the time, and the errors, when they occur, are distributed in a way that doesn't systematically deceive.

System B is accurate 90% of the time. Higher score. Preferred by every standard evaluation metric. But System B contradicts itself. Ask the same question in two different framings and you get two different answers — both confident, both apparently well-reasoned, both internally consistent in isolation. Ask a follow-up question that probes the consequences of an earlier answer and you sometimes get a response that is inconsistent with that answer. The contradiction is not flagged. It is not noticed. The system continues with apparent confidence.

Which system would you use to inform a consequential decision?

Most people, on reflection, choose System A. Not because it scores better — it doesn't — but because its failures are knowable. You can characterize them. You can account for them. You can build a process that catches them. System B's failures are not knowable in the same way. You cannot characterize incoherence by examining any single output. You can only characterize it by examining outputs in relation to each other — a comparison the system itself does not make.

The choice reveals something important: reliability and accuracy are not the same property. A system can be highly accurate and unreliable. It can be moderately accurate and highly reliable. Reliability requires not just accuracy but coherency — the property that the system's outputs are consistent with each other across contexts, framings, and sessions.

This paper argues that coherency is not a quality level. It is a precondition for trust. And it is almost entirely absent from how the AI field currently evaluates systems.

Part One: What Coherency Actually Means

Before the argument, the definition.

Coherency in an AI system is the property that the system's outputs are internally consistent across the relevant dimensions: across contexts, across sessions, across framings of the same question, across the space of outputs that would follow from the same underlying state of knowledge.

This definition has several implications worth unpacking.

Coherency is not the same as accuracy. An inaccurate system can be perfectly coherent — it consistently produces the wrong answer, but it produces the same wrong answer consistently. A coherent system is predictable, even when it is wrong. This is not a trivial property: a predictable error can be corrected. An unpredictable one cannot.

Coherency is not the same as confidence. Many AI systems are trained to produce confident outputs because confident outputs are preferred by human evaluators — a training artifact with significant consequences. A highly confident system can simultaneously assert two contradictory things in different contexts, each with equal apparent confidence. Coherency requires that confidence tracks the actual relationship between outputs, not just the stylistic preference for confident-sounding language.

Coherency is not the same as consistency in a shallow sense. A system that produces the same output to identical inputs is consistent but not necessarily coherent. Coherency requires consistency across related inputs — inputs that share structural similarity, that ask about the same domain from different angles, that probe the logical consequences of earlier outputs. This is a much stronger requirement.

Coherency has a temporal dimension. A coherent system is coherent not just within a session but across sessions. Its understanding of a domain does not shift arbitrarily between interactions. The conceptual framework it applies is stable. The principles it invokes are consistent. What it said was true yesterday, it does not contradict today without acknowledging and explaining the contradiction.

The temporal dimension is where production AI systems most commonly fail. A system that is coherent within sessions but incoherent across sessions becomes less trustworthy as the deployment extends — because each session may independently be coherent while the accumulated outputs are not. [Internal: Coherency Framework Protocol research, R-02, filed February 2026.]

Part Two: Why Incoherency Is Invisible

The most serious property of incoherence in AI systems is not that it happens. It is that it is invisible.

When an AI system produces an inaccurate output, the inaccuracy is potentially detectable. The output can be checked against ground truth. The factual claim can be verified. The prediction can be tested against outcomes. Accuracy is a property of individual outputs, and individual outputs can be evaluated.

Incoherence is a property of the relationship between outputs. It cannot be detected by examining any single output. Both outputs may be individually correct, individually well-reasoned, and individually coherent with themselves. The incoherence exists only in the gap between them — in the fact that they cannot both be true, or that they rest on incompatible assumptions, or that the reasoning in one contradicts the reasoning in the other.

This means that standard evaluation processes, which examine outputs one at a time against ground truth, cannot detect incoherence. A system can pass every benchmark evaluation with excellent results and be profoundly incoherent. The benchmarks do not measure the relevant property.

More concretely: in a production deployment, the people generating outputs from AI systems are usually different from the people making decisions based on those outputs, and neither group is systematically comparing outputs to other outputs from the same system. The person asking about compliance options in one session does not compare the answer they receive to the answer a colleague received to an adjacent question last week. The incoherence between those two sessions is never visible to anyone. The decisions made on the basis of each answer may be consequentially inconsistent with each other.

The invisibility of incoherence is what makes it dangerous in proportion to the scale of deployment. A single operator using an AI system inconsistently affects only their own work. At organizational scale, incoherent AI outputs propagate through decision-making processes that no single person can monitor. The incoherence compounds silently. The consequences appear as unexplained inconsistencies in outcomes — anomalies attributed to human error or process variation that are actually the visible signature of invisible output incoherence. [Internal: Reasoning CE, cross-session coherency monitoring analysis, R-11.]

Part Three: Coherency as Precondition

The claim of this paper is not that coherency is a desirable property that AI systems should have. It is the stronger claim: coherency is a precondition for trust, and a system for which coherency cannot be established is a system that cannot be trusted, regardless of its accuracy score.

The argument follows from the structure of trust.

Trust in a system is built through the accumulation of reliable predictions about the system's future behavior. You trust a system when you can accurately anticipate how it will perform in conditions you have not yet observed — because you have observed it perform consistently in conditions you have observed, and your model of the system's behavior extrapolates correctly.

This kind of trust requires consistency. If the system behaves differently in conditions that appear structurally identical, your model of its behavior cannot extrapolate correctly, because the variation that determines its behavior is not visible in the inputs you observe. The system has hidden degrees of freedom. Your trust is therefore not warranted by the evidence — you are not predicting behavior from a model, you are hoping behavior is good.

Coherency is the property that constrains those hidden degrees of freedom. A coherent system's outputs are governed by an underlying consistent framework. The framework may have limits — it may not cover all domains, it may be imprecise at boundaries — but within its scope, the outputs follow from it predictably. A coherent system can be trusted in the deep sense: you can build a model of how it reasons, and your model will make correct predictions.

An incoherent system cannot be trusted in this sense, regardless of its accuracy. Even if every individual output is correct, the relationship between outputs is not governed by a consistent framework that you can model and trust. The outputs are individually good but collectively ungoverned. You can observe that the system has been reliable so far. You cannot predict that it will continue to be reliable, because the variation that produces inconsistency is not visible to you. [Internal: Behavioural & Governance Structuring research, constitutional architecture and trust anchor analysis, R-12.]

Part Four: The Measurement Gap

If coherency is a precondition for trust, and if AI systems are widely deployed in consequential decision-support roles, then coherency should be a primary dimension in AI system evaluation. It is not.

The dominant evaluation paradigm — benchmark performance on curated test sets, accuracy against labeled datasets, human preference ratings in pairwise comparisons — measures accuracy and fluency. These are real properties and meaningful to measure. They are not coherency. None of the standard evaluation methods require or test for consistency across outputs, across framings, or across the temporal span of production deployment.

The reason for this gap is structural. Coherency measurement is harder than accuracy measurement. Accuracy requires ground truth for individual outputs. Coherency requires a comparison set — a structured collection of related outputs that can be analyzed in relation to each other. Building that comparison set requires knowing which outputs are related, which questions probe the same domain from different angles, which follow-up questions test the logical consequences of earlier answers. This is more complex than building a labeled dataset. It requires understanding the conceptual structure of the domain.

There is also a commercial incentive problem. Coherency failures are invisible to users who cannot compare outputs. A system that scores well on standard benchmarks will be preferred in commercial evaluations even if it is incoherent, because the incoherence is not visible in the evaluation protocol. The commercial incentive is to optimize for visible metrics. Coherency is not a visible metric.

The result is a systematic gap in AI evaluation that is consequential at the scale of current deployment. Every major AI system in production deployment today has an unknown coherency profile. We know a great deal about their accuracy. We know almost nothing about their consistency across the space of related outputs in production conditions. [Internal: Coherency Framework Protocol, evaluation gap analysis, R-02.]

Part Five: Constitutional Coherency

If coherency does not emerge reliably from current training paradigms, and if it cannot be adequately measured by current evaluation methods, the question becomes: how do you produce it?

The answer, on the evidence of both the governance research and the AI architecture research, is constitutional: you produce coherency by designing it into the architecture rather than hoping it emerges from training.

A constitutionally coherent system is one in which the outputs are generated within an architectural framework that enforces consistency by construction rather than by aspiration. The framework specifies what the system can and cannot claim, what principles govern its reasoning, what the relationships between concepts in its domain are. These specifications are not soft guidelines or training preferences. They are architectural constraints that apply to every output the system generates.

The analogy to legal constitutionalism is precise. A legal system that produces consistent judgments across cases is not one where judges happen to be consistent individuals. It is one where the law imposes the requirement of consistency architecturally — through precedent, through the requirement that decisions cite and be consistent with prior decisions, through appeals processes that explicitly compare current decisions to prior ones and require any divergence to be explicitly justified.

The same architectural principle applies to AI systems. Consistency of output does not emerge from training alone, because training signals are too noisy and too local to enforce global consistency. It must be enforced at the architecture level — through structured output protocols that require consistency with prior outputs in the relevant domain, through explicit coherency checking as part of the generation pipeline, through governance frameworks that treat contradiction as a failure mode to be caught rather than a variation to be tolerated.

Constitutional coherency is not a feature to add to existing systems. It is an architectural design decision that must be made at system design time. Systems designed without it cannot be retroactively made coherent through fine-tuning or prompt engineering. The coherency must be in the structure. [Internal: Coherency Framework Protocol, constitutional coherency architecture, R-02; Reasoning CE, structural consistency enforcement, R-11.]

Part Six: The Missing Metric

The practical implication of this paper is that the field needs a coherency metric — a standardized method for measuring consistency across the space of related outputs — and that this metric should be as central to AI evaluation as accuracy metrics are today.

What would such a metric require?

A relational output set. Coherency cannot be measured from a single output. The evaluation set must contain questions that are related — that probe the same domain from different angles, that follow the logical consequences of each other, that ask about the same underlying state from different framings. Building this set is more work than building an accuracy dataset. It is also more informative.

Cross-output consistency analysis. The evaluation method must compare outputs to each other, not just to ground truth. The relevant question is not "is this output correct?" but "is this output consistent with the other outputs that a coherent system with this knowledge state would produce?" This requires a model of the conceptual relationships between questions, not just a set of correct answers.

Temporal coherency tracking. A single evaluation snapshot cannot measure cross-session coherency. The metric requires tracking outputs over time — across multiple sessions, across updates to the underlying model, across changes to the deployment context. This is a continuous measurement, not a point-in-time benchmark.

A coherency score distinct from accuracy. The metric should produce a coherency score that is independent of the accuracy score. A system should be characterizable as high-accuracy/low-coherency, high-accuracy/high-coherency, low-accuracy/high-coherency, or low-accuracy/low-coherency. Each of these combinations has different implications for how the system can be trusted and for what kinds of applications it is suitable for.

The field does not have this metric today. Building it is not a simple research problem. It requires significant methodological work in defining coherency, building evaluation sets, and establishing reliability of the measurement process. That work is worth doing. A field that can characterize AI system accuracy but not AI system coherency does not have the measurement infrastructure required to make well-founded trust decisions. It can tell you how often the system is right. It cannot tell you whether the system can be trusted. [Internal: Coherency Framework Protocol, measurement architecture research, R-02.]

Conclusion: Trust Requires More Than Accuracy

The AI field has built powerful measurement infrastructure for accuracy, fluency, and human preference. These are real properties and meaningful to measure. They are not sufficient for trust.

A system can score well on every standard benchmark — high accuracy, high fluency, high human preference ratings — and be systematically untrustworthy because its outputs are not governed by a consistent underlying framework. The inconsistency is invisible to any evaluation method that examines outputs one at a time. It appears only in the gaps between outputs, in the relationships between them, in the long-run pattern of what the system says in one context versus what it says in another.

Coherency is the property that closes that gap. A coherent system is one whose outputs follow from a stable, consistent framework. You can build a model of how it reasons. Your model makes correct predictions. The trust you extend to the system is warranted by the structure of its behavior, not just by the record of its past outputs.

The current state of AI evaluation does not measure this property. The current state of AI architecture does not reliably produce it. The result is a gap between the level of trust that AI systems are extended in consequential deployment contexts and the level of trust that the evidence about those systems actually warrants.

Coherency is not a feature on a roadmap. It is a precondition for the kind of trust that makes AI systems genuinely useful in consequential work. Treating it as a quality enhancement rather than a foundational requirement is the error. Correcting that error requires both better measurement and better architecture. Neither is simple. Both are necessary.

ALSI Inc. — Advanced Governance Intelligence Human.Exe — Research Division For source material inquiries or research access: subject to NDA All frameworks, methodologies, and structural analyses described herein are proprietary IP of ALSI Inc.