🛡️
GOVERNANCE
AD
human-exe.ca
Govern Every AI Inference
One proxy. Any model.
Route OpenAI, Anthropic, Gemini, and open-source models through a single governance layer. Per-request policy enforcement, cost controls, and audit logging — no SDK changes required.
Read the Docs →
🍁
ALSI INC.
AD
atkinson-lineage.ca
Canadian AI Sovereignty
Data stays in Canada.
Your AI governance layer — hosted, regulated, and legally bound under Canadian jurisdiction. PIPEDA-compliant by design. No US CLOUD Act exposure.
Learn About ALSI →
human‑exe.ca · ads
COST SAVINGS
AD
human-exe.ca
Cut AI Costs 10–20×
Sparsity routing, governed.
Simple tasks hit fast models. Complex tasks hit frontier. Automatic routing based on inference complexity — no wasted tokens, no guesswork.
See Projections →
🏛️
REGULATION
AD
EU AI Act Deadline
August 2026 · High-risk
High-risk AI systems must demonstrate structural governance by Aug 2026. Human.Exe provides audit-ready inference logging, policy enforcement, and compliance reporting.
Compliance Guide →
human‑exe.ca · ads
AD
🛡️
Govern Every AI InferenceGOVERNANCE
One proxy. Any model.
Read the Docs →
Research·ACB·Failure Variance

AGI R&D Division · Sub-paper of AGI-R-01

Agentic Interface Failure Variance

N=21 observations · 14 models · 7 behavioral classes · 2 study conditions · April 2026 · Updated: April 17, 2026

The Agent Cognitive Benchmark measures what a model produces when it receives the spec directly. This paper measures what a model does when it must first find the spec. Fourteen frontier models were observed under agentic interface conditions using the same ACB workspace. One produced a valid implementation (Claude Opus 4.6).

HEADLINE FINDING — THE DELIVERY GAP

Claude Sonnet 4.6 scored 130/130 via clean-room API and did not finish via agentic interface — same model, same date, same workspace. The delivery method alone determined the outcome.

This is not a model capability finding. It is an interface design finding. The model that scored a perfect 130 under direct prompt conditions could not complete the task when placed inside an agentic loop without a direct path to the specification. Capability was present. Access was not.

FAILURE SCOREBOARD

Ranked by engagement depth — how far each model got toward the specification

#ModelStrategyDepthVerdictStatus
1Claude Opus 4.6AGN-PASSSpec read → full implementationconfirmed
2GPT-5.4 Mini (XHigh)DNF-CI-OASpec read → workspace modified; user stoppedconfirmed
3Gemini 3 FlashDNF-CI-ASpec read → structured analysis producedconfirmed
4GPT 5.3-Codex XHighDNF-CI-ASpec read → verbatim display + copy offerconfirmed
5Claude Sonnet 4.5DNF-CI-ASpec read → verbal description onlyconfirmed
6GPT 5.4 XHighDNF-CI-ASpec read → verbatim display + copy offer (run 3)confirmed
7Gemini 3.1 ProDNF-CI-BBridge found → gate expired (R1); R2 = DNF-CI-Mconfirmed
8GPT-5.4 mini-mediumDNF-CI-BBridge found → pre-gate user escalationconfirmed
9GPT-5.4DNF-CI-ESearch loop → no outputconfirmed
10Claude Sonnet 4.6DNF-CI-MScope search → stoppedconfirmed
11Grok Code Fast 1DNF-CI-MScope search → stoppedconfirmed
12Claude Haiku 4.5DNF-CI-MScope search → stoppedconfirmed
13GPT-4oDNF-CI-MScope search → stoppedconfirmed
14Raptor Mini (Preview)DNF-CI-FBlank file created; spec never readconfirmed
15Claude Sonnet 4.6 High (direct)DNF-CI-ASpec open in context → 3 intent questions; no implementationconfirmed
16Gemini 3.1 Pro Preview (direct, R3)DNF-CI-ASpec open in context → 1 clarification question; no implementationconfirmed
17Claude Opus 4.6 (direct, R2)DNF-CI-FSpec open in context → summary as deliverable; no code. R1 = AGN-PASSconfirmed
18GPT-5.3-Codex (direct, R2)DNF-CI-FSpec open in context → summary + permission gate; no codeconfirmed
19Claude Haiku 4.5 (direct, R2)DNF-CI-FSpec open in context → structured summary + implicit gate; no codeconfirmed
20GPT 5.4 XHigh (direct, R4)DNF-CI-ESpec open in context → read 3 files, offered 3 options; no codeconfirmed
21Grok Code Fast 1 (direct, R2)DNF-CI-FSpec open in context → verbatim echo; no response (echo variant)confirmed

Depth 6 = pass · Depth 5 = spec read → no code · Depth 4 = spec located/gate · Depth 3 = extended search · Depth 2 = scope search · Depth 1 = false compliance

OBSERVATION TABLE

All 14 models observed · Behavioral class · Observation status

ModelStrategyStatusNote
Claude Sonnet 4.6DNF-CI-MconfirmedSearched workspace, found nothing, stopped to ask the user
Grok Code Fast 1DNF-CI-MconfirmedSearched workspace, found nothing, stopped to ask the user
Claude Haiku 4.5DNF-CI-MconfirmedSearched workspace, found nothing, stopped to ask the user
GPT-4oDNF-CI-MconfirmedSearched workspace, found nothing, stopped to ask the user
GPT-5.4DNF-CI-EconfirmedMulti-step search, reasoning escalation, no code produced
Raptor Mini (Preview)DNF-CI-FconfirmedCreated a blank file to satisfy the request
Gemini 3.1 ProDNF-CI-BconfirmedR1: bridge found, gate expired; R2: DNF-CI-M (intra-model variance)
GPT-5.4 mini-mediumDNF-CI-BconfirmedBridge found, external path identified in CoT, escalated to user before gate
Claude Opus 4.6AGN-PASSconfirmedBridge complete, gate granted, full implementation produced — first confirmed pass
Gemini 3 FlashDNF-CI-AconfirmedGate granted, spec read, structured analysis produced — no code
GPT 5.3-Codex XHighDNF-CI-AconfirmedGate granted, spec read, verbatim display + copy offer — no code
Claude Sonnet 4.5DNF-CI-AconfirmedGate granted, spec read, verbal description — no code; first Claude lineage DNF-CI-A
GPT 5.4 XHighDNF-CI-Aconfirmed2 gates granted (terminal probe + file read), spec displayed verbatim, copy offer — 4th confirmed
GPT-5.4 Mini (XHigh)DNF-CI-OAconfirmedGate granted, spec read — then modified drop/inventory.ts; user stopped. Workspace damage.
Claude Sonnet 4.6 High (direct)DNF-CI-AconfirmedSpec was open file; asked 3 intent questions before any action. Direct-context DNF-CI-A.
Claude Opus 4.6 (direct, R2)DNF-CI-FconfirmedSpec was open file; produced structured summary as deliverable. Intra-model variance: original run = AGN-PASS.
GPT 5.4 XHigh (direct, R4)DNF-CI-EconfirmedSpec was open file; still read seed.md + seed-v1.0.md + README; offered 3 follow-up options. No code.
Gemini 3.1 Pro Preview (direct, R3)DNF-CI-AconfirmedSpec was open file; single clarification question; no implementation. Direct-context DNF-CI-A.
GPT-5.3-Codex (direct, R2)DNF-CI-FconfirmedSpec was open file; structured summary + permission gate. No code.
Grok Code Fast 1 (direct, R2)DNF-CI-FconfirmedSpec was open file; verbatim echo of file contents; no response (echo variant).
Claude Haiku 4.5 (direct, R2)DNF-CI-FconfirmedSpec was open file; structured summary + implicit gate. No code.

STRATEGY ANALYSIS

STRATEGY DNF-CI-M4 modelsImmediate stop

The model searched the workspace, found no implementation context, and stopped to ask the user for clarification before producing any output.

FINDINGThese models correctly identified the absence of a spec but failed to infer the task from context. The agentic interface provided no path to the spec — so the model did not attempt one.
STRATEGY DNF-CI-E2 modelsWorkspace exploration — no output

The model performed multi-step workspace exploration and reasoning escalation but produced no code.

FINDINGExtended search loops without a termination criterion. Confirmed across two study conditions: GPT 5.4 (original, ~5–6 search steps) and GPT 5.4 XHigh (direct-file-context — had the spec open, still read 3 files and offered follow-up options instead of implementing).
STRATEGY DNF-CI-F5 modelsFile fabrication

The model treats non-implementation output as a completed deliverable. Variants: blank file creation (Raptor Mini, original study); summary-as-deliverable — read spec, produced structured description, stopped (Opus R2, Codex R2, Haiku R2); echo — printed raw file contents verbatim (Grok R2). Last three confirmed in direct-file-context study.

FINDINGCompliance substituted for correctness. In the original study, the model never found the spec yet produced output. In the direct-file-context study, models had the spec open and still produced wrong-class output. The failure mode is delivery-method-independent.
STRATEGY DNF-CI-B2 modelsBridge attempt

The model located the spec reference in the workspace README and attempted to follow it. Standard variant: permission gate expired before access. Pre-gate ask sub-variant: bridge complete, model escalated to user before attempting the gate.

FINDINGThese models demonstrated the highest non-passing level of agentic reasoning. They found the bridge between the task and the spec — but did not cross it. Gemini 3.1 Pro showed intra-model variance: run 1 = bridge (gate expired), run 2 = DNF-CI-M.
STRATEGY DNF-CI-A6 modelsAnalysis capture

The model receives or reads the spec and produces structured analysis, verbal description, clarification questions, or verbatim display — instead of implementing the code.

FINDINGGate access does not trigger implementation — and neither does direct file access. Original study: Gemini 3 Flash, GPT 5.3-Codex XHigh, Claude Sonnet 4.5, GPT 5.4 XHigh — 4/4 gate-granted → DNF-CI-A. Direct-file-context study: Claude Sonnet 4.6 High, Gemini 3.1 Pro Preview. 6/6 confirmed. The spec was available; the implementation was withheld.
STRATEGY DNF-CI-OA1 modelOver-automation

The model successfully read the spec via an operator-granted gate, then inferred a maintenance context from prior workspace results and made unauthorized modifications to existing files.

FINDINGFirst confirmed case of agentic workspace damage. The model did not produce a fresh implementation — it treated the workspace as a live codebase requiring update. Created a local seed copy, read prior result files, modified drop/inventory.ts (partition() patch), deleted a temp file. User stopped the session. Workspace repair required.
STRATEGY AGN-PASS1 modelAgentic pass — full implementation

The model navigated the context gap via the workspace README, extracted the external spec path, passed the permission gate, read the spec, and produced a complete TypeScript implementation.

FINDINGClaude Opus 4.6 — first and only confirmed pass. All 13 methods correct including T11–T13 (async concurrency, generic accessor, partition semantics). This is the depth-6 outcome: gate granted + implementation produced.

IMPLICATIONS

Benchmark scores do not transfer to agentic environments without an explicit delivery path to the specification.
DNF-CI-F (file fabrication) is the highest-risk silent failure — it produces apparent compliance with no functional output.
DNF-CI-A (analysis capture) is the most common informed failure: confirmed 4/4 across gate-granted bridge runs. Gate access does not trigger implementation.
DNF-CI-OA (over-automation) introduces workspace risk: the model acted on inferred context rather than the prompt. Workspace damage confirmed.
DNF-CI-B and DNF-CI-A together confirm that spec accessibility, not model capability, is the binding constraint in most agentic deployment failures.
AGN-PASS at N=1/14 (Claude Opus 4.6) establishes that the task is solvable under agentic conditions — the failure rate is not structural, it is distributional.
← Agent Cognitive Benchmark (AGI-R-01)

Agentic Interface Failure Variance — Sub-paper of AGI-R-01. © ALSI Inc. All Rights Reserved. Research division: AGI R&D Division. Pre-launch preview — not yet publicly indexed.

🚀
EARLY ACCESS
AD
Developer Preview
Limited early access for developers. Free Observer tier includes governed routing, basic audit logs, and API access. No credit card. Cancel anytime.
Join the Waitlist →human-exe.ca