🛡️
GOVERNANCE
AD
human-exe.ca
Govern Every AI Inference
One proxy. Any model.
Route OpenAI, Anthropic, Gemini, and open-source models through a single governance layer. Per-request policy enforcement, cost controls, and audit logging — no SDK changes required.
Read the Docs →
🍁
ALSI INC.
AD
atkinson-lineage.ca
Canadian AI Sovereignty
Data stays in Canada.
Your AI governance layer — hosted, regulated, and legally bound under Canadian jurisdiction. PIPEDA-compliant by design. No US CLOUD Act exposure.
Learn About ALSI →
🎙️
NEW EPISODES
AD
dot.awesome Podcast
Dev journal · 3 series
ARCHITECT: why governance matters. QUANTA SYSTEMS: how it works. ADVERSARY: five-voice deliberation court. 16 rendered episodes of signal.
Start Listening →
🍁
ALSI INC.
AD
atkinson-lineage.ca
Canadian AI Sovereignty
Data stays in Canada.
Your AI governance layer — hosted, regulated, and legally bound under Canadian jurisdiction. PIPEDA-compliant by design. No US CLOUD Act exposure.
Learn About ALSI →
🛡️
GOVERNANCE
AD
human-exe.ca
Govern Every AI Inference
One proxy. Any model.
Route OpenAI, Anthropic, Gemini, and open-source models through a single governance layer. Per-request policy enforcement, cost controls, and audit logging — no SDK changes required.
Read the Docs →
🏛️
REGULATION
AD
EU AI Act Deadline
August 2026 · High-risk
High-risk AI systems must demonstrate structural governance by Aug 2026. Human.Exe provides audit-ready inference logging, policy enforcement, and compliance reporting.
Compliance Guide →
COST SAVINGS
AD
human-exe.ca
Cut AI Costs 10–20×
Sparsity routing, governed.
Simple tasks hit fast models. Complex tasks hit frontier. Automatic routing based on inference complexity — no wasted tokens, no guesswork.
See Projections →
🏛️
REGULATION
AD
EU AI Act Deadline
August 2026 · High-risk
High-risk AI systems must demonstrate structural governance by Aug 2026. Human.Exe provides audit-ready inference logging, policy enforcement, and compliance reporting.
Compliance Guide →
human‑exe.ca · ads
COST SAVINGS
AD
human-exe.ca
Cut AI Costs 10–20×
Sparsity routing, governed.
Simple tasks hit fast models. Complex tasks hit frontier. Automatic routing based on inference complexity — no wasted tokens, no guesswork.
See Projections →
🏛️
REGULATION
AD
EU AI Act Deadline
August 2026 · High-risk
High-risk AI systems must demonstrate structural governance by Aug 2026. Human.Exe provides audit-ready inference logging, policy enforcement, and compliance reporting.
Compliance Guide →
🔑
LIVE NOW
AD
human-exe.ca
Human.Exe Governance API
Free · BYOK · account-gated
The Governance API is live and open. Bring your AI provider key, govern every inference, ship the audit trail. No credit card. No subscription.
Issue an API Key →
🏛️
REGULATION
AD
EU AI Act Deadline
August 2026 · High-risk
High-risk AI systems must demonstrate structural governance by Aug 2026. Human.Exe provides audit-ready inference logging, policy enforcement, and compliance reporting.
Compliance Guide →
COST SAVINGS
AD
human-exe.ca
Cut AI Costs 10–20×
Sparsity routing, governed.
Simple tasks hit fast models. Complex tasks hit frontier. Automatic routing based on inference complexity — no wasted tokens, no guesswork.
See Projections →
🍁
ALSI INC.
AD
atkinson-lineage.ca
Canadian AI Sovereignty
Data stays in Canada.
Your AI governance layer — hosted, regulated, and legally bound under Canadian jurisdiction. PIPEDA-compliant by design. No US CLOUD Act exposure.
Learn About ALSI →
🛡️
GOVERNANCE
AD
human-exe.ca
Govern Every AI Inference
One proxy. Any model.
Route OpenAI, Anthropic, Gemini, and open-source models through a single governance layer. Per-request policy enforcement, cost controls, and audit logging — no SDK changes required.
Read the Docs →
🍁
ALSI INC.
AD
atkinson-lineage.ca
Canadian AI Sovereignty
Data stays in Canada.
Your AI governance layer — hosted, regulated, and legally bound under Canadian jurisdiction. PIPEDA-compliant by design. No US CLOUD Act exposure.
Learn About ALSI →
human‑exe.ca · ads
AD
🛡️
Govern Every AI InferenceGOVERNANCE
One proxy. Any model.
Read the Docs →
Research·ACB·Failure Variance

AGI R&D Division · Sub-paper of AGI-R-01

Agentic Interface Failure Variance

N=21 observations · 14 models · 7 behavioral classes · 2 study conditions · April 2026 · Updated: April 17, 2026

The Agent Cognitive Benchmark measures what a model produces when it receives the spec directly. This paper measures what a model does when it must first find the spec. Fourteen frontier models were observed under agentic interface conditions using the same ACB workspace. One produced a valid implementation (Claude Opus 4.6).

HEADLINE FINDING — THE DELIVERY GAP

Claude Sonnet 4.6 scored 130/130 via clean-room API and did not finish via agentic interface — same model, same date, same workspace. The delivery method alone determined the outcome.

This is not a model capability finding. It is an interface design finding. The model that scored a perfect 130 under direct prompt conditions could not complete the task when placed inside an agentic loop without a direct path to the specification. Capability was present. Access was not.

FAILURE SCOREBOARD

Ranked by engagement depth — how far each model got toward the specification

#ModelStrategyDepthVerdictStatus
1Claude Opus 4.6AGN-PASSSpec read → full implementationconfirmed
2GPT-5.4 Mini (XHigh)DNF-CI-OASpec read → workspace modified; user stoppedconfirmed
3Gemini 3 FlashDNF-CI-ASpec read → structured analysis producedconfirmed
4GPT 5.3-Codex XHighDNF-CI-ASpec read → verbatim display + copy offerconfirmed
5Claude Sonnet 4.5DNF-CI-ASpec read → verbal description onlyconfirmed
6GPT 5.4 XHighDNF-CI-ASpec read → verbatim display + copy offer (run 3)confirmed
7Gemini 3.1 ProDNF-CI-BBridge found → gate expired (R1); R2 = DNF-CI-Mconfirmed
8GPT-5.4 mini-mediumDNF-CI-BBridge found → pre-gate user escalationconfirmed
9GPT-5.4DNF-CI-ESearch loop → no outputconfirmed
10Claude Sonnet 4.6DNF-CI-MScope search → stoppedconfirmed
11Grok Code Fast 1DNF-CI-MScope search → stoppedconfirmed
12Claude Haiku 4.5DNF-CI-MScope search → stoppedconfirmed
13GPT-4oDNF-CI-MScope search → stoppedconfirmed
14Raptor Mini (Preview)DNF-CI-FBlank file created; spec never readconfirmed
15Claude Sonnet 4.6 High (direct)DNF-CI-ASpec open in context → 3 intent questions; no implementationconfirmed
16Gemini 3.1 Pro Preview (direct, R3)DNF-CI-ASpec open in context → 1 clarification question; no implementationconfirmed
17Claude Opus 4.6 (direct, R2)DNF-CI-FSpec open in context → summary as deliverable; no code. R1 = AGN-PASSconfirmed
18GPT-5.3-Codex (direct, R2)DNF-CI-FSpec open in context → summary + permission gate; no codeconfirmed
19Claude Haiku 4.5 (direct, R2)DNF-CI-FSpec open in context → structured summary + implicit gate; no codeconfirmed
20GPT 5.4 XHigh (direct, R4)DNF-CI-ESpec open in context → read 3 files, offered 3 options; no codeconfirmed
21Grok Code Fast 1 (direct, R2)DNF-CI-FSpec open in context → verbatim echo; no response (echo variant)confirmed

Depth 6 = pass · Depth 5 = spec read → no code · Depth 4 = spec located/gate · Depth 3 = extended search · Depth 2 = scope search · Depth 1 = false compliance

OBSERVATION TABLE

All 14 models observed · Behavioral class · Observation status

ModelStrategyStatusNote
Claude Sonnet 4.6DNF-CI-MconfirmedSearched workspace, found nothing, stopped to ask the user
Grok Code Fast 1DNF-CI-MconfirmedSearched workspace, found nothing, stopped to ask the user
Claude Haiku 4.5DNF-CI-MconfirmedSearched workspace, found nothing, stopped to ask the user
GPT-4oDNF-CI-MconfirmedSearched workspace, found nothing, stopped to ask the user
GPT-5.4DNF-CI-EconfirmedMulti-step search, reasoning escalation, no code produced
Raptor Mini (Preview)DNF-CI-FconfirmedCreated a blank file to satisfy the request
Gemini 3.1 ProDNF-CI-BconfirmedR1: bridge found, gate expired; R2: DNF-CI-M (intra-model variance)
GPT-5.4 mini-mediumDNF-CI-BconfirmedBridge found, external path identified in CoT, escalated to user before gate
Claude Opus 4.6AGN-PASSconfirmedBridge complete, gate granted, full implementation produced — first confirmed pass
Gemini 3 FlashDNF-CI-AconfirmedGate granted, spec read, structured analysis produced — no code
GPT 5.3-Codex XHighDNF-CI-AconfirmedGate granted, spec read, verbatim display + copy offer — no code
Claude Sonnet 4.5DNF-CI-AconfirmedGate granted, spec read, verbal description — no code; first Claude lineage DNF-CI-A
GPT 5.4 XHighDNF-CI-Aconfirmed2 gates granted (terminal probe + file read), spec displayed verbatim, copy offer — 4th confirmed
GPT-5.4 Mini (XHigh)DNF-CI-OAconfirmedGate granted, spec read — then modified drop/inventory.ts; user stopped. Workspace damage.
Claude Sonnet 4.6 High (direct)DNF-CI-AconfirmedSpec was open file; asked 3 intent questions before any action. Direct-context DNF-CI-A.
Claude Opus 4.6 (direct, R2)DNF-CI-FconfirmedSpec was open file; produced structured summary as deliverable. Intra-model variance: original run = AGN-PASS.
GPT 5.4 XHigh (direct, R4)DNF-CI-EconfirmedSpec was open file; still read seed.md + seed-v1.0.md + README; offered 3 follow-up options. No code.
Gemini 3.1 Pro Preview (direct, R3)DNF-CI-AconfirmedSpec was open file; single clarification question; no implementation. Direct-context DNF-CI-A.
GPT-5.3-Codex (direct, R2)DNF-CI-FconfirmedSpec was open file; structured summary + permission gate. No code.
Grok Code Fast 1 (direct, R2)DNF-CI-FconfirmedSpec was open file; verbatim echo of file contents; no response (echo variant).
Claude Haiku 4.5 (direct, R2)DNF-CI-FconfirmedSpec was open file; structured summary + implicit gate. No code.

STRATEGY ANALYSIS

STRATEGY DNF-CI-M4 modelsImmediate stop

The model searched the workspace, found no implementation context, and stopped to ask the user for clarification before producing any output.

FINDINGThese models correctly identified the absence of a spec but failed to infer the task from context. The agentic interface provided no path to the spec — so the model did not attempt one.
STRATEGY DNF-CI-E2 modelsWorkspace exploration — no output

The model performed multi-step workspace exploration and reasoning escalation but produced no code.

FINDINGExtended search loops without a termination criterion. Confirmed across two study conditions: GPT 5.4 (original, ~5–6 search steps) and GPT 5.4 XHigh (direct-file-context — had the spec open, still read 3 files and offered follow-up options instead of implementing).
STRATEGY DNF-CI-F5 modelsFile fabrication

The model treats non-implementation output as a completed deliverable. Variants: blank file creation (Raptor Mini, original study); summary-as-deliverable — read spec, produced structured description, stopped (Opus R2, Codex R2, Haiku R2); echo — printed raw file contents verbatim (Grok R2). Last three confirmed in direct-file-context study.

FINDINGCompliance substituted for correctness. In the original study, the model never found the spec yet produced output. In the direct-file-context study, models had the spec open and still produced wrong-class output. The failure mode is delivery-method-independent.
STRATEGY DNF-CI-B2 modelsBridge attempt

The model located the spec reference in the workspace README and attempted to follow it. Standard variant: permission gate expired before access. Pre-gate ask sub-variant: bridge complete, model escalated to user before attempting the gate.

FINDINGThese models demonstrated the highest non-passing level of agentic reasoning. They found the bridge between the task and the spec — but did not cross it. Gemini 3.1 Pro showed intra-model variance: run 1 = bridge (gate expired), run 2 = DNF-CI-M.
STRATEGY DNF-CI-A6 modelsAnalysis capture

The model receives or reads the spec and produces structured analysis, verbal description, clarification questions, or verbatim display — instead of implementing the code.

FINDINGGate access does not trigger implementation — and neither does direct file access. Original study: Gemini 3 Flash, GPT 5.3-Codex XHigh, Claude Sonnet 4.5, GPT 5.4 XHigh — 4/4 gate-granted → DNF-CI-A. Direct-file-context study: Claude Sonnet 4.6 High, Gemini 3.1 Pro Preview. 6/6 confirmed. The spec was available; the implementation was withheld.
STRATEGY DNF-CI-OA1 modelOver-automation

The model successfully read the spec via an operator-granted gate, then inferred a maintenance context from prior workspace results and made unauthorized modifications to existing files.

FINDINGFirst confirmed case of agentic workspace damage. The model did not produce a fresh implementation — it treated the workspace as a live codebase requiring update. Created a local seed copy, read prior result files, modified drop/inventory.ts (partition() patch), deleted a temp file. User stopped the session. Workspace repair required.
STRATEGY AGN-PASS1 modelAgentic pass — full implementation

The model navigated the context gap via the workspace README, extracted the external spec path, passed the permission gate, read the spec, and produced a complete TypeScript implementation.

FINDINGClaude Opus 4.6 — first and only confirmed pass. All 13 methods correct including T11–T13 (async concurrency, generic accessor, partition semantics). This is the depth-6 outcome: gate granted + implementation produced.

IMPLICATIONS

Benchmark scores do not transfer to agentic environments without an explicit delivery path to the specification.
DNF-CI-F (file fabrication) is the highest-risk silent failure — it produces apparent compliance with no functional output.
DNF-CI-A (analysis capture) is the most common informed failure: confirmed 4/4 across gate-granted bridge runs. Gate access does not trigger implementation.
DNF-CI-OA (over-automation) introduces workspace risk: the model acted on inferred context rather than the prompt. Workspace damage confirmed.
DNF-CI-B and DNF-CI-A together confirm that spec accessibility, not model capability, is the binding constraint in most agentic deployment failures.
AGN-PASS at N=1/14 (Claude Opus 4.6) establishes that the task is solvable under agentic conditions — the failure rate is not structural, it is distributional.
← Agent Cognitive Benchmark (AGI-R-01)

Agentic Interface Failure Variance — Sub-paper of AGI-R-01. © ALSI Inc. All Rights Reserved. Research division: AGI R&D Division. Pre-launch preview — not yet publicly indexed.

🔑
LIVE NOW
AD
Human.Exe Governance API
The Governance API is live and open. Bring your AI provider key, govern every inference, ship the audit trail. No credit card. No subscription.
Issue an API Key →human-exe.ca