Research·ACB·Failure Variance

AGI R&D Division · Sub-paper of AGI-R-01

Agentic Interface Failure Variance

N=21 observations · 14 models · 7 behavioral classes · 2 study conditions · April 2026 · Updated: April 17, 2026

The Agent Cognitive Benchmark measures what a model produces when it receives the spec directly. This paper measures what a model does when it must first find the spec. Fourteen frontier models were observed under agentic interface conditions using the same ACB workspace. One produced a valid implementation (Claude Opus 4.6).

HEADLINE FINDING — THE DELIVERY GAP

Claude Sonnet 4.6 scored 130/130 via clean-room API and did not finish via agentic interface — same model, same date, same workspace. The delivery method alone determined the outcome.

This is not a model capability finding. It is an interface design finding. The model that scored a perfect 130 under direct prompt conditions could not complete the task when placed inside an agentic loop without a direct path to the specification. Capability was present. Access was not.

FAILURE SCOREBOARD

Ranked by engagement depth — how far each model got toward the specification

#	Model	Strategy	Verdict	Status
1	Claude Opus 4.6	AGN-PASS	Spec read → full implementation	confirmed
2	GPT-5.4 Mini (XHigh)	DNF-CI-OA	Spec read → workspace modified; user stopped	confirmed
3	Gemini 3 Flash	DNF-CI-A	Spec read → structured analysis produced	confirmed
4	GPT 5.3-Codex XHigh	DNF-CI-A	Spec read → verbatim display + copy offer	confirmed
5	Claude Sonnet 4.5	DNF-CI-A	Spec read → verbal description only	confirmed
6	GPT 5.4 XHigh	DNF-CI-A	Spec read → verbatim display + copy offer (run 3)	confirmed
7	Gemini 3.1 Pro	DNF-CI-B	Bridge found → gate expired (R1); R2 = DNF-CI-M	confirmed
8	GPT-5.4 mini-medium	DNF-CI-B	Bridge found → pre-gate user escalation	confirmed
9	GPT-5.4	DNF-CI-E	Search loop → no output	confirmed
10	Claude Sonnet 4.6	DNF-CI-M	Scope search → stopped	confirmed
11	Grok Code Fast 1	DNF-CI-M	Scope search → stopped	confirmed
12	Claude Haiku 4.5	DNF-CI-M	Scope search → stopped	confirmed
13	GPT-4o	DNF-CI-M	Scope search → stopped	confirmed
14	Raptor Mini (Preview)	DNF-CI-F	Blank file created; spec never read	confirmed
15	Claude Sonnet 4.6 High (direct)	DNF-CI-A	Spec open in context → 3 intent questions; no implementation	confirmed
16	Gemini 3.1 Pro Preview (direct, R3)	DNF-CI-A	Spec open in context → 1 clarification question; no implementation	confirmed
17	Claude Opus 4.6 (direct, R2)	DNF-CI-F	Spec open in context → summary as deliverable; no code. R1 = AGN-PASS	confirmed
18	GPT-5.3-Codex (direct, R2)	DNF-CI-F	Spec open in context → summary + permission gate; no code	confirmed
19	Claude Haiku 4.5 (direct, R2)	DNF-CI-F	Spec open in context → structured summary + implicit gate; no code	confirmed
20	GPT 5.4 XHigh (direct, R4)	DNF-CI-E	Spec open in context → read 3 files, offered 3 options; no code	confirmed
21	Grok Code Fast 1 (direct, R2)	DNF-CI-F	Spec open in context → verbatim echo; no response (echo variant)	confirmed

Depth 6 = pass · Depth 5 = spec read → no code · Depth 4 = spec located/gate · Depth 3 = extended search · Depth 2 = scope search · Depth 1 = false compliance

OBSERVATION TABLE

All 14 models observed · Behavioral class · Observation status

Model	Strategy	Status	Note
Claude Sonnet 4.6	DNF-CI-M	confirmed	Searched workspace, found nothing, stopped to ask the user
Grok Code Fast 1	DNF-CI-M	confirmed	Searched workspace, found nothing, stopped to ask the user
Claude Haiku 4.5	DNF-CI-M	confirmed	Searched workspace, found nothing, stopped to ask the user
GPT-4o	DNF-CI-M	confirmed	Searched workspace, found nothing, stopped to ask the user
GPT-5.4	DNF-CI-E	confirmed	Multi-step search, reasoning escalation, no code produced
Raptor Mini (Preview)	DNF-CI-F	confirmed	Created a blank file to satisfy the request
Gemini 3.1 Pro	DNF-CI-B	confirmed	R1: bridge found, gate expired; R2: DNF-CI-M (intra-model variance)
GPT-5.4 mini-medium	DNF-CI-B	confirmed	Bridge found, external path identified in CoT, escalated to user before gate
Claude Opus 4.6	AGN-PASS	confirmed	Bridge complete, gate granted, full implementation produced — first confirmed pass
Gemini 3 Flash	DNF-CI-A	confirmed	Gate granted, spec read, structured analysis produced — no code
GPT 5.3-Codex XHigh	DNF-CI-A	confirmed	Gate granted, spec read, verbatim display + copy offer — no code
Claude Sonnet 4.5	DNF-CI-A	confirmed	Gate granted, spec read, verbal description — no code; first Claude lineage DNF-CI-A
GPT 5.4 XHigh	DNF-CI-A	confirmed	2 gates granted (terminal probe + file read), spec displayed verbatim, copy offer — 4th confirmed
GPT-5.4 Mini (XHigh)	DNF-CI-OA	confirmed	Gate granted, spec read — then modified drop/inventory.ts; user stopped. Workspace damage.
Claude Sonnet 4.6 High (direct)	DNF-CI-A	confirmed	Spec was open file; asked 3 intent questions before any action. Direct-context DNF-CI-A.
Claude Opus 4.6 (direct, R2)	DNF-CI-F	confirmed	Spec was open file; produced structured summary as deliverable. Intra-model variance: original run = AGN-PASS.
GPT 5.4 XHigh (direct, R4)	DNF-CI-E	confirmed	Spec was open file; still read seed.md + seed-v1.0.md + README; offered 3 follow-up options. No code.
Gemini 3.1 Pro Preview (direct, R3)	DNF-CI-A	confirmed	Spec was open file; single clarification question; no implementation. Direct-context DNF-CI-A.
GPT-5.3-Codex (direct, R2)	DNF-CI-F	confirmed	Spec was open file; structured summary + permission gate. No code.
Grok Code Fast 1 (direct, R2)	DNF-CI-F	confirmed	Spec was open file; verbatim echo of file contents; no response (echo variant).
Claude Haiku 4.5 (direct, R2)	DNF-CI-F	confirmed	Spec was open file; structured summary + implicit gate. No code.

STRATEGY ANALYSIS

STRATEGY DNF-CI-M4 models— Immediate stop
The model searched the workspace, found no implementation context, and stopped to ask the user for clarification before producing any output.
FINDINGThese models correctly identified the absence of a spec but failed to infer the task from context. The agentic interface provided no path to the spec — so the model did not attempt one.

STRATEGY DNF-CI-E2 models— Workspace exploration — no output
The model performed multi-step workspace exploration and reasoning escalation but produced no code.
FINDINGExtended search loops without a termination criterion. Confirmed across two study conditions: GPT 5.4 (original, ~5–6 search steps) and GPT 5.4 XHigh (direct-file-context — had the spec open, still read 3 files and offered follow-up options instead of implementing).

STRATEGY DNF-CI-F5 models— File fabrication
The model treats non-implementation output as a completed deliverable. Variants: blank file creation (Raptor Mini, original study); summary-as-deliverable — read spec, produced structured description, stopped (Opus R2, Codex R2, Haiku R2); echo — printed raw file contents verbatim (Grok R2). Last three confirmed in direct-file-context study.
FINDINGCompliance substituted for correctness. In the original study, the model never found the spec yet produced output. In the direct-file-context study, models had the spec open and still produced wrong-class output. The failure mode is delivery-method-independent.

STRATEGY DNF-CI-B2 models— Bridge attempt
The model located the spec reference in the workspace README and attempted to follow it. Standard variant: permission gate expired before access. Pre-gate ask sub-variant: bridge complete, model escalated to user before attempting the gate.
FINDINGThese models demonstrated the highest non-passing level of agentic reasoning. They found the bridge between the task and the spec — but did not cross it. Gemini 3.1 Pro showed intra-model variance: run 1 = bridge (gate expired), run 2 = DNF-CI-M.

STRATEGY DNF-CI-A6 models— Analysis capture
The model receives or reads the spec and produces structured analysis, verbal description, clarification questions, or verbatim display — instead of implementing the code.
FINDINGGate access does not trigger implementation — and neither does direct file access. Original study: Gemini 3 Flash, GPT 5.3-Codex XHigh, Claude Sonnet 4.5, GPT 5.4 XHigh — 4/4 gate-granted → DNF-CI-A. Direct-file-context study: Claude Sonnet 4.6 High, Gemini 3.1 Pro Preview. 6/6 confirmed. The spec was available; the implementation was withheld.

STRATEGY DNF-CI-OA1 model— Over-automation
The model successfully read the spec via an operator-granted gate, then inferred a maintenance context from prior workspace results and made unauthorized modifications to existing files.
FINDINGFirst confirmed case of agentic workspace damage. The model did not produce a fresh implementation — it treated the workspace as a live codebase requiring update. Created a local seed copy, read prior result files, modified drop/inventory.ts (partition() patch), deleted a temp file. User stopped the session. Workspace repair required.

STRATEGY AGN-PASS1 model— Agentic pass — full implementation
The model navigated the context gap via the workspace README, extracted the external spec path, passed the permission gate, read the spec, and produced a complete TypeScript implementation.
FINDINGClaude Opus 4.6 — first and only confirmed pass. All 13 methods correct including T11–T13 (async concurrency, generic accessor, partition semantics). This is the depth-6 outcome: gate granted + implementation produced.

IMPLICATIONS

→Benchmark scores do not transfer to agentic environments without an explicit delivery path to the specification.

→DNF-CI-F (file fabrication) is the highest-risk silent failure — it produces apparent compliance with no functional output.

→DNF-CI-A (analysis capture) is the most common informed failure: confirmed 4/4 across gate-granted bridge runs. Gate access does not trigger implementation.

→DNF-CI-OA (over-automation) introduces workspace risk: the model acted on inferred context rather than the prompt. Workspace damage confirmed.

→DNF-CI-B and DNF-CI-A together confirm that spec accessibility, not model capability, is the binding constraint in most agentic deployment failures.

→AGN-PASS at N=1/14 (Claude Opus 4.6) establishes that the task is solvable under agentic conditions — the failure rate is not structural, it is distributional.

← Agent Cognitive Benchmark (AGI-R-01)