npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.3 - Mend

ultimate-pi 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/concepts/feedforward-feedback-harness.md ADDED Viewed

@@ -0,0 +1,60 @@
+---
+type: concept
+title: "Feedforward-Feedback Harness Controls"
+created: 2026-04-30
+updated: 2026-04-30
+status: seed
+tags:
+  - harness
+  - controls
+  - feedforward
+  - feedback
+related:
+  - "[[agentic-harness]]"
+  - "[[harness-implementation-plan]]"
+  - "[[inline-post-edit-validation]]"
+sources:
+  - "[[bockeler2026-harness-engineering]]"
+  - "[[meng2026-agent-harness-survey]]"
+---# Feedforward-Feedback Harness Controls
+From Böckeler (2026, Martin Fowler): A cybernetic model of harness controls as guides (feedforward) and sensors (feedback), each split into computational and inferential execution types.
+## The Framework
+```
+FEEDFORWARD (Guides)                   FEEDBACK (Sensors)
+├─ Computational                       ├─ Computational
+│  ├─ Language servers                 │  ├─ Tests (unit, integration)
+│  ├─ CLIs, scripts                    │  ├─ Linters (ESLint, ruff)
+│  └─ Codemods                         │  ├─ Type checkers
+│                                      │  ├─ Mutation testing
+├─ Inferential                         │  └─ Structural tests (ArchUnit)
+│  ├─ AGENTS.md, skills                │
+│  ├─ Rules, conventions               ├─ Inferential
+│  ├─ Reference docs                   │  ├─ AI code review agents
+│  └─ How-to guides                    │  ├─ LLM-as-judge
+│                                      │  └─ Semantic analysis
+```
+## Mapping to Our Pipeline
+| Control Type | Our Implementation |
+|-------------|-------------------|
+| Feedforward-Computational | Tool schemas, `tsc --noEmit`, JSON schema validation |
+| Feedforward-Inferential | SKILL.md files, ADRs, wiki pages, AGENTS.md |
+| Feedback-Computational | Inline post-edit validation (Phase 12), final lint gate (Phase 16) |
+| Feedback-Inferential | L4 adversarial verification, L2 plan review, L1 spec review |
+## Key Insight
+> Separately, agents keep repeating the same mistakes (feedback-only) or encode rules but never find out whether they worked (feedforward-only). Both are needed.
+## Unsolved: Behaviour Harness
+Functional correctness verification remains the hardest problem. Current approach (AI-generated tests + manual testing) is insufficient. The "approved fixtures" pattern helps selectively but is not a wholesale answer.
+## Steering Loop
+Human's role: when an issue happens multiple times, improve the feedforward or feedback controls. This is harness engineering as ongoing practice, not one-time configuration.

package/vault/wiki/concepts/five-root-cause-metrics-sentrux.md ADDED Viewed

@@ -0,0 +1,40 @@
+---
+type: concept
+title: "Five Root Cause Metrics (sentrux)"
+created: 2026-05-03
+tags:
+  - sentrux
+  - code-metrics
+  - graph-theory
+related:
+  - "[[Quality Signal (sentrux)]]"
+  - "[[sentrux]]"
+sources:
+  - "[[sentrux-docs-root-cause-metrics]]"
+---
+# Five Root Cause Metrics (sentrux)
+sentrux computes 5 independent metrics covering the complete structural properties of a directed attributed dependency graph:
+| # | Metric | Theory | Measures |
+|---|--------|--------|----------|
+| 1 | Modularity | Newman 2004 | Graph community detection — do files cluster into independent modules? |
+| 2 | Acyclicity | Martin 2003 | Circular dependency detection via Tarjan's SCC |
+| 3 | Depth | Lakos 1996 | Longest dependency chain length |
+| 4 | Equality | Gini 1912 | Even distribution of complexity (no god files) |
+| 5 | Redundancy | Kolmogorov 1963 | Dead/duplicate code |
+## Dimensional Completeness
+3 edge properties + 2 node properties = 5 total dimensions:
+- **Edge properties:** modularity (clustering), acyclicity (cycles), depth (chain length)
+- **Node properties:** equality (concentration), redundancy (unnecessary nodes)
+Adding more metrics would either overlap (entropy ≈ Gini) or measure runtime behavior outside static analysis.
+## Why These Five?
+- **Modularity replaces** coupling, cohesion, god file detection, hotspot detection — all symptoms of low Q
+- **Acyclicity is fundamental:** cycles make build order undefined and testing impossible
+- **Depth captures** change propagation risk — deep chains mean small changes ripple far
+- **Equality targets** god files — the #1 source of AI agent confusion
+- **Redundancy captures** dead code — expands AI agent's search space without contributing behavior

package/vault/wiki/concepts/fork-safe-spec-storage.md ADDED Viewed

@@ -0,0 +1,89 @@
+---
+type: concept
+title: "Fork-Safe Spec Storage"
+status: developing
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - harness
+  - spec-storage
+  - github-issues
+  - fork
+  - multi-tenant
+  - isolation
+related:
+  - "[[Research: GitHub Issues as Harness Spec Storage]]"
+  - "[[content-addressed-spec-identity]]"
+  - "[[spec-hardening]]"
+  - "[[harness-implementation-plan]]"
+sources:
+  - "[[github-fork-issues-discussion]]"
+  - "[[Research: GitHub Issues as Harness Spec Storage]]"
+---# Fork-Safe Spec Storage
+How ultimate-pi's harness keeps spec storage isolated across forks. When someone forks a project using ultimate-pi, zero upstream spec state leaks into the fork.
+## The Problem
+| Threat | Mechanism |
+|--------|-----------|
+| Stale local cache | `.pi/harness/specs/<id>.json` committed to git → forked with upstream issue URLs |
+| Empty issue tracker | Fork's Issues tab is disabled by default (historical) or starts empty (post-Dec 2025) |
+| Wrong repo context | `gh` CLI authenticated to upstream, not fork |
+| Divergent issue numbers | Fork issue #5 ≠ upstream issue #5. Merge = collision |
+## The Solution: Three-Layer Isolation
+### Layer 1: Gitignored Runtime Cache
+`.pi/harness/specs/` is in `.gitignore`. The cache is, by definition, runtime-only. It's never committed, never pushed, never forked. Each clone starts fresh.
+**Rationale**: The local JSON is a **speed cache**, not a source of truth. GitHub Issues are the durable store. Caches should never be version-controlled.
+### Layer 2: `harness init` Bootstrap
+On first run in any repo (fork or not), `ultimate-pi harness init`:
+```
+ultimate-pi harness init
+  ├─ Detect fork: gh repo view --json isFork
+  ├─ Enable Issues: if disabled, guide user or auto-enable via API
+  ├─ gh auth check: prompt login if missing
+  ├─ gh repo set-default: point to current repo (fork, not upstream)
+  ├─ Create labels: gh label create harness-spec, harness-task, layer-1..8, status:*
+  ├─ Create templates: .github/ISSUE_TEMPLATE/harness-spec.yml
+  ├─ Gitignore: ensure .pi/harness/specs/ is in .gitignore
+  └─ Done. Idempotent — re-running is a no-op.
+```
+### Layer 3: GitHub's Native Repo Scoping
+GitHub Issues are scoped to a repository. A fork is a separate repository. Upstream's issues are never copied to forks. Each fork creates its own issues in its own namespace.
+| Scope | Upstream (`owner/project`) | Fork (`forker/project`) |
+|-------|---------------------------|------------------------|
+| Issue #1 | "Initial spec: auth system" | (empty — fork starts fresh) |
+| Issue #2 | "Bug: memory leak" | (empty) |
+| After `harness init` | Unchanged | Issue #1: "Spec: forker's first feature" |
+## Why Labels Instead of Issue Numbers
+Labels are the only shared artifact between upstream and fork. This is by design:
+- **Labels are convention, not data**. `harness-spec`, `layer-3`, `status:active` are semantic tags that mean the same thing in any repo.
+- **Labels are cheap to recreate**. `harness init` creates them in 8 API calls.
+- **Labels don't collide**. Issue #5 in upstream with `harness-spec` and issue #5 in fork with `harness-spec` are different issues with the same semantic category. No conflict.
+## The Merge Problem — SOLVED
+Fork issue #5 and upstream issue #5 are different specs. When code merges, spec identities must be reconciled. **Solution**: Content-addressed spec identity via `SHA256` fingerprinting + `gh issue transfer` API. See [[content-addressed-spec-identity]] for the full creative architecture.
+Summary: every spec carries a content hash in its issue body (`<!-- spec-fp: <hash> -->`). Harness resolves by hash search across repos, not by brittle issue numbers. `ultimate-pi harness migrate` transfers specs from fork to upstream on merge. Idempotent, ~2-3 days to implement.
+## Enforcement
+- L7 Schema Orchestration enforces that every harness run checks `gh repo view` before creating issues
+- `harness init` is a prerequisite gate — harness refuses to create issues until init completes
+- Config file `.pi/harness/config.json` stores the canonical repo for issue operations
+- Runtime validation: if `gh repo view --json nameWithOwner` doesn't match config, harness warns and offers re-init

package/vault/wiki/concepts/fts5-sandbox.md ADDED Viewed

@@ -0,0 +1,19 @@
+---
+type: concept
+title: "fts5-sandbox"
+created: 2026-04-30
+updated: 2026-04-30
+status: seed
+tags: [#concept, #context-mode, #architecture]
+related:
+  - "[[context-mode]]"
+  - "[[Research: context-mode vs lean-ctx]]"
+---
+# FTS5 Sandbox
+> [!stub] This is a stub page. See [[context-mode]] for full documentation.
+context-mode's core architecture: intercept agent tool calls, run them in a sandboxed subprocess, index the output into SQLite FTS5 with BM25 ranking, and let the agent search the indexed output on demand. Raw tool output never enters the agent's context window — the agent queries the sandbox instead.
+Achieves up to 99.5% token reduction on large tool outputs (e.g., 56.2KB Playwright output → 299B).

package/vault/wiki/concepts/fuzzy-edit-matching.md ADDED Viewed

@@ -0,0 +1,71 @@
+---
+type: concept
+title: "Fuzzy Edit Matching"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - agent-tools
+  - token-reduction
+  - diff-algorithms
+related:
+  - "[[wozcode]]"
+  - "[[research-wozcode-token-reduction]]"
+  - "[[harness-implementation-plan]]"
+status: developing
+---# Fuzzy Edit Matching
+Fuzzy edit matching is a tool-level enhancement that makes code edits tolerant of minor formatting differences between what the model outputs and what exists on disk. Instead of failing on exact string mismatch, the edit tool applies a similarity-tolerant diff that lands near-misses without a retry round-trip.
+## Problem
+Standard edit tools (Claude Code's `edit`, our harness's `edit`) require exact `oldText` match. When the model generates an edit with:
+- Whitespace drift (tabs vs spaces, trailing whitespace)
+- Indentation differences (2-space vs 4-space blocks)
+- Visually-identical characters (curly quotes `"` vs straight quotes `"`, em-dash `—` vs `--`)
+- Line-ending variance (`\n` vs `\r\n`)
+...the edit fails. The model receives the error, reformulates, and retries. Each retry is a full API round-trip costing input + output tokens.
+## Solution
+Fuzzy edit matching applies these normalization passes before attempting the match:
+1. **Whitespace normalization**: Collapse variable whitespace, strip trailing spaces
+2. **Character normalization**: Map curly quotes → straight, em-dashes → double-hyphens
+3. **Indentation-aware matching**: Match by content ignoring leading whitespace differences
+4. **Line-ending normalization**: Treat `\r\n` and `\n` as equivalent
+If the normalized `oldText` matches a unique region in the normalized file, the edit proceeds with the original (non-normalized) `newText` applied at the matched position.
+## Token Savings
+WOZCODE claims this eliminates a significant fraction of retry round-trips (Source: [[wozcode]]). Each avoided retry saves:
+- Input tokens: re-sending file content + error message + conversation history
+- Output tokens: reformulated edit attempt
+For a typical session, edit retries can account for 5-15% of total token spend.
+## Failure Modes
+- **Ambiguous match**: Normalized oldText matches multiple locations → must fall back to exact match
+- **Over-normalization**: Aggressive normalization changes semantics (e.g., normalizing quotes inside string literals)
+- **Silent wrong edits**: Fuzzy match lands on wrong location with similar content → bug introduced silently
+- **Confidence in large files**: Single-line matches in large files may be ambiguous
+## Relationship to WOZCODE Quality Loop
+Fuzzy edit matching is one component of WOZCODE's three-lever quality loop:
+1. **Fuzzy matching** → lands near-misses (prevents retries)
+2. **Post-edit validation** → catches actual errors (prevents cascading failures)
+3. **Better error context** → when an error reaches the model, it gets actionable details (reduces retry count)
+## Implementation Path for Our Harness
+1. Add normalization layer to the `edit` tool in `lib/harness-executor.ts`
+2. Configurable normalization: `fuzzy_edit: { normalize_whitespace: true, normalize_quotes: true }`
+3. Ambiguity detection: if multiple matches, log warning and fall back to exact
+4. Integration point: L3 [[grounding-checkpoints]], intercept edit failures and re-attempt with fuzzy matching before surfacing to model
+5. Track fuzzy-match rate vs exact-match rate as observability metric (L5 [[automated-observability]])

package/vault/wiki/concepts/gemini-cli-architecture.md ADDED Viewed

@@ -0,0 +1,104 @@
+---
+type: concept
+title: "Gemini CLI Architecture (SOTA)"
+created: 2026-05-01
+updated: 2026-05-01
+status: stable
+tags:
+  - gemini-cli
+  - architecture
+  - coding-agents
+  - harness
+related:
+  - "[[harness-engineering-first-principles]]"
+  - "[[agent-skills-pattern]]"
+  - "[[policy-engine-pattern]]"
+sources:
+  - "[[Source: Google Gemini CLI Architecture Docs]]"
+  - "[[Source: Gemini CLI Changelogs]]"
+  - "[[Source: Google Blog - Gemini CLI Announcement]]"
+---# Gemini CLI Architecture (SOTA)
+## Overview
+Gemini CLI is Google's open-source AI coding agent (Apache 2.0, launched June 2025, now v0.40+ with 103k GitHub stars, 6,005 commits, 476 releases). It brings Gemini models (2.5 Pro, 3.0 Pro/Flash, 3.1 Pro) directly into the terminal with a ReAct loop, built-in tools, MCP support, and an extensible architecture.
+## Core Architecture
+```
+User Input → packages/cli (frontend) → packages/core (backend) → Gemini API
+                  ↑                          ↓                        ↓
+            Display rendering         Tool execution          Tool request / response
+                                      State management        Prompt construction
+                                      API client
+```
+Two-package separation: CLI handles UX (input, history, display, themes, config). Core handles logic (API client, prompts, tool registry, state management, session config).
+## Key SOTA Harness Components
+### 1. Agent Skills (v0.23+)
+Progressive disclosure: skills loaded on-demand via `activate_skill` tool. Prevents context rot. Formalized with frontmatter, `/memory inbox` for human review of extracted skills, `skill-creator` meta-skill. [[agent-skills-pattern]]
+### 2. Plan Mode (v0.29+)
+Structured task decomposition: `/plan` command, todo tracking, annotations, research subagents, external editor support. Enabled by default v0.34. Model steering allows human to guide plan direction.
+### 3. Codebase Investigator (v0.12+)
+JIT context discovery: subagent automatically explores workspace, resolves relevant files, loads into context. Enhanced with JIT context injection in v0.36. Configurable turn limits.
+### 4. Policy Engine (v0.18+)
+Pre-execution tool gates: project-level policies, MCP server wildcards, tool annotation matching, persistent approvals, context-aware policies. [[policy-engine-pattern]]
+### 5. Event-Driven Hooks (v0.27+)
+Event-driven scheduler for tool execution. Hooks for compaction, continuation, lint checks. MessageBus architecture for internal communication. Queued tool confirmations.
+### 6. Context Compression Service (v0.38+)
+Advanced context management distilling conversation history. Configurable threshold. Complements (doesn't replace) progressive disclosure.
+### 7. Chapters Narrative Flow (v0.38+)
+Sessions grouped by intent and tool usage. Provides structural narrative across long sessions. Enhances human review UX.
+### 8. Subagents + Remote Agents (v0.32+)
+Generalist agent for task routing. Specialist subagents with JIT context. Remote agents via A2A protocol (v0.33). Resilient tool rejection with contextual feedback.
+### 9. Memory System (v0.39+)
+Four-tier memory: prompt-driven transition from static files. Auto Memory (experimental). `/memory inbox` for reviewing extracted patterns.
+### 10. Multi-Registry Architecture (v0.36+)
+Extensions, skills, MCP servers as separate registries. Extensions loaded in parallel. Partner ecosystem: 20+ extensions launched by v0.12.
+### 11. Browser Agent (v0.31+)
+Experimental browser agent with persistent sessions. Dynamic tool discovery. Chrome DevTools Protocol access.
+### 12. Model Routing (v0.12+)
+Auto-selects Flash for simple tasks, Pro for complex. Configurable. Model steering in workspace (v0.32).
+### 13. Sandboxing Stack (v0.34+)
+Docker, gVisor, LXC, macOS Seatbelt, Windows sandboxing. Dynamic expansion. Tool isolation via SandboxManager.
+### 14. Git Worktrees (v0.36+)
+Isolated parallel agent sessions. Multiple agents on same repo without conflicts.
+### 15. Extensions Ecosystem (v0.8+)
+Partner extensions (Hugging Face, Redis, Eleven Labs, Browserbase, etc.). Custom extensions. A2A protocol. SDK package (v0.30).
+## Technology Stack
+- **Language**: TypeScript 98% + JavaScript 1.8%
+- **Packaging**: npm (`@google/gemini-cli`), Homebrew, MacPorts, Anaconda
+- **Build**: esbuild
+- **Testing**: Integration tests, E2E tests, performance tests, memory tests
+- **Models**: Gemini 2.5 Pro, 3.0 Pro/Flash, 3.1 Pro, Gemma (local)
+- **Context**: 1M token window (2.5 Pro)
+## Free Tier Economics
+- 60 requests/minute, 1,000 requests/day
+- Personal Google account (OAuth)
+- Industry's largest free allowance
+- Also supports: Gemini API key, Vertex AI, Code Assist licenses
+## Relevance to Ultimate-PI
+Gemini CLI's architecture validates our two-layer separation (CLI/Core ≈ our harness pipeline + tool layer). Their rapid iteration model (weekly releases, experimental → preview → stable → default) is a deployment pattern we should adopt. Their ecosystem-first approach (extensions, skills registries) suggests our tool/extension system should be designed for community contribution.

package/vault/wiki/concepts/generator-evaluator-architecture.md ADDED Viewed

@@ -0,0 +1,64 @@
+---
+type: concept
+title: "Generator-Evaluator Architecture"
+created: 2026-04-30
+updated: 2026-04-30
+status: seed
+tags:
+  - harness
+  - multi-agent
+  - evaluator
+  - generator
+  - adversarial
+related:
+  - "[[adversarial-verification]]"
+  - "[[consensus-debate]]"
+  - "[[agentic-harness]]"
+sources:
+  - "[[anthropic2026-harness-design]]"
+---# Generator-Evaluator Architecture
+A GAN-inspired multi-agent pattern where an evaluator agent grades a generator agent's output against explicit criteria. From Anthropic's harness design work (Rajasekaran, 2026).
+## Core Pattern
+```
+Generator ──produces──► Output
+                           │
+                           ▼
+Evaluator ◄──tests────────┘
+    │
+    ├─ Grade against criteria
+    ├─ Write detailed critique
+    └─ Feedback → Generator (next iteration)
+```
+## Why Separate Generator and Evaluator
+**Fundamental finding**: When asked to evaluate their own work, agents "respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre." Separating the two roles solves this. Tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.
+## Evaluator Tuning
+Claude is "a poor QA agent out of the box" — identifies legitimate issues, then talks itself into deciding they aren't a big deal. Requires explicit tuning loop:
+1. Read evaluator logs
+2. Find examples where its judgment diverges from human judgment
+3. Update QA prompt to solve for those issues
+4. Repeat until evaluator grades reasonably
+## Sprint Contracts
+Before writing code, generator and evaluator negotiate a contract:
+- Generator proposes what it will build and how success will be verified
+- Evaluator reviews proposal to ensure it builds the right thing
+- Iterate until agreement
+- Generator builds against the agreed-upon contract
+- Communication via files
+## Relevance to Our Harness
+Our L4 (Adversarial Verification) maps to the evaluator role, but with one critical gap: we lack **explicit grading criteria with hard thresholds**. Our critics give narrative feedback, not falsifiable pass/fail on specific criteria. The sprint contract pattern (agree on "done" before code) could be integrated at L2 (plan review) before L3 execution.
+## When Is It Worth It?
+The evaluator is worth the cost when the task sits beyond what the current model does reliably solo. As models improve, the boundary moves — tasks that used to need evaluation may not anymore. The evaluator is not a fixed yes/no decision.

package/vault/wiki/concepts/guardian-agent-pattern.md ADDED Viewed

@@ -0,0 +1,67 @@
+---
+aliases: ["guardian agent", "pre-execution safety", "agent guardrails"]
+type: concept
+title: "Guardian Agent Pattern"
+created: 2026-04-30
+status: developing
+tags:
+  - concept
+  - guardian-agent
+  - safety
+  - agent-reliability
+related:
+  - "[[Research: Meta-Agent Context Drift Detection]]"
+  - "[[meta-agent-context-pruning]]"
+  - "[[context-drift-in-agents]]"
+  - "[[vectara-guardian-agents]]"
+  - "[[ironclaw-drift-monitor]]"
+updated: 2026-05-02
+---# Guardian Agent Pattern
+A design pattern where a separate agent (or rule-based system) monitors and validates another agent's actions — either before execution (proactive) or after detecting failure patterns (reactive). Emerging as a standard approach for production agent reliability.
+## Two Variants
+### Proactive Guardian (Pre-Execution)
+Validates proposed actions BEFORE they execute. Blocks unsafe, incorrect, or incomplete tool calls.
+**Example**: Vectara's Guardian Agents check three things before any tool executes:
+1. Unnecessary tools (should this tool be part of the plan?)
+2. Missing required tools (does the plan include all needed tools?)
+3. Argument validation (are parameters correct, present, properly structured?)
+Feedback aggregated and returned to agent for plan revision. (Source: [[vectara-guardian-agents]])
+### Reactive Guardian (Post-Execution)
+Monitors tool call history, detects failure patterns after they emerge, and injects corrective actions.
+**Example**: ironclaw's DriftMonitor detects 5 stuck patterns (repetition, failure spiral, tool cycling, silence drift, rework churn) and injects system messages. (Source: [[ironclaw-drift-monitor]])
+**Example**: The proposed meta-agent context pruning extends this by also removing dead-end context entries.
+## Academic Treatment
+**GuardAgent** (OpenReview): First LLM agent serving as guardrail for other LLM agents. Checks inputs/outputs against guard requests (safety rules, privacy policies). Training-free, in-context-learning-based.
+**GUARDIAN** (NeurIPS 2025): Models multi-agent collaboration as temporal attributed graph. Detects propagation dynamics of hallucinations and errors. State-of-the-art accuracy with efficient resource utilization.
+## Complementary Approaches
+Proactive (Guardian Agents) and reactive (Meta-Agent / DriftMonitor) are complementary:
+```
+User Request → [Guardian Agent: validate plan] → [Agent executes] → [Meta-Agent: monitor for drift]
+                     ↑ blocks bad plans                                  ↑ prunes context if stuck
+```
+Guardian Agents prevent execution-level errors. Meta-Agent recovers when the agent gets past the guardian but still gets stuck (context pollution, reasoning drift, unexpected tool behavior).
+## See Also
+- [[meta-agent-context-pruning]] — Reactive guardian with context pruning
+- [[vectara-guardian-agents]] — Source: benchmark and proactive guardian design
+- [[ironclaw-drift-monitor]] — Source: reactive DriftMonitor proposal
+- [[context-drift-in-agents]] — The problem both approaches address

package/vault/wiki/concepts/harness-configuration-layers.md ADDED Viewed

@@ -0,0 +1,89 @@
+---
+type: concept
+title: "Harness Configuration Layers"
+aliases: ["four-layer harness", "harness layers", "L1-L4 harness"]
+created: 2026-04-30
+updated: 2026-05-01
+tags: [#concept, #agents, #harness-design]
+status: redesign
+related:
+  - "[[provider-native-prompting]]"
+  - "[[model-adaptive-harness]]"
+  - "[[forgecode-gpt5-agent-improvements]]"
+  - "[[Research: Model-Specific Prompting Guides]]"
+sources:
+  - "[[openai-prompt-guidance]]"
+  - "[[anthropic-prompt-best-practices]]"
+  - "[[gemini-3-prompting-guide]]"
+  - "[[forgecode-gpt5-agent-improvements]]"
+---
+# Harness Configuration Layers
+A four-layer model for agent harness design. Each layer has configurable dimensions that vary by LLM model. **Updated May 2026** with official provider guidance as the primary source, replacing the earlier empirical-only approach.
+> [!important] The old design principle ("write once for strict, relax for forgiving") is retired. See [[provider-native-prompting]] for the replacement: generate provider-native prompts from a semantic spec.
+## Layer Architecture
+```
+L4 COMPLETION MODEL — How "done" is determined and verified
+L3 STATE CHANNEL    — How system state reaches the model
+L2 GATE DESIGN      — How transitions between phases are controlled
+L1 SIGNAL DESIGN    — How instructions are formatted for model consumption
+```
+## L1: Signal Design
+How instructions, constraints, and structure are formatted.
+| Dimension | OpenAI GPT | Anthropic Claude | Google Gemini | Source |
+|---|---|---|---|---|
+| Structure | XML-like sections | XML tags | Plain text sections | Official guides |
+| Density | Concise, outcome-first (5.5+) | General over prescriptive | Concise by default | OpenAI 5.5 guide, Anthropic BP, Gemini 3 guide |
+| Constraint Ordering | FIRST (early tokens weighted higher) | Flexible | LAST (constraints at end) | Forge Code, Gemini 3 guide |
+| Emphasis | Explicit markers for invariants only | Role setting + why explanation | Persona definitions binding | All three guides |
+| Nesting Depth | Flat (older GPT), relaxed (GPT-5.5+) | Hierarchical OK | Flat recommended | Forge Code, official guides |
+## L2: Gate Design
+How transitions between phases are controlled.
+| Dimension | OpenAI GPT | Anthropic Claude | Google Gemini | Source |
+|---|---|---|---|---|
+| Enforcement | Hard-gate (verification loop) | Self-check at end | Split-step verify→generate | GPT-5.4 guide, Anthropic BP, Gemini 3 guide |
+| Granularity | Per-step | Per-round (auto-calibrated) | Per-step | Forge Code, Anthropic BP |
+| Evidence | Checklist + tool output | Self-assessment + quote extraction | Explicit capability verification | GPT-5.4 guide, Anthropic BP, Gemini 3 guide |
+| Retry | Auto-loop (tool persistence rules) | Flag-and-continue (effort-controlled) | Escalate after 1-2 fallback strategies | GPT-5.4 guide, Anthropic BP, Gemini 3 guide |
+## L3: State Channel
+How system state reaches the model.
+| Dimension | OpenAI GPT | Anthropic Claude | Google Gemini | Source |
+|---|---|---|---|---|
+| Truncation | In-band warning + compaction API | Context awareness (tracks own budget) | System instruction for budget | GPT guide, Anthropic BP |
+| Progress | Explicit counters + user update spec | Auto-calibrated progress updates | Less verbose, steer explicitly | GPT-5.1 guide, Anthropic Opus 4.7, Gemini 3 guide |
+| Error | Structured (phase parameter) | Natural (adaptive thinking) | System instruction routing | GPT-5.3 Codex, Anthropic BP |
+| Grounding | Citation rules + retrieval budgets | Quote extraction from documents | Explicit "context is truth" statement | All three guides |
+## L4: Completion Model
+How "done" is determined.
+| Dimension | OpenAI GPT | Anthropic Claude | Google Gemini | Source |
+|---|---|---|---|---|
+| Criteria | Falsifiable checklist + completeness contract | Completion-signal + self-check | Split-step verification pass | GPT-5.4, Anthropic BP, Gemini 3 guide |
+| Self-Audit | Enforced (verification loop) | Natural (may need prompting) | Enforced (verify before generate) | GPT-5.4, Anthropic BP, Gemini 3 guide |
+| Partial-Work | Reject (completeness contract) | Accept-with-gaps (autonomous continuation) | STOP if unverified | GPT-5.4, Anthropic BP, Gemini 3 guide |
+## Design Principle (v2)
+**Generate provider-native prompts from a semantic spec.** See [[provider-native-prompting]] for the renderer architecture.
+## Sources
+- [[openai-prompt-guidance]] — OpenAI official, 2026
+- [[anthropic-prompt-best-practices]] — Anthropic official, 2026
+- [[gemini-3-prompting-guide]] — Google Cloud official, 2026-04-29
+- [[forgecode-gpt5-agent-improvements]] — Forge Code empirical failure modes, 2026