npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.4 - Mend

ultimate-pi 0.1.2 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/concepts/AST-Aware Code Chunking.md ADDED Viewed

@@ -0,0 +1,44 @@
+---
+type: concept
+title: "AST-Aware Code Chunking"
+created: 2026-04-30
+status: developing
+tags:
+  - chunking
+  - code-rag
+  - embeddings
+  - semantic-search
+related:
+  - "[[Semantic Codebase Indexing]]"
+  - "[[Context Engine (AI Coding)]]"
+  - "[[Contextualized Text Embedding]]"
+sources:
+  - "[[cast-code-chunking-paper]]"
+  - "[[code-chunk-library-supermemory]]"
+updated: 2026-05-02
+---# AST-Aware Code Chunking
+## Definition
+Splitting source code into retrievable chunks at Abstract Syntax Tree (AST) boundaries instead of arbitrary character/line limits. Preserves semantic coherence: functions, classes, and methods remain intact within chunks rather than being split mid-body.
+## Why It Matters
+Line-based chunking (e.g., split every 500 chars) produces fragments like `subtotal += item.price * item.qu` — meaningless without surrounding context. AST-aware chunking ensures each chunk is a complete, self-contained syntactic unit.
+## Algorithm (cAST / code-chunk)
+1. **Parse** source into AST via tree-sitter
+2. **Extract entities**: functions, methods, classes, imports with signatures and docstrings
+3. **Build scope tree**: hierarchical parent-child relationships (method → class → file)
+4. **Greedy window assignment**: pack complete AST nodes into chunks up to max non-whitespace character limit
+5. **Recurse into oversized nodes**: if a single function exceeds chunk limit, recurse into its children
+6. **Merge adjacent small windows**: reduce fragmentation
+7. **Add contextualized text**: prepend file path, scope chain, signatures, imports before raw code
+## Performance Impact
+- cAST paper: +4.3 Recall@5 on RepoEval, +2.67 Pass@1 on SWE-bench generation
+- code-chunk implementation: 70.1% Recall@5 vs 42.4% fixed-size baseline (with hard negatives + IoU threshold)
+- Vectara NAACL 2025: chunking strategy matters as much or more than embedding model choice

package/vault/wiki/concepts/Build-Time Prompt Compilation.md ADDED Viewed

@@ -0,0 +1,107 @@
+---
+type: concept
+title: "Build-Time Prompt Compilation"
+created: 2026-05-02
+updated: 2026-05-02
+tags:
+  - prompt-compilation
+  - build-time
+  - deterministic-builds
+  - npm-packaging
+status: developing
+related:
+  - "[[Prompt Renderer]]"
+  - "[[research: Prompt Renderer for Multi-Model Agent Harness]]"
+sources:
+  - "[[Source: Build-Time Prompt Compilation Architecture]]"
+---# Build-Time Prompt Compilation
+The practice of compiling prompts from a base specification into validated, model-specific output at **build time** (not runtime), then shipping the compiled output as static assets in an npm package.
+## Why Build-Time?
+| Aspect | Build-Time | Runtime |
+|--------|-----------|---------|
+| **Latency** | Zero at runtime | Template parse + render per request |
+| **Caching** | Built-in (compiled output IS the cache) | Requires cache warming, TTL management |
+| **Validation** | Syntax errors caught at build | Errors at request time |
+| **npm distribution** | Static JSON files | Ship template engine + templates |
+| **Determinism** | Hash-verifiable output | Runtime variables may differ |
+| **Dependency weight** | None (compiled JSON is raw data) | Must ship template engine in bundle |
+| **Real-world tools** | Microsoft prompt-engine (abandoned, 2.8K stars), PromptWeaver (active, MIT) | Jinja2, Handlebars runtime rendering |
+## Pipeline Design
+```
+Source: prompts/*.yaml (base specs)
+   ↓ [parser]
+PromptConfig[] (validated AST)
+   ↓ [model-renderers]
+Per-model output: prompts/gpt/*.json, prompts/claude/*.json, prompts/gemini/*.json
+   ↓ [validator]
+Check: syntax, token thresholds, variable completeness
+   ↓ [serializer]
+Deterministic JSON (sorted keys, fixed formatting, hash manifest)
+   ↓ [packager]
+Included in npm package as static assets
+```
+## Deterministic Build Requirements
+1. **Version-locked renderer**: Pinned compiler version in build manifest
+2. **Input hashing**: `sha256sum` over all source spec files
+3. **Build manifest**: Records compiler version, source files, file hashes, build time
+4. **Same input → same output**: Guaranteed identical compiled prompts for same spec + version
+## Incremental Compilation
+Only recompile prompts whose spec hash changed:
+1. Hash each source spec file
+2. Compare against previous build manifest
+3. Only recompile changed specs + specs that depend on changed fragments
+4. Full compilation only for release builds
+## Two-Phase Variable Model
+```yaml
+# Compile-time vars: resolved at build, produce multiple variants
+model_name: { compile: true, values: [gpt-5.2, claude-sonnet-4.5] }
+# → generates 2 compiled prompts per spec
+# Runtime vars: resolved at call time, injected into compiled template
+user_query: { compile: false }
+# → placeholder left in compiled output for runtime substitution
+```
+## Token Budget Awareness
+Compiled prompts must be validated against model-specific constraints:
+- **Minimum cache threshold**: 1,024 tokens (OpenAI/Anthropic), 4,096 (Google)
+- **Maximum context window**: Model-specific (e.g., 200K for Claude)
+- **Cost estimate**: Embedded in compiled output metadata
+## Integration with npm
+```
+ultimate-pi/
+├── prompts/              # Source specs (committed to git)
+│   ├── base/
+│   │   ├── system.yaml
+│   │   ├── spec-hardening.yaml
+│   │   └── verify.yaml
+│   └── fragments/
+│       └── code-context.yaml
+├── dist/
+│   └── prompts/          # Compiled output (shipped in npm)
+│       ├── gpt/
+│       │   ├── system.json
+│       │   └── verify.json
+│       ├── claude/
+│       │   └── system.json
+│       └── gemini/
+│           └── system.json
+│       └── manifest.json  # Build manifest with hashes
+└── scripts/
+    └── compile-prompts.ts  # Build script
+```

package/vault/wiki/concepts/Context Engine (AI Coding).md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+type: concept
+title: "Context Engine (AI Coding)"
+created: 2026-04-30
+status: developing
+tags:
+  - ai-coding
+  - context
+  - semantic-search
+  - rag
+aliases:
+  - Codebase Context Engine
+related:
+  - "[[Semantic Codebase Indexing]]"
+  - "[[Prompt Enhancement]]"
+  - "[[Contractor vs Employee AI Model]]"
+sources:
+  - "[[Augment Context Engine Official]]"
+  - "[[Augment Code WorkOS ERC 2025]]"
+updated: 2026-05-02
+---# Context Engine (AI Coding)
+A context engine is a system that provides AI coding agents with deep, semantic understanding of a codebase beyond what text search (grep) can provide. It is the differentiator between agents that merely generate code and agents that write code that fits the codebase.
+## Core Properties
+1. **Semantic indexing**: Embeds code into vector space, understanding relationships between files, functions, classes, and services.
+2. **Real-time sync**: Maintains live understanding as code changes, with millisecond-level sync.
+3. **Relationship mapping**: Tracks dependencies, call graphs, imports, and architectural patterns.
+4. **Intelligent retrieval**: Returns only what's relevant to the current task — not the entire codebase.
+5. **Multi-source**: Goes beyond code to include commit history, team patterns, documentation, and tribal knowledge.
+## Why It Matters
+The same LLM model produces dramatically different results depending on context quality. Augment Code demonstrated that Claude Opus 4.5 scored 51.80% (Auggie) vs 45.89% (SWE-Agent baseline) — a 6-point gap from context alone. When used as a context provider for other agents, improvements of 30-80% were observed.
+## Key Insight
+> Context quality determines code quality more than model intelligence. A weaker model with excellent context outperforms a stronger model with poor context.
+## Implementation Approaches
+1. **Embedding-based**: Use local embedding models (e.g., all-MiniLM-L6-v2) to index code files into a vector database (LanceDB, ChromaDB).
+2. **Hybrid retrieval**: Combine keyword (BM25) + semantic (cosine similarity) search for best recall.
+3. **Graph-based**: Build dependency/call graphs using tree-sitter AST analysis.
+4. **MCP exposure**: Wrap the context engine as an MCP server for any AI agent to consume.

package/vault/wiki/concepts/Context-Aware System Reminders.md ADDED Viewed

@@ -0,0 +1,61 @@
+---
+type: concept
+title: "Context-Aware System Reminders"
+created: 2026-05-03
+updated: 2026-05-03
+status: developing
+tags:
+  - context-engineering
+  - long-horizon
+  - behavioral-steering
+  - attention-decay
+related:
+  - "[[context-engineering]]"
+  - "[[context-anxiety]]"
+  - "[[Source: OpenDev — Building AI Coding Agents for the Terminal]]"
+sources:
+  - "[[Source: OpenDev — Building AI Coding Agents for the Terminal]]"
+---# Context-Aware System Reminders
+A mechanism for **counteracting instruction fade-out** in long-running agent sessions. From OpenDev ([Section 2.3.4](https://arxiv.org/html/2603.05344v1#S2.SS3.SSS4)). The core problem: after 30+ tool calls, agents silently stop following system prompt instructions — even though the instructions are still in the context window.
+## How It Works
+Instead of putting all instructions in the system prompt (which sits at the conversation's beginning and loses influence over time), inject **short, targeted reminders at the exact point of decision**.
+Each reminder is a brief `role: user` message placed at maximum recency in the conversation, immediately before the next LLM call. The model treats it as "something that just happened requiring a response."
+## Event Detectors
+Eight conditions trigger reminders:
+- Tool failure without retry (6 error-specific templates)
+- Exploration spiral (5+ consecutive reads)
+- Denied tool re-attempts
+- Premature completion with incomplete todos
+- Continued work after all todos done
+- Plan approval without follow-through
+- Unprocessed subagent results
+- Empty completion messages
+## Guardrail Counters
+Each reminder type has a counter or one-shot flag to prevent degenerating into noise:
+- Incomplete-todo nudges: max 2
+- Error-recovery nudges: max 3
+- Plan-approved, all-todos-complete, completion-summary: fire exactly once
+## Why role: user beats role: system
+Early experiments with `role: system` injection confirmed: user-role reminders produced noticeably higher compliance rates. After 40 turns, another system message blends into the background. A user message appears at the position of highest recency — the model treats it as something demanding a response.
+## Relevance to Our Harness
+Our harness's instruction drift is a known problem in long sessions. System reminders provide a lightweight, targeted mechanism:
+1. Catalog our failure modes (what instructions does our agent stop following?)
+2. Create a reminder template for each
+3. Detect the triggering condition at iteration boundaries
+4. Inject at decision point with `role: user`
+5. Cap frequency to prevent noise
+This is **cheaper than context resets** and **more targeted than full system prompt re-injection**.

package/vault/wiki/concepts/Contextualized Text Embedding.md ADDED Viewed

@@ -0,0 +1,42 @@
+---
+type: concept
+title: "Contextualized Text Embedding"
+created: 2026-04-30
+status: developing
+tags:
+  - embeddings
+  - chunking
+  - code-rag
+  - semantic-search
+related:
+  - "[[AST-Aware Code Chunking]]"
+  - "[[Semantic Codebase Indexing]]"
+sources:
+  - "[[code-chunk-library-supermemory]]"
+updated: 2026-05-02
+---# Contextualized Text Embedding
+## Definition
+Prepending semantic metadata (file path, scope chain, signatures, imports) to raw code before embedding. Transforms code from a bare syntactic fragment into a natural-language-like description that embedding models (trained primarily on natural language) can process effectively.
+## Format
+```
+# src/services/user.ts
+# Scope: UserService > getUser
+# Defines: async getUser(id: string): Promise<User>
+# Uses: Database
+# After: constructor
+  async getUser(id: string): Promise<User> { ... }
+```
+## Why It Works
+Embedding models like MiniLM-L6-v2 are trained on natural language corpora (Wikipedia, books, web text). They understand sentences and paragraphs, not raw code syntax. By prepending a natural-language description of what the code is, where it lives, and what it depends on, the embedding captures semantic relationships that pure code misses.
+## Impact on MiniLM-L6-v2
+MiniLM-L6-v2 was not trained on code. Without contextualized text, it embeds `async getUser(id)` as a sequence of tokens without understanding it's inside a UserService class or uses a Database. Contextualized text bridges this gap, making smaller general-purpose models viable for code retrieval.

package/vault/wiki/concepts/Contractor vs Employee AI Model.md ADDED Viewed

@@ -0,0 +1,55 @@
+---
+type: concept
+title: "Contractor vs Employee AI Model"
+created: 2026-04-30
+status: developing
+tags:
+  - ai-strategy
+  - context
+  - model-selection
+aliases:
+  - Context vs Intelligence
+related:
+  - "[[Context Engine (AI Coding)]]"
+  - "[[Augment Code]]"
+sources:
+  - "[[Augment Code Codacy AI Giants]]"
+updated: 2026-05-02
+---# Contractor vs Employee AI Model
+A mental model coined by Vinay Perneti (VP Engineering at Augment Code) that frames the relationship between LLM intelligence and codebase context in AI coding tools.
+## The Analogy
+| Aspect | Contractor | Employee |
+|--------|-----------|----------|
+| **Intelligence** | Has it (borrowed) | Has it |
+| **Context** | Missing | Has it (accumulated over time) |
+| **Result** | Needs constant re-explanation | Understands the codebase deeply |
+| **AI equivalent** | Raw LLM with no codebase context | LLM + Context Engine |
+## Why This Matters
+Most AI coding tools focus on model intelligence — chasing higher benchmark scores by using more powerful models. Augment's insight: context is the bottleneck, not intelligence.
+### Evidence
+- Claude Opus 4.5 scored 45.89% through SWE-Agent (baseline context).
+- Same model scored 51.80% through Auggie (semantic context).
+- A weaker model (Sonnet) with Augment's context can outperform Opus without it.
+## Practical Implications
+### For Tool Builders
+- Invest in context infrastructure before chasing model upgrades.
+- Build contextual understanding that persists across sessions.
+- Treat context as a first-class engineering problem, not an afterthought.
+### For Model Selection
+- Augment offers only 3 model choices (vs competitors' 20+).
+- Philosophy: "Let us solve the hard problems of what models make sense. You shouldn't spend mental cycles on how to get the best context."
+### For Our Harness
+- Context layer should be independent of model layer.
+- Swap models without rebuilding context infrastructure.
+- Invest in semantic indexing, knowledge persistence, and pattern recognition before model optimization.

package/vault/wiki/concepts/Dual-Model Agent Architecture.md ADDED Viewed

@@ -0,0 +1,65 @@
+---
+type: concept
+title: "Dual-Model Agent Architecture"
+created: 2026-04-30
+status: developing
+tags:
+  - agent-architecture
+  - llm
+  - ensembling
+  - swe-bench
+aliases:
+  - Two-Model Agent
+related:
+  - "[[Majority Vote Ensembling]]"
+  - "[[Agentic Coding Harness]]"
+sources:
+  - "[[Augment SWE-bench Agent GitHub]]"
+  - "[[Augment SWE-bench Pro Blog]]"
+updated: 2026-05-02
+---# Dual-Model Agent Architecture
+An agent architecture that uses two different LLMs for distinct phases: a fast, capable model for iterative reasoning/ coding, and a more deliberative model for solution selection/verification.
+## Augment Code's Implementation
+### Phase 1: Core Reasoning (Claude Sonnet 3.7)
+- Handles the iterative coding loop: read files, write code, run tests, debug.
+- Fast, capable, good at following instructions.
+- Runs in a loop with tool access (bash, file edit, sequential thinking).
+### Phase 2: Solution Ensembling (OpenAI o1)
+- After generating N candidate solutions (typically 8).
+- Presents all candidates to o1 with evaluation outcomes.
+- o1 analyzes and selects the best solution.
+- o1 is slower but more deliberative — better at comparative analysis.
+## Why Two Models?
+1. **Cost optimization**: Fast model for the 95% of work; expensive model only for selection.
+2. **Complementary strengths**: Claude excels at code generation; o1 excels at analysis and comparison.
+3. **Error reduction**: Majority vote ensembling catches errors that any single run might miss.
+4. **Separation of concerns**: Generation and evaluation use different reasoning patterns.
+## Alternative Patterns
+### Single-Model Multi-Pass
+- Same model generates multiple solutions then self-reviews.
+- Simpler but less effective than cross-model ensembling.
+### Model Cascade
+- Start with fast/cheap model; escalate to stronger model on failure.
+- Used by SWE-agent and some production systems.
+### Committee of Models
+- 3+ different models generate solutions independently.
+- Voting or LLM-based selection.
+## Implementation for Our Harness
+We can implement dual-model architecture as a configurable strategy:
+- **Primary model**: Claude (fast, code-capable) for the main agent loop.
+- **Ensembler model**: GPT-5 or o1 for solution verification and selection.
+- Generate 3-5 candidate solutions, use ensembler to pick best.
+- Configurable via harness config.

package/vault/wiki/concepts/Late Chunking vs Early Chunking.md ADDED Viewed

@@ -0,0 +1,43 @@
+---
+type: concept
+title: "Late Chunking vs Early Chunking"
+created: 2026-04-30
+status: developing
+tags:
+  - chunking
+  - embeddings
+  - rag
+  - semantic-search
+related:
+  - "[[AST-Aware Code Chunking]]"
+  - "[[Contextualized Text Embedding]]"
+sources:
+  - "[[vectara-chunking-vs-embedding-naacl2025]]"
+updated: 2026-05-02
+---# Late Chunking vs Early Chunking
+## Definitions
+- **Early chunking (standard)**: Split text → embed each chunk separately. Each chunk's embedding only sees its own text.
+- **Late chunking**: Embed the entire document first (producing token-level embeddings), then pool token embeddings into chunk-level embeddings using chunk boundaries. Each chunk's embedding "sees" the full document context.
+- **Contextual retrieval**: An intermediate approach: prepend document-level context to each chunk before embedding. Simpler than late chunking, captures some cross-chunk context.
+## Trade-offs
+| Approach | Semantic Coherence | Compute Cost | Implementation Complexity |
+|----------|-------------------|--------------|---------------------------|
+| Early chunking | Lowest | Lowest | Simplest |
+| Contextual retrieval | Medium | Medium | Moderate |
+| Late chunking | Highest | Highest | Complex |
+## Research Findings (arXiv:2504.19754)
+Late chunking + contextual retrieval evaluated for RAG systems:
+- Contextual retrieval preserves semantic coherence more effectively than early chunking
+- But requires greater computational resources (embeds full documents)
+- For code: contextual retrieval (prepending scope/file context) is the sweet spot — better than bare early chunking, cheaper than full late chunking
+## Relevance to Our Implementation
+We implement **contextual retrieval** (not full late chunking): prepend file path, scope chain, signatures, and imports to each chunk before embedding. This gives us much of the benefit at moderate cost.

package/vault/wiki/concepts/Majority Vote Ensembling.md ADDED Viewed

@@ -0,0 +1,68 @@
+---
+type: concept
+title: "Majority Vote Ensembling"
+created: 2026-04-30
+status: developing
+tags:
+  - agent-architecture
+  - llm
+  - ensembling
+aliases:
+  - Solution Ensembling
+related:
+  - "[[Dual-Model Agent Architecture]]"
+sources:
+  - "[[Augment SWE-bench Agent GitHub]]"
+updated: 2026-05-02
+---# Majority Vote Ensembling
+A technique where an agent generates multiple candidate solutions to the same problem, then uses an LLM (or voting mechanism) to select the best one. Used by Augment Code's SWE-bench agent to boost success rates.
+## How Augment Implements It
+1. Run the core agent (Claude Sonnet 3.7) N times on the same problem (typically N=8).
+2. Each run produces a candidate solution (diff).
+3. Run evaluation harness on each candidate to get pass/fail outcomes.
+4. Feed all candidates + outcomes to OpenAI o1 with a prompt asking it to select the best solution.
+5. o1 returns the index of the selected solution.
+## Input Format
+```json
+{
+  "id": "problem-1",
+  "instruction": "Fix the login timeout issue",
+  "diffs": ["diff1", "diff2", "..."],
+  "eval_outcomes": [
+    {"is_success": true},
+    {"is_success": false}
+  ]
+}
+```
+## Why It Works
+1. **Variance reduction**: Multiple independent runs reduce the impact of any single bad generation.
+2. **Complementary failures**: Different runs fail on different aspects; ensembling can pick the run that succeeded.
+3. **LLM-as-judge**: o1's reasoning capabilities are better suited for comparative analysis than code generation.
+4. **Evaluation-guided**: Including eval outcomes helps the ensembler distinguish between functionally correct and incorrect solutions.
+## Cost Consideration
+Running N candidates multiplies cost by N. Augment's approach: use a fast/cheap model (Sonnet) for the N runs, then an expensive model (o1) only for the single ensembling step.
+## Implementation for Our Harness
+```python
+def ensemble_solutions(problem: str, candidates: int = 5) -> str:
+    solutions = []
+    for i in range(candidates):
+        # Run agent independently
+        diff = run_agent(problem)
+        result = evaluate(diff)
+        solutions.append({"diff": diff, "success": result.passed})
+    # Select best via LLM ensembler
+    best = llm_ensembler.select_best(problem, solutions)
+    return best.diff
+```

package/vault/wiki/concepts/Meta-Harness.md ADDED Viewed

@@ -0,0 +1,16 @@
+---
+type: concept
+status: stub
+created: 2026-05-02
+updated: 2026-05-02
+tags: [concept, harness, meta-learning]
+---
+# Meta-Harness
+Outer-loop harness optimization framework from Lee et al. (Stanford/Together AI). A harness that optimizes the inner harness — selecting best configurations, prompts, and patterns across multiple agent runs.
+## References
+- [[lee2026-meta-harness]]
+- [[self-evolving-harness]]

package/vault/wiki/concepts/Multi-Agent AI Coding Architecture.md ADDED Viewed

@@ -0,0 +1,75 @@
+---
+type: concept
+title: "Multi-Agent AI Coding Architecture"
+created: 2026-05-03
+updated: 2026-05-03
+status: developing
+tags:
+  - multi-agent
+  - architecture
+  - agentic-coding
+  - harness
+related:
+  - "[[subagent-orchestration]]"
+  - "[[generator-evaluator-architecture]]"
+  - "[[agentic-harness]]"
+  - "[[Source: Lovable Architecture & Clone Analysis]]"
+  - "[[anthropic2026-harness-design]]"
+sources:
+  - "[[Source: Lovable Architecture & Clone Analysis]]"
+  - "[[anthropic2026-harness-design]]"
+  - "[[Source: OpenAI Harness Engineering — 0 Lines of Human Code]]"
+  - "[[Source: OpenDev — Building AI Coding Agents for the Terminal]]"
+---# Multi-Agent AI Coding Architecture
+The decomposition of software engineering tasks across specialized agents, each with a defined role, input/output contract, and tool surface. This is the **universal pattern** across all successful AI coding platforms.
+## Three Common Decompositions
+### Lovable/Clone Pattern: Planner → Architect → Coder
+```
+User prompt → Planner (structured Plan) → Architect (TaskPlan) → Coder (files on disk)
+```
+- Each agent receives Pydantic-validated inputs
+- LangGraph orchestrates with conditional edges
+- Coder uses ReAct pattern with file system tools
+### Anthropic Pattern: Planner → Generator → Evaluator
+```
+User prompt → Planner (product spec) → Generator (implements) ⇄ Evaluator (grades)
+```
+- Generator and Evaluator negotiate "sprint contracts" before coding
+- Evaluator uses Playwright to actually click through the app
+- Hard thresholds on grading criteria — fall below any, sprint fails
+### OpenAI Pattern: Agent-to-Agent Review Loops
+```
+Codex generates → Codex reviews locally → Additional agent review (cloud) → Human/agent feedback → Iterate
+```
+- "Ralph Wiggum Loop": agent reviews its own changes, requests additional reviews, responds to feedback, iterates until all agent reviewers satisfied
+- Humans may review PRs but aren't required to
+- Pushed "almost all review effort towards being handled agent-to-agent"
+## First-Principles Architecture
+### 1. Separate Planning from Execution
+Do not let the same agent plan and code in one step. The Planner should have read-only tools only — structurally prevented from writing code. This forces deliberation before action and prevents premature implementation.
+### 2. Structured Handoffs Between Agents
+Every handoff must be a validated data contract, not free text. Pydantic schemas, typed dicts, or structured files. The downstream agent processes objects, not unstructured descriptions.
+### 3. Independent Evaluator with Hard Criteria
+The agent that builds cannot be trusted to evaluate. Separate evaluator with explicit, gradable criteria. Each criterion has a hard threshold — not negotiable. "Claude is a poor QA agent out of the box" — evaluator requires explicit tuning to be skeptical.
+### 4. Sprint Contracts (Agree on "Done" Before Work)
+Before coding starts, the implementer and evaluator negotiate what success looks like. This prevents scope creep and provides concrete verification targets. Communication via files, not chat.
+### 5. Tool Surface = Agent Capability Boundary
+Each agent's available tools define its actual capability — not its prompt, not its role description. Remove write tools from planners. Remove subagent-spawning from subagents. Make capabilities structural, not aspirational.
+## Relevance to Our Harness
+- L2 (Planning) should be a separate agent with read-only tools
+- L3 (Execution) should work from L2's structured output
+- L4 (Verification) needs hard criteria with thresholds, not narrative feedback
+- Sprint contracts between L2 and L4 before L3 begins