npm - ultimate-pi - Versions diffs - 0.1.0 → 0.1.3 - Mend

ultimate-pi 0.1.0 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (509) hide show

package/vault/wiki/decisions/adr-013.md ADDED Viewed

@@ -0,0 +1,59 @@
+---
+type: decision
+title: "ADR-013: Biome for Phase 16 Deterministic Quality Gate"
+status: accepted
+priority: 1
+date: "2026-05-02"
+tags: [adr, harness, phase-16, linting, formatting, biome, deterministic-gate]
+sources:
+  - "[[HARNESS-PRD]]"
+  - "[[package.json]]"
+  - "[[biome.json]]"
+related:
+  - "[[adr-012]]"
+supersedes: "PRD Q5 (ESLint+Prettier recommendation)"
+created: 2026-05-02
+updated: 2026-05-02
+---
+# ADR-013: Biome for Phase 16 Deterministic Quality Gate
+## Context
+PRD Q5 originally resolved Phase 16 gate to "ESLint + Prettier." The project already uses Biome 2.0.6 (`package.json`: `"lint": "biome check"`, `"format": "biome format --write"`) with lefthook pre-commit integration. Adding ESLint+Prettier as new dependencies would duplicate existing tooling.
+The original concern was Biome's type-aware linting gap. With Biome 2.0.6 + TypeScript 6.0.3, type-aware rules have improved. The remaining gap is covered by `tsc --noEmit` as a separate deterministic step.
+## Decision
+**Use Biome for lint + format in Phase 16. Replace `ESLint + Prettier` with `biome check --apply` + `tsc --noEmit` + `fallow audit`.**
+Phase 16 gate runs three deterministic steps, 0 LLM tokens:
+1. `biome check --apply` — lint + format in one pass
+2. `tsc --noEmit` — type-checking catch for rules Biome doesn't cover (floating promises, type-aware issues)
+3. `fallow audit --changed-since main` — dead code, duplication, complexity
+All three are pure CLI tools with exit codes. No LLM involvement.
+## Rationale
+- **Already configured**: Biome is installed, configured (`biome.json`), and integrated with lefthook. Zero setup cost.
+- **Single tool for lint+format**: Biome replaces both ESLint and Prettier. One dependency instead of two.
+- **TypeScript type-checking via `tsc`**: Covers what Biome can't. `tsc --noEmit` is already in `package.json` scripts (`"check:ts"`).
+- **Zero incremental dependencies**: No ESLint, no prettier, no eslint-config-prettier, no @typescript-eslint packages.
+## Consequences
+### Positive
+- Fewer dependencies. Lower maintenance.
+- Matches existing project conventions.
+- lefthook integration already works.
+### Negative
+- Some ESLint rules have no Biome equivalent (rare edge cases).
+- `tsc --noEmit` is slower than Biome's native linting (but acceptable as a separate gate step).
+### Mitigations
+- If a specific ESLint-only rule is needed, evaluate case-by-case. Most are cosmetic — Biome's defaults are sufficient for a deterministic quality gate.
+- `tsc --noEmit` can be limited to `--skipLibCheck` for speed.

package/vault/wiki/decisions/adr-014.md ADDED Viewed

@@ -0,0 +1,73 @@
+---
+type: decision
+title: "ADR-014: isolated-vm for P43 TypeScript Execution Sandbox"
+status: accepted
+priority: 1
+date: "2026-05-02"
+tags: [adr, harness, p43, typescript-execution, sandbox, isolated-vm, security]
+sources:
+  - "[[HARNESS-PRD]]"
+  - "[[adr-012]]"
+related:
+  - "[[adr-012]]"
+supersedes:
+created: 2026-05-02
+updated: 2026-05-02
+---
+# ADR-014: isolated-vm for P43 TypeScript Execution Sandbox
+## Context
+P43 TypeScript Execution Layer replaces flat tool calling with a single `write_ts` tool backed by a sandboxed runtime. Agent writes TypeScript orchestrating tools; runtime executes the code.
+Three sandbox options evaluated:
+| | Node.js VM (`node:vm`) | Deno subprocess | `isolated-vm` |
+|---|---|---|---|
+| Isolation | Weak — same process, `process.exit()` kills harness | Strong — separate OS process | Strong — V8 isolate, separate heap |
+| Performance | Fastest — no IPC | IPC overhead per tool call | Fast — in-process but isolated |
+| Setup | Zero deps | Install Deno (new runtime dep) | Native addon (C++ compilation) |
+| Node compat | Full | Partial (Deno APIs differ) | Full |
+| Security | Low — `require('child_process')` escapes | Medium — `--allow-*` flags | High — no `require` unless granted |
+| Maturity | Built-in | Young | Mature (7K+ stars, Fly.io, Netlify) |
+Pi runs on Node.js. Adding Deno as a dependency for just the sandbox is heavy. `node:vm` is too weak — `process.exit()` kills the harness. PRD P38 (OS-level sandbox with bubblewrap/Seatbelt) is a separate phase and won't be ready when P43 ships.
+## Decision
+**Use `isolated-vm` as the P43 sandbox runtime.**
+- Separate V8 isolate with its own heap. Cannot crash the harness.
+- No `require` access unless explicitly granted via the sandbox API.
+- Tool functions (`read`, `edit`, `bash`, `find`, `grep`, `ck_search`) are exposed via explicit host function registration, not via Node.js module resolution.
+- TypeScript agent code is compiled with ESBuild (`tsc` type-strips, ESBuild bundles) to plain JS before injection into the isolate.
+- P38 bubblewrap/Seatbelt adds defense-in-depth later. `isolated-vm` is the inner sandbox; P38 is the outer sandbox.
+### Fallback
+If `isolated-vm` native addon compilation fails in a given environment, fall back to `node:vm` + P38 bubblewrap as the outer enforcement layer. The fallback is less secure but functional.
+## Rationale
+- **Security**: The agent writes arbitrary TypeScript. We cannot trust it. `isolated-vm` limits blast radius to the isolate.
+- **Performance**: In-process. No IPC overhead. Tool calls dispatch via typed host functions.
+- **Maturity**: 7K+ GitHub stars. Used by Fly.io for customer code execution and Netlify for edge functions. Battle-tested.
+- **Composability**: P43 sandbox serves double duty as P15b pre-verification sandbox. Same isolate, different execution context.
+## Consequences
+### Positive
+- Strong isolation without process overhead.
+- Reuses same sandbox for P15b pre-verification.
+- Explicit host function registration = auditable tool surface.
+### Negative
+- Native addon requires C++ build toolchain (`node-gyp`). Adds `dev` setup step.
+- Not available in all environments (e.g., some CI runners without C++ toolchain). Fallback needed.
+- Learning curve — `isolated-vm` API differs from `node:vm`.
+### Mitigations
+- Document `isolated-vm` build requirements in README.
+- Implement `node:vm` fallback path from day one.
+- P38 OS-level sandbox provides outer defense for fallback mode.

package/vault/wiki/decisions/adr-015.md ADDED Viewed

@@ -0,0 +1,81 @@
+---
+type: decision
+title: "ADR-015: Pipeline-First Build Order"
+status: accepted
+priority: 1
+date: "2026-05-02"
+tags: [adr, harness, build-order, mvp, incremental-delivery]
+sources:
+  - "[[HARNESS-PRD]]"
+  - "[[adr-012]]"
+  - "[[adr-014]]"
+related:
+  - "[[adr-012]]"
+  - "[[adr-014]]"
+supersedes: "PRD Section 16.1 (original 10-group build order)"
+created: 2026-05-02
+updated: 2026-05-02
+---
+# ADR-015: Pipeline-First Build Order
+## Context
+The original PRD specified 10 build groups with P43 (TypeScript Execution Layer) in Group 6 — after L1, L2, L3, L2.5, L4, and Post-Verification. Two competing strategies emerged:
+- **Option A (P43-first)**: Foundation → L1/L2 → P43 → L3 survivors → L2.5 → L4. Front-load the biggest context reduction.
+- **Option B (Pipeline-first)**: Foundation → L1/L2 → L2.5 → L4 → P43 + L3 survivors → Post-Verification. Validate quality gates before investing in execution layer.
+Initial preference was Option A to avoid rebuilding L3 integration. Re-evaluated: L2.5 (drift monitor) and L4 (adversarial verification) do not depend on P43. They work with pi's existing flat tool calling. Validating the full L1→L2→L2.5→L4 pipeline before P43 means we prove the gate model works before committing to the execution layer.
+## Decision
+**Pipeline-first (Option B). Validate gates before P43 investment.**
+### New Build Order
+```
+Group 1: Foundation (F0) + L1 Spec Hardening + L2 Structured Planning
+Group 2: L2.5 Runtime Drift Monitor (rule-based, works with pi's existing tool calling)
+Group 3: L4 Adversarial Verification (critic agents, selective debate)
+Group 4: P43 TypeScript Execution Layer + L3 survivors (P8/P9/P11/P13/P15)
+Group 5: Post-Verification (P20-P24: lint gate, observability, memory, orchestration, wiki query)
+Group 6: Cross-Cutting Capabilities (P25-P42: router, anxiety guard, error class, browser, hooks, compaction, permissions, etc.)
+Group 7: Self-Evolving Infrastructure (P45-P48: auto-optimize, behaviour harness, auto-learn, sandbox infra)
+```
+### L3 Survivors Absorbed into P43
+P8 (grounding checkpoints), P9 (AST truncation), P11 (inline validation), P12 (post-edit hooks), P13 (ck search), P15 (gitingest) implement inside P43's `isolated-vm` runtime, not on top of flat tools. P14 (Think-in-Code) is absorbed by P43 — P43 IS think-in-code. P10 (fuzzy edit matching) moves into P43's `edit()` host function. P15b (pre-verification sandbox) reuses the same `isolated-vm` isolate.
+### Incremental Delivery
+1. **After Group 1**: Harness blocks ambiguous tasks. Specs hardened. Plans structured.
+2. **After Group 2**: Agent stuckness detected and auto-corrected. Drift spirals prevented.
+3. **After Group 3**: Every change passes critic attack. Consensus debates filed.
+4. **After Group 4**: 3-4x context reduction on all tool workflows.
+5. **After Group 5**: Keep Rate tracked. Memory persists. Pipeline orchestrated.
+6. **After Group 6**: Full SOTA harness feature set.
+7. **After Group 7**: Harness self-evolves.
+## Rationale
+- **Risk reduction**: P43 is the hardest single phase (CodeAct-level complexity). Validating the simpler gate pipeline (L1/L2/L2.5/L4) first proves the architecture before committing to the execution layer.
+- **No throwaway work**: L2.5 and L4 work with any tool-calling mechanism. When P43 arrives, drift monitor and critics monitor P43 tool calls the same way they monitor flat tool calls — through `tool_result` events.
+- **Faster to first working pipeline**: Groups 1-3 produce an end-to-end harness (harden → plan → monitor drift → verify) in ~7 weeks. Users get value before P43.
+- **Parallelizable**: Group 4 (P43) can begin in parallel with Groups 2-3 if multiple developers/agents are available.
+## Consequences
+### Positive
+- Gates proven before execution layer.
+- Earlier user value.
+- P43 benefits from lessons learned in Groups 2-3 about tool calling patterns.
+### Negative
+- More total time to 3-4x context reduction (P43 at Group 4 vs Group 2).
+- L3 survivors (P8/P9/P11) delayed until P43 ships — grounding checkpoints not available in Groups 1-3.
+### Mitigations
+- If P43 proves simpler than expected, Group 4 can be fast-tracked.
+- Drift monitor (Group 2) provides partial grounding — catches context loops even without formal checkpoints.

package/vault/wiki/decisions/adr-016.md ADDED Viewed

@@ -0,0 +1,91 @@
+---
+type: decision
+title: "ADR-016: @tintinweb/pi-subagents for L4 Critic and Sub-Agent Infrastructure"
+status: accepted
+priority: 1
+date: "2026-05-02"
+tags: [adr, harness, l4, subagents, critic, pi-subagents, tintinweb]
+sources:
+  - "[[HARNESS-PRD]]"
+  - "[[adr-011]]"
+  - "[[adr-012]]"
+related:
+  - "[[adr-011]]"
+  - "[[adr-012]]"
+supersedes:
+created: 2026-05-02
+updated: 2026-05-02
+---
+# ADR-016: @tintinweb/pi-subagents for L4 Critic and Sub-Agent Infrastructure
+## Context
+L4 Adversarial Verification requires a separate agent process (critic) with its own context window, system prompt, and tool set. ADR-011 specifies multi-agent debate with separate sessions. ADR-012 specifies extension-based integration without forking pi.
+Two existing pi subagent packages were evaluated:
+- `@tintinweb/pi-subagents` (v0.6.3, 26 versions) — full-featured: RPC, event bus, custom agents, worktree isolation, memory, graceful turn limits
+- `@mjakl/pi-subagent` (v1.4.1) — minimal: depth guards, cycle prevention, spawn/fork modes. No RPC or event bus.
+Pi's philosophy: "No sub-agents built in. Build your own with extensions, or install a package." Both packages follow this model — they are pi extensions that spawn sub-agents as separate pi processes.
+## Decision
+**Use `@tintinweb/pi-subagents` as the sub-agent infrastructure. Define L4 critic as a custom agent type.**
+### Critic Agent Definition
+`.pi/agents/critic.md`:
+```yaml
+---
+description: Adversarial code reviewer — attacks code changes with hard-threshold pass/fail criteria
+tools: read, grep, find, ls, bash
+model: inherit
+thinking: high
+max_turns: 15
+prompt_mode: replace
+---
+```
+Critic runs with `prompt_mode: replace` — standalone system prompt, no parent context inheritance. This ensures true generator-evaluator separation (FP #8). The critic's system prompt contains hard-threshold pass/fail criteria extracted from the sprint contract.
+### Harness Integration
+The harness extension uses the package's cross-extension RPC to spawn and manage critics:
+1. Harness writes critic prompt to `.pi/harness/critics/<spec-hash>.md` (spec, diff, criteria)
+2. Harness emits `subagents:rpc:spawn` with `type: "critic"`, `prompt: "@.pi/harness/critics/<hash>.md"`
+3. Critic runs in separate pi process (`prompt_mode: replace` = clean context)
+4. Harness listens for `subagents:completed` event to get verdict
+5. Harness files consensus to `wiki/consensus/` (ADR-011)
+### Multi-Round Debate
+For selective multi-round debate (ADR-011), the harness can spawn multiple critic agents with different attack angles and use `steer_subagent` via RPC to inject counter-arguments.
+## Rationale
+- **Event bus + RPC**: The cross-extension RPC (`subagents:rpc:spawn`, `subagents:rpc:stop`) is essential for programmatic harness integration. `@mjakl/pi-subagent` lacks this.
+- **Separate pi processes**: Each sub-agent gets its own context window, model, and tool set. True adversarial separation.
+- **Mature**: 26 versions, active maintenance, ~4.6K monthly downloads.
+- **Custom agent types via `.md` files**: Clean, declarative. No code changes to define new agent roles.
+- **Graceful turn limits**: Critic won't spin forever. Gets wrap-up warning before abort.
+- **Compatibility**: Pi-native. Uses pi's session management, tool system, and extension API. No external LLM SDK needed.
+## Consequences
+### Positive
+- L4 critic runs in isolated context. No generator-evaluator contamination.
+- Extensible to other sub-agent roles (P25 specialization router, P30 browser agent).
+- Event bus enables other extensions to react to sub-agent lifecycle.
+### Negative
+- New dependency: `@tintinweb/pi-subagents`. Must be installed via `pi install npm:@tintinweb/pi-subagents`.
+- Sub-agent token cost is additive (critic tokens + proposer tokens). Mitigated by selective debate routing (ADR-011).
+- Relies on third-party package maintenance. If abandoned, fallback to direct pi SDK usage.
+### Mitigations
+- Package is MIT licensed. Can be forked and maintained if needed.
+- Fallback: direct `createAgentSession()` SDK usage if the package becomes unavailable.

package/vault/wiki/decisions/adr-017.md ADDED Viewed

@@ -0,0 +1,79 @@
+---
+type: decision
+title: "ADR-017: Harness Project Structure — src/harness/ Library + Extension Wiring"
+status: superseded
+priority: 1
+date: "2026-05-02"
+tags: [adr, harness, project-structure, foundation, f0]
+sources:
+  - "[[HARNESS-PRD]]"
+  - "[[adr-012]]"
+related:
+  - "[[adr-012]]"
+  - "[[skill-first-architecture]]"
+supersedes: "PRD Section 17 (original lib/ file structure)"
+superseded_by: "Pi built-in event bus (2026-05-04) — custom event bus no longer needed"
+created: 2026-05-02
+updated: 2026-05-04
+---
+# ADR-017: Harness Project Structure
+## Context
+The PRD specified a `lib/` directory with ~30 TypeScript files for harness logic. The project is a pi package with `.pi/extensions/` and `.pi/skills/`, not a standalone Node.js library. The integration model (ADR-012) is extension-based — harness logic wires into pi's `ExtensionAPI`.
+> [!warning] Superseded (2026-05-04)
+> Pi's latest version ships a built-in event bus, making the custom `events.ts` and `harness-event-bus.ts` wiring layer redundant. The code layer now consists of 3 files: `types.ts`, `config.ts`, `drift-monitor.ts`. Skills register directly with pi's native event bus. See [[skill-first-architecture]] for the updated architecture.
+Three structures considered:
+- ~~**Monolithic extension**: all logic in `.pi/extensions/harness-event-bus.ts`~~ (event bus removed)
+- **Multiple extensions**: one per layer (`.pi/extensions/harness-l1.ts`, etc.)
+- ~~**Library + wiring**: `src/harness/` for pure logic, `.pi/extensions/harness-event-bus.ts` for pi integration~~ (event bus removed)
+## Decision
+**Use `src/harness/` as the harness library. Skills register with pi's built-in event bus directly (no custom event bus needed).**
+```
+src/harness/
+  types.ts           # All harness types (Spec, Plan, DriftEvent, CriticVerdict, Config)
+  config.ts          # Load .pi/harness/config.json with code defaults
+  drift-monitor.ts   # L2.5: LLM-first drift detection + rule pre-filter
+.pi/extensions/
+  wiki-hooks.ts         # Existing (unchanged)
+  dotenv-loader.ts      # Existing (unchanged)
+.pi/agents/
+  critic.md             # L4 critic agent definition (ADR-016)
+```
+### Rules
+- `src/harness/` modules are **pure TypeScript**. No pi imports (`ExtensionAPI`, etc.). Testable without pi runtime.
+- Skills register event handlers directly with pi's built-in event bus — no custom wiring extension needed.
+- Shared state between harness modules uses pi's native event bus and typed interfaces in `types.ts`.
+## Rationale
+- **Separation of concerns**: Harness logic (spec hardening, drift detection, critic management) is independent of pi's API. Can be tested with plain vitest.
+- **Preserves PRD modularity**: The 30-file structure condenses into `src/harness/` modules but maintains the same logical separation.
+- **Single extension load**: pi loads one harness extension. No startup ordering issues.
+- **Minimal pi surface**: Pi's built-in event bus handles all Event API calls. Skills register directly with pi's native events.
+## Consequences
+### Positive
+- Testable without pi runtime.
+- Clean dependency direction: `pi native event bus → skills → src/harness/ → nothing external`.
+- Fits standard TypeScript project structure.
+- Fewer files: 3 code files vs 4 (event bus removed).
+### Negative
+- `src/harness/` modules must avoid importing from `@mariozechner/pi-coding-agent`. Type-only imports are OK.
+- Skills must correctly register with pi's built-in event bus API (pi's responsibility, not ours).
+### Mitigations
+- Pi extensions are TypeScript natively — pi runs them via `tsx`. No build step needed for development.
+- Type-only imports from pi SDK are safe (import type { ExtensionAPI }).

package/vault/wiki/decisions/adr-018.md ADDED Viewed

@@ -0,0 +1,100 @@
+---
+type: decision
+title: "ADR-018: Single Harness Config File — .pi/harness/config.json"
+status: accepted
+priority: 1
+date: "2026-05-02"
+tags: [adr, harness, config, foundation, f0]
+sources:
+  - "[[HARNESS-PRD]]"
+  - "[[adr-017]]"
+related:
+  - "[[adr-017]]"
+supersedes: "PRD Section 17 (multiple config files)"
+created: 2026-05-02
+updated: 2026-05-02
+---
+# ADR-018: Single Harness Config File
+## Context
+The PRD specified multiple harness config files (`.pi/harness/drift-monitor.json`, `.pi/harness/ts-exec.json`, `.pi/harness/fallow-gate.json`, etc.). This fragments configuration and adds cognitive overhead. Pi already has `.pi/settings.json` for its own config.
+Three approaches considered:
+- **Separate files per subsystem** — original PRD approach
+- **Extend `.pi/settings.json`** with a `harness` key — mixes harness config with pi config
+- **Single `.pi/harness/config.json`** with all harness settings
+## Decision
+**Use a single `.pi/harness/config.json` file. Project-local. No cascade. Defaults in code.**
+```json
+{
+  "driftMonitor": {
+    "enabled": true,
+    "patterns": {
+      "repetitionLoops": { "threshold": 3 },
+      "failureSpirals": { "threshold": 3 },
+      "toolCycling": { "threshold": 5 },
+      "silenceBatching": { "threshold": 6 },
+      "rework": { "threshold": 2 },
+      "excessiveSearch": { "threshold": 8 }
+    },
+    "escalation": {
+      "softNudgeAfter": 2,
+      "strongNudgeAfter": 4,
+      "restartAfter": 6
+    }
+  },
+  "critics": {
+    "maxRounds": 3,
+    "maxTokensPerRound": 6000,
+    "model": "inherit"
+  },
+  "specs": {
+    "storagePath": ".pi/harness/specs",
+    "maxClarificationRounds": 3
+  },
+  "debate": {
+    "enabled": true,
+    "gatingMode": "imad",
+    "budget": {
+      "l1MaxTokens": 6000,
+      "l2MaxTokens": 10000,
+      "l4MaxTokens": 8000
+    }
+  },
+  "phase16": {
+    "biome": true,
+    "tsc": true,
+    "fallow": false
+  }
+}
+```
+### Rules
+- All keys have defaults in `src/harness/config.ts`. User config merges on top.
+- File is project-local only (`.pi/harness/config.json`). No global cascade.
+- User creates from `harness.example.json` or edits by hand.
+- Missing file = all defaults. No error.
+## Rationale
+- **Single source of truth**: One file to understand and edit. No hunting across multiple files.
+- **Defaults in code**: Sensible defaults ship with the harness. Users only override what they need.
+- **No cascade complexity**: Project-local only. Avoids implementing a separate cascade system when pi already has one for its settings.
+- **Flat structure**: Top-level keys correspond to harness subsystems. Clear ownership.
+## Consequences
+### Positive
+- Simple. One file to read, one file to write.
+- Discoverable — single `harness.example.json` shows all options.
+- Merge from code defaults means config file can be minimal.
+### Negative
+- File grows as subsystems are added. Mitigated by flat top-level keys.
+- No per-user global defaults. Users who want the same harness config across projects must copy the file.

package/vault/wiki/decisions/adr-019.md ADDED Viewed

@@ -0,0 +1,75 @@
+---
+type: decision
+title: "ADR-019: Tool-Based Q&A for L1 Spec Clarification"
+status: accepted
+priority: 1
+date: "2026-05-02"
+tags: [adr, harness, l1, spec-hardening, qa, tool]
+sources:
+  - "[[HARNESS-PRD]]"
+  - "[[adr-012]]"
+  - "[[adr-017]]"
+related:
+  - "[[adr-012]]"
+  - "[[adr-017]]"
+supersedes:
+created: 2026-05-02
+updated: 2026-05-02
+---
+# ADR-019: Tool-Based Q&A for L1 Spec Clarification
+## Context
+L1 spec hardening may detect unresolved ambiguities. When automatic resolution fails (clarification loop exhausts), the harness must surface structured questions to the user. The harness extension has no direct UI — it must communicate through the LLM or via registered tools.
+Three approaches considered:
+- **System prompt injection**: LLM rephrases and asks user. Fragile — harness must parse LLM's rephrasing.
+- **Tool-based Q&A**: Harness registers a `harness_ask` tool. LLM calls it with structured questions. Tool handles user interaction via pi's TUI.
+- **Pre-execution gate**: Block before LLM sees the task. Poor UX in pi's conversation model.
+## Decision
+**Register a `harness_ask` tool that the LLM calls when L1 requires user clarification.**
+### Flow
+1. L1 ambiguity detector finds unresolved decisions in user request
+2. Harness injects system prompt: "Call `harness_ask` to clarify these ambiguities before proceeding"
+3. LLM calls `harness_ask({ questions: [{ id, question, options? }] })`
+4. Tool presents structured questions in pi's TUI (using `ctx.ui` API, same pattern as `wiki-hooks.ts` notifications)
+5. User answers via structured input (select from options, free text)
+6. Tool returns `{ answers: [{ id, answer }] }` to LLM
+7. Harness re-checks spec hardness. If resolved, proceed. If not, loop.
+### Fallback
+If `harness_ask` tool registration fails or pi's TUI API is insufficient, fall back to system prompt injection: "ASK THE USER THESE EXACT QUESTIONS: ...". The LLM becomes the intermediary.
+### Constraints
+- Maximum 3 clarification rounds per spec (configurable in `.pi/harness/config.json` → `specs.maxClarificationRounds`)
+- Questions must be multiple-choice when possible (reduces user effort, prevents LLM reinterpretation)
+- User can skip individual questions (allow partial resolution)
+## Rationale
+- **Structured**: Harness formats questions. LLM doesn't re-interpret. Answers are typed.
+- **Natural UX**: LLM mediates the conversation but harness controls the questions.
+- **Proven pattern**: `@tintinweb/pi-subagents` uses pi's TUI for agent widgets and conversation viewers. Tool-based UI is established in pi's extension model.
+- **Extensible**: Same `harness_ask` tool can be used by L2 (plan clarification) and L4 (critic follow-up questions).
+## Consequences
+### Positive
+- Structured Q&A prevents LLM from rephrasing or skipping questions.
+- Reusable across pipeline layers.
+- User sees clear, intentional questions — not LLM-generated ambiguity.
+### Negative
+- Requires pi TUI API support. If insufficient, falls back to system prompt injection (less reliable).
+- Adds latency — tool call round-trip for every clarification round.
+### Mitigations
+- Multiple questions batched in a single `harness_ask` call.
+- `maxClarificationRounds: 3` prevents infinite loops.

package/vault/wiki/decisions/adr-020.md ADDED Viewed

@@ -0,0 +1,106 @@
+---
+type: decision
+title: "ADR-020: YAML Task DAG and Sprint Contract Format"
+status: accepted
+priority: 1
+date: "2026-05-02"
+tags: [adr, harness, l2, planning, yaml, dag, sprint-contract]
+sources:
+  - "[[HARNESS-PRD]]"
+  - "[[adr-011]]"
+related:
+  - "[[adr-011]]"
+  - "[[adr-015]]"
+supersedes:
+created: 2026-05-02
+updated: 2026-05-02
+---
+# ADR-020: YAML Task DAG and Sprint Contract Format
+## Context
+L2 structured planning must produce a machine-readable task dependency graph with falsifiable "done" criteria (sprint contracts). The output must be:
+- Parseable by L3 for grounding checkpoint tracking
+- Version-controllable (diff-friendly in git)
+- Human-readable enough for debugging
+Three formats considered: JSON DAG, Markdown + YAML frontmatter, pure YAML.
+## Decision
+**Use pure YAML. Store in `.pi/harness/plans/<spec-hash>.yaml`.**
+### Schema
+```yaml
+spec: sha256:abc123...
+generated: "2026-05-02T14:30:00Z"
+model: anthropic/claude-sonnet-4-6
+tasks:
+  - id: "add-auth-middleware"
+    description: "Add JWT authentication middleware to Express app"
+    dependsOn: ["add-user-model"]
+    doneCriteria:
+      - type: "tests_pass"
+        pattern: "auth/**"
+      - type: "lint_passes"
+      - type: "no_regression"
+        baseline: "main"
+      - type: "spec_requirement"
+        requirement: "JWT tokens must be validated on every /api/* route"
+    estimatedTokens: 5000
+    checkpoint: false
+  - id: "add-user-model"
+    description: "Create User model with password hashing"
+    dependsOn: []
+    doneCriteria:
+      - type: "tests_pass"
+        pattern: "models/user*"
+      - type: "lint_passes"
+      - type: "typescript_compiles"
+    estimatedTokens: 3000
+    checkpoint: true
+```
+### Done Criteria Types
+| Type | Deterministic | Description |
+|------|--------------|-------------|
+| `tests_pass` | Deterministic | `vitest run --reporter json` pass for given pattern |
+| `lint_passes` | Deterministic | `biome check` pass |
+| `typescript_compiles` | Deterministic | `tsc --noEmit` pass |
+| `no_regression` | Deterministic | Tests that passed on `baseline` still pass |
+| `spec_requirement` | LLM-judged | Specific spec requirement satisfied (L4 critic verifies) |
+| `no_new_dead_code` | Deterministic | `fallow audit --changed-since main` pass |
+### Checkpoints
+Tasks with `checkpoint: true` are grounding checkpoints (P8). L3 pauses after completing a checkpoint task, runs all deterministic criteria, and compares spec-drift. Checkpoint tasks should be the smallest verifiable change (MVC).
+### Sprint Contract
+The entire plan file IS the sprint contract. Sign-off = plan file committed to git. L3 reads it, L4 critic uses `doneCriteria` as attack surface.
+## Rationale
+- **YAML over JSON**: Diff-friendly. No trailing comma issues. Comments supported (`#`). Human-readable without tooling.
+- **YAML over Markdown+YAML**: Single format. No parsing two formats from one file.
+- **Content-addressed**: File named by `spec-hash`. Immutable after generation. Regenerating plan = new hash = new file. Old plans preserved for audit.
+- **Typed `doneCriteria`**: Deterministic criteria can be auto-verified. LLM-judged criteria route to L4 critic. Clear separation.
+## Consequences
+### Positive
+- Single file per plan. Git-friendly diffs.
+- L3 reads YAML directly for checkpoint tracking.
+- Deterministic criteria auto-verified. LLM criteria deferred to L4.
+### Negative
+- YAML parser needed in harness (already available via `js-yaml` dependency from P22b).
+- `spec_requirement` type relies on L4 critic — if L4 not yet built (Group 1-2), these criteria are unchecked.
+- No narrative planning doc. Task descriptions are the only human-readable content.
+### Mitigations
+- `spec_requirement` criteria are skipped until L4 is active (Group 3). During Groups 1-2, only deterministic criteria are enforced.
+- Task `description` fields should be specific enough to serve as the narrative.