npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.3 - Mend

ultimate-pi 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/concepts/vcc-conversation-compaction-for-pi.md ADDED Viewed

@@ -0,0 +1,51 @@
+---
+type: concept
+status: developing
+created: 2026-05-05
+tags:
+  - pi-agent
+  - vcc
+  - compaction
+  - memory
+related:
+  - "[[pi-vcc-github-repo]]"
+  - "[[context-continuity]]"
+  - "[[structured-compaction]]"
+---
+# VCC Conversation Compaction for Pi
+## Definition
+VCC in Pi context refers to transcript-preserving, deterministic compaction approach adopted by `pi-vcc`, inspired by View-oriented Conversation Compiler. It compresses sessions without calling an LLM, then adds recall over raw lineage history.
+## Core Mechanics
+- Algorithmic extraction, not model summarization
+- Stable sectioned output: goal, files, commits, outstanding context, preferences
+- Explicit recall API (`vcc_recall`) with regex and lineage scope
+- High token reduction on long sessions (often 90%+)
+## Practical Impact
+For long-running coding sessions, VCC-style compaction reduces cost and hallucination risk during summarization while preserving retrievability of older context.
+## Competitive Position in Pi Ecosystem
+pi-vcc is the only fully deterministic (no-LLM) compaction extension. Three other Pi compaction extensions exist but all use LLM calls:
+- **pi-model-aware-compaction**: Per-model threshold triggers (timing control, not algorithm change)
+- **pi-custom-compaction**: Swap compaction model/template (still LLM-based)
+- **pi-agentic-compaction**: Virtual filesystem + sandboxed tools (still LLM-based)
+See [[pi-compaction-extensions-ecosystem]] for full comparison and [[deterministic-session-compaction]] for the broader pattern.
+## Broader Pattern Validation
+The deterministic compaction pattern is independently validated by:
+- **Codex DSC RFC** (openai/codex#8573): Proposed identical approach for Codex, closed as not_planned (Source: [[codex-dsc-rfc-8573]])
+- **Distill** (143 stars): Deterministic context preprocessing, different layer but same no-LLM principle (Source: [[distill-deterministic-context-compression]])
+- **MemoSift**: 6-layer deterministic compression engine with framework adapters
+## Clarification
+VCC here is **not** VS Code extension acronym. It is compaction method and Pi package category.

package/vault/wiki/concepts/verification-drift-detection.md ADDED Viewed

@@ -0,0 +1,19 @@
+---
+type: concept
+title: "verification-drift-detection"
+created: 2026-04-30
+updated: 2026-04-30
+status: seed
+tags: [#concept, #harness, #testing]
+related:
+  - "[[execution-feedback-loop]]"
+  - "[[grounding-checkpoints]]"
+---
+# Verification Drift Detection
+> [!stub] See [[grounding-checkpoints]] for the harness implementation.
+Detects when an agent's implementation drifts away from the spec or when verification results become stale. Part of the execution feedback loop: after each change, verify that the output still matches expected behavior. Drift detection triggers re-grounding — forcing the agent to re-read the spec before continuing.
+In the ultimate-pi harness, this is implemented by Layer 3 ([[grounding-checkpoints]]), which enforces smallest-verifiable-change + drift detection on every checkpoint.

package/vault/wiki/consensus/consensus-records.md ADDED Viewed

@@ -0,0 +1,58 @@
+---
+type: index
+title: Consensus Records
+created: 2026-04-30
+updated: 2026-04-30
+status: active
+tags: [consensus, debate, alignment, index]
+related:
+  - "[[consensus-debate]]"
+  - "[[adr-011]]"
+  - "[[harness-implementation-plan]]"
+---
+# Consensus Records
+Permanent alignment records for all agent debates. **Every debate verdict — win, lose, or deadlock — is filed here.**
+Future agents query this directory before forming positions. Contradicting a filed consensus triggers a harness block (L7 enforcement).
+## Directory Convention
+- Filename: `[layer]-[topic-slug].md`
+- Layers: `spec` (L1), `plan` (L2), `verify` (L4)
+- Example: `spec-idempotency-key-design.md`
+## Consensus Page Template
+```markdown
+---
+type: consensus
+layer: spec | plan | verify
+verdict: CONSENSUS_REACHED | DEADLOCK | BUDGET_EXHAUSTED | TIMEOUT
+date: YYYY-MM-DD
+participants: [agent-a, agent-b]
+topic: "Brief description"
+related: page-refs (wikilinks to related pages)
+---
+# [Topic]
+## Final Position
+[The winning / final agreed position]
+## Key Rounds Summary
+| Round | Attacker | Defender | Outcome |
+|-------|----------|----------|---------|
+| 1 | ... | ... | ... |
+## Evidence References
+- (wikilinks to evidence sources)
+## Rationale
+Why this consensus was reached. What was settled.
+```
+## No records yet
+Consensus filing begins with Phase P19b of the [[harness-implementation-plan]].

package/vault/wiki/decisions/2026-04-30-pi-lean-ctx-native.md ADDED Viewed

@@ -0,0 +1,122 @@
+---
+type: decision
+status: accepted
+created: 2026-04-30
+tags:
+  - lean-ctx
+  - pi
+  - extensions
+  - token-optimization
+  - mcp
+related:
+  - "[[lean-ctx]]"
+  - "[[leanctx-website]]"
+  - "[[Research: context-mode vs lean-ctx]]"
+updated: 2026-04-30
+title: "ADR: Adopt pi-lean-ctx Native Package, Drop Custom Extension"
+---# ADR: Adopt pi-lean-ctx Native Package, Drop Custom Extension
+## Context
+The ultimate-pi harness had a custom extension (`extensions/lean-ctx-enforce.ts`) that:
+- Detected lean-ctx binary availability
+- Wrapped `bash` commands with `lean-ctx -c` prefix
+- Overrode `read` to call `lean-ctx read` with simple mode selection
+- Registered `/lean-ctx-status` diagnostic command
+This custom extension was a stopgap — basic, manual wrapping, no MCP bridge, no auto mode selection, no compression stats.
+Meanwhile, the pi-lean-ctx npm package (v3.4.5, published by yvgude) provides a native Pi extension with full lean-ctx integration.
+## Alternatives Considered
+1. **Keep custom extension** — Simple, self-contained, no external npm dependency. But misses MCP bridge (48 tools), auto read-mode selection, ls/find/grep tools, compression stats, and reconnection logic.
+2. **Adopt pi-lean-ctx** — npm package maintained by lean-ctx author. Full MCP integration, all 48 lean-ctx MCP tools as native Pi tools, rich read mode selection, spawnHook bash wrapping, compression stats footer, reconnect/timeout-handling.
+3. **Hybrid: keep custom + add MCP manually** — Would duplicate effort. pi-lean-ctx already does everything better.
+## Decision
+**Replace the custom `lean-ctx-enforce.ts` extension with the `pi-lean-ctx` npm package.**
+## Changes Made
+| File | Action | Detail |
+|------|--------|--------|
+| `extensions/lean-ctx-enforce.ts` | Deleted | Replaced by pi-lean-ctx |
+| `.pi/settings.json` | Edited | Added `"npm:pi-lean-ctx"` to packages array |
+| `.pi/SYSTEM.md` | Edited | Updated skill routing line |
+| `package.json` | Edited | Updated `check:ts` script to dotenv-loader |
+| `.pi/skills/lean-ctx/SKILL.md` | Edited | Added integration note at top |
+| `.pi/npm/node_modules/pi-lean-ctx` | Installed | v3.4.5 + all deps |
+## What pi-lean-ctx Provides
+### Tool Overrides
+| Tool | Custom Ext | pi-lean-ctx |
+|------|-----------|-------------|
+| `bash` | Prepends `lean-ctx -c` | SpawnHook wraps `lean-ctx -c sh -lc` (preserves env, aliases). `raw=true` bypass option. |
+| `read` | Basic `lean-ctx read -m lines/…` | Auto mode selection: full (<8KB code), map (8KB–96KB), signatures (>96KB). Syntax highlighting. Compression stats footer. Truncation handling. |
+| `ls` | Not handled | Routes through `lean-ctx ls` with limit support |
+| `find` | Not handled | Routes through `lean-ctx find` with glob + limit |
+| `grep` | Not handled | Routes through `lean-ctx -c rg` with full ripgrep flags |
+| `cat` blocking | Not enforced | Read tool description warns: "Do NOT use bash to read files (cat/head/tail)" |
+### MCP Bridge
+- Auto-connects to lean-ctx MCP server (stdio transport)
+- Registers all 48 lean-ctx MCP tools as native Pi tools
+- Auto-reconnect (3 attempts, exponential backoff 2s/4s/8s)
+- 120s tool timeout with retry for idempotent tools
+- Tools excluded from bridge: `ctx_read`, `ctx_multi_read`, `ctx_shell`, `ctx_search`, `ctx_tree` (already handled via Pi-native tools)
+### Diagnostic Command
+`/lean-ctx` — Shows binary path, MCP bridge status, registered tool count, reconnect attempts, last hung/errored tool.
+## Dependencies
+- **Runtime**: `lean-ctx` binary (v3.4.2 installed via npm/cargo)
+- **npm**: `pi-lean-ctx@3.4.5` with `@modelcontextprotocol/sdk@^1.29.0`
+- **Peer**: `@mariozechner/pi-coding-agent@>=0.50.0` (we have 0.70.x)
+- **Peer**: `@mariozechner/pi-tui@*` (available via pi-coding-agent)
+- **TypeBox alias**: pi-agent's jiti loader aliases `@sinclair/typebox` → `typebox`
+## Consequences
+### Positive
+- 48 lean-ctx MCP tools available to agent: `ctx_session`, `ctx_knowledge`, `ctx_semantic_search`, `ctx_impact`, `ctx_architecture`, `ctx_workflow`, `ctx_gain`, etc.
+- Richer read modes: auto mode selection based on file size + extension
+- Proper compression stats on every tool output
+- Graceful reconnection if MCP server dies
+- Upstream-maintained (by lean-ctx author yvgude)
+### Negative
+- External npm dependency (mitigated: published by same author as lean-ctx, Apache 2.0)
+- MCP bridge adds startup latency (~200ms for tool discovery)
+- One more package to keep updated
+### Neutral
+- `/lean-ctx-status` command removed; replaced by `/lean-ctx`
+- Skill routing in SYSTEM.md changed from "default layer" to "native Pi package" description
+## Verification
+- `lean-ctx` binary v3.4.2 installed ✓
+- `pi-lean-ctx` v3.4.5 installed in `.pi/npm/node_modules` ✓
+- All peer dependencies satisfied ✓
+- `tsc` check on remaining extensions passes ✓
+- @sinclair/typebox aliased by jiti loader ✓
+## Next
+- Restart pi agent; pi-lean-ctx loads at session start
+- Run `/lean-ctx` to verify MCP bridge connected
+- Monitor `lean-ctx gain` after a few sessions for token savings data

package/vault/wiki/decisions/adr-008.md ADDED Viewed

@@ -0,0 +1,40 @@
+---
+type: decision
+title: "ADR-008: Spec-Only Black-Box QA"
+status: active
+priority: 1
+date: "2026-04-28"
+tags: [adr, qa, testing, harness, layer-4]
+sources:
+  - "[[harness-implementation-plan]]"
+related:
+  - "[[adversarial-verification]]"
+  - "[[agentic-harness]]"
+created: 2026-04-30
+updated: 2026-04-30
+---
+# ADR-008: Spec-Only Black-Box QA
+## Context
+Layer 4 (Adversarial Verification) needs to generate tests. Where should the test specifications come from — the implementation code or the specification?
+## Decision
+Tests are generated from the **specification only** — never from implementation code. This is black-box testing enforced at the architectural level.
+The prompt for the QA test writer **never includes implementation code**. The `spec_only` flag is immutable.
+## Rationale
+- Implementation-aware tests can be gamed by the implementation itself
+- Spec-only tests verify behavior, not implementation details
+- Prevents the common failure mode where agents write tests that pass by construction
+## Consequences
+- **Positive**: Tests are honest arbiters of correctness
+- **Positive**: No temptation to "test the implementation" rather than the behavior
+- **Negative**: May miss implementation-specific edge cases (mitigated by critic review in Layer 5)
+- **Negative**: Requires well-specified success criteria (enforced by Layer 1 Spec Hardening)

package/vault/wiki/decisions/adr-009.md ADDED Viewed

@@ -0,0 +1,46 @@
+---
+type: decision
+title: "ADR-009: claude-obsidian Mode B for Persistent Memory"
+status: active
+priority: 1
+date: "2026-04-28"
+tags: [adr, memory, wiki, harness, layer-6, claude-obsidian]
+sources:
+  - "[[harness-implementation-plan]]"
+related:
+  - "[[persistent-memory]]"
+  - "[[wiki-query-interface]]"
+  - "[[agentic-harness]]"
+created: 2026-04-30
+updated: 2026-04-30
+---
+# ADR-009: claude-obsidian Mode B for Persistent Memory
+## Context
+Layer 6 (Persistent Memory) needs a knowledge base that persists across sessions, supports cross-referencing, and enables retrieval. Previous approach used a custom `WikiKnowledgeBase` class with Vectra BM25+vector search (~87MB deps).
+## Decision
+Replace custom WikiKnowledgeBase + Vectra with **claude-obsidian skills in GitHub Mode B**. Search is LLM-native: `hot.md` → `index.md` → pages. No custom code, no embedding model.
+## Comparison
+| Aspect | Before (ADR-007) | After (ADR-009) |
+|--------|-------------------|------------------|
+| Cross-session memory | None | hot.md ~500-word cache |
+| Source provenance | No tracking | .raw/ immutable sources + manifest delta |
+| Repository structure | Flat patterns/ dirs | Mode B: modules, components, decisions, dependencies, flows |
+| Search | Vectra BM25+vector (~80MB model) | LLM-native: hot.md → index.md → pages |
+| Lint / health | None | 8+ category checks |
+| Contradiction flagging | None | `> [!contradiction]` callouts |
+| Dependencies | ~87MB | ~50KB skills + optional ollama |
+## Consequences
+- **Positive**: Eliminates ~87MB dependency footprint
+- **Positive**: LLM-native search leverages existing Claude capabilities
+- **Positive**: Obsidian wiki is human-readable and browseable
+- **Negative**: Search quality depends on LLM context management (mitigated by 3-mode depth system)
+- **Negative**: No semantic similarity search without optional ollama setup

package/vault/wiki/decisions/adr-010.md ADDED Viewed

@@ -0,0 +1,55 @@
+---
+type: decision
+title: "ADR-010: Agentic Harness ↔ Wiki Tight-Coupling Contract"
+status: active
+priority: 1
+created: "2026-04-28"
+updated: "2026-04-28"
+tags: [decision, harness, wiki, integration, pipeline, adr]
+sources:
+  - "[[harness]]"
+  - "[[harness-implementation-plan]]"
+  - "[[adr-009]]"
+  - "[[adr-008]]"
+  - "[[persistent-memory]]"
+  - "[[wiki-query-interface]]"
+related:
+  - "[[colocate-wiki]]"
+  - "[[harness-wiki-skill-mapping]]"
+  - "[[harness-wiki-pipeline]]"
+---
+# ADR-010: Agentic Harness ↔ Wiki Tight-Coupling Contract
+## Context
+The harness has 8 layers. Layers 6 (Persistent Memory) and 8 (Wiki Query Interface) already reference the wiki. But the other 6 layers have **no explicit read/write contract** with the wiki. This creates two failure modes:
+1. **Design drift**: An agent makes a decision that contradicts an existing ADR or module spec because it never read the wiki first.
+2. **Wiki staleness**: After a pipeline event changes the codebase, the wiki is not updated — decisions, patterns, and statuses go stale.
+[[adr-009]] replaced Vectra with LLM-native wiki search. [[colocate-wiki]] put the wiki in-repo. Now we need the **contract** that makes the harness and wiki a single synchronized system.
+## Decision
+**Every harness layer reads relevant wiki docs before acting, and writes back to the wiki after every state transition.** The contract is enforced at the extension layer (L7 schema orchestration), not by convention.
+Two axioms:
+1. **Read-first**: No layer acts without querying the wiki for relevant ADRs, module specs, and patterns.
+2. **Write-after**: No state transition completes without a wiki write that keeps docs current.
+## Rationale
+- **Consistency**: If ADR-008 says "black-box QA", no layer should generate implementation-coupled tests. Reading the wiki first prevents this.
+- **Traceability**: Every state transition produces a wiki artifact. Future sessions can reconstruct the full decision chain.
+- **Staleness elimination**: The wiki is the single source of truth for design decisions. If code changes, the wiki reflects it. If the wiki says something, the code respects it.
+- **Self-healing**: wiki-lint after every 10-15 writes catches contradictions, orphans, and stale claims before they compound.
+## Consequences
+- **Positive**: Harness decisions are always grounded in documented architecture.
+- **Positive**: Wiki stays current automatically — no manual doc updates needed.
+- **Positive**: New sessions can pick up exactly where the last left off via hot.md.
+- **Negative**: Extra token cost per subtask (~500-1500 for reads, ~500-1500 for writes).
+- **Negative**: Requires discipline at the extension-layer hooks — must not skip wiki reads.
+- **Negative**: Lint runs add latency but prevent long-term decay.

package/vault/wiki/decisions/adr-011.md ADDED Viewed

@@ -0,0 +1,165 @@
+---
+type: decision
+title: "ADR-011: Multi-Agent Consensus Debate in the Harness Pipeline"
+status: accepted
+priority: 1
+date: "2026-04-30"
+tags: [adr, harness, consensus, debate, multi-agent, pi-messenger, selective-routing, imad]
+sources:
+  - "[[harness-implementation-plan]]"
+  - "[[pi-messenger-analysis]]"
+  - "[[fan2025-imad]]"
+related:
+  - "[[agentic-harness]]"
+  - "[[adversarial-verification]]"
+  - "[[spec-hardening]]"
+  - "[[structured-planning]]"
+  - "[[consensus-debate]]"
+  - "[[selective-debate-routing]]"
+  - "[[drift-detection-unified]]"
+supersedes:
+created: 2026-04-30
+updated: 2026-04-30
+---
+# ADR-011: Multi-Agent Consensus Debate with Selective Routing
+## Context
+The current harness pipeline is single-agent sequential: one agent hardens the spec (L1), one agent plans (L2), one agent executes (L3), one critic attacks once (L4). L4 has `max_attack_rounds: 2` — a retry loop, not a debate.
+The best human software decisions come from back-and-forth argument: thesis → antithesis → synthesis. Single-pass review is weak because the reviewer's first objection is often shallow, and the proposer's rebuttal frequently reveals a deeper truth neither saw alone.
+Agents lack intuition. Multi-round argument is a substitute for intuition — each round forces the opponent to find a new attack surface, and each rebuttal forces the defender to articulate deeper justification.
+pi-messenger (nicobailon/pi-messenger, 532 ⭐) demonstrates that agents CAN communicate peer-to-peer via the file system — registry, inboxes, `fs.watch` for real-time delivery. No server, no daemon, just files. This is the right transport primitive for consensus debates.
+**UPDATE (2026-04-30)**: iMAD (Fan et al., AAAI 2026) demonstrates that debate is NOT always beneficial. Multi-agent debate can overturn correct single-agent answers. Always-on debate wastes tokens AND can reduce accuracy. Selective routing via a pre-debate gating classifier saves 92% tokens AND improves accuracy by 13.5%.
+## Decision
+**Add a consensus debate capability to the harness pipeline. Use pi-messenger's file-based message passing as the transport layer. Strip all UI overlays. Build a consensus protocol on top. Gate debate with iMAD-style selective routing — trigger debate only when a pre-debate classifier detects hesitation/uncertainty cues in single-agent self-critique.**
+### iMAD Integration: Pre-Debate Gate
+Before spawning a debate, the system:
+1. Single agent produces structured self-critique response
+2. Extract hesitation cues: uncertainty markers ("might", "could be", "I think"), contradictory statements, missing evidence references, low confidence indicators
+3. Lightweight classifier → debate or skip
+4. If confidence high + no hesitation → skip debate, save tokens
+5. If uncertainty detected → trigger full consensus debate
+Expected reduction: ~92% token savings on high-confidence tasks (~80% of tasks in early estimate).
+### What we adopt from pi-messenger:
+| Component | Purpose |
+|-----------|---------|
+| Agent registry (`.pi/messenger/registry/`) | Agent discovery, presence |
+| Per-agent inbox directories | Message delivery |
+| `fs.watch`-based message detection | Real-time delivery without polling |
+| JSON message format | Structured inter-agent communication |
+| Atomic file write patterns | Race-free message delivery |
+| Stale registration cleanup | Dead agent garbage collection |
+| Memorable name generation | Debug-friendly agent identification |
+### What we strip (the "overlays"):
+- Chat overlay UI (`/messenger`)
+- Status bar indicators (●3, on fire, debugging...)
+- Activity feed timeline
+- Emoji-based status messages
+- Human-as-participant features
+- Crew orchestration (planner→worker→reviewer DAG) — L7 handles this
+- Swarm claim/complete — L7 handles task tracking
+- Message budgets (per-coordination-level) — consensus budget replaces this
+### Consensus Protocol:
+A structured debate protocol with:
+1. **DebateSession**: N agents, M rounds, defined topic/scope
+2. **ConsensusBudget**: Max rounds, max tokens per round, max wall-clock time
+3. **Turn protocol**: Structured messages with `{ role, round, claim, counter_to, evidence_refs }`
+4. **Convergence detection**: When positions stabilize for K consecutive rounds
+5. **Verdict**: `CONSENSUS_REACHED`, `DEADLOCK` (positions unchanged after N rounds), `BUDGET_EXHAUSTED`
+### Integration points (selective — triggered only when pre-debate gate signals uncertainty):
+| Layer | Debate purpose | Agents | Budget |
+|-------|---------------|--------|--------|
+| L1 (Spec Hardening) | Argue about ambiguity resolution | Spec proposer + Spec critic | 3 rounds, ~6K tokens |
+| L2 (Structured Planning) | Argue about plan structure, dependencies | Planner + Plan critic | 3 rounds, ~10K tokens |
+| L4 (Adversarial Verification) | Multi-round attack on implementation | Defender + Attacker | 4 rounds, ~8K tokens |
+## Rationale
+### Why file-based messaging (not MCP, not HTTP, not in-process)?
+1. **Zero infrastructure**: No server, no daemon, no port management. pi.dev extensions already have filesystem access.
+2. **Process isolation**: Each debate participant is a separate LLM session. Filesystem is the natural IPC boundary.
+3. **Crash safety**: Messages are files. If an agent crashes, its messages persist. Debate can resume.
+4. **Observability**: Debate transcripts are files on disk. Debuggable without tooling.
+5. **pi-messenger already solved this**: Registry format, inbox pattern, watcher debouncing, stale cleanup — all battle-tested.
+### Why selective routing (not always-on debate)?
+iMAD shows that always-on debate:
+- Costs 92% more tokens than selective routing
+- Can overturn correct single-agent answers (accuracy regression)
+- Is only beneficial when the single agent shows hesitation/uncertainty
+Single agent self-critique is cheaper than full debate. Route to debate only when needed.
+### Why not use pi-messenger's Crew orchestration?
+The harness has L7 (Schema Orchestration via Archon) for DAG execution, loop nodes, approval gates, and worktree isolation. pi-messenger's Crew (planner→worker→reviewer waves) is a competing orchestration model. Using both would create conflicting DAG executors.
+### Why consensus budgets?
+Without budgets, debates consume unlimited tokens. A budget forces convergence — agents must prioritize their strongest arguments. This mirrors real human meetings with time limits.
+## Consequences
+### Positive
+- **Better decisions when needed**: Multi-round argument surfaces deeper issues than single-pass review
+- **Token-efficient**: Selective routing avoids debating settled questions
+- **Defense in depth**: Adversarial verification becomes genuine debate when warranted
+- **Spec quality**: L1 debates catch ambiguous specs before implementation
+- **Permanent agent alignment**: Winning consensus filed to `wiki/consensus/` — future agents query and align to resolved debates, preventing re-litigation
+- **Observable reasoning**: Debate transcripts are file artifacts that can be audited
+### Negative
+- **Complexity**: New consensus protocol layer, pre-debate classifier, message schema, convergence detection
+- **Classifier accuracy risk**: Pre-debate gate may miss cases where debate would help (false negatives)
+- **Latency**: When debate IS triggered, multi-round adds wall-clock time
+- **Agent quality variance**: Cheap models may produce shallow arguments
+### Mitigations
+- Debate is opt-in per layer (configurable `consensus: { enabled: true/false }`)
+- Pre-debate classifier is conservative: when uncertain, trigger debate
+- Budgets prevent runaway token consumption
+- **Winning consensus MUST be filed in wiki (`wiki/consensus/`)** as a permanent alignment record — NOT optional. Future agents query consensus before making decisions. Contradicting a filed consensus triggers harness block.
+- Hard-threshold pass/fail criteria (not narrative self-assessment) as primary L4 mechanism; debate is supplementary
+## Token Budget Impact (with Selective Routing)
+| Activity | Always-Debate | With Selective (80% skip rate) |
+|----------|--------------|-------------------------------|
+| L1 Spec Debate | +4,000 | +800 avg |
+| L2 Plan Debate | +5,000 | +1,000 avg |
+| L4 Adversarial Debate | +4,000 | +1,200 avg |
+| **Total added per subtask** | **~13,000** | **~3,000 avg** |
+## Implementation
+See [[harness-implementation-plan]] Phases P17-P19 and P2b.
+## Related
+- [[consensus-debate]] — concept page for the consensus protocol
+- [[selective-debate-routing]] — iMAD concept and mechanism
+- [[pi-messenger-analysis]] — full analysis of pi-messenger and what we adopt/strip
+- [[harness-implementation-plan]] — master plan with build phases
+- [[drift-detection-unified]] — how consensus debate complements drift detection

package/vault/wiki/decisions/adr-012.md ADDED Viewed

@@ -0,0 +1,102 @@
+---
+type: decision
+title: "ADR-012: Extension-Based Harness Orchestrator — Leveraging Pi's Native Event System"
+status: accepted
+priority: 1
+date: "2026-05-02"
+updated: "2026-05-04"
+tags: [adr, harness, integration, extensions, orchestrator, pi-extension-api]
+sources:
+  - "[[HARNESS-PRD]]"
+  - "[[dotenv-loader]]"
+  - "[[custom-footer]]"
+related:
+  - "[[adr-010]]"
+  - "[[adr-011]]"
+  - "[[adr-026-one-thing]]"
+supersedes:
+created: 2026-05-02
+---
+# ADR-012: Extension-Based Harness Orchestrator — Leveraging Pi's Native Event System
+**UPDATED 2026-05-04**: Original ADR incorrectly stated pi has only 5 native events. Pi v0.70.2 provides 30+ event types via `ExtensionAPI.on()`. The decision — extension-based, no fork — remains correct. Updated to reflect actual pi capabilities.
+## Context
+The HARNESS-PRD specifies an 8-layer mandatory execution pipeline. The pi coding agent provides an `ExtensionAPI` with **30+ native event types**: `session_start`, `session_before_compact`, `session_compact`, `session_shutdown`, `before_agent_start`, `agent_start`, `agent_end`, `turn_start`, `turn_end`, `message_start`, `message_update`, `message_end`, `tool_execution_start`, `tool_execution_update`, `tool_execution_end`, `tool_call` (per-tool types: bash, read, edit, write, grep, find, ls, custom), `tool_result` (per-tool types), `context`, `before_provider_request`, `after_provider_response`, `model_select`, `input`, `user_bash`, `resources_discover`, and more.
+Three integration paths were considered:
+- **A) Fork pi to add hook points.** Maintenance burden. Blocks on upstream.
+- **B) Wrap pi at process boundary.** Fragile, breaks on pi updates.
+- **C) Build a harness orchestrator extension that listens to pi's native events and routes them through a skill pipeline.** Zero pi changes. Full control within extensions.
+## Decision
+**Use an extension-based harness orchestrator. No fork. No process wrapping. No custom event bus needed.**
+A single `harness-orchestrator` extension subscribes to pi's native events directly — no intermediate event bus layer. State machine tracks pipeline position (L1→L2→L3→L4→L5-L8). Phase transitions detected via tool result patterns and turn boundaries. Skills are activated via `pi.sendMessage()` with `deliverAs: "steer"` to inject steering prompts at the right pipeline phase.
+### Event-to-Pipeline Mapping
+Pi's native events map directly to harness pipeline phases without translation:
+| Pi Native Event | Harness Pipeline Action |
+|---|---|
+| `turn_start` | Initialize phase context. Inject L1 spec-hardening steer if entering L1. |
+| `tool_call` | Track which tools the agent invokes. Detect phase transitions (e.g., write tool = entering L3 execution). |
+| `tool_result` | Route to drift monitor during L3 execution. Detect gate conditions (compile failures, lint errors). |
+| `turn_end` | Trigger L4 critic if in verification phase. Accumulate token budget. |
+| `before_agent_start` | Inject harness state into system prompt. Reinject after compaction. |
+| `agent_end` | Finalize pipeline. Trigger L5 observability + L6 memory writes. |
+| `session_start` | Bootstrap harness state. Load config. Warm wiki cache. |
+| `session_compact` | Persist harness state. Reinject after compaction completes. |
+| `session_shutdown` | Flush observations. Write keep-rate samples. |
+### Enforcement Model — Updated
+With `tool_call` events, the orchestrator gains new enforcement capabilities not possible with only 5 events:
+- **Pre-execution tool blocking**: `tool_call` handlers can return `{ block: true, reason: "..." }` to prevent tool execution. This enables: blocking edits when spec isn't hardened, blocking writes when drift detected, blocking bash when sandbox isn't configured.
+- **Result mutation**: `tool_result` handlers can modify content/details/isError. This enables: injecting warnings into results, marking results as errors based on drift detection, adding structural analysis annotations.
+- **Context injection**: `before_agent_start` can replace the system prompt entirely. This enables: switching between "spec hardening mode" and "execution mode" and "verification mode" prompts.
+Not 100% software-enforced for all layers, but `tool_call` blocking + `tool_result` mutation + `before_agent_start` prompt injection achieves high compliance. L7 orchestration (later phase) may add process-level enforcement via `pi.exec()`-based gate scripts.
+## Rationale
+- **Zero pi dependency**: Works with pi v0.70.2 today. No upstream PRs, no fork maintenance.
+- **All harness logic in one extension**: The orchestrator is a single `.pi/extensions/harness-orchestrator.ts`. No intermediate event bus layer.
+- **Proven pattern**: `custom-footer.ts` already demonstrates using `turn_start`, `context`, `model_select`, and `session_start` events. The harness orchestrator is the same pattern, scaled.
+- **Upgrade path**: If pi adds new native events, the orchestrator can subscribe to them without architectural changes.
+- **Tool call blocking is new**: pi's `tool_call` event supports `{ block: true }` return — this is a hard enforcement mechanism not available in the original 5-event assumption.
+## Consequences
+### Positive
+- Ships immediately. No external dependencies beyond pi.
+- Harness is self-contained in `.pi/extensions/`.
+- `tool_call` blocking provides hard enforcement for critical gates.
+- ~150 lines, not ~290 (no intermediate event bus).
+### Negative
+- Still ~95% compliance for skill-level steering (LLM can ignore steering prompts).
+- Pattern detection in `tool_result` is heuristic, not guaranteed.
+- `pi.sendMessage()` steering behavior unverified — need to test skill activation.
+### Mitigations
+- `tool_call` blocking for critical safety gates (no edit without spec, no write with drift).
+- Multiple defense layers (L1 hardening + L2.5 drift + L4 adversarial + P20 deterministic) mean a single-layer bypass is caught downstream.
+- Compliance monitoring in L5 observability tracks bypass rates per layer.
+- L7 orchestration (P23) adds process-level enforcement via `pi.exec()` gate scripts.
+## Correction from Original (2026-05-04)
+| | Original ADR-012 | Corrected |
+|---|---|---|
+| Pi native events | "5 native events" | **30+ native events** |
+| Architecture | Event bus layer on top of 5 events | **Orchestrator listens to 30+ events directly** |
+| File | `harness-event-bus.ts` | **`harness-orchestrator.ts`** |
+| Lines | ~200+ | **~100** (thinner — no bus layer) |
+| Tool blocking | Not possible (assumed) | **Possible via `tool_call` event** |
+| Pre-execution gates | Prompt-only | **Prompt OR `tool_call` blocking** |