npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.3 - Mend

ultimate-pi 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/concepts/skill-first-architecture.md ADDED Viewed

@@ -0,0 +1,166 @@
+---
+type: concept
+title: "Skill-First Harness Architecture"
+created: 2026-05-03
+updated: 2026-05-03
+tags: [concept, harness, skills, architecture, first-principles]
+related:
+  - "[[Research: Skill-First Harness Architecture]]"
+  - "[[harness-implementation-plan]]"
+  - "[[mvp-implementation-blueprint]]"
+  - "[[agent-skills-pattern]]"
+  - "[[harness-engineering-first-principles]]"
+  - "[[progressive-disclosure-agents]]"
+  - "[[drift-detection-unified]]"
+---
+# Skill-First Harness Architecture
+## Definition
+A harness architecture where pipeline layers are implemented as **markdown-based skills** (`.pi/skills/harness-*/SKILL.md`) rather than TypeScript code modules. Only deterministic infrastructure — the drift monitor, shared types, and config — remains as code. Event routing is handled by pi's built-in event bus. Skills are progressively loaded on-demand via the three-tier disclosure pattern: Discovery → Activation → Execution.
+## First Principles
+1. **Skill is the atomic unit of harness behavior.** Not a code file, not a function. A skill is a self-contained markdown directory with frontmatter metadata, a body of instructions, and optional supporting files. It activates when the LLM determines it's relevant.
+2. **Code is for determinism, not logic.** If a behavior must fire on every matching event with zero exceptions — it's code (hooks, drift monitor). If it's probabilistic evaluation (spec quality, plan correctness, code review) — it's a skill. The model is better at evaluation than imperative code.
+3. **Markdown skills ARE the specification.** No separate spec file per harness layer. The `SKILL.md` body is simultaneously the spec, the implementation instructions, and the documentation. Supporting files (`reference.md`, `scripts/`) provide execution-layer resources loaded on demand.
+4. **Pi's built-in event bus handles routing.** No custom event bus needed — pi's native event system wires events to skill invocations. Pipeline ordering is enforced by skill activation sequence — pi fires `harness-l1-activated` → L1 skill runs → returns → pi fires `harness-l2-activated` → L2 skill runs → etc.
+5. **Progressive disclosure is the memory model.** Skills load in three tiers: Discovery (metadata only, ~80 tokens/skill, always loaded), Activation (full SKILL.md body, ~2,000 tokens, loaded when relevant), Execution (supporting files, unlimited, loaded on demand). This keeps context lean while enabling unlimited bundled knowledge.
+6. **Zero-compile iteration.** Changing a skill is editing markdown. No TypeScript compilation, no npm build step, no restart. Agent picks up changes on next activation. This collapses the edit→build→test cycle for harness behavior.
+## Architecture Diagram
+```
+┌─────────────────────────────────────────────────────────────┐
+│                     THE SKILL-FIRST HARNESS                   │
+├─────────────────────────────────────────────────────────────┤
+│  CODE LAYER (TypeScript — deterministic, always-on)          │
+│  ┌──────────────┐  ┌───────────────────┐                    │
+│  │Drift Monitor │  │ Types + Config    │                    │
+│  │(pattern match│  │ (shared infra)    │                    │
+│  │ every tool_  │  │                   │                    │
+│  │ result event)│  │                   │                    │
+│  └──────────────┘  └───────────────────┘                    │
+│  EVENT BUS: pi's built-in native event system               │
+│                                                              │
+│  SKILL LAYER (Markdown — probabilistic, on-demand)           │
+│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │
+│  │ L1 Spec  │ │ L2 Plan  │ │ L4 Critic│ │ L5 Obsrv │       │
+│  │ Hardening│ │(DAG gen) │ │(advers.) │ │(metrics) │       │
+│  └──────────┘ └──────────┘ └──────────┘ └──────────┘       │
+│  ┌──────────┐ ┌──────────┐ ┌──────────┐                     │
+│  │ P20 Gate │ │ L6 Memory│ │ L7 Orch. │  L8 already         │
+│  │(biome+   │ │(wiki wr. │ │(Archon   │  wiki-based         │
+│  │ tsc+fal) │ │ contract)│ │ YAML)    │  skills             │
+│  └──────────┘ └──────────┘ └──────────┘                     │
+│                                                              │
+│  WIKI LAYER (Obsidian — persistent, cross-session)           │
+│  ┌──────────────────────────────────────────────────────┐   │
+│  │  ADRs, specs, plans, consensus, hot cache, index     │   │
+│  └──────────────────────────────────────────────────────┘   │
+└─────────────────────────────────────────────────────────────┘
+```
+## What Becomes Skills (and Why)
+| Harness Layer | Old: TypeScript | New: Skill | Reason |
+|---------------|----------------|------------|--------|
+| L1 Spec Hardening | `l1-spec.ts` (~300 lines) | `harness-spec/SKILL.md` | LLM evaluates ambiguity — probabilistic by nature |
+| L2 Planning | `l2-planner.ts` (~400 lines) | `harness-plan/SKILL.md` | LLM generates task DAG — probabilistic by nature |
+| L4 Adversarial | `l4-critics.ts` (~300 lines) | `harness-critic/SKILL.md` + `.pi/agents/critic.md` | LLM attacks code — probabilistic by nature |
+| P20 Gate | `p20-gate.ts` (~100 lines) | `harness-gate/SKILL.md` | Bash commands — skill provides instructions, bash executes |
+| L5 Observability | `l5-observability.ts` (~200 lines) | `harness-observe/SKILL.md` | LLM evaluates quality — probabilistic by nature |
+| L6 Memory | `l6-memory.ts` (~150 lines) | Already wiki-based skills | No change needed |
+## What Stays Code (and Why)
+| Component | File | Reason |
+|-----------|------|--------|
+| Drift Monitor | `drift-monitor.ts` (~500 lines) | Real-time pattern matching with sliding windows on every `tool_result` event. Sub-millisecond latency required. LLM evaluation every 8 turns is the PRIMARY detection, but the rule-based pre-filter and escalation ladder are deterministic code. |
+| Shared Types | `types.ts` (~200 lines) | TypeScript interfaces for Spec, Plan, DriftEvent, CriticVerdict, Config. Needed by drift monitor and config. |
+| Config Loader | `config.ts` (~100 lines) | Loads `.pi/harness/config.json`, merges with code defaults. Deterministic — no LLM involvement. |
+> [!note] Event Bus Removed (2026-05-04)
+> Pi's latest version ships a built-in event bus. The custom `events.ts` (~200 lines) and `.pi/extensions/harness-event-bus.ts` are no longer needed. Skills register directly with pi's native event system.
+## Skill Directory Structure
+```
+.pi/skills/
+├── harness-spec/
+│   ├── SKILL.md              # L1: Ambiguity detection, spec hardening, harness_ask tool usage
+│   └── reference.md          # Spec hardening patterns, ambiguity categories
+├── harness-plan/
+│   ├── SKILL.md              # L2: YAML task DAG generation, sprint contracts, plan summary
+│   └── reference.md          # Plan templates, DAG patterns, sprint contract examples
+├── harness-critic/
+│   ├── SKILL.md              # L4: Adversarial attack patterns, hard-threshold criteria, debate protocol
+│   └── reference.md          # Attack angle catalog, failure pattern taxonomy
+├── harness-observe/
+│   ├── SKILL.md              # L5: Keep Rate tracking, LLM-as-Judge, satisfaction metrics
+│   └── reference.md          # Metric definitions, sampling strategies
+├── harness-gate/
+│   ├── SKILL.md              # P20: Deterministic gate instructions (biome, tsc, fallow)
+│   └── reference.md          # Gate configuration, baseline management
+└── harness-memory/
+    ├── SKILL.md              # L6: Read-first/write-after wiki contract, hot cache rules
+    └── reference.md          # Wiki page templates, staleness rules
+```
+## File Count Comparison
+| Metric | Old (Code-First) | New (Skill-First) | Reduction |
+|--------|-----------------|-------------------|-----------|
+| TypeScript source files | 15 | 3 | -80% |
+| TypeScript lines | ~2,500 | ~600 | -76% |
+| Markdown skill files | 0 | 6 SKILL.md + 6 reference.md | New |
+| Markdown skill lines | 0 | ~1,200 (instructions) | New |
+| Compilation required | Yes (all 15 files) | Yes (3 files only) | -80% |
+| Iteration cycle | Edit TS → compile → restart | Edit MD → agent picks up next activation | Seconds vs minutes |
+## Why Skills Over Code for Harness Layers
+### 1. The Model Is Better at Evaluation Than Imperative Code
+L1 spec hardening: "Is this specification ambiguous?" — this is a natural language evaluation task. Writing regex patterns and AST analysis to detect ambiguity is fragile. Asking an LLM "does this spec have unresolved decisions?" is robust. The LLM already runs in the pipeline. Use it.
+### 2. Progressive Disclosure Prevents Context Bloat
+15 TypeScript files loaded into the agent's context as tool definitions consume tokens permanently. Skills load only when relevant. The discovery layer costs ~80 tokens per skill. All 6 harness skills together cost ~480 tokens at discovery — less than ONE loaded code module.
+### 3. User-Editable Without Compilation
+A project team can edit `.pi/skills/harness-critic/SKILL.md` to add project-specific attack patterns. No TypeScript knowledge needed. No build step. The markdown skill IS the configuration.
+### 4. Skills Compose Naturally with the Wiki
+L6 persistent memory is already wiki-based (claude-obsidian skills). Other harness layers reading/writing wiki pages is natural when they're also skills — the LLM invokes wiki-query from within a harness skill the same way it invokes any tool.
+### 5. Cross-Platform Compatibility
+The SKILL.md format is an open standard adopted by Anthropic, OpenAI, Google, GitHub, and Cursor. Harness skills work on ANY platform that supports the standard. Code modules are pi-specific.
+## When NOT to Use a Skill
+Skills are the WRONG choice when:
+- **Deterministic execution is required.** The drift monitor MUST fire on every `tool_result` event with zero exceptions. Skills are probabilistic — the model decides when to activate them.
+- **Sub-millisecond latency is required.** Pattern matching on a sliding window needs <1ms response. LLM invocation adds 200-500ms.
+- **The behavior is purely mechanical.** Loading a config file, merging with defaults, registering pi event handlers — these are wiring, not reasoning. Skills add unnecessary LLM overhead.
+- **State management across events.** Pi's native event bus tracks pipeline phase, turn count, drift history across hundreds of events. Skills are stateless per invocation — they'd need to re-read state from disk each time.
+## Migration Path
+From current plan (15 TS files) to skill-first (3 TS files + 6 skills):
+1. **F0 + L2.5 first** (unchanged — these ARE code): Types, config, drift monitor. These are the foundation. Event bus is handled by pi's built-in system.
+2. **Convert L1 Spec Hardening**: Extract the ambiguity detection and spec hardening logic from `l1-spec.ts` into `harness-spec/SKILL.md`. Pi's native event system fires this skill when it detects a `/harness` command.
+3. **Convert L2 Planning**: Extract DAG generation and sprint contract logic into `harness-plan/SKILL.md`.
+4. **Convert L4 Adversarial**: Already partially skill-based (critic.md agent definition). Extract attack pattern catalog into `harness-critic/SKILL.md`.
+5. **Convert P20 Gate**: Already deterministic bash commands. Extract gate instructions into `harness-gate/SKILL.md`.
+6. **Convert L5 Observability**: Extract Keep Rate tracking and LLM-as-Judge into `harness-observe/SKILL.md`.
+7. **L6 Memory**: Already wiki-based. Add `harness-memory/SKILL.md` for the read-first/write-after contract.
+8. **L7 Orchestration**: Already Archon YAML-based. No change.
+9. **L8 Wiki Query**: Already claude-obsidian skills. No change.
+> [!gap] Pi skill system integration details need verification. Can pi skills invoke other pi skills? Can pi skills write to `.pi/harness/` directories? These determine whether pi's event bus sequences skills or skills chain themselves.

package/vault/wiki/concepts/structured-compaction.md ADDED Viewed

@@ -0,0 +1,78 @@
+---
+type: concept
+title: "Structured Compaction Pipeline"
+aliases: ["compaction pipeline", "five-layer compaction"]
+created: 2026-05-01
+tags: [concept, harness-design, context-management, compaction, claude-code]
+status: developing
+related:
+  - "[[harness-implementation-plan]]"
+  - "[[drift-detection-unified]]"
+  - "[[agentic-harness]]"
+  - "[[context-anxiety]]"
+sources:
+  - "[[claude-code-architecture-vila-lab-2026]]"
+  - "[[claude-code-architecture-qubytes-2026]]"
+  - "[[claude-code-architecture-karaxai-2026]]"
+updated: 2026-05-02
+---
+# Structured Compaction Pipeline
+Claude Code's approach to context management: a five-layer compaction pipeline that uses a forked subagent to produce structured summaries tuned for software engineering tasks. Unlike simple truncation or lossy summarization, this is "structured extraction followed by selective reconstruction."
+## How It Works
+1. Context window fills to ~83.5% of capacity (e.g., ~167K / 200K tokens)
+2. A forked subagent is spawned whose sole job is to produce a structured summary
+3. The subagent receives a ~6,500 token compaction prompt tuned for SE tasks
+4. The summary selectively preserves: file paths, code snippets, error histories, active skills, plan state, tool deltas
+5. All prior messages are dropped. The summary is wrapped in `<summary>` tags and injected
+6. CLAUDE.md, tool definitions, and skills reload from disk automatically
+7. ~85% payload reduction (167K → ~25K tokens)
+## What Survives Compaction
+- File paths that were read or modified
+- Code snippets (trimmed, not full files)
+- Error messages and stack traces
+- Active skills and their state
+- Task plan state (TodoWrite)
+- Tool deltas (what changed, not what was the same)
+- CLAUDE.md (re-read from disk after compaction)
+- Last 5 file attachments
+## What Does Not Survive
+- Full file contents (re-read on demand)
+- Intermediate tool outputs
+- Earlier conversation turns (summarized)
+- Transient observations
+- Claude's internal reasoning chains
+## Compaction Instructions
+Users embed preservation instructions directly in CLAUDE.md. The compactor reads CLAUDE.md like any other context and honors these instructions:
+```markdown
+# Summary instructions
+When summarizing this conversation, always preserve:
+- The current task objective and acceptance criteria
+- File paths that have been read or modified
+- Test results and error messages
+- Decisions made and the reasoning behind them
+```
+## Relationship to Current Drift Monitor
+Our P3-P7 (Runtime Drift Monitor) detects stuck patterns and prunes context reactively. The compaction pipeline is proactive — it manages context before problems arise. They are complementary:
+- **Drift Monitor**: Detects failure spirals, injects corrections, forces restart (reactive)
+- **Compaction Pipeline**: Summarizes and reconstructs context at capacity thresholds (proactive)
+## Integration Opportunities
+- Replace P4 (Context pruning + correction injection) with structured compaction
+- Extend P3 (Rule-based stuck-pattern detection) as a complement to compaction
+- Add PreCompact/PostCompact hooks for archival and custom summarization
+- Use forked subagent pattern (already in P25 subagent router, but not for compaction)
+- Compaction instructions stored in wiki pages (L6), not CLAUDE.md files

package/vault/wiki/concepts/subagent-orchestration.md ADDED Viewed

@@ -0,0 +1,17 @@
+---
+type: concept
+status: stub
+created: 2026-05-02
+updated: 2026-05-02
+tags: [concept, agents, orchestration]
+---
+# Subagent Orchestration
+Pattern of spawning specialized subagents for isolated tasks with fresh context windows. Each subagent has its own context, tools, and scope. Enables parallel work and context isolation.
+## References
+- [[subagent-worktree-isolation]]
+- [[model-routing-agents]]
+- [[structured-compaction]]

package/vault/wiki/concepts/subagent-worktree-isolation.md ADDED Viewed

@@ -0,0 +1,68 @@
+---
+type: concept
+title: "Subagent Worktree Isolation"
+aliases: ["worktree isolation", "subagent sandboxing"]
+created: 2026-05-01
+tags: [concept, harness-design, subagents, isolation, claude-code]
+status: developing
+related:
+  - "[[harness-implementation-plan]]"
+  - "[[agentic-harness]]"
+  - "[[consensus-debate]]"
+sources:
+  - "[[claude-code-architecture-vila-lab-2026]]"
+  - "[[claude-code-security-architecture-penligent-2026]]"
+  - "[[claude-code-architecture-karaxai-2026]]"
+updated: 2026-05-02
+---
+# Subagent Worktree Isolation
+Claude Code's approach to subagent safety: each subagent gets a fresh context window AND optionally an isolated Git worktree. Only the final summary returns to the parent. No intermediate state leaks.
+## Two Dimensions of Isolation
+### Context Isolation
+- Fresh 200K-token context window per subagent
+- Only `subagent_type` (Explore, Plan, custom) and the prompt string are passed from parent
+- NO parent conversation history, file contents, or tool outputs carry over
+- Subagent's own file reads, tool calls, and reasoning stay isolated
+- Only the final message returns to parent as a tool result
+- Net effect: parent context grows by 1 summary, not the full subtask transcript
+### Filesystem Isolation (Worktree)
+- `isolation: worktree` creates a temporary Git worktree
+- Subagent gets an isolated copy of the repository
+- Edits don't conflict with main agent's working directory
+- Worktree is cleaned up when subagent finishes
+- Custom `WorktreeCreate`/`WorktreeRemove` hooks support non-Git VCS
+- Blast-radius control: risky refactor runs in worktree, not on main tree
+## Why This Matters
+Without isolation, subagents are just named prompts. With isolation, they become:
+1. **Context multipliers**: 10 subagents each with 200K context = 2M effective context across the session
+2. **Safety boundaries**: Read-only subagent physically cannot write files (not just "promised not to")
+3. **Parallelism enablers**: Multiple subagents editing in parallel worktrees, merges handled after completion
+4. **Blast-radius limiters**: Risky operations contained within worktree, discardable on failure
+## What Subagents CANNOT Do
+- Spawn their own subagents (one level only — prevents infinite recursion)
+- Access parent conversation history (fresh context only)
+- Access plugin subagent capabilities (hooks, MCP servers, permissionMode — security boundary)
+- Survive parent session termination (temporary by design)
+## Integration with P25 Subagent Router
+Our current P25 (Subagent Specialization Router) dispatches by task type but lacks:
+- Fresh context per subagent (currently shares parent context)
+- Worktree filesystem isolation
+- Sidechain transcripts (currently all output goes to parent)
+- Per-subagent tool restrictions (currently full tool access)
+- Custom subagent definitions in YAML (currently only hardcoded router logic)
+**Proposed**: P25 evolves to P25b — full subagent isolation with worktree support. See harness-implementation-plan for phase specification.
+## Security Model
+Plugin subagents have additional restrictions: no `hooks`, `mcpServers`, or `permissionMode` in frontmatter. This prevents plugin-supplied subagents from inheriting privilege. For full subagent capability, define in `.claude/agents/` (user-controlled), not in plugin bundles.

package/vault/wiki/concepts/superpowers-methodology.md ADDED Viewed

@@ -0,0 +1,78 @@
+---
+type: concept
+status: developing
+created: 2026-05-05
+tags:
+  - agent-skills
+  - methodology
+  - discipline
+  - tdd
+  - workflow
+related:
+  - "[[superpowers-github-repo]]"
+  - "[[superpowers-release-blog]]"
+  - "[[superpowers-termdock-analysis]]"
+  - "[[skill-first-architecture]]"
+  - "[[agent-skills-pattern]]"
+  - "[[policy-engine-pattern]]"
+---
+# Superpowers Methodology
+## Definition
+Superpowers is an agentic skills framework that gives AI coding agents a disciplined, structured software development methodology. It transforms agents from fast typists into disciplined engineering partners by enforcing process through hard gates — not suggestions, not best-practice advice, but mandatory workflows that block progress until conditions are met.
+## Core Principles
+1. **Discipline over intelligence** — A disciplined junior engineer ships more reliable code than a brilliant cowboy who skips process. AI agents are the same.
+2. **Hard gates over suggestions** — "Always write tests first" in CLAUDE.md is a suggestion that gets ignored under pressure. "NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST. Write code before the test? Delete it." is a gate that cannot be bypassed.
+3. **Composable workflow** — Each skill's output is the next skill's input. Brainstorming → spec → plan → tasks → subagent implementation → code review → merge. The compounding effect matters more than individual skill quality.
+4. **Fresh context per task** — Subagents start clean with only task description and relevant context, not full conversation history. Prevents context pollution and drift.
+5. **Progressive disclosure** — Skills load only name and description at startup (~100 tokens each). Full instructions load on demand when task matches.
+## The Brainstorm → Plan → Implement → Review Pipeline
+```
+User describes goal
+    ↓
+[brainstorming] — Ask clarifying questions, explore alternatives, present design in sections
+    ↓ (user approves design)
+[using-git-worktrees] — Create isolated branch, verify clean test baseline
+    ↓
+[writing-plans] — Break into 2-5 min tasks with exact file paths, code, verification steps
+    ↓ (user approves plan)
+[subagent-driven-development] — Dispatch fresh subagent per task
+    ↓ per task
+[test-driven-development] — RED: write failing test → GREEN: minimal code to pass → REFACTOR
+    ↓
+[requesting-code-review] — Review against plan, report by severity, critical = block
+    ↓
+[finishing-a-development-branch] — Verify tests, present merge/PR options, clean up
+```
+## Two Types of Enforcement
+| Skill Type | Enforcement | Examples |
+|-----------|-------------|----------|
+| **Rigid (Iron Laws)** | Hard gates, delete-and-restart consequences | TDD, systematic debugging |
+| **Adaptive (Structured)** | Checklist, hard gate on entry, flexible execution | Brainstorming, writing plans |
+| **Advisory** | Reports findings, human decides | Code review |
+## Why Hard Gates Work
+LLMs are trained to be helpful, which means they rush to produce output. A prompt saying "write tests first" competes with the model's default helpfulness bias. A hard gate saying "NO production code without a failing test first — delete code written before tests" creates a structural constraint the model cannot bypass without explicitly violating its instructions. The model follows because it understands principles, not because it was told to follow blindly.
+## Relationship to Our Harness
+Superpowers and our harness share the same mission (discipline for AI agents) but operate at different levels:
+| Dimension | Superpowers | Our Harness |
+|-----------|-------------|-------------|
+| **Enforcement** | Probabilistic (model compliance with skill instructions) | Deterministic (code-level drift monitor, pre-execution gates) |
+| **Architecture** | Markdown skills only | Skills + TypeScript code (drift monitor, config, types) |
+| **Scope** | Development methodology | Full agentic pipeline (spec, plan, execute, verify, observe, memory) |
+| **Portability** | Any SKILL.md-compatible agent | Pi-specific (TypeScript extension API) |
+| **Trigger** | Agent self-selects skills by description match | Harness layers fire at deterministic pipeline stages |
+Superpowers validates our skill-first architecture choice and can be USED WITHIN our harness — as a `.pi/skills/superpowers/` skill set that the agent loads. But Superpowers cannot replace our code-level enforcement (drift monitor) because its enforcement is only as strong as the agent's compliance with the skill instructions.

package/vault/wiki/concepts/think-in-code.md ADDED Viewed

@@ -0,0 +1,73 @@
+---
+type: concept
+title: Think in Code
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - context-optimization
+  - agentic-harness
+  - paradigm
+status: developing
+related:
+  - "[[context-mode]]"
+  - "[[agentic-harness-context-enforcement]]"
+sources:
+  - "[[think-in-code-blog]]"
+  - "[[context-mode-website]]"
+---# Think in Code
+A paradigm for AI coding agents where the agent writes code to process data instead of reading raw data into the context window for mental processing.
+## Definition
+> When you need to analyze, count, filter, compare, or process data, write code that does the work and output only the answer. Don't read raw data into context to process mentally.
+## Origin
+Introduced by B. Mert Köseoğlu in context-mode v1.0.64 (2026). Mandatory across all 14 platform instruction files.
+## Mechanism
+1. Agent encounters a task requiring data analysis (count files, filter errors, parse JSON, compare configs)
+2. Instead of calling `Read()` on each file, agent writes a JavaScript/TypeScript script
+3. Script runs in sandbox via `ctx_execute()` MCP tool (Node.js built-ins only, no npm)
+4. Only `console.log()` output enters the conversation
+5. Results are 200× smaller in context
+## Enforcing in Agentic Harness
+To enforce "Think in Code" in any agentic harness:
+### Method 1: System Prompt Injection
+Add the rule to the agent's AGENTS.md or equivalent instruction file:
+```markdown
+## Think in Code (MANDATORY)
+When you need to analyze, count, filter, compare, or process data,
+write code (JavaScript/Python) that does the work. Output only the
+answer. Do NOT read raw data into context for mental processing.
+Use built-ins only. No package installs. Always try/catch.
+```
+### Method 2: PreToolUse Hook
+Intercept `Read()`, `Bash()`, `WebFetch()` calls and check if the call looks like data analysis. Redirect to a sandbox execution tool.
+### Method 3: PostToolUse Compression
+When large output enters context, automatically summarize/gist it and store raw data in a searchable index (FTS5 or similar). Mark the raw data as reference-only.
+### Method 4: MCP Execution Tool
+Provide an `execute()` MCP tool that runs code in a sandbox. The agent learns to prefer this over raw reads because it's faster and cheaper.
+## Efficiency Gains (claimed)
+| Before | After | Reduction |
+|--------|-------|-----------|
+| 47 files × Read() = 700 KB | 1 ctx_execute() = 3.6 KB | 200× |
+| 20 tool calls = 600 KB | Same work, 20 KB | 30× |
+| Cloudflare 2,500+ endpoints | 2 tools, ~1,000 tokens | 60× |
+## Related Patterns
+- **Cloudflare Code Mode**: Same concept for Workers API. LLM writes TypeScript, runs in V8 isolate (Dynamic Workers).
+- **Code Interpreter**: Similar to ChatGPT's code interpreter but for local agent tool calls.
+- **Output Compression**: context-mode's companion technique — strip filler words from agent responses (65-75% output reduction).

package/vault/wiki/concepts/ts-execution-layer.md ADDED Viewed

@@ -0,0 +1,100 @@
+---
+type: concept
+title: "TypeScript Execution Layer"
+created: 2026-05-01
+updated: 2026-05-01
+tags:
+  - agent-tools
+  - typescript-execution-layer
+  - sandbox
+  - context-optimization
+  - harness
+status: developing
+related:
+  - "[[mcp-tool-routing]]"
+  - "[[think-in-code-enforcement]]"
+  - "[[agentic-harness-context-enforcement]]"
+  - "[[harness-implementation-plan]]"
+  - "[[Research: executor.sh Harness Integration]]"
+sources:
+  - "[[codeact-apple-2024]]"
+  - "[[cloudflare-codemode]]"
+  - "[[executor-rhyssullivan]]"
+  - "[[colinmcnamara-context-optimization-codemode]]"
+---# TypeScript Execution Layer
+Pattern for AI agent tool calling: instead of exposing dozens of tools as individual function calls in the LLM context, give the agent a **single "write TypeScript" tool** plus a **sandboxed TypeScript runtime** with a typed API surface for all tools. The LLM writes code; the runtime executes it; only results return to context.
+## The Problem: Tool Context Bloat
+Traditional MCP-based tool calling loads every tool definition into the LLM's context window:
+```
+System prompt (~500 tokens)
++ Tool definitions for 10 MCP servers (~5,000 tokens)
++ Conversation history (~2,000 tokens)
++ Tool call/response pairs (~3,000 tokens per interaction)
+= ~10,500+ tokens per turn
+```
+Each additional MCP server adds 300-800 tokens of tool definitions. Organizations with 50+ MCP servers face impossible context budgets. Agents get "dumber" as more tools are added. Teams constantly enable/disable tools to prevent overload.
+## The Solution: Code as Execution Layer
+```
+System prompt with coding instructions (~400 tokens)
++ TypeScript API type definitions (~2,000 tokens)
++ Generated code (~500 tokens)
++ Execution results only (~200 tokens)
+= ~3,100 tokens per turn
+```
+**~3-4x context reduction.** Multi-step workflows that would require 5-10 round-trips with traditional tool calling become one code generation turn.
+## Why TypeScript (Not Python, Not JSON, Not Bash)
+| Approach | Strengths | Weaknesses |
+|----------|-----------|------------|
+| **JSON tool-calling** | Simple, structured | No control flow, no composition, verbose for multi-step |
+| **Bash execution** | LLMs good at Bash, discoverable | No fine-grained auth, dangerous, platform-specific |
+| **Python (CodeAct)** | Rich libraries, interpreter errors | 60+ point gap open vs closed models, sandboxing hard |
+| **TypeScript** | Massive training data, type guardrails, Node.js ecosystem | Requires sandbox infra, JS-only |
+TypeScript advantages for agent code generation:
+1. **Rich training data**: Millions of TS/JS repos in pretraining
+2. **Type safety as guardrails**: Generated types guide correct API usage
+3. **Deterministic execution**: Code runs predictably once generated
+4. **Node.js ecosystem**: Huge library surface for data processing
+## Implementations
+| System | Sandbox | Language | Key Feature |
+|--------|---------|----------|-------------|
+| **CodeAct** (Apple 2024) | Python interpreter | Python | Foundation research, 20% improvement |
+| **Cloudflare Code Mode** (2025) | V8 Worker isolates | TypeScript | Network isolation, RPC dispatch |
+| **Executor** (RhysSullivan 2026) | Local Node.js runtime | TypeScript | Tool catalog, cross-agent sharing |
+## Harness Integration
+For the ultimate-pi harness, the TypeScript execution layer maps to a **new L3 tool phase**:
+- All L3 tools (read, bash, edit, grep, find, ck_search) exposed as typed TS functions
+- Agent gets single `write_ts` tool instead of 8-15 individual tools
+- Code runs in sandboxed Node.js VM or Deno subprocess
+- Tool calls dispatch via typed RPC back to harness
+- Permission subsystem (P35) gates all tool calls within sandbox
+- Extends P14 (Think-in-Code) from "write code for data analysis" to "write code to orchestrate tools"
+See [[harness-implementation-plan]] for the P43 phase specification.
+## Tradeoffs
+| Pro | Con |
+|-----|-----|
+| 3-4x context reduction | Requires sandbox infrastructure |
+| Multi-step in one turn | Permissioning moves to execution layer |
+| Richer control flow, error handling | LLM must generate valid TypeScript |
+| Deterministic execution | Sandbox escape is a security concern |
+| Type guardrails reduce errors | Not all models equally good at TS |
+| Tool discovery without context load | Requires type generation from tool schemas |

package/vault/wiki/concepts/typescript-strict-mode.md ADDED Viewed

@@ -0,0 +1,37 @@
+---
+type: concept
+status: developing
+tags:
+  - typescript
+  - strict-mode
+  - type-safety
+related:
+  - "[[ts-strict-mode-rishikc]]"
+  - "[[Research: TypeScript Best Practices and Codebase Structure]]"
+created: 2026-05-02
+updated: 2026-05-02
+---# TypeScript Strict Mode
+The `"strict": true` compiler option in `tsconfig.json` enables a suite of type-checking flags that catch entire categories of runtime errors at compile time.
+## Sub-flags Enabled
+| Flag | What it catches |
+|------|----------------|
+| `noImplicitAny` | Variables/functions without explicit types fall back to `any` |
+| `strictNullChecks` | `null` and `undefined` treated as distinct types, must be handled |
+| `strictFunctionTypes` | Enforces contravariance on function parameter types |
+| `strictPropertyInitialization` | Class properties must be initialized |
+| `noImplicitThis` | `this` must be explicitly typed in functions |
+| `strictBindCallApply` | Type-checks `.bind`, `.call`, `.apply` arguments |
+| `alwaysStrict` | Emits `"use strict"` and prevents `with` statements |
+| `useUnknownInCatchVariables` | Catch variables are `unknown`, not `any` |
+## Consensus
+All authoritative sources agree: enable `"strict": true` for new projects. For existing codebases, migrate incrementally — enable one flag at a time, fix errors per module.
+## Pair With
+ESLint `@typescript-eslint/recommended-type-checked` for defense-in-depth. Strict mode catches type issues; ESLint catches behavioral issues like floating promises.