npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.3 - Mend

ultimate-pi 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/questions/Research: Automating Software Engineering - Lovable, Bolt, Emergent, Rocket.md ADDED Viewed

@@ -0,0 +1,112 @@
+---
+type: synthesis
+title: "Research: Automating Software Engineering — Practices from Lovable, Bolt, Emergent, Rocket"
+created: 2026-05-03
+updated: 2026-05-03
+tags:
+  - research
+  - ai-coding
+  - harness
+  - multi-agent
+  - platform
+  - context-engineering
+status: developing
+related:
+  - "[[Lovable (company)]]"
+  - "[[Bolt.new (StackBlitz)]]"
+  - "[[Rocket.new]]"
+  - "[[Emergent Labs]]"
+  - "[[Source: Lovable Architecture & Clone Analysis]]"
+  - "[[Source: Bolt.new Architecture & Case Study]]"
+  - "[[Source: Rocket.new — Vibe Solutioning Platform]]"
+  - "[[Source: OpenAI Harness Engineering — 0 Lines of Human Code]]"
+  - "[[Source: OpenDev — Building AI Coding Agents for the Terminal]]"
+  - "[[generator-evaluator-architecture]]"
+  - "[[context-engineering]]"
+  - "[[progressive-disclosure-agents]]"
+  - "[[multi-agent-AI-coding-architecture]]"
+  - "[[Context-Aware System Reminders]]"
+  - "[[anthropic2026-harness-design]]"
+sources:
+  - "[[Source: Lovable Architecture & Clone Analysis]]"
+  - "[[Source: Bolt.new Architecture & Case Study]]"
+  - "[[Source: Rocket.new — Vibe Solutioning Platform]]"
+  - "[[Source: OpenAI Harness Engineering — 0 Lines of Human Code]]"
+  - "[[Source: OpenDev — Building AI Coding Agents for the Terminal]]"
+  - "[[anthropic2026-harness-design]]"
+---
+# Research: Automating Software Engineering — Practices from Lovable, Bolt, Emergent, Rocket
+## Overview
+Four platforms (Lovable, Bolt.new, Emergent Labs, Rocket.new) are attacking full-stack software engineering automation from different angles. Lovable and Bolt.new focus on **prompt-to-production web apps** with browser-based execution. Rocket.new adds **pre-build strategy** ("vibe solutioning"). Emergent builds **autonomous coding agents**. Combined with deep engineering reports from OpenAI (Codex) and Anthropic (harness design), clear first-principles patterns emerge for building an AI coding harness.
+## Key Findings
+### 1. Multi-Agent Architecture Is Universal
+Every successful platform decomposes work across specialized agents. Lovable's clone architecture uses **Planner → Architect → Coder** with Pydantic-typed handoffs. Anthropic uses **Planner → Generator → Evaluator** with sprint contracts. OpenAI Codex uses **agent-to-agent review loops** (Ralph Wiggum pattern). OpenDev uses **dual-agent separation** (thinking vs execution) with subagent spawning. The pattern: **don't make one agent do everything**. (Source: [[Source: Lovable Architecture & Clone Analysis]], [[anthropic2026-harness-design]], [[Source: OpenAI Harness Engineering — 0 Lines of Human Code]], [[Source: OpenDev — Building AI Coding Agents for the Terminal]])
+### 2. Environment Control Is the Moat
+Bolt.new's key differentiator: **WebContainers give AI complete control** over filesystem, node server, package manager, terminal, and browser console. This is what turned Claude from a code suggester into an app builder. OpenAI replicated this: Codex drives apps via Chrome DevTools Protocol, has its own ephemeral observability stack per worktree. Bolt hit $4M ARR in 4 weeks after adding Claude + WebContainers. (Source: [[Source: Bolt.new Architecture & Case Study]], [[Source: OpenAI Harness Engineering — 0 Lines of Human Code]])
+### 3. Structured Outputs Prevent Chaos
+The Lovable clone's key insight: moving from text-based AI interactions to **Pydantic-validated structured outputs** transforms AI from demo to production. Each agent receives validated objects, not messy text. OpenAI enforces architectural boundaries mechanically via custom linters — "enforce invariants, not micromanage implementations." OpenDev uses schema-level tool gating: make dangerous tools invisible to the agent, not just blocked. (Source: [[Source: Lovable Architecture & Clone Analysis]], [[Source: OpenAI Harness Engineering — 0 Lines of Human Code]], [[Source: OpenDev — Building AI Coding Agents for the Terminal]])
+### 4. Context Engineering Is the Central Constraint
+OpenAI's finding: **"Context is a scarce resource."** A giant AGENTS.md file crowds out the task. Instead: short AGENTS.md as table of contents pointing to a structured `docs/` directory. OpenDev implements 5-stage adaptive compaction (70%→99% thresholds), dual-memory architecture (episodic + working memory), event-driven system reminders at decision points. Anthropic found that context resets (not just compaction) are necessary when models exhibit "context anxiety." (Source: [[Source: OpenDev — Building AI Coding Agents for the Terminal]], [[Source: OpenAI Harness Engineering — 0 Lines of Human Code]], [[anthropic2026-harness-design]])
+### 5. Repository Knowledge as System of Record
+OpenAI's framing: **"What Codex can't see doesn't exist."** All knowledge must live in the repository — not in Slack threads, Google Docs, or people's heads. Design docs, execution plans, quality scores are all versioned and co-located with code. Dedicated "doc-gardening" agents scan for stale documentation. Rocket.new takes this further: **one shared context across strategy → build → competitive intelligence.** (Source: [[Source: OpenAI Harness Engineering — 0 Lines of Human Code]], [[Source: Rocket.new — Vibe Solutioning Platform]])
+### 6. "Code Generation Is a Commodity" — The Pre-Build Layer Matters
+Rocket.new's thesis: everyone can generate code now. The missing piece is **deciding what to build and tracking what happens after.** Their platform covers the full arc: market research → product strategy → app building → competitive intelligence. Raised $15M, 1.5M users across 180 countries. Pricing: $25-$350/month, with consulting-style reports at $250 tier. (Source: [[Source: Rocket.new — Vibe Solutioning Platform]])
+### 7. Generator-Evaluator Loop (GAN-Inspired)
+Anthropic's breakthrough: separating generator from evaluator. Agents "confidently praise their own mediocre work" — but tuning a standalone evaluator to be skeptical works. Each criterion has a hard threshold — fall below any, sprint fails. The evaluator uses Playwright to actually click through the app. OpenAI does the same at scale: "Codex reviews its own changes locally, requests additional agent reviews, responds to feedback, iterates in a loop until all agent reviewers are satisfied." (Source: [[anthropic2026-harness-design]], [[Source: OpenAI Harness Engineering — 0 Lines of Human Code]])
+### 8. Progressive Disclosure: Maps, Not Encyclopedias
+OpenAI tried the "one big AGENTS.md" approach. It failed: context scarcity, too much guidance becomes non-guidance, rots instantly, hard to verify. Instead: **short AGENTS.md (∼100 lines) as table of contents**, pointing to a structured `docs/` directory. OpenDev implements conditional prompt composition: sections load only when contextually relevant (e.g., git workflow section only in git repos). Skills use 2-phase loading: metadata index at startup, full content on-demand. (Source: [[Source: OpenAI Harness Engineering — 0 Lines of Human Code]], [[Source: OpenDev — Building AI Coding Agents for the Terminal]])
+### 9. "No Manually-Written Code" Philosophy
+OpenAI Codex built a product with **0 lines of human-written code** over 5 months, ∼1 million lines, ∼1,500 PRs, with 3-7 engineers. Humans steer, agents execute. Engineers became systems designers: building scaffolding, guardrails, and feedback loops. Codex wrote even its own AGENTS.md. This required redefining the engineer's role entirely. (Source: [[Source: OpenAI Harness Engineering — 0 Lines of Human Code]])
+### 10. Garbage Collection for AI Slop
+OpenAI's finding: agents replicate existing patterns — including bad ones. Initially spent Fridays (20% of week) cleaning "AI slop." Solution: encode "golden principles" mechanically, run recurring background cleanup agents, enforce continuously. Technical debt treated as "high-interest loan" — pay continuously in small increments. (Source: [[Source: OpenAI Harness Engineering — 0 Lines of Human Code]])
+## Key Entities
+- **[[Lovable (company)]]**: Full-stack AI dev platform (formerly GPT Engineer). SOC 2, ISO 27001. Browser-based with GitHub sync. "Orchestration layer, not just models."
+- **[[Bolt.new (StackBlitz)]]**: Browser-based AI web dev. WebContainers + Claude. 0→$4M ARR in 4 weeks. Open source. Remix + Radix UI + UnoCSS.
+- **[[Rocket.new]]**: "Vibe Solutioning" platform: strategy + building + competitive intel. $15M seed. 1.5M users. Based in Surat, India.
+- **[[Emergent Labs]]**: YC S24. "Autonomous coding agents that replace traditional software development." Building full-stack web & mobile apps from conversation.
+## Key Concepts
+- **[[generator-evaluator-architecture]]**: GAN-inspired multi-agent pattern. Generator builds, evaluator grades against explicit criteria.
+- **[[Context-Aware System Reminders]]**: Event-driven injection of behavioral guidance at decision points. Addresses attention-decay in long sessions.
+- **[[progressive-disclosure-agents]]**: Give agents maps (short AGENTS.md as ToC), not encyclopedias. Load details on-demand.
+- **[[multi-agent-AI-coding-architecture]]**: Planner → Architect → Coder pattern with structured handoffs.
+## Contradictions
+- **Anthropic says context resets are essential** for Sonnet 4.5 due to "context anxiety." **OpenAI says compaction + progressive disclosure** works for Codex. OpenDev uses **5-stage compaction, not resets.** The difference may be model-specific — Opus 4.6 eliminated the need for resets per Anthropic.
+- **Rocket.new says code generation is commoditized** — the frontier is pre-build strategy. **Lovable and Bolt.new** are still competing aggressively on code generation quality. The market hasn't fully shifted yet.
+- **OpenAI enforces architecture mechanically** (linters, structural tests). **Anthropic uses prompting + evaluator contracts**. Both work; the OpenAI approach is more robust but requires more upfront investment.
+## Open Questions
+- How does the planner-generator-evaluator loop scale to existing large codebases (not greenfield apps)? All current demos are new projects.
+- Can "sprint contracts" work for bug fixing and refactoring, not just feature building?
+- When does the evaluator become unnecessary? Anthropic says as models improve, the boundary moves. How to automatically detect when?
+- How to adapt context management strategies (compaction vs resets) per model, automatically?
+- Can Rocket.new's "vibe solutioning" pre-build layer be integrated into a coding harness to automate scope definition?
+## Sources
+- [[Source: Lovable Architecture & Clone Analysis]]: Multi-agent architecture, structured outputs, LangGraph + Groq
+- [[Source: Bolt.new Architecture & Case Study]]: WebContainers, Claude integration, Remix frontend, $4M ARR
+- [[Source: Rocket.new — Vibe Solutioning Platform]]: Strategy → Build → Intelligence, $15M seed, 1.5M users
+- [[Source: OpenAI Harness Engineering — 0 Lines of Human Code]]: Codex, architectural constraints, progressive disclosure, garbage collection
+- [[Source: OpenDev — Building AI Coding Agents for the Terminal]]: Compound AI, dual-agent, adaptive compaction, system reminders
+- [[anthropic2026-harness-design]]: GAN-inspired harness, generator-evaluator loop, sprint contracts

package/vault/wiki/questions/Research: Claude Code State-of-the-Art Harness Improvements.md ADDED Viewed

@@ -0,0 +1,209 @@
+---
+type: synthesis
+title: "Research: Claude Code State-of-the-Art Harness Improvements"
+created: 2026-05-01
+updated: 2026-05-01
+tags:
+  - research
+  - claude-code
+  - harness-design
+  - first-principles
+  - agent-architecture
+status: developing
+related:
+  - "[[harness-implementation-plan]]"
+  - "[[model-adaptive-harness]]"
+  - "[[agentic-harness]]"
+  - "[[cursor-harness-innovations]]"
+  - "[[Research: cursor.sh Harness Innovations]]"
+  - "[[Research: Google Antigravity Harness Integration]]"
+  - "[[provider-native-prompting]]"
+  - "[[harness-configuration-layers]]"
+  - "[[feedforward-feedback-harness]]"
+  - "[[self-evolving-harness]]"
+sources:
+  - "[[claude-code-architecture-vila-lab-2026]]"
+  - "[[claude-code-architecture-qubytes-2026]]"
+  - "[[claude-code-architecture-karaxai-2026]]"
+  - "[[claude-code-security-architecture-penligent-2026]]"
+---# Research: Claude Code State-of-the-Art Harness Improvements
+## Overview
+Claude Code (Anthropic) is the most architecturally sophisticated agentic coding system in production. Research across 4 primary sources — an academic paper (VILA-Lab, arxiv 2604.14228), a security-focused architecture deep-dive (Penligent), a systems-level walkthrough (KaraxAI), and a five-layer architecture breakdown (Qubytes) — reveals innovations that fundamentally challenge assumptions in our current harness design. Claude Code's architecture was reverse-engineered from a leaked 510K-line TypeScript codebase, revealing a system far more sophisticated than public documentation suggests.
+## Key Findings
+### Architecture Philosophy
+- **Agent Loop > Fixed Pipeline** (Source: [[claude-code-architecture-vila-lab-2026]]): Claude Code is a `while` loop surrounded by infrastructure — not a sequential pipeline. "The core agent loop — assemble context, call model, receive tool request, execute it, repeat — is conceptually simple. The real engineering genius lives in everything around that loop." Our 8-layer pipeline is sequential; Claude Code's loop is reactive.
+- **System, not chatbot** (Source: [[claude-code-security-architecture-penligent-2026]]): "Claude Code is a governed execution environment with a model in the middle. This is not just 'Claude plus bash.'"
+- **Five human values → 13 design principles** (Source: [[claude-code-architecture-vila-lab-2026]]): Human decision authority, safety/security, reliable execution, capability amplification, contextual adaptability. Each value traces through specific implementation choices.
+- **Independent validation of First Principle #1** (Source: [[claude-code-architecture-karaxai-2026]]): "The model is the commodity; the agent is the product." Directly validates our FP #1: "The harness — not the model — determines reliability at scale."
+### 1. Five-Layer Compaction Pipeline
+The most underappreciated Claude Code innovation. Not simple truncation — structured extraction followed by selective reconstruction:
+- Forked subagent produces ~6,500 token structured summary tuned specifically for software engineering tasks
+- Preserves: file paths, code snippets, error histories, active skills, plan state, tool deltas
+- Triggered at ~83.5% of 200K context window
+- Compaction instructions embedded in CLAUDE.md for domain-specific preservation
+- PreCompact/PostCompact hooks for archiving full transcripts before compression
+- ~85% payload reduction (167K → ~25K tokens)
+**Our gap**: P3-P4 has basic context pruning (rule-based stuck-pattern detection, correction injection). Nothing like structured multi-layer compaction with forked subagent. **This is the single biggest gap in our harness design.**
+### 2. Lifecycle Hook System (Deterministic Policy)
+The most architecturally novel feature. 30+ hook events spanning full lifecycle, each with JSON input/output contracts:
+| Hook Event | When Fires | Control |
+|---|---|---|
+| `PreToolUse` | Before tool execution | Allow/deny/ask/defer, modify input |
+| `PostToolUse` | After tool succeeds | Audit, auto-format, replace output |
+| `PostToolUseFailure` | After tool fails | Inject correction context |
+| `PostToolBatch` | After parallel tool batch | Batch-level context injection |
+| `PermissionRequest` | When permission dialog appears | Programmatic allow/deny |
+| `Stop` / `SubagentStop` | When agent finishes | Prevent stopping, re-invoke |
+| `UserPromptSubmit` | Before prompt processed | Block, inject context |
+| `SessionStart` / `SessionEnd` | Session lifecycle | Load context, cleanup |
+| `PreCompact` / `PostCompact` | Before/after compaction | Archive, block |
+| `SubagentStart` / `SubagentStop` | Subagent lifecycle | Inject context, validate output |
+| `TaskCreated` / `TaskCompleted` | Task lifecycle | Enforce naming, validate completion |
+| `ConfigChange` | Config files modified | Audit, block unauthorized changes |
+| `CwdChanged` / `FileChanged` | Directory/file changes | Reactive env management |
+| `WorktreeCreate` / `WorktreeRemove` | Isolation lifecycle | Custom VCS integration |
+| `Notification` | System notifications | Forward to external services |
+Five hook types: **command** (shell script, exit codes), **HTTP** (webhook), **MCP tool** (call MCP server), **prompt** (single-turn LLM evaluation), **agent** (multi-turn subagent with tool access).
+Critical distinction: **CLAUDE.md achieves ~92% compliance. Hooks achieve 100% compliance** for conditions they match. This is the deterministic escape hatch from probabilistic prompt-based control.
+Exit code semantics: `0` = success (allow), `2` = blocking error (deny, stderr fed to Claude), other = non-blocking error (continue execution).
+**Our gap**: Our harness has extension hooks at the layer level (`extensions/harness-*.ts`). But we lack tool-level lifecycle hooks with deterministic exit-code semantics. This is a fundamental architectural gap.
+### 3. Permission System as Architectural Subsystem
+Claude Code treats permission checking as architecturally separate from tool execution — a first-class subsystem sitting between the agent loop and tool execution:
+- **7 permission modes**: `default` (read-only), `acceptEdits` (auto-approve edits + safe fs ops), `plan` (read-only exploration + plan writing), `auto` (ML classifier reviews every action), `dontAsk` (only pre-approved tools), `bypassPermissions` (no checks, isolated environments only)
+- **ML-based auto classifier**: Separate model reviews each action in auto mode. Blocks actions that go beyond task, target untrusted infrastructure, or appear driven by prompt injection. Auto mode drops broad allow rules that would short-circuit the classifier.
+- **Rule syntax** (`allow`/`deny`/`ask`): First matching rule wins. `Tool` or `Tool(specifier)` patterns. Same rule language reused across hooks, CLI automation, and settings.
+- **Four configuration scopes**: Managed (org-controlled), User (`~/.claude/`), Project (`.claude/`, Git-shared), Local (`.claude/settings.local.json`, gitignored). Managed settings can enforce `allowManagedHooksOnly`, `allowManagedMcpServersOnly`, `allowManagedPermissionRulesOnly`.
+**Our gap**: We have NO permission system. L7 orchestration enforces pipeline stages but does not gate individual tool calls. This is a major architectural gap for any production deployment.
+### 4. Subagent Architecture with Deep Isolation
+- **Fresh 200K context window per subagent**: Only final summary returns to parent. All intermediate tool calls, file reads, and reasoning stay isolated.
+- **Worktree isolation**: `isolation: worktree` gives subagent a temporary Git worktree — isolated filesystem copy. Enables parallel editing without conflicts. Blast-radius control.
+- **Tool allowlists/denylists per subagent**: Security-review subagent gets Read+Grep+Glob but no Edit/Write. Different subagents get different capability surfaces.
+- **No nesting**: One level of subagent spawning only — prevents infinite recursion.
+- **Sidechain transcripts**: Subagent interactions captured in separate transcripts, keeping main thread clean.
+- **Custom subagents in YAML**: Defined in `.claude/agents/` or `~/.claude/agents/`. Configurable: tools, model, permission mode, persistent memory, isolation settings.
+**Our gap**: P25 subagent router is cost-based dispatch without worktree isolation. No sidechain transcripts. No per-subagent tool restrictions. No custom subagent definitions in config files.
+### 5. CLAUDE.md Hierarchical System
+Not a single file — a layered system with additive precedence:
+```
+Global (~/.claude/CLAUDE.md) → Enterprise (managed) → Project (.claude/CLAUDE.md) → Local (.claude/CLAUDE.local.md) → Notebook
+```
+- **Conditional rules**: YAML frontmatter with `match: "*.test.ts"` or `paths: ["src/api/**"]`. Rules only load when relevant files are accessed.
+- **96% compliance** with 5 conditional rule files of 30 lines each, vs 92% for single 150-line CLAUDE.md
+- **Injected as `<system-reminder>` tags**: Wrapped in XML, re-sent every API call (not cached in system prompt). Survives compaction because re-read from disk every turn.
+- **Three memory systems**: CLAUDE.md (reliable, user-controlled), Auto-memory (lossy, 200-line limit, Claude-maintained), Session memory (lossy, conversation history gets compacted)
+**Our gap**: Our wiki pages serve a similar role but lack the single-entry-point, additive-hierarchy, conditional-loading design. SKILL.md files cover some of this but differently (loaded on invocation, not always-present). No conditional rule matching.
+### 6. Skills with Progressive Disclosure
+- **Name + description loaded at startup** (~100 tokens per skill). Listed in `<available_skills>` block.
+- **Full SKILL.md body loaded on-demand** via Skill tool call. Only when invoked by user or Claude.
+- **Skills can restrict tools** (`allowed-tools`), include supporting files, spawn subagents.
+- **Distinct from slash commands**: Skills are model-invoked (or user `/skill-name`). Slash commands (`/clear`, `/compact`) are deterministic CLI operations.
+**Our gap**: Our skills load full SKILL.md content always. We partially implement progressive disclosure through the `description` field, but don't have the on-demand loading mechanism.
+### 7. Plugin Ecosystem
+- **9,000+ plugins** across registries. Official Anthropic marketplace ships built-in.
+- **Bundles**: Skills + agents + hooks + MCP servers as a single installable unit.
+- **Namespacing**: `/my-plugin:hello` prevents skill name conflicts.
+- **Agent override**: Plugin can replace main agent's system prompt, tool restrictions, model selection.
+- **Plugin subagents** cannot use `hooks`, `mcpServers`, or `permissionMode` — security boundary.
+**Our gap**: No plugin distribution layer. Our skills are project-local. Not critical for CLI harness but limits ecosystem growth.
+### 8. Agentic Search (No Embeddings)
+Claude Code deliberately rejects vector embeddings for code search:
+- **Glob → Grep → Read** hierarchy: File path pattern matching → content search via ripgrep → full file load
+- **"Agentic search generally works better"** than RAG (Boris Cherny, Claude Code creator)
+- Rationale: Code symbols are exact (`getUserById` ≠ `fetchAccountDetails`), indexes drift during active editing, embeddings leak code information as vectors, zero setup friction
+- **Explore subagent** (Haiku) for deep exploration — searches, reads, reasons, returns only summary
+**Our gap**: P13 (Semantic Code Search via ck MCP) is explicitly embeddings-based. This represents a fundamental design philosophy difference that we should reconsider.
+### 9. Sandboxing (OS-Level Enforcement)
+- **Seatbelt** (macOS) / **bubblewrap** (Linux/WSL2) for Bash tool
+- `autoAllowBashIfSandboxed`: auto-approve commands that can run sandboxed
+- Filesystem: `allowWrite`, `denyWrite`, `denyRead` paths
+- Network: `allowedDomains` restrictions
+- Child processes inherit sandbox boundaries
+- "Even if prompt injection manipulates Claude's behavior, the sandbox can still prevent critical file modification"
+**Our gap**: No sandbox subsystem. Relies on L7 orchestration for blast-radius control.
+### 10. Session Storage & Checkpointing
+- **Append-oriented session transcripts**: Full conversation history stored as JSONL
+- **Resume, fork, rewind**: Continue old sessions, branch into different approaches, replay edits
+- **Checkpoint system**: Snapshots file state before every edit, persists across sessions. Can restore code AND conversation. Separate from compaction.
+- **Bash-driven modifications**: Not tracked by edit-tool checkpoint rewind — explicitly documented limitation
+**Our gap**: Wiki-based state (L6) is not transactional session storage. No checkpoint/rewind capability. No session forking.
+## Key Entities
+- **[[Claude Code]]**: Anthropic's agentic coding CLI tool. 82,000+ GitHub stars, handles millions of coding sessions. Architecture reverse-engineered from 510K-line TypeScript codebase.
+- **[[VILA-Lab]]**: Academic research group (Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, Zhiqiang Shen) that published the most comprehensive Claude Code architecture analysis (arxiv 2604.14228).
+- **[[Boris Cherny]]**: Claude Code creator. Confirmed the deliberate rejection of vector embeddings for code search.
+## Key Concepts
+- **[[structured-compaction]]**: Five-layer compaction pipeline with forked subagent producing structured summaries for software engineering contexts. ~6,500 tokens, selective preservation.
+- **[[lifecycle-hooks]]**: 30+ deterministic hook events spanning full agent lifecycle. Exit-code semantics for blocking (2) vs non-blocking. Five hook types including LLM-based evaluation.
+- **[[permission-subsystem]]**: ML-classified permission checking as architecturally separate layer. 7 modes, 4 scopes, composable rule language.
+- **[[subagent-worktree-isolation]]**: Fresh context window + isolated Git worktree per subagent. Sidechain transcripts. Tool allowlists/denylists.
+- **[[additive-config-hierarchy]]**: CLAUDE.md layered system with conditional YAML frontmatter. 96% compliance from structured small files vs 92% from single large file.
+- **[[progressive-skill-disclosure]]**: Skills load name+description (100 tokens) at startup, full body on-demand via tool call.
+- **[[agentic-search-no-embeddings]]**: Glob → Grep → Read hierarchy. Deliberate rejection of vector embeddings for code search.
+- **[[sandbox-os-enforcement]]**: Seatbelt/bubblewrap OS-level boundaries beyond permission checks.
+## Contradictions
+- **Embeddings vs Agentic Search**: Our P13 invests in semantic code search via ck MCP (embeddings-based). Claude Code's creator states "agentic search generally works better." This is a design philosophy tension, not a clear-cut right/wrong. For exact symbol matching (code search), agentic search is likely superior. For conceptual queries ("find auth logic"), embeddings may help. Recommendation: Benchmark both approaches before committing to P13 implementation.
+- **Pipeline vs Loop**: Our 8-layer pipeline is sequential and mandatory. Claude Code's loop is reactive and flexible. The pipeline model guarantees quality enforcement (L1-L4) but the loop model enables more dynamic task handling. This tension should be resolved by keeping L1-L4 as quality gates but making L7 orchestration a loop-orchestrator rather than a fixed DAG.
+## Open Questions
+- Can the five-layer compaction pipeline work for 200K-token max context (our target) vs Claude Code's 200K? Our smaller context window means compaction triggers earlier — the tradeoffs change.
+- Should we adopt tool-level hooks as a replacement for our layer-level extension hooks, or as a complement? The exit-code semantics of Claude Code hooks are elegant but our TypeScript extension hooks offer richer integration with our pipeline.
+- Is the permission subsystem necessary for a CLI-level harness? Claude Code is a consumer product with enterprise deployments. Our harness is developer-facing. The permission model may be overengineered for our use case.
+- Can we adopt agentic search (Glob→Grep→Read) without abandoning semantic search? Perhaps a hybrid: agentic search as primary, embeddings as fallback for conceptual queries.
+- How much of the plugin ecosystem should we replicate? Namespaced skill distribution is valuable. Agent override is powerful. But 9,000+ plugins is a network-effect problem — not solvable by architecture alone.
+## Sources
+- [[claude-code-architecture-vila-lab-2026]]: Liu et al., arxiv 2604.14228, April 2026. Academic paper analyzing Claude Code source code. Five human values, 13 design principles, architecture comparison with OpenClaw.
+- [[claude-code-architecture-qubytes-2026]]: Vijendra, "Inside Claude Code: The Architecture That Makes AI Actually Do the Work," April 2026. Five-layer architecture breakdown: agent loop, permissions, tools, state, compaction.
+- [[claude-code-architecture-karaxai-2026]]: KaraxAI, "How Claude Code Actually Works: A Systems-Level Deep Dive," March 2026. Full stack: CLAUDE.md, agent loop, skills, plugins, MCP, subagents, hooks, context compression.
+- [[claude-code-security-architecture-penligent-2026]]: Penligent, "Inside Claude Code: The Architecture Behind Tools, Memory, Hooks, and MCP," April 2026. Security-focused technical analysis. Permission modes, sandboxing, CVE case studies, enterprise governance.

package/vault/wiki/questions/Research: Codex State-of-the-Art Harness Improvements.md ADDED Viewed

@@ -0,0 +1,99 @@
+---
+type: synthesis
+title: "Research: Codex State-of-the-Art Harness Improvements"
+created: 2026-05-01
+updated: 2026-05-01
+tags:
+  - research
+  - codex
+  - harness
+  - openai
+  - agent-architecture
+status: developing
+related:
+  - "[[codex-harness-innovations]]"
+  - "[[codex-open-source-agent-2026]]"
+  - "[[harness-implementation-plan]]"
+  - "[[model-adaptive-harness]]"
+  - "[[agentic-harness]]"
+  - "[[cursor-harness-innovations]]"
+  - "[[antigravity-agent-first-architecture]]"
+  - "[[Research: cursor.sh Harness Innovations]]"
+  - "[[Research: Google Antigravity Harness Integration]]"
+  - "[[Research: Claude Code State-of-the-Art Harness Improvements]]"
+sources:
+  - "[[codex-open-source-agent-2026]]"
+---# Research: Codex State-of-the-Art Harness Improvements
+## Overview
+Codex (OpenAI) is a **fully open-source** (Apache 2.0) Rust-based coding agent with 79.2K+ GitHub stars and 11.4K forks. It is the fourth major production agent analyzed after Cursor, Antigravity, and Claude Code — and uniquely valuable because its architecture is transparent (not reverse-engineered). Codex independently validates 7 of our planned features and reveals 5 new gaps. It also introduces 3 novel architectural patterns that challenge our first principles.
+## Key Findings
+### Features Codex Independently Validates
+| Our Feature | Codex Equivalent | Source |
+|---|---|---|
+| Model-adaptive harness | Per-agent model selection (gpt-5.5/5.4/5.4-mini/spark) with `model_reasoning_effort` per agent | [[codex-open-source-agent-2026]] |
+| Skills system (F0) | agentskills.io standard, progressive disclosure, `$skill-creator`, scoped discovery | [[codex-open-source-agent-2026]] |
+| Lifecycle hooks (P33) | 6-event hooks framework (SessionStart, PreToolUse, PermissionRequest, PostToolUse, UserPromptSubmit, Stop) with exit-code semantics | [[codex-open-source-agent-2026]] |
+| Subagent specialization (P25) | Parallel subagent dispatch with per-agent model + reasoning effort selection | [[codex-open-source-agent-2026]] |
+| Pre-verification isolation (P15b) | Sandbox tiers (read-only, workspace-write) + writable roots for bounded agent work | [[codex-open-source-agent-2026]] |
+| Persistent memory (L6) | Memories system (cross-thread, automatic, background-generated) | [[codex-open-source-agent-2026]] |
+| Subagent worktree isolation (P25b) | Git worktrees for parallel branch isolation | [[codex-open-source-agent-2026]] |
+### New Gaps Identified (5)
+1. **No OS-level sandboxing**: Our P35 permission subsystem is policy-only. Codex uses Seatbelt/bubblewrap/Windows Sandbox for OS-level enforcement. This is a first-principles gap: permissions without enforcement are polite requests.
+2. **No bidirectional MCP**: Our harness is MCP consumer only. Codex functions as MCP server — other agents can use it as a tool. This enables agent-to-agent composition. Missing from our architecture.
+3. **No implicit memory capture**: Our L6 is wiki-based (explicit, human-authored). Codex's Chronicle captures screen context automatically for situational awareness. Missing from our memory layer.
+4. **No automations (scheduled tasks)**: Codex supports scheduled recurring agent tasks. No equivalent in our plan. This is a new capability category.
+5. **No skills ecosystem tooling**: Our skills lack `$skill-creator`, `$skill-installer`, and agentskills.io standard compatibility. Tools for skill lifecycle management are missing.
+### Novel Architectural Patterns (3)
+1. **Multi-surface agent architecture**: Single agent logic runs across CLI, IDE extension, Desktop App, and Web via App Server (local HTTP/WebSocket). The agent core is surface-agnostic. This is architecturally distinct from our CLI-only model.
+2. **Rust-native implementation as a first-principles choice**: If your agent runs locally on user machines, use a systems language. Zero dependency install, direct OS sandbox API access, compile-time safety. TypeScript (our choice, Claude Code's choice) adds deployment complexity.
+3. **Sandbox as foundation, permissions as policy layer**: Codex cleanly separates technical enforcement (sandbox) from user-facing decisions (approvals). This is architecturally cleaner than our mixed approach.
+## Entities
+- **OpenAI (Codex team)**: 79.2K+ stars, open-source, 756 releases since launch. Primarily Rust engineers.
+## Key Concepts
+- [[codex-harness-innovations]] — 10 key innovations extracted from Codex architecture
+- **App Server Protocol**: Local HTTP/WebSocket bridge between agent core and presentation surfaces. v2 protocol with typed RPC, TypeScript codegen from Rust.
+- **Chronicle**: Screen-context capture for memory bootstrapping. Fills the gap between explicit wiki and implicit recall.
+- **MCP Server Mode**: Agent as composable tool. Enables agent-to-agent pipelines.
+- **Memories**: Automatic cross-thread persistent memory with background generation, rate-limit awareness, and secret redaction.
+- **Sandbox Tiers**: read-only → workspace-write → danger-full-access. OS-level enforcement.
+- **Worktrees**: Isolated git worktrees for parallel agent branches.
+## Contradictions
+- **Hook architecture**: Claude Code has 30+ hook events with deterministic exit codes. Codex has 6 events with concurrent execution. Both validated. Our P33 should study both before finalizing. [[codex-open-source-agent-2026]] vs [[claude-code-architecture-vila-lab-2026]].
+- **Memory philosophy**: Codex (automatic, screen-capture, background) vs our wiki (explicit, human-authored, checked-in). Both are valid. They serve different purposes. The gap is that we have nothing in the implicit/automatic category. [[codex-open-source-agent-2026]].
+- **Context management**: Codex uses subagent summaries to fight "context rot." Claude Code uses structured compaction. Cursor uses dynamic context. Antigravity uses 1M token windows. Four different approaches to the same problem. Our approach (wiki + lean-ctx + hot cache) is a fifth path. All are valid but solve different aspects.
+## Open Questions
+- Should our harness expose itself as an MCP server (like Codex `codex mcp-server`)? What pipeline stages would be useful as external tools?
+- Can we add Chronicle-style screen-context capture without the overhead of a full memory system?
+- Should we adopt the agentskills.io standard for skill format compatibility?
+- Is a Rust core (post-v1) worth the rewrite cost for zero-dependency install and OS sandbox integration?
+- Should we adopt OS-level sandboxing (bubblewrap on Linux) before P35, or build P35 first and add OS enforcement later?
+## Sources
+- [[codex-open-source-agent-2026]] — GitHub repo (openai/codex) + official docs, 2026

package/vault/wiki/questions/Research: Engineering Workflows of Legendary Programmers and AI Harness Mapping.md ADDED Viewed

@@ -0,0 +1,107 @@
+---
+type: synthesis
+title: "Research: Engineering Workflows of Legendary Programmers and AI Harness Mapping"
+created: 2026-05-03
+updated: 2026-05-03
+tags:
+  - research
+  - engineering-workflows
+  - legendary-programmers
+  - harness-design
+  - ai-coding
+status: developing
+related:
+  - "[[legendary-engineering-patterns-harness]]"
+  - "[[Linus Torvalds]]"
+  - "[[Ken Thompson]]"
+  - "[[Dennis Ritchie]]"
+  - "[[Anders Hejlsberg]]"
+  - "[[Guido van Rossum]]"
+  - "[[Bjarne Stroustrup]]"
+  - "[[fast-feedback-loops]]"
+  - "[[unix-composability]]"
+  - "[[chain-of-trust-software]]"
+  - "[[subtractive-design]]"
+  - "[[behavioral-compatibility-over-purity]]"
+  - "[[pragmatic-language-design]]"
+  - "[[harness]]"
+  - "[[harness-implementation-plan]]"
+sources:
+  - "[[linux-kernel-coding-workflow]]"
+  - "[[unix-philosophy]]"
+  - "[[birth-of-unix-kernighan-interview]]"
+  - "[[hejlsberg-7-learnings]]"
+  - "[[guido-python-design-philosophy]]"
+---
+# Research: Engineering Workflows of Legendary Programmers and AI Harness Mapping
+## Overview
+Research into the engineering practices of six legendary programmers — Linus Torvalds, Ken Thompson, Dennis Ritchie, Bjarne Stroustrup, Anders Hejlsberg, Guido van Rossum — reveals 10 cross-cutting patterns that map directly to AI coding harness design. The core finding: **the same principles that produced the world's most durable software — Linux, Unix, C, C++, Python, TypeScript — are the principles that must constrain AI-generated code.** Deterministic guardrails (type systems, linters, tests) become more important with AI, not less.
+## Key Findings
+- **Fast feedback loops are the highest-leverage practice across all six programmers.** Hejlsberg's Turbo Pascal instant compile, Torvalds' merge-window cadence, Thompson's rapid prototyping — all converge on the same insight: short cycle time changes behavior. For the harness: every AI-generated change must be testable within seconds. (Sources: [[hejlsberg-7-learnings]], [[linux-kernel-coding-workflow]])
+- **Composability over monoliths is the Unix legacy that still dominates.** The pipe (`|`) breakthrough at Bell Labs enabled an ecosystem of small, focused tools. This maps directly to agent composition: specialized sub-agents chained together rather than monolithic AI output. (Sources: [[unix-philosophy]], [[birth-of-unix-kernighan-interview]])
+- **Torvalds' chain-of-trust model is the canonical verification pipeline.** Patches flow through subsystem maintainers before reaching Linus. Each level inspects what it's specialized for. This maps to tiered harness gates: lint → type-check → test → critic agent → human review. (Source: [[linux-kernel-coding-workflow]])
+- **Type systems are the essential AI guardrail — Hejlsberg's 2026 insight.** "The most valuable tools in an AI-assisted workflow aren't the ones that generate the most code, but the ones that constrain it correctly." Hejlsberg and Stroustrup independently converge on static typing as the safety net against plausible-but-wrong AI output. This validates the harness's L3 (grounding-checkpoints) and L4 (adversarial-verification) as mandatory layers. (Sources: [[hejlsberg-7-learnings]], [[Bjarne Stroustrup]])
+- **Subtractive design is the antidote to AI bloat.** Thompson and McIlroy's "What can we throw out?" culture is the counterforce to AI's tendency to generate verbose, redundant code. The harness needs explicit "suggest deletion" modes — not just generation. (Sources: [[unix-philosophy]], [[Ken Thompson]])
+- **Behavioral compatibility is more important than architectural purity.** Hejlsberg (TypeScript extending JS), Stroustrup (C++ compatible with C), Torvalds ("don't break userspace") all choose pragmatism over clean-slate rewrites. The harness must verify that changes preserve existing behavior. (Sources: [[hejlsberg-7-learnings]], [[Bjarne Stroustrup]])
+- **Van Rossum and Hejlsberg both oppose "vibe coding."** Van Rossum: "We stay in control where it comes to architecture and API design." Torvalds: vibe coding is "horrible for production." All six programmers insist on human architectural control. AI assists execution, not design judgment. (Sources: [[guido-python-design-philosophy]], [[linux-kernel-coding-workflow]], [[Anders Hejlsberg]])
+- **Van Rossum's type-hint threshold (10K lines) suggests tiered harness behavior.** Below 10K lines, dynamic checking suffices. Above, strict typing becomes essential. This maps to harness modes that adapt enforcement strictness to codebase size. (Source: [[guido-python-design-philosophy]])
+- **Thompson's productivity demonstrates that deep system understanding enables extreme leverage.** Built Unix in 3 weeks. Reverse-engineered a typesetter in hours. For the harness: semantic codebase indexing and deep context provision are not optional — they are the prerequisite for effective AI code generation. (Source: [[birth-of-unix-kernighan-interview]])
+- **The Unix Room as shared context maps to the wiki as persistent memory (L6).** Kernighan's description of shared source trees, shared filesystems, and the `who` command as community tool mirrors the harness wiki: all decisions visible, all context searchable, all history preserved. (Source: [[birth-of-unix-kernighan-interview]])
+## Key Entities
+- [[Linus Torvalds]]: Linux kernel, Git, chain-of-trust development model, "don't break userspace"
+- [[Ken Thompson]]: Unix co-creator, subtractive design, extreme leverage from deep understanding
+- [[Dennis Ritchie]]: Unix co-creator, C language, K&R style, economy of design from constraints
+- [[Anders Hejlsberg]]: Fast feedback loops, behavioral compatibility, type systems as AI guardrails
+- [[Guido van Rossum]]: Pragmatism over perfection, simplicity as survival trait, human control over architecture
+- [[Bjarne Stroustrup]]: Evolutionary design, C compatibility as pragmatic choice, static typing for safety
+## Key Concepts
+- [[legendary-engineering-patterns-harness]]: 10 patterns mapped to harness layers
+- [[fast-feedback-loops]]: The highest-leverage practice across all six programmers
+- [[unix-composability]]: Pipes and small tools as agent composition model
+- [[chain-of-trust-software]]: Tiered verification as harness gate architecture
+- [[subtractive-design]]: "What can we throw out?" as AI bloat antidote
+- [[behavioral-compatibility-over-purity]]: Working within existing constraints over clean-slate
+- [[pragmatic-language-design]]: "Good enough" over perfection (van Rossum, Stroustrup)
+## Contradictions
+- **Static vs dynamic typing**: Stroustrup and Hejlsberg advocate static typing as essential safety net. Van Rossum designed Python as dynamic but added gradual typing above 10K lines — a convergence toward the same conclusion at scale. No fundamental contradiction; all agree types become essential at scale.
+- **Perfection vs pragmatism**: The ABC group (van Rossum's background) strived for perfection. Van Rossum deliberately rejected this. Stroustrup similarly rejected "a much smaller and cleaner language" (that would have been a "cult language") in favor of C compatibility. Consensus: pragmatism wins, but the harness must enforce correctness where it matters (behavioral compatibility, not architectural purity).
+## Open Questions
+- **How to implement subtractive design in an AI harness?** All six programmers emphasize removing what isn't needed. Current harness layers focus on adding correct code. A "subtraction mode" — AI-suggested deletions with safety verification — is not yet designed.
+- **Thompson-level codebase understanding for AI agents?** Thompson could hold an entire operating system in his head. Can semantic indexing + call graphs + AST queries provide equivalent understanding to an LLM? Benchmark needed.
+- **How to balance fast feedback with thorough verification?** Hejlsberg's instant feedback and Torvalds' rigorous review are in tension. Where does the harness optimize for speed vs correctness per change type?
+- **What is the harness equivalent of "don't break userspace"?** Behavioral regression testing exists but may not catch semantic drift. Formal behavioral contracts remain an open research area.
+- **10K-line threshold validation**: Van Rossum's claim that types become essential above 10K lines needs empirical validation for AI-generated codebases. Does AI output benefit from typing at smaller scales?
+## Sources
+- [[linux-kernel-coding-workflow]]: Torvalds et al., official kernel documentation, 2026
+- [[unix-philosophy]]: Wikipedia, synthesizing McIlroy/Thompson/Ritchie/Raymond, 2025
+- [[birth-of-unix-kernighan-interview]]: Brian Kernighan, CoRecursive podcast, 2020
+- [[hejlsberg-7-learnings]]: Aaron Winston, GitHub Blog, January 2026
+- [[guido-python-design-philosophy]]: Guido van Rossum, 2009 (design philosophy) + 2025 interview
+---
+*Research conducted 2026-05-03. 2 rounds, 8 searches, 11 pages scraped, 12 wiki pages created.*

package/vault/wiki/questions/Research: Fallow Codebase Intelligence Harness Integration.md ADDED Viewed

@@ -0,0 +1,72 @@
+---
+type: synthesis
+title: "Research: Fallow Codebase Intelligence Harness Integration"
+created: 2026-05-01
+updated: 2026-05-01
+tags:
+  - research
+  - harness
+  - fallow
+  - codebase-intelligence
+  - static-analysis
+  - dead-code
+  - quality-gate
+status: developing
+related:
+  - "[[fallow-rs-codebase-intelligence]]"
+  - "[[codebase-intelligence-harness-integration]]"
+  - "[[codebase-intelligence-ecosystem-comparison]]"
+  - "[[harness-implementation-plan]]"
+  - "[[harness]]"
+sources:
+  - "[[fallow-rs-codebase-intelligence]]"
+---# Research: Fallow Codebase Intelligence Harness Integration
+## Overview
+Fallow (fallow-rs/fallow, 1.7K stars, MIT) is a Rust-native codebase intelligence tool for TypeScript and JavaScript. It detects dead code, duplication, complexity, and architecture boundary violations — all sub-second even on 20K+ file codebases. It integrates into our harness as the Phase 16 deterministic quality gate, P15b pre-verification sandbox tool, and L5 observability substrate. No other ecosystem has a single-tool equivalent. Cross-ecosystem coverage requires combining 3-5 tools per language.
+## Key Findings
+- Fallow is the ONLY tool across TS/JS, Python, Go, Rust, and Elixir that provides dead code + duplication + complexity + boundaries in one sub-second package. (Source: [[fallow-rs-codebase-intelligence]], [[codebase-intelligence-ecosystem-comparison]])
+- Fallow is purpose-built for AI agent integration: MCP server, JSON with `actions` array, `auto_fixable` flags, agent skill shipped in npm package. (Source: [[fallow-rs-codebase-intelligence]])
+- Fallow fits 7 distinct integration points in our harness: L3 tool calling, P15b pre-verify, Phase 16 gate, L5 observability, P29 error classification, L6 baselines, P42 automations. (Source: [[codebase-intelligence-harness-integration]])
+- Fallow beats knip by 2-13x speed with broader feature coverage (duplication, complexity, boundaries, runtime). Beats jscpd by 8-26x. (Source: [[fallow-rs-codebase-intelligence]])
+- For Python: Vulture + Skylos + Ruff combo provides dead code coverage. For Go: golangci-lint + deadcode + gocyclo. For Rust: clippy + cargo-udeps + rust-code-analysis. For Elixir: dialyxir + credo. All inferior to fallow's single-command coverage. (Source: [[codebase-intelligence-ecosystem-comparison]])
+- Fallow's audit mode with baselines enables incremental adoption — critical for existing codebases with legacy issues. (Source: [[fallow-rs-codebase-intelligence]])
+- Fallow's runtime intelligence (paid) provides hot/cold path evidence from V8 coverage — a Keep Rate proxy for production code survival. (Source: [[fallow-rs-codebase-intelligence]])
+## Key Entities
+- **fallow-rs/fallow**: Rust-native TS/JS codebase intelligence. 1.7K stars, MIT. By Bart Waardenburg.
+- **knip**: Legacy dead code detector for TS/JS. ~7K stars. Superseded by fallow on speed and features.
+- **Vulture**: Python dead code detector. ~3.5K stars. AST-based, partial coverage.
+- **Skylos**: Multi-language SAST (Python, TS, Go, Java, Rust). Most comprehensive Python dead code.
+- **deadcode**: Official Go unreachable-function detector. By Alan Donovan (Go team).
+- **Staticcheck**: Most comprehensive Go linter. ~7K stars.
+- **cargo-udeps**: Rust unused dependency detector. Nightly required.
+- **Dialyzer**: Erlang/Elixir BEAM static analysis. Dead code + type errors.
+- **Credo**: Elixir code analysis with teaching focus. ~4.9K stars.
+## Key Concepts
+- [[codebase-intelligence-harness-integration]]: 7-point integration map for fallow into our harness pipeline
+- [[codebase-intelligence-ecosystem-comparison]]: Cross-language gap analysis showing no ecosystem has fallow-equivalent
+- Harness P44: New phase for codebase intelligence integration. 7 sub-phases (P44a through P44g).
+## Contradictions
+- **Fallow vs knip**: knip has ~7K stars (older, more established community). Fallow has 1.7K stars but 2-13x faster and broader feature set. Recommendation: fallow for harness. Stars lag features due to younger project.
+- **Vulture vs Skylos (Python)**: Vulture is dedicated dead code (3.5K stars, mature). Skylos is multi-language (+ security scanning). For harness: Skylos provides broader coverage. Vulture is simpler if only dead code needed.
+## Open Questions
+- Fallow is TS/JS only. Should harness invest in per-language tool wrappers (Python, Go, Rust, Elixir), or defer multi-language support to post-v1? Current recommendation: P44 for TS/JS now. Multi-language in future F5 phase.
+- Fallow runtime intelligence requires paid license. Is the V8 coverage hot-path data worth the cost for Keep Rate tracking? Recommendation: start with free static layer. Evaluate runtime layer if Keep Rate signal is insufficient.
+- Should harness enforce fallow audit gate (fail = block delivery) or use warn-only mode initially? Recommendation: warn-only for adoption period, escalate to fail gate after baselines established.
+- Can fallow's per-issue `actions` array be mapped to automated fix application (auto-heal)? Recommendation: P44e maps `auto_fixable=true` issues to auto-heal candidates. `fallow fix --dry-run` previews before applying.
+## Sources
+- [[fallow-rs-codebase-intelligence]]: Primary source. GitHub repo, docs, benchmarks.