npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.3 - Mend

ultimate-pi 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/concepts/browser-harness-agent.md ADDED Viewed

@@ -0,0 +1,41 @@
+---
+type: concept
+title: "browser-harness — Self-Healing CDP Harness"
+status: developing
+created: 2026-05-02
+updated: 2026-05-02
+tags:
+  - browser-automation
+  - cdp
+  - headless-browser
+  - browser-harness
+aliases: ["browser-harness", "CDP harness"]
+related:
+  - "[[browser-subagent-visual-verification]]"
+  - "[[harness-implementation-plan]]"
+  - "[[Source: browser-harness CDP Harness]]"
+sources:
+  - "[[Source: browser-harness CDP Harness]]"
+---# browser-harness — Self-Healing CDP Harness
+Cutting-edge SOTA thin CDP harness by browser-use (9.4K GitHub stars, MIT, Python). Connects LLMs directly to Chrome via one WebSocket — nothing between. Self-healing: the agent writes missing helper functions mid-execution.
+## Core Idea
+No Puppeteer. No Playwright. No pre-baked helpers. Just raw Chrome DevTools Protocol over a WebSocket. The agent calls `session.Page.navigate()`, `session.Input.dispatchMouseEvent()` — exactly what CDP provides, nothing hidden.
+When the agent encounters a missing interaction pattern, it writes the helper itself in `agent-workspace/agent_helpers.py`. The harness improves itself every run.
+## Architecture
+- **browser-harness** (Python, 9.4K stars): ~592 lines of core. Agent-editable workspace + domain skills.
+- **browser-harness-js** (TypeScript, 428 stars): 652 typed CDP methods. Bun-native REPL. `npx skills add` install.
+## Key Properties
+- **Minimal**: ~592 lines of Python. One WebSocket to Chrome.
+- **Self-healing**: Agent writes missing helpers mid-task.
+- **CDP-native**: 56+ domains, 652+ methods — no wrappers, no abstraction.
+- **Agent-editable**: `agent_helpers.py` and `domain-skills/` designed for agent modification.
+- **No version drift**: Auto-generated from Chrome protocol JSON.

package/vault/wiki/concepts/browser-subagent-visual-verification.md ADDED Viewed

@@ -0,0 +1,82 @@
+---
+type: concept
+title: "Browser Subagent for Visual Verification"
+status: developing
+created: 2026-05-01
+updated: 2026-05-02
+tags:
+  - antigravity
+  - browser-automation
+  - visual-verification
+  - tools
+  - agent-browser
+aliases: ["headless browser agent", "visual verification subagent"]
+related:
+  - "[[agentic-harness]]"
+  - "[[harness-implementation-plan]]"
+  - "[[grounding-checkpoints]]"
+  - "[[agent-browser-browser-automation]]"
+sources:
+  - "[[cursor-vs-antigravity-2026]]"
+  - "[[google-antigravity-official-blog]]"
+  - "[[Source: Vercel Labs agent-browser]]"
+---
+# Browser Subagent for Visual Verification
+Antigravity's most distinctive technical capability: an agent subprocess that drives a headless Chromium browser to visually verify UI changes.
+## How It Works
+1. Agent makes a code change (e.g., CSS fix)
+2. Agent spins up local dev server
+3. Browser subagent opens headless Chrome
+4. Subagent navigates to the affected page
+5. Takes before/after screenshots
+6. Uses vision-optimized models to analyze pixel differences
+7. Verifies the fix worked visually
+8. Reports results with screenshot evidence
+## Why This Is Revolutionary
+Traditional coding agents are **blind**. They reason about code as text but cannot see what it produces. A CSS change that "looks right" to the model may look completely wrong in the browser. The browser subagent closes this loop.
+## Use Cases
+- **CSS/UI fixes**: Agent sees if padding/margins/layout actually work
+- **Visual regression testing**: Before/after screenshots as verifiable artifacts
+- **Cross-device verification**: Test at different viewport sizes
+- **Form interaction testing**: Click buttons, fill forms, verify behavior
+- **Login flow testing**: Automate auth flows end-to-end
+## Gap in Our Harness
+Our harness has **no browser control capability**. All verification is:
+- **Syntax-level** (P11 inline validation, P20 lint/format)
+- **Semantic-level** (L4 adversarial critic)
+- **Observability-level** (L5 metrics)
+None of this can verify that a UI change actually produced the correct visual result.
+## Proposed Integration: Phase P30
+Add a **Browser Subagent** to the tool registry:
+- `lib/harness-browser.ts` — agent-browser driving headless Chrome via Rust daemon
+- `extensions/harness-browser.ts` — Extension hook: after UI-related edits, optionally trigger visual verification
+- Configurable: `.pi/harness/browser.json` — enable/disable, screenshot directories, viewport configs
+The browser subagent operates as a specialized subagent (P25 router dispatches UI tasks to it). It reports results as artifacts (P31).
+> [!update] May 2026: Replaced browser-harness (9.4K stars, Python) with **Vercel Labs agent-browser** (31.4K stars, Apache 2.0, Rust-native). agent-browser provides richer AI agent integration: snapshot + refs workflow, annotated screenshots, structured diff, React introspection, Web Vitals, batch mode, and built-in skills system. See [[agent-browser-browser-automation]] and [[Source: Vercel Labs agent-browser]].
+### Why agent-browser over browser-harness
+| Feature | browser-harness | agent-browser |
+|---------|----------------|---------------|
+| **Ecosystem** | 9.4K stars, Python | 31.4K stars, Rust-native binary |
+| **Agent workflow** | Raw CDP — agent writes helpers | Snapshot + @eN refs — purpose-built |
+| **Visual diff** | None | `diff screenshot --baseline before.png` |
+| **Annotated screenshots** | None | `--annotate` with numbered labels |
+| **Skills system** | None | `skills get core`, `npx skills add` |
+| **Batch mode** | None | Multi-command single invocation |
+| **Install** | `uv add browser-harness` (Python dep) | `npm install -g agent-browser` (single binary) |

package/vault/wiki/concepts/codebase-intelligence-ecosystem-comparison.md ADDED Viewed

@@ -0,0 +1,192 @@
+---
+type: concept
+title: "Codebase Intelligence Ecosystem Comparison"
+created: 2026-05-01
+updated: 2026-05-01
+status: developing
+tags:
+  - harness
+  - codebase-intelligence
+  - static-analysis
+  - dead-code
+  - ecosystem
+  - comparison
+related:
+  - "[[fallow-rs-codebase-intelligence]]"
+  - "[[Research: Fallow Codebase Intelligence Harness Integration]]"
+  - "[[codebase-intelligence-harness-integration]]"
+  - "[[harness-implementation-plan]]"
+sources:
+  - "[[fallow-rs-codebase-intelligence]]"
+---# Codebase Intelligence Ecosystem Comparison
+Comparison of codebase intelligence tools across TypeScript/JavaScript, Python, Go, Rust, and Elixir ecosystems. Focus: tools that provide project-wide dead code detection, duplication analysis, complexity scoring, and architecture boundary enforcement — the capabilities a coding agent harness needs for deterministic quality gating.
+## TypeScript / JavaScript
+### Fallow (fallow-rs/fallow) — PRIMARY RECOMMENDED
+- **Coverage**: Dead code, duplication, complexity, boundaries, runtime intelligence
+- **Speed**: Sub-second (Rust-native)
+- **Agent integration**: MCP server, JSON output with actions array, agent skill
+- **Stars**: 1.7K
+- **License**: MIT
+- **Limitation**: Syntactic only (no type-level dead code). TS/JS only.
+- **Status**: Adopted for P44 in harness implementation plan.
+### knip (webpro-nl/knip)
+- **Coverage**: Dead code detection (files, exports, dependencies, types)
+- **Speed**: Slower than fallow (2-13x on benchmarks). v6 improved but still behind.
+- **Agent integration**: JSON output, no MCP server
+- **Stars**: ~7K
+- **Status**: Legacy reference. fallow beats it on speed, features, and agent integration.
+### ts-prune
+- **Coverage**: Unused exports only
+- **Speed**: Fast but narrow scope
+- **Status**: Superseded by fallow/knip for comprehensive analysis.
+### jscpd
+- **Coverage**: Duplication detection
+- **Speed**: 8-26x slower than fallow
+- **Status**: Legacy reference.
+## Python
+### Vulture (jendrikseipp/vulture)
+- **Coverage**: Dead code detection (unused functions, variables, classes, imports)
+- **Method**: AST-based static analysis
+- **Limitations**: Python's dynamic nature causes false positives. No cross-file import graph traversal. No duplication or complexity analysis.
+- **Stars**: ~3.5K
+- **Agent integration**: CLI only, JSON output available
+- **Harness relevance**: Partial P44 coverage for Python projects. Combine with other tools.
+### Skylos (duriantaco/skylos)
+- **Coverage**: Multi-language SAST (Python, TS, JS, Go, Java, PHP, Rust)
+- **Features**: Dead code, security scanning, secrets detection, AI code guardrails
+- **Method**: CI/CD PR gate, local-first
+- **Agent integration**: JSON output, VS Code extension
+- **Harness relevance**: Most comprehensive Python dead code tool. Multi-language support valuable for harness.
+### Ruff (astral-sh/ruff)
+- **Coverage**: Linting + formatting (Rust-native)
+- **Speed**: 10-100x faster than flake8
+- **Limitations**: File-local only. No cross-file dead code detection. No duplication or boundaries.
+- **Agent integration**: CLI, JSON output
+- **Harness relevance**: Inline syntax validation (P11). Complements but doesn't replace fallow-equivalent.
+### Py-spy (benfred/py-spy)
+- **Coverage**: Sampling profiler for Python
+- **Harness relevance**: Runtime intelligence equivalent of fallow runtime (hot path detection)
+### pydeps (thebjorn/pydeps)
+- **Coverage**: Module dependency graph visualization
+- **Method**: Import graph traversal
+- **Harness relevance**: Dead file detection via import graph. Graphical output, not structured JSON.
+### Coverage.py + pytest-cov
+- **Coverage**: Runtime test coverage
+- **Harness relevance**: Runtime intelligence layer (akin to fallow's V8 coverage integration)
+## Go
+### deadcode (golang.org/x/tools/cmd/deadcode)
+- **Coverage**: Unreachable function detection
+- **Method**: Quick inspection of all packages in a Go program. Call graph analysis from main entry points.
+- **Built by**: Alan Donovan (Go team), Dec 2023
+- **Limitations**: Functions only. No exports check, no duplication, no complexity, no boundaries.
+- **Agent integration**: CLI only
+- **Harness relevance**: Official Go dead code tool. Narrow scope.
+### Staticcheck (dominikh/go-tools)
+- **Coverage**: Bugs, performance issues, simplifications, style rules, unused code
+- **Method**: Static analysis. Most comprehensive Go linter.
+- **Agent integration**: JSON output via golangci-lint wrapper
+- **Stars**: ~7K (go-tools monorepo)
+- **Harness relevance**: Best single Go static analysis tool. Covers dead code + quality.
+### golangci-lint
+- **Coverage**: Meta-linter wrapping 50+ Go linters
+- **Includes**: staticcheck, unused, deadcode, errcheck, govet
+- **Agent integration**: JSON, SARIF, GitHub annotations
+- **Harness relevance**: One-command quality gate for Go projects.
+### unused (dominikh)
+- **Coverage**: Unused identifiers (constants, variables, functions, types, fields)
+- **Method**: Part of staticcheck suite
+- **Harness relevance**: Dead code detection for Go. Redundant if using golangci-lint with staticcheck.
+## Rust
+### cargo-udeps (est31/cargo-udeps)
+- **Coverage**: Unused dependencies in Cargo.toml
+- **Method**: Compiler-level analysis. Requires nightly.
+- **Harness relevance**: Dedicated unused dep tool. Complements built-in rustc warnings.
+### cargo-machete (bnjbvr/cargo-machete)
+- **Coverage**: Unused dependencies (fast path)
+- **Method**: Regex-based pre-check. Works on stable Rust.
+- **Harness relevance**: Faster alternative to cargo-udeps. Pair with udeps for accuracy.
+### Built-in Rust compiler warnings
+- `#[warn(dead_code)]`, `#[warn(unused_imports)]`, `cargo clippy`
+- **Coverage**: In-crate dead code. No cross-crate dependency checking.
+- **Harness relevance**: Foundation layer. `cargo clippy -- -D warnings` as CI gate.
+### cargo-deny (EmbarkStudios/cargo-deny)
+- **Coverage**: License compliance, security advisories, duplicate dependencies
+- **Harness relevance**: Dependency audit layer. Complements udeps/machete.
+### rust-code-analysis (mozilla/rust-code-analysis)
+- **Coverage**: Code complexity metrics (cyclomatic, cognitive, LOC, HALSTEAD)
+- **Method**: Mozilla's tool. Supports Rust, C/C++, JS, Python.
+- **Harness relevance**: Complexity analysis for Rust (fallow-equivalent health scores)
+## Elixir
+### Dialyzer (via dialyxir: jeremyjh/dialyxir)
+- **Coverage**: Type errors, dead code, unreachable code, unnecessary tests
+- **Method**: Static analysis of BEAM bytecode. Requires typespecs for best results.
+- **Limitations**: Slow (bytecode analysis). Setup required. False positives on dynamic code patterns.
+- **Harness relevance**: Primary Elixir dead code detection. Dialyxir adds Elixir-friendly interface.
+### Credo (rrrene/credo)
+- **Coverage**: Code smells, style issues, refactoring opportunities, consistency
+- **Method**: AST-based static analysis. Teaching-focused.
+- **Stars**: ~4.9K
+- **Agent integration**: JSON output, check-style format
+- **Harness relevance**: Linting + code smell detection. Complements Dialyzer for quality gating.
+### Sobelow (nccgroup/sobelow)
+- **Coverage**: Security-focused static analysis for Phoenix
+- **Harness relevance**: Security layer for Elixir/Phoenix projects
+### CodeScene
+- **Coverage**: Behavioral code analysis (git history + complexity + social factors)
+- **Harness relevance**: L5 observability (hotspots, bus factor). Not Elixir-specific.
+## Gap Analysis: No Ecosystem Has a Fallow Equivalent
+| Capability | TS/JS (fallow) | Python | Go | Rust | Elixir |
+|---|---|---|---|---|---|
+| Dead code (unused files) | ✅ | Vulture (partial) | deadcode (functions only) | cargo-udeps (deps only) | Dialyzer (unreachable) |
+| Dead code (unused exports) | ✅ | ❌ No tool | ❌ No tool | ❌ No tool | ❌ No tool |
+| Duplication detection | ✅ | ❌ No tool | ❌ No tool | ❌ No tool | ❌ No tool |
+| Complexity scoring | ✅ | Radon/wily | gocyclo | rust-code-analysis | Credo (partial) |
+| Architecture boundaries | ✅ | ❌ No tool | ❌ No tool | ❌ No tool | ❌ No tool |
+| Runtime intelligence | ✅ (paid) | py-spy + coverage.py | pprof | perf/flamegraph | observer_cli |
+| MCP server for agents | ✅ | ❌ | ❌ | ❌ | ❌ |
+| Audit mode (changed files) | ✅ | ❌ | ❌ | ❌ | ❌ |
+| Single unified tool | ✅ | ❌ | ❌ | ❌ | ❌ |
+**Key finding**: Fallow is the ONLY codebase intelligence tool across all five ecosystems that provides dead code + duplication + complexity + boundaries in a single, fast, agent-integrated package. Every other ecosystem requires combining 3-5 separate tools to achieve similar coverage.
+## Harness Multi-Language Strategy
+For a multi-language harness, the approach is:
+1. **TS/JS**: `fallow` — single-command comprehensive gate
+2. **Python**: `skylos` (dead code + security) + `ruff` (lint) + `radon` (complexity) + `coverage.py` (runtime)
+3. **Go**: `golangci-lint` (all-in-one lint + dead code) + `gocyclo` (complexity) + `deadcode` (unreachable functions)
+4. **Rust**: `cargo clippy` (lint + dead code) + `cargo-udeps` (unused deps) + `rust-code-analysis` (complexity) + `cargo-deny` (audit)
+5. **Elixir**: `mix test` + `credo` (lint) + `dialyxir` (dead code + types) + `sobelow` (security)

package/vault/wiki/concepts/codebase-intelligence-harness-integration.md ADDED Viewed

@@ -0,0 +1,161 @@
+---
+type: concept
+title: "Codebase Intelligence Harness Integration"
+created: 2026-05-01
+updated: 2026-05-01
+status: developing
+tags:
+  - harness
+  - codebase-intelligence
+  - fallow
+  - quality-gate
+  - static-analysis
+related:
+  - "[[fallow-rs-codebase-intelligence]]"
+  - "[[Research: Fallow Codebase Intelligence Harness Integration]]"
+  - "[[codebase-intelligence-ecosystem-comparison]]"
+  - "[[harness-implementation-plan]]"
+  - "[[adr-010]]"
+sources:
+  - "[[fallow-rs-codebase-intelligence]]"
+---# Codebase Intelligence Harness Integration
+How deterministic codebase intelligence tools (primary: fallow for TS/JS) integrate into the ultimate-pi 8-layer agentic harness pipeline. Fallow is NOT an AI tool — it is the codebase truth layer the agent calls for deterministic quality signals.
+## Integration Points
+### 1. L3 Execution Layer — Agent Tool Calling (P44a)
+During task execution, the agent calls fallow as an L3 tool for real-time feedback:
+```
+Agent: "I've finished editing files. Let me check quality."
+  → npx fallow --format json
+  → Parse: "2 new unused exports, 1 circular dependency introduced"
+  → Agent: Fix before presenting result
+```
+Integration:
+- MCP server bundled in fallow npm package
+- Tool registration in harness MCP registry
+- Agent skill for fallow workflow guidance (`fallow-skills`)
+- JSON output with machine-actionable `actions` array per issue
+Token cost: ~0 (deterministic CLI, no LLM tokens used).
+### 2. P15b Pre-Verification Isolation Sandbox — Fallow Audit (P44b)
+Before presenting L3 results to the agent (and before L4 adversarial verification), run fallow audit scoped to changed files:
+```
+npx fallow audit --base main --format json --changed-since main
+```
+Returns verdict: pass (exit 0), warn (exit 0 with findings), fail (exit 1).
+Integration in P15b sandbox:
+1. Checkout temp worktree (already done by P15b)
+2. Run `fallow audit --gate all --format json`
+3. If fail: surface findings to agent for fix loop
+4. If pass/warn: proceed to L4 verification
+5. Warm findings go into L4 critic context for adversarial review
+Token cost: ~0 (deterministic). LLM tokens only on failure to describe findings to agent (~200).
+### 3. Phase 16 Lint+Format Gate — Fallow as Gate (P44c)
+Post-L4, pre-delivery deterministic gate. Non-negotiable pass/fail:
+```
+npx fallow audit --gate all
+```
+Verdict mapping:
+- **pass**: continue to L5 observability
+- **warn**: log findings, continue (warnings don't block delivery)
+- **fail**: block delivery. Agent must fix before proceeding.
+Baselines for legacy codebases:
+```
+npx fallow dead-code --save-baseline fallow-baselines/dead-code.json
+npx fallow health --save-baseline fallow-baselines/health.json
+npx fallow dupes --save-baseline fallow-baselines/dupes.json
+npx fallow audit --dead-code-baseline fallow-baselines/dead-code.json ...
+```
+Token cost: 0 (deterministic). Part of Phase 16's existing 0-token budget.
+### 4. L5 Observability — Health Trends + Keep Rate (P44d)
+Fallow's health scoring system provides quantitative substrate for Keep Rate tracking and codebase health observability:
+```
+fallow health --score              # 0-100 score with letter grade
+fallow health --trend              # Compare against saved snapshot
+fallow health --runtime-coverage ./coverage  # Hot/cold path evidence
+```
+Keep Rate proxy: `fallow health --score` snapshots stored at each delivery event. Track score over time to measure whether agent-generated code survives and maintains quality.
+Integration:
+- `lib/harness-observability.ts` adds `fallowHealthSnapshot()` method
+- Snapshot stored in L6 persistent memory per delivery
+- L5 dashboard surfaces health trend alongside Keep Rate
+Token cost: ~0-100 (metadata annotation). Already budgeted in L5's ~2,000 tokens.
+### 5. P29 Per-Tool Per-Model Error Classification (P44e)
+Fallow's structured JSON output provides classification substrate:
+Each finding has:
+- `rule` (e.g., "unused-exports", "circular-dependencies")
+- `severity` (error/warn)
+- `introduced` (true/false — from audit diff)
+- `actions` array (e.g., "remove_export", "restructure_deps")
+- `auto_fixable` (boolean)
+- File location, line numbers, CODEOWNERS
+This maps to P29's error classification system:
+- `rule` → error category
+- `introduced` → distinguishes agent-introduced vs inherited
+- `severity` → classification severity level
+- `actions` → remediation taxonomy
+- `auto_fixable` → auto-heal candidate flag
+### 6. L6 Persistent Memory — Baseline Storage (P44f)
+Fallow baselines stored in wiki-adjacent storage:
+```
+.fallow-baselines/
+  dead-code.json
+  health.json
+  dupes.json
+```
+These are git-committed (not in `.fallow/` cache directory). L6 persistent memory references baseline versions in health snapshots.
+### 7. P42 Scheduled Agent Automations — Periodic Health Sweeps (P44g)
+Cron-style harness-initiated fallow runs:
+- Weekly: `fallow health --trend` → surface regressions
+- Daily: `fallow dead-code --format json` → flag new dead code
+- Per-PR: `fallow audit` (already in CI)
+## What Fallow Does NOT Cover
+- **Functional correctness**: Not a test runner. Phase 16 only checks code quality, not behavior. F2 (Behaviour Harness) remains unsolved.
+- **Type-level dead code**: Fallow is syntactic only. TypeScript's `tsc --noEmit` handles type checking separately (already in P20 gate).
+- **Non-TS/JS ecosystems**: See [[codebase-intelligence-ecosystem-comparison]] for multi-language strategy.
+- **Test coverage**: Fallow health can ingest Istanbul coverage data for CRAP scoring, but does not run tests.
+## Integration Order
+1. **P44c**: Phase 16 gate integration (easiest, highest impact, 0 tokens). Add `fallow audit --gate all` to `lib/harness-polish.ts`.
+2. **P44b**: P15b pre-verification sandbox integration. Add `fallow audit --changed-since main` to sandbox script.
+3. **P44a**: L3 MCP tool registration. Add fallow to MCP tool registry.
+4. **P44d**: L5 health snapshot collection. Add to `lib/harness-observability.ts`.
+5. **P44e**: P29 error classification mapping. Add fallow rule taxonomy to `lib/harness-errors.ts`.
+6. **P44f**: L6 baseline storage. Create `.fallow-baselines/` config.
+7. **P44g**: P42 automation schedule. Add to cron-style harness jobs.

package/vault/wiki/concepts/codebase-to-context-ingestion.md ADDED Viewed

@@ -0,0 +1,46 @@
+---
+type: concept
+title: "Codebase-to-Context Ingestion"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - codebase-processing
+  - llm-context
+  - ingestion
+status: developing
+related:
+  - "[[gitingest]]"
+---# Codebase-to-Context Ingestion
+## Definition
+The process of converting an entire codebase (local directory or remote Git repository) into a structured plaintext format suitable for feeding into an LLM's context window.
+## Why It Matters
+AI coding agents operate on context. When an agent needs to understand an external dependency, library, or reference implementation, it must ingest that codebase efficiently. Individual file-by-file reading is slow and fragments understanding.
+## Key Properties
+- **Structured output**: Clear file boundaries, directory hierarchy preserved
+- **Filterable**: Pattern-based include/exclude, file size limits
+- **Deterministic**: Same input → same output, no LLM hallucination risk
+- **Compressible**: Output can be further compressed by context runtimes like lean-ctx
+## Tools
+| Tool | Method | LLM Required | Reads Code |
+|------|--------|-------------|------------|
+| [[gitingest]] | Clone + structure | No | Yes |
+| [[gitreverse]] | Metadata → LLM prompt | Yes | No |
+| lean-ctx (built-in) | AST-based selective reading | No | Yes (selective) |
+## Relationship to ultimate-pi Harness
+The harness currently uses its built-in lean-ctx tool for file-by-file reading. Codebase-to-context ingestion complements this by enabling bulk ingestion of entire external repositories. This is useful when:
+1. Researching how a library or tool works
+2. Understanding a reference implementation
+3. Ingesting documentation repos into wiki
+4. Cross-referencing code patterns across projects

package/vault/wiki/concepts/codex-harness-innovations.md ADDED Viewed

@@ -0,0 +1,147 @@
+---
+type: concept
+title: "Codex Harness Innovations (OpenAI)"
+aliases: ["codex innovations", "openai codex agent architecture"]
+created: 2026-05-01
+updated: 2026-05-01
+tags: [concept, harness, codex, openai, research, agent-architecture]
+status: active
+related:
+  - "[[model-adaptive-harness]]"
+  - "[[harness-implementation-plan]]"
+  - "[[agentic-harness-context-enforcement]]"
+  - "[[feedforward-feedback-harness]]"
+  - "[[cursor-harness-innovations]]"
+  - "[[antigravity-agent-first-architecture]]"
+  - "[[lifecycle-hooks]]"
+  - "[[provider-native-prompting]]"
+  - "[[self-evolving-harness]]"
+sources:
+  - "[[codex-open-source-agent-2026]]"
+---
+# Codex Harness Innovations
+Codex (OpenAI, open-source, 79.2K+ GitHub stars, Apache 2.0) is a Rust-based coding agent that runs as CLI, IDE extension, Desktop App, and Web. Research across the open-source repository (`github.com/openai/codex`), official docs (`developers.openai.com/codex`), and the AGENTS.md engineering conventions reveals 10 major innovations.
+This is the fourth major production agent analyzed (after Cursor, Antigravity, Claude Code). Codex is uniquely valuable because it is **fully open-source**, so we can study its architecture directly rather than reverse-engineering.
+## 10 Key Innovations
+### 1. Rust-Native Implementation (96.3% Rust)
+**What**: Compiled binary. Zero Node.js dependency. Platform-optimized sandbox integration via OS APIs.
+**Why it matters**: Better performance, simpler deployment, lower memory. Direct access to OS sandbox APIs (Seatbelt, bubblewrap). This is a first-principles choice: if you're building a local-first agent, use a systems language.
+**Our gap**: Our harness is TypeScript (like Claude Code). A Rust core would give zero-dependency install and tighter sandbox integration. Consider as long-term architectural direction (post-v1).
+### 2. Multi-Surface Agent Architecture
+**What**: Single agent logic runs across CLI, IDE extension, Desktop App, and Web. App Server (local HTTP/WebSocket) bridges agent core to IDE extensions. App-server protocol v2 with typed RPC, TypeScript codegen from Rust structs.
+**Why it matters**: "One agent for everywhere you code." Architectural separation of agent core from presentation layer via app-server protocol.
+**Our gap**: We are CLI-only. The App Server pattern is a potential future path for IDE integration.
+### 3. Platform-Native Sandboxing (3-Tier)
+**What**: OS-level enforcement using Seatbelt (macOS), bubblewrap (Linux), Windows Sandbox. Three tiers: read-only, workspace-write, danger-full-access. Approval policies: untrusted, on-request, never. Writable roots for multi-directory work. Permission profiles with per-domain rules.
+**Why it matters**: Enforced limits, not polite requests. The sandbox is a technical boundary; approvals are a policy layer on top. This separation is architecturally clean.
+**Our gap**: We have P35 (permission subsystem from Claude Code) but no OS-level sandbox integration. For a CLI harness that runs on user machines, OS sandboxing is the correct foundation.
+### 4. Bidirectional MCP (Client AND Server)
+**What**: Codex connects to MCP servers as a client. It also exposes itself as an MCP server (`codex mcp-server`) — other agents can use Codex as a tool.
+**Why it matters**: This is architecturally unique. No other production agent can BE an MCP tool for other agents. It enables agent-to-agent composition.
+**Our gap**: No equivalent. Our harness is an MCP consumer only. Exposing the harness as an MCP server would enable external agents to invoke harness pipeline stages. This is a design pattern worth considering for P25 (subagent specialization).
+### 5. Memories System with Chronicle
+**What**: Opt-in cross-thread persistent memory. Stored under `~/.codex/memories/`. Chronicle captures screen context to bootstrap memories. Background generation (idle-thread-based, rate-limit-aware). Secret redaction. Per-thread controls. Configurable extraction and consolidation models.
+**Why it matters**: Automated cross-session learning — not just explicit wiki pages. Chronicle is the missing piece: it captures what the USER was doing (screen context) to give the agent situational awareness.
+**Our gap**: Our L6 persistent memory is wiki-based (explicit, human-authored). Codex's approach is automatic, implicit, screen-capture-based. Different philosophy. The Chronicle approach could complement our wiki for rapid context recovery after interruptions.
+### 6. Hooks Framework (6 Events)
+**What**: JSON-configurable lifecycle hooks at 6 events: SessionStart, PreToolUse, PermissionRequest, PostToolUse, UserPromptSubmit, Stop. Exit-code semantics (0=continue, 2=block). JSON stdin/stdout contracts. Multiple matching hooks run concurrently. Regex matchers for tool-name filtering. Managed hooks via `requirements.toml`.
+**Why it matters**: Hooks are deterministic policy enforcement. This validates our P33 lifecycle hooks (from Claude Code analysis) — but Codex's hook framework is implemented differently (concurrent execution, JSON contracts, separate hooks.json file).
+**Our gap**: We planned P33 hooks from Claude Code. Codex independently validates the hook pattern but with different implementation choices. We should compare both implementations before building.
+### 7. Subagent Workflows (Parallel Dispatch)
+**What**: Explicit parallel agent dispatch with per-agent model selection. Addresses "context pollution" and "context rot." Subagents return summaries. Model selection per agent: gpt-5.5 for demanding work, gpt-5.4-mini for fast scans, gpt-5.3-codex-spark for near-instant text-only.
+**Why it matters**: Validates our P25 subagent specialization. But Codex's model-per-agent selection is more granular than our "cost router" approach — they optimize for task type, not just cost.
+**Our gap**: No equivalent to context pollution/rot terminology. Our P25 should adopt explicit context isolation with summary returns, not just cost-based routing.
+### 8. Git Worktrees
+**What**: Isolated git worktrees for parallel branch work. Multiple agents can work on different branches without conflicts.
+**Why it matters**: Directly solves the subagent isolation problem. Validates our P25b (subagent worktree isolation from Claude Code).
+**Our gap**: P25b planned from Claude Code. Codex independently validates.
+### 9. Skills System (agentskills.io Standard)
+**What**: Follows open `agentskills.io` standard. Progressive disclosure (name+description → full SKILL.md). 2% context budget cap. Built-in `$skill-creator` and `$skill-installer`. Scopes: REPO, USER, ADMIN, SYSTEM. Plugins for distribution.
+**Why it matters**: Our `.pi/skills/` system is nearly identical. Codex independently validates the skills pattern at massive scale. The standard (`agentskills.io`) means we could interoperate.
+**Our gap**: No `$skill-creator` or `$skill-installer` tools. No agentskills.io standard compliance. Our skills don't have the `agents/openai.yaml` metadata layer. These are polish gaps, not architectural gaps.
+### 10. Automations (Scheduled Agent Tasks)
+**What**: Scheduled recurring agent tasks — CI-like but agent-driven. No equivalent in Claude Code or Cursor.
+**Why it matters**: A new category of agent capability. The agent doesn't just respond to user prompts — it runs on schedules.
+**Our gap**: No equivalent in our plan. Consider as future phase: scheduled harness runs for maintenance tasks (wiki lint, dependency updates, test suite health checks).
+## What This Means From First Principles
+### FP #1 (Harness > Model): VALIDATED
+Codex has GPT-5.x models available but chose to build a 510K-line-equivalent Rust scaffold around them. The open-source nature proves: the harness is the product, the model is the infrastructure. Codex runs on GPT-5.5, GPT-5.4, GPT-5.4-mini, and GPT-5.3-codex-spark — model-adaptive by design.
+### New First Principle: Sandbox > Permissions
+Codex's architecture separates sandbox (technical boundary, OS-enforced) from approvals (policy layer, user-facing). This is cleaner than our approach of mixing permissions and enforcement. FP #12 (hooks > prompts) should be extended: **"Enforce boundaries with OS-level sandboxing. Use permissions for policy decisions."**
+### New First Principle: Agent as MCP Tool
+Codex's bidirectional MCP (it can BE a tool) suggests a new design pattern: harness pipeline stages should be exposable as MCP tools. External agents could invoke L1 spec hardening, L4 adversarial verification, or L7 orchestration as composable services.
+### New First Principle: Implicit Memory Matters
+Our wiki-based explicit memory (L6) is necessary but not sufficient. Codex's Chronicle + Memories shows that implicit, automatic, screen-capture-based memory fills a gap: the agent should remember what the user was doing without the user having to document it. Consider Chronicle-style context capture as a complement to wiki.
+## Comparison With Other Agents
+| Feature | Codex (OpenAI) | Claude Code | Cursor | Antigravity | Our Harness |
+|---|---|---|---|---|---|
+| Language | Rust | TypeScript | TypeScript | TypeScript | TypeScript |
+| Open Source | YES (Apache 2.0) | Leaked (no license) | No | No | Planned |
+| Sandbox | OS-level (Seatbelt/bubblewrap/Win) | Seatbelt/bubblewrap | Shadow Workspace | Browser sandbox | P35 planned |
+| Skills | agentskills.io standard | SKILL.md (built-in) | Rules/Skills | SKILL.md | `.pi/skills/` |
+| MCP server | Yes | No | No | Limited | No |
+| Memories | Yes (automatic) | Limited (transcript) | No | Knowledge base | Wiki (explicit) |
+| Hooks | 6 events, JSON I/O | 30+ events | Basic | Limited | P33 planned |
+| Subagents | Parallel + model-per-agent | Fork + summarize | Task-type router | Manager View | P25 planned |
+| Worktrees | Yes | Yes | No | Local envs | P25b planned |
+| Automations | Yes | No | No | No | No |
+| Multi-surface | CLI+IDE+App+Web | CLI+SDK | IDE only | IDE only | CLI only |
+## Sources
+- [[codex-open-source-agent-2026]] — GitHub repo + official docs, 2026