npm - ultimate-pi - Versions diffs - 0.1.0 → 0.1.3 - Mend

ultimate-pi 0.1.0 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (509) hide show

package/vault/wiki/questions/Research: context-mode vs lean-ctx.md ADDED Viewed

@@ -0,0 +1,72 @@
+---
+type: synthesis
+title: "Research: context-mode vs lean-ctx"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - research
+  - context-optimization
+  - agentic-harness
+  - tool-comparison
+status: developing
+related:
+  - "[[think-in-code]]"
+  - "[[context-mode]]"
+  - "[[lean-ctx]]"
+  - "[[agentic-harness-context-enforcement]]"
+sources:
+  - "[[context-mode-website]]"
+  - "[[leanctx-website]]"
+  - "[[think-in-code-blog]]"
+---# Research: context-mode vs lean-ctx
+## Overview
+Both context-mode and lean-ctx are MCP-based context optimization tools for AI coding agents. context-mode (11K+ GitHub stars, 48K npm downloads/month) intercepts tool output and sandboxes it into FTS5 with BM25 ranking. lean-ctx (924 stars, 3K crates.io downloads) compresses output intelligently via AST parsing and 90+ shell patterns. Both reduce token consumption by 60–99%. The key differentiator: context-mode mandates "Think in Code" as a paradigm shift; lean-ctx offers deeper agent governance (profiles, budgets, SLOs).
+## Key Findings
+- context-mode uses **intercept-and-sandbox** architecture: raw tool output never enters context. lean-ctx uses **compress-in-place**: output enters context but is intelligently stripped (Source: [[context-mode-website]], [[leanctx-website]])
+- context-mode enforces **"Think in Code"** as a mandatory paradigm across all platforms: agents must write code to analyze data rather than reading raw data into context (Source: [[think-in-code-blog]])
+- lean-ctx has **agent governance**: profiles, role-based budgets, token/cost SLOs, anomaly detection, 5 built-in roles (Source: [[leanctx-website]])
+- lean-ctx offers **cross-session memory + multi-agent sharing** via CCP protocol and scratchpad messaging; context-mode has 26-event session continuity but no multi-agent features (Source: [[context-mode-website]], [[leanctx-website]])
+- context-mode is TypeScript/Node.js (ELv2 license), lean-ctx is Rust/Apache 2.0 (Source: GitHub API)
+- Both support 14+ platforms including Claude Code, Cursor, Copilot, Gemini CLI, Pi (Source: [[context-mode-website]], [[leanctx-website]])
+- context-mode has stronger community validation: #1 on Hacker News (570+ points), 11,245 stars, claimed use at Microsoft/Google/Meta (Source: [[context-mode-website]])
+## Key Entities
+- [[context-mode]]: MCP plugin by B. Mert Köseoğlu. Sandboxes tool output into FTS5. ELv2 licensed. 11K stars.
+- [[lean-ctx]]: Context Runtime by yvgude. Rust binary, Apache 2.0. 924 stars. 3-layer architecture (MCP server + shell hook + protocols).
+## Key Concepts
+- [[think-in-code]]: Paradigm where AI agents write code to process data instead of reading raw data into context. Reduces context consumption 200×. Mandatory in context-mode v1.0.64+.
+- [[fts5-sandbox]]: context-mode's architecture — intercept tool calls, run in sandboxed subprocess, index output into SQLite FTS5 with BM25 ranking. Agent searches on demand.
+- [[ast-compression]]: lean-ctx's approach — use tree-sitter to parse code (18 languages), extract only signatures/types/logic, strip comments/whitespace. 60–95% reduction.
+- [[shell-pattern-compression]]: lean-ctx recognizes 90+ command patterns (git, npm, cargo, docker, kubectl) and compresses their output automatically.
+- [[context-continuity]]: Both tools preserve session state across context compaction. context-mode captures 26 event types to SessionDB. lean-ctx uses CCP (Cross-session Continuity Protocol).
+## Contradictions
+- [[context-mode-website]] claims 99.5% reduction on Playwright output (56.2KB → 299B). [[leanctx-website]] claims 60–99% with 99% on cached re-reads. Both claims are plausible but measure different things: context-mode measures sandbox avoidance (output never enters context), lean-ctx measures compression ratio (output enters but is stripped). Neither is "wrong" — they solve the problem differently.
+- [[context-mode-website]] lists 66,000+ users. NPM shows 48K downloads/month. Gap is likely cumulative installs vs monthly active. Not verified independently.
+## Open Questions
+- How does context-mode's FTS5 search quality compare to lean-ctx's semantic search (BM25 + TF-IDF)?
+- Does "Think in Code" enforcement cause the agent to make more errors when writing analysis scripts? What's the error rate?
+- For the ultimate-pi harness specifically: can both be used simultaneously? Would they conflict?
+- lean-ctx's governance features (profiles, budgets, SLOs) — how practically useful are they vs. just setting AGENTS.md rules?
+- context-mode's enterprise tier (Context as a Service, compliance reports) — is this a lock-in risk?
+> [!gap] Independent benchmarks needed. All compression claims come from the tools' own websites. No third-party comparison exists.
+> [!gap] No Reddit or community discussions found comparing context-mode vs lean-ctx directly. Tools are 1-2 months old each — comparison discourse hasn't emerged yet.
+## Sources
+- [[context-mode-website]]: context-mode.com landing page, 2026-04-30
+- [[leanctx-website]]: leanctx.com landing page, 2026-04-30
+- [[think-in-code-blog]]: "Think in Code" by B. Mert Köseoğlu, 2026
+- GitHub API: both repos, 2026-04-30

package/vault/wiki/questions/Research: cursor.sh Harness Innovations.md ADDED Viewed

@@ -0,0 +1,92 @@
+---
+type: synthesis
+title: "Research: cursor.sh Harness Innovations"
+created: 2026-05-01
+updated: 2026-05-01
+tags:
+  - research
+  - cursor
+  - agent-harness
+  - harness-design
+status: developing
+related:
+  - "[[cursor-harness-innovations]]"
+  - "[[harness-implementation-plan]]"
+  - "[[model-adaptive-harness]]"
+  - "[[agentic-harness-context-enforcement]]"
+  - "[[drift-detection-unified]]"
+  - "[[provider-native-prompting]]"
+  - "[[context-anxiety]]"
+sources:
+  - "[[cursor-shadow-workspace-2024]]"
+  - "[[cursor-agent-best-practices-2026]]"
+  - "[[cursor-harness-april-2026]]"
+  - "[[cursor-shipped-coding-agent-2026]]"
+  - "[[cursor-instant-apply-2024]]"
+  - "[[cursor-fork-29b-2025]]"
+---# Research: cursor.sh Harness Innovations
+## Overview
+Cursor (Anysphere, $29B valuation, $1B ARR) built the most successful production agent harness. Research across 7 primary sources (Cursor engineering blog, ByteByteGo, MMNTM) reveals 10 innovations and 6 first-principles lessons directly applicable to our harness implementation plan. Key finding: Cursor independently validated 5 of our planned features before we built them, and revealed 4 critical gaps we hadn't identified.
+## Key Findings
+### Validations (Cursor independently confirmed our designs)
+- **Model-adaptive harness** (Source: [[cursor-harness-april-2026]]): Cursor provisions different tool formats per model (patches for OpenAI, string replace for Anthropic). Matches our provider-native prompting redesign.
+- **Dynamic context over static context** (Source: [[cursor-harness-april-2026]]): Cursor removed pre-loaded context guardrails as models improved. Matches our wiki-query + lean-ctx approach.
+- **Context anxiety** (Source: [[cursor-harness-april-2026]]): One model started refusing work as context filled. Matches our P27 Context Anxiety Guard exactly. Independent discovery validates the concern.
+- **Continuous RL from user feedback** (Source: [[cursor-agent-best-practices-2026]]): 90-minute RL loop on accept/reject data. Matches our F1 Self-Evolving Harness concept.
+- **Edit quality is the bottleneck** (Source: [[cursor-instant-apply-2024]], [[cursor-shipped-coding-agent-2026]]): "Diff Problem" is the hardest engineering challenge. Matches our P10 fuzzy edit matching investment.
+### Critical Gaps (Things we're missing)
+- **Pre-verification isolation** (Source: [[cursor-shadow-workspace-2024]]): We validate after edit (P11). Cursor validates before showing user. Missing: isolated pre-commit validation sandbox between L3 and L4.
+- **Keep Rate metric** (Source: [[cursor-harness-april-2026]]): We have no post-hoc quality metric. Keep Rate (code persistence after 1hr/1day/1week) is the ultimate quality signal. Missing from L5.
+- **Per-tool per-model error classification** (Source: [[cursor-harness-april-2026]]): We don't classify tool errors or track per-model baselines. This blocks automated harness self-healing.
+- **Positive agent loops** (Source: [[cursor-agent-best-practices-2026]]): Our drift monitor only stops bad behavior. Cursor's hooks keep agent running until done. We need both.
+### New Patterns to Adopt
+- **Subagent specialization**: Dispatch by task type (planning/editing/debugging), not just cost. Fresh context per subagent. Evolves P25 Haiku Router.
+- **LLM-as-Judge for satisfaction**: Semantic analysis of user responses to agent output as quality signal.
+- **Search/replace tool training**: Search+replace is the hardest tool to teach. Training data needs high volume of tool-specific trajectories.
+- **Sandbox as serving infra**: Treat execution environments as core infrastructure with custom scheduling, not just containers.
+## Key Entities
+- **Cursor / Anysphere**: Company behind Cursor IDE. $29B valuation. Built by Michael Truell, Sualeh Asif, Aman Sanger, Arvid Lunnemark.
+- **Composer**: Cursor's custom MoE agentic coding model. 4x faster than similarly intelligent models.
+- **Fireworks AI**: Inference provider for Cursor's custom models, including speculative edits support.
+## Key Concepts
+- [[cursor-harness-innovations]]: Full catalog of 10 innovations with first-principles analysis
+- **Shadow Workspace**: Hidden Electron window for pre-verification with LSP feedback loop
+- **Speculative Edits**: Deterministic speculation using existing code as draft tokens. 9-13x speedup
+- **Keep Rate**: Fraction of agent code still in codebase after time intervals. Ultimate quality metric
+- **Context Anxiety**: Model behavior where filling context window triggers work refusal. Cross-model phenomenon
+## Contradictions
+- [[cursor-instant-apply-2024]] says full-file rewrites are superior to diffs. Our P10 fuzzy edit matching uses search/replace. However, Cursor's finding is about model training, not tool design. Our edit tool can accept either format. No contradiction — we should support both full-file rewrite and search/replace modes.
+- [[cursor-harness-april-2026]] says they removed guardrails as models improved. Our harness has mandatory verification (no-skip rule). These are different concerns: Cursor removed *context* guardrails (pre-loading files, limiting tool calls). We keep *quality* guardrails (verification is mandatory). Compatible: dynamic context + mandatory verification.
+## Open Questions
+- Can we implement pre-verification isolation without an IDE? Our harness runs in CLI/agent context — no Electron windows. Alternative: isolated temp directory with copy of relevant files, run compiler/linter, feed errors back. This is feasible.
+- What is the right "keep rate" time interval for our use case? Cursor users edit continuously. Our agent does discrete tasks. Maybe: "was the change reverted within the same session?"
+- How do we classify tool errors when our tools are MCP-based? We control fewer tools than Cursor. Classification would need to happen at the MCP bridge layer.
+- Is the 90-minute RL loop feasible without user-facing UI? We don't have accept/reject signals. Could use: was commit reverted? was follow-up fix needed? did tests pass on first try?
+- Should we adopt the subagent pattern for consensus debate? Currently P17-P19 uses pi-messenger transport. Subagent with fresh context might be simpler.
+## Sources
+- [[cursor-shadow-workspace-2024]]: Arvid Lunnemark, Sept 2024. Shadow workspace architecture.
+- [[cursor-agent-best-practices-2026]]: Lee Robinson, Jan 2026. Agent best practices, hooks, skills.
+- [[cursor-harness-april-2026]]: Stefan Heule & Jediah Katz, Apr 2026. Harness evolution, metrics, error classification.
+- [[cursor-shipped-coding-agent-2026]]: Lee Robinson + ByteByteGo, Jan 2026. System architecture, production challenges.
+- [[cursor-instant-apply-2024]]: Aman Sanger, May 2024. Speculative edits, fast apply model.
+- [[cursor-fork-29b-2025]]: MMNTM Research, Dec 2025. Architectural strategy, vertical agent thesis.

package/vault/wiki/questions/Research: executor.sh Harness Integration.md ADDED Viewed

@@ -0,0 +1,170 @@
+---
+type: synthesis
+title: "Research: executor.sh Harness Integration"
+created: 2026-05-01
+updated: 2026-05-01
+tags:
+  - research
+  - executor
+  - integration-layer
+  - harness
+  - tool-catalog
+  - policy-engine
+  - execution-layer
+status: developing
+related:
+  - "[[executor-rhyssullivan]]"
+  - "[[ts-execution-layer]]"
+  - "[[harness-implementation-plan]]"
+  - "[[Research: TypeScript Execution Layer for Agent Tool Calling]]"
+  - "[[Research: cursor.sh Harness Innovations]]"
+  - "[[Research: Codex State-of-the-Art Harness Improvements]]"
+sources:
+  - "[[executor-rhyssullivan]]"
+---# Research: executor.sh Harness Integration
+## Overview
+[executor.sh](https://executor.sh) is the product website for RhysSullivan/executor — an open-source (MIT, 1.3K stars) **integration layer** for AI agents. Research across 3 sources (executor.sh landing page, GitHub README, DeepWiki architecture analysis) reveals that Executor's scope is **broader than our P43 TypeScript Execution Layer** — it is a complete tool catalog + auth + policy + execution runtime. Our existing wiki classified Executor as a "TS execution layer" alongside CodeAct and Cloudflare Code Mode. This research finds that Executor belongs in a **separate category**: the agent integration/runtime layer.
+## Key Findings
+### executor.sh is NOT just a TS execution layer
+Our existing wiki treats Executor as one of three TS execution layer implementations (alongside CodeAct and Cloudflare Code Mode). Executor.sh positions it as "The missing integration layer" — a broader category. The three-era framing reveals the product thesis:
+| Era | Model | Executor's critique |
+|-----|-------|-------------------|
+| Era 1: Tool calling | Every tool schema dumped into context | Tokens wasted, poor performance |
+| Era 2: Bash | Agent calls CLI directly | Poor permission model, dangerous |
+| Era 3: Executor | Agent → executor → typed tools | Typed, sandboxed, cross-agent |
+The TypeScript execution is the **mechanism**, not the **category**. The category is "agent integration layer" — a unified catalog that normalizes diverse tool sources into one typed runtime shared across agents.
+### Five pillars absent from our P43 plan
+(Source: [[executor-rhyssullivan]])
+1. **Unified tool catalog with intent-based discovery**: Not just typed functions — a searchable, indexed catalog where agents call `tools.discover({ query: "github issues", limit: 5 })` instead of memorizing tool paths. Our P43 plans static type generation; Executor adds runtime discovery.
+2. **Shared auth across agents**: Sign in once to GitHub/Slack/Stripe via OAuth → every agent (Cursor, Claude Code, custom) shares those credentials. OS keychain storage. Our harness has P35 (Permission Subsystem) but doesn't address cross-tool auth.
+3. **Policy engine with approval workflows**: Auto-approve reads, pause on writes, wildcard rules. Human-in-the-loop for sensitive operations. Our P35 plans allow/deny/ask rules but lacks Executor's execution-pause-for-approval pattern.
+4. **Execution lifecycle with pause/resume**: Stateful executions that enter `waiting_for_interaction` state when auth or approval is needed. Resumed via `executor resume --execution-id <id>`. Our P43 doesn't address this.
+5. **Multi-source normalization**: OpenAPI, GraphQL, MCP, gRPC, custom JSON schema — all normalized into one namespace. Our P43 only targets harness-native L3 tools (read, bash, edit, grep, find).
+### Technical architecture (from DeepWiki)
+(Source: [[executor-rhyssullivan]])
+- **Bun monorepo**: `apps/executor` (CLI/Server), `apps/web` (React UI), `packages/` (core SDKs), `plugins/` (source-specific)
+- **Server**: Hono-based HTTP, `SqlControlPlaneRuntime` manages database + execution
+- **Persistence**: PGlite or local Postgres (workspace state, execution history, secrets)
+- **Sandbox**: SES (Secure EcmaScript) or Deno subprocesses
+- **MCP bridge**: `executor mcp` exposes catalog as MCP endpoint
+### Rhys Sullivan's design thesis (from Twitter/LinkedIn)
+> "LLMs are in desperate need of an execution layer made for them to run tool calls in. A year ago LLMs were making direct calls to tools, we found that it flooded their context with irrelevant information. Then we discovered with coding agents that the less tools you give them, the better they perform."
+This independently validates our First Principle #19 (Code is a better tool-calling interface than JSON) and our P43 investment. But Executor goes further: the execution layer should also handle auth, policies, and cross-agent sharing — not just sandboxed code execution.
+> "Executor is a highly extended implementation of codemode, that supports adding multiple sources rather than just 1 and a better permissions model." (Source: unrollnow.com thread)
+This clarifies the lineage: Cloudflare Code Mode → Executor. Executor extends Code Mode's single-source TS runtime into a multi-source integration layer.
+### Roadmap signals (from executor.sh)
+- **Now**: Core SDK/CLI, MCP bridge, Policy engine, Local web UI, Desktop app
+- **Soon**: Team management/SaaS, Advanced approval workflows, Org-wide source catalog
+- **Later**: Customer-managed integrations, Workflow primitives (webhooks, crons), Virtual filesystems & KV stores, npm ecosystem support
+The "Soon" tier (team/SaaS) signals that Executor is evolving from a local developer tool into a team infrastructure product. This has implications for our integration strategy — we should integrate at the local/CLI level before it moves upmarket.
+## How This Fits Into Our Harness Implementation Plan
+### Alignment with existing phases
+| Our Phase | Executor Equivalent | Verdict |
+|-----------|-------------------|--------|
+| P43 TS Execution Layer | TS runtime + typed tool API | **Validated**. Executor independently confirms the TS-over-JSON approach. |
+| P39 Harness as MCP Server | MCP bridge (`executor mcp`) | **Validated**. Same pattern. |
+| P35 Permission Subsystem | Policy engine (auto-approve/pause rules) | **Partially validated**. Executor has richer policy model (pause/resume). |
+| P14 Think-in-Code | Code-as-tool-calling paradigm | **Validated**. Executor extends this from data analysis to all tool calls. |
+### Gaps Executor reveals in our plan
+1. **No tool catalog with intent-based discovery**: P43 generates static TypeScript type definitions from tool schemas. Executor adds runtime discovery (`tools.discover({ query })`) that lets agents search tools by intent without loading all schemas into context. This is a **fundamental capability gap** — static type gen alone doesn't solve tool discovery at scale (50+ MCP sources).
+2. **No shared auth for external tools**: Our harness has no auth management layer. If an agent needs to call GitHub API, Stripe API, or Slack API, each tool call requires separate credential handling. Executor centralizes this — one OAuth flow, all agents share the token. This is a gap for any harness that runs agents in production workflows.
+3. **No execution pause/resume for human-in-the-loop**: Our P35 allows blocking tool calls, but doesn't support pausing execution for auth/approval and resuming. Executor's stateful execution lifecycle is a more sophisticated model.
+4. **No multi-source tool normalization**: Our tool registry is harness-native (lean-ctx tools, ck_search, Gitingest). We don't normalize external APIs (OpenAPI, GraphQL) into the same tool namespace. Executor does. This may be out of scope for Phase 0 but matters for production harnesses.
+### What Executor does that we should NOT adopt
+- **Web UI for tool configuration**: Our harness is CLI-only. The React web UI is unnecessary for our use case.
+- **Desktop app**: Same — CLI-only scope.
+- **Multi-source OpenAPI/GraphQL normalization**: Phase 0 scope is harness-native tools only. External API normalization is post-v1.
+- **Team/SaaS management**: Overengineered for a CLI harness. Stay local-first.
+- **Cloudflare Workers dependency**: Executor uses SES/Deno — our P43 can match this without CF dependency.
+### Build vs Integrate Decision
+Executor is MIT-licensed and can be used as a dependency. The integration path:
+**For external API integration (post-v1)**: Use Executor as a dependency.
+```bash
+npm install -g executor
+executor mcp  # expose as MCP server
+# Agent calls executor via MCP for GitHub/Slack/Stripe/etc.
+```
+**For harness-native tool optimization (P43)**: Build our own TS runtime. Our L3 tools (read, bash, edit, grep, find, ck_search, ctx_execute) need harness-specific TypeScript types and permission routing. Executor's plugin system can wrap these, but the runtime should be harness-native for tight integration with P35 (permission subsystem) and L7 (orchestration).
+**Recommended approach**:
+- P43 built custom for harness-native tools
+- Borrow Executor's patterns: catalog with `discover()`, policy engine with pause/resume, typed RPC dispatch
+- Post-v1: integrate Executor MCP server for external API access (GitHub, Slack, etc.)
+## Impact on Harness Implementation Plan
+### P43 should expand to include:
+| Sub-phase | What | Inspired by Executor |
+|-----------|------|---------------------|
+| P43a | Type generation from tool schemas (existing) | — |
+| P43b | Tool catalog with intent-based discovery (`tools.discover()`) | Executor's catalog + `tools.discover({ query, limit })` |
+| P43c | Policy-aware execution (auto-approve reads, pause on writes) | Executor's policy engine |
+| P43d | Execution lifecycle with pause/resume | Executor's `waiting_for_interaction` + `executor resume` |
+### P35 should borrow Executor's policy patterns:
+| Pattern | Description |
+|---------|-------------|
+| Auto-approve reads | Deterministic: read-only tool calls pass without LLM permission check |
+| Pause on writes | Execution enters `waiting_for_interaction` state; human resumes |
+| Wildcard rules | `github.issues.*` → auto-approve; `github.repos.delete` → pause |
+| Human-in-the-loop | Execution lifecycle supports pausing for auth/approval and resuming |
+## Contradictions
+- **Executor vs CodeAct**: Executor uses TypeScript; CodeAct uses Python. Both validate the code-as-tool-calling paradigm. Executor's TypeScript choice is better for our Node.js harness. No contradiction — language follows infrastructure.
+- **Executor vs Cloudflare Code Mode**: Executor extends Code Mode with multi-source support and richer policy engine. They're in the same lineage. Executor is the more mature implementation.
+- **Local vs SaaS**: Executor is local-first today, but roadmap shows team/SaaS in "Soon". Our harness is local-first by design. If Executor moves to SaaS, our local integration path may diverge.
+## Open Questions
+- **Can we integrate Executor's `tools.discover()` pattern without adopting its entire plugin system?** Yes — the discovery pattern is an API contract, not tied to the plugin architecture. We can implement `tools.discover({ query, limit })` over our own tool schema registry.
+- **Should P43 use Executor as a sandbox backend?** Executor's SES/Deno sandbox is production-ready. We could wrap it rather than building our own Node.js VM. But tight integration with P35 (permission subsystem) favors a custom sandbox. Needs a dedicated spike.
+- **Does Executor's pause/resume model work for CLI-only harness?** Yes — `executor resume` is CLI-native. The `waiting_for_interaction` state maps to our harness pausing for human input.
+- **Is Executor stable enough to depend on?** 1.3K stars, 1,492 commits, active development. But it's ~1 month old. Dependency risk is medium. Integration via MCP protocol (not code dependency) mitigates this.
+- **Will Executor's SaaS move break local-first integration?** Roadmap shows team/SaaS as additive, not replacement. Local-first is a core design principle. Risk is low for current scope.
+## Sources
+- [[executor-rhyssullivan]]: RhysSullivan/executor — product website (executor.sh), GitHub README, DeepWiki architecture analysis. Updated May 2026 with product positioning and architecture details.

package/vault/wiki/questions/Research: how GSD fits into our coding harness setup.md ADDED Viewed

@@ -0,0 +1,97 @@
+---
+type: synthesis
+title: "Research: how GSD fits into our coding harness setup"
+created: 2026-05-05
+updated: 2026-05-05
+tags:
+  - research
+  - gsd
+  - harness
+  - integration
+status: developing
+related:
+  - "[[gsd-get-shit-done]]"
+  - "[[harness-implementation-plan]]"
+  - "[[skill-first-architecture]]"
+  - "[[spec-hardening]]"
+  - "[[structured-planning]]"
+  - "[[grounding-checkpoints]]"
+  - "[[adversarial-verification]]"
+  - "[[persistent-memory]]"
+  - "[[agent-skills-pattern]]"
+  - "[[drift-detection-unified]]"
+  - "[[generator-evaluator-architecture]]"
+sources:
+  - "[[gsd-github-repo]]"
+  - "[[gsd-codecentric-deep-dive]]"
+  - "[[gsd-hn-discussion]]"
+  - "[[Source: How to Apply GAN Architecture to Multi-Agent Code Generation]]"
+---
+# Research: how GSD fits into our coding harness setup
+## Overview
+GSD is a downstream application-building pipeline (discuss → plan → execute → verify → ship) running inside Claude Code. Our harness is an upstream behavior-control pipeline (spec-hardening → planning → drift detection → grounding → adversarial verification → observability → memory → orchestration → query). They address fundamentally different layers of the agentic coding stack: GSD builds software, our harness controls how agents build software. They are **complementary**, not competitive.
+## Key Findings
+### 1. GSD is downstream; our harness is upstream (Source: [[gsd-github-repo]], [[gsd-codecentric-deep-dive]])
+GSD operates at the application layer — it receives a user's idea and produces working software. Our harness operates at the agent-control layer — it governs how agents reason, verify, and maintain state during any coding task. GSD could potentially run **inside** a harness-controlled pi session, benefiting from spec-hardening, drift detection, and adversarial verification around its own pipeline execution.
+### 2. GSD uses Claude Code as its runtime; we use pi (Source: [[gsd-codecentric-deep-dive]])
+GSD's entire architecture is markdown files interpreted by Claude Code's skills/agents/hooks system. No proprietary runtime. Our harness runs on pi (the coding agent platform GSD-2 is being ported to). The skill-first architecture transformation (May 2026) means both systems now share the same atomic unit: markdown skills loaded on demand.
+### 3. GSD lacks adversarial verification — our L4 fills that gap (Source: [[gsd-hn-discussion]], [[Source: How to Apply GAN Architecture to Multi-Agent Code Generation]])
+The freeCodeCamp analysis notes: "GSD relies on mechanical verification: lint, test, type-check... There is no agent reading another agent's code to assess whether it matches the spec's intent." This is exactly the gap our L4 adversarial verification addresses. A user running GSD inside our harness would get GSD's plan-checking PLUS our adversarial evaluator cross-verifying the output.
+### 4. GSD's context engineering complements our L3 grounding checkpoints (Source: [[gsd-github-repo]])
+GSD's core innovation — fresh 200K-context subagents, file-based state, XML-structured plans — aligns with our L3 approach but operates at a higher abstraction. GSD prevents context rot for application-building; our L3 prevents context rot for any agentic task. The techniques are compatible: GSD's "wave execution" parallels our subagent worktree isolation pattern.
+### 5. Both systems share skill-first architecture (Source: [[skill-first-architecture]], [[gsd-github-repo]])
+Since May 2026, our harness uses markdown skills as atomic units (3 code files + 12 skill files). GSD has always been markdown-first (59 skills + 33 subagents). Both use progressive disclosure of skills on demand. GSD's namespace routing (6 meta-skills routing to 59 concrete commands, saving ~2K tokens/turn) is a pattern worth adopting.
+### 6. GSD's state files are a narrower version of our L6 persistent memory (Source: [[gsd-github-repo]], [[persistent-memory]])
+GSD's `.planning/` directory (PROJECT.md, REQUIREMENTS.md, ROADMAP.md, STATE.md) is project-specific planning memory. Our wiki is a universal knowledge base covering harness design, coding patterns, architecture decisions, and research syntheses — feeding into every harness layer, not just project planning.
+### 7. GSD's limitations validate our harness approach (Source: [[gsd-hn-discussion]])
+Community feedback reveals GSD's weakness at scale: "agents leave orphans in large codebases," "verification uses simple lexical tools," "difficult to pivot mid-phase." These are exactly the failure modes our L1 (spec hardening prevents ambiguity), L2.5 (drift detection catches orphans), L3 (grounding checkpoints enforce correctness), and L4 (adversarial verification catches missed requirements) are designed to prevent.
+## Key Entities
+- [[gsd-get-shit-done]]: 60K-star meta-prompting/spec-driven development system for Claude Code
+- TÂCHES (@glittercowboy): GSD's creator, solo developer philosophy
+## Key Concepts
+- **Meta-prompting:** Using markdown files as structured prompts that orchestrate LLM behavior — both GSD and our skill-first harness use this
+- **Spec-driven development:** Writing detailed requirements before implementation — overlaps with our L1 spec hardening
+- **Context engineering:** Managing token budgets through fresh subagent contexts, file-based state, and XML structures — overlaps with our L3 grounding checkpoints
+- **Wave execution:** Grouping tasks by dependency for parallel/sequential execution — parallels our subagent worktree isolation
+## Contradictions
+- HN user `divx0` says GSD agents produce orphans at scale; GSD's README claims quality gates prevent scope reduction. The README claim applies to greenfield projects; the HN observation applies to large existing codebases. Both can be true: GSD works well within its design envelope (small-medium greenfield) but degrades outside it.
+- GSD's README calls itself "lightweight" while community calls it "overengineered." Resolution: lightweight compared to BMAD's 12 agents and enterprise workflows; heavy compared to native Claude Code plan mode.
+## Integration Opportunities
+### Immediate (adopt patterns)
+1. **Namespace routing:** GSD's 6 meta-skills reducing 86→6 eager-listed commands saves ~2K tokens/turn. We should adopt this for our growing skill collection.
+2. **Deterministic CLI helper:** GSD's `gsd-tools.cjs` pattern — "deterministic logic belongs in code, not prompts." Our pi extensions already follow this; formalize as a harness principle.
+3. **Wave execution tracking:** GSD's dependency-aware parallel execution with SUMMARY.md per task. Adopt the tracking pattern for our subagent dispatches.
+### Medium-term (layer GSD under harness)
+4. **Run GSD inside a harness-controlled pi session:** The harness pre-processes user intent (L1-L3), then dispatches GSD to handle the application-building pipeline (discuss→plan→execute), with the harness monitoring for drift (L2.5) and running adversarial verification (L4) on GSD's output.
+5. **Harness-as-GSD-plugin:** Package our drift detection and adversarial verification as skills that GSD's orchestrators can invoke during plan-check and verification phases.
+### Long-term (architectural convergence)
+6. **Unified skill marketplace:** Both systems use markdown skills. A shared skill format would let users mix GSD's application-building skills with our harness-control skills in one pipeline.
+## Open Questions
+- Can GSD's wave execution work with our worktree-isolated subagents, or do the isolation models conflict?
+- Would running GSD inside our harness add unacceptable latency for the "fast iteration" use case GSD's quick-mode serves?
+- GSD-2 (being built on pi.dev) may converge naturally with our harness. Should we wait for GSD-2 before attempting integration?
+- The HN discussion reveals a persistent debate: natural language specs vs. executable tests. Which direction should our L1 spec-hardening favor?
+## Sources
+- [[gsd-github-repo]]: TÂCHES, Dec 2025–May 2026, 60.1K stars
+- [[gsd-codecentric-deep-dive]]: Felix Abele, Mar 2026
+- [[gsd-hn-discussion]]: HN Community, Mar 2026, 473 points
+- [[Source: How to Apply GAN Architecture to Multi-Agent Code Generation]]: Christopher Galliart, freeCodeCamp, Mar 2026

package/vault/wiki/questions/Research: how claude-mem fits into our workflow. and whether it should replace obsidian in the codebase. no hard feelings about previous actions, rethink from first principles always.md ADDED Viewed

@@ -0,0 +1,80 @@
+---
+type: synthesis
+title: "Research: how claude-mem fits into our workflow. and whether it should replace obsidian in the codebase. no hard feelings about previous actions, rethink from first principles always"
+created: 2026-05-05
+updated: 2026-05-05
+tags:
+  - research
+  - memory
+  - claude-mem
+  - obsidian
+  - first-principles
+status: developing
+related:
+  - "[[adr-009]]"
+  - "[[persistent-memory]]"
+  - "[[memory-system-of-record-vs-ephemeral-cache]]"
+  - "[[lifecycle-hooks]]"
+  - "[[Codex Harness Innovations (OpenAI)]]"
+  - "[[Research: claude-mem over Obsidian for Harness Layer]]"
+sources:
+  - "[[adr-009]]"
+  - "[[persistent-memory]]"
+  - "[[lifecycle-hooks]]"
+  - "[[codex-harness-innovations]]"
+  - "[[Research: claude-mem over Obsidian for Harness Layer]]"
+---
+# Research: how claude-mem fits into our workflow. and whether it should replace obsidian in the codebase. no hard feelings about previous actions, rethink from first principles always
+## Overview
+First-principles test: memory system for harness must optimize for durability, auditability, deterministic enforcement, and operator trust. Current repo already anchors these in Obsidian wiki structure and ADR-backed contracts. Result: claude-mem fits as accelerator cache, not full replacement for canonical wiki memory.
+## Research Method
+- Round 1 (broad): scan existing memory architecture pages, ADRs, and harness control pages in local wiki corpus.
+- Round 2 (gap-fill): check for direct claude-mem source pages, benchmark pages, and operational evidence in current vault.
+- Round 3: skipped; no new primary sources available in current corpus.
+- Constraint note: external web fetch was blocked in this run, so all conclusions are bounded to in-vault evidence.
+## First-Principles Requirements
+1. **Canonical truth must be inspectable by humans and agents**.
+2. **Decision provenance must be stable and linkable**.
+3. **Memory policy compliance must not depend only on model obedience**.
+4. **Fast recall layer can be lossy, canonical layer cannot be lossy**.
+## Key Findings
+- **(high)** Canonical memory contract already exists and is explicit: `hot.md -> index.md -> linked pages`, with append-only operations log and ADR references (Source: [[adr-009]], [[persistent-memory]]).
+- **(high)** Harness write/read points are structurally integrated with wiki pages and event hooks (`session_start`, `session_shutdown`, decision capture), so replacement would require re-architecting multiple layers (Source: [[persistent-memory]]).
+- **(high)** Deterministic hooks outperform prompt-only policy memory. Memory hints in prompts can drift; hook gates provide hard enforcement (Source: [[lifecycle-hooks]]).
+- **(medium)** Automatic memory capture patterns are useful for continuity and speed, but they complement explicit knowledge systems instead of replacing them when audit/provenance matters (Source: [[codex-harness-innovations]]).
+- **(low)** No direct claude-mem benchmark, schema, or failure-mode source is currently filed in this vault, so "replace Obsidian now" cannot be justified with high confidence (Source: [[Research: claude-mem over Obsidian for Harness Layer]]).
+## Decision
+Do **not** replace Obsidian as memory system-of-record in this codebase now.
+## Where claude-mem Fits
+- Use claude-mem as **ephemeral recall cache** for short-horizon continuity.
+- Keep wiki as **canonical memory ledger** for decisions, ADR alignment, and contradiction handling.
+- Enforce write-back: if task changes architecture/policy, completion requires wiki update.
+## Recommended Operating Model
+1. Read order: quick cache (claude-mem) -> `[[hot]]` -> `[[index]]` -> linked canonical pages.
+2. Write order: decision-bearing output -> wiki first; cache may mirror but never override.
+3. Conflict rule: cache vs wiki disagreement -> wiki wins.
+4. Gate rule: stop-hook blocks completion if required wiki filing is missing.
+## Contradictions
+- Auto-memory value is real for speed, but speed-optimized memory and audit-optimized memory have different objective functions. Treating them as one layer causes drift.
+## Open Questions
+- Which claude-mem storage semantics (scope, retention, deletion) are acceptable for this repo?
+- Can claude-mem emit citation/provenance pointers equivalent to wiki wikilinks?
+- What latency/token savings appear in real runs with hybrid cache + wiki write-back?
+- What is the merge/conflict policy when cache proposes stale memory against newer ADRs?
+## Sources
+- [[adr-009]]: canonical decision for Layer 6 memory.
+- [[persistent-memory]]: operational read/write contract.
+- [[lifecycle-hooks]]: deterministic enforcement model.
+- [[codex-harness-innovations]]: implicit memory complement pattern.
+- [[Research: claude-mem over Obsidian for Harness Layer]]: prior internal synthesis and identified gaps.

package/vault/wiki/questions/Research: pi-vcc.md ADDED Viewed

@@ -0,0 +1,113 @@
+---
+type: synthesis
+title: "Research: pi-vcc"
+created: 2026-05-05
+updated: 2026-05-05
+tags:
+  - research
+  - pi-agent
+  - vcc
+  - compaction
+  - deterministic
+status: developing
+related:
+  - "[[vcc-conversation-compaction-for-pi]]"
+  - "[[deterministic-session-compaction]]"
+  - "[[context-folding]]"
+  - "[[pi-vcc-github-repo]]"
+  - "[[pi-mono-compaction-docs]]"
+  - "[[distill-deterministic-context-compression]]"
+  - "[[codex-dsc-rfc-8573]]"
+  - "[[pi-compaction-extensions-ecosystem]]"
+  - "[[pi-rtk-optimizer-github-repo]]"
+  - "[[pi-omni-compact-github-repo]]"
+  - "[[pi-context-prune-github-repo]]"
+  - "[[anthropic-compaction-api]]"
+  - "[[context-folding-paper]]"
+sources:
+  - "[[pi-vcc-github-repo]]"
+  - "[[pi-mono-compaction-docs]]"
+  - "[[distill-deterministic-context-compression]]"
+  - "[[codex-dsc-rfc-8573]]"
+  - "[[pi-compaction-extensions-ecosystem]]"
+  - "[[pi-rtk-optimizer-github-repo]]"
+  - "[[pi-omni-compact-github-repo]]"
+  - "[[pi-context-prune-github-repo]]"
+  - "[[anthropic-compaction-api]]"
+  - "[[context-folding-paper]]"
+---
+# Research: pi-vcc
+## Overview
+`pi-vcc` is the only fully deterministic session compaction extension for Pi. It achieves 35-99% token reduction with zero LLM calls, sub-500ms latency, and full JSONL recall. The Pi compaction ecosystem has grown from 4 to 7 extensions since initial research, with three distinct layers emerging: prevention, mid-session pruning, and boundary compaction. Meanwhile, Anthropic launched an official server-side compaction API (beta, January 2026) and academic research produced Context Folding (10x context reduction via RL-trained branch/return). pi-vcc's deterministic approach remains architecturally unique across all of these.
+## Key Findings
+1. **Pi compaction ecosystem expanded from 4 to 7 extensions**: Three new extensions emerged since April: pi-omni-compact (large-context model subprocess), pi-context-prune (tool-call batch summarization), and pi-rtk-optimizer (upstream command rewriting + output compaction). The ecosystem now operates at three distinct layers. (Source: [[pi-compaction-extensions-ecosystem]])
+2. **pi-vcc remains the only zero-LLM compaction option**: Across all 7 extensions plus Anthropic's official API, pi-vcc is still the only approach that uses zero LLM calls. Every other option — including the official Anthropic Context Compaction API — relies on LLM summarization. (Source: [[pi-vcc-github-repo]], [[anthropic-compaction-api]])
+3. **Anthropic launched official server-side compaction**: Beta since January 2026. Supports Claude Mythos Preview, Opus 4.7/4.6, Sonnet 4.6. Automatic threshold-based summarization. This validates compaction as a first-class platform concern. However, it has all three failure modes pi-vcc avoids: non-determinism, hallucination risk, API cost. (Source: [[anthropic-compaction-api]])
+4. **Context Folding achieves 10x context reduction with 32K budget**: arXiv 2510.11967 (ByteDance Seed, CMU, Stanford). 200-step agents at 62% BrowseComp-Plus and 58% SWE-Bench Verified using only 32K tokens. Fundamentally different approach: learned branch/return sub-trajectories WITHIN a single run, vs pi-vcc's boundary compaction. (Source: [[context-folding-paper]])
+5. **Tool-calling accuracy collapses ~40% past 80K tokens**: Context Folding paper quantifies the hard cliff. Past ~80K effective-context tokens, agent tool-calling accuracy drops dramatically. This is not a gradual decline — it is a cliff. Validates aggressive compaction as a correctness concern, not just a cost concern. (Source: [[context-folding-paper]])
+6. **Three-layer token management architecture emerged**: Prevention (rtk-optimizer) → mid-session pruning (context-prune) → boundary compaction (vcc/others). This maps directly to our harness's layered approach to context engineering. pi-vcc operates at layer 3; it could be complemented by layers 1 and 2. (Source: [[pi-compaction-extensions-ecosystem]])
+7. **pi-omni-compact represents the strongest competing philosophy**: Spawns separate Pi subprocess with 1M+ context model. Maximizes LLM compute for highest fidelity summaries. Exactly opposite of pi-vcc's philosophy: more compute for better quality vs zero compute for determinism. (Source: [[pi-omni-compact-github-repo]])
+8. **Recall remains the killer differentiator**: No new extension or API offers searchable access to pre-compaction history. pi-vcc's `vcc_recall` over raw JSONL with regex + ranked multi-word queries is still unique. pi-context-prune preserves originals but for tool-call batches only, not full conversation. (Source: [[pi-vcc-github-repo]], [[pi-context-prune-github-repo]])
+9. **Pi ecosystem reached 2,808+ resources**: 1,183+ extensions, 1,459 active projects. Compaction is among the highest-activity categories. pi-vcc's 75 stars and 3,299 monthly installs positions it as a mid-tier extension by adoption. (Source: [[pi-compaction-extensions-ecosystem]])
+10. **65% of enterprise AI failures attributed to context drift/memory loss**: Broader industry data validates compaction as mission-critical, not optional. Combined with the 80K token accuracy cliff, this makes the case that compaction quality directly determines agent reliability. (Source: [[context-folding-paper]])
+## Key Entities
+- [[pi-coding-agent]]: Host platform
+- `sting8k` (Do Anh): pi-vcc maintainer
+- `Siddhant-K-code`: Distill maintainer
+- `Whamp`: pi-omni-compact maintainer (competing philosophy)
+- `championswimmer`: pi-context-prune maintainer
+- `MasuRii`: pi-rtk-optimizer maintainer
+- Anthropic: Official compaction API provider
+- ByteDance Seed / CMU / Stanford: Context Folding researchers
+## Key Concepts
+- [[deterministic-session-compaction]]: The cross-tool pattern of no-LLM compaction
+- [[vcc-conversation-compaction-for-pi]]: pi-vcc's specific implementation
+- [[context-folding]]: RL-learned branch/return sub-trajectories (10x reduction)
+- [[structured-compaction]]: Claude Code's 5-layer approach (LLM-based, different paradigm)
+- [[context-engineering]]: Broader discipline encompassing all compaction approaches
+## Contradictions
+- **Codex rejected deterministic compaction but Anthropic launched LLM compaction**: OpenAI closed DSC RFC as not_planned. Anthropic launched official LLM-based compaction. Neither validates pi-vcc's deterministic approach directly, but both validate compaction as critical infrastructure.
+- **pi-omni-compact vs pi-vcc**: Diametrically opposed philosophies. omni-compact says "use MORE compute for BETTER summaries." pi-vcc says "use ZERO compute for SAFE summaries." Both have valid use cases; the winner depends on whether you optimize for fidelity or reliability.
+- **Context Folding outperforms summarization-based compaction**: If RL-trained folding beats LLM summarization, the gap between folding and deterministic extraction may be even larger — but folding requires training and is not available as a Pi extension.
+## Open Questions
+- Can pi-vcc adopt Context Folding's branch/return concept for within-run compaction (not just boundary)?
+- Is a hybrid pi-vcc + pi-rtk-optimizer stack the optimal three-layer architecture?
+- Should pi-vcc integrate with Anthropic's Compaction API as a fallback for nuanced sessions?
+- Does the 80K token accuracy cliff change the optimal compaction threshold for pi-vcc?
+- Can deterministic folding rules approximate FoldGRPO's learned behavior without RL training?
+- With 7 competing extensions, will the Pi ecosystem consolidate or further fragment?
+## Sources
+- [[pi-vcc-github-repo]]: Primary source, 75 stars, v0.3.12
+- [[pi-mono-compaction-docs]]: Pi core compaction baseline
+- [[distill-deterministic-context-compression]]: Competing deterministic approach, different layer
+- [[codex-dsc-rfc-8573]]: Codex's rejected but validating RFC
+- [[pi-compaction-extensions-ecosystem]]: Full Pi compaction extension landscape (7 extensions)
+- [[pi-rtk-optimizer-github-repo]]: Upstream token reduction (command rewriting)
+- [[pi-omni-compact-github-repo]]: Large-context model compaction (competing philosophy)
+- [[pi-context-prune-github-repo]]: Tool-call batch summarization
+- [[anthropic-compaction-api]]: Official Anthropic server-side compaction (beta Jan 2026)
+- [[context-folding-paper]]: arXiv 2510.11967, 10x context reduction via RL