npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.3 - Mend

ultimate-pi 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/questions/research-agent-first-codebase-exploration.md ADDED Viewed

@@ -0,0 +1,199 @@
+---
+type: synthesis
+title: "Research: Agent-First Codebase Exploration Strategies"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - research
+  - agent-architecture
+  - codebase-exploration
+  - harness-design
+status: developing
+related:
+  - "[[oss-guide-codebase-exploration]]"
+  - "[[aider-repomap-tree-sitter]]"
+  - "[[swe-agent-aci]]"
+  - "[[swe-bench]]"
+  - "[[openhands-platform]]"
+  - "[[agent-codebase-interface]]"
+  - "[[progressive-disclosure-agents]]"
+  - "[[repo-map-ranking]]"
+  - "[[execution-feedback-loop]]"
+  - "[[harness]]"
+  - "[[grounding-checkpoints]]"
+  - "[[structured-planning]]"
+sources:
+  - "[[oss-guide-codebase-exploration]]"
+  - "[[aider-repomap-tree-sitter]]"
+  - "[[swe-agent-aci]]"
+  - "[[swe-bench]]"
+  - "[[openhands-platform]]"
+---# Research: Rethinking OSS Codebase Exploration for AI Coding Agents
+## Overview
+The OSS Guide by Parth Parikh et al. (2020) is a human-centric manual for exploring large codebases. Every technique assumes a human developer with visual cortex, intuition, unlimited working memory, and gradual learning curves. AI coding agents have none of these. This synthesis maps each human technique to an agent-equivalent strategy — and identifies where the agent can actually do better.
+The core finding: humans and agents need fundamentally different interfaces to the same codebase. The Agent-Codebase Interface (ACI) concept from SWE-agent formalizes this gap. Our harness must implement agent-native exploration strategies, not emulate human workflows.
+---
+## Key Findings
+### 1. "Use the project" → "Map the project" (Source: [[oss-guide-codebase-exploration]], [[aider-repomap-tree-sitter]])
+**Human strategy**: Build something with the project to learn its breadth through experiential exposure.
+**Agent equivalent**: Construct a structured repo map — AST-parsed symbol definitions with cross-references, ranked by importance within token budget. Agents learn from structured information, not experiential use. Aider's repo map approach (tree-sitter + graph ranking) is the canonical implementation.
+**Agent advantage**: Can ingest the entire symbol structure of a 100K-line codebase in ~2K tokens. A human would need days of exploration.
+**Agent disadvantage**: Cannot form intuitive mental models. Every dependency must be explicitly represented.
+### 2. "Check earliest commits" → "Check architectural spec" (Source: [[oss-guide-codebase-exploration]])
+**Human strategy**: Read initial commits to understand the project's founding goals and core architecture.
+**Agent equivalent**: Parse the current architecture, not historical commits. Agents don't benefit from evolutionary understanding — they need the current dependency graph. Architectural decision records (ADRs), SPEC.md files, and module-level docstrings are more valuable. Earliest commits may contain code that no longer exists or has been refactored beyond recognition.
+**Verdict**: Skip for agents. Read current architecture docs instead.
+### 3. "Test cases as documentation" → "Test cases as ground truth" (Source: [[oss-guide-codebase-exploration]], [[execution-feedback-loop]])
+**Human strategy**: Read tests to understand expected behavior.
+**Agent equivalent**: Execute tests as verification checkpoints in the feedback loop. Tests serve dual purpose: (a) documentation of expected behavior, (b) runtime verification of changes. Agents should extract test assertions as behavioral contracts and use failing tests as precision navigation (test output + stack trace → exact file:line of failure).
+**Agent advantage**: Can run the entire test suite, extract structured failure data, and correlate failures to specific edits — all in seconds. Humans can't match this speed.
+**Agent disadvantage**: Cannot visually "see" test patterns or infer missing test coverage without explicit tooling.
+### 4. "Git log trick (80/20 rule)" → "Graph centrality ranking" (Source: [[oss-guide-codebase-exploration]], [[repo-map-ranking]])
+**Human strategy**: `git log --name-only | sort | uniq -c | sort -rg` to find most-edited files — the 20% of files doing 80% of the work.
+**Agent equivalent**: AST-based dependency graph with PageRank-style centrality. Most-referenced symbols = most important. This is more accurate than edit frequency because: (a) some core files rarely change (stable), (b) some frequently-edited files are just config/tests. Graph centrality captures structural importance, not historical churn.
+**Implementation for harness**: tree-sitter parse → extract definitions + references → build file-to-file edge graph → rank by in-degree (reference count) → select top-k within token budget.
+### 5. "Don't try to understand everything" → "Progressive disclosure" (Source: [[oss-guide-codebase-exploration]], [[progressive-disclosure-agents]])
+**Human strategy**: Narrow scope to the relevant subsystem; treat everything else as a black box.
+**Agent equivalent**: Identical principle, but implemented through [[progressive-disclosure-agents]] — a layered information architecture (L0: project map, L1: symbol map, L2: file context, L3: deep context). Agent starts with L0 always available, queries deeper layers on demand.
+**Key difference**: Humans scope down through intuition. Agents scope down through structured queries — "show me all callers of function X" or "show me all implementations of interface Y." The tooling must support these queries efficiently.
+### 6. "Paper Cut Principle" → "Coverage-driven exploration" (Source: [[oss-guide-codebase-exploration]])
+**Human strategy**: Many small fixes across the codebase build deep understanding over time.
+**Agent equivalent**: Agents don't learn over time (stateless across sessions unless persisted). Instead, they should ingest coverage information upfront: which files touch which subsystems, what the dependency boundaries are, where the hot paths live. A single comprehensive map is better than incremental learning. For persistent agents, store codebase understanding in the [[persistent-memory]] wiki layer.
+**Recommendation**: Build the full repo map once (cached). Refresh on new sessions by diffing against cached map. Don't rely on gradual learning.
+### 7. "Reproduce the issue" → "Automated reproduction + test capture" (Source: [[oss-guide-codebase-exploration]], [[execution-feedback-loop]])
+**Human strategy**: Manually reproduce the bug to understand it, then write a failing test.
+**Agent equivalent**: Automate reproduction when possible (script the steps). Generate a minimal failing test from the issue description. The test becomes the verification checkpoint. This is a core capability already in the [[grounding-checkpoints]] layer of our harness.
+### 8. "Structured theorizing" → "Multi-hypothesis search" (Source: [[oss-guide-codebase-exploration]])
+**Human strategy**: Brainstorm potential causes, verify each.
+**Agent equivalent**: Generate multiple candidate fix locations based on error signatures, stack traces, and symbol maps. Verifying each through targeted code inspection. This is effectively what the [[adversarial-verification]] layer does — critic agents propose alternative hypotheses.
+### 9. "Feedback from mentors" → "Adversarial verification" (Source: [[oss-guide-codebase-exploration]], [[adversarial-verification]])
+**Human strategy**: Rubber duck debugging, mentor review, explain to someone else.
+**Agent equivalent**: [[adversarial-verification]] — a separate critic agent reviews the proposed change, checks for edge cases, verifies against specifications, and either approves or sends back with specific failure reasons. This replaces the human mentor role.
+### 10. "Hack it, then get it right" → "Iterative refinement with verification gates" (Source: [[oss-guide-codebase-exploration]])
+**Human strategy**: Get a proof of concept working first, then polish.
+**Agent equivalent**: This maps directly to the harness pipeline: [[structured-planning]] → [[grounding-checkpoints]] (smallest verifiable change) → execute → verify → iterate. The key difference: agents must pass verification at every checkpoint, not just at the end. Each iteration must produce a verifiable state.
+---
+## Agent-Superior Capabilities
+Areas where agents outperform humans on codebase exploration:
+| Capability | Why Agents Win |
+|------------|---------------|
+| Symbol-space ingestion | Can parse and index 100K+ symbols in seconds across entire codebase |
+| Cross-reference tracking | Tree-sitter + graph algorithms find all callers/callees instantly |
+| Multi-file correlation | Can hold 50+ files in context simultaneously (if within window) |
+| Test execution speed | Can run thousands of tests, parse results, and correlate to changes |
+| Pattern matching at scale | grep/semgrep/ast-grep across entire codebase in milliseconds |
+## Agent-Weak Areas
+Areas where agents need explicit tooling to compensate:
+| Weakness | Mitigation |
+|----------|-----------|
+| No visual pattern recognition | Need explicit AST representations, not visual layouts |
+| No intuition or mental models | Need explicit dependency graphs and data flow diagrams |
+| Fixed context window | Need ranking and progressive disclosure to stay within budget |
+| No learning across sessions | Need persistent memory ([[persistent-memory]] wiki layer) |
+| Can't "skim" or scan | Need structured search with precise queries |
+| No physical debugging | Need execution feedback loop with structured output |
+---
+## Implementation Recommendations for Our Harness
+### Must-Have (L1-L3 integration)
+1. **Repo Map Generator**: Integrate tree-sitter parsing into L3 ([[grounding-checkpoints]]) or as a pre-flight step. Produce ranked symbol map on every harness invocation.
+2. **Progressive Disclosure API**: L0 always injected into context. L1 queryable by agent. L2/L3 available on explicit request.
+3. **Execution Feedback Loop**: Every code change must trigger relevant test subset. Output must be structured (JSON error format with file:line:message).
+4. **Persistent Codebase Understanding**: Cache repo maps in [[persistent-memory]]. Diff against cache on each session to identify changes.
+### Should-Have (medium priority)
+5. **Call Graph Queries**: "Show me all callers of X" / "Show me all implementations of interface Y" — implemented via tree-sitter reference tracking.
+6. **Symbol-Level Diffs**: When a file changes, show which symbols were added/removed/modified — not just which lines changed.
+7. **Test Impact Analysis**: Determine which tests are affected by a code change. Only run those. Essential for keeping the feedback loop fast.
+### Nice-to-Have
+8. **Architecture Diagram Generation**: Auto-generate module dependency diagrams from import graphs.
+9. **Hotspot Detection**: Identify files/symbols with high churn + high reference count = high risk.
+10. **Semantic Chunking**: Split files into semantically meaningful chunks (functions, classes, logical blocks) for more efficient context loading than raw line-based chunking.
+---
+## Contradictions
+- [[oss-guide-codebase-exploration]] recommends "use the project" as the first step. [[aider-repomap-tree-sitter]] and [[swe-agent-aci]] show that structured mapping is superior for agents. Resolution: Use the project only to extract behavioral contracts (API surface, expected inputs/outputs), not for broad familiarization.
+- [[oss-guide-codebase-exploration]] recommends checking earliest commits. Our analysis finds this ineffective for agents — architectural drift makes historical commits misleading. Resolution: Parse current architecture via AST, not git history.
+## Open Questions
+- How to handle dynamic languages (Python, JS) where tree-sitter can't resolve all references statically? (e.g., `getattr`, dynamic imports)
+- What is the optimal token allocation between repo map vs. file contents vs. conversation history?
+- How to handle monorepos where a single repo map would exceed even the largest context windows?
+- Can we pre-compute and cache agent-specific codebase representations (embeddings, graphs) to avoid re-parsing on every invocation?
+- How does the agent decide when to expand context (L0→L1→L2) vs. when to work with what it has?
+## Sources
+- [[oss-guide-codebase-exploration]]: Human-centric codebase exploration guide (2020)
+- [[aider-repomap-tree-sitter]]: Tree-sitter + graph ranking for LLM code context (2023)
+- [[swe-agent-aci]]: Agent-Computer Interfaces concept and SWE-agent system (2024)
+- [[swe-bench]]: Benchmark for real-world software engineering tasks (2023)
+- [[openhands-platform]]: Open platform for AI software developer agents (2024)

package/vault/wiki/questions/research-agentic-coding-harness-latest-papers.md ADDED Viewed

@@ -0,0 +1,142 @@
+---
+type: synthesis
+title: "Research: Agentic Coding Harness — Latest Papers & Pipeline Improvements"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - research
+  - harness
+  - agentic-coding
+  - pipeline
+status: developing
+related:
+  - "[[harness-implementation-plan]]"
+  - "[[agentic-harness]]"
+  - "[[model-adaptive-harness]]"
+  - "[[consensus-debate]]"
+  - "[[adr-011]]"
+  - "[[adversarial-verification]]"
+  - "[[harness-h-formalism]]"
+  - "[[feedforward-feedback-harness]]"
+  - "[[generator-evaluator-architecture]]"
+  - "[[self-evolving-harness]]"
+  - "[[selective-debate-routing]]"
+  - "[[context-anxiety]]"
+sources:
+  - "[[meng2026-agent-harness-survey]]"
+  - "[[anthropic2026-harness-design]]"
+  - "[[bockeler2026-harness-engineering]]"
+  - "[[lou2026-autoharness]]"
+  - "[[lee2026-meta-harness]]"
+  - "[[fan2025-imad]]"
+---# Research: Agentic Coding Harness — Latest Papers & Pipeline Improvements
+## Overview
+Researched 9 key sources (2 surveys, 5 papers, 2 production engineering blogs) from 2025-2026 on agentic coding harnesses. The field has crystallized around a consensus: **the harness — not the model — is the primary determinant of agent reliability at scale.** 2026 is "the year of the harness." Our pipeline plan is well-aligned with the state of the art but has specific actionable improvements.
+## Key Findings
+### 1. The Harness Has a Formal Model (Source: [[meng2026-agent-harness-survey]])
+The survey formalizes the harness as **H = (E, T, C, S, L, V)**: Execution Loop, Tool Registry, Context Manager, State Store, Lifecycle Hooks, Evaluation Interface. No system achieves production reliability without all six. Our 8-layer pipeline maps cleanly but lacks explicit lifecycle hooks (L) and systematic evaluation trajectory tracking (V).
+**Action**: Add explicit H=(E,T,C,S,L,V) mapping to our harness documentation. Consider formal component contracts.
+### 2. Self-Evaluation Is Fundamentally Broken (Source: [[anthropic2026-harness-design]])
+When agents evaluate their own work, they "confidently praise mediocre outputs." Separating generator from evaluator is essential. The evaluator needs explicit tuning to be skeptical — out of the box, Claude "talks itself out of flagging real issues."
+**Action**: Strengthen L4 adversarial verification with explicit pass/fail grading criteria and hard thresholds, not just narrative feedback. Add sprint contracts (agree on "done" before code) at L2.
+### 3. Harness Simplification Is Ongoing Practice (Source: [[anthropic2026-harness-design]])
+"Every component in a harness encodes an assumption about what the model can't do on its own — and those assumptions are worth stress testing." As Opus 4.6 improved, Anthropic removed sprint decomposition and made the evaluator conditional. The space of harness combinations "doesn't shrink as models improve — it moves."
+**Action**: Add a "harness simplification" review gate after each model upgrade. Which components became unnecessary? Which new capabilities enable removing old scaffolding?
+### 4. Feedforward + Feedback Control Framework (Source: [[bockeler2026-harness-engineering]])
+Martin Fowler team formalizes harness controls as: Feedforward (guides: AGENTS.md, skills, LSP) and Feedback (sensors: linters, tests, AI review). Both computational (deterministic) and inferential (probabilistic). The behavioural harness — functional correctness verification — remains unsolved.
+**Action**: Audit our pipeline against the feedforward/feedback framework. Are we missing computational sensors that could run cheaply per-edit? Our Phase 12 (inline syntax) is correct but could expand to structural tests (ArchUnit equivalents).
+### 5. Harnesses Can Self-Evolve (Sources: [[lou2026-autoharness]], [[lee2026-meta-harness]])
+AutoHarness: small model synthesizes harness code from environment feedback. 78% of chess losses were illegal moves → harness eliminates all illegal moves. Meta-Harness: outer-loop optimization over harness code, surpasses hand-engineered baselines on TerminalBench-2.
+**Action**: Add Phase 17 (future): Harness Auto-Optimization. Auto-tune token budgets from execution traces. Learn model profiles instead of hand-coding them. Auto-remove unnecessary gates.
+### 6. Debate Should Be Selective, Not Always-On (Source: [[fan2025-imad]])
+iMAD: Selective debate via lightweight classifier saves 92% tokens AND improves accuracy by 13.5% vs always-debate. Debate can overturn correct answers — it's not always beneficial.
+**Action**: Modify ADR-011 consensus debate design. Add a pre-debate gating classifier. Single agent self-critiques first; only trigger debate when hesitation cues detected. This could reduce projected ~13,000 token debate overhead to ~1,000 tokens in high-confidence cases.
+### 7. Context Anxiety Is Real (Source: [[anthropic2026-harness-design]])
+Models rush to finish as context fills. Sonnet 4.5 exhibited strong context anxiety requiring full context resets. Opus 4.5+ largely fixed. GPT/strict models may still be vulnerable.
+**Action**: Add context anxiety detection to long-running harness sessions. For GPT models, consider context resets between rounds with structured handoff artifacts.
+## Key Entities
+- **[[anthropic2026-harness-design]]**: Leading production harness engineering. GAN-inspired multi-agent architecture. Published definitive harness design guide.
+- **[[meng2026-agent-harness-survey]]**: Authored the first comprehensive harness survey (110+ papers, 23 systems). Formalized H=(E,T,C,S,L,V).
+- **[[bockeler2026-harness-engineering]]**: Feedforward/feedback control framework for harness engineering.
+- **[[lee2026-meta-harness]]**: Meta-Harness: automated harness optimization via outer-loop search.
+## Key Concepts
+- [[harness-h-formalism]]: H = (E, T, C, S, L, V) formal model
+- [[feedforward-feedback-harness]]: Guides + sensors, computational + inferential
+- [[generator-evaluator-architecture]]: GAN-inspired separation with sprint contracts
+- [[self-evolving-harness]]: Auto-synthesis and meta-optimization of harness code
+- [[selective-debate-routing]]: Trigger debate only when beneficial (iMAD)
+- [[context-anxiety]]: Models rush as context fills
+## Pipeline Improvement Recommendations
+### Immediate (integrate into existing phases)
+| # | Change | Affected Phase | Source |
+|---|--------|---------------|--------|
+| 1 | Add explicit pass/fail grading criteria with hard thresholds to L4 critics | Phase 5-6 | Anthropic harness design |
+| 2 | Add sprint contract negotiation at L2 (agree on "done" before L3) | Phase 3-4 | Anthropic harness design |
+| 3 | Add pre-debate gating classifier to ADR-011 consensus debate | Phase 14-15 | iMAD |
+| 4 | Add H=(E,T,C,S,L,V) mapping to harness documentation | Phase 0 | Harness survey |
+| 5 | Audit feedforward/feedback controls: missing computational sensors? | All phases | Martin Fowler |
+### Future Phases (new)
+| # | Phase | Description | Source |
+|---|-------|-------------|--------|
+| 17 | Harness Auto-Optimization | Auto-tune token budgets, learn model profiles, auto-remove unnecessary gates | Meta-Harness, AutoHarness |
+| 18 | Behaviour Harness | Functional correctness verification — the unsolved problem | Martin Fowler |
+| 19 | Context Anxiety Guard | Detect and mitigate context anxiety in long-running sessions | Anthropic harness design |
+### Revised Token Budget (with improvements)
+| Layer | Current | After Improvements | Mechanism |
+|-------|---------|-------------------|-----------|
+| L1 Spec hardening | ~2,000 | ~2,000 | No change |
+| L2 Planning + review | ~5,000 | ~7,500 | Sprint contracts add overhead |
+| L4 Adversarial | ~4,000 | ~4,500 | Hard-threshold criteria |
+| Consensus debate | ~13,000 | ~3,000 | Selective routing (92% token savings on ~80% of tasks) |
+| **Total per subtask** | **~30,500-33,500** | **~24,000-26,000** | Net savings ~20% |
+## Contradictions
+- **Always-debate vs selective debate**: Our ADR-011 assumes debate always improves outcomes. iMAD shows debate can overturn correct answers. Resolution: adopt selective routing. (Confidence: medium — iMAD tested on QA, not code review. Transferability needs verification.)
+- **Harness complexity vs simplification**: Full harness costs 20x more (Anthropic). Meta-Harness finds simpler harnesses automatically. Resolution: complexity is justified until models improve enough to make components unnecessary. Regular simplification audits required.
+## Open Questions
+- Do iMAD's hesitation cues generalize from QA tasks to code review (our L4 debate domain)?
+- Can a single debate-gating classifier work across L1 (spec), L2 (plan), and L4 (implementation)?
+- What is the right cadence for harness simplification audits? Every model upgrade? Monthly?
+- How do we measure harness coverage/quality? (No existing methodology — identified as open problem by both surveys)
+- Should the evaluator always use a different model than the generator for genuine adversarial diversity?
+## Sources
+- [[meng2026-agent-harness-survey]]: Meng et al., April 2026. H=(E,T,C,S,L,V), 110+ papers, 23 systems.
+- [[anthropic2026-harness-design]]: Rajasekaran, March 2026. GAN-inspired generator-evaluator architecture.
+- [[bockeler2026-harness-engineering]]: Böckeler/Fowler, April 2026. Feedforward/feedback control framework.
+- [[lou2026-autoharness]]: Lou et al., February 2026. Automatic harness synthesis from environment feedback.
+- [[lee2026-meta-harness]]: Lee et al., March 2026. Outer-loop harness optimization over code.
+- [[fan2025-imad]]: Fan et al., AAAI 2026. Selective multi-agent debate with 92% token savings.

package/vault/wiki/questions/research-gitingest-gitreverse-integration.md ADDED Viewed

@@ -0,0 +1,100 @@
+---
+type: synthesis
+title: "Research: GitIngest and GitReverse Integration"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - research
+  - tool-evaluation
+  - codebase-ingestion
+status: complete
+related:
+  - "[[gitingest]]"
+  - "[[gitreverse]]"
+  - "[[codebase-to-context-ingestion]]"
+  - "[[lean-ctx]]"
+sources:
+  - "[[gitingest]]"
+  - "[[gitreverse]]"
+---# Research: GitIngest and GitReverse Integration
+## Overview
+Evaluated two codebase-to-LLM services for integration into the ultimate-pi harness: **Gitingest** (codebase → structured text) and **GitReverse** (repo → synthetic user prompt). Gitingest is a strong fit. GitReverse is not.
+## Key Findings
+- **Gitingest converts entire codebases into LLM-ready structured text** (Source: [[gitingest]]). Output includes summary, directory tree, and all file contents with clear delimiters. No LLM dependency — deterministic and fast.
+- **GitReverse generates user prompts FROM repos** using LLM inference on metadata only (Source: [[gitreverse]]). It does NOT read source code. This is useful for users who want to "reverse engineer" what prompt created a project, but it doesn't help an agent ingest codebases.
+- **The harness currently lacks bulk codebase ingestion** (Source: [[lean-ctx]] handles file-by-file reading only). Gitingest fills this gap directly.
+- **Gitingest has a Python package** (`pip install gitingest`) with a clean API: `from gitingest import ingest` returns `(summary, tree, content)`. Can be wrapped as a skill or called via bash.
+- **Gitingest has an `/llms.txt` endpoint** providing machine-readable documentation for LLM agents to self-integrate.
+## Critical Evaluation: Gitingest
+### Will it help the harness right now? **YES**
+| Need | Current State | With Gitingest |
+|------|--------------|----------------|
+| Understand external repo | File-by-file via lean-ctx | Bulk ingestion in one operation |
+| Research dependencies | Manual web fetch | Structured codebase dump |
+| Ingest docs repos into wiki | Manual per-page | Single pipeline step |
+| Cross-reference implementations | Not possible | Compare codebases side-by-side |
+### Why integrate
+1. **Fills a capability gap**: The harness has no mechanism to ingest entire external codebases as context
+2. **Low integration cost**: Python package with clean API, trivially wrappable as a skill
+3. **No LLM dependency**: Deterministic output, no cost, no latency risk
+4. **Complementary to lean-ctx**: lean-ctx for local files, Gitingest for external repos
+5. **Already optimized for LLM context**: Output format has clear delimiters and structure
+### How to integrate
+**Recommended: Skill wrapper around Python package**
+```
+Skill: /gitingest
+└── Calls: gitingest <url> -o -
+└── Options: --include/--exclude patterns, --max-size, --branch
+└── Output: pipes to agent context or files to wiki
+└── Private repos: reads GITHUB_TOKEN from .env (already loaded by dotenv-loader extension)
+```
+Integration steps:
+1. Add `gitingest` to optional dependencies in `package.json` (as a `pip` dependency note) or document as prerequisite
+2. Create skill at `.pi/skills/gitingest/SKILL.md` that wraps `gitingest` CLI or Python API
+3. Skill handles: URL validation, output formatting, wiki filing (via wiki-ingest), error cases (rate limits, private repos without token, large repos)
+4. Register in skills-lock.json
+**Alternative: Direct bash integration**
+Simpler but less polished: just document that agent can run `gitingest <url>` via bash. No skill needed. This is the MVP approach.
+## Critical Evaluation: GitReverse
+### Will it help the harness right now? **NO**
+- GitReverse generates prompts FROM repos — the harness receives prompts, it doesn't need to generate them
+- It only reads metadata, not code. The harness needs code-level understanding
+- It uses LLM inference (cost + latency) for something the harness doesn't need
+- The output is a natural language prompt, not structured code context
+> [!gap] Could GitReverse be useful for wiki content generation? If the harness needs to generate natural language descriptions of repos for wiki pages, GitReverse could help. But this is not a current need.
+## Contradictions
+None identified between sources. Both tools are complementary products from different authors targeting different use cases. GitReverse's README explicitly credits Gitingest as inspiration.
+## Open Questions
+- How does Gitingest handle repos larger than context window? Does it truncate? (Source: [[gitingest]] supports file size limits but repo-level truncation behavior unclear)
+- Can Gitingest's output be further compressed by lean-ctx's tree-sitter AST mode for additional token savings?
+- Should we also evaluate Repomix (npm alternative mentioned in Gitingest README) as a Node.js-native alternative?
+- What's the GitHub API rate limit impact of frequent Gitingest usage? The web service may have its own caching.
+## Recommendation
+**Integrate Gitingest now.** Create a `/gitingest` skill (renamed from `/ingest` to avoid clash with `wiki-ingest`). Ship as MVP via direct bash wrapping, then iterate to Python API integration if needed.
+**Skip GitReverse.** No current use case in the harness. Revisit if wiki auto-description becomes a feature requirement.

package/vault/wiki/questions/research-wozcode-token-reduction.md ADDED Viewed

@@ -0,0 +1,67 @@
+---
+type: synthesis
+title: "Research: WOZCODE Token-Reduction Architecture"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - research
+  - token-reduction
+  - agent-architecture
+  - wozcode
+status: developing
+related:
+  - "[[wozcode]]"
+  - "[[ast-truncation]]"
+  - "[[fuzzy-edit-matching]]"
+  - "[[model-routing-agents]]"
+  - "[[inline-post-edit-validation]]"
+  - "[[harness-implementation-plan]]"
+  - "[[agentic-harness]]"
+sources:
+  - "[[wozcode]]"
+---# Research: WOZCODE Token-Reduction Architecture
+## Overview
+WOZCODE is a Claude Code plugin that reduces token spend by 25–55% through three reinforcing levers: smarter code exploration (fewer input tokens), batched fuzzy edits (fewer output tokens), and an inline quality loop (fewer retries). It runs 100% locally with zero cloud code exposure. Its architecture — AST truncation, Haiku subagent routing, and post-edit syntax validation — represents a fundamentally different approach to agentic coding that our harness should adopt.
+## Key Findings
+- **AST truncation cuts input tokens by returning function signatures, not bodies** (Source: [[wozcode]]). Smarter search returns ranked snippets instead of full-file grep dumps. This is complementary to our planned [[repo-map-ranking]] but more aggressive: it stubs content at the AST level rather than just selecting files.
+- **Fuzzy edit matching eliminates retry round-trips** (Source: [[wozcode]]). Tolerates whitespace drift, indentation changes, curly vs straight quotes, em-dashes. Near-misses still land — the edit tool self-corrects formatting differences instead of failing.
+- **Post-edit syntax validation catches errors before the model retries** (Source: [[wozcode]]). TS compiler, JSON/YAML/HTML parsers, SQL linter run after every edit. Errors caught before the next turn — fewer turns = less spend.
+- **Haiku subagents handle ~40% of coding work (exploration) at ~15× cheaper than Opus** (Source: [[wozcode]]). Read-only exploration routed to Haiku automatically; frontier model reserved for code generation.
+- **SQL dialect auto-fix rewrites common mistakes before they reach the model** (Source: [[wozcode]]). Backtick identifiers, unquoted reserved aliases, `COUNT(DISTINCT a, b)`, `date_trunc("month", col)`.
+- **Real savings from live API usage fields, not theoretical baselines** (Source: [[wozcode]]). Every percentage claim is measured from Anthropic's actual API usage fields.
+## Key Entities
+- **WOZCODE (WithWoz, Inc.)**: Claude Code plugin, founded 2025-2026. Patent-pending token-reduction technology. [[wozcode]]
+- **Haiku (Anthropic)**: Cheapest Claude model used for read-only exploration subagents. ~15× cheaper than Opus.
+- **Anthropic Claude Code**: The base agentic coding tool that WOZCODE wraps.
+## Key Concepts
+- [[ast-truncation]]: Stubbing function bodies at the AST level, returning only signatures + relevant snippets
+- [[fuzzy-edit-matching]]: Diff algorithm that tolerates formatting drift to land near-miss edits
+- [[model-routing-agents]]: Dispatching subtasks to different models based on operation type (explore vs generate)
+- [[inline-post-edit-validation]]: Running compilers/linters/parsers immediately after each edit, before the model sees the result
+- [[repo-map-ranking]]: Existing concept — graph centrality for selecting important codebase symbols
+## Contradictions
+- WOZCODE claims "100% local, zero-cloud" — but their privacy page reveals they send aggregated anonymous session stats to Supabase and use Stripe for billing. The code (files, paths, prompts, API keys) never leaves the machine, but metadata does. This is reasonable for a commercial product but worth noting.
+- WOZCODE's comparison with "graph-based explorers" (SDL-MCP) is accurate for exploration-only tools but understates what SDL-MCP can do with full repository indexing. SDL-MCP does cover some editing workflows if properly configured.
+## Open Questions
+- How does WOZCODE's AST truncation handle dynamically-typed languages (Python, JavaScript) where tree-sitter can't resolve all types statically? [gap: need to test on dynamic language codebases]
+- What is the actual performance overhead of running post-edit syntax validation after every edit? WOZCODE claims savings but doesn't disclose validation latency.
+- Can Haiku subagents be applied to code review / adversarial verification (our L4), or only to exploration? The architecture suggests exploration-only but the pattern could extend.
+- How does WOZCODE handle multi-file atomic edits where fuzzy matching on one file could create inconsistencies with another? [gap: need to investigate interaction between fuzzy matching and multi-edit batching]
+- The patent-pending status means implementation details are intentionally obscured. Reverse-engineering the fuzzy diff algorithm would require access to the plugin binary.
+## Sources
+- [[wozcode]]: Primary source — wozcode.com/how-it-works, docs, security pages (2026)

package/vault/wiki/questions/resolved-context-pruning-inplace-vs-restart.md ADDED Viewed

@@ -0,0 +1,95 @@
+---
+type: resolution
+title: "Resolved: In-Place Context Pruning vs Session Restart"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - resolution
+  - context-pruning
+  - meta-agent
+  - drift-detection
+status: resolved
+resolves:
+  - "[[meta-agent-context-pruning]] Open Questions #1-4"
+  - "[[drift-detection-unified]] Open Questions #1-3"
+related:
+  - "[[meta-agent-context-pruning]]"
+  - "[[drift-detection-unified]]"
+  - "[[context-drift-in-agents]]"
+sources:
+  - "[[claude-context-editing-docs]]"
+  - "[[opencode-dcp]]"
+  - "[[openclaw-session-pruning]]"
+  - "[[ms-chat-history-management]]"
+  - "[[agent-drift-academic-paper]]"
+---# Resolved: In-Place Context Pruning vs Session Restart
+## Resolution
+**Both in-place editing and session restart exist in production. In-place editing (server-side context clearing) is the preferred pattern when the LLM provider supports it. Session restart (compaction/summarization) is the fallback for providers without in-place support.**
+## Evidence
+### In-Place Editing (Production Pattern)
+Three major implementations confirm in-place context editing as the dominant production pattern:
+1. **Claude API Context Editing** (Anthropic, 2025): Server-side strategies `clear_tool_uses_20250919` and `clear_thinking_20251015`. Content is cleared server-side before the prompt reaches Claude. The client maintains the full, unmodified conversation history. Placeholder text replaces cleared content so the model knows it was removed. (Source: [[claude-context-editing-docs]])
+2. **OpenCode DCP** (2.5k stars, April 2026): "Your session history is never modified — DCP replaces pruned content with placeholders before sending requests to your LLM." Uses compress tool, deduplication, and purge-errors strategies. Cache hit rate: ~85% with DCP vs ~90% without. (Source: [[opencode-dcp]])
+3. **OpenClaw Session Pruning**: "Pruning only targets toolResult messages. It never modifies your actual user messages or the assistant's responses." Two modes: soft-trim (keep start+end, remove middle) and hard-clear (placeholder replacement). (Source: [[openclaw-session-pruning]])
+### Session Restart (Compaction/Summarization)
+Available as fallback:
+- **Claude SDK Compaction**: When token threshold exceeded, Claude generates structured summary, entire history replaced. Summary includes: Task Overview, Current State, Important Discoveries, Next Steps, Context to Preserve. (Source: [[claude-context-editing-docs]])
+- **Microsoft Semantic Kernel**: "Summarizing Older Messages" strategy — summarizes chat history, sends system message + summary + recent messages. Supports using small models (SLM) for summarization. (Source: [[ms-chat-history-management]])
+## Specific Questions Resolved
+### Q1: Can context be pruned in-place or must it always restart?
+**In-place. Always in-place for supported providers.** Claude API, OpenCode DCP, and OpenClaw all implement in-place editing — content is cleared before sending to the LLM but client-side history is never modified. Session restart is only needed for providers that lack server-side context editing APIs.
+### Q2: Minimum context that must survive pruning?
+**Production systems keep: system message, last 3-5 assistant turns (configurable `keepLastAssistants`), all user messages, any tool results containing images, and protected tools (task, skill, write, edit by default).** Everything else is eligible for clearing. The OpenClaw default `keepLastAssistants: 3` is a reasonable starting point.
+### Q3: Does pruning break chain-of-thought coherence?
+**In-place pruning preserves more coherence than restart.** Since in-place editing only clears old tool results (not assistant reasoning), chain-of-thought is preserved. Session restart (compaction) replaces everything with a summary, which loses fine-grained reasoning. Claude's thinking block clearing strategy (`clear_thinking_20251015`) explicitly controls how many turns of thinking to keep for coherence.
+**Recommendation**: Use in-place tool result clearing for routine context management. Reserve restart/compaction for extreme cases (>100k tokens accumulated).
+### Q4: How does pruning interact with prompt caching?
+**In-place clearing invalidates cache from the clearing point forward, but subsequent requests reuse the newly cached prefix.** The trade-off is quantified: OpenCode DCP reports ~85% cache hit rate with pruning vs ~90% without — a 5% cache hit reduction for significant token savings. Claude API's `clear_at_least` parameter ensures enough tokens are cleared to make cache invalidation worthwhile.
+**For the harness**: Configure `clear_at_least` to clear minimum 5000 tokens per operation. This ensures the token savings outweigh the cache write cost.
+### Q5: Can Haiku/Flash serve as meta-agent drift detector?
+**Yes, for rule-based detection with near-zero overhead. For LLM-based semantic drift detection, Haiku/Flash adds ~200-500 tokens per check (every 10-15 steps).** See [[resolved-small-model-meta-agents]] for full resolution.
+### Q6: Does the meta-agent itself need drift monitoring? (Infinite regress)
+**No.** The meta-agent uses rule-based detection (hash comparison + counters = 0 LLM tokens). There is no agentic loop to drift. If LLM-based detection is used (every 10-15 steps), it's a single inference, not an agentic session — no regress.
+## Harness Implementation
+For the ultimate-pi harness Layer 2.5 (Runtime Drift Monitor):
+| Strategy | When to Use | Implementation |
+|----------|------------|----------------|
+| **In-place clearing** | Primary (Claude API available) | Use `clear_tool_uses_20250919` with trigger at 30k tokens, keep 5 recent tool uses |
+| **Soft-trim** | Large tool results | Trim middle of oversized results, keep start+end |
+| **Hard-clear** | Stale tool results | Replace with `[Content cleared: tool result from step N]` |
+| **Compaction/restart** | Fallback (non-Claude providers) | Generate structured summary, restart session |
+| **Rule-based detection** | Always-on | 6 pattern signatures, 0 tokens |
+## Confidence
+**High.** Three independent production systems (Anthropic Claude API, OpenCode DCP, OpenClaw) all implement the same pattern: in-place editing that never modifies client-side history. The pattern is consistent and well-documented.