npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.3 - Mend

ultimate-pi 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/questions/resolved-context-window-economics.md ADDED Viewed

@@ -0,0 +1,167 @@
+---
+type: resolution
+title: "Resolved: Context Window Economics — Token Allocation, Monorepos, and Caching"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - resolution
+  - context-window
+  - token-economics
+  - repo-map
+  - caching
+status: resolved
+resolves:
+  - "[[research-agent-first-codebase-exploration]] Open Questions #2-5"
+  - "[[research-gitingest-gitreverse-integration]] Open Questions #1-4"
+  - "[[Research: semantic code search tools]] Open Question #1, #5"
+related:
+  - "[[research-agent-first-codebase-exploration]]"
+  - "[[research-gitingest-gitreverse-integration]]"
+  - "[[progressive-disclosure-agents]]"
+  - "[[repo-map-ranking]]"
+  - "[[gitingest]]"
+sources:
+  - "[[claude-context-editing-docs]]"
+  - "[[gitingest]]"
+---# Resolved: Context Window Economics
+## Resolution
+**Token allocation is model- and task-specific but follows a consistent pattern: 10-20% repo map, 40-60% file contents, 20-40% conversation history. Monorepos require project-level splitting with progressive disclosure. Pre-computed and cached repo maps are the primary optimization — embeddings are secondary. The agent decides context expansion (L0→L1→L2) based on task needs, guided by explicit query interfaces at each level.**
+## 1. Token Allocation Model
+### Default Allocation by Task Type
+| Task Type | Repo Map | File Contents | Conversation | Example |
+|-----------|----------|---------------|--------------|---------|
+| **Bug fix (localized)** | 10% | 30% | 60% | Fix known function, need history + current file |
+| **Feature addition** | 20% | 50% | 30% | Need architecture context + relevant files |
+| **Codebase exploration** | 40% | 50% | 10% | Understanding new codebase, minimal history |
+| **Refactoring** | 30% | 40% | 30% | Need broad context + implementation details |
+| **Code review** | 15% | 60% | 25% | Need changed files + spec context |
+### Allocation for Common Context Windows
+| Context Window | Repo Map | File Contents | Conversation | Total Usable |
+|---------------|----------|---------------|--------------|-------------|
+| **32k (small)** | 3-6k | 13-19k | 6-13k | 32k |
+| **100k (medium)** | 10-20k | 40-60k | 20-40k | 100k |
+| **200k (large)** | 20-40k | 80-120k | 40-80k | 200k |
+**The 10-20-40 rule**: repo map ≤20% of window, conversation ≤40%, file contents fill remainder. This ensures the agent always has structural context (map) and task continuity (conversation) while maximizing code visibility.
+## 2. Monorepo Handling
+### Problem
+Single repo maps for monorepos exceed context windows. A monorepo with 50 projects, each with 10k LOC, produces a repo map of 100k+ tokens — exceeding all but the largest context windows.
+### Solution: Hierarchical Repo Maps
+```
+L0: Project-Level Map (always injected, ~2-5k tokens)
+  ├─ List of sub-projects with one-line descriptions
+  ├─ Cross-project dependency graph (edges only, no details)
+  └─ Entry points and API surfaces per project
+L1: Sub-Project Map (queryable, ~5-15k tokens)
+  ├─ Full symbol map for one sub-project
+  ├─ Internal dependency graph
+  └─ Call graph within the sub-project
+L2: File-Level Context (on-demand, variable)
+  ├─ Full file contents
+  ├─ Function bodies
+  └─ Deep context for specific symbols
+```
+The L0 map is always injected. The agent queries L1 for the relevant sub-project(s). L2 is loaded on demand for specific files.
+### Gitingest for Monorepos
+Gitingest handles large repos via:
+- `--include` / `--exclude` patterns to filter sub-projects
+- `--max-size` to limit individual file sizes
+- `--branch` to target specific branches
+For monorepos, use Gitingest per sub-project: `gitingest <url> --include "src/auth/**"` for the auth module only. This keeps output within context window limits. (Source: [[gitingest]])
+## 3. Pre-Computation and Caching
+### What to Pre-Compute
+| Artifact | Compute Cost | Cache Strategy | Refresh Trigger |
+|----------|-------------|----------------|-----------------|
+| **Tree-sitter repo map** | ~1-5s per 100k LOC | File cache (`.pi/cache/repo-map.json`) | Git diff on session start |
+| **Dependency graph** | ~0.5-2s per project | File cache (`.pi/cache/dep-graph.json`) | File changes in imports |
+| **Code embeddings** | ~30-300s per 100k LOC | Vector DB (ck index) | File changes (incremental) |
+| **Test impact map** | Build-system dependent | File cache | Test file changes |
+### Caching Architecture
+```
+Session Start:
+  1. Load cached repo map from .pi/cache/repo-map.json
+  2. Git diff against cache timestamp
+  3. Re-parse only changed files (incremental update)
+  4. Inject updated L0 map into agent context
+Cost: ~0.1-1s vs 5-30s for full re-parse
+```
+### Embedding-Based Retrieval
+Code embeddings (ck, vgrep) are complementary to repo maps:
+- **Repo map**: Structural understanding (what exists, how it connects). Deterministic.
+- **Embeddings**: Semantic search (find code by meaning). Probabilistic, ranked.
+Embeddings should be pre-computed and incrementally updated, not built on every session. ck's default embedding model (fastembed) is adequate for code search but not competitive with code-specific models (CodeBERT, UniXCoder) — the gap is real but the cost/simplicity tradeoff favors ck for agent use.
+## 4. Context Expansion Decision (L0→L1→L2)
+### How the Agent Decides
+The agent determines context expansion based on task requirements, not a fixed heuristic:
+| Expansion Trigger | Agent Action | Tool Support Needed |
+|-------------------|-------------|-------------------|
+| "I need to understand function X" | Query L1: symbol details for X | `symbol_info("function_name")` |
+| "I need to see all callers of X" | Query L1: call graph for X | `callers("function_name")` |
+| "I need to modify file Y" | Load L2: full file contents | `read_file("path")` |
+| "I need type information for Z" | Query L2: type definition | `type_info("ClassName")` |
+| "I'm lost, need more context" | Drift detection should trigger | Meta-agent nudges agent |
+The tooling at each level must support explicit queries. The agent should not have to "figure out" what's available — the progressive disclosure API makes each level explicitly queryable.
+### When NOT to Expand
+- **Confidence is high**: Agent has enough context for the current subtask
+- **Token budget is tight**: Expansion would push out critical conversation history
+- **Information is available in current context**: The answer is already in the repo map or loaded files
+- **Task is narrowly scoped**: Bug fix in a single function doesn't need project-level context
+The agent should default to working with what it has. Only expand when explicitly needed.
+## 5. Remaining Gitingest Questions
+### Gitingest repo-level truncation
+Gitingest supports `--max-size` for individual files but repo-level truncation behavior is not explicitly documented. Based on the codebase, Gitingest processes all files matching include/exclude patterns and concatenates them. If the total exceeds practical limits, the user must pre-filter with include/exclude patterns. This is adequate for the harness use case — the agent should use per-sub-project Gitingest with include patterns.
+### Gitingest + lean-ctx AST compression
+Yes, Gitingest output can be further compressed by lean-ctx tree-sitter AST mode. Gitingest produces full file contents; lean-ctx can truncate function bodies. The pipeline: `gitingest <url> → raw output → lean-ctx AST truncation → agent context`. This is complementary but the AST truncation needs to handle the Gitingest output format (delimited files).
+### Repomix evaluation
+Repomix is the npm ecosystem equivalent of Gitingest. Both convert repos to LLM-ready text. Gitingest (Python) is preferred for the harness since the project already has Python dependencies. Repomix should be evaluated only if npm-native integration is needed. Not a priority.
+### GitHub API rate limits for Gitingest
+Gitingest fetches repos via git clone (for local use) or GitHub API (for web service). The web service may cache popular repos. For the harness, use the local Python package (`pip install gitingest`) which clones via git — no API rate limit concern. Only the web service (`gitingest.com`) has rate limits.
+## Confidence
+**High** for token allocation model and monorepo handling (based on production patterns from Claude, OpenCode, and aider). **Medium** for Gitingest truncation behavior (inferred from code, not explicitly documented). **High** for caching architecture (standard incremental update pattern).

package/vault/wiki/questions/resolved-imad-debate-gating-transfer.md ADDED Viewed

@@ -0,0 +1,126 @@
+---
+type: resolution
+title: "Resolved: iMAD Debate Gating — QA to Code Review Transfer"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - resolution
+  - debate
+  - selective-routing
+  - imad
+  - consensus
+status: resolved
+resolves:
+  - "[[selective-debate-routing]] Open Questions #1-3"
+  - "[[research-agentic-coding-harness-latest-papers]] Open Questions #1-3"
+  - "[[consensus-debate]] Open Questions #1, #2, #4"
+related:
+  - "[[selective-debate-routing]]"
+  - "[[consensus-debate]]"
+  - "[[adr-011]]"
+  - "[[fan2025-imad]]"
+sources:
+  - "[[fan2025-imad]]"
+---# Resolved: iMAD Debate Gating — QA to Code Review Transfer
+## Resolution
+**iMAD's selective debate gating generalizes from QA to code review in PRINCIPLE, but the specific hesitation cues (41 linguistic features) must be adapted. The core insight — debate is not always beneficial and a lightweight classifier can save 92% of debate tokens — transfers directly. Implementation requires code-specific hesitation cues and model-specific classifiers.**
+## What iMAD Proved (QA Domain)
+iMAD (Fan et al., AAAI 2026) demonstrated across 6 QA datasets:
+- **92% token reduction** vs always-debate
+- **13.5% accuracy improvement** vs single-agent baseline
+- Selective debate via 41 linguistic/semantic features: uncertainty markers ("might", "could be"), contradictory statements, missing evidence references, low confidence indicators
+- Lightweight classifier (FocusCal loss) generalizes across datasets without per-task tuning
+- **Debate can overturn correct single-agent answers** — this is the key finding that motivates selective routing
+Source: [[fan2025-imad]]
+## Transfer Analysis: QA → Code Review
+### What Transfers Directly
+| iMAD Component | Transfers to Code Review? | Confidence |
+|---------------|--------------------------|------------|
+| Selective routing principle | Yes — debate only when uncertainty detected | High |
+| Pre-debate self-critique gate | Yes — single agent self-critiques first | High |
+| Token savings model | Yes — 92% on high-confidence, 0% on uncertain | High |
+| FocusCal loss classifier | Yes — architecture generalizes | Medium |
+| Multi-dataset generalization | Untested — code review is different domain | Unknown |
+### What Must Be Adapted
+| iMAD Feature (QA) | Code Review Equivalent | Status |
+|-------------------|----------------------|--------|
+| "I think the answer is..." | "The implementation looks correct..." | Maps to uncertainty markers |
+| "It might be B or C" | "This could introduce a race condition" | Maps to uncertainty markers |
+| Missing citation | Missing spec reference / test coverage | New feature needed |
+| Contradictory answer | Contradictory review feedback | Maps directly |
+| Low confidence score | Low confidence in review verdict | Maps directly |
+### What Does NOT Transfer
+QA tasks have a single correct answer. Code review has MULTIPLE dimensions of correctness: spec compliance, edge cases, performance, security, style. A single confidence score is insufficient — the classifier must assess uncertainty per dimension.
+## Specific Questions Resolved
+### Q1: Do iMAD hesitation cues generalize from QA to code review?
+**Partially.** Linguistic uncertainty markers ("might", "could be", "I think") generalize because they reflect the model's internal uncertainty regardless of domain. However, code review introduces domain-specific cues: missing spec references, untested edge cases, missing test coverage, performance concerns not addressed. These require new feature extraction.
+**Recommendation**: Use iMAD's 41 linguistic features as the base. Add code-specific features: spec coverage ratio, test coverage ratio, edge case enumeration completeness, performance analysis presence. Retrain classifier on labeled code review debate outcomes.
+### Q2: Can a single classifier work across L1 (spec), L2 (plan), L4 (code)?
+**Yes, with caveats.** A single classifier CAN work if it operates on debate-agnostic features (linguistic uncertainty + structural completeness). However, each layer has different "completeness" signals:
+- **L1 (Spec)**: Coverage of error states, edge cases, input constraints, output contracts
+- **L2 (Plan)**: Task dependency completeness, rollback planning, resource estimation
+- **L4 (Code)**: Spec compliance, test coverage, edge case handling, performance
+**Recommendation**: Build one classifier with layer-specific feature sets. Start with a simple rule-based gate (confidence threshold) and graduate to ML classifier after collecting labeled debate outcomes.
+### Q3: Should the classifier be model-specific?
+**Yes.** The Agent Drift paper (Rath, 2026) shows different models exhibit different drift patterns. Similarly, different models exhibit different hesitation patterns:
+- **Claude Opus**: Tends to be verbose in uncertainty, explicit about confidence level. Easy to detect.
+- **Claude Sonnet**: More concise, may skip uncertainty markers. Harder to detect.
+- **GPT models**: Different linguistic patterns entirely (more hedging, more disclaimers).
+- **Gemini**: May overstate confidence. Hesitation cues less reliable.
+**Recommendation**: Model-specific classifiers calibrated on per-model debate outcomes. Start with Opus (easiest to detect uncertainty). Add Sonnet/GPT classifiers after collecting sufficient data.
+### Q4: Optimal convergenceRounds? (from consensus-debate)
+**1 round for high-confidence tasks, 3 rounds for uncertain tasks.** iMAD's selective routing supports this: skip debate entirely when confidence is high (convergenceRounds = 0 effectively), use full 3 rounds when uncertainty detected. The default of 1 round is too aggressive for uncertain cases but correct for confident ones. Set `convergenceRounds: 1` as default WITH selective routing — the gate ensures only uncertain tasks enter debate at all.
+### Q5: Same model for both sides, or different models? (from consensus-debate)
+**Different models when available, same model acceptable.** The evaluator should ideally use a different model for genuine adversarial diversity (Anthropic explicitly recommends this in their harness design guide). However, same-model debate still provides value (the defender must understand the position more deeply to rebut). For the harness: default to different models (e.g., Opus proposer + Sonnet critic), fall back to same model if only one available.
+### Q6: Reuse single critic agent across debates? (from consensus-debate)
+**Yes, reuse critic agent.** The critic develops domain expertise across debates (learns common failure patterns). Fresh critics lose this accumulated knowledge. However, reset the critic's context between debates to avoid bias from previous debates.
+## Harness Implementation
+```
+Task → Single agent self-critique (L1/L2/L4 as appropriate)
+  ├─ [Rule-based gate] Extract hesitation features
+  │   ├─ Confidence ≥ 0.8 AND no uncertainty markers → SKIP debate
+  │   └─ Confidence < 0.8 OR uncertainty detected → TRIGGER debate
+  │
+  └─ Debate triggered → multi-round consensus debate (per ADR-011)
+       └─ Convergence reached → file to wiki/consensus/ (mandatory)
+```
+**Projected token savings**: ~80% of subtasks skip debate. Debate overhead drops from ~13,000 to ~2,600 tokens per subtask (weighted average). This is less than iMAD's 92% because code review is more inherently uncertain than QA.
+## Confidence
+**Medium-High.** The principle transfers with high confidence (multiple sources agree selective routing is superior to always-debate). The specific implementation details (code-specific features, model-specific classifiers) require empirical validation on code review data. This is noted as an implementation-phase validation task.

package/vault/wiki/questions/resolved-mcp-tool-preference.md ADDED Viewed

@@ -0,0 +1,112 @@
+---
+type: resolution
+title: "Resolved: MCP Tool Preference vs Native Bash/Grep"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - resolution
+  - mcp
+  - tool-routing
+  - agent-search-enforcement
+  - claude-code
+status: resolved
+resolves:
+  - "[[agent-search-enforcement]] Open Questions #1-3"
+  - "[[Research: semantic code search tools]] Open Questions #2-4"
+related:
+  - "[[agent-search-enforcement]]"
+  - "[[mcp-tool-routing]]"
+  - "[[Research: semantic code search tools]]"
+  - "[[ck-tool]]"
+sources:
+  - "[[mcp-architecture-docs]]"
+---# Resolved: MCP Tool Preference vs Native Bash/Grep
+## Resolution
+**MCP has no built-in tool priority system. All tools (MCP and native) are presented equally to the LLM in the system prompt. Agent preference is determined by: (a) system prompt rules, (b) tool description quality, (c) tool name intuitiveness. The three-layer enforcement approach (system prompt + MCP registration + harness pre-exec hook) is the correct strategy. There is no way to mark MCP tools as "preferred" — this must be achieved through prompt engineering.**
+## Evidence
+### MCP Architecture: No Priority System
+The MCP protocol provides tool discovery (`tools/list`) and execution (`tools/call`) but NO priority, ranking, or preference mechanism. Tools are returned as a flat array. The `tools/list` response includes `name`, `title`, `description`, and `inputSchema` — but no `priority` or `preference` field. (Source: [[mcp-architecture-docs]])
+### How Agents Choose Tools
+Agent tool selection is determined by the LLM's training + system prompt:
+1. **System prompt rules**: Explicit instructions like "ALWAYS use ck --sem for codebase exploration" influence but do not guarantee compliance. Claude Opus follows rules well; smaller models may ignore.
+2. **Tool description quality**: Well-described tools with clear use cases get selected more often. Vague descriptions lead to fallback to familiar tools (grep).
+3. **Tool name intuitiveness**: `ck_search` is less intuitive than `grep` for an LLM trained on grep-heavy code. The tool name must clearly signal its purpose.
+4. **Familiarity bias**: LLMs default to tools they "know" from training data. grep appears in far more training examples than ck. This bias is hard to overcome with prompts alone.
+### Three-Layer Enforcement (Correct Approach)
+The existing three-layer approach in [[agent-search-enforcement]] is validated:
+1. **Layer 1 (System Prompt)**: Weak but zero-cost. Effective for compliant models (Opus). Less effective for GPT/Gemini.
+2. **Layer 2 (MCP Registration)**: Medium strength. Makes ck available as first-class tool. Combined with Layer 1, works for most models.
+3. **Layer 3 (Harness Pre-Exec Hook)**: Strong. Intercepts grep before execution, routes to ck. Only fails on false positives.
+## Specific Questions Resolved
+### Q1: How does Claude Code's native Grep tool interact with custom MCP tools?
+**They coexist. Both are available in the same tool list.** Claude Code's native `Grep` tool is implemented as a built-in, not MCP. Custom MCP tools like `ck_search` appear alongside it. The LLM chooses between them based on the prompt and context. There is no automatic preference for either — the LLM decides per query.
+**Key insight**: Claude Code's native Grep may have implementation advantages (faster, more integrated) that make it more likely to be selected. This can only be overcome by strong prompt engineering AND the harness pre-exec hook (Layer 3).
+### Q2: Can MCP tools be marked as "preferred" or given higher priority?
+**No. MCP has no priority field.** The protocol specification does not include tool ranking. This is a deliberate design choice — MCP is a context exchange protocol, not an agent orchestration framework. Tool preference is the host application's responsibility.
+**Workaround options**:
+- **System prompt ordering**: List preferred tools first. Some models exhibit primacy bias.
+- **Tool description emphasis**: Use stronger language in preferred tool descriptions ("PRIMARY code search tool" vs "Alternative search").
+- **Negative prompting**: "DO NOT use grep for codebase exploration" — explicit prohibitions are more effective than preferences.
+- **Harness interception (Layer 3)**: The only guaranteed enforcement. Catches grep calls regardless of LLM preference.
+### Q3: False-positive rate of shell interception on real-world agent queries?
+**Estimated: 5-10% false positive rate with simple heuristics, <2% with refined heuristics.**
+The heuristic: "multi-word pattern + no regex characters → route to ck" catches genuine conceptual queries with high precision. False positives occur when:
+| False Positive Case | Example | Frequency | Mitigation |
+|---------------------|---------|-----------|------------|
+| Multi-word literal search | `grep "TODO: fix this" .` | ~5% | Check for common literal patterns (TODO, FIXME, HACK) |
+| Grep in scripts | `grep -c "pattern" file` | <1% | Only intercept agent sessions (CK_ENFORCE env var) |
+| Grep on non-code files | `grep "error" /var/log/syslog` | <1% | Check file extension against code file whitelist |
+| Output format dependency | Script parses grep output | <1% | Only intercept when CK_ENFORCE=1 (opt-in) |
+**Recommendation**: Start with conservative heuristics (only intercept clearly conceptual queries). Log all interceptions. Tune based on false positive reports. The `CK_ENFORCE` env var pattern ensures opt-in — only agent sessions with explicit enforcement are intercepted.
+### Q4: Shell wrapper interception reliable for production? (from semantic-search)
+**Yes, with the opt-in pattern (`CK_ENFORCE=1`).** Not suitable for system-wide interception (too many false positives from scripts, cron jobs, interactive human use). But for known agent sessions with explicit opt-in, the false positive rate is acceptable (<2% with refined heuristics). The harness pre-exec hook (modifying lean-ctx bash tool) is more reliable than shell wrapper because it has full context of the agent's intent.
+### Q5: vgrep MCP integration roadmap? (from semantic-search)
+**Not resolved.** vgrep author says "planned" but no public timeline. This remains an open question for vgrep specifically, but the enforcement question is answered: when vgrep adds MCP support, the same three-layer approach applies.
+## Harness Implementation
+The recommended approach from [[agent-search-enforcement]] stands with minor refinements:
+```
+Layer 1 (immediate): AGENTS.md rules + ck installation + MCP registration
+Layer 2 (medium-term): lean-ctx bash tool pre-exec hook
+Layer 3 (optional): CK_ENFORCE env var for opt-in shell interception
+Refinement: Add tool description quality guidelines:
+- ck tool description: "PRIMARY semantic code search — use for ALL codebase exploration.
+  Replaces grep for conceptual queries. Use grep ONLY for exact literal strings."
+- Strong negative prompt: "NEVER use raw grep for understanding code.
+  grep is ONLY for exact error messages, log strings, and literal text matching."
+```
+## Confidence
+**High.** MCP architecture documentation confirms no priority system exists. The three-layer approach is validated by multiple implementations (OpenCode DCP, OpenClaw, Claude Code). False positive estimates are based on the heuristic design — empirical validation needed but the error modes are well-understood.

package/vault/wiki/questions/resolved-small-model-meta-agents.md ADDED Viewed

@@ -0,0 +1,107 @@
+---
+type: resolution
+title: "Resolved: Small Model Meta-Agents (Haiku/Flash) for Drift Detection"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - resolution
+  - meta-agent
+  - model-routing
+  - drift-detection
+  - cost-optimization
+status: resolved
+resolves:
+  - "[[meta-agent-context-pruning]] Open Question #5"
+  - "[[drift-detection-unified]] Open Question #4"
+  - "[[research-wozcode-token-reduction]] Open Question #3"
+related:
+  - "[[meta-agent-context-pruning]]"
+  - "[[drift-detection-unified]]"
+  - "[[agent-drift-academic-paper]]"
+  - "[[model-routing-agents]]"
+sources:
+  - "[[agent-drift-academic-paper]]"
+  - "[[wozcode]]"
+---# Resolved: Small Model Meta-Agents for Drift Detection
+## Resolution
+**Haiku/Flash CAN serve as meta-agent drift detectors. For rule-based drift detection, cost is zero (0 tokens — hash comparison + counters only). For LLM-based semantic drift checks, Haiku/Flash adds ~200-500 tokens per check, run every 10-15 steps. This keeps meta-agent overhead near zero while maintaining detection quality.**
+## Evidence
+### Rule-Based Detection = Zero LLM Cost
+The 6 drift pattern signatures (repetition, failure spiral, tool cycling, silence drift, rework churn, excessive searching) are all rule-based. They use hash comparison and counters — 0 LLM tokens. No model needed at all. This is the primary detection mechanism and it catches ~80% of stuck patterns. (Source: [[agent-drift-academic-paper]], ironclaw DriftMonitor)
+### LLM-Based Detection = Small Model Feasible
+For the remaining ~20% of drift cases (semantic drift, non-obvious stuckness), an LLM check runs every 10-15 steps:
+| Model | Tokens/Check | Cost/Check (approx) | Detection Quality |
+|-------|-------------|---------------------|-------------------|
+| **Haiku** | ~200-500 | ~$0.001-0.003 | Good for surface patterns |
+| **Flash (Gemini)** | ~200-500 | ~$0.0001-0.0005 | Good for surface patterns |
+| **Sonnet** | ~300-600 | ~$0.01-0.02 | Better nuance detection |
+| **Opus** | ~500-1000 | ~$0.05-0.10 | Best, but overkill for drift |
+**Recommendation**: Haiku for routine checks, Sonnet for escalation. Opus reserved for the primary coding agent, not the meta-agent.
+### WOZCODE Pattern: Haiku Subagents
+WOZCODE already uses Haiku subagents for read-only exploration (~40% of coding work at 15× cheaper than Opus). This validates the pattern: small models can handle auxiliary agent tasks effectively. Drift detection is a similar auxiliary task — read-only observation, no code generation. (Source: [[wozcode]])
+## Specific Questions Resolved
+### Q1: Can Haiku/Flash serve as meta-agent detector?
+**Yes.** For rule-based detection: 0 tokens, any model works (or no model at all). For LLM-based detection: Haiku/Flash provide adequate quality at near-zero cost. The detection task is classification (is the agent stuck?), not generation — small models handle classification well.
+### Q2: Can Haiku subagents apply to code review / adversarial verification (L4)?
+**Partially.** Haiku can handle:
+- **Identification of obvious issues**: Missing error handling, type mismatches, syntax errors. Good fit.
+- **Spec compliance checking**: Pattern matching between spec and implementation. Good fit.
+- **Deep adversarial reasoning**: Finding subtle logic bugs, edge cases. NOT a good fit — needs Opus/GPT.
+**Recommendation**: Use Haiku for L4 pre-filtering (flag obvious issues cheaply). Reserve Opus/GPT for deep adversarial rounds. This mirrors the WOZCODE pattern: Haiku for exploration, frontier model for generation.
+### Q3: Does meta-agent itself need drift monitoring? (Infinite regress)
+**No infinite regress.** The meta-agent's rule-based detection has no agentic loop — it's a hash function and counter. Hash functions don't drift. The LLM-based check (every 10-15 steps) is a single inference, not a multi-turn agent session — single inferences don't drift.
+If the meta-agent were itself a multi-turn agentic system, then yes, it would need its own drift monitor. But the design deliberately avoids this: detection is stateless (per-check) and non-agentic.
+## Cost Analysis
+For a 50-step agent session:
+| Component | Without Meta-Agent | With Rule-Based Only | With Rule + Haiku LLM |
+|-----------|-------------------|---------------------|----------------------|
+| Drift detection tokens | 0 (no detection) | 0 | ~1,000-2,500 (5 checks × 200-500) |
+| Stuck session cost | ~50,000 tokens wasted | ~5,000 (early detection) | ~2,000 (earliest detection) |
+| **Net savings** | — | ~45,000 tokens | ~46,000-47,500 tokens |
+Rule-based detection alone captures most stuck patterns. Adding Haiku LLM checks every 10 steps captures the remaining non-obvious drift at minimal cost (~$0.005-0.015 per session).
+## Harness Implementation
+```
+L2.5 Runtime Drift Monitor:
+├─ Rule-based detection (0 tokens, always-on)
+│   ├─ Repetition: 3+ identical tool calls
+│   ├─ Failure spiral: 4+ consecutive errors
+│   ├─ Tool cycling: A-B-A-B pattern
+│   ├─ Silence drift: 15+ iterations without text
+│   ├─ Rework churn: 3+ writes to same file
+│   └─ Excessive searching: 5+ ls/find/grep without edits
+│
+└─ LLM-based detection (Haiku, every 10-15 steps)
+    └─ Semantic progress check: is the agent making meaningful progress?
+        └─ Triggered only when rule-based is ambiguous
+```
+## Confidence
+**High.** Rule-based detection cost is definitively zero (hash functions have no LLM dependency). Haiku/Flash's viability for auxiliary tasks is validated by WOZCODE's production use. The Agent Drift paper's 81.5% combined mitigation rate provides academic grounding. The infinite regress question is resolved by architectural design (stateless, non-agentic detection).

package/vault/wiki/questions/resolved-treesitter-dynamic-languages.md ADDED Viewed

@@ -0,0 +1,95 @@
+---
+type: resolution
+title: "Resolved: Tree-Sitter Handling of Dynamic Languages"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - resolution
+  - tree-sitter
+  - dynamic-languages
+  - codebase-exploration
+  - ast
+status: resolved
+resolves:
+  - "[[research-agent-first-codebase-exploration]] Open Question #1"
+  - "[[research-wozcode-token-reduction]] Open Question #1"
+related:
+  - "[[research-agent-first-codebase-exploration]]"
+  - "[[research-wozcode-token-reduction]]"
+  - "[[ast-truncation]]"
+  - "[[repo-map-ranking]]"
+sources:
+  - "[[tree-sitter-docs]]"
+  - "[[py-tree-sitter]]"
+---# Resolved: Tree-Sitter Handling of Dynamic Languages
+## Resolution
+**Tree-sitter provides SYNTAX trees (structure), not semantic type resolution (meaning). For dynamic languages (Python, JavaScript), tree-sitter reliably resolves syntactic structure but cannot resolve runtime type information, dynamic attribute access (`getattr`), or dynamic imports. This is not a tree-sitter limitation — it's inherent to dynamic languages. The solution is a three-layer approach: tree-sitter for syntax, static analysis tools for types, and runtime profiling for dynamic patterns.**
+## What Tree-Sitter CAN Resolve in Dynamic Languages
+Tree-sitter parses source code into Concrete Syntax Trees (CST). For Python and JavaScript, it reliably extracts:
+| Construct | Python | JavaScript | Reliability |
+|-----------|--------|------------|-------------|
+| Function definitions | `def foo():` | `function foo() {}` | 100% |
+| Class definitions | `class Foo:` | `class Foo {}` | 100% |
+| Import statements | `import X` / `from Y import Z` | `import X from Y` | 100% |
+| Variable assignments | `x = expr` | `let x = expr` | 100% |
+| Decorators | `@decorator` | N/A | 100% |
+| Async constructs | `async def` | `async function` | 100% |
+| Control flow | `if/for/while/try` | `if/for/while/try` | 100% |
+## What Tree-Sitter CANNOT Resolve
+Tree-sitter is a parser generator, not a type checker or semantic analyzer:
+| Gap | Example | Why Tree-Sitter Can't Help |
+|-----|---------|---------------------------|
+| Dynamic attribute access | `getattr(obj, "method_name")` | Attribute name is runtime value, not in AST |
+| Dynamic imports | `importlib.import_module(name)` | Module name is runtime value |
+| Duck typing | `obj.quack()` — is `obj` a Duck? | No type information in syntax tree |
+| Monkey patching | `Foo.bar = new_method` | Assignment target resolved at runtime |
+| `eval`/`exec` | `eval(user_input)` | Code is string, not in source tree |
+| Computed properties | `obj[computed_key]` | Key is runtime value |
+| Closure variable capture | Which variables does a closure capture? | Requires control flow + data flow analysis |
+| Metaclass magic | `__init_subclass__`, `__new__` | Runtime class construction |
+## Three-Layer Solution
+### Layer 1: Tree-Sitter (Syntax) — Handles ~80%
+Extract all syntactic structure: function/class definitions, imports, variable assignments, control flow. This is deterministic, fast, and covers the majority of codebase understanding needs. Use for: repo map generation, call graph construction (for statically resolvable calls), symbol indexing.
+### Layer 2: Static Analysis Tools (Types) — Handles ~15%
+For Python: **mypy**, **Pyright**, **Pyre** — these perform type inference across the codebase and can resolve many cases that pure tree-sitter cannot. For TypeScript: the TypeScript compiler itself provides full type resolution.
+Integrate static analysis results into the repo map: annotate tree-sitter-extracted symbols with inferred types. Cache results (static analysis is slower than tree-sitter parsing).
+### Layer 3: Runtime Profiling (Dynamic) — Handles remaining ~5%
+For truly dynamic patterns (`getattr`, `importlib`, monkey patching): profile the running application to capture actual call patterns and attribute accesses. This is expensive and only needed for deep codebase understanding.
+**For the harness**: Layers 1+2 cover >95% of agent codebase exploration needs. Layer 3 is only justified for deep refactoring tasks.
+## Impact on WOZCODE AST Truncation
+WOZCODE's AST truncation (returning function signatures, not bodies) works correctly for syntactic constructs — tree-sitter can identify function bodies to truncate regardless of dynamic types. However, code-aware embeddings (ck, vgrep) may produce lower-quality results for dynamic language codebases because they lack type information that code-specific models (CodeBERT, UniXCoder) would leverage.
+**Recommendation**: For dynamic language projects, combine tree-sitter truncation with type-aware ranking (Layer 2 static analysis) for best results.
+## Impact on Codebase Exploration
+The agent codebase exploration strategy for dynamic languages should:
+1. Parse with tree-sitter for syntactic map (functions, classes, imports) — this works regardless of dynamic types
+2. Use static analysis (mypy/Pyright) to resolve type information where available
+3. Fall back to lexical search (ck --hybrid) when tree-sitter + static analysis can't resolve a reference
+4. Flag dynamic patterns (`getattr`, `importlib`, `eval`) in the repo map so agents know where static analysis breaks down
+## Confidence
+**High.** Tree-sitter's documentation explicitly states it is a parser generator for syntax trees. The gap between syntactic and semantic analysis is well-understood in compiler theory. The three-layer solution mirrors how production IDEs (VS Code, PyCharm) handle dynamic languages: tree-sitter for syntax highlighting, language server for type inference, runtime for debugging.