npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.4 - Mend

ultimate-pi 0.1.2 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/concepts/consensus-debate-flow.md ADDED Viewed

@@ -0,0 +1,17 @@
+---
+type: concept
+status: stub
+created: 2026-05-02
+updated: 2026-05-02
+tags: [concept, consensus, flow]
+---
+# Consensus Debate Flow
+The flow diagram / process for multi-agent consensus debates in the harness pipeline. Defines when debate is triggered (selective routing via iMAD), how rounds proceed, and how verdicts are filed.
+## References
+- [[consensus-debate]]
+- [[selective-debate-routing]]
+- [[adr-011]]

package/vault/wiki/concepts/consensus-debate.md ADDED Viewed

@@ -0,0 +1,206 @@
+---
+type: concept
+title: "Consensus Debate"
+created: 2026-04-30
+updated: 2026-04-30
+status: active
+tags: [harness, consensus, debate, multi-agent, dialectic, protocol]
+related:
+  - "[[adr-011]]"
+  - "[[agentic-harness]]"
+  - "[[adversarial-verification]]"
+  - "[[pi-messenger-analysis]]"
+---
+# Consensus Debate
+A structured multi-agent debate protocol for the harness pipeline. Replaces single-pass review with genuine back-and-forth argument — the kind that produces the best human software decisions.
+## First Principles
+### Why arguing works for humans
+The dialectical process (thesis → antithesis → synthesis) is the engine of reasoning:
+1. **Position**: Alice proposes X
+2. **Counter**: Bob identifies flaw in X
+3. **Rebuttal**: Alice refines X → X'
+4. **Counter**: Bob finds deeper flaw in X'
+5. **Rebuttal**: Alice refines X' → X''
+6. **Convergence**: Bob cannot find further flaws. Consensus.
+Each round forces deeper reasoning. The first counter is often shallow — it's the REBUTTAL that reveals the real insight, because defending a position requires understanding it more deeply than attacking it.
+### Why this applies even more to agents
+Agents lack intuition. They cannot "sense" something is wrong. They cannot have a "gut feeling" that a design is fragile.
+Multi-round argument is a **substitute for intuition**:
+- Round 1: Surface-level objections (syntax, naming, obvious gaps)
+- Round 2: Structural objections (dependency cycles, coupling, missing edge cases)
+- Round 3: Philosophical objections (wrong abstraction, incorrect model of the problem)
+Without rounds 2-3, agents miss everything below the surface.
+### Why single-pass review is insufficient
+L4 (Adversarial Verification) currently does ONE attack pass. The critic finds what it can in one shot, and that's it. But the critic's first pass is limited by its own blind spots — it can only attack what it sees immediately. A defender's rebuttal ("no, because X") often REVEALS a deeper flaw the critic didn't consider ("wait, if X is your assumption, then what about Y?"). This dynamic cannot happen in a single pass.
+## Protocol Design
+### DebateSession
+```
+DebateSession {
+  topic: string           // What is being debated
+  scope: LayerScope       // Which harness layer invoked this
+  participants: Agent[]   // 2+ agents with defined roles
+  budget: ConsensusBudget // Termination conditions
+  rounds: Round[]         // Accumulated argument rounds
+  verdict: Verdict | null // Final outcome
+}
+```
+### ConsensusBudget
+```
+ConsensusBudget {
+  maxRounds: number       // Hard cap (default: 3-4 depending on layer)
+  maxTokensPerRound: number // Per-agent token limit per round
+  maxWallClockMs: number  // Timeout (default: 120s)
+  convergenceRounds: number // Rounds without position change to declare convergence (default: 1)
+}
+```
+### Round Structure
+Each round has two phases:
+```
+Round {
+  number: number
+  attacker: Turn          // Critic's argument
+  defender: Turn          // Proposer's rebuttal
+}
+Turn {
+  agent: string           // Agent name
+  role: "attacker" | "defender"
+  position: string        // Succinct: what this agent asserts
+  counter_to: string      // Which specific claim is being countered
+  evidence_refs: string[] // References to spec, code, wiki pages
+  confidence_change: number // Did this turn shift confidence? (-1, 0, +1)
+}
+```
+### Verdict Semantics
+| Verdict | Meaning | Harness action |
+|---------|---------|---------------|
+| `CONSENSUS_REACHED` | Both sides agree on final position | File winning consensus to wiki as permanent alignment record, then proceed to next layer |
+| `DEADLOCK` | Positions unchanged after `convergenceRounds` rounds | File both positions + deadlock analysis to wiki. Escalate to human. |
+| `BUDGET_EXHAUSTED` | Max rounds or tokens hit without convergence | File last positions + exhaustion analysis to wiki. Use last agreed position if any; otherwise escalate |
+| `TIMEOUT` | Wall-clock time exceeded | File partial transcript to wiki. Escalate |
+### Convergence Detection
+Positions are hashed (deterministic). If the defender's position hash is identical for `convergenceRounds` consecutive rounds, and the attacker has presented new counters each time, but the position survived, we have convergence.
+Alternatively: if the attacker explicitly signals "no further objections" by setting `confidence_change: 0` and an empty `counter_to`, that's consensus.
+## Integration Points
+### L1: Spec Debate
+**Purpose**: Argue about whether a spec is sufficiently hardened.
+**Participants**: Spec proposer (defender) + Spec critic (attacker)
+**Example debate**:
+| Round | Attacker (Critic) | Defender (Proposer) |
+|-------|-------------------|---------------------|
+| 1 | "You didn't specify error behavior for network timeout" | "Added: timeout → retry 3x with exponential backoff, then fail with TIMEOUT_ERROR" |
+| 2 | "What about partial writes during retry? Is the operation idempotent?" | "Spec now requires idempotency key. Each retry reuses the same key. Server deduplicates." |
+| 3 | "No further objections. Spec is complete." | — |
+**Verdict**: CONSENSUS_REACHED. Winning consensus filed to `wiki/consensus/spec-[slug].md`. Spec proceeds to L2.
+**Budget**: 3 rounds, ~2K tokens/round.
+### L2: Plan Debate
+**Purpose**: Argue about plan structure, dependencies, and feasibility.
+**Participants**: Planner (defender) + Plan critic (attacker)
+**Example debate**:
+| Round | Attacker | Defender |
+|-------|----------|----------|
+| 1 | "Task 3 depends on Task 1 and Task 2 — but Task 2 also reads the output of Task 3. Circular dependency." | "Task 2 only reads Task 3's OUTPUT SCHEMA, not its data. Dependency is on the interface, not the implementation. Restructured: Task 2 depends on the schema definition step, not Task 3." |
+| 2 | "Task 4 (DB migration) has no rollback plan. If migration fails after Task 5 (data transform) runs, we're in an inconsistent state." | "Added Task 4b: migration verification + rollback trigger. Task 4 and 4b form an atomic pair. Task 5 only runs after 4b passes." |
+| 3 | "No further objections. Plan is executable." | — |
+**Verdict**: CONSENSUS_REACHED. Winning consensus filed to `wiki/consensus/plan-[slug].md`. Plan proceeds to L3.
+**Budget**: 3 rounds, ~3K tokens/round.
+### L4: Multi-Round Adversarial Attack
+**Purpose**: Genuine debate about implementation correctness and spec compliance.
+**Participants**: Implementer (defender) + Critic (attacker)
+This replaces the current single-pass L4 critic. Instead of one attack, the critic gets multiple rounds to find increasingly subtle flaws.
+**Budget**: 4 rounds, ~2K tokens/round. Winning consensus filed to `wiki/consensus/verify-[slug].md`.
+## Token Budget Impact
+| Activity | Current | With Consensus | Delta |
+|----------|---------|---------------|-------|
+| L1 Spec Hardening | ~2,000 | ~6,000 (3 rounds) | +4,000 |
+| L2 Planning + review | ~5,000 | ~10,000 (3 rounds) | +5,000 |
+| L4 Adversarial | ~4,000 | ~8,000 (4 rounds) | +4,000 |
+| **Total added per subtask** | — | **~13,000** | — |
+New total per subtask: **~30,500-33,500 tokens** (up from ~17,500).
+**Is this worth it?** Catching a spec flaw at L1 saves ~17,500 tokens of L2-L8 work. Catching a plan flaw at L2 saves ~15,500 tokens of L3-L8 work. The debate cost pays for itself on the first flaw caught.
+## Transport Layer: pi-messenger File-Based Messaging
+See [[pi-messenger-analysis]] for full analysis. Summary:
+- **Adopted**: Agent registry, per-agent inboxes, `fs.watch` delivery, JSON message format, atomic file writes, stale cleanup
+- **Stripped**: Chat UI, status bar, activity feed, emoji, crew orchestration, swarm claims
+- **Added**: Consensus protocol layer (DebateSession, ConsensusBudget, convergence detection, verdict generation)
+## Files
+- `lib/harness-messenger.ts` — pi-messenger transport integration (registry, inbox, watcher)
+- `lib/harness-debate.ts` — Consensus protocol (DebateSession, ConsensusBudget, convergence, verdict)
+- `lib/harness-schemas.ts` — Extended with debate message schemas
+- `extensions/harness-debate.ts` — Extension hooks: debate → wiki transcript
+- `extensions/harness-spec.ts` — Updated: L1 spec debate integration
+- `extensions/harness-planner.ts` — Updated: L2 plan debate integration
+- `extensions/harness-critics.ts` — Updated: L4 multi-round debate integration
+## Wiki Filing Rule (Mandatory)
+**Winning consensus from any agent debate MUST be filed in the project wiki.** This is not optional. The purpose is permanent agent alignment: future agents query the wiki before making decisions and find the resolved consensus, preventing re-litigation of settled debates.
+- **CONSENSUS_REACHED** → File final position + key rounds + evidence references to `wiki/consensus/`
+- **DEADLOCK** → File both positions + deadlock analysis (what blocked convergence)
+- **BUDGET_EXHAUSTED / TIMEOUT** → File partial transcript + exhaustion analysis
+Filing is enforced by L7 schema orchestration: no layer transition after a debate until the wiki write is confirmed. [[adr-010]] already mandates write-after for every state transition — consensus verdicts are state transitions and fall under this contract.
+## Open Questions
+- Should debates use the same model for both sides, or different models for genuine adversarial diversity?
+- What is the right default `convergenceRounds`? (1 may be too aggressive, 2 may waste tokens)
+- Should L3 (Grounding Checkpoints) also have a debate mode? (Current thinking: no — L3 is about execution fidelity, not design decisions)
+- Can we reuse a single critic agent across multiple debates, or should each debate spawn fresh critics?
+- What is the optimal wiki page structure for consensus records? (candidate: `wiki/consensus/[topic-slug].md` with frontmatter linking to the debate layer, participants, and verdict)

package/vault/wiki/concepts/content-addressed-spec-identity.md ADDED Viewed

@@ -0,0 +1,166 @@
+---
+type: concept
+title: "Content-Addressed Spec Identity"
+status: developing
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - harness
+  - spec-storage
+  - content-addressing
+  - fingerprinting
+  - fork-safety
+  - reconciliation
+related:
+  - "[[Research: GitHub Issues as Harness Spec Storage]]"
+  - "[[fork-safe-spec-storage]]"
+  - "[[spec-hardening]]"
+sources:
+  - "[[Research: GitHub Issues as Harness Spec Storage]]"
+---# Content-Addressed Spec Identity
+How ultimate-pi harnesses resolve specs by content fingerprint, not issue number. Solves the fork-merge divergence problem: fork's issue #5 and upstream's issue #5 are different specs found by different hashes.
+## The Problem
+GitHub Issues are identified by repo-scoped integers. When specs live in issues:
+| Scenario | Fork Issue | Upstream Issue | Conflict? |
+|----------|-----------|----------------|-----------|
+| Normal | `forker/proj#1` = "Add OAuth" | `owner/proj#1` = "Initial setup" | No — different repos |
+| After fork merge | `forker/proj#1` = "Add OAuth" transferred to upstream | `owner/proj#2` = "Fix rate limiter" | No — transfer assigns new number |
+| Stale cache | Local cache still says `forker/proj#1` | Issue doesn't exist in fork anymore, wrong number in upstream | **YES** — harness looks up wrong repo/number |
+Issue numbers are repo-scoped, time-ordered identifiers. They are NOT stable identities across repo boundaries.
+## The Solution: Content-Hash Identity
+Every HardenedSpec carries a deterministic content fingerprint that survives repo migration.
+### Fingerprint Generation
+```
+spec_fingerprint = SHA256(
+  normalize(intent_summary) +
+  normalize(json.dumps(success_criteria, sort_keys=True)) +
+  normalize(definition_of_done)
+)
+```
+`normalize()` strips whitespace, lowercases, removes punctuation. This makes the fingerprint tolerant of formatting changes while sensitive to semantic changes.
+### Embedding Points
+| Location | Format | Purpose |
+|----------|--------|---------|
+| Issue body (HTML comment) | `<!-- spec-fp: a1b2c3d4e5f6... -->` | Primary: searchable across repos |
+| Issue title prefix | `[spec:a1b2c3d4] Implement OAuth2 login` | Visible: human-readable first-8 prefix |
+| Local cache JSON | `"spec_fingerprint": "a1b2c3d4e5f6..."` | Fast: no API call needed for lookup |
+| Wiki page (ADR, spec page) | `spec_fingerprint: a1b2c3d4e5f6` | Cross-reference: wiki-to-issue linkage |
+### Resolution Algorithm
+When the harness needs spec `X`:
+```
+1. Check local cache: .pi/harness/specs/<id>.json — read spec_fingerprint
+2. Check cached issue URL: if github_issue_url exists and is reachable → use it
+3. If cached URL is stale (404, wrong repo, wrong content):
+   a. Search by fingerprint: gh search issues "spec-fp:a1b2c3d4" --label harness-spec
+   b. If found in current repo: update cache, use it
+   c. If found in different repo: warn, offer to transfer
+   d. If not found anywhere: spec is orphaned — recreate from local cache body
+4. Always verify: read issue body, extract fingerprint, compare with expected
+```
+### Why This Works
+- **Content-addressed, not location-addressed**: Like Git's object model. The spec's identity is its content, not where it lives.
+- **Repo-agnostic**: The same spec hash resolves to the correct issue in any repo.
+- **Transfer-safe**: When an issue is transferred via `gh issue transfer`, only its number changes. The body (and fingerprint within it) stays the same.
+- **Deduplication**: Two issues with the same fingerprint ARE the same spec. Harness can detect and merge duplicates.
+- **Searchable**: `gh search issues "spec-fp:abc123"` works across all repos the user has access to.
+## The Transfer-on-Merge Pattern
+When code merges from fork to upstream, specs can follow via GitHub's native issue transfer API.
+### `ultimate-pi harness migrate` Command
+```
+ultimate-pi harness migrate [--dry-run]
+  ├─ Detect repo change: current repo ≠ .pi/harness/config.json cached repo
+  ├─ List fork specs: gh issue list --repo <old-repo> --label harness-spec --json number,title,body
+  ├─ For each spec:
+  │   ├─ Search upstream: gh search issues "spec-fp:<hash>" --repo <new-repo>
+  │   ├─ If found in upstream → skip (already migrated)
+  │   ├─ If not found:
+  │   │   ├─ Transfer: gh issue transfer <issue> <new-repo>
+  │   │   ├─ Relabel: gh issue edit <new-number> --add-label harness-spec,layer-N,status:*
+  │   │   └─ Note: labels don't survive transfer — must reapply
+  │   └─ Update local cache: rewrite github_issue_url → new repo
+  ├─ Update config: .pi/harness/config.json → new repo
+  └─ Report: N transferred, M already-present, K orphaned
+```
+### Transfer API Constraints
+| Constraint | Impact |
+|------------|--------|
+| Requires write access to BOTH repos | Fork must have push access to upstream (true after PR merged) |
+| Labels are NOT transferred | Harness must reapply labels post-transfer |
+| Assignees, milestones ARE transferred | Good |
+| Comments ARE transferred | Execution audit trail preserved |
+| Sub-issues are NOT transferred | Must transfer children individually or recreate hierarchy |
+| Issue number changes | Upstream assigns next available number |
+### Idempotency Guarantee
+`harness migrate` is idempotent because it searches by fingerprint before transferring. If a spec already exists in upstream (from a previous migration run), it's skipped. Running migrate twice produces the same result.
+## Edge Cases
+### Orphaned Specs (No Fingerprint Match)
+If a spec exists in the fork but no matching fingerprint is found anywhere:
+- **Cause**: Issue body was edited and fingerprint comment removed, or spec was deleted
+- **Resolution**: Harness recreates the spec in upstream from the local cache JSON body
+- **Warning**: Comment history is lost (original was in fork issue, now inaccessible)
+### Duplicate Specs (Same Fingerprint, Different Issues)
+If two issues in the same repo have the same fingerprint:
+- **Detection**: `gh search issues "spec-fp:<hash>" --repo <repo>` returns multiple results
+- **Resolution**: Harness keeps the newer one (most recently updated), closes the older as duplicate with comment "Merged into #<newer> — same spec fingerprint"
+- **Rationale**: The newer issue likely has richer comment history
+### Cross-Fork Specs (Multiple Forks, Same Spec)
+If Forker-A and Forker-B both create "Add OAuth2" with different approaches:
+- **Their fingerprints WILL differ** — the `success_criteria` and `definition_of_done` are different
+- **No false collision**: fingerprint includes the full spec semantics, not just the title
+- **Correct behavior**: they remain separate specs in separate forks, as they should
+## Implementation Complexity
+| Component | Effort | Risk |
+|-----------|--------|------|
+| Fingerprint generation in SpecHardener | Low — add SHA256 call before saving | None |
+| Fingerprint embedding in issue body | Low — prepend HTML comment | None |
+| `gh search issues` integration | Low — single CLI call | None |
+| `harness migrate` command | Medium — transfer + relabel + cache update loop | Medium — transfer API edge cases |
+| Cache staleness detection | Low — compare cached URL to current repo | None |
+| Orphan recreation | Low — `gh issue create` from cached body | Low |
+| Duplicate detection | Low — count search results | None |
+Total: ~2-3 days of implementation. All operations use existing `gh` CLI commands or REST API.
+## Prior Art
+- **Git's content-addressed object model**: SHA1 hashes identify objects by content, not by location. This is the same principle applied to specs.
+- **IPFS / libp2p**: Content-addressed distributed storage. Specs are "CID-addressable" in concept.
+- **Nix / Guix**: Package builds are identified by content hashes of their inputs. Same deterministic identity pattern.
+- **Docker image digests**: `image@sha256:abc...` identifies an image by its content manifest, not by its tag. Tags move; digests don't.
+None of these are "harness spec storage" prior art — this is a novel application of content addressing to agent task management.

package/vault/wiki/concepts/context-anxiety.md ADDED Viewed

@@ -0,0 +1,57 @@
+---
+type: concept
+title: "Context Anxiety"
+created: 2026-04-30
+updated: 2026-04-30
+status: seed
+tags:
+  - context
+  - harness
+  - failure-mode
+related:
+  - "[[agentic-harness]]"
+  - "[[harness-implementation-plan]]"
+sources:
+  - "[[anthropic2026-harness-design]]"
+---# Context Anxiety
+A failure mode where LLM agents begin wrapping up work prematurely as they approach what they believe is their context limit. Identified and named by Anthropic Engineering (Rajasekaran, 2026).
+## Behavior
+- Agent rushes to finish, skipping verification steps
+- Outputs become shorter, less thorough
+- Agent makes premature declarations of completion
+- Quality drops sharply in later parts of long tasks
+## Which Models Exhibit It
+- **Sonnet 4.5**: Strong context anxiety. Compaction alone insufficient — required context resets with structured handoffs
+- **Opus 4.5+**: Largely eliminated. One continuous session sufficient with automatic compaction
+## Mitigations
+### Context Reset (for anxious models)
+- Clear context window entirely
+- Start fresh agent session
+- Provide structured handoff artifact carrying previous agent's state + next steps
+- Cost: orchestration complexity, token overhead, latency
+### Compaction (for non-anxious models)
+- Summarize earlier parts of conversation in-place
+- Same agent continues with shortened history
+- Preserves continuity but doesn't give clean slate
+- Insufficient alone for models with strong anxiety
+## Relevance to Our Harness
+Our long-running research sessions (3-round autoresearch, multi-phase builds) are vulnerable to context anxiety. Our current mitigation is compaction (summarizing earlier rounds). For GPT/strict models, this may be insufficient — context resets may be required between rounds.
+## Detection
+Watch for:
+- Sudden acceleration in output pace toward end of long sessions
+- Skipping of verification gates that were passed earlier
+- "I'll complete this quickly" language
+- Dropping of structured output formats

package/vault/wiki/concepts/context-compression-techniques.md ADDED Viewed

@@ -0,0 +1,19 @@
+---
+type: concept
+status: stub
+created: 2026-05-02
+updated: 2026-05-02
+tags: [concept, context-management]
+---
+# Context Compression Techniques
+Methods for reducing LLM context window usage while preserving task-relevant information. Includes structured compaction, AST truncation, shell pattern compression, and context pruning.
+## References
+- [[structured-compaction]]
+- [[ast-compression]]
+- [[shell-pattern-compression]]
+- [[meta-agent-context-pruning]]
+- [[context-anxiety]]

package/vault/wiki/concepts/context-continuity.md ADDED Viewed

@@ -0,0 +1,22 @@
+---
+type: concept
+title: "context-continuity"
+created: 2026-04-30
+updated: 2026-04-30
+status: seed
+tags: [#concept, #context-optimization, #session-management]
+related:
+  - "[[context-mode]]"
+  - "[[lean-ctx]]"
+---
+# Context Continuity
+> [!stub] This is a stub page.
+The ability for AI coding agents to preserve session state across context compaction events. Both context-mode and lean-ctx implement this, but differently:
+- **context-mode**: Captures 26 event types to SessionDB for cross-compaction continuity
+- **lean-ctx**: Uses CCP (Cross-session Continuity Protocol) with scratchpad messaging for multi-agent session sharing
+Without context continuity, each context window compaction resets the agent's working memory, losing accumulated understanding of the codebase.

package/vault/wiki/concepts/context-drift-in-agents.md ADDED Viewed

@@ -0,0 +1,106 @@
+---
+aliases: ["agent drift", "behavioral drift", "context drift in agents"]
+type: concept
+title: "Context Drift in AI Agents"
+created: 2026-04-30
+status: developing
+tags:
+  - concept
+  - drift
+  - agent-reliability
+  - context-engineering
+related:
+  - "[[Research: Meta-Agent Context Drift Detection]]"
+  - "[[meta-agent-context-pruning]]"
+  - "[[agent-loop-detection-patterns]]"
+  - "[[guardian-agent-pattern]]"
+  - "[[agent-drift-academic-paper]]"
+  - "[[ironclaw-drift-monitor]]"
+  - "[[model-adaptive-harness]]"
+updated: 2026-05-02
+---# Context Drift in AI Agents
+The progressive degradation of agent behavior, decision quality, and task coherence over extended interactions. Not a model failure — a context management failure.
+## Two Definitions
+The term "context drift" is used in two distinct ways in the literature:
+### 1. Stale Environment Context (Infrastructure Drift)
+The agent's view of the world diverges from reality because data sources haven't caught up. The agent reads stale state, makes decisions based on it, detects mismatch, re-plans, and loops. (Source: [[vectara-guardian-agents|Tacnode]])
+**Cause**: Slow data pipelines, batch ETL, separate OLTP/OLAP stores. The stack was built for human consumers (dashboards, periodic queries), not sub-second agent freshness.
+**Fix**: Unified context lake with instant freshness guarantees. Reduce hops between event and agent visibility.
+### 2. Context Window Pollution (Interaction Drift)
+The agent's context window fills with irrelevant information from failed attempts, verbose outputs, and dead-end explorations. Signal-to-noise collapses. The agent's decisions degrade because it's reasoning over noise. (Source: [[agent-drift-academic-paper]])
+**Cause**: Multi-turn agent loops where every tool call and response accumulates in context. Failed attempts add noise. Successful attempts may be too verbose. No mechanism prunes irrelevant history.
+**Fix**: Context compaction (summarize + restart), context pruning (remove dead-ends), external memory (structured notes outside context window).
+## The Meta-Agent Problem Space
+This concept page addresses the **second** definition — interaction drift from context window pollution. The meta-agent concept targets exactly this: detecting when context pollution has reached a critical point and pruning the context to restore signal quality.
+## Drift Taxonomy
+From the academic literature (Source: [[agent-drift-academic-paper]]):
+| Type | Definition | Example |
+|------|-----------|---------|
+| **Semantic drift** | Outputs deviate from original intent while staying syntactically valid | Financial analysis agent shifts from risk-focused to opportunity-emphasizing language |
+| **Coordination drift** | Multi-agent consensus breaks down | Router develops bias toward certain sub-agents, creating bottlenecks |
+| **Behavioral drift** | Novel strategies emerge not present in initial interactions | Agent caches data in chat history instead of using designated memory tools |
+## Stuck-Pattern Signatures
+Operational patterns that indicate context drift (Source: [[ironclaw-drift-monitor]]):
+| Pattern | Signature | Threshold |
+|---------|-----------|-----------|
+| Repetition loops | Same tool + same args called repeatedly | 3+ in 10 calls |
+| Failure spirals | Consecutive tool failures | 4+ |
+| Tool cycling | A-B-A-B-A-B alternation | 6 calls |
+| Silence drift | No text response | 15+ iterations |
+| Rework churn | Same file written repeatedly | 3+ writes |
+| Excessive searching | ls/find/grep without code edits | 5+ searches |
+## Quantified Impact
+From 847 simulated workflows (Source: [[agent-drift-academic-paper]]):
+- Task success rate: -42% (87.3% → 50.6%)
+- Human interventions: +216% (0.31/task → 0.98/task)
+- Token usage: +52.4% (12,400 → 18,900)
+- Inter-agent conflicts: +487.5%
+- Drift emerges after median 73 interactions — far earlier than expected
+## Three Causal Mechanisms
+1. **Context window pollution**: Irrelevant history dilutes signal. Episodic Memory Consolidation directly addresses this.
+2. **Distributional shift**: Narrow domain language diverges from broad training distribution over time.
+3. **Autoregressive reinforcement**: Small errors compound through feedback loops — an unnecessarily verbose response sets precedent for future verbosity.
+## Mitigation Approaches
+| Approach | Mechanism | Drift Reduction | Overhead |
+|----------|-----------|----------------|----------|
+| Episodic Memory Consolidation | Summarize + compress history | 51.9% | Moderate (summarization cost) |
+| Drift-Aware Routing | Stability scores in delegation | 63.0% | Low (metric computation) |
+| Adaptive Behavioral Anchoring | Few-shot exemplars from baseline | 70.4% | Low (prompt augmentation) |
+| Context Pruning (proposed) | Remove dead-end entries | Unknown | Low (metadata operation) |
+| Combined (all three above) | Multi-layer defense | 81.5% | +23% compute |
+The proposed meta-agent context pruning would add a fourth approach to this arsenal — one that operates at the conversation-history level rather than the prompt or routing level.
+## See Also
+- [[meta-agent-context-pruning]] — The proposed system combining detection + pruning + restart
+- [[agent-loop-detection-patterns]] — Production-grade loop detection code
+- [[guardian-agent-pattern]] — Pre-execution safety validation
+- [[agentic-harness-context-enforcement]] — Enforcing context-efficient behavior

package/vault/wiki/concepts/context-engineering.md ADDED Viewed

@@ -0,0 +1,62 @@
+---
+type: concept
+tags:
+  - context-engineering
+  - token-management
+  - compaction
+  - memory
+related:
+  - "[[Agent Harness Architecture]]"
+  - "[[sources/opendev-arxiv-2603.05344v1]]"
+  - "[[sources/martin-fowler-harness-engineering]]"
+---
+# Context Engineering
+The practice of managing an LLM's context window as a first-class engineering concern. Context is a finite, expensive resource consumed by system prompts, tool schemas, conversation history, and tool outputs. Effective context engineering determines how long an agent can operate before context overflow degrades performance.
+## Core Principles
+1. **Entropy reduction**: Each context element should reduce uncertainty about the desired output
+2. **Minimal sufficiency**: Include only what is necessary to avoid attention dilution
+3. **Semantic continuity**: Context should evolve coherently across turns, not be reconstructed from scratch
+## Key Techniques
+### Adaptive Context Compaction (ACC)
+Five graduated stages instead of a single emergency threshold:
+- **70%**: Warning — log pressure, track trends
+- **80%**: Observation Masking — replace old tool results with reference pointers (~15 tokens each)
+- **85%**: Fast Pruning — walk backward, replace oldest results with `[pruned]` markers
+- **90%**: Aggressive Masking — shrink preservation window to most recent outputs only
+- **99%**: Full Compaction — LLM-based summarization of middle portion, keep recent verbatim
+ACC reduces peak context consumption by ~54%, often eliminating the need for emergency compaction.
+### Dual-Memory Architecture
+- **Episodic memory**: LLM summary of full history (strategic context). Regenerated from full history every 5 messages to prevent summary drift.
+- **Working memory**: Last 6 exchanges verbatim (operational detail).
+### Event-Driven System Reminders
+Short, targeted messages injected at decision points (not upfront in system prompt). Use `role: user` for maximum recency. Governed by counter budgets to prevent noise.
+### Lazy Tool Discovery
+Only discovered/explicitly invoked MCP tool schemas consume context. Baseline overhead: <5% instead of 40%.
+### Tool Result Optimization
+Per-tool-type summarization (file reads → metadata, search → match counts, directory listings → item counts). Large outputs (>8,000 chars) offloaded to scratch files with previews.
+### Prompt Caching
+Split system prompt into stable (cacheable) and dynamic parts. Stable portion (~80-90%) receives `cache_control` header, yielding ~88% input cost reduction on cached tokens.
+## Calibration
+Always use the API's reported `prompt_tokens` as calibration anchor, not local estimates. Providers inject invisible content (safety preambles, tool serialization) that local counting misses.
+## Relevance to Our Harness
+- No staged compaction — we rely on session restarts when context fills
+- No event-driven reminders — our system prompt decays over long sessions
+- No dual-memory — thinking contexts share full history
+- Partial tool result optimization — `lean-ctx` compresses bash output
+- Lazy discovery pattern exists — `ctx_discover_tools` for lean-ctx