npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.4 - Mend

ultimate-pi 0.1.2 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/sources/langsight-loop-detection.md ADDED Viewed

@@ -0,0 +1,80 @@
+---
+type: source
+status: ingested
+source_type: blog
+title: "How to Detect and Stop AI Agent Loops in Production"
+author: LangSight Engineering
+date_published: 2026-03-22
+url: https://langsight.dev/blog/ai-agent-loop-detection/
+confidence: high
+key_claims:
+  - "Agent loops are the most common production failure mode"
+  - "Argument hash comparison catches >90% of real loops with zero false positives at threshold 3"
+  - "Three detection approaches: argument hash, sliding window rate, LLM output similarity"
+  - "Always combine loop detection with budget guardrails"
+tags:
+  - source
+  - loop-detection
+  - production
+  - agent-reliability
+  - langsight
+related:
+  - "[[Research: Meta-Agent Context Drift Detection]]"
+  - "[[agent-loop-detection-patterns]]"
+  - "[[context-drift-in-agents]]"
+created: 2026-05-02
+updated: 2026-05-02
+---# LangSight Loop Detection
+## Summary
+LangSight's production guide for detecting and stopping AI agent loops — the most common failure mode in deployed agent systems. Provides three detection approaches with working code, intervention strategies, and integration patterns. Based on production experience: a single support agent burned $214 calling the same CRM tool 89 times with identical arguments.
+## What It Contributes
+Validates that loop detection is production-critical and that argument hashing is the most reliable method. Provides concrete code for the detection layer of a meta-agent system. The $214 cautionary tale demonstrates the economic case for automated intervention.
+## Three Loop Patterns
+1. **Direct repetition**: Same tool + identical arguments multiple times in a row. Most common. Caused by tool returning error/unexpected result and LLM retry logic not distinguishing transient vs. structural failure.
+2. **Ping-pong between tools**: Two tools called alternately without state change. Agent calls A → B → A → B with same arguments.
+3. **Retry-without-progress**: Tool call succeeds but response doesn't satisfy agent's internal goal. Agent keeps calling with minor argument variations.
+## Three Detection Approaches
+### Approach 1: Argument Hash Comparison (Recommended)
+Most reliable. Hash `(tool_name, normalized_args)` and count occurrences in session window. Threshold 3 catches >90% of real loops.
+```python
+def compute_call_hash(tool_name: str, args: dict) -> str:
+    payload = f"{tool_name}:{json.dumps(args, sort_keys=True)}"
+    return hashlib.sha256(payload.encode()).hexdigest()[:16]
+```
+### Approach 2: Sliding Window Rate Detection
+Catches high-frequency calls regardless of argument variation. If tool called >N times in M seconds, flag it.
+### Approach 3: LLM Output Similarity
+Semantic similarity between consecutive reasoning outputs. High similarity (>0.95 cosine) across multiple steps = reasoning in circles. Computationally expensive, usually overkill.
+## Intervention Options
+1. **Warn and continue**: Log + alert, agent keeps running. Good for early monitoring.
+2. **Terminate session**: Hard stop. Mark session `loop_detected`, return structured error. Right default for production.
+3. **Inject recovery message**: System message telling agent it's stuck. Gives chance to self-recover before termination.
+## Budget Guardrails
+Backstop for unknown failure patterns: max cost, max steps, max wall time, soft alert at 80%.
+## Threshold Tuning
+- **Polling agents**: Use time-based windows, not count-based
+- **Retry-heavy workflows**: Increase threshold to 5-7
+- **Sub-agents**: Each sub-agent gets own loop detector
+- **Default**: Threshold 3 works for most agents
+## Relevance to Meta-Agent Concept
+LangSight provides the **detection layer** of the meta-agent pipeline. The argument hash approach is production-validated. Their three intervention options map to the proposed escalation model (warn → inject → terminate). What's missing: context pruning after detection. LangSight terminates or injects but doesn't remove dead-end history.

package/vault/wiki/sources/leanctx-website.md ADDED Viewed

@@ -0,0 +1,69 @@
+---
+type: source
+source_type: website
+title: leanctx.com
+author: yvgude
+date_published: 2026
+url: https://leanctx.com
+confidence: medium
+key_claims:
+  - "60–99% token reduction per file read"
+  - "46 MCP tools, 10 read modes, 90+ shell compression patterns"
+  - "Supports 24 AI tools"
+  - "Single Rust binary, zero telemetry, Apache 2.0"
+  - "Agent governance with profiles, budgets, SLOs, anomaly detection"
+created: 2026-04-30
+updated: 2026-04-30
+status: ingested
+tags: [#source/website]
+---
+# leanctx.com
+Landing page for LeanCTX — "The Context Engineering Layer for AI Coding."
+## Architecture (3 layers)
+1. **Context Server**: 49 intelligent MCP tools for file reads, shell commands, code search. Intent-aware compression with adaptive mode selection per task type.
+2. **Shell Hook**: Intercepts shell output. Recognizes 90+ command patterns (git, npm, cargo, docker, kubectl, etc). Compresses automatically.
+3. **Protocols**: CEP (Context Efficiency Protocol), CCP (Cross-session Continuity Protocol), TDD (symbol shorthand). Teaches AI to communicate leaner. 8–25% additional savings.
+## Read Modes
+- `full`: Complete content
+- `map`: Dependency graph + exports + API (~5-15% tokens)
+- `signatures`: Function/class signatures only (~10-20%)
+- `aggressive`: Syntax-stripped (~30-50%)
+- `entropy`: Shannon entropy filtered (~20-40%)
+- `diff`: Only changed lines since last read
+## Agent Governance
+- 5 built-in roles: Admin, Coder, Debugger, Reviewer, Ops
+- Token, cost, and shell budgets per agent
+- SLOs with automatic throttling
+- Anomaly detection for runaway consumption
+## Compression Results (claimed)
+- 60–95% per file read depending on mode
+- 99% on cached re-reads (13 tokens)
+- Shell builds: 847 → 42 tokens (95%)
+## Platforms
+Aider, Amazon Q, Amp, Antigravity, AWS Kiro, Claude Code, Cline, Continue, Cursor, Emacs, Gemini CLI, GitHub Copilot, JetBrains, Neovim, OpenAI Codex, OpenCode, Pi, Qwen Code, Roo Code, Sublime Text, Trae, Verdent, Windsurf, Zed
+## GitHub
+- Stars: 924 (as of 2026-04-30)
+- Forks: 109
+- Language: Rust
+- Created: 2026-03-23
+- License: Apache 2.0
+## crates.io
+- Package: lean-ctx
+- Total downloads: 3,188
+- Version: 3.4.5

package/vault/wiki/sources/lee2026-meta-harness.md ADDED Viewed

@@ -0,0 +1,59 @@
+---
+type: source
+source_type: paper
+title: "Meta-Harness: End-to-End Optimization of Model Harnesses"
+author: "Lee, Yoonho; Nair, Roshen; Zhang, Qizheng; et al."
+date_published: 2026-03-30
+url: "https://arxiv.org/abs/2603.28052"
+confidence: medium
+key_claims:
+  - "Outer-loop system searches over harness code for LLM applications"
+  - "Agentic proposer accesses source code, scores, and execution traces via filesystem"
+  - "7.7pt improvement on text classification with 4x fewer context tokens"
+  - "4.7pt improvement on IMO-level math problems across 5 held-out models"
+  - "Surpasses best hand-engineered baselines on TerminalBench-2"
+tags:
+  - harness
+  - meta-learning
+  - optimization
+  - terminal-bench
+created: 2026-04-30
+updated: 2026-04-30
+status: ingested
+---# Meta-Harness: End-to-End Optimization of Model Harnesses
+Lee et al., March 2026. Stanford / Together AI.
+## Core Idea
+Harnesses are still designed largely by hand. Meta-Harness is an outer-loop system that automatically searches over harness code, using an agentic proposer with access to source code, scores, and execution traces from all prior candidates.
+## Architecture
+- **Agentic Proposer**: LLM that reads existing harness code + execution traces + scores
+- **Filesystem-based memory**: All prior candidates, their code, traces, and scores available
+- **Outer-loop**: Proposer generates new harness variant → evaluate → add to candidate pool → repeat
+Key difference from AutoHarness: Meta-Harness sees ALL prior experiments, not just the last one.
+## Results
+| Domain | Improvement | Context Savings |
+|--------|-------------|-----------------|
+| Text classification | +7.7 pts | 4x fewer tokens |
+| IMO math reasoning | +4.7 pts | Across 5 held-out models |
+| Agentic coding (TerminalBench-2) | Surpasses hand-engineered | — |
+## Key Insight
+> Richer access to prior experience can enable automated harness engineering.
+This directly challenges the assumption that harness design must be a human engineering practice. It suggests a future where harnesses self-optimize from execution traces.
+## Relevance to Our Harness
+Our current pipeline is manually configured. Meta-Harness suggests:
+- Adding a "harness optimizer" that runs off failure traces
+- Auto-tuning token budgets per layer based on observed vs actual usage
+- Generating model-specific harness variants (our model-adaptive profiles could be learned, not hand-coded)

package/vault/wiki/sources/linux-kernel-coding-workflow.md ADDED Viewed

@@ -0,0 +1,50 @@
+---
+type: source
+source_type: official-documentation
+title: "Linux Kernel Coding Style and Development Workflow"
+author: "Linus Torvalds, Jonathan Corbet et al."
+date_published: 2026-01-20
+url: "https://github.com/torvalds/linux/blob/master/Documentation/process/coding-style.rst"
+confidence: high
+key_claims:
+  - "Functions should be short and do one thing well"
+  - "8-character tabs enforce shallow nesting; >3 levels is a design problem"
+  - "K&R brace style, functions get opening brace on next line"
+  - "Centralized error handling via goto for cleanup paths"
+  - "Reference counting mandatory for data structures visible to multiple threads"
+  - "Don't crash the kernel — WARN_ON_ONCE preferred over BUG()"
+  - "Time-based release cycle: 2-week merge window, 6-10 week rc stabilization"
+  - "Chain-of-trust maintainer hierarchy: patches flow up through subsystem trees"
+tags: [linux, coding-style, kernel, torvalds]
+---
+# Linux Kernel Coding Style and Development Workflow
+## Coding Style — Direct from Linus
+The Linux kernel coding style document enforces strict, opinionated rules:
+- **8-character tabs**: not just aesthetic — forces refactoring when nesting exceeds 3 levels
+- **K&R braces**: opening brace on same line for statements, on next line for functions. "K&R are right."
+- **Short functions**: should fit on 1-2 screenfuls (80x24). Local variables ≤ 5-10.
+- **Descriptive globals, short locals**: `count_active_users()`, not `cntusr()`; loop counter is `i`, not `loop_counter`
+- **No typedefs** for structs/pointers: `struct virtual_container *a` beats `vps_t a`
+- **Centralized exit via goto**: for cleanup in functions with multiple exit points. Label names describe what they free.
+- **Comments say WHAT, not HOW**: if function needs inline comments explaining how it works, rewrite it.
+- **Reference counting**: mandatory for any data structure accessible from another thread. "If another thread can find your data structure and you don't have a reference count, you almost certainly have a bug."
+- **Don't crash the kernel**: use `WARN_ON_ONCE()`, not `BUG()`. Kernel crashes are user decisions.
+## Development Process
+- **Time-based releases**: new major kernel every 2-3 months
+- **2-week merge window**: all new features land here. ~1,000 patches/day.
+- **6-10 week stabilization**: only fixes after -rc1. Regressions are the primary metric.
+- **Chain of trust**: patches flow through subsystem maintainers → Linus. Only ~1.3% of patches chosen directly by Linus.
+- **linux-next**: integration tree where all pending patches are tested before merge window.
+- **Staging trees**: drivers/staging/ for code not yet meeting quality standards; includes TODO files.
+## Linus on AI-Generated Code (2026)
+- Vibe coding: "fairly positive" for learning, "horrible, horrible idea from a maintenance standpoint" for production.
+- Linux kernel policy: AI-generated code is acceptable if reviewed by a human who takes responsibility. "If the code is good, it's good. If it's hallucinatory AI slop that breaks the kernel, the human who clicked submit is responsible."
+- "Code is cheap. Show me the talk." — prioritizes demonstrated understanding over volume of output.

package/vault/wiki/sources/lou2026-autoharness.md ADDED Viewed

@@ -0,0 +1,53 @@
+---
+type: source
+source_type: paper
+title: "AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness"
+author: "Lou, Xinghua; Lázaro-Gredilla, Miguel; Dedieu, Antoine; et al."
+date_published: 2026-02-10
+url: "https://arxiv.org/abs/2603.03329"
+confidence: medium
+key_claims:
+  - "Smaller model (Gemini Flash) can automatically synthesize code harness via iterative refinement"
+  - "78% of chess losses were illegal moves — harness eliminates all illegal moves in 145 TextArena games"
+  - "Synthesized harness enables smaller model to outperform larger models (Gemini Pro, GPT-5.2)"
+  - "Code-policy (entire policy in code, no LLM at decision time) beats larger models on 16 games"
+tags:
+  - harness
+  - auto-synthesis
+  - code-generation
+  - gemini
+created: 2026-04-30
+updated: 2026-04-30
+status: ingested
+---# AutoHarness: Automatically Synthesizing Code Harnesses
+Lou et al., February 2026.
+## Core Idea
+LLM agents often attempt actions that are prohibited by the environment. Instead of manually writing guardrails, AutoHarness demonstrates that a LLM can automatically synthesize a code harness via iterative refinement with environment feedback.
+## Key Numbers
+- **78% of Gemini-2.5-Flash losses** in Kaggle GameArena chess attributed to illegal moves
+- After AutoHarness: **all illegal moves prevented** across 145 TextArena games
+- Synthesized harness + Flash outperforms Gemini-2.5-Pro bare
+- Code-policy (fully compiled harness, no LLM at decision time) beats GPT-5.2-High on 16/16 games
+## Mechanism
+1. LLM generates initial harness code
+2. Environment provides feedback (illegal move detection, score)
+3. LLM iteratively refines harness code
+4. Final harness: prevents all illegal actions, optimizes for reward
+## Key Insight
+> Using a smaller model to synthesize a custom code harness can outperform a much larger model, while also being more cost effective.
+This is the automation of what the survey calls "harness engineering" — turning it from a human practice into an LLM-driven one. Directly relevant to [[lee2026-meta-harness]] which takes this further with outer-loop optimization.
+## Relevance to Our Harness
+Our harness is manually designed (skill files, schemas, gate logic). AutoHarness suggests that harness components could be automatically synthesized from failure traces. The token budget optimization problem (Phase 10-13) is a natural candidate for auto-synthesis.

package/vault/wiki/sources/martin-fowler-harness-engineering.md ADDED Viewed

@@ -0,0 +1,73 @@
+---
+type: source
+source_type: article
+author: "Birgitta Böckeler (Thoughtworks)"
+date_published: 2026-04-02
+url: https://martinfowler.com/articles/harness-engineering.html
+confidence: high
+tags:
+  - harness-engineering
+  - context-engineering
+  - agent-trust
+  - feedback-loops
+key_claims:
+  - "Agent = Model + Harness. A harness is everything in an AI agent except the model itself"
+  - "Feedforward guides (before action) + Feedback sensors (after action) form the steering loop"
+  - "Computational controls (deterministic, fast) vs Inferential controls (LLM-based, semantic)"
+  - "Three regulation categories: Maintainability, Architecture Fitness, Behaviour"
+  - "The human's job is to steer the agent by iterating on the harness"
+  - "Harness templates can encode topologies (CRUD service, event processor, data dashboard)"
+---
+# Harness Engineering for Coding Agent Users
+Martin Fowler blog — April 2026. By Birgitta Böckeler, Distinguished Engineer at Thoughtworks.
+## Core Mental Model
+**Agent = Model + Harness**. The harness is everything except the model: system prompts, tools, feedback loops, approval gates, context management.
+Three concentric circles:
+1. **Model** (core)
+2. **Builder harness** (coding agent's built-in infrastructure)
+3. **User harness** (what we build — guides + sensors specific to our use case)
+## Feedforward and Feedback
+| Direction | Purpose | Examples |
+|-----------|---------|----------|
+| **Feedforward (Guides)** | Steer agent *before* it acts | AGENTS.md, Skills, coding conventions, architecture docs |
+| **Feedback (Sensors)** | Observe *after* agent acts, enable self-correction | Linters, tests, review agents, type checkers |
+Two execution types:
+- **Computational**: Deterministic, fast (tests, linters, type checkers, structural analysis)
+- **Inferential**: LLM-based, semantic (AI code review, "LLM as judge")
+## The Steering Loop
+Human's role: Iterate on the harness. When issues recur, improve feedforward guides or feedback sensors to make them less probable. Agents can help build harness components (write structural tests, generate linter rules, create how-to guides).
+## Three Regulation Categories
+1. **Maintainability Harness**: Code quality, style, complexity, test coverage. Computational sensors catch structural issues reliably. LLMs partially address semantic judgment (duplicate code, brute-force fixes) but expensively.
+2. **Architecture Fitness Harness**: Performance requirements, logging standards, observability. Fitness functions as feedback sensors.
+3. **Behaviour Harness**: Functional correctness. The hardest category — still relies heavily on human review and manual testing. AI-generated tests put too much faith in AI.
+## Key Timing Principle: Keep Quality Left
+Checks distributed across the change lifecycle by cost and speed:
+- **Pre-commit**: Linters, fast tests, basic code review agent
+- **Post-integration pipeline**: Mutation testing, broad architecture review
+- **Continuous**: Dead code detection, dependency scanning, SLO monitoring
+## Harness Templates
+For enterprises with common service topologies (CRUD APIs, event processors, dashboards), harness templates bundle guides + sensors for each topology. Teams pick tech stacks partly based on available harnesses.
+## Relevance to Our Harness
+- Our `.pi/skills/` system implements feedforward guides
+- Our `wiki-lint` and `posthog-analyst` skills implement inferential feedback sensors
+- The steering loop is what we're building: improve harness as agents make mistakes
+- We need computational sensors: pre-commit hooks, structural tests, architecture fitness checks
+- Harness templates are our `lean-ctx` and `wiki` patterns — reusable across projects

package/vault/wiki/sources/mcp-architecture-docs.md ADDED Viewed

@@ -0,0 +1,13 @@
+---
+type: source
+status: stub
+created: 2026-05-02
+updated: 2026-05-02
+tags: [source, external-doc]
+---
+# MCP Architecture Docs
+Official Model Context Protocol (MCP) architecture documentation. Describes the protocol design, tool registration, and server-client model.
+Referenced in: [[resolved-mcp-tool-preference]]

package/vault/wiki/sources/meng2026-agent-harness-survey.md ADDED Viewed

@@ -0,0 +1,79 @@
+---
+type: source
+source_type: paper
+title: "Agent Harness for Large Language Model Agents: A Survey"
+author: "Meng, Qianyu; Wang, Yanan; Chen, Liyi; et al."
+date_published: 2026-04
+url: "https://github.com/Gloriaameng/Awesome-Agent-Harness"
+confidence: high
+key_claims:
+  - "Formalizes harness as six-component tuple H = (E, T, C, S, L, V)"
+  - "Surveys 110+ papers and 23 production systems"
+  - "Harness completeness matrix maps which components each system implements"
+  - "Maps 9 open technical challenges: security, evaluation, protocols, context, tools, memory, planning, multi-agent, compute economics"
+tags:
+  - harness
+  - survey
+  - agent-architecture
+  - llm-agents
+created: 2026-04-30
+updated: 2026-04-30
+status: ingested
+---# Agent Harness for Large Language Model Agents: A Survey
+Meng et al., April 2026. 110+ papers, 23 systems analyzed.
+## Core Contribution
+The survey formalizes the **agent execution harness** as a first-class architectural object:
+```
+H = (E, T, C, S, L, V)
+```
+| Component | Symbol | Role |
+|-----------|--------|------|
+| Execution Loop | E | Observe-think-act cycle, termination, error recovery |
+| Tool Registry | T | Typed tool catalog, routing, monitoring |
+| Context Manager | C | Context window control, compaction, retrieval |
+| State Store | S | Persistence across turns/sessions, crash recovery |
+| Lifecycle Hooks | L | Auth, logging, policy enforcement, instrumentation |
+| Evaluation Interface | V | Action trajectories, intermediate states, success signals |
+## Key Empirical Evidence
+- **Pi Research**: Grok Code Fast 1 jumped 6.7% → 68.3% on SWE-bench by changing ONLY the harness edit-tool format — model unchanged
+- **OpenAI Codex**: 1M lines of code, 0 hand-written over 5 months — failure attributed to "underspecified environments"
+- **Stripe Minions**: 1,300 PRs/week, 0 human-written code — harness-first engineering
+- **METR**: Benchmark-passing PRs have 24.2pp lower human merge rate, gap widening at 9.6pp/year
+- **Vercel**: Removing 80% of tools helped more than any model upgrade
+## Key Finding
+> The agent execution harness — not the model — is the primary determinant of agent reliability at scale.
+No agent framework can achieve production reliability without implementing ALL six governance components.
+## 9 Open Technical Challenges
+1. Security & Sandboxing — agents intentionally interact with sensitive resources
+2. Evaluation & Benchmarking — benchmark validity crisis (METR gap)
+3. Protocol Standardization — MCP (2-15ms) vs A2A (50-200ms) vs ACP
+4. Runtime Context Management — 1M+ token/task budgets
+5. Tool Use & Registry — schema-based contracts insufficient alone
+6. Memory Architecture — six patterns: flat → hierarchical → episodic → semantic → procedural → graph
+7. Planning & Reasoning — interface design outweighs model capability
+8. Multi-Agent Coordination — Byzantine fault tolerance unsolved
+9. Compute Economics — 13T tokens/week, doubling every 4 weeks
+## Relevance to Our Harness
+Our 8-layer harness (L1-L8) maps to these six components:
+- L1-L4 → E (Execution Loop with verification gates)
+- Tool Schema → T (Tool Registry)
+- Wiki/Knowledge Base → C, S (Context + State)
+- Archon L7 → L (Lifecycle hooks, orchestration)
+- QA/Critics L4-L5 → V (Evaluation)
+Missing from our implementation: formal H=(E,T,C,S,L,V) specification language, cross-harness portability, harness transparency specification.

package/vault/wiki/sources/mindstudio-four-agent-types.md ADDED Viewed

@@ -0,0 +1,68 @@
+---
+type: source
+source_type: article
+author: MindStudio
+date_published: 2026-04
+url: https://www.mindstudio.ai/blog/four-types-of-ai-agents-explained/
+confidence: medium
+tags:
+  - agent-types
+  - multi-agent
+  - orchestration
+  - architecture
+key_claims:
+  - "Four distinct agent types with different architectures: Coding Harnesses, Dark Factories, Auto Research, Orchestration"
+  - "Mismatching agent type to task is a primary cause of AI system failure in production"
+  - "Architecture matters more than model choice for multi-agent systems"
+  - "Orchestration agents add overhead — start simple, add complexity only when needed"
+---
+# Four Types of AI Agents Explained
+MindStudio blog — 2026. Classifies production AI agents into four architecturally distinct types.
+## The Four Types
+### 1. Coding Harnesses
+Operate within bounded technical environments (codebases). Tight feedback loop with deterministic execution environment — write code, run tests, see results, revise. Tools: file system access, terminal execution, test runners, code search, version control. Examples: Claude Code, GitHub Copilot Workspace, Devin.
+**Use when**: Task involves writing/editing/debugging code with testable success conditions.
+**Avoid when**: Task requires cross-domain reasoning or multi-agent coordination.
+### 2. Dark Factories
+Fully automated, humanless pipelines processing work at scale. Ingest inputs → process through defined steps → produce outputs. Run unattended on schedules or events.
+**Use when**: High volume, structurally similar inputs, scheduled/event-driven processing.
+**Avoid when**: Tasks require dynamic judgment or high-variance inputs.
+### 3. Auto Research Agents
+Autonomous information gathering: decompose questions, search multiple sources, evaluate relevance, synthesize findings. Dynamic retrieval path — decides what to retrieve based on findings.
+**Use when**: Answer requires multiple sources, retrieval path uncertain upfront, synthesis needed.
+**Avoid when**: Information available in structured database, simple RAG would suffice.
+### 4. Orchestration Agents
+Coordinate other agents: decompose complex goals, assign subtasks to specialists, handle dependencies, assemble final output. Acts as project manager, not implementer.
+**Use when**: Task requires multiple distinct capabilities or parallel workstreams.
+**Avoid when**: Single agent can handle the task — orchestration adds overhead (more LLM calls, more latency, more failure points).
+## Production Pattern: Combined Types
+Real systems combine types under an orchestrator. Example competitive intelligence system:
+1. Orchestrator receives brief
+2. Dispatches auto research agent for web gathering
+3. Dispatches dark factory for high-volume review processing
+4. Dispatches coding harness for structured data analysis
+5. Orchestrator synthesizes all outputs
+## Relevance to Our Harness
+- Our `wiki-autoresearch` skill is an auto research agent
+- Our `Agent` tool (subagent spawning) maps to orchestration
+- Our core coding loop is a coding harness
+- The key insight: match architecture to task, don't over-architect

package/vault/wiki/sources/ms-chat-history-management.md ADDED Viewed

@@ -0,0 +1,13 @@
+---
+type: source
+status: stub
+created: 2026-05-02
+updated: 2026-05-02
+tags: [source, external-doc]
+---
+# MS Chat History Management
+Microsoft documentation on chat history management patterns for AI agents. Describes strategies for maintaining conversation state across context windows.
+Referenced in: [[resolved-context-pruning-inplace-vs-restart]]

package/vault/wiki/sources/openai-prompt-guidance.md ADDED Viewed

@@ -0,0 +1,104 @@
+---
+type: source
+status: ingested
+source_type: official-documentation
+title: "OpenAI Prompt Guidance (GPT-5.5 through GPT-4.1)"
+author: "OpenAI"
+date_published: 2026-04-01
+date_fetched: 2026-05-01
+url: "https://developers.openai.com/api/docs/guides/prompt-guidance"
+confidence: high
+key_claims:
+  - "GPT-5.5 works best with outcome-first prompts that define the destination, not every step"
+  - "GPT-5.4 requires explicit tool persistence, verification loops, and completion criteria"
+  - "GPT-5.3 Codex ships with a canonical Codex-Max starter prompt optimized for coding agents"
+  - "GPT-5.2+ reasoning effort is the primary tuning knob: none/low/medium/high/xhigh"
+  - "GPT-5.1 introduced apply_patch and shell tools as native API tool types"
+  - "Contradictory instructions damage reasoning models more than older models"
+  - "Structured XML specs like `<instruction_spec>` improved instruction adherence"
+tags:
+  - prompting
+  - openai
+  - gpt
+  - model-specific
+  - harness-design
+created: 2026-05-02
+updated: 2026-05-02
+---# OpenAI Prompt Guidance
+Official prompting guide from OpenAI covering all GPT models from GPT-5.5 down to GPT-4.1. Each model generation has a dedicated section with model-specific guidance.
+## Model-Specific Key Findings
+### GPT-5.5
+- **Outcome-first prompts**: Define the destination, let model choose path
+- **Shorter prompts**: Legacy process-heavy prompts add noise
+- **Shorter, outcome-oriented**: "describe what good looks like, what constraints matter, what evidence is available"
+- **Personality + collaboration style**: Separate blocks for tone and task behavior
+- **Preamble for streaming**: Short user-visible update before tool calls
+- **Explicit stopping conditions**: "After each result, ask: can I answer now?"
+- **Retrieval budgets**: Stopping rules for search depth
+- **Phase parameter**: `commentary` vs `final_answer` distinction
+### GPT-5.4
+- **Tool persistence rules**: "Keep calling tools until task complete AND verification passes"
+- **Verification loop**: Check correctness, grounding, formatting, safety before finalizing
+- **Completeness contract**: Internal checklist, track processed items, confirm coverage
+- **Dependency checks**: Don't skip prerequisites because end state seems obvious
+- **Research mode**: Plan → Retrieve → Synthesize in 3 passes
+- **Small model guidance**: gpt-5.4-mini is more literal, needs explicit execution order
+- **Reasoning effort**: Start at none, increase only if evals regress
+### GPT-5.3 Codex
+- **Canonical Codex-Max prompt**: Full starter prompt published by OpenAI
+- **apply_patch**: First-class tool with Responses API integration; 35% fewer failures than manual
+- **Shell tool**: Structured shell_command with workdir, timeout, permissions
+- **Update plan tool**: JSON-based TODO with pending/in_progress/completed states
+- **Phase parameter**: Required; dropping phase causes significant degradation
+- **Parallel tool calls**: `multi_tool_use.parallel` with batch ordering
+- **Compaction**: First-class support for multi-hour reasoning
+- **Agents.md**: Automatically merged directory-scoped instruction files
+- **Personalities**: Friendly vs Pragmatic shipped with Codex CLI
+### GPT-5.2
+- **Verbosity controls**: Output verbosity spec with sentence/bullet limits per task type
+- **Scope drift prevention**: Explicit "no extra features" rules for frontend
+- **Long-context handling**: Force summarization and re-grounding
+- **Ambiguity mitigation**: Uncertainty-and-ambiguity block for hallucination-prone queries
+- **Tool persistence**: "Prefer tools over internal knowledge whenever fresh data needed"
+- **Compaction endpoint**: `/responses/compact` for extending effective context
+- **Reasoning effort migration**: GPT-4o/4.1 → `none`, GPT-5 → same, GPT-5.1 → same
+### GPT-5.1
+- **Agentic steerability**: Personality blocks, user update specs, solution persistence
+- **User updates (preambles)**: Frequency, verbosity, tone, content axes; "at least every 6 steps"
+- **Tool preambles**: Brief plan before tools, progress updates during
+- **Reasoning modes**: New `none` mode (no reasoning tokens at all)
+- **apply_patch tool**: Named tool type in Responses API; freeform under the hood
+- **Shell tool**: Native tool type for controlled command execution
+- **Metaprompting**: Model can debug and rewrite its own prompts
+### GPT-5
+- **Agentic eagerness**: Calibrate proactivity vs waiting for guidance
+- **Context gathering**: Batch search → minimal plan → complete task
+- **Frontend development**: Self-reflection rubrics, design system enforcement
+- **Cursor prompt tuning**: Real-world production agent findings
+- **Responses API**: Reasoning persisted between tool calls; 4.3% score improvement
+### GPT-4.1
+- **Literal instruction follower**: More literal than predecessors
+- **Persistence reminders**: "keep going until query completely resolved"
+- **Planning induction**: "plan extensively before each function call"
+- **SWE-bench prompt**: Full 55% pass rate agent prompt published
+- **Diff format**: V4A diff format with context-based (not line-number) matching
+## Cross-Model Patterns
+1. **Structured XML blocks work better than markdown** for complex instruction sets
+2. **Tool definitions should use API tools field**, not manual prompt injection
+3. **Reasoning effort is the primary tuning knob** across all GPT-5+ models
+4. **Verbosity API parameter + prompt-level overrides** for output length control
+5. **Metaprompting is officially recommended** for prompt optimization
+6. **Contradictory prompts hurt reasoning models significantly**
+7. **Small models need more explicit, structured instructions**