npm - ultimate-pi - Versions diffs - 0.1.2 → 0.1.3 - Mend

ultimate-pi 0.1.2 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (516) hide show

package/vault/wiki/questions/Research: Meta-Agent Context Drift Detection.md ADDED Viewed

@@ -0,0 +1,236 @@
+---
+type: synthesis
+title: "Research: Meta-Agent Context Drift Detection"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - research
+  - meta-agent
+  - context-drift
+  - harness-design
+  - agent-reliability
+status: developing
+related:
+  - "[[context-drift-in-agents]]"
+  - "[[meta-agent-context-pruning]]"
+  - "[[agent-loop-detection-patterns]]"
+  - "[[guardian-agent-pattern]]"
+  - "[[ironclaw-drift-monitor]]"
+  - "[[langsight-loop-detection]]"
+  - "[[agent-drift-academic-paper]]"
+  - "[[vectara-guardian-agents]]"
+  - "[[model-adaptive-harness]]"
+  - "[[harness-configuration-layers]]"
+  - "[[agentic-harness-context-enforcement]]"
+  - "[[grounding-checkpoints]]"
+sources:
+  - "[[ironclaw-drift-monitor]]"
+  - "[[langsight-loop-detection]]"
+  - "[[agent-drift-academic-paper]]"
+  - "[[vectara-guardian-agents]]"
+---# Research: Meta-Agent Context Drift Detection
+## Overview
+A meta-agent that monitors the primary coding agent for context drift — repeated incorrect tool calls, excessive ls/find commands, tool-call loops — and intervenes by pruning irrelevant history from context. This concept exists in fragmented form across industry practice (ironclaw DriftMonitor, LangSight loop detection, Claude Code compaction) and academic research (Agent Stability Index, SWE-Pruner, GUARDIAN), but **no single system combines detection + pruning + context replacement into one pipeline**. The exact composition the user described is a novel synthesis.
+## Key Findings
+- **Exact match exists**: nearai/ironclaw #1634 "DriftMonitor" (March 2026) implements rule-based stuck-pattern detection with system-message injection — but does NOT prune context (Source: [[ironclaw-drift-monitor]])
+- **Loop detection is production-ready**: LangSight detects tool-call repetition via argument hashing, catches 90%+ of real loops with zero false positives at threshold 3 (Source: [[langsight-loop-detection]])
+- **Agent drift is academically quantified**: Agent Drift paper (arxiv 2601.04170) shows 42% task success reduction, 3.2x human intervention increase, and introduces ASI (Agent Stability Index) across 12 dimensions (Source: [[agent-drift-academic-paper]])
+- **Guardian agents are an active industry pattern**: Vectara built a platform-agnostic benchmark (~900 scenarios) validating pre-execution safety layers that check tool selection, arguments, and sequencing before execution. Overall correct rate only 5-59% across platforms (Source: [[vectara-guardian-agents]])
+- **Context pruning exists for code, not conversation**: SWE-Pruner (arxiv 2601.16746) achieves 23-54% token reduction by pruning code context, but operates on source files, not agent conversation history (Source: [[swe-pruner-context-pruning]])
+- **The novel gap**: No existing system does the full loop: detect stuck → identify dead-end context entries → prune them → restart agent with clean context. Each piece exists independently. The composition is new.
+## First Principles Analysis
+### The Problem
+Agent starts task → makes wrong tool call → gets error → tries variant → still wrong → tries ls/find/grep repeatedly → context fills with dead ends. Signal-to-noise collapses. Agent gets more lost, not less.
+This is a **positive feedback loop of context pollution**. Each failed attempt adds noise that makes the next attempt MORE likely to fail. The agent doesn't just fail — it accelerates into failure.
+### The Meta-Agent Solution
+A separate observer (meta-agent) that:
+1. **Detects stuck patterns** — rule-based signatures of non-progress: repeated identical tool calls, tool cycling (A-B-A-B), consecutive failures, excessive file searching
+2. **Identifies dead-end context entries** — which tool calls and responses constitute noise vs. signal
+3. **Prunes the context** — removes dead-end entries from the conversation history
+4. **Injects a correction** — "You were stuck on [pattern]. Here's what you know so far. Try a different approach."
+5. **Restarts the agent** — either by editing in-place (if API supports it) or terminating and resuming with pruned history
+### Detection Mechanism
+**Rule-based (recommended)**: Zero LLM overhead. Pattern-match on tool call sequences:
+```
+Pattern              | Threshold   | Detection
+Repetition           | 3+ identical | Hash tool+args, count in sliding window
+Failure spiral       | 4+ failures  | Consecutive error count
+Tool cycling         | A-B-A-B-A-B  | Sequence pattern in last 6 calls
+Silence drift        | 15+ iters    | No text response counter
+Rework churn         | 3+ writes    | Same file written repeatedly
+Excessive searching  | 5+ ls/find   | Count search-type tool calls without code edits
+```
+**LLM-based (higher cost, higher precision)**: Every N steps, a separate small-model call evaluates trajectory for meaningful progress. Can catch semantic drift that rule-based misses.
+### Pruning Heuristic
+Distinguishing "failed but informative" from "failed and useless":
+| Keep | Prune |
+|------|-------|
+| Error led to different approach on next attempt | Identical call returned same result |
+| Output contained new information despite failure | Pure noise (navigation bars, boilerplate errors) |
+| User explicitly asked for that action | Agent retried without user direction |
+| Established a constraint used later | Agent forgot about the call entirely |
+Conservative pruning: when uncertain, keep. The cost of pruning useful context is higher than keeping benign noise.
+### Feasibility
+**High**. Detection is trivial (rule-based, O(1) per call). Pruning requires careful heuristics but the worst case (keep everything) is identical to current behavior. The intervention mechanism (system message injection) is already proven in ironclaw.
+### Overhead Analysis
+| Component | Cost | Notes |
+|-----------|------|-------|
+| Rule-based detection | ~0 tokens | Hash comparison + counters per tool call |
+| LLM-based detection | ~500 tokens per check | If checking every 10 steps, 5 checks in 50-step session = 2,500 tokens |
+| Context pruning | ~0 tokens | Metadata operation, no LLM call |
+| Correction injection | ~150 tokens | System message |
+| Session restart | 1 API call + cache miss | One-time cost if restarting; zero if in-place editing |
+| **Total overhead** | **~2,500-3,000 tokens** | vs. 20,000+ tokens wasted in bloated failed context |
+**Net savings**: 5-10x token reduction for stuck sessions. The meta-agent pays for itself after 1-2 interventions.
+### Edge Cases
+- **Polling agents**: Legitimate repeated calls to status endpoints. Whitelist polling tools or use time-based windows instead of count-based.
+- **Retry-heavy workflows**: Some tools legitimately fail transiently. Increase threshold to 5-7 for these agents.
+- **Exploratory searching**: Browsing many files is sometimes correct behavior. Distinguish by whether code edits follow the searches.
+- **False positive prune**: Removing useful context is worse than failing to prune. Conservative defaults + escalation levels.
+### Escalation Model
+1. **Soft nudge** (first detection): System message — "You've called [tool] with same args 3 times. Summarize what you know and try a different approach."
+2. **Strong nudge** (second detection): System message + context summary — "You're stuck. Here's a clean summary of what you've accomplished. Start fresh from here."
+3. **Forced restart** (third detection): Terminate session, prune history, restart with clean context and correction message.
+## Integration with Existing Harness Pipeline
+The meta-agent concept maps to the existing harness architecture:
+### New Layer: L2.5 — Runtime Drift Monitor
+Sits between L2 (Structured Planning) and L3 (Grounding Checkpoints). While L3 already has "drift detection" for scope creep against the spec, it does NOT monitor tool-call quality or context pollution.
+```
+L2 (Plan) → L2.5 (Drift Monitor) → L3 (Execute + Grounding)
+                ↑                          ↓
+                └── injects corrections ───┘
+```
+**Why between L2 and L3**: The plan defines expected tool sequences. The drift monitor compares actual tool calls against the plan AND against stuck-pattern signatures. It catches both "off-plan" drift (scope creep) and "stuck-on-plan" drift (repetitive failures).
+### Integration Points
+| Component | Harness File | Change |
+|-----------|-------------|--------|
+| DriftMonitor struct | `lib/harness-drift-monitor.ts` | **New** — pattern detection, correction injection |
+| DriftMonitor config | `.pi/harness/drift-monitor.json` | **New** — thresholds, escalation levels, whitelists |
+| Extension hook | `extensions/harness-drift-monitor.ts` | **New** — hooks into before_llm_call / after_tool_call |
+| L3 grounding | `lib/harness-executor.ts` | Add drift_monitor field, call check() before each LLM call |
+| Harness plan | `lib/harness-planner.ts` | Layer renumbering (L3→L4, L4→L5, etc. or insert as L2.5) |
+| Implementation plan | [[harness-implementation-plan]] | Add Phase 17: Runtime Drift Monitor |
+### Configuration Schema
+```typescript
+interface DriftMonitorConfig {
+  enabled: boolean;                    // default: true
+  detection: {
+    repetition_threshold: number;      // default: 3
+    failure_spiral_threshold: number;  // default: 4
+    cycle_window: number;              // default: 6
+    silence_threshold: number;         // default: 15 iterations
+    rework_threshold: number;          // default: 3
+    excessive_search_threshold: number;// default: 5
+  };
+  intervention: {
+    prune_context: boolean;            // default: true (prune dead-end entries)
+    inject_correction: boolean;        // default: true (system message)
+    escalation: "soft" | "strong" | "forced_restart";
+    max_escalations: number;           // default: 3
+  };
+  whitelist: {
+    polling_tools: string[];           // tools allowed repeated calls
+    retry_tools: string[];             // tools with legitimate retry patterns
+  };
+  model_profile: "auto" | "opus" | "gpt" | "gemini" | "strict";
+}
+```
+### Model-Adaptive Behavior
+Maps to L3 State Channel and L2 Gate Design from the [[harness-configuration-layers|four-layer harness]]:
+| Model | Detection | Intervention |
+|-------|-----------|-------------|
+| Opus | LLM-based every 15 steps (trusts self-assessment) | Soft nudge → self-corrects reliably |
+| GPT | Rule-based every step (needs frequent checks) | Hard escalation → auto-restart after 3 detections |
+| Gemini | Rule-based every 10 steps (moderate frequency) | Soft nudge → escalate if unresponsive |
+| Strict | Rule-based every step (maximum enforcement) | Hard escalation → auto-restart after 2 detections |
+### Token Budget
+Estimated overhead for a 50-step agent session:
+| Profile | Checks | Tokens per check | Total overhead |
+|---------|--------|-----------------|----------------|
+| Rule-based (GPT/strict) | 50 | ~0 | 0 |
+| Rule-based (Gemini) | 5 | ~0 | 0 |
+| LLM-based (Opus) | 3 | ~500 | 1,500 |
+All profiles: correction messages ~150 tokens each, max 3 interventions = 450 tokens. Pruning: zero token cost (metadata operation).
+## Key Entities
+- **nearai/ironclaw**: Open-source agent framework with proposed DriftMonitor (Source: [[ironclaw-drift-monitor]])
+- **LangSight**: Production agent monitoring with loop detection, budget guardrails, circuit breakers (Source: [[langsight-loop-detection]])
+- **Vectara**: Guardian Agents benchmark and pre-execution safety layer (Source: [[vectara-guardian-agents]])
+- **Abhishek Rath**: Author of Agent Drift paper, introduced ASI (Agent Stability Index) (Source: [[agent-drift-academic-paper]])
+- **Anthropic Applied AI team**: Published context engineering framework including compaction, note-taking, sub-agent architectures
+## Key Concepts
+- [[context-drift-in-agents]]: Progressive degradation of agent behavior over extended interactions
+- [[meta-agent-context-pruning]]: The proposed system — detect stuck, prune history, restart
+- [[agent-loop-detection-patterns]]: Three production patterns (direct repetition, ping-pong, retry-without-progress)
+- [[guardian-agent-pattern]]: Pre-execution safety layers that validate agent actions before they execute
+## Contradictions
+- **ironclaw vs. Vectara on intervention timing**: Ironclaw DriftMonitor injects corrections AFTER tool calls (reactive). Vectara Guardian Agents validate BEFORE tool execution (proactive). (Source: [[ironclaw-drift-monitor]] vs [[vectara-guardian-agents]]). The meta-agent concept is reactive (post-hoc pruning), so it aligns with ironclaw's approach. Vectara's proactive approach could complement it as a first line of defense.
+- **LangSight says terminate on loop detection**. Ironclaw says inject correction message. Both are valid for different risk profiles. The proposed escalation model (soft → strong → forced) synthesizes both.
+## Open Questions
+- Can context pruning be done in-place (API-supported message editing) or must it always be a session restart? Most APIs (Anthropic, OpenAI) support message truncation but not selective deletion from middle of history.
+- What is the "minimum viable context" that must survive pruning? The original task, key decisions made, constraints discovered, and the last successful state.
+- Does pruning break the model's chain-of-thought? If the model was mid-reasoning when stuck, restarting with pruned history may lose coherence. Needs testing.
+- How does this interact with prompt caching? Pruning may invalidate cached prefixes, increasing short-term cost.
+- Can a small/cheap model (Haiku, Flash) serve as the meta-agent detector, keeping overhead near zero?
+## Sources
+- [[ironclaw-drift-monitor]]: nearai/ironclaw #1634, March 2026 — Proposed DriftMonitor with 5 rule-based patterns
+- [[langsight-loop-detection]]: LangSight Engineering, March 2026 — Production loop detection with argument hash comparison
+- [[agent-drift-academic-paper]]: Abhishek Rath, January 2026 — Agent Stability Index (ASI) across 12 dimensions
+- [[vectara-guardian-agents]]: Vectara, November 2025 — Platform-agnostic guardian agents benchmark (~900 scenarios)
+- [[swe-pruner-context-pruning]]: Wang et al., January 2026 — Self-adaptive context pruning for coding agents (ACL 2026)
+- [[anthropic-context-engineering]]: Anthropic Applied AI, September 2025 — Context engineering framework

package/vault/wiki/questions/Research: Model-Adaptive Agent Harness Design.md ADDED Viewed

@@ -0,0 +1,95 @@
+---
+type: synthesis
+title: "Research: Model-Adaptive Agent Harness Design"
+created: 2026-04-30
+updated: 2026-04-30
+tags:
+  - research
+  - agents
+  - harness-design
+  - model-awareness
+status: complete
+related:
+  - "[[model-adaptive-harness]]"
+  - "[[harness-configuration-layers]]"
+  - "[[forgecode-gpt5-agent-improvements]]"
+sources:
+  - "[[forgecode-gpt5-agent-improvements]]"
+---# Research: Model-Adaptive Agent Harness Design
+## Overview
+Forge Code's TermBench 2.0 results reveal that agent harness reliability is not a property of the model — it's a property of how well the harness compensates for each model's specific failure modes. GPT 5.4 and Opus 4.6 reached identical 81.8% scores only after model-specific adaptation. This research documents the design principles for making the harness pipeline model-aware.
+## Key Findings
+- **The harness has four configurable layers** not previously recognized: Signal Design (L1), Gate Design (L2), State Channel (L3), Completion Model (L4). Each has dimensions that vary by model (Source: [[forgecode-gpt5-agent-improvements]])
+- **GPT and Opus fail differently but reach the same capability ceiling** when the harness compensates. GPT needs flat structure, constraints-first ordering, enforced gates, in-band signals. Opus tolerates nesting, infers from metadata, self-corrects (Source: [[forgecode-gpt5-agent-improvements]])
+- **Enforced verification is the single biggest improvement.** GPT stops after plausible-but-incomplete solutions. "Please verify" does nothing. A programmatic gate — checklist that must be passed before proceeding — catches gaps (Source: [[forgecode-gpt5-agent-improvements]])
+- **Schema/instruction shape is a reliability variable, not cosmetic.** GPT anchors on what appears first. Moving constraints before descriptive content reduces malformed behavior. Flat structures (1 nesting level) reduce structural errors. Same semantics, different reliability (Source: [[forgecode-gpt5-agent-improvements]])
+- **Truncation signaling must be in-band for GPT.** Metadata fields like `total_lines` are invisible to GPT's attention. Body-text warnings are necessary. Opus reads metadata fine (Source: [[forgecode-gpt5-agent-improvements]])
+## Key Entities
+- [[forgecode-gpt5-agent-improvements|ForgeCode]]: Agent coding platform that reached #1 on TermBench 2.0. Published the model-adaptive harness findings
+- Tushar Mathur: Author of the Forge Code blog post and lead on the harness adaptation work
+## Key Concepts
+- [[model-adaptive-harness]]: Harness that varies behavior by model profile, not a one-size-fits-all instruction set
+- [[harness-configuration-layers]]: Four-layer framework (L1 Signal, L2 Gate, L3 Channel, L4 Completion) with configurable dimensions per model
+## Design Principles for the Harness Pipeline
+These findings will be applied to the harness pipeline as it is built out. The key principle: **write once for strict (GPT-safe defaults), relax for forgiving models**. Never write for forgiving and hope strict models cope.
+### Four-Layer Model (see [[harness-configuration-layers]] for full specification)
+1. **L1 Signal Design** — instruction density, ordering, emphasis, nesting depth, atomicity
+2. **L2 Gate Design** — enforcement model (hard vs soft), granularity, evidence standard, retry behavior
+3. **L3 State Channel** — how truncation, progress, and errors are communicated to the model
+4. **L4 Completion Model** — how "done" is determined and verified
+### Model-Specific Differences
+| Behavior | Opus/Claude | GPT |
+|---|---|---|
+| Structure | Tolerates nesting, natural flow | Needs flat, constraints-first |
+| Truncation | Infers from metadata | Needs body-text warning |
+| Verification | Naturally double-checks | Must be ENFORCED (hard gate) |
+| Completion | Self-aware of gaps | Stops after plausible-but-incomplete |
+| Emphasis | Contextual cues work | Explicit markers (REQUIRED, MANDATORY) |
+### What Must Adapt per Model
+Each pipeline phase that generates instructions for the agent should vary based on the driving model:
+- Instruction formatting (density, ordering, emphasis)
+- Gate enforcement (hard vs soft, checklist vs self-assessment)
+- State signaling (in-band vs metadata, explicit vs implicit progress)
+- Completion criteria (falsifiable checklist vs completion signal)
+### What Never Adapts
+Core invariants across all model profiles:
+- Pipeline steps and phase ordering
+- Quality standards and source attribution requirements
+- Confidence labeling
+- Budget constraints (max rounds, max tokens, max pages)
+- Verification gates (what must be checked, even if how varies by model)
+## Open Questions
+- How to detect model at runtime? System prompt parsing? Tool-call format detection?
+- Should per-step gates be added for GPT profile, or is per-round sufficient?
+- How do these findings apply across all harness phases beyond research?
+- Gemini profile needs validation against actual Gemini agent trajectories
+- Should the harness maintain per-model reliability metrics to track which compensations work?
+## Sources
+- [[forgecode-gpt5-agent-improvements]]: Tushar Mathur, 2026-03-16. Primary source for all four fixes and model behavioral differences

package/vault/wiki/questions/Research: Model-Specific Prompting Guides.md ADDED Viewed

@@ -0,0 +1,165 @@
+---
+type: synthesis
+title: "Research: Model-Specific Prompting Guides"
+created: 2026-05-01
+updated: 2026-05-01
+tags:
+  - research
+  - prompting
+  - model-specific
+  - harness-redesign
+status: developing
+related:
+  - "[[model-adaptive-harness]]"
+  - "[[harness-configuration-layers]]"
+  - "[[harness-implementation-plan]]"
+  - "[[forgecode-gpt5-agent-improvements]]"
+sources:
+  - "[[openai-prompt-guidance]]"
+  - "[[anthropic-prompt-best-practices]]"
+  - "[[gemini-3-prompting-guide]]"
+---# Research: Model-Specific Prompting Guides
+## Overview
+Every major model provider now publishes official prompting guidance specific to their models. These guides describe HOW to prompt each model for best results — not just what the models fail at. The current harness design derives model profiles from Forge Code's empirical failure-mode observations. This research brings the OFFICIAL provider guidance as the primary source for harness adaptations.
+## Key Finding: The Harness Must Be Redesigned
+The current harness writes "strict mode" (GPT-safe defaults) as canonical and relaxes for forgiving models. This is WRONG according to official guidance. Each provider specifies fundamentally DIFFERENT prompting conventions — not just different strictness levels of the same format.
+### What Providers Say vs What Harness Does
+| Provider | Official Guidance | Current Harness Behavior |
+|----------|------------------|------------------------|
+| **OpenAI** | Outcome-first prompts, shorter, constraints-first ordering, preambles before tools, reasoning effort is primary knob | "Strict mode" — flat structure, constraints-first, enforced hard gates, in-band signals |
+| **Anthropic** | XML tags for structure, long content at top + query at bottom, role setting critical, prefer general instructions over prescriptive steps, effort parameter controls thinking | "Relaxed mode" — hierarchical instructions, soft gates, metadata-based state channels |
+| **Google** | Constraints at END (not beginning), split-step verification, temperature at 1.0, explicit grounding statements, persona definitions critical | No Gemini-specific profile; marked "TBD" |
+### Critical Contradictions
+1. **Constraint ordering**: OpenAI says constraints-FIRST. Google says constraints-LAST. The harness can't satisfy both with one canonical format.
+2. **Prompt density**: OpenAI (GPT-5.5+) says SHORTER prompts, outcome-first. The harness's "strict mode" generates verbose, constraint-heavy prompts — exactly what OpenAI now recommends against.
+3. **Structure format**: Anthropic recommends XML tags. OpenAI uses XML-like sections but also markdown. Google uses plain text sections. No single format works across all three.
+4. **Temperature**: Google mandates 1.0. OpenAI/Anthropic don't specify. The harness needs model-specific temperature config.
+5. **Verification strategy**: Google says split-step (verify first, then generate). Anthropic says self-check at end. OpenAI (GPT-5.4+) says verification loop before finalizing. Different workflows.
+6. **Grounding**: Google requires explicit "context is only source of truth" statements. OpenAI uses citation rules. Anthropic uses document quote extraction. Different grounding mechanisms.
+## Proposed Redesign: Provider-Native Prompt Generation
+Instead of "write once, relax for forgiving models," the harness should generate **provider-native prompts** optimized for each model's official conventions.
+### Design Principle (NEW)
+**Generate model-specific prompts from a provider-agnostic semantic specification. Never generate a single canonical prompt and relax it.**
+The harness's internal representation should be a semantic spec (what must be communicated), not a prompt string. The prompt renderer generates the actual prompt text according to the target model's provider conventions.
+### Provider Profiles
+#### OpenAI GPT-5.x Profile
+```
+STRUCTURE: XML-like sections (<instruction_spec>)
+ORDERING: Constraints-first, then context, then task
+DENSITY: Concise, outcome-oriented. Describe destination, not journey.
+EMPHASIS: Explicit markers: REQUIRED, MANDATORY for true invariants
+VERIFICATION: Action safety blocks, pre-flight/post-flight
+TOOLS: apply_patch native, shell_command tool, update_plan
+REASONING: Use reasoning_effort parameter, not prompt-level "think step by step"
+TEMPERATURE: Unspecified (default)
+CONTRADICTIONS: Audit prompts for conflicting instructions — harmful to GPT-5+
+```
+#### Anthropic Claude 4.x Profile
+```
+STRUCTURE: XML tags (<instructions>, <context>, <examples>)
+ORDERING: Long content at TOP, query at BOTTOM
+DENSITY: Prefer general instructions over prescriptive steps
+EMPHASIS: Role setting, explain "why" behind instructions
+VERIFICATION: Self-check at end against test criteria
+TOOLS: Explicit tool direction, default_to_action or do_not_act
+THINKING: Adaptive thinking with effort parameter
+TEMPERATURE: Unspecified (removed from API, use effort)
+HALLUCINATION: investigate_before_answering block
+PARALLEL: Maximize parallel tool calls
+```
+#### Google Gemini 3 Profile
+```
+STRUCTURE: Plain text sections
+ORDERING: Context → Task → Constraints AT END
+DENSITY: Concise by default, steer for verbosity explicitly
+EMPHASIS: Persona definitions are binding
+VERIFICATION: Split-step: verify capability → generate answer
+TOOLS: System instructions for steering
+THINKING: thinking level LOW/HIGH
+TEMPERATURE: 1.0 (MANDATORY — never change)
+GROUNDING: Explicit "context is absolute limit of truth" statement
+SYNTHESIS: "Based on the entire document above..." anchor phrase
+```
+## What Changes in the Harness
+### L1: Spec Hardening
+- **Before**: Generates spec hardening prompts in "strict mode" with flat structure
+- **After**: Generates provider-native spec prompts using the appropriate format per model
+### L2: Structured Planning
+- **Before**: Gate enforcement varies (hard for GPT, soft for Claude)
+- **After**: Gate enforcement follows provider conventions PLUS empirical failure mode data
+### L2.5: Drift Monitor
+- **Before**: Detection frequency varies by model
+- **After**: Detection strategy varies by model (split-step for Gemini, self-check for Claude, verification loop for GPT)
+### L3: Grounding Checkpoints
+- **Before**: Truncation signaling varies (in-band vs metadata)
+- **After**: Grounding mechanism varies (explicit grounding statement for Gemini, citation rules for GPT, quote extraction for Claude)
+### L4: Adversarial Verification
+- **Before**: Completion criteria vary (falsifiable checklist vs completion-signal)
+- **After**: Verification workflow varies (split-step verify-then-generate for Gemini, pre-flight/post-flight for GPT, self-check for Claude)
+### New: Prompt Renderer Module
+A new module between the harness's semantic spec and the actual API call. Takes a provider-agnostic task specification and renders it into a provider-native prompt.
+```
+Semantic Spec → Prompt Renderer → Provider-Native Prompt → API Call
+                ├── openai-renderer
+                ├── anthropic-renderer
+                └── google-renderer
+```
+## Entities
+- [[OpenAI]]: Publisher of GPT model family and official prompt guidance
+- [[Anthropic]]: Publisher of Claude model family and prompt engineering best practices
+- [[Google Cloud]]: Publisher of Gemini model family and Gemini 3 prompting guide
+## Key Concepts
+- [[provider-native-prompting]]: New concept — generate prompts optimized for each provider's conventions
+- [[Prompt Renderer]]: New module — translates semantic specs to provider-native prompts
+- [[model-adaptive-harness]]: Existing concept — needs significant redesign
+- [[harness-configuration-layers]]: Existing concept — dimensions need provider-native mappings
+## Contradictions
+- **Constraint ordering**: OpenAI says first, Google says last. Cannot resolve — must generate different prompts per provider.
+- **Prompt density**: OpenAI (5.5+) says shorter, harness says verbose strict mode. OpenAIs own newer guidance contradicts the harness's approach.
+- **Verification workflow**: Three different verification patterns (split-step, self-check, verification loop) — all from official sources, all valid.
+## Open Questions
+- How to handle models that don't have official prompting guides (Mistral, DeepSeek, Llama)?
+- Should the harness validate prompts against provider conventions before sending?
+- How does prompt caching interact with provider-native prompt generation?
+- Should the semantic spec be the same across all providers, or should it also vary?
+- What happens when provider guidance changes? Automatic updates?
+## Sources
+- [[openai-prompt-guidance]]: OpenAI, 2026 — Comprehensive multi-model guidance
+- [[anthropic-prompt-best-practices]]: Anthropic, 2026 — Claude Opus 4.7 through Haiku 4.5
+- [[gemini-3-prompting-guide]]: Google Cloud, 2026-04-29 — Gemini 3 specific