npm - @tangle-network/agent-eval - Versions diffs - 0.50.2 → 0.52.0 - Mend

@tangle-network/agent-eval 0.50.2 → 0.52.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

package/CHANGELOG.md +50 -0
package/dist/campaign/index.d.ts +1 -1
package/dist/campaign/index.js +5 -1
package/dist/campaign/index.js.map +1 -1
package/dist/{chunk-XAP6DJZE.js → chunk-YXD7GWJI.js} +35 -2
package/dist/chunk-YXD7GWJI.js.map +1 -0
package/dist/contract/index.d.ts +2 -2
package/dist/contract/index.js +1 -1
package/dist/openapi.json +1 -1
package/dist/{run-improvement-loop-BPMjNKMJ.d.ts → run-improvement-loop-Cc7oZlRP.d.ts} +48 -15
package/docs/specs/driver-honest-spec.md +251 -0
package/docs/specs/hermes-self-improvement-audit.md +93 -0
package/docs/specs/profile-versioning.md +291 -0
package/package.json +1 -1
package/dist/chunk-XAP6DJZE.js.map +0 -1

package/docs/specs/driver-honest-spec.md ADDED Viewed

@@ -0,0 +1,251 @@
+# Driver Honest Spec — what each driver IS, what each methodology actually is, where we deviate
+**Status:** Living document. Updated when we learn the truth from primary sources.
+**Date:** 2026-05-27
+This document exists because the project shipped two drivers with methodology names attached (`gepaDriver`, `skillOptDriver`) without the methodology specs being precisely encoded anywhere in the repo. That created an integrity gap. This doc closes it.
+Every claim in this doc is sourced from a primary reference (paper, code, or directly verifiable from our source). Marketing language is forbidden. If something is not implemented we say so.
+---
+## Part 1 — GEPA (the paper)
+**Source**: Agrawal et al., *"GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning"*, arXiv:2507.19457, July 2025.
+### What GEPA actually does
+Outer loop (verbatim from abstract): "samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the **Pareto frontier of its own attempts**."
+Named primitives in the paper:
+- **GEPA** (Genetic-Pareto) — the overall optimizer
+- **Pareto frontier** — non-dominated candidate set retained across iterations
+- **Prompt updates** — mutations proposed by reflection
+- **Rollouts** — trajectory samples
+### What gepaDriver in our substrate ACTUALLY does
+Source: `src/campaign/drivers/gepa.ts` (132 lines)
+- Single LLM call per `propose()` invocation
+- Input: prior generation's **single best candidate by composite score** + that candidate's top/bottom scenarios + 3 weakest dimensions (`buildEvidence`)
+- Output: N proposals, each a full document rewrite
+- Dedup by exact text equality
+### Deviations from the GEPA paper
+| GEPA paper | Our `gepaDriver` |
+|---|---|
+| **Pareto frontier** of candidates | **Single "best by composite"** — no Pareto set, no non-dominated tracking |
+| **Combine complementary lessons** from frontier | Each generation reflects on ONE prior candidate; no combination |
+| Multi-objective optimization | Single-objective (composite score) |
+| Genetic operators (mutation, crossover) | Reflection only — no crossover |
+| Sample efficiency claim (35× fewer rollouts than GRPO) | Unmeasured against any baseline |
+**Honest assessment**: our `gepaDriver` is a **reflective full-rewrite driver**, not GEPA. It captures GEPA's *reflection* primitive but not its *Pareto* mechanism. The name oversells. A faithful renaming would be `reflectiveRewriteDriver`. A faithful implementation would add a Pareto candidate pool + combine step.
+---
+## Part 2 — SkillOpt (the paper + code)
+**Source**:
+- README: https://github.com/microsoft/SkillOpt
+- Source: `/tmp/SkillOpt/skillopt/` (cloned 2026-05-27)
+- Key files: `engine/trainer.py`, `optimizer/clip.py` (rank_and_select), `optimizer/update_modes.py`, `evaluation/gate.py`, `types.py`
+### What SkillOpt actually does
+**6-stage per-step pipeline** (verbatim from `trainer.py:516` and adjacent):
+1. **Rollout** — `adapter.rollout(train_env, current_skill, ...)` collects trajectories on a batch.
+2. **Reflect** — `adapter.reflect()` analyses trajectories and emits **structured patches** (NOT full rewrites in patch mode). Failure trials → failure patches; success trials → success patches.
+3. **Aggregate** — `merge_patches(current_skill, all_failure_patches, all_success_patches, batch_size=merge_bs)` — hierarchically merges patches across accumulated batches.
+4. **Select** — `rank_and_select(current_skill, merged_patch, max_edits=edit_budget)` — if edit pool > budget, calls an optimizer LLM to **rank edits by importance** and keep top-L. Budget is "analogous to gradient clipping" (their words).
+5. **Update** — apply patch in one of 3 modes:
+   - **`patch`** — deterministic diff apply via `apply_patch_with_report()`; ops are `append | insert_after | replace | delete`
+   - **`rewrite_from_suggestions`** — LLM regenerates full skill from suggestions
+   - **`full_rewrite_minibatch`** — reflection directly emits complete candidate skills; select picks the best
+6. **Evaluate & Gate** — runs candidate on selection set, calls `evaluate_gate(cand_hard, current_score, best_score)`. Returns `accept_new_best | accept | reject` from a **literal `cand_hard > current_score`** comparison (`evaluation/gate.py:38`). No statistical test.
+Plus epoch-level stages:
+- **Slow update** — `run_slow_update()` builds longitudinal pairs across epochs.
+- **Meta skill** — `run_meta_skill()` produces optimizer-side memory of patterns across adjacent epochs.
+### Canonical patch shape (from `types.py:22-45`)
+```python
+EditOp = Literal["append", "insert_after", "replace", "delete"]
+@dataclass
+class Edit:
+    op: EditOp
+    content: str
+    target: str  # for replace/delete/insert_after
+    support_count: int | None  # how many trials voted for this edit
+    source_type: Literal["failure", "success"] | None
+    merge_level: int | None
+@dataclass
+class Patch:
+    edits: list[Edit]
+    reasoning: str
+    ranking_details: dict | None
+```
+### What `skillOptDriver` v0.51.0 in our substrate ACTUALLY does
+Source: `src/campaign/drivers/skillopt.ts` (current as of 0.51.0)
+- Single LLM call per `propose()` returning N full document rewrites
+- Post-parse rejection on: (a) any H2 header dropped, (b) sentence-edit count > editBudget × 2
+- Substantively equivalent to `gepaDriver` + 2 validation constraints
+### Deviations from SkillOpt
+| SkillOpt actual | Our 0.51.0 `skillOptDriver` |
+|---|---|
+| 6-stage pipeline (rollout → reflect → aggregate → select → update → gate) | Single LLM call → N rewrites |
+| **Patch-based edits** (`{op, target, content, support_count, source_type}`) | Full document rewrites only |
+| `merge_patches()` hierarchical merge across batches | No aggregation; each `propose()` is independent |
+| `rank_and_select(max_edits=edit_budget)` LLM-ranking of edits | All candidates that pass validation are returned |
+| 3 update modes (`patch`, `rewrite_from_suggestions`, `full_rewrite_minibatch`) | Only `full_rewrite_minibatch`-equivalent |
+| `evaluate_gate()` with `accept_new_best/accept/reject` codes | Substrate's outer gate decides ship/hold/inspect; driver doesn't see fine-grained accept signal |
+| Longitudinal `slow_update` across epochs | Not implemented |
+| `meta_skill` optimizer-side memory | Not implemented |
+| Selection-set cache (`sel_cache`) for repeated candidate hashes | Not implemented |
+| Edit-budget LR scheduler (constant / linear / cosine / autonomous) | Single fixed `editBudget` |
+| Mini-batch accumulation (`steps_per_epoch`, `merge_batch_size`) | Not implemented |
+| `decide_autonomous_learning_rate()` | Not implemented |
+| `longitudinal_pair_policy` (mixed / changed / unchanged) | Not implemented |
+**Honest assessment**: 13 substantive deviations. `skillOptDriver` 0.51.0 is **not** SkillOpt. It is `gepaDriver` with two post-validation constraints (section preservation, sentence-edit count). The methodology name oversells the implementation.
+### One thing where we are STRICTER than SkillOpt
+**The gate.** SkillOpt: literal `cand_hard > current_score` (`evaluation/gate.py:38`). Our substrate: paired bootstrap + 95% CI + Cohen's d + MDE + p-value (`defaultProductionGate`). When the lift CI straddles zero, our gate returns `hold` / `inspect`. SkillOpt would accept any improvement at all, even single-sample noise.
+This is real differentiation we have not been crediting ourselves for.
+---
+## Part 3 — Hermes Agent's "self-improvement"
+**Source**: `/tmp/hermes-agent/` (cloned 2026-05-27)
+- `agent/curator.py` (the actual loop)
+- `agent/skill_commands.py`
+- `agent/skill_utils.py`
+### What Hermes actually does
+From `curator.py` line 1: "Curator — background skill maintenance orchestrator. The curator is an auxiliary-model task that periodically reviews agent-created skills and maintains the collection."
+Trigger: idle-driven, with default `DEFAULT_INTERVAL_HOURS = 24 * 7` (7 days). When the agent has been idle for `DEFAULT_MIN_IDLE_HOURS = 2` and the last curator run was > 7 days ago, `maybe_run_curator()` spawns a forked AIAgent.
+What the curator does:
+- "Auto-transition lifecycle states based on derived skill activity timestamps"
+- "Spawn a background review agent that can **pin / archive / consolidate / patch** agent-created skills via `skill_manage`"
+- "Persist curator state (last_run_at, paused, etc.) in `.curator_state`"
+Strict invariants:
+- Only touches agent-created skills
+- "Never auto-deletes — only archives"
+- Pinned skills bypass auto-transitions
+- Uses the auxiliary client (separate from main session)
+### Hermes' actual gate
+**There is none.** The curator is an LLM editor making editorial decisions. There is no:
+- Held-out validation
+- Performance comparison between old and new skill versions
+- Statistical test
+- Rejection-on-regression mechanism
+Skills are refined by an LLM looking at usage patterns; the refinement is accepted because the LLM proposed it.
+### Honest assessment
+Hermes has a **skill curation system**, not a self-improvement loop. The README's claim "the only agent with a built-in learning loop" is generous — it's a 7-day-cron LLM librarian. There's no measurable guarantee that today's curated skill collection performs better than yesterday's.
+Compare:
+| Component | Hermes | SkillOpt | Tangle |
+|---|---|---|---|
+| Validation gate | None | `>` | Paired bootstrap CI |
+| Patch-level edits | No (LLM rewrites whole skill) | Yes | No (full rewrite only) |
+| Skill ranking / selection | No | Yes | No |
+| Sample efficiency claim | None | 35× vs GRPO | None |
+| Frequency | 7-day cron | Per training step | Per `selfImprove()` call |
+Where Tangle WINS: the gate. Where SkillOpt WINS: the pipeline sophistication. Where Hermes WINS: the deployment story (multi-platform, multi-tool-backend).
+---
+## Part 4 — What we should actually do
+### Phase A — rename to honest names (0.51.1, this session)
+The current `skillOptDriver` and `gepaDriver` names overclaim. Options:
+1. **Rename both:**
+   - `gepaDriver` → `reflectiveRewriteDriver` (drops the "Pareto" implication)
+   - `skillOptDriver` → `constrainedReflectiveDriver` (drops the SkillOpt-methodology implication)
+   - Reserve `gepaDriver` + `skillOptDriver` for faithful implementations
+2. **Keep `gepaDriver` name** (it's our most-used driver; renaming is disruptive); rename `skillOptDriver`.
+3. **Keep both names; add `@experimental` + a "differs from paper" docstring section.** Cheapest. Truthful enough.
+Recommendation: **option 3 plus a frontmatter "deviations from paper" section** in each driver source file. Empirically test before renaming.
+### Phase B — build the honest empirical harness (0.51.1, this session)
+`tests/driver-empirical.bench.ts` — for each driver:
+- Same scenarios (5 synthetic + 5 real legal-agent scenarios)
+- Same judge
+- Same `baselineSurface`
+- Same `budget` (1 gen, 3 candidates, holdout 0.3)
+- Report: lift mean, lift CI95, p-value, rollouts spent, $$ spent
+Drivers in the matrix:
+- `gepaDriver` (current full-rewrite reflection)
+- `skillOptDriver` (current 0.51.0 full-rewrite + constraints)
+- Future: real `skillOptDriverV2` with patch mode
+This is the **falsifiable test** of whether our drivers' methodology claims are worth the names.
+### Phase C — implement SkillOpt patch mode for real (0.52.0)
+Build `skillOptDriverV2` with:
+1. **`Edit` type matching SkillOpt's**: `{op: 'append'|'insert_after'|'replace'|'delete', content, target?, support_count?, source_type?}`
+2. **Reflect step emits patches**, not full rewrites
+3. **`mergePatches()`** — LLM-driven hierarchical merge of failure + success patches
+4. **`rankAndSelect()`** — LLM-driven ranking when edit pool > budget
+5. **Deterministic `applyPatch()`** — string ops, no LLM
+6. **Keep our gate** (paired bootstrap CI). Don't downgrade to SkillOpt's `>` — that's our edge.
+Estimated scope: 400-600 lines + tests.
+### Phase D — implement GEPA's Pareto frontier (0.53.0)
+Build `gepaDriverV2` with:
+1. **Candidate pool** retained across generations (non-dominated)
+2. **Multi-objective evaluation** (composite + cost + length + diversity)
+3. **Combine step** — LLM combines lessons from non-dominated candidates
+4. Keep reflection.
+5. Sample-efficiency target: match the paper's ~35× claim on a benchmark we choose.
+Estimated scope: 500-800 lines + tests.
+---
+## Source pointers (audit trail)
+- GEPA paper: https://arxiv.org/abs/2507.19457
+- SkillOpt repo: https://github.com/microsoft/SkillOpt (cloned at `/tmp/SkillOpt/` 2026-05-27)
+- Hermes repo: https://github.com/NousResearch/hermes-agent (cloned at `/tmp/hermes-agent/` 2026-05-27)
+- Our gepaDriver: `src/campaign/drivers/gepa.ts`
+- Our skillOptDriver: `src/campaign/drivers/skillopt.ts`
+- Our gate: `src/campaign/gates/default-production-gate.ts`
+- Our reflection primitive: `src/reflective-mutation.ts`
+Update this doc when:
+- We discover new behavior in any of the upstream methods (via reading their code, not their READMEs)
+- We ship a driver that closes one of the named gaps
+- We run the empirical harness and have real numbers to add

package/docs/specs/hermes-self-improvement-audit.md ADDED Viewed

@@ -0,0 +1,93 @@
+# Hermes self-improvement — corrected audit
+**Status:** Active. This corrects an earlier underestimate where I claimed Hermes only had the 7-day curator. Drew pushed back; he was right.
+**Source:** github.com/NousResearch/hermes-agent cloned 2026-05-27 at /tmp/hermes-agent.
+## The corrected picture
+Hermes has **two** self-improvement mechanisms, not one. Per their own source comments: "background self-improvement review fork" (`tools/skill_provenance.py:5`).
+### Mechanism 1 — per-turn background review (the actual learning loop I missed)
+**File:** `agent/background_review.py` (593 lines)
+**Trigger.** `spawn_background_review_thread()` runs after every turn (`AIAgent.run_conversation`). Forks a daemon thread that:
+1. Snapshots the conversation history
+2. Boots a forked `AIAgent` inheriting the parent's runtime (model, provider, base_url, credentials, cached system prompt — exact same auth for prompt-cache reuse)
+3. Feeds the fork one of three review prompts:
+   - `_MEMORY_REVIEW_PROMPT` — should we save anything about the user?
+   - `_SKILL_REVIEW_PROMPT` — should we update the skill library?
+   - `_COMBINED_REVIEW_PROMPT` — both
+4. The fork executes with a tool whitelist (memory + skill management only)
+5. Writes go straight to `~/.hermes/skills/` and the memory store
+6. Provenance tag: `_memory_write_origin = "background_review"`
+**Critical signal source.** The skill-review prompt explicitly looks for **user-feedback signal during the conversation**:
+> "User corrected your style, tone, format, legibility, or verbosity. **Frustration signals** like 'stop doing X', 'this is too verbose', 'don't format like this', 'why are you explaining', 'just give me the answer', 'you always do Y and I hate it', or an explicit 'remember this' are FIRST-CLASS skill signals, not just memory signals."
+> "Be ACTIVE — most sessions produce at least one skill update, even if small. A pass that does nothing is a missed learning opportunity, not a neutral outcome."
+This is **qualitative LLM-judges-LLM optimization driven by real user-corrective feedback**. The validation gate is the forked agent's own judgment.
+**No held-out validation.** No A/B between skill versions. No regression rejection. No statistical test. The agent decides "save this" or "don't" and writes immediately.
+### Mechanism 2 — 7-day curator (housekeeping, not learning)
+**File:** `agent/curator.py`. As I described earlier — periodic LLM editorial pass over agent-created skills, pin/archive/consolidate/patch. **Only touches skills that the per-turn loop created.** Doesn't refine via measurement; refines via LLM editorial judgment.
+### Storage
+- `~/.hermes/skills/<name>/SKILL.md` + `references/` directory per skill (their own documented invariant)
+- `~/.hermes/skills/.usage.json` — sidecar telemetry per skill (usage counts, lifecycle states `active → stale → archived → pinned`)
+- Lifecycle states drive curator decisions but never the per-turn review
+## Corrected competitive matrix
+| Component | Hermes | SkillOpt | Tangle |
+|---|---|---|---|
+| Trigger | **Per-turn fork** + 7-day curator | Per training step | Per `selfImprove()` invocation |
+| Signal source | **User corrective feedback during chat** + agent retrospection | Judge scores on held-out batches | Judge scores + held-out + multi-rater |
+| Patch granularity | Tool-call level (skill_manage create/edit/patch) | Structured `Edit` ops with `support_count` | Full document rewrite (today) |
+| Validation gate | **None** — forked agent's own judgment | Literal `cand_hard > current_score` | **Paired bootstrap + CI + Cohen's d + MDE** |
+| Rejection-on-regression | No | Yes (gate returns `reject`) | Yes (gate returns `hold` / `inspect`) |
+| Cross-batch aggregation | No | Yes (`merge_patches`) | No |
+| Edit ranking under budget | No | Yes (`rank_and_select`) | No |
+| Longitudinal memory | Usage telemetry only | Yes (`run_slow_update`, `run_meta_skill`) | No |
+| Statistical rigor | None | None | **Highest** |
+| User-feedback signal | **Yes — first-class** | No (offline only) | No (offline only) |
+## What we beat them on — what they beat us on
+**Tangle wins:** the gate. Paired bootstrap CI + Cohen's d + MDE is statistically stricter than both. We refuse to ship on noise; both Hermes and SkillOpt accept improvements that could be noise.
+**Hermes wins:** the signal. They use real user-corrective feedback ("you always do Y and I hate it") as a first-class gradient. We use judge scores; they use both judge scores AND user-language feedback. Their loop fires **per turn**, ours fires **per offline campaign**.
+**SkillOpt wins:** the pipeline. Structured patches, hierarchical merge, edit ranking under budget, multiple update modes, longitudinal slow-update, meta-skill memory. Our pipeline is full-rewrite-then-validate; theirs is patch-with-multi-trial-evidence.
+## The real architectural insight from this audit
+Hermes' per-turn loop is **online**. Our `selfImprove()` is **offline batch**. When Hermes runs on top of our sandbox, **the harness will mutate skills underneath us continuously**. By the time our offline eval finishes, the baseline we measured against may be 50 generations behind production.
+That's the gap task **#98 — Profile-versioning architecture** exists to close.
+## What we should actually do differently
+1. **Stop dismissing Hermes' loop.** It's real, it uses signal we don't, and it's been deployed at scale. Their methodology paper would be: "user-corrective-feedback-driven self-improvement with LLM-judges-LLM acceptance and usage-telemetry-driven housekeeping." We should treat this as a real prior, not marketing.
+2. **Add user-feedback signal as a substrate primitive.** Today our `RunRecord.outcome` carries judge scores and raw artifact data. It doesn't carry **in-conversation corrective signals** ("user said 'stop doing X' at turn 7"). If we want to fuse our statistical gate with Hermes' signal source, we need a `RunRecord.userFeedback?: UserCorrectionEvent[]` field.
+3. **Recognize the offline/online divide is structural.** Hermes is online. Our substrate is offline. The bridge is the profile-versioning architecture (task #98) — let the harness do per-turn online updates, let the substrate do batch offline eval against versioned snapshots, then merge/rebase via a real diff protocol.
+4. **Do the per-turn signal extraction NOW (cheap).** Even without versioning, we could parse traces for user-corrective markers (regex on user messages: "stop", "don't", "I hate", "always Y", "just give me", "this is too X") and emit them as a new `RunRecord` field. That captures Hermes' signal source as additive substrate evidence.
+## Source pointers (audit trail)
+- `agent/background_review.py:1-30` (header docstring naming the loop)
+- `agent/background_review.py:_MEMORY_REVIEW_PROMPT`, `_SKILL_REVIEW_PROMPT`, `_COMBINED_REVIEW_PROMPT` (the actual prompts)
+- `agent/background_review.py:_run_review_in_thread` (the fork worker)
+- `agent/background_review.py:spawn_background_review_thread` (the entry)
+- `tools/skill_provenance.py:1-15` (docstring: "background self-improvement review fork" — Hermes' own term for their loop)
+- `tools/skill_usage.py:1-25` (telemetry + lifecycle)
+- `agent/curator.py` (7-day housekeeping)
+- `skills/autonomous-ai-agents/hermes-agent/SKILL.md` (45KB CLI/architecture reference)

package/docs/specs/profile-versioning.md ADDED Viewed

@@ -0,0 +1,291 @@
+# Profile versioning — closing the offline/online drift gap
+**Status:** Architecture design. Greenfield, replace existing primitives in place. No V2 suffix.
+**Owner:** spans agent-eval + agent-runtime + agent-knowledge + sandbox SDK.
+**Tracking:** task #98.
+**Date:** 2026-05-27.
+## Architecture in one diagram — symmetric fork
+Neither writer is privileged. Both branches are first-class. When they reconverge, the substrate's job is to BENCHMARK the branches and propose what to keep — not to be the authority.
+```
+            AgentProfile lineage
+              ╱           ╲
+             ╱             ╲
+       harness branch   substrate branch
+        (per-turn writes)   (selfImprove diff)
+             ╲             ╱
+              ╲           ╱
+             DIVERGENCE EVENT
+                     │
+                     ▼
+            benchmark both branches
+            against the same held-out
+                     │
+            ┌────────┼────────┐
+            ▼        ▼        ▼
+       ship-harness ship-substrate merge
+                     │
+                     ▼
+              inconclusive → expand
+              corpus / human review
+```
+The substrate becomes a peer, not an owner. The gate verdict names *which* branch won, not just "ship."
+## What we are fixing
+Two writers, same state, no coordination:
+- **Harness writer** — Hermes-style per-turn `spawn_background_review_thread`, agent-runtime's runLoop, any future in-sandbox self-modification. Online, continuous, fires every turn.
+- **Substrate writer** — `selfImprove()` running offline against a frozen snapshot, producing a winner with held-out gate confidence. Batch, fires per campaign.
+Failure modes today:
+1. **Lost update.** Substrate ships a winner. Harness's per-turn updates since baseline evaporate.
+2. **Stale eval.** Substrate's lift CI is `winner vs P₀`. Production is at `P_h`. The CI says nothing about `winner vs P_h`.
+3. **Gate becomes a lie.** `gateDecision: ship` against `P₀` looks legitimate. Consumer ships. Regresses against `P_h`. Detection fails because metrics moved too.
+## The minimum design
+Single concept, single operation, content-addressable.
+### `AgentProfile` is a versioned, content-addressable object
+```typescript
+// src/profile/types.ts
+export interface AgentProfileVersion {
+  /** Content-hash of the materialised profile state. */
+  hash: string
+  /** Parent in the lineage, null for the genesis profile. */
+  parentHash: string | null
+  /** Who wrote this version. */
+  source: 'harness' | 'substrate' | 'human'
+  /** When. */
+  timestamp: number
+  /** Human-readable label, optional. */
+  label?: string
+}
+export type ProfileDiff =
+  | { kind: 'patch'; edits: ProfileEdit[] }
+  | { kind: 'replace'; content: MutableSurface }
+export interface ProfileEdit {
+  /** Which surface inside the profile this edit targets. */
+  surface: 'systemPrompt' | 'skill' | 'tool' | 'mcp' | 'subagent' | 'modelByRole'
+  /** Surface-scoped identifier — skillName, toolName, mcpId, subagentId, role. */
+  surfaceId?: string
+  op: 'append' | 'insert_after' | 'replace' | 'delete'
+  target?: string
+  content: string
+  /** Support count from multi-trial evidence. */
+  supportCount?: number
+  /** Source classification for the merge/rank stage. */
+  sourceType?: 'failure' | 'success'
+}
+```
+That's the whole substrate type surface. Two types. No interface explosion.
+### `RunRecord` carries the version it was captured at
+Replace the existing `commitSha` / `promptHash` / `configHash` triple with a single canonical hash. Greenfield, no compat shim:
+```typescript
+// src/run-record.ts — IN-PLACE replacement
+export interface RunRecord {
+  // ... existing fields ...
+  /** Content-hash of the AgentProfileVersion that produced this run. */
+  agentProfileHash: string
+}
+```
+`commitSha`, `promptHash`, `configHash` become *inputs* to `hashProfile()`, not separate fields.
+### `selfImprove()` returns a diff, and the gate becomes 4-way
+Replace the current return shape. Greenfield, in place:
+```typescript
+// src/contract/self-improve.ts — IN-PLACE replacement
+export interface SelfImproveResult {
+  /** What we measured against. */
+  baselineHash: string
+  /** What we recommend applying. */
+  diff: ProfileDiff
+  /** Hash of `applyDiff(baseline, diff)` — verifiable by consumer. */
+  winningHash: string
+  /** Statistical evidence — paired bootstrap CI vs baseline. */
+  lift: LiftInsight
+  /** Substrate verdict — see DriftGateDecision below. */
+  gateDecision: DriftGateDecision
+  insight: InsightReport
+}
+export type DriftGateDecision =
+  | { kind: 'ship-substrate'; reason: string; vs?: 'baseline' | 'harness-live' }
+  | { kind: 'ship-harness'; reason: string }
+  | { kind: 'merge'; mergedDiff: ProfileDiff; reason: string }
+  | { kind: 'inconclusive'; reason: string }
+```
+When the substrate runs WITHOUT `driftPolicy: benchmark-branches`, only `ship-substrate` / `inconclusive` (or the equivalent `hold` framing) are possible. When `benchmark-branches` is on, all four kinds may surface.
+The substrate is now explicit: *"this diff is statistically valid against `baselineHash`. Whether to apply it to your live state is your call — and we'll tell you what we found when we compared branches."*
+### The opt-in drift policy
+```typescript
+selfImprove({
+  // ... existing
+  driftPolicy?:
+    | { kind: 'ignore' }                                   // default — assume single-writer
+    | { kind: 'reject-on-drift' }                          // cheap safety mode
+    | { kind: 'benchmark-branches'; benchmarkBudget: { generations, populationSize } }
+})
+```
+- **`ignore`** is the default. Same as today. Zero overhead for consumers whose sandbox harness doesn't self-modify.
+- **`reject-on-drift`** is the cheap safety mode. Substrate notices `currentHash != baselineHash` at apply time and refuses to ship. Tells the consumer "your profile drifted; re-run selfImprove against current state."
+- **`benchmark-branches`** is the full thing — only used when the harness DOES self-modify (Hermes per-turn, Claude Code with skill creation, Codex with user-prompted skill edits, agent-builder RL bridge, any future autonomous improvement loop). Costs an extra mini-campaign. Returns the 4-way `DriftGateDecision`.
+### Generalises past Hermes
+Any in-sandbox profile mutation appends to the same profile log, regardless of trigger:
+- Hermes-style autonomous (per-turn `background_review` fork)
+- Claude/Codex user-prompted ("hey, create a skill for X")
+- agent-runtime's runLoop self-modifying its prompt addendum
+- RL-style policy parameter updates
+- Manual user edits via `skill_manage` commands
+The substrate doesn't care WHY the harness wrote. It just sees: live profile is at hash X, my baseline was Y. Same merge protocol applies.
+### Conflict resolution — the four cases
+For the `benchmark-branches` policy, the substrate handles four cases:
+1. **No conflict.** Edits target different surfaces (substrate edited `systemPrompt`, harness wrote a new `skill/X.md`). Auto-merge into a combined candidate, benchmark merged vs each branch.
+2. **Orthogonal edits to the same surface.** Both touched `systemPrompt` but different H2 sections (subsumed by `GepaDriverConstraints.preserveSections`). Auto-merge by union of edits, benchmark.
+3. **Semantic duplication.** Substrate proposed a new skill `summarize-pr`; harness already created `pr-summarizer` (similar purpose, different file). Substrate runs a similarity-detection step: embed both, threshold cosine similarity, surface as a "duplicate-likely" finding. Resolution: head-to-head benchmark with both → keep the winner → archive the loser.
+4. **Direct same-region conflict.** Both edited the same paragraph. Three resolution paths the substrate offers:
+   - **Head-to-head**: run both branches, pick the winner.
+   - **LLM-mediated merge**: prompt an LLM with both candidate edits + the held-out failure trials, ask for a synthesis that addresses both. Benchmark the synthesis.
+   - **Human review**: surface the diff with `requires-resolution: true` and stop.
+### Sandbox-side merge protocol
+```typescript
+// agent-runtime exports:
+export async function getCurrentProfileVersion(): Promise<AgentProfileVersion>
+export async function applyDiff(diff: ProfileDiff): Promise<ApplyResult>
+export type ApplyResult =
+  | { ok: true; newHash: string }
+  | { ok: false; reason: 'conflict'; ancestor: string; ours: string; theirs: string }
+  | { ok: false; reason: 'stale-baseline'; expected: string; actual: string }
+```
+Sandbox keeps an append-only profile log at `~/.tangle/profile-log.jsonl`. Every harness write appends an entry. Every substrate-proposed apply appends or returns conflict.
+### The merge algorithm (3-way, surface-scoped)
+When substrate proposes `diff(baselineHash → winningHash)` but live state is at `currentHash != baselineHash`:
+1. **Walk the lineage** — find common ancestor of `baselineHash` and `currentHash`. If `baselineHash` IS an ancestor of `currentHash`, we have a clean rebase target.
+2. **Per-surface 3-way merge** — for each `ProfileEdit` in the diff:
+   - If the targeted surface (skillName, toolName, etc.) hasn't been touched in `currentHash` lineage since `baselineHash` → apply.
+   - If touched but the textual edit is on a different region → apply (no conflict).
+   - If touched on the same region → return `conflict` with ancestor/ours/theirs for the human or substrate to resolve.
+3. **Re-eval recommendation** — if non-trivial conflicts, recommend `selfImprove()` re-run against `currentHash` rather than blind merge.
+The consumer chooses: rebase + re-eval (statistically clean), force merge (skip re-eval, ship-at-own-risk), or reject (substrate's proposal is too stale).
+## How this changes the substrate flow
+```
+Today:
+  ingest_baseline_P0 → eval → winner W → consumer ships W (regardless of drift)
+Tomorrow:
+  ingest_baseline_hashed → eval → {baselineHash, diff, winningHash, lift, gate}
+                                  ↓
+  sandbox.applyDiff(diff) → ok | conflict | stale-baseline
+                          ↓
+  if stale-baseline:    substrate re-eval against currentHash
+  if conflict:          substrate proposes targeted resolution OR human reviews
+  if ok:                profile log gets a new entry, substrate notified
+```
+## What changes per package
+| Package | Files | Change |
+|---|---|---|
+| **agent-eval** | `src/profile/types.ts` (new) | `AgentProfileVersion`, `ProfileDiff`, `ProfileEdit` |
+| | `src/profile/hash.ts` (new) | `hashProfile()` — content-hash of the materialised state |
+| | `src/profile/diff.ts` (new) | `diffProfiles(a, b)`, `applyDiff(profile, diff)`, `threeWayMerge(ancestor, ours, theirs)` |
+| | `src/run-record.ts` | REPLACE `commitSha`/`promptHash`/`configHash` triple with `agentProfileHash` (greenfield) |
+| | `src/contract/self-improve.ts` | REPLACE `SelfImproveResult` to return `{baselineHash, diff, winningHash, lift, gateDecision, insight}` |
+| | `src/contract/analyze-runs.ts` | Add `agentProfileLineage` section to `InsightReport` — what versions ran, drift detected |
+| **agent-runtime** | `src/profile/log.ts` (new) | Append-only `~/.tangle/profile-log.jsonl`. `appendVersion()`, `readLineage()`, `findCommonAncestor()` |
+| | `src/profile/api.ts` (new) | `getCurrentProfileVersion()`, `applyDiff()` |
+| | `src/loops/run-loop.ts` | Every harness-side write to skills/memory/prompt-addendum appends to profile log |
+| **agent-knowledge** | `src/skills/version.ts` (new) | Skills become independently versioned objects; profile references them by `skillSetHash` |
+| **sandbox** | `src/agent-profile.ts` | Expose `getCurrentProfileVersion()` over the SDK |
+## What the gate semantics become
+`defaultProductionGate` today: "is the candidate statistically better than the baseline?"
+`defaultProductionGate` tomorrow: same question, scoped to the baseline. The consumer (sandbox / human / hosted-tier) decides whether to apply, given the answer + the current live state.
+We do NOT downgrade our paired-bootstrap CI. That's our edge over SkillOpt and Hermes. We just stop pretending the ship verdict is a deployment decision — it's a measurement.
+## The forcing function (task C from the audit)
+Before we commit weeks to this implementation, set up the empirical case:
+1. Run Hermes on top of our sandbox.
+2. Hermes' per-turn loop mutates skills.
+3. Run `selfImprove()` against the baseline at sandbox boot.
+4. Observe `gateDecision: ship` produce a winner that, when applied to the now-drifted live state, regresses.
+5. Capture the actual lift CI gap between `winner vs baseline` and `winner vs live`.
+If that gap is small (< MDE), profile-versioning is over-engineering. If it's large, this work is critical. We should know the number, not the intuition.
+## Phasing
+### Phase 0 — forcing function (1 week)
+Hermes-on-sandbox drift experiment. Real numbers on the gap. Either proves this work is needed or kills it.
+### Phase 1 — types + hashing (3 days)
+`AgentProfileVersion`, `ProfileDiff`, `ProfileEdit`. `hashProfile()`. `diffProfiles()`. `applyDiff()`. Pure functions, fully tested, no integration yet.
+### Phase 2 — substrate-side rewire (5 days)
+Replace `RunRecord` triple with `agentProfileHash`. Replace `SelfImproveResult` shape. Update `analyzeRuns` to detect lineage drift. Update tests + all 6 consumer products.
+### Phase 3 — sandbox + runtime (1 week)
+Profile log primitive in agent-runtime. `getCurrentProfileVersion()` + `applyDiff()` API. Sandbox SDK surface. Three-way merge for surface-scoped edits.
+### Phase 4 — agent-knowledge skill versioning (3 days)
+Skills become independently versioned. `skillSetHash` referenced from profile.
+### Phase 5 — Hermes adapter (3 days)
+Bridge: Hermes' `~/.hermes/skills/` write events → our profile log via a runtime hook.
+Total: ~3 weeks of focused work. Phase 0 in this session if Drew greenlights.
+## Source pointers
+- Task: #98
+- Related audit: `docs/specs/hermes-self-improvement-audit.md`
+- Related spec: `docs/specs/driver-honest-spec.md`
+- Current pre-versioning `RunRecord`: `src/run-record.ts`
+- Current pre-versioning `SelfImproveResult`: `src/contract/self-improve.ts`
+- Current gate: `src/campaign/gates/default-production-gate.ts`

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@tangle-network/agent-eval",
-  "version": "0.50.2",
+  "version": "0.52.0",
   "description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.",
   "homepage": "https://github.com/tangle-network/agent-eval#readme",
   "repository": {