@tangle-network/agent-eval 0.50.2 → 0.52.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,251 @@
1
+ # Driver Honest Spec — what each driver IS, what each methodology actually is, where we deviate
2
+
3
+ **Status:** Living document. Updated when we learn the truth from primary sources.
4
+ **Date:** 2026-05-27
5
+
6
+ This document exists because the project shipped two drivers with methodology names attached (`gepaDriver`, `skillOptDriver`) without the methodology specs being precisely encoded anywhere in the repo. That created an integrity gap. This doc closes it.
7
+
8
+ Every claim in this doc is sourced from a primary reference (paper, code, or directly verifiable from our source). Marketing language is forbidden. If something is not implemented we say so.
9
+
10
+ ---
11
+
12
+ ## Part 1 — GEPA (the paper)
13
+
14
+ **Source**: Agrawal et al., *"GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning"*, arXiv:2507.19457, July 2025.
15
+
16
+ ### What GEPA actually does
17
+
18
+ Outer loop (verbatim from abstract): "samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the **Pareto frontier of its own attempts**."
19
+
20
+ Named primitives in the paper:
21
+ - **GEPA** (Genetic-Pareto) — the overall optimizer
22
+ - **Pareto frontier** — non-dominated candidate set retained across iterations
23
+ - **Prompt updates** — mutations proposed by reflection
24
+ - **Rollouts** — trajectory samples
25
+
26
+ ### What gepaDriver in our substrate ACTUALLY does
27
+
28
+ Source: `src/campaign/drivers/gepa.ts` (132 lines)
29
+
30
+ - Single LLM call per `propose()` invocation
31
+ - Input: prior generation's **single best candidate by composite score** + that candidate's top/bottom scenarios + 3 weakest dimensions (`buildEvidence`)
32
+ - Output: N proposals, each a full document rewrite
33
+ - Dedup by exact text equality
34
+
35
+ ### Deviations from the GEPA paper
36
+
37
+ | GEPA paper | Our `gepaDriver` |
38
+ |---|---|
39
+ | **Pareto frontier** of candidates | **Single "best by composite"** — no Pareto set, no non-dominated tracking |
40
+ | **Combine complementary lessons** from frontier | Each generation reflects on ONE prior candidate; no combination |
41
+ | Multi-objective optimization | Single-objective (composite score) |
42
+ | Genetic operators (mutation, crossover) | Reflection only — no crossover |
43
+ | Sample efficiency claim (35× fewer rollouts than GRPO) | Unmeasured against any baseline |
44
+
45
+ **Honest assessment**: our `gepaDriver` is a **reflective full-rewrite driver**, not GEPA. It captures GEPA's *reflection* primitive but not its *Pareto* mechanism. The name oversells. A faithful renaming would be `reflectiveRewriteDriver`. A faithful implementation would add a Pareto candidate pool + combine step.
46
+
47
+ ---
48
+
49
+ ## Part 2 — SkillOpt (the paper + code)
50
+
51
+ **Source**:
52
+ - README: https://github.com/microsoft/SkillOpt
53
+ - Source: `/tmp/SkillOpt/skillopt/` (cloned 2026-05-27)
54
+ - Key files: `engine/trainer.py`, `optimizer/clip.py` (rank_and_select), `optimizer/update_modes.py`, `evaluation/gate.py`, `types.py`
55
+
56
+ ### What SkillOpt actually does
57
+
58
+ **6-stage per-step pipeline** (verbatim from `trainer.py:516` and adjacent):
59
+
60
+ 1. **Rollout** — `adapter.rollout(train_env, current_skill, ...)` collects trajectories on a batch.
61
+ 2. **Reflect** — `adapter.reflect()` analyses trajectories and emits **structured patches** (NOT full rewrites in patch mode). Failure trials → failure patches; success trials → success patches.
62
+ 3. **Aggregate** — `merge_patches(current_skill, all_failure_patches, all_success_patches, batch_size=merge_bs)` — hierarchically merges patches across accumulated batches.
63
+ 4. **Select** — `rank_and_select(current_skill, merged_patch, max_edits=edit_budget)` — if edit pool > budget, calls an optimizer LLM to **rank edits by importance** and keep top-L. Budget is "analogous to gradient clipping" (their words).
64
+ 5. **Update** — apply patch in one of 3 modes:
65
+ - **`patch`** — deterministic diff apply via `apply_patch_with_report()`; ops are `append | insert_after | replace | delete`
66
+ - **`rewrite_from_suggestions`** — LLM regenerates full skill from suggestions
67
+ - **`full_rewrite_minibatch`** — reflection directly emits complete candidate skills; select picks the best
68
+ 6. **Evaluate & Gate** — runs candidate on selection set, calls `evaluate_gate(cand_hard, current_score, best_score)`. Returns `accept_new_best | accept | reject` from a **literal `cand_hard > current_score`** comparison (`evaluation/gate.py:38`). No statistical test.
69
+
70
+ Plus epoch-level stages:
71
+ - **Slow update** — `run_slow_update()` builds longitudinal pairs across epochs.
72
+ - **Meta skill** — `run_meta_skill()` produces optimizer-side memory of patterns across adjacent epochs.
73
+
74
+ ### Canonical patch shape (from `types.py:22-45`)
75
+
76
+ ```python
77
+ EditOp = Literal["append", "insert_after", "replace", "delete"]
78
+
79
+ @dataclass
80
+ class Edit:
81
+ op: EditOp
82
+ content: str
83
+ target: str # for replace/delete/insert_after
84
+ support_count: int | None # how many trials voted for this edit
85
+ source_type: Literal["failure", "success"] | None
86
+ merge_level: int | None
87
+
88
+ @dataclass
89
+ class Patch:
90
+ edits: list[Edit]
91
+ reasoning: str
92
+ ranking_details: dict | None
93
+ ```
94
+
95
+ ### What `skillOptDriver` v0.51.0 in our substrate ACTUALLY does
96
+
97
+ Source: `src/campaign/drivers/skillopt.ts` (current as of 0.51.0)
98
+
99
+ - Single LLM call per `propose()` returning N full document rewrites
100
+ - Post-parse rejection on: (a) any H2 header dropped, (b) sentence-edit count > editBudget × 2
101
+ - Substantively equivalent to `gepaDriver` + 2 validation constraints
102
+
103
+ ### Deviations from SkillOpt
104
+
105
+ | SkillOpt actual | Our 0.51.0 `skillOptDriver` |
106
+ |---|---|
107
+ | 6-stage pipeline (rollout → reflect → aggregate → select → update → gate) | Single LLM call → N rewrites |
108
+ | **Patch-based edits** (`{op, target, content, support_count, source_type}`) | Full document rewrites only |
109
+ | `merge_patches()` hierarchical merge across batches | No aggregation; each `propose()` is independent |
110
+ | `rank_and_select(max_edits=edit_budget)` LLM-ranking of edits | All candidates that pass validation are returned |
111
+ | 3 update modes (`patch`, `rewrite_from_suggestions`, `full_rewrite_minibatch`) | Only `full_rewrite_minibatch`-equivalent |
112
+ | `evaluate_gate()` with `accept_new_best/accept/reject` codes | Substrate's outer gate decides ship/hold/inspect; driver doesn't see fine-grained accept signal |
113
+ | Longitudinal `slow_update` across epochs | Not implemented |
114
+ | `meta_skill` optimizer-side memory | Not implemented |
115
+ | Selection-set cache (`sel_cache`) for repeated candidate hashes | Not implemented |
116
+ | Edit-budget LR scheduler (constant / linear / cosine / autonomous) | Single fixed `editBudget` |
117
+ | Mini-batch accumulation (`steps_per_epoch`, `merge_batch_size`) | Not implemented |
118
+ | `decide_autonomous_learning_rate()` | Not implemented |
119
+ | `longitudinal_pair_policy` (mixed / changed / unchanged) | Not implemented |
120
+
121
+ **Honest assessment**: 13 substantive deviations. `skillOptDriver` 0.51.0 is **not** SkillOpt. It is `gepaDriver` with two post-validation constraints (section preservation, sentence-edit count). The methodology name oversells the implementation.
122
+
123
+ ### One thing where we are STRICTER than SkillOpt
124
+
125
+ **The gate.** SkillOpt: literal `cand_hard > current_score` (`evaluation/gate.py:38`). Our substrate: paired bootstrap + 95% CI + Cohen's d + MDE + p-value (`defaultProductionGate`). When the lift CI straddles zero, our gate returns `hold` / `inspect`. SkillOpt would accept any improvement at all, even single-sample noise.
126
+
127
+ This is real differentiation we have not been crediting ourselves for.
128
+
129
+ ---
130
+
131
+ ## Part 3 — Hermes Agent's "self-improvement"
132
+
133
+ **Source**: `/tmp/hermes-agent/` (cloned 2026-05-27)
134
+ - `agent/curator.py` (the actual loop)
135
+ - `agent/skill_commands.py`
136
+ - `agent/skill_utils.py`
137
+
138
+ ### What Hermes actually does
139
+
140
+ From `curator.py` line 1: "Curator — background skill maintenance orchestrator. The curator is an auxiliary-model task that periodically reviews agent-created skills and maintains the collection."
141
+
142
+ Trigger: idle-driven, with default `DEFAULT_INTERVAL_HOURS = 24 * 7` (7 days). When the agent has been idle for `DEFAULT_MIN_IDLE_HOURS = 2` and the last curator run was > 7 days ago, `maybe_run_curator()` spawns a forked AIAgent.
143
+
144
+ What the curator does:
145
+ - "Auto-transition lifecycle states based on derived skill activity timestamps"
146
+ - "Spawn a background review agent that can **pin / archive / consolidate / patch** agent-created skills via `skill_manage`"
147
+ - "Persist curator state (last_run_at, paused, etc.) in `.curator_state`"
148
+
149
+ Strict invariants:
150
+ - Only touches agent-created skills
151
+ - "Never auto-deletes — only archives"
152
+ - Pinned skills bypass auto-transitions
153
+ - Uses the auxiliary client (separate from main session)
154
+
155
+ ### Hermes' actual gate
156
+
157
+ **There is none.** The curator is an LLM editor making editorial decisions. There is no:
158
+ - Held-out validation
159
+ - Performance comparison between old and new skill versions
160
+ - Statistical test
161
+ - Rejection-on-regression mechanism
162
+
163
+ Skills are refined by an LLM looking at usage patterns; the refinement is accepted because the LLM proposed it.
164
+
165
+ ### Honest assessment
166
+
167
+ Hermes has a **skill curation system**, not a self-improvement loop. The README's claim "the only agent with a built-in learning loop" is generous — it's a 7-day-cron LLM librarian. There's no measurable guarantee that today's curated skill collection performs better than yesterday's.
168
+
169
+ Compare:
170
+ | Component | Hermes | SkillOpt | Tangle |
171
+ |---|---|---|---|
172
+ | Validation gate | None | `>` | Paired bootstrap CI |
173
+ | Patch-level edits | No (LLM rewrites whole skill) | Yes | No (full rewrite only) |
174
+ | Skill ranking / selection | No | Yes | No |
175
+ | Sample efficiency claim | None | 35× vs GRPO | None |
176
+ | Frequency | 7-day cron | Per training step | Per `selfImprove()` call |
177
+
178
+ Where Tangle WINS: the gate. Where SkillOpt WINS: the pipeline sophistication. Where Hermes WINS: the deployment story (multi-platform, multi-tool-backend).
179
+
180
+ ---
181
+
182
+ ## Part 4 — What we should actually do
183
+
184
+ ### Phase A — rename to honest names (0.51.1, this session)
185
+
186
+ The current `skillOptDriver` and `gepaDriver` names overclaim. Options:
187
+
188
+ 1. **Rename both:**
189
+ - `gepaDriver` → `reflectiveRewriteDriver` (drops the "Pareto" implication)
190
+ - `skillOptDriver` → `constrainedReflectiveDriver` (drops the SkillOpt-methodology implication)
191
+ - Reserve `gepaDriver` + `skillOptDriver` for faithful implementations
192
+ 2. **Keep `gepaDriver` name** (it's our most-used driver; renaming is disruptive); rename `skillOptDriver`.
193
+ 3. **Keep both names; add `@experimental` + a "differs from paper" docstring section.** Cheapest. Truthful enough.
194
+
195
+ Recommendation: **option 3 plus a frontmatter "deviations from paper" section** in each driver source file. Empirically test before renaming.
196
+
197
+ ### Phase B — build the honest empirical harness (0.51.1, this session)
198
+
199
+ `tests/driver-empirical.bench.ts` — for each driver:
200
+ - Same scenarios (5 synthetic + 5 real legal-agent scenarios)
201
+ - Same judge
202
+ - Same `baselineSurface`
203
+ - Same `budget` (1 gen, 3 candidates, holdout 0.3)
204
+ - Report: lift mean, lift CI95, p-value, rollouts spent, $$ spent
205
+
206
+ Drivers in the matrix:
207
+ - `gepaDriver` (current full-rewrite reflection)
208
+ - `skillOptDriver` (current 0.51.0 full-rewrite + constraints)
209
+ - Future: real `skillOptDriverV2` with patch mode
210
+
211
+ This is the **falsifiable test** of whether our drivers' methodology claims are worth the names.
212
+
213
+ ### Phase C — implement SkillOpt patch mode for real (0.52.0)
214
+
215
+ Build `skillOptDriverV2` with:
216
+ 1. **`Edit` type matching SkillOpt's**: `{op: 'append'|'insert_after'|'replace'|'delete', content, target?, support_count?, source_type?}`
217
+ 2. **Reflect step emits patches**, not full rewrites
218
+ 3. **`mergePatches()`** — LLM-driven hierarchical merge of failure + success patches
219
+ 4. **`rankAndSelect()`** — LLM-driven ranking when edit pool > budget
220
+ 5. **Deterministic `applyPatch()`** — string ops, no LLM
221
+ 6. **Keep our gate** (paired bootstrap CI). Don't downgrade to SkillOpt's `>` — that's our edge.
222
+
223
+ Estimated scope: 400-600 lines + tests.
224
+
225
+ ### Phase D — implement GEPA's Pareto frontier (0.53.0)
226
+
227
+ Build `gepaDriverV2` with:
228
+ 1. **Candidate pool** retained across generations (non-dominated)
229
+ 2. **Multi-objective evaluation** (composite + cost + length + diversity)
230
+ 3. **Combine step** — LLM combines lessons from non-dominated candidates
231
+ 4. Keep reflection.
232
+ 5. Sample-efficiency target: match the paper's ~35× claim on a benchmark we choose.
233
+
234
+ Estimated scope: 500-800 lines + tests.
235
+
236
+ ---
237
+
238
+ ## Source pointers (audit trail)
239
+
240
+ - GEPA paper: https://arxiv.org/abs/2507.19457
241
+ - SkillOpt repo: https://github.com/microsoft/SkillOpt (cloned at `/tmp/SkillOpt/` 2026-05-27)
242
+ - Hermes repo: https://github.com/NousResearch/hermes-agent (cloned at `/tmp/hermes-agent/` 2026-05-27)
243
+ - Our gepaDriver: `src/campaign/drivers/gepa.ts`
244
+ - Our skillOptDriver: `src/campaign/drivers/skillopt.ts`
245
+ - Our gate: `src/campaign/gates/default-production-gate.ts`
246
+ - Our reflection primitive: `src/reflective-mutation.ts`
247
+
248
+ Update this doc when:
249
+ - We discover new behavior in any of the upstream methods (via reading their code, not their READMEs)
250
+ - We ship a driver that closes one of the named gaps
251
+ - We run the empirical harness and have real numbers to add
@@ -0,0 +1,93 @@
1
+ # Hermes self-improvement — corrected audit
2
+
3
+ **Status:** Active. This corrects an earlier underestimate where I claimed Hermes only had the 7-day curator. Drew pushed back; he was right.
4
+ **Source:** github.com/NousResearch/hermes-agent cloned 2026-05-27 at /tmp/hermes-agent.
5
+
6
+ ## The corrected picture
7
+
8
+ Hermes has **two** self-improvement mechanisms, not one. Per their own source comments: "background self-improvement review fork" (`tools/skill_provenance.py:5`).
9
+
10
+ ### Mechanism 1 — per-turn background review (the actual learning loop I missed)
11
+
12
+ **File:** `agent/background_review.py` (593 lines)
13
+
14
+ **Trigger.** `spawn_background_review_thread()` runs after every turn (`AIAgent.run_conversation`). Forks a daemon thread that:
15
+ 1. Snapshots the conversation history
16
+ 2. Boots a forked `AIAgent` inheriting the parent's runtime (model, provider, base_url, credentials, cached system prompt — exact same auth for prompt-cache reuse)
17
+ 3. Feeds the fork one of three review prompts:
18
+ - `_MEMORY_REVIEW_PROMPT` — should we save anything about the user?
19
+ - `_SKILL_REVIEW_PROMPT` — should we update the skill library?
20
+ - `_COMBINED_REVIEW_PROMPT` — both
21
+ 4. The fork executes with a tool whitelist (memory + skill management only)
22
+ 5. Writes go straight to `~/.hermes/skills/` and the memory store
23
+ 6. Provenance tag: `_memory_write_origin = "background_review"`
24
+
25
+ **Critical signal source.** The skill-review prompt explicitly looks for **user-feedback signal during the conversation**:
26
+
27
+ > "User corrected your style, tone, format, legibility, or verbosity. **Frustration signals** like 'stop doing X', 'this is too verbose', 'don't format like this', 'why are you explaining', 'just give me the answer', 'you always do Y and I hate it', or an explicit 'remember this' are FIRST-CLASS skill signals, not just memory signals."
28
+
29
+ > "Be ACTIVE — most sessions produce at least one skill update, even if small. A pass that does nothing is a missed learning opportunity, not a neutral outcome."
30
+
31
+ This is **qualitative LLM-judges-LLM optimization driven by real user-corrective feedback**. The validation gate is the forked agent's own judgment.
32
+
33
+ **No held-out validation.** No A/B between skill versions. No regression rejection. No statistical test. The agent decides "save this" or "don't" and writes immediately.
34
+
35
+ ### Mechanism 2 — 7-day curator (housekeeping, not learning)
36
+
37
+ **File:** `agent/curator.py`. As I described earlier — periodic LLM editorial pass over agent-created skills, pin/archive/consolidate/patch. **Only touches skills that the per-turn loop created.** Doesn't refine via measurement; refines via LLM editorial judgment.
38
+
39
+ ### Storage
40
+
41
+ - `~/.hermes/skills/<name>/SKILL.md` + `references/` directory per skill (their own documented invariant)
42
+ - `~/.hermes/skills/.usage.json` — sidecar telemetry per skill (usage counts, lifecycle states `active → stale → archived → pinned`)
43
+ - Lifecycle states drive curator decisions but never the per-turn review
44
+
45
+ ## Corrected competitive matrix
46
+
47
+ | Component | Hermes | SkillOpt | Tangle |
48
+ |---|---|---|---|
49
+ | Trigger | **Per-turn fork** + 7-day curator | Per training step | Per `selfImprove()` invocation |
50
+ | Signal source | **User corrective feedback during chat** + agent retrospection | Judge scores on held-out batches | Judge scores + held-out + multi-rater |
51
+ | Patch granularity | Tool-call level (skill_manage create/edit/patch) | Structured `Edit` ops with `support_count` | Full document rewrite (today) |
52
+ | Validation gate | **None** — forked agent's own judgment | Literal `cand_hard > current_score` | **Paired bootstrap + CI + Cohen's d + MDE** |
53
+ | Rejection-on-regression | No | Yes (gate returns `reject`) | Yes (gate returns `hold` / `inspect`) |
54
+ | Cross-batch aggregation | No | Yes (`merge_patches`) | No |
55
+ | Edit ranking under budget | No | Yes (`rank_and_select`) | No |
56
+ | Longitudinal memory | Usage telemetry only | Yes (`run_slow_update`, `run_meta_skill`) | No |
57
+ | Statistical rigor | None | None | **Highest** |
58
+ | User-feedback signal | **Yes — first-class** | No (offline only) | No (offline only) |
59
+
60
+ ## What we beat them on — what they beat us on
61
+
62
+ **Tangle wins:** the gate. Paired bootstrap CI + Cohen's d + MDE is statistically stricter than both. We refuse to ship on noise; both Hermes and SkillOpt accept improvements that could be noise.
63
+
64
+ **Hermes wins:** the signal. They use real user-corrective feedback ("you always do Y and I hate it") as a first-class gradient. We use judge scores; they use both judge scores AND user-language feedback. Their loop fires **per turn**, ours fires **per offline campaign**.
65
+
66
+ **SkillOpt wins:** the pipeline. Structured patches, hierarchical merge, edit ranking under budget, multiple update modes, longitudinal slow-update, meta-skill memory. Our pipeline is full-rewrite-then-validate; theirs is patch-with-multi-trial-evidence.
67
+
68
+ ## The real architectural insight from this audit
69
+
70
+ Hermes' per-turn loop is **online**. Our `selfImprove()` is **offline batch**. When Hermes runs on top of our sandbox, **the harness will mutate skills underneath us continuously**. By the time our offline eval finishes, the baseline we measured against may be 50 generations behind production.
71
+
72
+ That's the gap task **#98 — Profile-versioning architecture** exists to close.
73
+
74
+ ## What we should actually do differently
75
+
76
+ 1. **Stop dismissing Hermes' loop.** It's real, it uses signal we don't, and it's been deployed at scale. Their methodology paper would be: "user-corrective-feedback-driven self-improvement with LLM-judges-LLM acceptance and usage-telemetry-driven housekeeping." We should treat this as a real prior, not marketing.
77
+
78
+ 2. **Add user-feedback signal as a substrate primitive.** Today our `RunRecord.outcome` carries judge scores and raw artifact data. It doesn't carry **in-conversation corrective signals** ("user said 'stop doing X' at turn 7"). If we want to fuse our statistical gate with Hermes' signal source, we need a `RunRecord.userFeedback?: UserCorrectionEvent[]` field.
79
+
80
+ 3. **Recognize the offline/online divide is structural.** Hermes is online. Our substrate is offline. The bridge is the profile-versioning architecture (task #98) — let the harness do per-turn online updates, let the substrate do batch offline eval against versioned snapshots, then merge/rebase via a real diff protocol.
81
+
82
+ 4. **Do the per-turn signal extraction NOW (cheap).** Even without versioning, we could parse traces for user-corrective markers (regex on user messages: "stop", "don't", "I hate", "always Y", "just give me", "this is too X") and emit them as a new `RunRecord` field. That captures Hermes' signal source as additive substrate evidence.
83
+
84
+ ## Source pointers (audit trail)
85
+
86
+ - `agent/background_review.py:1-30` (header docstring naming the loop)
87
+ - `agent/background_review.py:_MEMORY_REVIEW_PROMPT`, `_SKILL_REVIEW_PROMPT`, `_COMBINED_REVIEW_PROMPT` (the actual prompts)
88
+ - `agent/background_review.py:_run_review_in_thread` (the fork worker)
89
+ - `agent/background_review.py:spawn_background_review_thread` (the entry)
90
+ - `tools/skill_provenance.py:1-15` (docstring: "background self-improvement review fork" — Hermes' own term for their loop)
91
+ - `tools/skill_usage.py:1-25` (telemetry + lifecycle)
92
+ - `agent/curator.py` (7-day housekeeping)
93
+ - `skills/autonomous-ai-agents/hermes-agent/SKILL.md` (45KB CLI/architecture reference)
@@ -0,0 +1,291 @@
1
+ # Profile versioning — closing the offline/online drift gap
2
+
3
+ **Status:** Architecture design. Greenfield, replace existing primitives in place. No V2 suffix.
4
+ **Owner:** spans agent-eval + agent-runtime + agent-knowledge + sandbox SDK.
5
+ **Tracking:** task #98.
6
+ **Date:** 2026-05-27.
7
+
8
+ ## Architecture in one diagram — symmetric fork
9
+
10
+ Neither writer is privileged. Both branches are first-class. When they reconverge, the substrate's job is to BENCHMARK the branches and propose what to keep — not to be the authority.
11
+
12
+ ```
13
+ AgentProfile lineage
14
+ ╱ ╲
15
+ ╱ ╲
16
+ harness branch substrate branch
17
+ (per-turn writes) (selfImprove diff)
18
+ ╲ ╱
19
+ ╲ ╱
20
+ DIVERGENCE EVENT
21
+
22
+
23
+ benchmark both branches
24
+ against the same held-out
25
+
26
+ ┌────────┼────────┐
27
+ ▼ ▼ ▼
28
+ ship-harness ship-substrate merge
29
+
30
+
31
+ inconclusive → expand
32
+ corpus / human review
33
+ ```
34
+
35
+ The substrate becomes a peer, not an owner. The gate verdict names *which* branch won, not just "ship."
36
+
37
+ ## What we are fixing
38
+
39
+ Two writers, same state, no coordination:
40
+
41
+ - **Harness writer** — Hermes-style per-turn `spawn_background_review_thread`, agent-runtime's runLoop, any future in-sandbox self-modification. Online, continuous, fires every turn.
42
+ - **Substrate writer** — `selfImprove()` running offline against a frozen snapshot, producing a winner with held-out gate confidence. Batch, fires per campaign.
43
+
44
+ Failure modes today:
45
+
46
+ 1. **Lost update.** Substrate ships a winner. Harness's per-turn updates since baseline evaporate.
47
+ 2. **Stale eval.** Substrate's lift CI is `winner vs P₀`. Production is at `P_h`. The CI says nothing about `winner vs P_h`.
48
+ 3. **Gate becomes a lie.** `gateDecision: ship` against `P₀` looks legitimate. Consumer ships. Regresses against `P_h`. Detection fails because metrics moved too.
49
+
50
+ ## The minimum design
51
+
52
+ Single concept, single operation, content-addressable.
53
+
54
+ ### `AgentProfile` is a versioned, content-addressable object
55
+
56
+ ```typescript
57
+ // src/profile/types.ts
58
+
59
+ export interface AgentProfileVersion {
60
+ /** Content-hash of the materialised profile state. */
61
+ hash: string
62
+ /** Parent in the lineage, null for the genesis profile. */
63
+ parentHash: string | null
64
+ /** Who wrote this version. */
65
+ source: 'harness' | 'substrate' | 'human'
66
+ /** When. */
67
+ timestamp: number
68
+ /** Human-readable label, optional. */
69
+ label?: string
70
+ }
71
+
72
+ export type ProfileDiff =
73
+ | { kind: 'patch'; edits: ProfileEdit[] }
74
+ | { kind: 'replace'; content: MutableSurface }
75
+
76
+ export interface ProfileEdit {
77
+ /** Which surface inside the profile this edit targets. */
78
+ surface: 'systemPrompt' | 'skill' | 'tool' | 'mcp' | 'subagent' | 'modelByRole'
79
+ /** Surface-scoped identifier — skillName, toolName, mcpId, subagentId, role. */
80
+ surfaceId?: string
81
+ op: 'append' | 'insert_after' | 'replace' | 'delete'
82
+ target?: string
83
+ content: string
84
+ /** Support count from multi-trial evidence. */
85
+ supportCount?: number
86
+ /** Source classification for the merge/rank stage. */
87
+ sourceType?: 'failure' | 'success'
88
+ }
89
+ ```
90
+
91
+ That's the whole substrate type surface. Two types. No interface explosion.
92
+
93
+ ### `RunRecord` carries the version it was captured at
94
+
95
+ Replace the existing `commitSha` / `promptHash` / `configHash` triple with a single canonical hash. Greenfield, no compat shim:
96
+
97
+ ```typescript
98
+ // src/run-record.ts — IN-PLACE replacement
99
+ export interface RunRecord {
100
+ // ... existing fields ...
101
+ /** Content-hash of the AgentProfileVersion that produced this run. */
102
+ agentProfileHash: string
103
+ }
104
+ ```
105
+
106
+ `commitSha`, `promptHash`, `configHash` become *inputs* to `hashProfile()`, not separate fields.
107
+
108
+ ### `selfImprove()` returns a diff, and the gate becomes 4-way
109
+
110
+ Replace the current return shape. Greenfield, in place:
111
+
112
+ ```typescript
113
+ // src/contract/self-improve.ts — IN-PLACE replacement
114
+ export interface SelfImproveResult {
115
+ /** What we measured against. */
116
+ baselineHash: string
117
+ /** What we recommend applying. */
118
+ diff: ProfileDiff
119
+ /** Hash of `applyDiff(baseline, diff)` — verifiable by consumer. */
120
+ winningHash: string
121
+ /** Statistical evidence — paired bootstrap CI vs baseline. */
122
+ lift: LiftInsight
123
+ /** Substrate verdict — see DriftGateDecision below. */
124
+ gateDecision: DriftGateDecision
125
+ insight: InsightReport
126
+ }
127
+
128
+ export type DriftGateDecision =
129
+ | { kind: 'ship-substrate'; reason: string; vs?: 'baseline' | 'harness-live' }
130
+ | { kind: 'ship-harness'; reason: string }
131
+ | { kind: 'merge'; mergedDiff: ProfileDiff; reason: string }
132
+ | { kind: 'inconclusive'; reason: string }
133
+ ```
134
+
135
+ When the substrate runs WITHOUT `driftPolicy: benchmark-branches`, only `ship-substrate` / `inconclusive` (or the equivalent `hold` framing) are possible. When `benchmark-branches` is on, all four kinds may surface.
136
+
137
+ The substrate is now explicit: *"this diff is statistically valid against `baselineHash`. Whether to apply it to your live state is your call — and we'll tell you what we found when we compared branches."*
138
+
139
+ ### The opt-in drift policy
140
+
141
+ ```typescript
142
+ selfImprove({
143
+ // ... existing
144
+ driftPolicy?:
145
+ | { kind: 'ignore' } // default — assume single-writer
146
+ | { kind: 'reject-on-drift' } // cheap safety mode
147
+ | { kind: 'benchmark-branches'; benchmarkBudget: { generations, populationSize } }
148
+ })
149
+ ```
150
+
151
+ - **`ignore`** is the default. Same as today. Zero overhead for consumers whose sandbox harness doesn't self-modify.
152
+ - **`reject-on-drift`** is the cheap safety mode. Substrate notices `currentHash != baselineHash` at apply time and refuses to ship. Tells the consumer "your profile drifted; re-run selfImprove against current state."
153
+ - **`benchmark-branches`** is the full thing — only used when the harness DOES self-modify (Hermes per-turn, Claude Code with skill creation, Codex with user-prompted skill edits, agent-builder RL bridge, any future autonomous improvement loop). Costs an extra mini-campaign. Returns the 4-way `DriftGateDecision`.
154
+
155
+ ### Generalises past Hermes
156
+
157
+ Any in-sandbox profile mutation appends to the same profile log, regardless of trigger:
158
+
159
+ - Hermes-style autonomous (per-turn `background_review` fork)
160
+ - Claude/Codex user-prompted ("hey, create a skill for X")
161
+ - agent-runtime's runLoop self-modifying its prompt addendum
162
+ - RL-style policy parameter updates
163
+ - Manual user edits via `skill_manage` commands
164
+
165
+ The substrate doesn't care WHY the harness wrote. It just sees: live profile is at hash X, my baseline was Y. Same merge protocol applies.
166
+
167
+ ### Conflict resolution — the four cases
168
+
169
+ For the `benchmark-branches` policy, the substrate handles four cases:
170
+
171
+ 1. **No conflict.** Edits target different surfaces (substrate edited `systemPrompt`, harness wrote a new `skill/X.md`). Auto-merge into a combined candidate, benchmark merged vs each branch.
172
+
173
+ 2. **Orthogonal edits to the same surface.** Both touched `systemPrompt` but different H2 sections (subsumed by `GepaDriverConstraints.preserveSections`). Auto-merge by union of edits, benchmark.
174
+
175
+ 3. **Semantic duplication.** Substrate proposed a new skill `summarize-pr`; harness already created `pr-summarizer` (similar purpose, different file). Substrate runs a similarity-detection step: embed both, threshold cosine similarity, surface as a "duplicate-likely" finding. Resolution: head-to-head benchmark with both → keep the winner → archive the loser.
176
+
177
+ 4. **Direct same-region conflict.** Both edited the same paragraph. Three resolution paths the substrate offers:
178
+ - **Head-to-head**: run both branches, pick the winner.
179
+ - **LLM-mediated merge**: prompt an LLM with both candidate edits + the held-out failure trials, ask for a synthesis that addresses both. Benchmark the synthesis.
180
+ - **Human review**: surface the diff with `requires-resolution: true` and stop.
181
+
182
+ ### Sandbox-side merge protocol
183
+
184
+ ```typescript
185
+ // agent-runtime exports:
186
+ export async function getCurrentProfileVersion(): Promise<AgentProfileVersion>
187
+ export async function applyDiff(diff: ProfileDiff): Promise<ApplyResult>
188
+
189
+ export type ApplyResult =
190
+ | { ok: true; newHash: string }
191
+ | { ok: false; reason: 'conflict'; ancestor: string; ours: string; theirs: string }
192
+ | { ok: false; reason: 'stale-baseline'; expected: string; actual: string }
193
+ ```
194
+
195
+ Sandbox keeps an append-only profile log at `~/.tangle/profile-log.jsonl`. Every harness write appends an entry. Every substrate-proposed apply appends or returns conflict.
196
+
197
+ ### The merge algorithm (3-way, surface-scoped)
198
+
199
+ When substrate proposes `diff(baselineHash → winningHash)` but live state is at `currentHash != baselineHash`:
200
+
201
+ 1. **Walk the lineage** — find common ancestor of `baselineHash` and `currentHash`. If `baselineHash` IS an ancestor of `currentHash`, we have a clean rebase target.
202
+ 2. **Per-surface 3-way merge** — for each `ProfileEdit` in the diff:
203
+ - If the targeted surface (skillName, toolName, etc.) hasn't been touched in `currentHash` lineage since `baselineHash` → apply.
204
+ - If touched but the textual edit is on a different region → apply (no conflict).
205
+ - If touched on the same region → return `conflict` with ancestor/ours/theirs for the human or substrate to resolve.
206
+ 3. **Re-eval recommendation** — if non-trivial conflicts, recommend `selfImprove()` re-run against `currentHash` rather than blind merge.
207
+
208
+ The consumer chooses: rebase + re-eval (statistically clean), force merge (skip re-eval, ship-at-own-risk), or reject (substrate's proposal is too stale).
209
+
210
+ ## How this changes the substrate flow
211
+
212
+ ```
213
+ Today:
214
+ ingest_baseline_P0 → eval → winner W → consumer ships W (regardless of drift)
215
+
216
+ Tomorrow:
217
+ ingest_baseline_hashed → eval → {baselineHash, diff, winningHash, lift, gate}
218
+
219
+ sandbox.applyDiff(diff) → ok | conflict | stale-baseline
220
+
221
+ if stale-baseline: substrate re-eval against currentHash
222
+ if conflict: substrate proposes targeted resolution OR human reviews
223
+ if ok: profile log gets a new entry, substrate notified
224
+ ```
225
+
226
+ ## What changes per package
227
+
228
+ | Package | Files | Change |
229
+ |---|---|---|
230
+ | **agent-eval** | `src/profile/types.ts` (new) | `AgentProfileVersion`, `ProfileDiff`, `ProfileEdit` |
231
+ | | `src/profile/hash.ts` (new) | `hashProfile()` — content-hash of the materialised state |
232
+ | | `src/profile/diff.ts` (new) | `diffProfiles(a, b)`, `applyDiff(profile, diff)`, `threeWayMerge(ancestor, ours, theirs)` |
233
+ | | `src/run-record.ts` | REPLACE `commitSha`/`promptHash`/`configHash` triple with `agentProfileHash` (greenfield) |
234
+ | | `src/contract/self-improve.ts` | REPLACE `SelfImproveResult` to return `{baselineHash, diff, winningHash, lift, gateDecision, insight}` |
235
+ | | `src/contract/analyze-runs.ts` | Add `agentProfileLineage` section to `InsightReport` — what versions ran, drift detected |
236
+ | **agent-runtime** | `src/profile/log.ts` (new) | Append-only `~/.tangle/profile-log.jsonl`. `appendVersion()`, `readLineage()`, `findCommonAncestor()` |
237
+ | | `src/profile/api.ts` (new) | `getCurrentProfileVersion()`, `applyDiff()` |
238
+ | | `src/loops/run-loop.ts` | Every harness-side write to skills/memory/prompt-addendum appends to profile log |
239
+ | **agent-knowledge** | `src/skills/version.ts` (new) | Skills become independently versioned objects; profile references them by `skillSetHash` |
240
+ | **sandbox** | `src/agent-profile.ts` | Expose `getCurrentProfileVersion()` over the SDK |
241
+
242
+ ## What the gate semantics become
243
+
244
+ `defaultProductionGate` today: "is the candidate statistically better than the baseline?"
245
+
246
+ `defaultProductionGate` tomorrow: same question, scoped to the baseline. The consumer (sandbox / human / hosted-tier) decides whether to apply, given the answer + the current live state.
247
+
248
+ We do NOT downgrade our paired-bootstrap CI. That's our edge over SkillOpt and Hermes. We just stop pretending the ship verdict is a deployment decision — it's a measurement.
249
+
250
+ ## The forcing function (task C from the audit)
251
+
252
+ Before we commit weeks to this implementation, set up the empirical case:
253
+
254
+ 1. Run Hermes on top of our sandbox.
255
+ 2. Hermes' per-turn loop mutates skills.
256
+ 3. Run `selfImprove()` against the baseline at sandbox boot.
257
+ 4. Observe `gateDecision: ship` produce a winner that, when applied to the now-drifted live state, regresses.
258
+ 5. Capture the actual lift CI gap between `winner vs baseline` and `winner vs live`.
259
+
260
+ If that gap is small (< MDE), profile-versioning is over-engineering. If it's large, this work is critical. We should know the number, not the intuition.
261
+
262
+ ## Phasing
263
+
264
+ ### Phase 0 — forcing function (1 week)
265
+ Hermes-on-sandbox drift experiment. Real numbers on the gap. Either proves this work is needed or kills it.
266
+
267
+ ### Phase 1 — types + hashing (3 days)
268
+ `AgentProfileVersion`, `ProfileDiff`, `ProfileEdit`. `hashProfile()`. `diffProfiles()`. `applyDiff()`. Pure functions, fully tested, no integration yet.
269
+
270
+ ### Phase 2 — substrate-side rewire (5 days)
271
+ Replace `RunRecord` triple with `agentProfileHash`. Replace `SelfImproveResult` shape. Update `analyzeRuns` to detect lineage drift. Update tests + all 6 consumer products.
272
+
273
+ ### Phase 3 — sandbox + runtime (1 week)
274
+ Profile log primitive in agent-runtime. `getCurrentProfileVersion()` + `applyDiff()` API. Sandbox SDK surface. Three-way merge for surface-scoped edits.
275
+
276
+ ### Phase 4 — agent-knowledge skill versioning (3 days)
277
+ Skills become independently versioned. `skillSetHash` referenced from profile.
278
+
279
+ ### Phase 5 — Hermes adapter (3 days)
280
+ Bridge: Hermes' `~/.hermes/skills/` write events → our profile log via a runtime hook.
281
+
282
+ Total: ~3 weeks of focused work. Phase 0 in this session if Drew greenlights.
283
+
284
+ ## Source pointers
285
+
286
+ - Task: #98
287
+ - Related audit: `docs/specs/hermes-self-improvement-audit.md`
288
+ - Related spec: `docs/specs/driver-honest-spec.md`
289
+ - Current pre-versioning `RunRecord`: `src/run-record.ts`
290
+ - Current pre-versioning `SelfImproveResult`: `src/contract/self-improve.ts`
291
+ - Current gate: `src/campaign/gates/default-production-gate.ts`
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@tangle-network/agent-eval",
3
- "version": "0.50.2",
3
+ "version": "0.52.0",
4
4
  "description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.",
5
5
  "homepage": "https://github.com/tangle-network/agent-eval#readme",
6
6
  "repository": {