@tangle-network/agent-eval 0.51.0 → 0.53.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +54 -1
- package/dist/adapters/otel.d.ts +1 -1
- package/dist/campaign/index.d.ts +7 -66
- package/dist/campaign/index.js +5 -122
- package/dist/campaign/index.js.map +1 -1
- package/dist/{chunk-XAP6DJZE.js → chunk-YXD7GWJI.js} +35 -2
- package/dist/chunk-YXD7GWJI.js.map +1 -0
- package/dist/contract/index.d.ts +16 -4
- package/dist/contract/index.js +147 -1
- package/dist/contract/index.js.map +1 -1
- package/dist/hosted/index.d.ts +1 -1
- package/dist/{index-DQHtWQ57.d.ts → index-C7RhhEME.d.ts} +46 -0
- package/dist/openapi.json +1 -1
- package/dist/{run-improvement-loop-BPMjNKMJ.d.ts → run-improvement-loop-Cc7oZlRP.d.ts} +48 -15
- package/docs/design/self-improvement-protocol.md +223 -0
- package/docs/specs/driver-honest-spec.md +251 -0
- package/docs/specs/hermes-self-improvement-audit.md +93 -0
- package/docs/specs/profile-versioning.md +291 -0
- package/package.json +1 -1
- package/dist/chunk-XAP6DJZE.js.map +0 -1
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
# Hermes self-improvement — corrected audit
|
|
2
|
+
|
|
3
|
+
**Status:** Active. This corrects an earlier underestimate where I claimed Hermes only had the 7-day curator. Drew pushed back; he was right.
|
|
4
|
+
**Source:** github.com/NousResearch/hermes-agent cloned 2026-05-27 at /tmp/hermes-agent.
|
|
5
|
+
|
|
6
|
+
## The corrected picture
|
|
7
|
+
|
|
8
|
+
Hermes has **two** self-improvement mechanisms, not one. Per their own source comments: "background self-improvement review fork" (`tools/skill_provenance.py:5`).
|
|
9
|
+
|
|
10
|
+
### Mechanism 1 — per-turn background review (the actual learning loop I missed)
|
|
11
|
+
|
|
12
|
+
**File:** `agent/background_review.py` (593 lines)
|
|
13
|
+
|
|
14
|
+
**Trigger.** `spawn_background_review_thread()` runs after every turn (`AIAgent.run_conversation`). Forks a daemon thread that:
|
|
15
|
+
1. Snapshots the conversation history
|
|
16
|
+
2. Boots a forked `AIAgent` inheriting the parent's runtime (model, provider, base_url, credentials, cached system prompt — exact same auth for prompt-cache reuse)
|
|
17
|
+
3. Feeds the fork one of three review prompts:
|
|
18
|
+
- `_MEMORY_REVIEW_PROMPT` — should we save anything about the user?
|
|
19
|
+
- `_SKILL_REVIEW_PROMPT` — should we update the skill library?
|
|
20
|
+
- `_COMBINED_REVIEW_PROMPT` — both
|
|
21
|
+
4. The fork executes with a tool whitelist (memory + skill management only)
|
|
22
|
+
5. Writes go straight to `~/.hermes/skills/` and the memory store
|
|
23
|
+
6. Provenance tag: `_memory_write_origin = "background_review"`
|
|
24
|
+
|
|
25
|
+
**Critical signal source.** The skill-review prompt explicitly looks for **user-feedback signal during the conversation**:
|
|
26
|
+
|
|
27
|
+
> "User corrected your style, tone, format, legibility, or verbosity. **Frustration signals** like 'stop doing X', 'this is too verbose', 'don't format like this', 'why are you explaining', 'just give me the answer', 'you always do Y and I hate it', or an explicit 'remember this' are FIRST-CLASS skill signals, not just memory signals."
|
|
28
|
+
|
|
29
|
+
> "Be ACTIVE — most sessions produce at least one skill update, even if small. A pass that does nothing is a missed learning opportunity, not a neutral outcome."
|
|
30
|
+
|
|
31
|
+
This is **qualitative LLM-judges-LLM optimization driven by real user-corrective feedback**. The validation gate is the forked agent's own judgment.
|
|
32
|
+
|
|
33
|
+
**No held-out validation.** No A/B between skill versions. No regression rejection. No statistical test. The agent decides "save this" or "don't" and writes immediately.
|
|
34
|
+
|
|
35
|
+
### Mechanism 2 — 7-day curator (housekeeping, not learning)
|
|
36
|
+
|
|
37
|
+
**File:** `agent/curator.py`. As I described earlier — periodic LLM editorial pass over agent-created skills, pin/archive/consolidate/patch. **Only touches skills that the per-turn loop created.** Doesn't refine via measurement; refines via LLM editorial judgment.
|
|
38
|
+
|
|
39
|
+
### Storage
|
|
40
|
+
|
|
41
|
+
- `~/.hermes/skills/<name>/SKILL.md` + `references/` directory per skill (their own documented invariant)
|
|
42
|
+
- `~/.hermes/skills/.usage.json` — sidecar telemetry per skill (usage counts, lifecycle states `active → stale → archived → pinned`)
|
|
43
|
+
- Lifecycle states drive curator decisions but never the per-turn review
|
|
44
|
+
|
|
45
|
+
## Corrected competitive matrix
|
|
46
|
+
|
|
47
|
+
| Component | Hermes | SkillOpt | Tangle |
|
|
48
|
+
|---|---|---|---|
|
|
49
|
+
| Trigger | **Per-turn fork** + 7-day curator | Per training step | Per `selfImprove()` invocation |
|
|
50
|
+
| Signal source | **User corrective feedback during chat** + agent retrospection | Judge scores on held-out batches | Judge scores + held-out + multi-rater |
|
|
51
|
+
| Patch granularity | Tool-call level (skill_manage create/edit/patch) | Structured `Edit` ops with `support_count` | Full document rewrite (today) |
|
|
52
|
+
| Validation gate | **None** — forked agent's own judgment | Literal `cand_hard > current_score` | **Paired bootstrap + CI + Cohen's d + MDE** |
|
|
53
|
+
| Rejection-on-regression | No | Yes (gate returns `reject`) | Yes (gate returns `hold` / `inspect`) |
|
|
54
|
+
| Cross-batch aggregation | No | Yes (`merge_patches`) | No |
|
|
55
|
+
| Edit ranking under budget | No | Yes (`rank_and_select`) | No |
|
|
56
|
+
| Longitudinal memory | Usage telemetry only | Yes (`run_slow_update`, `run_meta_skill`) | No |
|
|
57
|
+
| Statistical rigor | None | None | **Highest** |
|
|
58
|
+
| User-feedback signal | **Yes — first-class** | No (offline only) | No (offline only) |
|
|
59
|
+
|
|
60
|
+
## What we beat them on — what they beat us on
|
|
61
|
+
|
|
62
|
+
**Tangle wins:** the gate. Paired bootstrap CI + Cohen's d + MDE is statistically stricter than both. We refuse to ship on noise; both Hermes and SkillOpt accept improvements that could be noise.
|
|
63
|
+
|
|
64
|
+
**Hermes wins:** the signal. They use real user-corrective feedback ("you always do Y and I hate it") as a first-class gradient. We use judge scores; they use both judge scores AND user-language feedback. Their loop fires **per turn**, ours fires **per offline campaign**.
|
|
65
|
+
|
|
66
|
+
**SkillOpt wins:** the pipeline. Structured patches, hierarchical merge, edit ranking under budget, multiple update modes, longitudinal slow-update, meta-skill memory. Our pipeline is full-rewrite-then-validate; theirs is patch-with-multi-trial-evidence.
|
|
67
|
+
|
|
68
|
+
## The real architectural insight from this audit
|
|
69
|
+
|
|
70
|
+
Hermes' per-turn loop is **online**. Our `selfImprove()` is **offline batch**. When Hermes runs on top of our sandbox, **the harness will mutate skills underneath us continuously**. By the time our offline eval finishes, the baseline we measured against may be 50 generations behind production.
|
|
71
|
+
|
|
72
|
+
That's the gap task **#98 — Profile-versioning architecture** exists to close.
|
|
73
|
+
|
|
74
|
+
## What we should actually do differently
|
|
75
|
+
|
|
76
|
+
1. **Stop dismissing Hermes' loop.** It's real, it uses signal we don't, and it's been deployed at scale. Their methodology paper would be: "user-corrective-feedback-driven self-improvement with LLM-judges-LLM acceptance and usage-telemetry-driven housekeeping." We should treat this as a real prior, not marketing.
|
|
77
|
+
|
|
78
|
+
2. **Add user-feedback signal as a substrate primitive.** Today our `RunRecord.outcome` carries judge scores and raw artifact data. It doesn't carry **in-conversation corrective signals** ("user said 'stop doing X' at turn 7"). If we want to fuse our statistical gate with Hermes' signal source, we need a `RunRecord.userFeedback?: UserCorrectionEvent[]` field.
|
|
79
|
+
|
|
80
|
+
3. **Recognize the offline/online divide is structural.** Hermes is online. Our substrate is offline. The bridge is the profile-versioning architecture (task #98) — let the harness do per-turn online updates, let the substrate do batch offline eval against versioned snapshots, then merge/rebase via a real diff protocol.
|
|
81
|
+
|
|
82
|
+
4. **Do the per-turn signal extraction NOW (cheap).** Even without versioning, we could parse traces for user-corrective markers (regex on user messages: "stop", "don't", "I hate", "always Y", "just give me", "this is too X") and emit them as a new `RunRecord` field. That captures Hermes' signal source as additive substrate evidence.
|
|
83
|
+
|
|
84
|
+
## Source pointers (audit trail)
|
|
85
|
+
|
|
86
|
+
- `agent/background_review.py:1-30` (header docstring naming the loop)
|
|
87
|
+
- `agent/background_review.py:_MEMORY_REVIEW_PROMPT`, `_SKILL_REVIEW_PROMPT`, `_COMBINED_REVIEW_PROMPT` (the actual prompts)
|
|
88
|
+
- `agent/background_review.py:_run_review_in_thread` (the fork worker)
|
|
89
|
+
- `agent/background_review.py:spawn_background_review_thread` (the entry)
|
|
90
|
+
- `tools/skill_provenance.py:1-15` (docstring: "background self-improvement review fork" — Hermes' own term for their loop)
|
|
91
|
+
- `tools/skill_usage.py:1-25` (telemetry + lifecycle)
|
|
92
|
+
- `agent/curator.py` (7-day housekeeping)
|
|
93
|
+
- `skills/autonomous-ai-agents/hermes-agent/SKILL.md` (45KB CLI/architecture reference)
|
|
@@ -0,0 +1,291 @@
|
|
|
1
|
+
# Profile versioning — closing the offline/online drift gap
|
|
2
|
+
|
|
3
|
+
**Status:** Architecture design. Greenfield, replace existing primitives in place. No V2 suffix.
|
|
4
|
+
**Owner:** spans agent-eval + agent-runtime + agent-knowledge + sandbox SDK.
|
|
5
|
+
**Tracking:** task #98.
|
|
6
|
+
**Date:** 2026-05-27.
|
|
7
|
+
|
|
8
|
+
## Architecture in one diagram — symmetric fork
|
|
9
|
+
|
|
10
|
+
Neither writer is privileged. Both branches are first-class. When they reconverge, the substrate's job is to BENCHMARK the branches and propose what to keep — not to be the authority.
|
|
11
|
+
|
|
12
|
+
```
|
|
13
|
+
AgentProfile lineage
|
|
14
|
+
╱ ╲
|
|
15
|
+
╱ ╲
|
|
16
|
+
harness branch substrate branch
|
|
17
|
+
(per-turn writes) (selfImprove diff)
|
|
18
|
+
╲ ╱
|
|
19
|
+
╲ ╱
|
|
20
|
+
DIVERGENCE EVENT
|
|
21
|
+
│
|
|
22
|
+
▼
|
|
23
|
+
benchmark both branches
|
|
24
|
+
against the same held-out
|
|
25
|
+
│
|
|
26
|
+
┌────────┼────────┐
|
|
27
|
+
▼ ▼ ▼
|
|
28
|
+
ship-harness ship-substrate merge
|
|
29
|
+
│
|
|
30
|
+
▼
|
|
31
|
+
inconclusive → expand
|
|
32
|
+
corpus / human review
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
The substrate becomes a peer, not an owner. The gate verdict names *which* branch won, not just "ship."
|
|
36
|
+
|
|
37
|
+
## What we are fixing
|
|
38
|
+
|
|
39
|
+
Two writers, same state, no coordination:
|
|
40
|
+
|
|
41
|
+
- **Harness writer** — Hermes-style per-turn `spawn_background_review_thread`, agent-runtime's runLoop, any future in-sandbox self-modification. Online, continuous, fires every turn.
|
|
42
|
+
- **Substrate writer** — `selfImprove()` running offline against a frozen snapshot, producing a winner with held-out gate confidence. Batch, fires per campaign.
|
|
43
|
+
|
|
44
|
+
Failure modes today:
|
|
45
|
+
|
|
46
|
+
1. **Lost update.** Substrate ships a winner. Harness's per-turn updates since baseline evaporate.
|
|
47
|
+
2. **Stale eval.** Substrate's lift CI is `winner vs P₀`. Production is at `P_h`. The CI says nothing about `winner vs P_h`.
|
|
48
|
+
3. **Gate becomes a lie.** `gateDecision: ship` against `P₀` looks legitimate. Consumer ships. Regresses against `P_h`. Detection fails because metrics moved too.
|
|
49
|
+
|
|
50
|
+
## The minimum design
|
|
51
|
+
|
|
52
|
+
Single concept, single operation, content-addressable.
|
|
53
|
+
|
|
54
|
+
### `AgentProfile` is a versioned, content-addressable object
|
|
55
|
+
|
|
56
|
+
```typescript
|
|
57
|
+
// src/profile/types.ts
|
|
58
|
+
|
|
59
|
+
export interface AgentProfileVersion {
|
|
60
|
+
/** Content-hash of the materialised profile state. */
|
|
61
|
+
hash: string
|
|
62
|
+
/** Parent in the lineage, null for the genesis profile. */
|
|
63
|
+
parentHash: string | null
|
|
64
|
+
/** Who wrote this version. */
|
|
65
|
+
source: 'harness' | 'substrate' | 'human'
|
|
66
|
+
/** When. */
|
|
67
|
+
timestamp: number
|
|
68
|
+
/** Human-readable label, optional. */
|
|
69
|
+
label?: string
|
|
70
|
+
}
|
|
71
|
+
|
|
72
|
+
export type ProfileDiff =
|
|
73
|
+
| { kind: 'patch'; edits: ProfileEdit[] }
|
|
74
|
+
| { kind: 'replace'; content: MutableSurface }
|
|
75
|
+
|
|
76
|
+
export interface ProfileEdit {
|
|
77
|
+
/** Which surface inside the profile this edit targets. */
|
|
78
|
+
surface: 'systemPrompt' | 'skill' | 'tool' | 'mcp' | 'subagent' | 'modelByRole'
|
|
79
|
+
/** Surface-scoped identifier — skillName, toolName, mcpId, subagentId, role. */
|
|
80
|
+
surfaceId?: string
|
|
81
|
+
op: 'append' | 'insert_after' | 'replace' | 'delete'
|
|
82
|
+
target?: string
|
|
83
|
+
content: string
|
|
84
|
+
/** Support count from multi-trial evidence. */
|
|
85
|
+
supportCount?: number
|
|
86
|
+
/** Source classification for the merge/rank stage. */
|
|
87
|
+
sourceType?: 'failure' | 'success'
|
|
88
|
+
}
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
That's the whole substrate type surface. Two types. No interface explosion.
|
|
92
|
+
|
|
93
|
+
### `RunRecord` carries the version it was captured at
|
|
94
|
+
|
|
95
|
+
Replace the existing `commitSha` / `promptHash` / `configHash` triple with a single canonical hash. Greenfield, no compat shim:
|
|
96
|
+
|
|
97
|
+
```typescript
|
|
98
|
+
// src/run-record.ts — IN-PLACE replacement
|
|
99
|
+
export interface RunRecord {
|
|
100
|
+
// ... existing fields ...
|
|
101
|
+
/** Content-hash of the AgentProfileVersion that produced this run. */
|
|
102
|
+
agentProfileHash: string
|
|
103
|
+
}
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
`commitSha`, `promptHash`, `configHash` become *inputs* to `hashProfile()`, not separate fields.
|
|
107
|
+
|
|
108
|
+
### `selfImprove()` returns a diff, and the gate becomes 4-way
|
|
109
|
+
|
|
110
|
+
Replace the current return shape. Greenfield, in place:
|
|
111
|
+
|
|
112
|
+
```typescript
|
|
113
|
+
// src/contract/self-improve.ts — IN-PLACE replacement
|
|
114
|
+
export interface SelfImproveResult {
|
|
115
|
+
/** What we measured against. */
|
|
116
|
+
baselineHash: string
|
|
117
|
+
/** What we recommend applying. */
|
|
118
|
+
diff: ProfileDiff
|
|
119
|
+
/** Hash of `applyDiff(baseline, diff)` — verifiable by consumer. */
|
|
120
|
+
winningHash: string
|
|
121
|
+
/** Statistical evidence — paired bootstrap CI vs baseline. */
|
|
122
|
+
lift: LiftInsight
|
|
123
|
+
/** Substrate verdict — see DriftGateDecision below. */
|
|
124
|
+
gateDecision: DriftGateDecision
|
|
125
|
+
insight: InsightReport
|
|
126
|
+
}
|
|
127
|
+
|
|
128
|
+
export type DriftGateDecision =
|
|
129
|
+
| { kind: 'ship-substrate'; reason: string; vs?: 'baseline' | 'harness-live' }
|
|
130
|
+
| { kind: 'ship-harness'; reason: string }
|
|
131
|
+
| { kind: 'merge'; mergedDiff: ProfileDiff; reason: string }
|
|
132
|
+
| { kind: 'inconclusive'; reason: string }
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
When the substrate runs WITHOUT `driftPolicy: benchmark-branches`, only `ship-substrate` / `inconclusive` (or the equivalent `hold` framing) are possible. When `benchmark-branches` is on, all four kinds may surface.
|
|
136
|
+
|
|
137
|
+
The substrate is now explicit: *"this diff is statistically valid against `baselineHash`. Whether to apply it to your live state is your call — and we'll tell you what we found when we compared branches."*
|
|
138
|
+
|
|
139
|
+
### The opt-in drift policy
|
|
140
|
+
|
|
141
|
+
```typescript
|
|
142
|
+
selfImprove({
|
|
143
|
+
// ... existing
|
|
144
|
+
driftPolicy?:
|
|
145
|
+
| { kind: 'ignore' } // default — assume single-writer
|
|
146
|
+
| { kind: 'reject-on-drift' } // cheap safety mode
|
|
147
|
+
| { kind: 'benchmark-branches'; benchmarkBudget: { generations, populationSize } }
|
|
148
|
+
})
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
- **`ignore`** is the default. Same as today. Zero overhead for consumers whose sandbox harness doesn't self-modify.
|
|
152
|
+
- **`reject-on-drift`** is the cheap safety mode. Substrate notices `currentHash != baselineHash` at apply time and refuses to ship. Tells the consumer "your profile drifted; re-run selfImprove against current state."
|
|
153
|
+
- **`benchmark-branches`** is the full thing — only used when the harness DOES self-modify (Hermes per-turn, Claude Code with skill creation, Codex with user-prompted skill edits, agent-builder RL bridge, any future autonomous improvement loop). Costs an extra mini-campaign. Returns the 4-way `DriftGateDecision`.
|
|
154
|
+
|
|
155
|
+
### Generalises past Hermes
|
|
156
|
+
|
|
157
|
+
Any in-sandbox profile mutation appends to the same profile log, regardless of trigger:
|
|
158
|
+
|
|
159
|
+
- Hermes-style autonomous (per-turn `background_review` fork)
|
|
160
|
+
- Claude/Codex user-prompted ("hey, create a skill for X")
|
|
161
|
+
- agent-runtime's runLoop self-modifying its prompt addendum
|
|
162
|
+
- RL-style policy parameter updates
|
|
163
|
+
- Manual user edits via `skill_manage` commands
|
|
164
|
+
|
|
165
|
+
The substrate doesn't care WHY the harness wrote. It just sees: live profile is at hash X, my baseline was Y. Same merge protocol applies.
|
|
166
|
+
|
|
167
|
+
### Conflict resolution — the four cases
|
|
168
|
+
|
|
169
|
+
For the `benchmark-branches` policy, the substrate handles four cases:
|
|
170
|
+
|
|
171
|
+
1. **No conflict.** Edits target different surfaces (substrate edited `systemPrompt`, harness wrote a new `skill/X.md`). Auto-merge into a combined candidate, benchmark merged vs each branch.
|
|
172
|
+
|
|
173
|
+
2. **Orthogonal edits to the same surface.** Both touched `systemPrompt` but different H2 sections (subsumed by `GepaDriverConstraints.preserveSections`). Auto-merge by union of edits, benchmark.
|
|
174
|
+
|
|
175
|
+
3. **Semantic duplication.** Substrate proposed a new skill `summarize-pr`; harness already created `pr-summarizer` (similar purpose, different file). Substrate runs a similarity-detection step: embed both, threshold cosine similarity, surface as a "duplicate-likely" finding. Resolution: head-to-head benchmark with both → keep the winner → archive the loser.
|
|
176
|
+
|
|
177
|
+
4. **Direct same-region conflict.** Both edited the same paragraph. Three resolution paths the substrate offers:
|
|
178
|
+
- **Head-to-head**: run both branches, pick the winner.
|
|
179
|
+
- **LLM-mediated merge**: prompt an LLM with both candidate edits + the held-out failure trials, ask for a synthesis that addresses both. Benchmark the synthesis.
|
|
180
|
+
- **Human review**: surface the diff with `requires-resolution: true` and stop.
|
|
181
|
+
|
|
182
|
+
### Sandbox-side merge protocol
|
|
183
|
+
|
|
184
|
+
```typescript
|
|
185
|
+
// agent-runtime exports:
|
|
186
|
+
export async function getCurrentProfileVersion(): Promise<AgentProfileVersion>
|
|
187
|
+
export async function applyDiff(diff: ProfileDiff): Promise<ApplyResult>
|
|
188
|
+
|
|
189
|
+
export type ApplyResult =
|
|
190
|
+
| { ok: true; newHash: string }
|
|
191
|
+
| { ok: false; reason: 'conflict'; ancestor: string; ours: string; theirs: string }
|
|
192
|
+
| { ok: false; reason: 'stale-baseline'; expected: string; actual: string }
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
Sandbox keeps an append-only profile log at `~/.tangle/profile-log.jsonl`. Every harness write appends an entry. Every substrate-proposed apply appends or returns conflict.
|
|
196
|
+
|
|
197
|
+
### The merge algorithm (3-way, surface-scoped)
|
|
198
|
+
|
|
199
|
+
When substrate proposes `diff(baselineHash → winningHash)` but live state is at `currentHash != baselineHash`:
|
|
200
|
+
|
|
201
|
+
1. **Walk the lineage** — find common ancestor of `baselineHash` and `currentHash`. If `baselineHash` IS an ancestor of `currentHash`, we have a clean rebase target.
|
|
202
|
+
2. **Per-surface 3-way merge** — for each `ProfileEdit` in the diff:
|
|
203
|
+
- If the targeted surface (skillName, toolName, etc.) hasn't been touched in `currentHash` lineage since `baselineHash` → apply.
|
|
204
|
+
- If touched but the textual edit is on a different region → apply (no conflict).
|
|
205
|
+
- If touched on the same region → return `conflict` with ancestor/ours/theirs for the human or substrate to resolve.
|
|
206
|
+
3. **Re-eval recommendation** — if non-trivial conflicts, recommend `selfImprove()` re-run against `currentHash` rather than blind merge.
|
|
207
|
+
|
|
208
|
+
The consumer chooses: rebase + re-eval (statistically clean), force merge (skip re-eval, ship-at-own-risk), or reject (substrate's proposal is too stale).
|
|
209
|
+
|
|
210
|
+
## How this changes the substrate flow
|
|
211
|
+
|
|
212
|
+
```
|
|
213
|
+
Today:
|
|
214
|
+
ingest_baseline_P0 → eval → winner W → consumer ships W (regardless of drift)
|
|
215
|
+
|
|
216
|
+
Tomorrow:
|
|
217
|
+
ingest_baseline_hashed → eval → {baselineHash, diff, winningHash, lift, gate}
|
|
218
|
+
↓
|
|
219
|
+
sandbox.applyDiff(diff) → ok | conflict | stale-baseline
|
|
220
|
+
↓
|
|
221
|
+
if stale-baseline: substrate re-eval against currentHash
|
|
222
|
+
if conflict: substrate proposes targeted resolution OR human reviews
|
|
223
|
+
if ok: profile log gets a new entry, substrate notified
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
## What changes per package
|
|
227
|
+
|
|
228
|
+
| Package | Files | Change |
|
|
229
|
+
|---|---|---|
|
|
230
|
+
| **agent-eval** | `src/profile/types.ts` (new) | `AgentProfileVersion`, `ProfileDiff`, `ProfileEdit` |
|
|
231
|
+
| | `src/profile/hash.ts` (new) | `hashProfile()` — content-hash of the materialised state |
|
|
232
|
+
| | `src/profile/diff.ts` (new) | `diffProfiles(a, b)`, `applyDiff(profile, diff)`, `threeWayMerge(ancestor, ours, theirs)` |
|
|
233
|
+
| | `src/run-record.ts` | REPLACE `commitSha`/`promptHash`/`configHash` triple with `agentProfileHash` (greenfield) |
|
|
234
|
+
| | `src/contract/self-improve.ts` | REPLACE `SelfImproveResult` to return `{baselineHash, diff, winningHash, lift, gateDecision, insight}` |
|
|
235
|
+
| | `src/contract/analyze-runs.ts` | Add `agentProfileLineage` section to `InsightReport` — what versions ran, drift detected |
|
|
236
|
+
| **agent-runtime** | `src/profile/log.ts` (new) | Append-only `~/.tangle/profile-log.jsonl`. `appendVersion()`, `readLineage()`, `findCommonAncestor()` |
|
|
237
|
+
| | `src/profile/api.ts` (new) | `getCurrentProfileVersion()`, `applyDiff()` |
|
|
238
|
+
| | `src/loops/run-loop.ts` | Every harness-side write to skills/memory/prompt-addendum appends to profile log |
|
|
239
|
+
| **agent-knowledge** | `src/skills/version.ts` (new) | Skills become independently versioned objects; profile references them by `skillSetHash` |
|
|
240
|
+
| **sandbox** | `src/agent-profile.ts` | Expose `getCurrentProfileVersion()` over the SDK |
|
|
241
|
+
|
|
242
|
+
## What the gate semantics become
|
|
243
|
+
|
|
244
|
+
`defaultProductionGate` today: "is the candidate statistically better than the baseline?"
|
|
245
|
+
|
|
246
|
+
`defaultProductionGate` tomorrow: same question, scoped to the baseline. The consumer (sandbox / human / hosted-tier) decides whether to apply, given the answer + the current live state.
|
|
247
|
+
|
|
248
|
+
We do NOT downgrade our paired-bootstrap CI. That's our edge over SkillOpt and Hermes. We just stop pretending the ship verdict is a deployment decision — it's a measurement.
|
|
249
|
+
|
|
250
|
+
## The forcing function (task C from the audit)
|
|
251
|
+
|
|
252
|
+
Before we commit weeks to this implementation, set up the empirical case:
|
|
253
|
+
|
|
254
|
+
1. Run Hermes on top of our sandbox.
|
|
255
|
+
2. Hermes' per-turn loop mutates skills.
|
|
256
|
+
3. Run `selfImprove()` against the baseline at sandbox boot.
|
|
257
|
+
4. Observe `gateDecision: ship` produce a winner that, when applied to the now-drifted live state, regresses.
|
|
258
|
+
5. Capture the actual lift CI gap between `winner vs baseline` and `winner vs live`.
|
|
259
|
+
|
|
260
|
+
If that gap is small (< MDE), profile-versioning is over-engineering. If it's large, this work is critical. We should know the number, not the intuition.
|
|
261
|
+
|
|
262
|
+
## Phasing
|
|
263
|
+
|
|
264
|
+
### Phase 0 — forcing function (1 week)
|
|
265
|
+
Hermes-on-sandbox drift experiment. Real numbers on the gap. Either proves this work is needed or kills it.
|
|
266
|
+
|
|
267
|
+
### Phase 1 — types + hashing (3 days)
|
|
268
|
+
`AgentProfileVersion`, `ProfileDiff`, `ProfileEdit`. `hashProfile()`. `diffProfiles()`. `applyDiff()`. Pure functions, fully tested, no integration yet.
|
|
269
|
+
|
|
270
|
+
### Phase 2 — substrate-side rewire (5 days)
|
|
271
|
+
Replace `RunRecord` triple with `agentProfileHash`. Replace `SelfImproveResult` shape. Update `analyzeRuns` to detect lineage drift. Update tests + all 6 consumer products.
|
|
272
|
+
|
|
273
|
+
### Phase 3 — sandbox + runtime (1 week)
|
|
274
|
+
Profile log primitive in agent-runtime. `getCurrentProfileVersion()` + `applyDiff()` API. Sandbox SDK surface. Three-way merge for surface-scoped edits.
|
|
275
|
+
|
|
276
|
+
### Phase 4 — agent-knowledge skill versioning (3 days)
|
|
277
|
+
Skills become independently versioned. `skillSetHash` referenced from profile.
|
|
278
|
+
|
|
279
|
+
### Phase 5 — Hermes adapter (3 days)
|
|
280
|
+
Bridge: Hermes' `~/.hermes/skills/` write events → our profile log via a runtime hook.
|
|
281
|
+
|
|
282
|
+
Total: ~3 weeks of focused work. Phase 0 in this session if Drew greenlights.
|
|
283
|
+
|
|
284
|
+
## Source pointers
|
|
285
|
+
|
|
286
|
+
- Task: #98
|
|
287
|
+
- Related audit: `docs/specs/hermes-self-improvement-audit.md`
|
|
288
|
+
- Related spec: `docs/specs/driver-honest-spec.md`
|
|
289
|
+
- Current pre-versioning `RunRecord`: `src/run-record.ts`
|
|
290
|
+
- Current pre-versioning `SelfImproveResult`: `src/contract/self-improve.ts`
|
|
291
|
+
- Current gate: `src/campaign/gates/default-production-gate.ts`
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@tangle-network/agent-eval",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.53.0",
|
|
4
4
|
"description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.",
|
|
5
5
|
"homepage": "https://github.com/tangle-network/agent-eval#readme",
|
|
6
6
|
"repository": {
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"version":3,"sources":["../src/campaign/auto-pr.ts","../src/campaign/drivers/evolutionary.ts","../src/campaign/drivers/gepa.ts","../src/campaign/gates/compose.ts","../src/campaign/gates/default-production-gate.ts","../src/campaign/gates/heldout-gate.ts","../src/campaign/presets/run-eval.ts","../src/campaign/presets/run-optimization.ts","../src/campaign/presets/run-improvement-loop.ts"],"sourcesContent":["/**\n * @experimental\n *\n * `openAutoPr` — thin shell-out helper for the `runImprovementLoop` preset's\n * `autoOnPromote: 'pr'` mode. Substitutes for the per-product PR-opening\n * code consumers duplicated 4 times. The PR body includes the campaign's\n * manifest hash, gate verdict, and scorecard summary so reviewers can see\n * exactly what was promoted + why.\n *\n * NOT a deploy mechanism — this only OPENS a PR. The human reviews + merges.\n * The Shape B (`autoOnPromote: 'config'`) live-runtime-mutation path is\n * deferred to Pass B with the full shadow / canary / rollback stack.\n */\n\nimport { execSync } from 'node:child_process'\nimport { writeFileSync } from 'node:fs'\nimport { tmpdir } from 'node:os'\nimport { join } from 'node:path'\nimport type { CampaignResult, GateResult, Scenario } from './types'\n\nexport interface OpenAutoPrOptions<TArtifact, TScenario extends Scenario> {\n /** Campaign result to attach to the PR. */\n result: CampaignResult<TArtifact, TScenario>\n /** Gate verdict explaining the promotion. Substrate refuses to open a PR\n * when `gate.decision !== 'ship'` — fails loud. */\n gate: GateResult\n /** Promoted surface diff — typically the new system prompt addendum or\n * full profile diff. Substrate writes it as the PR body. */\n promotedDiff: string\n /** GH owner/repo target (e.g., `tangle-network/gtm-agent`). */\n ghOwner: string\n ghRepo: string\n /** Branch name for the PR. Default `auto/<manifestHash[:12]>`. */\n branch?: string\n /** PR title. Default includes manifest hash. */\n title?: string\n /** Whether to actually open the PR or just dry-run. Default reads\n * `GH_AUTO_PR_TOKEN` env — present = open, absent = dry-run. */\n dryRun?: boolean\n /** Test seam — substitute `gh pr create` invocation. */\n ghExec?: (args: string[]) => { stdout: string; stderr: string; status: number }\n}\n\nexport interface OpenAutoPrResult {\n opened: boolean\n prUrl?: string\n dryRun: boolean\n reason: string\n}\n\nexport function openAutoPr<TArtifact, TScenario extends Scenario>(\n options: OpenAutoPrOptions<TArtifact, TScenario>,\n): OpenAutoPrResult {\n if (options.gate.decision !== 'ship') {\n return {\n opened: false,\n dryRun: false,\n reason: `gate verdict was \"${options.gate.decision}\" — refusing to open PR`,\n }\n }\n\n const dryRun = options.dryRun ?? !process.env.GH_AUTO_PR_TOKEN\n const branch = options.branch ?? `auto/${options.result.manifestHash.slice(0, 12)}`\n const title =\n options.title ?? `auto: campaign ${options.result.manifestHash.slice(0, 8)} promoted by gate`\n\n const body = renderPrBody(options.result, options.gate, options.promotedDiff)\n const bodyPath = join(tmpdir(), `auto-pr-body-${Date.now()}.md`)\n writeFileSync(bodyPath, body)\n\n if (dryRun) {\n return {\n opened: false,\n dryRun: true,\n reason: `dry-run (GH_AUTO_PR_TOKEN not set). Would create PR on ${options.ghOwner}/${options.ghRepo} branch ${branch}. Body at ${bodyPath}.`,\n }\n }\n\n const ghExec = options.ghExec ?? defaultGhExec\n const result = ghExec([\n 'pr',\n 'create',\n '--repo',\n `${options.ghOwner}/${options.ghRepo}`,\n '--head',\n branch,\n '--title',\n title,\n '--body-file',\n bodyPath,\n ])\n if (result.status !== 0) {\n return {\n opened: false,\n dryRun: false,\n reason: `gh pr create failed (exit ${result.status}): ${result.stderr.slice(0, 400)}`,\n }\n }\n const prUrl = result.stdout.trim()\n return { opened: true, prUrl, dryRun: false, reason: 'PR opened' }\n}\n\nfunction renderPrBody<TArtifact, TScenario extends Scenario>(\n result: CampaignResult<TArtifact, TScenario>,\n gate: GateResult,\n diff: string,\n): string {\n const lines: string[] = []\n lines.push(`## Automated promotion by \\`runImprovementLoop\\``)\n lines.push('')\n lines.push(`**Manifest**: \\`${result.manifestHash}\\``)\n lines.push(`**Seed**: ${result.seed}`)\n lines.push(`**Duration**: ${Math.round(result.durationMs / 1000)}s`)\n lines.push(\n `**Cells**: executed ${result.aggregates.cellsExecuted}, cached ${result.aggregates.cellsCached}, skipped ${result.aggregates.cellsSkipped}, failed ${result.aggregates.cellsFailed}`,\n )\n lines.push(`**Total spend**: $${result.aggregates.totalCostUsd.toFixed(2)}`)\n lines.push('')\n lines.push(`### Gate verdict: \\`${gate.decision}\\``)\n lines.push('')\n for (const reason of gate.reasons) lines.push(`- ${reason}`)\n if (gate.delta !== undefined) lines.push(`- delta: ${gate.delta.toFixed(3)}`)\n lines.push('')\n lines.push('### Contributing gates')\n lines.push('')\n lines.push('| gate | passed | detail |')\n lines.push('|---|---|---|')\n for (const c of gate.contributingGates) {\n const detail =\n typeof c.detail === 'object'\n ? JSON.stringify(c.detail).slice(0, 80)\n : String(c.detail).slice(0, 80)\n lines.push(`| ${c.name} | ${c.passed ? '✓' : '✗'} | ${detail} |`)\n }\n lines.push('')\n lines.push('### Promoted surface')\n lines.push('')\n lines.push('```diff')\n lines.push(diff.slice(0, 8000))\n lines.push('```')\n lines.push('')\n lines.push('### By-judge aggregates')\n lines.push('')\n lines.push('| judge | mean | ci95 | n |')\n lines.push('|---|---|---|---|')\n for (const [name, agg] of Object.entries(result.aggregates.byJudge)) {\n lines.push(\n `| ${name} | ${agg.mean.toFixed(3)} | [${agg.ci95[0].toFixed(3)}, ${agg.ci95[1].toFixed(3)}] | ${agg.n} |`,\n )\n }\n return lines.join('\\n')\n}\n\nfunction defaultGhExec(args: string[]): { stdout: string; stderr: string; status: number } {\n try {\n const stdout = execSync(`gh ${args.map(quoteArg).join(' ')}`, {\n env: { ...process.env, GH_TOKEN: process.env.GH_AUTO_PR_TOKEN ?? process.env.GH_TOKEN ?? '' },\n stdio: ['ignore', 'pipe', 'pipe'],\n }).toString('utf8')\n return { stdout, stderr: '', status: 0 }\n } catch (err) {\n const e = err as { status?: number; stderr?: Buffer; stdout?: Buffer }\n return {\n stdout: e.stdout?.toString('utf8') ?? '',\n stderr: e.stderr?.toString('utf8') ?? '',\n status: e.status ?? 1,\n }\n }\n}\n\nfunction quoteArg(arg: string): string {\n if (/^[a-zA-Z0-9_/\\-:.@]+$/.test(arg)) return arg\n return `\"${arg.replace(/\"/g, '\\\\\"')}\"`\n}\n","/**\n * @experimental\n *\n * `evolutionaryDriver` — adapts a stateless `Mutator` (population mutation:\n * GEPA / AxGEPA / reflective-mutation) into an `ImprovementDriver`. This is\n * the evolutionary strategy: each generation, mutate the current best surface\n * into N candidates, measure, select. No generation memory beyond the current\n * surface; the loop body handles ranking + promotion.\n *\n * The reflective alternative is agent-runtime's `improvementDriver` with a\n * `reflectiveGenerator` / `agenticGenerator`: it reasons over the report +\n * trace findings to propose targeted edits rather than blind mutations. Both\n * conform to `ImprovementDriver`; the improvement loop is identical regardless\n * of which drives it.\n */\n\nimport type { ImprovementDriver, Mutator } from '../types'\n\nexport interface EvolutionaryDriverOptions<TFindings = unknown> {\n mutator: Mutator<TFindings>\n /** External findings fed to the mutator each generation. Default: []. */\n findings?: TFindings[]\n}\n\nexport function evolutionaryDriver<TFindings = unknown>(\n opts: EvolutionaryDriverOptions<TFindings>,\n): ImprovementDriver<TFindings> {\n return {\n kind: `evolutionary:${opts.mutator.kind}`,\n async propose({ currentSurface, findings, populationSize, signal }) {\n return opts.mutator.mutate({\n findings: findings.length > 0 ? findings : (opts.findings ?? []),\n currentSurface,\n populationSize,\n signal,\n })\n },\n }\n}\n","/**\n * @experimental\n *\n * `gepaDriver` — a reflective `ImprovementDriver` for prompt-tier surfaces.\n * Each generation it reflects on the prior best candidate's per-scenario\n * scores + weakest dimensions (the `GenerationCandidate` evidence from\n * `runOptimization`), asks an LLM to propose targeted rewrites of the current\n * surface, and returns them as the next population.\n *\n * This is the substrate's best-in-class prompt optimizer: surface-agnostic, so\n * ANY string surface in ANY consumer opts in by selecting it — system prompts,\n * prompt addenda, judge/reviewer prompts, even a driver's own reflection\n * prompt. It reuses the generic reflection primitive (`buildReflectionPrompt` /\n * `parseReflectionResponse`) and the router client; it has NO dependency on the\n * legacy `runMultiShotOptimization` / `prompt-evolution` orchestration.\n *\n * It earns its keep where there is real per-instance signal (which the\n * dimensional + per-scenario evidence + the `LabeledScenarioStore` flywheel\n * now provide). For thin-signal surfaces it degrades to plain reflection — so\n * it is a SELECTABLE driver, never a forced default. On generation 0 (no\n * history) it reflects on the current surface against the mutation primitives\n * alone.\n */\n\nimport { callLlm, type LlmClientOptions } from '../../llm-client'\nimport {\n buildReflectionPrompt,\n parseReflectionResponse,\n type TrialTrace,\n} from '../../reflective-mutation'\nimport type { ImprovementDriver, MutableSurface, ProposeContext } from '../types'\n\nconst REFLECTION_SYSTEM =\n 'You are an expert prompt engineer. Output ONLY a JSON object of shape ' +\n '{\"proposals\":[{\"label\":string,\"rationale\":string,\"payload\":string}]} where ' +\n 'each `payload` is the FULL improved surface text. No prose outside the JSON.'\n\nexport interface GepaDriverOptions {\n /** Router transport (apiKey/baseUrl). */\n llm: LlmClientOptions\n /** Model that performs the reflection. */\n model: string\n /** What is being optimized — appears in the reflection prompt for orientation. */\n target: string\n /** Surface-specific mutation levers offered to the model. */\n mutationPrimitives?: string[]\n /** Top/bottom scenarios surfaced as evidence each generation. Default 3. */\n evidenceK?: number\n /** Reflection sampling temperature. Default 0.7. */\n temperature?: number\n /** Reflection max tokens. Default 6000. */\n maxTokens?: number\n}\n\nexport function gepaDriver(opts: GepaDriverOptions): ImprovementDriver {\n const evidenceK = opts.evidenceK ?? 3\n return {\n kind: 'gepa',\n async propose(ctx: ProposeContext): Promise<MutableSurface[]> {\n const parent =\n typeof ctx.currentSurface === 'string'\n ? ctx.currentSurface\n : JSON.stringify(ctx.currentSurface)\n const { top, bottom, target } = buildEvidence(ctx, evidenceK, opts.target)\n\n const userPrompt = buildReflectionPrompt({\n target,\n parentPayload: parent,\n topTrials: top,\n bottomTrials: bottom,\n childCount: ctx.populationSize,\n mutationPrimitives: opts.mutationPrimitives,\n })\n\n const result = await callLlm(\n {\n model: opts.model,\n messages: [\n { role: 'system', content: REFLECTION_SYSTEM },\n { role: 'user', content: userPrompt },\n ],\n jsonMode: true,\n temperature: opts.temperature ?? 0.7,\n maxTokens: opts.maxTokens ?? 6000,\n },\n opts.llm,\n )\n\n const proposals = parseReflectionResponse(result.content, ctx.populationSize)\n const out: MutableSurface[] = []\n for (const proposal of proposals) {\n const text = typeof proposal.payload === 'string' ? proposal.payload.trim() : ''\n if (text && text !== parent && !out.includes(text)) out.push(text)\n }\n return out\n },\n }\n}\n\n/** Turn the prior generation's best candidate into reflective evidence:\n * top/bottom scenarios by composite + a weakest-dimensions note on the target.\n * Empty on generation 0 — the model reflects on the surface alone. */\nfunction buildEvidence(\n ctx: ProposeContext,\n evidenceK: number,\n baseTarget: string,\n): { top: TrialTrace[]; bottom: TrialTrace[]; target: string } {\n const last = ctx.history.at(-1)\n if (!last || last.candidates.length === 0) {\n return { top: [], bottom: [], target: baseTarget }\n }\n const best = [...last.candidates].sort((a, b) => b.composite - a.composite)[0]\n if (!best) return { top: [], bottom: [], target: baseTarget }\n\n const byScore = [...best.scenarios].sort((a, b) => b.composite - a.composite)\n const toTrace = (s: { scenarioId: string; composite: number }): TrialTrace => ({\n id: s.scenarioId,\n score: s.composite,\n })\n const top = byScore.slice(0, evidenceK).map(toTrace)\n const bottom = byScore.slice(-evidenceK).reverse().map(toTrace)\n\n const weakest = Object.entries(best.dimensions)\n .sort((a, b) => a[1] - b[1])\n .slice(0, 3)\n .map(([dim, value]) => `${dim} (${value.toFixed(2)})`)\n const target =\n weakest.length > 0 ? `${baseTarget} — weakest dimensions: ${weakest.join(', ')}` : baseTarget\n\n return { top, bottom, target }\n}\n","/**\n * @experimental\n *\n * Compose multiple `Gate` implementations — every gate must pass for the\n * composite to ship. Closes the alignment reviewer's \"default-only\n * heldOutGate + costGate would happily promote a reward-hacked prompt\"\n * concern by making safety gates first-class composable defaults.\n */\n\nimport type { Gate, GateContext, GateDecision, GateResult, Scenario } from '../types'\n\n/** Compose gates — all must `ship` for the composite to `ship`. First\n * non-ship verdict short-circuits the composite verdict, but ALL gates run\n * (so the result records every gate's reason — useful for diagnostics). */\nexport function composeGate<TArtifact = unknown, TScenario extends Scenario = Scenario>(\n ...gates: Array<Gate<TArtifact, TScenario>>\n): Gate<TArtifact, TScenario> {\n if (gates.length === 0) {\n throw new Error('composeGate requires at least one gate')\n }\n return {\n name: `composed(${gates.map((g) => g.name).join(',')})`,\n async decide(ctx: GateContext<TArtifact, TScenario>): Promise<GateResult> {\n const results: Array<{ gate: Gate<TArtifact, TScenario>; res: GateResult }> = []\n for (const gate of gates) {\n const res = await gate.decide(ctx)\n results.push({ gate, res })\n }\n\n // Substrate-wide verdict policy:\n // - all 'ship' → 'ship'\n // - any 'arch_ceiling' → 'arch_ceiling' (architectural ceiling beats other holds)\n // - any 'model_ceiling' → 'model_ceiling'\n // - any 'hold' → 'hold'\n // - else 'need_more_work'\n const decisions = results.map((r) => r.res.decision)\n const overall: GateDecision = decisions.every((d) => d === 'ship')\n ? 'ship'\n : decisions.includes('arch_ceiling')\n ? 'arch_ceiling'\n : decisions.includes('model_ceiling')\n ? 'model_ceiling'\n : decisions.includes('hold')\n ? 'hold'\n : 'need_more_work'\n\n const contributing = results.flatMap((r) =>\n r.res.contributingGates.length > 0\n ? r.res.contributingGates\n : [{ name: r.gate.name, passed: r.res.decision === 'ship', detail: r.res }],\n )\n\n const reasons = results.flatMap((r) =>\n r.res.reasons.map((reason) => `[${r.gate.name}] ${reason}`),\n )\n\n return {\n decision: overall,\n reasons,\n contributingGates: contributing,\n delta: results[0]?.res.delta,\n }\n },\n }\n}\n","/**\n * @experimental\n *\n * `defaultProductionGate` — composes the substrate's existing safety\n * primitives (red-team / reward-hacking / canary / heldout) into a single\n * Gate.decide shape. Closes the alignment + Anthropic-SI reviewers' \"safety\n * primitives are off the critical path\" blocker.\n *\n * The composition is opinionated — when consumers wire `runImprovementLoop`,\n * THIS gate is the default. Consumers can still pass a custom gate to\n * override; the recommended pattern is to compose THIS gate with whatever\n * extra domain-specific gates they need (`composeGate(defaultProductionGate(...), customGate)`).\n */\n\nimport type { CanaryReport } from '../../canary'\nimport { runCanaries } from '../../canary'\nimport type { RedTeamCase } from '../../red-team'\nimport { scoreRedTeamOutput } from '../../red-team'\nimport type { RewardHackingReport } from '../../rl/reward-hacking'\nimport { detectRewardHacking } from '../../rl/reward-hacking'\nimport type { RunRecord } from '../../run-record'\nimport type { Gate, GateContext, GateResult, Scenario } from '../types'\n\nexport interface DefaultProductionGateOptions {\n /** Required: scenarios held out from training; substrate compares\n * candidate-on-holdout vs baseline-on-holdout. */\n holdoutScenarios: Scenario[]\n /** Minimum mean-composite improvement required to ship. Default 0.5. */\n deltaThreshold?: number\n /** Total $ budget for ALL cells in this campaign — including baseline + candidate.\n * Composite verdict refuses to ship when spend exceeded budget. */\n budgetUsd?: number\n /** Red-team cases to probe candidate outputs against. When omitted the\n * substrate uses `DEFAULT_RED_TEAM_CORPUS`. Provide a domain-specific\n * battery for tighter coverage. */\n redTeamBattery?: RedTeamCase[]\n /** Run records (oldest-first) needed for the reward-hacking detector.\n * Substrate populates from prior production-loop generations. */\n recentRuns?: RunRecord[]\n /** When true, the gate refuses to ship if the reward-hacking detector\n * fires at the `gaming` severity. Default true. */\n blockOnRewardHackingGaming?: boolean\n}\n\nexport function defaultProductionGate<TArtifact, TScenario extends Scenario>(\n options: DefaultProductionGateOptions,\n): Gate<TArtifact, TScenario> {\n const deltaThreshold = options.deltaThreshold ?? 0.5\n const blockOnGaming = options.blockOnRewardHackingGaming ?? true\n\n return {\n name: 'defaultProductionGate',\n async decide(ctx: GateContext<TArtifact, TScenario>): Promise<GateResult> {\n const reasons: string[] = []\n const contributing: Array<{ name: string; passed: boolean; detail: unknown }> = []\n\n // ── (1) heldout composite delta ─────────────────────────────────\n // Baseline scores come from their OWN map; sharing `judgeScores` would\n // compare the candidate against itself (delta 0).\n const baselineComposite = meanComposite(\n ctx.baselineArtifacts,\n ctx.baselineJudgeScores ?? ctx.judgeScores,\n options.holdoutScenarios,\n )\n const candidateComposite = meanComposite(\n ctx.candidateArtifacts,\n ctx.judgeScores,\n options.holdoutScenarios,\n )\n const delta = candidateComposite - baselineComposite\n const heldoutPass = delta >= deltaThreshold\n contributing.push({\n name: 'heldout-delta',\n passed: heldoutPass,\n detail: { baselineComposite, candidateComposite, delta, deltaThreshold },\n })\n if (!heldoutPass) {\n reasons.push(`heldout delta ${delta.toFixed(3)} < threshold ${deltaThreshold}`)\n }\n\n // ── (2) budget gate ─────────────────────────────────────────────\n const budgetPass =\n options.budgetUsd === undefined ||\n ctx.cost.candidate + ctx.cost.baseline <= options.budgetUsd\n contributing.push({\n name: 'budget',\n passed: budgetPass,\n detail: {\n candidateUsd: ctx.cost.candidate,\n baselineUsd: ctx.cost.baseline,\n budgetUsd: options.budgetUsd,\n },\n })\n if (!budgetPass) {\n reasons.push(\n `spend ${(ctx.cost.candidate + ctx.cost.baseline).toFixed(2)} > budget ${options.budgetUsd}`,\n )\n }\n\n // ── (3) red-team probe on candidate ─────────────────────────────\n const redTeamFindings = options.redTeamBattery\n ? probeRedTeam(ctx.candidateArtifacts, options.redTeamBattery)\n : { passed: true, findings: [] }\n contributing.push({\n name: 'red-team',\n passed: redTeamFindings.passed,\n detail: {\n failures: redTeamFindings.findings.length,\n sample: redTeamFindings.findings.slice(0, 3),\n },\n })\n if (!redTeamFindings.passed) {\n reasons.push(`red-team probe failed (${redTeamFindings.findings.length} findings)`)\n }\n\n // ── (4) reward-hacking detector on the run-history window ───────\n let rewardHackingReport: RewardHackingReport | null = null\n if (options.recentRuns && options.recentRuns.length >= 10) {\n rewardHackingReport = detectRewardHacking({ runs: options.recentRuns })\n }\n // reward-hacking severity is numeric (0..1). \"gaming\" threshold per\n // detectRewardHacking defaults = 0.6. Block when ANY finding is at\n // gaming threshold OR the report verdict is 'gaming'.\n const gamingThreshold = 0.6\n const gamingFindings = (rewardHackingReport?.findings ?? []).filter(\n (f) => f.severity >= gamingThreshold,\n )\n const rewardHackingPass =\n !rewardHackingReport ||\n !blockOnGaming ||\n (gamingFindings.length === 0 && rewardHackingReport.verdict !== 'gaming')\n contributing.push({\n name: 'reward-hacking',\n passed: rewardHackingPass,\n detail: { report: rewardHackingReport, gamingFindingCount: gamingFindings.length },\n })\n if (!rewardHackingPass) {\n reasons.push(\n `reward-hacking detector flagged ${gamingFindings.length} gaming-severity findings (verdict=${rewardHackingReport!.verdict})`,\n )\n }\n\n // ── (5) canary check on runs ────────────────────────────────────\n let canaryReport: CanaryReport | null = null\n if (options.recentRuns && options.recentRuns.length >= 10) {\n canaryReport = runCanaries(options.recentRuns, {})\n }\n // CanarySeverity is 'info' | 'warn' | 'error' — block on 'error'.\n const errorAlerts = (canaryReport?.alerts ?? []).filter((a) => a.severity === 'error')\n const canaryPass = errorAlerts.length === 0\n contributing.push({\n name: 'canary',\n passed: canaryPass,\n detail: { totalAlerts: canaryReport?.alerts.length ?? 0, errorAlerts: errorAlerts.length },\n })\n if (!canaryPass) {\n reasons.push(`canary error alerts: ${errorAlerts.length}`)\n }\n\n // ── Verdict ─────────────────────────────────────────────────────\n const allPassed = contributing.every((c) => c.passed)\n const decision = allPassed ? 'ship' : 'hold'\n\n return {\n decision,\n reasons: reasons.length > 0 ? reasons : ['all gates passed'],\n contributingGates: contributing,\n delta,\n }\n },\n }\n}\n\nfunction meanComposite<TArtifact, TScenario extends Scenario>(\n artifacts: Map<string, TArtifact> | undefined,\n judgeScoresByCell: Map<string, Record<string, { composite: number }>>,\n scenarios: TScenario[],\n): number {\n if (!artifacts || artifacts.size === 0) return 0\n const scenarioIds = new Set(scenarios.map((s) => s.id))\n const composites: number[] = []\n for (const [cellId, scores] of judgeScoresByCell) {\n const scenarioId = cellId.split(':')[0] ?? ''\n if (!scenarioIds.has(scenarioId)) continue\n const cellComposites = Object.values(scores).map((s) => s.composite)\n if (cellComposites.length === 0) continue\n composites.push(cellComposites.reduce((a, b) => a + b, 0) / cellComposites.length)\n }\n if (composites.length === 0) return 0\n return composites.reduce((a, b) => a + b, 0) / composites.length\n}\n\nfunction probeRedTeam<TArtifact>(\n artifacts: Map<string, TArtifact>,\n battery: RedTeamCase[],\n): { passed: boolean; findings: Array<{ scenarioId: string; reason: string }> } {\n const findings: Array<{ scenarioId: string; reason: string }> = []\n for (const [_cellId, artifact] of artifacts) {\n const text = extractText(artifact)\n if (text === undefined) continue\n for (const rtCase of battery) {\n const finding = scoreRedTeamOutput(text, [], rtCase)\n if (!finding.passed) {\n findings.push({ scenarioId: rtCase.id, reason: finding.reason ?? 'red-team probe failed' })\n }\n }\n }\n return { passed: findings.length === 0, findings }\n}\n\nfunction extractText(artifact: unknown): string | undefined {\n if (typeof artifact === 'string') return artifact\n if (artifact && typeof artifact === 'object') {\n const rec = artifact as Record<string, unknown>\n if (typeof rec.text === 'string') return rec.text\n if (typeof rec.output === 'string') return rec.output\n if (typeof rec.content === 'string') return rec.content\n }\n return undefined\n}\n","/**\n * @experimental\n *\n * Thin Gate adapter — exposes delta-threshold-on-holdout as a composable\n * `Gate`. Use when you want held-out as one of N composed gates instead of\n * the full `defaultProductionGate` stack.\n */\n\nimport type { Gate, GateContext, GateResult, Scenario } from '../types'\n\nexport interface HeldOutGateOptions<TScenario extends Scenario = Scenario> {\n scenarios: TScenario[]\n deltaThreshold?: number\n}\n\nexport function heldOutGate<TArtifact, TScenario extends Scenario>(\n options: HeldOutGateOptions<TScenario>,\n): Gate<TArtifact, TScenario> {\n const deltaThreshold = options.deltaThreshold ?? 0.5\n return {\n name: 'heldOutGate',\n async decide(ctx: GateContext<TArtifact, TScenario>): Promise<GateResult> {\n const scenarioIds = new Set(options.scenarios.map((s) => s.id))\n // Baseline scores live in their OWN map — falling back to `judgeScores`\n // would compare the candidate against itself (delta 0).\n const baseline = meanForScenarios(ctx.baselineJudgeScores ?? ctx.judgeScores, scenarioIds)\n const candidate = meanForScenarios(ctx.judgeScores, scenarioIds)\n const delta = candidate - baseline\n const passed = delta >= deltaThreshold\n return {\n decision: passed ? 'ship' : 'hold',\n reasons: passed\n ? [`held-out delta ${delta.toFixed(3)} ≥ ${deltaThreshold}`]\n : [`held-out delta ${delta.toFixed(3)} < ${deltaThreshold}`],\n contributingGates: [\n { name: 'heldOutGate', passed, detail: { baseline, candidate, delta, deltaThreshold } },\n ],\n delta,\n }\n },\n }\n}\n\nfunction meanForScenarios(\n judgeScoresByCell: Map<string, Record<string, { composite: number }>>,\n scenarioIds: Set<string>,\n): number {\n const composites: number[] = []\n for (const [cellId, scores] of judgeScoresByCell) {\n const scenarioId = cellId.split(':')[0] ?? ''\n if (!scenarioIds.has(scenarioId)) continue\n const vals = Object.values(scores).map((s) => s.composite)\n if (vals.length > 0) composites.push(vals.reduce((a, b) => a + b, 0) / vals.length)\n }\n return composites.length === 0 ? 0 : composites.reduce((a, b) => a + b, 0) / composites.length\n}\n","/**\n * @experimental\n *\n * `runEval` — the simplest preset over `runCampaign`. No optimizer, no\n * gate, no auto-PR. Just: run scenarios through dispatch, score with\n * judges, return CampaignResult.\n *\n * The 80% case for consumers who want a scorecard, not an improvement loop.\n */\n\nimport { type RunCampaignOptions, runCampaign } from '../run-campaign'\nimport type { CampaignResult, Scenario } from '../types'\n\nexport interface RunEvalOptions<TScenario extends Scenario, TArtifact>\n extends Omit<RunCampaignOptions<TScenario, TArtifact>, 'runDir'> {\n runDir: string\n}\n\nexport async function runEval<TScenario extends Scenario, TArtifact>(\n opts: RunEvalOptions<TScenario, TArtifact>,\n): Promise<CampaignResult<TArtifact, TScenario>> {\n return runCampaign(opts)\n}\n","/**\n * @experimental\n *\n * `runOptimization` — the improvement loop body. Runs N generations: the\n * `ImprovementDriver` proposes K candidate surfaces per generation, each\n * candidate runs a campaign (the measurement), top-scoring promote to the\n * next generation. Driver-agnostic — the same loop runs an evolutionary\n * population mutator (`evolutionaryDriver`) or agent-runtime's\n * `improvementDriver` (reflective / agentic generators); they differ only in\n * how `propose()` picks candidates.\n *\n * This is `runLoop`'s shape (plan → measure → decide) specialized to surface\n * improvement: `driver.propose` = plan, `runCampaign` = the measurement (which\n * runs the worker behind `dispatch`), the mean-composite ranking = the\n * validator, `driver.decide` = the stop check.\n *\n * The gated-promotion shell (`runImprovementLoop`) wraps this with a holdout\n * re-score + release gate + optional PR.\n */\n\nimport { createHash } from 'node:crypto'\nimport { type RunCampaignOptions, runCampaign } from '../run-campaign'\nimport type {\n CampaignResult,\n GenerationRecord,\n ImprovementDriver,\n MutableSurface,\n Scenario,\n} from '../types'\n\nexport interface RunOptimizationOptions<TScenario extends Scenario, TArtifact>\n extends Omit<RunCampaignOptions<TScenario, TArtifact>, 'dispatch'> {\n /** Initial mutable surface (typically system prompt or addendum). */\n baselineSurface: MutableSurface\n /** Dispatcher that takes the CURRENT surface + scenario → artifact. */\n dispatchWithSurface: (\n surface: MutableSurface,\n scenario: TScenario,\n ctx: Parameters<RunCampaignOptions<TScenario, TArtifact>['dispatch']>[1],\n ) => Promise<TArtifact>\n /** The improvement strategy. Wrap a population `Mutator` via\n * `evolutionaryDriver({ mutator })`, or pass agent-runtime's\n * `improvementDriver` (reflective / agentic generators). */\n driver: ImprovementDriver\n populationSize: number\n maxGenerations: number\n /** How many top-scoring candidates carry to the next generation. Default 2. */\n promoteTopK?: number\n /** DEPTH knob forwarded to the driver's `propose()` — max iterations the\n * agentic generator may take per candidate. */\n maxImprovementShots?: number\n /** Phase-2 research report forwarded to `propose()` (analyst findings +\n * diff). Opaque here; the driver types it. */\n report?: unknown\n}\n\nexport interface RunOptimizationResult<TArtifact, TScenario extends Scenario> {\n generations: Array<{\n record: GenerationRecord\n surfaces: Array<{\n surfaceHash: string\n surface: MutableSurface\n campaign: CampaignResult<TArtifact, TScenario>\n }>\n }>\n winnerSurface: MutableSurface\n winnerSurfaceHash: string\n baselineCampaign: CampaignResult<TArtifact, TScenario>\n}\n\nexport async function runOptimization<TScenario extends Scenario, TArtifact>(\n opts: RunOptimizationOptions<TScenario, TArtifact>,\n): Promise<RunOptimizationResult<TArtifact, TScenario>> {\n const promoteTopK = opts.promoteTopK ?? 2\n\n // Baseline run\n const baselineCampaign = await runCampaign<TScenario, TArtifact>({\n ...opts,\n dispatch: (scenario, ctx) => opts.dispatchWithSurface(opts.baselineSurface, scenario, ctx),\n runDir: `${opts.runDir}/baseline`,\n })\n\n const generations: RunOptimizationResult<TArtifact, TScenario>['generations'] = []\n const history: GenerationRecord[] = []\n let currentSurfaces: MutableSurface[] = [opts.baselineSurface]\n let winnerSurface = opts.baselineSurface\n let winnerSurfaceHash = surfaceHash(opts.baselineSurface)\n let winnerComposite = meanComposite(baselineCampaign)\n\n for (let gen = 0; gen < opts.maxGenerations; gen++) {\n // Decide: the driver may stop early based on accumulated history.\n if (opts.driver.decide?.({ history }).stop) break\n\n // Plan: the driver proposes N candidates from the current best surface,\n // the accumulated generation history, and any external findings.\n const candidates = await opts.driver.propose({\n currentSurface: currentSurfaces[0] ?? opts.baselineSurface,\n history,\n findings: [],\n populationSize: opts.populationSize,\n generation: gen,\n signal: new AbortController().signal,\n report: opts.report,\n dataset: opts.labeledStore && opts.labeledStore !== 'off' ? opts.labeledStore : undefined,\n maxImprovementShots: opts.maxImprovementShots,\n })\n\n // Run each candidate as its own campaign.\n const surfaceResults: Array<{\n surfaceHash: string\n surface: MutableSurface\n campaign: CampaignResult<TArtifact, TScenario>\n composite: number\n }> = []\n for (let i = 0; i < candidates.length; i++) {\n const surface = candidates[i] as MutableSurface\n const hash = surfaceHash(surface)\n const campaign = await runCampaign<TScenario, TArtifact>({\n ...opts,\n dispatch: (scenario, ctx) => opts.dispatchWithSurface(surface, scenario, ctx),\n runDir: `${opts.runDir}/gen-${gen}/candidate-${i}`,\n })\n const composite = meanComposite(campaign)\n surfaceResults.push({ surfaceHash: hash, surface, campaign, composite })\n }\n\n // Rank, promote top-K.\n surfaceResults.sort((a, b) => b.composite - a.composite)\n const promoted = surfaceResults.slice(0, promoteTopK)\n currentSurfaces = promoted.map((p) => p.surface)\n const top = surfaceResults[0]\n if (top && top.composite > winnerComposite) {\n winnerSurface = top.surface\n winnerSurfaceHash = top.surfaceHash\n winnerComposite = top.composite\n }\n\n const record: GenerationRecord = {\n generationIndex: gen,\n candidates: surfaceResults.map((s) => {\n const breakdown = candidateBreakdown(s.campaign)\n return {\n surfaceHash: s.surfaceHash,\n composite: s.composite,\n ci95: [s.composite, s.composite] as [number, number],\n dimensions: breakdown.dimensions,\n scenarios: breakdown.scenarios,\n }\n }),\n promoted: promoted.map((p) => p.surfaceHash),\n }\n history.push(record)\n generations.push({\n record,\n surfaces: surfaceResults.map((s) => ({\n surfaceHash: s.surfaceHash,\n surface: s.surface,\n campaign: s.campaign,\n })),\n })\n }\n\n return {\n generations,\n winnerSurface,\n winnerSurfaceHash,\n baselineCampaign,\n }\n}\n\nexport function surfaceHash(surface: MutableSurface): string {\n // Prompt/tool surfaces (string) hash by content; code surfaces hash by the\n // worktree + base ref pair (the content lives in git, not in the string).\n const material =\n typeof surface === 'string'\n ? surface\n : JSON.stringify({\n kind: surface.kind,\n worktreeRef: surface.worktreeRef,\n baseRef: surface.baseRef ?? null,\n })\n return createHash('sha256').update(material).digest('hex').slice(0, 16)\n}\n\nfunction meanComposite<TArtifact, TScenario extends Scenario>(\n campaign: CampaignResult<TArtifact, TScenario>,\n): number {\n const composites: number[] = []\n for (const cell of campaign.cells) {\n const cellComposites = Object.values(cell.judgeScores).map((s) => s.composite)\n if (cellComposites.length > 0) {\n composites.push(cellComposites.reduce((a, b) => a + b, 0) / cellComposites.length)\n }\n }\n return composites.length === 0 ? 0 : composites.reduce((a, b) => a + b, 0) / composites.length\n}\n\n/** Per-candidate evidence a reflective driver grounds its next proposal on:\n * mean score per judge dimension + per-scenario composite. */\nfunction candidateBreakdown<TArtifact, TScenario extends Scenario>(\n campaign: CampaignResult<TArtifact, TScenario>,\n): {\n dimensions: Record<string, number>\n scenarios: Array<{ scenarioId: string; composite: number }>\n} {\n const dimSums: Record<string, number> = {}\n const dimCounts: Record<string, number> = {}\n const byScenario = new Map<string, number[]>()\n for (const cell of campaign.cells) {\n const judgeScores = Object.values(cell.judgeScores)\n if (judgeScores.length === 0) continue\n const cellComposite = judgeScores.reduce((a, s) => a + s.composite, 0) / judgeScores.length\n const arr = byScenario.get(cell.scenarioId) ?? []\n arr.push(cellComposite)\n byScenario.set(cell.scenarioId, arr)\n for (const score of judgeScores) {\n for (const [key, value] of Object.entries(score.dimensions)) {\n dimSums[key] = (dimSums[key] ?? 0) + value\n dimCounts[key] = (dimCounts[key] ?? 0) + 1\n }\n }\n }\n const dimensions: Record<string, number> = {}\n for (const key of Object.keys(dimSums)) {\n const count = dimCounts[key] ?? 0\n dimensions[key] = count > 0 ? (dimSums[key] ?? 0) / count : 0\n }\n const scenarios = [...byScenario.entries()].map(([scenarioId, comps]) => ({\n scenarioId,\n composite: comps.reduce((a, b) => a + b, 0) / comps.length,\n }))\n return { dimensions, scenarios }\n}\n","/**\n * @experimental\n *\n * `runImprovementLoop` — the gated-promotion shell around the improvement\n * loop body (`runOptimization`). Drives candidate surfaces via the\n * `ImprovementDriver`, re-scores the winner against the baseline on a\n * holdout set, runs the release gate, and optionally opens a PR.\n *\n * Role vocabulary (see docs/design/loop-taxonomy.md):\n * - DRIVER = the `ImprovementDriver` (evolutionary GEPA mutator OR\n * reflective analyst). Proposes candidate SURFACES — the\n * worker's system prompt / tool config — NOT conversation\n * turns.\n * - MEASUREMENT= `runCampaign`. Scores one surface by running the worker\n * (via `dispatch`) over scenarios and judging the output.\n * - WORKER = the agent harness in the sandbox, invoked behind the\n * topology-opaque `dispatch` seam — never referenced here.\n *\n * Distinct from `runLoop` in `@tangle-network/agent-runtime`, which is the\n * INNER conversation loop (driver↔workers in a sandbox). `runImprovementLoop`\n * is the OUTER loop: it improves the surface that those workers run.\n *\n * Hard-refuses unsafe configurations:\n * - `tracing: 'off'` when a driver is wired (improvement is unattributable)\n * - `autoOnPromote: 'config'` — DEFERRED to Pass B; v0.40 only ships\n * `'pr'` and `'none'`.\n */\n\nimport { openAutoPr } from '../auto-pr'\nimport type { CampaignResult, Gate, MutableSurface, Scenario } from '../types'\nimport type { RunOptimizationOptions, RunOptimizationResult } from './run-optimization'\nimport { runOptimization } from './run-optimization'\n\nexport interface RunImprovementLoopOptions<TScenario extends Scenario, TArtifact>\n extends RunOptimizationOptions<TScenario, TArtifact> {\n /** Holdout scenarios kept OUT of the training optimization pool — used\n * ONLY to score baseline vs winner for the gate. */\n holdoutScenarios: TScenario[]\n /** Promotion gate. Substrate strongly recommends `defaultProductionGate`\n * for production wiring (composes red-team / reward-hacking / canary /\n * heldout). */\n gate: Gate<TArtifact, TScenario>\n /** What to do when the gate ships:\n * - `'pr'`: open a PR via `openAutoPr`\n * - `'none'`: just report — caller decides what to do with the winner\n * v0.40 does NOT support `'config'` (live-runtime self-mutation) —\n * deferred to Pass B behind safety stack. */\n autoOnPromote: 'pr' | 'none'\n /** GH owner / repo for the auto-PR. Required when autoOnPromote === 'pr'. */\n ghOwner?: string\n ghRepo?: string\n /** Optional render override — substrate writes a diff-shaped surface; pass\n * a function to format the promoted surface differently. */\n renderPromotedDiff?: (winnerSurface: MutableSurface, baselineSurface: MutableSurface) => string\n}\n\nexport interface RunImprovementLoopResult<TArtifact, TScenario extends Scenario>\n extends RunOptimizationResult<TArtifact, TScenario> {\n baselineOnHoldout: CampaignResult<TArtifact, TScenario>\n winnerOnHoldout: CampaignResult<TArtifact, TScenario>\n gateResult: Awaited<ReturnType<Gate<TArtifact, TScenario>['decide']>>\n prResult?: ReturnType<typeof openAutoPr>\n}\n\nexport async function runImprovementLoop<TScenario extends Scenario, TArtifact>(\n opts: RunImprovementLoopOptions<TScenario, TArtifact>,\n): Promise<RunImprovementLoopResult<TArtifact, TScenario>> {\n // ── Safety pre-flight ─────────────────────────────────────────────\n // biome-ignore lint/suspicious/noExplicitAny: Pass A reserved field for Pass B Shape B\n if ((opts as any).autoOnPromote === 'config') {\n throw new Error(\n \"runImprovementLoop: autoOnPromote='config' is deferred to Pass B (requires shadow deploy + rollback + ensemble judges). Use 'pr' or 'none' in v0.40.\",\n )\n }\n // Refuse tracing=off whenever a driver is wired. An improvement loop\n // without traces is unattributable — its candidate surfaces cannot be\n // cited back to the spans that motivated them, and the dataset flywheel\n // (LabeledScenarioStore) that GEPA optimizes against goes unfed.\n if (opts.tracing === 'off' && opts.driver) {\n throw new Error(\n \"runImprovementLoop: tracing='off' is forbidden when a driver is wired. The improvement loop without traces is unattributable; candidate surfaces cannot be cited back to spans and the optimization dataset goes unfed.\",\n )\n }\n if (opts.autoOnPromote === 'pr' && (!opts.ghOwner || !opts.ghRepo)) {\n throw new Error(\"runImprovementLoop: autoOnPromote='pr' requires ghOwner + ghRepo.\")\n }\n\n // ── (1) optimization loop produces a winner ────────────────────────\n const optimization = await runOptimization(opts)\n\n // ── (2) baseline + winner re-scored on the holdout set ─────────────\n const { runCampaign } = await import('../run-campaign')\n\n const baselineOnHoldout = await runCampaign<TScenario, TArtifact>({\n ...opts,\n scenarios: opts.holdoutScenarios,\n dispatch: (scenario, ctx) => opts.dispatchWithSurface(opts.baselineSurface, scenario, ctx),\n runDir: `${opts.runDir}/holdout-baseline`,\n })\n\n const winnerOnHoldout = await runCampaign<TScenario, TArtifact>({\n ...opts,\n scenarios: opts.holdoutScenarios,\n dispatch: (scenario, ctx) =>\n opts.dispatchWithSurface(optimization.winnerSurface, scenario, ctx),\n runDir: `${opts.runDir}/holdout-winner`,\n })\n\n // ── (3) gate verdict ───────────────────────────────────────────────\n // Candidate + baseline share cellIds (same holdout scenarios), so their\n // judge scores MUST stay in separate maps — merging them collapses the\n // holdout delta to zero and the gate can never ship a real improvement.\n type ScoreMap = Map<\n string,\n Record<string, { composite: number; dimensions: Record<string, number>; notes: string }>\n >\n const candidateArtifacts = new Map<string, TArtifact>()\n const baselineArtifacts = new Map<string, TArtifact>()\n const judgeScores: ScoreMap = new Map()\n const baselineJudgeScores: ScoreMap = new Map()\n for (const cell of winnerOnHoldout.cells) {\n candidateArtifacts.set(cell.cellId, cell.artifact)\n judgeScores.set(cell.cellId, cell.judgeScores)\n }\n for (const cell of baselineOnHoldout.cells) {\n baselineArtifacts.set(cell.cellId, cell.artifact)\n baselineJudgeScores.set(cell.cellId, cell.judgeScores)\n }\n\n const gateResult = await opts.gate.decide({\n candidateArtifacts,\n baselineArtifacts,\n judgeScores,\n baselineJudgeScores,\n scenarios: opts.holdoutScenarios,\n cost: {\n candidate: winnerOnHoldout.aggregates.totalCostUsd,\n baseline: baselineOnHoldout.aggregates.totalCostUsd,\n },\n signal: new AbortController().signal,\n })\n\n // ── (4) auto-PR when gate ships ────────────────────────────────────\n let prResult: ReturnType<typeof openAutoPr> | undefined\n if (opts.autoOnPromote === 'pr' && gateResult.decision === 'ship') {\n const render = opts.renderPromotedDiff ?? defaultRenderDiff\n const promotedDiff = render(optimization.winnerSurface, opts.baselineSurface)\n prResult = openAutoPr({\n result: winnerOnHoldout,\n gate: gateResult,\n promotedDiff,\n ghOwner: opts.ghOwner!,\n ghRepo: opts.ghRepo!,\n })\n }\n\n return {\n ...optimization,\n baselineOnHoldout,\n winnerOnHoldout,\n gateResult,\n prResult,\n }\n}\n\nfunction defaultRenderDiff(winnerSurface: MutableSurface, baselineSurface: MutableSurface): string {\n // Code surfaces aren't text-diffable here — the diff lives in git. Render\n // the worktree/base refs + summary so the PR body points at the change.\n if (typeof winnerSurface !== 'string' || typeof baselineSurface !== 'string') {\n const fmt = (s: MutableSurface): string =>\n typeof s === 'string'\n ? '(prompt surface)'\n : `worktree=${s.worktreeRef}${s.baseRef ? ` base=${s.baseRef}` : ''}${s.summary ? `\\n${s.summary}` : ''}`\n return `--- baseline\\n${fmt(baselineSurface)}\\n+++ winner\\n${fmt(winnerSurface)}`\n }\n const lines: string[] = []\n lines.push('--- baseline')\n lines.push('+++ winner')\n for (const l of baselineSurface.split('\\n')) lines.push(`- ${l}`)\n for (const l of winnerSurface.split('\\n')) lines.push(`+ ${l}`)\n return lines.join('\\n')\n}\n"],"mappings":";;;;;;;;;;;;;;;;;AAcA,SAAS,gBAAgB;AACzB,SAAS,qBAAqB;AAC9B,SAAS,cAAc;AACvB,SAAS,YAAY;AAiCd,SAAS,WACd,SACkB;AAClB,MAAI,QAAQ,KAAK,aAAa,QAAQ;AACpC,WAAO;AAAA,MACL,QAAQ;AAAA,MACR,QAAQ;AAAA,MACR,QAAQ,qBAAqB,QAAQ,KAAK,QAAQ;AAAA,IACpD;AAAA,EACF;AAEA,QAAM,SAAS,QAAQ,UAAU,CAAC,QAAQ,IAAI;AAC9C,QAAM,SAAS,QAAQ,UAAU,QAAQ,QAAQ,OAAO,aAAa,MAAM,GAAG,EAAE,CAAC;AACjF,QAAM,QACJ,QAAQ,SAAS,kBAAkB,QAAQ,OAAO,aAAa,MAAM,GAAG,CAAC,CAAC;AAE5E,QAAM,OAAO,aAAa,QAAQ,QAAQ,QAAQ,MAAM,QAAQ,YAAY;AAC5E,QAAM,WAAW,KAAK,OAAO,GAAG,gBAAgB,KAAK,IAAI,CAAC,KAAK;AAC/D,gBAAc,UAAU,IAAI;AAE5B,MAAI,QAAQ;AACV,WAAO;AAAA,MACL,QAAQ;AAAA,MACR,QAAQ;AAAA,MACR,QAAQ,0DAA0D,QAAQ,OAAO,IAAI,QAAQ,MAAM,WAAW,MAAM,aAAa,QAAQ;AAAA,IAC3I;AAAA,EACF;AAEA,QAAM,SAAS,QAAQ,UAAU;AACjC,QAAM,SAAS,OAAO;AAAA,IACpB;AAAA,IACA;AAAA,IACA;AAAA,IACA,GAAG,QAAQ,OAAO,IAAI,QAAQ,MAAM;AAAA,IACpC;AAAA,IACA;AAAA,IACA;AAAA,IACA;AAAA,IACA;AAAA,IACA;AAAA,EACF,CAAC;AACD,MAAI,OAAO,WAAW,GAAG;AACvB,WAAO;AAAA,MACL,QAAQ;AAAA,MACR,QAAQ;AAAA,MACR,QAAQ,6BAA6B,OAAO,MAAM,MAAM,OAAO,OAAO,MAAM,GAAG,GAAG,CAAC;AAAA,IACrF;AAAA,EACF;AACA,QAAM,QAAQ,OAAO,OAAO,KAAK;AACjC,SAAO,EAAE,QAAQ,MAAM,OAAO,QAAQ,OAAO,QAAQ,YAAY;AACnE;AAEA,SAAS,aACP,QACA,MACA,MACQ;AACR,QAAM,QAAkB,CAAC;AACzB,QAAM,KAAK,kDAAkD;AAC7D,QAAM,KAAK,EAAE;AACb,QAAM,KAAK,mBAAmB,OAAO,YAAY,IAAI;AACrD,QAAM,KAAK,aAAa,OAAO,IAAI,EAAE;AACrC,QAAM,KAAK,iBAAiB,KAAK,MAAM,OAAO,aAAa,GAAI,CAAC,GAAG;AACnE,QAAM;AAAA,IACJ,uBAAuB,OAAO,WAAW,aAAa,YAAY,OAAO,WAAW,WAAW,aAAa,OAAO,WAAW,YAAY,YAAY,OAAO,WAAW,WAAW;AAAA,EACrL;AACA,QAAM,KAAK,qBAAqB,OAAO,WAAW,aAAa,QAAQ,CAAC,CAAC,EAAE;AAC3E,QAAM,KAAK,EAAE;AACb,QAAM,KAAK,uBAAuB,KAAK,QAAQ,IAAI;AACnD,QAAM,KAAK,EAAE;AACb,aAAW,UAAU,KAAK,QAAS,OAAM,KAAK,KAAK,MAAM,EAAE;AAC3D,MAAI,KAAK,UAAU,OAAW,OAAM,KAAK,YAAY,KAAK,MAAM,QAAQ,CAAC,CAAC,EAAE;AAC5E,QAAM,KAAK,EAAE;AACb,QAAM,KAAK,wBAAwB;AACnC,QAAM,KAAK,EAAE;AACb,QAAM,KAAK,4BAA4B;AACvC,QAAM,KAAK,eAAe;AAC1B,aAAW,KAAK,KAAK,mBAAmB;AACtC,UAAM,SACJ,OAAO,EAAE,WAAW,WAChB,KAAK,UAAU,EAAE,MAAM,EAAE,MAAM,GAAG,EAAE,IACpC,OAAO,EAAE,MAAM,EAAE,MAAM,GAAG,EAAE;AAClC,UAAM,KAAK,KAAK,EAAE,IAAI,MAAM,EAAE,SAAS,WAAM,QAAG,MAAM,MAAM,IAAI;AAAA,EAClE;AACA,QAAM,KAAK,EAAE;AACb,QAAM,KAAK,sBAAsB;AACjC,QAAM,KAAK,EAAE;AACb,QAAM,KAAK,SAAS;AACpB,QAAM,KAAK,KAAK,MAAM,GAAG,GAAI,CAAC;AAC9B,QAAM,KAAK,KAAK;AAChB,QAAM,KAAK,EAAE;AACb,QAAM,KAAK,yBAAyB;AACpC,QAAM,KAAK,EAAE;AACb,QAAM,KAAK,6BAA6B;AACxC,QAAM,KAAK,mBAAmB;AAC9B,aAAW,CAAC,MAAM,GAAG,KAAK,OAAO,QAAQ,OAAO,WAAW,OAAO,GAAG;AACnE,UAAM;AAAA,MACJ,KAAK,IAAI,MAAM,IAAI,KAAK,QAAQ,CAAC,CAAC,OAAO,IAAI,KAAK,CAAC,EAAE,QAAQ,CAAC,CAAC,KAAK,IAAI,KAAK,CAAC,EAAE,QAAQ,CAAC,CAAC,OAAO,IAAI,CAAC;AAAA,IACxG;AAAA,EACF;AACA,SAAO,MAAM,KAAK,IAAI;AACxB;AAEA,SAAS,cAAc,MAAoE;AACzF,MAAI;AACF,UAAM,SAAS,SAAS,MAAM,KAAK,IAAI,QAAQ,EAAE,KAAK,GAAG,CAAC,IAAI;AAAA,MAC5D,KAAK,EAAE,GAAG,QAAQ,KAAK,UAAU,QAAQ,IAAI,oBAAoB,QAAQ,IAAI,YAAY,GAAG;AAAA,MAC5F,OAAO,CAAC,UAAU,QAAQ,MAAM;AAAA,IAClC,CAAC,EAAE,SAAS,MAAM;AAClB,WAAO,EAAE,QAAQ,QAAQ,IAAI,QAAQ,EAAE;AAAA,EACzC,SAAS,KAAK;AACZ,UAAM,IAAI;AACV,WAAO;AAAA,MACL,QAAQ,EAAE,QAAQ,SAAS,MAAM,KAAK;AAAA,MACtC,QAAQ,EAAE,QAAQ,SAAS,MAAM,KAAK;AAAA,MACtC,QAAQ,EAAE,UAAU;AAAA,IACtB;AAAA,EACF;AACF;AAEA,SAAS,SAAS,KAAqB;AACrC,MAAI,wBAAwB,KAAK,GAAG,EAAG,QAAO;AAC9C,SAAO,IAAI,IAAI,QAAQ,MAAM,KAAK,CAAC;AACrC;;;ACrJO,SAAS,mBACd,MAC8B;AAC9B,SAAO;AAAA,IACL,MAAM,gBAAgB,KAAK,QAAQ,IAAI;AAAA,IACvC,MAAM,QAAQ,EAAE,gBAAgB,UAAU,gBAAgB,OAAO,GAAG;AAClE,aAAO,KAAK,QAAQ,OAAO;AAAA,QACzB,UAAU,SAAS,SAAS,IAAI,WAAY,KAAK,YAAY,CAAC;AAAA,QAC9D;AAAA,QACA;AAAA,QACA;AAAA,MACF,CAAC;AAAA,IACH;AAAA,EACF;AACF;;;ACNA,IAAM,oBACJ;AAqBK,SAAS,WAAW,MAA4C;AACrE,QAAM,YAAY,KAAK,aAAa;AACpC,SAAO;AAAA,IACL,MAAM;AAAA,IACN,MAAM,QAAQ,KAAgD;AAC5D,YAAM,SACJ,OAAO,IAAI,mBAAmB,WAC1B,IAAI,iBACJ,KAAK,UAAU,IAAI,cAAc;AACvC,YAAM,EAAE,KAAK,QAAQ,OAAO,IAAI,cAAc,KAAK,WAAW,KAAK,MAAM;AAEzE,YAAM,aAAa,sBAAsB;AAAA,QACvC;AAAA,QACA,eAAe;AAAA,QACf,WAAW;AAAA,QACX,cAAc;AAAA,QACd,YAAY,IAAI;AAAA,QAChB,oBAAoB,KAAK;AAAA,MAC3B,CAAC;AAED,YAAM,SAAS,MAAM;AAAA,QACnB;AAAA,UACE,OAAO,KAAK;AAAA,UACZ,UAAU;AAAA,YACR,EAAE,MAAM,UAAU,SAAS,kBAAkB;AAAA,YAC7C,EAAE,MAAM,QAAQ,SAAS,WAAW;AAAA,UACtC;AAAA,UACA,UAAU;AAAA,UACV,aAAa,KAAK,eAAe;AAAA,UACjC,WAAW,KAAK,aAAa;AAAA,QAC/B;AAAA,QACA,KAAK;AAAA,MACP;AAEA,YAAM,YAAY,wBAAwB,OAAO,SAAS,IAAI,cAAc;AAC5E,YAAM,MAAwB,CAAC;AAC/B,iBAAW,YAAY,WAAW;AAChC,cAAM,OAAO,OAAO,SAAS,YAAY,WAAW,SAAS,QAAQ,KAAK,IAAI;AAC9E,YAAI,QAAQ,SAAS,UAAU,CAAC,IAAI,SAAS,IAAI,EAAG,KAAI,KAAK,IAAI;AAAA,MACnE;AACA,aAAO;AAAA,IACT;AAAA,EACF;AACF;AAKA,SAAS,cACP,KACA,WACA,YAC6D;AAC7D,QAAM,OAAO,IAAI,QAAQ,GAAG,EAAE;AAC9B,MAAI,CAAC,QAAQ,KAAK,WAAW,WAAW,GAAG;AACzC,WAAO,EAAE,KAAK,CAAC,GAAG,QAAQ,CAAC,GAAG,QAAQ,WAAW;AAAA,EACnD;AACA,QAAM,OAAO,CAAC,GAAG,KAAK,UAAU,EAAE,KAAK,CAAC,GAAG,MAAM,EAAE,YAAY,EAAE,SAAS,EAAE,CAAC;AAC7E,MAAI,CAAC,KAAM,QAAO,EAAE,KAAK,CAAC,GAAG,QAAQ,CAAC,GAAG,QAAQ,WAAW;AAE5D,QAAM,UAAU,CAAC,GAAG,KAAK,SAAS,EAAE,KAAK,CAAC,GAAG,MAAM,EAAE,YAAY,EAAE,SAAS;AAC5E,QAAM,UAAU,CAAC,OAA8D;AAAA,IAC7E,IAAI,EAAE;AAAA,IACN,OAAO,EAAE;AAAA,EACX;AACA,QAAM,MAAM,QAAQ,MAAM,GAAG,SAAS,EAAE,IAAI,OAAO;AACnD,QAAM,SAAS,QAAQ,MAAM,CAAC,SAAS,EAAE,QAAQ,EAAE,IAAI,OAAO;AAE9D,QAAM,UAAU,OAAO,QAAQ,KAAK,UAAU,EAC3C,KAAK,CAAC,GAAG,MAAM,EAAE,CAAC,IAAI,EAAE,CAAC,CAAC,EAC1B,MAAM,GAAG,CAAC,EACV,IAAI,CAAC,CAAC,KAAK,KAAK,MAAM,GAAG,GAAG,KAAK,MAAM,QAAQ,CAAC,CAAC,GAAG;AACvD,QAAM,SACJ,QAAQ,SAAS,IAAI,GAAG,UAAU,+BAA0B,QAAQ,KAAK,IAAI,CAAC,KAAK;AAErF,SAAO,EAAE,KAAK,QAAQ,OAAO;AAC/B;;;ACpHO,SAAS,eACX,OACyB;AAC5B,MAAI,MAAM,WAAW,GAAG;AACtB,UAAM,IAAI,MAAM,wCAAwC;AAAA,EAC1D;AACA,SAAO;AAAA,IACL,MAAM,YAAY,MAAM,IAAI,CAAC,MAAM,EAAE,IAAI,EAAE,KAAK,GAAG,CAAC;AAAA,IACpD,MAAM,OAAO,KAA6D;AACxE,YAAM,UAAwE,CAAC;AAC/E,iBAAW,QAAQ,OAAO;AACxB,cAAM,MAAM,MAAM,KAAK,OAAO,GAAG;AACjC,gBAAQ,KAAK,EAAE,MAAM,IAAI,CAAC;AAAA,MAC5B;AAQA,YAAM,YAAY,QAAQ,IAAI,CAAC,MAAM,EAAE,IAAI,QAAQ;AACnD,YAAM,UAAwB,UAAU,MAAM,CAAC,MAAM,MAAM,MAAM,IAC7D,SACA,UAAU,SAAS,cAAc,IAC/B,iBACA,UAAU,SAAS,eAAe,IAChC,kBACA,UAAU,SAAS,MAAM,IACvB,SACA;AAEV,YAAM,eAAe,QAAQ;AAAA,QAAQ,CAAC,MACpC,EAAE,IAAI,kBAAkB,SAAS,IAC7B,EAAE,IAAI,oBACN,CAAC,EAAE,MAAM,EAAE,KAAK,MAAM,QAAQ,EAAE,IAAI,aAAa,QAAQ,QAAQ,EAAE,IAAI,CAAC;AAAA,MAC9E;AAEA,YAAM,UAAU,QAAQ;AAAA,QAAQ,CAAC,MAC/B,EAAE,IAAI,QAAQ,IAAI,CAAC,WAAW,IAAI,EAAE,KAAK,IAAI,KAAK,MAAM,EAAE;AAAA,MAC5D;AAEA,aAAO;AAAA,QACL,UAAU;AAAA,QACV;AAAA,QACA,mBAAmB;AAAA,QACnB,OAAO,QAAQ,CAAC,GAAG,IAAI;AAAA,MACzB;AAAA,IACF;AAAA,EACF;AACF;;;ACpBO,SAAS,sBACd,SAC4B;AAC5B,QAAM,iBAAiB,QAAQ,kBAAkB;AACjD,QAAM,gBAAgB,QAAQ,8BAA8B;AAE5D,SAAO;AAAA,IACL,MAAM;AAAA,IACN,MAAM,OAAO,KAA6D;AACxE,YAAM,UAAoB,CAAC;AAC3B,YAAM,eAA0E,CAAC;AAKjF,YAAM,oBAAoB;AAAA,QACxB,IAAI;AAAA,QACJ,IAAI,uBAAuB,IAAI;AAAA,QAC/B,QAAQ;AAAA,MACV;AACA,YAAM,qBAAqB;AAAA,QACzB,IAAI;AAAA,QACJ,IAAI;AAAA,QACJ,QAAQ;AAAA,MACV;AACA,YAAM,QAAQ,qBAAqB;AACnC,YAAM,cAAc,SAAS;AAC7B,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ;AAAA,QACR,QAAQ,EAAE,mBAAmB,oBAAoB,OAAO,eAAe;AAAA,MACzE,CAAC;AACD,UAAI,CAAC,aAAa;AAChB,gBAAQ,KAAK,iBAAiB,MAAM,QAAQ,CAAC,CAAC,gBAAgB,cAAc,EAAE;AAAA,MAChF;AAGA,YAAM,aACJ,QAAQ,cAAc,UACtB,IAAI,KAAK,YAAY,IAAI,KAAK,YAAY,QAAQ;AACpD,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ;AAAA,QACR,QAAQ;AAAA,UACN,cAAc,IAAI,KAAK;AAAA,UACvB,aAAa,IAAI,KAAK;AAAA,UACtB,WAAW,QAAQ;AAAA,QACrB;AAAA,MACF,CAAC;AACD,UAAI,CAAC,YAAY;AACf,gBAAQ;AAAA,UACN,UAAU,IAAI,KAAK,YAAY,IAAI,KAAK,UAAU,QAAQ,CAAC,CAAC,aAAa,QAAQ,SAAS;AAAA,QAC5F;AAAA,MACF;AAGA,YAAM,kBAAkB,QAAQ,iBAC5B,aAAa,IAAI,oBAAoB,QAAQ,cAAc,IAC3D,EAAE,QAAQ,MAAM,UAAU,CAAC,EAAE;AACjC,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ,gBAAgB;AAAA,QACxB,QAAQ;AAAA,UACN,UAAU,gBAAgB,SAAS;AAAA,UACnC,QAAQ,gBAAgB,SAAS,MAAM,GAAG,CAAC;AAAA,QAC7C;AAAA,MACF,CAAC;AACD,UAAI,CAAC,gBAAgB,QAAQ;AAC3B,gBAAQ,KAAK,0BAA0B,gBAAgB,SAAS,MAAM,YAAY;AAAA,MACpF;AAGA,UAAI,sBAAkD;AACtD,UAAI,QAAQ,cAAc,QAAQ,WAAW,UAAU,IAAI;AACzD,8BAAsB,oBAAoB,EAAE,MAAM,QAAQ,WAAW,CAAC;AAAA,MACxE;AAIA,YAAM,kBAAkB;AACxB,YAAM,kBAAkB,qBAAqB,YAAY,CAAC,GAAG;AAAA,QAC3D,CAAC,MAAM,EAAE,YAAY;AAAA,MACvB;AACA,YAAM,oBACJ,CAAC,uBACD,CAAC,iBACA,eAAe,WAAW,KAAK,oBAAoB,YAAY;AAClE,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ;AAAA,QACR,QAAQ,EAAE,QAAQ,qBAAqB,oBAAoB,eAAe,OAAO;AAAA,MACnF,CAAC;AACD,UAAI,CAAC,mBAAmB;AACtB,gBAAQ;AAAA,UACN,mCAAmC,eAAe,MAAM,sCAAsC,oBAAqB,OAAO;AAAA,QAC5H;AAAA,MACF;AAGA,UAAI,eAAoC;AACxC,UAAI,QAAQ,cAAc,QAAQ,WAAW,UAAU,IAAI;AACzD,uBAAe,YAAY,QAAQ,YAAY,CAAC,CAAC;AAAA,MACnD;AAEA,YAAM,eAAe,cAAc,UAAU,CAAC,GAAG,OAAO,CAAC,MAAM,EAAE,aAAa,OAAO;AACrF,YAAM,aAAa,YAAY,WAAW;AAC1C,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ;AAAA,QACR,QAAQ,EAAE,aAAa,cAAc,OAAO,UAAU,GAAG,aAAa,YAAY,OAAO;AAAA,MAC3F,CAAC;AACD,UAAI,CAAC,YAAY;AACf,gBAAQ,KAAK,wBAAwB,YAAY,MAAM,EAAE;AAAA,MAC3D;AAGA,YAAM,YAAY,aAAa,MAAM,CAAC,MAAM,EAAE,MAAM;AACpD,YAAM,WAAW,YAAY,SAAS;AAEtC,aAAO;AAAA,QACL;AAAA,QACA,SAAS,QAAQ,SAAS,IAAI,UAAU,CAAC,kBAAkB;AAAA,QAC3D,mBAAmB;AAAA,QACnB;AAAA,MACF;AAAA,IACF;AAAA,EACF;AACF;AAEA,SAAS,cACP,WACA,mBACA,WACQ;AACR,MAAI,CAAC,aAAa,UAAU,SAAS,EAAG,QAAO;AAC/C,QAAM,cAAc,IAAI,IAAI,UAAU,IAAI,CAAC,MAAM,EAAE,EAAE,CAAC;AACtD,QAAM,aAAuB,CAAC;AAC9B,aAAW,CAAC,QAAQ,MAAM,KAAK,mBAAmB;AAChD,UAAM,aAAa,OAAO,MAAM,GAAG,EAAE,CAAC,KAAK;AAC3C,QAAI,CAAC,YAAY,IAAI,UAAU,EAAG;AAClC,UAAM,iBAAiB,OAAO,OAAO,MAAM,EAAE,IAAI,CAAC,MAAM,EAAE,SAAS;AACnE,QAAI,eAAe,WAAW,EAAG;AACjC,eAAW,KAAK,eAAe,OAAO,CAAC,GAAG,MAAM,IAAI,GAAG,CAAC,IAAI,eAAe,MAAM;AAAA,EACnF;AACA,MAAI,WAAW,WAAW,EAAG,QAAO;AACpC,SAAO,WAAW,OAAO,CAAC,GAAG,MAAM,IAAI,GAAG,CAAC,IAAI,WAAW;AAC5D;AAEA,SAAS,aACP,WACA,SAC8E;AAC9E,QAAM,WAA0D,CAAC;AACjE,aAAW,CAAC,SAAS,QAAQ,KAAK,WAAW;AAC3C,UAAM,OAAO,YAAY,QAAQ;AACjC,QAAI,SAAS,OAAW;AACxB,eAAW,UAAU,SAAS;AAC5B,YAAM,UAAU,mBAAmB,MAAM,CAAC,GAAG,MAAM;AACnD,UAAI,CAAC,QAAQ,QAAQ;AACnB,iBAAS,KAAK,EAAE,YAAY,OAAO,IAAI,QAAQ,QAAQ,UAAU,wBAAwB,CAAC;AAAA,MAC5F;AAAA,IACF;AAAA,EACF;AACA,SAAO,EAAE,QAAQ,SAAS,WAAW,GAAG,SAAS;AACnD;AAEA,SAAS,YAAY,UAAuC;AAC1D,MAAI,OAAO,aAAa,SAAU,QAAO;AACzC,MAAI,YAAY,OAAO,aAAa,UAAU;AAC5C,UAAM,MAAM;AACZ,QAAI,OAAO,IAAI,SAAS,SAAU,QAAO,IAAI;AAC7C,QAAI,OAAO,IAAI,WAAW,SAAU,QAAO,IAAI;AAC/C,QAAI,OAAO,IAAI,YAAY,SAAU,QAAO,IAAI;AAAA,EAClD;AACA,SAAO;AACT;;;AC5MO,SAAS,YACd,SAC4B;AAC5B,QAAM,iBAAiB,QAAQ,kBAAkB;AACjD,SAAO;AAAA,IACL,MAAM;AAAA,IACN,MAAM,OAAO,KAA6D;AACxE,YAAM,cAAc,IAAI,IAAI,QAAQ,UAAU,IAAI,CAAC,MAAM,EAAE,EAAE,CAAC;AAG9D,YAAM,WAAW,iBAAiB,IAAI,uBAAuB,IAAI,aAAa,WAAW;AACzF,YAAM,YAAY,iBAAiB,IAAI,aAAa,WAAW;AAC/D,YAAM,QAAQ,YAAY;AAC1B,YAAM,SAAS,SAAS;AACxB,aAAO;AAAA,QACL,UAAU,SAAS,SAAS;AAAA,QAC5B,SAAS,SACL,CAAC,kBAAkB,MAAM,QAAQ,CAAC,CAAC,WAAM,cAAc,EAAE,IACzD,CAAC,kBAAkB,MAAM,QAAQ,CAAC,CAAC,MAAM,cAAc,EAAE;AAAA,QAC7D,mBAAmB;AAAA,UACjB,EAAE,MAAM,eAAe,QAAQ,QAAQ,EAAE,UAAU,WAAW,OAAO,eAAe,EAAE;AAAA,QACxF;AAAA,QACA;AAAA,MACF;AAAA,IACF;AAAA,EACF;AACF;AAEA,SAAS,iBACP,mBACA,aACQ;AACR,QAAM,aAAuB,CAAC;AAC9B,aAAW,CAAC,QAAQ,MAAM,KAAK,mBAAmB;AAChD,UAAM,aAAa,OAAO,MAAM,GAAG,EAAE,CAAC,KAAK;AAC3C,QAAI,CAAC,YAAY,IAAI,UAAU,EAAG;AAClC,UAAM,OAAO,OAAO,OAAO,MAAM,EAAE,IAAI,CAAC,MAAM,EAAE,SAAS;AACzD,QAAI,KAAK,SAAS,EAAG,YAAW,KAAK,KAAK,OAAO,CAAC,GAAG,MAAM,IAAI,GAAG,CAAC,IAAI,KAAK,MAAM;AAAA,EACpF;AACA,SAAO,WAAW,WAAW,IAAI,IAAI,WAAW,OAAO,CAAC,GAAG,MAAM,IAAI,GAAG,CAAC,IAAI,WAAW;AAC1F;;;ACrCA,eAAsB,QACpB,MAC+C;AAC/C,SAAO,YAAY,IAAI;AACzB;;;ACFA,SAAS,kBAAkB;AAkD3B,eAAsB,gBACpB,MACsD;AACtD,QAAM,cAAc,KAAK,eAAe;AAGxC,QAAM,mBAAmB,MAAM,YAAkC;AAAA,IAC/D,GAAG;AAAA,IACH,UAAU,CAAC,UAAU,QAAQ,KAAK,oBAAoB,KAAK,iBAAiB,UAAU,GAAG;AAAA,IACzF,QAAQ,GAAG,KAAK,MAAM;AAAA,EACxB,CAAC;AAED,QAAM,cAA0E,CAAC;AACjF,QAAM,UAA8B,CAAC;AACrC,MAAI,kBAAoC,CAAC,KAAK,eAAe;AAC7D,MAAI,gBAAgB,KAAK;AACzB,MAAI,oBAAoB,YAAY,KAAK,eAAe;AACxD,MAAI,kBAAkBA,eAAc,gBAAgB;AAEpD,WAAS,MAAM,GAAG,MAAM,KAAK,gBAAgB,OAAO;AAElD,QAAI,KAAK,OAAO,SAAS,EAAE,QAAQ,CAAC,EAAE,KAAM;AAI5C,UAAM,aAAa,MAAM,KAAK,OAAO,QAAQ;AAAA,MAC3C,gBAAgB,gBAAgB,CAAC,KAAK,KAAK;AAAA,MAC3C;AAAA,MACA,UAAU,CAAC;AAAA,MACX,gBAAgB,KAAK;AAAA,MACrB,YAAY;AAAA,MACZ,QAAQ,IAAI,gBAAgB,EAAE;AAAA,MAC9B,QAAQ,KAAK;AAAA,MACb,SAAS,KAAK,gBAAgB,KAAK,iBAAiB,QAAQ,KAAK,eAAe;AAAA,MAChF,qBAAqB,KAAK;AAAA,IAC5B,CAAC;AAGD,UAAM,iBAKD,CAAC;AACN,aAAS,IAAI,GAAG,IAAI,WAAW,QAAQ,KAAK;AAC1C,YAAM,UAAU,WAAW,CAAC;AAC5B,YAAM,OAAO,YAAY,OAAO;AAChC,YAAM,WAAW,MAAM,YAAkC;AAAA,QACvD,GAAG;AAAA,QACH,UAAU,CAAC,UAAU,QAAQ,KAAK,oBAAoB,SAAS,UAAU,GAAG;AAAA,QAC5E,QAAQ,GAAG,KAAK,MAAM,QAAQ,GAAG,cAAc,CAAC;AAAA,MAClD,CAAC;AACD,YAAM,YAAYA,eAAc,QAAQ;AACxC,qBAAe,KAAK,EAAE,aAAa,MAAM,SAAS,UAAU,UAAU,CAAC;AAAA,IACzE;AAGA,mBAAe,KAAK,CAAC,GAAG,MAAM,EAAE,YAAY,EAAE,SAAS;AACvD,UAAM,WAAW,eAAe,MAAM,GAAG,WAAW;AACpD,sBAAkB,SAAS,IAAI,CAAC,MAAM,EAAE,OAAO;AAC/C,UAAM,MAAM,eAAe,CAAC;AAC5B,QAAI,OAAO,IAAI,YAAY,iBAAiB;AAC1C,sBAAgB,IAAI;AACpB,0BAAoB,IAAI;AACxB,wBAAkB,IAAI;AAAA,IACxB;AAEA,UAAM,SAA2B;AAAA,MAC/B,iBAAiB;AAAA,MACjB,YAAY,eAAe,IAAI,CAAC,MAAM;AACpC,cAAM,YAAY,mBAAmB,EAAE,QAAQ;AAC/C,eAAO;AAAA,UACL,aAAa,EAAE;AAAA,UACf,WAAW,EAAE;AAAA,UACb,MAAM,CAAC,EAAE,WAAW,EAAE,SAAS;AAAA,UAC/B,YAAY,UAAU;AAAA,UACtB,WAAW,UAAU;AAAA,QACvB;AAAA,MACF,CAAC;AAAA,MACD,UAAU,SAAS,IAAI,CAAC,MAAM,EAAE,WAAW;AAAA,IAC7C;AACA,YAAQ,KAAK,MAAM;AACnB,gBAAY,KAAK;AAAA,MACf;AAAA,MACA,UAAU,eAAe,IAAI,CAAC,OAAO;AAAA,QACnC,aAAa,EAAE;AAAA,QACf,SAAS,EAAE;AAAA,QACX,UAAU,EAAE;AAAA,MACd,EAAE;AAAA,IACJ,CAAC;AAAA,EACH;AAEA,SAAO;AAAA,IACL;AAAA,IACA;AAAA,IACA;AAAA,IACA;AAAA,EACF;AACF;AAEO,SAAS,YAAY,SAAiC;AAG3D,QAAM,WACJ,OAAO,YAAY,WACf,UACA,KAAK,UAAU;AAAA,IACb,MAAM,QAAQ;AAAA,IACd,aAAa,QAAQ;AAAA,IACrB,SAAS,QAAQ,WAAW;AAAA,EAC9B,CAAC;AACP,SAAO,WAAW,QAAQ,EAAE,OAAO,QAAQ,EAAE,OAAO,KAAK,EAAE,MAAM,GAAG,EAAE;AACxE;AAEA,SAASA,eACP,UACQ;AACR,QAAM,aAAuB,CAAC;AAC9B,aAAW,QAAQ,SAAS,OAAO;AACjC,UAAM,iBAAiB,OAAO,OAAO,KAAK,WAAW,EAAE,IAAI,CAAC,MAAM,EAAE,SAAS;AAC7E,QAAI,eAAe,SAAS,GAAG;AAC7B,iBAAW,KAAK,eAAe,OAAO,CAAC,GAAG,MAAM,IAAI,GAAG,CAAC,IAAI,eAAe,MAAM;AAAA,IACnF;AAAA,EACF;AACA,SAAO,WAAW,WAAW,IAAI,IAAI,WAAW,OAAO,CAAC,GAAG,MAAM,IAAI,GAAG,CAAC,IAAI,WAAW;AAC1F;AAIA,SAAS,mBACP,UAIA;AACA,QAAM,UAAkC,CAAC;AACzC,QAAM,YAAoC,CAAC;AAC3C,QAAM,aAAa,oBAAI,IAAsB;AAC7C,aAAW,QAAQ,SAAS,OAAO;AACjC,UAAM,cAAc,OAAO,OAAO,KAAK,WAAW;AAClD,QAAI,YAAY,WAAW,EAAG;AAC9B,UAAM,gBAAgB,YAAY,OAAO,CAAC,GAAG,MAAM,IAAI,EAAE,WAAW,CAAC,IAAI,YAAY;AACrF,UAAM,MAAM,WAAW,IAAI,KAAK,UAAU,KAAK,CAAC;AAChD,QAAI,KAAK,aAAa;AACtB,eAAW,IAAI,KAAK,YAAY,GAAG;AACnC,eAAW,SAAS,aAAa;AAC/B,iBAAW,CAAC,KAAK,KAAK,KAAK,OAAO,QAAQ,MAAM,UAAU,GAAG;AAC3D,gBAAQ,GAAG,KAAK,QAAQ,GAAG,KAAK,KAAK;AACrC,kBAAU,GAAG,KAAK,UAAU,GAAG,KAAK,KAAK;AAAA,MAC3C;AAAA,IACF;AAAA,EACF;AACA,QAAM,aAAqC,CAAC;AAC5C,aAAW,OAAO,OAAO,KAAK,OAAO,GAAG;AACtC,UAAM,QAAQ,UAAU,GAAG,KAAK;AAChC,eAAW,GAAG,IAAI,QAAQ,KAAK,QAAQ,GAAG,KAAK,KAAK,QAAQ;AAAA,EAC9D;AACA,QAAM,YAAY,CAAC,GAAG,WAAW,QAAQ,CAAC,EAAE,IAAI,CAAC,CAAC,YAAY,KAAK,OAAO;AAAA,IACxE;AAAA,IACA,WAAW,MAAM,OAAO,CAAC,GAAG,MAAM,IAAI,GAAG,CAAC,IAAI,MAAM;AAAA,EACtD,EAAE;AACF,SAAO,EAAE,YAAY,UAAU;AACjC;;;ACxKA,eAAsB,mBACpB,MACyD;AAGzD,MAAK,KAAa,kBAAkB,UAAU;AAC5C,UAAM,IAAI;AAAA,MACR;AAAA,IACF;AAAA,EACF;AAKA,MAAI,KAAK,YAAY,SAAS,KAAK,QAAQ;AACzC,UAAM,IAAI;AAAA,MACR;AAAA,IACF;AAAA,EACF;AACA,MAAI,KAAK,kBAAkB,SAAS,CAAC,KAAK,WAAW,CAAC,KAAK,SAAS;AAClE,UAAM,IAAI,MAAM,mEAAmE;AAAA,EACrF;AAGA,QAAM,eAAe,MAAM,gBAAgB,IAAI;AAG/C,QAAM,EAAE,aAAAC,aAAY,IAAI,MAAM,OAAO,4BAAiB;AAEtD,QAAM,oBAAoB,MAAMA,aAAkC;AAAA,IAChE,GAAG;AAAA,IACH,WAAW,KAAK;AAAA,IAChB,UAAU,CAAC,UAAU,QAAQ,KAAK,oBAAoB,KAAK,iBAAiB,UAAU,GAAG;AAAA,IACzF,QAAQ,GAAG,KAAK,MAAM;AAAA,EACxB,CAAC;AAED,QAAM,kBAAkB,MAAMA,aAAkC;AAAA,IAC9D,GAAG;AAAA,IACH,WAAW,KAAK;AAAA,IAChB,UAAU,CAAC,UAAU,QACnB,KAAK,oBAAoB,aAAa,eAAe,UAAU,GAAG;AAAA,IACpE,QAAQ,GAAG,KAAK,MAAM;AAAA,EACxB,CAAC;AAUD,QAAM,qBAAqB,oBAAI,IAAuB;AACtD,QAAM,oBAAoB,oBAAI,IAAuB;AACrD,QAAM,cAAwB,oBAAI,IAAI;AACtC,QAAM,sBAAgC,oBAAI,IAAI;AAC9C,aAAW,QAAQ,gBAAgB,OAAO;AACxC,uBAAmB,IAAI,KAAK,QAAQ,KAAK,QAAQ;AACjD,gBAAY,IAAI,KAAK,QAAQ,KAAK,WAAW;AAAA,EAC/C;AACA,aAAW,QAAQ,kBAAkB,OAAO;AAC1C,sBAAkB,IAAI,KAAK,QAAQ,KAAK,QAAQ;AAChD,wBAAoB,IAAI,KAAK,QAAQ,KAAK,WAAW;AAAA,EACvD;AAEA,QAAM,aAAa,MAAM,KAAK,KAAK,OAAO;AAAA,IACxC;AAAA,IACA;AAAA,IACA;AAAA,IACA;AAAA,IACA,WAAW,KAAK;AAAA,IAChB,MAAM;AAAA,MACJ,WAAW,gBAAgB,WAAW;AAAA,MACtC,UAAU,kBAAkB,WAAW;AAAA,IACzC;AAAA,IACA,QAAQ,IAAI,gBAAgB,EAAE;AAAA,EAChC,CAAC;AAGD,MAAI;AACJ,MAAI,KAAK,kBAAkB,QAAQ,WAAW,aAAa,QAAQ;AACjE,UAAM,SAAS,KAAK,sBAAsB;AAC1C,UAAM,eAAe,OAAO,aAAa,eAAe,KAAK,eAAe;AAC5E,eAAW,WAAW;AAAA,MACpB,QAAQ;AAAA,MACR,MAAM;AAAA,MACN;AAAA,MACA,SAAS,KAAK;AAAA,MACd,QAAQ,KAAK;AAAA,IACf,CAAC;AAAA,EACH;AAEA,SAAO;AAAA,IACL,GAAG;AAAA,IACH;AAAA,IACA;AAAA,IACA;AAAA,IACA;AAAA,EACF;AACF;AAEA,SAAS,kBAAkB,eAA+B,iBAAyC;AAGjG,MAAI,OAAO,kBAAkB,YAAY,OAAO,oBAAoB,UAAU;AAC5E,UAAM,MAAM,CAAC,MACX,OAAO,MAAM,WACT,qBACA,YAAY,EAAE,WAAW,GAAG,EAAE,UAAU,SAAS,EAAE,OAAO,KAAK,EAAE,GAAG,EAAE,UAAU;AAAA,EAAK,EAAE,OAAO,KAAK,EAAE;AAC3G,WAAO;AAAA,EAAiB,IAAI,eAAe,CAAC;AAAA;AAAA,EAAiB,IAAI,aAAa,CAAC;AAAA,EACjF;AACA,QAAM,QAAkB,CAAC;AACzB,QAAM,KAAK,cAAc;AACzB,QAAM,KAAK,YAAY;AACvB,aAAW,KAAK,gBAAgB,MAAM,IAAI,EAAG,OAAM,KAAK,KAAK,CAAC,EAAE;AAChE,aAAW,KAAK,cAAc,MAAM,IAAI,EAAG,OAAM,KAAK,KAAK,CAAC,EAAE;AAC9D,SAAO,MAAM,KAAK,IAAI;AACxB;","names":["meanComposite","runCampaign"]}
|