@tangle-network/agent-eval 0.50.0 → 0.50.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/concepts.md CHANGED
@@ -9,6 +9,26 @@ connected, or the answer lacks required sources. The package gives products a
9
9
  shared way to record runs, check outcomes, classify failures, compare variants,
10
10
  and make release decisions.
11
11
 
12
+ ## The three top-level functions
13
+
14
+ Everything funnels through `/contract`. Three entries, one shape coming back:
15
+
16
+ | Function | When to call it | What you give it | What you get back |
17
+ |---|---|---|---|
18
+ | **`selfImprove()`** | You have a closed loop — scenarios, judge, agent in hand, and you want the substrate to propose better candidates + gate them. | scenarios, agent, judge, baseline surface | `SelfImproveResult.insight: InsightReport` + ship/hold verdict + winner surface |
19
+ | **`analyzeRuns()`** | You have observed runs (production traces, an approve/reject corpus, a CSV gold set) and want the same rigor packet without invoking an agent. | `RunRecord[]` + optional flags | `InsightReport` |
20
+ | **Intake adapters** (`fromFeedbackTable`, `fromOtelSpans`) | Your data isn't already in `RunRecord` shape — it's in Obsidian, Sheets, an OTel collector, etc. | source-specific input | `RunRecord[]` ready to pipe into `analyzeRuns()` |
21
+
22
+ The three customer maturity stages — logs only → ratings → closed loop — map exactly to the three functions. See [`customer-journeys.md`](./customer-journeys.md) for the runnable walkthroughs.
23
+
24
+ The shape of the answer — `InsightReport` — is identical across all three paths. Distributional summary, paired-bootstrap lift CI, judge stats, inter-rater agreement, cost-quality Pareto, failure clusters, contamination check, outcome correlation, release axes, and a ranked recommendations array. Walked through section-by-section in [`insight-report.md`](./insight-report.md).
25
+
26
+ ## The layering rule
27
+
28
+ `agent-eval` is the **substrate** at the bottom of the Tangle agent stack. `agent-runtime` and `agent-knowledge` depend on it; `agent-eval` MUST NOT import from either. Primitives that "feel like" they belong in a consumer but are actually substrate-shaped (validator verdicts, run records, scenarios, judge scores) live here. Primitives that genuinely require a running agent loop (`ValidationCtx` with iteration + signal + traceEmitter, sandbox `AgentRunSpec`) stay in `agent-runtime`.
29
+
30
+ The test: *does this concept make sense WITHOUT a running agent loop?* If yes, it's substrate. If no, it's runtime. The full rule is in [`/CLAUDE.md`](../CLAUDE.md#repo-layering--this-package-is-the-substrate).
31
+
12
32
  ## Main Objects
13
33
 
14
34
  | Thing | What it is | One-line example |
@@ -0,0 +1,208 @@
1
+ # Customer journeys
2
+
3
+ Three end-to-end journeys covering the surface of `@tangle-network/agent-eval`. Each one is a runnable example under `examples/` — clone the repo and `pnpm tsx examples/<journey>/index.ts` to see the actual output.
4
+
5
+ The three journeys map to three customer-maturity stages:
6
+
7
+ 1. **Logs but no eval discipline** → [Production traces journey](#1-production-traces-journey-customer-otel-traces)
8
+ 2. **Ratings but no closed loop** → [Feedback corpus journey](#2-feedback-corpus-journey-customer-feedback-loop)
9
+ 3. **Scenarios, judge, agent — full closed loop** → [Closed-loop journey](#3-closed-loop-journey-selfimprove-quickstart)
10
+
11
+ Each section: what the customer has, what they want, the code, what the report looks like.
12
+
13
+ ---
14
+
15
+ ## 1. Production traces journey — `customer-otel-traces`
16
+
17
+ **The customer:** an agentic GTM-as-a-service company. Multiple agent steps in prod (social media posting, image generation, translation). OTel observability piped to their collector. Doesn't run formal evals. CTO hand-rolled their tracing.
18
+
19
+ **The frustration:** "Which step is unreliable? What's our cost-quality profile? Where do we fix next?" They have the data; they don't have the answer.
20
+
21
+ **What they need from agent-eval:** day-1 analysis of their existing logs. No scenarios, no judges, no closed loop. Just turn the trace stream into a decision packet.
22
+
23
+ ### The code
24
+
25
+ ```ts
26
+ import { analyzeRuns, fromOtelSpans } from '@tangle-network/agent-eval/contract'
27
+
28
+ const runs = fromOtelSpans({ spans: yourOtelStream })
29
+ const report = await analyzeRuns({ runs })
30
+
31
+ // report.failureClusters → root causes
32
+ // report.costQuality.pareto → cost-vs-quality scatter
33
+ // report.composite → distribution
34
+ // report.recommendations → top-3 actions
35
+ ```
36
+
37
+ ### What the report shows
38
+
39
+ ```
40
+ Runs analyzed: 40
41
+ Composite mean: 0.721 (p50: 0.717, p95: 0.925, stddev: 0.210)
42
+ Cost mean: $0.103 (p95: $0.131)
43
+
44
+ ── Failures ──
45
+ 6 runs with status=ERROR or failureMode set:
46
+ tool.search (3x)
47
+ agent.turn (3x)
48
+
49
+ ── Cost-quality Pareto ──
50
+ 1 candidate(s) plotted; 1 on the frontier
51
+ otel-default: cost=$0.103 quality=0.721 (frontier)
52
+
53
+ ── Recommendations ──
54
+ [medium] expand-corpus — Mean composite 0.721 has room
55
+ ```
56
+
57
+ ### Next steps for this customer
58
+
59
+ 1. Wire an `AnalystRegistry` to cluster the 6 failures by root cause via LLM analysis.
60
+ 2. Add `outcomeSignal` once they have downstream conversion / engagement / post-engagement data, and the report fits a reward model showing whether their score predicts the customer outcome.
61
+ 3. Once they identify a step worth optimizing (translation, say), graduate to journey #3 — wrap that step in a `Dispatch` and call `selfImprove()`.
62
+
63
+ **Runnable:** [`examples/customer-otel-traces/`](../examples/customer-otel-traces/)
64
+
65
+ ---
66
+
67
+ ## 2. Feedback corpus journey — `customer-feedback-loop`
68
+
69
+ **The customer:** a research-validation team. A GitHub Action fires `claude -p` against the next claim, writes the research output to Obsidian. Three reviewers (Alice, Bob, Carol) tag results `#approved` or `#rejected`. Outputs feed a knowledge base. Knowledge feeds content. Content feeds engagement. The founder wants more engagement faster.
70
+
71
+ **The frustration:** "We disagree on what's good. We don't know if our 'good' actually drives engagement. Reviewing every claim is slow."
72
+
73
+ **What they need from agent-eval:** turn the approve/reject corpus into actionable signal:
74
+ - Where do reviewers disagree? (triage list)
75
+ - Can we synthesize each reviewer's taste into an LLM judge? (auto-grade)
76
+ - Does the taste actually predict downstream engagement? (close the loop)
77
+
78
+ ### The code
79
+
80
+ ```ts
81
+ import { analyzeRuns, fromFeedbackTable } from '@tangle-network/agent-eval/contract'
82
+
83
+ // 1. Parse Obsidian #approved / #rejected tags into a flat table:
84
+ const ratings = parseObsidianVault('./research-vault')
85
+ // [{ runId: 'claim-1', rater: 'alice', rating: true }, ...]
86
+
87
+ // 2. Pipe through the adapter:
88
+ const { runs, raterScores } = fromFeedbackTable({ ratings })
89
+
90
+ // 3. Analyze:
91
+ const report = await analyzeRuns({
92
+ runs,
93
+ raterScores,
94
+ // Optional: close the loop with engagement data once you have it.
95
+ outcomeSignal: { metric: 'engagement_rate', valueByRunId: enrichedFromProd },
96
+ })
97
+
98
+ // report.interRater.disagreementCases → top 20 claims worth a meeting
99
+ // report.outcomeCorrelation → does team taste predict engagement?
100
+ // report.recommendations → action list
101
+ ```
102
+
103
+ ### What the report shows
104
+
105
+ ```
106
+ Runs analyzed: 30
107
+ Composite mean: 0.756 (approve rate ~76%)
108
+
109
+ ── Inter-rater agreement ──
110
+ Raters: 3 (alice, bob, carol)
111
+ Jointly rated runs: 30
112
+ Pairwise pearson κ:
113
+ alice::bob 0.53
114
+ alice::carol 0.55
115
+ bob::carol 0.21
116
+ Mean κ: 0.43
117
+
118
+ ── Top 5 disagreement cases ──
119
+ claim-1 range=1.00 ratings: alice=0, bob=0, carol=1
120
+ claim-7 range=1.00 ratings: alice=0, bob=1, carol=0
121
+ ...
122
+
123
+ ── Recommendations ──
124
+ [high] recalibrate — Inter-rater agreement κ=0.43 is below 0.5
125
+ Raters disagree on what 'good' looks like. Refine the rubric or triage the disagreement cases.
126
+ ```
127
+
128
+ ### Next steps for this customer
129
+
130
+ 1. **Triage meeting on the disagreement cases.** Mean κ=0.43 means the rubric is ambiguous; clarify it on the cases that split.
131
+ 2. **Calibrate one LLM judge per reviewer.** Each reviewer's history is the gold signal — substrate primitive `calibrateJudge` against `raterScores` filtered to that reviewer.
132
+ 3. **Add engagement as `outcomeSignal`** once the content downstream is instrumented. The `outcomeCorrelation` section tells the team whether their taste predicts the founder's token-max goal — and if not, the linear reward model says how to retarget.
133
+ 4. **Graduate to journey #3** — wrap the research-generation Claude-P call as a `Dispatch`, use the calibrated judges, run `selfImprove()` nightly. Open a PR against the GitHub Action when the holdout approval rate beats baseline.
134
+
135
+ **Runnable:** [`examples/customer-feedback-loop/`](../examples/customer-feedback-loop/)
136
+
137
+ ---
138
+
139
+ ## 3. Closed-loop journey — `selfimprove-quickstart`
140
+
141
+ **The customer:** a team with a scenario corpus, a judge, and an agent. Wants to improve the prompt under statistical confidence — propose better candidates, gate on holdout lift, ship the winner.
142
+
143
+ **The frustration:** "We can run an A/B by hand but we don't know if the improvement is real. We don't have time to run paired bootstrap by hand. We want a function that decides."
144
+
145
+ **What they need from agent-eval:** the closed loop in one function — propose, score, gate, ship — with the full rigor packet on the way out.
146
+
147
+ ### The code
148
+
149
+ ```ts
150
+ import { selfImprove } from '@tangle-network/agent-eval/contract'
151
+
152
+ const result = await selfImprove({
153
+ scenarios,
154
+ agent: async (surface, scenario) =>
155
+ await myAgent.run({ systemPrompt: (surface as { systemPrompt: string }).systemPrompt, scenario }),
156
+ judge: {
157
+ name: 'rubric',
158
+ dimensions: [{ key: 'clarity', weight: 1 }, { key: 'concision', weight: 1 }],
159
+ score: async ({ artifact }) => myJudgeFn(artifact),
160
+ },
161
+ baselineSurface: { kind: 'prompt', systemPrompt: 'You write marketing copy...' },
162
+ budget: { generations: 3, populationSize: 2 },
163
+ })
164
+
165
+ result.gateDecision // 'ship' | 'hold' | ...
166
+ result.insight // full decision packet
167
+ ```
168
+
169
+ ### What the report shows
170
+
171
+ ```
172
+ ═══ selfImprove() decision packet ═══
173
+
174
+ Gate decision: ship
175
+ Raw lift: +0.194
176
+
177
+ ── Statistical lift (paired bootstrap) ──
178
+ delta: +0.254
179
+ CI95: [0.254, 0.254]
180
+ pValue: 1.0000
181
+ Cohen's d: 0.00
182
+ MDE @ 80% power: 2.802
183
+ required n at observed effect: 244
184
+
185
+ ── Recommendations ──
186
+ [critical] ship — Ship — lift 0.254 (95% CI 0.254..0.254)
187
+ ```
188
+
189
+ ### Next steps for this customer
190
+
191
+ 1. **Ship the winner.** Either accept `result.winner.surface` programmatically and roll it out, or pass `autoOnPromote: 'pr'` + a GitHub repo to have selfImprove open a PR for you.
192
+ 2. **Wire `hostedTenant`** to ship the decision packet to a dashboard (the hosted Intelligence orchestrator, or your own implementation of the wire spec).
193
+ 3. **Add `canaryScenarios`** to guard against the holdout leaking into the candidate prompt.
194
+ 4. **Add `outcomeSignal`** in `analyzeRuns()` for any post-deploy reruns to verify the predicted lift actually shows up in real outcomes.
195
+
196
+ **Runnable:** [`examples/selfimprove-quickstart/`](../examples/selfimprove-quickstart/)
197
+
198
+ ---
199
+
200
+ ## How the three journeys compose
201
+
202
+ Journey #1 + #2 + #3 are **maturity stages**, not exclusive products. A team typically:
203
+
204
+ 1. Starts with **#1** (analyze production logs) to find what's broken.
205
+ 2. Adds **#2** (feedback corpus) once they have a sense of where to improve, to calibrate what "good" means.
206
+ 3. Graduates to **#3** (closed loop) once they have scenarios + judges, to automate the improvement.
207
+
208
+ Same substrate, same `InsightReport` shape, no rip-and-replace between stages. The data you collect in #1 informs the scenarios you derive in #2 which feed the loop in #3.
@@ -0,0 +1,337 @@
1
+ # `InsightReport` — the decision packet
2
+
3
+ The single shape every analysis call returns. `selfImprove()` embeds it in `SelfImproveResult.insight`; `analyzeRuns()` returns it directly. The hosted-tier wire format carries it on `EvalRunEvent.insightReport?`.
4
+
5
+ Every section is **opt-in based on what your data supports** — the function never invents signal. If your runs don't carry judge scores, `judges` is empty. If there's no baseline/candidate split, `lift` is undefined. The shape is consistent; population is honest.
6
+
7
+ This page walks every section with a real (synthetic) example and explains how to act on it.
8
+
9
+ ---
10
+
11
+ ## At a glance
12
+
13
+ ```ts
14
+ interface InsightReport {
15
+ n: number // runs analyzed
16
+ composite: ScalarDistribution // always
17
+ perDimension: Record<string, ScalarDistribution> // when judgeScores carry dimensions
18
+ costQuality: { cost: ScalarDistribution; pareto: ParetoFigureSpec } // always
19
+ judges: Record<string, JudgeInsight> // when runs carry judge scores
20
+ interRater?: InterRaterInsight // when raterScores supplied
21
+ lift?: LiftInsight // when baseline + candidate present
22
+ failureClusters?: FailureClusterInsight // when AnalystRegistry wired
23
+ contamination?: ContaminationInsight // when canaryScenarios supplied
24
+ outcomeCorrelation?: OutcomeCorrelationInsight // when outcomeSignal supplied
25
+ release: ReleaseSummary // always
26
+ recommendations: Recommendation[] // always — read this FIRST
27
+ }
28
+ ```
29
+
30
+ ---
31
+
32
+ ## `n` + `composite` + `perDimension` — distributional summary
33
+
34
+ Always present. The basic "where are my numbers" view.
35
+
36
+ ```jsonc
37
+ {
38
+ "n": 30,
39
+ "composite": {
40
+ "n": 30,
41
+ "mean": 0.683, "p50": 0.667, "p95": 1.000, "stddev": 0.231,
42
+ "min": 0.0, "max": 1.0,
43
+ "histogram": [
44
+ { "lo": 0.0, "hi": 0.083, "count": 5 },
45
+ { "lo": 0.083, "hi": 0.167, "count": 0 },
46
+ // ...12 bins by default
47
+ ]
48
+ },
49
+ "perDimension": {
50
+ "clarity": { "mean": 0.72, "p50": 0.75, "p95": 0.95, "stddev": 0.18, /* ... */ },
51
+ "concision": { "mean": 0.65, "p50": 0.68, "p95": 0.88, "stddev": 0.21, /* ... */ }
52
+ }
53
+ }
54
+ ```
55
+
56
+ **Read first:** the `composite.mean`. If it's < 0.5, your agent has a ceiling problem, not a tuning problem.
57
+
58
+ **Read next:** `perDimension`. If `clarity` is high but `concision` is low, your prompts get the right ideas in too many words — different fix than "wrong ideas."
59
+
60
+ **Use the histogram for:** finding bimodal failure modes. A bin with `count > 0` near zero and another > 0 near 1 means your agent has two distinct behaviors, not one noisy one.
61
+
62
+ ---
63
+
64
+ ## `costQuality` — cost-vs-quality Pareto
65
+
66
+ Always present. `cost.histogram` is the per-run cost distribution; `pareto` is the substrate's `ParetoFigureSpec`.
67
+
68
+ ```jsonc
69
+ {
70
+ "costQuality": {
71
+ "cost": {
72
+ "mean": 0.024, "p95": 0.041,
73
+ "histogram": [/* */]
74
+ },
75
+ "pareto": {
76
+ "kind": "pareto-cost-quality",
77
+ "split": "holdout",
78
+ "axes": { "x": "costUsd", "y": "score" },
79
+ "points": [
80
+ { "candidateId": "baseline", "cost": 0.018, "quality": 0.58, "n": 20, "onFrontier": true },
81
+ { "candidateId": "winner", "cost": 0.027, "quality": 0.65, "n": 20, "onFrontier": true }
82
+ ]
83
+ }
84
+ }
85
+ }
86
+ ```
87
+
88
+ **Use this when:** comparing prompts, models, or candidate surfaces. The Pareto frontier is your menu of "best you can do at each cost level."
89
+
90
+ **Render with:** any chart library — `points` is plain JSON. Hosted-tier dashboards render this as a scatter with the frontier highlighted.
91
+
92
+ ---
93
+
94
+ ## `judges` — per-judge mean
95
+
96
+ Populated when run records carry `outcome.judgeScores`.
97
+
98
+ ```jsonc
99
+ {
100
+ "judges": {
101
+ "domain-expert": { "n": 30, "meanScore": 0.71 },
102
+ "helpfulness-llm": { "n": 30, "meanScore": 0.62 }
103
+ }
104
+ }
105
+ ```
106
+
107
+ The substrate's full judge-calibration suite (positional bias, self-preference, verbosity bias) lives in `/reporting` and operates on **paired-by-condition** inputs that `analyzeRuns` doesn't synthesize from raw `RunRecord[]`. Wire them yourself when you have the paired data; the report's `judges` map is the corpus-level slice.
108
+
109
+ **Use this when:** comparing multiple judges over the same corpus. A big gap between two judges' means is the first signal that one of them is mis-calibrated.
110
+
111
+ ---
112
+
113
+ ## `interRater` — multi-rater agreement + disagreement triage
114
+
115
+ Populated when `analyzeRuns({ raterScores })` is supplied — typically via `fromFeedbackTable()`.
116
+
117
+ ```jsonc
118
+ {
119
+ "interRater": {
120
+ "raters": 3,
121
+ "jointlyRated": 30,
122
+ "kappa": 0.71,
123
+ "perPair": {
124
+ "alice::bob": 0.78,
125
+ "alice::carol": 0.65,
126
+ "bob::carol": 0.69
127
+ },
128
+ "disagreementCases": [
129
+ { "runId": "claim-7", "range": 1.00,
130
+ "ratings": [{"rater":"alice","score":1},{"rater":"bob","score":1},{"rater":"carol","score":0}] },
131
+ { "runId": "claim-13", "range": 1.00,
132
+ "ratings": [{"rater":"alice","score":0},{"rater":"bob","score":0},{"rater":"carol","score":1}] }
133
+ // ...top 20 by range
134
+ ]
135
+ }
136
+ }
137
+ ```
138
+
139
+ **Read first:** the mean `kappa`. < 0.5 means raters disagree on what "good" looks like — surface the disagreement cases at the next review meeting.
140
+
141
+ **Use this when:** building per-rater LLM judges. Each rater's individual scores are the gold signal you calibrate against. Once a calibrated LLM matches the human ≥85%, you can auto-grade and escalate only the disagreement cases.
142
+
143
+ ---
144
+
145
+ ## `lift` — paired-bootstrap statistical lift
146
+
147
+ Populated when baseline + candidate candidates are present (auto-detected from two distinct `candidateId`s, or explicit via `baselineCandidateId` + `candidateCandidateId`).
148
+
149
+ ```jsonc
150
+ {
151
+ "lift": {
152
+ "baselineMean": 0.58,
153
+ "candidateMean": 0.65,
154
+ "delta": 0.07,
155
+ "ci95": [0.04, 0.10], // bootstrap CI on the delta
156
+ "pValue": 0.0008, // paired t-test
157
+ "n": 40, // paired observations
158
+ "cohensD": 0.41,
159
+ "mde": 0.06, // min detectable effect at current n, 80% power
160
+ "requiredN": 38 // n needed for observed delta at 80% power
161
+ }
162
+ }
163
+ ```
164
+
165
+ **Decision rule:**
166
+ - `ci95[0] > threshold` → **SHIP.** Lower bound above your delta threshold means the lift is real at 95% confidence.
167
+ - `ci95[0] ≤ threshold < ci95[1]` → **INCONCLUSIVE.** Expand the corpus or wait for more data.
168
+ - `ci95[1] ≤ threshold` → **HOLD.** No evidence the candidate is better.
169
+
170
+ The `recommendations` array surfaces exactly this decision (`kind: 'ship' | 'hold' | 'expand-corpus'`) — that's what consumers should read.
171
+
172
+ **Why bootstrap, not t-test alone:** paired bootstrap is distribution-free. Your judge scores are bounded in [0,1] and almost never normal; the bootstrap CI is the honest one.
173
+
174
+ ---
175
+
176
+ ## `failureClusters` — grouped failure modes
177
+
178
+ Populated when an `AnalystRegistry` is passed via `analyzeRuns({ analyst })`. The substrate runs each failed run through the registered analysts and groups findings by `analyst_id` / `area`.
179
+
180
+ ```jsonc
181
+ {
182
+ "failureClusters": {
183
+ "totalFailures": 11,
184
+ "clusters": [
185
+ { "id": "off-topic-drift", "name": "off-topic-drift",
186
+ "share": 0.45, "exemplars": ["run-12", "run-19", "run-33"] },
187
+ { "id": "over-confidence", "name": "over-confidence",
188
+ "share": 0.27, "exemplars": ["run-3", "run-21"] },
189
+ { "id": "format-mismatch", "name": "format-mismatch",
190
+ "share": 0.18, "exemplars": ["run-41", "run-44"] }
191
+ ]
192
+ }
193
+ }
194
+ ```
195
+
196
+ **Read first:** the top cluster's `share`. If one cluster is > 40% of failures, fix that pattern before doing anything else.
197
+
198
+ **Use this when:** triaging a regression. Failure clusters tell you "fix this kind of thing first."
199
+
200
+ **To wire it:** register analysts in `AnalystRegistry`. See `src/analyst/registry.ts` and `src/analyst/kinds.ts` for the four built-in kinds (`failure-mode`, `improvement`, `knowledge-gap`, `knowledge-poisoning`).
201
+
202
+ ---
203
+
204
+ ## `contamination` — canary check
205
+
206
+ Populated when canary scenarios are passed via `analyzeRuns({ canaryScenarios })`. Each canary carries a sentinel string the agent should never emit; the report counts leaks.
207
+
208
+ ```jsonc
209
+ {
210
+ "contamination": {
211
+ "leaks": 0,
212
+ "holdoutAuditPassed": true,
213
+ "details": []
214
+ }
215
+ }
216
+ ```
217
+
218
+ When `leaks > 0`:
219
+
220
+ ```jsonc
221
+ {
222
+ "contamination": {
223
+ "leaks": 2,
224
+ "holdoutAuditPassed": false,
225
+ "details": [
226
+ { "runId": "run-12", "canary": "xyz-secret-canary-123", "matched": "...the secret xyz-secret-canary-123 says..." }
227
+ ]
228
+ }
229
+ }
230
+ ```
231
+
232
+ **When this fails:** your holdout corpus has leaked into training context. The `lift` number is **unreliable**. Investigate before shipping anything.
233
+
234
+ ---
235
+
236
+ ## `outcomeCorrelation` — closing the loop on real outcomes
237
+
238
+ Populated when `outcomeSignal: { metric, valueByRunId }` is supplied.
239
+
240
+ ```jsonc
241
+ {
242
+ "outcomeCorrelation": {
243
+ "metric": "engagement_rate",
244
+ "n": 80,
245
+ "pearson": 0.72, // linear correlation
246
+ "spearman": 0.69, // rank correlation (robust to monotonic nonlinearity)
247
+ "rewardModel": {
248
+ "intercept": 0.04,
249
+ "slope": 1.93,
250
+ "r2": 0.52 // share of outcome variance the judge explains
251
+ }
252
+ }
253
+ }
254
+ ```
255
+
256
+ This is the layer that says **"does my judge's taste actually predict the metric the business cares about?"**
257
+
258
+ **Read first:** `spearman`. If it's < 0.3 in absolute value, your judges are scoring something different from what wins downstream. Refit the judges (use the customer's downstream signal as gold) or change the rubric.
259
+
260
+ **The reward model** is the simple linear `y = intercept + slope * composite`. Use it to:
261
+ - Predict the engagement of a new run from its composite score alone.
262
+ - Set a `composite` threshold for "must beat X to ship" based on the engagement equivalent.
263
+
264
+ ---
265
+
266
+ ## `release` — pass/warn/fail axes
267
+
268
+ Always present. Roll-up across three axes — quality lift, contamination, composite distribution.
269
+
270
+ ```jsonc
271
+ {
272
+ "release": {
273
+ "status": "pass",
274
+ "axes": [
275
+ { "name": "quality-lift", "status": "pass",
276
+ "detail": "delta=0.070, CI95=[0.040, 0.100], n=40" },
277
+ { "name": "contamination", "status": "pass",
278
+ "detail": "0 canary leak(s)" },
279
+ { "name": "composite-distribution", "status": "pass",
280
+ "detail": "mean=0.683, p50=0.667, p95=1.000 over n=30" }
281
+ ],
282
+ "issues": []
283
+ }
284
+ }
285
+ ```
286
+
287
+ Overall `status` is `fail` if any axis fails; `warn` if any warn; `pass` otherwise.
288
+
289
+ **Use this when:** wiring agent-eval into CI. A `status === 'pass'` from `analyzeRuns` on the candidate vs baseline is your green-light gate.
290
+
291
+ ---
292
+
293
+ ## `recommendations` — the actionable layer
294
+
295
+ Always present. Read this first.
296
+
297
+ ```jsonc
298
+ {
299
+ "recommendations": [
300
+ { "priority": "critical", "kind": "ship",
301
+ "title": "Ship — lift 0.070 (95% CI 0.040..0.100)",
302
+ "detail": "Holdout lift exceeds threshold 0.02 with 95% bootstrap confidence (n=40, p=0.0008, d=0.41).",
303
+ "evidencePath": "lift" },
304
+ { "priority": "high", "kind": "investigate",
305
+ "title": "Top failure cluster: off-topic-drift (45% of failures)",
306
+ "detail": "11 runs failed. The largest cluster groups 3 exemplars under 'off-topic-drift'.",
307
+ "evidencePath": "failureClusters.clusters[0]" }
308
+ ]
309
+ }
310
+ ```
311
+
312
+ | `kind` | When emitted |
313
+ |---|---|
314
+ | `ship` | lift CI lower bound > threshold |
315
+ | `hold` | lift CI upper bound ≤ threshold |
316
+ | `expand-corpus` | lift CI straddles threshold — more data needed |
317
+ | `fix` | canary contamination detected |
318
+ | `recalibrate` | inter-rater κ < 0.5, OR outcome correlation < 0.3 |
319
+ | `investigate` | top failure cluster > some-share |
320
+
321
+ `evidencePath` points back into the report (`"lift"`, `"contamination"`, `"failureClusters.clusters[0]"`) so a UI can deep-link from each recommendation to its evidence.
322
+
323
+ ---
324
+
325
+ ## How `analyzeRuns` populates each section
326
+
327
+ | Section | Required input |
328
+ |---|---|
329
+ | `composite`, `perDimension`, `costQuality`, `release`, `recommendations` | `runs` |
330
+ | `judges` | `runs` with `outcome.judgeScores` |
331
+ | `interRater` | `raterScores` (≥ 2 raters jointly rated some runs) |
332
+ | `lift` | two distinct `candidateId`s in `runs` (or explicit baseline/candidate ids) |
333
+ | `failureClusters` | `analyst` registry passed in |
334
+ | `contamination` | `canaryScenarios` passed in |
335
+ | `outcomeCorrelation` | `outcomeSignal` passed in |
336
+
337
+ All sections beyond the always-present ones are `T | undefined`, never empty objects. If a section is missing, your inputs didn't support it — the report is honest about that.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@tangle-network/agent-eval",
3
- "version": "0.50.0",
3
+ "version": "0.50.2",
4
4
  "description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.",
5
5
  "homepage": "https://github.com/tangle-network/agent-eval#readme",
6
6
  "repository": {