@tangle-network/agent-eval 0.50.0 → 0.50.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +135 -0
- package/README.md +235 -331
- package/dist/openapi.json +1 -1
- package/docs/concepts.md +20 -0
- package/docs/customer-journeys.md +208 -0
- package/docs/insight-report.md +337 -0
- package/package.json +1 -1
package/docs/concepts.md
CHANGED
|
@@ -9,6 +9,26 @@ connected, or the answer lacks required sources. The package gives products a
|
|
|
9
9
|
shared way to record runs, check outcomes, classify failures, compare variants,
|
|
10
10
|
and make release decisions.
|
|
11
11
|
|
|
12
|
+
## The three top-level functions
|
|
13
|
+
|
|
14
|
+
Everything funnels through `/contract`. Three entries, one shape coming back:
|
|
15
|
+
|
|
16
|
+
| Function | When to call it | What you give it | What you get back |
|
|
17
|
+
|---|---|---|---|
|
|
18
|
+
| **`selfImprove()`** | You have a closed loop — scenarios, judge, agent in hand, and you want the substrate to propose better candidates + gate them. | scenarios, agent, judge, baseline surface | `SelfImproveResult.insight: InsightReport` + ship/hold verdict + winner surface |
|
|
19
|
+
| **`analyzeRuns()`** | You have observed runs (production traces, an approve/reject corpus, a CSV gold set) and want the same rigor packet without invoking an agent. | `RunRecord[]` + optional flags | `InsightReport` |
|
|
20
|
+
| **Intake adapters** (`fromFeedbackTable`, `fromOtelSpans`) | Your data isn't already in `RunRecord` shape — it's in Obsidian, Sheets, an OTel collector, etc. | source-specific input | `RunRecord[]` ready to pipe into `analyzeRuns()` |
|
|
21
|
+
|
|
22
|
+
The three customer maturity stages — logs only → ratings → closed loop — map exactly to the three functions. See [`customer-journeys.md`](./customer-journeys.md) for the runnable walkthroughs.
|
|
23
|
+
|
|
24
|
+
The shape of the answer — `InsightReport` — is identical across all three paths. Distributional summary, paired-bootstrap lift CI, judge stats, inter-rater agreement, cost-quality Pareto, failure clusters, contamination check, outcome correlation, release axes, and a ranked recommendations array. Walked through section-by-section in [`insight-report.md`](./insight-report.md).
|
|
25
|
+
|
|
26
|
+
## The layering rule
|
|
27
|
+
|
|
28
|
+
`agent-eval` is the **substrate** at the bottom of the Tangle agent stack. `agent-runtime` and `agent-knowledge` depend on it; `agent-eval` MUST NOT import from either. Primitives that "feel like" they belong in a consumer but are actually substrate-shaped (validator verdicts, run records, scenarios, judge scores) live here. Primitives that genuinely require a running agent loop (`ValidationCtx` with iteration + signal + traceEmitter, sandbox `AgentRunSpec`) stay in `agent-runtime`.
|
|
29
|
+
|
|
30
|
+
The test: *does this concept make sense WITHOUT a running agent loop?* If yes, it's substrate. If no, it's runtime. The full rule is in [`/CLAUDE.md`](../CLAUDE.md#repo-layering--this-package-is-the-substrate).
|
|
31
|
+
|
|
12
32
|
## Main Objects
|
|
13
33
|
|
|
14
34
|
| Thing | What it is | One-line example |
|
|
@@ -0,0 +1,208 @@
|
|
|
1
|
+
# Customer journeys
|
|
2
|
+
|
|
3
|
+
Three end-to-end journeys covering the surface of `@tangle-network/agent-eval`. Each one is a runnable example under `examples/` — clone the repo and `pnpm tsx examples/<journey>/index.ts` to see the actual output.
|
|
4
|
+
|
|
5
|
+
The three journeys map to three customer-maturity stages:
|
|
6
|
+
|
|
7
|
+
1. **Logs but no eval discipline** → [Production traces journey](#1-production-traces-journey-customer-otel-traces)
|
|
8
|
+
2. **Ratings but no closed loop** → [Feedback corpus journey](#2-feedback-corpus-journey-customer-feedback-loop)
|
|
9
|
+
3. **Scenarios, judge, agent — full closed loop** → [Closed-loop journey](#3-closed-loop-journey-selfimprove-quickstart)
|
|
10
|
+
|
|
11
|
+
Each section: what the customer has, what they want, the code, what the report looks like.
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## 1. Production traces journey — `customer-otel-traces`
|
|
16
|
+
|
|
17
|
+
**The customer:** an agentic GTM-as-a-service company. Multiple agent steps in prod (social media posting, image generation, translation). OTel observability piped to their collector. Doesn't run formal evals. CTO hand-rolled their tracing.
|
|
18
|
+
|
|
19
|
+
**The frustration:** "Which step is unreliable? What's our cost-quality profile? Where do we fix next?" They have the data; they don't have the answer.
|
|
20
|
+
|
|
21
|
+
**What they need from agent-eval:** day-1 analysis of their existing logs. No scenarios, no judges, no closed loop. Just turn the trace stream into a decision packet.
|
|
22
|
+
|
|
23
|
+
### The code
|
|
24
|
+
|
|
25
|
+
```ts
|
|
26
|
+
import { analyzeRuns, fromOtelSpans } from '@tangle-network/agent-eval/contract'
|
|
27
|
+
|
|
28
|
+
const runs = fromOtelSpans({ spans: yourOtelStream })
|
|
29
|
+
const report = await analyzeRuns({ runs })
|
|
30
|
+
|
|
31
|
+
// report.failureClusters → root causes
|
|
32
|
+
// report.costQuality.pareto → cost-vs-quality scatter
|
|
33
|
+
// report.composite → distribution
|
|
34
|
+
// report.recommendations → top-3 actions
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
### What the report shows
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
Runs analyzed: 40
|
|
41
|
+
Composite mean: 0.721 (p50: 0.717, p95: 0.925, stddev: 0.210)
|
|
42
|
+
Cost mean: $0.103 (p95: $0.131)
|
|
43
|
+
|
|
44
|
+
── Failures ──
|
|
45
|
+
6 runs with status=ERROR or failureMode set:
|
|
46
|
+
tool.search (3x)
|
|
47
|
+
agent.turn (3x)
|
|
48
|
+
|
|
49
|
+
── Cost-quality Pareto ──
|
|
50
|
+
1 candidate(s) plotted; 1 on the frontier
|
|
51
|
+
otel-default: cost=$0.103 quality=0.721 (frontier)
|
|
52
|
+
|
|
53
|
+
── Recommendations ──
|
|
54
|
+
[medium] expand-corpus — Mean composite 0.721 has room
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### Next steps for this customer
|
|
58
|
+
|
|
59
|
+
1. Wire an `AnalystRegistry` to cluster the 6 failures by root cause via LLM analysis.
|
|
60
|
+
2. Add `outcomeSignal` once they have downstream conversion / engagement / post-engagement data, and the report fits a reward model showing whether their score predicts the customer outcome.
|
|
61
|
+
3. Once they identify a step worth optimizing (translation, say), graduate to journey #3 — wrap that step in a `Dispatch` and call `selfImprove()`.
|
|
62
|
+
|
|
63
|
+
**Runnable:** [`examples/customer-otel-traces/`](../examples/customer-otel-traces/)
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
## 2. Feedback corpus journey — `customer-feedback-loop`
|
|
68
|
+
|
|
69
|
+
**The customer:** a research-validation team. A GitHub Action fires `claude -p` against the next claim, writes the research output to Obsidian. Three reviewers (Alice, Bob, Carol) tag results `#approved` or `#rejected`. Outputs feed a knowledge base. Knowledge feeds content. Content feeds engagement. The founder wants more engagement faster.
|
|
70
|
+
|
|
71
|
+
**The frustration:** "We disagree on what's good. We don't know if our 'good' actually drives engagement. Reviewing every claim is slow."
|
|
72
|
+
|
|
73
|
+
**What they need from agent-eval:** turn the approve/reject corpus into actionable signal:
|
|
74
|
+
- Where do reviewers disagree? (triage list)
|
|
75
|
+
- Can we synthesize each reviewer's taste into an LLM judge? (auto-grade)
|
|
76
|
+
- Does the taste actually predict downstream engagement? (close the loop)
|
|
77
|
+
|
|
78
|
+
### The code
|
|
79
|
+
|
|
80
|
+
```ts
|
|
81
|
+
import { analyzeRuns, fromFeedbackTable } from '@tangle-network/agent-eval/contract'
|
|
82
|
+
|
|
83
|
+
// 1. Parse Obsidian #approved / #rejected tags into a flat table:
|
|
84
|
+
const ratings = parseObsidianVault('./research-vault')
|
|
85
|
+
// [{ runId: 'claim-1', rater: 'alice', rating: true }, ...]
|
|
86
|
+
|
|
87
|
+
// 2. Pipe through the adapter:
|
|
88
|
+
const { runs, raterScores } = fromFeedbackTable({ ratings })
|
|
89
|
+
|
|
90
|
+
// 3. Analyze:
|
|
91
|
+
const report = await analyzeRuns({
|
|
92
|
+
runs,
|
|
93
|
+
raterScores,
|
|
94
|
+
// Optional: close the loop with engagement data once you have it.
|
|
95
|
+
outcomeSignal: { metric: 'engagement_rate', valueByRunId: enrichedFromProd },
|
|
96
|
+
})
|
|
97
|
+
|
|
98
|
+
// report.interRater.disagreementCases → top 20 claims worth a meeting
|
|
99
|
+
// report.outcomeCorrelation → does team taste predict engagement?
|
|
100
|
+
// report.recommendations → action list
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
### What the report shows
|
|
104
|
+
|
|
105
|
+
```
|
|
106
|
+
Runs analyzed: 30
|
|
107
|
+
Composite mean: 0.756 (approve rate ~76%)
|
|
108
|
+
|
|
109
|
+
── Inter-rater agreement ──
|
|
110
|
+
Raters: 3 (alice, bob, carol)
|
|
111
|
+
Jointly rated runs: 30
|
|
112
|
+
Pairwise pearson κ:
|
|
113
|
+
alice::bob 0.53
|
|
114
|
+
alice::carol 0.55
|
|
115
|
+
bob::carol 0.21
|
|
116
|
+
Mean κ: 0.43
|
|
117
|
+
|
|
118
|
+
── Top 5 disagreement cases ──
|
|
119
|
+
claim-1 range=1.00 ratings: alice=0, bob=0, carol=1
|
|
120
|
+
claim-7 range=1.00 ratings: alice=0, bob=1, carol=0
|
|
121
|
+
...
|
|
122
|
+
|
|
123
|
+
── Recommendations ──
|
|
124
|
+
[high] recalibrate — Inter-rater agreement κ=0.43 is below 0.5
|
|
125
|
+
Raters disagree on what 'good' looks like. Refine the rubric or triage the disagreement cases.
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
### Next steps for this customer
|
|
129
|
+
|
|
130
|
+
1. **Triage meeting on the disagreement cases.** Mean κ=0.43 means the rubric is ambiguous; clarify it on the cases that split.
|
|
131
|
+
2. **Calibrate one LLM judge per reviewer.** Each reviewer's history is the gold signal — substrate primitive `calibrateJudge` against `raterScores` filtered to that reviewer.
|
|
132
|
+
3. **Add engagement as `outcomeSignal`** once the content downstream is instrumented. The `outcomeCorrelation` section tells the team whether their taste predicts the founder's token-max goal — and if not, the linear reward model says how to retarget.
|
|
133
|
+
4. **Graduate to journey #3** — wrap the research-generation Claude-P call as a `Dispatch`, use the calibrated judges, run `selfImprove()` nightly. Open a PR against the GitHub Action when the holdout approval rate beats baseline.
|
|
134
|
+
|
|
135
|
+
**Runnable:** [`examples/customer-feedback-loop/`](../examples/customer-feedback-loop/)
|
|
136
|
+
|
|
137
|
+
---
|
|
138
|
+
|
|
139
|
+
## 3. Closed-loop journey — `selfimprove-quickstart`
|
|
140
|
+
|
|
141
|
+
**The customer:** a team with a scenario corpus, a judge, and an agent. Wants to improve the prompt under statistical confidence — propose better candidates, gate on holdout lift, ship the winner.
|
|
142
|
+
|
|
143
|
+
**The frustration:** "We can run an A/B by hand but we don't know if the improvement is real. We don't have time to run paired bootstrap by hand. We want a function that decides."
|
|
144
|
+
|
|
145
|
+
**What they need from agent-eval:** the closed loop in one function — propose, score, gate, ship — with the full rigor packet on the way out.
|
|
146
|
+
|
|
147
|
+
### The code
|
|
148
|
+
|
|
149
|
+
```ts
|
|
150
|
+
import { selfImprove } from '@tangle-network/agent-eval/contract'
|
|
151
|
+
|
|
152
|
+
const result = await selfImprove({
|
|
153
|
+
scenarios,
|
|
154
|
+
agent: async (surface, scenario) =>
|
|
155
|
+
await myAgent.run({ systemPrompt: (surface as { systemPrompt: string }).systemPrompt, scenario }),
|
|
156
|
+
judge: {
|
|
157
|
+
name: 'rubric',
|
|
158
|
+
dimensions: [{ key: 'clarity', weight: 1 }, { key: 'concision', weight: 1 }],
|
|
159
|
+
score: async ({ artifact }) => myJudgeFn(artifact),
|
|
160
|
+
},
|
|
161
|
+
baselineSurface: { kind: 'prompt', systemPrompt: 'You write marketing copy...' },
|
|
162
|
+
budget: { generations: 3, populationSize: 2 },
|
|
163
|
+
})
|
|
164
|
+
|
|
165
|
+
result.gateDecision // 'ship' | 'hold' | ...
|
|
166
|
+
result.insight // full decision packet
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
### What the report shows
|
|
170
|
+
|
|
171
|
+
```
|
|
172
|
+
═══ selfImprove() decision packet ═══
|
|
173
|
+
|
|
174
|
+
Gate decision: ship
|
|
175
|
+
Raw lift: +0.194
|
|
176
|
+
|
|
177
|
+
── Statistical lift (paired bootstrap) ──
|
|
178
|
+
delta: +0.254
|
|
179
|
+
CI95: [0.254, 0.254]
|
|
180
|
+
pValue: 1.0000
|
|
181
|
+
Cohen's d: 0.00
|
|
182
|
+
MDE @ 80% power: 2.802
|
|
183
|
+
required n at observed effect: 244
|
|
184
|
+
|
|
185
|
+
── Recommendations ──
|
|
186
|
+
[critical] ship — Ship — lift 0.254 (95% CI 0.254..0.254)
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
### Next steps for this customer
|
|
190
|
+
|
|
191
|
+
1. **Ship the winner.** Either accept `result.winner.surface` programmatically and roll it out, or pass `autoOnPromote: 'pr'` + a GitHub repo to have selfImprove open a PR for you.
|
|
192
|
+
2. **Wire `hostedTenant`** to ship the decision packet to a dashboard (the hosted Intelligence orchestrator, or your own implementation of the wire spec).
|
|
193
|
+
3. **Add `canaryScenarios`** to guard against the holdout leaking into the candidate prompt.
|
|
194
|
+
4. **Add `outcomeSignal`** in `analyzeRuns()` for any post-deploy reruns to verify the predicted lift actually shows up in real outcomes.
|
|
195
|
+
|
|
196
|
+
**Runnable:** [`examples/selfimprove-quickstart/`](../examples/selfimprove-quickstart/)
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
## How the three journeys compose
|
|
201
|
+
|
|
202
|
+
Journey #1 + #2 + #3 are **maturity stages**, not exclusive products. A team typically:
|
|
203
|
+
|
|
204
|
+
1. Starts with **#1** (analyze production logs) to find what's broken.
|
|
205
|
+
2. Adds **#2** (feedback corpus) once they have a sense of where to improve, to calibrate what "good" means.
|
|
206
|
+
3. Graduates to **#3** (closed loop) once they have scenarios + judges, to automate the improvement.
|
|
207
|
+
|
|
208
|
+
Same substrate, same `InsightReport` shape, no rip-and-replace between stages. The data you collect in #1 informs the scenarios you derive in #2 which feed the loop in #3.
|
|
@@ -0,0 +1,337 @@
|
|
|
1
|
+
# `InsightReport` — the decision packet
|
|
2
|
+
|
|
3
|
+
The single shape every analysis call returns. `selfImprove()` embeds it in `SelfImproveResult.insight`; `analyzeRuns()` returns it directly. The hosted-tier wire format carries it on `EvalRunEvent.insightReport?`.
|
|
4
|
+
|
|
5
|
+
Every section is **opt-in based on what your data supports** — the function never invents signal. If your runs don't carry judge scores, `judges` is empty. If there's no baseline/candidate split, `lift` is undefined. The shape is consistent; population is honest.
|
|
6
|
+
|
|
7
|
+
This page walks every section with a real (synthetic) example and explains how to act on it.
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## At a glance
|
|
12
|
+
|
|
13
|
+
```ts
|
|
14
|
+
interface InsightReport {
|
|
15
|
+
n: number // runs analyzed
|
|
16
|
+
composite: ScalarDistribution // always
|
|
17
|
+
perDimension: Record<string, ScalarDistribution> // when judgeScores carry dimensions
|
|
18
|
+
costQuality: { cost: ScalarDistribution; pareto: ParetoFigureSpec } // always
|
|
19
|
+
judges: Record<string, JudgeInsight> // when runs carry judge scores
|
|
20
|
+
interRater?: InterRaterInsight // when raterScores supplied
|
|
21
|
+
lift?: LiftInsight // when baseline + candidate present
|
|
22
|
+
failureClusters?: FailureClusterInsight // when AnalystRegistry wired
|
|
23
|
+
contamination?: ContaminationInsight // when canaryScenarios supplied
|
|
24
|
+
outcomeCorrelation?: OutcomeCorrelationInsight // when outcomeSignal supplied
|
|
25
|
+
release: ReleaseSummary // always
|
|
26
|
+
recommendations: Recommendation[] // always — read this FIRST
|
|
27
|
+
}
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## `n` + `composite` + `perDimension` — distributional summary
|
|
33
|
+
|
|
34
|
+
Always present. The basic "where are my numbers" view.
|
|
35
|
+
|
|
36
|
+
```jsonc
|
|
37
|
+
{
|
|
38
|
+
"n": 30,
|
|
39
|
+
"composite": {
|
|
40
|
+
"n": 30,
|
|
41
|
+
"mean": 0.683, "p50": 0.667, "p95": 1.000, "stddev": 0.231,
|
|
42
|
+
"min": 0.0, "max": 1.0,
|
|
43
|
+
"histogram": [
|
|
44
|
+
{ "lo": 0.0, "hi": 0.083, "count": 5 },
|
|
45
|
+
{ "lo": 0.083, "hi": 0.167, "count": 0 },
|
|
46
|
+
// ...12 bins by default
|
|
47
|
+
]
|
|
48
|
+
},
|
|
49
|
+
"perDimension": {
|
|
50
|
+
"clarity": { "mean": 0.72, "p50": 0.75, "p95": 0.95, "stddev": 0.18, /* ... */ },
|
|
51
|
+
"concision": { "mean": 0.65, "p50": 0.68, "p95": 0.88, "stddev": 0.21, /* ... */ }
|
|
52
|
+
}
|
|
53
|
+
}
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
**Read first:** the `composite.mean`. If it's < 0.5, your agent has a ceiling problem, not a tuning problem.
|
|
57
|
+
|
|
58
|
+
**Read next:** `perDimension`. If `clarity` is high but `concision` is low, your prompts get the right ideas in too many words — different fix than "wrong ideas."
|
|
59
|
+
|
|
60
|
+
**Use the histogram for:** finding bimodal failure modes. A bin with `count > 0` near zero and another > 0 near 1 means your agent has two distinct behaviors, not one noisy one.
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
## `costQuality` — cost-vs-quality Pareto
|
|
65
|
+
|
|
66
|
+
Always present. `cost.histogram` is the per-run cost distribution; `pareto` is the substrate's `ParetoFigureSpec`.
|
|
67
|
+
|
|
68
|
+
```jsonc
|
|
69
|
+
{
|
|
70
|
+
"costQuality": {
|
|
71
|
+
"cost": {
|
|
72
|
+
"mean": 0.024, "p95": 0.041,
|
|
73
|
+
"histogram": [/* */]
|
|
74
|
+
},
|
|
75
|
+
"pareto": {
|
|
76
|
+
"kind": "pareto-cost-quality",
|
|
77
|
+
"split": "holdout",
|
|
78
|
+
"axes": { "x": "costUsd", "y": "score" },
|
|
79
|
+
"points": [
|
|
80
|
+
{ "candidateId": "baseline", "cost": 0.018, "quality": 0.58, "n": 20, "onFrontier": true },
|
|
81
|
+
{ "candidateId": "winner", "cost": 0.027, "quality": 0.65, "n": 20, "onFrontier": true }
|
|
82
|
+
]
|
|
83
|
+
}
|
|
84
|
+
}
|
|
85
|
+
}
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
**Use this when:** comparing prompts, models, or candidate surfaces. The Pareto frontier is your menu of "best you can do at each cost level."
|
|
89
|
+
|
|
90
|
+
**Render with:** any chart library — `points` is plain JSON. Hosted-tier dashboards render this as a scatter with the frontier highlighted.
|
|
91
|
+
|
|
92
|
+
---
|
|
93
|
+
|
|
94
|
+
## `judges` — per-judge mean
|
|
95
|
+
|
|
96
|
+
Populated when run records carry `outcome.judgeScores`.
|
|
97
|
+
|
|
98
|
+
```jsonc
|
|
99
|
+
{
|
|
100
|
+
"judges": {
|
|
101
|
+
"domain-expert": { "n": 30, "meanScore": 0.71 },
|
|
102
|
+
"helpfulness-llm": { "n": 30, "meanScore": 0.62 }
|
|
103
|
+
}
|
|
104
|
+
}
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
The substrate's full judge-calibration suite (positional bias, self-preference, verbosity bias) lives in `/reporting` and operates on **paired-by-condition** inputs that `analyzeRuns` doesn't synthesize from raw `RunRecord[]`. Wire them yourself when you have the paired data; the report's `judges` map is the corpus-level slice.
|
|
108
|
+
|
|
109
|
+
**Use this when:** comparing multiple judges over the same corpus. A big gap between two judges' means is the first signal that one of them is mis-calibrated.
|
|
110
|
+
|
|
111
|
+
---
|
|
112
|
+
|
|
113
|
+
## `interRater` — multi-rater agreement + disagreement triage
|
|
114
|
+
|
|
115
|
+
Populated when `analyzeRuns({ raterScores })` is supplied — typically via `fromFeedbackTable()`.
|
|
116
|
+
|
|
117
|
+
```jsonc
|
|
118
|
+
{
|
|
119
|
+
"interRater": {
|
|
120
|
+
"raters": 3,
|
|
121
|
+
"jointlyRated": 30,
|
|
122
|
+
"kappa": 0.71,
|
|
123
|
+
"perPair": {
|
|
124
|
+
"alice::bob": 0.78,
|
|
125
|
+
"alice::carol": 0.65,
|
|
126
|
+
"bob::carol": 0.69
|
|
127
|
+
},
|
|
128
|
+
"disagreementCases": [
|
|
129
|
+
{ "runId": "claim-7", "range": 1.00,
|
|
130
|
+
"ratings": [{"rater":"alice","score":1},{"rater":"bob","score":1},{"rater":"carol","score":0}] },
|
|
131
|
+
{ "runId": "claim-13", "range": 1.00,
|
|
132
|
+
"ratings": [{"rater":"alice","score":0},{"rater":"bob","score":0},{"rater":"carol","score":1}] }
|
|
133
|
+
// ...top 20 by range
|
|
134
|
+
]
|
|
135
|
+
}
|
|
136
|
+
}
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
**Read first:** the mean `kappa`. < 0.5 means raters disagree on what "good" looks like — surface the disagreement cases at the next review meeting.
|
|
140
|
+
|
|
141
|
+
**Use this when:** building per-rater LLM judges. Each rater's individual scores are the gold signal you calibrate against. Once a calibrated LLM matches the human ≥85%, you can auto-grade and escalate only the disagreement cases.
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
## `lift` — paired-bootstrap statistical lift
|
|
146
|
+
|
|
147
|
+
Populated when baseline + candidate candidates are present (auto-detected from two distinct `candidateId`s, or explicit via `baselineCandidateId` + `candidateCandidateId`).
|
|
148
|
+
|
|
149
|
+
```jsonc
|
|
150
|
+
{
|
|
151
|
+
"lift": {
|
|
152
|
+
"baselineMean": 0.58,
|
|
153
|
+
"candidateMean": 0.65,
|
|
154
|
+
"delta": 0.07,
|
|
155
|
+
"ci95": [0.04, 0.10], // bootstrap CI on the delta
|
|
156
|
+
"pValue": 0.0008, // paired t-test
|
|
157
|
+
"n": 40, // paired observations
|
|
158
|
+
"cohensD": 0.41,
|
|
159
|
+
"mde": 0.06, // min detectable effect at current n, 80% power
|
|
160
|
+
"requiredN": 38 // n needed for observed delta at 80% power
|
|
161
|
+
}
|
|
162
|
+
}
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
**Decision rule:**
|
|
166
|
+
- `ci95[0] > threshold` → **SHIP.** Lower bound above your delta threshold means the lift is real at 95% confidence.
|
|
167
|
+
- `ci95[0] ≤ threshold < ci95[1]` → **INCONCLUSIVE.** Expand the corpus or wait for more data.
|
|
168
|
+
- `ci95[1] ≤ threshold` → **HOLD.** No evidence the candidate is better.
|
|
169
|
+
|
|
170
|
+
The `recommendations` array surfaces exactly this decision (`kind: 'ship' | 'hold' | 'expand-corpus'`) — that's what consumers should read.
|
|
171
|
+
|
|
172
|
+
**Why bootstrap, not t-test alone:** paired bootstrap is distribution-free. Your judge scores are bounded in [0,1] and almost never normal; the bootstrap CI is the honest one.
|
|
173
|
+
|
|
174
|
+
---
|
|
175
|
+
|
|
176
|
+
## `failureClusters` — grouped failure modes
|
|
177
|
+
|
|
178
|
+
Populated when an `AnalystRegistry` is passed via `analyzeRuns({ analyst })`. The substrate runs each failed run through the registered analysts and groups findings by `analyst_id` / `area`.
|
|
179
|
+
|
|
180
|
+
```jsonc
|
|
181
|
+
{
|
|
182
|
+
"failureClusters": {
|
|
183
|
+
"totalFailures": 11,
|
|
184
|
+
"clusters": [
|
|
185
|
+
{ "id": "off-topic-drift", "name": "off-topic-drift",
|
|
186
|
+
"share": 0.45, "exemplars": ["run-12", "run-19", "run-33"] },
|
|
187
|
+
{ "id": "over-confidence", "name": "over-confidence",
|
|
188
|
+
"share": 0.27, "exemplars": ["run-3", "run-21"] },
|
|
189
|
+
{ "id": "format-mismatch", "name": "format-mismatch",
|
|
190
|
+
"share": 0.18, "exemplars": ["run-41", "run-44"] }
|
|
191
|
+
]
|
|
192
|
+
}
|
|
193
|
+
}
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
**Read first:** the top cluster's `share`. If one cluster is > 40% of failures, fix that pattern before doing anything else.
|
|
197
|
+
|
|
198
|
+
**Use this when:** triaging a regression. Failure clusters tell you "fix this kind of thing first."
|
|
199
|
+
|
|
200
|
+
**To wire it:** register analysts in `AnalystRegistry`. See `src/analyst/registry.ts` and `src/analyst/kinds.ts` for the four built-in kinds (`failure-mode`, `improvement`, `knowledge-gap`, `knowledge-poisoning`).
|
|
201
|
+
|
|
202
|
+
---
|
|
203
|
+
|
|
204
|
+
## `contamination` — canary check
|
|
205
|
+
|
|
206
|
+
Populated when canary scenarios are passed via `analyzeRuns({ canaryScenarios })`. Each canary carries a sentinel string the agent should never emit; the report counts leaks.
|
|
207
|
+
|
|
208
|
+
```jsonc
|
|
209
|
+
{
|
|
210
|
+
"contamination": {
|
|
211
|
+
"leaks": 0,
|
|
212
|
+
"holdoutAuditPassed": true,
|
|
213
|
+
"details": []
|
|
214
|
+
}
|
|
215
|
+
}
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
When `leaks > 0`:
|
|
219
|
+
|
|
220
|
+
```jsonc
|
|
221
|
+
{
|
|
222
|
+
"contamination": {
|
|
223
|
+
"leaks": 2,
|
|
224
|
+
"holdoutAuditPassed": false,
|
|
225
|
+
"details": [
|
|
226
|
+
{ "runId": "run-12", "canary": "xyz-secret-canary-123", "matched": "...the secret xyz-secret-canary-123 says..." }
|
|
227
|
+
]
|
|
228
|
+
}
|
|
229
|
+
}
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
**When this fails:** your holdout corpus has leaked into training context. The `lift` number is **unreliable**. Investigate before shipping anything.
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
## `outcomeCorrelation` — closing the loop on real outcomes
|
|
237
|
+
|
|
238
|
+
Populated when `outcomeSignal: { metric, valueByRunId }` is supplied.
|
|
239
|
+
|
|
240
|
+
```jsonc
|
|
241
|
+
{
|
|
242
|
+
"outcomeCorrelation": {
|
|
243
|
+
"metric": "engagement_rate",
|
|
244
|
+
"n": 80,
|
|
245
|
+
"pearson": 0.72, // linear correlation
|
|
246
|
+
"spearman": 0.69, // rank correlation (robust to monotonic nonlinearity)
|
|
247
|
+
"rewardModel": {
|
|
248
|
+
"intercept": 0.04,
|
|
249
|
+
"slope": 1.93,
|
|
250
|
+
"r2": 0.52 // share of outcome variance the judge explains
|
|
251
|
+
}
|
|
252
|
+
}
|
|
253
|
+
}
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
This is the layer that says **"does my judge's taste actually predict the metric the business cares about?"**
|
|
257
|
+
|
|
258
|
+
**Read first:** `spearman`. If it's < 0.3 in absolute value, your judges are scoring something different from what wins downstream. Refit the judges (use the customer's downstream signal as gold) or change the rubric.
|
|
259
|
+
|
|
260
|
+
**The reward model** is the simple linear `y = intercept + slope * composite`. Use it to:
|
|
261
|
+
- Predict the engagement of a new run from its composite score alone.
|
|
262
|
+
- Set a `composite` threshold for "must beat X to ship" based on the engagement equivalent.
|
|
263
|
+
|
|
264
|
+
---
|
|
265
|
+
|
|
266
|
+
## `release` — pass/warn/fail axes
|
|
267
|
+
|
|
268
|
+
Always present. Roll-up across three axes — quality lift, contamination, composite distribution.
|
|
269
|
+
|
|
270
|
+
```jsonc
|
|
271
|
+
{
|
|
272
|
+
"release": {
|
|
273
|
+
"status": "pass",
|
|
274
|
+
"axes": [
|
|
275
|
+
{ "name": "quality-lift", "status": "pass",
|
|
276
|
+
"detail": "delta=0.070, CI95=[0.040, 0.100], n=40" },
|
|
277
|
+
{ "name": "contamination", "status": "pass",
|
|
278
|
+
"detail": "0 canary leak(s)" },
|
|
279
|
+
{ "name": "composite-distribution", "status": "pass",
|
|
280
|
+
"detail": "mean=0.683, p50=0.667, p95=1.000 over n=30" }
|
|
281
|
+
],
|
|
282
|
+
"issues": []
|
|
283
|
+
}
|
|
284
|
+
}
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
Overall `status` is `fail` if any axis fails; `warn` if any warn; `pass` otherwise.
|
|
288
|
+
|
|
289
|
+
**Use this when:** wiring agent-eval into CI. A `status === 'pass'` from `analyzeRuns` on the candidate vs baseline is your green-light gate.
|
|
290
|
+
|
|
291
|
+
---
|
|
292
|
+
|
|
293
|
+
## `recommendations` — the actionable layer
|
|
294
|
+
|
|
295
|
+
Always present. Read this first.
|
|
296
|
+
|
|
297
|
+
```jsonc
|
|
298
|
+
{
|
|
299
|
+
"recommendations": [
|
|
300
|
+
{ "priority": "critical", "kind": "ship",
|
|
301
|
+
"title": "Ship — lift 0.070 (95% CI 0.040..0.100)",
|
|
302
|
+
"detail": "Holdout lift exceeds threshold 0.02 with 95% bootstrap confidence (n=40, p=0.0008, d=0.41).",
|
|
303
|
+
"evidencePath": "lift" },
|
|
304
|
+
{ "priority": "high", "kind": "investigate",
|
|
305
|
+
"title": "Top failure cluster: off-topic-drift (45% of failures)",
|
|
306
|
+
"detail": "11 runs failed. The largest cluster groups 3 exemplars under 'off-topic-drift'.",
|
|
307
|
+
"evidencePath": "failureClusters.clusters[0]" }
|
|
308
|
+
]
|
|
309
|
+
}
|
|
310
|
+
```
|
|
311
|
+
|
|
312
|
+
| `kind` | When emitted |
|
|
313
|
+
|---|---|
|
|
314
|
+
| `ship` | lift CI lower bound > threshold |
|
|
315
|
+
| `hold` | lift CI upper bound ≤ threshold |
|
|
316
|
+
| `expand-corpus` | lift CI straddles threshold — more data needed |
|
|
317
|
+
| `fix` | canary contamination detected |
|
|
318
|
+
| `recalibrate` | inter-rater κ < 0.5, OR outcome correlation < 0.3 |
|
|
319
|
+
| `investigate` | top failure cluster > some-share |
|
|
320
|
+
|
|
321
|
+
`evidencePath` points back into the report (`"lift"`, `"contamination"`, `"failureClusters.clusters[0]"`) so a UI can deep-link from each recommendation to its evidence.
|
|
322
|
+
|
|
323
|
+
---
|
|
324
|
+
|
|
325
|
+
## How `analyzeRuns` populates each section
|
|
326
|
+
|
|
327
|
+
| Section | Required input |
|
|
328
|
+
|---|---|
|
|
329
|
+
| `composite`, `perDimension`, `costQuality`, `release`, `recommendations` | `runs` |
|
|
330
|
+
| `judges` | `runs` with `outcome.judgeScores` |
|
|
331
|
+
| `interRater` | `raterScores` (≥ 2 raters jointly rated some runs) |
|
|
332
|
+
| `lift` | two distinct `candidateId`s in `runs` (or explicit baseline/candidate ids) |
|
|
333
|
+
| `failureClusters` | `analyst` registry passed in |
|
|
334
|
+
| `contamination` | `canaryScenarios` passed in |
|
|
335
|
+
| `outcomeCorrelation` | `outcomeSignal` passed in |
|
|
336
|
+
|
|
337
|
+
All sections beyond the always-present ones are `T | undefined`, never empty objects. If a section is missing, your inputs didn't support it — the report is honest about that.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@tangle-network/agent-eval",
|
|
3
|
-
"version": "0.50.
|
|
3
|
+
"version": "0.50.1",
|
|
4
4
|
"description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.",
|
|
5
5
|
"homepage": "https://github.com/tangle-network/agent-eval#readme",
|
|
6
6
|
"repository": {
|