@tangle-network/agent-eval 0.49.0 → 0.50.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +135 -0
- package/README.md +235 -331
- package/dist/adapters/http.d.ts +1 -1
- package/dist/adapters/langchain.d.ts +1 -1
- package/dist/adapters/otel.d.ts +8 -2
- package/dist/campaign/index.d.ts +3 -3
- package/dist/{chunk-PD3MH6WU.js → chunk-5KSDYBYH.js} +2 -2
- package/dist/{chunk-MNL6LXGQ.js → chunk-EGIPWXHL.js} +2 -98
- package/dist/chunk-EGIPWXHL.js.map +1 -0
- package/dist/{chunk-OYI6RZJK.js → chunk-FQK2CCIM.js} +1 -1
- package/dist/chunk-FQK2CCIM.js.map +1 -0
- package/dist/chunk-MAZ26DC7.js +99 -0
- package/dist/chunk-MAZ26DC7.js.map +1 -0
- package/dist/chunk-SHTXZ4O2.js +113 -0
- package/dist/chunk-SHTXZ4O2.js.map +1 -0
- package/dist/{chunk-KQ26DYTQ.js → chunk-UBQGWD3O.js} +2 -2
- package/dist/contract/index.d.ts +206 -9
- package/dist/contract/index.js +751 -3
- package/dist/contract/index.js.map +1 -1
- package/dist/governance/index.d.ts +1 -1
- package/dist/hosted/index.d.ts +8 -192
- package/dist/hosted/index.js +1 -1
- package/dist/index-BRxz6qov.d.ts +409 -0
- package/dist/index.d.ts +18 -462
- package/dist/index.js +14 -106
- package/dist/index.js.map +1 -1
- package/dist/meta-eval/index.d.ts +3 -3
- package/dist/openapi.json +1 -1
- package/dist/{outcome-store-BxJ3DQKJ.d.ts → outcome-store-D6KWmYvj.d.ts} +1 -1
- package/dist/registry-8KAs18kY.d.ts +457 -0
- package/dist/{release-report-DBB8lB1P.d.ts → release-report-DSu0DWy8.d.ts} +3 -296
- package/dist/reporting.d.ts +6 -4
- package/dist/reporting.js +6 -4
- package/dist/{researcher-CHMO56K0.d.ts → researcher-LZD0qHEa.d.ts} +1 -1
- package/dist/rl.d.ts +9 -8
- package/dist/rl.js +3 -2
- package/dist/rl.js.map +1 -1
- package/dist/{rubric-predictive-validity-CJ08tGwq.d.ts → rubric-predictive-validity-ByZEC3BX.d.ts} +1 -1
- package/dist/{run-improvement-loop-B-L8GgpW.d.ts → run-improvement-loop-BPMjNKMJ.d.ts} +2 -2
- package/dist/sequential-5iSVfzl2.d.ts +139 -0
- package/dist/store-CJbzDxZ2.d.ts +220 -0
- package/dist/{sequential-CbFH___X.d.ts → summary-report-B7gNRX-r.d.ts} +1 -139
- package/dist/traces.d.ts +3 -220
- package/dist/{types-8u72Gc76.d.ts → types-Dbj5gu8n.d.ts} +1 -1
- package/dist/types-DhqpAi_z.d.ts +296 -0
- package/docs/concepts.md +20 -0
- package/docs/customer-journeys.md +208 -0
- package/docs/insight-report.md +337 -0
- package/package.json +1 -1
- package/dist/chunk-MNL6LXGQ.js.map +0 -1
- package/dist/chunk-OYI6RZJK.js.map +0 -1
- /package/dist/{chunk-PD3MH6WU.js.map → chunk-5KSDYBYH.js.map} +0 -0
- /package/dist/{chunk-KQ26DYTQ.js.map → chunk-UBQGWD3O.js.map} +0 -0
package/README.md
CHANGED
|
@@ -1,400 +1,304 @@
|
|
|
1
|
-
#
|
|
2
|
-
|
|
3
|
-
**Substrate for self-improving agents.** Trace what runs, verify the result,
|
|
4
|
-
turn outcomes into preferences and rewards, mutate prompts and policies under
|
|
5
|
-
anytime-valid evidence, and ship only when the improvement is decisive.
|
|
6
|
-
|
|
7
|
-
```txt
|
|
8
|
-
real product task
|
|
9
|
-
-> observe / act (your runtime)
|
|
10
|
-
-> trace + verifier pipeline (capture integrity)
|
|
11
|
-
-> RunRecord (canonical eval artifact)
|
|
12
|
-
-> judge calibration · paired stats · sequential α
|
|
13
|
-
-> preferences · verifiable rewards · process rewards
|
|
14
|
-
-> GEPA / reflective mutation · auto-research · active curriculum
|
|
15
|
-
-> release gate · replay · contamination probe · tournament rating
|
|
16
|
-
-> next iteration
|
|
17
|
-
```
|
|
1
|
+
# `@tangle-network/agent-eval`
|
|
18
2
|
|
|
19
|
-
|
|
20
|
-
routing, browser drivers, sandbox policy, or deployment. Products own those.
|
|
21
|
-
This package owns the loop that closes evaluation → preference → mutation →
|
|
22
|
-
redeploy, with capture integrity and statistically rigorous evidence at every
|
|
23
|
-
step.
|
|
3
|
+
**Ship better agent prompts with statistical confidence.** One function call returns a decision packet: lift CI, judge calibration, contamination check, failure clusters, cost-quality Pareto, and a ranked action list. Same shape whether you've got a closed improvement loop or just production logs.
|
|
24
4
|
|
|
25
|
-
|
|
26
|
-
|
|
5
|
+
[](https://www.npmjs.com/package/@tangle-network/agent-eval)
|
|
6
|
+
[](https://pypi.org/project/agent-eval-rpc/)
|
|
7
|
+
[](https://github.com/tangle-network/agent-eval/actions/workflows/ci.yml)
|
|
8
|
+
[](./LICENSE)
|
|
27
9
|
|
|
28
|
-
|
|
10
|
+
> TypeScript first-class, Python (`agent-eval-rpc`) speaks the same wire protocol, hosted-tier-friendly, MIT, self-hostable, no SaaS dependency.
|
|
29
11
|
|
|
30
|
-
|
|
31
|
-
pnpm add @tangle-network/agent-eval
|
|
32
|
-
# or, from Python:
|
|
33
|
-
pip install agent-eval-rpc
|
|
34
|
-
```
|
|
12
|
+
---
|
|
35
13
|
|
|
36
|
-
##
|
|
14
|
+
## Table of contents
|
|
37
15
|
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
16
|
+
- [What you get back](#what-you-get-back-the-decision-packet)
|
|
17
|
+
- [Quick start](#quick-start)
|
|
18
|
+
- [Closed loop — `selfImprove()`](#closed-loop--selfimprove)
|
|
19
|
+
- [Observed runs — `analyzeRuns()`](#observed-runs--analyzeruns)
|
|
20
|
+
- [Existing data — intake adapters](#existing-data--intake-adapters)
|
|
21
|
+
- [How it compares](#how-it-compares)
|
|
22
|
+
- [Customer journeys](#customer-journeys)
|
|
23
|
+
- [Subpath entry points](#subpath-entry-points)
|
|
24
|
+
- [Concepts + design](#concepts--design)
|
|
25
|
+
- [Hosted tier](#hosted-tier)
|
|
26
|
+
- [Install + run](#install--run)
|
|
27
|
+
- [Stability + versioning](#stability--versioning)
|
|
28
|
+
- [License](#license)
|
|
43
29
|
|
|
44
|
-
|
|
45
|
-
intent: task.prompt,
|
|
46
|
-
budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },
|
|
30
|
+
---
|
|
47
31
|
|
|
48
|
-
|
|
49
|
-
return product.readState(task.id)
|
|
50
|
-
},
|
|
32
|
+
## What you get back: the decision packet
|
|
51
33
|
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
34
|
+
Whether you call `selfImprove()` (closed loop) or `analyzeRuns()` (observed runs), the report has the same shape. Here's a real one, abridged:
|
|
35
|
+
|
|
36
|
+
```jsonc
|
|
37
|
+
{
|
|
38
|
+
"n": 80, // runs analyzed
|
|
39
|
+
"composite": { // distributional summary
|
|
40
|
+
"mean": 0.62, "p50": 0.65, "p95": 0.88, "stddev": 0.17,
|
|
41
|
+
"histogram": [/* 12 bins */]
|
|
42
|
+
},
|
|
43
|
+
"lift": { // paired bootstrap
|
|
44
|
+
"baselineMean": 0.58, "candidateMean": 0.65,
|
|
45
|
+
"delta": 0.07,
|
|
46
|
+
"ci95": [0.04, 0.10], // 95% CI on the delta
|
|
47
|
+
"pValue": 0.0008, // paired-t
|
|
48
|
+
"cohensD": 0.41,
|
|
49
|
+
"n": 40,
|
|
50
|
+
"mde": 0.06, // min detectable effect at 80% power
|
|
51
|
+
"requiredN": 38 // n needed to detect observed delta
|
|
52
|
+
},
|
|
53
|
+
"judges": { // per-judge calibration
|
|
54
|
+
"domain-expert": { "n": 80, "meanScore": 0.64 },
|
|
55
|
+
"helpfulness-llm": { "n": 80, "meanScore": 0.61 }
|
|
56
|
+
},
|
|
57
|
+
"interRater": { // multi-rater agreement
|
|
58
|
+
"raters": 3, "jointlyRated": 80, "kappa": 0.71,
|
|
59
|
+
"disagreementCases": [/* top 20 ranked by spread */]
|
|
60
|
+
},
|
|
61
|
+
"costQuality": { // cost-vs-quality
|
|
62
|
+
"cost": { "mean": 0.024, "p95": 0.041, /* ... */ },
|
|
63
|
+
"pareto": { /* ParetoFigureSpec the dashboard renders */ }
|
|
64
|
+
},
|
|
65
|
+
"failureClusters": { // when an AnalystRegistry is wired
|
|
66
|
+
"totalFailures": 11,
|
|
67
|
+
"clusters": [
|
|
68
|
+
{ "name": "off-topic-drift", "share": 0.45, "exemplars": ["run-12", "run-19"] },
|
|
69
|
+
{ "name": "over-confidence", "share": 0.27, "exemplars": ["run-3"] },
|
|
70
|
+
{ "name": "format-mismatch", "share": 0.18, "exemplars": ["run-41"] }
|
|
65
71
|
]
|
|
66
72
|
},
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
}
|
|
73
|
-
return {
|
|
74
|
-
type: 'continue',
|
|
75
|
-
action: { type: 'repair', failed: failed.map((e) => e.id) },
|
|
76
|
-
reason: 'repair failed gates',
|
|
77
|
-
}
|
|
73
|
+
"contamination": { "leaks": 0, "holdoutAuditPassed": true },
|
|
74
|
+
"outcomeCorrelation": { // when downstream metric supplied
|
|
75
|
+
"metric": "engagement_rate", "n": 80,
|
|
76
|
+
"pearson": 0.72, "spearman": 0.69,
|
|
77
|
+
"rewardModel": { "intercept": 0.04, "slope": 1.93, "r2": 0.52 }
|
|
78
78
|
},
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
79
|
+
"release": {
|
|
80
|
+
"status": "pass",
|
|
81
|
+
"axes": [
|
|
82
|
+
{ "name": "quality-lift", "status": "pass" },
|
|
83
|
+
{ "name": "contamination", "status": "pass" },
|
|
84
|
+
{ "name": "composite-distribution","status": "pass" }
|
|
85
|
+
]
|
|
82
86
|
},
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
87
|
+
"recommendations": [
|
|
88
|
+
{ "priority": "critical", "kind": "ship",
|
|
89
|
+
"title": "Ship — lift 0.070 (95% CI 0.040..0.100)",
|
|
90
|
+
"detail": "Holdout lift exceeds threshold 0.02 with 95% bootstrap confidence (n=40, p=0.0008, d=0.41)." },
|
|
91
|
+
{ "priority": "high", "kind": "investigate",
|
|
92
|
+
"title": "Top failure cluster: off-topic-drift (45% of failures)",
|
|
93
|
+
"detail": "11 runs failed. Drill into exemplars run-12 / run-19 to identify the pattern." }
|
|
94
|
+
]
|
|
95
|
+
}
|
|
86
96
|
```
|
|
87
97
|
|
|
88
|
-
|
|
89
|
-
dependencies behind `observe()` and `act()`, never the eval contract.
|
|
98
|
+
The `recommendations` array is the human-readable layer; everything above it is the evidence. Read the recs, act on them, the numbers are the proof.
|
|
90
99
|
|
|
91
|
-
|
|
100
|
+
---
|
|
92
101
|
|
|
93
|
-
|
|
94
|
-
becomes today's incident. The production agents that win are the ones that
|
|
95
|
-
**continuously re-train against live failure modes**.
|
|
102
|
+
## Quick start
|
|
96
103
|
|
|
97
|
-
|
|
98
|
-
substrate into a self-improvement cron:
|
|
104
|
+
### Closed loop — `selfImprove()`
|
|
99
105
|
|
|
100
|
-
|
|
101
|
-
import {
|
|
102
|
-
runProductionLoop,
|
|
103
|
-
httpGithubClient,
|
|
104
|
-
FileSystemFeedbackTrajectoryStore,
|
|
105
|
-
} from '@tangle-network/agent-eval'
|
|
106
|
-
import { FileSystemTraceStore } from '@tangle-network/agent-eval/traces'
|
|
107
|
-
|
|
108
|
-
const result = await runProductionLoop({
|
|
109
|
-
runId: `weekly-${new Date().toISOString().slice(0, 10)}`,
|
|
110
|
-
target: 'tax-agent',
|
|
111
|
-
|
|
112
|
-
// 1. Where production traces + feedback land. Wire the HTTP ingestion
|
|
113
|
-
// endpoints (POST /v1/traces/ingest, POST /v1/feedback) from your
|
|
114
|
-
// runtime; the same store reads them here.
|
|
115
|
-
traceStore: new FileSystemTraceStore({ dir: 'data/prod-traces' }),
|
|
116
|
-
feedbackStore: new FileSystemFeedbackTrajectoryStore({ dir: 'data/prod-feedback' }),
|
|
117
|
-
|
|
118
|
-
// 2. Cluster threshold: act on failure groups ≥ 20 runs or ≥ 5% of corpus.
|
|
119
|
-
cluster: { minClusterSize: 20, minSeverityRatio: 0.05, maxClustersPerCycle: 1 },
|
|
120
|
-
|
|
121
|
-
// 3. Evolve: seed = current prompt, gate against holdout scenarios.
|
|
122
|
-
evolve: {
|
|
123
|
-
baselinePrompt: currentSystemPrompt,
|
|
124
|
-
holdoutScenarios: productionShapeScenarios,
|
|
125
|
-
runner, // your agent driver
|
|
126
|
-
scorer, // calibrated judge or rubric
|
|
127
|
-
mutator, // GEPA-style or addendum-style mutator
|
|
128
|
-
gate: {
|
|
129
|
-
baselineKey: 'baseline',
|
|
130
|
-
minProductiveRuns: 5,
|
|
131
|
-
pairedDeltaThreshold: 0.03, // require Nσ improvement on holdout
|
|
132
|
-
overfitGapThreshold: 0.10,
|
|
133
|
-
},
|
|
134
|
-
},
|
|
135
|
-
|
|
136
|
-
// 4. Ship: when the gate passes, open a PR with the new prompt.
|
|
137
|
-
ship: {
|
|
138
|
-
client: httpGithubClient({ token: process.env.GITHUB_TOKEN! }),
|
|
139
|
-
repo: { owner: 'tangle-network', name: 'tax-agent' },
|
|
140
|
-
branchPrefix: 'eval/auto-improve',
|
|
141
|
-
promptFilePath: 'prompts/tax-agent-system.txt',
|
|
142
|
-
reviewers: ['drew'],
|
|
143
|
-
},
|
|
106
|
+
You have scenarios, a dispatch, judges, and want the loop to propose better prompts + tell you which to ship.
|
|
144
107
|
|
|
145
|
-
|
|
108
|
+
```ts
|
|
109
|
+
import { selfImprove } from '@tangle-network/agent-eval/contract'
|
|
110
|
+
|
|
111
|
+
const result = await selfImprove({
|
|
112
|
+
scenarios, // your scenario corpus
|
|
113
|
+
dispatch: async ({ scenario }) => // your agent — anything that returns an artifact
|
|
114
|
+
await myAgent.run(scenario),
|
|
115
|
+
judges: [myJudge], // any JudgeConfig — LLM, rule, ensemble
|
|
116
|
+
baselineSurface: { systemPrompt: currentPrompt },
|
|
146
117
|
})
|
|
147
118
|
|
|
148
|
-
|
|
149
|
-
|
|
119
|
+
result.gateDecision // 'ship' | 'hold' | 'need_more_work' | ...
|
|
120
|
+
result.lift // raw delta on holdout
|
|
121
|
+
result.insight // the full decision packet above
|
|
150
122
|
```
|
|
151
123
|
|
|
152
|
-
|
|
153
|
-
GitHub Actions. It is **idempotent + replayable**: same `runId` → same plan.
|
|
154
|
-
Gate failures are fail-closed — a candidate that beats baseline on search but
|
|
155
|
-
overfits on holdout never lands.
|
|
124
|
+
### Observed runs — `analyzeRuns()`
|
|
156
125
|
|
|
157
|
-
|
|
158
|
-
[`examples/production-loop`](./examples/production-loop/README.md).
|
|
126
|
+
You don't have a closed loop yet — you have observed runs (production traces, an approve/reject corpus, a CSV gold set). Same report shape, no agent invocation.
|
|
159
127
|
|
|
160
|
-
|
|
128
|
+
```ts
|
|
129
|
+
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
|
|
130
|
+
|
|
131
|
+
const report = await analyzeRuns({
|
|
132
|
+
runs, // RunRecord[]
|
|
133
|
+
outcomeSignal: { // optional — closes the loop on real outcomes
|
|
134
|
+
metric: 'engagement_rate',
|
|
135
|
+
valueByRunId: enrichedFromProd,
|
|
136
|
+
},
|
|
137
|
+
canaryScenarios, // optional — contamination probe
|
|
138
|
+
analyst: myAnalystRegistry, // optional — AI-powered failure clustering
|
|
139
|
+
})
|
|
161
140
|
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
141
|
+
report.recommendations // ranked actions
|
|
142
|
+
report.failureClusters // grouped failure modes
|
|
143
|
+
report.outcomeCorrelation // judge↔outcome correlation + linear reward model
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
### Existing data — intake adapters
|
|
147
|
+
|
|
148
|
+
You have data already. Don't reshape it — pipe it through an adapter.
|
|
165
149
|
|
|
166
150
|
```ts
|
|
167
|
-
import { runEvalCampaign } from '@tangle-network/agent-eval'
|
|
168
151
|
import {
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
//
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
const
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
// 3. Estimate a candidate policy's value without re-running.
|
|
185
|
-
const ope = offPolicyEstimateAll(campaign.runs, candidatePolicy) // IPS + SNIPS + DR
|
|
186
|
-
|
|
187
|
-
// 4. Or close the loop end-to-end: score → reflect → mutate → re-run.
|
|
188
|
-
const next = await analyzeOptimizationResult(campaign, { researcher })
|
|
152
|
+
fromFeedbackTable,
|
|
153
|
+
fromOtelSpans,
|
|
154
|
+
analyzeRuns,
|
|
155
|
+
} from '@tangle-network/agent-eval/contract'
|
|
156
|
+
|
|
157
|
+
// Multi-rater approve/reject (Obsidian tags, Sheets, CSV, Postgres).
|
|
158
|
+
const { runs, raterScores } = fromFeedbackTable({
|
|
159
|
+
ratings: parseYourFeedbackTable(), // Array<{ runId, rater, rating }>
|
|
160
|
+
})
|
|
161
|
+
await analyzeRuns({ runs, raterScores })
|
|
162
|
+
|
|
163
|
+
// Production OTel traces — group by tangle.runId or traceId.
|
|
164
|
+
const runs2 = fromOtelSpans({ spans: yourOtelStream })
|
|
165
|
+
await analyzeRuns({ runs: runs2 })
|
|
189
166
|
```
|
|
190
167
|
|
|
191
|
-
|
|
192
|
-
| --- | --- | --- |
|
|
193
|
-
| Eval matrix with integrity | `runEvalCampaign` | `/` |
|
|
194
|
-
| Deterministic re-judge / audit | `ReplayCache`, `createReplayFetch` | `/` |
|
|
195
|
-
| Anytime-valid α across rolling looks | `pairedEvalueSequence` | `/reporting` |
|
|
196
|
-
| Judge quality vs gold | `calibrateJudge` (κ, Pearson, MAE, bias probes) | `/` |
|
|
197
|
-
| Continuous inter-rater agreement | `calibrateJudgeContinuous`, `continuousAgreement` (κ_w, ICC(2,1), bootstrap CIs) | `/` |
|
|
198
|
-
| (chosen, rejected) for DPO/KTO/PPO | `extractPreferences` | `/rl` |
|
|
199
|
-
| Verifiable reward signal | `extractVerifiableReward` | `/rl` |
|
|
200
|
-
| Step-level / PRM training data | `extractStepRewards`, `prmTrainingPairs` | `/rl` |
|
|
201
|
-
| Estimate policy value off-policy | `offPolicyEstimateAll` (IPS + SNIPS + DR) | `/rl` |
|
|
202
|
-
| GEPA / reflective prompt mutation | `buildReflectionPrompt`, `parseReflectionResponse`, Ax-GEPA `SteeringOptimizer` | `/` `/optimization` |
|
|
203
|
-
| Auto-research (read runs → propose) | `analyzeOptimizationResult`, `PredictiveValidityResearcher` | `/rl` |
|
|
204
|
-
| Active curriculum (variance / Thompson) | `allocateCurriculum` | `/rl` |
|
|
205
|
-
| Tournament ratings (Bradley-Terry + Elo) | `fitBradleyTerry`, `applyEloUpdate` | `/rl` |
|
|
206
|
-
| Adversarial scenario search | `adversarialScenarioSearch` | `/rl` |
|
|
207
|
-
| Contamination probe (held-out perturb) | `runContaminationProbe` | `/rl` |
|
|
208
|
-
| Reward hacking signatures | `detectRewardHacking` | `/rl` |
|
|
209
|
-
| Compute curves (best-of-N, self-consist, Pareto) | `runComputeCurve`, `bestOfN`, `selfConsistency`, `paretoFrontier` | `/rl` |
|
|
210
|
-
| Knowledge gap separated from reasoning gap | `scoreKnowledgeReadiness` | `/` |
|
|
211
|
-
| Release gate (paired evidence + holdouts) | `evaluateReleaseConfidence`, `HeldOutGate` | `/reporting` |
|
|
212
|
-
| Launch report (decision-grade) | `renderReleaseReport`, `researchReport` | `/reporting` |
|
|
213
|
-
|
|
214
|
-
## Import Paths
|
|
215
|
-
|
|
216
|
-
| Subpath | Use for |
|
|
217
|
-
| --- | --- |
|
|
218
|
-
| `@tangle-network/agent-eval/contract` | **LAND-tier surface** — `selfImprove`, `runCampaign`, `runImprovementLoop`, `runEval`, `Dispatch`, `Mutator`, `Gate`, `defaultProductionGate`, `gepaDriver`, `diffRuns`, storage backends. New code starts here. |
|
|
219
|
-
| `@tangle-network/agent-eval/hosted` | **EXPAND-tier surface** — `createHostedClient`, wire-format types, `HOSTED_WIRE_VERSION`. Ships eval-run events + trace spans to any orchestrator that speaks the spec. |
|
|
220
|
-
| `@tangle-network/agent-eval/adapters/otel` | OTel→hosted bridge — `createOtelBridge` forwards OTel-shape spans (TraceAI, OpenLLMetry, OTel SDK) into the hosted-tier ingest. |
|
|
221
|
-
| `@tangle-network/agent-eval/adapters/langchain` | LangChain executor adapter — wrap a LangChain runnable as a `Dispatch`. |
|
|
222
|
-
| `@tangle-network/agent-eval/adapters/http` | Distributed driver — `httpDispatch` + `runDispatchServer` for cross-machine campaigns. |
|
|
223
|
-
| `@tangle-network/agent-eval/campaign` | Lower-level campaign primitives — `runCampaign`, driver implementations, storage. |
|
|
224
|
-
| `@tangle-network/agent-eval/multishot` | Multi-shot optimization primitives. |
|
|
225
|
-
| `@tangle-network/agent-eval/control` | `observe → validate → decide → act`, action policy, propose/review loops |
|
|
226
|
-
| `@tangle-network/agent-eval/traces` | trace stores, emitters, TraceAnalyst, replay |
|
|
227
|
-
| `@tangle-network/agent-eval/optimization` | feedback trajectories, multi-shot, prompt evolution, GEPA, EvalCampaign |
|
|
228
|
-
| `@tangle-network/agent-eval/reporting` | release confidence, paired stats, sequential e-values, launch reports |
|
|
229
|
-
| `@tangle-network/agent-eval/rl` | adapters, verifiable rewards, preferences, OPE, PRM, contamination, tournaments, adversarial, compute curves, auto-research |
|
|
230
|
-
| `@tangle-network/agent-eval/wire` | HTTP/RPC server + schemas (same protocol the Python client speaks) |
|
|
231
|
-
| `@tangle-network/agent-eval/benchmarks` | benchmark adapter contracts and reference wrappers |
|
|
232
|
-
| `@tangle-network/agent-eval/matrix` | N-axis cartesian runner over substrate types — see [`src/matrix/`](./src/matrix/) |
|
|
233
|
-
|
|
234
|
-
The root export remains available for convenience; new code should prefer
|
|
235
|
-
focused subpaths. Anything under `/rl`, `/pipelines`, `/meta-eval`, `/prm`,
|
|
236
|
-
or `/builder-eval` is only reachable via its subpath.
|
|
237
|
-
|
|
238
|
-
## API stability
|
|
239
|
-
|
|
240
|
-
Public exports are tagged with JSDoc stability markers so consumers can see
|
|
241
|
-
status at the call site (IDE hover, language server, declaration files).
|
|
168
|
+
Both intake adapters preserve every signal in the source — multi-rater scores stay rater-keyed so the report can compute inter-rater agreement and surface the disagreement triage list.
|
|
242
169
|
|
|
243
|
-
|
|
244
|
-
| --- | --- |
|
|
245
|
-
| `@stable` | API frozen at this major. Breaking changes require a major bump. |
|
|
246
|
-
| `@experimental` | Interface may evolve before becoming `@stable`. Pin the patch version if you depend on it. |
|
|
247
|
-
| `@internal` | Not part of the public contract. Use the documented subpath instead. |
|
|
170
|
+
---
|
|
248
171
|
|
|
249
|
-
|
|
250
|
-
[`src/rl/index.ts`](./src/rl/index.ts) for the current stable/experimental
|
|
251
|
-
breakdown.
|
|
172
|
+
## How it compares
|
|
252
173
|
|
|
253
|
-
|
|
174
|
+
| | LangSmith | Braintrust | Phoenix | **agent-eval** |
|
|
175
|
+
|---|:---:|:---:|:---:|:---:|
|
|
176
|
+
| Closed-loop self-improvement | ✱ human-in-loop | ✱ experiment-driven | — | ✓ autonomous + gated |
|
|
177
|
+
| Statistical lift CI (paired bootstrap) | — | partial | — | ✓ |
|
|
178
|
+
| Judge calibration + bias detection | — | — | — | ✓ |
|
|
179
|
+
| Inter-rater agreement + disagreement triage | — | — | — | ✓ |
|
|
180
|
+
| Contamination / canary check | — | — | — | ✓ |
|
|
181
|
+
| AI-driven failure clustering | partial | — | partial | ✓ |
|
|
182
|
+
| Cost-quality Pareto | — | — | — | ✓ |
|
|
183
|
+
| Multi-language clients (TS + Python) | TS only | TS only | TS + Py | ✓ TS + Py |
|
|
184
|
+
| Self-hostable / no-SaaS option | — | — | OSS | ✓ MIT, OSS |
|
|
185
|
+
| Substrate vs SaaS shape | SaaS | SaaS | OSS server | **library** |
|
|
186
|
+
| Hosted tier (optional) | required | required | optional | optional |
|
|
254
187
|
|
|
255
|
-
|
|
256
|
-
code: (1) raw HTTP capture alongside the structured spans so a reviewer can
|
|
257
|
-
verify which route answered, (2) a preflight assertion that the configured
|
|
258
|
-
client points at the intended provider, (3) a run-end assertion that the
|
|
259
|
-
expected events were actually written, and (4) auto-execution of the trace
|
|
260
|
-
analyst as part of the run lifecycle.
|
|
188
|
+
Position: agent-eval is the **substrate** (one library, decision-grade output) the others are SaaS *around* the substrate. If you want a closed loop that ships your prompt under statistical confidence, you call agent-eval. If you want a dashboard rendered from your data, you pipe agent-eval into the hosted tier or your own renderer.
|
|
261
189
|
|
|
262
|
-
|
|
263
|
-
import {
|
|
264
|
-
TraceEmitter, FileSystemRawProviderSink, callLlm, assertLlmRoute,
|
|
265
|
-
assertRunCaptured, throwIfRunIncomplete,
|
|
266
|
-
} from '@tangle-network/agent-eval'
|
|
267
|
-
import { traceAnalystOnRunComplete } from '@tangle-network/agent-eval/traces'
|
|
190
|
+
---
|
|
268
191
|
|
|
269
|
-
|
|
270
|
-
assertLlmRoute(llmOpts, { requireExplicitBaseUrl: true, allowedBaseUrls, requireAuth: true })
|
|
192
|
+
## Customer journeys
|
|
271
193
|
|
|
272
|
-
|
|
273
|
-
onRunComplete: [traceAnalystOnRunComplete({ analyze: analystOpts, save })],
|
|
274
|
-
})
|
|
275
|
-
await emitter.startRun(/* ... */)
|
|
276
|
-
// LLM calls flow through callLlm with `{ rawSink: sink, traceContext: { runId, spanId } }`.
|
|
277
|
-
await emitter.endRun({ pass, score })
|
|
194
|
+
Three runnable examples — each is self-contained, each shows the actual output.
|
|
278
195
|
|
|
279
|
-
|
|
280
|
-
|
|
281
|
-
|
|
282
|
-
|
|
196
|
+
| Journey | Example | Who it's for |
|
|
197
|
+
|---|---|---|
|
|
198
|
+
| **Closed loop** — improve a prompt under statistical confidence | [`examples/selfimprove-quickstart/`](./examples/selfimprove-quickstart/) | Teams with scenarios + judges + agent in hand |
|
|
199
|
+
| **Multi-rater feedback corpus** — turn Obsidian/Sheets/CSV ratings into actionable insights | [`examples/customer-feedback-loop/`](./examples/customer-feedback-loop/) | Teams reviewing AI outputs by hand who want to compress that taste into per-member LLM judges + close the loop |
|
|
200
|
+
| **Production OTel traces** — analyze logs you already have, no closed loop required | [`examples/customer-otel-traces/`](./examples/customer-otel-traces/) | Teams running agents in prod with observability, no eval discipline yet |
|
|
283
201
|
|
|
284
|
-
|
|
285
|
-
[`SKILL.md` § Capture integrity](./.claude/skills/agent-eval/SKILL.md#capture-integrity-required-for-launch-grade-adoption).
|
|
286
|
-
|
|
287
|
-
## Examples
|
|
288
|
-
|
|
289
|
-
Each example has its own README with what it demonstrates, expected output,
|
|
290
|
-
and runtime. See [`examples/`](./examples/).
|
|
291
|
-
|
|
292
|
-
- [`examples/multi-shot-optimization`](./examples/multi-shot-optimization/README.md):
|
|
293
|
-
optimize full trajectories with held-out promotion.
|
|
294
|
-
- [`examples/same-sandbox-harness`](./examples/same-sandbox-harness/README.md):
|
|
295
|
-
run setup/build/test and evidence checks in one workspace.
|
|
296
|
-
- [`examples/benchmarks`](./examples/benchmarks/README.md):
|
|
297
|
-
benchmark adapter shape and reference wrappers.
|
|
298
|
-
- [`examples/auto-research-with-agent-builder`](./examples/auto-research-with-agent-builder/README.md):
|
|
299
|
-
closed loop — score, reflect, mutate, re-score, repeat.
|
|
300
|
-
- [`examples/fine-tune-with-prime-rl`](./examples/fine-tune-with-prime-rl/README.md):
|
|
301
|
-
RunRecord → preferences → trainer (prime-rl) → next campaign.
|
|
302
|
-
- [`examples/production-loop`](./examples/production-loop/README.md):
|
|
303
|
-
ingest prod traces + feedback, cluster failures, evolve, gate, open a PR.
|
|
304
|
-
|
|
305
|
-
## Matrix
|
|
306
|
-
|
|
307
|
-
`@tangle-network/agent-eval/matrix` is an N-axis cartesian runner over the
|
|
308
|
-
substrate types you already use — `AgentProfile` from
|
|
309
|
-
`@tangle-network/sandbox`, `Driver` / `Validator` from
|
|
310
|
-
`@tangle-network/agent-runtime`, rubric records, anything. It does not wrap
|
|
311
|
-
substrate types; the caller passes them in axis values, the runner iterates
|
|
312
|
-
the cartesian, and the aggregator returns per-axis pass / score / cost /
|
|
313
|
-
duration summaries.
|
|
202
|
+
Each example: `README.md` + a single `index.ts` runnable via `pnpm tsx`. Prints the resulting `InsightReport` to stdout.
|
|
314
203
|
|
|
315
|
-
|
|
316
|
-
import { runAgentMatrix } from '@tangle-network/agent-eval/matrix'
|
|
317
|
-
|
|
318
|
-
const result = await runAgentMatrix({
|
|
319
|
-
axes: [
|
|
320
|
-
{ name: 'scenario', values: scenarios.map((s) => ({ id: s.id, value: s })) },
|
|
321
|
-
{ name: 'profile', values: profiles.map((p) => ({ id: p.name, value: p })) },
|
|
322
|
-
{ name: 'thinking', values: [
|
|
323
|
-
{ id: 'low', value: 'low' }, { id: 'high', value: 'high' },
|
|
324
|
-
] },
|
|
325
|
-
],
|
|
326
|
-
reps: 3,
|
|
327
|
-
maxConcurrency: 4,
|
|
328
|
-
costCeiling: 5.0,
|
|
329
|
-
filter: (cell) => !(cell.axes.scenario.value.hard === 5 && cell.axes.thinking.id === 'low'),
|
|
330
|
-
runCell: async (cell) => runScenario(cell.axes.scenario.value, cell.axes.profile.value),
|
|
331
|
-
})
|
|
204
|
+
---
|
|
332
205
|
|
|
333
|
-
|
|
334
|
-
```
|
|
206
|
+
## Subpath entry points
|
|
335
207
|
|
|
336
|
-
|
|
208
|
+
| Subpath | What it gives you |
|
|
209
|
+
|---|---|
|
|
210
|
+
| `@tangle-network/agent-eval/contract` | **The headline surface.** `selfImprove`, `analyzeRuns`, `runImprovementLoop`, `runCampaign`, `runEval`, `diffRuns`, intake adapters (`fromFeedbackTable`, `fromOtelSpans`), drivers (`gepaDriver`, `evolutionaryDriver`), gates (`defaultProductionGate`, `heldOutGate`, `composeGate`), storage. **New code starts here.** |
|
|
211
|
+
| `@tangle-network/agent-eval/hosted` | Hosted-tier wire-format types + `createHostedClient` to ship eval-run events + trace spans to any orchestrator speaking the spec |
|
|
212
|
+
| `@tangle-network/agent-eval/adapters/otel` | `createOtelBridge` — forwards OpenTelemetry-shape spans into the hosted-tier ingest |
|
|
213
|
+
| `@tangle-network/agent-eval/adapters/langchain` | LangChain runnable → `Dispatch` adapter |
|
|
214
|
+
| `@tangle-network/agent-eval/adapters/http` | `httpDispatch` + `runDispatchServer` for distributed campaigns across machines |
|
|
215
|
+
| `@tangle-network/agent-eval/campaign` | Lower-level campaign primitives (storage, drivers, types) |
|
|
216
|
+
| `@tangle-network/agent-eval/multishot` | N-shot persona × shot matrix runner |
|
|
217
|
+
| `@tangle-network/agent-eval/control` | Agent control loop primitives (`runAgentControlLoop`, action policy, propose/review) |
|
|
218
|
+
| `@tangle-network/agent-eval/traces` | Trace stores, emitters, OTLP-JSONL replay |
|
|
219
|
+
| `@tangle-network/agent-eval/reporting` | Release confidence, paired stats, sequential e-values, launch reports |
|
|
220
|
+
| `@tangle-network/agent-eval/rl` | RL bridge — verifiable rewards, preferences, OPE, PRM, tournaments, contamination, compute curves, auto-research |
|
|
221
|
+
| `@tangle-network/agent-eval/matrix` | N-axis cartesian over substrate types |
|
|
222
|
+
| `@tangle-network/agent-eval/wire` | HTTP/RPC server + Zod schemas (same protocol the Python client speaks) |
|
|
223
|
+
| `@tangle-network/agent-eval/benchmarks` | Benchmark adapter contracts and reference wrappers |
|
|
337
224
|
|
|
338
|
-
|
|
225
|
+
The root export remains available for backward compatibility; new code should prefer focused subpaths. Anything under `/rl`, `/pipelines`, `/meta-eval`, `/prm`, or `/builder-eval` is **only** reachable via its subpath.
|
|
339
226
|
|
|
340
|
-
|
|
227
|
+
---
|
|
341
228
|
|
|
342
|
-
|
|
343
|
-
2. [Product Eval Adoption](./docs/product-eval-adoption.md)
|
|
344
|
-
3. [Control Runtime](./docs/control-runtime.md)
|
|
345
|
-
4. [Feedback Trajectories](./docs/feedback-trajectories.md)
|
|
346
|
-
5. [Multi-Shot Optimization](./docs/multi-shot-optimization.md)
|
|
347
|
-
6. [Trace Analysis](./docs/trace-analysis.md)
|
|
348
|
-
7. [Knowledge Readiness](./docs/knowledge-readiness.md)
|
|
349
|
-
8. [Integration Launch Gates](./docs/integration-launch-gates.md)
|
|
350
|
-
9. [Wire Protocol](./docs/wire-protocol.md) — required for non-TypeScript consumers
|
|
229
|
+
## Concepts + design
|
|
351
230
|
|
|
352
|
-
|
|
231
|
+
- [`docs/concepts.md`](./docs/concepts.md) — five types, three top-level functions, the layering rule, the wire protocol contract
|
|
232
|
+
- [`docs/insight-report.md`](./docs/insight-report.md) — annotated walkthrough of every section of the decision packet
|
|
233
|
+
- [`docs/customer-journeys.md`](./docs/customer-journeys.md) — three end-to-end journeys with code + expected output
|
|
234
|
+
- [`docs/adapters-observability.md`](./docs/adapters-observability.md) — composing agent-eval with LangSmith, Langfuse, Phoenix, OpenLLMetry, TraceAI
|
|
235
|
+
- [`docs/wire-protocol.md`](./docs/wire-protocol.md) — the HTTP/RPC contract Python (and any future language) speaks
|
|
236
|
+
- [`docs/hosted-ingest-spec.md`](./docs/hosted-ingest-spec.md) — the hosted-tier wire format, frozen at `2026-05-26.v1`
|
|
237
|
+
- [`docs/design/`](./docs/design/) — RFCs + architectural notes
|
|
353
238
|
|
|
354
|
-
|
|
355
|
-
|
|
356
|
-
|
|
239
|
+
The `.claude/skills/agent-eval/SKILL.md` skill ships embedded directives so LLM agents writing integration code don't reintroduce historical bug classes.
|
|
240
|
+
|
|
241
|
+
---
|
|
242
|
+
|
|
243
|
+
## Hosted tier
|
|
244
|
+
|
|
245
|
+
Wire your loop to a hosted orchestrator (ours, or your own implementation of the spec) with one config:
|
|
246
|
+
|
|
247
|
+
```ts
|
|
248
|
+
await selfImprove({
|
|
249
|
+
scenarios, dispatch, judges, baselineSurface,
|
|
250
|
+
hostedTenant: {
|
|
251
|
+
endpoint: 'https://intelligence.tangle.tools',
|
|
252
|
+
apiKey: process.env.TANGLE_API_KEY!,
|
|
253
|
+
tenantId: 'your-tenant',
|
|
254
|
+
},
|
|
255
|
+
})
|
|
357
256
|
```
|
|
358
257
|
|
|
359
|
-
|
|
258
|
+
The substrate runs the loop in your process. Only the eval-run events + (optional) trace spans go to the orchestrator. Your scenarios, your judges, your raw data — never sent. Spec at [`docs/hosted-ingest-spec.md`](./docs/hosted-ingest-spec.md); reference receiver at [`examples/hosted-ingest-server/`](./examples/hosted-ingest-server/).
|
|
259
|
+
|
|
260
|
+
---
|
|
261
|
+
|
|
262
|
+
## Install + run
|
|
360
263
|
|
|
361
264
|
```sh
|
|
265
|
+
pnpm add @tangle-network/agent-eval
|
|
266
|
+
# or, from Python:
|
|
362
267
|
pip install agent-eval-rpc
|
|
363
268
|
```
|
|
364
269
|
|
|
365
|
-
|
|
366
|
-
from agent_eval_rpc import Client
|
|
367
|
-
client = Client() # auto-detects HTTP server, falls back to subprocess
|
|
368
|
-
score = await client.judge(content=output, rubric_name="anti-slop")
|
|
369
|
-
```
|
|
270
|
+
Run an example:
|
|
370
271
|
|
|
371
|
-
|
|
372
|
-
|
|
373
|
-
|
|
272
|
+
```sh
|
|
273
|
+
pnpm tsx examples/selfimprove-quickstart/index.ts
|
|
274
|
+
pnpm tsx examples/customer-feedback-loop/index.ts
|
|
275
|
+
pnpm tsx examples/customer-otel-traces/index.ts
|
|
276
|
+
```
|
|
374
277
|
|
|
375
|
-
|
|
278
|
+
Run the test suite:
|
|
376
279
|
|
|
377
280
|
```sh
|
|
378
281
|
pnpm install
|
|
379
|
-
pnpm
|
|
282
|
+
pnpm build
|
|
380
283
|
pnpm test
|
|
381
|
-
pnpm lint # biome
|
|
382
|
-
pnpm build # tsup + openapi.json
|
|
383
284
|
```
|
|
384
285
|
|
|
385
|
-
|
|
286
|
+
---
|
|
287
|
+
|
|
288
|
+
## Stability + versioning
|
|
289
|
+
|
|
290
|
+
Public exports carry JSDoc stability markers visible in IDE hover + `.d.ts`:
|
|
291
|
+
|
|
292
|
+
| Tag | Meaning |
|
|
293
|
+
|---|---|
|
|
294
|
+
| `@stable` | API frozen at this major. Breaking changes require a major bump. |
|
|
295
|
+
| `@experimental` | Interface may evolve before becoming `@stable`. Pin the patch version if you depend on it. |
|
|
296
|
+
| `@internal` | Not part of the public contract. Use the documented subpath instead. |
|
|
386
297
|
|
|
387
|
-
|
|
388
|
-
production session/runtime layer.
|
|
389
|
-
- [`@tangle-network/agent-knowledge`](https://www.npmjs.com/package/@tangle-network/agent-knowledge):
|
|
390
|
-
source-grounded knowledge bases and readiness.
|
|
391
|
-
- [`@tangle-network/agent-integrations`](https://www.npmjs.com/package/@tangle-network/agent-integrations):
|
|
392
|
-
connection, grant, capability, and integration invocation contracts.
|
|
298
|
+
[`CHANGELOG.md`](./CHANGELOG.md) tracks every release with what's new / additive / breaking.
|
|
393
299
|
|
|
394
|
-
|
|
395
|
-
it knows; `agent-integrations` is what it can do; `agent-eval` is how it gets
|
|
396
|
-
better.
|
|
300
|
+
---
|
|
397
301
|
|
|
398
302
|
## License
|
|
399
303
|
|
|
400
|
-
MIT
|
|
304
|
+
MIT. See [`LICENSE`](./LICENSE).
|
package/dist/adapters/http.d.ts
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
import { S as Scenario,
|
|
1
|
+
import { S as Scenario, D as DispatchFn, b as DispatchContext } from '../types-Dbj5gu8n.js';
|
|
2
2
|
|
|
3
3
|
/**
|
|
4
4
|
* # `@tangle-network/agent-eval/adapters/http` — distributed Dispatch over HTTP.
|
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
import { S as Scenario,
|
|
1
|
+
import { S as Scenario, J as JudgeScore, D as DispatchFn, a as JudgeConfig } from '../types-Dbj5gu8n.js';
|
|
2
2
|
|
|
3
3
|
/**
|
|
4
4
|
* # `@tangle-network/agent-eval/adapters/langchain` — wrap any LangChain
|