@tangle-network/agent-eval 0.50.0 → 0.50.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,400 +1,304 @@
1
- # @tangle-network/agent-eval
2
-
3
- **Substrate for self-improving agents.** Trace what runs, verify the result,
4
- turn outcomes into preferences and rewards, mutate prompts and policies under
5
- anytime-valid evidence, and ship only when the improvement is decisive.
6
-
7
- ```txt
8
- real product task
9
- -> observe / act (your runtime)
10
- -> trace + verifier pipeline (capture integrity)
11
- -> RunRecord (canonical eval artifact)
12
- -> judge calibration · paired stats · sequential α
13
- -> preferences · verifiable rewards · process rewards
14
- -> GEPA / reflective mutation · auto-research · active curriculum
15
- -> release gate · replay · contamination probe · tournament rating
16
- -> next iteration
17
- ```
1
+ # `@tangle-network/agent-eval`
18
2
 
19
- `agent-eval` does **not** own product state, credentials, UI, storage, model
20
- routing, browser drivers, sandbox policy, or deployment. Products own those.
21
- This package owns the loop that closes evaluation → preference → mutation →
22
- redeploy, with capture integrity and statistically rigorous evidence at every
23
- step.
3
+ **Ship better agent prompts with statistical confidence.** One function call returns a decision packet: lift CI, judge calibration, contamination check, failure clusters, cost-quality Pareto, and a ranked action list. Same shape whether you've got a closed improvement loop or just production logs.
24
4
 
25
- It ships as a TypeScript library (npm) with a generated Python client (PyPI),
26
- both speaking the same wire protocol. MIT, self-hostable, no SaaS dependency.
5
+ [![npm](https://img.shields.io/npm/v/@tangle-network/agent-eval.svg)](https://www.npmjs.com/package/@tangle-network/agent-eval)
6
+ [![pypi](https://img.shields.io/pypi/v/agent-eval-rpc.svg)](https://pypi.org/project/agent-eval-rpc/)
7
+ [![tests](https://github.com/tangle-network/agent-eval/actions/workflows/ci.yml/badge.svg)](https://github.com/tangle-network/agent-eval/actions/workflows/ci.yml)
8
+ [![license: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)
27
9
 
28
- ## Install
10
+ > TypeScript first-class, Python (`agent-eval-rpc`) speaks the same wire protocol, hosted-tier-friendly, MIT, self-hostable, no SaaS dependency.
29
11
 
30
- ```sh
31
- pnpm add @tangle-network/agent-eval
32
- # or, from Python:
33
- pip install agent-eval-rpc
34
- ```
12
+ ---
35
13
 
36
- ## Quick Start — the control loop
14
+ ## Table of contents
37
15
 
38
- ```ts
39
- import {
40
- objectiveEval,
41
- runAgentControlLoop,
42
- } from '@tangle-network/agent-eval/control'
16
+ - [What you get back](#what-you-get-back-the-decision-packet)
17
+ - [Quick start](#quick-start)
18
+ - [Closed loop — `selfImprove()`](#closed-loop--selfimprove)
19
+ - [Observed runs — `analyzeRuns()`](#observed-runs--analyzeruns)
20
+ - [Existing data — intake adapters](#existing-data--intake-adapters)
21
+ - [How it compares](#how-it-compares)
22
+ - [Customer journeys](#customer-journeys)
23
+ - [Subpath entry points](#subpath-entry-points)
24
+ - [Concepts + design](#concepts--design)
25
+ - [Hosted tier](#hosted-tier)
26
+ - [Install + run](#install--run)
27
+ - [Stability + versioning](#stability--versioning)
28
+ - [License](#license)
43
29
 
44
- const result = await runAgentControlLoop({
45
- intent: task.prompt,
46
- budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },
30
+ ---
47
31
 
48
- observe() {
49
- return product.readState(task.id)
50
- },
32
+ ## What you get back: the decision packet
51
33
 
52
- validate({ state }) {
53
- return [
54
- objectiveEval({
55
- id: 'build-passes',
56
- passed: state.build.exitCode === 0,
57
- severity: 'critical',
58
- metadata: state.build,
59
- }),
60
- objectiveEval({
61
- id: 'preview-serves',
62
- passed: state.preview.httpStatus === 200,
63
- severity: 'critical',
64
- }),
34
+ Whether you call `selfImprove()` (closed loop) or `analyzeRuns()` (observed runs), the report has the same shape. Here's a real one, abridged:
35
+
36
+ ```jsonc
37
+ {
38
+ "n": 80, // runs analyzed
39
+ "composite": { // distributional summary
40
+ "mean": 0.62, "p50": 0.65, "p95": 0.88, "stddev": 0.17,
41
+ "histogram": [/* 12 bins */]
42
+ },
43
+ "lift": { // paired bootstrap
44
+ "baselineMean": 0.58, "candidateMean": 0.65,
45
+ "delta": 0.07,
46
+ "ci95": [0.04, 0.10], // 95% CI on the delta
47
+ "pValue": 0.0008, // paired-t
48
+ "cohensD": 0.41,
49
+ "n": 40,
50
+ "mde": 0.06, // min detectable effect at 80% power
51
+ "requiredN": 38 // n needed to detect observed delta
52
+ },
53
+ "judges": { // per-judge calibration
54
+ "domain-expert": { "n": 80, "meanScore": 0.64 },
55
+ "helpfulness-llm": { "n": 80, "meanScore": 0.61 }
56
+ },
57
+ "interRater": { // multi-rater agreement
58
+ "raters": 3, "jointlyRated": 80, "kappa": 0.71,
59
+ "disagreementCases": [/* top 20 ranked by spread */]
60
+ },
61
+ "costQuality": { // cost-vs-quality
62
+ "cost": { "mean": 0.024, "p95": 0.041, /* ... */ },
63
+ "pareto": { /* ParetoFigureSpec the dashboard renders */ }
64
+ },
65
+ "failureClusters": { // when an AnalystRegistry is wired
66
+ "totalFailures": 11,
67
+ "clusters": [
68
+ { "name": "off-topic-drift", "share": 0.45, "exemplars": ["run-12", "run-19"] },
69
+ { "name": "over-confidence", "share": 0.27, "exemplars": ["run-3"] },
70
+ { "name": "format-mismatch", "share": 0.18, "exemplars": ["run-41"] }
65
71
  ]
66
72
  },
67
-
68
- decide({ evals }) {
69
- const failed = evals.filter((e) => !e.passed)
70
- if (failed.length === 0) {
71
- return { type: 'stop', pass: true, reason: 'all gates passed' }
72
- }
73
- return {
74
- type: 'continue',
75
- action: { type: 'repair', failed: failed.map((e) => e.id) },
76
- reason: 'repair failed gates',
77
- }
73
+ "contamination": { "leaks": 0, "holdoutAuditPassed": true },
74
+ "outcomeCorrelation": { // when downstream metric supplied
75
+ "metric": "engagement_rate", "n": 80,
76
+ "pearson": 0.72, "spearman": 0.69,
77
+ "rewardModel": { "intercept": 0.04, "slope": 1.93, "r2": 0.52 }
78
78
  },
79
-
80
- act(action) {
81
- return product.runAgentStep(task.id, action)
79
+ "release": {
80
+ "status": "pass",
81
+ "axes": [
82
+ { "name": "quality-lift", "status": "pass" },
83
+ { "name": "contamination", "status": "pass" },
84
+ { "name": "composite-distribution","status": "pass" }
85
+ ]
82
86
  },
83
- })
84
-
85
- await product.storeEvalResult(task.id, result)
87
+ "recommendations": [
88
+ { "priority": "critical", "kind": "ship",
89
+ "title": "Ship — lift 0.070 (95% CI 0.040..0.100)",
90
+ "detail": "Holdout lift exceeds threshold 0.02 with 95% bootstrap confidence (n=40, p=0.0008, d=0.41)." },
91
+ { "priority": "high", "kind": "investigate",
92
+ "title": "Top failure cluster: off-topic-drift (45% of failures)",
93
+ "detail": "11 runs failed. Drill into exemplars run-12 / run-19 to identify the pattern." }
94
+ ]
95
+ }
86
96
  ```
87
97
 
88
- Same loop shape in production, replay, benchmark, and optimization. Swap the
89
- dependencies behind `observe()` and `act()`, never the eval contract.
98
+ The `recommendations` array is the human-readable layer; everything above it is the evidence. Read the recs, act on them, the numbers are the proof.
90
99
 
91
- ## Production loop — close the eval → prod → eval cycle
100
+ ---
92
101
 
93
- Static prompts decay. Yesterday's FTC rule flips today; yesterday's tool quirk
94
- becomes today's incident. The production agents that win are the ones that
95
- **continuously re-train against live failure modes**.
102
+ ## Quick start
96
103
 
97
- `runProductionLoop` is the orchestration layer that wires the existing eval
98
- substrate into a self-improvement cron:
104
+ ### Closed loop `selfImprove()`
99
105
 
100
- ```ts
101
- import {
102
- runProductionLoop,
103
- httpGithubClient,
104
- FileSystemFeedbackTrajectoryStore,
105
- } from '@tangle-network/agent-eval'
106
- import { FileSystemTraceStore } from '@tangle-network/agent-eval/traces'
107
-
108
- const result = await runProductionLoop({
109
- runId: `weekly-${new Date().toISOString().slice(0, 10)}`,
110
- target: 'tax-agent',
111
-
112
- // 1. Where production traces + feedback land. Wire the HTTP ingestion
113
- // endpoints (POST /v1/traces/ingest, POST /v1/feedback) from your
114
- // runtime; the same store reads them here.
115
- traceStore: new FileSystemTraceStore({ dir: 'data/prod-traces' }),
116
- feedbackStore: new FileSystemFeedbackTrajectoryStore({ dir: 'data/prod-feedback' }),
117
-
118
- // 2. Cluster threshold: act on failure groups ≥ 20 runs or ≥ 5% of corpus.
119
- cluster: { minClusterSize: 20, minSeverityRatio: 0.05, maxClustersPerCycle: 1 },
120
-
121
- // 3. Evolve: seed = current prompt, gate against holdout scenarios.
122
- evolve: {
123
- baselinePrompt: currentSystemPrompt,
124
- holdoutScenarios: productionShapeScenarios,
125
- runner, // your agent driver
126
- scorer, // calibrated judge or rubric
127
- mutator, // GEPA-style or addendum-style mutator
128
- gate: {
129
- baselineKey: 'baseline',
130
- minProductiveRuns: 5,
131
- pairedDeltaThreshold: 0.03, // require Nσ improvement on holdout
132
- overfitGapThreshold: 0.10,
133
- },
134
- },
135
-
136
- // 4. Ship: when the gate passes, open a PR with the new prompt.
137
- ship: {
138
- client: httpGithubClient({ token: process.env.GITHUB_TOKEN! }),
139
- repo: { owner: 'tangle-network', name: 'tax-agent' },
140
- branchPrefix: 'eval/auto-improve',
141
- promptFilePath: 'prompts/tax-agent-system.txt',
142
- reviewers: ['drew'],
143
- },
106
+ You have scenarios, a dispatch, judges, and want the loop to propose better prompts + tell you which to ship.
144
107
 
145
- cron: { cadence: 'weekly' }, // surface-only; consumer schedules
108
+ ```ts
109
+ import { selfImprove } from '@tangle-network/agent-eval/contract'
110
+
111
+ const result = await selfImprove({
112
+ scenarios, // your scenario corpus
113
+ dispatch: async ({ scenario }) => // your agent — anything that returns an artifact
114
+ await myAgent.run(scenario),
115
+ judges: [myJudge], // any JudgeConfig — LLM, rule, ensemble
116
+ baselineSurface: { systemPrompt: currentPrompt },
146
117
  })
147
118
 
148
- console.log(result.decision) // 'pr_opened' | 'gate_failed' | 'no_actionable_failures' | ...
149
- console.log(result.pullRequest?.prUrl) // populated when a PR was opened
119
+ result.gateDecision // 'ship' | 'hold' | 'need_more_work' | ...
120
+ result.lift // raw delta on holdout
121
+ result.insight // the full decision packet above
150
122
  ```
151
123
 
152
- The primitive runs **one cycle**. Schedule it with `workflow_dispatch` + cron in
153
- GitHub Actions. It is **idempotent + replayable**: same `runId` → same plan.
154
- Gate failures are fail-closed — a candidate that beats baseline on search but
155
- overfits on holdout never lands.
124
+ ### Observed runs `analyzeRuns()`
156
125
 
157
- Full runnable demo (synthetic traces, no credentials) in
158
- [`examples/production-loop`](./examples/production-loop/README.md).
126
+ You don't have a closed loop yet — you have observed runs (production traces, an approve/reject corpus, a CSV gold set). Same report shape, no agent invocation.
159
127
 
160
- ## Self-improvement loop
128
+ ```ts
129
+ import { analyzeRuns } from '@tangle-network/agent-eval/contract'
130
+
131
+ const report = await analyzeRuns({
132
+ runs, // RunRecord[]
133
+ outcomeSignal: { // optional — closes the loop on real outcomes
134
+ metric: 'engagement_rate',
135
+ valueByRunId: enrichedFromProd,
136
+ },
137
+ canaryScenarios, // optional — contamination probe
138
+ analyst: myAnalystRegistry, // optional — AI-powered failure clustering
139
+ })
161
140
 
162
- Eval doesn't end at "pass/fail." Outcomes become training signal, mutation
163
- proposals, and curriculum updates — all from the same `RunRecord` produced by
164
- the control loop.
141
+ report.recommendations // ranked actions
142
+ report.failureClusters // grouped failure modes
143
+ report.outcomeCorrelation // judge↔outcome correlation + linear reward model
144
+ ```
145
+
146
+ ### Existing data — intake adapters
147
+
148
+ You have data already. Don't reshape it — pipe it through an adapter.
165
149
 
166
150
  ```ts
167
- import { runEvalCampaign } from '@tangle-network/agent-eval'
168
151
  import {
169
- extractPreferences,
170
- extractVerifiableReward,
171
- filterDeterministicallyRewarded,
172
- offPolicyEstimateAll,
173
- analyzeOptimizationResult,
174
- } from '@tangle-network/agent-eval/rl'
175
-
176
- // 1. Run a matrix of variants × scenarios with capture integrity by construction.
177
- const campaign = await runEvalCampaign({ variants, scenarios, run })
178
-
179
- // 2. Convert outcomes into RL signal.
180
- const rewards = extractVerifiableReward(campaign.runs) // compile/test/schema
181
- const prefs = extractPreferences(campaign.runs) // (chosen, rejected) triples
182
- const clean = filterDeterministicallyRewarded(rewards) // judge-noise free
183
-
184
- // 3. Estimate a candidate policy's value without re-running.
185
- const ope = offPolicyEstimateAll(campaign.runs, candidatePolicy) // IPS + SNIPS + DR
186
-
187
- // 4. Or close the loop end-to-end: score → reflect → mutate → re-run.
188
- const next = await analyzeOptimizationResult(campaign, { researcher })
152
+ fromFeedbackTable,
153
+ fromOtelSpans,
154
+ analyzeRuns,
155
+ } from '@tangle-network/agent-eval/contract'
156
+
157
+ // Multi-rater approve/reject (Obsidian tags, Sheets, CSV, Postgres).
158
+ const { runs, raterScores } = fromFeedbackTable({
159
+ ratings: parseYourFeedbackTable(), // Array<{ runId, rater, rating }>
160
+ })
161
+ await analyzeRuns({ runs, raterScores })
162
+
163
+ // Production OTel traces — group by tangle.runId or traceId.
164
+ const runs2 = fromOtelSpans({ spans: yourOtelStream })
165
+ await analyzeRuns({ runs: runs2 })
189
166
  ```
190
167
 
191
- | Step | Primitive | Subpath |
192
- | --- | --- | --- |
193
- | Eval matrix with integrity | `runEvalCampaign` | `/` |
194
- | Deterministic re-judge / audit | `ReplayCache`, `createReplayFetch` | `/` |
195
- | Anytime-valid α across rolling looks | `pairedEvalueSequence` | `/reporting` |
196
- | Judge quality vs gold | `calibrateJudge` (κ, Pearson, MAE, bias probes) | `/` |
197
- | Continuous inter-rater agreement | `calibrateJudgeContinuous`, `continuousAgreement` (κ_w, ICC(2,1), bootstrap CIs) | `/` |
198
- | (chosen, rejected) for DPO/KTO/PPO | `extractPreferences` | `/rl` |
199
- | Verifiable reward signal | `extractVerifiableReward` | `/rl` |
200
- | Step-level / PRM training data | `extractStepRewards`, `prmTrainingPairs` | `/rl` |
201
- | Estimate policy value off-policy | `offPolicyEstimateAll` (IPS + SNIPS + DR) | `/rl` |
202
- | GEPA / reflective prompt mutation | `buildReflectionPrompt`, `parseReflectionResponse`, Ax-GEPA `SteeringOptimizer` | `/` `/optimization` |
203
- | Auto-research (read runs → propose) | `analyzeOptimizationResult`, `PredictiveValidityResearcher` | `/rl` |
204
- | Active curriculum (variance / Thompson) | `allocateCurriculum` | `/rl` |
205
- | Tournament ratings (Bradley-Terry + Elo) | `fitBradleyTerry`, `applyEloUpdate` | `/rl` |
206
- | Adversarial scenario search | `adversarialScenarioSearch` | `/rl` |
207
- | Contamination probe (held-out perturb) | `runContaminationProbe` | `/rl` |
208
- | Reward hacking signatures | `detectRewardHacking` | `/rl` |
209
- | Compute curves (best-of-N, self-consist, Pareto) | `runComputeCurve`, `bestOfN`, `selfConsistency`, `paretoFrontier` | `/rl` |
210
- | Knowledge gap separated from reasoning gap | `scoreKnowledgeReadiness` | `/` |
211
- | Release gate (paired evidence + holdouts) | `evaluateReleaseConfidence`, `HeldOutGate` | `/reporting` |
212
- | Launch report (decision-grade) | `renderReleaseReport`, `researchReport` | `/reporting` |
213
-
214
- ## Import Paths
215
-
216
- | Subpath | Use for |
217
- | --- | --- |
218
- | `@tangle-network/agent-eval/contract` | **LAND-tier surface** — `selfImprove`, `runCampaign`, `runImprovementLoop`, `runEval`, `Dispatch`, `Mutator`, `Gate`, `defaultProductionGate`, `gepaDriver`, `diffRuns`, storage backends. New code starts here. |
219
- | `@tangle-network/agent-eval/hosted` | **EXPAND-tier surface** — `createHostedClient`, wire-format types, `HOSTED_WIRE_VERSION`. Ships eval-run events + trace spans to any orchestrator that speaks the spec. |
220
- | `@tangle-network/agent-eval/adapters/otel` | OTel→hosted bridge — `createOtelBridge` forwards OTel-shape spans (TraceAI, OpenLLMetry, OTel SDK) into the hosted-tier ingest. |
221
- | `@tangle-network/agent-eval/adapters/langchain` | LangChain executor adapter — wrap a LangChain runnable as a `Dispatch`. |
222
- | `@tangle-network/agent-eval/adapters/http` | Distributed driver — `httpDispatch` + `runDispatchServer` for cross-machine campaigns. |
223
- | `@tangle-network/agent-eval/campaign` | Lower-level campaign primitives — `runCampaign`, driver implementations, storage. |
224
- | `@tangle-network/agent-eval/multishot` | Multi-shot optimization primitives. |
225
- | `@tangle-network/agent-eval/control` | `observe → validate → decide → act`, action policy, propose/review loops |
226
- | `@tangle-network/agent-eval/traces` | trace stores, emitters, TraceAnalyst, replay |
227
- | `@tangle-network/agent-eval/optimization` | feedback trajectories, multi-shot, prompt evolution, GEPA, EvalCampaign |
228
- | `@tangle-network/agent-eval/reporting` | release confidence, paired stats, sequential e-values, launch reports |
229
- | `@tangle-network/agent-eval/rl` | adapters, verifiable rewards, preferences, OPE, PRM, contamination, tournaments, adversarial, compute curves, auto-research |
230
- | `@tangle-network/agent-eval/wire` | HTTP/RPC server + schemas (same protocol the Python client speaks) |
231
- | `@tangle-network/agent-eval/benchmarks` | benchmark adapter contracts and reference wrappers |
232
- | `@tangle-network/agent-eval/matrix` | N-axis cartesian runner over substrate types — see [`src/matrix/`](./src/matrix/) |
233
-
234
- The root export remains available for convenience; new code should prefer
235
- focused subpaths. Anything under `/rl`, `/pipelines`, `/meta-eval`, `/prm`,
236
- or `/builder-eval` is only reachable via its subpath.
237
-
238
- ## API stability
239
-
240
- Public exports are tagged with JSDoc stability markers so consumers can see
241
- status at the call site (IDE hover, language server, declaration files).
168
+ Both intake adapters preserve every signal in the source — multi-rater scores stay rater-keyed so the report can compute inter-rater agreement and surface the disagreement triage list.
242
169
 
243
- | Tag | Meaning |
244
- | --- | --- |
245
- | `@stable` | API frozen at this major. Breaking changes require a major bump. |
246
- | `@experimental` | Interface may evolve before becoming `@stable`. Pin the patch version if you depend on it. |
247
- | `@internal` | Not part of the public contract. Use the documented subpath instead. |
170
+ ---
248
171
 
249
- The `/rl` subpath is the most active surface. See
250
- [`src/rl/index.ts`](./src/rl/index.ts) for the current stable/experimental
251
- breakdown.
172
+ ## How it compares
252
173
 
253
- ## Capture integrity
174
+ | | LangSmith | Braintrust | Phoenix | **agent-eval** |
175
+ |---|:---:|:---:|:---:|:---:|
176
+ | Closed-loop self-improvement | ✱ human-in-loop | ✱ experiment-driven | — | ✓ autonomous + gated |
177
+ | Statistical lift CI (paired bootstrap) | — | partial | — | ✓ |
178
+ | Judge calibration + bias detection | — | — | — | ✓ |
179
+ | Inter-rater agreement + disagreement triage | — | — | — | ✓ |
180
+ | Contamination / canary check | — | — | — | ✓ |
181
+ | AI-driven failure clustering | partial | — | partial | ✓ |
182
+ | Cost-quality Pareto | — | — | — | ✓ |
183
+ | Multi-language clients (TS + Python) | TS only | TS only | TS + Py | ✓ TS + Py |
184
+ | Self-hostable / no-SaaS option | — | — | OSS | ✓ MIT, OSS |
185
+ | Substrate vs SaaS shape | SaaS | SaaS | OSS server | **library** |
186
+ | Hosted tier (optional) | required | required | optional | optional |
254
187
 
255
- Launch-grade benchmark runs need four things that are easy to forget in glue
256
- code: (1) raw HTTP capture alongside the structured spans so a reviewer can
257
- verify which route answered, (2) a preflight assertion that the configured
258
- client points at the intended provider, (3) a run-end assertion that the
259
- expected events were actually written, and (4) auto-execution of the trace
260
- analyst as part of the run lifecycle.
188
+ Position: agent-eval is the **substrate** (one library, decision-grade output) the others are SaaS *around* the substrate. If you want a closed loop that ships your prompt under statistical confidence, you call agent-eval. If you want a dashboard rendered from your data, you pipe agent-eval into the hosted tier or your own renderer.
261
189
 
262
- ```ts
263
- import {
264
- TraceEmitter, FileSystemRawProviderSink, callLlm, assertLlmRoute,
265
- assertRunCaptured, throwIfRunIncomplete,
266
- } from '@tangle-network/agent-eval'
267
- import { traceAnalystOnRunComplete } from '@tangle-network/agent-eval/traces'
190
+ ---
268
191
 
269
- const sink = new FileSystemRawProviderSink({ dir: `${workDir}/raw-events` })
270
- assertLlmRoute(llmOpts, { requireExplicitBaseUrl: true, allowedBaseUrls, requireAuth: true })
192
+ ## Customer journeys
271
193
 
272
- const emitter = new TraceEmitter(store, {
273
- onRunComplete: [traceAnalystOnRunComplete({ analyze: analystOpts, save })],
274
- })
275
- await emitter.startRun(/* ... */)
276
- // LLM calls flow through callLlm with `{ rawSink: sink, traceContext: { runId, spanId } }`.
277
- await emitter.endRun({ pass, score })
194
+ Three runnable examples each is self-contained, each shows the actual output.
278
195
 
279
- throwIfRunIncomplete(await assertRunCaptured(store, emitter.runId, {
280
- llmSpansMin: 1, rawSink: sink, requireRawCoverageOfLlmSpans: true, requireOutcome: true,
281
- }))
282
- ```
196
+ | Journey | Example | Who it's for |
197
+ |---|---|---|
198
+ | **Closed loop** — improve a prompt under statistical confidence | [`examples/selfimprove-quickstart/`](./examples/selfimprove-quickstart/) | Teams with scenarios + judges + agent in hand |
199
+ | **Multi-rater feedback corpus** — turn Obsidian/Sheets/CSV ratings into actionable insights | [`examples/customer-feedback-loop/`](./examples/customer-feedback-loop/) | Teams reviewing AI outputs by hand who want to compress that taste into per-member LLM judges + close the loop |
200
+ | **Production OTel traces** — analyze logs you already have, no closed loop required | [`examples/customer-otel-traces/`](./examples/customer-otel-traces/) | Teams running agents in prod with observability, no eval discipline yet |
283
201
 
284
- Directives, rationale, and shipped-bug context are in
285
- [`SKILL.md` § Capture integrity](./.claude/skills/agent-eval/SKILL.md#capture-integrity-required-for-launch-grade-adoption).
286
-
287
- ## Examples
288
-
289
- Each example has its own README with what it demonstrates, expected output,
290
- and runtime. See [`examples/`](./examples/).
291
-
292
- - [`examples/multi-shot-optimization`](./examples/multi-shot-optimization/README.md):
293
- optimize full trajectories with held-out promotion.
294
- - [`examples/same-sandbox-harness`](./examples/same-sandbox-harness/README.md):
295
- run setup/build/test and evidence checks in one workspace.
296
- - [`examples/benchmarks`](./examples/benchmarks/README.md):
297
- benchmark adapter shape and reference wrappers.
298
- - [`examples/auto-research-with-agent-builder`](./examples/auto-research-with-agent-builder/README.md):
299
- closed loop — score, reflect, mutate, re-score, repeat.
300
- - [`examples/fine-tune-with-prime-rl`](./examples/fine-tune-with-prime-rl/README.md):
301
- RunRecord → preferences → trainer (prime-rl) → next campaign.
302
- - [`examples/production-loop`](./examples/production-loop/README.md):
303
- ingest prod traces + feedback, cluster failures, evolve, gate, open a PR.
304
-
305
- ## Matrix
306
-
307
- `@tangle-network/agent-eval/matrix` is an N-axis cartesian runner over the
308
- substrate types you already use — `AgentProfile` from
309
- `@tangle-network/sandbox`, `Driver` / `Validator` from
310
- `@tangle-network/agent-runtime`, rubric records, anything. It does not wrap
311
- substrate types; the caller passes them in axis values, the runner iterates
312
- the cartesian, and the aggregator returns per-axis pass / score / cost /
313
- duration summaries.
202
+ Each example: `README.md` + a single `index.ts` runnable via `pnpm tsx`. Prints the resulting `InsightReport` to stdout.
314
203
 
315
- ```ts
316
- import { runAgentMatrix } from '@tangle-network/agent-eval/matrix'
317
-
318
- const result = await runAgentMatrix({
319
- axes: [
320
- { name: 'scenario', values: scenarios.map((s) => ({ id: s.id, value: s })) },
321
- { name: 'profile', values: profiles.map((p) => ({ id: p.name, value: p })) },
322
- { name: 'thinking', values: [
323
- { id: 'low', value: 'low' }, { id: 'high', value: 'high' },
324
- ] },
325
- ],
326
- reps: 3,
327
- maxConcurrency: 4,
328
- costCeiling: 5.0,
329
- filter: (cell) => !(cell.axes.scenario.value.hard === 5 && cell.axes.thinking.id === 'low'),
330
- runCell: async (cell) => runScenario(cell.axes.scenario.value, cell.axes.profile.value),
331
- })
204
+ ---
332
205
 
333
- console.log(result.byAxis.profile) // per-profile passRate / meanScore / p90 / cost
334
- ```
206
+ ## Subpath entry points
335
207
 
336
- See [`src/matrix/`](./src/matrix/) for the full surface.
208
+ | Subpath | What it gives you |
209
+ |---|---|
210
+ | `@tangle-network/agent-eval/contract` | **The headline surface.** `selfImprove`, `analyzeRuns`, `runImprovementLoop`, `runCampaign`, `runEval`, `diffRuns`, intake adapters (`fromFeedbackTable`, `fromOtelSpans`), drivers (`gepaDriver`, `evolutionaryDriver`), gates (`defaultProductionGate`, `heldOutGate`, `composeGate`), storage. **New code starts here.** |
211
+ | `@tangle-network/agent-eval/hosted` | Hosted-tier wire-format types + `createHostedClient` to ship eval-run events + trace spans to any orchestrator speaking the spec |
212
+ | `@tangle-network/agent-eval/adapters/otel` | `createOtelBridge` — forwards OpenTelemetry-shape spans into the hosted-tier ingest |
213
+ | `@tangle-network/agent-eval/adapters/langchain` | LangChain runnable → `Dispatch` adapter |
214
+ | `@tangle-network/agent-eval/adapters/http` | `httpDispatch` + `runDispatchServer` for distributed campaigns across machines |
215
+ | `@tangle-network/agent-eval/campaign` | Lower-level campaign primitives (storage, drivers, types) |
216
+ | `@tangle-network/agent-eval/multishot` | N-shot persona × shot matrix runner |
217
+ | `@tangle-network/agent-eval/control` | Agent control loop primitives (`runAgentControlLoop`, action policy, propose/review) |
218
+ | `@tangle-network/agent-eval/traces` | Trace stores, emitters, OTLP-JSONL replay |
219
+ | `@tangle-network/agent-eval/reporting` | Release confidence, paired stats, sequential e-values, launch reports |
220
+ | `@tangle-network/agent-eval/rl` | RL bridge — verifiable rewards, preferences, OPE, PRM, tournaments, contamination, compute curves, auto-research |
221
+ | `@tangle-network/agent-eval/matrix` | N-axis cartesian over substrate types |
222
+ | `@tangle-network/agent-eval/wire` | HTTP/RPC server + Zod schemas (same protocol the Python client speaks) |
223
+ | `@tangle-network/agent-eval/benchmarks` | Benchmark adapter contracts and reference wrappers |
337
224
 
338
- ## Docs
225
+ The root export remains available for backward compatibility; new code should prefer focused subpaths. Anything under `/rl`, `/pipelines`, `/meta-eval`, `/prm`, or `/builder-eval` is **only** reachable via its subpath.
339
226
 
340
- Read in this order:
227
+ ---
341
228
 
342
- 1. [Concepts](./docs/concepts.md) mental model, 5 min
343
- 2. [Product Eval Adoption](./docs/product-eval-adoption.md)
344
- 3. [Control Runtime](./docs/control-runtime.md)
345
- 4. [Feedback Trajectories](./docs/feedback-trajectories.md)
346
- 5. [Multi-Shot Optimization](./docs/multi-shot-optimization.md)
347
- 6. [Trace Analysis](./docs/trace-analysis.md)
348
- 7. [Knowledge Readiness](./docs/knowledge-readiness.md)
349
- 8. [Integration Launch Gates](./docs/integration-launch-gates.md)
350
- 9. [Wire Protocol](./docs/wire-protocol.md) — required for non-TypeScript consumers
229
+ ## Concepts + design
351
230
 
352
- ## CLI / Wire Protocol
231
+ - [`docs/concepts.md`](./docs/concepts.md) five types, three top-level functions, the layering rule, the wire protocol contract
232
+ - [`docs/insight-report.md`](./docs/insight-report.md) — annotated walkthrough of every section of the decision packet
233
+ - [`docs/customer-journeys.md`](./docs/customer-journeys.md) — three end-to-end journeys with code + expected output
234
+ - [`docs/adapters-observability.md`](./docs/adapters-observability.md) — composing agent-eval with LangSmith, Langfuse, Phoenix, OpenLLMetry, TraceAI
235
+ - [`docs/wire-protocol.md`](./docs/wire-protocol.md) — the HTTP/RPC contract Python (and any future language) speaks
236
+ - [`docs/hosted-ingest-spec.md`](./docs/hosted-ingest-spec.md) — the hosted-tier wire format, frozen at `2026-05-26.v1`
237
+ - [`docs/design/`](./docs/design/) — RFCs + architectural notes
353
238
 
354
- ```sh
355
- npm i -g @tangle-network/agent-eval
356
- agent-eval serve --port 5005
239
+ The `.claude/skills/agent-eval/SKILL.md` skill ships embedded directives so LLM agents writing integration code don't reintroduce historical bug classes.
240
+
241
+ ---
242
+
243
+ ## Hosted tier
244
+
245
+ Wire your loop to a hosted orchestrator (ours, or your own implementation of the spec) with one config:
246
+
247
+ ```ts
248
+ await selfImprove({
249
+ scenarios, dispatch, judges, baselineSurface,
250
+ hostedTenant: {
251
+ endpoint: 'https://intelligence.tangle.tools',
252
+ apiKey: process.env.TANGLE_API_KEY!,
253
+ tenantId: 'your-tenant',
254
+ },
255
+ })
357
256
  ```
358
257
 
359
- Python:
258
+ The substrate runs the loop in your process. Only the eval-run events + (optional) trace spans go to the orchestrator. Your scenarios, your judges, your raw data — never sent. Spec at [`docs/hosted-ingest-spec.md`](./docs/hosted-ingest-spec.md); reference receiver at [`examples/hosted-ingest-server/`](./examples/hosted-ingest-server/).
259
+
260
+ ---
261
+
262
+ ## Install + run
360
263
 
361
264
  ```sh
265
+ pnpm add @tangle-network/agent-eval
266
+ # or, from Python:
362
267
  pip install agent-eval-rpc
363
268
  ```
364
269
 
365
- ```py
366
- from agent_eval_rpc import Client
367
- client = Client() # auto-detects HTTP server, falls back to subprocess
368
- score = await client.judge(content=output, rubric_name="anti-slop")
369
- ```
270
+ Run an example:
370
271
 
371
- TypeScript is the source of truth. Python is a thin transport client over the
372
- generated OpenAPI schema. Schema drift is enforced impossible at release time
373
- (version-locked CI).
272
+ ```sh
273
+ pnpm tsx examples/selfimprove-quickstart/index.ts
274
+ pnpm tsx examples/customer-feedback-loop/index.ts
275
+ pnpm tsx examples/customer-otel-traces/index.ts
276
+ ```
374
277
 
375
- ## Development
278
+ Run the test suite:
376
279
 
377
280
  ```sh
378
281
  pnpm install
379
- pnpm typecheck
282
+ pnpm build
380
283
  pnpm test
381
- pnpm lint # biome
382
- pnpm build # tsup + openapi.json
383
284
  ```
384
285
 
385
- ## Related Packages
286
+ ---
287
+
288
+ ## Stability + versioning
289
+
290
+ Public exports carry JSDoc stability markers visible in IDE hover + `.d.ts`:
291
+
292
+ | Tag | Meaning |
293
+ |---|---|
294
+ | `@stable` | API frozen at this major. Breaking changes require a major bump. |
295
+ | `@experimental` | Interface may evolve before becoming `@stable`. Pin the patch version if you depend on it. |
296
+ | `@internal` | Not part of the public contract. Use the documented subpath instead. |
386
297
 
387
- - [`@tangle-network/agent-runtime`](https://www.npmjs.com/package/@tangle-network/agent-runtime):
388
- production session/runtime layer.
389
- - [`@tangle-network/agent-knowledge`](https://www.npmjs.com/package/@tangle-network/agent-knowledge):
390
- source-grounded knowledge bases and readiness.
391
- - [`@tangle-network/agent-integrations`](https://www.npmjs.com/package/@tangle-network/agent-integrations):
392
- connection, grant, capability, and integration invocation contracts.
298
+ [`CHANGELOG.md`](./CHANGELOG.md) tracks every release with what's new / additive / breaking.
393
299
 
394
- Together: `agent-runtime` is where the agent runs; `agent-knowledge` is what
395
- it knows; `agent-integrations` is what it can do; `agent-eval` is how it gets
396
- better.
300
+ ---
397
301
 
398
302
  ## License
399
303
 
400
- MIT
304
+ MIT. See [`LICENSE`](./LICENSE).
package/dist/openapi.json CHANGED
@@ -2,7 +2,7 @@
2
2
  "openapi": "3.1.0",
3
3
  "info": {
4
4
  "title": "@tangle-network/agent-eval — wire protocol",
5
- "version": "0.50.0",
5
+ "version": "0.50.1",
6
6
  "description": "HTTP and stdio RPC interface to agent-eval. The TypeScript runtime is the source of truth; this spec is the contract that cross-language clients (Python, Rust, Go) generate from.\n\nWire-protocol version: 1.0.0. Bumps on breaking changes to request/response schemas.",
7
7
  "contact": {
8
8
  "name": "Tangle Network",