@tangle-network/agent-eval 0.20.11 → 0.21.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (59) hide show
  1. package/CHANGELOG.md +76 -0
  2. package/README.md +137 -170
  3. package/dist/benchmarks/index.d.ts +2 -1
  4. package/dist/{chunk-JAOLXRIA.js → chunk-3GN6U53I.js} +205 -4
  5. package/dist/chunk-3GN6U53I.js.map +1 -0
  6. package/dist/chunk-3IX6QTB7.js +1349 -0
  7. package/dist/chunk-3IX6QTB7.js.map +1 -0
  8. package/dist/chunk-5IIQKMD5.js +236 -0
  9. package/dist/chunk-5IIQKMD5.js.map +1 -0
  10. package/dist/chunk-ARZ6BEV6.js +1310 -0
  11. package/dist/chunk-ARZ6BEV6.js.map +1 -0
  12. package/dist/chunk-HRZELXCR.js +1354 -0
  13. package/dist/chunk-HRZELXCR.js.map +1 -0
  14. package/dist/chunk-KRR4VMH7.js +423 -0
  15. package/dist/chunk-KRR4VMH7.js.map +1 -0
  16. package/dist/chunk-SNUHRBDL.js +154 -0
  17. package/dist/chunk-SNUHRBDL.js.map +1 -0
  18. package/dist/chunk-WOK2RTWG.js +1920 -0
  19. package/dist/chunk-WOK2RTWG.js.map +1 -0
  20. package/dist/{chunk-LSR4IAYN.js → chunk-WOPGKVN4.js} +2 -2
  21. package/dist/chunk-YUFXO3TU.js +148 -0
  22. package/dist/chunk-YUFXO3TU.js.map +1 -0
  23. package/dist/cli.js +3 -2
  24. package/dist/cli.js.map +1 -1
  25. package/dist/control-cxwMOAsy.d.ts +259 -0
  26. package/dist/control.d.ts +6 -0
  27. package/dist/control.js +30 -0
  28. package/dist/control.js.map +1 -0
  29. package/dist/dataset-B9qvlm_o.d.ts +112 -0
  30. package/dist/emitter-B2XqDKFU.d.ts +121 -0
  31. package/dist/feedback-trajectory-CB0A32o3.d.ts +346 -0
  32. package/dist/{index-1PZOtZFr.d.ts → index-c5saLbKD.d.ts} +2 -133
  33. package/dist/index.d.ts +178 -2945
  34. package/dist/index.js +1066 -6185
  35. package/dist/index.js.map +1 -1
  36. package/dist/multi-shot-optimization-Bvtz294B.d.ts +598 -0
  37. package/dist/openapi.json +1 -1
  38. package/dist/optimization.d.ts +146 -0
  39. package/dist/optimization.js +60 -0
  40. package/dist/optimization.js.map +1 -0
  41. package/dist/reporting-Da2ihlcM.d.ts +672 -0
  42. package/dist/reporting.d.ts +5 -0
  43. package/dist/reporting.js +36 -0
  44. package/dist/reporting.js.map +1 -0
  45. package/dist/run-record-CX_jcAyr.d.ts +134 -0
  46. package/dist/store-u47QaJ9G.d.ts +297 -0
  47. package/dist/traces.d.ts +914 -0
  48. package/dist/traces.js +120 -0
  49. package/dist/traces.js.map +1 -0
  50. package/dist/wire/index.js +3 -2
  51. package/docs/concepts.md +16 -11
  52. package/docs/feature-guide.md +10 -17
  53. package/docs/integration-launch-gates.md +77 -0
  54. package/docs/product-eval-adoption.md +27 -0
  55. package/docs/research-report-methodology.md +155 -0
  56. package/docs/trace-analysis.md +75 -0
  57. package/package.json +30 -12
  58. package/dist/chunk-JAOLXRIA.js.map +0 -1
  59. /package/dist/{chunk-LSR4IAYN.js.map → chunk-WOPGKVN4.js.map} +0 -0
package/CHANGELOG.md CHANGED
@@ -1,5 +1,81 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.21.0 — capture integrity + launch-grade reporting
4
+
5
+ This release closes the layer-1 gap a downstream consumer surfaced: better
6
+ post-run statistics don't help if the underlying data wasn't captured. 0.21
7
+ adds first-class raw provider-event capture, a fail-loud route guard, a
8
+ run-completion integrity check, and run-complete hooks (with a trace-analyst
9
+ auto-execution helper) so a direct matrix run produces complete forensics
10
+ without out-of-band glue.
11
+
12
+ ### Added
13
+
14
+ - **`RawProviderSink` (capture).** First-class persistence for HTTP-level
15
+ provider request / response / error payloads alongside the structured
16
+ `LlmSpan`. `InMemoryRawProviderSink`, `FileSystemRawProviderSink` (NDJSON,
17
+ rolls at 32 MiB), and `NoopRawProviderSink` ship in core. Default redactor
18
+ strips `Authorization` / `X-Api-Key` / `Cookie` headers and credential-shaped
19
+ body fields (`apiKey`, `bearer`, `password`, `secret`, `token`); redacted
20
+ paths are recorded on `event.redactedFields` so a reviewer can see what was
21
+ stripped without exposing values. Wired into `callLlm` via
22
+ `LlmClientOptions.rawSink` — every retry attempt produces a `request` and
23
+ either a `response` or `error` event with the attempt index attached.
24
+ - **`assertLlmRoute` (route guard).** Pure function that throws
25
+ `LlmRouteAssertionError` when the configured client doesn't match the
26
+ caller's route requirements: `requireExplicitBaseUrl`, `allowedBaseUrls`,
27
+ `blockedBaseUrls`, `requireAuth`, `expectedProvider`. Designed for the
28
+ matrix-runner preflight — fail loud at the boundary instead of silently
29
+ falling back to the public/free-tier router.
30
+ - **`assertRunCaptured` (integrity check).** Read-only check on
31
+ `(store, runId, expectations)` that returns a structured
32
+ `RunIntegrityReport` with issue codes (`missing_llm_spans`,
33
+ `missing_raw_events`, `orphan_llm_span`, `no_raw_sink`, `missing_outcome`,
34
+ …). Pair with the new `requireRawCoverageOfLlmSpans` to assert every
35
+ `LlmSpan` has a matching raw `request` event. Use directly or via
36
+ `throwIfRunIncomplete` for strict mode.
37
+ - **`onRunComplete` hooks on `TraceEmitter`.** New
38
+ `TraceEmitterOptions.onRunComplete` array fires after `endRun` / `abortRun`
39
+ with full run context (run id, outcome, status, store, emitter). Errors are
40
+ swallowed and recorded as `log` events by default; opt into propagation via
41
+ `hookErrors: 'throw'`. `addRunCompleteHook` attaches hooks after construction.
42
+ - **`traceAnalystOnRunComplete` factory.** Drop-in run-complete hook that
43
+ runs `analyzeTraces` after each run and persists the result. Resolves the
44
+ "trace analyst never ran on this matrix sweep" complaint by making
45
+ auto-execution declarative.
46
+ - **`researchReport`** — executive research-report layer for coding-vertical
47
+ benchmark runs (originally landed in #34, elevated in #35). Composes
48
+ `summaryTable`, `paretoChart`, `gainHistogram`, held-out gate decisions,
49
+ and optional `failureClusterView` output into one structured artifact:
50
+ promote / hold / equivalent / reject / needs-more-data guidance with
51
+ rationale, risks, next actions, markdown, HTML, and JSON chart specs.
52
+ - Decisions are made on paired evidence — never on marginal means alone.
53
+ - ROPE (Region of Practical Equivalence) supported via the `rope` option.
54
+ - Bayesian-bootstrap-style `Pr(Δ>0)` and `Pr(Δ∈ROPE)` summaries (Rubin 1981).
55
+ - Per-candidate minimum detectable paired effect via `pairedMde`.
56
+ - SHA-256 `runFingerprint` and optional `preregistrationHash` linking a
57
+ signed `HypothesisManifest`.
58
+ - Embedded methodology + `docs/research-report-methodology.md` companion.
59
+ - **`pairedMde`** in `power-analysis`: closed-form minimum detectable paired
60
+ effect (inverse to the paired-t / sign-rank power formula).
61
+
62
+ ### Changed
63
+
64
+ - `researchReport` is async (uses Web Crypto via `hashJson` for the run
65
+ fingerprint).
66
+ - Default `researchReport.minPairs` is 20 (soft floor); hard floor of 6 is
67
+ enforced regardless via `RESEARCH_REPORT_HARD_PAIR_FLOOR`.
68
+
69
+ ### Wire-protocol consumers
70
+
71
+ No wire-protocol changes. The new capture / integrity / hook primitives are
72
+ TypeScript-only; cross-language consumers continue to use the existing RPC
73
+ surface.
74
+
75
+ ### Python client
76
+
77
+ Locked at `tangle-agent-eval==0.21.0` to match the npm package.
78
+
3
79
  ## 0.20.10 — hardening audit follow-up
4
80
 
5
81
  ### Fixed
package/README.md CHANGED
@@ -1,65 +1,24 @@
1
1
  # @tangle-network/agent-eval
2
2
 
3
- Evaluation infrastructure for agent systems.
4
-
5
- `agent-eval` gives agent products a reusable way to record what happened,
6
- verify outcomes, classify failures, compare variants, optimize prompts or
7
- policies, and make release decisions from evidence instead of anecdotes.
8
-
9
- It does not own your product state, credentials, UI, or model routing. Product
10
- teams keep those boundaries; this package standardizes how runs are recorded,
11
- checked, compared, and promoted.
12
-
13
- ## Contents
14
-
15
- - [When To Use It](#when-to-use-it)
16
- - [Architecture](#architecture)
17
- - [Install](#install)
18
- - [Quick Start](#quick-start)
19
- - [Core Primitives](#core-primitives)
20
- - [Adoption Path](#adoption-path)
21
- - [Examples](#examples)
22
- - [Documentation](#documentation)
23
- - [Development](#development)
24
- - [Related Packages](#related-packages)
25
-
26
- ## When To Use It
27
-
28
- Use `agent-eval` when you need one or more of these:
29
-
30
- - A reproducible eval harness for coding agents, builder agents, or multi-tool
31
- workflows.
32
- - Structured traces for agent runs: spans, artifacts, events, budgets, tool
33
- calls, retrieval, judge output, and sandbox execution.
34
- - Deterministic gates around build/test/deploy checks.
35
- - LLM-as-judge or deterministic judge fleets with calibration and canaries.
36
- - Dataset splits, holdouts, paired statistics, and release confidence gates.
37
- - Failure taxonomy that distinguishes prompt, tool, sandbox, retrieval,
38
- evaluator, and knowledge-readiness failures.
39
- - Optimization loops over prompts, steering, code mutations, or full multi-shot
40
- trajectories.
41
- - Report data for internal launch reviews, CI gates, and research analysis.
42
-
43
- ## Architecture
3
+ Evaluation infrastructure for agent products.
4
+
5
+ Use it to wrap the real workflow your users run, record what happened, verify
6
+ the result, turn feedback into replay data, compare variants, and ship only
7
+ when the evidence improves.
44
8
 
45
9
  ```txt
46
- agent/product run
47
- -> TraceEmitter / TraceStore
48
- -> SandboxHarness / MultiLayerVerifier / JudgeRunner
49
- -> failure taxonomy + metrics
50
- -> paired stats + held-out gates
51
- -> optimization + release confidence + reports
10
+ product task
11
+ -> observe state
12
+ -> validate with deterministic gates first
13
+ -> act through the real product adapter
14
+ -> trace + feedback trajectory
15
+ -> replay / optimize / release gate
52
16
  ```
53
17
 
54
- Package responsibilities:
55
-
56
- - `agent-eval`: run evidence, eval contracts, verification, statistics,
57
- optimization, reporting.
58
- - Product app: domain state, tools, credentials, UI, storage, deployment, model
59
- gateway.
60
- - `@tangle-network/agent-runtime`: production agent-loop/session runtime.
61
- - `@tangle-network/agent-knowledge`: evidence stores, claim/page synthesis,
62
- retrieval, knowledge readiness implementation.
18
+ `agent-eval` does not own product state, credentials, UI, storage, model
19
+ routing, browser drivers, sandbox policy, or deployment. Products own those.
20
+ This package owns eval contracts, loop mechanics, traces, statistics,
21
+ optimization inputs, and release evidence.
63
22
 
64
23
  ## Install
65
24
 
@@ -67,41 +26,23 @@ Package responsibilities:
67
26
  pnpm add @tangle-network/agent-eval
68
27
  ```
69
28
 
70
- Wire protocol / CLI:
71
-
72
- ```sh
73
- npm i -g @tangle-network/agent-eval
74
- agent-eval serve --port 5005
75
- ```
76
-
77
- Python client source lives in `clients/python`. Until the PyPI package is
78
- published, install it from the repo:
79
-
80
- ```sh
81
- cd clients/python
82
- pip install -e .
83
- ```
84
-
85
29
  ## Quick Start
86
30
 
87
- Wrap the real product loop first. Do not build a toy eval path that users never
88
- exercise.
89
-
90
31
  ```ts
91
32
  import {
92
33
  objectiveEval,
93
34
  runAgentControlLoop,
94
- } from '@tangle-network/agent-eval'
35
+ } from '@tangle-network/agent-eval/control'
95
36
 
96
37
  const result = await runAgentControlLoop({
97
38
  intent: task.prompt,
98
39
  budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },
99
40
 
100
- async observe() {
101
- return productAdapter.readState(task.id)
41
+ observe() {
42
+ return product.readState(task.id)
102
43
  },
103
44
 
104
- async validate({ state }) {
45
+ validate({ state }) {
105
46
  return [
106
47
  objectiveEval({
107
48
  id: 'build-passes',
@@ -117,128 +58,154 @@ const result = await runAgentControlLoop({
117
58
  ]
118
59
  },
119
60
 
120
- async decide({ evals }) {
121
- return evals.every((evalResult) => evalResult.passed)
122
- ? { type: 'stop', reason: 'all critical checks passed' }
123
- : { type: 'continue', action: { type: 'repair' }, reason: 'checks failed' }
61
+ decide({ evals }) {
62
+ const failed = evals.filter((e) => !e.passed)
63
+ if (failed.length === 0) {
64
+ return { type: 'stop', pass: true, reason: 'all gates passed' }
65
+ }
66
+ return {
67
+ type: 'continue',
68
+ action: { type: 'repair', failed: failed.map((e) => e.id) },
69
+ reason: 'repair failed gates',
70
+ }
124
71
  },
125
72
 
126
- async act(action) {
127
- return productAdapter.runAgentStep(task.id, action)
73
+ act(action) {
74
+ return product.runAgentStep(task.id, action)
128
75
  },
129
76
  })
130
77
 
131
- await productAdapter.storeControlResult(task.id, result)
78
+ await product.storeEvalResult(task.id, result)
132
79
  ```
133
80
 
134
- Once this loop represents production behavior, convert completed runs into
135
- feedback trajectories, split them into train/dev/test/holdout sets, and run
136
- multi-shot optimization against the same adapter.
137
-
138
- ## Core Primitives
139
-
140
- | Primitive | Purpose |
141
- |---|---|
142
- | `TraceEmitter`, `TraceStore` | Append-only run/span/event/artifact/budget records. |
143
- | `SandboxHarness` | Build/test/runtime checks with captured stdout, stderr, exit codes, wall time, and parsed test counts. |
144
- | `MultiLayerVerifier` | Ordered verification stages with dependencies, skip-on-fail, findings, scores, and time caps. |
145
- | `JudgeRunner` | Parallel deterministic or LLM-backed judges over the same artifact/run. |
146
- | `runAgentControlLoop` | Observe/validate/decide/act loop with budgets, stop policies, and structured eval results. |
147
- | `Dataset`, `RunRecord`, `HeldOutGate` | Versioned corpora, reproducible run metadata, and held-out promotion decisions. |
148
- | `pairedBootstrap`, `pairedWilcoxon`, `bhAdjust` | Paired experiment statistics and multiple-comparison correction. |
149
- | `classifyFailure` | Rule-based failure classification for agent, tool, sandbox, retrieval, and knowledge failures. |
150
- | `runMultiShotOptimization` | Optimization over full agent trajectories with actionable side information. |
151
- | `runPromptEvolution` | Prompt/steering/code evolution over scenario scores. |
152
- | `evaluateReleaseConfidence` | Release scorecard across evidence volume, pass rate, score, overfit, cost, latency, and gates. |
153
- | `summaryTable`, `paretoChart`, `gainHistogram` | Report-ready structured outputs. |
154
- | `KnowledgeRequirement`, `KnowledgeBundle` | Shared contracts for knowledge readiness. |
155
-
156
- `NoopResearcher` is a fail-loud sentinel for wiring tests. Production systems
157
- should implement `Researcher` directly or use `CallbackResearcher`.
158
-
159
- ## Adoption Path
160
-
161
- 1. Choose one real workflow: code generation, browser task, research task,
162
- workflow builder, voice interaction, or domain agent task.
163
- 2. Write a product adapter that can observe state and execute one agent step.
164
- 3. Add deterministic validators first: build, test, serve, schema, policy,
165
- permission, retrieval, and deployment checks.
166
- 4. Add LLM judges only for subjective quality that deterministic checks cannot
167
- measure.
168
- 5. Emit traces and convert successful and failed attempts into
169
- `FeedbackTrajectory` records.
170
- 6. Build train/dev/test/holdout scenarios from those trajectories.
171
- 7. Run `runMultiShotOptimization()` or prompt/code evolution on train/dev.
172
- 8. Promote only when test/holdout gates and real product telemetry improve.
173
-
174
- For a complete product integration guide, see
175
- [Product Eval Adoption](./docs/product-eval-adoption.md).
81
+ That loop should be the same shape in production, replay, benchmark, and
82
+ optimization. Swap dependencies behind `observe()` and `act()`, not the eval
83
+ contract itself.
176
84
 
177
- ## Examples
85
+ ## Import Paths
178
86
 
179
- Runnable examples live in the repository's
180
- [`examples/`](https://github.com/tangle-network/agent-eval/tree/main/examples)
181
- directory. They are not part of the published npm package.
87
+ The root export remains available, but new code should prefer focused subpaths:
182
88
 
183
- - [`examples/same-sandbox-harness`](https://github.com/tangle-network/agent-eval/tree/main/examples/same-sandbox-harness) - run
184
- multiple eval passes against the same workspace.
185
- - [`examples/multi-shot-optimization`](https://github.com/tangle-network/agent-eval/tree/main/examples/multi-shot-optimization) -
186
- optimize full agent trajectories with held-out promotion.
187
- - [`examples/benchmarks`](https://github.com/tangle-network/agent-eval/tree/main/examples/benchmarks) - benchmark adapter shape and
188
- reference benchmark wrappers.
89
+ ```ts
90
+ import { runAgentControlLoop } from '@tangle-network/agent-eval/control'
91
+ import { runMultiShotOptimization } from '@tangle-network/agent-eval/optimization'
92
+ import { TraceEmitter } from '@tangle-network/agent-eval/traces'
93
+ import { renderReleaseReport } from '@tangle-network/agent-eval/reporting'
94
+ ```
189
95
 
190
- The examples are intentionally kept outside the README so they can be expanded,
191
- tested, and copied without turning this page into a tutorial.
96
+ | Subpath | Use for |
97
+ | --- | --- |
98
+ | `@tangle-network/agent-eval/control` | `observe -> validate -> decide -> act`, action policy, propose/review loops |
99
+ | `@tangle-network/agent-eval/traces` | trace stores, emitters, TraceAnalyst |
100
+ | `@tangle-network/agent-eval/optimization` | feedback trajectories, multi-shot optimization, prompt evolution |
101
+ | `@tangle-network/agent-eval/reporting` | release confidence, paired stats, report/table/chart specs |
102
+ | `@tangle-network/agent-eval/wire` | HTTP/RPC judge server and schemas |
103
+ | `@tangle-network/agent-eval/benchmarks` | benchmark adapter contracts and reference wrappers |
104
+
105
+ ## Core Pieces
106
+
107
+ | Need | Use |
108
+ | --- | --- |
109
+ | Keep an agent working until objective state passes | `runAgentControlLoop` |
110
+ | Turn user/reviewer feedback into replay data | `FeedbackTrajectory` |
111
+ | Compare prompt/tool/retrieval policies over full trajectories | `runMultiShotOptimization` |
112
+ | Gate releases with paired evidence and holdouts | `evaluateReleaseConfidence`, `HeldOutGate` |
113
+ | Explain regressions across trace corpora | `TraceAnalyst` / `analyzeTraces` |
114
+ | Report a launch decision | `renderReleaseReport`, `researchReport`, `summaryTable`, `paretoChart`, `gainHistogram` |
115
+ | Capture every provider HTTP request / response for forensics | `RawProviderSink`, `LlmClientOptions.rawSink` |
116
+ | Fail loud if an eval would silently use the wrong route | `assertLlmRoute` |
117
+ | Assert at run-end that the artifact is complete | `assertRunCaptured`, `throwIfRunIncomplete` |
118
+ | Auto-execute the trace analyst on every run | `traceAnalystOnRunComplete` + `TraceEmitterOptions.onRunComplete` |
119
+ | Model missing context separately from bad reasoning | `KnowledgeRequirement`, `KnowledgeBundle` |
120
+
121
+ ### Capture integrity (0.21+)
122
+
123
+ Launch-grade benchmark runs need four things that are easy to forget in glue
124
+ code: (1) raw HTTP capture alongside the structured spans so a reviewer can
125
+ verify which route answered, (2) a preflight assertion that the configured
126
+ client points at the intended provider, (3) a run-end assertion that the
127
+ expected events were actually written, and (4) auto-execution of the trace
128
+ analyst as part of the run lifecycle. The wiring fits in a few lines:
192
129
 
193
- ## Documentation
130
+ ```ts
131
+ import {
132
+ TraceEmitter, FileSystemRawProviderSink, callLlm, assertLlmRoute,
133
+ assertRunCaptured, throwIfRunIncomplete,
134
+ } from '@tangle-network/agent-eval'
135
+ import { traceAnalystOnRunComplete } from '@tangle-network/agent-eval/traces'
194
136
 
195
- - [Concepts](./docs/concepts.md)
196
- - [Feature Guide](./docs/feature-guide.md)
197
- - [Product Eval Adoption](./docs/product-eval-adoption.md)
198
- - [Control Runtime](./docs/control-runtime.md)
199
- - [Knowledge Readiness](./docs/knowledge-readiness.md)
200
- - [Multi-Shot Optimization](./docs/multi-shot-optimization.md)
201
- - [Feedback Trajectories](./docs/feedback-trajectories.md)
202
- - [Wire Protocol](./docs/wire-protocol.md)
137
+ const sink = new FileSystemRawProviderSink({ dir: `${workDir}/raw-events` })
138
+ assertLlmRoute(llmOpts, { requireExplicitBaseUrl: true, allowedBaseUrls, requireAuth: true })
203
139
 
204
- ## Development
140
+ const emitter = new TraceEmitter(store, {
141
+ onRunComplete: [traceAnalystOnRunComplete({ analyze: analystOpts, save })],
142
+ })
143
+ await emitter.startRun(/* ... */)
144
+ // LLM calls flow through callLlm with `{ rawSink: sink, traceContext: { runId, spanId } }`.
145
+ await emitter.endRun({ pass, score })
205
146
 
206
- ```sh
207
- pnpm install
208
- pnpm typecheck
209
- pnpm test
210
- pnpm build
211
- pnpm openapi
147
+ throwIfRunIncomplete(await assertRunCaptured(store, emitter.runId, {
148
+ llmSpansMin: 1, rawSink: sink, requireRawCoverageOfLlmSpans: true, requireOutcome: true,
149
+ }))
212
150
  ```
213
151
 
214
- Run the local server:
152
+ Directives, rationale, and shipped-bug context are in
153
+ [`SKILL.md` § Capture integrity](./.claude/skills/agent-eval/SKILL.md#capture-integrity-required-for-launch-grade-adoption).
154
+
155
+ ## Examples
156
+
157
+ Runnable examples live in
158
+ [`examples/`](https://github.com/tangle-network/agent-eval/tree/main/examples).
159
+
160
+ - [`examples/multi-shot-optimization`](https://github.com/tangle-network/agent-eval/tree/main/examples/multi-shot-optimization):
161
+ optimize full trajectories with held-out promotion.
162
+ - [`examples/same-sandbox-harness`](https://github.com/tangle-network/agent-eval/tree/main/examples/same-sandbox-harness):
163
+ run setup/build/test and evidence checks in one workspace.
164
+ - [`examples/benchmarks`](https://github.com/tangle-network/agent-eval/tree/main/examples/benchmarks):
165
+ benchmark adapter shape and reference wrappers.
166
+
167
+ ## Docs
168
+
169
+ Read in this order:
170
+
171
+ 1. [Product Eval Adoption](./docs/product-eval-adoption.md)
172
+ 2. [Control Runtime](./docs/control-runtime.md)
173
+ 3. [Feedback Trajectories](./docs/feedback-trajectories.md)
174
+ 4. [Multi-Shot Optimization](./docs/multi-shot-optimization.md)
175
+ 5. [Trace Analysis](./docs/trace-analysis.md)
176
+ 6. [Knowledge Readiness](./docs/knowledge-readiness.md)
177
+ 7. [Integration Launch Gates](./docs/integration-launch-gates.md)
178
+ 8. [Wire Protocol](./docs/wire-protocol.md)
179
+
180
+ ## CLI / Wire Protocol
215
181
 
216
182
  ```sh
217
- pnpm build
218
- node dist/cli.js serve --port 5005
183
+ npm i -g @tangle-network/agent-eval
184
+ agent-eval serve --port 5005
219
185
  ```
220
186
 
221
- Python client tests:
187
+ The Python client lives in `clients/python`:
222
188
 
223
189
  ```sh
224
- pnpm build
225
190
  cd clients/python
226
- pip install -e ".[dev]"
227
- pytest
191
+ pip install -e .
228
192
  ```
229
193
 
230
- ## Release
194
+ ## Development
231
195
 
232
- `@tangle-network/agent-eval` publishes to npm. The Python client lives under
233
- `clients/python` and is versioned from this repository.
196
+ ```sh
197
+ pnpm install
198
+ pnpm typecheck
199
+ pnpm test
200
+ pnpm build
201
+ pnpm openapi
202
+ ```
234
203
 
235
204
  ## Related Packages
236
205
 
237
- - [`@tangle-network/agent-runtime`](https://github.com/tangle-network/agent-runtime)
238
- - [`@tangle-network/agent-knowledge`](https://github.com/tangle-network/agent-knowledge)
239
- - [`@tangle-network/agent-integrations`](https://github.com/tangle-network/agent-integrations)
240
- - [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway)
241
- - [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud)
206
+ - `@tangle-network/agent-runtime`: production session/runtime layer.
207
+ - `@tangle-network/agent-knowledge`: source-grounded knowledge bases and readiness.
208
+ - `@tangle-network/agent-integrations`: connection, grant, capability, and integration invocation contracts.
242
209
 
243
210
  ## License
244
211
 
@@ -1 +1,2 @@
1
- export { B as BENCHMARK_SPLIT_SEED, b as BenchmarkAdapter, c as BenchmarkDatasetItem, d as BenchmarkEvaluation, i as deterministicSplit, l as routing } from '../index-1PZOtZFr.js';
1
+ export { B as BENCHMARK_SPLIT_SEED, a as BenchmarkAdapter, b as BenchmarkDatasetItem, c as BenchmarkEvaluation, d as deterministicSplit, e as routing } from '../index-c5saLbKD.js';
2
+ import '../run-record-CX_jcAyr.js';