@tangle-network/agent-eval 0.20.11 → 0.21.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +76 -0
- package/README.md +137 -170
- package/dist/benchmarks/index.d.ts +2 -1
- package/dist/{chunk-JAOLXRIA.js → chunk-3GN6U53I.js} +205 -4
- package/dist/chunk-3GN6U53I.js.map +1 -0
- package/dist/chunk-3IX6QTB7.js +1349 -0
- package/dist/chunk-3IX6QTB7.js.map +1 -0
- package/dist/chunk-5IIQKMD5.js +236 -0
- package/dist/chunk-5IIQKMD5.js.map +1 -0
- package/dist/chunk-ARZ6BEV6.js +1310 -0
- package/dist/chunk-ARZ6BEV6.js.map +1 -0
- package/dist/chunk-HRZELXCR.js +1354 -0
- package/dist/chunk-HRZELXCR.js.map +1 -0
- package/dist/chunk-KRR4VMH7.js +423 -0
- package/dist/chunk-KRR4VMH7.js.map +1 -0
- package/dist/chunk-SNUHRBDL.js +154 -0
- package/dist/chunk-SNUHRBDL.js.map +1 -0
- package/dist/chunk-WOK2RTWG.js +1920 -0
- package/dist/chunk-WOK2RTWG.js.map +1 -0
- package/dist/{chunk-LSR4IAYN.js → chunk-WOPGKVN4.js} +2 -2
- package/dist/chunk-YUFXO3TU.js +148 -0
- package/dist/chunk-YUFXO3TU.js.map +1 -0
- package/dist/cli.js +3 -2
- package/dist/cli.js.map +1 -1
- package/dist/control-cxwMOAsy.d.ts +259 -0
- package/dist/control.d.ts +6 -0
- package/dist/control.js +30 -0
- package/dist/control.js.map +1 -0
- package/dist/dataset-B9qvlm_o.d.ts +112 -0
- package/dist/emitter-B2XqDKFU.d.ts +121 -0
- package/dist/feedback-trajectory-CB0A32o3.d.ts +346 -0
- package/dist/{index-1PZOtZFr.d.ts → index-c5saLbKD.d.ts} +2 -133
- package/dist/index.d.ts +178 -2945
- package/dist/index.js +1066 -6185
- package/dist/index.js.map +1 -1
- package/dist/multi-shot-optimization-Bvtz294B.d.ts +598 -0
- package/dist/openapi.json +1 -1
- package/dist/optimization.d.ts +146 -0
- package/dist/optimization.js +60 -0
- package/dist/optimization.js.map +1 -0
- package/dist/reporting-Da2ihlcM.d.ts +672 -0
- package/dist/reporting.d.ts +5 -0
- package/dist/reporting.js +36 -0
- package/dist/reporting.js.map +1 -0
- package/dist/run-record-CX_jcAyr.d.ts +134 -0
- package/dist/store-u47QaJ9G.d.ts +297 -0
- package/dist/traces.d.ts +914 -0
- package/dist/traces.js +120 -0
- package/dist/traces.js.map +1 -0
- package/dist/wire/index.js +3 -2
- package/docs/concepts.md +16 -11
- package/docs/feature-guide.md +10 -17
- package/docs/integration-launch-gates.md +77 -0
- package/docs/product-eval-adoption.md +27 -0
- package/docs/research-report-methodology.md +155 -0
- package/docs/trace-analysis.md +75 -0
- package/package.json +30 -12
- package/dist/chunk-JAOLXRIA.js.map +0 -1
- /package/dist/{chunk-LSR4IAYN.js.map → chunk-WOPGKVN4.js.map} +0 -0
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,81 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.21.0 — capture integrity + launch-grade reporting
|
|
4
|
+
|
|
5
|
+
This release closes the layer-1 gap a downstream consumer surfaced: better
|
|
6
|
+
post-run statistics don't help if the underlying data wasn't captured. 0.21
|
|
7
|
+
adds first-class raw provider-event capture, a fail-loud route guard, a
|
|
8
|
+
run-completion integrity check, and run-complete hooks (with a trace-analyst
|
|
9
|
+
auto-execution helper) so a direct matrix run produces complete forensics
|
|
10
|
+
without out-of-band glue.
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
|
|
14
|
+
- **`RawProviderSink` (capture).** First-class persistence for HTTP-level
|
|
15
|
+
provider request / response / error payloads alongside the structured
|
|
16
|
+
`LlmSpan`. `InMemoryRawProviderSink`, `FileSystemRawProviderSink` (NDJSON,
|
|
17
|
+
rolls at 32 MiB), and `NoopRawProviderSink` ship in core. Default redactor
|
|
18
|
+
strips `Authorization` / `X-Api-Key` / `Cookie` headers and credential-shaped
|
|
19
|
+
body fields (`apiKey`, `bearer`, `password`, `secret`, `token`); redacted
|
|
20
|
+
paths are recorded on `event.redactedFields` so a reviewer can see what was
|
|
21
|
+
stripped without exposing values. Wired into `callLlm` via
|
|
22
|
+
`LlmClientOptions.rawSink` — every retry attempt produces a `request` and
|
|
23
|
+
either a `response` or `error` event with the attempt index attached.
|
|
24
|
+
- **`assertLlmRoute` (route guard).** Pure function that throws
|
|
25
|
+
`LlmRouteAssertionError` when the configured client doesn't match the
|
|
26
|
+
caller's route requirements: `requireExplicitBaseUrl`, `allowedBaseUrls`,
|
|
27
|
+
`blockedBaseUrls`, `requireAuth`, `expectedProvider`. Designed for the
|
|
28
|
+
matrix-runner preflight — fail loud at the boundary instead of silently
|
|
29
|
+
falling back to the public/free-tier router.
|
|
30
|
+
- **`assertRunCaptured` (integrity check).** Read-only check on
|
|
31
|
+
`(store, runId, expectations)` that returns a structured
|
|
32
|
+
`RunIntegrityReport` with issue codes (`missing_llm_spans`,
|
|
33
|
+
`missing_raw_events`, `orphan_llm_span`, `no_raw_sink`, `missing_outcome`,
|
|
34
|
+
…). Pair with the new `requireRawCoverageOfLlmSpans` to assert every
|
|
35
|
+
`LlmSpan` has a matching raw `request` event. Use directly or via
|
|
36
|
+
`throwIfRunIncomplete` for strict mode.
|
|
37
|
+
- **`onRunComplete` hooks on `TraceEmitter`.** New
|
|
38
|
+
`TraceEmitterOptions.onRunComplete` array fires after `endRun` / `abortRun`
|
|
39
|
+
with full run context (run id, outcome, status, store, emitter). Errors are
|
|
40
|
+
swallowed and recorded as `log` events by default; opt into propagation via
|
|
41
|
+
`hookErrors: 'throw'`. `addRunCompleteHook` attaches hooks after construction.
|
|
42
|
+
- **`traceAnalystOnRunComplete` factory.** Drop-in run-complete hook that
|
|
43
|
+
runs `analyzeTraces` after each run and persists the result. Resolves the
|
|
44
|
+
"trace analyst never ran on this matrix sweep" complaint by making
|
|
45
|
+
auto-execution declarative.
|
|
46
|
+
- **`researchReport`** — executive research-report layer for coding-vertical
|
|
47
|
+
benchmark runs (originally landed in #34, elevated in #35). Composes
|
|
48
|
+
`summaryTable`, `paretoChart`, `gainHistogram`, held-out gate decisions,
|
|
49
|
+
and optional `failureClusterView` output into one structured artifact:
|
|
50
|
+
promote / hold / equivalent / reject / needs-more-data guidance with
|
|
51
|
+
rationale, risks, next actions, markdown, HTML, and JSON chart specs.
|
|
52
|
+
- Decisions are made on paired evidence — never on marginal means alone.
|
|
53
|
+
- ROPE (Region of Practical Equivalence) supported via the `rope` option.
|
|
54
|
+
- Bayesian-bootstrap-style `Pr(Δ>0)` and `Pr(Δ∈ROPE)` summaries (Rubin 1981).
|
|
55
|
+
- Per-candidate minimum detectable paired effect via `pairedMde`.
|
|
56
|
+
- SHA-256 `runFingerprint` and optional `preregistrationHash` linking a
|
|
57
|
+
signed `HypothesisManifest`.
|
|
58
|
+
- Embedded methodology + `docs/research-report-methodology.md` companion.
|
|
59
|
+
- **`pairedMde`** in `power-analysis`: closed-form minimum detectable paired
|
|
60
|
+
effect (inverse to the paired-t / sign-rank power formula).
|
|
61
|
+
|
|
62
|
+
### Changed
|
|
63
|
+
|
|
64
|
+
- `researchReport` is async (uses Web Crypto via `hashJson` for the run
|
|
65
|
+
fingerprint).
|
|
66
|
+
- Default `researchReport.minPairs` is 20 (soft floor); hard floor of 6 is
|
|
67
|
+
enforced regardless via `RESEARCH_REPORT_HARD_PAIR_FLOOR`.
|
|
68
|
+
|
|
69
|
+
### Wire-protocol consumers
|
|
70
|
+
|
|
71
|
+
No wire-protocol changes. The new capture / integrity / hook primitives are
|
|
72
|
+
TypeScript-only; cross-language consumers continue to use the existing RPC
|
|
73
|
+
surface.
|
|
74
|
+
|
|
75
|
+
### Python client
|
|
76
|
+
|
|
77
|
+
Locked at `tangle-agent-eval==0.21.0` to match the npm package.
|
|
78
|
+
|
|
3
79
|
## 0.20.10 — hardening audit follow-up
|
|
4
80
|
|
|
5
81
|
### Fixed
|
package/README.md
CHANGED
|
@@ -1,65 +1,24 @@
|
|
|
1
1
|
# @tangle-network/agent-eval
|
|
2
2
|
|
|
3
|
-
Evaluation infrastructure for agent
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
It does not own your product state, credentials, UI, or model routing. Product
|
|
10
|
-
teams keep those boundaries; this package standardizes how runs are recorded,
|
|
11
|
-
checked, compared, and promoted.
|
|
12
|
-
|
|
13
|
-
## Contents
|
|
14
|
-
|
|
15
|
-
- [When To Use It](#when-to-use-it)
|
|
16
|
-
- [Architecture](#architecture)
|
|
17
|
-
- [Install](#install)
|
|
18
|
-
- [Quick Start](#quick-start)
|
|
19
|
-
- [Core Primitives](#core-primitives)
|
|
20
|
-
- [Adoption Path](#adoption-path)
|
|
21
|
-
- [Examples](#examples)
|
|
22
|
-
- [Documentation](#documentation)
|
|
23
|
-
- [Development](#development)
|
|
24
|
-
- [Related Packages](#related-packages)
|
|
25
|
-
|
|
26
|
-
## When To Use It
|
|
27
|
-
|
|
28
|
-
Use `agent-eval` when you need one or more of these:
|
|
29
|
-
|
|
30
|
-
- A reproducible eval harness for coding agents, builder agents, or multi-tool
|
|
31
|
-
workflows.
|
|
32
|
-
- Structured traces for agent runs: spans, artifacts, events, budgets, tool
|
|
33
|
-
calls, retrieval, judge output, and sandbox execution.
|
|
34
|
-
- Deterministic gates around build/test/deploy checks.
|
|
35
|
-
- LLM-as-judge or deterministic judge fleets with calibration and canaries.
|
|
36
|
-
- Dataset splits, holdouts, paired statistics, and release confidence gates.
|
|
37
|
-
- Failure taxonomy that distinguishes prompt, tool, sandbox, retrieval,
|
|
38
|
-
evaluator, and knowledge-readiness failures.
|
|
39
|
-
- Optimization loops over prompts, steering, code mutations, or full multi-shot
|
|
40
|
-
trajectories.
|
|
41
|
-
- Report data for internal launch reviews, CI gates, and research analysis.
|
|
42
|
-
|
|
43
|
-
## Architecture
|
|
3
|
+
Evaluation infrastructure for agent products.
|
|
4
|
+
|
|
5
|
+
Use it to wrap the real workflow your users run, record what happened, verify
|
|
6
|
+
the result, turn feedback into replay data, compare variants, and ship only
|
|
7
|
+
when the evidence improves.
|
|
44
8
|
|
|
45
9
|
```txt
|
|
46
|
-
|
|
47
|
-
->
|
|
48
|
-
->
|
|
49
|
-
->
|
|
50
|
-
->
|
|
51
|
-
->
|
|
10
|
+
product task
|
|
11
|
+
-> observe state
|
|
12
|
+
-> validate with deterministic gates first
|
|
13
|
+
-> act through the real product adapter
|
|
14
|
+
-> trace + feedback trajectory
|
|
15
|
+
-> replay / optimize / release gate
|
|
52
16
|
```
|
|
53
17
|
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
- Product app: domain state, tools, credentials, UI, storage, deployment, model
|
|
59
|
-
gateway.
|
|
60
|
-
- `@tangle-network/agent-runtime`: production agent-loop/session runtime.
|
|
61
|
-
- `@tangle-network/agent-knowledge`: evidence stores, claim/page synthesis,
|
|
62
|
-
retrieval, knowledge readiness implementation.
|
|
18
|
+
`agent-eval` does not own product state, credentials, UI, storage, model
|
|
19
|
+
routing, browser drivers, sandbox policy, or deployment. Products own those.
|
|
20
|
+
This package owns eval contracts, loop mechanics, traces, statistics,
|
|
21
|
+
optimization inputs, and release evidence.
|
|
63
22
|
|
|
64
23
|
## Install
|
|
65
24
|
|
|
@@ -67,41 +26,23 @@ Package responsibilities:
|
|
|
67
26
|
pnpm add @tangle-network/agent-eval
|
|
68
27
|
```
|
|
69
28
|
|
|
70
|
-
Wire protocol / CLI:
|
|
71
|
-
|
|
72
|
-
```sh
|
|
73
|
-
npm i -g @tangle-network/agent-eval
|
|
74
|
-
agent-eval serve --port 5005
|
|
75
|
-
```
|
|
76
|
-
|
|
77
|
-
Python client source lives in `clients/python`. Until the PyPI package is
|
|
78
|
-
published, install it from the repo:
|
|
79
|
-
|
|
80
|
-
```sh
|
|
81
|
-
cd clients/python
|
|
82
|
-
pip install -e .
|
|
83
|
-
```
|
|
84
|
-
|
|
85
29
|
## Quick Start
|
|
86
30
|
|
|
87
|
-
Wrap the real product loop first. Do not build a toy eval path that users never
|
|
88
|
-
exercise.
|
|
89
|
-
|
|
90
31
|
```ts
|
|
91
32
|
import {
|
|
92
33
|
objectiveEval,
|
|
93
34
|
runAgentControlLoop,
|
|
94
|
-
} from '@tangle-network/agent-eval'
|
|
35
|
+
} from '@tangle-network/agent-eval/control'
|
|
95
36
|
|
|
96
37
|
const result = await runAgentControlLoop({
|
|
97
38
|
intent: task.prompt,
|
|
98
39
|
budget: { maxSteps: 8, maxWallMs: 180_000, maxCostUsd: 2 },
|
|
99
40
|
|
|
100
|
-
|
|
101
|
-
return
|
|
41
|
+
observe() {
|
|
42
|
+
return product.readState(task.id)
|
|
102
43
|
},
|
|
103
44
|
|
|
104
|
-
|
|
45
|
+
validate({ state }) {
|
|
105
46
|
return [
|
|
106
47
|
objectiveEval({
|
|
107
48
|
id: 'build-passes',
|
|
@@ -117,128 +58,154 @@ const result = await runAgentControlLoop({
|
|
|
117
58
|
]
|
|
118
59
|
},
|
|
119
60
|
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
61
|
+
decide({ evals }) {
|
|
62
|
+
const failed = evals.filter((e) => !e.passed)
|
|
63
|
+
if (failed.length === 0) {
|
|
64
|
+
return { type: 'stop', pass: true, reason: 'all gates passed' }
|
|
65
|
+
}
|
|
66
|
+
return {
|
|
67
|
+
type: 'continue',
|
|
68
|
+
action: { type: 'repair', failed: failed.map((e) => e.id) },
|
|
69
|
+
reason: 'repair failed gates',
|
|
70
|
+
}
|
|
124
71
|
},
|
|
125
72
|
|
|
126
|
-
|
|
127
|
-
return
|
|
73
|
+
act(action) {
|
|
74
|
+
return product.runAgentStep(task.id, action)
|
|
128
75
|
},
|
|
129
76
|
})
|
|
130
77
|
|
|
131
|
-
await
|
|
78
|
+
await product.storeEvalResult(task.id, result)
|
|
132
79
|
```
|
|
133
80
|
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
## Core Primitives
|
|
139
|
-
|
|
140
|
-
| Primitive | Purpose |
|
|
141
|
-
|---|---|
|
|
142
|
-
| `TraceEmitter`, `TraceStore` | Append-only run/span/event/artifact/budget records. |
|
|
143
|
-
| `SandboxHarness` | Build/test/runtime checks with captured stdout, stderr, exit codes, wall time, and parsed test counts. |
|
|
144
|
-
| `MultiLayerVerifier` | Ordered verification stages with dependencies, skip-on-fail, findings, scores, and time caps. |
|
|
145
|
-
| `JudgeRunner` | Parallel deterministic or LLM-backed judges over the same artifact/run. |
|
|
146
|
-
| `runAgentControlLoop` | Observe/validate/decide/act loop with budgets, stop policies, and structured eval results. |
|
|
147
|
-
| `Dataset`, `RunRecord`, `HeldOutGate` | Versioned corpora, reproducible run metadata, and held-out promotion decisions. |
|
|
148
|
-
| `pairedBootstrap`, `pairedWilcoxon`, `bhAdjust` | Paired experiment statistics and multiple-comparison correction. |
|
|
149
|
-
| `classifyFailure` | Rule-based failure classification for agent, tool, sandbox, retrieval, and knowledge failures. |
|
|
150
|
-
| `runMultiShotOptimization` | Optimization over full agent trajectories with actionable side information. |
|
|
151
|
-
| `runPromptEvolution` | Prompt/steering/code evolution over scenario scores. |
|
|
152
|
-
| `evaluateReleaseConfidence` | Release scorecard across evidence volume, pass rate, score, overfit, cost, latency, and gates. |
|
|
153
|
-
| `summaryTable`, `paretoChart`, `gainHistogram` | Report-ready structured outputs. |
|
|
154
|
-
| `KnowledgeRequirement`, `KnowledgeBundle` | Shared contracts for knowledge readiness. |
|
|
155
|
-
|
|
156
|
-
`NoopResearcher` is a fail-loud sentinel for wiring tests. Production systems
|
|
157
|
-
should implement `Researcher` directly or use `CallbackResearcher`.
|
|
158
|
-
|
|
159
|
-
## Adoption Path
|
|
160
|
-
|
|
161
|
-
1. Choose one real workflow: code generation, browser task, research task,
|
|
162
|
-
workflow builder, voice interaction, or domain agent task.
|
|
163
|
-
2. Write a product adapter that can observe state and execute one agent step.
|
|
164
|
-
3. Add deterministic validators first: build, test, serve, schema, policy,
|
|
165
|
-
permission, retrieval, and deployment checks.
|
|
166
|
-
4. Add LLM judges only for subjective quality that deterministic checks cannot
|
|
167
|
-
measure.
|
|
168
|
-
5. Emit traces and convert successful and failed attempts into
|
|
169
|
-
`FeedbackTrajectory` records.
|
|
170
|
-
6. Build train/dev/test/holdout scenarios from those trajectories.
|
|
171
|
-
7. Run `runMultiShotOptimization()` or prompt/code evolution on train/dev.
|
|
172
|
-
8. Promote only when test/holdout gates and real product telemetry improve.
|
|
173
|
-
|
|
174
|
-
For a complete product integration guide, see
|
|
175
|
-
[Product Eval Adoption](./docs/product-eval-adoption.md).
|
|
81
|
+
That loop should be the same shape in production, replay, benchmark, and
|
|
82
|
+
optimization. Swap dependencies behind `observe()` and `act()`, not the eval
|
|
83
|
+
contract itself.
|
|
176
84
|
|
|
177
|
-
##
|
|
85
|
+
## Import Paths
|
|
178
86
|
|
|
179
|
-
|
|
180
|
-
[`examples/`](https://github.com/tangle-network/agent-eval/tree/main/examples)
|
|
181
|
-
directory. They are not part of the published npm package.
|
|
87
|
+
The root export remains available, but new code should prefer focused subpaths:
|
|
182
88
|
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
89
|
+
```ts
|
|
90
|
+
import { runAgentControlLoop } from '@tangle-network/agent-eval/control'
|
|
91
|
+
import { runMultiShotOptimization } from '@tangle-network/agent-eval/optimization'
|
|
92
|
+
import { TraceEmitter } from '@tangle-network/agent-eval/traces'
|
|
93
|
+
import { renderReleaseReport } from '@tangle-network/agent-eval/reporting'
|
|
94
|
+
```
|
|
189
95
|
|
|
190
|
-
|
|
191
|
-
|
|
96
|
+
| Subpath | Use for |
|
|
97
|
+
| --- | --- |
|
|
98
|
+
| `@tangle-network/agent-eval/control` | `observe -> validate -> decide -> act`, action policy, propose/review loops |
|
|
99
|
+
| `@tangle-network/agent-eval/traces` | trace stores, emitters, TraceAnalyst |
|
|
100
|
+
| `@tangle-network/agent-eval/optimization` | feedback trajectories, multi-shot optimization, prompt evolution |
|
|
101
|
+
| `@tangle-network/agent-eval/reporting` | release confidence, paired stats, report/table/chart specs |
|
|
102
|
+
| `@tangle-network/agent-eval/wire` | HTTP/RPC judge server and schemas |
|
|
103
|
+
| `@tangle-network/agent-eval/benchmarks` | benchmark adapter contracts and reference wrappers |
|
|
104
|
+
|
|
105
|
+
## Core Pieces
|
|
106
|
+
|
|
107
|
+
| Need | Use |
|
|
108
|
+
| --- | --- |
|
|
109
|
+
| Keep an agent working until objective state passes | `runAgentControlLoop` |
|
|
110
|
+
| Turn user/reviewer feedback into replay data | `FeedbackTrajectory` |
|
|
111
|
+
| Compare prompt/tool/retrieval policies over full trajectories | `runMultiShotOptimization` |
|
|
112
|
+
| Gate releases with paired evidence and holdouts | `evaluateReleaseConfidence`, `HeldOutGate` |
|
|
113
|
+
| Explain regressions across trace corpora | `TraceAnalyst` / `analyzeTraces` |
|
|
114
|
+
| Report a launch decision | `renderReleaseReport`, `researchReport`, `summaryTable`, `paretoChart`, `gainHistogram` |
|
|
115
|
+
| Capture every provider HTTP request / response for forensics | `RawProviderSink`, `LlmClientOptions.rawSink` |
|
|
116
|
+
| Fail loud if an eval would silently use the wrong route | `assertLlmRoute` |
|
|
117
|
+
| Assert at run-end that the artifact is complete | `assertRunCaptured`, `throwIfRunIncomplete` |
|
|
118
|
+
| Auto-execute the trace analyst on every run | `traceAnalystOnRunComplete` + `TraceEmitterOptions.onRunComplete` |
|
|
119
|
+
| Model missing context separately from bad reasoning | `KnowledgeRequirement`, `KnowledgeBundle` |
|
|
120
|
+
|
|
121
|
+
### Capture integrity (0.21+)
|
|
122
|
+
|
|
123
|
+
Launch-grade benchmark runs need four things that are easy to forget in glue
|
|
124
|
+
code: (1) raw HTTP capture alongside the structured spans so a reviewer can
|
|
125
|
+
verify which route answered, (2) a preflight assertion that the configured
|
|
126
|
+
client points at the intended provider, (3) a run-end assertion that the
|
|
127
|
+
expected events were actually written, and (4) auto-execution of the trace
|
|
128
|
+
analyst as part of the run lifecycle. The wiring fits in a few lines:
|
|
192
129
|
|
|
193
|
-
|
|
130
|
+
```ts
|
|
131
|
+
import {
|
|
132
|
+
TraceEmitter, FileSystemRawProviderSink, callLlm, assertLlmRoute,
|
|
133
|
+
assertRunCaptured, throwIfRunIncomplete,
|
|
134
|
+
} from '@tangle-network/agent-eval'
|
|
135
|
+
import { traceAnalystOnRunComplete } from '@tangle-network/agent-eval/traces'
|
|
194
136
|
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
- [Product Eval Adoption](./docs/product-eval-adoption.md)
|
|
198
|
-
- [Control Runtime](./docs/control-runtime.md)
|
|
199
|
-
- [Knowledge Readiness](./docs/knowledge-readiness.md)
|
|
200
|
-
- [Multi-Shot Optimization](./docs/multi-shot-optimization.md)
|
|
201
|
-
- [Feedback Trajectories](./docs/feedback-trajectories.md)
|
|
202
|
-
- [Wire Protocol](./docs/wire-protocol.md)
|
|
137
|
+
const sink = new FileSystemRawProviderSink({ dir: `${workDir}/raw-events` })
|
|
138
|
+
assertLlmRoute(llmOpts, { requireExplicitBaseUrl: true, allowedBaseUrls, requireAuth: true })
|
|
203
139
|
|
|
204
|
-
|
|
140
|
+
const emitter = new TraceEmitter(store, {
|
|
141
|
+
onRunComplete: [traceAnalystOnRunComplete({ analyze: analystOpts, save })],
|
|
142
|
+
})
|
|
143
|
+
await emitter.startRun(/* ... */)
|
|
144
|
+
// LLM calls flow through callLlm with `{ rawSink: sink, traceContext: { runId, spanId } }`.
|
|
145
|
+
await emitter.endRun({ pass, score })
|
|
205
146
|
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
pnpm test
|
|
210
|
-
pnpm build
|
|
211
|
-
pnpm openapi
|
|
147
|
+
throwIfRunIncomplete(await assertRunCaptured(store, emitter.runId, {
|
|
148
|
+
llmSpansMin: 1, rawSink: sink, requireRawCoverageOfLlmSpans: true, requireOutcome: true,
|
|
149
|
+
}))
|
|
212
150
|
```
|
|
213
151
|
|
|
214
|
-
|
|
152
|
+
Directives, rationale, and shipped-bug context are in
|
|
153
|
+
[`SKILL.md` § Capture integrity](./.claude/skills/agent-eval/SKILL.md#capture-integrity-required-for-launch-grade-adoption).
|
|
154
|
+
|
|
155
|
+
## Examples
|
|
156
|
+
|
|
157
|
+
Runnable examples live in
|
|
158
|
+
[`examples/`](https://github.com/tangle-network/agent-eval/tree/main/examples).
|
|
159
|
+
|
|
160
|
+
- [`examples/multi-shot-optimization`](https://github.com/tangle-network/agent-eval/tree/main/examples/multi-shot-optimization):
|
|
161
|
+
optimize full trajectories with held-out promotion.
|
|
162
|
+
- [`examples/same-sandbox-harness`](https://github.com/tangle-network/agent-eval/tree/main/examples/same-sandbox-harness):
|
|
163
|
+
run setup/build/test and evidence checks in one workspace.
|
|
164
|
+
- [`examples/benchmarks`](https://github.com/tangle-network/agent-eval/tree/main/examples/benchmarks):
|
|
165
|
+
benchmark adapter shape and reference wrappers.
|
|
166
|
+
|
|
167
|
+
## Docs
|
|
168
|
+
|
|
169
|
+
Read in this order:
|
|
170
|
+
|
|
171
|
+
1. [Product Eval Adoption](./docs/product-eval-adoption.md)
|
|
172
|
+
2. [Control Runtime](./docs/control-runtime.md)
|
|
173
|
+
3. [Feedback Trajectories](./docs/feedback-trajectories.md)
|
|
174
|
+
4. [Multi-Shot Optimization](./docs/multi-shot-optimization.md)
|
|
175
|
+
5. [Trace Analysis](./docs/trace-analysis.md)
|
|
176
|
+
6. [Knowledge Readiness](./docs/knowledge-readiness.md)
|
|
177
|
+
7. [Integration Launch Gates](./docs/integration-launch-gates.md)
|
|
178
|
+
8. [Wire Protocol](./docs/wire-protocol.md)
|
|
179
|
+
|
|
180
|
+
## CLI / Wire Protocol
|
|
215
181
|
|
|
216
182
|
```sh
|
|
217
|
-
|
|
218
|
-
|
|
183
|
+
npm i -g @tangle-network/agent-eval
|
|
184
|
+
agent-eval serve --port 5005
|
|
219
185
|
```
|
|
220
186
|
|
|
221
|
-
Python client
|
|
187
|
+
The Python client lives in `clients/python`:
|
|
222
188
|
|
|
223
189
|
```sh
|
|
224
|
-
pnpm build
|
|
225
190
|
cd clients/python
|
|
226
|
-
pip install -e
|
|
227
|
-
pytest
|
|
191
|
+
pip install -e .
|
|
228
192
|
```
|
|
229
193
|
|
|
230
|
-
##
|
|
194
|
+
## Development
|
|
231
195
|
|
|
232
|
-
|
|
233
|
-
|
|
196
|
+
```sh
|
|
197
|
+
pnpm install
|
|
198
|
+
pnpm typecheck
|
|
199
|
+
pnpm test
|
|
200
|
+
pnpm build
|
|
201
|
+
pnpm openapi
|
|
202
|
+
```
|
|
234
203
|
|
|
235
204
|
## Related Packages
|
|
236
205
|
|
|
237
|
-
-
|
|
238
|
-
-
|
|
239
|
-
-
|
|
240
|
-
- [`@tangle-network/agent-gateway`](https://github.com/tangle-network/agent-gateway)
|
|
241
|
-
- [`@tangle-network/tcloud`](https://github.com/tangle-network/tcloud)
|
|
206
|
+
- `@tangle-network/agent-runtime`: production session/runtime layer.
|
|
207
|
+
- `@tangle-network/agent-knowledge`: source-grounded knowledge bases and readiness.
|
|
208
|
+
- `@tangle-network/agent-integrations`: connection, grant, capability, and integration invocation contracts.
|
|
242
209
|
|
|
243
210
|
## License
|
|
244
211
|
|
|
@@ -1 +1,2 @@
|
|
|
1
|
-
export { B as BENCHMARK_SPLIT_SEED,
|
|
1
|
+
export { B as BENCHMARK_SPLIT_SEED, a as BenchmarkAdapter, b as BenchmarkDatasetItem, c as BenchmarkEvaluation, d as deterministicSplit, e as routing } from '../index-c5saLbKD.js';
|
|
2
|
+
import '../run-record-CX_jcAyr.js';
|