@ls-stack/agent-eval 0.60.4 → 0.61.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/{app-gg10KvzS.mjs → app-Dm_9ZTVa.mjs} +4 -4
- package/dist/apps/web/dist/assets/index-CM_zUhl_.css +1 -0
- package/dist/apps/web/dist/assets/{index-CM6MDNqo.js → index-CwSehYad.js} +76 -76
- package/dist/apps/web/dist/index.html +2 -2
- package/dist/bin.mjs +1 -1
- package/dist/caseChild.mjs +1 -1
- package/dist/{cli-OLZIjQpx.mjs → cli-CPBIcMP-.mjs} +4 -4
- package/dist/index.d.mts +61 -52
- package/dist/index.mjs +3 -3
- package/dist/runChild.mjs +2 -2
- package/dist/{runExecution-Bu9yfdUS.mjs → runExecution-D-CnSRYy.mjs} +17 -1
- package/dist/{runOrchestration-mpgZmEZ6.mjs → runOrchestration-Basvyp4u.mjs} +1 -1
- package/dist/{runner-C4Y0lWb1.mjs → runner-B6UT1K7L.mjs} +1 -1
- package/dist/{runner-SxtKn-Xh.mjs → runner-DwNb5TCb.mjs} +2 -2
- package/dist/{src-Cy3OxoZW.mjs → src-SixIk0b7.mjs} +2 -2
- package/package.json +3 -3
- package/skills/agent-eval/SKILL.md +76 -432
- package/dist/apps/web/dist/assets/index-CqWfzcFb.css +0 -1
|
@@ -5,80 +5,27 @@ description: Create, run, and maintain TypeScript evals with @ls-stack/agent-eva
|
|
|
5
5
|
|
|
6
6
|
# Agent Eval
|
|
7
7
|
|
|
8
|
-
Local-first eval runner for LLM and agent systems. Evals are strict TypeScript
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
This skill covers the mental model and conventions. For exhaustive field lists
|
|
15
|
-
(config options, eval shape, column formats, score/chart/stats options, trace
|
|
16
|
-
display rules), read the TypeScript declarations shipped with the package:
|
|
17
|
-
|
|
18
|
-
- `AgentEvalsConfig`, `EvalDefinition`, `EvalCase`, `EvalOutputs`,
|
|
19
|
-
`EvalColumnOverride`, `EvalDeriveConfig`, `EvalScoreDef`,
|
|
20
|
-
`EvalManualScoreDef`, `EvalTraceTree`, and `TraceSpanInfo` are exported from
|
|
21
|
-
`@ls-stack/agent-eval`.
|
|
22
|
-
- Import Zod directly from `zod` when authoring `outputsSchema` or
|
|
23
|
-
`manualInput.schema`; `@ls-stack/agent-eval` does not re-export Zod.
|
|
8
|
+
Local-first eval runner for LLM and agent systems. Evals are strict TypeScript modules named `*.eval.ts`, discovered from `agent-evals.config.ts`, and executed through the CLI (`agent-evals run`) or local app (`agent-evals app`). Runs persist to `.agent-evals/` so results, traces, and caches survive across processes.
|
|
9
|
+
|
|
10
|
+
This skill covers the mental model and conventions. For exhaustive field lists (config options, eval shape, column formats, score/chart/stats options, trace display rules), read the TypeScript declarations shipped with the package:
|
|
11
|
+
|
|
12
|
+
- `AgentEvalsConfig`, `EvalDefinition`, `EvalCase`, `EvalOutputs`, `EvalColumnOverride`, `EvalDeriveConfig`, `EvalScoreDef`, `EvalManualScoreDef`, `EvalTraceTree`, and `TraceSpanInfo` are exported from `@ls-stack/agent-eval`.
|
|
13
|
+
- Import Zod directly from `zod` when authoring `outputsSchema` or `manualInput.schema`; `@ls-stack/agent-eval` does not re-export Zod.
|
|
24
14
|
- `.d.ts` files land in `node_modules/@ls-stack/agent-eval/dist/`.
|
|
25
|
-
- CLI surface: `agent-evals --help` and `agent-evals <command> --help`.
|
|
26
|
-
|
|
27
|
-
-
|
|
28
|
-
|
|
29
|
-
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
- `agent-evals run --temporary` persists a run like normal history, but deletes
|
|
34
|
-
it before the next run starts. Temporary runs appear in `show-runs` while
|
|
35
|
-
present; normal runs are never deleted by temporary-run cleanup. In the app,
|
|
36
|
-
the run drawer can promote a temporary run to durable history.
|
|
37
|
-
- `agent-evals app` watches `agent-evals.config.ts` and the workspace `.env`
|
|
38
|
-
and reloads them in place when the runner is idle. If config or `.env`
|
|
39
|
-
changes during an active run, the reload applies after the current run
|
|
40
|
-
reaches a terminal state.
|
|
41
|
-
- App-triggered runs log the queued target evals, resolved case concurrency,
|
|
42
|
-
each case start for evals that are actually running, and the terminal run
|
|
43
|
-
summary in the server terminal.
|
|
44
|
-
|
|
45
|
-
Assume that enumerated tables in this document may lag behind the types —
|
|
46
|
-
treat the types as source of truth when they disagree.
|
|
15
|
+
- CLI surface: `agent-evals --help` and `agent-evals <command> --help`. Unknown help targets exit non-zero instead of falling back to global help.
|
|
16
|
+
- The CLI automatically loads `.env` from the current workspace. Shell-provided environment variables win; pass `--no-env` to disable `.env` loading once.
|
|
17
|
+
- Unfiltered `agent-evals run` is disabled by default; use `--eval` or `--case` for targeted CLI runs, or `--tags-filter <expr>` to run cases matching tags. Set `allowCliRunAll: true` in `agent-evals.config.ts` to opt into run-all CLI behavior.
|
|
18
|
+
- `agent-evals run --temporary` persists a run like normal history, but deletes it before the next run starts. Temporary runs appear in `show-runs` while present; normal runs are never deleted by temporary-run cleanup. In the app, the run drawer can promote a temporary run to durable history.
|
|
19
|
+
- `agent-evals app` watches `agent-evals.config.ts` and the workspace `.env` and reloads them in place when the runner is idle. If config or `.env` changes during an active run, the reload applies after the current run reaches a terminal state.
|
|
20
|
+
- App-triggered runs log the queued target evals, resolved case concurrency, each case start for evals that are actually running, and the terminal run summary in the server terminal.
|
|
21
|
+
|
|
22
|
+
Assume that enumerated tables in this document may lag behind the types — treat the types as source of truth when they disagree.
|
|
47
23
|
|
|
48
24
|
## Where tracing lives
|
|
49
25
|
|
|
50
|
-
**Tracing belongs in the product source code, not in the eval file.** The eval
|
|
51
|
-
|
|
52
|
-
inside the
|
|
53
|
-
invoke.
|
|
54
|
-
|
|
55
|
-
`evalTracer`, `evalSpan`, output helpers, `evalLog`, `evalAssert`, and
|
|
56
|
-
`evalExpect` are ambient no-ops when called outside an eval case scope, so
|
|
57
|
-
leaving them in
|
|
58
|
-
production paths is safe — they only record anything when the product code runs
|
|
59
|
-
inside an eval's `execute`. Use `isInEvalScope()` to branch on eval-only behavior in shared code
|
|
60
|
-
(e.g. skip a real network side effect): it returns `null` outside eval-owned
|
|
61
|
-
work and returns `'env'`, `'cases'`, `'eval'`, `'derive'`, `'outputsSchema'`, or
|
|
62
|
-
`'scorer'` during runner phases. Top-level modules imported while a run is being
|
|
63
|
-
prepared see `'env'`; code called from `execute` sees `'eval'`. Use
|
|
64
|
-
`getEvalCaseInput()` to read the current case input, or
|
|
65
|
-
`getEvalCaseInput('customer.tier')` for nested dot-path access; outside a case
|
|
66
|
-
scope it returns `undefined`. Use `nextEvalId()` inside eval-scoped code when a
|
|
67
|
-
stable generated id is needed; it includes the eval file, eval id, case id, and
|
|
68
|
-
a per-case sequence number, and throws outside an eval case scope.
|
|
69
|
-
Use `evalLog(level, ...args)` for intentional per-case logs. The runner also
|
|
70
|
-
captures `console.log`, `console.info`, `console.warn`, and `console.error`
|
|
71
|
-
during case-owned phases by default; log arguments are stored as JSON-safe
|
|
72
|
-
values. Logs inside cached operations are not replayed from cache hits.
|
|
73
|
-
Use eval tags to target related coverage without naming every case:
|
|
74
|
-
`AgentEvalsConfig.tags` applies workspace-wide tags, `defineEval({ tags })`
|
|
75
|
-
adds eval tags, `case.tags` adds case-only tags, and `removeTags` disables a
|
|
76
|
-
configured global tag for one eval. CLI filters support Vitest-style tag
|
|
77
|
-
expressions such as `agent-evals run --tags-filter "refunds && !slow"`.
|
|
78
|
-
Inside eval-scoped code, use `matchesEvalTags('tag')` or
|
|
79
|
-
`matchesEvalTags({ all, any, not })`; it uses typed exact tag names and returns
|
|
80
|
-
`false` outside a case scope. Projects can narrow tag names with a `.d.ts`
|
|
81
|
-
module augmentation:
|
|
26
|
+
**Tracing belongs in the product source code, not in the eval file.** The eval file wires up cases and scoring; the real `evalTracer.span(...)` calls sit inside the workflow, agent, or tool functions that both production and evals invoke.
|
|
27
|
+
|
|
28
|
+
`evalTracer`, `evalSpan`, output helpers, `evalLog`, `evalAssert`, and `evalExpect` are ambient no-ops when called outside an eval case scope, so leaving them in production paths is safe — they only record anything when the product code runs inside an eval's `execute`. Use `isInEvalScope()` to branch on eval-only behavior in shared code (e.g. skip a real network side effect): it returns `null` outside eval-owned work and returns `'env'`, `'cases'`, `'eval'`, `'derive'`, `'outputsSchema'`, or `'scorer'` during runner phases. Top-level modules imported while a run is being prepared see `'env'`; code called from `execute` sees `'eval'`. Use `getEvalCaseInput()` to read the current case input, or `getEvalCaseInput('customer.tier')` for nested dot-path access; outside a case scope it returns `undefined`. Use `nextEvalId()` inside eval-scoped code when a stable generated id is needed; it includes the eval file, eval id, case id, and a per-case sequence number, and throws outside an eval case scope. Use `evalLog(level, ...args)` for intentional per-case logs. The runner also captures `console.log`, `console.info`, `console.warn`, and `console.error` during case-owned phases by default; log arguments are stored as JSON-safe values. Logs inside cached operations are not replayed from cache hits. Use eval tags to target related coverage without naming every case: `AgentEvalsConfig.tags` applies workspace-wide tags, `defineEval({ tags })` adds eval tags, `case.tags` adds case-only tags, and `removeTags` disables a configured global tag for one eval. CLI filters support Vitest-style tag expressions such as `agent-evals run --tags-filter "refunds && !slow"`. Inside eval-scoped code, use `matchesEvalTags('tag')` or `matchesEvalTags({ all, any, not })`; it uses typed exact tag names and returns `false` outside a case scope. Projects can narrow tag names with a `.d.ts` module augmentation:
|
|
82
29
|
|
|
83
30
|
```ts
|
|
84
31
|
import '@ls-stack/agent-eval';
|
|
@@ -162,55 +109,17 @@ export async function runRefundWorkflow(input: RefundInput) {
|
|
|
162
109
|
}
|
|
163
110
|
```
|
|
164
111
|
|
|
165
|
-
Span `kind` values are open-ended strings. Use familiar kinds such as
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
automatically
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
`
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
`evalTracer.span(...)`, such as optional model/tool failures that fall back and
|
|
177
|
-
continue. You can pass one error, multiple error arguments, or an array. The
|
|
178
|
-
span is still marked `error`. Pass `'warning'` or `{ level: 'warning' }` as the
|
|
179
|
-
final argument for diagnostics that should not change an otherwise successful
|
|
180
|
-
span's status.
|
|
181
|
-
|
|
182
|
-
If a span callback throws, the SDK automatically marks that span as `error`,
|
|
183
|
-
stores the thrown error on it, and rethrows so the case errors. Use that for
|
|
184
|
-
terminal failures; use `captureEvalSpanError(...)` for recoverable failures that
|
|
185
|
-
continue through fallback logic.
|
|
186
|
-
|
|
187
|
-
Fire-and-forget spans started during `execute` are awaited before outputs,
|
|
188
|
-
`deriveFromTracing`, scores, and trace data are finalized, so `void
|
|
189
|
-
evalTracer.span(...)` is safe when the span result is not needed. Register
|
|
190
|
-
non-span promises with `startEvalBackgroundJob(promise)`. The runner only waits
|
|
191
|
-
for settlement; promise and span errors keep their normal behavior. Use
|
|
192
|
-
`waitForBackgroundJob: false` on a span, or `waitForBackgroundJobs: false` on an
|
|
193
|
-
eval definition, when background work should not delay finalization.
|
|
194
|
-
|
|
195
|
-
Eval Date APIs use a shifted wall clock by default: `new Date()` and
|
|
196
|
-
`Date.now()` start at `2026-04-10T00:00:00.000Z` during case generation,
|
|
197
|
-
execution, tracing, derived outputs, and scorers, then continue advancing with
|
|
198
|
-
real elapsed time. Set `startTime` on a specific `defineEval(...)` to use
|
|
199
|
-
another initial clock value, or set `startTime: 'now'` for that eval to use the
|
|
200
|
-
real current clock. Timers are not faked, so async waits still run normally.
|
|
201
|
-
Set `freezeTime: true` to keep Date APIs frozen until they are moved manually.
|
|
202
|
-
Use `evalTime.startTime` to read the captured wall-clock start as a Dayjs
|
|
203
|
-
object, and `evalTime.dayjs(...)` to create other Dayjs date objects. Use
|
|
204
|
-
`evalTime.advance(amount, unit)` inside an eval to move the shifted clock
|
|
205
|
-
forward with Dayjs `add(...)` units. It throws for evals with
|
|
206
|
-
`startTime: 'now'`, unless `freezeTime: true` is also set.
|
|
207
|
-
|
|
208
|
-
For libraries or observability exporters that already emit span lifecycle
|
|
209
|
-
events, use `evalTracer.startSpan(...)`, `evalTracer.updateSpan(...)`,
|
|
210
|
-
`evalTracer.endSpan(...)`, or `evalTracer.recordSpan(...)` to translate those
|
|
211
|
-
events into the eval trace tree without wrapping the upstream work in a
|
|
212
|
-
callback. Pass the upstream span id and parent id when available so saved trace
|
|
213
|
-
JSON and `deriveFromTracing` use the recorded hierarchy.
|
|
112
|
+
Span `kind` values are open-ended strings. Use familiar kinds such as `agent`, `tool`, `llm`, `api`, `retrieval`, `scorer`, or `checkpoint` when they fit, and preserve external tracer kinds such as `mastra.workflow.step` when they are more specific. Only the `input` and `output` span attributes are promoted automatically in the trace tree; use `traceDisplay` for other span attributes such as `model` or `usage`. Eval-level LLM usage outputs, columns, stats, and charts are derived from matching LLM spans by default. Prefer `llmCalls.pricing` for LLM-call cost display; built-in costs ignore span `costUsd` attributes.
|
|
113
|
+
|
|
114
|
+
Use `captureEvalSpanError(error)` for recoverable errors on the active `evalTracer.span(...)`, such as optional model/tool failures that fall back and continue. You can pass one error, multiple error arguments, or an array. The span is still marked `error`. Pass `'warning'` or `{ level: 'warning' }` as the final argument for diagnostics that should not change an otherwise successful span's status.
|
|
115
|
+
|
|
116
|
+
If a span callback throws, the SDK automatically marks that span as `error`, stores the thrown error on it, and rethrows so the case errors. Use that for terminal failures; use `captureEvalSpanError(...)` for recoverable failures that continue through fallback logic.
|
|
117
|
+
|
|
118
|
+
Fire-and-forget spans started during `execute` are awaited before outputs, `deriveFromTracing`, scores, and trace data are finalized, so `void evalTracer.span(...)` is safe when the span result is not needed. Register non-span promises with `startEvalBackgroundJob(promise)`. The runner only waits for settlement; promise and span errors keep their normal behavior. Use `waitForBackgroundJob: false` on a span, or `waitForBackgroundJobs: false` on an eval definition, when background work should not delay finalization.
|
|
119
|
+
|
|
120
|
+
Eval Date APIs use a shifted wall clock by default: `new Date()` and `Date.now()` start at `2026-04-10T00:00:00.000Z` during case generation, execution, tracing, derived outputs, and scorers, then continue advancing with real elapsed time. Set `startTime` on a specific `defineEval(...)` to use another initial clock value, or set `startTime: 'now'` for that eval to use the real current clock. Timers are not faked, so async waits still run normally. Set `freezeTime: true` to keep Date APIs frozen until they are moved manually. Use `evalTime.startTime` to read the captured wall-clock start as a Dayjs object, and `evalTime.dayjs(...)` to create other Dayjs date objects. Use `evalTime.advance(amount, unit)` inside an eval to move the shifted clock forward with Dayjs `add(...)` units. It throws for evals with `startTime: 'now'`, unless `freezeTime: true` is also set.
|
|
121
|
+
|
|
122
|
+
For libraries or observability exporters that already emit span lifecycle events, use `evalTracer.startSpan(...)`, `evalTracer.updateSpan(...)`, `evalTracer.endSpan(...)`, or `evalTracer.recordSpan(...)` to translate those events into the eval trace tree without wrapping the upstream work in a callback. Pass the upstream span id and parent id when available so saved trace JSON and `deriveFromTracing` use the recorded hierarchy.
|
|
214
123
|
|
|
215
124
|
### Eval file (thin)
|
|
216
125
|
|
|
@@ -249,20 +158,13 @@ defineEval<RefundInput, RefundOutputs>({
|
|
|
249
158
|
});
|
|
250
159
|
```
|
|
251
160
|
|
|
252
|
-
`execute` usually just calls the product code. Push any placeholder
|
|
253
|
-
`evalTracer.span(...)` wrappers out of the eval and into the product module
|
|
254
|
-
they describe so production runs get the same trajectory. Only keep tracing
|
|
255
|
-
inside `execute` when the behavior being measured is eval-specific (e.g. a
|
|
256
|
-
judge-only sub-step with no production analogue).
|
|
161
|
+
`execute` usually just calls the product code. Push any placeholder `evalTracer.span(...)` wrappers out of the eval and into the product module they describe so production runs get the same trajectory. Only keep tracing inside `execute` when the behavior being measured is eval-specific (e.g. a judge-only sub-step with no production analogue).
|
|
257
162
|
|
|
258
|
-
Case `id` values anchor historical runs, caches, and manual scores — keep them
|
|
259
|
-
stable. See `EvalDefinition` / `EvalCase` in the types for every supported
|
|
260
|
-
field.
|
|
163
|
+
Case `id` values anchor historical runs, caches, and manual scores — keep them stable. See `EvalDefinition` / `EvalCase` in the types for every supported field.
|
|
261
164
|
|
|
262
165
|
### Manual input
|
|
263
166
|
|
|
264
|
-
Use `manualInput` instead of `cases` when each run should pause for the user
|
|
265
|
-
to type values:
|
|
167
|
+
Use `manualInput` instead of `cases` when each run should pause for the user to type values:
|
|
266
168
|
|
|
267
169
|
```ts
|
|
268
170
|
const inputSchema = z.object({
|
|
@@ -286,203 +188,39 @@ defineEval<z.infer<typeof inputSchema>>({
|
|
|
286
188
|
});
|
|
287
189
|
```
|
|
288
190
|
|
|
289
|
-
`manualInput` configures the local app form descriptor derived from the schema
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
single targeted eval or `--input-file <path>` mapping eval keys/ids to inputs.
|
|
293
|
-
Each run produces one synthetic case `<evalId>-manual` with the validated
|
|
294
|
-
submission; mixing `manualInput` with `cases` is rejected at discovery time.
|
|
295
|
-
|
|
296
|
-
For file or image fields, set `{ asFile: true, accept?, maxSizeBytes? }` and
|
|
297
|
-
type the field with `manualInputFileValueSchema`. The runtime value carries
|
|
298
|
-
`{ name, mimeType, sizeBytes, sha256, path }`, where `path` is a
|
|
299
|
-
workspace-relative run artifact. Use `readManualInputFile(value)` when bytes,
|
|
300
|
-
`Blob`, `File`, text, or parsed JSON are needed. In CLI runs, provide path
|
|
301
|
-
objects such as
|
|
302
|
-
`{ "image": { "path": "./screenshot.png" } }`; the CLI stages the file before
|
|
303
|
-
starting the run.
|
|
191
|
+
`manualInput` configures the local app form descriptor derived from the schema (`z.string` -> text, `z.enum` -> select, `z.boolean` -> checkbox, etc.; nested shapes fall back to JSON input). The CLI accepts `--input '<json>'` for a single targeted eval or `--input-file <path>` mapping eval keys/ids to inputs. Each run produces one synthetic case `<evalId>-manual` with the validated submission; mixing `manualInput` with `cases` is rejected at discovery time.
|
|
192
|
+
|
|
193
|
+
For file or image fields, set `{ asFile: true, accept?, maxSizeBytes? }` and type the field with `manualInputFileValueSchema`. The runtime value carries `{ name, mimeType, sizeBytes, sha256, path }`, where `path` is a workspace-relative run artifact. Use `readManualInputFile(value)` when bytes, `Blob`, `File`, text, or parsed JSON are needed. In CLI runs, provide path objects such as `{ "image": { "path": "./screenshot.png" } }`; the CLI stages the file before starting the run.
|
|
304
194
|
|
|
305
195
|
## Scoring
|
|
306
196
|
|
|
307
|
-
Every score returns a normalized `0..1` value. Pass/fail is per-score: a case
|
|
308
|
-
fails if any score with `passThreshold` falls below it, if an assertion fails,
|
|
309
|
-
or if the case errors. Scores without `passThreshold` are informational.
|
|
197
|
+
Every score returns a normalized `0..1` value. Pass/fail is per-score: a case fails if any score with `passThreshold` falls below it, if an assertion fails, or if the case errors. Scores without `passThreshold` are informational.
|
|
310
198
|
|
|
311
|
-
Score functions run in their own trace scope, separate from the execution
|
|
312
|
-
trace, so LLM-as-judge scorers can use `evalTracer.span(...)` and cached spans
|
|
313
|
-
without polluting the agent trajectory. Outputs set inside a scorer stay
|
|
314
|
-
private to that score. Spanless `evalTracer.cache(...)` calls made directly
|
|
315
|
-
inside a scorer are stored on that score trace's `cacheRefs` payload.
|
|
199
|
+
Score functions run in their own trace scope, separate from the execution trace, so LLM-as-judge scorers can use `evalTracer.span(...)` and cached spans without polluting the agent trajectory. Outputs set inside a scorer stay private to that score. Spanless `evalTracer.cache(...)` calls made directly inside a scorer are stored on that score trace's `cacheRefs` payload.
|
|
316
200
|
|
|
317
|
-
`manualScores` declares score columns that reviewers fill in after a run.
|
|
318
|
-
Pending values keep the eval in an `unscored` state instead of failing.
|
|
201
|
+
`manualScores` declares score columns that reviewers fill in after a run. Pending values keep the eval in an `unscored` state instead of failing.
|
|
319
202
|
|
|
320
|
-
See `EvalScoreDef` / `EvalManualScoreDef` in the types for the full shape
|
|
321
|
-
(format, threshold, column overrides).
|
|
203
|
+
See `EvalScoreDef` / `EvalManualScoreDef` in the types for the full shape (format, threshold, column overrides).
|
|
322
204
|
|
|
323
205
|
## Outputs, columns, trace display
|
|
324
206
|
|
|
325
|
-
- `setEvalOutput(key, value)` writes reviewable data for the case. Values are
|
|
326
|
-
|
|
327
|
-
|
|
328
|
-
|
|
329
|
-
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
335
|
-
|
|
336
|
-
|
|
337
|
-
|
|
338
|
-
typed from the eval's outputs generic. Keep `setEvalOutput` for shared
|
|
339
|
-
workflow code that does not receive the execute context.
|
|
340
|
-
- Use `incrementEvalOutput(key, delta)` for numeric totals,
|
|
341
|
-
`appendToEvalOutput(key, value)` for arrays that preserve existing scalar
|
|
342
|
-
values, and `mergeEvalOutput(key, patch)` for shallow object updates.
|
|
343
|
-
`evalSpan` has matching `incrementAttribute`, `appendToAttribute`, and
|
|
344
|
-
`mergeAttribute` helpers for span attributes.
|
|
345
|
-
- `outputsSchema` validates final outputs after `execute` and
|
|
346
|
-
`deriveFromTracing`, before computed scores. For Zod object schemas, only
|
|
347
|
-
declared keys are passed to the schema; parsed fields merge back into the raw
|
|
348
|
-
output map, so defaults/transforms apply to configured fields and
|
|
349
|
-
unconfigured outputs stay visible as before. Validation failures fail the case
|
|
350
|
-
and skip computed scores. When you pass a narrowed outputs type as the second
|
|
351
|
-
`defineEval` generic, `outputsSchema` is required.
|
|
352
|
-
- `columns` overrides the display for output and score keys (label, format,
|
|
353
|
-
alignment, visibility). The set of supported formats is declared by the
|
|
354
|
-
`ColumnFormat` union and `EvalColumnOverride` in the types. Global
|
|
355
|
-
`columns` in `agent-evals.config.ts` apply to every eval; eval-level
|
|
356
|
-
`columns` override matching global keys. Use `hideIfNoValue: true` to hide a
|
|
357
|
-
column when every row is missing the value, `null`, or an empty string; `0`
|
|
358
|
-
and `false` still count as values. Use `format: 'image'`, `'html'`, `'pdf'`,
|
|
359
|
-
`'audio'`, `'video'`, or `'file'` for `Blob`/`File` outputs or `repoFile(...)`
|
|
360
|
-
references that should render as reviewable artifacts. Persisted `Blob`/`File`
|
|
361
|
-
artifacts include byte sizes in their run artifact refs; pass the optional
|
|
362
|
-
`repoFile(..., ..., sizeBytes)` hint when a repository file card should show
|
|
363
|
-
a size.
|
|
364
|
-
- `deriveFromTracing` can be authored globally in `agent-evals.config.ts` or
|
|
365
|
-
locally on one eval. Prefer the keyed map form for shared metrics:
|
|
366
|
-
`deriveFromTracing: { toolCalls: ({ trace }) => trace.findSpansByKind('tool').length }`.
|
|
367
|
-
The older object-returning function form remains supported. Global
|
|
368
|
-
derivations run first; runtime outputs are never overwritten, and eval-level
|
|
369
|
-
derivations only fill keys still missing after global derivations. In keyed
|
|
370
|
-
form, return `undefined` to omit one output for that case. Do not call
|
|
371
|
-
`evalAssert(...)` or `evalExpect(...)` from `deriveFromTracing`; use
|
|
372
|
-
`tracingAssertions` for trace-derived pass/fail checks.
|
|
373
|
-
- `tracingAssertions` is a single function that can be authored globally or
|
|
374
|
-
locally on one eval when a finished-trace invariant should pass or fail the
|
|
375
|
-
case without creating a fake score column. It receives the same
|
|
376
|
-
`{ trace, input, case }` context as `deriveFromTracing`; call
|
|
377
|
-
`evalAssert(...)` or `evalExpect(...)` inside it.
|
|
378
|
-
Useful trace helpers include `trace.findSpan(name)`, `trace.findSpans(name)`,
|
|
379
|
-
`trace.hasSpan(name)`, `trace.findSpansByKind(kind)`,
|
|
380
|
-
`trace.findToolCallSpans()`, `trace.listToolCallSpanNames()`,
|
|
381
|
-
`trace.hasToolCallSpan(name)`,
|
|
382
|
-
`trace.getToolCallSpans(name)`,
|
|
383
|
-
`trace.getToolCallSpanCount(toolName)`,
|
|
384
|
-
`trace.hasToolCallSpanCount(toolName, expectedCalls)`,
|
|
385
|
-
`trace.listSpanNames(kind?)`, `trace.listSpanNamesDfs(kind?)`, and
|
|
386
|
-
`trace.flattenDfs()`.
|
|
387
|
-
The tool-call helpers include both `kind: 'tool'` spans and imported
|
|
388
|
-
execution spans recorded as `kind: 'tool_call'`. Tool-name checks and counts
|
|
389
|
-
match the span `name` as well as GenAI/Mastra identity attributes such as
|
|
390
|
-
`genAI["gen_ai.tool.name"]` and `mastra.entityName`; list helpers prefer
|
|
391
|
-
those tool identity attributes when present. `getToolCallSpans(name)`
|
|
392
|
-
returns one normalized object per matching call, including parsed
|
|
393
|
-
`arguments`, parsed `result`, `description`, `toolType`, `attributes`, and
|
|
394
|
-
the original `span`.
|
|
395
|
-
- `traceDisplay` promotes selected span attributes into the trace tree and
|
|
396
|
-
detail pane; it supports aggregation across subtrees (`scope`, `mode`) and
|
|
397
|
-
user-defined `transform(...)` for derived views (e.g. currency conversion).
|
|
398
|
-
See the `TraceDisplayInputConfig` type.
|
|
399
|
-
- `llmCalls` (in `agent-evals.config.ts`) configures how LLM-call spans are
|
|
400
|
-
summarized for review. Defaults to `kind: 'llm'` spans with `model`,
|
|
401
|
-
`usage.*`, `latencyMs`, `input`, `output`, etc. read from conventional
|
|
402
|
-
attribute paths. The default `steps` path reads an array from
|
|
403
|
-
`span.attributes.steps`; if it is missing, direct child `model_step` spans are
|
|
404
|
-
shown as that call's steps. Tool calls are aggregated from the configured
|
|
405
|
-
`toolCalls` path plus step-level `toolCalls` on authored step arrays or
|
|
406
|
-
direct `model_step` child spans, including Mastra's serialized
|
|
407
|
-
`mastra.model_step.output` format, and child `tool_call` execution spans
|
|
408
|
-
under each model step. `latencyMs` is time to first token; duration, total
|
|
409
|
-
tokens, output tokens/sec, and USD costs are derived. Override `kinds` to
|
|
410
|
-
broaden the filter,
|
|
411
|
-
override `attributes.<field>` for non-default primitive span shapes, configure
|
|
412
|
-
model-keyed `pricing` to derive USD costs from token counts, with nested
|
|
413
|
-
`providers` entries for provider-specific rates, add `costCurrencies` to show
|
|
414
|
-
converted cost columns in the expanded breakdown table only, add
|
|
415
|
-
`derivedAttributes` to persist computed values back onto matching LLM spans
|
|
416
|
-
before trace consumers run, and add entries to `metrics` to surface arbitrary user metrics
|
|
417
|
-
(`format: 'string' | 'number' | 'duration' | 'json' | 'boolean'`,
|
|
418
|
-
`placements: ['header' | 'body']`). `derivedAttributes` can be a keyed map
|
|
419
|
-
for one-off fields or one callback that returns multiple path/value pairs.
|
|
420
|
-
Derived keys are dot-paths under `span.attributes`; return `undefined` to
|
|
421
|
-
skip one span or one returned key.
|
|
422
|
-
- Default usage config derives missing eval outputs from matching LLM/API spans
|
|
423
|
-
before `outputsSchema` and scores run: `apiCalls`, `costUsd`, `llmTurns`,
|
|
424
|
-
`inputTokens`, `outputTokens`, `totalTokens`, `cachedInputTokens`,
|
|
425
|
-
`cacheCreationInputTokens`, `reasoningTokens`, and `llmDurationMs`. Authored
|
|
426
|
-
outputs and column overrides win. Default usage columns, stats, and charts
|
|
427
|
-
use `hideIfNoValue: true`. Default LLM usage charts configure cost, input
|
|
428
|
-
tokens, and output tokens separately and use `dedupeConsecutiveValues: true`
|
|
429
|
-
to skip repeated adjacent chart values. `totalTokens` is input + output only;
|
|
430
|
-
cache read/write tokens stay separate and affect `costUsd` at their own
|
|
431
|
-
rates. `llmTurns` is the maximum per-call turn count in the case run, using
|
|
432
|
-
configured steps when available and otherwise one turn per matched LLM call
|
|
433
|
-
span.
|
|
434
|
-
Derived base input cost uses `inputTokens - cachedInputTokens -
|
|
435
|
-
cacheCreationInputTokens` so cache details are not double-counted.
|
|
436
|
-
`cacheCreationInputTokens` is the total cache-write count; optional
|
|
437
|
-
`cacheCreationInput1hTokens` only splits that total for 1-hour write pricing
|
|
438
|
-
via `cacheCreationInput1hUsdPerMillion`. `llmDurationMs` sums elapsed matched
|
|
439
|
-
LLM span durations; it is not time-to-first-token latency.
|
|
440
|
-
Remove defaults globally or per eval with `removeDefaultConfig: true` or a
|
|
441
|
-
key list such as
|
|
442
|
-
`removeDefaultConfig: ['apiCalls', 'reasoningTokens']`.
|
|
443
|
-
- `apiCalls` (in `agent-evals.config.ts`) configures how API-call spans are
|
|
444
|
-
summarized for review. Defaults to `kind: 'api'`, `'http'`, `'http.client'`,
|
|
445
|
-
and `'fetch'` spans with `method`, `url`, `statusCode`, `request`,
|
|
446
|
-
`response`, `requestBody`, `responseBody`, `headers`, `durationMs`, and
|
|
447
|
-
`error` read from conventional attribute paths. Override `kinds` or
|
|
448
|
-
`attributes.<field>` for external tracers, add `derivedAttributes` as a
|
|
449
|
-
keyed map or object-returning callback for computed persisted API span
|
|
450
|
-
attributes, and add `metrics` with the same formats and placements as
|
|
451
|
-
LLM-call metrics.
|
|
452
|
-
- `runLogs` (in `agent-evals.config.ts`) controls case log capture. Use
|
|
453
|
-
`runLogs: { captureConsole: false }` to keep console output in the terminal
|
|
454
|
-
without persisting console calls to case details. Manual `evalLog(...)` calls
|
|
455
|
-
are still captured. Captured log locations store the selected user-facing
|
|
456
|
-
source frame and the full JavaScript stack so agents can inspect additional
|
|
457
|
-
frames in persisted artifacts when diagnosing where a log came from.
|
|
458
|
-
|
|
459
|
-
Stats rows and history charts can be authored via `stats` / `charts` on the eval
|
|
460
|
-
definition. Global `stats` in `agent-evals.config.ts` combine with eval-level
|
|
461
|
-
stats. Native stat kinds include `cases`, `passRate`, `duration`, and
|
|
462
|
-
`cacheHits`; `cacheHits` shows Agent Eval operation-level cache hits over total
|
|
463
|
-
cache operations (`hits/total`) from spans and `evalTracer.cache(...)` refs, not
|
|
464
|
-
LLM provider prompt-cache read tokens such as `cachedInputTokens`. Cache-hit
|
|
465
|
-
stats use a separate aggregate control and default to `sum`; `avg` is average
|
|
466
|
-
per-case hit rate, and min/max/best/worst select cases by hit rate. `duration`
|
|
467
|
-
aggregates per-case durations using the same modes as column stats. Usage stats
|
|
468
|
-
and LLM usage charts are added by default unless removed with
|
|
469
|
-
`removeDefaultConfig`. Column stats can override `format` and `numberFormat`,
|
|
470
|
-
otherwise they inherit from the matching column. Duration and column stat
|
|
471
|
-
aggregates support `avg`, `min`, `max`, `sum`, `best` (highest finite value),
|
|
472
|
-
and `worst` (lowest finite value). Use `defaultStatAggregate` in
|
|
473
|
-
`agent-evals.config.ts` to set the workspace-wide initial duration/column stat
|
|
474
|
-
mode, or on an eval definition to override it for that eval. Number formats use
|
|
475
|
-
`maxDecimalPlaces` to cap decimals and `minDecimalPlaces` to pad trailing
|
|
476
|
-
zeroes. Without `maxDecimalPlaces`, the default cap is 3 decimal places. Stats
|
|
477
|
-
and charts support `hideIfNoValue: true`. Charts support
|
|
478
|
-
`dedupeConsecutiveValues: true` to omit consecutive points whose plotted metrics
|
|
479
|
-
and tooltip extras match the previous kept point.
|
|
480
|
-
Their shapes live in the types; no need to memorize the option set.
|
|
207
|
+
- `setEvalOutput(key, value)` writes reviewable data for the case. Values are stored as received: primitives, objects/arrays, explicit file refs, and native `Blob`/`File` values. `columns.format` only controls visualization. Inside `execute`, `setOutput(key, value, formatOrOverride)` can attach a display hint directly to a runtime output, e.g. `'markdown'` or `{ label: 'Receipt', format: 'image', hideInTable: true }`. Authored global/eval `columns` for the same key take precedence over that runtime hint. Non-JSON runtime values such as `Date`, `Map`, `Set`, `BigInt`, typed arrays, and class instances use the tagged value serializer instead of a string fallback. Native `Blob`/`File` values are copied to run artifacts because saved run files are JSON. Inside `execute`, prefer the context `setOutput(key, value)` helper when writing schema-backed outputs; it is typed from the eval's outputs generic. Keep `setEvalOutput` for shared workflow code that does not receive the execute context.
|
|
208
|
+
- Use `incrementEvalOutput(key, delta)` for numeric totals, `appendToEvalOutput(key, value)` for arrays that preserve existing scalar values, and `mergeEvalOutput(key, patch)` for shallow object updates. `evalSpan` has matching `incrementAttribute`, `appendToAttribute`, and `mergeAttribute` helpers for span attributes.
|
|
209
|
+
- `outputsSchema` validates final outputs after `execute` and `deriveFromTracing`, before computed scores. For Zod object schemas, only declared keys are passed to the schema; parsed fields merge back into the raw output map, so defaults/transforms apply to configured fields and unconfigured outputs stay visible as before. Validation failures fail the case and skip computed scores. When you pass a narrowed outputs type as the second `defineEval` generic, `outputsSchema` is required.
|
|
210
|
+
- `columns` overrides the display for output and score keys (label, format, alignment, visibility). The set of supported formats is declared by the `ColumnFormat` union and `EvalColumnOverride` in the types. Global `columns` in `agent-evals.config.ts` apply to every eval; eval-level `columns` override matching global keys. Use `hideIfNoValue: true` to hide a column when every row is missing the value, `null`, or an empty string; `0` and `false` still count as values. Use `format: 'image'`, `'html'`, `'pdf'`, `'audio'`, `'video'`, or `'file'` for `Blob`/`File` outputs or `repoFile(...)` references that should render as reviewable artifacts. Persisted `Blob`/`File` artifacts include byte sizes in their run artifact refs; pass the optional `repoFile(..., ..., sizeBytes)` hint when a repository file card should show a size.
|
|
211
|
+
- `deriveFromTracing` can be authored globally in `agent-evals.config.ts` or locally on one eval. Prefer the keyed map form for shared metrics: `deriveFromTracing: { toolCalls: ({ trace }) => trace.findSpansByKind('tool').length }`. The older object-returning function form remains supported. Global derivations run first; runtime outputs are never overwritten, and eval-level derivations only fill keys still missing after global derivations. In keyed form, return `undefined` to omit one output for that case. Do not call `evalAssert(...)` or `evalExpect(...)` from `deriveFromTracing`; use `tracingAssertions` for trace-derived pass/fail checks.
|
|
212
|
+
- `tracingAssertions` is a single function that can be authored globally or locally on one eval when a finished-trace invariant should pass or fail the case without creating a fake score column. It receives the same `{ trace, input, case }` context as `deriveFromTracing`; call `evalAssert(...)` or `evalExpect(...)` inside it. Useful trace helpers include `trace.findSpan(name)`, `trace.findSpans(name)`, `trace.hasSpan(name)`, `trace.findSpansByKind(kind)`, `trace.findToolCallSpans()`, `trace.listToolCallSpanNames()`, `trace.hasToolCallSpan(name)`, `trace.getToolCallSpans(name)`, `trace.getToolCallSpanCount(toolName)`, `trace.hasToolCallSpanCount(toolName, expectedCalls)`, `trace.listSpanNames(kind?)`, `trace.listSpanNamesDfs(kind?)`, and `trace.flattenDfs()`. The tool-call helpers include both `kind: 'tool'` spans and imported execution spans recorded as `kind: 'tool_call'`. Tool-name checks and counts match the span `name` as well as GenAI/Mastra identity attributes such as `genAI["gen_ai.tool.name"]` and `mastra.entityName`; list helpers prefer those tool identity attributes when present. `getToolCallSpans(name)` returns one normalized object per matching call, including parsed `arguments`, parsed `result`, `description`, `toolType`, `attributes`, and the original `span`.
|
|
213
|
+
- `traceDisplay` promotes selected span attributes into the trace tree and detail pane; it supports aggregation across subtrees (`scope`, `mode`) and user-defined `transform(...)` for derived views (e.g. currency conversion). See the `TraceDisplayInputConfig` type.
|
|
214
|
+
- `llmCalls` (in `agent-evals.config.ts`) configures how LLM-call spans are summarized for review. Defaults to `kind: 'llm'` spans with `model`, `usage.*`, `latencyMs`, `input`, `output`, etc. read from conventional attribute paths. The default `steps` path reads an array from `span.attributes.steps`; if it is missing, direct child `model_step` spans are shown as that call's steps. Tool calls are aggregated from the configured `toolCalls` path plus step-level `toolCalls` on authored step arrays or direct `model_step` child spans, including Mastra's serialized `mastra.model_step.output` format, and child `tool_call` execution spans under each model step. `latencyMs` is time to first token; duration, total tokens, output tokens/sec, and USD costs are derived. Override `kinds` to broaden the filter, override `attributes.<field>` for non-default primitive span shapes, configure model-keyed `pricing` to derive USD costs from token counts, with nested `providers` entries for provider-specific rates, add `costCurrencies` to show converted cost columns in the expanded breakdown table only, add `derivedAttributes` to persist computed values back onto matching LLM spans before trace consumers run, and add entries to `metrics` to surface arbitrary user metrics (`format: 'string' | 'number' | 'duration' | 'json' | 'boolean'`, `placements: ['header' | 'body']`). `derivedAttributes` can be a keyed map for one-off fields or one callback that returns multiple path/value pairs. Derived keys are dot-paths under `span.attributes`; return `undefined` to skip one span or one returned key.
|
|
215
|
+
- Default usage config derives missing eval outputs from matching LLM/API spans before `outputsSchema` and scores run: `apiCalls`, `costUsd`, `llmTurns`, `inputTokens`, `outputTokens`, `totalTokens`, `cachedInputTokens`, `cacheCreationInputTokens`, `reasoningTokens`, and `llmDurationMs`. Authored outputs and column overrides win. Default usage columns, stats, and charts use `hideIfNoValue: true`. Default LLM usage charts configure cost, input tokens, and output tokens separately and use `dedupeConsecutiveValues: true` to skip repeated adjacent chart values. `totalTokens` is input + output only; cache read/write tokens stay separate and affect `costUsd` at their own rates. `llmTurns` is the maximum per-call turn count in the case run, using configured steps when available and otherwise one turn per matched LLM call span. Derived base input cost uses `inputTokens - cachedInputTokens - cacheCreationInputTokens` so cache details are not double-counted. `cacheCreationInputTokens` is the total cache-write count; optional `cacheCreationInput1hTokens` only splits that total for 1-hour write pricing via `cacheCreationInput1hUsdPerMillion`. `llmDurationMs` sums elapsed matched LLM span durations; it is not time-to-first-token latency. Remove defaults globally or per eval with `removeDefaultConfig: true` or a key list such as `removeDefaultConfig: ['apiCalls', 'reasoningTokens']`.
|
|
216
|
+
- `apiCalls` (in `agent-evals.config.ts`) configures how API-call spans are summarized for review. Defaults to `kind: 'api'`, `'http'`, `'http.client'`, and `'fetch'` spans with `method`, `url`, `statusCode`, `request`, `routeAlias`, `response`, `requestBody`, `responseBody`, `headers`, `durationMs`, and `error` read from conventional attribute paths. Override `kinds` or `attributes.<field>` for external tracers. Set a per-span `routeAlias` attribute such as `/v3/tabs/:id` to group dynamic URL paths in API-call route labels and endpoint charts while preserving original URLs in row details. Add `derivedAttributes` as a keyed map or object-returning callback for computed persisted API span attributes, and add `metrics` with the same formats and placements as LLM-call metrics.
|
|
217
|
+
- `runLogs` (in `agent-evals.config.ts`) controls case log capture. Use `runLogs: { captureConsole: false }` to keep console output in the terminal without persisting console calls to case details. Manual `evalLog(...)` calls are still captured. Captured log locations store the selected user-facing source frame and the full JavaScript stack so agents can inspect additional frames in persisted artifacts when diagnosing where a log came from.
|
|
218
|
+
|
|
219
|
+
Stats rows and history charts can be authored via `stats` / `charts` on the eval definition. Global `stats` in `agent-evals.config.ts` combine with eval-level stats. Native stat kinds include `cases`, `passRate`, `duration`, and `cacheHits`; `cacheHits` shows Agent Eval operation-level cache hits over total cache operations (`hits/total`) from spans and `evalTracer.cache(...)` refs, not LLM provider prompt-cache read tokens such as `cachedInputTokens`. Cache-hit stats use a separate aggregate control and default to `sum`; `avg` is average per-case hit rate, and min/max/best/worst select cases by hit rate. `duration` aggregates per-case durations using the same modes as column stats. Usage stats and LLM usage charts are added by default unless removed with `removeDefaultConfig`. Column stats can override `format` and `numberFormat`, otherwise they inherit from the matching column. Duration and column stat aggregates support `avg`, `min`, `max`, `sum`, `best` (highest finite value), and `worst` (lowest finite value). Use `defaultStatAggregate` in `agent-evals.config.ts` to set the workspace-wide initial duration/column stat mode, or on an eval definition to override it for that eval. Number formats use `maxDecimalPlaces` to cap decimals and `minDecimalPlaces` to pad trailing zeroes. Without `maxDecimalPlaces`, the default cap is 3 decimal places. Stats and charts support `hideIfNoValue: true`. Charts support `dedupeConsecutiveValues: true` to omit consecutive points whose plotted metrics and tooltip extras match the previous kept point. Their shapes live in the types; no need to memorize the option set.
|
|
481
220
|
|
|
482
221
|
## Cached operations
|
|
483
222
|
|
|
484
|
-
Wrap a costly pure span in `cache: { namespace, key }` so later runs replay its
|
|
485
|
-
recorded effects without re-executing:
|
|
223
|
+
Wrap a costly pure span in `cache: { namespace, key }` so later runs replay its recorded effects without re-executing:
|
|
486
224
|
|
|
487
225
|
```ts
|
|
488
226
|
await evalTracer.span(
|
|
@@ -507,8 +245,7 @@ await evalTracer.span(
|
|
|
507
245
|
);
|
|
508
246
|
```
|
|
509
247
|
|
|
510
|
-
Use `evalTracer.cache(...)` for pure values that should not create their own
|
|
511
|
-
trace span:
|
|
248
|
+
Use `evalTracer.cache(...)` for pure values that should not create their own trace span:
|
|
512
249
|
|
|
513
250
|
```ts
|
|
514
251
|
const context = await evalTracer.cache(
|
|
@@ -524,94 +261,25 @@ const context = await evalTracer.cache(
|
|
|
524
261
|
|
|
525
262
|
Mental model:
|
|
526
263
|
|
|
527
|
-
- Only SDK-mediated effects replay on a hit: sub-spans, checkpoints,
|
|
528
|
-
|
|
529
|
-
|
|
530
|
-
|
|
531
|
-
- `
|
|
532
|
-
|
|
533
|
-
|
|
534
|
-
|
|
535
|
-
|
|
536
|
-
|
|
537
|
-
-
|
|
538
|
-
|
|
539
|
-
in cache-key hashing.
|
|
540
|
-
- Cached spans require an explicit `cache.namespace`. Value caches can also set
|
|
541
|
-
an explicit `namespace`; prefer doing that when the cache is part of a
|
|
542
|
-
documented workflow. Matching namespaces share entries across operations/evals
|
|
543
|
-
that use the same authored key.
|
|
544
|
-
- Per eval, `cache: { read?: boolean; store?: boolean }` controls whether
|
|
545
|
-
authored cached operations may read or persist entries. Both default to
|
|
546
|
-
`true`. Use `read: false` to always execute instead of replaying hits, and
|
|
547
|
-
`store: false` to allow reads while preventing misses/refreshes from writing
|
|
548
|
-
cache or raw-key debug files. Run-level bypass/refresh controls still take
|
|
549
|
-
precedence.
|
|
550
|
-
- Authored eval ids are unique within one eval file. The exact eval identity is
|
|
551
|
-
the workspace-relative file path plus eval id, so the same id can be reused in
|
|
552
|
-
different files. Case ids must be unique within one eval; duplicate case ids
|
|
553
|
-
are reported as run errors.
|
|
554
|
-
- Cache keys should be deterministic primitives, arrays, and plain objects.
|
|
555
|
-
`Buffer`, `ArrayBuffer`, and typed arrays hash by bytes. Native `Blob`/`File`
|
|
556
|
-
keys use stable metadata by default (`type`, `size`, plus
|
|
557
|
-
`name`/`lastModified` for `File`) and do not read file bytes. Add
|
|
558
|
-
`serializeFileBytes: true` to a cached span or `evalTracer.cache(...)` call
|
|
559
|
-
when byte-level cache invalidation is required.
|
|
560
|
-
- Cache entries are stored as one Brotli-compressed JSON file per key under
|
|
561
|
-
`.agent-evals/cache/<sanitizedNamespace>/<keyHash>.json.br`, with a small
|
|
562
|
-
namespace index sidecar at
|
|
563
|
-
`.agent-evals/cache/<sanitizedNamespace>/.index-<namespaceHash>.json`.
|
|
564
|
-
Listing and retention use the index without opening cached payloads. Index
|
|
565
|
-
rows intentionally stay minimal: stored time, last access time, and external
|
|
566
|
-
JSON blob refs. Each namespace is capped at 100 entries by default. The runner
|
|
567
|
-
prunes least recently accessed indexed entries after a run finishes and the
|
|
568
|
-
runner stays idle for `cache.pruneIdleDelayMs ?? 5000` milliseconds. Configure
|
|
569
|
-
`cache.maxEntries` as a number for the default cap, or as
|
|
570
|
-
`{ default, namespaces }` for exact namespace-specific caps.
|
|
571
|
-
Writes initialize the row's last access time to the stored time; later cache
|
|
572
|
-
hits refresh that timestamp at the configured access-time update interval.
|
|
573
|
-
- Unindexed legacy cache files are ignored by normal lookup/listing. Use
|
|
574
|
-
`agent-evals cache repair` to remove unindexed cache files, stale index rows,
|
|
575
|
-
debug sidecars, and unreferenced blob files.
|
|
576
|
-
- Nested cached JSON values at or above roughly 10K JSON characters are stored
|
|
577
|
-
as content-addressed Brotli blobs under `.agent-evals/cache/cache-blobs/` and
|
|
578
|
-
referenced from cache JSON by sha256. Identical large payloads share the same
|
|
579
|
-
blob.
|
|
580
|
-
- Authored raw cache keys are stored for debugging under
|
|
581
|
-
`.agent-evals/cache-debug/<sanitizedNamespace>/<keyHash>.json`. This folder
|
|
582
|
-
may include prompts, user inputs, full serialized cache payloads, or other
|
|
583
|
-
sensitive data, should be gitignored, and is not needed for cache reuse.
|
|
584
|
-
- Cached payloads use JSON-safe tagged serialization, so return values and
|
|
585
|
-
recorded SDK effects preserve richer built-ins such as `Date`, `Map`, `Set`,
|
|
586
|
-
typed arrays, `URL`, `Headers`, `Blob`, and `File` on hits. Undefined values
|
|
587
|
-
are omitted by default instead of being written to cache files; direct
|
|
588
|
-
serializer callers can pass
|
|
589
|
-
`{ preserveUndefined: true }` when explicit undefined wrappers are needed.
|
|
590
|
-
Cache keys still use the deterministic key-hashing rules above.
|
|
264
|
+
- Only SDK-mediated effects replay on a hit: sub-spans, checkpoints, output helper calls, span attributes. External side effects (HTTP, DB writes, file I/O) **do not** replay — cache only pure functions of the key.
|
|
265
|
+
- `evalTracer.cache(...)` does not create a span. When it runs inside an active span, that span gets a `cache.refs` entry with the value cache name, key, namespace, and hit/miss status. When called directly from the case body (no surrounding span), the ref is recorded on the case detail's `cacheRefs` array. When called directly from a scorer, the ref is recorded on that scoring trace's `cacheRefs` array.
|
|
266
|
+
- Cache identity is the namespace plus the authored key. Source-file fingerprints are tracked for run freshness separately, but do not participate in cache-key hashing.
|
|
267
|
+
- Cached spans require an explicit `cache.namespace`. Value caches can also set an explicit `namespace`; prefer doing that when the cache is part of a documented workflow. Matching namespaces share entries across operations/evals that use the same authored key.
|
|
268
|
+
- Per eval, `cache: { read?: boolean; store?: boolean }` controls whether authored cached operations may read or persist entries. Both default to `true`. Use `read: false` to always execute instead of replaying hits, and `store: false` to allow reads while preventing misses/refreshes from writing cache or raw-key debug files. Run-level bypass/refresh controls still take precedence.
|
|
269
|
+
- Authored eval ids are unique within one eval file. The exact eval identity is the workspace-relative file path plus eval id, so the same id can be reused in different files. Case ids must be unique within one eval; duplicate case ids are reported as run errors.
|
|
270
|
+
- Cache keys should be deterministic primitives, arrays, and plain objects. `Buffer`, `ArrayBuffer`, and typed arrays hash by bytes. Native `Blob`/`File` keys use stable metadata by default (`type`, `size`, plus `name`/`lastModified` for `File`) and do not read file bytes. Add `serializeFileBytes: true` to a cached span or `evalTracer.cache(...)` call when byte-level cache invalidation is required.
|
|
271
|
+
- Cache entries are stored as one Brotli-compressed JSON file per key under `.agent-evals/cache/<sanitizedNamespace>/<keyHash>.json.br`, with a small namespace index sidecar at `.agent-evals/cache/<sanitizedNamespace>/.index-<namespaceHash>.json`. Listing and retention use the index without opening cached payloads. Index rows intentionally stay minimal: stored time, last access time, and external JSON blob refs. Each namespace is capped at 100 entries by default. The runner prunes least recently accessed indexed entries after a run finishes and the runner stays idle for `cache.pruneIdleDelayMs ?? 5000` milliseconds. Configure `cache.maxEntries` as a number for the default cap, or as `{ default, namespaces }` for exact namespace-specific caps. Writes initialize the row's last access time to the stored time; later cache hits refresh that timestamp at the configured access-time update interval.
|
|
272
|
+
- Unindexed legacy cache files are ignored by normal lookup/listing. Use `agent-evals cache repair` to remove unindexed cache files, stale index rows, debug sidecars, and unreferenced blob files.
|
|
273
|
+
- Nested cached JSON values at or above roughly 10K JSON characters are stored as content-addressed Brotli blobs under `.agent-evals/cache/cache-blobs/` and referenced from cache JSON by sha256. Identical large payloads share the same blob.
|
|
274
|
+
- Authored raw cache keys are stored for debugging under `.agent-evals/cache-debug/<sanitizedNamespace>/<keyHash>.json`. This folder may include prompts, user inputs, full serialized cache payloads, or other sensitive data, should be gitignored, and is not needed for cache reuse.
|
|
275
|
+
- Cached payloads use JSON-safe tagged serialization, so return values and recorded SDK effects preserve richer built-ins such as `Date`, `Map`, `Set`, typed arrays, `URL`, `Headers`, `Blob`, and `File` on hits. Undefined values are omitted by default instead of being written to cache files; direct serializer callers can pass `{ preserveUndefined: true }` when explicit undefined wrappers are needed. Cache keys still use the deterministic key-hashing rules above.
|
|
591
276
|
- Cache mode per run is controlled by CLI flags (see `agent-evals run --help`).
|
|
592
277
|
|
|
593
278
|
## Artifacts
|
|
594
279
|
|
|
595
|
-
Run output lives under `.agent-evals/runs/<run-id>/`. Cache payloads live under
|
|
596
|
-
|
|
597
|
-
|
|
598
|
-
authoring evals; configure cache namespaces manually in eval code, then use
|
|
599
|
-
`agent-evals cache list` to inspect persisted namespace/key entries or
|
|
600
|
-
`agent-evals cache repair` to clean orphaned cache artifacts. Files in a run
|
|
601
|
-
directory include run metadata,
|
|
602
|
-
a run summary, per-case results, and per-case trace JSON. Inspect run files when
|
|
603
|
-
debugging persisted output, costs, columns, traces, or failures; inspect cache
|
|
604
|
-
entries when debugging replayed span/value-cache results.
|
|
605
|
-
Targeted evals in `run.json` are recorded by exact `evalKeys`
|
|
606
|
-
(`filePath + evalId`) rather than authored eval ids, so duplicate eval ids stay
|
|
607
|
-
unambiguous in saved history.
|
|
608
|
-
Temporary runs use the same directory layout, but are removed before the next
|
|
609
|
-
run of any kind starts.
|
|
610
|
-
When a saved case needs to be handed to another agent, the app can copy the
|
|
611
|
-
saved case detail path or the saved run folder path directly.
|
|
612
|
-
|
|
613
|
-
Use `agent-evals show-runs` when you need stable file
|
|
614
|
-
paths before reading saved output:
|
|
280
|
+
Run output lives under `.agent-evals/runs/<run-id>/`. Cache payloads live under `.agent-evals/cache/<sanitizedNamespace>/<keyHash>.json.br` with namespace index sidecars next to them. Do not rely on a specific cache filename when authoring evals; configure cache namespaces manually in eval code, then use `agent-evals cache list` to inspect persisted namespace/key entries or `agent-evals cache repair` to clean orphaned cache artifacts. Files in a run directory include run metadata, a run summary, per-case results, and per-case trace JSON. Inspect run files when debugging persisted output, costs, columns, traces, or failures; inspect cache entries when debugging replayed span/value-cache results. Targeted evals in `run.json` are recorded by exact `evalKeys` (`filePath + evalId`) rather than authored eval ids, so duplicate eval ids stay unambiguous in saved history. Temporary runs use the same directory layout, but are removed before the next run of any kind starts. When a saved case needs to be handed to another agent, the app can copy the saved case detail path or the saved run folder path directly.
|
|
281
|
+
|
|
282
|
+
Use `agent-evals show-runs` when you need stable file paths before reading saved output:
|
|
615
283
|
|
|
616
284
|
```sh
|
|
617
285
|
agent-evals show-runs
|
|
@@ -622,21 +290,11 @@ jq . .agent-evals/runs/<run-id>/case-details/<case-id>.json
|
|
|
622
290
|
jq . .agent-evals/runs/<run-id>/traces/<case-id>.json
|
|
623
291
|
```
|
|
624
292
|
|
|
625
|
-
Run ids can be full timestamp ids, short ids such as `r0` from
|
|
626
|
-
`agent-evals show-runs`, or `latest`. `show-runs` is only an artifact index;
|
|
627
|
-
the files themselves remain the source of truth for detailed results and
|
|
628
|
-
traces.
|
|
293
|
+
Run ids can be full timestamp ids, short ids such as `r0` from `agent-evals show-runs`, or `latest`. `show-runs` is only an artifact index; the files themselves remain the source of truth for detailed results and traces.
|
|
629
294
|
|
|
630
295
|
## Module mocking
|
|
631
296
|
|
|
632
|
-
For true module replacement inside an eval, register `mock.module(...)` from
|
|
633
|
-
`node:test` before dynamically importing the module graph. Agent Evals enables
|
|
634
|
-
Node's `--experimental-test-module-mocks` flag automatically for CLI and app
|
|
635
|
-
runs. Use dynamic
|
|
636
|
-
`import(...)` inside `execute` — static imports happen too early.
|
|
637
|
-
Each case/trial reloads the eval module graph in its own isolation scope, so
|
|
638
|
-
module-level mock state in workspace files and ESM dependencies does not leak
|
|
639
|
-
between concurrent cases.
|
|
297
|
+
For true module replacement inside an eval, register `mock.module(...)` from `node:test` before dynamically importing the module graph. Agent Evals enables Node's `--experimental-test-module-mocks` flag automatically for CLI and app runs. Use dynamic `import(...)` inside `execute` — static imports happen too early. Each case/trial reloads the eval module graph in its own isolation scope, so module-level mock state in workspace files and ESM dependencies does not leak between concurrent cases.
|
|
640
298
|
|
|
641
299
|
```ts
|
|
642
300
|
import { mock } from 'node:test';
|
|
@@ -660,25 +318,11 @@ defineEval({
|
|
|
660
318
|
|
|
661
319
|
When adding or changing evals:
|
|
662
320
|
|
|
663
|
-
1. Put the tracing + ambient SDK calls in the product code that runs in both
|
|
664
|
-
production and evals. Keep eval files thin.
|
|
321
|
+
1. Put the tracing + ambient SDK calls in the product code that runs in both production and evals. Keep eval files thin.
|
|
665
322
|
2. Use realistic cases drawn from real product flows; avoid placeholder inputs.
|
|
666
|
-
3. `evalAssert` for hard invariants and truthy type narrowing. It records
|
|
667
|
-
|
|
668
|
-
in `assertionFailures` and fail the case. Use `evalExpect` for non-trivial
|
|
669
|
-
comparisons, `tracingAssertions` for invariants derived from the finished
|
|
670
|
-
trace, `scores` for graded signals, and `passThreshold` only on scores that
|
|
671
|
-
should gate pass/fail.
|
|
672
|
-
4. Surface reviewable values through execute-context `setOutput` or ambient
|
|
673
|
-
`setEvalOutput` in shared workflow code, and shape them with `columns`
|
|
674
|
-
formats from the `ColumnFormat` type.
|
|
323
|
+
3. `evalAssert` for hard invariants and truthy type narrowing. It records pass/fail entries in case-detail `assertions`; failed entries are also kept in `assertionFailures` and fail the case. Use `evalExpect` for non-trivial comparisons, `tracingAssertions` for invariants derived from the finished trace, `scores` for graded signals, and `passThreshold` only on scores that should gate pass/fail.
|
|
324
|
+
4. Surface reviewable values through execute-context `setOutput` or ambient `setEvalOutput` in shared workflow code, and shape them with `columns` formats from the `ColumnFormat` type.
|
|
675
325
|
5. Promote high-signal span attributes with `traceDisplay`.
|
|
676
|
-
6. Cache costly pure spans with `cache: { namespace, key }` and pure spanless
|
|
677
|
-
|
|
678
|
-
|
|
679
|
-
7. Sanity-check after changes: `agent-evals list`, then
|
|
680
|
-
`agent-evals run --eval <id>`; use `--file <path|glob>` to target one file
|
|
681
|
-
when multiple files use the same eval id.
|
|
682
|
-
8. Locate saved artifacts with `agent-evals show-runs latest --json`, then read
|
|
683
|
-
the relevant `summary.json`, `cases.jsonl`, `case-details/<case-id>.json`,
|
|
684
|
-
or `traces/<case-id>.json` file directly.
|
|
326
|
+
6. Cache costly pure spans with `cache: { namespace, key }` and pure spanless values with `evalTracer.cache(...)`; never cache operations whose external side effects you depend on.
|
|
327
|
+
7. Sanity-check after changes: `agent-evals list`, then `agent-evals run --eval <id>`; use `--file <path|glob>` to target one file when multiple files use the same eval id.
|
|
328
|
+
8. Locate saved artifacts with `agent-evals show-runs latest --json`, then read the relevant `summary.json`, `cases.jsonl`, `case-details/<case-id>.json`, or `traces/<case-id>.json` file directly.
|