@ls-stack/agent-eval 0.60.4 → 0.61.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -5,80 +5,27 @@ description: Create, run, and maintain TypeScript evals with @ls-stack/agent-eva
5
5
 
6
6
  # Agent Eval
7
7
 
8
- Local-first eval runner for LLM and agent systems. Evals are strict TypeScript
9
- modules named `*.eval.ts`, discovered from `agent-evals.config.ts`, and
10
- executed through the CLI (`agent-evals run`) or local app (`agent-evals app`).
11
- Runs persist to `.agent-evals/` so results, traces, and caches survive across
12
- processes.
13
-
14
- This skill covers the mental model and conventions. For exhaustive field lists
15
- (config options, eval shape, column formats, score/chart/stats options, trace
16
- display rules), read the TypeScript declarations shipped with the package:
17
-
18
- - `AgentEvalsConfig`, `EvalDefinition`, `EvalCase`, `EvalOutputs`,
19
- `EvalColumnOverride`, `EvalDeriveConfig`, `EvalScoreDef`,
20
- `EvalManualScoreDef`, `EvalTraceTree`, and `TraceSpanInfo` are exported from
21
- `@ls-stack/agent-eval`.
22
- - Import Zod directly from `zod` when authoring `outputsSchema` or
23
- `manualInput.schema`; `@ls-stack/agent-eval` does not re-export Zod.
8
+ Local-first eval runner for LLM and agent systems. Evals are strict TypeScript modules named `*.eval.ts`, discovered from `agent-evals.config.ts`, and executed through the CLI (`agent-evals run`) or local app (`agent-evals app`). Runs persist to `.agent-evals/` so results, traces, and caches survive across processes.
9
+
10
+ This skill covers the mental model and conventions. For exhaustive field lists (config options, eval shape, column formats, score/chart/stats options, trace display rules), read the TypeScript declarations shipped with the package:
11
+
12
+ - `AgentEvalsConfig`, `EvalDefinition`, `EvalCase`, `EvalOutputs`, `EvalColumnOverride`, `EvalDeriveConfig`, `EvalScoreDef`, `EvalManualScoreDef`, `EvalTraceTree`, and `TraceSpanInfo` are exported from `@ls-stack/agent-eval`.
13
+ - Import Zod directly from `zod` when authoring `outputsSchema` or `manualInput.schema`; `@ls-stack/agent-eval` does not re-export Zod.
24
14
  - `.d.ts` files land in `node_modules/@ls-stack/agent-eval/dist/`.
25
- - CLI surface: `agent-evals --help` and `agent-evals <command> --help`.
26
- Unknown help targets exit non-zero instead of falling back to global help.
27
- - The CLI automatically loads `.env` from the current workspace. Shell-provided
28
- environment variables win; pass `--no-env` to disable `.env` loading once.
29
- - Unfiltered `agent-evals run` is disabled by default; use `--eval` or `--case`
30
- for targeted CLI runs, or `--tags-filter <expr>` to run cases matching tags.
31
- Set `allowCliRunAll: true` in
32
- `agent-evals.config.ts` to opt into run-all CLI behavior.
33
- - `agent-evals run --temporary` persists a run like normal history, but deletes
34
- it before the next run starts. Temporary runs appear in `show-runs` while
35
- present; normal runs are never deleted by temporary-run cleanup. In the app,
36
- the run drawer can promote a temporary run to durable history.
37
- - `agent-evals app` watches `agent-evals.config.ts` and the workspace `.env`
38
- and reloads them in place when the runner is idle. If config or `.env`
39
- changes during an active run, the reload applies after the current run
40
- reaches a terminal state.
41
- - App-triggered runs log the queued target evals, resolved case concurrency,
42
- each case start for evals that are actually running, and the terminal run
43
- summary in the server terminal.
44
-
45
- Assume that enumerated tables in this document may lag behind the types —
46
- treat the types as source of truth when they disagree.
15
+ - CLI surface: `agent-evals --help` and `agent-evals <command> --help`. Unknown help targets exit non-zero instead of falling back to global help.
16
+ - The CLI automatically loads `.env` from the current workspace. Shell-provided environment variables win; pass `--no-env` to disable `.env` loading once.
17
+ - Unfiltered `agent-evals run` is disabled by default; use `--eval` or `--case` for targeted CLI runs, or `--tags-filter <expr>` to run cases matching tags. Set `allowCliRunAll: true` in `agent-evals.config.ts` to opt into run-all CLI behavior.
18
+ - `agent-evals run --temporary` persists a run like normal history, but deletes it before the next run starts. Temporary runs appear in `show-runs` while present; normal runs are never deleted by temporary-run cleanup. In the app, the run drawer can promote a temporary run to durable history.
19
+ - `agent-evals app` watches `agent-evals.config.ts` and the workspace `.env` and reloads them in place when the runner is idle. If config or `.env` changes during an active run, the reload applies after the current run reaches a terminal state.
20
+ - App-triggered runs log the queued target evals, resolved case concurrency, each case start for evals that are actually running, and the terminal run summary in the server terminal.
21
+
22
+ Assume that enumerated tables in this document may lag behind the types — treat the types as source of truth when they disagree.
47
23
 
48
24
  ## Where tracing lives
49
25
 
50
- **Tracing belongs in the product source code, not in the eval file.** The eval
51
- file wires up cases and scoring; the real `evalTracer.span(...)` calls sit
52
- inside the workflow, agent, or tool functions that both production and evals
53
- invoke.
54
-
55
- `evalTracer`, `evalSpan`, output helpers, `evalLog`, `evalAssert`, and
56
- `evalExpect` are ambient no-ops when called outside an eval case scope, so
57
- leaving them in
58
- production paths is safe — they only record anything when the product code runs
59
- inside an eval's `execute`. Use `isInEvalScope()` to branch on eval-only behavior in shared code
60
- (e.g. skip a real network side effect): it returns `null` outside eval-owned
61
- work and returns `'env'`, `'cases'`, `'eval'`, `'derive'`, `'outputsSchema'`, or
62
- `'scorer'` during runner phases. Top-level modules imported while a run is being
63
- prepared see `'env'`; code called from `execute` sees `'eval'`. Use
64
- `getEvalCaseInput()` to read the current case input, or
65
- `getEvalCaseInput('customer.tier')` for nested dot-path access; outside a case
66
- scope it returns `undefined`. Use `nextEvalId()` inside eval-scoped code when a
67
- stable generated id is needed; it includes the eval file, eval id, case id, and
68
- a per-case sequence number, and throws outside an eval case scope.
69
- Use `evalLog(level, ...args)` for intentional per-case logs. The runner also
70
- captures `console.log`, `console.info`, `console.warn`, and `console.error`
71
- during case-owned phases by default; log arguments are stored as JSON-safe
72
- values. Logs inside cached operations are not replayed from cache hits.
73
- Use eval tags to target related coverage without naming every case:
74
- `AgentEvalsConfig.tags` applies workspace-wide tags, `defineEval({ tags })`
75
- adds eval tags, `case.tags` adds case-only tags, and `removeTags` disables a
76
- configured global tag for one eval. CLI filters support Vitest-style tag
77
- expressions such as `agent-evals run --tags-filter "refunds && !slow"`.
78
- Inside eval-scoped code, use `matchesEvalTags('tag')` or
79
- `matchesEvalTags({ all, any, not })`; it uses typed exact tag names and returns
80
- `false` outside a case scope. Projects can narrow tag names with a `.d.ts`
81
- module augmentation:
26
+ **Tracing belongs in the product source code, not in the eval file.** The eval file wires up cases and scoring; the real `evalTracer.span(...)` calls sit inside the workflow, agent, or tool functions that both production and evals invoke.
27
+
28
+ `evalTracer`, `evalSpan`, output helpers, `evalLog`, `evalAssert`, and `evalExpect` are ambient no-ops when called outside an eval case scope, so leaving them in production paths is safe — they only record anything when the product code runs inside an eval's `execute`. Use `isInEvalScope()` to branch on eval-only behavior in shared code (e.g. skip a real network side effect): it returns `null` outside eval-owned work and returns `'env'`, `'cases'`, `'eval'`, `'derive'`, `'outputsSchema'`, or `'scorer'` during runner phases. Top-level modules imported while a run is being prepared see `'env'`; code called from `execute` sees `'eval'`. Use `getEvalCaseInput()` to read the current case input, or `getEvalCaseInput('customer.tier')` for nested dot-path access; outside a case scope it returns `undefined`. Use `nextEvalId()` inside eval-scoped code when a stable generated id is needed; it includes the eval file, eval id, case id, and a per-case sequence number, and throws outside an eval case scope. Use `evalLog(level, ...args)` for intentional per-case logs. The runner also captures `console.log`, `console.info`, `console.warn`, and `console.error` during case-owned phases by default; log arguments are stored as JSON-safe values. Logs inside cached operations are not replayed from cache hits. Use eval tags to target related coverage without naming every case: `AgentEvalsConfig.tags` applies workspace-wide tags, `defineEval({ tags })` adds eval tags, `case.tags` adds case-only tags, and `removeTags` disables a configured global tag for one eval. CLI filters support Vitest-style tag expressions such as `agent-evals run --tags-filter "refunds && !slow"`. Inside eval-scoped code, use `matchesEvalTags('tag')` or `matchesEvalTags({ all, any, not })`; it uses typed exact tag names and returns `false` outside a case scope. Projects can narrow tag names with a `.d.ts` module augmentation:
82
29
 
83
30
  ```ts
84
31
  import '@ls-stack/agent-eval';
@@ -162,55 +109,17 @@ export async function runRefundWorkflow(input: RefundInput) {
162
109
  }
163
110
  ```
164
111
 
165
- Span `kind` values are open-ended strings. Use familiar kinds such as
166
- `agent`, `tool`, `llm`, `api`, `retrieval`, `scorer`, or `checkpoint` when they
167
- fit, and preserve external tracer kinds such as `mastra.workflow.step` when they
168
- are more specific. Only the `input` and `output` span attributes are promoted
169
- automatically in the trace tree; use `traceDisplay` for other span attributes
170
- such as `model` or `usage`. Eval-level LLM usage outputs, columns, stats, and
171
- charts are derived from matching LLM spans by default. Prefer
172
- `llmCalls.pricing` for LLM-call cost display; built-in costs ignore span
173
- `costUsd` attributes.
174
-
175
- Use `captureEvalSpanError(error)` for recoverable errors on the active
176
- `evalTracer.span(...)`, such as optional model/tool failures that fall back and
177
- continue. You can pass one error, multiple error arguments, or an array. The
178
- span is still marked `error`. Pass `'warning'` or `{ level: 'warning' }` as the
179
- final argument for diagnostics that should not change an otherwise successful
180
- span's status.
181
-
182
- If a span callback throws, the SDK automatically marks that span as `error`,
183
- stores the thrown error on it, and rethrows so the case errors. Use that for
184
- terminal failures; use `captureEvalSpanError(...)` for recoverable failures that
185
- continue through fallback logic.
186
-
187
- Fire-and-forget spans started during `execute` are awaited before outputs,
188
- `deriveFromTracing`, scores, and trace data are finalized, so `void
189
- evalTracer.span(...)` is safe when the span result is not needed. Register
190
- non-span promises with `startEvalBackgroundJob(promise)`. The runner only waits
191
- for settlement; promise and span errors keep their normal behavior. Use
192
- `waitForBackgroundJob: false` on a span, or `waitForBackgroundJobs: false` on an
193
- eval definition, when background work should not delay finalization.
194
-
195
- Eval Date APIs use a shifted wall clock by default: `new Date()` and
196
- `Date.now()` start at `2026-04-10T00:00:00.000Z` during case generation,
197
- execution, tracing, derived outputs, and scorers, then continue advancing with
198
- real elapsed time. Set `startTime` on a specific `defineEval(...)` to use
199
- another initial clock value, or set `startTime: 'now'` for that eval to use the
200
- real current clock. Timers are not faked, so async waits still run normally.
201
- Set `freezeTime: true` to keep Date APIs frozen until they are moved manually.
202
- Use `evalTime.startTime` to read the captured wall-clock start as a Dayjs
203
- object, and `evalTime.dayjs(...)` to create other Dayjs date objects. Use
204
- `evalTime.advance(amount, unit)` inside an eval to move the shifted clock
205
- forward with Dayjs `add(...)` units. It throws for evals with
206
- `startTime: 'now'`, unless `freezeTime: true` is also set.
207
-
208
- For libraries or observability exporters that already emit span lifecycle
209
- events, use `evalTracer.startSpan(...)`, `evalTracer.updateSpan(...)`,
210
- `evalTracer.endSpan(...)`, or `evalTracer.recordSpan(...)` to translate those
211
- events into the eval trace tree without wrapping the upstream work in a
212
- callback. Pass the upstream span id and parent id when available so saved trace
213
- JSON and `deriveFromTracing` use the recorded hierarchy.
112
+ Span `kind` values are open-ended strings. Use familiar kinds such as `agent`, `tool`, `llm`, `api`, `retrieval`, `scorer`, or `checkpoint` when they fit, and preserve external tracer kinds such as `mastra.workflow.step` when they are more specific. Only the `input` and `output` span attributes are promoted automatically in the trace tree; use `traceDisplay` for other span attributes such as `model` or `usage`. Eval-level LLM usage outputs, columns, stats, and charts are derived from matching LLM spans by default. Prefer `llmCalls.pricing` for LLM-call cost display; built-in costs ignore span `costUsd` attributes.
113
+
114
+ Use `captureEvalSpanError(error)` for recoverable errors on the active `evalTracer.span(...)`, such as optional model/tool failures that fall back and continue. You can pass one error, multiple error arguments, or an array. The span is still marked `error`. Pass `'warning'` or `{ level: 'warning' }` as the final argument for diagnostics that should not change an otherwise successful span's status.
115
+
116
+ If a span callback throws, the SDK automatically marks that span as `error`, stores the thrown error on it, and rethrows so the case errors. Use that for terminal failures; use `captureEvalSpanError(...)` for recoverable failures that continue through fallback logic.
117
+
118
+ Fire-and-forget spans started during `execute` are awaited before outputs, `deriveFromTracing`, scores, and trace data are finalized, so `void evalTracer.span(...)` is safe when the span result is not needed. Register non-span promises with `startEvalBackgroundJob(promise)`. The runner only waits for settlement; promise and span errors keep their normal behavior. Use `waitForBackgroundJob: false` on a span, or `waitForBackgroundJobs: false` on an eval definition, when background work should not delay finalization.
119
+
120
+ Eval Date APIs use a shifted wall clock by default: `new Date()` and `Date.now()` start at `2026-04-10T00:00:00.000Z` during case generation, execution, tracing, derived outputs, and scorers, then continue advancing with real elapsed time. Set `startTime` on a specific `defineEval(...)` to use another initial clock value, or set `startTime: 'now'` for that eval to use the real current clock. Timers are not faked, so async waits still run normally. Set `freezeTime: true` to keep Date APIs frozen until they are moved manually. Use `evalTime.startTime` to read the captured wall-clock start as a Dayjs object, and `evalTime.dayjs(...)` to create other Dayjs date objects. Use `evalTime.advance(amount, unit)` inside an eval to move the shifted clock forward with Dayjs `add(...)` units. It throws for evals with `startTime: 'now'`, unless `freezeTime: true` is also set.
121
+
122
+ For libraries or observability exporters that already emit span lifecycle events, use `evalTracer.startSpan(...)`, `evalTracer.updateSpan(...)`, `evalTracer.endSpan(...)`, or `evalTracer.recordSpan(...)` to translate those events into the eval trace tree without wrapping the upstream work in a callback. Pass the upstream span id and parent id when available so saved trace JSON and `deriveFromTracing` use the recorded hierarchy.
214
123
 
215
124
  ### Eval file (thin)
216
125
 
@@ -249,20 +158,13 @@ defineEval<RefundInput, RefundOutputs>({
249
158
  });
250
159
  ```
251
160
 
252
- `execute` usually just calls the product code. Push any placeholder
253
- `evalTracer.span(...)` wrappers out of the eval and into the product module
254
- they describe so production runs get the same trajectory. Only keep tracing
255
- inside `execute` when the behavior being measured is eval-specific (e.g. a
256
- judge-only sub-step with no production analogue).
161
+ `execute` usually just calls the product code. Push any placeholder `evalTracer.span(...)` wrappers out of the eval and into the product module they describe so production runs get the same trajectory. Only keep tracing inside `execute` when the behavior being measured is eval-specific (e.g. a judge-only sub-step with no production analogue).
257
162
 
258
- Case `id` values anchor historical runs, caches, and manual scores — keep them
259
- stable. See `EvalDefinition` / `EvalCase` in the types for every supported
260
- field.
163
+ Case `id` values anchor historical runs, caches, and manual scores — keep them stable. See `EvalDefinition` / `EvalCase` in the types for every supported field.
261
164
 
262
165
  ### Manual input
263
166
 
264
- Use `manualInput` instead of `cases` when each run should pause for the user
265
- to type values:
167
+ Use `manualInput` instead of `cases` when each run should pause for the user to type values:
266
168
 
267
169
  ```ts
268
170
  const inputSchema = z.object({
@@ -286,203 +188,39 @@ defineEval<z.infer<typeof inputSchema>>({
286
188
  });
287
189
  ```
288
190
 
289
- `manualInput` configures the local app form descriptor derived from the schema
290
- (`z.string` -> text, `z.enum` -> select, `z.boolean` -> checkbox, etc.; nested
291
- shapes fall back to JSON input). The CLI accepts `--input '<json>'` for a
292
- single targeted eval or `--input-file <path>` mapping eval keys/ids to inputs.
293
- Each run produces one synthetic case `<evalId>-manual` with the validated
294
- submission; mixing `manualInput` with `cases` is rejected at discovery time.
295
-
296
- For file or image fields, set `{ asFile: true, accept?, maxSizeBytes? }` and
297
- type the field with `manualInputFileValueSchema`. The runtime value carries
298
- `{ name, mimeType, sizeBytes, sha256, path }`, where `path` is a
299
- workspace-relative run artifact. Use `readManualInputFile(value)` when bytes,
300
- `Blob`, `File`, text, or parsed JSON are needed. In CLI runs, provide path
301
- objects such as
302
- `{ "image": { "path": "./screenshot.png" } }`; the CLI stages the file before
303
- starting the run.
191
+ `manualInput` configures the local app form descriptor derived from the schema (`z.string` -> text, `z.enum` -> select, `z.boolean` -> checkbox, etc.; nested shapes fall back to JSON input). The CLI accepts `--input '<json>'` for a single targeted eval or `--input-file <path>` mapping eval keys/ids to inputs. Each run produces one synthetic case `<evalId>-manual` with the validated submission; mixing `manualInput` with `cases` is rejected at discovery time.
192
+
193
+ For file or image fields, set `{ asFile: true, accept?, maxSizeBytes? }` and type the field with `manualInputFileValueSchema`. The runtime value carries `{ name, mimeType, sizeBytes, sha256, path }`, where `path` is a workspace-relative run artifact. Use `readManualInputFile(value)` when bytes, `Blob`, `File`, text, or parsed JSON are needed. In CLI runs, provide path objects such as `{ "image": { "path": "./screenshot.png" } }`; the CLI stages the file before starting the run.
304
194
 
305
195
  ## Scoring
306
196
 
307
- Every score returns a normalized `0..1` value. Pass/fail is per-score: a case
308
- fails if any score with `passThreshold` falls below it, if an assertion fails,
309
- or if the case errors. Scores without `passThreshold` are informational.
197
+ Every score returns a normalized `0..1` value. Pass/fail is per-score: a case fails if any score with `passThreshold` falls below it, if an assertion fails, or if the case errors. Scores without `passThreshold` are informational.
310
198
 
311
- Score functions run in their own trace scope, separate from the execution
312
- trace, so LLM-as-judge scorers can use `evalTracer.span(...)` and cached spans
313
- without polluting the agent trajectory. Outputs set inside a scorer stay
314
- private to that score. Spanless `evalTracer.cache(...)` calls made directly
315
- inside a scorer are stored on that score trace's `cacheRefs` payload.
199
+ Score functions run in their own trace scope, separate from the execution trace, so LLM-as-judge scorers can use `evalTracer.span(...)` and cached spans without polluting the agent trajectory. Outputs set inside a scorer stay private to that score. Spanless `evalTracer.cache(...)` calls made directly inside a scorer are stored on that score trace's `cacheRefs` payload.
316
200
 
317
- `manualScores` declares score columns that reviewers fill in after a run.
318
- Pending values keep the eval in an `unscored` state instead of failing.
201
+ `manualScores` declares score columns that reviewers fill in after a run. Pending values keep the eval in an `unscored` state instead of failing.
319
202
 
320
- See `EvalScoreDef` / `EvalManualScoreDef` in the types for the full shape
321
- (format, threshold, column overrides).
203
+ See `EvalScoreDef` / `EvalManualScoreDef` in the types for the full shape (format, threshold, column overrides).
322
204
 
323
205
  ## Outputs, columns, trace display
324
206
 
325
- - `setEvalOutput(key, value)` writes reviewable data for the case. Values are
326
- stored as received: primitives, objects/arrays, explicit file refs, and
327
- native `Blob`/`File` values. `columns.format` only controls visualization.
328
- Inside `execute`, `setOutput(key, value, formatOrOverride)` can attach a
329
- display hint directly to a runtime output, e.g. `'markdown'` or
330
- `{ label: 'Receipt', format: 'image', hideInTable: true }`. Authored
331
- global/eval `columns` for the same key take precedence over that runtime
332
- hint.
333
- Non-JSON runtime values such as `Date`, `Map`, `Set`, `BigInt`, typed arrays,
334
- and class instances use the tagged value serializer instead of a string
335
- fallback. Native `Blob`/`File` values are copied to run artifacts because
336
- saved run files are JSON. Inside `execute`, prefer the context
337
- `setOutput(key, value)` helper when writing schema-backed outputs; it is
338
- typed from the eval's outputs generic. Keep `setEvalOutput` for shared
339
- workflow code that does not receive the execute context.
340
- - Use `incrementEvalOutput(key, delta)` for numeric totals,
341
- `appendToEvalOutput(key, value)` for arrays that preserve existing scalar
342
- values, and `mergeEvalOutput(key, patch)` for shallow object updates.
343
- `evalSpan` has matching `incrementAttribute`, `appendToAttribute`, and
344
- `mergeAttribute` helpers for span attributes.
345
- - `outputsSchema` validates final outputs after `execute` and
346
- `deriveFromTracing`, before computed scores. For Zod object schemas, only
347
- declared keys are passed to the schema; parsed fields merge back into the raw
348
- output map, so defaults/transforms apply to configured fields and
349
- unconfigured outputs stay visible as before. Validation failures fail the case
350
- and skip computed scores. When you pass a narrowed outputs type as the second
351
- `defineEval` generic, `outputsSchema` is required.
352
- - `columns` overrides the display for output and score keys (label, format,
353
- alignment, visibility). The set of supported formats is declared by the
354
- `ColumnFormat` union and `EvalColumnOverride` in the types. Global
355
- `columns` in `agent-evals.config.ts` apply to every eval; eval-level
356
- `columns` override matching global keys. Use `hideIfNoValue: true` to hide a
357
- column when every row is missing the value, `null`, or an empty string; `0`
358
- and `false` still count as values. Use `format: 'image'`, `'html'`, `'pdf'`,
359
- `'audio'`, `'video'`, or `'file'` for `Blob`/`File` outputs or `repoFile(...)`
360
- references that should render as reviewable artifacts. Persisted `Blob`/`File`
361
- artifacts include byte sizes in their run artifact refs; pass the optional
362
- `repoFile(..., ..., sizeBytes)` hint when a repository file card should show
363
- a size.
364
- - `deriveFromTracing` can be authored globally in `agent-evals.config.ts` or
365
- locally on one eval. Prefer the keyed map form for shared metrics:
366
- `deriveFromTracing: { toolCalls: ({ trace }) => trace.findSpansByKind('tool').length }`.
367
- The older object-returning function form remains supported. Global
368
- derivations run first; runtime outputs are never overwritten, and eval-level
369
- derivations only fill keys still missing after global derivations. In keyed
370
- form, return `undefined` to omit one output for that case. Do not call
371
- `evalAssert(...)` or `evalExpect(...)` from `deriveFromTracing`; use
372
- `tracingAssertions` for trace-derived pass/fail checks.
373
- - `tracingAssertions` is a single function that can be authored globally or
374
- locally on one eval when a finished-trace invariant should pass or fail the
375
- case without creating a fake score column. It receives the same
376
- `{ trace, input, case }` context as `deriveFromTracing`; call
377
- `evalAssert(...)` or `evalExpect(...)` inside it.
378
- Useful trace helpers include `trace.findSpan(name)`, `trace.findSpans(name)`,
379
- `trace.hasSpan(name)`, `trace.findSpansByKind(kind)`,
380
- `trace.findToolCallSpans()`, `trace.listToolCallSpanNames()`,
381
- `trace.hasToolCallSpan(name)`,
382
- `trace.getToolCallSpans(name)`,
383
- `trace.getToolCallSpanCount(toolName)`,
384
- `trace.hasToolCallSpanCount(toolName, expectedCalls)`,
385
- `trace.listSpanNames(kind?)`, `trace.listSpanNamesDfs(kind?)`, and
386
- `trace.flattenDfs()`.
387
- The tool-call helpers include both `kind: 'tool'` spans and imported
388
- execution spans recorded as `kind: 'tool_call'`. Tool-name checks and counts
389
- match the span `name` as well as GenAI/Mastra identity attributes such as
390
- `genAI["gen_ai.tool.name"]` and `mastra.entityName`; list helpers prefer
391
- those tool identity attributes when present. `getToolCallSpans(name)`
392
- returns one normalized object per matching call, including parsed
393
- `arguments`, parsed `result`, `description`, `toolType`, `attributes`, and
394
- the original `span`.
395
- - `traceDisplay` promotes selected span attributes into the trace tree and
396
- detail pane; it supports aggregation across subtrees (`scope`, `mode`) and
397
- user-defined `transform(...)` for derived views (e.g. currency conversion).
398
- See the `TraceDisplayInputConfig` type.
399
- - `llmCalls` (in `agent-evals.config.ts`) configures how LLM-call spans are
400
- summarized for review. Defaults to `kind: 'llm'` spans with `model`,
401
- `usage.*`, `latencyMs`, `input`, `output`, etc. read from conventional
402
- attribute paths. The default `steps` path reads an array from
403
- `span.attributes.steps`; if it is missing, direct child `model_step` spans are
404
- shown as that call's steps. Tool calls are aggregated from the configured
405
- `toolCalls` path plus step-level `toolCalls` on authored step arrays or
406
- direct `model_step` child spans, including Mastra's serialized
407
- `mastra.model_step.output` format, and child `tool_call` execution spans
408
- under each model step. `latencyMs` is time to first token; duration, total
409
- tokens, output tokens/sec, and USD costs are derived. Override `kinds` to
410
- broaden the filter,
411
- override `attributes.<field>` for non-default primitive span shapes, configure
412
- model-keyed `pricing` to derive USD costs from token counts, with nested
413
- `providers` entries for provider-specific rates, add `costCurrencies` to show
414
- converted cost columns in the expanded breakdown table only, add
415
- `derivedAttributes` to persist computed values back onto matching LLM spans
416
- before trace consumers run, and add entries to `metrics` to surface arbitrary user metrics
417
- (`format: 'string' | 'number' | 'duration' | 'json' | 'boolean'`,
418
- `placements: ['header' | 'body']`). `derivedAttributes` can be a keyed map
419
- for one-off fields or one callback that returns multiple path/value pairs.
420
- Derived keys are dot-paths under `span.attributes`; return `undefined` to
421
- skip one span or one returned key.
422
- - Default usage config derives missing eval outputs from matching LLM/API spans
423
- before `outputsSchema` and scores run: `apiCalls`, `costUsd`, `llmTurns`,
424
- `inputTokens`, `outputTokens`, `totalTokens`, `cachedInputTokens`,
425
- `cacheCreationInputTokens`, `reasoningTokens`, and `llmDurationMs`. Authored
426
- outputs and column overrides win. Default usage columns, stats, and charts
427
- use `hideIfNoValue: true`. Default LLM usage charts configure cost, input
428
- tokens, and output tokens separately and use `dedupeConsecutiveValues: true`
429
- to skip repeated adjacent chart values. `totalTokens` is input + output only;
430
- cache read/write tokens stay separate and affect `costUsd` at their own
431
- rates. `llmTurns` is the maximum per-call turn count in the case run, using
432
- configured steps when available and otherwise one turn per matched LLM call
433
- span.
434
- Derived base input cost uses `inputTokens - cachedInputTokens -
435
- cacheCreationInputTokens` so cache details are not double-counted.
436
- `cacheCreationInputTokens` is the total cache-write count; optional
437
- `cacheCreationInput1hTokens` only splits that total for 1-hour write pricing
438
- via `cacheCreationInput1hUsdPerMillion`. `llmDurationMs` sums elapsed matched
439
- LLM span durations; it is not time-to-first-token latency.
440
- Remove defaults globally or per eval with `removeDefaultConfig: true` or a
441
- key list such as
442
- `removeDefaultConfig: ['apiCalls', 'reasoningTokens']`.
443
- - `apiCalls` (in `agent-evals.config.ts`) configures how API-call spans are
444
- summarized for review. Defaults to `kind: 'api'`, `'http'`, `'http.client'`,
445
- and `'fetch'` spans with `method`, `url`, `statusCode`, `request`,
446
- `response`, `requestBody`, `responseBody`, `headers`, `durationMs`, and
447
- `error` read from conventional attribute paths. Override `kinds` or
448
- `attributes.<field>` for external tracers, add `derivedAttributes` as a
449
- keyed map or object-returning callback for computed persisted API span
450
- attributes, and add `metrics` with the same formats and placements as
451
- LLM-call metrics.
452
- - `runLogs` (in `agent-evals.config.ts`) controls case log capture. Use
453
- `runLogs: { captureConsole: false }` to keep console output in the terminal
454
- without persisting console calls to case details. Manual `evalLog(...)` calls
455
- are still captured. Captured log locations store the selected user-facing
456
- source frame and the full JavaScript stack so agents can inspect additional
457
- frames in persisted artifacts when diagnosing where a log came from.
458
-
459
- Stats rows and history charts can be authored via `stats` / `charts` on the eval
460
- definition. Global `stats` in `agent-evals.config.ts` combine with eval-level
461
- stats. Native stat kinds include `cases`, `passRate`, `duration`, and
462
- `cacheHits`; `cacheHits` shows Agent Eval operation-level cache hits over total
463
- cache operations (`hits/total`) from spans and `evalTracer.cache(...)` refs, not
464
- LLM provider prompt-cache read tokens such as `cachedInputTokens`. Cache-hit
465
- stats use a separate aggregate control and default to `sum`; `avg` is average
466
- per-case hit rate, and min/max/best/worst select cases by hit rate. `duration`
467
- aggregates per-case durations using the same modes as column stats. Usage stats
468
- and LLM usage charts are added by default unless removed with
469
- `removeDefaultConfig`. Column stats can override `format` and `numberFormat`,
470
- otherwise they inherit from the matching column. Duration and column stat
471
- aggregates support `avg`, `min`, `max`, `sum`, `best` (highest finite value),
472
- and `worst` (lowest finite value). Use `defaultStatAggregate` in
473
- `agent-evals.config.ts` to set the workspace-wide initial duration/column stat
474
- mode, or on an eval definition to override it for that eval. Number formats use
475
- `maxDecimalPlaces` to cap decimals and `minDecimalPlaces` to pad trailing
476
- zeroes. Without `maxDecimalPlaces`, the default cap is 3 decimal places. Stats
477
- and charts support `hideIfNoValue: true`. Charts support
478
- `dedupeConsecutiveValues: true` to omit consecutive points whose plotted metrics
479
- and tooltip extras match the previous kept point.
480
- Their shapes live in the types; no need to memorize the option set.
207
+ - `setEvalOutput(key, value)` writes reviewable data for the case. Values are stored as received: primitives, objects/arrays, explicit file refs, and native `Blob`/`File` values. `columns.format` only controls visualization. Inside `execute`, `setOutput(key, value, formatOrOverride)` can attach a display hint directly to a runtime output, e.g. `'markdown'` or `{ label: 'Receipt', format: 'image', hideInTable: true }`. Authored global/eval `columns` for the same key take precedence over that runtime hint. Non-JSON runtime values such as `Date`, `Map`, `Set`, `BigInt`, typed arrays, and class instances use the tagged value serializer instead of a string fallback. Native `Blob`/`File` values are copied to run artifacts because saved run files are JSON. Inside `execute`, prefer the context `setOutput(key, value)` helper when writing schema-backed outputs; it is typed from the eval's outputs generic. Keep `setEvalOutput` for shared workflow code that does not receive the execute context.
208
+ - Use `incrementEvalOutput(key, delta)` for numeric totals, `appendToEvalOutput(key, value)` for arrays that preserve existing scalar values, and `mergeEvalOutput(key, patch)` for shallow object updates. `evalSpan` has matching `incrementAttribute`, `appendToAttribute`, and `mergeAttribute` helpers for span attributes.
209
+ - `outputsSchema` validates final outputs after `execute` and `deriveFromTracing`, before computed scores. For Zod object schemas, only declared keys are passed to the schema; parsed fields merge back into the raw output map, so defaults/transforms apply to configured fields and unconfigured outputs stay visible as before. Validation failures fail the case and skip computed scores. When you pass a narrowed outputs type as the second `defineEval` generic, `outputsSchema` is required.
210
+ - `columns` overrides the display for output and score keys (label, format, alignment, visibility). The set of supported formats is declared by the `ColumnFormat` union and `EvalColumnOverride` in the types. Global `columns` in `agent-evals.config.ts` apply to every eval; eval-level `columns` override matching global keys. Use `hideIfNoValue: true` to hide a column when every row is missing the value, `null`, or an empty string; `0` and `false` still count as values. Use `format: 'image'`, `'html'`, `'pdf'`, `'audio'`, `'video'`, or `'file'` for `Blob`/`File` outputs or `repoFile(...)` references that should render as reviewable artifacts. Persisted `Blob`/`File` artifacts include byte sizes in their run artifact refs; pass the optional `repoFile(..., ..., sizeBytes)` hint when a repository file card should show a size.
211
+ - `deriveFromTracing` can be authored globally in `agent-evals.config.ts` or locally on one eval. Prefer the keyed map form for shared metrics: `deriveFromTracing: { toolCalls: ({ trace }) => trace.findSpansByKind('tool').length }`. The older object-returning function form remains supported. Global derivations run first; runtime outputs are never overwritten, and eval-level derivations only fill keys still missing after global derivations. In keyed form, return `undefined` to omit one output for that case. Do not call `evalAssert(...)` or `evalExpect(...)` from `deriveFromTracing`; use `tracingAssertions` for trace-derived pass/fail checks.
212
+ - `tracingAssertions` is a single function that can be authored globally or locally on one eval when a finished-trace invariant should pass or fail the case without creating a fake score column. It receives the same `{ trace, input, case }` context as `deriveFromTracing`; call `evalAssert(...)` or `evalExpect(...)` inside it. Useful trace helpers include `trace.findSpan(name)`, `trace.findSpans(name)`, `trace.hasSpan(name)`, `trace.findSpansByKind(kind)`, `trace.findToolCallSpans()`, `trace.listToolCallSpanNames()`, `trace.hasToolCallSpan(name)`, `trace.getToolCallSpans(name)`, `trace.getToolCallSpanCount(toolName)`, `trace.hasToolCallSpanCount(toolName, expectedCalls)`, `trace.listSpanNames(kind?)`, `trace.listSpanNamesDfs(kind?)`, and `trace.flattenDfs()`. The tool-call helpers include both `kind: 'tool'` spans and imported execution spans recorded as `kind: 'tool_call'`. Tool-name checks and counts match the span `name` as well as GenAI/Mastra identity attributes such as `genAI["gen_ai.tool.name"]` and `mastra.entityName`; list helpers prefer those tool identity attributes when present. `getToolCallSpans(name)` returns one normalized object per matching call, including parsed `arguments`, parsed `result`, `description`, `toolType`, `attributes`, and the original `span`.
213
+ - `traceDisplay` promotes selected span attributes into the trace tree and detail pane; it supports aggregation across subtrees (`scope`, `mode`) and user-defined `transform(...)` for derived views (e.g. currency conversion). See the `TraceDisplayInputConfig` type.
214
+ - `llmCalls` (in `agent-evals.config.ts`) configures how LLM-call spans are summarized for review. Defaults to `kind: 'llm'` spans with `model`, `usage.*`, `latencyMs`, `input`, `output`, etc. read from conventional attribute paths. The default `steps` path reads an array from `span.attributes.steps`; if it is missing, direct child `model_step` spans are shown as that call's steps. Tool calls are aggregated from the configured `toolCalls` path plus step-level `toolCalls` on authored step arrays or direct `model_step` child spans, including Mastra's serialized `mastra.model_step.output` format, and child `tool_call` execution spans under each model step. `latencyMs` is time to first token; duration, total tokens, output tokens/sec, and USD costs are derived. Override `kinds` to broaden the filter, override `attributes.<field>` for non-default primitive span shapes, configure model-keyed `pricing` to derive USD costs from token counts, with nested `providers` entries for provider-specific rates, add `costCurrencies` to show converted cost columns in the expanded breakdown table only, add `derivedAttributes` to persist computed values back onto matching LLM spans before trace consumers run, and add entries to `metrics` to surface arbitrary user metrics (`format: 'string' | 'number' | 'duration' | 'json' | 'boolean'`, `placements: ['header' | 'body']`). `derivedAttributes` can be a keyed map for one-off fields or one callback that returns multiple path/value pairs. Derived keys are dot-paths under `span.attributes`; return `undefined` to skip one span or one returned key.
215
+ - Default usage config derives missing eval outputs from matching LLM/API spans before `outputsSchema` and scores run: `apiCalls`, `costUsd`, `llmTurns`, `inputTokens`, `outputTokens`, `totalTokens`, `cachedInputTokens`, `cacheCreationInputTokens`, `reasoningTokens`, and `llmDurationMs`. Authored outputs and column overrides win. Default usage columns, stats, and charts use `hideIfNoValue: true`. Default LLM usage charts configure cost, input tokens, and output tokens separately and use `dedupeConsecutiveValues: true` to skip repeated adjacent chart values. `totalTokens` is input + output only; cache read/write tokens stay separate and affect `costUsd` at their own rates. `llmTurns` is the maximum per-call turn count in the case run, using configured steps when available and otherwise one turn per matched LLM call span. Derived base input cost uses `inputTokens - cachedInputTokens - cacheCreationInputTokens` so cache details are not double-counted. `cacheCreationInputTokens` is the total cache-write count; optional `cacheCreationInput1hTokens` only splits that total for 1-hour write pricing via `cacheCreationInput1hUsdPerMillion`. `llmDurationMs` sums elapsed matched LLM span durations; it is not time-to-first-token latency. Remove defaults globally or per eval with `removeDefaultConfig: true` or a key list such as `removeDefaultConfig: ['apiCalls', 'reasoningTokens']`.
216
+ - `apiCalls` (in `agent-evals.config.ts`) configures how API-call spans are summarized for review. Defaults to `kind: 'api'`, `'http'`, `'http.client'`, and `'fetch'` spans with `method`, `url`, `statusCode`, `request`, `routeAlias`, `response`, `requestBody`, `responseBody`, `headers`, `durationMs`, and `error` read from conventional attribute paths. Override `kinds` or `attributes.<field>` for external tracers. Set a per-span `routeAlias` attribute such as `/v3/tabs/:id` to group dynamic URL paths in API-call route labels and endpoint charts while preserving original URLs in row details. Add `derivedAttributes` as a keyed map or object-returning callback for computed persisted API span attributes, and add `metrics` with the same formats and placements as LLM-call metrics.
217
+ - `runLogs` (in `agent-evals.config.ts`) controls case log capture. Use `runLogs: { captureConsole: false }` to keep console output in the terminal without persisting console calls to case details. Manual `evalLog(...)` calls are still captured. Captured log locations store the selected user-facing source frame and the full JavaScript stack so agents can inspect additional frames in persisted artifacts when diagnosing where a log came from.
218
+
219
+ Stats rows and history charts can be authored via `stats` / `charts` on the eval definition. Global `stats` in `agent-evals.config.ts` combine with eval-level stats. Native stat kinds include `cases`, `passRate`, `duration`, and `cacheHits`; `cacheHits` shows Agent Eval operation-level cache hits over total cache operations (`hits/total`) from spans and `evalTracer.cache(...)` refs, not LLM provider prompt-cache read tokens such as `cachedInputTokens`. Cache-hit stats use a separate aggregate control and default to `sum`; `avg` is average per-case hit rate, and min/max/best/worst select cases by hit rate. `duration` aggregates per-case durations using the same modes as column stats. Usage stats and LLM usage charts are added by default unless removed with `removeDefaultConfig`. Column stats can override `format` and `numberFormat`, otherwise they inherit from the matching column. Duration and column stat aggregates support `avg`, `min`, `max`, `sum`, `best` (highest finite value), and `worst` (lowest finite value). Use `defaultStatAggregate` in `agent-evals.config.ts` to set the workspace-wide initial duration/column stat mode, or on an eval definition to override it for that eval. Number formats use `maxDecimalPlaces` to cap decimals and `minDecimalPlaces` to pad trailing zeroes. Without `maxDecimalPlaces`, the default cap is 3 decimal places. Stats and charts support `hideIfNoValue: true`. Charts support `dedupeConsecutiveValues: true` to omit consecutive points whose plotted metrics and tooltip extras match the previous kept point. Their shapes live in the types; no need to memorize the option set.
481
220
 
482
221
  ## Cached operations
483
222
 
484
- Wrap a costly pure span in `cache: { namespace, key }` so later runs replay its
485
- recorded effects without re-executing:
223
+ Wrap a costly pure span in `cache: { namespace, key }` so later runs replay its recorded effects without re-executing:
486
224
 
487
225
  ```ts
488
226
  await evalTracer.span(
@@ -507,8 +245,7 @@ await evalTracer.span(
507
245
  );
508
246
  ```
509
247
 
510
- Use `evalTracer.cache(...)` for pure values that should not create their own
511
- trace span:
248
+ Use `evalTracer.cache(...)` for pure values that should not create their own trace span:
512
249
 
513
250
  ```ts
514
251
  const context = await evalTracer.cache(
@@ -524,94 +261,25 @@ const context = await evalTracer.cache(
524
261
 
525
262
  Mental model:
526
263
 
527
- - Only SDK-mediated effects replay on a hit: sub-spans, checkpoints,
528
- output helper calls, span attributes. External side
529
- effects (HTTP, DB writes, file I/O) **do not** replay cache only pure
530
- functions of the key.
531
- - `evalTracer.cache(...)` does not create a span. When it runs inside an active
532
- span, that span gets a `cache.refs` entry with the value cache name, key,
533
- namespace, and hit/miss status. When called directly from the case body
534
- (no surrounding span), the ref is recorded on the case detail's `cacheRefs`
535
- array. When called directly from a scorer, the ref is recorded on that
536
- scoring trace's `cacheRefs` array.
537
- - Cache identity is the namespace plus the authored key. Source-file
538
- fingerprints are tracked for run freshness separately, but do not participate
539
- in cache-key hashing.
540
- - Cached spans require an explicit `cache.namespace`. Value caches can also set
541
- an explicit `namespace`; prefer doing that when the cache is part of a
542
- documented workflow. Matching namespaces share entries across operations/evals
543
- that use the same authored key.
544
- - Per eval, `cache: { read?: boolean; store?: boolean }` controls whether
545
- authored cached operations may read or persist entries. Both default to
546
- `true`. Use `read: false` to always execute instead of replaying hits, and
547
- `store: false` to allow reads while preventing misses/refreshes from writing
548
- cache or raw-key debug files. Run-level bypass/refresh controls still take
549
- precedence.
550
- - Authored eval ids are unique within one eval file. The exact eval identity is
551
- the workspace-relative file path plus eval id, so the same id can be reused in
552
- different files. Case ids must be unique within one eval; duplicate case ids
553
- are reported as run errors.
554
- - Cache keys should be deterministic primitives, arrays, and plain objects.
555
- `Buffer`, `ArrayBuffer`, and typed arrays hash by bytes. Native `Blob`/`File`
556
- keys use stable metadata by default (`type`, `size`, plus
557
- `name`/`lastModified` for `File`) and do not read file bytes. Add
558
- `serializeFileBytes: true` to a cached span or `evalTracer.cache(...)` call
559
- when byte-level cache invalidation is required.
560
- - Cache entries are stored as one Brotli-compressed JSON file per key under
561
- `.agent-evals/cache/<sanitizedNamespace>/<keyHash>.json.br`, with a small
562
- namespace index sidecar at
563
- `.agent-evals/cache/<sanitizedNamespace>/.index-<namespaceHash>.json`.
564
- Listing and retention use the index without opening cached payloads. Index
565
- rows intentionally stay minimal: stored time, last access time, and external
566
- JSON blob refs. Each namespace is capped at 100 entries by default. The runner
567
- prunes least recently accessed indexed entries after a run finishes and the
568
- runner stays idle for `cache.pruneIdleDelayMs ?? 5000` milliseconds. Configure
569
- `cache.maxEntries` as a number for the default cap, or as
570
- `{ default, namespaces }` for exact namespace-specific caps.
571
- Writes initialize the row's last access time to the stored time; later cache
572
- hits refresh that timestamp at the configured access-time update interval.
573
- - Unindexed legacy cache files are ignored by normal lookup/listing. Use
574
- `agent-evals cache repair` to remove unindexed cache files, stale index rows,
575
- debug sidecars, and unreferenced blob files.
576
- - Nested cached JSON values at or above roughly 10K JSON characters are stored
577
- as content-addressed Brotli blobs under `.agent-evals/cache/cache-blobs/` and
578
- referenced from cache JSON by sha256. Identical large payloads share the same
579
- blob.
580
- - Authored raw cache keys are stored for debugging under
581
- `.agent-evals/cache-debug/<sanitizedNamespace>/<keyHash>.json`. This folder
582
- may include prompts, user inputs, full serialized cache payloads, or other
583
- sensitive data, should be gitignored, and is not needed for cache reuse.
584
- - Cached payloads use JSON-safe tagged serialization, so return values and
585
- recorded SDK effects preserve richer built-ins such as `Date`, `Map`, `Set`,
586
- typed arrays, `URL`, `Headers`, `Blob`, and `File` on hits. Undefined values
587
- are omitted by default instead of being written to cache files; direct
588
- serializer callers can pass
589
- `{ preserveUndefined: true }` when explicit undefined wrappers are needed.
590
- Cache keys still use the deterministic key-hashing rules above.
264
+ - Only SDK-mediated effects replay on a hit: sub-spans, checkpoints, output helper calls, span attributes. External side effects (HTTP, DB writes, file I/O) **do not** replay — cache only pure functions of the key.
265
+ - `evalTracer.cache(...)` does not create a span. When it runs inside an active span, that span gets a `cache.refs` entry with the value cache name, key, namespace, and hit/miss status. When called directly from the case body (no surrounding span), the ref is recorded on the case detail's `cacheRefs` array. When called directly from a scorer, the ref is recorded on that scoring trace's `cacheRefs` array.
266
+ - Cache identity is the namespace plus the authored key. Source-file fingerprints are tracked for run freshness separately, but do not participate in cache-key hashing.
267
+ - Cached spans require an explicit `cache.namespace`. Value caches can also set an explicit `namespace`; prefer doing that when the cache is part of a documented workflow. Matching namespaces share entries across operations/evals that use the same authored key.
268
+ - Per eval, `cache: { read?: boolean; store?: boolean }` controls whether authored cached operations may read or persist entries. Both default to `true`. Use `read: false` to always execute instead of replaying hits, and `store: false` to allow reads while preventing misses/refreshes from writing cache or raw-key debug files. Run-level bypass/refresh controls still take precedence.
269
+ - Authored eval ids are unique within one eval file. The exact eval identity is the workspace-relative file path plus eval id, so the same id can be reused in different files. Case ids must be unique within one eval; duplicate case ids are reported as run errors.
270
+ - Cache keys should be deterministic primitives, arrays, and plain objects. `Buffer`, `ArrayBuffer`, and typed arrays hash by bytes. Native `Blob`/`File` keys use stable metadata by default (`type`, `size`, plus `name`/`lastModified` for `File`) and do not read file bytes. Add `serializeFileBytes: true` to a cached span or `evalTracer.cache(...)` call when byte-level cache invalidation is required.
271
+ - Cache entries are stored as one Brotli-compressed JSON file per key under `.agent-evals/cache/<sanitizedNamespace>/<keyHash>.json.br`, with a small namespace index sidecar at `.agent-evals/cache/<sanitizedNamespace>/.index-<namespaceHash>.json`. Listing and retention use the index without opening cached payloads. Index rows intentionally stay minimal: stored time, last access time, and external JSON blob refs. Each namespace is capped at 100 entries by default. The runner prunes least recently accessed indexed entries after a run finishes and the runner stays idle for `cache.pruneIdleDelayMs ?? 5000` milliseconds. Configure `cache.maxEntries` as a number for the default cap, or as `{ default, namespaces }` for exact namespace-specific caps. Writes initialize the row's last access time to the stored time; later cache hits refresh that timestamp at the configured access-time update interval.
272
+ - Unindexed legacy cache files are ignored by normal lookup/listing. Use `agent-evals cache repair` to remove unindexed cache files, stale index rows, debug sidecars, and unreferenced blob files.
273
+ - Nested cached JSON values at or above roughly 10K JSON characters are stored as content-addressed Brotli blobs under `.agent-evals/cache/cache-blobs/` and referenced from cache JSON by sha256. Identical large payloads share the same blob.
274
+ - Authored raw cache keys are stored for debugging under `.agent-evals/cache-debug/<sanitizedNamespace>/<keyHash>.json`. This folder may include prompts, user inputs, full serialized cache payloads, or other sensitive data, should be gitignored, and is not needed for cache reuse.
275
+ - Cached payloads use JSON-safe tagged serialization, so return values and recorded SDK effects preserve richer built-ins such as `Date`, `Map`, `Set`, typed arrays, `URL`, `Headers`, `Blob`, and `File` on hits. Undefined values are omitted by default instead of being written to cache files; direct serializer callers can pass `{ preserveUndefined: true }` when explicit undefined wrappers are needed. Cache keys still use the deterministic key-hashing rules above.
591
276
  - Cache mode per run is controlled by CLI flags (see `agent-evals run --help`).
592
277
 
593
278
  ## Artifacts
594
279
 
595
- Run output lives under `.agent-evals/runs/<run-id>/`. Cache payloads live under
596
- `.agent-evals/cache/<sanitizedNamespace>/<keyHash>.json.br` with namespace
597
- index sidecars next to them. Do not rely on a specific cache filename when
598
- authoring evals; configure cache namespaces manually in eval code, then use
599
- `agent-evals cache list` to inspect persisted namespace/key entries or
600
- `agent-evals cache repair` to clean orphaned cache artifacts. Files in a run
601
- directory include run metadata,
602
- a run summary, per-case results, and per-case trace JSON. Inspect run files when
603
- debugging persisted output, costs, columns, traces, or failures; inspect cache
604
- entries when debugging replayed span/value-cache results.
605
- Targeted evals in `run.json` are recorded by exact `evalKeys`
606
- (`filePath + evalId`) rather than authored eval ids, so duplicate eval ids stay
607
- unambiguous in saved history.
608
- Temporary runs use the same directory layout, but are removed before the next
609
- run of any kind starts.
610
- When a saved case needs to be handed to another agent, the app can copy the
611
- saved case detail path or the saved run folder path directly.
612
-
613
- Use `agent-evals show-runs` when you need stable file
614
- paths before reading saved output:
280
+ Run output lives under `.agent-evals/runs/<run-id>/`. Cache payloads live under `.agent-evals/cache/<sanitizedNamespace>/<keyHash>.json.br` with namespace index sidecars next to them. Do not rely on a specific cache filename when authoring evals; configure cache namespaces manually in eval code, then use `agent-evals cache list` to inspect persisted namespace/key entries or `agent-evals cache repair` to clean orphaned cache artifacts. Files in a run directory include run metadata, a run summary, per-case results, and per-case trace JSON. Inspect run files when debugging persisted output, costs, columns, traces, or failures; inspect cache entries when debugging replayed span/value-cache results. Targeted evals in `run.json` are recorded by exact `evalKeys` (`filePath + evalId`) rather than authored eval ids, so duplicate eval ids stay unambiguous in saved history. Temporary runs use the same directory layout, but are removed before the next run of any kind starts. When a saved case needs to be handed to another agent, the app can copy the saved case detail path or the saved run folder path directly.
281
+
282
+ Use `agent-evals show-runs` when you need stable file paths before reading saved output:
615
283
 
616
284
  ```sh
617
285
  agent-evals show-runs
@@ -622,21 +290,11 @@ jq . .agent-evals/runs/<run-id>/case-details/<case-id>.json
622
290
  jq . .agent-evals/runs/<run-id>/traces/<case-id>.json
623
291
  ```
624
292
 
625
- Run ids can be full timestamp ids, short ids such as `r0` from
626
- `agent-evals show-runs`, or `latest`. `show-runs` is only an artifact index;
627
- the files themselves remain the source of truth for detailed results and
628
- traces.
293
+ Run ids can be full timestamp ids, short ids such as `r0` from `agent-evals show-runs`, or `latest`. `show-runs` is only an artifact index; the files themselves remain the source of truth for detailed results and traces.
629
294
 
630
295
  ## Module mocking
631
296
 
632
- For true module replacement inside an eval, register `mock.module(...)` from
633
- `node:test` before dynamically importing the module graph. Agent Evals enables
634
- Node's `--experimental-test-module-mocks` flag automatically for CLI and app
635
- runs. Use dynamic
636
- `import(...)` inside `execute` — static imports happen too early.
637
- Each case/trial reloads the eval module graph in its own isolation scope, so
638
- module-level mock state in workspace files and ESM dependencies does not leak
639
- between concurrent cases.
297
+ For true module replacement inside an eval, register `mock.module(...)` from `node:test` before dynamically importing the module graph. Agent Evals enables Node's `--experimental-test-module-mocks` flag automatically for CLI and app runs. Use dynamic `import(...)` inside `execute` — static imports happen too early. Each case/trial reloads the eval module graph in its own isolation scope, so module-level mock state in workspace files and ESM dependencies does not leak between concurrent cases.
640
298
 
641
299
  ```ts
642
300
  import { mock } from 'node:test';
@@ -660,25 +318,11 @@ defineEval({
660
318
 
661
319
  When adding or changing evals:
662
320
 
663
- 1. Put the tracing + ambient SDK calls in the product code that runs in both
664
- production and evals. Keep eval files thin.
321
+ 1. Put the tracing + ambient SDK calls in the product code that runs in both production and evals. Keep eval files thin.
665
322
  2. Use realistic cases drawn from real product flows; avoid placeholder inputs.
666
- 3. `evalAssert` for hard invariants and truthy type narrowing. It records
667
- pass/fail entries in case-detail `assertions`; failed entries are also kept
668
- in `assertionFailures` and fail the case. Use `evalExpect` for non-trivial
669
- comparisons, `tracingAssertions` for invariants derived from the finished
670
- trace, `scores` for graded signals, and `passThreshold` only on scores that
671
- should gate pass/fail.
672
- 4. Surface reviewable values through execute-context `setOutput` or ambient
673
- `setEvalOutput` in shared workflow code, and shape them with `columns`
674
- formats from the `ColumnFormat` type.
323
+ 3. `evalAssert` for hard invariants and truthy type narrowing. It records pass/fail entries in case-detail `assertions`; failed entries are also kept in `assertionFailures` and fail the case. Use `evalExpect` for non-trivial comparisons, `tracingAssertions` for invariants derived from the finished trace, `scores` for graded signals, and `passThreshold` only on scores that should gate pass/fail.
324
+ 4. Surface reviewable values through execute-context `setOutput` or ambient `setEvalOutput` in shared workflow code, and shape them with `columns` formats from the `ColumnFormat` type.
675
325
  5. Promote high-signal span attributes with `traceDisplay`.
676
- 6. Cache costly pure spans with `cache: { namespace, key }` and pure spanless
677
- values with `evalTracer.cache(...)`; never cache operations whose external
678
- side effects you depend on.
679
- 7. Sanity-check after changes: `agent-evals list`, then
680
- `agent-evals run --eval <id>`; use `--file <path|glob>` to target one file
681
- when multiple files use the same eval id.
682
- 8. Locate saved artifacts with `agent-evals show-runs latest --json`, then read
683
- the relevant `summary.json`, `cases.jsonl`, `case-details/<case-id>.json`,
684
- or `traces/<case-id>.json` file directly.
326
+ 6. Cache costly pure spans with `cache: { namespace, key }` and pure spanless values with `evalTracer.cache(...)`; never cache operations whose external side effects you depend on.
327
+ 7. Sanity-check after changes: `agent-evals list`, then `agent-evals run --eval <id>`; use `--file <path|glob>` to target one file when multiple files use the same eval id.
328
+ 8. Locate saved artifacts with `agent-evals show-runs latest --json`, then read the relevant `summary.json`, `cases.jsonl`, `case-details/<case-id>.json`, or `traces/<case-id>.json` file directly.