@ls-stack/agent-eval 0.27.1 → 0.28.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,2 +1,2 @@
1
- import { n as initRunner, t as getRunnerInstance } from "./runner-zqKwTlNj.mjs";
1
+ import { n as initRunner, t as getRunnerInstance } from "./runner-DbVB66h9.mjs";
2
2
  export { getRunnerInstance, initRunner };
@@ -1,5 +1,5 @@
1
- import { n as createRunner } from "./cli-Clf8xUFa.mjs";
2
- import "./src-BBwT7_cy.mjs";
1
+ import { n as createRunner } from "./cli-BQwRbqsL.mjs";
2
+ import "./src-CuirVcPY.mjs";
3
3
  //#region ../../apps/server/src/runner.ts
4
4
  let runnerInstance = null;
5
5
  function getRunnerInstance() {
@@ -0,0 +1,3 @@
1
+ import "./runOrchestration-ClWYWPen.mjs";
2
+ import "./cli-BQwRbqsL.mjs";
3
+ export {};
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ls-stack/agent-eval",
3
- "version": "0.27.1",
3
+ "version": "0.28.0",
4
4
  "type": "module",
5
5
  "bin": {
6
6
  "agent-evals": "./dist/bin.mjs"
@@ -33,7 +33,9 @@
33
33
  ]
34
34
  },
35
35
  "dts": {
36
- "eager": true
36
+ "eager": true,
37
+ "entry": "src/index.ts",
38
+ "tsconfig": "tsconfig.build.json"
37
39
  },
38
40
  "entry": [
39
41
  "src/index.ts",
@@ -59,8 +61,8 @@
59
61
  "@types/node": "^24.7.2",
60
62
  "typescript": "^5.9.2",
61
63
  "@agent-evals/runner": "0.0.1",
62
- "@agent-evals/sdk": "0.0.1",
63
- "@agent-evals/shared": "0.0.1"
64
+ "@agent-evals/shared": "0.0.1",
65
+ "@agent-evals/sdk": "0.0.1"
64
66
  },
65
67
  "scripts": {
66
68
  "build": "pnpm --filter @agent-evals/web build && tsdown",
@@ -16,9 +16,9 @@ This skill covers the mental model and conventions. For exhaustive field lists
16
16
  display rules), read the TypeScript declarations shipped with the package:
17
17
 
18
18
  - `AgentEvalsConfig`, `EvalDefinition`, `EvalCase`, `EvalOutputs`,
19
- `EvalColumnOverride`, `EvalScoreDef`, `EvalManualScoreDef`,
20
- `EvalTraceTree`, `TraceSpanInfo`, and `z` are exported from
21
- `@ls-stack/agent-eval`.
19
+ `EvalColumnOverride`, `EvalDeriveConfig`, `EvalScoreDef`,
20
+ `EvalManualScoreDef`, `EvalTraceTree`, `TraceSpanInfo`, and `z` are exported
21
+ from `@ls-stack/agent-eval`.
22
22
  - `.d.ts` files land in `node_modules/@ls-stack/agent-eval/dist/`.
23
23
  - CLI surface: `agent-evals --help` and `agent-evals <command> --help`.
24
24
  Unknown help targets exit non-zero instead of falling back to global help.
@@ -269,7 +269,19 @@ See `EvalScoreDef` / `EvalManualScoreDef` in the types for the full shape
269
269
  `defineEval` generic, `outputsSchema` is required.
270
270
  - `columns` overrides the display for output and score keys (label, format,
271
271
  alignment, visibility). The set of supported formats is declared by the
272
- `ColumnFormat` union and `EvalColumnOverride` in the types.
272
+ `ColumnFormat` union and `EvalColumnOverride` in the types. Global
273
+ `columns` in `agent-evals.config.ts` apply to every eval; eval-level
274
+ `columns` override matching global keys. Use `hideIfNoValue: true` to hide a
275
+ column from the runs table when every rendered row is missing the value,
276
+ `null`, or an empty string; `0` and `false` still count as values, and the
277
+ value remains available in case details and raw output data.
278
+ - `deriveFromTracing` can be authored globally in `agent-evals.config.ts` or
279
+ locally on one eval. Prefer the keyed map form for shared metrics:
280
+ `deriveFromTracing: { toolCalls: ({ trace }) => trace.findSpansByKind('tool').length }`.
281
+ The older object-returning function form remains supported. Global
282
+ derivations run first; runtime outputs are never overwritten, and eval-level
283
+ derivations only fill keys still missing after global derivations. In keyed
284
+ form, return `undefined` to omit one output for that case.
273
285
  - `traceDisplay` promotes selected span attributes into the trace tree and
274
286
  detail pane; it supports aggregation across subtrees (`scope`, `mode`) and
275
287
  user-defined `transform(...)` for derived views (e.g. currency conversion).
@@ -280,18 +292,24 @@ See `EvalScoreDef` / `EvalManualScoreDef` in the types for the full shape
280
292
  attribute paths. `latencyMs` is time to first token; duration, total tokens,
281
293
  tokens/sec, and USD costs are derived. Override `kinds` to broaden the filter,
282
294
  override `attributes.<field>` for non-default primitive span shapes, configure
283
- `pricing` to derive USD costs from token counts by model/provider, add
284
- `derivedAttributes` to persist computed values back onto matching LLM spans
285
- before trace consumers run, and add entries to `metrics` to surface arbitrary
286
- user metrics (`format: 'string' | 'number' | 'duration' | 'json' |
287
- 'boolean'`, `placements: ['header' | 'body']`). `derivedAttributes` keys are
288
- dot-paths under `span.attributes`; return `undefined` to skip one span.
295
+ model-keyed `pricing` to derive USD costs from token counts, with nested
296
+ `providers` entries for provider-specific rates, add `derivedAttributes` to
297
+ persist computed values back onto matching LLM spans before trace consumers
298
+ run, and add entries to `metrics` to surface arbitrary user metrics
299
+ (`format: 'string' | 'number' | 'duration' | 'json' | 'boolean'`,
300
+ `placements: ['header' | 'body']`). `derivedAttributes` keys are dot-paths
301
+ under `span.attributes`; return `undefined` to skip one span. For saved runs,
302
+ the case drawer more menu can recalculate configured LLM/API derived
303
+ attributes for one case and persist the updated trace artifacts without
304
+ re-running the eval.
289
305
  - Default usage config derives missing eval outputs from matching LLM/API spans
290
306
  before `outputsSchema` and scores run: `apiCalls`, `costUsd`, `llmTurns`,
291
307
  `inputTokens`, `outputTokens`, `totalTokens`, `cachedInputTokens`,
292
308
  `cacheCreationInputTokens`, `reasoningTokens`, and `llmDurationMs`. Authored
293
- outputs and column overrides win. `totalTokens` is input + output only; cache
294
- read/write tokens stay separate and affect `costUsd` at their own rates.
309
+ outputs and column overrides win. Default usage columns, stats, and charts
310
+ use `hideIfNoValue: true`, so the UI hides them until matching LLM/API span
311
+ data exists. `totalTokens` is input + output only; cache read/write tokens
312
+ stay separate and affect `costUsd` at their own rates.
295
313
  Derived base input cost uses `inputTokens - cachedInputTokens -
296
314
  cacheCreationInputTokens` so cache details are not double-counted.
297
315
  `cacheCreationInputTokens` is the total cache-write count; optional
@@ -315,12 +333,15 @@ cacheCreationInputTokens` so cache details are not double-counted.
315
333
  are still captured.
316
334
 
317
335
  Stats rows and history charts on the eval card can be authored via `stats` /
318
- `charts` on the eval definition. Usage stats and LLM usage charts are added by
336
+ `charts` on the eval definition. Global `stats` in `agent-evals.config.ts`
337
+ render before eval-level stats. Usage stats and LLM usage charts are added by
319
338
  default unless removed with `removeDefaultConfig`. Column stats can override
320
339
  `format` and `numberFormat`, otherwise they inherit from the matching column.
321
340
  Number formats use `maxDecimalPlaces` to cap decimals and `minDecimalPlaces`
322
341
  to pad trailing zeroes. Without `maxDecimalPlaces`, they render up to 3 decimal
323
- places.
342
+ places. Stats and charts support `hideIfNoValue: true`; stats hide when they
343
+ would otherwise render an empty value, and charts hide when no plotted metric or
344
+ tooltip extra has a numeric value in the rendered history window.
324
345
  Their shapes live in the types; no need to memorize the option set.
325
346
 
326
347
  ## Cached operations
@@ -378,12 +399,18 @@ Mental model:
378
399
  (no surrounding span), the ref is recorded on the case detail's `cacheRefs`
379
400
  array.
380
401
  - Cache identity is the namespace plus the authored key. Source-file
381
- fingerprints are stored as metadata for inspection, but do not participate in
382
- cache-key hashing.
402
+ fingerprints are tracked for run freshness separately, but do not participate
403
+ in cache-key hashing.
383
404
  - Cached spans require an explicit `cache.namespace`; value caches default to
384
405
  `${evalId}__${name}` and can be overridden with `namespace`. Matching
385
406
  namespaces share entries across operations/evals that use the same authored
386
407
  key.
408
+ - Per eval, `cache: { read?: boolean; store?: boolean }` controls whether
409
+ authored cached operations may read or persist entries. Both default to
410
+ `true`. Use `read: false` to always execute instead of replaying hits, and
411
+ `store: false` to allow reads while preventing misses/refreshes from writing
412
+ cache or raw-key debug files. Run-level bypass/refresh controls still take
413
+ precedence.
387
414
  - Authored eval ids are unique within one eval file. The exact eval identity is
388
415
  the workspace-relative file path plus eval id, so the same id can be reused in
389
416
  different files. Case ids must be unique within one eval; duplicate case ids
@@ -403,10 +430,15 @@ Mental model:
403
430
  user inputs, or other sensitive data, should be gitignored, and is not needed
404
431
  for cache reuse. The UI Cache tab shows the raw key when it is available and
405
432
  can be filtered to hits or new entries added by cache misses/refreshes.
406
- - Cached payloads use advance serialization/deserialization with the Web API plugin set, so return values and
407
- recorded SDK effects preserve richer built-ins such as `Date`, `Map`, `Set`,
408
- typed arrays, `URL`, `Headers`, `Blob`, and `File` on hits. Cache keys still
409
- use the deterministic key-hashing rules above.
433
+ Misses/refreshes with `cache.store: false` are shown as non-stored activity
434
+ without fetch/delete controls.
435
+ - Cached payloads use advanced serialization/deserialization with the Web API
436
+ plugin set, so return values and recorded SDK effects preserve richer
437
+ built-ins such as `Date`, `Map`, `Set`, typed arrays, `URL`, `Headers`,
438
+ `Blob`, and `File` on hits. Undefined values are omitted by default instead
439
+ of being written to cache files; direct serializer callers can pass
440
+ `{ preserveUndefined: true }` when explicit undefined wrappers are needed.
441
+ Cache keys still use the deterministic key-hashing rules above.
410
442
  - Cache mode per run is controlled by CLI flags (see `agent-evals run --help`).
411
443
 
412
444
  ## Artifacts