npm - @ls-stack/agent-eval - Versions diffs - 0.20.0 → 0.22.0 - Mend

@ls-stack/agent-eval 0.20.0 → 0.22.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

package/dist/{app-DsiLU65H.mjs → app-moDHbg1O.mjs} +3 -3
package/dist/apps/web/dist/assets/index-AUDD3rNB.js +118 -0
package/dist/apps/web/dist/assets/{index-CvR6QCLa.css → index-r0dVFK0B.css} +1 -1
package/dist/apps/web/dist/index.html +2 -2
package/dist/bin.mjs +1 -1
package/dist/{cli-weogme5U.mjs → cli-C0EtHhEO.mjs} +3 -3
package/dist/index.d.mts +56 -61
package/dist/index.mjs +3 -3
package/dist/runChild.mjs +1 -1
package/dist/{runOrchestration-Cv1kiOAG.mjs → runOrchestration-D1edUDhp.mjs} +155 -140
package/dist/{runner-DzrMtgBu.mjs → runner-C9nP2VKL.mjs} +2 -2
package/dist/{runner-B25oRQxX.mjs → runner-CyRhIzci.mjs} +1 -1
package/dist/src-D-HuV8I-.mjs +3 -0
package/package.json +1 -1
package/skills/agent-eval/SKILL.md +30 -20
package/dist/apps/web/dist/assets/index-Cba4MFa0.js +0 -118
package/dist/src-B879LZfo.mjs +0 -3

package/skills/agent-eval/SKILL.md CHANGED Viewed

@@ -92,20 +92,16 @@ export async function runRefundWorkflow(input: RefundInput) {
         async () => {
           let text: string;
           let usage: { inputTokens: number; outputTokens: number };
-          let costUsd: number;
           try {
-            ({ text, usage, costUsd } = await llm.complete(input.message));
+            ({ text, usage } = await llm.complete(input.message));
           } catch (error) {
             captureEvalSpanError(error);
-            ({ text, usage, costUsd } = await llm.completeWithFallback(
-              input.message,
-            ));
+            ({ text, usage } = await llm.completeWithFallback(input.message));
           }
           evalSpan.setAttributes({
             model: 'gpt-4o-mini',
             provider: 'openai',
             usage,
-            costUsd,
           });
           const expectedLocale = getEvalCaseInput('locale');
           if (typeof expectedLocale === 'string') {
@@ -137,8 +133,8 @@ are more specific. Only the `input` and `output` span attributes are promoted
 automatically in the trace tree; use `traceDisplay` for other span attributes
 such as `model` or `usage`. Eval-level LLM usage outputs, columns, stats, and
 charts are derived from matching LLM spans by default. Prefer
-`llmCalls.pricing` for LLM-call cost display instead of writing `costUsd` on
-each span.
+`llmCalls.pricing` for LLM-call cost display; built-in costs ignore span
+`costUsd` attributes.
 Use `captureEvalSpanError(error)` for recoverable errors on the active
 `evalTracer.span(...)`, such as optional model/tool failures that fall back and
@@ -261,18 +257,28 @@ See `EvalScoreDef` / `EvalManualScoreDef` in the types for the full shape
   See the `TraceDisplayInputConfig` type.
 - `llmCalls` (in `agent-evals.config.ts`) configures how LLM-call spans are
   summarized for review. Defaults to `kind: 'llm'` spans with `model`,
-  `usage.*`, `tokensPerSecond`, `input`, `output`, etc. read from conventional
-  attribute paths. Override `kinds` to broaden the filter, override
-  `attributes.<field>` for non-default span shapes, configure `pricing` to
-  derive USD costs from token counts by model/provider, and add entries to
-  `metrics` to surface arbitrary user metrics (`format: 'string' | 'number' |
-'duration' | 'json' | 'boolean'`, `placements: ['header' | 'body']`).
+  `usage.*`, `latencyMs`, `input`, `output`, etc. read from conventional
+  attribute paths. `latencyMs` is time to first token; duration, total tokens,
+  tokens/sec, and USD costs are derived. Override `kinds` to broaden the filter,
+  override `attributes.<field>` for non-default primitive span shapes, configure
+  `pricing` to derive USD costs from token counts by model/provider, and add
+  entries to `metrics` to surface arbitrary user metrics (`format: 'string' |
+'number' | 'duration' | 'json' | 'boolean'`, `placements: ['header' |
+'body']`).
 - Default usage config derives missing eval outputs from matching LLM/API spans
   before `outputsSchema` and scores run: `apiCalls`, `costUsd`, `llmTurns`,
   `inputTokens`, `outputTokens`, `totalTokens`, `cachedInputTokens`,
-  `cacheCreationInputTokens`, `reasoningTokens`, and `llmLatencyMs`. Authored
-  outputs and column overrides win. Remove defaults globally or per eval with
-  `removeDefaultConfig: true` or a key list such as
+  `cacheCreationInputTokens`, `reasoningTokens`, and `llmDurationMs`. Authored
+  outputs and column overrides win. `totalTokens` is input + output only; cache
+  read/write tokens stay separate and affect `costUsd` at their own rates.
+  Derived base input cost uses `inputTokens - cachedInputTokens -
+cacheCreationInputTokens` so cache details are not double-counted.
+  `cacheCreationInputTokens` is the total cache-write count; optional
+  `cacheCreationInput1hTokens` only splits that total for 1-hour write pricing
+  via `cacheCreationInput1hUsdPerMillion`. `llmDurationMs` sums elapsed matched
+  LLM span durations; it is not time-to-first-token latency.
+  Remove defaults globally or per eval with `removeDefaultConfig: true` or a
+  key list such as
   `removeDefaultConfig: ['apiCalls', 'reasoningTokens']`.
 - `apiCalls` (in `agent-evals.config.ts`) configures how API-call spans are
   summarized for review. Defaults to `kind: 'api'`, `'http'`, `'http.client'`,
@@ -287,9 +293,13 @@ See `EvalScoreDef` / `EvalManualScoreDef` in the types for the full shape
   are still captured.
 Stats rows and history charts on the eval card can be authored via `stats` /
-`charts` on the eval definition. Usage stats/charts are added by default
-unless removed with `removeDefaultConfig`. Their shapes live in the types; no
-need to memorize the option set.
+`charts` on the eval definition. Usage stats and LLM usage charts are added by
+default unless removed with `removeDefaultConfig`. Column stats can override
+`format` and `numberFormat`, otherwise they inherit from the matching column.
+Number formats use `maxDecimalPlaces` to cap decimals and `minDecimalPlaces`
+to pad trailing zeroes. Without `maxDecimalPlaces`, they render up to 3 decimal
+places.
+Their shapes live in the types; no need to memorize the option set.
 ## Cached operations