@tangle-network/agent-eval 0.71.0 → 0.72.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -4,6 +4,30 @@ All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval-
4
4
 
5
5
  ---
6
6
 
7
+ ## [0.72.0] — 2026-05-31 — cost axis prices unpriced-at-source models (every run carries a real, labeled cost)
8
+
9
+ A live tax-agent full-loop run (real sandbox, `deepseek-v4-pro`, real tokens) exposed the second root of the cost-ledger split: the sandbox reported `totalCostUsd: 0` despite `17537` input / `622` output tokens — not a stub, not a mis-wired ledger, but a model the **source** can't rate. The cost / Pareto / `tokens_per_dollar` axes blanked even though the substrate's pricing table prices `deepseek` correctly; the table was simply never consulted on the matrix cost projection. A $0 cost on a run that burned real tokens reads as "free," which is the more misleading state.
10
+
11
+ ### Fixed
12
+
13
+ - **`runProfileMatrix` prices measured tokens when the source reports $0.** Cost precedence is now explicit: **source-billed > token-estimated > none**. When `cell.costUsd === 0` and real output tokens flowed and the model is priced (`isModelPriced`), `buildRunRecord` sets the cost from `estimateCost(in, out, model)` (real published rate × real tokens) and stamps `raw.cost_estimated = 1`. A billed cost is never overridden; a model the table also can't rate stays $0 (no fabrication). The estimate flows into `record.costUsd`, so `byProfile.totalCostUsd`, `integrity.totalCostUsd`, and `tokens_per_dollar` / `cost_per_quality` all populate.
14
+ - **Every cost surface in the matrix result agrees.** The embedded `campaigns[id].aggregates.totalCostUsd` is reconciled to the priced total instead of runCampaign's raw `ctx.cost` ledger (which only sees the source's $0). No more two-`totalCostUsd`-that-disagree in one result.
15
+ - **Honest integrity diagnosis.** `summarizeBackendIntegrity`'s uncosted-records message now names **both** roots — mis-wired ledger OR unpriced-at-source model — and points at `estimateCost` for the latter, instead of asserting the ledger is broken.
16
+
17
+ Live proof: the same tax case that recorded `$0` now records **`$0.0059453`** (`17537 × 0.0003/1k + 622 × 0.0011/1k`, exact), `cost_estimated: 1`, `uncostedRecords: 0`, verdict `real`. Generalizes to every consumer of `runProfileMatrix`. New regression tests: priced-when-source-zero, billed-takes-precedence, truly-unpriced-stays-$0, campaign-aggregate-reconciled. Full suite (1663) green.
18
+
19
+ ## [0.71.0] — 2026-05-31 — corpus-by-default + multi-dimensional capture (datasets as eval exhaust)
20
+
21
+ Every matrix run now emits a multi-dimensional, dataset-able record with no side-channel — the groundwork for "datasets gathered for free by running evals."
22
+
23
+ ### Added
24
+
25
+ - **Multi-dim guardrail projection in `buildRunRecord`.** Each `RunRecord.outcome.raw` carries `cost_usd`, `tokens_input` / `tokens_output` (+ `tokens_cached` when present), `latency_ms`, and the guarded ratios `tokens_per_dollar` / `cost_per_quality`. RAW-ONLY — the composite stays the judge objective (anti-Goodhart); these are tracked + dashboarded + carried into datasets, never optimized.
26
+ - **Corpus-by-default via `corpusText`.** An optional `corpusText(artifact, scenario) => {prompt, completion}` stamps the trajectory text onto each record (the `CorpusRecord` shape), so a run is dataset-able with no side-channel. Fail-soft: a throwing extractor omits the text and keeps the graded record.
27
+ - **`appendToCorpus` / `readCorpus` / `buildDatasetFromCorpus`** (`src/rl/corpus.ts`) — append-only JSONL corpus (deduped by `runId`), with score/split filtering into a train/holdout dataset.
28
+
29
+ `buildRunRecord` is generic over `<TScenario, TArtifact>`; a `scenarioById` map threads each scenario into the projection.
30
+
7
31
  ## [0.70.0] — 2026-05-31 — error-grounded reflection (the driver targets real failures, not blind rewrites)
8
32
 
9
33
  Adversarial verification on TWO domains (legal + tax, two worker models) found the same root cause: the gepaDriver's candidates **regressed** the baseline, so the gate correctly held — but nothing improved. The driver was reflecting on per-scenario *scores* only; the judge's `notes` (the "why it failed") were computed but **dropped** before the reflection. So it proposed generic rewrites a capable model already knows, which distract rather than help.
@@ -7,10 +7,12 @@ import {
7
7
  heldoutSignificance,
8
8
  pairHoldout,
9
9
  runEval
10
- } from "../chunk-6QZUCFKM.js";
10
+ } from "../chunk-UD6EF73X.js";
11
11
  import {
12
- agentProfileHash
13
- } from "../chunk-PQV2TKC3.js";
12
+ agentProfileHash,
13
+ estimateCost,
14
+ isModelPriced
15
+ } from "../chunk-SL55X4VN.js";
14
16
  import {
15
17
  buildLoopProvenanceRecord,
16
18
  campaignBreakdown,
@@ -31,14 +33,14 @@ import {
31
33
  runOptimization,
32
34
  surfaceContentHash,
33
35
  surfaceHash
34
- } from "../chunk-VMAYE3LM.js";
36
+ } from "../chunk-4QJN7RDX.js";
35
37
  import {
36
38
  assertRealBackend,
37
39
  fsCampaignStorage,
38
40
  inMemoryCampaignStorage,
39
41
  runCampaign,
40
42
  summarizeBackendIntegrity
41
- } from "../chunk-6XQIEUQ2.js";
43
+ } from "../chunk-ZPSKPT3V.js";
42
44
  import "../chunk-YV7J7X5N.js";
43
45
  import {
44
46
  validateRunRecord
@@ -873,15 +875,22 @@ function buildRunRecord(args) {
873
875
  }
874
876
  const perDimMean = {};
875
877
  for (const [dim, values] of Object.entries(dimAccum)) perDimMean[dim] = mean2(values);
876
- raw.cost_usd = cell.costUsd;
878
+ let costUsd = cell.costUsd;
879
+ let costEstimated = false;
880
+ if (costUsd === 0 && cell.tokenUsage.output > 0 && isModelPriced(profile.model)) {
881
+ costUsd = estimateCost(cell.tokenUsage.input, cell.tokenUsage.output, profile.model);
882
+ costEstimated = costUsd > 0;
883
+ }
884
+ raw.cost_usd = costUsd;
885
+ raw.cost_estimated = costEstimated ? 1 : 0;
877
886
  raw.tokens_input = cell.tokenUsage.input;
878
887
  raw.tokens_output = cell.tokenUsage.output;
879
888
  if (typeof cell.tokenUsage.cached === "number") raw.tokens_cached = cell.tokenUsage.cached;
880
889
  raw.latency_ms = cell.durationMs;
881
- if (cell.costUsd > 0) {
882
- raw.tokens_per_dollar = (cell.tokenUsage.input + cell.tokenUsage.output) / cell.costUsd;
890
+ if (costUsd > 0) {
891
+ raw.tokens_per_dollar = (cell.tokenUsage.input + cell.tokenUsage.output) / costUsd;
883
892
  }
884
- if (composite > 0.01) raw.cost_per_quality = cell.costUsd / composite;
893
+ if (composite > 0.01) raw.cost_per_quality = costUsd / composite;
885
894
  const outcome = splitTag === "holdout" ? { holdoutScore: composite, raw } : { searchScore: composite, raw };
886
895
  if (Object.keys(perJudge).length > 0) {
887
896
  outcome.judgeScores = {
@@ -901,7 +910,7 @@ function buildRunRecord(args) {
901
910
  configHash,
902
911
  commitSha,
903
912
  wallMs: cell.durationMs,
904
- costUsd: cell.costUsd,
913
+ costUsd,
905
914
  tokenUsage: cell.tokenUsage,
906
915
  outcome,
907
916
  splitTag,
@@ -982,7 +991,6 @@ async function runProfileMatrix(opts) {
982
991
  now: opts.now,
983
992
  runDir: join2(opts.runDir, sanitize(profile.id))
984
993
  });
985
- campaigns[profile.id] = campaign;
986
994
  const profileRecords = [];
987
995
  for (const cell of campaign.cells) {
988
996
  const record = buildRunRecord({
@@ -1001,13 +1009,18 @@ async function runProfileMatrix(opts) {
1001
1009
  profileRecords.push(record);
1002
1010
  records.push(record);
1003
1011
  }
1012
+ const pricedTotalCostUsd = profileRecords.reduce((a, r) => a + r.costUsd, 0);
1013
+ campaigns[profile.id] = {
1014
+ ...campaign,
1015
+ aggregates: { ...campaign.aggregates, totalCostUsd: pricedTotalCostUsd }
1016
+ };
1004
1017
  byProfile[profile.id] = {
1005
1018
  profileId: profile.id,
1006
1019
  profileHash,
1007
1020
  model: profile.model,
1008
1021
  records: profileRecords.length,
1009
1022
  meanComposite: mean2(profileRecords.map(compositeOf)),
1010
- totalCostUsd: profileRecords.reduce((a, r) => a + r.costUsd, 0),
1023
+ totalCostUsd: pricedTotalCostUsd,
1011
1024
  integrity: summarizeBackendIntegrity(profileRecords)
1012
1025
  };
1013
1026
  }