@tangle-network/agent-eval 0.71.0 → 0.72.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +24 -0
- package/dist/campaign/index.js +25 -12
- package/dist/campaign/index.js.map +1 -1
- package/dist/{chunk-VMAYE3LM.js → chunk-4QJN7RDX.js} +3 -3
- package/dist/chunk-SL55X4VN.js +186 -0
- package/dist/chunk-SL55X4VN.js.map +1 -0
- package/dist/{chunk-6QZUCFKM.js → chunk-UD6EF73X.js} +3 -3
- package/dist/{chunk-6XQIEUQ2.js → chunk-ZPSKPT3V.js} +5 -3
- package/dist/{chunk-6XQIEUQ2.js.map → chunk-ZPSKPT3V.js.map} +1 -1
- package/dist/contract/index.js +3 -3
- package/dist/index.js +11 -156
- package/dist/index.js.map +1 -1
- package/dist/openapi.json +1 -1
- package/dist/{run-campaign-BVY3RGAZ.js → run-campaign-OVEZF24D.js} +2 -2
- package/package.json +1 -1
- package/dist/chunk-PQV2TKC3.js +0 -27
- package/dist/chunk-PQV2TKC3.js.map +0 -1
- /package/dist/{chunk-VMAYE3LM.js.map → chunk-4QJN7RDX.js.map} +0 -0
- /package/dist/{chunk-6QZUCFKM.js.map → chunk-UD6EF73X.js.map} +0 -0
- /package/dist/{run-campaign-BVY3RGAZ.js.map → run-campaign-OVEZF24D.js.map} +0 -0
package/CHANGELOG.md
CHANGED
|
@@ -4,6 +4,30 @@ All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval-
|
|
|
4
4
|
|
|
5
5
|
---
|
|
6
6
|
|
|
7
|
+
## [0.72.0] — 2026-05-31 — cost axis prices unpriced-at-source models (every run carries a real, labeled cost)
|
|
8
|
+
|
|
9
|
+
A live tax-agent full-loop run (real sandbox, `deepseek-v4-pro`, real tokens) exposed the second root of the cost-ledger split: the sandbox reported `totalCostUsd: 0` despite `17537` input / `622` output tokens — not a stub, not a mis-wired ledger, but a model the **source** can't rate. The cost / Pareto / `tokens_per_dollar` axes blanked even though the substrate's pricing table prices `deepseek` correctly; the table was simply never consulted on the matrix cost projection. A $0 cost on a run that burned real tokens reads as "free," which is the more misleading state.
|
|
10
|
+
|
|
11
|
+
### Fixed
|
|
12
|
+
|
|
13
|
+
- **`runProfileMatrix` prices measured tokens when the source reports $0.** Cost precedence is now explicit: **source-billed > token-estimated > none**. When `cell.costUsd === 0` and real output tokens flowed and the model is priced (`isModelPriced`), `buildRunRecord` sets the cost from `estimateCost(in, out, model)` (real published rate × real tokens) and stamps `raw.cost_estimated = 1`. A billed cost is never overridden; a model the table also can't rate stays $0 (no fabrication). The estimate flows into `record.costUsd`, so `byProfile.totalCostUsd`, `integrity.totalCostUsd`, and `tokens_per_dollar` / `cost_per_quality` all populate.
|
|
14
|
+
- **Every cost surface in the matrix result agrees.** The embedded `campaigns[id].aggregates.totalCostUsd` is reconciled to the priced total instead of runCampaign's raw `ctx.cost` ledger (which only sees the source's $0). No more two-`totalCostUsd`-that-disagree in one result.
|
|
15
|
+
- **Honest integrity diagnosis.** `summarizeBackendIntegrity`'s uncosted-records message now names **both** roots — mis-wired ledger OR unpriced-at-source model — and points at `estimateCost` for the latter, instead of asserting the ledger is broken.
|
|
16
|
+
|
|
17
|
+
Live proof: the same tax case that recorded `$0` now records **`$0.0059453`** (`17537 × 0.0003/1k + 622 × 0.0011/1k`, exact), `cost_estimated: 1`, `uncostedRecords: 0`, verdict `real`. Generalizes to every consumer of `runProfileMatrix`. New regression tests: priced-when-source-zero, billed-takes-precedence, truly-unpriced-stays-$0, campaign-aggregate-reconciled. Full suite (1663) green.
|
|
18
|
+
|
|
19
|
+
## [0.71.0] — 2026-05-31 — corpus-by-default + multi-dimensional capture (datasets as eval exhaust)
|
|
20
|
+
|
|
21
|
+
Every matrix run now emits a multi-dimensional, dataset-able record with no side-channel — the groundwork for "datasets gathered for free by running evals."
|
|
22
|
+
|
|
23
|
+
### Added
|
|
24
|
+
|
|
25
|
+
- **Multi-dim guardrail projection in `buildRunRecord`.** Each `RunRecord.outcome.raw` carries `cost_usd`, `tokens_input` / `tokens_output` (+ `tokens_cached` when present), `latency_ms`, and the guarded ratios `tokens_per_dollar` / `cost_per_quality`. RAW-ONLY — the composite stays the judge objective (anti-Goodhart); these are tracked + dashboarded + carried into datasets, never optimized.
|
|
26
|
+
- **Corpus-by-default via `corpusText`.** An optional `corpusText(artifact, scenario) => {prompt, completion}` stamps the trajectory text onto each record (the `CorpusRecord` shape), so a run is dataset-able with no side-channel. Fail-soft: a throwing extractor omits the text and keeps the graded record.
|
|
27
|
+
- **`appendToCorpus` / `readCorpus` / `buildDatasetFromCorpus`** (`src/rl/corpus.ts`) — append-only JSONL corpus (deduped by `runId`), with score/split filtering into a train/holdout dataset.
|
|
28
|
+
|
|
29
|
+
`buildRunRecord` is generic over `<TScenario, TArtifact>`; a `scenarioById` map threads each scenario into the projection.
|
|
30
|
+
|
|
7
31
|
## [0.70.0] — 2026-05-31 — error-grounded reflection (the driver targets real failures, not blind rewrites)
|
|
8
32
|
|
|
9
33
|
Adversarial verification on TWO domains (legal + tax, two worker models) found the same root cause: the gepaDriver's candidates **regressed** the baseline, so the gate correctly held — but nothing improved. The driver was reflecting on per-scenario *scores* only; the judge's `notes` (the "why it failed") were computed but **dropped** before the reflection. So it proposed generic rewrites a capable model already knows, which distract rather than help.
|
package/dist/campaign/index.js
CHANGED
|
@@ -7,10 +7,12 @@ import {
|
|
|
7
7
|
heldoutSignificance,
|
|
8
8
|
pairHoldout,
|
|
9
9
|
runEval
|
|
10
|
-
} from "../chunk-
|
|
10
|
+
} from "../chunk-UD6EF73X.js";
|
|
11
11
|
import {
|
|
12
|
-
agentProfileHash
|
|
13
|
-
|
|
12
|
+
agentProfileHash,
|
|
13
|
+
estimateCost,
|
|
14
|
+
isModelPriced
|
|
15
|
+
} from "../chunk-SL55X4VN.js";
|
|
14
16
|
import {
|
|
15
17
|
buildLoopProvenanceRecord,
|
|
16
18
|
campaignBreakdown,
|
|
@@ -31,14 +33,14 @@ import {
|
|
|
31
33
|
runOptimization,
|
|
32
34
|
surfaceContentHash,
|
|
33
35
|
surfaceHash
|
|
34
|
-
} from "../chunk-
|
|
36
|
+
} from "../chunk-4QJN7RDX.js";
|
|
35
37
|
import {
|
|
36
38
|
assertRealBackend,
|
|
37
39
|
fsCampaignStorage,
|
|
38
40
|
inMemoryCampaignStorage,
|
|
39
41
|
runCampaign,
|
|
40
42
|
summarizeBackendIntegrity
|
|
41
|
-
} from "../chunk-
|
|
43
|
+
} from "../chunk-ZPSKPT3V.js";
|
|
42
44
|
import "../chunk-YV7J7X5N.js";
|
|
43
45
|
import {
|
|
44
46
|
validateRunRecord
|
|
@@ -873,15 +875,22 @@ function buildRunRecord(args) {
|
|
|
873
875
|
}
|
|
874
876
|
const perDimMean = {};
|
|
875
877
|
for (const [dim, values] of Object.entries(dimAccum)) perDimMean[dim] = mean2(values);
|
|
876
|
-
|
|
878
|
+
let costUsd = cell.costUsd;
|
|
879
|
+
let costEstimated = false;
|
|
880
|
+
if (costUsd === 0 && cell.tokenUsage.output > 0 && isModelPriced(profile.model)) {
|
|
881
|
+
costUsd = estimateCost(cell.tokenUsage.input, cell.tokenUsage.output, profile.model);
|
|
882
|
+
costEstimated = costUsd > 0;
|
|
883
|
+
}
|
|
884
|
+
raw.cost_usd = costUsd;
|
|
885
|
+
raw.cost_estimated = costEstimated ? 1 : 0;
|
|
877
886
|
raw.tokens_input = cell.tokenUsage.input;
|
|
878
887
|
raw.tokens_output = cell.tokenUsage.output;
|
|
879
888
|
if (typeof cell.tokenUsage.cached === "number") raw.tokens_cached = cell.tokenUsage.cached;
|
|
880
889
|
raw.latency_ms = cell.durationMs;
|
|
881
|
-
if (
|
|
882
|
-
raw.tokens_per_dollar = (cell.tokenUsage.input + cell.tokenUsage.output) /
|
|
890
|
+
if (costUsd > 0) {
|
|
891
|
+
raw.tokens_per_dollar = (cell.tokenUsage.input + cell.tokenUsage.output) / costUsd;
|
|
883
892
|
}
|
|
884
|
-
if (composite > 0.01) raw.cost_per_quality =
|
|
893
|
+
if (composite > 0.01) raw.cost_per_quality = costUsd / composite;
|
|
885
894
|
const outcome = splitTag === "holdout" ? { holdoutScore: composite, raw } : { searchScore: composite, raw };
|
|
886
895
|
if (Object.keys(perJudge).length > 0) {
|
|
887
896
|
outcome.judgeScores = {
|
|
@@ -901,7 +910,7 @@ function buildRunRecord(args) {
|
|
|
901
910
|
configHash,
|
|
902
911
|
commitSha,
|
|
903
912
|
wallMs: cell.durationMs,
|
|
904
|
-
costUsd
|
|
913
|
+
costUsd,
|
|
905
914
|
tokenUsage: cell.tokenUsage,
|
|
906
915
|
outcome,
|
|
907
916
|
splitTag,
|
|
@@ -982,7 +991,6 @@ async function runProfileMatrix(opts) {
|
|
|
982
991
|
now: opts.now,
|
|
983
992
|
runDir: join2(opts.runDir, sanitize(profile.id))
|
|
984
993
|
});
|
|
985
|
-
campaigns[profile.id] = campaign;
|
|
986
994
|
const profileRecords = [];
|
|
987
995
|
for (const cell of campaign.cells) {
|
|
988
996
|
const record = buildRunRecord({
|
|
@@ -1001,13 +1009,18 @@ async function runProfileMatrix(opts) {
|
|
|
1001
1009
|
profileRecords.push(record);
|
|
1002
1010
|
records.push(record);
|
|
1003
1011
|
}
|
|
1012
|
+
const pricedTotalCostUsd = profileRecords.reduce((a, r) => a + r.costUsd, 0);
|
|
1013
|
+
campaigns[profile.id] = {
|
|
1014
|
+
...campaign,
|
|
1015
|
+
aggregates: { ...campaign.aggregates, totalCostUsd: pricedTotalCostUsd }
|
|
1016
|
+
};
|
|
1004
1017
|
byProfile[profile.id] = {
|
|
1005
1018
|
profileId: profile.id,
|
|
1006
1019
|
profileHash,
|
|
1007
1020
|
model: profile.model,
|
|
1008
1021
|
records: profileRecords.length,
|
|
1009
1022
|
meanComposite: mean2(profileRecords.map(compositeOf)),
|
|
1010
|
-
totalCostUsd:
|
|
1023
|
+
totalCostUsd: pricedTotalCostUsd,
|
|
1011
1024
|
integrity: summarizeBackendIntegrity(profileRecords)
|
|
1012
1025
|
};
|
|
1013
1026
|
}
|