@tangle-network/agent-eval 0.67.0 → 0.69.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +28 -0
- package/dist/campaign/index.js +9 -9
- package/dist/{chunk-MZ2IYGGN.js → chunk-E24XD7A2.js} +4 -278
- package/dist/chunk-E24XD7A2.js.map +1 -0
- package/dist/{chunk-NV2PF37Q.js → chunk-JFGZPUMU.js} +277 -3
- package/dist/chunk-JFGZPUMU.js.map +1 -0
- package/dist/contract/index.js +6 -6
- package/dist/index.d.ts +171 -5
- package/dist/index.js +147 -2
- package/dist/index.js.map +1 -1
- package/dist/openapi.json +1 -1
- package/package.json +1 -1
- package/dist/chunk-MZ2IYGGN.js.map +0 -1
- package/dist/chunk-NV2PF37Q.js.map +0 -1
package/CHANGELOG.md
CHANGED
|
@@ -4,6 +4,34 @@ All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval-
|
|
|
4
4
|
|
|
5
5
|
---
|
|
6
6
|
|
|
7
|
+
## [0.69.0] — 2026-05-30 — strong generic baseline roles (engineer / researcher / generalist)
|
|
8
|
+
|
|
9
|
+
The structured profile (0.68.0) had a hollow top zone — `baselineProfile` took an arbitrary `role` string. Products are file-producing, tool-using agents living in a sandbox, but nothing gave them a strong operator foundation. This adds three generically-useful, verification-first baseline roles distilled from agent-runtime's `coderProfile` doctrine.
|
|
10
|
+
|
|
11
|
+
### Added (`profile.*`)
|
|
12
|
+
|
|
13
|
+
- **`engineerRole`** — a senior principal / 10x-IC sandbox operator: produce the real artifact then verify it; smallest correct change; **run the checks and fix the root cause — never weaken a test or hide an error**; inspect external-boundary outcomes; "done" = produced AND verified.
|
|
14
|
+
- **`researcherRole`** — read the real sources, cite every material claim, mark inference vs. verified, never fabricate a source/quote/number.
|
|
15
|
+
- **`generalistRole`** — strong default: do over describe, ground claims, verify before done, ask only on genuinely user-owned choices.
|
|
16
|
+
- `BASELINE_ROLES` (keyed `engineer|researcher|generalist`) + `baselineProfileFromRole(role, overrides?)` — pick a foundation, override the environment to describe THIS product's sandbox, then layer domain via `prodProfile`.
|
|
17
|
+
|
|
18
|
+
**Layering discipline:** these are domain-AGNOSTIC and verification-first. Domain strength (legal M&A persona, tax-calc rigor) stays in the **product repo** and composes on top via `domain[]`; it is lifted into the substrate only once ≥2 products genuinely reuse it. 3 new tests assert the roles are distinct, verification-first, and carry no product-domain words. Full suite (1642) green.
|
|
19
|
+
|
|
20
|
+
## [0.68.0] — 2026-05-30 — structured AgentProfile (the self-improvement surface stops being an opaque blob)
|
|
21
|
+
|
|
22
|
+
The optimizable surface was an opaque string addendum, so the loop could only mutate (and the dashboard only diff) an unstructured blob — you couldn't see *what kind* of improvement a candidate made. This adds a **sectioned `AgentProfile`** primitive (mirrored on Harvey LAB's system-prompt structure) so the surface has named, separately-addressable zones the loop targets one at a time.
|
|
23
|
+
|
|
24
|
+
### Added
|
|
25
|
+
|
|
26
|
+
- **`profile` namespace** (`import { profile } from '@tangle-network/agent-eval'`):
|
|
27
|
+
- `AgentProfile { role, environment, toolConventions, skills: ProfileSkill[], domain: AgentProfileSection[] }` — the structured surface. `environment` is a first-class section (the sandbox contract: workspace root, read-only documents, output dir, skills dir), matching how an agentic harness actually addresses its sandbox.
|
|
28
|
+
- `renderProfile(p)` emits the system prompt in fixed order: role → `## Environment` → `## Tool conventions` → `## Skills` → `## Domain guidance`.
|
|
29
|
+
- `baselineProfile` / `prodProfile(baseline, shipped)` — baseline = empty domain + stock skills; prod = baseline + gate-certified domain sections.
|
|
30
|
+
- `applyDomainPatch(p, sectionId, body)` — **section-scoped** edit so the improvement loop optimizes ONE evolvable section, not the whole blob; `profileToSurface(p)` bridges to the existing string `MutableSurface`.
|
|
31
|
+
- Namespaced as `profile.*` to avoid clashing with the benchmark-cell `AgentProfile` already exported from `./agent-profile`.
|
|
32
|
+
|
|
33
|
+
Additive — does not touch `runImprovementLoop` or the string surface. 15 tests (zone order; only evolvable sections change hash under `applyDomainPatch`; baseline vs prod differ only in domain/skills; Environment present + non-empty). Full suite (1639) green. First consumers: the TaxCalcBench + Harvey LAB benchmark adapters (tax-agent / legal-agent) that score our agent's profile against public leaderboards.
|
|
34
|
+
|
|
7
35
|
## [0.67.0] — 2026-05-30 — the promotion gate is statistically trustworthy (no more shipping noise)
|
|
8
36
|
|
|
9
37
|
An adversarial review of a real "ship +4.0 lift" decision found it was a **triple false positive**: the driver's candidate lost on train, so the winner was the baseline (empty diff); the loop re-scored the baseline against ITSELF on the holdout and read run-to-run model noise (91 vs 95) as a "+4 lift"; and a point-estimate gate (`delta >= 0.03` on a 0-100 scale, `reps:1`) shipped it — while the reward-hacking gate was blind to a −30 regression on a safety dimension hiding under the +4 net. The promotion gate could not tell a real improvement from noise or from a Goodhart trade.
|
package/dist/campaign/index.js
CHANGED
|
@@ -1,37 +1,37 @@
|
|
|
1
1
|
import {
|
|
2
|
-
buildLoopProvenanceRecord,
|
|
3
2
|
composeGate,
|
|
4
3
|
defaultProductionGate,
|
|
5
4
|
detectScale,
|
|
6
5
|
dimensionRegressions,
|
|
7
|
-
emitLoopProvenance,
|
|
8
6
|
evolutionaryDriver,
|
|
9
7
|
heldoutSignificance,
|
|
10
|
-
loopProvenanceSpans,
|
|
11
8
|
pairHoldout,
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
runEval,
|
|
15
|
-
surfaceContentHash
|
|
16
|
-
} from "../chunk-MZ2IYGGN.js";
|
|
9
|
+
runEval
|
|
10
|
+
} from "../chunk-E24XD7A2.js";
|
|
17
11
|
import {
|
|
18
12
|
agentProfileHash
|
|
19
13
|
} from "../chunk-PQV2TKC3.js";
|
|
20
14
|
import {
|
|
15
|
+
buildLoopProvenanceRecord,
|
|
21
16
|
campaignBreakdown,
|
|
22
17
|
campaignMeanComposite,
|
|
23
18
|
countSentenceEdits,
|
|
24
19
|
defaultRenderDiff,
|
|
20
|
+
emitLoopProvenance,
|
|
25
21
|
extractH2Sections,
|
|
26
22
|
gepaDriver,
|
|
27
23
|
heldOutGate,
|
|
28
24
|
isProposedCandidate,
|
|
29
25
|
labelTrustRank,
|
|
26
|
+
loopProvenanceSpans,
|
|
30
27
|
openAutoPr,
|
|
28
|
+
provenanceRecordPath,
|
|
29
|
+
provenanceSpansPath,
|
|
31
30
|
runImprovementLoop,
|
|
32
31
|
runOptimization,
|
|
32
|
+
surfaceContentHash,
|
|
33
33
|
surfaceHash
|
|
34
|
-
} from "../chunk-
|
|
34
|
+
} from "../chunk-JFGZPUMU.js";
|
|
35
35
|
import {
|
|
36
36
|
assertRealBackend,
|
|
37
37
|
fsCampaignStorage,
|
|
@@ -1,10 +1,9 @@
|
|
|
1
1
|
import {
|
|
2
2
|
runCanaries,
|
|
3
3
|
scoreRedTeamOutput
|
|
4
|
-
} from "./chunk-
|
|
4
|
+
} from "./chunk-JFGZPUMU.js";
|
|
5
5
|
import {
|
|
6
|
-
runCampaign
|
|
7
|
-
summarizeBackendIntegrity
|
|
6
|
+
runCampaign
|
|
8
7
|
} from "./chunk-6XQIEUQ2.js";
|
|
9
8
|
import {
|
|
10
9
|
detectRewardHacking
|
|
@@ -306,273 +305,6 @@ async function runEval(opts) {
|
|
|
306
305
|
return runCampaign(opts);
|
|
307
306
|
}
|
|
308
307
|
|
|
309
|
-
// src/campaign/provenance.ts
|
|
310
|
-
import { createHash } from "crypto";
|
|
311
|
-
import { join } from "path";
|
|
312
|
-
function surfaceContentHash(surface) {
|
|
313
|
-
const material = typeof surface === "string" ? surface : JSON.stringify({
|
|
314
|
-
kind: surface.kind,
|
|
315
|
-
worktreeRef: surface.worktreeRef,
|
|
316
|
-
baseRef: surface.baseRef ?? null
|
|
317
|
-
});
|
|
318
|
-
return `sha256:${createHash("sha256").update(material).digest("hex")}`;
|
|
319
|
-
}
|
|
320
|
-
function meanHoldoutComposite(campaign) {
|
|
321
|
-
const xs = [];
|
|
322
|
-
for (const cell of campaign.cells) {
|
|
323
|
-
if (cell.error) continue;
|
|
324
|
-
const cs = Object.values(cell.judgeScores).map((s) => s.composite);
|
|
325
|
-
if (cs.length) xs.push(cs.reduce((a, b) => a + b, 0) / cs.length);
|
|
326
|
-
}
|
|
327
|
-
return xs.length ? xs.reduce((a, b) => a + b, 0) / xs.length : 0;
|
|
328
|
-
}
|
|
329
|
-
function buildLoopProvenanceRecord(args) {
|
|
330
|
-
const integrity = summarizeBackendIntegrity(args.workerRecords);
|
|
331
|
-
const models = [...new Set(args.workerRecords.map((r) => r.model))].sort();
|
|
332
|
-
const candidates = [];
|
|
333
|
-
for (const gen of args.generations) {
|
|
334
|
-
const promotedSet = new Set(gen.promoted);
|
|
335
|
-
const surfaceByHash = new Map(gen.surfaces.map((s) => [s.surfaceHash, s.surface]));
|
|
336
|
-
for (const c of gen.candidates) {
|
|
337
|
-
const surface = surfaceByHash.get(c.surfaceHash);
|
|
338
|
-
const entry = {
|
|
339
|
-
generation: gen.generationIndex,
|
|
340
|
-
surfaceHash: c.surfaceHash,
|
|
341
|
-
contentHash: surface !== void 0 ? surfaceContentHash(surface) : `sha256:${c.surfaceHash}`,
|
|
342
|
-
composite: c.composite,
|
|
343
|
-
promoted: promotedSet.has(c.surfaceHash)
|
|
344
|
-
};
|
|
345
|
-
if (c.label) entry.label = c.label;
|
|
346
|
-
if (c.rationale) entry.rationale = c.rationale;
|
|
347
|
-
candidates.push(entry);
|
|
348
|
-
}
|
|
349
|
-
}
|
|
350
|
-
const baselineHoldoutComposite = meanHoldoutComposite(args.baselineOnHoldout);
|
|
351
|
-
const winnerHoldoutComposite = meanHoldoutComposite(args.winnerOnHoldout);
|
|
352
|
-
const record = {
|
|
353
|
-
schema: "tangle.loop-provenance.v1",
|
|
354
|
-
runId: args.runId,
|
|
355
|
-
runDir: args.runDir,
|
|
356
|
-
timestamp: args.timestamp,
|
|
357
|
-
baselineContentHash: surfaceContentHash(args.baselineSurface),
|
|
358
|
-
winnerContentHash: surfaceContentHash(args.winnerSurface),
|
|
359
|
-
diff: args.diff,
|
|
360
|
-
candidates,
|
|
361
|
-
gate: {
|
|
362
|
-
decision: args.gate.decision,
|
|
363
|
-
reasons: args.gate.reasons,
|
|
364
|
-
delta: args.gate.delta,
|
|
365
|
-
contributingGates: args.gate.contributingGates.map((g) => ({
|
|
366
|
-
name: g.name,
|
|
367
|
-
passed: g.passed
|
|
368
|
-
}))
|
|
369
|
-
},
|
|
370
|
-
baselineHoldoutComposite,
|
|
371
|
-
winnerHoldoutComposite,
|
|
372
|
-
heldOutLift: winnerHoldoutComposite - baselineHoldoutComposite,
|
|
373
|
-
backend: {
|
|
374
|
-
verdict: integrity.verdict,
|
|
375
|
-
workerCallCount: integrity.totalRecords,
|
|
376
|
-
models,
|
|
377
|
-
totalInputTokens: integrity.totalInputTokens,
|
|
378
|
-
totalOutputTokens: integrity.totalOutputTokens,
|
|
379
|
-
totalCostUsd: integrity.totalCostUsd
|
|
380
|
-
},
|
|
381
|
-
totalCostUsd: args.totalCostUsd,
|
|
382
|
-
totalDurationMs: args.totalDurationMs
|
|
383
|
-
};
|
|
384
|
-
if (args.winnerLabel) record.winnerLabel = args.winnerLabel;
|
|
385
|
-
if (args.winnerRationale) record.winnerRationale = args.winnerRationale;
|
|
386
|
-
return record;
|
|
387
|
-
}
|
|
388
|
-
var DECISION_OK = ["ship"];
|
|
389
|
-
function hashId(parts) {
|
|
390
|
-
return createHash("sha256").update(parts.join(":")).digest("hex");
|
|
391
|
-
}
|
|
392
|
-
function gateStatus(decision) {
|
|
393
|
-
return DECISION_OK.includes(decision) ? { code: "OK" } : { code: "ERROR", message: `gate decision: ${decision}` };
|
|
394
|
-
}
|
|
395
|
-
function loopProvenanceSpans(record, opts = {}) {
|
|
396
|
-
const traceId = hashId(["trace", record.runId]).slice(0, 32);
|
|
397
|
-
const baseNano = (opts.baseTimeMs ?? (Date.parse(record.timestamp) || Date.now())) * 1e6;
|
|
398
|
-
const endNano = baseNano + Math.max(1, record.totalDurationMs) * 1e6;
|
|
399
|
-
const spans = [];
|
|
400
|
-
const rootSpanId = hashId(["root", record.runId]).slice(0, 16);
|
|
401
|
-
spans.push({
|
|
402
|
-
traceId,
|
|
403
|
-
spanId: rootSpanId,
|
|
404
|
-
name: "improvement-loop",
|
|
405
|
-
startTimeUnixNano: baseNano,
|
|
406
|
-
endTimeUnixNano: endNano,
|
|
407
|
-
attributes: {
|
|
408
|
-
"tangle.runId": record.runId,
|
|
409
|
-
"tangle.runDir": record.runDir,
|
|
410
|
-
"tangle.baselineContentHash": record.baselineContentHash,
|
|
411
|
-
"tangle.winnerContentHash": record.winnerContentHash,
|
|
412
|
-
"tangle.heldOutLift": record.heldOutLift,
|
|
413
|
-
"tangle.gateDecision": record.gate.decision,
|
|
414
|
-
"tangle.backendVerdict": record.backend.verdict,
|
|
415
|
-
"tangle.workerCallCount": record.backend.workerCallCount,
|
|
416
|
-
"tangle.totalCostUsd": record.totalCostUsd
|
|
417
|
-
},
|
|
418
|
-
status: gateStatus(record.gate.decision),
|
|
419
|
-
"tangle.runId": record.runId
|
|
420
|
-
});
|
|
421
|
-
const byGen = /* @__PURE__ */ new Map();
|
|
422
|
-
for (const c of record.candidates) {
|
|
423
|
-
const arr = byGen.get(c.generation) ?? [];
|
|
424
|
-
arr.push(c);
|
|
425
|
-
byGen.set(c.generation, arr);
|
|
426
|
-
}
|
|
427
|
-
for (const [generation, cands] of [...byGen.entries()].sort((a, b) => a[0] - b[0])) {
|
|
428
|
-
const genSpanId = hashId(["gen", record.runId, String(generation)]).slice(0, 16);
|
|
429
|
-
const bestComposite = cands.reduce((m, c) => Math.max(m, c.composite), 0);
|
|
430
|
-
spans.push({
|
|
431
|
-
traceId,
|
|
432
|
-
spanId: genSpanId,
|
|
433
|
-
parentSpanId: rootSpanId,
|
|
434
|
-
name: `generation-${generation}`,
|
|
435
|
-
startTimeUnixNano: baseNano,
|
|
436
|
-
endTimeUnixNano: endNano,
|
|
437
|
-
attributes: {
|
|
438
|
-
"tangle.runId": record.runId,
|
|
439
|
-
"tangle.generation": generation,
|
|
440
|
-
"tangle.populationSize": cands.length,
|
|
441
|
-
"tangle.bestComposite": bestComposite
|
|
442
|
-
},
|
|
443
|
-
"tangle.runId": record.runId,
|
|
444
|
-
"tangle.generation": generation
|
|
445
|
-
});
|
|
446
|
-
for (let i = 0; i < cands.length; i++) {
|
|
447
|
-
const c = cands[i];
|
|
448
|
-
const candSpanId = hashId(["cand", record.runId, String(generation), c.surfaceHash]).slice(
|
|
449
|
-
0,
|
|
450
|
-
16
|
|
451
|
-
);
|
|
452
|
-
const attributes = {
|
|
453
|
-
"tangle.runId": record.runId,
|
|
454
|
-
"tangle.generation": generation,
|
|
455
|
-
"tangle.surfaceHash": c.surfaceHash,
|
|
456
|
-
"tangle.contentHash": c.contentHash,
|
|
457
|
-
"tangle.composite": c.composite,
|
|
458
|
-
"tangle.promoted": c.promoted
|
|
459
|
-
};
|
|
460
|
-
if (c.label) attributes["tangle.candidateLabel"] = c.label;
|
|
461
|
-
if (c.rationale) attributes["tangle.candidateRationale"] = c.rationale;
|
|
462
|
-
spans.push({
|
|
463
|
-
traceId,
|
|
464
|
-
spanId: candSpanId,
|
|
465
|
-
parentSpanId: genSpanId,
|
|
466
|
-
name: `candidate-${c.surfaceHash}`,
|
|
467
|
-
startTimeUnixNano: baseNano,
|
|
468
|
-
endTimeUnixNano: endNano,
|
|
469
|
-
attributes,
|
|
470
|
-
"tangle.runId": record.runId,
|
|
471
|
-
"tangle.generation": generation
|
|
472
|
-
});
|
|
473
|
-
}
|
|
474
|
-
}
|
|
475
|
-
const gateSpanId = hashId(["gate", record.runId]).slice(0, 16);
|
|
476
|
-
spans.push({
|
|
477
|
-
traceId,
|
|
478
|
-
spanId: gateSpanId,
|
|
479
|
-
parentSpanId: rootSpanId,
|
|
480
|
-
name: "gate-decision",
|
|
481
|
-
startTimeUnixNano: endNano,
|
|
482
|
-
endTimeUnixNano: endNano,
|
|
483
|
-
attributes: {
|
|
484
|
-
"tangle.runId": record.runId,
|
|
485
|
-
"tangle.gateDecision": record.gate.decision,
|
|
486
|
-
"tangle.gateDelta": record.gate.delta ?? record.heldOutLift,
|
|
487
|
-
"tangle.gateReasons": JSON.stringify(record.gate.reasons),
|
|
488
|
-
"tangle.heldOutLift": record.heldOutLift,
|
|
489
|
-
"tangle.baselineHoldoutComposite": record.baselineHoldoutComposite,
|
|
490
|
-
"tangle.winnerHoldoutComposite": record.winnerHoldoutComposite
|
|
491
|
-
},
|
|
492
|
-
status: gateStatus(record.gate.decision),
|
|
493
|
-
"tangle.runId": record.runId
|
|
494
|
-
});
|
|
495
|
-
return spans;
|
|
496
|
-
}
|
|
497
|
-
function provenanceRecordPath(runDir) {
|
|
498
|
-
return join(runDir, "loop-provenance.json");
|
|
499
|
-
}
|
|
500
|
-
function provenanceSpansPath(runDir) {
|
|
501
|
-
return join(runDir, "loop-provenance-spans.jsonl");
|
|
502
|
-
}
|
|
503
|
-
function snapshotFromHoldout(index, surfaceHash, surface, campaign) {
|
|
504
|
-
const cells = campaign.cells.map((cell) => {
|
|
505
|
-
const judgeScores = Object.values(cell.judgeScores);
|
|
506
|
-
const composite = judgeScores.length === 0 ? 0 : judgeScores.reduce((s, j) => s + j.composite, 0) / judgeScores.length;
|
|
507
|
-
const score = {
|
|
508
|
-
scenarioId: cell.scenarioId,
|
|
509
|
-
rep: cell.rep,
|
|
510
|
-
compositeMean: composite,
|
|
511
|
-
dimensions: Object.fromEntries(
|
|
512
|
-
Object.entries(cell.judgeScores).map(([name, s]) => [name, s.dimensions])
|
|
513
|
-
)
|
|
514
|
-
};
|
|
515
|
-
if (cell.error) score.errorMessage = cell.error;
|
|
516
|
-
return score;
|
|
517
|
-
});
|
|
518
|
-
const compositeMean = cells.length === 0 ? 0 : cells.reduce((s, c) => s + c.compositeMean, 0) / cells.length;
|
|
519
|
-
return {
|
|
520
|
-
index,
|
|
521
|
-
surfaceHash,
|
|
522
|
-
surface,
|
|
523
|
-
cells,
|
|
524
|
-
compositeMean,
|
|
525
|
-
costUsd: campaign.aggregates.totalCostUsd,
|
|
526
|
-
durationMs: campaign.durationMs
|
|
527
|
-
};
|
|
528
|
-
}
|
|
529
|
-
function buildEvalRunEvent(args, record) {
|
|
530
|
-
return {
|
|
531
|
-
runId: args.runId,
|
|
532
|
-
runDir: args.runDir,
|
|
533
|
-
timestamp: args.timestamp,
|
|
534
|
-
status: "finished",
|
|
535
|
-
labels: {},
|
|
536
|
-
baseline: snapshotFromHoldout(
|
|
537
|
-
0,
|
|
538
|
-
record.baselineContentHash,
|
|
539
|
-
args.baselineSurface,
|
|
540
|
-
args.baselineOnHoldout
|
|
541
|
-
),
|
|
542
|
-
generations: [
|
|
543
|
-
snapshotFromHoldout(1, record.winnerContentHash, args.winnerSurface, args.winnerOnHoldout)
|
|
544
|
-
],
|
|
545
|
-
gateDecision: args.gate.decision,
|
|
546
|
-
holdoutLift: record.heldOutLift,
|
|
547
|
-
totalCostUsd: args.totalCostUsd,
|
|
548
|
-
totalDurationMs: args.totalDurationMs
|
|
549
|
-
};
|
|
550
|
-
}
|
|
551
|
-
async function emitLoopProvenance(args) {
|
|
552
|
-
const record = buildLoopProvenanceRecord(args);
|
|
553
|
-
const spans = loopProvenanceSpans(record);
|
|
554
|
-
args.storage.ensureDir(args.runDir);
|
|
555
|
-
const recordPath = provenanceRecordPath(args.runDir);
|
|
556
|
-
const spansPath = provenanceSpansPath(args.runDir);
|
|
557
|
-
args.storage.write(recordPath, JSON.stringify(record, null, 2));
|
|
558
|
-
args.storage.write(spansPath, spans.map((s) => JSON.stringify(s)).join("\n"));
|
|
559
|
-
if (args.hostedClient) {
|
|
560
|
-
try {
|
|
561
|
-
await args.hostedClient.ingestEvalRun(buildEvalRunEvent(args, record));
|
|
562
|
-
} catch (err) {
|
|
563
|
-
const msg = err instanceof Error ? err.message : String(err);
|
|
564
|
-
console.warn(`[agent-eval] hosted eval-run ingest failed (continuing): ${msg}`);
|
|
565
|
-
}
|
|
566
|
-
try {
|
|
567
|
-
await args.hostedClient.ingestTraces(spans);
|
|
568
|
-
} catch (err) {
|
|
569
|
-
const msg = err instanceof Error ? err.message : String(err);
|
|
570
|
-
console.warn(`[agent-eval] provenance span ingest failed (continuing): ${msg}`);
|
|
571
|
-
}
|
|
572
|
-
}
|
|
573
|
-
return { record, spans, recordPath, spansPath };
|
|
574
|
-
}
|
|
575
|
-
|
|
576
308
|
export {
|
|
577
309
|
evolutionaryDriver,
|
|
578
310
|
composeGate,
|
|
@@ -581,12 +313,6 @@ export {
|
|
|
581
313
|
detectScale,
|
|
582
314
|
dimensionRegressions,
|
|
583
315
|
defaultProductionGate,
|
|
584
|
-
runEval
|
|
585
|
-
surfaceContentHash,
|
|
586
|
-
buildLoopProvenanceRecord,
|
|
587
|
-
loopProvenanceSpans,
|
|
588
|
-
provenanceRecordPath,
|
|
589
|
-
provenanceSpansPath,
|
|
590
|
-
emitLoopProvenance
|
|
316
|
+
runEval
|
|
591
317
|
};
|
|
592
|
-
//# sourceMappingURL=chunk-
|
|
318
|
+
//# sourceMappingURL=chunk-E24XD7A2.js.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"sources":["../src/campaign/drivers/evolutionary.ts","../src/campaign/gates/compose.ts","../src/campaign/gates/statistical-heldout.ts","../src/campaign/gates/default-production-gate.ts","../src/campaign/presets/run-eval.ts"],"sourcesContent":["/**\n * @experimental\n *\n * `evolutionaryDriver` — adapts a stateless `Mutator` (population mutation:\n * GEPA / AxGEPA / reflective-mutation) into an `ImprovementDriver`. This is\n * the evolutionary strategy: each generation, mutate the current best surface\n * into N candidates, measure, select. No generation memory beyond the current\n * surface; the loop body handles ranking + promotion.\n *\n * The reflective alternative is agent-runtime's `improvementDriver` with a\n * `reflectiveGenerator` / `agenticGenerator`: it reasons over the report +\n * trace findings to propose targeted edits rather than blind mutations. Both\n * conform to `ImprovementDriver`; the improvement loop is identical regardless\n * of which drives it.\n */\n\nimport type { ImprovementDriver, Mutator } from '../types'\n\nexport interface EvolutionaryDriverOptions<TFindings = unknown> {\n mutator: Mutator<TFindings>\n /** External findings fed to the mutator each generation. Default: []. */\n findings?: TFindings[]\n}\n\nexport function evolutionaryDriver<TFindings = unknown>(\n opts: EvolutionaryDriverOptions<TFindings>,\n): ImprovementDriver<TFindings> {\n return {\n kind: `evolutionary:${opts.mutator.kind}`,\n async propose({ currentSurface, findings, populationSize, signal }) {\n return opts.mutator.mutate({\n findings: findings.length > 0 ? findings : (opts.findings ?? []),\n currentSurface,\n populationSize,\n signal,\n })\n },\n }\n}\n","/**\n * @experimental\n *\n * Compose multiple `Gate` implementations — every gate must pass for the\n * composite to ship. Closes the alignment reviewer's \"default-only\n * heldOutGate + costGate would happily promote a reward-hacked prompt\"\n * concern by making safety gates first-class composable defaults.\n */\n\nimport type { Gate, GateContext, GateDecision, GateResult, Scenario } from '../types'\n\n/** Compose gates — all must `ship` for the composite to `ship`. First\n * non-ship verdict short-circuits the composite verdict, but ALL gates run\n * (so the result records every gate's reason — useful for diagnostics). */\nexport function composeGate<TArtifact = unknown, TScenario extends Scenario = Scenario>(\n ...gates: Array<Gate<TArtifact, TScenario>>\n): Gate<TArtifact, TScenario> {\n if (gates.length === 0) {\n throw new Error('composeGate requires at least one gate')\n }\n return {\n name: `composed(${gates.map((g) => g.name).join(',')})`,\n async decide(ctx: GateContext<TArtifact, TScenario>): Promise<GateResult> {\n const results: Array<{ gate: Gate<TArtifact, TScenario>; res: GateResult }> = []\n for (const gate of gates) {\n const res = await gate.decide(ctx)\n results.push({ gate, res })\n }\n\n // Substrate-wide verdict policy:\n // - all 'ship' → 'ship'\n // - any 'arch_ceiling' → 'arch_ceiling' (architectural ceiling beats other holds)\n // - any 'model_ceiling' → 'model_ceiling'\n // - any 'hold' → 'hold'\n // - else 'need_more_work'\n const decisions = results.map((r) => r.res.decision)\n const overall: GateDecision = decisions.every((d) => d === 'ship')\n ? 'ship'\n : decisions.includes('arch_ceiling')\n ? 'arch_ceiling'\n : decisions.includes('model_ceiling')\n ? 'model_ceiling'\n : decisions.includes('hold')\n ? 'hold'\n : 'need_more_work'\n\n const contributing = results.flatMap((r) =>\n r.res.contributingGates.length > 0\n ? r.res.contributingGates\n : [{ name: r.gate.name, passed: r.res.decision === 'ship', detail: r.res }],\n )\n\n const reasons = results.flatMap((r) =>\n r.res.reasons.map((reason) => `[${r.gate.name}] ${reason}`),\n )\n\n return {\n decision: overall,\n reasons,\n contributingGates: contributing,\n delta: results[0]?.res.delta,\n }\n },\n }\n}\n","/**\n * @experimental\n *\n * Statistical held-out promotion machinery — the trustworthy core the\n * point-estimate `heldout-delta` gate lacked.\n *\n * The shipped false positive it prevents: a winner re-scored against the\n * baseline on the holdout read run-to-run model NOISE (e.g. 91 vs 95) as a\n * \"+4 lift\" and shipped, because the gate compared point estimates with no\n * confidence interval. Here we pair candidate vs baseline holdout observations\n * and bootstrap a CI on the paired delta — a candidate ships only when the CI\n * lower bound clears the effect-size threshold (the gain is real at the\n * confidence level, not noise), and is blocked when a critical dimension\n * (e.g. `hallucination_free` for a legal agent) significantly regresses even if\n * the net composite rose (anti-Goodhart).\n *\n * Two traps this module is built around (both produce a NEW false positive if\n * gotten wrong):\n * 1. PAIRING GRANULARITY — pairs by FULL `cellId` (`scenario:rep`), never by\n * `scenarioId` (which averages reps away and destroys the within-pair\n * variance reduction that makes a paired bootstrap tighter than unpaired).\n * One paired observation per cell ⇒ reps multiply n.\n * 2. SCALE — a judge may emit composites/dimensions on [0,1] or 0-100. The\n * threshold + tolerance are interpreted in the judge's NATIVE scale; the\n * per-dimension tolerance auto-scales off the observed baseline magnitudes\n * so `-0.10` on [0,1] doesn't silently become a no-op on a 0-100 dimension.\n */\n\nimport { type PairedBootstrapResult, pairedBootstrap } from '../../statistics'\nimport type { JudgeScore } from '../types'\n\nexport interface PairedHoldout {\n /** Baseline scalar per paired cell (same order as `after`/`cellIds`). */\n before: number[]\n /** Candidate scalar per paired cell. */\n after: number[]\n /** The full cellIds (`scenario:rep`) that paired, in order. */\n cellIds: string[]\n}\n\n/**\n * Pair candidate vs baseline holdout observations by FULL cellId. `select`\n * pulls the scalar from a cell's judge reports (composite, or a named\n * dimension); a cell contributes the mean of `select` across its judges. Cells\n * whose scenario is not in `scenarioIds`, or where `select` is undefined for\n * every judge on either side, are skipped on BOTH sides so the arrays stay\n * paired. Throws when the two maps disagree on which holdout cells exist — a\n * load-bearing invariant: the baseline + winner holdout campaigns run the same\n * scenarios with the same seed base, so their cellIds MUST align; a mismatch\n * means a silent pairing bug, not a soft fallback.\n */\nexport function pairHoldout(\n candidate: Map<string, Record<string, JudgeScore>>,\n baseline: Map<string, Record<string, JudgeScore>>,\n scenarioIds: Set<string>,\n select: (s: JudgeScore) => number | undefined,\n): PairedHoldout {\n const cellValue = (\n byCell: Map<string, Record<string, JudgeScore>>,\n cellId: string,\n ): number | undefined => {\n const scores = byCell.get(cellId)\n if (!scores) return undefined\n const vals: number[] = []\n for (const s of Object.values(scores)) {\n const v = select(s)\n if (typeof v === 'number' && Number.isFinite(v)) vals.push(v)\n }\n if (vals.length === 0) return undefined\n return vals.reduce((a, b) => a + b, 0) / vals.length\n }\n\n const inScope = (cellId: string) => scenarioIds.has(cellId.split(':')[0] ?? '')\n const candCells = [...candidate.keys()].filter(inScope).sort()\n const baseCells = [...baseline.keys()].filter(inScope).sort()\n // Alignment invariant — the holdout campaigns share scenarios + seed, so the\n // cell sets must be identical. Differ ⇒ a real pairing bug; fail loud.\n if (candCells.length !== baseCells.length || candCells.some((c, i) => c !== baseCells[i])) {\n throw new Error(\n `pairHoldout: candidate/baseline holdout cells do not align — ` +\n `candidate=[${candCells.join(',')}] baseline=[${baseCells.join(',')}]. ` +\n `Both holdout campaigns must run the same scenarios with the same seed base.`,\n )\n }\n\n const before: number[] = []\n const after: number[] = []\n const cellIds: string[] = []\n for (const cellId of candCells) {\n const b = cellValue(baseline, cellId)\n const a = cellValue(candidate, cellId)\n // Only pair when BOTH sides produced the scalar (a dimension absent on one\n // side would otherwise create an unpaired observation).\n if (b === undefined || a === undefined) continue\n before.push(b)\n after.push(a)\n cellIds.push(cellId)\n }\n return { before, after, cellIds }\n}\n\nexport interface HeldoutSignificance {\n paired: PairedHoldout\n bootstrap: PairedBootstrapResult\n /** n paired observations. */\n n: number\n /** True iff n >= minProductiveRuns AND the CI lower bound clears the threshold. */\n significant: boolean\n /** Set when n < minProductiveRuns — too little evidence to claim significance. */\n fewRuns: boolean\n}\n\nexport interface HeldoutSignificanceOptions {\n deltaThreshold?: number\n minProductiveRuns?: number\n confidence?: number\n resamples?: number\n /** Fixed by default for a deterministic, reproducible gate verdict. */\n seed?: number\n statistic?: 'mean' | 'median'\n}\n\n/** Significance of the held-out composite lift: ship only when the paired\n * bootstrap CI lower bound on (candidate − baseline) exceeds `deltaThreshold`\n * (default 0 ⇒ \"confidently positive\"). Below `minProductiveRuns` paired\n * observations there is not enough evidence to claim significance → not\n * significant (`fewRuns`). Interpret `deltaThreshold` in the judge's native\n * composite scale. */\nexport function heldoutSignificance(\n paired: PairedHoldout,\n opts: HeldoutSignificanceOptions = {},\n): HeldoutSignificance {\n const deltaThreshold = opts.deltaThreshold ?? 0\n const minProductiveRuns = opts.minProductiveRuns ?? 3\n const bootstrap = pairedBootstrap(paired.before, paired.after, {\n confidence: opts.confidence ?? 0.95,\n resamples: opts.resamples ?? 2000,\n statistic: opts.statistic ?? 'median',\n seed: opts.seed ?? 1337,\n })\n const n = paired.before.length\n const fewRuns = n < minProductiveRuns\n const significant = !fewRuns && bootstrap.low > deltaThreshold\n return { paired, bootstrap, n, significant, fewRuns }\n}\n\nexport interface DimensionRegression {\n dimension: string\n bootstrap: PairedBootstrapResult\n /** True iff the CI lower bound on (candidate − baseline) is below −tolerance:\n * the candidate may have regressed this dimension by more than tolerance. */\n regressed: boolean\n tolerance: number\n n: number\n}\n\n/** Detect the native scale of a set of scores: 0-100 when any magnitude clears\n * 1.5, else [0,1]. Used to auto-scale the regression tolerance so a default\n * expressed for [0,1] is not silently a no-op on a 0-100 dimension. */\nexport function detectScale(values: number[]): 1 | 100 {\n return values.some((v) => Math.abs(v) > 1.5) ? 100 : 1\n}\n\n/** Per-critical-dimension regression guard. For each dimension, pair the\n * candidate vs baseline values by full cellId and bootstrap the paired delta;\n * a dimension is \"regressed\" when the CI lower bound < −tolerance (conservative\n * — blocks if the credible worst case exceeds tolerance, which is the right\n * posture for safety dimensions like `hallucination_free`). When `tolerance`\n * is omitted it auto-scales: 0.05 on [0,1], 5 on 0-100. */\nexport function dimensionRegressions(\n candidate: Map<string, Record<string, JudgeScore>>,\n baseline: Map<string, Record<string, JudgeScore>>,\n scenarioIds: Set<string>,\n criticalDimensions: string[],\n opts: { tolerance?: number; confidence?: number; resamples?: number; seed?: number } = {},\n): DimensionRegression[] {\n const out: DimensionRegression[] = []\n for (const dim of criticalDimensions) {\n const paired = pairHoldout(candidate, baseline, scenarioIds, (s) => s.dimensions[dim])\n if (paired.before.length === 0) continue // dimension not scored on this judge\n const tolerance = opts.tolerance ?? 0.05 * detectScale([...paired.before, ...paired.after])\n const bootstrap = pairedBootstrap(paired.before, paired.after, {\n confidence: opts.confidence ?? 0.95,\n resamples: opts.resamples ?? 2000,\n statistic: 'median',\n seed: opts.seed ?? 1337,\n })\n out.push({\n dimension: dim,\n bootstrap,\n regressed: bootstrap.low < -tolerance,\n tolerance,\n n: paired.before.length,\n })\n }\n return out\n}\n","/**\n * @experimental\n *\n * `defaultProductionGate` — composes the substrate's existing safety\n * primitives (red-team / reward-hacking / canary / heldout) into a single\n * Gate.decide shape. Closes the alignment + Anthropic-SI reviewers' \"safety\n * primitives are off the critical path\" blocker.\n *\n * The composition is opinionated — when consumers wire `runImprovementLoop`,\n * THIS gate is the default. Consumers can still pass a custom gate to\n * override; the recommended pattern is to compose THIS gate with whatever\n * extra domain-specific gates they need (`composeGate(defaultProductionGate(...), customGate)`).\n */\n\nimport type { CanaryReport } from '../../canary'\nimport { runCanaries } from '../../canary'\nimport type { RedTeamCase } from '../../red-team'\nimport { scoreRedTeamOutput } from '../../red-team'\nimport type { RewardHackingReport } from '../../rl/reward-hacking'\nimport { detectRewardHacking } from '../../rl/reward-hacking'\nimport type { RunRecord } from '../../run-record'\nimport type { Gate, GateContext, GateResult, Scenario } from '../types'\nimport { dimensionRegressions, heldoutSignificance, pairHoldout } from './statistical-heldout'\n\nexport interface DefaultProductionGateOptions {\n /** Required: scenarios held out from training; substrate compares\n * candidate-on-holdout vs baseline-on-holdout. */\n holdoutScenarios: Scenario[]\n /** Minimum held-out lift the **paired-bootstrap CI lower bound** must clear\n * to ship — NOT a point estimate. Default 0 ⇒ \"confidently positive at the\n * confidence level\". Interpreted in the judge's native composite scale (set\n * e.g. 2 for a 0-100 rubric to require a ≥2-point significant gain). */\n deltaThreshold?: number\n /** Confidence level for the held-out + dimension bootstraps. Default 0.95. */\n confidence?: number\n /** Bootstrap resamples. Default 2000. */\n bootstrapResamples?: number\n /** Fixed bootstrap seed for a deterministic verdict. Default 1337. */\n bootstrapSeed?: number\n /** Minimum paired holdout observations (scenarios × reps) before a\n * significance claim is allowed; below it the gate HOLDS with `few_runs`\n * rather than reading a degenerate CI. Default 3. */\n minProductiveRuns?: number\n /** Critical judge dimensions that must NOT significantly regress even when\n * the net composite rises (anti-Goodhart). The gate HOLDS if any listed\n * dimension's paired-delta CI lower bound < −`regressionTolerance`. E.g.\n * `['hallucination_free']` for a legal agent. */\n criticalDimensions?: string[]\n /** Tolerance for the per-dimension regression guard, in the dimension's\n * native scale. When omitted it auto-scales off observed magnitudes:\n * 0.05 on [0,1], 5 on 0-100. */\n regressionTolerance?: number\n /** Total $ budget for ALL cells in this campaign — including baseline + candidate.\n * Composite verdict refuses to ship when spend exceeded budget. */\n budgetUsd?: number\n /** Red-team cases to probe candidate outputs against. When omitted the\n * substrate uses `DEFAULT_RED_TEAM_CORPUS`. Provide a domain-specific\n * battery for tighter coverage. */\n redTeamBattery?: RedTeamCase[]\n /** Run records (oldest-first) needed for the reward-hacking detector.\n * Substrate populates from prior production-loop generations. */\n recentRuns?: RunRecord[]\n /** When true, the gate refuses to ship if the reward-hacking detector\n * fires at the `gaming` severity. Default true. */\n blockOnRewardHackingGaming?: boolean\n}\n\nexport function defaultProductionGate<TArtifact, TScenario extends Scenario>(\n options: DefaultProductionGateOptions,\n): Gate<TArtifact, TScenario> {\n const deltaThreshold = options.deltaThreshold ?? 0\n const confidence = options.confidence ?? 0.95\n const resamples = options.bootstrapResamples ?? 2000\n const seed = options.bootstrapSeed ?? 1337\n const minProductiveRuns = options.minProductiveRuns ?? 3\n const blockOnGaming = options.blockOnRewardHackingGaming ?? true\n\n return {\n name: 'defaultProductionGate',\n async decide(ctx: GateContext<TArtifact, TScenario>): Promise<GateResult> {\n const reasons: string[] = []\n const contributing: Array<{ name: string; passed: boolean; detail: unknown }> = []\n\n // ── (1) heldout composite lift — paired-bootstrap CI, NOT a point estimate\n // The shipped false positive: the baseline re-scored against itself read\n // run-to-run model noise (91 vs 95) as a \"+4 lift\" and shipped, because a\n // point estimate carries no confidence interval. Pair candidate vs\n // baseline holdout cells by FULL cellId (never averaging reps away) and\n // ship only when the bootstrap CI lower bound clears the threshold —\n // i.e. the gain is real at the confidence level, not noise.\n const scenarioIds = new Set(options.holdoutScenarios.map((s) => s.id))\n const sig = heldoutSignificance(\n pairHoldout(\n ctx.judgeScores,\n ctx.baselineJudgeScores ?? ctx.judgeScores,\n scenarioIds,\n (s) => s.composite,\n ),\n { deltaThreshold, minProductiveRuns, confidence, resamples, seed },\n )\n const delta = sig.bootstrap.median\n const heldoutPass = sig.significant\n contributing.push({\n name: 'heldout-significance',\n passed: heldoutPass,\n detail: {\n n: sig.n,\n deltaMedian: sig.bootstrap.median,\n ciLow: sig.bootstrap.low,\n ciHigh: sig.bootstrap.high,\n confidence: sig.bootstrap.confidence,\n deltaThreshold,\n fewRuns: sig.fewRuns,\n },\n })\n if (!heldoutPass) {\n reasons.push(\n sig.fewRuns\n ? `held-out: only ${sig.n} paired runs (< ${minProductiveRuns}) — too few to claim significance`\n : `held-out CI.low ${sig.bootstrap.low.toFixed(3)} ≤ threshold ${deltaThreshold} (median ${sig.bootstrap.median.toFixed(3)}, ${(sig.bootstrap.confidence * 100).toFixed(0)}% CI [${sig.bootstrap.low.toFixed(3)}, ${sig.bootstrap.high.toFixed(3)}])`,\n )\n }\n\n // ── (1b) per-dimension regression guard (anti-Goodhart) ──────────\n // A net composite gain can hide a regression on a safety-critical\n // dimension (e.g. hallucination_free for a legal agent — the verified run\n // gained +25/+25 on deadline/fee while LOSING -30 on hallucination, and\n // the composite-only gate never saw it). Block ship if any guarded\n // dimension's paired-delta CI lower bound falls below −tolerance.\n const dimRegs = options.criticalDimensions?.length\n ? dimensionRegressions(\n ctx.judgeScores,\n ctx.baselineJudgeScores ?? ctx.judgeScores,\n scenarioIds,\n options.criticalDimensions,\n { tolerance: options.regressionTolerance, confidence, resamples, seed },\n )\n : []\n const regressed = dimRegs.filter((d) => d.regressed)\n const dimPass = regressed.length === 0\n contributing.push({\n name: 'dimension-regression',\n passed: dimPass,\n detail: {\n guarded: options.criticalDimensions ?? [],\n regressions: dimRegs.map((d) => ({\n dimension: d.dimension,\n ciLow: d.bootstrap.low,\n median: d.bootstrap.median,\n tolerance: d.tolerance,\n n: d.n,\n regressed: d.regressed,\n })),\n },\n })\n if (!dimPass) {\n reasons.push(\n `critical dimension(s) regressed: ${regressed.map((d) => `${d.dimension} CI.low ${d.bootstrap.low.toFixed(3)} < -${d.tolerance}`).join('; ')}`,\n )\n }\n\n // ── (2) budget gate ─────────────────────────────────────────────\n const budgetPass =\n options.budgetUsd === undefined ||\n ctx.cost.candidate + ctx.cost.baseline <= options.budgetUsd\n contributing.push({\n name: 'budget',\n passed: budgetPass,\n detail: {\n candidateUsd: ctx.cost.candidate,\n baselineUsd: ctx.cost.baseline,\n budgetUsd: options.budgetUsd,\n },\n })\n if (!budgetPass) {\n reasons.push(\n `spend ${(ctx.cost.candidate + ctx.cost.baseline).toFixed(2)} > budget ${options.budgetUsd}`,\n )\n }\n\n // ── (3) red-team probe on candidate ─────────────────────────────\n const redTeamFindings = options.redTeamBattery\n ? probeRedTeam(ctx.candidateArtifacts, options.redTeamBattery)\n : { passed: true, findings: [] }\n contributing.push({\n name: 'red-team',\n passed: redTeamFindings.passed,\n detail: {\n failures: redTeamFindings.findings.length,\n sample: redTeamFindings.findings.slice(0, 3),\n },\n })\n if (!redTeamFindings.passed) {\n reasons.push(`red-team probe failed (${redTeamFindings.findings.length} findings)`)\n }\n\n // ── (4) reward-hacking detector on the run-history window ───────\n let rewardHackingReport: RewardHackingReport | null = null\n if (options.recentRuns && options.recentRuns.length >= 10) {\n rewardHackingReport = detectRewardHacking({ runs: options.recentRuns })\n }\n // reward-hacking severity is numeric (0..1). \"gaming\" threshold per\n // detectRewardHacking defaults = 0.6. Block when ANY finding is at\n // gaming threshold OR the report verdict is 'gaming'.\n const gamingThreshold = 0.6\n const gamingFindings = (rewardHackingReport?.findings ?? []).filter(\n (f) => f.severity >= gamingThreshold,\n )\n const rewardHackingPass =\n !rewardHackingReport ||\n !blockOnGaming ||\n (gamingFindings.length === 0 && rewardHackingReport.verdict !== 'gaming')\n contributing.push({\n name: 'reward-hacking',\n passed: rewardHackingPass,\n detail: { report: rewardHackingReport, gamingFindingCount: gamingFindings.length },\n })\n if (!rewardHackingPass) {\n reasons.push(\n `reward-hacking detector flagged ${gamingFindings.length} gaming-severity findings (verdict=${rewardHackingReport!.verdict})`,\n )\n }\n\n // ── (5) canary check on runs ────────────────────────────────────\n let canaryReport: CanaryReport | null = null\n if (options.recentRuns && options.recentRuns.length >= 10) {\n canaryReport = runCanaries(options.recentRuns, {})\n }\n // CanarySeverity is 'info' | 'warn' | 'error' — block on 'error'.\n const errorAlerts = (canaryReport?.alerts ?? []).filter((a) => a.severity === 'error')\n const canaryPass = errorAlerts.length === 0\n contributing.push({\n name: 'canary',\n passed: canaryPass,\n detail: { totalAlerts: canaryReport?.alerts.length ?? 0, errorAlerts: errorAlerts.length },\n })\n if (!canaryPass) {\n reasons.push(`canary error alerts: ${errorAlerts.length}`)\n }\n\n // ── Verdict ─────────────────────────────────────────────────────\n const allPassed = contributing.every((c) => c.passed)\n const decision = allPassed ? 'ship' : 'hold'\n\n return {\n decision,\n reasons: reasons.length > 0 ? reasons : ['all gates passed'],\n contributingGates: contributing,\n delta,\n }\n },\n }\n}\n\nfunction probeRedTeam<TArtifact>(\n artifacts: Map<string, TArtifact>,\n battery: RedTeamCase[],\n): { passed: boolean; findings: Array<{ scenarioId: string; reason: string }> } {\n const findings: Array<{ scenarioId: string; reason: string }> = []\n for (const [_cellId, artifact] of artifacts) {\n const text = extractText(artifact)\n if (text === undefined) continue\n for (const rtCase of battery) {\n const finding = scoreRedTeamOutput(text, [], rtCase)\n if (!finding.passed) {\n findings.push({ scenarioId: rtCase.id, reason: finding.reason ?? 'red-team probe failed' })\n }\n }\n }\n return { passed: findings.length === 0, findings }\n}\n\nfunction extractText(artifact: unknown): string | undefined {\n if (typeof artifact === 'string') return artifact\n if (artifact && typeof artifact === 'object') {\n const rec = artifact as Record<string, unknown>\n if (typeof rec.text === 'string') return rec.text\n if (typeof rec.output === 'string') return rec.output\n if (typeof rec.content === 'string') return rec.content\n }\n return undefined\n}\n","/**\n * @experimental\n *\n * `runEval` — the simplest preset over `runCampaign`. No optimizer, no\n * gate, no auto-PR. Just: run scenarios through dispatch, score with\n * judges, return CampaignResult.\n *\n * The 80% case for consumers who want a scorecard, not an improvement loop.\n */\n\nimport { type RunCampaignOptions, runCampaign } from '../run-campaign'\nimport type { CampaignResult, Scenario } from '../types'\n\nexport interface RunEvalOptions<TScenario extends Scenario, TArtifact>\n extends Omit<RunCampaignOptions<TScenario, TArtifact>, 'runDir'> {\n runDir: string\n}\n\nexport async function runEval<TScenario extends Scenario, TArtifact>(\n opts: RunEvalOptions<TScenario, TArtifact>,\n): Promise<CampaignResult<TArtifact, TScenario>> {\n return runCampaign(opts)\n}\n"],"mappings":";;;;;;;;;;;;;;;AAwBO,SAAS,mBACd,MAC8B;AAC9B,SAAO;AAAA,IACL,MAAM,gBAAgB,KAAK,QAAQ,IAAI;AAAA,IACvC,MAAM,QAAQ,EAAE,gBAAgB,UAAU,gBAAgB,OAAO,GAAG;AAClE,aAAO,KAAK,QAAQ,OAAO;AAAA,QACzB,UAAU,SAAS,SAAS,IAAI,WAAY,KAAK,YAAY,CAAC;AAAA,QAC9D;AAAA,QACA;AAAA,QACA;AAAA,MACF,CAAC;AAAA,IACH;AAAA,EACF;AACF;;;ACxBO,SAAS,eACX,OACyB;AAC5B,MAAI,MAAM,WAAW,GAAG;AACtB,UAAM,IAAI,MAAM,wCAAwC;AAAA,EAC1D;AACA,SAAO;AAAA,IACL,MAAM,YAAY,MAAM,IAAI,CAAC,MAAM,EAAE,IAAI,EAAE,KAAK,GAAG,CAAC;AAAA,IACpD,MAAM,OAAO,KAA6D;AACxE,YAAM,UAAwE,CAAC;AAC/E,iBAAW,QAAQ,OAAO;AACxB,cAAM,MAAM,MAAM,KAAK,OAAO,GAAG;AACjC,gBAAQ,KAAK,EAAE,MAAM,IAAI,CAAC;AAAA,MAC5B;AAQA,YAAM,YAAY,QAAQ,IAAI,CAAC,MAAM,EAAE,IAAI,QAAQ;AACnD,YAAM,UAAwB,UAAU,MAAM,CAAC,MAAM,MAAM,MAAM,IAC7D,SACA,UAAU,SAAS,cAAc,IAC/B,iBACA,UAAU,SAAS,eAAe,IAChC,kBACA,UAAU,SAAS,MAAM,IACvB,SACA;AAEV,YAAM,eAAe,QAAQ;AAAA,QAAQ,CAAC,MACpC,EAAE,IAAI,kBAAkB,SAAS,IAC7B,EAAE,IAAI,oBACN,CAAC,EAAE,MAAM,EAAE,KAAK,MAAM,QAAQ,EAAE,IAAI,aAAa,QAAQ,QAAQ,EAAE,IAAI,CAAC;AAAA,MAC9E;AAEA,YAAM,UAAU,QAAQ;AAAA,QAAQ,CAAC,MAC/B,EAAE,IAAI,QAAQ,IAAI,CAAC,WAAW,IAAI,EAAE,KAAK,IAAI,KAAK,MAAM,EAAE;AAAA,MAC5D;AAEA,aAAO;AAAA,QACL,UAAU;AAAA,QACV;AAAA,QACA,mBAAmB;AAAA,QACnB,OAAO,QAAQ,CAAC,GAAG,IAAI;AAAA,MACzB;AAAA,IACF;AAAA,EACF;AACF;;;ACbO,SAAS,YACd,WACA,UACA,aACA,QACe;AACf,QAAM,YAAY,CAChB,QACA,WACuB;AACvB,UAAM,SAAS,OAAO,IAAI,MAAM;AAChC,QAAI,CAAC,OAAQ,QAAO;AACpB,UAAM,OAAiB,CAAC;AACxB,eAAW,KAAK,OAAO,OAAO,MAAM,GAAG;AACrC,YAAM,IAAI,OAAO,CAAC;AAClB,UAAI,OAAO,MAAM,YAAY,OAAO,SAAS,CAAC,EAAG,MAAK,KAAK,CAAC;AAAA,IAC9D;AACA,QAAI,KAAK,WAAW,EAAG,QAAO;AAC9B,WAAO,KAAK,OAAO,CAAC,GAAG,MAAM,IAAI,GAAG,CAAC,IAAI,KAAK;AAAA,EAChD;AAEA,QAAM,UAAU,CAAC,WAAmB,YAAY,IAAI,OAAO,MAAM,GAAG,EAAE,CAAC,KAAK,EAAE;AAC9E,QAAM,YAAY,CAAC,GAAG,UAAU,KAAK,CAAC,EAAE,OAAO,OAAO,EAAE,KAAK;AAC7D,QAAM,YAAY,CAAC,GAAG,SAAS,KAAK,CAAC,EAAE,OAAO,OAAO,EAAE,KAAK;AAG5D,MAAI,UAAU,WAAW,UAAU,UAAU,UAAU,KAAK,CAAC,GAAG,MAAM,MAAM,UAAU,CAAC,CAAC,GAAG;AACzF,UAAM,IAAI;AAAA,MACR,gFACgB,UAAU,KAAK,GAAG,CAAC,eAAe,UAAU,KAAK,GAAG,CAAC;AAAA,IAEvE;AAAA,EACF;AAEA,QAAM,SAAmB,CAAC;AAC1B,QAAM,QAAkB,CAAC;AACzB,QAAM,UAAoB,CAAC;AAC3B,aAAW,UAAU,WAAW;AAC9B,UAAM,IAAI,UAAU,UAAU,MAAM;AACpC,UAAM,IAAI,UAAU,WAAW,MAAM;AAGrC,QAAI,MAAM,UAAa,MAAM,OAAW;AACxC,WAAO,KAAK,CAAC;AACb,UAAM,KAAK,CAAC;AACZ,YAAQ,KAAK,MAAM;AAAA,EACrB;AACA,SAAO,EAAE,QAAQ,OAAO,QAAQ;AAClC;AA6BO,SAAS,oBACd,QACA,OAAmC,CAAC,GACf;AACrB,QAAM,iBAAiB,KAAK,kBAAkB;AAC9C,QAAM,oBAAoB,KAAK,qBAAqB;AACpD,QAAM,YAAY,gBAAgB,OAAO,QAAQ,OAAO,OAAO;AAAA,IAC7D,YAAY,KAAK,cAAc;AAAA,IAC/B,WAAW,KAAK,aAAa;AAAA,IAC7B,WAAW,KAAK,aAAa;AAAA,IAC7B,MAAM,KAAK,QAAQ;AAAA,EACrB,CAAC;AACD,QAAM,IAAI,OAAO,OAAO;AACxB,QAAM,UAAU,IAAI;AACpB,QAAM,cAAc,CAAC,WAAW,UAAU,MAAM;AAChD,SAAO,EAAE,QAAQ,WAAW,GAAG,aAAa,QAAQ;AACtD;AAeO,SAAS,YAAY,QAA2B;AACrD,SAAO,OAAO,KAAK,CAAC,MAAM,KAAK,IAAI,CAAC,IAAI,GAAG,IAAI,MAAM;AACvD;AAQO,SAAS,qBACd,WACA,UACA,aACA,oBACA,OAAuF,CAAC,GACjE;AACvB,QAAM,MAA6B,CAAC;AACpC,aAAW,OAAO,oBAAoB;AACpC,UAAM,SAAS,YAAY,WAAW,UAAU,aAAa,CAAC,MAAM,EAAE,WAAW,GAAG,CAAC;AACrF,QAAI,OAAO,OAAO,WAAW,EAAG;AAChC,UAAM,YAAY,KAAK,aAAa,OAAO,YAAY,CAAC,GAAG,OAAO,QAAQ,GAAG,OAAO,KAAK,CAAC;AAC1F,UAAM,YAAY,gBAAgB,OAAO,QAAQ,OAAO,OAAO;AAAA,MAC7D,YAAY,KAAK,cAAc;AAAA,MAC/B,WAAW,KAAK,aAAa;AAAA,MAC7B,WAAW;AAAA,MACX,MAAM,KAAK,QAAQ;AAAA,IACrB,CAAC;AACD,QAAI,KAAK;AAAA,MACP,WAAW;AAAA,MACX;AAAA,MACA,WAAW,UAAU,MAAM,CAAC;AAAA,MAC5B;AAAA,MACA,GAAG,OAAO,OAAO;AAAA,IACnB,CAAC;AAAA,EACH;AACA,SAAO;AACT;;;ACjIO,SAAS,sBACd,SAC4B;AAC5B,QAAM,iBAAiB,QAAQ,kBAAkB;AACjD,QAAM,aAAa,QAAQ,cAAc;AACzC,QAAM,YAAY,QAAQ,sBAAsB;AAChD,QAAM,OAAO,QAAQ,iBAAiB;AACtC,QAAM,oBAAoB,QAAQ,qBAAqB;AACvD,QAAM,gBAAgB,QAAQ,8BAA8B;AAE5D,SAAO;AAAA,IACL,MAAM;AAAA,IACN,MAAM,OAAO,KAA6D;AACxE,YAAM,UAAoB,CAAC;AAC3B,YAAM,eAA0E,CAAC;AASjF,YAAM,cAAc,IAAI,IAAI,QAAQ,iBAAiB,IAAI,CAAC,MAAM,EAAE,EAAE,CAAC;AACrE,YAAM,MAAM;AAAA,QACV;AAAA,UACE,IAAI;AAAA,UACJ,IAAI,uBAAuB,IAAI;AAAA,UAC/B;AAAA,UACA,CAAC,MAAM,EAAE;AAAA,QACX;AAAA,QACA,EAAE,gBAAgB,mBAAmB,YAAY,WAAW,KAAK;AAAA,MACnE;AACA,YAAM,QAAQ,IAAI,UAAU;AAC5B,YAAM,cAAc,IAAI;AACxB,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ;AAAA,QACR,QAAQ;AAAA,UACN,GAAG,IAAI;AAAA,UACP,aAAa,IAAI,UAAU;AAAA,UAC3B,OAAO,IAAI,UAAU;AAAA,UACrB,QAAQ,IAAI,UAAU;AAAA,UACtB,YAAY,IAAI,UAAU;AAAA,UAC1B;AAAA,UACA,SAAS,IAAI;AAAA,QACf;AAAA,MACF,CAAC;AACD,UAAI,CAAC,aAAa;AAChB,gBAAQ;AAAA,UACN,IAAI,UACA,kBAAkB,IAAI,CAAC,mBAAmB,iBAAiB,2CAC3D,mBAAmB,IAAI,UAAU,IAAI,QAAQ,CAAC,CAAC,qBAAgB,cAAc,YAAY,IAAI,UAAU,OAAO,QAAQ,CAAC,CAAC,MAAM,IAAI,UAAU,aAAa,KAAK,QAAQ,CAAC,CAAC,SAAS,IAAI,UAAU,IAAI,QAAQ,CAAC,CAAC,KAAK,IAAI,UAAU,KAAK,QAAQ,CAAC,CAAC;AAAA,QACrP;AAAA,MACF;AAQA,YAAM,UAAU,QAAQ,oBAAoB,SACxC;AAAA,QACE,IAAI;AAAA,QACJ,IAAI,uBAAuB,IAAI;AAAA,QAC/B;AAAA,QACA,QAAQ;AAAA,QACR,EAAE,WAAW,QAAQ,qBAAqB,YAAY,WAAW,KAAK;AAAA,MACxE,IACA,CAAC;AACL,YAAM,YAAY,QAAQ,OAAO,CAAC,MAAM,EAAE,SAAS;AACnD,YAAM,UAAU,UAAU,WAAW;AACrC,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ;AAAA,QACR,QAAQ;AAAA,UACN,SAAS,QAAQ,sBAAsB,CAAC;AAAA,UACxC,aAAa,QAAQ,IAAI,CAAC,OAAO;AAAA,YAC/B,WAAW,EAAE;AAAA,YACb,OAAO,EAAE,UAAU;AAAA,YACnB,QAAQ,EAAE,UAAU;AAAA,YACpB,WAAW,EAAE;AAAA,YACb,GAAG,EAAE;AAAA,YACL,WAAW,EAAE;AAAA,UACf,EAAE;AAAA,QACJ;AAAA,MACF,CAAC;AACD,UAAI,CAAC,SAAS;AACZ,gBAAQ;AAAA,UACN,oCAAoC,UAAU,IAAI,CAAC,MAAM,GAAG,EAAE,SAAS,WAAW,EAAE,UAAU,IAAI,QAAQ,CAAC,CAAC,OAAO,EAAE,SAAS,EAAE,EAAE,KAAK,IAAI,CAAC;AAAA,QAC9I;AAAA,MACF;AAGA,YAAM,aACJ,QAAQ,cAAc,UACtB,IAAI,KAAK,YAAY,IAAI,KAAK,YAAY,QAAQ;AACpD,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ;AAAA,QACR,QAAQ;AAAA,UACN,cAAc,IAAI,KAAK;AAAA,UACvB,aAAa,IAAI,KAAK;AAAA,UACtB,WAAW,QAAQ;AAAA,QACrB;AAAA,MACF,CAAC;AACD,UAAI,CAAC,YAAY;AACf,gBAAQ;AAAA,UACN,UAAU,IAAI,KAAK,YAAY,IAAI,KAAK,UAAU,QAAQ,CAAC,CAAC,aAAa,QAAQ,SAAS;AAAA,QAC5F;AAAA,MACF;AAGA,YAAM,kBAAkB,QAAQ,iBAC5B,aAAa,IAAI,oBAAoB,QAAQ,cAAc,IAC3D,EAAE,QAAQ,MAAM,UAAU,CAAC,EAAE;AACjC,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ,gBAAgB;AAAA,QACxB,QAAQ;AAAA,UACN,UAAU,gBAAgB,SAAS;AAAA,UACnC,QAAQ,gBAAgB,SAAS,MAAM,GAAG,CAAC;AAAA,QAC7C;AAAA,MACF,CAAC;AACD,UAAI,CAAC,gBAAgB,QAAQ;AAC3B,gBAAQ,KAAK,0BAA0B,gBAAgB,SAAS,MAAM,YAAY;AAAA,MACpF;AAGA,UAAI,sBAAkD;AACtD,UAAI,QAAQ,cAAc,QAAQ,WAAW,UAAU,IAAI;AACzD,8BAAsB,oBAAoB,EAAE,MAAM,QAAQ,WAAW,CAAC;AAAA,MACxE;AAIA,YAAM,kBAAkB;AACxB,YAAM,kBAAkB,qBAAqB,YAAY,CAAC,GAAG;AAAA,QAC3D,CAAC,MAAM,EAAE,YAAY;AAAA,MACvB;AACA,YAAM,oBACJ,CAAC,uBACD,CAAC,iBACA,eAAe,WAAW,KAAK,oBAAoB,YAAY;AAClE,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ;AAAA,QACR,QAAQ,EAAE,QAAQ,qBAAqB,oBAAoB,eAAe,OAAO;AAAA,MACnF,CAAC;AACD,UAAI,CAAC,mBAAmB;AACtB,gBAAQ;AAAA,UACN,mCAAmC,eAAe,MAAM,sCAAsC,oBAAqB,OAAO;AAAA,QAC5H;AAAA,MACF;AAGA,UAAI,eAAoC;AACxC,UAAI,QAAQ,cAAc,QAAQ,WAAW,UAAU,IAAI;AACzD,uBAAe,YAAY,QAAQ,YAAY,CAAC,CAAC;AAAA,MACnD;AAEA,YAAM,eAAe,cAAc,UAAU,CAAC,GAAG,OAAO,CAAC,MAAM,EAAE,aAAa,OAAO;AACrF,YAAM,aAAa,YAAY,WAAW;AAC1C,mBAAa,KAAK;AAAA,QAChB,MAAM;AAAA,QACN,QAAQ;AAAA,QACR,QAAQ,EAAE,aAAa,cAAc,OAAO,UAAU,GAAG,aAAa,YAAY,OAAO;AAAA,MAC3F,CAAC;AACD,UAAI,CAAC,YAAY;AACf,gBAAQ,KAAK,wBAAwB,YAAY,MAAM,EAAE;AAAA,MAC3D;AAGA,YAAM,YAAY,aAAa,MAAM,CAAC,MAAM,EAAE,MAAM;AACpD,YAAM,WAAW,YAAY,SAAS;AAEtC,aAAO;AAAA,QACL;AAAA,QACA,SAAS,QAAQ,SAAS,IAAI,UAAU,CAAC,kBAAkB;AAAA,QAC3D,mBAAmB;AAAA,QACnB;AAAA,MACF;AAAA,IACF;AAAA,EACF;AACF;AAEA,SAAS,aACP,WACA,SAC8E;AAC9E,QAAM,WAA0D,CAAC;AACjE,aAAW,CAAC,SAAS,QAAQ,KAAK,WAAW;AAC3C,UAAM,OAAO,YAAY,QAAQ;AACjC,QAAI,SAAS,OAAW;AACxB,eAAW,UAAU,SAAS;AAC5B,YAAM,UAAU,mBAAmB,MAAM,CAAC,GAAG,MAAM;AACnD,UAAI,CAAC,QAAQ,QAAQ;AACnB,iBAAS,KAAK,EAAE,YAAY,OAAO,IAAI,QAAQ,QAAQ,UAAU,wBAAwB,CAAC;AAAA,MAC5F;AAAA,IACF;AAAA,EACF;AACA,SAAO,EAAE,QAAQ,SAAS,WAAW,GAAG,SAAS;AACnD;AAEA,SAAS,YAAY,UAAuC;AAC1D,MAAI,OAAO,aAAa,SAAU,QAAO;AACzC,MAAI,YAAY,OAAO,aAAa,UAAU;AAC5C,UAAM,MAAM;AACZ,QAAI,OAAO,IAAI,SAAS,SAAU,QAAO,IAAI;AAC7C,QAAI,OAAO,IAAI,WAAW,SAAU,QAAO,IAAI;AAC/C,QAAI,OAAO,IAAI,YAAY,SAAU,QAAO,IAAI;AAAA,EAClD;AACA,SAAO;AACT;;;ACvQA,eAAsB,QACpB,MAC+C;AAC/C,SAAO,YAAY,IAAI;AACzB;","names":[]}
|