@tangle-network/agent-eval 0.66.0 → 0.68.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -4,6 +4,34 @@ All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval-
4
4
 
5
5
  ---
6
6
 
7
+ ## [0.68.0] — 2026-05-30 — structured AgentProfile (the self-improvement surface stops being an opaque blob)
8
+
9
+ The optimizable surface was an opaque string addendum, so the loop could only mutate (and the dashboard only diff) an unstructured blob — you couldn't see *what kind* of improvement a candidate made. This adds a **sectioned `AgentProfile`** primitive (mirrored on Harvey LAB's system-prompt structure) so the surface has named, separately-addressable zones the loop targets one at a time.
10
+
11
+ ### Added
12
+
13
+ - **`profile` namespace** (`import { profile } from '@tangle-network/agent-eval'`):
14
+ - `AgentProfile { role, environment, toolConventions, skills: ProfileSkill[], domain: AgentProfileSection[] }` — the structured surface. `environment` is a first-class section (the sandbox contract: workspace root, read-only documents, output dir, skills dir), matching how an agentic harness actually addresses its sandbox.
15
+ - `renderProfile(p)` emits the system prompt in fixed order: role → `## Environment` → `## Tool conventions` → `## Skills` → `## Domain guidance`.
16
+ - `baselineProfile` / `prodProfile(baseline, shipped)` — baseline = empty domain + stock skills; prod = baseline + gate-certified domain sections.
17
+ - `applyDomainPatch(p, sectionId, body)` — **section-scoped** edit so the improvement loop optimizes ONE evolvable section, not the whole blob; `profileToSurface(p)` bridges to the existing string `MutableSurface`.
18
+ - Namespaced as `profile.*` to avoid clashing with the benchmark-cell `AgentProfile` already exported from `./agent-profile`.
19
+
20
+ Additive — does not touch `runImprovementLoop` or the string surface. 15 tests (zone order; only evolvable sections change hash under `applyDomainPatch`; baseline vs prod differ only in domain/skills; Environment present + non-empty). Full suite (1639) green. First consumers: the TaxCalcBench + Harvey LAB benchmark adapters (tax-agent / legal-agent) that score our agent's profile against public leaderboards.
21
+
22
+ ## [0.67.0] — 2026-05-30 — the promotion gate is statistically trustworthy (no more shipping noise)
23
+
24
+ An adversarial review of a real "ship +4.0 lift" decision found it was a **triple false positive**: the driver's candidate lost on train, so the winner was the baseline (empty diff); the loop re-scored the baseline against ITSELF on the holdout and read run-to-run model noise (91 vs 95) as a "+4 lift"; and a point-estimate gate (`delta >= 0.03` on a 0-100 scale, `reps:1`) shipped it — while the reward-hacking gate was blind to a −30 regression on a safety dimension hiding under the +4 net. The promotion gate could not tell a real improvement from noise or from a Goodhart trade.
25
+
26
+ ### Fixed / Added
27
+
28
+ - **No-op guard** (`runImprovementLoop`) — when the winner is byte-identical to the baseline (no candidate beat the training baseline, empty diff), the loop now forces `hold` and skips the meaningless baseline-vs-itself holdout pass, instead of shipping the noise delta.
29
+ - **Statistical held-out gate** — `defaultProductionGate`'s held-out check is now a **paired bootstrap CI**, not a point estimate. It pairs candidate vs baseline holdout cells by **full `cellId` (`scenario:rep`)** — never averaging reps away — and ships only when the CI lower bound clears `deltaThreshold` (default 0 ⇒ confidently positive). Below `minProductiveRuns` (default 3) paired observations it HOLDS with `few_runs` rather than reading a degenerate interval. (New module `src/campaign/gates/statistical-heldout.ts`; reuses `pairedBootstrap` from `src/statistics.ts`.)
30
+ - **Per-dimension regression guard (anti-Goodhart)** — `criticalDimensions` + `regressionTolerance` on `DefaultProductionGateOptions`. The gate HOLDS if any guarded dimension's paired-delta CI lower bound falls below −tolerance, even when the net composite rose. Tolerance auto-scales (0.05 on [0,1], 5 on 0-100) so a default expressed for one scale isn't a silent no-op on the other.
31
+ - **Exports** `pairHoldout`, `heldoutSignificance`, `dimensionRegressions`, `detectScale` from `/campaign`.
32
+
33
+ This collapses the duplicated gate tech-debt (a rigorous `src/held-out-gate.ts` existed but the loop wired the weak adapter) onto the shared `pairedBootstrap` statistics. 12 new regression tests, including the exact noisy-same-mean false positive and the composite-up/dimension-down Goodhart trade. Full suite (1624) green. The remaining path to a *proven* self-improvement (headroom corpus + Goodhart-resistant measurement, driver effectiveness, inter-cycle compounding) is tracked separately.
34
+
7
35
  ## [0.66.0] — 2026-05-30 — the improvement loop can no longer hang silently or ingest to the wrong URL
8
36
 
9
37
  ### Fixed
@@ -1,9 +1,10 @@
1
1
  import { a as RunCampaignOptions, C as CampaignStorage } from '../run-improvement-loop-BKpM5T4t.js';
2
2
  export { d as GepaDriverConstraints, G as GepaDriverOptions, O as OpenAutoPrOptions, e as OpenAutoPrResult, b as RunImprovementLoopOptions, R as RunImprovementLoopResult, h as RunOptimizationOptions, j as RunOptimizationResult, k as countSentenceEdits, l as defaultRenderDiff, m as extractH2Sections, f as fsCampaignStorage, g as gepaDriver, i as inMemoryCampaignStorage, o as openAutoPr, r as runCampaign, c as runImprovementLoop, n as runOptimization, s as surfaceHash } from '../run-improvement-loop-BKpM5T4t.js';
3
- export { B as BuildLoopProvenanceArgs, D as DefaultProductionGateOptions, a as EmitLoopProvenanceArgs, b as EmitLoopProvenanceResult, E as EvolutionaryDriverOptions, H as HeldOutGateOptions, f as LoopProvenanceBackend, g as LoopProvenanceCandidate, L as LoopProvenanceRecord, R as RunEvalOptions, i as buildLoopProvenanceRecord, c as composeGate, d as defaultProductionGate, j as emitLoopProvenance, e as evolutionaryDriver, h as heldOutGate, l as loopProvenanceSpans, p as provenanceRecordPath, k as provenanceSpansPath, r as runEval, s as surfaceContentHash } from '../provenance-BZUFC1_D.js';
3
+ export { B as BuildLoopProvenanceArgs, D as DefaultProductionGateOptions, a as EmitLoopProvenanceArgs, b as EmitLoopProvenanceResult, E as EvolutionaryDriverOptions, H as HeldOutGateOptions, f as LoopProvenanceBackend, g as LoopProvenanceCandidate, L as LoopProvenanceRecord, R as RunEvalOptions, i as buildLoopProvenanceRecord, c as composeGate, d as defaultProductionGate, j as emitLoopProvenance, e as evolutionaryDriver, h as heldOutGate, l as loopProvenanceSpans, p as provenanceRecordPath, k as provenanceSpansPath, r as runEval, s as surfaceContentHash } from '../provenance-CChUqexv.js';
4
4
  import { L as LlmClientOptions } from '../llm-client-DbjLfz-K.js';
5
- import { I as ImprovementDriver, L as LabeledScenarioStore, q as LabeledScenarioWrite, r as LabeledScenarioSampleArgs, s as LabeledScenarioRecord, t as LabelTrust, S as Scenario, M as MutableSurface, b as DispatchContext, a as JudgeConfig, u as LabeledScenarioSource, f as CampaignResult, h as CodeSurface } from '../types-c2R2kfmv.js';
6
- export { C as CampaignAggregates, c as CampaignArtifactWriter, d as CampaignCellResult, e as CampaignCostMeter, v as CampaignTokenUsage, g as CampaignTraceWriter, D as DispatchFn, G as Gate, i as GateContext, j as GateDecision, k as GateResult, l as GenerationCandidate, m as GenerationRecord, w as JudgeAggregate, n as JudgeDimension, J as JudgeScore, o as Mutator, O as OptimizerConfig, P as ParetoParent, x as ProposeContext, y as ProposedCandidate, R as RedactionStatus, z as ScenarioAggregate, p as SessionScript, T as TraceSpan, A as isProposedCandidate, B as labelTrustRank } from '../types-c2R2kfmv.js';
5
+ import { I as ImprovementDriver, J as JudgeScore, L as LabeledScenarioStore, q as LabeledScenarioWrite, r as LabeledScenarioSampleArgs, s as LabeledScenarioRecord, t as LabelTrust, S as Scenario, M as MutableSurface, b as DispatchContext, a as JudgeConfig, u as LabeledScenarioSource, f as CampaignResult, h as CodeSurface } from '../types-c2R2kfmv.js';
6
+ export { C as CampaignAggregates, c as CampaignArtifactWriter, d as CampaignCellResult, e as CampaignCostMeter, v as CampaignTokenUsage, g as CampaignTraceWriter, D as DispatchFn, G as Gate, i as GateContext, j as GateDecision, k as GateResult, l as GenerationCandidate, m as GenerationRecord, w as JudgeAggregate, n as JudgeDimension, o as Mutator, O as OptimizerConfig, P as ParetoParent, x as ProposeContext, y as ProposedCandidate, R as RedactionStatus, z as ScenarioAggregate, p as SessionScript, T as TraceSpan, A as isProposedCandidate, B as labelTrustRank } from '../types-c2R2kfmv.js';
7
+ import { a as PairedBootstrapResult } from '../statistics-B7yCbi9i.js';
7
8
  import { A as AgentProfile, B as BackendIntegrityReport } from '../agent-profile-DzcPHR1Z.js';
8
9
  import { A as AgentEvalError } from '../errors-Dwqw-T_m.js';
9
10
  import { b as RunSplitTag, R as RunRecord } from '../run-record-BgTFzO2r.js';
@@ -16,6 +17,8 @@ import '../summary-report-ByiOUrHj.js';
16
17
  import '../failure-cluster-CL7IVgkJ.js';
17
18
  import '../judge-calibration-DilmB3Ml.js';
18
19
  import '../raw-provider-sink-C46HDghv.js';
20
+ import '../types-Croy5h7V.js';
21
+ import '@tangle-network/tcloud';
19
22
 
20
23
  /**
21
24
  * @experimental
@@ -164,6 +167,106 @@ declare class SkillPatchParseError extends Error {
164
167
  }
165
168
  declare function parseSkillPatchResponse(raw: string, maxPatches: number, editBudget: number): SkillPatch[];
166
169
 
170
+ /**
171
+ * @experimental
172
+ *
173
+ * Statistical held-out promotion machinery — the trustworthy core the
174
+ * point-estimate `heldout-delta` gate lacked.
175
+ *
176
+ * The shipped false positive it prevents: a winner re-scored against the
177
+ * baseline on the holdout read run-to-run model NOISE (e.g. 91 vs 95) as a
178
+ * "+4 lift" and shipped, because the gate compared point estimates with no
179
+ * confidence interval. Here we pair candidate vs baseline holdout observations
180
+ * and bootstrap a CI on the paired delta — a candidate ships only when the CI
181
+ * lower bound clears the effect-size threshold (the gain is real at the
182
+ * confidence level, not noise), and is blocked when a critical dimension
183
+ * (e.g. `hallucination_free` for a legal agent) significantly regresses even if
184
+ * the net composite rose (anti-Goodhart).
185
+ *
186
+ * Two traps this module is built around (both produce a NEW false positive if
187
+ * gotten wrong):
188
+ * 1. PAIRING GRANULARITY — pairs by FULL `cellId` (`scenario:rep`), never by
189
+ * `scenarioId` (which averages reps away and destroys the within-pair
190
+ * variance reduction that makes a paired bootstrap tighter than unpaired).
191
+ * One paired observation per cell ⇒ reps multiply n.
192
+ * 2. SCALE — a judge may emit composites/dimensions on [0,1] or 0-100. The
193
+ * threshold + tolerance are interpreted in the judge's NATIVE scale; the
194
+ * per-dimension tolerance auto-scales off the observed baseline magnitudes
195
+ * so `-0.10` on [0,1] doesn't silently become a no-op on a 0-100 dimension.
196
+ */
197
+
198
+ interface PairedHoldout {
199
+ /** Baseline scalar per paired cell (same order as `after`/`cellIds`). */
200
+ before: number[];
201
+ /** Candidate scalar per paired cell. */
202
+ after: number[];
203
+ /** The full cellIds (`scenario:rep`) that paired, in order. */
204
+ cellIds: string[];
205
+ }
206
+ /**
207
+ * Pair candidate vs baseline holdout observations by FULL cellId. `select`
208
+ * pulls the scalar from a cell's judge reports (composite, or a named
209
+ * dimension); a cell contributes the mean of `select` across its judges. Cells
210
+ * whose scenario is not in `scenarioIds`, or where `select` is undefined for
211
+ * every judge on either side, are skipped on BOTH sides so the arrays stay
212
+ * paired. Throws when the two maps disagree on which holdout cells exist — a
213
+ * load-bearing invariant: the baseline + winner holdout campaigns run the same
214
+ * scenarios with the same seed base, so their cellIds MUST align; a mismatch
215
+ * means a silent pairing bug, not a soft fallback.
216
+ */
217
+ declare function pairHoldout(candidate: Map<string, Record<string, JudgeScore>>, baseline: Map<string, Record<string, JudgeScore>>, scenarioIds: Set<string>, select: (s: JudgeScore) => number | undefined): PairedHoldout;
218
+ interface HeldoutSignificance {
219
+ paired: PairedHoldout;
220
+ bootstrap: PairedBootstrapResult;
221
+ /** n paired observations. */
222
+ n: number;
223
+ /** True iff n >= minProductiveRuns AND the CI lower bound clears the threshold. */
224
+ significant: boolean;
225
+ /** Set when n < minProductiveRuns — too little evidence to claim significance. */
226
+ fewRuns: boolean;
227
+ }
228
+ interface HeldoutSignificanceOptions {
229
+ deltaThreshold?: number;
230
+ minProductiveRuns?: number;
231
+ confidence?: number;
232
+ resamples?: number;
233
+ /** Fixed by default for a deterministic, reproducible gate verdict. */
234
+ seed?: number;
235
+ statistic?: 'mean' | 'median';
236
+ }
237
+ /** Significance of the held-out composite lift: ship only when the paired
238
+ * bootstrap CI lower bound on (candidate − baseline) exceeds `deltaThreshold`
239
+ * (default 0 ⇒ "confidently positive"). Below `minProductiveRuns` paired
240
+ * observations there is not enough evidence to claim significance → not
241
+ * significant (`fewRuns`). Interpret `deltaThreshold` in the judge's native
242
+ * composite scale. */
243
+ declare function heldoutSignificance(paired: PairedHoldout, opts?: HeldoutSignificanceOptions): HeldoutSignificance;
244
+ interface DimensionRegression {
245
+ dimension: string;
246
+ bootstrap: PairedBootstrapResult;
247
+ /** True iff the CI lower bound on (candidate − baseline) is below −tolerance:
248
+ * the candidate may have regressed this dimension by more than tolerance. */
249
+ regressed: boolean;
250
+ tolerance: number;
251
+ n: number;
252
+ }
253
+ /** Detect the native scale of a set of scores: 0-100 when any magnitude clears
254
+ * 1.5, else [0,1]. Used to auto-scale the regression tolerance so a default
255
+ * expressed for [0,1] is not silently a no-op on a 0-100 dimension. */
256
+ declare function detectScale(values: number[]): 1 | 100;
257
+ /** Per-critical-dimension regression guard. For each dimension, pair the
258
+ * candidate vs baseline values by full cellId and bootstrap the paired delta;
259
+ * a dimension is "regressed" when the CI lower bound < −tolerance (conservative
260
+ * — blocks if the credible worst case exceeds tolerance, which is the right
261
+ * posture for safety dimensions like `hallucination_free`). When `tolerance`
262
+ * is omitted it auto-scales: 0.05 on [0,1], 5 on 0-100. */
263
+ declare function dimensionRegressions(candidate: Map<string, Record<string, JudgeScore>>, baseline: Map<string, Record<string, JudgeScore>>, scenarioIds: Set<string>, criticalDimensions: string[], opts?: {
264
+ tolerance?: number;
265
+ confidence?: number;
266
+ resamples?: number;
267
+ seed?: number;
268
+ }): DimensionRegression[];
269
+
167
270
  /**
168
271
  * @experimental
169
272
  *
@@ -648,4 +751,4 @@ declare function gitWorktreeAdapter(opts: GitWorktreeAdapterOptions): WorktreeAd
648
751
  * as a ref under the adapter's worktree dir. */
649
752
  declare function resolveWorktreePath(surface: CodeSurface, worktreeDir?: string): string;
650
753
 
651
- export { type AcceptedEdit, type ApplySkillPatchResult, type CampaignBreakdown, CampaignResult, CampaignStorage, CodeSurface, type CompareDriversOptions, DispatchContext, type DriverComparison, type DriverEntry, type DriverPairwise, type DriverScore, FsLabeledScenarioStore, type FsLabeledScenarioStoreOptions, type GitWorktreeAdapterOptions, ImprovementDriver, JudgeConfig, LabelTrust, LabeledScenarioRecord, LabeledScenarioSampleArgs, LabeledScenarioSource, LabeledScenarioStore, LabeledScenarioStoreError, LabeledScenarioWrite, MutableSurface, type OptimizerEntryConfig, type ProfileDispatchFn, ProfileMatrixError, type ProfileSummary, type ProposePatchesArgs, type RejectedEdit, RunCampaignOptions, type RunProfileMatrixOptions, type RunProfileMatrixResult, type RunSkillOptOptions, type RunSkillOptResult, Scenario, type ScenarioRollup, type SkillOptDriver, type SkillOptDriverOptions, type SkillOptEpochRecord, type SkillOptEvidence, type SkillPatch, type SkillPatchOp, SkillPatchParseError, type SkillPatchRejection, type Worktree, type WorktreeAdapter, WorktreeAdapterError, applySkillPatch, campaignBreakdown, campaignMeanComposite, compareDrivers, gepaParetoEntry, gepaReflectionEntry, gitWorktreeAdapter, parseSkillPatchResponse, patchEditCount, resolveWorktreePath, runProfileMatrix, runSkillOpt, skillOptDriver, skillOptEntry };
754
+ export { type AcceptedEdit, type ApplySkillPatchResult, type CampaignBreakdown, CampaignResult, CampaignStorage, CodeSurface, type CompareDriversOptions, type DimensionRegression, DispatchContext, type DriverComparison, type DriverEntry, type DriverPairwise, type DriverScore, FsLabeledScenarioStore, type FsLabeledScenarioStoreOptions, type GitWorktreeAdapterOptions, type HeldoutSignificance, type HeldoutSignificanceOptions, ImprovementDriver, JudgeConfig, JudgeScore, LabelTrust, LabeledScenarioRecord, LabeledScenarioSampleArgs, LabeledScenarioSource, LabeledScenarioStore, LabeledScenarioStoreError, LabeledScenarioWrite, MutableSurface, type OptimizerEntryConfig, type PairedHoldout, type ProfileDispatchFn, ProfileMatrixError, type ProfileSummary, type ProposePatchesArgs, type RejectedEdit, RunCampaignOptions, type RunProfileMatrixOptions, type RunProfileMatrixResult, type RunSkillOptOptions, type RunSkillOptResult, Scenario, type ScenarioRollup, type SkillOptDriver, type SkillOptDriverOptions, type SkillOptEpochRecord, type SkillOptEvidence, type SkillPatch, type SkillPatchOp, SkillPatchParseError, type SkillPatchRejection, type Worktree, type WorktreeAdapter, WorktreeAdapterError, applySkillPatch, campaignBreakdown, campaignMeanComposite, compareDrivers, detectScale, dimensionRegressions, gepaParetoEntry, gepaReflectionEntry, gitWorktreeAdapter, heldoutSignificance, pairHoldout, parseSkillPatchResponse, patchEditCount, resolveWorktreePath, runProfileMatrix, runSkillOpt, skillOptDriver, skillOptEntry };
@@ -1,33 +1,37 @@
1
1
  import {
2
- buildLoopProvenanceRecord,
3
2
  composeGate,
4
3
  defaultProductionGate,
5
- emitLoopProvenance,
4
+ detectScale,
5
+ dimensionRegressions,
6
6
  evolutionaryDriver,
7
- loopProvenanceSpans,
8
- provenanceRecordPath,
9
- provenanceSpansPath,
10
- runEval,
11
- surfaceContentHash
12
- } from "../chunk-RDK3P4JE.js";
7
+ heldoutSignificance,
8
+ pairHoldout,
9
+ runEval
10
+ } from "../chunk-E24XD7A2.js";
13
11
  import {
14
12
  agentProfileHash
15
13
  } from "../chunk-PQV2TKC3.js";
16
14
  import {
15
+ buildLoopProvenanceRecord,
17
16
  campaignBreakdown,
18
17
  campaignMeanComposite,
19
18
  countSentenceEdits,
20
19
  defaultRenderDiff,
20
+ emitLoopProvenance,
21
21
  extractH2Sections,
22
22
  gepaDriver,
23
23
  heldOutGate,
24
24
  isProposedCandidate,
25
25
  labelTrustRank,
26
+ loopProvenanceSpans,
26
27
  openAutoPr,
28
+ provenanceRecordPath,
29
+ provenanceSpansPath,
27
30
  runImprovementLoop,
28
31
  runOptimization,
32
+ surfaceContentHash,
29
33
  surfaceHash
30
- } from "../chunk-Q56RRLEC.js";
34
+ } from "../chunk-JFGZPUMU.js";
31
35
  import {
32
36
  assertRealBackend,
33
37
  fsCampaignStorage,
@@ -1091,6 +1095,8 @@ export {
1091
1095
  countSentenceEdits,
1092
1096
  defaultProductionGate,
1093
1097
  defaultRenderDiff,
1098
+ detectScale,
1099
+ dimensionRegressions,
1094
1100
  emitLoopProvenance,
1095
1101
  evolutionaryDriver,
1096
1102
  extractH2Sections,
@@ -1100,11 +1106,13 @@ export {
1100
1106
  gepaReflectionEntry,
1101
1107
  gitWorktreeAdapter,
1102
1108
  heldOutGate,
1109
+ heldoutSignificance,
1103
1110
  inMemoryCampaignStorage,
1104
1111
  isProposedCandidate,
1105
1112
  labelTrustRank,
1106
1113
  loopProvenanceSpans,
1107
1114
  openAutoPr,
1115
+ pairHoldout,
1108
1116
  parseSkillPatchResponse,
1109
1117
  patchEditCount,
1110
1118
  provenanceRecordPath,