slice-tournament-zoo 0.7.2 → 0.9.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -261,6 +261,36 @@ You, the session, become the orchestrator. The command:
261
261
 
262
262
  Every exact decision is made by the CLI, never by the agent's own arithmetic.
263
263
 
264
+ ### Evolve the harness itself (0.9.0, opt-in)
265
+
266
+ STZ can improve **its own harness**, not just the code it produces. The per-slice
267
+ tournament stays exactly as above; a separate, default-off meta-loop evolves the
268
+ harness *genome* (test-author heuristics, specimen strategies, judge rubric,
269
+ selection weights, fan-out, the suite battery) against **held-out, recall-free**
270
+ pilot fitness — a DGM/HarnessX-style archive selected by GRPO advantage with a
271
+ six-gate promotion guard (0.9.5 adds calibrated-verifier gating: a selection
272
+ judge must pass a blind target-task accuracy battery before it may steer a
273
+ promotion, fail-closed).
274
+
275
+ ```text
276
+ /stz:inject slice-01 # adversarially harden the sealed suite (find blind spots)
277
+ /stz:evolve # run the bounded harness-evolution meta-loop (needs harness.enabled)
278
+ ```
279
+
280
+ The flagship is **automated suite sharpening**: a blind-spot bug-class the judge
281
+ finds past a green suite (e.g. the `5abc` malformed-token trap) is mined *once*
282
+ into the test-author's repertoire + the mutation battery, so every future suite is
283
+ born sharper at ~0 marginal cost — instead of re-deriving it per slice. This is
284
+ the empirically-grounded relocation of the shelved 0.8.0 per-slice convergence
285
+ loop (ruled out budget-matched and recall-free; see `docs/ROADMAP.md` and
286
+ `experiments/swebench-pilot/PILOT-RESULTS-{BLIND,JUDGE}.md`). Bridge primitives:
287
+ `inject`, `harness-mine`, `harness-promote-mutator`, `harness-spawn`,
288
+ `harness-fitness`, `harness-select`, `harness-promote`, `harness-status`,
289
+ `judge-stress`, `judge-calibration`. A 0.9.5 authoring gene
290
+ (`waf-playbook-autogen-v0`) lets the test author bake AWS Well-Architected
291
+ playbook edge-cases for contracted behaviour (one-time, never a reward). Every
292
+ kill-switch halts and surfaces; nothing auto-rewrites its own guard.
293
+
264
294
  ## Example commands and workflows
265
295
 
266
296
  ### A whole project (the full pipeline)
@@ -405,3 +435,10 @@ For contributors and anyone going past day-to-day operation:
405
435
  ## License
406
436
 
407
437
  [Apache-2.0](https://github.com/dr-robert-li/slice-tournament-zoo/blob/main/LICENSE).
438
+
439
+ ## Research
440
+
441
+ The full account of what STZ is, the experiments under `experiments/`, the outcomes, and
442
+ the open questions is in **[docs/PAPER.md](docs/PAPER.md)** ("When does a self-improving
443
+ coding harness actually improve competency? A negative result, earned"). The first-person
444
+ build log is in [docs/JOURNAL.md](docs/JOURNAL.md).
@@ -0,0 +1,42 @@
1
+ ---
2
+ name: stz-harness-critic
3
+ description: HarnessX-style Critic for the STZ harness-evolution meta-loop (0.9.0). Validates a candidate harness variant on the HELD-OUT pilot fitness before promotion. Reads the truth suites; blind to which variant authored which output (no genome-authorship bias).
4
+ tools: Read, Bash, Grep, Glob
5
+ model: inherit
6
+ ---
7
+
8
+ You are the **Critic** in the STZ harness-evolution meta-loop (the C in HarnessX's
9
+ Digester→Planner→Evolver→Critic). The Evolver proposed a harness **variant** (one
10
+ gene changed: a test-author heuristic, a specimen strategy, a judge rubric, a
11
+ selection-weight tuple, fan-out, or a battery mutator). Your job is to decide
12
+ whether it genuinely improves the harness — on **held-out, recall-free** fitness,
13
+ not on the training traces.
14
+
15
+ ## Inputs
16
+ - The variant's **per-substrate truth scores** on the recall-free pilots
17
+ (`experiments/{cron,hexcolor,ipv4}-pilot/truth-suite/`), already computed by
18
+ running the variant's tournament on each pilot.
19
+ - The current **incumbent** archive entry (`bridge harness-status`).
20
+
21
+ ## What you check (and how to stay honest)
22
+ 1. **Beats the incumbent at equal-or-lower budget.** A variant that wins only by
23
+ spending more tokens is rejected (the JUDGE pilot's "B overspent and only tied"
24
+ is the cautionary baseline). Use the budget-matched comparison.
25
+ 2. **No regression on any substrate** the incumbent already passed. A variant that
26
+ trades a cron win for a hexcolor loss is not an improvement.
27
+ 3. **Convention axes discounted.** Spec-silent / recall axes (`7`=Sunday,
28
+ leading-zero, whitespace) are reported separately, never folded into the
29
+ primary fitness — they are the contamination the synthetic substrate exists to
30
+ exclude.
31
+ 4. **Symmetric error.** "No variant beats the incumbent → keep the incumbent" is a
32
+ SUCCESS outcome, not a failure. Do not manufacture a winner.
33
+
34
+ ## What you must NOT do
35
+ - Do NOT read which genome authored which output before scoring (authorship bias).
36
+ - Do NOT auto-rewrite anything. You emit a verdict; the bridge `harness-promote`
37
+ six-gate runs the actual promotion (and it also checks hack-clean on the
38
+ variant's own outputs, seal integrity, interface parity, and — 0.9.5 — that the
39
+ selection judge is target-task calibrated, else it fails closed).
40
+
41
+ Return: a per-substrate comparison table, the budget note, and a PROMOTE /
42
+ HOLD verdict with the deciding reason. The decision is earned, not asserted.
@@ -0,0 +1,41 @@
1
+ ---
2
+ name: stz-injector
3
+ description: Adversarial bug-injector for STZ suite hardening (0.9.0, SSR-style). Perturbs a WINNING specimen into plausible variants it believes still satisfy the contract, to surface blind spots the sealed suite cannot see. Blind to the truth oracle and the sealed suite source.
4
+ tools: Read, Write, Bash, Grep, Glob
5
+ model: inherit
6
+ ---
7
+
8
+ You are the **bug-injector** in an STZ suite-hardening round. Your adversary is the
9
+ **sealed test suite**, not the contract. Your job: make the suite's blind spots
10
+ visible so the test-author can close them.
11
+
12
+ ## What you may read
13
+ - The slice **contract** (`.stz/40-slices/<id>/manifest.json` + `plan.md`).
14
+ - ONE **winning specimen's source** (the tournament winner's `index.*`).
15
+
16
+ ## What you must NOT read (the blindness contract)
17
+ - The sealed suite source (`.stz/30-tests/held-out/`), its reference, or any
18
+ truth/oracle file. You are blind to the grader. (A silent read defeats the
19
+ whole experiment — every finding in `experiments/*/FINDINGS.md` is recall-free
20
+ precisely because this held.)
21
+
22
+ ## What you produce
23
+ Plausible **mutant variants** of the winner that you BELIEVE a reviewer would
24
+ still accept as contract-satisfying, but that perturb behaviour — drop a
25
+ validation branch, loosen a boundary, accept a malformed token. Write each as a
26
+ candidate mutator spec `{name, find, replace}` (a regex substitution over the
27
+ winner's source) so the bridge can apply it deterministically.
28
+
29
+ The harness runs your candidates through `bridge inject` / `harness-mine`:
30
+ - a mutant the sealed suite **still passes** is a real blind spot (survives);
31
+ - a mutant the suite **kills** is already covered — discard it.
32
+
33
+ ## The hard rule you must respect
34
+ A surviving mutant is only a real defect if it violates a **named contract
35
+ clause**. You do not decide that — the cross-reference adjudicator does. And you
36
+ must **never** propose keying a test to your mutant's exact bytes; the test-author
37
+ writes a GENERAL property over the violated clause's input class (train-on-test is
38
+ forbidden — see `experiments/swebench-pilot/PILOT-RESULTS-JUDGE.md`).
39
+
40
+ Return the candidate mutator specs and a one-line rationale per spec naming the
41
+ contract clause you think each violates. Nothing is sealed by you.
@@ -92,6 +92,43 @@ do not invent requirements the implementers were never given. That produces the
92
92
  mirror failure (failing correct code on an unstated rule), the same class the
93
93
  invariant rules above guard against.
94
94
 
95
+ ## Heuristic gene: `heuristicId` routing (the G1 gene)
96
+
97
+ The slice's harness genome carries a `heuristicId` (passed to you by the
98
+ orchestrator). It selects which negative-case repertoire you draw on. It only
99
+ changes *which edge cases you reach for* — never the contract you test:
100
+
101
+ - **`baseline-v0` / `explicit-examples-v0`** — hand-written example cases over the
102
+ contract clauses (the default).
103
+ - **`property-fuzz-v1`** — prefer property-based generators over the negative
104
+ space (the approach the section above already recommends).
105
+ - **`waf-playbook-autogen-v0`** — additionally consult the **AWS Well-Architected
106
+ playbook bank** (the AWS Well-Architected Agentic AI Lens + the
107
+ `aws-samples/well-architected-skills-and-steering` skills, carried as steering
108
+ text in `.stz/20-standards/`) to sharpen negative/edge cases for the
109
+ reliability-, observability-, and guardrail-shaped behaviours **the contract
110
+ already specifies** — e.g. a contracted retry/back-off clause gets a case
111
+ asserting it actually retries and eventually gives up; a contracted
112
+ idempotency/least-privilege/timeout clause gets a discriminating negative.
113
+
114
+ ### The Goodhart guard for `waf-playbook-autogen-v0` (load-bearing — do not relax)
115
+
116
+ This is **one-time amortized authoring**, not a score to optimise. Two hard rules,
117
+ both required (the survey `experiments/META-RSI-SURVEY.md` §II.3 earned why):
118
+
119
+ 1. **WAF practices only sharpen cases for behaviour the contract already
120
+ specifies. They never add a WAF requirement the contract is silent on.** A
121
+ WAF-flavoured test for an unstated requirement is the exact "stay within the
122
+ contract" violation above, *and* it would smuggle WAF-conformance into the
123
+ sealed suite — which then *is* the fitness signal, making conformance a reward
124
+ by the back door. If the contract does not mention the pillar behaviour, do not
125
+ test it.
126
+ 2. **No WAF-conformance score is ever computed as fitness.** The selection
127
+ `weights` tuple stays `{pass, coverage, kill, codeHealth, clean}`; promotion
128
+ stays on held-out *functional* fitness only. An LLM-judged "how Well-Architected
129
+ does this look" score is appearance-adjacent and must never enter selection
130
+ (that is the conformance-judge failure mode the survey rules out).
131
+
95
132
  ## Reference implementation (proves the suite is satisfiable)
96
133
 
97
134
  Also write a **minimal, correct reference implementation** of the contract into
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "slice-tournament-zoo",
3
- "version": "0.7.2",
4
- "description": "STZ: a contract-bounded slice pipeline that implements each slice adversarially via an N-specimen tournament with frozen sealed tests, GRPO-style selection, layered anti-reward-hacking, and a replayable markdown audit trail.",
3
+ "version": "0.9.5",
4
+ "description": "STZ: a contract-bounded slice pipeline that implements each slice adversarially via an N-specimen tournament with frozen sealed tests, GRPO-style selection, layered anti-reward-hacking, a replayable markdown audit trail, and (0.9.0) a bounded harness-level recursive-self-improvement meta-loop that evolves the harness against held-out pilot fitness.",
5
5
  "license": "Apache-2.0",
6
6
  "homepage": "https://github.com/dr-robert-li/slice-tournament-zoo#readme",
7
7
  "repository": {
package/src/bridge.ts CHANGED
@@ -56,14 +56,34 @@ import {
56
56
  runConfigExists,
57
57
  defaultRunConfig,
58
58
  } from "./project.js";
59
- import { detectHacks } from "./hack-detector.js";
59
+ import { detectHacks, suspicionScore } from "./hack-detector.js";
60
60
  import { STZ_VERSION, SCHEMA_VERSION, PACKAGE_NAME } from "./version.js";
61
61
  import { onNoPassers, type EscalationState } from "./escalation.js";
62
62
  import { evalGate, select, pairings } from "./selection.js";
63
63
  import { diffSpecs, renderSpecDiff, isFaithful, unmatchedIntentIds, mismatchedAsBuiltIds, type Spec } from "./specdiff.js";
64
64
  import { seal, verifySeal, amendSeal, heldOutFiles } from "./seal.js";
65
65
  import { renderPressureLog, refinementContext, type CulledSpecimen } from "./pressure.js";
66
- import { fullEval, crossReference } from "./eval-runner.js";
66
+ import { fullEval, crossReference, injectMutants, loadBattery, type MutatorSpec } from "./eval-runner.js";
67
+ import { groupRelativeAdvantage } from "./grpo.js";
68
+ import { checkDiversity, frontierWeights, weightedFitness } from "./diversity.js";
69
+ import { checkParity } from "./harness-hash.js";
70
+ import {
71
+ readArchive,
72
+ appendArchiveEntry,
73
+ bumpChildCount,
74
+ incumbent,
75
+ sampleParents,
76
+ makeArchiveEntry,
77
+ promotionGate,
78
+ batteryDir,
79
+ readReliabilityProfile,
80
+ mergeReliabilityEntry,
81
+ defaultGenome,
82
+ type MetaState,
83
+ } from "./harness.js";
84
+ import { initialInject, onInjectRound, summarizeSurvivors } from "./injector.js";
85
+ import { consistencyScore, bucketOf, calibrationGate } from "./judge-reliability.js";
86
+ import type { ArchiveEntry, HarnessGenome } from "./types.js";
67
87
  import {
68
88
  loadCompat,
69
89
  saveCompat,
@@ -193,12 +213,15 @@ function commitEval(
193
213
  root: string,
194
214
  slice: string,
195
215
  specimen: string,
196
- metrics: { testPassRate: number; coverage: number; mutationScore: number },
216
+ metrics: { testPassRate: number; coverage: number; mutationScore: number; codeHealth?: number },
197
217
  fixtureNames: string[],
198
218
  extra: Record<string, unknown> = {},
199
219
  ): void {
200
220
  const files = readSpecimenFiles(root, slice, specimen);
201
221
  const hackFindings = detectHacks(specimen, files, { fixtureNames });
222
+ // 0.9.0: graded soft-suspicion (a hard-passer can still carry it) + code-health
223
+ // feed the multi-objective reward. codeHealth absent ⇒ neutral best (1).
224
+ const suspicion = suspicionScore(files, { fixtureNames });
202
225
  const result: EvalResult = {
203
226
  specimen,
204
227
  passedGate: metrics.testPassRate >= 1 && hackFindings.length === 0,
@@ -206,6 +229,8 @@ function commitEval(
206
229
  coverage: metrics.coverage,
207
230
  mutationScore: metrics.mutationScore,
208
231
  hackFindings,
232
+ ...(metrics.codeHealth !== undefined ? { codeHealth: metrics.codeHealth } : {}),
233
+ suspicion,
209
234
  };
210
235
  const out = evalResultPath(root, slice, specimen);
211
236
  mkdirSync(join(out, ".."), { recursive: true });
@@ -227,14 +252,16 @@ function recordEval(args: Record<string, string>): void {
227
252
  */
228
253
  function evalCmd(args: Record<string, string>): void {
229
254
  const { root, slice, specimen } = args as { root: string; slice: string; specimen: string };
230
- const e = fullEval(args.sealed!, args.impl!);
255
+ // Promoted bug-class mutators under 60-harness/battery participate in mutation
256
+ // scoring when present (the sharpened battery), so a hardened suite is rewarded.
257
+ const e = fullEval(args.sealed!, args.impl!, existsSync(batteryDir(root)) ? batteryDir(root) : undefined);
231
258
  commitEval(
232
259
  root,
233
260
  slice,
234
261
  specimen,
235
- { testPassRate: e.testPassRate, coverage: e.coverage, mutationScore: e.mutationScore },
262
+ { testPassRate: e.testPassRate, coverage: e.coverage, mutationScore: e.mutationScore, codeHealth: e.codeHealth },
236
263
  args.fixtures ? args.fixtures.split(",") : [],
237
- { measured: { passed: e.passed, total: e.total, mutants: e.mutants, survivors: e.survivors } },
264
+ { measured: { passed: e.passed, total: e.total, mutants: e.mutants, survivors: e.survivors, codeHealth: e.codeHealth } },
238
265
  );
239
266
  }
240
267
 
@@ -1039,6 +1066,327 @@ async function mergeValidate(args: Record<string, string>): Promise<void> {
1039
1066
  print(verdict);
1040
1067
  }
1041
1068
 
1069
+ // ════════════════════════════════════════════════════════════════════════════
1070
+ // 0.9.0 — Harness-level recursive self-improvement (meta-loop) bridge commands.
1071
+ // The bridge owns ALL compute (N6): agents feed numbers in, never do arithmetic.
1072
+ // ════════════════════════════════════════════════════════════════════════════
1073
+
1074
+ /** Read JSON from a file path OR an inline JSON string arg. */
1075
+ function readJSONArg<T>(v: string | undefined): T | null {
1076
+ if (!v || v === "true") return null;
1077
+ if (existsSync(v)) return readJSON<T>(v);
1078
+ try {
1079
+ return JSON.parse(v) as T;
1080
+ } catch {
1081
+ return null;
1082
+ }
1083
+ }
1084
+
1085
+ /**
1086
+ * inject: adversarial suite hardening (SSR-style). Run the mutation battery
1087
+ * (built-ins ∪ promoted) against a winning impl; mutants the SEALED suite still
1088
+ * passes are candidate blind spots. Reports survivors + the bounded-FSM next
1089
+ * action. Promotion of a survivor into a sealed test is a SEPARATE, gated step
1090
+ * (adjudicate clause → general PBT case → seal-amend → reference re-verify) —
1091
+ * this command only DISCOVERS, it never amends.
1092
+ */
1093
+ function injectCmd(args: Record<string, string>): void {
1094
+ const sealed = args.sealed;
1095
+ const impl = args.impl;
1096
+ if (!sealed || !impl) {
1097
+ process.stderr.write("inject requires --sealed <suite> and --impl <winning-specimen>.\n");
1098
+ process.exitCode = 1;
1099
+ return;
1100
+ }
1101
+ const battery = loadBattery(args.root ? batteryDir(args.root) : args.battery);
1102
+ const survivors = injectMutants(sealed, impl, battery);
1103
+ const { action } = onInjectRound(initialInject(), { survivors: survivors.length, promoted: 0 });
1104
+ print({
1105
+ batterySize: battery.length,
1106
+ survivors: summarizeSurvivors(survivors),
1107
+ blindSpotFound: survivors.length > 0,
1108
+ nextAction: action,
1109
+ note:
1110
+ survivors.length > 0
1111
+ ? "Blind spot(s) found. Adjudicate each against a NAMED contract clause; only a clause-violating survivor becomes a GENERAL (not mutant-keyed) sealed test via seal-amend + reference re-verify."
1112
+ : "No survivor — the sealed suite caught every injected variant.",
1113
+ });
1114
+ }
1115
+
1116
+ /**
1117
+ * harness-mine: the test-author skill-mining verifier (promotion gate half i).
1118
+ * Given a candidate bug-class mutator spec, does it SURVIVE the given sealed
1119
+ * suite (a genuine, currently-uncaught blind spot)? A mutator the incumbent
1120
+ * suite already kills is a no-op and rejected. The complementary half (ii) — the
1121
+ * sharpened suite KILLS it — is a second call against the amended suite expecting
1122
+ * `survives:false`.
1123
+ */
1124
+ function harnessMine(args: Record<string, string>): void {
1125
+ const sealed = args.sealed;
1126
+ const impl = args.impl;
1127
+ // Accept either a single MutatorSpec or a battery-style array (take the first).
1128
+ const raw = readJSONArg<MutatorSpec | MutatorSpec[]>(args.mutator ?? args["mutator-spec"]);
1129
+ const spec = Array.isArray(raw) ? raw[0] : raw;
1130
+ if (!sealed || !impl || !spec?.name || !spec.find) {
1131
+ process.stderr.write("harness-mine requires --sealed, --impl, and --mutator <spec.json|inline> ({name,find,replace}).\n");
1132
+ process.exitCode = 1;
1133
+ return;
1134
+ }
1135
+ const survivors = injectMutants(sealed, impl, [
1136
+ { name: spec.name, apply: (s) => (new RegExp(spec.find, spec.flags ?? "")).test(s) ? s.replace(new RegExp(spec.find, spec.flags ?? ""), spec.replace) : null },
1137
+ ]);
1138
+ const survives = survivors.length > 0;
1139
+ print({
1140
+ mutator: spec.name,
1141
+ survives,
1142
+ verdict: survives
1143
+ ? "SURVIVES — a genuine, currently-uncaught blind spot (promotion gate half i ✓). Author the general heuristic, then re-run against the sharpened suite expecting survives:false."
1144
+ : "killed — the suite already catches this class; not a blind spot. Rejected as a no-op.",
1145
+ });
1146
+ }
1147
+
1148
+ /** harness-promote-mutator: append a TWICE-verified mutator spec to the battery. */
1149
+ async function harnessPromoteMutator(args: Record<string, string>): Promise<void> {
1150
+ const root = args.root;
1151
+ const rawSpec = readJSONArg<MutatorSpec | MutatorSpec[]>(args.spec);
1152
+ const spec = Array.isArray(rawSpec) ? rawSpec[0] : rawSpec;
1153
+ if (!root || !spec?.name) {
1154
+ process.stderr.write("harness-promote-mutator requires --root and --spec <mutator.json> with a name.\n");
1155
+ process.exitCode = 1;
1156
+ return;
1157
+ }
1158
+ const dir = batteryDir(root);
1159
+ mkdirSync(dir, { recursive: true });
1160
+ const file = join(dir, `${spec.name.replace(/[^A-Za-z0-9_-]/g, "_")}.json`);
1161
+ await writeFile(file, JSON.stringify([spec], null, 2) + "\n", "utf8");
1162
+ print({ promoted: spec.name, battery: file, batterySize: loadBattery(dir).length });
1163
+ }
1164
+
1165
+ /**
1166
+ * harness-spawn: deterministically sample K parents from the archive (DGM rule
1167
+ * P ∝ fitness/(1+childCount)) and emit their genomes as mutation seeds. An empty
1168
+ * archive yields the default/incumbent genome as the sole seed.
1169
+ */
1170
+ function harnessSpawn(args: Record<string, string>): void {
1171
+ const root = args.root!;
1172
+ const k = Math.max(1, Math.round(Number(args.k ?? "4")));
1173
+ const archive = readArchive(root);
1174
+ const parents = archive.length === 0
1175
+ ? [{ variantId: "seed", genome: defaultGenome() }]
1176
+ : sampleParents(archive, k).map((p) => ({ variantId: p.variantId, genome: p.genome }));
1177
+ print({ count: parents.length, parents, note: "Mutate ONE gene per child (HarnessX substitution); realize via the agent layer, then score with harness-fitness." });
1178
+ }
1179
+
1180
+ /**
1181
+ * harness-fitness: compute a variant's held-out fitness from per-substrate truth
1182
+ * scores (the agent layer ran the variant's tournament on each recall-free pilot
1183
+ * and passes the numbers in), AceGRPO-weighted toward the learnable frontier
1184
+ * (substrates where the incumbent is mid-band), then append a content-addressed
1185
+ * ArchiveEntry. The bridge owns the math; agents never compute it.
1186
+ */
1187
+ async function harnessFitness(args: Record<string, string>): Promise<void> {
1188
+ const root = args.root!;
1189
+ const genome = readJSONArg<HarnessGenome>(args.genome);
1190
+ const scores = readJSONArg<Record<string, number>>(args.scores);
1191
+ if (!genome || !scores) {
1192
+ process.stderr.write("harness-fitness requires --root, --genome <genome.json>, --scores <{substrate:score}>.\n");
1193
+ process.exitCode = 1;
1194
+ return;
1195
+ }
1196
+ const substrates = Object.keys(scores).sort();
1197
+ const inc = incumbent(root);
1198
+ const incPer = substrates.map((s) => inc?.perSubstrate[s] ?? 0.5);
1199
+ const weights = frontierWeights(incPer);
1200
+ const fitness = weightedFitness(substrates.map((s) => scores[s]!), weights);
1201
+ const entry = makeArchiveEntry({
1202
+ genome,
1203
+ parent: args.parent && args.parent !== "true" ? args.parent : inc?.variantId ?? null,
1204
+ fitness,
1205
+ perSubstrate: scores,
1206
+ advantage: 0, // filled by harness-select within its generation
1207
+ gates: { hackClean: false, sealOk: false, interfaceParity: false, diversityOk: false, beatsIncumbent: false, rubricCalibrated: false },
1208
+ });
1209
+ appendArchiveEntry(root, entry);
1210
+ if (entry.parent) bumpChildCount(root, entry.parent);
1211
+ print({ variantId: entry.variantId, fitness, weights, perSubstrate: scores, incumbentFitness: inc?.fitness ?? null });
1212
+ }
1213
+
1214
+ /**
1215
+ * harness-select: GRPO group-relative advantage over a generation of variants
1216
+ * (the harness altitude), with the variance-collapse guard. Returns the
1217
+ * max-advantage winner and whether the generation carried enough spread to rank.
1218
+ */
1219
+ function harnessSelect(args: Record<string, string>): void {
1220
+ const variants = readJSONArg<{ variantId: string; fitness: number }[]>(args.variants);
1221
+ if (!variants || variants.length === 0) {
1222
+ process.stderr.write("harness-select requires --variants <[{variantId,fitness}]>.\n");
1223
+ process.exitCode = 1;
1224
+ return;
1225
+ }
1226
+ const floor = Number(args.floor ?? "0.02");
1227
+ const diversity = checkDiversity(variants.map((v) => v.fitness), floor);
1228
+ const advantages = groupRelativeAdvantage(variants.map((v) => ({ specimen: v.variantId, reward: v.fitness })));
1229
+ const ranked = [...advantages].sort((a, b) => b.advantage - a.advantage);
1230
+ print({
1231
+ diversity,
1232
+ winner: diversity.ok ? ranked[0]?.specimen ?? null : null,
1233
+ advantages: ranked,
1234
+ note: diversity.ok
1235
+ ? "Generation has spread; winner is the max-advantage variant."
1236
+ : "VARIANCE COLLAPSE — σ below floor. Do NOT promote; re-sample with forced gene diversity (RC-GRPO).",
1237
+ });
1238
+ }
1239
+
1240
+ /**
1241
+ * harness-promote: the five-gate promotion decision (DGM hack-resistance). A
1242
+ * variant becomes incumbent only if it beats the incumbent on held-out fitness
1243
+ * AND is hack-clean on its OWN outputs AND preserved sealing integrity AND
1244
+ * interface parity AND came from a diverse generation.
1245
+ */
1246
+ function harnessPromote(args: Record<string, string>): void {
1247
+ const root = args.root!;
1248
+ const variantId = args.variant;
1249
+ const archive = readArchive(root);
1250
+ const variant = archive.find((e) => e.variantId === variantId);
1251
+ if (!variant) {
1252
+ process.stderr.write(`harness-promote: variant ${variantId} not in archive.\n`);
1253
+ process.exitCode = 1;
1254
+ return;
1255
+ }
1256
+ const bool = (k: string): boolean => args[k] === "true" || args[k] === undefined ? args[k] === "true" : String(args[k]).toLowerCase() === "true";
1257
+ // "Beats incumbent" must compare against the prior incumbent, NOT this variant
1258
+ // itself (it may already be the max-fitness archived entry). Prefer an explicit
1259
+ // baseline fitness from the caller; else the best fitness among OTHER entries.
1260
+ const others = archive.filter((e) => e.variantId !== variantId);
1261
+ const baseline =
1262
+ args["baseline-fitness"] !== undefined && args["baseline-fitness"] !== "true"
1263
+ ? Number(args["baseline-fitness"])
1264
+ : others.length
1265
+ ? Math.max(...others.map((e) => e.fitness))
1266
+ : -Infinity;
1267
+ const beatsIncumbent = variant.fitness > baseline;
1268
+ // Interface parity: the variant must not change the bridge command surface.
1269
+ const incumbentCommands = BRIDGE_COMMANDS;
1270
+ const variantCommands = readJSONArg<string[]>(args["variant-commands"]) ?? BRIDGE_COMMANDS;
1271
+ const parity = checkParity(incumbentCommands, variantCommands);
1272
+ // Calibrated-verifier gate (0.9.5, fail-closed): the judge that produced this
1273
+ // variant's selection signal must be target-task calibrated before it may steer
1274
+ // promotion (2606.14629 — an uncalibrated verifier silently regresses). A
1275
+ // missing --slice-type, or a slice-type whose blind-accuracy battery has not
1276
+ // run, reads as uncalibrated.
1277
+ const sliceType = args["slice-type"];
1278
+ const profile = readReliabilityProfile(root);
1279
+ const calib = sliceType
1280
+ ? calibrationGate(profile, sliceType)
1281
+ : { calibrated: false, reason: "no --slice-type — judge calibration unknown (fail-closed)" };
1282
+ const inputs = {
1283
+ beatsIncumbent,
1284
+ hackClean: bool("hack-clean"),
1285
+ sealOk: bool("seal-ok"),
1286
+ interfaceParity: parity.ok,
1287
+ diversityOk: bool("diversity-ok"),
1288
+ rubricCalibrated: calib.calibrated,
1289
+ };
1290
+ const verdict = promotionGate(inputs);
1291
+ // Record the gate snapshot on the entry (audit), append-rewrite is fine: the
1292
+ // archive is the durable record and this is the gate result for THIS variant.
1293
+ variant.gates = { ...inputs };
1294
+ writeFileSync(join(stzPath(root, "60-harness"), "MANIFEST.json"), JSON.stringify(archive, null, 2) + "\n", "utf8");
1295
+ print({ variantId, inputs, ...verdict, parity, calibration: calib, baselineFitness: baseline === -Infinity ? null : baseline, variantFitness: variant.fitness });
1296
+ }
1297
+
1298
+ /** harness-status: archive summary, incumbent, and meta-loop view. */
1299
+ function harnessStatus(args: Record<string, string>): void {
1300
+ const root = args.root!;
1301
+ const archive = readArchive(root);
1302
+ const inc = incumbent(root);
1303
+ const meta: Pick<MetaState, "generation"> = { generation: archive.length };
1304
+ print({
1305
+ archiveSize: archive.length,
1306
+ incumbent: inc ? { variantId: inc.variantId, fitness: inc.fitness, perSubstrate: inc.perSubstrate } : null,
1307
+ battery: loadBattery(batteryDir(root)).map((m) => m.name),
1308
+ variants: archive.map((e) => ({ variantId: e.variantId, parent: e.parent, fitness: e.fitness, childCount: e.childCount, promoted: e.gates.beatsIncumbent })),
1309
+ meta,
1310
+ });
1311
+ }
1312
+
1313
+ /**
1314
+ * judge-stress: consistency CI check (no labels). Given pairwise judgments re-run
1315
+ * under an order/verbosity swap, score the fraction whose winner is invariant —
1316
+ * a reliability signal grounded in the real cron order-effect. Writes a
1317
+ * per-slice-type profile under 90-audit/judge-reliability.md. NEVER aggregates
1318
+ * multiple judges (naive ensembles amplify bias — arXiv:2505.19477).
1319
+ */
1320
+ async function judgeStress(args: Record<string, string>): Promise<void> {
1321
+ const pairs = readJSONArg<{ original: string; perturbed: string }[]>(args.pairs);
1322
+ const sliceType = args["slice-type"] ?? "unknown";
1323
+ if (!pairs) {
1324
+ process.stderr.write("judge-stress requires --pairs <[{original,perturbed}]> and optional --slice-type.\n");
1325
+ process.exitCode = 1;
1326
+ return;
1327
+ }
1328
+ const result = consistencyScore(pairs);
1329
+ const bucket = bucketOf(result.score);
1330
+ if (args.root) {
1331
+ // Persist the machine-readable profile the promotion gate consumes. Merge so
1332
+ // a blind-accuracy bucket already written by judge-calibration is preserved
1333
+ // (the two commands own different fields and may run in either order).
1334
+ mergeReliabilityEntry(args.root, { sliceType, consistency: result.score, n: result.total });
1335
+ await writeDoc(args.root, join("90-audit", "judge-reliability.md"), {
1336
+ frontmatter: { summary: `Judge consistency for ${sliceType}: ${(result.score * 100).toFixed(0)}% invariant under perturbation (n=${result.total}, ${bucket}).` },
1337
+ body:
1338
+ `# Judge reliability profile\n\n` +
1339
+ `Single robust judge, stress-tested for consistency (NO naive ensembling — more judges amplify bias).\n\n` +
1340
+ `- **slice-type:** ${sliceType}\n- **consistency (order/verbosity invariance):** ${result.invariant}/${result.total} = ${result.score.toFixed(3)} (${bucket})\n` +
1341
+ `- **blind-battery accuracy:** pending (must be authored blind to judge rationales — a self-built battery is circular)\n\n` +
1342
+ `Below ${0.7} ⇒ down-weight the judge for this slice-type and lean on the sealed/truth divergence backstop.\n`,
1343
+ ...({} as Record<string, never>),
1344
+ });
1345
+ }
1346
+ print({ sliceType, ...result, bucket });
1347
+ }
1348
+
1349
+ /**
1350
+ * judge-calibration (0.9.5): measure the judge's TARGET-TASK accuracy on a blind,
1351
+ * pre-registered ground-truth battery and persist the bucket. This is the
1352
+ * calibration 2606.14629 requires BEFORE a verifier may steer promotion: a judge
1353
+ * that is above-threshold on one slice-type can be sub-threshold on another, and
1354
+ * a confident-but-wrong verifier regresses worse than a random one. The agent
1355
+ * layer runs the judge on the blind battery and passes its picks (`--verdicts`)
1356
+ * alongside the ground-truth labels (`--labels`); the bridge owns the arithmetic
1357
+ * (no model call — N6). Writes `blindAccuracyBucket` into the same per-slice-type
1358
+ * profile entry judge-stress fills, merge-preserving its consistency field.
1359
+ */
1360
+ function judgeCalibration(args: Record<string, string>): void {
1361
+ const root = args.root;
1362
+ const sliceType = args["slice-type"];
1363
+ const verdicts = readJSONArg<string[]>(args.verdicts);
1364
+ const labels = readJSONArg<string[]>(args.labels);
1365
+ if (!root || !sliceType || !verdicts || !labels || verdicts.length !== labels.length || verdicts.length === 0) {
1366
+ process.stderr.write(
1367
+ "judge-calibration requires --root, --slice-type, --verdicts <[picked]>, --labels <[groundTruth]> (equal, non-empty arrays).\n",
1368
+ );
1369
+ process.exitCode = 1;
1370
+ return;
1371
+ }
1372
+ const correct = verdicts.filter((v, i) => v === labels[i]).length;
1373
+ const accuracy = correct / verdicts.length;
1374
+ const bucket = bucketOf(accuracy);
1375
+ mergeReliabilityEntry(root, { sliceType, blindAccuracyBucket: bucket, n: verdicts.length });
1376
+ print({ sliceType, accuracy, bucket, correct, n: verdicts.length });
1377
+ }
1378
+
1379
+ /** The pinned bridge command surface — the interface a variant must preserve. */
1380
+ const BRIDGE_COMMANDS = [
1381
+ "version", "begin", "record-eval", "eval", "gate", "escalate", "record-votes", "select", "finalize",
1382
+ "project-init", "project-phase", "project-write-intent", "project-record-area", "project-set-config",
1383
+ "project-dark-factory", "project-config", "slice-add", "project-seed-slices", "project-status", "summary",
1384
+ "seal", "seal-verify", "seal-crosscheck", "seal-amend", "merge-validate", "merge-compat-propose",
1385
+ "merge-compat-approve", "merge-compat-retire", "merge-compat-list",
1386
+ "inject", "harness-mine", "harness-promote-mutator", "harness-spawn", "harness-fitness", "harness-select",
1387
+ "harness-promote", "harness-status", "judge-stress", "judge-calibration",
1388
+ ];
1389
+
1042
1390
  export async function runBridge(argv: string[]): Promise<void> {
1043
1391
  const [sub, ...rest] = argv;
1044
1392
  const args = parseArgs(rest);
@@ -1072,6 +1420,17 @@ export async function runBridge(argv: string[]): Promise<void> {
1072
1420
  case "merge-compat-approve": await mergeCompatApprove(args); break;
1073
1421
  case "merge-compat-retire": await mergeCompatRetire(args); break;
1074
1422
  case "merge-compat-list": mergeCompatList(args); break;
1423
+ // ── 0.9.0 harness-level RSI meta-loop ──────────────────────────────────
1424
+ case "inject": injectCmd(args); break;
1425
+ case "harness-mine": harnessMine(args); break;
1426
+ case "harness-promote-mutator": await harnessPromoteMutator(args); break;
1427
+ case "harness-spawn": harnessSpawn(args); break;
1428
+ case "harness-fitness": await harnessFitness(args); break;
1429
+ case "harness-select": harnessSelect(args); break;
1430
+ case "harness-promote": harnessPromote(args); break;
1431
+ case "harness-status": harnessStatus(args); break;
1432
+ case "judge-stress": await judgeStress(args); break;
1433
+ case "judge-calibration": judgeCalibration(args); break;
1075
1434
  default:
1076
1435
  process.stderr.write(`unknown bridge subcommand: ${sub}\n`);
1077
1436
  process.exitCode = 1;