slice-tournament-zoo 0.7.3 → 0.9.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +37 -0
- package/agents/stz-harness-critic.md +42 -0
- package/agents/stz-injector.md +41 -0
- package/agents/stz-test-author.md +37 -0
- package/package.json +2 -2
- package/src/bridge.ts +365 -6
- package/src/diversity.ts +57 -0
- package/src/eval-runner.ts +146 -7
- package/src/hack-detector.ts +71 -0
- package/src/harness-hash.ts +60 -0
- package/src/harness.ts +316 -0
- package/src/injector.ts +79 -0
- package/src/judge-reliability.ts +135 -0
- package/src/project.ts +54 -0
- package/src/selection.ts +16 -3
- package/src/taxonomy.ts +3 -0
- package/src/types.ts +87 -0
- package/src/version.ts +1 -1
package/README.md
CHANGED
|
@@ -261,6 +261,36 @@ You, the session, become the orchestrator. The command:
|
|
|
261
261
|
|
|
262
262
|
Every exact decision is made by the CLI, never by the agent's own arithmetic.
|
|
263
263
|
|
|
264
|
+
### Evolve the harness itself (0.9.0, opt-in)
|
|
265
|
+
|
|
266
|
+
STZ can improve **its own harness**, not just the code it produces. The per-slice
|
|
267
|
+
tournament stays exactly as above; a separate, default-off meta-loop evolves the
|
|
268
|
+
harness *genome* (test-author heuristics, specimen strategies, judge rubric,
|
|
269
|
+
selection weights, fan-out, the suite battery) against **held-out, recall-free**
|
|
270
|
+
pilot fitness — a DGM/HarnessX-style archive selected by GRPO advantage with a
|
|
271
|
+
six-gate promotion guard (0.9.5 adds calibrated-verifier gating: a selection
|
|
272
|
+
judge must pass a blind target-task accuracy battery before it may steer a
|
|
273
|
+
promotion, fail-closed).
|
|
274
|
+
|
|
275
|
+
```text
|
|
276
|
+
/stz:inject slice-01 # adversarially harden the sealed suite (find blind spots)
|
|
277
|
+
/stz:evolve # run the bounded harness-evolution meta-loop (needs harness.enabled)
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
The flagship is **automated suite sharpening**: a blind-spot bug-class the judge
|
|
281
|
+
finds past a green suite (e.g. the `5abc` malformed-token trap) is mined *once*
|
|
282
|
+
into the test-author's repertoire + the mutation battery, so every future suite is
|
|
283
|
+
born sharper at ~0 marginal cost — instead of re-deriving it per slice. This is
|
|
284
|
+
the empirically-grounded relocation of the shelved 0.8.0 per-slice convergence
|
|
285
|
+
loop (ruled out budget-matched and recall-free; see `docs/ROADMAP.md` and
|
|
286
|
+
`experiments/swebench-pilot/PILOT-RESULTS-{BLIND,JUDGE}.md`). Bridge primitives:
|
|
287
|
+
`inject`, `harness-mine`, `harness-promote-mutator`, `harness-spawn`,
|
|
288
|
+
`harness-fitness`, `harness-select`, `harness-promote`, `harness-status`,
|
|
289
|
+
`judge-stress`, `judge-calibration`. A 0.9.5 authoring gene
|
|
290
|
+
(`waf-playbook-autogen-v0`) lets the test author bake AWS Well-Architected
|
|
291
|
+
playbook edge-cases for contracted behaviour (one-time, never a reward). Every
|
|
292
|
+
kill-switch halts and surfaces; nothing auto-rewrites its own guard.
|
|
293
|
+
|
|
264
294
|
## Example commands and workflows
|
|
265
295
|
|
|
266
296
|
### A whole project (the full pipeline)
|
|
@@ -405,3 +435,10 @@ For contributors and anyone going past day-to-day operation:
|
|
|
405
435
|
## License
|
|
406
436
|
|
|
407
437
|
[Apache-2.0](https://github.com/dr-robert-li/slice-tournament-zoo/blob/main/LICENSE).
|
|
438
|
+
|
|
439
|
+
## Research
|
|
440
|
+
|
|
441
|
+
The full account of what STZ is, the experiments under `experiments/`, the outcomes, and
|
|
442
|
+
the open questions is in **[docs/PAPER.md](docs/PAPER.md)** ("When does a self-improving
|
|
443
|
+
coding harness actually improve competency? A negative result, earned"). The first-person
|
|
444
|
+
build log is in [docs/JOURNAL.md](docs/JOURNAL.md).
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: stz-harness-critic
|
|
3
|
+
description: HarnessX-style Critic for the STZ harness-evolution meta-loop (0.9.0). Validates a candidate harness variant on the HELD-OUT pilot fitness before promotion. Reads the truth suites; blind to which variant authored which output (no genome-authorship bias).
|
|
4
|
+
tools: Read, Bash, Grep, Glob
|
|
5
|
+
model: inherit
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
You are the **Critic** in the STZ harness-evolution meta-loop (the C in HarnessX's
|
|
9
|
+
Digester→Planner→Evolver→Critic). The Evolver proposed a harness **variant** (one
|
|
10
|
+
gene changed: a test-author heuristic, a specimen strategy, a judge rubric, a
|
|
11
|
+
selection-weight tuple, fan-out, or a battery mutator). Your job is to decide
|
|
12
|
+
whether it genuinely improves the harness — on **held-out, recall-free** fitness,
|
|
13
|
+
not on the training traces.
|
|
14
|
+
|
|
15
|
+
## Inputs
|
|
16
|
+
- The variant's **per-substrate truth scores** on the recall-free pilots
|
|
17
|
+
(`experiments/{cron,hexcolor,ipv4}-pilot/truth-suite/`), already computed by
|
|
18
|
+
running the variant's tournament on each pilot.
|
|
19
|
+
- The current **incumbent** archive entry (`bridge harness-status`).
|
|
20
|
+
|
|
21
|
+
## What you check (and how to stay honest)
|
|
22
|
+
1. **Beats the incumbent at equal-or-lower budget.** A variant that wins only by
|
|
23
|
+
spending more tokens is rejected (the JUDGE pilot's "B overspent and only tied"
|
|
24
|
+
is the cautionary baseline). Use the budget-matched comparison.
|
|
25
|
+
2. **No regression on any substrate** the incumbent already passed. A variant that
|
|
26
|
+
trades a cron win for a hexcolor loss is not an improvement.
|
|
27
|
+
3. **Convention axes discounted.** Spec-silent / recall axes (`7`=Sunday,
|
|
28
|
+
leading-zero, whitespace) are reported separately, never folded into the
|
|
29
|
+
primary fitness — they are the contamination the synthetic substrate exists to
|
|
30
|
+
exclude.
|
|
31
|
+
4. **Symmetric error.** "No variant beats the incumbent → keep the incumbent" is a
|
|
32
|
+
SUCCESS outcome, not a failure. Do not manufacture a winner.
|
|
33
|
+
|
|
34
|
+
## What you must NOT do
|
|
35
|
+
- Do NOT read which genome authored which output before scoring (authorship bias).
|
|
36
|
+
- Do NOT auto-rewrite anything. You emit a verdict; the bridge `harness-promote`
|
|
37
|
+
six-gate runs the actual promotion (and it also checks hack-clean on the
|
|
38
|
+
variant's own outputs, seal integrity, interface parity, and — 0.9.5 — that the
|
|
39
|
+
selection judge is target-task calibrated, else it fails closed).
|
|
40
|
+
|
|
41
|
+
Return: a per-substrate comparison table, the budget note, and a PROMOTE /
|
|
42
|
+
HOLD verdict with the deciding reason. The decision is earned, not asserted.
|
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: stz-injector
|
|
3
|
+
description: Adversarial bug-injector for STZ suite hardening (0.9.0, SSR-style). Perturbs a WINNING specimen into plausible variants it believes still satisfy the contract, to surface blind spots the sealed suite cannot see. Blind to the truth oracle and the sealed suite source.
|
|
4
|
+
tools: Read, Write, Bash, Grep, Glob
|
|
5
|
+
model: inherit
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
You are the **bug-injector** in an STZ suite-hardening round. Your adversary is the
|
|
9
|
+
**sealed test suite**, not the contract. Your job: make the suite's blind spots
|
|
10
|
+
visible so the test-author can close them.
|
|
11
|
+
|
|
12
|
+
## What you may read
|
|
13
|
+
- The slice **contract** (`.stz/40-slices/<id>/manifest.json` + `plan.md`).
|
|
14
|
+
- ONE **winning specimen's source** (the tournament winner's `index.*`).
|
|
15
|
+
|
|
16
|
+
## What you must NOT read (the blindness contract)
|
|
17
|
+
- The sealed suite source (`.stz/30-tests/held-out/`), its reference, or any
|
|
18
|
+
truth/oracle file. You are blind to the grader. (A silent read defeats the
|
|
19
|
+
whole experiment — every finding in `experiments/*/FINDINGS.md` is recall-free
|
|
20
|
+
precisely because this held.)
|
|
21
|
+
|
|
22
|
+
## What you produce
|
|
23
|
+
Plausible **mutant variants** of the winner that you BELIEVE a reviewer would
|
|
24
|
+
still accept as contract-satisfying, but that perturb behaviour — drop a
|
|
25
|
+
validation branch, loosen a boundary, accept a malformed token. Write each as a
|
|
26
|
+
candidate mutator spec `{name, find, replace}` (a regex substitution over the
|
|
27
|
+
winner's source) so the bridge can apply it deterministically.
|
|
28
|
+
|
|
29
|
+
The harness runs your candidates through `bridge inject` / `harness-mine`:
|
|
30
|
+
- a mutant the sealed suite **still passes** is a real blind spot (survives);
|
|
31
|
+
- a mutant the suite **kills** is already covered — discard it.
|
|
32
|
+
|
|
33
|
+
## The hard rule you must respect
|
|
34
|
+
A surviving mutant is only a real defect if it violates a **named contract
|
|
35
|
+
clause**. You do not decide that — the cross-reference adjudicator does. And you
|
|
36
|
+
must **never** propose keying a test to your mutant's exact bytes; the test-author
|
|
37
|
+
writes a GENERAL property over the violated clause's input class (train-on-test is
|
|
38
|
+
forbidden — see `experiments/swebench-pilot/PILOT-RESULTS-JUDGE.md`).
|
|
39
|
+
|
|
40
|
+
Return the candidate mutator specs and a one-line rationale per spec naming the
|
|
41
|
+
contract clause you think each violates. Nothing is sealed by you.
|
|
@@ -92,6 +92,43 @@ do not invent requirements the implementers were never given. That produces the
|
|
|
92
92
|
mirror failure (failing correct code on an unstated rule), the same class the
|
|
93
93
|
invariant rules above guard against.
|
|
94
94
|
|
|
95
|
+
## Heuristic gene: `heuristicId` routing (the G1 gene)
|
|
96
|
+
|
|
97
|
+
The slice's harness genome carries a `heuristicId` (passed to you by the
|
|
98
|
+
orchestrator). It selects which negative-case repertoire you draw on. It only
|
|
99
|
+
changes *which edge cases you reach for* — never the contract you test:
|
|
100
|
+
|
|
101
|
+
- **`baseline-v0` / `explicit-examples-v0`** — hand-written example cases over the
|
|
102
|
+
contract clauses (the default).
|
|
103
|
+
- **`property-fuzz-v1`** — prefer property-based generators over the negative
|
|
104
|
+
space (the approach the section above already recommends).
|
|
105
|
+
- **`waf-playbook-autogen-v0`** — additionally consult the **AWS Well-Architected
|
|
106
|
+
playbook bank** (the AWS Well-Architected Agentic AI Lens + the
|
|
107
|
+
`aws-samples/well-architected-skills-and-steering` skills, carried as steering
|
|
108
|
+
text in `.stz/20-standards/`) to sharpen negative/edge cases for the
|
|
109
|
+
reliability-, observability-, and guardrail-shaped behaviours **the contract
|
|
110
|
+
already specifies** — e.g. a contracted retry/back-off clause gets a case
|
|
111
|
+
asserting it actually retries and eventually gives up; a contracted
|
|
112
|
+
idempotency/least-privilege/timeout clause gets a discriminating negative.
|
|
113
|
+
|
|
114
|
+
### The Goodhart guard for `waf-playbook-autogen-v0` (load-bearing — do not relax)
|
|
115
|
+
|
|
116
|
+
This is **one-time amortized authoring**, not a score to optimise. Two hard rules,
|
|
117
|
+
both required (the survey `experiments/META-RSI-SURVEY.md` §II.3 earned why):
|
|
118
|
+
|
|
119
|
+
1. **WAF practices only sharpen cases for behaviour the contract already
|
|
120
|
+
specifies. They never add a WAF requirement the contract is silent on.** A
|
|
121
|
+
WAF-flavoured test for an unstated requirement is the exact "stay within the
|
|
122
|
+
contract" violation above, *and* it would smuggle WAF-conformance into the
|
|
123
|
+
sealed suite — which then *is* the fitness signal, making conformance a reward
|
|
124
|
+
by the back door. If the contract does not mention the pillar behaviour, do not
|
|
125
|
+
test it.
|
|
126
|
+
2. **No WAF-conformance score is ever computed as fitness.** The selection
|
|
127
|
+
`weights` tuple stays `{pass, coverage, kill, codeHealth, clean}`; promotion
|
|
128
|
+
stays on held-out *functional* fitness only. An LLM-judged "how Well-Architected
|
|
129
|
+
does this look" score is appearance-adjacent and must never enter selection
|
|
130
|
+
(that is the conformance-judge failure mode the survey rules out).
|
|
131
|
+
|
|
95
132
|
## Reference implementation (proves the suite is satisfiable)
|
|
96
133
|
|
|
97
134
|
Also write a **minimal, correct reference implementation** of the contract into
|
package/package.json
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "slice-tournament-zoo",
|
|
3
|
-
"version": "0.
|
|
4
|
-
"description": "STZ: a contract-bounded slice pipeline that implements each slice adversarially via an N-specimen tournament with frozen sealed tests, GRPO-style selection, layered anti-reward-hacking,
|
|
3
|
+
"version": "0.9.5",
|
|
4
|
+
"description": "STZ: a contract-bounded slice pipeline that implements each slice adversarially via an N-specimen tournament with frozen sealed tests, GRPO-style selection, layered anti-reward-hacking, a replayable markdown audit trail, and (0.9.0) a bounded harness-level recursive-self-improvement meta-loop that evolves the harness against held-out pilot fitness.",
|
|
5
5
|
"license": "Apache-2.0",
|
|
6
6
|
"homepage": "https://github.com/dr-robert-li/slice-tournament-zoo#readme",
|
|
7
7
|
"repository": {
|
package/src/bridge.ts
CHANGED
|
@@ -56,14 +56,34 @@ import {
|
|
|
56
56
|
runConfigExists,
|
|
57
57
|
defaultRunConfig,
|
|
58
58
|
} from "./project.js";
|
|
59
|
-
import { detectHacks } from "./hack-detector.js";
|
|
59
|
+
import { detectHacks, suspicionScore } from "./hack-detector.js";
|
|
60
60
|
import { STZ_VERSION, SCHEMA_VERSION, PACKAGE_NAME } from "./version.js";
|
|
61
61
|
import { onNoPassers, type EscalationState } from "./escalation.js";
|
|
62
62
|
import { evalGate, select, pairings } from "./selection.js";
|
|
63
63
|
import { diffSpecs, renderSpecDiff, isFaithful, unmatchedIntentIds, mismatchedAsBuiltIds, type Spec } from "./specdiff.js";
|
|
64
64
|
import { seal, verifySeal, amendSeal, heldOutFiles } from "./seal.js";
|
|
65
65
|
import { renderPressureLog, refinementContext, type CulledSpecimen } from "./pressure.js";
|
|
66
|
-
import { fullEval, crossReference } from "./eval-runner.js";
|
|
66
|
+
import { fullEval, crossReference, injectMutants, loadBattery, type MutatorSpec } from "./eval-runner.js";
|
|
67
|
+
import { groupRelativeAdvantage } from "./grpo.js";
|
|
68
|
+
import { checkDiversity, frontierWeights, weightedFitness } from "./diversity.js";
|
|
69
|
+
import { checkParity } from "./harness-hash.js";
|
|
70
|
+
import {
|
|
71
|
+
readArchive,
|
|
72
|
+
appendArchiveEntry,
|
|
73
|
+
bumpChildCount,
|
|
74
|
+
incumbent,
|
|
75
|
+
sampleParents,
|
|
76
|
+
makeArchiveEntry,
|
|
77
|
+
promotionGate,
|
|
78
|
+
batteryDir,
|
|
79
|
+
readReliabilityProfile,
|
|
80
|
+
mergeReliabilityEntry,
|
|
81
|
+
defaultGenome,
|
|
82
|
+
type MetaState,
|
|
83
|
+
} from "./harness.js";
|
|
84
|
+
import { initialInject, onInjectRound, summarizeSurvivors } from "./injector.js";
|
|
85
|
+
import { consistencyScore, bucketOf, calibrationGate } from "./judge-reliability.js";
|
|
86
|
+
import type { ArchiveEntry, HarnessGenome } from "./types.js";
|
|
67
87
|
import {
|
|
68
88
|
loadCompat,
|
|
69
89
|
saveCompat,
|
|
@@ -193,12 +213,15 @@ function commitEval(
|
|
|
193
213
|
root: string,
|
|
194
214
|
slice: string,
|
|
195
215
|
specimen: string,
|
|
196
|
-
metrics: { testPassRate: number; coverage: number; mutationScore: number },
|
|
216
|
+
metrics: { testPassRate: number; coverage: number; mutationScore: number; codeHealth?: number },
|
|
197
217
|
fixtureNames: string[],
|
|
198
218
|
extra: Record<string, unknown> = {},
|
|
199
219
|
): void {
|
|
200
220
|
const files = readSpecimenFiles(root, slice, specimen);
|
|
201
221
|
const hackFindings = detectHacks(specimen, files, { fixtureNames });
|
|
222
|
+
// 0.9.0: graded soft-suspicion (a hard-passer can still carry it) + code-health
|
|
223
|
+
// feed the multi-objective reward. codeHealth absent ⇒ neutral best (1).
|
|
224
|
+
const suspicion = suspicionScore(files, { fixtureNames });
|
|
202
225
|
const result: EvalResult = {
|
|
203
226
|
specimen,
|
|
204
227
|
passedGate: metrics.testPassRate >= 1 && hackFindings.length === 0,
|
|
@@ -206,6 +229,8 @@ function commitEval(
|
|
|
206
229
|
coverage: metrics.coverage,
|
|
207
230
|
mutationScore: metrics.mutationScore,
|
|
208
231
|
hackFindings,
|
|
232
|
+
...(metrics.codeHealth !== undefined ? { codeHealth: metrics.codeHealth } : {}),
|
|
233
|
+
suspicion,
|
|
209
234
|
};
|
|
210
235
|
const out = evalResultPath(root, slice, specimen);
|
|
211
236
|
mkdirSync(join(out, ".."), { recursive: true });
|
|
@@ -227,14 +252,16 @@ function recordEval(args: Record<string, string>): void {
|
|
|
227
252
|
*/
|
|
228
253
|
function evalCmd(args: Record<string, string>): void {
|
|
229
254
|
const { root, slice, specimen } = args as { root: string; slice: string; specimen: string };
|
|
230
|
-
|
|
255
|
+
// Promoted bug-class mutators under 60-harness/battery participate in mutation
|
|
256
|
+
// scoring when present (the sharpened battery), so a hardened suite is rewarded.
|
|
257
|
+
const e = fullEval(args.sealed!, args.impl!, existsSync(batteryDir(root)) ? batteryDir(root) : undefined);
|
|
231
258
|
commitEval(
|
|
232
259
|
root,
|
|
233
260
|
slice,
|
|
234
261
|
specimen,
|
|
235
|
-
{ testPassRate: e.testPassRate, coverage: e.coverage, mutationScore: e.mutationScore },
|
|
262
|
+
{ testPassRate: e.testPassRate, coverage: e.coverage, mutationScore: e.mutationScore, codeHealth: e.codeHealth },
|
|
236
263
|
args.fixtures ? args.fixtures.split(",") : [],
|
|
237
|
-
{ measured: { passed: e.passed, total: e.total, mutants: e.mutants, survivors: e.survivors } },
|
|
264
|
+
{ measured: { passed: e.passed, total: e.total, mutants: e.mutants, survivors: e.survivors, codeHealth: e.codeHealth } },
|
|
238
265
|
);
|
|
239
266
|
}
|
|
240
267
|
|
|
@@ -1039,6 +1066,327 @@ async function mergeValidate(args: Record<string, string>): Promise<void> {
|
|
|
1039
1066
|
print(verdict);
|
|
1040
1067
|
}
|
|
1041
1068
|
|
|
1069
|
+
// ════════════════════════════════════════════════════════════════════════════
|
|
1070
|
+
// 0.9.0 — Harness-level recursive self-improvement (meta-loop) bridge commands.
|
|
1071
|
+
// The bridge owns ALL compute (N6): agents feed numbers in, never do arithmetic.
|
|
1072
|
+
// ════════════════════════════════════════════════════════════════════════════
|
|
1073
|
+
|
|
1074
|
+
/** Read JSON from a file path OR an inline JSON string arg. */
|
|
1075
|
+
function readJSONArg<T>(v: string | undefined): T | null {
|
|
1076
|
+
if (!v || v === "true") return null;
|
|
1077
|
+
if (existsSync(v)) return readJSON<T>(v);
|
|
1078
|
+
try {
|
|
1079
|
+
return JSON.parse(v) as T;
|
|
1080
|
+
} catch {
|
|
1081
|
+
return null;
|
|
1082
|
+
}
|
|
1083
|
+
}
|
|
1084
|
+
|
|
1085
|
+
/**
|
|
1086
|
+
* inject: adversarial suite hardening (SSR-style). Run the mutation battery
|
|
1087
|
+
* (built-ins ∪ promoted) against a winning impl; mutants the SEALED suite still
|
|
1088
|
+
* passes are candidate blind spots. Reports survivors + the bounded-FSM next
|
|
1089
|
+
* action. Promotion of a survivor into a sealed test is a SEPARATE, gated step
|
|
1090
|
+
* (adjudicate clause → general PBT case → seal-amend → reference re-verify) —
|
|
1091
|
+
* this command only DISCOVERS, it never amends.
|
|
1092
|
+
*/
|
|
1093
|
+
function injectCmd(args: Record<string, string>): void {
|
|
1094
|
+
const sealed = args.sealed;
|
|
1095
|
+
const impl = args.impl;
|
|
1096
|
+
if (!sealed || !impl) {
|
|
1097
|
+
process.stderr.write("inject requires --sealed <suite> and --impl <winning-specimen>.\n");
|
|
1098
|
+
process.exitCode = 1;
|
|
1099
|
+
return;
|
|
1100
|
+
}
|
|
1101
|
+
const battery = loadBattery(args.root ? batteryDir(args.root) : args.battery);
|
|
1102
|
+
const survivors = injectMutants(sealed, impl, battery);
|
|
1103
|
+
const { action } = onInjectRound(initialInject(), { survivors: survivors.length, promoted: 0 });
|
|
1104
|
+
print({
|
|
1105
|
+
batterySize: battery.length,
|
|
1106
|
+
survivors: summarizeSurvivors(survivors),
|
|
1107
|
+
blindSpotFound: survivors.length > 0,
|
|
1108
|
+
nextAction: action,
|
|
1109
|
+
note:
|
|
1110
|
+
survivors.length > 0
|
|
1111
|
+
? "Blind spot(s) found. Adjudicate each against a NAMED contract clause; only a clause-violating survivor becomes a GENERAL (not mutant-keyed) sealed test via seal-amend + reference re-verify."
|
|
1112
|
+
: "No survivor — the sealed suite caught every injected variant.",
|
|
1113
|
+
});
|
|
1114
|
+
}
|
|
1115
|
+
|
|
1116
|
+
/**
|
|
1117
|
+
* harness-mine: the test-author skill-mining verifier (promotion gate half i).
|
|
1118
|
+
* Given a candidate bug-class mutator spec, does it SURVIVE the given sealed
|
|
1119
|
+
* suite (a genuine, currently-uncaught blind spot)? A mutator the incumbent
|
|
1120
|
+
* suite already kills is a no-op and rejected. The complementary half (ii) — the
|
|
1121
|
+
* sharpened suite KILLS it — is a second call against the amended suite expecting
|
|
1122
|
+
* `survives:false`.
|
|
1123
|
+
*/
|
|
1124
|
+
function harnessMine(args: Record<string, string>): void {
|
|
1125
|
+
const sealed = args.sealed;
|
|
1126
|
+
const impl = args.impl;
|
|
1127
|
+
// Accept either a single MutatorSpec or a battery-style array (take the first).
|
|
1128
|
+
const raw = readJSONArg<MutatorSpec | MutatorSpec[]>(args.mutator ?? args["mutator-spec"]);
|
|
1129
|
+
const spec = Array.isArray(raw) ? raw[0] : raw;
|
|
1130
|
+
if (!sealed || !impl || !spec?.name || !spec.find) {
|
|
1131
|
+
process.stderr.write("harness-mine requires --sealed, --impl, and --mutator <spec.json|inline> ({name,find,replace}).\n");
|
|
1132
|
+
process.exitCode = 1;
|
|
1133
|
+
return;
|
|
1134
|
+
}
|
|
1135
|
+
const survivors = injectMutants(sealed, impl, [
|
|
1136
|
+
{ name: spec.name, apply: (s) => (new RegExp(spec.find, spec.flags ?? "")).test(s) ? s.replace(new RegExp(spec.find, spec.flags ?? ""), spec.replace) : null },
|
|
1137
|
+
]);
|
|
1138
|
+
const survives = survivors.length > 0;
|
|
1139
|
+
print({
|
|
1140
|
+
mutator: spec.name,
|
|
1141
|
+
survives,
|
|
1142
|
+
verdict: survives
|
|
1143
|
+
? "SURVIVES — a genuine, currently-uncaught blind spot (promotion gate half i ✓). Author the general heuristic, then re-run against the sharpened suite expecting survives:false."
|
|
1144
|
+
: "killed — the suite already catches this class; not a blind spot. Rejected as a no-op.",
|
|
1145
|
+
});
|
|
1146
|
+
}
|
|
1147
|
+
|
|
1148
|
+
/** harness-promote-mutator: append a TWICE-verified mutator spec to the battery. */
|
|
1149
|
+
async function harnessPromoteMutator(args: Record<string, string>): Promise<void> {
|
|
1150
|
+
const root = args.root;
|
|
1151
|
+
const rawSpec = readJSONArg<MutatorSpec | MutatorSpec[]>(args.spec);
|
|
1152
|
+
const spec = Array.isArray(rawSpec) ? rawSpec[0] : rawSpec;
|
|
1153
|
+
if (!root || !spec?.name) {
|
|
1154
|
+
process.stderr.write("harness-promote-mutator requires --root and --spec <mutator.json> with a name.\n");
|
|
1155
|
+
process.exitCode = 1;
|
|
1156
|
+
return;
|
|
1157
|
+
}
|
|
1158
|
+
const dir = batteryDir(root);
|
|
1159
|
+
mkdirSync(dir, { recursive: true });
|
|
1160
|
+
const file = join(dir, `${spec.name.replace(/[^A-Za-z0-9_-]/g, "_")}.json`);
|
|
1161
|
+
await writeFile(file, JSON.stringify([spec], null, 2) + "\n", "utf8");
|
|
1162
|
+
print({ promoted: spec.name, battery: file, batterySize: loadBattery(dir).length });
|
|
1163
|
+
}
|
|
1164
|
+
|
|
1165
|
+
/**
|
|
1166
|
+
* harness-spawn: deterministically sample K parents from the archive (DGM rule
|
|
1167
|
+
* P ∝ fitness/(1+childCount)) and emit their genomes as mutation seeds. An empty
|
|
1168
|
+
* archive yields the default/incumbent genome as the sole seed.
|
|
1169
|
+
*/
|
|
1170
|
+
function harnessSpawn(args: Record<string, string>): void {
|
|
1171
|
+
const root = args.root!;
|
|
1172
|
+
const k = Math.max(1, Math.round(Number(args.k ?? "4")));
|
|
1173
|
+
const archive = readArchive(root);
|
|
1174
|
+
const parents = archive.length === 0
|
|
1175
|
+
? [{ variantId: "seed", genome: defaultGenome() }]
|
|
1176
|
+
: sampleParents(archive, k).map((p) => ({ variantId: p.variantId, genome: p.genome }));
|
|
1177
|
+
print({ count: parents.length, parents, note: "Mutate ONE gene per child (HarnessX substitution); realize via the agent layer, then score with harness-fitness." });
|
|
1178
|
+
}
|
|
1179
|
+
|
|
1180
|
+
/**
|
|
1181
|
+
* harness-fitness: compute a variant's held-out fitness from per-substrate truth
|
|
1182
|
+
* scores (the agent layer ran the variant's tournament on each recall-free pilot
|
|
1183
|
+
* and passes the numbers in), AceGRPO-weighted toward the learnable frontier
|
|
1184
|
+
* (substrates where the incumbent is mid-band), then append a content-addressed
|
|
1185
|
+
* ArchiveEntry. The bridge owns the math; agents never compute it.
|
|
1186
|
+
*/
|
|
1187
|
+
async function harnessFitness(args: Record<string, string>): Promise<void> {
|
|
1188
|
+
const root = args.root!;
|
|
1189
|
+
const genome = readJSONArg<HarnessGenome>(args.genome);
|
|
1190
|
+
const scores = readJSONArg<Record<string, number>>(args.scores);
|
|
1191
|
+
if (!genome || !scores) {
|
|
1192
|
+
process.stderr.write("harness-fitness requires --root, --genome <genome.json>, --scores <{substrate:score}>.\n");
|
|
1193
|
+
process.exitCode = 1;
|
|
1194
|
+
return;
|
|
1195
|
+
}
|
|
1196
|
+
const substrates = Object.keys(scores).sort();
|
|
1197
|
+
const inc = incumbent(root);
|
|
1198
|
+
const incPer = substrates.map((s) => inc?.perSubstrate[s] ?? 0.5);
|
|
1199
|
+
const weights = frontierWeights(incPer);
|
|
1200
|
+
const fitness = weightedFitness(substrates.map((s) => scores[s]!), weights);
|
|
1201
|
+
const entry = makeArchiveEntry({
|
|
1202
|
+
genome,
|
|
1203
|
+
parent: args.parent && args.parent !== "true" ? args.parent : inc?.variantId ?? null,
|
|
1204
|
+
fitness,
|
|
1205
|
+
perSubstrate: scores,
|
|
1206
|
+
advantage: 0, // filled by harness-select within its generation
|
|
1207
|
+
gates: { hackClean: false, sealOk: false, interfaceParity: false, diversityOk: false, beatsIncumbent: false, rubricCalibrated: false },
|
|
1208
|
+
});
|
|
1209
|
+
appendArchiveEntry(root, entry);
|
|
1210
|
+
if (entry.parent) bumpChildCount(root, entry.parent);
|
|
1211
|
+
print({ variantId: entry.variantId, fitness, weights, perSubstrate: scores, incumbentFitness: inc?.fitness ?? null });
|
|
1212
|
+
}
|
|
1213
|
+
|
|
1214
|
+
/**
|
|
1215
|
+
* harness-select: GRPO group-relative advantage over a generation of variants
|
|
1216
|
+
* (the harness altitude), with the variance-collapse guard. Returns the
|
|
1217
|
+
* max-advantage winner and whether the generation carried enough spread to rank.
|
|
1218
|
+
*/
|
|
1219
|
+
function harnessSelect(args: Record<string, string>): void {
|
|
1220
|
+
const variants = readJSONArg<{ variantId: string; fitness: number }[]>(args.variants);
|
|
1221
|
+
if (!variants || variants.length === 0) {
|
|
1222
|
+
process.stderr.write("harness-select requires --variants <[{variantId,fitness}]>.\n");
|
|
1223
|
+
process.exitCode = 1;
|
|
1224
|
+
return;
|
|
1225
|
+
}
|
|
1226
|
+
const floor = Number(args.floor ?? "0.02");
|
|
1227
|
+
const diversity = checkDiversity(variants.map((v) => v.fitness), floor);
|
|
1228
|
+
const advantages = groupRelativeAdvantage(variants.map((v) => ({ specimen: v.variantId, reward: v.fitness })));
|
|
1229
|
+
const ranked = [...advantages].sort((a, b) => b.advantage - a.advantage);
|
|
1230
|
+
print({
|
|
1231
|
+
diversity,
|
|
1232
|
+
winner: diversity.ok ? ranked[0]?.specimen ?? null : null,
|
|
1233
|
+
advantages: ranked,
|
|
1234
|
+
note: diversity.ok
|
|
1235
|
+
? "Generation has spread; winner is the max-advantage variant."
|
|
1236
|
+
: "VARIANCE COLLAPSE — σ below floor. Do NOT promote; re-sample with forced gene diversity (RC-GRPO).",
|
|
1237
|
+
});
|
|
1238
|
+
}
|
|
1239
|
+
|
|
1240
|
+
/**
|
|
1241
|
+
* harness-promote: the five-gate promotion decision (DGM hack-resistance). A
|
|
1242
|
+
* variant becomes incumbent only if it beats the incumbent on held-out fitness
|
|
1243
|
+
* AND is hack-clean on its OWN outputs AND preserved sealing integrity AND
|
|
1244
|
+
* interface parity AND came from a diverse generation.
|
|
1245
|
+
*/
|
|
1246
|
+
function harnessPromote(args: Record<string, string>): void {
|
|
1247
|
+
const root = args.root!;
|
|
1248
|
+
const variantId = args.variant;
|
|
1249
|
+
const archive = readArchive(root);
|
|
1250
|
+
const variant = archive.find((e) => e.variantId === variantId);
|
|
1251
|
+
if (!variant) {
|
|
1252
|
+
process.stderr.write(`harness-promote: variant ${variantId} not in archive.\n`);
|
|
1253
|
+
process.exitCode = 1;
|
|
1254
|
+
return;
|
|
1255
|
+
}
|
|
1256
|
+
const bool = (k: string): boolean => args[k] === "true" || args[k] === undefined ? args[k] === "true" : String(args[k]).toLowerCase() === "true";
|
|
1257
|
+
// "Beats incumbent" must compare against the prior incumbent, NOT this variant
|
|
1258
|
+
// itself (it may already be the max-fitness archived entry). Prefer an explicit
|
|
1259
|
+
// baseline fitness from the caller; else the best fitness among OTHER entries.
|
|
1260
|
+
const others = archive.filter((e) => e.variantId !== variantId);
|
|
1261
|
+
const baseline =
|
|
1262
|
+
args["baseline-fitness"] !== undefined && args["baseline-fitness"] !== "true"
|
|
1263
|
+
? Number(args["baseline-fitness"])
|
|
1264
|
+
: others.length
|
|
1265
|
+
? Math.max(...others.map((e) => e.fitness))
|
|
1266
|
+
: -Infinity;
|
|
1267
|
+
const beatsIncumbent = variant.fitness > baseline;
|
|
1268
|
+
// Interface parity: the variant must not change the bridge command surface.
|
|
1269
|
+
const incumbentCommands = BRIDGE_COMMANDS;
|
|
1270
|
+
const variantCommands = readJSONArg<string[]>(args["variant-commands"]) ?? BRIDGE_COMMANDS;
|
|
1271
|
+
const parity = checkParity(incumbentCommands, variantCommands);
|
|
1272
|
+
// Calibrated-verifier gate (0.9.5, fail-closed): the judge that produced this
|
|
1273
|
+
// variant's selection signal must be target-task calibrated before it may steer
|
|
1274
|
+
// promotion (2606.14629 — an uncalibrated verifier silently regresses). A
|
|
1275
|
+
// missing --slice-type, or a slice-type whose blind-accuracy battery has not
|
|
1276
|
+
// run, reads as uncalibrated.
|
|
1277
|
+
const sliceType = args["slice-type"];
|
|
1278
|
+
const profile = readReliabilityProfile(root);
|
|
1279
|
+
const calib = sliceType
|
|
1280
|
+
? calibrationGate(profile, sliceType)
|
|
1281
|
+
: { calibrated: false, reason: "no --slice-type — judge calibration unknown (fail-closed)" };
|
|
1282
|
+
const inputs = {
|
|
1283
|
+
beatsIncumbent,
|
|
1284
|
+
hackClean: bool("hack-clean"),
|
|
1285
|
+
sealOk: bool("seal-ok"),
|
|
1286
|
+
interfaceParity: parity.ok,
|
|
1287
|
+
diversityOk: bool("diversity-ok"),
|
|
1288
|
+
rubricCalibrated: calib.calibrated,
|
|
1289
|
+
};
|
|
1290
|
+
const verdict = promotionGate(inputs);
|
|
1291
|
+
// Record the gate snapshot on the entry (audit), append-rewrite is fine: the
|
|
1292
|
+
// archive is the durable record and this is the gate result for THIS variant.
|
|
1293
|
+
variant.gates = { ...inputs };
|
|
1294
|
+
writeFileSync(join(stzPath(root, "60-harness"), "MANIFEST.json"), JSON.stringify(archive, null, 2) + "\n", "utf8");
|
|
1295
|
+
print({ variantId, inputs, ...verdict, parity, calibration: calib, baselineFitness: baseline === -Infinity ? null : baseline, variantFitness: variant.fitness });
|
|
1296
|
+
}
|
|
1297
|
+
|
|
1298
|
+
/** harness-status: archive summary, incumbent, and meta-loop view. */
|
|
1299
|
+
function harnessStatus(args: Record<string, string>): void {
|
|
1300
|
+
const root = args.root!;
|
|
1301
|
+
const archive = readArchive(root);
|
|
1302
|
+
const inc = incumbent(root);
|
|
1303
|
+
const meta: Pick<MetaState, "generation"> = { generation: archive.length };
|
|
1304
|
+
print({
|
|
1305
|
+
archiveSize: archive.length,
|
|
1306
|
+
incumbent: inc ? { variantId: inc.variantId, fitness: inc.fitness, perSubstrate: inc.perSubstrate } : null,
|
|
1307
|
+
battery: loadBattery(batteryDir(root)).map((m) => m.name),
|
|
1308
|
+
variants: archive.map((e) => ({ variantId: e.variantId, parent: e.parent, fitness: e.fitness, childCount: e.childCount, promoted: e.gates.beatsIncumbent })),
|
|
1309
|
+
meta,
|
|
1310
|
+
});
|
|
1311
|
+
}
|
|
1312
|
+
|
|
1313
|
+
/**
|
|
1314
|
+
* judge-stress: consistency CI check (no labels). Given pairwise judgments re-run
|
|
1315
|
+
* under an order/verbosity swap, score the fraction whose winner is invariant —
|
|
1316
|
+
* a reliability signal grounded in the real cron order-effect. Writes a
|
|
1317
|
+
* per-slice-type profile under 90-audit/judge-reliability.md. NEVER aggregates
|
|
1318
|
+
* multiple judges (naive ensembles amplify bias — arXiv:2505.19477).
|
|
1319
|
+
*/
|
|
1320
|
+
async function judgeStress(args: Record<string, string>): Promise<void> {
|
|
1321
|
+
const pairs = readJSONArg<{ original: string; perturbed: string }[]>(args.pairs);
|
|
1322
|
+
const sliceType = args["slice-type"] ?? "unknown";
|
|
1323
|
+
if (!pairs) {
|
|
1324
|
+
process.stderr.write("judge-stress requires --pairs <[{original,perturbed}]> and optional --slice-type.\n");
|
|
1325
|
+
process.exitCode = 1;
|
|
1326
|
+
return;
|
|
1327
|
+
}
|
|
1328
|
+
const result = consistencyScore(pairs);
|
|
1329
|
+
const bucket = bucketOf(result.score);
|
|
1330
|
+
if (args.root) {
|
|
1331
|
+
// Persist the machine-readable profile the promotion gate consumes. Merge so
|
|
1332
|
+
// a blind-accuracy bucket already written by judge-calibration is preserved
|
|
1333
|
+
// (the two commands own different fields and may run in either order).
|
|
1334
|
+
mergeReliabilityEntry(args.root, { sliceType, consistency: result.score, n: result.total });
|
|
1335
|
+
await writeDoc(args.root, join("90-audit", "judge-reliability.md"), {
|
|
1336
|
+
frontmatter: { summary: `Judge consistency for ${sliceType}: ${(result.score * 100).toFixed(0)}% invariant under perturbation (n=${result.total}, ${bucket}).` },
|
|
1337
|
+
body:
|
|
1338
|
+
`# Judge reliability profile\n\n` +
|
|
1339
|
+
`Single robust judge, stress-tested for consistency (NO naive ensembling — more judges amplify bias).\n\n` +
|
|
1340
|
+
`- **slice-type:** ${sliceType}\n- **consistency (order/verbosity invariance):** ${result.invariant}/${result.total} = ${result.score.toFixed(3)} (${bucket})\n` +
|
|
1341
|
+
`- **blind-battery accuracy:** pending (must be authored blind to judge rationales — a self-built battery is circular)\n\n` +
|
|
1342
|
+
`Below ${0.7} ⇒ down-weight the judge for this slice-type and lean on the sealed/truth divergence backstop.\n`,
|
|
1343
|
+
...({} as Record<string, never>),
|
|
1344
|
+
});
|
|
1345
|
+
}
|
|
1346
|
+
print({ sliceType, ...result, bucket });
|
|
1347
|
+
}
|
|
1348
|
+
|
|
1349
|
+
/**
|
|
1350
|
+
* judge-calibration (0.9.5): measure the judge's TARGET-TASK accuracy on a blind,
|
|
1351
|
+
* pre-registered ground-truth battery and persist the bucket. This is the
|
|
1352
|
+
* calibration 2606.14629 requires BEFORE a verifier may steer promotion: a judge
|
|
1353
|
+
* that is above-threshold on one slice-type can be sub-threshold on another, and
|
|
1354
|
+
* a confident-but-wrong verifier regresses worse than a random one. The agent
|
|
1355
|
+
* layer runs the judge on the blind battery and passes its picks (`--verdicts`)
|
|
1356
|
+
* alongside the ground-truth labels (`--labels`); the bridge owns the arithmetic
|
|
1357
|
+
* (no model call — N6). Writes `blindAccuracyBucket` into the same per-slice-type
|
|
1358
|
+
* profile entry judge-stress fills, merge-preserving its consistency field.
|
|
1359
|
+
*/
|
|
1360
|
+
function judgeCalibration(args: Record<string, string>): void {
|
|
1361
|
+
const root = args.root;
|
|
1362
|
+
const sliceType = args["slice-type"];
|
|
1363
|
+
const verdicts = readJSONArg<string[]>(args.verdicts);
|
|
1364
|
+
const labels = readJSONArg<string[]>(args.labels);
|
|
1365
|
+
if (!root || !sliceType || !verdicts || !labels || verdicts.length !== labels.length || verdicts.length === 0) {
|
|
1366
|
+
process.stderr.write(
|
|
1367
|
+
"judge-calibration requires --root, --slice-type, --verdicts <[picked]>, --labels <[groundTruth]> (equal, non-empty arrays).\n",
|
|
1368
|
+
);
|
|
1369
|
+
process.exitCode = 1;
|
|
1370
|
+
return;
|
|
1371
|
+
}
|
|
1372
|
+
const correct = verdicts.filter((v, i) => v === labels[i]).length;
|
|
1373
|
+
const accuracy = correct / verdicts.length;
|
|
1374
|
+
const bucket = bucketOf(accuracy);
|
|
1375
|
+
mergeReliabilityEntry(root, { sliceType, blindAccuracyBucket: bucket, n: verdicts.length });
|
|
1376
|
+
print({ sliceType, accuracy, bucket, correct, n: verdicts.length });
|
|
1377
|
+
}
|
|
1378
|
+
|
|
1379
|
+
/** The pinned bridge command surface — the interface a variant must preserve. */
|
|
1380
|
+
const BRIDGE_COMMANDS = [
|
|
1381
|
+
"version", "begin", "record-eval", "eval", "gate", "escalate", "record-votes", "select", "finalize",
|
|
1382
|
+
"project-init", "project-phase", "project-write-intent", "project-record-area", "project-set-config",
|
|
1383
|
+
"project-dark-factory", "project-config", "slice-add", "project-seed-slices", "project-status", "summary",
|
|
1384
|
+
"seal", "seal-verify", "seal-crosscheck", "seal-amend", "merge-validate", "merge-compat-propose",
|
|
1385
|
+
"merge-compat-approve", "merge-compat-retire", "merge-compat-list",
|
|
1386
|
+
"inject", "harness-mine", "harness-promote-mutator", "harness-spawn", "harness-fitness", "harness-select",
|
|
1387
|
+
"harness-promote", "harness-status", "judge-stress", "judge-calibration",
|
|
1388
|
+
];
|
|
1389
|
+
|
|
1042
1390
|
export async function runBridge(argv: string[]): Promise<void> {
|
|
1043
1391
|
const [sub, ...rest] = argv;
|
|
1044
1392
|
const args = parseArgs(rest);
|
|
@@ -1072,6 +1420,17 @@ export async function runBridge(argv: string[]): Promise<void> {
|
|
|
1072
1420
|
case "merge-compat-approve": await mergeCompatApprove(args); break;
|
|
1073
1421
|
case "merge-compat-retire": await mergeCompatRetire(args); break;
|
|
1074
1422
|
case "merge-compat-list": mergeCompatList(args); break;
|
|
1423
|
+
// ── 0.9.0 harness-level RSI meta-loop ──────────────────────────────────
|
|
1424
|
+
case "inject": injectCmd(args); break;
|
|
1425
|
+
case "harness-mine": harnessMine(args); break;
|
|
1426
|
+
case "harness-promote-mutator": await harnessPromoteMutator(args); break;
|
|
1427
|
+
case "harness-spawn": harnessSpawn(args); break;
|
|
1428
|
+
case "harness-fitness": await harnessFitness(args); break;
|
|
1429
|
+
case "harness-select": harnessSelect(args); break;
|
|
1430
|
+
case "harness-promote": harnessPromote(args); break;
|
|
1431
|
+
case "harness-status": harnessStatus(args); break;
|
|
1432
|
+
case "judge-stress": await judgeStress(args); break;
|
|
1433
|
+
case "judge-calibration": judgeCalibration(args); break;
|
|
1075
1434
|
default:
|
|
1076
1435
|
process.stderr.write(`unknown bridge subcommand: ${sub}\n`);
|
|
1077
1436
|
process.exitCode = 1;
|