voidforge-build 23.19.0 → 23.21.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/.claude/agents/celebrimbor-forge-artist.md +1 -0
- package/dist/.claude/agents/ducem-token-economics.md +1 -0
- package/dist/.claude/agents/galadriel-frontend.md +1 -0
- package/dist/.claude/agents/romanoff-integrations.md +4 -0
- package/dist/.claude/agents/silver-surfer-herald.md +19 -4
- package/dist/.claude/commands/architect.md +4 -3
- package/dist/.claude/commands/assemble.md +12 -0
- package/dist/.claude/commands/assess.md +1 -0
- package/dist/.claude/commands/build.md +8 -0
- package/dist/.claude/commands/contextmeter.md +56 -0
- package/dist/.claude/commands/debrief.md +10 -0
- package/dist/.claude/commands/engage.md +5 -0
- package/dist/.claude/commands/git.md +19 -3
- package/dist/.claude/commands/imagine.md +1 -1
- package/dist/.claude/commands/seal.md +81 -0
- package/dist/.claude/commands/ux.md +13 -0
- package/dist/.claude/workflows/gauntlet.workflow.js +13 -1
- package/dist/CHANGELOG.md +63 -0
- package/dist/CLAUDE.md +10 -1
- package/dist/HOLOCRON.md +16 -2
- package/dist/VERSION.md +3 -1
- package/dist/docs/methods/AI_INTELLIGENCE.md +3 -0
- package/dist/docs/methods/ASSEMBLER.md +12 -0
- package/dist/docs/methods/BUILD_PROTOCOL.md +15 -0
- package/dist/docs/methods/CAMPAIGN.md +11 -0
- package/dist/docs/methods/DEVOPS_ENGINEER.md +66 -0
- package/dist/docs/methods/FIELD_MEDIC.md +1 -0
- package/dist/docs/methods/FORGE_ARTIST.md +3 -4
- package/dist/docs/methods/GAUNTLET.md +6 -0
- package/dist/docs/methods/MUSTER.md +2 -0
- package/dist/docs/methods/PRODUCT_DESIGN_FRONTEND.md +18 -0
- package/dist/docs/methods/QA_ENGINEER.md +21 -1
- package/dist/docs/methods/RELEASE_MANAGER.md +38 -0
- package/dist/docs/methods/SECURITY_AUDITOR.md +11 -1
- package/dist/docs/methods/SUB_AGENTS.md +33 -0
- package/dist/docs/methods/SYSTEMS_ARCHITECT.md +15 -0
- package/dist/docs/methods/TESTING.md +2 -0
- package/dist/docs/methods/TROUBLESHOOTING.md +2 -2
- package/dist/docs/methods/WORKFLOWS.md +14 -0
- package/dist/docs/patterns/ai-prompt-safety.ts +85 -0
- package/dist/docs/patterns/data-pipeline.ts +59 -1
- package/dist/docs/patterns/egress-sandbox.sh +43 -0
- package/dist/docs/patterns/exclusion-set-invariant.md +62 -0
- package/dist/docs/patterns/multi-tenant-property-test.ts +64 -0
- package/dist/docs/patterns/nginx-vhost.conf +156 -0
- package/dist/docs/patterns/oauth-token-lifecycle.ts +21 -0
- package/dist/docs/patterns/post-deploy-probe.sh +115 -0
- package/dist/docs/patterns/rls-test-fixture.py +140 -0
- package/dist/docs/patterns/structural-sql-sentinel.py +134 -0
- package/dist/scripts/statusline/README.md +38 -0
- package/dist/scripts/statusline/context-awareness-hook.sh +53 -0
- package/dist/scripts/statusline/settings-snippet.json +17 -0
- package/dist/scripts/statusline/voidforge-statusline.sh +91 -0
- package/dist/scripts/voidforge.js +69 -6
- package/dist/wizard/lib/claude-md-strategy.d.ts +87 -0
- package/dist/wizard/lib/claude-md-strategy.js +198 -0
- package/dist/wizard/lib/marker.d.ts +48 -1
- package/dist/wizard/lib/marker.js +58 -2
- package/dist/wizard/lib/patterns/oauth-token-lifecycle.d.ts +14 -0
- package/dist/wizard/lib/patterns/oauth-token-lifecycle.js +21 -0
- package/dist/wizard/lib/project-init.js +59 -0
- package/dist/wizard/lib/updater.d.ts +19 -0
- package/dist/wizard/lib/updater.js +84 -33
- package/package.json +2 -2
package/dist/CLAUDE.md
CHANGED
|
@@ -33,6 +33,10 @@ ADR-051 enforces this gate at the hook level (PreToolUse). The prose below is th
|
|
|
33
33
|
|
|
34
34
|
**Hook enforcement (ADR-051 Phase 5b — live as of v23.8.14; state relocated per ADR-060 in v23.8.18).** A `PreToolUse` hook on the **Agent and Workflow tools** (`scripts/surfer-gate/check.sh`; Workflow added per ADR-064) blocks any sub-agent or workflow launch that isn't the Silver Surfer itself, unless a roster has been recorded for this session or a bypass flag is set. State lives at `$XDG_RUNTIME_DIR/voidforge-gate/` (Linux) or `$HOME/.voidforge/gate/` (macOS fallback) — per-user, `0700`. This is the permanent enforcement mechanism. The prose above is a human-readable backup. **Workflow launches are gated identically (ADR-064):** a workflow run requires a recorded roster or a `--light`/`--solo` bypass — workflow-spawned sub-agents are invisible to the per-Agent hook, so gating the launch is what closes that bypass. Build/apply/research workflows that aren't review rosters should set a bypass.
|
|
35
35
|
|
|
36
|
+
**Non-review commands with a fixed roster take the bypass, NOT a Surfer muster (#366 F4).** A command like `/debrief` is NOT in the gated-commands list above — but the hook blocks *every* non-Surfer Agent launch regardless of the list, so its command-prescribed sub-agents (Ezri/O'Brien/Nog/Jake) get blocked too. The fix: any fixed-roster, non-review pipeline runs `[ -x scripts/surfer-gate/bypass.sh ] && bash scripts/surfer-gate/bypass.sh --light || true` BEFORE launching its sub-agents. Its roster is command-prescribed, not cherry-picked, so the gate's anti-cherry-pick purpose doesn't apply — the bypass is correct, not a workaround. (The gated list governs *which commands must muster the Surfer*; it does not exempt unlisted commands from the hook.)
|
|
37
|
+
|
|
38
|
+
**Stale session pointer — auto-repaired (#366 F4, fixed in #384 RC-3).** The repo's session pointer can point at a *dead* session (a prior `/clear`ed or crashed session whose dir still exists). Historically `bypass.sh` then wrote the flag to that dead session's dir — the WRONG one — and the live session's launch still blocked until you re-ran the bypass. **`bypass.sh` now self-repairs:** it reads the live session id from `CLAUDE_CODE_SESSION_ID` (the same id Claude Code passes the `PreToolUse` hook as `session_id` — it equals the live transcript's basename), and when that disagrees with the pointer it repoints the pointer to the live session and writes the flag there. A single `bash scripts/surfer-gate/bypass.sh --light` now lands correctly on the first try; no re-run needed. **Legacy fallback:** on older Claude Code builds that don't export `CLAUDE_CODE_SESSION_ID`, `LIVE_SID` is empty and the prior behavior remains — if the first launch still blocks, re-run the same `bypass.sh --light` line once (the first blocked `check.sh` fire repoints the pointer, so the second write lands correctly).
|
|
39
|
+
|
|
36
40
|
**Orchestrator contract** (you run these Bash commands at the right moments — wrap each in an existence guard so projects on older methodology versions don't error):
|
|
37
41
|
|
|
38
42
|
1. After the Silver Surfer sub-agent returns its roster, and before launching any other Agent: `[ -x scripts/surfer-gate/record-roster.sh ] && bash scripts/surfer-gate/record-roster.sh || true` (optionally pass the roster JSON as the first argument for audit). The existence guard is a defensive no-op for projects that predate v23.10.0 — when the gate started shipping via the npm methodology package per #317.
|
|
@@ -130,6 +134,9 @@ Reference implementations in `/docs/patterns/`. Match these shapes when writing.
|
|
|
130
134
|
- `nginx-vhost.conf` — Cloudflare-Flexible-safe vhost template: security headers, ACME http-01 passthrough, no redirect loop behind CF's flexible SSL (field report #351, #344)
|
|
131
135
|
- `error-message-categorization.tsx` — Categorize errors at the UI boundary (network / auth / validation / server / unknown) before choosing copy, so users see actionable messages not raw internals (field report #351, #343)
|
|
132
136
|
- `codemod-hygiene.md` — after a jscodeshift/recast codemod, strip incidental reformatting so the diff shows only the semantic change (field report #357)
|
|
137
|
+
- `post-deploy-probe.sh` — deploy probe that asserts response content + Content-Type, not HTTP status only, so an SPA catch-all serving index.html for every path can't false-pass into a rollback (field report #371)
|
|
138
|
+
- `exclusion-set-invariant.md` — superset invariant for multi-mechanism exclusion sets: one canonical secret/PII set with `.gitignore` / rsync / scanner derived from it (or a CI assertion) so the three never drift (field report #377)
|
|
139
|
+
- `egress-sandbox.sh` — egress-confined workload (`systemd-run` `IPAddress*` cgroup filter) that drops to the invoking uid/gid so artifacts stay user-owned, not root-owned, while network confinement is preserved (field report #382)
|
|
133
140
|
|
|
134
141
|
## Slash Commands
|
|
135
142
|
|
|
@@ -166,6 +173,8 @@ Reference implementations in `/docs/patterns/`. Match these shapes when writing.
|
|
|
166
173
|
| `/portfolio` | Steris's cross-project financials — aggregated spend/revenue, portfolio optimization | Full |
|
|
167
174
|
| `/ai` | Seldon's AI Intelligence Audit — model selection, prompts, tool-use, orchestration, safety, evals | All |
|
|
168
175
|
| `/vault` | Seldon's Time Vault — distill session intelligence into portable briefing for session handoff | All |
|
|
176
|
+
| `/seal` | Session closeout ritual — orchestrates `/git` commit → `/git` push → `/debrief --submit` → `/vault --seal`, then always prints the copy-paste next-session handoff prompt | All |
|
|
177
|
+
| `/contextmeter` | Ducem Barr's context budget meter — installs a context-usage status line (colored meter) + a `UserPromptSubmit` hook that warns Claude itself as the window fills (named to avoid the native `/statusline`/`/context` collision) | All |
|
|
169
178
|
|
|
170
179
|
**Tier key:** `All` = works everywhere. `Full` = requires the wizard server (`packages/voidforge/wizard/server.ts`). Full-tier commands offer to install the wizard if not present.
|
|
171
180
|
|
|
@@ -254,7 +263,7 @@ See `/docs/methods/MUSTER.md` for the full Muster Protocol.
|
|
|
254
263
|
| **Learnings** | `/docs/LEARNINGS.md` | Project-scoped operational knowledge — read at session start if exists |
|
|
255
264
|
| **The Muster** | `/docs/methods/MUSTER.md` | When using `--muster` flag on any command |
|
|
256
265
|
| **Time Vault** | `/docs/methods/TIME_VAULT.md` | Seldon — when preserving session intelligence for transfer |
|
|
257
|
-
| **Patterns** | `/docs/patterns/` | When writing code (
|
|
266
|
+
| **Patterns** | `/docs/patterns/` | When writing code (56 reference implementations) |
|
|
258
267
|
| **Lessons** | `/docs/LESSONS.md` | Cross-project learnings |
|
|
259
268
|
| **Workflows** | `/docs/methods/WORKFLOWS.md` | Dynamic Workflow authoring standard (ADR-067) — when to use, API, gotchas, the ADR-064 gate-launch sequence |
|
|
260
269
|
| **Native Capabilities** | `/docs/NATIVE_CAPABILITIES.md` | Command × native-skill collision tracker (ADR-066) — re-audit each release |
|
package/dist/HOLOCRON.md
CHANGED
|
@@ -96,7 +96,7 @@ Or manually: copy CLAUDE.md, .claude/, and docs/ from the npm package into your
|
|
|
96
96
|
|
|
97
97
|
Every tier includes:
|
|
98
98
|
- **CLAUDE.md** — Root context loaded at every session start
|
|
99
|
-
- **
|
|
99
|
+
- **33 slash commands** (31 primary + 2 permanent aliases: `/review` → `/engage`, `/security` → `/sentinel`) — `/prd`, `/blueprint`, `/build`, `/qa`, `/test`, `/sentinel`, `/ux`, `/engage`, `/deploy`, `/devops`, `/architect`, `/assess`, `/git`, `/void`, `/vault`, `/seal`, `/contextmeter`, `/thumper`, `/assemble`, `/gauntlet`, `/campaign`, `/imagine`, `/debrief`, `/audit-docs`, `/dangerroom`, `/cultivation`, `/grow`, `/current`, `/treasury`, `/portfolio`, `/ai`. Run `ls .claude/commands/*.md | wc -l` for the live file count.
|
|
100
100
|
- **13-phase build protocol** — PRD to production with verification gates
|
|
101
101
|
- **18 specialist agent protocols** — Each lead has behavioral directives and a sub-agent roster
|
|
102
102
|
- **Named characters** — From Tolkien, Marvel, DC, Star Wars, Star Trek, Dune, Anime, Cosmere, and Foundation — each materialized as a subagent definition in `.claude/agents/`
|
|
@@ -620,9 +620,23 @@ Seldon distills session intelligence into a portable briefing. The Time Vault pr
|
|
|
620
620
|
|
|
621
621
|
Flags: `--seal` (auto-confirm), `--open` (read most recent vault), `--list` (list all vaults), `--for <target>` (tailor for `campaign`, `colleague`, or `trigger`).
|
|
622
622
|
|
|
623
|
+
#### `/seal` — Session Closeout Ritual
|
|
624
|
+
**When:** You're done for the session and want to ship, report, preserve, and hand off in one move.
|
|
625
|
+
|
|
626
|
+
`/seal` is a thin conductor — it runs no new persona, it sequences three you already know: Coulson ships the release (`/git` commit + push), Bashir files the field report upstream (`/debrief --submit`), and Seldon seals the vault (`/vault --seal`). It always ends by printing the copy-paste pickup prompt that boots the next session. The order is deliberate: commit and push first so the report and vault describe the final state; debrief before vault so the vault folds in the debrief's learnings. It short-circuits safely — a failing test suite halts the pipeline before push (you don't ship a broken build or file a success report on it) but still seals a vault recording the blocked state, so the next session resumes on the right foot.
|
|
627
|
+
|
|
628
|
+
Flags: `--dry-run` (preview every stage), `--yes` (no confirmation pause), `--no-push`, `--no-submit`, `--no-debrief`, and `/git` pass-throughs (`--major`/`--minor`/`--patch`/`--no-tag`).
|
|
629
|
+
|
|
630
|
+
#### `/contextmeter` — Context Budget Meter
|
|
631
|
+
**When:** It's installed **by default** on `init` (warn 80% / crit 92%) — run this command only to retune thresholds, re-install on an older project, or uninstall.
|
|
632
|
+
|
|
633
|
+
Installs two small scripts: a **status line** that renders a colored context meter (`⟦████████░░⟧ 78% ctx · 44k left`, green → yellow → red as it fills) for you, and a **`UserPromptSubmit` hook** that injects "you have ~X% context left, checkpoint soon" into Claude's own context once usage crosses the threshold — so the model can `/vault` or `/seal` before compaction instead of being caught out. The meter's colors and the hook's warnings share the same thresholds (yellow/warn 80%, red/critical 92%). The status line reads Claude Code's native `context_window` field (with a transcript fallback); the hook derives usage from the transcript. Named `/contextmeter` rather than `/statusline` because the native `/statusline` and `/context` commands always shadow a same-named project command (`docs/NATIVE_CAPABILITIES.md`). Requires `jq`.
|
|
634
|
+
|
|
635
|
+
Flags: `--warn-pct N` / `--crit-pct N` (hook thresholds), `--window N` (fallback denominator), `--status`, `--uninstall`, `--dry-run`.
|
|
636
|
+
|
|
623
637
|
### Flag System
|
|
624
638
|
|
|
625
|
-
VoidForge flags are standardized across all
|
|
639
|
+
VoidForge flags are standardized across all 33 commands. Same flag = same meaning everywhere.
|
|
626
640
|
|
|
627
641
|
**Tier 1 — Universal:** `--resume` (resume from state), `--plan` (plan without executing), `--fast` (reduced review passes), `--dry-run` (preview without doing), `--status` (show state), `--blitz` (autonomous, no pauses)
|
|
628
642
|
|
package/dist/VERSION.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Version
|
|
2
2
|
|
|
3
|
-
**Current:** 23.
|
|
3
|
+
**Current:** 23.21.0
|
|
4
4
|
|
|
5
5
|
## Versioning Scheme
|
|
6
6
|
|
|
@@ -14,6 +14,8 @@ This project uses [Semantic Versioning](https://semver.org/):
|
|
|
14
14
|
|
|
15
15
|
| Version | Date | Summary |
|
|
16
16
|
|---------|------|---------|
|
|
17
|
+
| 23.21.0 | 2026-06-24 | **Triaged field reports #382 / #383 / #384 → `/seal`-hardening + DevOps/QA/orchestration fixes + a pattern-distribution gap.** **#384:** release Step 0 unrelated/pre-existing-change detection (`/git`, `/seal`, `RELEASE_MANAGER.md`) — split session-authored vs out-of-scope changes, dependency-manifest scrutiny, never `git add -A` (the `vercel` near-miss, mechanized); creation-time native-collision gate (`BUILD_PROTOCOL.md`, `NATIVE_CAPABILITIES.md`) — check a new command's name + add its row at creation, not release re-audit; `bypass.sh` stale-pointer self-repair via `CLAUDE_CODE_SESSION_ID` (repoints a dead-session pointer on the first try, no re-run; +4 tests, gate 27→31). **#382:** DevOps ACL-traverse enumeration + post-lock prod-FE 200 check; new `egress-sandbox.sh` pattern (`systemd-run --uid/--gid`, uid-independent egress); SUB_AGENTS spend ceiling reserves in-flight child budget; QA coverage-fidelity honesty + per-lane ledger; headless-OAuth bootstrap note. **Distribution:** `prepack.sh`/`copy-assets.sh` now ship every `docs/patterns/` file regardless of extension (`.sh`/`.py`/`.conf` were silently dropped — LRN-11 gap). #383 closed as already-shipped. Build clean, suite 1420. Dep `^23.20.0` → `^23.21.0`. |
|
|
18
|
+
| 23.20.0 | 2026-06-23 | **Triaged 12 upstream field reports (#364–#378) → methodology hardening, + `/seal` and `/contextmeter`.** Applied every accepted fix across ~41 method docs / agents / patterns / commands (throughput/scale gates, ADR concurrency verification, deny-list discipline, runtime-path tracers, render-gate coverage, OAuth/external-claim verification, HTTP two-principal isolation, dall-e→gpt-image-1 currency, `/debrief` gate-gap docs) + 2 new patterns (`post-deploy-probe.sh`, `exclusion-set-invariant.md`); implemented the two wizard reports — non-destructive CLAUDE.md `update` merge (#368) + legacy-marker detection (#369). **New `/seal`** — session closeout (git → debrief → vault → handoff). **New `/contextmeter`** — context-budget meter + `UserPromptSubmit` awareness hook, default-on (warn 80% / crit 92%), `scripts/statusline/` wired through all four distribution paths + npm `files`. Build clean, suite 1392→1420. Dep `^23.19.0` → `^23.20.0`. |
|
|
17
19
|
| 23.19.0 | 2026-06-13 | **Gauntlet acceptance test → 14 fixes (the ADR-067 re-platform, validated by running it on itself).** Ran the new `gauntlet.workflow.js` live on the v23.13–v23.18 platform code (10-agent Surfer roster → 347 agents → 99 distinct claims → 66 confirmed + 24 crossfire, 0 Critical) and fixed the 3-lens-confirmed findings. **Gate (security):** `_paths.sh` reap was missing `-mindepth 1` → could `rm -rf` the entire `sessions/` tree (every live roster/bypass); the reaper now refreshes `$SESSION_DIR` mtime on activity + threshold raised above the TTL, closing the documented reap-vs-fresh-roster/bypass race; `shasum`→`sha256sum` fallback (gate silently broke on Alpine); `bypass.sh` run before the first hook fire now records a repo-scoped *pending* bypass that `check.sh` promotes (was a silent no-op). **Workflows:** strike no longer re-runs the same ≤5-agent roster twice; crossfire `survives:true+REFUTED` verdicts no longer vanish into no bucket (logged in `crossfireRefutedLog`); dedup keeps the **highest** severity + `raisedBy` (was first-write-wins); guarded `JSON.parse(args)`; undefined-domain prompt guard. **Distribution:** `npx voidforge-build init` now copies `.claude/workflows/` + `AGENT_CLASSIFICATION.md`; `update` now propagates `.claude/workflows` + `scripts/surfer-gate` (both were stranded). **Validation:** new `scripts/validate-workflows.sh` (wraps the runtime shape, then `node --check`) wired into `pretest` — corrects the false "scripts pass `node --check`" claim and gates syntax errors from shipping. **Docs:** `WORKFLOWS.md` example `agentType: a.id`→`a.name` + new gotchas; stale `/tmp/voidforge-*` paths fixed in gate README + CLAUDE.md (ADR-060). **CI:** `recover-partial` derives the version from `package.json` not `github.ref_name` (broke on dispatch); Playwright cache key off the committed manifests not the regenerated lockfile. Gate suite 23→27, full suite 1390→1392. Deferred (field-report candidates): concurrent same-repo pointer collision, `workflow_dispatch` branch guard. Dep `^23.18.0` → `^23.19.0`. |
|
|
18
20
|
| 23.18.0 | 2026-06-13 | **Workflow re-platform of `/gauntlet` + `/assemble` (ADR-067)** — the opportunity ADR-064 unblocked. New `.claude/workflows/gauntlet.workflow.js` (discovery → JS dedupe → 3-lens adversarial REFUTE → crossfire → council, schema-validated) and `assemble-review.workflow.js` (engage+sentinel over a mission diff; build/arch/devops stay prose). New **`docs/methods/WORKFLOWS.md`** authoring standard (API, the #348/#363 gotchas, 16/1000 caps, and the ADR-064 gate-launch sequence: Surfer→record-roster→Workflow). `gauntlet.md`/`assemble.md` gain workflow-execution sections; personas + fix-application + Debate Protocol stay prose. **Distribution gate (Phase 12.75):** `.claude/workflows/` is a new shared category — added to `prepack.sh` (npm) + `copy-assets.sh` (init) so the scripts ship to consumers. Both scripts `node --check`-validated (ESM async-wrapped); the live end-to-end gauntlet run is the acceptance test. Dep `^23.17.0` → `^23.18.0`. |
|
|
19
21
|
| 23.17.0 | 2026-06-13 | **Effort-tiering fleet edit (ADR-054) — verified + applied.** Verified against the official Claude Code sub-agents docs that `effort` is a supported sub-agent frontmatter field (values `low`/`medium`/`high`/`xhigh`/`max`; "available levels depend on the model"). Applied across all 264 agent definitions: **20 leads (`model: inherit`) → `effort: xhigh`, 201 Sonnet specialists → `effort: medium`, 43 Haiku scouts → omitted** (Haiku doesn't support the parameter). Per-agent reasoning-spend lever, independent of model tier — the largest cost lever in the fleet (200 specialists no longer pay lead-level reasoning for read-and-report review). Frontmatter-only, idempotent insert after the `model:` line; `validate-agent-refs` + full suite (1390/1390) green; integrity preserved. Closes the M2 deferral from v23.16.0. Updated ADR-054 (status→fleet-applied), SUB_AGENTS.md, COMPATIBILITY.md. Dep `^23.16.0` → `^23.17.0`. (Aside: confirmed the v23.16.0 gate fix live — a Workflow launch was correctly BLOCKED this session until a `--light` bypass was set; noted a reap-vs-fresh-bypass timing race for a future field report.) |
|
|
@@ -143,6 +143,7 @@ Run sequentially — each builds on findings from parallel phase:
|
|
|
143
143
|
- Are system prompts deduplicated across requests?
|
|
144
144
|
- Is streaming used where appropriate? (Time to first token)
|
|
145
145
|
- Estimated monthly cost at projected volume?
|
|
146
|
+
- **Are hardcoded per-token cost constants verified against CURRENT provider pricing?** Per-token LLM rates are a STALENESS LIABILITY: models get retired and repriced, so a constant that was right at build time silently rots. Whenever touching cost-tracking or cost-cap code, verify every per-token rate against the provider's live pricing — do not trust the value in the repo, the PRD, or a prior vault. A stale rate mis-records COGS and mis-sets margin guards: field report #364 found Opus hardcoded at $15/$75 per 1M tokens against an actual current $5/$25 — a 3× over-statement that inflated every recorded generation cost and set AI-cost caps *above* subscription revenue (a live margin leak). (Field report #364)
|
|
146
147
|
|
|
147
148
|
**Bayta Darell (Evaluation):** Quality measurement.
|
|
148
149
|
- Does an eval exist for each AI component?
|
|
@@ -258,6 +259,8 @@ If issues found, return to Phase 3. Maximum 2 iterations.
|
|
|
258
259
|
- [ ] Human review process for edge cases
|
|
259
260
|
- [ ] LIVE eval layer runs against the real model and passes before launch (sandbox layer alone cannot catch model-output-shape bugs) (field report #352, #4)
|
|
260
261
|
- [ ] Model output normalized null-to-undefined before Zod `.optional()` validation (field report #352, #4)
|
|
262
|
+
- [ ] Safety-eval leak-detector is INDEPENDENT of the production deny-list/filter — it does not re-import the filter's own banned-terms or regex. Reusing the filter to test the filter is tautological: every term the filter misses, the eval also misses, so the eval reports PASS on a real leak. Build the leak-detector from a separate oracle (hand-curated banned-phrase set, second model / LLM-judge, or human labels). (Field report #378)
|
|
263
|
+
- [ ] Safety eval includes adversarial cases for all three deny-list false-fire classes: NEGATION ("no accreditation evidence" must PASS — it's the safe answer), PROPER-NOUN (a contact at "Visa" / a fund named "Trust Fund" must not flag on the substring), and HOMOGLYPH / zero-width evasion ("аccredited" with a Cyrillic 'а', or a zero-width-split token, must still be CAUGHT after NFKC normalization). (Field report #378)
|
|
261
264
|
|
|
262
265
|
### AI Gate Bootstrapping (Cold-Start Problem)
|
|
263
266
|
AI-gated approval systems have a cold-start problem: no historical outcomes -> gate rejects all requests -> no operations -> no outcomes. During the first N decisions (configurable, default 20), the gate should approve at reduced size (0.5-0.7x normal) to build a track record. The gate should never reject solely because "no historical data exists." Include explicit prompt guidance: "Lack of history is not a reason to reject — approve at reduced size to build the track record." (Field report #152)
|
|
@@ -127,6 +127,18 @@ Verify no circular calls between store actions and API methods. Specifically che
|
|
|
127
127
|
|
|
128
128
|
When a feature is added to one surface (API, dashboard, CLI, marketing site), verify all other surfaces displaying the same entities are updated. A new field added to the API response but missing from the dashboard table, or a new tier added to the pricing page but missing from the settings panel, creates an inconsistent product. After each pipeline phase that adds or modifies a feature, grep for the entity name across all surfaces: API routes, React/Vue components, CLI output formatters, marketing page copy, email templates, admin panels. (Triage fix from field report batch #149-#153.)
|
|
129
129
|
|
|
130
|
+
### Render-Gate Regression Coverage (Phase 2.5 smoke + Phase 6 /ux)
|
|
131
|
+
|
|
132
|
+
A green build and a green unit suite do NOT catch render-gate regressions — a removed or renamed prop can silently kill a feature while every automated gate stays green. Example: a component still gates its render on `!token`; the `token` prop is removed (now always `null`); the headline panel becomes invisible to every signed-in user — and `build` plus 97 unit tests all pass, because the compiler and the unit suite never render the gated surface. Only a browser does. (Field report #375.)
|
|
133
|
+
|
|
134
|
+
So when a pipeline change removes or renames a **prop or a shared contract**, the Phase 2.5 smoke and the Phase 6 `/ux` browser/e2e pass must:
|
|
135
|
+
|
|
136
|
+
1. Cover **EVERY surface that consumes the changed prop/contract — not a sampled page.** Grep for the symbol; the consuming-surface list is the screenshot list, not a subset of it.
|
|
137
|
+
2. Explicitly **re-check the render *gates* that key off the changed prop** — for each gate (`!token`, `prop && <Panel/>`, `if (!x) return null`), confirm the gated surface still renders after the change.
|
|
138
|
+
3. Verify each changed component in **BOTH signed-in and signed-out states.**
|
|
139
|
+
|
|
140
|
+
An e2e that exercises a *different* surface than the one that changed does not satisfy the screenshot mandate — it is a coverage gap that ships a dead feature. (Field report #375: removing a browser token prop left `TelegramConnect` gated on a now-always-`null` value; the headline connect panel went invisible to every signed-in user, passed a green build + 97 unit tests, and was caught only by the review roster because the e2e exercised the Account dialog, not the changed surface.)
|
|
141
|
+
|
|
130
142
|
### Phase 13.5 — Doc-Currency Refresh (pre-SEAL)
|
|
131
143
|
|
|
132
144
|
After the Council signs off, but BEFORE Fury seals the run and makes the Deploy Offer, sweep the project's source-of-truth docs for drift introduced over the course of the pipeline. A full `/assemble` touches architecture, features, version, and build state — by the time the Council finishes, the docs that describe the project frequently no longer match it. This mirrors the Doc-Currency Refresh mission in `CAMPAIGN.md`: same checklist, applied once at the end of the pipeline instead of once per mission. (Field report #342 F-1: `/assemble` shipped a Council-clean build whose `CLAUDE.md` Project block and `PROJECT_VERSION` line still described the pre-build scaffold.)
|
|
@@ -258,6 +258,9 @@ Grep for deferred wiring comments: `Set after`, `Wire after`, `None #`, `TODO: w
|
|
|
258
258
|
7. Log each batch to `/logs/phase-05-features.md`
|
|
259
259
|
|
|
260
260
|
**Phase 6 — Integrations.**
|
|
261
|
+
|
|
262
|
+
**MANDATORY PRE-INTEGRATION WEB-VERIFICATION (Romanoff owns; field report #364 Finding A).** BEFORE writing any external-API integration code, web-verify the provider's live docs (WebFetch) for: the CURRENT API version, deprecation/sunset notices on the endpoints you plan to call, and the auth requirements (OAuth scopes, token type, developer-token gating). Treat the PRD/vault/plan's API version, endpoint names, and auth specifics as STALE until confirmed — they rot fast across a multi-day campaign on a fast-moving platform (Google/Meta/Stripe ad & billing APIs deprecate aggressively). The post-build live smoke test below is NOT a substitute: it verifies the call you wrote works, not that you wrote against the right (non-deprecated) API. Real case (Kongo M8): a vault plan said "v17→v21, use `uploadClickConversions`" — live docs showed v24 was current, that path was blocked for the project's token 3 days out, and the correct route was a different API with a different scope and no developer token. Building blind would have wasted the whole mission and missed a hard external deadline. Log the verified version/deprecation/auth findings to the phase log before coding.
|
|
263
|
+
|
|
261
264
|
1. Each integration: client wrapper, env vars, test mode, error handling, retry logic
|
|
262
265
|
2. For async work, follow `/docs/patterns/job-queue.ts`
|
|
263
266
|
3. Kenobi reviews each integration
|
|
@@ -357,6 +360,10 @@ If this build introduces a new shared file category (e.g., `.claude/agents/`, a
|
|
|
357
360
|
6. `void.md` — listed in user-facing sync checklist
|
|
358
361
|
Missing even one path means some users silently miss the feature. This gate is mandatory after any structural addition to the methodology. (Field report #297: .claude/agents/ was added to packaging but missed in 3 of 6 delivery paths.)
|
|
359
362
|
|
|
363
|
+
**Four-path distribution discipline (LRN-11, field report #366 F2).** Of the paths above, FOUR are the actual file-copy routes a new category must land in or it strands some installs — verify each by name: (1) `packages/methodology/scripts/prepack.sh` (the npm methodology package), (2) `packages/voidforge/scripts/copy-assets.sh` (the CLI `dist/`), (3) `packages/voidforge/wizard/lib/project-init.ts` `copyMethodology()` (`npx voidforge-build init`), (4) `packages/voidforge/wizard/lib/updater.ts` diff `dirs` (`npx voidforge-build update`). v23.18.0 added `.claude/workflows/` to (1) and (2) only, stranding `init` and `update` — three of that release's bugs trace to the omission. Grep all four for an existing sibling category (`.claude/agents`, `scripts/thumper`) and mirror it.
|
|
364
|
+
|
|
365
|
+
**Ship-with-validator rule (field report #366 F2).** A new shipped artifact TYPE must ship, in the same release, with a CI/pretest validator for it — e.g. a new `.claude/workflows/*.js` category ships with `scripts/validate-workflows.sh` wired into `pretest`, plus a regression test that init/update actually copy it. And **a release must never claim a validation it did not run.** v23.18.0 claimed "both scripts pass `node --check`" when their top-level `return`/`await` make a bare `node --check` fail — the claim was false and ran nothing. State the exact command the validator runs; if it cannot run in CI, say so — never assert a green you didn't observe.
|
|
366
|
+
|
|
360
367
|
**Phase 13 — Launch Checklist.**
|
|
361
368
|
All flows in production. SSL. Email. Payments. Analytics. Monitoring. Backups. Security headers. Legal. Performance. Mobile. Accessibility. Tests passing. **Build-time env var verification:** For every new `NEXT_PUBLIC_*` / `VITE_*` / `REACT_APP_*` reference introduced during this build, verify the variable exists in `.env` or the deploy environment. Missing build-time vars cause features to silently disappear without errors. (Field report #104) Log final status to `/logs/phase-13-launch.md`.
|
|
362
369
|
|
|
@@ -443,6 +450,14 @@ Examples of batches that are too big:
|
|
|
443
450
|
|
|
444
451
|
---
|
|
445
452
|
|
|
453
|
+
## Authoring a New Slash Command — Creation-Time Native-Collision Gate (field report #384 RC-2)
|
|
454
|
+
|
|
455
|
+
When you create a new VoidForge slash command (a new `.claude/commands/*.md`), the **native-collision check happens at creation time, not at release re-audit.** ADR-066's `docs/NATIVE_CAPABILITIES.md` re-audit is a release-time backstop; relying on it alone let `/contextmeter` get built as `/statusline` and then renamed mid-build — after docs and scripts already referenced the dead name. Shift the check left:
|
|
456
|
+
|
|
457
|
+
1. **Before writing the file, check the proposed name against the native command/skill set.** Claude Code ships native slash commands and skills (e.g. `/init`, `/review`, `/security-review`, `/code-review`, `/test`, `/commit`, `/statusline`, `/context`, `/deep-research`, plus built-ins). On surfaces with project-local resolution a same-named `.claude/commands/*.md` wins, but on surfaces without it (claude.ai web, some IDE extensions) a colliding **native** capability shadows ours — running ungated and without VoidForge semantics. Consult `docs/NATIVE_CAPABILITIES.md` for the currently-tracked native set.
|
|
458
|
+
2. **If the name collides, resolve it before the name propagates.** Either rename to a non-colliding name (the `/statusline`→`/contextmeter`, `/review`→`/engage`, `/security`→`/sentinel` precedent) or, if coexistence is deliberate, record a `coexist + document` disposition. Decide *before* writing the file, so no doc/script ever references a name you'll have to retract.
|
|
459
|
+
3. **Add the `NATIVE_CAPABILITIES.md` row as part of creating the command** — same commit, not a later audit. The ADR-066 coverage rule (every `.claude/commands/*.md` has a row) is then satisfied at creation, and the release-time re-audit becomes a confirmation rather than a discovery.
|
|
460
|
+
|
|
446
461
|
## Principles
|
|
447
462
|
|
|
448
463
|
1. PRD is source of truth. Agents don't override product decisions. If the PRD is ambiguous, flag it and present options — don't decide product direction.
|
|
@@ -304,6 +304,8 @@ User confirms, redirects, or overrides. On confirm → Step 4.
|
|
|
304
304
|
|
|
305
305
|
**Post-infrastructure enforcement gate:** For infrastructure campaigns (deploy targets, CI/CD, monitoring, staging environments): after the infrastructure is provisioned, run `/architect --plan` to verify workflow enforcement gates exist — not just infrastructure existence. Infrastructure without process gates is incomplete.
|
|
306
306
|
|
|
307
|
+
**Dark-flag activation is gated on the comprehensive REVIEW, not the deploy (field report #373 #1).** When a mission builds a feature behind a dark flag — deploy disabled, then enable via a flag flip (often paired with a contract/data migration) — the gate on the **flag flip** is the comprehensive adversarial review of the dark code, NOT the deploy of the dark code. Sequence the mission as **build dark → review the dark code (Gauntlet checkpoint or full `/assemble` review round) → flip the flag only after the review's Critical/High findings are closed.** Never flip-then-review: a feature activated before its review ships review-findable bugs to live users, and a subsequent Gauntlet finds them already in production. The activation review MUST exercise the **partial/edge interaction states the feature introduces** (partial confirm, reject-all, edit-then-act, account-switch mid-action) — a green happy-path smoke is necessary, not sufficient, because the new bugs live in the partial/edge paths. (Field report #373: a mission built an ADR dark, flipped the flag + ran a contract migration, and ONLY THEN ran the Gauntlet — which found 6 live blockers; a follow-on `/assemble` found 2 more, also live. 8 review-findable bugs shipped to prod purely from the dark→activate→review ordering.)
|
|
308
|
+
|
|
307
309
|
**Silver Surfer gate fires at the REVIEW phase, not the solo build.** Within a mission, the gate (ADR-051 PreToolUse hook on the Agent tool) engages when Fury deploys the review/audit roster as sub-agents — NOT during the orchestrator's solo build of the mission's code. Solo-build-before-review is intentional, not a skipped gate: parallel agents editing the same tightly-coupled engine files (game loop, state machine, shared service) would clobber each other's edits and produce merge garbage. So the orchestrator builds the changeset solo, THEN the Surfer-gated review roster reads it. If you find yourself mid-build asking "did a gate get skipped?", the answer is no — the gate has not fired yet because the review phase has not started. (Field report #348 #3: mid-build confusion over an un-fired gate that fires correctly at the review phase.)
|
|
308
310
|
|
|
309
311
|
### Pre-Prod Verification: when there is no staging
|
|
@@ -459,6 +461,15 @@ Field report #322 (barrierwatch): `statistical-gate.ts` grew 425 → 775 LOC acr
|
|
|
459
461
|
|
|
460
462
|
When a mission duplicates or extends an existing code path (adding a version-aware path alongside a legacy path, adding a new endpoint that mirrors an existing one), verify that security patterns (locking, rate limiting, validation, sanitization) from the original path are replicated in the new path. Grep for the original pattern and confirm it exists in the new code. (Field report #38: optimistic locking in legacy chat edit was not replicated to the version-aware path.)
|
|
461
463
|
|
|
464
|
+
### FR-A5 Isolation Gate — HTTP-level two-principal test + planted-uid red-check
|
|
465
|
+
|
|
466
|
+
When a mission ships or touches user-scoped / multi-tenant data (the FR-A5 "two-user isolation" class — owner data must not leak across principals), the isolation test that gates the mission MUST drive the **real request entry point with two distinct credentials** — not a repository-layer test only. A test that calls the repository/store directly and asserts on its `WHERE user_id = ?` clause goes green even when the HTTP handler **hardcodes `uid`** — because the repo is never the thing that resolves the principal in production; the auth→uid wiring at the handler is. The definition-of-done for an FR-A5 isolation gate is two-part and both parts are mandatory:
|
|
467
|
+
|
|
468
|
+
1. **HTTP-level two-principal assertion.** The test issues requests through the actual request entrypoint (route/handler/middleware stack the production server mounts) with **two distinct credentials** — e.g. owner-token vs. a second session-cookie/second-user-token — and asserts principal A cannot read or mutate principal B's resources (404/403, empty result, or write-rejected, per the project's IDOR contract). Driving the repo's WHERE clause is NOT sufficient; the test must cross the auth→uid resolution seam.
|
|
469
|
+
2. **Planted-uid red-check (explicit).** The gate must include a planted-bug assertion: **hardcoding `uid = <owner>` in the handler MUST turn the test RED.** If pinning the handler's principal to a constant still passes every isolation test, the test is asserting at the wrong layer — it never exercised the auth→uid wiring and provides false confidence. State this red-check explicitly in the mission's acceptance criteria, and verify it by actually planting the constant once and confirming the failure.
|
|
470
|
+
|
|
471
|
+
(Field report #371 #1: an FR-A5 two-user isolation test written against repositories went green, yet hardcoding `uid = 1` in the HTTP handler passed ALL 48 tests — the gap survived until an HTTP-level two-principal test driving the real handler was added at a checkpoint. The repository-layer test never exercised the auth→uid wiring that the bug lived in. See the handler-entry two-principal variant in `docs/patterns/multi-tenant-property-test.*`.)
|
|
472
|
+
|
|
462
473
|
### Minimum Review Guarantee
|
|
463
474
|
|
|
464
475
|
Even in `--fast` mode, each mission gets at least **1 review round** (not 3, but never 0). A single review catches ~80% of issues for 33% of the review cost. Zero reviews in blitz caused 7 Critical+High issues to accumulate undetected across 4 missions — all caught by the Victory Gauntlet but at much higher fix cost. (Field report #28)
|
|
@@ -45,6 +45,8 @@ When the application spawns child processes (workers, background jobs, PTY sessi
|
|
|
45
45
|
|
|
46
46
|
(Field report #57: shell profiles re-injected environment variables that were explicitly filtered from the PTY environment.)
|
|
47
47
|
|
|
48
|
+
**Egress sandbox: drop to the invoking uid, don't run as root.** When you confine a workload's outbound network with `sudo systemd-run -p IPAddressDeny=…`, pass `--uid`/`--gid` to run it as the invoking user. `IPAddress*` filtering is a cgroup property and is uid-independent, so confinement is fully preserved while artifacts stay user-owned — running as root litters root-owned state that breaks a sibling tool run later as the normal user. See `docs/patterns/egress-sandbox.sh` (field report #382 RC-2).
|
|
49
|
+
|
|
48
50
|
See NAMING_REGISTRY.md for 70+ additional characters.
|
|
49
51
|
|
|
50
52
|
## Goal
|
|
@@ -328,6 +330,16 @@ If health check fails after deploy:
|
|
|
328
330
|
|
|
329
331
|
(Field report #97: 3 campaigns of Dialog Travel code never reached production because no deploy step existed.)
|
|
330
332
|
|
|
333
|
+
### Arming / go-no-go gates must run the REAL production launch path
|
|
334
|
+
|
|
335
|
+
A go/no-go gate — a "tracer bullet" that authorizes arming an autonomous component, or a deploy gate that authorizes cutover — is only meaningful if it crosses the **same production seam** the live system uses. A tracer that runs through a dev shortcut and reports CLEAN gives **false confidence**: it proves a path that production never takes.
|
|
336
|
+
|
|
337
|
+
The recurring failure: the gate executes via a dev/current-user code path — an `unsafe_run_as_current_user`-style flag, a `--dev`/`--local` launcher, the test harness's in-process invocation — and never crosses the privileged hand-off (the `sudo`/`setuid` drop, the systemd `User=`/`ExecStart`, the scheduler's spawn) that the production scheduler actually uses. It passes CLEAN while production is broken, because the broken seam was the one the tracer skipped.
|
|
338
|
+
|
|
339
|
+
**Rule:** the gate that authorizes arming or deploy must exercise the **real entrypoint, the real OS user, the real privilege drop, and the real process model** — the production seam, end to end — not a dev bypass. Same systemd unit (or the same scheduler/launcher), same environment construction, same user/privilege transition, same hand-off. If the production path drops privileges and re-execs under a service account, the tracer must too; a tracer that stays in the current user's shell is testing a different program.
|
|
340
|
+
|
|
341
|
+
**Checklist item (add to the deploy/arming go-no-go):** *"Tracer exercised the production seam — real entrypoint, real user, real privilege drop, real hand-off — not a dev/`--unsafe-current-user` shortcut."* If you cannot answer yes, the gate has not run; a CLEAN from a dev path is a false CLEAN. (Field report #377: an arming tracer ran via a current-user dev path, reported CLEAN, and gated the arming — the system's first real scheduled run then failed because the production privileged hand-off it had skipped was broken.)
|
|
342
|
+
|
|
331
343
|
## Load Testing (Pre-Launch)
|
|
332
344
|
|
|
333
345
|
**When to load test:**
|
|
@@ -396,6 +408,35 @@ When staging and production coexist on the same server, enforce full isolation:
|
|
|
396
408
|
|
|
397
409
|
Convention isn't enough — enforcement is. The pre-push hook is the single most effective protection. (Field report #241: 68-hour production outage from shared infrastructure.)
|
|
398
410
|
|
|
411
|
+
### Locking a shared parent dir: enumerate every traversing service account (field report #382, HIGH)
|
|
412
|
+
|
|
413
|
+
A QA/security isolation step that revokes world-traverse on a home or parent directory can silently break any *other* service whose path runs through it — a cross-phase landmine that surfaces hours later. Real incident: `/home/ubuntu` was set to `0750` plus a traverse ACL granting only the QA containment users (`us-qa`/`us-healer`); nginx (`www-data`) serves the production SPA from a directory *under* `/home/ubuntu`, lost traverse, and 500'd the moment its workers recycled (logrotate).
|
|
414
|
+
|
|
415
|
+
**Rule — when you revoke world-traverse (`o-x`, `0750`, a restrictive ACL) on any home or parent directory, enumerate EVERY service account that traverses it and grant each an explicit traverse ACL.** The web server (`www-data`/`nginx`), the app/runtime user, the process manager, backup/cron users — any uid whose working path descends through the locked dir needs `setfacl -m u:<svc>:--x <dir>` (execute/traverse only, NOT read). Then **verify with a live request, not the ACL listing**: after locking the dir, `curl` the production front-end (and any other co-located service) and assert `200`. `getfacl` proves the ACL is *set*; only a live request proves nothing downstream *broke*. Add to the containment runbook: "after locking a home/parent dir → curl prod FE → assert 200."
|
|
416
|
+
|
|
417
|
+
### Promote gate must verify the staging server's DEPLOYED COMMIT == branch HEAD
|
|
418
|
+
|
|
419
|
+
A staging-first promote gate that checks only "staging branch is ahead of main" + "staging health endpoint returns 200" + "version was bumped" is **structurally blind to the one thing staging-first exists to guarantee**: that the code being promoted actually *ran on staging*. Branch-ahead proves the commit was *pushed*; health-200 proves *some* build is up. Neither proves the staging **server** is running the commit being promoted — a push-to-branch without a redeploy leaves the server lagging the branch, and "ahead + 200" both still pass. Promote at that point and you ship commits to prod that **never executed on staging** — the exact failure staging-first is built to prevent (and the same shape as the "deployed but never reloaded" stale-build outage elsewhere in this doc).
|
|
420
|
+
|
|
421
|
+
**The gate (two required parts):**
|
|
422
|
+
|
|
423
|
+
1. **Expose the running build's commit on the health/status endpoint.** A health body of `{status, checks, responseMs}` with no commit/version field gives nothing machine-checkable to promote against. Add the deployed git SHA (or version) to the payload — this is the same build-fingerprint discipline as §Build Staleness Detection, applied to the promote decision:
|
|
424
|
+
```json
|
|
425
|
+
{ "status": "ok", "commit": "abc1234", "version": "v5.5.1", "checks": { ... } }
|
|
426
|
+
```
|
|
427
|
+
2. **`promote.sh` compares the staging server's reported commit against the branch HEAD being promoted, and BLOCKS on mismatch.** Health-200 + branch-ahead is necessary, not sufficient — the *server* must be running the code being promoted.
|
|
428
|
+
```bash
|
|
429
|
+
STAGING_COMMIT="$(curl -s "https://$STAGING_HOST/api/health" | jq -r '.commit')"
|
|
430
|
+
BRANCH_HEAD="$(git rev-parse --short "$PROMOTE_BRANCH")"
|
|
431
|
+
if [ "$STAGING_COMMIT" != "$BRANCH_HEAD" ]; then
|
|
432
|
+
echo "PROMOTE BLOCKED: staging server runs $STAGING_COMMIT but you are promoting $BRANCH_HEAD."
|
|
433
|
+
echo "Redeploy staging to $BRANCH_HEAD and re-run the health check before promoting."
|
|
434
|
+
exit 1
|
|
435
|
+
fi
|
|
436
|
+
```
|
|
437
|
+
|
|
438
|
+
A health-only promote gate promotes stale builds. The commit-equality check is the assertion that the thing you tested is the thing you ship. (Field report #364: a session pushed two missions to the staging *branch* without redeploying the staging *server*; "branch ahead + HTTP 200" both passed while the server lagged the branch by a full version — promoting would have shipped prod commits that never ran on staging, caught only by an operator's instinct to inspect server state.)
|
|
439
|
+
|
|
399
440
|
### Renaming a Linked Worktree Directory Breaks Git Silently
|
|
400
441
|
|
|
401
442
|
A linked git worktree (staging worktree, release worktree) keeps **two** pointer files that must agree on the directory's path. Renaming the worktree directory with a plain `mv` orphans both, and git gives you no warning (field report #343 F2):
|
|
@@ -445,6 +486,8 @@ Add project-specific exclusions for any directory that receives runtime-generate
|
|
|
445
486
|
|
|
446
487
|
**Live-fire verification per credential (field report #360).** After wiring ANY external credential — analytics, error tracking, ad platform, payment, LLM provider, anything with an API key/secret/token — exercise it against the provider's LIVE API and confirm acceptance before marking the integration done. Env-var-set is NOT done; a structurally-valid value (correct prefix/length) can still be dead. Send the smallest real authenticated request the provider supports (a no-op read, a token introspection, a `whoami`/`accounts:list`, a single test event) and assert a success status, not just a non-error transport. This single live call also surfaces latent integration bugs the stored value can't reveal: a hardcoded/sunsetting API version now returning 404 (pin a current version + add a health check), a missing required header (e.g. `login-customer-id` for a manager→client account), or wrong scopes. Evidence: a Google Ads credential that looked structurally valid was dead (`invalid_client`) and a v17 pin had been retired (404, current v21) — eyeballing would have shipped a silently-broken integration.
|
|
447
488
|
|
|
489
|
+
**Interactive OAuth consent on a headless server (field report #382).** Bootstrapping an OAuth credential that uses a loopback redirect (Google's installed-app flow, many provider CLIs) assumes a browser on the *same host* as the consent step — it opens `http://127.0.0.1:<port>/` and waits. On a headless box (deploy server, CI runner) there is no local browser, so the flow hangs. Two fallbacks — document both in any OAuth-bootstrap tooling guidance: (1) **SSH local port-forward** — `ssh -L <port>:127.0.0.1:<port> user@server`, run the consent in your *local* browser, and the loopback redirect resolves back through the tunnel to the server's listener; (2) **paste-the-code / out-of-band** — use the flow's manual-code variant (e.g. `--no-launch-browser`), open the printed auth URL on any machine, and paste the returned code into the headless prompt. Use port-forward when the tool only does loopback; paste-code when it offers an OOB option.
|
|
490
|
+
|
|
448
491
|
**Post-deploy OAuth sign-in failures: discriminate IdP-side from regression before rolling back.** When the first real sign-in after a deploy fails, do NOT reflexively roll back — first locate WHERE it failed. If the error page lives on the IdP's own domain (e.g. `accounts.google.com/info/unknownerror`, with a `rapt` re-auth token) and occurs BEFORE your `/callback` is hit, the failure is on the identity provider, not your migration — typically a stuck re-auth session, not a regression. Confirm your authorize request was well-formed (client_id, redirect_uri, scope, state) and then retry in a fresh/incognito session; an incognito success proves the deploy is fine and the IdP session was transient. Only an error AT your callback (state mismatch, token-exchange 4xx, cookie not set) implicates your code. A reflexive rollback on an IdP-side error falsely blames the migration and fixes nothing. (Field report #357 #3.)
|
|
449
492
|
|
|
450
493
|
**Post-deploy asset verification:** After deploying, verify specifically the files that *changed* in this deploy — not pre-existing assets. Check: (a) correct content-type header (text/html on a static asset means the file is missing from the deployment), (b) correct content-length (not the index.html fallback size), (c) deployment list shows the correct environment. Do NOT verify only pre-existing assets — they prove the host is up, not that the deploy succeeded. (Field report #114)
|
|
@@ -533,6 +576,29 @@ ReadWritePaths=/var/lib/myapp /var/log/myapp
|
|
|
533
576
|
```
|
|
534
577
|
Note: ahead-of-time-compiled binaries (Go, Rust, statically compiled C/C++) have no JIT and **can** keep `MemoryDenyWriteExecute=true` — the restriction is specific to JIT runtimes (Node/V8, the JVM, PyPy, .NET with JIT). When a unit template is shared across services, gate MDWE on the runtime, not on the unit boilerplate.
|
|
535
578
|
|
|
579
|
+
### Live contrastive smoke gate (systemd / shell / sudo / sandbox wiring)
|
|
580
|
+
|
|
581
|
+
A unit file that *parses* is not a unit that *runs*. systemd sandbox flags (`ProtectHome`, `ReadWritePaths`, `MemoryDenyWriteExecute`, `User=`, `NoNewPrivileges`), shell `set -o pipefail` interactions, `sudo`/`setuid` drops, and `exec`-replaced traps all pass every code-read and unit-lint yet fail (or silently no-op) the moment the service runs under the real hardened runtime. `systemd-analyze verify <unit>` checks syntax; it does NOT prove the service does useful work inside its own sandbox.
|
|
582
|
+
|
|
583
|
+
For any systemd/shell/sudo/sandbox wiring mission, the review gate must run a **live contrastive smoke** — prove the failure mode AND the fix *at runtime*, contrasting pass-vs-fail, not merely that the unit file is defined:
|
|
584
|
+
|
|
585
|
+
1. **Reproduce the failure live.** Run the operation under the OLD/un-fixed sandbox config and show it actually blocks — e.g. a `systemd-run` (or `systemd-run --user`) transient unit carrying the restrictive flags demonstrates the write is denied / the process SIGTRAPs / the cadence run no-ops.
|
|
586
|
+
2. **Prove the fix live.** Run the SAME operation under the NEW config and show it now succeeds.
|
|
587
|
+
3. **Assert the contrast.** The gate passes only when fail→old and pass→new are both demonstrated. A unit that merely *exists* with the right flags is not proof the service runs under them.
|
|
588
|
+
|
|
589
|
+
```bash
|
|
590
|
+
# Contrastive smoke: does the service actually run under its real hardened runtime?
|
|
591
|
+
# OLD config must BLOCK; NEW config must ALLOW. Run both; assert the contrast.
|
|
592
|
+
systemd-run --pty --property=ProtectHome=read-only \
|
|
593
|
+
--property=ReadWritePaths=/var/log/myapp \
|
|
594
|
+
/usr/bin/env sh -c 'echo probe > /home/svc/repo/run.log' # expect: FAILS (read-only)
|
|
595
|
+
systemd-run --pty --property=ProtectHome=tmpfs \
|
|
596
|
+
--property=ReadWritePaths=/var/log/myapp /home/svc/repo \
|
|
597
|
+
/usr/bin/env sh -c 'echo probe > /home/svc/repo/run.log' # expect: SUCCEEDS (path writable)
|
|
598
|
+
```
|
|
599
|
+
|
|
600
|
+
This catches the class of defect that unit-green code hides: a `ProtectHome=read-only` + enumerated `ReadWritePaths` that makes a repo-root log read-only and **silently no-ops every armed cadence run**; an `OnFailure=` alert unit that `203/EXEC`-fails because its script lacks the executable bit; a secret still readable in `/proc/<pid>/environ` after an in-process `unsetenv`. None are reachable by code-reading or unit tests — only by running the service under its real runtime and watching what it does. (Field report #365: an M13 systemd sandbox passed every unit test but would have silently no-op'd every armed cadence run; reachable only by live proof, not code-reading.)
|
|
601
|
+
|
|
536
602
|
## Config Foot-Guns (deploy/runtime)
|
|
537
603
|
|
|
538
604
|
Four recurring config traps that pass every syntax check yet break at runtime (field report #352 #5):
|
|
@@ -50,6 +50,7 @@ Transform session failures into structured, actionable field reports that improv
|
|
|
50
50
|
4. **Categorize root causes.** Every failure is one of: methodology gap, tooling limitation, communication failure, scope issue, framework-specific bug, or external dependency.
|
|
51
51
|
5. **Severity matters.** Distinguish between "this affects all users" (methodology flaw) and "this was specific to my project" (edge case).
|
|
52
52
|
6. **Be actionable.** Every finding should specify: which file should change, what should be added/modified, and which agent is responsible.
|
|
53
|
+
7. **Take the gate bypass before dispatching the DS9 crew (#366 F4).** Bashir's Ezri/O'Brien/Nog/Jake are a fixed, command-prescribed roster — not a cherry-picked review — so `/debrief` is NOT in the Silver Surfer gated-commands list. But the gate's `PreToolUse` hook blocks *every* non-Surfer Agent launch regardless of that list, so the crew would be blocked. BEFORE launching the first sub-agent, run `[ -x scripts/surfer-gate/bypass.sh ] && bash scripts/surfer-gate/bypass.sh --light || true`. The bypass is correct here, not a workaround: a fixed roster can't be cherry-picked, so the gate's anti-cherry-pick purpose doesn't apply. **Stale-pointer self-repair (#384 RC-3):** `bypass.sh` now detects a stale pointer itself — it reads the live session id from `CLAUDE_CODE_SESSION_ID` and repoints a pointer left by a `/clear`ed or crashed session to the live session, so the bypass lands in the right dir on the first try. On older Claude Code builds without that env var the legacy behavior applies: if the first sub-agent launch still blocks, re-run the same `bypass.sh --light` line once (the first blocked `check.sh` fire repoints to the live session, so the second write lands correctly).
|
|
53
54
|
|
|
54
55
|
## Root Cause Categories
|
|
55
56
|
|
|
@@ -90,10 +90,9 @@ Stored at `public/images/manifest.json`:
|
|
|
90
90
|
|
|
91
91
|
Default: OpenAI (gpt-image-1). Provider-abstracted for future extensibility.
|
|
92
92
|
|
|
93
|
-
| Provider | Model | Per Image
|
|
94
|
-
|
|
95
|
-
| OpenAI | gpt-image-1 | ~$0.04 | Default. Best quality/cost ratio. |
|
|
96
|
-
| OpenAI | DALL-E 3 HD | ~$0.08 | Higher detail, double cost. |
|
|
93
|
+
| Provider | Model | Per Image | Notes |
|
|
94
|
+
|----------|-------|-----------|-------|
|
|
95
|
+
| OpenAI | gpt-image-1 | ~$0.04 | Default. Best quality/cost ratio. For higher detail, raise `quality` to `high`. |
|
|
97
96
|
|
|
98
97
|
## Deliverables
|
|
99
98
|
|
|
@@ -112,6 +112,8 @@ Why default-to-refuted: across instrumented Gauntlets, **~38% of first-pass Crit
|
|
|
112
112
|
|
|
113
113
|
> **Cross-system checkpoint is non-optional (field report #350 #4):** in a multi-mission Gauntlet, the cross-system checkpoint caught a **fix-induced Critical that a per-mission review's own fix had created** — the per-mission review verified its fix in isolation and passed it; only the whole-system pass saw the new failure mode the fix introduced. This is direct evidence that verifying a fix against the single mission that motivated it is insufficient. The Gauntlet-level refute-the-fix checkpoint stays in the protocol regardless of how green the per-mission reviews were.
|
|
114
114
|
|
|
115
|
+
**A fix-batch verdict must PROVE its premise — live runtime assertion for runtime-dependent fixes (field report #377 #2, #4):** Source-read + unit-test-green is NECESSARY but NOT SUFFICIENT to accept a fix whose correctness depends on **runtime state**. For any fix touching file permissions, environment variables, process lifecycle (`exec`/`trap`/signal handling), systemd sandbox (`ProtectHome`, `ReadWritePaths`, `OnFailure=`), privilege drop, or any kernel-observable state, the fix-batch gate requires a **live runtime assertion** that the fix actually takes effect at runtime — not merely that the source looks correct. The acceptance check is empirical: run the real unit/script/process and assert the post-state. Examples that "looked right" in source and passed unit tests yet failed at runtime: an `OnFailure=` alert unit that `203/EXEC`-failed because its script lacked the executable bit (assert: `systemd-run` the unit and confirm it starts); a secret still present in `/proc/<pid>/environ` after an in-process `unsetenv`/pop because the kernel exposes the exec-time env block (assert: read `/proc/<pid>/environ` of the live process); a cleanup step skipped because `exec` replaced the shell before the `trap` could fire (assert: run the script and confirm the cleanup artifact). **A verdict must prove its premise before the fix is accepted.** A severity rating or an accept/reject decision that rests on an unverified factual premise — "the secret is reachable," "the bit is set," "the trap fires" — is INADMISSIBLE until the agent ships the command that proves the premise and the command is run (e.g., prove reachability by running the env-builder before rating a secret CRITICAL). When the runtime cannot be exercised, the fix is accepted only as a code-read CONFIRM carrying the unreproduced-finding severity discount — never present a source-read as proof the runtime behavior holds. (Field report #377: three adversarial verification rounds each caught a fix that passed source-read + unit tests but did not take effect at runtime, plus a CRITICAL rated on a premise — secret reachability — that was never checked and proved false.)
|
|
116
|
+
|
|
115
117
|
**Round 5 — The Council (convergence):**
|
|
116
118
|
- Spock (Star Trek) — code quality after fixes
|
|
117
119
|
- Ahsoka (Star Wars) — access control integrity
|
|
@@ -125,6 +127,10 @@ Troi also performs a **Marketing Copy Drift Check**: compare marketing page clai
|
|
|
125
127
|
|
|
126
128
|
**Composition/wiring lens (Victory / multi-mission Gauntlet) (field report #358 #1):** Per-mission reviews are structurally blind to cross-mission composition — they only see one mission's changeset. A defect that is a property of the *assembled entry paths across all missions* (which code path is actually invoked at each armed/public entry point, what each entry *passes* vs. what the library *accepts*, and whether a security-critical default — `run_as`, eval tier, isolation flag — is set on the entry path or only deep in a module) is invisible to every per-mission review yet ships. Therefore the final holistic Gauntlet MUST dedicate at least one agent to a wiring/composition pass that: (1) enumerates every entry point (CLI, daemon, public route, scheduled job) that invokes the assembled system; (2) for each, traces what arguments/config it actually passes and reconciles them against what the library/eval gate accepts and what the safe default requires; (3) flags any entry path that omits a containment boundary the library threads internally but the entry never sets, or that injects a weaker gate (T1-only) than the full regression+isolation gate the system defines. This pass is non-negotiable and is not satisfied by green per-mission reviews. Field report #358: 12 passing reviews (10 per-mission + 2 every-4 checkpoints) all missed (a) every armed run executing as the privileged `run_as` user because it was set on the eval module but never on any entry path, and (b) both armed entry points injecting a T1-only eval that bypassed the T2/T3 gate — both caught only by the final Victory Gauntlet.
|
|
127
129
|
|
|
130
|
+
**Declared-vs-implemented reconciliation lens (Victory / multi-mission Gauntlet) — MANDATORY (field report #365 #4):** Phantom coverage — a flow/entry-point/capability **declared** in one mission, **implemented** (or not) in a second, and **counted as covered** by a measurement tool in a third — is structurally invisible to every per-mission review because the declaration, the implementation, and the measurement each live in a different mission's changeset. The final holistic Gauntlet MUST dedicate at least one agent to a reconciliation pass that, for the whole assembled system: (1) **enumerates everything DECLARED** — every flow registered in a registry/manifest/config, every route declared, every capability advertised in a coverage/health/status surface; (2) **enumerates everything actually IMPLEMENTED AND WIRED** — for each declared item, confirm a real handler/runner exists AND is reachable from a live entry point (not a stub, not an unimported function, not a `pass`); (3) **reconciles the two counts and flags every gap** — any declared item with no live implementation is phantom coverage; any coverage/health number that counts declared-but-unimplemented items is a false-confidence metric and is itself a finding. The reconciliation is a count, not a vibe: "registry declares N flows; runner implements M; coverage tool reports K covered — reconcile N vs M vs K and name every item in the difference." This pass is non-negotiable for any multi-mission build and is not satisfied by green per-mission reviews. (Field report #365: a flow registry (mission A) declared 14 "scripted" flows; the runner (mission B) implemented only one; the coverage tool (mission C) counted all 14 as covered — the gap was invisible to every per-mission review and caught only by the whole-system Victory Gauntlet.)
|
|
131
|
+
|
|
132
|
+
**Dark-flag activation is gated on REVIEW, not deploy (field report #373 #1):** For a feature built behind a dark flag (deployed disabled, then enabled via a flag flip + any paired contract/data migration), the comprehensive adversarial review of the dark code is the gate on the **flag flip** — NOT on the deploy of the dark code. Reviewing AFTER activation ships review-findable bugs to live users. The required ordering is **build dark → Gauntlet the dark code → flip the flag only once the review's Critical/High are closed** — never flip-then-review. The Gauntlet (or `/assemble` for a follow-on increment) that clears a dark feature for activation MUST exercise the **partial/edge interaction states the feature introduces** (partial confirm, reject-all, edit-then-act, account-switch mid-action), not only the end-to-end happy path — a green happy-path smoke is necessary, not sufficient, because the new bugs live in the partial/edge interactions. (Field report #373: an ADR was built dark, the flag flipped + a contract migration applied, and ONLY THEN the 17-agent Gauntlet ran — it found 6 confirmed blockers, including a premature batch-completion that stranded un-confirmed siblings, already LIVE; a follow-on `/assemble` found 2 more HIGH, also already live. The dark→activate→review ordering shipped 8 review-findable bugs to prod.)
|
|
133
|
+
|
|
128
134
|
**Conditional verdict — ship-vs-enable separation (field report #358 #4):** When the Council's verdict is conditional — "safe to ship in state X but not state Y" (most commonly: safe to ship the feature GATED OFF, but NOT safe to arm/enable it) — the Council MUST NOT sign off on a bare "ship." It requires, before sign-off: (1) an **ADR that explicitly separates the two states** — what is true in the shipped-but-gated state, what must additionally hold before the enabled/armed state is safe (the open P0/P1 prerequisites), and which Gauntlet findings gate the transition; and (2) a **prerequisites runbook** enumerating the concrete, verifiable steps to move from shipped to enabled (containment boundary set on every entry path, full eval gate wired, credentials provisioned, etc.). Without this artifact, "shipped" silently reads as "fully enabled" to the next operator and a latent privileged-execution or gate-bypass gap goes live. The shipped state is only signed off once the ADR + runbook exist; the enabled state is signed off only once the runbook's prerequisites are independently verified. (Union Station's campaign wrote ADR-222 to capture exactly this separation.)
|
|
129
135
|
|
|
130
136
|
**Pattern auth completeness check (Kenobi, during Rounds 2-3):** When a pattern file defines an authentication flow, verify the auth checks perform actual value verification (compare against expected, call verify functions) — not just presence checks (`!!header`, `Boolean()`). Flag `!!` or truthiness checks on auth-related headers as suspicious. (Field report #109: daemon socket auth used `!!vaultHeader` which passed for any non-empty string.)
|
|
@@ -39,6 +39,8 @@ For each of the 9 universes, evaluate which agents have relevant expertise for *
|
|
|
39
39
|
3. Does this agent bring a unique perspective no other included agent covers? → Include
|
|
40
40
|
4. Would this agent's findings be a subset of another agent's? → Exclude (dedup)
|
|
41
41
|
|
|
42
|
+
**The orchestrator owns this dedup and the dispatch decision.** Whether the roll comes from the Muster evaluation above or from the Silver Surfer's pre-scan, the candidate list is *advice* — the orchestrator collapses same-domain agents auditing the same artifact into one agent per distinct lens before launching, and decides which survive. A roster that returns ~5 data agents and ~6 security agents re-reading one artifact is bloat, not coverage; it wastes tokens on launch and again on re-deduping near-identical findings. The Herald advises; it never commands "launch all / do not analyze yourself." See SUB_AGENTS.md "The Orchestrator Owns Roster Dedup + Dispatch" for the full rule. (Field report #378 RC-3.)
|
|
43
|
+
|
|
42
44
|
**Universe leads (always evaluated, included if relevant):**
|
|
43
45
|
|
|
44
46
|
| Universe | Lead | Domain | Include When |
|
|
@@ -92,6 +92,8 @@ Trace the primary user flow step by step. This is a narrative walkthrough, not a
|
|
|
92
92
|
2. **MANDATORY: Screenshot every page.** Save screenshots to temp directory. The agent MUST read each screenshot via the Read tool and visually analyze it for: layout integrity, content completeness, visual hierarchy, spacing consistency, state correctness. This is how Galadriel "sees" the product — without screenshots, the review is code-reading, not visual review. Take at desktop viewport (1440x900) for primary analysis.
|
|
93
93
|
|
|
94
94
|
**Atomic-visual carve-out:** For an atomic visual change — a single component, one icon, a loader, one state — a component-level **render-harness** screenshot (the component mounted in isolation, captured, and Read) satisfies the "verify visually" rule. It is a faster, equally-valid proof than standing up the full authed app, and avoids the auth + DB + server setup the full-page pass requires. Use it only for genuinely isolated visual artifacts; anything touching layout, navigation, or cross-component flow still gets the full-page screenshot pass. (Field report #362.)
|
|
95
|
+
|
|
96
|
+
**Render-gate regression coverage:** A green build and a green unit suite do NOT catch render-gate regressions — a removed or renamed prop can silently kill a feature (a component still gating its render on a prop that is now always `null`) while every automated gate stays green. So when the change under review touched a prop or a shared contract, the walkthrough must cover **EVERY surface that consumes the changed prop/contract — not a sampled page** — and must explicitly **re-check the render *gates* that key off the changed prop** (the panel that gated on it: does it still render?). Verify each changed component in BOTH signed-in and signed-out states. A "screenshot every page" pass satisfied by an e2e that exercises a *different* surface than the one that changed is not coverage — it is a miss waiting to ship a dead feature. (Field report #375.)
|
|
95
97
|
3. **Behavioral verification:** Click every button, link, tab on primary routes. After each click, verify something visible changed (DOM mutation, navigation, modal). Flag non-responsive interactive elements.
|
|
96
98
|
4. **Form interaction:** Fill every form. Verify: focus rings visible on Tab, validation triggers on blur/submit, error messages appear next to correct fields, success state shows after valid submission.
|
|
97
99
|
5. **Keyboard walkthrough:** Tab through each page. Verify: focus order matches visual order, no focus traps except intentional modals, Escape closes overlays.
|
|
@@ -217,6 +219,22 @@ Screen all copy and visuals against the tells that mark generated work as genera
|
|
|
217
219
|
|
|
218
220
|
A surface that trips three or more of these tells is presumed AI-slop and goes back for de-AI revision, anchored against the Step 1.8 reference dossier.
|
|
219
221
|
|
|
222
|
+
### The Originality Gate — justify-or-reject the homogenized defaults
|
|
223
|
+
|
|
224
|
+
(Field reports #376, #1.)
|
|
225
|
+
|
|
226
|
+
The de-AI checklist above flags tells *after* a surface exists. The Originality Gate runs *before* any visual direction is emitted and is stricter: it names the specific homogenized defaults the model reaches for by reflex, and forces an explicit verdict on each. For EACH item below, record one of two verdicts — **REJECTED** (not used in this direction) or **JUSTIFIED** (deliberately kept, with the reason anchored to a concrete, named artifact in the Step 1.8 reference dossier). The bar is asymmetric on purpose: rejection is free, justification must cite the dossier. "It looked fine," "it's a clean default," or "it's what the framework ships" are not justifications — only a named dossier reference is.
|
|
227
|
+
|
|
228
|
+
The named defaults to adjudicate:
|
|
229
|
+
|
|
230
|
+
- **blue-600 hero** (or the framework's default-primary accent) — the reflexive Tailwind/SaaS blue.
|
|
231
|
+
- **purple→cyan / violet→teal gradient headings** — the `bg-clip-text` rainbow headline.
|
|
232
|
+
- **the shadcn default hero** — centered headline + sub + two buttons + faint grid/radial, untouched.
|
|
233
|
+
- **floating orbs / particles / aurora blobs** — decorative background motion that carries no meaning.
|
|
234
|
+
- **the default Inter / Playfair pairing** — the reflexive "modern sans + elegant serif" combo.
|
|
235
|
+
|
|
236
|
+
The direction passes the gate only when every item is explicitly REJECTED or JUSTIFIED against the dossier. The default posture is **distinctive and ownable, not "current SaaS standard."** If three or more items land on JUSTIFIED rather than REJECTED, treat that as evidence the direction has converged on the statistical mean and send it back to Step 1.8 reference grounding before it goes any further. Originality is a gate the work must pass, not a hope — the "everything on the internet looks AI-generated now" failure mode is produced precisely by methodologies that *default* to these picks and never force the verdict.
|
|
237
|
+
|
|
220
238
|
## Step 2 — UX/UI Attack Plan
|
|
221
239
|
|
|
222
240
|
**Elrond:** IA, navigation, task flows, friction.
|
|
@@ -158,7 +158,7 @@ When a system has dynamic optimization (auto-tuning, parameter sweeps, adaptive
|
|
|
158
158
|
|
|
159
159
|
**Copy Accuracy Pass:** Grep for numeric claims in rendered content (e.g., "10 lead agents", "12 commands", "53 pages"). Cross-reference against actual data counts. Any mismatch is a bug — inaccurate numbers undermine credibility. This is automatable and should run on every QA pass.
|
|
160
160
|
|
|
161
|
-
**Image Size Audit:** For projects with static images (especially `/imagine` output), check every image in `public/` or `static/`: flag any image > 200KB, flag any image >4x its display dimensions (a 1024px source rendered at 40px is a 97% bandwidth waste). Total asset directory should be < 10MB for marketing sites, < 50MB for apps. If `/imagine` was used, verify Gimli's optimization step (Step 5.5) produced WebP files at 2x display dimensions, not raw 1024px
|
|
161
|
+
**Image Size Audit:** For projects with static images (especially `/imagine` output), check every image in `public/` or `static/`: flag any image > 200KB, flag any image >4x its display dimensions (a 1024px source rendered at 40px is a 97% bandwidth waste). Total asset directory should be < 10MB for marketing sites, < 50MB for apps. If `/imagine` was used, verify Gimli's optimization step (Step 5.5) produced WebP files at 2x display dimensions, not raw 1024px gpt-image-1 PNGs.
|
|
162
162
|
|
|
163
163
|
### Install/CTA Command Verification
|
|
164
164
|
Verify all install/CTA terminal commands shown on the site actually work in a clean environment. Copy each command from the rendered page, run it in a fresh shell (no project-specific PATH, no aliases), and verify the expected outcome. Marketing pages with broken install commands are worse than no install commands. (Triage fix from field report batch #149-#153.)
|
|
@@ -265,6 +265,12 @@ Flag as **High severity**. In financial systems (trading, payments, billing), fl
|
|
|
265
265
|
|
|
266
266
|
For any feature where the system consumes the output of an LLM or an external tool and then ACTS on it (applies an LLM-generated diff/edit, parses a model-authored JSON plan, executes a tool-returned command, validates a third-party payload), hand-authored fixtures are insufficient — they exercise only the shapes you imagined, which are exactly the shapes that already work. Mandate a **real-output self-test on seeded mutants**: seed a known defect (a real mutant), run the system end-to-end against the REAL external output (real LLM call, real tool response), and assert two properties — **does-it-fix** (the system resolves the seeded mutant) and **does-no-harm** (it does not corrupt unrelated state or pass when it should fail). **Heuristic: if every test of an integration boundary uses a fixture you authored, you have not tested the boundary — you have tested your own imagination of it.** Field report #358: M5–M9 unit tests fed the apply path hand-authored unified diffs that always `git apply`-ed cleanly; the first real-LLM self-test immediately surfaced that real Sonnet diffs do NOT apply (miscounted `@@` hunk headers, missing trailing newline → 'corrupt patch'). The fix was architectural (return exact `{old,new}` edits, generate the diff with `difflib`). Without a real-output self-test, this ships broken. Budget for flakiness: real-LLM tests hit rate limits — wrap each call in a bounded retry loop.
|
|
267
267
|
|
|
268
|
+
### Throughput / Scale Gate — Per-Row Network Stages (field report #378)
|
|
269
|
+
|
|
270
|
+
For any batch or pipeline project, every stage that makes a per-row network call (LLM classify, enrichment API, per-record HTTP/DB round-trip) over a large input set is a **scale-gated stage**, and the QA pass must include a throughput test, not just a correctness test. A green suite on a 5-row fixture certifies nothing about an `O(rows × latency)` sequential loop — small fixtures pass instantly whether the stage is concurrent or serial, so they structurally cannot expose a sequential implementation where an ADR specified concurrency (worker pool / bounded fan-out).
|
|
271
|
+
|
|
272
|
+
**Required check:** Run the stage against N well above ~500 rows and assert that concurrency is actually *wired*, not merely specified — e.g. measure wall-clock against the per-call latency budget (a serial loop's runtime ≈ `rows × latency` and blows the budget), or assert in-flight call count reaches the configured pool bound. When an ADR claims a "bounded worker pool" or "parallel fan-out" for a stage, a sequential `for`-loop that satisfies it is a **silent regression** — flag as **High** (Critical for cost/SLA-bound stages). Trace the ADR's concurrency claim to the implementation and prove the pool exists; do not take the ADR's word for it. (Field report #378: an ADR specified a bounded worker pool for I/O-bound stages, but both the LLM-classify and Hunter-enrich stages shipped as sequential loops — tests green on small fixtures. At production scale, ~4k rows ran ~4 hours and a ~10k enrich stalled the run. Two separate discoveries, both invisible to the correctness gate.)
|
|
273
|
+
|
|
268
274
|
### Failure Attribution (multi-file test runs)
|
|
269
275
|
|
|
270
276
|
A test failure observed during a multi-file suite run is **NOT attributed to your change** until BOTH of these hold:
|
|
@@ -286,6 +292,19 @@ For every gate, threshold, or invariant a mission introduces (auth allowlist, ev
|
|
|
286
292
|
|
|
287
293
|
A gate with no test that fails on its inversion is a **vacuous invariant**: it looks like protection but enforces nothing, because nothing observes whether it holds. Recurring vacuous-invariant anti-patterns (these surfaced **4x in a single session**): an eval scorer that always passes regardless of output; an auth allowlist with an inverted `!`-check that admits everyone; an off-by-one cap boundary that never actually caps; a truthy boot-guard that is always truthy and so never guards. Treat any newly-introduced gate as guilty until a failing-on-inversion test proves it innocent. (Field report #352 #1)
|
|
288
294
|
|
|
295
|
+
### Drift-Guard Discipline — Shared Check + Proven CI Wiring (field report #365)
|
|
296
|
+
|
|
297
|
+
A shipped drift-guard (coverage gate, schema-parity `--check` CLI, lint sentinel, any "this can't regress" enforcer) is only real if two things hold, and the review MUST confirm BOTH:
|
|
298
|
+
|
|
299
|
+
1. **One check function, shared between the CLI and the tests.** The guard's enforcement logic and its test suite must call the *same* function — one source of truth. When the `--check` CLI and the pytest/vitest suite each re-implement the invariant, they drift: the CLI silently enforces *weaker* invariants than the tests assert, and the guard passes while guarding nothing. Verify the CLI and the tests import the same predicate, not two copies that agree today.
|
|
300
|
+
2. **Proven wired into CI — not merely defined.** A guard that runs nowhere is decorative. The review must locate the actual CI job that invokes the guard (grep the workflow YAML / CI config for the guard's command or test file) and confirm it runs on the gating event (PR / pre-merge), not just that the test file exists on disk. "The test is written" is necessary-not-sufficient; "the test runs in CI on every change" is the bar.
|
|
301
|
+
|
|
302
|
+
A guard failing either condition is **High** — it manufactures false confidence in exactly the regressions it claims to prevent. (Field report #365: a coverage drift-guard shipped a `--check` CLI that enforced weaker invariants than its own pytest suite — silently passing the three likeliest regressions — AND the tests were never wired into CI. The guard looked green while guarding nothing.)
|
|
303
|
+
|
|
304
|
+
### Coverage Honesty — Count at the Fidelity Actually Exercised (field report #382)
|
|
305
|
+
|
|
306
|
+
A coverage SSOT (the canonical "what's tested" ledger) must record each case at the fidelity it was *actually* exercised — never reclassify it covered on partial evidence. Real incident: a new real-account smoke lane proved the *backend* integration (token → authenticated read) but not the in-browser OAuth round-trip; the temptation was to mark the front-end `external_cases` "covered." It wasn't — only the backend path was. **Rules:** (1) a case is covered only at the layer a test actually drives — a backend token test does not cover the browser consent flow; an API test does not cover the UI that calls it; (2) record proof **per lane** in a ledger (which lane exercised the case, at what fidelity, with what evidence — a log line, a screenshot, a status assertion) so a coverage claim is auditable, not asserted; (3) never promote a case to "covered" in the SSOT on a different-fidelity proxy. Counting partial evidence as full coverage manufactures the same false confidence as a vacuous gate — it just hides in the ledger instead of the code. (Reinforces the verification-discipline theme of #377.)
|
|
307
|
+
|
|
289
308
|
### Safety-Critical Return Value Verification
|
|
290
309
|
|
|
291
310
|
For systems with safety-critical operations (stop-loss placement, circuit breakers, rollback triggers, payment captures, credential revocations): verify the return value of the safety operation BEFORE transitioning state. The pattern: `call safety operation → check return → only then transition`.
|
|
@@ -318,6 +337,7 @@ This is a HARD GATE, not a suggestion. Actually execute runtime tests:
|
|
|
318
337
|
- If yes → infinite render loop. Must fix before proceeding.
|
|
319
338
|
- Check for `.focus()` calls in effects — do they need ref guards?
|
|
320
339
|
5. **Verify primary user flow** — trace from user action → handler → store → render → what the user sees
|
|
340
|
+
5a. **Verify partial and edge states, not just the happy path (field report #373)** — for any new multi-step interaction (multi-confirm list, batch action, wizard, inline-edit-then-act), a green happy-path smoke is necessary-not-sufficient. The bugs live in the states the feature *introduces*: partial confirm (confirm some, leave siblings un-confirmed), reject-all, edit-then-add, edit-vs-confirm race. Explicitly exercise each partial/edge transition and assert state stays consistent. This applies with special force to **dark-flag features**: the comprehensive adversarial review must gate the **activation (flag flip)**, not the deploy — reviewing dark code only *after* it is activated ships review-findable bugs to prod. (Field report #373: a dark→activate→review ordering plus a happy-path-only smoke shipped 6+ confirmed blockers live — a premature batch-completion stranded un-confirmed siblings on a partial confirm; a later inline-edit follow-up shipped an edit-vs-confirm race, also already live. Both were exactly the partial/edge states the happy-path smoke never touched.)
|
|
321
341
|
6. **Data-UI enum consistency** — for every UI filter, dropdown, category selector, or status badge: extract the set of values used in the UI and compare against the canonical source (Prisma enum, DB CHECK constraint, TypeScript union, Python Enum). Flag mismatches. A single-character difference (e.g., `SHOPPING` in UI vs `SHOP` in enum) causes silent total failure — zero results, zero errors, zero log entries. This check must compare string values, not just count them. Also verify that new enum values added to the schema have corresponding UI representations. (Field report #263: category filter used `SHOPPING` but Prisma enum was `SHOP` — filter showed zero results for ~5 days with no errors.)
|
|
322
342
|
|
|
323
343
|
If the server cannot be started (methodology-only project, missing dependencies), document why and skip with a note.
|
|
@@ -43,6 +43,17 @@ Clean, consistent, well-documented releases. Every version bump tells a story. E
|
|
|
43
43
|
4. **Never auto-push.** Push only when the user explicitly requests it.
|
|
44
44
|
5. **Present before executing.** Show the changelog entry, version bump, and commit message for user approval before committing.
|
|
45
45
|
6. **Breaking changes get called out.** If MAJOR, explain what breaks and why.
|
|
46
|
+
7. **Never `git add -A` a release.** Run the Step-0 unrelated-change split first (see "Unrelated-Change Detection" below) and stage by path.
|
|
47
|
+
|
|
48
|
+
## Unrelated-Change Detection — Step 0 (field report #384 RC-1)
|
|
49
|
+
|
|
50
|
+
A release must ship only what the session authored. The working tree can also carry **pre-existing or out-of-scope changes** — a stray dependency, a leftover edit, an untracked scratch/probe file — and a naive `git add -A` bundles them into the release. The v23.20.0 near-miss: a `vercel` dependency (added to root `dependencies` by a stray `npm install`, +~5,900 `package-lock.json` lines) sat in the tree unrelated to the release; it was caught only because the lead manually diffed `package.json`. Encode that vigilance instead of relying on it.
|
|
51
|
+
|
|
52
|
+
Before staging (in `/git` Step 0 and `/seal` Step 0):
|
|
53
|
+
|
|
54
|
+
1. **Dependency manifests get special scrutiny.** If any manifest/lockfile is in the diff — `package.json`, `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `requirements.txt`, `pyproject.toml`, `Cargo.toml`/`Cargo.lock`, `go.mod`/`go.sum`, `Gemfile`/`Gemfile.lock` — read the **dependency-level** diff (`git diff -- <manifest>`), not just the filename. Any dependency added/changed/removed that the session did not deliberately introduce is flagged for an explicit include/exclude decision. This enforces "no new dependencies without justification" (CLAUDE.md Coding Standards) at release time.
|
|
55
|
+
2. **Scope diff.** Compare the full changed-file list against what the session actually touched. Surface anything you did not author this session for an explicit keep/drop decision.
|
|
56
|
+
3. **Present the split** — *session-authored (stage these)* vs *pre-existing or out-of-scope (decide)* — and stage by path from the session-authored side only. `git add -A` / `git add .` re-admits exactly the changes this split exists to exclude.
|
|
46
57
|
|
|
47
58
|
## Semver Rules
|
|
48
59
|
|
|
@@ -216,6 +227,33 @@ For each script discovered, document its purpose + waiver convention in the proj
|
|
|
216
227
|
|
|
217
228
|
**Methodology vs project tooling:** the SCRIPTS are project-specific; the DISCIPLINE (run all gates before push) is methodology. The orchestrator does not need to know what each script does — only that it exists and must pass.
|
|
218
229
|
|
|
230
|
+
## Removal Sweep
|
|
231
|
+
|
|
232
|
+
When a release deletes a symbol, export, prop, env var, command, or any other named artifact, the deletion is only half done until its *name* is gone everywhere too. A green build and a green test suite confirm the **code** compiles without it — they say nothing about the comments that still describe it, the README sentence that still tells users to set it, or the doc that still links to it. That prose drift survives every automated gate and ships as a silent lie.
|
|
233
|
+
|
|
234
|
+
**Rule:** Before commit, for every symbol/export/prop/env-var/command removed in this release, Coulson greps for its name across the **whole tree** — code AND comments AND user-facing copy (READMEs, docs, CLAUDE.md, command files, UI strings, help text) — not just source. Any surviving reference is either updated to match the new reality or itself removed, before the commit lands.
|
|
235
|
+
|
|
236
|
+
**Sweep shape (run once per removed name):**
|
|
237
|
+
|
|
238
|
+
```bash
|
|
239
|
+
# NAME = the deleted symbol/export/prop/env-var/command
|
|
240
|
+
git grep -nI -- "$NAME" -- ':!CHANGELOG.md' ':!PROJECT_VERSION.md' ':!VERSION.md'
|
|
241
|
+
# Every hit that is not the intentional "Removed" changelog line must be resolved.
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
**Why both, not just code.** Field report #375 (PerpWatch): retiring the shared `MONITOR_TOKEN` auth path left stale `MONITOR_TOKEN` references across ~8 comment sites **plus** user-facing copy ("set a monitor token") after the symbol was deleted — the build and 97 unit tests were green throughout, because none of the stale references were *code*. Root cause: no sweep step pairing a symbol removal with its prose. Pair the deletion with the grep, every time.
|
|
245
|
+
|
|
246
|
+
## Ship-and-Validate: New Artifact Type Needs a Validator the Same Release
|
|
247
|
+
|
|
248
|
+
A release that introduces a **new shipped artifact type** (a new file category copied into the distributed packages — e.g. `.claude/workflows/*.workflow.js`, a new agent format, a new pattern extension) MUST ship a matching pretest/CI validator **in the same release**. The validator runs in `pretest`/CI so the new category is checked on every build, not just by hand once. A release MUST NOT claim a validation it does not actually run — a CHANGELOG line, VERSION.md note, or release summary asserting "validated" / "passes `node --check`" / "schema-checked" is a hard defect unless a wired-in check actually produced that result this release.
|
|
249
|
+
|
|
250
|
+
**Coulson rejects a release when:**
|
|
251
|
+
|
|
252
|
+
1. The diff adds a new shipped file category but adds **no** validator that exercises it (no `scripts/validate-*.sh`, no CI step, nothing in `pretest`).
|
|
253
|
+
2. The CHANGELOG / VERSION.md / release notes assert a validation that no command in the release actually ran — verify the claim by running the asserted check before accepting the wording. An unrun claim gets the wording struck or the check wired in; never shipped as-is.
|
|
254
|
+
|
|
255
|
+
**Why.** Field report #366 (v23.18.0): the release added `.claude/workflows/*.workflow.js`, claimed "both scripts pass `node --check`" (FALSE — their top-level `return`/`await` make a bare `node --check` fail), and added **no** pretest validator. Three of the next release's fourteen bugs traced to that one omission. This is the recurring "referenced-but-doesn't-ship" / "gate that doesn't gate" class (#297, #352): the fix that closes it is a real validator wired into `pretest` plus an honest claim. (The companion distribution-paths checklist — wiring a new category into ALL of `prepack.sh`, `copy-assets.sh`, `project-init.ts`, and `updater.ts` — lives in BUILD_PROTOCOL.md Phase 12.75.)
|
|
256
|
+
|
|
219
257
|
## Post-Amend SHA Pin
|
|
220
258
|
|
|
221
259
|
`git commit --amend` rewrites the SHA but `logs/campaign-state.md` rows still reference the pre-amend SHA. Across a long campaign, these dangling references accumulate and break post-hoc audits (`git log --grep` against the recorded SHA returns nothing).
|