voidforge-build 23.10.0 → 23.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (65) hide show
  1. package/dist/.claude/agents/bashir-field-medic.md +1 -0
  2. package/dist/.claude/agents/coulson-release.md +3 -0
  3. package/dist/.claude/agents/irulan-historian.md +3 -0
  4. package/dist/.claude/agents/loki-chaos.md +1 -0
  5. package/dist/.claude/agents/picard-architecture.md +3 -0
  6. package/dist/.claude/agents/silver-surfer-herald.md +3 -0
  7. package/dist/.claude/agents/sisko-campaign.md +3 -0
  8. package/dist/.claude/commands/architect.md +38 -0
  9. package/dist/.claude/commands/campaign.md +2 -0
  10. package/dist/.claude/commands/gauntlet.md +11 -0
  11. package/dist/.claude/commands/git.md +13 -3
  12. package/dist/CHANGELOG.md +63 -0
  13. package/dist/CLAUDE.md +13 -4
  14. package/dist/VERSION.md +2 -1
  15. package/dist/docs/methods/AI_INTELLIGENCE.md +15 -0
  16. package/dist/docs/methods/BACKEND_ENGINEER.md +48 -0
  17. package/dist/docs/methods/CAMPAIGN.md +196 -1
  18. package/dist/docs/methods/DEVOPS_ENGINEER.md +16 -0
  19. package/dist/docs/methods/FORGE_KEEPER.md +18 -0
  20. package/dist/docs/methods/GAUNTLET.md +2 -0
  21. package/dist/docs/methods/QA_ENGINEER.md +46 -0
  22. package/dist/docs/methods/RELEASE_MANAGER.md +59 -0
  23. package/dist/docs/methods/SECURITY_AUDITOR.md +53 -0
  24. package/dist/docs/methods/SUB_AGENTS.md +90 -0
  25. package/dist/docs/methods/SYSTEMS_ARCHITECT.md +42 -2
  26. package/dist/docs/methods/TESTING.md +17 -0
  27. package/dist/docs/methods/TIME_VAULT.md +17 -0
  28. package/dist/docs/patterns/adr-verification-gate.md +80 -0
  29. package/dist/docs/patterns/ai-eval.ts +87 -0
  30. package/dist/docs/patterns/ai-prompt-safety.ts +242 -0
  31. package/dist/docs/patterns/audit-log.ts +132 -0
  32. package/dist/docs/patterns/llm-state-dedup.ts +246 -0
  33. package/dist/docs/patterns/middleware.ts +83 -0
  34. package/dist/docs/patterns/multi-tenant-pool-bypass.ts +134 -0
  35. package/dist/docs/patterns/multi-tenant-property-test.ts +127 -0
  36. package/dist/docs/patterns/refactor-extraction.md +96 -0
  37. package/dist/scripts/voidforge.js +0 -0
  38. package/dist/wizard/lib/anomaly-detection.d.ts +59 -0
  39. package/dist/wizard/lib/anomaly-detection.js +122 -0
  40. package/dist/wizard/lib/asset-scanner.d.ts +23 -0
  41. package/dist/wizard/lib/asset-scanner.js +107 -0
  42. package/dist/wizard/lib/build-analytics.d.ts +39 -0
  43. package/dist/wizard/lib/build-analytics.js +91 -0
  44. package/dist/wizard/lib/codegen/erd-gen.d.ts +16 -0
  45. package/dist/wizard/lib/codegen/erd-gen.js +98 -0
  46. package/dist/wizard/lib/codegen/openapi-gen.d.ts +15 -0
  47. package/dist/wizard/lib/codegen/openapi-gen.js +79 -0
  48. package/dist/wizard/lib/codegen/prisma-types.d.ts +15 -0
  49. package/dist/wizard/lib/codegen/prisma-types.js +44 -0
  50. package/dist/wizard/lib/codegen/seed-gen.d.ts +16 -0
  51. package/dist/wizard/lib/codegen/seed-gen.js +128 -0
  52. package/dist/wizard/lib/correlation-engine.d.ts +59 -0
  53. package/dist/wizard/lib/correlation-engine.js +152 -0
  54. package/dist/wizard/lib/desktop-notify.d.ts +27 -0
  55. package/dist/wizard/lib/desktop-notify.js +98 -0
  56. package/dist/wizard/lib/image-gen.d.ts +56 -0
  57. package/dist/wizard/lib/image-gen.js +159 -0
  58. package/dist/wizard/lib/natural-language-deploy.d.ts +30 -0
  59. package/dist/wizard/lib/natural-language-deploy.js +186 -0
  60. package/dist/wizard/lib/project-init.js +57 -0
  61. package/dist/wizard/lib/route-optimizer.d.ts +28 -0
  62. package/dist/wizard/lib/route-optimizer.js +93 -0
  63. package/dist/wizard/lib/service-install.d.ts +18 -0
  64. package/dist/wizard/lib/service-install.js +182 -0
  65. package/package.json +1 -1
@@ -51,6 +51,39 @@ Autonomous campaign execution: read the PRD, figure out what's next, build it, v
51
51
  12. **Log deviations.** When the build deviates from PRD architecture, update the PRD or log it in campaign-state.md. Never leave a silent contradiction.
52
52
  13. **Operational verification after deploy.** After deploying to a live environment, wait for 1 full operational cycle (1 trade cycle, 1 cron job, 1 polling interval) and check logs for errors, halts, and successful operations before marking the mission complete. "It deployed" ≠ "it works." (Field report #152)
53
53
 
54
+ ### Audit-first missions for callsite-counted ADRs
55
+
56
+ When `/architect` produces an ADR whose effort estimate scales with callsite count (refactor sweeps, ContextVar migrations, security boundary tightening, schema-column adds across N tables), require a paired **audit mission** BEFORE plan finalization. The audit produces the actual count; plan estimates use that count, not the architect's grep-by-eye guess.
57
+
58
+ Examples that triggered this rule:
59
+ - ADR-138 (Union Station, #316 §3): estimated "12+ unscoped tenant-pool callsites." Audit mission M-04.5 found ~900 across ~95 files — 75x off. The original "manual per-callsite refactor" plan was infeasible at that scale; audit forced the architectural rewrite (pool callback + ContextVar middleware).
60
+ - Any ADR claiming "tighten boundary at every X" — if X is a code pattern, count it before planning.
61
+
62
+ **Audit mission shape:**
63
+ 1. Picard or Spock writes the audit query (grep recipe, AST query, or schema introspection)
64
+ 2. The audit runs against the live codebase
65
+ 3. The result is committed as `docs/audits/<topic>-<date>.md` with the count + per-file breakdown
66
+ 4. THEN the architect re-estimates effort and sequences subsequent missions
67
+
68
+ Skip this step only when the ADR's scope is bounded by entity (one file, one table, one route group) — bounded ADRs don't need an audit because the count is visible in the scope itself.
69
+
70
+ ### Closeout grep pinning
71
+
72
+ When a `/campaign` closeout report cites a followup count or backlog size (e.g., "F-V710-ORG1-DEFAULTS — ~12 sites remaining" or "~21 cumulative followups"), the followup definition MUST embed the literal grep pattern + observed `n=N` at closeout HEAD. The next campaign's `/architect --plan` re-runs the same grep before accepting the count.
73
+
74
+ **Required closeout shape:**
75
+
76
+ ```
77
+ ### F-<NAME>
78
+ Scope: verified at <SHA> with `grep -rcE '<pattern>' <paths> | awk -F: '{s+=$2} END {print s}'` → n=<COUNT>
79
+ Severity: <level>
80
+ Status: open
81
+ ```
82
+
83
+ Field report #329 documents the cost of skipping this: v7.10 closeout cited "~12 sites" for `org_id=1` defaults without a verification grep. v7.11 plan-mode re-ran the grep and found 65 sites — 5× the estimate. The campaign plan had to restructure into a parallel sub-campaign (M-59-SWEEP) instead of a serial mission slate. The verification grep is one shell command. Skipping it cascades into wrong plans.
84
+
85
+ This rule reciprocates the ADR-side "Scope-confidence interval" requirement in `SYSTEMS_ARCHITECT.md` — the closeout writes the grep, the next plan re-runs it.
86
+
54
87
  ### TECH_DEBT SLA enforcement
55
88
 
56
89
  `/campaign` and `/assemble` audit `TECH_DEBT.md` before every mission selection. Critical + Immediate + LowEffort items overdue by 48h BLOCK campaign advancement. Critical + Immediate + HighEffort items overdue by 72h without owner + deadline BLOCK. High + Immediate items overdue by 7 days WARN.
@@ -169,7 +202,8 @@ Dax reads the Prophets' plan:
169
202
  7. Diff: PRD requirements vs. implemented features (structural AND semantic — not just "does the route exist?" but "does the component render what the PRD describes?")
170
203
  8. Produce: **The Prophecy Board** — ordered list of missions with scope, plus a separate list of BLOCKED items (assets, credentials, user decisions)
171
204
  8a. **Cross-mission data handoff check (Odo):** For any system that forms a closed loop (e.g., generate → track → analyze → feed back), identify every data handoff point between missions. Each handoff must be explicitly scoped in at least one mission: "Mission N produces X, Mission M consumes X via [mechanism]." If the loop spans 3+ missions, draw the handoff map. Unscoped handoffs become no-ops — the code on each side compiles and tests independently, but the data never flows between them. (Field report #265: seedPush extracted winning variant data but discarded it — the feedback loop was documented but not wired because the two ends were in separate missions with no explicit handoff.)
172
- 9. **Acceptance criteria gate:** Every mission on the Prophecy Board MUST have at least one acceptance criterion before Dax finalizes the board. Acceptance criteria are concrete, verifiable conditions "endpoint returns 200 with correct schema," "UI renders empty/loading/error/success states," "test covers the happy path." Missions without acceptance criteria are stubs that escape quality gates later. If a mission's scope is too vague to produce criteria, it's too vague to build split or clarify first. This applies to `--plan` mode too, not just build mode. (Field report #129: Phases 3-6 written as stubs without criteria, caught late by blitz compliance check.)
205
+ 9. **Cluster-mission recognition:** Before finalizing the board, Dax asks: "Are any of these missions cluster-natured?" A cluster-mission is a single-line entry that actually spans 4+ ADR sections, 4+ sub-components, or 4+ migration steps. Examples: M-51 cluster (per-org MCP topology) genuinely required 4 sub-missions per ADR-107 §c-§f; M-44 series required 5 sub-missions per ADR-117. Pretending a cluster is one mission produces 2-3× planning underestimates and forces mid-campaign restructuring. If a mission has 4+ named deliverables in different files/modules, split into sub-missions (M-51a/b/c/d) at plan time, not at execution time. (Field report #326: Sisko's original v7.10 slate was 9 missions; reality was 21 because cluster recognition was deferred.)
206
+ 10. **Acceptance criteria gate:** Every mission on the Prophecy Board MUST have at least one acceptance criterion before Dax finalizes the board. Acceptance criteria are concrete, verifiable conditions — "endpoint returns 200 with correct schema," "UI renders empty/loading/error/success states," "test covers the happy path." Missions without acceptance criteria are stubs that escape quality gates later. If a mission's scope is too vague to produce criteria, it's too vague to build — split or clarify first. This applies to `--plan` mode too, not just build mode. (Field report #129: Phases 3-6 written as stubs without criteria, caught late by blitz compliance check.)
173
207
 
174
208
  ### Deep Codebase Scan for PRD Diff
175
209
 
@@ -279,6 +313,120 @@ These issues are invisible to standard code review but Critical when found by th
279
313
 
280
314
  After each mission's 1-round review, check: "Did this mission modify any file that was also modified by a prior mission in this campaign?" If so, verify that the prior mission's patterns (error handling, locking, validation) are preserved in the new changes. This is a 30-second scan per shared file — run `git log --name-only` to identify cross-mission file overlap. Cross-cutting bugs that span files modified in different missions are invisible to single-mission review. (Field report #38: 2 Critical findings — chat stream timeout and optimistic locking omission — both involved files modified across multiple missions.)
281
315
 
316
+ ### Caller-Graph Audit (silent-default abstractions)
317
+
318
+ When a mission closes a class of bug rooted in a silent default value (`org_id: int = 1`, `tenant_id = None`, `user_id = SYSTEM`, `region = "us-east-1"`), the mission brief MUST enumerate every caller-graph site of every function whose default was wrong — not just the primary site that surfaced the bug.
319
+
320
+ **Detection pattern (F6-class abstractions):**
321
+
322
+ A "silent default" is a parameter whose default value short-circuits a multi-tenant / authorization / region-scoping invariant when the caller forgets to pass it. The fix is two-step:
323
+
324
+ 1. Remove the default OR change it to a sentinel that fails-loud (`None` with assertion, `NotProvided` enum value)
325
+ 2. Update every caller — the mission's grep MUST find them all
326
+
327
+ Without step 2, callers that omitted the parameter relied on the wrong default; making the function safer breaks them. Worse, callers that PASSED `org_id=1` explicitly (the wrong default) now silently leak across tenants. The cleanup density of these bugs is high — one explicit defect surfaces N adjacent latents of the same family.
328
+
329
+ **Required mission shape:**
330
+
331
+ ```
332
+ M-XX — F-V710-ORG1-DEFAULTS cleanup, batch N
333
+ Scope: <module/* glob>
334
+ Pre-mission audit:
335
+ grep -rnE 'org_id\s*:\s*int\s*=\s*1' <glob> → N callsites
336
+ For each callsite, classify:
337
+ - Defensible (cross-tenant by design, documented)
338
+ - Retrofit residue (CRITICAL, fix)
339
+ - Caller of retrofit residue (verify it passes a real org_id)
340
+ Fix shape:
341
+ 1. Remove the default (or replace with fail-loud sentinel)
342
+ 2. Update every caller-graph site enumerated above
343
+ 3. Add a property test or AST lint to prevent re-introduction
344
+ ```
345
+
346
+ Field report #326 (Union Station v7.10 M-55): the mission brief named 9 sigs in `widget_pipeline.py`. Reality required 11 sigs (9 + 2 in `registry.py`) PLUS 2 sigs in `providers.py`. The bonus F-K-M55-1 HIGH bug — cross-tenant providers leak via `SlackCacheContext.inject` and `IdeasDbContext.enrich` defaults — was caught by Kenobi during M-55 review and fixed same-commit. Without the caller-graph enumeration upfront, it would have shipped.
347
+
348
+ This is the "cleanup-density pattern" (F-V710-CLEANUP-DENSITY in the source field report). Budget for it: when a mission closes one defect in this class, expect 0.5-2 bonus defects of the same family in the same commit.
349
+
350
+ ### V710 Acceptance Template Inheritance Counter
351
+
352
+ For multi-mission campaigns shipping a class of fix (multi-tenant retrofit, dialect migration, auth tightening), establish a project-scoped **acceptance template** — a 4-8 item matrix that every mission in the class must satisfy. Track inheritance with a monotonic counter; relax to spot-check or retire only via explicit Picard-countersigned amendment at version-rollover gates.
353
+
354
+ **Template anatomy (per field report #326, v7.10):**
355
+
356
+ The V710 template (Batman, M-50b R0) was 6 items:
357
+
358
+ 1. NO POLICY-AS-PURE-FUNCTION (policies must consume DB state, not compile-time constants)
359
+ 2. NO LOG-OUTPUT-AS-ASSERTION (test assertions must check returned values, not log lines)
360
+ 3. NO SINGLE-HAPPY-PATH (every test covers happy + at least one negative + boundary)
361
+ 4. NO `?→$N` MONKEYPATCH SHIMS (use proper fixtures, not parameter rewrites at test time)
362
+ 5. ASYNC PATHS GET ASYNC TESTS (await paths tested under `asyncio.run`, not synchronously)
363
+ 6. SHAPE OF TRUTH PINS (test asserts the literal SQL/JSON shape returned, not a derived count)
364
+
365
+ Each mission's test files were reviewed against the matrix at acceptance. Counter advanced 1/1, 2/2, ... clean inheritance through 13/13 by Phase C close, 20/20 by v7.10 closeout. Zero waivers.
366
+
367
+ **Why this works:**
368
+
369
+ A class of bug is recurring because the test shape that would catch it is non-obvious. Document the test shape ONCE as an acceptance template; every mission in the class inherits the discipline. The counter ratchets forward without becoming bureaucratic because it's tied to a real bug class, not made-up process.
370
+
371
+ **When to relax:**
372
+
373
+ At version-rollover gates (v7.X → v7.X+1, phase boundaries), Picard countersigns ONE of:
374
+
375
+ - **(a) re-affirm** — keep mandatory for next phase (default if unrecorded)
376
+ - **(b) relax-to-spot-check** — sample 1 in N missions
377
+ - **(c) retire** — the bug class is closed; future tests don't need the matrix
378
+
379
+ Document the decision in a `logs/campaign-decisions-{version}.md` entry. Defaulting to re-affirm preserves discipline; relax/retire requires evidence (the bug class hasn't recurred for N missions).
380
+
381
+ ### Operator Decision Documents
382
+
383
+ For campaigns where the operator delegates architectural calls mid-campaign (split mission X into A+B, choose recomputation over bit-cast, accept ADR amendment scope, etc.), record each call in `logs/campaign-decisions-{version}.md` with a stable ID, the call, and the rationale.
384
+
385
+ **Why this is a first-class artifact:**
386
+
387
+ Mission briefs encode WHAT to build. Operator decisions encode WHAT THE OPERATOR CHOSE among multiple valid paths. Without a decision log, agents reading the campaign-state later cannot distinguish operator intent from happenstance — they're equally likely to rationalize the wrong choice as "load-bearing" or "incidental."
388
+
389
+ **Shape:**
390
+
391
+ ```
392
+ # Campaign Decisions — v7.10
393
+
394
+ ## D-1: M-51 split scope (2026-05-08)
395
+ Question: split per-org MCP topology into 1, 2, or 4 sub-missions?
396
+ Operator chose: 4 (M-51b/c/d/e per ADR-107 §c-§f)
397
+ Rationale: cluster-mission recognition; each sub-section was a 1-2 day mission.
398
+
399
+ ## D-2: M-52 recomputation vs bit-cast (2026-05-09)
400
+ Question: V092 migrates 13.6K embedding rows. Use bit-cast or recompute from source?
401
+ Operator chose: recompute from source (NULL → repopulate via embed_text())
402
+ Rationale: ADR-110's bit-cast scheme was wrong as written (Spock + Loki LK-2.2 caught
403
+ that "bit-identical reinterpret cast" was actually a value cast that would corrupt
404
+ every row). Recompute is slow (~3 min for 13.6K rows) but correct.
405
+ ...
406
+ ```
407
+
408
+ Field report #326 (v7.10 M-57): a literal SQL `false→true` flip referenced in D-6 would have BROKEN the contract because ADR-138 §Addendum supersedes ADR-139. Investigation surfaced the supersession; the mission honored architectural reality instead of mechanical application. The decision log captured INTENT; the mission delivered CORRECTNESS. Both are required.
409
+
410
+ ### LOC Growth Tracker (per-mission)
411
+
412
+ After each mission's 1-round review, run a LOC sweep against files modified in the mission. Any file that crossed the 300-LOC threshold (or grew >100 LOC in a single mission) is flagged for split-or-justify review.
413
+
414
+ **Detection (run post-build, before commit):**
415
+
416
+ ```bash
417
+ # Files this mission touched
418
+ git diff --name-only HEAD | while read f; do
419
+ if [ -f "$f" ]; then
420
+ lines=$(wc -l < "$f")
421
+ [ "$lines" -gt 300 ] && echo "LOC: $f -> $lines (over 300)"
422
+ fi
423
+ done
424
+ ```
425
+
426
+ When the tracker fires: Boromir or Stark reviews whether the growth is justified or if a split is overdue (per ADR-066 or project's equivalent decomposition boundary). The check costs <5 seconds; the alternative is the Gauntlet catching it 4 missions later, when the file has compounded with two other missions' changes and the safe split is harder.
427
+
428
+ Field report #322 (barrierwatch): `statistical-gate.ts` grew 425 → 775 LOC across M5 + M6 + Fix Batch additions. Each per-mission review was clean in isolation — only Gauntlet Round 3 caught the cumulative violation. A per-mission LOC tracker would have surfaced it at +100, not +350.
429
+
282
430
  ### Pattern Replication Check
283
431
 
284
432
  When a mission duplicates or extends an existing code path (adding a version-aware path alongside a legacy path, adding a new endpoint that mirrors an existing one), verify that security patterns (locking, rate limiting, validation, sanitization) from the original path are replicated in the new path. Grep for the original pattern and confirm it exists in the new code. (Field report #38: optimistic locking in legacy chat edit was not replicated to the version-aware path.)
@@ -369,6 +517,53 @@ If you believe context justifies reducing quality:
369
517
 
370
518
  The Gauntlet is never reduced. Checkpoints are never lightweight. Debriefs are never skipped. Run `/context` or run the full protocol.
371
519
 
520
+ ### Pause-Bias Anti-Pattern (autonomous mode)
521
+
522
+ When a mission completes in autonomous mode (`--blitz`, `--autonomous`, or default ADR-043 autonomous-by-default), the orchestrator's next action is to mark the next mission `in_progress` and START. Do NOT present "milestone summaries" framed as decision points. Do NOT ask "continue with M-X or pause?" Do NOT rationalize a pause as "strategic checkpoint" or "context budget management."
523
+
524
+ **The distinction is structural, not stylistic:**
525
+
526
+ - ✅ **Status update** (valid): "M-7.4 shipped at `6d7f5b3`. Starting M-7.5."
527
+ - ❌ **Decision-frame question** (anti-pattern): "M-7.4 complete. Continue with M-7.5 or pause for review?"
528
+ - ❌ **Rationalized pause** (anti-pattern): "Given the context usage, recommend resuming M-7.5 in a fresh session."
529
+
530
+ Status updates are FINE — they tell the operator what happened. Questions are NOT — they shift control back to the operator, which defeats the autonomous-by-default contract.
531
+
532
+ **The only valid pause triggers in autonomous mode** remain:
533
+ 1. `/context` shows actual usage above 85% (cite the number)
534
+ 2. A BLOCKED item requires user input (e.g., missing credentials, design decision)
535
+ 3. A Critical finding from `/assemble` that can't be auto-fixed (per `--autonomous` git-tag rollback)
536
+ 4. The user interrupts
537
+
538
+ Field report #323 (barrierwatch Phase 2): mid-campaign, after 4 missions shipped, the orchestrator presented a "checkpoint summary" with `"Continue with M-7.5 or pause?"` The operator responded sharply that pause-bias was a recurring pattern and to stop asking questions. The rationalizations ("context budget," "strategic checkpoint") had been rejected before via project memory; the orchestrator re-introduced them anyway.
539
+
540
+ **Why this happens:** the pause-frame feels like good operational hygiene. It is not. ADR-043 made autonomous the default specifically to eliminate this friction. Status updates between missions preserve transparency without re-introducing decision points. Trust the operator to interrupt if they need to.
541
+
542
+ ### ROADMAP Path Disambiguation
543
+
544
+ If both `ROADMAP.md` (root) and `docs/ROADMAP.md` exist, the **root** file is canonical for active campaign state. `docs/ROADMAP.md` is typically historical or aspirational — do not mutate it during `/campaign` or `/architect --plan` unless explicitly scoped.
545
+
546
+ Sisko + Picard verify which file holds the active campaign section at Step 0/Step 1 entry. The verification is a single `head -20 ROADMAP.md docs/ROADMAP.md` to inspect both — disambiguate before reading, not after editing the wrong one.
547
+
548
+ Field report #323: Victory Gauntlet reported "no Active Campaign section in `docs/ROADMAP.md`" — false alarm because the canonical `ROADMAP.md` was at repo root (2761 LOC). Reading the wrong file produces wrong findings.
549
+
550
+ ### Pre-Split Blocker Phase (ADR-066-style file splits)
551
+
552
+ When a campaign includes file splits of signing-critical, replay-critical, or load-bearing modules (per ADR-066 or project equivalent), the first split commit MUST be preceded by a "Phase B" of pre-split blockers. Without them, splits ship correctness regressions that surface only in production.
553
+
554
+ **The four pre-split blockers** (all must pass tsc + lint + existing test suite BEFORE any split commit):
555
+
556
+ 1. **Signing/serialization golden-vector test** for every signing path the splits will touch (EIP-712, action hashes, HMAC, JWT). Pinned hex inputs → pinned hex output.
557
+ 2. **Byte-identical replay-equivalence harness** using a frozen test fixture. Pattern: copy live DB to `runs/fixtures/`, capture canonical stdout via the deterministic strategy/runner, commit baseline + sha256 lock, write a vitest/pytest `skipIf-fixture` test that diffs new output against baseline.
558
+ 3. **ADR amendment** declaring the canonical home for any utility being extracted (rate-limiter, types, error-envelope). Preempts ad-hoc decisions during the splits.
559
+ 4. **Load-bearing function ships first** if the campaign also implements a critical decision function (e.g., `applyReengagementGate`) — ship the function before the split that would otherwise re-route its wiring.
560
+
561
+ Each blocker is its own mission, gated on tsc + lint + tests green. THEN the split missions land.
562
+
563
+ Reference implementation from field report #323 (barrierwatch v0.5.0): `src/research/lib/rolling-r1-verdict.ts` (Zod-schema parser for cron output contracts), `src/__tests__/replay-equivalence.test.ts` + `scripts/replay-equivalence.sh` + `tests/fixtures/replay-r1-baseline.txt` (byte-identical regression harness), `state/db-checksum-baseline.txt` (frozen fixture pin). Three splits (`hl-exchange-client.ts`, `pm-clob-client.ts`, `main.ts`) all preserved replay-equivalence BYTE-IDENTICAL.
564
+
565
+ The discipline answers: "splits worked" vs "splits worked safely."
566
+
372
567
  ### Step 5 — Debrief and Commit
373
568
 
374
569
  1. **Security gate (before commit):** Check if this mission added new TypeScript/JavaScript files that handle network I/O (HTTP endpoints, WebSocket handlers), user input (form parsing, body parsing), or credential storage (vault writes, env file generation). If yes, flag: **"This mission added network-facing code. Run `/sentinel` before committing."** Even in `--fast` mode, security is non-negotiable for new attack surface. This prevents shipping Critical vulnerabilities that only get caught in a post-hoc hardening pass.
@@ -142,6 +142,22 @@ Known issues when deploying Tailwind v4 to Vercel or similar build platforms:
142
142
 
143
143
  Never combine methodology syncs (`/void`) with unrelated debugging in the same session. If a sync introduces a problem, the debug commits interleave with sync commits, making it impossible to identify which change broke what. Rule: sync first, verify, THEN debug separately. If needed, hard-reset to the pre-sync state and reapply incrementally. (Field report #29: 6 retcon commits interleaved with 20 CSS-fix commits.)
144
144
 
145
+ ### Production Runtime Topology Authoritative-Source
146
+
147
+ Production runtime should run under a **single supervisor** — typically systemd, sometimes PM2 or Docker — and the active topology must be discoverable from one source. Temporary workarounds drift the topology silently:
148
+
149
+ - A `nohup`/`tmux`/manual `&` launch outlives its purpose; the systemd unit drifts from reality.
150
+ - `ExecStart` paths ossify against an old binary location (`~/.local/bin/uvicorn` vs `.venv/bin/uvicorn`).
151
+ - `StartLimitBurst` exhausts; the unit shows `failed` while a manual process serves traffic.
152
+
153
+ When a temporary workaround is acceptable, document it in `OPERATIONS.md` §Runtime Topology (or equivalent) as the canonical runtime, then either fix the systemd unit OR set a calendar reminder to revisit it. Field report #319 §7: Union Station served via nohup-launched uvicorn from 2026-03-27 onward — the systemd unit was `enabled` but `failed`. M-05 cutover required killing the nohup process (brief outage), fixing `ExecStart`, `systemctl reset-failed`, `daemon-reload`, `restart`. None of that should have been in the cutover contract.
154
+
155
+ **Pre-deploy check (mandatory):**
156
+
157
+ 1. `systemctl status <unit>` (or `pm2 list`) — what does the supervisor think is running?
158
+ 2. `ps -ef | grep <binary>` — what's actually running?
159
+ 3. Reconcile. If they disagree, fix BEFORE the deploy starts.
160
+
145
161
  ### Process Manager Discipline
146
162
 
147
163
  If a process manager (PM2, systemd, Docker, supervisord) owns the application port, NEVER kill the port directly (`fuser -k`, `kill`, `lsof -ti | xargs kill`). Always reload through the process manager: `pm2 reload`, `systemctl restart`, `docker compose restart`. Killing the port causes the process manager to auto-restart the old build, creating a race condition with any manual start attempt — the user sees stale code while the fix is already built. (Field report #123: 30+ minutes of stale code serving in production because `fuser -k 5005/tcp` raced with PM2's auto-restart.)
@@ -118,6 +118,24 @@ Fetch the latest from the source:
118
118
  - **Unchanged** — identical
119
119
  - **Locally modified** — local version differs from BOTH the old upstream and new upstream (user made custom changes)
120
120
 
121
+ ### Step 1.4 — Distribution-vs-Source Drift Check (Goldberry)
122
+
123
+ When syncing methodology, verify that every artifact mentioned in the published `CLAUDE.md` prose is actually present after sync. CLAUDE.md cites scripts and paths as if they exist; if the npm package's `files` array or `prepack.sh` doesn't ship them, downstream consumers get prose that points at nothing.
124
+
125
+ **Procedure:**
126
+ 1. Grep the synced `CLAUDE.md` for path-shaped strings: `scripts/`, `.claude/settings`, `docs/adrs/`, `bash scripts/`.
127
+ 2. For each path, run `[ -e <path> ] && echo present || echo MISSING`.
128
+ 3. If any cited path is missing, this is a **distribution gap** — flag to the user with the specific missing entries.
129
+ 4. Surface the gap as a manifest line in Step 2:
130
+ ```
131
+ Distribution gap detected:
132
+ - scripts/surfer-gate/check.sh — cited in CLAUDE.md but not shipped
133
+ - scripts/surfer-gate/record-roster.sh — cited in CLAUDE.md but not shipped
134
+ Action: pull from tmcleod3/voidforge:<paths> and re-run sync, OR upgrade to vX.Y.Z+ where the gap is closed.
135
+ ```
136
+
137
+ This catches future ADR-051-shaped drift: a permanent enforcement mechanism that the methodology documents as live but never actually ships. Field report #317 documents this exact failure for the Silver Surfer Gate scripts pre-v23.10.0 — the gap was known and recorded in CHANGELOG.md for at least one published version before being closed.
138
+
121
139
  ### Step 1.5 — Spring Cleaning (Treebeard)
122
140
 
123
141
  When upgrading across versions, check the **Migration Registry** for one-time cleanup actions that apply to the version range being crossed. Migrations only run once — they clean up artifacts from older VoidForge versions that should never have been on npm package.
@@ -162,6 +162,8 @@ Fix batches happen between rounds:
162
162
 
163
163
  **Pass 2 false-positive severity:** When Pass 2 identifies a potential false-positive in a security pattern added during Pass 1, classify as **Must Fix**, not Medium. A false positive in a security scanner is functionally a regression — it degrades working features. Do not defer with "monitor in production" unless a monitoring mechanism actually exists and is configured. (Field report #121)
164
164
 
165
+ **Production-parity exit criterion:** Before any Gauntlet round can be marked PASS, verify that the test execution backend matches the project's declared production backend. If `PROJECT_VERSION.md` (or equivalent) declares PostgreSQL but `tests/conftest.py` autouse fixture pins SQLite (or vice versa), the Gauntlet **FAILS** regardless of green test counts. Tests pinned to the wrong backend silently mask the integrations that actually run in prod (RLS, asyncpg pools, advisory locks, LISTEN/NOTIFY, FOR UPDATE SKIP LOCKED, transaction semantics). Field report #315 M3: this slipped past 4 dual-backend Gauntlets on Union Station between v6.2.1 cutover and v7.6 — every Gauntlet was structurally blind to the runtime risk it was supposed to be reviewing. Concrete check at end of each round: `grep -nE "_backend\s*=\s*['\"]" tests/conftest.py` and reconcile against `cat PROJECT_VERSION.md | grep -i 'database\|backend'`. Mismatch = FAIL the round.
166
+
165
167
  ## Finding Format
166
168
 
167
169
  Every finding, from every agent, in every round, uses this format:
@@ -118,6 +118,22 @@ When the project targets mobile platforms, add these to the attack plan:
118
118
  - **App lifecycle:** Background → foreground. Verify state restored (form input, scroll position, auth token). Test after 30min background.
119
119
  - **Platform differences:** Test on both iOS and Android if cross-platform. Verify platform-specific components render correctly.
120
120
 
121
+ ### Multi-Tenant Retrofit Smell (regression checklist)
122
+
123
+ For any project with `org_id`, `tenant_id`, or `workspace_id` columns, run this grep before declaring a QA pass green:
124
+
125
+ ```bash
126
+ grep -rnE "(\bor\s+1\b|org_id\s*:\s*int\s*=\s*1|org_id\s*:\s*int\s*\|\s*None|org_id\s*=\s*None|tenant_id\s*=\s*None|workspace_id\s*=\s*None)" \
127
+ --include="*.py" --include="*.ts" --include="*.tsx" \
128
+ --exclude-dir=node_modules --exclude-dir=.venv --exclude-dir=tests .
129
+ ```
130
+
131
+ Each hit is either intentionally cross-tenant (system endpoints, admin tools — must have authorization comment naming the policy) or retrofit residue (a fallback predating the multi-tenant migration — CRITICAL, fix before sign-off). See `SECURITY_AUDITOR.md` for the full smell discussion. The pattern recurred across 6 Union Station campaigns despite previous "we already fixed that" claims (field report #315 M2). Trust the grep, not memory.
132
+
133
+ ### Production-Backend Parity Check
134
+
135
+ Before declaring any QA pass green, verify that the test execution backend matches the production backend declared in `PROJECT_VERSION.md` / `CLAUDE.md` Stack section. Concrete check: read `tests/conftest.py` (Python) or equivalent test bootstrap; if it pins a non-prod backend (e.g., `_backend = "sqlite"` while prod is PostgreSQL since version X), this is a CRITICAL finding and the QA pass FAILS regardless of green test counts. Tests pinned to the wrong backend exercise none of the production-relevant integrations (RLS, asyncpg, advisory locks, LISTEN/NOTIFY, FOR UPDATE SKIP LOCKED) and silently mask production behavior. Field report #315 M3: this slipped past 4 dual-backend Gauntlets on Union Station before being caught at /assess. See `GAUNTLET.md` for the Gauntlet-side exit criterion.
136
+
121
137
  ### API Boundary Type Verification
122
138
 
123
139
  When the backend (Python, Go, Rust) and frontend (JavaScript) use different type systems, verify that types survive the API boundary correctly. Common gotcha: Python `bool` (`True`/`False`) becomes JSON `true`/`false` — but Python's string representation `"True"` is truthy in JS while `"False"` is also truthy. Check: Does the frontend compare API boolean values with `===` (strict) or `==` (loose)? Does the backend serialize booleans as JSON booleans or as strings? This catches "it works in Python tests but breaks in the browser" bugs. (Field report #66)
@@ -202,6 +218,36 @@ For services that maintain runtime state (caches, connection pools, scheduled jo
202
218
  Grep for `strftime`, `format(`, `toISOString`, `new Date().to` calls and verify they use the project's canonical timestamp format (typically `%Y-%m-%dT%H:%M:%SZ` or ISO 8601). Flag any non-canonical format strings. Non-canonical timestamps cause: cache TTL bugs (string comparison fails), sorting issues, and cross-system timestamp mismatches.
203
219
  (Field report #21: cache used `%Y-%m-%d %H:%M:%S` while all other code used `%Y-%m-%dT%H:%M:%SZ` — cache effectively never expired.)
204
220
 
221
+ ### Strict-Mode Audit Classification
222
+
223
+ When the codebase ships under any strict-mode setting (bash `set -euo pipefail`, TypeScript `strict: true`, Python `-W error`, Rust `#![deny(warnings)]`, Ruby `frozen_string_literal: true`), no QA finding involving language syntax, undefined-variable references, arithmetic expansion, type coercion, or null/undefined handling may be classified as **WARN/cosmetic** without behavioral evidence.
224
+
225
+ **Behavioral evidence requires ONE of:**
226
+
227
+ 1. **Unreachable-by-gate proof** — the code path is provably unreachable under any input, cited by the specific gate that excludes it (e.g., "this branch is guarded by `if (process.env.NODE_ENV !== 'production')` and the audit fixture pins production").
228
+ 2. **Real-path test under the same strict-mode flags as production** — the reviewer ran the non-dry-run code path with the production strict-mode settings and observed no failure. Static reading alone is not sufficient.
229
+
230
+ The audit's strict-mode MUST match the script's strict-mode. A reviewer running tests with `set +u` (not strict) cannot classify a `set -u` script's undefined-variable risk as cosmetic — the production environment promotes the risk to fatal that the audit never exercised.
231
+
232
+ Field report #330 (threadplex-ops): a sub-agent reviewer flagged `$(( rc_failures(rc) ))` as "cosmetic WARN — always returns 0." The function used C-style call syntax inside arithmetic expansion, which bash doesn't support — under `set -u` it parsed as undefined-variable reference and aborted immediately. The dry-run path didn't execute the branch (only reached on real label-PUT failures), so the bug was invisible during review. First real `--once` invocation tagged 100 items, then crashed mid-batch.
233
+
234
+ The audit-classification was wrong, and a wrong audit classification routes around the fix. The orchestrator MUST NOT unblock a fix-batch on a WARN/cosmetic classification that lacks one of the two evidence requirements above.
235
+
236
+ ### Telegram-Bot Group-Chat Suffix Test (when project is a chat bot)
237
+
238
+ Telegram (and similar chat platforms) append the bot's username to commands in group chats: `/system` becomes `/system@MyBotName`. A bare-anchor regex like `^/system($|[[:space:]])` rejects the group-chat form, silently breaking the bot for any group-deployed user.
239
+
240
+ **Required test for every chat-bot command parser:**
241
+
242
+ 1. `/cmd` — direct chat (private message form)
243
+ 2. `/cmd arg1 arg2` — direct chat with arguments
244
+ 3. `/cmd@BotName` — group chat (bare command)
245
+ 4. `/cmd@BotName arg1 arg2` — group chat with arguments
246
+
247
+ The parser must accept both forms. Reference normalizer: `sed -E 's#^(/[a-zA-Z_]+)@[a-zA-Z0-9_]+($|[[:space:]])#\1\2#'` (note the `#` delimiter to avoid clash with the regex's alternation operator).
248
+
249
+ Field report #325 (threadplex-ops): five fix batches missed this until Round 3 V-02 caught it — the bot rejected `/system@MyBot` in group chats. Niche but real, and zero-cost to add to the QA checklist for any chat-bot product.
250
+
205
251
  ### Stub Detection (Oracle, Round 2)
206
252
 
207
253
  Oracle scans for methods that return success without side effects — the most dangerous form of incomplete code. A method that raises `NotImplementedError` fails loudly and safely. A method that returns `True` without acting is a time bomb.
@@ -136,6 +136,65 @@ When the user passes `--deploy` to `/git`, run `/deploy` automatically after the
136
136
 
137
137
  This enables one-command commit-and-deploy for ad-hoc changes outside of campaigns.
138
138
 
139
+ ## Per-Commit CHANGELOG Discipline
140
+
141
+ CHANGELOG drift accumulates silently when entries are deferred to session boundaries. By the time someone notices, the test count trajectory is wrong and the per-mission delta is unrecoverable from the diff alone.
142
+
143
+ **Rule:** Commits that touch `src/**`, `docs/adrs/**`, or load-bearing method docs (`docs/methods/*.md`) MUST include a `CHANGELOG.md` entry as part of the staged paths. Coulson rejects commits matching those globs that omit `CHANGELOG.md`.
144
+
145
+ **Exceptions** (no CHANGELOG entry needed):
146
+ - Pure refactor / move with no behavior change (label the commit `chore:`)
147
+ - Test-only changes that don't add a new test pattern
148
+ - Documentation typo fixes
149
+ - Files explicitly listed under `.changelog-exempt` if present
150
+
151
+ **Enforcement check (Coulson runs before commit):**
152
+
153
+ ```bash
154
+ if git diff --cached --name-only | grep -qE '^(src/|docs/adrs/|docs/methods/.*\.md$)'; then
155
+ git diff --cached --name-only | grep -q '^CHANGELOG\.md$' || {
156
+ echo "Commit touches src/adrs/methods but omits CHANGELOG.md"; exit 1
157
+ }
158
+ fi
159
+ ```
160
+
161
+ Field report #322 (barrierwatch): test count trajectory showed 1207 when reality was 1209+ after Fix Batch 1; CHANGELOG drift caught only by Round 3 Nightwing. Without that agent, the release would have shipped with a stale CHANGELOG.
162
+
163
+ ## Pre-Push Lint Sweep
164
+
165
+ Project-specific lint gates (`scripts/check-*.sh`, `scripts/lint_*.py`, `bin/preflight`, etc.) are easy to forget without a checklist — and the cost is a hotfix loop where the first push fails CI on a contract gate that local development never exercised.
166
+
167
+ **Rule:** Before `git push`, Coulson runs every executable under `scripts/check-*` (or framework equivalent — `scripts/lint_*`, `bin/preflight`, `make preflight`). If any returns non-zero, push is blocked until the finding is resolved (fix the code OR add an explicit `# <gate>-allowed` waiver with rationale).
168
+
169
+ **Discovery shape:**
170
+
171
+ ```bash
172
+ find scripts/ -maxdepth 2 -type f \( -name 'check-*' -o -name 'lint_*' \) -executable 2>/dev/null
173
+ ```
174
+
175
+ For each script discovered, document its purpose + waiver convention in the project README or `docs/CONTRIBUTING.md`. Field report #324 (Union Station v7.8) documents 3 separate hotfix loops in a single session where the waiver convention (`# system-org-allowed` for source code, double-backticks for prose) existed but was not surfaced in any reviewer-readable checklist.
176
+
177
+ **Methodology vs project tooling:** the SCRIPTS are project-specific; the DISCIPLINE (run all gates before push) is methodology. The orchestrator does not need to know what each script does — only that it exists and must pass.
178
+
179
+ ## Post-Amend SHA Pin
180
+
181
+ `git commit --amend` rewrites the SHA but `logs/campaign-state.md` rows still reference the pre-amend SHA. Across a long campaign, these dangling references accumulate and break post-hoc audits (`git log --grep` against the recorded SHA returns nothing).
182
+
183
+ **Rule:** After any `git commit --amend`, Coulson scans `logs/campaign-state.md` (and `logs/build-state.md`, `logs/gauntlet-state.md` if present) for SHA placeholders that may now be stale.
184
+
185
+ **Detection pattern:**
186
+
187
+ ```bash
188
+ # Find recorded SHAs that no longer exist in git
189
+ grep -oE '\b[a-f0-9]{7,40}\b' logs/campaign-state.md 2>/dev/null | sort -u | while read sha; do
190
+ git cat-file -e "$sha^{commit}" 2>/dev/null || echo "STALE: $sha in campaign-state.md"
191
+ done
192
+ ```
193
+
194
+ **Resolution:** Replace the stale SHA with the post-amend SHA. Land both the amend and the state-file pin in one logical operation (squash if not yet pushed; new commit if already on remote).
195
+
196
+ Field report #327 (Union Station v7.10 Phase C): every mission shipped as a `<mission> + <followup pin SHA>` pair because amends were routine and the state file always lagged by one SHA. The discipline ergonomically holds, but it's a known foot-gun — surface it explicitly so future operators don't rediscover it.
197
+
139
198
  ## Post-Push Deploy Check
140
199
 
141
200
  After pushing to remote, if the project runs on a persistent server (PM2, systemd, Docker):
@@ -106,6 +106,59 @@ These require full codebase context — run sequentially:
106
106
  - **JS execution:** `eval()`, `Function()`, `setTimeout`/`setInterval` with string arguments
107
107
  (Field report #38: sanitizer missed `object`, `embed`, `applet`, `base`, `meta[http-equiv]` — 5 potential XSS vectors.)
108
108
 
109
+ ### Sanitizer Bypass-Class Checklist
110
+
111
+ When auditing any prompt-injection sanitizer, command-injection filter, or content sanitizer that operates on adversary-controlled text, verify coverage against the canonical bypass classes. Sanitizers built incrementally (adding patterns as discovered) inevitably miss entries — each fix-batch produces a narrower bypass that the next round catches, compressing 3 fix batches into 1.
112
+
113
+ **Required coverage for every text-input sanitizer:**
114
+
115
+ 1. **Case-fold variants** — `APPROVED ACTION`, `approved action`, `Approved Action`, `aPPROVED aCTION`. The sanitizer MUST be case-insensitive (regex `i` flag, ICU case-fold, or explicit `.lower()` pre-check). Test with mixed-case input.
116
+ 2. **Unicode lookalikes & em-dash variants** — em-dash (`—`), en-dash (`–`), figure-dash (`‒`), minus sign (`−`), full-width hyphen (`-`), Cyrillic `а`/`е`/`о` substituted for Latin `a`/`e`/`o`. Normalize to NFKC before matching, OR explicitly enumerate the lookalike set.
117
+ 3. **Newline-split variants** — `sed` is line-oriented by default; a marker split across `\n` defeats line-level regex. Use `sed -zE` (whole-buffer), Perl `-0777`, or Python re.DOTALL/re.MULTILINE depending on language. Test with `\r\n`, `\n`, `
`, `
`.
118
+ 4. **Character-class glob variants** — patterns like `AUTHORIT[Yy]` or `appr[o0]ved` exploit blocklist regexes that miss numeric/alpha substitutions. The sanitizer should normalize obfuscation classes (l33t-speak, `0`/`o`, `1`/`l`, `$`/`s`) OR reject any non-ASCII letter in security-relevant context.
119
+ 5. **Encoding variants** — base64, URL-encoded, HTML-entity, JS-escape (`\x41`, `A`), hex-escape, double-encoded. The sanitizer must decode BEFORE matching, not after.
120
+ 6. **Length-boundary variants** — payloads at exactly the truncation boundary, payloads with leading/trailing whitespace that strips to a malicious core, payloads that exceed max-length and trigger truncation that creates a different malicious string.
121
+ 7. **Novel-marker variants** — the sanitizer that catches `[APPROVED]` should catch `「APPROVED」`, `«APPROVED»`, `\\xe2\\x80\\xbaAPPROVED\\xe2\\x80\\xba`. Test with at least 3 unusual delimiter pairs.
122
+
123
+ Field report #325 (threadplex-ops Victory Gauntlet): each fix batch on the prompt-injection sanitizer introduced a narrower bypass that the next round caught. Fix Batch 1 added noun-whitelist `sed`; Round 3 found case-fold + em-dash + novel marker bypasses. Fix Batch 3 added shape-blacklist `sed -E i`; Round 4 found newline-split bypass (sed line-oriented). Fix Batch 4 used `sed -zE` (whole-buffer). The checklist above would have collapsed those three iterations into one — the bypass classes are knowable upfront, not discoverable per-round.
124
+
125
+ **Audit step:** for every sanitizer the codebase ships, verify the test suite covers all 7 classes above with at least 2 samples each. Missing classes are pre-flagged finding (HIGH severity for security-relevant sanitizers, MEDIUM otherwise).
126
+
127
+ ### Multi-Tenant Retrofit Smell (`or 1` / `org_id=None`)
128
+
129
+ A recurring data-leak class across multi-tenant retrofit campaigns. When a project adds `org_id` columns and composite PKs but leaves the `else` branch / `or 1` fallback alive, queries silently leak across tenants when authentication is missing or partial. Field report #315 M2 documents this recurring across 6 Union Station campaigns (v3.0 → v3.6.1 → v7.0 → v7.0.1 → v7.4 → v7.6).
130
+
131
+ **Mandatory grep pass on every multi-tenant codebase:**
132
+
133
+ ```bash
134
+ # Catches all variants
135
+ grep -rnE "(\bor\s+1\b|org_id\s*:\s*int\s*=\s*1|org_id\s*:\s*int\s*\|\s*None|org_id\s*=\s*None|tenant_id\s*=\s*None|workspace_id\s*=\s*None)" \
136
+ --include="*.py" --include="*.ts" --include="*.tsx" \
137
+ --exclude-dir=node_modules --exclude-dir=.venv .
138
+ ```
139
+
140
+ Each hit must be classified:
141
+ - **Defensible** — a system endpoint that explicitly serves cross-tenant data (admin tools, reporting), with documented authorization checks. Annotate with a comment naming the policy.
142
+ - **Retrofit residue** — a fallback that predates the multi-tenant migration. **CRITICAL** finding; rewrite to fail-fast.
143
+ - **Test-only** — fixture default. Acceptable in `tests/`, **never** in production code.
144
+
145
+ This grep is part of every `/sentinel` run on projects with `org_id` columns. Also runs in `/qa` regression checklists (see QA_ENGINEER.md). Do not skip it for "we already fixed that" — the pattern recurs.
146
+
147
+ ### IDOR Matrix for Parametric-Path Routers
148
+
149
+ Mandatory when a router has parametric paths (`/X/{id}`) AND additional fixed-suffix paths under the same entity prefix (`/X/batch-update`, `/X/merge`, `/X/export`). FastAPI dispatches first-matching-route — `/X/{person_id}` is more general than `/X/batch-update` and shadows the fixed suffix when registered first. The fixed-suffix endpoint then becomes silently unreachable, returning 422 (path-arg parse failure) instead of running.
150
+
151
+ Field report #320 §2 documents M-10 commit 5: `PATCH /people/batch-update` had been **unreachable in production for an unknown duration** because `/people/{person_id}` shadowed it. Surfaced only when Strange's IDOR matrix test attempted cross-org denial on `batch-update` and got 422 instead of 403.
152
+
153
+ **Matrix shape (one row per fixed-suffix endpoint × one column per access pattern):**
154
+
155
+ | | Same-org user | Cross-org user | No auth |
156
+ |---|---|---|---|
157
+ | `PATCH /X/batch-update` | 200 + scoped result | 403 (or 404 per ADR) | 401 |
158
+ | `POST /X/merge` | 200 | 403 | 401 |
159
+
160
+ **Fix when the matrix surfaces a route shadow:** add path-converter type hints (`{person_id:int}`, `{company_id:int}`) so the parametric route is restricted to its actual type. Do not reorder routes — type-converted paths are unambiguous; reordering is fragile. Then re-run the matrix to confirm fixed-suffix routes reach their handlers.
161
+
109
162
  ### Proxy Route SSRF
110
163
 
111
164
  For any route that proxies requests to external APIs (image proxies, API gateways, CDN wrappers):
@@ -133,11 +133,81 @@ AGENT: [Name]
133
133
  STATUS: Done / Blocked / Needs Review
134
134
  CHANGES: [Files modified, one-line each]
135
135
  DECISIONS: [Non-obvious choices with rationale]
136
+ DEVIATIONS FROM CONTRACT: [see below — required, "None" is acceptable]
136
137
  ASSUMPTIONS: [Needs confirmation]
137
138
  RISKS: [Side effects]
138
139
  REGRESSION: [How to verify]
139
140
  ```
140
141
 
142
+ ### Deviations from Contract (required section)
143
+
144
+ For every item in the dispatch brief that the agent chose to handle differently from the literal contract — defensible improvements, scope adjustments, deferred work — flag it explicitly:
145
+
146
+ ```
147
+ - Brief said: "<exact wording>"
148
+ You did: <what you actually shipped>
149
+ Why: <rationale>
150
+ Risk: <production-side implication, or "None" if internal-only>
151
+ Reviewer signoff needed: <Y/N — if Y, name the reviewer>
152
+ ```
153
+
154
+ An empty section ("No deviations") is acceptable and explicit. **Hidden deviations risk emerging as production bugs** — Stark's M-05-prep-2 silent fallback (`_get_db_admin()` retained tenant-pool fallback for "dev/test backward compat" instead of failing-fast as Picard's contract specified) was sound but not flagged in the build report headline. It took a Loki chaos pass to catch the production-side implication. (Field report #318 §4.) Across that single session, 6 separate agents had silent deviations from their dispatch briefs.
155
+
156
+ The orchestrator reviews this section at the same priority as STATUS. A deviation that risks production behavior triggers a reviewer dispatch (Loki, Riker, or the original contract author).
157
+
158
+ ### Sub-Agent Review Contract (WARN/cosmetic evidence requirement)
159
+
160
+ A sub-agent reviewer may classify a finding as **WARN/cosmetic** (deferrable, non-blocking) only if at least ONE of the following holds:
161
+
162
+ 1. The code path is **provably unreachable** with a citation of the specific gate that excludes it (e.g., `if (DEV_ONLY)` guard pinned by audit fixture).
163
+ 2. The reviewer **ran the real (non-dry-run) code path under the same strict-mode flags as production** and observed no failure.
164
+
165
+ Static reading alone is NOT sufficient evidence for a WARN/cosmetic downgrade when the codebase ships under `set -euo pipefail`, TypeScript strict, Python `-W error`, or any equivalent strict-mode setting. The orchestrator MUST NOT unblock a fix-batch on a WARN/cosmetic classification that lacks one of the two above.
166
+
167
+ Field report #330: a Kim-class reviewer flagged a bash syntax oddity as "cosmetic — always returns 0." The reasoning was correct only if the code path didn't crash under strict-mode flags — which it did. The audit's strict-mode must match the script's strict-mode. See `QA_ENGINEER.md` "Strict-Mode Audit Classification" for the language-level rule.
168
+
169
+ **The contract applies recursively** — a sub-agent reviewing another sub-agent's classification inherits this requirement. WARN/cosmetic that survives a chain of reviews still requires evidence at the root of the chain.
170
+
171
+ ### Agent Capability Matrix (tool surface verification)
172
+
173
+ Before briefing an agent for a task, the orchestrator confirms the agent has the tools required for that task. The `tools:` field in each `.claude/agents/<id>.md` frontmatter is the source of truth.
174
+
175
+ **Quick decision tree:**
176
+
177
+ | Task type | Required tools | Common mismatch |
178
+ |---|---|---|
179
+ | Write files (audit reports, ADRs, code) | `Write` + `Edit` | Read-only agents (e.g., scout-tier) return audit text instead of files |
180
+ | Modify existing files | `Edit` | Read-only agents propose diffs instead of applying them |
181
+ | Run scripts / git ops | `Bash` | Some review-tier agents lack Bash and can't verify their own findings |
182
+ | Pattern search / discovery | `Grep` + `Glob` | All agents have these (scout floor) |
183
+ | Read agent definitions | `Read` | Universal |
184
+
185
+ **Pre-deployment check:** if the dispatch brief asks the agent to "write," "update," "modify," or "fix" any file, verify the agent definition includes `Write` and/or `Edit` in `tools:`. If not, EITHER:
186
+
187
+ 1. Add the tool to the agent definition (preferred when the agent SHOULD be authoring in their domain — e.g., Irulan should write ADR audits as files), OR
188
+ 2. Delegate the actual write to an orchestrator-tier action (the agent produces structured audit output; the orchestrator writes the file).
189
+
190
+ Field report #322 (barrierwatch M1): Irulan was asked to write `docs/adrs/INDEX.md` and update `CHANGELOG.md`. Her tools were `Read, Grep, Glob` — she returned a comprehensive audit text instead of files. The orchestrator manually transcribed her audit into the files. Cost: a redirect that should have been a tool-list fix.
191
+
192
+ ### Build-Agent Pytest Sequencing
193
+
194
+ Build agents that need to verify their work with pytest should:
195
+
196
+ 1. Run **targeted pytest** on touched files only as the agent's internal verification (fast, fits in the agent response window — typically 1-3 min).
197
+ 2. **Commit + report BEFORE** running the full-suite pytest. The orchestrator runs the full suite as the gate — that's not the agent's job.
198
+ 3. Do NOT run the full CI-equivalent suite as the agent's final action. Long-running suites (12-15 min) routinely exceed the agent response window, truncate the report mid-output, and force the orchestrator to reconstruct state from `git log` rather than read the report.
199
+
200
+ Field report #320 §4: 4 of Strange's M-10 commits had truncated reports because internal pytest was still running when the response window closed. Targeted pytest (`pytest -q tests/path/to/touched_module.py`) is the right shape for the agent; full-suite is the orchestrator's gate.
201
+
202
+ ### Long-Running Shell Commands Inside Agent Dispatches
203
+
204
+ When a sub-agent needs to run a shell command that takes longer than ~3 minutes (long pytest, full build, multi-region deploy probe, container migration), the dispatch prompt must specify one of two patterns:
205
+
206
+ 1. **Background + poll** — agent runs the command with `run_in_background: true`, then polls for completion at fixed intervals. The agent's final response includes the polled outcome.
207
+ 2. **Reduce scope** — the agent runs a focused subset that completes inside the response-stream window. The orchestrator runs the full version separately.
208
+
209
+ Naked long-running commands inside an agent dispatch will truncate the agent's report mid-execution; the orchestrator then has to recover state from disk and re-write the report retrospectively. Field report #317 logged 4 such truncations in a single Union Station session.
210
+
141
211
  ## Agent Debate Protocol
142
212
 
143
213
  When two agents disagree on a finding, run a structured debate instead of listing both opinions:
@@ -327,6 +397,26 @@ CONSTRAINTS: [list]
327
397
  | Architecture / Council | Position Statement: assessment, concerns, sign-off |
328
398
  | Build agents | Build Report: files created/modified, tests added, decisions made |
329
399
 
400
+ ### Intentionally Overlapping Mandates (high-signal convergence)
401
+
402
+ When dispatching parallel reviewers, **deliberately give 3+ agents the same diff with different lenses**. This is not duplication — it is intentional convergence.
403
+
404
+ - Findings flagged by 1 agent = standard signal, route to triage
405
+ - Findings flagged by 2+ agents from different universes = high-confidence signal, prioritize
406
+ - Findings flagged by 3+ agents = critical convergence, fix in same batch
407
+
408
+ Field report #324 (Union Station v7.8 R2): three agents (Discovery + Stark + Kenobi) ran in parallel against the same diff. HIGH-1 was caught by all three; two MED findings by 2 of 3. A single-agent review would have missed ~25% of findings empirically. The "wasted" agent budget is the price of multi-lens coverage.
409
+
410
+ **When to use overlap:**
411
+ - Methodology ADRs (statistical, security, financial) — code-vs-ADR + spec-adversary + Riker trade-offs (3 lenses, same diff)
412
+ - Multi-tenant boundary changes — Stark (impl) + Kenobi (auth) + Ahsoka (IDOR) + Spock (schema), 4 lenses on the same code
413
+ - Cross-module diffs after refactor sweeps — Cyborg (integration) + Strange (services) + Banner (queries)
414
+
415
+ **When NOT to use overlap:**
416
+ - Trivial single-file changes (<50 lines, no cross-module impact)
417
+ - Pure formatting/lint sweeps
418
+ - Doc-only edits where finding density approaches zero
419
+
330
420
  ### Concurrency Rules (ADR-059)
331
421
 
332
422
  - **Fan out the full roster in parallel for read-only analysis.** Opus 4.7's 1M context window handles 20+ concurrent findings tables without thrashing. Field report #270 confirmed 15+ parallel agents at 15-25% context usage.