voidforge-build 23.19.0 → 23.21.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (64) hide show
  1. package/dist/.claude/agents/celebrimbor-forge-artist.md +1 -0
  2. package/dist/.claude/agents/ducem-token-economics.md +1 -0
  3. package/dist/.claude/agents/galadriel-frontend.md +1 -0
  4. package/dist/.claude/agents/romanoff-integrations.md +4 -0
  5. package/dist/.claude/agents/silver-surfer-herald.md +19 -4
  6. package/dist/.claude/commands/architect.md +4 -3
  7. package/dist/.claude/commands/assemble.md +12 -0
  8. package/dist/.claude/commands/assess.md +1 -0
  9. package/dist/.claude/commands/build.md +8 -0
  10. package/dist/.claude/commands/contextmeter.md +56 -0
  11. package/dist/.claude/commands/debrief.md +10 -0
  12. package/dist/.claude/commands/engage.md +5 -0
  13. package/dist/.claude/commands/git.md +19 -3
  14. package/dist/.claude/commands/imagine.md +1 -1
  15. package/dist/.claude/commands/seal.md +81 -0
  16. package/dist/.claude/commands/ux.md +13 -0
  17. package/dist/.claude/workflows/gauntlet.workflow.js +13 -1
  18. package/dist/CHANGELOG.md +63 -0
  19. package/dist/CLAUDE.md +10 -1
  20. package/dist/HOLOCRON.md +16 -2
  21. package/dist/VERSION.md +3 -1
  22. package/dist/docs/methods/AI_INTELLIGENCE.md +3 -0
  23. package/dist/docs/methods/ASSEMBLER.md +12 -0
  24. package/dist/docs/methods/BUILD_PROTOCOL.md +15 -0
  25. package/dist/docs/methods/CAMPAIGN.md +11 -0
  26. package/dist/docs/methods/DEVOPS_ENGINEER.md +66 -0
  27. package/dist/docs/methods/FIELD_MEDIC.md +1 -0
  28. package/dist/docs/methods/FORGE_ARTIST.md +3 -4
  29. package/dist/docs/methods/GAUNTLET.md +6 -0
  30. package/dist/docs/methods/MUSTER.md +2 -0
  31. package/dist/docs/methods/PRODUCT_DESIGN_FRONTEND.md +18 -0
  32. package/dist/docs/methods/QA_ENGINEER.md +21 -1
  33. package/dist/docs/methods/RELEASE_MANAGER.md +38 -0
  34. package/dist/docs/methods/SECURITY_AUDITOR.md +11 -1
  35. package/dist/docs/methods/SUB_AGENTS.md +33 -0
  36. package/dist/docs/methods/SYSTEMS_ARCHITECT.md +15 -0
  37. package/dist/docs/methods/TESTING.md +2 -0
  38. package/dist/docs/methods/TROUBLESHOOTING.md +2 -2
  39. package/dist/docs/methods/WORKFLOWS.md +14 -0
  40. package/dist/docs/patterns/ai-prompt-safety.ts +85 -0
  41. package/dist/docs/patterns/data-pipeline.ts +59 -1
  42. package/dist/docs/patterns/egress-sandbox.sh +43 -0
  43. package/dist/docs/patterns/exclusion-set-invariant.md +62 -0
  44. package/dist/docs/patterns/multi-tenant-property-test.ts +64 -0
  45. package/dist/docs/patterns/nginx-vhost.conf +156 -0
  46. package/dist/docs/patterns/oauth-token-lifecycle.ts +21 -0
  47. package/dist/docs/patterns/post-deploy-probe.sh +115 -0
  48. package/dist/docs/patterns/rls-test-fixture.py +140 -0
  49. package/dist/docs/patterns/structural-sql-sentinel.py +134 -0
  50. package/dist/scripts/statusline/README.md +38 -0
  51. package/dist/scripts/statusline/context-awareness-hook.sh +53 -0
  52. package/dist/scripts/statusline/settings-snippet.json +17 -0
  53. package/dist/scripts/statusline/voidforge-statusline.sh +91 -0
  54. package/dist/scripts/voidforge.js +69 -6
  55. package/dist/wizard/lib/claude-md-strategy.d.ts +87 -0
  56. package/dist/wizard/lib/claude-md-strategy.js +198 -0
  57. package/dist/wizard/lib/marker.d.ts +48 -1
  58. package/dist/wizard/lib/marker.js +58 -2
  59. package/dist/wizard/lib/patterns/oauth-token-lifecycle.d.ts +14 -0
  60. package/dist/wizard/lib/patterns/oauth-token-lifecycle.js +21 -0
  61. package/dist/wizard/lib/project-init.js +59 -0
  62. package/dist/wizard/lib/updater.d.ts +19 -0
  63. package/dist/wizard/lib/updater.js +84 -33
  64. package/package.json +2 -2
@@ -72,7 +72,11 @@ OWASP Top 10 evaluation. Find misconfigurations, missing protections, insecure d
72
72
 
73
73
  These are independent, read-only scans. Run in parallel using the Agent tool:
74
74
 
75
- **Leia — Secrets:** No secrets in source code. No secrets in git history. .env in .gitignore. Different secrets dev/prod. Rotation plan documented. **Fail-closed verification:** When a new feature depends on a security primitive (encrypt, hash, sign, verify), check the primitive's failure mode. If it fails open (returns data instead of raising on misconfiguration), flag as Critical. Security functions must raise on misconfiguration, never silently degrade. (Field report #99: encrypt() silently returned plaintext when ENCRYPTION_KEY was unset — OAuth tokens stored unencrypted for an entire campaign.)
75
+ **Leia — Secrets:** No secrets in source code. No secrets in git history. .env in .gitignore. Different secrets dev/prod. Rotation plan documented.
76
+
77
+ **PII export-format `.gitignore` (data projects):** For any project that ingests or exports personal data, the default `.gitignore` recommendation must cover common PII / data-export formats up front — not just `.env`. A raw export dropped in the repo root with no `.gitignore` is one `git add -A` away from permanent third-party-PII exposure in history. Recommend at minimum: `*.abbu *.abcddb* *.vcf *.zip *.docx /input/ /output/ /data/ *.db .env`. The `.abbu`/`.abcddb` entries cover Apple Contacts bundles (SQLite stores); `.vcf` covers vCard dumps; `/input/`, `/output/`, `/data/` cover the conventional ingest/emit/working directories where raw exports land. (Field report #378: a raw PII export sat in the repo root with no `.gitignore` — a near-miss caught only by pre-build assessment, and a second near-miss at ingest when an `.abbu` bundle arrived mid-session uncovered by the default ignore set.)
78
+
79
+ **Fail-closed verification:** When a new feature depends on a security primitive (encrypt, hash, sign, verify), check the primitive's failure mode. If it fails open (returns data instead of raising on misconfiguration), flag as Critical. Security functions must raise on misconfiguration, never silently degrade. (Field report #99: encrypt() silently returned plaintext when ENCRYPTION_KEY was unset — OAuth tokens stored unencrypted for an entire campaign.)
76
80
 
77
81
  **Credential fallback check:** After fixing a hardcoded credential, grep for fallback patterns: `?? 'defaultValue'`, `|| 'hardcoded'`. An environment variable with a hardcoded fallback is an incomplete fix — the fallback becomes the live credential when the env var is missing.
78
82
 
@@ -173,6 +177,12 @@ Pattern: `/api/photos/[...name]` that joins path segments into a Google API URL
173
177
 
174
178
  **Security principle:** For security boundaries (tool access, URL allowlists, IP ranges, credential scopes), **always prefer whitelist (default-deny) over blocklist (default-allow)**. New entries should be blocked by default until explicitly allowed. Blocklists inevitably miss entries.
175
179
 
180
+ ### Denylist = Tripwire, Boundary = Authoritative Control
181
+
182
+ A denylist over an open input space (a regex blocklist guarding an LLM-proposed diff, a forbidden-term filter, a pattern matcher over adversary-controlled text) is a **tripwire — defense-in-depth, not the security boundary**. It will have bypasses; that is its nature, and finding them does not by itself constitute a breach. The actual guarantee comes from the **authoritative control** behind it — environment sanitization, OS-user isolation, an allowlist-built sandbox, a server-side authorization check. When auditing one of these, Kenobi must: (a) identify the authoritative boundary, (b) test-lock *it*, (c) treat the pattern-denylist as defense-in-depth, and (d) **NOT escalate denylist gaps to CRITICAL without first proving the authoritative boundary is REACHABLE.**
183
+
184
+ **Reachability is empirical, not assumed.** A severity rating that rests on a factual premise ("a secret is reachable past this filter," "this bypass lands in a privileged context") must ship the command that proves the premise — run the env-builder and show the secret is present, exploit the bypass and show it crosses the boundary. If the env is built from an allowlist with secrets stripped, an 18-bypass denylist over that input is a tripwire with nothing behind it to trip into — the gaps are real but the severity is not CRITICAL. Whitelist > blocklist (above) remains the standing preference; this principle governs how you *score* a blocklist gap: severity rests on **proven reachability of the authoritative boundary**, never on the count of denylist bypasses alone. (Field report #377: a regex denylist guarding an LLM-proposed code diff had 18 confirmed bypasses and was escalated to CRITICAL on the unverified premise that a secret was reachable in the sandboxed eval env — but the env was built from a secrets-stripped allowlist, provable by running the env-builder. The denylist was a tripwire; environment-sanitization + OS-user isolation were the boundary.)
185
+
176
186
  ### Encryption Egress Audit
177
187
 
178
188
  When a field is encrypted (at rest or in transit), grep ALL usages of the original plaintext variable in the same function and across the codebase. Encryption applied to one egress point (e.g., database write) does not protect other egress points that use the same variable:
@@ -222,6 +222,17 @@ When a sub-agent needs to run a shell command that takes longer than ~3 minutes
222
222
 
223
223
  Naked long-running commands inside an agent dispatch will truncate the agent's report mid-execution; the orchestrator then has to recover state from disk and re-write the report retrospectively. Field report #317 logged 4 such truncations in a single Union Station session.
224
224
 
225
+ ### Repro Scratch Goes to mktemp, NEVER the Repo Tree
226
+
227
+ Any agent that reproduces a finding via shell — probe scripts, planted-bug fixtures, atomic-write `.tmp` files, race-repro harnesses — MUST write its scratch to an isolated temp path (`$(mktemp -d)` for a directory, `$(mktemp)` for a single file), NEVER into the working tree. The dispatch brief for any Bash-enabled repro/adversarial agent must state this constraint explicitly. (Field report #366 F5.)
228
+
229
+ ```bash
230
+ scratch="$(mktemp -d)"; trap 'rm -rf "$scratch"' EXIT # isolated, auto-cleaned
231
+ # ... write probe scripts, .tmp files, fixtures under "$scratch" ...
232
+ ```
233
+
234
+ The failure mode this prevents: the gauntlet's adversarial agents reproduced gate races by writing `.gate-repro-scratch/` and `scripts/surfer-gate/.*-probe.sh` plus orphaned atomic-write `.tmp` files **into the repo** — on two separate runs — and they were nearly committed via `git add -A`. A temp dir is invisible to `git`, cleans itself on exit, and cannot litter the tree or dirty the diff the review is about to assess. As a belt-and-suspenders backstop, projects should `.gitignore` a designated scratch path, but the temp-dir rule is the primary mechanism — scratch that never enters the tree needs no ignoring. (The WORKFLOWS.md side of this rule covers workflow-spawned agents; this subsection covers Agent-tool dispatches.)
235
+
225
236
  ## Agent Debate Protocol
226
237
 
227
238
  When two agents disagree on a finding, run a structured debate instead of listing both opinions:
@@ -278,6 +289,8 @@ Leads inherit the main session's model (Opus). Specialists run on Sonnet for cos
278
289
 
279
290
  **Effort tiering (per-agent spend lever).** Claude Code exposes an `effort:` level (`low`/`medium`/`high`/`xhigh`/`max`) that controls reasoning depth *independently* of the model tier. Apply by role: **Leads → `xhigh`** (the recommended start for agentic work on Opus 4.8); **Specialists → `medium`** (read-and-report review rarely needs full `high` spend across ~200 agents); **Scouts → OMIT** — **Haiku 4.5 does not support the effort parameter and errors if it is passed.** Haiku also has a **200K context ceiling (not 1M)**: the Surfer pre-scan and scout prompts must fit within it — read agent frontmatter (name/description/tags), not full bodies, on large rosters. **Verified + applied 2026-06-13:** the official sub-agents docs confirm `effort` is a supported frontmatter field; the fleet edit is live — all 20 leads carry `effort: xhigh`, all 201 Sonnet specialists `effort: medium`, the 43 Haiku scouts omit it (ADR-054). New agents should follow the same tiering.
280
291
 
292
+ **Global spend ceiling must reserve in-flight budget (field report #382).** When a multi-child orchestration enforces a global cost ceiling, the launch gate must NOT admit the next child on *cumulative-spent-so-far < ceiling* — that ignores the children already running, so the ceiling is breached by whatever the in-flight children go on to spend (real incident: an $80 cap spent $83.72). Before launching child N, reserve the **max-possible in-flight spend**: admit only if `ceiling − spent_so_far − Σ(per-child ceiling of running children) ≥ next child's per-child ceiling`. That bounds the worst case under the cap. If per-child spend can't be bounded, the total can't be either — then document the overshoot bound explicitly (`ceiling + (concurrency − 1) × max-per-child`) rather than calling the cap hard. In a Dynamic Workflow the `budget` API (WORKFLOWS.md) is the natural enforcement point: gate `agent()` launches on `budget.remaining()` minus reserved in-flight, not on raw `spent()`.
293
+
281
294
  ### Tool Restrictions
282
295
 
283
296
  | Profile | Tools | Agents |
@@ -361,6 +374,17 @@ Motivating incidents:
361
374
 
362
375
  Both would have been caught by an adversarial pass that asked "what new failure mode does THIS fix create?" rather than only "is the old finding gone?" When a fix introduces a sentinel/lock/retry-state, the verify dispatch brief MUST name the wedge/loop/orphan/double-send checklist explicitly and require the agent to trace the liveness path.
363
376
 
377
+ #### Confirm the empirical premise of a severity rating before acting on it
378
+
379
+ A severity is only as real as the factual claim it rests on. When a verdict rates a finding CRITICAL/High **because of an asserted fact** — "the secret is reachable from the sandboxed eval", "this input flows unsanitized into the sink", "the denylist is the boundary so its bypasses are exploitable" — that premise is a hypothesis until someone **runs the command that proves it**. A severity built on an unproven premise is not actionable, however confident the rating sounds. (Field report #377 #4.)
380
+
381
+ The discipline has two layers, and both are mandatory for any CRITICAL whose severity depends on a factual claim:
382
+
383
+ 1. **The verifying agent ships the command that proves the premise.** A verdict that rests on a factual premise must include the empirical check that confirms it — the actual `cat /proc/<pid>/environ`, the env-builder run that shows the secret is (or is not) present, the request that reaches (or doesn't reach) the sink. "I read the code and it looks reachable" is the *finding*, not the *proof*. Reachability is a 3-Lens stage (above) for exactly this reason; a CRITICAL skips no lens.
384
+ 2. **The orchestrator re-checks the premise of any CRITICAL before acting on it.** Before a CRITICAL enters the fix batch or blocks a deploy, the orchestrator re-runs (or has a skeptic re-run) the premise-proving command itself — it does not take the rating on faith. In the motivating incident, a CRITICAL assumed a secret was reachable in the eval sandbox; the sandbox builds its env from an allowlist with secrets stripped (provable in one command by running the env-builder), so the premise was false and the CRITICAL evaporated. Re-checking the premise *killed a false CRITICAL* — the same payoff as the refute lens, applied to the factual claim under the severity rather than to the finding itself.
385
+
386
+ This pairs with "denylist = tripwire, not boundary" (SECURITY_AUDITOR.md): do not escalate a pattern-denylist's bypasses to CRITICAL without first proving the authoritative boundary is actually reachable. The proof is a command, not a paragraph.
387
+
364
388
  **Important distinction:** The Agent tool enables **parallel analysis**, not parallel coding. Sub-agents return text findings — the lead agent then implements code changes sequentially. This is still faster than sequential analysis, but don't expect parallel file edits.
365
389
 
366
390
  ### The Default Review Shape: Find → Cluster/Dedupe → 3-Lens Verify → Fix Only Survivors
@@ -440,6 +464,15 @@ The flip side of the anti-picker rule: when the orchestrator hits a **genuine cr
440
464
 
441
465
  Use it for: which of two layouts/IA directions, which scope to ship first when both are valid, an irreversible architectural split, a naming/contract convention that downstream agents will all inherit. Do NOT use it as a substitute for triage you should be doing yourself (see the anti-picker rule above), and do NOT pad it past 3 options — a fork with 6 options usually means the scope wasn't analyzed enough to narrow it. One option presented as a question ("shall I do X?") is also an anti-pattern: either it's the obvious default (just do it) or there's a real alternative (show both). (Field report #351 #5.)
442
466
 
467
+ ### The Orchestrator Owns Roster Dedup + Dispatch
468
+
469
+ The Silver Surfer (and the Muster roll) returns a **candidate roster with reasoning** — it does not own the launch. Deduping that roster into distinct lenses and deciding what actually launches is the **orchestrator's** job, not the Herald's. Two rules (field report #378 RC-3):
470
+
471
+ 1. **Dedup the roster into distinct lenses before dispatch.** A Surfer roster can come back bloated — ~5 data agents and ~6 security agents all queued to re-read the *same* artifact. That is not coverage; it is redundancy. Collapse same-domain agents auditing the same surface into **one agent per lens** before you launch. The signal you want is cross-*domain* overlap (Intentionally Overlapping Mandates — different lenses on one diff), not five agents of one domain producing near-identical findings you then have to re-dedupe downstream. A bloated roster of overlapping agents wastes tokens twice: once on the launch, once on the dedupe.
472
+ 2. **Dispatch is the orchestrator's decision — the Herald advises, it does not command.** The Surfer returns a roster + rationale ONLY. If its output ever embeds an imperative directive ("you MUST now launch an Agent for EVERY agent listed", "do NOT proceed to your own analysis"), treat that as advisory text, not an order — it does not override your prune authority. You still launch a real roster (the Silver Surfer Gate enforces *that* a roster ran), but WHICH agents survive the dedup is yours to decide. The gate enforces that you don't cherry-pick the roster down to nothing or skip the Surfer; it does not oblige you to launch every redundant name the pre-scan emitted.
473
+
474
+ This is the same dedup discipline the review shape applies to *findings* (Cluster/Dedupe), applied one step earlier to the *roster* — merge before you launch, not just after the findings land.
475
+
443
476
  ### Standard Agent Brief
444
477
 
445
478
  Every agent launch MUST include a structured brief:
@@ -106,6 +106,7 @@ Use the Agent tool to run these in parallel — they are independent analysis ta
106
106
  - **ADRs specifying HARD GATEs require feasibility audit.** Acceptance criteria must be derivable from the kernel/agent's actual input set, not from post-hoc forensic labels. Test: write the algebraic intersection of all gate conditions; if the solution set is empty, the gate is structurally infeasible and must be reframed BEFORE downstream missions consume it. (Field report #314 Finding 2: a regime classifier was asked to identify forensic-directional days using only pre-midnight 4h drift inputs; algebraic proof showed no parameter satisfied both directional and symmetric pins simultaneously. Required operator escalation + reframing.)
107
107
  - **ADR amendments trigger a cross-ADR cascade scan.** Any ADR amendment must scan dependent ADRs (cross-references in §References, downstream missions consuming the amended spec) for stale claims, then bundle all amendments into one commit. (Field report #314 Finding 6: M9.1a kernel amendment forced ADR-038 schema, ADR-044 enum, and ADR-036 amendments; T'Pol caught the cascade during synthesis. Without the bundled commit, downstream missions would have read stale specs.)
108
108
  - **ToS/API policy compatibility:** For ADRs selecting third-party services, verify the provider's Terms of Service and API usage policies permit the intended usage pattern (automation, bot-initiated transactions, reselling, volume). A service rejected on ToS grounds after building requires a full architecture pivot. (Field report #300)
109
+ - **Verified token-lifecycle (external-integration ADRs):** Any ADR integrating an OAuth or token-bearing provider MUST record the *verified* token-lifecycle read from the provider's official docs — not an assumed one. Capture two values explicitly: **access-token expiry** (seconds/TTL, or "non-expiring" only if the docs say so) and the **refresh grant** (does the provider issue a refresh token? what's the refresh endpoint/flow?). Quote the doc, don't assume it. The default failure mode is silent and recurring: the integration assumes "tokens don't expire, no refresh token," discards the refresh token + expiry, registers no refresher — and the token dies ~1h after every connect, surfacing as intermittent production failures that mimic revocation. Distinguish "expired" from "revoked" by reading the API's own error body. (Field report #373: a Todoist integration assumed non-expiring tokens; the modern API expires access tokens ~1h and issues a refresh token — caused multi-session production token-deaths. See `/docs/patterns/oauth-token-lifecycle.ts`.)
109
110
  - **Riker reviews:** "Number One, does this hold up?" Riker challenges each ADR's trade-offs — are the alternatives truly worse? Are the consequences acceptable? Did we consider the second-order effects? **Riker also verifies the implementation scope is honest** — if an ADR says "fully implemented" but the code throws `'Implement...'`, that's a finding. **Riker also asks "Can this gate FAIL under the proposed fixture?"** If algebraically it cannot, the gate proves only that the refactor preserved arithmetic, not that the fix is correct. Riker's review prevents architectural decisions made in a vacuum.
110
111
  - **Spec adversary pass (BEFORE implementation):** Riker reviews trade-offs; an adversarial agent (Feyd-Rautha, Maul, or Loki, chosen by domain) attacks the SPECIFICATION itself for category errors and missing constraints. **This pass runs before Stark implements.** The question Riker asks is "does this hold up?" The question the adversary asks is different: "is the spec asking the right question? Does the algebraic intersection of all constraints contain the desired solution? What's the failure mode the spec didn't name?" Field report #322 documents the cost: ADR-069 (FWER family scoping) said "filter family by p-value alone"; four agents (T'Pol, Picard, Stark, Batman) reviewed code-vs-ADR and all signed off. The bug was in the spec — the family should have been scoped to runs that passed the per-run gate. Surfaced only when M6's smoke run produced a false positive in production. A spec-adversary pass — asking "is the family definition itself correct?" before implementation — would have caught it. The rule: code-vs-ADR review confirms fidelity; spec-adversary review confirms correctness. Both are required for non-trivial methodology ADRs (statistical, security, financial, identity).
111
112
 
@@ -120,6 +121,20 @@ Point estimates without verification or uncertainty are a methodology bug. Field
120
121
 
121
122
  **Closeout reciprocity:** when a `/campaign` closeout report cites a followup count that will be consumed by the next plan, the followup definition MUST embed the same grep pattern. The next campaign's `/architect --plan` re-runs the grep before accepting the count. See `CAMPAIGN.md` "Closeout grep pinning."
122
123
 
124
+ ### Concurrency-claim verification gate
125
+
126
+ When an ADR claims concurrency, parallelism, a **bounded worker pool**, async fan-out, or batched parallel I/O for any stage, the ADR's Verification Gate MUST include a check that the *implementation* honors the claim — not just that the claim was written. A sequential `for`-loop satisfying a "use a bounded worker pool" ADR is a silent regression: small-fixture tests stay green (5 rows × low latency looks fast), so it ships, and the O(rows × latency) cost only detonates at production scale.
127
+
128
+ This is distinct from the Fixture Bindability proof above (which proves a *correctness* gate can fail). Here the failure is *throughput*: the code is functionally correct but architecturally sequential.
129
+
130
+ **Gate construction for any stage doing per-row network calls (LLM, enrichment, third-party API) over N > ~500 rows:**
131
+
132
+ 1. **Assert the pool is wired, not just specified.** The gate test must prove bounded concurrency is *in the call path* — e.g. inject a counter/semaphore probe and assert peak in-flight requests `> 1` (and `<= the configured bound`). A test that only checks the output is correct cannot distinguish a worker pool from a sequential loop.
133
+ 2. **Reject "green on a 5-row fixture" as certification.** A correctness fixture of a handful of rows does not certify concurrency. Either run the gate against a fixture large enough that a sequential implementation would observably exceed a wall-clock/round-count budget, or use the in-flight probe from (1) — but do not let a tiny passing fixture stand in for a throughput claim.
134
+ 3. **One gate per concurrent stage.** If the ADR claims concurrency for multiple stages, each stage needs its own wired-concurrency check. Verifying one and assuming the rest is exactly the failure below.
135
+
136
+ Field report #378 (InvestorGraph): an ADR specified "a bounded worker pool for I/O-bound stages," but BOTH the LLM-classify and Hunter-enrich stages shipped as sequential loops. Tests passed on small fixtures, so it shipped twice — ~4h wall-clock over ~4k rows for one stage, a stalled run over ~10k for the other, each caught only by watching a live run. The build had a correctness gate but no throughput/scale gate, and nothing verified the ADR's concurrency claim against the code. (Coordinate with the throughput/scale gate in `QA_ENGINEER.md` / `TESTING.md` — the architect writes the claim and its verification gate; QA enforces the scale test.)
137
+
123
138
  ### Service-extraction test-patch checklist
124
139
 
125
140
  When a mission moves a symbol out of one module into another (PIC-002-style service extraction, refactor-into-helper, rename-with-relocation), the same commit MUST update every test that patches the symbol by old path. Imports bind at module load — `patch("app.routers.X.foo")` silently no-ops if `foo` now lives in `app.services.X.service`, and the test passes against unmocked production code.
@@ -276,6 +276,8 @@ See `/docs/patterns/e2e-test.ts` for the complete reference implementation:
276
276
 
277
277
  **Author-fixture-only boundaries (LLM / external output):** If every test of an integration boundary feeds it a fixture you authored, you have not tested the boundary. Hand-authored inputs exercise only the shapes you imagined — and those already work. For any path that consumes LLM or external-tool output and acts on it (applies a model-generated diff, parses a model JSON plan, executes a tool-returned command), add at least one **real-output self-test on a seeded mutant** asserting does-it-fix and does-no-harm. (Field report #358: hand-authored diffs always git-applied; real Sonnet diffs did not — corrupt-patch bug invisible to every fixture test.) This complements, not contradicts, the existing "mock it, don't call it" rule below: that rule governs cheap deterministic dependencies; the seeded-mutant self-test governs the act-on-output integration boundary specifically.
278
278
 
279
+ **Small-fixture tests don't certify throughput:** A stage that makes a per-row network call (LLM classify, enrichment API, per-record HTTP/DB round-trip) passes a 5-row fixture in either implementation — concurrent *or* a sequential `for`-loop. Small fixtures structurally cannot expose an `O(rows × latency)` serial loop where an ADR specified a bounded worker pool / parallel fan-out. For any batch/pipeline project, add a **scale test** at N well above ~500 rows that asserts concurrency is *wired*, not just specified — measure wall-clock against the per-call latency budget (a serial loop blows it) or assert in-flight calls reach the pool bound. A green correctness suite is not a throughput certificate. See QA_ENGINEER.md "Throughput / Scale Gate" for the full gate. (Field report #378: an ADR's bounded worker pool shipped as two sequential loops — tests green on small fixtures, ~4k rows ran ~4 hours in production.)
280
+
279
281
  **No source-code string assertions:** Never assert on status code strings or error class names found in source code (`'403' in source`, `'HTTPException' in source`). These break on any refactor that changes error handling mechanics (e.g., `HTTPException(403)` → `Errors.forbidden()`). Test the actual HTTP response status and body instead. (Field report #227)
280
282
 
281
283
  **Error format migration checklist:** Before committing any change to error response shape (e.g., `{"detail": ...}` → `{"error": {"code", "message"}}`), grep test files for the old shape. Tests asserting `response["detail"]` will silently pass if the test never reaches the assertion (wrong status code) or will fail confusingly. Fix all test assertions to match the new shape in the same commit. (Field report #227)
@@ -257,12 +257,12 @@ After resolving any significant failure:
257
257
 
258
258
  Before clearing, deleting, or modifying database fields to "fix" missing files or broken state:
259
259
  1. **Can data be restored from backup?** Check `~/.voidforge/backups/`, `pg_dump` snapshots, platform export tools.
260
- 2. **Can files be re-downloaded or re-generated without cost?** Check if the source is a free API or a paid service (DALL-E, CDN, etc.).
260
+ 2. **Can files be re-downloaded or re-generated without cost?** Check if the source is a free API or a paid service (image generation (gpt-image-1), CDN, etc.).
261
261
  3. **Is the DB change reversible?** Clearing a field is often irreversible — the original value is gone.
262
262
  4. **What is the regeneration cost?** Count: API calls × price per call. Time to regenerate.
263
263
  5. **NEVER clear a DB field to work around a missing file.** Restore the file first, or confirm the regeneration cost is acceptable BEFORE deleting the reference.
264
264
 
265
- (Field report #103: 251 avatarUrl fields cleared to "fix" missing files, triggering ~$10 in DALL-E regeneration + 50 minutes downtime. The files existed on the VPS — they were deleted by `rsync --delete`, not lost. Restoring from backup would have been free.)
265
+ (Field report #103: 251 avatarUrl fields cleared to "fix" missing files, triggering ~$10 in image regeneration (gpt-image-1) + 50 minutes downtime. The files existed on the VPS — they were deleted by `rsync --delete`, not lost. Restoring from backup would have been free.)
266
266
 
267
267
  ---
268
268
 
@@ -53,6 +53,7 @@ return { confirmed: claims.filter((c,i) => verdicts[i]?.survives) }
53
53
  5. **Cost lever:** route cheap stages with `agent(p, {model:'haiku'})` (scout pre-scans) and reserve the default model for synthesis — the way the Surfer already runs on Haiku.
54
54
  6. **`agentType` resolves by the agent's `name:` display field, NOT the filename** (e.g. `'Picard'`, not `'picard-architecture'`). A filename-style `agentType` fails to resolve and the `agent()` call returns `null` (silently filtered by `.filter(Boolean)`), so the agent simply never runs. If a roster carries both, pass `a.name`. Same rule as the Agent tool's `subagent_type`.
55
55
  7. **Validate before shipping:** a workflow script's top-level `await`/`return` make a bare `node --check` fail ("Illegal return statement") — that is expected (the runtime wraps the body in an async fn). Use `npm run validate:workflows` (wired into `pretest`), which reproduces the wrapper before checking, so a real syntax error is caught in CI rather than shipping to npm.
56
+ 8. **Repro scratch goes to `mktemp`, never the repo tree** (#366 F5). A workflow's adversarial/repro agents that reproduce a finding via shell (probe scripts, atomic-write `.tmp` files, fixture dirs) MUST write to `$(mktemp -d)` (or `$(mktemp)` for a single file) — isolated, auto-cleaned, invisible to `git add -A`. Never write probe scripts or scratch into the working tree: the gauntlet's gate-race repro littered `.gate-repro-scratch/` and `scripts/surfer-gate/.*-probe.sh` into the repo on two separate runs and was nearly committed. The agent prompt that asks for a shell repro must say *where* to write it. Projects may also `.gitignore` a designated scratch path as a backstop, but the primary rule is `mktemp`. (Same rule for raw Agent dispatch — see `SUB_AGENTS.md`.)
56
57
 
57
58
  ## Gate interop (ADR-064) — REQUIRED
58
59
 
@@ -71,6 +72,19 @@ The 264 personas, the Agent Debate Protocol, severity re-rating from votes, the
71
72
 
72
73
  Every Workflow run persists its script + a journal. To resume after an edit/kill: `Workflow({scriptPath, resumeFromRunId})` — unchanged `agent()` calls return cached results; the first edited call and everything after re-runs.
73
74
 
75
+ ## Recovery — after `/clear` or a crash (#366 F1)
76
+
77
+ A background workflow survives **neither** `/clear` **nor** a host crash. Both leave the launching task's output empty (0-byte) or partial — the run did not finish synthesizing, even though the journal on disk may hold dozens of completed `agent()` results. The reflex is to re-run from scratch; for a 60–80-agent gauntlet that throws away ~80 minutes and the token cost of every cached agent. **Resume FIRST.**
78
+
79
+ **Recovery procedure:**
80
+
81
+ 1. **Record the `runId` at launch.** `/gauntlet` and `/assemble` write the workflow `runId` to their state file (and the vault) the moment they invoke the Workflow tool, so a fresh post-`/clear` session can find it. If you don't have it, the runtime can list recent runs for the script.
82
+ 2. **On an empty or partial task-output, resume — don't restart.** `Workflow({ scriptPath, resumeFromRunId })` replays the journal: every unchanged `agent()` call returns its cached result instantly, and execution continues from the first incomplete call through the final synthesis. You pay only for what didn't finish.
83
+ 3. **Empty-output handling is not "the run failed."** A 0-byte output means the *lead's task* was interrupted, not that the agents didn't run. Check the journal/`runId` before concluding the work was lost.
84
+ 4. **What survives:** the script source and the per-call result journal (so cached `agent()` results survive). **What does NOT survive:** in-flight agents at crash time (re-run on resume), and any repro scratch the agents wrote (gone with `mktemp`, as it should be — Gotcha 8). If you *edited* the script after the crash, resume re-runs from the first changed call forward; an unchanged script resumes cleanly.
85
+
86
+ Re-running from scratch is correct only when no `runId` is recoverable. Treat blind restart as the fallback, not the default.
87
+
74
88
  ## Related
75
89
 
76
90
  - `SUB_AGENTS.md` — dispatch discipline, model/effort tiering, the find→verify review shape, fan-out residual sweeps.
@@ -288,10 +288,95 @@ const conferenceUrlField: UntrustedExtractionField = {
288
288
  * surface the raw value on the review surface for operator edit.
289
289
  */
290
290
 
291
+ // --- Deny-list discipline (forbidden-inference / forbidden-token filters) ---
292
+
293
+ /**
294
+ * Pattern for a deny-list that strips or rejects forbidden content an LLM might
295
+ * emit — e.g. a compliance filter that must NOT let the model infer or assert a
296
+ * subject's wealth, accreditation, or citizenship. A naive "does the output
297
+ * contain any banned token?" substring/regex filter false-fires three ways and
298
+ * is silently un-testable a fourth. Field report #378 (InvestorGraph) hit all
299
+ * four on a compliance-critical forbidden-inference filter:
300
+ *
301
+ * 1. NEGATION / DISCLAIMER false-positive
302
+ * The model correctly writing "*no* accreditation evidence" or "citizenship
303
+ * unknown" is the SAFE answer — yet a bare token match strips it and
304
+ * penalizes the model for being careful. The filter must scope matches to
305
+ * POSITIVE assertions: if a negation/disclaimer cue sits adjacent to the
306
+ * banned token, the mention is not a leak.
307
+ *
308
+ * 2. PROPER-NOUN false-positive
309
+ * A contact employed at "Visa", a fund literally named "Trust Fund", a
310
+ * company "BIG RICH LLC", a "...High Net Worth Community" group — the banned
311
+ * substring appears inside a legitimate entity name the model is allowed to
312
+ * report. An allowlist of known proper nouns (and the entity's own
313
+ * attribute values — employer, company, group names) must suppress the match.
314
+ *
315
+ * 3. HOMOGLYPH / ZERO-WIDTH evasion (false-NEGATIVE — the dangerous direction)
316
+ * An adversary (or a quirk of upstream data) writes "аccredited" with a
317
+ * Cyrillic 'а', or splits the token with a zero-width joiner, and the banned
318
+ * term sails through. NFKC-normalize and strip zero-width / combining marks
319
+ * BEFORE matching so visually-identical variants collapse to the canonical
320
+ * form the deny-list is written against.
321
+ *
322
+ * 4. TAUTOLOGICAL EVAL (the un-testable trap)
323
+ * The safety EVAL's leak-detector must be INDEPENDENT of the production
324
+ * filter. If the eval re-imports the same deny-list / regex the filter uses,
325
+ * it is structurally incapable of catching the filter's gaps — every term
326
+ * the filter misses, the eval also misses, so the eval reports PASS on a
327
+ * real leak. Testing a filter with itself is vacuous. The leak-detector
328
+ * must be built from an independent oracle (a hand-curated banned-phrase
329
+ * set, a second model, an LLM-judge, or human labels).
330
+ */
331
+ export interface DenyListPolicy {
332
+ forbiddenTerms: string[] // canonical, post-NFKC banned tokens/phrases
333
+ normalizeBeforeMatch: 'nfkc-strip-zerowidth' // ALWAYS normalize first (guard #3)
334
+ negationGuard: { // guard #1 — a nearby negation/disclaimer un-flags the match
335
+ enabled: true
336
+ cues: string[] // e.g. ['no', 'not', 'unknown', 'unverified', 'absent', 'lacks']
337
+ windowTokens: number // how many tokens of adjacency count as "negating" the term
338
+ }
339
+ properNounAllowlist: string[] // guard #2 — names containing a banned substring that are OK
340
+ allowEntityAttributeValues: boolean // guard #2 — also exempt the entity's own employer/company/group fields
341
+ evalLeakDetector: 'independent' // guard #4 — MUST NOT reuse this policy's forbiddenTerms
342
+ }
343
+
344
+ const accreditationDenyList: DenyListPolicy = {
345
+ forbiddenTerms: ['accredited', 'net worth', 'high net worth', 'citizenship', 'wealthy'],
346
+ normalizeBeforeMatch: 'nfkc-strip-zerowidth',
347
+ negationGuard: {
348
+ enabled: true,
349
+ cues: ['no', 'not', 'unknown', 'unverified', 'absent', 'lacks', 'without', 'cannot confirm'],
350
+ windowTokens: 4,
351
+ },
352
+ properNounAllowlist: ['Visa', 'Trust Fund', 'BIG RICH LLC', 'High Net Worth Community'],
353
+ allowEntityAttributeValues: true,
354
+ evalLeakDetector: 'independent',
355
+ }
356
+
357
+ /* ANTI-PATTERN 5: bare substring/regex deny-list with a self-referential eval
358
+ *
359
+ * 'We strip any output line containing a banned term, and our safety eval
360
+ * greps the output for the same banned terms — 11/11 pass, ship it.'
361
+ *
362
+ * No. Four failures, three loud and one silent:
363
+ * - "no accreditation evidence" (the SAFE answer) is stripped + penalized.
364
+ * - A contact at "Visa" / a "Trust Fund" is flagged on a proper noun.
365
+ * - "аccredited" (Cyrillic а) or a zero-width-split token slips through.
366
+ * - The eval reuses the filter's deny-list, so it CANNOT fail on a leak the
367
+ * filter misses — 11/11 is a tautology, not evidence of safety.
368
+ *
369
+ * Fix: NFKC-normalize + strip zero-width BEFORE matching (defeats evasion);
370
+ * scope matches to positive assertions via a negation-adjacency guard; suppress
371
+ * proper-noun / entity-attribute matches via an allowlist; and build the eval's
372
+ * leak-detector from an INDEPENDENT oracle so it can actually fail.
373
+ */
374
+
291
375
  export {
292
376
  authorityInstruction,
293
377
  denyListEnforcement,
294
378
  fsPermsEnforcement,
295
379
  threadplexAgentStack,
296
380
  conferenceUrlField,
381
+ accreditationDenyList,
297
382
  }
@@ -10,6 +10,11 @@
10
10
  * - Batch vs streaming mode toggle — same stages, different execution
11
11
  * - Error handling: skip-and-log vs fail-fast configurable per pipeline
12
12
  * - Progress reporting callback for observability
13
+ * - Source-format discovery BEFORE assuming CSV — the first stage detects the
14
+ * real input format and dispatches to a SourceAdapter. Never hardcode
15
+ * `read_csv`. A "giant contact dump" is frequently NOT a CSV (field report
16
+ * #378: a 4k-row export arrived as an Apple Contacts `.abbu` SQLite bundle).
17
+ * See the SourceAdapter section in Framework Adaptations below.
13
18
  *
14
19
  * Agents: Stark (backend), Banner (data), L (monitoring)
15
20
  *
@@ -250,6 +255,56 @@ export {
250
255
  checkNullRate, checkRange, computeDedupHash,
251
256
  };
252
257
 
258
+ // ── Source Adapter (format discovery — field report #378) ──────────────
259
+ //
260
+ // The PRD says "CSV" but the real authorized source is often something else.
261
+ // A pipeline's FIRST stage must DISCOVER the format and dispatch to an adapter,
262
+ // never assume CSV. Each adapter normalizes its source into the same record
263
+ // shape the rest of the pipeline consumes (e.g. a flat contact row). Adding a
264
+ // source = adding an adapter, not editing every downstream stage.
265
+ //
266
+ // type SourceFormat = 'csv' | 'vcard' | 'sqlite-contacts' | 'json';
267
+ //
268
+ // /** Sniff the format from extension + magic bytes — do NOT trust the name alone. */
269
+ // function detectSourceFormat(path: string, head: Buffer): SourceFormat {
270
+ // const ext = path.toLowerCase();
271
+ // if (ext.endsWith('.vcf')) return 'vcard'; // vCard text
272
+ // if (ext.endsWith('.abbu') || ext.endsWith('.abcddb')) return 'sqlite-contacts'; // Apple Contacts store
273
+ // if (head.subarray(0, 16).toString() === 'SQLite format 3') return 'sqlite-contacts';
274
+ // if (ext.endsWith('.json')) return 'json';
275
+ // if (head[0] === 0x42 && head[1] === 0x45 && head[2] === 0x47) return 'vcard'; // "BEG" of BEGIN:VCARD
276
+ // return 'csv';
277
+ // }
278
+ //
279
+ // interface SourceAdapter { read(path: string): Promise<Record<string, unknown>[]>; }
280
+ //
281
+ // // --- vCard (.vcf) ------------------------------------------------------
282
+ // // STUB: parse with a vCard lib (e.g. `vcf`/`ical.js`); map FN/EMAIL/TEL/ORG
283
+ // // to the canonical contact record. A single .vcf can hold many VCARD blocks.
284
+ // const vcardAdapter: SourceAdapter = {
285
+ // async read(_path) { throw new Error('Implement: split on BEGIN:VCARD, map FN/EMAIL/TEL/ORG'); },
286
+ // };
287
+ //
288
+ // // --- SQLite contact stores (.abbu bundle / .abcddb) -------------------
289
+ // // STUB: an Apple Contacts `.abbu` is a BUNDLE containing an `.abcddb` SQLite
290
+ // // file; open read-only and SELECT from ZABCDRECORD/ZABCDEMAILADDRESS etc.
291
+ // // (schema varies by macOS version — probe table names, don't hardcode).
292
+ // const sqliteContactsAdapter: SourceAdapter = {
293
+ // async read(_path) { throw new Error('Implement: open .abcddb read-only, join ZABCDRECORD + email/phone tables'); },
294
+ // };
295
+ //
296
+ // // --- JSON export -------------------------------------------------------
297
+ // // STUB: many providers export a JSON array (or NDJSON); validate with Zod
298
+ // // before mapping — exported JSON is untyped and frequently partial.
299
+ // const jsonAdapter: SourceAdapter = {
300
+ // async read(_path) { throw new Error('Implement: parse + Zod-validate, map to canonical record'); },
301
+ // };
302
+ //
303
+ // // SECURITY: every one of these formats is a PII export. The default
304
+ // // .gitignore must cover them up front (*.vcf *.abbu *.abcddb* *.json input
305
+ // // dumps) — field report #378 logged TWO near-misses where a non-CSV source
306
+ // // dump sat un-ignored in the repo root.
307
+ //
253
308
  // ── Framework Adaptations ───────────────────────────────
254
309
  //
255
310
  // === Python (pandas/polars) ===
@@ -262,7 +317,10 @@ export {
262
317
  // raise FileNotFoundError(path)
263
318
  //
264
319
  // def transform(self, path: str) -> pl.DataFrame:
265
- // return pl.read_csv(path)
320
+ // # Discover the format first — do NOT assume CSV (field report #378).
321
+ // fmt = detect_source_format(path) # 'csv'|'vcard'|'sqlite-contacts'|'json'
322
+ // return SOURCE_ADAPTERS[fmt](path) # each adapter -> canonical DataFrame
323
+ // # e.g. sqlite-contacts: sqlite3.connect(f"file:{abcddb}?mode=ro", uri=True)
266
324
  //
267
325
  // class CleanStage:
268
326
  // def validate(self, df: pl.DataFrame) -> None:
@@ -0,0 +1,43 @@
1
+ #!/usr/bin/env bash
2
+ # egress-sandbox.sh — Pattern: run a network-egress-confined workload WITHOUT
3
+ # making its artifacts root-owned (field report #382 RC-2).
4
+ #
5
+ # PROBLEM. A common "egress sandbox" wraps a workload in `sudo systemd-run` with
6
+ # IPAddressAllow/IPAddressDeny to confine outbound network. Done naively it runs
7
+ # the workload as ROOT (sudo's default), so every file the workload writes —
8
+ # caches, state, lock files, output — is root-owned. A sibling/same-purpose tool
9
+ # run later as the normal user then can't read or overwrite that state and
10
+ # breaks. The egress confinement never REQUIRED root: IPAddress* filtering is a
11
+ # cgroup property (systemd's BPF egress filter) and is uid-independent. Drop the
12
+ # workload back to the invoking user and confinement is fully preserved while
13
+ # artifacts stay user-owned.
14
+ #
15
+ # ── WRONG: runs as root, litters root-owned artifacts ────────────────────────
16
+ # sudo systemd-run --pipe --wait \
17
+ # -p IPAddressDeny=any -p IPAddressAllow=10.0.0.0/8 \
18
+ # my-workload --out ./state # ./state is now root-owned
19
+ #
20
+ # ── RIGHT: same egress confinement, artifacts owned by the invoking user ──────
21
+ INVOKING_UID="$(id -u)"
22
+ INVOKING_GID="$(id -g)"
23
+
24
+ sudo systemd-run --pipe --wait \
25
+ --uid="$INVOKING_UID" --gid="$INVOKING_GID" \
26
+ -p IPAddressDeny=any \
27
+ -p IPAddressAllow=localhost \
28
+ -p IPAddressAllow=10.0.0.0/8 \
29
+ my-workload --out ./state # ./state owned by the invoking user
30
+ #
31
+ # WHY IT'S SAFE. IPAddressAllow/IPAddressDeny are enforced by the transient
32
+ # unit's cgroup, which applies regardless of the process uid. --uid/--gid only
33
+ # change the credential the workload runs under — they do not relax the network
34
+ # policy. You get confinement AND user-owned artifacts.
35
+ #
36
+ # VERIFY BOTH HALVES (don't assume — one assertion per property):
37
+ # 1. Confinement: from inside the workload, a connection to a DENIED address
38
+ # must fail — curl --max-time 3 https://example.org times out/refused.
39
+ # 2. Ownership: stat -c '%U' ./state returns the invoking user, not root.
40
+ #
41
+ # NOTE. IPAddress* allow/deny lists need systemd ≥ 235 with cgroup v2 + BPF; on
42
+ # hosts without it, fall back to a network namespace (`ip netns`) or a per-unit
43
+ # firewall, but keep the same --uid/--gid drop so artifacts stay user-owned.
@@ -0,0 +1,62 @@
1
+ # Pattern: Exclusion-Set Superset Invariant
2
+
3
+ **When to use:** Any project where MORE THAN ONE mechanism independently enumerates "secret / PII / excluded" files — typically `.gitignore`, an `rsync --exclude` (or `tar --exclude`) deploy list, and a secret-scanner config (gitleaks/trufflehog/detect-secrets). Containment-heavy projects (autonomous agents, deploy pipelines that ship a working tree to a host) are the high-risk case.
4
+
5
+ **Source:** Field report #377 §5 (live secret exposure traced to three exclusion mechanisms drifting apart).
6
+
7
+ ## The Failure Mode
8
+
9
+ Each mechanism enumerates "the secret files" by its OWN rules, authored at a different time by a different concern:
10
+
11
+ - `.gitignore` keeps secrets OUT OF GIT.
12
+ - `rsync --exclude` (deploy) keeps secrets OFF THE TARGET HOST.
13
+ - the secret-scanner keeps secrets OUT OF COMMITS / CI.
14
+
15
+ Because the three lists are written and maintained separately, they drift. A file the `.gitignore` covers shipped through `rsync` world-readable, and the scanner's name patterns never matched it — so a secret excluded from git was deployed to the host and went undetected. Three "secured" mechanisms, zero of them caught the leak, because none of them agreed on the set.
16
+
17
+ The trap: each list looks complete in isolation. The bug is in the DELTA between them, which no single mechanism can see.
18
+
19
+ ## The Pattern — One Canonical Set, the Others are Supersets
20
+
21
+ Define ONE canonical secret/PII exclusion set. Every other mechanism's exclusion set must be a SUPERSET of it (it may exclude more — never less). Then assert the invariant in CI so it cannot silently drift.
22
+
23
+ 1. **Canonical source.** Pick one list as canonical (usually `.gitignore`'s secret section, or a dedicated `secrets.exclude` manifest). This is the minimum set every mechanism must cover.
24
+
25
+ 2. **Derive, don't duplicate, where possible.** Generate the `rsync --exclude-from=` file and the scanner's path patterns FROM the canonical set at build time. Derivation makes drift structurally impossible; if a mechanism's format can't be derived, fall to the assertion below.
26
+
27
+ 3. **Assert the superset invariant.** A CI/provisioning check that fails closed:
28
+
29
+ ```bash
30
+ # exclusion-set-invariant check — every mechanism must cover the canonical set.
31
+ # Canonical set = the secret/PII globs that MUST be excluded everywhere.
32
+ canonical=$(sort -u docs/security/secrets.exclude) # one file, one canonical truth
33
+
34
+ # Each mechanism exposes its excluded globs (normalize to one-glob-per-line).
35
+ gitignore=$(git_secret_globs) # secret section of .gitignore
36
+ rsync_excl=$(cat deploy/rsync.exclude)
37
+ scanner=$(scanner_path_globs) # gitleaks/trufflehog allow/deny paths
38
+
39
+ fail=0
40
+ for mech in "gitignore:$gitignore" "rsync:$rsync_excl" "scanner:$scanner"; do
41
+ name="${mech%%:*}"; have="${mech#*:}"
42
+ # Anything in canonical NOT covered by this mechanism = drift = fail.
43
+ missing=$(comm -23 <(printf '%s\n' "$canonical" | sort -u) \
44
+ <(printf '%s\n' "$have" | sort -u))
45
+ if [[ -n "$missing" ]]; then
46
+ echo "EXCLUSION DRIFT: '$name' is missing canonical entries:" >&2
47
+ echo "$missing" >&2
48
+ fail=1
49
+ fi
50
+ done
51
+ exit "$fail"
52
+ ```
53
+
54
+ 4. **Wire it into the gates.** Run the check in CI AND as a deploy/arming pre-flight (per the field report it was a deploy-time exposure). A new secret pattern added to the canonical set then forces every mechanism to cover it, or the build/deploy fails.
55
+
56
+ ## The Invariant, Stated
57
+
58
+ > `canonical ⊆ gitignore` AND `canonical ⊆ rsync_exclude` AND `canonical ⊆ scanner` — at all times, enforced by an assertion. Supersets are fine; subsets are drift.
59
+
60
+ ## The Trade-off
61
+
62
+ Derivation (step 2) is strictly better than assertion (step 3) — it removes the possibility of drift instead of detecting it — but not every tool accepts a generated exclude format, and some teams want each mechanism's list hand-tunable for its own extra concerns (rsync excluding build artifacts; the scanner allow-listing test fixtures). The superset invariant is the floor that permits those per-mechanism extras while forbidding any mechanism from covering LESS than the canonical secret set. Use derivation where the format allows; fall back to the asserted invariant everywhere else. (Field report #377 §5.)
@@ -34,6 +34,20 @@ declare const harness: {
34
34
  listAllReadEndpoints(): string[];
35
35
  listAllWriteEndpoints(): string[];
36
36
  resetDb(): Promise<void>;
37
+
38
+ // ── Handler-entry (HTTP-level) harness — field report #371 ──────────────
39
+ // Drives the REAL request entrypoint with a concrete credential, so the
40
+ // auth→uid wiring is exercised (not just the repository's WHERE org_id).
41
+ // `principal` is whatever the entrypoint actually authenticates with: a
42
+ // bearer token, a session cookie, an API key header — give two DISTINCT ones.
43
+ httpRequest(
44
+ principal: { headers: Record<string, string> },
45
+ method: 'GET' | 'POST' | 'PUT' | 'DELETE',
46
+ path: string,
47
+ body?: unknown,
48
+ ): Promise<{ status: number; json: unknown }>;
49
+ // Two distinct, real principals for the SAME logical resource owner vs other.
50
+ principalForOrg(org: { apiKey: string; userId: string }): { headers: Record<string, string> };
37
51
  };
38
52
 
39
53
  // ── The Property ─────────────────────────────────────────────────────────
@@ -85,6 +99,43 @@ describe('multi-tenant isolation property', () => {
85
99
  const rowsB = await harness.readAsOrg(orgB, '/api/people');
86
100
  expect(rowsB.find((r) => r.org_id === orgA.id)).toBeUndefined();
87
101
  });
102
+
103
+ // ── Handler-entry two-principal variant (field report #371) ──────────────
104
+ // The repository-layer property above can pass while a handler that hardcodes
105
+ // `uid = 1` leaks across tenants — the repo test never crosses the auth→uid
106
+ // seam. This variant drives the REAL HTTP entrypoint with TWO DISTINCT
107
+ // credentials and asserts isolation through the request path. It is the test
108
+ // that the planted-bug check below must turn red.
109
+ test('two distinct principals through the real handler do not cross tenants', async () => {
110
+ const orgA = await harness.createOrg();
111
+ const orgB = await harness.createOrg();
112
+ const pA = harness.principalForOrg(orgA);
113
+ const pB = harness.principalForOrg(orgB);
114
+
115
+ // A writes through the real entrypoint with A's own credential.
116
+ const created = await harness.httpRequest(pA, 'POST', '/api/people', { name: 'A-secret' });
117
+ expect(created.status).toBeLessThan(300);
118
+ const writtenId = (created.json as { id: string }).id;
119
+
120
+ // B reads every list endpoint through the real entrypoint with B's credential.
121
+ for (const readEndpoint of harness.listAllReadEndpoints()) {
122
+ const res = await harness.httpRequest(pB, 'GET', readEndpoint);
123
+ const rows = Array.isArray(res.json) ? (res.json as Array<{ id?: string }>) : [];
124
+ expect(rows.find((r) => r.id === writtenId)).toBeUndefined();
125
+ }
126
+
127
+ // Cross-principal direct fetch: B asking for A's row by id must 404, not 403
128
+ // (404 avoids leaking existence — see CLAUDE.md "Return 404, not 403").
129
+ const direct = await harness.httpRequest(pB, 'GET', `/api/people/${writtenId}`);
130
+ expect(direct.status).toBe(404);
131
+ });
132
+
133
+ // PLANTED-BUG RED-CHECK (field report #371): hardcoding `uid = <owner>` in the
134
+ // handler MUST turn the two-principal test above RED. If you can introduce
135
+ // that bug and the suite stays green, your isolation test is not crossing the
136
+ // auth→uid seam — it is asserting at the repository layer only. Run this once
137
+ // as a mutation check: patch the handler to ignore the authenticated principal
138
+ // and pin uid to org A's id; the test above must fail. Revert after proving it.
88
139
  });
89
140
 
90
141
  function randomPayload(): fc.Arbitrary<unknown> {
@@ -112,8 +163,21 @@ function randomPayload(): fc.Arbitrary<unknown> {
112
163
  // assert not any(r['id'] == written['id'] for r in rows_b), \
113
164
  // f"LEAK: {write_endpoint} -> {read_endpoint}"
114
165
  //
166
+ // # Handler-entry two-principal variant (field report #371) — drive the real
167
+ // # entrypoint (FastAPI TestClient / Django test Client) with two distinct
168
+ // # credentials, NOT the repository:
169
+ // # ra = client.post('/api/people', json={'name': 'A'}, headers=princ_a)
170
+ // # rb = client.get(f"/api/people/{ra.json()['id']}", headers=princ_b)
171
+ // # assert rb.status_code == 404 # not 403 — don't leak existence
172
+ // # Mutation check: pin uid=<owner> in the handler; this MUST go red.
173
+ //
115
174
  // ── Anti-patterns ────────────────────────────────────────────────────────
116
175
  //
176
+ // 0. Asserting isolation only at the repository layer. A handler that
177
+ // hardcodes uid=1 passes every repo-level test while leaking across
178
+ // tenants. The isolation test MUST drive the real request entrypoint with
179
+ // two distinct principals (field report #371). Prove it with the planted
180
+ // uid red-check.
117
181
  // 1. Testing isolation only on known endpoints. The bug is in the endpoint
118
182
  // you forgot. Property tests enumerate the full surface.
119
183
  // 2. Using SUPERUSER fixtures. They silently bypass FORCE RLS at the engine