npm - @therocketcode/gsd-core - Versions diffs - 1.7.4 → 1.8.0 - Mend

@therocketcode/gsd-core 1.7.4 → 1.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

package/.claude-plugin/plugin.json +1 -1
package/agents/gsd-plan-checker.md +2 -2
package/commands/gsd/cicd-strategy.md +67 -0
package/commands/gsd/discover-product.md +2 -2
package/commands/gsd/infrastructure-strategy.md +65 -0
package/gemini-extension.json +1 -1
package/gsd-core/references/ai-test-quality.md +85 -0
package/gsd-core/references/architecture-decision.md +10 -7
package/gsd-core/references/cicd-strategy.md +115 -0
package/gsd-core/references/contract-testing.md +9 -1
package/gsd-core/references/data-environments.md +89 -0
package/gsd-core/references/domain-modeling.md +14 -2
package/gsd-core/references/e2e-tiering.md +2 -2
package/gsd-core/references/infrastructure-strategy.md +91 -0
package/gsd-core/references/product-discovery.md +7 -7
package/gsd-core/references/test-doubles.md +88 -0
package/gsd-core/references/test-strategy.md +6 -5
package/gsd-core/templates/adr.md +21 -1
package/gsd-core/templates/cicd-strategy.md +72 -0
package/gsd-core/templates/domain-model.md +4 -2
package/gsd-core/templates/infra-strategy.md +77 -0
package/gsd-core/templates/product-brief.md +10 -8
package/gsd-core/templates/test-strategy.md +8 -0
package/gsd-core/workflows/add-tests.md +8 -3
package/gsd-core/workflows/cicd-strategy.md +152 -0
package/gsd-core/workflows/discover-product.md +13 -9
package/gsd-core/workflows/discuss-phase.md +1 -1
package/gsd-core/workflows/help/modes/full.md +2 -0
package/gsd-core/workflows/infrastructure-strategy.md +142 -0
package/gsd-core/workflows/model-domain.md +13 -13
package/gsd-core/workflows/plan-phase.md +2 -2
package/gsd-core/workflows/recommend-architecture.md +22 -8
package/gsd-core/workflows/testing-strategy.md +6 -4
package/package.json +1 -1
package/scripts/changeset/cli.cjs +2 -2

package/gsd-core/references/e2e-tiering.md CHANGED Viewed

@@ -4,13 +4,13 @@ How-to reference for structuring end-to-end tests so CI stays fast and the suite
 ## The tiers
-- **Persistent (in CI, every PR):** a small **smoke / critical-user-journey** suite — auth, payment, core nav. <5 min feedback. An "ultra-priority" set checking only vital flows validates build stability.
+- **Persistent (in CI, every PR):** a small **smoke / critical-user-journey** suite of **3–7 journeys** — auth, payment, core nav. <5 min feedback. An "ultra-priority" set checking only vital flows validates build stability.
 - **Deeper regression:** business-critical flows on staging / release candidates / scheduled runs — *not* every PR.
 - **Transient:** throwaway e2e to validate a freshly-built flow during the dev loop; **not** kept in CI; demote to integration once the behavior is covered more cheaply.
 ## Keep e2e lean
-- E2E is the most expensive layer to run and maintain. A portfolio of **50–200 well-chosen** e2e tests covering critical workflows beats 1,000 redundant ones.
+- E2E is the most expensive layer to run and maintain. A portfolio of **50–200 well-chosen** e2e tests covering critical workflows beats 1,000 redundant ones. ≈50–200 well-chosen e2e tests is the cap on the **total** e2e portfolio across all tiers (PR smoke + staging regression + release/scheduled) — never the size of the PR gate, which stays at 3–7 journeys.
 - Push edge cases **down** to unit/integration (test each behavior once, at the cheapest level — see `test-strategy.md`). E2E covers true end-to-end critical journeys only.
 - The **ice-cream cone** (mostly e2e) is the anti-pattern.

package/gsd-core/references/infrastructure-strategy.md ADDED Viewed

@@ -0,0 +1,91 @@
+# Infrastructure Strategy — Compute Ladder, Quantified
+Reference for `/gsd:infrastructure-strategy`. Recommends infrastructure that **matches actual traffic shape, team size, and spend** — explicitly avoiding both the day-one Kubernetes cluster and the hand-rolled VM. Consumes the architecture ADR (topology) and PRODUCT-BRIEF (scale expectations). Recommends; the user decides.
+## Core principle: serverless containers are the default rung
+The empirically safe prior, not a bias: real Kubernetes clusters average **8–13% CPU utilization** (CAST AI 2,000-org telemetry, three consecutive years; ~70% of requested CPU/memory never used). The cost crossovers where dedicated compute wins sit at 40–80% sustained utilization — **4–10× above what the median real cluster achieves**. Most teams pay dedicated-cluster prices for serverless-shaped traffic. Start at the serverless-container rung; move up only on a trigger.
+The three inputs every crossover keys off: **traffic shape** (idle %, burst ratio), **team size**, **monthly compute spend**. Get these before recommending anything.
+## The compute decision ladder
+| Rung | Wins when | Move up when |
+|------|-----------|--------------|
+| **0 — Static / edge** (GCS+CDN, S3+CloudFront, Cloudflare Pages/Workers, Vercel) | content pre-renderable; logic fits edge functions (auth redirects, headers, A/B); pennies, scale-to-zero inherent | you need persistent server-side state, a real DB with private access, requests beyond edge runtime limits (CPU ms, no TCP), or a backend language the edge doesn't run |
+| **1 — FaaS** (Lambda, Cloud Run functions, Azure Functions) | event-driven glue, webhooks, queue consumers, cron, spiky low-volume APIs | executions **>15 min** (Lambda hard limit); WebSockets/streaming/gRPC; in-memory caches / **connection pools** matter; you want one instance serving many concurrent requests; or volume crosses **~15M invocations/month** at sub-second durations — above that Fargate-class compute runs 40–70% cheaper |
+| **2 — Serverless containers** (Cloud Run, ECS+Fargate, Container Apps) ← **DEFAULT** | stateless HTTP/gRPC APIs, web backends, workers — i.e., most apps. Any OCI image, no runtime ceiling, per-instance concurrency (Cloud Run default 80) amortizes cost across requests. Lock-in low: the artifact is a standard container | sustained busy **>40–50% with commitments** (roughly >$50–100k/yr compute) AND team ≥4 engineers or a platform owner; OR a capability trigger — GPU scheduling, sidecars/daemonsets/mesh, multi-tenant isolation, stateful sets, custom networking/CNI, operators (Kafka, Postgres) |
+| **3 — Managed Kubernetes** (GKE / EKS / AKS) | many services with complex topology (~50+ services / platform-team scale), the capability triggers above, or sustained utilization where commitments pay. Control plane ~$0.10/hr (~$73/mo); the real cost is ops (~2.5 platform engineers per 100 devs) | **down** if utilization is <~20% sustained — you've built an idle cluster, the median case. Sideways to VMs only for the rung-4 exception list |
+| **4 — Raw VMs** (GCE / EC2 / Azure VMs) | the **exception list only** — see below | **down** when you find yourself reimplementing deploys, health checks, rollouts, and autoscaling by hand for a stateless web app — that's the platform's job |
+**The VM exception list** (the only reasons rung 4 is a destination): BYOL / dedicated-host licensing (Windows, Oracle); special hardware (specific GPU SKUs, local NVMe, huge memory) or fine-grained GPU control; kernel/OS access; self-managed stateful systems (DBs, brokers); max-utilization steady fleets on 3-yr RIs/CUDs (−55% to −66% off, beats everything *if* you actually sustain the load). Plus the honest small case: "a couple of boring VMs" for a tiny fixed-traffic product is legitimate simplicity — what's rarely right is VM+ASG+LB templates as the default *growth* architecture.
+## The quantified crossovers (why the default is the default)
+- **AWS's own Fargate study** (AWS Containers Blog): at ~6% CPU reservation, **Fargate is 87% cheaper** than the EC2 equivalent; at 100% reservation EC2 is only **>20% cheaper**. Derived from current prices (Fargate $0.04048/vCPU-hr + $0.004445/GB-hr vs m5.large): on-demand EC2 wins only above **~70–80% sustained packing**; with a 3-yr Savings Plan the crossover drops to **~40–50% sustained**. FinOps rule of thumb: Fargate carries a 20–30% premium over *well-managed* EC2 — and almost nobody manages EC2 well (see CAST AI).
+- **Cloud Run vs a cluster:** a 10M-req/mo, 50 ms service ≈ **$30–40/mo on Cloud Run vs ~$210/mo** on a minimal always-on cluster (control plane + node floor before serving a request).
+- **The labor term dominates:** practitioner FinOps analysis puts self-managed-K8s payback at **~$2.5M/yr compute spend** once engineer cost is in, and flatly: **fewer than 4 engineers → nobody should be running Kubernetes**. Treat the dollar figure as order-of-magnitude; the team-size floor as hard.
+- **GKE Autopilot is the escape hatch** when someone genuinely needs the K8s API early: pod-level billing, no node ops — the sane intermediate rung before GKE Standard/EKS.
+- **PaaS nostalgia costs ~+50%:** App Engine-style PaaS runs about 50% more than Cloud Run for less flexibility; Google's own migration center steers new users to Cloud Run.
+## Per-cloud asymmetries (encode these — they change recommendations)
+- **Fargate does NOT scale to zero** (ECS services hold desired count; App Runner pauses to a low idle charge) and **has no free tier**.
+- **Cloud Run and Container Apps share the same free grant** (180k vCPU-s + 360k GiB-s + 2M req/mo) **plus scale-to-zero** → a genuinely **$0/mo idle dev/staging environment on GCP or Azure**. On AWS it is not — use Lambda for dev or accept idle cost.
+- Cold-start-sensitive + spiky → set **min-instances=1 and price it** (~$50–65/mo for 1 always-on vCPU on Cloud Run) before reaching for a cluster.
+- AWS splits queueing across three services (SQS/SNS/EventBridge) where GCP has one; Azure's standalone-cron story is the weakest; GCP's secret-manager DX is the most container-native (direct Cloud Run mounting).
+## Per-cloud equivalences
+| Capability | GCP | AWS | Azure |
+|---|---|---|---|
+| Serverless containers | Cloud Run | ECS on Fargate (App Runner for PaaS-style) | Container Apps |
+| Managed Kubernetes | GKE (Standard / Autopilot) | EKS | AKS |
+| Managed Postgres | Cloud SQL (AlloyDB high-end) | RDS (Aurora high-end) | Database for PostgreSQL Flexible Server |
+| Secrets | Secret Manager | Secrets Manager (SSM Parameter Store = cheap tier) | Key Vault |
+| Queue / pub-sub | Pub/Sub (+ Cloud Tasks) | SQS + SNS + EventBridge | Service Bus + Storage Queues + Event Grid |
+| Object storage | Cloud Storage | S3 | Blob Storage |
+| Scheduler / cron | Cloud Scheduler | EventBridge Scheduler | Container Apps Jobs / Functions timer |
+| Monitoring / logging | Cloud Monitoring + Logging | CloudWatch (+X-Ray) | Azure Monitor + App Insights |
+The ladder is cloud-portable — pick the cloud by team familiarity and existing constraints, then read the column.
+## The observability floor (day one, ~$0–26/mo)
+1. **Structured JSON logs to stdout** — the platform log service ingests automatically. No log vendor yet.
+2. **Error tracking** (Sentry free tier or GlitchTip): grouped, release-tagged, alerting included.
+3. **Uptime check** on public endpoints — external, not just in-cloud (Cloud Monitoring uptime checks / UptimeRobot free tier).
+4. **One golden-signals dashboard** from metrics the platform already emits: request rate, error rate, p50/p95/p99 latency, instance count/CPU. No agents needed on the default rung.
+5. **3–5 alerts max:** error-rate spike, p95 over threshold, uptime fail, queue depth if async, and — non-negotiable — a **billing budget alert** (the most important alert a small team sets).
+**Defer until >3 services in a request path** (or the first "which service caused this?" incident): distributed tracing / OpenTelemetry wiring, formal SLO/error-budget machinery, self-hosted Prometheus+Grafana+Loki stack (becomes its own pet), paid log aggregation, on-call tooling for a 3-person team.
+## When you actually need…
+| Component | Default rung gives you | Add it when |
+|---|---|---|
+| **Load balancer** | Cloud Run domain mapping + managed TLS; ACA ingress; ALB already present with ECS | multi-region anycast, WAF/Cloud Armor, CDN in front, own TLS certs, path-routing across services, private exposure to a VPC |
+| **VPC / private networking** | not required to ship — Cloud Run/ACA run outside your VPC by default | the **first private resource**: Cloud SQL/RDS private IP, Redis, internal-only services. Then use Direct VPC egress / a connector — and immediately audit the NAT rules (see anti-patterns) |
+| **Multi-region** | one region + CDN serves most products | hard data-residency law; contractual RTO/RPO a region outage would break; a global latency-sensitive product (150 ms→50 ms class). Cheapest first step: **multi-region data backups**, not multi-region compute. Anti-pattern: 2× full capacity 24/7 "just in case" |
+| **Autoscaling policies** | the platform IS the autoscaler | only tune three knobs: **max-instances (the cost cap — set it day one)**, min-instances (cold-start SLA), concurrency (CPU- vs IO-bound). Writing HPA/ASG policies is a rung-3/4 activity |
+## The IaC floor
+No IaC (unreproducible click-ops) and full Terraform ceremony for one service (modules-of-modules, Terragrunt, three repos) **both lose**. The floor: everything creatable from the repo, via either (a) **one small Terraform/OpenTofu root module** — provider, service, DB, secrets, domain, ~100–200 lines, remote state in a bucket — or (b) **honest scripted CLI deploys** (`gcloud run deploy` / `az containerapp up` + checked-in YAML). Both are acceptable; **Terraform earns its keep at the second environment or second service**. No premature module abstraction. CI deploys from main; secrets live in the cloud secret manager, never in tfvars.
+## Anti-patterns
+- **Day-one Kubernetes for a 3-person team.** <4 engineers → don't; case studies of $6,000/mo → $89/mo after leaving K8s, plus months of lost engineering time. Even enterprise platform teams report persistent K8s pain at ~93%.
+- **The idle cluster.** Median 8–13% CPU, ~70% of requests never used (CAST AI). If you have a cluster, measure packing — or you're funding this anti-pattern.
+- **NAT-gateway bill surprises.** $0.045/GB processed **including traffic to S3/ECR that would otherwise be free**; one misconfigured route = +$1,000/day (Geocodio). Fix: gateway VPC endpoints for S3/DynamoDB (free), interface endpoints for ECR; on GCP, Direct VPC egress + Private Google Access. Rule: the moment you create private subnets, audit what routes through NAT.
+- **Never-rightsized managed Postgres.** 61% of developers never rightsize; RDS at 5% CPU is the norm; rightsizing recovers 20–40%. Floor: start one size smaller than you think (db.t4g / e2-small class), watch 60–90 days, scale on evidence; storage autoscaling on; multi-AZ only when users would notice.
+- **PaaS-by-default** (App Engine-style): ~+50% cost for less flexibility.
+## The meta-tell (use this to settle every rung)
+If you cannot point to a **current, concrete requirement** — a real >15-min job, a real GPU/sidecar/operator need, real measured sustained utilization above the crossover, a real BYOL contract — that justifies a rung **above serverless containers**, you are over-engineering: drop to the default. If such a requirement exists and you parked it on the default rung anyway (a self-managed Kafka on Cloud Run, a 16-hour batch job on Lambda), you are under-engineering the platform choice: move that one component up — **and only that component**.
+## Consumes / produces
+- **Consumes** `PRODUCT-BRIEF.md` (scale expectations, traffic shape), the architecture ADR (`.planning/adr/` — deployment topology: how many deployables), and `TEST-STRATEGY.md` (CI needs: test containers, e2e environments). All optional — ask directly when absent.
+- **Produces** `.planning/INFRA-STRATEGY.md` — cloud, compute rung per component with promotion triggers, data layer per environment, observability floor, IaC floor, cost guardrails. Feeds `/gsd:cicd-strategy` (deploy targets, environments) and `plan-phase`.

package/gsd-core/references/product-discovery.md CHANGED Viewed

@@ -5,12 +5,12 @@ Reference for `/gsd:discover-product`. An **optional** front-of-funnel step that
 ## When to run / when to skip
 - **Run** when value is **uncertain**: a new market/segment, no past-behavior evidence, a stakeholder asserting demand from a hypothetical, or a large/irreversible bet.
-- **Skip** (or jump straight to lightweight RICE prioritization) when a client/customer has explicit, evidenced requirements — then the open question is *sequence and cost*, not *whether*.
+- **Skip** (or jump straight to lightweight RICE prioritization) when a client/customer has explicit, evidenced requirements — money moved or real usage, *covering the candidate list*; LOIs alone don't qualify — then the open question is *sequence and cost*, not *whether*.
 ## Core principles
 - **Outcomes over outputs** (Perri): define the customer behavior/metric to change, not the feature to ship.
-- **Demand over interest** (YC): behavior, money, and "panic when it breaks" count; waitlists and "that's interesting" don't. Ask about the **past**, never hypotheticals.
+- **Demand over interest** (YC): **strong** = money moved / real usage / panic-when-it-breaks; **medium** = signed non-binding commitments (LOIs, unpaid pilots) — real but not yet demand, convert to strong or treat as open; **weak** = interest (waitlists, "that's interesting") — doesn't count. Ask about the **past**, never hypotheticals.
 - **Find the narrowest wedge:** the smallest version someone would pay for this week — the hair-on-fire segment.
 - **Frame the vision as an opportunity/outcome** (it must admit >1 solution) so it informs but doesn't over-constrain architecture.
 - **Cover the four risks** (Cagan): value, usability, feasibility, viability.
@@ -21,15 +21,15 @@ Reference for `/gsd:discover-product`. An **optional** front-of-funnel step that
 The first answer is polished — push 2–3 times with concrete specifics, not soft exploration. "Name the *actual* human, the *actual* consequence." Reflect the answer back; confirm before moving on. One thread at a time.
-## Distilled question set (ordered; skip any block already evidenced)
+## Distilled question set (ordered; skip a block only when its named outputs already exist at strong evidence — reflect the skipped conclusion back)
-0. **Frame:** what customer behavior/metric do we want to change (not a feature)? If we skipped discovery entirely, what assumption would we be betting the whole build on?
+0. **Frame:** what customer behavior/metric do we want to change (not a feature)? (The betting-the-build assumption is enumerated in block 4 — don't ask it twice.)
 1. **Job & user:** who *specifically* — and for whom is the problem most acute, frequent, expensive, unavoidable? State the job solution-free. Job story: *"When [situation], I want to [motivation], so I can [outcome]."* Capture 2–3 **measurable desired-outcome statements** for the job (direction + metric + object, e.g. "reduce the time to find an open class slot") — these are what "better" is measured against. Note if the job-population is heterogeneous (different segments → different outcomes; don't average them away).
-2. **Demand vs interest:** "Tell me about the *last time* you hit this." "What are you doing about it *today*?" "What does that workaround cost (time/money)?" "What *real* evidence exists — pre-pay, LOI, pilot, converted signups?" (Never "would you use X?")
+2. **Demand vs interest:** "Tell me about the *last time* you hit this." "What are you doing about it *today*?" "What does that workaround cost (time/money)?" "What do they use today instead — incl. spreadsheets or nothing — and why hasn't it won?" "What *real* evidence exists — pre-pay, pilot in use, converted signups, signed LOIs?" Mark strength: strong (money/usage) / medium (LOIs, unpaid pilots) / weak (interest). (Never "would you use X?")
 3. **Wedge & under-served outcome:** which single opportunity, solved, most moves the outcome? The narrowest version that fully solves it for one user? Can we imagine >1 solution? (If no — we smuggled in a solution; re-frame.)
-4. **Four risks** (only the unvalidated ones): **value** (evidence they'll choose this over the status quo), **usability** (where they'll get stuck), **feasibility** (riskiest technical unknown), **viability** (pricing/legal/sales/brand). First **enumerate the leap-of-faith assumptions** behind the chosen wedge (what must be true for it to work); then run the *cheapest test on the riskiest* one — not just a single test on the least-validated risk.
+4. **Four risks** (only the unvalidated ones): **value** (evidence they'll choose this over the status quo), **usability** (where they'll get stuck), **feasibility** (riskiest technical unknown), **viability** (pricing/legal/sales/brand). First **enumerate the leap-of-faith assumptions** behind the chosen wedge (what must be true for it to work); then **specify** the cheapest test for the *riskiest* one — pass/fail threshold, kill criterion, owner, by-when; tests execute *after* the session, before building — not just a single test on the least-validated risk. Value is never "validated" on founder testimony alone — it needs customer-sourced evidence (a named customer's behavior, quote, or money).
 5. **Scope & prioritization:** end-to-end journey → the thin first slice (walking skeleton). RICE on the candidate list — Reach × Impact × Confidence ÷ Effort; table-stakes/dependencies legitimately override the score.
-6. **Success:** how will we know it worked (the outcome metric, by when)? What would make ≥40% of target users "very disappointed" to lose it? (Sean Ellis PMF proxy — necessary, not sufficient; survey only users who used the core.)
+6. **Success:** how will we know it worked (the outcome metric, by when)? Pre-register the Sean Ellis PMF criterion — the ≥40%-"very disappointed" survey to run once ≥N pilots have used the core (necessary, not sufficient; survey only users who used the core) — a planned measurement, never a founder prediction.
 ## Feature prioritization: rigor vs lightweight

package/gsd-core/references/test-doubles.md ADDED Viewed

@@ -0,0 +1,88 @@
+# Test Doubles — Which Kind, Where, and What to Assert
+How-to reference for choosing the right *kind* of test double — dummy, stub, spy, mock, fake — and the narrow set of seams where any double is allowed at all. Read when a test needs a stand-in for a collaborator. Pairs with `test-strategy.md` (its "mock ONLY at architectural boundaries" rule answers *where*; this file answers *what kind* and *what to assert*) and `contract-testing.md`.
+## The taxonomy (Meszaros / Fowler)
+| Kind | What it is | Reach for it when | What you may assert |
+|---|---|---|---|
+| **Dummy** | A placeholder passed only to satisfy a signature; never exercised | a constructor demands a logger/metrics object the tested path never touches | nothing — if you're asserting on it, it isn't a dummy |
+| **Stub** | Canned answers to **queries**; no logic, no expectations | the test needs a specific input state ("the gateway returns 503", "the user lookup yields an admin") | **never assert ON a stub** — it feeds the test; it is not the subject |
+| **Spy** | Records the calls it receives (often a stub that also records) | verifying an outbound **notification** happened: email sent, event published, webhook fired | the recorded outbound call + its key payload fields, once |
+| **Mock** | Pre-programmed with expectations; fails the test itself when they're violated | same role as a spy, framework-flavored — keep rare; a spy + explicit assert reads better | outbound **commands** only |
+| **Fake** | A real, working, lightweight implementation — in-memory repo, local queue, fake clock | standing in for a **port** in sociable core tests (the default — see below) | normal state/outcome assertions; the fake *behaves*, so test it like the real thing |
+Two camps, one rule — Khorikov's CQS discipline:
+- **Queries** (return data, no external side effect) → **stub** them, and assert on the SUT's *output or state*, never on whether/how the stub was called.
+- **Commands** (side effects on the outside world) → **spy/mock** them; "the email was sent with this recipient" *is* the observable behavior, so interaction-verifying it is correct.
+Interaction-verifying a query is the classic brittle-test generator: the test re-states the implementation's call sequence and breaks on every refactor while the behavior is unchanged. State/outcome verification survives refactoring; that is the whole point of behavior-over-implementation.
+## Fake-at-ports — the default for a hexagonal/ports core
+"Unit-test the core" does not mean per-test `jest.mock` walls. For sociable tests of an application core, prefer **one in-memory fake per driven port**:
+1. The fake lives in test support and is reused by every test — not redefined inline per file.
+2. It implements the port *honestly* (stores and retrieves, rejects duplicates, honors ordering) so tests assert **outcomes** ("the vehicle is now retrievable"), not interactions ("`save` was called").
+3. **Verify the fake against the real adapter's contract:** write one shared contract suite and run it against *both* the fake and the real (Testcontainers-backed) adapter. A fake nobody verifies drifts from production behavior, and every core test quietly tests fiction.
+```ts
+// ports/vehicle-repo.ts — the port the core depends on
+export interface VehicleRepo {
+  save(v: Vehicle): Promise<void>;
+  findById(id: string): Promise<Vehicle | null>;
+}
+// test/support/fake-vehicle-repo.ts — honest in-memory fake
+export class FakeVehicleRepo implements VehicleRepo {
+  private rows = new Map<string, Vehicle>();
+  async save(v: Vehicle) { this.rows.set(v.id, structuredClone(v)); }
+  async findById(id: string) { return this.rows.get(id) ?? null; }
+}
+// test/support/vehicle-repo.contract.ts — ONE suite, run against BOTH impls
+export function vehicleRepoContract(make: () => Promise<VehicleRepo>) {
+  it('round-trips a saved vehicle', async () => {
+    const repo = await make();
+    await repo.save(vehicle({ id: 'v1', plate: 'ABC-123' }));
+    expect(await repo.findById('v1')).toMatchObject({ plate: 'ABC-123' });
+  });
+  it('returns null for an unknown id', async () => {
+    expect(await (await make()).findById('nope')).toBeNull();
+  });
+}
+// fake-vehicle-repo.test.ts:       vehicleRepoContract(async () => new FakeVehicleRepo());
+// pg-vehicle-repo.medium.test.ts:  vehicleRepoContract(async () => pgRepo(await getDb()));
+```
+Core tests then take the fake and assert state — `expect(await repo.findById('v1'))…` — and you can refactor the core's internals freely; nothing in the tests names them. Pure domain functions need no doubles at all; the fake enters only where the application core meets its ports.
+## The mockable-seam allow-list
+Doubles are legal ONLY at **unmanaged out-of-process dependencies** — systems another party owns or observes: 3rd-party APIs, payment gateways, email/SMS providers, a partner's message bus. Everything else:
+- **Your own DB / cache / filesystem (managed dependencies):** run them **real** (`test-containers.md`, `db-test-isolation.md`) — never mocked, including "for speed" (txn-rollback against a singleton container is in-memory fast).
+- **In-process collaborators:** never doubled — sociable tests use the real object.
+- **Ports of your core:** in-memory **fakes**, contract-verified as above.
+TEST-STRATEGY.md's per-subdomain table doubles as the project's seam **allow-list**: if a seam isn't declared there, a double on it is a review-blocking smell, not a style choice.
+## Choosing in five seconds
+1. Collaborator is in-process? → no double; sociable test.
+2. It's your DB/cache/fs? → real instance via Testcontainers.
+3. It's a port of the core? → in-memory **fake**, contract-verified.
+4. The test needs a canned *answer* from an unmanaged dependency? → **stub**; assert on the SUT, not the stub.
+5. The behavior under test *is* an outbound command/notification to an unmanaged dependency? → **spy** (or mock); assert the call once, with its discriminating payload fields.
+## Anti-patterns
+- `jest.mock('../service')` on an in-process collaborator — solitary/mockist tests; brittle, refactor-hostile.
+- Asserting a stub was called / called-with — interaction-verifying a query restates the implementation.
+- A fake that is never contract-verified against the real adapter (the fake drifts; tests stay green; prod breaks).
+- **Testing the mock:** configuring a double, then asserting the double's own canned behavior back at it.
+- Mocking the DB or repositories "so tests are fast" — see `db-test-isolation.md` for the fast *real* path.
+- Mixing kinds blindly because the framework calls everything `mock` — name the role (stub? spy? fake?) before writing it.
+*Sources: Meszaros "xUnit Test Patterns" (the taxonomy); Fowler "Mocks Aren't Stubs" / TestDouble bliki; Khorikov "Unit Testing Principles, Practices, and Patterns" (CQS rule, managed vs unmanaged dependencies); "Software Engineering at Google" ch. 13 (prefer real implementations and fakes over mocking).*

package/gsd-core/references/test-strategy.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Testing Strategy — Shape Follows Architecture
-Reference for `/gsd:testing-strategy`. Decides WHAT to test, at WHICH level, and HOW MUCH — matched to the architecture (from the ADR / SKELETON). **Extends, not replaces,** the project's existing rigor in `TESTING-STANDARDS.md` (real-code-only, no-vacuous-assertions, typed surface, `fast-check` property tests, Stryker mutation ≥80%). Recommends; the user decides.
+Reference for `/gsd:testing-strategy`. Decides WHAT to test, at WHICH level, and HOW MUCH — matched to the architecture (from the ADR / SKELETON). Where `TESTING-STANDARDS.md` exists, **extends — never replaces —** its rigor; where it's absent (greenfield), the strategy **defines** the baseline rules itself (real-code-only, no vacuous assertions, typed surface, `fast-check` property tests, Stryker mutation ≥80% on critical modules) and creates the file from them. Recommends; the user decides.
 ## Core principles (the consensus)
@@ -17,7 +17,7 @@ Read the architecture decision (`.planning/adr/*.md` or SKELETON). Per subdomain
 |---|---|---|
 | **Domain Model / rich core** | more **small (unit)** tests of the domain logic, through its public API | lots of pure logic, few dependencies — cheap and high-value to unit-test |
 | **Transaction Script / CRUD-over-DB** | more **medium (integration)** tests against a real DB | thin logic, DB-bound — little pure logic to isolate; confidence lives at the DB boundary |
-| **Hexagonal core** | unit-test the domain in isolation (no mocks — it's pure); integration-test the **adapters** against real systems | the architecture literally separates the two |
+| **Hexagonal core** | pure domain functions need no doubles; the application core is tested with **in-memory fakes at its ports** (contract-verified against the real adapters — see `test-doubles.md`); the **adapters** are integration-tested against real systems | the architecture literally separates the two |
 | **Many integrations / external APIs** | medium integration tests at the ports; **contract tests** where a 3rd-party can't be seeded | confidence is in the integration, not mock existence |
 | **Bought / off-the-shelf (Generic)** | thin integration smoke at your adapter seam only | don't test the vendor's internals — test your seam |
@@ -39,7 +39,7 @@ The quality benefit is real, but the evidence favors **small, uniform increments
 ## Persistent vs transient E2E
-- **Persistent:** a small **smoke / critical-user-journey** suite (auth, payment, core nav) in CI on every PR (<5 min). Keep it lean (≈50–200 well-chosen e2e tests) — e2e is slow/flaky; push coverage down to integration.
+- **Persistent:** a small **smoke / critical-user-journey** suite (auth, payment, core nav) of **3–7 journeys** in CI on every PR (<5 min). ≈50–200 well-chosen e2e tests is the cap on the **total** e2e portfolio across all tiers (PR smoke + staging regression + release/scheduled) — never the size of the PR gate, which stays at 3–7 journeys. E2e is slow/flaky; push coverage down to integration.
 - **Transient:** throwaway e2e to validate a freshly-built flow during the dev loop; **not** kept in CI. Once the behavior is covered more cheaply (integration), delete or demote it.
 ## Output (`TEST-STRATEGY.md`)
@@ -52,13 +52,14 @@ The quality benefit is real, but the evidence favors **small, uniform increments
 Feeds `add-tests`, `execute-phase`, and `plan-phase`.
-## Extends existing rigor (keep it all)
+## Extends existing rigor — or instates it
-`TESTING-STANDARDS.md` already enforces real-code-only, no-vacuous-assertions, the typed-surface mandate, `fast-check` property tests for bijective/invariant logic, and Stryker mutation ≥80% on critical modules. This skill adds the **strategic layer** on top — the shape, the what/what-not, the level-per-subdomain. Do not weaken any existing standard.
+Where `TESTING-STANDARDS.md` is present, it enforces baseline rigor — real-code-only, no-vacuous-assertions, the typed-surface mandate, `fast-check` property tests for bijective/invariant logic, Stryker mutation ≥80% on critical modules — keep all of it; do not weaken any existing standard. Where the file is **absent** (greenfield), those same rules are the defaults to instate: the strategy records them as the project's baseline standards (in TEST-STRATEGY.md's Notes) and offers to generate `TESTING-STANDARDS.md` from them. Either way, this skill adds the **strategic layer** on top — the shape, the what/what-not, the level-per-subdomain.
 ## Test-infrastructure how-to references (read when writing the tests)
 When the strategy calls for real-dependency integration tests, auth, or e2e, load the focused how-to reference:
+- `@~/.claude/gsd-core/references/test-doubles.md` — dummy/stub/spy/mock/fake taxonomy; fake-at-ports; never assert on stubs; the mockable-seam allow-list.
 - `@~/.claude/gsd-core/references/test-containers.md` — real DBs/services via Testcontainers (singleton pattern, pinned tags, CI/Ryuk).
 - `@~/.claude/gsd-core/references/db-test-isolation.md` — parallel-safe DB isolation (template DB, db/schema-per-worker, txn rollback).
 - `@~/.claude/gsd-core/references/auth-in-tests.md` — authenticate-once/storageState, token minting, multi-role, JWT vs cookie, one-account-per-worker.

package/gsd-core/templates/adr.md CHANGED Viewed

@@ -28,7 +28,27 @@
 - CD / monitoring / DevOps maturity in place: [yes/no]
 - Bounded contexts well-understood already: [yes/no]
-[If any "no" → Modular Monolith. If a specific component is extracted, record the Hard-Parts disintegrators that justified it.]
+[If any "no" → Modular Monolith. If a specific component is extracted, record the Hard-Parts disintegrators that justified it — extraction *now* requires current (not projected) pressure plus the CD/ops gate; otherwise it's a promotion trigger below.]
+**Tenancy (required when serving multiple customer orgs):** [single-tenant / shared schema + tenant-scoped RLS (default) / schema-per-tenant / DB-per-tenant] — [the contractual/regulatory isolation mandate that climbs above the default, or "none — default holds"]
+### Module map
+[Derived from DOMAIN-MODEL: modules = bounded contexts when mapped, else subdomain groupings. Polysemes resolved: each flagged term → one owning module per meaning, or "none flagged".]
+| Module | Owns (responsibility + schema) | Talks to | Via |
+|--------|--------------------------------|----------|-----|
+| [module] | [responsibility; its schema/tables] | [modules / 3rd-party behind ACL] | [sync call / event — mechanism: in-process / job queue / outbox] |
+[Pipeline-shaped modules: note the data shape (buffer/queue, backpressure, retention). Scheduled/recurring work: where it runs (cron / job queue / recompute-on-read).]
+### Promotion triggers
+The concrete, observable signals that would change a decision above — re-check at `/gsd:new-milestone`.
+| Component / axis | Observable condition | Response when it fires |
+|------------------|----------------------|------------------------|
+| [component, rung, or Axis B] | [e.g., sustained ingest > N/s today; a second team forms; an isolation mandate lands in a contract] | [e.g., extract via Strangler Fig + ACL; raise the rung; climb to schema-per-tenant] |
 ### Baseline note

package/gsd-core/templates/cicd-strategy.md ADDED Viewed

@@ -0,0 +1,72 @@
+# CI/CD Strategy — [PROJECT_TITLE]
+**Created:** [DATE] via `/gsd:cicd-strategy`
+**Basis:** `TEST-STRATEGY.md` (tiers + smoke list) · `INFRA-STRATEGY.md` / ADR (target cloud).
+## CI platform
+- **Chosen:** [GitHub Actions]
+- **Why:** [default — repo on GitHub, ecosystem + OIDC into the target cloud / OR the deliberate exception: VPC-isolated/regulated builds → Cloud Build/CodeBuild, or cheap compute behind GHA]
+- **Runners:** [hosted, until bill > free tier + low-hundreds $/mo → managed third-party before DIY; never self-hosted on public repos]
+## Auth (CI → cloud)
+- **Method:** [OIDC / Workload Identity Federation — zero long-lived cloud keys]
+- **`sub` condition (MANDATORY):** [pinned to `repo:ORG/REPO` + `environment:prod` / branch — never wildcard]
+- **Fallback (only where federation impossible):** [target + short-lived scoped secret, rotation cadence]
+## Secrets
+| Secret | Lives where | Injected how |
+|--------|-------------|--------------|
+| Cloud deploy creds | nowhere — OIDC mints per job | short-lived token |
+| [CI-scoped token, e.g. SaaS API] | CI platform secrets (only because OIDC unavailable) | env at job start |
+| [app secrets — DB url, API keys] | cloud secret manager | runtime injection — never in images, never committed `.env` |
+## Pipeline map (tiers → stages)
+| Stage | What runs | Time budget |
+|-------|-----------|-------------|
+| PR gate | lint + types + small (unit) + fast medium + smoke e2e: [N flows from TEST-STRATEGY] | **≤10 min** |
+| Merge to main | full medium + e2e subset vs [preview/ephemeral env] | [~X min] |
+| Nightly / pre-release | full e2e portfolio + long suites + mutation run (Stryker on [targets]) | unbounded |
+- **Merge queue:** [off — trigger: ~tens of merges/day / stale-base failures routine]
+## Flaky-test policy
+- PR-gate tests hold <1% flake rate; flakes **quarantined from the gate, kept running post-merge**, fix SLA: [N days].
+- Differentiated retries for diagnosis only. **No blanket retry-until-green.**
+## Deployment ladder
+- **Rung:** [solo/low blast radius — trunk-based + PR previews + one-command rollback; NO staging]
+- **Blast-radius additions:** [feature flags internal-first / revertable expand-contract schema changes / blue-keep-alive window — or n/a]
+- **Promotion triggers:** [staging-thin: risky migration to rehearse · canary analysis: ~a dozen trustworthy SLIs + traffic signal on 1–5% slice · merge queue: see above]
+- **Invariants:** build once, promote the same digest-pinned artifact; one-command rollback; config attaches at release.
+## Supply-chain checklist (the free six)
+- [ ] SHA-pin all actions + Dependabot updating pins
+- [ ] Committed lockfile + `npm ci`
+- [ ] Top-level read-only `permissions:` / read-only `GITHUB_TOKEN` default
+- [ ] OIDC — zero long-lived cloud keys in CI
+- [ ] Push protection + secret scanning; no `.env` in repo
+- [ ] Branch ruleset on main: PR + status checks, no force-push
+Never: `pull_request_target` + untrusted checkout · self-hosted runners on public repos.
+## Anti-patterns acknowledged
+- [long-lived keys in CI · secrets in images/.env · per-env artifacts · heavy e2e PR gate · retry-until-green · tag-pinned actions · wildcard OIDC `sub` · staging-as-safety-strategy — see reference]
+## Deferred (with triggers)
+- [SLSA L3 — when artifacts cross trust boundaries · cosign — when there's a verifier · SBOM program — when a customer/regulator asks · canary analysis — SLIs + traffic · staging — migration rehearsal]
+## Handoff notes for plan-phase
+- [CI workflow files to create, the OIDC role/WIF pool to provision, secret-manager entries, preview-env wiring, which phase owns each]
+---
+*CI/CD strategy. Consumed by `/gsd:plan-phase` (CI/deploy phases plan against it).*

package/gsd-core/templates/domain-model.md CHANGED Viewed

@@ -19,7 +19,9 @@ Strategic classification — drives where to invest and (downstream) the archite
 | Subdomain | Type | Description | Rationale (why this type) | Complexity |
 |-----------|------|-------------|---------------------------|------------|
-| [Name] | Core / Supporting / Generic | [What it does] | [Why core / supporting / generic] | low / medium / high |
+| [Name] | Core / Supporting / Generic | [What it does] | [Why this type + fired complexity signals] | low / medium / high (derived) |
+> Complexity is derived, not asked — rated against the 5-signal rubric (invariants, lifecycle/state, derivation/tradeoffs, temporal logic, policy variance); fired signals go in the rationale. Core+low is a contradiction.
 **Core** — differentiating *and* complex; build in-house, invest:
 - [Subdomain] — [why it's the competitive edge]
@@ -34,7 +36,7 @@ Strategic classification — drives where to invest and (downstream) the archite
 ## Bounded Contexts
-[Filled when event storming was run; otherwise: "Deferred — single context assumed; planning will refine if boundaries emerge."]
+[Filled when event storming was run; otherwise: "Deferred — single context assumed; planning will refine if boundaries emerge." Exception: a flagged polyseme or third-party/legacy upstream gets its candidate contexts and seam (default ACL) recorded here, marked "candidate — refine in planning" — never "single context assumed" beside a flagged boundary.]
 | Context | Owns (responsibility) | Key domain events | Talks to | Language boundary |
 |---------|----------------------|-------------------|----------|-------------------|

package/gsd-core/templates/infra-strategy.md ADDED Viewed

@@ -0,0 +1,77 @@
+# Infrastructure Strategy — [PROJECT_TITLE]
+**Created:** [DATE] via `/gsd:infrastructure-strategy`
+**Basis:** architecture decision [ADR-NNNN] · PRODUCT-BRIEF scale expectations · TEST-STRATEGY CI needs.
+## Cloud
+**[GCP | AWS | Azure]** — [why: existing constraint / team familiarity / $0-idle dev story]. Inputs: traffic shape [idle/bursty/steady], team size [N], expected compute spend [~$N/mo].
+## Compute (rung per component — default is serverless containers)
+Every rung above the default must name the **current, concrete trigger** that justifies it.
+| Component | Rung | Service | Trigger justifying anything above the default | Promotion trigger (→ next rung) |
+|-----------|------|---------|-----------------------------------------------|---------------------------------|
+| [api] | Serverless containers (default) | [Cloud Run / ECS+Fargate / Container Apps] | — | [sustained >40–50% busy w/ commitments AND ≥4 eng, or GPU/sidecar/operator need → K8s (Autopilot first)] |
+| [webhooks/cron] | FaaS | [Lambda / CR functions / Functions] | [event-glue, spiky low volume] | [>15-min runs, WebSockets, conn pools, ~>15M invocations/mo → containers] |
+| [frontend] | Static/edge | [CDN/Pages] | [pre-renderable] | [server-side state/runtime needed → up] |
+**Knobs (day one):** max-instances cap = [N] per service (the cost ceiling); min-instances = [0 | 1 (~$50–65/mo, only if cold-start SLA demands it)]; concurrency = [default 80 / tuned].
+## Data layer (per environment)
+Detail and pooling rules: see `references/data-environments.md`. **Pooling is mandatory** wherever serverless compute talks to Postgres.
+| Env | Database | Size / tier | Notes |
+|-----|----------|-------------|-------|
+| dev | [serverless PG / scale-to-zero] | [free tier] | [$0-idle where the cloud allows] |
+| preview | [branch / per-PR] | — | [only if platform makes it cheap] |
+| staging | [managed PG] | [one notch below prod] | prod-shaped |
+| prod | [Cloud SQL / RDS / PG Flexible] | [start one size smaller than instinct] | storage autoscaling on; multi-AZ [yes/no + why] |
+**Crossover-watch metric:** [the number that flips provisioned ↔ serverless PG — e.g. sustained connections / compute-hours/mo] — review at [60–90 days].
+## Environments
+[dev → staging → prod] · dev is [scale-to-zero $0-idle | low-cost idle (AWS)] · previews: [per-PR revisions | none yet].
+## Secrets
+[Secret Manager / Secrets Manager / Key Vault] — injected at deploy, never in tfvars or repo. (See `references/data-environments.md` for per-env secret handling.)
+## Observability floor (day one)
+- [ ] Structured JSON logs to stdout → platform log service
+- [ ] Error tracking: [Sentry free tier / GlitchTip]
+- [ ] External uptime check on [endpoint]
+- [ ] Golden-signals dashboard (rate, errors, p50/p95/p99, instances)
+- [ ] Alerts (3–5 max): error-rate spike · p95 > [N ms] · uptime fail · **billing alert** · [queue depth]
+**Deferred until >3 services in a request path:** distributed tracing/OTel, SLO/error-budget machinery, self-hosted metrics stack, paid log aggregation.
+## IaC
+[One Terraform/OpenTofu root module (~100–200 lines, remote state in [bucket]) | scripted CLI deploys (`gcloud run deploy` / `az containerapp up`) + checked-in config]. Terraform earns its keep at the **second environment or second service** — [current status]. CI deploys from main.
+## Cost guardrails
+- Billing budget alert at **[$N/mo]** (warn at [50/90/100]%)
+- max-instances caps set on every service (above)
+- [NAT audit rule: the moment private subnets exist, audit what routes through NAT — gateway endpoints for S3/DynamoDB]
+- DB rightsizing review at [60–90 days]
+## NOT decided / deferred
+- [Multi-region — trigger: data-residency law / contractual RTO-RPO / global latency need. First step would be multi-region backups, not compute]
+- [Load balancer / WAF — trigger: path-routing across services, WAF need, private exposure]
+- [VPC — trigger: first private resource (DB private IP, Redis)]
+- [Kubernetes — trigger recorded in the compute table]
+## Handoff notes
+- **For `/gsd:cicd-strategy`:** deploy targets = [services above]; environments = [map above]; registry = [Artifact Registry / ECR / ACR]; IaC entry point = [path]; secrets source = [manager above].
+- **For `plan-phase`:** infra setup tasks = [IaC floor, billing alert, uptime check, Sentry wiring]; anything in NOT-decided needs no work until its trigger fires.
+---
+*Infrastructure strategy. Consumed by `/gsd:cicd-strategy` and `/gsd:plan-phase`.*

package/gsd-core/templates/product-brief.md CHANGED Viewed

@@ -25,10 +25,12 @@ State 2–3 as *direction + metric + object* so "is it working?" is answerable.
 | Signal | Evidence | Strength |
 |--------|----------|----------|
-| Current workaround | [what they do today + its cost in time/money] | strong / weak |
-| Behavioral / money | [pre-pay, LOI, pilot, converted signups, "calls when it breaks"] | strong / weak |
+| Workaround / alternatives | [what they use today instead — incl. spreadsheets/nothing — its cost, why it hasn't won] | strong / medium / weak |
+| Money / usage | [pre-pay, paying client, real usage, "calls when it breaks"] | strong |
+| Signed commitments | [non-binding LOIs, unpaid pilots] | medium — not yet demand |
+| Interest | [waitlists, likes, "sounds useful"] | weak |
-[If evidence is all "interest" (waitlists, "sounds useful"), flag it — demand is unproven.]
+[All-interest = demand unproven. Medium at best → convert to strong or treat demand as open.]
 ## The wedge
@@ -38,7 +40,7 @@ State 2–3 as *direction + metric + object* so "is it working?" is answerable.
 | Risk | Status | Evidence / cheapest test before building |
 |------|--------|------------------------------------------|
-| Value (will they choose it over the status quo?) | validated / open | |
+| Value (will they choose it over the status quo?) | validated / open | [validated needs customer-sourced evidence (named behavior/quote/money) — never founder testimony alone] |
 | Usability (can they figure it out?) | validated / open | |
 | Feasibility (can we build it — riskiest unknown?) | validated / open | |
 | Viability (pricing/legal/sales/brand?) | validated / open | |
@@ -52,7 +54,7 @@ State 2–3 as *direction + metric + object* so "is it working?" is answerable.
 ## Success
 - **Outcome metric + by when:** [the behavior/metric change and the date]
-- **PMF check:** [what would make ≥40% of core users "very disappointed" to lose this]
+- **PMF check (pre-registered):** [Sean Ellis criterion (≥40% "very disappointed") to survey once ≥N pilots used the core — planned measurement, not founder prediction]
 ## Handoff notes
@@ -62,9 +64,9 @@ State 2–3 as *direction + metric + object* so "is it working?" is answerable.
 The brief is a hypothesis. List the assumptions that must hold for the wedge to work, riskiest first, each with the cheapest next test — revisit after a handful of customer conversations (continuous discovery, not a gate).
-| Leap-of-faith assumption | Riskiest? | Cheapest next test |
-|---|---|---|
-| [what must be true] | [yes / no] | [the test] |
+| Leap-of-faith assumption | Riskiest? | Cheapest next test (threshold · owner · by-when) | Kill criterion |
+|---|---|---|---|
+| [what must be true] | [yes / no] | [test + threshold, owner, by-when — runs before building] | [what result kills the wedge] |
 ---
 *Product brief. Next: `/gsd:new-project` (if not done) → `/gsd:model-domain` → `/gsd:recommend-architecture`.*

package/gsd-core/templates/test-strategy.md CHANGED Viewed

@@ -37,6 +37,14 @@ The shape is an *output* of the architecture, not a chosen target. Sociable test
 - Coverage = **floor**, not a target.
 - Mutation testing (Stryker) on: [critical modules — e.g. the pricing engine, money math].
+## CI execution map (feeds `/gsd:cicd-strategy`)
+| Pipeline stage | Runs | Budget |
+|---|---|---|
+| PR gate (blocking) | small + fast medium + 3–7 e2e smoke [+ changed-files mutation if fast] | ≤10 min |
+| Merge to main | full medium + e2e portfolio subset | — |
+| Nightly / scheduled | full e2e portfolio + full mutation run [+ vendor sandbox smoke, non-blocking] | — |
 ## TDD stance
 - Behavior-level tests, **small uniform increments**, regression floor, real RED step.

package/gsd-core/workflows/add-tests.md CHANGED Viewed

@@ -6,6 +6,8 @@ Users currently hand-craft `/gsd:quick` prompts for test generation after each p
 <required_reading>
 Read all files referenced by the invoking prompt's execution_context before starting.
+@~/.claude/gsd-core/references/ai-test-quality.md — the quality contract for AI-written tests (behavior inventory, forbidden patterns, assertion rules, falsifiability gate). Binding for every test generated here.
 </required_reading>
 <process>
@@ -176,8 +178,8 @@ For each approved file, create a detailed test plan.
 **For TDD files**, plan tests following RED-GREEN-REFACTOR:
 1. Identify testable functions/methods in the file
-2. For each function: list input scenarios, expected outputs, edge cases
-3. Note: since code already exists, tests may pass immediately — that's OK, but verify they test the RIGHT behavior
+2. For each function, build the **behavior inventory** (`ai-test-quality.md` §A): happy paths, boundary values, every error/rejection path, illegal states/transitions. Tests map 1:1 to inventory rows; rework any plan that is ≥80% happy-path rows before presenting it
+3. Note: since code already exists, tests may pass on first run — a first-run-green test is **unverified** until the falsifiability gate (`ai-test-quality.md` §D, applied at generation) proves it can fail
 **For E2E files**, plan tests following RED-GREEN gates:
 1. Identify user scenarios from CONTEXT.md/VERIFICATION.md
@@ -226,13 +228,15 @@ For each approved TDD test:
    // Assert — verify the output matches expectations
    ```
+   Before running, sweep for forbidden patterns (`ai-test-quality.md` §B/§C): no vacuous sole assertions (`toBeDefined`/`toBeTruthy`/`expect(true)`), no `jest.mock`/`vi.mock` off the strategy's seam allow-list, no `toHaveBeenCalled*` on queries, no `.skip`/`sleep(`; every test has ≥1 specific-value assertion.
 3. **Run the test**:
    ```bash
    {discovered test command}
    ```
 4. **Evaluate result:**
-   - **Test passes**: Good — the implementation satisfies the test. Verify the test checks meaningful behavior (not just that it compiles).
+   - **Test passes**: not done yet — apply the **falsifiability gate** (`ai-test-quality.md` §D): temporarily mutate the SUT (flip a branch condition, drop the write, return a constant), re-run and watch the test fail, then revert the mutation and watch it pass. A test that cannot be made to fail by breaking the implementation is testing nothing — rewrite or delete it. Where the strategy sets a mutation floor, run Stryker incrementally on the changed files and kill (or explicitly waive) surviving mutants (§E).
    - **Test fails with assertion error**: This may be a genuine bug discovered by the test. Flag it:
      ```
      ⚠️ Potential bug found: {test name}
@@ -350,6 +354,7 @@ Present next steps:
 - [ ] TDD tests generated with arrange/act/assert structure
 - [ ] E2E tests generated targeting user scenarios
 - [ ] All tests executed — no untested tests marked as passing
+- [ ] Falsifiability gate applied — every first-run-green test shown able to fail (mutate → red → revert → green)
 - [ ] Bugs discovered by tests flagged (not fixed)
 - [ ] Test files committed with proper message
 - [ ] Coverage gaps documented