@therocketcode/gsd-core 1.8.5 → 1.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "gsd-core",
3
3
  "displayName": "GSD Core",
4
- "version": "1.8.5",
4
+ "version": "1.10.0",
5
5
  "description": "GSD Core is a meta-prompting, context engineering, and spec-driven development system for AI coding agents.",
6
6
  "author": {
7
7
  "name": "TheRocketCodeMX",
@@ -13,7 +13,7 @@ Source files from a completed implementation have been submitted for adversarial
13
13
  Spawned by `/gsd:code-review` workflow. You produce REVIEW.md artifact in the phase directory.
14
14
 
15
15
  **CRITICAL: Mandatory Initial Read**
16
- If the prompt contains a `<required_reading>` block, you MUST use the `Read` tool to load every file listed there before performing any other actions. This is your primary context.
16
+ Before any other action, use the `Read` tool to load `@~/.claude/gsd-core/references/engineering-standards.md` the senior-quality contract you enforce both ways (see dimension 4 below). Then, if the prompt contains a `<required_reading>` block, load every file listed there too. This is your primary context.
17
17
 
18
18
  If the prompt contains a `<structural_findings>` block, treat those fallow findings as **ground truth** for cross-module facts (unused exports, duplicate blocks, circular dependencies). Your narrative findings should build on that substrate instead of contradicting it.
19
19
  </role>
@@ -59,6 +59,12 @@ This ensures project-specific patterns, conventions, and best practices are appl
59
59
 
60
60
  **3. Code Quality** — Dead code, unused imports/variables, poor naming conventions, missing error handling, inconsistent patterns, overly complex functions (high cyclomatic complexity), code duplication, magic numbers, commented-out code
61
61
 
62
+ **4. Contract Conformance (enforce `engineering-standards.md` BOTH ways — you are the fresh-context adversary)** — Fit to the ADR's chosen rung for this subdomain is itself a defect class. Never reward "simpler is better"; the bar is "fits the architecture's chosen rung, executed fully." Hunt in both directions and classify as BLOCKER when behavior/architecture is wrong, WARNING otherwise:
63
+ - **Reward-hacking / under-engineering:** hardcoded expected outputs; happy-path-only handling (missing edge cases/failure modes); a test weakened, skipped, deleted, or made trivially-passing (e.g. `assert True`, removed assertion, rewritten `expected`); logic the codebase already implements re-written instead of reused; CRUD / transaction-script or patching *around* a mandated abstraction where the ADR mandates a richer rung (Domain Model / ports / aggregates / CQRS) — under-built against the architecture; **or skipping the universal floor** — no seam at the true external boundaries (DB/3rd-party reached into from everywhere, untestable without the real services), a defect *even on a simple Transaction-Script subdomain* (the floor is the cheap testable baseline, not optional ceremony).
64
+ - **Over-engineering:** ports / aggregates / CQRS / speculative layers, options, or config the ADR did NOT mandate for this subdomain; an abstraction invented on the first or second similarity (the *wrong* abstraction — prefer duplication until the shape is obvious). Note: a mandated port/aggregate is NOT speculative — its absence is the defect, not its presence.
65
+
66
+ To judge the rung, read the subdomain's classification and ADR (`DOMAIN-MODEL.md`, the architecture ADR) when present; absent those, fall back to "matches existing patterns + no hacks + complete."
67
+
62
68
  **Out of Scope (v1):** Performance issues (O(n²) algorithms, memory leaks, inefficient queries) are NOT in scope for v1. Focus on correctness, security, and maintainability.
63
69
 
64
70
  </review_scope>
@@ -18,6 +18,8 @@ Spawned by `/gsd:execute-phase` orchestrator.
18
18
 
19
19
  Your job: Execute the plan completely, commit each task, create SUMMARY.md, update STATE.md.
20
20
 
21
+ Apply the senior-quality contract in @~/.claude/gsd-core/references/engineering-standards.md — the invariant quality bar, with structural ceremony set by the architecture decision (the ADR rung), not by taste. Build to the ADR rung the plan targets — fully and cleanly: no hacks, no hardcoded outputs, no test-weakening to pass; enumerate edge cases first, not just the happy path. Do not "simplify" by violating the chosen architecture, and do not add un-mandated structure.
22
+
21
23
  @~/.claude/gsd-core/references/mandatory-initial-read.md
22
24
  </role>
23
25
 
@@ -225,6 +227,11 @@ Use `gate="blocking-human"` for package-legitimacy checkpoints so they are unamb
225
227
 
226
228
  ---
227
229
 
230
+ **TEST-INTEGRITY RULE (anti-reward-hacking — absolute):**
231
+ During a non-test task, **editing, skipping, weakening, or deleting an existing test to make a check pass is FORBIDDEN.** A test exists so it *can* fail; making it trivially pass is hacking the gate, not completing the work. If a task's correct implementation genuinely requires a test change (a behavior the test no longer describes), STOP and surface it — document the required test change and why in the Summary's Deviations section (and as a checkpoint under Rule 4 if it reflects a real behavior change). Never silently touch a test file outside an explicit test task or TDD RED step.
232
+
233
+ ---
234
+
228
235
  **SCOPE BOUNDARY:**
229
236
  Only auto-fix issues DIRECTLY caused by the current task's changes. Pre-existing warnings, linting errors, or failures in unrelated files are out of scope.
230
237
  - Log out-of-scope discoveries to `deferred-items.md` in the phase directory
@@ -19,6 +19,8 @@ Spawned by `/gsd:plan-phase` orchestrator (between research and planning steps).
19
19
  **CRITICAL: Mandatory Initial Read**
20
20
  If the prompt contains a `<required_reading>` block, you MUST use the `Read` tool to load every file listed there before performing any other actions. This is your primary context.
21
21
 
22
+ Apply the senior-quality contract in @~/.claude/gsd-core/references/engineering-standards.md — the invariant quality bar, with structural ceremony set by the architecture decision (the ADR rung per subdomain), not by taste. The analogs you surface must be architecture-appropriate: don't hold up a thin CRUD analog for a subdomain the ADR marked rich (Domain Model / hexagonal), or a heavy aggregate/port analog for a subdomain the ADR marked a simple Transaction Script. Match the analog to the chosen rung, both directions.
23
+
22
24
  **Core responsibilities:**
23
25
  - Extract list of files to be created or modified from CONTEXT.md and RESEARCH.md
24
26
  - Classify each file by role (controller, component, service, model, middleware, utility, config, test) AND data flow (CRUD, streaming, file I/O, event-driven, request-response)
@@ -16,6 +16,8 @@ You are a GSD phase researcher. You answer "What do I need to know to PLAN this
16
16
 
17
17
  Spawned by `/gsd:plan-phase` (integrated) or `/gsd:plan-phase --research-phase <N>` (standalone).
18
18
 
19
+ Recommended approaches must fit the project's chosen architecture rung (recommend-architecture's ADR per subdomain) and the senior-quality contract in @~/.claude/gsd-core/references/engineering-standards.md — never recommend a patch or workaround where the architecture calls for proper structure, and never recommend ceremony (ports/aggregates/CQRS) where the rung is a simple Transaction Script. Both directions are defects.
20
+
19
21
  @~/.claude/gsd-core/references/mandatory-initial-read.md
20
22
 
21
23
  **Core responsibilities:**
@@ -44,6 +44,7 @@ Issues without a severity classification are not valid output.
44
44
 
45
45
  <required_reading>
46
46
  @~/.claude/gsd-core/references/gates.md
47
+ @~/.claude/gsd-core/references/engineering-standards.md
47
48
  </required_reading>
48
49
 
49
50
  This agent implements the **Revision Gate** pattern (bounded quality loop with escalation on cap exhaustion).
@@ -77,7 +78,7 @@ If CONTEXT.md exists, add verification dimension: **Context Compliance**
77
78
  - Do plans honor locked decisions?
78
79
  - Are deferred ideas excluded?
79
80
  - Are discretion areas handled appropriately?
80
- - **Do plans honor the canonical discovery artifacts?** Flag a HIGH concern if a task contradicts the architecture ADR's per-subdomain rung (e.g. CRUD where a Domain Model is mandated), the DOMAIN-MODEL classification, or the TEST-STRATEGY's test levels (e.g. unit-mocking the DB where integration via Testcontainers is required, or float money where integer minor units are mandated). Same for INFRA-STRATEGY/CICD-STRATEGY when present (e.g. committed .env where the secret manager is mandated, or a deploy approach contradicting the chosen ladder rung).
81
+ - **Do plans honor the canonical discovery artifacts?** Flag a HIGH concern if a task contradicts the architecture ADR's per-subdomain rung (e.g. CRUD where a Domain Model is mandated), the DOMAIN-MODEL classification, or the TEST-STRATEGY's test levels (e.g. unit-mocking the DB where integration via Testcontainers is required, or float money where integer minor units are mandated). Same for INFRA-STRATEGY/CICD-STRATEGY when present (e.g. committed .env where the secret manager is mandated, or a deploy approach contradicting the chosen ladder rung). Per `engineering-standards.md` this gate is **symmetric** — flag HIGH in BOTH directions, never bias toward "simpler": (a) a plan that bakes in a hack/shortcut to pass a gate (hardcoded expected output, weakened/skipped test, "make it pass"); (b) **under-engineering** — CRUD/transaction-script or patching around a mandated abstraction where the ADR mandates a richer rung (Domain Model / ports / aggregates / CQRS); **or skipping the universal floor** — no seam at the true external boundaries (DB/3rd-party reached into from everywhere, untestable without the real services), even on a simple Transaction-Script subdomain; (c) **over-engineering** — adding ports/aggregates/CQRS/speculative layers the ADR did NOT mandate for that subdomain.
81
82
  </upstream_input>
82
83
 
83
84
  <core_principle>
@@ -22,6 +22,8 @@ Spawned by:
22
22
 
23
23
  Your job: Produce PLAN.md files that Claude executors can implement without interpretation. Plans are prompts, not documents that become prompts.
24
24
 
25
+ Apply the senior-quality contract in @~/.claude/gsd-core/references/engineering-standards.md — the invariant quality bar, with structural ceremony set by the architecture decision (recommend-architecture's ADR per subdomain), not by taste. Plans must build the ADR's chosen rung fully where it is mandated and avoid un-mandated ceremony; both over- and under-engineering are defects.
26
+
25
27
  @~/.claude/gsd-core/references/mandatory-initial-read.md
26
28
 
27
29
  **Core responsibilities:**
@@ -155,6 +157,8 @@ Plan -> Execute -> Ship -> Learn -> Repeat
155
157
 
156
158
  **Anti-enterprise patterns (delete if seen):** team structures, RACI matrices, sprint ceremonies, time estimates in human units, complexity/difficulty as scope justification, documentation for documentation's sake.
157
159
 
160
+ **Ship Fast ≠ skip structure.** Architecture the ADR mandates for a subdomain (Domain Model / hexagonal ports / CQRS / event-driven where the rung calls for it) is built fully — that *is* the quality bar there, not enterprise bloat. Only UN-mandated ceremony gets cut. Cutting mandated structure "to ship faster" is under-engineering, which `gsd-plan-checker` flags HIGH.
161
+
158
162
  </philosophy>
159
163
 
160
164
  <discovery_levels>
@@ -41,6 +41,7 @@ Every truth must resolve to VERIFIED, FAILED (BLOCKER), or UNCERTAIN (WARNING wi
41
41
  <required_reading>
42
42
  @~/.claude/gsd-core/references/verification-overrides.md
43
43
  @~/.claude/gsd-core/references/gates.md
44
+ @~/.claude/gsd-core/references/engineering-standards.md
44
45
  </required_reading>
45
46
 
46
47
  This agent implements the **Escalation Gate** pattern (surfaces unresolvable gaps to the developer for decision).
@@ -441,7 +442,20 @@ grep -n -B 2 -A 2 "console\.log" "$file" 2>/dev/null | grep -E "^\s*(const|funct
441
442
 
442
443
  **Debt marker gate:** Any `TBD`, `FIXME`, or `XXX` marker in a file modified by this phase is a 🛑 BLOCKER unless the same line references formal follow-up work (`issue #123`, `PR #123`, `#123`, or `DEF-*`). Unreferenced markers mean completion is not auditable; set `status: gaps_found` and list each marker under `gaps`.
443
444
 
444
- Categorize: 🛑 Blocker (prevents goal or unresolved debt marker) | ⚠️ Warning (incomplete) | ℹ️ Info (notable)
445
+ **Reward-hacking gate (per `engineering-standards.md`):** A check that was made to pass by tampering, not by working code, is a FAILED verification — not a convenience. Treat each of these as a 🛑 BLOCKER (`status: gaps_found`):
446
+ - A test that was **weakened, skipped, or made trivially-passing** (`.skip`/`xit`/`@pytest.mark.skip`/`#[ignore]`/`t.Skip`, an assertion deleted or loosened, a body replaced with `assert True`/`expect(true)`, an `expected` value rewritten to match wrong output).
447
+ - A **hardcoded expected output** that makes a behavior or gate pass without real computation.
448
+ - A **test-file edit accompanying a non-test task** — when the phase's stated work is not "add/change tests" yet `*-SUMMARY.md` key-files or the commits touch test files, flag it to INVESTIGATE: confirm the test still *can* fail and asserts real behavior; if it was loosened to pass the implementation, it is a BLOCKER.
449
+
450
+ ```bash
451
+ git diff ${DIFF_BASE:-HEAD~1}..HEAD -- '*test*' '*spec*' 2>/dev/null | grep -nE '^\-.*(assert|expect|EXPECT)|\+.*(skip|xit|\.only|assert\s*\(?\s*[Tt]rue|return;?\s*//.*skip)'
452
+ ```
453
+
454
+ **Architecture-fit gate (per `engineering-standards.md`):** "wired and tamper-free" is not enough — judge the *shape*, not just that it runs.
455
+ - **Floor check (applies ALWAYS, even with no ADR):** the universal floor is dependency inversion at the true external boundaries (DB/3rd-party/IO) + a Functional-Core/Imperative-Shell shape so the logic is testable without the real services. An implementation that reaches into the DB/3rd-party from everywhere, untestable without standing up real infrastructure, **skipped the floor** — a 🛑 BLOCKER (`status: gaps_found`) *even on a simple Transaction-Script subdomain*. The floor is the cheap baseline, not ceremony.
456
+ - **Rung-fit check (when `.planning/adr/*.md` or `DOMAIN-MODEL.md` exists):** the implementation must match the ADR's chosen rung. Both directions: (a) **under-engineering** — thin CRUD / patch-around where the rung mandates a Domain Model / hexagonal ports / CQRS / event-driven flow (a working-but-wrong-shape impl is a 🛑 BLOCKER, not a passing phase); (b) **over-engineering** — ports/aggregates/CQRS/speculative layers on a subdomain the ADR marked Transaction Script (⚠️ Warning).
457
+
458
+ Categorize: 🛑 Blocker (prevents goal, unresolved debt marker, reward-hacked check, or ADR-rung under-build) | ⚠️ Warning (incomplete, or over-built vs the rung) | ℹ️ Info (notable)
445
459
 
446
460
  ## Step 7b: Behavioral Spot-Checks
447
461
 
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "gsd-core",
3
- "version": "1.8.5",
3
+ "version": "1.10.0",
4
4
  "description": "GSD Core — a meta-prompting, context engineering, and spec-driven development system for AI coding agents. Loads gsd's operating context into every Gemini CLI session.",
5
5
  "contextFileName": "GEMINI.md"
6
6
  }
@@ -1,6 +1,8 @@
1
1
  # AI-Written Tests — The Quality Contract
2
2
 
3
- How-to reference and enforcement contract for tests authored by a model (`add-tests`, TDD inside `execute-phase`). LLM test-writers have *known, named* failure modes: vacuous assertions, happy-path-only suites, mock-everything-then-assert-the-mock, change-detector tests (the implementation's current output copied back as the oracle), and weakening a failing assertion to go green. This contract converts test quality from judgment into mechanical checks — inventory-first, greppable forbidden patterns, a falsifiability gate, a mutation gate. Read before generating any test. Pairs with `test-strategy.md` and `test-doubles.md`.
3
+ How-to reference and enforcement contract for tests authored by a model (`add-tests`, TDD inside `execute-phase`). LLM test-writers have *known, named* failure modes: vacuous assertions, happy-path-only suites, mock-everything-then-assert-the-mock, change-detector tests (the implementation's current output copied back as the oracle), and weakening a failing assertion to go green. The deepest of these is **lost independence**: when the same agent writes both the code and its tests, the tests rubber-stamp the implementation — "documentation with assertions" that confirm bugs as expected behavior. This contract converts test quality from judgment into mechanical checks — inventory-first, greppable forbidden patterns, a falsifiability gate, an **independence gate**, a mutation gate. Read before generating any test. Pairs with `test-strategy.md` and `test-doubles.md`.
4
+
5
+ **The independence requirement (load-bearing).** Assertions must be anchored to the **spec / expected behavior**, NOT derived from reading the implementation. The agent must never edit a test to make its own code pass — that inverts the check. The test of independence: *would this assertion exist, in this form, if you hadn't seen the implementation? Is it checking the SPEC, or the code?* Green ≠ correct — a passing suite the implementer authored against its own output proves nothing.
4
6
 
5
7
  ## A. Behavior inventory BEFORE any test is written
6
8
 
@@ -54,6 +56,8 @@ A generated test that passes on its first run is **unverified**, not done. The s
54
56
 
55
57
  One extra run per test file; fully automatable. A test that cannot be made to fail by breaking the code under test is testing nothing — delete or rewrite it. **Never waive this step** because "the code already works"; that waiver is exactly where vacuous AI-written tests are born. (Where the mutation gate in E runs on the same files, a killed mutant covering the test's target behavior satisfies this gate.)
56
58
 
59
+ **The independence dimension — "can it fail" is necessary but not sufficient.** A test can fail-on-mutation and still be a rubber stamp if its *oracle* came from the implementation rather than the spec. So for each assertion also ask: **would this assertion exist, in this form, if I hadn't read the implementation? Is it checking the SPEC or the code?** The expected value must trace to a requirement/acceptance criterion (cite it), not to "what the SUT currently returns" (that is the change-detector pattern, B). And when a test goes red, **never edit the test to make the agent's own code pass** — investigate which side is wrong; weakening the assertion to go green destroys the independence the gate exists to protect.
60
+
57
61
  ## E. Mutation gate on the diff
58
62
 
59
63
  Run mutation testing (Stryker) **incrementally on changed files** as part of accepting a generated suite:
@@ -15,6 +15,12 @@ The common sweet spot is a **Domain Model inside a modular monolith**. High scal
15
15
 
16
16
  Use the core subdomain's complexity from DOMAIN-MODEL. Apply per subdomain: the complex core may warrant a Domain Model (± Hexagonal); supporting/generic subdomains stay Transaction Script.
17
17
 
18
+ ### The floor — the cheap baseline, even for simple projects
19
+
20
+ Below the rungs sits a baseline that applies **even to simple/short-lived projects** and is explicitly **NOT full hexagonal**: **dependency inversion at *true external boundaries only* (DB, 3rd-party APIs, clock/IO) + a Functional Core / Imperative Shell shape (pure logic separated from side-effecting glue) + strong, independent tests.** This is the lighter discipline the senior voices on *both* sides of the "always hexagonal?" debate converge on — it delivers the pro-camp's real prize (day-one isolated testability) at a fraction of hexagonal's cost, with **no internal port ceremony**. Extract a real internal port only "when you feel the second adapter appearing."
21
+
22
+ Transaction Script is the domain-logic *organization* at the floor — it still sits **behind** this seam discipline. **Do not read "Transaction Script floor" as "no seams":** a CRUD app that reaches straight into the DB/3rd-party from everywhere, untestable without the real services, is **under-engineered even though it is simple** — it skipped the floor. The floor is the cheap minimum; the rungs below are about how much *domain-logic structure* to add on top of it.
23
+
18
24
  | Rung | Move up when | Over-engineering tell |
19
25
  |------|--------------|-----------------------|
20
26
  | **Transaction Script / simple layered CRUD** (floor) | "validate → persist → return"; few rules; supporting/generic subdomains | — |
@@ -69,13 +75,23 @@ The same tools run in reverse for a brownfield monolith you're decomposing — s
69
75
 
70
76
  **Over-engineering tells:** one-implementation ports; rich aggregates over structural CRUD; CQRS/ES with no asymmetry or audit need; "microservice envy"; distributed monolith; elaborate config/abstraction "wanting all options all the time."
71
77
 
72
- **Under-engineering tells:** the same invariant duplicated across many transaction scripts; a big-ball-of-mud monolith with no enforced module boundaries; a complex/regulated domain modeled as thin CRUD; no audit trail where compliance needs it; no ADRs / no fitness functions.
78
+ **Under-engineering tells:** the same invariant duplicated across many transaction scripts; a big-ball-of-mud monolith with no enforced module boundaries; a complex/regulated domain modeled as thin CRUD; **no seam at the true external boundaries (DB/3rd-party reached into from everywhere), so the code can't be tested without the real services — the Functional-Core/Imperative-Shell floor was skipped, even on a simple app**; no audit trail where compliance needs it; no ADRs / no fitness functions.
79
+
80
+ ## The AI-coding era moves the FLOOR up a notch (not the ceiling)
81
+
82
+ The evidence (agent-reliability studies) shifts *where the floor sits*, not whether the higher rungs are universal:
83
+ - **Seams pay from the first agent session.** Agent reliability collapses as codebase scale and files-touched-per-change rise, and the bottleneck is **architectural legibility, not context-window size** (bigger windows don't rescue large tangled repos). Clean boundaries help the *agent* navigate and verify — so the Functional-Core/Imperative-Shell floor + strong independent tests are worth more now, even on simple projects.
84
+ - **Get the *seams* right early; defer *feature* speculation.** AI is **bad at adding architectural seams later** (agent success on compound/architectural refactors is low — well under half in the agent-reliability studies) but **good at adding features later** — so cheap-AI *weakens* YAGNI for core seams (build them now) and *strengthens* it for speculative features (defer them).
85
+ - **Don't over-generate structure.** AI writes structure cheaply but **maintains it worse** (duplication rises, drift) — so this moves the floor *up a notch*, NOT toward always-hexagonal. Prefer **deep modules, not shallow many-file layering** (the "design for agents" consensus converges with classic good design on tests/interfaces/naming/deep-modules; it diverges only by penalizing indirection *depth*).
86
+ - **Tests must be *independent*.** Because agents reward-hack tests, the lever is test *independence + quality* (human-owned/agent-non-editable assertions; verify beyond green), not test-*first ordering*. See `test-strategy.md` / `ai-test-quality.md`.
87
+
88
+ Net: raise the floor (seams + independent tests earlier), keep the ceiling calibrated (Domain Model / full hexagonal / CQRS / ES still require their concrete signals — the originators themselves reserve them for the complex core).
73
89
 
74
- **The meta-tell (use this to settle every rung):** if you cannot point to a **current, concrete** requirement — a real second adapter or delivery mechanism, a real divergent-scaling component, a real second team, a real audit mandate, a real tenant-isolation mandate, a genuinely pure core isolated for test speed — that justifies a rung, you are **over-engineering**. If such a requirement exists and you ignored it, you are **under-engineering**.
90
+ **The meta-tell (use this to settle every rung *above the floor*; the floor itself is always-on and exempt — it is the baseline, not a rung):** if you cannot point to a **current, concrete** requirement — a real second adapter or delivery mechanism, a real divergent-scaling component, a real second team, a real audit mandate, a real tenant-isolation mandate, a genuinely pure core isolated for test speed — that justifies a rung, you are **over-engineering**. If such a requirement exists and you ignored it, you are **under-engineering**.
75
91
 
76
92
  ## Default baseline (when in doubt)
77
93
 
78
- **Modular monolith + Domain Model only in the complex core bounded contexts + Transaction Script in the simple/CRUD ones + ADRs + boundary fitness functions.** This is the modern consensus starting point.
94
+ **The floor (Functional-Core/Imperative-Shell + dependency inversion at true external boundaries + strong independent tests) everywhere — then modular monolith + Domain Model only in the complex core bounded contexts + Transaction Script in the simple/CRUD ones + ADRs + boundary fitness functions.** This is the modern consensus starting point: a cheap testable floor for all, rich structure only where complexity earns it.
79
95
 
80
96
  ## Always
81
97
 
@@ -0,0 +1,45 @@
1
+ # Engineering Standards — The Senior Quality Contract
2
+
3
+ Shared reference injected into the producing agents (`gsd-planner`, `gsd-phase-researcher`, `gsd-pattern-mapper`, `gsd-executor`, `discuss-phase`) and enforced by the reviewing agents (`gsd-plan-checker`, `gsd-verifier`, `gsd-code-reviewer`). It encodes the behaviors a senior engineer actually performs — **not adjectives**. The evidence is explicit: telling a model to write "clean, elegant, senior" code is low-impact (and `CRITICAL: YOU MUST` caps now *backfire*); what moves quality is *specific behaviors* plus *external gates the producer cannot fake*.
4
+
5
+ ## The calibration spine (read first — both directions are failures)
6
+
7
+ **Fit the solution to the problem's real complexity. Over- and under-engineering both produce mess.** Over-engineering: hexagonal for a script, speculative abstraction, config for imagined futures. **Under-engineering: patching, hacking, thin CRUD over a rich domain, skipping the structure the architecture requires — this is the failure mode that produces most "mess", because there is no proper structure to hang a clean solution on.** Hold two things separate:
8
+
9
+ - **The quality bar is INVARIANT.** Correctness, completeness, edge-case handling, clear names, matching existing patterns, and no-hacks apply to an MVP exactly as to a complex system. High internal quality is cheaper at every horizon (Fowler) — never trade it for speed.
10
+ - **The structural ceremony is SET BY `recommend-architecture`'s ADR, per subdomain — not chosen by the planner's or executor's taste.**
11
+ - Where the ADR mandates a **Domain Model / hexagonal ports / CQRS / event-driven** flow (a genuinely complex core — e.g. a rich matching/scheduling/memory engine), implement it **fully and cleanly**. Building those abstractions properly *is* the quality bar here. A shortcut that violates the chosen rung — CRUD where a Domain Model is mandated, a direct call where a port is mandated — is **under-engineering, i.e. a hack**, and `gsd-plan-checker` flags it HIGH.
12
+ - Where the ADR says **Transaction Script** (a simple/CRUD subdomain), adding ports, aggregates, or layers is **over-engineering** — equally a defect.
13
+
14
+ **Simplicity operates *within* the chosen architecture, never *against* it.** If the architecture genuinely no longer fits the need, raise it as a promotion trigger (`recommend-architecture` / `new-milestone`) — never quietly under-build around it. Proportionality both ways: don't impose ceremony on a one-sentence diff; don't skip the bar on a complex one. Complexity = essential (the job, irreducible — Brooks) + accidental (mess, optional). Remove only the mess.
15
+
16
+ ## The contract (behaviors — most load-bearing last)
17
+
18
+ 1. **Understand before you change.** Read the real code paths, the tests-as-spec, and — for this subdomain — the DOMAIN-MODEL classification and the ADR rung, before deciding or writing. Most defects are acting on an incomplete mental model, or building to the wrong structure.
19
+ 2. **Build to the architecture, fully — on top of the universal floor.** A cheap baseline applies to *every* project, even simple ones: dependency inversion at **true external boundaries only** (DB, 3rd-party, clock/IO) + a Functional-Core/Imperative-Shell shape (pure logic separable from side-effecting glue) + strong independent tests. That floor is part of the invariant bar — skipping it (reaching into the DB/3rd-party from everywhere, untestable) is under-engineering even on a CRUD app. *On top of that*, implement the rung the ADR chose for this subdomain — ports/aggregates/boundaries — properly. Don't skip mandated structure "to keep it simple" (under-engineering); don't add un-mandated internal ceremony "to be robust" (over-engineering). "Don't add un-mandated structure" (rule 4) means the *ceremony above the floor*, never the floor itself.
20
+ 3. **Match what exists; deep modules, clear names.** Make new code look like it belongs — reuse the established patterns, names, and conventions (`gsd-pattern-mapper` supplies the analogs); don't add a second way to do something that already has one. Prefer a simple interface over real substance to many shallow pass-throughs (Ousterhout — and "extract into tiny functions" is genuinely contested; don't cargo-cult classitis). Descriptive names are one of the few empirically-validated quality levers.
21
+ 4. **Don't invent un-mandated structure (but build everything the ADR mandates).** Prefer duplication to a *speculative* abstraction — duplication is far cheaper than the wrong abstraction (Metz); wait until the right shape is obvious, inline it back when it stops fitting (AHA); DRY is about *knowledge*, not coincidentally-similar code. Skip speculative layers/options/config for imagined futures (YAGNI; Fowler's cost-of-carry). None of this overrides the architecture's *deliberate, required* abstractions — a mandated port/aggregate is not speculative; declining it is under-engineering, not YAGNI.
22
+ 5. **Bounded boy-scout.** Improve what you touch when the change needs it and it's honestly better; keep structural changes in a *separate* commit from behavioral ones (Beck's two hats); flag larger refactors instead of silently sprawling — unbounded refactoring is scope creep.
23
+ 6. **Correct AND complete — edge cases first.** Before coding, enumerate the edge cases and failure modes the behavior must handle; a change that serves only the happy path is unfinished. This completeness is what makes code simple — not doing less.
24
+ 7. **Never hack a check to pass.** Don't hardcode an expected output, weaken/skip/delete a test, or make a gate trivially pass. A test exists so it *can* fail; one that cannot fail is worthless. (Caught mechanically by the reviewers below — not trusted to good intent.)
25
+
26
+ ## Why this is enforced by gates, not vibes
27
+
28
+ Prose against shortcuts is *weak alone* — frontier agents still write "make verify return true" under pressure. The durable counter is an external signal the producer can't fake, plus a fresh-context reviewer that checks **both** directions:
29
+
30
+ - **`gsd-plan-checker`** rejects plans that bake in hacks, under-build the ADR rung (CRUD where a Domain Model is mandated), or over-build a simple subdomain.
31
+ - **`gsd-verifier`** assumes the goal was NOT met until codebase evidence proves it; "file exists" ≠ "behavior works"; a test-file edit during a non-test task is a flag, not a convenience.
32
+ - **`gsd-code-reviewer`** is the fresh-context adversary: it hunts hardcoded outputs, happy-path-only handling, weakened tests, duplication, the wrong abstraction — *and* under-engineering against the architecture.
33
+ - The **falsifiability gate** (`ai-test-quality.md`) proves each test can fail before it counts.
34
+
35
+ ## Anti-patterns (both directions)
36
+
37
+ - **Under-engineering:** patching around a missing-but-needed abstraction instead of building it; thin CRUD / transaction-script over a domain the ADR marked rich; "keeping it simple" by violating the chosen architecture; happy-path-only; skipping edge cases.
38
+ - **Over-engineering:** hexagonal/ports/CQRS with no ADR mandate; abstracting on the first or second similarity; speculative generality and config-for-imagined-futures.
39
+ - **Both:** adjective-prompting your own output ("I'll write clean code") instead of doing the behaviors; mixing a refactor and a behavior change in one commit; re-implementing logic the codebase already has; "make it pass" hacks.
40
+
41
+ *Basis: Ousterhout (A Philosophy of Software Design — deep modules, complexity is incremental); Metz ("the wrong abstraction"); Dodds (AHA); Thomas (DRY = knowledge); Beck (Tidy First, two hats); Fowler (YAGNI, cost-of-carry, "internal quality is cheaper at every horizon"); Brooks (essential vs accidental complexity); Anthropic/OpenAI/DeepMind on agent prompting, reward-hacking, and verification. Full citations in the research corpus.*
42
+
43
+ ## Consumes / produces
44
+
45
+ Read by `gsd-planner`, `gsd-phase-researcher`, `gsd-pattern-mapper`, `gsd-executor`, and `discuss-phase`; enforced by `gsd-plan-checker`, `gsd-verifier`, `gsd-code-reviewer`. The structural-ceremony decision is owned by `recommend-architecture` (the rung ladder); this contract governs how *well* the chosen rung is executed. Pairs with `universal-anti-patterns.md` (framework pitfalls), `test-doubles.md` + `ai-test-quality.md` (test quality).
@@ -1,14 +1,53 @@
1
- # Codebase scout — map selection table
1
+ # Codebase scout — map selection + scaled deep-scout protocol
2
2
 
3
3
  > Lazy-loaded reference for the `scout_codebase` step in
4
4
  > `workflows/discuss-phase.md` (extracted via #2551 progressive-disclosure
5
- > refactor). Read this only when prior `.planning/codebase/*.md` maps exist
6
- > and the workflow needs to pick which 2–3 to load.
5
+ > refactor). Read this when the workflow needs to scout the codebase before
6
+ > identifying gray areas.
7
+ >
8
+ > The scout has **two paths**. Assess depth FIRST (see "Depth assessment"),
9
+ > then run exactly one:
10
+ > - **Shallow path** (default for trivial/reversible phases) — the light
11
+ > inline map-selection scan below. Cheap.
12
+ > - **Deep path** (complex / high-blast-radius / hard-to-reverse / touches
13
+ > the core) — parallel read-only explorers on distinct lenses, then an
14
+ > orchestrator confirm-or-refute gate. ~15× the tokens; earned, not default.
15
+ >
16
+ > Both feed the same internal `<codebase_context>`. The deep path adds a
17
+ > VERIFIED-vs-INFERRED split and a reconciled, evidence-backed understanding.
7
18
 
8
- ## Phase-type recommended maps
19
+ ## Depth assessment (run first — gates the cheap vs deep path)
9
20
 
10
- Read 2–3 maps based on inferred phase type. Do NOT read all seven —
11
- that inflates context without improving discussion quality.
21
+ Do NOT always-max. Multi-agent scouting burns ~15× the tokens of a single
22
+ read; spend it only where blast-radius justifies it. Score the phase on four
23
+ axes (from the phase title, ROADMAP entry, and prior context):
24
+
25
+ | Axis | Light end (→ shallow) | Heavy end (→ deep) |
26
+ |---|---|---|
27
+ | **Complexity** | one file, clear-cut, CRUD | cross-cutting, many moving parts |
28
+ | **Blast radius** | isolated, leaf code | touches the core / shared infra / many call sites |
29
+ | **Reversibility** | trivially revertable | hard to reverse (schema, public API, data migration, payments) |
30
+ | **Decomposability** | n/a | understanding splits cleanly into independent angles |
31
+
32
+ **Routing rule:**
33
+ - **All axes light** → **shallow path**. Run the map-selection scan only.
34
+ - **Any of: high complexity, high blast-radius, low reversibility, or "touches
35
+ the core"** → **deep path** (parallel lenses + confirm-or-refute gate).
36
+ - **Tightly-coupled / sequential understanding** (the angles can't be explored
37
+ independently) → prefer **one deep continuous-context pass** over many
38
+ parallel explorers (Cognition: parallel workers drift on conflicting
39
+ assumptions). Use the deep path's verification gate, but a single explorer.
40
+
41
+ **Overrides:** an explicit `--deep` forces the deep path; `--shallow` forces
42
+ the light path. With no flag, the routing rule above decides. Log the resolved
43
+ path and the axis that triggered it (e.g. `[scout] deep — schema change,
44
+ low reversibility`).
45
+
46
+ ## Shallow path — phase-type → recommended maps
47
+
48
+ The cheap, single-pass scan. Read 2–3 maps based on inferred phase type. Do
49
+ NOT read all seven — that inflates context without improving discussion
50
+ quality.
12
51
 
13
52
  | Phase type (infer from title + ROADMAP entry) | Read these maps |
14
53
  |---|---|
@@ -49,3 +88,96 @@ From the scan, identify:
49
88
 
50
89
  Used in `analyze_phase` and `present_gray_areas`. NOT written to a file —
51
90
  session-only.
91
+
92
+ On the **deep path**, this same `<codebase_context>` additionally carries a
93
+ `VERIFIED facts` / `INFERRED facts` split and the explicit open questions left
94
+ at the stop (see below).
95
+
96
+ ---
97
+
98
+ ## Deep path — parallel read-only explorers on distinct lenses
99
+
100
+ Run this only when the depth assessment routed here. Reuses the exact
101
+ parallel-dispatch machinery the advisor mode (`modes/advisor.md`
102
+ `advisor_research`) and `discuss-phase-assumptions.md` already use: spawn
103
+ `Agent()` calls simultaneously, then synthesize. Do NOT invent new machinery.
104
+
105
+ **Read-only is a hard constraint.** Explorers answer questions; they never
106
+ write code (Cognition: parallel writers amplify conflicting decisions; parallel
107
+ read-only scouts are safe). They glob/grep/read/inspect git history only.
108
+
109
+ ### Lens menu — pick 2–4 relevant to the phase type
110
+
111
+ Spawn one explorer per lens, each owning a DISTINCT angle (not N identical
112
+ agents — that just duplicates work). Give each Anthropic's 4-part contract:
113
+ objective, output format, tools/sources, and boundaries ("don't cover X —
114
+ that's another explorer's job").
115
+
116
+ | Lens | Owns | Pick when |
117
+ |---|---|---|
118
+ | **by-layer / structure & dependency** | architecture, modules, who-calls-whom, the real dependency graph | almost always (the spine) |
119
+ | **by-data-flow** | how data enters / transforms / exits; invariants; persistence | data models, schema, state, pipelines |
120
+ | **by-control-flow / call-path** | real execution paths, entry points (traced, not assumed) | behavior changes, request/handler flows |
121
+ | **by-goals/intent + tests-as-spec** | what it's supposed to do; tests read as the executable spec; existing conventions to imitate | any phase modifying established behavior |
122
+ | **by-failure-mode** | error handling, edge cases, where it breaks, retry/timeout paths | high-blast-radius, low-reversibility, infra |
123
+
124
+ Default selection: **by-layer + by-goals/tests** for most phases; add
125
+ **by-data-flow** for data/schema work, **by-control-flow** for behavior work,
126
+ **by-failure-mode** for high-blast-radius/infra.
127
+
128
+ ### Explorer contract (required of every lens)
129
+
130
+ Each explorer MUST return:
131
+ - **Load-bearing claims with `file:line` citations** — and, for counts or
132
+ exhaustiveness claims, the command + its raw output. No bare assertions.
133
+ - **An honest coverage statement** — "searched 2 of 5 dirs", not
134
+ "searched all locations". Partial is fine; lying-by-omission is not.
135
+ - **A VERIFIED vs INFERRED split** — facts read from tool output vs
136
+ conclusions drawn from them.
137
+
138
+ > **ORCHESTRATOR RULE — CODEX RUNTIME**: After calling all Agent() calls above
139
+ > to spawn explorers, do NOT independently scout or analyze the codebase while
140
+ > the subagents are active. Wait for all explorers to return before
141
+ > synthesizing. This prevents duplicate work and wasted context.
142
+
143
+ ## Orchestrator confirm-or-refute gate (the load-bearing part)
144
+
145
+ ~20–30% of subagent reports contain at least one claim that doesn't match the
146
+ underlying tool output, and self-verification is unreliable. So the
147
+ orchestrator does NOT trust the explorers' prose. After all explorers return,
148
+ before adopting anything into `<codebase_context>`:
149
+
150
+ 1. **Spot-check the highest-risk / load-bearing claims against RAW TOOL
151
+ OUTPUT** — re-read the actual file at the cited `file:line`, re-run the
152
+ grep/count yourself. Check claims against the tool output, never against the
153
+ agent's summary. Prioritize: inflated/exhaustiveness claims ("found 15
154
+ files", "searched all", "applied successfully"), and any claim a gray area
155
+ or plan would hang on. You need not re-verify everything — re-derive the
156
+ pivotal facts.
157
+ 2. **Require citations for load-bearing claims.** A load-bearing claim with no
158
+ `file:line` (or no command+output) is treated as INFERRED, not VERIFIED,
159
+ until you ground it yourself.
160
+ 3. **Reconcile contradictions — mandatory, never silently pick.** When two
161
+ explorers disagree (e.g. "3 call sites" vs "5 call sites"), surface the
162
+ contradiction and resolve it against ground truth (re-grep), don't merge or
163
+ pick one quietly.
164
+ 4. **Separate VERIFIED from INFERRED** in the resulting `<codebase_context>`.
165
+ Carry both forward labelled — never collapse inference into fact.
166
+ 5. **Never summarize a summary.** On any disputed or load-bearing point,
167
+ re-ground at the source rather than paraphrasing an explorer's paraphrase.
168
+
169
+ ## Sufficiency stop (avoid analysis paralysis)
170
+
171
+ State the stop criterion explicitly. Stop scouting and proceed to
172
+ `analyze_phase` when **all three** hold:
173
+ - **Confidence threshold met** — the chosen lenses are covered and the real
174
+ call-path / data-flow for the change site was traced, not assumed. Satisfice
175
+ (~good enough to decide), don't optimize.
176
+ - **Budget cap not exceeded** — within the tool-call/explorer budget for this
177
+ task class (cheap fact-finding gets few; complex/high-blast gets more).
178
+ - **Saturation** — the last round surfaced **no new load-bearing facts AND all
179
+ contradictions are resolved.**
180
+
181
+ Record any remaining open questions explicitly (bounded, named) and carry them
182
+ into `<codebase_context>` — don't keep scouting to chase curiosity. Only once
183
+ the gate passes is the understanding "sufficient AND verified".
@@ -35,7 +35,13 @@ Pure, dependency-light, logic-dense code: **money/currency (integer minor units
35
35
 
36
36
  ## TDD — honestly
37
37
 
38
- The quality benefit is real, but the evidence favors **small, uniform increments** over *test-first specifically* (controlled studies found test-first vs test-after didn't differ; granularity did). So **mandate: behavior-level tests + small increments + a regression floor.** Keep the **RED step** (watch a test fail) so tests actually test something. Test-first vs test-after is a **knob** (`workflow.tdd_mode`), not dogma.
38
+ The defensible mandate is **always strong, INDEPENDENT tests** not "always TDD, even for simple." The evidence is clear that what drives the quality benefit is **granularity + test existence + independence**, not test-first *ordering* (controlled studies found test-first vs test-after didn't differ; granularity did; the design-improvement claim for ordering is contested, not established). So **mandate: behavior-level tests + small increments + a regression floor.** Keep the **RED step** (watch a test fail) so tests actually test something.
39
+
40
+ **Default to test-first for AI-WRITTEN code — but on anti-gaming/independence grounds, not "test-first improves design."** When an agent authors the implementation, writing the spec/test first stops it from shaping tests to its own output: agents reward-hack tests, and a large share of "passing" agent patches diverge from ground truth (green ≠ correct). The lever is an *independent*, human-anchored check the agent did not write to its own convenience (see `ai-test-quality.md`) — not the ordering ritual. Test-first vs test-after stays a **knob** (`workflow.tdd_mode`); the independence requirement does not.
41
+
42
+ **Skip TDD for** UI/visual components, prototypes/spikes, glue/wiring, and trivial CRUD — correctness there is visual, throwaway, or too thin to assert cheaply (both pro- and anti-TDD camps agree). Still cover them with the cheapest check that fits (visual/snapshot, integration smoke).
43
+
44
+ **Green ≠ correct — verify beyond green.** A passing suite is a *lower bound* on correctness, not proof. For critical paths pair it with mutation/property/differential testing + human UAT.
39
45
 
40
46
  ## Persistent vs transient E2E
41
47
 
@@ -8,6 +8,7 @@ You are a thinking partner, not an interviewer. The user is the visionary — yo
8
8
  @~/.claude/gsd-core/references/domain-probes.md
9
9
  @~/.claude/gsd-core/references/gate-prompts.md
10
10
  @~/.claude/gsd-core/references/universal-anti-patterns.md
11
+ @~/.claude/gsd-core/references/engineering-standards.md
11
12
  </required_reading>
12
13
 
13
14
  <progressive_disclosure>
@@ -278,16 +279,19 @@ Parse JSON for: `todo_count`, `matches[]` (each with `file`, `title`, `area`, `s
278
279
  </step>
279
280
 
280
281
  <step name="scout_codebase">
281
- Lightweight scan of existing code to inform gray area identification (~10% context).
282
+ Scan existing code to inform gray area identification depth SCALED to the phase, not always-max.
282
283
 
283
- Read `@~/.claude/gsd-core/references/scout-codebase.md` — it contains the phase-type→map selection table, single-read rule, no-maps fallback, and `<codebase_context>` output schema. Then execute:
284
- 1. `ls .planning/codebase/*.md` to find existing maps
285
- 2. Select 2–3 maps via the reference's table; or grep fallback if none exist
286
- 3. Build internal `<codebase_context>` per the reference's output schema
284
+ Read `@~/.claude/gsd-core/references/scout-codebase.md` — it contains the depth assessment, the shallow map-selection scan, the deep-path lens menu, the orchestrator confirm-or-refute gate, and the sufficiency stop. Then:
285
+
286
+ 1. **Assess depth first** (reference's "Depth assessment"). Score the phase on complexity × blast-radius × reversibility × decomposability. `--deep`/`--shallow` in $ARGUMENTS override; otherwise auto-route. Log the resolved path and the trigger.
287
+
288
+ 2. **Shallow path** (default — trivial/reversible phases, ~10% context): `ls .planning/codebase/*.md`, select 2–3 maps via the reference's table (or grep fallback if none exist), build internal `<codebase_context>`.
289
+
290
+ 3. **Deep path** (complex / high-blast-radius / hard-to-reverse / touches-the-core): spawn parallel READ-ONLY explorers on 2–4 distinct lenses, reusing the advisor-mode/`discuss-phase-assumptions` parallel `Agent()` dispatch + ORCHESTRATOR RULE (do not invent new machinery). Each returns `file:line`-cited claims, honest coverage, and a VERIFIED-vs-INFERRED split. Then run the reference's **confirm-or-refute gate** (spot-check highest-risk claims against RAW TOOL OUTPUT not prose; reconcile contradictions against ground truth; keep VERIFIED separate from INFERRED; never summarize-a-summary) and stop per its sufficiency criterion. Build `<codebase_context>` from the reconciled, evidence-backed understanding.
287
291
  </step>
288
292
 
289
293
  <step name="analyze_phase">
290
- Analyze the phase to identify gray areas. Use both `prior_decisions` and `codebase_context` to ground the analysis.
294
+ Analyze the phase to identify gray areas. Use both `prior_decisions` and `codebase_context` to ground the analysis. When `scout_codebase` ran the deep path, ground gray areas in its **VERIFIED** facts; treat **INFERRED** facts and recorded open questions as things to confirm with the user, not as settled premises.
291
295
 
292
296
  1. **Domain boundary** — What capability is this phase delivering? State it clearly.
293
297
 
@@ -649,6 +649,7 @@ increases monotonically across waves. `{status}` is `complete` (success),
649
649
  ` : ''}
650
650
  - ${PROJECT_ROOT}/CLAUDE.md (Project instructions, if exists — follow project-specific guidelines and coding conventions)
651
651
  - ${PROJECT_ROOT}/.claude/skills/ or ${PROJECT_ROOT}/.agents/skills/ (Project skills, if either exists — list skills, read SKILL.md for each, follow relevant rules during implementation)
652
+ - ${PROJECT_ROOT}/.planning/adr/*.md, ${PROJECT_ROOT}/.planning/DOMAIN-MODEL.md, ${PROJECT_ROOT}/.planning/TEST-STRATEGY.md (if they exist — the chosen architecture rung and test levels; any autonomous deviation must honor them: build to the rung fully, neither under- nor over-engineering it, per @~/.claude/gsd-core/references/engineering-standards.md)
652
653
  </files_to_read>
653
654
 
654
655
  ${AGENT_SKILLS}
@@ -124,6 +124,8 @@ INFRA-STRATEGY.md written — infrastructure matched to traffic shape and team s
124
124
  Next: /gsd:cicd-strategy (pipelines + deploy targets will follow this strategy)
125
125
  ```
126
126
 
127
+ **Roadmap reconciliation:** ROADMAP.md predates this strategy. Scan it against the compute rungs and data layer — if a phase assumes compute, a data store, or an environment this strategy invalidates or reshapes (e.g. a phase planning a self-managed cluster now dropped to serverless containers, or one missing the IaC/observability floor this strategy mandates), SAY SO explicitly and offer `/gsd:phase --edit` (or a roadmap refresh — the roadmapper re-reads discovery artifacts). Never leave a known strategy↔roadmap contradiction unspoken.
128
+
127
129
  </process>
128
130
 
129
131
  <critical_rules>
@@ -10,6 +10,7 @@ Read all files referenced by the invoking prompt's execution_context before star
10
10
  @~/.claude/gsd-core/references/gate-prompts.md
11
11
  @~/.claude/gsd-core/references/agent-contracts.md
12
12
  @~/.claude/gsd-core/references/gates.md
13
+ @~/.claude/gsd-core/references/engineering-standards.md
13
14
  </required_reading>
14
15
 
15
16
  <available_agent_types>
@@ -46,7 +46,9 @@ From DOMAIN-MODEL.md (or the answers): extract each subdomain's type + complexit
46
46
 
47
47
  ## Step 3: Axis A — domain-logic organization (per subdomain)
48
48
 
49
- Decide a rung **per subdomain** (not one rung for the whole app). Supporting/generic subdomains almost always stay at Transaction Script; the complex core may climb.
49
+ **First, the floor applies to ALL subdomains, even simple ones (state it in the ADR):** dependency inversion at *true external boundaries only* (DB, 3rd-party, clock/IO) + a Functional-Core/Imperative-Shell shape + strong independent tests. This is NOT hexagonal — no internal port ceremony — it's the cheap testability baseline both camps' senior voices converge on, and the AI era makes it pay from the first agent session (see the reference's floor + AI-era sections). "Transaction Script" is the domain-logic organization *on top of* this floor, never an excuse to skip it (a CRUD app that reaches into the DB from everywhere is under-engineered).
50
+
51
+ Then decide a rung **per subdomain** (not one rung for the whole app). Supporting/generic subdomains almost always stay at Transaction Script *behind the floor seams*; the complex core may climb.
50
52
 
51
53
  For the **core** subdomain, use `AskUserQuestion` (header "Core logic"):
52
54
  - question: "For the core (*[core name]*): are operations mostly validate→persist→return, or are there rich invariants / state machines / tangled business rules?"
@@ -105,6 +105,8 @@ TEST-STRATEGY.md written — test shape set to follow the architecture.
105
105
  Next: /gsd:plan-phase (plans + /gsd:add-tests will follow this strategy)
106
106
  ```
107
107
 
108
+ **Roadmap reconciliation:** ROADMAP.md predates this strategy. Scan it against the test shape — if a phase relies on test infrastructure or a test level this strategy invalidates or reshapes (e.g. a phase planning e2e for what's now demoted to integration, or one missing the test infra a gnarly-bit tier needs), SAY SO explicitly and offer `/gsd:phase --edit` (or a roadmap refresh — the roadmapper re-reads discovery artifacts). Never leave a known strategy↔roadmap contradiction unspoken.
109
+
108
110
  </process>
109
111
 
110
112
  <critical_rules>
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@therocketcode/gsd-core",
3
- "version": "1.8.5",
3
+ "version": "1.10.0",
4
4
  "description": "GSD Core is a meta-prompting, context engineering, and spec-driven development system for AI coding agents.",
5
5
  "bin": {
6
6
  "gsd-core": "bin/install.js",