@therocketcode/gsd-core 1.9.0 → 1.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "gsd-core",
3
3
  "displayName": "GSD Core",
4
- "version": "1.9.0",
4
+ "version": "1.11.0",
5
5
  "description": "GSD Core is a meta-prompting, context engineering, and spec-driven development system for AI coding agents.",
6
6
  "author": {
7
7
  "name": "TheRocketCodeMX",
@@ -60,7 +60,7 @@ This ensures project-specific patterns, conventions, and best practices are appl
60
60
  **3. Code Quality** — Dead code, unused imports/variables, poor naming conventions, missing error handling, inconsistent patterns, overly complex functions (high cyclomatic complexity), code duplication, magic numbers, commented-out code
61
61
 
62
62
  **4. Contract Conformance (enforce `engineering-standards.md` BOTH ways — you are the fresh-context adversary)** — Fit to the ADR's chosen rung for this subdomain is itself a defect class. Never reward "simpler is better"; the bar is "fits the architecture's chosen rung, executed fully." Hunt in both directions and classify as BLOCKER when behavior/architecture is wrong, WARNING otherwise:
63
- - **Reward-hacking / under-engineering:** hardcoded expected outputs; happy-path-only handling (missing edge cases/failure modes); a test weakened, skipped, deleted, or made trivially-passing (e.g. `assert True`, removed assertion, rewritten `expected`); logic the codebase already implements re-written instead of reused; CRUD / transaction-script or patching *around* a mandated abstraction where the ADR mandates a richer rung (Domain Model / ports / aggregates / CQRS) — under-built against the architecture.
63
+ - **Reward-hacking / under-engineering:** hardcoded expected outputs; happy-path-only handling (missing edge cases/failure modes); a test weakened, skipped, deleted, or made trivially-passing (e.g. `assert True`, removed assertion, rewritten `expected`); logic the codebase already implements re-written instead of reused; CRUD / transaction-script or patching *around* a mandated abstraction where the ADR mandates a richer rung (Domain Model / ports / aggregates / CQRS) — under-built against the architecture; **or skipping the universal floor** — no seam at the true external boundaries (DB/3rd-party reached into from everywhere, untestable without the real services), a defect *even on a simple Transaction-Script subdomain* (the floor is the cheap testable baseline, not optional ceremony).
64
64
  - **Over-engineering:** ports / aggregates / CQRS / speculative layers, options, or config the ADR did NOT mandate for this subdomain; an abstraction invented on the first or second similarity (the *wrong* abstraction — prefer duplication until the shape is obvious). Note: a mandated port/aggregate is NOT speculative — its absence is the defect, not its presence.
65
65
 
66
66
  To judge the rung, read the subdomain's classification and ADR (`DOMAIN-MODEL.md`, the architecture ADR) when present; absent those, fall back to "matches existing patterns + no hacks + complete."
@@ -78,7 +78,7 @@ If CONTEXT.md exists, add verification dimension: **Context Compliance**
78
78
  - Do plans honor locked decisions?
79
79
  - Are deferred ideas excluded?
80
80
  - Are discretion areas handled appropriately?
81
- - **Do plans honor the canonical discovery artifacts?** Flag a HIGH concern if a task contradicts the architecture ADR's per-subdomain rung (e.g. CRUD where a Domain Model is mandated), the DOMAIN-MODEL classification, or the TEST-STRATEGY's test levels (e.g. unit-mocking the DB where integration via Testcontainers is required, or float money where integer minor units are mandated). Same for INFRA-STRATEGY/CICD-STRATEGY when present (e.g. committed .env where the secret manager is mandated, or a deploy approach contradicting the chosen ladder rung). Per `engineering-standards.md` this gate is **symmetric** — flag HIGH in BOTH directions, never bias toward "simpler": (a) a plan that bakes in a hack/shortcut to pass a gate (hardcoded expected output, weakened/skipped test, "make it pass"); (b) **under-engineering** — CRUD/transaction-script, or patching around a mandated abstraction, where the ADR mandates a richer rung (Domain Model / ports / aggregates / CQRS); (c) **over-engineering** — adding ports/aggregates/CQRS/speculative layers the ADR did NOT mandate for that subdomain.
81
+ - **Do plans honor the canonical discovery artifacts?** Flag a HIGH concern if a task contradicts the architecture ADR's per-subdomain rung (e.g. CRUD where a Domain Model is mandated), the DOMAIN-MODEL classification, or the TEST-STRATEGY's test levels (e.g. unit-mocking the DB where integration via Testcontainers is required, or float money where integer minor units are mandated). Same for INFRA-STRATEGY/CICD-STRATEGY when present (e.g. committed .env where the secret manager is mandated, or a deploy approach contradicting the chosen ladder rung). Per `engineering-standards.md` this gate is **symmetric** — flag HIGH in BOTH directions, never bias toward "simpler": (a) a plan that bakes in a hack/shortcut to pass a gate (hardcoded expected output, weakened/skipped test, "make it pass"); (b) **under-engineering** — CRUD/transaction-script or patching around a mandated abstraction where the ADR mandates a richer rung (Domain Model / ports / aggregates / CQRS); **or skipping the universal floor** — no seam at the true external boundaries (DB/3rd-party reached into from everywhere, untestable without the real services), even on a simple Transaction-Script subdomain; (c) **over-engineering** — adding ports/aggregates/CQRS/speculative layers the ADR did NOT mandate for that subdomain.
82
82
  </upstream_input>
83
83
 
84
84
  <core_principle>
@@ -451,7 +451,9 @@ grep -n -B 2 -A 2 "console\.log" "$file" 2>/dev/null | grep -E "^\s*(const|funct
451
451
  git diff ${DIFF_BASE:-HEAD~1}..HEAD -- '*test*' '*spec*' 2>/dev/null | grep -nE '^\-.*(assert|expect|EXPECT)|\+.*(skip|xit|\.only|assert\s*\(?\s*[Tt]rue|return;?\s*//.*skip)'
452
452
  ```
453
453
 
454
- **Architecture-fit gate (per `engineering-standards.md`; only when `.planning/adr/*.md` or `DOMAIN-MODEL.md` exists):** "wired and tamper-free" is not enough — the implementation must match the ADR's chosen rung for the subdomain it touches. Read the latest ADR + DOMAIN-MODEL and check **both directions**: (a) **under-engineering** — thin CRUD / transaction-script / a patch-around where the rung mandates a Domain Model / hexagonal ports / CQRS / event-driven flow (a working-but-wrong-shape implementation is a 🛑 BLOCKER, `status: gaps_found` — it's the under-build the contract forbids, not a passing phase); (b) **over-engineering** — ports/aggregates/CQRS/speculative layers on a subdomain the ADR marked Transaction Script (⚠️ Warning). Judge the shape, not just that it runs. If no ADR/DOMAIN-MODEL exists, skip this gate.
454
+ **Architecture-fit gate (per `engineering-standards.md`):** "wired and tamper-free" is not enough — judge the *shape*, not just that it runs.
455
+ - **Floor check (applies ALWAYS, even with no ADR):** the universal floor is dependency inversion at the true external boundaries (DB/3rd-party/IO) + a Functional-Core/Imperative-Shell shape so the logic is testable without the real services. An implementation that reaches into the DB/3rd-party from everywhere, untestable without standing up real infrastructure, **skipped the floor** — a 🛑 BLOCKER (`status: gaps_found`) *even on a simple Transaction-Script subdomain*. The floor is the cheap baseline, not ceremony.
456
+ - **Rung-fit check (when `.planning/adr/*.md` or `DOMAIN-MODEL.md` exists):** the implementation must match the ADR's chosen rung. Both directions: (a) **under-engineering** — thin CRUD / patch-around where the rung mandates a Domain Model / hexagonal ports / CQRS / event-driven flow (a working-but-wrong-shape impl is a 🛑 BLOCKER, not a passing phase); (b) **over-engineering** — ports/aggregates/CQRS/speculative layers on a subdomain the ADR marked Transaction Script (⚠️ Warning).
455
457
 
456
458
  Categorize: 🛑 Blocker (prevents goal, unresolved debt marker, reward-hacked check, or ADR-rung under-build) | ⚠️ Warning (incomplete, or over-built vs the rung) | ℹ️ Info (notable)
457
459
 
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "gsd-core",
3
- "version": "1.9.0",
3
+ "version": "1.11.0",
4
4
  "description": "GSD Core — a meta-prompting, context engineering, and spec-driven development system for AI coding agents. Loads gsd's operating context into every Gemini CLI session.",
5
5
  "contextFileName": "GEMINI.md"
6
6
  }
@@ -1,6 +1,8 @@
1
1
  # AI-Written Tests — The Quality Contract
2
2
 
3
- How-to reference and enforcement contract for tests authored by a model (`add-tests`, TDD inside `execute-phase`). LLM test-writers have *known, named* failure modes: vacuous assertions, happy-path-only suites, mock-everything-then-assert-the-mock, change-detector tests (the implementation's current output copied back as the oracle), and weakening a failing assertion to go green. This contract converts test quality from judgment into mechanical checks — inventory-first, greppable forbidden patterns, a falsifiability gate, a mutation gate. Read before generating any test. Pairs with `test-strategy.md` and `test-doubles.md`.
3
+ How-to reference and enforcement contract for tests authored by a model (`add-tests`, TDD inside `execute-phase`). LLM test-writers have *known, named* failure modes: vacuous assertions, happy-path-only suites, mock-everything-then-assert-the-mock, change-detector tests (the implementation's current output copied back as the oracle), and weakening a failing assertion to go green. The deepest of these is **lost independence**: when the same agent writes both the code and its tests, the tests rubber-stamp the implementation — "documentation with assertions" that confirm bugs as expected behavior. This contract converts test quality from judgment into mechanical checks — inventory-first, greppable forbidden patterns, a falsifiability gate, an **independence gate**, a mutation gate. Read before generating any test. Pairs with `test-strategy.md` and `test-doubles.md`.
4
+
5
+ **The independence requirement (load-bearing).** Assertions must be anchored to the **spec / expected behavior**, NOT derived from reading the implementation. The agent must never edit a test to make its own code pass — that inverts the check. The test of independence: *would this assertion exist, in this form, if you hadn't seen the implementation? Is it checking the SPEC, or the code?* Green ≠ correct — a passing suite the implementer authored against its own output proves nothing.
4
6
 
5
7
  ## A. Behavior inventory BEFORE any test is written
6
8
 
@@ -54,6 +56,8 @@ A generated test that passes on its first run is **unverified**, not done. The s
54
56
 
55
57
  One extra run per test file; fully automatable. A test that cannot be made to fail by breaking the code under test is testing nothing — delete or rewrite it. **Never waive this step** because "the code already works"; that waiver is exactly where vacuous AI-written tests are born. (Where the mutation gate in E runs on the same files, a killed mutant covering the test's target behavior satisfies this gate.)
56
58
 
59
+ **The independence dimension — "can it fail" is necessary but not sufficient.** A test can fail-on-mutation and still be a rubber stamp if its *oracle* came from the implementation rather than the spec. So for each assertion also ask: **would this assertion exist, in this form, if I hadn't read the implementation? Is it checking the SPEC or the code?** The expected value must trace to a requirement/acceptance criterion (cite it), not to "what the SUT currently returns" (that is the change-detector pattern, B). And when a test goes red, **never edit the test to make the agent's own code pass** — investigate which side is wrong; weakening the assertion to go green destroys the independence the gate exists to protect.
60
+
57
61
  ## E. Mutation gate on the diff
58
62
 
59
63
  Run mutation testing (Stryker) **incrementally on changed files** as part of accepting a generated suite:
@@ -15,6 +15,12 @@ The common sweet spot is a **Domain Model inside a modular monolith**. High scal
15
15
 
16
16
  Use the core subdomain's complexity from DOMAIN-MODEL. Apply per subdomain: the complex core may warrant a Domain Model (± Hexagonal); supporting/generic subdomains stay Transaction Script.
17
17
 
18
+ ### The floor — the cheap baseline, even for simple projects
19
+
20
+ Below the rungs sits a baseline that applies **even to simple/short-lived projects** and is explicitly **NOT full hexagonal**: **dependency inversion at *true external boundaries only* (DB, 3rd-party APIs, clock/IO) + a Functional Core / Imperative Shell shape (pure logic separated from side-effecting glue) + strong, independent tests.** This is the lighter discipline the senior voices on *both* sides of the "always hexagonal?" debate converge on — it delivers the pro-camp's real prize (day-one isolated testability) at a fraction of hexagonal's cost, with **no internal port ceremony**. Extract a real internal port only "when you feel the second adapter appearing."
21
+
22
+ Transaction Script is the domain-logic *organization* at the floor — it still sits **behind** this seam discipline. **Do not read "Transaction Script floor" as "no seams":** a CRUD app that reaches straight into the DB/3rd-party from everywhere, untestable without the real services, is **under-engineered even though it is simple** — it skipped the floor. The floor is the cheap minimum; the rungs below are about how much *domain-logic structure* to add on top of it.
23
+
18
24
  | Rung | Move up when | Over-engineering tell |
19
25
  |------|--------------|-----------------------|
20
26
  | **Transaction Script / simple layered CRUD** (floor) | "validate → persist → return"; few rules; supporting/generic subdomains | — |
@@ -69,13 +75,23 @@ The same tools run in reverse for a brownfield monolith you're decomposing — s
69
75
 
70
76
  **Over-engineering tells:** one-implementation ports; rich aggregates over structural CRUD; CQRS/ES with no asymmetry or audit need; "microservice envy"; distributed monolith; elaborate config/abstraction "wanting all options all the time."
71
77
 
72
- **Under-engineering tells:** the same invariant duplicated across many transaction scripts; a big-ball-of-mud monolith with no enforced module boundaries; a complex/regulated domain modeled as thin CRUD; no audit trail where compliance needs it; no ADRs / no fitness functions.
78
+ **Under-engineering tells:** the same invariant duplicated across many transaction scripts; a big-ball-of-mud monolith with no enforced module boundaries; a complex/regulated domain modeled as thin CRUD; **no seam at the true external boundaries (DB/3rd-party reached into from everywhere), so the code can't be tested without the real services — the Functional-Core/Imperative-Shell floor was skipped, even on a simple app**; no audit trail where compliance needs it; no ADRs / no fitness functions.
79
+
80
+ ## The AI-coding era moves the FLOOR up a notch (not the ceiling)
81
+
82
+ The evidence (agent-reliability studies) shifts *where the floor sits*, not whether the higher rungs are universal:
83
+ - **Seams pay from the first agent session.** Agent reliability collapses as codebase scale and files-touched-per-change rise, and the bottleneck is **architectural legibility, not context-window size** (bigger windows don't rescue large tangled repos). Clean boundaries help the *agent* navigate and verify — so the Functional-Core/Imperative-Shell floor + strong independent tests are worth more now, even on simple projects.
84
+ - **Get the *seams* right early; defer *feature* speculation.** AI is **bad at adding architectural seams later** (agent success on compound/architectural refactors is low — well under half in the agent-reliability studies) but **good at adding features later** — so cheap-AI *weakens* YAGNI for core seams (build them now) and *strengthens* it for speculative features (defer them).
85
+ - **Don't over-generate structure.** AI writes structure cheaply but **maintains it worse** (duplication rises, drift) — so this moves the floor *up a notch*, NOT toward always-hexagonal. Prefer **deep modules, not shallow many-file layering** (the "design for agents" consensus converges with classic good design on tests/interfaces/naming/deep-modules; it diverges only by penalizing indirection *depth*).
86
+ - **Tests must be *independent*.** Because agents reward-hack tests, the lever is test *independence + quality* (human-owned/agent-non-editable assertions; verify beyond green), not test-*first ordering*. See `test-strategy.md` / `ai-test-quality.md`.
87
+
88
+ Net: raise the floor (seams + independent tests earlier), keep the ceiling calibrated (Domain Model / full hexagonal / CQRS / ES still require their concrete signals — the originators themselves reserve them for the complex core).
73
89
 
74
- **The meta-tell (use this to settle every rung):** if you cannot point to a **current, concrete** requirement — a real second adapter or delivery mechanism, a real divergent-scaling component, a real second team, a real audit mandate, a real tenant-isolation mandate, a genuinely pure core isolated for test speed — that justifies a rung, you are **over-engineering**. If such a requirement exists and you ignored it, you are **under-engineering**.
90
+ **The meta-tell (use this to settle every rung *above the floor*; the floor itself is always-on and exempt — it is the baseline, not a rung):** if you cannot point to a **current, concrete** requirement — a real second adapter or delivery mechanism, a real divergent-scaling component, a real second team, a real audit mandate, a real tenant-isolation mandate, a genuinely pure core isolated for test speed — that justifies a rung, you are **over-engineering**. If such a requirement exists and you ignored it, you are **under-engineering**.
75
91
 
76
92
  ## Default baseline (when in doubt)
77
93
 
78
- **Modular monolith + Domain Model only in the complex core bounded contexts + Transaction Script in the simple/CRUD ones + ADRs + boundary fitness functions.** This is the modern consensus starting point.
94
+ **The floor (Functional-Core/Imperative-Shell + dependency inversion at true external boundaries + strong independent tests) everywhere — then modular monolith + Domain Model only in the complex core bounded contexts + Transaction Script in the simple/CRUD ones + ADRs + boundary fitness functions.** This is the modern consensus starting point: a cheap testable floor for all, rich structure only where complexity earns it.
79
95
 
80
96
  ## Always
81
97
 
@@ -0,0 +1,63 @@
1
+ # Brownfield Adaptation — Working an Existing Codebase
2
+
3
+ How GSD adapts when there is **already a codebase** (or existing infra/pipeline/tests), not a clean greenfield. The strategy chain and scout were greenfield-first by default; this reference is the brownfield half. Pairs with `engineering-standards.md` (the invariant bar), `recommend-architecture.md` (the evolution/strangler content), and the `map-codebase` outputs (`.planning/codebase/*.md`) it consumes.
4
+
5
+ ## The principle
6
+
7
+ **Assess what exists, then recommend an *evolution path* — never "design the ideal from scratch" and impose it.** Greenfield defines; brownfield reconciles. The strategy skills, in brownfield mode, read the existing reality first (the `map-codebase` maps + the code) and produce a *current-state → target → migration* recommendation, not a blank-slate decision.
8
+
9
+ ## Follow vs. improve vs. refactor (the decision matrix)
10
+
11
+ Three options, one matrix keyed by: **blast radius · reversibility · test-coverage-present · churn × centrality · am-I-already-editing-this.**
12
+
13
+ - **IMPROVE incrementally — the default.** Bring the code you touch up to the standard, scoped to the change. Most work lives here.
14
+ - **REFACTOR — reserved.** Only for **hotspots** (high churn × complexity, where the entrenchment tax compounds), and **only after characterization tests pin current behavior**. Gate it behind those tests; never refactor blind.
15
+ - **FOLLOW the existing pattern, even if poor — a deliberate, time-boxed tactic, never a silent default.** Choose it when blast radius is large, there is no coverage, the code is cold (rarely changes), or no target standard is agreed. Stay **locally consistent** — matching the surrounding code beats a lone island of "correct" — but do **not** propagate the bad pattern into *new* modules.
16
+
17
+ **The consistency rule:** make the target standard explicit *first*; **all new code follows the target**; only **old** patterns you actually touch get refactored; prefer **codemods** to convert at scale. "Be consistent with bad code" is *local* advice for the change region — not a mandate to spread it.
18
+
19
+ ## The safe-change sequence (Feathers)
20
+
21
+ Before changing legacy code: **change point → find a seam → break the dependency at the seam → write characterization tests (pin the *current* behavior, bugs included) → then change.** Where full coverage is too costly, keep new code clean with **Sprout** (write the new logic in a fresh tested unit, call it from the legacy code) or **Wrap** (wrap the legacy call to add behavior around it).
22
+
23
+ **Where to start testing:** the highest **churn × risk** hotspots from git history — you need a *local* safety net around the change region, not global coverage. Do not block work on a coverage percentage.
24
+
25
+ ## Architecture & domain on legacy
26
+
27
+ - **Do not "DDD the monolith"** (Evans: it almost always disappoints). Carve a **bubble context** around the core subdomain, defend it with an **Anti-Corruption Layer**, and grow it via **Strangler Fig** — new behavior to the clean core, old paths kept serving until strangled.
28
+ - **Incremental always beats big-bang rewrite** (rewrites lose near-universally). The real axis is big-bang vs incremental, not rewrite vs not. Even the genuine rewrite case (a whole technology generation is obsolete) is done as **incremental subsystem replacement**, never a stop-the-world rewrite.
29
+ - **Reverse-engineer, don't re-invent:** in brownfield, `model-domain` extracts the ubiquitous language and subdomains *from the existing code + docs* (names, modules, call paths) and reconciles them with the user's vision, flagging mismatches — it does not interview from a blank slate.
30
+
31
+ ## Keeping improvement bounded
32
+
33
+ - **Two hats (Beck):** never mix a structural change and a behavioral change in one diff — separate commits.
34
+ - **Preparatory refactoring:** refactor only enough to make the change easy ("make the change easy, then make the easy change"), not the whole neighborhood.
35
+ - **Note bigger problems and return** — record the larger refactor as a candidate (a roadmap/backlog item), don't chase it now. Unbounded improvement is scope creep.
36
+
37
+ ## The decision card (how to surface it — never impose)
38
+
39
+ When the assessment finds a gap between current code and the standard/strategy, present a **decision card** and let the human choose:
40
+
41
+ ```
42
+ [area] current: <what the code does today>
43
+ target: <the standard / strategy recommendation>
44
+ gap cost: <blast radius · churn/centrality · reversibility>
45
+ options: ▸ Follow (match existing, time-boxed) ▸ Improve (default, scoped to what we touch) ▸ Refactor (gated on characterization tests)
46
+ ```
47
+
48
+ **Default-select Improve.** Gate Refactor behind characterization tests. Record the choice. The framework recommends and surfaces cost; the user owns the appetite for change.
49
+
50
+ ## Anti-patterns
51
+
52
+ - Imposing a from-scratch ideal architecture/test/infra strategy on an existing system (ignoring the grain).
53
+ - Refactoring a hotspot with no characterization tests (changing behavior blind).
54
+ - Big-bang rewrite; "we'll clean it all up" sweeping refactors mixed with feature work.
55
+ - Propagating an existing bad pattern into new modules in the name of "consistency."
56
+ - DDD-ing the whole legacy monolith instead of carving a bubble context.
57
+ - Blocking work on a global coverage target instead of a local safety net.
58
+
59
+ *Basis: Feathers (Working Effectively with Legacy Code — seams, characterization tests, Sprout/Wrap); Fowler (Strangler Fig, preparatory refactoring, "make the change easy"); Beck (two hats, Tidy First); Evans (bubble context, ACL; "don't DDD the monolith"); Tornhill (churn×complexity hotspots); the big-bang-rewrite-loses consensus. Full citations in the research corpus.*
60
+
61
+ ## Consumes / produces
62
+
63
+ Consumed by the strategy chain (`model-domain`, `recommend-architecture`, `testing-strategy`, `infrastructure-strategy`, `cicd-strategy`) and the executor **when existing code is present**; reads `map-codebase`'s `.planning/codebase/*.md`. Pairs with `engineering-standards.md` (rule 3, "match what exists") and `recommend-architecture.md`'s evolution/strangler section.
@@ -16,8 +16,8 @@ Shared reference injected into the producing agents (`gsd-planner`, `gsd-phase-r
16
16
  ## The contract (behaviors — most load-bearing last)
17
17
 
18
18
  1. **Understand before you change.** Read the real code paths, the tests-as-spec, and — for this subdomain — the DOMAIN-MODEL classification and the ADR rung, before deciding or writing. Most defects are acting on an incomplete mental model, or building to the wrong structure.
19
- 2. **Build to the architecture, fully.** Implement the rung the ADR chose for this subdomain — its ports, aggregates, boundaries — properly and completely. Don't skip mandated structure "to keep it simple" (under-engineering); don't add un-mandated structure "to be robust" (over-engineering).
20
- 3. **Match what exists; deep modules, clear names.** Make new code look like it belongs — reuse the established patterns, names, and conventions (`gsd-pattern-mapper` supplies the analogs); don't add a second way to do something that already has one. Prefer a simple interface over real substance to many shallow pass-throughs (Ousterhout — and "extract into tiny functions" is genuinely contested; don't cargo-cult classitis). Descriptive names are one of the few empirically-validated quality levers.
19
+ 2. **Build to the architecture, fully — on top of the universal floor.** A cheap baseline applies to *every* project, even simple ones: dependency inversion at **true external boundaries only** (DB, 3rd-party, clock/IO) + a Functional-Core/Imperative-Shell shape (pure logic separable from side-effecting glue) + strong independent tests. That floor is part of the invariant bar — skipping it (reaching into the DB/3rd-party from everywhere, untestable) is under-engineering even on a CRUD app. *On top of that*, implement the rung the ADR chose for this subdomain — ports/aggregates/boundaries — properly. Don't skip mandated structure "to keep it simple" (under-engineering); don't add un-mandated internal ceremony "to be robust" (over-engineering). "Don't add un-mandated structure" (rule 4) means the *ceremony above the floor*, never the floor itself.
20
+ 3. **Match what exists; deep modules, clear names.** Make new code look like it belongs — reuse the established patterns, names, and conventions (`gsd-pattern-mapper` supplies the analogs); don't add a second way to do something that already has one. On EXISTING code, "match what exists" is governed by the follow/improve/refactor matrix in `@~/.claude/gsd-core/references/brownfield-adaptation.md`: match the local grain by default, improve what you touch, and refactor only hotspots under characterization tests — never propagate a bad existing pattern into NEW code in the name of consistency. Prefer a simple interface over real substance to many shallow pass-throughs (Ousterhout — and "extract into tiny functions" is genuinely contested; don't cargo-cult classitis). Descriptive names are one of the few empirically-validated quality levers.
21
21
  4. **Don't invent un-mandated structure (but build everything the ADR mandates).** Prefer duplication to a *speculative* abstraction — duplication is far cheaper than the wrong abstraction (Metz); wait until the right shape is obvious, inline it back when it stops fitting (AHA); DRY is about *knowledge*, not coincidentally-similar code. Skip speculative layers/options/config for imagined futures (YAGNI; Fowler's cost-of-carry). None of this overrides the architecture's *deliberate, required* abstractions — a mandated port/aggregate is not speculative; declining it is under-engineering, not YAGNI.
22
22
  5. **Bounded boy-scout.** Improve what you touch when the change needs it and it's honestly better; keep structural changes in a *separate* commit from behavioral ones (Beck's two hats); flag larger refactors instead of silently sprawling — unbounded refactoring is scope creep.
23
23
  6. **Correct AND complete — edge cases first.** Before coding, enumerate the edge cases and failure modes the behavior must handle; a change that serves only the happy path is unfinished. This completeness is what makes code simple — not doing less.
@@ -1,183 +1,117 @@
1
- # Codebase scout — map selection + scaled deep-scout protocol
1
+ # Codebase scout — mandatory phase exploration via parallel agents
2
2
 
3
3
  > Lazy-loaded reference for the `scout_codebase` step in
4
- > `workflows/discuss-phase.md` (extracted via #2551 progressive-disclosure
5
- > refactor). Read this when the workflow needs to scout the codebase before
6
- > identifying gray areas.
7
- >
8
- > The scout has **two paths**. Assess depth FIRST (see "Depth assessment"),
9
- > then run exactly one:
10
- > - **Shallow path** (default for trivial/reversible phases) — the light
11
- > inline map-selection scan below. Cheap.
12
- > - **Deep path** (complex / high-blast-radius / hard-to-reverse / touches
13
- > the core) — parallel read-only explorers on distinct lenses, then an
14
- > orchestrator confirm-or-refute gate. ~15× the tokens; earned, not default.
15
- >
16
- > Both feed the same internal `<codebase_context>`. The deep path adds a
17
- > VERIFIED-vs-INFERRED split and a reconciled, evidence-backed understanding.
18
-
19
- ## Depth assessment (run first — gates the cheap vs deep path)
20
-
21
- Do NOT always-max. Multi-agent scouting burns ~15× the tokens of a single
22
- read; spend it only where blast-radius justifies it. Score the phase on four
23
- axes (from the phase title, ROADMAP entry, and prior context):
24
-
25
- | Axis | Light end (→ shallow) | Heavy end (→ deep) |
26
- |---|---|---|
27
- | **Complexity** | one file, clear-cut, CRUD | cross-cutting, many moving parts |
28
- | **Blast radius** | isolated, leaf code | touches the core / shared infra / many call sites |
29
- | **Reversibility** | trivially revertable | hard to reverse (schema, public API, data migration, payments) |
30
- | **Decomposability** | n/a | understanding splits cleanly into independent angles |
31
-
32
- **Routing rule:**
33
- - **All axes light** → **shallow path**. Run the map-selection scan only.
34
- - **Any of: high complexity, high blast-radius, low reversibility, or "touches
35
- the core"** → **deep path** (parallel lenses + confirm-or-refute gate).
36
- - **Tightly-coupled / sequential understanding** (the angles can't be explored
37
- independently) → prefer **one deep continuous-context pass** over many
38
- parallel explorers (Cognition: parallel workers drift on conflicting
39
- assumptions). Use the deep path's verification gate, but a single explorer.
40
-
41
- **Overrides:** an explicit `--deep` forces the deep path; `--shallow` forces
42
- the light path. With no flag, the routing rule above decides. Log the resolved
43
- path and the axis that triggered it (e.g. `[scout] deep — schema change,
44
- low reversibility`).
45
-
46
- ## Shallow path — phase-type → recommended maps
47
-
48
- The cheap, single-pass scan. Read 2–3 maps based on inferred phase type. Do
49
- NOT read all seven — that inflates context without improving discussion
50
- quality.
4
+ > `workflows/discuss-phase.md`. Read this before scouting a phase.
51
5
 
52
- | Phase type (infer from title + ROADMAP entry) | Read these maps |
53
- |---|---|
54
- | UI / frontend / styling / design | CONVENTIONS.md, STRUCTURE.md, STACK.md |
55
- | Backend / API / service / data model | STACK.md, ARCHITECTURE.md, INTEGRATIONS.md |
56
- | Integration / third-party / provider | STACK.md, INTEGRATIONS.md, ARCHITECTURE.md |
57
- | Infrastructure / DevOps / CI / deploy | STACK.md, ARCHITECTURE.md, INTEGRATIONS.md |
58
- | Testing / QA / coverage | TESTING.md, CONVENTIONS.md, STRUCTURE.md |
59
- | Documentation / content | CONVENTIONS.md, STRUCTURE.md |
60
- | Mixed / unclear | STACK.md, ARCHITECTURE.md, CONVENTIONS.md |
6
+ ## Exploring the phase with parallel agents is MANDATORY always
61
7
 
62
- Read CONCERNS.md only if the phase explicitly addresses known concerns or
63
- security issues.
8
+ **If you are in discuss-phase, you WILL spawn dedicated parallel read-only explorer agents to explore this phase. This is not optional, not gated on how much you already know, and not something you can satisfy by reading a few files in your own context.** What scales is the *number* of explorers and *which* lenses — never *whether* you explore.
64
9
 
65
- ## Single-read rule
10
+ (Proportionality lives at the command boundary, not here: genuinely trivial work routes through `/gsd:fast` or `/gsd:quick`. The moment you are running discuss-phase, the phase is substantive enough to plan — so it is substantive enough to explore. There is no "too small to scout" case inside discuss-phase.)
66
11
 
67
- Read each map file in a **single** Read call. Do not read the same file at
68
- two different offsets — split reads break prompt-cache reuse and cost more
69
- than a single full read.
12
+ ### Why so you respect it instead of routing around it
70
13
 
71
- ## No-maps fallback
14
+ - **You are exploring THIS phase specifically — not reusing the global context you think you have.** Whatever you already know is about the project in general; the phase needs targeted grounding on the exact code paths, data flows, invariants, and failure modes it will touch. General context is not phase context.
15
+ - **Exploration COMPLEMENTS and STRESS-TESTS what you know — it does not repeat it.** Its job is to surface what you *don't* know and to verify what you *think* you know against ground truth. The points you're most confident about are the ones most worth checking — confidence is where stale assumptions hide.
16
+ - **The agent that "already has the context" is the highest-risk one.** ~20–30% of even careful single-context conclusions don't survive a check against raw tool output (stale memory, an assumed call path, a pattern that changed). Skipping exploration because you feel informed is exactly how a wrong premise gets locked into CONTEXT.md and propagated through plan → execute.
17
+ - **This is the highest-leverage moment in the whole workflow.** Decisions locked here become canonical references the planner and executor must follow. A cheap, thorough exploration now prevents expensive rework later. Thoroughness beats tokens here, always.
18
+ - **Parallel agents beat one big pass** because independent lenses see what a single sweep misses, and isolating each lens in its own context keeps each one sharp.
72
19
 
73
- If `.planning/codebase/*.md` does not exist:
74
- 1. Extract key terms from the phase goal (e.g., "feed" → "post", "card",
75
- "list"; "auth" → "login", "session", "token")
76
- 2. `grep -rlE "{term1}|{term2}" src/ app/ --include="*.ts" ...` (use `-E`
77
- for extended regex so the `|` alternation works on both GNU grep and BSD
78
- grep / macOS), and `ls` the conventional component/hook/util dirs
79
- 3. Read the 3–5 most relevant files
20
+ ### Rationalization-killers every one of these is INVALID
80
21
 
81
- ## Output (internal `<codebase_context>`)
22
+ If you catch yourself thinking any of these, STOP — you are about to under-explore. Spawn the explorers anyway:
23
+
24
+ | The rationalization | Why it's invalid |
25
+ |---|---|
26
+ | "I already have the context from committed docs / the maps." | Docs and maps are *grounding input you feed the explorers* — never a substitute for exploring the phase. They're also potentially stale; the explorers verify them. |
27
+ | "It's mostly already mapped." | "Mostly" and "this phase, verified, right now" are different. Explore the delta and re-verify the rest. |
28
+ | "I know this codebase well." | Then exploration is cheap and confirms it. If it doesn't confirm it, you needed it. |
29
+ | "The angles are tightly coupled, so I'll just do it inline myself." | Tight coupling means *fewer explorers or one dedicated explorer* — never inline-in-your-own-context. A dedicated exploration pass is still mandatory. |
30
+ | "It's a small/clear phase." | Then it routed to /gsd:fast or /gsd:quick. You're in discuss-phase, so it isn't. |
31
+ | "I'll save tokens." | Re-work from a wrong premise costs far more than the exploration. Thoroughness over tokens — the user has said this repeatedly. |
32
+
33
+ **Violating the letter of this (a couple of inline greps labelled "deep scout") is violating the spirit.** The deliverable is dedicated parallel explorers + the confirm-or-refute gate, every time.
82
34
 
83
- From the scan, identify:
84
- - **Reusable assets** — components, hooks, utilities usable in this phase
85
- - **Established patterns** — state management, styling, data fetching
86
- - **Integration points** — routes, nav, providers where new code connects
87
- - **Creative options** — approaches the architecture enables or constrains
35
+ ## Mode (detect FIRST — *what* to scout; never *whether*)
88
36
 
89
- Used in `analyze_phase` and `present_gray_areas`. NOT written to a file
90
- session-only.
37
+ Detect per change-site, not per repo (a brownfield repo can have a brand-new module; a greenfield repo can carry rich docs). Mode picks the lenses; it never licenses skipping exploration.
91
38
 
92
- On the **deep path**, this same `<codebase_context>` additionally carries a
93
- `VERIFIED facts` / `INFERRED facts` split and the explicit open questions left
94
- at the stop (see below).
39
+ - **Brownfield** real source exists for this area **code-comprehension** lenses (below). Depth-bounded: localize to spans + blast radius.
40
+ - **Greenfield** — little/no code yet (new project / foundational phase) → **external pre-research** lenses: verify *current* library/API versions & signatures against the web (training memory is stale — ~1 in 5 generated packages is hallucinated; post-cutoff APIs run <50% without docs), find reference implementations, surface pitfalls, reconcile locked decisions against ground truth. Breadth-bounded: ~5 good sources optimal. **Never skip because "there's no code to read" — that conflation is the classic foundational-rework miss.**
41
+ - **Greenfield-with-docs** — thin code but real specs/research exist → honor the docs the way brownfield honors tests; reconcile against them.
95
42
 
96
- ---
43
+ Mixed phases run both disciplines.
97
44
 
98
- ## Deep path parallel read-only explorers on distinct lenses
45
+ ## Breadth assessment (how MANY explorers not whether)
99
46
 
100
- Run this only when the depth assessment routed here. Reuses the exact
101
- parallel-dispatch machinery the advisor mode (`modes/advisor.md`
102
- `advisor_research`) and `discuss-phase-assumptions.md` already use: spawn
103
- `Agent()` calls simultaneously, then synthesize. Do NOT invent new machinery.
47
+ Always spawn explorers; scale the count to the phase:
48
+ - **Focused phase** (one subsystem, clear boundary) → 2 lenses.
49
+ - **Sprawling / high-blast-radius / hard-to-reverse / touches-the-core** → 3–4+ lenses.
50
+ - **Tightly-coupled / sequential** (angles can't be explored independently — Cognition: parallel workers drift on conflicting assumptions) **one dedicated deep explorer** with continuous context, still read-only, still through the confirm-or-refute gate. This is "fewer explorers," NOT "do it inline."
104
51
 
105
- **Read-only is a hard constraint.** Explorers answer questions; they never
106
- write code (Cognition: parallel writers amplify conflicting decisions; parallel
107
- read-only scouts are safe). They glob/grep/read/inspect git history only.
52
+ `--deep` forces the maximal fan-out; `--shallow` reduces to the 2-lens minimum (it does NOT mean "no agents"). Log the resolved breadth + the trigger (e.g. `[scout] 4 lenses — touches the core, low reversibility`).
108
53
 
109
- ### Lens menu — pick 2–4 relevant to the phase type
54
+ ## Lens menus — pick the relevant ones (each explorer owns a DISTINCT angle)
110
55
 
111
- Spawn one explorer per lens, each owning a DISTINCT angle (not N identical
112
- agents — that just duplicates work). Give each Anthropic's 4-part contract:
113
- objective, output format, tools/sources, and boundaries ("don't cover X —
114
- that's another explorer's job").
56
+ Give each explorer Anthropic's 4-part contract: objective, output format, tools/sources, boundaries ("don't cover X — another explorer owns it"). Feed each the grounding input (existing maps/docs) as starting context, with instructions to *verify and go deeper*, not restate.
57
+
58
+ ### Brownfield lenses (existing code)
115
59
 
116
60
  | Lens | Owns | Pick when |
117
61
  |---|---|---|
118
62
  | **by-layer / structure & dependency** | architecture, modules, who-calls-whom, the real dependency graph | almost always (the spine) |
119
- | **by-data-flow** | how data enters / transforms / exits; invariants; persistence | data models, schema, state, pipelines |
63
+ | **by-data-flow** | how data enters/transforms/exits; invariants; persistence | data models, schema, state, pipelines |
120
64
  | **by-control-flow / call-path** | real execution paths, entry points (traced, not assumed) | behavior changes, request/handler flows |
121
- | **by-goals/intent + tests-as-spec** | what it's supposed to do; tests read as the executable spec; existing conventions to imitate | any phase modifying established behavior |
65
+ | **by-goals/intent + tests-as-spec** | what it should do; tests as the executable spec; conventions to imitate | any phase modifying established behavior |
122
66
  | **by-failure-mode** | error handling, edge cases, where it breaks, retry/timeout paths | high-blast-radius, low-reversibility, infra |
123
67
 
124
- Default selection: **by-layer + by-goals/tests** for most phases; add
125
- **by-data-flow** for data/schema work, **by-control-flow** for behavior work,
126
- **by-failure-mode** for high-blast-radius/infra.
68
+ ### Greenfield lenses (external pre-research area has no code yet)
69
+
70
+ | Lens | Owns | Pick when |
71
+ |---|---|---|
72
+ | **by-ecosystem / version ground-truth** | current lib/framework versions, signatures, breaking changes vs stale memory; deps exist & maintained | always — the greenfield spine |
73
+ | **by-reference-implementation** | how comparable real systems structure this — layout, seams, gotchas | foundational scaffolding; unfamiliar domain |
74
+ | **by-canonical-doc reconciliation** | the project's locked decisions (spec/ADR/STACK) checked against current ground truth | a strategy chain produced artifacts |
75
+ | **by-pitfalls / failure-modes** | footguns, deprecations, security advisories for the chosen stack | high-blast-radius foundational choices |
76
+
77
+ ### Grounding input (feed it; don't let it replace exploration)
78
+
79
+ Maps in `.planning/codebase/*.md` (if present) and design docs are *inputs* to the explorers. If no maps exist for a brownfield area, the explorers build understanding directly (glob/grep/read/git-history). Reading a map is never the exploration — it's the starting point the explorers verify and extend.
80
+
81
+ When maps exist, select which to feed by phase type (don't dump all seven):
82
+
83
+ | Phase type (infer from title + ROADMAP entry) | Read these maps |
84
+ |---|---|
85
+ | UI / frontend / styling / design | CONVENTIONS.md, STRUCTURE.md, STACK.md |
86
+ | Backend / API / service / data model | STACK.md, ARCHITECTURE.md, INTEGRATIONS.md |
87
+ | Integration / third-party / provider | STACK.md, INTEGRATIONS.md, ARCHITECTURE.md |
88
+ | Infrastructure / DevOps / CI / deploy | STACK.md, ARCHITECTURE.md, INTEGRATIONS.md |
89
+ | Testing / QA / coverage | TESTING.md, CONVENTIONS.md, STRUCTURE.md |
90
+ | Documentation / content | CONVENTIONS.md, STRUCTURE.md |
91
+ | Mixed / unclear | STACK.md, ARCHITECTURE.md, CONVENTIONS.md |
92
+
93
+ Read each map in a **single** Read call — never split reads of the same file at two offsets (it breaks prompt-cache reuse and costs more than one full read).
127
94
 
128
95
  ### Explorer contract (required of every lens)
129
96
 
130
- Each explorer MUST return:
131
- - **Load-bearing claims with `file:line` citations** — and, for counts or
132
- exhaustiveness claims, the command + its raw output. No bare assertions.
133
- - **An honest coverage statement** — "searched 2 of 5 dirs", not
134
- "searched all locations". Partial is fine; lying-by-omission is not.
135
- - **A VERIFIED vs INFERRED split** — facts read from tool output vs
136
- conclusions drawn from them.
97
+ Each explorer MUST return: **load-bearing claims with `file:line` (or `url`+access-date) citations** — counts/exhaustiveness claims include the command + raw output; **an honest coverage statement** ("searched 2 of 5 dirs", not "searched all"); **a VERIFIED vs INFERRED split**.
137
98
 
138
- > **ORCHESTRATOR RULE — CODEX RUNTIME**: After calling all Agent() calls above
139
- > to spawn explorers, do NOT independently scout or analyze the codebase while
140
- > the subagents are active. Wait for all explorers to return before
141
- > synthesizing. This prevents duplicate work and wasted context.
99
+ > **ORCHESTRATOR RULE — CODEX RUNTIME**: After spawning the Agent() explorers, do NOT independently scout/analyze while they run. Wait for all to return, then synthesize. (Reuses the exact parallel-dispatch machinery in `modes/advisor.md` and `discuss-phase-assumptions.md` — do not invent new machinery.)
142
100
 
143
101
  ## Orchestrator confirm-or-refute gate (the load-bearing part)
144
102
 
145
- ~20–30% of subagent reports contain at least one claim that doesn't match the
146
- underlying tool output, and self-verification is unreliable. So the
147
- orchestrator does NOT trust the explorers' prose. After all explorers return,
148
- before adopting anything into `<codebase_context>`:
149
-
150
- 1. **Spot-check the highest-risk / load-bearing claims against RAW TOOL
151
- OUTPUT** re-read the actual file at the cited `file:line`, re-run the
152
- grep/count yourself. Check claims against the tool output, never against the
153
- agent's summary. Prioritize: inflated/exhaustiveness claims ("found 15
154
- files", "searched all", "applied successfully"), and any claim a gray area
155
- or plan would hang on. You need not re-verify everything — re-derive the
156
- pivotal facts.
157
- 2. **Require citations for load-bearing claims.** A load-bearing claim with no
158
- `file:line` (or no command+output) is treated as INFERRED, not VERIFIED,
159
- until you ground it yourself.
160
- 3. **Reconcile contradictions — mandatory, never silently pick.** When two
161
- explorers disagree (e.g. "3 call sites" vs "5 call sites"), surface the
162
- contradiction and resolve it against ground truth (re-grep), don't merge or
163
- pick one quietly.
164
- 4. **Separate VERIFIED from INFERRED** in the resulting `<codebase_context>`.
165
- Carry both forward labelled — never collapse inference into fact.
166
- 5. **Never summarize a summary.** On any disputed or load-bearing point,
167
- re-ground at the source rather than paraphrasing an explorer's paraphrase.
103
+ Self-verification is unreliable and ~20–30% of subagent reports contain a claim that doesn't match tool output — so do NOT trust explorer prose. Before adopting anything into `<codebase_context>`:
104
+
105
+ 1. **Spot-check the highest-risk / load-bearing claims against RAW TOOL OUTPUT** — re-read the cited `file:line`, re-run the grep/count yourself. Check the tool output, never the summary. Prioritize inflated/exhaustiveness claims and anything a gray area or plan will hang on. Re-derive the pivotal facts; you needn't re-verify everything.
106
+ 2. **Require citations for load-bearing claims** — an uncited one is INFERRED, not VERIFIED, until you ground it.
107
+ 3. **Reconcile contradictions — mandatory, never silently pick.** Two explorers disagree → resolve against ground truth (re-grep), don't merge or pick quietly.
108
+ 4. **Separate VERIFIED from INFERRED** in `<codebase_context>` carry both labelled; never collapse inference into fact.
109
+ 5. **Never summarize a summary** on any disputed/load-bearing point, re-ground at the source.
168
110
 
169
111
  ## Sufficiency stop (avoid analysis paralysis)
170
112
 
171
- State the stop criterion explicitly. Stop scouting and proceed to
172
- `analyze_phase` when **all three** hold:
173
- - **Confidence threshold met** — the chosen lenses are covered and the real
174
- call-path / data-flow for the change site was traced, not assumed. Satisfice
175
- (~good enough to decide), don't optimize.
176
- - **Budget cap not exceeded** — within the tool-call/explorer budget for this
177
- task class (cheap fact-finding gets few; complex/high-blast gets more).
178
- - **Saturation** — the last round surfaced **no new load-bearing facts AND all
179
- contradictions are resolved.**
180
-
181
- Record any remaining open questions explicitly (bounded, named) and carry them
182
- into `<codebase_context>` — don't keep scouting to chase curiosity. Only once
183
- the gate passes is the understanding "sufficient AND verified".
113
+ Stop and proceed to `analyze_phase` when **all three** hold: **confidence threshold met** (lenses covered; the real call-path/data-flow for the change site traced, not assumed; satisfice, don't optimize); **budget cap not exceeded** (scaled to the phase class); **saturation** (last round surfaced no new load-bearing facts AND all contradictions resolved). Record remaining open questions explicitly (bounded, named) and carry them into `<codebase_context>`. Only then is the understanding "sufficient AND verified".
114
+
115
+ ## Output (internal `<codebase_context>`)
116
+
117
+ Reusable assets · established patterns · integration points · creative options the architecture enables/constrains — plus the **VERIFIED / INFERRED** split and the named open questions. Used in `analyze_phase` and `present_gray_areas`; session-only, not written to a file.
@@ -35,7 +35,13 @@ Pure, dependency-light, logic-dense code: **money/currency (integer minor units
35
35
 
36
36
  ## TDD — honestly
37
37
 
38
- The quality benefit is real, but the evidence favors **small, uniform increments** over *test-first specifically* (controlled studies found test-first vs test-after didn't differ; granularity did). So **mandate: behavior-level tests + small increments + a regression floor.** Keep the **RED step** (watch a test fail) so tests actually test something. Test-first vs test-after is a **knob** (`workflow.tdd_mode`), not dogma.
38
+ The defensible mandate is **always strong, INDEPENDENT tests** not "always TDD, even for simple." The evidence is clear that what drives the quality benefit is **granularity + test existence + independence**, not test-first *ordering* (controlled studies found test-first vs test-after didn't differ; granularity did; the design-improvement claim for ordering is contested, not established). So **mandate: behavior-level tests + small increments + a regression floor.** Keep the **RED step** (watch a test fail) so tests actually test something.
39
+
40
+ **Default to test-first for AI-WRITTEN code — but on anti-gaming/independence grounds, not "test-first improves design."** When an agent authors the implementation, writing the spec/test first stops it from shaping tests to its own output: agents reward-hack tests, and a large share of "passing" agent patches diverge from ground truth (green ≠ correct). The lever is an *independent*, human-anchored check the agent did not write to its own convenience (see `ai-test-quality.md`) — not the ordering ritual. Test-first vs test-after stays a **knob** (`workflow.tdd_mode`); the independence requirement does not.
41
+
42
+ **Skip TDD for** UI/visual components, prototypes/spikes, glue/wiring, and trivial CRUD — correctness there is visual, throwaway, or too thin to assert cheaply (both pro- and anti-TDD camps agree). Still cover them with the cheapest check that fits (visual/snapshot, integration smoke).
43
+
44
+ **Green ≠ correct — verify beyond green.** A passing suite is a *lower bound* on correctness, not proof. For critical paths pair it with mutation/property/differential testing + human UAT.
39
45
 
40
46
  ## Persistent vs transient E2E
41
47
 
@@ -4,6 +4,7 @@ Recommend a CI/CD strategy matched to the test strategy, the target infrastructu
4
4
 
5
5
  <required_reading>
6
6
  @~/.claude/gsd-core/references/cicd-strategy.md
7
+ @~/.claude/gsd-core/references/brownfield-adaptation.md
7
8
  @~/.claude/gsd-core/templates/cicd-strategy.md
8
9
  </required_reading>
9
10
 
@@ -36,8 +37,12 @@ cat .planning/TEST-STRATEGY.md 2>/dev/null || true
36
37
  cat .planning/INFRA-STRATEGY.md 2>/dev/null || true
37
38
  cat .planning/adr/*.md 2>/dev/null || true
38
39
  cat .planning/PROJECT.md 2>/dev/null || true
40
+ cat .planning/codebase/STACK.md 2>/dev/null || true
41
+ ls .planning/codebase/STACK.md >/dev/null 2>&1 && echo "BROWNFIELD" || echo "GREENFIELD"
39
42
  ```
40
43
 
44
+ **Brownfield mode (existing pipeline).** If `BROWNFIELD` (a `map-codebase` run produced `.planning/codebase/STACK.md`, or CI config already exists), **read `@~/.claude/gsd-core/references/brownfield-adaptation.md` now** and recommend a **transition plan**, not a rip-and-replace: **audit the existing pipeline** (from STACK.md + the CI config) against the tier→stage map (Step 5), the **≤10-min PR budget**, and the **pinned-`sub` OIDC / secrets** standard (Step 4). Recommend incremental fixes via **decision cards** (current → target → gap cost → Follow / Improve / Refactor), e.g. "15-min gate → split to PR-smoke + nightly," "long-lived service-account key → OIDC federation," "moved-tag actions → SHA-pin." **Default-select Improve**; sequence the changes (don't enable everything at once). Greenfield (no map, no pipeline) keeps the from-scratch pipeline design below as the default.
45
+
41
46
  **Read `@~/.claude/gsd-core/references/cicd-strategy.md` now** — it defines the GHA-default platform decision, OIDC-with-pinned-`sub`, the secrets split, the tier→stage mapping with the ≤10-min PR budget, the flaky canon, the merge-queue trigger, the deployment ladder, the free-six supply-chain table stakes, and the anti-pattern table.
42
47
 
43
48
  **Grounding maturity governs elicitation depth.** When upstream artifacts (spec, ADR, strategies, research) already answer a step, draft-from-docs and present for confirmation — cite the source, don't re-interview. Reserve questions for genuine decision points and contradictions. Honor a posture stated in `$ARGUMENTS` without re-asking.
@@ -279,15 +279,13 @@ Parse JSON for: `todo_count`, `matches[]` (each with `file`, `title`, `area`, `s
279
279
  </step>
280
280
 
281
281
  <step name="scout_codebase">
282
- Scan existing code to inform gray area identificationdepth SCALED to the phase, not always-max.
282
+ Explore THIS phase before identifying gray areas, by spawning **dedicated parallel read-only explorer agents**. **This is MANDATORY and is NOT gated on how much context you already have** you are exploring this phase specifically (to complement and verify what you think you know), not reusing global context. Reading maps/docs in your own context is NOT a substitute. (Trivial work routes to /gsd:fast or /gsd:quick; if you're in discuss-phase, the phase warrants exploration.)
283
283
 
284
- Read `@~/.claude/gsd-core/references/scout-codebase.md` — it contains the depth assessment, the shallow map-selection scan, the deep-path lens menu, the orchestrator confirm-or-refute gate, and the sufficiency stop. Then:
284
+ **Read `@~/.claude/gsd-core/references/scout-codebase.md` now** why-it's-mandatory, rationalization-killers, mode detection, lens menus, breadth scaling, confirm-or-refute gate, sufficiency stop. Then:
285
285
 
286
- 1. **Assess depth first** (reference's "Depth assessment"). Score the phase on complexity × blast-radius × reversibility × decomposability. `--deep`/`--shallow` in $ARGUMENTS override; otherwise auto-route. Log the resolved path and the trigger.
287
-
288
- 2. **Shallow path** (default trivial/reversible phases, ~10% context): `ls .planning/codebase/*.md`, select 2–3 maps via the reference's table (or grep fallback if none exist), build internal `<codebase_context>`.
289
-
290
- 3. **Deep path** (complex / high-blast-radius / hard-to-reverse / touches-the-core): spawn parallel READ-ONLY explorers on 2–4 distinct lenses, reusing the advisor-mode/`discuss-phase-assumptions` parallel `Agent()` dispatch + ORCHESTRATOR RULE (do not invent new machinery). Each returns `file:line`-cited claims, honest coverage, and a VERIFIED-vs-INFERRED split. Then run the reference's **confirm-or-refute gate** (spot-check highest-risk claims against RAW TOOL OUTPUT not prose; reconcile contradictions against ground truth; keep VERIFIED separate from INFERRED; never summarize-a-summary) and stop per its sufficiency criterion. Build `<codebase_context>` from the reconciled, evidence-backed understanding.
286
+ 1. **Detect mode** (brownfield / greenfield / greenfield-with-docs) picks *which lenses*, never *whether*.
287
+ 2. **Spawn parallel read-only explorers** — count scales with breadth (2 focused → 4+ sprawling; `--shallow` = 2-lens floor, never zero; tightly-coupled = one dedicated explorer, never inline). Reuse the advisor / `discuss-phase-assumptions` `Agent()` dispatch + ORCHESTRATOR RULE; don't invent machinery. Existing maps/docs are grounding INPUT to verify, not the answer. Each returns cited (`file:line`/`url`) claims, honest coverage, VERIFIED-vs-INFERRED.
288
+ 3. **Confirm-or-refute gate** — spot-check highest-risk claims against RAW TOOL OUTPUT not prose; reconcile contradictions against ground truth; keep VERIFIED vs INFERRED; never summarize-a-summary. Stop per the sufficiency criterion; build `<codebase_context>` from the reconciled evidence.
291
289
  </step>
292
290
 
293
291
  <step name="analyze_phase">
@@ -649,6 +649,7 @@ increases monotonically across waves. `{status}` is `complete` (success),
649
649
  ` : ''}
650
650
  - ${PROJECT_ROOT}/CLAUDE.md (Project instructions, if exists — follow project-specific guidelines and coding conventions)
651
651
  - ${PROJECT_ROOT}/.claude/skills/ or ${PROJECT_ROOT}/.agents/skills/ (Project skills, if either exists — list skills, read SKILL.md for each, follow relevant rules during implementation)
652
+ - ${PROJECT_ROOT}/.planning/adr/*.md, ${PROJECT_ROOT}/.planning/DOMAIN-MODEL.md, ${PROJECT_ROOT}/.planning/TEST-STRATEGY.md (if they exist — the chosen architecture rung and test levels; any autonomous deviation must honor them: build to the rung fully, neither under- nor over-engineering it, per @~/.claude/gsd-core/references/engineering-standards.md)
652
653
  </files_to_read>
653
654
 
654
655
  ${AGENT_SKILLS}
@@ -5,6 +5,7 @@ Recommend an infrastructure strategy matched to the project's actual traffic sha
5
5
  <required_reading>
6
6
  @~/.claude/gsd-core/references/infrastructure-strategy.md
7
7
  @~/.claude/gsd-core/references/data-environments.md
8
+ @~/.claude/gsd-core/references/brownfield-adaptation.md
8
9
  @~/.claude/gsd-core/templates/infra-strategy.md
9
10
  </required_reading>
10
11
 
@@ -37,8 +38,12 @@ cat .planning/PRODUCT-BRIEF.md 2>/dev/null || true
37
38
  cat .planning/REQUIREMENTS.md 2>/dev/null || true
38
39
  cat .planning/adr/*.md 2>/dev/null || true
39
40
  cat .planning/TEST-STRATEGY.md 2>/dev/null || true
41
+ cat .planning/codebase/STACK.md 2>/dev/null || true
42
+ ls .planning/codebase/STACK.md >/dev/null 2>&1 && echo "BROWNFIELD" || echo "GREENFIELD"
40
43
  ```
41
44
 
45
+ **Brownfield mode (existing infra).** If `BROWNFIELD` (a `map-codebase` run produced `.planning/codebase/STACK.md`, or infra/config already exists), **read `@~/.claude/gsd-core/references/brownfield-adaptation.md` now** and recommend right-size-or-migrate, not a from-scratch build: assess the existing infra (from STACK.md + config — current compute rung, data layer, IaC, CI) **against the ladder**, e.g. "currently self-managed K8s at low utilization → the CAST-AI data says serverless-containers is cheaper → here's the decision card." Recommend evolution via **strangler** — route *new* workloads to the target rung and migrate existing ones incrementally — **never a big-bang re-platform**. Surface each gap as a **decision card** (current → target → gap cost: **cost · blast radius · reversibility** → Follow / Improve / Refactor-migrate) and **default-select Improve**. Walk Steps 4–7 below as the current-state→target reconciliation rather than a blank-slate walk. Greenfield (no map, no infra) keeps the from-scratch ladder walk as the default.
46
+
42
47
  **Read `@~/.claude/gsd-core/references/infrastructure-strategy.md` now** — it defines the compute ladder with quantified move-up triggers, the crossover numbers (Fargate-vs-EC2, the CAST AI utilization data, the <4-engineers floor), the per-cloud asymmetries and equivalences table, the observability floor, the when-you-actually-need triggers, the IaC floor, the anti-patterns, and the meta-tell.
43
48
 
44
49
  **Grounding maturity governs elicitation depth.** When upstream artifacts (spec, ADR, strategies, research) already answer a step, draft-from-docs and present for confirmation — cite the source, don't re-interview. Reserve questions for genuine decision points and contradictions. Honor a posture stated in `$ARGUMENTS` without re-asking.
@@ -124,6 +129,8 @@ INFRA-STRATEGY.md written — infrastructure matched to traffic shape and team s
124
129
  Next: /gsd:cicd-strategy (pipelines + deploy targets will follow this strategy)
125
130
  ```
126
131
 
132
+ **Roadmap reconciliation:** ROADMAP.md predates this strategy. Scan it against the compute rungs and data layer — if a phase assumes compute, a data store, or an environment this strategy invalidates or reshapes (e.g. a phase planning a self-managed cluster now dropped to serverless containers, or one missing the IaC/observability floor this strategy mandates), SAY SO explicitly and offer `/gsd:phase --edit` (or a roadmap refresh — the roadmapper re-reads discovery artifacts). Never leave a known strategy↔roadmap contradiction unspoken.
133
+
127
134
  </process>
128
135
 
129
136
  <critical_rules>
@@ -4,6 +4,7 @@ Establish strategic Domain-Driven Design foundations for a greenfield project: a
4
4
 
5
5
  <required_reading>
6
6
  @~/.claude/gsd-core/references/domain-modeling.md
7
+ @~/.claude/gsd-core/references/brownfield-adaptation.md
7
8
  @~/.claude/gsd-core/templates/domain-model.md
8
9
  </required_reading>
9
10
 
@@ -45,6 +46,20 @@ From the project docs, build an internal draft:
45
46
  - Which nouns/verbs recur (candidate ubiquitous-language terms)?
46
47
  - Are there distinct business areas or user roles (candidate subdomains)?
47
48
 
49
+ **Brownfield mode (existing code).** Greenfield (define from docs/vision) is the default below. But when `.planning/codebase/*.md` exists OR real source is present, **reverse-engineer the domain from the code first, then reconcile** — don't model from a blank slate (see `@~/.claude/gsd-core/references/brownfield-adaptation.md`, "Reverse-engineer, don't re-invent"):
50
+
51
+ ```bash
52
+ ls .planning/codebase/ARCHITECTURE.md .planning/codebase/STRUCTURE.md >/dev/null 2>&1 && echo "HAS_MAPS" || echo "NO_MAPS"
53
+ ```
54
+
55
+ If maps exist, `cat .planning/codebase/ARCHITECTURE.md .planning/codebase/STRUCTURE.md` (and STACK.md if useful). If real source exists but no maps, suggest `/gsd:map-codebase` first, or skim the source tree directly. Then, *before* drafting from vision:
56
+ - **Extract ubiquitous-language terms** from module/type/table names and call paths (the words the code already speaks) — these seed Step 3's draft.
57
+ - **Derive candidate subdomains** from the structure (modules, packages, service boundaries in ARCHITECTURE.md/STRUCTURE.md) — these seed Step 4's area list.
58
+ - **Reconcile against the user's vision** and flag mismatches, mirroring discover-product's gap map — here **KEEP** (code and vision agree), **REFINE** (present but vague, partial, or drifting from vision), **CONTESTED** (code and vision disagree, or vision wants something the code lacks). Present the reconciliation in one message; carry CONTESTED items into the Step 6 notes.
59
+ - **Don't DDD the monolith.** When the existing system is a tangle, do NOT model the whole legacy estate as clean subdomains. Carve a **bubble context** around the core subdomain and mark an **ACL at its edge** (record it as a candidate boundary in Step 5/6) — model the core cleanly, wrap the mess. Recording only; the translator is tactical.
60
+
61
+ Greenfield behavior (draft from docs/vision) remains the default when no code exists.
62
+
48
63
  **Grounding maturity governs elicitation depth.** When the upstream docs are mature (a design spec, research corpus, or detailed brief already forged the vocabulary and areas), default every step to **draft-from-docs + confirm**: present complete drafts for correction, state which docs grounded them, and reserve actual questions for *genuine decision points* — contested classifications, the one-core call, complexity contradictions. The 2–3 probing rounds below are for thin grounding (a bare PROJECT.md), not a re-interview of what the docs already answer. Honor a posture stated in `$ARGUMENTS` without re-asking.
49
64
 
50
65
  **Check for an existing model:**
@@ -10,6 +10,7 @@ Read all files referenced by the invoking prompt's execution_context before star
10
10
  @~/.claude/gsd-core/references/gate-prompts.md
11
11
  @~/.claude/gsd-core/references/agent-contracts.md
12
12
  @~/.claude/gsd-core/references/gates.md
13
+ @~/.claude/gsd-core/references/engineering-standards.md
13
14
  </required_reading>
14
15
 
15
16
  <available_agent_types>
@@ -4,6 +4,7 @@ Recommend an application architecture matched to the project's actual complexity
4
4
 
5
5
  <required_reading>
6
6
  @~/.claude/gsd-core/references/architecture-decision.md
7
+ @~/.claude/gsd-core/references/brownfield-adaptation.md
7
8
  @~/.claude/gsd-core/templates/adr.md
8
9
  </required_reading>
9
10
 
@@ -40,13 +41,29 @@ cat .planning/DOMAIN-MODEL.md 2>/dev/null || true
40
41
 
41
42
  **Grounding maturity governs elicitation depth.** When upstream artifacts (DOMAIN-MODEL, a design spec, research) already answer a question below, draft-from-docs and present for confirmation — cite the source, don't re-interview. Reserve `AskUserQuestion` for genuine decision points: contested rungs, gate answers the docs don't record, and contradictions (the reconcile rule still ALWAYS runs). Honor a posture stated in `$ARGUMENTS` without re-asking.
42
43
 
44
+ **Brownfield mode (existing architecture).** Greenfield (recommend the target from scratch) is the default. But when existing code or `map-codebase` maps are present, **assess the current topology first and recommend an evolution path** — never impose a from-scratch ideal on the running system (see `@~/.claude/gsd-core/references/brownfield-adaptation.md`):
45
+
46
+ ```bash
47
+ ls .planning/codebase/ARCHITECTURE.md >/dev/null 2>&1 && echo "HAS_MAPS" || echo "NO_MAPS"
48
+ ```
49
+
50
+ If maps exist, `cat .planning/codebase/ARCHITECTURE.md` (consume STRUCTURE.md/CONCERNS.md as needed); if real source exists but no maps, suggest `/gsd:map-codebase` first. Then run Steps 3–5 in **assess-then-evolve** form:
51
+ - **Score the current topology against the meta-tell** (Step 5) — where is the existing system already over- or under-engineered relative to the floor and the per-subdomain rungs?
52
+ - **The target is unchanged — the floor (functional-core/imperative-shell + tests) and the rung ladder still apply.** Brownfield changes only *how you get there*: incrementally, not in one cut.
53
+ - **Recommend via the decision card**, not a verdict: per affected area record `current state · target · gap cost (blast radius · churn/centrality · reversibility) · Follow | Improve | Refactor`. **Default-select Improve**; **gate Refactor behind characterization tests**; Follow is a deliberate, time-boxed choice for cold/uncovered code. Record the card in the ADR's Consequences/promotion-triggers, surfacing cost — the user owns the appetite.
54
+ - **The evolution path uses Strangler Fig + ACL + bubble context** — exactly the *Evolving the topology* mechanics this skill already documents in Step 4. In brownfield, that section is the recommended path (new behavior to the clean core, old paths strangled), not just a future-split footnote. Don't duplicate it here — reference it.
55
+
56
+ Greenfield behavior remains the default when no existing code is present.
57
+
43
58
  **If `NO_DOMAIN_MODEL`:** tell the user "No domain model found — I'll ask the complexity questions directly. (Consider `/gsd:model-domain` first for a sharper result.)" Then gather, per major area: is it core/supporting/generic, and how complex (rich rules vs CRUD)?
44
59
 
45
60
  From DOMAIN-MODEL.md (or the answers): extract each subdomain's type + complexity, **plus** the bounded contexts (owns / talks-to), the context-map relationships, and any flagged polysemes — Step 4.5 consumes them. The **core** subdomain's complexity is the primary driver of Axis A.
46
61
 
47
62
  ## Step 3: Axis A — domain-logic organization (per subdomain)
48
63
 
49
- Decide a rung **per subdomain** (not one rung for the whole app). Supporting/generic subdomains almost always stay at Transaction Script; the complex core may climb.
64
+ **First, the floor applies to ALL subdomains, even simple ones (state it in the ADR):** dependency inversion at *true external boundaries only* (DB, 3rd-party, clock/IO) + a Functional-Core/Imperative-Shell shape + strong independent tests. This is NOT hexagonal — no internal port ceremony — it's the cheap testability baseline both camps' senior voices converge on, and the AI era makes it pay from the first agent session (see the reference's floor + AI-era sections). "Transaction Script" is the domain-logic organization *on top of* this floor, never an excuse to skip it (a CRUD app that reaches into the DB from everywhere is under-engineered).
65
+
66
+ Then decide a rung **per subdomain** (not one rung for the whole app). Supporting/generic subdomains almost always stay at Transaction Script *behind the floor seams*; the complex core may climb.
50
67
 
51
68
  For the **core** subdomain, use `AskUserQuestion` (header "Core logic"):
52
69
  - question: "For the core (*[core name]*): are operations mostly validate→persist→return, or are there rich invariants / state machines / tangled business rules?"
@@ -4,6 +4,7 @@ Recommend a test strategy matched to the architecture: WHAT to test, at WHICH le
4
4
 
5
5
  <required_reading>
6
6
  @~/.claude/gsd-core/references/test-strategy.md
7
+ @~/.claude/gsd-core/references/brownfield-adaptation.md
7
8
  @~/.claude/gsd-core/templates/test-strategy.md
8
9
  </required_reading>
9
10
 
@@ -37,10 +38,14 @@ cat .planning/REQUIREMENTS.md 2>/dev/null || true
37
38
  cat .planning/DOMAIN-MODEL.md 2>/dev/null || true
38
39
  cat .planning/adr/*.md 2>/dev/null || true
39
40
  cat TESTING-STANDARDS.md 2>/dev/null || true
41
+ cat .planning/codebase/TESTING.md 2>/dev/null || true
42
+ ls .planning/codebase/TESTING.md >/dev/null 2>&1 && echo "BROWNFIELD" || echo "GREENFIELD"
40
43
  ```
41
44
 
42
45
  **Read `@~/.claude/gsd-core/references/test-strategy.md` now** — it defines behavior-over-implementation, sociable-by-default, test-once-at-cheapest-level, shape-follows-architecture (size axis), the gnarly-bits list, persistent-vs-transient e2e, and coverage-as-floor + mutation.
43
46
 
47
+ **Brownfield mode (existing test suite / untested legacy).** If `BROWNFIELD` (a `map-codebase` run produced `.planning/codebase/TESTING.md`, or existing code is present), **read `@~/.claude/gsd-core/references/brownfield-adaptation.md` now** and recommend the test strategy as an *evolution*, not a from-scratch shape: assess the current tests from TESTING.md (framework, structure, coverage) and frame Steps 3–6 as additions on top of what exists, keeping the existing framework/conventions unless there's a decision-card reason to change. Key rule: for **untested legacy you intend to change, write characterization tests first** (pin the *current* behavior, bugs included) — starting at the **highest churn × risk hotspots** (git history), as a *local* safety net around the change region; never block on a global coverage %. Surface each gap (missing level, weak assertions, money-as-float, etc.) as a **decision card** (current → target → gap cost → Follow / Improve / Refactor) and **default-select Improve**; gate any Refactor behind those characterization tests. Greenfield (no map, no code) keeps the from-scratch derivation below as the default.
48
+
44
49
  **Grounding maturity governs elicitation depth.** When upstream artifacts (spec, ADR, strategies, research) already answer a step, draft-from-docs and present for confirmation — cite the source, don't re-interview. Reserve questions for genuine decision points and contradictions. Honor a posture stated in `$ARGUMENTS` without re-asking.
45
50
 
46
51
 
@@ -105,6 +110,8 @@ TEST-STRATEGY.md written — test shape set to follow the architecture.
105
110
  Next: /gsd:plan-phase (plans + /gsd:add-tests will follow this strategy)
106
111
  ```
107
112
 
113
+ **Roadmap reconciliation:** ROADMAP.md predates this strategy. Scan it against the test shape — if a phase relies on test infrastructure or a test level this strategy invalidates or reshapes (e.g. a phase planning e2e for what's now demoted to integration, or one missing the test infra a gnarly-bit tier needs), SAY SO explicitly and offer `/gsd:phase --edit` (or a roadmap refresh — the roadmapper re-reads discovery artifacts). Never leave a known strategy↔roadmap contradiction unspoken.
114
+
108
115
  </process>
109
116
 
110
117
  <critical_rules>
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@therocketcode/gsd-core",
3
- "version": "1.9.0",
3
+ "version": "1.11.0",
4
4
  "description": "GSD Core is a meta-prompting, context engineering, and spec-driven development system for AI coding agents.",
5
5
  "bin": {
6
6
  "gsd-core": "bin/install.js",