specpipe 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (60) hide show
  1. package/README.md +1319 -0
  2. package/bin/devkit.js +3 -0
  3. package/package.json +61 -0
  4. package/src/cli.js +76 -0
  5. package/src/commands/check.js +33 -0
  6. package/src/commands/diff.js +84 -0
  7. package/src/commands/init-adopt.js +54 -0
  8. package/src/commands/init-agents.js +118 -0
  9. package/src/commands/init-global.js +102 -0
  10. package/src/commands/init.js +311 -0
  11. package/src/commands/list.js +54 -0
  12. package/src/commands/remove.js +133 -0
  13. package/src/commands/upgrade.js +215 -0
  14. package/src/lib/agent-guards.js +100 -0
  15. package/src/lib/agent-install.js +161 -0
  16. package/src/lib/agents.js +280 -0
  17. package/src/lib/claude-global.js +183 -0
  18. package/src/lib/detector.js +93 -0
  19. package/src/lib/hasher.js +21 -0
  20. package/src/lib/installer.js +213 -0
  21. package/src/lib/logger.js +16 -0
  22. package/src/lib/manifest.js +102 -0
  23. package/src/lib/reconcile.js +56 -0
  24. package/templates/.claude/CLAUDE.md +79 -0
  25. package/templates/.claude/hooks/comment-guard.js +126 -0
  26. package/templates/.claude/hooks/file-guard.js +216 -0
  27. package/templates/.claude/hooks/glob-guard.js +104 -0
  28. package/templates/.claude/hooks/path-guard.sh +118 -0
  29. package/templates/.claude/hooks/self-review.sh +27 -0
  30. package/templates/.claude/hooks/sensitive-guard.sh +227 -0
  31. package/templates/.claude/settings.json +68 -0
  32. package/templates/docs/WORKFLOW.md +325 -0
  33. package/templates/docs/specs/.gitkeep +0 -0
  34. package/templates/hooks/specpipe-read-guard.sh +42 -0
  35. package/templates/hooks/specpipe-shell-guard.sh +65 -0
  36. package/templates/rules/specpipe-guards.md +40 -0
  37. package/templates/scripts/test-hooks.sh +66 -0
  38. package/templates/skills/sp-build/SKILL.md +776 -0
  39. package/templates/skills/sp-challenge/SKILL.md +255 -0
  40. package/templates/skills/sp-commit/SKILL.md +174 -0
  41. package/templates/skills/sp-explore/SKILL.md +730 -0
  42. package/templates/skills/sp-fix/SKILL.md +266 -0
  43. package/templates/skills/sp-humanize/SKILL.md +212 -0
  44. package/templates/skills/sp-investigate/SKILL.md +648 -0
  45. package/templates/skills/sp-md-render/SKILL.md +200 -0
  46. package/templates/skills/sp-md-render/components.md +415 -0
  47. package/templates/skills/sp-md-render/template.html +283 -0
  48. package/templates/skills/sp-plan/SKILL.md +947 -0
  49. package/templates/skills/sp-review/SKILL.md +268 -0
  50. package/templates/skills/sp-scaffold/SKILL.md +237 -0
  51. package/templates/skills/sp-scaffold/references/ARCHITECTURE.md.tmpl +228 -0
  52. package/templates/skills/sp-scaffold/references/DESIGN.md.tmpl +113 -0
  53. package/templates/skills/sp-scaffold/references/adr/NNNN-template.md +92 -0
  54. package/templates/skills/sp-scaffold/references/stack-profiles/react.md +36 -0
  55. package/templates/skills/sp-spec-render/SKILL.md +254 -0
  56. package/templates/skills/sp-spec-render/components.md +418 -0
  57. package/templates/skills/sp-spec-render/examples/user-auth.html +749 -0
  58. package/templates/skills/sp-spec-render/examples/user-auth.md +114 -0
  59. package/templates/skills/sp-spec-render/template.html +222 -0
  60. package/templates/skills/sp-voices/SKILL.md +1184 -0
@@ -0,0 +1,776 @@
1
+ ---
2
+ description: |
3
+ TDD delivery loop — write failing tests from spec, implement story by story,
4
+ drive to GREEN. One story → red → green → next story. For a multi-story spec,
5
+ auto-mode orchestrates the whole spec to done by dispatching one subagent per
6
+ story (keeps context lean, minimal human-in-loop), stopping only on blockers,
7
+ spec drift, or checkpoint stories.
8
+ Use when asked to "build this", "implement the spec", "code the feature",
9
+ "triển khai", "làm tính năng", "code theo spec", "TDD this", "build hết spec",
10
+ "build all stories", "implement the whole spec", or "build tự động".
11
+ Proactively invoke this skill (do NOT write code directly) when the user has
12
+ a spec ready in docs/specs/ and wants it implemented, or asks to start coding
13
+ a planned feature.
14
+ Requires a spec from /sp-plan or equivalent — if no spec exists, run /sp-plan
15
+ first instead of jumping into code.
16
+ allowed-tools: Read, Write, Edit, Bash, Glob, Grep, AskUserQuestion, Agent, mcp__graphatlas__*
17
+ ---
18
+ TDD delivery loop — write failing tests from spec AS, implement story by story, drive to GREEN.
19
+
20
+ This skill has two execution paths, both built on the same **Execution Procedure (Phase 0a–Phase 5)** below. **Inline** runs that procedure directly in the current context (classic behaviour — fine for one story or a small spec, but context grows with every story). **Auto-Mode** turns the current context into an orchestrator that drives a multi-story spec to completion by dispatching a fresh subagent per story — each subagent runs the same procedure scoped to its one story, so the controller's context stays lean. Run Mode Detection first.
21
+
22
+ ---
23
+
24
+ ## Mode Detection (run first)
25
+
26
+ 1. Resolve the target spec at `docs/specs/<feature>/<feature>.md` (from `$ARGUMENTS` or the changed feature). Count the **in-scope** stories: those in `## Stories`, minus any already `done` in `.build-progress`, intersected with any `$ARGUMENTS` story filter. (So a resume with one story left counts as 1.)
27
+ 2. Decide:
28
+ - **No spec / no `## Stories` section** (e.g. ad-hoc build, bug-fix-style work, `$ARGUMENTS` is a bare file) → **Inline.** Run the Execution Procedure directly. No mode question.
29
+ - **1 story in scope** (single-story spec, or `$ARGUMENTS` scopes to one `S-NNN`) → **Inline.** Run the procedure for that one story. No subagents (inline threshold = 1).
30
+ - **≥2 stories in scope** → ask the user **once** via AskUserQuestion:
31
+
32
+ ```json
33
+ {
34
+ "questions": [{
35
+ "question": "This spec has <N> stories. Run auto-mode (I orchestrate: one subagent per story, gate each, stop only on blockers / spec-drift / checkpoint stories), or inline (build all <N> in this context, classic, you watch each step)?",
36
+ "header": "Build mode",
37
+ "multiSelect": false,
38
+ "options": [
39
+ {"label": "Auto — build all <N> to done", "description": "Orchestrate with subagents; lean context; stop only on BLOCKED, spec signal S1/S2, or a checkpoint story"},
40
+ {"label": "Inline — build all in this context", "description": "Classic single-context loop over every story; full visibility, but context grows per story"}
41
+ ]
42
+ }]
43
+ }
44
+ ```
45
+
46
+ - **Auto** → go to **Auto-Mode (Orchestrator)** below.
47
+ - **Inline** → run the Execution Procedure, looping over the in-scope stories (resume from first `pending` in `.build-progress`).
48
+
49
+ ---
50
+
51
+ ## Auto-Mode (Orchestrator)
52
+
53
+ The current context is the **controller**. It does not implement — it orders, dispatches, gates, and routes. Keep controller context lean: read status and checklist ticks, never full diffs or full subagent bodies.
54
+
55
+ ### A1 — Plan the run (read spec ONCE)
56
+
57
+ **0. Foundation Gate (before any dispatch).** Run the Phase 0b Foundation Gate once: if there is no runnable harness (no project markers / runner not invocable), STOP and report BLOCKED pointing to `/sp-scaffold` — never dispatch stories onto an empty repo. Then:
58
+
59
+ 1. Read the spec once. Extract for every story: ID, priority, full text + its AS, and the `**Execution:**` block (`depends_on`, `parallel_safe`, `files`, `autonomous`, `verify`). Missing block → treat as `depends_on: none, parallel_safe: false, autonomous: true, files: unknown`.
60
+ 2. Read `.build-progress` if present. In Auto-Mode it uses **three** states: `done`, `building`, `pending`. The controller flips a story to `building` right before dispatch (A2) and to `done` only after its gates clear (A4/A4b) — so a story interrupted mid-build is left as `building`, never silently `done`. **On resume:**
61
+ - `building` story → it was in flight when interrupted. Check `git log --all --grep "S-NNN"` — this scans the FULL message including the `Story:` footer (A2 mandates it there). Do NOT use `git log --oneline | grep` — that only sees the subject and will miss the footer, giving a false "no commit" that overwrites finished work. Commit exists → the subagent finished but wasn't gated → gate it (A4) instead of rebuilding. No commit → re-dispatch it.
62
+ - `pending` → untouched, build normally. `done` → skip.
63
+ - **Stale worktrees from an interrupted prior run:** run `git worktree list`. For any leftover agent worktree (path under `.claude/worktrees/` or branch `worktree-agent-*`): if its story is `done` and its branch is merged → clean it up (`git worktree remove` + `git branch -d` + `git worktree prune`). If it is **locked** or its work is unmerged → do NOT force-remove; report it ("leftover worktree <path> from an interrupted run — remove with `git worktree remove -f -f` only if you're sure its agent is dead") and leave it. Never blindly `-f -f` on resume.
64
+ 3. **Derive the full checklist yourself now.** Dispatched subagents are barred from Phase 0.6, so the controller owns derivation: run Phase 0.6 over the whole spec, give each line an `owner: S-NNN`, and write `.build-checklist`. (On resume, re-derive and diff per Phase 0.6's "checklist already exists" rules.) This must exist before the first dispatch, because A2 pastes each story's `owner` lines into its subagent prompt.
65
+ 4. **Validate the dependency graph FIRST.** Every `depends_on` ID must resolve to a story in this spec, and the graph must be a DAG. If any `depends_on` points to a missing/removed ID, or there is a cycle (or at any later point pending stories remain but none is ready) → STOP with `BLOCKED: dependency cycle or dangling ref between <stories>` and tell the user to fix it via `/sp-plan`. Do not loop.
66
+ 5. **Compute waves** from priority + `depends_on`:
67
+ - A story is ready when all its `depends_on` are `done`.
68
+ - Within a ready set, order by priority (P0 → P1 → P2). **Process the ready set in priority order: run sequential-eligible stories first (one at a time), then dispatch the parallel group.** Never run an inline sequential build at the same time as a worktree wave — finish the sequential story, then start the group.
69
+ - A set of ready stories runs **in parallel only if** every one is `parallel_safe: true` AND none is `autonomous: checkpoint` (checkpoint stories always run sequentially so the A5 pause fires) AND their `files` are concrete (not `unknown`, not a bare directory hint) AND pairwise disjoint. Any overlap, any `unknown`, any directory-level hint, any dependency, any doubt → that story drops to sequential. A wave can mix: dispatch the parallel-eligible group (in batches, see the concurrency cap below), build the rest one at a time.
70
+ - **Two implementer subagents must never share one working tree** — git index, `HEAD`, and the build dir are shared even when file *contents* are disjoint. Parallel dispatch therefore uses worktree isolation; see A2 (dispatch) + A4b (integration) for the exact procedure. If `isolation="worktree"` is unavailable in this environment, fall back to sequential for that wave.
71
+ - **Concurrency cap — never more than 3 implementer subagents at once.** If a parallel-eligible group has >3 stories, dispatch in batches of ≤3: a batch's branches integrate (A4b) before the next batch starts. More than ~3 burns context/tokens in parallel and risks rate limits, while the *sequential* A4b integrate is the real bottleneck — extra concurrency buys little. The cap drops with the context tier (A6): **3** at PEAK, **2** at GOOD, **1 (sequential)** at DEGRADING or worse. It is a ceiling, not a target — 2 disjoint stories run as 2, never padded to 3.
72
+
73
+ ### A2 — Per-story dispatch (approach b — point, don't paste the procedure)
74
+
75
+ Before dispatching, check the story's `autonomous` field: if it is `checkpoint`, pause and let the human inspect before proceeding (A5) — do not dispatch it silently. Then mark the story (or each member of a parallel group) `building` in `.build-progress` so an interruption is recoverable (A1).
76
+
77
+ **Dispatch shape:**
78
+ - **Sequential story** → dispatch one **general-purpose subagent** via the Agent tool, working in the current tree.
79
+ - **Parallel-eligible group** (from A1.5) → first capture the wave base: `EXPECTED_BASE=$(git rev-parse HEAD)`. Then dispatch each member subagent with `isolation="worktree"` so each gets its own worktree + branch off HEAD — **at most the A1.5 concurrency cap (≤3) at a time**; if the group is larger, do it in batches, integrating each batch (A4b) before dispatching the next. They run concurrently within a batch; the controller integrates their branches afterward (A4b).
80
+
81
+ The prompt MUST contain:
82
+
83
+ 1. **Point** the subagent to the procedure: *"Read `.claude/skills/sp-build/SKILL.md` and follow the **Execution Procedure (Phase 0a–Phase 5)** for EXACTLY the one story below. Do NOT invoke the sp-build skill (no recursion), do NOT re-enter Mode Detection, do NOT read or build any other story."*
84
+ 2. **The dispatched-subagent contract** (paste verbatim — this is what keeps the controller the single owner of cross-story state):
85
+ - Build only your assigned story; the Phase 2 loop runs exactly once.
86
+ - Name every test with the `AS-NNN` it covers (`AS-NNN: <scenario>`), one test node per primary AS — the controller's Spec Coverage Gate (Phase 3.5) counts coverage by that ID, so an untagged test is invisible to it.
87
+ - Do NOT write `.build-progress` or `.build-checklist` — the controller owns them. Report your checklist ticks in the contract instead.
88
+ - Do NOT run Phase 3 (full-suite), Phase 4.5 (cross-story checklist review), or Phase 5 (summary/cleanup) — those are the controller's job. Run only your story's filtered tests.
89
+ - Do NOT surface a spec signal to the user or edit the spec — return it in the `Spec signal` field.
90
+ - Commit your own work as ONE commit, conventional format, with a `Story: S-NNN` footer line (sp-commit's story-link convention):
91
+ ```
92
+ feat(scope): <short desc>
93
+
94
+ Story: S-NNN
95
+ ```
96
+ The footer is how the controller finds your work on resume (`git log --grep`) — mandatory, not cosmetic.
97
+ - **If you are running in a git worktree** (parallel dispatch): before any edit, assert `git symbolic-ref HEAD` is your own per-agent branch, NOT a protected ref (`main`/`master`/`develop`/`release/*`) — if it is protected, STOP and report BLOCKED, never `git update-ref` your way onto it. Then confirm your branch was cut from `EXPECTED_BASE` (passed in your prompt); if `git merge-base HEAD <EXPECTED_BASE>` differs from `EXPECTED_BASE`, `git reset --hard <EXPECTED_BASE>` before starting. Stay inside your worktree; touch only your `files`.
98
+ 3. **Paste** the story's full text + its AS + the `files` hints + relevant Constraints + **the `.build-checklist` lines whose `owner` is this story** (so the subagent reports ticks against the controller's real line IDs, not IDs it invented) + **`EXPECTED_BASE` if this is a worktree (parallel) dispatch**. The subagent should not have to re-derive scope.
99
+ 4. Demand the **report contract** in A3.
100
+
101
+ Do not paste the whole procedure into the prompt — the subagent reads it from disk.
102
+
103
+ ### A3 — Report contract (what the subagent returns)
104
+
105
+ ```
106
+ Status: DONE | DONE_WITH_CONCERNS | BLOCKED | NEEDS_CONTEXT
107
+ Story: S-NNN
108
+ Files changed: [...]
109
+ Tests added: [exact test names]
110
+ Checklist: [lines ticked]
111
+ Edge compliance: [the 8-row table for this story — each ✓ or N/A+reason] (depth forcing-function; the controller aggregates these into Phase 5)
112
+ Spec signal: none | S1 <gap> | S2 <conflict> | S3 <added guard>
113
+ ```
114
+
115
+ Controller reads this report only — not the diff. (Spec signal definitions = Phase 5 "Spec Update Signal".)
116
+
117
+ ### A4 — Two gates per story (both must pass to continue)
118
+
119
+ **Gate 1 — Verification.** The controller runs the cheap checks directly: the story's filtered tests + its `verify` command if present (these emit pass/fail, not a diff to read). For the review, **dispatch a reviewer subagent** (sp-review-style, spec-compliance scoped to the story's commit) that reads the diff and returns a one-line verdict — so the controller never loads the diff itself. Any check fails → re-dispatch the implementer subagent **once** with the specific failure. (The subagent already spends its own internal 3-attempt budget per Phase 4 — the controller does not add 3 more on top; that would blow past the project's "max 3 fix loops" rule. One corrective re-dispatch, then stop.) Still failing → BLOCKED.
120
+
121
+ **Gate 2 — Spec signal** (runs in parallel with Gate 1, enforces the project rule "only /sp-plan touches specs"):
122
+ - `S1` (behaviour/edge case with no AS) or `S2` (code must contradict an AS) → **STOP the run.** Surface the signal and the exact command: `⚠️ Spec drift — run /sp-plan docs/specs/<feature>/<feature>.md '<change>'` then resume `/sp-build`. Do not auto-edit the spec, do not skip the story. **STOP = stop dispatching NEW stories; let any in-flight sibling subagents finish their current story (do not kill them mid-write).** Keep already-committed/merged work — S1/S2 mean the spec is stale, not that the code is wrong; leave those stories `done`, leave their commits, and just don't advance. Resume after `/sp-plan`.
123
+ - `S3` (added guard/constraint not in spec) → record it for the final report, continue.
124
+
125
+ Only when both gates clear: tick `.build-checklist`, mark the story `done` in `.build-progress`, move on.
126
+
127
+ For a **sequential** story the subagent committed in the current tree, so gate it directly here. For a **parallel wave**, integrate first (A4b), then gate each story on the integrated tree.
128
+
129
+ ### A4b — Integrate a parallel wave (parallel dispatch only)
130
+
131
+ After every member of a parallel group returns `DONE` (handle any non-DONE per A5 first), merge their worktree branches into the working branch — the controller does this, sequentially, deterministic order (ascending `S-NNN`):
132
+
133
+ 1. For each branch: `git merge --no-ff <branch>`. Files were declared pairwise-disjoint, so a clean merge is expected.
134
+ 2. **On conflict** → the story's `files`/`parallel_safe` was wrong (it overlapped a sibling — often a shared router/index/schema). Do NOT resolve by force: `git merge --abort`, then **rebuild that story sequentially** on top of the already-integrated base (re-dispatch it as a sequential story in the current tree). Record a warning and a spec-fix hint: `⚠️ parallel_safe wrong for <S-NNN> — run /sp-plan to set parallel_safe: false / fix files`. This is the self-correcting safety net for an over-optimistic `parallel_safe`.
135
+ 3. After all branches are integrated, run the A4 gates for each story **on the integrated tree** (not on the isolated branch) so cross-story breakage surfaces. If a gate fails here, re-dispatch that implementer **sequentially in the integrated tree** (its worktree may already be gone) under the same one-retry-then-BLOCKED rule. Then mark each `done`.
136
+ 4. Tear down each worktree **explicitly** — a worktree the subagent committed to is *changed*, and `isolation="worktree"` only auto-cleans *unchanged* ones, so cleanup is yours. Per member, after its branch is merged: `git worktree remove <path>` → `git branch -d <branch>` (plain `-d`, which only succeeds once merged — a failure means it wasn't integrated, so stop and investigate, do NOT `-D`) → finally `git worktree prune`. **Do not `git worktree remove -f -f` a *locked* worktree** — a lock (`claude agent … pid`) means the agent is still live or the run was interrupted; force-removing it can destroy in-flight work. If a worktree is still locked when you reach teardown, the subagent has not actually returned — wait for it (or treat as BLOCKED), don't force.
137
+
138
+ ### A5 — Stop conditions (otherwise: do NOT pause)
139
+
140
+ Run continuously. Do **not** ask "continue?" or print progress summaries between stories. Stop only when:
141
+ - a story is `autonomous: checkpoint` → pause via `AskUserQuestion` **twice**: (1) BEFORE dispatch — "S-NNN is a checkpoint (sensitive: <why>). Build it now / skip for now / stop?" and do not dispatch until the user approves; (2) AFTER its gates pass, before marking `done` — show what changed and ask "looks right / needs changes?". A bare text note is not a pause — you must actually stop and wait for the answer,
142
+ - a subagent returns `NEEDS_CONTEXT` → surface its question to the human and answer it, then re-dispatch the SAME story. Do NOT mark it `done` and do NOT skip it,
143
+ - `BLOCKED` you cannot resolve (subagent's 3 internal attempts + your 1 re-dispatch exhausted),
144
+ - a dependency cycle / dangling ref / no-ready-but-pending state (per A1) → `BLOCKED`,
145
+ - spec signal `S1`/`S2`,
146
+ - all stories `done`.
147
+
148
+ ### A6 — Context budget (controller self-monitoring)
149
+
150
+ - Read frontmatter/status/checklist, never full SUMMARY/diff bodies, unless on a ≥500k-token model and a decision needs it.
151
+ - Track usage tiers: <50% normal; 50–70% economize, frontmatter-only, warn the user "context getting heavy — consider checkpointing"; 70%+ checkpoint progress to `.build-progress` immediately and stop.
152
+ - Watch for degraded subagent output (vagueness like "appropriate handling", reported items fewer than the story's AS) → re-verify against the checklist, don't trust the report.
153
+
154
+ ### A7 — Finish
155
+
156
+ When all stories `done`: run the full suite once, **then run the Spec Coverage Gate (Phase 3.5) over the whole spec** — any uncovered AS/C → not DONE, reopen the owning story. Then the Phase 5 summary (aggregate: stories built, coverage-gate result, open gaps, S3 signals, deferred items). Delete `.build-progress`/`.build-checklist` only if the gate passed AND checklist is 100% `[x]`/`[N/A]`.
157
+
158
+ ---
159
+
160
+ ## Execution Procedure (Phase 0a–Phase 5)
161
+
162
+ This procedure builds the story (or stories) in scope.
163
+
164
+ - **Inline run** (main context): loops over every in-scope story; owns `.build-progress`/`.build-checklist`; runs Phase 3 (full suite) and Phase 5 (summary/cleanup) normally.
165
+ - **Dispatched Auto-Mode subagent**: builds EXACTLY its one assigned story under the dispatched-subagent contract (see A2) — Phase 2 loop runs once; does NOT derive or write `.build-progress`/`.build-checklist` (the controller owns them and pastes this story's checklist lines into the prompt); skips Phase 3 and Phase 5; returns spec signals in its report instead of surfacing them. Wherever a step below says "derive/store `.build-checklist` (Phase 0.6)", "move to the next story", "mark done in `.build-progress`", "tick `.build-checklist`", "run full suite", or "Phase 5 cleanup" — a dispatched subagent skips it and instead works against the pasted checklist lines, reporting its ticks back to the controller.
166
+
167
+ ## Phase 0a — Graphatlas probe (run once)
168
+
169
+ Before Phase 0:
170
+
171
+ 1. Call `mcp__graphatlas__ga_architecture` with `max_modules: 1`.
172
+ 2. Interpret:
173
+ - Returns `modules` → **GA available.** Use `ga_*` for locate / call-graph / impact below. Grep is fallback.
174
+ - Error `STALE_INDEX` → call `mcp__graphatlas__ga_reindex` (mode `"full"`), retry once, then treat as available.
175
+ - Tool not found / connection error / any other failure → **GA unavailable.** Use grep/glob throughout. Do not re-probe.
176
+ 3. After edits, the graph goes stale. Don't reindex on a schedule — instead, when a later `ga_*` call returns `STALE_INDEX`, call `mcp__graphatlas__ga_reindex` (mode `"full"`) once then retry. On large repos a full reindex is not cheap, so reindex on demand, not per story.
177
+
178
+ ---
179
+
180
+ ## Phase 0: Build Context
181
+
182
+ 1. **Find what changed:**
183
+ ```
184
+ BASE=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's|refs/remotes/origin/||') || BASE="main"
185
+ git diff --name-only "$BASE"...HEAD
186
+ ```
187
+ If `$ARGUMENTS` provided → scope to that file or feature only.
188
+ This scan exists for **regression detection** on code the branch already changed — a fresh build from spec legitimately has no diff yet, which is fine: proceed to step 2. Only stop with "Nothing to build — specify a spec, feature, or file" when there is no spec/`## Stories`, no `$ARGUMENTS` scope, AND no diff.
189
+
190
+ **Regression auto-detect:** List lines removed or modified from existing code (not pure additions):
191
+ ```
192
+ git diff "$BASE"...HEAD -- <src> | grep -E "^-[^-]" | head -50
193
+ ```
194
+ For each modified function identified, evaluate whether behavior changed. Classify each change:
195
+ - **Behavior changed** → regression test REQUIRED covering the old behavior path (see REGRESSION RULE in Phase 1.5).
196
+ - **Pure refactor** (rename, format, extract helper, comment, type-only) → no new test required; add 1-line note in summary `REFACTOR_ONLY: <file:line> — <why no behavior change>`.
197
+
198
+ Do not skip this classification silently. If unsure whether a change is behavior-changing, treat it as behavior-changing.
199
+
200
+ 2. **Read the spec** at `docs/specs/<feature>/<feature>.md` — the `## Stories` section with acceptance scenarios is your roadmap. The `## Overview` and `## Constraints` sections tell you the INTENT behind the code.
201
+
202
+ 3. **Check build progress:** Look for `docs/specs/<feature>/.build-progress`.
203
+ - If found → read it, find the first line marked `pending` → resume from that story.
204
+ Log: "Resuming from S-00X (previous session progress found)."
205
+ - If not found → start from S-001 as normal.
206
+
207
+ File format:
208
+ ```
209
+ S-001 done
210
+ S-002 done
211
+ S-003 pending
212
+ ```
213
+
214
+ 4. **Locate related code.** **If GA available (per Phase 0a):** `ga_symbols` on the main function/type names from the spec → definitions; `ga_callers`/`ga_callees` → dependency chain; `ga_impact(symbol=...)` → blast radius + affected tests; `ga_architecture` → confirm module/layer (auth, payment, core); `ga_file_summary` before reading a file in full. **If GA unavailable:** grep for the main function/type names in the changed files.
215
+
216
+ 5. **Read existing tests** for the changed files — find patterns, fixtures, naming conventions. Don't duplicate.
217
+
218
+ ---
219
+
220
+ ## Phase 0b: Foundation Gate (a runnable harness must exist)
221
+
222
+ Runs once, after Phase 0, before any test work (Phase 0.5 onward). This is a **precondition, not a build step**: it refuses to enter the TDD loop on a repo with no runnable harness, where Phase 2's RED would error — or worse, report a **false green** (a filtered run matching 0 tests exits 0 on many frameworks; see Test Command §"Filter pattern verification"). sp-build does **NOT** scaffold; that is `/sp-scaffold`'s job. This gate only checks, and points there.
223
+
224
+ **Check (cheap — NOT the full suite, that's Phase 3):**
225
+ 1. **`TEST_CMD` resolves** from a project marker (per the Test Command section). No marker at all → fail.
226
+ 2. **The runner is invocable** — its list/collect command runs without "command not found" / "no config" (reuse the per-framework list commands from Test Command §"Filter pattern verification"). A runner that can't even list tests is not a harness.
227
+ 3. **Dependencies are present / installable** — the project's deps are installed, or `install` succeeds.
228
+
229
+ A fresh build from a spec legitimately has no diff yet — that's fine (Phase 0 already handles it). This gate is about the **harness**, not about whether code changed.
230
+
231
+ **Outcomes:**
232
+ - **Pass** (markers present, runner invocable) → proceed to Phase 0.5. For an existing project this is a fast no-op — nothing changes (backward compatible).
233
+ - **Fail** (no markers / runner not invocable / empty repo) → **STOP, report BLOCKED:**
234
+ > "No runnable harness — the TDD loop can't start. Run `/sp-scaffold` first to stand up a runnable skeleton (brand-new project: `/sp-explore` greenfield → `/sp-scaffold`), then re-run `/sp-build`."
235
+
236
+ Do NOT scaffold here, and do NOT proceed into Phase 2 — proceeding would produce false greens on an empty repo, which is exactly the failure this gate exists to stop.
237
+
238
+ **Recorded `TEST_CMD`:** if `/sp-scaffold` ran, it recorded the resolved `TEST_CMD` + run command (handoff / ARCHITECTURE §13). Use that as the expected marker — a mismatch (scaffold said `vitest`, the repo has none) is a fail, not something to paper over.
239
+
240
+ **Auto-Mode:** the controller runs this gate ONCE in A1 (before the first dispatch); dispatched per-story subagents assume the harness exists and skip it.
241
+
242
+ ---
243
+
244
+ ## Phase 0.5: Implementation Risk Check
245
+
246
+ Run after Phase 0. Takes 2 minutes. Checks only what is visible at implementation time
247
+ (sp-challenge already reviewed the spec adversarially — this catches code-level issues only).
248
+
249
+ - **N+1:** For each story involving a list/loop — will implementation query DB inside the loop? Flag before writing the test.
250
+ - **DRY:** Grep for similar logic in existing code. If found, reuse — don't duplicate.
251
+ - **Error paths:** For each story — what can go wrong? (null, empty, network fail, invalid input) Note these upfront so they land in the Coverage Map, not as afterthoughts.
252
+ - **Pattern:** What's the existing pattern for this type of operation in the codebase? Follow it unless there's a reason not to.
253
+ - **UI Notes + UI Inventory (FE stories only):** Read TWO sections before writing the failing test:
254
+ 1. `## What Already Exists § UI Inventory` — list of existing components with paths the spec marked reusable for this feature. **Skim this FIRST** to find reuse candidates before assuming a Component Tree entry needs new scaffolding. Each row in UI Inventory carries a `file:path` you can open to check the actual API.
255
+ 2. `## UI Notes` Component Tree — components this story must produce. Use it to shape layout / section order / hierarchy.
256
+ **Precedence: AS / Constraints > Prototype URL > UI Notes.** If you detect a contradiction (e.g. AS says "notify user" but the Component Tree has no notification surface, OR Component Tree shows a button the AS never references), STOP and raise a Spec Signal — do NOT build to the UI Notes and ignore the AS. UI Notes is structural reference; the AS is the contract.
257
+
258
+ Output: 2-3 line summary. Feeds into Phase 1.5 Coverage Map.
259
+
260
+ ---
261
+
262
+ ## Phase 0.6: Spec Checklist
263
+
264
+ Derive a checklist from the spec — each "promise" in this build's scope becomes one line. The checklist mirrors the spec; it does not invent new requirements.
265
+
266
+ **Sources (all in `docs/specs/<feature>/<feature>.md`) — anchor on IDENTITY, not nouns:**
267
+ - **Each `AS-NNN` → at least one line carrying that ID** (`AS-NNN`, or `AS-NNN.Tk` when one AS needs several assertions). This is the primary anchor: the checklist is keyed on the spec's case IDs, not on text it happens to mention. A Then with several fields/effects becomes several `AS-NNN.Tk` lines — but they all carry the same AS-NNN, so the AS is never lost.
268
+ - Each Constraint → one `C-NNN` line.
269
+ - Each open `GAP-NNN` (status not `resolved`) → one `[ ]` line tagged `GAP` (so a parked gap is visible, not silently dropped — see Spec Coverage Gate).
270
+ - Each Not-in-Scope row → one `[N/A]` line (prevents accidental ticking).
271
+
272
+ **Completeness invariant (checked, not hoped):** every `AS-NNN` and `C-NNN` in the spec's `## Stories`/Constraints MUST appear on ≥1 checklist line. An AS with no line = the checklist is wrong (re-derive), not the spec. Deriving from Then-nouns alone silently drops AS whose Then is verb-shaped ("retries", "must not send") or whose nouns collide with another AS — anchoring on the ID closes that.
273
+
274
+ **Granularity rule (so two devs produce the same checklist):**
275
+ - 1 line per **observable output field** (appears in Then result, independently assertable)
276
+ - 1 line per **side effect** (write to DB, emit event, external call)
277
+ - 1 line per **error path** declared in a Then clause
278
+ - Do NOT split adjectives (sorted/deduped/trimmed) into separate lines — roll them into the field line
279
+
280
+ Example: Then "returns sorted list of {file, confidence, edges}" → 3 lines (one per field), not 4.
281
+
282
+ **Stored at:** `docs/specs/<feature>/.build-checklist` (alongside `.build-progress`)
283
+
284
+ **Format** (owner column resolves multi-story AS):
285
+ ```
286
+ [ ] AS-012.T1 — affected_tests includes convention-matched files | owner: S-003
287
+ [ ] AS-012.T2 — affected_tests includes TESTED_BY edges | owner: S-003
288
+ [ ] AS-012.T3 — output sorted by confidence | owner: S-004
289
+ [ ] C-003 — query completes under 50ms | owner: S-005
290
+ [N/A] AS-015 — out of scope (M3) | owner: —
291
+ ```
292
+
293
+ Owner = the story in this build planned to cover that line. If an AS spans multiple stories, each line gets its own owner. Use `owner: ?` when unknown upfront, resolve when reaching that story.
294
+
295
+ **Three checkbox states:**
296
+ - `[x]` — done: there is a test assertion AND production code emitting the behavior
297
+ - `[~]` — partial: carve-out with a concrete destination (story ID that exists in the plan, OR Known-Gap row in the spec). References like "future work", "later", "TODO.md", "Phase X (does not exist)" are NOT accepted
298
+ - `[ ]` — untouched: will be covered by a later story in this build, or out-of-scope already declared
299
+
300
+ **If checklist already exists** (resume build):
301
+ - Re-derive from the current spec. Diff against the old checklist.
302
+ - New line in spec, missing from checklist → append `[ ]`
303
+ - Line in checklist, no longer in spec → mark `[STALE]` (do not delete — keep audit trail)
304
+ - Line present in both BUT Then clause text has changed → reset to `[ ]` with note `RESET: spec text changed <date>`, re-verify. The old `[x]` may be stale — the previous assertion may no longer match.
305
+
306
+ ---
307
+
308
+ ## Phase 1: Decide What to Test
309
+
310
+ Test behavior, not implementation. If the internals change but behavior stays the same, tests should still pass.
311
+
312
+ **What NOT to test:**
313
+ - Private/internal methods (test through public API)
314
+ - Framework behavior (test YOUR handler, not that Express routes work)
315
+ - Trivial getters/setters (unless they have validation)
316
+ - Implementation details (HOW it works — test WHAT it does)
317
+
318
+ **Edge cases to consider per story** (these are the rows of the Edge Case Compliance Table below — the depth check, separate from AS-ID coverage):
319
+ - Null/undefined · empty · invalid type · boundary (min/max) · error path · race/concurrency · large data · special chars (unicode/SQL).
320
+
321
+ For each, add the assertion **inside the owning AS's test** when that AS's behaviour reaches it. This is test DEPTH, not coverage — coverage (every AS has a test) is the Spec Coverage Gate's job; this forces you not to skip edge thinking within a test. A genuinely-missed edge the spec never captured = a **spec signal (S1)**, not a quiet tick.
322
+
323
+ **Quality check for each test:**
324
+ - Does it test one concept? If it fails, do you know exactly what broke?
325
+ - Is it independent? No test depends on another running first.
326
+ - Is it deterministic? No random, no time-dependent, no external service calls.
327
+ - **Name embeds the AS-NNN it covers**, then the scenario: `AS-007: returns 403 when role missing`. The ID makes coverage machine-checkable (Spec Coverage Gate); the description keeps it readable. One test node covers exactly one primary AS — shared setup is fine, a shared assertion standing in for several AS is not (it masks under-coverage).
328
+
329
+ **Completeness Principle:**
330
+
331
+ AI writes tests significantly faster than humans. When deciding test scope:
332
+
333
+ | Task type | Human | CC | Compression |
334
+ |-----------|-------|----|-------------|
335
+ | Boilerplate tests | 2 days | 15 min | ~100x |
336
+ | Edge case + error paths | 1 day | 15 min | ~50x |
337
+ | Feature | 1 week | 30 min | ~30x |
338
+ | Bug fix | 4 hours | 15 min | ~20x |
339
+
340
+ Rule: Default to writing the complete test set. AskUserQuestion only when the gap genuinely affects design choice (not effort). Do NOT use self-estimated effort as a justification to skip — LLMs under-estimate when motivated to move on.
341
+
342
+ **Edge Case Compliance Table (per story) — a THOROUGHNESS forcing-function, NOT a coverage claim.**
343
+
344
+ This is orthogonal to the Spec Coverage Gate (Phase 3.5), not a rival: the gate guarantees every AS has a test (breadth, counted on AS-IDs); this table forces each story's tests to consider edge DEPTH — because agents reliably skip edge cases when left to their own judgement, and the gate cannot see that (a test named `AS-005` that only checks the happy path still passes the gate). An `N/A` here means "considered, doesn't apply" — it is never a coverage gap; coverage is the gate's job alone.
345
+
346
+ Fill this table in the Phase 5 summary for each story. Every row is `✓` (an assertion exists, inside the owning AS's test) or `N/A + 1-line reason`. Blank rows are not allowed.
347
+
348
+ | Edge case | Status | Test name / Reason if N/A |
349
+ |-----------|--------|---------------------------|
350
+ | Null/Undefined input | | |
351
+ | Empty array/string | | |
352
+ | Invalid types | | |
353
+ | Boundary values | | |
354
+ | Error paths | | |
355
+ | Race conditions | | |
356
+ | Large data | | |
357
+ | Special characters | | |
358
+
359
+ `N/A` is valid only with a reason (e.g. "N/A — function takes an enum, invalid type impossible at the type layer"). The context-dependent rows (race, large data, special characters) will often be `N/A` for a given story — that is expected and honest, not gaming. The cheap-and-usually-relevant rows (null, empty, boundary, error) should rarely be `N/A`. If filling a row would mean a junk test for behaviour the AS can't reach, mark `N/A + reason` — don't manufacture the test; and if you find a real edge the spec never captured, raise it as a **spec signal (S1)**, don't just tick it here.
360
+
361
+ **Engineering instincts — apply when deciding test scope:**
362
+ - **Systems over heroes:** Design tests for a tired dev at 3am, not your best engineer. If a test requires knowing internals to understand, it will fail the wrong person at the worst time.
363
+ - **Blast radius instinct:** For each Coverage Map GAP — if this path breaks in prod, how many users/systems are affected? High blast radius → mandatory test, no deferral.
364
+ - **Make the change easy, then make the easy change:** If writing a test is hard, the production code is tangled. Refactor structure first (separate commit), then add the test.
365
+ - **Reversibility preference:** When two approaches have equal coverage, pick the one easier to delete when behavior changes. Brittle tests are technical debt disguised as coverage.
366
+
367
+ ---
368
+
369
+ ## Phase 1.5: Coverage Map
370
+
371
+ Before writing tests, trace all paths and draw a diagram to see gaps upfront — not after.
372
+
373
+ **Step 1 — Trace code paths:** For each changed function/component, follow data through every branch: if/else, switch, guard clause, early return, try/catch, error boundary. Trace into helper functions if they have untested branches.
374
+
375
+ **Step 2 — Trace user flows:** For multi-step features, trace the user journey. Edge cases: double-click/rapid resubmit, navigate away mid-op, submit stale data (session expired), slow connection, concurrent actions (2 tabs open).
376
+
377
+ **Step 3 — Draw the diagram:**
378
+
379
+ ```
380
+ CODE PATH COVERAGE
381
+ ===========================
382
+ [+] src/services/example.ts
383
+
384
+ ├── processX()
385
+ │ ├── [★★★ TESTED] Happy path + error — example.test.ts:42
386
+ │ ├── [GAP] Network timeout — NO TEST
387
+ │ └── [GAP] Invalid input — NO TEST
388
+
389
+ └── helperY()
390
+ ├── [★★ TESTED] Normal case — example.test.ts:89
391
+ └── [★ TESTED] Smoke check only — example.test.ts:101
392
+
393
+ USER FLOW COVERAGE
394
+ ===========================
395
+ [+] Checkout flow
396
+
397
+ ├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15
398
+ ├── [GAP] [→E2E] Double-click submit — needs E2E, not unit
399
+ ├── [GAP] Navigate away mid-op — unit sufficient
400
+ └── [GAP] [→EVAL] Prompt template change — needs eval
401
+
402
+ ─────────────────────────────────
403
+ COVERAGE: 3/7 paths tested (43%)
404
+ Code paths: 2/4 (50%)
405
+ User flows: 1/3 (33%)
406
+ QUALITY: ★★★: 1 ★★: 1 ★: 1
407
+ GAPS: 4 paths need tests (1 need E2E, 1 need eval)
408
+ ─────────────────────────────────
409
+ ```
410
+
411
+ **Legend:**
412
+ - `[★★★ TESTED]` = test covers edge cases AND error paths; include `file:line`
413
+ - `[★★ TESTED]` = test covers happy path only; include `file:line`
414
+ - `[★ TESTED]` = smoke test / trivial assertion; include `file:line`
415
+ - `[GAP]` = no test — **MUST write in Phase 2**
416
+ - `[GAP] [→E2E]` = needs E2E test: flow spans 3+ components, auth/payment/data-destruction
417
+ - `[→MANUAL]` = Non-testable layer (view, template, styling). Note the visual check needed (e.g., "confirm error banner appears on invalid input"). Always test the logic backing it.
418
+ - `[GAP] [→EVAL]` = needs eval: prompt template or LLM output changed. When flagged: define capability + regression evals before implementing, run baseline and capture failure signatures, implement minimum passing change, re-run and report pass@1 and pass@3. Release-critical paths should target pass@3 stability before merge.
419
+
420
+ **E2E Decision Matrix:**
421
+
422
+ | Use E2E `[→E2E]` when | Use unit test when |
423
+ |---|---|
424
+ | Flow spans 3+ components/services | Pure function, clear inputs/outputs |
425
+ | Mocking hides real failures (API→queue→worker→DB) | Internal helper, no side effects |
426
+ | Auth / payment / data destruction | Single-function edge case (null, empty) |
427
+
428
+ **E2E is AUTHORED, not just flagged.** A `[→E2E]` path is NOT a TODO you list and move on. For **critical flows — auth, payment, data-destruction, and each story's primary cross-component happy path** — you WRITE the E2E test and drive it RED→GREEN on `E2E_CMD` (see Test Command), exactly like a unit test. Modern E2E is cheap and scriptable (Playwright/Cypress/Detox) — "it spans components" is a reason to write the E2E, not skip it. Only a **non-critical** `[→E2E]` flow may be DEFERRED to Phase 5 "E2E" with a one-line reason (budget/env). A **critical** `[→E2E]` left unwritten makes the build **DONE_WITH_CONCERNS**, never DONE. (Mocking a critical cross-component flow at unit level is the vacuous-test trap — the decision matrix routed it to E2E precisely because mocks hide its failures. The same applies to a Linked-Field seam: real E2E/integration, never mocked.)
429
+
430
+ **Testability Classification — classify by what the code does, not what framework it uses:**
431
+
432
+ | Code category | Examples | Strategy | Tag |
433
+ |---|---|---|---|
434
+ | Logic | Service, ViewModel, Presenter, Utils, Parser, Validator | Unit test directly — inputs, outputs, state transitions | (default) |
435
+ | View / Template | UI render, layout, data binding, template markup | Extract logic to testable layer; mark view code `[→MANUAL]` | `[→MANUAL]` |
436
+ | Pure presentation | Styling, spacing, animation, theming | Visual verification only | `[→MANUAL]` |
437
+ | Glue / Wiring | Dependency injection, route registration, config binding | Test through integration or E2E | `[→E2E]` or skip |
438
+
439
+ Rule: If a view/template contains conditional logic (if/else, loops with filtering, computed display values) — extract that logic into the testable layer (ViewModel, Presenter, helper) and unit test there. The view becomes a thin binding with no logic to test.
440
+
441
+ **Diagram is mandatory.** Even when all paths are covered, you must still produce the diagram with `[★★★ TESTED]` / `[★★ TESTED]` / `[★ TESTED]` entries including `file:line` references for each. Do not replace it with "All paths covered ✓". The diagram is the evidence — a one-line claim is not.
442
+
443
+ If every path is already covered, the diagram will have zero `[GAP]` rows — that is fine. Write it anyway and proceed to Phase 2.
444
+
445
+ **REGRESSION RULE:** If the diff changes existing behavior AND no test covers that path → a regression test is a **CRITICAL requirement. No asking. No skipping.**
446
+
447
+ **LINKED-FIELD SEAM RULE:** If the spec has a `## Linked Fields` block (it is one side of a producer/consumer split), each linked field's **seam AS is a real-integration test** — run it against the ACTUAL producer (build the producer side first; the consumer spec's seam tests run against it), **never a mocked consumer**. A mocked seam is a vacuous test: the mismatch it exists to catch — field on the wrong surface (list vs single-get) or wrong lifecycle (transient-in-response vs persisted+served) — is exactly what the mock hides. Do NOT mark a consumer story `done` on a mocked seam. **Auto-Mode:** a single-side subagent cannot see the seam — the controller runs the cross-spec seam tests after both sibling specs are built (A4b/A7).
448
+
449
+ ---
450
+
451
+ ## Test Command
452
+
453
+ Resolve once before running tests. Auto-detect from project markers:
454
+
455
+ | Marker | Run all | Run filtered |
456
+ |--------|---------|-------------|
457
+ | vitest config / vitest in package.json | `npx vitest run` | `npx vitest run -t "<pattern>"` |
458
+ | jest config / jest in package.json | `npx jest --no-cache` | `npx jest --no-cache -t "<pattern>"` |
459
+ | pyproject.toml / pytest.ini | `python3 -m pytest -x` | `python3 -m pytest -x -k "<pattern>"` |
460
+ | Cargo.toml | `cargo test` | `cargo test "<pattern>"` |
461
+ | go.mod | `go test ./...` | `go test ./... -run "<pattern>"` |
462
+ | build.gradle | `./gradlew test` | `./gradlew test --tests "<pattern>"` |
463
+ | *.sln | `dotnet test` | `dotnet test --filter "<pattern>"` |
464
+ | Package.swift | `swift test` | `swift test --filter "<pattern>"` |
465
+ | Gemfile | `bundle exec rspec` | `bundle exec rspec -e "<pattern>"` |
466
+
467
+ All test commands below use `TEST_CMD` to mean the resolved command. For filtered runs, use the framework's native filter flag from the table above.
468
+
469
+ **E2E runner (separate from the unit/integration `TEST_CMD`).** A flow tagged `[→E2E]` runs on the project's E2E harness, NOT the unit runner — resolve it from project markers and call it `E2E_CMD`: Playwright (`playwright.config.*` → `npx playwright test`), Cypress (`cypress.config.*` → `npx cypress run`), Detox / Maestro (mobile), or the e2e package's own script (`pnpm --filter e2e test`). E2E lives in its own package/dir with its own config (per the test-layout rule). If a critical flow needs E2E and no E2E harness exists, that's a scaffold gap → report it (`/sp-scaffold` sets up the e2e package); do NOT silently downgrade a payment/auth/data-destruction flow to a mocked unit test.
470
+
471
+ **Filter pattern verification (MANDATORY before trusting a filtered run):**
472
+
473
+ A filtered run that matches zero tests will exit 0 on many frameworks — that is a false green. Before trusting any `TEST_CMD --filter "<pattern>"` result, confirm the pattern matched ≥1 test case:
474
+
475
+ - vitest: `npx vitest list -t "<pattern>"` → output must list ≥1 test
476
+ - jest: add `--passWithNoTests=false` → exits 1 if no tests match
477
+ - pytest: `-k "<pattern>" --collect-only -q` → output must list ≥1 test
478
+ - cargo: `cargo test "<pattern>" -- --list` → output must list ≥1 test
479
+ - go: `go test -run "<pattern>" -list ".*" ./...` → output must list ≥1 test
480
+ - gradle / dotnet / swift / rspec / other: if no equivalent listing command is known, fall back to verifying by `grep -r "<test name>" <test-dir>` — the test string must exist in source. Log `FILTER_VERIFY: fallback-grep` in the summary.
481
+
482
+ If the verification shows 0 matches → the test you just wrote did not register (wrong name, wrong file location, framework did not pick it up). Fix before proceeding. Do NOT interpret 0-match as "PASSES". **Max 3 retry attempts** on filter-match failure; if still 0 after 3, stop and report BLOCKED (test infrastructure issue, not a TDD issue).
483
+
484
+ ---
485
+
486
+ ## Phase 2: Story Loop (RED → GREEN → REFACTOR)
487
+
488
+ Work through stories one at a time from the spec's `## Stories` section.
489
+ Follow the project's existing test patterns.
490
+
491
+ **For each story:**
492
+
493
+ ### Step 1 — RED: Write test, verify it fails
494
+
495
+ Write tests for the story's acceptance scenarios.
496
+
497
+ First, verify the filter pattern matches the new test (see "Filter pattern verification" in the Test Command section). Then run:
498
+ ```
499
+ TEST_CMD --filter "<story test name>"
500
+ ```
501
+
502
+ **Capture and paste the raw failure output** (stack trace / assertion diff / first 20 lines) into your notes — this is the evidence for the `RED → GREEN` claim in Phase 5. A summary like "3 fails" without the raw text is not sufficient evidence.
503
+
504
+ - **FAILS** → correct. The test describes behavior that doesn't exist yet. Continue to Step 2.
505
+ - **PASSES** → the behavior already exists. Either the test is wrong (assertions too weak) or the code already handles this case. Investigate before continuing. If already covered, mark story `done` and move to the next story.
506
+ - **0 TESTS MATCHED** → filter pattern did not register. Fix test name / file location. Do NOT proceed.
507
+
508
+ ### Step 2 — GREEN: Implement minimal production code
509
+
510
+ Write the production code that makes **ALL** the story's failing tests pass — every AS test and every edge assertion you wrote in RED. **"Minimal" = the YAGNI guard** (no speculative code, no gold-plating, no handling for a case that no test exercises) — it is **NOT "the least to pass one test"**. Completeness comes from the **test set being complete** (Phase 1 + the Coverage Map already wrote a test for every case); the code then rises to satisfy all of them. If you find yourself wanting to write code for a case that has *no test*, STOP — that is a missing test: go back to RED (or it's a Phase-1 / Coverage-Map gap, or an S1 spec gap). Never silently over-build past the tests, and never under-build below them.
511
+
512
+ > **TDD GREEN vs "NEVER fix production code" (Phase 4) — disambiguation:**
513
+ > - Writing NEW production code to satisfy a NEW failing TDD test: **REQUIRED** (this is GREEN).
514
+ > - Modifying EXISTING production code because an EXISTING test started failing in Phase 3/4: **requires AskUserQuestion first** (this is the "NEVER" rule).
515
+ > The difference is: TDD writes code toward a test written moments ago. Fix Loop touches code that was already green. Don't confuse the two.
516
+
517
+ Run (filtered):
518
+ ```
519
+ TEST_CMD --filter "<story test name>"
520
+ ```
521
+
522
+ - **PASSES** → continue to Step 3.
523
+ - **FAILS** → fix production code (not the test). Max 3 attempts, then stop and report per Phase 4.
524
+
525
+ ### Step 3 — REFACTOR (optional)
526
+
527
+ If the implementation introduced duplication, unclear naming, or violated existing patterns — refactor now while tests are green. Run tests after refactoring to confirm nothing broke.
528
+
529
+ ### Step 4 — Update progress
530
+
531
+ Mark the story `done` in `.build-progress`:
532
+ ```bash
533
+ # Example after S-002 passes:
534
+ # S-001 done
535
+ # S-002 done
536
+ # S-003 pending
537
+ ```
538
+ Write the full file each time (overwrite, not append) to keep state clean.
539
+
540
+ **Test count assertion (MANDATORY):** Confirm tests were actually added by diff-counting:
541
+ ```
542
+ git diff --stat <test-dir>
543
+ ```
544
+ Record for the Phase 5 summary: `S-00X added N tests: <list exact test names>`. Test names must be grep-able in the test file.
545
+
546
+ - **N ≥ 1** → normal case. Story is `done`.
547
+ - **N = 0** → only acceptable if the story is a pure refactor AND existing tests already cover the changed path. Record: `S-00X added 0 tests: REFACTOR_ONLY — covered by existing <file:test name>`. Otherwise, story is NOT `done` — add tests first.
548
+
549
+ **Checklist update (MANDATORY):** Open `.build-checklist` and tick the lines this story covers:
550
+
551
+ ```
552
+ [x] AS-012.T1 — covered by affected_tests_test.rs:test_convention_match
553
+ [~] AS-012.T2 — PARTIAL: query wired, emit deferred → M3 S-008
554
+ ```
555
+ For `[x]`, record `file:test-name`. For `[~]`, record the destination.
556
+
557
+ **Carve-out scan on the story diff:**
558
+ ```
559
+ git diff <story-files> | grep -nE "TODO|FIXME|XXX|HACK"
560
+ ```
561
+ Each match not already in the checklist → add a new `[~]` line with destination. Matches without a concrete destination → the story is NOT `done`; either (a) create a new story in the plan, or (b) add a Known-Gap row to the spec, before closing.
562
+
563
+ **Concrete destination** = one of these grep-able sources (priority order):
564
+ 1. Story ID in `docs/specs/<feature>/<feature>.md` (section `## Stories` — grep `S-NNN` or `M<X> S-NNN`)
565
+ 2. Row in `<feature>.md` Known-Gaps / Not-in-Scope section
566
+ 3. Issue tracker ID if the project declares one (GitHub `#NNN`, JIRA `ABC-NNN`) — verify with `gh issue view` or URL regex; no online check required if the author confirms
567
+ 4. External plan file if the project declares `plan_file: <path>` in CLAUDE.md
568
+
569
+ Not accepted: TODO.md, free-form code comments, "future work", "later", "Phase X" without a corresponding row.
570
+
571
+ **If the project does not use a formal spec/plan** (bug fix single story, no /sp-plan): skip the destination rule for this build, replace with a lighter rule "each TODO in diff must have a 1-line justification in the summary" — log in Phase 5 output as `CARVE_OUT_RELAXED: no spec context`.
572
+
573
+ **Reverse-map check (catch orphans — code exists, spec does not):**
574
+
575
+ For each "artifact" newly appearing in the story diff (not only TODOs):
576
+ - New file under `src/` production (not tests)
577
+ - New publicly exported function/class/type
578
+ - New DDL/schema object (table/index/enum) — detect in a language-agnostic way by grepping the declarative keywords this project uses
579
+ - New public API endpoint / CLI command
580
+ - New config key / feature flag
581
+
582
+ For each artifact → ask: "which checklist line (AS/Constraint) requires this artifact to exist?"
583
+
584
+ - Maps to ≥1 checklist line → OK
585
+ - No mapping → FLAG ORPHAN. Three ways to handle:
586
+ (a) Artifact is genuinely required → add a checklist line sourced from the AS/Constraint that requires it. If no AS requires it → the spec is missing coverage; add a Known-Gap or run `/sp-plan` to add an AS
587
+ (b) Artifact is infrastructure for a later story → convert to `[~] <artifact> — deferred use → <future story/gap>`
588
+ (c) Orphan (legacy/experiment) → remove or justify in the spec
589
+
590
+ This rule is **language-agnostic**: the dev decides what counts as an "artifact" based on the diff. The skill does not grep DDL or parse ASTs. It only requires "everything new has a documented reason".
591
+
592
+ **Ordering gate:** tick the checklist BEFORE marking the story `done` in `.build-progress`. If a checklist line with `owner: <this-story>` is not yet ticked or converted to `[~]`/`[N/A]` → story is NOT `done`. One-way sync: progress = f(checklist).
593
+
594
+ **Then proceed to the next story.**
595
+
596
+ ---
597
+
598
+ **Before moving to Phase 3, verify** (inline run only — a dispatched Auto-Mode subagent stops after its one story and reports back; the controller runs Phase 3):
599
+ - [ ] All public functions have unit tests
600
+ - [ ] All API endpoints have integration tests
601
+ - [ ] Edge cases covered (null, empty, invalid, boundary)
602
+ - [ ] Error paths tested (not just happy path)
603
+ - [ ] Tests are independent (no shared state)
604
+ - [ ] Assertions are specific and meaningful
605
+
606
+ ---
607
+
608
+ ## Phase 3: Build and Run
609
+
610
+ This runs the full test suite after all stories are complete. Individual story tests were already verified in Phase 2.
611
+
612
+ Compile/typecheck first (tsc --noEmit, cargo check, go vet, swift build, etc.).
613
+
614
+ Then run all tests:
615
+ ```
616
+ TEST_CMD
617
+ ```
618
+
619
+ ---
620
+
621
+ ## Phase 3.5: Spec Coverage Gate (deterministic — the actual guarantee)
622
+
623
+ Everything else (checklist ticks, per-story counts, reviewer grep) is LLM judgement and can drift. This gate is the one place coverage is **counted by a command, not hoped for** — it makes "every spec case has ≥1 test" an invariant: the build does not pass while any AS/C is uncovered.
624
+
625
+ It works because **tests embed their `AS-NNN`/`C-NNN` in the test name** (Phase 1 quality check). The gate is a set-difference: spec IDs minus IDs found in the test files.
626
+
627
+ ```bash
628
+ SPEC=docs/specs/<feature>/<feature>.md
629
+ TESTDIR=<test dir> # e.g. tests/ or src/ (resolve like TEST_CMD)
630
+
631
+ # Obligations from the spec: every AS-NNN and C-NNN under ## Stories / Constraints
632
+ # Use -w (word match), NOT \b — \b is a GNU-ism; on BSD/macOS grep it produces
633
+ # phantom short IDs (e.g. "AS-01" out of "AS-010"). -owE is portable and exact.
634
+ grep -owE '(AS|C)-[0-9]+' "$SPEC" | sort -u > /tmp/spec-ids.txt
635
+ # Covered: IDs that actually appear in a test file (tests embed the ID in their name)
636
+ # -h is REQUIRED with -r: without it grep prefixes each match with "file:" and the
637
+ # set-difference never matches (gate would falsely report everything uncovered).
638
+ grep -rowhE '(AS|C)-[0-9]+' "$TESTDIR" | sort -u > /tmp/covered-ids.txt
639
+ # Uncovered = in spec, not in any test
640
+ comm -23 /tmp/spec-ids.txt /tmp/covered-ids.txt
641
+ ```
642
+
643
+ - **Any line printed → BLOCKED.** Those AS/C have no test carrying their ID. List them; do not mark the build DONE. (Single-story / no-formal-spec builds: skip — log `COVERAGE_GATE: no spec`.)
644
+ - **Open gaps:** `grep -E 'GAP-[0-9]+' "$SPEC"` whose status is not `resolved` → list them in the summary as "unresolved gaps (not blocking, but visible)". A gap is never silently dropped: it is either resolved into an AS (via `/sp-plan`, then it gets counted) or shown as open.
645
+
646
+ **What this gate does and does NOT prove.** It proves identity presence (necessary): no AS is missing a test. It does NOT alone prove the test is meaningful — that's the per-AS rules: one test node per primary AS, with a real assertion (Phase 1), and the strong form, **falsifiability** — negating an AS's Then should turn its test red. Identity gate = cheap and deterministic (run it every build); falsifiability via mutation testing = the north-star, optional/advanced. The gate stops silent *absence*; the assertion rules stop silent *emptiness*.
647
+
648
+ **Auto-mode:** the controller runs this gate at A7 (finish) over the whole spec, and may run the per-story slice after each story's gates (A4). Dispatched subagents are told (A2 contract) to embed `AS-NNN` in every test name so the gate can see their work.
649
+
650
+ ---
651
+
652
+ ## Phase 4: Fix Loop
653
+
654
+ If tests fail:
655
+ 1. Read error output. Is the test wrong or the production code wrong?
656
+ 2. If production code seems wrong → use `AskUserQuestion`:
657
+
658
+ ```json
659
+ {
660
+ "questions": [
661
+ {
662
+ "question": "Test expects <X> but code does <Y>. Which is correct?",
663
+ "header": "Test vs Code Mismatch",
664
+ "multiSelect": false,
665
+ "options": [
666
+ {"label": "Fix production code — the test is correct (human: ~30m / CC: ~10m) | Completeness: 10/10"},
667
+ {"label": "Adjust the test — the code behavior is intentional (human: ~10m / CC: ~5m) | Completeness: 7/10"}
668
+ ]
669
+ }
670
+ ]
671
+ }
672
+ ```
673
+ 3. Fix test code only. Re-run. Max 3 attempts, then stop and report.
674
+
675
+ **NEVER (applies to Fix Loop — existing tests that regressed; does NOT apply to TDD GREEN in Phase 2):**
676
+ - Fix existing production code without asking
677
+ - Delete or weaken existing tests
678
+ - Add `skip`/`xit`/`@disabled` to hide failures
679
+ - Use mocks solely to avoid a real failure
680
+
681
+ ---
682
+
683
+ ## Phase 4.5: Pre-Summary Review
684
+
685
+ Walk `.build-checklist` before writing the summary. This is in-place verification — it prevents the user from having to re-run the skill just to audit.
686
+
687
+ **For each line not marked `[x]`:**
688
+
689
+ 1. **`[~]` partial:** verify the destination still exists.
690
+ - Story ID → `grep "<story-id>"` in the plan/spec → must match
691
+ - Known-Gap row → grep in `<feature>.md` → must match
692
+ - No match → FAIL: destination has vanished (moved/deleted), must re-bind before closing the build.
693
+
694
+ 2. **`[ ]` untouched but this build was supposed to cover it** (lines with `owner: S-NNN` belonging to closed stories):
695
+ - This is NOT vague "self-investigation". Concrete evidence is required:
696
+ - Grep the owner story's test file → any assertion matching the Then clause?
697
+ - Grep the owner story's production diff → any code emitting this output?
698
+ - Both absent → the owner story shipped incomplete. **Reopen the story** (revert to `pending` in `.build-progress`), add test+code, OR convert to `[~]` with a concrete destination.
699
+ - A dev may NOT convert `[ ]` → `[~] scope drift` without commit SHA / diff evidence showing the requirement changed mid-build. "scope drift" without evidence = miss.
700
+
701
+ 3. **`[N/A]`** needs no action (declared out-of-scope upfront).
702
+
703
+ **The output of this phase flows straight into Phase 5 Summary** (see format below).
704
+
705
+ ---
706
+
707
+ ## Phase 5: Summary
708
+
709
+ Start with one of:
710
+ - **DONE** — All stories green, implementation risks addressed, no signal needed, **AND checklist is 100% `[x]` or `[N/A]`**.
711
+ - **DONE_WITH_CONCERNS** — Green but: [P2 risks from Phase 0.5 / coverage gaps / spec signal / **any `[~]` carve-outs in checklist**]
712
+ - **BLOCKED** — Cannot proceed: [what's blocking, what was tried, 3-attempt limit hit]
713
+ - **NEEDS_CONTEXT** — Missing info to continue: [what's needed and why]
714
+
715
+ ```
716
+ Tests: X added, Y modified, Z unchanged
717
+ Result: All passing ✓ / N failing ✗
718
+ Coverage: [critical uncovered paths if any]
719
+ Files changed: [production files touched]
720
+ Files tested: [test files touched]
721
+ Stories: [AS-001 ✓, AS-002 ✓, AS-005 new]
722
+ TDD evidence: [S-001: RED (paste 1st failing assertion raw) → GREEN ✓ | tests added: <names>, S-002: RED (raw output) → GREEN ✓ | tests added: <names>]
723
+ Checklist: X/Y [x], A/Y [~] (destinations: <story-id list or Known-Gap refs>), B/Y [ ] (reasons), C/Y [N/A]
724
+ Coverage gate (Phase 3.5): PASS — all AS/C carry a test | BLOCKED — uncovered: <AS/C ids> (breadth)
725
+ Edge Case Compliance: [per-story table — every row ✓ or N/A+reason] (depth)
726
+ Open gaps: [GAP-NNN not yet resolved, or "none"]
727
+ E2E: [authored + green: <test names> | deferred non-critical (with reason): <flows> | none]. A critical [→E2E] left unwritten → status is DONE_WITH_CONCERNS, not DONE.
728
+ Eval needed: [→EVAL gaps from Coverage Map, or "none"]
729
+ Manual needed: [→MANUAL gaps from Coverage Map, or "none"]
730
+ ```
731
+
732
+ **Progress file cleanup:**
733
+ - All stories done AND checklist is 100% `[x]`/`[N/A]` → delete `docs/specs/<feature>/.build-progress` and `.build-checklist`
734
+ - Stories remaining OR any `[~]` carve-outs → leave both files. Log: "Progress + checklist saved — resume with `/sp-build`"
735
+
736
+ ### Spec Update Signal
737
+
738
+ **Relationship with Phase 0.6 Checklist:**
739
+ - Checklist is an **evidence artifact** (what got done / deferred).
740
+ - S1/S2/S3 are **action signals** (user must run `/sp-plan`).
741
+ - Both fire when their conditions are met — they do not suppress each other.
742
+
743
+ Mapping:
744
+ - Checklist `[~]` with destination = new Known-Gap row → also fires **S3** (new constraint not yet documented)
745
+ - Checklist `[ ]` on a closed owner story, plus any `[STALE]` lines = code drift from spec → fires **S2**
746
+ - A new test with no matching AS in the checklist (caught by reverse-map in Phase 2 Step 4) → fires **S1**
747
+
748
+ The summary must show both the checklist stats AND the signal block — do not merge them.
749
+
750
+ After every build, check against these conditions. If ANY is true → **must** signal.
751
+
752
+ **Signal when (MUST):**
753
+
754
+ | # | Condition |
755
+ |---|-----------|
756
+ | S1 | A new test covers behavior, edge case, or error path with no corresponding AS in the spec |
757
+ | S2 | Code behavior no longer matches the Given/When/Then of an existing AS (spec is stale) |
758
+ | S3 | Implementation adds a new constraint or guard not documented in any AS or Constraints section |
759
+
760
+ **Do not signal when:**
761
+ - Pure refactor — behavior unchanged, all existing AS still map correctly
762
+ - Performance fix — same output, just faster
763
+ - Fix to match spec — code was wrong, spec was right, no new behavior added
764
+
765
+ **Signal format:**
766
+ ```
767
+ ⚠️ Spec Update Needed — run `/sp-plan docs/specs/<feature>/<feature>.md '<describe change>'`
768
+ Reason: [S1 | S2 | S3] — <one line: what is missing or mismatched>
769
+ ```
770
+
771
+ If S1 applies to a failing test: state **"This failure suggests a missing acceptance scenario."** Describe the gap and prompt to run `/sp-plan` before re-running `/sp-build`. Do not silently add the test without the AS.
772
+
773
+ ## Rules
774
+ 1. **Behavior over implementation.** Test what code DOES, not how.
775
+ 2. **Independent tests.** Each test sets up its own state, cleans up after.
776
+ 3. **Spec stays upstream.** If a test reveals a spec gap (S1), signal and update the spec before adding the test. If code drifts from spec (S2), signal. If new constraint added (S3), signal.