agentic-sdlc-wizard 1.46.1 → 1.48.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -9,61 +9,55 @@ effort: high
9
9
  ## Task
10
10
  $ARGUMENTS
11
11
 
12
+ Operational checklist. Full protocol lives in `CLAUDE_CODE_SDLC_WIZARD.md` — read it for depth.
13
+
12
14
  ## Full SDLC Checklist
13
15
 
14
- Your FIRST action must be TodoWrite with these steps:
16
+ Your FIRST action must be a TodoWrite covering every phase below. Compact form (omit `activeForm` to use the subject as the spinner label):
15
17
 
16
18
  ```
17
19
  TodoWrite([
18
- // PLANNING PHASE (Plan Mode for non-trivial tasks)
19
- { content: "Find and read relevant documentation", status: "in_progress", activeForm: "Reading docs" },
20
- { content: "Assess doc health - flag issues (ask before cleaning)", status: "pending", activeForm: "Checking doc health" },
21
- { content: "DRY scan: What patterns exist to reuse? New pattern = get approval", status: "pending", activeForm: "Scanning for reusable patterns" },
22
- { content: "Prove It Gate: adding new component? Research alternatives, prove quality with tests", status: "pending", activeForm: "Checking prove-it gate" },
23
- { content: "Blast radius: What depends on code I'm changing?", status: "pending", activeForm: "Checking dependencies" },
24
- { content: "Design system check (if UI change)", status: "pending", activeForm: "Checking design system" },
25
- { content: "Restate task in own words - verify understanding", status: "pending", activeForm: "Verifying understanding" },
26
- { content: "Scrutinize test design - right things tested? Follow TESTING.md?", status: "pending", activeForm: "Reviewing test approach" },
27
- { content: "Present approach + STATE CONFIDENCE LEVEL", status: "pending", activeForm: "Presenting approach" },
28
- { content: "Signal ready - user exits plan mode", status: "pending", activeForm: "Awaiting plan approval" },
29
- // TRANSITION PHASE (After plan mode)
30
- { content: "Doc sync: update or create feature docs — MUST be current before commit", status: "pending", activeForm: "Syncing feature docs" },
31
- // IMPLEMENTATION PHASE
32
- { content: "TDD RED: Write failing test FIRST", status: "pending", activeForm: "Writing failing test" },
33
- { content: "TDD GREEN: Implement, verify test passes", status: "pending", activeForm: "Implementing feature" },
34
- { content: "Run lint/typecheck", status: "pending", activeForm: "Running lint and typecheck" },
35
- { content: "Run ALL tests", status: "pending", activeForm: "Running all tests" },
36
- { content: "Production build check", status: "pending", activeForm: "Verifying production build" },
37
- // REVIEW PHASE
38
- { content: "DRY check: Is logic duplicated elsewhere?", status: "pending", activeForm: "Checking for duplication" },
39
- { content: "Visual consistency check (if UI change)", status: "pending", activeForm: "Checking visual consistency" },
40
- { content: "Self-review: run /code-review", status: "pending", activeForm: "Running code review" },
41
- { content: "Security review (if warranted)", status: "pending", activeForm: "Checking security implications" },
42
- { content: "Cross-model review (if configured — see below)", status: "pending", activeForm: "Running cross-model review" },
43
- { content: "Scope guard: only changes related to task? No legacy/fallback code left?", status: "pending", activeForm: "Checking scope and legacy code" },
44
- // CI FEEDBACK LOOP (if CI monitoring enabled in setup - skip if no CI)
45
- // NOTE (meta-repo only, ROADMAP #212 Option 1, 2026-04-24): this repo no
46
- // longer runs e2e simulation in CI. Only `validate` blocks merge. For E2E
47
- // signal on a PR, checkout the PR locally and run:
48
- // bash tests/e2e/local-shepherd.sh <PR>
49
- // which scores on Max subscription and posts an advisory check-run.
50
- // Consumer repos still use their own CI as configured.
51
- { content: "Commit and push to remote", status: "pending", activeForm: "Pushing to remote" },
52
- { content: "Watch CI - fix failures, iterate until green (max 2x)", status: "pending", activeForm: "Watching CI" },
53
- { content: "Read CI review - implement valid suggestions, iterate until clean", status: "pending", activeForm: "Addressing CI review feedback" },
54
- { content: "Meta-repo only: run local shepherd if PR needs E2E score (optional)", status: "pending", activeForm: "Running local shepherd" },
55
- { content: "Post-deploy verification (if deploy task — see Deployment Tasks)", status: "pending", activeForm: "Verifying deployment" },
20
+ // PLANNING
21
+ { content: "Find and read relevant documentation", status: "in_progress" },
22
+ { content: "Assess doc health - flag issues (ask before cleaning)", status: "pending" },
23
+ { content: "DRY scan: What patterns exist to reuse? New pattern = get approval", status: "pending" },
24
+ { content: "Prove It Gate: adding new component? Research alternatives, prove quality with tests", status: "pending" },
25
+ { content: "Blast radius: What depends on code I'm changing?", status: "pending" },
26
+ { content: "Design system check (if UI change)", status: "pending" },
27
+ { content: "Restate task in own words - verify understanding", status: "pending" },
28
+ { content: "Scrutinize test design - right things tested? Follow TESTING.md?", status: "pending" },
29
+ { content: "Present approach + STATE CONFIDENCE LEVEL", status: "pending" },
30
+ { content: "Signal ready - user exits plan mode", status: "pending" },
31
+ // TRANSITION
32
+ { content: "Doc sync: update or create feature doc — MUST be current before commit", status: "pending" },
33
+ // IMPLEMENTATION
34
+ { content: "TDD RED: Write failing test FIRST", status: "pending" },
35
+ { content: "TDD GREEN: Implement, verify test passes", status: "pending" },
36
+ { content: "Run lint/typecheck", status: "pending" },
37
+ { content: "Run ALL tests", status: "pending" },
38
+ { content: "Production build check", status: "pending" },
39
+ // REVIEW
40
+ { content: "DRY check: Is logic duplicated elsewhere?", status: "pending" },
41
+ { content: "Visual consistency check (if UI change)", status: "pending" },
42
+ { content: "Self-review: run /code-review", status: "pending" },
43
+ { content: "Security review (if warranted)", status: "pending" },
44
+ { content: "Cross-model review (if configured)", status: "pending" },
45
+ { content: "Scope guard: only changes related to task? No legacy/fallback code left?", status: "pending" },
46
+ // CI SHEPHERD
47
+ { content: "Commit and push to remote", status: "pending" },
48
+ { content: "Watch CI - fix failures, iterate until green (max 2x)", status: "pending" },
49
+ { content: "Read CI review - implement valid suggestions, iterate until clean", status: "pending" },
50
+ { content: "Meta-repo only: run local shepherd if PR needs E2E score (optional)", status: "pending" },
51
+ { content: "Post-deploy verification (if deploy task)", status: "pending" },
56
52
  // FINAL
57
- { content: "Present summary: changes, tests, CI status", status: "pending", activeForm: "Presenting final summary" },
58
- { content: "Capture learnings (if anyupdate TESTING.md, CLAUDE.md, or feature docs)", status: "pending", activeForm: "Capturing session learnings" },
59
- { content: "Close out plan files: if task came from a plan, mark complete or delete", status: "pending", activeForm: "Closing plan artifacts" }
53
+ { content: "Present summary: changes, tests, CI status", status: "pending" },
54
+ { content: "Capture learnings (after session — TESTING.md, CLAUDE.md, or feature docs)", status: "pending" },
55
+ { content: "Close out plan files: if task came from a plan, mark complete or delete", status: "pending" }
60
56
  ])
61
57
  ```
62
58
 
63
59
  ## SDLC Quality Checklist (Scoring Rubric)
64
60
 
65
- Your work is scored on these criteria. **Critical** criteria are must-pass.
66
-
67
61
  | Criterion | Points | Critical? | What Counts |
68
62
  |-----------|--------|-----------|-------------|
69
63
  | task_tracking | 1 | | Use TodoWrite or TaskCreate |
@@ -76,716 +70,249 @@ Your work is scored on these criteria. **Critical** criteria are must-pass.
76
70
  | self_review | 1 | **YES** | Read back files/diffs you modified |
77
71
  | clean_code | 1 | | One coherent approach, no dead code |
78
72
 
79
- **Total: 10 points** (11 for UI tasks, +1 for design_system check)
80
-
81
- Critical miss on `tdd_red` or `self_review` = process failure regardless of total score.
73
+ **Total: 10 points** (11 for UI tasks, +1 for design_system check). Critical miss on `tdd_red` or `self_review` = process failure regardless of total score.
82
74
 
83
- ## Test Failure Recovery (SDET Philosophy)
75
+ ## Test Failure Recovery
84
76
 
85
- ```
86
- ┌─────────────────────────────────────────────────────────────────────┐
87
- │ ALL TESTS MUST PASS. NO EXCEPTIONS. │
88
- │ │
89
- │ This is not negotiable. This is not flexible. This is absolute. │
90
- └─────────────────────────────────────────────────────────────────────┘
91
- ```
77
+ **ALL TESTS MUST PASS. NO EXCEPTIONS.** Test code is app code. Failures are bugs — investigate them like a 15-year SDET, not by brushing aside.
92
78
 
93
- **Not acceptable:**
94
- - "Those were already failing" → Fix them first
95
- - "Not related to my changes" → Doesn't matter, fix it
96
- - "It's flaky" → Flaky = bug, investigate
79
+ Not acceptable: "those were already failing", "not related to my changes", "it's flaky" (flaky = bug we haven't found yet).
97
80
 
98
- **Treat test code like app code.** Test failures are bugs. Investigate them the way a 15-year SDET would - with thought and care, not by brushing them aside.
99
-
100
- If tests fail:
81
+ When tests fail:
101
82
  1. Identify which test(s) failed
102
- 2. Diagnose WHY - this is the important part:
103
- - Your code broke it? Fix your code (regression)
104
- - Test is for deleted code? Delete the test
105
- - Test has wrong assertions? Fix the test
106
- - Test is "flaky"? Investigate - flakiness is just another word for bug
107
- 3. Fix appropriately (fix code, fix test, or delete dead test)
108
- 4. Run specific test individually first
109
- 5. Then run ALL tests
110
- 6. Still failing? ASK USER - don't spin your wheels
111
-
112
- **Flaky tests are bugs, not mysteries:**
113
- - Sometimes the bug is in app code (race condition, timing issue)
114
- - Sometimes the bug is in test code (shared state, not parallel-safe)
115
- - Sometimes the bug is in test environment (cleanup not proper)
116
-
117
- Debug it. Find root cause. Fix it properly. Tests ARE code.
118
-
119
- ## New Pattern & Test Design Scrutiny (PLANNING)
120
-
121
- **New design patterns require human approval:**
122
- 1. Search first - do similar patterns exist in codebase?
123
- 2. If YES and they're good - use as building block
124
- 3. If YES but they're bad - propose improvement, get approval
125
- 4. If NO (new pattern) - explain why needed, get explicit approval
126
-
127
- **Test design scrutiny during planning:**
128
- - Are we testing the right things?
129
- - Does test approach follow TESTING.md philosophies?
130
- - If introducing new test patterns, same scrutiny as code patterns
131
-
132
- ## Prove It Gate (REQUIRED for New Additions)
133
-
134
- **Adding a new skill, hook, workflow, or component? PROVE IT FIRST:**
135
-
136
- 1. **Absorption check:** Can this be added as a section in an existing skill instead of a new component? Default is YES — new skills/hooks need strong justification. Releasing is SDLC, not a separate skill. Debugging is SDLC, not a separate skill. Keep it lean
137
- 2. **Research:** Does something equivalent already exist (native CC, third-party plugin, existing skill)?
138
- 3. **If YES:** Why is yours better? Show evidence (A/B test, quality comparison, gap analysis)
139
- 4. **If NO:** What gap does this fill? Is the gap real or theoretical?
140
- 5. **Quality tests:** New additions MUST have tests that prove OUTPUT QUALITY, not just existence
141
- 6. **Less is more:** Every addition is maintenance burden. Default answer is NO unless proven YES
142
-
143
- **Existence tests are NOT quality tests:**
144
- - BAD: "ci-analyzer skill file exists" — proves nothing about quality
145
- - GOOD: "ci-analyzer recommends lint-first when test-before-lint detected" — proves behavior
146
-
147
- **If you can't write a quality test for it, you can't prove it works, so don't add it.**
148
-
149
- ## Plan Mode Integration
83
+ 2. Diagnose WHY: your code broke it (regression — fix code), test is for deleted code (delete test), test has wrong assertions (fix test), "flaky" (investigate — race, shared state, env)
84
+ 3. Fix appropriately, run specific test individually first, then run ALL tests
85
+ 4. Still failing after 2 attempts? STOP and ASK USER
150
86
 
151
- **Use plan mode for:** Multi-file changes, new features, LOW confidence, bugs needing investigation.
87
+ ## Confidence Check (REQUIRED)
152
88
 
153
- **Workflow:**
154
- 1. **Plan Mode** (editing blocked): Research -> Write plan file -> Present approach + confidence
155
- 2. **Transition** (after approval): Update feature docs
156
- 3. **Implementation**: TDD RED -> GREEN -> PASS
89
+ State your confidence before presenting an approach:
157
90
 
158
- ### Auto-Approval: Skip Plan Approval Step
91
+ | Level | Meaning | Action | Effort |
92
+ |-------|---------|--------|--------|
93
+ | HIGH (90%+) | Know exactly what to do | Present, proceed after approval | `high` (default) |
94
+ | MEDIUM (60-89%) | Solid approach, some uncertainty | Present, highlight uncertainties | `high` |
95
+ | LOW (<60%) | Not sure | Research or try Codex; if still LOW, ASK USER | **`/effort xhigh` now** |
96
+ | FAILED 2x | Something's wrong | Codex for fresh perspective; if still stuck, STOP | **`/effort max` now** |
97
+ | CONFUSED | Can't diagnose | Codex; if still confused, STOP and describe | **`/effort max` now** |
159
98
 
160
- If ALL of these are true, skip plan approval and go straight to TDD:
161
- - Confidence is **HIGH (95%+)** — you know exactly what to do
162
- - Task is **single-file or trivial** (config tweak, small bug fix, string change)
163
- - No new patterns introduced
164
- - No architectural decisions
99
+ **Dynamic effort bumping is NOT optional.** "Consider max effort" is the same as "ignore this." Bump BEFORE the next attempt, not after a third failure.
165
100
 
166
- When auto-approving, still announce your approach — just don't wait for approval:
167
- > "Confidence HIGH (95%). Single-file change. Proceeding directly to TDD."
101
+ ## Plan Mode
168
102
 
169
- **When in doubt, wait for approval.** Auto-approval is for clear-cut cases only.
103
+ Use plan mode for: multi-file changes, new features, LOW confidence, bugs needing investigation. **Skip plan approval step** (auto-approval) when confidence HIGH (95%+) AND single-file/trivial AND no new patterns AND no architectural decisions — still announce approach, don't wait. When in doubt, wait.
170
104
 
171
105
  ## Recommended Model
172
106
 
173
- **Opt-in: `opus[1m]` (Opus 4.7 with 1M context window).** Run `/model opus[1m]` at the start of any non-trivial SDLC session but understand the tradeoff first (issue #198).
174
-
175
- **Why opt-in, not default:** A top-level `model` pin in `.claude/settings.json` disables Claude Code's per-turn model auto-selection. That's a real cost — Max-plan users pay for that auto-selection (Sonnet for cheap tasks, Opus for hard ones, plus weekly-limit smoothing). Pin only when you actually need the 1M headroom.
176
-
177
- **Why pin to `opus[1m]` when you do opt in:**
178
- - SDLC sessions (plan → TDD → review → CI shepherd) accumulate context fast — plans, test output, diffs, review artifacts. 200K fills up before you're done.
179
- - Forced auto-compact mid-task loses your working state. Extra headroom is cheaper than re-reading files.
180
- - At time of writing, Anthropic lists 1M context at standard pricing for supported Opus/Sonnet models — verify current rates for your plan before relying on this.
107
+ **Opt-in: `opus[1m]` (Opus 4.7 with 1M context).** `/model opus[1m]` at the start of non-trivial sessions — understand the tradeoff (issue #198). A top-level `model` pin in `.claude/settings.json` disables CC's per-turn auto-selection; pin only when you need 1M headroom. Requires CC v2.1.111+.
181
108
 
182
- **Requires Claude Code v2.1.111+** for Opus 4.7.
183
-
184
- **Pair with `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30`** when you opt in. Without it, CC's default auto-compact on 1M fires at ~76K and defeats the purpose. The setup wizard's Step 9.5 prompts to write both together (template ships with neither, opt-in only).
185
-
186
- **Fall back to `opus` (200K) only when:** your plan charges a premium for long-context prompts, the task is genuinely short (<30K), or team cost controls flag >200K prompts. See the "1M vs 200K Context Window" section in `CLAUDE_CODE_SDLC_WIZARD.md` for details.
187
-
188
- ## Confidence Check (REQUIRED)
109
+ **Pair with `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30` when you opt in.** Without it, the default fires at ~76K on 1M. **Pick ONE — do NOT set both `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30` AND `CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000`** — they compound to 30% × 400K = 120K trigger ≈ 12% of 1M, fires almost immediately (#207). See wizard "Autocompact Tuning" for details.
189
110
 
190
- Before presenting approach, STATE your confidence:
191
-
192
- | Level | Meaning | Action | Effort |
193
- |-------|---------|--------|--------|
194
- | HIGH (90%+) | Know exactly what to do | Present approach, proceed after approval | `high` (default) |
195
- | MEDIUM (60-89%) | Solid approach, some uncertainty | Present approach, highlight uncertainties | `high` (default) |
196
- | LOW (<60%) | Not sure | Do more research or try cross-model research (Codex) to get to 95%. If still LOW after research, ASK USER | **Run `/effort xhigh` now** — don't wait |
197
- | FAILED 2x | Something's wrong | Try cross-model research (Codex) for a fresh perspective. If still stuck, STOP and ASK USER | **Run `/effort max` now** — you're already burning cycles at lower effort |
198
- | CONFUSED | Can't diagnose why something is failing | Try cross-model research (Codex). If still confused, STOP. Describe what you tried, ask for help | **Run `/effort max` now** — stop spinning |
199
-
200
- **Dynamic bumping is NOT optional.** "Consider max effort" is the same as "ignore this" in practice. If your confidence drops or tests fail twice, bump effort BEFORE the next attempt — not after a third failure. Spinning at low effort is an SDLC failure mode, not a style choice.
201
-
202
- ## Self-Review Loop (CRITICAL)
111
+ ## Self-Review Loop
203
112
 
204
113
  ```
205
- PLANNING -> DOCS -> TDD RED -> TDD GREEN -> Tests Pass -> Self-Review
206
- ^ |
207
- | v
208
- | Issues found?
209
- | |-- NO -> Present to user
210
- | +-- YES v
211
- +------------------------------------------- Ask user: fix in new plan?
114
+ PLANNING DOCS TDD RED GREEN Tests Pass Self-Review
115
+ ^ |
116
+ +--- Ask user: fix in new plan? ←- Issues found? YES (NO → Present)
212
117
  ```
213
118
 
214
- **The loop goes back to PLANNING, not TDD RED.** When self-review finds issues:
215
- 1. Ask user: "Found issues. Want to create a plan to fix?"
216
- 2. If yes -> back to PLANNING phase with new plan doc
217
- 3. Then -> docs update -> TDD -> review (proper SDLC loop)
218
-
219
- **How to self-review:**
220
- 1. Run `/code-review` to review your changes
221
- 2. It launches parallel agents (CLAUDE.md compliance, bug detection, logic & security)
222
- 3. Issues at confidence >= 80 are real findings — go back to PLANNING to fix
223
- 4. Issues below 80 are likely false positives — skip unless obviously valid
224
- 5. Address issues by going back through the proper SDLC loop
119
+ The loop goes back to PLANNING, not TDD RED. Run `/code-review`; issues at confidence ≥ 80 are real, < 80 are likely false positives. Found issues → ask "Want a plan to fix?" → new plan → docs → TDD → review.
225
120
 
226
121
  ## Cross-Model Review (If Configured)
227
122
 
228
- **When to run:** High-stakes changes (auth, payments, data handling), releases/publishes (version bumps, CHANGELOG, npm publish), complex refactors, research-heavy work.
229
- **When to skip:** Trivial changes (typo fixes, config tweaks), time-sensitive hotfixes, risk < review cost.
123
+ **When to run:** high-stakes changes (auth, payments, data), releases/publishes, complex refactors.
124
+ **When to skip:** trivial changes, time-sensitive hotfixes, risk < review cost.
125
+ **Prerequisites:** Codex CLI (`npm i -g @openai/codex`) + OpenAI API key.
230
126
 
231
- **Prerequisites:** Codex CLI installed (`npm i -g @openai/codex`), OpenAI API key set.
127
+ The PROTOCOL is universal across domains; only `review_instructions` and `verification_checklist` change. **Reviewer always at flagship tier (#233):** if the project pins `model: "sonnet[1m]"` (mixed-mode), the reviewer still runs `gpt-5.5` or Opus 4.7 max — adversarial diversity is the point.
232
128
 
233
- **The core insight:** The review PROTOCOL is universal across domains. Only the review INSTRUCTIONS change. Code review is the default template below. For non-code domains (research, persuasion, medical content), adapt the `review_instructions` and `verification_checklist` fields while keeping the same handoff/dialogue/convergence loop.
129
+ ### Step 0: Preflight Self-Review
234
130
 
235
- **Reviewer always at the flagship tier (roadmap #233):** if the project pins `model: "sonnet[1m]"` (mixed-mode) or any non-flagship coder, the cross-model reviewer **still runs at the flagship**: `codex exec -c 'model_reasoning_effort="xhigh"'` (gpt-5.5) or an Opus 4.7 max equivalent. The whole point of mixed-mode is that adversarial review catches Sonnet's blind spots — weakening the reviewer leg defeats the savings. Don't downscale the review just because the coder is downscaled.
131
+ At `.reviews/preflight-{review_id}.md`, document what you already checked: `/code-review` passed, all tests passing, specific concerns checked, what you verified manually, known limitations. Reduces reviewer findings to 0-1 per round.
236
132
 
237
- ### Step 0: Write Preflight Self-Review Doc
133
+ ### Step 1: Mission-First Handoff
238
134
 
239
- Before submitting to an external reviewer, document what YOU already checked. This is proven to reduce reviewer findings to 0-1 per round (evidence: anticheat repo preflight discipline).
240
-
241
- Write `.reviews/preflight-{review_id}.md`:
242
- ```markdown
243
- ## Preflight Self-Review: {feature}
244
- - [ ] Self-review via /code-review passed
245
- - [ ] All tests passing
246
- - [ ] Checked for: [specific concerns for this change]
247
- - [ ] Verified: [what you manually confirmed]
248
- - [ ] Known limitations: [what you couldn't verify]
249
- ```
250
-
251
- ### Step 1: Write Mission-First Handoff
252
-
253
- After self-review and preflight pass, write `.reviews/handoff.json`:
135
+ Write `.reviews/handoff.json`:
254
136
  ```jsonc
255
137
  {
256
138
  "review_id": "feature-xyz-001",
257
139
  "status": "PENDING_REVIEW",
258
140
  "round": 1,
259
- "mission": "What changed and why — 2-3 sentences of context",
260
- "success": "What 'correctly reviewed' looks like — the reviewer's goal",
261
- "failure": "What gets missed if the reviewer is superficial",
141
+ "mission": "What changed and why — 2-3 sentences",
142
+ "success": "What 'correctly reviewed' looks like",
143
+ "failure": "What gets missed if reviewer is superficial",
262
144
  "files_changed": ["src/auth.ts", "tests/auth.test.ts"],
263
145
  "fixes_applied": [],
264
146
  "previous_score": null,
265
147
  "verification_checklist": [
266
148
  "(a) Verify input validation at auth.ts:45 handles empty strings",
267
- "(b) Verify test covers the null-token edge case",
268
- "(c) Check no hardcoded secrets in diff"
149
+ "(b) Verify test covers null-token edge case"
269
150
  ],
270
- "review_instructions": "Focus on security and edge cases. Be strict — assume bugs may be present until proven otherwise.",
151
+ "review_instructions": "Focus on security and edge cases. Be strict — assume bugs may be present.",
271
152
  "preflight_path": ".reviews/preflight-feature-xyz-001.md",
272
- "artifact_path": ".reviews/feature-xyz-001/",
273
153
  "pr_number": 205
274
154
  }
275
155
  ```
276
156
 
277
- **Key fields explained:**
278
- - `mission/success/failure` — Gives the reviewer context. Without this, you get generic "looks good" feedback. With it, reviewers read raw source files and verify specific claims (proven across 4 repos)
279
- - `verification_checklist` — Specific things to verify with file:line references. NOT "review for correctness" — that's too vague. Each item is independently verifiable
280
- - `preflight_path` — Shows the reviewer what you already checked, so they focus on what you might have missed
281
- - `pr_number` (optional) — PreCompact self-heal opt-in (ROADMAP #209). When the review tracks a specific PR, set this. The `precompact-seam-check.sh` hook queries `gh pr view N --json state` on every manual `/compact` and, if the PR is MERGED, treats this handoff as implicit CERTIFIED — unblocking `/compact` even if `status` is still `PENDING_*`. Without `pr_number`, a forgotten PENDING handoff blocks every future manual compact until you flip status by hand or hit the `SDLC_HANDOFF_STALE_DAYS` (default 14) auto-expire fallback. Omit for ad-hoc reviews not tied to a PR.
157
+ `mission/success/failure` give context (without them: generic "looks good"). `verification_checklist` is specific (file:line), not "review for correctness." `pr_number` (optional) is the **PreCompact self-heal opt-in (ROADMAP #209)**: when set, `precompact-seam-check.sh` checks `gh pr view N --json state` on `/compact` and, if MERGED, treats handoff as implicit CERTIFIED. Without it, a forgotten PENDING handoff blocks every manual compact until you flip status or hit `SDLC_HANDOFF_STALE_DAYS` (default 14).
282
158
 
283
- ### Step 2: Run the Independent Reviewer
159
+ ### Step 2: Run the Reviewer
284
160
 
285
161
  ```bash
286
- codex exec \
287
- -c 'model_reasoning_effort="xhigh"' \
288
- -s danger-full-access \
162
+ codex exec -c 'model_reasoning_effort="xhigh"' -s danger-full-access \
289
163
  -o .reviews/latest-review.md \
290
- "You are an independent code reviewer performing a certification audit. \
291
- Read .reviews/handoff.json for full context mission, success/failure \
292
- conditions, and verification checklist. \
293
- Verify each checklist item with evidence (file:line, grep results, test output). \
294
- Output each finding with: ID (1, 2, ...), severity (P0/P1/P2), evidence, \
295
- and a 'certify condition' (what specific change resolves it). \
296
- Re-verify any prior-round passes still hold. \
164
+ "Independent code reviewer. Read .reviews/handoff.json for context. \
165
+ Verify each checklist item with evidence (file:line, grep, test output). \
166
+ Each finding: ID, severity (P0/P1/P2), evidence, certify condition. \
297
167
  End with: score (1-10), CERTIFIED or NOT CERTIFIED."
298
168
  ```
299
169
 
300
- **Always use `xhigh` reasoning effort.** Lower settings miss subtle errors (wrong-generation references, stale pricing, cross-file inconsistencies).
301
-
302
- **Sandbox note:** Codex's Rust binary requires access to macOS system configuration APIs (`SCDynamicStore`) during sandbox initialization. Claude Code's sandbox blocks this access, causing `codex exec` to crash with `system-configuration panicked: Attempted to create a NULL object`. When running `codex exec` from within Claude Code, you MUST use `dangerouslyDisableSandbox: true` on the Bash tool call. This is safe — Codex has its own sandbox (`-s danger-full-access` is already specified), and the CC sandbox bypass only affects the Codex process. This is a known Codex issue ([#15640](https://github.com/openai/codex/issues/15640)).
170
+ Always `xhigh` lower settings miss subtle errors. **Progress (#259):** xhigh runs take 1-5 min; for a heartbeat use `scripts/codex-review-with-progress.sh` (`SDLC_CODEX_HEARTBEAT_INTERVAL` tunes). **Sandbox:** Codex's Rust binary needs `SCDynamicStore`; CC's sandbox blocks this. From CC, use `dangerouslyDisableSandbox: true` — Codex has its own sandbox via `-s danger-full-access`. Known issue: [codex#15640](https://github.com/openai/codex/issues/15640).
303
171
 
304
- If CERTIFIED → proceed to CI. If NOT CERTIFIED → go to dialogue loop.
172
+ CERTIFIED → CI. NOT CERTIFIED → dialogue loop.
305
173
 
306
174
  ### Step 3: Dialogue Loop
307
175
 
308
- Respond per-finding don't silently fix everything:
309
-
310
- 1. Write `.reviews/response.json`:
311
- ```jsonc
312
- {
313
- "review_id": "feature-xyz-001",
314
- "round": 2,
315
- "responding_to": ".reviews/latest-review.md",
316
- "responses": [
317
- { "finding": "1", "action": "FIXED", "summary": "Added missing validation" },
318
- { "finding": "2", "action": "DISPUTED", "justification": "Intentional — see CODE_REVIEW_EXCEPTIONS.md" },
319
- { "finding": "3", "action": "ACCEPTED", "summary": "Will add test coverage" }
320
- ]
321
- }
322
- ```
323
- - **FIXED**: "I fixed this. Here's what changed." Reviewer verifies against certify condition.
324
- - **DISPUTED**: "This is intentional/incorrect. Here's why." Reviewer accepts or rejects with reasoning.
325
- - **ACCEPTED**: "You're right. Fixing now." (Same as FIXED, batched.)
326
-
327
- 2. Update `handoff.json`: increment `round`, set `"status": "PENDING_RECHECK"`, add `fixes_applied` list with numbered items and file:line references, update `previous_score`.
328
-
329
- 3. Run targeted recheck (NOT a full re-review):
330
- ```bash
331
- codex exec \
332
- -c 'model_reasoning_effort="xhigh"' \
333
- -s danger-full-access \
334
- -o .reviews/latest-review.md \
335
- "TARGETED RECHECK — not a full re-review. Read .reviews/handoff.json \
336
- for previous_review path and response.json for the author's responses. \
337
- For each finding: FIXED → verify against original certify condition. \
338
- DISPUTED → evaluate justification (ACCEPT if sound, REJECT with reasoning). \
339
- ACCEPTED → verify it was applied. \
340
- Do NOT raise new findings unless P0 (critical/security). \
341
- New observations go in 'Notes for next review' (non-blocking). \
342
- Re-verify all prior passes still hold. \
343
- End with: score (1-10), CERTIFIED or NOT CERTIFIED."
344
- ```
345
-
346
- ### Convergence
347
-
348
- **2 rounds is the sweet spot. 3 max.** Research across 14 repos and 7 papers confirms additional rounds beyond 3 produce <5% position shift.
349
-
350
- Max 2 recheck rounds (3 total including initial review). If still NOT CERTIFIED after round 3, escalate to the user with a summary of open findings.
351
-
352
- ```
353
- Preflight → handoff.json (round 1) → FULL REVIEW
354
- |
355
- CERTIFIED? → YES → CI
356
- |
357
- NO (scored findings)
358
- |
359
- response.json (FIXED/DISPUTED/ACCEPTED)
360
- |
361
- handoff.json (round 2+) → TARGETED RECHECK
362
- |
363
- CERTIFIED? → YES → CI
364
- |
365
- NO → one more round, then escalate
366
- ```
367
-
368
- **Tool-agnostic:** The value is adversarial diversity (different model, different blind spots), not the specific tool. Any competing AI reviewer works.
369
-
370
- ### Anti-Patterns to Avoid
371
-
372
- - **"Find at least N problems"** — Incentivizes false positives. Use adversarial framing ("assume bugs may be present") instead
373
- - **"Review this"** — Too vague, gets generic feedback. Use mission + verification checklist
374
- - **Numeric 1-10 scales without criteria** — Unreliable. Decompose into specific checklist items
375
- - **Letting reviewer see author's reasoning** — Causes anchoring bias. Let them form independent opinion from code
376
-
377
- ### Release Review Focus
378
-
379
- Before any release/publish, add these to `verification_checklist`:
380
- - **CHANGELOG consistency** — all sections present, no lost entries during consolidation
381
- - **Version parity** — package.json, SDLC.md, CHANGELOG, wizard metadata all match
382
- - **Stale examples** — hardcoded version strings in docs match current release
383
- - **Docs accuracy** — README, ARCHITECTURE.md reflect current feature set
384
- - **CLI-distributed file parity** — live skills, hooks, settings match CLI templates
385
-
386
- ### Multiple Reviewers (N-Reviewer Pipeline)
387
-
388
- When multiple reviewers comment on a PR (Claude PR review, Codex, human reviewers), address each reviewer independently:
389
-
390
- 1. **Read all reviews** — `gh api repos/OWNER/REPO/pulls/PR/comments` to get every reviewer's feedback
391
- 2. **Respond per-reviewer** — Each reviewer has different blind spots and priorities. Address each one's findings separately
392
- 3. **Resolve conflicts** — If reviewers disagree, pick the stronger argument, note why
393
- 4. **Iterate until all approve** — Don't merge until every active reviewer is satisfied
394
- 5. **Max 3 iterations per reviewer** — If a reviewer keeps finding new things, escalate to the user
395
-
396
- ### Adapting for Non-Code Domains
397
-
398
- The handoff format and dialogue loop work for ANY domain. Only `review_instructions` and `verification_checklist` change:
176
+ Per-finding response in `.reviews/response.json`: `{"finding": "1", "action": "FIXED|DISPUTED|ACCEPTED", "summary": "..."}`. Update `handoff.json`: increment `round`, status `PENDING_RECHECK`, add `fixes_applied` (numbered, file:line refs).
399
177
 
400
- | Domain | Instructions Focus | Checklist Example |
401
- |--------|-------------------|-------------------|
402
- | **Code (default)** | Security, logic bugs, test coverage | "Verify input validation at file:line" |
403
- | **Research/Docs** | Factual accuracy, source verification, overclaims | "Verify $736-$804 appears in both docs, no stale $695-$723 remains" |
404
- | **Persuasion** | Audience psychology, tone, trust | "If you were [audience], what's the moment you'd stop reading?" |
178
+ Recheck prompt: "TARGETED RECHECK. For each finding: FIXED → verify certify condition. DISPUTED → ACCEPT if sound, REJECT with reasoning. ACCEPTED → verify applied. Do NOT raise new findings unless P0. End with score, CERTIFIED or NOT CERTIFIED."
405
179
 
406
- For non-code: add `"audience"` and `"stakes"` fields to handoff.json. For code, these are implied (audience = other developers, stakes = production impact).
180
+ **Convergence:** 2 rounds is the sweet spot, 3 max (research: 14 repos + 7 papers). After 3 still NOT CERTIFIED escalate to user.
407
181
 
408
- ### Custom Subagents (`.claude/agents/`)
409
-
410
- Claude Code supports custom subagents in `.claude/agents/`:
411
-
412
- - **`sdlc-reviewer`** — SDLC compliance review (planning, TDD, self-review checks)
413
- - **`ci-debug`** — CI failure diagnosis (reads logs, identifies root cause, suggests fix)
414
- - **`test-writer`** — Quality tests following TESTING.md philosophies
415
-
416
- **Skills** guide Claude's behavior. **Agents** run autonomously and return results. Use agents for parallel work or fresh context windows.
417
-
418
- ## Test Review (Harder Than Implementation)
419
-
420
- During self-review, critique tests HARDER than app code:
421
- 1. **Testing the right things?** - Not just that tests pass
422
- 2. **Tests prove correctness?** - Or just verify current behavior?
423
- 3. **Follow our philosophies (TESTING.md)?**
424
- - Testing Diamond (integration-heavy)?
425
- - Minimal mocking (see table below)?
426
- - Real fixtures from captured data?
427
-
428
- **Tests are the foundation.** Bad tests = false confidence = production bugs.
429
-
430
- ### Testing Diamond — Know Your Layers
431
-
432
- | Layer | What It Tests | % of Suite | Key Trait |
433
- |-------|--------------|------------|-----------|
434
- | **E2E** | Full user flow through UI/browser (Playwright, Cypress) | ~5% | Slow, brittle, but proves the real thing works |
435
- | **Integration** | Real systems via API without UI — real DB, real cache, real services | ~90% | **Best bang for buck.** Fast, stable, high confidence |
436
- | **Unit** | Pure logic only — no DB, no API, no filesystem | ~5% | Fast but limited scope |
437
-
438
- **The critical boundary:** E2E tests go through the user's actual UI/browser. Integration tests hit real systems via API but without UI. If your test doesn't open a browser or render a UI, it's not E2E — it's integration. This distinction matters because mislabeling integration tests as E2E leads to overinvestment in slow browser tests when fast API-level tests would suffice.
439
-
440
- ### Minimal Mocking Philosophy
441
-
442
- | What | Mock? | Why |
443
- |------|-------|-----|
444
- | Database | NEVER | Use test DB or in-memory |
445
- | Cache | NEVER | Use isolated test instance |
446
- | External APIs | YES | Real calls = flaky + expensive |
447
- | Time/Date | YES | Determinism |
182
+ **Anti-patterns:** "find at least N problems," "review this," 1-10 without criteria, letting reviewer see author's reasoning (anchoring).
448
183
 
449
- **Mocks MUST come from REAL captured data** capture real API responses, save to fixtures directory, import in tests. Never guess mock shapes.
184
+ **Multiple reviewers** (Claude review + Codex + human): `gh api repos/OWNER/REPO/pulls/PR/comments` for all feedback, respond to each reviewer independently (different blind spots), pick stronger argument on conflicts, max 3 iterations per reviewer.
450
185
 
451
- ### Unit Tests = Pure Logic ONLY
186
+ **Non-code domains** (research, persuasion, medical): same handoff format, adapt `review_instructions` + `verification_checklist`, add `audience` + `stakes`.
452
187
 
453
- A function qualifies for unit testing ONLY if:
454
- - No database calls
455
- - No external API calls
456
- - No file system access
457
- - No cache calls
458
- - Input -> Output transformation only
459
-
460
- Everything else needs integration tests.
461
-
462
- ### TDD Tests Must PROVE
463
-
464
- | Phase | What It Proves |
465
- |-------|----------------|
466
- | RED | Test FAILS -> Bug exists or feature missing |
467
- | GREEN | Test PASSES -> Fix works or feature implemented |
468
- | Forever | Regression protection |
469
-
470
- ## Flaky Test Recovery
471
-
472
- **Flaky tests are bugs. Period.** See: [How do you Address and Prevent Flaky Tests?](https://softwareautomation.notion.site/How-do-you-Address-and-Prevent-Flaky-Tests-23c539e19b3c46eeb655642b95237dc0)
473
-
474
- When a test fails intermittently:
475
- 1. **Don't dismiss it** — "flaky" means "bug we haven't found yet"
476
- 2. **Identify the layer** — test code? app code? environment?
477
- 3. **Stress-test** — run the suspect test N times to reproduce reliably
478
- 4. **Fix root cause** — don't just retry-and-pray
479
- 5. **If CI infrastructure** — make cosmetic steps non-blocking, keep quality gates strict
480
-
481
- ## Scope Guard (Stay in Your Lane)
482
-
483
- **Only make changes directly related to the task.**
484
-
485
- If you notice something else that should be fixed:
486
- - NOTE it in your summary ("I noticed X could be improved")
487
- - DON'T fix it unless asked
488
-
489
- **Why this matters:** AI agents can drift into "helpful" changes that weren't requested. This creates unexpected diffs, breaks unrelated things, and makes code review harder.
490
-
491
- ## Debugging Workflow (Systematic Investigation)
492
-
493
- When something breaks and the cause isn't obvious, follow this systematic debugging workflow:
494
-
495
- ```
496
- Reproduce → Isolate → Root Cause → Fix → Regression Test
497
- ```
498
-
499
- 1. **Reproduce** — Can you make it fail consistently? If intermittent, stress-test (run N times). If you can't reproduce it, you can't fix it
500
- 2. **Isolate** — Narrow the scope. Which file? Which function? Which input? Use binary search: comment out half the code, does it still fail?
501
- 3. **Root cause** — Don't fix symptoms. Ask "why?" until you hit the actual cause. "It crashes on line 42" is a symptom. "Null pointer because the API returns undefined when rate-limited" is a root cause
502
- 4. **Fix** — Fix the root cause, not the symptom. Write the fix
503
- 5. **Regression test** — Write a test that fails without your fix and passes with it (TDD GREEN)
504
-
505
- **For regressions** (it worked before, now it doesn't):
506
- - Use `git bisect` to find the exact commit that broke it
507
- - `git bisect start`, `git bisect bad` (current), `git bisect good <known-good-commit>`
508
- - Bisect narrows to the breaking commit in O(log n) steps
509
-
510
- **Environment-specific bugs** (works locally, fails in CI/staging/prod):
511
- - Check environment differences: env vars, OS version, dependency versions, file permissions
512
- - Reproduce the environment locally if possible (Docker, env vars)
513
- - Add logging at the failure point — don't guess, observe
514
-
515
- **When to stop and ask:**
516
- - After 2 failed fix attempts → STOP and ASK USER
517
- - If the bug is in code you don't understand → read first, then fix
518
- - If reproducing requires access you don't have → ASK USER
519
-
520
- ## CI Feedback Loop — Local Shepherd (After Commit)
521
-
522
- **This is the "local shepherd" — the CI fix mechanism.** It runs in your active session with full context.
523
-
524
- **The SDLC doesn't end at local tests.** CI must pass too.
525
-
526
- ```
527
- Local tests pass -> Commit -> Push -> Watch CI
528
- |
529
- CI passes? -+-> YES -> Present for review
530
- |
531
- +-> NO -> Fix -> Push -> Watch CI
532
- |
533
- (max 2 attempts)
534
- |
535
- Still failing?
536
- |
537
- STOP and ASK USER
538
- ```
539
-
540
- ```
541
- ┌─────────────────────────────────────────────────────────────────────┐
542
- │ NEVER AUTO-MERGE. NO EXCEPTIONS. │
543
- │ │
544
- │ Do NOT run `gh pr merge --auto`. Ever. │
545
- │ Auto-merge fires before you can read review feedback. │
546
- │ The shepherd loop IS the process. Skipping it = shipping bugs. │
547
- └─────────────────────────────────────────────────────────────────────┘
548
- ```
549
-
550
- **The full shepherd sequence — every step is mandatory:**
551
- 1. Push changes to remote
552
- 2. Watch CI: `gh pr checks --watch`
553
- 3. Read CI logs — **pass or fail**: `gh run view <RUN_ID> --log` (not just `--log-failed`). Passing CI can still hide warnings, skipped steps, or degraded scores. Don't just check the green checkmark
554
- 4. **Cross-model review the CI logs themselves** — pipe `gh run view <RUN_ID> --log` to a tmp file and run `codex exec -c 'model_reasoning_effort="xhigh"' -s danger-full-access` with a prompt like *"Audit this CI log for silent failures, skipped tests, degraded metrics, or warnings-that-should-be-errors. Green checkmark is necessary but not sufficient."* A second model catches things the first missed (e.g., a job that passed but degraded an E2E score by 30%, or a test that was silently excluded). Cheap — one extra `codex exec` per PR. **Run separately on Tier 1 quick-check AND Tier 2 5x evaluation logs** — they exercise different code paths, so a clean Tier 1 audit doesn't imply a clean Tier 2. Evidence from PR #206: Tier 1 audit found 3 P1s (Node 24 false-green, "11/10" score leak, E2E incomplete); Tier 2 audit TBD — value is measured by running both and comparing.
555
- 5. If CI fails → diagnose from logs, fix, push again (max 2 attempts)
556
- 6. If CI passes → read ALL review comments: `gh api repos/OWNER/REPO/pulls/PR/comments`
557
- 7. Fix valid suggestions, push, iterate until clean
558
- 8. Only then: explicit merge with `gh pr merge --squash`
559
-
560
- **Why this is non-negotiable:** PR #145 auto-merged a release before review feedback was read. CI reviewer found a P1 dead-code bug that shipped to main. The fix required a follow-up commit. Auto-merge cost more time than the shepherd loop would have taken.
561
-
562
- **Why read passing logs:** v1.24.0 release only read logs on failure (round 1), then just checked the green checkmark on round 2. Passing CI can hide warnings, skipped steps, degraded E2E scores, or silent test exclusions. A green checkmark is necessary but not sufficient.
563
-
564
- **Context GC (compact during idle):** While waiting for CI (typically 3-5 min), suggest `/compact` if the conversation is long. Think of it like a time-based garbage collector — idle time + high memory pressure = good time to collect. Don't suggest on short conversations.
565
-
566
- **CI failures follow same rules as test failures:**
567
- - Your code broke it? Fix your code
568
- - CI config issue? Fix the config
569
- - Flaky? Investigate - flakiness is a bug
570
- - Stuck? ASK USER
571
-
572
- ## CI Review Feedback Loop — Local Shepherd (After CI Passes)
573
-
574
- **CI passing isn't the end.** If CI includes a code reviewer, read and address its suggestions.
575
-
576
- ```
577
- CI passes -> Read review suggestions
578
- |
579
- Valid improvements? -+-> YES -> Implement -> Run tests -> Push
580
- | |
581
- | Review again (iterate)
582
- |
583
- +-> NO (just opinions/style) -> Skip, note why
584
- |
585
- +-> None -> Done, present to user
586
- ```
587
-
588
- **How to evaluate suggestions:**
589
- 1. Read all CI review comments: `gh api repos/OWNER/REPO/pulls/PR/comments`
590
- 2. For each suggestion, ask: **"Is this a real improvement or just an opinion?"**
591
- - **Real improvement:** Fixes a bug, improves performance, adds missing error handling, reduces duplication, improves test coverage → Implement it
592
- - **Opinion/style:** Different but equivalent formatting, subjective naming preference, "you could also..." without clear benefit → Skip it
593
- 3. Implement the valid ones, run tests locally, push
594
- 4. CI re-reviews — repeat until no substantive suggestions remain
595
- 5. Max 3 iterations — if reviewer keeps finding new things, ASK USER
596
-
597
- **The goal:** User is only brought in at the very end, when both CI and reviewer are satisfied. The code should be polished before human review.
598
-
599
- **Customizable behavior** (set during wizard setup):
600
- - **Auto-implement** (default): Implement valid suggestions autonomously, skip opinions
601
- - **Ask first**: Present suggestions to user, let them decide which to implement
602
- - **Skip review feedback**: Ignore CI review suggestions, only fix CI failures
603
-
604
- ## Context Management
605
-
606
- - `/compact` between planning and implementation (plan preserved in summary)
607
- - `/clear` between unrelated tasks (stale context wastes tokens and misleads)
608
- - `/clear` after 2+ failed corrections (context polluted — start fresh with better prompt)
609
- - Auto-compact fires at ~95% capacity — no manual management needed
610
- - After committing a PR, `/clear` before starting the next feature
611
- - **Autocompact tuning:** Set `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE` to trigger compaction earlier (75% for 200K, 30% for 1M). On 1M models, the default fires at ~76K — pick ONE of: `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30` **OR** `CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000` (do NOT set both — they compound to 30% × 400K = 120K trigger ≈ 12% of 1M, which fires almost immediately, #207). See wizard doc "Autocompact Tuning" for full details
612
-
613
- **`--bare` mode (v2.1.81+):** `claude -p "prompt" --bare` skips ALL hooks, skills, LSP, and plugins. This is a complete wizard bypass — no SDLC enforcement, no TDD checks, no planning hooks. Use only for scripted headless calls (CI pipelines, automation) where you explicitly don't want wizard enforcement. Never use `--bare` for normal development work.
614
-
615
- ## DRY Principle
616
-
617
- **Before coding:** "What patterns exist I can reuse?"
618
- **After coding:** "Did I accidentally duplicate anything?"
619
-
620
- ## Design System Check (If UI Change)
188
+ ### Release Review Focus
621
189
 
622
- **When to check:** CSS/styling changes, new UI components, color/font usage.
623
- **When to skip:** Backend-only changes, config/build changes, non-visual code.
190
+ Before any release/publish, add to `verification_checklist`: **CHANGELOG consistency** (sections present, no lost entries), **Version parity** (package.json + SDLC.md + CHANGELOG + wizard metadata), **Stale examples** (hardcoded version strings), **Docs accuracy** (README + ARCHITECTURE reflect current features), **CLI-distributed file parity** (live skills/hooks match CLI templates).
624
191
 
625
- **Planning phase - "Design system check":**
626
- 1. Read DESIGN_SYSTEM.md if it exists
627
- 2. Check if change involves colors, fonts, spacing, or components
628
- 3. Verify intended styles match design system tokens
629
- 4. Flag if introducing new patterns not in design system
192
+ (Full protocol with rationale and convergence diagrams: `CLAUDE_CODE_SDLC_WIZARD.md` → Cross-Model Review.)
630
193
 
631
- **Review phase - "Visual consistency check":**
632
- 1. Are colors from the design system palette?
633
- 2. Are fonts/sizes from typography scale?
634
- 3. Are spacing values from the spacing scale?
635
- 4. Do new components follow existing patterns?
194
+ ## Documentation Sync (REQUIRED During Planning)
636
195
 
637
- **If no DESIGN_SYSTEM.md exists:** Skip these checks (project has no documented design system).
196
+ **Docs MUST be current before commit.** Stale docs = wrong implementations = wasted sessions.
638
197
 
639
- ## Release Planning (If Task Involves a Release)
198
+ Standard pattern: `*_DOCS.md` living documents that grow with the feature (`AUTH_DOCS.md`, `PAYMENTS_DOCS.md`).
640
199
 
641
- **When to check:** Task mentions "release", "publish", "version bump", "npm publish", or multiple items being shipped together.
642
- **When to skip:** Single feature implementation, bug fix, or anything that isn't a release.
200
+ 1. Read feature docs for the area being changed during planning
201
+ 2. When a code change contradicts what the doc says MUST update the feature doc
202
+ 3. When a code change extends behavior the doc describes → MUST update the feature doc (add new behavior)
203
+ 4. No `*_DOCS.md` exists and feature touches 3+ files → create one
204
+ 5. Project has `ROADMAP.md` → mark items done, add new items (ROADMAP feeds CHANGELOG)
643
205
 
644
- Before implementing any release items:
206
+ `/claude-md-improver` audits CLAUDE.md structure. Run it periodically. It does NOT cover feature docs — the SDLC workflow handles those.
645
207
 
646
- 1. **List all items**Read ROADMAP.md (or equivalent), identify every item planned for this release
647
- 2. **Plan each at 95% confidence** — For each item: what files change, what tests prove it works, what's the blast radius. If confidence < 95% on any item, flag it
648
- 3. **Identify blocks** — Which items depend on others? What must go first?
649
- 4. **Present all plans together** — User reviews the complete batch, not one at a time. This catches conflicts, sequencing issues, and scope creep before any code is written
650
- 5. **Pre-release CI audit** — Before cutting the release, review CI runs across ALL PRs merged since last release. Look for: warnings in passing runs, degraded E2E scores, skipped test suites, silent failures masked by `continue-on-error`. Use `gh run list` + `gh run view <ID> --log` to audit. A green checkmark is necessary but not sufficient
651
- 6. **User approves, then implement** — Full SDLC per item (TDD RED → GREEN → self-review), in the prioritized order
208
+ ## CI Feedback LoopLocal Shepherd
652
209
 
653
- **Why batch planning works:** Ad-hoc one-at-a-time implementation leads to unvalidated additions and scope creep. Batch planning catches problems early — if you can't plan it at 95%, you're not ready to ship it.
210
+ **NEVER AUTO-MERGE. Do NOT run `gh pr merge --auto`.** Auto-merge fires before review feedback can be read. The shepherd loop IS the process.
654
211
 
655
- **Why pre-release CI audit:** v1.24.0 shipped without auditing CI logs across merged PRs #150-#152. Passing CI doesn't mean nothing fishy got through — warnings, degraded scores, and skipped steps can hide in green runs.
212
+ Mandatory steps:
213
+ 1. Push to remote
214
+ 2. `gh pr checks --watch`
215
+ 3. **Read CI logs whether pass or fail** (`gh run view <RUN_ID> --log`, not just `--log-failed`). Passing CI hides warnings, skipped steps, degraded scores
216
+ 4. **Cross-model audit the CI logs** — pipe to a tmp file, run `codex exec -c 'model_reasoning_effort="xhigh"' -s danger-full-access` with *"Audit for silent failures, skipped tests, degraded metrics, warnings-that-should-be-errors."* Tier 1 + Tier 2 separately
217
+ 5. CI fails → diagnose, fix, push (max 2 attempts)
218
+ 6. CI passes → `gh api repos/OWNER/REPO/pulls/PR/comments` for review feedback
219
+ 7. Implement valid suggestions (bugs, perf, missing error handling, dedup, coverage). Skip opinions/style. Max 3 iterations
220
+ 8. Explicit `gh pr merge --squash`
656
221
 
657
- ## Deployment Tasks (If Task Involves Deploy)
222
+ **Evidence:** PR #145 auto-merged before review was read; reviewer found a P1 dead-code bug that shipped. v1.24.0 only checked the green checkmark on round 2; passing CI hides degraded E2E scores and silent test exclusions. Use idle CI time (3-5 min) for `/compact` if context is long.
658
223
 
659
- **When to check:** Task mentions "deploy", "release", "push to prod", "staging", etc.
660
- **When to skip:** Code changes only, no deployment involved.
224
+ ## Scope, DRY, Patterns, Legacy
661
225
 
662
- **Before any deployment:**
663
- 1. Read ARCHITECTURE.md Find the Environments table and Deployment Checklist
664
- 2. Verify which environment is the target (dev/staging/prod)
665
- 3. Follow the deployment checklist in ARCHITECTURE.md
226
+ - **Scope guard** — only task-related changes. Notice something else → NOTE in summary, don't fix unless asked. AI drift into "helpful" changes breaks unrelated things.
227
+ - **DRY** before coding: "what patterns exist to reuse?" After: "did I duplicate anything?"
228
+ - **New patterns** require human approval: search first, propose if no equivalent, get explicit approval.
229
+ - **DELETE legacy code** backwards-compat shims, "just in case" fallbacks → gone. If it breaks, fix properly.
666
230
 
667
- **Confidence levels for deployment:**
231
+ ## Debugging Workflow (Systematic)
668
232
 
669
- | Target | Required Confidence | If Lower |
670
- |--------|---------------------|----------|
671
- | Dev/Preview | MEDIUM or higher | Proceed with caution |
672
- | Staging | MEDIUM or higher | Proceed, note uncertainties |
673
- | **Production** | **HIGH only** | **ASK USER before deploying** |
233
+ Reproduce Isolate Root Cause Fix → Regression Test. This is the systematic debugging methodology — do not skip steps. Regressions: `git bisect`. Env-specific: check env vars/OS/deps/permissions, reproduce locally, log at the failure point. 2 failed attempts → STOP and ASK USER.
674
234
 
675
- **Production deployment requires:**
676
- - All tests passing
677
- - Production build succeeding
678
- - Changes tested in staging/preview first
679
- - HIGH confidence (90%+)
680
- - If ANY doubt → ASK USER first
235
+ ## Release Planning (Task Ships a Release)
681
236
 
682
- **If ARCHITECTURE.md has no Environments section:** Ask user "How do you deploy to [target]?" before proceeding.
237
+ List all items from ROADMAP, plan each at 95% confidence, identify dependencies, present all plans together (catches conflicts/scope creep), pre-release CI audit across merged PRs (warnings, degraded scores, skipped suites — green checkmark insufficient), user approves, then implement in priority order.
683
238
 
684
- **After deploying — Post-Deploy Verification:**
685
- 1. Read ARCHITECTURE.md → Find the Post-Deploy Verification table
686
- 2. Run health check for the target environment
687
- 3. Check logs for new errors
688
- 4. Run smoke tests if configured
689
- 5. Monitor error rates for 15 min (production only)
690
- 6. If issues found → rollback first, then start new SDLC loop to fix
239
+ ## Deployment Tasks
691
240
 
692
- **If ARCHITECTURE.md has no Post-Deploy section:** Ask user "How do you verify [target] is working after deploy?"
241
+ Read `ARCHITECTURE.md` Environments table + Deployment Checklist. **Production requires HIGH (90%+); ANY doubt → ASK USER.** **Post-deploy verification:** health check, log scan, smoke tests, monitor 15 min (prod only). Issues → rollback first, then new SDLC loop.
693
242
 
694
- ## DELETE Legacy Code
243
+ ## Test Review (Harder Than Implementation)
695
244
 
696
- - Legacy code? DELETE IT
697
- - Backwards compatibility? NO - DELETE IT
698
- - "Just in case" fallbacks? DELETE IT
245
+ Critique tests harder than app code: testing the right things? Tests prove correctness or just verify current behavior? Follow TESTING.md (Testing Diamond, minimal mocking, real-captured fixtures).
699
246
 
700
- **THE RULE:** Delete old code first. If it breaks, fix it properly.
247
+ **Testing Diamond:** E2E ~5% (slow, proves real thing) → Integration ~90% (best bang for buck — real DB/cache/services via API, no UI) → Unit ~5% (pure logic only). If no UI/browser, it's integration, not E2E.
701
248
 
702
- ## Documentation Sync (REQUIRED — During Planning)
249
+ **Mocking:**
703
250
 
704
- Feature docs MUST be current before commit. Docs are code — stale docs mislead future sessions, waste tokens, and cause wrong implementations.
251
+ | What | Mock? | Why |
252
+ |------|-------|-----|
253
+ | Database | NEVER | Test DB or in-memory |
254
+ | Cache | NEVER | Isolated test instance |
255
+ | External APIs | YES | Real calls = flaky + expensive |
256
+ | Time/Date | YES | Determinism |
705
257
 
706
- **Standard pattern:** `*_DOCS.md` living documents that grow with the feature (e.g., `AUTH_DOCS.md`, `PAYMENTS_DOCS.md`, `SEARCH_DOCS.md`). Same philosophy as `TESTING.md` and `ARCHITECTURE.md` one source of truth per topic, kept current.
258
+ Mocks MUST come from real captured data never guess shapes. Unit tests qualify ONLY for pure I→O (no DB, API, FS, cache).
707
259
 
708
- ```
709
- ┌─────────────────────────────────────────────────────────────────────┐
710
- │ DOCS MUST BE CURRENT BEFORE COMMIT. │
711
- │ │
712
- │ Stale docs = wrong implementations = wasted sessions. │
713
- │ If you changed the feature, update its doc. No exceptions. │
714
- └─────────────────────────────────────────────────────────────────────┘
715
- ```
260
+ **TDD proves:** RED (fails — bug or missing feature), GREEN (passes — fix works), Forever (regression protection).
716
261
 
717
- 1. **During planning**, read feature docs for the area being changed (`*_DOCS.md`, `docs/features/`, `docs/decisions/`)
718
- 2. If your code change contradicts what the doc says → MUST update the doc
719
- 3. If your code change extends behavior the doc describes → MUST add to the doc
720
- 4. If no `*_DOCS.md` exists and the feature touches 3+ files → create one. Keep it simple: what the feature does, key decisions, gotchas. Same structure as TESTING.md (topic-focused, not exhaustive)
721
- 5. If the project has a `ROADMAP.md` → update it (mark items done, add new items). ROADMAP feeds CHANGELOG — keeping it current means releases write themselves
262
+ ## Prove It Gate (New Additions Only)
722
263
 
723
- **Doc staleness signals:** Low confidence in an area often means the docs are stale, missing, or misleading. If you struggle during planning, check whether the docs match the actual code.
264
+ Adding a new skill/hook/workflow? Default answer is NO. Prove it: (1) **Absorption check** can this be a section in an existing skill? (2) Research existing equivalents (native CC, third-party, existing skill). (3) If yes why is yours better with evidence. (4) If no — real gap or theoretical? (5) **Quality tests** must prove OUTPUT QUALITY (existence tests prove nothing). (6) Less is more — every addition is maintenance burden.
724
265
 
725
- **CLAUDE.md health:** `/claude-md-improver` audits CLAUDE.md structure and completeness. Run it periodically. It does NOT cover feature docs — the SDLC workflow handles those.
266
+ If you can't write a quality test for it, you can't prove it works.
726
267
 
727
268
  ## After Session (Capture Learnings)
728
269
 
729
- If this session revealed insights, update the right place:
730
- - **Testing patterns, gotchas** → `TESTING.md`
731
- - **Feature-specific quirks** Feature docs (`*_DOCS.md`, e.g., `AUTH_DOCS.md`)
732
- - **Architecture decisions** `docs/decisions/` (ADR format) or `ARCHITECTURE.md`
733
- - **General project context** `CLAUDE.md` (or `/revise-claude-md`)
734
- - **Plan files** If this session's work came from a plan file, delete it or mark it complete. Stale plans mislead future sessions into thinking work is still pending
270
+ | Insight | Destination |
271
+ |---------|-------------|
272
+ | Testing patterns/gotchas | `TESTING.md` |
273
+ | Feature-specific quirks | `*_DOCS.md` (e.g., `AUTH_DOCS.md`) |
274
+ | Architecture decisions | `docs/decisions/` (ADR) or `ARCHITECTURE.md` |
275
+ | General project context | `CLAUDE.md` (or `/revise-claude-md`) |
276
+ | Plan files (work done) | Delete or mark complete (stale plans mislead) |
735
277
 
736
278
  ### Memory Audit Protocol
737
279
 
738
- Per-user memory at `~/.claude/projects/<proj>/memory/` accumulates private learnings. Some belong there (user preferences, external references). Others are portable technical lessons (tool quirks, platform gotchas, bash/GHA/macOS footguns) that would save the next contributor hours. Run this audit to promote the portable ones.
280
+ Per-user memory at `~/.claude/projects/<proj>/memory/` accumulates private learnings. Some are portable lessons (tool quirks, platform gotchas) worth promoting to wizard docs.
739
281
 
740
- **When to run:**
741
- - End-of-release (before cutting a tag)
742
- - After a debugging-heavy session with multiple memory additions
743
- - On explicit "audit my memory" request
282
+ **When to run:** end-of-release, after debugging-heavy sessions, or on explicit "audit my memory" request.
744
283
 
745
- **Classify each memory file in `~/.claude/projects/<proj>/memory/`:**
284
+ **Rule-based denylist** (deterministic, no LLM):
285
+ - `type: user` → keep (user identity, preferences — never promote)
286
+ - `type: reference` → keep (external pointers, private by default)
287
+ - `type: project` → manual review (mixed state + portable lesson)
288
+ - `type: feedback` → manual review (mixed personal preference + portable rule)
746
289
 
747
- 1. **Rule-based denylist (deterministic, no LLM):**
748
- - `type: user` → `keep` (user identity, preferences — never promote)
749
- - `type: reference` → `keep` (external pointers to Discord/URL/etc — private by default)
750
- - `type: project` → `manual-review` (often mixed state + portable lesson — human decides)
751
- - `type: feedback` → `manual-review` (often mixed personal preference + portable rule — human decides)
752
- - Parser must normalize YAML variants (`type: "user"`, `type: user # comment`, surrounding whitespace) — see `tests/test-memory-audit-protocol.sh::apply_denylist_rule` for the reference implementation
753
- 2. **Remaining entries** (no type, or type outside the 4 above) fall through to human-gated review. An LLM-assisted classification runner is Prove-It-Gated: build it only after running this protocol 4+ times with manual classification. Until then, human review at promotion time IS the quality gate
290
+ **Destinations for promote entries** (no new files): tool/platform gotchas → `SDLC.md` `## Lessons Learned`. Testing → `TESTING.md`. Tool quirks tied to a skill → that `SKILL.md`. Process rules → `CLAUDE.md`.
754
291
 
755
- **Destinations for `promote` entries (no new files use existing wizard destinations):**
292
+ **Tracking:** `promoted_to: <path>` in the memory file's YAML frontmatter; later audits skip already-promoted entries.
756
293
 
757
- | Content | Target |
758
- |---------|--------|
759
- | Language/tool/platform gotchas (bash, gh CLI, GHA, macOS) | `SDLC.md` → `## Lessons Learned` section |
760
- | Testing gotchas (flaky patterns, mock-vs-integration lessons) | `TESTING.md` |
761
- | Tool-specific quirks tied to a skill | That skill's `SKILL.md` |
762
- | Process rules that should govern the project | `CLAUDE.md` |
294
+ **Human gate is MANDATORY.** Protocol produces diffs; user approves chunk-by-chunk. Never auto-apply. Prove-It: build a `/memory-audit` slash command only after running 4+ times manually. (Full protocol: wizard doc.)
763
295
 
764
- **Tracking:** When you promote an entry, add `promoted_to: <path>` to that memory file's YAML frontmatter. Subsequent audits skip already-promoted entries.
765
-
766
- **Human gate is MANDATORY.** Protocol produces diffs; user approves chunk-by-chunk before apply. Never auto-apply — private memory touching public docs needs human judgement.
767
-
768
- **Prove It Gate:** If you find yourself running this protocol 4+ times and manually doing the same classification work, that's evidence to build a `/memory-audit` slash command AND wire the LLM-gated quality tests (8/10 classification, 6/6 destination). Until then, protocol + human review is enough — and no stub tests that skip (they mislead reviewers into thinking a gate exists when it doesn't).
769
-
770
- ## Post-Mortem: When Process Fails, Feed It Back
771
-
772
- **Every process failure becomes an enforcement rule.** When you skip a step and it causes a problem, don't just fix the symptom — add a gate so it can't happen again.
296
+ ## Post-Mortem: Process Failures Become Rules
773
297
 
774
298
  ```
775
299
  Incident → Root Cause → New Rule → Test That Proves the Rule → Ship
776
300
  ```
777
301
 
778
- **How to post-mortem a process failure:**
779
- 1. **What happened?** — Describe the incident (what went wrong, what was the impact)
780
- 2. **Root cause** — Not "I forgot" — what structurally allowed the skip? Was it guidance (easy to ignore) instead of a gate (impossible to skip)?
781
- 3. **New rule** — Turn the failure into an enforcement rule in the SDLC skill
782
- 4. **Test** — Write a test that proves the rule exists (TDD — the rule is code too)
783
- 5. **Evidence** — Reference the incident so future readers understand WHY the rule exists
302
+ Don't fix only the symptom. Add a gate so it can't happen again. Example: PR #145 auto-merged before CI review → "NEVER AUTO-MERGE" block + `test_never_auto_merge_gate`.
784
303
 
785
- **Example (real incident):** PR #145 auto-merged before CI review was read. Root cause: auto-merge was enabled by default, no enforcement gate existed. New rule: "NEVER AUTO-MERGE" block added to CI Shepherd section with the same weight as "ALL TESTS MUST PASS." Test: `test_never_auto_merge_gate` verifies the block exists.
304
+ ## Context Management & Subagents
786
305
 
787
- **Industry pattern:** "Every mistake becomes a rule" the best SDLC systems are built from accumulated incident learnings, not theoretical best practices.
306
+ - `/compact` between planning and implementation (plan preserved in summary)
307
+ - `/clear` between unrelated tasks; after 2+ failed corrections (context polluted)
308
+ - Auto-compact fires at ~95% capacity
309
+ - After committing a PR, `/clear` before next feature
310
+ - `--bare` mode (v2.1.81+) skips ALL hooks/skills/LSP/plugins. Scripted headless only — never normal development.
311
+ - Custom subagents (`.claude/agents/`) run autonomously and return results. Skills guide behavior; agents do work. Use for parallel tasks or fresh context. Examples: `sdlc-reviewer`, `ci-debug`, `test-writer`.
788
312
 
789
- ---
313
+ ## Design System Check (UI Changes Only)
790
314
 
791
- **Full reference:** SDLC.md
315
+ Read `DESIGN_SYSTEM.md` if exists. Verify colors/fonts/spacing match tokens; flag new patterns not in design system. Skip on backend/config/non-visual code.
316
+
317
+ ---
318
+ **Full reference:** `CLAUDE_CODE_SDLC_WIZARD.md` (cross-model review, deployment, debugging, post-mortem, memory audit, design system). `TESTING.md` (testing diamond + mocking). `ARCHITECTURE.md` (environments + post-deploy).