agentic-sdlc-wizard 1.47.0 → 1.49.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -9,61 +9,55 @@ effort: high
9
9
  ## Task
10
10
  $ARGUMENTS
11
11
 
12
+ Operational checklist. Full protocol lives in `CLAUDE_CODE_SDLC_WIZARD.md` — read it for depth.
13
+
12
14
  ## Full SDLC Checklist
13
15
 
14
- Your FIRST action must be TodoWrite with these steps:
16
+ Your FIRST action must be a TodoWrite covering every phase below. Compact form (omit `activeForm` to use the subject as the spinner label):
15
17
 
16
18
  ```
17
19
  TodoWrite([
18
- // PLANNING PHASE (Plan Mode for non-trivial tasks)
19
- { content: "Find and read relevant documentation", status: "in_progress", activeForm: "Reading docs" },
20
- { content: "Assess doc health - flag issues (ask before cleaning)", status: "pending", activeForm: "Checking doc health" },
21
- { content: "DRY scan: What patterns exist to reuse? New pattern = get approval", status: "pending", activeForm: "Scanning for reusable patterns" },
22
- { content: "Prove It Gate: adding new component? Research alternatives, prove quality with tests", status: "pending", activeForm: "Checking prove-it gate" },
23
- { content: "Blast radius: What depends on code I'm changing?", status: "pending", activeForm: "Checking dependencies" },
24
- { content: "Design system check (if UI change)", status: "pending", activeForm: "Checking design system" },
25
- { content: "Restate task in own words - verify understanding", status: "pending", activeForm: "Verifying understanding" },
26
- { content: "Scrutinize test design - right things tested? Follow TESTING.md?", status: "pending", activeForm: "Reviewing test approach" },
27
- { content: "Present approach + STATE CONFIDENCE LEVEL", status: "pending", activeForm: "Presenting approach" },
28
- { content: "Signal ready - user exits plan mode", status: "pending", activeForm: "Awaiting plan approval" },
29
- // TRANSITION PHASE (After plan mode)
30
- { content: "Doc sync: update or create feature docs — MUST be current before commit", status: "pending", activeForm: "Syncing feature docs" },
31
- // IMPLEMENTATION PHASE
32
- { content: "TDD RED: Write failing test FIRST", status: "pending", activeForm: "Writing failing test" },
33
- { content: "TDD GREEN: Implement, verify test passes", status: "pending", activeForm: "Implementing feature" },
34
- { content: "Run lint/typecheck", status: "pending", activeForm: "Running lint and typecheck" },
35
- { content: "Run ALL tests", status: "pending", activeForm: "Running all tests" },
36
- { content: "Production build check", status: "pending", activeForm: "Verifying production build" },
37
- // REVIEW PHASE
38
- { content: "DRY check: Is logic duplicated elsewhere?", status: "pending", activeForm: "Checking for duplication" },
39
- { content: "Visual consistency check (if UI change)", status: "pending", activeForm: "Checking visual consistency" },
40
- { content: "Self-review: run /code-review", status: "pending", activeForm: "Running code review" },
41
- { content: "Security review (if warranted)", status: "pending", activeForm: "Checking security implications" },
42
- { content: "Cross-model review (if configured — see below)", status: "pending", activeForm: "Running cross-model review" },
43
- { content: "Scope guard: only changes related to task? No legacy/fallback code left?", status: "pending", activeForm: "Checking scope and legacy code" },
44
- // CI FEEDBACK LOOP (if CI monitoring enabled in setup - skip if no CI)
45
- // NOTE (meta-repo only, ROADMAP #212 Option 1, 2026-04-24): this repo no
46
- // longer runs e2e simulation in CI. Only `validate` blocks merge. For E2E
47
- // signal on a PR, checkout the PR locally and run:
48
- // bash tests/e2e/local-shepherd.sh <PR>
49
- // which scores on Max subscription and posts an advisory check-run.
50
- // Consumer repos still use their own CI as configured.
51
- { content: "Commit and push to remote", status: "pending", activeForm: "Pushing to remote" },
52
- { content: "Watch CI - fix failures, iterate until green (max 2x)", status: "pending", activeForm: "Watching CI" },
53
- { content: "Read CI review - implement valid suggestions, iterate until clean", status: "pending", activeForm: "Addressing CI review feedback" },
54
- { content: "Meta-repo only: run local shepherd if PR needs E2E score (optional)", status: "pending", activeForm: "Running local shepherd" },
55
- { content: "Post-deploy verification (if deploy task — see Deployment Tasks)", status: "pending", activeForm: "Verifying deployment" },
20
+ // PLANNING
21
+ { content: "Find and read relevant documentation", status: "in_progress" },
22
+ { content: "Assess doc health - flag issues (ask before cleaning)", status: "pending" },
23
+ { content: "DRY scan: What patterns exist to reuse? New pattern = get approval", status: "pending" },
24
+ { content: "Prove It Gate: adding new component? Research alternatives, prove quality with tests", status: "pending" },
25
+ { content: "Blast radius: What depends on code I'm changing?", status: "pending" },
26
+ { content: "Design system check (if UI change)", status: "pending" },
27
+ { content: "Restate task in own words - verify understanding", status: "pending" },
28
+ { content: "Scrutinize test design - right things tested? Follow TESTING.md?", status: "pending" },
29
+ { content: "Present approach + STATE CONFIDENCE LEVEL", status: "pending" },
30
+ { content: "Signal ready - user exits plan mode", status: "pending" },
31
+ // TRANSITION
32
+ { content: "Doc sync: update or create feature doc — MUST be current before commit", status: "pending" },
33
+ // IMPLEMENTATION
34
+ { content: "TDD RED: Write failing test FIRST", status: "pending" },
35
+ { content: "TDD GREEN: Implement, verify test passes", status: "pending" },
36
+ { content: "Run lint/typecheck", status: "pending" },
37
+ { content: "Run ALL tests", status: "pending" },
38
+ { content: "Production build check", status: "pending" },
39
+ // REVIEW
40
+ { content: "DRY check: Is logic duplicated elsewhere?", status: "pending" },
41
+ { content: "Visual consistency check (if UI change)", status: "pending" },
42
+ { content: "Self-review: run /code-review", status: "pending" },
43
+ { content: "Security review (if warranted)", status: "pending" },
44
+ { content: "Cross-model review (if configured)", status: "pending" },
45
+ { content: "Scope guard: only changes related to task? No legacy/fallback code left?", status: "pending" },
46
+ // CI SHEPHERD
47
+ { content: "Commit and push to remote", status: "pending" },
48
+ { content: "Watch CI - fix failures, iterate until green (max 2x)", status: "pending" },
49
+ { content: "Read CI review - implement valid suggestions, iterate until clean", status: "pending" },
50
+ { content: "Meta-repo only: run local shepherd if PR needs E2E score (optional)", status: "pending" },
51
+ { content: "Post-deploy verification (if deploy task)", status: "pending" },
56
52
  // FINAL
57
- { content: "Present summary: changes, tests, CI status", status: "pending", activeForm: "Presenting final summary" },
58
- { content: "Capture learnings (if anyupdate TESTING.md, CLAUDE.md, or feature docs)", status: "pending", activeForm: "Capturing session learnings" },
59
- { content: "Close out plan files: if task came from a plan, mark complete or delete", status: "pending", activeForm: "Closing plan artifacts" }
53
+ { content: "Present summary: changes, tests, CI status", status: "pending" },
54
+ { content: "Capture learnings (after session — TESTING.md, CLAUDE.md, or feature docs)", status: "pending" },
55
+ { content: "Close out plan files: if task came from a plan, mark complete or delete", status: "pending" }
60
56
  ])
61
57
  ```
62
58
 
63
59
  ## SDLC Quality Checklist (Scoring Rubric)
64
60
 
65
- Your work is scored on these criteria. **Critical** criteria are must-pass.
66
-
67
61
  | Criterion | Points | Critical? | What Counts |
68
62
  |-----------|--------|-----------|-------------|
69
63
  | task_tracking | 1 | | Use TodoWrite or TaskCreate |
@@ -76,734 +70,249 @@ Your work is scored on these criteria. **Critical** criteria are must-pass.
76
70
  | self_review | 1 | **YES** | Read back files/diffs you modified |
77
71
  | clean_code | 1 | | One coherent approach, no dead code |
78
72
 
79
- **Total: 10 points** (11 for UI tasks, +1 for design_system check)
80
-
81
- Critical miss on `tdd_red` or `self_review` = process failure regardless of total score.
73
+ **Total: 10 points** (11 for UI tasks, +1 for design_system check). Critical miss on `tdd_red` or `self_review` = process failure regardless of total score.
82
74
 
83
- ## Test Failure Recovery (SDET Philosophy)
75
+ ## Test Failure Recovery
84
76
 
85
- ```
86
- ┌─────────────────────────────────────────────────────────────────────┐
87
- │ ALL TESTS MUST PASS. NO EXCEPTIONS. │
88
- │ │
89
- │ This is not negotiable. This is not flexible. This is absolute. │
90
- └─────────────────────────────────────────────────────────────────────┘
91
- ```
77
+ **ALL TESTS MUST PASS. NO EXCEPTIONS.** Test code is app code. Failures are bugs — investigate them like a 15-year SDET, not by brushing aside.
92
78
 
93
- **Not acceptable:**
94
- - "Those were already failing" → Fix them first
95
- - "Not related to my changes" → Doesn't matter, fix it
96
- - "It's flaky" → Flaky = bug, investigate
79
+ Not acceptable: "those were already failing", "not related to my changes", "it's flaky" (flaky = bug we haven't found yet).
97
80
 
98
- **Treat test code like app code.** Test failures are bugs. Investigate them the way a 15-year SDET would - with thought and care, not by brushing them aside.
99
-
100
- If tests fail:
81
+ When tests fail:
101
82
  1. Identify which test(s) failed
102
- 2. Diagnose WHY - this is the important part:
103
- - Your code broke it? Fix your code (regression)
104
- - Test is for deleted code? Delete the test
105
- - Test has wrong assertions? Fix the test
106
- - Test is "flaky"? Investigate - flakiness is just another word for bug
107
- 3. Fix appropriately (fix code, fix test, or delete dead test)
108
- 4. Run specific test individually first
109
- 5. Then run ALL tests
110
- 6. Still failing? ASK USER - don't spin your wheels
111
-
112
- **Flaky tests are bugs, not mysteries:**
113
- - Sometimes the bug is in app code (race condition, timing issue)
114
- - Sometimes the bug is in test code (shared state, not parallel-safe)
115
- - Sometimes the bug is in test environment (cleanup not proper)
116
-
117
- Debug it. Find root cause. Fix it properly. Tests ARE code.
118
-
119
- ## New Pattern & Test Design Scrutiny (PLANNING)
120
-
121
- **New design patterns require human approval:**
122
- 1. Search first - do similar patterns exist in codebase?
123
- 2. If YES and they're good - use as building block
124
- 3. If YES but they're bad - propose improvement, get approval
125
- 4. If NO (new pattern) - explain why needed, get explicit approval
126
-
127
- **Test design scrutiny during planning:**
128
- - Are we testing the right things?
129
- - Does test approach follow TESTING.md philosophies?
130
- - If introducing new test patterns, same scrutiny as code patterns
131
-
132
- ## Prove It Gate (REQUIRED for New Additions)
133
-
134
- **Adding a new skill, hook, workflow, or component? PROVE IT FIRST:**
135
-
136
- 1. **Absorption check:** Can this be added as a section in an existing skill instead of a new component? Default is YES — new skills/hooks need strong justification. Releasing is SDLC, not a separate skill. Debugging is SDLC, not a separate skill. Keep it lean
137
- 2. **Research:** Does something equivalent already exist (native CC, third-party plugin, existing skill)?
138
- 3. **If YES:** Why is yours better? Show evidence (A/B test, quality comparison, gap analysis)
139
- 4. **If NO:** What gap does this fill? Is the gap real or theoretical?
140
- 5. **Quality tests:** New additions MUST have tests that prove OUTPUT QUALITY, not just existence
141
- 6. **Less is more:** Every addition is maintenance burden. Default answer is NO unless proven YES
142
-
143
- **Existence tests are NOT quality tests:**
144
- - BAD: "ci-analyzer skill file exists" — proves nothing about quality
145
- - GOOD: "ci-analyzer recommends lint-first when test-before-lint detected" — proves behavior
146
-
147
- **If you can't write a quality test for it, you can't prove it works, so don't add it.**
148
-
149
- ## Plan Mode Integration
83
+ 2. Diagnose WHY: your code broke it (regression — fix code), test is for deleted code (delete test), test has wrong assertions (fix test), "flaky" (investigate — race, shared state, env)
84
+ 3. Fix appropriately, run specific test individually first, then run ALL tests
85
+ 4. Still failing after 2 attempts? STOP and ASK USER
150
86
 
151
- **Use plan mode for:** Multi-file changes, new features, LOW confidence, bugs needing investigation.
87
+ ## Confidence Check (REQUIRED)
152
88
 
153
- **Workflow:**
154
- 1. **Plan Mode** (editing blocked): Research -> Write plan file -> Present approach + confidence
155
- 2. **Transition** (after approval): Update feature docs
156
- 3. **Implementation**: TDD RED -> GREEN -> PASS
89
+ State your confidence before presenting an approach:
157
90
 
158
- ### Auto-Approval: Skip Plan Approval Step
91
+ | Level | Meaning | Action | Effort |
92
+ |-------|---------|--------|--------|
93
+ | HIGH (90%+) | Know exactly what to do | Present, proceed after approval | `high` (default) |
94
+ | MEDIUM (60-89%) | Solid approach, some uncertainty | Present, highlight uncertainties | `high` |
95
+ | LOW (<60%) | Not sure | Research or try Codex; if still LOW, ASK USER | **`/effort xhigh` now** |
96
+ | FAILED 2x | Something's wrong | Codex for fresh perspective; if still stuck, STOP | **`/effort max` now** |
97
+ | CONFUSED | Can't diagnose | Codex; if still confused, STOP and describe | **`/effort max` now** |
159
98
 
160
- If ALL of these are true, skip plan approval and go straight to TDD:
161
- - Confidence is **HIGH (95%+)** — you know exactly what to do
162
- - Task is **single-file or trivial** (config tweak, small bug fix, string change)
163
- - No new patterns introduced
164
- - No architectural decisions
99
+ **Dynamic effort bumping is NOT optional.** "Consider max effort" is the same as "ignore this." Bump BEFORE the next attempt, not after a third failure.
165
100
 
166
- When auto-approving, still announce your approach — just don't wait for approval:
167
- > "Confidence HIGH (95%). Single-file change. Proceeding directly to TDD."
101
+ ## Plan Mode
168
102
 
169
- **When in doubt, wait for approval.** Auto-approval is for clear-cut cases only.
103
+ Use plan mode for: multi-file changes, new features, LOW confidence, bugs needing investigation. **Skip plan approval step** (auto-approval) when confidence HIGH (95%+) AND single-file/trivial AND no new patterns AND no architectural decisions — still announce approach, don't wait. When in doubt, wait.
170
104
 
171
105
  ## Recommended Model
172
106
 
173
- **Opt-in: `opus[1m]` (Opus 4.7 with 1M context window).** Run `/model opus[1m]` at the start of any non-trivial SDLC session but understand the tradeoff first (issue #198).
174
-
175
- **Why opt-in, not default:** A top-level `model` pin in `.claude/settings.json` disables Claude Code's per-turn model auto-selection. That's a real cost — Max-plan users pay for that auto-selection (Sonnet for cheap tasks, Opus for hard ones, plus weekly-limit smoothing). Pin only when you actually need the 1M headroom.
176
-
177
- **Why pin to `opus[1m]` when you do opt in:**
178
- - SDLC sessions (plan → TDD → review → CI shepherd) accumulate context fast — plans, test output, diffs, review artifacts. 200K fills up before you're done.
179
- - Forced auto-compact mid-task loses your working state. Extra headroom is cheaper than re-reading files.
180
- - At time of writing, Anthropic lists 1M context at standard pricing for supported Opus/Sonnet models — verify current rates for your plan before relying on this.
107
+ **Opt-in: `opus[1m]` (Opus 4.7 with 1M context).** `/model opus[1m]` at the start of non-trivial sessions — understand the tradeoff (issue #198). A top-level `model` pin in `.claude/settings.json` disables CC's per-turn auto-selection; pin only when you need 1M headroom. Requires CC v2.1.111+.
181
108
 
182
- **Requires Claude Code v2.1.111+** for Opus 4.7.
183
-
184
- **Pair with `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30`** when you opt in. Without it, CC's default auto-compact on 1M fires at ~76K and defeats the purpose. The setup wizard's Step 9.5 prompts to write both together (template ships with neither, opt-in only).
185
-
186
- **Fall back to `opus` (200K) only when:** your plan charges a premium for long-context prompts, the task is genuinely short (<30K), or team cost controls flag >200K prompts. See the "1M vs 200K Context Window" section in `CLAUDE_CODE_SDLC_WIZARD.md` for details.
187
-
188
- ## Confidence Check (REQUIRED)
109
+ **Pair with `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30` when you opt in.** Without it, the default fires at ~76K on 1M. **Pick ONE — do NOT set both `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30` AND `CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000`** — they compound to 30% × 400K = 120K trigger ≈ 12% of 1M, fires almost immediately (#207). See wizard "Autocompact Tuning" for details.
189
110
 
190
- Before presenting approach, STATE your confidence:
191
-
192
- | Level | Meaning | Action | Effort |
193
- |-------|---------|--------|--------|
194
- | HIGH (90%+) | Know exactly what to do | Present approach, proceed after approval | `high` (default) |
195
- | MEDIUM (60-89%) | Solid approach, some uncertainty | Present approach, highlight uncertainties | `high` (default) |
196
- | LOW (<60%) | Not sure | Do more research or try cross-model research (Codex) to get to 95%. If still LOW after research, ASK USER | **Run `/effort xhigh` now** — don't wait |
197
- | FAILED 2x | Something's wrong | Try cross-model research (Codex) for a fresh perspective. If still stuck, STOP and ASK USER | **Run `/effort max` now** — you're already burning cycles at lower effort |
198
- | CONFUSED | Can't diagnose why something is failing | Try cross-model research (Codex). If still confused, STOP. Describe what you tried, ask for help | **Run `/effort max` now** — stop spinning |
199
-
200
- **Dynamic bumping is NOT optional.** "Consider max effort" is the same as "ignore this" in practice. If your confidence drops or tests fail twice, bump effort BEFORE the next attempt — not after a third failure. Spinning at low effort is an SDLC failure mode, not a style choice.
201
-
202
- ## Self-Review Loop (CRITICAL)
111
+ ## Self-Review Loop
203
112
 
204
113
  ```
205
- PLANNING -> DOCS -> TDD RED -> TDD GREEN -> Tests Pass -> Self-Review
206
- ^ |
207
- | v
208
- | Issues found?
209
- | |-- NO -> Present to user
210
- | +-- YES v
211
- +------------------------------------------- Ask user: fix in new plan?
114
+ PLANNING DOCS TDD RED GREEN Tests Pass Self-Review
115
+ ^ |
116
+ +--- Ask user: fix in new plan? ←- Issues found? YES (NO → Present)
212
117
  ```
213
118
 
214
- **The loop goes back to PLANNING, not TDD RED.** When self-review finds issues:
215
- 1. Ask user: "Found issues. Want to create a plan to fix?"
216
- 2. If yes -> back to PLANNING phase with new plan doc
217
- 3. Then -> docs update -> TDD -> review (proper SDLC loop)
218
-
219
- **How to self-review:**
220
- 1. Run `/code-review` to review your changes
221
- 2. It launches parallel agents (CLAUDE.md compliance, bug detection, logic & security)
222
- 3. Issues at confidence >= 80 are real findings — go back to PLANNING to fix
223
- 4. Issues below 80 are likely false positives — skip unless obviously valid
224
- 5. Address issues by going back through the proper SDLC loop
119
+ The loop goes back to PLANNING, not TDD RED. Run `/code-review`; issues at confidence ≥ 80 are real, < 80 are likely false positives. Found issues → ask "Want a plan to fix?" → new plan → docs → TDD → review.
225
120
 
226
121
  ## Cross-Model Review (If Configured)
227
122
 
228
- **When to run:** High-stakes changes (auth, payments, data handling), releases/publishes (version bumps, CHANGELOG, npm publish), complex refactors, research-heavy work.
229
- **When to skip:** Trivial changes (typo fixes, config tweaks), time-sensitive hotfixes, risk < review cost.
230
-
231
- **Prerequisites:** Codex CLI installed (`npm i -g @openai/codex`), OpenAI API key set.
232
-
233
- **The core insight:** The review PROTOCOL is universal across domains. Only the review INSTRUCTIONS change. Code review is the default template below. For non-code domains (research, persuasion, medical content), adapt the `review_instructions` and `verification_checklist` fields while keeping the same handoff/dialogue/convergence loop.
123
+ **When to run:** high-stakes changes (auth, payments, data), releases/publishes, complex refactors.
124
+ **When to skip:** trivial changes, time-sensitive hotfixes, risk < review cost.
125
+ **Prerequisites:** Codex CLI (`npm i -g @openai/codex`) + OpenAI API key.
234
126
 
235
- **Reviewer always at the flagship tier (roadmap #233):** if the project pins `model: "sonnet[1m]"` (mixed-mode) or any non-flagship coder, the cross-model reviewer **still runs at the flagship**: `codex exec -c 'model_reasoning_effort="xhigh"'` (gpt-5.5) or an Opus 4.7 max equivalent. The whole point of mixed-mode is that adversarial review catches Sonnet's blind spots — weakening the reviewer leg defeats the savings. Don't downscale the review just because the coder is downscaled.
127
+ The PROTOCOL is universal across domains; only `review_instructions` and `verification_checklist` change. **Reviewer always at flagship tier (#233):** if the project pins `model: "sonnet[1m]"` (mixed-mode), the reviewer still runs `gpt-5.5` or Opus 4.7 max adversarial diversity is the point.
236
128
 
237
- ### Step 0: Write Preflight Self-Review Doc
129
+ ### Step 0: Preflight Self-Review
238
130
 
239
- Before submitting to an external reviewer, document what YOU already checked. This is proven to reduce reviewer findings to 0-1 per round (evidence: anticheat repo preflight discipline).
131
+ At `.reviews/preflight-{review_id}.md`, document what you already checked: `/code-review` passed, all tests passing, specific concerns checked, what you verified manually, known limitations. Reduces reviewer findings to 0-1 per round.
240
132
 
241
- Write `.reviews/preflight-{review_id}.md`:
242
- ```markdown
243
- ## Preflight Self-Review: {feature}
244
- - [ ] Self-review via /code-review passed
245
- - [ ] All tests passing
246
- - [ ] Checked for: [specific concerns for this change]
247
- - [ ] Verified: [what you manually confirmed]
248
- - [ ] Known limitations: [what you couldn't verify]
249
- ```
250
-
251
- ### Step 1: Write Mission-First Handoff
133
+ ### Step 1: Mission-First Handoff
252
134
 
253
- After self-review and preflight pass, write `.reviews/handoff.json`:
135
+ Write `.reviews/handoff.json`:
254
136
  ```jsonc
255
137
  {
256
138
  "review_id": "feature-xyz-001",
257
139
  "status": "PENDING_REVIEW",
258
140
  "round": 1,
259
- "mission": "What changed and why — 2-3 sentences of context",
260
- "success": "What 'correctly reviewed' looks like — the reviewer's goal",
261
- "failure": "What gets missed if the reviewer is superficial",
141
+ "mission": "What changed and why — 2-3 sentences",
142
+ "success": "What 'correctly reviewed' looks like",
143
+ "failure": "What gets missed if reviewer is superficial",
262
144
  "files_changed": ["src/auth.ts", "tests/auth.test.ts"],
263
145
  "fixes_applied": [],
264
146
  "previous_score": null,
265
147
  "verification_checklist": [
266
148
  "(a) Verify input validation at auth.ts:45 handles empty strings",
267
- "(b) Verify test covers the null-token edge case",
268
- "(c) Check no hardcoded secrets in diff"
149
+ "(b) Verify test covers null-token edge case"
269
150
  ],
270
- "review_instructions": "Focus on security and edge cases. Be strict — assume bugs may be present until proven otherwise.",
151
+ "review_instructions": "Focus on security and edge cases. Be strict — assume bugs may be present.",
271
152
  "preflight_path": ".reviews/preflight-feature-xyz-001.md",
272
- "artifact_path": ".reviews/feature-xyz-001/",
273
153
  "pr_number": 205
274
154
  }
275
155
  ```
276
156
 
277
- **Key fields explained:**
278
- - `mission/success/failure` — Gives the reviewer context. Without this, you get generic "looks good" feedback. With it, reviewers read raw source files and verify specific claims (proven across 4 repos)
279
- - `verification_checklist` — Specific things to verify with file:line references. NOT "review for correctness" — that's too vague. Each item is independently verifiable
280
- - `preflight_path` — Shows the reviewer what you already checked, so they focus on what you might have missed
281
- - `pr_number` (optional) — PreCompact self-heal opt-in (ROADMAP #209). When the review tracks a specific PR, set this. The `precompact-seam-check.sh` hook queries `gh pr view N --json state` on every manual `/compact` and, if the PR is MERGED, treats this handoff as implicit CERTIFIED — unblocking `/compact` even if `status` is still `PENDING_*`. Without `pr_number`, a forgotten PENDING handoff blocks every future manual compact until you flip status by hand or hit the `SDLC_HANDOFF_STALE_DAYS` (default 14) auto-expire fallback. Omit for ad-hoc reviews not tied to a PR.
157
+ `mission/success/failure` give context (without them: generic "looks good"). `verification_checklist` is specific (file:line), not "review for correctness." `pr_number` (optional) is the **PreCompact self-heal opt-in (ROADMAP #209)**: when set, `precompact-seam-check.sh` checks `gh pr view N --json state` on `/compact` and, if MERGED, treats handoff as implicit CERTIFIED. Without it, a forgotten PENDING handoff blocks every manual compact until you flip status or hit `SDLC_HANDOFF_STALE_DAYS` (default 14).
282
158
 
283
- ### Step 2: Run the Independent Reviewer
159
+ ### Step 2: Run the Reviewer
284
160
 
285
161
  ```bash
286
- codex exec \
287
- -c 'model_reasoning_effort="xhigh"' \
288
- -s danger-full-access \
162
+ codex exec -c 'model_reasoning_effort="xhigh"' -s danger-full-access \
289
163
  -o .reviews/latest-review.md \
290
- "You are an independent code reviewer performing a certification audit. \
291
- Read .reviews/handoff.json for full context mission, success/failure \
292
- conditions, and verification checklist. \
293
- Verify each checklist item with evidence (file:line, grep results, test output). \
294
- Output each finding with: ID (1, 2, ...), severity (P0/P1/P2), evidence, \
295
- and a 'certify condition' (what specific change resolves it). \
296
- Re-verify any prior-round passes still hold. \
164
+ "Independent code reviewer. Read .reviews/handoff.json for context. \
165
+ Verify each checklist item with evidence (file:line, grep, test output). \
166
+ Each finding: ID, severity (P0/P1/P2), evidence, certify condition. \
297
167
  End with: score (1-10), CERTIFIED or NOT CERTIFIED."
298
168
  ```
299
169
 
300
- **Always use `xhigh` reasoning effort.** Lower settings miss subtle errors (wrong-generation references, stale pricing, cross-file inconsistencies).
301
-
302
- **Progress visibility (#259):** Reviews at `xhigh` routinely take 1-5 minutes. Without a heartbeat the user can't distinguish "still thinking" from "crashed silently". For long reviews, swap the bare invocation for `scripts/codex-review-with-progress.sh`:
170
+ Always `xhigh` lower settings miss subtle errors. **Progress (#259):** xhigh runs take 1-5 min; for a heartbeat use `scripts/codex-review-with-progress.sh` (`SDLC_CODEX_HEARTBEAT_INTERVAL` tunes). **Sandbox:** Codex's Rust binary needs `SCDynamicStore`; CC's sandbox blocks this. From CC, use `dangerouslyDisableSandbox: true` — Codex has its own sandbox via `-s danger-full-access`. Known issue: [codex#15640](https://github.com/openai/codex/issues/15640).
303
171
 
304
- ```bash
305
- scripts/codex-review-with-progress.sh \
306
- .reviews/latest-review.md \
307
- "You are an independent code reviewer ..."
308
- ```
172
+ CERTIFIED → CI. NOT CERTIFIED → dialogue loop.
309
173
 
310
- The wrapper passes the same default flags (`xhigh`, `danger-full-access`, `-o`) and emits a heartbeat to stderr every 10s (`SDLC_CODEX_HEARTBEAT_INTERVAL` to tune):
174
+ ### Step 3: Dialogue Loop
311
175
 
312
- ```
313
- [codex 0m10s elapsed, 0 bytes written to .reviews/latest-review.md] still running...
314
- [codex 0m20s elapsed, 1342 bytes written to .reviews/latest-review.md] still running...
315
- [codex finished in 47s with rc=0]
316
- ```
176
+ Per-finding response in `.reviews/response.json`: `{"finding": "1", "action": "FIXED|DISPUTED|ACCEPTED", "summary": "..."}`. Update `handoff.json`: increment `round`, status `PENDING_RECHECK`, add `fixes_applied` (numbered, file:line refs).
317
177
 
318
- The output file growth is the closest signal to "Codex is producing tokens". Empty + long elapsed = stuck; growing = working.
178
+ Recheck prompt: "TARGETED RECHECK. For each finding: FIXED verify certify condition. DISPUTED → ACCEPT if sound, REJECT with reasoning. ACCEPTED verify applied. Do NOT raise new findings unless P0. End with score, CERTIFIED or NOT CERTIFIED."
319
179
 
320
- **Sandbox note:** Codex's Rust binary requires access to macOS system configuration APIs (`SCDynamicStore`) during sandbox initialization. Claude Code's sandbox blocks this access, causing `codex exec` to crash with `system-configuration panicked: Attempted to create a NULL object`. When running `codex exec` from within Claude Code, you MUST use `dangerouslyDisableSandbox: true` on the Bash tool call. This is safe Codex has its own sandbox (`-s danger-full-access` is already specified), and the CC sandbox bypass only affects the Codex process. This is a known Codex issue ([#15640](https://github.com/openai/codex/issues/15640)).
180
+ **Convergence:** 2 rounds is the sweet spot, 3 max (research: 14 repos + 7 papers). After 3 still NOT CERTIFIED escalate to user.
321
181
 
322
- If CERTIFIED proceed to CI. If NOT CERTIFIED go to dialogue loop.
182
+ **Anti-patterns:** "find at least N problems," "review this," 1-10 without criteria, letting reviewer see author's reasoning (anchoring).
323
183
 
324
- ### Step 3: Dialogue Loop
184
+ **Multiple reviewers** (Claude review + Codex + human): `gh api repos/OWNER/REPO/pulls/PR/comments` for all feedback, respond to each reviewer independently (different blind spots), pick stronger argument on conflicts, max 3 iterations per reviewer.
325
185
 
326
- Respond per-finding don't silently fix everything:
327
-
328
- 1. Write `.reviews/response.json`:
329
- ```jsonc
330
- {
331
- "review_id": "feature-xyz-001",
332
- "round": 2,
333
- "responding_to": ".reviews/latest-review.md",
334
- "responses": [
335
- { "finding": "1", "action": "FIXED", "summary": "Added missing validation" },
336
- { "finding": "2", "action": "DISPUTED", "justification": "Intentional — see CODE_REVIEW_EXCEPTIONS.md" },
337
- { "finding": "3", "action": "ACCEPTED", "summary": "Will add test coverage" }
338
- ]
339
- }
340
- ```
341
- - **FIXED**: "I fixed this. Here's what changed." Reviewer verifies against certify condition.
342
- - **DISPUTED**: "This is intentional/incorrect. Here's why." Reviewer accepts or rejects with reasoning.
343
- - **ACCEPTED**: "You're right. Fixing now." (Same as FIXED, batched.)
344
-
345
- 2. Update `handoff.json`: increment `round`, set `"status": "PENDING_RECHECK"`, add `fixes_applied` list with numbered items and file:line references, update `previous_score`.
346
-
347
- 3. Run targeted recheck (NOT a full re-review):
348
- ```bash
349
- codex exec \
350
- -c 'model_reasoning_effort="xhigh"' \
351
- -s danger-full-access \
352
- -o .reviews/latest-review.md \
353
- "TARGETED RECHECK — not a full re-review. Read .reviews/handoff.json \
354
- for previous_review path and response.json for the author's responses. \
355
- For each finding: FIXED → verify against original certify condition. \
356
- DISPUTED → evaluate justification (ACCEPT if sound, REJECT with reasoning). \
357
- ACCEPTED → verify it was applied. \
358
- Do NOT raise new findings unless P0 (critical/security). \
359
- New observations go in 'Notes for next review' (non-blocking). \
360
- Re-verify all prior passes still hold. \
361
- End with: score (1-10), CERTIFIED or NOT CERTIFIED."
362
- ```
363
-
364
- ### Convergence
365
-
366
- **2 rounds is the sweet spot. 3 max.** Research across 14 repos and 7 papers confirms additional rounds beyond 3 produce <5% position shift.
367
-
368
- Max 2 recheck rounds (3 total including initial review). If still NOT CERTIFIED after round 3, escalate to the user with a summary of open findings.
186
+ **Non-code domains** (research, persuasion, medical): same handoff format, adapt `review_instructions` + `verification_checklist`, add `audience` + `stakes`.
369
187
 
370
- ```
371
- Preflight → handoff.json (round 1) → FULL REVIEW
372
- |
373
- CERTIFIED? → YES → CI
374
- |
375
- NO (scored findings)
376
- |
377
- response.json (FIXED/DISPUTED/ACCEPTED)
378
- |
379
- handoff.json (round 2+) → TARGETED RECHECK
380
- |
381
- CERTIFIED? → YES → CI
382
- |
383
- NO → one more round, then escalate
384
- ```
188
+ ### Release Review Focus
385
189
 
386
- **Tool-agnostic:** The value is adversarial diversity (different model, different blind spots), not the specific tool. Any competing AI reviewer works.
190
+ Before any release/publish, add to `verification_checklist`: **CHANGELOG consistency** (sections present, no lost entries), **Version parity** (package.json + SDLC.md + CHANGELOG + wizard metadata), **Stale examples** (hardcoded version strings), **Docs accuracy** (README + ARCHITECTURE reflect current features), **CLI-distributed file parity** (live skills/hooks match CLI templates).
387
191
 
388
- ### Anti-Patterns to Avoid
192
+ (Full protocol with rationale and convergence diagrams: `CLAUDE_CODE_SDLC_WIZARD.md` → Cross-Model Review.)
389
193
 
390
- - **"Find at least N problems"** Incentivizes false positives. Use adversarial framing ("assume bugs may be present") instead
391
- - **"Review this"** — Too vague, gets generic feedback. Use mission + verification checklist
392
- - **Numeric 1-10 scales without criteria** — Unreliable. Decompose into specific checklist items
393
- - **Letting reviewer see author's reasoning** — Causes anchoring bias. Let them form independent opinion from code
194
+ ## Documentation Sync (REQUIREDDuring Planning)
394
195
 
395
- ### Release Review Focus
196
+ **Docs MUST be current before commit.** Stale docs = wrong implementations = wasted sessions.
396
197
 
397
- Before any release/publish, add these to `verification_checklist`:
398
- - **CHANGELOG consistency** — all sections present, no lost entries during consolidation
399
- - **Version parity** — package.json, SDLC.md, CHANGELOG, wizard metadata all match
400
- - **Stale examples** — hardcoded version strings in docs match current release
401
- - **Docs accuracy** — README, ARCHITECTURE.md reflect current feature set
402
- - **CLI-distributed file parity** — live skills, hooks, settings match CLI templates
198
+ Standard pattern: `*_DOCS.md` living documents that grow with the feature (`AUTH_DOCS.md`, `PAYMENTS_DOCS.md`).
403
199
 
404
- ### Multiple Reviewers (N-Reviewer Pipeline)
200
+ 1. Read feature docs for the area being changed during planning
201
+ 2. When a code change contradicts what the doc says → MUST update the feature doc
202
+ 3. When a code change extends behavior the doc describes → MUST update the feature doc (add new behavior)
203
+ 4. No `*_DOCS.md` exists and feature touches 3+ files → create one
204
+ 5. Project has `ROADMAP.md` → mark items done, add new items (ROADMAP feeds CHANGELOG)
405
205
 
406
- When multiple reviewers comment on a PR (Claude PR review, Codex, human reviewers), address each reviewer independently:
206
+ `/claude-md-improver` audits CLAUDE.md structure. Run it periodically. It does NOT cover feature docs the SDLC workflow handles those.
407
207
 
408
- 1. **Read all reviews**`gh api repos/OWNER/REPO/pulls/PR/comments` to get every reviewer's feedback
409
- 2. **Respond per-reviewer** — Each reviewer has different blind spots and priorities. Address each one's findings separately
410
- 3. **Resolve conflicts** — If reviewers disagree, pick the stronger argument, note why
411
- 4. **Iterate until all approve** — Don't merge until every active reviewer is satisfied
412
- 5. **Max 3 iterations per reviewer** — If a reviewer keeps finding new things, escalate to the user
208
+ ## CI Feedback LoopLocal Shepherd
413
209
 
414
- ### Adapting for Non-Code Domains
210
+ **NEVER AUTO-MERGE. Do NOT run `gh pr merge --auto`.** Auto-merge fires before review feedback can be read. The shepherd loop IS the process.
415
211
 
416
- The handoff format and dialogue loop work for ANY domain. Only `review_instructions` and `verification_checklist` change:
212
+ Mandatory steps:
213
+ 1. Push to remote
214
+ 2. `gh pr checks --watch`
215
+ 3. **Read CI logs whether pass or fail** (`gh run view <RUN_ID> --log`, not just `--log-failed`). Passing CI hides warnings, skipped steps, degraded scores
216
+ 4. **Cross-model audit the CI logs** — pipe to a tmp file, run `codex exec -c 'model_reasoning_effort="xhigh"' -s danger-full-access` with *"Audit for silent failures, skipped tests, degraded metrics, warnings-that-should-be-errors."* Tier 1 + Tier 2 separately
217
+ 5. CI fails → diagnose, fix, push (max 2 attempts)
218
+ 6. CI passes → `gh api repos/OWNER/REPO/pulls/PR/comments` for review feedback
219
+ 7. Implement valid suggestions (bugs, perf, missing error handling, dedup, coverage). Skip opinions/style. Max 3 iterations
220
+ 8. Explicit `gh pr merge --squash`
417
221
 
418
- | Domain | Instructions Focus | Checklist Example |
419
- |--------|-------------------|-------------------|
420
- | **Code (default)** | Security, logic bugs, test coverage | "Verify input validation at file:line" |
421
- | **Research/Docs** | Factual accuracy, source verification, overclaims | "Verify $736-$804 appears in both docs, no stale $695-$723 remains" |
422
- | **Persuasion** | Audience psychology, tone, trust | "If you were [audience], what's the moment you'd stop reading?" |
222
+ **Evidence:** PR #145 auto-merged before review was read; reviewer found a P1 dead-code bug that shipped. v1.24.0 only checked the green checkmark on round 2; passing CI hides degraded E2E scores and silent test exclusions. Use idle CI time (3-5 min) for `/compact` if context is long.
423
223
 
424
- For non-code: add `"audience"` and `"stakes"` fields to handoff.json. For code, these are implied (audience = other developers, stakes = production impact).
224
+ ## Scope, DRY, Patterns, Legacy
425
225
 
426
- ### Custom Subagents (`.claude/agents/`)
226
+ - **Scope guard** — only task-related changes. Notice something else → NOTE in summary, don't fix unless asked. AI drift into "helpful" changes breaks unrelated things.
227
+ - **DRY** — before coding: "what patterns exist to reuse?" After: "did I duplicate anything?"
228
+ - **New patterns** require human approval: search first, propose if no equivalent, get explicit approval.
229
+ - **DELETE legacy code** — backwards-compat shims, "just in case" fallbacks → gone. If it breaks, fix properly.
427
230
 
428
- Claude Code supports custom subagents in `.claude/agents/`:
231
+ ## Debugging Workflow (Systematic)
429
232
 
430
- - **`sdlc-reviewer`**SDLC compliance review (planning, TDD, self-review checks)
431
- - **`ci-debug`** — CI failure diagnosis (reads logs, identifies root cause, suggests fix)
432
- - **`test-writer`** — Quality tests following TESTING.md philosophies
233
+ Reproduce Isolate → Root Cause → Fix → Regression Test. This is the systematic debugging methodology do not skip steps. Regressions: `git bisect`. Env-specific: check env vars/OS/deps/permissions, reproduce locally, log at the failure point. 2 failed attempts → STOP and ASK USER.
433
234
 
434
- **Skills** guide Claude's behavior. **Agents** run autonomously and return results. Use agents for parallel work or fresh context windows.
235
+ ## Release Planning (Task Ships a Release)
435
236
 
436
- ## Test Review (Harder Than Implementation)
237
+ List all items from ROADMAP, plan each at 95% confidence, identify dependencies, present all plans together (catches conflicts/scope creep), pre-release CI audit across merged PRs (warnings, degraded scores, skipped suites — green checkmark insufficient), user approves, then implement in priority order.
437
238
 
438
- During self-review, critique tests HARDER than app code:
439
- 1. **Testing the right things?** - Not just that tests pass
440
- 2. **Tests prove correctness?** - Or just verify current behavior?
441
- 3. **Follow our philosophies (TESTING.md)?**
442
- - Testing Diamond (integration-heavy)?
443
- - Minimal mocking (see table below)?
444
- - Real fixtures from captured data?
239
+ ## Deployment Tasks
445
240
 
446
- **Tests are the foundation.** Bad tests = false confidence = production bugs.
241
+ Read `ARCHITECTURE.md` Environments table + Deployment Checklist. **Production requires HIGH (90%+); ANY doubt → ASK USER.** **Post-deploy verification:** health check, log scan, smoke tests, monitor 15 min (prod only). Issues → rollback first, then new SDLC loop.
447
242
 
448
- ### Testing Diamond Know Your Layers
243
+ ## Test Review (Harder Than Implementation)
449
244
 
450
- | Layer | What It Tests | % of Suite | Key Trait |
451
- |-------|--------------|------------|-----------|
452
- | **E2E** | Full user flow through UI/browser (Playwright, Cypress) | ~5% | Slow, brittle, but proves the real thing works |
453
- | **Integration** | Real systems via API without UI — real DB, real cache, real services | ~90% | **Best bang for buck.** Fast, stable, high confidence |
454
- | **Unit** | Pure logic only — no DB, no API, no filesystem | ~5% | Fast but limited scope |
245
+ Critique tests harder than app code: testing the right things? Tests prove correctness or just verify current behavior? Follow TESTING.md (Testing Diamond, minimal mocking, real-captured fixtures).
455
246
 
456
- **The critical boundary:** E2E tests go through the user's actual UI/browser. Integration tests hit real systems via API but without UI. If your test doesn't open a browser or render a UI, it's not E2E — it's integration. This distinction matters because mislabeling integration tests as E2E leads to overinvestment in slow browser tests when fast API-level tests would suffice.
247
+ **Testing Diamond:** E2E ~5% (slow, proves real thing) Integration ~90% (best bang for buck — real DB/cache/services via API, no UI) Unit ~5% (pure logic only). If no UI/browser, it's integration, not E2E.
457
248
 
458
- ### Minimal Mocking Philosophy
249
+ **Mocking:**
459
250
 
460
251
  | What | Mock? | Why |
461
252
  |------|-------|-----|
462
- | Database | NEVER | Use test DB or in-memory |
463
- | Cache | NEVER | Use isolated test instance |
253
+ | Database | NEVER | Test DB or in-memory |
254
+ | Cache | NEVER | Isolated test instance |
464
255
  | External APIs | YES | Real calls = flaky + expensive |
465
256
  | Time/Date | YES | Determinism |
466
257
 
467
- **Mocks MUST come from REAL captured data**capture real API responses, save to fixtures directory, import in tests. Never guess mock shapes.
468
-
469
- ### Unit Tests = Pure Logic ONLY
470
-
471
- A function qualifies for unit testing ONLY if:
472
- - No database calls
473
- - No external API calls
474
- - No file system access
475
- - No cache calls
476
- - Input -> Output transformation only
477
-
478
- Everything else needs integration tests.
479
-
480
- ### TDD Tests Must PROVE
481
-
482
- | Phase | What It Proves |
483
- |-------|----------------|
484
- | RED | Test FAILS -> Bug exists or feature missing |
485
- | GREEN | Test PASSES -> Fix works or feature implemented |
486
- | Forever | Regression protection |
487
-
488
- ## Flaky Test Recovery
489
-
490
- **Flaky tests are bugs. Period.** See: [How do you Address and Prevent Flaky Tests?](https://softwareautomation.notion.site/How-do-you-Address-and-Prevent-Flaky-Tests-23c539e19b3c46eeb655642b95237dc0)
491
-
492
- When a test fails intermittently:
493
- 1. **Don't dismiss it** — "flaky" means "bug we haven't found yet"
494
- 2. **Identify the layer** — test code? app code? environment?
495
- 3. **Stress-test** — run the suspect test N times to reproduce reliably
496
- 4. **Fix root cause** — don't just retry-and-pray
497
- 5. **If CI infrastructure** — make cosmetic steps non-blocking, keep quality gates strict
498
-
499
- ## Scope Guard (Stay in Your Lane)
500
-
501
- **Only make changes directly related to the task.**
502
-
503
- If you notice something else that should be fixed:
504
- - NOTE it in your summary ("I noticed X could be improved")
505
- - DON'T fix it unless asked
506
-
507
- **Why this matters:** AI agents can drift into "helpful" changes that weren't requested. This creates unexpected diffs, breaks unrelated things, and makes code review harder.
508
-
509
- ## Debugging Workflow (Systematic Investigation)
510
-
511
- When something breaks and the cause isn't obvious, follow this systematic debugging workflow:
512
-
513
- ```
514
- Reproduce → Isolate → Root Cause → Fix → Regression Test
515
- ```
516
-
517
- 1. **Reproduce** — Can you make it fail consistently? If intermittent, stress-test (run N times). If you can't reproduce it, you can't fix it
518
- 2. **Isolate** — Narrow the scope. Which file? Which function? Which input? Use binary search: comment out half the code, does it still fail?
519
- 3. **Root cause** — Don't fix symptoms. Ask "why?" until you hit the actual cause. "It crashes on line 42" is a symptom. "Null pointer because the API returns undefined when rate-limited" is a root cause
520
- 4. **Fix** — Fix the root cause, not the symptom. Write the fix
521
- 5. **Regression test** — Write a test that fails without your fix and passes with it (TDD GREEN)
258
+ Mocks MUST come from real captured data — never guess shapes. Unit tests qualify ONLY for pure I→O (no DB, API, FS, cache).
522
259
 
523
- **For regressions** (it worked before, now it doesn't):
524
- - Use `git bisect` to find the exact commit that broke it
525
- - `git bisect start`, `git bisect bad` (current), `git bisect good <known-good-commit>`
526
- - Bisect narrows to the breaking commit in O(log n) steps
260
+ **TDD proves:** RED (fails bug or missing feature), GREEN (passes — fix works), Forever (regression protection).
527
261
 
528
- **Environment-specific bugs** (works locally, fails in CI/staging/prod):
529
- - Check environment differences: env vars, OS version, dependency versions, file permissions
530
- - Reproduce the environment locally if possible (Docker, env vars)
531
- - Add logging at the failure point — don't guess, observe
262
+ ## Prove It Gate (New Additions Only)
532
263
 
533
- **When to stop and ask:**
534
- - After 2 failed fix attempts → STOP and ASK USER
535
- - If the bug is in code you don't understand → read first, then fix
536
- - If reproducing requires access you don't have → ASK USER
264
+ Adding a new skill/hook/workflow? Default answer is NO. Prove it: (1) **Absorption check** can this be a section in an existing skill? (2) Research existing equivalents (native CC, third-party, existing skill). (3) If yes — why is yours better with evidence. (4) If no — real gap or theoretical? (5) **Quality tests** must prove OUTPUT QUALITY (existence tests prove nothing). (6) Less is more — every addition is maintenance burden.
537
265
 
538
- ## CI Feedback Loop Local Shepherd (After Commit)
539
-
540
- **This is the "local shepherd" — the CI fix mechanism.** It runs in your active session with full context.
541
-
542
- **The SDLC doesn't end at local tests.** CI must pass too.
543
-
544
- ```
545
- Local tests pass -> Commit -> Push -> Watch CI
546
- |
547
- CI passes? -+-> YES -> Present for review
548
- |
549
- +-> NO -> Fix -> Push -> Watch CI
550
- |
551
- (max 2 attempts)
552
- |
553
- Still failing?
554
- |
555
- STOP and ASK USER
556
- ```
557
-
558
- ```
559
- ┌─────────────────────────────────────────────────────────────────────┐
560
- │ NEVER AUTO-MERGE. NO EXCEPTIONS. │
561
- │ │
562
- │ Do NOT run `gh pr merge --auto`. Ever. │
563
- │ Auto-merge fires before you can read review feedback. │
564
- │ The shepherd loop IS the process. Skipping it = shipping bugs. │
565
- └─────────────────────────────────────────────────────────────────────┘
566
- ```
567
-
568
- **The full shepherd sequence — every step is mandatory:**
569
- 1. Push changes to remote
570
- 2. Watch CI: `gh pr checks --watch`
571
- 3. Read CI logs — **pass or fail**: `gh run view <RUN_ID> --log` (not just `--log-failed`). Passing CI can still hide warnings, skipped steps, or degraded scores. Don't just check the green checkmark
572
- 4. **Cross-model review the CI logs themselves** — pipe `gh run view <RUN_ID> --log` to a tmp file and run `codex exec -c 'model_reasoning_effort="xhigh"' -s danger-full-access` with a prompt like *"Audit this CI log for silent failures, skipped tests, degraded metrics, or warnings-that-should-be-errors. Green checkmark is necessary but not sufficient."* A second model catches things the first missed (e.g., a job that passed but degraded an E2E score by 30%, or a test that was silently excluded). Cheap — one extra `codex exec` per PR. **Run separately on Tier 1 quick-check AND Tier 2 5x evaluation logs** — they exercise different code paths, so a clean Tier 1 audit doesn't imply a clean Tier 2. Evidence from PR #206: Tier 1 audit found 3 P1s (Node 24 false-green, "11/10" score leak, E2E incomplete); Tier 2 audit TBD — value is measured by running both and comparing.
573
- 5. If CI fails → diagnose from logs, fix, push again (max 2 attempts)
574
- 6. If CI passes → read ALL review comments: `gh api repos/OWNER/REPO/pulls/PR/comments`
575
- 7. Fix valid suggestions, push, iterate until clean
576
- 8. Only then: explicit merge with `gh pr merge --squash`
577
-
578
- **Why this is non-negotiable:** PR #145 auto-merged a release before review feedback was read. CI reviewer found a P1 dead-code bug that shipped to main. The fix required a follow-up commit. Auto-merge cost more time than the shepherd loop would have taken.
579
-
580
- **Why read passing logs:** v1.24.0 release only read logs on failure (round 1), then just checked the green checkmark on round 2. Passing CI can hide warnings, skipped steps, degraded E2E scores, or silent test exclusions. A green checkmark is necessary but not sufficient.
581
-
582
- **Context GC (compact during idle):** While waiting for CI (typically 3-5 min), suggest `/compact` if the conversation is long. Think of it like a time-based garbage collector — idle time + high memory pressure = good time to collect. Don't suggest on short conversations.
583
-
584
- **CI failures follow same rules as test failures:**
585
- - Your code broke it? Fix your code
586
- - CI config issue? Fix the config
587
- - Flaky? Investigate - flakiness is a bug
588
- - Stuck? ASK USER
589
-
590
- ## CI Review Feedback Loop — Local Shepherd (After CI Passes)
591
-
592
- **CI passing isn't the end.** If CI includes a code reviewer, read and address its suggestions.
593
-
594
- ```
595
- CI passes -> Read review suggestions
596
- |
597
- Valid improvements? -+-> YES -> Implement -> Run tests -> Push
598
- | |
599
- | Review again (iterate)
600
- |
601
- +-> NO (just opinions/style) -> Skip, note why
602
- |
603
- +-> None -> Done, present to user
604
- ```
605
-
606
- **How to evaluate suggestions:**
607
- 1. Read all CI review comments: `gh api repos/OWNER/REPO/pulls/PR/comments`
608
- 2. For each suggestion, ask: **"Is this a real improvement or just an opinion?"**
609
- - **Real improvement:** Fixes a bug, improves performance, adds missing error handling, reduces duplication, improves test coverage → Implement it
610
- - **Opinion/style:** Different but equivalent formatting, subjective naming preference, "you could also..." without clear benefit → Skip it
611
- 3. Implement the valid ones, run tests locally, push
612
- 4. CI re-reviews — repeat until no substantive suggestions remain
613
- 5. Max 3 iterations — if reviewer keeps finding new things, ASK USER
614
-
615
- **The goal:** User is only brought in at the very end, when both CI and reviewer are satisfied. The code should be polished before human review.
616
-
617
- **Customizable behavior** (set during wizard setup):
618
- - **Auto-implement** (default): Implement valid suggestions autonomously, skip opinions
619
- - **Ask first**: Present suggestions to user, let them decide which to implement
620
- - **Skip review feedback**: Ignore CI review suggestions, only fix CI failures
621
-
622
- ## Context Management
623
-
624
- - `/compact` between planning and implementation (plan preserved in summary)
625
- - `/clear` between unrelated tasks (stale context wastes tokens and misleads)
626
- - `/clear` after 2+ failed corrections (context polluted — start fresh with better prompt)
627
- - Auto-compact fires at ~95% capacity — no manual management needed
628
- - After committing a PR, `/clear` before starting the next feature
629
- - **Autocompact tuning:** Set `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE` to trigger compaction earlier (75% for 200K, 30% for 1M). On 1M models, the default fires at ~76K — pick ONE of: `CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=30` **OR** `CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000` (do NOT set both — they compound to 30% × 400K = 120K trigger ≈ 12% of 1M, which fires almost immediately, #207). See wizard doc "Autocompact Tuning" for full details
630
-
631
- **`--bare` mode (v2.1.81+):** `claude -p "prompt" --bare` skips ALL hooks, skills, LSP, and plugins. This is a complete wizard bypass — no SDLC enforcement, no TDD checks, no planning hooks. Use only for scripted headless calls (CI pipelines, automation) where you explicitly don't want wizard enforcement. Never use `--bare` for normal development work.
632
-
633
- ## DRY Principle
634
-
635
- **Before coding:** "What patterns exist I can reuse?"
636
- **After coding:** "Did I accidentally duplicate anything?"
637
-
638
- ## Design System Check (If UI Change)
639
-
640
- **When to check:** CSS/styling changes, new UI components, color/font usage.
641
- **When to skip:** Backend-only changes, config/build changes, non-visual code.
642
-
643
- **Planning phase - "Design system check":**
644
- 1. Read DESIGN_SYSTEM.md if it exists
645
- 2. Check if change involves colors, fonts, spacing, or components
646
- 3. Verify intended styles match design system tokens
647
- 4. Flag if introducing new patterns not in design system
648
-
649
- **Review phase - "Visual consistency check":**
650
- 1. Are colors from the design system palette?
651
- 2. Are fonts/sizes from typography scale?
652
- 3. Are spacing values from the spacing scale?
653
- 4. Do new components follow existing patterns?
654
-
655
- **If no DESIGN_SYSTEM.md exists:** Skip these checks (project has no documented design system).
656
-
657
- ## Release Planning (If Task Involves a Release)
658
-
659
- **When to check:** Task mentions "release", "publish", "version bump", "npm publish", or multiple items being shipped together.
660
- **When to skip:** Single feature implementation, bug fix, or anything that isn't a release.
661
-
662
- Before implementing any release items:
663
-
664
- 1. **List all items** — Read ROADMAP.md (or equivalent), identify every item planned for this release
665
- 2. **Plan each at 95% confidence** — For each item: what files change, what tests prove it works, what's the blast radius. If confidence < 95% on any item, flag it
666
- 3. **Identify blocks** — Which items depend on others? What must go first?
667
- 4. **Present all plans together** — User reviews the complete batch, not one at a time. This catches conflicts, sequencing issues, and scope creep before any code is written
668
- 5. **Pre-release CI audit** — Before cutting the release, review CI runs across ALL PRs merged since last release. Look for: warnings in passing runs, degraded E2E scores, skipped test suites, silent failures masked by `continue-on-error`. Use `gh run list` + `gh run view <ID> --log` to audit. A green checkmark is necessary but not sufficient
669
- 6. **User approves, then implement** — Full SDLC per item (TDD RED → GREEN → self-review), in the prioritized order
670
-
671
- **Why batch planning works:** Ad-hoc one-at-a-time implementation leads to unvalidated additions and scope creep. Batch planning catches problems early — if you can't plan it at 95%, you're not ready to ship it.
672
-
673
- **Why pre-release CI audit:** v1.24.0 shipped without auditing CI logs across merged PRs #150-#152. Passing CI doesn't mean nothing fishy got through — warnings, degraded scores, and skipped steps can hide in green runs.
674
-
675
- ## Deployment Tasks (If Task Involves Deploy)
676
-
677
- **When to check:** Task mentions "deploy", "release", "push to prod", "staging", etc.
678
- **When to skip:** Code changes only, no deployment involved.
679
-
680
- **Before any deployment:**
681
- 1. Read ARCHITECTURE.md → Find the Environments table and Deployment Checklist
682
- 2. Verify which environment is the target (dev/staging/prod)
683
- 3. Follow the deployment checklist in ARCHITECTURE.md
684
-
685
- **Confidence levels for deployment:**
686
-
687
- | Target | Required Confidence | If Lower |
688
- |--------|---------------------|----------|
689
- | Dev/Preview | MEDIUM or higher | Proceed with caution |
690
- | Staging | MEDIUM or higher | Proceed, note uncertainties |
691
- | **Production** | **HIGH only** | **ASK USER before deploying** |
692
-
693
- **Production deployment requires:**
694
- - All tests passing
695
- - Production build succeeding
696
- - Changes tested in staging/preview first
697
- - HIGH confidence (90%+)
698
- - If ANY doubt → ASK USER first
699
-
700
- **If ARCHITECTURE.md has no Environments section:** Ask user "How do you deploy to [target]?" before proceeding.
701
-
702
- **After deploying — Post-Deploy Verification:**
703
- 1. Read ARCHITECTURE.md → Find the Post-Deploy Verification table
704
- 2. Run health check for the target environment
705
- 3. Check logs for new errors
706
- 4. Run smoke tests if configured
707
- 5. Monitor error rates for 15 min (production only)
708
- 6. If issues found → rollback first, then start new SDLC loop to fix
709
-
710
- **If ARCHITECTURE.md has no Post-Deploy section:** Ask user "How do you verify [target] is working after deploy?"
711
-
712
- ## DELETE Legacy Code
713
-
714
- - Legacy code? DELETE IT
715
- - Backwards compatibility? NO - DELETE IT
716
- - "Just in case" fallbacks? DELETE IT
717
-
718
- **THE RULE:** Delete old code first. If it breaks, fix it properly.
719
-
720
- ## Documentation Sync (REQUIRED — During Planning)
721
-
722
- Feature docs MUST be current before commit. Docs are code — stale docs mislead future sessions, waste tokens, and cause wrong implementations.
723
-
724
- **Standard pattern:** `*_DOCS.md` — living documents that grow with the feature (e.g., `AUTH_DOCS.md`, `PAYMENTS_DOCS.md`, `SEARCH_DOCS.md`). Same philosophy as `TESTING.md` and `ARCHITECTURE.md` — one source of truth per topic, kept current.
725
-
726
- ```
727
- ┌─────────────────────────────────────────────────────────────────────┐
728
- │ DOCS MUST BE CURRENT BEFORE COMMIT. │
729
- │ │
730
- │ Stale docs = wrong implementations = wasted sessions. │
731
- │ If you changed the feature, update its doc. No exceptions. │
732
- └─────────────────────────────────────────────────────────────────────┘
733
- ```
734
-
735
- 1. **During planning**, read feature docs for the area being changed (`*_DOCS.md`, `docs/features/`, `docs/decisions/`)
736
- 2. If your code change contradicts what the doc says → MUST update the doc
737
- 3. If your code change extends behavior the doc describes → MUST add to the doc
738
- 4. If no `*_DOCS.md` exists and the feature touches 3+ files → create one. Keep it simple: what the feature does, key decisions, gotchas. Same structure as TESTING.md (topic-focused, not exhaustive)
739
- 5. If the project has a `ROADMAP.md` → update it (mark items done, add new items). ROADMAP feeds CHANGELOG — keeping it current means releases write themselves
740
-
741
- **Doc staleness signals:** Low confidence in an area often means the docs are stale, missing, or misleading. If you struggle during planning, check whether the docs match the actual code.
742
-
743
- **CLAUDE.md health:** `/claude-md-improver` audits CLAUDE.md structure and completeness. Run it periodically. It does NOT cover feature docs — the SDLC workflow handles those.
266
+ If you can't write a quality test for it, you can't prove it works.
744
267
 
745
268
  ## After Session (Capture Learnings)
746
269
 
747
- If this session revealed insights, update the right place:
748
- - **Testing patterns, gotchas** → `TESTING.md`
749
- - **Feature-specific quirks** Feature docs (`*_DOCS.md`, e.g., `AUTH_DOCS.md`)
750
- - **Architecture decisions** `docs/decisions/` (ADR format) or `ARCHITECTURE.md`
751
- - **General project context** `CLAUDE.md` (or `/revise-claude-md`)
752
- - **Plan files** If this session's work came from a plan file, delete it or mark it complete. Stale plans mislead future sessions into thinking work is still pending
270
+ | Insight | Destination |
271
+ |---------|-------------|
272
+ | Testing patterns/gotchas | `TESTING.md` |
273
+ | Feature-specific quirks | `*_DOCS.md` (e.g., `AUTH_DOCS.md`) |
274
+ | Architecture decisions | `docs/decisions/` (ADR) or `ARCHITECTURE.md` |
275
+ | General project context | `CLAUDE.md` (or `/revise-claude-md`) |
276
+ | Plan files (work done) | Delete or mark complete (stale plans mislead) |
753
277
 
754
278
  ### Memory Audit Protocol
755
279
 
756
- Per-user memory at `~/.claude/projects/<proj>/memory/` accumulates private learnings. Some belong there (user preferences, external references). Others are portable technical lessons (tool quirks, platform gotchas, bash/GHA/macOS footguns) that would save the next contributor hours. Run this audit to promote the portable ones.
280
+ Per-user memory at `~/.claude/projects/<proj>/memory/` accumulates private learnings. Some are portable lessons (tool quirks, platform gotchas) worth promoting to wizard docs.
757
281
 
758
- **When to run:**
759
- - End-of-release (before cutting a tag)
760
- - After a debugging-heavy session with multiple memory additions
761
- - On explicit "audit my memory" request
282
+ **When to run:** end-of-release, after debugging-heavy sessions, or on explicit "audit my memory" request.
762
283
 
763
- **Classify each memory file in `~/.claude/projects/<proj>/memory/`:**
284
+ **Rule-based denylist** (deterministic, no LLM):
285
+ - `type: user` → keep (user identity, preferences — never promote)
286
+ - `type: reference` → keep (external pointers, private by default)
287
+ - `type: project` → manual review (mixed state + portable lesson)
288
+ - `type: feedback` → manual review (mixed personal preference + portable rule)
764
289
 
765
- 1. **Rule-based denylist (deterministic, no LLM):**
766
- - `type: user` → `keep` (user identity, preferences — never promote)
767
- - `type: reference` → `keep` (external pointers to Discord/URL/etc — private by default)
768
- - `type: project` → `manual-review` (often mixed state + portable lesson — human decides)
769
- - `type: feedback` → `manual-review` (often mixed personal preference + portable rule — human decides)
770
- - Parser must normalize YAML variants (`type: "user"`, `type: user # comment`, surrounding whitespace) — see `tests/test-memory-audit-protocol.sh::apply_denylist_rule` for the reference implementation
771
- 2. **Remaining entries** (no type, or type outside the 4 above) fall through to human-gated review. An LLM-assisted classification runner is Prove-It-Gated: build it only after running this protocol 4+ times with manual classification. Until then, human review at promotion time IS the quality gate
290
+ **Destinations for promote entries** (no new files): tool/platform gotchas → `SDLC.md` `## Lessons Learned`. Testing → `TESTING.md`. Tool quirks tied to a skill → that `SKILL.md`. Process rules → `CLAUDE.md`.
772
291
 
773
- **Destinations for `promote` entries (no new files use existing wizard destinations):**
292
+ **Tracking:** `promoted_to: <path>` in the memory file's YAML frontmatter; later audits skip already-promoted entries.
774
293
 
775
- | Content | Target |
776
- |---------|--------|
777
- | Language/tool/platform gotchas (bash, gh CLI, GHA, macOS) | `SDLC.md` → `## Lessons Learned` section |
778
- | Testing gotchas (flaky patterns, mock-vs-integration lessons) | `TESTING.md` |
779
- | Tool-specific quirks tied to a skill | That skill's `SKILL.md` |
780
- | Process rules that should govern the project | `CLAUDE.md` |
294
+ **Human gate is MANDATORY.** Protocol produces diffs; user approves chunk-by-chunk. Never auto-apply. Prove-It: build a `/memory-audit` slash command only after running 4+ times manually. (Full protocol: wizard doc.)
781
295
 
782
- **Tracking:** When you promote an entry, add `promoted_to: <path>` to that memory file's YAML frontmatter. Subsequent audits skip already-promoted entries.
783
-
784
- **Human gate is MANDATORY.** Protocol produces diffs; user approves chunk-by-chunk before apply. Never auto-apply — private memory touching public docs needs human judgement.
785
-
786
- **Prove It Gate:** If you find yourself running this protocol 4+ times and manually doing the same classification work, that's evidence to build a `/memory-audit` slash command AND wire the LLM-gated quality tests (8/10 classification, 6/6 destination). Until then, protocol + human review is enough — and no stub tests that skip (they mislead reviewers into thinking a gate exists when it doesn't).
787
-
788
- ## Post-Mortem: When Process Fails, Feed It Back
789
-
790
- **Every process failure becomes an enforcement rule.** When you skip a step and it causes a problem, don't just fix the symptom — add a gate so it can't happen again.
296
+ ## Post-Mortem: Process Failures Become Rules
791
297
 
792
298
  ```
793
299
  Incident → Root Cause → New Rule → Test That Proves the Rule → Ship
794
300
  ```
795
301
 
796
- **How to post-mortem a process failure:**
797
- 1. **What happened?** — Describe the incident (what went wrong, what was the impact)
798
- 2. **Root cause** — Not "I forgot" — what structurally allowed the skip? Was it guidance (easy to ignore) instead of a gate (impossible to skip)?
799
- 3. **New rule** — Turn the failure into an enforcement rule in the SDLC skill
800
- 4. **Test** — Write a test that proves the rule exists (TDD — the rule is code too)
801
- 5. **Evidence** — Reference the incident so future readers understand WHY the rule exists
302
+ Don't fix only the symptom. Add a gate so it can't happen again. Example: PR #145 auto-merged before CI review → "NEVER AUTO-MERGE" block + `test_never_auto_merge_gate`.
802
303
 
803
- **Example (real incident):** PR #145 auto-merged before CI review was read. Root cause: auto-merge was enabled by default, no enforcement gate existed. New rule: "NEVER AUTO-MERGE" block added to CI Shepherd section with the same weight as "ALL TESTS MUST PASS." Test: `test_never_auto_merge_gate` verifies the block exists.
304
+ ## Context Management & Subagents
804
305
 
805
- **Industry pattern:** "Every mistake becomes a rule" the best SDLC systems are built from accumulated incident learnings, not theoretical best practices.
306
+ - `/compact` between planning and implementation (plan preserved in summary)
307
+ - `/clear` between unrelated tasks; after 2+ failed corrections (context polluted)
308
+ - Auto-compact fires at ~95% capacity
309
+ - After committing a PR, `/clear` before next feature
310
+ - `--bare` mode (v2.1.81+) skips ALL hooks/skills/LSP/plugins. Scripted headless only — never normal development.
311
+ - Custom subagents (`.claude/agents/`) run autonomously and return results. Skills guide behavior; agents do work. Use for parallel tasks or fresh context. Examples: `sdlc-reviewer`, `ci-debug`, `test-writer`.
806
312
 
807
- ---
313
+ ## Design System Check (UI Changes Only)
808
314
 
809
- **Full reference:** SDLC.md
315
+ Read `DESIGN_SYSTEM.md` if exists. Verify colors/fonts/spacing match tokens; flag new patterns not in design system. Skip on backend/config/non-visual code.
316
+
317
+ ---
318
+ **Full reference:** `CLAUDE_CODE_SDLC_WIZARD.md` (cross-model review, deployment, debugging, post-mortem, memory audit, design system). `TESTING.md` (testing diamond + mocking). `ARCHITECTURE.md` (environments + post-deploy).