agentic-sdlc-wizard 1.22.0 → 1.23.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -4,6 +4,31 @@ All notable changes to the SDLC Wizard.
4
4
 
5
5
  > **Note:** This changelog is for humans to read. Don't manually apply these changes - just run the wizard ("Check for SDLC wizard updates") and it handles everything automatically.
6
6
 
7
+ ## [1.23.0] - 2026-04-01
8
+
9
+ ### Added
10
+ - Update notification hook — `instructions-loaded-check.sh` checks npm for newer wizard version each session. Non-blocking, graceful on network failure. One-liner: "SDLC Wizard update available: X → Y (run /update-wizard)" (#64)
11
+ - Cross-model review standardization — mission-first handoff (mission/success/failure fields), preflight self-review doc, verification checklist, adversarial framing, domain template guidance, convergence reduced to 2-3 rounds. Audited 4 repos + 14 external repos + 7 papers (#72, #56)
12
+ - Release Planning Gate — section in SDLC skill. Before implementing release items: list all, plan each at 95% confidence, identify blocks, present plans as batch. Prove It Gate strengthened with absorption check (#73)
13
+ - 6 quality tests for update notification (fake npm in PATH, version comparison, failure modes)
14
+ - 12 quality tests for cross-model review, context position, release planning
15
+ - Testing Diamond boundary table — explicit E2E (UI/browser ~5%) vs Integration (API/no UI ~90%) vs Unit (pure logic ~5%) in SKILL.md and wizard doc (#65)
16
+ - Skill frontmatter docs — expanded to full table covering `paths:`, `context: fork`, `effort:`, `disable-model-invocation:`, `argument-hint:` (#69)
17
+ - `--bare` mode documentation in SKILL.md — complete wizard bypass warning for scripted headless calls (#70)
18
+ - 6 quality tests for #65/#69/#70
19
+ - "NEVER AUTO-MERGE" enforcement gate in CI Shepherd section — same weight as "ALL TESTS MUST PASS." Full shepherd sequence documented as mandatory (post-mortem from PR #145 incident)
20
+ - Post-Mortem pattern — when process fails, feed it back: Incident → Root Cause → New Rule → Test → Ship. "Every mistake becomes a rule"
21
+ - 4 quality tests for enforcement gate + post-mortem
22
+
23
+ ### Fixed
24
+ - Dead-code pipe in `test_prove_it_absorption()` — `grep -qi | grep -qi` was a no-op (P1 from PR #145 CI review)
25
+
26
+ ### Changed
27
+ - Moved "ALL TESTS MUST PASS" from 61% depth to 11% depth in SDLC skill (Lost in the Middle fix) (#57)
28
+ - Prove It Gate now requires absorption check — "can this be a section in an existing skill?" — before proposing new skills/components
29
+ - Wizard "E2E vs Manual Testing" section replaced with "E2E vs Integration — The Critical Boundary" (#65)
30
+ - Wizard "Skill Effort Frontmatter" section expanded to "Skill Frontmatter Fields" with full field reference (#69)
31
+
7
32
  ## [1.22.0] - 2026-04-01
8
33
 
9
34
  ### Added
@@ -307,9 +307,24 @@ New built-in commands available to use alongside the wizard:
307
307
 
308
308
  **Tip**: `/simplify` pairs well with the self-review phase. Run it after implementation as an additional quality check.
309
309
 
310
- ### Skill Effort Frontmatter (v2.1.80+)
310
+ ### Skill Frontmatter Fields (v2.1.80+)
311
311
 
312
- Skills can now set an `effort` level in frontmatter. The wizard's `/sdlc` skill uses `effort: high` to ensure Claude gives full attention to SDLC tasks.
312
+ Skills support these frontmatter fields:
313
+
314
+ | Field | Purpose | Example |
315
+ |-------|---------|---------|
316
+ | `name` | Skill name (matches `/command`) | `name: sdlc` |
317
+ | `description` | Trigger description for auto-invocation | `description: Full SDLC workflow...` |
318
+ | `effort` | Set reasoning effort level | `effort: high` |
319
+ | `paths` | Restrict skill to specific file patterns | `paths: ["src/**/*.ts", "tests/**"]` |
320
+ | `context` | Context mode (`fork` = isolated subagent) | `context: fork` |
321
+ | `argument-hint` | Hint for `$ARGUMENTS` placeholder | `argument-hint: [task description]` |
322
+ | `disable-model-invocation` | Prevent skill from being auto-invoked by model | `disable-model-invocation: true` |
323
+
324
+ **Key fields explained:**
325
+ - **`effort: high`** — The wizard's `/sdlc` skill uses this to ensure Claude gives full attention. `max` is available but costs significantly more tokens.
326
+ - **`paths:`** — Limits when a skill activates based on files being worked on. Useful for language-specific or directory-specific skills.
327
+ - **`context: fork`** — Runs the skill in an isolated subagent context. The subagent gets its own context window, so it won't pollute the main conversation. Useful for review skills or analysis that should run independently.
313
328
 
314
329
  ### InstructionsLoaded Hook (v2.1.69+)
315
330
 
@@ -475,10 +490,11 @@ Here's the "Testing Diamond" approach (recommended for AI agents):
475
490
  - **Confidence**: If integration tests pass, production usually works
476
491
  - **AI-friendly**: Give Claude concrete pass/fail feedback on real behavior
477
492
 
478
- **E2E vs Manual Testing:**
479
- - **E2E (automated)**: Playwright, Cypress - runs without human
480
- - **Manual testing**: Human sign-off at the very end
481
- - **Goal**: Zero manual testing. Only for final verification when 100% confident.
493
+ **E2E vs Integration — The Critical Boundary:**
494
+ - **E2E**: Tests that go through the user's actual UI/browser (Playwright, Cypress). ~5% of suite.
495
+ - **Integration**: Tests that hit real systems via API without UI — real DB, real cache, real services. ~90% of suite.
496
+ - **Unit**: Pure logic only no DB, no API, no filesystem. ~5% of suite.
497
+ - **The rule**: If your test doesn't open a browser or render a UI, it's not E2E — it's integration. Mislabeling leads to overinvestment in slow browser tests.
482
498
 
483
499
  **But your team decides:**
484
500
 
@@ -2401,7 +2417,7 @@ If deployment fails or post-deploy verification catches issues:
2401
2417
 
2402
2418
  **SDLC.md:**
2403
2419
  ```markdown
2404
- <!-- SDLC Wizard Version: 1.22.0 -->
2420
+ <!-- SDLC Wizard Version: 1.23.0 -->
2405
2421
  <!-- Setup Date: [DATE] -->
2406
2422
  <!-- Completed Steps: step-0.1, step-0.2, step-0.4, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
2407
2423
  <!-- Git Workflow: [PRs or Solo] -->
@@ -2797,6 +2813,7 @@ Docs drift when code changes but docs don't. The SDLC skill's planning phase det
2797
2813
  - During planning, Claude reads feature docs for the area being changed
2798
2814
  - If the code change contradicts what the doc says, Claude updates the doc
2799
2815
  - The "After Session" step routes learnings to the right doc
2816
+ - Plan files get closed out — if the session's work came from a plan, it gets deleted or marked complete so future sessions aren't misled
2800
2817
  - Stale docs cause low confidence — if Claude struggles, the doc may need updating
2801
2818
 
2802
2819
  **CLAUDE.md health:** Run `/claude-md-improver` periodically (quarterly or after major changes). It audits CLAUDE.md specifically — structure, clarity, completeness (6 criteria, 100-point rubric). It does NOT cover feature docs, TESTING.md, or ADRs — the SDLC workflow handles those.
@@ -3014,7 +3031,7 @@ Use an independent AI model from a different company as a code reviewer. The aut
3014
3031
  **The Protocol:**
3015
3032
 
3016
3033
  1. Create a `.reviews/` directory in your project
3017
- 2. After Claude completes its SDLC loop (self-review passes), write a handoff file:
3034
+ 2. After Claude completes its SDLC loop (self-review passes), write a preflight doc (what you already checked) then a mission-first handoff file:
3018
3035
 
3019
3036
  ```jsonc
3020
3037
  // .reviews/handoff.json
@@ -3022,12 +3039,22 @@ Use an independent AI model from a different company as a code reviewer. The aut
3022
3039
  "review_id": "feature-xyz-001",
3023
3040
  "status": "PENDING_REVIEW",
3024
3041
  "round": 1,
3042
+ "mission": "What changed and why — context for the reviewer",
3043
+ "success": "What 'correctly reviewed' looks like",
3044
+ "failure": "What gets missed if the reviewer is superficial",
3025
3045
  "files_changed": ["src/auth.ts", "tests/auth.test.ts"],
3026
- "review_instructions": "Review for security, edge cases, and correctness",
3046
+ "verification_checklist": [
3047
+ "(a) Verify input validation at auth.ts:45",
3048
+ "(b) Verify test covers null-token edge case"
3049
+ ],
3050
+ "review_instructions": "Focus on security and edge cases. Assume bugs may be present until proven otherwise.",
3051
+ "preflight_path": ".reviews/preflight-feature-xyz-001.md",
3027
3052
  "artifact_path": ".reviews/feature-xyz-001/"
3028
3053
  }
3029
3054
  ```
3030
3055
 
3056
+ The `mission/success/failure` fields give the reviewer context. Without them, you get generic "looks good" feedback. With them, reviewers dig into source files and verify specific claims. The `verification_checklist` tells the reviewer exactly what to verify — not "review this" but specific items with file:line references.
3057
+
3031
3058
  3. Run the independent reviewer (Round 1 — full review). These commands use your Codex default model — configure it to the latest, most capable model available:
3032
3059
 
3033
3060
  ```bash
@@ -3302,7 +3329,7 @@ Walk through updates? (y/n)
3302
3329
  Store wizard state in `SDLC.md` as metadata comments (invisible to readers, parseable by Claude):
3303
3330
 
3304
3331
  ```markdown
3305
- <!-- SDLC Wizard Version: 1.22.0 -->
3332
+ <!-- SDLC Wizard Version: 1.23.0 -->
3306
3333
  <!-- Setup Date: 2026-01-24 -->
3307
3334
  <!-- Completed Steps: step-0.1, step-0.2, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
3308
3335
  <!-- Git Workflow: PRs -->
@@ -20,4 +20,16 @@ if [ -n "$MISSING" ]; then
20
20
  echo "Invoke Skill tool, skill=\"setup-wizard\" to generate them."
21
21
  fi
22
22
 
23
+ # Version update check (non-blocking, best-effort)
24
+ SDLC_MD="$PROJECT_DIR/SDLC.md"
25
+ if [ -f "$SDLC_MD" ]; then
26
+ INSTALLED_VERSION=$(grep -o 'SDLC Wizard Version: [0-9.]*' "$SDLC_MD" | head -1 | sed 's/SDLC Wizard Version: //')
27
+ if [ -n "$INSTALLED_VERSION" ] && command -v npm > /dev/null 2>&1; then
28
+ LATEST_VERSION=$(npm view agentic-sdlc-wizard version 2>/dev/null) || true
29
+ if [ -n "$LATEST_VERSION" ] && [ "$LATEST_VERSION" != "$INSTALLED_VERSION" ]; then
30
+ echo "SDLC Wizard update available: ${INSTALLED_VERSION} → ${LATEST_VERSION} (run /update-wizard)"
31
+ fi
32
+ fi
33
+ fi
34
+
23
35
  exit 0
@@ -48,7 +48,8 @@ TodoWrite([
48
48
  { content: "Post-deploy verification (if deploy task — see Deployment Tasks)", status: "pending", activeForm: "Verifying deployment" },
49
49
  // FINAL
50
50
  { content: "Present summary: changes, tests, CI status", status: "pending", activeForm: "Presenting final summary" },
51
- { content: "Capture learnings (if any — update TESTING.md, CLAUDE.md, or feature docs)", status: "pending", activeForm: "Capturing session learnings" }
51
+ { content: "Capture learnings (if any — update TESTING.md, CLAUDE.md, or feature docs)", status: "pending", activeForm: "Capturing session learnings" },
52
+ { content: "Close out plan files: if task came from a plan, mark complete or delete", status: "pending", activeForm: "Closing plan artifacts" }
52
53
  ])
53
54
  ```
54
55
 
@@ -72,6 +73,42 @@ Your work is scored on these criteria. **Critical** criteria are must-pass.
72
73
 
73
74
  Critical miss on `tdd_red` or `self_review` = process failure regardless of total score.
74
75
 
76
+ ## Test Failure Recovery (SDET Philosophy)
77
+
78
+ ```
79
+ ┌─────────────────────────────────────────────────────────────────────┐
80
+ │ ALL TESTS MUST PASS. NO EXCEPTIONS. │
81
+ │ │
82
+ │ This is not negotiable. This is not flexible. This is absolute. │
83
+ └─────────────────────────────────────────────────────────────────────┘
84
+ ```
85
+
86
+ **Not acceptable:**
87
+ - "Those were already failing" → Fix them first
88
+ - "Not related to my changes" → Doesn't matter, fix it
89
+ - "It's flaky" → Flaky = bug, investigate
90
+
91
+ **Treat test code like app code.** Test failures are bugs. Investigate them the way a 15-year SDET would - with thought and care, not by brushing them aside.
92
+
93
+ If tests fail:
94
+ 1. Identify which test(s) failed
95
+ 2. Diagnose WHY - this is the important part:
96
+ - Your code broke it? Fix your code (regression)
97
+ - Test is for deleted code? Delete the test
98
+ - Test has wrong assertions? Fix the test
99
+ - Test is "flaky"? Investigate - flakiness is just another word for bug
100
+ 3. Fix appropriately (fix code, fix test, or delete dead test)
101
+ 4. Run specific test individually first
102
+ 5. Then run ALL tests
103
+ 6. Still failing? ASK USER - don't spin your wheels
104
+
105
+ **Flaky tests are bugs, not mysteries:**
106
+ - Sometimes the bug is in app code (race condition, timing issue)
107
+ - Sometimes the bug is in test code (shared state, not parallel-safe)
108
+ - Sometimes the bug is in test environment (cleanup not proper)
109
+
110
+ Debug it. Find root cause. Fix it properly. Tests ARE code.
111
+
75
112
  ## New Pattern & Test Design Scrutiny (PLANNING)
76
113
 
77
114
  **New design patterns require human approval:**
@@ -89,11 +126,12 @@ Critical miss on `tdd_red` or `self_review` = process failure regardless of tota
89
126
 
90
127
  **Adding a new skill, hook, workflow, or component? PROVE IT FIRST:**
91
128
 
92
- 1. **Research:** Does something equivalent already exist (native CC, third-party plugin, existing skill)?
93
- 2. **If YES:** Why is yours better? Show evidence (A/B test, quality comparison, gap analysis)
94
- 3. **If NO:** What gap does this fill? Is the gap real or theoretical?
95
- 4. **Quality tests:** New additions MUST have tests that prove OUTPUT QUALITY, not just existence
96
- 5. **Less is more:** Every addition is maintenance burden. Default answer is NO unless proven YES
129
+ 1. **Absorption check:** Can this be added as a section in an existing skill instead of a new component? Default is YES — new skills/hooks need strong justification. Releasing is SDLC, not a separate skill. Debugging is SDLC, not a separate skill. Keep it lean
130
+ 2. **Research:** Does something equivalent already exist (native CC, third-party plugin, existing skill)?
131
+ 3. **If YES:** Why is yours better? Show evidence (A/B test, quality comparison, gap analysis)
132
+ 4. **If NO:** What gap does this fill? Is the gap real or theoretical?
133
+ 5. **Quality tests:** New additions MUST have tests that prove OUTPUT QUALITY, not just existence
134
+ 6. **Less is more:** Every addition is maintenance burden. Default answer is NO unless proven YES
97
135
 
98
136
  **Existence tests are NOT quality tests:**
99
137
  - BAD: "ci-analyzer skill file exists" — proves nothing about quality
@@ -131,9 +169,9 @@ Before presenting approach, STATE your confidence:
131
169
  |-------|---------|--------|--------|
132
170
  | HIGH (90%+) | Know exactly what to do | Present approach, proceed after approval | `high` (default) |
133
171
  | MEDIUM (60-89%) | Solid approach, some uncertainty | Present approach, highlight uncertainties | `high` (default) |
134
- | LOW (<60%) | Not sure | ASK USER before proceeding | Consider `/effort max` |
135
- | FAILED 2x | Something's wrong | STOP. ASK USER immediately | Try `/effort max` |
136
- | CONFUSED | Can't diagnose why something is failing | STOP. Describe what you tried, ask for help | Try `/effort max` |
172
+ | LOW (<60%) | Not sure | Do more research or try cross-model research (Codex) to get to 95%. If still LOW after research, ASK USER | Consider `/effort max` |
173
+ | FAILED 2x | Something's wrong | Try cross-model research (Codex) for a fresh perspective. If still stuck, STOP and ASK USER | Try `/effort max` |
174
+ | CONFUSED | Can't diagnose why something is failing | Try cross-model research (Codex). If still confused, STOP. Describe what you tried, ask for help | Try `/effort max` |
137
175
 
138
176
  ## Self-Review Loop (CRITICAL)
139
177
 
@@ -166,36 +204,76 @@ PLANNING -> DOCS -> TDD RED -> TDD GREEN -> Tests Pass -> Self-Review
166
204
 
167
205
  **Prerequisites:** Codex CLI installed (`npm i -g @openai/codex`), OpenAI API key set.
168
206
 
169
- ### Round 1: Initial Review
207
+ **The core insight:** The review PROTOCOL is universal across domains. Only the review INSTRUCTIONS change. Code review is the default template below. For non-code domains (research, persuasion, medical content), adapt the `review_instructions` and `verification_checklist` fields while keeping the same handoff/dialogue/convergence loop.
170
208
 
171
- 1. After self-review passes, write `.reviews/handoff.json`:
172
- ```jsonc
173
- {
174
- "review_id": "feature-xyz-001",
175
- "status": "PENDING_REVIEW",
176
- "round": 1,
177
- "files_changed": ["src/auth.ts", "tests/auth.test.ts"],
178
- "review_instructions": "Review for security, edge cases, and correctness",
179
- "artifact_path": ".reviews/feature-xyz-001/"
180
- }
181
- ```
182
- 2. Run the independent reviewer:
183
- ```bash
184
- codex exec \
185
- -c 'model_reasoning_effort="xhigh"' \
186
- -s danger-full-access \
187
- -o .reviews/latest-review.md \
188
- "You are an independent code reviewer. Read .reviews/handoff.json, \
189
- review the listed files. Output each finding with: an ID (1, 2, ...), \
190
- severity (P0/P1/P2), description, and a 'certify condition' stating \
191
- what specific change would resolve it. \
192
- End with CERTIFIED or NOT CERTIFIED."
193
- ```
194
- 3. If CERTIFIED → proceed to CI. If NOT CERTIFIED → go to Round 2.
209
+ ### Step 0: Write Preflight Self-Review Doc
195
210
 
196
- ### Round 2+: Dialogue Loop
211
+ Before submitting to an external reviewer, document what YOU already checked. This is proven to reduce reviewer findings to 0-1 per round (evidence: anticheat repo preflight discipline).
197
212
 
198
- When the reviewer finds issues, respond per-finding instead of silently fixing everything:
213
+ Write `.reviews/preflight-{review_id}.md`:
214
+ ```markdown
215
+ ## Preflight Self-Review: {feature}
216
+ - [ ] Self-review via /code-review passed
217
+ - [ ] All tests passing
218
+ - [ ] Checked for: [specific concerns for this change]
219
+ - [ ] Verified: [what you manually confirmed]
220
+ - [ ] Known limitations: [what you couldn't verify]
221
+ ```
222
+
223
+ ### Step 1: Write Mission-First Handoff
224
+
225
+ After self-review and preflight pass, write `.reviews/handoff.json`:
226
+ ```jsonc
227
+ {
228
+ "review_id": "feature-xyz-001",
229
+ "status": "PENDING_REVIEW",
230
+ "round": 1,
231
+ "mission": "What changed and why — 2-3 sentences of context",
232
+ "success": "What 'correctly reviewed' looks like — the reviewer's goal",
233
+ "failure": "What gets missed if the reviewer is superficial",
234
+ "files_changed": ["src/auth.ts", "tests/auth.test.ts"],
235
+ "fixes_applied": [],
236
+ "previous_score": null,
237
+ "verification_checklist": [
238
+ "(a) Verify input validation at auth.ts:45 handles empty strings",
239
+ "(b) Verify test covers the null-token edge case",
240
+ "(c) Check no hardcoded secrets in diff"
241
+ ],
242
+ "review_instructions": "Focus on security and edge cases. Be strict — assume bugs may be present until proven otherwise.",
243
+ "preflight_path": ".reviews/preflight-feature-xyz-001.md",
244
+ "artifact_path": ".reviews/feature-xyz-001/"
245
+ }
246
+ ```
247
+
248
+ **Key fields explained:**
249
+ - `mission/success/failure` — Gives the reviewer context. Without this, you get generic "looks good" feedback. With it, reviewers read raw source files and verify specific claims (proven across 4 repos)
250
+ - `verification_checklist` — Specific things to verify with file:line references. NOT "review for correctness" — that's too vague. Each item is independently verifiable
251
+ - `preflight_path` — Shows the reviewer what you already checked, so they focus on what you might have missed
252
+
253
+ ### Step 2: Run the Independent Reviewer
254
+
255
+ ```bash
256
+ codex exec \
257
+ -c 'model_reasoning_effort="xhigh"' \
258
+ -s danger-full-access \
259
+ -o .reviews/latest-review.md \
260
+ "You are an independent code reviewer performing a certification audit. \
261
+ Read .reviews/handoff.json for full context — mission, success/failure \
262
+ conditions, and verification checklist. \
263
+ Verify each checklist item with evidence (file:line, grep results, test output). \
264
+ Output each finding with: ID (1, 2, ...), severity (P0/P1/P2), evidence, \
265
+ and a 'certify condition' (what specific change resolves it). \
266
+ Re-verify any prior-round passes still hold. \
267
+ End with: score (1-10), CERTIFIED or NOT CERTIFIED."
268
+ ```
269
+
270
+ **Always use `xhigh` reasoning effort.** Lower settings miss subtle errors (wrong-generation references, stale pricing, cross-file inconsistencies).
271
+
272
+ If CERTIFIED → proceed to CI. If NOT CERTIFIED → go to dialogue loop.
273
+
274
+ ### Step 3: Dialogue Loop
275
+
276
+ Respond per-finding — don't silently fix everything:
199
277
 
200
278
  1. Write `.reviews/response.json`:
201
279
  ```jsonc
@@ -205,16 +283,16 @@ When the reviewer finds issues, respond per-finding instead of silently fixing e
205
283
  "responding_to": ".reviews/latest-review.md",
206
284
  "responses": [
207
285
  { "finding": "1", "action": "FIXED", "summary": "Added missing validation" },
208
- { "finding": "2", "action": "DISPUTED", "justification": "This is intentional — see CODE_REVIEW_EXCEPTIONS.md" },
286
+ { "finding": "2", "action": "DISPUTED", "justification": "Intentional — see CODE_REVIEW_EXCEPTIONS.md" },
209
287
  { "finding": "3", "action": "ACCEPTED", "summary": "Will add test coverage" }
210
288
  ]
211
289
  }
212
290
  ```
213
- - **FIXED**: "I fixed this. Here is what changed." Reviewer verifies.
214
- - **DISPUTED**: "This is intentional/incorrect. Here is why." Reviewer accepts or rejects.
215
- - **ACCEPTED**: "You are right. Fixing now." (Same as FIXED, batched.)
291
+ - **FIXED**: "I fixed this. Here's what changed." Reviewer verifies against certify condition.
292
+ - **DISPUTED**: "This is intentional/incorrect. Here's why." Reviewer accepts or rejects with reasoning.
293
+ - **ACCEPTED**: "You're right. Fixing now." (Same as FIXED, batched.)
216
294
 
217
- 2. Update `handoff.json` with `"status": "PENDING_RECHECK"`, increment `round`, add `"response_path"` and `"previous_review"` fields.
295
+ 2. Update `handoff.json`: increment `round`, set `"status": "PENDING_RECHECK"`, add `fixes_applied` list with numbered items and file:line references, update `previous_score`.
218
296
 
219
297
  3. Run targeted recheck (NOT a full re-review):
220
298
  ```bash
@@ -222,86 +300,88 @@ When the reviewer finds issues, respond per-finding instead of silently fixing e
222
300
  -c 'model_reasoning_effort="xhigh"' \
223
301
  -s danger-full-access \
224
302
  -o .reviews/latest-review.md \
225
- "You are doing a TARGETED RECHECK. First read .reviews/handoff.json \
226
- to find the previous_review path read that file for the original \
227
- findings and certify conditions. Then read .reviews/response.json \
228
- for the author's responses. For each: \
229
- FIXED → verify the fix against the original certify condition. \
230
- DISPUTED → evaluate the justification (ACCEPT if sound, REJECT if not). \
303
+ "TARGETED RECHECK not a full re-review. Read .reviews/handoff.json \
304
+ for previous_review path and response.json for the author's responses. \
305
+ For each finding: FIXED verify against original certify condition. \
306
+ DISPUTED evaluate justification (ACCEPT if sound, REJECT with reasoning). \
231
307
  ACCEPTED → verify it was applied. \
232
308
  Do NOT raise new findings unless P0 (critical/security). \
233
309
  New observations go in 'Notes for next review' (non-blocking). \
234
- End with CERTIFIED or NOT CERTIFIED."
310
+ Re-verify all prior passes still hold. \
311
+ End with: score (1-10), CERTIFIED or NOT CERTIFIED."
235
312
  ```
236
313
 
237
- 4. If CERTIFIED → done. If NOT CERTIFIED (rejected disputes or failed fixes) → fix rejected items and repeat.
238
-
239
314
  ### Convergence
240
315
 
241
- Max 3 recheck rounds (4 total including initial review). If still NOT CERTIFIED after round 4, escalate to the user with a summary of open findings. Don't spin indefinitely.
316
+ **2 rounds is the sweet spot. 3 max.** Research across 14 repos and 7 papers confirms additional rounds beyond 3 produce <5% position shift.
317
+
318
+ Max 2 recheck rounds (3 total including initial review). If still NOT CERTIFIED after round 3, escalate to the user with a summary of open findings.
242
319
 
243
320
  ```
244
- Self-review passes → handoff.json (round 1, PENDING_REVIEW)
245
- |
246
- Reviewer: FULL REVIEW (structured findings)
247
- |
248
- CERTIFIED? YES → CI feedback loop
249
- |
250
- NO (findings with IDs + certify conditions)
251
- |
252
- Claude writes response.json:
253
- FIXED / DISPUTED / ACCEPTED per finding
254
- |
255
- handoff.json (round 2+, PENDING_RECHECK)
256
- |
257
- Reviewer: TARGETED RECHECK (previous findings only)
258
- |
259
- All resolved? → YES → CERTIFIED
260
- |
261
- NO → fix rejected items, repeat
262
- (max 3 rechecks, then escalate to user)
321
+ Preflight → handoff.json (round 1) → FULL REVIEW
322
+ |
323
+ CERTIFIED? YES CI
324
+ |
325
+ NO (scored findings)
326
+ |
327
+ response.json (FIXED/DISPUTED/ACCEPTED)
328
+ |
329
+ handoff.json (round 2+) → TARGETED RECHECK
330
+ |
331
+ CERTIFIED? → YES → CI
332
+ |
333
+ NO → one more round, then escalate
263
334
  ```
264
335
 
265
336
  **Tool-agnostic:** The value is adversarial diversity (different model, different blind spots), not the specific tool. Any competing AI reviewer works.
266
337
 
267
- **Full protocol:** See the wizard's "Cross-Model Review Loop (Optional)" section for key flags and reasoning effort guidance.
338
+ ### Anti-Patterns to Avoid
339
+
340
+ - **"Find at least N problems"** — Incentivizes false positives. Use adversarial framing ("assume bugs may be present") instead
341
+ - **"Review this"** — Too vague, gets generic feedback. Use mission + verification checklist
342
+ - **Numeric 1-10 scales without criteria** — Unreliable. Decompose into specific checklist items
343
+ - **Letting reviewer see author's reasoning** — Causes anchoring bias. Let them form independent opinion from code
268
344
 
269
345
  ### Release Review Focus
270
346
 
271
- Before any release/publish, add these to `review_instructions`:
347
+ Before any release/publish, add these to `verification_checklist`:
272
348
  - **CHANGELOG consistency** — all sections present, no lost entries during consolidation
273
349
  - **Version parity** — package.json, SDLC.md, CHANGELOG, wizard metadata all match
274
350
  - **Stale examples** — hardcoded version strings in docs match current release
275
351
  - **Docs accuracy** — README, ARCHITECTURE.md reflect current feature set
276
352
  - **CLI-distributed file parity** — live skills, hooks, settings match CLI templates
277
353
 
278
- Evidence: v1.20.0 cross-model review caught CHANGELOG section loss and stale wizard version examples that passed all tests and self-review. Tests catch version mismatches; cross-model review catches semantic issues tests cannot.
279
-
280
354
  ### Multiple Reviewers (N-Reviewer Pipeline)
281
355
 
282
356
  When multiple reviewers comment on a PR (Claude PR review, Codex, human reviewers), address each reviewer independently:
283
357
 
284
358
  1. **Read all reviews** — `gh api repos/OWNER/REPO/pulls/PR/comments` to get every reviewer's feedback
285
359
  2. **Respond per-reviewer** — Each reviewer has different blind spots and priorities. Address each one's findings separately
286
- 3. **Resolve conflicts** — If reviewers disagree, use your judgment: pick the stronger argument, note why you chose it
287
- 4. **Iterate until all approve** — Don't merge until every active reviewer is satisfied or their concerns are explicitly addressed
288
- 5. **Max 3 iterations per reviewer** — If a reviewer keeps finding new things after 3 rounds, escalate to the user
360
+ 3. **Resolve conflicts** — If reviewers disagree, pick the stronger argument, note why
361
+ 4. **Iterate until all approve** — Don't merge until every active reviewer is satisfied
362
+ 5. **Max 3 iterations per reviewer** — If a reviewer keeps finding new things, escalate to the user
289
363
 
290
- **The value of multiple reviewers:** Different models/humans catch different issues. Claude excels at SDLC/process compliance. Codex catches logic bugs. Humans catch "does this make sense for the product?" None alone is sufficient for high-stakes changes.
364
+ ### Adapting for Non-Code Domains
291
365
 
292
- ### Custom Subagents (`.claude/agents/`)
366
+ The handoff format and dialogue loop work for ANY domain. Only `review_instructions` and `verification_checklist` change:
367
+
368
+ | Domain | Instructions Focus | Checklist Example |
369
+ |--------|-------------------|-------------------|
370
+ | **Code (default)** | Security, logic bugs, test coverage | "Verify input validation at file:line" |
371
+ | **Research/Docs** | Factual accuracy, source verification, overclaims | "Verify $736-$804 appears in both docs, no stale $695-$723 remains" |
372
+ | **Persuasion** | Audience psychology, tone, trust | "If you were [audience], what's the moment you'd stop reading?" |
373
+
374
+ For non-code: add `"audience"` and `"stakes"` fields to handoff.json. For code, these are implied (audience = other developers, stakes = production impact).
293
375
 
294
- Claude Code supports custom subagents in `.claude/agents/`. These are specialized agents you can invoke for focused tasks:
376
+ ### Custom Subagents (`.claude/agents/`)
295
377
 
296
- - **`sdlc-reviewer`** An agent focused purely on SDLC compliance review (planning, TDD, self-review checks)
297
- - **`ci-debug`** — An agent specialized in diagnosing CI failures (reads logs, identifies root cause, suggests fix)
298
- - **`test-writer`** — An agent focused on writing quality tests following TESTING.md philosophies
378
+ Claude Code supports custom subagents in `.claude/agents/`:
299
379
 
300
- **When to use agents vs skills:**
301
- - **Skills** (`.claude/skills/`) Prompts that guide Claude's behavior for a task type. Claude reads and follows them
302
- - **Agents** (`.claude/agents/`) Independent subprocesses that run autonomously on a focused task and return results
380
+ - **`sdlc-reviewer`** SDLC compliance review (planning, TDD, self-review checks)
381
+ - **`ci-debug`**CI failure diagnosis (reads logs, identifies root cause, suggests fix)
382
+ - **`test-writer`**Quality tests following TESTING.md philosophies
303
383
 
304
- Agents are useful when you want parallel work (e.g., run `sdlc-reviewer` while you continue implementing) or when a task benefits from a fresh context window focused on one thing.
384
+ **Skills** guide Claude's behavior. **Agents** run autonomously and return results. Use agents for parallel work or fresh context windows.
305
385
 
306
386
  ## Test Review (Harder Than Implementation)
307
387
 
@@ -315,6 +395,16 @@ During self-review, critique tests HARDER than app code:
315
395
 
316
396
  **Tests are the foundation.** Bad tests = false confidence = production bugs.
317
397
 
398
+ ### Testing Diamond — Know Your Layers
399
+
400
+ | Layer | What It Tests | % of Suite | Key Trait |
401
+ |-------|--------------|------------|-----------|
402
+ | **E2E** | Full user flow through UI/browser (Playwright, Cypress) | ~5% | Slow, brittle, but proves the real thing works |
403
+ | **Integration** | Real systems via API without UI — real DB, real cache, real services | ~90% | **Best bang for buck.** Fast, stable, high confidence |
404
+ | **Unit** | Pure logic only — no DB, no API, no filesystem | ~5% | Fast but limited scope |
405
+
406
+ **The critical boundary:** E2E tests go through the user's actual UI/browser. Integration tests hit real systems via API but without UI. If your test doesn't open a browser or render a UI, it's not E2E — it's integration. This distinction matters because mislabeling integration tests as E2E leads to overinvestment in slow browser tests when fast API-level tests would suffice.
407
+
318
408
  ### Minimal Mocking Philosophy
319
409
 
320
410
  | What | Mock? | Why |
@@ -366,42 +456,6 @@ If you notice something else that should be fixed:
366
456
 
367
457
  **Why this matters:** AI agents can drift into "helpful" changes that weren't requested. This creates unexpected diffs, breaks unrelated things, and makes code review harder.
368
458
 
369
- ## Test Failure Recovery (SDET Philosophy)
370
-
371
- ```
372
- ┌─────────────────────────────────────────────────────────────────────┐
373
- │ ALL TESTS MUST PASS. NO EXCEPTIONS. │
374
- │ │
375
- │ This is not negotiable. This is not flexible. This is absolute. │
376
- └─────────────────────────────────────────────────────────────────────┘
377
- ```
378
-
379
- **Not acceptable:**
380
- - "Those were already failing" → Fix them first
381
- - "Not related to my changes" → Doesn't matter, fix it
382
- - "It's flaky" → Flaky = bug, investigate
383
-
384
- **Treat test code like app code.** Test failures are bugs. Investigate them the way a 15-year SDET would - with thought and care, not by brushing them aside.
385
-
386
- If tests fail:
387
- 1. Identify which test(s) failed
388
- 2. Diagnose WHY - this is the important part:
389
- - Your code broke it? Fix your code (regression)
390
- - Test is for deleted code? Delete the test
391
- - Test has wrong assertions? Fix the test
392
- - Test is "flaky"? Investigate - flakiness is just another word for bug
393
- 3. Fix appropriately (fix code, fix test, or delete dead test)
394
- 4. Run specific test individually first
395
- 5. Then run ALL tests
396
- 6. Still failing? ASK USER - don't spin your wheels
397
-
398
- **Flaky tests are bugs, not mysteries:**
399
- - Sometimes the bug is in app code (race condition, timing issue)
400
- - Sometimes the bug is in test code (shared state, not parallel-safe)
401
- - Sometimes the bug is in test environment (cleanup not proper)
402
-
403
- Debug it. Find root cause. Fix it properly. Tests ARE code.
404
-
405
459
  ## Debugging Workflow (Systematic Investigation)
406
460
 
407
461
  When something breaks and the cause isn't obvious, follow this systematic debugging workflow:
@@ -451,25 +505,25 @@ Local tests pass -> Commit -> Push -> Watch CI
451
505
  STOP and ASK USER
452
506
  ```
453
507
 
454
- **How to watch CI:**
508
+ ```
509
+ ┌─────────────────────────────────────────────────────────────────────┐
510
+ │ NEVER AUTO-MERGE. NO EXCEPTIONS. │
511
+ │ │
512
+ │ Do NOT run `gh pr merge --auto`. Ever. │
513
+ │ Auto-merge fires before you can read review feedback. │
514
+ │ The shepherd loop IS the process. Skipping it = shipping bugs. │
515
+ └─────────────────────────────────────────────────────────────────────┘
516
+ ```
517
+
518
+ **The full shepherd sequence — every step is mandatory:**
455
519
  1. Push changes to remote
456
- 2. Check CI status:
457
- ```bash
458
- # Watch checks in real-time (blocks until complete)
459
- gh pr checks --watch
520
+ 2. Watch CI: `gh pr checks --watch`
521
+ 3. If CI fails → read logs (`gh run view <RUN_ID> --log-failed`), fix, push again (max 2 attempts)
522
+ 4. If CI passes read ALL review comments: `gh api repos/OWNER/REPO/pulls/PR/comments`
523
+ 5. Fix valid suggestions, push, iterate until clean
524
+ 6. Only then: explicit merge with `gh pr merge --squash`
460
525
 
461
- # Or check status without blocking
462
- gh pr checks
463
-
464
- # View specific failed run logs
465
- gh run view <RUN_ID> --log-failed
466
- ```
467
- 3. If CI fails:
468
- - Read failure logs: `gh run view <RUN_ID> --log-failed`
469
- - Diagnose root cause (same philosophy as local test failures)
470
- - Fix and push again
471
- 4. Max 2 fix attempts - if still failing, ASK USER
472
- 5. If CI passes - proceed to present final summary
526
+ **Why this is non-negotiable:** PR #145 auto-merged a release before review feedback was read. CI reviewer found a P1 dead-code bug that shipped to main. The fix required a follow-up commit. Auto-merge cost more time than the shepherd loop would have taken.
473
527
 
474
528
  **Context GC (compact during idle):** While waiting for CI (typically 3-5 min), suggest `/compact` if the conversation is long. Think of it like a time-based garbage collector — idle time + high memory pressure = good time to collect. Don't suggest on short conversations.
475
529
 
@@ -519,6 +573,8 @@ CI passes -> Read review suggestions
519
573
  - Auto-compact fires at ~95% capacity — no manual management needed
520
574
  - After committing a PR, `/clear` before starting the next feature
521
575
 
576
+ **`--bare` mode (v2.1.81+):** `claude -p "prompt" --bare` skips ALL hooks, skills, LSP, and plugins. This is a complete wizard bypass — no SDLC enforcement, no TDD checks, no planning hooks. Use only for scripted headless calls (CI pipelines, automation) where you explicitly don't want wizard enforcement. Never use `--bare` for normal development work.
577
+
522
578
  ## DRY Principle
523
579
 
524
580
  **Before coding:** "What patterns exist I can reuse?"
@@ -543,6 +599,21 @@ CI passes -> Read review suggestions
543
599
 
544
600
  **If no DESIGN_SYSTEM.md exists:** Skip these checks (project has no documented design system).
545
601
 
602
+ ## Release Planning (If Task Involves a Release)
603
+
604
+ **When to check:** Task mentions "release", "publish", "version bump", "npm publish", or multiple items being shipped together.
605
+ **When to skip:** Single feature implementation, bug fix, or anything that isn't a release.
606
+
607
+ Before implementing any release items:
608
+
609
+ 1. **List all items** — Read ROADMAP.md (or equivalent), identify every item planned for this release
610
+ 2. **Plan each at 95% confidence** — For each item: what files change, what tests prove it works, what's the blast radius. If confidence < 95% on any item, flag it
611
+ 3. **Identify blocks** — Which items depend on others? What must go first?
612
+ 4. **Present all plans together** — User reviews the complete batch, not one at a time. This catches conflicts, sequencing issues, and scope creep before any code is written
613
+ 5. **User approves, then implement** — Full SDLC per item (TDD RED → GREEN → self-review), in the prioritized order
614
+
615
+ **Why batch planning works:** Ad-hoc one-at-a-time implementation leads to unvalidated additions and scope creep. Batch planning catches problems early — if you can't plan it at 95%, you're not ready to ship it.
616
+
546
617
  ## Deployment Tasks (If Task Involves Deploy)
547
618
 
548
619
  **When to check:** Task mentions "deploy", "release", "push to prod", "staging", etc.
@@ -608,6 +679,26 @@ If this session revealed insights, update the right place:
608
679
  - **Feature-specific quirks** → Feature docs (`*_PLAN.md`, `*_DOCS.md`)
609
680
  - **Architecture decisions** → `docs/decisions/` (ADR format) or `ARCHITECTURE.md`
610
681
  - **General project context** → `CLAUDE.md` (or `/revise-claude-md`)
682
+ - **Plan files** → If this session's work came from a plan file, delete it or mark it complete. Stale plans mislead future sessions into thinking work is still pending
683
+
684
+ ## Post-Mortem: When Process Fails, Feed It Back
685
+
686
+ **Every process failure becomes an enforcement rule.** When you skip a step and it causes a problem, don't just fix the symptom — add a gate so it can't happen again.
687
+
688
+ ```
689
+ Incident → Root Cause → New Rule → Test That Proves the Rule → Ship
690
+ ```
691
+
692
+ **How to post-mortem a process failure:**
693
+ 1. **What happened?** — Describe the incident (what went wrong, what was the impact)
694
+ 2. **Root cause** — Not "I forgot" — what structurally allowed the skip? Was it guidance (easy to ignore) instead of a gate (impossible to skip)?
695
+ 3. **New rule** — Turn the failure into an enforcement rule in the SDLC skill
696
+ 4. **Test** — Write a test that proves the rule exists (TDD — the rule is code too)
697
+ 5. **Evidence** — Reference the incident so future readers understand WHY the rule exists
698
+
699
+ **Example (real incident):** PR #145 auto-merged before CI review was read. Root cause: auto-merge was enabled by default, no enforcement gate existed. New rule: "NEVER AUTO-MERGE" block added to CI Shepherd section with the same weight as "ALL TESTS MUST PASS." Test: `test_never_auto_merge_gate` verifies the block exists.
700
+
701
+ **Industry pattern:** "Every mistake becomes a rule" — the best SDLC systems are built from accumulated incident learnings, not theoretical best practices.
611
702
 
612
703
  ---
613
704
 
@@ -45,10 +45,11 @@ Extract the latest version from the first `## [X.X.X]` line.
45
45
  Parse all CHANGELOG entries between the user's installed version and the latest. Present a clear summary:
46
46
 
47
47
  ```
48
- Installed: 1.20.0
49
- Latest: 1.22.0
48
+ Installed: 1.21.0
49
+ Latest: 1.23.0
50
50
 
51
51
  What changed:
52
+ - [1.23.0] Update notification hook, ...
52
53
  - [1.22.0] Plan auto-approval, debugging workflow, /feedback skill, BRANDING.md detection, ...
53
54
  - [1.21.0] Confidence-driven setup, prove-it gate, cross-model release review, ...
54
55
  - [1.20.0] Version-pinned CC update gate, Tier 1 flakiness fix, flaky test guidance, ...
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentic-sdlc-wizard",
3
- "version": "1.22.0",
3
+ "version": "1.23.0",
4
4
  "description": "SDLC enforcement for Claude Code — hooks, skills, and wizard setup in one command",
5
5
  "bin": {
6
6
  "sdlc-wizard": "./cli/bin/sdlc-wizard.js"