npm - agentic-sdlc-wizard - Versions diffs - 1.18.0 → 1.21.0 - Mend

agentic-sdlc-wizard 1.18.0 → 1.21.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/CHANGELOG.md +37 -0
package/CLAUDE_CODE_SDLC_WIZARD.md +271 -163
package/README.md +5 -5
package/cli/templates/skills/sdlc/SKILL.md +59 -5
package/cli/templates/skills/setup/SKILL.md +75 -53
package/cli/templates/skills/update/SKILL.md +5 -5
package/package.json +1 -1

package/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,43 @@ All notable changes to the SDLC Wizard.
 > **Note:** This changelog is for humans to read. Don't manually apply these changes - just run the wizard ("Check for SDLC wizard updates") and it handles everything automatically.
+## [1.21.0] - 2026-03-31
+### Added
+- Confidence-driven setup wizard — kills the fixed 18 questions. Scans repo, builds confidence per data point, only asks what it can't infer. Dynamic question count (0-2 for well-configured projects, 10+ for bare repos). 95% aggregate confidence threshold (#52)
+- CI Shepherd opt-in question in setup wizard (#48 partial)
+- Cross-model release review recommendation — releases/publishes as explicit trigger, Release Review Checklist with v1.20.0 evidence (#49)
+- Prove It Gate enforcement in SDLC skill — prevents unvalidated additions with quality test requirements (#50)
+- 6 confidence-driven setup tests, 10 prove-it-gate tests, 6 release review tests
+### Removed
+- ci-analyzer skill — violated Prove It philosophy (existence-only tests, no quality validation, overlap with `/claude-automation-recommender`) (#50)
+- ci-self-heal.yml deprecated — local shepherd is the primary CI fix mechanism
+### Changed
+- Wizard doc: Q-numbered questions → data point descriptions with detection hints
+- Setup skill: 12 steps (was 11) with new "Build Confidence Map" step
+- CLI distributes 8 template files (was 9, removed ci-analyzer)
+## [1.20.0] - 2026-03-31
+### Added
+- CC version-pinned update gate — E2E tests run actual new CC version, not bundled binary (#46)
+- Tier 1 E2E flakiness fix — regression threshold 1.5→3.0, absorbs ±2-3 point LLM variance (#47)
+- Flaky test prevention guidance with external reference in wizard, SKILL.md
+- 2 release consistency tests (package.json ↔ CHANGELOG ↔ SDLC.md version parity)
+## [1.19.0] - 2026-03-31
+### Added
+- CI Local Shepherd Model — two-tier CI fix model (shepherd primary, bot fallback), SHA-based suppression (#36)
+- Gap Analysis vs `/claude-automation-recommender` — complementary tools positioning (#35)
+- `/clear` vs `/compact` context management guidance (#38)
+- Token efficiency auditing — `/cost`, `--max-budget-usd`, OpenTelemetry (#42)
+- Blank repo support — verified clean install, 10 new E2E tests (#31)
+- Feature documentation enforcement — ADR guidance, `claude-md-improver`, doc sync in SDLC (#43)
+- Setup skill description trimmed to 199 chars (v2.1.86 caps at 250)
 ## [1.18.0] - 2026-03-30
 ### Added

package/CLAUDE_CODE_SDLC_WIZARD.md CHANGED Viewed

@@ -37,7 +37,7 @@ As Claude Code improves, the wizard absorbs those improvements and removes its o
 **But here's the key:** This isn't a one-size-fits-all answer. It's a starting point that helps you find YOUR answer. Every project is different. The self-evaluating loop (plan → build → test → review → improve) needs to be tuned to your codebase, your team, your standards. The wizard gives you the framework — you shape it into something bespoke.
 **The living system:**
-- CI self-heal captures friction signals as GitHub issues for pattern analysis
+- The local shepherd captures friction signals during active sessions
 - You approve changes to the process
 - Both sides learn over time
 - The system improves the system (recursive improvement)
@@ -93,8 +93,10 @@ This prevents both false positives (crying wolf) and false negatives (missing re
 **How We Apply This:**
 - Weekly workflow tests new Claude Code versions before recommending upgrade
+- Version-pinned gate: installs the specific CC version and passes it via `path_to_claude_code_executable` so E2E actually runs the new binary
 - Phase A: Does new CC version break SDLC enforcement?
 - Phase B: Do changelog-suggested improvements actually help?
+- Green CI = safe to upgrade. Red = stay on current version until fixed
 - Results shown in PR with statistical confidence
 ---
@@ -209,6 +211,8 @@ When Anthropic provides official plugins or tools that handle something:
 | **Claude Code v2.1.69+** | Required for InstructionsLoaded hook, skill directory variable, and Tasks system |
 | **Git repository** | Files should be committed for team sharing |
+**Blank repos (no CLAUDE.md, no code):** The wizard works on empty repos. Run `npx agentic-sdlc-wizard init` — it installs hooks, skills, and the wizard doc. On first session, the hooks detect missing SDLC files and redirect to `/setup-wizard`, which generates CLAUDE.md, SDLC.md, TESTING.md, and ARCHITECTURE.md interactively. You do NOT need to run Claude's built-in `/init` first — the setup wizard handles everything.
 ---
 ## Recommended Effort Level
@@ -352,6 +356,14 @@ This applies to everything: native Claude Code commands vs custom skills, framew
 **For the wizard's CI/CD:** When the weekly-update workflow detects a new Claude Code feature that overlaps with a wizard feature, the CI should automatically run E2E with both versions and recommend KEEP CUSTOM / SWITCH TO NATIVE / TIE.
+**This applies to YOUR OWN additions too — not just native vs custom:**
+- Adding a new skill? Prove it fills a gap nothing else covers. Write quality tests.
+- Adding a new hook? Prove it improves scores or catches real issues.
+- Adding a new workflow? Prove the automation ROI exceeds maintenance cost.
+- Existence tests ("file exists", "has frontmatter") are NOT proof. They prove the file was created, not that it works.
+**Evidence:** ci-analyzer skill was added in v1.20.0 with 4 existence-only tests, zero quality validation, and overlap with the third-party `/claude-automation-recommender`. Deleted in next release. This gap led to the Prove It Gate enforcement in the SDLC skill.
 ---
 ## What You're Setting Up
@@ -644,6 +656,33 @@ Security review depth should match your project's risk profile. During wizard se
 ---
+## Context Management: `/clear` vs `/compact`
+Two tools for managing context — use the right one:
+| | `/compact` | `/clear` |
+|---|---|---|
+| **What it does** | Summarizes conversation, frees space | Resets conversation entirely |
+| **When to use** | Continuing same task, need more room | Switching to an unrelated task |
+| **Preserves** | Summary of decisions + progress | Nothing (fresh start) |
+| **CLAUDE.md** | Re-loaded from disk | Re-loaded from disk |
+| **Hooks/skills/settings** | Unaffected | Unaffected |
+| **Task list** | Persists | Cleared |
+**Rules of thumb:**
+- `/compact` between planning and implementation (plan preserved in summary)
+- `/clear` between unrelated tasks (stale context wastes tokens and misleads Claude)
+- `/clear` after 2+ failed corrections on the same issue (context is polluted with bad approaches — start fresh with a better prompt)
+- After committing a PR, `/clear` before starting the next feature
+**Auto-compact** fires automatically at ~95% context capacity. You don't need to manage this manually — Claude Code handles it. The SDLC skill suggests `/compact` during CI idle time as a "context GC" opportunity.
+**What survives `/compact`:** Key decisions, code changes, task state (as a summary). What can be lost: detailed early-conversation instructions not in CLAUDE.md, specific file contents read long ago.
+**Best practice:** Put persistent instructions in CLAUDE.md (survives both `/compact` and `/clear`), not in conversation.
+---
 ## Example Workflow (End-to-End)
 Here's what a typical task looks like with this system:
@@ -912,21 +951,25 @@ The wizard creates TDD-specific automations that official plugins don't provide:
 ### Step 0.3: Additional Recommendations (Optional)
-After SDLC setup is complete, run `claude-code-setup` for additional recommendations:
+After SDLC setup is complete, run `/claude-automation-recommender` for stack-specific tooling:
 ```
-"Based on your codebase, recommend additional automations"
+/claude-automation-recommender
 ```
-This may suggest:
-- MCP Servers (context7 for docs, Playwright for frontend)
-- Additional hooks (auto-format if Prettier configured)
-- Subagents (security-reviewer if auth code detected)
+**The wizard is an enforcement engine** — it installs working hooks, skills, and process guardrails that run automatically. **The recommender is a suggestion engine** — it analyzes your codebase and suggests additional automations you might want. They're complementary:
-**Claude prompts for each:**
-> "[Detected: Prettier config] Want to add auto-format hook? (y/n)"
+| Category | Wizard Ships | Recommender Suggests |
+|----------|-------------|---------------------|
+| SDLC process (TDD, planning, review) | Enforced via hooks + skills | Not covered |
+| CI workflows (PR review) | Templates + docs | Not covered |
+| MCP servers (context7, Playwright, DB) | Not covered | Per-stack suggestions |
+| Auto-formatting hooks (Prettier, ESLint) | Not covered | Per-stack suggestions |
+| Type-checking hooks (tsc, mypy) | Not covered | Per-stack suggestions |
+| Subagent templates (code-reviewer, etc.) | Cross-model review only | 8 templates |
+| Plugin recommendations (LSPs, etc.) | Not covered | Per-stack suggestions |
-These are additive—they don't replace our TDD hooks.
+The recommender's suggestions are additive — they don't replace the wizard's TDD hooks or SDLC enforcement.
 ### Git Workflow Preference
@@ -991,39 +1034,44 @@ Feature branches still recommended for solo devs (keeps main clean, easy rollbac
 **Back-and-forth:** User questions live in PR comments. Bot's response is always the latest sticky comment. Clean and organized.
-**CI monitoring question:**
-> "Should Claude monitor CI checks after pushing and auto-diagnose failures? (y/n)"
+**CI shepherd opt-in (only if CI detected during auto-scan):**
+> "Enable CI shepherd role? Claude will actively watch CI, auto-fix failures, and iterate on review feedback. (y/n)"
-- **Yes** → Enable CI feedback loop in SDLC skill, add `gh` CLI to allowedTools
-- **No** → Skip CI monitoring steps (Claude still runs local tests, just doesn't watch CI)
+- **Yes** → Enable full shepherd loop: CI fix loop + review feedback loop. Ask detail questions below
+- **No** → Skip CI shepherd entirely (Claude still runs local tests, just doesn't interact with CI after pushing)
-**What this does:**
-1. After pushing, Claude runs `gh pr checks` to watch CI status
-2. If checks fail, Claude reads logs via `gh run view --log-failed`
-3. Claude diagnoses the failure and proposes a fix
-4. Max 2 fix attempts, then asks user
-5. Job isn't done until CI is green
+**What the CI shepherd does:**
+1. **CI fix loop:** After pushing, Claude watches CI via `gh pr checks`, reads failure logs, diagnoses and fixes, pushes again (max 2 attempts)
+2. **Review feedback loop:** After CI passes, Claude reads automated review comments, implements valid suggestions, pushes and re-reviews (max 3 iterations)
-**Recommendation:** Yes if you have CI configured. This closes the loop between
-"local tests pass" and "PR is actually ready to merge."
+**Recommendation:** Yes if you have CI configured. The shepherd closes the loop between "local tests pass" and "PR is actually ready to merge."
 **Requirements:**
 - `gh` CLI installed and authenticated
 - CI/CD configured (GitHub Actions, etc.)
 - If no CI yet: skip, add later when you set up CI
+**Stored in SDLC.md metadata as:**
+```
+<!-- CI Shepherd: enabled -->
+```
+**Detail questions (only if CI shepherd is enabled):**
+**CI monitoring detail:**
+> "Should Claude monitor CI checks after pushing and auto-diagnose failures? (y/n)"
+- **Yes** → Enable CI feedback loop in SDLC skill, add `gh` CLI to allowedTools
+- **No** → Skip CI monitoring steps (Claude still runs local tests, just doesn't watch CI)
 **CI review feedback question (only if CI monitoring is enabled):**
 > "What level of automated review response do you want?"
-| Level | Name | What autofix handles | Est. API cost |
-|-------|------|---------------------|---------------|
-| **L1** | `ci-only` | CI failures only (broken tests, lint) | ~$0.50/fix |
-| **L2** | `criticals` (default) | + Critical review findings (must-fix) | ~$1/fix |
-| **L3** | `all-findings` | + Every suggestion the reviewer flags | ~$2/fix |
-> **Cost note:** Higher levels mean more autofix iterations (each ~$0.50).
-> L3 typically adds 1-2 extra iterations per PR but produces cleaner code.
-> You can change this anytime by editing `AUTOFIX_LEVEL` in your ci-autofix workflow.
+| Level | Name | What the shepherd handles |
+|-------|------|--------------------------|
+| **L1** | `ci-only` | CI failures only (broken tests, lint) |
+| **L2** | `criticals` (default) | + Critical review findings (must-fix) |
+| **L3** | `all-findings` | + Every suggestion the reviewer flags |
 **What this does:**
 1. After CI passes, Claude reads the automated code review comments
@@ -1198,9 +1246,11 @@ Recommendation: Your current tests rely heavily on mocks.
 ---
-## Step 1: Confirm or Customize
+## Step 1: Build Confidence Map and Fill Gaps
-Claude presents what it found. You confirm or override:
+Claude assigns a state to each configuration data point based on scan results. **RESOLVED (detected)** items are presented for bulk confirmation. **RESOLVED (inferred)** items are presented with inferred values for the user to verify. **UNRESOLVED** items become questions. **The number of questions is dynamic — it depends on how much the scan resolves.** Stop asking when ALL data points are resolved (detected, inferred+confirmed, or answered by user).
+Claude presents what it found, organized by resolution state:
 ### Project Structure (Auto-Detected)
@@ -1209,13 +1259,13 @@ Claude presents what it found. You confirm or override:
 Override? (leave blank to accept): _______________
 ```
-**Q2: Where do your tests live?**
+**Test directory** (detect from tests/, __tests__/, spec/, test file patterns)
 ```
 Examples: tests/, __tests__/, src/**/*.test.ts, spec/
 Your answer: _______________
 ```
-**Q3: What's your test framework?**
+**Test framework** (detect from jest.config, vitest.config, pytest.ini, etc.)
 ```
 Options: Jest, Vitest, Playwright, Cypress, pytest, Go testing, other
 Your answer: _______________
@@ -1223,31 +1273,31 @@ Your answer: _______________
 ### Commands
-**Q4: What runs your linter?**
+**Lint command** (detect from package.json scripts, Makefile, config files)
 ```
 Examples: npm run lint, pnpm lint, eslint ., biome check
 Your answer: _______________
 ```
-**Q5: What runs type checking?**
+**Type-check command** (detect from tsconfig.json, mypy.ini, etc.)
 ```
 Examples: npm run typecheck, tsc --noEmit, mypy, none
 Your answer: _______________
 ```
-**Q6: What runs all tests?**
+**Run all tests command** (detect from package.json "test" script, Makefile)
 ```
 Examples: npm run test, pnpm test, pytest, go test ./...
 Your answer: _______________
 ```
-**Q7: What runs a specific test file?**
+**Run single test file command** (infer from framework: jest → jest path, pytest → pytest path)
 ```
 Examples: npm run test -- path/to/test.ts, pytest path/to/test.py
 Your answer: _______________
 ```
-**Q8: What builds for production?**
+**Production build command** (detect from package.json "build" script, Makefile)
 ```
 Examples: npm run build, pnpm build, go build, cargo build
 Your answer: _______________
@@ -1255,7 +1305,7 @@ Your answer: _______________
 ### Deployment
-**Q8.5: How do you deploy? (auto-detected, confirm or override)**
+**Deployment setup** (auto-detected from Dockerfile, vercel.json, fly.toml, deploy scripts)
 ```
 Detected: [e.g., Vercel, GitHub Actions, Docker, none]
@@ -1278,19 +1328,19 @@ Your answer: _______________
 ### Infrastructure
-**Q9: What database(s) do you use?**
+**Database(s)** (detect from prisma/, .env DB vars, docker-compose services)
 ```
 Examples: PostgreSQL, MySQL, SQLite, MongoDB, none
 Your answer: _______________
 ```
-**Q10: Do you use caching (Redis, etc.)?**
+**Caching layer** (detect from .env REDIS vars, docker-compose redis service)
 ```
 Examples: Redis, Memcached, none
 Your answer: _______________
 ```
-**Q11: How long do your tests take?**
+**Test duration** (estimate from test file count, CI run times if available)
 ```
 Examples: <1 minute, 1-5 minutes, 5+ minutes
 Your answer: _______________
@@ -1298,7 +1348,7 @@ Your answer: _______________
 ### Output Preferences
-**Q12: How much detail in Claude's responses?**
+**Response detail level** (cannot detect — always ask if no preference found)
 ```
 Options:
 - Small   - Minimal output, just essentials (experienced users)
@@ -1316,7 +1366,7 @@ Stored in `.claude/settings.json` as `"verbosity": "small|medium|large"`.
 ### Testing Philosophy
-**Q13: What's your testing approach?**
+**Testing approach** (infer from existing test patterns — test-first files, coverage config)
 ```
 Options:
 - Strict TDD (test first always)
@@ -1327,7 +1377,7 @@ Options:
 Your answer: _______________
 ```
-**Q14: What types of tests do you want?**
+**Test types** (detect from existing test file patterns: *.test.*, *.spec.*, e2e/, integration/)
 ```
 (Check all that apply)
 [ ] Unit tests (pure logic, isolated)
@@ -1337,7 +1387,7 @@ Your answer: _______________
 [ ] Other: _______________
 ```
-**Q15: Your mocking philosophy?**
+**Mocking philosophy** (detect from jest.mock, unittest.mock usage patterns)
 ```
 Options:
 - Minimal mocking (real DB, mock external APIs only)
@@ -1352,7 +1402,7 @@ Your answer: _______________
 **If test framework detected (Jest, pytest, Go, etc.):**
 ```
-Q16: Code Coverage (Optional)
+Code Coverage (Optional)
 Detected: [test framework] with coverage configuration
@@ -1373,7 +1423,7 @@ Your answer: _______________
 **If no test framework detected (docs/AI-heavy project):**
 ```
-Q16: Code Coverage (Optional)
+Code Coverage (Optional)
 No test framework detected (documentation/AI-heavy project).
@@ -1393,19 +1443,19 @@ Your answer: _______________
 ---
-### Using Your Answers
+### How Configuration Data Points Map to Files
-Your answers map to these files:
+Each resolved data point (whether detected or confirmed by the user) maps to generated files:
-| Question | Used In |
-|----------|---------|
-| Q1 (source dir) | `tdd-pretool-check.sh` - pattern match |
-| Q2 (test dir) | `TESTING.md` - documentation |
-| Q3 (test framework) | `TESTING.md` - documentation |
-| Q4-Q8 (commands) | `CLAUDE.md` - Commands section |
-| Q9-Q10 (infra) | `CLAUDE.md` - Architecture section, `TESTING.md` - mock decisions |
-| Q11 (test duration) | `SDLC skill` - wait time note |
-| Q12 (E2E) | `TESTING.md` - testing diamond top |
+| Data Point | Used In |
+|-----------|---------|
+| Source directory | `tdd-pretool-check.sh` - pattern match |
+| Test directory | `TESTING.md` - documentation |
+| Test framework | `TESTING.md` - documentation |
+| Commands (lint, typecheck, test, build) | `CLAUDE.md` - Commands section |
+| Infrastructure (DB, cache) | `CLAUDE.md` - Architecture section, `TESTING.md` - mock decisions |
+| Test duration | `SDLC skill` - wait time note |
+| Test types (E2E) | `TESTING.md` - testing diamond top |
 ---
@@ -1654,13 +1704,14 @@ TodoWrite([
   { content: "Find and read relevant documentation", status: "in_progress", activeForm: "Reading docs" },
   { content: "Assess doc health - flag issues (ask before cleaning)", status: "pending", activeForm: "Checking doc health" },
   { content: "DRY scan: What patterns exist to reuse?", status: "pending", activeForm: "Scanning for reusable patterns" },
+  { content: "Prove It Gate: adding new component? Research alternatives, prove quality with tests", status: "pending", activeForm: "Checking prove-it gate" },
   { content: "Blast radius: What depends on code I'm changing?", status: "pending", activeForm: "Checking dependencies" },
   { content: "Restate task in own words - verify understanding", status: "pending", activeForm: "Verifying understanding" },
   { content: "Scrutinize test design - right things tested? Follow TESTING.md?", status: "pending", activeForm: "Reviewing test approach" },
   { content: "Present approach + STATE CONFIDENCE LEVEL", status: "pending", activeForm: "Presenting approach" },
   { content: "Signal ready - user exits plan mode", status: "pending", activeForm: "Awaiting plan approval" },
   // TRANSITION PHASE (After plan mode, before compact)
-  { content: "Update feature docs with discovered gotchas", status: "pending", activeForm: "Updating feature docs" },
+  { content: "Doc sync: update feature docs if code change contradicts or extends documented behavior", status: "pending", activeForm: "Syncing feature docs" },
   { content: "Request /compact before TDD", status: "pending", activeForm: "Requesting compact" },
   // IMPLEMENTATION PHASE (After compact)
   { content: "TDD RED: Write failing test FIRST", status: "pending", activeForm: "Writing failing test" },
@@ -1695,13 +1746,29 @@ TodoWrite([
 - Does test approach follow TESTING.md philosophies?
 - If introducing new test patterns, same scrutiny as code patterns
+## Prove It Gate (REQUIRED for New Additions)
+**Adding a new skill, hook, workflow, or component? PROVE IT FIRST:**
+1. **Research:** Does something equivalent already exist (native CC, third-party plugin, existing skill)?
+2. **If YES:** Why is yours better? Show evidence (A/B test, quality comparison, gap analysis)
+3. **If NO:** What gap does this fill? Is the gap real or theoretical?
+4. **Quality tests:** New additions MUST have tests that prove OUTPUT QUALITY, not just existence
+5. **Less is more:** Every addition is maintenance burden. Default answer is NO unless proven YES
+**Existence tests are NOT quality tests:**
+- BAD: "ci-analyzer skill file exists" — proves nothing about quality
+- GOOD: "ci-analyzer recommends lint-first when test-before-lint detected" — proves behavior
+**If you can't write a quality test for it, you can't prove it works, so don't add it.**
 ## Plan Mode Integration
 **Use plan mode for:** Multi-file changes, new features, LOW confidence, bugs needing investigation.
 **Workflow:**
 1. **Plan Mode** (editing blocked): Research → Write plan file → Present approach + confidence
-2. **Transition** (after approval): Update feature docs → Request /compact
+2. **Transition** (after approval): Doc sync (update feature docs if code contradicts/extends them) → Request /compact
 3. **Implementation** (after compact): TDD RED → GREEN → PASS
 **Before TDD, MUST ask:** "Docs updated. Run `/compact` before implementation?"
@@ -1744,7 +1811,7 @@ PLANNING → DOCS → TDD RED → TDD GREEN → Tests Pass → Self-Review
 ## Cross-Model Review (If Configured)
-**When to run:** High-stakes changes (auth, payments, data handling), complex refactors, research-heavy work.
+**When to run:** High-stakes changes (auth, payments, data handling), releases/publishes (version bumps, CHANGELOG, npm publish), complex refactors, research-heavy work.
 **When to skip:** Trivial changes (typo fixes, config tweaks), time-sensitive hotfixes, risk < review cost.
 **Prerequisites:** Codex CLI installed (`npm i -g @openai/codex`), OpenAI API key set.
@@ -1849,6 +1916,17 @@ Self-review passes → handoff.json (round 1, PENDING_REVIEW)
 **Full protocol:** See the "Cross-Model Review Loop (Optional)" section below for key flags and reasoning effort guidance.
+### Release Review Focus
+Before any release/publish, add these to `review_instructions`:
+- **CHANGELOG consistency** — all sections present, no lost entries during consolidation
+- **Version parity** — package.json, SDLC.md, CHANGELOG, wizard metadata all match
+- **Stale examples** — hardcoded version strings in docs match current release
+- **Docs accuracy** — README, ARCHITECTURE.md reflect current feature set
+- **CLI-distributed file parity** — live skills, hooks, settings match CLI templates
+Evidence: v1.20.0 cross-model review caught CHANGELOG section loss and stale wizard version examples that passed all tests and self-review.
 ## Test Review (Harder Than Implementation)
 During self-review, critique tests HARDER than app code:
@@ -1898,7 +1976,7 @@ Debug it. Find root cause. Fix it properly. Tests ARE code.
 ## Flaky Test Prevention
-**Flaky tests are bugs. Period.** They erode trust in the test suite, slow down teams, and mask real regressions.
+**Flaky tests are bugs. Period.** They erode trust in the test suite, slow down teams, and mask real regressions. For a deep dive, see: [How do you Address and Prevent Flaky Tests?](https://softwareautomation.notion.site/How-do-you-Address-and-Prevent-Flaky-Tests-23c539e19b3c46eeb655642b95237dc0)
 ### Principles
@@ -1926,7 +2004,9 @@ Sometimes the flakiness is genuinely in CI infrastructure (runner environment, G
 - **Keep quality gates strict** — the actual pass/fail decision must NOT have `continue-on-error`
 - **Separate "fail the build" from "nice to have"** — a missing PR comment is not a regression
-## CI Feedback Loop (After Commit)
+## CI Feedback Loop — Local Shepherd (After Commit)
+**This is the "local shepherd" — your CI fix mechanism.** It runs in your active session with full context.
 **The SDLC doesn't end at local tests.** CI must pass too.
@@ -1972,7 +2052,7 @@ Local tests pass -> Commit -> Push -> Watch CI
 - Flaky? Investigate - flakiness is a bug
 - Stuck? ASK USER
-## CI Review Feedback Loop (After CI Passes)
+## CI Review Feedback Loop — Local Shepherd (After CI Passes)
 **CI passing isn't the end.** If CI includes a code reviewer, read and address its suggestions.
@@ -2102,7 +2182,7 @@ Create `CLAUDE.md` in your project root. This is your project-specific configura
 ## Commands
-<!-- CUSTOMIZE: Replace with your actual commands from Q4-Q8 -->
+<!-- CUSTOMIZE: Replace with your actual detected/confirmed commands -->
 - Build: `[your build command]`
 - Run dev: `[your dev command]`
@@ -2189,7 +2269,7 @@ These are your full reference docs. Start with stubs and expand over time:
 ## Environments
-<!-- Claude auto-populates this from Q8.5 deployment detection -->
+<!-- Claude auto-populates this from deployment detection -->
 | Environment | URL | Deploy Command | Trigger |
 |-------------|-----|----------------|---------|
@@ -2236,7 +2316,7 @@ If deployment fails or post-deploy verification catches issues:
 | Environment | Rollback Command | Notes |
 |-------------|------------------|-------|
-| Preview | [auto-expires or redeploy] | Usually self-heals |
+| Preview | [auto-expires or redeploy] | Ephemeral — redeploy to fix |
 | Staging | `[your rollback command]` | [notes] |
 | Production | `[your rollback command]` | [critical - document clearly] |
@@ -2266,7 +2346,7 @@ If deployment fails or post-deploy verification catches issues:
 **SDLC.md:**
 ```markdown
-<!-- SDLC Wizard Version: 1.18.0 -->
+<!-- SDLC Wizard Version: 1.21.0 -->
 <!-- Setup Date: [DATE] -->
 <!-- Completed Steps: step-0.1, step-0.2, step-0.4, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
 <!-- Git Workflow: [PRs or Solo] -->
@@ -2577,9 +2657,17 @@ Want me to file these? (yes/no/not now)
 ## Going Further
-### Create Feature Plan Docs
+### Feature Documentation
-For each major feature, create `FEATURE_NAME_PLAN.md`:
+Keep feature docs alongside code. Three patterns, use what fits:
+| Pattern | When to Use | Example |
+|---------|-------------|---------|
+| `*_PLAN.md` / `*_DOCS.md` | Per-feature living docs | `AUTH_DOCS.md`, `PAYMENTS_PLAN.md` |
+| `docs/decisions/NNN-title.md` (ADR) | Architecture decisions that need rationale | `docs/decisions/001-use-postgres.md` |
+| `docs/features/name.md` | Feature docs in a `docs/` directory | `docs/features/auth.md` |
+**Feature doc template:**
 ```markdown
 # Feature Name
@@ -2597,7 +2685,36 @@ Things that can trip you up.
 What's planned but not done.
 ```
-Claude will read these during planning and update them with discoveries.
+**ADR (Architecture Decision Record) template** — for decisions that need context:
+```markdown
+# ADR-NNN: Decision Title
+## Status
+Accepted | Superseded by ADR-NNN | Deprecated
+## Context
+What is the problem? What forces are at play?
+## Decision
+What did we decide and why?
+## Consequences
+What are the trade-offs? What becomes easier/harder?
+```
+Store ADRs in `docs/decisions/`. Number sequentially. Claude reads these during planning to understand why things are built the way they are.
+**Keeping docs in sync with code:**
+Docs drift when code changes but docs don't. The SDLC skill's planning phase detects this:
+- During planning, Claude reads feature docs for the area being changed
+- If the code change contradicts what the doc says, Claude updates the doc
+- The "After Session" step routes learnings to the right doc
+- Stale docs cause low confidence — if Claude struggles, the doc may need updating
+**CLAUDE.md health:** Run `/claude-md-improver` periodically (quarterly or after major changes). It audits CLAUDE.md specifically — structure, clarity, completeness (6 criteria, 100-point rubric). It does NOT cover feature docs, TESTING.md, or ADRs — the SDLC workflow handles those.
 ### Expand TESTING.md
@@ -2621,6 +2738,10 @@ Add project-specific guidance to skills:
 - Preferred patterns
 - Architecture decisions
+### Complementary Tools
+The wizard handles SDLC process enforcement. For stack-specific tooling, run `/claude-automation-recommender` — it suggests MCP servers, formatting hooks, type-checking hooks, subagent templates, and plugins based on your detected tech stack. See [Step 0.3](#step-03-additional-recommendations-optional) for the full comparison.
 ---
 ## Testing AI Apps: What's Different
@@ -2688,6 +2809,49 @@ _Sources: [Confident AI](https://www.confident-ai.com/blog/llm-testing-in-2024-t
 ---
+## Token Efficiency
+Practical techniques to reduce token consumption without sacrificing quality.
+### Monitor Costs
+| Tool | What It Shows | When to Use |
+|------|---------------|-------------|
+| `/cost` | Session total: USD, API time, code changes | After a session to review spend |
+| `/context` | What's consuming context window space | When hitting context limits |
+| Status line | Real-time `cost.total_cost_usd` + token counts | Continuous monitoring |
+### Reduce Consumption
+| Technique | Savings | How |
+|-----------|---------|-----|
+| `/compact` between phases | ~40-60% context | Plan → compact → implement (plan preserved) |
+| `/clear` between tasks | 100% context reset | No stale context from prior work |
+| Delegate verbose ops to subagents | Separate context | `Agent` tool returns summary, not full output |
+| Use skills for on-demand knowledge | Smaller base context | Skills load only when invoked |
+| Scope investigations narrowly | Fewer tokens read | "investigate auth module" > "investigate codebase" |
+| `--effort low` for simple tasks | ~50% thinking tokens | Simple renames, config changes |
+### CI Cost Control
+Add `--max-budget-usd` to CI workflows as a safety net:
+```yaml
+claude_args: "--max-budget-usd 5.00 --max-turns 30"
+```
+| Flag | Purpose |
+|------|---------|
+| `--max-budget-usd` | Hard dollar cap per CI invocation |
+| `--max-turns` | Limit agentic turns (prevents infinite loops) |
+| `--effort` | `low`/`medium`/`high` controls thinking depth |
+### Advanced: OpenTelemetry
+For organization-wide cost tracking, enable `CLAUDE_CODE_ENABLE_TELEMETRY=1`. This exports per-request `cost_usd`, `input_tokens`, `output_tokens` to any OTLP-compatible backend (Datadog, Honeycomb, Prometheus).
+---
 ## CI/CD Gotchas
 Common pitfalls when automating AI-assisted development workflows.
@@ -2749,85 +2913,6 @@ Claude: [fetches via gh api, discusses with you interactively]
 This is optional - skip if you prefer fresh reviews only.
-### CI Auto-Fix Loop (Optional)
-Automatically fix CI failures and PR review findings. Claude reads the error context, fixes the code, commits, and re-triggers CI. Loops until CI passes AND review has no findings at your chosen level, or max retries hit.
-**The Loop:**
-```
-Push to PR
-    |
-    v
-CI runs ──► FAIL ──► ci-autofix: Claude reads logs, fixes, commits [autofix 1/3] ──► re-trigger
-    |
-    └── PASS ──► PR Review ──► has findings at your level? ──► ci-autofix: fixes all ──► re-trigger
-                      |
-                      └── APPROVE, no findings ──► DONE
-```
-**Safety measures:**
-- Never runs on main branch
-- Max retries (default 3, configurable via `MAX_AUTOFIX_RETRIES`)
-- `AUTOFIX_LEVEL` controls what findings to act on (`ci-only`, `criticals`, `all-findings`)
-- Restricted Claude tools (no git, no npm)
-- Self-modification ban (can't edit its own workflow file)
-- `[autofix N/M]` commit tags for audit trail
-- Sticky PR comments show status
-**Setup:**
-1. Create `.github/workflows/ci-autofix.yml`:
-```yaml
-name: CI Auto-Fix
-on:
-  workflow_run:
-    workflows: ["CI", "PR Code Review"]
-    types: [completed]
-permissions:
-  contents: write
-  pull-requests: write
-env:
-  MAX_AUTOFIX_RETRIES: 3
-  AUTOFIX_LEVEL: criticals  # ci-only | criticals | all-findings
-jobs:
-  autofix:
-    runs-on: ubuntu-latest
-    if: |
-      github.event.workflow_run.head_branch != 'main' &&
-      github.event.workflow_run.event == 'pull_request' &&
-      (
-        (github.event.workflow_run.name == 'CI' && github.event.workflow_run.conclusion == 'failure') ||
-        (github.event.workflow_run.name == 'PR Code Review' && github.event.workflow_run.conclusion == 'success')
-      )
-    steps:
-      # Count previous [autofix] commits to enforce max retries
-      # Download CI failure logs or fetch review comment
-      # Check findings at your AUTOFIX_LEVEL (criticals + suggestions)
-      # Run Claude to fix ALL findings with restricted tools
-      # Commit [autofix N/M], push, re-trigger CI
-      # Post sticky PR comment with status
-```
-2. Add `workflow_dispatch:` trigger to your CI workflow (so autofix can re-trigger it)
-3. Optionally configure a GitHub App for token generation (avoids `workflow_run` default-branch constraint)
-**Token approaches:**
-| Approach | When | Pros |
-|----------|------|------|
-| GITHUB_TOKEN + `gh workflow run` | Default | No extra setup |
-| GitHub App token | `CI_AUTOFIX_APP_ID` secret exists | Push triggers `synchronize` naturally |
-**Note:** `workflow_run` only fires for workflows on the default branch. The ci-autofix workflow is dormant until first merged to main.
-> **Template vs. this repo:** The template above uses `ci-autofix.yml` with `criticals` as a safe default for new projects. The wizard's own repo has evolved this into `ci-self-heal.yml` with `all-findings` — a more aggressive configuration we dogfood internally. Both naming conventions work; the behavior is identical.
----
 ### Cross-Model Review Loop (Optional)
 Use an independent AI model from a different company as a code reviewer. The author can't grade their own homework — a model with different training data and different biases catches blind spots the authoring model misses.
@@ -2966,6 +3051,7 @@ Claude writes code → self-review passes → handoff.json (round 1)
 **When to use this:**
 - High-stakes changes (auth, payments, data handling)
+- **Releases and publishes** (version bumps, CHANGELOG, npm publish) — see Release Review Checklist below
 - Research-heavy work where accuracy matters more than speed
 - Complex refactors touching many files
 - Any time you want higher confidence before merging
@@ -2975,6 +3061,30 @@ Claude writes code → self-review passes → handoff.json (round 1)
 - Time-sensitive hotfixes
 - Changes where the review cost exceeds the risk
+#### Release Review Checklist
+Before any release or npm publish, add these focus areas to the cross-model `review_instructions`:
+**Why:** Self-review and automated tests regularly miss release-specific inconsistencies. Evidence: v1.20.0 cross-model review caught 2 real issues (CHANGELOG section lost during consolidation, stale hardcoded version examples) that passed all tests and self-review.
+| Check | What to Look For | Example Failure |
+|-------|-------------------|-----------------|
+| CHANGELOG consistency | All sections present, no lost entries during consolidation | v1.19.0 section dropped when merging into v1.20.0 |
+| Version parity | package.json, SDLC.md, CHANGELOG, wizard metadata all match | SDLC.md says 1.19.0 but package.json says 1.20.0 |
+| Stale examples | Hardcoded version strings in docs/wizard match current release | Wizard examples showing v1.15.0 when publishing v1.20.0 |
+| Docs accuracy | README, ARCHITECTURE.md reflect current feature set | "8 workflows" when there are actually 7 |
+| CLI-distributed file parity | Live skills, hooks, settings match CLI templates | SKILL.md edited but cli/templates/ not updated |
+**Example `review_instructions` for releases:**
+```
+Review for release consistency: CHANGELOG completeness (no lost sections),
+version parity across package.json/SDLC.md/CHANGELOG/wizard metadata,
+stale hardcoded versions in examples, docs accuracy vs actual features,
+CLI-distributed file parity (skills, hooks, settings).
+```
+**This complements automated tests, not replaces them.** Tests catch exact version mismatches (e.g., `test_package_version_matches_changelog`). Cross-model review catches semantic issues tests cannot — a section silently dropped, examples using outdated but syntactically valid versions, docs describing features that no longer exist.
 ---
 ## User Understanding and Periodic Feedback
@@ -3079,21 +3189,19 @@ Claude reads the CHANGELOG to show you what's new **before** applying anything.
 ```
 Claude: "Fetching CHANGELOG to check for updates..."
-Your version: 1.8.0
-Latest version: 1.13.0
+Your version: X.Y.0
+Latest version: X.Z.0
-What's new since 1.8.0:
-- v1.13.0: Self-update improvements, optional CI notification
-- v1.12.0: Full system audit, apply step fixes
-- v1.11.0: Stale output cleanup, error handling
-- v1.10.0: "Prove It's Better" CI automation
-- v1.9.0: Workflow consolidation (6 → 5 workflows)
+What's new since X.Y.0:
+- vX.Z.0: Latest features and improvements
+- vX.Y+1.0: Previous version changes
+  (... entries from CHANGELOG between your version and latest ...)
 Now checking your setup against latest wizard...
 ✓ Hooks - up to date
 ✓ Skills - content differs (update available)
-✗ step-update-notify - NOT DONE (new in v1.13.0, optional)
+✗ step-update-notify - NOT DONE (new in vX.Z.0, optional)
 Summary:
 - 1 file update available (SDLC skill)
@@ -3109,7 +3217,7 @@ Walk through updates? (y/n)
 Store wizard state in `SDLC.md` as metadata comments (invisible to readers, parseable by Claude):
 ```markdown
-<!-- SDLC Wizard Version: 1.18.0 -->
+<!-- SDLC Wizard Version: 1.21.0 -->
 <!-- Setup Date: 2026-01-24 -->
 <!-- Completed Steps: step-0.1, step-0.2, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
 <!-- Git Workflow: PRs -->

package/README.md CHANGED Viewed

@@ -83,7 +83,7 @@ Layer 1: PHILOSOPHY
 | **SDP normalization** | Separates "the model had a bad day" from "our SDLC broke" by cross-referencing external benchmarks |
 | **CUSUM drift detection** | Catches gradual quality decay over time — borrowed from manufacturing quality control |
 | **Pre-tool TDD hooks** | Before source edits, a hook reminds Claude to write tests first. CI scoring checks whether it actually followed TDD |
-| **Self-evolving loop** | Weekly/monthly external research + CI friction signals from self-heal — you approve, the system gets better |
+| **Self-evolving loop** | Weekly/monthly external research + local CI shepherd loop — you approve, the system gets better |
 ## How It Works
@@ -186,14 +186,14 @@ This isn't the only Claude Code SDLC tool. Here's an honest comparison:
 |--------|------------|----------------------|-------------|
 | **Focus** | SDLC enforcement + measurement | Agent performance optimization | Plugin marketplace |
 | **Hooks** | 3 (SDLC, TDD, instructions) | 12+ (dev blocker, prettier, etc.) | Webhook watcher |
-| **Skills** | 2 (/sdlc, /setup) | 80+ domain-specific | 13 slash commands |
+| **Skills** | 3 (/sdlc, /setup, /update) | 80+ domain-specific | 13 slash commands |
 | **Evaluation** | 95% CI, CUSUM, SDP, Tier 1/2 | Configuration testing | skilltest framework |
-| **Self-healing** | CI auto-fix + re-trigger | No | No |
+| **CI Shepherd** | Local CI fix loop | No | No |
 | **Auto-updates** | Weekly CC + community scan | No | No |
 | **Install** | `npx agentic-sdlc-wizard init` | npm install | npm install |
 | **Philosophy** | Lightweight, prove-it-or-delete | Scale and optimization | Documentation-first |
-**Our unique strengths:** Statistical rigor (CUSUM + 95% CI), SDP scoring (model quality vs SDLC compliance), self-healing CI, Prove-It A/B pipeline, comprehensive automated test suite, dogfooding enforcement.
+**Our unique strengths:** Statistical rigor (CUSUM + 95% CI), SDP scoring (model quality vs SDLC compliance), CI shepherd loop, Prove-It A/B pipeline, comprehensive automated test suite, dogfooding enforcement.
 **Where others are stronger:** everything-claude-code has broader language/framework coverage. claude-sdlc has webhook-driven automation. Both have npm distribution.
@@ -204,7 +204,7 @@ This isn't the only Claude Code SDLC tool. Here's an honest comparison:
 | Document | What It Covers |
 |----------|---------------|
 | [ARCHITECTURE.md](ARCHITECTURE.md) | System design, 5-layer diagram, data flows, file structure |
-| [CI_CD.md](CI_CD.md) | All 5 workflows, E2E scoring, tier system, SDP, integrity checks |
+| [CI_CD.md](CI_CD.md) | All 4 workflows, E2E scoring, tier system, SDP, integrity checks |
 | [SDLC.md](SDLC.md) | Version tracking, enforcement rules, SDLC configuration |
 | [TESTING.md](TESTING.md) | Testing philosophy, test diamond, TDD approach |
 | [CHANGELOG.md](CHANGELOG.md) | Version history, what changed and when |

package/cli/templates/skills/sdlc/SKILL.md CHANGED Viewed

@@ -19,6 +19,7 @@ TodoWrite([
   { content: "Find and read relevant documentation", status: "in_progress", activeForm: "Reading docs" },
   { content: "Assess doc health - flag issues (ask before cleaning)", status: "pending", activeForm: "Checking doc health" },
   { content: "DRY scan: What patterns exist to reuse? New pattern = get approval", status: "pending", activeForm: "Scanning for reusable patterns" },
+  { content: "Prove It Gate: adding new component? Research alternatives, prove quality with tests", status: "pending", activeForm: "Checking prove-it gate" },
   { content: "Blast radius: What depends on code I'm changing?", status: "pending", activeForm: "Checking dependencies" },
   { content: "Design system check (if UI change)", status: "pending", activeForm: "Checking design system" },
   { content: "Restate task in own words - verify understanding", status: "pending", activeForm: "Verifying understanding" },
@@ -26,7 +27,7 @@ TodoWrite([
   { content: "Present approach + STATE CONFIDENCE LEVEL", status: "pending", activeForm: "Presenting approach" },
   { content: "Signal ready - user exits plan mode", status: "pending", activeForm: "Awaiting plan approval" },
   // TRANSITION PHASE (After plan mode)
-  { content: "Update feature docs with discovered gotchas", status: "pending", activeForm: "Updating feature docs" },
+  { content: "Doc sync: update feature docs if code change contradicts or extends documented behavior", status: "pending", activeForm: "Syncing feature docs" },
   // IMPLEMENTATION PHASE
   { content: "TDD RED: Write failing test FIRST", status: "pending", activeForm: "Writing failing test" },
   { content: "TDD GREEN: Implement, verify test passes", status: "pending", activeForm: "Implementing feature" },
@@ -84,6 +85,22 @@ Critical miss on `tdd_red` or `self_review` = process failure regardless of tota
 - Does test approach follow TESTING.md philosophies?
 - If introducing new test patterns, same scrutiny as code patterns
+## Prove It Gate (REQUIRED for New Additions)
+**Adding a new skill, hook, workflow, or component? PROVE IT FIRST:**
+1. **Research:** Does something equivalent already exist (native CC, third-party plugin, existing skill)?
+2. **If YES:** Why is yours better? Show evidence (A/B test, quality comparison, gap analysis)
+3. **If NO:** What gap does this fill? Is the gap real or theoretical?
+4. **Quality tests:** New additions MUST have tests that prove OUTPUT QUALITY, not just existence
+5. **Less is more:** Every addition is maintenance burden. Default answer is NO unless proven YES
+**Existence tests are NOT quality tests:**
+- BAD: "ci-analyzer skill file exists" — proves nothing about quality
+- GOOD: "ci-analyzer recommends lint-first when test-before-lint detected" — proves behavior
+**If you can't write a quality test for it, you can't prove it works, so don't add it.**
 ## Plan Mode Integration
 **Use plan mode for:** Multi-file changes, new features, LOW confidence, bugs needing investigation.
@@ -131,7 +148,7 @@ PLANNING -> DOCS -> TDD RED -> TDD GREEN -> Tests Pass -> Self-Review
 ## Cross-Model Review (If Configured)
-**When to run:** High-stakes changes (auth, payments, data handling), complex refactors, research-heavy work.
+**When to run:** High-stakes changes (auth, payments, data handling), releases/publishes (version bumps, CHANGELOG, npm publish), complex refactors, research-heavy work.
 **When to skip:** Trivial changes (typo fixes, config tweaks), time-sensitive hotfixes, risk < review cost.
 **Prerequisites:** Codex CLI installed (`npm i -g @openai/codex`), OpenAI API key set.
@@ -236,6 +253,17 @@ Self-review passes → handoff.json (round 1, PENDING_REVIEW)
 **Full protocol:** See the wizard's "Cross-Model Review Loop (Optional)" section for key flags and reasoning effort guidance.
+### Release Review Focus
+Before any release/publish, add these to `review_instructions`:
+- **CHANGELOG consistency** — all sections present, no lost entries during consolidation
+- **Version parity** — package.json, SDLC.md, CHANGELOG, wizard metadata all match
+- **Stale examples** — hardcoded version strings in docs match current release
+- **Docs accuracy** — README, ARCHITECTURE.md reflect current feature set
+- **CLI-distributed file parity** — live skills, hooks, settings match CLI templates
+Evidence: v1.20.0 cross-model review caught CHANGELOG section loss and stale wizard version examples that passed all tests and self-review. Tests catch version mismatches; cross-model review catches semantic issues tests cannot.
 ## Test Review (Harder Than Implementation)
 During self-review, critique tests HARDER than app code:
@@ -280,6 +308,8 @@ Everything else needs integration tests.
 ## Flaky Test Recovery
+**Flaky tests are bugs. Period.** See: [How do you Address and Prevent Flaky Tests?](https://softwareautomation.notion.site/How-do-you-Address-and-Prevent-Flaky-Tests-23c539e19b3c46eeb655642b95237dc0)
 When a test fails intermittently:
 1. **Don't dismiss it** — "flaky" means "bug we haven't found yet"
 2. **Identify the layer** — test code? app code? environment?
@@ -333,7 +363,9 @@ If tests fail:
 Debug it. Find root cause. Fix it properly. Tests ARE code.
-## CI Feedback Loop (After Commit)
+## CI Feedback Loop — Local Shepherd (After Commit)
+**This is the "local shepherd" — the CI fix mechanism.** It runs in your active session with full context.
 **The SDLC doesn't end at local tests.** CI must pass too.
@@ -379,7 +411,7 @@ Local tests pass -> Commit -> Push -> Watch CI
 - Flaky? Investigate - flakiness is a bug
 - Stuck? ASK USER
-## CI Review Feedback Loop (After CI Passes)
+## CI Review Feedback Loop — Local Shepherd (After CI Passes)
 **CI passing isn't the end.** If CI includes a code reviewer, read and address its suggestions.
@@ -411,6 +443,14 @@ CI passes -> Read review suggestions
 - **Ask first**: Present suggestions to user, let them decide which to implement
 - **Skip review feedback**: Ignore CI review suggestions, only fix CI failures
+## Context Management
+- `/compact` between planning and implementation (plan preserved in summary)
+- `/clear` between unrelated tasks (stale context wastes tokens and misleads)
+- `/clear` after 2+ failed corrections (context polluted — start fresh with better prompt)
+- Auto-compact fires at ~95% capacity — no manual management needed
+- After committing a PR, `/clear` before starting the next feature
 ## DRY Principle
 **Before coding:** "What patterns exist I can reuse?"
@@ -480,11 +520,25 @@ CI passes -> Read review suggestions
 **THE RULE:** Delete old code first. If it breaks, fix it properly.
+## Documentation Sync (During Planning)
+When a code change affects a documented feature, update the doc in the same PR:
+1. **During planning**, read feature docs for the area being changed (`*_PLAN.md`, `*_DOCS.md`, `docs/features/`, `docs/decisions/`)
+2. If your code change contradicts what the doc says → update the doc
+3. If your code change extends behavior the doc describes → add to the doc
+4. If no feature doc exists and the change is substantial → note it in the summary (don't create one unprompted)
+**Doc staleness signals:** Low confidence in an area often means the docs are stale, missing, or misleading. If you struggle during planning, check whether the docs match the actual code.
+**CLAUDE.md health:** `/claude-md-improver` audits CLAUDE.md structure and completeness. Run it periodically. It does NOT cover feature docs — the SDLC workflow handles those.
 ## After Session (Capture Learnings)
 If this session revealed insights, update the right place:
 - **Testing patterns, gotchas** → `TESTING.md`
-- **Feature-specific quirks** → Feature docs (`*_PLAN.md`)
+- **Feature-specific quirks** → Feature docs (`*_PLAN.md`, `*_DOCS.md`)
+- **Architecture decisions** → `docs/decisions/` (ADR format) or `ARCHITECTURE.md`
 - **General project context** → `CLAUDE.md` (or `/revise-claude-md`)
 ---

package/cli/templates/skills/setup/SKILL.md CHANGED Viewed

@@ -1,17 +1,19 @@
 ---
 name: setup-wizard
-description: Interactive project setup wizard that scans the codebase, asks all 16 configuration questions, generates SDLC files (CLAUDE.md, SDLC.md, TESTING.md, ARCHITECTURE.md), and verifies the installation. Use this skill when setting up the SDLC wizard for the first time or re-running setup.
+description: Setup wizard — scans codebase, builds confidence per data point, only asks what it can't figure out, generates SDLC files. Use for first-time setup or re-running setup.
 argument-hint: [optional: regenerate | verify-only]
 effort: high
 ---
-# Setup Wizard - Interactive Project Configuration
+# Setup Wizard - Confidence-Driven Project Configuration
 ## Task
 $ARGUMENTS
 ## Purpose
-You are an interactive setup wizard. Your job is to scan the project, ask the user ALL configuration questions, and generate the SDLC files. DO NOT skip questions. DO NOT make assumptions. The user's answers drive the output.
+You are a confidence-driven setup wizard. Your job is to scan the project, infer as much as possible, and only ask the user about what you can't figure out. The number of questions is DYNAMIC — it depends on how much you can detect. Stop asking when all configuration data points are resolved (detected, confirmed, or answered).
+**DO NOT ask a fixed list of questions. DO NOT ask what you already know.**
 ## MANDATORY FIRST ACTION: Read the Wizard Doc
@@ -36,56 +38,70 @@ Scan the project root for:
 - Deployment: Dockerfile, vercel.json, fly.toml, netlify.toml, Procfile, k8s/
 - Design system: tailwind.config.*, .storybook/, theme files, CSS custom properties
 - Existing docs: README.md, CLAUDE.md, ARCHITECTURE.md
+- Scripts in package.json (lint, test, build, typecheck, etc.)
+- Database config files (prisma/, drizzle.config.*, knexfile.*, .env with DB_*)
+- Cache config (redis.conf, .env with REDIS_*)
+### Step 2: Build Confidence Map
+For each configuration data point, assign a confidence level based on scan results:
+**Configuration Data Points:**
-Present findings to the user in a clear summary with detected values.
+| Category | Data Point | How to Detect |
+|----------|-----------|---------------|
+| Structure | Source directory | Look for src/, app/, lib/, etc. |
+| Structure | Test directory | Look for tests/, __tests__/, spec/ |
+| Structure | Test framework | Config files (jest.config, vitest.config, pytest.ini) |
+| Commands | Lint command | package.json scripts, Makefile, config files |
+| Commands | Type-check command | tsconfig.json → tsc, mypy.ini → mypy |
+| Commands | Run all tests | package.json "test" script, Makefile |
+| Commands | Run single test file | Infer from framework (jest → jest path, pytest → pytest path) |
+| Commands | Production build | package.json "build" script, Makefile |
+| Commands | Deployment setup | Dockerfile, vercel.json, fly.toml, deploy scripts |
+| Infra | Database(s) | prisma/, .env DB vars, docker-compose services |
+| Infra | Caching layer | .env REDIS vars, docker-compose redis service |
+| Infra | Test duration | Count test files, check CI run times if available |
+| Preferences | Response detail level | Cannot detect — ALWAYS ASK |
+| Preferences | Testing approach | Cannot detect intent from existing code — ALWAYS ASK |
+| Preferences | Mocking philosophy | Cannot detect intent from existing code — ALWAYS ASK |
+| Testing | Test types | What test files exist (*.test.*, *.spec.*, e2e/, integration/) |
+| Coverage | Coverage config | nyc, c8, coverage.py config, CI coverage steps |
+| CI | CI shepherd opt-in | Only if CI detected — ALWAYS ASK |
-### Step 2: Ask ALL 17 Questions
+**Each data point has one of three states:**
+- **RESOLVED (detected):** Found concrete evidence — config file, script, directory exists. No question needed, just confirm.
+- **RESOLVED (inferred):** Found indirect evidence — naming patterns, related config. Present inference, let user confirm or correct.
+- **UNRESOLVED:** No evidence found — must ask user directly.
-Ask every question. Pre-fill detected values but let the user confirm or override.
+**Preference data points** (response detail, testing approach, mocking philosophy, CI shepherd) are ALWAYS UNRESOLVED regardless of what code patterns exist. Current code patterns show what IS, not what the user WANTS going forward.
-**Project Structure:**
-1. Source directory (detected or ask)
-2. Test directory (detected or ask)
-3. Test framework (detected or ask)
+### Step 3: Present Findings and Fill Gaps
-**Commands:**
-4. Lint command
-5. Type-check command
-6. Run all tests command
-7. Run single test file command
-8. Production build command
-9. Deployment setup (detected environments, confirm or customize)
+Present ALL detected values organized by state to the user.
-**Infrastructure:**
-10. Database(s) used
-11. Caching layer (Redis, etc.)
-12. Test duration (<1 min, 1-5 min, 5+ min)
+**For RESOLVED (detected) items:** Show what was found, let user bulk-confirm with a single "Looks good" or override specific items.
-**Output Preferences:**
-13. Response detail level (small/medium/large)
+**For RESOLVED (inferred) items:** Show what was inferred with reasoning, ask user to confirm or correct.
-**Testing Philosophy:**
-14. Testing approach (strict TDD, test-after, mixed, minimal, none yet)
-15. Test types wanted (unit, integration, E2E, API)
-16. Mocking philosophy (minimal, heavy, no mocking)
+**For UNRESOLVED items:** Ask the user directly — these are your questions.
-**Coverage:**
-17. Code coverage preferences (enforce threshold, report only, AI suggestions, skip)
+**The ready rule:** You are ready to generate files when ALL data points are resolved (detected, inferred+confirmed, or answered by user). The number of questions you ask depends entirely on how many data points remain unresolved after scanning. A well-configured project might need 3-4 questions (just preferences). A bare repo might need 10+. There is no fixed count.
-DO NOT proceed to file generation until ALL 17 questions have answers.
+DO NOT proceed to file generation until all data points are resolved.
-### Step 3: Generate CLAUDE.md
+### Step 4: Generate CLAUDE.md
-Using the user's answers, generate `CLAUDE.md` with:
+Using detected + confirmed values, generate `CLAUDE.md` with:
 - Project overview (from scan results)
-- Commands table (Q4-Q8 answers)
+- Commands table (detected/confirmed commands)
 - Code style section (from detected linters/formatters)
 - Architecture summary (from scan)
-- Special notes (from Q9-Q11)
+- Special notes (infra, deployment)
 Reference: See "Step 3" in `CLAUDE_CODE_SDLC_WIZARD.md` for the full template.
-### Step 4: Generate SDLC.md
+### Step 5: Generate SDLC.md
 Generate `SDLC.md` with the full SDLC checklist customized to the project:
 - Plan mode guidance
@@ -98,35 +114,35 @@ Include metadata comments:
 ```
 <!-- SDLC Wizard Version: [version from CLAUDE_CODE_SDLC_WIZARD.md] -->
 <!-- Setup Date: [today's date] -->
-<!-- Completed Steps: 0.4, 1-10 -->
+<!-- Completed Steps: step-0.1, step-0.2, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
 ```
 Reference: See "Step 4" in `CLAUDE_CODE_SDLC_WIZARD.md` for the full template.
-### Step 5: Generate TESTING.md
+### Step 6: Generate TESTING.md
-Generate `TESTING.md` based on Q13-Q16 answers:
+Generate `TESTING.md` based on detected/confirmed testing data:
 - Testing Diamond visualization
 - Test types and their purposes
-- Mocking rules (from Q15)
-- Test file organization (from Q2, Q3)
-- Coverage config (from Q16)
+- Mocking rules (from detected patterns or user input)
+- Test file organization (from detected structure)
+- Coverage config (from detected config or user input)
 - Framework-specific patterns
 Reference: See "Step 5" in `CLAUDE_CODE_SDLC_WIZARD.md` for the full template.
-### Step 6: Generate ARCHITECTURE.md
+### Step 7: Generate ARCHITECTURE.md
 Generate `ARCHITECTURE.md` with:
 - System overview diagram (from scan)
 - Component descriptions
-- Environments table (from Q8.5)
+- Environments table (from detected deployment config)
 - Deployment checklist
 - Key technical decisions
 Reference: See "Step 6" in `CLAUDE_CODE_SDLC_WIZARD.md` for the full template.
-### Step 7: Generate DESIGN_SYSTEM.md (If UI Detected)
+### Step 8: Generate DESIGN_SYSTEM.md (If UI Detected)
 Only if design system artifacts were found in Step 1:
 - Extract colors, fonts, spacing from config
@@ -135,7 +151,7 @@ Only if design system artifacts were found in Step 1:
 Skip this step if no UI/design system detected.
-### Step 8: Configure Tool Permissions
+### Step 9: Configure Tool Permissions
 Based on detected stack, suggest `allowedTools` entries for `.claude/settings.json`:
 - Package manager commands (npm, pnpm, yarn, cargo, go, pip, etc.)
@@ -144,11 +160,11 @@ Based on detected stack, suggest `allowedTools` entries for `.claude/settings.js
 Present suggestions and let the user confirm.
-### Step 9: Customize Hooks
+### Step 10: Customize Hooks
-Update `tdd-pretool-check.sh` with the actual source directory from Q1 (replace generic `/src/` pattern).
+Update `tdd-pretool-check.sh` with the actual source directory (replace generic `/src/` pattern).
-### Step 10: Verify Setup
+### Step 11: Verify Setup
 Run verification checks:
 1. All generated files exist and are non-empty
@@ -159,18 +175,24 @@ Run verification checks:
 Report any issues found.
-### Step 11: Instruct Restart
+### Step 12: Instruct Restart and Next Steps
 Tell the user:
 > Setup complete. Hooks and settings load at session start.
 > **Exit Claude Code and restart it** for the new configuration to take effect.
 > On restart, the SDLC hook will fire and you'll see the checklist in every response.
+>
+> **Optional next step:**
+> - Run `/claude-automation-recommender` for stack-specific tooling suggestions (MCP servers, formatting hooks, type-checking hooks, plugins)
+>
+> The recommender is complementary to the SDLC wizard — it adds tooling recommendations, not process enforcement.
 ## Rules
-- NEVER skip a question. If the user says "I don't know", record that and move on.
-- NEVER assume answers. If auto-scan can't detect something, ASK.
-- ALWAYS show detected values and let the user confirm or override.
+- NEVER ask what you already know from scanning. If you found it, confirm it — don't ask it.
+- NEVER use a fixed question count. The number of questions is dynamic based on scan results.
+- ALWAYS show detected values organized by resolution state and let the user confirm or override.
 - ALWAYS generate metadata comments in SDLC.md (version, date, steps).
+- If most data points are resolved after scanning, present findings for bulk confirmation — don't force individual questions.
 - If the user passes `regenerate` as an argument, skip Q&A and regenerate files from existing SDLC.md metadata.
-- If the user passes `verify-only` as an argument, skip to Step 10 (verify) only.
+- If the user passes `verify-only` as an argument, skip to Step 11 (verify) only.

package/cli/templates/skills/update/SKILL.md CHANGED Viewed

@@ -45,13 +45,13 @@ Extract the latest version from the first `## [X.X.X]` line.
 Parse all CHANGELOG entries between the user's installed version and the latest. Present a clear summary:
 ```
-Installed: 1.15.0
-Latest:    1.18.0
+Installed: 1.19.0
+Latest:    1.21.0
 What changed:
-- [1.18.0] Added /update-wizard skill, ...
-- [1.17.0] Consolidated /testing into /sdlc, ...
-- [1.16.0] Cross-model review protocol, ...
+- [1.21.0] Confidence-driven setup, prove-it gate, cross-model release review, ...
+- [1.20.0] Version-pinned CC update gate, Tier 1 flakiness fix, flaky test guidance, ...
+- [1.19.0] CI shepherd model, token efficiency, feature doc enforcement, ...
 ```
 **If versions match:** Say "You're up to date! (version X.X.X)" and stop.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "agentic-sdlc-wizard",
-  "version": "1.18.0",
+  "version": "1.21.0",
   "description": "SDLC enforcement for Claude Code — hooks, skills, and wizard setup in one command",
   "bin": {
     "sdlc-wizard": "./cli/bin/sdlc-wizard.js"