npm - agentic-sdlc-wizard - Versions diffs - 1.36.1 → 1.37.0 - Mend

agentic-sdlc-wizard 1.36.1 → 1.37.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/.claude-plugin/marketplace.json +1 -1
package/.claude-plugin/plugin.json +1 -1
package/CHANGELOG.md +23 -2
package/CLAUDE_CODE_SDLC_WIZARD.md +9 -6
package/README.md +22 -0
package/hooks/instructions-loaded-check.sh +4 -17
package/hooks/model-effort-check.sh +35 -16
package/hooks/precompact-seam-check.sh +47 -6
package/package.json +1 -1
package/skills/sdlc/SKILL.md +7 -0
package/skills/update/SKILL.md +2 -1

package/.claude-plugin/marketplace.json CHANGED Viewed

@@ -13,7 +13,7 @@
       "name": "sdlc-wizard",
       "source": ".",
       "description": "SDLC enforcement for AI agents — TDD, planning, self-review, CI shepherd",
-      "version": "1.36.1",
+      "version": "1.37.0",
       "author": {
         "name": "Stefan Ayala"
       },

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "sdlc-wizard",
-  "version": "1.36.1",
+  "version": "1.37.0",
   "description": "SDLC enforcement for AI agents — TDD, planning, self-review, CI shepherd",
   "author": {
     "name": "Stefan Ayala",

package/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,27 @@ All notable changes to the SDLC Wizard.
 > **Note:** This changelog is for humans to read. Don't manually apply these changes - just run the wizard ("Check for SDLC wizard updates") and it handles everything automatically.
+## [1.37.0] - 2026-04-24
+### Changed
+- **`monthly-research.yml` workflow deleted** (ROADMAP #231 Phase 1, PR #235). 519 lines + 4 claude-code-action steps removed. Zero merged artifacts in 30d while burning $11-23/month in Anthropic API. Research now happens inline in a Claude Code session, not on a scheduled cron. All 17 `test_monthly_*` assertions in `tests/test-workflow-triggers.sh` stubbed with `n/a per #231 Phase 1` pattern (165/165 tests still green). Live docs (CI_CD.md, ARCHITECTURE.md, plans/AUTO_SELF_UPDATE.md) mark monthly-research REMOVED; historical audit tables intentionally preserved. Codex cross-model review: 3-round, 9/10 CERTIFIED.
+- **`model-effort-check.sh` loud WARNING below xhigh** (ROADMAP #217, PR #236). Closed the coherence gap between the docs (`max` preferred, `xhigh` floor) and the hook behavior. Previously the hook treated any effort ≠ xhigh as "upgrade available" — including `max` (the preferred default), which was backwards. New behavior:
+  - `effort=max` or `xhigh` → silent (at or above floor)
+  - `effort=high/medium/low` or unset → LOUD WARNING block: `WARNING` marker, SDLC compliance mention, `/effort max` primary recommendation, `/effort xhigh` floor alternative, `opus[1m]` model reminder
+  - Removed duplicate effort/model check from `instructions-loaded-check.sh` — single source of truth is now `model-effort-check.sh`. Regression test asserts the dupe doesn't come back.
+  - 2 new TDD tests + 1 regression test. Updated `test_hooks_recommend_opus_1m_alias` for single-source-of-truth. 119/119 hook tests pass.
+  - Codex cross-model review: 3-round, 10/10 CERTIFIED.
+### Roadmap
+- **#232 added**: `/update-wizard` should mimic `claude update` UX — detect stale npm CLI and offer to refresh before applying in-session file updates. User call-out 2026-04-24.
+### Removed
+- `.github/workflows/monthly-research.yml` (519 lines, 4 claude-code-action steps, 0 merged artifacts in 30d).
 ## [1.36.1] - 2026-04-23
 ### Changed
@@ -117,7 +138,7 @@ All notable changes to the SDLC Wizard.
   - `run_init_split` test helper captures stdout/stderr separately with explicit exit code
   - 9 new CLI tests, 5 new hook tests; Codex xhigh 4-round review: 9/10 CERTIFIED
 - Model/effort upgrade detection at session start (#179, #180)
-  - SessionStart hook nudges when configured `effortLevel` is below `xhigh` recommendation
+  - SessionStart hook nudges when configured `effortLevel` is below `xhigh` (wording superseded by #217 on 2026-04-24: `max` preferred, `xhigh` floor)
   - Reads `.claude/settings.local.json` → `.claude/settings.json` → `$HOME/.claude/settings.json` precedence
   - Non-blocking (`exit 0`); asks Claude to compare recommended model against its own system prompt
   - `claude-opus-4-6` defaults bumped to `claude-opus-4-7` in `pr-review.yml`, `evaluate.sh`, `sdp-score.sh`, `pairwise-compare.sh`
@@ -137,7 +158,7 @@ All notable changes to the SDLC Wizard.
   - Default: opus-4-7 + xhigh (matches CC's new default)
   - 3 new tests (39 total model-comparison tests)
 - `xhigh` effort level documented in wizard (#178)
-  - New effort table: high → xhigh (recommended for coding) → max
+  - New effort table: high → xhigh → max (xhigh was called "recommended for coding" here; superseded by #217 on 2026-04-24: `max` is preferred, `xhigh` is the floor)
   - Opus 4.7 changes: stricter effort adherence, budget_tokens deprecated, 64k+ max_tokens guidance
 - Benchmark ceiling effect audit documented in wizard
   - Cross-model audit (Codex GPT-5.4, xhigh) rated benchmark 2/10 NOT CERTIFIED

package/CLAUDE_CODE_SDLC_WIZARD.md CHANGED Viewed

@@ -1107,16 +1107,18 @@ For Claude to be effective at SDLC enforcement, your project should have these d
 1. Go to: `Settings > Branches > Add rule`
 2. Branch name pattern: `main` (or `master`)
 3. Enable the settings above (solo or team, as appropriate)
-4. Add required status checks: `validate`, `e2e-quick-check`
+4. Add required status checks: `validate` (E2E is advisory — see note below)
 5. Save changes
+> **Note (ROADMAP #212 Option 1, April 2026):** We no longer require `e2e-quick-check` as a blocking check. It burned Anthropic API credits on every PR, and branch protection pinned to GitHub Actions made local-maintainer check-run satisfaction impossible. E2E now runs advisory-only via `tests/e2e/local-shepherd.sh` on the maintainer's Max subscription. See `ROADMAP.md` #212 for the full rationale.
 **How to enable (CLI — solo dev):**
 ```bash
 gh api repos/OWNER/REPO/branches/main/protection --method PUT --input - << 'EOF'
 {
   "required_status_checks": {
     "strict": true,
-    "contexts": ["validate", "e2e-quick-check"]
+    "contexts": ["validate"]
   },
   "enforce_admins": false,
   "required_pull_request_reviews": null,
@@ -1131,7 +1133,7 @@ gh api repos/OWNER/REPO/branches/main/protection --method PUT --input - << 'EOF'
 {
   "required_status_checks": {
     "strict": true,
-    "contexts": ["validate", "e2e-quick-check"]
+    "contexts": ["validate"]
   },
   "enforce_admins": true,
   "required_pull_request_reviews": {
@@ -2713,7 +2715,7 @@ If deployment fails or post-deploy verification catches issues:
 **SDLC.md:**
 ```markdown
-<!-- SDLC Wizard Version: 1.36.1 -->
+<!-- SDLC Wizard Version: 1.37.0 -->
 <!-- Setup Date: [DATE] -->
 <!-- Completed Steps: step-0.1, step-0.2, step-0.4, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
 <!-- Git Workflow: [PRs or Solo] -->
@@ -3465,11 +3467,12 @@ Use an independent AI model from a different company as a code reviewer. The aut
 **Why this works:** Two AI systems from different companies (e.g., Claude writes, GPT reviews) provide adversarial diversity. They have fundamentally different training, different failure modes, and different strengths. What one misses, the other catches.
-**Use the best model at the deepest reasoning.** This is your quality gate — don't economize on it. Always use the latest, most capable model available (currently GPT-5.4) at maximum reasoning effort (`xhigh`). Cheaper/faster models miss things. The whole point is catching what the authoring model couldn't.
+**Use the best model at the deepest reasoning.** This is your quality gate — don't economize on it. Always use the latest, most capable model available (**GPT-5.5 if you have access**, otherwise GPT-5.4) at maximum reasoning effort (`xhigh` — this is non-negotiable, lower settings miss subtle errors). Cheaper/faster models miss things. The whole point is catching what the authoring model couldn't.
 **Prerequisites:**
 - Codex CLI installed: `npm i -g @openai/codex`
 - OpenAI API key configured: `export OPENAI_API_KEY=...`
+- Codex CLI picks up your OpenAI account's best available model automatically. If you have GPT-5.5 access, `codex exec` uses it; otherwise it falls back to GPT-5.4. No config change needed on your side.
 - This is a local workflow tool — not required for CI/CD
 **The Protocol:**
@@ -3774,7 +3777,7 @@ Walk through updates? (y/n)
 Store wizard state in `SDLC.md` as metadata comments (invisible to readers, parseable by Claude):
 ```markdown
-<!-- SDLC Wizard Version: 1.36.1 -->
+<!-- SDLC Wizard Version: 1.37.0 -->
 <!-- Setup Date: 2026-01-24 -->
 <!-- Completed Steps: step-0.1, step-0.2, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
 <!-- Git Workflow: PRs -->

package/README.md CHANGED Viewed

@@ -110,6 +110,28 @@ Layer 1: PHILOSOPHY
 | **Pre-tool TDD hooks** | Before source edits, a hook reminds Claude to write tests first. CI scoring checks whether it actually followed TDD |
 | **Self-evolving loop** | Weekly/monthly external research + local CI shepherd loop — you approve, the system gets better |
+## Optional: Cross-Model Review (Codex)
+Claude can't grade its own homework. Have a **different AI from a different company** review Claude's work — different training, different blind spots, different biases. We use OpenAI's Codex CLI, and it's **three commands to set up**:
+```bash
+npm i -g @openai/codex
+export OPENAI_API_KEY=sk-...
+codex --version   # confirm ready
+```
+That's it. Codex picks up your OpenAI account's best available model automatically — **if you have GPT-5.5, it uses GPT-5.5; otherwise GPT-5.4**. No model config needed.
+**How to use it:** after Claude's self-review passes, write a one-file mission brief and run:
+```bash
+codex exec -c 'model_reasoning_effort="xhigh"' -s danger-full-access \
+  -o .reviews/latest-review.md \
+  "Read .reviews/handoff.json and review per the checklist. Output findings + CERTIFIED or NOT CERTIFIED."
+```
+`xhigh` reasoning is **non-negotiable** — lower settings miss subtle bugs. See [CLAUDE_CODE_SDLC_WIZARD.md](CLAUDE_CODE_SDLC_WIZARD.md#cross-model-review-loop-optional) for the full protocol (handoff format, round-2 dialogue loop, preflight docs). Real-world: this catches P0/P1 issues in 2-3 out of 10 reviews that Claude's self-review rated as clean.
 ## How It Works
 **Think Iron Man:** Jarvis is nothing without Tony Stark. Tony Stark is still Tony Stark. But together? They make Iron Man. This SDLC is your suit - you build it over time, improve it for your needs, and it makes you both better.

package/hooks/instructions-loaded-check.sh CHANGED Viewed

@@ -123,23 +123,10 @@ if command -v codex > /dev/null 2>&1 && [ -d "$PROJECT_DIR/.reviews" ]; then
     fi
 fi
-# Model/effort upgrade check (non-blocking, best-effort)
-RECOMMENDED_MODEL="opus[1m]"
-RECOMMENDED_EFFORT="xhigh"
-if command -v jq > /dev/null 2>&1; then
-    EFFORT=""
-    PROJ="${CLAUDE_PROJECT_DIR:-$PROJECT_DIR}"
-    for f in "$PROJ/.claude/settings.local.json" "$PROJ/.claude/settings.json" "$HOME/.claude/settings.json"; do
-        if [ -f "$f" ]; then
-            val=$(jq -r '.effortLevel // empty' "$f" 2>/dev/null)
-            if [ -n "$val" ]; then EFFORT="$val"; break; fi
-        fi
-    done
-    if [ -n "$EFFORT" ] && [ "$EFFORT" != "$RECOMMENDED_EFFORT" ]; then
-        echo "Upgrade available: effort $EFFORT → $RECOMMENDED_EFFORT (run: /effort $RECOMMENDED_EFFORT)"
-        echo "Recommended model: $RECOMMENDED_MODEL (run: /model $RECOMMENDED_MODEL)"
-    fi
-fi
+# Model/effort upgrade check is delegated to hooks/model-effort-check.sh
+# (single source of truth per ROADMAP #217). Don't duplicate the logic here —
+# this hook and model-effort-check.sh both fire on SessionStart, so two checks
+# would double-print the nudge and risk drifting out of sync.
 # Dual-channel install check (#181) — nudge when CLI skills + Claude plugin both present
 if [ -d "$PROJECT_DIR/.claude/skills/update" ]; then

package/hooks/model-effort-check.sh CHANGED Viewed

@@ -1,12 +1,19 @@
 #!/bin/bash
-# SessionStart hook — nudges user when effort level is below recommended
-# and tells Claude the recommended model so it can compare against its own
-# CC does NOT expose the model to hooks, so model nudge relies on Claude
-# seeing this output and comparing against its system prompt
-# Non-blocking: always exits 0
+# SessionStart hook — effort/model nudge.
+#
+# Behavior (per ROADMAP #217):
+#   effort=max    -> silent (preferred default, above floor)
+#   effort=xhigh  -> silent (minimum floor on Opus 4.7)
+#   effort=high|medium|low (or unset) -> LOUD WARNING:
+#     Opus 4.7 needs xhigh floor for SDLC compliance (TDD, self-review, deep reasoning).
+#     Recommends `/effort max`. Also reminds about recommended model `opus[1m]`.
+#
+# CC does not expose the current model to hooks, so the model nudge is emitted as
+# guidance for Claude to compare against its own system prompt.
+#
+# Non-blocking: always exits 0.
 RECOMMENDED_MODEL="opus[1m]"
-RECOMMENDED_EFFORT="xhigh"
 # Drain stdin (SessionStart sends JSON but model field isn't in it)
 cat > /dev/null
@@ -27,17 +34,29 @@ for f in "$project_dir/.claude/settings.local.json" "$project_dir/.claude/settin
     fi
 done
-nudge=""
-if [ -n "$effort" ] && [ "$effort" != "$RECOMMENDED_EFFORT" ]; then
-    nudge="effort: $effort → $RECOMMENDED_EFFORT (run: /effort $RECOMMENDED_EFFORT)"
+# At or above floor — silent.
+case "$effort" in
+    max|xhigh)
+        exit 0
+        ;;
+esac
+# Below floor OR unset — LOUD warning.
+# Note: test_model_effort_size_cap asserts output < 500 chars. Keep copy terse.
+if [ -z "$effort" ]; then
+    effort_display="unset"
+else
+    effort_display="$effort"
 fi
-if [ -n "$nudge" ]; then
-    echo "Upgrade available:"
-    echo "  $nudge"
-    echo "  recommended model: $RECOMMENDED_MODEL (run: /model $RECOMMENDED_MODEL)"
-    echo "  (Claude: compare recommended model against your current model — nudge user if different)"
-fi
+echo "=============================================================================="
+echo " WARNING: effort '$effort_display' breaks SDLC compliance on Opus 4.7."
+echo " Below xhigh = shallow reasoning, skipped TDD, dropped self-review."
+echo ""
+echo " Run: /effort max    (preferred, full SDLC compliance)"
+echo " Or:  /effort xhigh  (minimum floor)"
+echo ""
+echo " recommended model: $RECOMMENDED_MODEL (run: /model $RECOMMENDED_MODEL)"
+echo "=============================================================================="
 exit 0

package/hooks/precompact-seam-check.sh CHANGED Viewed

@@ -22,19 +22,57 @@ ROOT="${CLAUDE_PROJECT_DIR:-$PWD}"
 HOLD_REASONS=""
 # Check 1: Codex review mid-cycle
-# Self-heal: if handoff has a pr_number and gh reports that PR as MERGED,
-# the handoff is a stale artifact from a closed review — treat as implicit
-# CERTIFIED so a forgotten status field doesn't permanently block /compact.
+# Self-heal paths (ordered by preference):
+#   (a) #209: handoff has pr_number + gh reports PR MERGED → implicit CERTIFIED (silent)
+#   (b) #229: handoff has no pr_number but mtime > SDLC_HANDOFF_STALE_DAYS days
+#       → implicit CERTIFIED with WARN (the handoff predates #209 or was never
+#       PR-linked; blocking forever over a forgotten artifact is worse UX than
+#       the bug we're preventing). Default threshold: 14 days.
 HANDOFF="$ROOT/.reviews/handoff.json"
+# Validate SDLC_HANDOFF_STALE_DAYS as non-negative integer. Anything else
+# (empty, "foo", "-3", "10.5") silently falls back to 14 — we don't want a
+# typo in the user's env to leak a bash arithmetic error to stderr every
+# time the hook runs (caught by Codex P2 review of PR #227).
+STALE_DAYS_RAW="${SDLC_HANDOFF_STALE_DAYS:-14}"
+case "$STALE_DAYS_RAW" in
+    ''|*[!0-9]*) STALE_DAYS=14 ;;
+    *) STALE_DAYS="$STALE_DAYS_RAW" ;;
+esac
+STALE_WARN=""
 if [ -f "$HANDOFF" ]; then
     STATUS=$(grep -o '"status"[[:space:]]*:[[:space:]]*"[^"]*"' "$HANDOFF" 2>/dev/null | head -1 | sed 's/.*"\([^"]*\)"$/\1/')
     case "$STATUS" in
         PENDING_REVIEW|PENDING_RECHECK)
             PR_NUMBER=$(grep -o '"pr_number"[[:space:]]*:[[:space:]]*[0-9][0-9]*' "$HANDOFF" 2>/dev/null | head -1 | grep -o '[0-9][0-9]*$')
             HEALED=0
-            if [ -n "$PR_NUMBER" ] && command -v gh >/dev/null 2>&1; then
-                PR_STATE=$(gh pr view "$PR_NUMBER" --json state --jq .state 2>/dev/null)
-                [ "$PR_STATE" = "MERGED" ] && HEALED=1
+            if [ -n "$PR_NUMBER" ]; then
+                # Path (a): PR-linked self-heal (#209). Applies regardless of mtime —
+                # an OPEN PR with old mtime is still a live review.
+                if command -v gh >/dev/null 2>&1; then
+                    PR_STATE=$(gh pr view "$PR_NUMBER" --json state --jq .state 2>/dev/null)
+                    [ "$PR_STATE" = "MERGED" ] && HEALED=1
+                fi
+            else
+                # Path (b): stale-handoff auto-expire (#229). Only when no pr_number
+                # — we must not short-circuit PR-linked reviews.
+                # Try GNU stat first (Linux: `-c %Y` gives mtime, BSD stat errors out
+                # so `||` fires). Then BSD stat (macOS: `-f %m` gives mtime). The
+                # reverse order fails on Linux because `stat -f` on GNU means
+                # `--file-system` and dumps filesystem info to stdout (non-error).
+                MTIME=$(stat -c %Y "$HANDOFF" 2>/dev/null || stat -f %m "$HANDOFF" 2>/dev/null)
+                # Numeric guard: if stat fails both flavors or returns junk, skip
+                # the stale-check entirely and fall through to HOLD (safe default).
+                case "$MTIME" in
+                    ''|*[!0-9]*) ;;
+                    *)
+                        NOW=$(date +%s)
+                        AGE_DAYS=$(( (NOW - MTIME) / 86400 ))
+                        if [ "$AGE_DAYS" -ge "$STALE_DAYS" ]; then
+                            HEALED=1
+                            STALE_WARN="WARN: handoff.json is ${STATUS} and ${AGE_DAYS}d old with no pr_number — treating as stale CERTIFIED (override: set SDLC_HANDOFF_STALE_DAYS or close out the review)."
+                        fi
+                        ;;
+                esac
             fi
             if [ "$HEALED" -ne 1 ]; then
                 HOLD_REASONS="${HOLD_REASONS}  - Codex review is ${STATUS}. Round-1 evidence lives in this context — compacting now loses what round-2 needs to re-verify.
@@ -72,4 +110,7 @@ if [ -n "$HOLD_REASONS" ]; then
     exit 2
 fi
+# Stale-handoff unblock: emit a one-line WARN so the user knows the hook
+# self-healed over an abandoned PENDING artifact (but still allow /compact).
+[ -n "$STALE_WARN" ] && echo "$STALE_WARN" >&2
 exit 0

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "agentic-sdlc-wizard",
-  "version": "1.36.1",
+  "version": "1.37.0",
   "description": "SDLC enforcement for Claude Code — hooks, skills, and wizard setup in one command",
   "bin": {
     "sdlc-wizard": "cli/bin/sdlc-wizard.js"

package/skills/sdlc/SKILL.md CHANGED Viewed

@@ -42,9 +42,16 @@ TodoWrite([
   { content: "Cross-model review (if configured — see below)", status: "pending", activeForm: "Running cross-model review" },
   { content: "Scope guard: only changes related to task? No legacy/fallback code left?", status: "pending", activeForm: "Checking scope and legacy code" },
   // CI FEEDBACK LOOP (if CI monitoring enabled in setup - skip if no CI)
+  // NOTE (meta-repo only, ROADMAP #212 Option 1, 2026-04-24): this repo no
+  // longer runs e2e simulation in CI. Only `validate` blocks merge. For E2E
+  // signal on a PR, checkout the PR locally and run:
+  //   bash tests/e2e/local-shepherd.sh <PR>
+  // which scores on Max subscription and posts an advisory check-run.
+  // Consumer repos still use their own CI as configured.
   { content: "Commit and push to remote", status: "pending", activeForm: "Pushing to remote" },
   { content: "Watch CI - fix failures, iterate until green (max 2x)", status: "pending", activeForm: "Watching CI" },
   { content: "Read CI review - implement valid suggestions, iterate until clean", status: "pending", activeForm: "Addressing CI review feedback" },
+  { content: "Meta-repo only: run local shepherd if PR needs E2E score (optional)", status: "pending", activeForm: "Running local shepherd" },
   { content: "Post-deploy verification (if deploy task — see Deployment Tasks)", status: "pending", activeForm: "Verifying deployment" },
   // FINAL
   { content: "Present summary: changes, tests, CI status", status: "pending", activeForm: "Presenting final summary" },

package/skills/update/SKILL.md CHANGED Viewed

@@ -46,9 +46,10 @@ Parse all CHANGELOG entries between the user's installed version and the latest.
 ```
 Installed: 1.24.0
-Latest:    1.36.1
+Latest:    1.37.0
 What changed:
+- [1.37.0] `monthly-research.yml` workflow deleted (ROADMAP #231 Phase 1) — 0 merged artifacts in 30d while burning $11-23/month; research happens inline now. `model-effort-check.sh` loud WARNING below xhigh (#217) — max preferred, xhigh floor; duplicate effort nudge in `instructions-loaded-check.sh` removed; single source of truth. Both changes Codex-certified.
 - [1.36.1] Repo renamed `agentic-ai-sdlc-wizard` → `claude-sdlc-wizard` (matches sibling pattern; npm package unchanged); `npm pkg fix` metadata cleanup; slug migration across docs/tests/configs
 - [1.36.0] CC 2.1.118 `/usage` canonical + aliases, Tier 2 dead-gate fix (#215), score-history max_score correctness (#211), setup-bun regression guard (#210), post-mortem learnings (#220-222), GPT-5.5 adoption plan (#223), MCP-tool hooks + #198 re-verify in backlog (#218/#219)
 - [1.35.0] PreCompact seam gate + self-heal (#208/#209), effort auto-bump on LOW/FAILED/CONFUSED (#195), wizard staleness nudge (#196), Codex CI-log audit pattern, ...