agentic-sdlc-wizard 1.30.0 → 1.32.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -13,7 +13,7 @@
13
13
  "name": "sdlc-wizard",
14
14
  "source": ".",
15
15
  "description": "SDLC enforcement for AI agents — TDD, planning, self-review, CI shepherd",
16
- "version": "1.30.0",
16
+ "version": "1.31.0",
17
17
  "author": {
18
18
  "name": "Stefan Ayala"
19
19
  },
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "sdlc-wizard",
3
- "version": "1.30.0",
3
+ "version": "1.31.0",
4
4
  "description": "SDLC enforcement for AI agents — TDD, planning, self-review, CI shepherd",
5
5
  "author": {
6
6
  "name": "Stefan Ayala",
package/CHANGELOG.md CHANGED
@@ -4,6 +4,46 @@ All notable changes to the SDLC Wizard.
4
4
 
5
5
  > **Note:** This changelog is for humans to read. Don't manually apply these changes - just run the wizard ("Check for SDLC wizard updates") and it handles everything automatically.
6
6
 
7
+ ## [1.32.0] - 2026-04-16
8
+
9
+ ### Added
10
+ - Opus 4.7 support in benchmark workflow (#178)
11
+ - `claude-opus-4-7` added to model choices, `effort` input (high/xhigh/max)
12
+ - `--effort` passed via `claude_args`, effort recorded in artifacts + summaries
13
+ - Hard-fail when xhigh used with non-4.7 models (inputs resolved before shell)
14
+ - Artifact names include effort level to prevent collision
15
+ - Default: opus-4-7 + xhigh (matches CC's new default)
16
+ - 3 new tests (39 total model-comparison tests)
17
+ - `xhigh` effort level documented in wizard (#178)
18
+ - New effort table: high → xhigh (recommended for coding) → max
19
+ - Opus 4.7 changes: stricter effort adherence, budget_tokens deprecated, 64k+ max_tokens guidance
20
+ - Benchmark ceiling effect audit documented in wizard
21
+ - Cross-model audit (Codex GPT-5.4, xhigh) rated benchmark 2/10 NOT CERTIFIED
22
+ - 4 P0 findings: fake trials, answer key leaked, no independent verification, binary rubric
23
+ - 3 concrete fixes documented (remove coaching, add correctness scoring, real trials)
24
+ - External benchmark comparison (SWE-Bench, Aider methodology)
25
+ - Automation Station community Discord link in README
26
+
27
+ ### Fixed
28
+ - Orphaned `skills/gdlc/` causing test-doc-consistency failures (deleted)
29
+
30
+ ## [1.31.0] - 2026-04-14
31
+
32
+ ### Added
33
+ - Ephemeral marketplace path detection in CLI `check` command (#174)
34
+ - Scans `~/.claude/settings.json` `extraKnownMarketplaces` for directory sources on ephemeral paths (`/tmp/`, `/private/tmp/`, `/var/folders/`)
35
+ - `EPHEMERAL` status (path exists but in ephemeral root) warns but doesn't fail check
36
+ - `DANGLING` status (path doesn't exist) errors with non-zero exit code
37
+ - Suggests moving to `~/.claude/plugins-local/<name>` for stable installs
38
+ - JSON output (`--json`) includes new `marketplace` field
39
+ - 10 new tests (51 total CLI tests)
40
+
41
+ ### Fixed
42
+ - Hook false-positive "SETUP NOT COMPLETE" in non-SDLC directories (#173, PR #175)
43
+ - Three-way detection: both files (normal), one file (warn partial setup), neither (silent exit)
44
+ - Added `find_partial_sdlc_root` helper for partial-setup detection
45
+ - 2 new hook tests (60 total hook tests)
46
+
7
47
  ## [1.30.0] - 2026-04-12
8
48
 
9
49
  ### Added
@@ -99,6 +99,27 @@ This prevents both false positives (crying wolf) and false negatives (missing re
99
99
  - Green CI = safe to upgrade. Red = stay on current version until fixed
100
100
  - Results shown in PR with statistical confidence
101
101
 
102
+ ### Benchmark Ceiling Effect (Known Issue — April 2026)
103
+
104
+ **Our E2E benchmark currently has zero discriminating power.** Both Opus 4.6 and 4.7 scored perfect 10/10 on the `add-feature` scenario (3 trials each, `high` effort). A cross-model audit (Codex GPT-5.4, xhigh reasoning) rated the benchmark methodology **2/10, NOT CERTIFIED** and identified 4 P0 critical issues:
105
+
106
+ | Finding | Severity | Problem |
107
+ |---------|----------|---------|
108
+ | **Fake trials** | P0 | The workflow runs the simulation ONCE, then re-scores the same output N times. "Trials" measure judge jitter, not model variance |
109
+ | **Answer key leaked** | P0 | The simulation prompt tells the model exactly what's scored ("You MUST use TodoWrite... scored by automated checks"). This tests obedience to rubric, not SDLC judgment |
110
+ | **No independent verification** | P0 | "Tests pass" is self-reported from the transcript. The evaluator never re-runs `npm test` on the final code |
111
+ | **Binary rubric** | P0 | Every criterion is YES/NO. The evaluator is explicitly designed for "near-zero variance." On an easy coached task, scores collapse to 10/10 |
112
+
113
+ **Three concrete fixes to break the ceiling:**
114
+
115
+ 1. **Remove rubric leakage** — Don't tell the model what's scored in the simulation prompt. Let the wizard hooks and docs drive behavior naturally. Score hidden behaviors from traces, not coached compliance
116
+ 2. **Make correctness the majority of the score** — After simulation, run an external verifier: re-run `npm test` on the modified fixture, add hidden tests the model didn't know about, inspect the actual diff. Replace transcript-only `clean_code` with diff-based quality checks
117
+ 3. **Real trials on calibrated scenarios** — Each trial must be a fresh end-to-end simulation run on a fresh checkout. Select scenarios by pilot difficulty so top models don't all saturate (similar to Aider's hard-subset methodology). The current single-coached-toy-run approach is measuring nothing
118
+
119
+ **What external benchmarks do differently:** SWE-Bench gives a real issue plus a full repo snapshot, applies the agent's patch, and runs the repo's actual tests to score `% resolved`. Aider's polyglot benchmark was explicitly rebuilt because the old one saturated — it uses 225 harder tasks chosen to preserve headroom. Our benchmark lacks real task difficulty calibration, independent execution-based correctness, multi-task breadth, and headroom management.
120
+
121
+ **Status:** This is tracked as item #96 (E2E score audit) on the roadmap. Until fixed, the benchmark measures process compliance coaching, not model quality differentiation.
122
+
102
123
  ---
103
124
 
104
125
  ## Philosophy: Sensible Defaults, Smart Customization
@@ -221,12 +242,20 @@ Claude Code's **effort level** controls how much thinking the model does before
221
242
 
222
243
  | Level | When to Use | How to Set |
223
244
  |-------|-------------|------------|
224
- | `high` | **Default for all SDLC work.** Features, bug fixes, refactoring, tests, reviews | `effort: high` in skill frontmatter (already set) |
225
- | `max` | LOW confidence, FAILED 2x, architecture decisions, complex debugging, cross-model reviews | `/effort max` (session only resets next session) |
245
+ | `high` | Standard SDLC work. Features, bug fixes, refactoring, tests, reviews | `effort: high` in skill frontmatter (already set) |
246
+ | `xhigh` | **Recommended default for coding and agentic work (Opus 4.7+).** Long-running tasks, repeated tool calls, deep exploration. Claude Code defaults to this on Opus 4.7 | `/effort xhigh` or set in skill frontmatter |
247
+ | `max` | LOW confidence, FAILED 2x, architecture decisions, complex debugging, cross-model reviews. Reserve for genuinely frontier problems — on most workloads `max` adds cost for small quality gains | `/effort max` (session only — resets next session) |
248
+
249
+ **Effort level changes in Opus 4.7 (April 2026):**
250
+ - **`xhigh` is new** — sits between `high` and `max`, designed for coding and agentic work (30+ minute tasks with token budgets in the millions)
251
+ - **Claude Code now defaults to `xhigh`** on Opus 4.7 for all plans
252
+ - **Opus 4.7 respects effort levels more strictly** than 4.6 — at lower levels it scopes work tighter instead of going above and beyond. If you see shallow reasoning, raise effort rather than prompting around it
253
+ - **`budget_tokens` is deprecated** on Opus 4.7 — use adaptive thinking with effort instead
254
+ - When running at `xhigh` or `max`, set a large `max_tokens` (64k+) so the model has room to think across subagents and tool calls
226
255
 
227
- **Why `high` is the default:** Claude Code uses **adaptive thinking** to dynamically allocate reasoning budget per turn. On Pro and Max plans, the default effort level is **medium (85)**, which causes the model to under-allocate reasoning on complex multi-step tasks — leading to shallow analysis, missed edge cases, and "lazy" outputs. This was [confirmed by Anthropic engineer Boris Cherny](https://github.com/anthropics/claude-code/issues/42796) and is documented at [code.claude.com](https://code.claude.com/docs/en/model-config). API, Team, and Enterprise plans default to high effort and are not affected.
256
+ **Why `high` was the previous default:** Claude Code uses **adaptive thinking** to dynamically allocate reasoning budget per turn. On Pro and Max plans, the default effort level was **medium (85)**, which causes the model to under-allocate reasoning on complex multi-step tasks — leading to shallow analysis, missed edge cases, and "lazy" outputs. This was [confirmed by Anthropic engineer Boris Cherny](https://github.com/anthropics/claude-code/issues/42796) and is documented at [code.claude.com](https://code.claude.com/docs/en/model-config). API, Team, and Enterprise plans default to high effort and are not affected.
228
257
 
229
- The `/sdlc` skill sets `effort: high` in its frontmatter, overriding the medium default on every SDLC invocation. This gives thorough reasoning without the unbounded token cost of `max`.
258
+ The `/sdlc` skill sets `effort: high` in its frontmatter, overriding the medium default on every SDLC invocation. Consider upgrading to `effort: xhigh` on Opus 4.7+ for deeper reasoning on complex tasks.
230
259
 
231
260
  **Nuclear option — disable adaptive thinking entirely:** Set `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1` in your environment or settings.json `env` block. This forces a fixed reasoning budget per turn instead of letting the model dynamically allocate. Use this if you observe persistent quality issues even with `effort: high`. See [Claude Code model config docs](https://code.claude.com/docs/en/model-config) for details.
232
261
 
@@ -2628,7 +2657,7 @@ If deployment fails or post-deploy verification catches issues:
2628
2657
 
2629
2658
  **SDLC.md:**
2630
2659
  ```markdown
2631
- <!-- SDLC Wizard Version: 1.30.0 -->
2660
+ <!-- SDLC Wizard Version: 1.31.0 -->
2632
2661
  <!-- Setup Date: [DATE] -->
2633
2662
  <!-- Completed Steps: step-0.1, step-0.2, step-0.4, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
2634
2663
  <!-- Git Workflow: [PRs or Solo] -->
@@ -3687,7 +3716,7 @@ Walk through updates? (y/n)
3687
3716
  Store wizard state in `SDLC.md` as metadata comments (invisible to readers, parseable by Claude):
3688
3717
 
3689
3718
  ```markdown
3690
- <!-- SDLC Wizard Version: 1.30.0 -->
3719
+ <!-- SDLC Wizard Version: 1.31.0 -->
3691
3720
  <!-- Setup Date: 2026-01-24 -->
3692
3721
  <!-- Completed Steps: step-0.1, step-0.2, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
3693
3722
  <!-- Git Workflow: PRs -->
package/README.md CHANGED
@@ -87,7 +87,7 @@ Layer 4: STATISTICAL VALIDATION
87
87
  SDP normalizes for model quality. CUSUM catches drift.
88
88
 
89
89
  Layer 3: SCORING ENGINE
90
- 7 criteria, 10/11 points. Claude evaluates Claude.
90
+ Multi-criteria scoring, 10/11 points. Claude evaluates Claude.
91
91
  Before/after wizard A/B comparison in CI.
92
92
 
93
93
  Layer 2: ENFORCEMENT
@@ -229,12 +229,16 @@ This isn't the only Claude Code SDLC tool. Here's an honest comparison:
229
229
  | Document | What It Covers |
230
230
  |----------|---------------|
231
231
  | [ARCHITECTURE.md](ARCHITECTURE.md) | System design, 5-layer diagram, data flows, file structure |
232
- | [CI_CD.md](CI_CD.md) | All 7 workflows, E2E scoring, tier system, SDP, integrity checks |
232
+ | [CI_CD.md](CI_CD.md) | All workflows, E2E scoring, tier system, SDP, integrity checks |
233
233
  | [SDLC.md](SDLC.md) | Version tracking, enforcement rules, SDLC configuration |
234
234
  | [TESTING.md](TESTING.md) | Testing philosophy, test diamond, TDD approach |
235
235
  | [CHANGELOG.md](CHANGELOG.md) | Version history, what changed and when |
236
236
  | [CONTRIBUTING.md](CONTRIBUTING.md) | How to contribute, evaluation methodology |
237
237
 
238
+ ## Community
239
+
240
+ Come join **[Automation Station](https://discord.com/invite/fGPEF7GHrF)** — a community Discord packed with software engineers bringing 40+ years of combined experience across every area of the stack (frontend, backend, infra, embedded, data, QA, DevOps, you name it). Share patterns, ask questions, compare notes on AI agents, automation, and SDLC tooling.
241
+
238
242
  ## Contributing
239
243
 
240
244
  PRs welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for evaluation methodology and testing.
package/cli/init.js CHANGED
@@ -2,6 +2,7 @@
2
2
 
3
3
  const crypto = require('crypto');
4
4
  const fs = require('fs');
5
+ const os = require('os');
5
6
  const path = require('path');
6
7
 
7
8
  const RESET = '\x1b[0m';
@@ -285,23 +286,37 @@ function check(targetDir, { json = false } = {}) {
285
286
  // Offline or npm unavailable — skip update check
286
287
  }
287
288
 
288
- const hasDrift = results.some((r) => r.status === 'MISSING' || r.status === 'DRIFT');
289
+ const marketplace = checkMarketplacePaths();
290
+ const hasDrift = results.some((r) => r.status === 'MISSING' || r.status === 'DRIFT')
291
+ || marketplace.some((m) => m.status === 'DANGLING');
289
292
 
290
293
  if (json) {
291
- console.log(JSON.stringify({ files: results, update: updateInfo }, null, 2));
294
+ console.log(JSON.stringify({ files: results, update: updateInfo, marketplace }, null, 2));
292
295
  } else {
293
296
  for (const r of results) {
294
297
  const color = r.status === 'MATCH' ? GREEN : r.status === 'MISSING' ? RED : YELLOW;
295
298
  console.log(` ${color}${r.status}${RESET} ${r.file}`);
296
299
  if (r.details) console.log(` ${r.details}`);
297
300
  }
301
+ for (const m of marketplace) {
302
+ const color = m.status === 'DANGLING' ? RED : YELLOW;
303
+ const heading = m.status === 'EPHEMERAL'
304
+ ? `Marketplace '${m.name}' source path is ephemeral:`
305
+ : `Marketplace '${m.name}' source path does not exist:`;
306
+ console.log(`\n ${color}${m.status}${RESET} ${heading}`);
307
+ console.log(` ${m.path}`);
308
+ console.log(` ${m.details}`);
309
+ if (m.suggestion) {
310
+ console.log(` Recommended: move to ${m.suggestion}`);
311
+ }
312
+ }
298
313
  if (updateInfo) {
299
314
  console.log(`\n ${YELLOW}UPDATE${RESET} v${updateInfo.current} -> v${updateInfo.latest}`);
300
315
  console.log(' Run: npx agentic-sdlc-wizard init --force');
301
316
  }
302
317
  }
303
318
 
304
- return { results, updateInfo, hasDrift };
319
+ return { results, updateInfo, hasDrift, marketplace };
305
320
  }
306
321
 
307
322
  function checkFile(srcPath, destPath, relativeDest, shouldBeExecutable) {
@@ -341,4 +356,56 @@ function checkGitignore(gitignorePath) {
341
356
  return { file: '.gitignore', status: 'MATCH' };
342
357
  }
343
358
 
359
+ const EPHEMERAL_ROOTS = /^(\/tmp\/|\/private\/tmp\/|\/var\/folders\/|\/private\/var\/folders\/)/;
360
+
361
+ function checkMarketplacePaths() {
362
+ const results = [];
363
+ const globalSettings = path.join(os.homedir(), '.claude', 'settings.json');
364
+
365
+ if (!fs.existsSync(globalSettings)) return results;
366
+
367
+ let data;
368
+ try {
369
+ data = JSON.parse(fs.readFileSync(globalSettings, 'utf8'));
370
+ } catch (_) {
371
+ return results;
372
+ }
373
+
374
+ const marketplaces = data.extraKnownMarketplaces;
375
+ if (!marketplaces || typeof marketplaces !== 'object') return results;
376
+
377
+ for (const [name, entry] of Object.entries(marketplaces)) {
378
+ const source = entry && entry.source;
379
+ if (!source || source.source !== 'directory' || !source.path || typeof source.path !== 'string') continue;
380
+
381
+ const sourcePath = source.path;
382
+ const isEphemeral = EPHEMERAL_ROOTS.test(sourcePath);
383
+ const exists = fs.existsSync(sourcePath);
384
+ const basename = path.basename(sourcePath);
385
+ const suggestion = `~/.claude/plugins-local/${basename}`;
386
+
387
+ if (!exists) {
388
+ results.push({
389
+ name,
390
+ path: sourcePath,
391
+ status: 'DANGLING',
392
+ details: isEphemeral
393
+ ? 'Ephemeral path has been reaped — plugin is broken'
394
+ : 'Path does not exist — plugin may be silently broken',
395
+ suggestion: isEphemeral ? suggestion : undefined,
396
+ });
397
+ } else if (isEphemeral) {
398
+ results.push({
399
+ name,
400
+ path: sourcePath,
401
+ status: 'EPHEMERAL',
402
+ details: 'macOS reaps this path periodically — plugin may break silently',
403
+ suggestion,
404
+ });
405
+ }
406
+ }
407
+
408
+ return results;
409
+ }
410
+
344
411
  module.exports = { init, check, planOperations, GITIGNORE_ENTRIES };
@@ -0,0 +1,36 @@
1
+ #!/bin/bash
2
+ # Shared helper: walk up from CWD to find nearest SDLC.md + TESTING.md pair
3
+ # Sourced by sdlc-prompt-check.sh and instructions-loaded-check.sh
4
+ # Fixes #171: false-positive "SETUP NOT COMPLETE" in monorepos / nested projects
5
+
6
+ # find_sdlc_root — walks up from pwd, stops at $HOME (exclusive)
7
+ # Sets SDLC_ROOT to the found directory, or empty string if not found
8
+ find_sdlc_root() {
9
+ local check_dir
10
+ check_dir="$(pwd)"
11
+ SDLC_ROOT=""
12
+ while [ "$check_dir" != "/" ] && [ "$check_dir" != "$HOME" ] && [ -n "$check_dir" ]; do
13
+ if [ -f "$check_dir/SDLC.md" ] && [ -f "$check_dir/TESTING.md" ]; then
14
+ SDLC_ROOT="$check_dir"
15
+ return 0
16
+ fi
17
+ check_dir="$(dirname "$check_dir")"
18
+ done
19
+ return 1
20
+ }
21
+
22
+ # find_partial_sdlc_root — walks up looking for EITHER SDLC.md OR TESTING.md
23
+ # Used to detect partial setup (one file exists but not both) vs not-an-SDLC-project
24
+ find_partial_sdlc_root() {
25
+ local check_dir
26
+ check_dir="$(pwd)"
27
+ SDLC_ROOT=""
28
+ while [ "$check_dir" != "/" ] && [ "$check_dir" != "$HOME" ] && [ -n "$check_dir" ]; do
29
+ if [ -f "$check_dir/SDLC.md" ] || [ -f "$check_dir/TESTING.md" ]; then
30
+ SDLC_ROOT="$check_dir"
31
+ return 0
32
+ fi
33
+ check_dir="$(dirname "$check_dir")"
34
+ done
35
+ return 1
36
+ }
@@ -4,7 +4,21 @@
4
4
  # Available since Claude Code v2.1.69
5
5
  # Note: no set -e — this hook must always exit 0 to not block session start
6
6
 
7
- PROJECT_DIR="${CLAUDE_PROJECT_DIR:-.}"
7
+ # Walk up from CWD to find nearest SDLC.md + TESTING.md (#171: monorepo support)
8
+ HOOK_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
9
+ source "$HOOK_DIR/_find-sdlc-root.sh"
10
+
11
+ # CWD walk-up finds nearest SDLC project (#173: silent exit for non-SDLC dirs)
12
+ if find_sdlc_root; then
13
+ PROJECT_DIR="$SDLC_ROOT"
14
+ elif find_partial_sdlc_root; then
15
+ # Partial setup — one file exists but not both. Warn about missing files
16
+ PROJECT_DIR="$SDLC_ROOT"
17
+ else
18
+ # Not an SDLC project at all — exit silently
19
+ exit 0
20
+ fi
21
+
8
22
  MISSING=""
9
23
 
10
24
  if [ ! -f "$PROJECT_DIR/SDLC.md" ]; then
@@ -2,8 +2,21 @@
2
2
  # Light SDLC hook - baseline reminder every prompt (~100 tokens)
3
3
  # Full guidance in skill: .claude/skills/sdlc/
4
4
 
5
- # Check if setup has been completed
6
- PROJECT_DIR="${CLAUDE_PROJECT_DIR:-.}"
5
+ # Walk up from CWD to find nearest SDLC.md + TESTING.md (#171: monorepo support)
6
+ HOOK_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
7
+ source "$HOOK_DIR/_find-sdlc-root.sh"
8
+
9
+ # CWD walk-up finds nearest SDLC project (#173: silent exit for non-SDLC dirs)
10
+ if find_sdlc_root; then
11
+ PROJECT_DIR="$SDLC_ROOT"
12
+ elif find_partial_sdlc_root; then
13
+ # Partial setup — one file exists but not both. Warn about missing files
14
+ PROJECT_DIR="$SDLC_ROOT"
15
+ else
16
+ # Not an SDLC project at all — exit silently
17
+ exit 0
18
+ fi
19
+
7
20
  if [ ! -s "$PROJECT_DIR/SDLC.md" ] || [ ! -s "$PROJECT_DIR/TESTING.md" ]; then
8
21
  cat << 'SETUP'
9
22
  SETUP NOT COMPLETE: SDLC.md and/or TESTING.md are missing.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentic-sdlc-wizard",
3
- "version": "1.30.0",
3
+ "version": "1.32.0",
4
4
  "description": "SDLC enforcement for Claude Code — hooks, skills, and wizard setup in one command",
5
5
  "bin": {
6
6
  "sdlc-wizard": "./cli/bin/sdlc-wizard.js"
@@ -46,9 +46,10 @@ Parse all CHANGELOG entries between the user's installed version and the latest.
46
46
 
47
47
  ```
48
48
  Installed: 1.24.0
49
- Latest: 1.30.0
49
+ Latest: 1.31.0
50
50
 
51
51
  What changed:
52
+ - [1.31.0] Hook false-positive fix for non-SDLC dirs, ephemeral marketplace path warning, ...
52
53
  - [1.30.0] Firmware fixture, model A/B comparison workflow, CC degradation detection, ...
53
54
  - [1.29.0] Node 24 compliance, autocompact in settings.json, effectiveness scoreboard, ...
54
55
  - [1.28.0] Autocompact benchmarking methodology, canary fact mechanism, benchmark harness, ...