agentic-sdlc-wizard 1.31.0 → 1.32.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +23 -0
- package/CLAUDE_CODE_SDLC_WIZARD.md +33 -4
- package/README.md +4 -0
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -4,6 +4,29 @@ All notable changes to the SDLC Wizard.
|
|
|
4
4
|
|
|
5
5
|
> **Note:** This changelog is for humans to read. Don't manually apply these changes - just run the wizard ("Check for SDLC wizard updates") and it handles everything automatically.
|
|
6
6
|
|
|
7
|
+
## [1.32.0] - 2026-04-16
|
|
8
|
+
|
|
9
|
+
### Added
|
|
10
|
+
- Opus 4.7 support in benchmark workflow (#178)
|
|
11
|
+
- `claude-opus-4-7` added to model choices, `effort` input (high/xhigh/max)
|
|
12
|
+
- `--effort` passed via `claude_args`, effort recorded in artifacts + summaries
|
|
13
|
+
- Hard-fail when xhigh used with non-4.7 models (inputs resolved before shell)
|
|
14
|
+
- Artifact names include effort level to prevent collision
|
|
15
|
+
- Default: opus-4-7 + xhigh (matches CC's new default)
|
|
16
|
+
- 3 new tests (39 total model-comparison tests)
|
|
17
|
+
- `xhigh` effort level documented in wizard (#178)
|
|
18
|
+
- New effort table: high → xhigh (recommended for coding) → max
|
|
19
|
+
- Opus 4.7 changes: stricter effort adherence, budget_tokens deprecated, 64k+ max_tokens guidance
|
|
20
|
+
- Benchmark ceiling effect audit documented in wizard
|
|
21
|
+
- Cross-model audit (Codex GPT-5.4, xhigh) rated benchmark 2/10 NOT CERTIFIED
|
|
22
|
+
- 4 P0 findings: fake trials, answer key leaked, no independent verification, binary rubric
|
|
23
|
+
- 3 concrete fixes documented (remove coaching, add correctness scoring, real trials)
|
|
24
|
+
- External benchmark comparison (SWE-Bench, Aider methodology)
|
|
25
|
+
- Automation Station community Discord link in README
|
|
26
|
+
|
|
27
|
+
### Fixed
|
|
28
|
+
- Orphaned `skills/gdlc/` causing test-doc-consistency failures (deleted)
|
|
29
|
+
|
|
7
30
|
## [1.31.0] - 2026-04-14
|
|
8
31
|
|
|
9
32
|
### Added
|
|
@@ -99,6 +99,27 @@ This prevents both false positives (crying wolf) and false negatives (missing re
|
|
|
99
99
|
- Green CI = safe to upgrade. Red = stay on current version until fixed
|
|
100
100
|
- Results shown in PR with statistical confidence
|
|
101
101
|
|
|
102
|
+
### Benchmark Ceiling Effect (Known Issue — April 2026)
|
|
103
|
+
|
|
104
|
+
**Our E2E benchmark currently has zero discriminating power.** Both Opus 4.6 and 4.7 scored perfect 10/10 on the `add-feature` scenario (3 trials each, `high` effort). A cross-model audit (Codex GPT-5.4, xhigh reasoning) rated the benchmark methodology **2/10, NOT CERTIFIED** and identified 4 P0 critical issues:
|
|
105
|
+
|
|
106
|
+
| Finding | Severity | Problem |
|
|
107
|
+
|---------|----------|---------|
|
|
108
|
+
| **Fake trials** | P0 | The workflow runs the simulation ONCE, then re-scores the same output N times. "Trials" measure judge jitter, not model variance |
|
|
109
|
+
| **Answer key leaked** | P0 | The simulation prompt tells the model exactly what's scored ("You MUST use TodoWrite... scored by automated checks"). This tests obedience to rubric, not SDLC judgment |
|
|
110
|
+
| **No independent verification** | P0 | "Tests pass" is self-reported from the transcript. The evaluator never re-runs `npm test` on the final code |
|
|
111
|
+
| **Binary rubric** | P0 | Every criterion is YES/NO. The evaluator is explicitly designed for "near-zero variance." On an easy coached task, scores collapse to 10/10 |
|
|
112
|
+
|
|
113
|
+
**Three concrete fixes to break the ceiling:**
|
|
114
|
+
|
|
115
|
+
1. **Remove rubric leakage** — Don't tell the model what's scored in the simulation prompt. Let the wizard hooks and docs drive behavior naturally. Score hidden behaviors from traces, not coached compliance
|
|
116
|
+
2. **Make correctness the majority of the score** — After simulation, run an external verifier: re-run `npm test` on the modified fixture, add hidden tests the model didn't know about, inspect the actual diff. Replace transcript-only `clean_code` with diff-based quality checks
|
|
117
|
+
3. **Real trials on calibrated scenarios** — Each trial must be a fresh end-to-end simulation run on a fresh checkout. Select scenarios by pilot difficulty so top models don't all saturate (similar to Aider's hard-subset methodology). The current single-coached-toy-run approach is measuring nothing
|
|
118
|
+
|
|
119
|
+
**What external benchmarks do differently:** SWE-Bench gives a real issue plus a full repo snapshot, applies the agent's patch, and runs the repo's actual tests to score `% resolved`. Aider's polyglot benchmark was explicitly rebuilt because the old one saturated — it uses 225 harder tasks chosen to preserve headroom. Our benchmark lacks real task difficulty calibration, independent execution-based correctness, multi-task breadth, and headroom management.
|
|
120
|
+
|
|
121
|
+
**Status:** This is tracked as item #96 (E2E score audit) on the roadmap. Until fixed, the benchmark measures process compliance coaching, not model quality differentiation.
|
|
122
|
+
|
|
102
123
|
---
|
|
103
124
|
|
|
104
125
|
## Philosophy: Sensible Defaults, Smart Customization
|
|
@@ -221,12 +242,20 @@ Claude Code's **effort level** controls how much thinking the model does before
|
|
|
221
242
|
|
|
222
243
|
| Level | When to Use | How to Set |
|
|
223
244
|
|-------|-------------|------------|
|
|
224
|
-
| `high` |
|
|
225
|
-
| `
|
|
245
|
+
| `high` | Standard SDLC work. Features, bug fixes, refactoring, tests, reviews | `effort: high` in skill frontmatter (already set) |
|
|
246
|
+
| `xhigh` | **Recommended default for coding and agentic work (Opus 4.7+).** Long-running tasks, repeated tool calls, deep exploration. Claude Code defaults to this on Opus 4.7 | `/effort xhigh` or set in skill frontmatter |
|
|
247
|
+
| `max` | LOW confidence, FAILED 2x, architecture decisions, complex debugging, cross-model reviews. Reserve for genuinely frontier problems — on most workloads `max` adds cost for small quality gains | `/effort max` (session only — resets next session) |
|
|
248
|
+
|
|
249
|
+
**Effort level changes in Opus 4.7 (April 2026):**
|
|
250
|
+
- **`xhigh` is new** — sits between `high` and `max`, designed for coding and agentic work (30+ minute tasks with token budgets in the millions)
|
|
251
|
+
- **Claude Code now defaults to `xhigh`** on Opus 4.7 for all plans
|
|
252
|
+
- **Opus 4.7 respects effort levels more strictly** than 4.6 — at lower levels it scopes work tighter instead of going above and beyond. If you see shallow reasoning, raise effort rather than prompting around it
|
|
253
|
+
- **`budget_tokens` is deprecated** on Opus 4.7 — use adaptive thinking with effort instead
|
|
254
|
+
- When running at `xhigh` or `max`, set a large `max_tokens` (64k+) so the model has room to think across subagents and tool calls
|
|
226
255
|
|
|
227
|
-
**Why `high`
|
|
256
|
+
**Why `high` was the previous default:** Claude Code uses **adaptive thinking** to dynamically allocate reasoning budget per turn. On Pro and Max plans, the default effort level was **medium (85)**, which causes the model to under-allocate reasoning on complex multi-step tasks — leading to shallow analysis, missed edge cases, and "lazy" outputs. This was [confirmed by Anthropic engineer Boris Cherny](https://github.com/anthropics/claude-code/issues/42796) and is documented at [code.claude.com](https://code.claude.com/docs/en/model-config). API, Team, and Enterprise plans default to high effort and are not affected.
|
|
228
257
|
|
|
229
|
-
The `/sdlc` skill sets `effort: high` in its frontmatter, overriding the medium default on every SDLC invocation.
|
|
258
|
+
The `/sdlc` skill sets `effort: high` in its frontmatter, overriding the medium default on every SDLC invocation. Consider upgrading to `effort: xhigh` on Opus 4.7+ for deeper reasoning on complex tasks.
|
|
230
259
|
|
|
231
260
|
**Nuclear option — disable adaptive thinking entirely:** Set `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1` in your environment or settings.json `env` block. This forces a fixed reasoning budget per turn instead of letting the model dynamically allocate. Use this if you observe persistent quality issues even with `effort: high`. See [Claude Code model config docs](https://code.claude.com/docs/en/model-config) for details.
|
|
232
261
|
|
package/README.md
CHANGED
|
@@ -235,6 +235,10 @@ This isn't the only Claude Code SDLC tool. Here's an honest comparison:
|
|
|
235
235
|
| [CHANGELOG.md](CHANGELOG.md) | Version history, what changed and when |
|
|
236
236
|
| [CONTRIBUTING.md](CONTRIBUTING.md) | How to contribute, evaluation methodology |
|
|
237
237
|
|
|
238
|
+
## Community
|
|
239
|
+
|
|
240
|
+
Come join **[Automation Station](https://discord.com/invite/fGPEF7GHrF)** — a community Discord packed with software engineers bringing 40+ years of combined experience across every area of the stack (frontend, backend, infra, embedded, data, QA, DevOps, you name it). Share patterns, ask questions, compare notes on AI agents, automation, and SDLC tooling.
|
|
241
|
+
|
|
238
242
|
## Contributing
|
|
239
243
|
|
|
240
244
|
PRs welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for evaluation methodology and testing.
|