agentic-sdlc-wizard 1.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md ADDED
@@ -0,0 +1,343 @@
1
+ # Changelog
2
+
3
+ All notable changes to the SDLC Wizard.
4
+
5
+ > **Note:** This changelog is for humans to read. Don't manually apply these changes - just run the wizard ("Check for SDLC wizard updates") and it handles everything automatically.
6
+
7
+ ## [1.15.0] - 2026-03-25
8
+
9
+ ### Added
10
+ - aistupidlevel.info as Source 3 in external benchmark cascade (DailyBench -> LiveBench -> aistupidlevel -> baseline)
11
+ - Competitive watchlist in `analyze-community.md` — weekly scan now checks 5 named repos for new releases/patterns
12
+ - `COMPETITIVE_AUDIT.md` — honest ecosystem comparison, unique strengths, tracked gaps, contribution ideas
13
+ - README "How This Compares" section with honest positioning table
14
+ - Token usage tracking gap documented (blocked until `claude-code-action` exposes usage data)
15
+ - 3 new tests in `test-external-benchmark.sh` for aistupidlevel integration
16
+ - 2 new tests in `test-prove-it.sh` for competitive watchlist and README positioning
17
+
18
+ ### Changed
19
+ - Roadmap reordered: competitive audit (#10) marked DONE
20
+
21
+ ## [1.14.0] - 2026-03-24
22
+
23
+ ### Fixed
24
+ - CI re-trigger bug: `workflow_dispatch` caused `e2e-quick-check` to skip, blocking auto-merge (PR #75). Jobs now accept dispatch events with simulation steps gated behind PR-only checks
25
+ - SDLC.md version stuck at 1.9.0 (should be 1.14.0)
26
+ - CONTRIBUTING.md missing 11 test scripts, outdated scoring criteria, wrong repo URL in discussions link
27
+
28
+ ### Added
29
+ - 3 tests in `test-workflow-triggers.sh`: verify required CI jobs accept `workflow_dispatch`
30
+ - 4 integration tests in `test-prove-it.sh`: prove `compare_ci` detects REGRESSION/STABLE/IMPROVED with synthetic scores
31
+ - 3 E2E tests in `test-self-update.sh`: verify live CHANGELOG and wizard URLs return valid content
32
+ - `should_simulate` gate in CI: dispatch runs produce green checks without burning API credits
33
+ - Documented `workflow_dispatch` behavior in `ci-self-heal.yml`
34
+
35
+ ### Changed
36
+ - Roadmap reordered: competitive audit (#10) before distribution (#30)
37
+ - CONTRIBUTING.md scoring criteria updated to v3 multi-call judge + v3.1 pairwise tiebreaker
38
+ - CONTRIBUTING.md test list updated to all 21 CI validate scripts
39
+
40
+ ## [1.13.0] - 2026-03-23
41
+
42
+ ### Changed
43
+ - Rewrote "Staying Updated" section with explicit fetch URLs, CHANGELOG-first update flow, and 4-phase process
44
+ - Claude now shows users what changed (via CHANGELOG) before offering to apply updates
45
+ - Fixed "CHANGELOG is for Humans, Not Claude" — Claude reads CHANGELOG first to drive the update flow
46
+
47
+ ### Added
48
+ - Optional "Wizard Update Notification" GitHub Action template — weekly check, creates issue when newer version exists ($0 cost, no API key)
49
+ - `step-update-notify` in wizard step registry (optional step for CI notification)
50
+ - 12 new tests in `tests/test-self-update.sh` (URL correctness, YAML validation, workflow template, step registry)
51
+
52
+ ## [1.12.0] - 2026-03-23
53
+
54
+ ### Fixed
55
+ - Apply step in `weekly-update.yml` and `monthly-research.yml` never propagated changes to test fixture (baseline == candidate, verdict always STABLE, comparison useless)
56
+ - Stale output file between baseline and candidate simulations in both auto-update workflows (same bug as ci.yml, fixed in #24)
57
+ - `sdp-score.sh` default model `claude-sonnet-4` corrected to `claude-opus-4-6` (matches evaluate.sh)
58
+ - README "All 6 workflows" corrected to "All 5 workflows" (stale since v1.9.0 consolidation)
59
+
60
+ ### Added
61
+ - 6 new audit tests: apply step propagation (2), stale output cleanup (2), SDP model consistency (1), README accuracy (1)
62
+ - Native CC feature overlap analysis: all 5 custom features audited — KEEP CUSTOM (no overlap with CC v2.1.81)
63
+
64
+ ### Audited (no changes needed)
65
+ - All 5 custom features (hooks + skills): value is in content (SDLC philosophy, TDD enforcement), not framework
66
+ - Noted for future: `continue-on-error` patterns, `/tmp` hardcodes, permission scoping
67
+
68
+ ## [1.11.0] - 2026-03-23
69
+
70
+ ### Fixed
71
+ - Stale output file between baseline and candidate simulations in Tier 2 (candidate eval could read baseline data on silent failure)
72
+ - Comment "3x evaluations" corrected to "5x evaluations" in ci.yml Tier 2 header
73
+ - `run-tier2-evaluation.sh` silent `score=0` fallback replaced with proper error handling (stderr separation, exit on failure)
74
+
75
+ ### Added
76
+ - 13 test scripts wired into CI validate job (228 additional tests now run on every PR)
77
+ - Tests for Tier 2 comment accuracy and stale output cleanup
78
+ - Tests for `run-tier2-evaluation.sh` error handling (no stderr suppression, no silent fallback)
79
+
80
+ ### Removed
81
+ - Legacy duplicate `tests/test-self-heal-simulation.sh` (690 lines, subset of e2e version)
82
+
83
+ ## [1.10.0] - 2026-03-22
84
+
85
+ ### Added
86
+ - "Prove It's Better" CI automation — when weekly-update detects a CC release that overlaps a custom wizard feature, CI auto-runs a side-by-side Tier 2 comparison and recommends KEEP CUSTOM / SWITCH TO NATIVE / TIE
87
+ - `tests/e2e/lib/prove-it.sh` — path validation allowlist + fixture stripping library
88
+ - `prove-it-test` job in `weekly-update.yml` — only runs when overlap detected ($0 extra on typical weeks)
89
+ - Custom feature inventory table in `analyze-release.md` — tells Claude what to check for overlap
90
+ - `has_overlap` / `overlap_paths` outputs wired from `check-updates` job
91
+ - 13 new tests in `tests/test-prove-it.sh` (allowlist validation, fixture stripping, settings.json updates, overlap signal parsing, workflow integration)
92
+ - Test fixture `tests/fixtures/releases/v99.0.0-overlap.json`
93
+
94
+ ## [1.9.1] - 2026-03-22
95
+
96
+ ### Verified
97
+ - `all-findings` self-heal (#27): PR #70 confirmed `workflow_run` triggers on review suggestions, `AUTOFIX_LEVEL=all-findings` passes filtering, Claude invoked in `review-findings` mode
98
+
99
+ ### Added
100
+ - Real CI review format parsing test (h4 headers, `_None._` italic, line references)
101
+ - Roadmap ordering in AUTO_SELF_UPDATE.md
102
+
103
+ ## [1.9.0] - 2026-03-21
104
+
105
+ ### Changed
106
+ - Consolidated `daily-update.yml` + `weekly-community.yml` into single `weekly-update.yml`
107
+ - 4 jobs: check-updates, version-test, scan-community, community-e2e-test
108
+ - Single Monday 9 AM UTC schedule (was two separate cron entries)
109
+ - Reduces workflow count from 6 to 5, auto-update workflows from 3 to 2
110
+ - Cost: ~$2.50/week combined (unchanged)
111
+ - Updated all docs and 25+ tests to reference `weekly-update.yml`
112
+
113
+ ### Added
114
+ - 5 new workflow consolidation tests: 4-job structure, dependency chains, permissions, single cron
115
+
116
+ ## [1.8.1] - 2026-03-21
117
+
118
+ ### Fixed
119
+ - `tdd_red` deterministic checker: now parses JSON execution output via jq (was always scoring 0/2 due to regex mismatch with claude-code-action JSON format)
120
+ - Score history push: checkout actual PR branch before push (was silently failing from detached HEAD)
121
+ - `instructions-loaded-check.sh`: explicit `exit 0` for defensive safety
122
+
123
+ ### Changed
124
+ - Phase 5: Re-enabled all auto-update workflow schedules
125
+ - `weekly-update.yml` (formerly `daily-update.yml` + `weekly-community.yml`): Mondays 9 AM UTC
126
+ - `monthly-research.yml`: re-enabled (1st of month 11 AM UTC)
127
+ - Golden scores: `high-compliance.tdd_red` updated to 0 (text golden files lack JSON tool_use blocks; tdd_red correctness verified via dedicated JSON unit tests)
128
+
129
+ ### Added
130
+ - 7 new tests: JSON-based tdd_red checks (5), empty/nonexistent file edge cases (2)
131
+ - 3 new workflow trigger tests: weekly schedule validation, all-schedules-active, score-history-checkout
132
+
133
+ ## [1.8.0] - 2026-03-20
134
+
135
+ ### Added
136
+ - Version catch-up: consolidated update from Claude Code v2.1.15 to v2.1.81 (66 minor versions)
137
+ - `InstructionsLoaded` hook (`instructions-loaded-check.sh`) — validates SDLC.md and TESTING.md exist at session start (v2.1.69+)
138
+ - `effort: high` frontmatter on `/sdlc` and `/testing` skills (v2.1.80+)
139
+ - "Prove It's Better" core philosophy — use native features unless custom is proven better via E2E comparison
140
+ - Vision statement in README — "Mold an ever-evolving SDLC... replace with native... one day delete this repo"
141
+ - Documentation section in README linking ARCHITECTURE.md, CI_CD.md, SDLC.md, TESTING.md, CHANGELOG.md, CONTRIBUTING.md
142
+ - Documented new built-in commands in wizard: `/memory`, `/simplify`, `/batch`, `/loop`, `/effort`
143
+ - Documented security hardening fixes (v2.1.49, v2.1.72, v2.1.74, v2.1.77, v2.1.78)
144
+ - Documented `${CLAUDE_SKILL_DIR}` variable, `agent_id`/`agent_type` hook metadata
145
+ - Documented `CLAUDE_CODE_SIMPLE` bypass risk, HTML comment behavior, 128k output tokens, `--bare` flag
146
+ - 7 new hook tests (18 total) for InstructionsLoaded hook
147
+ - `plans/CATCHUP.md` — documents the version catch-up process for future reference
148
+
149
+ ### Changed
150
+ - Claude Code baseline bumped from v2.1.15+ to v2.1.81+
151
+ - Wizard version bumped from 1.7.0 to 1.8.0
152
+ - Prerequisites updated: minimum v2.1.69+ (was v2.1.16+)
153
+ - `.github/last-checked-version.txt` updated to v2.1.81
154
+ - Scheduled workflow triggers disabled (PR #66) to save API tokens — re-enable in Phase 5
155
+
156
+ ### Audited (Category C: no swap needed)
157
+ - No custom `/claude-api` skill exists — nothing to swap with native built-in
158
+
159
+ ## [1.7.0] - 2026-02-15
160
+
161
+ ### Added
162
+ - CI Auto-Fix Loop (`ci-self-heal.yml`) — automated fix cycle for CI failures and PR review findings
163
+ - Multi-call LLM judge (v3) — per-criterion API calls with dedicated calibration examples
164
+ - Golden output regression — 3 saved outputs with verified expected score ranges catch prompt drift
165
+ - Per-criterion CUSUM — tracks individual criterion drift, not just total score
166
+ - Pairwise tiebreaker (v3.1) — holistic comparison with full swap when scores within 1.0
167
+ - Deterministic pre-checks — grep-based scoring for task_tracking, confidence, tdd_red (free, fast)
168
+ - 3 real-world scenarios: multi-file-api-endpoint, production-bug-investigation, technical-debt-cleanup
169
+ - Score analytics (`score-analytics.sh`) — history parsing, trends, per-criterion averages, reports
170
+ - Score history persistence — results committed back to repo after each E2E evaluation
171
+ - Historical context in PR comments — scenario average and weakest criterion
172
+ - Color-coded PR comments — emoji indicators for PASS/WARN/FAIL per criterion
173
+ - Binary sub-criteria scoring with workflow input validation (PR #32)
174
+ - Evaluate bug regression tests (`test-evaluate-bugs.sh`)
175
+ - Score analytics tests (`test-score-analytics.sh`)
176
+ - Self-heal simulation tests (25 tests) — retry counting, AUTOFIX_LEVEL filtering, findings parsing, branch safety
177
+ - Self-heal live fire test procedure — validated full workflow_run → Claude fix → commit cycle (PR #52)
178
+
179
+ ### Fixed
180
+ - `workflow_run` trigger dead for ci-autofix — invalid `workflows: write` permission scope caused GitHub parser to silently fail; removed it + renamed to `ci-self-heal.yml`
181
+ - Tier 1 E2E flakiness — regression threshold widened from -0.5 to -1.5 (absorbs ±1 LLM noise)
182
+ - Silent zero scores from `2>&1` mixing stderr into stdout (PR #33)
183
+ - Token/cost metrics always N/A — removed dead extraction code (action doesn't expose usage data)
184
+ - Score history never persisting (ephemeral runner) — added git commit step
185
+ - `show_full_output` invalid action input — deleted
186
+ - `configureGitAuth` crash — added `git init` before simulation
187
+ - `error_max_turns` on hard scenarios — bumped from 45 to 55
188
+ - Autofix can't push workflow files — requires PAT with `workflow` scope or GitHub App (not YAML permissions)
189
+ - `git push` silent error swallowing in `weekly-community.yml` — removed `|| echo` fallback
190
+ - Missing `pull-requests: write` permission in `monthly-research.yml` — e2e-test job creates PRs but permission wasn't declared
191
+ - Workflow input validation audit — removed `prompt_file`, `direct_prompt`, `model` invalid inputs across all 3 auto-update workflows
192
+ - `outputs.response` doesn't exist — read from execution output file instead
193
+ - CI re-trigger 403 in self-heal loop — missing `actions: write` permission for `gh workflow run` dispatch
194
+
195
+ ### Changed
196
+ - `monthly-research.yml` schedule enabled (1st of month, 11 AM UTC) — Item 23 Phase 3
197
+ - `weekly-community.yml` schedule enabled (Mondays 10 AM UTC) — Item 23 Phase 2
198
+ - `daily-update.yml` schedule re-enabled (9 AM UTC) — Item 23 Phase 1
199
+ - All auto-update workflows create PRs (removed "LOW → direct commit" path)
200
+ - Evaluation uses `claude-opus-4-6` model (was hardcoded to `claude-sonnet-4`)
201
+ - E2E scenarios expanded from 10 to 13
202
+
203
+ ## [1.6.0] - 2026-02-06
204
+
205
+ ### Added
206
+ - Full test coverage for stats library, hooks, and compliance checker (34 new tests)
207
+ - Extended SDP calculation and external benchmark tests (9 new tests)
208
+ - Future roadmap items 14-19 in AUTO_SELF_UPDATE.md
209
+
210
+ ### Fixed
211
+ - Version format validation before npm install (security: prevents injection)
212
+ - Hardcoded `/home/runner/work/_temp/` paths replaced with `${RUNNER_TEMP:-/tmp}`
213
+ - Silent fallback to v0.0.0 on API failure (now fails loudly)
214
+ - Duplicate prompt sources in daily-update workflow (prompt_file + inline prompt)
215
+ - Hardcoded output path in pr-review workflow
216
+ - Weekly community workflow hardcoded output path
217
+
218
+ ### Changed
219
+ - Documentation overhaul: TESTING.md, CI_CD.md, CONTRIBUTING.md, README.md updated
220
+ - SDLC.md version tracking updated from 1.0.0 to 1.6.0
221
+
222
+ ### Files Added
223
+ - `tests/test-stats.sh` - Statistical functions tests (14 tests)
224
+ - `tests/test-hooks.sh` - Hook script tests (11 tests)
225
+ - `tests/test-compliance.sh` - Compliance checker tests (9 tests)
226
+
227
+ ### Files Modified
228
+ - `.github/workflows/daily-update.yml` - Security + correctness fixes
229
+ - `.github/workflows/pr-review.yml` - Hardcoded path fix
230
+ - `.github/workflows/weekly-community.yml` - Hardcoded path fix
231
+ - `tests/test-sdp-calculation.sh` - Extended (5 new tests)
232
+ - `tests/test-external-benchmark.sh` - Extended (4 new tests)
233
+
234
+ ## [1.5.0] - 2026-02-03
235
+
236
+ ### Added
237
+ - SDP (SDLC Degradation-adjusted Performance) scoring to distinguish "model issues" from "wizard issues"
238
+ - External benchmark tracking (DailyBench, LiveBench) with 24-hour caching
239
+ - Robustness metric showing how well SDLC holds up vs model changes
240
+ - Two-layer scoring: L1 (Model Quality) + L2 (SDLC Compliance)
241
+
242
+ ### How It Works
243
+ PR comments now show three metrics:
244
+ - **Raw Score**: Actual E2E measurement
245
+ - **SDP Score**: Adjusted for external model conditions
246
+ - **Robustness**: < 1.0 = resilient, > 1.0 = sensitive
247
+
248
+ When model benchmarks drop but your SDLC score holds steady, that's a sign your wizard setup is robust.
249
+
250
+ ### Files Added
251
+ - `tests/e2e/lib/external-benchmark.sh` - Multi-source benchmark fetcher
252
+ - `tests/e2e/lib/sdp-score.sh` - SDP calculation logic
253
+ - `tests/e2e/external-baseline.json` - Baseline external benchmarks
254
+ - `tests/test-external-benchmark.sh` - Benchmark fetcher tests
255
+ - `tests/test-sdp-calculation.sh` - SDP calculation tests
256
+
257
+ ### Files Modified
258
+ - `tests/e2e/evaluate.sh` - Outputs SDP alongside raw scores
259
+ - `.github/workflows/ci.yml` - PR comments include SDP metrics
260
+ - Documentation updated (README, CONTRIBUTING, CI_CD, AUTO_SELF_UPDATE)
261
+
262
+ ## [1.4.0] - 2026-01-26
263
+
264
+ ### Added
265
+ - Auto-update system for staying current with Claude Code releases
266
+ - Daily workflow: monitors official releases, creates PRs for relevant updates
267
+ - Weekly workflow: scans community discussions, creates digest issues
268
+ - Analysis prompts with wizard philosophy baked in
269
+ - Version tracking files for state management
270
+
271
+ ### How It Works
272
+ GitHub Actions check for Claude Code updates daily (official releases) and weekly (community discussions). Claude analyzes relevance to the wizard, and HIGH/MEDIUM confidence updates create PRs for human review. Most community content is filtered as noise - that's expected.
273
+
274
+ ### Files Added
275
+ - `.github/workflows/daily-update.yml`
276
+ - `.github/workflows/weekly-community.yml`
277
+ - `.github/prompts/analyze-release.md`
278
+ - `.github/prompts/analyze-community.md`
279
+ - `.github/last-checked-version.txt`
280
+ - `.github/last-community-scan.txt`
281
+
282
+ ### Required Setup
283
+ Add `ANTHROPIC_API_KEY` to repository secrets for workflows to function.
284
+
285
+ ## [1.3.0] - 2026-01-24
286
+
287
+ ### Added
288
+ - Idempotent wizard - safe to run on any existing setup
289
+ - Setup tracking comments in SDLC.md (version, completed steps, preferences)
290
+ - Wizard step registry for tracking what's been done
291
+ - Backwards compatibility for old wizard users
292
+
293
+ ### Changed
294
+ - "Staying Updated" section rewritten for idempotent approach
295
+ - Update flow now checks plugins and questions, not just files
296
+ - One unified flow for setup AND updates (no separate paths)
297
+
298
+ ### How It Works
299
+ The wizard now tracks completed steps in SDLC.md metadata comments. Old users running "check for updates" will be walked through only the new steps they haven't done yet.
300
+
301
+ ## [1.2.0] - 2026-01-24
302
+
303
+ ### Added
304
+ - Official plugin integration (claude-md-management, code-review, claude-code-setup)
305
+ - Step 0.1-0.4: Plugin setup before auto-scan
306
+ - "Leverage Official Tools" principle in Philosophy section
307
+ - Post-mortem learnings table (what goes where)
308
+ - Testing skill "After Session" section for capturing learnings
309
+ - Clear update workflow in "Staying Updated" section
310
+
311
+ ### Changed
312
+ - Step 0 restructured: plugins first, then SDLC setup, then auto-scan
313
+ - Stay Lightweight section now includes official plugin table
314
+ - Clarified plugin scope: claude-md-management = CLAUDE.md only
315
+
316
+ ### Files Affected
317
+ - `.claude/skills/testing/SKILL.md` - Add "After Session" section
318
+ - `SDLC.md` - Consider adding version comment
319
+
320
+ ## [1.1.0] - 2026-01-23
321
+
322
+ ### Added
323
+ - Tasks system documentation (v2.1.16+)
324
+ - $ARGUMENTS skill parameter support (v2.1.19+)
325
+ - Ike the cat easter egg (8 pounds, Fancy Feast enthusiast)
326
+ - Iron Man analogy for human+AI partnership
327
+
328
+ ### Changed
329
+ - Test review preference: user chooses oversight level
330
+ - Shared environment awareness (not everyone runs isolated)
331
+
332
+ ## [1.0.0] - 2026-01-20
333
+
334
+ ### Added
335
+ - Initial SDLC Wizard release
336
+ - TDD enforcement hooks
337
+ - SDLC and Testing skills
338
+ - Confidence levels (HIGH/MEDIUM/LOW)
339
+ - Planning mode integration
340
+ - Self-review workflow
341
+ - Testing Diamond philosophy
342
+ - Mini-retro after tasks
343
+