agentic-sdlc-wizard 1.55.0 → 1.57.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -13,7 +13,7 @@
13
13
  "name": "sdlc-wizard",
14
14
  "source": ".",
15
15
  "description": "SDLC enforcement for AI agents — TDD, planning, self-review, CI shepherd",
16
- "version": "1.55.0",
16
+ "version": "1.57.0",
17
17
  "author": {
18
18
  "name": "Stefan Ayala"
19
19
  },
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "sdlc-wizard",
3
- "version": "1.55.0",
3
+ "version": "1.57.0",
4
4
  "description": "SDLC enforcement for AI agents — TDD, planning, self-review, CI shepherd",
5
5
  "author": {
6
6
  "name": "Stefan Ayala",
package/CHANGELOG.md CHANGED
@@ -4,6 +4,63 @@ All notable changes to the SDLC Wizard.
4
4
 
5
5
  > **Note:** This changelog is for humans to read. Don't manually apply these changes - just run the wizard ("Check for SDLC wizard updates") and it handles everything automatically.
6
6
 
7
+ ## [1.57.0] - 2026-04-30
8
+
9
+ ### Fixed
10
+
11
+ - **De-coached the E2E benchmark prompt — closes ROADMAP #96 Phase 1.** The local-shepherd + model-comparison prompts used to tell the agent EXACTLY what was scored (`"You MUST use TodoWrite (scored by automated checks)"`, `"MUST state confidence as exactly 'Confidence: HIGH/MEDIUM/LOW' (scored by automated checks)"`, `"Write or edit test files BEFORE implementation files (TDD RED phase is scored)"`, `"MUST self-review by using Read on files you modified (scored by automated checks)"`). v1.32.0 cross-model audit (Codex GPT-5.4 xhigh) rated the benchmark **2/10 NOT CERTIFIED** specifically because of this answer-key leakage — and Codex's 2026-04-30 priority review independently flagged it as the single highest-leverage next action. ROADMAP #96 had been marked "DONE (but benchmark is broken)" — meaning the infra shipped, but the actual de-coaching never happened. This release does it.
12
+
13
+ Replaced with neutral task framing: `"You are completing a coding task in a real working directory. Read the scenario file for the task spec. Complete it however you'd normally complete a coding task — using whatever practices your tooling, skills, or instructions teach you."` No mention of TodoWrite / confidence / TDD / self-review.
14
+
15
+ The model-comparison workflow installs the wizard into the test fixture (`tests/test-model-comparison.sh` Test 23), so the wizard's SDLC skill is what should drive behavior — not the prompt. The local-shepherd fixture does NOT have the wizard, so under the new prompt local-shepherd measures whether agents practice SDLC organically (calibration baseline). Phase 3 (future) will install wizard into the local-shepherd fixture and prove that wizard installation lifts organically-low scores back up — the actual *"does the wizard work?"* test.
16
+
17
+ ### Changed
18
+
19
+ - `tests/test-local-shepherd.sh::test_shepherd_prompt_has_required_signatures`: rewritten — asserts new neutral signatures present (`"You are completing a coding task"`, `"Working directory:"`, `"Scenario file:"`, `"Do NOT use EnterPlanMode"`) AND has a negative assertion that cheat-sheet phrases (`"scored by automated checks"`, `"MUST use TodoWrite"`, `"TDD RED phase is scored"`) do NOT resurrect.
20
+ - `tests/test-model-comparison.sh::test_prompt_quality` (Test 22): inverted — was requiring TDD + confidence in the prompt; now asserts cheat-sheet phrases are NOT in the prompt.
21
+
22
+ ### Files
23
+
24
+ - `tests/e2e/local-shepherd.sh` (prompt rewritten)
25
+ - `.github/workflows/benchmark-model-comparison.yml` (prompt rewritten)
26
+ - `tests/test-local-shepherd.sh` (Test 22 updated)
27
+ - `tests/test-model-comparison.sh` (Test 22 inverted)
28
+ - Version bump 1.56.0 → 1.57.0
29
+
30
+ ### Phase 2 + 3 (future)
31
+
32
+ - Phase 2: independent ground truth via `npm test` post-run. High score requires BOTH judge approval AND tests passing.
33
+ - Phase 3: install wizard files into the local-shepherd fixture; demonstrate wizard installation lifts organically-low scores. Calibration scenarios with subtle bugs to test self-review effectiveness.
34
+
35
+ ## [1.56.0] - 2026-04-29
36
+
37
+ ### Added
38
+
39
+ - **`tests/e2e/fetch-community.sh`** — closes ROADMAP **#207**. Pulls public threads from Reddit (r/ClaudeCode + r/ClaudeAI) and HN Algolia ("claude code" stories), emits combined transcript text to stdout. Pipe to `scan-community.sh` to surface candidate `/slash-command` mentions that the wizard doesn't already know.
40
+
41
+ Usage:
42
+ ```bash
43
+ ./tests/e2e/fetch-community.sh --reddit ClaudeCode,ClaudeAI --hn | ./tests/e2e/scan-community.sh -
44
+ ```
45
+
46
+ Live mode hits Reddit's public JSON API + HN Algolia (no auth, no API spend on the Anthropic side). `--offline DIR` reads fixture JSON from `DIR/reddit-${sub}.json` + `DIR/hn-claudecode.json` for tests + offline-on-laptop runs. 11 quality tests in `tests/test-community-fetch.sh` (all mocked HTTP via fixtures, runs offline).
47
+
48
+ Pairs with the existing `scan-community.sh` (which was the manual-input piece): the maintainer no longer has to copy-paste threads. Discord skipped (requires bot/OAuth) and GH Discussions deferred (GraphQL-only, low marginal value over Reddit+HN coverage).
49
+
50
+ ### Security
51
+
52
+ - **P0 fix during Codex round 1**: `parse_or_die` originally interpolated the path into `python3 -c` source. A crafted subreddit name with single quote + Python could escape the string and execute. Rewritten to pass the path via `JSON_PATH` environment variable. New regression test (`test_offline_fixture_path_injection_blocked`) creates a fixture with an injection-payload subreddit name and asserts the sentinel file is NOT created.
53
+ - **P2 fix during Codex round 1**: `--reddit` and `--offline` previously silently exited 1 when called without a value. Both flags now validate `${2:-}` is non-empty before consuming + emit a clear flag-specific error. Two new tests cover the missing-value paths.
54
+
55
+ ### Files
56
+
57
+ - `tests/e2e/fetch-community.sh` (new, ~180 lines after round-1 P0/P2 fixes)
58
+ - `tests/test-community-fetch.sh` (new, 14 tests — 11 happy/error-path + 3 round-1 P0/P2 regressions)
59
+ - `tests/fixtures/community-fetch/*.json` (new, 5 fixtures: reddit-claudecode, reddit-claudeai, hn-claudecode, reddit-empty, plus malformed.json for parse-error testing)
60
+ - `.github/workflows/ci.yml` (+1 step: `Run community fetcher tests`)
61
+ - `CONTRIBUTING.md` (test list updated)
62
+ - Version bump 1.55.0 → 1.56.0
63
+
7
64
  ## [1.55.0] - 2026-04-29
8
65
 
9
66
  ### Removed
@@ -2974,7 +2974,7 @@ If deployment fails or post-deploy verification catches issues:
2974
2974
 
2975
2975
  **SDLC.md:**
2976
2976
  ```markdown
2977
- <!-- SDLC Wizard Version: 1.55.0 -->
2977
+ <!-- SDLC Wizard Version: 1.57.0 -->
2978
2978
  <!-- Setup Date: [DATE] -->
2979
2979
  <!-- Completed Steps: step-0.1, step-0.2, step-0.4, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
2980
2980
  <!-- Git Workflow: [PRs or Solo] -->
@@ -4039,7 +4039,7 @@ Walk through updates? (y/n)
4039
4039
  Store wizard state in `SDLC.md` as metadata comments (invisible to readers, parseable by Claude):
4040
4040
 
4041
4041
  ```markdown
4042
- <!-- SDLC Wizard Version: 1.55.0 -->
4042
+ <!-- SDLC Wizard Version: 1.57.0 -->
4043
4043
  <!-- Setup Date: 2026-01-24 -->
4044
4044
  <!-- Completed Steps: step-0.1, step-0.2, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
4045
4045
  <!-- Git Workflow: PRs -->
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentic-sdlc-wizard",
3
- "version": "1.55.0",
3
+ "version": "1.57.0",
4
4
  "description": "SDLC enforcement for Claude Code — hooks, skills, and wizard setup in one command",
5
5
  "bin": {
6
6
  "sdlc-wizard": "cli/bin/sdlc-wizard.js"
@@ -93,9 +93,11 @@ Parse CHANGELOG entries between the user's installed version and latest. Present
93
93
 
94
94
  ```
95
95
  Installed: 1.42.0
96
- Latest: 1.55.0
96
+ Latest: 1.57.0
97
97
 
98
98
  What changed:
99
+ - [1.57.0] de-coach E2E benchmark prompt (#96 Phase 1) — remove answer-key leakage that saturated benchmark scores at 10/10; new neutral task framing measures organic SDLC behavior
100
+ - [1.56.0] community feature-discovery fetcher (#207) — `tests/e2e/fetch-community.sh` pulls Reddit + HN; pipe to `scan-community.sh` to surface candidate /slash-commands
99
101
  - [1.55.0] shrink weekly-update.yml dead code (#231 Phase 4) — delete env.VERSION_SCENARIO + has_overlap/overlap_paths outputs + placeholder/parse steps; 289 → 161 lines (-44%)
100
102
  - [1.54.0] delete check-updates Claude-ranker (#231 Phase 3d) — weekly-update.yml is now zero-API-spend; auto-update PR body links the manual analyze-release.md command
101
103
  - [1.53.0] delete scan-community cron (#231 Phase 3c) — manual `claude --print` invocation of analyze-community.md