agentic-sdlc-wizard 1.57.0 → 1.59.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md
CHANGED
|
@@ -4,6 +4,69 @@ All notable changes to the SDLC Wizard.
|
|
|
4
4
|
|
|
5
5
|
> **Note:** This changelog is for humans to read. Don't manually apply these changes - just run the wizard ("Check for SDLC wizard updates") and it handles everything automatically.
|
|
6
6
|
|
|
7
|
+
## [1.59.0] - 2026-04-30
|
|
8
|
+
|
|
9
|
+
### Added
|
|
10
|
+
|
|
11
|
+
- **Evaluator runs on Max via `claude --print` — closes ROADMAP #228.** New `EVAL_USE_CLI=1` mode in `tests/e2e/evaluate.sh` swaps the per-criterion judge transport from `curl` → `api.anthropic.com` to `claude --print --output-format json` against the user's Max subscription. Same model (`claude-opus-4-7`), same prompts, same JSON parsing — only the auth/billing path differs. `local-shepherd.sh` sets it by default and drops the `ANTHROPIC_API_KEY` hard-fail, so the local-Max shepherd is now **honestly zero-API**: simulation, evaluator, and orchestration all on Max quota. CI keeps the curl path (default) for paths without an authed CLI.
|
|
12
|
+
|
|
13
|
+
Why: the local-shepherd previously claimed "zero-API" but the evaluator still hit the paid API for per-criterion scoring (~$0.40/PR). Codex flagged this in the #212 review as the gap to close. With #228 done, the local-Max path is end-to-end on subscription quota.
|
|
14
|
+
|
|
15
|
+
CLI invocation flags: `--print --output-format json --max-turns 1 --model claude-opus-4-7 --tools "" --setting-sources user --mcp-config '{"mcpServers":{}}' --strict-mcp-config`. Single-shot, model pinned to match curl path, no built-in tool use (`--tools ""`), no MCP tool exposure (Codex round 1 P1 #1 — `--tools ""` alone leaves user MCP servers like `mcp__playwright__*` reachable; the criterion prompt embeds untrusted simulation output, so prompt-injection could otherwise reach them), settings limited to user-level so this repo's hooks (sdlc-prompt-check, etc.) don't fire and pollute the criterion prompt with SDLC-baseline reminders. Runs from a clean `mktemp -d` cwd for the same reason. Retry-once preserved from the curl path.
|
|
16
|
+
|
|
17
|
+
Score parity: same model, same prompts. Stochastic variance is the only expected delta (±1-2 pts per criterion). Statistical parity proof (Prove-It Gate, paired N=15 runs) is **deferred to ROADMAP #212(i)** — implementation here is engineering, not a parity claim.
|
|
18
|
+
|
|
19
|
+
### Tests
|
|
20
|
+
|
|
21
|
+
- New `tests/test-evaluate-cli-mode.sh` (15 tests): EVAL_USE_CLI branch present, calls `claude --print --output-format json --max-turns 1 --tools ""`, runs from clean cwd, retries on failure, extracts `.result` via jq selector tolerant of array+object shapes, gates ANTHROPIC_API_KEY check, API path intact, shepherd exports the env var, shepherd no longer hard-fails on missing key, MCP isolated via `--mcp-config '{}' --strict-mcp-config` on both calls (Codex round 1 P1 #1), `--model claude-opus-4-7` pinned on both calls (Codex round 1 P1 #2).
|
|
22
|
+
- `tests/test-local-shepherd.sh`: replaced obsolete `test_shepherd_aborts_on_missing_api_key` with `test_shepherd_runs_without_api_key` (positive contract — shepherd no longer cares about API key when CLI mode is on).
|
|
23
|
+
- Wired into `ci.yml` (new step "Run evaluator CLI-mode tests (#228)").
|
|
24
|
+
|
|
25
|
+
### Files
|
|
26
|
+
|
|
27
|
+
- `tests/e2e/evaluate.sh` (new `call_criterion_cli()` helper, branch in `call_criterion_api()`, conditional API-key check)
|
|
28
|
+
- `tests/e2e/local-shepherd.sh` (drop API-key requirement, export `EVAL_USE_CLI=1`, update stale provenance comments to reflect closed #228)
|
|
29
|
+
- `tests/test-evaluate-cli-mode.sh` (new, 15 tests)
|
|
30
|
+
- `tests/test-local-shepherd.sh` (replace obsolete API-key abort test with positive zero-API contract)
|
|
31
|
+
- `.github/workflows/ci.yml` (new test step)
|
|
32
|
+
- `CHANGELOG.md`, `package.json`, `SDLC.md`, `CLAUDE_CODE_SDLC_WIZARD.md`, `.claude-plugin/plugin.json` + `marketplace.json`, `skills/update/SKILL.md`, `.claude/skills/update/SKILL.md` (1.58.0 → 1.59.0)
|
|
33
|
+
|
|
34
|
+
## [1.58.0] - 2026-04-30
|
|
35
|
+
|
|
36
|
+
### Added
|
|
37
|
+
|
|
38
|
+
- **Ground-truth gate for the E2E benchmark — closes ROADMAP #96 Phase 2.** New `tests/e2e/ground-truth.sh` runs the fixture's own test suite (`npm test`) post-simulation and emits structured JSON: `{tests_run, tests_pass, tests_rc, tests_tail}`. `local-shepherd.sh` calls it after the evaluator, before the score-history append. If tests fail, the final score is **capped at 5/10** (configurable via `GROUND_TRUTH_FAIL_CAP`) — the judge can't tell if `npm test` actually passes; only running it can.
|
|
39
|
+
|
|
40
|
+
Combined with Phase 1's de-coaching (v1.57.0), the benchmark now requires **both** judge approval **and** real test passage. Catches "agent followed protocol but produced broken code" false-positives — exactly the failure mode the v1.32.0 cross-model audit warned about.
|
|
41
|
+
|
|
42
|
+
Score-history rows now record `original_judge_score`, `tests_run`, `tests_pass`, `ground_truth_gated` so trend analytics can distinguish judge noise from real regressions.
|
|
43
|
+
|
|
44
|
+
Cross-platform timeout: uses `timeout` if available, falls back to `gtimeout` (coreutils on macOS), and finally a portable `perl` fallback for stock systems. Default `GROUND_TRUTH_TIMEOUT=120s`.
|
|
45
|
+
|
|
46
|
+
Escape hatches:
|
|
47
|
+
- `SDLC_SHEPHERD_SKIP_GROUND_TRUTH=1` — disables the gate entirely (raw judge scores)
|
|
48
|
+
- `SDLC_SHEPHERD_FIXTURE_DIR=...` — override fixture location (default `tests/e2e/fixtures/test-repo`)
|
|
49
|
+
- `SDLC_SHEPHERD_GROUND_TRUTH=...` — override script path (used by tests)
|
|
50
|
+
|
|
51
|
+
### Tests
|
|
52
|
+
|
|
53
|
+
- New `tests/test-ground-truth.sh` (11 tests): passing/failing/no-test/no-package/missing-dir/no-args/help/JSON-validity/timeout-enforcement.
|
|
54
|
+
- 4 new integration tests in `tests/test-local-shepherd.sh` (38 total): gate caps judge=9 to score=5, gate leaves passing judge alone, no-tests fixture skips gate, `SKIP_GROUND_TRUTH` env var fully disables gate.
|
|
55
|
+
- Wired into `ci.yml` and `CONTRIBUTING.md` test list.
|
|
56
|
+
|
|
57
|
+
### Files
|
|
58
|
+
|
|
59
|
+
- `tests/e2e/ground-truth.sh` (new, 105 lines)
|
|
60
|
+
- `tests/test-ground-truth.sh` (new, 11 tests)
|
|
61
|
+
- `tests/e2e/local-shepherd.sh` (gate logic + new score-history fields)
|
|
62
|
+
- `tests/test-local-shepherd.sh` (4 new integration tests)
|
|
63
|
+
- `.github/workflows/ci.yml` + `CONTRIBUTING.md` (test list)
|
|
64
|
+
- Version bump 1.57.0 → 1.58.0
|
|
65
|
+
|
|
66
|
+
### Phase 3 (future)
|
|
67
|
+
|
|
68
|
+
- Install wizard files into the local-shepherd test fixture so the simulation tests "wizard-installed agent" vs "wizard-less agent." That's the actual *"does the wizard work?"* test — pairs with calibration scenarios that include subtle bugs to test self-review effectiveness.
|
|
69
|
+
|
|
7
70
|
## [1.57.0] - 2026-04-30
|
|
8
71
|
|
|
9
72
|
### Fixed
|
|
@@ -2974,7 +2974,7 @@ If deployment fails or post-deploy verification catches issues:
|
|
|
2974
2974
|
|
|
2975
2975
|
**SDLC.md:**
|
|
2976
2976
|
```markdown
|
|
2977
|
-
<!-- SDLC Wizard Version: 1.
|
|
2977
|
+
<!-- SDLC Wizard Version: 1.59.0 -->
|
|
2978
2978
|
<!-- Setup Date: [DATE] -->
|
|
2979
2979
|
<!-- Completed Steps: step-0.1, step-0.2, step-0.4, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
|
|
2980
2980
|
<!-- Git Workflow: [PRs or Solo] -->
|
|
@@ -4039,7 +4039,7 @@ Walk through updates? (y/n)
|
|
|
4039
4039
|
Store wizard state in `SDLC.md` as metadata comments (invisible to readers, parseable by Claude):
|
|
4040
4040
|
|
|
4041
4041
|
```markdown
|
|
4042
|
-
<!-- SDLC Wizard Version: 1.
|
|
4042
|
+
<!-- SDLC Wizard Version: 1.59.0 -->
|
|
4043
4043
|
<!-- Setup Date: 2026-01-24 -->
|
|
4044
4044
|
<!-- Completed Steps: step-0.1, step-0.2, step-1, step-2, step-3, step-4, step-5, step-6, step-7, step-8, step-9 -->
|
|
4045
4045
|
<!-- Git Workflow: PRs -->
|
package/package.json
CHANGED
package/skills/update/SKILL.md
CHANGED
|
@@ -93,9 +93,11 @@ Parse CHANGELOG entries between the user's installed version and latest. Present
|
|
|
93
93
|
|
|
94
94
|
```
|
|
95
95
|
Installed: 1.42.0
|
|
96
|
-
Latest: 1.
|
|
96
|
+
Latest: 1.59.0
|
|
97
97
|
|
|
98
98
|
What changed:
|
|
99
|
+
- [1.59.0] evaluator on Max via `claude --print` (#228) — `EVAL_USE_CLI=1` swaps `evaluate.sh`'s per-criterion judge transport from `curl` → API to `claude --print --output-format json`. local-shepherd.sh sets it by default, so the local path is honestly zero-API
|
|
100
|
+
- [1.58.0] ground-truth gate for E2E benchmark (#96 Phase 2) — `tests/e2e/ground-truth.sh` runs `npm test` post-sim; final score capped at 5 if tests fail. Catches "agent followed protocol but produced broken code"
|
|
99
101
|
- [1.57.0] de-coach E2E benchmark prompt (#96 Phase 1) — remove answer-key leakage that saturated benchmark scores at 10/10; new neutral task framing measures organic SDLC behavior
|
|
100
102
|
- [1.56.0] community feature-discovery fetcher (#207) — `tests/e2e/fetch-community.sh` pulls Reddit + HN; pipe to `scan-community.sh` to surface candidate /slash-commands
|
|
101
103
|
- [1.55.0] shrink weekly-update.yml dead code (#231 Phase 4) — delete env.VERSION_SCENARIO + has_overlap/overlap_paths outputs + placeholder/parse steps; 289 → 161 lines (-44%)
|