@windyroad/risk-scorer 0.12.3 → 0.12.4-preview.558

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -310,5 +310,5 @@
310
310
  }
311
311
  },
312
312
  "name": "wr-risk-scorer",
313
- "version": "0.12.3"
313
+ "version": "0.12.4"
314
314
  }
@@ -360,6 +360,24 @@ readiness may be listed separately as "post-release follow-ups" outside the
360
360
  residual risk computation, but MUST NOT appear in a Controls list and MUST NOT
361
361
  reduce any inherent risk score.
362
362
 
363
+ ### R009 control vocabulary — SKILL/agent-prose surfaces (P355 / RFC-012 / ADR-075)
364
+
365
+ For diffs touching `packages/*/skills/*/SKILL.md`, `packages/*/skills/*/REFERENCE.md`, or `packages/*/agents/*.md`, the **previously-irreducible R009 "no behavioural harness for the LLM-prose surface" floor is discharged** when a paired promptfoo Tier-A/B eval exists for the changed prose AND `npx promptfoo eval` passes on this commit. Credit it as a named likelihood-reducing control with the same evidence shape as behavioural bats:
366
+
367
+ - **Control name**: `Promptfoo Tier-A/B SKILL/agent-prose eval` at `packages/<pkg>/skills/<name>/eval/promptfooconfig.yaml` (or `packages/<pkg>/agents/<name>/eval/...` for agent prose).
368
+ - **Failure scenario exercised**: the LLM-prose surface emits the documented step ordering / vocabulary / refusal / classifier verdict. Pre-RFC-012, no test exercised this; defects landed in published SKILLs and ran in adopter sessions. Post-RFC-012, the eval's Tier-A deterministic assertions (icontains/contains/regex/not-regex) or Tier-B llm-rubric pass^k repro the prose surface behaviourally.
369
+ - **How it catches before reaching the user**: the eval runs the actual SKILL.md (via `run-skill-eval.sh` exec provider wrapping `claude -p --append-system-prompt`); a behavioural regression in the prose surface fails the assertion. The held-changeset / dogfood-window catches some prose defects, but promptfoo catches edge-case defects that dogfood replays miss (per R009 Watch-out clause).
370
+ - **Authority**: ADR-075 Amendment 2026-06-02 (scope extension to SKILL.md prose); RFC-012 (implementation); R009 standing-risk entry (`docs/risks/R009-*.active.md`) Controls table row 2 + per-action modulator + Residual risk Per-action quick path.
371
+
372
+ **Crediting rule** (matches R009 modulator semantics):
373
+
374
+ - **WITH paired promptfoo Tier-A/B eval AND tests pass on this commit** → `-1` likelihood for the prose-surface subset. Cite the eval config path + the assertion(s) that exercise the failure scenario in the Controls bullet.
375
+ - **NO paired promptfoo Tier-A/B eval for the changed prose** → `+1` likelihood for the prose-surface subset (the pre-RFC-012 floor stands for this subset). Surface this as a remediation candidate: `add a promptfoo eval covering <the specific prose step> at packages/<pkg>/skills/<name>/eval/promptfooconfig.yaml`.
376
+
377
+ **Do NOT credit promptfoo coverage as a global R009 discharge** — credit applies only to the prose-surface subset of the diff. A commit that changes a hook script AND a SKILL.md credits promptfoo only against the SKILL.md prose risk, not against the hook script's R009 score (which still requires behavioural bats per ADR-052).
378
+
379
+ **Discoverability**: check for paired promptfoo coverage by listing `packages/<pkg>/skills/<name>/eval/promptfooconfig.yaml` (or the agent-prose analogue). The first reference slice is `packages/itil/skills/manage-problem/eval/promptfooconfig.yaml` — model new slices on its shape. If the eval exists but you can't determine pass/fail from pipeline state, cite the eval as `un-verified` rather than crediting the -1.
380
+
363
381
  ## User-Stated Preconditions Check
364
382
 
365
383
  A technical control list never substitutes for an explicit user warning. Before
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@windyroad/risk-scorer",
3
- "version": "0.12.3",
3
+ "version": "0.12.4-preview.558",
4
4
  "description": "Pipeline risk scoring, commit/push gates, and secret leak detection",
5
5
  "bin": {
6
6
  "windyroad-risk-scorer": "./bin/install.mjs"