@windyroad/risk-scorer 0.9.0 → 0.10.0-preview.327

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,5 +1,5 @@
1
1
  {
2
2
  "name": "wr-risk-scorer",
3
- "version": "0.9.0",
3
+ "version": "0.10.0",
4
4
  "description": "Pipeline risk scoring, commit/push/release gates for Claude Code"
5
5
  }
@@ -10,14 +10,14 @@ model: inherit
10
10
 
11
11
  You are the External-Comms Risk Reviewer. Your single job: read the draft of an outbound prose tool call (a `gh issue create --body ...`, a PR description, a security-advisory body, a `.changeset/*.md` file, or the README diff that `npm publish` will publish) and return a structured PASS/FAIL verdict against RISK-POLICY.md's Confidential Information classes.
12
12
 
13
- You are read-only. You do NOT write files, do NOT commit, do NOT modify the draft. Your verdict is consumed by the `risk-score-mark.sh` PostToolUse hook (P064 / ADR-028 amended), which writes the marker that allows the gated tool call to proceed.
13
+ You are read-only. You do NOT write files, do NOT commit, do NOT modify the draft. Your verdict is consumed by the `risk-score-mark.sh` PostToolUse hook (P064 / ADR-028 amended 2026-05-14 + 2026-05-16), which derives the marker key from the prompt structure you receive and writes the marker that allows the gated tool call to proceed.
14
14
 
15
15
  ## What you receive
16
16
 
17
- The invoking skill (`/wr-risk-scorer:assess-external-comms`) or the agent that hit the gate provides:
17
+ The invoking skill (`/wr-risk-scorer:assess-external-comms`) or the agent that hit the gate provides a structured prompt (P166 / ADR-028 amended 2026-05-16):
18
18
 
19
- - The **draft body** verbatimthe exact prose that would land on the external surface.
20
- - The **target surface** one of: `gh-issue-create`, `gh-issue-comment`, `gh-issue-edit`, `gh-pr-create`, `gh-pr-comment`, `gh-pr-edit`, `gh-api-security-advisories`, `gh-api-comments`, `npm-publish`, `changeset-author`.
19
+ - A leading `SURFACE: <name>` line one of: `gh-issue-create`, `gh-issue-comment`, `gh-issue-edit`, `gh-pr-create`, `gh-pr-comment`, `gh-pr-edit`, `gh-api-security-advisories`, `gh-api-comments`, `npm-publish`, `changeset-author`.
20
+ - The **draft body** verbatim, wrapped in `<draft>...</draft>` markers so the PostToolUse hook can extract it for marker-key derivation.
21
21
  - The **destination** when known (e.g. `anthropics/claude-code#52831`).
22
22
 
23
23
  Read `RISK-POLICY.md` (project root) to get the authoritative Confidential Information class list. As of P064 it covers:
@@ -42,28 +42,20 @@ The hybrid pre-filter (`packages/*/hooks/lib/leak-detect.sh`) has already caught
42
42
 
43
43
  ## Verdict format (MANDATORY)
44
44
 
45
- End your report with a structured block consumed by `risk-score-mark.sh`. Every field is required.
45
+ End your report with a structured block consumed by `risk-score-mark.sh`:
46
46
 
47
47
  ```
48
48
  EXTERNAL_COMMS_RISK_VERDICT: PASS
49
- EXTERNAL_COMMS_RISK_KEY: <sha256 hex string>
50
49
  ```
51
50
 
52
51
  OR for a failed review:
53
52
 
54
53
  ```
55
54
  EXTERNAL_COMMS_RISK_VERDICT: FAIL
56
- EXTERNAL_COMMS_RISK_KEY: <sha256 hex string>
57
55
  EXTERNAL_COMMS_RISK_REASON: <one-line description of the leak class + matched fragment>
58
56
  ```
59
57
 
60
- Compute the key as:
61
-
62
- ```
63
- printf '%s\n%s' "<draft body verbatim>" "<surface name>" | shasum -a 256 | cut -d' ' -f1
64
- ```
65
-
66
- The key MUST match the gate's computation exactly — a key mismatch means the marker is written for a different draft and the original gated call will continue to deny.
58
+ You do NOT need to emit `EXTERNAL_COMMS_RISK_KEY`. The PostToolUse hook derives the marker key directly from the `SURFACE:` line and `<draft>...</draft>` block in the prompt you received (P166 / ADR-028 amended 2026-05-16). Single fire per gate cycle.
67
59
 
68
60
  ## Grounding (ADR-026)
69
61
 
@@ -82,7 +74,7 @@ Example:
82
74
  - You are a reviewer, not an editor — do NOT propose rewrites in the verdict block. (Free prose suggestions outside the verdict block are fine and helpful.)
83
75
  - Do NOT score by analogy when the policy names the class.
84
76
  - Do NOT write to `/tmp/` or any marker location yourself — the PostToolUse hook owns that.
85
- - Do NOT skip the `EXTERNAL_COMMS_RISK_KEY` line; without it, the marker hook has no key to write the marker against and the gate will deny again on retry.
77
+ - You do NOT need to emit `EXTERNAL_COMMS_RISK_KEY` the hook derives the key from the prompt's `SURFACE:` + `<draft>` structure (P166 / ADR-028 amended 2026-05-16). If your prompt lacks that structure (legacy caller), the hook falls back to an emitted KEY line for backward compatibility, but the canonical path is hook-side derivation.
86
78
  - When the draft is empty (e.g. `npm publish` with no extractable body fragment), review the staged content the publish would push (README diff, package.json description) instead. If neither is available, FAIL with reason "draft body unresolvable; cannot risk-review without text" so the user can pre-review manually.
87
79
 
88
80
  ## Below-Appetite Output Rule (ADR-013 Rule 5)
@@ -269,14 +269,17 @@ This is the symmetric counterpart to ADR-042 Rule 2's move-to-holding contract.
269
269
 
270
270
  ### Mechanism — invoke the deterministic graduation evaluator
271
271
 
272
- The Rule 1a join (changeset → problem ID → ticket Priority) and the Rule 2 VP carve-out detection are deterministic lookups. Invoke the `wr-risk-scorer-evaluate-graduation` shim (ADR-049 `$PATH`-resolved) to read structured candidate lines for each held changeset:
272
+ The Rule 1a join (changeset → problem ID → ticket Priority), the Rule 2 VP carve-out detection, and the Rule 3b cohort grouping are deterministic lookups. Invoke the `wr-risk-scorer-evaluate-graduation` shim (ADR-049 `$PATH`-resolved) to read structured candidate lines for each held changeset:
273
273
 
274
274
  ```
275
275
  GRADUATION_CANDIDATE: changeset=<filename> | ticket=P<NNN> | priority=<N> | class=3a | status=<resolved|vp-blocked|halt-no-resolution>
276
+ GRADUATION_CANDIDATE: changeset=<filename> | ticket=P<NNN> | priority=<cohort-max-N> | class=3b | cohort=<id> | status=<resolved|vp-blocked|halt-no-resolution>
276
277
  GRADUATION_SUMMARY: total=<N> resolved=<N> vp_blocked=<N> halts=<N>
277
278
  ```
278
279
 
279
- The script does NOT compute release-risk and does NOT apply Rule 4 evidence-floor judgement those are LLM-judgement surfaces you own per ADR-015's pure-scorer contract. The script's job is to emit candidates with their joined Priority; your job is to decide whether each candidate's release-risk + evidence-floor profile justifies emitting a `reinstate-from-holding` remediation line.
280
+ Class 3b lines insert a `cohort=<id>` column between `class` and `status`. The cohort id is derived from the normalised reinstate-trigger prose (first 8 tokens, kebab-sanitised) of the `docs/changesets-holding/README.md` "Currently held" entries that share an identical normalised trigger. Cohort `priority` is `max(Priority)` across all member tickets per ADR-061 Rule 3b; cohort `status` propagates atomically any halt cohort halts, any VP-blocked cohort VP-blocked, otherwise cohort resolved. Single-member "cohorts" are emitted as class=3a (no Phase 2a regression).
281
+
282
+ The script does NOT compute release-risk and does NOT apply Rule 4 evidence-floor judgement — those are LLM-judgement surfaces you own per ADR-015's pure-scorer contract. The script's job is to emit candidates with their joined Priority + cohort classification; your job is to decide whether each candidate's release-risk + evidence-floor profile justifies emitting a `reinstate-from-holding` remediation line.
280
283
 
281
284
  ### Per-candidate evaluation rules
282
285
 
@@ -308,13 +311,31 @@ For each `status=halt-no-resolution` candidate (Rule 1a terminal — no ticket r
308
311
 
309
312
  - **DO NOT auto-graduate**. Surface the unresolved candidate in your report body under an "Unresolvable graduation candidates" section so the caller (orchestrator) sees the join failure and can present it as a user-decision surface per ADR-013 + ADR-044 framework-resolution boundary. Per ADR-061 Rule 1a, join ambiguity is a user-decision surface, not an agent-decision surface.
310
313
 
311
- ### Scope Phase 2a only
314
+ ### Class 3b atomic-cohort evaluation (Phase 2b — ADR-061 Rule 3b)
315
+
316
+ When candidate lines emit `class=3b` with a `cohort=<id>` column, ADR-061 Rule 3b applies: **the entire cohort ships atomically or none of it does**. Per-member graduation is not authorised. Evaluate the cohort as a single unit:
317
+
318
+ 1. **Group candidates by cohort id** — collect all `class=3b` candidates sharing the same `cohort=` column into a single evaluation set.
319
+ 2. **Compute cohort release-risk** — re-score the current pipeline as if the **full cohort** were `git mv`'d back to `.changeset/` together (not one at a time). The marginal release-risk delta is computed against the cohort's combined diff surface, not any single member's diff.
320
+ 3. **Compare against cohort priority** — the `priority=<cohort-max-N>` column on every cohort-member line already carries `max(Priority)` across all member tickets (deterministic join, Rule 3b math). Apply Rule 1: cohort graduates when `cohort-release-risk ≤ cohort-priority`.
321
+ 4. **Verify Rule 4 evidence floor per cohort** — every cohort member must independently satisfy its class-specific evidence shape (PreToolUse:Bash gate / UserPromptSubmit detector / commit-hook-with-auto-fix / SessionStart additionalContext). One floor failure in any member blocks the whole cohort. Per ADR-026 cite + persist + uncertainty: cite the artefact for each member in the audit trail.
322
+ 5. **Cohort-level VP carve-out** — if the deterministic evaluator already returned `status=vp-blocked` for the cohort (any member's ticket in Verification Pending), DO NOT emit a reinstate. The carve-out lifts when all member tickets transition out of `.verifying.md`.
323
+ 6. **Cohort-level halt-and-prompt** — if the deterministic evaluator returned `status=halt-no-resolution` for the cohort (any member fails Rule 1a join), DO NOT auto-graduate. Surface the cohort in the "Unresolvable graduation candidates" section. Per architect C1 (2026-05-17 P162 Phase 2b review), partial-cohort resolution is NOT authorised — the cohort is atomic.
324
+ 7. **Emit one `reinstate-from-holding` line per cohort member** when all six checks pass, all referencing the same cohort id so the consuming orchestrator can apply them as an atomic batch:
325
+
326
+ ```
327
+ RISK_REMEDIATIONS:
328
+ - R<N> | reinstate-from-holding <member-1>: cohort <id> release-risk <release-score>/25 ≤ cohort-priority <priority-value>; class 3b; evidence: <member-1 artefact citation> | S | -<release-score-share> | docs/changesets-holding/<member-1>, .changeset/<member-1>
329
+ - R<N+1> | reinstate-from-holding <member-2>: cohort <id> release-risk <release-score>/25 ≤ cohort-priority <priority-value>; class 3b; evidence: <member-2 artefact citation> | S | -<release-score-share> | docs/changesets-holding/<member-2>, .changeset/<member-2>
330
+ ```
331
+
332
+ The agent consuming these lines applies them as a single batch — either all members reinstate in one operation or none do. Partial application breaks ADR-061 Rule 3b atomicity.
312
333
 
313
- This evaluation surface covers **orthogonal-gate class (3a) only** per ADR-061 Rule 3. Atomic-cohort class (3b — RFC-shaped held changesets that graduate as a single atomic unit per ADR-060 finding 12) requires RFC ticket cohort enumeration and is **deferred to Phase 2b**. When the holding-area contains entries that belong to an RFC cohort, the Phase 2a evaluator emits each entry as an independent 3a candidate; treat such candidates conservatively (the symmetric-balance math is identical but the evaluation unit is wrong) and prefer a `RISK_REGISTER_HINT:` over auto-emitting `reinstate-from-holding` until Phase 2b lands the cohort enumeration.
334
+ The cohort id-from-prose detection is the Phase 2b shape per the architect-approved 2026-05-17 design. If cohort grouping false-positives appear (e.g. two unrelated changesets coincidentally sharing trigger prose), ADR-061 Reassessment Triggers ("Manual graduations diverge from criterion verdicts") covers the upgrade to a structured cohort-declaration field.
314
335
 
315
336
  ### Audit trail (Rule 6)
316
337
 
317
- Every emitted `reinstate-from-holding` line MUST cite the resolved problem-ticket ID and Priority value in the description column so the audit trail extends ADR-042 Rule 6. The consuming orchestrator additionally appends to `docs/changesets-holding/README.md` "Recently reinstated" per Rule 6 § 2.
338
+ Every emitted `reinstate-from-holding` line MUST cite the resolved problem-ticket ID and Priority value in the description column so the audit trail extends ADR-042 Rule 6. For Class 3b cohort reinstates, every member line MUST additionally cite the cohort id and the cohort-level priority + release-risk values so the per-member audit row reconstructs the atomic cohort decision. The consuming orchestrator additionally appends to `docs/changesets-holding/README.md` "Recently reinstated" per Rule 6 § 2 with the class (3a or 3b) and, for cohort members, the cohort id.
318
339
 
319
340
  ## Confidential Information Disclosure
320
341
 
@@ -31,7 +31,12 @@
31
31
  # Marker location: ${TMPDIR:-/tmp}/claude-risk-${SESSION_ID}/external-comms-<EVALUATOR_ID>-reviewed-<sha256>
32
32
  # Marker writer: PostToolUse:Agent hook in each consumer plugin
33
33
  # (risk-score-mark.sh or external-comms-mark-reviewed.sh) on
34
- # subagent type wr-<plugin>:external-comms.
34
+ # subagent type wr-<plugin>:external-comms. The mark hook
35
+ # derives the marker key from the agent's tool_input.prompt
36
+ # by parsing the same `SURFACE:` + `<draft>` structure the
37
+ # orchestrator was instructed to include (P166 / ADR-028
38
+ # amended 2026-05-16). Single fire per gate cycle suffices;
39
+ # the agent no longer needs to compute the key itself.
35
40
  #
36
41
  # Per-evaluator marker scheme (ADR-028 amended 2026-05-14): when both
37
42
  # voice-tone and risk-scorer are installed, both gates fire on the same
@@ -234,8 +239,12 @@ if [ -f "$MARKER" ]; then
234
239
  fi
235
240
 
236
241
  # Marker absent — deny + delegate.
242
+ # P166: instruct the orchestrator to structure the agent prompt with a
243
+ # leading `SURFACE: <name>` line and a `<draft>...</draft>` block so the
244
+ # PostToolUse mark hook can derive the canonical marker key locally
245
+ # (sha256(DRAFT + '\n' + SURFACE)). Single fire per gate cycle.
237
246
  VERDICT_PREFIX="${EXTERNAL_COMMS_VERDICT_PREFIX:-EXTERNAL_COMMS_${EXTERNAL_COMMS_EVALUATOR_ID^^}}"
238
- REASON=$(printf 'BLOCKED (external-comms gate / %s evaluator): %s draft has not been reviewed by %s. Delegate to %s (subagent_type: '"'"'%s'"'"') with the draft body for review. The PostToolUse hook will mark this draft reviewed when the subagent emits %s_VERDICT: PASS. Use %s for an interactive walkthrough. Override only when intentional: BYPASS_RISK_GATE=1.' \
239
- "$EXTERNAL_COMMS_EVALUATOR_ID" "$SURFACE" "$EXTERNAL_COMMS_SUBAGENT_TYPE" "$EXTERNAL_COMMS_SUBAGENT_TYPE" "$EXTERNAL_COMMS_SUBAGENT_TYPE" "$VERDICT_PREFIX" "$EXTERNAL_COMMS_ASSESS_SKILL")
247
+ REASON=$(printf 'BLOCKED (external-comms gate / %s evaluator): %s draft has not been reviewed by %s. Delegate to %s (subagent_type: '"'"'%s'"'"') with a prompt that starts with the line `SURFACE: %s` and wraps the draft body verbatim inside `<draft>...</draft>` markers. The PostToolUse hook derives the marker key from that structure and marks the draft reviewed when the subagent emits %s_VERDICT: PASS — single fire suffices. Use %s for an interactive walkthrough. Override only when intentional: BYPASS_RISK_GATE=1.' \
248
+ "$EXTERNAL_COMMS_EVALUATOR_ID" "$SURFACE" "$EXTERNAL_COMMS_SUBAGENT_TYPE" "$EXTERNAL_COMMS_SUBAGENT_TYPE" "$EXTERNAL_COMMS_SUBAGENT_TYPE" "$SURFACE" "$VERDICT_PREFIX" "$EXTERNAL_COMMS_ASSESS_SKILL")
240
249
  deny_with_reason "$REASON"
241
250
  exit 0
@@ -0,0 +1,44 @@
1
+ #!/bin/bash
2
+ # Shared helper: derive the external-comms marker key from an agent's
3
+ # tool_input.prompt by extracting the structured `SURFACE: <name>` line
4
+ # and `<draft>...</draft>` block, then computing
5
+ # sha256(DRAFT + '\n' + SURFACE) — the same key shape the gate computes
6
+ # at PreToolUse time (external-comms-gate.sh line 229).
7
+ #
8
+ # P166 + ADR-028 amended 2026-05-16: the PostToolUse:Agent mark hook
9
+ # derives the marker key from observed runtime state instead of trusting
10
+ # an agent-emitted EXTERNAL_COMMS_<EVAL>_KEY line. Removes the
11
+ # double-invocation cost class — single fire per gate cycle suffices.
12
+ #
13
+ # Canonical source: packages/shared/hooks/lib/external-comms-key.sh
14
+ # Synced byte-identically into each consumer plugin's hooks/lib/ via
15
+ # scripts/sync-external-comms-gate.sh (ADR-017 duplicate-script pattern).
16
+ #
17
+ # Returns the 64-char hex sha256 on stdout when both markers are present
18
+ # in the prompt. Returns empty string when either marker is absent — the
19
+ # caller falls back to the agent-emitted KEY for backward compatibility
20
+ # with cached old SKILL.md / agent prompts.
21
+
22
+ derive_external_comms_key_from_prompt() {
23
+ local prompt="$1"
24
+ [ -n "$prompt" ] || { echo ""; return 0; }
25
+ printf '%s' "$prompt" | python3 -c "
26
+ import sys, re, hashlib
27
+ text = sys.stdin.read()
28
+ # DRAFT extraction: non-greedy match between <draft>...</draft>.
29
+ # Tolerates an optional newline immediately after <draft> and before </draft>
30
+ # so the body content does not capture wrapping newlines.
31
+ draft_match = re.search(r'<draft>\n?(.*?)\n?</draft>', text, re.DOTALL)
32
+ # SURFACE extraction: must be anchored to line start (MULTILINE) to avoid
33
+ # matching prose like 'context says SURFACE: x'. Surface name is a single
34
+ # token: letter + word/hyphen chars.
35
+ surface_match = re.search(r'^SURFACE:\s*([A-Za-z][\w-]*)', text, re.MULTILINE)
36
+ if not draft_match or not surface_match:
37
+ print('')
38
+ sys.exit(0)
39
+ draft = draft_match.group(1)
40
+ surface = surface_match.group(1)
41
+ payload = (draft + '\n' + surface).encode('utf-8')
42
+ print(hashlib.sha256(payload).hexdigest())
43
+ " 2>/dev/null
44
+ }
@@ -11,6 +11,8 @@ set -euo pipefail
11
11
 
12
12
  SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
13
13
  source "$SCRIPT_DIR/lib/gate-helpers.sh"
14
+ # shellcheck source=lib/external-comms-key.sh
15
+ source "$SCRIPT_DIR/lib/external-comms-key.sh"
14
16
  _enable_err_trap
15
17
 
16
18
  _parse_input
@@ -204,18 +206,43 @@ if echo "$SUBAGENT" | grep -qE 'risk-scorer.policy'; then
204
206
  fi
205
207
 
206
208
  # ---------------------------------------------------------------------------
207
- # External-comms reviewer (P064 / ADR-028 amended 2026-05-14): write
208
- # per-evaluator marker keyed on sha256(draft + '\n' + surface). Subagent
209
- # emits the key; this hook trusts and uses it. Marker file:
210
- # external-comms-risk-reviewed-<key>. The voice-tone evaluator (P038)
211
- # writes its own peer marker external-comms-voice-tone-reviewed-<key>
212
- # from packages/voice-tone/hooks/external-comms-mark-reviewed.sh.
209
+ # External-comms reviewer (P064 / ADR-028 amended 2026-05-14, further
210
+ # amended 2026-05-16 P166): write per-evaluator marker keyed on
211
+ # sha256(draft + '\n' + surface). The hook derives the key from the
212
+ # agent's tool_input.prompt (structured `SURFACE:` line + `<draft>`
213
+ # block) instead of trusting an agent-emitted KEY — single fire per
214
+ # gate cycle suffices. Backward-compat fallback to the agent's
215
+ # EXTERNAL_COMMS_RISK_KEY line preserved during the deprecation window
216
+ # (one release cycle).
217
+ # Marker file: external-comms-risk-reviewed-<key>. The voice-tone
218
+ # evaluator (P038) writes its own peer marker
219
+ # external-comms-voice-tone-reviewed-<key> from
220
+ # packages/voice-tone/hooks/external-comms-mark-reviewed.sh.
213
221
  # ---------------------------------------------------------------------------
214
222
  if echo "$SUBAGENT" | grep -qE 'risk-scorer.external-comms'; then
215
223
  VERDICT_LINE=$(echo "$AGENT_OUTPUT" | grep -E '^EXTERNAL_COMMS_RISK_VERDICT:' | tail -1) || true
216
- KEY_LINE=$(echo "$AGENT_OUTPUT" | grep -E '^EXTERNAL_COMMS_RISK_KEY:' | tail -1) || true
217
224
  VERDICT=$(echo "$VERDICT_LINE" | sed 's/^EXTERNAL_COMMS_RISK_VERDICT:[[:space:]]*//' | tr -d '[:space:]')
218
- KEY=$(echo "$KEY_LINE" | sed 's/^EXTERNAL_COMMS_RISK_KEY:[[:space:]]*//' | tr -d '[:space:]')
225
+
226
+ # Read the prompt the orchestrator sent to the agent so we can derive
227
+ # the canonical key locally. _HOOK_INPUT is set by gate-helpers.sh's
228
+ # _parse_input upstream of this branch.
229
+ PROMPT=$(echo "$_HOOK_INPUT" | python3 -c "
230
+ import sys, json
231
+ try:
232
+ print(json.load(sys.stdin).get('tool_input', {}).get('prompt', ''))
233
+ except Exception:
234
+ print('')
235
+ " 2>/dev/null || echo "")
236
+
237
+ # Primary: derive from the prompt (P166 single-fire path).
238
+ KEY=$(derive_external_comms_key_from_prompt "$PROMPT")
239
+ if [ -z "$KEY" ]; then
240
+ # Fallback: cached old SKILL.md still instructs the agent to emit
241
+ # EXTERNAL_COMMS_RISK_KEY. Honour it during the deprecation window.
242
+ KEY_LINE=$(echo "$AGENT_OUTPUT" | grep -E '^EXTERNAL_COMMS_RISK_KEY:' | tail -1) || true
243
+ KEY=$(echo "$KEY_LINE" | sed 's/^EXTERNAL_COMMS_RISK_KEY:[[:space:]]*//' | tr -d '[:space:]')
244
+ fi
245
+
219
246
  # Validate key: 64 hex chars (sha256 output). Reject anything else.
220
247
  if echo "$KEY" | grep -qE '^[0-9a-f]{64}$'; then
221
248
  case "$VERDICT" in
@@ -0,0 +1,94 @@
1
+ #!/usr/bin/env bats
2
+ # Behavioural tests for risk-score-mark.sh external-comms branch under
3
+ # P166 hook-side key derivation (ADR-028 amended 2026-05-16).
4
+ #
5
+ # Contract: the PostToolUse:Agent hook derives the marker key from
6
+ # tool_input.prompt's `SURFACE: <name>` + `<draft>...</draft>` structure
7
+ # instead of trusting an agent-emitted EXTERNAL_COMMS_RISK_KEY line.
8
+ # On PASS, writes external-comms-risk-reviewed-<KEY> at the derived key.
9
+ # Backward-compat: falls back to agent-emitted KEY when prompt has no
10
+ # structure (one release-cycle window).
11
+
12
+ setup() {
13
+ SCRIPT_DIR="$(cd "$(dirname "$BATS_TEST_FILENAME")/.." && pwd)"
14
+ HOOK="$SCRIPT_DIR/risk-score-mark.sh"
15
+ ORIG_DIR="$PWD"
16
+ TEST_DIR=$(mktemp -d)
17
+ cd "$TEST_DIR"
18
+ TMPDIR="$TEST_DIR/tmp"
19
+ export TMPDIR
20
+ mkdir -p "$TMPDIR"
21
+ SESSION_ID="test-rs-mark-extcomms-prompt-$$-${BATS_TEST_NUMBER}"
22
+ RDIR="$TMPDIR/claude-risk-${SESSION_ID}"
23
+ }
24
+
25
+ teardown() {
26
+ cd "$ORIG_DIR"
27
+ rm -rf "$TEST_DIR"
28
+ }
29
+
30
+ gate_key() {
31
+ local draft="$1" surface="$2"
32
+ printf '%s\n%s' "$draft" "$surface" | shasum -a 256 | cut -d' ' -f1
33
+ }
34
+
35
+ run_hook() {
36
+ local prompt="$1"
37
+ local agent_output="$2"
38
+ python3 -c "
39
+ import json, sys
40
+ print(json.dumps({
41
+ 'tool_name': 'Agent',
42
+ 'session_id': '${SESSION_ID}',
43
+ 'tool_input': {'subagent_type': 'wr-risk-scorer:external-comms', 'prompt': sys.argv[1]},
44
+ 'tool_response': {'content': [{'type': 'text', 'text': sys.argv[2]}]}
45
+ }))" "$prompt" "$agent_output" | bash "$HOOK"
46
+ }
47
+
48
+ @test "external-comms PASS with structured prompt: marker lands at hook-derived key" {
49
+ DRAFT="we observed a leaked secret pattern in the changeset"
50
+ SURFACE="changeset-author"
51
+ PROMPT=$'SURFACE: '"$SURFACE"$'\n<draft>\n'"$DRAFT"$'\n</draft>\nReview against RISK-POLICY.md.'
52
+ AGENT_OUTPUT=$'no Confidential Information class matched\nEXTERNAL_COMMS_RISK_VERDICT: PASS'
53
+ run_hook "$PROMPT" "$AGENT_OUTPUT"
54
+ KEY=$(gate_key "$DRAFT" "$SURFACE")
55
+ [ -f "$RDIR/external-comms-risk-reviewed-${KEY}" ]
56
+ }
57
+
58
+ @test "external-comms FAIL with structured prompt: no marker" {
59
+ DRAFT="client Acme Corp is hitting this"
60
+ SURFACE="gh-issue-create"
61
+ PROMPT=$'SURFACE: '"$SURFACE"$'\n<draft>\n'"$DRAFT"$'\n</draft>'
62
+ AGENT_OUTPUT=$'EXTERNAL_COMMS_RISK_VERDICT: FAIL\nEXTERNAL_COMMS_RISK_REASON: Client names class — "Acme Corp"'
63
+ run_hook "$PROMPT" "$AGENT_OUTPUT"
64
+ KEY=$(gate_key "$DRAFT" "$SURFACE")
65
+ [ ! -f "$RDIR/external-comms-risk-reviewed-${KEY}" ]
66
+ }
67
+
68
+ @test "external-comms PASS with structured prompt AND agent-emitted KEY: hook-derived key wins" {
69
+ DRAFT="hook-derived wins"
70
+ SURFACE="gh-pr-comment"
71
+ PROMPT=$'SURFACE: '"$SURFACE"$'\n<draft>\n'"$DRAFT"$'\n</draft>'
72
+ BOGUS_KEY="0000000000000000000000000000000000000000000000000000000000000000"
73
+ AGENT_OUTPUT=$'EXTERNAL_COMMS_RISK_VERDICT: PASS\nEXTERNAL_COMMS_RISK_KEY: '"$BOGUS_KEY"
74
+ run_hook "$PROMPT" "$AGENT_OUTPUT"
75
+ DERIVED_KEY=$(gate_key "$DRAFT" "$SURFACE")
76
+ [ -f "$RDIR/external-comms-risk-reviewed-${DERIVED_KEY}" ]
77
+ [ ! -f "$RDIR/external-comms-risk-reviewed-${BOGUS_KEY}" ]
78
+ }
79
+
80
+ @test "external-comms backward-compat: PASS with no structured prompt but agent KEY" {
81
+ LEGACY_KEY="fedcba9876543210fedcba9876543210fedcba9876543210fedcba9876543210"
82
+ PROMPT="legacy unstructured prompt"
83
+ AGENT_OUTPUT=$'EXTERNAL_COMMS_RISK_VERDICT: PASS\nEXTERNAL_COMMS_RISK_KEY: '"$LEGACY_KEY"
84
+ run_hook "$PROMPT" "$AGENT_OUTPUT"
85
+ [ -f "$RDIR/external-comms-risk-reviewed-${LEGACY_KEY}" ]
86
+ }
87
+
88
+ @test "external-comms no structured prompt and no agent KEY: no marker" {
89
+ PROMPT="legacy"
90
+ AGENT_OUTPUT=$'EXTERNAL_COMMS_RISK_VERDICT: PASS'
91
+ run_hook "$PROMPT" "$AGENT_OUTPUT"
92
+ ext_markers=$(find "$RDIR" -maxdepth 1 -name 'external-comms-risk-reviewed-*' 2>/dev/null | wc -l | tr -d ' ')
93
+ [ "$ext_markers" -eq 0 ]
94
+ }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@windyroad/risk-scorer",
3
- "version": "0.9.0",
3
+ "version": "0.10.0-preview.327",
4
4
  "description": "Pipeline risk scoring, commit/push gates, and secret leak detection",
5
5
  "bin": {
6
6
  "windyroad-risk-scorer": "./bin/install.mjs"
@@ -3,14 +3,29 @@
3
3
  #
4
4
  # Evaluates held-changeset graduation candidates per ADR-061
5
5
  # (Dogfood graduation criteria for held changesets — symmetric risk balance).
6
- # Phase 2a: orthogonal-gate class only (Class 3a per ADR-061 Rule 3).
7
- # Atomic-cohort class (3b) requires RFC ticket cohort enumeration and is
8
- # deferred to Phase 2b per the architect-approved Phase 2a/2b split.
9
6
  #
10
- # This script implements the deterministic Rule 1a join + Rule 2 VP carve-out
11
- # detection. It does NOT compute release-risk and does NOT apply Rule 4
12
- # evidence-floor judgement — those are LLM-judgement surfaces owned by the
13
- # wr-risk-scorer:pipeline agent (per ADR-015 pure-scorer contract).
7
+ # Phase 2a orthogonal-gate class (Class 3a per ADR-061 Rule 3): deterministic
8
+ # Rule 1a join + Rule 2 VP carve-out detection per changeset, independently.
9
+ #
10
+ # Phase 2b — atomic-cohort class (Class 3b per ADR-061 Rule 3b): parses
11
+ # docs/changesets-holding/README.md "Currently held" section, groups entries
12
+ # by shared reinstate-trigger prose (parenthetical elaborations stripped
13
+ # before grouping), and emits cohort-aware candidates. Cohort priority is
14
+ # max(Priority) across all member tickets; any VP-blocked or halt-no-resolution
15
+ # member propagates atomically to the entire cohort ("entire cohort ships or
16
+ # none does" — symmetric to Rule 2's per-changeset carve-out at cohort grain).
17
+ # Single-member "cohorts" fall back to class=3a (no Phase 2a regression).
18
+ #
19
+ # This script implements deterministic Rule 1a join + Rule 2 VP carve-out
20
+ # detection + Rule 3b cohort grouping. It does NOT compute release-risk and
21
+ # does NOT apply Rule 4 evidence-floor judgement — those are LLM-judgement
22
+ # surfaces owned by the wr-risk-scorer:pipeline agent (per ADR-015 pure-scorer
23
+ # contract).
24
+ #
25
+ # Cohort-id-from-prose is the Phase 2b shape per architect approval 2026-05-17.
26
+ # Reassessment Triggers in ADR-061 ("Manual graduations diverge from criterion
27
+ # verdicts") cover the upgrade to a structured cohort-declaration field if
28
+ # prose-shape brittleness appears in dogfood.
14
29
  #
15
30
  # Usage:
16
31
  # evaluate-graduation.sh [<project-root>]
@@ -27,12 +42,19 @@
27
42
  # docs/problems/<NNN>-*.md (flat) AND docs/problems/*/<NNN>-*.md (per-state)
28
43
  # - Extracts the Priority value from the ticket's `**Priority**: N (...)` line.
29
44
  # - Detects Rule 2 VP carve-out (ticket file ends in .verifying.md).
45
+ # - Parses docs/changesets-holding/README.md "Currently held" section and
46
+ # groups entries by normalised reinstate-trigger prose (Phase 2b).
47
+ # - Multi-member groups emit class=3b + cohort=<id> with cohort-level
48
+ # priority/status. Single-member groups emit class=3a unchanged.
30
49
  # - Emits one structured candidate line per held changeset to stdout.
31
50
  #
32
- # Stdout format (one candidate per held changeset, agent-parseable):
51
+ # Stdout format — Class 3a (one candidate per held changeset, agent-parseable):
33
52
  # GRADUATION_CANDIDATE: changeset=<filename> | ticket=P<NNN> | priority=<N> | class=3a | status=<resolved|vp-blocked|halt-no-resolution>
34
53
  #
35
- # Stdout summary line at end:
54
+ # Stdout format Class 3b (cohort member; cohort= column added between class and status):
55
+ # GRADUATION_CANDIDATE: changeset=<filename> | ticket=P<NNN> | priority=<cohort-max-N> | class=3b | cohort=<id> | status=<resolved|vp-blocked|halt-no-resolution>
56
+ #
57
+ # Stdout summary line at end (member-level counts; cohorts count individually):
36
58
  # GRADUATION_SUMMARY: total=<N> resolved=<N> vp_blocked=<N> halts=<N>
37
59
  #
38
60
  # Exit codes:
@@ -42,13 +64,14 @@
42
64
  # 1 — no holding-area or empty holding-area (no-op caller signal)
43
65
  # 2 — invalid project root (missing docs/)
44
66
  #
45
- # @adr ADR-061 (graduation criteria — Phase 2a Rule 1a join + Rule 2 VP carve-out)
67
+ # @adr ADR-061 (graduation criteria — Phase 2a Rule 1a join + Rule 2 VP carve-out;
68
+ # Phase 2b Rule 3b atomic-cohort grouping + cohort-level propagation)
46
69
  # @adr ADR-049 (resolved via bin/wr-risk-scorer-evaluate-graduation shim)
47
70
  # @adr ADR-052 (behavioural-fixture coverage at scripts/test/evaluate-graduation.bats)
48
- # @adr ADR-015 (pure-scorer contract — script does deterministic join only;
71
+ # @adr ADR-015 (pure-scorer contract — script does deterministic join + grouping;
49
72
  # agent owns release-risk re-computation + evidence-floor judgement)
50
73
  # @adr ADR-031 (dual-tolerant problem-ticket layout per RFC-002 migration window)
51
- # @problem P162 (Phase 2a)
74
+ # @problem P162 (Phase 2a + Phase 2b)
52
75
 
53
76
  set -uo pipefail
54
77
 
@@ -83,8 +106,8 @@ if [ "${#HELD_FILES[@]}" -eq 0 ]; then
83
106
  exit 1
84
107
  fi
85
108
 
86
- # Delegate the per-candidate join + VP-check to python for re-readable
87
- # regex + dual-layout glob handling.
109
+ # Delegate the per-candidate join + VP-check + cohort grouping to python for
110
+ # re-readable regex + dual-layout glob handling.
88
111
  EVAL_RESULT=$(python3 - "$HOLDING_DIR" "$PROBLEMS_DIR" "${HELD_FILES[@]}" <<'PYEOF'
89
112
  import os
90
113
  import re
@@ -99,6 +122,18 @@ FILENAME_TICKET_RE = re.compile(r'-p(\d+)-', re.IGNORECASE)
99
122
  BODY_TICKET_RE = re.compile(r'\bP(\d+)\b')
100
123
  PRIORITY_LINE_RE = re.compile(r'^\*\*Priority\*\*:\s*(\d+)\b')
101
124
 
125
+ # Phase 2b — README "Currently held" bullet parser.
126
+ # Matches `- \`<filename>\` ... **Reinstate trigger**: <trigger-text>`.
127
+ # Captures the filename (group 1) and the trigger text (group 2; rest of line).
128
+ README_BULLET_RE = re.compile(
129
+ r'^-\s+`([^`]+\.md)`\s+.*?\*\*Reinstate trigger\*\*:\s*(.+?)\s*$'
130
+ )
131
+ # Strip parenthetical elaborations before grouping; nested parens are out of
132
+ # scope for Phase 2b (no observed README entry uses them in the trigger).
133
+ PAREN_RE = re.compile(r'\([^()]*\)')
134
+ # Sanitise cohort-id from normalised trigger prose.
135
+ NON_ID_CHAR_RE = re.compile(r'[^a-z0-9]+')
136
+
102
137
 
103
138
  def find_ticket_file(ticket_id_padded: str):
104
139
  """Dual-tolerant glob per ADR-031 / RFC-002 migration window.
@@ -167,55 +202,234 @@ def resolve_ticket_ids(changeset_path: str):
167
202
  return ids
168
203
 
169
204
 
170
- total = 0
171
- resolved = 0
172
- vp_blocked = 0
173
- halts = 0
205
+ def normalise_trigger(trigger_text: str) -> str:
206
+ """Normalise reinstate-trigger prose for cohort-key comparison.
207
+
208
+ Strips parenthetical elaborations (Reassessment criterion citations,
209
+ inline notes), takes the prefix up to the first em-dash separator
210
+ (typical for "trigger description — review at ..." continuations),
211
+ strips trailing punctuation, lowercases, and collapses whitespace
212
+ LAST so paren-strip artefacts (stray spaces before punctuation) do
213
+ not break equality matching.
214
+ """
215
+ # Strip parentheticals; loop in case there are multiple non-nested groups.
216
+ prior = None
217
+ cleaned = trigger_text
218
+ while cleaned != prior:
219
+ prior = cleaned
220
+ cleaned = PAREN_RE.sub('', cleaned)
221
+ # Take prefix up to first em-dash separator (continuations begin here).
222
+ cleaned = cleaned.split('—', 1)[0] # em-dash U+2014
223
+ # Lowercase, strip surrounding whitespace + trailing punctuation; collapse
224
+ # whitespace LAST so paren-strip leaves no orphaned single spaces before
225
+ # punctuation that would defeat equality comparison.
226
+ cleaned = cleaned.lower().strip().rstrip('.,;:').strip()
227
+ cleaned = ' '.join(cleaned.split())
228
+ # Strip any trailing punctuation that was previously space-separated.
229
+ cleaned = cleaned.rstrip('.,;:').strip()
230
+ return cleaned
231
+
232
+
233
+ def cohort_id_from_trigger(normalised: str) -> str:
234
+ """Compute a filename-safe cohort id from normalised trigger prose.
235
+
236
+ Takes the first 8 tokens, replaces non-alphanumeric runs with single
237
+ dashes, trims surrounding dashes, caps at 60 chars.
238
+ """
239
+ tokens = normalised.split()[:8]
240
+ joined = ' '.join(tokens)
241
+ slug = NON_ID_CHAR_RE.sub('-', joined).strip('-')
242
+ return slug[:60] if slug else 'cohort'
243
+
244
+
245
+ def parse_currently_held_cohorts(holding_dir: str):
246
+ """Parse docs/changesets-holding/README.md to build a filename→cohort-id map.
247
+
248
+ Reads only entries within the "## Currently held" section (case-insensitive),
249
+ extracts each bullet's filename + trigger text, normalises triggers, and
250
+ groups filenames sharing an identical normalised trigger. Cohorts with ≥ 2
251
+ members are returned as {filename: cohort_id}; single-member groups are
252
+ omitted so they fall back to class=3a per Phase 2a semantics.
253
+
254
+ Returns {} when README missing OR "Currently held" section absent OR no
255
+ multi-member groups present.
256
+ """
257
+ readme_path = os.path.join(holding_dir, 'README.md')
258
+ if not os.path.isfile(readme_path):
259
+ return {}
260
+ try:
261
+ with open(readme_path, 'r', encoding='utf-8') as f:
262
+ lines = f.readlines()
263
+ except (OSError, IOError):
264
+ return {}
265
+
266
+ # Walk lines; track whether we're inside the "Currently held" section.
267
+ in_section = False
268
+ bullets = [] # list of (filename, normalised_trigger)
269
+ for line in lines:
270
+ stripped = line.strip()
271
+ if stripped.startswith('## '):
272
+ heading = stripped[3:].strip().lower()
273
+ in_section = (heading == 'currently held')
274
+ continue
275
+ if not in_section:
276
+ continue
277
+ match = README_BULLET_RE.match(line.rstrip('\n'))
278
+ if not match:
279
+ continue
280
+ filename = match.group(1)
281
+ trigger = match.group(2)
282
+ normalised = normalise_trigger(trigger)
283
+ if not normalised:
284
+ continue
285
+ bullets.append((filename, normalised))
286
+
287
+ # Group bullets by normalised trigger.
288
+ groups = {}
289
+ for filename, normalised in bullets:
290
+ groups.setdefault(normalised, []).append(filename)
291
+
292
+ # Keep only multi-member groups; compute cohort id.
293
+ cohort_map = {}
294
+ for normalised, members in groups.items():
295
+ if len(members) < 2:
296
+ continue
297
+ cohort_id = cohort_id_from_trigger(normalised)
298
+ for filename in members:
299
+ cohort_map[filename] = cohort_id
300
+ return cohort_map
301
+
302
+
303
+ # Per-changeset resolution structure:
304
+ # {basename: {ticket: 'P<NNN>'|'-', priority: <int>|None, status: <str>,
305
+ # ticket_ids: [<padded>], chosen_suffix: <str>|None}}
306
+ per_changeset = {}
174
307
 
175
308
  for changeset_path in held_files:
176
- total += 1
177
309
  basename = os.path.basename(changeset_path)
178
310
  ticket_ids = resolve_ticket_ids(changeset_path)
179
311
 
180
312
  if not ticket_ids:
181
- # Rule 1a terminal — halt-and-prompt
182
- print(f'GRADUATION_CANDIDATE: changeset={basename} | ticket=- | priority=- | class=3a | status=halt-no-resolution')
183
- halts += 1
313
+ per_changeset[basename] = {
314
+ 'ticket_label': '-',
315
+ 'priority': None,
316
+ 'status': 'halt-no-resolution',
317
+ }
184
318
  continue
185
319
 
186
- # Resolve each referenced ticket; collect (ticket_id, priority, status_suffix) triples
187
320
  resolutions = []
188
- unresolved_ids = []
189
321
  for tid in ticket_ids:
190
322
  path, suffix = find_ticket_file(tid)
191
323
  if path is None:
192
- unresolved_ids.append(tid)
193
324
  continue
194
325
  priority = extract_priority(path)
195
326
  if priority is None:
196
- unresolved_ids.append(tid)
197
327
  continue
198
328
  resolutions.append((tid, priority, suffix))
199
329
 
200
330
  if not resolutions:
201
- # All referenced tickets failed to resolve — halt
202
- print(f'GRADUATION_CANDIDATE: changeset={basename} | ticket={",".join(f"P{i}" for i in ticket_ids)} | priority=- | class=3a | status=halt-no-resolution')
203
- halts += 1
331
+ per_changeset[basename] = {
332
+ 'ticket_label': ','.join(f'P{i}' for i in ticket_ids),
333
+ 'priority': None,
334
+ 'status': 'halt-no-resolution',
335
+ }
204
336
  continue
205
337
 
206
- # Rule 1a multi-ticket: max(Priority) across the referenced set
207
- # Pick the resolution with the highest priority; report its ticket ID.
208
338
  resolutions.sort(key=lambda r: r[1], reverse=True)
209
339
  chosen_tid, chosen_priority, chosen_suffix = resolutions[0]
210
-
211
- # Rule 2 VP carve-out
212
340
  if chosen_suffix == 'verifying':
213
- print(f'GRADUATION_CANDIDATE: changeset={basename} | ticket=P{chosen_tid} | priority={chosen_priority} | class=3a | status=vp-blocked')
214
- vp_blocked += 1
341
+ per_changeset[basename] = {
342
+ 'ticket_label': f'P{chosen_tid}',
343
+ 'priority': chosen_priority,
344
+ 'status': 'vp-blocked',
345
+ }
215
346
  continue
216
347
 
217
- print(f'GRADUATION_CANDIDATE: changeset={basename} | ticket=P{chosen_tid} | priority={chosen_priority} | class=3a | status=resolved')
218
- resolved += 1
348
+ per_changeset[basename] = {
349
+ 'ticket_label': f'P{chosen_tid}',
350
+ 'priority': chosen_priority,
351
+ 'status': 'resolved',
352
+ }
353
+
354
+ # Phase 2b — cohort detection.
355
+ cohort_map = parse_currently_held_cohorts(holding_dir)
356
+ # Build inverse: cohort_id → [member basenames].
357
+ cohort_members = {}
358
+ for filename, cohort_id in cohort_map.items():
359
+ cohort_members.setdefault(cohort_id, []).append(filename)
360
+
361
+ # Compute cohort-level rollups (priority + status).
362
+ # Atomic propagation: any halt → cohort halts; else any vp-blocked → cohort
363
+ # vp-blocked; else cohort resolved. Cohort priority = max(member priority)
364
+ # across resolved/vp-blocked members; '-' when all members halted.
365
+ cohort_rollup = {}
366
+ for cohort_id, members in cohort_members.items():
367
+ statuses = []
368
+ priorities = []
369
+ for filename in members:
370
+ # Only consider members that are actually in the holding-area glob;
371
+ # README may list entries that no longer exist on disk (stale README).
372
+ info = per_changeset.get(filename)
373
+ if info is None:
374
+ continue
375
+ statuses.append(info['status'])
376
+ if info['priority'] is not None:
377
+ priorities.append(info['priority'])
378
+
379
+ if not statuses:
380
+ # No cohort members are real held files; skip cohort treatment.
381
+ continue
382
+ if 'halt-no-resolution' in statuses:
383
+ cohort_status = 'halt-no-resolution'
384
+ elif 'vp-blocked' in statuses:
385
+ cohort_status = 'vp-blocked'
386
+ else:
387
+ cohort_status = 'resolved'
388
+ cohort_priority = max(priorities) if priorities else None
389
+ cohort_rollup[cohort_id] = {
390
+ 'status': cohort_status,
391
+ 'priority': cohort_priority,
392
+ }
393
+
394
+ # Emit candidate lines in held_files order.
395
+ total = 0
396
+ resolved = 0
397
+ vp_blocked = 0
398
+ halts = 0
399
+
400
+ for changeset_path in held_files:
401
+ total += 1
402
+ basename = os.path.basename(changeset_path)
403
+ info = per_changeset[basename]
404
+ cohort_id = cohort_map.get(basename)
405
+ is_cohort = cohort_id is not None and cohort_id in cohort_rollup
406
+
407
+ if is_cohort:
408
+ rollup = cohort_rollup[cohort_id]
409
+ # Use cohort-level priority + status; ticket_label remains member-local
410
+ # so audit trail still cites the specific resolved ticket.
411
+ priority_str = '-' if rollup['priority'] is None else str(rollup['priority'])
412
+ ticket_label = info['ticket_label']
413
+ status = rollup['status']
414
+ print(
415
+ f'GRADUATION_CANDIDATE: changeset={basename} | ticket={ticket_label} | '
416
+ f'priority={priority_str} | class=3b | cohort={cohort_id} | status={status}'
417
+ )
418
+ else:
419
+ priority_str = '-' if info['priority'] is None else str(info['priority'])
420
+ print(
421
+ f'GRADUATION_CANDIDATE: changeset={basename} | ticket={info["ticket_label"]} | '
422
+ f'priority={priority_str} | class=3a | status={info["status"]}'
423
+ )
424
+
425
+ # Tally member-level counts (cohorts count per-member for backward compat).
426
+ effective_status = cohort_rollup[cohort_id]['status'] if is_cohort else info['status']
427
+ if effective_status == 'resolved':
428
+ resolved += 1
429
+ elif effective_status == 'vp-blocked':
430
+ vp_blocked += 1
431
+ elif effective_status == 'halt-no-resolution':
432
+ halts += 1
219
433
 
220
434
  print(f'GRADUATION_SUMMARY: total={total} resolved={resolved} vp_blocked={vp_blocked} halts={halts}')
221
435
  PYEOF
@@ -2,10 +2,16 @@
2
2
  # Behavioural-fixture coverage for packages/risk-scorer/scripts/evaluate-graduation.sh
3
3
  # per ADR-052 (behavioural tests default) and ADR-061 (dogfood graduation criteria).
4
4
  #
5
- # Phase 2a coverage — orthogonal-gate class (Class 3a) only. Atomic-cohort
6
- # class (Class 3b Rule 3b RFC cohort enumeration) is deferred to Phase 2b.
7
- # Maps to ADR-061 Confirmation criterion 2 items a-f (item g atomic-cohort
8
- # lands in Phase 2b alongside the RFC enumeration logic).
5
+ # Phase 2a coverage — orthogonal-gate class (Class 3a). Maps to
6
+ # ADR-061 Confirmation criterion 2 items a-f.
7
+ #
8
+ # Phase 2b coverage atomic-cohort class (Class 3b — Rule 3b cohort enumeration).
9
+ # Maps to ADR-061 Confirmation criterion 2 item g (full-cohort evaluation,
10
+ # max(Priority) across cohort tickets, atomic VP-blocked + halt propagation).
11
+ # Cohort detection reads docs/changesets-holding/README.md "Currently held"
12
+ # section and groups entries by shared reinstate-trigger prose (parenthetical
13
+ # elaborations stripped before grouping). Single-member "cohorts" fall back
14
+ # to class=3a (no Phase 2a regression).
9
15
 
10
16
  setup() {
11
17
  REPO_ROOT="$(cd "$(dirname "$BATS_TEST_FILENAME")/../../../.." && pwd)"
@@ -270,3 +276,198 @@ EOF
270
276
  # Confirm body-referenced P800 was NOT picked up
271
277
  ! echo "$output" | grep -q 'ticket=P800'
272
278
  }
279
+
280
+ # ----- Phase 2b: ADR-061 Confirmation criterion 2 item (g) — atomic-cohort -----
281
+
282
+ # Helper: seed a Currently held entry into docs/changesets-holding/README.md.
283
+ # Cohort detection reads this file and groups entries by shared reinstate-trigger
284
+ # prose (parenthetical elaborations stripped) — see evaluate-graduation.sh.
285
+ seed_holding_readme() {
286
+ # seed_holding_readme <bullet-line> [<bullet-line>...]
287
+ local readme="docs/changesets-holding/README.md"
288
+ if [ ! -f "$readme" ]; then
289
+ cat > "$readme" <<'EOF'
290
+ # Changesets Holding Area
291
+
292
+ ## Currently held
293
+
294
+ EOF
295
+ fi
296
+ for bullet in "$@"; do
297
+ printf '%s\n' "$bullet" >> "$readme"
298
+ done
299
+ }
300
+
301
+ # Case (g.1) — two members sharing identical reinstate-trigger prose form a cohort
302
+ @test "case (g.1): two members sharing reinstate-trigger form Class 3b cohort" {
303
+ seed_problem "170" "open" "9"
304
+ seed_problem "171" "open" "12"
305
+ seed_changeset "wr-itil-p170-phase4.md"
306
+ seed_changeset "wr-itil-p171-phase3.md"
307
+ seed_holding_readme \
308
+ "- \`wr-itil-p170-phase4.md\` — patch. **Reinstate trigger**: Phase 3 + Phase 4 end-of-chain user verification fires." \
309
+ "- \`wr-itil-p171-phase3.md\` — minor. **Reinstate trigger**: Phase 3 + Phase 4 end-of-chain user verification fires."
310
+ run bash "$SCRIPT" "$WORK_DIR"
311
+ [ "$status" -eq 0 ]
312
+ # Both members emit class=3b
313
+ echo "$output" | grep 'changeset=wr-itil-p170-phase4.md' | grep -q 'class=3b'
314
+ echo "$output" | grep 'changeset=wr-itil-p171-phase3.md' | grep -q 'class=3b'
315
+ # Both members share the same cohort= column
316
+ cohort_p170=$(echo "$output" | grep 'changeset=wr-itil-p170-phase4.md' | sed -n 's/.*cohort=\([^ |]*\).*/\1/p')
317
+ cohort_p171=$(echo "$output" | grep 'changeset=wr-itil-p171-phase3.md' | sed -n 's/.*cohort=\([^ |]*\).*/\1/p')
318
+ [ -n "$cohort_p170" ]
319
+ [ "$cohort_p170" = "$cohort_p171" ]
320
+ }
321
+
322
+ # Case (g.2) — cohort uses max(Priority) across all member tickets per ADR-061 Rule 3b
323
+ @test "case (g.2): cohort priority is max across member tickets" {
324
+ seed_problem "172" "open" "6"
325
+ seed_problem "173" "open" "15"
326
+ seed_problem "174" "open" "9"
327
+ seed_changeset "wr-itil-p172-slice-a.md"
328
+ seed_changeset "wr-itil-p173-slice-b.md"
329
+ seed_changeset "wr-itil-p174-slice-c.md"
330
+ seed_holding_readme \
331
+ "- \`wr-itil-p172-slice-a.md\` — patch. **Reinstate trigger**: RFC-009 end-of-chain verification." \
332
+ "- \`wr-itil-p173-slice-b.md\` — patch. **Reinstate trigger**: RFC-009 end-of-chain verification." \
333
+ "- \`wr-itil-p174-slice-c.md\` — patch. **Reinstate trigger**: RFC-009 end-of-chain verification."
334
+ run bash "$SCRIPT" "$WORK_DIR"
335
+ [ "$status" -eq 0 ]
336
+ # Every cohort member carries priority=15 (max across P172/P173/P174)
337
+ echo "$output" | grep 'changeset=wr-itil-p172-slice-a.md' | grep -q 'priority=15'
338
+ echo "$output" | grep 'changeset=wr-itil-p173-slice-b.md' | grep -q 'priority=15'
339
+ echo "$output" | grep 'changeset=wr-itil-p174-slice-c.md' | grep -q 'priority=15'
340
+ }
341
+
342
+ # Case (g.3) — one VP-blocked cohort member marks the entire cohort vp-blocked
343
+ @test "case (g.3): VP-blocked member blocks entire cohort (Rule 2 carve-out symmetric)" {
344
+ seed_problem "175" "open" "9"
345
+ seed_problem "176" "verifying" "12"
346
+ seed_changeset "wr-itil-p175-slice-a.md"
347
+ seed_changeset "wr-itil-p176-slice-b.md"
348
+ seed_holding_readme \
349
+ "- \`wr-itil-p175-slice-a.md\` — minor. **Reinstate trigger**: cohort verification fires." \
350
+ "- \`wr-itil-p176-slice-b.md\` — minor. **Reinstate trigger**: cohort verification fires."
351
+ run bash "$SCRIPT" "$WORK_DIR"
352
+ [ "$status" -eq 0 ]
353
+ # Both members report status=vp-blocked even though only P176 is in verifying state
354
+ echo "$output" | grep 'changeset=wr-itil-p175-slice-a.md' | grep -q 'status=vp-blocked'
355
+ echo "$output" | grep 'changeset=wr-itil-p176-slice-b.md' | grep -q 'status=vp-blocked'
356
+ }
357
+
358
+ # Case (g.4) — one halt-no-resolution member propagates to entire cohort (architect C1)
359
+ @test "case (g.4): halt-no-resolution member propagates to entire cohort" {
360
+ seed_problem "177" "open" "9"
361
+ # P178 deliberately NOT seeded → halt-no-resolution
362
+ seed_changeset "wr-itil-p177-slice-a.md"
363
+ seed_changeset "wr-itil-p178-slice-b.md"
364
+ seed_holding_readme \
365
+ "- \`wr-itil-p177-slice-a.md\` — patch. **Reinstate trigger**: shared cohort fires." \
366
+ "- \`wr-itil-p178-slice-b.md\` — patch. **Reinstate trigger**: shared cohort fires."
367
+ run bash "$SCRIPT" "$WORK_DIR"
368
+ [ "$status" -eq 0 ]
369
+ # Both members report status=halt-no-resolution — cohort cannot graduate partially
370
+ echo "$output" | grep 'changeset=wr-itil-p177-slice-a.md' | grep -q 'status=halt-no-resolution'
371
+ echo "$output" | grep 'changeset=wr-itil-p178-slice-b.md' | grep -q 'status=halt-no-resolution'
372
+ }
373
+
374
+ # Case (g.5) — single-member "cohort" falls back to Class 3a (no Phase 2a regression)
375
+ @test "case (g.5): single-member 'cohort' falls back to class=3a" {
376
+ seed_problem "179" "open" "9"
377
+ seed_changeset "wr-itil-p179-solo.md"
378
+ seed_holding_readme \
379
+ "- \`wr-itil-p179-solo.md\` — patch. **Reinstate trigger**: nobody else shares this trigger."
380
+ run bash "$SCRIPT" "$WORK_DIR"
381
+ [ "$status" -eq 0 ]
382
+ echo "$output" | grep 'changeset=wr-itil-p179-solo.md' | grep -q 'class=3a'
383
+ ! echo "$output" | grep 'changeset=wr-itil-p179-solo.md' | grep -q 'cohort='
384
+ }
385
+
386
+ # Case (g.6) — parenthetical elaborations stripped before grouping
387
+ @test "case (g.6): parenthetical elaborations stripped before cohort grouping" {
388
+ seed_problem "180" "open" "9"
389
+ seed_problem "181" "open" "9"
390
+ seed_changeset "wr-itil-p180-a.md"
391
+ seed_changeset "wr-itil-p181-b.md"
392
+ # P180 trigger has no parens; P181 trigger has parenthetical elaboration —
393
+ # cohort detection must strip the paren content before comparison.
394
+ seed_holding_readme \
395
+ "- \`wr-itil-p180-a.md\` — patch. **Reinstate trigger**: end-of-chain fires." \
396
+ "- \`wr-itil-p181-b.md\` — patch. **Reinstate trigger**: end-of-chain fires (only the slice 3 dependency remains, can defer per Reassessment criterion k)."
397
+ run bash "$SCRIPT" "$WORK_DIR"
398
+ [ "$status" -eq 0 ]
399
+ # Despite different surface prose, the normalised trigger matches → both class=3b
400
+ echo "$output" | grep 'changeset=wr-itil-p180-a.md' | grep -q 'class=3b'
401
+ echo "$output" | grep 'changeset=wr-itil-p181-b.md' | grep -q 'class=3b'
402
+ }
403
+
404
+ # Case (g.7) — README without "Currently held" section → all entries fall back to class=3a
405
+ @test "case (g.7): README without 'Currently held' section falls back to class=3a (defensive)" {
406
+ seed_problem "182" "open" "9"
407
+ seed_problem "183" "open" "12"
408
+ seed_changeset "wr-itil-p182-a.md"
409
+ seed_changeset "wr-itil-p183-b.md"
410
+ # README exists but has no "Currently held" section — cohort detection finds nothing.
411
+ cat > "docs/changesets-holding/README.md" <<'EOF'
412
+ # Holding Area
413
+ Some unrelated content.
414
+ EOF
415
+ run bash "$SCRIPT" "$WORK_DIR"
416
+ [ "$status" -eq 0 ]
417
+ echo "$output" | grep 'changeset=wr-itil-p182-a.md' | grep -q 'class=3a'
418
+ echo "$output" | grep 'changeset=wr-itil-p183-b.md' | grep -q 'class=3a'
419
+ }
420
+
421
+ # Case (g.8) — README absent entirely → all entries fall back to class=3a (defensive)
422
+ @test "case (g.8): missing README falls back to class=3a" {
423
+ seed_problem "184" "open" "9"
424
+ seed_changeset "wr-itil-p184-a.md"
425
+ # Do NOT create README at all
426
+ run bash "$SCRIPT" "$WORK_DIR"
427
+ [ "$status" -eq 0 ]
428
+ echo "$output" | grep 'changeset=wr-itil-p184-a.md' | grep -q 'class=3a'
429
+ }
430
+
431
+ # Case (g.9) — multiple distinct cohorts in the same holding-area resolve independently
432
+ @test "case (g.9): multiple distinct cohorts coexist with distinct cohort= ids" {
433
+ seed_problem "185" "open" "9"
434
+ seed_problem "186" "open" "10"
435
+ seed_problem "187" "open" "12"
436
+ seed_problem "188" "open" "8"
437
+ seed_changeset "wr-itil-p185-cohort-a.md"
438
+ seed_changeset "wr-itil-p186-cohort-a.md"
439
+ seed_changeset "wr-itil-p187-cohort-b.md"
440
+ seed_changeset "wr-itil-p188-cohort-b.md"
441
+ seed_holding_readme \
442
+ "- \`wr-itil-p185-cohort-a.md\` — minor. **Reinstate trigger**: cohort alpha fires." \
443
+ "- \`wr-itil-p186-cohort-a.md\` — minor. **Reinstate trigger**: cohort alpha fires." \
444
+ "- \`wr-itil-p187-cohort-b.md\` — minor. **Reinstate trigger**: cohort beta fires." \
445
+ "- \`wr-itil-p188-cohort-b.md\` — minor. **Reinstate trigger**: cohort beta fires."
446
+ run bash "$SCRIPT" "$WORK_DIR"
447
+ [ "$status" -eq 0 ]
448
+ cohort_a1=$(echo "$output" | grep 'changeset=wr-itil-p185-cohort-a.md' | sed -n 's/.*cohort=\([^ |]*\).*/\1/p')
449
+ cohort_a2=$(echo "$output" | grep 'changeset=wr-itil-p186-cohort-a.md' | sed -n 's/.*cohort=\([^ |]*\).*/\1/p')
450
+ cohort_b1=$(echo "$output" | grep 'changeset=wr-itil-p187-cohort-b.md' | sed -n 's/.*cohort=\([^ |]*\).*/\1/p')
451
+ cohort_b2=$(echo "$output" | grep 'changeset=wr-itil-p188-cohort-b.md' | sed -n 's/.*cohort=\([^ |]*\).*/\1/p')
452
+ [ -n "$cohort_a1" ] && [ "$cohort_a1" = "$cohort_a2" ]
453
+ [ -n "$cohort_b1" ] && [ "$cohort_b1" = "$cohort_b2" ]
454
+ [ "$cohort_a1" != "$cohort_b1" ]
455
+ # Cohort A priority is max(9,10) = 10; Cohort B priority is max(12,8) = 12
456
+ echo "$output" | grep 'changeset=wr-itil-p185-cohort-a.md' | grep -q 'priority=10'
457
+ echo "$output" | grep 'changeset=wr-itil-p187-cohort-b.md' | grep -q 'priority=12'
458
+ }
459
+
460
+ # Case (g.10) — cohort detection does NOT regress Phase 2a summary counts
461
+ @test "case (g.10): cohort members still count individually in GRADUATION_SUMMARY" {
462
+ seed_problem "190" "open" "9"
463
+ seed_problem "191" "open" "9"
464
+ seed_changeset "wr-itil-p190-cohort.md"
465
+ seed_changeset "wr-itil-p191-cohort.md"
466
+ seed_holding_readme \
467
+ "- \`wr-itil-p190-cohort.md\` — patch. **Reinstate trigger**: shared cohort." \
468
+ "- \`wr-itil-p191-cohort.md\` — patch. **Reinstate trigger**: shared cohort."
469
+ run bash "$SCRIPT" "$WORK_DIR"
470
+ [ "$status" -eq 0 ]
471
+ # Phase 2a parsers see total=2 resolved=2 — backwards compatible
472
+ echo "$output" | grep -q 'GRADUATION_SUMMARY: total=2 resolved=2 vp_blocked=0 halts=0'
473
+ }
@@ -51,12 +51,24 @@ Do not ask if the surface is obvious from the conversation context.
51
51
 
52
52
  ### 3. Construct the review prompt
53
53
 
54
- Build a self-contained prompt for the `wr-risk-scorer:external-comms` subagent that includes:
54
+ Build a self-contained prompt for the `wr-risk-scorer:external-comms` subagent. The prompt MUST be structured so the PostToolUse hook can derive the marker key locally (P166 / ADR-028 amended 2026-05-16) — single fire per gate cycle suffices:
55
55
 
56
- - The **draft body** verbatim (between explicit `<draft>...</draft>` markers so the agent's substring extraction is unambiguous).
57
- - The **target surface** (one of the canonical strings above).
58
- - The **destination** when known.
59
- - A reminder to compute `EXTERNAL_COMMS_RISK_KEY = sha256(draft + '\n' + surface)`.
56
+ ```
57
+ SURFACE: <surface-name>
58
+ <draft>
59
+ <draft body verbatim>
60
+ </draft>
61
+
62
+ Destination: <destination if known>
63
+ Review against RISK-POLICY.md Confidential Information classes.
64
+ ```
65
+
66
+ Two requirements:
67
+
68
+ - A leading line `SURFACE: <surface-name>` where `<surface-name>` is one of the canonical strings (`gh-issue-create`, `gh-pr-comment`, etc.) — anchored to line start, single token.
69
+ - The **draft body** wrapped verbatim inside `<draft>...</draft>` markers — the hook extracts everything between these markers and uses it for `sha256(DRAFT + '\n' + SURFACE)`.
70
+
71
+ The orchestrator does NOT pre-compute the key — the hook derives it from the prompt structure. Skip the agent-emitted key entirely.
60
72
 
61
73
  ### 4. Delegate to wr-risk-scorer:external-comms
62
74
 
@@ -67,7 +79,7 @@ subagent_type: wr-risk-scorer:external-comms
67
79
  prompt: <constructed review prompt from step 3>
68
80
  ```
69
81
 
70
- Wait for the subagent to complete. The subagent will output a structured verdict block (`EXTERNAL_COMMS_RISK_VERDICT: PASS|FAIL` + `EXTERNAL_COMMS_RISK_KEY: <sha>` + optional `EXTERNAL_COMMS_RISK_REASON: ...`). The `PostToolUse:Agent` hook (`risk-score-mark.sh`) reads that output and writes the marker automatically.
82
+ Wait for the subagent to complete. The subagent outputs a structured verdict block (`EXTERNAL_COMMS_RISK_VERDICT: PASS|FAIL` + optional `EXTERNAL_COMMS_RISK_REASON: ...` on FAIL). The `PostToolUse:Agent` hook (`risk-score-mark.sh`) parses the verdict, derives the marker key from the prompt's `SURFACE:` + `<draft>` structure, and writes the marker automatically on PASS.
71
83
 
72
84
  **Do not write to `${TMPDIR:-/tmp}/claude-risk-*` yourself.** The hook is the only correct mechanism.
73
85