@windyroad/risk-scorer 0.3.3 → 0.3.4-preview.104
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +2 -2
- package/agents/pipeline.md +78 -2
- package/agents/plan.md +28 -1
- package/agents/test/risk-scorer-above-appetite-stop.bats +80 -0
- package/agents/test/risk-scorer-monitoring-not-a-control.bats +76 -0
- package/agents/test/risk-scorer-reducing-bypass-criteria.bats +62 -0
- package/agents/test/risk-scorer-user-stated-preconditions.bats +89 -0
- package/agents/wip.md +28 -1
- package/package.json +1 -1
package/agents/pipeline.md
CHANGED
|
@@ -94,7 +94,36 @@ Commit score >= push score >= release score (risk accumulates upward).
|
|
|
94
94
|
|
|
95
95
|
## Risk-Reducing and Risk-Neutral Bypass
|
|
96
96
|
|
|
97
|
-
|
|
97
|
+
`RISK_BYPASS: reducing` is reserved for commits that genuinely reduce risk.
|
|
98
|
+
The 329-report retrospective found this label applied to 97.9% of commits in
|
|
99
|
+
this repo because the old criteria were too loose — changeset metadata, ADR
|
|
100
|
+
checkbox ticks, and docs-only edits all earned the bypass. When nearly
|
|
101
|
+
everything is "reducing", the label provides no discriminating signal. These
|
|
102
|
+
criteria tighten that.
|
|
103
|
+
|
|
104
|
+
Emit `RISK_BYPASS: reducing` ONLY when ALL of the following are true:
|
|
105
|
+
1. The commit closes a problem ticket (the diff includes a `.known-error.md` →
|
|
106
|
+
`.closed.md` rename, references "closes P<NNN>" in the commit message, or
|
|
107
|
+
adds a `## Fix Committed` section to a known-error ticket), OR
|
|
108
|
+
2. The commit explicitly remediates a risk item previously flagged by the
|
|
109
|
+
scorer in a prior report (the diff fixes something a prior risk report
|
|
110
|
+
called out), OR
|
|
111
|
+
3. The commit removes a documented risk (retires a hazardous hook, removes an
|
|
112
|
+
insecure API, deletes a known-defective code path)
|
|
113
|
+
|
|
114
|
+
Ordinary commits that do not meet at least one of these conditions are **risk-neutral, not risk-reducing**. Docs-only edits, test-only additions without a remediation link, and routine refactors are all neutral — do NOT emit the reducing bypass for them.
|
|
115
|
+
|
|
116
|
+
When emitting `RISK_BYPASS: reducing`, cite the reason on a companion
|
|
117
|
+
`RISK_BYPASS_REASON:` line so the bypass is auditable:
|
|
118
|
+
|
|
119
|
+
```
|
|
120
|
+
RISK_BYPASS: reducing
|
|
121
|
+
RISK_BYPASS_REASON: closes P043 (tightens reducing-bypass criteria; removes previously-flagged over-application)
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Acceptable `RISK_BYPASS_REASON:` values cite the ticket ID closed, the prior
|
|
125
|
+
risk report remediated, or the removed risk — matching one of the three
|
|
126
|
+
criteria above.
|
|
98
127
|
|
|
99
128
|
For live incidents (outage, security, information disclosure), include `RISK_BYPASS: incident`.
|
|
100
129
|
|
|
@@ -109,7 +138,18 @@ Do NOT emit: "Suggested Actions", "Your call:", advisory warnings, back-pressure
|
|
|
109
138
|
|
|
110
139
|
## Above-Appetite Remediations
|
|
111
140
|
|
|
112
|
-
When ANY cumulative score exceeds appetite (> 4),
|
|
141
|
+
When ANY cumulative score exceeds appetite (> 4), the verbal verdict is **STOP**.
|
|
142
|
+
The scorer is not the primary decision-maker — the hook gate will block the
|
|
143
|
+
action — but the scorer's verdict must match the structured score so the agent
|
|
144
|
+
does not waste tool calls acting on an ambiguous nudge.
|
|
145
|
+
|
|
146
|
+
**Do NOT emit** "Proceed", "Proceed with release", "Continue", "You may ship",
|
|
147
|
+
"OK to commit/push/release", or any similar nudge language when cumulative risk
|
|
148
|
+
exceeds appetite. The only sanctioned above-appetite output is the Risk Report
|
|
149
|
+
structure, `RISK_SCORES: ...`, and the structured `RISK_REMEDIATIONS:` block
|
|
150
|
+
defined below.
|
|
151
|
+
|
|
152
|
+
Emit a structured `RISK_REMEDIATIONS:` block after the `RISK_SCORES:` line. This gives the calling skill machine-readable input for structured decision prompts.
|
|
113
153
|
|
|
114
154
|
Format (5 columns — machine-readable for structured AskUserQuestion prompts in calling skills):
|
|
115
155
|
```
|
|
@@ -144,6 +184,42 @@ Do not rely on a static list. For each control claimed to reduce risk, you MUST:
|
|
|
144
184
|
3. Ask: "Would this control catch this failure before reaching the user?"
|
|
145
185
|
4. **Name the control**: "Tests pass" is not a control. Name the specific test file and scenario. If you cannot name it, it provides 0 reduction.
|
|
146
186
|
|
|
187
|
+
**Monitoring is not a control.** Monitoring, alerting, dashboards, and any other post-release detection activity MUST NOT be credited as a control that reduces residual risk. Post-release detection does NOT reduce pre-release risk — it only shortens the time to notice a failure after it has already reached users. A genuine control exercises the failure
|
|
188
|
+
scenario BEFORE the change ships: a test, a CI gate, a feature flag, a preview
|
|
189
|
+
verification, an architect review, an installer dry-run. Monitoring and rollback
|
|
190
|
+
readiness may be listed separately as "post-release follow-ups" outside the
|
|
191
|
+
residual risk computation, but MUST NOT appear in a Controls list and MUST NOT
|
|
192
|
+
reduce any inherent risk score.
|
|
193
|
+
|
|
194
|
+
## User-Stated Preconditions Check
|
|
195
|
+
|
|
196
|
+
A technical control list never substitutes for an explicit user warning. Before
|
|
197
|
+
credit is given to any control, check for **user-stated preconditions** — conditions
|
|
198
|
+
the user has named in the current conversation, commit messages, changesets, or
|
|
199
|
+
problem tickets that tie this change to a paired capability (e.g., "A is only safe
|
|
200
|
+
if B ships alongside", "don't release X until Y is merged").
|
|
201
|
+
|
|
202
|
+
For each user-stated precondition:
|
|
203
|
+
1. Determine whether the paired capability is released, queued in the unreleased
|
|
204
|
+
changeset batch, or unmet.
|
|
205
|
+
2. If unmet, the precondition is a failed control — credit zero reduction from
|
|
206
|
+
otherwise-valid controls (tests, CI, architect review) that do not address the
|
|
207
|
+
precondition itself.
|
|
208
|
+
3. Surface the unmet precondition as a standalone **Risk item** with inherent
|
|
209
|
+
impact and likelihood reflecting the consequence the user warned about.
|
|
210
|
+
Inherent risk MUST be >= Medium (>= 5), even when the diff's technical risk
|
|
211
|
+
alone would score Low. This routes the precondition through the existing
|
|
212
|
+
above-appetite `RISK_REMEDIATIONS:` flow rather than burying it in prose.
|
|
213
|
+
|
|
214
|
+
Sources to inspect for stated preconditions:
|
|
215
|
+
- Recent conversation messages directed to the agent
|
|
216
|
+
- Open or known-error problem tickets referenced in the diff or recent commits
|
|
217
|
+
- Commit messages and changeset files on the unreleased queue
|
|
218
|
+
- CLAUDE.md notes about cross-cutting dependencies
|
|
219
|
+
|
|
220
|
+
User warnings reflect domain context the scorer cannot derive from the diff alone.
|
|
221
|
+
They outrank the technical assessment.
|
|
222
|
+
|
|
147
223
|
## Constraints
|
|
148
224
|
|
|
149
225
|
- You are a scorer, not an editor.
|
package/agents/plan.md
CHANGED
|
@@ -49,7 +49,13 @@ You are the Risk Scorer in plan review mode. Assess both the plan's own risk AND
|
|
|
49
49
|
|
|
50
50
|
End your report with `RISK_VERDICT: PASS` or `RISK_VERDICT: FAIL` on its own line. A PostToolUse hook reads this and writes the marker files — do NOT write files yourself.
|
|
51
51
|
|
|
52
|
-
On FAIL,
|
|
52
|
+
On FAIL, the verbal verdict is **STOP**. **Do NOT emit** "Proceed", "Continue",
|
|
53
|
+
"You may ship", "OK to implement", or any similar nudge language. The plan is
|
|
54
|
+
not policy-authorised — the only sanctioned FAIL output is the Plan Risk Report,
|
|
55
|
+
the `RISK_VERDICT: FAIL` marker, and the structured `RISK_REMEDIATIONS:` block
|
|
56
|
+
defined below.
|
|
57
|
+
|
|
58
|
+
Emit a structured `RISK_REMEDIATIONS:` block after the verdict (5 columns — machine-readable for structured AskUserQuestion prompts in calling skills):
|
|
53
59
|
```
|
|
54
60
|
RISK_REMEDIATIONS:
|
|
55
61
|
- R1 | <description of what the plan must add/change> | <effort S/M/L> | <risk_delta -N> | <affected area>
|
|
@@ -68,6 +74,27 @@ For each control claimed to reduce risk:
|
|
|
68
74
|
2. Name the specific test file/scenario or hook
|
|
69
75
|
3. If you cannot name it, it provides 0 reduction
|
|
70
76
|
|
|
77
|
+
**Monitoring is not a control.** Monitoring, alerting, dashboards, and any other post-release detection activity MUST NOT be credited as a control in a plan's residual risk. Post-release detection does NOT reduce pre-release risk — it only shortens the time to notice a failure after it has already reached users. A genuine control exercises the failure scenario before the
|
|
78
|
+
plan's changes ship: a test, a CI gate, a feature flag, a preview verification,
|
|
79
|
+
an architect review. Monitoring MUST NOT appear in a Controls list and MUST NOT
|
|
80
|
+
reduce any inherent risk score.
|
|
81
|
+
|
|
82
|
+
## User-Stated Preconditions Check
|
|
83
|
+
|
|
84
|
+
Before crediting any control, check for **user-stated preconditions** — conditions
|
|
85
|
+
the user has named in the plan, associated problem tickets, commit messages, or
|
|
86
|
+
CLAUDE.md that tie this plan to a paired capability (e.g., "A is only safe if B
|
|
87
|
+
ships alongside", "don't release X until Y is merged").
|
|
88
|
+
|
|
89
|
+
For each user-stated precondition:
|
|
90
|
+
1. Check whether the plan already addresses or queues the paired capability.
|
|
91
|
+
2. If the precondition is unmet in the plan, credit zero reduction from controls
|
|
92
|
+
that do not cover the precondition, and surface the unmet precondition as a **Risk item** with inherent risk >= Medium (>= 5).
|
|
93
|
+
3. A plan that ships a change without addressing a user-stated precondition
|
|
94
|
+
must be FAIL, regardless of the diff's technical score.
|
|
95
|
+
|
|
96
|
+
User warnings outrank technical control discovery.
|
|
97
|
+
|
|
71
98
|
## Constraints
|
|
72
99
|
|
|
73
100
|
- You are a scorer, not an editor.
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
#!/usr/bin/env bats
|
|
2
|
+
# Doc-lint guard: risk-scorer agent prompts must contain an explicit
|
|
3
|
+
# STOP / do-not-proceed directive in their Above-Appetite sections.
|
|
4
|
+
#
|
|
5
|
+
# Structural assertions — Permitted Exception to the source-grep ban (ADR-005 / P011).
|
|
6
|
+
# These tests assert that the pipeline, wip, and plan scorer prompts forbid
|
|
7
|
+
# "Proceed", "Continue", or "You may ship" nudges when cumulative risk
|
|
8
|
+
# exceeds appetite.
|
|
9
|
+
#
|
|
10
|
+
# Background: P037 identified that scorer reports could include "Proceed
|
|
11
|
+
# with release" or similar nudge language even when residual risk exceeded
|
|
12
|
+
# appetite. The hook gate then correctly blocked the action, but only after
|
|
13
|
+
# the agent wasted tool calls and tokens acting on the nudge. The scorer
|
|
14
|
+
# is not the primary decision-maker, but its verbal verdict must match the
|
|
15
|
+
# structured score — ambiguous "proceed" language undermines this.
|
|
16
|
+
#
|
|
17
|
+
# The Below-Appetite Output Rule (ADR-013 Rule 5) already requires silent
|
|
18
|
+
# policy-authorised release when all scores are within appetite. This guard
|
|
19
|
+
# enforces the inverse: an explicit STOP directive above appetite.
|
|
20
|
+
#
|
|
21
|
+
# Cross-reference:
|
|
22
|
+
# P037: docs/problems/037-risk-scorer-proceeds-above-appetite.open.md
|
|
23
|
+
# ADR-013: docs/decisions/013-structured-user-interaction-for-governance-decisions.proposed.md
|
|
24
|
+
# @jtbd JTBD-001 (enforce governance without slowing down)
|
|
25
|
+
# @jtbd JTBD-002 (ship with confidence — verbal verdict must match structured score)
|
|
26
|
+
# @jtbd JTBD-202 (pre-flight governance — structured output is the only sanctioned channel)
|
|
27
|
+
|
|
28
|
+
setup() {
|
|
29
|
+
AGENTS_DIR="$(cd "$(dirname "$BATS_TEST_FILENAME")/.." && pwd)"
|
|
30
|
+
PIPELINE="${AGENTS_DIR}/pipeline.md"
|
|
31
|
+
WIP="${AGENTS_DIR}/wip.md"
|
|
32
|
+
PLAN="${AGENTS_DIR}/plan.md"
|
|
33
|
+
}
|
|
34
|
+
|
|
35
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
36
|
+
# pipeline.md: Above-Appetite STOP directive
|
|
37
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
38
|
+
|
|
39
|
+
@test "pipeline.md Above-Appetite section contains explicit STOP directive" {
|
|
40
|
+
# Must contain the word STOP (or BLOCKED) as the verdict above appetite.
|
|
41
|
+
run grep -qE "STOP|BLOCKED" "$PIPELINE"
|
|
42
|
+
[ "$status" -eq 0 ]
|
|
43
|
+
}
|
|
44
|
+
|
|
45
|
+
@test "pipeline.md Above-Appetite section forbids Proceed nudges" {
|
|
46
|
+
# Must explicitly forbid emitting "Proceed" / "Continue" nudges
|
|
47
|
+
# when risk exceeds appetite.
|
|
48
|
+
run grep -qE "[Dd]o NOT emit.*Proceed|forbid.*Proceed|not emit.*Continue|must not.*proceed" "$PIPELINE"
|
|
49
|
+
[ "$status" -eq 0 ]
|
|
50
|
+
}
|
|
51
|
+
|
|
52
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
53
|
+
# wip.md: Above-Appetite STOP directive
|
|
54
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
55
|
+
|
|
56
|
+
@test "wip.md Above-Appetite section contains explicit STOP directive" {
|
|
57
|
+
# PAUSE is the wip-mode verdict equivalent of STOP.
|
|
58
|
+
run grep -qE "STOP|BLOCKED|PAUSE" "$WIP"
|
|
59
|
+
[ "$status" -eq 0 ]
|
|
60
|
+
}
|
|
61
|
+
|
|
62
|
+
@test "wip.md Above-Appetite section forbids Proceed nudges" {
|
|
63
|
+
run grep -qE "[Dd]o NOT emit.*Proceed|forbid.*Proceed|not emit.*Continue|must not.*proceed" "$WIP"
|
|
64
|
+
[ "$status" -eq 0 ]
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
68
|
+
# plan.md: FAIL directive (plan-mode equivalent of STOP)
|
|
69
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
70
|
+
|
|
71
|
+
@test "plan.md FAIL section contains explicit STOP directive" {
|
|
72
|
+
# FAIL is the plan-mode verdict; reinforces STOP language.
|
|
73
|
+
run grep -qE "STOP|BLOCKED|FAIL" "$PLAN"
|
|
74
|
+
[ "$status" -eq 0 ]
|
|
75
|
+
}
|
|
76
|
+
|
|
77
|
+
@test "plan.md FAIL section forbids Proceed nudges" {
|
|
78
|
+
run grep -qE "[Dd]o NOT emit.*Proceed|forbid.*Proceed|not emit.*Continue|must not.*proceed" "$PLAN"
|
|
79
|
+
[ "$status" -eq 0 ]
|
|
80
|
+
}
|
|
@@ -0,0 +1,76 @@
|
|
|
1
|
+
#!/usr/bin/env bats
|
|
2
|
+
# Doc-lint guard: risk-scorer agent prompts must explicitly state that
|
|
3
|
+
# monitoring, alerting, and other post-release detection activities are
|
|
4
|
+
# NOT controls and MUST NOT be credited against residual risk.
|
|
5
|
+
#
|
|
6
|
+
# Structural assertions — Permitted Exception to the source-grep ban (ADR-005 / P011).
|
|
7
|
+
#
|
|
8
|
+
# Background: P038 identified that scorer reports were crediting
|
|
9
|
+
# "monitor for elevated errors", "be ready to rollback", and similar
|
|
10
|
+
# post-release detection activities as controls that reduced residual
|
|
11
|
+
# risk. These activities help detect failures after they occur — they
|
|
12
|
+
# are incident response, not release-gate risk reduction. Crediting
|
|
13
|
+
# them creates false confidence in risky releases.
|
|
14
|
+
#
|
|
15
|
+
# A genuine control exercises the failure scenario BEFORE the change
|
|
16
|
+
# ships (tests, CI gates, feature flags, preview verification, architect
|
|
17
|
+
# review). Monitoring shortens detection time; it does not prevent the
|
|
18
|
+
# failure from reaching users.
|
|
19
|
+
#
|
|
20
|
+
# Cross-reference:
|
|
21
|
+
# P038: docs/problems/038-risk-scorer-suggests-monitoring-as-control.open.md
|
|
22
|
+
# ADR-013: docs/decisions/013-structured-user-interaction-for-governance-decisions.proposed.md
|
|
23
|
+
# @jtbd JTBD-001 (enforce governance — control list must reflect actual prevention)
|
|
24
|
+
# @jtbd JTBD-002 (ship with confidence — no false-confidence releases)
|
|
25
|
+
# @jtbd JTBD-202 (pre-flight governance — scorer must distinguish prevention from detection)
|
|
26
|
+
|
|
27
|
+
setup() {
|
|
28
|
+
AGENTS_DIR="$(cd "$(dirname "$BATS_TEST_FILENAME")/.." && pwd)"
|
|
29
|
+
PIPELINE="${AGENTS_DIR}/pipeline.md"
|
|
30
|
+
WIP="${AGENTS_DIR}/wip.md"
|
|
31
|
+
PLAN="${AGENTS_DIR}/plan.md"
|
|
32
|
+
}
|
|
33
|
+
|
|
34
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
35
|
+
# pipeline.md
|
|
36
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
37
|
+
|
|
38
|
+
@test "pipeline.md states monitoring is not a control" {
|
|
39
|
+
run grep -qE "[Mm]onitoring is (not|NOT) a control|[Mm]onitoring.*MUST NOT.*credit" "$PIPELINE"
|
|
40
|
+
[ "$status" -eq 0 ]
|
|
41
|
+
}
|
|
42
|
+
|
|
43
|
+
@test "pipeline.md forbids crediting post-release detection as risk reduction" {
|
|
44
|
+
# Post-release detection activities (monitoring, alerting, rollback readiness)
|
|
45
|
+
# must not reduce residual risk.
|
|
46
|
+
run grep -qE "post-release.*(not|NOT) (reduce|control|credit)|detection.*(not|NOT) (reduce|prevention)" "$PIPELINE"
|
|
47
|
+
[ "$status" -eq 0 ]
|
|
48
|
+
}
|
|
49
|
+
|
|
50
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
51
|
+
# wip.md
|
|
52
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
53
|
+
|
|
54
|
+
@test "wip.md states monitoring is not a control" {
|
|
55
|
+
run grep -qE "[Mm]onitoring is (not|NOT) a control|[Mm]onitoring.*MUST NOT.*credit" "$WIP"
|
|
56
|
+
[ "$status" -eq 0 ]
|
|
57
|
+
}
|
|
58
|
+
|
|
59
|
+
@test "wip.md forbids crediting post-release detection as risk reduction" {
|
|
60
|
+
run grep -qE "post-release.*(not|NOT) (reduce|control|credit)|detection.*(not|NOT) (reduce|prevention)" "$WIP"
|
|
61
|
+
[ "$status" -eq 0 ]
|
|
62
|
+
}
|
|
63
|
+
|
|
64
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
65
|
+
# plan.md
|
|
66
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
67
|
+
|
|
68
|
+
@test "plan.md states monitoring is not a control" {
|
|
69
|
+
run grep -qE "[Mm]onitoring is (not|NOT) a control|[Mm]onitoring.*MUST NOT.*credit" "$PLAN"
|
|
70
|
+
[ "$status" -eq 0 ]
|
|
71
|
+
}
|
|
72
|
+
|
|
73
|
+
@test "plan.md forbids crediting post-release detection as risk reduction" {
|
|
74
|
+
run grep -qE "post-release.*(not|NOT) (reduce|control|credit)|detection.*(not|NOT) (reduce|prevention)" "$PLAN"
|
|
75
|
+
[ "$status" -eq 0 ]
|
|
76
|
+
}
|
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
#!/usr/bin/env bats
|
|
2
|
+
# Doc-lint guard: risk-scorer agent prompts must scope the
|
|
3
|
+
# `RISK_BYPASS: reducing` label to commits that actually reduce risk.
|
|
4
|
+
#
|
|
5
|
+
# Structural assertions — Permitted Exception to the source-grep ban (ADR-005 / P011).
|
|
6
|
+
#
|
|
7
|
+
# Background: P043 analysed 329 risk reports across 6 projects and found
|
|
8
|
+
# `RISK_BYPASS: reducing` applied to 97.9% of commits in this repo and
|
|
9
|
+
# 79.6% across consumer projects. The scorer treated changeset metadata,
|
|
10
|
+
# ADR checkbox ticks, docs-only edits, and genuinely risk-reducing fixes
|
|
11
|
+
# all the same way. When nearly every commit is "reducing", the label
|
|
12
|
+
# provides no discriminating signal.
|
|
13
|
+
#
|
|
14
|
+
# The tightened criteria require the commit to:
|
|
15
|
+
# 1. Close a problem ticket, OR
|
|
16
|
+
# 2. Explicitly remediate a previously-flagged risk, OR
|
|
17
|
+
# 3. Remove a documented risk
|
|
18
|
+
# Ordinary docs-only or test-only commits that don't meet one of these
|
|
19
|
+
# conditions are risk-neutral — no bypass label.
|
|
20
|
+
#
|
|
21
|
+
# Cross-reference:
|
|
22
|
+
# P043: docs/problems/043-risk-bypass-reducing-lost-discriminating-power.open.md
|
|
23
|
+
# ADR-013: docs/decisions/013-structured-user-interaction-for-governance-decisions.proposed.md
|
|
24
|
+
# @jtbd JTBD-001 (enforce governance — bypass must reflect real risk reduction)
|
|
25
|
+
# @jtbd JTBD-202 (pre-flight governance — bypass label must be auditable)
|
|
26
|
+
|
|
27
|
+
setup() {
|
|
28
|
+
AGENTS_DIR="$(cd "$(dirname "$BATS_TEST_FILENAME")/.." && pwd)"
|
|
29
|
+
PIPELINE="${AGENTS_DIR}/pipeline.md"
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
# NOTE: wip.md is intentionally excluded from these assertions — wip-mode emits
|
|
33
|
+
# RISK_VERDICT: CONTINUE/PAUSE, not RISK_BYPASS labels. Bypass criteria apply
|
|
34
|
+
# only to the pipeline (commit/push/release) scorer.
|
|
35
|
+
|
|
36
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
37
|
+
# pipeline.md: tightened reducing criteria
|
|
38
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
39
|
+
|
|
40
|
+
@test "pipeline.md reducing bypass requires closing a ticket" {
|
|
41
|
+
# Must reference ticket closure as a valid trigger for reducing bypass.
|
|
42
|
+
run grep -qE "[Cc]lose[sd]?.*ticket|[Cc]loses P[0-9]|problem.*close" "$PIPELINE"
|
|
43
|
+
[ "$status" -eq 0 ]
|
|
44
|
+
}
|
|
45
|
+
|
|
46
|
+
@test "pipeline.md reducing bypass requires remediating a flagged risk" {
|
|
47
|
+
run grep -qE "remediate.*risk|remediates.*risk|flagged risk" "$PIPELINE"
|
|
48
|
+
[ "$status" -eq 0 ]
|
|
49
|
+
}
|
|
50
|
+
|
|
51
|
+
@test "pipeline.md reducing bypass excludes docs-only neutral commits" {
|
|
52
|
+
# Ordinary docs/test commits without ticket closure must NOT earn the bypass.
|
|
53
|
+
run grep -qE "docs-only.*neutral|test-only.*neutral|ordinary.*neutral|neutral.*no bypass" "$PIPELINE"
|
|
54
|
+
[ "$status" -eq 0 ]
|
|
55
|
+
}
|
|
56
|
+
|
|
57
|
+
@test "pipeline.md requires audit reason for reducing bypass" {
|
|
58
|
+
# Audit trail: cite which ticket closed, which risk remediated, etc.
|
|
59
|
+
run grep -qE "RISK_BYPASS_REASON|cite.*ticket|reason.*bypass|bypass.*reason" "$PIPELINE"
|
|
60
|
+
[ "$status" -eq 0 ]
|
|
61
|
+
}
|
|
62
|
+
|
|
@@ -0,0 +1,89 @@
|
|
|
1
|
+
#!/usr/bin/env bats
|
|
2
|
+
# Doc-lint guard: risk-scorer agent prompts must define a User-Stated
|
|
3
|
+
# Preconditions Check as a sub-rule of Control Discovery.
|
|
4
|
+
#
|
|
5
|
+
# Structural assertions — Permitted Exception to the source-grep ban (ADR-005 / P011).
|
|
6
|
+
# These tests assert that the pipeline, wip, and plan scorer prompts
|
|
7
|
+
# instruct the scorer to detect user-stated conditional-delivery warnings
|
|
8
|
+
# and surface unmet preconditions as Risk items.
|
|
9
|
+
#
|
|
10
|
+
# Background: P041 identified that the risk scorer evaluated technical
|
|
11
|
+
# risk of a diff in isolation and missed explicit user-stated warnings
|
|
12
|
+
# that a change was conditional on a paired capability. Downstream this
|
|
13
|
+
# caused a breaking change to ship to production despite a twice-stated
|
|
14
|
+
# user warning. This guard prevents regression of the fix: every scoring
|
|
15
|
+
# agent must have a User-Stated Preconditions Check.
|
|
16
|
+
#
|
|
17
|
+
# Cross-reference:
|
|
18
|
+
# P041: docs/problems/041-risk-scorer-misses-user-stated-dependencies.known-error.md
|
|
19
|
+
# ADR-013: structured user interaction for governance decisions
|
|
20
|
+
# @jtbd JTBD-002 (ship with confidence — user-stated preconditions are honoured)
|
|
21
|
+
# @jtbd JTBD-202 (pre-flight governance checks surface explicit warnings)
|
|
22
|
+
|
|
23
|
+
setup() {
|
|
24
|
+
AGENTS_DIR="$(cd "$(dirname "$BATS_TEST_FILENAME")/.." && pwd)"
|
|
25
|
+
PIPELINE="${AGENTS_DIR}/pipeline.md"
|
|
26
|
+
WIP="${AGENTS_DIR}/wip.md"
|
|
27
|
+
PLAN="${AGENTS_DIR}/plan.md"
|
|
28
|
+
}
|
|
29
|
+
|
|
30
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
31
|
+
# pipeline.md: user-stated precondition check
|
|
32
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
33
|
+
|
|
34
|
+
@test "pipeline.md defines User-Stated Preconditions Check section" {
|
|
35
|
+
run grep -q "User-Stated Preconditions" "$PIPELINE"
|
|
36
|
+
[ "$status" -eq 0 ]
|
|
37
|
+
}
|
|
38
|
+
|
|
39
|
+
@test "pipeline.md precondition check surfaces unmet preconditions as Risk items" {
|
|
40
|
+
# Unmet preconditions must flow through the existing Risk item structure,
|
|
41
|
+
# which feeds RISK_REMEDIATIONS above appetite (>= 5).
|
|
42
|
+
run grep -qE "precondition.*Risk item|Risk item.*precondition" "$PIPELINE"
|
|
43
|
+
[ "$status" -eq 0 ]
|
|
44
|
+
}
|
|
45
|
+
|
|
46
|
+
@test "pipeline.md precondition check credits zero reduction when paired capability is unmet" {
|
|
47
|
+
# Aligns with existing Control Discovery rule: if a control cannot be named,
|
|
48
|
+
# or a stated precondition is unmet, the control provides 0 reduction.
|
|
49
|
+
run grep -qE "zero reduction|0 reduction" "$PIPELINE"
|
|
50
|
+
[ "$status" -eq 0 ]
|
|
51
|
+
}
|
|
52
|
+
|
|
53
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
54
|
+
# wip.md: user-stated precondition check
|
|
55
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
56
|
+
|
|
57
|
+
@test "wip.md defines User-Stated Preconditions Check section" {
|
|
58
|
+
run grep -q "User-Stated Preconditions" "$WIP"
|
|
59
|
+
[ "$status" -eq 0 ]
|
|
60
|
+
}
|
|
61
|
+
|
|
62
|
+
@test "wip.md precondition check surfaces unmet preconditions as Risk items" {
|
|
63
|
+
run grep -qE "precondition.*Risk item|Risk item.*precondition" "$WIP"
|
|
64
|
+
[ "$status" -eq 0 ]
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
@test "wip.md precondition check credits zero reduction when paired capability is unmet" {
|
|
68
|
+
run grep -qE "zero reduction|0 reduction" "$WIP"
|
|
69
|
+
[ "$status" -eq 0 ]
|
|
70
|
+
}
|
|
71
|
+
|
|
72
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
73
|
+
# plan.md: user-stated precondition check
|
|
74
|
+
# ──────────────────────────────────────────────────────────────────────────────
|
|
75
|
+
|
|
76
|
+
@test "plan.md defines User-Stated Preconditions Check section" {
|
|
77
|
+
run grep -q "User-Stated Preconditions" "$PLAN"
|
|
78
|
+
[ "$status" -eq 0 ]
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
@test "plan.md precondition check surfaces unmet preconditions as Risk items" {
|
|
82
|
+
run grep -qE "precondition.*Risk item|Risk item.*precondition" "$PLAN"
|
|
83
|
+
[ "$status" -eq 0 ]
|
|
84
|
+
}
|
|
85
|
+
|
|
86
|
+
@test "plan.md precondition check credits zero reduction when paired capability is unmet" {
|
|
87
|
+
run grep -qE "zero reduction|0 reduction" "$PLAN"
|
|
88
|
+
[ "$status" -eq 0 ]
|
|
89
|
+
}
|
package/agents/wip.md
CHANGED
|
@@ -51,7 +51,15 @@ If cumulative risk is **within appetite** (< 5): provide the assessment table an
|
|
|
51
51
|
|
|
52
52
|
### Above-Appetite Remediations
|
|
53
53
|
|
|
54
|
-
If cumulative risk **exceeds appetite** (>= 5)
|
|
54
|
+
If cumulative risk **exceeds appetite** (>= 5), the verbal verdict is **PAUSE**
|
|
55
|
+
(the wip-mode equivalent of STOP).
|
|
56
|
+
|
|
57
|
+
**Do NOT emit** "Proceed", "Continue", "OK to edit", "You may commit", or any
|
|
58
|
+
similar nudge language when cumulative risk exceeds appetite. The only
|
|
59
|
+
sanctioned above-appetite output is the WIP Risk Assessment table and the
|
|
60
|
+
structured `RISK_REMEDIATIONS:` block defined below.
|
|
61
|
+
|
|
62
|
+
Provide the assessment table, then emit a structured `RISK_REMEDIATIONS:` block with specific risk-reducing actions:
|
|
55
63
|
|
|
56
64
|
Format (5 columns — machine-readable for structured AskUserQuestion prompts in calling skills):
|
|
57
65
|
```
|
|
@@ -100,6 +108,25 @@ RISK_COMMIT_REASON: <one-line description of the completed governance work detec
|
|
|
100
108
|
|
|
101
109
|
For each control claimed to reduce risk, name the specific test file/scenario. If you cannot name it, it provides 0 reduction.
|
|
102
110
|
|
|
111
|
+
**Monitoring is not a control.** Monitoring, alerting, dashboards, and any other post-release detection activity MUST NOT be credited or reduce residual risk. Post-release detection does NOT reduce pre-release risk — it only shortens the time to notice a failure after it has already reached users.
|
|
112
|
+
A genuine control exercises the failure scenario before the change ships: a
|
|
113
|
+
test, a CI gate, a feature flag, a preview verification. Monitoring MUST NOT
|
|
114
|
+
appear in a Controls list and MUST NOT reduce any inherent risk score.
|
|
115
|
+
|
|
116
|
+
## User-Stated Preconditions Check
|
|
117
|
+
|
|
118
|
+
Before crediting any control, check for **user-stated preconditions** — conditions
|
|
119
|
+
the user has named in the current conversation, commit messages, changesets, or
|
|
120
|
+
problem tickets that tie this change to a paired capability (e.g., "A is only safe
|
|
121
|
+
if B ships alongside").
|
|
122
|
+
|
|
123
|
+
If a paired capability is unmet, credit zero reduction from controls that do not
|
|
124
|
+
address the precondition, and surface the unmet precondition as a **Risk item**
|
|
125
|
+
with inherent risk >= Medium (>= 5). This routes it into the above-appetite
|
|
126
|
+
`RISK_REMEDIATIONS:` flow and forces a PAUSE verdict until the precondition is
|
|
127
|
+
met or the change is revised. User warnings outrank the diff's technical
|
|
128
|
+
assessment.
|
|
129
|
+
|
|
103
130
|
## Constraints
|
|
104
131
|
|
|
105
132
|
- You are a scorer, not an editor. Do NOT write files — a PostToolUse hook handles that.
|
package/package.json
CHANGED