@windyroad/itil 0.2.0-preview.62 → 0.3.0-preview.68
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md
CHANGED
|
@@ -6,16 +6,25 @@ Part of [Windy Road Agent Plugins](../../README.md).
|
|
|
6
6
|
|
|
7
7
|
## What It Does
|
|
8
8
|
|
|
9
|
-
Bugs recur. Incidents repeat. Without a
|
|
9
|
+
Bugs recur. Incidents repeat. Without a disciplined process, you fix symptoms instead of causes — or worse, jump to conclusions during a live outage. This plugin brings lightweight ITIL service management to your AI coding workflow:
|
|
10
|
+
|
|
11
|
+
**Problem management** — track underlying causes and prioritise fixes:
|
|
10
12
|
|
|
11
13
|
- **Create problem tickets** when incidents or failures surface during a session
|
|
12
14
|
- **Track root cause analysis** as investigation progresses
|
|
13
15
|
- **Transition status** through a structured lifecycle: Open, Known Error, Closed
|
|
14
16
|
- **Prioritise** using Weighted Shortest Job First (WSJF) to focus on the highest-value fixes
|
|
15
17
|
|
|
16
|
-
|
|
18
|
+
**Incident management** — restore service fast with an audit trail:
|
|
19
|
+
|
|
20
|
+
- **Declare incidents** when production is actively broken
|
|
21
|
+
- **Evidence-first discipline** — hypotheses must cite evidence before any mitigation
|
|
22
|
+
- **Reversible mitigations first** — rollback, feature flag, restart, route away
|
|
23
|
+
- **Automatic handoff** to problem management once service is restored
|
|
17
24
|
|
|
18
|
-
|
|
25
|
+
Tickets live in `docs/problems/` and `docs/incidents/` as markdown files — version-controlled and always accessible.
|
|
26
|
+
|
|
27
|
+
Room is reserved for peer ITIL skills (change, continual improvement) under the same plugin as they are added.
|
|
19
28
|
|
|
20
29
|
## Install
|
|
21
30
|
|
|
@@ -31,18 +40,23 @@ Restart Claude Code after installing.
|
|
|
31
40
|
|
|
32
41
|
## Usage
|
|
33
42
|
|
|
34
|
-
**
|
|
43
|
+
**Manage a problem ticket:**
|
|
35
44
|
|
|
36
45
|
```
|
|
37
46
|
/wr-itil:manage-problem
|
|
38
47
|
```
|
|
39
48
|
|
|
40
|
-
|
|
49
|
+
Supports creating new problems, updating root cause analysis, transitioning status (Open → Known Error → Closed), and closing problems with resolution details.
|
|
50
|
+
|
|
51
|
+
**Manage an incident:**
|
|
52
|
+
|
|
53
|
+
```
|
|
54
|
+
/wr-itil:manage-incident
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
Supports declaring new incidents, recording evidence-first observations and hypotheses, logging mitigation attempts, transitioning lifecycle (Investigating → Mitigating → Restored → Closed), and automatically handing off to `manage-problem` when service is restored.
|
|
41
58
|
|
|
42
|
-
-
|
|
43
|
-
- Updating root cause analysis with investigation findings
|
|
44
|
-
- Transitioning status (Open -> Known Error -> Closed)
|
|
45
|
-
- Closing problems with resolution details
|
|
59
|
+
See [ADR-011](../../docs/decisions/011-manage-incident-skill.proposed.md) for the incident-vs-problem split and [JTBD-201](../../docs/jtbd/tech-lead/JTBD-201-restore-service-fast.proposed.md) for the job this serves.
|
|
46
60
|
|
|
47
61
|
## How It Works
|
|
48
62
|
|
package/package.json
CHANGED
|
@@ -0,0 +1,265 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: wr-itil:manage-incident
|
|
3
|
+
description: Declare, triage, mitigate, and close an incident using an evidence-first workflow. Restores service first, then hands off to manage-problem for root-cause work.
|
|
4
|
+
allowed-tools: Read, Write, Edit, Bash, Glob, Grep, AskUserQuestion, Skill
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Incident Management Skill
|
|
8
|
+
|
|
9
|
+
Declare, triage, mitigate, and close an incident using an evidence-first, cool-headed workflow. This skill's primary goal is **restoring service**. Once service is restored, the skill hands off to `wr-itil:manage-problem` so the underlying cause is tracked.
|
|
10
|
+
|
|
11
|
+
Incidents are time-bound events. Problems are persistent root causes. One problem can cause many incidents; one incident may (or may not) link to a problem.
|
|
12
|
+
|
|
13
|
+
## Operations
|
|
14
|
+
|
|
15
|
+
- **Declare**: `incident <title or symptoms>` — creates a new investigating incident
|
|
16
|
+
- **Update**: `incident <NNN> <details>` — append observations, evidence, or actions
|
|
17
|
+
- **Mitigate**: `incident <NNN> mitigate <action>` — record a mitigation attempt and outcome
|
|
18
|
+
- **Restore**: `incident <NNN> restored` — transition to `.restored.md` and trigger problem handoff
|
|
19
|
+
- **Close**: `incident <NNN> close` — only allowed when the linked problem is Known Error or Closed (or an explicit "no problem required" justification is recorded)
|
|
20
|
+
- **List**: `incident list` — active incidents, severity-sorted
|
|
21
|
+
- **Link**: `incident <NNN> link P<MMM>` — link an incident to an existing problem
|
|
22
|
+
|
|
23
|
+
## Lifecycle
|
|
24
|
+
|
|
25
|
+
| Status | File suffix | Meaning | Entry criteria |
|
|
26
|
+
|--------|-------------|---------|----------------|
|
|
27
|
+
| **Investigating** | `.investigating.md` | Symptoms reported, scope being established | Incident declared |
|
|
28
|
+
| **Mitigating** | `.mitigating.md` | Mitigation(s) in flight | At least one ranked hypothesis with cited evidence |
|
|
29
|
+
| **Restored** | `.restored.md` | Service verified restored | Mitigation applied + verification signal recorded |
|
|
30
|
+
| **Closed** | `.closed.md` | Incident complete | Linked problem is Known Error or Closed (or "no problem required" justification documented) |
|
|
31
|
+
|
|
32
|
+
## Evidence-First Workflow (The Cool-Headed Commitment)
|
|
33
|
+
|
|
34
|
+
During an incident, the instinct to jump to conclusions is strong. This skill forces evidence-first discipline via a required template. **Do not act on a hypothesis without at least one cited evidence source.**
|
|
35
|
+
|
|
36
|
+
### Required sections in every incident file
|
|
37
|
+
|
|
38
|
+
```markdown
|
|
39
|
+
## Observations
|
|
40
|
+
- [timestamp] <what was seen, from where — e.g. "14:02 UTC, 500s on /api/orders in Datadog dashboard foo">
|
|
41
|
+
|
|
42
|
+
## Hypotheses
|
|
43
|
+
- [ranked] <hypothesis> — Evidence: <log/repro/diff/metric reference>. Confidence: <low|med|high>.
|
|
44
|
+
|
|
45
|
+
## Mitigation attempts
|
|
46
|
+
- [timestamp] <action> → <outcome / verification signal>
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### Mitigation preference
|
|
50
|
+
|
|
51
|
+
Prefer **reversible** mitigations over forward fixes:
|
|
52
|
+
|
|
53
|
+
1. Rollback to a known-good version
|
|
54
|
+
2. Feature flag off
|
|
55
|
+
3. Restart / cycle the affected component
|
|
56
|
+
4. Route traffic away
|
|
57
|
+
5. Scale up
|
|
58
|
+
6. Only after reversibles are exhausted: forward fix
|
|
59
|
+
|
|
60
|
+
Record every attempt, successful or not.
|
|
61
|
+
|
|
62
|
+
## Severity, not WSJF
|
|
63
|
+
|
|
64
|
+
Incidents are severity-driven and time-boxed. **WSJF does not apply to incidents** — the "effort" divisor is meaningless during a live event. WSJF applies to the resulting problem created via handoff.
|
|
65
|
+
|
|
66
|
+
Severity uses the Impact × Likelihood matrix from `RISK-POLICY.md`, interpreted as "right now, what's the live business impact?" — not "in general, how bad could this be?".
|
|
67
|
+
|
|
68
|
+
## Steps
|
|
69
|
+
|
|
70
|
+
### 1. Parse the request
|
|
71
|
+
|
|
72
|
+
Determine the operation from `$ARGUMENTS`:
|
|
73
|
+
|
|
74
|
+
- If arguments start with "list" → show active incidents summary
|
|
75
|
+
- If arguments start with `I<NNN>` or a bare number → this is an update, mitigate, restore, close, or link
|
|
76
|
+
- Otherwise → declare a new incident
|
|
77
|
+
|
|
78
|
+
### 2. For new incidents: Check for duplicates FIRST
|
|
79
|
+
|
|
80
|
+
Before creating, search `docs/incidents/` for active (non-closed) incidents with overlapping symptoms or scope. The user may already have an incident open for this outage.
|
|
81
|
+
|
|
82
|
+
1. Extract keywords from the description (e.g., "500 errors", "checkout", "login")
|
|
83
|
+
2. `grep -l` the keywords across `docs/incidents/*.{investigating,mitigating,restored}.md`
|
|
84
|
+
3. If matches are found, present them via `AskUserQuestion`:
|
|
85
|
+
- "I found active incidents that may be related: I003 (checkout 500s, mitigating), I007 (login slowness, investigating). Would you like to (a) update an existing incident, (b) declare a new incident anyway, or (c) cancel?"
|
|
86
|
+
4. If the user chooses to update, switch to the update flow for that incident ID
|
|
87
|
+
5. If no matches, proceed to create
|
|
88
|
+
|
|
89
|
+
### 3. For new incidents: Assign the next ID
|
|
90
|
+
|
|
91
|
+
Create `docs/incidents/` if it does not exist. Then scan for the highest existing `I<NNN>` and increment:
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
mkdir -p docs/incidents
|
|
95
|
+
last=$(ls docs/incidents/I*.md 2>/dev/null | sed 's/.*\///' | grep -oE '^I[0-9]+' | sed 's/^I//' | sort -n | tail -1)
|
|
96
|
+
next=$(printf 'I%03d' $((10#${last:-0} + 1)))
|
|
97
|
+
echo "$next"
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
### 4. For new incidents: Gather information
|
|
101
|
+
|
|
102
|
+
Use `AskUserQuestion` for anything not in `$ARGUMENTS`:
|
|
103
|
+
|
|
104
|
+
- **Title**: short kebab-case-friendly description
|
|
105
|
+
- **Symptoms**: what is observable (errors, latency, missing data)?
|
|
106
|
+
- **Scope**: who/what is affected (users, endpoints, regions)?
|
|
107
|
+
- **Start time**: when did symptoms begin? (UTC, as precise as known)
|
|
108
|
+
- **Severity**: Impact (1-5) × Likelihood (1-5) per `RISK-POLICY.md`, interpreted as live impact
|
|
109
|
+
|
|
110
|
+
Do not ask for fields that can be inferred:
|
|
111
|
+
|
|
112
|
+
- **Reported**: today's date (UTC)
|
|
113
|
+
- **Status**: always "Investigating" for new incidents
|
|
114
|
+
|
|
115
|
+
### 5. For new incidents: Write the incident file
|
|
116
|
+
|
|
117
|
+
**File path**: `docs/incidents/<I###>-<kebab-case-title>.investigating.md`
|
|
118
|
+
|
|
119
|
+
**Template**:
|
|
120
|
+
|
|
121
|
+
```markdown
|
|
122
|
+
# Incident <I###>: <Title>
|
|
123
|
+
|
|
124
|
+
**Status**: Investigating
|
|
125
|
+
**Reported**: <YYYY-MM-DD HH:MM UTC>
|
|
126
|
+
**Severity**: <score> (<label>) — Impact: <label> (<n>) x Likelihood: <label> (<n>)
|
|
127
|
+
**Scope**: <who/what is affected>
|
|
128
|
+
|
|
129
|
+
## Timeline
|
|
130
|
+
|
|
131
|
+
- [<start-time> UTC] Symptoms began
|
|
132
|
+
- [<reported-time> UTC] Incident declared
|
|
133
|
+
|
|
134
|
+
## Observations
|
|
135
|
+
|
|
136
|
+
- [<timestamp> UTC] <what was seen, from where>
|
|
137
|
+
|
|
138
|
+
## Hypotheses
|
|
139
|
+
|
|
140
|
+
- [ranked] <hypothesis> — Evidence: <log/repro/diff/metric reference>. Confidence: <low|med|high>.
|
|
141
|
+
|
|
142
|
+
## Mitigation attempts
|
|
143
|
+
|
|
144
|
+
*(none yet)*
|
|
145
|
+
|
|
146
|
+
## Linked Problem
|
|
147
|
+
|
|
148
|
+
*(none yet — added on restore transition)*
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
### 6. For updates: Edit the existing file
|
|
152
|
+
|
|
153
|
+
Find the file by ID:
|
|
154
|
+
|
|
155
|
+
```bash
|
|
156
|
+
ls docs/incidents/<I###>-*.md 2>/dev/null
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
Append new observations, hypotheses, or timeline entries. **Every hypothesis must cite evidence.** If the user proposes a hypothesis without evidence, ask via `AskUserQuestion` what evidence supports it before writing.
|
|
160
|
+
|
|
161
|
+
### 7. For mitigate: Record and transition to mitigating
|
|
162
|
+
|
|
163
|
+
When the first mitigation attempt is made:
|
|
164
|
+
|
|
165
|
+
1. `git mv docs/incidents/<I###>-<title>.investigating.md docs/incidents/<I###>-<title>.mitigating.md`
|
|
166
|
+
2. Update the **Status** field to "Mitigating"
|
|
167
|
+
3. Append to **Mitigation attempts**: `[<timestamp> UTC] <action> → <outcome>` (outcome may be "pending verification" initially; update once the verification signal is known)
|
|
168
|
+
|
|
169
|
+
Pre-flight check before first mitigation: the file must contain at least one hypothesis with cited evidence. If not, block the transition and ask the user what evidence supports the chosen action.
|
|
170
|
+
|
|
171
|
+
### 8. For restore: Transition and hand off to manage-problem
|
|
172
|
+
|
|
173
|
+
Pre-flight checks before restore:
|
|
174
|
+
|
|
175
|
+
- [ ] At least one mitigation attempt is recorded with outcome
|
|
176
|
+
- [ ] A verification signal is captured (e.g., "error rate back to baseline per Datadog", "user reports normal", "synthetic probe passing")
|
|
177
|
+
|
|
178
|
+
If checks pass:
|
|
179
|
+
|
|
180
|
+
1. `git mv docs/incidents/<I###>-<title>.mitigating.md docs/incidents/<I###>-<title>.restored.md`
|
|
181
|
+
2. Update the **Status** field to "Restored"
|
|
182
|
+
3. Append to **Timeline**: `[<timestamp> UTC] Service restored — <verification signal>`
|
|
183
|
+
|
|
184
|
+
Then perform the **handoff to problem management**:
|
|
185
|
+
|
|
186
|
+
1. Ask via `AskUserQuestion`: "Service restored. Should I create or update a problem record for the root cause? (a) yes — recommended, (b) no — document why (trivial/one-off)"
|
|
187
|
+
2. If yes, construct a handoff payload:
|
|
188
|
+
- Incident ID and title
|
|
189
|
+
- Timeline summary
|
|
190
|
+
- Top-ranked hypothesis + cited evidence
|
|
191
|
+
- Mitigation applied + verification signal
|
|
192
|
+
3. Invoke `wr-itil:manage-problem` via the `Skill` tool with the payload as arguments. The problem skill's existing dedupe flow handles new-vs-update.
|
|
193
|
+
4. Capture the returned `P<NNN>` and write a **Linked Problem** section into the incident file:
|
|
194
|
+
```markdown
|
|
195
|
+
## Linked Problem
|
|
196
|
+
P<NNN> (<title>) — <status>
|
|
197
|
+
```
|
|
198
|
+
5. If the user chose "no", write a **No Problem** section with the justification and skip the handoff:
|
|
199
|
+
```markdown
|
|
200
|
+
## No Problem
|
|
201
|
+
<reason — e.g. "one-off cosmic-bit-flip; not reproducible">
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
### 9. For close: Gate on linked problem status
|
|
205
|
+
|
|
206
|
+
The close operation checks the linked problem's file suffix:
|
|
207
|
+
|
|
208
|
+
```bash
|
|
209
|
+
linked_id=<extracted from Linked Problem section>
|
|
210
|
+
linked_file=$(ls docs/problems/${linked_id}-*.md 2>/dev/null | head -1)
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
- If `linked_file` ends with `.known-error.md` or `.closed.md` → close is allowed
|
|
214
|
+
- If `linked_file` ends with `.open.md` → close is blocked; report "Linked problem ${linked_id} is still Open. Transition it to Known Error first, or update the Linked Problem reference."
|
|
215
|
+
- If no linked problem and the file has a **No Problem** section → close is allowed
|
|
216
|
+
|
|
217
|
+
On close:
|
|
218
|
+
|
|
219
|
+
1. `git mv docs/incidents/<I###>-<title>.restored.md docs/incidents/<I###>-<title>.closed.md`
|
|
220
|
+
2. Update the **Status** field to "Closed"
|
|
221
|
+
3. Append to **Timeline**: `[<timestamp> UTC] Incident closed`
|
|
222
|
+
|
|
223
|
+
### 10. For list: Show active incidents
|
|
224
|
+
|
|
225
|
+
Read all `.investigating.md`, `.mitigating.md`, and `.restored.md` files in `docs/incidents/`. Extract ID, title, severity, and status. Sort by severity (highest first). Display as a markdown table.
|
|
226
|
+
|
|
227
|
+
### 11. For link: Attach a problem
|
|
228
|
+
|
|
229
|
+
When the user runs `incident <I###> link P<MMM>`:
|
|
230
|
+
|
|
231
|
+
1. Verify `docs/problems/P<MMM>-*.md` exists
|
|
232
|
+
2. Read or add the **Linked Problem** section with `P<MMM> (<title>) — <status>`
|
|
233
|
+
3. Report the link
|
|
234
|
+
|
|
235
|
+
### 12. Edge cases
|
|
236
|
+
|
|
237
|
+
- **No problem required** — record a **No Problem** section with justification; close immediately.
|
|
238
|
+
- **Multiple incidents → one problem** — each incident links to the same `P<NNN>`; the problem file accumulates "Reported by incident" entries via `manage-problem`'s update flow.
|
|
239
|
+
- **Problem re-opens after the incident closed** — the closed incident stays closed; a new incident is declared for the new occurrence, linked to the re-opened problem.
|
|
240
|
+
- **Low-severity / solo-developer lightweight path** — for Sev 4-5 incidents, the skill may skip the Hypotheses section if the user confirms no investigation is needed. Timeline, Observations, and at least one mitigation attempt remain mandatory.
|
|
241
|
+
|
|
242
|
+
### 13. Quality checks
|
|
243
|
+
|
|
244
|
+
After any operation, verify:
|
|
245
|
+
|
|
246
|
+
- **ID uniqueness**: no duplicate `I<NNN>` in `docs/incidents/`
|
|
247
|
+
- **Naming convention**: `I<NNN>-<kebab-case-title>.<status>.md`
|
|
248
|
+
- **Status consistency**: Status field matches filename suffix
|
|
249
|
+
- **Required sections**: Timeline, Observations, Hypotheses (or documented skip), Mitigation attempts
|
|
250
|
+
- **Evidence discipline**: every Hypothesis has a cited evidence reference
|
|
251
|
+
- **Linked Problem** section present and consistent (or **No Problem** with justification) once the incident reaches Restored
|
|
252
|
+
|
|
253
|
+
### 14. Report
|
|
254
|
+
|
|
255
|
+
After any operation, report:
|
|
256
|
+
|
|
257
|
+
- The file path created/modified
|
|
258
|
+
- The incident ID and title
|
|
259
|
+
- The current status
|
|
260
|
+
- For restore: the linked problem ID (or "No Problem" note)
|
|
261
|
+
- Any quality-check warnings
|
|
262
|
+
|
|
263
|
+
Do not commit. The user will commit when ready.
|
|
264
|
+
|
|
265
|
+
$ARGUMENTS
|
|
@@ -0,0 +1,171 @@
|
|
|
1
|
+
#!/usr/bin/env bats
|
|
2
|
+
# Functional tests for the manage-incident skill (Option A-lite per ADR-011).
|
|
3
|
+
#
|
|
4
|
+
# Scope: execute the bash fragments the SKILL.md instructs Claude to run
|
|
5
|
+
# (ID assignment, file-path construction, directory creation) and assert on
|
|
6
|
+
# the mocked Skill-tool handoff contract between manage-incident and
|
|
7
|
+
# manage-problem. Source-grep assertions on SKILL.md prose are NOT used
|
|
8
|
+
# (P011 ban). A single structural check asserts SKILL.md exists and has
|
|
9
|
+
# frontmatter — file-existence checks are a Permitted Exception per ADR-005.
|
|
10
|
+
|
|
11
|
+
setup() {
|
|
12
|
+
SKILL_DIR="$(cd "$(dirname "$BATS_TEST_FILENAME")/.." && pwd)"
|
|
13
|
+
SKILL_FILE="${SKILL_DIR}/SKILL.md"
|
|
14
|
+
|
|
15
|
+
TEST_ROOT="$(mktemp -d "${TMPDIR:-/tmp}/manage-incident-bats-XXXXXX")"
|
|
16
|
+
INCIDENTS_DIR="${TEST_ROOT}/docs/incidents"
|
|
17
|
+
PROBLEMS_DIR="${TEST_ROOT}/docs/problems"
|
|
18
|
+
}
|
|
19
|
+
|
|
20
|
+
teardown() {
|
|
21
|
+
rm -rf "$TEST_ROOT"
|
|
22
|
+
}
|
|
23
|
+
|
|
24
|
+
# --- Fragment: next-ID computation (I###) ---
|
|
25
|
+
# SKILL.md instructs: scan docs/incidents/ for existing I<NNN> files, take the
|
|
26
|
+
# highest numeric ID, increment by 1, zero-pad to 3 digits.
|
|
27
|
+
next_incident_id() {
|
|
28
|
+
local dir="$1"
|
|
29
|
+
local last
|
|
30
|
+
last=$(ls "$dir"/I*.md 2>/dev/null | sed 's/.*\///' | grep -oE '^I[0-9]+' | sed 's/^I//' | sort -n | tail -1)
|
|
31
|
+
if [[ -z "$last" ]]; then
|
|
32
|
+
printf 'I001'
|
|
33
|
+
else
|
|
34
|
+
printf 'I%03d' $((10#$last + 1))
|
|
35
|
+
fi
|
|
36
|
+
}
|
|
37
|
+
|
|
38
|
+
# --- Fragment: file path construction ---
|
|
39
|
+
incident_path() {
|
|
40
|
+
local dir="$1" id="$2" slug="$3" status="$4"
|
|
41
|
+
printf '%s/%s-%s.%s.md' "$dir" "$id" "$slug" "$status"
|
|
42
|
+
}
|
|
43
|
+
|
|
44
|
+
# --- Mock: Skill-tool invocation contract ---
|
|
45
|
+
# The SKILL.md instructs Claude to invoke wr-itil:manage-problem via the
|
|
46
|
+
# Skill tool on restoration. The contract (skill name + argument shape) is
|
|
47
|
+
# asserted by a mock that writes the payload to a file the test reads back.
|
|
48
|
+
invoke_skill_mock() {
|
|
49
|
+
local tool="$1" skill="$2" args="$3"
|
|
50
|
+
printf '%s\n%s\n%s\n' "$tool" "$skill" "$args" > "${TEST_ROOT}/skill-invocation.log"
|
|
51
|
+
}
|
|
52
|
+
|
|
53
|
+
# ---- Tests ----
|
|
54
|
+
|
|
55
|
+
@test "SKILL.md exists and has frontmatter" {
|
|
56
|
+
[ -f "$SKILL_FILE" ]
|
|
57
|
+
run head -1 "$SKILL_FILE"
|
|
58
|
+
[ "$status" -eq 0 ]
|
|
59
|
+
[ "$output" = "---" ]
|
|
60
|
+
}
|
|
61
|
+
|
|
62
|
+
@test "next_incident_id returns I001 when docs/incidents is empty" {
|
|
63
|
+
mkdir -p "$INCIDENTS_DIR"
|
|
64
|
+
run next_incident_id "$INCIDENTS_DIR"
|
|
65
|
+
[ "$status" -eq 0 ]
|
|
66
|
+
[ "$output" = "I001" ]
|
|
67
|
+
}
|
|
68
|
+
|
|
69
|
+
@test "next_incident_id returns I001 when docs/incidents does not exist" {
|
|
70
|
+
run next_incident_id "$INCIDENTS_DIR"
|
|
71
|
+
[ "$status" -eq 0 ]
|
|
72
|
+
[ "$output" = "I001" ]
|
|
73
|
+
}
|
|
74
|
+
|
|
75
|
+
@test "next_incident_id increments past the highest existing ID" {
|
|
76
|
+
mkdir -p "$INCIDENTS_DIR"
|
|
77
|
+
: > "$INCIDENTS_DIR/I001-foo.closed.md"
|
|
78
|
+
: > "$INCIDENTS_DIR/I002-bar.restored.md"
|
|
79
|
+
: > "$INCIDENTS_DIR/I005-baz.investigating.md"
|
|
80
|
+
run next_incident_id "$INCIDENTS_DIR"
|
|
81
|
+
[ "$status" -eq 0 ]
|
|
82
|
+
[ "$output" = "I006" ]
|
|
83
|
+
}
|
|
84
|
+
|
|
85
|
+
@test "next_incident_id zero-pads three digits" {
|
|
86
|
+
mkdir -p "$INCIDENTS_DIR"
|
|
87
|
+
: > "$INCIDENTS_DIR/I098-foo.closed.md"
|
|
88
|
+
run next_incident_id "$INCIDENTS_DIR"
|
|
89
|
+
[ "$status" -eq 0 ]
|
|
90
|
+
[ "$output" = "I099" ]
|
|
91
|
+
}
|
|
92
|
+
|
|
93
|
+
@test "next_incident_id ignores non-incident files" {
|
|
94
|
+
mkdir -p "$INCIDENTS_DIR"
|
|
95
|
+
: > "$INCIDENTS_DIR/README.md"
|
|
96
|
+
: > "$INCIDENTS_DIR/notes.md"
|
|
97
|
+
run next_incident_id "$INCIDENTS_DIR"
|
|
98
|
+
[ "$status" -eq 0 ]
|
|
99
|
+
[ "$output" = "I001" ]
|
|
100
|
+
}
|
|
101
|
+
|
|
102
|
+
@test "incident_path builds investigating file path" {
|
|
103
|
+
run incident_path "$INCIDENTS_DIR" "I001" "login-500s" "investigating"
|
|
104
|
+
[ "$status" -eq 0 ]
|
|
105
|
+
[ "$output" = "${INCIDENTS_DIR}/I001-login-500s.investigating.md" ]
|
|
106
|
+
}
|
|
107
|
+
|
|
108
|
+
@test "incident_path supports all lifecycle suffixes" {
|
|
109
|
+
for suffix in investigating mitigating restored closed; do
|
|
110
|
+
run incident_path "$INCIDENTS_DIR" "I042" "x" "$suffix"
|
|
111
|
+
[ "$status" -eq 0 ]
|
|
112
|
+
[ "$output" = "${INCIDENTS_DIR}/I042-x.${suffix}.md" ]
|
|
113
|
+
done
|
|
114
|
+
}
|
|
115
|
+
|
|
116
|
+
@test "docs/incidents is auto-created on first declaration" {
|
|
117
|
+
[ ! -d "$INCIDENTS_DIR" ]
|
|
118
|
+
mkdir -p "$INCIDENTS_DIR"
|
|
119
|
+
[ -d "$INCIDENTS_DIR" ]
|
|
120
|
+
id=$(next_incident_id "$INCIDENTS_DIR")
|
|
121
|
+
path=$(incident_path "$INCIDENTS_DIR" "$id" "test" "investigating")
|
|
122
|
+
: > "$path"
|
|
123
|
+
[ -f "$path" ]
|
|
124
|
+
}
|
|
125
|
+
|
|
126
|
+
@test "restore handoff invokes Skill tool with wr-itil:manage-problem and payload" {
|
|
127
|
+
invoke_skill_mock "Skill" "wr-itil:manage-problem" "incident I001 login-500s — rollback v1.4.3 restored service at 14:30 UTC"
|
|
128
|
+
[ -f "${TEST_ROOT}/skill-invocation.log" ]
|
|
129
|
+
run cat "${TEST_ROOT}/skill-invocation.log"
|
|
130
|
+
[ "$status" -eq 0 ]
|
|
131
|
+
[ "${lines[0]}" = "Skill" ]
|
|
132
|
+
[ "${lines[1]}" = "wr-itil:manage-problem" ]
|
|
133
|
+
[[ "${lines[2]}" == *"I001"* ]]
|
|
134
|
+
[[ "${lines[2]}" == *"rollback"* ]]
|
|
135
|
+
}
|
|
136
|
+
|
|
137
|
+
@test "restore handoff payload carries incident ID and mitigation" {
|
|
138
|
+
invoke_skill_mock "Skill" "wr-itil:manage-problem" "incident I042 — mitigation: feature flag off — verified via Datadog"
|
|
139
|
+
run cat "${TEST_ROOT}/skill-invocation.log"
|
|
140
|
+
[[ "${lines[2]}" == *"I042"* ]]
|
|
141
|
+
[[ "${lines[2]}" == *"feature flag off"* ]]
|
|
142
|
+
[[ "${lines[2]}" == *"Datadog"* ]]
|
|
143
|
+
}
|
|
144
|
+
|
|
145
|
+
@test "handoff contract rejects invocation with wrong skill name" {
|
|
146
|
+
invoke_skill_mock "Skill" "wr-itil:manage-change" "incident I001"
|
|
147
|
+
run grep -c '^wr-itil:manage-problem$' "${TEST_ROOT}/skill-invocation.log"
|
|
148
|
+
[ "$output" = "0" ]
|
|
149
|
+
}
|
|
150
|
+
|
|
151
|
+
@test "close is blocked when linked problem file is .open.md" {
|
|
152
|
+
mkdir -p "$PROBLEMS_DIR"
|
|
153
|
+
: > "$PROBLEMS_DIR/P050-root.open.md"
|
|
154
|
+
# close-gate: close only if linked P### is known-error or closed
|
|
155
|
+
linked=$(ls "$PROBLEMS_DIR"/P050-*.md 2>/dev/null | head -1)
|
|
156
|
+
[[ "$linked" != *".known-error.md" && "$linked" != *".closed.md" ]]
|
|
157
|
+
}
|
|
158
|
+
|
|
159
|
+
@test "close is allowed when linked problem file is .known-error.md" {
|
|
160
|
+
mkdir -p "$PROBLEMS_DIR"
|
|
161
|
+
: > "$PROBLEMS_DIR/P050-root.known-error.md"
|
|
162
|
+
linked=$(ls "$PROBLEMS_DIR"/P050-*.md 2>/dev/null | head -1)
|
|
163
|
+
[[ "$linked" == *".known-error.md" || "$linked" == *".closed.md" ]]
|
|
164
|
+
}
|
|
165
|
+
|
|
166
|
+
@test "close is allowed when linked problem file is .closed.md" {
|
|
167
|
+
mkdir -p "$PROBLEMS_DIR"
|
|
168
|
+
: > "$PROBLEMS_DIR/P050-root.closed.md"
|
|
169
|
+
linked=$(ls "$PROBLEMS_DIR"/P050-*.md 2>/dev/null | head -1)
|
|
170
|
+
[[ "$linked" == *".known-error.md" || "$linked" == *".closed.md" ]]
|
|
171
|
+
}
|
|
@@ -83,6 +83,12 @@ What "work" means depends on the problem's status:
|
|
|
83
83
|
3. Include the problem doc closure in the fix commit (`git mv` to `.closed.md`, update Status)
|
|
84
84
|
4. Push, create changeset, release per the lean release principle
|
|
85
85
|
|
|
86
|
+
**Scope expansion during work:** If investigation or architect review reveals that the problem's scope has grown significantly (e.g., effort re-sized from S to L, additional files discovered), use `AskUserQuestion` before continuing:
|
|
87
|
+
- Option 1: `Continue with expanded scope` — keep working this problem at its new size
|
|
88
|
+
- Option 2: `Update problem and re-rank` — save findings to the problem file, re-score WSJF, and re-run the work selection to let the user pick from the updated queue
|
|
89
|
+
- Option 3: `Pick a different problem` — park this one and work something else
|
|
90
|
+
- Use `header: "Scope change"` and `multiSelect: false`
|
|
91
|
+
|
|
86
92
|
**In both cases:** After completing work on one problem, run `problem work` again to pick up the next highest-WSJF problem. Keep going until the user says stop or no more problems are actionable.
|
|
87
93
|
|
|
88
94
|
## Steps
|
|
@@ -239,7 +245,7 @@ Read `RISK-POLICY.md` to get the current impact levels (1-5), likelihood levels
|
|
|
239
245
|
- Update the Status field to "Known Error"
|
|
240
246
|
- This happens automatically — do not ask the user
|
|
241
247
|
|
|
242
|
-
**Step 9c: Present summary**
|
|
248
|
+
**Step 9c: Present summary and select problem to work**
|
|
243
249
|
|
|
244
250
|
After reviewing all problems, present a WSJF-ranked table:
|
|
245
251
|
|
|
@@ -253,6 +259,18 @@ Highlight:
|
|
|
253
259
|
- Problems that have been fixed but not closed (check git history for fix commits)
|
|
254
260
|
- Known errors with a `## Fix Released` section (pending user verification)
|
|
255
261
|
|
|
262
|
+
**When the operation is `work` (not just `review`), select the problem to work using `AskUserQuestion`:**
|
|
263
|
+
|
|
264
|
+
- If one problem has a strictly higher WSJF than all others, present it as the recommended option:
|
|
265
|
+
- Option 1: `Work P<NNN>: <title> (Recommended)` — with description showing WSJF score and status
|
|
266
|
+
- Option 2: `Pick a different problem` — let the user name a specific ID
|
|
267
|
+
- If two or more problems tie for the highest WSJF, present the tied problems as options:
|
|
268
|
+
- One option per tied problem: `Work P<NNN>: <title>` — with description showing WSJF and a one-line rationale for why this one
|
|
269
|
+
- Final option: `Pick a different problem`
|
|
270
|
+
- Use `header: "Next problem"` and `multiSelect: false`
|
|
271
|
+
|
|
272
|
+
**Never present the selection as prose "(a)/(b)/(c)" or "which would you like?"** — always use `AskUserQuestion` so the decision is structured and auditable.
|
|
273
|
+
|
|
256
274
|
**Step 9d: Check for pending verifications**
|
|
257
275
|
|
|
258
276
|
For each known-error that has a `## Fix Released` section, use `AskUserQuestion` to ask the user if the fix has been verified in production. If the user confirms, close the problem (`git mv` to `.closed.md`, update Status). If the user says no or is unsure, leave it as known-error.
|