@xcraftmind/mastermind 0.28.0 → 0.28.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +4 -4
- package/package.json +9 -9
- package/share/agents/mastermind-auditor.md +76 -2
- package/share/agents/mastermind-critic.md +1 -0
- package/share/agents/mastermind-investigator.md +168 -0
- package/share/agents/mastermind-prompt-refiner.md +29 -10
- package/share/agents/mastermind-researcher.md +23 -4
- package/share/agents/mastermind-task-executor.md +29 -0
- package/share/skills/mastermind-prompt-refiner/SKILL.md +61 -8
- package/share/skills/mastermind-task-planning/SKILL.md +105 -3
- package/share/skills/mastermind-task-planning/references/design-review-packet.md +120 -0
- package/share/skills/mastermind-task-planning/references/spec-template.md +84 -4
- package/share/agents/mastermind-release.md +0 -442
- package/share/commands/api-shape-explorer.md +0 -107
- package/share/skills/doc-stub-sync/SKILL.md +0 -187
- package/share/skills/doc-stub-sync/references/error-handling.md +0 -79
- package/share/skills/doc-stub-sync/references/url-patterns.md +0 -83
- package/share/skills/doc-stub-sync/scripts/doc_update.py +0 -285
- package/share/skills/doc-stub-sync/scripts/requirements.txt +0 -2
- package/share/skills/flaky-finder/SKILL.md +0 -75
- package/share/skills/mastermind-incident-response/SKILL.md +0 -157
- package/share/skills/mastermind-incident-response/references/investigation-playbook.md +0 -174
- package/share/skills/mastermind-incident-response/references/postmortem-template.md +0 -184
- package/share/skills/mastermind-incident-response/references/triage-checklist.md +0 -118
- package/share/skills/pr-review/SKILL.md +0 -89
|
@@ -1,75 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: flaky-finder
|
|
3
|
-
description: Identify flaky tests by running the suite repeatedly and bisecting failures across runs. Use when the user says "find flaky tests", "this test is flaky", "tests pass locally but fail in CI", or sees intermittent test failures.
|
|
4
|
-
metadata:
|
|
5
|
-
version: 0.1.0
|
|
6
|
-
authors:
|
|
7
|
-
- mastermind
|
|
8
|
-
tags:
|
|
9
|
-
- testing
|
|
10
|
-
model: sonnet
|
|
11
|
-
---
|
|
12
|
-
|
|
13
|
-
# Flaky Test Finder
|
|
14
|
-
|
|
15
|
-
Finds tests that pass and fail non-deterministically. Runs the suite N times, records which tests changed outcome between runs, and ranks them by flake rate.
|
|
16
|
-
|
|
17
|
-
## When to use
|
|
18
|
-
|
|
19
|
-
- User reports "tests pass locally but fail in CI"
|
|
20
|
-
- A test failed once and the user wants to confirm if it's flaky before retrying
|
|
21
|
-
- User explicitly asks for a flake audit before a release
|
|
22
|
-
- Do NOT use for finding *broken* tests — those fail consistently. Use a regular test run for that.
|
|
23
|
-
|
|
24
|
-
## Prerequisites
|
|
25
|
-
|
|
26
|
-
- A working `<test-command>` for the project (`pytest`, `go test ./...`, `npm test`, etc.)
|
|
27
|
-
- Time — flake hunting is inherently slow (10-50 runs of the full suite)
|
|
28
|
-
|
|
29
|
-
## Steps
|
|
30
|
-
|
|
31
|
-
1. **Confirm the test command.** Read the project's CI config or `package.json` / `Makefile`. If unclear, ask.
|
|
32
|
-
2. **Establish a baseline.** Run the suite once. If it fails, the failures aren't flakes — they're broken. Stop and report.
|
|
33
|
-
3. **Decide N.** Default to 20 runs. For long suites (>5min), drop to 10. For fast suites (<30s), go to 50.
|
|
34
|
-
4. **Run N times, recording each test's pass/fail per run.** Use the project's machine-readable output if available (`pytest --junitxml`, `go test -json`, `jest --json`).
|
|
35
|
-
5. **Compute flake rate per test.** A test that passed 18/20 times and failed 2/20 has a flake rate of 10%.
|
|
36
|
-
6. **Rank by flake rate descending.** Anything between 1% and 99% is suspicious; 0% and 100% are deterministic.
|
|
37
|
-
7. **For the top 3 flakiest, read the test code.** Look for: shared state, time-based assertions, network calls, ordering assumptions, race conditions.
|
|
38
|
-
8. **Report findings.**
|
|
39
|
-
|
|
40
|
-
## Outputs
|
|
41
|
-
|
|
42
|
-
```markdown
|
|
43
|
-
## Flake report — N=<N> runs of <test-command>
|
|
44
|
-
|
|
45
|
-
### Flaky tests (sorted by flake rate)
|
|
46
|
-
| Test | Flake rate | Likely cause |
|
|
47
|
-
|---|---|---|
|
|
48
|
-
| `tests/limiter_test.go::TestConcurrentBucket` | 35% (7/20 failed) | Race on shared counter, no `t.Parallel()` synchronization |
|
|
49
|
-
| `tests/api_test.py::test_response_time` | 15% (3/20 failed) | Time-based assertion `< 100ms` — fails under load |
|
|
50
|
-
|
|
51
|
-
### Deterministic failures
|
|
52
|
-
- `tests/foo_test.py::test_bar` — failed all 20 runs. Not a flake; this test is broken.
|
|
53
|
-
|
|
54
|
-
### Deterministic passes
|
|
55
|
-
- <count> tests passed all <N> runs.
|
|
56
|
-
```
|
|
57
|
-
|
|
58
|
-
## Examples
|
|
59
|
-
|
|
60
|
-
**Input:** "Our CI is flaky, can you find the culprit?"
|
|
61
|
-
|
|
62
|
-
**Output (abbreviated):**
|
|
63
|
-
```markdown
|
|
64
|
-
## Flake report — N=20 runs of `pytest tests/`
|
|
65
|
-
|
|
66
|
-
### Flaky tests
|
|
67
|
-
| Test | Flake rate | Likely cause |
|
|
68
|
-
|---|---|---|
|
|
69
|
-
| `test_websocket_reconnect` | 25% | Race between `await ws.connect()` and the heartbeat loop |
|
|
70
|
-
| `test_cache_eviction` | 5% | Wall-clock assertion `time.time() - start < 1.0` |
|
|
71
|
-
|
|
72
|
-
### Deterministic
|
|
73
|
-
- 312 passed all 20 runs
|
|
74
|
-
- 0 failed all 20 runs
|
|
75
|
-
```
|
|
@@ -1,157 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: mastermind-incident-response
|
|
3
|
-
description: Parallel workflow for production incidents — triage, stop the bleeding, investigate root cause via mmcg + git + .mastermind/tasks/ history, write a blameless postmortem, feed lessons back into CONTEXT.md and (if applicable) into the main workflow's spec template or critic dimensions. Use when the user says "incident", "outage", "rollback", "что-то сломалось в проде", or pastes paging alerts / error logs.
|
|
4
|
-
metadata:
|
|
5
|
-
version: 0.1.0
|
|
6
|
-
authors:
|
|
7
|
-
- mastermind
|
|
8
|
-
tags:
|
|
9
|
-
- workflow
|
|
10
|
-
- incident-response
|
|
11
|
-
- postmortem
|
|
12
|
-
- operations
|
|
13
|
-
model: opus
|
|
14
|
-
---
|
|
15
|
-
|
|
16
|
-
# Mastermind — Incident Response
|
|
17
|
-
|
|
18
|
-
A **parallel workflow** for handling production breakage. Different from the main 13-step planning workflow ([`mastermind-task-planning`](../mastermind-task-planning/SKILL.md)) which builds new things — this one **stops bleeding**, finds root cause, and turns lessons into systemic improvements.
|
|
19
|
-
|
|
20
|
-
## When to Activate
|
|
21
|
-
|
|
22
|
-
User says or pastes:
|
|
23
|
-
- "incident", "outage", "production is down", "что-то сломалось в проде", "rollback"
|
|
24
|
-
- Paging alerts (Datadog, PagerDuty, Sentry)
|
|
25
|
-
- Error logs with stack traces
|
|
26
|
-
- "users are reporting…"
|
|
27
|
-
- "deploy broke something"
|
|
28
|
-
|
|
29
|
-
## What this is NOT
|
|
30
|
-
|
|
31
|
-
- **Not** the bug-triage flow for development-time bugs — those go through the regular planning workflow
|
|
32
|
-
- **Not** a feature-request channel
|
|
33
|
-
- **Not** a debugging session for the user's local environment
|
|
34
|
-
- **Not** a substitute for paging an actual on-call engineer for sev0/sev1 incidents — the workflow assists, doesn't replace the human
|
|
35
|
-
|
|
36
|
-
## Different Priorities Than Planning
|
|
37
|
-
|
|
38
|
-
| Planning workflow | Incident response |
|
|
39
|
-
|---|---|
|
|
40
|
-
| Optimize for quality | Optimize for **time** |
|
|
41
|
-
| 7-dim critic before doing anything | Bias toward **rollback first**, understand later |
|
|
42
|
-
| Mandatory specs, alternatives, tests | Hot-fix is OK if rollback impossible |
|
|
43
|
-
| "Did we design this right?" | "What's the fastest way to stop the bleeding?" |
|
|
44
|
-
| Blameless review post-fact | Blameless reasoning **during** |
|
|
45
|
-
|
|
46
|
-
You are in **operations mode**. Speed of bleed-stop > completeness of fix > root-cause depth > paperwork. Reverse that order during postmortem.
|
|
47
|
-
|
|
48
|
-
## Phases
|
|
49
|
-
|
|
50
|
-
### Phase 1 — Triage (target: first 5 minutes)
|
|
51
|
-
|
|
52
|
-
Ask the user (or extract from pasted alert):
|
|
53
|
-
|
|
54
|
-
1. **Symptom** — what users / monitoring see (one sentence, observable)
|
|
55
|
-
2. **Scope** — how many users / how much traffic / which surfaces
|
|
56
|
-
3. **Severity** — pick a number:
|
|
57
|
-
- **sev0** — total outage, paging fire
|
|
58
|
-
- **sev1** — major degradation, immediate action needed
|
|
59
|
-
- **sev2** — partial degradation, action within hours
|
|
60
|
-
- **sev3** — minor / cosmetic, action within days
|
|
61
|
-
4. **Timeline** — when did this start? (correlate with deploys / changes)
|
|
62
|
-
5. **What's been tried already**
|
|
63
|
-
|
|
64
|
-
While asking, parallel-research with `mastermind-researcher` subagent:
|
|
65
|
-
```
|
|
66
|
-
git log --since='2 hours ago' --oneline → what changed recently
|
|
67
|
-
git log -10 --oneline → most recent commits
|
|
68
|
-
ls -lt .mastermind/tasks/*/spec.md 2>/dev/null | head -10 → most recent specs (folder-per-task)
|
|
69
|
-
mmcg_status → index health
|
|
70
|
-
```
|
|
71
|
-
|
|
72
|
-
Use see [`references/triage-checklist.md`](references/triage-checklist.md) for the full first-response checklist.
|
|
73
|
-
|
|
74
|
-
### Phase 2 — Stop the bleeding (target: next 10 minutes after triage)
|
|
75
|
-
|
|
76
|
-
**Order of preference:**
|
|
77
|
-
1. **Rollback** to last known good — if you can identify it, do it
|
|
78
|
-
2. **Disable the feature** — if feature-flagged, flip the flag off
|
|
79
|
-
3. **Hot patch** — only if 1 and 2 not possible; this carries risk
|
|
80
|
-
4. **Escalate** — if stuck > 10 min on Phase 2, page additional help / wake on-call
|
|
81
|
-
|
|
82
|
-
For each option, write to user what you're about to propose. **Do not execute destructive ops** (`git push --force`, deploys) without explicit user confirmation per turn — they're operating the controls, you're advising.
|
|
83
|
-
|
|
84
|
-
Mitigation tactics by failure type:
|
|
85
|
-
- **Recent deploy broke things** → revert the deploy
|
|
86
|
-
- **Recent config change broke things** → revert config
|
|
87
|
-
- **Data corruption** → freeze writes, restore from backup, investigate cause separately
|
|
88
|
-
- **External dependency degraded** → enable degraded-mode fallback if present; otherwise wait + monitor
|
|
89
|
-
- **Resource exhaustion (memory, disk, connections)** → kill / restart / scale; investigate cause separately
|
|
90
|
-
|
|
91
|
-
### Phase 3 — Investigate (after symptoms stop)
|
|
92
|
-
|
|
93
|
-
With pressure off, find **root cause** — not just the symptom. Five-whys discipline.
|
|
94
|
-
|
|
95
|
-
**Investigation playbook** — see [`references/investigation-playbook.md`](references/investigation-playbook.md) for the full set of mmcg + git + log patterns. Quick summary:
|
|
96
|
-
|
|
97
|
-
- **What changed recently?** `git log --since='<time of incident start - 1h>' -- <suspected paths>`
|
|
98
|
-
- **What's the blast radius of the change?** `mmcg query impact <symbol> --depth 3`
|
|
99
|
-
- **Were the relevant specs in `.mastermind/tasks/` going to catch this?** Read their Tests Plan + Observability Plan sections
|
|
100
|
-
- **Did observability fire?** If yes, why didn't it page sooner? If no, why wasn't it instrumented?
|
|
101
|
-
- **Is this a recurrence?** Grep `CONTEXT.md` for the symptom — known gotcha?
|
|
102
|
-
|
|
103
|
-
If a fix is needed, **don't write it inline in this incident flow** — open a `.mastermind/tasks/<NNN>-<short-name>/spec.md` via the main workflow. The fix goes through the normal critic/auditor gates. Incident response identifies the need; planner designs the response.
|
|
104
|
-
|
|
105
|
-
### Phase 4 — Postmortem (within 24h of resolution)
|
|
106
|
-
|
|
107
|
-
Use [`references/postmortem-template.md`](references/postmortem-template.md). Sections:
|
|
108
|
-
|
|
109
|
-
- **Summary** (1-2 sentences — what happened, impact, resolution)
|
|
110
|
-
- **Timeline** (UTC, minute-resolution where relevant)
|
|
111
|
-
- **What went wrong** (root cause, contributing factors)
|
|
112
|
-
- **What went well** (yes, name what worked — psychological safety + reinforces good patterns)
|
|
113
|
-
- **Why detection took N minutes** (separate from why-it-happened — detection is its own failure mode)
|
|
114
|
-
- **Why mitigation took N minutes** (rollback fast? unclear who could act? missing runbook?)
|
|
115
|
-
- **Action items** (specific, owned, dated — each becomes a `.mastermind/tasks/` spec or a CONTEXT.md update)
|
|
116
|
-
|
|
117
|
-
**Blameless framing** — write about systems, not people:
|
|
118
|
-
- ❌ "Engineer X deployed without testing"
|
|
119
|
-
- ✓ "The deploy pipeline allowed merging with failing tests because the test job was marked non-blocking three weeks ago"
|
|
120
|
-
|
|
121
|
-
If a person made a judgment call that turned out wrong, frame it as: "given the information available at the time, the action was reasonable; the lesson is that information X needs to be more accessible / surfaced earlier."
|
|
122
|
-
|
|
123
|
-
### Phase 5 — Feed forward
|
|
124
|
-
|
|
125
|
-
Two destinations:
|
|
126
|
-
|
|
127
|
-
**A. Project `CONTEXT.md`** (immediate):
|
|
128
|
-
- **Known gotchas** entry for the failure pattern — concrete + scenario + reference to postmortem path
|
|
129
|
-
- **Don't-touch list** entry if a code area has subtle constraints now known
|
|
130
|
-
- **Decision log** entry if the postmortem changed an architectural decision
|
|
131
|
-
|
|
132
|
-
**B. Action items as new `.mastermind/tasks/` specs** (within days):
|
|
133
|
-
- Each action item becomes a spec
|
|
134
|
-
- Specs go through normal workflow (planner → critic → executor → auditor)
|
|
135
|
-
- Link back to postmortem in spec's Notes section
|
|
136
|
-
|
|
137
|
-
**C. Workflow improvements** (if applicable):
|
|
138
|
-
- Did the spec for the offending change include an Observability Plan? If no, that's evidence the planner skill should make it more mandatory.
|
|
139
|
-
- Did the critic's 7 dimensions miss this category of issue? Propose an 8th dimension or sharpening an existing one.
|
|
140
|
-
- Did the auditor pass when it shouldn't have? Add a new check.
|
|
141
|
-
|
|
142
|
-
Workflow improvements go into the mastermind repo itself as a meta-improvement spec. **The workflow learns from its own failures.**
|
|
143
|
-
|
|
144
|
-
## Roles & subagents
|
|
145
|
-
|
|
146
|
-
Most of incident response is run by the planner (in this mode), with these spawns:
|
|
147
|
-
|
|
148
|
-
- **`mastermind-researcher`** — for git/mmcg fact-gathering during Phase 1 and Phase 3
|
|
149
|
-
- **`mastermind-critic`** — for the postmortem's "what went wrong" section if there's a design question (e.g., "was this design fundamentally flawed?"). Optional.
|
|
150
|
-
- **`mastermind-auditor`** — NOT used in incident response (it's a post-flight checker, doesn't apply here)
|
|
151
|
-
- **`mastermind-task-planning`** (in main mode) — for any follow-up specs that come out of the postmortem
|
|
152
|
-
|
|
153
|
-
## References
|
|
154
|
-
|
|
155
|
-
- [`references/triage-checklist.md`](references/triage-checklist.md) — first 5 minutes
|
|
156
|
-
- [`references/investigation-playbook.md`](references/investigation-playbook.md) — mmcg + git + .mastermind/tasks/ patterns for finding root cause
|
|
157
|
-
- [`references/postmortem-template.md`](references/postmortem-template.md) — blameless postmortem fill-in
|
|
@@ -1,174 +0,0 @@
|
|
|
1
|
-
# Investigation playbook — find root cause via mmcg + git + .mastermind/tasks/
|
|
2
|
-
|
|
3
|
-
Reference for the [`mastermind-incident-response`](../SKILL.md) skill, Phase 3. After symptoms have stopped, find what actually broke.
|
|
4
|
-
|
|
5
|
-
The patterns below are concrete recipes. Use them when the corresponding question comes up — don't run all of them speculatively (wastes context).
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Question 1 — "What changed recently?"
|
|
10
|
-
|
|
11
|
-
Most production incidents trace to a recent change. Start here.
|
|
12
|
-
|
|
13
|
-
```bash
|
|
14
|
-
# What was committed in the window when the incident started?
|
|
15
|
-
git log --since='2 hours ago' --until='now' --oneline
|
|
16
|
-
|
|
17
|
-
# What was committed in the file/dir we suspect?
|
|
18
|
-
git log -20 --oneline -- <suspected/path/>
|
|
19
|
-
|
|
20
|
-
# What did the most recent commits actually change?
|
|
21
|
-
git log -5 -p --stat -- <suspected/path/>
|
|
22
|
-
```
|
|
23
|
-
|
|
24
|
-
Then for any candidate commit:
|
|
25
|
-
```bash
|
|
26
|
-
git show <commit-sha> --stat # what files
|
|
27
|
-
git show <commit-sha> # full diff
|
|
28
|
-
```
|
|
29
|
-
|
|
30
|
-
**Heuristic for ranking suspect commits:**
|
|
31
|
-
- Most recent first
|
|
32
|
-
- Bigger diffs first (more surface to have bugs)
|
|
33
|
-
- Commits to "interesting" paths (hot paths, recently-incident-prone dirs)
|
|
34
|
-
- Commits that touch the symptom's component (grep error message in commit diff)
|
|
35
|
-
|
|
36
|
-
---
|
|
37
|
-
|
|
38
|
-
## Question 2 — "What's the blast radius of the change?"
|
|
39
|
-
|
|
40
|
-
If you have a suspect commit, what does it touch that could explain the symptom?
|
|
41
|
-
|
|
42
|
-
```bash
|
|
43
|
-
# What symbols changed in the suspect commit?
|
|
44
|
-
git show <commit-sha> --name-only
|
|
45
|
-
|
|
46
|
-
# For each changed function/method, what calls it?
|
|
47
|
-
mmcg_callers <symbol> --language <lang>
|
|
48
|
-
|
|
49
|
-
# Transitive — what else depends on it?
|
|
50
|
-
mmcg_impact <symbol> --depth 3 --language <lang>
|
|
51
|
-
```
|
|
52
|
-
|
|
53
|
-
If the blast radius doesn't include the symptom's component → the suspect commit probably isn't the cause. Move to the next candidate.
|
|
54
|
-
|
|
55
|
-
If the blast radius DOES include the symptom's component → strong candidate. Read the code change.
|
|
56
|
-
|
|
57
|
-
---
|
|
58
|
-
|
|
59
|
-
## Question 3 — "Were the relevant specs supposed to catch this?"
|
|
60
|
-
|
|
61
|
-
Spec for any recent change should have had Tests Plan + Observability Plan + Performance Considerations sections (per spec template).
|
|
62
|
-
|
|
63
|
-
```bash
|
|
64
|
-
# Find the spec for this work (look in .mastermind/tasks/ for matching folder or timestamp)
|
|
65
|
-
ls -lt .mastermind/tasks/ # task folders, newest first
|
|
66
|
-
ls -lt .mastermind/tasks/*/spec.md 2>/dev/null # direct list of spec files by mtime
|
|
67
|
-
|
|
68
|
-
# Read its Tests Plan
|
|
69
|
-
grep -A 20 "Tests Plan" .mastermind/tasks/<NNN>-<name>/spec.md
|
|
70
|
-
|
|
71
|
-
# Read its Observability Plan
|
|
72
|
-
grep -A 10 "Observability Plan" .mastermind/tasks/<NNN>-<name>/spec.md
|
|
73
|
-
|
|
74
|
-
# Read its Performance Considerations
|
|
75
|
-
grep -A 10 "Performance Considerations" .mastermind/tasks/<NNN>-<name>/spec.md
|
|
76
|
-
```
|
|
77
|
-
|
|
78
|
-
Then ask:
|
|
79
|
-
|
|
80
|
-
- **Did the Tests Plan cover this failure mode?** If no, that's a gap in spec quality. Action item: "harden Tests Plan for similar work."
|
|
81
|
-
- **Did Observability fire for this failure?** If no, was it instrumented? If yes, did it page? Detection time is its own failure to root-cause.
|
|
82
|
-
- **Did Performance Considerations anticipate this load?** If the issue is scale-related, the spec should have predicted it.
|
|
83
|
-
|
|
84
|
-
These answers feed the postmortem's "Why detection took N minutes" and "Workflow improvements" sections.
|
|
85
|
-
|
|
86
|
-
---
|
|
87
|
-
|
|
88
|
-
## Question 4 — "Has this happened before?"
|
|
89
|
-
|
|
90
|
-
```bash
|
|
91
|
-
# Check known gotchas
|
|
92
|
-
grep -B 2 -A 4 -i "<symptom keywords>" CONTEXT.md
|
|
93
|
-
|
|
94
|
-
# Check don't-touch list
|
|
95
|
-
grep -B 2 -A 4 "<file or path>" CONTEXT.md
|
|
96
|
-
|
|
97
|
-
# Check past postmortems (if you keep them)
|
|
98
|
-
ls postmortems/ 2>/dev/null || ls docs/postmortems/ 2>/dev/null
|
|
99
|
-
grep -ri "<symptom>" postmortems/ 2>/dev/null
|
|
100
|
-
```
|
|
101
|
-
|
|
102
|
-
If yes → this is a **recurrence**. That's a MUCH bigger finding than a first-time incident:
|
|
103
|
-
- The previous fix didn't stick
|
|
104
|
-
- The prevention didn't transfer
|
|
105
|
-
- The CONTEXT.md entry was either missing or ignored
|
|
106
|
-
|
|
107
|
-
A recurrence postmortem should focus heavily on Phase 5 (feed forward) and propose a STRUCTURAL fix, not just a code fix.
|
|
108
|
-
|
|
109
|
-
---
|
|
110
|
-
|
|
111
|
-
## Question 5 — "What's the failure mode?"
|
|
112
|
-
|
|
113
|
-
Classify the failure into a category. This guides both immediate response and what kind of structural improvement to propose:
|
|
114
|
-
|
|
115
|
-
| Category | Examples | Typical fix |
|
|
116
|
-
|---|---|---|
|
|
117
|
-
| **Code bug** | null-pointer, off-by-one, wrong condition | Code change + test |
|
|
118
|
-
| **Configuration bug** | wrong env var, bad timeout, missing flag | Config change + config validation in CI |
|
|
119
|
-
| **Schema / migration** | column missing, type mismatch, FK violation | Migration + pre-flight schema check |
|
|
120
|
-
| **Capacity / scale** | OOM, connection pool exhausted, CPU pegged | Scaling + capacity model in spec template |
|
|
121
|
-
| **External dependency** | upstream API down, vendor SLA breach | Degraded-mode fallback + dep health probe |
|
|
122
|
-
| **Race / concurrency** | lost write, deadlock, double-spend | Concurrency model documented + tests |
|
|
123
|
-
| **Data quality** | bad input from upstream, encoding issue | Validation at boundary + schema for inputs |
|
|
124
|
-
| **Process** | bad deploy, wrong branch, missed code review | Pipeline / process improvement |
|
|
125
|
-
|
|
126
|
-
The category determines whether the postmortem proposes a CODE fix, a PROCESS fix, or a SYSTEM fix (architectural).
|
|
127
|
-
|
|
128
|
-
---
|
|
129
|
-
|
|
130
|
-
## Question 6 — "What's the smallest reproducer?"
|
|
131
|
-
|
|
132
|
-
Before declaring root cause found, get a reproducer:
|
|
133
|
-
|
|
134
|
-
- **Unit test** — fastest; if you can write one that fails, you understand the bug
|
|
135
|
-
- **Integration test** — if behavior depends on multiple components
|
|
136
|
-
- **Manual reproduction** — if neither possible, document the exact steps
|
|
137
|
-
|
|
138
|
-
A bug without a reproducer is a bug not understood. Don't ship the fix until you can reproduce.
|
|
139
|
-
|
|
140
|
-
The reproducer becomes Test 1 in the fix spec's Tests Plan.
|
|
141
|
-
|
|
142
|
-
---
|
|
143
|
-
|
|
144
|
-
## Five-whys discipline
|
|
145
|
-
|
|
146
|
-
For the postmortem's "What went wrong" section, apply five-whys:
|
|
147
|
-
|
|
148
|
-
1. **Why** did the symptom happen? → because <proximate cause>
|
|
149
|
-
2. **Why** did that happen? → because <one level deeper>
|
|
150
|
-
3. **Why** did THAT happen? → because <one more>
|
|
151
|
-
4. **Why**?
|
|
152
|
-
5. **Why**?
|
|
153
|
-
|
|
154
|
-
Stop when:
|
|
155
|
-
- The answer is a system property (e.g., "the deploy pipeline is asynchronous, so we ran without rollback safety")
|
|
156
|
-
- The answer would require changing fundamental architecture (escalate, don't propose in postmortem)
|
|
157
|
-
- You've gone 5 levels — at some point further whys are speculation
|
|
158
|
-
|
|
159
|
-
Document the chain in the postmortem. The deepest "why" you can credibly answer is the **systemic** root cause; that's what the action items should address.
|
|
160
|
-
|
|
161
|
-
---
|
|
162
|
-
|
|
163
|
-
## When to stop investigating
|
|
164
|
-
|
|
165
|
-
You have enough when:
|
|
166
|
-
|
|
167
|
-
1. ✓ You can name the proximate cause (the code/config/data thing that broke)
|
|
168
|
-
2. ✓ You can name a systemic cause (the why that goes deeper than "engineer X did Y")
|
|
169
|
-
3. ✓ You have a reproducer (or explicit decision that one is not feasible)
|
|
170
|
-
4. ✓ You know what would have prevented this (a test? a check? a config? a process?)
|
|
171
|
-
|
|
172
|
-
Then go to Phase 4 — write the postmortem.
|
|
173
|
-
|
|
174
|
-
If after 1-2 hours you can't answer #1-2 → escalate or accept "we couldn't determine root cause" honestly in the postmortem. **Don't fabricate a cause to look complete.** Unknown root causes are themselves a finding.
|
|
@@ -1,184 +0,0 @@
|
|
|
1
|
-
<!--
|
|
2
|
-
Mastermind blameless postmortem template.
|
|
3
|
-
|
|
4
|
-
HOW TO USE
|
|
5
|
-
- Copy this file to postmortems/<YYYY-MM-DD>-<short-name>.md (or docs/postmortems/, whichever convention your repo uses)
|
|
6
|
-
- Fill in every <placeholder> with concrete content
|
|
7
|
-
- Delete sections that genuinely don't apply (e.g., if a sev3 had no user impact)
|
|
8
|
-
- Keep the file short — a postmortem nobody reads is worse than no postmortem
|
|
9
|
-
- Action items get linked back here from the .mastermind/tasks/ specs they spawn
|
|
10
|
-
|
|
11
|
-
BLAMELESS PRINCIPLE
|
|
12
|
-
Write about systems, not people. If a person made a judgment call, frame it as:
|
|
13
|
-
"given the information available, the action was reasonable; the lesson is that
|
|
14
|
-
information X needs to be more accessible / surfaced earlier."
|
|
15
|
-
|
|
16
|
-
Anti-pattern: "Engineer X deployed without testing."
|
|
17
|
-
Better: "The deploy pipeline allowed merging with failing tests because the
|
|
18
|
-
test job was marked non-blocking three weeks ago."
|
|
19
|
-
|
|
20
|
-
TONE
|
|
21
|
-
- Past tense (this happened, this was tried)
|
|
22
|
-
- Concrete (timestamps, error messages, file paths)
|
|
23
|
-
- Honest about unknowns ("we don't yet know why X" is better than fabricating)
|
|
24
|
-
-->
|
|
25
|
-
|
|
26
|
-
# Postmortem: <short title>
|
|
27
|
-
|
|
28
|
-
## Summary
|
|
29
|
-
|
|
30
|
-
**Date:** <YYYY-MM-DD>
|
|
31
|
-
**Severity:** <sev0 | sev1 | sev2 | sev3>
|
|
32
|
-
**Duration:** <N minutes from start to mitigation, M minutes to full resolution>
|
|
33
|
-
**Impact:** <one sentence — what users / systems were affected, magnitude>
|
|
34
|
-
|
|
35
|
-
<One to three sentences: what happened, what was the user impact, what was the resolution. Anyone reading should know what this is about in 30 seconds.>
|
|
36
|
-
|
|
37
|
-
---
|
|
38
|
-
|
|
39
|
-
## Timeline (UTC)
|
|
40
|
-
|
|
41
|
-
| Time | Event |
|
|
42
|
-
|---|---|
|
|
43
|
-
| <HH:MM> | First failure observed (per <source — log, monitor, user report>) |
|
|
44
|
-
| <HH:MM> | Detection (paged / reported in #ops / noticed by …) |
|
|
45
|
-
| <HH:MM> | Incident response engaged |
|
|
46
|
-
| <HH:MM> | Triage complete; severity declared as <sev>; <initial hypothesis> |
|
|
47
|
-
| <HH:MM> | Mitigation: <what was done> |
|
|
48
|
-
| <HH:MM> | Symptoms stopped |
|
|
49
|
-
| <HH:MM> | Root cause identified |
|
|
50
|
-
| <HH:MM> | Full resolution (fix deployed / patch applied / dependency restored) |
|
|
51
|
-
| <HH:MM> | Postmortem started |
|
|
52
|
-
|
|
53
|
-
---
|
|
54
|
-
|
|
55
|
-
## What happened
|
|
56
|
-
|
|
57
|
-
<2-4 paragraphs of narrative. Lead with the proximate cause. Then walk through how the symptom manifested, what was tried, what worked, what didn't.>
|
|
58
|
-
|
|
59
|
-
<If multiple things went wrong in sequence (cascading failure), name each component and how they interacted.>
|
|
60
|
-
|
|
61
|
-
---
|
|
62
|
-
|
|
63
|
-
## Root cause analysis
|
|
64
|
-
|
|
65
|
-
### Proximate cause
|
|
66
|
-
<The specific code change / config / dependency that triggered the symptom. Cite file:line, commit SHA, or external system.>
|
|
67
|
-
|
|
68
|
-
### Systemic causes (five-whys chain)
|
|
69
|
-
1. **Why** did the symptom happen? <because…>
|
|
70
|
-
2. **Why** did that happen? <because…>
|
|
71
|
-
3. **Why** did that happen? <because…>
|
|
72
|
-
4. **Why** did that happen? <because…>
|
|
73
|
-
5. **Why** did that happen? <because…>
|
|
74
|
-
|
|
75
|
-
The deepest "why" we can credibly answer is the **systemic root cause** — that's what the action items should address.
|
|
76
|
-
|
|
77
|
-
### Failure category
|
|
78
|
-
|
|
79
|
-
<Pick one from the investigation-playbook.md table: code bug / configuration bug / schema / capacity / external dependency / race or concurrency / data quality / process.>
|
|
80
|
-
|
|
81
|
-
---
|
|
82
|
-
|
|
83
|
-
## Detection
|
|
84
|
-
|
|
85
|
-
**Why did detection take <N> minutes?**
|
|
86
|
-
<Why didn't this fire earlier? Was there a monitor for this failure mode? Did the monitor fire but not page? Did it page someone who couldn't act?>
|
|
87
|
-
|
|
88
|
-
**What detection improvements would have caught this <K> minutes sooner?**
|
|
89
|
-
<Specific: "a P99 latency alert on /api/messages would have fired at 14:33 instead of 14:38 when users reported.">
|
|
90
|
-
|
|
91
|
-
---
|
|
92
|
-
|
|
93
|
-
## Mitigation
|
|
94
|
-
|
|
95
|
-
**Why did mitigation take <M> minutes?**
|
|
96
|
-
<Was the rollback path clear? Was someone with deploy access available? Was the on-call runbook accurate?>
|
|
97
|
-
|
|
98
|
-
**What mitigation improvements would have stopped the bleed sooner?**
|
|
99
|
-
<Specific: "a feature flag for the new code path would have let us disable in seconds instead of waiting for a rollback deploy.">
|
|
100
|
-
|
|
101
|
-
---
|
|
102
|
-
|
|
103
|
-
## What went well
|
|
104
|
-
|
|
105
|
-
<Yes, name what worked. Reinforces good patterns and supports psychological safety. Be specific.>
|
|
106
|
-
|
|
107
|
-
- <Thing 1 — e.g., "The Datadog dashboard for /api/messages clearly showed the regression once someone looked at it">
|
|
108
|
-
- <Thing 2 — e.g., "On-call rotation was clear; <person> was immediately available">
|
|
109
|
-
- <Thing 3>
|
|
110
|
-
|
|
111
|
-
---
|
|
112
|
-
|
|
113
|
-
## What didn't go well
|
|
114
|
-
|
|
115
|
-
<The hard part — be honest, but blameless. Focus on systems, gaps, processes.>
|
|
116
|
-
|
|
117
|
-
- <Thing 1 — e.g., "The spec for this change didn't include an Observability Plan, so the new code path had no metrics">
|
|
118
|
-
- <Thing 2 — e.g., "The deploy pipeline reported success even though the smoke test failed">
|
|
119
|
-
- <Thing 3>
|
|
120
|
-
|
|
121
|
-
---
|
|
122
|
-
|
|
123
|
-
## Where the mastermind workflow gates failed (if applicable)
|
|
124
|
-
|
|
125
|
-
*Only relevant if this incident came from a change shipped through the mastermind workflow. If it came from a hot-fix, manual ops, or pre-mastermind code, skip this section.*
|
|
126
|
-
|
|
127
|
-
- **Critic dimension(s) that should have caught this:** <e.g., "dimension #3 Observability — the design didn't include any metric / log on the failure path, and the critic missed flagging it">
|
|
128
|
-
- **Spec template section(s) that were empty or weak:** <e.g., "Observability Plan was 'n/a — no production runtime' but this code DOES run in production">
|
|
129
|
-
- **Auditor checks that passed when they shouldn't have:** <e.g., "auditor verified tests ran, but spec didn't include a load test, so capacity issue wasn't tested for">
|
|
130
|
-
|
|
131
|
-
→ Each of these maps to a workflow-improvement action item (see Action Items below).
|
|
132
|
-
|
|
133
|
-
---
|
|
134
|
-
|
|
135
|
-
## Action items
|
|
136
|
-
|
|
137
|
-
Each action item gets owned, dated, and either (a) becomes a `.mastermind/tasks/` spec or (b) becomes a CONTEXT.md update.
|
|
138
|
-
|
|
139
|
-
| # | Action | Type | Owner | Due | Spec / CONTEXT entry |
|
|
140
|
-
|---|---|---|---|---|---|
|
|
141
|
-
| 1 | <Specific change — code, config, process, doc> | <code-fix \| context-md \| workflow-improvement \| process> | <person> | <YYYY-MM-DD> | <`.mastermind/tasks/NNN-name/spec.md` or "CONTEXT.md → Known gotchas">|
|
|
142
|
-
| 2 | <…> | <…> | <…> | <…> | <…> |
|
|
143
|
-
|
|
144
|
-
**Avoid action items like:**
|
|
145
|
-
- ❌ "Be more careful when deploying" — not actionable
|
|
146
|
-
- ❌ "Add monitoring" — too vague
|
|
147
|
-
- ❌ "Train the team on X" — training without process change rarely sticks
|
|
148
|
-
|
|
149
|
-
**Prefer action items like:**
|
|
150
|
-
- ✓ "Add P99 latency alert on /api/messages at 200ms threshold via Datadog monitor — `.mastermind/tasks/NNN-add-messages-latency-alert/spec.md`"
|
|
151
|
-
- ✓ "Add 'capacity test' as mandatory line in Performance Considerations section of spec-template — `.mastermind/tasks/NNN-spec-template-capacity/spec.md`"
|
|
152
|
-
- ✓ "Add CONTEXT.md known-gotcha: 'Redis cluster mode silently drops MULTI on key migrations during rebalance'"
|
|
153
|
-
|
|
154
|
-
---
|
|
155
|
-
|
|
156
|
-
## Feed forward to CONTEXT.md
|
|
157
|
-
|
|
158
|
-
The following entries get appended to project `CONTEXT.md`:
|
|
159
|
-
|
|
160
|
-
### Known gotchas (append)
|
|
161
|
-
- **<one-line summary of the failure pattern>** — <bite scenario>. See `postmortems/<this file>`.
|
|
162
|
-
|
|
163
|
-
### Don't-touch list (if applicable, append)
|
|
164
|
-
- **`<path or symbol>`** — <constraint that emerged from this incident>
|
|
165
|
-
|
|
166
|
-
### Decision log (if architecture changed, append)
|
|
167
|
-
- **<YYYY-MM-DD> — <decision name>** — <one-sentence decision, why, alternatives rejected, source: postmortems/<this file>>
|
|
168
|
-
|
|
169
|
-
---
|
|
170
|
-
|
|
171
|
-
## Unknowns
|
|
172
|
-
|
|
173
|
-
*If root cause is partially or fully unknown, name what's unknown. This is honest — fabricating a cause is worse than admitting uncertainty.*
|
|
174
|
-
|
|
175
|
-
- <Unknown 1 — e.g., "We don't yet know why the Redis client closed the connection at 14:32 specifically; logs are too sparse to tell">
|
|
176
|
-
- <What would we need to know it: a debug log, a packet capture, a repro environment>
|
|
177
|
-
|
|
178
|
-
---
|
|
179
|
-
|
|
180
|
-
## Sign-off
|
|
181
|
-
|
|
182
|
-
- **Author:** <name>
|
|
183
|
-
- **Reviewers:** <names who read this before publishing>
|
|
184
|
-
- **Distribution:** <team / org / wider>
|