@brunosps00/dev-workflow 0.13.0 → 0.15.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +9 -3
- package/package.json +1 -1
- package/scaffold/en/commands/dw-bugfix.md +2 -1
- package/scaffold/en/commands/dw-code-review.md +1 -0
- package/scaffold/en/commands/dw-create-tasks.md +6 -0
- package/scaffold/en/commands/dw-deps-audit.md +1 -1
- package/scaffold/en/commands/dw-fix-qa.md +1 -1
- package/scaffold/en/commands/dw-functional-doc.md +1 -1
- package/scaffold/en/commands/dw-help.md +1 -1
- package/scaffold/en/commands/dw-redesign-ui.md +1 -1
- package/scaffold/en/commands/dw-run-qa.md +2 -1
- package/scaffold/en/commands/dw-run-task.md +1 -1
- package/scaffold/pt-br/commands/dw-bugfix.md +2 -1
- package/scaffold/pt-br/commands/dw-code-review.md +1 -0
- package/scaffold/pt-br/commands/dw-create-tasks.md +6 -0
- package/scaffold/pt-br/commands/dw-deps-audit.md +1 -1
- package/scaffold/pt-br/commands/dw-fix-qa.md +1 -1
- package/scaffold/pt-br/commands/dw-functional-doc.md +1 -1
- package/scaffold/pt-br/commands/dw-help.md +1 -1
- package/scaffold/pt-br/commands/dw-redesign-ui.md +1 -1
- package/scaffold/pt-br/commands/dw-run-qa.md +2 -1
- package/scaffold/pt-br/commands/dw-run-task.md +1 -1
- package/scaffold/skills/dw-incident-response/SKILL.md +164 -0
- package/scaffold/skills/dw-incident-response/references/blameless-discipline.md +126 -0
- package/scaffold/skills/dw-incident-response/references/communication-templates.md +107 -0
- package/scaffold/skills/dw-incident-response/references/postmortem-template.md +133 -0
- package/scaffold/skills/dw-incident-response/references/runbook-templates.md +169 -0
- package/scaffold/skills/dw-incident-response/references/severity-and-triage.md +186 -0
- package/scaffold/skills/dw-llm-eval/SKILL.md +148 -0
- package/scaffold/skills/dw-llm-eval/references/agent-eval.md +252 -0
- package/scaffold/skills/dw-llm-eval/references/judge-calibration.md +169 -0
- package/scaffold/skills/dw-llm-eval/references/oracle-ladder.md +171 -0
- package/scaffold/skills/dw-llm-eval/references/rag-metrics.md +186 -0
- package/scaffold/skills/dw-llm-eval/references/reference-dataset.md +190 -0
- package/scaffold/skills/dw-testing-discipline/SKILL.md +99 -76
- package/scaffold/skills/dw-testing-discipline/references/agent-guardrails.md +170 -0
- package/scaffold/skills/dw-testing-discipline/references/anti-patterns.md +6 -6
- package/scaffold/skills/dw-testing-discipline/references/core-rules.md +128 -0
- package/scaffold/skills/dw-testing-discipline/references/playwright-recipes.md +2 -2
- package/scaffold/skills/dw-ui-discipline/SKILL.md +101 -79
- package/scaffold/skills/dw-ui-discipline/references/hard-gate.md +93 -73
- package/scaffold/skills/dw-ui-discipline/references/visual-slop.md +152 -0
- package/scaffold/skills/dw-testing-discipline/references/ai-agent-gates.md +0 -170
- package/scaffold/skills/dw-testing-discipline/references/iron-laws.md +0 -128
- package/scaffold/skills/dw-ui-discipline/references/anti-slop.md +0 -162
- /package/scaffold/skills/dw-testing-discipline/references/{positive-patterns.md → patterns.md} +0 -0
|
@@ -0,0 +1,107 @@
|
|
|
1
|
+
# Communication templates
|
|
2
|
+
|
|
3
|
+
Two messages get drafted during Phase 4: an initial notification (sent during the incident) and a resolution notification (sent at all-clear). Both go through the same channels — status page, support team, internal Slack/Teams, customer email if applicable.
|
|
4
|
+
|
|
5
|
+
## Initial notification
|
|
6
|
+
|
|
7
|
+
Sent as soon as the incident is acknowledged. Update every 30 min for SEV-1/2 until resolved.
|
|
8
|
+
|
|
9
|
+
```markdown
|
|
10
|
+
## Incident: <Title>
|
|
11
|
+
|
|
12
|
+
**Severity:** SEV-<X>
|
|
13
|
+
**Status:** Investigating | Mitigated | Resolved
|
|
14
|
+
**Impact:** <What users experience — be specific>
|
|
15
|
+
**Started:** YYYY-MM-DD HH:MM UTC
|
|
16
|
+
**Next update:** HH:MM UTC
|
|
17
|
+
|
|
18
|
+
We are aware of <symptom> affecting <surface>. The team is actively investigating.
|
|
19
|
+
<Optional: known workaround if any.>
|
|
20
|
+
|
|
21
|
+
We will post the next update by HH:MM UTC.
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
### Update cadence
|
|
25
|
+
|
|
26
|
+
- **SEV-1:** every 15 min until mitigated, then every 30 min until resolved.
|
|
27
|
+
- **SEV-2:** every 30 min throughout.
|
|
28
|
+
- **SEV-3:** every 2 hours.
|
|
29
|
+
- **SEV-4:** initial + resolution; no interim updates.
|
|
30
|
+
|
|
31
|
+
### Status progression
|
|
32
|
+
|
|
33
|
+
The `Status` field moves through three values:
|
|
34
|
+
1. **Investigating** — we know it's happening; cause unknown.
|
|
35
|
+
2. **Mitigated** — symptoms are reduced or contained; root cause may still be unresolved.
|
|
36
|
+
3. **Resolved** — symptoms gone, monitoring shows baseline, no follow-up expected.
|
|
37
|
+
|
|
38
|
+
### Tone
|
|
39
|
+
|
|
40
|
+
- Direct. No "we're experiencing a slight disruption."
|
|
41
|
+
- Specific. Say which feature; don't generalize.
|
|
42
|
+
- Honest. If you don't know the cause, say so.
|
|
43
|
+
- No blame. Don't mention which team owns the area in public-facing comms.
|
|
44
|
+
|
|
45
|
+
## Resolution notification
|
|
46
|
+
|
|
47
|
+
Sent when Phase 3 confirms recovery.
|
|
48
|
+
|
|
49
|
+
```markdown
|
|
50
|
+
## Resolved: <Title>
|
|
51
|
+
|
|
52
|
+
**Duration:** Xh Ym (from HH:MM to HH:MM UTC)
|
|
53
|
+
**Root cause:** <1-2 sentences, technical but accessible>
|
|
54
|
+
**Fix applied:** <what was done — link to deploy / PR if public-facing isn't a concern>
|
|
55
|
+
**Postmortem:** Will be published within 48 hours at <link>
|
|
56
|
+
|
|
57
|
+
We apologize for the disruption. Customers affected by <specific issue> can <specific recovery action if applicable>.
|
|
58
|
+
|
|
59
|
+
Questions: <support@example.com> or <#incident-channel>.
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
### What to include
|
|
63
|
+
|
|
64
|
+
- Specific duration in minutes/hours.
|
|
65
|
+
- Root cause that's accurate but readable by non-engineers.
|
|
66
|
+
- Concrete remedy (refund window, retry-now action, etc.) if customers need to do something.
|
|
67
|
+
- Postmortem link or commitment to publish.
|
|
68
|
+
|
|
69
|
+
### What NOT to include
|
|
70
|
+
|
|
71
|
+
- Names of individuals.
|
|
72
|
+
- Detailed stack traces or internal tool names.
|
|
73
|
+
- Speculation on prevention before the postmortem is done.
|
|
74
|
+
- Apologies that promise "this will never happen again" — incidents do recur; promise to learn.
|
|
75
|
+
|
|
76
|
+
## Internal-only versions
|
|
77
|
+
|
|
78
|
+
For internal tools without external users, drop the apology section and add:
|
|
79
|
+
|
|
80
|
+
```markdown
|
|
81
|
+
## Internal incident — <Title>
|
|
82
|
+
|
|
83
|
+
**Severity:** SEV-X
|
|
84
|
+
**Affected team(s):** <list>
|
|
85
|
+
**Duration:** Xh Ym
|
|
86
|
+
**Cause:** <technical>
|
|
87
|
+
**Fix:** <commit/deploy>
|
|
88
|
+
**Action items:** see postmortem at <link>
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
Shorter, more technical, no need to be customer-friendly.
|
|
92
|
+
|
|
93
|
+
## Status page updates
|
|
94
|
+
|
|
95
|
+
If using a status page (Statuspage, Cachet, Better Uptime, etc.):
|
|
96
|
+
|
|
97
|
+
- Component-level granularity: mark which specific service is degraded, not "platform down."
|
|
98
|
+
- Match the severity to the visual indicator (degraded vs partial outage vs major outage).
|
|
99
|
+
- Close the incident on the status page only after the resolution notification confirms recovery.
|
|
100
|
+
|
|
101
|
+
## When the communication discipline bends
|
|
102
|
+
|
|
103
|
+
- **Internal-only tools:** initial + resolution only; no 30-min updates needed.
|
|
104
|
+
- **Security incidents:** legal/compliance may need to approve every external message. Build that handoff into Phase 4.
|
|
105
|
+
- **Customer-specific incidents** (affects 1-2 enterprise customers, not the whole user base): direct contact to those customers replaces public status page.
|
|
106
|
+
|
|
107
|
+
Document the deviation in `04-communication.md`.
|
|
@@ -0,0 +1,133 @@
|
|
|
1
|
+
# Postmortem template — blameless, action-oriented
|
|
2
|
+
|
|
3
|
+
Use this template for the Phase 5 output (`05-postmortem.md`). Every field is mandatory unless explicitly marked optional.
|
|
4
|
+
|
|
5
|
+
The discipline behind the template lives in `blameless-discipline.md` — read both.
|
|
6
|
+
|
|
7
|
+
## Template
|
|
8
|
+
|
|
9
|
+
```markdown
|
|
10
|
+
# Postmortem: <Incident title>
|
|
11
|
+
|
|
12
|
+
**Date:** YYYY-MM-DD
|
|
13
|
+
**Duration:** Xh Ym (from detection to all-clear)
|
|
14
|
+
**Severity:** SEV-X
|
|
15
|
+
**Authors:** @name1, @name2
|
|
16
|
+
**Status:** Draft | Reviewed | Published
|
|
17
|
+
|
|
18
|
+
## Summary
|
|
19
|
+
|
|
20
|
+
[2-3 sentences. What happened? What was the user-visible impact? How was it resolved?
|
|
21
|
+
No blame, no causes yet — just the observable narrative.]
|
|
22
|
+
|
|
23
|
+
## Timeline
|
|
24
|
+
|
|
25
|
+
All times UTC. Aim for ±5 min granularity; tighter for SEV-1/2.
|
|
26
|
+
|
|
27
|
+
| Time (UTC) | Event |
|
|
28
|
+
|------------|-------|
|
|
29
|
+
| HH:MM | First alert fired (or first user report) |
|
|
30
|
+
| HH:MM | On-call acknowledged |
|
|
31
|
+
| HH:MM | War room opened / incident channel created |
|
|
32
|
+
| HH:MM | Initial triage complete; severity confirmed |
|
|
33
|
+
| HH:MM | Hypothesis: <X> |
|
|
34
|
+
| HH:MM | Hypothesis rejected — observed <Y> |
|
|
35
|
+
| HH:MM | Root cause identified |
|
|
36
|
+
| HH:MM | Fix applied: <commit SHA / deploy>
|
|
37
|
+
| HH:MM | Verified recovery (health checks green, error rate baseline)
|
|
38
|
+
| HH:MM | All-clear confirmed; communications sent |
|
|
39
|
+
|
|
40
|
+
## Root cause
|
|
41
|
+
|
|
42
|
+
[Technical explanation. Walk through the chain:
|
|
43
|
+
1. What was the proximate cause? (the immediate trigger)
|
|
44
|
+
2. What was the underlying assumption that broke?
|
|
45
|
+
3. Why didn't existing safeguards catch it?
|
|
46
|
+
|
|
47
|
+
No blame. No "X did Y wrong." Frame as: "the system allowed Z because A was assumed."]
|
|
48
|
+
|
|
49
|
+
## Impact
|
|
50
|
+
|
|
51
|
+
- **Users affected:** N (counting method: <how we estimated>)
|
|
52
|
+
- **Geography:** [if applicable]
|
|
53
|
+
- **Revenue impact:** $X estimated (calculation: <method>)
|
|
54
|
+
- **Error budget consumed:** X% of monthly SLO
|
|
55
|
+
- **Data impact:** [data lost / corrupted? recoverable?]
|
|
56
|
+
|
|
57
|
+
## Detection
|
|
58
|
+
|
|
59
|
+
- **How was it detected?** [alert / user report / scheduled check]
|
|
60
|
+
- **Time to detection (TTD):** X minutes
|
|
61
|
+
- **Was the detection signal adequate?** [yes/no — and what would improve it]
|
|
62
|
+
|
|
63
|
+
## What went well
|
|
64
|
+
|
|
65
|
+
Concrete things to keep:
|
|
66
|
+
- [item — specific, not "we worked as a team"]
|
|
67
|
+
- [item]
|
|
68
|
+
- [item]
|
|
69
|
+
|
|
70
|
+
## What went wrong
|
|
71
|
+
|
|
72
|
+
Process / tooling / assumptions that failed:
|
|
73
|
+
- [item — specific failure mode, no blame]
|
|
74
|
+
- [item]
|
|
75
|
+
- [item]
|
|
76
|
+
|
|
77
|
+
## Where we got lucky
|
|
78
|
+
|
|
79
|
+
[Optional. What worked by accident that we shouldn't rely on next time?
|
|
80
|
+
e.g., "The bug only fired during business hours because of <X>; could have been overnight and worse."]
|
|
81
|
+
|
|
82
|
+
## Action items
|
|
83
|
+
|
|
84
|
+
Quality bar: each item has **owner**, **due date**, and **measurable outcome**. "Improve monitoring" does NOT count. Read `blameless-discipline.md` for the bar.
|
|
85
|
+
|
|
86
|
+
| Action | Owner | Due date | Priority | Tracking |
|
|
87
|
+
|--------|-------|----------|----------|----------|
|
|
88
|
+
| Add Datadog SLO alert at p99 > 800ms with PagerDuty routing | @bruno | 2026-06-01 | P1 | Linear ENG-1234 |
|
|
89
|
+
| Add idempotency key to /api/orders POST | @maria | 2026-05-25 | P1 | GitHub #5678 |
|
|
90
|
+
| Update runbook with new mitigation steps | @bruno | 2026-05-19 | P2 | Linear ENG-1235 |
|
|
91
|
+
| Constitution principle: every state-changing endpoint requires audit log | @team | 2026-06-15 | P3 | ADR-042 |
|
|
92
|
+
|
|
93
|
+
## Constitution implications
|
|
94
|
+
|
|
95
|
+
[Did this incident reveal a principle the team should commit to? If yes, an ADR
|
|
96
|
+
should be opened via `/dw-adr`. Examples:
|
|
97
|
+
- "Every write to billing tables requires audit log entry"
|
|
98
|
+
- "Database migrations must run in a separate maintenance window"
|
|
99
|
+
- "External API calls must have circuit breakers"
|
|
100
|
+
|
|
101
|
+
Link the ADR slug here.]
|
|
102
|
+
|
|
103
|
+
## Cross-incident pattern
|
|
104
|
+
|
|
105
|
+
[Have we seen this category of incident before? If yes, list the prior incidents.
|
|
106
|
+
3+ similar incidents = structural problem; needs design review, not just an action item.]
|
|
107
|
+
|
|
108
|
+
## References
|
|
109
|
+
|
|
110
|
+
- Incident channel: [Slack link]
|
|
111
|
+
- Diagnostic dashboards: [Grafana / Datadog links]
|
|
112
|
+
- Affected commits: [SHA range]
|
|
113
|
+
- Related ADRs: [list]
|
|
114
|
+
- Related runbook: `<path>` (updated in this incident if applicable)
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
## Quality checklist before publishing
|
|
118
|
+
|
|
119
|
+
- [ ] Summary is 2-3 sentences — not a wall of text.
|
|
120
|
+
- [ ] Timeline events have specific times (not "around lunch time").
|
|
121
|
+
- [ ] Root cause section explains the ASSUMPTION, not just the trigger.
|
|
122
|
+
- [ ] Every action item has owner + due date + measurable outcome.
|
|
123
|
+
- [ ] Action items are filed in the actual tracker (Linear/Jira/GitHub) with links.
|
|
124
|
+
- [ ] No names appear in "what went wrong" — only systems, processes, and assumptions.
|
|
125
|
+
- [ ] Cross-incident pattern section reviewed against `.dw/incidents/` for prior occurrences.
|
|
126
|
+
- [ ] If a constitution principle emerged, ADR opened via `/dw-adr` and linked.
|
|
127
|
+
|
|
128
|
+
## Publishing
|
|
129
|
+
|
|
130
|
+
- Postmortems for SEV-1/2 are shared with stakeholders within 48 hours.
|
|
131
|
+
- Postmortems for SEV-3 are reviewed in the next team retro.
|
|
132
|
+
- Action items track in the project's normal tooling — not in the postmortem file.
|
|
133
|
+
- The postmortem file in `.dw/incidents/` is the canonical write-up; everything else (Slack threads, status pages) eventually points back here.
|
|
@@ -0,0 +1,169 @@
|
|
|
1
|
+
# Runbook templates + on-call handoff
|
|
2
|
+
|
|
3
|
+
Two contexts where this reference applies:
|
|
4
|
+
1. **Generating a runbook** (entry-mode 3 in the skill — no live incident, just producing operational docs).
|
|
5
|
+
2. **On-call handoff** at the end of a shift.
|
|
6
|
+
|
|
7
|
+
## Runbook: service outage template
|
|
8
|
+
|
|
9
|
+
```markdown
|
|
10
|
+
## Runbook — <Service Name> outage
|
|
11
|
+
|
|
12
|
+
### Quick diagnosis (< 5 min)
|
|
13
|
+
|
|
14
|
+
1. Health check: `curl -sf https://<service>/health | jq .`
|
|
15
|
+
2. Pod status (K8s): `kubectl get pods -l app=<service>`
|
|
16
|
+
3. Recent deploys: `kubectl rollout history deployment/<service>`
|
|
17
|
+
4. Recent logs: `kubectl logs -l app=<service> --tail=50 --since=5m`
|
|
18
|
+
5. Metrics dashboard: <Grafana / Datadog link>
|
|
19
|
+
|
|
20
|
+
### Common failure modes
|
|
21
|
+
|
|
22
|
+
| Symptom | Likely cause | Fix |
|
|
23
|
+
|---------|--------------|-----|
|
|
24
|
+
| OOMKilled | Memory leak or under-sized pod | Scale up replicas; increase memory limit; investigate leak |
|
|
25
|
+
| CrashLoopBackOff | Config error, missing env var | Check logs for startup errors; verify ConfigMap/Secret |
|
|
26
|
+
| ImagePullBackOff | Bad image tag or registry auth | Verify tag exists; check registry credentials |
|
|
27
|
+
| 503s from healthy pods | Downstream dependency down | Check the dependency (DB, cache, queue) |
|
|
28
|
+
| Slow responses (high p99) | DB connection saturation; N+1 query; cold cache | Check DB connections; tail slow query log; check cache hit rate |
|
|
29
|
+
| Latency spike after deploy | New code path slower than expected | Roll back; profile the new code path |
|
|
30
|
+
|
|
31
|
+
### Rollback procedure
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
# K8s
|
|
35
|
+
kubectl rollout undo deployment/<service>
|
|
36
|
+
|
|
37
|
+
# Confirm
|
|
38
|
+
kubectl rollout status deployment/<service>
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### Escalation chain
|
|
42
|
+
|
|
43
|
+
- **L1 (on-call engineer):** <name / PagerDuty schedule>
|
|
44
|
+
- **L2 (team lead):** <name>
|
|
45
|
+
- **L3 (infrastructure / platform team):** <team channel>
|
|
46
|
+
- **L4 (CTO / VP Eng):** <name> (SEV-1 only)
|
|
47
|
+
|
|
48
|
+
### Known dependencies
|
|
49
|
+
|
|
50
|
+
- Database: <DB name + version>
|
|
51
|
+
- Cache: <Redis / Memcached>
|
|
52
|
+
- Queue: <SQS / RabbitMQ / Kafka>
|
|
53
|
+
- External APIs: <list with SLAs>
|
|
54
|
+
|
|
55
|
+
If a dependency is down, the service may degrade gracefully — see `<degraded-mode>` section in the architecture doc.
|
|
56
|
+
|
|
57
|
+
### Related runbooks
|
|
58
|
+
|
|
59
|
+
- `<dependency-A>` runbook
|
|
60
|
+
- `<related-service-B>` runbook
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
## Runbook: database incident template
|
|
64
|
+
|
|
65
|
+
```markdown
|
|
66
|
+
## Runbook — Database (<engine, version>) incident
|
|
67
|
+
|
|
68
|
+
### Quick diagnosis (Postgres examples; adapt for MySQL/etc.)
|
|
69
|
+
|
|
70
|
+
```sql
|
|
71
|
+
-- Active connections vs baseline
|
|
72
|
+
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
|
|
73
|
+
-- Expected baseline: <N>; alert if > <2N>
|
|
74
|
+
|
|
75
|
+
-- Long-running queries
|
|
76
|
+
SELECT pid, now() - query_start AS duration, query
|
|
77
|
+
FROM pg_stat_activity
|
|
78
|
+
WHERE state = 'active' AND now() - query_start > interval '30 seconds'
|
|
79
|
+
ORDER BY duration DESC;
|
|
80
|
+
|
|
81
|
+
-- Lock waits
|
|
82
|
+
SELECT * FROM pg_locks WHERE NOT granted;
|
|
83
|
+
|
|
84
|
+
-- Replication lag (if applicable)
|
|
85
|
+
SELECT * FROM pg_stat_replication;
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Common DB failure modes
|
|
89
|
+
|
|
90
|
+
| Symptom | Likely cause | Mitigation |
|
|
91
|
+
|---------|--------------|------------|
|
|
92
|
+
| Connection pool exhaustion | App leak; spike in traffic | Restart app pods to release connections; scale app; investigate leak |
|
|
93
|
+
| Lock contention | Long-running transaction blocking writers | Identify holder via pg_stat_activity; terminate if safe (`pg_terminate_backend`) |
|
|
94
|
+
| Disk full | Unbounded growth; WAL retention | Free space; review retention; vacuum |
|
|
95
|
+
| Replication lag | Network or replica overload | Check replica health; review primary write rate |
|
|
96
|
+
| Slow query (p99 spike) | Missing index; bad plan after stats change | `EXPLAIN ANALYZE` the slow query; check `pg_stat_user_indexes` |
|
|
97
|
+
|
|
98
|
+
### Emergency procedures
|
|
99
|
+
|
|
100
|
+
- **Terminate a stuck query:** `SELECT pg_terminate_backend(<pid>);` — only after confirming it's safe (it won't roll back distributed transactions cleanly).
|
|
101
|
+
- **Failover to replica:** `<runbook-specific commands>`.
|
|
102
|
+
- **Restore from snapshot:** see `<DR runbook>` — only for data corruption / loss.
|
|
103
|
+
|
|
104
|
+
### Escalation
|
|
105
|
+
|
|
106
|
+
- **DB on-call:** <name / schedule>
|
|
107
|
+
- **DBA team:** <channel>
|
|
108
|
+
- **Vendor support:** <ticket portal> (RDS, CloudSQL, etc.)
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## On-call handoff template
|
|
112
|
+
|
|
113
|
+
End of every shift. Sent in the team's on-call channel + emailed to the incoming engineer.
|
|
114
|
+
|
|
115
|
+
```markdown
|
|
116
|
+
## On-call handoff — <Date>
|
|
117
|
+
|
|
118
|
+
**Outgoing:** @<name>
|
|
119
|
+
**Incoming:** @<name>
|
|
120
|
+
|
|
121
|
+
### Active incidents
|
|
122
|
+
|
|
123
|
+
- [None] OR list with status, severity, channel link
|
|
124
|
+
|
|
125
|
+
### Ongoing investigations (carried over)
|
|
126
|
+
|
|
127
|
+
- **<service/area>:** <one-line status, what's been tried, next step>
|
|
128
|
+
- **<issue>:** <what's blocking>
|
|
129
|
+
|
|
130
|
+
### Recent changes (last 24h)
|
|
131
|
+
|
|
132
|
+
- **<service>:** deploy at HH:MM UTC — <commit/PR>
|
|
133
|
+
- **<config>:** change at HH:MM — <what>
|
|
134
|
+
- **<infra>:** change at HH:MM — <what>
|
|
135
|
+
|
|
136
|
+
### Known issues (workarounds active)
|
|
137
|
+
|
|
138
|
+
- **<symptom>:** workaround: <action>. Tracking: <issue link>.
|
|
139
|
+
|
|
140
|
+
### Upcoming events
|
|
141
|
+
|
|
142
|
+
- **<date HH:MM UTC>:** scheduled maintenance — <service> — owner <name>
|
|
143
|
+
- **<date>:** traffic spike expected (launch, marketing campaign, etc.)
|
|
144
|
+
|
|
145
|
+
### Notes for the new shift
|
|
146
|
+
|
|
147
|
+
- [free-form: anything the incoming engineer should know that didn't fit above]
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
## Generating runbooks proactively
|
|
151
|
+
|
|
152
|
+
When the skill is invoked in entry-mode 3 (runbook generation, no live incident):
|
|
153
|
+
|
|
154
|
+
1. Ask: "Which service / area is the runbook for?"
|
|
155
|
+
2. Look up the service in `.dw/intel/` — pull the stack, dependencies, recent deploys.
|
|
156
|
+
3. Use the appropriate template above (service outage / DB / on-call handoff).
|
|
157
|
+
4. Fill in concrete commands, dashboard links, on-call info.
|
|
158
|
+
5. Save to `.dw/runbooks/<service-name>.md`.
|
|
159
|
+
6. Suggest adding the runbook path to the relevant `.dw/rules/<service>.md`.
|
|
160
|
+
|
|
161
|
+
Runbooks should be **executable** — every command listed must work without modification when copy-pasted by a tired engineer at 3am.
|
|
162
|
+
|
|
163
|
+
## Maintenance cadence
|
|
164
|
+
|
|
165
|
+
- Runbooks reviewed quarterly.
|
|
166
|
+
- Updated immediately after any incident where the runbook was wrong or missing.
|
|
167
|
+
- Owner per runbook = the team that owns the service.
|
|
168
|
+
|
|
169
|
+
A stale runbook is worse than no runbook (false confidence). If you can't keep one current, delete it and rely on the incident-response workflow alone.
|
|
@@ -0,0 +1,186 @@
|
|
|
1
|
+
# Severity classification + triage commands
|
|
2
|
+
|
|
3
|
+
## Severity criteria — extended version
|
|
4
|
+
|
|
5
|
+
### SEV-1 (Critical)
|
|
6
|
+
|
|
7
|
+
**Trigger:** any of
|
|
8
|
+
- Production service fully down or returning >50% errors.
|
|
9
|
+
- Data loss in progress or already occurred.
|
|
10
|
+
- Security breach detected (active credential leak, RCE, unauthorized data access).
|
|
11
|
+
- Compliance violation (PCI, HIPAA, GDPR exposure).
|
|
12
|
+
|
|
13
|
+
**Response:** page on-call immediately. CEO/CTO notified within 30 min if customer-facing. War room opened.
|
|
14
|
+
|
|
15
|
+
### SEV-2 (Major)
|
|
16
|
+
|
|
17
|
+
**Trigger:** any of
|
|
18
|
+
- Significant feature degradation affecting >25% of users.
|
|
19
|
+
- Latency 10× normal or higher.
|
|
20
|
+
- Background jobs queue backing up beyond SLA.
|
|
21
|
+
- Partial outage of a critical dependency (auth, payments, search).
|
|
22
|
+
|
|
23
|
+
**Response:** on-call investigates within 30 min. Stakeholders notified within 1 hour.
|
|
24
|
+
|
|
25
|
+
### SEV-3 (Minor)
|
|
26
|
+
|
|
27
|
+
**Trigger:** any of
|
|
28
|
+
- Single endpoint returning errors with a workaround available.
|
|
29
|
+
- Performance regression affecting <25% of users.
|
|
30
|
+
- Non-critical feature broken.
|
|
31
|
+
|
|
32
|
+
**Response:** investigated within 4 hours. Communicated in standup/Slack channel.
|
|
33
|
+
|
|
34
|
+
### SEV-4 (Low)
|
|
35
|
+
|
|
36
|
+
**Trigger:** any of
|
|
37
|
+
- Cosmetic issue.
|
|
38
|
+
- Non-user-facing bug.
|
|
39
|
+
- Minor UX papercut.
|
|
40
|
+
|
|
41
|
+
**Response:** logged as a normal bug. Fixed in next routine deploy.
|
|
42
|
+
|
|
43
|
+
## Blast radius assessment
|
|
44
|
+
|
|
45
|
+
Before declaring severity, answer:
|
|
46
|
+
1. **Which services** are affected?
|
|
47
|
+
2. **How many users** are seeing the issue? Estimate from error rate × DAU.
|
|
48
|
+
3. **What's the revenue impact** per hour (if customer-facing)?
|
|
49
|
+
4. **What downstream systems** depend on the affected service?
|
|
50
|
+
5. **Is the issue spreading** or contained?
|
|
51
|
+
|
|
52
|
+
Document each in `01-triage.md`. The numbers go into the postmortem.
|
|
53
|
+
|
|
54
|
+
## Triage commands by stack
|
|
55
|
+
|
|
56
|
+
### Kubernetes
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
# Find non-running pods
|
|
60
|
+
kubectl get pods -A | grep -v -E 'Running|Completed'
|
|
61
|
+
|
|
62
|
+
# Pods consuming most memory
|
|
63
|
+
kubectl top pods --sort-by=memory --all-namespaces | head -20
|
|
64
|
+
|
|
65
|
+
# Recent logs from a specific service
|
|
66
|
+
kubectl logs -l app=<service> --tail=100 --since=10m
|
|
67
|
+
|
|
68
|
+
# Recent deploys
|
|
69
|
+
kubectl rollout history deployment/<service>
|
|
70
|
+
|
|
71
|
+
# Rollback the most recent deploy
|
|
72
|
+
kubectl rollout undo deployment/<service>
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### Docker / Docker Compose
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
# Containers that exited
|
|
79
|
+
docker ps -a --filter "status=exited" --format "table {{.Names}}\t{{.Status}}"
|
|
80
|
+
|
|
81
|
+
# Recent container logs
|
|
82
|
+
docker logs <container> --tail=100 --since=10m
|
|
83
|
+
|
|
84
|
+
# Container resource usage
|
|
85
|
+
docker stats --no-stream
|
|
86
|
+
|
|
87
|
+
# Recent compose deploys (assumes versioned compose files)
|
|
88
|
+
git log --oneline --since="24 hours ago" -- docker-compose.yml
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
### Generic (any service with HTTP health endpoint)
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
# Health check
|
|
95
|
+
curl -sf https://<host>/health | jq .
|
|
96
|
+
|
|
97
|
+
# Compare healthy vs unhealthy timing
|
|
98
|
+
time curl -sf https://<host>/health
|
|
99
|
+
time curl -sf https://<host>/api/<critical-endpoint>
|
|
100
|
+
|
|
101
|
+
# Recent application deploys (any git-tracked deployment)
|
|
102
|
+
git log --oneline --since="2 hours ago"
|
|
103
|
+
|
|
104
|
+
# Recent infra changes
|
|
105
|
+
git -C /path/to/infra log --oneline --since="24 hours ago"
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
### Database (Postgres)
|
|
109
|
+
|
|
110
|
+
```sql
|
|
111
|
+
-- Active connections (compare against normal baseline)
|
|
112
|
+
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
|
|
113
|
+
|
|
114
|
+
-- Long-running queries (>30s)
|
|
115
|
+
SELECT pid, now() - query_start AS duration, query, state
|
|
116
|
+
FROM pg_stat_activity
|
|
117
|
+
WHERE state = 'active' AND now() - query_start > interval '30 seconds'
|
|
118
|
+
ORDER BY duration DESC;
|
|
119
|
+
|
|
120
|
+
-- Lock contention
|
|
121
|
+
SELECT blocked_locks.pid AS blocked_pid,
|
|
122
|
+
blocking_locks.pid AS blocking_pid,
|
|
123
|
+
blocked_activity.query AS blocked_query,
|
|
124
|
+
blocking_activity.query AS blocking_query
|
|
125
|
+
FROM pg_catalog.pg_locks blocked_locks
|
|
126
|
+
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
|
|
127
|
+
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
|
|
128
|
+
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
|
|
129
|
+
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
|
|
130
|
+
AND blocking_locks.pid != blocked_locks.pid
|
|
131
|
+
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
|
|
132
|
+
WHERE NOT blocked_locks.granted;
|
|
133
|
+
|
|
134
|
+
-- Terminate a stuck query (last resort; investigate first)
|
|
135
|
+
SELECT pg_terminate_backend(<pid>);
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### Generic application metrics
|
|
139
|
+
|
|
140
|
+
If your APM provides these (Datadog, New Relic, Sentry):
|
|
141
|
+
- Error rate per endpoint over the last hour vs the prior week's baseline.
|
|
142
|
+
- p50/p95/p99 latency per endpoint.
|
|
143
|
+
- Throughput (requests per second).
|
|
144
|
+
- Saturation (queue depth, connection pool usage).
|
|
145
|
+
|
|
146
|
+
The Google SRE "Four Golden Signals" (latency, errors, traffic, saturation) cover most cases.
|
|
147
|
+
|
|
148
|
+
## Immediate mitigation patterns
|
|
149
|
+
|
|
150
|
+
Before debugging root cause, can you reduce blast radius now? Try in order:
|
|
151
|
+
|
|
152
|
+
1. **Rollback the most recent deploy** if timing matches.
|
|
153
|
+
2. **Toggle feature flag** for the affected feature.
|
|
154
|
+
3. **Redirect traffic** away from the unhealthy instance / region.
|
|
155
|
+
4. **Rate-limit** the abusive client or queue depth.
|
|
156
|
+
5. **Scale up** if it's resource starvation (CPU, memory, connection pool).
|
|
157
|
+
|
|
158
|
+
If none apply, escalate to investigation phase. Don't burn time forcing mitigation that doesn't fit.
|
|
159
|
+
|
|
160
|
+
## What to record in `01-triage.md`
|
|
161
|
+
|
|
162
|
+
```markdown
|
|
163
|
+
# Triage — <incident title>
|
|
164
|
+
|
|
165
|
+
## Detected
|
|
166
|
+
- **Time:** YYYY-MM-DD HH:MM UTC
|
|
167
|
+
- **Detected by:** alert / user report / monitoring dashboard / customer support
|
|
168
|
+
- **Severity:** SEV-X
|
|
169
|
+
|
|
170
|
+
## Symptoms
|
|
171
|
+
- [observable behavior]
|
|
172
|
+
- [error rates, latency, etc.]
|
|
173
|
+
|
|
174
|
+
## Blast radius
|
|
175
|
+
- Services: [list]
|
|
176
|
+
- Users affected: [estimate]
|
|
177
|
+
- Revenue impact: [if known]
|
|
178
|
+
- Downstream impact: [systems that depend on this]
|
|
179
|
+
|
|
180
|
+
## Initial mitigation
|
|
181
|
+
- [what was tried]
|
|
182
|
+
- [what worked / didn't]
|
|
183
|
+
|
|
184
|
+
## Next steps
|
|
185
|
+
Proceeding to Phase 2 (investigation) after user confirmation.
|
|
186
|
+
```
|