@brunosps00/dev-workflow 0.13.0 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. package/README.md +9 -3
  2. package/package.json +1 -1
  3. package/scaffold/en/commands/dw-bugfix.md +2 -1
  4. package/scaffold/en/commands/dw-code-review.md +1 -0
  5. package/scaffold/en/commands/dw-create-tasks.md +6 -0
  6. package/scaffold/en/commands/dw-deps-audit.md +1 -1
  7. package/scaffold/en/commands/dw-fix-qa.md +1 -1
  8. package/scaffold/en/commands/dw-functional-doc.md +1 -1
  9. package/scaffold/en/commands/dw-help.md +1 -1
  10. package/scaffold/en/commands/dw-redesign-ui.md +1 -1
  11. package/scaffold/en/commands/dw-run-qa.md +2 -1
  12. package/scaffold/en/commands/dw-run-task.md +1 -1
  13. package/scaffold/pt-br/commands/dw-bugfix.md +2 -1
  14. package/scaffold/pt-br/commands/dw-code-review.md +1 -0
  15. package/scaffold/pt-br/commands/dw-create-tasks.md +6 -0
  16. package/scaffold/pt-br/commands/dw-deps-audit.md +1 -1
  17. package/scaffold/pt-br/commands/dw-fix-qa.md +1 -1
  18. package/scaffold/pt-br/commands/dw-functional-doc.md +1 -1
  19. package/scaffold/pt-br/commands/dw-help.md +1 -1
  20. package/scaffold/pt-br/commands/dw-redesign-ui.md +1 -1
  21. package/scaffold/pt-br/commands/dw-run-qa.md +2 -1
  22. package/scaffold/pt-br/commands/dw-run-task.md +1 -1
  23. package/scaffold/skills/dw-incident-response/SKILL.md +164 -0
  24. package/scaffold/skills/dw-incident-response/references/blameless-discipline.md +126 -0
  25. package/scaffold/skills/dw-incident-response/references/communication-templates.md +107 -0
  26. package/scaffold/skills/dw-incident-response/references/postmortem-template.md +133 -0
  27. package/scaffold/skills/dw-incident-response/references/runbook-templates.md +169 -0
  28. package/scaffold/skills/dw-incident-response/references/severity-and-triage.md +186 -0
  29. package/scaffold/skills/dw-llm-eval/SKILL.md +148 -0
  30. package/scaffold/skills/dw-llm-eval/references/agent-eval.md +252 -0
  31. package/scaffold/skills/dw-llm-eval/references/judge-calibration.md +169 -0
  32. package/scaffold/skills/dw-llm-eval/references/oracle-ladder.md +171 -0
  33. package/scaffold/skills/dw-llm-eval/references/rag-metrics.md +186 -0
  34. package/scaffold/skills/dw-llm-eval/references/reference-dataset.md +190 -0
  35. package/scaffold/skills/dw-testing-discipline/SKILL.md +99 -76
  36. package/scaffold/skills/dw-testing-discipline/references/agent-guardrails.md +170 -0
  37. package/scaffold/skills/dw-testing-discipline/references/anti-patterns.md +6 -6
  38. package/scaffold/skills/dw-testing-discipline/references/core-rules.md +128 -0
  39. package/scaffold/skills/dw-testing-discipline/references/playwright-recipes.md +2 -2
  40. package/scaffold/skills/dw-ui-discipline/SKILL.md +101 -79
  41. package/scaffold/skills/dw-ui-discipline/references/hard-gate.md +93 -73
  42. package/scaffold/skills/dw-ui-discipline/references/visual-slop.md +152 -0
  43. package/scaffold/skills/dw-testing-discipline/references/ai-agent-gates.md +0 -170
  44. package/scaffold/skills/dw-testing-discipline/references/iron-laws.md +0 -128
  45. package/scaffold/skills/dw-ui-discipline/references/anti-slop.md +0 -162
  46. /package/scaffold/skills/dw-testing-discipline/references/{positive-patterns.md → patterns.md} +0 -0
@@ -0,0 +1,107 @@
1
+ # Communication templates
2
+
3
+ Two messages get drafted during Phase 4: an initial notification (sent during the incident) and a resolution notification (sent at all-clear). Both go through the same channels — status page, support team, internal Slack/Teams, customer email if applicable.
4
+
5
+ ## Initial notification
6
+
7
+ Sent as soon as the incident is acknowledged. Update every 30 min for SEV-1/2 until resolved.
8
+
9
+ ```markdown
10
+ ## Incident: <Title>
11
+
12
+ **Severity:** SEV-<X>
13
+ **Status:** Investigating | Mitigated | Resolved
14
+ **Impact:** <What users experience — be specific>
15
+ **Started:** YYYY-MM-DD HH:MM UTC
16
+ **Next update:** HH:MM UTC
17
+
18
+ We are aware of <symptom> affecting <surface>. The team is actively investigating.
19
+ <Optional: known workaround if any.>
20
+
21
+ We will post the next update by HH:MM UTC.
22
+ ```
23
+
24
+ ### Update cadence
25
+
26
+ - **SEV-1:** every 15 min until mitigated, then every 30 min until resolved.
27
+ - **SEV-2:** every 30 min throughout.
28
+ - **SEV-3:** every 2 hours.
29
+ - **SEV-4:** initial + resolution; no interim updates.
30
+
31
+ ### Status progression
32
+
33
+ The `Status` field moves through three values:
34
+ 1. **Investigating** — we know it's happening; cause unknown.
35
+ 2. **Mitigated** — symptoms are reduced or contained; root cause may still be unresolved.
36
+ 3. **Resolved** — symptoms gone, monitoring shows baseline, no follow-up expected.
37
+
38
+ ### Tone
39
+
40
+ - Direct. No "we're experiencing a slight disruption."
41
+ - Specific. Say which feature; don't generalize.
42
+ - Honest. If you don't know the cause, say so.
43
+ - No blame. Don't mention which team owns the area in public-facing comms.
44
+
45
+ ## Resolution notification
46
+
47
+ Sent when Phase 3 confirms recovery.
48
+
49
+ ```markdown
50
+ ## Resolved: <Title>
51
+
52
+ **Duration:** Xh Ym (from HH:MM to HH:MM UTC)
53
+ **Root cause:** <1-2 sentences, technical but accessible>
54
+ **Fix applied:** <what was done — link to deploy / PR if public-facing isn't a concern>
55
+ **Postmortem:** Will be published within 48 hours at <link>
56
+
57
+ We apologize for the disruption. Customers affected by <specific issue> can <specific recovery action if applicable>.
58
+
59
+ Questions: <support@example.com> or <#incident-channel>.
60
+ ```
61
+
62
+ ### What to include
63
+
64
+ - Specific duration in minutes/hours.
65
+ - Root cause that's accurate but readable by non-engineers.
66
+ - Concrete remedy (refund window, retry-now action, etc.) if customers need to do something.
67
+ - Postmortem link or commitment to publish.
68
+
69
+ ### What NOT to include
70
+
71
+ - Names of individuals.
72
+ - Detailed stack traces or internal tool names.
73
+ - Speculation on prevention before the postmortem is done.
74
+ - Apologies that promise "this will never happen again" — incidents do recur; promise to learn.
75
+
76
+ ## Internal-only versions
77
+
78
+ For internal tools without external users, drop the apology section and add:
79
+
80
+ ```markdown
81
+ ## Internal incident — <Title>
82
+
83
+ **Severity:** SEV-X
84
+ **Affected team(s):** <list>
85
+ **Duration:** Xh Ym
86
+ **Cause:** <technical>
87
+ **Fix:** <commit/deploy>
88
+ **Action items:** see postmortem at <link>
89
+ ```
90
+
91
+ Shorter, more technical, no need to be customer-friendly.
92
+
93
+ ## Status page updates
94
+
95
+ If using a status page (Statuspage, Cachet, Better Uptime, etc.):
96
+
97
+ - Component-level granularity: mark which specific service is degraded, not "platform down."
98
+ - Match the severity to the visual indicator (degraded vs partial outage vs major outage).
99
+ - Close the incident on the status page only after the resolution notification confirms recovery.
100
+
101
+ ## When the communication discipline bends
102
+
103
+ - **Internal-only tools:** initial + resolution only; no 30-min updates needed.
104
+ - **Security incidents:** legal/compliance may need to approve every external message. Build that handoff into Phase 4.
105
+ - **Customer-specific incidents** (affects 1-2 enterprise customers, not the whole user base): direct contact to those customers replaces public status page.
106
+
107
+ Document the deviation in `04-communication.md`.
@@ -0,0 +1,133 @@
1
+ # Postmortem template — blameless, action-oriented
2
+
3
+ Use this template for the Phase 5 output (`05-postmortem.md`). Every field is mandatory unless explicitly marked optional.
4
+
5
+ The discipline behind the template lives in `blameless-discipline.md` — read both.
6
+
7
+ ## Template
8
+
9
+ ```markdown
10
+ # Postmortem: <Incident title>
11
+
12
+ **Date:** YYYY-MM-DD
13
+ **Duration:** Xh Ym (from detection to all-clear)
14
+ **Severity:** SEV-X
15
+ **Authors:** @name1, @name2
16
+ **Status:** Draft | Reviewed | Published
17
+
18
+ ## Summary
19
+
20
+ [2-3 sentences. What happened? What was the user-visible impact? How was it resolved?
21
+ No blame, no causes yet — just the observable narrative.]
22
+
23
+ ## Timeline
24
+
25
+ All times UTC. Aim for ±5 min granularity; tighter for SEV-1/2.
26
+
27
+ | Time (UTC) | Event |
28
+ |------------|-------|
29
+ | HH:MM | First alert fired (or first user report) |
30
+ | HH:MM | On-call acknowledged |
31
+ | HH:MM | War room opened / incident channel created |
32
+ | HH:MM | Initial triage complete; severity confirmed |
33
+ | HH:MM | Hypothesis: <X> |
34
+ | HH:MM | Hypothesis rejected — observed <Y> |
35
+ | HH:MM | Root cause identified |
36
+ | HH:MM | Fix applied: <commit SHA / deploy>
37
+ | HH:MM | Verified recovery (health checks green, error rate baseline)
38
+ | HH:MM | All-clear confirmed; communications sent |
39
+
40
+ ## Root cause
41
+
42
+ [Technical explanation. Walk through the chain:
43
+ 1. What was the proximate cause? (the immediate trigger)
44
+ 2. What was the underlying assumption that broke?
45
+ 3. Why didn't existing safeguards catch it?
46
+
47
+ No blame. No "X did Y wrong." Frame as: "the system allowed Z because A was assumed."]
48
+
49
+ ## Impact
50
+
51
+ - **Users affected:** N (counting method: <how we estimated>)
52
+ - **Geography:** [if applicable]
53
+ - **Revenue impact:** $X estimated (calculation: <method>)
54
+ - **Error budget consumed:** X% of monthly SLO
55
+ - **Data impact:** [data lost / corrupted? recoverable?]
56
+
57
+ ## Detection
58
+
59
+ - **How was it detected?** [alert / user report / scheduled check]
60
+ - **Time to detection (TTD):** X minutes
61
+ - **Was the detection signal adequate?** [yes/no — and what would improve it]
62
+
63
+ ## What went well
64
+
65
+ Concrete things to keep:
66
+ - [item — specific, not "we worked as a team"]
67
+ - [item]
68
+ - [item]
69
+
70
+ ## What went wrong
71
+
72
+ Process / tooling / assumptions that failed:
73
+ - [item — specific failure mode, no blame]
74
+ - [item]
75
+ - [item]
76
+
77
+ ## Where we got lucky
78
+
79
+ [Optional. What worked by accident that we shouldn't rely on next time?
80
+ e.g., "The bug only fired during business hours because of <X>; could have been overnight and worse."]
81
+
82
+ ## Action items
83
+
84
+ Quality bar: each item has **owner**, **due date**, and **measurable outcome**. "Improve monitoring" does NOT count. Read `blameless-discipline.md` for the bar.
85
+
86
+ | Action | Owner | Due date | Priority | Tracking |
87
+ |--------|-------|----------|----------|----------|
88
+ | Add Datadog SLO alert at p99 > 800ms with PagerDuty routing | @bruno | 2026-06-01 | P1 | Linear ENG-1234 |
89
+ | Add idempotency key to /api/orders POST | @maria | 2026-05-25 | P1 | GitHub #5678 |
90
+ | Update runbook with new mitigation steps | @bruno | 2026-05-19 | P2 | Linear ENG-1235 |
91
+ | Constitution principle: every state-changing endpoint requires audit log | @team | 2026-06-15 | P3 | ADR-042 |
92
+
93
+ ## Constitution implications
94
+
95
+ [Did this incident reveal a principle the team should commit to? If yes, an ADR
96
+ should be opened via `/dw-adr`. Examples:
97
+ - "Every write to billing tables requires audit log entry"
98
+ - "Database migrations must run in a separate maintenance window"
99
+ - "External API calls must have circuit breakers"
100
+
101
+ Link the ADR slug here.]
102
+
103
+ ## Cross-incident pattern
104
+
105
+ [Have we seen this category of incident before? If yes, list the prior incidents.
106
+ 3+ similar incidents = structural problem; needs design review, not just an action item.]
107
+
108
+ ## References
109
+
110
+ - Incident channel: [Slack link]
111
+ - Diagnostic dashboards: [Grafana / Datadog links]
112
+ - Affected commits: [SHA range]
113
+ - Related ADRs: [list]
114
+ - Related runbook: `<path>` (updated in this incident if applicable)
115
+ ```
116
+
117
+ ## Quality checklist before publishing
118
+
119
+ - [ ] Summary is 2-3 sentences — not a wall of text.
120
+ - [ ] Timeline events have specific times (not "around lunch time").
121
+ - [ ] Root cause section explains the ASSUMPTION, not just the trigger.
122
+ - [ ] Every action item has owner + due date + measurable outcome.
123
+ - [ ] Action items are filed in the actual tracker (Linear/Jira/GitHub) with links.
124
+ - [ ] No names appear in "what went wrong" — only systems, processes, and assumptions.
125
+ - [ ] Cross-incident pattern section reviewed against `.dw/incidents/` for prior occurrences.
126
+ - [ ] If a constitution principle emerged, ADR opened via `/dw-adr` and linked.
127
+
128
+ ## Publishing
129
+
130
+ - Postmortems for SEV-1/2 are shared with stakeholders within 48 hours.
131
+ - Postmortems for SEV-3 are reviewed in the next team retro.
132
+ - Action items track in the project's normal tooling — not in the postmortem file.
133
+ - The postmortem file in `.dw/incidents/` is the canonical write-up; everything else (Slack threads, status pages) eventually points back here.
@@ -0,0 +1,169 @@
1
+ # Runbook templates + on-call handoff
2
+
3
+ Two contexts where this reference applies:
4
+ 1. **Generating a runbook** (entry-mode 3 in the skill — no live incident, just producing operational docs).
5
+ 2. **On-call handoff** at the end of a shift.
6
+
7
+ ## Runbook: service outage template
8
+
9
+ ```markdown
10
+ ## Runbook — <Service Name> outage
11
+
12
+ ### Quick diagnosis (< 5 min)
13
+
14
+ 1. Health check: `curl -sf https://<service>/health | jq .`
15
+ 2. Pod status (K8s): `kubectl get pods -l app=<service>`
16
+ 3. Recent deploys: `kubectl rollout history deployment/<service>`
17
+ 4. Recent logs: `kubectl logs -l app=<service> --tail=50 --since=5m`
18
+ 5. Metrics dashboard: <Grafana / Datadog link>
19
+
20
+ ### Common failure modes
21
+
22
+ | Symptom | Likely cause | Fix |
23
+ |---------|--------------|-----|
24
+ | OOMKilled | Memory leak or under-sized pod | Scale up replicas; increase memory limit; investigate leak |
25
+ | CrashLoopBackOff | Config error, missing env var | Check logs for startup errors; verify ConfigMap/Secret |
26
+ | ImagePullBackOff | Bad image tag or registry auth | Verify tag exists; check registry credentials |
27
+ | 503s from healthy pods | Downstream dependency down | Check the dependency (DB, cache, queue) |
28
+ | Slow responses (high p99) | DB connection saturation; N+1 query; cold cache | Check DB connections; tail slow query log; check cache hit rate |
29
+ | Latency spike after deploy | New code path slower than expected | Roll back; profile the new code path |
30
+
31
+ ### Rollback procedure
32
+
33
+ ```bash
34
+ # K8s
35
+ kubectl rollout undo deployment/<service>
36
+
37
+ # Confirm
38
+ kubectl rollout status deployment/<service>
39
+ ```
40
+
41
+ ### Escalation chain
42
+
43
+ - **L1 (on-call engineer):** <name / PagerDuty schedule>
44
+ - **L2 (team lead):** <name>
45
+ - **L3 (infrastructure / platform team):** <team channel>
46
+ - **L4 (CTO / VP Eng):** <name> (SEV-1 only)
47
+
48
+ ### Known dependencies
49
+
50
+ - Database: <DB name + version>
51
+ - Cache: <Redis / Memcached>
52
+ - Queue: <SQS / RabbitMQ / Kafka>
53
+ - External APIs: <list with SLAs>
54
+
55
+ If a dependency is down, the service may degrade gracefully — see `<degraded-mode>` section in the architecture doc.
56
+
57
+ ### Related runbooks
58
+
59
+ - `<dependency-A>` runbook
60
+ - `<related-service-B>` runbook
61
+ ```
62
+
63
+ ## Runbook: database incident template
64
+
65
+ ```markdown
66
+ ## Runbook — Database (<engine, version>) incident
67
+
68
+ ### Quick diagnosis (Postgres examples; adapt for MySQL/etc.)
69
+
70
+ ```sql
71
+ -- Active connections vs baseline
72
+ SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
73
+ -- Expected baseline: <N>; alert if > <2N>
74
+
75
+ -- Long-running queries
76
+ SELECT pid, now() - query_start AS duration, query
77
+ FROM pg_stat_activity
78
+ WHERE state = 'active' AND now() - query_start > interval '30 seconds'
79
+ ORDER BY duration DESC;
80
+
81
+ -- Lock waits
82
+ SELECT * FROM pg_locks WHERE NOT granted;
83
+
84
+ -- Replication lag (if applicable)
85
+ SELECT * FROM pg_stat_replication;
86
+ ```
87
+
88
+ ### Common DB failure modes
89
+
90
+ | Symptom | Likely cause | Mitigation |
91
+ |---------|--------------|------------|
92
+ | Connection pool exhaustion | App leak; spike in traffic | Restart app pods to release connections; scale app; investigate leak |
93
+ | Lock contention | Long-running transaction blocking writers | Identify holder via pg_stat_activity; terminate if safe (`pg_terminate_backend`) |
94
+ | Disk full | Unbounded growth; WAL retention | Free space; review retention; vacuum |
95
+ | Replication lag | Network or replica overload | Check replica health; review primary write rate |
96
+ | Slow query (p99 spike) | Missing index; bad plan after stats change | `EXPLAIN ANALYZE` the slow query; check `pg_stat_user_indexes` |
97
+
98
+ ### Emergency procedures
99
+
100
+ - **Terminate a stuck query:** `SELECT pg_terminate_backend(<pid>);` — only after confirming it's safe (it won't roll back distributed transactions cleanly).
101
+ - **Failover to replica:** `<runbook-specific commands>`.
102
+ - **Restore from snapshot:** see `<DR runbook>` — only for data corruption / loss.
103
+
104
+ ### Escalation
105
+
106
+ - **DB on-call:** <name / schedule>
107
+ - **DBA team:** <channel>
108
+ - **Vendor support:** <ticket portal> (RDS, CloudSQL, etc.)
109
+ ```
110
+
111
+ ## On-call handoff template
112
+
113
+ End of every shift. Sent in the team's on-call channel + emailed to the incoming engineer.
114
+
115
+ ```markdown
116
+ ## On-call handoff — <Date>
117
+
118
+ **Outgoing:** @<name>
119
+ **Incoming:** @<name>
120
+
121
+ ### Active incidents
122
+
123
+ - [None] OR list with status, severity, channel link
124
+
125
+ ### Ongoing investigations (carried over)
126
+
127
+ - **<service/area>:** <one-line status, what's been tried, next step>
128
+ - **<issue>:** <what's blocking>
129
+
130
+ ### Recent changes (last 24h)
131
+
132
+ - **<service>:** deploy at HH:MM UTC — <commit/PR>
133
+ - **<config>:** change at HH:MM — <what>
134
+ - **<infra>:** change at HH:MM — <what>
135
+
136
+ ### Known issues (workarounds active)
137
+
138
+ - **<symptom>:** workaround: <action>. Tracking: <issue link>.
139
+
140
+ ### Upcoming events
141
+
142
+ - **<date HH:MM UTC>:** scheduled maintenance — <service> — owner <name>
143
+ - **<date>:** traffic spike expected (launch, marketing campaign, etc.)
144
+
145
+ ### Notes for the new shift
146
+
147
+ - [free-form: anything the incoming engineer should know that didn't fit above]
148
+ ```
149
+
150
+ ## Generating runbooks proactively
151
+
152
+ When the skill is invoked in entry-mode 3 (runbook generation, no live incident):
153
+
154
+ 1. Ask: "Which service / area is the runbook for?"
155
+ 2. Look up the service in `.dw/intel/` — pull the stack, dependencies, recent deploys.
156
+ 3. Use the appropriate template above (service outage / DB / on-call handoff).
157
+ 4. Fill in concrete commands, dashboard links, on-call info.
158
+ 5. Save to `.dw/runbooks/<service-name>.md`.
159
+ 6. Suggest adding the runbook path to the relevant `.dw/rules/<service>.md`.
160
+
161
+ Runbooks should be **executable** — every command listed must work without modification when copy-pasted by a tired engineer at 3am.
162
+
163
+ ## Maintenance cadence
164
+
165
+ - Runbooks reviewed quarterly.
166
+ - Updated immediately after any incident where the runbook was wrong or missing.
167
+ - Owner per runbook = the team that owns the service.
168
+
169
+ A stale runbook is worse than no runbook (false confidence). If you can't keep one current, delete it and rely on the incident-response workflow alone.
@@ -0,0 +1,186 @@
1
+ # Severity classification + triage commands
2
+
3
+ ## Severity criteria — extended version
4
+
5
+ ### SEV-1 (Critical)
6
+
7
+ **Trigger:** any of
8
+ - Production service fully down or returning >50% errors.
9
+ - Data loss in progress or already occurred.
10
+ - Security breach detected (active credential leak, RCE, unauthorized data access).
11
+ - Compliance violation (PCI, HIPAA, GDPR exposure).
12
+
13
+ **Response:** page on-call immediately. CEO/CTO notified within 30 min if customer-facing. War room opened.
14
+
15
+ ### SEV-2 (Major)
16
+
17
+ **Trigger:** any of
18
+ - Significant feature degradation affecting >25% of users.
19
+ - Latency 10× normal or higher.
20
+ - Background jobs queue backing up beyond SLA.
21
+ - Partial outage of a critical dependency (auth, payments, search).
22
+
23
+ **Response:** on-call investigates within 30 min. Stakeholders notified within 1 hour.
24
+
25
+ ### SEV-3 (Minor)
26
+
27
+ **Trigger:** any of
28
+ - Single endpoint returning errors with a workaround available.
29
+ - Performance regression affecting <25% of users.
30
+ - Non-critical feature broken.
31
+
32
+ **Response:** investigated within 4 hours. Communicated in standup/Slack channel.
33
+
34
+ ### SEV-4 (Low)
35
+
36
+ **Trigger:** any of
37
+ - Cosmetic issue.
38
+ - Non-user-facing bug.
39
+ - Minor UX papercut.
40
+
41
+ **Response:** logged as a normal bug. Fixed in next routine deploy.
42
+
43
+ ## Blast radius assessment
44
+
45
+ Before declaring severity, answer:
46
+ 1. **Which services** are affected?
47
+ 2. **How many users** are seeing the issue? Estimate from error rate × DAU.
48
+ 3. **What's the revenue impact** per hour (if customer-facing)?
49
+ 4. **What downstream systems** depend on the affected service?
50
+ 5. **Is the issue spreading** or contained?
51
+
52
+ Document each in `01-triage.md`. The numbers go into the postmortem.
53
+
54
+ ## Triage commands by stack
55
+
56
+ ### Kubernetes
57
+
58
+ ```bash
59
+ # Find non-running pods
60
+ kubectl get pods -A | grep -v -E 'Running|Completed'
61
+
62
+ # Pods consuming most memory
63
+ kubectl top pods --sort-by=memory --all-namespaces | head -20
64
+
65
+ # Recent logs from a specific service
66
+ kubectl logs -l app=<service> --tail=100 --since=10m
67
+
68
+ # Recent deploys
69
+ kubectl rollout history deployment/<service>
70
+
71
+ # Rollback the most recent deploy
72
+ kubectl rollout undo deployment/<service>
73
+ ```
74
+
75
+ ### Docker / Docker Compose
76
+
77
+ ```bash
78
+ # Containers that exited
79
+ docker ps -a --filter "status=exited" --format "table {{.Names}}\t{{.Status}}"
80
+
81
+ # Recent container logs
82
+ docker logs <container> --tail=100 --since=10m
83
+
84
+ # Container resource usage
85
+ docker stats --no-stream
86
+
87
+ # Recent compose deploys (assumes versioned compose files)
88
+ git log --oneline --since="24 hours ago" -- docker-compose.yml
89
+ ```
90
+
91
+ ### Generic (any service with HTTP health endpoint)
92
+
93
+ ```bash
94
+ # Health check
95
+ curl -sf https://<host>/health | jq .
96
+
97
+ # Compare healthy vs unhealthy timing
98
+ time curl -sf https://<host>/health
99
+ time curl -sf https://<host>/api/<critical-endpoint>
100
+
101
+ # Recent application deploys (any git-tracked deployment)
102
+ git log --oneline --since="2 hours ago"
103
+
104
+ # Recent infra changes
105
+ git -C /path/to/infra log --oneline --since="24 hours ago"
106
+ ```
107
+
108
+ ### Database (Postgres)
109
+
110
+ ```sql
111
+ -- Active connections (compare against normal baseline)
112
+ SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
113
+
114
+ -- Long-running queries (>30s)
115
+ SELECT pid, now() - query_start AS duration, query, state
116
+ FROM pg_stat_activity
117
+ WHERE state = 'active' AND now() - query_start > interval '30 seconds'
118
+ ORDER BY duration DESC;
119
+
120
+ -- Lock contention
121
+ SELECT blocked_locks.pid AS blocked_pid,
122
+ blocking_locks.pid AS blocking_pid,
123
+ blocked_activity.query AS blocked_query,
124
+ blocking_activity.query AS blocking_query
125
+ FROM pg_catalog.pg_locks blocked_locks
126
+ JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
127
+ JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
128
+ AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
129
+ AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
130
+ AND blocking_locks.pid != blocked_locks.pid
131
+ JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
132
+ WHERE NOT blocked_locks.granted;
133
+
134
+ -- Terminate a stuck query (last resort; investigate first)
135
+ SELECT pg_terminate_backend(<pid>);
136
+ ```
137
+
138
+ ### Generic application metrics
139
+
140
+ If your APM provides these (Datadog, New Relic, Sentry):
141
+ - Error rate per endpoint over the last hour vs the prior week's baseline.
142
+ - p50/p95/p99 latency per endpoint.
143
+ - Throughput (requests per second).
144
+ - Saturation (queue depth, connection pool usage).
145
+
146
+ The Google SRE "Four Golden Signals" (latency, errors, traffic, saturation) cover most cases.
147
+
148
+ ## Immediate mitigation patterns
149
+
150
+ Before debugging root cause, can you reduce blast radius now? Try in order:
151
+
152
+ 1. **Rollback the most recent deploy** if timing matches.
153
+ 2. **Toggle feature flag** for the affected feature.
154
+ 3. **Redirect traffic** away from the unhealthy instance / region.
155
+ 4. **Rate-limit** the abusive client or queue depth.
156
+ 5. **Scale up** if it's resource starvation (CPU, memory, connection pool).
157
+
158
+ If none apply, escalate to investigation phase. Don't burn time forcing mitigation that doesn't fit.
159
+
160
+ ## What to record in `01-triage.md`
161
+
162
+ ```markdown
163
+ # Triage — <incident title>
164
+
165
+ ## Detected
166
+ - **Time:** YYYY-MM-DD HH:MM UTC
167
+ - **Detected by:** alert / user report / monitoring dashboard / customer support
168
+ - **Severity:** SEV-X
169
+
170
+ ## Symptoms
171
+ - [observable behavior]
172
+ - [error rates, latency, etc.]
173
+
174
+ ## Blast radius
175
+ - Services: [list]
176
+ - Users affected: [estimate]
177
+ - Revenue impact: [if known]
178
+ - Downstream impact: [systems that depend on this]
179
+
180
+ ## Initial mitigation
181
+ - [what was tried]
182
+ - [what worked / didn't]
183
+
184
+ ## Next steps
185
+ Proceeding to Phase 2 (investigation) after user confirmation.
186
+ ```