@shaykec/bridge 0.4.25 → 0.4.26
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/journeys/ai-engineer.yaml +34 -0
- package/journeys/backend-developer.yaml +36 -0
- package/journeys/business-analyst.yaml +37 -0
- package/journeys/devops-engineer.yaml +37 -0
- package/journeys/engineering-manager.yaml +44 -0
- package/journeys/frontend-developer.yaml +41 -0
- package/journeys/fullstack-developer.yaml +49 -0
- package/journeys/mobile-developer.yaml +42 -0
- package/journeys/product-manager.yaml +35 -0
- package/journeys/qa-engineer.yaml +37 -0
- package/journeys/ux-designer.yaml +43 -0
- package/modules/README.md +52 -0
- package/modules/accessibility-fundamentals/content.md +126 -0
- package/modules/accessibility-fundamentals/exercises.md +88 -0
- package/modules/accessibility-fundamentals/module.yaml +43 -0
- package/modules/accessibility-fundamentals/quick-ref.md +71 -0
- package/modules/accessibility-fundamentals/quiz.md +100 -0
- package/modules/accessibility-fundamentals/resources.md +29 -0
- package/modules/accessibility-fundamentals/walkthrough.md +80 -0
- package/modules/adr-writing/content.md +121 -0
- package/modules/adr-writing/exercises.md +81 -0
- package/modules/adr-writing/module.yaml +41 -0
- package/modules/adr-writing/quick-ref.md +57 -0
- package/modules/adr-writing/quiz.md +73 -0
- package/modules/adr-writing/resources.md +29 -0
- package/modules/adr-writing/walkthrough.md +64 -0
- package/modules/ai-agents/content.md +120 -0
- package/modules/ai-agents/exercises.md +82 -0
- package/modules/ai-agents/module.yaml +42 -0
- package/modules/ai-agents/quick-ref.md +60 -0
- package/modules/ai-agents/quiz.md +103 -0
- package/modules/ai-agents/resources.md +30 -0
- package/modules/ai-agents/walkthrough.md +85 -0
- package/modules/ai-assisted-research/content.md +136 -0
- package/modules/ai-assisted-research/exercises.md +80 -0
- package/modules/ai-assisted-research/module.yaml +42 -0
- package/modules/ai-assisted-research/quick-ref.md +67 -0
- package/modules/ai-assisted-research/quiz.md +73 -0
- package/modules/ai-assisted-research/resources.md +33 -0
- package/modules/ai-assisted-research/walkthrough.md +85 -0
- package/modules/ai-pair-programming/content.md +105 -0
- package/modules/ai-pair-programming/exercises.md +98 -0
- package/modules/ai-pair-programming/module.yaml +39 -0
- package/modules/ai-pair-programming/quick-ref.md +58 -0
- package/modules/ai-pair-programming/quiz.md +73 -0
- package/modules/ai-pair-programming/resources.md +34 -0
- package/modules/ai-pair-programming/walkthrough.md +117 -0
- package/modules/ai-test-generation/content.md +125 -0
- package/modules/ai-test-generation/exercises.md +98 -0
- package/modules/ai-test-generation/module.yaml +39 -0
- package/modules/ai-test-generation/quick-ref.md +65 -0
- package/modules/ai-test-generation/quiz.md +74 -0
- package/modules/ai-test-generation/resources.md +41 -0
- package/modules/ai-test-generation/walkthrough.md +100 -0
- package/modules/api-design/content.md +189 -0
- package/modules/api-design/exercises.md +84 -0
- package/modules/api-design/game.yaml +113 -0
- package/modules/api-design/module.yaml +45 -0
- package/modules/api-design/quick-ref.md +73 -0
- package/modules/api-design/quiz.md +100 -0
- package/modules/api-design/resources.md +55 -0
- package/modules/api-design/walkthrough.md +88 -0
- package/modules/clean-code/content.md +136 -0
- package/modules/clean-code/exercises.md +137 -0
- package/modules/clean-code/game.yaml +172 -0
- package/modules/clean-code/module.yaml +44 -0
- package/modules/clean-code/quick-ref.md +44 -0
- package/modules/clean-code/quiz.md +105 -0
- package/modules/clean-code/resources.md +40 -0
- package/modules/clean-code/walkthrough.md +78 -0
- package/modules/clean-code/workshop.yaml +149 -0
- package/modules/code-review/content.md +130 -0
- package/modules/code-review/exercises.md +95 -0
- package/modules/code-review/game.yaml +83 -0
- package/modules/code-review/module.yaml +42 -0
- package/modules/code-review/quick-ref.md +77 -0
- package/modules/code-review/quiz.md +105 -0
- package/modules/code-review/resources.md +40 -0
- package/modules/code-review/walkthrough.md +106 -0
- package/modules/daily-workflow/content.md +81 -0
- package/modules/daily-workflow/exercises.md +50 -0
- package/modules/daily-workflow/module.yaml +33 -0
- package/modules/daily-workflow/quick-ref.md +37 -0
- package/modules/daily-workflow/quiz.md +65 -0
- package/modules/daily-workflow/resources.md +38 -0
- package/modules/daily-workflow/walkthrough.md +83 -0
- package/modules/debugging-systematically/content.md +139 -0
- package/modules/debugging-systematically/exercises.md +91 -0
- package/modules/debugging-systematically/module.yaml +46 -0
- package/modules/debugging-systematically/quick-ref.md +59 -0
- package/modules/debugging-systematically/quiz.md +105 -0
- package/modules/debugging-systematically/resources.md +42 -0
- package/modules/debugging-systematically/walkthrough.md +84 -0
- package/modules/debugging-systematically/workshop.yaml +127 -0
- package/modules/demo-test/content.md +68 -0
- package/modules/demo-test/exercises.md +28 -0
- package/modules/demo-test/game.yaml +171 -0
- package/modules/demo-test/module.yaml +41 -0
- package/modules/demo-test/quick-ref.md +54 -0
- package/modules/demo-test/quiz.md +74 -0
- package/modules/demo-test/resources.md +21 -0
- package/modules/demo-test/walkthrough.md +122 -0
- package/modules/demo-test/workshop.yaml +31 -0
- package/modules/design-critique/content.md +93 -0
- package/modules/design-critique/exercises.md +71 -0
- package/modules/design-critique/module.yaml +41 -0
- package/modules/design-critique/quick-ref.md +63 -0
- package/modules/design-critique/quiz.md +73 -0
- package/modules/design-critique/resources.md +27 -0
- package/modules/design-critique/walkthrough.md +68 -0
- package/modules/design-patterns/content.md +335 -0
- package/modules/design-patterns/exercises.md +82 -0
- package/modules/design-patterns/game.yaml +55 -0
- package/modules/design-patterns/module.yaml +45 -0
- package/modules/design-patterns/quick-ref.md +44 -0
- package/modules/design-patterns/quiz.md +101 -0
- package/modules/design-patterns/resources.md +40 -0
- package/modules/design-patterns/walkthrough.md +64 -0
- package/modules/exploratory-testing/content.md +133 -0
- package/modules/exploratory-testing/exercises.md +88 -0
- package/modules/exploratory-testing/module.yaml +41 -0
- package/modules/exploratory-testing/quick-ref.md +68 -0
- package/modules/exploratory-testing/quiz.md +75 -0
- package/modules/exploratory-testing/resources.md +39 -0
- package/modules/exploratory-testing/walkthrough.md +87 -0
- package/modules/git/content.md +128 -0
- package/modules/git/exercises.md +53 -0
- package/modules/git/game.yaml +190 -0
- package/modules/git/module.yaml +44 -0
- package/modules/git/quick-ref.md +67 -0
- package/modules/git/quiz.md +89 -0
- package/modules/git/resources.md +49 -0
- package/modules/git/walkthrough.md +92 -0
- package/modules/git/workshop.yaml +145 -0
- package/modules/hiring-interviews/content.md +130 -0
- package/modules/hiring-interviews/exercises.md +88 -0
- package/modules/hiring-interviews/module.yaml +41 -0
- package/modules/hiring-interviews/quick-ref.md +68 -0
- package/modules/hiring-interviews/quiz.md +73 -0
- package/modules/hiring-interviews/resources.md +36 -0
- package/modules/hiring-interviews/walkthrough.md +75 -0
- package/modules/hooks/content.md +97 -0
- package/modules/hooks/exercises.md +69 -0
- package/modules/hooks/module.yaml +39 -0
- package/modules/hooks/quick-ref.md +93 -0
- package/modules/hooks/quiz.md +81 -0
- package/modules/hooks/resources.md +34 -0
- package/modules/hooks/walkthrough.md +105 -0
- package/modules/hooks/workshop.yaml +64 -0
- package/modules/incident-response/content.md +124 -0
- package/modules/incident-response/exercises.md +82 -0
- package/modules/incident-response/game.yaml +132 -0
- package/modules/incident-response/module.yaml +45 -0
- package/modules/incident-response/quick-ref.md +53 -0
- package/modules/incident-response/quiz.md +103 -0
- package/modules/incident-response/resources.md +40 -0
- package/modules/incident-response/walkthrough.md +82 -0
- package/modules/llm-fundamentals/content.md +114 -0
- package/modules/llm-fundamentals/exercises.md +83 -0
- package/modules/llm-fundamentals/module.yaml +42 -0
- package/modules/llm-fundamentals/quick-ref.md +64 -0
- package/modules/llm-fundamentals/quiz.md +103 -0
- package/modules/llm-fundamentals/resources.md +30 -0
- package/modules/llm-fundamentals/walkthrough.md +91 -0
- package/modules/one-on-ones/content.md +133 -0
- package/modules/one-on-ones/exercises.md +81 -0
- package/modules/one-on-ones/module.yaml +44 -0
- package/modules/one-on-ones/quick-ref.md +67 -0
- package/modules/one-on-ones/quiz.md +73 -0
- package/modules/one-on-ones/resources.md +37 -0
- package/modules/one-on-ones/walkthrough.md +69 -0
- package/modules/package.json +9 -0
- package/modules/prioritization-frameworks/content.md +130 -0
- package/modules/prioritization-frameworks/exercises.md +93 -0
- package/modules/prioritization-frameworks/module.yaml +41 -0
- package/modules/prioritization-frameworks/quick-ref.md +77 -0
- package/modules/prioritization-frameworks/quiz.md +73 -0
- package/modules/prioritization-frameworks/resources.md +32 -0
- package/modules/prioritization-frameworks/walkthrough.md +69 -0
- package/modules/prompt-engineering/content.md +123 -0
- package/modules/prompt-engineering/exercises.md +82 -0
- package/modules/prompt-engineering/game.yaml +101 -0
- package/modules/prompt-engineering/module.yaml +45 -0
- package/modules/prompt-engineering/quick-ref.md +65 -0
- package/modules/prompt-engineering/quiz.md +105 -0
- package/modules/prompt-engineering/resources.md +36 -0
- package/modules/prompt-engineering/walkthrough.md +81 -0
- package/modules/rag-fundamentals/content.md +111 -0
- package/modules/rag-fundamentals/exercises.md +80 -0
- package/modules/rag-fundamentals/module.yaml +45 -0
- package/modules/rag-fundamentals/quick-ref.md +58 -0
- package/modules/rag-fundamentals/quiz.md +75 -0
- package/modules/rag-fundamentals/resources.md +34 -0
- package/modules/rag-fundamentals/walkthrough.md +75 -0
- package/modules/react-fundamentals/content.md +140 -0
- package/modules/react-fundamentals/exercises.md +81 -0
- package/modules/react-fundamentals/game.yaml +145 -0
- package/modules/react-fundamentals/module.yaml +45 -0
- package/modules/react-fundamentals/quick-ref.md +62 -0
- package/modules/react-fundamentals/quiz.md +106 -0
- package/modules/react-fundamentals/resources.md +42 -0
- package/modules/react-fundamentals/walkthrough.md +89 -0
- package/modules/react-fundamentals/workshop.yaml +112 -0
- package/modules/react-native-fundamentals/content.md +141 -0
- package/modules/react-native-fundamentals/exercises.md +79 -0
- package/modules/react-native-fundamentals/module.yaml +42 -0
- package/modules/react-native-fundamentals/quick-ref.md +60 -0
- package/modules/react-native-fundamentals/quiz.md +61 -0
- package/modules/react-native-fundamentals/resources.md +24 -0
- package/modules/react-native-fundamentals/walkthrough.md +84 -0
- package/modules/registry.yaml +1650 -0
- package/modules/risk-management/content.md +162 -0
- package/modules/risk-management/exercises.md +86 -0
- package/modules/risk-management/module.yaml +41 -0
- package/modules/risk-management/quick-ref.md +82 -0
- package/modules/risk-management/quiz.md +73 -0
- package/modules/risk-management/resources.md +40 -0
- package/modules/risk-management/walkthrough.md +67 -0
- package/modules/running-effective-standups/content.md +119 -0
- package/modules/running-effective-standups/exercises.md +79 -0
- package/modules/running-effective-standups/module.yaml +40 -0
- package/modules/running-effective-standups/quick-ref.md +61 -0
- package/modules/running-effective-standups/quiz.md +73 -0
- package/modules/running-effective-standups/resources.md +36 -0
- package/modules/running-effective-standups/walkthrough.md +76 -0
- package/modules/solid-principles/content.md +154 -0
- package/modules/solid-principles/exercises.md +107 -0
- package/modules/solid-principles/module.yaml +42 -0
- package/modules/solid-principles/quick-ref.md +50 -0
- package/modules/solid-principles/quiz.md +102 -0
- package/modules/solid-principles/resources.md +39 -0
- package/modules/solid-principles/walkthrough.md +84 -0
- package/modules/sprint-planning/content.md +142 -0
- package/modules/sprint-planning/exercises.md +79 -0
- package/modules/sprint-planning/game.yaml +84 -0
- package/modules/sprint-planning/module.yaml +44 -0
- package/modules/sprint-planning/quick-ref.md +76 -0
- package/modules/sprint-planning/quiz.md +102 -0
- package/modules/sprint-planning/resources.md +39 -0
- package/modules/sprint-planning/walkthrough.md +75 -0
- package/modules/sql-fundamentals/content.md +160 -0
- package/modules/sql-fundamentals/exercises.md +87 -0
- package/modules/sql-fundamentals/game.yaml +105 -0
- package/modules/sql-fundamentals/module.yaml +45 -0
- package/modules/sql-fundamentals/quick-ref.md +53 -0
- package/modules/sql-fundamentals/quiz.md +103 -0
- package/modules/sql-fundamentals/resources.md +42 -0
- package/modules/sql-fundamentals/walkthrough.md +92 -0
- package/modules/sql-fundamentals/workshop.yaml +109 -0
- package/modules/stakeholder-communication/content.md +186 -0
- package/modules/stakeholder-communication/exercises.md +87 -0
- package/modules/stakeholder-communication/module.yaml +38 -0
- package/modules/stakeholder-communication/quick-ref.md +89 -0
- package/modules/stakeholder-communication/quiz.md +73 -0
- package/modules/stakeholder-communication/resources.md +41 -0
- package/modules/stakeholder-communication/walkthrough.md +74 -0
- package/modules/system-design/content.md +149 -0
- package/modules/system-design/exercises.md +83 -0
- package/modules/system-design/game.yaml +95 -0
- package/modules/system-design/module.yaml +46 -0
- package/modules/system-design/quick-ref.md +59 -0
- package/modules/system-design/quiz.md +102 -0
- package/modules/system-design/resources.md +46 -0
- package/modules/system-design/walkthrough.md +90 -0
- package/modules/team-topologies/content.md +166 -0
- package/modules/team-topologies/exercises.md +85 -0
- package/modules/team-topologies/module.yaml +41 -0
- package/modules/team-topologies/quick-ref.md +61 -0
- package/modules/team-topologies/quiz.md +101 -0
- package/modules/team-topologies/resources.md +37 -0
- package/modules/team-topologies/walkthrough.md +76 -0
- package/modules/technical-debt/content.md +111 -0
- package/modules/technical-debt/exercises.md +92 -0
- package/modules/technical-debt/module.yaml +39 -0
- package/modules/technical-debt/quick-ref.md +60 -0
- package/modules/technical-debt/quiz.md +73 -0
- package/modules/technical-debt/resources.md +25 -0
- package/modules/technical-debt/walkthrough.md +94 -0
- package/modules/technical-mentoring/content.md +128 -0
- package/modules/technical-mentoring/exercises.md +84 -0
- package/modules/technical-mentoring/module.yaml +41 -0
- package/modules/technical-mentoring/quick-ref.md +74 -0
- package/modules/technical-mentoring/quiz.md +73 -0
- package/modules/technical-mentoring/resources.md +33 -0
- package/modules/technical-mentoring/walkthrough.md +65 -0
- package/modules/test-strategy/content.md +136 -0
- package/modules/test-strategy/exercises.md +84 -0
- package/modules/test-strategy/game.yaml +99 -0
- package/modules/test-strategy/module.yaml +45 -0
- package/modules/test-strategy/quick-ref.md +66 -0
- package/modules/test-strategy/quiz.md +99 -0
- package/modules/test-strategy/resources.md +60 -0
- package/modules/test-strategy/walkthrough.md +97 -0
- package/modules/test-strategy/workshop.yaml +96 -0
- package/modules/typescript-fundamentals/content.md +127 -0
- package/modules/typescript-fundamentals/exercises.md +79 -0
- package/modules/typescript-fundamentals/game.yaml +111 -0
- package/modules/typescript-fundamentals/module.yaml +45 -0
- package/modules/typescript-fundamentals/quick-ref.md +55 -0
- package/modules/typescript-fundamentals/quiz.md +104 -0
- package/modules/typescript-fundamentals/resources.md +42 -0
- package/modules/typescript-fundamentals/walkthrough.md +71 -0
- package/modules/typescript-fundamentals/workshop.yaml +146 -0
- package/modules/user-story-mapping/content.md +123 -0
- package/modules/user-story-mapping/exercises.md +87 -0
- package/modules/user-story-mapping/module.yaml +41 -0
- package/modules/user-story-mapping/quick-ref.md +64 -0
- package/modules/user-story-mapping/quiz.md +73 -0
- package/modules/user-story-mapping/resources.md +29 -0
- package/modules/user-story-mapping/walkthrough.md +86 -0
- package/modules/writing-prds/content.md +133 -0
- package/modules/writing-prds/exercises.md +93 -0
- package/modules/writing-prds/game.yaml +83 -0
- package/modules/writing-prds/module.yaml +44 -0
- package/modules/writing-prds/quick-ref.md +77 -0
- package/modules/writing-prds/quiz.md +103 -0
- package/modules/writing-prds/resources.md +30 -0
- package/modules/writing-prds/walkthrough.md +87 -0
- package/package.json +1 -1
|
@@ -0,0 +1,124 @@
|
|
|
1
|
+
# Incident Response — From Alert to Postmortem
|
|
2
|
+
|
|
3
|
+
<!-- hint:slides topic="Incident response lifecycle: severity levels, roles (IC, Comms Lead), triage, mitigation, resolution, and blameless postmortem" slides="6" -->
|
|
4
|
+
|
|
5
|
+
## Incident Severity Levels
|
|
6
|
+
|
|
7
|
+
Define severity so everyone knows how to respond. Common tiers:
|
|
8
|
+
|
|
9
|
+
| Severity | Impact | Example | Response Time |
|
|
10
|
+
|----------|--------|---------|---------------|
|
|
11
|
+
| **SEV1** | Critical: full outage, data loss, security breach | Site down, payment broken | Immediate (e.g., 5 min) |
|
|
12
|
+
| **SEV2** | Major: significant degradation, key feature broken | Search down, 50% errors | Within 15–30 min |
|
|
13
|
+
| **SEV3** | Minor: limited impact, workaround exists | One region slow, non-critical bug | Within hours |
|
|
14
|
+
| **SEV4** | Low: cosmetic or edge case | UI glitch, rare error | Next business day |
|
|
15
|
+
|
|
16
|
+
Customize thresholds for your system. Document them in your runbooks so on-call knows when to escalate.
|
|
17
|
+
|
|
18
|
+
## Roles During Incidents
|
|
19
|
+
|
|
20
|
+
| Role | Responsibility |
|
|
21
|
+
|------|----------------|
|
|
22
|
+
| **Incident Commander (IC)** | Owns the response; coordinates, decides, delegates. Does not fix—orchestrates. |
|
|
23
|
+
| **Comms Lead** | Updates stakeholders, status page, and internal channels. Frees IC to focus on resolution. |
|
|
24
|
+
| **Responders** | Engineers debugging, rolling back, or applying fixes. Report progress to IC. |
|
|
25
|
+
|
|
26
|
+
IC and Comms Lead should be different people when possible. IC stays focused on technical decisions; Comms keeps everyone informed.
|
|
27
|
+
|
|
28
|
+
## The Incident Lifecycle
|
|
29
|
+
|
|
30
|
+
```mermaid
|
|
31
|
+
flowchart LR
|
|
32
|
+
A[Detect] --> B[Triage]
|
|
33
|
+
B --> C[Respond]
|
|
34
|
+
C --> D[Mitigate]
|
|
35
|
+
D --> E[Resolve]
|
|
36
|
+
E --> F[Postmortem]
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
```mermaid
|
|
40
|
+
flowchart TD
|
|
41
|
+
A[Detect] --> B[Triage]
|
|
42
|
+
B --> C[Mitigate]
|
|
43
|
+
C --> D[Resolve]
|
|
44
|
+
D --> E[Postmortem]
|
|
45
|
+
E --> F[Action Items]
|
|
46
|
+
|
|
47
|
+
A -.->|Alert, user report| A
|
|
48
|
+
B -.->|Severity, scope| B
|
|
49
|
+
C -.->|Rollback, fix, workaround| C
|
|
50
|
+
D -.->|Verify, all-clear| D
|
|
51
|
+
E -.->|Blameless, 5 whys| E
|
|
52
|
+
F -.->|Track, prevent recurrence| F
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
1. **Detect** — Alert fires, user reports, or monitoring catches the issue.
|
|
56
|
+
2. **Triage** — Assess severity, scope, and impact. Assign IC and responders.
|
|
57
|
+
3. **Mitigate** — Roll back, apply fix, or implement workaround. Goal: reduce impact.
|
|
58
|
+
4. **Resolve** — Verify the fix, confirm stability, declare incident closed.
|
|
59
|
+
5. **Postmortem** — Blameless analysis: what happened, why, what we'll do differently.
|
|
60
|
+
6. **Action Items** — Track improvements (runbooks, monitoring, code changes).
|
|
61
|
+
|
|
62
|
+
## Communication Templates
|
|
63
|
+
|
|
64
|
+
**Internal (Slack, email):**
|
|
65
|
+
> **Incident: [Brief title]**
|
|
66
|
+
> **Severity:** SEV[1–4]
|
|
67
|
+
> **Impact:** [Who/what is affected]
|
|
68
|
+
> **Status:** Investigating / Mitigating / Resolved
|
|
69
|
+
> **ETA:** [If known]
|
|
70
|
+
> **Incident Commander:** [Name]
|
|
71
|
+
|
|
72
|
+
**External (status page, customers):**
|
|
73
|
+
> **We're aware of [issue].** Impact: [description]. We're investigating and will update within [time]. ETA: [if known].
|
|
74
|
+
|
|
75
|
+
**Update cadence:** Every 15–30 min for SEV1/2; don't leave people guessing. "No update" is an update—say "Still investigating, next update in 15 min."
|
|
76
|
+
|
|
77
|
+
## Blameless Postmortems
|
|
78
|
+
|
|
79
|
+
**Goal:** Learn, not blame. Focus on systems and process, not individuals.
|
|
80
|
+
|
|
81
|
+
**Structure:**
|
|
82
|
+
1. **Summary** — What happened, in plain language.
|
|
83
|
+
2. **Timeline** — Key events with timestamps.
|
|
84
|
+
3. **Root cause** — Use "5 whys" or similar. Go past symptoms to contributing factors.
|
|
85
|
+
4. **Impact** — Users affected, duration, business impact.
|
|
86
|
+
5. **Action items** — What we'll change (runbooks, monitoring, code, process). Owner and due date.
|
|
87
|
+
6. **Lessons learned** — What went well, what didn't.
|
|
88
|
+
|
|
89
|
+
**Rule:** No names in the "who messed up" sense. "The deploy went out" not "Alice deployed the bad code." Discuss the decision-making and systems that allowed the failure.
|
|
90
|
+
|
|
91
|
+
## On-Call Best Practices
|
|
92
|
+
|
|
93
|
+
- **Runbooks** — Step-by-step guides for common incidents. "If X, do Y."
|
|
94
|
+
- **Escalation paths** — Who gets paged when: primary → secondary → manager. Define SLA.
|
|
95
|
+
- **Rotation** — Fair distribution; avoid burnout. Tools: PagerDuty, OpsGenie, VictorOps.
|
|
96
|
+
- **Compensation** — On-call pay, time off, or flexibility. Make it sustainable.
|
|
97
|
+
- **Training** — Shadow new on-call; run game days (simulated incidents).
|
|
98
|
+
|
|
99
|
+
## Preventing Incident Fatigue
|
|
100
|
+
|
|
101
|
+
- Limit consecutive weeks on primary.
|
|
102
|
+
- Handoff meetings: what’s in flight, recent changes.
|
|
103
|
+
- Post-incident rest: no meetings for IC for a few hours after SEV1/2.
|
|
104
|
+
- Track pages: if someone is woken up constantly, fix the alerts or the system.
|
|
105
|
+
|
|
106
|
+
## Learning from Incidents — SLOs and Error Budgets
|
|
107
|
+
|
|
108
|
+
**SLO (Service Level Objective):** "99.9% of requests succeed."
|
|
109
|
+
**Error budget:** The allowed failure (e.g., 0.1% = ~43 min downtime/month). When you're within budget, you can ship; when you're over, focus on reliability.
|
|
110
|
+
|
|
111
|
+
Incidents consume the error budget. Use postmortems to decide: was this a one-off or a systemic problem? Invest in fixes that prevent recurrence and protect the budget.
|
|
112
|
+
|
|
113
|
+
## Incident Lifecycle Flow (Simplified)
|
|
114
|
+
|
|
115
|
+
```mermaid
|
|
116
|
+
flowchart LR
|
|
117
|
+
A[Alert] --> B[Tri age]
|
|
118
|
+
B --> C[Mitigate]
|
|
119
|
+
C --> D[Resolve]
|
|
120
|
+
D --> E[Postmortem]
|
|
121
|
+
E --> A
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Continuous improvement: each incident makes the system more resilient if we learn from it.
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
# Incident Response — Exercises
|
|
2
|
+
|
|
3
|
+
## Exercise 1: Severity Classification
|
|
4
|
+
|
|
5
|
+
**Task:** Classify these incidents as SEV1–4 and justify: (A) Payment processing returns 500 for 10% of users. (B) Typo on "About" page. (C) Full site returns 502 for all users. (D) Search is slow (3s vs 1s) for one region.
|
|
6
|
+
|
|
7
|
+
**Validation:**
|
|
8
|
+
- [ ] Each has a severity
|
|
9
|
+
- [ ] Justification references impact and scope
|
|
10
|
+
- [ ] Response time is implied or stated
|
|
11
|
+
|
|
12
|
+
**Hints:**
|
|
13
|
+
1. Full site down → SEV1
|
|
14
|
+
2. Payment 500 for 10% → SEV2 (major, key feature)
|
|
15
|
+
3. Search slow, one region → SEV3 (degradation, workaround: wait or retry)
|
|
16
|
+
4. Typo → SEV4
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Exercise 2: Communication Template in Action
|
|
21
|
+
|
|
22
|
+
**Task:** A database failover fails at 2 p.m.; the site is down. Write the internal announcement and external status update you'd post within 5 minutes. Then write the "Resolved" update for when you fix it 45 minutes later.
|
|
23
|
+
|
|
24
|
+
**Validation:**
|
|
25
|
+
- [ ] Internal has: severity, impact, status, IC, ETA
|
|
26
|
+
- [ ] External is customer-friendly, no jargon
|
|
27
|
+
- [ ] Resolved update thanks users and summarizes what happened
|
|
28
|
+
|
|
29
|
+
**Hints:**
|
|
30
|
+
1. Internal: "SEV1, full outage, mitigating, IC: [name], ETA: investigating"
|
|
31
|
+
2. External: "We're experiencing an outage. Our team is on it. Next update in 15 min."
|
|
32
|
+
3. Resolved: "Service restored. We'll share a postmortem within 48 hours."
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Exercise 3: Blameless Postmortem Draft
|
|
37
|
+
|
|
38
|
+
**Task:** A deploy at 10 a.m. introduced a bug; it wasn't caught by CI. Incident resolved at 11:30 a.m. Draft the "Root Cause" section using 5 whys. Use blameless language—focus on process, not people.
|
|
39
|
+
|
|
40
|
+
**Validation:**
|
|
41
|
+
- [ ] 5 whys go from symptom to systemic cause
|
|
42
|
+
- [ ] No blame ("the process allowed" not "Alice didn't test")
|
|
43
|
+
- [ ] At least one action item is implied
|
|
44
|
+
|
|
45
|
+
**Hints:**
|
|
46
|
+
1. Why did the bug reach prod? → CI didn't catch it
|
|
47
|
+
2. Why didn't CI catch it? → No test for this case
|
|
48
|
+
3. Why no test? → Gap in test coverage / requirements
|
|
49
|
+
4. Keep going to process: how do we close the gap?
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## Exercise 4: Runbook Entry
|
|
54
|
+
|
|
55
|
+
**Task:** Write a runbook entry for "Database replica lag exceeds 30 seconds." Include: how to detect, how to confirm, mitigation steps (2–4), escalation path, and where to log the incident.
|
|
56
|
+
|
|
57
|
+
**Validation:**
|
|
58
|
+
- [ ] Detection is specific (metric, dashboard, alert)
|
|
59
|
+
- [ ] Mitigation steps are ordered and actionable
|
|
60
|
+
- [ ] Escalation path is clear
|
|
61
|
+
|
|
62
|
+
**Hints:**
|
|
63
|
+
1. Detect: CloudWatch/graph, PagerDuty alert
|
|
64
|
+
2. Confirm: Check replica lag metric, verify impact
|
|
65
|
+
3. Mitigate: Identify slow query, kill if safe, scale read replicas, failover if critical
|
|
66
|
+
4. Escalate: DB team, IC if no progress in 15 min
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
## Exercise 5: On-Call Rotation Design
|
|
71
|
+
|
|
72
|
+
**Task:** Design an on-call rotation for a 5-person team. Include: primary and secondary, rotation cadence (e.g., weekly), handoff process, and one policy to prevent fatigue (e.g., no back-to-back primaries).
|
|
73
|
+
|
|
74
|
+
**Validation:**
|
|
75
|
+
- [ ] Primary and secondary defined
|
|
76
|
+
- [ ] Cadence and handoff are specified
|
|
77
|
+
- [ ] At least one fatigue-prevention policy
|
|
78
|
+
|
|
79
|
+
**Hints:**
|
|
80
|
+
1. Weekly primary, weekly secondary (different person)
|
|
81
|
+
2. Handoff: 15-min call, share recent incidents, in-flight work
|
|
82
|
+
3. Policy: No one does primary 2 weeks in a row; comp time after SEV1
|
|
@@ -0,0 +1,132 @@
|
|
|
1
|
+
games:
|
|
2
|
+
- type: scenario
|
|
3
|
+
title: "Incident Commander"
|
|
4
|
+
startHealth: 5
|
|
5
|
+
steps:
|
|
6
|
+
- id: start
|
|
7
|
+
situation: "It's 2 AM. PagerDuty wakes you — the payment service is down. Alerts are firing: 503s on checkout, payment gateway timeouts. You're the incident commander. What do you do first?"
|
|
8
|
+
choices:
|
|
9
|
+
- text: "Declare SEV1 immediately and page the full team."
|
|
10
|
+
consequence: "You acted decisively. SEV1 is correct for customer-facing payment outages. The team is assembling."
|
|
11
|
+
health: 1
|
|
12
|
+
next: triage
|
|
13
|
+
- text: "Wait 5 minutes to see if it self-recovers before paging anyone."
|
|
14
|
+
consequence: "Every minute of payment downtime costs revenue and trust. SEV1 incidents need immediate response."
|
|
15
|
+
health: -2
|
|
16
|
+
next: triage
|
|
17
|
+
- text: "Post in Slack and ask if anyone has seen this before."
|
|
18
|
+
consequence: "Slack is too slow for payment outages. You need to formally declare severity and page. Time is critical."
|
|
19
|
+
health: -1
|
|
20
|
+
next: triage
|
|
21
|
+
- id: triage
|
|
22
|
+
situation: "The team is joining. You confirm: checkout is returning 503, payment gateway is timing out. One engineer says the gateway provider's status page shows 'degraded.' What's your next move?"
|
|
23
|
+
choices:
|
|
24
|
+
- text: "Create a shared doc, assign roles (comms, technical lead, scribe), and have tech lead investigate root cause."
|
|
25
|
+
consequence: "Good incident hygiene. Clear roles prevent chaos. The technical lead can focus on debugging while you coordinate."
|
|
26
|
+
health: 1
|
|
27
|
+
next: communicate
|
|
28
|
+
- text: "Have everyone start debugging in parallel."
|
|
29
|
+
consequence: "Too many cooks. Without a single technical lead and clear assignments, people duplicate work and step on each other."
|
|
30
|
+
health: -1
|
|
31
|
+
next: communicate
|
|
32
|
+
- text: "Pause and run a full architecture review before touching anything."
|
|
33
|
+
consequence: "Architecture reviews are for postmortems. During an incident, you need fast triage and mitigation, not deep analysis."
|
|
34
|
+
health: -2
|
|
35
|
+
next: communicate
|
|
36
|
+
- id: communicate
|
|
37
|
+
situation: "Stakeholders are asking for updates. Support is getting flooded with tickets. The technical lead is still investigating. What do you prioritize?"
|
|
38
|
+
choices:
|
|
39
|
+
- text: "Send a brief status to stakeholders — 'We're investigating payment gateway issues, ETA for next update in 15 min' — then let the tech lead work."
|
|
40
|
+
consequence: "Right balance. Stakeholders get reassurance; the team isn't distracted. Time-boxed updates manage expectations."
|
|
41
|
+
health: 1
|
|
42
|
+
next: debug_vs_comms
|
|
43
|
+
- text: "Ignore stakeholders until you have a root cause."
|
|
44
|
+
consequence: "Stakeholders need to know you're on it. Radio silence increases anxiety and can trigger escalations."
|
|
45
|
+
health: -1
|
|
46
|
+
next: debug_vs_comms
|
|
47
|
+
- text: "Pull the technical lead off investigation to draft a detailed customer-facing post."
|
|
48
|
+
consequence: "Investigation should take priority. Detailed posts can wait. Brief internal updates are enough for now."
|
|
49
|
+
health: -2
|
|
50
|
+
next: debug_vs_comms
|
|
51
|
+
- id: debug_vs_comms
|
|
52
|
+
situation: "The tech lead suspects a bad deploy from 2 hours ago. Rolling back would take ~20 minutes. The payment gateway provider's status still says 'degraded.' Do you rollback or wait for more data?"
|
|
53
|
+
choices:
|
|
54
|
+
- text: "Rollback now. Payment is critical; the deploy is the most likely culprit and we can't afford to wait."
|
|
55
|
+
consequence: "Pragmatic. For payment, speed matters. If the rollback fixes it, you're done. If not, you've ruled out a major variable."
|
|
56
|
+
health: 1
|
|
57
|
+
next: rollback_result
|
|
58
|
+
- text: "Wait for the gateway provider to confirm their status before deciding."
|
|
59
|
+
consequence: "External status pages can lag. Your own deploy is something you control. Delaying the rollback extends customer impact."
|
|
60
|
+
health: -1
|
|
61
|
+
next: rollback_result
|
|
62
|
+
- text: "Run A/B tests to confirm the deploy is the cause."
|
|
63
|
+
consequence: "A/B tests take time. During an incident, you need fast, reversible actions. Rollback is low risk and high signal."
|
|
64
|
+
health: -2
|
|
65
|
+
next: rollback_result
|
|
66
|
+
- id: rollback_result
|
|
67
|
+
situation: "You rolled back. Checkout recovers. The incident is resolved. Now the CEO asks for a postmortem by end of week. How do you approach it?"
|
|
68
|
+
choices:
|
|
69
|
+
- text: "Schedule a blameless postmortem with everyone involved. Focus on what we'll change (process, tooling, checks), not who screwed up."
|
|
70
|
+
consequence: "Blameless postmortems build learning culture. People share more, and you get better preventative measures."
|
|
71
|
+
health: 1
|
|
72
|
+
next: postmortem
|
|
73
|
+
- text: "Assign it to the engineer who pushed the deploy."
|
|
74
|
+
consequence: "That creates blame and discourages future transparency. Postmortems should be collaborative and blameless."
|
|
75
|
+
health: -2
|
|
76
|
+
next: postmortem
|
|
77
|
+
- text: "Skip the postmortem — we fixed it, let's move on."
|
|
78
|
+
consequence: "Every incident is a learning opportunity. Skipping postmortems means the same failures repeat."
|
|
79
|
+
health: -1
|
|
80
|
+
next: postmortem
|
|
81
|
+
- id: postmortem
|
|
82
|
+
situation: "In the postmortem, the team identifies: no canary for the payment service, and deploy went out without a staging gate. What should you document?"
|
|
83
|
+
choices:
|
|
84
|
+
- text: "Write action items: add canary deployment, require staging validation for payment service, set up alerts for gateway timeouts."
|
|
85
|
+
consequence: "Well done, Incident Commander. Clear, actionable items. You triaged severity, coordinated the team, communicated with stakeholders, made a timely rollback decision, and ran a blameless postmortem."
|
|
86
|
+
health: 1
|
|
87
|
+
next: end
|
|
88
|
+
- text: "Document that 'we'll be more careful next time.'"
|
|
89
|
+
consequence: "Vague. 'Be more careful' doesn't change systems. You need concrete process or tooling changes."
|
|
90
|
+
health: -1
|
|
91
|
+
next: end
|
|
92
|
+
- text: "Blame the engineer who merged without sufficient review."
|
|
93
|
+
consequence: "Blameless means focusing on system failures, not individuals. Blame stops people from participating honestly."
|
|
94
|
+
health: -2
|
|
95
|
+
next: end
|
|
96
|
+
|
|
97
|
+
- type: classify
|
|
98
|
+
title: "Severity Sorter"
|
|
99
|
+
categories:
|
|
100
|
+
- name: "SEV1 - Critical"
|
|
101
|
+
color: "#f85149"
|
|
102
|
+
- name: "SEV2 - Major"
|
|
103
|
+
color: "#d29922"
|
|
104
|
+
- name: "SEV3 - Minor"
|
|
105
|
+
color: "#58a6ff"
|
|
106
|
+
- name: "SEV4 - Low"
|
|
107
|
+
color: "#8b949e"
|
|
108
|
+
items:
|
|
109
|
+
- text: "All checkouts failing, payment service returning 503. Revenue impact."
|
|
110
|
+
category: "SEV1 - Critical"
|
|
111
|
+
- text: "Authentication service down — no one can log in."
|
|
112
|
+
category: "SEV1 - Critical"
|
|
113
|
+
- text: "Database primary unreachable; replicas serving read-only. Writes failing."
|
|
114
|
+
category: "SEV1 - Critical"
|
|
115
|
+
- text: "Search is slow (2–3s) but returning results. Some users complaining."
|
|
116
|
+
category: "SEV2 - Major"
|
|
117
|
+
- text: "CDN cache miss rate up 20%. Page loads slightly slower."
|
|
118
|
+
category: "SEV2 - Major"
|
|
119
|
+
- text: "Admin dashboard export fails for reports over 10k rows."
|
|
120
|
+
category: "SEV2 - Major"
|
|
121
|
+
- text: "Non-critical feature flag not applying for a small segment."
|
|
122
|
+
category: "SEV3 - Minor"
|
|
123
|
+
- text: "Email notifications delayed by 5–10 minutes."
|
|
124
|
+
category: "SEV3 - Minor"
|
|
125
|
+
- text: "Minor UI glitch on settings page in Safari only."
|
|
126
|
+
category: "SEV3 - Minor"
|
|
127
|
+
- text: "Spelling error in FAQ page."
|
|
128
|
+
category: "SEV4 - Low"
|
|
129
|
+
- text: "Deprecated API endpoint returns 410 as expected; one client not updated."
|
|
130
|
+
category: "SEV4 - Low"
|
|
131
|
+
- text: "Analytics pipeline backlog of 1 hour; dashboards slightly stale."
|
|
132
|
+
category: "SEV4 - Low"
|
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
slug: incident-response
|
|
2
|
+
title: "Incident Response — From Alert to Postmortem"
|
|
3
|
+
version: 1.0.0
|
|
4
|
+
description: "Handle production incidents: severity levels, roles, lifecycle, communication, blameless postmortems, and on-call best practices."
|
|
5
|
+
category: leadership
|
|
6
|
+
tags: [incident-response, on-call, postmortem, reliability, sre, blameless]
|
|
7
|
+
difficulty: intermediate
|
|
8
|
+
|
|
9
|
+
xp:
|
|
10
|
+
read: 15
|
|
11
|
+
walkthrough: 40
|
|
12
|
+
exercise: 25
|
|
13
|
+
quiz: 20
|
|
14
|
+
quiz-perfect-bonus: 10
|
|
15
|
+
game: 25
|
|
16
|
+
game-perfect-bonus: 15
|
|
17
|
+
|
|
18
|
+
time:
|
|
19
|
+
quick: 5
|
|
20
|
+
read: 20
|
|
21
|
+
guided: 50
|
|
22
|
+
|
|
23
|
+
prerequisites: [code-review]
|
|
24
|
+
related: [risk-management, debugging-systematically, stakeholder-communication]
|
|
25
|
+
|
|
26
|
+
triggers:
|
|
27
|
+
- "How do I handle a production incident?"
|
|
28
|
+
- "What is a blameless postmortem?"
|
|
29
|
+
- "How do I set up an on-call rotation?"
|
|
30
|
+
- "How do I run an incident response process?"
|
|
31
|
+
|
|
32
|
+
visuals:
|
|
33
|
+
diagrams: [diagram-flow, diagram-mermaid]
|
|
34
|
+
quiz-types: [quiz-drag-order, quiz-timed-choice]
|
|
35
|
+
game-types: [scenario, classify]
|
|
36
|
+
playground: bash
|
|
37
|
+
slides: true
|
|
38
|
+
|
|
39
|
+
sources:
|
|
40
|
+
- url: "https://sre.google/sre-book/"
|
|
41
|
+
label: "Google SRE Book"
|
|
42
|
+
type: docs
|
|
43
|
+
- url: "https://response.pagerduty.com"
|
|
44
|
+
label: "PagerDuty Incident Response Guide"
|
|
45
|
+
type: docs
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
# Incident Response — Quick Reference
|
|
2
|
+
|
|
3
|
+
## Severity Levels
|
|
4
|
+
|
|
5
|
+
| SEV | Impact | Response |
|
|
6
|
+
|-----|--------|----------|
|
|
7
|
+
| 1 | Critical: full outage, data loss | Immediate (~5 min) |
|
|
8
|
+
| 2 | Major: key feature broken | 15–30 min |
|
|
9
|
+
| 3 | Minor: limited impact, workaround | Hours |
|
|
10
|
+
| 4 | Low: cosmetic, edge case | Next business day |
|
|
11
|
+
|
|
12
|
+
## Roles
|
|
13
|
+
|
|
14
|
+
| Role | Responsibility |
|
|
15
|
+
|------|----------------|
|
|
16
|
+
| Incident Commander | Coordinate, decide, delegate |
|
|
17
|
+
| Comms Lead | Stakeholder updates, status page |
|
|
18
|
+
| Responders | Debug, rollback, fix |
|
|
19
|
+
|
|
20
|
+
## Lifecycle
|
|
21
|
+
|
|
22
|
+
1. **Detect** — Alert, user report
|
|
23
|
+
2. **Triage** — Severity, scope, assign IC
|
|
24
|
+
3. **Mitigate** — Rollback, fix, workaround
|
|
25
|
+
4. **Resolve** — Verify, all-clear
|
|
26
|
+
5. **Postmortem** — Blameless, 5 whys
|
|
27
|
+
6. **Action Items** — Track, prevent recurrence
|
|
28
|
+
|
|
29
|
+
## Communication Template
|
|
30
|
+
|
|
31
|
+
**Internal:** Severity, impact, status, IC, ETA
|
|
32
|
+
**External:** "We're aware of X. Impact: Y. ETA: Z."
|
|
33
|
+
**Cadence:** Every 15–30 min for SEV1/2
|
|
34
|
+
|
|
35
|
+
## Blameless Postmortem
|
|
36
|
+
|
|
37
|
+
- Summary, Timeline, Root Cause (5 whys), Impact, Action Items, Lessons
|
|
38
|
+
- No blame: focus on systems and process
|
|
39
|
+
- Action items: owner, due date
|
|
40
|
+
|
|
41
|
+
## On-Call Best Practices
|
|
42
|
+
|
|
43
|
+
- Runbooks for common incidents
|
|
44
|
+
- Escalation path: primary → secondary → manager
|
|
45
|
+
- Rotation: fair, no back-to-back primary
|
|
46
|
+
- Post-incident rest for IC
|
|
47
|
+
- Compensation / flexibility
|
|
48
|
+
|
|
49
|
+
## Error Budgets
|
|
50
|
+
|
|
51
|
+
- SLO: e.g., 99.9% success
|
|
52
|
+
- Error budget: allowed failure (0.1% ≈ 43 min/month)
|
|
53
|
+
- Incidents consume budget; postmortems guide investment in reliability
|
|
@@ -0,0 +1,103 @@
|
|
|
1
|
+
# Incident Response — Quiz
|
|
2
|
+
|
|
3
|
+
## Question 1
|
|
4
|
+
|
|
5
|
+
What is the primary role of the Incident Commander?
|
|
6
|
+
|
|
7
|
+
A) Fix the bug
|
|
8
|
+
B) Coordinate the response, make decisions, delegate—not necessarily fix
|
|
9
|
+
C) Update the status page
|
|
10
|
+
D) Write the postmortem
|
|
11
|
+
|
|
12
|
+
<!-- ANSWER: B -->
|
|
13
|
+
<!-- EXPLANATION: The IC owns the response and orchestrates. They coordinate responders, decide on approach, and delegate. They may not be the one typing the fix—their job is to ensure the team responds effectively. Fixing is the responders' job. -->
|
|
14
|
+
|
|
15
|
+
## Question 2
|
|
16
|
+
|
|
17
|
+
What makes a postmortem "blameless"?
|
|
18
|
+
|
|
19
|
+
A) No one is mentioned
|
|
20
|
+
B) Focus on systems and process, not individuals; we learn, not punish
|
|
21
|
+
C) Only positive feedback
|
|
22
|
+
D) No root cause is identified
|
|
23
|
+
|
|
24
|
+
<!-- ANSWER: B -->
|
|
25
|
+
<!-- EXPLANATION: Blameless means we analyze what happened in terms of systems, process, and decisions—not "who messed up." The goal is learning and preventing recurrence, not assigning fault. People need to feel safe to contribute honestly. -->
|
|
26
|
+
|
|
27
|
+
## Question 3
|
|
28
|
+
|
|
29
|
+
Which severity typically requires immediate (e.g., 5 min) response?
|
|
30
|
+
|
|
31
|
+
A) SEV4
|
|
32
|
+
B) SEV3
|
|
33
|
+
C) SEV2
|
|
34
|
+
D) SEV1
|
|
35
|
+
|
|
36
|
+
<!-- ANSWER: D -->
|
|
37
|
+
<!-- EXPLANATION: SEV1 is critical—full outage, data loss, security breach. Response time is immediate (e.g., 5 min). SEV2 is major (15–30 min); SEV3/4 are slower. -->
|
|
38
|
+
|
|
39
|
+
## Question 4
|
|
40
|
+
|
|
41
|
+
What should the external status update include?
|
|
42
|
+
|
|
43
|
+
A) Technical details of the failure
|
|
44
|
+
B) We're aware of X, impact is Y, ETA is Z (or "investigating")
|
|
45
|
+
C) Names of engineers working on it
|
|
46
|
+
D) Root cause analysis
|
|
47
|
+
|
|
48
|
+
<!-- ANSWER: B -->
|
|
49
|
+
<!-- EXPLANATION: External updates should be clear and customer-friendly: we're aware, here's the impact, here's our ETA (or that we're investigating). No technical jargon, no blame, no internal details. -->
|
|
50
|
+
|
|
51
|
+
## Question 5
|
|
52
|
+
|
|
53
|
+
The "5 whys" in a postmortem are used to:
|
|
54
|
+
|
|
55
|
+
A) Assign blame
|
|
56
|
+
B) Trace from symptom to systemic/root cause
|
|
57
|
+
C) List five people involved
|
|
58
|
+
D) Count how many things went wrong
|
|
59
|
+
|
|
60
|
+
<!-- ANSWER: B -->
|
|
61
|
+
<!-- EXPLANATION: 5 whys is a technique to dig from the surface symptom to deeper causes. "Why did X happen?" → "Because Y." "Why did Y happen?" → Repeat. You reach contributing factors in process and design, not just the immediate trigger. -->
|
|
62
|
+
|
|
63
|
+
## Question 6
|
|
64
|
+
|
|
65
|
+
To prevent incident fatigue, a good practice is:
|
|
66
|
+
|
|
67
|
+
A) Have one person always on-call
|
|
68
|
+
B) Limit consecutive weeks on primary; offer post-incident rest
|
|
69
|
+
C) Only page during business hours
|
|
70
|
+
D) Skip postmortems to save time
|
|
71
|
+
|
|
72
|
+
<!-- ANSWER: B -->
|
|
73
|
+
<!-- EXPLANATION: Limit consecutive primary weeks, rotate fairly, give IC rest after SEV1/2 (no meetings for a few hours). One person always on-call and skipping rest leads to burnout. -->
|
|
74
|
+
|
|
75
|
+
## Question 7
|
|
76
|
+
|
|
77
|
+
<!-- VISUAL: quiz-drag-order -->
|
|
78
|
+
|
|
79
|
+
Put these incident lifecycle phases in the correct order:
|
|
80
|
+
|
|
81
|
+
A) Mitigation (contain and fix)
|
|
82
|
+
B) Detection and triage
|
|
83
|
+
C) Post-incident review
|
|
84
|
+
D) Recovery (restore service)
|
|
85
|
+
E) Preparation (runbooks, on-call)
|
|
86
|
+
|
|
87
|
+
<!-- ANSWER: E,B,A,D,C -->
|
|
88
|
+
<!-- EXPLANATION: Prepare first (runbooks, on-call). When an incident occurs, detect and triage, then mitigate (contain and fix), recover (restore service), and finally conduct a post-incident review to learn. -->
|
|
89
|
+
|
|
90
|
+
## Question 8
|
|
91
|
+
|
|
92
|
+
<!-- VISUAL: quiz-drag-order -->
|
|
93
|
+
|
|
94
|
+
Put these postmortem sections in a logical order for the document:
|
|
95
|
+
|
|
96
|
+
A) Timeline of events
|
|
97
|
+
B) Root cause analysis
|
|
98
|
+
C) Impact and severity
|
|
99
|
+
D) Action items
|
|
100
|
+
E) Summary
|
|
101
|
+
|
|
102
|
+
<!-- ANSWER: E,C,A,B,D -->
|
|
103
|
+
<!-- EXPLANATION: A postmortem typically opens with a summary, then impact/severity, a chronological timeline, root cause analysis (5 whys, etc.), and closes with action items to prevent recurrence. -->
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
# Incident Response — Resources
|
|
2
|
+
|
|
3
|
+
## Official Guides
|
|
4
|
+
|
|
5
|
+
- [Google SRE Book](https://sre.google/sre-book/) — Free online. Covers incident response, postmortems, SLOs, and error budgets. Essential reading.
|
|
6
|
+
- [Google SRE Book — Managing Incidents](https://sre.google/sre-book/managing-incidents/) — Direct chapter on incident management.
|
|
7
|
+
- [PagerDuty Incident Response Guide](https://response.pagerduty.com) — Roles, workflows, communication, and best practices.
|
|
8
|
+
- [PagerDuty — Postmortem Best Practices](https://response.pagerduty.com/before/incident-response/postmortem/) — Blameless postmortems.
|
|
9
|
+
|
|
10
|
+
## Articles
|
|
11
|
+
|
|
12
|
+
- [Blameless PostMortems and a Just Culture](https://codeascraft.com/2013/11/14/blameless-postmortems-and-a-just-culture/) — Etsy (Code as Craft). Why blameless matters and how to build the culture.
|
|
13
|
+
- [Writing an Incident Postmortem](https://www.atlassian.com/incident-management/postmortem) — Atlassian. Template and examples.
|
|
14
|
+
- [The Five Whys](https://www.atlassian.com/incident-management/postmortem/blameless) — Using 5 whys in postmortems.
|
|
15
|
+
- [Incident Communication Best Practices](https://www.pagerduty.com/blog/incident-communication-best-practices/) — PagerDuty. Status updates and stakeholder communication.
|
|
16
|
+
- [On-Call Best Practices](https://www.pagerduty.com/resources/learn/on-call-best-practices/) — PagerDuty. Rotation, runbooks, fatigue prevention.
|
|
17
|
+
|
|
18
|
+
## Books
|
|
19
|
+
|
|
20
|
+
- **Site Reliability Engineering** (O'Reilly) — Google SRE book in print. Covers incidents, SLOs, and more.
|
|
21
|
+
- **The Phoenix Project** by Gene Kim et al. — Novel about IT ops and incident response; illustrates principles.
|
|
22
|
+
- **An Elegant Puzzle** by Will Larson — Includes on-call, incident process, and team reliability.
|
|
23
|
+
|
|
24
|
+
## Tools
|
|
25
|
+
|
|
26
|
+
- [PagerDuty](https://www.pagerduty.com/) — Incident alerting, on-call scheduling, escalation.
|
|
27
|
+
- [OpsGenie](https://www.atlassian.com/software/opsgenie) — Alerting and on-call management.
|
|
28
|
+
- [Statuspage](https://www.atlassian.com/software/statuspage) — Public status pages.
|
|
29
|
+
- [Rootly](https://www.rootly.com/) — Incident management and runbooks.
|
|
30
|
+
- [Incident.io](https://incident.io/) — Incident response workflow and automation.
|
|
31
|
+
|
|
32
|
+
## Videos
|
|
33
|
+
|
|
34
|
+
- [Google SRE — How We Do It](https://www.youtube.com/results?search_query=google+sre+incident) — Search for talks on incident response.
|
|
35
|
+
- [Blameless Postmortems at Etsy](https://www.youtube.com/results?search_query=blameless+postmortem+etsy) — Culture and process.
|
|
36
|
+
|
|
37
|
+
## Podcasts
|
|
38
|
+
|
|
39
|
+
- [Engineering Culture by InfoQ](https://www.infoq.com/podcasts/engineering-culture/) — Episodes on reliability and incidents.
|
|
40
|
+
- [SRE Weekly](https://sreweekly.com/) — Newsletter; often covers incident and postmortem content.
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
# Incident Response Walkthrough — Learn by Doing
|
|
2
|
+
|
|
3
|
+
## Before We Begin
|
|
4
|
+
|
|
5
|
+
Incidents follow a lifecycle: detect, triage, communicate, mitigate, resolve, and learn. How you handle each phase—and who does what—determines whether you fix things quickly and improve, or repeat the same mistakes.
|
|
6
|
+
|
|
7
|
+
**Diagnostic question:** Have you (or your team) ever had something break in production? What went well? What was chaotic—alerts, communication, or knowing who owned what?
|
|
8
|
+
|
|
9
|
+
**Checkpoint:** You can name at least two phases of the incident lifecycle and one thing that often goes wrong.
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Step 1: Define Severity Levels
|
|
14
|
+
|
|
15
|
+
<!-- hint:diagram mermaid-type="flowchart" topic="Incident lifecycle from detect to postmortem" -->
|
|
16
|
+
<!-- hint:list style="cards" -->
|
|
17
|
+
|
|
18
|
+
**Task:** For your system (or a hypothetical one), define SEV1–4 with concrete examples. Include: impact description, example scenario, and target response time for each.
|
|
19
|
+
|
|
20
|
+
**Question:** How would someone know in the moment which severity applies? What's ambiguous?
|
|
21
|
+
|
|
22
|
+
**Checkpoint:** Each severity has a clear example and response time.
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## Step 2: Draft a Communication Template
|
|
27
|
+
|
|
28
|
+
**Task:** Write two templates: (1) internal incident announcement (Slack/email) and (2) external status page update. Include placeholders for severity, impact, status, and ETA.
|
|
29
|
+
|
|
30
|
+
**Question:** What do stakeholders need to know immediately vs. what can wait? How often should you update?
|
|
31
|
+
|
|
32
|
+
**Checkpoint:** Templates are copy-pastable with clear placeholders.
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Step 3: Outline Incident Roles
|
|
37
|
+
|
|
38
|
+
**Task:** Define the roles for your team's incident response: Incident Commander, Comms Lead, Responders. For each, write 2–3 responsibilities. Who would fill each role in a real incident?
|
|
39
|
+
|
|
40
|
+
**Question:** What happens if the usual IC is unavailable? How do you rotate?
|
|
41
|
+
|
|
42
|
+
**Checkpoint:** Roles are defined; you know who does what and who backs up whom.
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Step 4: Write a Postmortem Template
|
|
47
|
+
|
|
48
|
+
**Task:** Create a blameless postmortem template. Include: Summary, Timeline, Root Cause (5 whys), Impact, Action Items (with owners), and Lessons Learned. Add 1–2 "blameless language" guidelines (how to describe what happened without naming individuals negatively).
|
|
49
|
+
|
|
50
|
+
**Question:** How do you make it safe for people to contribute honestly? What would discourage participation?
|
|
51
|
+
|
|
52
|
+
**Checkpoint:** Template is complete; language guidelines support blameless culture.
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## Step 5: Map the Incident Lifecycle
|
|
57
|
+
|
|
58
|
+
**Task:** Draw or describe your team's incident flow from Detect → Resolve → Postmortem. Include: how alerts fire, who gets paged, how triage happens, where updates are posted, and when postmortem is scheduled.
|
|
59
|
+
|
|
60
|
+
**Question:** Where are the gaps? What would fail in a real 3 a.m. incident?
|
|
61
|
+
|
|
62
|
+
**Checkpoint:** Flow is documented; at least one gap is identified.
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## Step 6: Design an On-Call Runbook Entry
|
|
67
|
+
|
|
68
|
+
**Task:** Pick one realistic incident type (e.g., "High error rate on API"). Write a runbook entry: symptoms, how to confirm, steps to mitigate, who to escalate to, and where to document.
|
|
69
|
+
|
|
70
|
+
**Question:** Could someone unfamiliar with the system follow this at 2 a.m.? What's missing?
|
|
71
|
+
|
|
72
|
+
**Checkpoint:** Runbook is actionable; an on-call engineer could follow it.
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## Step 7: Plan a Post-Incident Review
|
|
77
|
+
|
|
78
|
+
**Task:** Schedule a blameless postmortem for a past incident (real or hypothetical). Write the agenda: timebox per section, who facilitates, how action items are tracked. Include one "retrospective" question: "What would we do differently in the response itself?"
|
|
79
|
+
|
|
80
|
+
**Question:** How do you avoid making the postmortem feel punitive? How do you ensure action items get done?
|
|
81
|
+
|
|
82
|
+
**Checkpoint:** Agenda is timeboxed; facilitation and follow-through are planned.
|