@shaykec/bridge 0.4.25 → 0.4.26

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (319) hide show
  1. package/journeys/ai-engineer.yaml +34 -0
  2. package/journeys/backend-developer.yaml +36 -0
  3. package/journeys/business-analyst.yaml +37 -0
  4. package/journeys/devops-engineer.yaml +37 -0
  5. package/journeys/engineering-manager.yaml +44 -0
  6. package/journeys/frontend-developer.yaml +41 -0
  7. package/journeys/fullstack-developer.yaml +49 -0
  8. package/journeys/mobile-developer.yaml +42 -0
  9. package/journeys/product-manager.yaml +35 -0
  10. package/journeys/qa-engineer.yaml +37 -0
  11. package/journeys/ux-designer.yaml +43 -0
  12. package/modules/README.md +52 -0
  13. package/modules/accessibility-fundamentals/content.md +126 -0
  14. package/modules/accessibility-fundamentals/exercises.md +88 -0
  15. package/modules/accessibility-fundamentals/module.yaml +43 -0
  16. package/modules/accessibility-fundamentals/quick-ref.md +71 -0
  17. package/modules/accessibility-fundamentals/quiz.md +100 -0
  18. package/modules/accessibility-fundamentals/resources.md +29 -0
  19. package/modules/accessibility-fundamentals/walkthrough.md +80 -0
  20. package/modules/adr-writing/content.md +121 -0
  21. package/modules/adr-writing/exercises.md +81 -0
  22. package/modules/adr-writing/module.yaml +41 -0
  23. package/modules/adr-writing/quick-ref.md +57 -0
  24. package/modules/adr-writing/quiz.md +73 -0
  25. package/modules/adr-writing/resources.md +29 -0
  26. package/modules/adr-writing/walkthrough.md +64 -0
  27. package/modules/ai-agents/content.md +120 -0
  28. package/modules/ai-agents/exercises.md +82 -0
  29. package/modules/ai-agents/module.yaml +42 -0
  30. package/modules/ai-agents/quick-ref.md +60 -0
  31. package/modules/ai-agents/quiz.md +103 -0
  32. package/modules/ai-agents/resources.md +30 -0
  33. package/modules/ai-agents/walkthrough.md +85 -0
  34. package/modules/ai-assisted-research/content.md +136 -0
  35. package/modules/ai-assisted-research/exercises.md +80 -0
  36. package/modules/ai-assisted-research/module.yaml +42 -0
  37. package/modules/ai-assisted-research/quick-ref.md +67 -0
  38. package/modules/ai-assisted-research/quiz.md +73 -0
  39. package/modules/ai-assisted-research/resources.md +33 -0
  40. package/modules/ai-assisted-research/walkthrough.md +85 -0
  41. package/modules/ai-pair-programming/content.md +105 -0
  42. package/modules/ai-pair-programming/exercises.md +98 -0
  43. package/modules/ai-pair-programming/module.yaml +39 -0
  44. package/modules/ai-pair-programming/quick-ref.md +58 -0
  45. package/modules/ai-pair-programming/quiz.md +73 -0
  46. package/modules/ai-pair-programming/resources.md +34 -0
  47. package/modules/ai-pair-programming/walkthrough.md +117 -0
  48. package/modules/ai-test-generation/content.md +125 -0
  49. package/modules/ai-test-generation/exercises.md +98 -0
  50. package/modules/ai-test-generation/module.yaml +39 -0
  51. package/modules/ai-test-generation/quick-ref.md +65 -0
  52. package/modules/ai-test-generation/quiz.md +74 -0
  53. package/modules/ai-test-generation/resources.md +41 -0
  54. package/modules/ai-test-generation/walkthrough.md +100 -0
  55. package/modules/api-design/content.md +189 -0
  56. package/modules/api-design/exercises.md +84 -0
  57. package/modules/api-design/game.yaml +113 -0
  58. package/modules/api-design/module.yaml +45 -0
  59. package/modules/api-design/quick-ref.md +73 -0
  60. package/modules/api-design/quiz.md +100 -0
  61. package/modules/api-design/resources.md +55 -0
  62. package/modules/api-design/walkthrough.md +88 -0
  63. package/modules/clean-code/content.md +136 -0
  64. package/modules/clean-code/exercises.md +137 -0
  65. package/modules/clean-code/game.yaml +172 -0
  66. package/modules/clean-code/module.yaml +44 -0
  67. package/modules/clean-code/quick-ref.md +44 -0
  68. package/modules/clean-code/quiz.md +105 -0
  69. package/modules/clean-code/resources.md +40 -0
  70. package/modules/clean-code/walkthrough.md +78 -0
  71. package/modules/clean-code/workshop.yaml +149 -0
  72. package/modules/code-review/content.md +130 -0
  73. package/modules/code-review/exercises.md +95 -0
  74. package/modules/code-review/game.yaml +83 -0
  75. package/modules/code-review/module.yaml +42 -0
  76. package/modules/code-review/quick-ref.md +77 -0
  77. package/modules/code-review/quiz.md +105 -0
  78. package/modules/code-review/resources.md +40 -0
  79. package/modules/code-review/walkthrough.md +106 -0
  80. package/modules/daily-workflow/content.md +81 -0
  81. package/modules/daily-workflow/exercises.md +50 -0
  82. package/modules/daily-workflow/module.yaml +33 -0
  83. package/modules/daily-workflow/quick-ref.md +37 -0
  84. package/modules/daily-workflow/quiz.md +65 -0
  85. package/modules/daily-workflow/resources.md +38 -0
  86. package/modules/daily-workflow/walkthrough.md +83 -0
  87. package/modules/debugging-systematically/content.md +139 -0
  88. package/modules/debugging-systematically/exercises.md +91 -0
  89. package/modules/debugging-systematically/module.yaml +46 -0
  90. package/modules/debugging-systematically/quick-ref.md +59 -0
  91. package/modules/debugging-systematically/quiz.md +105 -0
  92. package/modules/debugging-systematically/resources.md +42 -0
  93. package/modules/debugging-systematically/walkthrough.md +84 -0
  94. package/modules/debugging-systematically/workshop.yaml +127 -0
  95. package/modules/demo-test/content.md +68 -0
  96. package/modules/demo-test/exercises.md +28 -0
  97. package/modules/demo-test/game.yaml +171 -0
  98. package/modules/demo-test/module.yaml +41 -0
  99. package/modules/demo-test/quick-ref.md +54 -0
  100. package/modules/demo-test/quiz.md +74 -0
  101. package/modules/demo-test/resources.md +21 -0
  102. package/modules/demo-test/walkthrough.md +122 -0
  103. package/modules/demo-test/workshop.yaml +31 -0
  104. package/modules/design-critique/content.md +93 -0
  105. package/modules/design-critique/exercises.md +71 -0
  106. package/modules/design-critique/module.yaml +41 -0
  107. package/modules/design-critique/quick-ref.md +63 -0
  108. package/modules/design-critique/quiz.md +73 -0
  109. package/modules/design-critique/resources.md +27 -0
  110. package/modules/design-critique/walkthrough.md +68 -0
  111. package/modules/design-patterns/content.md +335 -0
  112. package/modules/design-patterns/exercises.md +82 -0
  113. package/modules/design-patterns/game.yaml +55 -0
  114. package/modules/design-patterns/module.yaml +45 -0
  115. package/modules/design-patterns/quick-ref.md +44 -0
  116. package/modules/design-patterns/quiz.md +101 -0
  117. package/modules/design-patterns/resources.md +40 -0
  118. package/modules/design-patterns/walkthrough.md +64 -0
  119. package/modules/exploratory-testing/content.md +133 -0
  120. package/modules/exploratory-testing/exercises.md +88 -0
  121. package/modules/exploratory-testing/module.yaml +41 -0
  122. package/modules/exploratory-testing/quick-ref.md +68 -0
  123. package/modules/exploratory-testing/quiz.md +75 -0
  124. package/modules/exploratory-testing/resources.md +39 -0
  125. package/modules/exploratory-testing/walkthrough.md +87 -0
  126. package/modules/git/content.md +128 -0
  127. package/modules/git/exercises.md +53 -0
  128. package/modules/git/game.yaml +190 -0
  129. package/modules/git/module.yaml +44 -0
  130. package/modules/git/quick-ref.md +67 -0
  131. package/modules/git/quiz.md +89 -0
  132. package/modules/git/resources.md +49 -0
  133. package/modules/git/walkthrough.md +92 -0
  134. package/modules/git/workshop.yaml +145 -0
  135. package/modules/hiring-interviews/content.md +130 -0
  136. package/modules/hiring-interviews/exercises.md +88 -0
  137. package/modules/hiring-interviews/module.yaml +41 -0
  138. package/modules/hiring-interviews/quick-ref.md +68 -0
  139. package/modules/hiring-interviews/quiz.md +73 -0
  140. package/modules/hiring-interviews/resources.md +36 -0
  141. package/modules/hiring-interviews/walkthrough.md +75 -0
  142. package/modules/hooks/content.md +97 -0
  143. package/modules/hooks/exercises.md +69 -0
  144. package/modules/hooks/module.yaml +39 -0
  145. package/modules/hooks/quick-ref.md +93 -0
  146. package/modules/hooks/quiz.md +81 -0
  147. package/modules/hooks/resources.md +34 -0
  148. package/modules/hooks/walkthrough.md +105 -0
  149. package/modules/hooks/workshop.yaml +64 -0
  150. package/modules/incident-response/content.md +124 -0
  151. package/modules/incident-response/exercises.md +82 -0
  152. package/modules/incident-response/game.yaml +132 -0
  153. package/modules/incident-response/module.yaml +45 -0
  154. package/modules/incident-response/quick-ref.md +53 -0
  155. package/modules/incident-response/quiz.md +103 -0
  156. package/modules/incident-response/resources.md +40 -0
  157. package/modules/incident-response/walkthrough.md +82 -0
  158. package/modules/llm-fundamentals/content.md +114 -0
  159. package/modules/llm-fundamentals/exercises.md +83 -0
  160. package/modules/llm-fundamentals/module.yaml +42 -0
  161. package/modules/llm-fundamentals/quick-ref.md +64 -0
  162. package/modules/llm-fundamentals/quiz.md +103 -0
  163. package/modules/llm-fundamentals/resources.md +30 -0
  164. package/modules/llm-fundamentals/walkthrough.md +91 -0
  165. package/modules/one-on-ones/content.md +133 -0
  166. package/modules/one-on-ones/exercises.md +81 -0
  167. package/modules/one-on-ones/module.yaml +44 -0
  168. package/modules/one-on-ones/quick-ref.md +67 -0
  169. package/modules/one-on-ones/quiz.md +73 -0
  170. package/modules/one-on-ones/resources.md +37 -0
  171. package/modules/one-on-ones/walkthrough.md +69 -0
  172. package/modules/package.json +9 -0
  173. package/modules/prioritization-frameworks/content.md +130 -0
  174. package/modules/prioritization-frameworks/exercises.md +93 -0
  175. package/modules/prioritization-frameworks/module.yaml +41 -0
  176. package/modules/prioritization-frameworks/quick-ref.md +77 -0
  177. package/modules/prioritization-frameworks/quiz.md +73 -0
  178. package/modules/prioritization-frameworks/resources.md +32 -0
  179. package/modules/prioritization-frameworks/walkthrough.md +69 -0
  180. package/modules/prompt-engineering/content.md +123 -0
  181. package/modules/prompt-engineering/exercises.md +82 -0
  182. package/modules/prompt-engineering/game.yaml +101 -0
  183. package/modules/prompt-engineering/module.yaml +45 -0
  184. package/modules/prompt-engineering/quick-ref.md +65 -0
  185. package/modules/prompt-engineering/quiz.md +105 -0
  186. package/modules/prompt-engineering/resources.md +36 -0
  187. package/modules/prompt-engineering/walkthrough.md +81 -0
  188. package/modules/rag-fundamentals/content.md +111 -0
  189. package/modules/rag-fundamentals/exercises.md +80 -0
  190. package/modules/rag-fundamentals/module.yaml +45 -0
  191. package/modules/rag-fundamentals/quick-ref.md +58 -0
  192. package/modules/rag-fundamentals/quiz.md +75 -0
  193. package/modules/rag-fundamentals/resources.md +34 -0
  194. package/modules/rag-fundamentals/walkthrough.md +75 -0
  195. package/modules/react-fundamentals/content.md +140 -0
  196. package/modules/react-fundamentals/exercises.md +81 -0
  197. package/modules/react-fundamentals/game.yaml +145 -0
  198. package/modules/react-fundamentals/module.yaml +45 -0
  199. package/modules/react-fundamentals/quick-ref.md +62 -0
  200. package/modules/react-fundamentals/quiz.md +106 -0
  201. package/modules/react-fundamentals/resources.md +42 -0
  202. package/modules/react-fundamentals/walkthrough.md +89 -0
  203. package/modules/react-fundamentals/workshop.yaml +112 -0
  204. package/modules/react-native-fundamentals/content.md +141 -0
  205. package/modules/react-native-fundamentals/exercises.md +79 -0
  206. package/modules/react-native-fundamentals/module.yaml +42 -0
  207. package/modules/react-native-fundamentals/quick-ref.md +60 -0
  208. package/modules/react-native-fundamentals/quiz.md +61 -0
  209. package/modules/react-native-fundamentals/resources.md +24 -0
  210. package/modules/react-native-fundamentals/walkthrough.md +84 -0
  211. package/modules/registry.yaml +1650 -0
  212. package/modules/risk-management/content.md +162 -0
  213. package/modules/risk-management/exercises.md +86 -0
  214. package/modules/risk-management/module.yaml +41 -0
  215. package/modules/risk-management/quick-ref.md +82 -0
  216. package/modules/risk-management/quiz.md +73 -0
  217. package/modules/risk-management/resources.md +40 -0
  218. package/modules/risk-management/walkthrough.md +67 -0
  219. package/modules/running-effective-standups/content.md +119 -0
  220. package/modules/running-effective-standups/exercises.md +79 -0
  221. package/modules/running-effective-standups/module.yaml +40 -0
  222. package/modules/running-effective-standups/quick-ref.md +61 -0
  223. package/modules/running-effective-standups/quiz.md +73 -0
  224. package/modules/running-effective-standups/resources.md +36 -0
  225. package/modules/running-effective-standups/walkthrough.md +76 -0
  226. package/modules/solid-principles/content.md +154 -0
  227. package/modules/solid-principles/exercises.md +107 -0
  228. package/modules/solid-principles/module.yaml +42 -0
  229. package/modules/solid-principles/quick-ref.md +50 -0
  230. package/modules/solid-principles/quiz.md +102 -0
  231. package/modules/solid-principles/resources.md +39 -0
  232. package/modules/solid-principles/walkthrough.md +84 -0
  233. package/modules/sprint-planning/content.md +142 -0
  234. package/modules/sprint-planning/exercises.md +79 -0
  235. package/modules/sprint-planning/game.yaml +84 -0
  236. package/modules/sprint-planning/module.yaml +44 -0
  237. package/modules/sprint-planning/quick-ref.md +76 -0
  238. package/modules/sprint-planning/quiz.md +102 -0
  239. package/modules/sprint-planning/resources.md +39 -0
  240. package/modules/sprint-planning/walkthrough.md +75 -0
  241. package/modules/sql-fundamentals/content.md +160 -0
  242. package/modules/sql-fundamentals/exercises.md +87 -0
  243. package/modules/sql-fundamentals/game.yaml +105 -0
  244. package/modules/sql-fundamentals/module.yaml +45 -0
  245. package/modules/sql-fundamentals/quick-ref.md +53 -0
  246. package/modules/sql-fundamentals/quiz.md +103 -0
  247. package/modules/sql-fundamentals/resources.md +42 -0
  248. package/modules/sql-fundamentals/walkthrough.md +92 -0
  249. package/modules/sql-fundamentals/workshop.yaml +109 -0
  250. package/modules/stakeholder-communication/content.md +186 -0
  251. package/modules/stakeholder-communication/exercises.md +87 -0
  252. package/modules/stakeholder-communication/module.yaml +38 -0
  253. package/modules/stakeholder-communication/quick-ref.md +89 -0
  254. package/modules/stakeholder-communication/quiz.md +73 -0
  255. package/modules/stakeholder-communication/resources.md +41 -0
  256. package/modules/stakeholder-communication/walkthrough.md +74 -0
  257. package/modules/system-design/content.md +149 -0
  258. package/modules/system-design/exercises.md +83 -0
  259. package/modules/system-design/game.yaml +95 -0
  260. package/modules/system-design/module.yaml +46 -0
  261. package/modules/system-design/quick-ref.md +59 -0
  262. package/modules/system-design/quiz.md +102 -0
  263. package/modules/system-design/resources.md +46 -0
  264. package/modules/system-design/walkthrough.md +90 -0
  265. package/modules/team-topologies/content.md +166 -0
  266. package/modules/team-topologies/exercises.md +85 -0
  267. package/modules/team-topologies/module.yaml +41 -0
  268. package/modules/team-topologies/quick-ref.md +61 -0
  269. package/modules/team-topologies/quiz.md +101 -0
  270. package/modules/team-topologies/resources.md +37 -0
  271. package/modules/team-topologies/walkthrough.md +76 -0
  272. package/modules/technical-debt/content.md +111 -0
  273. package/modules/technical-debt/exercises.md +92 -0
  274. package/modules/technical-debt/module.yaml +39 -0
  275. package/modules/technical-debt/quick-ref.md +60 -0
  276. package/modules/technical-debt/quiz.md +73 -0
  277. package/modules/technical-debt/resources.md +25 -0
  278. package/modules/technical-debt/walkthrough.md +94 -0
  279. package/modules/technical-mentoring/content.md +128 -0
  280. package/modules/technical-mentoring/exercises.md +84 -0
  281. package/modules/technical-mentoring/module.yaml +41 -0
  282. package/modules/technical-mentoring/quick-ref.md +74 -0
  283. package/modules/technical-mentoring/quiz.md +73 -0
  284. package/modules/technical-mentoring/resources.md +33 -0
  285. package/modules/technical-mentoring/walkthrough.md +65 -0
  286. package/modules/test-strategy/content.md +136 -0
  287. package/modules/test-strategy/exercises.md +84 -0
  288. package/modules/test-strategy/game.yaml +99 -0
  289. package/modules/test-strategy/module.yaml +45 -0
  290. package/modules/test-strategy/quick-ref.md +66 -0
  291. package/modules/test-strategy/quiz.md +99 -0
  292. package/modules/test-strategy/resources.md +60 -0
  293. package/modules/test-strategy/walkthrough.md +97 -0
  294. package/modules/test-strategy/workshop.yaml +96 -0
  295. package/modules/typescript-fundamentals/content.md +127 -0
  296. package/modules/typescript-fundamentals/exercises.md +79 -0
  297. package/modules/typescript-fundamentals/game.yaml +111 -0
  298. package/modules/typescript-fundamentals/module.yaml +45 -0
  299. package/modules/typescript-fundamentals/quick-ref.md +55 -0
  300. package/modules/typescript-fundamentals/quiz.md +104 -0
  301. package/modules/typescript-fundamentals/resources.md +42 -0
  302. package/modules/typescript-fundamentals/walkthrough.md +71 -0
  303. package/modules/typescript-fundamentals/workshop.yaml +146 -0
  304. package/modules/user-story-mapping/content.md +123 -0
  305. package/modules/user-story-mapping/exercises.md +87 -0
  306. package/modules/user-story-mapping/module.yaml +41 -0
  307. package/modules/user-story-mapping/quick-ref.md +64 -0
  308. package/modules/user-story-mapping/quiz.md +73 -0
  309. package/modules/user-story-mapping/resources.md +29 -0
  310. package/modules/user-story-mapping/walkthrough.md +86 -0
  311. package/modules/writing-prds/content.md +133 -0
  312. package/modules/writing-prds/exercises.md +93 -0
  313. package/modules/writing-prds/game.yaml +83 -0
  314. package/modules/writing-prds/module.yaml +44 -0
  315. package/modules/writing-prds/quick-ref.md +77 -0
  316. package/modules/writing-prds/quiz.md +103 -0
  317. package/modules/writing-prds/resources.md +30 -0
  318. package/modules/writing-prds/walkthrough.md +87 -0
  319. package/package.json +1 -1
@@ -0,0 +1,124 @@
1
+ # Incident Response — From Alert to Postmortem
2
+
3
+ <!-- hint:slides topic="Incident response lifecycle: severity levels, roles (IC, Comms Lead), triage, mitigation, resolution, and blameless postmortem" slides="6" -->
4
+
5
+ ## Incident Severity Levels
6
+
7
+ Define severity so everyone knows how to respond. Common tiers:
8
+
9
+ | Severity | Impact | Example | Response Time |
10
+ |----------|--------|---------|---------------|
11
+ | **SEV1** | Critical: full outage, data loss, security breach | Site down, payment broken | Immediate (e.g., 5 min) |
12
+ | **SEV2** | Major: significant degradation, key feature broken | Search down, 50% errors | Within 15–30 min |
13
+ | **SEV3** | Minor: limited impact, workaround exists | One region slow, non-critical bug | Within hours |
14
+ | **SEV4** | Low: cosmetic or edge case | UI glitch, rare error | Next business day |
15
+
16
+ Customize thresholds for your system. Document them in your runbooks so on-call knows when to escalate.
17
+
18
+ ## Roles During Incidents
19
+
20
+ | Role | Responsibility |
21
+ |------|----------------|
22
+ | **Incident Commander (IC)** | Owns the response; coordinates, decides, delegates. Does not fix—orchestrates. |
23
+ | **Comms Lead** | Updates stakeholders, status page, and internal channels. Frees IC to focus on resolution. |
24
+ | **Responders** | Engineers debugging, rolling back, or applying fixes. Report progress to IC. |
25
+
26
+ IC and Comms Lead should be different people when possible. IC stays focused on technical decisions; Comms keeps everyone informed.
27
+
28
+ ## The Incident Lifecycle
29
+
30
+ ```mermaid
31
+ flowchart LR
32
+ A[Detect] --> B[Triage]
33
+ B --> C[Respond]
34
+ C --> D[Mitigate]
35
+ D --> E[Resolve]
36
+ E --> F[Postmortem]
37
+ ```
38
+
39
+ ```mermaid
40
+ flowchart TD
41
+ A[Detect] --> B[Triage]
42
+ B --> C[Mitigate]
43
+ C --> D[Resolve]
44
+ D --> E[Postmortem]
45
+ E --> F[Action Items]
46
+
47
+ A -.->|Alert, user report| A
48
+ B -.->|Severity, scope| B
49
+ C -.->|Rollback, fix, workaround| C
50
+ D -.->|Verify, all-clear| D
51
+ E -.->|Blameless, 5 whys| E
52
+ F -.->|Track, prevent recurrence| F
53
+ ```
54
+
55
+ 1. **Detect** — Alert fires, user reports, or monitoring catches the issue.
56
+ 2. **Triage** — Assess severity, scope, and impact. Assign IC and responders.
57
+ 3. **Mitigate** — Roll back, apply fix, or implement workaround. Goal: reduce impact.
58
+ 4. **Resolve** — Verify the fix, confirm stability, declare incident closed.
59
+ 5. **Postmortem** — Blameless analysis: what happened, why, what we'll do differently.
60
+ 6. **Action Items** — Track improvements (runbooks, monitoring, code changes).
61
+
62
+ ## Communication Templates
63
+
64
+ **Internal (Slack, email):**
65
+ > **Incident: [Brief title]**
66
+ > **Severity:** SEV[1–4]
67
+ > **Impact:** [Who/what is affected]
68
+ > **Status:** Investigating / Mitigating / Resolved
69
+ > **ETA:** [If known]
70
+ > **Incident Commander:** [Name]
71
+
72
+ **External (status page, customers):**
73
+ > **We're aware of [issue].** Impact: [description]. We're investigating and will update within [time]. ETA: [if known].
74
+
75
+ **Update cadence:** Every 15–30 min for SEV1/2; don't leave people guessing. "No update" is an update—say "Still investigating, next update in 15 min."
76
+
77
+ ## Blameless Postmortems
78
+
79
+ **Goal:** Learn, not blame. Focus on systems and process, not individuals.
80
+
81
+ **Structure:**
82
+ 1. **Summary** — What happened, in plain language.
83
+ 2. **Timeline** — Key events with timestamps.
84
+ 3. **Root cause** — Use "5 whys" or similar. Go past symptoms to contributing factors.
85
+ 4. **Impact** — Users affected, duration, business impact.
86
+ 5. **Action items** — What we'll change (runbooks, monitoring, code, process). Owner and due date.
87
+ 6. **Lessons learned** — What went well, what didn't.
88
+
89
+ **Rule:** No names in the "who messed up" sense. "The deploy went out" not "Alice deployed the bad code." Discuss the decision-making and systems that allowed the failure.
90
+
91
+ ## On-Call Best Practices
92
+
93
+ - **Runbooks** — Step-by-step guides for common incidents. "If X, do Y."
94
+ - **Escalation paths** — Who gets paged when: primary → secondary → manager. Define SLA.
95
+ - **Rotation** — Fair distribution; avoid burnout. Tools: PagerDuty, OpsGenie, VictorOps.
96
+ - **Compensation** — On-call pay, time off, or flexibility. Make it sustainable.
97
+ - **Training** — Shadow new on-call; run game days (simulated incidents).
98
+
99
+ ## Preventing Incident Fatigue
100
+
101
+ - Limit consecutive weeks on primary.
102
+ - Handoff meetings: what’s in flight, recent changes.
103
+ - Post-incident rest: no meetings for IC for a few hours after SEV1/2.
104
+ - Track pages: if someone is woken up constantly, fix the alerts or the system.
105
+
106
+ ## Learning from Incidents — SLOs and Error Budgets
107
+
108
+ **SLO (Service Level Objective):** "99.9% of requests succeed."
109
+ **Error budget:** The allowed failure (e.g., 0.1% = ~43 min downtime/month). When you're within budget, you can ship; when you're over, focus on reliability.
110
+
111
+ Incidents consume the error budget. Use postmortems to decide: was this a one-off or a systemic problem? Invest in fixes that prevent recurrence and protect the budget.
112
+
113
+ ## Incident Lifecycle Flow (Simplified)
114
+
115
+ ```mermaid
116
+ flowchart LR
117
+ A[Alert] --> B[Tri age]
118
+ B --> C[Mitigate]
119
+ C --> D[Resolve]
120
+ D --> E[Postmortem]
121
+ E --> A
122
+ ```
123
+
124
+ Continuous improvement: each incident makes the system more resilient if we learn from it.
@@ -0,0 +1,82 @@
1
+ # Incident Response — Exercises
2
+
3
+ ## Exercise 1: Severity Classification
4
+
5
+ **Task:** Classify these incidents as SEV1–4 and justify: (A) Payment processing returns 500 for 10% of users. (B) Typo on "About" page. (C) Full site returns 502 for all users. (D) Search is slow (3s vs 1s) for one region.
6
+
7
+ **Validation:**
8
+ - [ ] Each has a severity
9
+ - [ ] Justification references impact and scope
10
+ - [ ] Response time is implied or stated
11
+
12
+ **Hints:**
13
+ 1. Full site down → SEV1
14
+ 2. Payment 500 for 10% → SEV2 (major, key feature)
15
+ 3. Search slow, one region → SEV3 (degradation, workaround: wait or retry)
16
+ 4. Typo → SEV4
17
+
18
+ ---
19
+
20
+ ## Exercise 2: Communication Template in Action
21
+
22
+ **Task:** A database failover fails at 2 p.m.; the site is down. Write the internal announcement and external status update you'd post within 5 minutes. Then write the "Resolved" update for when you fix it 45 minutes later.
23
+
24
+ **Validation:**
25
+ - [ ] Internal has: severity, impact, status, IC, ETA
26
+ - [ ] External is customer-friendly, no jargon
27
+ - [ ] Resolved update thanks users and summarizes what happened
28
+
29
+ **Hints:**
30
+ 1. Internal: "SEV1, full outage, mitigating, IC: [name], ETA: investigating"
31
+ 2. External: "We're experiencing an outage. Our team is on it. Next update in 15 min."
32
+ 3. Resolved: "Service restored. We'll share a postmortem within 48 hours."
33
+
34
+ ---
35
+
36
+ ## Exercise 3: Blameless Postmortem Draft
37
+
38
+ **Task:** A deploy at 10 a.m. introduced a bug; it wasn't caught by CI. Incident resolved at 11:30 a.m. Draft the "Root Cause" section using 5 whys. Use blameless language—focus on process, not people.
39
+
40
+ **Validation:**
41
+ - [ ] 5 whys go from symptom to systemic cause
42
+ - [ ] No blame ("the process allowed" not "Alice didn't test")
43
+ - [ ] At least one action item is implied
44
+
45
+ **Hints:**
46
+ 1. Why did the bug reach prod? → CI didn't catch it
47
+ 2. Why didn't CI catch it? → No test for this case
48
+ 3. Why no test? → Gap in test coverage / requirements
49
+ 4. Keep going to process: how do we close the gap?
50
+
51
+ ---
52
+
53
+ ## Exercise 4: Runbook Entry
54
+
55
+ **Task:** Write a runbook entry for "Database replica lag exceeds 30 seconds." Include: how to detect, how to confirm, mitigation steps (2–4), escalation path, and where to log the incident.
56
+
57
+ **Validation:**
58
+ - [ ] Detection is specific (metric, dashboard, alert)
59
+ - [ ] Mitigation steps are ordered and actionable
60
+ - [ ] Escalation path is clear
61
+
62
+ **Hints:**
63
+ 1. Detect: CloudWatch/graph, PagerDuty alert
64
+ 2. Confirm: Check replica lag metric, verify impact
65
+ 3. Mitigate: Identify slow query, kill if safe, scale read replicas, failover if critical
66
+ 4. Escalate: DB team, IC if no progress in 15 min
67
+
68
+ ---
69
+
70
+ ## Exercise 5: On-Call Rotation Design
71
+
72
+ **Task:** Design an on-call rotation for a 5-person team. Include: primary and secondary, rotation cadence (e.g., weekly), handoff process, and one policy to prevent fatigue (e.g., no back-to-back primaries).
73
+
74
+ **Validation:**
75
+ - [ ] Primary and secondary defined
76
+ - [ ] Cadence and handoff are specified
77
+ - [ ] At least one fatigue-prevention policy
78
+
79
+ **Hints:**
80
+ 1. Weekly primary, weekly secondary (different person)
81
+ 2. Handoff: 15-min call, share recent incidents, in-flight work
82
+ 3. Policy: No one does primary 2 weeks in a row; comp time after SEV1
@@ -0,0 +1,132 @@
1
+ games:
2
+ - type: scenario
3
+ title: "Incident Commander"
4
+ startHealth: 5
5
+ steps:
6
+ - id: start
7
+ situation: "It's 2 AM. PagerDuty wakes you — the payment service is down. Alerts are firing: 503s on checkout, payment gateway timeouts. You're the incident commander. What do you do first?"
8
+ choices:
9
+ - text: "Declare SEV1 immediately and page the full team."
10
+ consequence: "You acted decisively. SEV1 is correct for customer-facing payment outages. The team is assembling."
11
+ health: 1
12
+ next: triage
13
+ - text: "Wait 5 minutes to see if it self-recovers before paging anyone."
14
+ consequence: "Every minute of payment downtime costs revenue and trust. SEV1 incidents need immediate response."
15
+ health: -2
16
+ next: triage
17
+ - text: "Post in Slack and ask if anyone has seen this before."
18
+ consequence: "Slack is too slow for payment outages. You need to formally declare severity and page. Time is critical."
19
+ health: -1
20
+ next: triage
21
+ - id: triage
22
+ situation: "The team is joining. You confirm: checkout is returning 503, payment gateway is timing out. One engineer says the gateway provider's status page shows 'degraded.' What's your next move?"
23
+ choices:
24
+ - text: "Create a shared doc, assign roles (comms, technical lead, scribe), and have tech lead investigate root cause."
25
+ consequence: "Good incident hygiene. Clear roles prevent chaos. The technical lead can focus on debugging while you coordinate."
26
+ health: 1
27
+ next: communicate
28
+ - text: "Have everyone start debugging in parallel."
29
+ consequence: "Too many cooks. Without a single technical lead and clear assignments, people duplicate work and step on each other."
30
+ health: -1
31
+ next: communicate
32
+ - text: "Pause and run a full architecture review before touching anything."
33
+ consequence: "Architecture reviews are for postmortems. During an incident, you need fast triage and mitigation, not deep analysis."
34
+ health: -2
35
+ next: communicate
36
+ - id: communicate
37
+ situation: "Stakeholders are asking for updates. Support is getting flooded with tickets. The technical lead is still investigating. What do you prioritize?"
38
+ choices:
39
+ - text: "Send a brief status to stakeholders — 'We're investigating payment gateway issues, ETA for next update in 15 min' — then let the tech lead work."
40
+ consequence: "Right balance. Stakeholders get reassurance; the team isn't distracted. Time-boxed updates manage expectations."
41
+ health: 1
42
+ next: debug_vs_comms
43
+ - text: "Ignore stakeholders until you have a root cause."
44
+ consequence: "Stakeholders need to know you're on it. Radio silence increases anxiety and can trigger escalations."
45
+ health: -1
46
+ next: debug_vs_comms
47
+ - text: "Pull the technical lead off investigation to draft a detailed customer-facing post."
48
+ consequence: "Investigation should take priority. Detailed posts can wait. Brief internal updates are enough for now."
49
+ health: -2
50
+ next: debug_vs_comms
51
+ - id: debug_vs_comms
52
+ situation: "The tech lead suspects a bad deploy from 2 hours ago. Rolling back would take ~20 minutes. The payment gateway provider's status still says 'degraded.' Do you rollback or wait for more data?"
53
+ choices:
54
+ - text: "Rollback now. Payment is critical; the deploy is the most likely culprit and we can't afford to wait."
55
+ consequence: "Pragmatic. For payment, speed matters. If the rollback fixes it, you're done. If not, you've ruled out a major variable."
56
+ health: 1
57
+ next: rollback_result
58
+ - text: "Wait for the gateway provider to confirm their status before deciding."
59
+ consequence: "External status pages can lag. Your own deploy is something you control. Delaying the rollback extends customer impact."
60
+ health: -1
61
+ next: rollback_result
62
+ - text: "Run A/B tests to confirm the deploy is the cause."
63
+ consequence: "A/B tests take time. During an incident, you need fast, reversible actions. Rollback is low risk and high signal."
64
+ health: -2
65
+ next: rollback_result
66
+ - id: rollback_result
67
+ situation: "You rolled back. Checkout recovers. The incident is resolved. Now the CEO asks for a postmortem by end of week. How do you approach it?"
68
+ choices:
69
+ - text: "Schedule a blameless postmortem with everyone involved. Focus on what we'll change (process, tooling, checks), not who screwed up."
70
+ consequence: "Blameless postmortems build learning culture. People share more, and you get better preventative measures."
71
+ health: 1
72
+ next: postmortem
73
+ - text: "Assign it to the engineer who pushed the deploy."
74
+ consequence: "That creates blame and discourages future transparency. Postmortems should be collaborative and blameless."
75
+ health: -2
76
+ next: postmortem
77
+ - text: "Skip the postmortem — we fixed it, let's move on."
78
+ consequence: "Every incident is a learning opportunity. Skipping postmortems means the same failures repeat."
79
+ health: -1
80
+ next: postmortem
81
+ - id: postmortem
82
+ situation: "In the postmortem, the team identifies: no canary for the payment service, and deploy went out without a staging gate. What should you document?"
83
+ choices:
84
+ - text: "Write action items: add canary deployment, require staging validation for payment service, set up alerts for gateway timeouts."
85
+ consequence: "Well done, Incident Commander. Clear, actionable items. You triaged severity, coordinated the team, communicated with stakeholders, made a timely rollback decision, and ran a blameless postmortem."
86
+ health: 1
87
+ next: end
88
+ - text: "Document that 'we'll be more careful next time.'"
89
+ consequence: "Vague. 'Be more careful' doesn't change systems. You need concrete process or tooling changes."
90
+ health: -1
91
+ next: end
92
+ - text: "Blame the engineer who merged without sufficient review."
93
+ consequence: "Blameless means focusing on system failures, not individuals. Blame stops people from participating honestly."
94
+ health: -2
95
+ next: end
96
+
97
+ - type: classify
98
+ title: "Severity Sorter"
99
+ categories:
100
+ - name: "SEV1 - Critical"
101
+ color: "#f85149"
102
+ - name: "SEV2 - Major"
103
+ color: "#d29922"
104
+ - name: "SEV3 - Minor"
105
+ color: "#58a6ff"
106
+ - name: "SEV4 - Low"
107
+ color: "#8b949e"
108
+ items:
109
+ - text: "All checkouts failing, payment service returning 503. Revenue impact."
110
+ category: "SEV1 - Critical"
111
+ - text: "Authentication service down — no one can log in."
112
+ category: "SEV1 - Critical"
113
+ - text: "Database primary unreachable; replicas serving read-only. Writes failing."
114
+ category: "SEV1 - Critical"
115
+ - text: "Search is slow (2–3s) but returning results. Some users complaining."
116
+ category: "SEV2 - Major"
117
+ - text: "CDN cache miss rate up 20%. Page loads slightly slower."
118
+ category: "SEV2 - Major"
119
+ - text: "Admin dashboard export fails for reports over 10k rows."
120
+ category: "SEV2 - Major"
121
+ - text: "Non-critical feature flag not applying for a small segment."
122
+ category: "SEV3 - Minor"
123
+ - text: "Email notifications delayed by 5–10 minutes."
124
+ category: "SEV3 - Minor"
125
+ - text: "Minor UI glitch on settings page in Safari only."
126
+ category: "SEV3 - Minor"
127
+ - text: "Spelling error in FAQ page."
128
+ category: "SEV4 - Low"
129
+ - text: "Deprecated API endpoint returns 410 as expected; one client not updated."
130
+ category: "SEV4 - Low"
131
+ - text: "Analytics pipeline backlog of 1 hour; dashboards slightly stale."
132
+ category: "SEV4 - Low"
@@ -0,0 +1,45 @@
1
+ slug: incident-response
2
+ title: "Incident Response — From Alert to Postmortem"
3
+ version: 1.0.0
4
+ description: "Handle production incidents: severity levels, roles, lifecycle, communication, blameless postmortems, and on-call best practices."
5
+ category: leadership
6
+ tags: [incident-response, on-call, postmortem, reliability, sre, blameless]
7
+ difficulty: intermediate
8
+
9
+ xp:
10
+ read: 15
11
+ walkthrough: 40
12
+ exercise: 25
13
+ quiz: 20
14
+ quiz-perfect-bonus: 10
15
+ game: 25
16
+ game-perfect-bonus: 15
17
+
18
+ time:
19
+ quick: 5
20
+ read: 20
21
+ guided: 50
22
+
23
+ prerequisites: [code-review]
24
+ related: [risk-management, debugging-systematically, stakeholder-communication]
25
+
26
+ triggers:
27
+ - "How do I handle a production incident?"
28
+ - "What is a blameless postmortem?"
29
+ - "How do I set up an on-call rotation?"
30
+ - "How do I run an incident response process?"
31
+
32
+ visuals:
33
+ diagrams: [diagram-flow, diagram-mermaid]
34
+ quiz-types: [quiz-drag-order, quiz-timed-choice]
35
+ game-types: [scenario, classify]
36
+ playground: bash
37
+ slides: true
38
+
39
+ sources:
40
+ - url: "https://sre.google/sre-book/"
41
+ label: "Google SRE Book"
42
+ type: docs
43
+ - url: "https://response.pagerduty.com"
44
+ label: "PagerDuty Incident Response Guide"
45
+ type: docs
@@ -0,0 +1,53 @@
1
+ # Incident Response — Quick Reference
2
+
3
+ ## Severity Levels
4
+
5
+ | SEV | Impact | Response |
6
+ |-----|--------|----------|
7
+ | 1 | Critical: full outage, data loss | Immediate (~5 min) |
8
+ | 2 | Major: key feature broken | 15–30 min |
9
+ | 3 | Minor: limited impact, workaround | Hours |
10
+ | 4 | Low: cosmetic, edge case | Next business day |
11
+
12
+ ## Roles
13
+
14
+ | Role | Responsibility |
15
+ |------|----------------|
16
+ | Incident Commander | Coordinate, decide, delegate |
17
+ | Comms Lead | Stakeholder updates, status page |
18
+ | Responders | Debug, rollback, fix |
19
+
20
+ ## Lifecycle
21
+
22
+ 1. **Detect** — Alert, user report
23
+ 2. **Triage** — Severity, scope, assign IC
24
+ 3. **Mitigate** — Rollback, fix, workaround
25
+ 4. **Resolve** — Verify, all-clear
26
+ 5. **Postmortem** — Blameless, 5 whys
27
+ 6. **Action Items** — Track, prevent recurrence
28
+
29
+ ## Communication Template
30
+
31
+ **Internal:** Severity, impact, status, IC, ETA
32
+ **External:** "We're aware of X. Impact: Y. ETA: Z."
33
+ **Cadence:** Every 15–30 min for SEV1/2
34
+
35
+ ## Blameless Postmortem
36
+
37
+ - Summary, Timeline, Root Cause (5 whys), Impact, Action Items, Lessons
38
+ - No blame: focus on systems and process
39
+ - Action items: owner, due date
40
+
41
+ ## On-Call Best Practices
42
+
43
+ - Runbooks for common incidents
44
+ - Escalation path: primary → secondary → manager
45
+ - Rotation: fair, no back-to-back primary
46
+ - Post-incident rest for IC
47
+ - Compensation / flexibility
48
+
49
+ ## Error Budgets
50
+
51
+ - SLO: e.g., 99.9% success
52
+ - Error budget: allowed failure (0.1% ≈ 43 min/month)
53
+ - Incidents consume budget; postmortems guide investment in reliability
@@ -0,0 +1,103 @@
1
+ # Incident Response — Quiz
2
+
3
+ ## Question 1
4
+
5
+ What is the primary role of the Incident Commander?
6
+
7
+ A) Fix the bug
8
+ B) Coordinate the response, make decisions, delegate—not necessarily fix
9
+ C) Update the status page
10
+ D) Write the postmortem
11
+
12
+ <!-- ANSWER: B -->
13
+ <!-- EXPLANATION: The IC owns the response and orchestrates. They coordinate responders, decide on approach, and delegate. They may not be the one typing the fix—their job is to ensure the team responds effectively. Fixing is the responders' job. -->
14
+
15
+ ## Question 2
16
+
17
+ What makes a postmortem "blameless"?
18
+
19
+ A) No one is mentioned
20
+ B) Focus on systems and process, not individuals; we learn, not punish
21
+ C) Only positive feedback
22
+ D) No root cause is identified
23
+
24
+ <!-- ANSWER: B -->
25
+ <!-- EXPLANATION: Blameless means we analyze what happened in terms of systems, process, and decisions—not "who messed up." The goal is learning and preventing recurrence, not assigning fault. People need to feel safe to contribute honestly. -->
26
+
27
+ ## Question 3
28
+
29
+ Which severity typically requires immediate (e.g., 5 min) response?
30
+
31
+ A) SEV4
32
+ B) SEV3
33
+ C) SEV2
34
+ D) SEV1
35
+
36
+ <!-- ANSWER: D -->
37
+ <!-- EXPLANATION: SEV1 is critical—full outage, data loss, security breach. Response time is immediate (e.g., 5 min). SEV2 is major (15–30 min); SEV3/4 are slower. -->
38
+
39
+ ## Question 4
40
+
41
+ What should the external status update include?
42
+
43
+ A) Technical details of the failure
44
+ B) We're aware of X, impact is Y, ETA is Z (or "investigating")
45
+ C) Names of engineers working on it
46
+ D) Root cause analysis
47
+
48
+ <!-- ANSWER: B -->
49
+ <!-- EXPLANATION: External updates should be clear and customer-friendly: we're aware, here's the impact, here's our ETA (or that we're investigating). No technical jargon, no blame, no internal details. -->
50
+
51
+ ## Question 5
52
+
53
+ The "5 whys" in a postmortem are used to:
54
+
55
+ A) Assign blame
56
+ B) Trace from symptom to systemic/root cause
57
+ C) List five people involved
58
+ D) Count how many things went wrong
59
+
60
+ <!-- ANSWER: B -->
61
+ <!-- EXPLANATION: 5 whys is a technique to dig from the surface symptom to deeper causes. "Why did X happen?" → "Because Y." "Why did Y happen?" → Repeat. You reach contributing factors in process and design, not just the immediate trigger. -->
62
+
63
+ ## Question 6
64
+
65
+ To prevent incident fatigue, a good practice is:
66
+
67
+ A) Have one person always on-call
68
+ B) Limit consecutive weeks on primary; offer post-incident rest
69
+ C) Only page during business hours
70
+ D) Skip postmortems to save time
71
+
72
+ <!-- ANSWER: B -->
73
+ <!-- EXPLANATION: Limit consecutive primary weeks, rotate fairly, give IC rest after SEV1/2 (no meetings for a few hours). One person always on-call and skipping rest leads to burnout. -->
74
+
75
+ ## Question 7
76
+
77
+ <!-- VISUAL: quiz-drag-order -->
78
+
79
+ Put these incident lifecycle phases in the correct order:
80
+
81
+ A) Mitigation (contain and fix)
82
+ B) Detection and triage
83
+ C) Post-incident review
84
+ D) Recovery (restore service)
85
+ E) Preparation (runbooks, on-call)
86
+
87
+ <!-- ANSWER: E,B,A,D,C -->
88
+ <!-- EXPLANATION: Prepare first (runbooks, on-call). When an incident occurs, detect and triage, then mitigate (contain and fix), recover (restore service), and finally conduct a post-incident review to learn. -->
89
+
90
+ ## Question 8
91
+
92
+ <!-- VISUAL: quiz-drag-order -->
93
+
94
+ Put these postmortem sections in a logical order for the document:
95
+
96
+ A) Timeline of events
97
+ B) Root cause analysis
98
+ C) Impact and severity
99
+ D) Action items
100
+ E) Summary
101
+
102
+ <!-- ANSWER: E,C,A,B,D -->
103
+ <!-- EXPLANATION: A postmortem typically opens with a summary, then impact/severity, a chronological timeline, root cause analysis (5 whys, etc.), and closes with action items to prevent recurrence. -->
@@ -0,0 +1,40 @@
1
+ # Incident Response — Resources
2
+
3
+ ## Official Guides
4
+
5
+ - [Google SRE Book](https://sre.google/sre-book/) — Free online. Covers incident response, postmortems, SLOs, and error budgets. Essential reading.
6
+ - [Google SRE Book — Managing Incidents](https://sre.google/sre-book/managing-incidents/) — Direct chapter on incident management.
7
+ - [PagerDuty Incident Response Guide](https://response.pagerduty.com) — Roles, workflows, communication, and best practices.
8
+ - [PagerDuty — Postmortem Best Practices](https://response.pagerduty.com/before/incident-response/postmortem/) — Blameless postmortems.
9
+
10
+ ## Articles
11
+
12
+ - [Blameless PostMortems and a Just Culture](https://codeascraft.com/2013/11/14/blameless-postmortems-and-a-just-culture/) — Etsy (Code as Craft). Why blameless matters and how to build the culture.
13
+ - [Writing an Incident Postmortem](https://www.atlassian.com/incident-management/postmortem) — Atlassian. Template and examples.
14
+ - [The Five Whys](https://www.atlassian.com/incident-management/postmortem/blameless) — Using 5 whys in postmortems.
15
+ - [Incident Communication Best Practices](https://www.pagerduty.com/blog/incident-communication-best-practices/) — PagerDuty. Status updates and stakeholder communication.
16
+ - [On-Call Best Practices](https://www.pagerduty.com/resources/learn/on-call-best-practices/) — PagerDuty. Rotation, runbooks, fatigue prevention.
17
+
18
+ ## Books
19
+
20
+ - **Site Reliability Engineering** (O'Reilly) — Google SRE book in print. Covers incidents, SLOs, and more.
21
+ - **The Phoenix Project** by Gene Kim et al. — Novel about IT ops and incident response; illustrates principles.
22
+ - **An Elegant Puzzle** by Will Larson — Includes on-call, incident process, and team reliability.
23
+
24
+ ## Tools
25
+
26
+ - [PagerDuty](https://www.pagerduty.com/) — Incident alerting, on-call scheduling, escalation.
27
+ - [OpsGenie](https://www.atlassian.com/software/opsgenie) — Alerting and on-call management.
28
+ - [Statuspage](https://www.atlassian.com/software/statuspage) — Public status pages.
29
+ - [Rootly](https://www.rootly.com/) — Incident management and runbooks.
30
+ - [Incident.io](https://incident.io/) — Incident response workflow and automation.
31
+
32
+ ## Videos
33
+
34
+ - [Google SRE — How We Do It](https://www.youtube.com/results?search_query=google+sre+incident) — Search for talks on incident response.
35
+ - [Blameless Postmortems at Etsy](https://www.youtube.com/results?search_query=blameless+postmortem+etsy) — Culture and process.
36
+
37
+ ## Podcasts
38
+
39
+ - [Engineering Culture by InfoQ](https://www.infoq.com/podcasts/engineering-culture/) — Episodes on reliability and incidents.
40
+ - [SRE Weekly](https://sreweekly.com/) — Newsletter; often covers incident and postmortem content.
@@ -0,0 +1,82 @@
1
+ # Incident Response Walkthrough — Learn by Doing
2
+
3
+ ## Before We Begin
4
+
5
+ Incidents follow a lifecycle: detect, triage, communicate, mitigate, resolve, and learn. How you handle each phase—and who does what—determines whether you fix things quickly and improve, or repeat the same mistakes.
6
+
7
+ **Diagnostic question:** Have you (or your team) ever had something break in production? What went well? What was chaotic—alerts, communication, or knowing who owned what?
8
+
9
+ **Checkpoint:** You can name at least two phases of the incident lifecycle and one thing that often goes wrong.
10
+
11
+ ---
12
+
13
+ ## Step 1: Define Severity Levels
14
+
15
+ <!-- hint:diagram mermaid-type="flowchart" topic="Incident lifecycle from detect to postmortem" -->
16
+ <!-- hint:list style="cards" -->
17
+
18
+ **Task:** For your system (or a hypothetical one), define SEV1–4 with concrete examples. Include: impact description, example scenario, and target response time for each.
19
+
20
+ **Question:** How would someone know in the moment which severity applies? What's ambiguous?
21
+
22
+ **Checkpoint:** Each severity has a clear example and response time.
23
+
24
+ ---
25
+
26
+ ## Step 2: Draft a Communication Template
27
+
28
+ **Task:** Write two templates: (1) internal incident announcement (Slack/email) and (2) external status page update. Include placeholders for severity, impact, status, and ETA.
29
+
30
+ **Question:** What do stakeholders need to know immediately vs. what can wait? How often should you update?
31
+
32
+ **Checkpoint:** Templates are copy-pastable with clear placeholders.
33
+
34
+ ---
35
+
36
+ ## Step 3: Outline Incident Roles
37
+
38
+ **Task:** Define the roles for your team's incident response: Incident Commander, Comms Lead, Responders. For each, write 2–3 responsibilities. Who would fill each role in a real incident?
39
+
40
+ **Question:** What happens if the usual IC is unavailable? How do you rotate?
41
+
42
+ **Checkpoint:** Roles are defined; you know who does what and who backs up whom.
43
+
44
+ ---
45
+
46
+ ## Step 4: Write a Postmortem Template
47
+
48
+ **Task:** Create a blameless postmortem template. Include: Summary, Timeline, Root Cause (5 whys), Impact, Action Items (with owners), and Lessons Learned. Add 1–2 "blameless language" guidelines (how to describe what happened without naming individuals negatively).
49
+
50
+ **Question:** How do you make it safe for people to contribute honestly? What would discourage participation?
51
+
52
+ **Checkpoint:** Template is complete; language guidelines support blameless culture.
53
+
54
+ ---
55
+
56
+ ## Step 5: Map the Incident Lifecycle
57
+
58
+ **Task:** Draw or describe your team's incident flow from Detect → Resolve → Postmortem. Include: how alerts fire, who gets paged, how triage happens, where updates are posted, and when postmortem is scheduled.
59
+
60
+ **Question:** Where are the gaps? What would fail in a real 3 a.m. incident?
61
+
62
+ **Checkpoint:** Flow is documented; at least one gap is identified.
63
+
64
+ ---
65
+
66
+ ## Step 6: Design an On-Call Runbook Entry
67
+
68
+ **Task:** Pick one realistic incident type (e.g., "High error rate on API"). Write a runbook entry: symptoms, how to confirm, steps to mitigate, who to escalate to, and where to document.
69
+
70
+ **Question:** Could someone unfamiliar with the system follow this at 2 a.m.? What's missing?
71
+
72
+ **Checkpoint:** Runbook is actionable; an on-call engineer could follow it.
73
+
74
+ ---
75
+
76
+ ## Step 7: Plan a Post-Incident Review
77
+
78
+ **Task:** Schedule a blameless postmortem for a past incident (real or hypothetical). Write the agenda: timebox per section, who facilitates, how action items are tracked. Include one "retrospective" question: "What would we do differently in the response itself?"
79
+
80
+ **Question:** How do you avoid making the postmortem feel punitive? How do you ensure action items get done?
81
+
82
+ **Checkpoint:** Agenda is timeboxed; facilitation and follow-through are planned.