codingbuddy-rules 2.4.2 → 3.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.ai-rules/CHANGELOG.md +122 -0
- package/.ai-rules/agents/README.md +527 -11
- package/.ai-rules/agents/accessibility-specialist.json +0 -1
- package/.ai-rules/agents/act-mode.json +0 -1
- package/.ai-rules/agents/agent-architect.json +0 -1
- package/.ai-rules/agents/ai-ml-engineer.json +0 -1
- package/.ai-rules/agents/architecture-specialist.json +14 -2
- package/.ai-rules/agents/backend-developer.json +14 -2
- package/.ai-rules/agents/code-quality-specialist.json +0 -1
- package/.ai-rules/agents/data-engineer.json +0 -1
- package/.ai-rules/agents/devops-engineer.json +24 -2
- package/.ai-rules/agents/documentation-specialist.json +0 -1
- package/.ai-rules/agents/eval-mode.json +0 -1
- package/.ai-rules/agents/event-architecture-specialist.json +719 -0
- package/.ai-rules/agents/frontend-developer.json +14 -2
- package/.ai-rules/agents/i18n-specialist.json +0 -1
- package/.ai-rules/agents/integration-specialist.json +11 -1
- package/.ai-rules/agents/migration-specialist.json +676 -0
- package/.ai-rules/agents/mobile-developer.json +0 -1
- package/.ai-rules/agents/observability-specialist.json +747 -0
- package/.ai-rules/agents/performance-specialist.json +24 -2
- package/.ai-rules/agents/plan-mode.json +0 -1
- package/.ai-rules/agents/platform-engineer.json +0 -1
- package/.ai-rules/agents/security-specialist.json +27 -16
- package/.ai-rules/agents/seo-specialist.json +0 -1
- package/.ai-rules/agents/solution-architect.json +0 -1
- package/.ai-rules/agents/technical-planner.json +0 -1
- package/.ai-rules/agents/test-strategy-specialist.json +14 -2
- package/.ai-rules/agents/ui-ux-designer.json +0 -1
- package/.ai-rules/rules/core.md +25 -0
- package/.ai-rules/skills/README.md +35 -0
- package/.ai-rules/skills/database-migration/SKILL.md +531 -0
- package/.ai-rules/skills/database-migration/expand-contract-patterns.md +314 -0
- package/.ai-rules/skills/database-migration/large-scale-migration.md +414 -0
- package/.ai-rules/skills/database-migration/rollback-strategies.md +359 -0
- package/.ai-rules/skills/database-migration/validation-procedures.md +428 -0
- package/.ai-rules/skills/dependency-management/SKILL.md +381 -0
- package/.ai-rules/skills/dependency-management/license-compliance.md +282 -0
- package/.ai-rules/skills/dependency-management/lock-file-management.md +437 -0
- package/.ai-rules/skills/dependency-management/major-upgrade-guide.md +292 -0
- package/.ai-rules/skills/dependency-management/security-vulnerability-response.md +230 -0
- package/.ai-rules/skills/incident-response/SKILL.md +373 -0
- package/.ai-rules/skills/incident-response/communication-templates.md +322 -0
- package/.ai-rules/skills/incident-response/escalation-matrix.md +347 -0
- package/.ai-rules/skills/incident-response/postmortem-template.md +351 -0
- package/.ai-rules/skills/incident-response/severity-classification.md +256 -0
- package/.ai-rules/skills/performance-optimization/CREATION-LOG.md +87 -0
- package/.ai-rules/skills/performance-optimization/SKILL.md +76 -0
- package/.ai-rules/skills/performance-optimization/documentation-template.md +70 -0
- package/.ai-rules/skills/pr-review/SKILL.md +768 -0
- package/.ai-rules/skills/refactoring/SKILL.md +192 -0
- package/.ai-rules/skills/refactoring/refactoring-catalog.md +1377 -0
- package/package.json +1 -1
|
@@ -0,0 +1,322 @@
|
|
|
1
|
+
# Communication Templates
|
|
2
|
+
|
|
3
|
+
Standardized templates for incident communication. Consistent format builds stakeholder trust.
|
|
4
|
+
|
|
5
|
+
## The Communication Rule
|
|
6
|
+
|
|
7
|
+
**Communicate BEFORE attempting fixes.** Stakeholders need to know:
|
|
8
|
+
- That you're aware of the issue
|
|
9
|
+
- What the impact is
|
|
10
|
+
- What you're doing about it
|
|
11
|
+
- When they'll hear next
|
|
12
|
+
|
|
13
|
+
## Initial Incident Notification
|
|
14
|
+
|
|
15
|
+
### Internal (Team/Slack/War Room)
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
🔴 INCIDENT DECLARED
|
|
19
|
+
|
|
20
|
+
Severity: P[1-4]
|
|
21
|
+
Started: [YYYY-MM-DD HH:MM UTC]
|
|
22
|
+
Status: Investigating
|
|
23
|
+
|
|
24
|
+
Impact:
|
|
25
|
+
- [Affected service/feature]
|
|
26
|
+
- [User impact - count/percentage]
|
|
27
|
+
- [Business impact if known]
|
|
28
|
+
|
|
29
|
+
Initial Observations:
|
|
30
|
+
- [What triggered the alert]
|
|
31
|
+
- [What we're seeing]
|
|
32
|
+
|
|
33
|
+
Incident Commander: [Name]
|
|
34
|
+
trace_id: [ID if available]
|
|
35
|
+
|
|
36
|
+
Next Update: [Time - based on severity cadence]
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
### External (Status Page)
|
|
40
|
+
|
|
41
|
+
```
|
|
42
|
+
[SERVICE] - Investigating Issues
|
|
43
|
+
|
|
44
|
+
We are currently investigating reports of [brief description].
|
|
45
|
+
|
|
46
|
+
Some users may experience [symptom].
|
|
47
|
+
|
|
48
|
+
We are actively working to resolve this and will provide updates as we learn more.
|
|
49
|
+
|
|
50
|
+
Started: [HH:MM UTC]
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## Status Updates
|
|
56
|
+
|
|
57
|
+
### Progress Update (Internal)
|
|
58
|
+
|
|
59
|
+
```
|
|
60
|
+
📢 INCIDENT UPDATE - [INCIDENT_ID]
|
|
61
|
+
|
|
62
|
+
Time: [HH:MM UTC]
|
|
63
|
+
Severity: P[X] [unchanged/upgraded/downgraded]
|
|
64
|
+
Status: [Investigating/Mitigating/Resolving/Resolved]
|
|
65
|
+
|
|
66
|
+
Progress:
|
|
67
|
+
- [What was done since last update]
|
|
68
|
+
- [What was learned]
|
|
69
|
+
|
|
70
|
+
Current Focus:
|
|
71
|
+
- [What we're working on now]
|
|
72
|
+
|
|
73
|
+
Blockers: [If any]
|
|
74
|
+
|
|
75
|
+
ETA: [If known, or "Assessing"]
|
|
76
|
+
|
|
77
|
+
Next Update: [Time]
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
### External Update (Status Page)
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
[SERVICE] - [Status]
|
|
84
|
+
|
|
85
|
+
Update: We have identified the issue as [brief, non-technical description].
|
|
86
|
+
|
|
87
|
+
We are [current action in user-friendly terms].
|
|
88
|
+
|
|
89
|
+
[X]% of users may still experience [symptom].
|
|
90
|
+
|
|
91
|
+
We will provide another update in [time period].
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
---
|
|
95
|
+
|
|
96
|
+
## Mitigation Announcement
|
|
97
|
+
|
|
98
|
+
### Mitigation Applied (Internal)
|
|
99
|
+
|
|
100
|
+
```
|
|
101
|
+
🟡 MITIGATION APPLIED - [INCIDENT_ID]
|
|
102
|
+
|
|
103
|
+
Time: [HH:MM UTC]
|
|
104
|
+
Action Taken: [Rollback/Feature flag/Traffic shift/etc.]
|
|
105
|
+
|
|
106
|
+
Result:
|
|
107
|
+
- [Immediate effect observed]
|
|
108
|
+
- [Metrics improving/stable]
|
|
109
|
+
|
|
110
|
+
Current Status:
|
|
111
|
+
- Service: [Degraded/Partially restored/Stable]
|
|
112
|
+
- Error rate: [Current vs peak]
|
|
113
|
+
- User impact: [Current assessment]
|
|
114
|
+
|
|
115
|
+
Root Cause: [If known, or "Under investigation"]
|
|
116
|
+
|
|
117
|
+
Next Steps:
|
|
118
|
+
- [Planned actions]
|
|
119
|
+
|
|
120
|
+
Next Update: [Time]
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
### External Update (Status Page)
|
|
124
|
+
|
|
125
|
+
```
|
|
126
|
+
[SERVICE] - Monitoring
|
|
127
|
+
|
|
128
|
+
We have implemented a fix for the issue affecting [service].
|
|
129
|
+
|
|
130
|
+
Service is [restored/partially restored]. Some users may still experience [residual impact] while we complete restoration.
|
|
131
|
+
|
|
132
|
+
We continue to monitor the situation.
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
## Resolution Announcement
|
|
138
|
+
|
|
139
|
+
### Incident Resolved (Internal)
|
|
140
|
+
|
|
141
|
+
```
|
|
142
|
+
✅ INCIDENT RESOLVED - [INCIDENT_ID]
|
|
143
|
+
|
|
144
|
+
Resolved: [YYYY-MM-DD HH:MM UTC]
|
|
145
|
+
Duration: [X hours Y minutes]
|
|
146
|
+
Severity: P[X]
|
|
147
|
+
|
|
148
|
+
Summary:
|
|
149
|
+
- [Brief description of what happened]
|
|
150
|
+
- [Root cause in one sentence]
|
|
151
|
+
- [Resolution action]
|
|
152
|
+
|
|
153
|
+
Impact:
|
|
154
|
+
- Users affected: [Count/percentage]
|
|
155
|
+
- Duration of impact: [Time]
|
|
156
|
+
- SLO budget consumed: [Percentage]
|
|
157
|
+
|
|
158
|
+
Timeline: [Link to detailed timeline]
|
|
159
|
+
|
|
160
|
+
Postmortem: Scheduled for [Date/Time]
|
|
161
|
+
|
|
162
|
+
Thank you to everyone who responded: [Names]
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### External Resolution (Status Page)
|
|
166
|
+
|
|
167
|
+
```
|
|
168
|
+
[SERVICE] - Resolved
|
|
169
|
+
|
|
170
|
+
The issue affecting [service] has been resolved.
|
|
171
|
+
|
|
172
|
+
All systems are operating normally.
|
|
173
|
+
|
|
174
|
+
We apologize for any inconvenience this may have caused.
|
|
175
|
+
|
|
176
|
+
Total duration: [X hours Y minutes]
|
|
177
|
+
|
|
178
|
+
A detailed postmortem will be published at [URL/time].
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
---
|
|
182
|
+
|
|
183
|
+
## Escalation Templates
|
|
184
|
+
|
|
185
|
+
### Escalating to Leadership (P1/P2)
|
|
186
|
+
|
|
187
|
+
```
|
|
188
|
+
🚨 ESCALATION - P[X] Incident
|
|
189
|
+
|
|
190
|
+
Incident: [Brief description]
|
|
191
|
+
Started: [Time] ([Duration so far])
|
|
192
|
+
Current Status: [Status]
|
|
193
|
+
|
|
194
|
+
Business Impact:
|
|
195
|
+
- [Revenue impact if known]
|
|
196
|
+
- [User impact]
|
|
197
|
+
- [Reputation/compliance concerns]
|
|
198
|
+
|
|
199
|
+
Current Response:
|
|
200
|
+
- [Who is engaged]
|
|
201
|
+
- [What's being done]
|
|
202
|
+
|
|
203
|
+
Challenges:
|
|
204
|
+
- [What's blocking resolution]
|
|
205
|
+
- [Resources needed]
|
|
206
|
+
|
|
207
|
+
Decision Needed: [If any]
|
|
208
|
+
|
|
209
|
+
Incident Commander: [Name]
|
|
210
|
+
War Room: [Link]
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
### Requesting Additional Help
|
|
214
|
+
|
|
215
|
+
```
|
|
216
|
+
🆘 ASSISTANCE NEEDED - [INCIDENT_ID]
|
|
217
|
+
|
|
218
|
+
We need help with: [Specific expertise/access/decision]
|
|
219
|
+
|
|
220
|
+
Context:
|
|
221
|
+
- [Current situation]
|
|
222
|
+
- [What we've tried]
|
|
223
|
+
- [Why we're stuck]
|
|
224
|
+
|
|
225
|
+
Specifically looking for:
|
|
226
|
+
- [Skill/access/knowledge needed]
|
|
227
|
+
- [Time commitment expected]
|
|
228
|
+
|
|
229
|
+
Join us: [War room link]
|
|
230
|
+
Contact: [IC name and contact]
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
## Handoff Template
|
|
236
|
+
|
|
237
|
+
### Shift/IC Handoff
|
|
238
|
+
|
|
239
|
+
```
|
|
240
|
+
🔄 INCIDENT HANDOFF - [INCIDENT_ID]
|
|
241
|
+
|
|
242
|
+
Outgoing IC: [Name]
|
|
243
|
+
Incoming IC: [Name]
|
|
244
|
+
Time: [HH:MM UTC]
|
|
245
|
+
|
|
246
|
+
Current Status:
|
|
247
|
+
- Severity: P[X]
|
|
248
|
+
- Phase: [Detect/Triage/Communicate/Mitigate/Diagnose/Fix/Verify/Document]
|
|
249
|
+
- Duration: [Time since start]
|
|
250
|
+
|
|
251
|
+
Summary:
|
|
252
|
+
- [What happened]
|
|
253
|
+
- [What we've done]
|
|
254
|
+
- [Current state]
|
|
255
|
+
|
|
256
|
+
Key Context:
|
|
257
|
+
- trace_id: [ID]
|
|
258
|
+
- Root cause hypothesis: [If any]
|
|
259
|
+
- Mitigation in place: [Yes/No - what]
|
|
260
|
+
|
|
261
|
+
Immediate Next Steps:
|
|
262
|
+
1. [Priority 1]
|
|
263
|
+
2. [Priority 2]
|
|
264
|
+
|
|
265
|
+
Open Questions:
|
|
266
|
+
- [Unresolved items]
|
|
267
|
+
|
|
268
|
+
Stakeholder Status:
|
|
269
|
+
- Last external update: [Time]
|
|
270
|
+
- Next update due: [Time]
|
|
271
|
+
|
|
272
|
+
Resources:
|
|
273
|
+
- Dashboard: [Link]
|
|
274
|
+
- Logs: [Link/query]
|
|
275
|
+
- War room: [Link]
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
---
|
|
279
|
+
|
|
280
|
+
## Communication Cadence by Severity
|
|
281
|
+
|
|
282
|
+
| Severity | Internal Updates | External Updates | Escalation |
|
|
283
|
+
|----------|-----------------|------------------|------------|
|
|
284
|
+
| P1 | Every 15 minutes | Every 30 minutes | Immediate to leadership |
|
|
285
|
+
| P2 | Every 30 minutes | Every 1 hour | Within 1 hour |
|
|
286
|
+
| P3 | At milestones | If customer-facing | If >8 hours |
|
|
287
|
+
| P4 | On resolution | Not required | Not required |
|
|
288
|
+
|
|
289
|
+
---
|
|
290
|
+
|
|
291
|
+
## Key Fields to Always Include
|
|
292
|
+
|
|
293
|
+
Every communication should include:
|
|
294
|
+
|
|
295
|
+
1. **Timestamp** - When this update was sent (UTC)
|
|
296
|
+
2. **Severity** - Current classification
|
|
297
|
+
3. **Status** - Current phase/state
|
|
298
|
+
4. **Impact** - Who/what is affected
|
|
299
|
+
5. **trace_id** - **CRITICAL for investigation** - Links logs, traces, metrics; enables instant context jumping; include in EVERY internal communication
|
|
300
|
+
6. **Next Update** - When stakeholders will hear again
|
|
301
|
+
|
|
302
|
+
**Why trace_id matters:** Without trace_id, responders waste precious minutes searching logs. With it, they jump directly to the failing request across all systems.
|
|
303
|
+
|
|
304
|
+
---
|
|
305
|
+
|
|
306
|
+
## Anti-Patterns to Avoid
|
|
307
|
+
|
|
308
|
+
❌ **Don't:**
|
|
309
|
+
- Send updates without timestamps
|
|
310
|
+
- Use technical jargon in external comms
|
|
311
|
+
- Blame individuals or teams
|
|
312
|
+
- Promise specific resolution times unless certain
|
|
313
|
+
- Go silent for extended periods
|
|
314
|
+
- Share sensitive details externally
|
|
315
|
+
|
|
316
|
+
✅ **Do:**
|
|
317
|
+
- Keep updates concise but informative
|
|
318
|
+
- Use clear, non-technical language externally
|
|
319
|
+
- Focus on impact and actions
|
|
320
|
+
- Provide realistic expectations
|
|
321
|
+
- Maintain consistent cadence
|
|
322
|
+
- Include next update time
|
|
@@ -0,0 +1,347 @@
|
|
|
1
|
+
# Escalation Matrix
|
|
2
|
+
|
|
3
|
+
Role-based escalation paths per severity level.
|
|
4
|
+
|
|
5
|
+
## The Escalation Rule
|
|
6
|
+
|
|
7
|
+
**Escalate based on severity, not gut feeling.** Proper escalation:
|
|
8
|
+
- Gets the right people involved at the right time
|
|
9
|
+
- Avoids under-response (delayed resolution)
|
|
10
|
+
- Avoids over-response (unnecessary panic)
|
|
11
|
+
- Ensures accountability at each level
|
|
12
|
+
|
|
13
|
+
**📝 For message templates, see:** `communication-templates.md` → Escalation Templates section
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Escalation Paths by Severity
|
|
18
|
+
|
|
19
|
+
### P1 - Critical Incident
|
|
20
|
+
|
|
21
|
+
```
|
|
22
|
+
Timeline: 0-60 minutes from incident start
|
|
23
|
+
|
|
24
|
+
0 min Alert fires
|
|
25
|
+
└── On-call engineer acknowledges
|
|
26
|
+
└── Declares P1 incident
|
|
27
|
+
|
|
28
|
+
5 min └── Opens war room
|
|
29
|
+
└── Notifies: Primary on-call team
|
|
30
|
+
|
|
31
|
+
15 min └── IC (Incident Commander) assigned
|
|
32
|
+
└── Notifies: Engineering leadership
|
|
33
|
+
└── First status update sent
|
|
34
|
+
|
|
35
|
+
30 min └── Notifies: Executive team
|
|
36
|
+
└── Customer communications prepared
|
|
37
|
+
└── Status page updated
|
|
38
|
+
|
|
39
|
+
60 min └── If unresolved: Escalate to VP/CTO
|
|
40
|
+
└── Customer communication sent
|
|
41
|
+
└── Consider bringing in additional teams
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
**Who Gets Notified:**
|
|
45
|
+
|
|
46
|
+
| Time | Role | Notification Method |
|
|
47
|
+
|------|------|-------------------|
|
|
48
|
+
| Immediate | On-call engineer | Page |
|
|
49
|
+
| 5 min | On-call team | Page |
|
|
50
|
+
| 15 min | Engineering manager | Page |
|
|
51
|
+
| 30 min | Director/VP Engineering | Page + Call |
|
|
52
|
+
| 30 min | Customer Success | Message |
|
|
53
|
+
| 60 min | CTO/Executive | Call |
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
### P2 - High Severity
|
|
58
|
+
|
|
59
|
+
```
|
|
60
|
+
Timeline: 0-2 hours from incident start
|
|
61
|
+
|
|
62
|
+
0 min Alert fires
|
|
63
|
+
└── On-call engineer acknowledges
|
|
64
|
+
└── Classifies as P2
|
|
65
|
+
|
|
66
|
+
15 min └── Notifies: On-call team
|
|
67
|
+
└── Begins investigation
|
|
68
|
+
|
|
69
|
+
30 min └── First status update sent
|
|
70
|
+
└── Status page updated (if customer-facing)
|
|
71
|
+
└── Notifies: Team lead
|
|
72
|
+
|
|
73
|
+
1 hour └── If unresolved: Notifies Engineering manager
|
|
74
|
+
└── Consider escalating to P1
|
|
75
|
+
|
|
76
|
+
2 hours └── If unresolved: Escalate to Director
|
|
77
|
+
└── Mandatory progress review
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
**Who Gets Notified:**
|
|
81
|
+
|
|
82
|
+
| Time | Role | Notification Method |
|
|
83
|
+
|------|------|-------------------|
|
|
84
|
+
| Immediate | On-call engineer | Page |
|
|
85
|
+
| 15 min | On-call team | Page (business hours) |
|
|
86
|
+
| 30 min | Team lead | Message |
|
|
87
|
+
| 1 hour | Engineering manager | Message |
|
|
88
|
+
| 2 hours | Director | Message + Call |
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
### P3 - Medium Severity
|
|
93
|
+
|
|
94
|
+
```
|
|
95
|
+
Timeline: 0-8 hours from incident start
|
|
96
|
+
|
|
97
|
+
0 min Alert fires / Issue reported
|
|
98
|
+
└── On-call engineer acknowledges
|
|
99
|
+
└── Classifies as P3
|
|
100
|
+
|
|
101
|
+
1 hour └── Creates incident ticket
|
|
102
|
+
└── Notifies: Team channel
|
|
103
|
+
└── Begins investigation
|
|
104
|
+
|
|
105
|
+
4 hours └── Progress update to team
|
|
106
|
+
└── If blocked: Request assistance
|
|
107
|
+
|
|
108
|
+
8 hours └── If unresolved: Escalate to team lead
|
|
109
|
+
└── Consider overnight handoff
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
**Who Gets Notified:**
|
|
113
|
+
|
|
114
|
+
| Time | Role | Notification Method |
|
|
115
|
+
|------|------|-------------------|
|
|
116
|
+
| Immediate | On-call engineer | Alert |
|
|
117
|
+
| 1 hour | Team | Channel message |
|
|
118
|
+
| 4 hours | Team lead (if blocked) | Message |
|
|
119
|
+
| 8 hours | Team lead | Message |
|
|
120
|
+
|
|
121
|
+
---
|
|
122
|
+
|
|
123
|
+
### P4 - Low Severity
|
|
124
|
+
|
|
125
|
+
```
|
|
126
|
+
Timeline: 0-5 business days
|
|
127
|
+
|
|
128
|
+
0 min Issue identified
|
|
129
|
+
└── Classifies as P4
|
|
130
|
+
└── Creates ticket
|
|
131
|
+
|
|
132
|
+
Next day └── Ticket prioritized in backlog
|
|
133
|
+
└── Assigned to engineer
|
|
134
|
+
|
|
135
|
+
Sprint └── Work scheduled
|
|
136
|
+
└── Resolved in normal flow
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
**Who Gets Notified:**
|
|
140
|
+
|
|
141
|
+
| Time | Role | Notification Method |
|
|
142
|
+
|------|------|-------------------|
|
|
143
|
+
| Immediate | Reporter | Acknowledgment |
|
|
144
|
+
| Daily standup | Team | Regular process |
|
|
145
|
+
|
|
146
|
+
---
|
|
147
|
+
|
|
148
|
+
## Roles and Responsibilities
|
|
149
|
+
|
|
150
|
+
### Incident Commander (IC)
|
|
151
|
+
|
|
152
|
+
**Responsibilities:**
|
|
153
|
+
- Overall coordination of incident response
|
|
154
|
+
- Decision-making authority during incident
|
|
155
|
+
- Communication management
|
|
156
|
+
- Resource allocation
|
|
157
|
+
- Escalation decisions
|
|
158
|
+
|
|
159
|
+
**When Assigned:**
|
|
160
|
+
- P1: Immediately
|
|
161
|
+
- P2: If duration >30 minutes
|
|
162
|
+
- P3/P4: Not typically needed
|
|
163
|
+
|
|
164
|
+
### Communications Lead
|
|
165
|
+
|
|
166
|
+
**Responsibilities:**
|
|
167
|
+
- External communications (status page, customer comms)
|
|
168
|
+
- Internal stakeholder updates
|
|
169
|
+
- Documentation of communications timeline
|
|
170
|
+
|
|
171
|
+
**When Assigned:**
|
|
172
|
+
- P1: Immediately
|
|
173
|
+
- P2: If customer-facing
|
|
174
|
+
- P3/P4: IC handles or not needed
|
|
175
|
+
|
|
176
|
+
### Technical Lead
|
|
177
|
+
|
|
178
|
+
**Responsibilities:**
|
|
179
|
+
- Technical investigation coordination
|
|
180
|
+
- Fix/mitigation decisions
|
|
181
|
+
- Technical resource allocation
|
|
182
|
+
|
|
183
|
+
**When Assigned:**
|
|
184
|
+
- P1: Immediately
|
|
185
|
+
- P2: If complex technical issue
|
|
186
|
+
- P3/P4: Assigned engineer leads
|
|
187
|
+
|
|
188
|
+
---
|
|
189
|
+
|
|
190
|
+
## Escalation Triggers
|
|
191
|
+
|
|
192
|
+
### Escalate Immediately When:
|
|
193
|
+
|
|
194
|
+
| Trigger | Action |
|
|
195
|
+
|---------|--------|
|
|
196
|
+
| Data breach suspected | Escalate to Security + Executive |
|
|
197
|
+
| Customer data at risk | Escalate to Legal + Executive |
|
|
198
|
+
| Compliance violation | Escalate to Compliance + Executive |
|
|
199
|
+
| Extended outage (>1 hour P1) | Escalate to VP/CTO |
|
|
200
|
+
| Multiple systems affected | Escalate to Platform team |
|
|
201
|
+
| Resolution blocked | Escalate for resources/decisions |
|
|
202
|
+
|
|
203
|
+
### Escalate After Threshold When:
|
|
204
|
+
|
|
205
|
+
| Trigger | Threshold | Action |
|
|
206
|
+
|---------|-----------|--------|
|
|
207
|
+
| P1 unresolved | 1 hour | VP/CTO |
|
|
208
|
+
| P2 unresolved | 4 hours | Director |
|
|
209
|
+
| P3 unresolved | 2 days | Team lead review |
|
|
210
|
+
| Multiple failed fixes | 3 attempts | Fresh eyes/architecture review |
|
|
211
|
+
| IC needs decision | Immediately | Appropriate decision-maker |
|
|
212
|
+
|
|
213
|
+
---
|
|
214
|
+
|
|
215
|
+
## On-Call Structure
|
|
216
|
+
|
|
217
|
+
### Primary On-Call
|
|
218
|
+
|
|
219
|
+
**Responsibilities:**
|
|
220
|
+
- First responder to all alerts
|
|
221
|
+
- Initial triage and classification
|
|
222
|
+
- Escalation to secondary if needed
|
|
223
|
+
|
|
224
|
+
**Rotation:** Weekly (adjust per team)
|
|
225
|
+
|
|
226
|
+
### Secondary On-Call
|
|
227
|
+
|
|
228
|
+
**Responsibilities:**
|
|
229
|
+
- Backup for primary
|
|
230
|
+
- Expertise escalation
|
|
231
|
+
- Weekend/holiday coverage
|
|
232
|
+
|
|
233
|
+
**Rotation:** Weekly, offset from primary
|
|
234
|
+
|
|
235
|
+
### Management On-Call
|
|
236
|
+
|
|
237
|
+
**Responsibilities:**
|
|
238
|
+
- Decision escalation
|
|
239
|
+
- Resource allocation
|
|
240
|
+
- External communication approval
|
|
241
|
+
- Executive escalation
|
|
242
|
+
|
|
243
|
+
**Rotation:** Weekly among engineering managers
|
|
244
|
+
|
|
245
|
+
---
|
|
246
|
+
|
|
247
|
+
## Escalation Communication
|
|
248
|
+
|
|
249
|
+
**📝 Full templates:** See `communication-templates.md` for complete Internal/External notification formats, Status Updates, and Handoff templates.
|
|
250
|
+
|
|
251
|
+
### How to Escalate
|
|
252
|
+
|
|
253
|
+
```
|
|
254
|
+
When escalating, provide:
|
|
255
|
+
|
|
256
|
+
1. Incident ID and severity
|
|
257
|
+
2. Current status and duration
|
|
258
|
+
3. What's been tried
|
|
259
|
+
4. What's blocking progress
|
|
260
|
+
5. What decision/resource is needed
|
|
261
|
+
6. War room/channel link
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
### Escalation Message Template
|
|
265
|
+
|
|
266
|
+
```
|
|
267
|
+
🚨 ESCALATION REQUEST
|
|
268
|
+
|
|
269
|
+
Incident: [ID] - P[X]
|
|
270
|
+
Duration: [Time]
|
|
271
|
+
Current Status: [Status]
|
|
272
|
+
|
|
273
|
+
Situation:
|
|
274
|
+
[Brief description of current state]
|
|
275
|
+
|
|
276
|
+
What We've Tried:
|
|
277
|
+
- [Action 1]
|
|
278
|
+
- [Action 2]
|
|
279
|
+
|
|
280
|
+
Blocker:
|
|
281
|
+
[What's preventing resolution]
|
|
282
|
+
|
|
283
|
+
Request:
|
|
284
|
+
[Specific decision/resource/expertise needed]
|
|
285
|
+
|
|
286
|
+
Join: [War room link]
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
---
|
|
290
|
+
|
|
291
|
+
## De-escalation
|
|
292
|
+
|
|
293
|
+
### When to Reduce Severity
|
|
294
|
+
|
|
295
|
+
- Impact reduced through mitigation
|
|
296
|
+
- Fewer users affected than initially thought
|
|
297
|
+
- Workaround discovered
|
|
298
|
+
- Root cause isolated to limited scope
|
|
299
|
+
|
|
300
|
+
### De-escalation Process
|
|
301
|
+
|
|
302
|
+
1. IC proposes severity change
|
|
303
|
+
2. Communicate change to all stakeholders
|
|
304
|
+
3. Adjust response resources accordingly
|
|
305
|
+
4. Update status page if applicable
|
|
306
|
+
5. Document reason for change
|
|
307
|
+
|
|
308
|
+
---
|
|
309
|
+
|
|
310
|
+
## Contact Directory
|
|
311
|
+
|
|
312
|
+
Maintain an up-to-date contact list:
|
|
313
|
+
|
|
314
|
+
```
|
|
315
|
+
| Role | Name | Phone | Slack | Email |
|
|
316
|
+
|------|------|-------|-------|-------|
|
|
317
|
+
| On-call Primary | [Rotating] | [Phone] | @oncall-primary | - |
|
|
318
|
+
| On-call Secondary | [Rotating] | [Phone] | @oncall-secondary | - |
|
|
319
|
+
| Engineering Manager | [Name] | [Phone] | @[handle] | [email] |
|
|
320
|
+
| Director | [Name] | [Phone] | @[handle] | [email] |
|
|
321
|
+
| VP Engineering | [Name] | [Phone] | @[handle] | [email] |
|
|
322
|
+
| CTO | [Name] | [Phone] | @[handle] | [email] |
|
|
323
|
+
| Security | @security-team | - | @security | security@company |
|
|
324
|
+
| Legal | [Name] | [Phone] | @legal | legal@company |
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
**Keep this updated!** Outdated contacts during P1 = delayed resolution.
|
|
328
|
+
|
|
329
|
+
---
|
|
330
|
+
|
|
331
|
+
## Escalation Anti-Patterns
|
|
332
|
+
|
|
333
|
+
❌ **Avoid:**
|
|
334
|
+
- Escalating without trying to resolve first
|
|
335
|
+
- Not escalating when thresholds are reached
|
|
336
|
+
- Escalating without context
|
|
337
|
+
- Escalating to wrong person/team
|
|
338
|
+
- Under-escalating due to fear
|
|
339
|
+
- Over-escalating due to panic
|
|
340
|
+
|
|
341
|
+
✅ **Do:**
|
|
342
|
+
- Follow the defined thresholds
|
|
343
|
+
- Provide full context when escalating
|
|
344
|
+
- Know your escalation paths
|
|
345
|
+
- Escalate early when uncertain
|
|
346
|
+
- De-escalate when appropriate
|
|
347
|
+
- Document escalation decisions
|