moicle 1.1.1 → 1.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +12 -2
- package/assets/architecture/go-backend.md +930 -108
- package/assets/commands/brainstorm.md +1 -0
- package/assets/skills/api-integration/SKILL.md +883 -0
- package/assets/skills/architect-review/SKILL.md +473 -0
- package/assets/skills/deprecation/SKILL.md +923 -0
- package/assets/skills/documentation/SKILL.md +1333 -0
- package/assets/skills/fix-pr-comment/SKILL.md +283 -0
- package/assets/skills/go-module/SKILL.md +77 -0
- package/assets/skills/incident-response/SKILL.md +946 -0
- package/assets/skills/onboarding/SKILL.md +607 -0
- package/assets/skills/pr-review/SKILL.md +620 -0
- package/assets/skills/refactor/SKILL.md +756 -0
- package/assets/skills/spike/SKILL.md +535 -0
- package/assets/skills/tdd/SKILL.md +828 -0
- package/bin/cli.js +2 -1
- package/dist/commands/install.d.ts.map +1 -1
- package/dist/commands/install.js +20 -2
- package/dist/commands/install.js.map +1 -1
- package/dist/utils/symlink.d.ts +1 -0
- package/dist/utils/symlink.d.ts.map +1 -1
- package/dist/utils/symlink.js +1 -0
- package/dist/utils/symlink.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,946 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: incident-response
|
|
3
|
+
description: Incident response workflow for handling production issues. Use when there's an incident, outage, production down, or when user says "incident", "outage", "production down", "service down", "emergency".
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Incident Response Workflow
|
|
7
|
+
|
|
8
|
+
Structured workflow for handling production incidents with clear phases from triage to postmortem.
|
|
9
|
+
|
|
10
|
+
## IMPORTANT: Read Architecture First
|
|
11
|
+
|
|
12
|
+
**Before responding to incidents, you MUST read the appropriate architecture reference:**
|
|
13
|
+
|
|
14
|
+
### Global Architecture Files
|
|
15
|
+
```
|
|
16
|
+
~/.claude/architecture/
|
|
17
|
+
├── clean-architecture.md # Core principles for all projects
|
|
18
|
+
├── flutter-mobile.md # Flutter + Riverpod
|
|
19
|
+
├── react-frontend.md # React + Vite + TypeScript
|
|
20
|
+
├── go-backend.md # Go + Gin
|
|
21
|
+
├── laravel-backend.md # Laravel + PHP
|
|
22
|
+
├── remix-fullstack.md # Remix fullstack
|
|
23
|
+
└── monorepo.md # Monorepo structure
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
### Project-specific (if exists)
|
|
27
|
+
```
|
|
28
|
+
.claude/architecture/ # Project overrides
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
**Understand the system architecture to diagnose and fix issues effectively.**
|
|
32
|
+
|
|
33
|
+
## Recommended Agents
|
|
34
|
+
|
|
35
|
+
| Phase | Agent | Purpose |
|
|
36
|
+
|-------|-------|---------|
|
|
37
|
+
| TRIAGE | `@devops` | Infrastructure & monitoring |
|
|
38
|
+
| INVESTIGATE | `@react-frontend-dev`, `@go-backend-dev`, `@laravel-backend-dev`, `@flutter-mobile-dev`, `@remix-fullstack-dev` | Stack-specific debugging |
|
|
39
|
+
| INVESTIGATE | `@security-audit` | Security incidents |
|
|
40
|
+
| INVESTIGATE | `@db-designer` | Database issues |
|
|
41
|
+
| MITIGATE | `@devops` | Quick rollback/scaling |
|
|
42
|
+
| RESOLVE | Stack-specific dev agents | Permanent fix |
|
|
43
|
+
| POSTMORTEM | `@docs-writer` | Document learnings |
|
|
44
|
+
|
|
45
|
+
## Workflow Overview
|
|
46
|
+
|
|
47
|
+
```
|
|
48
|
+
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
|
|
49
|
+
│1. TRIAGE │──▶│2. INVESTI│──▶│3. MITIGATE──▶│4. RESOLVE│──▶│5. POST- │
|
|
50
|
+
│ │ │ GATE │ │ │ │ │ │ MORTEM │
|
|
51
|
+
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
|
|
52
|
+
│ │ │ │
|
|
53
|
+
│ │ │ └──── Monitor ────┐
|
|
54
|
+
│ │ └──── If fails ─────────┐ │
|
|
55
|
+
│ └──── Escalate if needed │ │
|
|
56
|
+
└──── Notify stakeholders ────────────────────────────┴────────┘
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## Severity Levels
|
|
62
|
+
|
|
63
|
+
| Level | Description | Response Time | Examples |
|
|
64
|
+
|-------|-------------|---------------|----------|
|
|
65
|
+
| 🔴 **P1** | Critical - Complete outage | Immediate (< 15 min) | Production down, data loss, security breach |
|
|
66
|
+
| 🟠 **P2** | High - Major degradation | < 1 hour | Core features broken, 50%+ users affected |
|
|
67
|
+
| 🟡 **P3** | Medium - Partial degradation | < 4 hours | Non-core features broken, < 50% users affected |
|
|
68
|
+
| 🟢 **P4** | Low - Minor issue | < 24 hours | UI bugs, minor performance issues |
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## Phase 1: TRIAGE
|
|
73
|
+
|
|
74
|
+
**Goal**: Assess severity, scope, and impact immediately
|
|
75
|
+
|
|
76
|
+
**Timeline**: 5-15 minutes
|
|
77
|
+
|
|
78
|
+
### Actions
|
|
79
|
+
|
|
80
|
+
1. **Initial Assessment**:
|
|
81
|
+
- [ ] Incident reported (alert/user/monitoring)
|
|
82
|
+
- [ ] Timestamp recorded
|
|
83
|
+
- [ ] Initial severity assigned (P1-P4)
|
|
84
|
+
|
|
85
|
+
2. **Gather Critical Information**:
|
|
86
|
+
```markdown
|
|
87
|
+
- What is down/broken?
|
|
88
|
+
- When did it start?
|
|
89
|
+
- How many users affected?
|
|
90
|
+
- Error messages/logs?
|
|
91
|
+
- Recent deployments/changes?
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
3. **Identify System & Architecture**:
|
|
95
|
+
- [ ] Identify affected stack (Flutter/React/Go/Laravel/Remix)
|
|
96
|
+
- [ ] Identify affected service/component
|
|
97
|
+
- [ ] Read architecture doc for affected area
|
|
98
|
+
|
|
99
|
+
4. **Establish Incident Response**:
|
|
100
|
+
- [ ] Create incident channel (Slack/Teams)
|
|
101
|
+
- [ ] Assign incident commander
|
|
102
|
+
- [ ] Notify stakeholders (use template below)
|
|
103
|
+
|
|
104
|
+
5. **Quick Health Checks**:
|
|
105
|
+
```bash
|
|
106
|
+
# Check service status
|
|
107
|
+
curl -I https://[service-url]/health
|
|
108
|
+
|
|
109
|
+
# Check logs (last 100 lines)
|
|
110
|
+
kubectl logs -n [namespace] [pod] --tail=100
|
|
111
|
+
# or
|
|
112
|
+
docker logs [container] --tail=100
|
|
113
|
+
|
|
114
|
+
# Check metrics
|
|
115
|
+
# (CPU, memory, disk, network)
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
### Output Template
|
|
119
|
+
|
|
120
|
+
```markdown
|
|
121
|
+
## INCIDENT REPORT
|
|
122
|
+
|
|
123
|
+
**Severity**: P[1-4] - [CRITICAL/HIGH/MEDIUM/LOW]
|
|
124
|
+
**Status**: INVESTIGATING
|
|
125
|
+
**Started**: [timestamp]
|
|
126
|
+
**Affected**: [service/feature]
|
|
127
|
+
**User Impact**: [number/percentage] users
|
|
128
|
+
**Stack**: [Flutter/React/Go/Laravel/Remix]
|
|
129
|
+
**Architecture Doc**: [path]
|
|
130
|
+
|
|
131
|
+
### Summary
|
|
132
|
+
[One-line description of what's broken]
|
|
133
|
+
|
|
134
|
+
### Impact
|
|
135
|
+
- [ ] Production down
|
|
136
|
+
- [ ] Feature degraded
|
|
137
|
+
- [ ] Data integrity issues
|
|
138
|
+
- [ ] Security concerns
|
|
139
|
+
|
|
140
|
+
### Recent Changes
|
|
141
|
+
- [Last deployment/change]
|
|
142
|
+
- [Time of change]
|
|
143
|
+
|
|
144
|
+
### Initial Observations
|
|
145
|
+
- [Error messages]
|
|
146
|
+
- [Metrics anomalies]
|
|
147
|
+
- [User reports]
|
|
148
|
+
|
|
149
|
+
### Response Team
|
|
150
|
+
- Incident Commander: [name]
|
|
151
|
+
- Engineers: [names]
|
|
152
|
+
- On-call: [name]
|
|
153
|
+
|
|
154
|
+
### Communication
|
|
155
|
+
- Incident Channel: [link]
|
|
156
|
+
- Status Page Updated: [Y/N]
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
### Gate
|
|
160
|
+
- [ ] Severity assessed
|
|
161
|
+
- [ ] Stakeholders notified
|
|
162
|
+
- [ ] Architecture doc identified
|
|
163
|
+
- [ ] Response team assembled
|
|
164
|
+
- [ ] Communication channel established
|
|
165
|
+
|
|
166
|
+
---
|
|
167
|
+
|
|
168
|
+
## Phase 2: INVESTIGATE
|
|
169
|
+
|
|
170
|
+
**Goal**: Find the root cause
|
|
171
|
+
|
|
172
|
+
**Timeline**: 15-60 minutes (varies by severity)
|
|
173
|
+
|
|
174
|
+
### Actions
|
|
175
|
+
|
|
176
|
+
1. **Read Architecture Documentation**:
|
|
177
|
+
- [ ] Read architecture doc for affected stack
|
|
178
|
+
- [ ] Understand system components and dependencies
|
|
179
|
+
- [ ] Identify critical paths
|
|
180
|
+
|
|
181
|
+
2. **Collect Evidence**:
|
|
182
|
+
```bash
|
|
183
|
+
# Application logs
|
|
184
|
+
grep -i error /var/log/[app].log | tail -50
|
|
185
|
+
|
|
186
|
+
# System logs
|
|
187
|
+
journalctl -u [service] --since "30 minutes ago"
|
|
188
|
+
|
|
189
|
+
# Database logs
|
|
190
|
+
# (stack-specific commands from architecture doc)
|
|
191
|
+
|
|
192
|
+
# Network/traffic
|
|
193
|
+
netstat -an | grep [port]
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
3. **Check Monitoring Dashboards**:
|
|
197
|
+
- [ ] CPU/Memory usage
|
|
198
|
+
- [ ] Request rate/latency
|
|
199
|
+
- [ ] Error rate
|
|
200
|
+
- [ ] Database connections/queries
|
|
201
|
+
- [ ] External service dependencies
|
|
202
|
+
|
|
203
|
+
4. **Timeline Analysis**:
|
|
204
|
+
```markdown
|
|
205
|
+
- [HH:MM] - Normal operation
|
|
206
|
+
- [HH:MM] - [Event: deployment/config change/traffic spike]
|
|
207
|
+
- [HH:MM] - First error logged
|
|
208
|
+
- [HH:MM] - User reports started
|
|
209
|
+
- [HH:MM] - Full outage
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
5. **Root Cause Analysis (5 Whys)**:
|
|
213
|
+
```
|
|
214
|
+
Problem: [What is failing?]
|
|
215
|
+
|
|
216
|
+
Why? → [First level cause]
|
|
217
|
+
Why? → [Second level cause]
|
|
218
|
+
Why? → [Third level cause]
|
|
219
|
+
Why? → [Fourth level cause]
|
|
220
|
+
Why? → [ROOT CAUSE]
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
6. **Check Dependencies**:
|
|
224
|
+
- [ ] External APIs responding?
|
|
225
|
+
- [ ] Database accessible?
|
|
226
|
+
- [ ] Cache service healthy?
|
|
227
|
+
- [ ] Message queue working?
|
|
228
|
+
- [ ] CDN functioning?
|
|
229
|
+
|
|
230
|
+
### Investigation Output
|
|
231
|
+
|
|
232
|
+
```markdown
|
|
233
|
+
## ROOT CAUSE ANALYSIS
|
|
234
|
+
|
|
235
|
+
**Architecture Reference**: [path to doc]
|
|
236
|
+
**Affected Component**: [service/layer from architecture]
|
|
237
|
+
**Root Cause**: [detailed description]
|
|
238
|
+
|
|
239
|
+
### Timeline
|
|
240
|
+
- [HH:MM] - Event 1
|
|
241
|
+
- [HH:MM] - Event 2
|
|
242
|
+
- [HH:MM] - Failure point
|
|
243
|
+
|
|
244
|
+
### Evidence
|
|
245
|
+
**Logs**:
|
|
246
|
+
```
|
|
247
|
+
[relevant log entries]
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
**Metrics**:
|
|
251
|
+
- CPU: [before/after]
|
|
252
|
+
- Memory: [before/after]
|
|
253
|
+
- Error rate: [before/after]
|
|
254
|
+
|
|
255
|
+
**Related Changes**:
|
|
256
|
+
- Commit: [sha] - [description]
|
|
257
|
+
- Deploy: [time] - [version]
|
|
258
|
+
- Config: [change description]
|
|
259
|
+
|
|
260
|
+
### Hypothesis
|
|
261
|
+
[What we think caused it]
|
|
262
|
+
|
|
263
|
+
### Verification
|
|
264
|
+
[How to verify the hypothesis]
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
### Gate
|
|
268
|
+
- [ ] Root cause identified
|
|
269
|
+
- [ ] Evidence collected
|
|
270
|
+
- [ ] Timeline established
|
|
271
|
+
- [ ] Dependencies checked
|
|
272
|
+
|
|
273
|
+
---
|
|
274
|
+
|
|
275
|
+
## Phase 3: MITIGATE
|
|
276
|
+
|
|
277
|
+
**Goal**: Stop the bleeding with temporary fix
|
|
278
|
+
|
|
279
|
+
**Timeline**: Immediate (5-30 minutes)
|
|
280
|
+
|
|
281
|
+
### Principles
|
|
282
|
+
1. **Speed over perfection** - Stop user impact NOW
|
|
283
|
+
2. **Temporary is OK** - Permanent fix comes later
|
|
284
|
+
3. **Minimize risk** - Don't make it worse
|
|
285
|
+
4. **Preserve evidence** - Keep logs/data for analysis
|
|
286
|
+
|
|
287
|
+
### Mitigation Strategies
|
|
288
|
+
|
|
289
|
+
#### Strategy 1: Rollback
|
|
290
|
+
```bash
|
|
291
|
+
# Rollback to previous version
|
|
292
|
+
git log --oneline -5
|
|
293
|
+
git revert [bad-commit-sha]
|
|
294
|
+
# or
|
|
295
|
+
kubectl rollout undo deployment/[name]
|
|
296
|
+
# or
|
|
297
|
+
docker service update --rollback [service]
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
#### Strategy 2: Disable Feature
|
|
301
|
+
```bash
|
|
302
|
+
# Feature flag (if available)
|
|
303
|
+
# Turn off problematic feature
|
|
304
|
+
curl -X POST [config-service]/flags/[feature] -d '{"enabled": false}'
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
#### Strategy 3: Scale Resources
|
|
308
|
+
```bash
|
|
309
|
+
# Horizontal scaling
|
|
310
|
+
kubectl scale deployment/[name] --replicas=10
|
|
311
|
+
|
|
312
|
+
# Vertical scaling (increase resources)
|
|
313
|
+
# Update deployment with more CPU/memory
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
#### Strategy 4: Traffic Routing
|
|
317
|
+
```bash
|
|
318
|
+
# Route traffic away from bad instance
|
|
319
|
+
# Update load balancer/ingress
|
|
320
|
+
kubectl patch ingress [name] -p '[patch-json]'
|
|
321
|
+
```
|
|
322
|
+
|
|
323
|
+
#### Strategy 5: Kill Switch
|
|
324
|
+
```bash
|
|
325
|
+
# Emergency circuit breaker
|
|
326
|
+
# Disable non-critical services to save core
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
#### Strategy 6: Database Mitigation
|
|
330
|
+
```sql
|
|
331
|
+
-- If database issue
|
|
332
|
+
-- Kill long-running queries
|
|
333
|
+
SELECT pg_terminate_backend([pid]);
|
|
334
|
+
|
|
335
|
+
-- Add missing index temporarily
|
|
336
|
+
CREATE INDEX CONCURRENTLY idx_temp ON [table]([column]);
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
### Mitigation Checklist
|
|
340
|
+
- [ ] Mitigation strategy selected
|
|
341
|
+
- [ ] Tested in staging (if time permits for P2-P4)
|
|
342
|
+
- [ ] Executed in production
|
|
343
|
+
- [ ] User impact reduced/eliminated
|
|
344
|
+
- [ ] System stable
|
|
345
|
+
- [ ] Monitoring confirms mitigation working
|
|
346
|
+
|
|
347
|
+
### Output Template
|
|
348
|
+
|
|
349
|
+
```markdown
|
|
350
|
+
## MITIGATION APPLIED
|
|
351
|
+
|
|
352
|
+
**Strategy**: [Rollback/Feature Flag/Scaling/etc]
|
|
353
|
+
**Applied At**: [timestamp]
|
|
354
|
+
**Applied By**: [name]
|
|
355
|
+
|
|
356
|
+
### Action Taken
|
|
357
|
+
```bash
|
|
358
|
+
[exact commands executed]
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
### Result
|
|
362
|
+
- User Impact: [before] → [after]
|
|
363
|
+
- Error Rate: [before] → [after]
|
|
364
|
+
- Service Status: [before] → [after]
|
|
365
|
+
|
|
366
|
+
### Monitoring
|
|
367
|
+
- [ ] Metrics returning to normal
|
|
368
|
+
- [ ] Error rate decreased
|
|
369
|
+
- [ ] Users can access service
|
|
370
|
+
- [ ] No new issues introduced
|
|
371
|
+
|
|
372
|
+
### Next Steps
|
|
373
|
+
- [ ] Monitor for 15-30 minutes
|
|
374
|
+
- [ ] Proceed to RESOLVE for permanent fix
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
### Gate
|
|
378
|
+
- [ ] User impact mitigated
|
|
379
|
+
- [ ] System stable
|
|
380
|
+
- [ ] Monitoring shows improvement
|
|
381
|
+
- [ ] No new critical issues
|
|
382
|
+
|
|
383
|
+
---
|
|
384
|
+
|
|
385
|
+
## Phase 4: RESOLVE
|
|
386
|
+
|
|
387
|
+
**Goal**: Implement permanent fix
|
|
388
|
+
|
|
389
|
+
**Timeline**: Hours to days (depending on severity)
|
|
390
|
+
|
|
391
|
+
### Actions
|
|
392
|
+
|
|
393
|
+
1. **Read Architecture Documentation**:
|
|
394
|
+
- [ ] Re-read architecture doc for affected stack
|
|
395
|
+
- [ ] Understand proper fix location and approach
|
|
396
|
+
- [ ] Follow architecture patterns
|
|
397
|
+
|
|
398
|
+
2. **Design Permanent Fix**:
|
|
399
|
+
```markdown
|
|
400
|
+
### Fix Design
|
|
401
|
+
**Based on**: [architecture doc]
|
|
402
|
+
**Affected Layers**: [from architecture doc]
|
|
403
|
+
|
|
404
|
+
**Root Cause**: [from investigation]
|
|
405
|
+
|
|
406
|
+
**Permanent Solution**:
|
|
407
|
+
- [Change 1 - following architecture patterns]
|
|
408
|
+
- [Change 2 - following architecture patterns]
|
|
409
|
+
|
|
410
|
+
**Why This Prevents Recurrence**:
|
|
411
|
+
[Explanation]
|
|
412
|
+
```
|
|
413
|
+
|
|
414
|
+
3. **Implement Fix**:
|
|
415
|
+
- [ ] Create branch: `incident/[issue-id]-[description]`
|
|
416
|
+
- [ ] Implement following architecture doc patterns
|
|
417
|
+
- [ ] Add monitoring/alerting to prevent recurrence
|
|
418
|
+
- [ ] Add tests to catch this scenario
|
|
419
|
+
|
|
420
|
+
4. **Test Thoroughly**:
|
|
421
|
+
```bash
|
|
422
|
+
# Use test commands from architecture doc
|
|
423
|
+
flutter test # Flutter
|
|
424
|
+
go test ./... # Go
|
|
425
|
+
bun test # React/Remix
|
|
426
|
+
php artisan test # Laravel
|
|
427
|
+
|
|
428
|
+
# Integration tests
|
|
429
|
+
# Load tests (if performance issue)
|
|
430
|
+
# Security tests (if security issue)
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
5. **Gradual Rollout** (for P1/P2 incidents):
|
|
434
|
+
- [ ] Deploy to staging
|
|
435
|
+
- [ ] Canary deployment (5% traffic)
|
|
436
|
+
- [ ] Monitor for issues
|
|
437
|
+
- [ ] Gradual increase (25%, 50%, 100%)
|
|
438
|
+
|
|
439
|
+
### Resolution Checklist
|
|
440
|
+
- [ ] Fix follows architecture doc
|
|
441
|
+
- [ ] Root cause addressed
|
|
442
|
+
- [ ] Tests added to prevent regression
|
|
443
|
+
- [ ] Monitoring/alerting enhanced
|
|
444
|
+
- [ ] Code reviewed
|
|
445
|
+
- [ ] Documentation updated
|
|
446
|
+
|
|
447
|
+
### Deployment
|
|
448
|
+
|
|
449
|
+
```bash
|
|
450
|
+
# Commit with incident reference
|
|
451
|
+
git add .
|
|
452
|
+
git commit -m "fix: [description] (incident #[id])
|
|
453
|
+
|
|
454
|
+
Root cause: [explanation]
|
|
455
|
+
Resolution: [what was changed]
|
|
456
|
+
|
|
457
|
+
Prevents recurrence by: [explanation]
|
|
458
|
+
|
|
459
|
+
Fixes #[incident-id]"
|
|
460
|
+
|
|
461
|
+
# Create PR
|
|
462
|
+
gh pr create --title "fix: [description] (incident #[id])" --body "$(cat <<'EOF'
|
|
463
|
+
## Summary
|
|
464
|
+
Resolves incident #[id]
|
|
465
|
+
|
|
466
|
+
## Root Cause
|
|
467
|
+
[Detailed explanation]
|
|
468
|
+
|
|
469
|
+
## Permanent Fix
|
|
470
|
+
[What was changed, following architecture patterns]
|
|
471
|
+
|
|
472
|
+
## Prevention
|
|
473
|
+
- [ ] Monitoring added
|
|
474
|
+
- [ ] Tests added
|
|
475
|
+
- [ ] Documentation updated
|
|
476
|
+
|
|
477
|
+
## Testing
|
|
478
|
+
- [ ] Unit tests pass
|
|
479
|
+
- [ ] Integration tests pass
|
|
480
|
+
- [ ] Staging deployment successful
|
|
481
|
+
- [ ] Canary deployment successful
|
|
482
|
+
|
|
483
|
+
## Rollback Plan
|
|
484
|
+
`git revert [commit-sha]`
|
|
485
|
+
EOF
|
|
486
|
+
)"
|
|
487
|
+
```
|
|
488
|
+
|
|
489
|
+
### Gate
|
|
490
|
+
- [ ] Permanent fix implemented
|
|
491
|
+
- [ ] Following architecture patterns
|
|
492
|
+
- [ ] All tests passing
|
|
493
|
+
- [ ] Deployed successfully
|
|
494
|
+
- [ ] Monitoring confirms fix
|
|
495
|
+
|
|
496
|
+
---
|
|
497
|
+
|
|
498
|
+
## Phase 5: POSTMORTEM
|
|
499
|
+
|
|
500
|
+
**Goal**: Learn and prevent future incidents
|
|
501
|
+
|
|
502
|
+
**Timeline**: Within 3-5 days after resolution
|
|
503
|
+
|
|
504
|
+
### Actions
|
|
505
|
+
|
|
506
|
+
1. **Schedule Postmortem Meeting**:
|
|
507
|
+
- [ ] Blameless environment
|
|
508
|
+
- [ ] All involved parties
|
|
509
|
+
- [ ] Within 1 week of incident
|
|
510
|
+
|
|
511
|
+
2. **Write Postmortem Document**:
|
|
512
|
+
|
|
513
|
+
### Postmortem Template
|
|
514
|
+
|
|
515
|
+
```markdown
|
|
516
|
+
# Incident Postmortem: [Title]
|
|
517
|
+
|
|
518
|
+
**Date**: [YYYY-MM-DD]
|
|
519
|
+
**Severity**: P[1-4]
|
|
520
|
+
**Duration**: [HH:MM]
|
|
521
|
+
**Impact**: [X users / Y transactions / Z revenue]
|
|
522
|
+
**Authors**: [names]
|
|
523
|
+
|
|
524
|
+
## Summary
|
|
525
|
+
[2-3 sentence summary of what happened]
|
|
526
|
+
|
|
527
|
+
## Timeline (All times in [timezone])
|
|
528
|
+
|
|
529
|
+
| Time | Event |
|
|
530
|
+
|------|-------|
|
|
531
|
+
| HH:MM | Normal operation |
|
|
532
|
+
| HH:MM | [Trigger event] |
|
|
533
|
+
| HH:MM | First error detected |
|
|
534
|
+
| HH:MM | Incident declared |
|
|
535
|
+
| HH:MM | Triage complete |
|
|
536
|
+
| HH:MM | Root cause identified |
|
|
537
|
+
| HH:MM | Mitigation applied |
|
|
538
|
+
| HH:MM | Service restored |
|
|
539
|
+
| HH:MM | Permanent fix deployed |
|
|
540
|
+
| HH:MM | Incident closed |
|
|
541
|
+
|
|
542
|
+
## Root Cause
|
|
543
|
+
[Detailed explanation of what went wrong and why]
|
|
544
|
+
|
|
545
|
+
### Contributing Factors
|
|
546
|
+
1. [Factor 1]
|
|
547
|
+
2. [Factor 2]
|
|
548
|
+
3. [Factor 3]
|
|
549
|
+
|
|
550
|
+
## Impact
|
|
551
|
+
|
|
552
|
+
### Users
|
|
553
|
+
- [X] users unable to access [feature]
|
|
554
|
+
- [Y] failed transactions
|
|
555
|
+
- [Z] support tickets
|
|
556
|
+
|
|
557
|
+
### Business
|
|
558
|
+
- [Estimated revenue impact]
|
|
559
|
+
- [Reputation impact]
|
|
560
|
+
- [SLA breach: Y/N]
|
|
561
|
+
|
|
562
|
+
### Engineering
|
|
563
|
+
- [On-call hours]
|
|
564
|
+
- [Development time]
|
|
565
|
+
|
|
566
|
+
## Resolution
|
|
567
|
+
|
|
568
|
+
### Immediate Mitigation
|
|
569
|
+
[What we did to stop the bleeding]
|
|
570
|
+
|
|
571
|
+
### Permanent Fix
|
|
572
|
+
[What we changed to prevent recurrence]
|
|
573
|
+
|
|
574
|
+
## What Went Well
|
|
575
|
+
1. [Positive aspect 1]
|
|
576
|
+
2. [Positive aspect 2]
|
|
577
|
+
3. [Positive aspect 3]
|
|
578
|
+
|
|
579
|
+
## What Went Wrong
|
|
580
|
+
1. [Problem 1]
|
|
581
|
+
2. [Problem 2]
|
|
582
|
+
3. [Problem 3]
|
|
583
|
+
|
|
584
|
+
## Action Items
|
|
585
|
+
|
|
586
|
+
| Action | Owner | Priority | Due Date | Status |
|
|
587
|
+
|--------|-------|----------|----------|--------|
|
|
588
|
+
| [Action 1] | [Name] | P1 | [Date] | [Open/In Progress/Done] |
|
|
589
|
+
| [Action 2] | [Name] | P2 | [Date] | [Open/In Progress/Done] |
|
|
590
|
+
| [Action 3] | [Name] | P3 | [Date] | [Open/In Progress/Done] |
|
|
591
|
+
|
|
592
|
+
### Preventive Measures
|
|
593
|
+
- [ ] Add monitoring for [metric]
|
|
594
|
+
- [ ] Add alerting for [condition]
|
|
595
|
+
- [ ] Improve documentation for [area]
|
|
596
|
+
- [ ] Add tests for [scenario]
|
|
597
|
+
- [ ] Update runbook for [process]
|
|
598
|
+
- [ ] Architecture review for [component]
|
|
599
|
+
|
|
600
|
+
### Process Improvements
|
|
601
|
+
- [ ] Update incident response playbook
|
|
602
|
+
- [ ] Improve deployment process
|
|
603
|
+
- [ ] Enhance testing strategy
|
|
604
|
+
- [ ] Update on-call rotation
|
|
605
|
+
|
|
606
|
+
## Supporting Information
|
|
607
|
+
|
|
608
|
+
### Logs
|
|
609
|
+
```
|
|
610
|
+
[Relevant log excerpts]
|
|
611
|
+
```
|
|
612
|
+
|
|
613
|
+
### Metrics
|
|
614
|
+
[Screenshots/graphs of relevant metrics]
|
|
615
|
+
|
|
616
|
+
### Related Incidents
|
|
617
|
+
- [Previous similar incident #123]
|
|
618
|
+
- [Related issue #456]
|
|
619
|
+
|
|
620
|
+
## Lessons Learned
|
|
621
|
+
1. [Key learning 1]
|
|
622
|
+
2. [Key learning 2]
|
|
623
|
+
3. [Key learning 3]
|
|
624
|
+
```
|
|
625
|
+
|
|
626
|
+
### Postmortem Checklist
|
|
627
|
+
- [ ] Postmortem doc written
|
|
628
|
+
- [ ] Meeting held (blameless)
|
|
629
|
+
- [ ] Action items assigned with owners
|
|
630
|
+
- [ ] Action items tracked
|
|
631
|
+
- [ ] Knowledge base updated
|
|
632
|
+
- [ ] Runbook updated (if applicable)
|
|
633
|
+
- [ ] Architecture doc updated (if applicable)
|
|
634
|
+
|
|
635
|
+
### Gate
|
|
636
|
+
- [ ] Postmortem complete
|
|
637
|
+
- [ ] Action items tracked
|
|
638
|
+
- [ ] Team learned lessons
|
|
639
|
+
- [ ] Prevention measures in place
|
|
640
|
+
|
|
641
|
+
---
|
|
642
|
+
|
|
643
|
+
## Communication Templates
|
|
644
|
+
|
|
645
|
+
### Initial Incident Notification
|
|
646
|
+
|
|
647
|
+
```
|
|
648
|
+
🚨 INCIDENT ALERT - P[X]
|
|
649
|
+
|
|
650
|
+
Status: INVESTIGATING
|
|
651
|
+
Affected: [service/feature]
|
|
652
|
+
Impact: [brief description]
|
|
653
|
+
Started: [timestamp]
|
|
654
|
+
|
|
655
|
+
We are investigating and will provide updates every [15/30] minutes.
|
|
656
|
+
|
|
657
|
+
Incident Channel: [link]
|
|
658
|
+
Status Page: [link]
|
|
659
|
+
|
|
660
|
+
- Incident Commander: [name]
|
|
661
|
+
```
|
|
662
|
+
|
|
663
|
+
### Status Update
|
|
664
|
+
|
|
665
|
+
```
|
|
666
|
+
📊 INCIDENT UPDATE - P[X]
|
|
667
|
+
|
|
668
|
+
Status: [INVESTIGATING/MITIGATING/RESOLVED]
|
|
669
|
+
Time: [timestamp]
|
|
670
|
+
|
|
671
|
+
Progress:
|
|
672
|
+
- [Update 1]
|
|
673
|
+
- [Update 2]
|
|
674
|
+
|
|
675
|
+
Current Impact: [description]
|
|
676
|
+
|
|
677
|
+
Next update in [15/30] minutes or when status changes.
|
|
678
|
+
|
|
679
|
+
- Incident Commander: [name]
|
|
680
|
+
```
|
|
681
|
+
|
|
682
|
+
### Resolution Notification
|
|
683
|
+
|
|
684
|
+
```
|
|
685
|
+
✅ INCIDENT RESOLVED - P[X]
|
|
686
|
+
|
|
687
|
+
Duration: [HH:MM]
|
|
688
|
+
Resolved: [timestamp]
|
|
689
|
+
|
|
690
|
+
Summary:
|
|
691
|
+
- Issue: [brief description]
|
|
692
|
+
- Cause: [root cause]
|
|
693
|
+
- Fix: [what was done]
|
|
694
|
+
|
|
695
|
+
Impact:
|
|
696
|
+
- [X] users affected
|
|
697
|
+
- [Y] duration
|
|
698
|
+
|
|
699
|
+
All services are now operating normally.
|
|
700
|
+
|
|
701
|
+
Postmortem will be shared within [3-5] days.
|
|
702
|
+
|
|
703
|
+
Thank you for your patience.
|
|
704
|
+
|
|
705
|
+
- Incident Commander: [name]
|
|
706
|
+
```
|
|
707
|
+
|
|
708
|
+
---
|
|
709
|
+
|
|
710
|
+
## Quick Reference
|
|
711
|
+
|
|
712
|
+
### Architecture Docs
|
|
713
|
+
| Stack | Doc |
|
|
714
|
+
|-------|-----|
|
|
715
|
+
| All | `clean-architecture.md` |
|
|
716
|
+
| Flutter | `flutter-mobile.md` |
|
|
717
|
+
| React | `react-frontend.md` |
|
|
718
|
+
| Go | `go-backend.md` |
|
|
719
|
+
| Laravel | `laravel-backend.md` |
|
|
720
|
+
| Remix | `remix-fullstack.md` |
|
|
721
|
+
| Monorepo | `monorepo.md` |
|
|
722
|
+
|
|
723
|
+
### Incident Response Checklist
|
|
724
|
+
|
|
725
|
+
#### First 5 Minutes
|
|
726
|
+
- [ ] Assign severity (P1-P4)
|
|
727
|
+
- [ ] Create incident channel
|
|
728
|
+
- [ ] Notify stakeholders
|
|
729
|
+
- [ ] Assign incident commander
|
|
730
|
+
- [ ] Begin triage
|
|
731
|
+
|
|
732
|
+
#### First 15 Minutes
|
|
733
|
+
- [ ] Read architecture doc
|
|
734
|
+
- [ ] Gather logs/metrics
|
|
735
|
+
- [ ] Identify affected components
|
|
736
|
+
- [ ] Establish timeline
|
|
737
|
+
- [ ] Update stakeholders
|
|
738
|
+
|
|
739
|
+
#### First 30 Minutes
|
|
740
|
+
- [ ] Root cause hypothesis
|
|
741
|
+
- [ ] Mitigation plan ready
|
|
742
|
+
- [ ] Execute mitigation
|
|
743
|
+
- [ ] Monitor impact
|
|
744
|
+
|
|
745
|
+
#### First Hour
|
|
746
|
+
- [ ] Service restored (mitigated)
|
|
747
|
+
- [ ] Monitoring stable
|
|
748
|
+
- [ ] Plan permanent fix
|
|
749
|
+
|
|
750
|
+
#### First Day
|
|
751
|
+
- [ ] Permanent fix designed
|
|
752
|
+
- [ ] Following architecture patterns
|
|
753
|
+
- [ ] Tests added
|
|
754
|
+
- [ ] Deployed with monitoring
|
|
755
|
+
|
|
756
|
+
#### First Week
|
|
757
|
+
- [ ] Postmortem scheduled
|
|
758
|
+
- [ ] Postmortem written
|
|
759
|
+
- [ ] Action items assigned
|
|
760
|
+
- [ ] Prevention measures planned
|
|
761
|
+
|
|
762
|
+
### Severity Response SLA
|
|
763
|
+
|
|
764
|
+
| Severity | Detection | Triage | Mitigation | Resolution |
|
|
765
|
+
|----------|-----------|--------|------------|------------|
|
|
766
|
+
| 🔴 P1 | < 5 min | < 15 min | < 30 min | < 4 hours |
|
|
767
|
+
| 🟠 P2 | < 15 min | < 30 min | < 1 hour | < 24 hours |
|
|
768
|
+
| 🟡 P3 | < 1 hour | < 2 hours | < 4 hours | < 1 week |
|
|
769
|
+
| 🟢 P4 | < 4 hours | < 1 day | Best effort | < 2 weeks |
|
|
770
|
+
|
|
771
|
+
### Escalation Path
|
|
772
|
+
|
|
773
|
+
```
|
|
774
|
+
P1: Immediate
|
|
775
|
+
├── Incident Commander
|
|
776
|
+
├── Engineering Manager
|
|
777
|
+
├── CTO
|
|
778
|
+
└── CEO (if customer-facing)
|
|
779
|
+
|
|
780
|
+
P2: < 1 hour
|
|
781
|
+
├── Incident Commander
|
|
782
|
+
├── Engineering Manager
|
|
783
|
+
└── CTO (if not resolved in 2 hours)
|
|
784
|
+
|
|
785
|
+
P3: < 4 hours
|
|
786
|
+
├── Incident Commander
|
|
787
|
+
└── Engineering Manager
|
|
788
|
+
|
|
789
|
+
P4: Normal business hours
|
|
790
|
+
└── Incident Commander
|
|
791
|
+
```
|
|
792
|
+
|
|
793
|
+
### Key Metrics to Monitor
|
|
794
|
+
|
|
795
|
+
#### Application Metrics
|
|
796
|
+
- [ ] Request rate
|
|
797
|
+
- [ ] Response time (p50, p95, p99)
|
|
798
|
+
- [ ] Error rate
|
|
799
|
+
- [ ] Active users
|
|
800
|
+
- [ ] Transaction volume
|
|
801
|
+
|
|
802
|
+
#### Infrastructure Metrics
|
|
803
|
+
- [ ] CPU usage
|
|
804
|
+
- [ ] Memory usage
|
|
805
|
+
- [ ] Disk usage
|
|
806
|
+
- [ ] Network I/O
|
|
807
|
+
- [ ] Database connections
|
|
808
|
+
|
|
809
|
+
#### Business Metrics
|
|
810
|
+
- [ ] Conversion rate
|
|
811
|
+
- [ ] Revenue
|
|
812
|
+
- [ ] User satisfaction
|
|
813
|
+
- [ ] SLA compliance
|
|
814
|
+
|
|
815
|
+
### Common Incident Types
|
|
816
|
+
|
|
817
|
+
| Type | Common Causes | Typical Mitigation |
|
|
818
|
+
|------|---------------|-------------------|
|
|
819
|
+
| **Deployment** | Bad code, config error | Rollback |
|
|
820
|
+
| **Infrastructure** | Resource exhaustion | Scale up/out |
|
|
821
|
+
| **Database** | Slow queries, locks | Kill queries, add indexes |
|
|
822
|
+
| **External Dependency** | API down, timeout | Circuit breaker, fallback |
|
|
823
|
+
| **Security** | Attack, vulnerability | Block traffic, patch |
|
|
824
|
+
| **Data** | Corruption, loss | Restore from backup |
|
|
825
|
+
|
|
826
|
+
### War Room Best Practices
|
|
827
|
+
|
|
828
|
+
1. **Clear Roles**:
|
|
829
|
+
- Incident Commander (makes decisions)
|
|
830
|
+
- Investigators (find root cause)
|
|
831
|
+
- Communicators (update stakeholders)
|
|
832
|
+
- Scribes (document timeline)
|
|
833
|
+
|
|
834
|
+
2. **Communication**:
|
|
835
|
+
- Use dedicated channel
|
|
836
|
+
- Update every 15-30 minutes
|
|
837
|
+
- Be specific, not vague
|
|
838
|
+
- No blame, focus on fix
|
|
839
|
+
|
|
840
|
+
3. **Decision Making**:
|
|
841
|
+
- Incident Commander has final say
|
|
842
|
+
- Bias toward action
|
|
843
|
+
- Don't wait for perfect information
|
|
844
|
+
- Mitigate first, understand later
|
|
845
|
+
|
|
846
|
+
4. **Documentation**:
|
|
847
|
+
- Log every action with timestamp
|
|
848
|
+
- Save logs/metrics
|
|
849
|
+
- Screenshot dashboards
|
|
850
|
+
- Record decisions and rationale
|
|
851
|
+
|
|
852
|
+
---
|
|
853
|
+
|
|
854
|
+
## Runbook Examples
|
|
855
|
+
|
|
856
|
+
### Service Down
|
|
857
|
+
|
|
858
|
+
```bash
|
|
859
|
+
# 1. Check if service is running
|
|
860
|
+
systemctl status [service]
|
|
861
|
+
# or
|
|
862
|
+
kubectl get pods -n [namespace]
|
|
863
|
+
|
|
864
|
+
# 2. Check logs
|
|
865
|
+
journalctl -u [service] --since "10 minutes ago"
|
|
866
|
+
# or
|
|
867
|
+
kubectl logs -n [namespace] [pod] --tail=100
|
|
868
|
+
|
|
869
|
+
# 3. Restart service
|
|
870
|
+
systemctl restart [service]
|
|
871
|
+
# or
|
|
872
|
+
kubectl rollout restart deployment/[name]
|
|
873
|
+
|
|
874
|
+
# 4. Verify
|
|
875
|
+
curl https://[service]/health
|
|
876
|
+
```
|
|
877
|
+
|
|
878
|
+
### High CPU Usage
|
|
879
|
+
|
|
880
|
+
```bash
|
|
881
|
+
# 1. Identify process
|
|
882
|
+
top -o %CPU
|
|
883
|
+
# or
|
|
884
|
+
ps aux --sort=-%cpu | head
|
|
885
|
+
|
|
886
|
+
# 2. Check if legitimate
|
|
887
|
+
# (deployment, batch job, etc.)
|
|
888
|
+
|
|
889
|
+
# 3. If abnormal, scale up
|
|
890
|
+
kubectl scale deployment/[name] --replicas=[N+2]
|
|
891
|
+
|
|
892
|
+
# 4. Investigate root cause
|
|
893
|
+
# (profiling, logs, etc.)
|
|
894
|
+
```
|
|
895
|
+
|
|
896
|
+
### Database Slow
|
|
897
|
+
|
|
898
|
+
```sql
|
|
899
|
+
-- 1. Check active queries
|
|
900
|
+
SELECT pid, now() - query_start as duration, query
|
|
901
|
+
FROM pg_stat_activity
|
|
902
|
+
WHERE state = 'active'
|
|
903
|
+
ORDER BY duration DESC;
|
|
904
|
+
|
|
905
|
+
-- 2. Kill long-running queries
|
|
906
|
+
SELECT pg_terminate_backend([pid]);
|
|
907
|
+
|
|
908
|
+
-- 3. Check for locks
|
|
909
|
+
SELECT * FROM pg_locks WHERE NOT granted;
|
|
910
|
+
|
|
911
|
+
-- 4. Add missing indexes (if needed)
|
|
912
|
+
CREATE INDEX CONCURRENTLY idx_name ON table(column);
|
|
913
|
+
```
|
|
914
|
+
|
|
915
|
+
---
|
|
916
|
+
|
|
917
|
+
## Emergency Contacts
|
|
918
|
+
|
|
919
|
+
**Add your team's contacts:**
|
|
920
|
+
|
|
921
|
+
```markdown
|
|
922
|
+
### On-Call Rotation
|
|
923
|
+
- Primary: [name] - [contact]
|
|
924
|
+
- Secondary: [name] - [contact]
|
|
925
|
+
|
|
926
|
+
### Escalation
|
|
927
|
+
- Engineering Manager: [name] - [contact]
|
|
928
|
+
- CTO: [name] - [contact]
|
|
929
|
+
- CEO: [name] - [contact]
|
|
930
|
+
|
|
931
|
+
### External
|
|
932
|
+
- Cloud Provider Support: [contact]
|
|
933
|
+
- Database Support: [contact]
|
|
934
|
+
- Security Team: [contact]
|
|
935
|
+
```
|
|
936
|
+
|
|
937
|
+
---
|
|
938
|
+
|
|
939
|
+
## Additional Resources
|
|
940
|
+
|
|
941
|
+
- [Incident Response Plan]
|
|
942
|
+
- [Runbook Collection]
|
|
943
|
+
- [Monitoring Dashboard]
|
|
944
|
+
- [Status Page]
|
|
945
|
+
- [Postmortem Archive]
|
|
946
|
+
- [Architecture Documentation]
|