moicle 1.1.1 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,946 @@
1
+ ---
2
+ name: incident-response
3
+ description: Incident response workflow for handling production issues. Use when there's an incident, outage, production down, or when user says "incident", "outage", "production down", "service down", "emergency".
4
+ ---
5
+
6
+ # Incident Response Workflow
7
+
8
+ Structured workflow for handling production incidents with clear phases from triage to postmortem.
9
+
10
+ ## IMPORTANT: Read Architecture First
11
+
12
+ **Before responding to incidents, you MUST read the appropriate architecture reference:**
13
+
14
+ ### Global Architecture Files
15
+ ```
16
+ ~/.claude/architecture/
17
+ ├── clean-architecture.md # Core principles for all projects
18
+ ├── flutter-mobile.md # Flutter + Riverpod
19
+ ├── react-frontend.md # React + Vite + TypeScript
20
+ ├── go-backend.md # Go + Gin
21
+ ├── laravel-backend.md # Laravel + PHP
22
+ ├── remix-fullstack.md # Remix fullstack
23
+ └── monorepo.md # Monorepo structure
24
+ ```
25
+
26
+ ### Project-specific (if exists)
27
+ ```
28
+ .claude/architecture/ # Project overrides
29
+ ```
30
+
31
+ **Understand the system architecture to diagnose and fix issues effectively.**
32
+
33
+ ## Recommended Agents
34
+
35
+ | Phase | Agent | Purpose |
36
+ |-------|-------|---------|
37
+ | TRIAGE | `@devops` | Infrastructure & monitoring |
38
+ | INVESTIGATE | `@react-frontend-dev`, `@go-backend-dev`, `@laravel-backend-dev`, `@flutter-mobile-dev`, `@remix-fullstack-dev` | Stack-specific debugging |
39
+ | INVESTIGATE | `@security-audit` | Security incidents |
40
+ | INVESTIGATE | `@db-designer` | Database issues |
41
+ | MITIGATE | `@devops` | Quick rollback/scaling |
42
+ | RESOLVE | Stack-specific dev agents | Permanent fix |
43
+ | POSTMORTEM | `@docs-writer` | Document learnings |
44
+
45
+ ## Workflow Overview
46
+
47
+ ```
48
+ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
49
+ │1. TRIAGE │──▶│2. INVESTI│──▶│3. MITIGATE──▶│4. RESOLVE│──▶│5. POST- │
50
+ │ │ │ GATE │ │ │ │ │ │ MORTEM │
51
+ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
52
+ │ │ │ │
53
+ │ │ │ └──── Monitor ────┐
54
+ │ │ └──── If fails ─────────┐ │
55
+ │ └──── Escalate if needed │ │
56
+ └──── Notify stakeholders ────────────────────────────┴────────┘
57
+ ```
58
+
59
+ ---
60
+
61
+ ## Severity Levels
62
+
63
+ | Level | Description | Response Time | Examples |
64
+ |-------|-------------|---------------|----------|
65
+ | 🔴 **P1** | Critical - Complete outage | Immediate (< 15 min) | Production down, data loss, security breach |
66
+ | 🟠 **P2** | High - Major degradation | < 1 hour | Core features broken, 50%+ users affected |
67
+ | 🟡 **P3** | Medium - Partial degradation | < 4 hours | Non-core features broken, < 50% users affected |
68
+ | 🟢 **P4** | Low - Minor issue | < 24 hours | UI bugs, minor performance issues |
69
+
70
+ ---
71
+
72
+ ## Phase 1: TRIAGE
73
+
74
+ **Goal**: Assess severity, scope, and impact immediately
75
+
76
+ **Timeline**: 5-15 minutes
77
+
78
+ ### Actions
79
+
80
+ 1. **Initial Assessment**:
81
+ - [ ] Incident reported (alert/user/monitoring)
82
+ - [ ] Timestamp recorded
83
+ - [ ] Initial severity assigned (P1-P4)
84
+
85
+ 2. **Gather Critical Information**:
86
+ ```markdown
87
+ - What is down/broken?
88
+ - When did it start?
89
+ - How many users affected?
90
+ - Error messages/logs?
91
+ - Recent deployments/changes?
92
+ ```
93
+
94
+ 3. **Identify System & Architecture**:
95
+ - [ ] Identify affected stack (Flutter/React/Go/Laravel/Remix)
96
+ - [ ] Identify affected service/component
97
+ - [ ] Read architecture doc for affected area
98
+
99
+ 4. **Establish Incident Response**:
100
+ - [ ] Create incident channel (Slack/Teams)
101
+ - [ ] Assign incident commander
102
+ - [ ] Notify stakeholders (use template below)
103
+
104
+ 5. **Quick Health Checks**:
105
+ ```bash
106
+ # Check service status
107
+ curl -I https://[service-url]/health
108
+
109
+ # Check logs (last 100 lines)
110
+ kubectl logs -n [namespace] [pod] --tail=100
111
+ # or
112
+ docker logs [container] --tail=100
113
+
114
+ # Check metrics
115
+ # (CPU, memory, disk, network)
116
+ ```
117
+
118
+ ### Output Template
119
+
120
+ ```markdown
121
+ ## INCIDENT REPORT
122
+
123
+ **Severity**: P[1-4] - [CRITICAL/HIGH/MEDIUM/LOW]
124
+ **Status**: INVESTIGATING
125
+ **Started**: [timestamp]
126
+ **Affected**: [service/feature]
127
+ **User Impact**: [number/percentage] users
128
+ **Stack**: [Flutter/React/Go/Laravel/Remix]
129
+ **Architecture Doc**: [path]
130
+
131
+ ### Summary
132
+ [One-line description of what's broken]
133
+
134
+ ### Impact
135
+ - [ ] Production down
136
+ - [ ] Feature degraded
137
+ - [ ] Data integrity issues
138
+ - [ ] Security concerns
139
+
140
+ ### Recent Changes
141
+ - [Last deployment/change]
142
+ - [Time of change]
143
+
144
+ ### Initial Observations
145
+ - [Error messages]
146
+ - [Metrics anomalies]
147
+ - [User reports]
148
+
149
+ ### Response Team
150
+ - Incident Commander: [name]
151
+ - Engineers: [names]
152
+ - On-call: [name]
153
+
154
+ ### Communication
155
+ - Incident Channel: [link]
156
+ - Status Page Updated: [Y/N]
157
+ ```
158
+
159
+ ### Gate
160
+ - [ ] Severity assessed
161
+ - [ ] Stakeholders notified
162
+ - [ ] Architecture doc identified
163
+ - [ ] Response team assembled
164
+ - [ ] Communication channel established
165
+
166
+ ---
167
+
168
+ ## Phase 2: INVESTIGATE
169
+
170
+ **Goal**: Find the root cause
171
+
172
+ **Timeline**: 15-60 minutes (varies by severity)
173
+
174
+ ### Actions
175
+
176
+ 1. **Read Architecture Documentation**:
177
+ - [ ] Read architecture doc for affected stack
178
+ - [ ] Understand system components and dependencies
179
+ - [ ] Identify critical paths
180
+
181
+ 2. **Collect Evidence**:
182
+ ```bash
183
+ # Application logs
184
+ grep -i error /var/log/[app].log | tail -50
185
+
186
+ # System logs
187
+ journalctl -u [service] --since "30 minutes ago"
188
+
189
+ # Database logs
190
+ # (stack-specific commands from architecture doc)
191
+
192
+ # Network/traffic
193
+ netstat -an | grep [port]
194
+ ```
195
+
196
+ 3. **Check Monitoring Dashboards**:
197
+ - [ ] CPU/Memory usage
198
+ - [ ] Request rate/latency
199
+ - [ ] Error rate
200
+ - [ ] Database connections/queries
201
+ - [ ] External service dependencies
202
+
203
+ 4. **Timeline Analysis**:
204
+ ```markdown
205
+ - [HH:MM] - Normal operation
206
+ - [HH:MM] - [Event: deployment/config change/traffic spike]
207
+ - [HH:MM] - First error logged
208
+ - [HH:MM] - User reports started
209
+ - [HH:MM] - Full outage
210
+ ```
211
+
212
+ 5. **Root Cause Analysis (5 Whys)**:
213
+ ```
214
+ Problem: [What is failing?]
215
+
216
+ Why? → [First level cause]
217
+ Why? → [Second level cause]
218
+ Why? → [Third level cause]
219
+ Why? → [Fourth level cause]
220
+ Why? → [ROOT CAUSE]
221
+ ```
222
+
223
+ 6. **Check Dependencies**:
224
+ - [ ] External APIs responding?
225
+ - [ ] Database accessible?
226
+ - [ ] Cache service healthy?
227
+ - [ ] Message queue working?
228
+ - [ ] CDN functioning?
229
+
230
+ ### Investigation Output
231
+
232
+ ```markdown
233
+ ## ROOT CAUSE ANALYSIS
234
+
235
+ **Architecture Reference**: [path to doc]
236
+ **Affected Component**: [service/layer from architecture]
237
+ **Root Cause**: [detailed description]
238
+
239
+ ### Timeline
240
+ - [HH:MM] - Event 1
241
+ - [HH:MM] - Event 2
242
+ - [HH:MM] - Failure point
243
+
244
+ ### Evidence
245
+ **Logs**:
246
+ ```
247
+ [relevant log entries]
248
+ ```
249
+
250
+ **Metrics**:
251
+ - CPU: [before/after]
252
+ - Memory: [before/after]
253
+ - Error rate: [before/after]
254
+
255
+ **Related Changes**:
256
+ - Commit: [sha] - [description]
257
+ - Deploy: [time] - [version]
258
+ - Config: [change description]
259
+
260
+ ### Hypothesis
261
+ [What we think caused it]
262
+
263
+ ### Verification
264
+ [How to verify the hypothesis]
265
+ ```
266
+
267
+ ### Gate
268
+ - [ ] Root cause identified
269
+ - [ ] Evidence collected
270
+ - [ ] Timeline established
271
+ - [ ] Dependencies checked
272
+
273
+ ---
274
+
275
+ ## Phase 3: MITIGATE
276
+
277
+ **Goal**: Stop the bleeding with temporary fix
278
+
279
+ **Timeline**: Immediate (5-30 minutes)
280
+
281
+ ### Principles
282
+ 1. **Speed over perfection** - Stop user impact NOW
283
+ 2. **Temporary is OK** - Permanent fix comes later
284
+ 3. **Minimize risk** - Don't make it worse
285
+ 4. **Preserve evidence** - Keep logs/data for analysis
286
+
287
+ ### Mitigation Strategies
288
+
289
+ #### Strategy 1: Rollback
290
+ ```bash
291
+ # Rollback to previous version
292
+ git log --oneline -5
293
+ git revert [bad-commit-sha]
294
+ # or
295
+ kubectl rollout undo deployment/[name]
296
+ # or
297
+ docker service update --rollback [service]
298
+ ```
299
+
300
+ #### Strategy 2: Disable Feature
301
+ ```bash
302
+ # Feature flag (if available)
303
+ # Turn off problematic feature
304
+ curl -X POST [config-service]/flags/[feature] -d '{"enabled": false}'
305
+ ```
306
+
307
+ #### Strategy 3: Scale Resources
308
+ ```bash
309
+ # Horizontal scaling
310
+ kubectl scale deployment/[name] --replicas=10
311
+
312
+ # Vertical scaling (increase resources)
313
+ # Update deployment with more CPU/memory
314
+ ```
315
+
316
+ #### Strategy 4: Traffic Routing
317
+ ```bash
318
+ # Route traffic away from bad instance
319
+ # Update load balancer/ingress
320
+ kubectl patch ingress [name] -p '[patch-json]'
321
+ ```
322
+
323
+ #### Strategy 5: Kill Switch
324
+ ```bash
325
+ # Emergency circuit breaker
326
+ # Disable non-critical services to save core
327
+ ```
328
+
329
+ #### Strategy 6: Database Mitigation
330
+ ```sql
331
+ -- If database issue
332
+ -- Kill long-running queries
333
+ SELECT pg_terminate_backend([pid]);
334
+
335
+ -- Add missing index temporarily
336
+ CREATE INDEX CONCURRENTLY idx_temp ON [table]([column]);
337
+ ```
338
+
339
+ ### Mitigation Checklist
340
+ - [ ] Mitigation strategy selected
341
+ - [ ] Tested in staging (if time permits for P2-P4)
342
+ - [ ] Executed in production
343
+ - [ ] User impact reduced/eliminated
344
+ - [ ] System stable
345
+ - [ ] Monitoring confirms mitigation working
346
+
347
+ ### Output Template
348
+
349
+ ```markdown
350
+ ## MITIGATION APPLIED
351
+
352
+ **Strategy**: [Rollback/Feature Flag/Scaling/etc]
353
+ **Applied At**: [timestamp]
354
+ **Applied By**: [name]
355
+
356
+ ### Action Taken
357
+ ```bash
358
+ [exact commands executed]
359
+ ```
360
+
361
+ ### Result
362
+ - User Impact: [before] → [after]
363
+ - Error Rate: [before] → [after]
364
+ - Service Status: [before] → [after]
365
+
366
+ ### Monitoring
367
+ - [ ] Metrics returning to normal
368
+ - [ ] Error rate decreased
369
+ - [ ] Users can access service
370
+ - [ ] No new issues introduced
371
+
372
+ ### Next Steps
373
+ - [ ] Monitor for 15-30 minutes
374
+ - [ ] Proceed to RESOLVE for permanent fix
375
+ ```
376
+
377
+ ### Gate
378
+ - [ ] User impact mitigated
379
+ - [ ] System stable
380
+ - [ ] Monitoring shows improvement
381
+ - [ ] No new critical issues
382
+
383
+ ---
384
+
385
+ ## Phase 4: RESOLVE
386
+
387
+ **Goal**: Implement permanent fix
388
+
389
+ **Timeline**: Hours to days (depending on severity)
390
+
391
+ ### Actions
392
+
393
+ 1. **Read Architecture Documentation**:
394
+ - [ ] Re-read architecture doc for affected stack
395
+ - [ ] Understand proper fix location and approach
396
+ - [ ] Follow architecture patterns
397
+
398
+ 2. **Design Permanent Fix**:
399
+ ```markdown
400
+ ### Fix Design
401
+ **Based on**: [architecture doc]
402
+ **Affected Layers**: [from architecture doc]
403
+
404
+ **Root Cause**: [from investigation]
405
+
406
+ **Permanent Solution**:
407
+ - [Change 1 - following architecture patterns]
408
+ - [Change 2 - following architecture patterns]
409
+
410
+ **Why This Prevents Recurrence**:
411
+ [Explanation]
412
+ ```
413
+
414
+ 3. **Implement Fix**:
415
+ - [ ] Create branch: `incident/[issue-id]-[description]`
416
+ - [ ] Implement following architecture doc patterns
417
+ - [ ] Add monitoring/alerting to prevent recurrence
418
+ - [ ] Add tests to catch this scenario
419
+
420
+ 4. **Test Thoroughly**:
421
+ ```bash
422
+ # Use test commands from architecture doc
423
+ flutter test # Flutter
424
+ go test ./... # Go
425
+ bun test # React/Remix
426
+ php artisan test # Laravel
427
+
428
+ # Integration tests
429
+ # Load tests (if performance issue)
430
+ # Security tests (if security issue)
431
+ ```
432
+
433
+ 5. **Gradual Rollout** (for P1/P2 incidents):
434
+ - [ ] Deploy to staging
435
+ - [ ] Canary deployment (5% traffic)
436
+ - [ ] Monitor for issues
437
+ - [ ] Gradual increase (25%, 50%, 100%)
438
+
439
+ ### Resolution Checklist
440
+ - [ ] Fix follows architecture doc
441
+ - [ ] Root cause addressed
442
+ - [ ] Tests added to prevent regression
443
+ - [ ] Monitoring/alerting enhanced
444
+ - [ ] Code reviewed
445
+ - [ ] Documentation updated
446
+
447
+ ### Deployment
448
+
449
+ ```bash
450
+ # Commit with incident reference
451
+ git add .
452
+ git commit -m "fix: [description] (incident #[id])
453
+
454
+ Root cause: [explanation]
455
+ Resolution: [what was changed]
456
+
457
+ Prevents recurrence by: [explanation]
458
+
459
+ Fixes #[incident-id]"
460
+
461
+ # Create PR
462
+ gh pr create --title "fix: [description] (incident #[id])" --body "$(cat <<'EOF'
463
+ ## Summary
464
+ Resolves incident #[id]
465
+
466
+ ## Root Cause
467
+ [Detailed explanation]
468
+
469
+ ## Permanent Fix
470
+ [What was changed, following architecture patterns]
471
+
472
+ ## Prevention
473
+ - [ ] Monitoring added
474
+ - [ ] Tests added
475
+ - [ ] Documentation updated
476
+
477
+ ## Testing
478
+ - [ ] Unit tests pass
479
+ - [ ] Integration tests pass
480
+ - [ ] Staging deployment successful
481
+ - [ ] Canary deployment successful
482
+
483
+ ## Rollback Plan
484
+ `git revert [commit-sha]`
485
+ EOF
486
+ )"
487
+ ```
488
+
489
+ ### Gate
490
+ - [ ] Permanent fix implemented
491
+ - [ ] Following architecture patterns
492
+ - [ ] All tests passing
493
+ - [ ] Deployed successfully
494
+ - [ ] Monitoring confirms fix
495
+
496
+ ---
497
+
498
+ ## Phase 5: POSTMORTEM
499
+
500
+ **Goal**: Learn and prevent future incidents
501
+
502
+ **Timeline**: Within 3-5 days after resolution
503
+
504
+ ### Actions
505
+
506
+ 1. **Schedule Postmortem Meeting**:
507
+ - [ ] Blameless environment
508
+ - [ ] All involved parties
509
+ - [ ] Within 1 week of incident
510
+
511
+ 2. **Write Postmortem Document**:
512
+
513
+ ### Postmortem Template
514
+
515
+ ```markdown
516
+ # Incident Postmortem: [Title]
517
+
518
+ **Date**: [YYYY-MM-DD]
519
+ **Severity**: P[1-4]
520
+ **Duration**: [HH:MM]
521
+ **Impact**: [X users / Y transactions / Z revenue]
522
+ **Authors**: [names]
523
+
524
+ ## Summary
525
+ [2-3 sentence summary of what happened]
526
+
527
+ ## Timeline (All times in [timezone])
528
+
529
+ | Time | Event |
530
+ |------|-------|
531
+ | HH:MM | Normal operation |
532
+ | HH:MM | [Trigger event] |
533
+ | HH:MM | First error detected |
534
+ | HH:MM | Incident declared |
535
+ | HH:MM | Triage complete |
536
+ | HH:MM | Root cause identified |
537
+ | HH:MM | Mitigation applied |
538
+ | HH:MM | Service restored |
539
+ | HH:MM | Permanent fix deployed |
540
+ | HH:MM | Incident closed |
541
+
542
+ ## Root Cause
543
+ [Detailed explanation of what went wrong and why]
544
+
545
+ ### Contributing Factors
546
+ 1. [Factor 1]
547
+ 2. [Factor 2]
548
+ 3. [Factor 3]
549
+
550
+ ## Impact
551
+
552
+ ### Users
553
+ - [X] users unable to access [feature]
554
+ - [Y] failed transactions
555
+ - [Z] support tickets
556
+
557
+ ### Business
558
+ - [Estimated revenue impact]
559
+ - [Reputation impact]
560
+ - [SLA breach: Y/N]
561
+
562
+ ### Engineering
563
+ - [On-call hours]
564
+ - [Development time]
565
+
566
+ ## Resolution
567
+
568
+ ### Immediate Mitigation
569
+ [What we did to stop the bleeding]
570
+
571
+ ### Permanent Fix
572
+ [What we changed to prevent recurrence]
573
+
574
+ ## What Went Well
575
+ 1. [Positive aspect 1]
576
+ 2. [Positive aspect 2]
577
+ 3. [Positive aspect 3]
578
+
579
+ ## What Went Wrong
580
+ 1. [Problem 1]
581
+ 2. [Problem 2]
582
+ 3. [Problem 3]
583
+
584
+ ## Action Items
585
+
586
+ | Action | Owner | Priority | Due Date | Status |
587
+ |--------|-------|----------|----------|--------|
588
+ | [Action 1] | [Name] | P1 | [Date] | [Open/In Progress/Done] |
589
+ | [Action 2] | [Name] | P2 | [Date] | [Open/In Progress/Done] |
590
+ | [Action 3] | [Name] | P3 | [Date] | [Open/In Progress/Done] |
591
+
592
+ ### Preventive Measures
593
+ - [ ] Add monitoring for [metric]
594
+ - [ ] Add alerting for [condition]
595
+ - [ ] Improve documentation for [area]
596
+ - [ ] Add tests for [scenario]
597
+ - [ ] Update runbook for [process]
598
+ - [ ] Architecture review for [component]
599
+
600
+ ### Process Improvements
601
+ - [ ] Update incident response playbook
602
+ - [ ] Improve deployment process
603
+ - [ ] Enhance testing strategy
604
+ - [ ] Update on-call rotation
605
+
606
+ ## Supporting Information
607
+
608
+ ### Logs
609
+ ```
610
+ [Relevant log excerpts]
611
+ ```
612
+
613
+ ### Metrics
614
+ [Screenshots/graphs of relevant metrics]
615
+
616
+ ### Related Incidents
617
+ - [Previous similar incident #123]
618
+ - [Related issue #456]
619
+
620
+ ## Lessons Learned
621
+ 1. [Key learning 1]
622
+ 2. [Key learning 2]
623
+ 3. [Key learning 3]
624
+ ```
625
+
626
+ ### Postmortem Checklist
627
+ - [ ] Postmortem doc written
628
+ - [ ] Meeting held (blameless)
629
+ - [ ] Action items assigned with owners
630
+ - [ ] Action items tracked
631
+ - [ ] Knowledge base updated
632
+ - [ ] Runbook updated (if applicable)
633
+ - [ ] Architecture doc updated (if applicable)
634
+
635
+ ### Gate
636
+ - [ ] Postmortem complete
637
+ - [ ] Action items tracked
638
+ - [ ] Team learned lessons
639
+ - [ ] Prevention measures in place
640
+
641
+ ---
642
+
643
+ ## Communication Templates
644
+
645
+ ### Initial Incident Notification
646
+
647
+ ```
648
+ 🚨 INCIDENT ALERT - P[X]
649
+
650
+ Status: INVESTIGATING
651
+ Affected: [service/feature]
652
+ Impact: [brief description]
653
+ Started: [timestamp]
654
+
655
+ We are investigating and will provide updates every [15/30] minutes.
656
+
657
+ Incident Channel: [link]
658
+ Status Page: [link]
659
+
660
+ - Incident Commander: [name]
661
+ ```
662
+
663
+ ### Status Update
664
+
665
+ ```
666
+ 📊 INCIDENT UPDATE - P[X]
667
+
668
+ Status: [INVESTIGATING/MITIGATING/RESOLVED]
669
+ Time: [timestamp]
670
+
671
+ Progress:
672
+ - [Update 1]
673
+ - [Update 2]
674
+
675
+ Current Impact: [description]
676
+
677
+ Next update in [15/30] minutes or when status changes.
678
+
679
+ - Incident Commander: [name]
680
+ ```
681
+
682
+ ### Resolution Notification
683
+
684
+ ```
685
+ ✅ INCIDENT RESOLVED - P[X]
686
+
687
+ Duration: [HH:MM]
688
+ Resolved: [timestamp]
689
+
690
+ Summary:
691
+ - Issue: [brief description]
692
+ - Cause: [root cause]
693
+ - Fix: [what was done]
694
+
695
+ Impact:
696
+ - [X] users affected
697
+ - [Y] duration
698
+
699
+ All services are now operating normally.
700
+
701
+ Postmortem will be shared within [3-5] days.
702
+
703
+ Thank you for your patience.
704
+
705
+ - Incident Commander: [name]
706
+ ```
707
+
708
+ ---
709
+
710
+ ## Quick Reference
711
+
712
+ ### Architecture Docs
713
+ | Stack | Doc |
714
+ |-------|-----|
715
+ | All | `clean-architecture.md` |
716
+ | Flutter | `flutter-mobile.md` |
717
+ | React | `react-frontend.md` |
718
+ | Go | `go-backend.md` |
719
+ | Laravel | `laravel-backend.md` |
720
+ | Remix | `remix-fullstack.md` |
721
+ | Monorepo | `monorepo.md` |
722
+
723
+ ### Incident Response Checklist
724
+
725
+ #### First 5 Minutes
726
+ - [ ] Assign severity (P1-P4)
727
+ - [ ] Create incident channel
728
+ - [ ] Notify stakeholders
729
+ - [ ] Assign incident commander
730
+ - [ ] Begin triage
731
+
732
+ #### First 15 Minutes
733
+ - [ ] Read architecture doc
734
+ - [ ] Gather logs/metrics
735
+ - [ ] Identify affected components
736
+ - [ ] Establish timeline
737
+ - [ ] Update stakeholders
738
+
739
+ #### First 30 Minutes
740
+ - [ ] Root cause hypothesis
741
+ - [ ] Mitigation plan ready
742
+ - [ ] Execute mitigation
743
+ - [ ] Monitor impact
744
+
745
+ #### First Hour
746
+ - [ ] Service restored (mitigated)
747
+ - [ ] Monitoring stable
748
+ - [ ] Plan permanent fix
749
+
750
+ #### First Day
751
+ - [ ] Permanent fix designed
752
+ - [ ] Following architecture patterns
753
+ - [ ] Tests added
754
+ - [ ] Deployed with monitoring
755
+
756
+ #### First Week
757
+ - [ ] Postmortem scheduled
758
+ - [ ] Postmortem written
759
+ - [ ] Action items assigned
760
+ - [ ] Prevention measures planned
761
+
762
+ ### Severity Response SLA
763
+
764
+ | Severity | Detection | Triage | Mitigation | Resolution |
765
+ |----------|-----------|--------|------------|------------|
766
+ | 🔴 P1 | < 5 min | < 15 min | < 30 min | < 4 hours |
767
+ | 🟠 P2 | < 15 min | < 30 min | < 1 hour | < 24 hours |
768
+ | 🟡 P3 | < 1 hour | < 2 hours | < 4 hours | < 1 week |
769
+ | 🟢 P4 | < 4 hours | < 1 day | Best effort | < 2 weeks |
770
+
771
+ ### Escalation Path
772
+
773
+ ```
774
+ P1: Immediate
775
+ ├── Incident Commander
776
+ ├── Engineering Manager
777
+ ├── CTO
778
+ └── CEO (if customer-facing)
779
+
780
+ P2: < 1 hour
781
+ ├── Incident Commander
782
+ ├── Engineering Manager
783
+ └── CTO (if not resolved in 2 hours)
784
+
785
+ P3: < 4 hours
786
+ ├── Incident Commander
787
+ └── Engineering Manager
788
+
789
+ P4: Normal business hours
790
+ └── Incident Commander
791
+ ```
792
+
793
+ ### Key Metrics to Monitor
794
+
795
+ #### Application Metrics
796
+ - [ ] Request rate
797
+ - [ ] Response time (p50, p95, p99)
798
+ - [ ] Error rate
799
+ - [ ] Active users
800
+ - [ ] Transaction volume
801
+
802
+ #### Infrastructure Metrics
803
+ - [ ] CPU usage
804
+ - [ ] Memory usage
805
+ - [ ] Disk usage
806
+ - [ ] Network I/O
807
+ - [ ] Database connections
808
+
809
+ #### Business Metrics
810
+ - [ ] Conversion rate
811
+ - [ ] Revenue
812
+ - [ ] User satisfaction
813
+ - [ ] SLA compliance
814
+
815
+ ### Common Incident Types
816
+
817
+ | Type | Common Causes | Typical Mitigation |
818
+ |------|---------------|-------------------|
819
+ | **Deployment** | Bad code, config error | Rollback |
820
+ | **Infrastructure** | Resource exhaustion | Scale up/out |
821
+ | **Database** | Slow queries, locks | Kill queries, add indexes |
822
+ | **External Dependency** | API down, timeout | Circuit breaker, fallback |
823
+ | **Security** | Attack, vulnerability | Block traffic, patch |
824
+ | **Data** | Corruption, loss | Restore from backup |
825
+
826
+ ### War Room Best Practices
827
+
828
+ 1. **Clear Roles**:
829
+ - Incident Commander (makes decisions)
830
+ - Investigators (find root cause)
831
+ - Communicators (update stakeholders)
832
+ - Scribes (document timeline)
833
+
834
+ 2. **Communication**:
835
+ - Use dedicated channel
836
+ - Update every 15-30 minutes
837
+ - Be specific, not vague
838
+ - No blame, focus on fix
839
+
840
+ 3. **Decision Making**:
841
+ - Incident Commander has final say
842
+ - Bias toward action
843
+ - Don't wait for perfect information
844
+ - Mitigate first, understand later
845
+
846
+ 4. **Documentation**:
847
+ - Log every action with timestamp
848
+ - Save logs/metrics
849
+ - Screenshot dashboards
850
+ - Record decisions and rationale
851
+
852
+ ---
853
+
854
+ ## Runbook Examples
855
+
856
+ ### Service Down
857
+
858
+ ```bash
859
+ # 1. Check if service is running
860
+ systemctl status [service]
861
+ # or
862
+ kubectl get pods -n [namespace]
863
+
864
+ # 2. Check logs
865
+ journalctl -u [service] --since "10 minutes ago"
866
+ # or
867
+ kubectl logs -n [namespace] [pod] --tail=100
868
+
869
+ # 3. Restart service
870
+ systemctl restart [service]
871
+ # or
872
+ kubectl rollout restart deployment/[name]
873
+
874
+ # 4. Verify
875
+ curl https://[service]/health
876
+ ```
877
+
878
+ ### High CPU Usage
879
+
880
+ ```bash
881
+ # 1. Identify process
882
+ top -o %CPU
883
+ # or
884
+ ps aux --sort=-%cpu | head
885
+
886
+ # 2. Check if legitimate
887
+ # (deployment, batch job, etc.)
888
+
889
+ # 3. If abnormal, scale up
890
+ kubectl scale deployment/[name] --replicas=[N+2]
891
+
892
+ # 4. Investigate root cause
893
+ # (profiling, logs, etc.)
894
+ ```
895
+
896
+ ### Database Slow
897
+
898
+ ```sql
899
+ -- 1. Check active queries
900
+ SELECT pid, now() - query_start as duration, query
901
+ FROM pg_stat_activity
902
+ WHERE state = 'active'
903
+ ORDER BY duration DESC;
904
+
905
+ -- 2. Kill long-running queries
906
+ SELECT pg_terminate_backend([pid]);
907
+
908
+ -- 3. Check for locks
909
+ SELECT * FROM pg_locks WHERE NOT granted;
910
+
911
+ -- 4. Add missing indexes (if needed)
912
+ CREATE INDEX CONCURRENTLY idx_name ON table(column);
913
+ ```
914
+
915
+ ---
916
+
917
+ ## Emergency Contacts
918
+
919
+ **Add your team's contacts:**
920
+
921
+ ```markdown
922
+ ### On-Call Rotation
923
+ - Primary: [name] - [contact]
924
+ - Secondary: [name] - [contact]
925
+
926
+ ### Escalation
927
+ - Engineering Manager: [name] - [contact]
928
+ - CTO: [name] - [contact]
929
+ - CEO: [name] - [contact]
930
+
931
+ ### External
932
+ - Cloud Provider Support: [contact]
933
+ - Database Support: [contact]
934
+ - Security Team: [contact]
935
+ ```
936
+
937
+ ---
938
+
939
+ ## Additional Resources
940
+
941
+ - [Incident Response Plan]
942
+ - [Runbook Collection]
943
+ - [Monitoring Dashboard]
944
+ - [Status Page]
945
+ - [Postmortem Archive]
946
+ - [Architecture Documentation]