codingbuddy-rules 2.4.2 → 3.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (53) hide show
  1. package/.ai-rules/CHANGELOG.md +122 -0
  2. package/.ai-rules/agents/README.md +527 -11
  3. package/.ai-rules/agents/accessibility-specialist.json +0 -1
  4. package/.ai-rules/agents/act-mode.json +0 -1
  5. package/.ai-rules/agents/agent-architect.json +0 -1
  6. package/.ai-rules/agents/ai-ml-engineer.json +0 -1
  7. package/.ai-rules/agents/architecture-specialist.json +14 -2
  8. package/.ai-rules/agents/backend-developer.json +14 -2
  9. package/.ai-rules/agents/code-quality-specialist.json +0 -1
  10. package/.ai-rules/agents/data-engineer.json +0 -1
  11. package/.ai-rules/agents/devops-engineer.json +24 -2
  12. package/.ai-rules/agents/documentation-specialist.json +0 -1
  13. package/.ai-rules/agents/eval-mode.json +0 -1
  14. package/.ai-rules/agents/event-architecture-specialist.json +719 -0
  15. package/.ai-rules/agents/frontend-developer.json +14 -2
  16. package/.ai-rules/agents/i18n-specialist.json +0 -1
  17. package/.ai-rules/agents/integration-specialist.json +11 -1
  18. package/.ai-rules/agents/migration-specialist.json +676 -0
  19. package/.ai-rules/agents/mobile-developer.json +0 -1
  20. package/.ai-rules/agents/observability-specialist.json +747 -0
  21. package/.ai-rules/agents/performance-specialist.json +24 -2
  22. package/.ai-rules/agents/plan-mode.json +0 -1
  23. package/.ai-rules/agents/platform-engineer.json +0 -1
  24. package/.ai-rules/agents/security-specialist.json +27 -16
  25. package/.ai-rules/agents/seo-specialist.json +0 -1
  26. package/.ai-rules/agents/solution-architect.json +0 -1
  27. package/.ai-rules/agents/technical-planner.json +0 -1
  28. package/.ai-rules/agents/test-strategy-specialist.json +14 -2
  29. package/.ai-rules/agents/ui-ux-designer.json +0 -1
  30. package/.ai-rules/rules/core.md +25 -0
  31. package/.ai-rules/skills/README.md +35 -0
  32. package/.ai-rules/skills/database-migration/SKILL.md +531 -0
  33. package/.ai-rules/skills/database-migration/expand-contract-patterns.md +314 -0
  34. package/.ai-rules/skills/database-migration/large-scale-migration.md +414 -0
  35. package/.ai-rules/skills/database-migration/rollback-strategies.md +359 -0
  36. package/.ai-rules/skills/database-migration/validation-procedures.md +428 -0
  37. package/.ai-rules/skills/dependency-management/SKILL.md +381 -0
  38. package/.ai-rules/skills/dependency-management/license-compliance.md +282 -0
  39. package/.ai-rules/skills/dependency-management/lock-file-management.md +437 -0
  40. package/.ai-rules/skills/dependency-management/major-upgrade-guide.md +292 -0
  41. package/.ai-rules/skills/dependency-management/security-vulnerability-response.md +230 -0
  42. package/.ai-rules/skills/incident-response/SKILL.md +373 -0
  43. package/.ai-rules/skills/incident-response/communication-templates.md +322 -0
  44. package/.ai-rules/skills/incident-response/escalation-matrix.md +347 -0
  45. package/.ai-rules/skills/incident-response/postmortem-template.md +351 -0
  46. package/.ai-rules/skills/incident-response/severity-classification.md +256 -0
  47. package/.ai-rules/skills/performance-optimization/CREATION-LOG.md +87 -0
  48. package/.ai-rules/skills/performance-optimization/SKILL.md +76 -0
  49. package/.ai-rules/skills/performance-optimization/documentation-template.md +70 -0
  50. package/.ai-rules/skills/pr-review/SKILL.md +768 -0
  51. package/.ai-rules/skills/refactoring/SKILL.md +192 -0
  52. package/.ai-rules/skills/refactoring/refactoring-catalog.md +1377 -0
  53. package/package.json +1 -1
@@ -0,0 +1,230 @@
1
+ # Security Vulnerability Response
2
+
3
+ Detailed procedures for responding to CVEs and security advisories.
4
+
5
+ ## CVE Severity Classification
6
+
7
+ ### CVSS Score Interpretation
8
+
9
+ | Score Range | Severity | Description |
10
+ |-------------|----------|-------------|
11
+ | 9.0 - 10.0 | **Critical** | Exploitable remotely, no authentication required, complete system compromise |
12
+ | 7.0 - 8.9 | **High** | Significant impact, may require some conditions |
13
+ | 4.0 - 6.9 | **Medium** | Limited impact or requires significant preconditions |
14
+ | 0.1 - 3.9 | **Low** | Minimal impact, difficult to exploit |
15
+
16
+ ### Response Time Requirements
17
+
18
+ | Severity | Assessment | Remediation | Verification |
19
+ |----------|------------|-------------|--------------|
20
+ | Critical | 1 hour | 24 hours | Same day |
21
+ | High | 4 hours | 7 days | Within 48 hours |
22
+ | Medium | 24 hours | 30 days | Within 1 week |
23
+ | Low | 1 week | 90 days | Next release |
24
+
25
+ ## Vulnerability Assessment Checklist
26
+
27
+ ### Step 1: Confirm Affected
28
+
29
+ ```bash
30
+ # Check if package is in your dependencies
31
+ npm ls <package-name>
32
+ # or
33
+ yarn why <package-name>
34
+ # or
35
+ pnpm why <package-name>
36
+ ```
37
+
38
+ Questions to answer:
39
+ - [ ] Is the vulnerable version in our lock file?
40
+ - [ ] Is it a direct or transitive dependency?
41
+ - [ ] Is the vulnerable code path actually used in our application?
42
+
43
+ ### Step 2: Analyze Attack Vector
44
+
45
+ Review the CVE details:
46
+
47
+ | Vector | Questions |
48
+ |--------|-----------|
49
+ | **Network** | Is the package exposed to network input? |
50
+ | **Local** | Can attackers access local filesystem? |
51
+ | **User Input** | Does user-controlled data reach the vulnerable code? |
52
+ | **Authentication** | Is the vulnerable path behind auth? |
53
+
54
+ ### Step 3: Determine If Exploitable
55
+
56
+ Common scenarios where you may NOT be affected:
57
+ - Package is dev-only (`devDependencies`) and not in production build
58
+ - Vulnerable function is never called in your code
59
+ - Input to vulnerable function is always sanitized
60
+ - Network path to vulnerable code doesn't exist
61
+
62
+ **Document your reasoning** - "Not affected because X" is a valid conclusion, but must be evidenced.
63
+
64
+ ## Patch Availability Assessment
65
+
66
+ ### Patch Available
67
+
68
+ 1. Check package changelog/releases
69
+ 2. Verify fix is in latest version
70
+ 3. Check compatibility with your version constraints
71
+ 4. Plan upgrade using main workflow
72
+
73
+ ### No Patch Available
74
+
75
+ If no official patch exists:
76
+
77
+ 1. **Check for pre-release/beta with fix**
78
+ ```bash
79
+ npm view <package> versions --json | tail -20
80
+ ```
81
+
82
+ 2. **Check for fork with patch**
83
+ - Search GitHub issues for the CVE
84
+ - Look for community forks
85
+
86
+ 3. **Evaluate alternatives**
87
+ - Can you replace the package?
88
+ - What's the migration effort?
89
+
90
+ 4. **Implement temporary mitigation** (see below)
91
+
92
+ ## Temporary Mitigations
93
+
94
+ When immediate patch isn't available:
95
+
96
+ ### Input Validation
97
+
98
+ Add validation before vulnerable function:
99
+ ```typescript
100
+ // Before using vulnerable library function
101
+ function sanitizeInput(input: unknown): SafeInput {
102
+ // Validate and sanitize based on CVE details
103
+ // Return safe version or throw
104
+ }
105
+ ```
106
+
107
+ ### Feature Disable
108
+
109
+ Disable vulnerable feature if non-critical:
110
+ ```typescript
111
+ // Feature flag to disable vulnerable code path
112
+ if (config.FEATURE_X_ENABLED && !config.CVE_MITIGATION_MODE) {
113
+ // Use vulnerable feature
114
+ }
115
+ ```
116
+
117
+ ### Network Isolation
118
+
119
+ If network-exploitable:
120
+ - Add WAF rules
121
+ - Rate limiting
122
+ - IP allowlisting
123
+
124
+ ### Dependency Pinning
125
+
126
+ Lock to last known safe version:
127
+ ```json
128
+ {
129
+ "resolutions": {
130
+ "vulnerable-package": "1.2.3"
131
+ }
132
+ }
133
+ ```
134
+
135
+ ## Documentation Template
136
+
137
+ When documenting CVE response:
138
+
139
+ ```markdown
140
+ ## CVE-YYYY-XXXXX Response
141
+
142
+ **Package:** package-name@version
143
+ **Severity:** Critical/High/Medium/Low (CVSS: X.X)
144
+ **Discovered:** YYYY-MM-DD
145
+ **Resolved:** YYYY-MM-DD
146
+
147
+ ### Assessment
148
+ - [ ] Confirmed affected: Yes/No
149
+ - [ ] Attack vector applicable: Yes/No
150
+ - [ ] Evidence: [link or explanation]
151
+
152
+ ### Resolution
153
+ - [ ] Patch applied: version X.Y.Z
154
+ - [ ] Mitigation applied: [description]
155
+ - [ ] Verified: [test results]
156
+
157
+ ### Timeline
158
+ - HH:MM - CVE detected
159
+ - HH:MM - Assessment complete
160
+ - HH:MM - Patch/mitigation applied
161
+ - HH:MM - Verification complete
162
+ ```
163
+
164
+ ## Automation Setup
165
+
166
+ ### Continuous Monitoring
167
+
168
+ ```yaml
169
+ # GitHub Actions example
170
+ name: Security Audit
171
+ on:
172
+ schedule:
173
+ - cron: '0 6 * * *' # Daily at 6 AM
174
+ push:
175
+ paths:
176
+ - '**/package-lock.json'
177
+ - '**/yarn.lock'
178
+ - '**/pnpm-lock.yaml'
179
+
180
+ jobs:
181
+ audit:
182
+ runs-on: ubuntu-latest
183
+ steps:
184
+ - uses: actions/checkout@v4
185
+ - run: npm audit --audit-level=high
186
+ ```
187
+
188
+ ### Alert Thresholds
189
+
190
+ Configure alerts based on severity:
191
+
192
+ | Severity | Alert Channel | Escalation |
193
+ |----------|--------------|------------|
194
+ | Critical | PagerDuty + Slack | Immediate |
195
+ | High | Slack #security | 4 hours |
196
+ | Medium | Daily digest | Weekly review |
197
+ | Low | Monthly report | Quarterly |
198
+
199
+ ## Common Vulnerability Types
200
+
201
+ ### Prototype Pollution
202
+
203
+ **Symptoms:** Objects gain unexpected properties
204
+ **Check:** Does library merge user input into objects?
205
+ **Mitigation:** Freeze prototypes, validate input shape
206
+
207
+ ### ReDoS (Regular Expression DoS)
208
+
209
+ **Symptoms:** Regex hangs on crafted input
210
+ **Check:** Does library use regex on user input?
211
+ **Mitigation:** Input length limits, timeout
212
+
213
+ ### Path Traversal
214
+
215
+ **Symptoms:** File access outside intended directory
216
+ **Check:** Does library handle file paths from user input?
217
+ **Mitigation:** Validate paths, use allowlists
218
+
219
+ ### Command Injection
220
+
221
+ **Symptoms:** Shell commands executed with user input
222
+ **Check:** Does library spawn processes with user data?
223
+ **Mitigation:** Escape inputs, use parameterized commands
224
+
225
+ ## Resources
226
+
227
+ - [NVD - National Vulnerability Database](https://nvd.nist.gov/)
228
+ - [GitHub Advisory Database](https://github.com/advisories)
229
+ - [Snyk Vulnerability Database](https://snyk.io/vuln/)
230
+ - [npm Security Advisories](https://www.npmjs.com/advisories)
@@ -0,0 +1,373 @@
1
+ ---
2
+ name: incident-response
3
+ description: Use when production incident occurs, alerts fire, service degradation detected, or on-call escalation needed - guides systematic organizational response before technical fixes
4
+ ---
5
+
6
+ # Incident Response
7
+
8
+ ## Overview
9
+
10
+ Incidents demand disciplined response, not heroic fixes. Rushed patches mask problems and create new ones.
11
+
12
+ **Core principle:** ALWAYS triage, communicate, and contain before attempting fixes. Technical investigation follows organizational response.
13
+
14
+ **Violating the letter of this process is violating the spirit of incident response.**
15
+
16
+ ## First 5 Minutes Checklist
17
+
18
+ **Use this when alerts fire. Don't think, follow the list.**
19
+
20
+ | Minute | Action | Output |
21
+ |--------|--------|--------|
22
+ | 0-1 | Acknowledge alert | `incident_acknowledged` |
23
+ | 1-2 | Screenshot dashboards NOW | Evidence preserved |
24
+ | 2-3 | Assess: >50% users? Critical function? | Severity (P1-P4) |
25
+ | 3-4 | Post in incident channel: severity + impact + trace_id | `stakeholders_notified` |
26
+ | 4-5 | If P1/P2: Open war room, page on-call | Responders engaged |
27
+
28
+ **After 5 minutes:** Proceed to Phase 4 (Mitigate) - you've completed Detect, Triage, Communicate.
29
+
30
+ ## The Iron Law
31
+
32
+ ```
33
+ NO FIXES WITHOUT TRIAGE AND COMMUNICATION FIRST
34
+ ```
35
+
36
+ If you haven't classified severity and notified stakeholders, you cannot propose fixes.
37
+
38
+ ## When to Use
39
+
40
+ Use for ANY production incident:
41
+ - Alerts firing (P1-P4)
42
+ - Customer reports of issues
43
+ - Monitoring anomalies
44
+ - Service degradation
45
+ - Security incidents
46
+ - Data integrity concerns
47
+
48
+ **Use this ESPECIALLY when:**
49
+ - Under time pressure (emergencies make shortcuts tempting)
50
+ - Multiple systems failing (panic causes scattered responses)
51
+ - "Just fix it" pressure from stakeholders
52
+ - You don't fully understand the scope yet
53
+ - Previous fix attempt didn't work
54
+
55
+ **Don't skip when:**
56
+ - Issue seems simple (simple incidents have complex causes too)
57
+ - You're in a hurry (systematic response is faster than thrashing)
58
+ - Management wants it fixed NOW (protocol prevents extended outages)
59
+
60
+ ## When NOT to Use
61
+
62
+ This skill is for **organizational incident response**. Don't use it for:
63
+
64
+ - **Local development issues** - Your build failing isn't an incident
65
+ - **Non-production environments** - Staging bugs use normal debugging
66
+ - **Planned maintenance** - Scheduled changes have different protocols
67
+ - **Feature bugs with no urgency** - Normal backlog items, not incidents
68
+ - **Performance optimization** - Use systematic-debugging for investigation
69
+ - **One-off user reports** - Investigate first; escalate to incident if widespread
70
+
71
+ **Key distinction:** This skill manages the organizational response (communication, escalation, coordination). For technical root cause analysis, hand off to `superpowers:systematic-debugging` at Phase 5.
72
+
73
+ ## The Eight Phases
74
+
75
+ You MUST complete each phase before proceeding to the next.
76
+
77
+ ```dot
78
+ digraph incident_response {
79
+ rankdir=TB;
80
+ node [shape=box];
81
+
82
+ "1. Detect" -> "2. Triage";
83
+ "2. Triage" -> "3. Communicate";
84
+ "3. Communicate" -> "4. Mitigate";
85
+ "4. Mitigate" -> "5. Diagnose";
86
+ "5. Diagnose" -> "6. Fix/Rollback";
87
+ "6. Fix/Rollback" -> "7. Verify";
88
+ "7. Verify" -> "8. Document";
89
+
90
+ "Skip Phase?" [shape=diamond];
91
+ "2. Triage" -> "Skip Phase?" [style=dashed, label="tempted?"];
92
+ "Skip Phase?" -> "STOP" [label="NO"];
93
+ "STOP" [shape=doublecircle, color=red];
94
+ }
95
+ ```
96
+
97
+ ### Phase 1: Detect
98
+
99
+ **Acknowledge the incident:**
100
+
101
+ 1. **Confirm the alert is valid**
102
+ - Not a false positive?
103
+ - Reproducible?
104
+ - Multiple signals confirming?
105
+
106
+ 2. **Capture initial evidence**
107
+ - Screenshot dashboards NOW (before they scroll)
108
+ - Note the exact time
109
+ - Record the alert message verbatim
110
+ - Preserve trace_id if available
111
+
112
+ 3. **Declare incident start**
113
+ - "Incident acknowledged at [TIME]"
114
+ - Assign incident_id if your system has one
115
+
116
+ **Completion criteria:**
117
+ - [ ] `incident_acknowledged` - Alert confirmed and logged
118
+
119
+ ### Phase 2: Triage
120
+
121
+ **Classify severity BEFORE taking action:**
122
+
123
+ See `severity-classification.md` for detailed criteria.
124
+
125
+ 1. **Assess impact scope**
126
+ - How many users affected?
127
+ - Which features/services impacted?
128
+ - What's the blast radius?
129
+
130
+ 2. **Classify severity (P1-P4)**
131
+ - P1: >50% users, critical function down
132
+ - P2: 10-50% users, major feature unavailable
133
+ - P3: <10% users, workaround exists
134
+ - P4: Minimal impact, no urgency
135
+
136
+ 3. **Determine business impact**
137
+ - Revenue at risk?
138
+ - Compliance implications?
139
+ - Reputation damage?
140
+
141
+ 4. **Check SLO status**
142
+ - Which SLO is breaching?
143
+ - What's the error budget burn rate?
144
+ - How fast are we consuming budget?
145
+
146
+ **Completion criteria:**
147
+ - [ ] `severity_classified` - P1/P2/P3/P4 assigned
148
+ - [ ] `impact_assessed` - Users affected and scope documented
149
+
150
+ ### Phase 3: Communicate
151
+
152
+ **Notify stakeholders BEFORE attempting fixes:**
153
+
154
+ See `communication-templates.md` for notification formats.
155
+
156
+ 1. **Internal notification**
157
+ - Alert the on-call team
158
+ - Notify relevant stakeholders
159
+ - Open incident channel/war room (P1/P2)
160
+
161
+ 2. **External communication (if applicable)**
162
+ - Update status page
163
+ - Prepare customer communication
164
+ - Follow regulatory requirements
165
+
166
+ 3. **Set communication cadence**
167
+ - P1: Updates every 15 minutes
168
+ - P2: Updates every 30 minutes
169
+ - P3/P4: Updates at milestones
170
+
171
+ 4. **Include trace_id in all communications**
172
+ - Enables instant context jumping
173
+ - Links logs, traces, and metrics
174
+
175
+ **Completion criteria:**
176
+ - [ ] `stakeholders_notified` - Team/leadership informed
177
+ - [ ] `communication_cadence_set` - Update schedule established
178
+
179
+ ### Phase 4: Mitigate
180
+
181
+ **Contain the blast radius:**
182
+
183
+ 1. **Evaluate rollback FIRST**
184
+ - Was there a recent deployment?
185
+ - Can we rollback safely?
186
+ - Rollback is often faster than forward-fix
187
+
188
+ 2. **If rollback not viable:**
189
+ - Can we disable the failing feature?
190
+ - Can we route traffic away?
191
+ - Can we scale resources?
192
+
193
+ 3. **Preserve evidence before action**
194
+ - Export logs before rotation
195
+ - Capture heap/thread dumps
196
+ - Screenshot metrics dashboards
197
+
198
+ 4. **Implement containment**
199
+ - Apply the chosen mitigation
200
+ - Verify it's working
201
+ - Communicate the mitigation status
202
+
203
+ **Completion criteria:**
204
+ - [ ] `rollback_evaluated` - Rollback option assessed
205
+ - [ ] `containment_verified` - Blast radius limited
206
+ - [ ] `evidence_preserved` - Logs/traces captured before action
207
+
208
+ ### Phase 5: Diagnose
209
+
210
+ **Now investigate the root cause:**
211
+
212
+ **HANDOFF:** Use `superpowers:systematic-debugging` for technical investigation.
213
+
214
+ 1. **Form hypothesis**
215
+ - What changed recently?
216
+ - What do the logs/traces show?
217
+ - Which component is failing?
218
+
219
+ 2. **Test hypothesis**
220
+ - One variable at a time
221
+ - Gather evidence, don't guess
222
+ - Follow the systematic debugging phases
223
+
224
+ 3. **If 3+ hypotheses failed:**
225
+ - STOP and question architecture
226
+ - Escalate for fresh perspective
227
+ - Don't keep guessing
228
+
229
+ **Completion criteria:**
230
+ - [ ] `root_cause_identified` - Cause confirmed with evidence
231
+ - [ ] OR `escalated_for_help` - Fresh eyes requested after 3+ failed hypotheses
232
+
233
+ ### Phase 6: Fix or Rollback
234
+
235
+ **Apply the resolution:**
236
+
237
+ 1. **If rollback is the answer:**
238
+ - Execute rollback procedure
239
+ - Verify rollback completed
240
+ - Confirm service restored
241
+
242
+ 2. **If forward-fix required:**
243
+ - Implement minimal fix
244
+ - Get peer review (even quick)
245
+ - Stage the fix properly
246
+
247
+ 3. **Never skip verification**
248
+ - Test the fix works
249
+ - Confirm no side effects
250
+ - Don't just "hope it worked"
251
+
252
+ **Completion criteria:**
253
+ - [ ] `fix_deployed` - Forward fix applied and reviewed
254
+ - [ ] OR `rollback_completed` - Previous state restored
255
+
256
+ ### Phase 7: Verify
257
+
258
+ **Confirm the incident is resolved:**
259
+
260
+ **HANDOFF:** Use `superpowers:verification-before-completion` for thorough verification.
261
+
262
+ 1. **Service health check**
263
+ - Error rates back to normal?
264
+ - Latency within SLO?
265
+ - No new alerts?
266
+
267
+ 2. **User impact verification**
268
+ - Can users complete flows?
269
+ - Any lingering issues?
270
+ - Customer reports resolved?
271
+
272
+ 3. **Monitoring confirmation**
273
+ - Dashboards showing green?
274
+ - SLO back in budget?
275
+ - No anomalies in metrics?
276
+
277
+ **Completion criteria:**
278
+ - [ ] `service_restored` - Users can complete flows normally
279
+ - [ ] `slo_compliant` - Error rates and latency within SLO
280
+
281
+ ### Phase 8: Document
282
+
283
+ **The incident is NOT resolved until documentation exists:**
284
+
285
+ See `postmortem-template.md` for the blameless RCA framework.
286
+
287
+ 1. **Schedule postmortem**
288
+ - Within 48 hours for P1/P2
289
+ - Within 1 week for P3/P4
290
+ - Include all responders
291
+
292
+ 2. **Capture timeline**
293
+ - What happened, when, by whom
294
+ - Use trace_id to reconstruct
295
+ - Include communication history
296
+
297
+ 3. **Identify action items**
298
+ - Preventive measures
299
+ - Detection improvements
300
+ - Process enhancements
301
+
302
+ 4. **Close the incident**
303
+ - All stakeholders notified
304
+ - Status page updated
305
+ - Incident marked resolved
306
+
307
+ **Completion criteria:**
308
+ - [ ] `postmortem_scheduled` - Meeting booked within required window
309
+ - [ ] `incident_closed` - Stakeholders notified, status page updated
310
+
311
+ ## Red Flags - STOP and Follow Process
312
+
313
+ If you catch yourself thinking:
314
+ - "Quick fix for now, investigate later"
315
+ - "Just try this and see if it works"
316
+ - "No time for communication, focus on fixing"
317
+ - "Skip the triage, obviously it's P1"
318
+ - "Rollback is too risky, let's push forward"
319
+ - "We can document later, let's close this"
320
+ - "I know what the problem is, trust me"
321
+ - "The user wants speed, not process"
322
+
323
+ **ALL of these mean: STOP. Return to the current phase.**
324
+
325
+ ## Common Rationalizations
326
+
327
+ | Excuse | Reality |
328
+ |--------|---------|
329
+ | "This is obviously a P1" | Every incident feels urgent. Triage takes 2 minutes, wrong classification causes chaos. |
330
+ | "Fix first, communicate later" | Silent outages create panic. 30-second update prevents 30-minute escalation. |
331
+ | "Rollback is too risky" | Known state > uncertain forward fix. Rollback buys time to fix properly. |
332
+ | "We can investigate after fixing" | Evidence disappears after restart. Capture now, analyze later. |
333
+ | "It's a one-off, skip the postmortem" | All incidents feel unique. Patterns emerge only from documentation. |
334
+ | "I know this system, trust me" | Expert intuition fails under pressure. Protocol beats intuition. |
335
+ | "Communication slows us down" | Coordinated response > heroic solo effort. Communication accelerates resolution. |
336
+ | "Previous fix worked, try another" | 3+ fixes = wrong diagnosis. Stop guessing, investigate properly. |
337
+
338
+ ## Quick Reference
339
+
340
+ | Phase | Key Activities | Verification Key |
341
+ |-------|---------------|------------------|
342
+ | **1. Detect** | Acknowledge, capture evidence, record time, **capture trace_id** | `incident_acknowledged` |
343
+ | **2. Triage** | Assess impact, classify P1-P4, check SLO | `severity_classified` |
344
+ | **3. Communicate** | Notify team, update status, set cadence, **include trace_id** | `stakeholders_notified` |
345
+ | **4. Mitigate** | Evaluate rollback, contain blast radius | `containment_verified` |
346
+ | **5. Diagnose** | Form hypothesis, test systematically | `root_cause_identified` |
347
+ | **6. Fix/Rollback** | Apply resolution, get review | `fix_deployed` |
348
+ | **7. Verify** | Confirm resolution, check SLO | `service_restored` |
349
+ | **8. Document** | Schedule postmortem, capture timeline, **correlate via trace_id** | `postmortem_scheduled` |
350
+
351
+ **🔑 trace_id:** Include in ALL internal communications - enables instant context jumping across logs, traces, and metrics.
352
+
353
+ ## Supporting Files
354
+
355
+ These files provide detailed guidance for specific phases:
356
+
357
+ - **`severity-classification.md`** - P1-P4 criteria with SLO burn rate mapping
358
+ - **`communication-templates.md`** - Notification templates for stakeholders
359
+ - **`postmortem-template.md`** - Blameless RCA framework and action items
360
+ - **`escalation-matrix.md`** - Role-based escalation paths per severity
361
+
362
+ ## Related Skills
363
+
364
+ - **superpowers:systematic-debugging** - For Phase 5 (Diagnose) root cause analysis
365
+ - **superpowers:verification-before-completion** - For Phase 7 (Verify) confirmation
366
+
367
+ ## Real-World Impact
368
+
369
+ From incident response data:
370
+ - Systematic response: MTTR 30-60 minutes
371
+ - Ad-hoc heroics: MTTR 2-4 hours
372
+ - Communication during incident: 70% fewer escalations
373
+ - Postmortem completion: 50% fewer repeat incidents