codingbuddy-rules 2.4.2 → 3.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.ai-rules/CHANGELOG.md +122 -0
- package/.ai-rules/agents/README.md +527 -11
- package/.ai-rules/agents/accessibility-specialist.json +0 -1
- package/.ai-rules/agents/act-mode.json +0 -1
- package/.ai-rules/agents/agent-architect.json +0 -1
- package/.ai-rules/agents/ai-ml-engineer.json +0 -1
- package/.ai-rules/agents/architecture-specialist.json +14 -2
- package/.ai-rules/agents/backend-developer.json +14 -2
- package/.ai-rules/agents/code-quality-specialist.json +0 -1
- package/.ai-rules/agents/data-engineer.json +0 -1
- package/.ai-rules/agents/devops-engineer.json +24 -2
- package/.ai-rules/agents/documentation-specialist.json +0 -1
- package/.ai-rules/agents/eval-mode.json +0 -1
- package/.ai-rules/agents/event-architecture-specialist.json +719 -0
- package/.ai-rules/agents/frontend-developer.json +14 -2
- package/.ai-rules/agents/i18n-specialist.json +0 -1
- package/.ai-rules/agents/integration-specialist.json +11 -1
- package/.ai-rules/agents/migration-specialist.json +676 -0
- package/.ai-rules/agents/mobile-developer.json +0 -1
- package/.ai-rules/agents/observability-specialist.json +747 -0
- package/.ai-rules/agents/performance-specialist.json +24 -2
- package/.ai-rules/agents/plan-mode.json +0 -1
- package/.ai-rules/agents/platform-engineer.json +0 -1
- package/.ai-rules/agents/security-specialist.json +27 -16
- package/.ai-rules/agents/seo-specialist.json +0 -1
- package/.ai-rules/agents/solution-architect.json +0 -1
- package/.ai-rules/agents/technical-planner.json +0 -1
- package/.ai-rules/agents/test-strategy-specialist.json +14 -2
- package/.ai-rules/agents/ui-ux-designer.json +0 -1
- package/.ai-rules/rules/core.md +25 -0
- package/.ai-rules/skills/README.md +35 -0
- package/.ai-rules/skills/database-migration/SKILL.md +531 -0
- package/.ai-rules/skills/database-migration/expand-contract-patterns.md +314 -0
- package/.ai-rules/skills/database-migration/large-scale-migration.md +414 -0
- package/.ai-rules/skills/database-migration/rollback-strategies.md +359 -0
- package/.ai-rules/skills/database-migration/validation-procedures.md +428 -0
- package/.ai-rules/skills/dependency-management/SKILL.md +381 -0
- package/.ai-rules/skills/dependency-management/license-compliance.md +282 -0
- package/.ai-rules/skills/dependency-management/lock-file-management.md +437 -0
- package/.ai-rules/skills/dependency-management/major-upgrade-guide.md +292 -0
- package/.ai-rules/skills/dependency-management/security-vulnerability-response.md +230 -0
- package/.ai-rules/skills/incident-response/SKILL.md +373 -0
- package/.ai-rules/skills/incident-response/communication-templates.md +322 -0
- package/.ai-rules/skills/incident-response/escalation-matrix.md +347 -0
- package/.ai-rules/skills/incident-response/postmortem-template.md +351 -0
- package/.ai-rules/skills/incident-response/severity-classification.md +256 -0
- package/.ai-rules/skills/performance-optimization/CREATION-LOG.md +87 -0
- package/.ai-rules/skills/performance-optimization/SKILL.md +76 -0
- package/.ai-rules/skills/performance-optimization/documentation-template.md +70 -0
- package/.ai-rules/skills/pr-review/SKILL.md +768 -0
- package/.ai-rules/skills/refactoring/SKILL.md +192 -0
- package/.ai-rules/skills/refactoring/refactoring-catalog.md +1377 -0
- package/package.json +1 -1
|
@@ -0,0 +1,230 @@
|
|
|
1
|
+
# Security Vulnerability Response
|
|
2
|
+
|
|
3
|
+
Detailed procedures for responding to CVEs and security advisories.
|
|
4
|
+
|
|
5
|
+
## CVE Severity Classification
|
|
6
|
+
|
|
7
|
+
### CVSS Score Interpretation
|
|
8
|
+
|
|
9
|
+
| Score Range | Severity | Description |
|
|
10
|
+
|-------------|----------|-------------|
|
|
11
|
+
| 9.0 - 10.0 | **Critical** | Exploitable remotely, no authentication required, complete system compromise |
|
|
12
|
+
| 7.0 - 8.9 | **High** | Significant impact, may require some conditions |
|
|
13
|
+
| 4.0 - 6.9 | **Medium** | Limited impact or requires significant preconditions |
|
|
14
|
+
| 0.1 - 3.9 | **Low** | Minimal impact, difficult to exploit |
|
|
15
|
+
|
|
16
|
+
### Response Time Requirements
|
|
17
|
+
|
|
18
|
+
| Severity | Assessment | Remediation | Verification |
|
|
19
|
+
|----------|------------|-------------|--------------|
|
|
20
|
+
| Critical | 1 hour | 24 hours | Same day |
|
|
21
|
+
| High | 4 hours | 7 days | Within 48 hours |
|
|
22
|
+
| Medium | 24 hours | 30 days | Within 1 week |
|
|
23
|
+
| Low | 1 week | 90 days | Next release |
|
|
24
|
+
|
|
25
|
+
## Vulnerability Assessment Checklist
|
|
26
|
+
|
|
27
|
+
### Step 1: Confirm Affected
|
|
28
|
+
|
|
29
|
+
```bash
|
|
30
|
+
# Check if package is in your dependencies
|
|
31
|
+
npm ls <package-name>
|
|
32
|
+
# or
|
|
33
|
+
yarn why <package-name>
|
|
34
|
+
# or
|
|
35
|
+
pnpm why <package-name>
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
Questions to answer:
|
|
39
|
+
- [ ] Is the vulnerable version in our lock file?
|
|
40
|
+
- [ ] Is it a direct or transitive dependency?
|
|
41
|
+
- [ ] Is the vulnerable code path actually used in our application?
|
|
42
|
+
|
|
43
|
+
### Step 2: Analyze Attack Vector
|
|
44
|
+
|
|
45
|
+
Review the CVE details:
|
|
46
|
+
|
|
47
|
+
| Vector | Questions |
|
|
48
|
+
|--------|-----------|
|
|
49
|
+
| **Network** | Is the package exposed to network input? |
|
|
50
|
+
| **Local** | Can attackers access local filesystem? |
|
|
51
|
+
| **User Input** | Does user-controlled data reach the vulnerable code? |
|
|
52
|
+
| **Authentication** | Is the vulnerable path behind auth? |
|
|
53
|
+
|
|
54
|
+
### Step 3: Determine If Exploitable
|
|
55
|
+
|
|
56
|
+
Common scenarios where you may NOT be affected:
|
|
57
|
+
- Package is dev-only (`devDependencies`) and not in production build
|
|
58
|
+
- Vulnerable function is never called in your code
|
|
59
|
+
- Input to vulnerable function is always sanitized
|
|
60
|
+
- Network path to vulnerable code doesn't exist
|
|
61
|
+
|
|
62
|
+
**Document your reasoning** - "Not affected because X" is a valid conclusion, but must be evidenced.
|
|
63
|
+
|
|
64
|
+
## Patch Availability Assessment
|
|
65
|
+
|
|
66
|
+
### Patch Available
|
|
67
|
+
|
|
68
|
+
1. Check package changelog/releases
|
|
69
|
+
2. Verify fix is in latest version
|
|
70
|
+
3. Check compatibility with your version constraints
|
|
71
|
+
4. Plan upgrade using main workflow
|
|
72
|
+
|
|
73
|
+
### No Patch Available
|
|
74
|
+
|
|
75
|
+
If no official patch exists:
|
|
76
|
+
|
|
77
|
+
1. **Check for pre-release/beta with fix**
|
|
78
|
+
```bash
|
|
79
|
+
npm view <package> versions --json | tail -20
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
2. **Check for fork with patch**
|
|
83
|
+
- Search GitHub issues for the CVE
|
|
84
|
+
- Look for community forks
|
|
85
|
+
|
|
86
|
+
3. **Evaluate alternatives**
|
|
87
|
+
- Can you replace the package?
|
|
88
|
+
- What's the migration effort?
|
|
89
|
+
|
|
90
|
+
4. **Implement temporary mitigation** (see below)
|
|
91
|
+
|
|
92
|
+
## Temporary Mitigations
|
|
93
|
+
|
|
94
|
+
When immediate patch isn't available:
|
|
95
|
+
|
|
96
|
+
### Input Validation
|
|
97
|
+
|
|
98
|
+
Add validation before vulnerable function:
|
|
99
|
+
```typescript
|
|
100
|
+
// Before using vulnerable library function
|
|
101
|
+
function sanitizeInput(input: unknown): SafeInput {
|
|
102
|
+
// Validate and sanitize based on CVE details
|
|
103
|
+
// Return safe version or throw
|
|
104
|
+
}
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
### Feature Disable
|
|
108
|
+
|
|
109
|
+
Disable vulnerable feature if non-critical:
|
|
110
|
+
```typescript
|
|
111
|
+
// Feature flag to disable vulnerable code path
|
|
112
|
+
if (config.FEATURE_X_ENABLED && !config.CVE_MITIGATION_MODE) {
|
|
113
|
+
// Use vulnerable feature
|
|
114
|
+
}
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### Network Isolation
|
|
118
|
+
|
|
119
|
+
If network-exploitable:
|
|
120
|
+
- Add WAF rules
|
|
121
|
+
- Rate limiting
|
|
122
|
+
- IP allowlisting
|
|
123
|
+
|
|
124
|
+
### Dependency Pinning
|
|
125
|
+
|
|
126
|
+
Lock to last known safe version:
|
|
127
|
+
```json
|
|
128
|
+
{
|
|
129
|
+
"resolutions": {
|
|
130
|
+
"vulnerable-package": "1.2.3"
|
|
131
|
+
}
|
|
132
|
+
}
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
## Documentation Template
|
|
136
|
+
|
|
137
|
+
When documenting CVE response:
|
|
138
|
+
|
|
139
|
+
```markdown
|
|
140
|
+
## CVE-YYYY-XXXXX Response
|
|
141
|
+
|
|
142
|
+
**Package:** package-name@version
|
|
143
|
+
**Severity:** Critical/High/Medium/Low (CVSS: X.X)
|
|
144
|
+
**Discovered:** YYYY-MM-DD
|
|
145
|
+
**Resolved:** YYYY-MM-DD
|
|
146
|
+
|
|
147
|
+
### Assessment
|
|
148
|
+
- [ ] Confirmed affected: Yes/No
|
|
149
|
+
- [ ] Attack vector applicable: Yes/No
|
|
150
|
+
- [ ] Evidence: [link or explanation]
|
|
151
|
+
|
|
152
|
+
### Resolution
|
|
153
|
+
- [ ] Patch applied: version X.Y.Z
|
|
154
|
+
- [ ] Mitigation applied: [description]
|
|
155
|
+
- [ ] Verified: [test results]
|
|
156
|
+
|
|
157
|
+
### Timeline
|
|
158
|
+
- HH:MM - CVE detected
|
|
159
|
+
- HH:MM - Assessment complete
|
|
160
|
+
- HH:MM - Patch/mitigation applied
|
|
161
|
+
- HH:MM - Verification complete
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
## Automation Setup
|
|
165
|
+
|
|
166
|
+
### Continuous Monitoring
|
|
167
|
+
|
|
168
|
+
```yaml
|
|
169
|
+
# GitHub Actions example
|
|
170
|
+
name: Security Audit
|
|
171
|
+
on:
|
|
172
|
+
schedule:
|
|
173
|
+
- cron: '0 6 * * *' # Daily at 6 AM
|
|
174
|
+
push:
|
|
175
|
+
paths:
|
|
176
|
+
- '**/package-lock.json'
|
|
177
|
+
- '**/yarn.lock'
|
|
178
|
+
- '**/pnpm-lock.yaml'
|
|
179
|
+
|
|
180
|
+
jobs:
|
|
181
|
+
audit:
|
|
182
|
+
runs-on: ubuntu-latest
|
|
183
|
+
steps:
|
|
184
|
+
- uses: actions/checkout@v4
|
|
185
|
+
- run: npm audit --audit-level=high
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
### Alert Thresholds
|
|
189
|
+
|
|
190
|
+
Configure alerts based on severity:
|
|
191
|
+
|
|
192
|
+
| Severity | Alert Channel | Escalation |
|
|
193
|
+
|----------|--------------|------------|
|
|
194
|
+
| Critical | PagerDuty + Slack | Immediate |
|
|
195
|
+
| High | Slack #security | 4 hours |
|
|
196
|
+
| Medium | Daily digest | Weekly review |
|
|
197
|
+
| Low | Monthly report | Quarterly |
|
|
198
|
+
|
|
199
|
+
## Common Vulnerability Types
|
|
200
|
+
|
|
201
|
+
### Prototype Pollution
|
|
202
|
+
|
|
203
|
+
**Symptoms:** Objects gain unexpected properties
|
|
204
|
+
**Check:** Does library merge user input into objects?
|
|
205
|
+
**Mitigation:** Freeze prototypes, validate input shape
|
|
206
|
+
|
|
207
|
+
### ReDoS (Regular Expression DoS)
|
|
208
|
+
|
|
209
|
+
**Symptoms:** Regex hangs on crafted input
|
|
210
|
+
**Check:** Does library use regex on user input?
|
|
211
|
+
**Mitigation:** Input length limits, timeout
|
|
212
|
+
|
|
213
|
+
### Path Traversal
|
|
214
|
+
|
|
215
|
+
**Symptoms:** File access outside intended directory
|
|
216
|
+
**Check:** Does library handle file paths from user input?
|
|
217
|
+
**Mitigation:** Validate paths, use allowlists
|
|
218
|
+
|
|
219
|
+
### Command Injection
|
|
220
|
+
|
|
221
|
+
**Symptoms:** Shell commands executed with user input
|
|
222
|
+
**Check:** Does library spawn processes with user data?
|
|
223
|
+
**Mitigation:** Escape inputs, use parameterized commands
|
|
224
|
+
|
|
225
|
+
## Resources
|
|
226
|
+
|
|
227
|
+
- [NVD - National Vulnerability Database](https://nvd.nist.gov/)
|
|
228
|
+
- [GitHub Advisory Database](https://github.com/advisories)
|
|
229
|
+
- [Snyk Vulnerability Database](https://snyk.io/vuln/)
|
|
230
|
+
- [npm Security Advisories](https://www.npmjs.com/advisories)
|
|
@@ -0,0 +1,373 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: incident-response
|
|
3
|
+
description: Use when production incident occurs, alerts fire, service degradation detected, or on-call escalation needed - guides systematic organizational response before technical fixes
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Incident Response
|
|
7
|
+
|
|
8
|
+
## Overview
|
|
9
|
+
|
|
10
|
+
Incidents demand disciplined response, not heroic fixes. Rushed patches mask problems and create new ones.
|
|
11
|
+
|
|
12
|
+
**Core principle:** ALWAYS triage, communicate, and contain before attempting fixes. Technical investigation follows organizational response.
|
|
13
|
+
|
|
14
|
+
**Violating the letter of this process is violating the spirit of incident response.**
|
|
15
|
+
|
|
16
|
+
## First 5 Minutes Checklist
|
|
17
|
+
|
|
18
|
+
**Use this when alerts fire. Don't think, follow the list.**
|
|
19
|
+
|
|
20
|
+
| Minute | Action | Output |
|
|
21
|
+
|--------|--------|--------|
|
|
22
|
+
| 0-1 | Acknowledge alert | `incident_acknowledged` |
|
|
23
|
+
| 1-2 | Screenshot dashboards NOW | Evidence preserved |
|
|
24
|
+
| 2-3 | Assess: >50% users? Critical function? | Severity (P1-P4) |
|
|
25
|
+
| 3-4 | Post in incident channel: severity + impact + trace_id | `stakeholders_notified` |
|
|
26
|
+
| 4-5 | If P1/P2: Open war room, page on-call | Responders engaged |
|
|
27
|
+
|
|
28
|
+
**After 5 minutes:** Proceed to Phase 4 (Mitigate) - you've completed Detect, Triage, Communicate.
|
|
29
|
+
|
|
30
|
+
## The Iron Law
|
|
31
|
+
|
|
32
|
+
```
|
|
33
|
+
NO FIXES WITHOUT TRIAGE AND COMMUNICATION FIRST
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
If you haven't classified severity and notified stakeholders, you cannot propose fixes.
|
|
37
|
+
|
|
38
|
+
## When to Use
|
|
39
|
+
|
|
40
|
+
Use for ANY production incident:
|
|
41
|
+
- Alerts firing (P1-P4)
|
|
42
|
+
- Customer reports of issues
|
|
43
|
+
- Monitoring anomalies
|
|
44
|
+
- Service degradation
|
|
45
|
+
- Security incidents
|
|
46
|
+
- Data integrity concerns
|
|
47
|
+
|
|
48
|
+
**Use this ESPECIALLY when:**
|
|
49
|
+
- Under time pressure (emergencies make shortcuts tempting)
|
|
50
|
+
- Multiple systems failing (panic causes scattered responses)
|
|
51
|
+
- "Just fix it" pressure from stakeholders
|
|
52
|
+
- You don't fully understand the scope yet
|
|
53
|
+
- Previous fix attempt didn't work
|
|
54
|
+
|
|
55
|
+
**Don't skip when:**
|
|
56
|
+
- Issue seems simple (simple incidents have complex causes too)
|
|
57
|
+
- You're in a hurry (systematic response is faster than thrashing)
|
|
58
|
+
- Management wants it fixed NOW (protocol prevents extended outages)
|
|
59
|
+
|
|
60
|
+
## When NOT to Use
|
|
61
|
+
|
|
62
|
+
This skill is for **organizational incident response**. Don't use it for:
|
|
63
|
+
|
|
64
|
+
- **Local development issues** - Your build failing isn't an incident
|
|
65
|
+
- **Non-production environments** - Staging bugs use normal debugging
|
|
66
|
+
- **Planned maintenance** - Scheduled changes have different protocols
|
|
67
|
+
- **Feature bugs with no urgency** - Normal backlog items, not incidents
|
|
68
|
+
- **Performance optimization** - Use systematic-debugging for investigation
|
|
69
|
+
- **One-off user reports** - Investigate first; escalate to incident if widespread
|
|
70
|
+
|
|
71
|
+
**Key distinction:** This skill manages the organizational response (communication, escalation, coordination). For technical root cause analysis, hand off to `superpowers:systematic-debugging` at Phase 5.
|
|
72
|
+
|
|
73
|
+
## The Eight Phases
|
|
74
|
+
|
|
75
|
+
You MUST complete each phase before proceeding to the next.
|
|
76
|
+
|
|
77
|
+
```dot
|
|
78
|
+
digraph incident_response {
|
|
79
|
+
rankdir=TB;
|
|
80
|
+
node [shape=box];
|
|
81
|
+
|
|
82
|
+
"1. Detect" -> "2. Triage";
|
|
83
|
+
"2. Triage" -> "3. Communicate";
|
|
84
|
+
"3. Communicate" -> "4. Mitigate";
|
|
85
|
+
"4. Mitigate" -> "5. Diagnose";
|
|
86
|
+
"5. Diagnose" -> "6. Fix/Rollback";
|
|
87
|
+
"6. Fix/Rollback" -> "7. Verify";
|
|
88
|
+
"7. Verify" -> "8. Document";
|
|
89
|
+
|
|
90
|
+
"Skip Phase?" [shape=diamond];
|
|
91
|
+
"2. Triage" -> "Skip Phase?" [style=dashed, label="tempted?"];
|
|
92
|
+
"Skip Phase?" -> "STOP" [label="NO"];
|
|
93
|
+
"STOP" [shape=doublecircle, color=red];
|
|
94
|
+
}
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### Phase 1: Detect
|
|
98
|
+
|
|
99
|
+
**Acknowledge the incident:**
|
|
100
|
+
|
|
101
|
+
1. **Confirm the alert is valid**
|
|
102
|
+
- Not a false positive?
|
|
103
|
+
- Reproducible?
|
|
104
|
+
- Multiple signals confirming?
|
|
105
|
+
|
|
106
|
+
2. **Capture initial evidence**
|
|
107
|
+
- Screenshot dashboards NOW (before they scroll)
|
|
108
|
+
- Note the exact time
|
|
109
|
+
- Record the alert message verbatim
|
|
110
|
+
- Preserve trace_id if available
|
|
111
|
+
|
|
112
|
+
3. **Declare incident start**
|
|
113
|
+
- "Incident acknowledged at [TIME]"
|
|
114
|
+
- Assign incident_id if your system has one
|
|
115
|
+
|
|
116
|
+
**Completion criteria:**
|
|
117
|
+
- [ ] `incident_acknowledged` - Alert confirmed and logged
|
|
118
|
+
|
|
119
|
+
### Phase 2: Triage
|
|
120
|
+
|
|
121
|
+
**Classify severity BEFORE taking action:**
|
|
122
|
+
|
|
123
|
+
See `severity-classification.md` for detailed criteria.
|
|
124
|
+
|
|
125
|
+
1. **Assess impact scope**
|
|
126
|
+
- How many users affected?
|
|
127
|
+
- Which features/services impacted?
|
|
128
|
+
- What's the blast radius?
|
|
129
|
+
|
|
130
|
+
2. **Classify severity (P1-P4)**
|
|
131
|
+
- P1: >50% users, critical function down
|
|
132
|
+
- P2: 10-50% users, major feature unavailable
|
|
133
|
+
- P3: <10% users, workaround exists
|
|
134
|
+
- P4: Minimal impact, no urgency
|
|
135
|
+
|
|
136
|
+
3. **Determine business impact**
|
|
137
|
+
- Revenue at risk?
|
|
138
|
+
- Compliance implications?
|
|
139
|
+
- Reputation damage?
|
|
140
|
+
|
|
141
|
+
4. **Check SLO status**
|
|
142
|
+
- Which SLO is breaching?
|
|
143
|
+
- What's the error budget burn rate?
|
|
144
|
+
- How fast are we consuming budget?
|
|
145
|
+
|
|
146
|
+
**Completion criteria:**
|
|
147
|
+
- [ ] `severity_classified` - P1/P2/P3/P4 assigned
|
|
148
|
+
- [ ] `impact_assessed` - Users affected and scope documented
|
|
149
|
+
|
|
150
|
+
### Phase 3: Communicate
|
|
151
|
+
|
|
152
|
+
**Notify stakeholders BEFORE attempting fixes:**
|
|
153
|
+
|
|
154
|
+
See `communication-templates.md` for notification formats.
|
|
155
|
+
|
|
156
|
+
1. **Internal notification**
|
|
157
|
+
- Alert the on-call team
|
|
158
|
+
- Notify relevant stakeholders
|
|
159
|
+
- Open incident channel/war room (P1/P2)
|
|
160
|
+
|
|
161
|
+
2. **External communication (if applicable)**
|
|
162
|
+
- Update status page
|
|
163
|
+
- Prepare customer communication
|
|
164
|
+
- Follow regulatory requirements
|
|
165
|
+
|
|
166
|
+
3. **Set communication cadence**
|
|
167
|
+
- P1: Updates every 15 minutes
|
|
168
|
+
- P2: Updates every 30 minutes
|
|
169
|
+
- P3/P4: Updates at milestones
|
|
170
|
+
|
|
171
|
+
4. **Include trace_id in all communications**
|
|
172
|
+
- Enables instant context jumping
|
|
173
|
+
- Links logs, traces, and metrics
|
|
174
|
+
|
|
175
|
+
**Completion criteria:**
|
|
176
|
+
- [ ] `stakeholders_notified` - Team/leadership informed
|
|
177
|
+
- [ ] `communication_cadence_set` - Update schedule established
|
|
178
|
+
|
|
179
|
+
### Phase 4: Mitigate
|
|
180
|
+
|
|
181
|
+
**Contain the blast radius:**
|
|
182
|
+
|
|
183
|
+
1. **Evaluate rollback FIRST**
|
|
184
|
+
- Was there a recent deployment?
|
|
185
|
+
- Can we rollback safely?
|
|
186
|
+
- Rollback is often faster than forward-fix
|
|
187
|
+
|
|
188
|
+
2. **If rollback not viable:**
|
|
189
|
+
- Can we disable the failing feature?
|
|
190
|
+
- Can we route traffic away?
|
|
191
|
+
- Can we scale resources?
|
|
192
|
+
|
|
193
|
+
3. **Preserve evidence before action**
|
|
194
|
+
- Export logs before rotation
|
|
195
|
+
- Capture heap/thread dumps
|
|
196
|
+
- Screenshot metrics dashboards
|
|
197
|
+
|
|
198
|
+
4. **Implement containment**
|
|
199
|
+
- Apply the chosen mitigation
|
|
200
|
+
- Verify it's working
|
|
201
|
+
- Communicate the mitigation status
|
|
202
|
+
|
|
203
|
+
**Completion criteria:**
|
|
204
|
+
- [ ] `rollback_evaluated` - Rollback option assessed
|
|
205
|
+
- [ ] `containment_verified` - Blast radius limited
|
|
206
|
+
- [ ] `evidence_preserved` - Logs/traces captured before action
|
|
207
|
+
|
|
208
|
+
### Phase 5: Diagnose
|
|
209
|
+
|
|
210
|
+
**Now investigate the root cause:**
|
|
211
|
+
|
|
212
|
+
**HANDOFF:** Use `superpowers:systematic-debugging` for technical investigation.
|
|
213
|
+
|
|
214
|
+
1. **Form hypothesis**
|
|
215
|
+
- What changed recently?
|
|
216
|
+
- What do the logs/traces show?
|
|
217
|
+
- Which component is failing?
|
|
218
|
+
|
|
219
|
+
2. **Test hypothesis**
|
|
220
|
+
- One variable at a time
|
|
221
|
+
- Gather evidence, don't guess
|
|
222
|
+
- Follow the systematic debugging phases
|
|
223
|
+
|
|
224
|
+
3. **If 3+ hypotheses failed:**
|
|
225
|
+
- STOP and question architecture
|
|
226
|
+
- Escalate for fresh perspective
|
|
227
|
+
- Don't keep guessing
|
|
228
|
+
|
|
229
|
+
**Completion criteria:**
|
|
230
|
+
- [ ] `root_cause_identified` - Cause confirmed with evidence
|
|
231
|
+
- [ ] OR `escalated_for_help` - Fresh eyes requested after 3+ failed hypotheses
|
|
232
|
+
|
|
233
|
+
### Phase 6: Fix or Rollback
|
|
234
|
+
|
|
235
|
+
**Apply the resolution:**
|
|
236
|
+
|
|
237
|
+
1. **If rollback is the answer:**
|
|
238
|
+
- Execute rollback procedure
|
|
239
|
+
- Verify rollback completed
|
|
240
|
+
- Confirm service restored
|
|
241
|
+
|
|
242
|
+
2. **If forward-fix required:**
|
|
243
|
+
- Implement minimal fix
|
|
244
|
+
- Get peer review (even quick)
|
|
245
|
+
- Stage the fix properly
|
|
246
|
+
|
|
247
|
+
3. **Never skip verification**
|
|
248
|
+
- Test the fix works
|
|
249
|
+
- Confirm no side effects
|
|
250
|
+
- Don't just "hope it worked"
|
|
251
|
+
|
|
252
|
+
**Completion criteria:**
|
|
253
|
+
- [ ] `fix_deployed` - Forward fix applied and reviewed
|
|
254
|
+
- [ ] OR `rollback_completed` - Previous state restored
|
|
255
|
+
|
|
256
|
+
### Phase 7: Verify
|
|
257
|
+
|
|
258
|
+
**Confirm the incident is resolved:**
|
|
259
|
+
|
|
260
|
+
**HANDOFF:** Use `superpowers:verification-before-completion` for thorough verification.
|
|
261
|
+
|
|
262
|
+
1. **Service health check**
|
|
263
|
+
- Error rates back to normal?
|
|
264
|
+
- Latency within SLO?
|
|
265
|
+
- No new alerts?
|
|
266
|
+
|
|
267
|
+
2. **User impact verification**
|
|
268
|
+
- Can users complete flows?
|
|
269
|
+
- Any lingering issues?
|
|
270
|
+
- Customer reports resolved?
|
|
271
|
+
|
|
272
|
+
3. **Monitoring confirmation**
|
|
273
|
+
- Dashboards showing green?
|
|
274
|
+
- SLO back in budget?
|
|
275
|
+
- No anomalies in metrics?
|
|
276
|
+
|
|
277
|
+
**Completion criteria:**
|
|
278
|
+
- [ ] `service_restored` - Users can complete flows normally
|
|
279
|
+
- [ ] `slo_compliant` - Error rates and latency within SLO
|
|
280
|
+
|
|
281
|
+
### Phase 8: Document
|
|
282
|
+
|
|
283
|
+
**The incident is NOT resolved until documentation exists:**
|
|
284
|
+
|
|
285
|
+
See `postmortem-template.md` for the blameless RCA framework.
|
|
286
|
+
|
|
287
|
+
1. **Schedule postmortem**
|
|
288
|
+
- Within 48 hours for P1/P2
|
|
289
|
+
- Within 1 week for P3/P4
|
|
290
|
+
- Include all responders
|
|
291
|
+
|
|
292
|
+
2. **Capture timeline**
|
|
293
|
+
- What happened, when, by whom
|
|
294
|
+
- Use trace_id to reconstruct
|
|
295
|
+
- Include communication history
|
|
296
|
+
|
|
297
|
+
3. **Identify action items**
|
|
298
|
+
- Preventive measures
|
|
299
|
+
- Detection improvements
|
|
300
|
+
- Process enhancements
|
|
301
|
+
|
|
302
|
+
4. **Close the incident**
|
|
303
|
+
- All stakeholders notified
|
|
304
|
+
- Status page updated
|
|
305
|
+
- Incident marked resolved
|
|
306
|
+
|
|
307
|
+
**Completion criteria:**
|
|
308
|
+
- [ ] `postmortem_scheduled` - Meeting booked within required window
|
|
309
|
+
- [ ] `incident_closed` - Stakeholders notified, status page updated
|
|
310
|
+
|
|
311
|
+
## Red Flags - STOP and Follow Process
|
|
312
|
+
|
|
313
|
+
If you catch yourself thinking:
|
|
314
|
+
- "Quick fix for now, investigate later"
|
|
315
|
+
- "Just try this and see if it works"
|
|
316
|
+
- "No time for communication, focus on fixing"
|
|
317
|
+
- "Skip the triage, obviously it's P1"
|
|
318
|
+
- "Rollback is too risky, let's push forward"
|
|
319
|
+
- "We can document later, let's close this"
|
|
320
|
+
- "I know what the problem is, trust me"
|
|
321
|
+
- "The user wants speed, not process"
|
|
322
|
+
|
|
323
|
+
**ALL of these mean: STOP. Return to the current phase.**
|
|
324
|
+
|
|
325
|
+
## Common Rationalizations
|
|
326
|
+
|
|
327
|
+
| Excuse | Reality |
|
|
328
|
+
|--------|---------|
|
|
329
|
+
| "This is obviously a P1" | Every incident feels urgent. Triage takes 2 minutes, wrong classification causes chaos. |
|
|
330
|
+
| "Fix first, communicate later" | Silent outages create panic. 30-second update prevents 30-minute escalation. |
|
|
331
|
+
| "Rollback is too risky" | Known state > uncertain forward fix. Rollback buys time to fix properly. |
|
|
332
|
+
| "We can investigate after fixing" | Evidence disappears after restart. Capture now, analyze later. |
|
|
333
|
+
| "It's a one-off, skip the postmortem" | All incidents feel unique. Patterns emerge only from documentation. |
|
|
334
|
+
| "I know this system, trust me" | Expert intuition fails under pressure. Protocol beats intuition. |
|
|
335
|
+
| "Communication slows us down" | Coordinated response > heroic solo effort. Communication accelerates resolution. |
|
|
336
|
+
| "Previous fix worked, try another" | 3+ fixes = wrong diagnosis. Stop guessing, investigate properly. |
|
|
337
|
+
|
|
338
|
+
## Quick Reference
|
|
339
|
+
|
|
340
|
+
| Phase | Key Activities | Verification Key |
|
|
341
|
+
|-------|---------------|------------------|
|
|
342
|
+
| **1. Detect** | Acknowledge, capture evidence, record time, **capture trace_id** | `incident_acknowledged` |
|
|
343
|
+
| **2. Triage** | Assess impact, classify P1-P4, check SLO | `severity_classified` |
|
|
344
|
+
| **3. Communicate** | Notify team, update status, set cadence, **include trace_id** | `stakeholders_notified` |
|
|
345
|
+
| **4. Mitigate** | Evaluate rollback, contain blast radius | `containment_verified` |
|
|
346
|
+
| **5. Diagnose** | Form hypothesis, test systematically | `root_cause_identified` |
|
|
347
|
+
| **6. Fix/Rollback** | Apply resolution, get review | `fix_deployed` |
|
|
348
|
+
| **7. Verify** | Confirm resolution, check SLO | `service_restored` |
|
|
349
|
+
| **8. Document** | Schedule postmortem, capture timeline, **correlate via trace_id** | `postmortem_scheduled` |
|
|
350
|
+
|
|
351
|
+
**🔑 trace_id:** Include in ALL internal communications - enables instant context jumping across logs, traces, and metrics.
|
|
352
|
+
|
|
353
|
+
## Supporting Files
|
|
354
|
+
|
|
355
|
+
These files provide detailed guidance for specific phases:
|
|
356
|
+
|
|
357
|
+
- **`severity-classification.md`** - P1-P4 criteria with SLO burn rate mapping
|
|
358
|
+
- **`communication-templates.md`** - Notification templates for stakeholders
|
|
359
|
+
- **`postmortem-template.md`** - Blameless RCA framework and action items
|
|
360
|
+
- **`escalation-matrix.md`** - Role-based escalation paths per severity
|
|
361
|
+
|
|
362
|
+
## Related Skills
|
|
363
|
+
|
|
364
|
+
- **superpowers:systematic-debugging** - For Phase 5 (Diagnose) root cause analysis
|
|
365
|
+
- **superpowers:verification-before-completion** - For Phase 7 (Verify) confirmation
|
|
366
|
+
|
|
367
|
+
## Real-World Impact
|
|
368
|
+
|
|
369
|
+
From incident response data:
|
|
370
|
+
- Systematic response: MTTR 30-60 minutes
|
|
371
|
+
- Ad-hoc heroics: MTTR 2-4 hours
|
|
372
|
+
- Communication during incident: 70% fewer escalations
|
|
373
|
+
- Postmortem completion: 50% fewer repeat incidents
|