omgkit 2.1.1 → 2.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/plugin/skills/SKILL_STANDARDS.md +743 -0
- package/plugin/skills/databases/mongodb/SKILL.md +797 -28
- package/plugin/skills/databases/prisma/SKILL.md +776 -30
- package/plugin/skills/databases/redis/SKILL.md +885 -25
- package/plugin/skills/devops/aws/SKILL.md +686 -28
- package/plugin/skills/devops/github-actions/SKILL.md +684 -29
- package/plugin/skills/devops/kubernetes/SKILL.md +621 -24
- package/plugin/skills/frameworks/django/SKILL.md +920 -20
- package/plugin/skills/frameworks/express/SKILL.md +1361 -35
- package/plugin/skills/frameworks/fastapi/SKILL.md +1260 -33
- package/plugin/skills/frameworks/laravel/SKILL.md +1244 -31
- package/plugin/skills/frameworks/nestjs/SKILL.md +1005 -26
- package/plugin/skills/frameworks/rails/SKILL.md +594 -28
- package/plugin/skills/frameworks/spring/SKILL.md +528 -35
- package/plugin/skills/frameworks/vue/SKILL.md +1296 -27
- package/plugin/skills/frontend/accessibility/SKILL.md +1108 -34
- package/plugin/skills/frontend/frontend-design/SKILL.md +1304 -26
- package/plugin/skills/frontend/responsive/SKILL.md +847 -21
- package/plugin/skills/frontend/shadcn-ui/SKILL.md +976 -38
- package/plugin/skills/frontend/tailwindcss/SKILL.md +831 -35
- package/plugin/skills/frontend/threejs/SKILL.md +1298 -29
- package/plugin/skills/languages/javascript/SKILL.md +935 -31
- package/plugin/skills/methodology/brainstorming/SKILL.md +597 -23
- package/plugin/skills/methodology/defense-in-depth/SKILL.md +832 -34
- package/plugin/skills/methodology/dispatching-parallel-agents/SKILL.md +665 -31
- package/plugin/skills/methodology/executing-plans/SKILL.md +556 -24
- package/plugin/skills/methodology/finishing-development-branch/SKILL.md +595 -25
- package/plugin/skills/methodology/problem-solving/SKILL.md +429 -61
- package/plugin/skills/methodology/receiving-code-review/SKILL.md +536 -24
- package/plugin/skills/methodology/requesting-code-review/SKILL.md +632 -21
- package/plugin/skills/methodology/root-cause-tracing/SKILL.md +641 -30
- package/plugin/skills/methodology/sequential-thinking/SKILL.md +262 -3
- package/plugin/skills/methodology/systematic-debugging/SKILL.md +571 -32
- package/plugin/skills/methodology/test-driven-development/SKILL.md +779 -24
- package/plugin/skills/methodology/testing-anti-patterns/SKILL.md +691 -29
- package/plugin/skills/methodology/token-optimization/SKILL.md +598 -29
- package/plugin/skills/methodology/verification-before-completion/SKILL.md +543 -22
- package/plugin/skills/methodology/writing-plans/SKILL.md +590 -18
- package/plugin/skills/omega/omega-architecture/SKILL.md +838 -39
- package/plugin/skills/omega/omega-coding/SKILL.md +636 -39
- package/plugin/skills/omega/omega-sprint/SKILL.md +855 -48
- package/plugin/skills/omega/omega-testing/SKILL.md +940 -41
- package/plugin/skills/omega/omega-thinking/SKILL.md +703 -50
- package/plugin/skills/security/better-auth/SKILL.md +1065 -28
- package/plugin/skills/security/oauth/SKILL.md +968 -31
- package/plugin/skills/security/owasp/SKILL.md +894 -33
- package/plugin/skills/testing/playwright/SKILL.md +764 -38
- package/plugin/skills/testing/pytest/SKILL.md +873 -36
- package/plugin/skills/testing/vitest/SKILL.md +980 -35
|
@@ -1,53 +1,664 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: root-cause-tracing
|
|
3
|
-
description:
|
|
3
|
+
description: Systematic root cause analysis using 5 Whys, Fishbone diagrams, and evidence-based investigation
|
|
4
|
+
category: methodology
|
|
5
|
+
triggers:
|
|
6
|
+
- root cause
|
|
7
|
+
- 5 whys
|
|
8
|
+
- fishbone diagram
|
|
9
|
+
- debugging
|
|
10
|
+
- incident analysis
|
|
11
|
+
- post mortem
|
|
12
|
+
- problem investigation
|
|
4
13
|
---
|
|
5
14
|
|
|
6
|
-
# Root Cause Tracing
|
|
15
|
+
# Root Cause Tracing
|
|
7
16
|
|
|
8
|
-
|
|
17
|
+
Master **systematic root cause analysis** to find the true underlying causes of problems, not just symptoms. This skill provides frameworks for investigating issues deeply and preventing recurrence.
|
|
18
|
+
|
|
19
|
+
## Purpose
|
|
20
|
+
|
|
21
|
+
Find and eliminate true root causes:
|
|
22
|
+
|
|
23
|
+
- Distinguish symptoms from underlying causes
|
|
24
|
+
- Use structured investigation methodologies
|
|
25
|
+
- Trace causality chains to their origins
|
|
26
|
+
- Identify systemic factors that allow problems
|
|
27
|
+
- Prevent recurrence through proper fixes
|
|
28
|
+
- Document findings for organizational learning
|
|
29
|
+
- Build more resilient systems over time
|
|
30
|
+
|
|
31
|
+
## Features
|
|
32
|
+
|
|
33
|
+
### 1. The Root Cause Hierarchy
|
|
34
|
+
|
|
35
|
+
```markdown
|
|
36
|
+
## Understanding Cause Layers
|
|
37
|
+
|
|
38
|
+
┌─────────────────────────────────────────────────────────────────────────┐
|
|
39
|
+
│ ROOT CAUSE HIERARCHY │
|
|
40
|
+
├─────────────────────────────────────────────────────────────────────────┤
|
|
41
|
+
│ │
|
|
42
|
+
│ SYMPTOM │
|
|
43
|
+
│ └── What you observe: "App crashed" "Users can't login" │
|
|
44
|
+
│ │
|
|
45
|
+
│ PROXIMATE CAUSE │
|
|
46
|
+
│ └── Direct trigger: "Out of memory" "Database timeout" │
|
|
47
|
+
│ │
|
|
48
|
+
│ CONTRIBUTING FACTORS │
|
|
49
|
+
│ └── Conditions that enabled: "No memory limits" "Slow query" │
|
|
50
|
+
│ │
|
|
51
|
+
│ ROOT CAUSE │
|
|
52
|
+
│ └── Fundamental reason: "Memory leak in event handlers" │
|
|
53
|
+
│ │
|
|
54
|
+
│ SYSTEMIC FACTORS │
|
|
55
|
+
│ └── Why it wasn't caught: "No memory monitoring" "Missing tests" │
|
|
56
|
+
│ │
|
|
57
|
+
│ │
|
|
58
|
+
│ PRINCIPLE: Fix at the deepest level possible │
|
|
59
|
+
│ - Fixing symptoms: Problem returns │
|
|
60
|
+
│ - Fixing proximate cause: Similar problems emerge │
|
|
61
|
+
│ - Fixing root cause: This specific problem prevented │
|
|
62
|
+
│ - Fixing systemic factors: Entire class of problems prevented │
|
|
63
|
+
│ │
|
|
64
|
+
└─────────────────────────────────────────────────────────────────────────┘
|
|
9
65
|
```
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
66
|
+
|
|
67
|
+
### 2. The 5 Whys Technique
|
|
68
|
+
|
|
69
|
+
```markdown
|
|
70
|
+
## 5 Whys: Iterative Causality Tracing
|
|
71
|
+
|
|
72
|
+
### The Method
|
|
73
|
+
Ask "Why?" repeatedly (typically 5 times) until you reach a fundamental cause.
|
|
74
|
+
|
|
75
|
+
### Example: Production Outage
|
|
76
|
+
|
|
77
|
+
Problem: Website went down for 2 hours
|
|
78
|
+
|
|
79
|
+
Why #1: Why did the website go down?
|
|
80
|
+
→ The application server ran out of memory and crashed.
|
|
81
|
+
|
|
82
|
+
Why #2: Why did it run out of memory?
|
|
83
|
+
→ The number of active connections grew unbounded.
|
|
84
|
+
|
|
85
|
+
Why #3: Why did connections grow unbounded?
|
|
86
|
+
→ Connection objects weren't being released after use.
|
|
87
|
+
|
|
88
|
+
Why #4: Why weren't connections being released?
|
|
89
|
+
→ The cleanup code in the finally block wasn't being executed
|
|
90
|
+
due to an early return statement.
|
|
91
|
+
|
|
92
|
+
Why #5: Why wasn't this caught before production?
|
|
93
|
+
→ No test existed for the connection cleanup path, and code review
|
|
94
|
+
missed the early return bypassing the finally block.
|
|
95
|
+
|
|
96
|
+
### Root Causes Identified:
|
|
97
|
+
1. Technical: Missing cleanup code execution (fix the bug)
|
|
98
|
+
2. Systemic: Missing test coverage for cleanup paths (add tests)
|
|
99
|
+
3. Process: Code review didn't catch early-return anti-pattern (add checklist)
|
|
17
100
|
```
|
|
18
101
|
|
|
19
|
-
|
|
102
|
+
```typescript
|
|
103
|
+
/**
|
|
104
|
+
* 5 Whys Investigation Framework
|
|
105
|
+
*/
|
|
106
|
+
|
|
107
|
+
interface WhyStep {
|
|
108
|
+
question: string;
|
|
109
|
+
answer: string;
|
|
110
|
+
evidence: string[];
|
|
111
|
+
confidence: 'confirmed' | 'likely' | 'hypothesis';
|
|
112
|
+
}
|
|
113
|
+
|
|
114
|
+
interface FiveWhysAnalysis {
|
|
115
|
+
problem: string;
|
|
116
|
+
impactAssessment: ImpactAssessment;
|
|
117
|
+
whys: WhyStep[];
|
|
118
|
+
rootCauses: RootCause[];
|
|
119
|
+
recommendations: Recommendation[];
|
|
120
|
+
}
|
|
121
|
+
|
|
122
|
+
function conductFiveWhys(problem: string): FiveWhysAnalysis {
|
|
123
|
+
const analysis: FiveWhysAnalysis = {
|
|
124
|
+
problem,
|
|
125
|
+
impactAssessment: assessImpact(problem),
|
|
126
|
+
whys: [],
|
|
127
|
+
rootCauses: [],
|
|
128
|
+
recommendations: []
|
|
129
|
+
};
|
|
130
|
+
|
|
131
|
+
let currentQuestion = `Why did ${problem}?`;
|
|
132
|
+
let depth = 0;
|
|
133
|
+
|
|
134
|
+
while (depth < 5) {
|
|
135
|
+
const answer = investigateQuestion(currentQuestion);
|
|
136
|
+
const evidence = gatherEvidence(answer);
|
|
137
|
+
|
|
138
|
+
analysis.whys.push({
|
|
139
|
+
question: currentQuestion,
|
|
140
|
+
answer: answer.text,
|
|
141
|
+
evidence: evidence.sources,
|
|
142
|
+
confidence: evidence.confidence
|
|
143
|
+
});
|
|
144
|
+
|
|
145
|
+
// Check if we've reached a root cause
|
|
146
|
+
if (isRootCause(answer)) {
|
|
147
|
+
analysis.rootCauses.push({
|
|
148
|
+
description: answer.text,
|
|
149
|
+
type: classifyRootCause(answer),
|
|
150
|
+
evidence: evidence
|
|
151
|
+
});
|
|
152
|
+
}
|
|
153
|
+
|
|
154
|
+
currentQuestion = `Why ${answer.text}?`;
|
|
155
|
+
depth++;
|
|
156
|
+
}
|
|
157
|
+
|
|
158
|
+
// Generate recommendations for each root cause
|
|
159
|
+
analysis.recommendations = analysis.rootCauses.map(rc =>
|
|
160
|
+
generateRecommendation(rc)
|
|
161
|
+
);
|
|
162
|
+
|
|
163
|
+
return analysis;
|
|
164
|
+
}
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
### 3. Fishbone (Ishikawa) Diagram
|
|
168
|
+
|
|
169
|
+
```markdown
|
|
170
|
+
## Fishbone Diagram: Category-Based Analysis
|
|
171
|
+
|
|
172
|
+
┌──────────────────┐
|
|
173
|
+
│ PROBLEM │
|
|
174
|
+
│ [Symptom Here] │
|
|
175
|
+
└────────┬─────────┘
|
|
176
|
+
│
|
|
177
|
+
┌────────────────────────────┼────────────────────────────┐
|
|
178
|
+
│ │ │
|
|
179
|
+
│ PEOPLE │ PROCESS │
|
|
180
|
+
│ ────── │ ─────── │
|
|
181
|
+
│ • Skill gaps │ • Missing │
|
|
182
|
+
│ • Communication │ steps │
|
|
183
|
+
│ • Training │ • Unclear │
|
|
184
|
+
│ • Handoffs │ ownership │
|
|
185
|
+
│ ╲ │ ╱ │
|
|
186
|
+
│ ╲ │ ╱ │
|
|
187
|
+
│ ╲ │ ╱ │
|
|
188
|
+
│ ╲──────────────┼──────────────╱ │
|
|
189
|
+
│ ╲ │ ╱ │
|
|
190
|
+
│ ╲ │ ╱ │
|
|
191
|
+
│ ╲ │ ╱ │
|
|
192
|
+
│ ╲──────────┼──────────╱ │
|
|
193
|
+
│ ╱ │ ╲ │
|
|
194
|
+
│ ╱ │ ╲ │
|
|
195
|
+
│ ╱ │ ╲ │
|
|
196
|
+
│ ╱──────────────┼──────────────╲ │
|
|
197
|
+
│ ╱ │ ╲ │
|
|
198
|
+
│ ╱ │ ╲ │
|
|
199
|
+
│ ╱ │ ╲ │
|
|
200
|
+
│ TECHNOLOGY │ ENVIRONMENT │
|
|
201
|
+
│ ────────── │ ─────────── │
|
|
202
|
+
│ • Code bugs │ • Load │
|
|
203
|
+
│ • Dependencies │ • Network │
|
|
204
|
+
│ • Infrastructure │ • Third-party │
|
|
205
|
+
│ • Configuration │ • Timing │
|
|
206
|
+
│ │ │
|
|
207
|
+
└────────────────────────────┴────────────────────────────┘
|
|
208
|
+
|
|
209
|
+
## Software-Specific Categories
|
|
210
|
+
|
|
211
|
+
1. CODE
|
|
212
|
+
- Logic errors
|
|
213
|
+
- Race conditions
|
|
214
|
+
- Memory leaks
|
|
215
|
+
- Error handling
|
|
216
|
+
|
|
217
|
+
2. DATA
|
|
218
|
+
- Invalid input
|
|
219
|
+
- Corrupt data
|
|
220
|
+
- Missing data
|
|
221
|
+
- Schema mismatches
|
|
222
|
+
|
|
223
|
+
3. CONFIGURATION
|
|
224
|
+
- Wrong settings
|
|
225
|
+
- Environment mismatch
|
|
226
|
+
- Secrets/credentials
|
|
227
|
+
- Feature flags
|
|
228
|
+
|
|
229
|
+
4. INFRASTRUCTURE
|
|
230
|
+
- Resource exhaustion
|
|
231
|
+
- Network issues
|
|
232
|
+
- Service failures
|
|
233
|
+
- Scaling problems
|
|
234
|
+
|
|
235
|
+
5. EXTERNAL
|
|
236
|
+
- Third-party APIs
|
|
237
|
+
- Dependencies
|
|
238
|
+
- User behavior
|
|
239
|
+
- Attack/abuse
|
|
240
|
+
|
|
241
|
+
6. PROCESS
|
|
242
|
+
- Missing tests
|
|
243
|
+
- Review gaps
|
|
244
|
+
- Deployment issues
|
|
245
|
+
- Monitoring blind spots
|
|
20
246
|
```
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
247
|
+
|
|
248
|
+
### 4. Evidence-Based Investigation
|
|
249
|
+
|
|
250
|
+
```typescript
|
|
251
|
+
/**
|
|
252
|
+
* Evidence-Based Root Cause Investigation
|
|
253
|
+
* Every hypothesis must be backed by data
|
|
254
|
+
*/
|
|
255
|
+
|
|
256
|
+
interface Evidence {
|
|
257
|
+
type: 'log' | 'metric' | 'trace' | 'reproduction' | 'testimony';
|
|
258
|
+
source: string;
|
|
259
|
+
timestamp?: Date;
|
|
260
|
+
data: unknown;
|
|
261
|
+
reliability: 'high' | 'medium' | 'low';
|
|
262
|
+
}
|
|
263
|
+
|
|
264
|
+
interface Hypothesis {
|
|
265
|
+
description: string;
|
|
266
|
+
category: RootCauseCategory;
|
|
267
|
+
evidence: {
|
|
268
|
+
supporting: Evidence[];
|
|
269
|
+
contradicting: Evidence[];
|
|
270
|
+
};
|
|
271
|
+
confidence: number; // 0-1
|
|
272
|
+
testPlan?: string;
|
|
273
|
+
}
|
|
274
|
+
|
|
275
|
+
class RootCauseInvestigator {
|
|
276
|
+
private hypotheses: Hypothesis[] = [];
|
|
277
|
+
private evidence: Evidence[] = [];
|
|
278
|
+
|
|
279
|
+
// Step 1: Gather all available evidence
|
|
280
|
+
async gatherEvidence(incident: Incident): Promise<Evidence[]> {
|
|
281
|
+
const evidence: Evidence[] = [];
|
|
282
|
+
|
|
283
|
+
// Collect logs around incident time
|
|
284
|
+
const logs = await this.queryLogs({
|
|
285
|
+
startTime: incident.startTime.minus({ minutes: 30 }),
|
|
286
|
+
endTime: incident.endTime.plus({ minutes: 30 }),
|
|
287
|
+
services: incident.affectedServices
|
|
288
|
+
});
|
|
289
|
+
|
|
290
|
+
evidence.push(...logs.map(log => ({
|
|
291
|
+
type: 'log' as const,
|
|
292
|
+
source: log.service,
|
|
293
|
+
timestamp: log.timestamp,
|
|
294
|
+
data: log.message,
|
|
295
|
+
reliability: 'high' as const
|
|
296
|
+
})));
|
|
297
|
+
|
|
298
|
+
// Collect metrics
|
|
299
|
+
const metrics = await this.queryMetrics({
|
|
300
|
+
metrics: ['error_rate', 'latency_p99', 'memory_usage', 'cpu_usage'],
|
|
301
|
+
startTime: incident.startTime.minus({ hours: 1 }),
|
|
302
|
+
endTime: incident.endTime.plus({ hours: 1 })
|
|
303
|
+
});
|
|
304
|
+
|
|
305
|
+
evidence.push(...metrics.map(m => ({
|
|
306
|
+
type: 'metric' as const,
|
|
307
|
+
source: m.name,
|
|
308
|
+
timestamp: m.timestamp,
|
|
309
|
+
data: m.value,
|
|
310
|
+
reliability: 'high' as const
|
|
311
|
+
})));
|
|
312
|
+
|
|
313
|
+
// Collect traces for affected requests
|
|
314
|
+
const traces = await this.queryTraces({
|
|
315
|
+
traceIds: incident.affectedTraceIds.slice(0, 100)
|
|
316
|
+
});
|
|
317
|
+
|
|
318
|
+
evidence.push(...traces.map(t => ({
|
|
319
|
+
type: 'trace' as const,
|
|
320
|
+
source: t.serviceName,
|
|
321
|
+
data: t.spans,
|
|
322
|
+
reliability: 'high' as const
|
|
323
|
+
})));
|
|
324
|
+
|
|
325
|
+
this.evidence = evidence;
|
|
326
|
+
return evidence;
|
|
327
|
+
}
|
|
328
|
+
|
|
329
|
+
// Step 2: Generate hypotheses based on evidence
|
|
330
|
+
generateHypotheses(): Hypothesis[] {
|
|
331
|
+
const hypotheses: Hypothesis[] = [];
|
|
332
|
+
|
|
333
|
+
// Analyze log patterns
|
|
334
|
+
const errorPatterns = this.findErrorPatterns(this.evidence);
|
|
335
|
+
for (const pattern of errorPatterns) {
|
|
336
|
+
hypotheses.push({
|
|
337
|
+
description: `Error caused by: ${pattern.summary}`,
|
|
338
|
+
category: this.categorizePattern(pattern),
|
|
339
|
+
evidence: {
|
|
340
|
+
supporting: pattern.matchingLogs,
|
|
341
|
+
contradicting: []
|
|
342
|
+
},
|
|
343
|
+
confidence: pattern.frequency / this.evidence.length,
|
|
344
|
+
testPlan: `Reproduce with: ${pattern.reproductionHint}`
|
|
345
|
+
});
|
|
346
|
+
}
|
|
347
|
+
|
|
348
|
+
// Analyze metric anomalies
|
|
349
|
+
const anomalies = this.findMetricAnomalies(this.evidence);
|
|
350
|
+
for (const anomaly of anomalies) {
|
|
351
|
+
hypotheses.push({
|
|
352
|
+
description: `Resource issue: ${anomaly.metric} ${anomaly.direction}`,
|
|
353
|
+
category: 'infrastructure',
|
|
354
|
+
evidence: {
|
|
355
|
+
supporting: [anomaly.evidence],
|
|
356
|
+
contradicting: []
|
|
357
|
+
},
|
|
358
|
+
confidence: anomaly.deviation > 3 ? 0.8 : 0.5
|
|
359
|
+
});
|
|
360
|
+
}
|
|
361
|
+
|
|
362
|
+
// Sort by confidence
|
|
363
|
+
this.hypotheses = hypotheses.sort((a, b) => b.confidence - a.confidence);
|
|
364
|
+
return this.hypotheses;
|
|
365
|
+
}
|
|
366
|
+
|
|
367
|
+
// Step 3: Test hypotheses
|
|
368
|
+
async testHypothesis(hypothesis: Hypothesis): Promise<boolean> {
|
|
369
|
+
if (!hypothesis.testPlan) {
|
|
370
|
+
throw new Error('Hypothesis has no test plan');
|
|
371
|
+
}
|
|
372
|
+
|
|
373
|
+
// Attempt to reproduce in safe environment
|
|
374
|
+
const result = await this.runReproduction(hypothesis.testPlan);
|
|
375
|
+
|
|
376
|
+
if (result.reproduced) {
|
|
377
|
+
hypothesis.confidence = Math.min(hypothesis.confidence + 0.3, 1);
|
|
378
|
+
hypothesis.evidence.supporting.push({
|
|
379
|
+
type: 'reproduction',
|
|
380
|
+
source: 'test-environment',
|
|
381
|
+
data: result,
|
|
382
|
+
reliability: 'high'
|
|
383
|
+
});
|
|
384
|
+
return true;
|
|
385
|
+
} else {
|
|
386
|
+
hypothesis.confidence *= 0.5;
|
|
387
|
+
return false;
|
|
388
|
+
}
|
|
389
|
+
}
|
|
390
|
+
}
|
|
26
391
|
```
|
|
27
392
|
|
|
28
|
-
|
|
29
|
-
1. **Code** - Bug in logic
|
|
30
|
-
2. **Data** - Invalid input
|
|
31
|
-
3. **Config** - Wrong settings
|
|
32
|
-
4. **Environment** - System issues
|
|
33
|
-
5. **External** - Third-party failure
|
|
393
|
+
### 5. Root Cause Analysis Template
|
|
34
394
|
|
|
35
|
-
## Output
|
|
36
395
|
```markdown
|
|
37
|
-
## Root Cause Analysis
|
|
396
|
+
## Root Cause Analysis Report
|
|
397
|
+
|
|
398
|
+
### Incident Summary
|
|
399
|
+
- **Incident ID:** [INC-XXXX]
|
|
400
|
+
- **Date/Time:** [When it occurred]
|
|
401
|
+
- **Duration:** [How long it lasted]
|
|
402
|
+
- **Severity:** [Critical/High/Medium/Low]
|
|
403
|
+
- **Affected Systems:** [What was impacted]
|
|
404
|
+
- **User Impact:** [How users were affected]
|
|
405
|
+
|
|
406
|
+
---
|
|
407
|
+
|
|
408
|
+
### Timeline
|
|
409
|
+
|
|
410
|
+
| Time | Event | Source |
|
|
411
|
+
|------|-------|--------|
|
|
412
|
+
| 09:00 | First error logged | Application logs |
|
|
413
|
+
| 09:05 | Alert triggered | Monitoring system |
|
|
414
|
+
| 09:15 | Investigation started | On-call engineer |
|
|
415
|
+
| 09:30 | Root cause identified | Log analysis |
|
|
416
|
+
| 09:45 | Fix deployed | Deployment system |
|
|
417
|
+
| 10:00 | Service restored | Health checks |
|
|
418
|
+
|
|
419
|
+
---
|
|
38
420
|
|
|
39
421
|
### Symptom
|
|
40
|
-
|
|
422
|
+
**What was observed:**
|
|
423
|
+
[Describe the visible symptoms - error messages, user reports, alerts]
|
|
424
|
+
|
|
425
|
+
**Evidence:**
|
|
426
|
+
- [Log excerpt 1]
|
|
427
|
+
- [Metric screenshot 2]
|
|
428
|
+
- [User report 3]
|
|
429
|
+
|
|
430
|
+
---
|
|
41
431
|
|
|
42
432
|
### Proximate Cause
|
|
43
|
-
|
|
433
|
+
**Immediate trigger:**
|
|
434
|
+
[What directly caused the symptom]
|
|
435
|
+
|
|
436
|
+
**Evidence:**
|
|
437
|
+
- [Supporting evidence]
|
|
438
|
+
|
|
439
|
+
---
|
|
44
440
|
|
|
45
441
|
### Root Cause
|
|
46
|
-
|
|
442
|
+
**Underlying reason:**
|
|
443
|
+
[The fundamental cause that, if fixed, prevents recurrence]
|
|
444
|
+
|
|
445
|
+
**5 Whys Analysis:**
|
|
446
|
+
1. Why [symptom]? → [answer]
|
|
447
|
+
2. Why [answer 1]? → [answer]
|
|
448
|
+
3. Why [answer 2]? → [answer]
|
|
449
|
+
4. Why [answer 3]? → [answer]
|
|
450
|
+
5. Why [answer 4]? → [ROOT CAUSE]
|
|
451
|
+
|
|
452
|
+
**Evidence:**
|
|
453
|
+
- [Code snippet showing the bug]
|
|
454
|
+
- [Configuration showing the misconfiguration]
|
|
455
|
+
|
|
456
|
+
---
|
|
47
457
|
|
|
48
458
|
### Systemic Factors
|
|
49
|
-
|
|
459
|
+
**Why wasn't this caught earlier?**
|
|
460
|
+
|
|
461
|
+
1. **Testing Gap:** [What test would have caught this?]
|
|
462
|
+
2. **Monitoring Gap:** [What alert would have warned us?]
|
|
463
|
+
3. **Process Gap:** [What review would have prevented this?]
|
|
464
|
+
|
|
465
|
+
---
|
|
466
|
+
|
|
467
|
+
### Action Items
|
|
50
468
|
|
|
51
|
-
|
|
52
|
-
|
|
469
|
+
| Action | Owner | Priority | Due Date | Status |
|
|
470
|
+
|--------|-------|----------|----------|--------|
|
|
471
|
+
| Fix the immediate bug | @engineer | P0 | Today | Done |
|
|
472
|
+
| Add regression test | @engineer | P1 | This week | In Progress |
|
|
473
|
+
| Add monitoring | @sre | P1 | This week | Not Started |
|
|
474
|
+
| Update runbook | @sre | P2 | Next week | Not Started |
|
|
475
|
+
| Add code review checklist item | @lead | P2 | Next sprint | Not Started |
|
|
476
|
+
|
|
477
|
+
---
|
|
478
|
+
|
|
479
|
+
### Lessons Learned
|
|
480
|
+
|
|
481
|
+
1. **What we learned:** [Key insight]
|
|
482
|
+
2. **What we'll do differently:** [Process change]
|
|
483
|
+
3. **Similar risks to address:** [Other areas with same pattern]
|
|
484
|
+
```
|
|
485
|
+
|
|
486
|
+
### 6. Common Root Cause Patterns
|
|
487
|
+
|
|
488
|
+
```typescript
|
|
489
|
+
/**
|
|
490
|
+
* Common root cause patterns in software systems
|
|
491
|
+
*/
|
|
492
|
+
|
|
493
|
+
const commonRootCausePatterns = {
|
|
494
|
+
resourceExhaustion: {
|
|
495
|
+
symptoms: [
|
|
496
|
+
'Out of memory errors',
|
|
497
|
+
'Connection pool exhausted',
|
|
498
|
+
'File descriptor limit reached',
|
|
499
|
+
'Thread pool saturation'
|
|
500
|
+
],
|
|
501
|
+
commonCauses: [
|
|
502
|
+
'Memory leaks (objects not garbage collected)',
|
|
503
|
+
'Connection leaks (not closing connections)',
|
|
504
|
+
'Unbounded queues or caches',
|
|
505
|
+
'Missing resource limits'
|
|
506
|
+
],
|
|
507
|
+
investigation: `
|
|
508
|
+
1. Check resource usage metrics over time
|
|
509
|
+
2. Look for steady growth patterns
|
|
510
|
+
3. Identify what's holding resources
|
|
511
|
+
4. Profile memory/connections in staging
|
|
512
|
+
`,
|
|
513
|
+
prevention: [
|
|
514
|
+
'Set explicit resource limits',
|
|
515
|
+
'Implement circuit breakers',
|
|
516
|
+
'Add resource usage monitoring',
|
|
517
|
+
'Use connection pooling with limits'
|
|
518
|
+
]
|
|
519
|
+
},
|
|
520
|
+
|
|
521
|
+
racingConditions: {
|
|
522
|
+
symptoms: [
|
|
523
|
+
'Intermittent failures',
|
|
524
|
+
'Data inconsistency',
|
|
525
|
+
'Deadlocks',
|
|
526
|
+
'Lost updates'
|
|
527
|
+
],
|
|
528
|
+
commonCauses: [
|
|
529
|
+
'Missing synchronization',
|
|
530
|
+
'Non-atomic operations',
|
|
531
|
+
'Improper lock ordering',
|
|
532
|
+
'Shared mutable state'
|
|
533
|
+
],
|
|
534
|
+
investigation: `
|
|
535
|
+
1. Look for concurrent access patterns
|
|
536
|
+
2. Check for shared mutable state
|
|
537
|
+
3. Review lock acquisition order
|
|
538
|
+
4. Add detailed tracing to suspect areas
|
|
539
|
+
`,
|
|
540
|
+
prevention: [
|
|
541
|
+
'Use immutable data structures',
|
|
542
|
+
'Implement proper locking',
|
|
543
|
+
'Use atomic operations',
|
|
544
|
+
'Add concurrency tests'
|
|
545
|
+
]
|
|
546
|
+
},
|
|
547
|
+
|
|
548
|
+
cascadingFailures: {
|
|
549
|
+
symptoms: [
|
|
550
|
+
'Multiple services failing',
|
|
551
|
+
'Rapid error propagation',
|
|
552
|
+
'Timeout storms',
|
|
553
|
+
'Complete outage from partial failure'
|
|
554
|
+
],
|
|
555
|
+
commonCauses: [
|
|
556
|
+
'Missing circuit breakers',
|
|
557
|
+
'Synchronous dependencies',
|
|
558
|
+
'No fallback mechanisms',
|
|
559
|
+
'Shared resource contention'
|
|
560
|
+
],
|
|
561
|
+
investigation: `
|
|
562
|
+
1. Map the failure propagation path
|
|
563
|
+
2. Identify the initial failure point
|
|
564
|
+
3. Check for missing isolation
|
|
565
|
+
4. Review timeout configurations
|
|
566
|
+
`,
|
|
567
|
+
prevention: [
|
|
568
|
+
'Implement circuit breakers',
|
|
569
|
+
'Add bulkhead patterns',
|
|
570
|
+
'Use async communication',
|
|
571
|
+
'Design for graceful degradation'
|
|
572
|
+
]
|
|
573
|
+
},
|
|
574
|
+
|
|
575
|
+
configurationErrors: {
|
|
576
|
+
symptoms: [
|
|
577
|
+
'Works in one environment, fails in another',
|
|
578
|
+
'Feature behaves unexpectedly',
|
|
579
|
+
'Connection failures',
|
|
580
|
+
'Permission denied'
|
|
581
|
+
],
|
|
582
|
+
commonCauses: [
|
|
583
|
+
'Environment variable mismatch',
|
|
584
|
+
'Missing or wrong credentials',
|
|
585
|
+
'Feature flag misconfiguration',
|
|
586
|
+
'Resource limits too low'
|
|
587
|
+
],
|
|
588
|
+
investigation: `
|
|
589
|
+
1. Compare configurations across environments
|
|
590
|
+
2. Check recent configuration changes
|
|
591
|
+
3. Verify secrets are present and correct
|
|
592
|
+
4. Review feature flag states
|
|
593
|
+
`,
|
|
594
|
+
prevention: [
|
|
595
|
+
'Use configuration validation',
|
|
596
|
+
'Implement config diffing',
|
|
597
|
+
'Require config reviews',
|
|
598
|
+
'Test with production-like config'
|
|
599
|
+
]
|
|
600
|
+
}
|
|
601
|
+
};
|
|
53
602
|
```
|
|
603
|
+
|
|
604
|
+
## Use Cases
|
|
605
|
+
|
|
606
|
+
### Production Incident Investigation
|
|
607
|
+
|
|
608
|
+
```typescript
|
|
609
|
+
async function investigateIncident(incidentId: string): Promise<RCAReport> {
|
|
610
|
+
const investigator = new RootCauseInvestigator();
|
|
611
|
+
|
|
612
|
+
// 1. Define the problem clearly
|
|
613
|
+
const incident = await getIncident(incidentId);
|
|
614
|
+
const problem = `${incident.summary} affecting ${incident.userCount} users`;
|
|
615
|
+
|
|
616
|
+
// 2. Gather evidence
|
|
617
|
+
await investigator.gatherEvidence(incident);
|
|
618
|
+
|
|
619
|
+
// 3. Generate and rank hypotheses
|
|
620
|
+
const hypotheses = investigator.generateHypotheses();
|
|
621
|
+
|
|
622
|
+
// 4. Test top hypotheses
|
|
623
|
+
for (const hypothesis of hypotheses.slice(0, 3)) {
|
|
624
|
+
const confirmed = await investigator.testHypothesis(hypothesis);
|
|
625
|
+
if (confirmed) {
|
|
626
|
+
break;
|
|
627
|
+
}
|
|
628
|
+
}
|
|
629
|
+
|
|
630
|
+
// 5. Document findings
|
|
631
|
+
return generateRCAReport(investigator);
|
|
632
|
+
}
|
|
633
|
+
```
|
|
634
|
+
|
|
635
|
+
## Best Practices
|
|
636
|
+
|
|
637
|
+
### Do's
|
|
638
|
+
|
|
639
|
+
- **Gather evidence first** before forming hypotheses
|
|
640
|
+
- **Use structured methods** (5 Whys, Fishbone) consistently
|
|
641
|
+
- **Involve multiple perspectives** for complex issues
|
|
642
|
+
- **Document everything** for future reference
|
|
643
|
+
- **Look for systemic factors** not just immediate causes
|
|
644
|
+
- **Create actionable recommendations** with owners and deadlines
|
|
645
|
+
- **Share learnings** across the organization
|
|
646
|
+
- **Verify fixes** actually prevent recurrence
|
|
647
|
+
|
|
648
|
+
### Don'ts
|
|
649
|
+
|
|
650
|
+
- Don't stop at the first answer - dig deeper
|
|
651
|
+
- Don't blame individuals - look for systemic issues
|
|
652
|
+
- Don't skip evidence gathering
|
|
653
|
+
- Don't accept "human error" as root cause
|
|
654
|
+
- Don't confuse correlation with causation
|
|
655
|
+
- Don't rush to solutions before understanding the problem
|
|
656
|
+
- Don't ignore near-misses - investigate them too
|
|
657
|
+
- Don't let action items go untracked
|
|
658
|
+
|
|
659
|
+
## References
|
|
660
|
+
|
|
661
|
+
- [The Toyota Way: 5 Whys](https://en.wikipedia.org/wiki/Five_whys)
|
|
662
|
+
- [Ishikawa Diagram](https://en.wikipedia.org/wiki/Ishikawa_diagram)
|
|
663
|
+
- [Google SRE: Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
|
|
664
|
+
- [Learning from Incidents](https://www.learningfromincidents.io/)
|