omgkit 2.1.1 → 2.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. package/package.json +1 -1
  2. package/plugin/skills/SKILL_STANDARDS.md +743 -0
  3. package/plugin/skills/databases/mongodb/SKILL.md +797 -28
  4. package/plugin/skills/databases/prisma/SKILL.md +776 -30
  5. package/plugin/skills/databases/redis/SKILL.md +885 -25
  6. package/plugin/skills/devops/aws/SKILL.md +686 -28
  7. package/plugin/skills/devops/github-actions/SKILL.md +684 -29
  8. package/plugin/skills/devops/kubernetes/SKILL.md +621 -24
  9. package/plugin/skills/frameworks/django/SKILL.md +920 -20
  10. package/plugin/skills/frameworks/express/SKILL.md +1361 -35
  11. package/plugin/skills/frameworks/fastapi/SKILL.md +1260 -33
  12. package/plugin/skills/frameworks/laravel/SKILL.md +1244 -31
  13. package/plugin/skills/frameworks/nestjs/SKILL.md +1005 -26
  14. package/plugin/skills/frameworks/rails/SKILL.md +594 -28
  15. package/plugin/skills/frameworks/spring/SKILL.md +528 -35
  16. package/plugin/skills/frameworks/vue/SKILL.md +1296 -27
  17. package/plugin/skills/frontend/accessibility/SKILL.md +1108 -34
  18. package/plugin/skills/frontend/frontend-design/SKILL.md +1304 -26
  19. package/plugin/skills/frontend/responsive/SKILL.md +847 -21
  20. package/plugin/skills/frontend/shadcn-ui/SKILL.md +976 -38
  21. package/plugin/skills/frontend/tailwindcss/SKILL.md +831 -35
  22. package/plugin/skills/frontend/threejs/SKILL.md +1298 -29
  23. package/plugin/skills/languages/javascript/SKILL.md +935 -31
  24. package/plugin/skills/methodology/brainstorming/SKILL.md +597 -23
  25. package/plugin/skills/methodology/defense-in-depth/SKILL.md +832 -34
  26. package/plugin/skills/methodology/dispatching-parallel-agents/SKILL.md +665 -31
  27. package/plugin/skills/methodology/executing-plans/SKILL.md +556 -24
  28. package/plugin/skills/methodology/finishing-development-branch/SKILL.md +595 -25
  29. package/plugin/skills/methodology/problem-solving/SKILL.md +429 -61
  30. package/plugin/skills/methodology/receiving-code-review/SKILL.md +536 -24
  31. package/plugin/skills/methodology/requesting-code-review/SKILL.md +632 -21
  32. package/plugin/skills/methodology/root-cause-tracing/SKILL.md +641 -30
  33. package/plugin/skills/methodology/sequential-thinking/SKILL.md +262 -3
  34. package/plugin/skills/methodology/systematic-debugging/SKILL.md +571 -32
  35. package/plugin/skills/methodology/test-driven-development/SKILL.md +779 -24
  36. package/plugin/skills/methodology/testing-anti-patterns/SKILL.md +691 -29
  37. package/plugin/skills/methodology/token-optimization/SKILL.md +598 -29
  38. package/plugin/skills/methodology/verification-before-completion/SKILL.md +543 -22
  39. package/plugin/skills/methodology/writing-plans/SKILL.md +590 -18
  40. package/plugin/skills/omega/omega-architecture/SKILL.md +838 -39
  41. package/plugin/skills/omega/omega-coding/SKILL.md +636 -39
  42. package/plugin/skills/omega/omega-sprint/SKILL.md +855 -48
  43. package/plugin/skills/omega/omega-testing/SKILL.md +940 -41
  44. package/plugin/skills/omega/omega-thinking/SKILL.md +703 -50
  45. package/plugin/skills/security/better-auth/SKILL.md +1065 -28
  46. package/plugin/skills/security/oauth/SKILL.md +968 -31
  47. package/plugin/skills/security/owasp/SKILL.md +894 -33
  48. package/plugin/skills/testing/playwright/SKILL.md +764 -38
  49. package/plugin/skills/testing/pytest/SKILL.md +873 -36
  50. package/plugin/skills/testing/vitest/SKILL.md +980 -35
@@ -1,53 +1,664 @@
1
1
  ---
2
2
  name: root-cause-tracing
3
- description: Finding root causes. Use when debugging complex issues.
3
+ description: Systematic root cause analysis using 5 Whys, Fishbone diagrams, and evidence-based investigation
4
+ category: methodology
5
+ triggers:
6
+ - root cause
7
+ - 5 whys
8
+ - fishbone diagram
9
+ - debugging
10
+ - incident analysis
11
+ - post mortem
12
+ - problem investigation
4
13
  ---
5
14
 
6
- # Root Cause Tracing Skill
15
+ # Root Cause Tracing
7
16
 
8
- ## 5 Whys
17
+ Master **systematic root cause analysis** to find the true underlying causes of problems, not just symptoms. This skill provides frameworks for investigating issues deeply and preventing recurrence.
18
+
19
+ ## Purpose
20
+
21
+ Find and eliminate true root causes:
22
+
23
+ - Distinguish symptoms from underlying causes
24
+ - Use structured investigation methodologies
25
+ - Trace causality chains to their origins
26
+ - Identify systemic factors that allow problems
27
+ - Prevent recurrence through proper fixes
28
+ - Document findings for organizational learning
29
+ - Build more resilient systems over time
30
+
31
+ ## Features
32
+
33
+ ### 1. The Root Cause Hierarchy
34
+
35
+ ```markdown
36
+ ## Understanding Cause Layers
37
+
38
+ ┌─────────────────────────────────────────────────────────────────────────┐
39
+ │ ROOT CAUSE HIERARCHY │
40
+ ├─────────────────────────────────────────────────────────────────────────┤
41
+ │ │
42
+ │ SYMPTOM │
43
+ │ └── What you observe: "App crashed" "Users can't login" │
44
+ │ │
45
+ │ PROXIMATE CAUSE │
46
+ │ └── Direct trigger: "Out of memory" "Database timeout" │
47
+ │ │
48
+ │ CONTRIBUTING FACTORS │
49
+ │ └── Conditions that enabled: "No memory limits" "Slow query" │
50
+ │ │
51
+ │ ROOT CAUSE │
52
+ │ └── Fundamental reason: "Memory leak in event handlers" │
53
+ │ │
54
+ │ SYSTEMIC FACTORS │
55
+ │ └── Why it wasn't caught: "No memory monitoring" "Missing tests" │
56
+ │ │
57
+ │ │
58
+ │ PRINCIPLE: Fix at the deepest level possible │
59
+ │ - Fixing symptoms: Problem returns │
60
+ │ - Fixing proximate cause: Similar problems emerge │
61
+ │ - Fixing root cause: This specific problem prevented │
62
+ │ - Fixing systemic factors: Entire class of problems prevented │
63
+ │ │
64
+ └─────────────────────────────────────────────────────────────────────────┘
9
65
  ```
10
- Problem: App crashed
11
- Why? Out of memory
12
- Why? → Memory leak
13
- Why? → Event listeners not cleaned
14
- Why? Missing cleanup in useEffect
15
- Why? → Developer unaware of cleanup pattern
16
- Root: Training/documentation gap
66
+
67
+ ### 2. The 5 Whys Technique
68
+
69
+ ```markdown
70
+ ## 5 Whys: Iterative Causality Tracing
71
+
72
+ ### The Method
73
+ Ask "Why?" repeatedly (typically 5 times) until you reach a fundamental cause.
74
+
75
+ ### Example: Production Outage
76
+
77
+ Problem: Website went down for 2 hours
78
+
79
+ Why #1: Why did the website go down?
80
+ → The application server ran out of memory and crashed.
81
+
82
+ Why #2: Why did it run out of memory?
83
+ → The number of active connections grew unbounded.
84
+
85
+ Why #3: Why did connections grow unbounded?
86
+ → Connection objects weren't being released after use.
87
+
88
+ Why #4: Why weren't connections being released?
89
+ → The cleanup code in the finally block wasn't being executed
90
+ due to an early return statement.
91
+
92
+ Why #5: Why wasn't this caught before production?
93
+ → No test existed for the connection cleanup path, and code review
94
+ missed the early return bypassing the finally block.
95
+
96
+ ### Root Causes Identified:
97
+ 1. Technical: Missing cleanup code execution (fix the bug)
98
+ 2. Systemic: Missing test coverage for cleanup paths (add tests)
99
+ 3. Process: Code review didn't catch early-return anti-pattern (add checklist)
17
100
  ```
18
101
 
19
- ## Fishbone Diagram
102
+ ```typescript
103
+ /**
104
+ * 5 Whys Investigation Framework
105
+ */
106
+
107
+ interface WhyStep {
108
+ question: string;
109
+ answer: string;
110
+ evidence: string[];
111
+ confidence: 'confirmed' | 'likely' | 'hypothesis';
112
+ }
113
+
114
+ interface FiveWhysAnalysis {
115
+ problem: string;
116
+ impactAssessment: ImpactAssessment;
117
+ whys: WhyStep[];
118
+ rootCauses: RootCause[];
119
+ recommendations: Recommendation[];
120
+ }
121
+
122
+ function conductFiveWhys(problem: string): FiveWhysAnalysis {
123
+ const analysis: FiveWhysAnalysis = {
124
+ problem,
125
+ impactAssessment: assessImpact(problem),
126
+ whys: [],
127
+ rootCauses: [],
128
+ recommendations: []
129
+ };
130
+
131
+ let currentQuestion = `Why did ${problem}?`;
132
+ let depth = 0;
133
+
134
+ while (depth < 5) {
135
+ const answer = investigateQuestion(currentQuestion);
136
+ const evidence = gatherEvidence(answer);
137
+
138
+ analysis.whys.push({
139
+ question: currentQuestion,
140
+ answer: answer.text,
141
+ evidence: evidence.sources,
142
+ confidence: evidence.confidence
143
+ });
144
+
145
+ // Check if we've reached a root cause
146
+ if (isRootCause(answer)) {
147
+ analysis.rootCauses.push({
148
+ description: answer.text,
149
+ type: classifyRootCause(answer),
150
+ evidence: evidence
151
+ });
152
+ }
153
+
154
+ currentQuestion = `Why ${answer.text}?`;
155
+ depth++;
156
+ }
157
+
158
+ // Generate recommendations for each root cause
159
+ analysis.recommendations = analysis.rootCauses.map(rc =>
160
+ generateRecommendation(rc)
161
+ );
162
+
163
+ return analysis;
164
+ }
165
+ ```
166
+
167
+ ### 3. Fishbone (Ishikawa) Diagram
168
+
169
+ ```markdown
170
+ ## Fishbone Diagram: Category-Based Analysis
171
+
172
+ ┌──────────────────┐
173
+ │ PROBLEM │
174
+ │ [Symptom Here] │
175
+ └────────┬─────────┘
176
+
177
+ ┌────────────────────────────┼────────────────────────────┐
178
+ │ │ │
179
+ │ PEOPLE │ PROCESS │
180
+ │ ────── │ ─────── │
181
+ │ • Skill gaps │ • Missing │
182
+ │ • Communication │ steps │
183
+ │ • Training │ • Unclear │
184
+ │ • Handoffs │ ownership │
185
+ │ ╲ │ ╱ │
186
+ │ ╲ │ ╱ │
187
+ │ ╲ │ ╱ │
188
+ │ ╲──────────────┼──────────────╱ │
189
+ │ ╲ │ ╱ │
190
+ │ ╲ │ ╱ │
191
+ │ ╲ │ ╱ │
192
+ │ ╲──────────┼──────────╱ │
193
+ │ ╱ │ ╲ │
194
+ │ ╱ │ ╲ │
195
+ │ ╱ │ ╲ │
196
+ │ ╱──────────────┼──────────────╲ │
197
+ │ ╱ │ ╲ │
198
+ │ ╱ │ ╲ │
199
+ │ ╱ │ ╲ │
200
+ │ TECHNOLOGY │ ENVIRONMENT │
201
+ │ ────────── │ ─────────── │
202
+ │ • Code bugs │ • Load │
203
+ │ • Dependencies │ • Network │
204
+ │ • Infrastructure │ • Third-party │
205
+ │ • Configuration │ • Timing │
206
+ │ │ │
207
+ └────────────────────────────┴────────────────────────────┘
208
+
209
+ ## Software-Specific Categories
210
+
211
+ 1. CODE
212
+ - Logic errors
213
+ - Race conditions
214
+ - Memory leaks
215
+ - Error handling
216
+
217
+ 2. DATA
218
+ - Invalid input
219
+ - Corrupt data
220
+ - Missing data
221
+ - Schema mismatches
222
+
223
+ 3. CONFIGURATION
224
+ - Wrong settings
225
+ - Environment mismatch
226
+ - Secrets/credentials
227
+ - Feature flags
228
+
229
+ 4. INFRASTRUCTURE
230
+ - Resource exhaustion
231
+ - Network issues
232
+ - Service failures
233
+ - Scaling problems
234
+
235
+ 5. EXTERNAL
236
+ - Third-party APIs
237
+ - Dependencies
238
+ - User behavior
239
+ - Attack/abuse
240
+
241
+ 6. PROCESS
242
+ - Missing tests
243
+ - Review gaps
244
+ - Deployment issues
245
+ - Monitoring blind spots
20
246
  ```
21
- ┌─ Code
22
- ├─ Config
23
- Problem ────────────├─ Environment
24
- ├─ Data
25
- └─ Dependencies
247
+
248
+ ### 4. Evidence-Based Investigation
249
+
250
+ ```typescript
251
+ /**
252
+ * Evidence-Based Root Cause Investigation
253
+ * Every hypothesis must be backed by data
254
+ */
255
+
256
+ interface Evidence {
257
+ type: 'log' | 'metric' | 'trace' | 'reproduction' | 'testimony';
258
+ source: string;
259
+ timestamp?: Date;
260
+ data: unknown;
261
+ reliability: 'high' | 'medium' | 'low';
262
+ }
263
+
264
+ interface Hypothesis {
265
+ description: string;
266
+ category: RootCauseCategory;
267
+ evidence: {
268
+ supporting: Evidence[];
269
+ contradicting: Evidence[];
270
+ };
271
+ confidence: number; // 0-1
272
+ testPlan?: string;
273
+ }
274
+
275
+ class RootCauseInvestigator {
276
+ private hypotheses: Hypothesis[] = [];
277
+ private evidence: Evidence[] = [];
278
+
279
+ // Step 1: Gather all available evidence
280
+ async gatherEvidence(incident: Incident): Promise<Evidence[]> {
281
+ const evidence: Evidence[] = [];
282
+
283
+ // Collect logs around incident time
284
+ const logs = await this.queryLogs({
285
+ startTime: incident.startTime.minus({ minutes: 30 }),
286
+ endTime: incident.endTime.plus({ minutes: 30 }),
287
+ services: incident.affectedServices
288
+ });
289
+
290
+ evidence.push(...logs.map(log => ({
291
+ type: 'log' as const,
292
+ source: log.service,
293
+ timestamp: log.timestamp,
294
+ data: log.message,
295
+ reliability: 'high' as const
296
+ })));
297
+
298
+ // Collect metrics
299
+ const metrics = await this.queryMetrics({
300
+ metrics: ['error_rate', 'latency_p99', 'memory_usage', 'cpu_usage'],
301
+ startTime: incident.startTime.minus({ hours: 1 }),
302
+ endTime: incident.endTime.plus({ hours: 1 })
303
+ });
304
+
305
+ evidence.push(...metrics.map(m => ({
306
+ type: 'metric' as const,
307
+ source: m.name,
308
+ timestamp: m.timestamp,
309
+ data: m.value,
310
+ reliability: 'high' as const
311
+ })));
312
+
313
+ // Collect traces for affected requests
314
+ const traces = await this.queryTraces({
315
+ traceIds: incident.affectedTraceIds.slice(0, 100)
316
+ });
317
+
318
+ evidence.push(...traces.map(t => ({
319
+ type: 'trace' as const,
320
+ source: t.serviceName,
321
+ data: t.spans,
322
+ reliability: 'high' as const
323
+ })));
324
+
325
+ this.evidence = evidence;
326
+ return evidence;
327
+ }
328
+
329
+ // Step 2: Generate hypotheses based on evidence
330
+ generateHypotheses(): Hypothesis[] {
331
+ const hypotheses: Hypothesis[] = [];
332
+
333
+ // Analyze log patterns
334
+ const errorPatterns = this.findErrorPatterns(this.evidence);
335
+ for (const pattern of errorPatterns) {
336
+ hypotheses.push({
337
+ description: `Error caused by: ${pattern.summary}`,
338
+ category: this.categorizePattern(pattern),
339
+ evidence: {
340
+ supporting: pattern.matchingLogs,
341
+ contradicting: []
342
+ },
343
+ confidence: pattern.frequency / this.evidence.length,
344
+ testPlan: `Reproduce with: ${pattern.reproductionHint}`
345
+ });
346
+ }
347
+
348
+ // Analyze metric anomalies
349
+ const anomalies = this.findMetricAnomalies(this.evidence);
350
+ for (const anomaly of anomalies) {
351
+ hypotheses.push({
352
+ description: `Resource issue: ${anomaly.metric} ${anomaly.direction}`,
353
+ category: 'infrastructure',
354
+ evidence: {
355
+ supporting: [anomaly.evidence],
356
+ contradicting: []
357
+ },
358
+ confidence: anomaly.deviation > 3 ? 0.8 : 0.5
359
+ });
360
+ }
361
+
362
+ // Sort by confidence
363
+ this.hypotheses = hypotheses.sort((a, b) => b.confidence - a.confidence);
364
+ return this.hypotheses;
365
+ }
366
+
367
+ // Step 3: Test hypotheses
368
+ async testHypothesis(hypothesis: Hypothesis): Promise<boolean> {
369
+ if (!hypothesis.testPlan) {
370
+ throw new Error('Hypothesis has no test plan');
371
+ }
372
+
373
+ // Attempt to reproduce in safe environment
374
+ const result = await this.runReproduction(hypothesis.testPlan);
375
+
376
+ if (result.reproduced) {
377
+ hypothesis.confidence = Math.min(hypothesis.confidence + 0.3, 1);
378
+ hypothesis.evidence.supporting.push({
379
+ type: 'reproduction',
380
+ source: 'test-environment',
381
+ data: result,
382
+ reliability: 'high'
383
+ });
384
+ return true;
385
+ } else {
386
+ hypothesis.confidence *= 0.5;
387
+ return false;
388
+ }
389
+ }
390
+ }
26
391
  ```
27
392
 
28
- ## Categories
29
- 1. **Code** - Bug in logic
30
- 2. **Data** - Invalid input
31
- 3. **Config** - Wrong settings
32
- 4. **Environment** - System issues
33
- 5. **External** - Third-party failure
393
+ ### 5. Root Cause Analysis Template
34
394
 
35
- ## Output
36
395
  ```markdown
37
- ## Root Cause Analysis
396
+ ## Root Cause Analysis Report
397
+
398
+ ### Incident Summary
399
+ - **Incident ID:** [INC-XXXX]
400
+ - **Date/Time:** [When it occurred]
401
+ - **Duration:** [How long it lasted]
402
+ - **Severity:** [Critical/High/Medium/Low]
403
+ - **Affected Systems:** [What was impacted]
404
+ - **User Impact:** [How users were affected]
405
+
406
+ ---
407
+
408
+ ### Timeline
409
+
410
+ | Time | Event | Source |
411
+ |------|-------|--------|
412
+ | 09:00 | First error logged | Application logs |
413
+ | 09:05 | Alert triggered | Monitoring system |
414
+ | 09:15 | Investigation started | On-call engineer |
415
+ | 09:30 | Root cause identified | Log analysis |
416
+ | 09:45 | Fix deployed | Deployment system |
417
+ | 10:00 | Service restored | Health checks |
418
+
419
+ ---
38
420
 
39
421
  ### Symptom
40
- [What was observed]
422
+ **What was observed:**
423
+ [Describe the visible symptoms - error messages, user reports, alerts]
424
+
425
+ **Evidence:**
426
+ - [Log excerpt 1]
427
+ - [Metric screenshot 2]
428
+ - [User report 3]
429
+
430
+ ---
41
431
 
42
432
  ### Proximate Cause
43
- [Immediate cause]
433
+ **Immediate trigger:**
434
+ [What directly caused the symptom]
435
+
436
+ **Evidence:**
437
+ - [Supporting evidence]
438
+
439
+ ---
44
440
 
45
441
  ### Root Cause
46
- [Underlying reason]
442
+ **Underlying reason:**
443
+ [The fundamental cause that, if fixed, prevents recurrence]
444
+
445
+ **5 Whys Analysis:**
446
+ 1. Why [symptom]? → [answer]
447
+ 2. Why [answer 1]? → [answer]
448
+ 3. Why [answer 2]? → [answer]
449
+ 4. Why [answer 3]? → [answer]
450
+ 5. Why [answer 4]? → [ROOT CAUSE]
451
+
452
+ **Evidence:**
453
+ - [Code snippet showing the bug]
454
+ - [Configuration showing the misconfiguration]
455
+
456
+ ---
47
457
 
48
458
  ### Systemic Factors
49
- [Why it wasn't caught]
459
+ **Why wasn't this caught earlier?**
460
+
461
+ 1. **Testing Gap:** [What test would have caught this?]
462
+ 2. **Monitoring Gap:** [What alert would have warned us?]
463
+ 3. **Process Gap:** [What review would have prevented this?]
464
+
465
+ ---
466
+
467
+ ### Action Items
50
468
 
51
- ### Prevention
52
- [How to prevent recurrence]
469
+ | Action | Owner | Priority | Due Date | Status |
470
+ |--------|-------|----------|----------|--------|
471
+ | Fix the immediate bug | @engineer | P0 | Today | Done |
472
+ | Add regression test | @engineer | P1 | This week | In Progress |
473
+ | Add monitoring | @sre | P1 | This week | Not Started |
474
+ | Update runbook | @sre | P2 | Next week | Not Started |
475
+ | Add code review checklist item | @lead | P2 | Next sprint | Not Started |
476
+
477
+ ---
478
+
479
+ ### Lessons Learned
480
+
481
+ 1. **What we learned:** [Key insight]
482
+ 2. **What we'll do differently:** [Process change]
483
+ 3. **Similar risks to address:** [Other areas with same pattern]
484
+ ```
485
+
486
+ ### 6. Common Root Cause Patterns
487
+
488
+ ```typescript
489
+ /**
490
+ * Common root cause patterns in software systems
491
+ */
492
+
493
+ const commonRootCausePatterns = {
494
+ resourceExhaustion: {
495
+ symptoms: [
496
+ 'Out of memory errors',
497
+ 'Connection pool exhausted',
498
+ 'File descriptor limit reached',
499
+ 'Thread pool saturation'
500
+ ],
501
+ commonCauses: [
502
+ 'Memory leaks (objects not garbage collected)',
503
+ 'Connection leaks (not closing connections)',
504
+ 'Unbounded queues or caches',
505
+ 'Missing resource limits'
506
+ ],
507
+ investigation: `
508
+ 1. Check resource usage metrics over time
509
+ 2. Look for steady growth patterns
510
+ 3. Identify what's holding resources
511
+ 4. Profile memory/connections in staging
512
+ `,
513
+ prevention: [
514
+ 'Set explicit resource limits',
515
+ 'Implement circuit breakers',
516
+ 'Add resource usage monitoring',
517
+ 'Use connection pooling with limits'
518
+ ]
519
+ },
520
+
521
+ racingConditions: {
522
+ symptoms: [
523
+ 'Intermittent failures',
524
+ 'Data inconsistency',
525
+ 'Deadlocks',
526
+ 'Lost updates'
527
+ ],
528
+ commonCauses: [
529
+ 'Missing synchronization',
530
+ 'Non-atomic operations',
531
+ 'Improper lock ordering',
532
+ 'Shared mutable state'
533
+ ],
534
+ investigation: `
535
+ 1. Look for concurrent access patterns
536
+ 2. Check for shared mutable state
537
+ 3. Review lock acquisition order
538
+ 4. Add detailed tracing to suspect areas
539
+ `,
540
+ prevention: [
541
+ 'Use immutable data structures',
542
+ 'Implement proper locking',
543
+ 'Use atomic operations',
544
+ 'Add concurrency tests'
545
+ ]
546
+ },
547
+
548
+ cascadingFailures: {
549
+ symptoms: [
550
+ 'Multiple services failing',
551
+ 'Rapid error propagation',
552
+ 'Timeout storms',
553
+ 'Complete outage from partial failure'
554
+ ],
555
+ commonCauses: [
556
+ 'Missing circuit breakers',
557
+ 'Synchronous dependencies',
558
+ 'No fallback mechanisms',
559
+ 'Shared resource contention'
560
+ ],
561
+ investigation: `
562
+ 1. Map the failure propagation path
563
+ 2. Identify the initial failure point
564
+ 3. Check for missing isolation
565
+ 4. Review timeout configurations
566
+ `,
567
+ prevention: [
568
+ 'Implement circuit breakers',
569
+ 'Add bulkhead patterns',
570
+ 'Use async communication',
571
+ 'Design for graceful degradation'
572
+ ]
573
+ },
574
+
575
+ configurationErrors: {
576
+ symptoms: [
577
+ 'Works in one environment, fails in another',
578
+ 'Feature behaves unexpectedly',
579
+ 'Connection failures',
580
+ 'Permission denied'
581
+ ],
582
+ commonCauses: [
583
+ 'Environment variable mismatch',
584
+ 'Missing or wrong credentials',
585
+ 'Feature flag misconfiguration',
586
+ 'Resource limits too low'
587
+ ],
588
+ investigation: `
589
+ 1. Compare configurations across environments
590
+ 2. Check recent configuration changes
591
+ 3. Verify secrets are present and correct
592
+ 4. Review feature flag states
593
+ `,
594
+ prevention: [
595
+ 'Use configuration validation',
596
+ 'Implement config diffing',
597
+ 'Require config reviews',
598
+ 'Test with production-like config'
599
+ ]
600
+ }
601
+ };
53
602
  ```
603
+
604
+ ## Use Cases
605
+
606
+ ### Production Incident Investigation
607
+
608
+ ```typescript
609
+ async function investigateIncident(incidentId: string): Promise<RCAReport> {
610
+ const investigator = new RootCauseInvestigator();
611
+
612
+ // 1. Define the problem clearly
613
+ const incident = await getIncident(incidentId);
614
+ const problem = `${incident.summary} affecting ${incident.userCount} users`;
615
+
616
+ // 2. Gather evidence
617
+ await investigator.gatherEvidence(incident);
618
+
619
+ // 3. Generate and rank hypotheses
620
+ const hypotheses = investigator.generateHypotheses();
621
+
622
+ // 4. Test top hypotheses
623
+ for (const hypothesis of hypotheses.slice(0, 3)) {
624
+ const confirmed = await investigator.testHypothesis(hypothesis);
625
+ if (confirmed) {
626
+ break;
627
+ }
628
+ }
629
+
630
+ // 5. Document findings
631
+ return generateRCAReport(investigator);
632
+ }
633
+ ```
634
+
635
+ ## Best Practices
636
+
637
+ ### Do's
638
+
639
+ - **Gather evidence first** before forming hypotheses
640
+ - **Use structured methods** (5 Whys, Fishbone) consistently
641
+ - **Involve multiple perspectives** for complex issues
642
+ - **Document everything** for future reference
643
+ - **Look for systemic factors** not just immediate causes
644
+ - **Create actionable recommendations** with owners and deadlines
645
+ - **Share learnings** across the organization
646
+ - **Verify fixes** actually prevent recurrence
647
+
648
+ ### Don'ts
649
+
650
+ - Don't stop at the first answer - dig deeper
651
+ - Don't blame individuals - look for systemic issues
652
+ - Don't skip evidence gathering
653
+ - Don't accept "human error" as root cause
654
+ - Don't confuse correlation with causation
655
+ - Don't rush to solutions before understanding the problem
656
+ - Don't ignore near-misses - investigate them too
657
+ - Don't let action items go untracked
658
+
659
+ ## References
660
+
661
+ - [The Toyota Way: 5 Whys](https://en.wikipedia.org/wiki/Five_whys)
662
+ - [Ishikawa Diagram](https://en.wikipedia.org/wiki/Ishikawa_diagram)
663
+ - [Google SRE: Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
664
+ - [Learning from Incidents](https://www.learningfromincidents.io/)