omgkit 2.2.0 → 2.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (55) hide show
  1. package/package.json +1 -1
  2. package/plugin/skills/databases/mongodb/SKILL.md +60 -776
  3. package/plugin/skills/databases/prisma/SKILL.md +53 -744
  4. package/plugin/skills/databases/redis/SKILL.md +53 -860
  5. package/plugin/skills/devops/aws/SKILL.md +68 -672
  6. package/plugin/skills/devops/github-actions/SKILL.md +54 -657
  7. package/plugin/skills/devops/kubernetes/SKILL.md +67 -602
  8. package/plugin/skills/devops/performance-profiling/SKILL.md +59 -863
  9. package/plugin/skills/frameworks/django/SKILL.md +87 -853
  10. package/plugin/skills/frameworks/express/SKILL.md +95 -1301
  11. package/plugin/skills/frameworks/fastapi/SKILL.md +90 -1198
  12. package/plugin/skills/frameworks/laravel/SKILL.md +87 -1187
  13. package/plugin/skills/frameworks/nestjs/SKILL.md +106 -973
  14. package/plugin/skills/frameworks/react/SKILL.md +94 -962
  15. package/plugin/skills/frameworks/vue/SKILL.md +95 -1242
  16. package/plugin/skills/frontend/accessibility/SKILL.md +91 -1056
  17. package/plugin/skills/frontend/frontend-design/SKILL.md +69 -1262
  18. package/plugin/skills/frontend/responsive/SKILL.md +76 -799
  19. package/plugin/skills/frontend/shadcn-ui/SKILL.md +73 -921
  20. package/plugin/skills/frontend/tailwindcss/SKILL.md +60 -788
  21. package/plugin/skills/frontend/threejs/SKILL.md +72 -1266
  22. package/plugin/skills/languages/javascript/SKILL.md +106 -849
  23. package/plugin/skills/methodology/brainstorming/SKILL.md +70 -576
  24. package/plugin/skills/methodology/defense-in-depth/SKILL.md +79 -831
  25. package/plugin/skills/methodology/dispatching-parallel-agents/SKILL.md +81 -654
  26. package/plugin/skills/methodology/executing-plans/SKILL.md +86 -529
  27. package/plugin/skills/methodology/finishing-development-branch/SKILL.md +95 -586
  28. package/plugin/skills/methodology/problem-solving/SKILL.md +67 -681
  29. package/plugin/skills/methodology/receiving-code-review/SKILL.md +70 -533
  30. package/plugin/skills/methodology/requesting-code-review/SKILL.md +70 -610
  31. package/plugin/skills/methodology/root-cause-tracing/SKILL.md +70 -646
  32. package/plugin/skills/methodology/sequential-thinking/SKILL.md +70 -478
  33. package/plugin/skills/methodology/systematic-debugging/SKILL.md +66 -559
  34. package/plugin/skills/methodology/test-driven-development/SKILL.md +91 -752
  35. package/plugin/skills/methodology/testing-anti-patterns/SKILL.md +78 -687
  36. package/plugin/skills/methodology/token-optimization/SKILL.md +72 -602
  37. package/plugin/skills/methodology/verification-before-completion/SKILL.md +108 -529
  38. package/plugin/skills/methodology/writing-plans/SKILL.md +79 -566
  39. package/plugin/skills/omega/omega-architecture/SKILL.md +91 -752
  40. package/plugin/skills/omega/omega-coding/SKILL.md +161 -552
  41. package/plugin/skills/omega/omega-sprint/SKILL.md +132 -777
  42. package/plugin/skills/omega/omega-testing/SKILL.md +157 -845
  43. package/plugin/skills/omega/omega-thinking/SKILL.md +165 -606
  44. package/plugin/skills/security/better-auth/SKILL.md +46 -1034
  45. package/plugin/skills/security/oauth/SKILL.md +80 -934
  46. package/plugin/skills/security/owasp/SKILL.md +78 -862
  47. package/plugin/skills/testing/playwright/SKILL.md +77 -700
  48. package/plugin/skills/testing/pytest/SKILL.md +73 -811
  49. package/plugin/skills/testing/vitest/SKILL.md +60 -920
  50. package/plugin/skills/tools/document-processing/SKILL.md +111 -838
  51. package/plugin/skills/tools/image-processing/SKILL.md +126 -659
  52. package/plugin/skills/tools/mcp-development/SKILL.md +85 -758
  53. package/plugin/skills/tools/media-processing/SKILL.md +118 -735
  54. package/plugin/stdrules/SKILL_STANDARDS.md +490 -0
  55. package/plugin/skills/SKILL_STANDARDS.md +0 -743
@@ -1,664 +1,88 @@
1
1
  ---
2
- name: root-cause-tracing
3
- description: Systematic root cause analysis using 5 Whys, Fishbone diagrams, and evidence-based investigation
4
- category: methodology
5
- triggers:
6
- - root cause
7
- - 5 whys
8
- - fishbone diagram
9
- - debugging
10
- - incident analysis
11
- - post mortem
12
- - problem investigation
2
+ name: tracing-root-causes
3
+ description: AI agent performs systematic root cause analysis using 5 Whys, Fishbone diagrams, and evidence-based investigation. Use when debugging, conducting post-mortems, or investigating incidents.
13
4
  ---
14
5
 
15
- # Root Cause Tracing
6
+ # Tracing Root Causes
16
7
 
17
- Master **systematic root cause analysis** to find the true underlying causes of problems, not just symptoms. This skill provides frameworks for investigating issues deeply and preventing recurrence.
8
+ ## Quick Start
18
9
 
19
- ## Purpose
20
-
21
- Find and eliminate true root causes:
22
-
23
- - Distinguish symptoms from underlying causes
24
- - Use structured investigation methodologies
25
- - Trace causality chains to their origins
26
- - Identify systemic factors that allow problems
27
- - Prevent recurrence through proper fixes
28
- - Document findings for organizational learning
29
- - Build more resilient systems over time
10
+ 1. **Identify Symptom** - Document the observable problem
11
+ 2. **Gather Evidence** - Collect logs, metrics, traces around incident
12
+ 3. **Apply 5 Whys** - Ask "Why?" iteratively until fundamental cause found
13
+ 4. **Map Categories** - Use Fishbone to explore all cause categories
14
+ 5. **Document Findings** - Create RCA report with action items
30
15
 
31
16
  ## Features
32
17
 
33
- ### 1. The Root Cause Hierarchy
34
-
35
- ```markdown
36
- ## Understanding Cause Layers
37
-
38
- ┌─────────────────────────────────────────────────────────────────────────┐
39
- │ ROOT CAUSE HIERARCHY │
40
- ├─────────────────────────────────────────────────────────────────────────┤
41
- │ │
42
- │ SYMPTOM │
43
- │ └── What you observe: "App crashed" "Users can't login" │
44
- │ │
45
- │ PROXIMATE CAUSE │
46
- │ └── Direct trigger: "Out of memory" "Database timeout" │
47
- │ │
48
- │ CONTRIBUTING FACTORS │
49
- │ └── Conditions that enabled: "No memory limits" "Slow query" │
50
- │ │
51
- │ ROOT CAUSE │
52
- │ └── Fundamental reason: "Memory leak in event handlers" │
53
- │ │
54
- │ SYSTEMIC FACTORS │
55
- │ └── Why it wasn't caught: "No memory monitoring" "Missing tests" │
56
- │ │
57
- │ │
58
- │ PRINCIPLE: Fix at the deepest level possible │
59
- │ - Fixing symptoms: Problem returns │
60
- │ - Fixing proximate cause: Similar problems emerge │
61
- │ - Fixing root cause: This specific problem prevented │
62
- │ - Fixing systemic factors: Entire class of problems prevented │
63
- │ │
64
- └─────────────────────────────────────────────────────────────────────────┘
65
- ```
66
-
67
- ### 2. The 5 Whys Technique
68
-
69
- ```markdown
70
- ## 5 Whys: Iterative Causality Tracing
71
-
72
- ### The Method
73
- Ask "Why?" repeatedly (typically 5 times) until you reach a fundamental cause.
74
-
75
- ### Example: Production Outage
76
-
77
- Problem: Website went down for 2 hours
78
-
79
- Why #1: Why did the website go down?
80
- → The application server ran out of memory and crashed.
81
-
82
- Why #2: Why did it run out of memory?
83
- → The number of active connections grew unbounded.
84
-
85
- Why #3: Why did connections grow unbounded?
86
- → Connection objects weren't being released after use.
87
-
88
- Why #4: Why weren't connections being released?
89
- → The cleanup code in the finally block wasn't being executed
90
- due to an early return statement.
91
-
92
- Why #5: Why wasn't this caught before production?
93
- → No test existed for the connection cleanup path, and code review
94
- missed the early return bypassing the finally block.
95
-
96
- ### Root Causes Identified:
97
- 1. Technical: Missing cleanup code execution (fix the bug)
98
- 2. Systemic: Missing test coverage for cleanup paths (add tests)
99
- 3. Process: Code review didn't catch early-return anti-pattern (add checklist)
100
- ```
101
-
102
- ```typescript
103
- /**
104
- * 5 Whys Investigation Framework
105
- */
106
-
107
- interface WhyStep {
108
- question: string;
109
- answer: string;
110
- evidence: string[];
111
- confidence: 'confirmed' | 'likely' | 'hypothesis';
112
- }
113
-
114
- interface FiveWhysAnalysis {
115
- problem: string;
116
- impactAssessment: ImpactAssessment;
117
- whys: WhyStep[];
118
- rootCauses: RootCause[];
119
- recommendations: Recommendation[];
120
- }
121
-
122
- function conductFiveWhys(problem: string): FiveWhysAnalysis {
123
- const analysis: FiveWhysAnalysis = {
124
- problem,
125
- impactAssessment: assessImpact(problem),
126
- whys: [],
127
- rootCauses: [],
128
- recommendations: []
129
- };
130
-
131
- let currentQuestion = `Why did ${problem}?`;
132
- let depth = 0;
133
-
134
- while (depth < 5) {
135
- const answer = investigateQuestion(currentQuestion);
136
- const evidence = gatherEvidence(answer);
137
-
138
- analysis.whys.push({
139
- question: currentQuestion,
140
- answer: answer.text,
141
- evidence: evidence.sources,
142
- confidence: evidence.confidence
143
- });
144
-
145
- // Check if we've reached a root cause
146
- if (isRootCause(answer)) {
147
- analysis.rootCauses.push({
148
- description: answer.text,
149
- type: classifyRootCause(answer),
150
- evidence: evidence
151
- });
152
- }
153
-
154
- currentQuestion = `Why ${answer.text}?`;
155
- depth++;
156
- }
157
-
158
- // Generate recommendations for each root cause
159
- analysis.recommendations = analysis.rootCauses.map(rc =>
160
- generateRecommendation(rc)
161
- );
162
-
163
- return analysis;
164
- }
165
- ```
166
-
167
- ### 3. Fishbone (Ishikawa) Diagram
168
-
169
- ```markdown
170
- ## Fishbone Diagram: Category-Based Analysis
171
-
172
- ┌──────────────────┐
173
- │ PROBLEM │
174
- │ [Symptom Here] │
175
- └────────┬─────────┘
176
-
177
- ┌────────────────────────────┼────────────────────────────┐
178
- │ │ │
179
- │ PEOPLE │ PROCESS │
180
- │ ────── │ ─────── │
181
- │ • Skill gaps │ • Missing │
182
- │ • Communication │ steps │
183
- │ • Training │ • Unclear │
184
- │ • Handoffs │ ownership │
185
- │ ╲ │ ╱ │
186
- │ ╲ │ ╱ │
187
- │ ╲ │ ╱ │
188
- │ ╲──────────────┼──────────────╱ │
189
- │ ╲ │ ╱ │
190
- │ ╲ │ ╱ │
191
- │ ╲ │ ╱ │
192
- │ ╲──────────┼──────────╱ │
193
- │ ╱ │ ╲ │
194
- │ ╱ │ ╲ │
195
- │ ╱ │ ╲ │
196
- │ ╱──────────────┼──────────────╲ │
197
- │ ╱ │ ╲ │
198
- │ ╱ │ ╲ │
199
- │ ╱ │ ╲ │
200
- │ TECHNOLOGY │ ENVIRONMENT │
201
- │ ────────── │ ─────────── │
202
- │ • Code bugs │ • Load │
203
- │ • Dependencies │ • Network │
204
- │ • Infrastructure │ • Third-party │
205
- │ • Configuration │ • Timing │
206
- │ │ │
207
- └────────────────────────────┴────────────────────────────┘
208
-
209
- ## Software-Specific Categories
210
-
211
- 1. CODE
212
- - Logic errors
213
- - Race conditions
214
- - Memory leaks
215
- - Error handling
216
-
217
- 2. DATA
218
- - Invalid input
219
- - Corrupt data
220
- - Missing data
221
- - Schema mismatches
222
-
223
- 3. CONFIGURATION
224
- - Wrong settings
225
- - Environment mismatch
226
- - Secrets/credentials
227
- - Feature flags
228
-
229
- 4. INFRASTRUCTURE
230
- - Resource exhaustion
231
- - Network issues
232
- - Service failures
233
- - Scaling problems
234
-
235
- 5. EXTERNAL
236
- - Third-party APIs
237
- - Dependencies
238
- - User behavior
239
- - Attack/abuse
240
-
241
- 6. PROCESS
242
- - Missing tests
243
- - Review gaps
244
- - Deployment issues
245
- - Monitoring blind spots
246
- ```
247
-
248
- ### 4. Evidence-Based Investigation
249
-
250
- ```typescript
251
- /**
252
- * Evidence-Based Root Cause Investigation
253
- * Every hypothesis must be backed by data
254
- */
255
-
256
- interface Evidence {
257
- type: 'log' | 'metric' | 'trace' | 'reproduction' | 'testimony';
258
- source: string;
259
- timestamp?: Date;
260
- data: unknown;
261
- reliability: 'high' | 'medium' | 'low';
262
- }
263
-
264
- interface Hypothesis {
265
- description: string;
266
- category: RootCauseCategory;
267
- evidence: {
268
- supporting: Evidence[];
269
- contradicting: Evidence[];
270
- };
271
- confidence: number; // 0-1
272
- testPlan?: string;
273
- }
274
-
275
- class RootCauseInvestigator {
276
- private hypotheses: Hypothesis[] = [];
277
- private evidence: Evidence[] = [];
278
-
279
- // Step 1: Gather all available evidence
280
- async gatherEvidence(incident: Incident): Promise<Evidence[]> {
281
- const evidence: Evidence[] = [];
282
-
283
- // Collect logs around incident time
284
- const logs = await this.queryLogs({
285
- startTime: incident.startTime.minus({ minutes: 30 }),
286
- endTime: incident.endTime.plus({ minutes: 30 }),
287
- services: incident.affectedServices
288
- });
289
-
290
- evidence.push(...logs.map(log => ({
291
- type: 'log' as const,
292
- source: log.service,
293
- timestamp: log.timestamp,
294
- data: log.message,
295
- reliability: 'high' as const
296
- })));
297
-
298
- // Collect metrics
299
- const metrics = await this.queryMetrics({
300
- metrics: ['error_rate', 'latency_p99', 'memory_usage', 'cpu_usage'],
301
- startTime: incident.startTime.minus({ hours: 1 }),
302
- endTime: incident.endTime.plus({ hours: 1 })
303
- });
304
-
305
- evidence.push(...metrics.map(m => ({
306
- type: 'metric' as const,
307
- source: m.name,
308
- timestamp: m.timestamp,
309
- data: m.value,
310
- reliability: 'high' as const
311
- })));
312
-
313
- // Collect traces for affected requests
314
- const traces = await this.queryTraces({
315
- traceIds: incident.affectedTraceIds.slice(0, 100)
316
- });
317
-
318
- evidence.push(...traces.map(t => ({
319
- type: 'trace' as const,
320
- source: t.serviceName,
321
- data: t.spans,
322
- reliability: 'high' as const
323
- })));
324
-
325
- this.evidence = evidence;
326
- return evidence;
327
- }
328
-
329
- // Step 2: Generate hypotheses based on evidence
330
- generateHypotheses(): Hypothesis[] {
331
- const hypotheses: Hypothesis[] = [];
18
+ | Feature | Description | Guide |
19
+ |---------|-------------|-------|
20
+ | Cause Hierarchy | Symptom -> Proximate -> Root -> Systemic | Fix at deepest level possible |
21
+ | 5 Whys | Iterative "Why?" questioning | Typically 5 iterations to root cause |
22
+ | Fishbone Diagram | Category-based cause exploration | Code, Data, Config, Infra, External, Process |
23
+ | Evidence Gathering | Logs, metrics, traces, reproduction | Timestamp, source, reliability rating |
24
+ | RCA Report | Structured documentation | Timeline, cause chain, action items |
25
+ | Systemic Factors | Why wasn't this caught earlier? | Testing, monitoring, process gaps |
332
26
 
333
- // Analyze log patterns
334
- const errorPatterns = this.findErrorPatterns(this.evidence);
335
- for (const pattern of errorPatterns) {
336
- hypotheses.push({
337
- description: `Error caused by: ${pattern.summary}`,
338
- category: this.categorizePattern(pattern),
339
- evidence: {
340
- supporting: pattern.matchingLogs,
341
- contradicting: []
342
- },
343
- confidence: pattern.frequency / this.evidence.length,
344
- testPlan: `Reproduce with: ${pattern.reproductionHint}`
345
- });
346
- }
27
+ ## Common Patterns
347
28
 
348
- // Analyze metric anomalies
349
- const anomalies = this.findMetricAnomalies(this.evidence);
350
- for (const anomaly of anomalies) {
351
- hypotheses.push({
352
- description: `Resource issue: ${anomaly.metric} ${anomaly.direction}`,
353
- category: 'infrastructure',
354
- evidence: {
355
- supporting: [anomaly.evidence],
356
- contradicting: []
357
- },
358
- confidence: anomaly.deviation > 3 ? 0.8 : 0.5
359
- });
360
- }
361
-
362
- // Sort by confidence
363
- this.hypotheses = hypotheses.sort((a, b) => b.confidence - a.confidence);
364
- return this.hypotheses;
365
- }
366
-
367
- // Step 3: Test hypotheses
368
- async testHypothesis(hypothesis: Hypothesis): Promise<boolean> {
369
- if (!hypothesis.testPlan) {
370
- throw new Error('Hypothesis has no test plan');
371
- }
372
-
373
- // Attempt to reproduce in safe environment
374
- const result = await this.runReproduction(hypothesis.testPlan);
375
-
376
- if (result.reproduced) {
377
- hypothesis.confidence = Math.min(hypothesis.confidence + 0.3, 1);
378
- hypothesis.evidence.supporting.push({
379
- type: 'reproduction',
380
- source: 'test-environment',
381
- data: result,
382
- reliability: 'high'
383
- });
384
- return true;
385
- } else {
386
- hypothesis.confidence *= 0.5;
387
- return false;
388
- }
389
- }
390
- }
391
29
  ```
392
-
393
- ### 5. Root Cause Analysis Template
394
-
395
- ```markdown
396
- ## Root Cause Analysis Report
397
-
398
- ### Incident Summary
399
- - **Incident ID:** [INC-XXXX]
400
- - **Date/Time:** [When it occurred]
401
- - **Duration:** [How long it lasted]
402
- - **Severity:** [Critical/High/Medium/Low]
403
- - **Affected Systems:** [What was impacted]
404
- - **User Impact:** [How users were affected]
405
-
406
- ---
407
-
408
- ### Timeline
409
-
410
- | Time | Event | Source |
411
- |------|-------|--------|
412
- | 09:00 | First error logged | Application logs |
413
- | 09:05 | Alert triggered | Monitoring system |
414
- | 09:15 | Investigation started | On-call engineer |
415
- | 09:30 | Root cause identified | Log analysis |
416
- | 09:45 | Fix deployed | Deployment system |
417
- | 10:00 | Service restored | Health checks |
418
-
419
- ---
420
-
421
- ### Symptom
422
- **What was observed:**
423
- [Describe the visible symptoms - error messages, user reports, alerts]
424
-
425
- **Evidence:**
426
- - [Log excerpt 1]
427
- - [Metric screenshot 2]
428
- - [User report 3]
429
-
430
- ---
431
-
432
- ### Proximate Cause
433
- **Immediate trigger:**
434
- [What directly caused the symptom]
435
-
436
- **Evidence:**
437
- - [Supporting evidence]
438
-
439
- ---
440
-
441
- ### Root Cause
442
- **Underlying reason:**
443
- [The fundamental cause that, if fixed, prevents recurrence]
444
-
445
- **5 Whys Analysis:**
446
- 1. Why [symptom]? → [answer]
447
- 2. Why [answer 1]? → [answer]
448
- 3. Why [answer 2]? → [answer]
449
- 4. Why [answer 3]? → [answer]
450
- 5. Why [answer 4]? → [ROOT CAUSE]
451
-
452
- **Evidence:**
453
- - [Code snippet showing the bug]
454
- - [Configuration showing the misconfiguration]
455
-
456
- ---
457
-
458
- ### Systemic Factors
459
- **Why wasn't this caught earlier?**
460
-
461
- 1. **Testing Gap:** [What test would have caught this?]
462
- 2. **Monitoring Gap:** [What alert would have warned us?]
463
- 3. **Process Gap:** [What review would have prevented this?]
464
-
465
- ---
466
-
467
- ### Action Items
468
-
469
- | Action | Owner | Priority | Due Date | Status |
470
- |--------|-------|----------|----------|--------|
471
- | Fix the immediate bug | @engineer | P0 | Today | Done |
472
- | Add regression test | @engineer | P1 | This week | In Progress |
473
- | Add monitoring | @sre | P1 | This week | Not Started |
474
- | Update runbook | @sre | P2 | Next week | Not Started |
475
- | Add code review checklist item | @lead | P2 | Next sprint | Not Started |
476
-
477
- ---
478
-
479
- ### Lessons Learned
480
-
481
- 1. **What we learned:** [Key insight]
482
- 2. **What we'll do differently:** [Process change]
483
- 3. **Similar risks to address:** [Other areas with same pattern]
30
+ # 5 Whys Example
31
+ Problem: Website down for 2 hours
32
+
33
+ Why #1: Why down? -> Server out of memory
34
+ Why #2: Why out of memory? -> Connections unbounded
35
+ Why #3: Why unbounded? -> Not released after use
36
+ Why #4: Why not released? -> Early return skipped finally
37
+ Why #5: Why not caught? -> No test for cleanup path
38
+
39
+ Root Causes:
40
+ 1. Technical: Missing cleanup execution
41
+ 2. Systemic: Missing test coverage
42
+ 3. Process: Code review missed pattern
43
+
44
+ # Fishbone Categories (Software)
45
+ CODE: Logic errors, race conditions, memory leaks
46
+ DATA: Invalid input, corrupt data, schema mismatch
47
+ CONFIG: Wrong settings, env mismatch, feature flags
48
+ INFRA: Resource exhaustion, network, scaling
49
+ EXTERNAL: Third-party APIs, dependencies, attacks
50
+ PROCESS: Missing tests, review gaps, monitoring blind spots
484
51
  ```
485
52
 
486
- ### 6. Common Root Cause Patterns
487
-
488
- ```typescript
489
- /**
490
- * Common root cause patterns in software systems
491
- */
492
-
493
- const commonRootCausePatterns = {
494
- resourceExhaustion: {
495
- symptoms: [
496
- 'Out of memory errors',
497
- 'Connection pool exhausted',
498
- 'File descriptor limit reached',
499
- 'Thread pool saturation'
500
- ],
501
- commonCauses: [
502
- 'Memory leaks (objects not garbage collected)',
503
- 'Connection leaks (not closing connections)',
504
- 'Unbounded queues or caches',
505
- 'Missing resource limits'
506
- ],
507
- investigation: `
508
- 1. Check resource usage metrics over time
509
- 2. Look for steady growth patterns
510
- 3. Identify what's holding resources
511
- 4. Profile memory/connections in staging
512
- `,
513
- prevention: [
514
- 'Set explicit resource limits',
515
- 'Implement circuit breakers',
516
- 'Add resource usage monitoring',
517
- 'Use connection pooling with limits'
518
- ]
519
- },
520
-
521
- racingConditions: {
522
- symptoms: [
523
- 'Intermittent failures',
524
- 'Data inconsistency',
525
- 'Deadlocks',
526
- 'Lost updates'
527
- ],
528
- commonCauses: [
529
- 'Missing synchronization',
530
- 'Non-atomic operations',
531
- 'Improper lock ordering',
532
- 'Shared mutable state'
533
- ],
534
- investigation: `
535
- 1. Look for concurrent access patterns
536
- 2. Check for shared mutable state
537
- 3. Review lock acquisition order
538
- 4. Add detailed tracing to suspect areas
539
- `,
540
- prevention: [
541
- 'Use immutable data structures',
542
- 'Implement proper locking',
543
- 'Use atomic operations',
544
- 'Add concurrency tests'
545
- ]
546
- },
547
-
548
- cascadingFailures: {
549
- symptoms: [
550
- 'Multiple services failing',
551
- 'Rapid error propagation',
552
- 'Timeout storms',
553
- 'Complete outage from partial failure'
554
- ],
555
- commonCauses: [
556
- 'Missing circuit breakers',
557
- 'Synchronous dependencies',
558
- 'No fallback mechanisms',
559
- 'Shared resource contention'
560
- ],
561
- investigation: `
562
- 1. Map the failure propagation path
563
- 2. Identify the initial failure point
564
- 3. Check for missing isolation
565
- 4. Review timeout configurations
566
- `,
567
- prevention: [
568
- 'Implement circuit breakers',
569
- 'Add bulkhead patterns',
570
- 'Use async communication',
571
- 'Design for graceful degradation'
572
- ]
573
- },
574
-
575
- configurationErrors: {
576
- symptoms: [
577
- 'Works in one environment, fails in another',
578
- 'Feature behaves unexpectedly',
579
- 'Connection failures',
580
- 'Permission denied'
581
- ],
582
- commonCauses: [
583
- 'Environment variable mismatch',
584
- 'Missing or wrong credentials',
585
- 'Feature flag misconfiguration',
586
- 'Resource limits too low'
587
- ],
588
- investigation: `
589
- 1. Compare configurations across environments
590
- 2. Check recent configuration changes
591
- 3. Verify secrets are present and correct
592
- 4. Review feature flag states
593
- `,
594
- prevention: [
595
- 'Use configuration validation',
596
- 'Implement config diffing',
597
- 'Require config reviews',
598
- 'Test with production-like config'
599
- ]
600
- }
601
- };
602
53
  ```
603
-
604
- ## Use Cases
605
-
606
- ### Production Incident Investigation
607
-
608
- ```typescript
609
- async function investigateIncident(incidentId: string): Promise<RCAReport> {
610
- const investigator = new RootCauseInvestigator();
611
-
612
- // 1. Define the problem clearly
613
- const incident = await getIncident(incidentId);
614
- const problem = `${incident.summary} affecting ${incident.userCount} users`;
615
-
616
- // 2. Gather evidence
617
- await investigator.gatherEvidence(incident);
618
-
619
- // 3. Generate and rank hypotheses
620
- const hypotheses = investigator.generateHypotheses();
621
-
622
- // 4. Test top hypotheses
623
- for (const hypothesis of hypotheses.slice(0, 3)) {
624
- const confirmed = await investigator.testHypothesis(hypothesis);
625
- if (confirmed) {
626
- break;
627
- }
628
- }
629
-
630
- // 5. Document findings
631
- return generateRCAReport(investigator);
632
- }
54
+ # Cause Hierarchy
55
+ SYMPTOM: "App crashed"
56
+ |
57
+ PROXIMATE CAUSE: "Out of memory"
58
+ |
59
+ CONTRIBUTING FACTOR: "No memory limits"
60
+ |
61
+ ROOT CAUSE: "Memory leak in event handlers"
62
+ |
63
+ SYSTEMIC FACTOR: "No memory monitoring"
64
+
65
+ PRINCIPLE: Fix symptoms = problem returns
66
+ Fix root cause = this problem prevented
67
+ Fix systemic = class of problems prevented
633
68
  ```
634
69
 
635
70
  ## Best Practices
636
71
 
637
- ### Do's
638
-
639
- - **Gather evidence first** before forming hypotheses
640
- - **Use structured methods** (5 Whys, Fishbone) consistently
641
- - **Involve multiple perspectives** for complex issues
642
- - **Document everything** for future reference
643
- - **Look for systemic factors** not just immediate causes
644
- - **Create actionable recommendations** with owners and deadlines
645
- - **Share learnings** across the organization
646
- - **Verify fixes** actually prevent recurrence
647
-
648
- ### Don'ts
649
-
650
- - Don't stop at the first answer - dig deeper
651
- - Don't blame individuals - look for systemic issues
652
- - Don't skip evidence gathering
653
- - Don't accept "human error" as root cause
654
- - Don't confuse correlation with causation
655
- - Don't rush to solutions before understanding the problem
656
- - Don't ignore near-misses - investigate them too
657
- - Don't let action items go untracked
658
-
659
- ## References
660
-
661
- - [The Toyota Way: 5 Whys](https://en.wikipedia.org/wiki/Five_whys)
662
- - [Ishikawa Diagram](https://en.wikipedia.org/wiki/Ishikawa_diagram)
663
- - [Google SRE: Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
664
- - [Learning from Incidents](https://www.learningfromincidents.io/)
72
+ | Do | Avoid |
73
+ |----|-------|
74
+ | Gather evidence before forming hypotheses | Jumping to conclusions |
75
+ | Use structured methods consistently | Ad-hoc investigation |
76
+ | Involve multiple perspectives | Single viewpoint |
77
+ | Look for systemic factors | Just fixing immediate cause |
78
+ | Create actionable recommendations | Vague "be more careful" |
79
+ | Verify fixes prevent recurrence | Assuming fix works |
80
+ | Share learnings across team | Siloing knowledge |
81
+ | Investigate near-misses too | Only investigating failures |
82
+
83
+ ## Related Skills
84
+
85
+ - `debugging-systematically` - Four-phase debugging process
86
+ - `solving-problems` - 5-phase problem-solving framework
87
+ - `thinking-sequentially` - Numbered thought chains
88
+ - `verifying-before-completion` - Ensure fix completeness