omgkit 2.2.0 → 2.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (60) hide show
  1. package/README.md +3 -3
  2. package/package.json +1 -1
  3. package/plugin/skills/databases/database-management/SKILL.md +288 -0
  4. package/plugin/skills/databases/database-migration/SKILL.md +285 -0
  5. package/plugin/skills/databases/database-schema-design/SKILL.md +195 -0
  6. package/plugin/skills/databases/mongodb/SKILL.md +60 -776
  7. package/plugin/skills/databases/prisma/SKILL.md +53 -744
  8. package/plugin/skills/databases/redis/SKILL.md +53 -860
  9. package/plugin/skills/databases/supabase/SKILL.md +283 -0
  10. package/plugin/skills/devops/aws/SKILL.md +68 -672
  11. package/plugin/skills/devops/github-actions/SKILL.md +54 -657
  12. package/plugin/skills/devops/kubernetes/SKILL.md +67 -602
  13. package/plugin/skills/devops/performance-profiling/SKILL.md +59 -863
  14. package/plugin/skills/frameworks/django/SKILL.md +87 -853
  15. package/plugin/skills/frameworks/express/SKILL.md +95 -1301
  16. package/plugin/skills/frameworks/fastapi/SKILL.md +90 -1198
  17. package/plugin/skills/frameworks/laravel/SKILL.md +87 -1187
  18. package/plugin/skills/frameworks/nestjs/SKILL.md +106 -973
  19. package/plugin/skills/frameworks/react/SKILL.md +94 -962
  20. package/plugin/skills/frameworks/vue/SKILL.md +95 -1242
  21. package/plugin/skills/frontend/accessibility/SKILL.md +91 -1056
  22. package/plugin/skills/frontend/frontend-design/SKILL.md +69 -1262
  23. package/plugin/skills/frontend/responsive/SKILL.md +76 -799
  24. package/plugin/skills/frontend/shadcn-ui/SKILL.md +73 -921
  25. package/plugin/skills/frontend/tailwindcss/SKILL.md +60 -788
  26. package/plugin/skills/frontend/threejs/SKILL.md +72 -1266
  27. package/plugin/skills/languages/javascript/SKILL.md +106 -849
  28. package/plugin/skills/methodology/brainstorming/SKILL.md +70 -576
  29. package/plugin/skills/methodology/defense-in-depth/SKILL.md +79 -831
  30. package/plugin/skills/methodology/dispatching-parallel-agents/SKILL.md +81 -654
  31. package/plugin/skills/methodology/executing-plans/SKILL.md +86 -529
  32. package/plugin/skills/methodology/finishing-development-branch/SKILL.md +95 -586
  33. package/plugin/skills/methodology/problem-solving/SKILL.md +67 -681
  34. package/plugin/skills/methodology/receiving-code-review/SKILL.md +70 -533
  35. package/plugin/skills/methodology/requesting-code-review/SKILL.md +70 -610
  36. package/plugin/skills/methodology/root-cause-tracing/SKILL.md +70 -646
  37. package/plugin/skills/methodology/sequential-thinking/SKILL.md +70 -478
  38. package/plugin/skills/methodology/systematic-debugging/SKILL.md +66 -559
  39. package/plugin/skills/methodology/test-driven-development/SKILL.md +91 -752
  40. package/plugin/skills/methodology/testing-anti-patterns/SKILL.md +78 -687
  41. package/plugin/skills/methodology/token-optimization/SKILL.md +72 -602
  42. package/plugin/skills/methodology/verification-before-completion/SKILL.md +108 -529
  43. package/plugin/skills/methodology/writing-plans/SKILL.md +79 -566
  44. package/plugin/skills/omega/omega-architecture/SKILL.md +91 -752
  45. package/plugin/skills/omega/omega-coding/SKILL.md +161 -552
  46. package/plugin/skills/omega/omega-sprint/SKILL.md +132 -777
  47. package/plugin/skills/omega/omega-testing/SKILL.md +157 -845
  48. package/plugin/skills/omega/omega-thinking/SKILL.md +165 -606
  49. package/plugin/skills/security/better-auth/SKILL.md +46 -1034
  50. package/plugin/skills/security/oauth/SKILL.md +80 -934
  51. package/plugin/skills/security/owasp/SKILL.md +78 -862
  52. package/plugin/skills/testing/playwright/SKILL.md +77 -700
  53. package/plugin/skills/testing/pytest/SKILL.md +73 -811
  54. package/plugin/skills/testing/vitest/SKILL.md +60 -920
  55. package/plugin/skills/tools/document-processing/SKILL.md +111 -838
  56. package/plugin/skills/tools/image-processing/SKILL.md +126 -659
  57. package/plugin/skills/tools/mcp-development/SKILL.md +85 -758
  58. package/plugin/skills/tools/media-processing/SKILL.md +118 -735
  59. package/plugin/stdrules/SKILL_STANDARDS.md +490 -0
  60. package/plugin/skills/SKILL_STANDARDS.md +0 -743
@@ -1,664 +1,88 @@
1
1
  ---
2
- name: root-cause-tracing
3
- description: Systematic root cause analysis using 5 Whys, Fishbone diagrams, and evidence-based investigation
4
- category: methodology
5
- triggers:
6
- - root cause
7
- - 5 whys
8
- - fishbone diagram
9
- - debugging
10
- - incident analysis
11
- - post mortem
12
- - problem investigation
2
+ name: tracing-root-causes
3
+ description: AI agent performs systematic root cause analysis using 5 Whys, Fishbone diagrams, and evidence-based investigation. Use when debugging, conducting post-mortems, or investigating incidents.
13
4
  ---
14
5
 
15
- # Root Cause Tracing
6
+ # Tracing Root Causes
16
7
 
17
- Master **systematic root cause analysis** to find the true underlying causes of problems, not just symptoms. This skill provides frameworks for investigating issues deeply and preventing recurrence.
8
+ ## Quick Start
18
9
 
19
- ## Purpose
20
-
21
- Find and eliminate true root causes:
22
-
23
- - Distinguish symptoms from underlying causes
24
- - Use structured investigation methodologies
25
- - Trace causality chains to their origins
26
- - Identify systemic factors that allow problems
27
- - Prevent recurrence through proper fixes
28
- - Document findings for organizational learning
29
- - Build more resilient systems over time
10
+ 1. **Identify Symptom** - Document the observable problem
11
+ 2. **Gather Evidence** - Collect logs, metrics, traces around incident
12
+ 3. **Apply 5 Whys** - Ask "Why?" iteratively until fundamental cause found
13
+ 4. **Map Categories** - Use Fishbone to explore all cause categories
14
+ 5. **Document Findings** - Create RCA report with action items
30
15
 
31
16
  ## Features
32
17
 
33
- ### 1. The Root Cause Hierarchy
34
-
35
- ```markdown
36
- ## Understanding Cause Layers
37
-
38
- ┌─────────────────────────────────────────────────────────────────────────┐
39
- │ ROOT CAUSE HIERARCHY │
40
- ├─────────────────────────────────────────────────────────────────────────┤
41
- │ │
42
- │ SYMPTOM │
43
- │ └── What you observe: "App crashed" "Users can't login" │
44
- │ │
45
- │ PROXIMATE CAUSE │
46
- │ └── Direct trigger: "Out of memory" "Database timeout" │
47
- │ │
48
- │ CONTRIBUTING FACTORS │
49
- │ └── Conditions that enabled: "No memory limits" "Slow query" │
50
- │ │
51
- │ ROOT CAUSE │
52
- │ └── Fundamental reason: "Memory leak in event handlers" │
53
- │ │
54
- │ SYSTEMIC FACTORS │
55
- │ └── Why it wasn't caught: "No memory monitoring" "Missing tests" │
56
- │ │
57
- │ │
58
- │ PRINCIPLE: Fix at the deepest level possible │
59
- │ - Fixing symptoms: Problem returns │
60
- │ - Fixing proximate cause: Similar problems emerge │
61
- │ - Fixing root cause: This specific problem prevented │
62
- │ - Fixing systemic factors: Entire class of problems prevented │
63
- │ │
64
- └─────────────────────────────────────────────────────────────────────────┘
65
- ```
66
-
67
- ### 2. The 5 Whys Technique
68
-
69
- ```markdown
70
- ## 5 Whys: Iterative Causality Tracing
71
-
72
- ### The Method
73
- Ask "Why?" repeatedly (typically 5 times) until you reach a fundamental cause.
74
-
75
- ### Example: Production Outage
76
-
77
- Problem: Website went down for 2 hours
78
-
79
- Why #1: Why did the website go down?
80
- → The application server ran out of memory and crashed.
81
-
82
- Why #2: Why did it run out of memory?
83
- → The number of active connections grew unbounded.
84
-
85
- Why #3: Why did connections grow unbounded?
86
- → Connection objects weren't being released after use.
87
-
88
- Why #4: Why weren't connections being released?
89
- → The cleanup code in the finally block wasn't being executed
90
- due to an early return statement.
91
-
92
- Why #5: Why wasn't this caught before production?
93
- → No test existed for the connection cleanup path, and code review
94
- missed the early return bypassing the finally block.
95
-
96
- ### Root Causes Identified:
97
- 1. Technical: Missing cleanup code execution (fix the bug)
98
- 2. Systemic: Missing test coverage for cleanup paths (add tests)
99
- 3. Process: Code review didn't catch early-return anti-pattern (add checklist)
100
- ```
101
-
102
- ```typescript
103
- /**
104
- * 5 Whys Investigation Framework
105
- */
106
-
107
- interface WhyStep {
108
- question: string;
109
- answer: string;
110
- evidence: string[];
111
- confidence: 'confirmed' | 'likely' | 'hypothesis';
112
- }
113
-
114
- interface FiveWhysAnalysis {
115
- problem: string;
116
- impactAssessment: ImpactAssessment;
117
- whys: WhyStep[];
118
- rootCauses: RootCause[];
119
- recommendations: Recommendation[];
120
- }
121
-
122
- function conductFiveWhys(problem: string): FiveWhysAnalysis {
123
- const analysis: FiveWhysAnalysis = {
124
- problem,
125
- impactAssessment: assessImpact(problem),
126
- whys: [],
127
- rootCauses: [],
128
- recommendations: []
129
- };
130
-
131
- let currentQuestion = `Why did ${problem}?`;
132
- let depth = 0;
133
-
134
- while (depth < 5) {
135
- const answer = investigateQuestion(currentQuestion);
136
- const evidence = gatherEvidence(answer);
137
-
138
- analysis.whys.push({
139
- question: currentQuestion,
140
- answer: answer.text,
141
- evidence: evidence.sources,
142
- confidence: evidence.confidence
143
- });
144
-
145
- // Check if we've reached a root cause
146
- if (isRootCause(answer)) {
147
- analysis.rootCauses.push({
148
- description: answer.text,
149
- type: classifyRootCause(answer),
150
- evidence: evidence
151
- });
152
- }
153
-
154
- currentQuestion = `Why ${answer.text}?`;
155
- depth++;
156
- }
157
-
158
- // Generate recommendations for each root cause
159
- analysis.recommendations = analysis.rootCauses.map(rc =>
160
- generateRecommendation(rc)
161
- );
162
-
163
- return analysis;
164
- }
165
- ```
166
-
167
- ### 3. Fishbone (Ishikawa) Diagram
168
-
169
- ```markdown
170
- ## Fishbone Diagram: Category-Based Analysis
171
-
172
- ┌──────────────────┐
173
- │ PROBLEM │
174
- │ [Symptom Here] │
175
- └────────┬─────────┘
176
-
177
- ┌────────────────────────────┼────────────────────────────┐
178
- │ │ │
179
- │ PEOPLE │ PROCESS │
180
- │ ────── │ ─────── │
181
- │ • Skill gaps │ • Missing │
182
- │ • Communication │ steps │
183
- │ • Training │ • Unclear │
184
- │ • Handoffs │ ownership │
185
- │ ╲ │ ╱ │
186
- │ ╲ │ ╱ │
187
- │ ╲ │ ╱ │
188
- │ ╲──────────────┼──────────────╱ │
189
- │ ╲ │ ╱ │
190
- │ ╲ │ ╱ │
191
- │ ╲ │ ╱ │
192
- │ ╲──────────┼──────────╱ │
193
- │ ╱ │ ╲ │
194
- │ ╱ │ ╲ │
195
- │ ╱ │ ╲ │
196
- │ ╱──────────────┼──────────────╲ │
197
- │ ╱ │ ╲ │
198
- │ ╱ │ ╲ │
199
- │ ╱ │ ╲ │
200
- │ TECHNOLOGY │ ENVIRONMENT │
201
- │ ────────── │ ─────────── │
202
- │ • Code bugs │ • Load │
203
- │ • Dependencies │ • Network │
204
- │ • Infrastructure │ • Third-party │
205
- │ • Configuration │ • Timing │
206
- │ │ │
207
- └────────────────────────────┴────────────────────────────┘
208
-
209
- ## Software-Specific Categories
210
-
211
- 1. CODE
212
- - Logic errors
213
- - Race conditions
214
- - Memory leaks
215
- - Error handling
216
-
217
- 2. DATA
218
- - Invalid input
219
- - Corrupt data
220
- - Missing data
221
- - Schema mismatches
222
-
223
- 3. CONFIGURATION
224
- - Wrong settings
225
- - Environment mismatch
226
- - Secrets/credentials
227
- - Feature flags
228
-
229
- 4. INFRASTRUCTURE
230
- - Resource exhaustion
231
- - Network issues
232
- - Service failures
233
- - Scaling problems
234
-
235
- 5. EXTERNAL
236
- - Third-party APIs
237
- - Dependencies
238
- - User behavior
239
- - Attack/abuse
240
-
241
- 6. PROCESS
242
- - Missing tests
243
- - Review gaps
244
- - Deployment issues
245
- - Monitoring blind spots
246
- ```
247
-
248
- ### 4. Evidence-Based Investigation
249
-
250
- ```typescript
251
- /**
252
- * Evidence-Based Root Cause Investigation
253
- * Every hypothesis must be backed by data
254
- */
255
-
256
- interface Evidence {
257
- type: 'log' | 'metric' | 'trace' | 'reproduction' | 'testimony';
258
- source: string;
259
- timestamp?: Date;
260
- data: unknown;
261
- reliability: 'high' | 'medium' | 'low';
262
- }
263
-
264
- interface Hypothesis {
265
- description: string;
266
- category: RootCauseCategory;
267
- evidence: {
268
- supporting: Evidence[];
269
- contradicting: Evidence[];
270
- };
271
- confidence: number; // 0-1
272
- testPlan?: string;
273
- }
274
-
275
- class RootCauseInvestigator {
276
- private hypotheses: Hypothesis[] = [];
277
- private evidence: Evidence[] = [];
278
-
279
- // Step 1: Gather all available evidence
280
- async gatherEvidence(incident: Incident): Promise<Evidence[]> {
281
- const evidence: Evidence[] = [];
282
-
283
- // Collect logs around incident time
284
- const logs = await this.queryLogs({
285
- startTime: incident.startTime.minus({ minutes: 30 }),
286
- endTime: incident.endTime.plus({ minutes: 30 }),
287
- services: incident.affectedServices
288
- });
289
-
290
- evidence.push(...logs.map(log => ({
291
- type: 'log' as const,
292
- source: log.service,
293
- timestamp: log.timestamp,
294
- data: log.message,
295
- reliability: 'high' as const
296
- })));
297
-
298
- // Collect metrics
299
- const metrics = await this.queryMetrics({
300
- metrics: ['error_rate', 'latency_p99', 'memory_usage', 'cpu_usage'],
301
- startTime: incident.startTime.minus({ hours: 1 }),
302
- endTime: incident.endTime.plus({ hours: 1 })
303
- });
304
-
305
- evidence.push(...metrics.map(m => ({
306
- type: 'metric' as const,
307
- source: m.name,
308
- timestamp: m.timestamp,
309
- data: m.value,
310
- reliability: 'high' as const
311
- })));
312
-
313
- // Collect traces for affected requests
314
- const traces = await this.queryTraces({
315
- traceIds: incident.affectedTraceIds.slice(0, 100)
316
- });
317
-
318
- evidence.push(...traces.map(t => ({
319
- type: 'trace' as const,
320
- source: t.serviceName,
321
- data: t.spans,
322
- reliability: 'high' as const
323
- })));
324
-
325
- this.evidence = evidence;
326
- return evidence;
327
- }
328
-
329
- // Step 2: Generate hypotheses based on evidence
330
- generateHypotheses(): Hypothesis[] {
331
- const hypotheses: Hypothesis[] = [];
18
+ | Feature | Description | Guide |
19
+ |---------|-------------|-------|
20
+ | Cause Hierarchy | Symptom -> Proximate -> Root -> Systemic | Fix at deepest level possible |
21
+ | 5 Whys | Iterative "Why?" questioning | Typically 5 iterations to root cause |
22
+ | Fishbone Diagram | Category-based cause exploration | Code, Data, Config, Infra, External, Process |
23
+ | Evidence Gathering | Logs, metrics, traces, reproduction | Timestamp, source, reliability rating |
24
+ | RCA Report | Structured documentation | Timeline, cause chain, action items |
25
+ | Systemic Factors | Why wasn't this caught earlier? | Testing, monitoring, process gaps |
332
26
 
333
- // Analyze log patterns
334
- const errorPatterns = this.findErrorPatterns(this.evidence);
335
- for (const pattern of errorPatterns) {
336
- hypotheses.push({
337
- description: `Error caused by: ${pattern.summary}`,
338
- category: this.categorizePattern(pattern),
339
- evidence: {
340
- supporting: pattern.matchingLogs,
341
- contradicting: []
342
- },
343
- confidence: pattern.frequency / this.evidence.length,
344
- testPlan: `Reproduce with: ${pattern.reproductionHint}`
345
- });
346
- }
27
+ ## Common Patterns
347
28
 
348
- // Analyze metric anomalies
349
- const anomalies = this.findMetricAnomalies(this.evidence);
350
- for (const anomaly of anomalies) {
351
- hypotheses.push({
352
- description: `Resource issue: ${anomaly.metric} ${anomaly.direction}`,
353
- category: 'infrastructure',
354
- evidence: {
355
- supporting: [anomaly.evidence],
356
- contradicting: []
357
- },
358
- confidence: anomaly.deviation > 3 ? 0.8 : 0.5
359
- });
360
- }
361
-
362
- // Sort by confidence
363
- this.hypotheses = hypotheses.sort((a, b) => b.confidence - a.confidence);
364
- return this.hypotheses;
365
- }
366
-
367
- // Step 3: Test hypotheses
368
- async testHypothesis(hypothesis: Hypothesis): Promise<boolean> {
369
- if (!hypothesis.testPlan) {
370
- throw new Error('Hypothesis has no test plan');
371
- }
372
-
373
- // Attempt to reproduce in safe environment
374
- const result = await this.runReproduction(hypothesis.testPlan);
375
-
376
- if (result.reproduced) {
377
- hypothesis.confidence = Math.min(hypothesis.confidence + 0.3, 1);
378
- hypothesis.evidence.supporting.push({
379
- type: 'reproduction',
380
- source: 'test-environment',
381
- data: result,
382
- reliability: 'high'
383
- });
384
- return true;
385
- } else {
386
- hypothesis.confidence *= 0.5;
387
- return false;
388
- }
389
- }
390
- }
391
29
  ```
392
-
393
- ### 5. Root Cause Analysis Template
394
-
395
- ```markdown
396
- ## Root Cause Analysis Report
397
-
398
- ### Incident Summary
399
- - **Incident ID:** [INC-XXXX]
400
- - **Date/Time:** [When it occurred]
401
- - **Duration:** [How long it lasted]
402
- - **Severity:** [Critical/High/Medium/Low]
403
- - **Affected Systems:** [What was impacted]
404
- - **User Impact:** [How users were affected]
405
-
406
- ---
407
-
408
- ### Timeline
409
-
410
- | Time | Event | Source |
411
- |------|-------|--------|
412
- | 09:00 | First error logged | Application logs |
413
- | 09:05 | Alert triggered | Monitoring system |
414
- | 09:15 | Investigation started | On-call engineer |
415
- | 09:30 | Root cause identified | Log analysis |
416
- | 09:45 | Fix deployed | Deployment system |
417
- | 10:00 | Service restored | Health checks |
418
-
419
- ---
420
-
421
- ### Symptom
422
- **What was observed:**
423
- [Describe the visible symptoms - error messages, user reports, alerts]
424
-
425
- **Evidence:**
426
- - [Log excerpt 1]
427
- - [Metric screenshot 2]
428
- - [User report 3]
429
-
430
- ---
431
-
432
- ### Proximate Cause
433
- **Immediate trigger:**
434
- [What directly caused the symptom]
435
-
436
- **Evidence:**
437
- - [Supporting evidence]
438
-
439
- ---
440
-
441
- ### Root Cause
442
- **Underlying reason:**
443
- [The fundamental cause that, if fixed, prevents recurrence]
444
-
445
- **5 Whys Analysis:**
446
- 1. Why [symptom]? → [answer]
447
- 2. Why [answer 1]? → [answer]
448
- 3. Why [answer 2]? → [answer]
449
- 4. Why [answer 3]? → [answer]
450
- 5. Why [answer 4]? → [ROOT CAUSE]
451
-
452
- **Evidence:**
453
- - [Code snippet showing the bug]
454
- - [Configuration showing the misconfiguration]
455
-
456
- ---
457
-
458
- ### Systemic Factors
459
- **Why wasn't this caught earlier?**
460
-
461
- 1. **Testing Gap:** [What test would have caught this?]
462
- 2. **Monitoring Gap:** [What alert would have warned us?]
463
- 3. **Process Gap:** [What review would have prevented this?]
464
-
465
- ---
466
-
467
- ### Action Items
468
-
469
- | Action | Owner | Priority | Due Date | Status |
470
- |--------|-------|----------|----------|--------|
471
- | Fix the immediate bug | @engineer | P0 | Today | Done |
472
- | Add regression test | @engineer | P1 | This week | In Progress |
473
- | Add monitoring | @sre | P1 | This week | Not Started |
474
- | Update runbook | @sre | P2 | Next week | Not Started |
475
- | Add code review checklist item | @lead | P2 | Next sprint | Not Started |
476
-
477
- ---
478
-
479
- ### Lessons Learned
480
-
481
- 1. **What we learned:** [Key insight]
482
- 2. **What we'll do differently:** [Process change]
483
- 3. **Similar risks to address:** [Other areas with same pattern]
30
+ # 5 Whys Example
31
+ Problem: Website down for 2 hours
32
+
33
+ Why #1: Why down? -> Server out of memory
34
+ Why #2: Why out of memory? -> Connections unbounded
35
+ Why #3: Why unbounded? -> Not released after use
36
+ Why #4: Why not released? -> Early return skipped finally
37
+ Why #5: Why not caught? -> No test for cleanup path
38
+
39
+ Root Causes:
40
+ 1. Technical: Missing cleanup execution
41
+ 2. Systemic: Missing test coverage
42
+ 3. Process: Code review missed pattern
43
+
44
+ # Fishbone Categories (Software)
45
+ CODE: Logic errors, race conditions, memory leaks
46
+ DATA: Invalid input, corrupt data, schema mismatch
47
+ CONFIG: Wrong settings, env mismatch, feature flags
48
+ INFRA: Resource exhaustion, network, scaling
49
+ EXTERNAL: Third-party APIs, dependencies, attacks
50
+ PROCESS: Missing tests, review gaps, monitoring blind spots
484
51
  ```
485
52
 
486
- ### 6. Common Root Cause Patterns
487
-
488
- ```typescript
489
- /**
490
- * Common root cause patterns in software systems
491
- */
492
-
493
- const commonRootCausePatterns = {
494
- resourceExhaustion: {
495
- symptoms: [
496
- 'Out of memory errors',
497
- 'Connection pool exhausted',
498
- 'File descriptor limit reached',
499
- 'Thread pool saturation'
500
- ],
501
- commonCauses: [
502
- 'Memory leaks (objects not garbage collected)',
503
- 'Connection leaks (not closing connections)',
504
- 'Unbounded queues or caches',
505
- 'Missing resource limits'
506
- ],
507
- investigation: `
508
- 1. Check resource usage metrics over time
509
- 2. Look for steady growth patterns
510
- 3. Identify what's holding resources
511
- 4. Profile memory/connections in staging
512
- `,
513
- prevention: [
514
- 'Set explicit resource limits',
515
- 'Implement circuit breakers',
516
- 'Add resource usage monitoring',
517
- 'Use connection pooling with limits'
518
- ]
519
- },
520
-
521
- racingConditions: {
522
- symptoms: [
523
- 'Intermittent failures',
524
- 'Data inconsistency',
525
- 'Deadlocks',
526
- 'Lost updates'
527
- ],
528
- commonCauses: [
529
- 'Missing synchronization',
530
- 'Non-atomic operations',
531
- 'Improper lock ordering',
532
- 'Shared mutable state'
533
- ],
534
- investigation: `
535
- 1. Look for concurrent access patterns
536
- 2. Check for shared mutable state
537
- 3. Review lock acquisition order
538
- 4. Add detailed tracing to suspect areas
539
- `,
540
- prevention: [
541
- 'Use immutable data structures',
542
- 'Implement proper locking',
543
- 'Use atomic operations',
544
- 'Add concurrency tests'
545
- ]
546
- },
547
-
548
- cascadingFailures: {
549
- symptoms: [
550
- 'Multiple services failing',
551
- 'Rapid error propagation',
552
- 'Timeout storms',
553
- 'Complete outage from partial failure'
554
- ],
555
- commonCauses: [
556
- 'Missing circuit breakers',
557
- 'Synchronous dependencies',
558
- 'No fallback mechanisms',
559
- 'Shared resource contention'
560
- ],
561
- investigation: `
562
- 1. Map the failure propagation path
563
- 2. Identify the initial failure point
564
- 3. Check for missing isolation
565
- 4. Review timeout configurations
566
- `,
567
- prevention: [
568
- 'Implement circuit breakers',
569
- 'Add bulkhead patterns',
570
- 'Use async communication',
571
- 'Design for graceful degradation'
572
- ]
573
- },
574
-
575
- configurationErrors: {
576
- symptoms: [
577
- 'Works in one environment, fails in another',
578
- 'Feature behaves unexpectedly',
579
- 'Connection failures',
580
- 'Permission denied'
581
- ],
582
- commonCauses: [
583
- 'Environment variable mismatch',
584
- 'Missing or wrong credentials',
585
- 'Feature flag misconfiguration',
586
- 'Resource limits too low'
587
- ],
588
- investigation: `
589
- 1. Compare configurations across environments
590
- 2. Check recent configuration changes
591
- 3. Verify secrets are present and correct
592
- 4. Review feature flag states
593
- `,
594
- prevention: [
595
- 'Use configuration validation',
596
- 'Implement config diffing',
597
- 'Require config reviews',
598
- 'Test with production-like config'
599
- ]
600
- }
601
- };
602
53
  ```
603
-
604
- ## Use Cases
605
-
606
- ### Production Incident Investigation
607
-
608
- ```typescript
609
- async function investigateIncident(incidentId: string): Promise<RCAReport> {
610
- const investigator = new RootCauseInvestigator();
611
-
612
- // 1. Define the problem clearly
613
- const incident = await getIncident(incidentId);
614
- const problem = `${incident.summary} affecting ${incident.userCount} users`;
615
-
616
- // 2. Gather evidence
617
- await investigator.gatherEvidence(incident);
618
-
619
- // 3. Generate and rank hypotheses
620
- const hypotheses = investigator.generateHypotheses();
621
-
622
- // 4. Test top hypotheses
623
- for (const hypothesis of hypotheses.slice(0, 3)) {
624
- const confirmed = await investigator.testHypothesis(hypothesis);
625
- if (confirmed) {
626
- break;
627
- }
628
- }
629
-
630
- // 5. Document findings
631
- return generateRCAReport(investigator);
632
- }
54
+ # Cause Hierarchy
55
+ SYMPTOM: "App crashed"
56
+ |
57
+ PROXIMATE CAUSE: "Out of memory"
58
+ |
59
+ CONTRIBUTING FACTOR: "No memory limits"
60
+ |
61
+ ROOT CAUSE: "Memory leak in event handlers"
62
+ |
63
+ SYSTEMIC FACTOR: "No memory monitoring"
64
+
65
+ PRINCIPLE: Fix symptoms = problem returns
66
+ Fix root cause = this problem prevented
67
+ Fix systemic = class of problems prevented
633
68
  ```
634
69
 
635
70
  ## Best Practices
636
71
 
637
- ### Do's
638
-
639
- - **Gather evidence first** before forming hypotheses
640
- - **Use structured methods** (5 Whys, Fishbone) consistently
641
- - **Involve multiple perspectives** for complex issues
642
- - **Document everything** for future reference
643
- - **Look for systemic factors** not just immediate causes
644
- - **Create actionable recommendations** with owners and deadlines
645
- - **Share learnings** across the organization
646
- - **Verify fixes** actually prevent recurrence
647
-
648
- ### Don'ts
649
-
650
- - Don't stop at the first answer - dig deeper
651
- - Don't blame individuals - look for systemic issues
652
- - Don't skip evidence gathering
653
- - Don't accept "human error" as root cause
654
- - Don't confuse correlation with causation
655
- - Don't rush to solutions before understanding the problem
656
- - Don't ignore near-misses - investigate them too
657
- - Don't let action items go untracked
658
-
659
- ## References
660
-
661
- - [The Toyota Way: 5 Whys](https://en.wikipedia.org/wiki/Five_whys)
662
- - [Ishikawa Diagram](https://en.wikipedia.org/wiki/Ishikawa_diagram)
663
- - [Google SRE: Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
664
- - [Learning from Incidents](https://www.learningfromincidents.io/)
72
+ | Do | Avoid |
73
+ |----|-------|
74
+ | Gather evidence before forming hypotheses | Jumping to conclusions |
75
+ | Use structured methods consistently | Ad-hoc investigation |
76
+ | Involve multiple perspectives | Single viewpoint |
77
+ | Look for systemic factors | Just fixing immediate cause |
78
+ | Create actionable recommendations | Vague "be more careful" |
79
+ | Verify fixes prevent recurrence | Assuming fix works |
80
+ | Share learnings across team | Siloing knowledge |
81
+ | Investigate near-misses too | Only investigating failures |
82
+
83
+ ## Related Skills
84
+
85
+ - `debugging-systematically` - Four-phase debugging process
86
+ - `solving-problems` - 5-phase problem-solving framework
87
+ - `thinking-sequentially` - Numbered thought chains
88
+ - `verifying-before-completion` - Ensure fix completeness