npm - codingbuddy-rules - Versions diffs - 2.4.2 → 3.0.2 - Mend

codingbuddy-rules 2.4.2 → 3.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (53) hide show

package/.ai-rules/skills/incident-response/postmortem-template.md ADDED Viewed

@@ -0,0 +1,351 @@
+# Postmortem Template
+Blameless Root Cause Analysis (RCA) framework for organizational learning.
+## The Documentation Rule
+**The incident is NOT resolved until the postmortem is scheduled.** Postmortems:
+- Capture learning while memory is fresh
+- Prevent repeat incidents
+- Build organizational knowledge
+- Improve detection and response
+## Postmortem Timeline
+| Severity | Schedule Within | Complete Within | Review Meeting |
+|----------|----------------|-----------------|----------------|
+| P1 | 24 hours | 5 business days | Required |
+| P2 | 48 hours | 7 business days | Required |
+| P3 | 1 week | 2 weeks | Optional |
+| P4 | 2 weeks | 1 month | Not required |
+---
+## Blameless Postmortem Principles
+### The Core Philosophy
+**Incidents are system failures, not people failures.**
+- Focus on what happened, not who to blame
+- Assume everyone acted with best intentions
+- Look for systemic causes, not individual mistakes
+- Identify process gaps, not performance gaps
+### Language Guidelines
+❌ **Avoid:**
+- "Engineer X caused the outage by..."
+- "If only Y had checked..."
+- "Z failed to follow procedure"
+- "The mistake was..."
+✅ **Use:**
+- "The deployment process allowed..."
+- "The monitoring gap meant..."
+- "The system permitted..."
+- "The contributing factor was..."
+---
+## Postmortem Document Structure
+### Header Information
+```
+# Postmortem: [Incident Title]
+**Incident ID:** [ID]
+**Date:** [YYYY-MM-DD]
+**Duration:** [X hours Y minutes]
+**Severity:** P[X]
+**Author:** [Name]
+**Review Status:** [Draft/Reviewed/Approved]
+**Attendees:** [List of people involved in postmortem]
+```
+---
+### Executive Summary
+```
+## Executive Summary
+[2-3 sentences describing what happened, the impact, and the resolution]
+**Key Metrics:**
+- Time to Detect (TTD): [X minutes]
+- Time to Mitigate (TTM): [X minutes]
+- Time to Resolve (TTR): [X minutes]
+- Users Affected: [Count/percentage]
+- SLO Budget Consumed: [Percentage]
+- Revenue Impact: [If applicable]
+```
+---
+### Timeline
+```
+## Timeline
+All times in UTC.
+| Time | Event | Actor |
+|------|-------|-------|
+| HH:MM | [What happened] | [System/Team] |
+| HH:MM | [Alert fired] | [Monitoring] |
+| HH:MM | [Alert acknowledged] | [Responder] |
+| HH:MM | [Investigation started] | [Team] |
+| HH:MM | [Root cause identified] | [Team] |
+| HH:MM | [Mitigation applied] | [Responder] |
+| HH:MM | [Service restored] | [System] |
+| HH:MM | [Incident closed] | [IC] |
+**trace_id:** [ID for log/trace correlation]
+```
+---
+### Impact Assessment
+```
+## Impact Assessment
+### User Impact
+- [X] users experienced [symptom] for [duration]
+- [Affected user segments]
+- [Geographic regions affected]
+### Business Impact
+- Revenue: [Lost/at-risk amount if applicable]
+- SLA: [Any SLA breaches]
+- Compliance: [Any compliance implications]
+### Service Impact
+- Primary: [Service(s) directly affected]
+- Secondary: [Downstream services affected]
+- Cascade: [Any cascade effects]
+### SLO Impact
+| SLO | Target | Actual During Incident | Budget Consumed |
+|-----|--------|----------------------|-----------------|
+| [Name] | [X]% | [Y]% | [Z]% |
+```
+---
+### Contributing Factors
+**Note:** We use "contributing factors" not "root cause" because incidents are complex and multi-causal.
+```
+## Contributing Factors
+### Technical Factors
+1. **[Factor]**
+   - What: [Description]
+   - Why it mattered: [How it contributed]
+   - Evidence: [Logs/traces/metrics reference]
+2. **[Factor]**
+   - What: [Description]
+   - Why it mattered: [How it contributed]
+   - Evidence: [Reference]
+### Process Factors
+1. **[Factor]**
+   - What: [Description]
+   - Why it mattered: [How it contributed]
+### Detection Factors
+1. **[Factor]**
+   - What monitoring gap existed?
+   - How could earlier detection have helped?
+```
+---
+### 5 Whys Analysis
+```
+## 5 Whys Analysis
+**Symptom:** [What was observed]
+1. **Why did [symptom] occur?**
+   Because [reason 1]
+2. **Why did [reason 1] happen?**
+   Because [reason 2]
+3. **Why did [reason 2] happen?**
+   Because [reason 3]
+4. **Why did [reason 3] happen?**
+   Because [reason 4]
+5. **Why did [reason 4] happen?**
+   Because [reason 5 - usually systemic/process]
+**Contributing Factor Identified:** [Summary]
+```
+---
+### Observability Gap Analysis
+```
+## Observability Gap Analysis
+### Detection Effectiveness
+- Was the issue detected by monitoring? [Yes/No]
+- If no, how was it detected? [Customer report/Manual check/etc.]
+- Time between incident start and detection: [X minutes]
+### Dashboard Effectiveness
+- Did dashboards show the problem clearly? [Yes/No]
+- What was missing from dashboards?
+- What additional views would have helped?
+### Log/Trace Effectiveness
+- Could logs identify the root cause? [Yes/No]
+- Was trace_id correlation helpful? [Yes/No]
+- What additional logging would help?
+### Alerting Effectiveness
+- Did alerts fire appropriately? [Yes/No]
+- Were there false negatives? [Alerts that should have fired]
+- Were there false positives? [Alert noise during incident]
+```
+---
+### Action Items
+```
+## Action Items
+### Immediate (Within 1 Week)
+| ID | Action | Owner | Due Date | Status |
+|----|--------|-------|----------|--------|
+| AI-1 | [Action] | [Name] | [Date] | [Open/Done] |
+### Short-term (Within 1 Month)
+| ID | Action | Owner | Due Date | Status |
+|----|--------|-------|----------|--------|
+| AI-2 | [Action] | [Name] | [Date] | [Open/Done] |
+### Long-term (Within Quarter)
+| ID | Action | Owner | Due Date | Status |
+|----|--------|-------|----------|--------|
+| AI-3 | [Action] | [Name] | [Date] | [Open/Done] |
+### Observability Improvements
+| ID | Action | Owner | Due Date | Status |
+|----|--------|-------|----------|--------|
+| OBS-1 | Add alert for [condition] | [Name] | [Date] | [Status] |
+| OBS-2 | Add dashboard for [metric] | [Name] | [Date] | [Status] |
+| OBS-3 | Add logging for [event] | [Name] | [Date] | [Status] |
+```
+---
+### What Went Well
+```
+## What Went Well
+- [Positive observation about response]
+- [Effective tool/process that helped]
+- [Good decision that was made]
+- [Communication that worked well]
+```
+---
+### Lessons Learned
+```
+## Lessons Learned
+### Process Improvements
+- [Learning about incident response process]
+### Technical Insights
+- [Technical learning from this incident]
+### Communication Insights
+- [Learning about communication effectiveness]
+### Prevention Strategies
+- [How similar incidents can be prevented]
+```
+---
+## Postmortem Review Meeting
+### Agenda (60 minutes)
+1. **Timeline walkthrough** (10 min)
+   - What happened, when
+   - Key decision points
+2. **Contributing factors discussion** (20 min)
+   - Technical factors
+   - Process factors
+   - Detection gaps
+3. **Action item review** (20 min)
+   - Prioritize actions
+   - Assign owners
+   - Set due dates
+4. **Lessons learned** (10 min)
+   - What went well
+   - What to improve
+   - Share with broader team
+### Meeting Rules
+- No blame, no defensiveness
+- Focus on systems and processes
+- Everyone's perspective is valuable
+- Actions must have owners and dates
+- Follow up on action completion
+---
+## Action Item Categories
+Ensure actions cover:
+| Category | Example Actions |
+|----------|----------------|
+| **Prevention** | Code review requirement, input validation |
+| **Detection** | New alerts, improved dashboards |
+| **Mitigation** | Runbook updates, failover automation |
+| **Response** | Training, process documentation |
+| **Recovery** | Backup improvements, rollback procedures |
+---
+## Postmortem Anti-Patterns
+❌ **Avoid:**
+- Assigning blame to individuals
+- Concluding "human error" (dig deeper)
+- Action items without owners
+- Actions without deadlines
+- Publishing without review
+- Skipping the meeting
+✅ **Do:**
+- Focus on system failures
+- Find multiple contributing factors
+- Assign clear ownership
+- Set realistic deadlines
+- Review with stakeholders
+- Track action completion

package/.ai-rules/skills/incident-response/severity-classification.md ADDED Viewed

@@ -0,0 +1,256 @@
+# Severity Classification
+Objective severity classification based on SLO burn rates and business impact.
+## The Classification Rule
+**Classify severity BEFORE taking any action.** Severity determines:
+- Response time expectations
+- Who gets notified
+- Resource allocation priority
+- Communication cadence
+## Severity Matrix
+### P1 - Critical
+**SLO Burn Rate:** >14.4x (consuming >2% error budget per hour)
+**Impact Criteria (ANY of these):**
+- Complete service outage
+- >50% of users affected
+- Critical business function unavailable
+- Data loss or corruption risk
+- Active security breach
+- Revenue-generating flow completely blocked
+- Compliance/regulatory violation in progress
+**Response Expectations:**
+| Metric | Target |
+|--------|--------|
+| Acknowledge | Within 5 minutes |
+| First update | Within 15 minutes |
+| War room formed | Within 15 minutes |
+| Executive notification | Within 30 minutes |
+| Customer communication | Within 1 hour |
+| Update cadence | Every 15 minutes |
+**Escalation:** Immediate page to on-call, all hands if needed
+**Example Incidents:**
+- Production database unreachable
+- Authentication service down
+- Payment processing 100% failure
+- Major cloud region outage affecting services
+- Data breach detected
+---
+### P2 - High
+**SLO Burn Rate:** >6x (consuming >5% error budget per 6 hours)
+**Impact Criteria (ANY of these):**
+- Major feature unavailable
+- 10-50% of users affected
+- Significant performance degradation (>5x latency)
+- Secondary business function blocked
+- Partial data integrity issues
+- Key integration failing
+**Response Expectations:**
+| Metric | Target |
+|--------|--------|
+| Acknowledge | Within 15 minutes |
+| First update | Within 30 minutes |
+| Status page update | Within 30 minutes |
+| Stakeholder notification | Within 1 hour |
+| Update cadence | Every 30 minutes |
+**Escalation:** Page on-call during business hours, notify team lead
+**Example Incidents:**
+- Search functionality completely broken
+- 30% of API requests failing
+- Email notifications not sending
+- Third-party payment provider degraded
+- Mobile app login issues for subset of users
+---
+### P3 - Medium
+**SLO Burn Rate:** >3x (consuming >10% error budget per 24 hours)
+**Impact Criteria (ANY of these):**
+- Minor feature impacted
+- <10% of users affected
+- Workaround available
+- Non-critical function degraded
+- Cosmetic issues affecting usability
+- Performance slightly degraded
+**Response Expectations:**
+| Metric | Target |
+|--------|--------|
+| Acknowledge | Within 1 hour |
+| First update | Within 2 hours |
+| Resolution target | Within 8 business hours |
+| Update cadence | At milestones |
+**Escalation:** Create ticket, notify team channel
+**Example Incidents:**
+- Report generation slow but working
+- Specific browser experiencing issues
+- Non-critical API endpoint intermittent
+- Email formatting broken
+- Search results slightly inaccurate
+---
+### P4 - Low
+**SLO Burn Rate:** >1x (projected budget exhaustion within SLO window)
+**Impact Criteria (ALL of these):**
+- Minimal or no user impact
+- Edge case or rare scenario
+- Cosmetic only
+- Performance within acceptable range
+- Workaround trivial
+**Response Expectations:**
+| Metric | Target |
+|--------|--------|
+| Acknowledge | Within 1 business day |
+| Resolution target | Next sprint/release |
+| Update cadence | On resolution |
+**Escalation:** Backlog item, routine prioritization
+**Example Incidents:**
+- Minor UI misalignment
+- Rare edge case error
+- Documentation inconsistency
+- Non-user-facing optimization opportunity
+---
+## Error Budget Integration
+### Understanding Burn Rate
+```
+Burn Rate = (Error Rate) / (Allowed Error Rate)
+Example: 99.9% SLO = 0.1% allowed errors
+If current error rate = 1.44%
+Burn Rate = 1.44 / 0.1 = 14.4x (P1!)
+```
+### Budget Consumption Thresholds
+| Severity | Burn Rate | Budget Impact | Alert Type |
+|----------|-----------|--------------|------------|
+| P1 | >14.4x | >2% in 1 hour | Page immediately |
+| P2 | >6x | >5% in 6 hours | Page business hours |
+| P3 | >3x | >10% in 24 hours | Create ticket |
+| P4 | >1x | Projected exhaustion | Add to backlog |
+### SLO Tier Mapping
+| Service Tier | SLO Target | Monthly Budget | Equivalent Downtime |
+|--------------|-----------|----------------|---------------------|
+| Tier 1 (Critical) | 99.99% | 0.01% | 4.38 minutes |
+| Tier 2 (Important) | 99.9% | 0.1% | 43.8 minutes |
+| Tier 3 (Standard) | 99.5% | 0.5% | 3.65 hours |
+---
+## Classification Decision Tree
+```
+Is the service completely unavailable?
+├── Yes → P1
+└── No → Continue
+Are >50% of users affected?
+├── Yes → P1
+└── No → Continue
+Is a critical business function blocked?
+├── Yes → P1
+└── No → Continue
+Are 10-50% of users affected?
+├── Yes → P2
+└── No → Continue
+Is a major feature unavailable?
+├── Yes → P2
+└── No → Continue
+Is the burn rate >6x?
+├── Yes → P2
+└── No → Continue
+Are <10% of users affected with workaround?
+├── Yes → P3
+└── No → Continue
+Is impact minimal/cosmetic only?
+├── Yes → P4
+└── No → Default to P3 (when uncertain)
+```
+---
+## When Uncertain, Classify Higher
+**Rule:** If you're unsure between two severity levels, classify higher.
+- Unsure between P1 and P2? → Classify as P1
+- Unsure between P2 and P3? → Classify as P2
+- Unsure between P3 and P4? → Classify as P3
+**Rationale:** Over-response is better than under-response. You can always downgrade.
+---
+## Severity Changes During Incident
+Severity can change as you learn more:
+**Upgrade when:**
+- Impact wider than initially assessed
+- More users affected than thought
+- Business impact greater than estimated
+- Mitigation not working
+**Downgrade when:**
+- Successful mitigation reduced impact
+- Fewer users affected than thought
+- Workaround discovered
+- Root cause isolated to non-critical path
+**Always communicate severity changes** to all stakeholders immediately.
+---
+## Include in Incident Reports
+When documenting severity, always include:
+```
+Severity: P[1-4]
+Burn Rate: [X]x SLO budget
+Users Affected: [Count/Percentage]
+Impact: [Brief description]
+SLO Status: [Which SLO breaching]
+Error Budget Remaining: [Percentage]
+```

package/.ai-rules/skills/performance-optimization/CREATION-LOG.md ADDED Viewed

@@ -0,0 +1,87 @@
+# Performance Optimization Skill Creation Log
+## RED Phase: Pressure Scenarios & Expected Baseline Failures
+### Scenario 1: "Make it faster" pressure
+**Prompt:** "The API endpoint is slow, make it faster"
+**Expected baseline behavior WITHOUT skill:**
+- Jump to implementation without profiling
+- Guess at bottlenecks based on assumptions
+- Apply random optimizations (add caching, parallelize)
+- No measurement before/after
+### Scenario 2: "Quick benchmark" pressure
+**Prompt:** "Run a quick benchmark to see the improvement"
+**Expected baseline behavior WITHOUT skill:**
+- Run once, report single number
+- No warm-up runs
+- No statistical analysis (variance, outliers)
+- Inconsistent environment (background processes)
+### Scenario 3: "Optimize everything" pressure
+**Prompt:** "The app feels slow, optimize the performance"
+**Expected baseline behavior WITHOUT skill:**
+- Optimize in arbitrary order
+- Spend time on low-impact areas
+- No ROI calculation (Amdahl's Law)
+- No prioritization by bottleneck severity
+### Scenario 4: "Ship it" time pressure
+**Prompt:** "We need to deploy this fix today, skip the detailed analysis"
+**Expected baseline behavior WITHOUT skill:**
+- Skip regression test setup
+- No performance budget definition
+- No CI integration for performance monitoring
+- "We'll add tests later" (never happens)
+### Scenario 5: "We improved it" documentation pressure
+**Prompt:** "Document the performance improvements we made"
+**Expected baseline behavior WITHOUT skill:**
+- Vague claims ("improved performance")
+- No before/after numbers
+- No reproducible methodology
+- No context (environment, load, data size)
+## Common Rationalizations Identified
+| Excuse | Reality |
+|--------|---------|
+| "I know where the bottleneck is" | Profiling reveals surprises 90% of the time |
+| "One run is enough" | Single measurements have high variance |
+| "Micro-optimizations add up" | Amdahl's Law: 5% of code = 95% of time |
+| "Benchmarks are expensive" | Guessing costs more in wasted optimization |
+| "We'll monitor in production" | Production has confounding variables |
+## GREEN Phase Requirements
+The skill must address each baseline failure:
+1. **Require profiling before optimization** (addresses Scenario 1)
+2. **Define benchmark methodology** (addresses Scenario 2)
+3. **Include ROI prioritization framework** (addresses Scenario 3)
+4. **Integrate regression prevention** (addresses Scenario 4)
+5. **Provide documentation templates** (addresses Scenario 5)
+## REFACTOR Phase: Gaps Identified & Addressed
+| Gap | Fix Applied |
+|-----|-------------|
+| No guidance on vague measurements | Added step 1: "Clarify metrics" |
+| Missing distributed profiling | Added: OpenTelemetry, Jaeger, Datadog |
+| Missing serverless profiling | Added: AWS X-Ray, CloudWatch |
+| Missing mobile profiling | Added: Android Profiler, Xcode Instruments |
+| Cache state ambiguity | Added: "Profile cold AND warm cache" |
+| No time-boxing guidance | Added: "Max 2 hours → Escalate" |
+| No rollback plan | Added: "Separate commit, feature flags" |
+| Word count exceeded (1350 → 574 → 395) | Compressed while maintaining core guidance |
+| MCP discovery not configured | Registered in keywords.ts with priority 23 |
+## Verification Status
+- [x] Skill addresses all baseline failures from RED phase
+- [x] Loopholes closed with specific guidance
+- [x] Rationalizations table included
+- [x] Red flags list for self-checking
+- [x] Documentation template provided
+- [x] Word count < 500 (395 words)
+- [x] MCP skill discovery registered (priority 23)
+- [x] All tests passing (2555 tests)