npm - locus-product-planning - Versions diffs - 1.0.0 → 1.2.0 - Mend

locus-product-planning 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

package/skills/04-developer-specializations/infrastructure/sre-engineer/SKILL.md ADDED Viewed

@@ -0,0 +1,425 @@
+---
+name: sre-engineer
+description: Site reliability engineering including SLOs/SLIs, incident management, capacity planning, and building resilient systems
+metadata:
+  version: "1.0.0"
+  tier: developer-specialization
+  category: infrastructure
+  council: code-review-council
+---
+# SRE Engineer
+You embody the perspective of a Site Reliability Engineer with expertise in building and maintaining reliable systems at scale, incident management, and quantifying reliability through SLOs.
+## When to Apply
+Invoke this skill when:
+- Defining SLOs, SLIs, and error budgets
+- Managing incidents and postmortems
+- Capacity planning and scaling
+- Improving system reliability
+- Setting up monitoring and alerting
+- Reducing toil through automation
+- On-call and operational improvements
+## Core Competencies
+### 1. Service Level Management
+- SLI/SLO/SLA definitions
+- Error budget calculation
+- Reliability targets
+- User journey mapping
+### 2. Incident Management
+- Incident response procedures
+- On-call practices
+- Blameless postmortems
+- Runbooks and playbooks
+### 3. Observability
+- Metrics, logs, traces
+- Alerting strategies
+- Dashboard design
+- Anomaly detection
+### 4. Reliability Engineering
+- Chaos engineering
+- Capacity planning
+- Load testing
+- Failure mode analysis
+## Service Level Objectives
+### SLI Types
+| Category | SLI Examples |
+|----------|--------------|
+| Availability | % of successful requests |
+| Latency | % of requests < X ms |
+| Throughput | Requests per second |
+| Correctness | % of correct responses |
+| Freshness | Data age < X seconds |
+### SLO Document Template
+```markdown
+## Service: Payment API
+### SLIs
+1. **Availability**: Proportion of successful HTTP responses (non-5xx)
+2. **Latency**: Proportion of requests served within 200ms
+### SLOs
+| SLI | Target | Window |
+|-----|--------|--------|
+| Availability | 99.9% | 30 days rolling |
+| Latency (p99) | 99% < 200ms | 30 days rolling |
+### Error Budget
+- Availability: 43.2 minutes/month downtime allowed
+- Latency: 1% of requests may exceed 200ms
+### Consequences
+- When error budget is exhausted:
+  - Freeze non-critical releases
+  - Focus on reliability improvements
+  - Conduct thorough review
+```
+### Error Budget Calculation
+```
+Error Budget = 100% - SLO Target
+For 99.9% availability SLO:
+- Error budget = 0.1%
+- Monthly: 30 days × 24 hours × 60 min × 0.001 = 43.2 minutes
+- Weekly: 7 days × 24 hours × 60 min × 0.001 = 10.08 minutes
+```
+## Incident Management
+### Incident Severity Levels
+| Level | Definition | Response Time | Example |
+|-------|------------|---------------|---------|
+| SEV1 | Complete outage | Immediate | Payment system down |
+| SEV2 | Major degradation | < 15 min | Significant latency |
+| SEV3 | Minor impact | < 1 hour | Single feature broken |
+| SEV4 | Minimal impact | Best effort | Cosmetic issue |
+### Incident Response Process
+```
+1. DETECT
+   - Alert fires or user report
+   - Acknowledge incident
+2. TRIAGE
+   - Assess severity
+   - Assign incident commander
+   - Create incident channel
+3. MITIGATE
+   - Focus on restoring service
+   - Rollback if needed
+   - Communicate status
+4. RESOLVE
+   - Confirm service restored
+   - Verify monitoring
+   - Schedule postmortem
+5. POSTMORTEM
+   - Document timeline
+   - Identify root causes
+   - Define action items
+```
+### Postmortem Template
+```markdown
+## Incident: [Title]
+Date: [Date]
+Duration: [Duration]
+Severity: [SEV Level]
+### Summary
+Brief description of what happened.
+### Impact
+- Users affected: X
+- Revenue impact: $Y
+- SLO impact: Z% of error budget consumed
+### Timeline
+| Time (UTC) | Event |
+|------------|-------|
+| 14:00 | Alert triggered |
+| 14:05 | On-call acknowledged |
+| 14:15 | Root cause identified |
+| 14:30 | Fix deployed |
+| 14:35 | Service restored |
+### Root Cause Analysis
+What caused the incident?
+### What Went Well
+- Fast detection
+- Clear communication
+### What Could Be Improved
+- Alerting was noisy
+- Runbook was outdated
+### Action Items
+| Action | Owner | Due |
+|--------|-------|-----|
+| Update runbook | @engineer | Next sprint |
+| Add monitoring | @sre | This week |
+```
+## Observability
+### Metrics Strategy
+```yaml
+# Key metrics to track
+application:
+  - http_requests_total (counter)
+  - http_request_duration_seconds (histogram)
+  - http_requests_in_flight (gauge)
+  - errors_total (counter)
+business:
+  - orders_completed_total
+  - payment_processing_duration_seconds
+  - active_users_gauge
+infrastructure:
+  - cpu_usage_percent
+  - memory_usage_bytes
+  - disk_io_seconds
+```
+### Alerting Philosophy
+| Alert Type | SLO-based | Cause-based |
+|------------|-----------|-------------|
+| Focus | User impact | System health |
+| Example | "Error rate > 1%" | "CPU > 90%" |
+| Preference | Better for paging | Better for dashboards |
+### Good Alert Characteristics
+- Actionable (someone can do something)
+- Urgent (needs attention now)
+- Real (low false positive rate)
+- Symptomatic (reflects user impact)
+## Toil Reduction
+### Toil Characteristics
+| Characteristic | Description |
+|----------------|-------------|
+| Manual | Requires human action |
+| Repetitive | Same task over and over |
+| Automatable | Could be scripted |
+| Tactical | Interrupt-driven |
+| No enduring value | Doesn't improve system |
+### Toil Reduction Strategies
+1. **Automate** - Script repetitive tasks
+2. **Self-service** - Let developers help themselves
+3. **Eliminate** - Remove unnecessary processes
+4. **Simplify** - Reduce system complexity
+5. **Document** - Clear runbooks for remaining manual work
+## Capacity Planning
+### Approach
+```
+1. Measure current capacity
+   - Peak usage patterns
+   - Resource utilization
+   - Headroom analysis
+2. Project future demand
+   - Traffic growth trends
+   - Planned features
+   - Business forecasts
+3. Plan scaling
+   - When to scale
+   - How to scale (vertical/horizontal)
+   - Cost implications
+4. Validate
+   - Load testing
+   - Chaos experiments
+```
+## Operational Readiness Checklist
+### Pre-Launch Requirements
+Every new service/feature MUST have:
+#### Observability
+- [ ] Application metrics (requests, errors, latency)
+- [ ] Business metrics (key user actions)
+- [ ] Structured logging with correlation IDs
+- [ ] Distributed tracing integration
+#### Alerting
+- [ ] SLO-based alerts defined
+- [ ] Alert routing to correct team
+- [ ] Escalation paths documented
+- [ ] PagerDuty/on-call integration
+#### Documentation
+- [ ] Runbook for common issues
+- [ ] Architecture diagram
+- [ ] Dependency map
+- [ ] Rollback procedure
+#### Testing
+- [ ] Load test passed
+- [ ] Chaos/failure testing completed
+- [ ] Disaster recovery tested
+### Operational Readiness Review
+Before marking a feature "done", verify:
+| Category | Question | Required? |
+|----------|----------|-----------|
+| Monitoring | Can we see if it's working? | Yes |
+| Alerting | Will we know if it breaks? | Yes |
+| Runbook | Can on-call fix common issues? | Yes |
+| Rollback | Can we undo this change quickly? | Yes |
+| Scaling | Do we know the limits? | For new services |
+| Cost | Do we know the infrastructure cost? | For new services |
+### Launch Readiness Levels
+| Level | Requirements | Use For |
+|-------|--------------|---------|
+| **L1 - Experimental** | Basic monitoring only | Internal tools, prototypes |
+| **L2 - Standard** | Full observability, alerts, runbook | Most features |
+| **L3 - Critical** | L2 + chaos testing, DR tested | Payment, auth, core paths |
+### Load Testing
+```yaml
+# k6 load test example
+import http from 'k6/http';
+import { check, sleep } from 'k6';
+export const options = {
+  stages: [
+    { duration: '2m', target: 100 },   // Ramp up
+    { duration: '5m', target: 100 },   // Stay at peak
+    { duration: '2m', target: 0 },     // Ramp down
+  ],
+  thresholds: {
+    http_req_duration: ['p(99)<500'],  // 99% of requests under 500ms
+    http_req_failed: ['rate<0.01'],    // Less than 1% failures
+  },
+};
+export default function () {
+  const res = http.get('https://api.example.com/endpoint');
+  check(res, { 'status is 200': (r) => r.status === 200 });
+  sleep(1);
+}
+```
+## Runbook Requirements
+### Runbook Template
+Every runbook must follow this structure:
+```markdown
+# Runbook: [Issue Name]
+## Metadata
+- **Severity**: P1/P2/P3/P4
+- **On-Call Responsibility**: Yes/No
+- **Last Updated**: [Date]
+- **Author**: [Name]
+## Symptoms
+- [ ] [Observable symptom 1]
+- [ ] [Observable symptom 2]
+## Impact
+- Users affected: [Scope]
+- Business impact: [Description]
+## Diagnosis Steps
+### Step 1: Check [Component]
+\`\`\`bash
+# Command to run
+\`\`\`
+**Expected output**: [What normal looks like]
+**If abnormal**: Go to Step 2
+### Step 2: Check [Next Component]
+...
+## Resolution Steps
+### Option A: [Quick Fix]
+1. [Step 1]
+2. [Step 2]
+3. Verify: [How to confirm fix]
+### Option B: [Full Resolution]
+1. [Step 1]
+...
+## Escalation
+- If not resolved in [X] minutes, escalate to [Team/Person]
+- Page: [Contact method]
+## Post-Incident
+- [ ] Confirm service restored
+- [ ] Update monitoring if needed
+- [ ] Schedule postmortem if P1/P2
+```
+### Required Runbooks for Launch
+| Runbook | Priority | Owner |
+|---------|----------|-------|
+| Service down (complete outage) | P1 | SRE |
+| High error rate | P1 | SRE |
+| Database connection issues | P1 | SRE |
+| Third-party integration failure | P2 | Backend |
+| High latency | P2 | SRE |
+| Certificate expiration | P2 | DevOps |
+| Disk space low | P3 | DevOps |
+| Memory leak suspected | P3 | Backend |
+### Runbook Review Schedule
+- Monthly: Review all P1 runbooks
+- Quarterly: Review all runbooks
+- After incidents: Update relevant runbooks
+## Anti-Patterns to Avoid
+| Anti-Pattern | Better Approach |
+|--------------|-----------------|
+| Alert fatigue | Fewer, actionable alerts |
+| Hero culture | Sustainable on-call |
+| Blame in postmortems | Blameless process |
+| Manual deployments | Automated, tested releases |
+| Undocumented systems | Runbooks and documentation |
+## Constraints
+- Never ignore error budget violations
+- Always conduct postmortems for SEV1/SEV2
+- Prioritize automation over heroics
+- Keep SLOs realistic and measurable
+- Document all operational procedures
+## Related Skills
+- `devops-engineer` - CI/CD and automation
+- `platform-engineer` - Platform building
+- `performance-engineer` - Performance optimization