npm - blockmine - Versions diffs - 1.20.0 → 1.22.0 - Mend

blockmine 1.20.0 → 1.22.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (434) hide show

package/.claude/skills/sre/resources/alerting-best-practices.md ADDED Viewed

@@ -0,0 +1,282 @@
+# Alerting Best Practices
+Alert design principles, notification routing (PagerDuty, OpsGenie), alert fatigue prevention, and effective on-call alerting strategies.
+## Table of Contents
+- [Alert Design Principles](#alert-design-principles)
+- [Alert Rules](#alert-rules)
+- [Notification Routing](#notification-routing)
+- [Alert Fatigue Prevention](#alert-fatigue-prevention)
+- [Best Practices](#best-practices)
+## Alert Design Principles
+**Good Alerts:**
+```
+✅ Actionable - Can be fixed immediately
+✅ Specific - Clear what's wrong
+✅ User-impacting - Affects customers
+✅ Urgent - Requires immediate attention
+✅ Novel - Not duplicate of existing alert
+```
+**Bad Alerts:**
+```
+❌ Noisy - Frequent false positives
+❌ Vague - Unclear what to do
+❌ Premature - Fires before issue impacts users
+❌ Duplicate - Same as other alerts
+❌ Low-priority - Can wait until business hours
+```
+## Alert Rules
+**Prometheus Alerting:**
+```yaml
+groups:
+  - name: slo_alerts
+    rules:
+      # Good: User-impacting, actionable
+      - alert: HighErrorRate
+        expr: |
+          (
+            sum(rate(http_requests_total{status=~"5.."}[5m]))
+            /
+            sum(rate(http_requests_total[5m]))
+          ) > 0.05
+        for: 5m
+        labels:
+          severity: critical
+          team: platform
+        annotations:
+          summary: "Error rate above 5% for 5 minutes"
+          description: "{{ $value | humanizePercentage }} of requests failing"
+          runbook: "https://runbooks.example.com/high-error-rate"
+          dashboard: "https://grafana.example.com/d/service-health"
+      # Good: SLO-based, clear threshold
+      - alert: LatencyP95High
+        expr: |
+          histogram_quantile(0.95,
+            rate(http_request_duration_seconds_bucket[5m])
+          ) > 0.5
+        for: 10m
+        labels:
+          severity: warning
+          team: platform
+        annotations:
+          summary: "P95 latency above 500ms"
+          impact: "Users experiencing slow response times"
+```
+**Multi-Window Alerts:**
+```yaml
+# Fast burn + slow burn
+- alert: ErrorBudgetBurn
+  expr: |
+    (
+      sum(rate(http_requests_total{status=~"5.."}[1h]))
+      /
+      sum(rate(http_requests_total[1h]))
+      > (14.4 * (1 - 0.999))
+    )
+    and
+    (
+      sum(rate(http_requests_total{status=~"5.."}[5m]))
+      /
+      sum(rate(http_requests_total[5m]))
+      > (14.4 * (1 - 0.999))
+    )
+  labels:
+    severity: critical
+  annotations:
+    summary: "Error budget burning at 14.4x rate"
+```
+## Notification Routing
+**AlertManager Config:**
+```yaml
+route:
+  receiver: default
+  group_by: ['alertname', 'cluster']
+  group_wait: 30s
+  group_interval: 5m
+  repeat_interval: 12h
+  routes:
+    # Critical: Page immediately
+    - match:
+        severity: critical
+      receiver: pagerduty
+      group_wait: 10s
+      repeat_interval: 5m
+    # Warning: Slack notification
+    - match:
+        severity: warning
+      receiver: slack
+      repeat_interval: 4h
+    # Info: Email only
+    - match:
+        severity: info
+      receiver: email
+      repeat_interval: 24h
+receivers:
+  - name: pagerduty
+    pagerduty_configs:
+      - service_key: $PAGERDUTY_SERVICE_KEY
+        description: "{{ .GroupLabels.alertname }}"
+  - name: slack
+    slack_configs:
+      - api_url: $SLACK_WEBHOOK_URL
+        channel: '#alerts'
+        title: "{{ .GroupLabels.alertname }}"
+        text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
+  - name: email
+    email_configs:
+      - to: 'team@example.com'
+        from: 'alertmanager@example.com'
+```
+**PagerDuty Integration:**
+```yaml
+pagerduty_configs:
+  - routing_key: $PAGERDUTY_ROUTING_KEY
+    severity: "{{ .Labels.severity }}"
+    client: "Alertmanager"
+    client_url: "{{ .ExternalURL }}"
+    description: "{{ .GroupLabels.alertname }}"
+    details:
+      firing: "{{ .Alerts.Firing | len }}"
+      resolved: "{{ .Alerts.Resolved | len }}"
+      summary: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
+```
+## Alert Fatigue Prevention
+**Strategies:**
+1. **High Signal-to-Noise Ratio**
+```
+Target: < 5% false positive rate
+If alert fires but no action taken → remove or adjust
+```
+2. **Appropriate Thresholds**
+```yaml
+# Too sensitive
+expr: cpu_usage > 0.5  # Fires constantly
+# Better
+expr: cpu_usage > 0.9 for 10m  # Sustained high usage
+```
+3. **Group Similar Alerts**
+```yaml
+route:
+  group_by: ['alertname', 'cluster', 'service']
+  group_wait: 30s  # Wait to group
+  group_interval: 5m  # Send grouped updates
+```
+4. **Escalation Policies**
+```yaml
+# PagerDuty escalation
+escalation_policy:
+  - level: 1
+    targets: [on_call_primary]
+    escalation_delay: 5m
+  - level: 2
+    targets: [on_call_secondary, team_lead]
+    escalation_delay: 10m
+  - level: 3
+    targets: [engineering_manager]
+    escalation_delay: 15m
+```
+5. **Alert Inhibition**
+```yaml
+inhibit_rules:
+  # If service is down, don't alert on high latency
+  - source_match:
+      severity: critical
+      alertname: ServiceDown
+    target_match:
+      severity: warning
+      alertname: HighLatency
+    equal: ['service']
+```
+## Best Practices
+### 1. Include Runbook Links
+```yaml
+annotations:
+  runbook: "https://runbooks.example.com/{{ $labels.alertname }}"
+```
+### 2. Add Context
+```yaml
+annotations:
+  description: |
+    Service {{ $labels.service }} error rate is {{ $value | humanizePercentage }}
+    Dashboard: https://grafana.example.com/d/{{ $labels.service }}
+    Logs: https://logs.example.com/?service={{ $labels.service }}
+```
+### 3. Test Alerts
+```bash
+# Send test alert
+amtool alert add alertname=TestAlert severity=warning
+# Check routing
+amtool config routes test --config.file=alertmanager.yml \
+  severity=critical team=platform
+```
+### 4. Review Alerts Regularly
+```yaml
+# Quarterly alert audit
+review_process:
+  - Check false positive rate
+  - Verify runbooks are current
+  - Update thresholds based on trends
+  - Remove unused alerts
+```
+### 5. Time-Based Routing
+```yaml
+# Different routing for business hours vs off-hours
+routes:
+  - match:
+      severity: warning
+    receiver: slack
+    active_time_intervals:
+      - business_hours
+  - match:
+      severity: warning
+    receiver: email
+    active_time_intervals:
+      - off_hours
+```
+---
+**Related Resources:**
+- [incident-management.md](incident-management.md)
+- [on-call-runbooks.md](on-call-runbooks.md)
+- [observability-stack.md](observability-stack.md)

package/.claude/skills/sre/resources/capacity-planning.md ADDED Viewed

@@ -0,0 +1,226 @@
+# Capacity Planning
+Resource forecasting, growth modeling, scalability analysis, load testing, and proactive capacity management.
+## Table of Contents
+- [Capacity Planning Process](#capacity-planning-process)
+- [Resource Forecasting](#resource-forecasting)
+- [Load Testing](#load-testing)
+- [Scalability Analysis](#scalability-analysis)
+## Capacity Planning Process
+```yaml
+quarterly_process:
+  1_collect_data:
+    - Current resource usage trends
+    - Traffic growth patterns
+    - Business projections
+    - Seasonal variations
+  2_forecast:
+    - Project 6-12 months ahead
+    - Account for growth initiatives
+    - Include safety margin (20-30%)
+  3_plan_upgrades:
+    - Identify bottlenecks
+    - Plan infrastructure changes
+    - Budget for new resources
+  4_implement:
+    - Gradual rollout
+    - Monitor impact
+    - Adjust as needed
+```
+## Resource Forecasting
+**Linear Growth Model:**
+```python
+import pandas as pd
+import numpy as np
+from sklearn.linear_model import LinearRegression
+def forecast_capacity(historical_data, months_ahead=6):
+    """
+    Forecast resource requirements
+    Args:
+        historical_data: DataFrame with 'date' and 'usage' columns
+        months_ahead: Number of months to forecast
+    Returns:
+        Forecasted usage values
+    """
+    # Prepare data
+    X = np.array(range(len(historical_data))).reshape(-1, 1)
+    y = historical_data['usage'].values
+    # Train model
+    model = LinearRegression()
+    model.fit(X, y)
+    # Forecast
+    future_X = np.array(range(len(historical_data),
+                              len(historical_data) + months_ahead)).reshape(-1, 1)
+    forecast = model.predict(future_X)
+    # Add 30% safety margin
+    return forecast * 1.3
+# Usage
+import pandas as pd
+data = pd.DataFrame({
+    'date': pd.date_range('2023-01-01', periods=12, freq='M'),
+    'usage': [100, 110, 115, 125, 130, 140, 145, 155, 160, 170, 175, 185]
+})
+forecast = forecast_capacity(data, months_ahead=6)
+print(f"Forecasted usage in 6 months: {forecast[-1]:.0f}")
+```
+**Capacity Metrics:**
+```yaml
+cpu:
+  current_avg: 45%
+  current_p95: 75%
+  target_max: 80%
+  growth_rate: 5% monthly
+  action_needed: Scale in 4 months
+memory:
+  current_avg: 60%
+  current_p95: 85%
+  target_max: 85%
+  growth_rate: 3% monthly
+  action_needed: Scale in 6 months
+storage:
+  current_usage: 500GB
+  total_capacity: 1TB
+  growth_rate: 50GB monthly
+  action_needed: Scale in 10 months
+```
+## Load Testing
+**k6 Load Test:**
+```javascript
+// load-test.js
+import http from 'k6/http';
+import { check, sleep } from 'k6';
+export const options = {
+  stages: [
+    { duration: '5m', target: 100 },   // Ramp up to 100 users
+    { duration: '10m', target: 100 },  // Stay at 100 users
+    { duration: '5m', target: 500 },   // Ramp to 500 users
+    { duration: '10m', target: 500 },  // Stay at 500
+    { duration: '5m', target: 1000 },  // Spike to 1000
+    { duration: '5m', target: 0 },     // Ramp down
+  ],
+  thresholds: {
+    http_req_duration: ['p(95)<500'], // 95% of requests < 500ms
+    http_req_failed: ['rate<0.01'],   // Error rate < 1%
+  },
+};
+export default function () {
+  const res = http.get('https://api.example.com/');
+  check(res, {
+    'status is 200': (r) => r.status === 200,
+    'response time < 500ms': (r) => r.timings.duration < 500,
+  });
+  sleep(1);
+}
+```
+**Run Load Test:**
+```bash
+# Local test
+k6 run load-test.js
+# Cloud test (distributed)
+k6 cloud load-test.js
+# With custom VUs
+k6 run --vus 1000 --duration 30m load-test.js
+```
+## Scalability Analysis
+**Horizontal vs Vertical Scaling:**
+```yaml
+horizontal_scaling:
+  when: Stateless applications, need high availability
+  pros:
+    - No downtime
+    - Better fault tolerance
+    - Linear cost scaling
+  cons:
+    - More complex
+    - Coordination overhead
+vertical_scaling:
+  when: Stateful applications, simpler architecture
+  pros:
+    - Simpler architecture
+    - Less coordination
+  cons:
+    - Downtime required
+    - Upper limits
+    - Single point of failure
+```
+**Auto-scaling Configuration:**
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: api-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: api
+  minReplicas: 3
+  maxReplicas: 100
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70
+  - type: Resource
+    resource:
+      name: memory
+      target:
+        type: Utilization
+        averageUtilization: 80
+  behavior:
+    scaleDown:
+      stabilizationWindowSeconds: 300
+      policies:
+      - type: Percent
+        value: 50
+        periodSeconds: 60
+    scaleUp:
+      stabilizationWindowSeconds: 0
+      policies:
+      - type: Percent
+        value: 100
+        periodSeconds: 30
+      - type: Pods
+        value: 5
+        periodSeconds: 30
+      selectPolicy: Max
+```
+---
+**Related Resources:**
+- [performance-optimization.md](performance-optimization.md)
+- [resource-management.md](../platform-engineering/resources/resource-management.md)

package/.claude/skills/sre/resources/chaos-engineering.md ADDED Viewed

@@ -0,0 +1,193 @@
+# Chaos Engineering
+Chaos Monkey, fault injection, failure mode testing, Chaos Toolkit, Litmus Chaos, and resilience testing practices.
+## Table of Contents
+- [Principles](#principles)
+- [Tools](#tools)
+- [Experiments](#experiments)
+- [Best Practices](#best-practices)
+## Principles
+**Chaos Engineering Principles:**
+1. Build a hypothesis around steady state
+2. Vary real-world events
+3. Run experiments in production
+4. Automate experiments
+5. Minimize blast radius
+## Tools
+**Chaos Mesh (Kubernetes):**
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: PodChaos
+metadata:
+  name: pod-failure-example
+spec:
+  action: pod-failure
+  mode: one
+  selector:
+    namespaces:
+      - production
+    labelSelectors:
+      app: api-service
+  duration: "30s"
+  scheduler:
+    cron: "@every 2h"
+```
+**Network Chaos:**
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: NetworkChaos
+metadata:
+  name: network-delay
+spec:
+  action: delay
+  mode: all
+  selector:
+    namespaces:
+      - production
+    labelSelectors:
+      app: api-service
+  delay:
+    latency: "100ms"
+    correlation: "25"
+    jitter: "10ms"
+  duration: "5m"
+```
+**Litmus Chaos:**
+```yaml
+apiVersion: litmuschaos.io/v1alpha1
+kind: ChaosEngine
+metadata:
+  name: nginx-chaos
+spec:
+  appinfo:
+    appns: 'default'
+    applabel: 'app=nginx'
+    appkind: 'deployment'
+  chaosServiceAccount: litmus-admin
+  experiments:
+  - name: pod-delete
+    spec:
+      components:
+        env:
+        - name: TOTAL_CHAOS_DURATION
+          value: '30'
+        - name: CHAOS_INTERVAL
+          value: '10'
+        - name: FORCE
+          value: 'false'
+```
+## Experiments
+**Pod Deletion Test:**
+```bash
+# Verify system handles pod failures
+kubectl delete pod -l app=api-service --grace-period=0
+# Expected outcome:
+# - New pod starts automatically
+# - No service interruption
+# - Requests handled by other pods
+```
+**Database Failure Simulation:**
+```yaml
+# Simulate database connection issues
+apiVersion: chaos-mesh.org/v1alpha1
+kind: NetworkChaos
+metadata:
+  name: db-partition
+spec:
+  action: partition
+  mode: all
+  selector:
+    namespaces:
+      - production
+    labelSelectors:
+      app: api-service
+  direction: to
+  target:
+    selector:
+      namespaces:
+        - production
+      labelSelectors:
+        app: postgres
+  duration: "2m"
+```
+**CPU Stress Test:**
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: StressChaos
+metadata:
+  name: cpu-stress
+spec:
+  mode: one
+  selector:
+    namespaces:
+      - production
+    labelSelectors:
+      app: api-service
+  stressors:
+    cpu:
+      workers: 4
+      load: 80
+  duration: "5m"
+```
+## Best Practices
+### 1. Start Small
+```
+Begin in dev/staging
+Small blast radius
+Short duration
+Gradually increase scope
+```
+### 2. Define Success Criteria
+```yaml
+experiment:
+  hypothesis: "API continues serving traffic during pod failure"
+  success_criteria:
+    - Error rate < 0.1%
+    - P95 latency < 500ms
+    - No customer impact
+  failure_action: Rollback immediately
+```
+### 3. Automate Chaos
+```yaml
+# Regular chaos experiments
+schedule:
+  daily: Pod deletion
+  weekly: Network latency
+  monthly: Region failure simulation
+```
+### 4. Monitor During Experiments
+```yaml
+observability:
+  - Real-time dashboards
+  - Alert on anomalies
+  - Correlate with experiment timeline
+  - Document unexpected behavior
+```
+---
+**Related Resources:**
+- [reliability-patterns.md](reliability-patterns.md)
+- [incident-management.md](incident-management.md)