npm - @jetrabbits/agentic - Versions diffs - 0.0.1 - Mend

@jetrabbits/agentic 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (440) hide show

package/areas/devops/sre/rules/on-call-standards.md ADDED Viewed

@@ -0,0 +1,25 @@
+# Rule: On-Call Standards
+**Priority**: P1 — On-call engineers must be equipped and protected.
+## On-Call Requirements
+1. **Response times** — P0: 5 min; P1: 15 min; P2: 1h (business hours).
+2. **On-call rotation** — maximum 1 week primary + 1 week secondary per engineer per month.
+3. **Runbook coverage** — every alert that can page must have a runbook. No runbook = alert is demoted to warning until written.
+4. **Tooling access** — on-call engineer has prod read + limited write access (rollback, scale, restart). Full access requires separate MFA.
+5. **Escalation path** documented and tested quarterly.
+## Incident Severity (align with platform)
+| Severity | Definition | Response |
+|:---|:---|:---|
+| P0 | Complete outage; data loss | Immediate; all hands |
+| P1 | Major feature broken; >25% users affected | 15 min; on-call + lead |
+| P2 | Degraded; workaround available | 1h; business hours OK |
+| P3 | Minor issue; no user impact | Next sprint |
+## Toil Budget
+- On-call toil (repetitive, automatable work) must not exceed 50% of on-call hours.
+- Toil > 50% for 2 consecutive rotations → mandatory automation sprint.

package/areas/devops/sre/rules/slo-policy.md ADDED Viewed

@@ -0,0 +1,31 @@
+# Rule: SLO Policy
+**Priority**: P1 — Services in production must have defined SLOs with error budgets.
+## SLO Definition Requirements
+1. **Every Tier 1 service must define:**
+   - **SLI** (what we measure): e.g., proportion of requests completing < 500ms
+   - **SLO** (the target): e.g., 99.5% of requests complete < 500ms over 28 days
+   - **Error budget**: 100% - SLO = 0.5% = ~3.6h of downtime per 28 days
+2. **SLI types (choose appropriate)**
+   | SLI type | Formula | Use when |
+   |:---|:---|:---|
+   | Availability | good_requests / total_requests | HTTP services |
+   | Latency | requests_below_threshold / total_requests | Latency-sensitive APIs |
+   | Throughput | actual_throughput / target_throughput | Batch/stream processing |
+   | Correctness | correct_results / total_results | Data pipelines |
+3. **SLO tiers**
+   | Tier | Example SLO | Error budget / 28d |
+   |:---|:---|:---|
+   | Tier 1 (revenue) | 99.9% availability | 43 minutes |
+   | Tier 2 (internal) | 99.5% availability | 3.6 hours |
+   | Tier 3 (batch) | 99.0% availability | 7.2 hours |
+4. **28-day rolling window** is the default measurement period. Rolling > calendar month (avoids "burn ahead" gaming).
+5. **SLOs reviewed quarterly** — adjust based on actual reliability data.

package/areas/devops/sre/skills/capacity-planning/SKILL.md ADDED Viewed

@@ -0,0 +1,162 @@
+---
+name: capacity-planning
+type: skill
+description: Forecast infrastructure capacity needs — traffic projection, resource headroom calculations, node pool sizing, K8s cluster capacity.
+related-rules:
+  - slo-policy.md
+allowed-tools: Read, Write, Edit, Bash
+---
+# Skill: Capacity Planning
+> **Expertise:** Traffic forecasting, per-pod resource modeling, node pool sizing, cluster capacity headroom, VPA/HPA tuning for growth.
+## When to load
+When planning for growth, validating current cluster headroom, sizing node pools, or preparing for a high-traffic event (sale, launch).
+## Traffic Forecasting
+```promql
+# Current RPS baseline (7-day average)
+avg_over_time(
+  sum(rate(http_requests_total{service="checkout-service"}[5m]))[7d:5m]
+)
+# Peak RPS (7-day p99)
+quantile_over_time(0.99,
+  sum(rate(http_requests_total{service="checkout-service"}[5m]))[7d:5m]
+)
+# Week-over-week growth rate
+(
+  avg_over_time(sum(rate(http_requests_total[5m]))[7d:5m])
+  /
+  avg_over_time(sum(rate(http_requests_total[5m]))[7d:5m] offset 7d)
+) - 1
+# e.g. 0.08 = 8% weekly growth → ~3.5× in 6 months
+```
+## Per-Pod Resource Modeling
+```
+Model: what resources does 1 pod consume per RPS unit?
+Step 1: current pod metrics
+  - pods = 4 (HPA current)
+  - RPS = 200 req/s (avg)
+  - CPU per pod = 320m (avg), 480m (p99)
+  - Memory per pod = 280Mi (avg), 380Mi (peak)
+Step 2: per-RPS resource cost
+  - CPU per RPS = 320m / (200/4) = 6.4m CPU per RPS
+  - Mem per RPS = 280Mi / (200/4) = 5.6Mi per RPS
+Step 3: future requirements at 2× traffic (400 RPS)
+  - CPU needed = 400 × 6.4m = 2560m = 2.56 cores
+  - Mem needed = 400 × 5.6Mi = 2240Mi ≈ 2.2Gi
+  - Pods needed (at 70% CPU target) = 2560m / (500m × 0.7) = 7.3 → 8 pods min
+  - Update HPA maxReplicas to accommodate
+```
+## Cluster Capacity Check
+```bash
+# Total cluster allocatable resources
+kubectl get nodes -o json | jq '
+  [.items[].status.allocatable] |
+  {
+    cpu: [(.[].cpu | gsub("m";"") | tonumber) / 1000] | add,
+    memory_gi: [(.[].memory | gsub("Ki";"") | tonumber) / 1048576] | add
+  }'
+# Currently requested resources (sum of all pod requests)
+kubectl get pods -A -o json | jq '
+  [.items[].spec.containers[].resources.requests // {}] |
+  {
+    cpu_requested: [.[].cpu // "0m" | gsub("m";"") | tonumber] | add / 1000,
+    mem_requested_gi: [.[].memory // "0Mi" | gsub("Mi";"") | tonumber] | add / 1024
+  }'
+# Headroom per node (allocatable - requested)
+kubectl describe nodes | grep -A5 "Allocated resources:"
+# Quick headroom summary script
+kubectl get nodes -o custom-columns=\
+"NAME:.metadata.name,\
+CPU_ALLOC:.status.allocatable.cpu,\
+MEM_ALLOC:.status.allocatable.memory,\
+READY:.status.conditions[-1].type"
+```
+## Node Pool Sizing Formula
+```
+Variables:
+  T = target RPS (peak)
+  R_cpu = CPU request per pod (millicores)
+  R_mem = memory request per pod (MiB)
+  util = target utilisation (e.g. 0.70 = 70%)
+  headroom = spare capacity factor (e.g. 1.3 = 30% spare)
+  node_cpu = node allocatable CPU (millicores)
+  node_mem = node allocatable memory (MiB)
+Pods needed:
+  pods = ceil((T × cpu_per_rps) / (node_cpu × util)) × headroom
+Nodes needed for CPU:
+  nodes_cpu = ceil((pods × R_cpu) / (node_cpu × util))
+Nodes needed for Memory:
+  nodes_mem = ceil((pods × R_mem) / (node_mem × util))
+Required nodes = max(nodes_cpu, nodes_mem) + 1 (N+1 for failure tolerance)
+```
+## Pre-Event Capacity (sale, product launch)
+```bash
+# 1. Estimate peak multiplier from past events or product team forecast
+PEAK_MULTIPLIER=5   # "we expect 5× normal traffic for 2 hours"
+# 2. Pre-scale HPA min replicas before event
+kubectl patch hpa order-service -n production \
+  -p '{"spec":{"minReplicas":10}}'
+# 3. Pre-warm node pool (add nodes before autoscaler reacts)
+# AWS: adjust ASG desired capacity
+aws autoscaling set-desired-capacity \
+  --auto-scaling-group-name prod-workers \
+  --desired-capacity 12
+# 4. Disable HPA scale-down during event window
+kubectl patch hpa order-service -n production \
+  -p '{"spec":{"behavior":{"scaleDown":{"stabilizationWindowSeconds":3600}}}}'
+# 5. Restore after event
+kubectl patch hpa order-service -n production \
+  -p '{"spec":{"minReplicas":2,"behavior":{"scaleDown":{"stabilizationWindowSeconds":300}}}}'
+```
+## Capacity Planning Report (monthly)
+```markdown
+## Capacity Report — November 2024
+### Current State
+- Cluster: 9 workers (cx41, 4 vCPU / 16Gi each)
+- CPU utilisation: 58% avg, 71% peak
+- Memory utilisation: 62% avg, 74% peak
+- Headroom: ~25% CPU, ~20% Memory
+### Growth Trend
+- Traffic WoW growth: +6.2% (8 weeks avg)
+- Extrapolation: current capacity exhausted in ~14 weeks at current growth
+### Recommendations
+1. Add 2 nodes before end of Q4 (reduce peak CPU to < 60%)
+2. Evaluate spot nodes for worker pool (60-75% cost saving)
+3. Review order-service memory limit — VPA recommends 640Mi vs current 512Mi
+### Next Review: December 2024
+```

package/areas/devops/sre/skills/chaos-engineering/SKILL.md ADDED Viewed

@@ -0,0 +1,186 @@
+---
+name: chaos-engineering
+type: skill
+description: Design and run chaos experiments in Kubernetes — pod failures, network partitions, resource pressure with LitmusChaos and manual chaos.
+related-rules:
+  - slo-policy.md
+  - on-call-standards.md
+allowed-tools: Read, Write, Edit, Bash
+---
+# Skill: Chaos Engineering
+> **Expertise:** LitmusChaos experiments, manual K8s chaos, network partition testing, graceful degradation validation.
+## When to load
+When designing chaos experiments, validating failover behavior, verifying SLO headroom, or onboarding a service to chaos testing.
+## Chaos Experiment Design Principles
+```
+1. Define steady state first
+   → What does "working" look like? (SLI baseline: error rate < 0.1%, p99 < 200ms)
+2. Hypothesize
+   → "If 1/3 of pods die, the service will continue serving with p99 < 500ms"
+3. Blast radius control
+   → Start with staging. Start with 1 pod. Increase gradually.
+4. Abort conditions
+   → Auto-stop if error rate > 1% or p99 > 1s for > 2 min
+5. Document and act
+   → Passed = evidence of resilience. Failed = fix + re-test. Never just accept failure.
+```
+## Manual Chaos (no tooling needed)
+```bash
+# ── Pod kill (test restart recovery) ──────────────────────────
+kubectl delete pod <pod-name> -n production
+# Watch: kubectl get pods -n production -l app=my-service -w
+# Expected: new pod starts, readiness probe passes, 0 user-visible errors
+# ── Kill all pods in deployment (test rolling restart recovery) ──
+kubectl rollout restart deployment/my-service -n production
+# Watch error rate during rollout
+# ── Simulate OOMKill ──────────────────────────────────────────
+kubectl exec -it <pod> -n production -- sh -c \
+  "dd if=/dev/zero of=/dev/shm/blob bs=1M count=600"
+# Expected: pod OOMKilled, restarted, alert fired, no user impact
+# ── Resource pressure on node ─────────────────────────────────
+kubectl run stress --image=polinux/stress --restart=Never \
+  --overrides='{"spec":{"nodeSelector":{"kubernetes.io/hostname":"worker-01"}}}' \
+  -- stress --cpu 4 --vm 1 --vm-bytes 2G --timeout 120s
+# ── Network partition: isolate a pod (Cilium + network policy) ──
+# Apply a policy that drops all traffic from/to the pod
+kubectl apply -f - << 'EOF'
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata: { name: chaos-isolate, namespace: production }
+spec:
+  podSelector: { matchLabels: { chaos-target: "true" } }
+  policyTypes: [Ingress, Egress]
+EOF
+kubectl label pod <pod> chaos-target=true -n production
+# Observe: circuit breakers trip, retries, fallback behavior
+# Cleanup:
+kubectl delete networkpolicy chaos-isolate -n production
+kubectl label pod <pod> chaos-target- -n production
+```
+## LitmusChaos Experiments
+```yaml
+# Install LitmusChaos
+kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml
+# ── Pod Delete experiment ────────────────────────────────────
+apiVersion: litmuschaos.io/v1alpha1
+kind: ChaosEngine
+metadata:
+  name: pod-delete-experiment
+  namespace: production
+spec:
+  appinfo:
+    appns: production
+    applabel: app=order-service
+    appkind: deployment
+  engineState: active
+  chaosServiceAccount: litmus-admin
+  experiments:
+    - name: pod-delete
+      spec:
+        components:
+          env:
+            - name: TOTAL_CHAOS_DURATION
+              value: "60"          # run for 60 seconds
+            - name: CHAOS_INTERVAL
+              value: "10"          # delete a pod every 10s
+            - name: FORCE
+              value: "false"       # graceful delete (test SIGTERM handling)
+            - name: PODS_AFFECTED_PERC
+              value: "33"          # kill 33% of pods at a time
+```
+```yaml
+# ── Pod CPU Hog (test HPA scale-out) ─────────────────────────
+apiVersion: litmuschaos.io/v1alpha1
+kind: ChaosEngine
+metadata:
+  name: cpu-hog-experiment
+  namespace: production
+spec:
+  experiments:
+    - name: pod-cpu-hog
+      spec:
+        components:
+          env:
+            - name: CPU_CORES
+              value: "1"
+            - name: TOTAL_CHAOS_DURATION
+              value: "120"
+            - name: TARGET_PODS
+              value: "order-service-abc123"
+```
+## Chaos Game Days (structured runbook)
+```
+1. Define scope (30 min)
+   - Which services? Which failure modes?
+   - What is acceptable impact? (staging or prod with traffic shadow)
+2. Baseline measurement (10 min)
+   - Capture: RPS, error rate, p99, pod count
+   - Screenshot Grafana dashboard
+3. Run experiments (60–90 min)
+   Experiment A: Kill 1 of 3 pods → observe recovery time
+   Experiment B: Saturate CPU on 1 pod → observe HPA response
+   Experiment C: Partition service from its DB → observe circuit breaker
+4. Capture results per experiment
+   - Steady state maintained? (SLI threshold)
+   - Time to recovery
+   - Alerts fired? Correct ones?
+   - Runbook adequate?
+5. Action items (20 min)
+   - For each failure: fix or accept with documentation
+   - Schedule follow-up experiments after fixes
+```
+## Abort / Safety Controls
+```yaml
+# LitmusChaos: abort on SLO breach using steady-state hypothesis
+spec:
+  jobCleanUpPolicy: delete
+  monitoring: true
+  # Prometheus probe: abort if error rate > 1%
+  experiments:
+    - name: pod-delete
+      spec:
+        probe:
+          - name: check-error-rate
+            type: promProbe
+            promProbe/inputs:
+              endpoint: http://prometheus:9090
+              query: |
+                sum(rate(http_requests_total{service="order-service",status=~"5.."}[2m]))
+                / sum(rate(http_requests_total{service="order-service"}[2m]))
+              comparator:
+                type: float
+                criteria: "<="
+                value: "0.01"    # abort if error rate exceeds 1%
+            mode: Continuous
+            runProperties:
+              probeTimeout: 10s
+              interval: 15s
+```

package/areas/devops/sre/skills/incident-command/SKILL.md ADDED Viewed

@@ -0,0 +1,119 @@
+---
+name: incident-command
+type: skill
+description: Structured incident command for P0/P1 — roles, timeline, communication templates, and mitigation-first approach.
+related-rules:
+  - on-call-standards.md
+  - error-budget-policy.md
+allowed-tools: Read, Bash
+---
+# Skill: Incident Command
+> **Expertise:** ICS-inspired incident structure, communication templates, mitigation over diagnosis, blameless culture.
+## When to load
+When responding to a P0/P1 incident, coordinating a multi-engineer response, or writing a war room update.
+## Incident Roles
+| Role | Responsibility | Who |
+|:---|:---|:---|
+| **Incident Commander (IC)** | Owns coordination; makes go/no-go calls | On-call lead or SRE |
+| **Technical Lead** | Diagnoses and implements fix | On-call engineer |
+| **Comms Lead** | Writes status page + stakeholder updates | PM or secondary on-call |
+| **Scribe** | Documents timeline in real-time | Any available engineer |
+## P0 Timeline (first 30 minutes)
+```
+T+0:   ACKNOWLEDGE — "I'm on it" in #incidents Slack
+T+2:   SCOPE — What's broken? Since when? Who's affected?
+       → kubectl get pods -A | grep -v Running
+       → Check Grafana error rate + latency dashboard
+T+5:   PAGE escalation if > 10% users affected or revenue impacted
+T+10:  STATUS PAGE update: "We are investigating reports of [symptom]"
+T+15:  MITIGATION — Rollback > fix. Prefer reversible actions.
+       Order: rollback deploy → feature flag off → scale up → redirect traffic
+T+20:  COMMUNICATE — Slack update with mitigation status + ETA
+T+30:  STABILIZE — Confirm metrics returning to baseline
+       → Watch error rate for 10 min after mitigation
+T+60:  PRELIMINARY POSTMORTEM doc created (timeline captured)
+T+24h: FULL POSTMORTEM — 5-whys, action items, owners
+```
+## Mitigation Priority (always prefer fast+reversible)
+```
+1. Rollback deploy → helm rollback <release> -n <ns>    # < 2 min
+2. Feature flag off → LaunchDarkly / split.io toggle    # < 1 min
+3. Scale up replicas → kubectl scale deploy ... --replicas=N
+4. Restart pods → kubectl rollout restart deploy/<n>
+5. Redirect traffic → DNS change / load balancer weight
+6. Fix forward → only if rollback is not possible
+```
+## Slack Communication Templates
+```
+# P0 Opening Message (#incidents channel)
+🔴 **P0 INCIDENT OPEN** — [service] [symptom]
+IC: @you | Scribe: @name
+Impact: [who is affected, estimated user count]
+Current status: Investigating
+Thread: all updates in this thread
+War room: https://meet.google.com/...
+# Update every 15 min until resolved
+📊 **UPDATE T+15** — [service]
+Status: Mitigating / Resolved / Monitoring
+Action taken: Rolled back to v2.3.0
+Current error rate: 0.2% (was 8.4%)
+ETA: Monitoring for 10 min, then close
+# Resolution
+✅ **RESOLVED** — [service] — [duration]
+Root cause (preliminary): [1-sentence summary]
+Mitigation: [what fixed it]
+Next: Postmortem within 24h @[owner]
+```
+## Status Page Templates
+```
+# Investigating
+Investigating - We are investigating reports of [symptom] affecting [service].
+Users may experience [impact]. We will provide updates every 15 minutes.
+# Identified
+Identified - We have identified the issue causing [symptom].
+We are working on a fix and expect resolution by [ETA].
+# Monitoring
+Monitoring - A fix has been implemented and we are monitoring the results.
+Users should no longer experience [symptom].
+# Resolved
+Resolved - [symptom] affecting [service] has been resolved.
+This incident lasted [duration]. A postmortem will be published within 72 hours.
+```
+## Useful Emergency Commands
+```bash
+# Immediate rollback
+helm rollback <release-name> -n <namespace>           # rolls back 1 version
+helm rollback <release-name> <revision> -n <namespace> # specific revision
+helm history <release-name> -n <namespace>             # list revisions
+# Scale up quickly
+kubectl scale deploy <name> -n <ns> --replicas=10
+# Emergency pod restart (without rollout)
+kubectl delete pods -n <ns> -l app=<name>
+# Check what changed recently
+kubectl describe deploy <name> -n <ns> | grep -A5 "Events:"
+kubectl rollout history deploy/<name> -n <ns>
+```

package/areas/devops/sre/skills/postmortem-analysis/SKILL.md ADDED Viewed

@@ -0,0 +1,104 @@
+---
+name: postmortem-analysis
+type: skill
+description: Write blameless postmortems with 5-whys RCA, actionable follow-ups, and systematic prevention measures.
+related-rules:
+  - on-call-standards.md
+allowed-tools: Read, Write
+---
+# Skill: Postmortem Analysis
+> **Expertise:** Blameless culture, 5-whys root cause analysis, contributing factors, actionable items with owners and due dates.
+## When to load
+When writing a postmortem after a P0/P1 incident, reviewing a draft postmortem, or designing action items.
+## Postmortem Template
+```markdown
+# Postmortem: [Service] — [Date] — [Severity]
+**Status:** Draft / In Review / Complete
+**Severity:** P0 / P1
+**Duration:** [start] → [end] ([total duration])
+**Impact:** [N users affected, revenue impact if known, SLO budget consumed: X minutes]
+**Incident Commander:** [name]
+**Authors:** [name(s)]
+---
+## Summary
+[2–3 sentences: what broke, what caused it, what fixed it]
+## Timeline (UTC)
+| Time | Event |
+|:---|:---|
+| 14:22 | Alert fired: HighErrorRate on payment-service |
+| 14:24 | On-call acknowledged; war room opened |
+| 14:28 | Identified: error correlated with v2.4.1 deploy at 14:05 |
+| 14:31 | Mitigation: helm rollback payment-service to revision 3 |
+| 14:33 | Error rate returning to baseline |
+| 14:40 | Resolved; monitoring |
+## Root Cause Analysis (5-Whys)
+**Symptom:** Payment service returning 502s at 4.2% rate
+1. **Why?** → Upstream credit-card-service returning 503s
+2. **Why?** → credit-card-service pods OOMKilled
+3. **Why?** → Memory limit was 256Mi; new code path loaded full transaction history into memory
+4. **Why?** → Code review missed memory complexity of the new query (no performance test)
+5. **Why?** → No memory profiling step in CI; no load test in staging pipeline
+**Root cause:** Insufficient memory limit combined with absent memory regression testing.
+## Contributing Factors
+- [ ] Memory limits not updated with new feature PR
+- [ ] Staging environment has lower traffic than production (bug not triggered)
+- [ ] No VPA recommendation visible to developers
+## What Went Well
+- On-call responded in 4 minutes (SLO: 5 min) ✅
+- Rollback executed in 2 minutes ✅
+- Status page updated within 10 minutes ✅
+## What Went Poorly
+- Memory issue not caught in staging
+- Alert fired 17 minutes after deploy (too slow — alert `for: 2m` but high latency in detection)
+- Runbook for OOMKilled did not include memory limit increase steps
+## Action Items
+| Action | Owner | Priority | Due |
+|:---|:---|:---|:---|
+| Add memory profiling step to CI (`memory-profiler`) | @dev-team | P1 | 2024-11-22 |
+| Add k6 load test to staging pipeline (match prod traffic pattern) | @devops-team | P1 | 2024-11-29 |
+| Add VPA in "Off" mode for all services → surface recommendations | @devops-team | P2 | 2024-12-06 |
+| Update OOMKilled runbook with memory limit increase steps | @sre-team | P2 | 2024-11-20 |
+| Reduce alert `for:` to 1m for payment-service | @sre-team | P3 | 2024-11-20 |
+## SLO Impact
+- Error budget consumed: 18 minutes (of 201.6 min / 28d budget)
+- Budget remaining: 89.1%
+- Budget state: 🟢 Healthy
+```
+## 5-Whys Facilitation Tips
+1. **Start with the user-visible symptom**, not the technical failure.
+2. **Each "why" must be something that could have been different** — avoid "because the code had a bug" (that's not actionable).
+3. **Stop at organizational / process level** — usually why 4 or 5 reveals a missing process, test, or convention.
+4. **Multiple root causes are OK** — most incidents have 2-3 contributing causes, not one.
+5. **Blameless means systems-focused** — "the deployment process allowed an under-tested change" not "Alice didn't test well".
+## Action Item Quality
+| ❌ Weak | ✅ Strong |
+|:---|:---|
+| "Improve testing" | "Add k6 load test targeting payment endpoint to staging pipeline by Nov 29" |
+| "Fix monitoring" | "Add HighMemoryUsage alert (> 80% of limit) with `for: 5m` by Nov 20" |
+| "Be more careful" | "Add required checklist item in PR template: memory impact assessed for new DB queries" |
+| "Investigate X" | "Timebox investigation to 2h; report findings in Slack #postmortems by Nov 21" |