npm - @jetrabbits/agentic - Versions diffs - 0.0.1 - Mend

@jetrabbits/agentic 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (440) hide show

package/areas/devops/observability/skills/log-aggregation/SKILL.md ADDED Viewed

@@ -0,0 +1,159 @@
+---
+name: log-aggregation
+type: skill
+description: Set up Loki or ELK log aggregation for K8s workloads — structured logging, log routing, and log-based alerting.
+related-rules:
+  - golden-signals.md
+  - data-retention.md
+allowed-tools: Read, Write, Edit, Bash
+---
+# Skill: Log Aggregation
+> **Expertise:** Loki (Grafana stack), Promtail/Fluent Bit, structured JSON logging, log-based alerting, ELK basics.
+## When to load
+When setting up log collection, writing log queries, debugging missing logs, or adding log-based alerts.
+## Loki Stack (K8s — recommended)
+```yaml
+# Promtail DaemonSet auto-discovers K8s pod logs
+# Install via helm:
+helm upgrade --install loki grafana/loki-stack \
+  -n monitoring \
+  -f loki-values.yaml
+# loki-values.yaml
+loki:
+  auth_enabled: false
+  limits_config:
+    retention_period: 720h   # 30 days
+    ingestion_rate_mb: 16
+    max_streams_per_user: 10000
+  storage_config:
+    boltdb_shipper:
+      active_index_directory: /data/loki/boltdb-index
+    filesystem:
+      directory: /data/loki/chunks
+promtail:
+  config:
+    clients:
+      - url: http://loki:3100/loki/api/v1/push
+    scrape_configs:
+      - job_name: kubernetes-pods
+        kubernetes_sd_configs:
+          - role: pod
+        pipeline_stages:
+          - docker: {}     # parse Docker JSON log format
+          - json:          # extract fields from app JSON logs
+              expressions:
+                level: level
+                trace_id: trace_id
+                service: service
+          - labels:
+              level:
+              service:
+```
+## LogQL Queries
+```logql
+# All error logs from a service in last 5 min
+{namespace="production", app="order-service"} |= "ERROR"
+# Parse JSON and filter by field
+{namespace="production"} | json | level="error" | trace_id != ""
+# Count errors per service (for alerting)
+sum by (service) (
+  count_over_time({namespace="production"} | json | level="error" [5m])
+)
+# Log rate (to detect log explosion)
+sum(rate({namespace="production"}[5m])) by (app)
+# Find slow requests from logs
+{app="api-gateway"} | json | response_time_ms > 500
+```
+## Structured Logging Standards
+```python
+# Python — structlog
+import structlog
+log = structlog.get_logger()
+# Always include: service, version, trace_id, span_id, level
+log.info("order.created",
+    order_id="ord-123",
+    user_id="usr-456",    # OK in log; NOT in metrics labels
+    amount_cents=4999,
+    # trace_id injected automatically via TraceContextFilter
+)
+# Output (JSON):
+# {"event": "order.created", "level": "info", "order_id": "ord-123",
+#  "trace_id": "abc123def456", "span_id": "789xyz", "timestamp": "..."}
+```
+```go
+// Go — slog (stdlib, Go 1.21+)
+logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
+    Level: slog.LevelInfo,
+}))
+slog.SetDefault(logger)
+slog.Info("order.created",
+    "order_id", "ord-123",
+    "amount_cents", 4999,
+    "trace_id", traceID,  // inject from context
+)
+```
+## Log-Based Alerting (Loki ruler)
+```yaml
+# loki-rules.yaml
+groups:
+  - name: application.logs
+    rules:
+      - alert: HighErrorLogRate
+        expr: |
+          sum(rate({namespace="production"} | json | level="error" [5m])) by (app)
+          > 10
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Error log rate > 10/s — {{ $labels.app }}"
+          runbook_url: "https://runbooks.internal/high-error-logs"
+```
+## Fluent Bit (alternative — lower resource usage)
+```yaml
+# fluent-bit-config.yaml (K8s ConfigMap)
+[INPUT]
+    Name              tail
+    Path              /var/log/containers/*.log
+    Parser            docker
+    Refresh_Interval  5
+[FILTER]
+    Name         kubernetes
+    Match        kube.*
+    Kube_Tag_Prefix  kube.var.log.containers.
+    Merge_Log    On
+    Keep_Log     Off
+[OUTPUT]
+    Name   loki
+    Match  kube.*
+    Host   loki
+    Port   3100
+    Labels job=fluent-bit, namespace=$kubernetes['namespace_name']
+```

package/areas/devops/observability/skills/prometheus-alertmanager/SKILL.md ADDED Viewed

@@ -0,0 +1,188 @@
+---
+name: prometheus-alertmanager
+type: skill
+description: Write production-quality Prometheus alert rules, recording rules, and Alertmanager routing configs.
+related-rules:
+  - golden-signals.md
+  - alerting-standards.md
+allowed-tools: Read, Write, Edit, Bash
+---
+# Skill: Prometheus & Alertmanager
+> **Expertise:** PromQL, alert rules, recording rules, Alertmanager routing, inhibition, silences.
+## When to load
+When writing alert rules, debugging PromQL, configuring Alertmanager routing, or investigating a firing alert.
+## Golden Signal Alert Rules
+```yaml
+# alerts/service-golden-signals.yaml
+groups:
+  - name: service.golden-signals
+    rules:
+      # ── Errors ────────────────────────────────────────
+      - alert: HighErrorRate
+        expr: |
+          (
+            sum(rate(http_requests_total{status=~"5.."}[5m])) by (namespace, service)
+            /
+            sum(rate(http_requests_total[5m])) by (namespace, service)
+          ) > 0.01
+        for: 2m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Error rate > 1% — {{ $labels.service }} in {{ $labels.namespace }}"
+          description: "Current error rate: {{ $value | humanizePercentage }}"
+          runbook_url: "https://runbooks.internal/high-error-rate"
+      # ── Latency ───────────────────────────────────────
+      - alert: HighP99Latency
+        expr: |
+          histogram_quantile(0.99,
+            sum(rate(http_request_duration_seconds_bucket[5m])) by (namespace, service, le)
+          ) > 1.0
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "p99 latency > 1s — {{ $labels.service }}"
+          description: "p99: {{ $value | humanizeDuration }}"
+          runbook_url: "https://runbooks.internal/high-latency"
+      # ── Saturation ────────────────────────────────────
+      - alert: PodMemoryPressure
+        expr: |
+          (
+            container_memory_working_set_bytes{container!=""}
+            /
+            container_spec_memory_limit_bytes{container!=""}
+          ) > 0.85
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Memory > 85% limit — {{ $labels.container }} in {{ $labels.namespace }}"
+          runbook_url: "https://runbooks.internal/memory-pressure"
+      # ── Traffic Drop ──────────────────────────────────
+      - alert: TrafficDrop
+        expr: |
+          (
+            sum(rate(http_requests_total[5m])) by (service)
+            /
+            sum(rate(http_requests_total[1h] offset 5m)) by (service)
+          ) < 0.5
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Traffic dropped > 50% vs 1h ago — {{ $labels.service }}"
+          runbook_url: "https://runbooks.internal/traffic-drop"
+```
+## Recording Rules (pre-aggregate expensive queries)
+```yaml
+groups:
+  - name: service.recording
+    interval: 1m
+    rules:
+      # Pre-compute error rate (used in dashboards — no re-computation)
+      - record: job:http_requests:error_rate5m
+        expr: |
+          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, namespace)
+          /
+          sum(rate(http_requests_total[5m])) by (job, namespace)
+      # Pre-compute p99 (expensive histogram_quantile)
+      - record: job:http_request_duration_seconds:p99_5m
+        expr: |
+          histogram_quantile(0.99,
+            sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
+          )
+```
+## PromQL Patterns
+```promql
+# Rate of requests (always use rate() on counters, not irate() for alerting)
+rate(http_requests_total[5m])
+# Error ratio
+sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
+/ sum(rate(http_requests_total[5m])) by (service)
+# Memory utilisation (working set vs limit)
+container_memory_working_set_bytes / container_spec_memory_limit_bytes
+# CPU throttling ratio (> 25% = limit too low)
+sum(rate(container_cpu_throttled_seconds_total[5m])) by (pod)
+/ sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
+# Absent metric (detect missing scrape targets)
+absent(up{job="my-service"} == 1)
+```
+## Alertmanager Config
+```yaml
+# alertmanager.yml
+global:
+  resolve_timeout: 5m
+  slack_api_url: https://hooks.slack.com/...
+route:
+  receiver: slack-warning
+  group_by: [alertname, namespace, service]
+  group_wait: 30s
+  group_interval: 5m
+  repeat_interval: 4h
+  routes:
+    - matchers: [severity="critical"]
+      receiver: pagerduty
+      group_wait: 0s            # page immediately
+    - matchers: [alertname="Watchdog"]
+      receiver: deadman-snitch  # heartbeat alert
+inhibit_rules:
+  # If a service is down (critical), suppress its latency/error warnings
+  - source_matchers: [severity="critical", alertname="ServiceDown"]
+    target_matchers: [severity="warning"]
+    equal: [namespace, service]
+receivers:
+  - name: pagerduty
+    pagerduty_configs:
+      - routing_key: $PD_ROUTING_KEY
+        description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
+  - name: slack-warning
+    slack_configs:
+      - channel: '#alerts-warning'
+        title: '{{ .GroupLabels.alertname }}'
+        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
+```
+## Debugging Alerts
+```bash
+# Check currently firing alerts
+kubectl port-forward svc/alertmanager 9093:9093 -n monitoring
+# Open http://localhost:9093
+# Evaluate a PromQL expression (check why alert fired/didn't fire)
+kubectl port-forward svc/prometheus 9090:9090 -n monitoring
+# Open http://localhost:9090/graph
+# Check alert rule evaluation
+curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name=="HighErrorRate")'
+# Silence a noisy alert during maintenance
+amtool silence add alertname="HighErrorRate" namespace="staging" \
+  --duration=2h --comment="Scheduled maintenance window"
+```

package/areas/devops/observability/skills/slo-implementation/SKILL.md ADDED Viewed

@@ -0,0 +1,189 @@
+---
+name: slo-implementation
+type: skill
+description: Implement SLOs end-to-end in Prometheus — recording rules, burn rate alerts, error budget dashboards, and Sloth/pyrra integration.
+related-rules:
+  - golden-signals.md
+  - alerting-standards.md
+allowed-tools: Read, Write, Edit, Bash
+---
+# Skill: SLO Implementation
+> **Expertise:** Prometheus recording rules for SLOs, multi-window burn rate alerts, Sloth code generation, error budget Grafana panels.
+## When to load
+When implementing SLOs for a service in Prometheus, setting up burn rate alerts, or creating error budget dashboards.
+## Full SLO Stack (single service)
+### Step 1: Define the SLI Recording Rules
+```yaml
+# prometheus-rules/slo-checkout-service.yaml
+groups:
+  - name: slo:checkout-service:recording
+    interval: 30s
+    rules:
+      # Good requests: 2xx, latency < 500ms (combine availability + latency SLI)
+      - record: slo:http_requests_good:rate5m
+        labels: { service: checkout-service }
+        expr: |
+          sum(rate(http_requests_total{
+            service="checkout-service",
+            status=~"2.."
+          }[5m]))
+          # For latency SLI, intersect with bucket:
+          # sum(rate(http_request_duration_seconds_bucket{
+          #   service="checkout-service", le="0.5"}[5m]))
+      - record: slo:http_requests_total:rate5m
+        labels: { service: checkout-service }
+        expr: |
+          sum(rate(http_requests_total{service="checkout-service"}[5m]))
+      # SLI ratio (5m window)
+      - record: slo:http_availability:ratio_rate5m
+        labels: { service: checkout-service }
+        expr: |
+          slo:http_requests_good:rate5m{service="checkout-service"}
+          / slo:http_requests_total:rate5m{service="checkout-service"}
+      # Pre-compute multiple windows for burn rate alerts
+      - record: slo:http_availability:ratio_rate30m
+        labels: { service: checkout-service }
+        expr: |
+          sum(rate(http_requests_total{service="checkout-service",status=~"2.."}[30m]))
+          / sum(rate(http_requests_total{service="checkout-service"}[30m]))
+      - record: slo:http_availability:ratio_rate1h
+        labels: { service: checkout-service }
+        expr: |
+          sum(rate(http_requests_total{service="checkout-service",status=~"2.."}[1h]))
+          / sum(rate(http_requests_total{service="checkout-service"}[1h]))
+      - record: slo:http_availability:ratio_rate6h
+        labels: { service: checkout-service }
+        expr: |
+          sum(rate(http_requests_total{service="checkout-service",status=~"2.."}[6h]))
+          / sum(rate(http_requests_total{service="checkout-service"}[6h]))
+      - record: slo:http_availability:ratio_rate1d
+        labels: { service: checkout-service }
+        expr: |
+          sum(rate(http_requests_total{service="checkout-service",status=~"2.."}[1d]))
+          / sum(rate(http_requests_total{service="checkout-service"}[1d]))
+      - record: slo:http_availability:ratio_rate28d
+        labels: { service: checkout-service }
+        expr: |
+          sum_over_time(slo:http_availability:ratio_rate5m{service="checkout-service"}[28d])
+          / (28 * 24 * 12)   # 12 samples/hour × 24h × 28d
+```
+### Step 2: Multi-Window Burn Rate Alerts
+```yaml
+  - name: slo:checkout-service:alerts
+    rules:
+      # ── Fast burn (1h + 5m windows, 14.4× rate) ──────────────────
+      # Consumes 2% of 28d budget in 1h → page immediately
+      - alert: CheckoutSLOFastBurn
+        expr: |
+          (slo:http_availability:ratio_rate1h{service="checkout-service"} < (1 - 14.4 * 0.005))
+          and
+          (slo:http_availability:ratio_rate5m{service="checkout-service"} < (1 - 14.4 * 0.005))
+        for: 2m
+        labels:
+          severity: critical
+          service: checkout-service
+          slo: availability-99.5
+        annotations:
+          summary: "Checkout SLO fast burn — error rate > 14.4× baseline"
+          description: "1h availability: {{ $value | humanizePercentage }}. Budget burning rapidly."
+          runbook_url: "https://runbooks.internal/checkout-slo-fast-burn"
+      # ── Slow burn (6h + 30m windows, 6× rate) ────────────────────
+      # Consumes 5% of 28d budget in 6h → ticket, fix in business hours
+      - alert: CheckoutSLOSlowBurn
+        expr: |
+          (slo:http_availability:ratio_rate6h{service="checkout-service"} < (1 - 6 * 0.005))
+          and
+          (slo:http_availability:ratio_rate30m{service="checkout-service"} < (1 - 6 * 0.005))
+        for: 15m
+        labels:
+          severity: warning
+          service: checkout-service
+          slo: availability-99.5
+        annotations:
+          summary: "Checkout SLO slow burn — error rate > 6× baseline"
+          runbook_url: "https://runbooks.internal/checkout-slo-slow-burn"
+      # ── Budget exhaustion warning ─────────────────────────────────
+      - alert: CheckoutSLOBudgetLow
+        expr: |
+          slo:http_availability:ratio_rate28d{service="checkout-service"}
+          < (1 - 0.005 * 0.75)   # < 25% budget remaining
+        for: 1h
+        labels:
+          severity: warning
+          service: checkout-service
+        annotations:
+          summary: "Checkout error budget < 25% remaining for this month"
+          runbook_url: "https://runbooks.internal/checkout-error-budget"
+```
+### Step 3: Sloth (generate from YAML spec)
+```yaml
+# slo/checkout-service.yaml
+version: "prometheus/v1"
+service: checkout-service
+labels: { team: backend, tier: "1" }
+slos:
+  - name: requests-availability
+    objective: 99.5
+    description: "99.5% of checkout requests succeed"
+    sli:
+      events:
+        error_query: |
+          sum(rate(http_requests_total{
+            service="checkout-service",
+            status=~"5.."}[{{.window}}]))
+        total_query: |
+          sum(rate(http_requests_total{
+            service="checkout-service"}[{{.window}}]))
+    alerting:
+      name: CheckoutServiceAvailability
+      page_alert:
+        labels: { severity: critical }
+        annotations:
+          runbook_url: https://runbooks.internal/checkout-availability
+      ticket_alert:
+        labels: { severity: warning }
+```
+```bash
+# Generate Prometheus rules + alerts from Sloth spec
+sloth generate -i slo/checkout-service.yaml -o rules/slo-checkout-generated.yaml
+# Produces: recording rules for all windows + multi-window burn rate alerts
+```
+### Step 4: Error Budget Dashboard (Grafana)
+```promql
+-- Current error budget remaining (percent of 28d budget)
+(
+  sum_over_time(slo:http_availability:ratio_rate5m{service="checkout-service"}[28d])
+  / (28 * 24 * 12)
+  - (1 - 0.005)
+)
+/ 0.005 * 100
+-- Hours of budget remaining at current burn rate
+(
+  (slo:http_availability:ratio_rate28d{service="checkout-service"} - (1 - 0.005))
+  / 0.005
+) * 28 * 24
+```

package/areas/devops/observability/workflows/alert-investigation.md ADDED Viewed

@@ -0,0 +1,98 @@
+---
+name: alert-investigation
+type: workflow
+trigger: /alert-investigation
+description: Structured alert investigation — classify, correlate metrics/logs/traces, identify root cause, mitigate, and improve alert quality.
+inputs:
+  - alert_name
+  - alert_labels
+  - firing_since
+outputs:
+  - root_cause_summary
+  - mitigation_applied_or_deferred
+  - alert_quality_notes
+roles:
+  - devops-engineer
+  - developer
+execution:
+  initiator: developer
+related-rules:
+  - golden-signals.md
+  - alerting-standards.md
+uses-skills:
+  - prometheus-alertmanager
+  - grafana-dashboards
+  - log-aggregation
+  - distributed-tracing
+quality-gates:
+  - root cause identified before alert is silenced
+  - action item created for any alert that fired without a valid runbook step
+---
+## Steps
+### 1. Acknowledge & Classify — `@devops-engineer`
+- Open Grafana: navigate to service dashboard for the affected service
+- Check: is this a real user-impact alert or a false positive?
+  - Real: error rate / latency / saturation affecting users
+  - False: alert threshold too sensitive for normal traffic patterns
+- Check: when did the alert start? Correlate with recent deploys or cron jobs
+- **Done when:** alert classified (real/false-positive) and current status known
+### 2. Correlate Signals — `@devops-engineer`
+**Metrics (Prometheus):**
+```promql
+-- Error rate breakdown by endpoint
+sum(rate(http_requests_total{service="$svc", status=~"5.."}[5m])) by (path)
+/ sum(rate(http_requests_total{service="$svc"}[5m])) by (path)
+-- Latency distribution shift
+histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$svc"}[5m])) by (le))
+-- Recent pod restarts
+increase(kube_pod_container_status_restarts_total{namespace="$ns"}[30m])
+```
+**Logs (Loki):**
+```logql
+{namespace="$ns", app="$svc"} | json | level="error"
+  | line_format "{{.message}} trace={{.trace_id}}"
+```
+**Traces (Tempo):**
+- Search by trace_id from logs → view full request trace
+- Filter by `duration > 1s AND status=error` to find slow/failing requests
+### 3. Identify Root Cause — `@devops-engineer` + `@developer`
+Decision tree:
+```
+Error rate spike?
+  → Recent deploy? → Check image diff, config changes → Rollback candidate
+  → No deploy? → Check upstream dependency health, DB connections, external API
+Latency spike?
+  → CPU throttling? → Check container_cpu_throttled_seconds
+  → Memory pressure? → Check working set vs limits
+  → Downstream slow? → Trace to identify bottleneck service
+  → DB slow? → Check pg_stat_statements, lock waits
+Saturation?
+  → CPU: scale out or increase limits
+  → Memory: right-size or find leak
+  → Connections: check PgBouncer, connection leak
+```
+### 4. Mitigate — `@devops-engineer`
+- Apply fix (rollback, scale, restart, config change)
+- Watch: is the alert resolving? (usually auto-resolves within `for:` duration after fix)
+- If not resolving: escalate to P1
+### 5. Post-Investigation Notes — `@devops-engineer`
+- Was the runbook adequate? (could a junior follow it to resolution?)
+- Is the alert threshold correct? (too sensitive = toil; too loose = misses real issues)
+- Create ticket if: runbook needs update, threshold needs tuning, or root cause needs a code fix
+## Exit
+Alert resolved or escalated + root cause noted + runbook quality assessed = investigation complete.