npm - @jetrabbits/agentic - Versions diffs - 0.0.1 - Mend

@jetrabbits/agentic 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (440) hide show

package/areas/devops/observability/prompts/observability-stack-setup.md ADDED Viewed

@@ -0,0 +1,99 @@
+---
+workflow: observability-stack-setup
+---
+# Prompt: `/observability-stack-setup`
+Use when: setting up the full observability stack from scratch on a K8s cluster.
+---
+## Example 1 — Full kube-prometheus-stack + Loki + Tempo
+**EN:**
+```
+/observability-stack-setup
+Cluster: prod-cluster-eu (bare-metal K8s 1.31, Cilium)
+Storage: Longhorn (block storage available)
+Stack to deploy (all via Helm + ArgoCD):
+  - kube-prometheus-stack (Prometheus + Alertmanager + Grafana + node-exporter + kube-state-metrics)
+  - Loki (single binary mode for start; distributed when logs > 50GB/day)
+  - Promtail DaemonSet (log collection from all pods)
+  - Tempo (distributed tracing, single binary)
+  - OpenTelemetry Collector (DaemonSet, receives OTLP)
+  - Grafana datasources: Prometheus + Loki + Tempo (auto-configured)
+Persistence:
+  - Prometheus: 50Gi Longhorn, 15-day retention
+  - Loki: 100Gi Longhorn, 30-day retention
+  - Tempo: 50Gi Longhorn, 7-day retention
+  - Grafana: 5Gi Longhorn (dashboards in ConfigMaps, not DB)
+Alertmanager: route critical → PagerDuty; warning → Slack #alerts
+Ingress: observability tools exposed via NGINX Ingress with mTLS (internal only)
+```
+**RU:**
+```
+/observability-stack-setup
+Кластер: prod-cluster-eu (bare-metal K8s 1.31, Cilium)
+Хранилище: Longhorn (блочное хранилище доступно)
+Стек для деплоя (через Helm + ArgoCD):
+  - kube-prometheus-stack (Prometheus + Alertmanager + Grafana + node-exporter + kube-state-metrics)
+  - Loki (single binary на старте; distributed когда логи > 50GB/день)
+  - Promtail DaemonSet (сбор логов со всех подов)
+  - Tempo (distributed tracing, single binary)
+  - OpenTelemetry Collector (DaemonSet, принимает OTLP)
+  - Datasources в Grafana: Prometheus + Loki + Tempo (авто-настройка)
+Persistence:
+  - Prometheus: 50Gi Longhorn, хранение 15 дней
+  - Loki: 100Gi Longhorn, хранение 30 дней
+  - Tempo: 50Gi Longhorn, хранение 7 дней
+  - Grafana: 5Gi Longhorn (дашборды в ConfigMaps, не в БД)
+Alertmanager: critical → PagerDuty; warning → Slack #alerts
+Ingress: инструменты observability через NGINX Ingress с mTLS (только внутренний доступ)
+```
+---
+## Example 2 — Migrate from ELK to Loki (cost reduction)
+**EN:**
+```
+/observability-stack-setup
+Task: migrate log aggregation from Elasticsearch + Kibana to Grafana Loki
+Current: ELK stack consuming 200Gi storage, 8 CPU, 32Gi memory (3 ES nodes)
+Target: Loki in single binary mode (~2 CPU, 4Gi memory, 100Gi storage)
+Migration constraints:
+  - Zero log gap during migration (dual-write period)
+  - Existing Kibana dashboards must be recreated in Grafana (or translated)
+  - Historical logs from ES: export last 7 days to Loki (beyond that, discard)
+  - Application log format: already JSON structured
+Migration plan:
+  1. Deploy Loki + Grafana alongside existing ELK
+  2. Configure Fluent Bit to ship to both (dual-write, 1 week)
+  3. Recreate critical Kibana dashboards in Grafana LogQL
+  4. Decommission ELK after 1 week parallel operation
+  5. Reclaim storage and compute
+```
+**RU:**
+```
+/observability-stack-setup
+Задача: миграция агрегации логов с Elasticsearch + Kibana на Grafana Loki
+Текущее: ELK стек потребляет 200Gi хранилища, 8 CPU, 32Gi памяти (3 ноды ES)
+Цель: Loki в single binary режиме (~2 CPU, 4Gi памяти, 100Gi хранилища)
+Ограничения миграции:
+  - Нулевой пробел в логах во время миграции (период двойной записи)
+  - Существующие дашборды Kibana должны быть воссозданы в Grafana (или транслированы)
+  - Исторические логи из ES: экспорт последних 7 дней в Loki (далее — удалить)
+  - Формат логов приложений: уже JSON структурированный
+План миграции:
+  1. Развернуть Loki + Grafana рядом с существующим ELK
+  2. Настроить Fluent Bit для отправки в оба (двойная запись, 1 неделя)
+  3. Воссоздать критичные дашборды Kibana в Grafana LogQL
+  4. Вывести из эксплуатации ELK после 1 недели параллельной работы
+  5. Вернуть хранилище и вычислительные ресурсы
+```

package/areas/devops/observability/prompts/onboard-service-monitoring.md ADDED Viewed

@@ -0,0 +1,79 @@
+---
+workflow: onboard-service-monitoring
+---
+# Prompt: `/onboard-service-monitoring`
+Use when: adding observability (metrics, logs, traces, alerts, dashboard) to a service.
+---
+## Example 1 — Full observability stack for Python service
+**EN:**
+```
+/onboard-service-monitoring
+Service: checkout-service / Namespace: production / Language: Python 3.12 + FastAPI
+Current state: service runs, zero observability
+Stack: Prometheus (kube-prometheus-stack) + Loki + Tempo + Grafana
+Required:
+  - Metrics: prometheus-client; expose /metrics; golden signals (latency, errors, traffic, saturation)
+  - Traces: opentelemetry-sdk auto-instrumentation; export to otel-collector:4317
+  - Logs: already JSON; add trace_id injection via TraceContextFilter
+  - ServiceMonitor: scrape interval 15s
+  - Alerts: HighErrorRate (critical > 1%), HighP99Latency (warning > 1s), PodMemoryPressure (warning > 85%)
+  - Dashboard: standard service overview + business metric: checkout_conversion_rate
+Business metric to add: checkout_success_total / checkout_attempt_total (gauge panel)
+```
+**RU:**
+```
+/onboard-service-monitoring
+Сервис: checkout-service / Namespace: production / Язык: Python 3.12 + FastAPI
+Текущее состояние: сервис работает, нет observability
+Стек: Prometheus (kube-prometheus-stack) + Loki + Tempo + Grafana
+Требуется:
+  - Метрики: prometheus-client; /metrics endpoint; golden signals (latency, errors, traffic, saturation)
+  - Трейсы: opentelemetry-sdk авто-инструментирование; экспорт на otel-collector:4317
+  - Логи: уже JSON; добавить инжекцию trace_id через TraceContextFilter
+  - ServiceMonitor: интервал scrape 15с
+  - Алерты: HighErrorRate (critical > 1%), HighP99Latency (warning > 1s), PodMemoryPressure (warning > 85%)
+  - Dashboard: стандартный обзор сервиса + бизнес-метрика: checkout_conversion_rate
+Бизнес-метрика: checkout_success_total / checkout_attempt_total (gauge панель)
+```
+---
+## Example 2 — Alerts-only for existing service
+**EN:**
+```
+/onboard-service-monitoring
+Service: notification-service / Namespace: production
+Current state: metrics already scraped (ServiceMonitor exists); no alerts defined
+Task: add alerting only
+Required alerts:
+  - HighErrorRate: critical if error rate > 1% for 2m
+  - QueueBacklog: warning if notification_queue_depth > 1000 for 5m (custom metric already exposed)
+  - PodRestarting: warning if pod restarts > 3 in 15m
+Each alert: runbook_url pointing to docs/runbooks/<alert-name>.md (create stub runbooks too)
+Alertmanager routing: critical → PagerDuty; warning → #alerts-warning Slack
+```
+**RU:**
+```
+/onboard-service-monitoring
+Сервис: notification-service / Namespace: production
+Текущее состояние: метрики уже собираются (ServiceMonitor есть); алерты не настроены
+Задача: только настройка алертов
+Необходимые алерты:
+  - HighErrorRate: critical при error rate > 1% в течение 2м
+  - QueueBacklog: warning при notification_queue_depth > 1000 в течение 5м (кастомная метрика уже доступна)
+  - PodRestarting: warning при рестартах пода > 3 за 15м
+Каждый алерт: runbook_url → docs/runbooks/<alert-name>.md (создать stub runbooks тоже)
+Маршрутизация Alertmanager: critical → PagerDuty; warning → #alerts-warning Slack
+```

package/areas/devops/observability/rules/alerting-standards.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Rule: Alerting Standards
+**Priority**: P1 — Alerts without runbooks are not deployed.
+## Alert Quality Rules
+1. **Every alert has a runbook** — `annotations.runbook_url` is mandatory.
+2. **No alert fires without a human action** — if no one can do anything about it, it's not an alert (it's a dashboard).
+3. **Alert on symptoms, not causes** — `HighErrorRate` is an alert; `HighCPU` is a warning unless it causes user impact.
+4. **Severity classification**
+   | Severity | Meaning | Response |
+   |:---|:---|:---|
+   | `critical` | User-facing outage or data loss risk | Page on-call immediately |
+   | `warning` | Degraded but not broken; trending toward critical | Notify Slack; fix in business hours |
+   | `info` | Informational; no action required | Dashboard only |
+5. **`for:` duration** — critical alerts: `for: 2m`; warning alerts: `for: 10m`. Instant alerts cause false positives.
+6. **Alert fatigue policy** — if an alert fires more than 3 times in a week without action → reduce sensitivity or fix root cause.
+## Notification Routing (Alertmanager)
+```yaml
+route:
+  group_by: [alertname, namespace]
+  group_wait: 30s
+  group_interval: 5m
+  repeat_interval: 4h
+  receiver: slack-warning
+  routes:
+    - matchers: [severity="critical"]
+      receiver: pagerduty-oncall
+      continue: true
+    - matchers: [severity="critical"]
+      receiver: slack-critical
+```

package/areas/devops/observability/rules/data-retention.md ADDED Viewed

@@ -0,0 +1,19 @@
+# Rule: Observability Data Retention
+**Priority**: P2 — Retention settings affect compliance and cost.
+## Retention Policy
+| Data type | Hot storage | Cold/archive | Delete |
+|:---|:---|:---|:---|
+| Metrics (Prometheus) | 15 days | 1 year (VictoriaMetrics/Thanos) | > 1 year |
+| Logs | 30 days (Loki/ES hot) | 90 days (cold) | > 90 days |
+| Traces | 7 days | Not archived (high volume) | > 7 days |
+| Alerting history | 90 days | — | > 90 days |
+## Cost Controls
+1. **High-cardinality labels are forbidden** — no `user_id`, `session_id`, `request_id` as Prometheus labels (use trace context instead).
+2. **Log sampling** — DEBUG logs sampled at 1% in production; INFO at 100%; ERROR/WARN at 100%.
+3. **Trace sampling** — head-based: 10% of requests; always-sample for errors and slow requests (> p99).
+4. **Prometheus recording rules** — pre-aggregate expensive queries; don't query raw metrics in dashboards.

package/areas/devops/observability/rules/golden-signals.md ADDED Viewed

@@ -0,0 +1,28 @@
+# Rule: Golden Signals & Observability Baseline
+**Priority**: P1 — Services without golden signals cannot be promoted to production.
+## Four Golden Signals (mandatory for every service)
+| Signal | Metric | Alert threshold |
+|:---|:---|:---|
+| **Latency** | p50, p95, p99 request duration | p99 > 1s for 5 min |
+| **Traffic** | Requests per second (RPS) | Drop > 50% from baseline |
+| **Errors** | 5xx rate / error rate | > 1% for 2 min |
+| **Saturation** | CPU %, memory %, queue depth | CPU > 80%, Memory > 85% |
+## Instrumentation Requirements
+1. **Every HTTP service exposes** `/metrics` in Prometheus format on port 9090 (or sidecar).
+2. **Every service has** a `ServiceMonitor` (kube-prometheus-stack) or scrape config.
+3. **Structured JSON logging** — no unstructured log lines in production.
+4. **Trace context propagated** — W3C TraceContext headers forwarded between all services.
+5. **Health endpoints** — `/health/ready` (readiness) and `/health/live` (liveness) separate.
+## Three Pillars Coverage
+| Pillar | Stack | Retention |
+|:---|:---|:---|
+| Metrics | Prometheus + VictoriaMetrics (long-term) | 15 days hot / 1 year cold |
+| Logs | Loki or ELK (Elasticsearch+Logstash+Kibana) | 30 days |
+| Traces | Tempo or Jaeger (via OpenTelemetry) | 7 days |

package/areas/devops/observability/skills/distributed-tracing/SKILL.md ADDED Viewed

@@ -0,0 +1,149 @@
+---
+name: distributed-tracing
+type: skill
+description: Implement distributed tracing with OpenTelemetry, Tempo/Jaeger — instrumentation, sampling, and trace-to-log correlation.
+related-rules:
+  - golden-signals.md
+  - data-retention.md
+allowed-tools: Read, Write, Edit
+---
+# Skill: Distributed Tracing
+> **Expertise:** OpenTelemetry SDK, auto-instrumentation, Tempo/Jaeger, trace-log correlation, sampling strategies.
+## When to load
+When adding tracing to a service, debugging slow distributed transactions, or setting up trace → log → metric correlation.
+## OpenTelemetry Collector (K8s DaemonSet)
+```yaml
+# otel-collector-config.yaml
+receivers:
+  otlp:
+    protocols:
+      grpc: { endpoint: "0.0.0.0:4317" }
+      http: { endpoint: "0.0.0.0:4318" }
+processors:
+  batch:
+    timeout: 1s
+    send_batch_size: 1000
+  memory_limiter:
+    check_interval: 1s
+    limit_mib: 400
+  # Tail-based sampling — sample 100% of error/slow traces
+  tail_sampling:
+    decision_wait: 10s
+    policies:
+      - name: errors-policy
+        type: status_code
+        status_code: { status_codes: [ERROR] }
+      - name: slow-traces
+        type: latency
+        latency: { threshold_ms: 500 }
+      - name: probabilistic-10pct
+        type: probabilistic
+        probabilistic: { sampling_percentage: 10 }
+exporters:
+  otlp/tempo:
+    endpoint: tempo:4317
+    tls: { insecure: true }
+service:
+  pipelines:
+    traces:
+      receivers: [otlp]
+      processors: [memory_limiter, tail_sampling, batch]
+      exporters: [otlp/tempo]
+```
+## Python Auto-Instrumentation (FastAPI)
+```python
+# main.py — add before app creation
+from opentelemetry import trace
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.export import BatchSpanProcessor
+from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
+from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
+from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
+from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
+provider = TracerProvider()
+provider.add_span_processor(
+    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
+)
+trace.set_tracer_provider(provider)
+# Auto-instrument frameworks
+FastAPIInstrumentor.instrument_app(app)
+HTTPXClientInstrumentor().instrument()  # outgoing HTTP calls
+SQLAlchemyInstrumentor().instrument()   # DB queries
+```
+## Go Auto-Instrumentation
+```go
+// tracing/setup.go
+func InitTracer(serviceName string) func() {
+    exporter, _ := otlptracegrpc.New(ctx,
+        otlptracegrpc.WithEndpoint("otel-collector:4317"),
+        otlptracegrpc.WithInsecure(),
+    )
+    tp := tracesdk.NewTracerProvider(
+        tracesdk.WithBatcher(exporter),
+        tracesdk.WithResource(resource.NewWithAttributes(
+            semconv.SchemaURL,
+            semconv.ServiceNameKey.String(serviceName),
+        )),
+    )
+    otel.SetTracerProvider(tp)
+    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
+        propagation.TraceContext{},
+        propagation.Baggage{},
+    ))
+    return func() { tp.Shutdown(ctx) }
+}
+```
+## Trace-to-Log Correlation
+```python
+# Inject trace_id and span_id into every log line
+import logging
+from opentelemetry import trace
+class TraceContextFilter(logging.Filter):
+    def filter(self, record):
+        span = trace.get_current_span()
+        ctx = span.get_span_context()
+        record.trace_id = format(ctx.trace_id, '032x') if ctx.is_valid else ''
+        record.span_id  = format(ctx.span_id,  '016x') if ctx.is_valid else ''
+        return True
+# Log format: {"message": "...", "trace_id": "abc123", "span_id": "def456"}
+# Loki/Grafana links trace_id → Tempo automatically
+```
+## K8s Pod Annotation for Auto-Instrumentation (Operator)
+```yaml
+# Using OpenTelemetry Operator — zero code change instrumentation
+metadata:
+  annotations:
+    instrumentation.opentelemetry.io/inject-python: "true"
+    # For Java: inject-java, Go: inject-go
+```
+## Sampling Strategy
+| Scenario | Strategy | Rate |
+|:---|:---|:---|
+| Normal traffic | Probabilistic (head-based) | 10% |
+| Errors | Always sample | 100% |
+| Latency > p99 | Tail-based | 100% |
+| Debug/investigation | Force-sample via baggage | 100% |

package/areas/devops/observability/skills/grafana-dashboards/SKILL.md ADDED Viewed

@@ -0,0 +1,201 @@
+---
+name: grafana-dashboards
+type: skill
+description: Design and maintain Grafana dashboards — service overview panels, SLO tracking, variable templates, dashboard-as-code with Grafonnet/Jsonnet.
+related-rules:
+  - golden-signals.md
+  - data-retention.md
+allowed-tools: Read, Write, Edit
+---
+# Skill: Grafana Dashboards
+> **Expertise:** Panel design, variable templates, cross-datasource linking (Prometheus + Loki + Tempo), SLO dashboards, Jsonnet/Grafonnet for dashboard-as-code.
+## When to load
+When creating a new service dashboard, adding SLO panels, debugging a missing metric in Grafana, or converting dashboards to code.
+## Standard Service Overview Dashboard Panels
+```
+Row 1: Golden Signals
+  [Traffic: RPS]  [Error Rate %]  [P50/P95/P99 Latency]  [Saturation: CPU/Mem]
+Row 2: Infrastructure
+  [Pod count / restarts]  [Memory usage vs limit]  [CPU throttling %]
+Row 3: Logs & Traces (via Explore links)
+  [Recent error logs]  [Slow trace exemplars]
+Row 4: Business Metrics (service-specific)
+  [Order rate]  [Checkout conversion]  [Queue depth]
+```
+## Panel Configurations
+```json
+// Error rate panel (Stat + threshold coloring)
+{
+  "type": "stat",
+  "title": "Error Rate",
+  "datasource": "Prometheus",
+  "targets": [{
+    "expr": "sum(rate(http_requests_total{service='$service', status=~'5..'}[5m])) / sum(rate(http_requests_total{service='$service'}[5m])) * 100",
+    "legendFormat": "Error %"
+  }],
+  "fieldConfig": {
+    "defaults": {
+      "unit": "percent",
+      "thresholds": {
+        "steps": [
+          {"color": "green", "value": 0},
+          {"color": "yellow", "value": 0.1},
+          {"color": "red",   "value": 1.0}
+        ]
+      }
+    }
+  }
+}
+```
+```json
+// Latency heatmap panel
+{
+  "type": "heatmap",
+  "title": "Request Latency Heatmap",
+  "targets": [{
+    "expr": "sum(rate(http_request_duration_seconds_bucket{service='$service'}[5m])) by (le)",
+    "format": "heatmap",
+    "legendFormat": "{{le}}"
+  }],
+  "yAxis": { "format": "s", "logBase": 2 }
+}
+```
+## Variable Templates (dashboard filters)
+```json
+// Namespace variable (auto-populated from Prometheus labels)
+{
+  "name": "namespace",
+  "type": "query",
+  "datasource": "Prometheus",
+  "query": "label_values(kube_pod_info, namespace)",
+  "refresh": 2,     // refresh on time range change
+  "includeAll": true,
+  "multi": true
+}
+// Service variable (filtered by selected namespace)
+{
+  "name": "service",
+  "type": "query",
+  "query": "label_values(http_requests_total{namespace='$namespace'}, service)",
+  "refresh": 2
+}
+```
+## Trace Exemplars (link Prometheus → Tempo)
+```json
+// Enable exemplars on latency histogram panel
+{
+  "type": "timeseries",
+  "targets": [{
+    "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service='$service'}[5m])) by (le))",
+    "exemplar": true,    // show exemplar dots linking to traces
+    "legendFormat": "p99"
+  }]
+}
+// Requires: Tempo as datasource, traceID in exemplar labels
+```
+## Log Panel (Loki integration)
+```json
+{
+  "type": "logs",
+  "title": "Recent Errors",
+  "datasource": "Loki",
+  "targets": [{
+    "expr": "{namespace='$namespace', app='$service'} | json | level='error'",
+    "maxLines": 100
+  }]
+}
+```
+## SLO Dashboard Panels
+```json
+// SLO budget remaining (Gauge)
+{
+  "type": "gauge",
+  "title": "Error Budget Remaining",
+  "targets": [{
+    "expr": "(1 - (sum(increase(http_requests_total{service='$service',status=~'5..'}[28d])) / sum(increase(http_requests_total{service='$service'}[28d])))) / 0.005 * 100"
+  }],
+  "fieldConfig": {
+    "defaults": {
+      "unit": "percent",
+      "min": 0, "max": 100,
+      "thresholds": {
+        "steps": [
+          {"color": "red",    "value": 0},
+          {"color": "yellow", "value": 25},
+          {"color": "green",  "value": 50}
+        ]
+      }
+    }
+  }
+}
+```
+## Dashboard as Code (Jsonnet + Grafonnet)
+```jsonnet
+// dashboards/service-overview.jsonnet
+local grafana = import 'grafonnet/grafana.libsonnet';
+local dashboard = grafana.dashboard;
+local row = grafana.row;
+local prometheus = grafana.prometheus;
+local graphPanel = grafana.graphPanel;
+dashboard.new(
+  'Service Overview',
+  tags=['generated', 'service'],
+  refresh='1m',
+  time_from='now-1h',
+)
+.addPanel(
+  graphPanel.new(
+    'Request Rate',
+    datasource='Prometheus',
+  ).addTarget(
+    prometheus.target(
+      'sum(rate(http_requests_total{service="$service"}[5m]))',
+      legendFormat='RPS',
+    )
+  ),
+  gridPos={ x: 0, y: 0, w: 12, h: 8 }
+)
+```
+```bash
+# Generate JSON from Jsonnet and import to Grafana
+jsonnet -J vendor dashboards/service-overview.jsonnet > /tmp/dashboard.json
+curl -X POST http://grafana:3000/api/dashboards/import \
+  -H 'Content-Type: application/json' \
+  -d "{\"dashboard\": $(cat /tmp/dashboard.json), \"overwrite\": true}"
+```
+## Dashboard Git Workflow
+```bash
+# Export dashboard JSON from Grafana (for version control)
+curl http://grafana:3000/api/dashboards/uid/<uid> | jq '.dashboard' > dashboards/service-overview.json
+# Apply dashboard from Git (GitOps)
+# Use grafana-operator or grizzly (grafana/grizzly)
+grr apply dashboards/
+```