npm - @jetrabbits/agentic - Versions diffs - 0.0.1 - Mend

@jetrabbits/agentic 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (440) hide show

package/areas/devops/observability/workflows/observability-stack-setup.md ADDED Viewed

@@ -0,0 +1,156 @@
+---
+name: observability-stack-setup
+type: workflow
+trigger: /observability-stack-setup
+description: Deploy the full observability stack (Prometheus + Loki + Tempo + Grafana) to a Kubernetes cluster from scratch.
+inputs:
+  - cluster_name
+  - storage_class
+  - retention_days_metrics
+  - retention_days_logs
+outputs:
+  - running_observability_stack
+  - grafana_url
+  - setup_report
+roles:
+  - devops-engineer
+execution:
+  initiator: developer
+related-rules:
+  - golden-signals.md
+  - alerting-standards.md
+  - data-retention.md
+uses-skills:
+  - prometheus-alertmanager
+  - grafana-dashboards
+  - log-aggregation
+  - distributed-tracing
+  - slo-implementation
+quality-gates:
+  - all components healthy (Prometheus targets UP)
+  - sample alert fires and reaches Alertmanager
+  - Grafana shows data from all three pillars (metrics/logs/traces)
+---
+## Steps
+### 1. Namespace & Prerequisites — `@devops-engineer`
+```bash
+kubectl create namespace monitoring
+kubectl create namespace logging
+kubectl create namespace tracing
+# Add Helm repos
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo add grafana              https://grafana.github.io/helm-charts
+helm repo update
+```
+### 2. kube-prometheus-stack (Prometheus + Grafana + Alertmanager) — `@devops-engineer`
+```bash
+helm upgrade --install kube-prometheus-stack \
+  prometheus-community/kube-prometheus-stack \
+  -n monitoring \
+  -f infra/observability/prometheus-values.yaml \
+  --create-namespace
+# prometheus-values.yaml (key sections)
+# prometheus:
+#   prometheusSpec:
+#     retention: 15d
+#     storageSpec:
+#       volumeClaimTemplate:
+#         spec:
+#           storageClassName: longhorn
+#           resources: { requests: { storage: 50Gi } }
+# alertmanager:
+#   config: <alertmanager routing config>
+# grafana:
+#   adminPassword: <from Vault>
+#   persistence: { enabled: true, storageClassName: longhorn, size: 10Gi }
+```
+- Verify: `kubectl get pods -n monitoring` — all Running
+- Check Prometheus targets: `kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring`
+### 3. Loki + Promtail (Logs) — `@devops-engineer`
+```bash
+helm upgrade --install loki grafana/loki-stack \
+  -n logging \
+  -f infra/observability/loki-values.yaml \
+  --create-namespace
+# loki-values.yaml key settings:
+# loki.config.limits_config.retention_period: 720h  (30d)
+# promtail.config.clients[0].url: http://loki.logging:3100/loki/api/v1/push
+```
+- Add Loki datasource in Grafana: `http://loki.logging:3100`
+- Verify: `{job="loki"}` returns logs in Grafana Explore
+### 4. Tempo (Traces) — `@devops-engineer`
+```bash
+helm upgrade --install tempo grafana/tempo \
+  -n tracing \
+  -f infra/observability/tempo-values.yaml \
+  --create-namespace
+# tempo-values.yaml key settings:
+# tempo.retention: 168h  (7d)
+# persistence.enabled: true
+# tempo.receivers.otlp.protocols.grpc.endpoint: 0.0.0.0:4317
+```
+- Add Tempo datasource in Grafana: `http://tempo.tracing:3100`
+- Configure trace-to-log correlation: set Loki derived field `traceID` → Tempo URL
+### 5. OpenTelemetry Collector (DaemonSet) — `@devops-engineer`
+```bash
+helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
+  -n monitoring \
+  -f infra/observability/otel-collector-values.yaml
+```
+- Accepts OTLP from apps (port 4317 gRPC, 4318 HTTP)
+- Forwards to Tempo
+### 6. Validate Stack — `@devops-engineer`
+```bash
+# Test Prometheus query
+kubectl exec -n monitoring deploy/prometheus -- \
+  wget -qO- 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result | length'
+# Test alert routing: create test alert
+kubectl apply -f - << 'YAML'
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: test-alert
+  namespace: monitoring
+spec:
+  groups:
+    - name: test
+      rules:
+        - alert: TestAlert
+          expr: vector(1)   # always fires
+          labels: { severity: warning }
+          annotations: { summary: "Test alert for stack validation" }
+YAML
+# Check Alertmanager received it: kubectl port-forward svc/alertmanager 9093:9093 -n monitoring
+# Test logs visible
+kubectl port-forward svc/grafana 3000:80 -n monitoring
+# Open Grafana → Explore → Loki → {job="monitoring"} → should show logs
+```
+### 7. Import Dashboards — `@devops-engineer`
+```bash
+# Apply standard dashboard ConfigMaps (GitOps)
+kubectl apply -f infra/observability/dashboards/ -n monitoring
+# Or import via Grafana API
+for dashboard in infra/observability/dashboards/*.json; do
+  curl -X POST http://admin:${GRAFANA_PASS}@localhost:3000/api/dashboards/import \
+    -H 'Content-Type: application/json' \
+    -d "{\"dashboard\": $(cat $dashboard), \"overwrite\": true}"
+done
+```
+## Exit
+All 4 components healthy + test alert fired + dashboards showing data = stack deployed.

package/areas/devops/observability/workflows/onboard-service-monitoring.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+name: onboard-service-monitoring
+type: workflow
+trigger: /onboard-service-monitoring
+description: Add full observability (metrics, logs, traces, alerts, dashboard) to an existing service.
+inputs:
+  - service_name
+  - namespace
+  - language/framework
+outputs:
+  - running_metrics_scrape
+  - grafana_dashboard
+  - alert_rules_deployed
+roles:
+  - devops-engineer
+  - developer
+execution:
+  initiator: developer
+related-rules:
+  - golden-signals.md
+  - alerting-standards.md
+uses-skills:
+  - prometheus-alertmanager
+  - grafana-dashboards
+  - distributed-tracing
+  - log-aggregation
+quality-gates:
+  - all four golden signals visible in Prometheus
+  - at least one critical alert deployed with runbook
+  - logs visible in Loki with trace_id field
+---
+## Steps
+### 1. Metrics Instrumentation — `@developer`
+- Add Prometheus client library to service
+- Expose standard HTTP metrics (requests_total, duration histogram, active_requests)
+- Expose `/metrics` endpoint on port 9090 (or sidecar annotation)
+- **Done when:** `curl http://<pod-ip>:9090/metrics` returns Prometheus format
+### 2. ServiceMonitor — `@devops-engineer`
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: ${SERVICE}
+  namespace: ${NAMESPACE}
+spec:
+  selector:
+    matchLabels: { app: ${SERVICE} }
+  endpoints:
+    - port: metrics
+      interval: 15s
+      path: /metrics
+```
+- **Done when:** Prometheus targets page shows service as UP
+### 3. Tracing Instrumentation — `@developer`
+- Add OpenTelemetry SDK (or use K8s operator auto-injection)
+- Configure OTLP exporter → otel-collector:4317
+- Verify trace_id appears in application logs
+- **Done when:** traces visible in Tempo; trace_id in logs
+### 4. Log Labels — `@devops-engineer`
+- Verify Promtail/Fluent Bit picks up pod logs
+- Confirm JSON parsing works: `{namespace="${NS}", app="${SERVICE}"} | json`
+- Add log-based alert if service emits structured error logs
+- **Done when:** logs searchable in Loki with level + trace_id fields
+### 5. Alert Rules — `@devops-engineer`
+- Create `PrometheusRule` with golden signal alerts (HighErrorRate, HighP99Latency, PodMemoryPressure)
+- Write runbook for each alert in `docs/runbooks/`
+- Test alert firing: temporarily lower threshold, verify Alertmanager receives it
+- **Done when:** all alerts show in Prometheus rules page; test fire works
+### 6. Grafana Dashboard — `@devops-engineer`
+- Import standard service overview dashboard template
+- Customize: add service-specific panels (queue depth, custom business metrics)
+- Link trace panel (Tempo datasource) to request duration panel
+- **Done when:** dashboard saved in `infra/dashboards/`; Grafana shows live data
+## Exit
+Golden signals in Prometheus + logs in Loki + traces in Tempo + alerts deployed + dashboard live = service monitored.

package/areas/devops/sre/AGENTS.md ADDED Viewed

@@ -0,0 +1,48 @@
+# SRE — guidance index
+## What this area covers
+Site reliability engineering: SLO/SLI design, error budget policy, chaos engineering, capacity planning, incident command, and post-mortem facilitation. The SRE area treats reliability as a measurable feature with a finite budget — not a vague aspiration.
+## Guidance chain
+1. Project `.agent/` baseline
+2. `sre/rules/*` — load all
+3. `sre/skills/*/SKILL.md` — load matching skill only
+4. `sre/workflows/*` — load matching workflow
+## Cross-cutting constraints
+- **SLOs drive decisions** — if error budget remains, ship features; if exhausted, halt features and fix reliability.
+- **No heroics** — every repeated manual action is a toil item to automate.
+- **Blameless culture** — incidents indict systems, not people. Post-mortems focus on what the system lacked.
+- **Data before action** — no reliability work starts without a metric showing the problem.
+## Spec map
+```text
+sre/
+├── rules/
+│   ├── slo-policy.md             ← SLO definition standards, window sizes, target tiers
+│   ├── error-budget-policy.md    ← budget consumption thresholds, freeze triggers
+│   └── on-call-standards.md      ← rotation design, escalation, response SLAs
+├── skills/
+│   ├── slo-sli-design/SKILL.md       ← SLI selection, SLO target setting, burn-rate alerts
+│   ├── chaos-engineering/SKILL.md    ← experiment design, blast radius, rollback gates
+│   ├── capacity-planning/SKILL.md    ← demand forecasting, right-sizing, headroom models
+│   ├── incident-command/SKILL.md     ← severity classification, role assignment, comms cadence
+│   └── postmortem-analysis/SKILL.md  ← 5 Whys, fault trees, systemic action items
+├── workflows/
+│   ├── incident-response.md   ← /incident-response
+│   ├── postmortem.md          ← /postmortem
+│   └── slo-review.md          ← /slo-review
+└── prompts/
+    └── *.md
+```
+## Discovery patterns
+- `rules/*.md`
+- `skills/*/SKILL.md`
+- `workflows/*.md`
+- `prompts/*.md`

package/areas/devops/sre/prompts/incident-response.md ADDED Viewed

@@ -0,0 +1,129 @@
+---
+workflow: incident-response
+---
+# Prompt: `/incident-response`
+Use when: actively responding to a production incident or resilience event that demands mitigation, communication, and a clear escalation path.
+---
+## Example 1 — P0 service outage
+**EN:**
+```
+/incident-response
+Severity: P0
+Service: order-service / Namespace: production
+Symptom: complete service outage — 100% error rate since 15:42 UTC
+Affected: all checkout flows; estimated 2,000 users/min impact
+IC: @me (on-call)
+Available data:
+  - Alert fired: HighErrorRate 100% for order-service
+  - Recent deploy: order-service v3.1.0 at 15:38 UTC (4 min before incident)
+  - Prometheus shows: all pods Running but 0 successful responses
+  - Logs show: "connection refused" to postgres-primary:5432
+Actions needed:
+  1. Immediate mitigation options (rollback, feature flag, scale)
+  2. Status page template for this incident
+  3. Slack communication template for #incidents
+  4. Scribe doc started
+```
+**RU:**
+```
+/incident-response
+Severity: P0
+Сервис: order-service / Namespace: production
+Симптом: полный отказ сервиса — 100% error rate с 15:42 UTC
+Затронуто: все checkout flow; ~2,000 пользователей/мин
+IC: @я (on-call)
+Доступные данные:
+  - Алерт: HighErrorRate 100% для order-service
+  - Последний деплой: order-service v3.1.0 в 15:38 UTC (за 4 мин до инцидента)
+  - Prometheus: все поды Running но 0 успешных ответов
+  - Логи: "connection refused" на postgres-primary:5432
+Необходимые действия:
+  1. Варианты немедленной митигации (откат, feature flag, масштабирование)
+  2. Шаблон для status page по этому инциденту
+  3. Шаблон Slack сообщения для #incidents
+  4. Начат scribe doc
+```
+---
+## Example 2 — P1 performance degradation
+**EN:**
+```
+/incident-response
+Severity: P1
+Service: payment-service / Namespace: production
+Symptom: p99 latency spiked from 200ms to 4.2s; error rate 0.8% (below 1% threshold but rising)
+Affected: ~15% of payment requests timing out; no complete outage
+Recent changes: database index migration ran at 03:00 UTC (8h ago)
+Metrics: CPU normal, memory normal, DB connections at 89% pool capacity
+Action needed:
+  1. Classify: is this trending toward P0? (burn rate calculation)
+  2. Identify: DB connection exhaustion vs slow query vs external dependency
+  3. Quick mitigation options that don't require deploy
+  4. Comms: notify on Slack (not status page yet — degraded only)
+```
+**RU:**
+```
+/incident-response
+Severity: P1
+Сервис: payment-service / Namespace: production
+Симптом: p99 latency вырос с 200ms до 4.2s; error rate 0.8% (ниже порога 1% но растёт)
+Затронуто: ~15% запросов на оплату с таймаутом; полного отказа нет
+Недавние изменения: миграция индекса БД в 03:00 UTC (8ч назад)
+Метрики: CPU в норме, память в норме, DB connections на 89% от pool capacity
+Необходимые действия:
+  1. Классификация: движется ли это к P0? (расчёт burn rate)
+  2. Определить: исчерпание DB connections vs медленный запрос vs внешняя зависимость
+  3. Варианты быстрой митигации без деплоя
+  4. Коммуникация: уведомить в Slack (не status page пока — только деградация)
+```
+---
+## Example 3 — Network partition: database failover test
+**EN:**
+```
+/incident-response
+System under test: payment-service → postgres-primary (CloudNativePG cluster)
+Hypothesis: "payment-service automatically reconnects within 30s after postgres primary failover, with < 500ms added latency per request during reconnect"
+Experiment:
+  - Inject NetworkChaos: block traffic from payment-service pods to postgres-primary for 90s
+  - CloudNativePG should auto-promote replica to primary during network partition
+  - Monitor: payment-service error rate, connection pool exhaustion (pgbouncer stats), reconnect time
+  - Verify: after partition heals, service recovers automatically (no manual intervention)
+Pre-conditions:
+  - Confirm postgres replica is healthy before starting
+  - Confirm pgbouncer reconnect_timeout is set appropriately
+  - Run at 10% of normal traffic (k6 load generator)
+```
+**RU:**
+```
+/incident-response
+Система под тестом: payment-service → postgres-primary (CloudNativePG кластер)
+Гипотеза: "payment-service автоматически переподключается в течение 30с после failover postgres primary, с задержкой < 500ms на запрос во время переподключения"
+Эксперимент:
+  - Инжектировать NetworkChaos: блокировать трафик от подов payment-service к postgres-primary на 90с
+  - CloudNativePG должен автоматически назначить реплику primary во время сетевого раздела
+  - Мониторинг: error rate payment-service, исчерпание connection pool (статистика pgbouncer), время переподключения
+  - Проверить: после восстановления раздела сервис восстанавливается автоматически (без ручного вмешательства)
+Предусловия:
+  - Убедиться что postgres реплика здорова перед началом
+  - Убедиться что pgbouncer reconnect_timeout настроен правильно
+  - Запустить при 10% от нормального трафика (k6 генератор нагрузки)
+```

package/areas/devops/sre/prompts/postmortem.md ADDED Viewed

@@ -0,0 +1,101 @@
+---
+workflow: postmortem
+---
+# Prompt: `/postmortem`
+Use when: writing or facilitating a blameless postmortem after a P0/P1 incident.
+---
+## Example 1 — Full postmortem from incident data
+**EN:**
+```
+/postmortem
+Incident: INC-2024-112 / Service: payment-service / Severity: P1
+Duration: 2024-11-15 03:42–04:01 UTC (19 min)
+Impact: 4.2% error rate; ~850 failed payment attempts; SLO: 18 min budget consumed
+Root cause (preliminary): OOMKilled pods after v2.4.1 deploy introduced high-memory code path
+Timeline (from scribe doc):
+  03:42 - Alert fired HighErrorRate 4.2%
+  03:44 - On-call acknowledged
+  03:49 - Identified: payment-service pods OOMKilling (exit 137)
+  03:51 - Mitigation: helm rollback payment-service to revision 3
+  03:53 - Error rate dropping
+  04:01 - Resolved; monitoring
+Tasks:
+  1. Full 5-whys RCA from the preliminary root cause
+  2. Contributing factors analysis
+  3. 3–5 action items (specific, owned, dated within 2 weeks)
+  4. What went well section (at least 3 items)
+  5. SLO impact calculation
+```
+**RU:**
+```
+/postmortem
+Инцидент: INC-2024-112 / Сервис: payment-service / Severity: P1
+Длительность: 2024-11-15 03:42–04:01 UTC (19 мин)
+Влияние: error rate 4.2%; ~850 неудачных попыток оплаты; SLO: потрачено 18 мин бюджета
+Корневая причина (предварительно): OOMKilled поды после деплоя v2.4.1 с высокопамятным кодом
+Timeline (из scribe doc):
+  03:42 - Алерт сработал HighErrorRate 4.2%
+  03:44 - On-call подтвердил
+  03:49 - Определено: поды payment-service OOMKilling (exit 137)
+  03:51 - Митигация: helm rollback payment-service до ревизии 3
+  03:53 - Error rate падает
+  04:01 - Разрешено; мониторинг
+Задачи:
+  1. Полный анализ 5-whys от предварительной корневой причины
+  2. Анализ способствующих факторов
+  3. 3–5 action items (конкретные, с владельцами, сроки в течение 2 недель)
+  4. Раздел "что прошло хорошо" (минимум 3 пункта)
+  5. Расчёт влияния на SLO
+```
+---
+## Example 2 — SLO review: define SLOs for new service
+**EN:**
+```
+/postmortem
+Task: define SLOs (not a postmortem — SLO design session)
+Service: notification-service
+User expectation: "Notifications arrive within 30 seconds; don't lose notifications"
+Current metrics available: delivery_attempts_total, delivery_success_total, delivery_latency_seconds
+Team size: 2 backend engineers + 1 devops
+Target tier: Tier 2 (internal tool; not directly revenue-impacting)
+Design:
+  1. Select 2 SLIs (availability + latency) with formulas
+  2. Propose SLO targets (start conservative)
+  3. Calculate error budget for 28-day window
+  4. Write burn rate alert thresholds (fast + slow)
+  5. Sloth YAML definition
+```
+**RU:**
+```
+/postmortem
+Задача: определить SLO (не postmortem — сессия проектирования SLO)
+Сервис: notification-service
+Ожидание пользователей: "Уведомления доставляются в течение 30 секунд; уведомления не теряются"
+Доступные метрики: delivery_attempts_total, delivery_success_total, delivery_latency_seconds
+Размер команды: 2 backend инженера + 1 devops
+Целевой tier: Tier 2 (внутренний инструмент; не влияет напрямую на выручку)
+Проектирование:
+  1. Выбрать 2 SLI (availability + latency) с формулами
+  2. Предложить цели SLO (начать консервативно)
+  3. Рассчитать error budget для 28-дневного окна
+  4. Написать пороги burn rate алертов (быстрый + медленный)
+  5. YAML определение для Sloth
+```

package/areas/devops/sre/prompts/slo-review.md ADDED Viewed

@@ -0,0 +1,125 @@
+---
+workflow: slo-review
+---
+# Prompt: `/slo-review`
+Use when: reviewing SLO health, error budget burn, and upcoming capacity risks before committing to reliability or scaling work.
+---
+## Example 1 — Q4 SLO review for 6 services
+**EN:**
+```
+/slo-review
+Review period: Q3 2024 (July–September)
+Services under review: checkout, payment, order, auth, user, notification
+Data available in Prometheus (Sloth recording rules)
+For each service, evaluate:
+  1. SLI achievement: actual ratio vs SLO target for the quarter
+  2. Error budget burn: how much was consumed, main events causing consumption
+  3. Incidents: count, severity, duration, correlation with budget consumption
+  4. Target calibration: is the target too tight (budget always exhausted) or too loose (never burns)?
+  5. Action items from previous review: completed? effective?
+Recommendations needed:
+  - Services to tighten (budget never used → target probably too conservative)
+  - Services to loosen (budget always exhausted → target not achievable with current architecture)
+  - Reliability investments for Q4 (prioritised by error budget consumed)
+Output format: executive summary + per-service table + Q4 recommendations
+```
+**RU:**
+```
+/slo-review
+Период проверки: Q3 2024 (июль–сентябрь)
+Сервисы на проверке: checkout, payment, order, auth, user, notification
+Данные доступны в Prometheus (Sloth recording rules)
+Для каждого сервиса оценить:
+  1. Достижение SLI: фактическое соотношение vs цель SLO за квартал
+  2. Сжигание error budget: сколько потрачено, основные события вызвавшие потребление
+  3. Инциденты: количество, severity, продолжительность, корреляция с потреблением бюджета
+  4. Калибровка цели: слишком жёсткая (бюджет всегда исчерпан) или слишком мягкая (никогда не горит)?
+  5. Action items из предыдущего review: выполнены? эффективны?
+Необходимые рекомендации:
+  - Сервисы для ужесточения (бюджет никогда не расходуется → цель вероятно слишком консервативная)
+  - Сервисы для смягчения (бюджет всегда исчерпан → цель недостижима с текущей архитектурой)
+  - Инвестиции в надёжность на Q4 (приоритизированы по потреблённому error budget)
+Формат вывода: executive summary + таблица по сервисам + рекомендации на Q4
+```
+---
+## Example 2 — Emergency SLO calibration after infra migration
+**EN:**
+```
+/slo-review
+Context: migrated from single-AZ to multi-AZ K8s (3 control plane + 6 workers)
+Pre-migration: payment-service SLO 99.5%, frequently in Freeze state
+Hypothesis: new HA setup should enable tightening to 99.9%
+Task:
+  1. Review pre-migration error budget consumption (last 3 months)
+  2. Classify error budget events: infra-caused vs app-caused vs dependency-caused
+  3. Estimate: if all infra-caused events are eliminated, what availability % would have been achieved?
+  4. Propose new SLO target with rationale
+  5. Set review checkpoint: evaluate new target after 30 days
+```
+**RU:**
+```
+/slo-review
+Контекст: миграция с single-AZ на multi-AZ K8s (3 control plane + 6 workers)
+До миграции: payment-service SLO 99.5%, часто в состоянии Freeze
+Гипотеза: новая HA конфигурация должна позволить ужесточить до 99.9%
+Задача:
+  1. Проверить потребление error budget до миграции (последние 3 месяца)
+  2. Классифицировать события error budget: вызванные инфрой / приложением / зависимостями
+  3. Оценить: если бы все события вызванные инфрой были исключены, какой % доступности был бы достигнут?
+  4. Предложить новую цель SLO с обоснованием
+  5. Установить точку проверки: оценить новую цель через 30 дней
+```
+---
+## Example 3 — Black Friday capacity runbook
+**EN:**
+```
+/slo-review
+Event: Black Friday (peak 5× normal traffic, 4-hour window)
+Services affected: checkout, payment, order (top 3 by load)
+Normal peak: 800 RPS; expected BF peak: 4000 RPS
+Pre-event checklist needed:
+  - Scale workers from 6 → 10 (pre-provision 48h before event)
+  - Set HPA min replicas: checkout→10, payment→8, order→8 (prevent cold start during spike)
+  - Pre-warm: connection pools, DNS TTLs flushed, CDN cache warmed
+  - Load test: k6 script targeting 4500 RPS (10% above expected peak); run 2 days before
+  - DB: pre-warm vacuumed + analysed; connection pool max set to 80% of max_connections
+  - War room: open 1h before event; on-call + dev leads + DBA on standby
+  - Auto-scale-down: trigger 2h after event peak (cost control)
+Output: runbook document + pre-event checklist + post-event scale-down procedure
+```
+**RU:**
+```
+/slo-review
+Событие: Чёрная пятница (пик 5× нормального трафика, 4-часовое окно)
+Затронутые сервисы: checkout, payment, order (топ-3 по нагрузке)
+Нормальный пик: 800 RPS; ожидаемый пик ЧП: 4000 RPS
+Необходимый чеклист перед событием:
+  - Масштабировать workers с 6 → 10 (заранее за 48ч до события)
+  - Установить HPA min replicas: checkout→10, payment→8, order→8 (предотвратить cold start при скачке)
+  - Pre-warm: connection pools, сброс DNS TTL, прогрев CDN кэша
+  - Нагрузочное тестирование: k6 скрипт на 4500 RPS (10% сверх ожидаемого пика); запустить за 2 дня
+  - БД: прогрев vacuum + analyse; max connection pool = 80% от max_connections
+  - Военная комната: открыть за 1ч до события; on-call + dev leads + DBA в режиме ожидания
+  - Авто-уменьшение масштаба: через 2ч после пика события (контроль затрат)
+Результат: runbook документ + чеклист до события + процедура уменьшения масштаба после события
+```

package/areas/devops/sre/rules/error-budget-policy.md ADDED Viewed

@@ -0,0 +1,25 @@
+# Rule: Error Budget Policy
+**Priority**: P1 — Error budget governs feature development velocity vs reliability investment.
+## Error Budget States
+| State | Budget remaining | Action |
+|:---|:---|:---|
+| 🟢 Healthy | > 50% | Normal development velocity |
+| 🟡 Warning | 25–50% | Reliability work enters next sprint |
+| 🔴 Freeze | < 25% | Feature freeze; only reliability fixes ship |
+| ⛔ Exhausted | 0% | Mandatory postmortem; all features blocked until replenished |
+## Freeze Rules
+- Feature freeze requires: team-lead + product-owner sign-off.
+- Reliability work during freeze: reduce MTTR, add chaos tests, improve monitoring.
+- Exception for hotfixes (security, critical bugs) — requires VP Engineering approval.
+## Error Budget Tracking
+- Error budget burn rate alerts:
+  - Fast burn (> 14.4× in 1h): page on-call → investigate immediately
+  - Slow burn (> 3× over 6h): Slack alert → review in next stand-up
+- Monthly error budget report published to Confluence/Notion.