npm - ecip-observability-stack - Versions diffs - 1.0.0 - Mend

ecip-observability-stack 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

package/CLAUDE.md +48 -0
package/README.md +75 -0
package/alerts/analysis-backlog.yaml +39 -0
package/alerts/cache-degradation.yaml +44 -0
package/alerts/dlq-depth.yaml +56 -0
package/alerts/lsp-daemon.yaml +43 -0
package/alerts/mcp-latency.yaml +46 -0
package/alerts/security-anomaly.yaml +59 -0
package/alerts/sla-latency.yaml +61 -0
package/chaos/kafka-broker-restart.sh +168 -0
package/chaos/kill-lsp-daemon.sh +148 -0
package/chaos/redis-node-failure.sh +318 -0
package/ci/check-observability-contract.js +285 -0
package/ci/eslint-plugin-ecip/index.js +209 -0
package/ci/eslint-plugin-ecip/package.json +12 -0
package/ci/github-actions-observability-gate.yaml +180 -0
package/ci/ruff-shared.toml +41 -0
package/collector/otel-collector-config.yaml +226 -0
package/collector/otel-collector-daemonset.yaml +168 -0
package/collector/sampling-config.yaml +83 -0
package/dashboards/_provisioning/grafana-dashboards.yaml +16 -0
package/dashboards/analysis-throughput.json +166 -0
package/dashboards/cache-performance.json +129 -0
package/dashboards/cross-repo-fanout.json +93 -0
package/dashboards/event-bus-dlq.json +129 -0
package/dashboards/lsp-daemon-health.json +104 -0
package/dashboards/mcp-call-graph.json +114 -0
package/dashboards/query-latency.json +160 -0
package/dashboards/security-events.json +131 -0
package/docs/M08-Observability-Design.md +639 -0
package/docs/PROGRESS.md +375 -0
package/docs/module-documentation.md +64 -0
package/elasticsearch/ilm-policy.json +57 -0
package/elasticsearch/index-template.json +62 -0
package/elasticsearch/kibana-space.yaml +53 -0
package/helm/Chart.yaml +30 -0
package/helm/templates/configmaps.yaml +25 -0
package/helm/templates/elasticsearch.yaml +68 -0
package/helm/templates/grafana-secret.yaml +22 -0
package/helm/templates/grafana.yaml +19 -0
package/helm/templates/loki.yaml +33 -0
package/helm/templates/otel-collector.yaml +119 -0
package/helm/templates/prometheus.yaml +43 -0
package/helm/templates/tempo.yaml +16 -0
package/helm/values.prod.yaml +159 -0
package/helm/values.yaml +146 -0
package/logging-lib/nodejs/package.json +57 -0
package/logging-lib/nodejs/pnpm-lock.yaml +4576 -0
package/logging-lib/python/pyproject.toml +45 -0
package/logging-lib/python/src/__init__.py +19 -0
package/logging-lib/python/src/logger.py +131 -0
package/logging-lib/python/src/security_events.py +150 -0
package/logging-lib/python/src/tracer.py +185 -0
package/logging-lib/python/tests/test_logger.py +113 -0
package/package.json +21 -0
package/prometheus/prometheus-values.yaml +170 -0
package/prometheus/recording-rules.yaml +97 -0
package/prometheus/scrape-configs.yaml +122 -0
package/runbooks/SDK-INTEGRATION.md +239 -0
package/runbooks/alert-response/ANALYSIS_BACKLOG.md +128 -0
package/runbooks/alert-response/DLQ_DEPTH_EXCEEDED.md +150 -0
package/runbooks/alert-response/HIGH_QUERY_LATENCY.md +134 -0
package/runbooks/alert-response/LSP_DAEMON_RESTART.md +118 -0
package/runbooks/alert-response/SECURITY_ANOMALY.md +160 -0
package/runbooks/dashboard-guide.md +169 -0
package/scripts/lint-dashboards.js +184 -0
package/tempo/tempo-datasource.yaml +46 -0
package/tempo/tempo-values.yaml +94 -0
package/tests/alert-threshold-config.test.ts +283 -0
package/tests/log-schema-validation.test.ts +246 -0
package/tests/metric-label-validation.test.ts +292 -0
package/tests/otel-pipeline-integration.test.ts +420 -0
package/tests/security-events.test.ts +417 -0
package/tsconfig.json +17 -0
package/vitest.config.ts +21 -0
package/vitest.integration.config.ts +9 -0

package/CLAUDE.md ADDED Viewed

@@ -0,0 +1,48 @@
+# CLAUDE.md — ecip-observability-stack (M08)
+## Module Purpose
+The **Observability Stack** is the eyes and ears of the platform. This module deploys and operates the monitoring infrastructure — it does NOT contain application code.
+## Key Principle: Auto-Instrumentation
+Other modules should NOT need to write custom metrics code. The OpenTelemetry auto-instrumentation should cover standard HTTP, gRPC, Kafka, and database spans automatically.
+When a module needs a custom metric (e.g., `ecip_analysis_embedding_duration`), it uses the OTel SDK's metric API — not Prometheus client directly.
+## Dashboard Ownership
+Each dashboard in `dashboards/` corresponds to a module:
+```
+dashboards/
+  analysis-engine.json      ← M02 metrics
+  query-service.json        ← M04 metrics
+  mcp-servers.json          ← M05 metrics
+  knowledge-store.json      ← M03 metrics
+  event-bus.json            ← M07 metrics
+  platform-sla.json         ← Cross-module SLA overview
+```
+Edit dashboards in Grafana UI → export JSON → commit here. This is the source of truth for dashboard definitions.
+## Alert Rules
+Alert definitions are in `alerts/`. Each alert must have:
+- `severity`: `critical`, `warning`, or `info`
+- `runbook_url`: link to a runbook in `runbooks/`
+- `for`: how long the condition must hold before firing
+Critical alerts page on-call. Warning alerts send Slack notification only.
+## Required Span Attributes (from all modules)
+Enforce these via OTel Collector attribute processor if modules don't emit them:
+- `ecip.module` (e.g., `M02`, `M04`)
+- `ecip.org_id`
+- `ecip.repo_id` (where applicable)
+## Chaos Testing
+Chaos test scripts are in `chaos/`. Run these before each production release:
+- `chaos/kill-lsp-daemon.sh` — verify M02 circuit breaker fires correctly
+- `chaos/redis-node-failure.sh` — verify M03 cache fallback
+- `chaos/kafka-broker-restart.sh` — verify M07 producer retry
+## Do Not
+- Do not add application logic here
+- Do not write to M03 or any other module's data store
+- Do not require other modules to install Prometheus client libraries (OTel SDK only)

package/README.md ADDED Viewed

@@ -0,0 +1,75 @@
+# ecip-observability-stack (M08 — Observability Stack)
+> **Team:** Platform/Infra · **Phase:** 5 (Weeks 23–28) · **Priority:** P5
+Platform-wide observability infrastructure. Provides distributed tracing, metrics collection, log aggregation, Grafana dashboards, and alerting for all ECIP modules via OpenTelemetry auto-instrumentation.
+---
+## Responsibilities
+- Deploy and maintain OpenTelemetry Collector
+- Operate Grafana + Prometheus (metrics)
+- Operate Jaeger or Tempo (distributed tracing)
+- Provide dashboards for: analysis latency, query p95, cache hit rates, MCP call graphs
+- Configure alerting rules and on-call routing
+- Run chaos tests and load tests for production validation
+---
+## Technology Stack
+| Component | Technology |
+|-----------|-----------|
+| Instrumentation | OpenTelemetry SDK (auto) |
+| Metrics | Prometheus + Grafana |
+| Tracing | Jaeger or Grafana Tempo |
+| Logs | Loki or ELK |
+| Alerting | Grafana Alerting + PagerDuty |
+---
+## Getting Started
+```bash
+git clone git@github.com:ecip/ecip-observability-stack.git
+cd ecip-observability-stack
+# Deploy the full observability stack to Kubernetes
+helm upgrade --install ecip-obs ./helm --namespace monitoring
+# Access Grafana locally
+kubectl port-forward svc/grafana 3000:3000 -n monitoring
+# Open http://localhost:3000 (admin/admin)
+```
+---
+## Required Dashboards
+| Dashboard | Key Metrics |
+|-----------|------------|
+| Analysis Pipeline | Events consumed/s, analysis duration p50/p95, embedding API latency, error rate |
+| Query Service | Query duration p50/p95/p99, LLM API latency, cache hit rate, MCP fan-out depth |
+| MCP Servers | Tool call latency per tool, per-repo RPS, auth failure rate |
+| Knowledge Store | Redis hit rate, pgvector query duration, write throughput |
+| Event Bus | Kafka consumer lag, DLQ depth, webhook processing latency |
+| Platform SLAs | End-to-end query p95 < 1.5s, analysis p95 < 30s, uptime |
+---
+## SLA Targets
+| SLA | Target |
+|-----|--------|
+| Query p95 latency | < 1.5s |
+| Analysis p95 latency (trunk) | < 30s per file |
+| Platform uptime | > 99.5% |
+| Cache hit rate (M03) | > 80% |
+---
+## Module Dependencies
+**Depends on:** OpenTelemetry SDKs in all other modules (auto-instrumented — no per-module code changes required)
+**Called by:** Nothing — pull-based metrics collection

package/alerts/analysis-backlog.yaml ADDED Viewed

@@ -0,0 +1,39 @@
+# =============================================================================
+# ECIP Alert: Analysis Backlog Growing
+# =============================================================================
+# Fires when Kafka consumer lag for analysis topics exceeds 1000 events
+# =============================================================================
+groups:
+  - name: ecip.analysis.backlog
+    rules:
+      - alert: AnalysisBacklogCritical
+        expr: >
+          sum(kafka_consumergroup_lag{group=~"ecip-analysis.*"}) > 1000
+        for: 10m
+        labels:
+          severity: critical
+          module: M02
+          team: analysis-engine
+        annotations:
+          summary: "Analysis event backlog exceeds 1000 events"
+          description: >
+            The analysis engine's Kafka consumer lag is {{ $value }} events.
+            This means analysis is falling behind ingestion rate.
+            If sustained, newly pushed code will not be indexed in time.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/ANALYSIS_BACKLOG.md"
+          dashboard_url: "https://grafana.ecip.internal/d/ecip-analysis-throughput"
+      - alert: AnalysisBacklogWarning
+        expr: >
+          sum(kafka_consumergroup_lag{group=~"ecip-analysis.*"}) > 500
+        for: 15m
+        labels:
+          severity: warning
+          module: M02
+          team: analysis-engine
+        annotations:
+          summary: "Analysis event backlog exceeds 500 events"
+          description: >
+            The analysis engine's Kafka consumer lag is {{ $value }} events.
+            Trending toward the critical threshold (1000).
+          runbook_url: "https://ecip.internal/runbooks/alert-response/ANALYSIS_BACKLOG.md"

package/alerts/cache-degradation.yaml ADDED Viewed

@@ -0,0 +1,44 @@
+# =============================================================================
+# ECIP Alert: Cache Hit Rate Degradation
+# =============================================================================
+# SLA Target: cache_hit_rate > 80% (M03 Knowledge Store)
+# Warning at < 60%
+# =============================================================================
+groups:
+  - name: ecip.cache.degradation
+    rules:
+      - alert: CacheHitRateDegraded
+        expr: >
+          cache_hit_rate{job=~"ecip-knowledge-store|ecip-query-service"} < 0.60
+        for: 15m
+        labels:
+          severity: warning
+          module: M03
+          team: knowledge-store
+        annotations:
+          summary: "Cache hit rate below 60% for {{ $labels.cache_type }} ({{ $labels.repo }})"
+          description: >
+            Cache hit rate for {{ $labels.cache_type }} on repo
+            {{ $labels.repo }} is {{ $value | printf "%.1f" }}%.
+            Target is > 80%. Possible causes: cold cache after
+            deployment, Redis eviction, or traffic pattern change.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/HIGH_QUERY_LATENCY.md"
+          dashboard_url: "https://grafana.ecip.internal/d/ecip-cache-performance"
+      - alert: KnowledgeStoreWriteLatencyHigh
+        expr: >
+          histogram_quantile(0.95,
+            sum(rate(knowledge_store_write_duration_ms_bucket[5m])) by (le, store_type)
+          ) > 200
+        for: 10m
+        labels:
+          severity: warning
+          module: M03
+          team: knowledge-store
+        annotations:
+          summary: "Knowledge Store write p95 > 200ms for {{ $labels.store_type }}"
+          description: >
+            Write latency p95 for {{ $labels.store_type }} is
+            {{ $value | printf "%.0f" }}ms. This slows analysis
+            indexing and may cause backlog growth.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/HIGH_QUERY_LATENCY.md"

package/alerts/dlq-depth.yaml ADDED Viewed

@@ -0,0 +1,56 @@
+# =============================================================================
+# ECIP Alert: DLQ Depth Exceeded
+# =============================================================================
+# Dead Letter Queue depth indicates persistent message processing failures
+# =============================================================================
+groups:
+  - name: ecip.event-bus.dlq
+    rules:
+      - alert: DLQDepthExceeded
+        expr: >
+          event_bus_dlq_depth > 100
+        for: 5m
+        labels:
+          severity: critical
+          module: M07
+          team: event-bus
+        annotations:
+          summary: "DLQ depth > 100 for topic {{ $labels.topic }}"
+          description: >
+            The dead letter queue for topic {{ $labels.topic }} has
+            {{ $value }} messages. This means messages are consistently
+            failing processing after all retry attempts.
+            Investigate the DLQ messages and root cause.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/DLQ_DEPTH_EXCEEDED.md"
+          dashboard_url: "https://grafana.ecip.internal/d/ecip-event-bus-dlq"
+      - alert: DLQDepthWarning
+        expr: >
+          event_bus_dlq_depth > 50
+        for: 10m
+        labels:
+          severity: warning
+          module: M07
+          team: event-bus
+        annotations:
+          summary: "DLQ depth > 50 for topic {{ $labels.topic }}"
+          description: >
+            DLQ for topic {{ $labels.topic }} has {{ $value }} messages.
+            Trending toward the critical threshold (100).
+          runbook_url: "https://ecip.internal/runbooks/alert-response/DLQ_DEPTH_EXCEEDED.md"
+      - alert: DLQMessageAgeHigh
+        expr: >
+          event_bus_dlq_oldest_message_age_seconds > 86400
+        for: 30m
+        labels:
+          severity: warning
+          module: M07
+          team: event-bus
+        annotations:
+          summary: "DLQ has messages older than 24 hours for topic {{ $labels.topic }}"
+          description: >
+            The oldest DLQ message for {{ $labels.topic }} is
+            {{ $value | humanizeDuration }} old. Stale DLQ messages
+            indicate abandoned failures that need manual investigation.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/DLQ_DEPTH_EXCEEDED.md"

package/alerts/lsp-daemon.yaml ADDED Viewed

@@ -0,0 +1,43 @@
+# =============================================================================
+# ECIP Alert: LSP Daemon Restart Rate
+# =============================================================================
+# for: 0min is INTENTIONAL — daemon crash is always significant.
+# The M04 circuit breaker handles graceful degradation.
+# This alert gets a human involved in parallel.
+# =============================================================================
+groups:
+  - name: ecip.lsp.daemon
+    rules:
+      - alert: LSPDaemonRestartRate
+        expr: >
+          rate(lsp_daemon_restarts_total[1h]) > 2
+        for: 0m
+        labels:
+          severity: critical
+          module: M02
+          team: analysis-engine
+        annotations:
+          summary: "LSP daemon restart rate > 2/hour for {{ $labels.repo }} ({{ $labels.language }})"
+          description: >
+            LSP daemon for {{ $labels.repo }} ({{ $labels.language }})
+            is restarting at {{ $value | printf "%.1f" }} times/hour.
+            This indicates OOM, crash loop, or resource exhaustion.
+            The M04 circuit breaker should be active — verify in Grafana.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/LSP_DAEMON_RESTART.md"
+          dashboard_url: "https://grafana.ecip.internal/d/ecip-lsp-daemon-health"
+      - alert: LSPDaemonOOMKill
+        expr: >
+          increase(kube_pod_container_status_last_terminated_reason{container=~"lsp-daemon.*", reason="OOMKilled"}[1h]) > 0
+        for: 0m
+        labels:
+          severity: critical
+          module: M02
+          team: analysis-engine
+        annotations:
+          summary: "LSP daemon OOM killed: {{ $labels.pod }}"
+          description: >
+            LSP daemon pod {{ $labels.pod }} was OOM-killed.
+            Consider increasing memory limits or reviewing the repo
+            that triggered the analysis.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/LSP_DAEMON_RESTART.md"

package/alerts/mcp-latency.yaml ADDED Viewed

@@ -0,0 +1,46 @@
+# =============================================================================
+# ECIP Alert: MCP Call Latency
+# =============================================================================
+# MCP fan-out calls must stay below 800ms p95 to meet the overall
+# query latency SLA of 1500ms
+# =============================================================================
+groups:
+  - name: ecip.mcp.latency
+    rules:
+      - alert: MCPCallLatencyWarn
+        expr: >
+          histogram_quantile(0.95,
+            sum(rate(mcp_call_duration_ms_bucket[5m])) by (le, target_repo, tool_name)
+          ) > 800
+        for: 10m
+        labels:
+          severity: warning
+          module: M05
+          team: mcp-server
+        annotations:
+          summary: "MCP call p95 > 800ms for {{ $labels.tool_name }} → {{ $labels.target_repo }}"
+          description: >
+            MCP tool call {{ $labels.tool_name }} targeting
+            {{ $labels.target_repo }} has p95 latency of
+            {{ $value | printf "%.0f" }}ms. This contributes to
+            overall query latency SLA risk.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/HIGH_QUERY_LATENCY.md"
+          dashboard_url: "https://grafana.ecip.internal/d/ecip-mcp-call-graph"
+      - alert: MCPCallErrorRateHigh
+        expr: >
+          sum(rate(mcp_call_duration_ms_count{status_code=~"5.."}[5m])) by (tool_name)
+          /
+          sum(rate(mcp_call_duration_ms_count[5m])) by (tool_name)
+          > 0.05
+        for: 5m
+        labels:
+          severity: warning
+          module: M05
+          team: mcp-server
+        annotations:
+          summary: "MCP tool {{ $labels.tool_name }} error rate > 5%"
+          description: >
+            MCP tool {{ $labels.tool_name }} is failing at
+            {{ $value | printf "%.1f" }}% error rate.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/HIGH_QUERY_LATENCY.md"

package/alerts/security-anomaly.yaml ADDED Viewed

@@ -0,0 +1,59 @@
+# =============================================================================
+# ECIP Alert: Security Anomaly Detection
+# =============================================================================
+# for: 0min — security events fire IMMEDIATELY
+# Auth failure burst and RBAC denial burst detection
+# =============================================================================
+groups:
+  - name: ecip.security.anomaly
+    rules:
+      - alert: SecurityAuthBurst
+        expr: >
+          increase(auth_failure_total[5m]) > 10
+        for: 0m
+        labels:
+          severity: critical
+          module: M01
+          team: security
+        annotations:
+          summary: "Auth failure burst detected: {{ $value | printf \"%.0f\" }} failures in 5 minutes"
+          description: >
+            More than 10 authentication failures in the last 5 minutes.
+            Reason breakdown: {{ $labels.reason }}.
+            This may indicate a brute-force attempt or a mass token expiration event.
+            Check the security events dashboard and Elasticsearch for details.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/SECURITY_ANOMALY.md"
+          dashboard_url: "https://grafana.ecip.internal/d/ecip-security-events"
+      - alert: SecurityRBACDenialBurst
+        expr: >
+          increase(rbac_denial_total[5m]) > 10
+        for: 0m
+        labels:
+          severity: warning
+          module: M06
+          team: security
+        annotations:
+          summary: "RBAC denial burst: {{ $value | printf \"%.0f\" }} denials in 5 minutes"
+          description: >
+            More than 10 RBAC denials in the last 5 minutes for
+            resource={{ $labels.resource }}, action={{ $labels.action }}.
+            This may indicate misconfigured permissions or
+            an unauthorized access attempt.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/SECURITY_ANOMALY.md"
+      - alert: ServiceAuthFailure
+        expr: >
+          increase(auth_failure_total{reason="mtls_rejected"}[5m]) > 0
+        for: 0m
+        labels:
+          severity: critical
+          module: M01
+          team: security
+        annotations:
+          summary: "Service-to-service authentication failure (mTLS rejection)"
+          description: >
+            An mTLS authentication failure was detected. This indicates
+            either a certificate misconfiguration or a potential
+            man-in-the-middle attempt. Investigate immediately.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/SECURITY_ANOMALY.md"

package/alerts/sla-latency.yaml ADDED Viewed

@@ -0,0 +1,61 @@
+# =============================================================================
+# ECIP Alert: Query Latency SLA Breach
+# =============================================================================
+# SLA Target: query_duration_ms p95 < 1500ms
+# Uses histogram_quantile (NOT average — averaging hides tail problems)
+# =============================================================================
+groups:
+  - name: ecip.sla.latency
+    rules:
+      - alert: QueryLatencySLABreach
+        expr: >
+          histogram_quantile(0.95,
+            sum(rate(query_duration_ms_bucket{job="ecip-query-service"}[5m])) by (le)
+          ) > 1500
+        for: 5m
+        labels:
+          severity: critical
+          module: M04
+          team: query-service
+        annotations:
+          summary: "Query latency p95 exceeds SLA threshold (1500ms)"
+          description: >
+            Query service p95 latency is {{ $value | printf "%.0f" }}ms,
+            which exceeds the 1500ms SLA threshold.
+            This has been firing for more than 5 minutes.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/HIGH_QUERY_LATENCY.md"
+          dashboard_url: "https://grafana.ecip.internal/d/ecip-query-latency"
+      - alert: FilterAuthorizedReposLatency
+        expr: >
+          histogram_quantile(0.95,
+            sum(rate(filter_authorized_repos_duration_ms_bucket{job="ecip-registry-service"}[5m])) by (le)
+          ) > 20
+        for: 5m
+        labels:
+          severity: warning
+          module: M06
+          team: registry-service
+        annotations:
+          summary: "FilterAuthorizedRepos p95 exceeds 20ms (NFR-SEC-011)"
+          description: >
+            Registry service FilterAuthorizedRepos RPC p95 latency is
+            {{ $value | printf "%.1f" }}ms, exceeding the 20ms SLA
+            from NFR-SEC-011. This directly impacts the hot query path.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/HIGH_QUERY_LATENCY.md"
+      - alert: GRPCRequestLatencyHigh
+        expr: >
+          histogram_quantile(0.95,
+            sum(rate(grpc_request_duration_ms_bucket[5m])) by (le, service, method)
+          ) > 500
+        for: 5m
+        labels:
+          severity: warning
+          module: M06
+        annotations:
+          summary: "gRPC request p95 latency > 500ms for {{ $labels.service }}/{{ $labels.method }}"
+          description: >
+            gRPC p95 latency is {{ $value | printf "%.0f" }}ms for
+            {{ $labels.service }}/{{ $labels.method }}.
+          runbook_url: "https://ecip.internal/runbooks/alert-response/HIGH_QUERY_LATENCY.md"

package/chaos/kafka-broker-restart.sh ADDED Viewed

@@ -0,0 +1,168 @@
+#!/bin/bash
+# =============================================================================
+# Chaos Test: Kafka Broker Restart
+# =============================================================================
+# Simulates Kafka broker restart for M07 Event Bus.
+# Validates that:
+#   1. DLQDepthWarning/DLQDepthExceeded alerts fire if messages fail
+#   2. Event Bus DLQ dashboard reflects message accumulation
+#   3. Consumer groups rebalance correctly after broker recovery
+#   4. No data loss — messages are replayed from Kafka after recovery
+#
+# Prerequisites:
+#   - kubectl configured with access to the kafka namespace
+#   - Kafka StatefulSet running
+#   - Prometheus and Alertmanager running
+#   - ecip-event-bus consumers running
+#
+# Usage:
+#   ./kafka-broker-restart.sh [--namespace kafka] [--broker-id 0]
+# =============================================================================
+set -euo pipefail
+KAFKA_NAMESPACE="${1:-kafka}"
+BROKER_ID="${2:-0}"
+echo "=========================================="
+echo "  Chaos Test: Kafka Broker Restart"
+echo "=========================================="
+echo "  Kafka namespace: $KAFKA_NAMESPACE"
+echo "  Target broker:   kafka-$BROKER_ID"
+echo "=========================================="
+# Verify the target broker pod exists
+BROKER_POD="kafka-$BROKER_ID"
+if ! kubectl get pod -n "$KAFKA_NAMESPACE" "$BROKER_POD" &>/dev/null; then
+  echo "ERROR: Pod $BROKER_POD not found in namespace $KAFKA_NAMESPACE"
+  exit 1
+fi
+echo "Target pod: $BROKER_POD"
+# Pre-test: capture consumer group state
+echo ""
+echo "[PRE-TEST] Capturing baseline state..."
+# Get ECIP consumer group lag
+kubectl exec -n "$KAFKA_NAMESPACE" "$BROKER_POD" -- \
+  kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
+  --group ecip-event-processors --describe 2>/dev/null || echo "  (consumer group info unavailable)"
+# Count partitions on this broker
+PARTITION_COUNT=$(kubectl exec -n "$KAFKA_NAMESPACE" "$BROKER_POD" -- \
+  kafka-broker-api-versions.sh --bootstrap-server localhost:9092 2>/dev/null \
+  | head -1 || echo "unknown")
+echo "  Broker $BROKER_ID status: Running"
+# Check DLQ depth before
+DLQ_DEPTH_BEFORE=$(kubectl exec -n "$KAFKA_NAMESPACE" "$BROKER_POD" -- \
+  kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 \
+  --topic ecip.dlq --time -1 2>/dev/null \
+  | awk -F: '{sum += $3} END {print sum}' || echo "0")
+echo "  DLQ depth before: $DLQ_DEPTH_BEFORE"
+# Execute: delete the broker pod (Kubernetes will restart it)
+echo ""
+echo "[CHAOS] Deleting Kafka broker pod $BROKER_POD..."
+echo "  StatefulSet will recreate it automatically."
+kubectl delete pod -n "$KAFKA_NAMESPACE" "$BROKER_POD" --grace-period=0 --force 2>/dev/null
+echo "  Pod deleted. Watching for recreation..."
+# Wait for pod to be recreated and ready
+echo ""
+echo "[RECOVERY] Waiting for broker to restart..."
+TIMEOUT=120
+ELAPSED=0
+while [ "$ELAPSED" -lt "$TIMEOUT" ]; do
+  STATUS=$(kubectl get pod -n "$KAFKA_NAMESPACE" "$BROKER_POD" \
+    -o jsonpath='{.status.phase}' 2>/dev/null || echo "Pending")
+  READY=$(kubectl get pod -n "$KAFKA_NAMESPACE" "$BROKER_POD" \
+    -o jsonpath='{.status.containerStatuses[0].ready}' 2>/dev/null || echo "false")
+  echo "  [${ELAPSED}s] Phase: $STATUS, Ready: $READY"
+  if [ "$READY" = "true" ]; then
+    echo "  Broker is back online!"
+    break
+  fi
+  sleep 10
+  ELAPSED=$((ELAPSED + 10))
+done
+if [ "$ELAPSED" -ge "$TIMEOUT" ]; then
+  echo "  ⚠️  WARNING: Broker did not recover within ${TIMEOUT}s"
+fi
+# Wait for consumer group rebalance
+echo ""
+echo "[REBALANCE] Waiting 30s for consumer group rebalance..."
+sleep 30
+# Post-test: check consumer group state
+echo ""
+echo "[POST-TEST] Checking consumer group state..."
+kubectl exec -n "$KAFKA_NAMESPACE" "$BROKER_POD" -- \
+  kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
+  --group ecip-event-processors --describe 2>/dev/null || echo "  (consumer group info unavailable)"
+# Check DLQ depth after
+DLQ_DEPTH_AFTER=$(kubectl exec -n "$KAFKA_NAMESPACE" "$BROKER_POD" -- \
+  kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 \
+  --topic ecip.dlq --time -1 2>/dev/null \
+  | awk -F: '{sum += $3} END {print sum}' || echo "0")
+echo "  DLQ depth after: $DLQ_DEPTH_AFTER"
+echo "  New DLQ messages: $((DLQ_DEPTH_AFTER - DLQ_DEPTH_BEFORE))"
+# Check alerts
+echo ""
+echo "[ALERTS] Checking for Event Bus alerts..."
+PROM_POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus \
+  -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
+if [ -n "$PROM_POD" ]; then
+  DLQ_ALERT=$(kubectl exec -n monitoring "$PROM_POD" -- \
+    wget -qO- 'http://localhost:9090/api/v1/alerts' 2>/dev/null \
+    | grep -c "DLQDepth" || echo "0")
+  echo "  DLQ alerts firing: $DLQ_ALERT"
+fi
+# Verify no data loss — check consumer group caught up
+echo ""
+echo "[VERIFICATION] Checking for data completeness..."
+LAG_TOTAL=$(kubectl exec -n "$KAFKA_NAMESPACE" "$BROKER_POD" -- \
+  kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
+  --group ecip-event-processors --describe 2>/dev/null \
+  | awk 'NR>1 {sum += $5} END {print sum+0}' || echo "unknown")
+echo "  Total consumer lag: $LAG_TOTAL"
+# Summary
+echo ""
+echo "=========================================="
+echo "  Chaos Test Complete"
+echo "=========================================="
+echo "  Broker recovery:   $( [ "$READY" = "true" ] && echo 'Success ✅' || echo 'FAILED ❌' )"
+echo "  DLQ growth:        $((DLQ_DEPTH_AFTER - DLQ_DEPTH_BEFORE)) messages"
+echo "  Consumer lag:      $LAG_TOTAL"
+echo ""
+echo "  Next steps:"
+echo "    1. Check Grafana → ECIP → Event Bus DLQ dashboard"
+echo "    2. Verify consumer lag returns to 0"
+echo "    3. If DLQ messages accumulated, replay with:"
+echo "       kubectl exec -n ecip deployment/ecip-event-bus -- \\"
+echo "         node scripts/replay-dlq.js --topic ecip.dlq --batch-size 50"
+echo "    4. Verify all alerts resolve within 15 minutes"
+echo "=========================================="