ecip-observability-stack 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (76) hide show
  1. package/CLAUDE.md +48 -0
  2. package/README.md +75 -0
  3. package/alerts/analysis-backlog.yaml +39 -0
  4. package/alerts/cache-degradation.yaml +44 -0
  5. package/alerts/dlq-depth.yaml +56 -0
  6. package/alerts/lsp-daemon.yaml +43 -0
  7. package/alerts/mcp-latency.yaml +46 -0
  8. package/alerts/security-anomaly.yaml +59 -0
  9. package/alerts/sla-latency.yaml +61 -0
  10. package/chaos/kafka-broker-restart.sh +168 -0
  11. package/chaos/kill-lsp-daemon.sh +148 -0
  12. package/chaos/redis-node-failure.sh +318 -0
  13. package/ci/check-observability-contract.js +285 -0
  14. package/ci/eslint-plugin-ecip/index.js +209 -0
  15. package/ci/eslint-plugin-ecip/package.json +12 -0
  16. package/ci/github-actions-observability-gate.yaml +180 -0
  17. package/ci/ruff-shared.toml +41 -0
  18. package/collector/otel-collector-config.yaml +226 -0
  19. package/collector/otel-collector-daemonset.yaml +168 -0
  20. package/collector/sampling-config.yaml +83 -0
  21. package/dashboards/_provisioning/grafana-dashboards.yaml +16 -0
  22. package/dashboards/analysis-throughput.json +166 -0
  23. package/dashboards/cache-performance.json +129 -0
  24. package/dashboards/cross-repo-fanout.json +93 -0
  25. package/dashboards/event-bus-dlq.json +129 -0
  26. package/dashboards/lsp-daemon-health.json +104 -0
  27. package/dashboards/mcp-call-graph.json +114 -0
  28. package/dashboards/query-latency.json +160 -0
  29. package/dashboards/security-events.json +131 -0
  30. package/docs/M08-Observability-Design.md +639 -0
  31. package/docs/PROGRESS.md +375 -0
  32. package/docs/module-documentation.md +64 -0
  33. package/elasticsearch/ilm-policy.json +57 -0
  34. package/elasticsearch/index-template.json +62 -0
  35. package/elasticsearch/kibana-space.yaml +53 -0
  36. package/helm/Chart.yaml +30 -0
  37. package/helm/templates/configmaps.yaml +25 -0
  38. package/helm/templates/elasticsearch.yaml +68 -0
  39. package/helm/templates/grafana-secret.yaml +22 -0
  40. package/helm/templates/grafana.yaml +19 -0
  41. package/helm/templates/loki.yaml +33 -0
  42. package/helm/templates/otel-collector.yaml +119 -0
  43. package/helm/templates/prometheus.yaml +43 -0
  44. package/helm/templates/tempo.yaml +16 -0
  45. package/helm/values.prod.yaml +159 -0
  46. package/helm/values.yaml +146 -0
  47. package/logging-lib/nodejs/package.json +57 -0
  48. package/logging-lib/nodejs/pnpm-lock.yaml +4576 -0
  49. package/logging-lib/python/pyproject.toml +45 -0
  50. package/logging-lib/python/src/__init__.py +19 -0
  51. package/logging-lib/python/src/logger.py +131 -0
  52. package/logging-lib/python/src/security_events.py +150 -0
  53. package/logging-lib/python/src/tracer.py +185 -0
  54. package/logging-lib/python/tests/test_logger.py +113 -0
  55. package/package.json +21 -0
  56. package/prometheus/prometheus-values.yaml +170 -0
  57. package/prometheus/recording-rules.yaml +97 -0
  58. package/prometheus/scrape-configs.yaml +122 -0
  59. package/runbooks/SDK-INTEGRATION.md +239 -0
  60. package/runbooks/alert-response/ANALYSIS_BACKLOG.md +128 -0
  61. package/runbooks/alert-response/DLQ_DEPTH_EXCEEDED.md +150 -0
  62. package/runbooks/alert-response/HIGH_QUERY_LATENCY.md +134 -0
  63. package/runbooks/alert-response/LSP_DAEMON_RESTART.md +118 -0
  64. package/runbooks/alert-response/SECURITY_ANOMALY.md +160 -0
  65. package/runbooks/dashboard-guide.md +169 -0
  66. package/scripts/lint-dashboards.js +184 -0
  67. package/tempo/tempo-datasource.yaml +46 -0
  68. package/tempo/tempo-values.yaml +94 -0
  69. package/tests/alert-threshold-config.test.ts +283 -0
  70. package/tests/log-schema-validation.test.ts +246 -0
  71. package/tests/metric-label-validation.test.ts +292 -0
  72. package/tests/otel-pipeline-integration.test.ts +420 -0
  73. package/tests/security-events.test.ts +417 -0
  74. package/tsconfig.json +17 -0
  75. package/vitest.config.ts +21 -0
  76. package/vitest.integration.config.ts +9 -0
@@ -0,0 +1,128 @@
1
+ # Alert Runbook: Analysis Backlog
2
+
3
+ > **Alert Name:** `AnalysisBacklogCritical` / `AnalysisBacklogWarning`
4
+ > **Severity:** Critical / Warning
5
+ > **Module:** M02 — Analysis Engine
6
+
7
+ ---
8
+
9
+ ## Alert Definitions
10
+
11
+ ```yaml
12
+ # Warning
13
+ expr: ecip_analysis_backlog_size{module="M02"} > 500
14
+ for: 15m
15
+
16
+ # Critical
17
+ expr: ecip_analysis_backlog_size{module="M02"} > 1000
18
+ for: 10m
19
+ ```
20
+
21
+ ---
22
+
23
+ ## Impact
24
+
25
+ - Code analysis jobs are queuing faster than they can be processed
26
+ - Repository updates are delayed
27
+ - Users see stale code intelligence results
28
+ - If backlog grows indefinitely: eventual disk/memory pressure
29
+
30
+ ---
31
+
32
+ ## Triage Steps
33
+
34
+ ### 1. Check Analysis Throughput Dashboard
35
+
36
+ Open **Grafana → ECIP → Analysis Throughput**. Look for:
37
+ - Current backlog size and trend
38
+ - Analysis throughput (analyses per minute) — has it dropped?
39
+ - Error rate — are analyses failing?
40
+ - Active analyses — are workers busy or idle?
41
+
42
+ ### 2. Check LSP Daemon Health
43
+
44
+ Analysis throughput depends on healthy LSP daemons. Open **Grafana → ECIP → LSP Daemon Health**.
45
+ - If daemon restarts are elevated: see [LSP_DAEMON_RESTART runbook](./LSP_DAEMON_RESTART.md)
46
+ - If OOM kills are occurring: daemons are crashing before completing work
47
+
48
+ ### 3. Check Event Bus
49
+
50
+ Is the backlog caused by a flood of events?
51
+
52
+ ```bash
53
+ # Check Kafka lag for analysis consumer group
54
+ kubectl exec -n ecip kafka-0 -- kafka-consumer-groups.sh \
55
+ --bootstrap-server localhost:9092 \
56
+ --group ecip-analysis-consumers \
57
+ --describe
58
+ ```
59
+
60
+ ### 4. Check for Large Repository Events
61
+
62
+ ```bash
63
+ kubectl logs -n ecip deployment/ecip-analysis-engine --tail=200 \
64
+ | grep -E "files_to_analyze|large_repo|backlog"
65
+ ```
66
+
67
+ A single large repository push can flood the queue.
68
+
69
+ ---
70
+
71
+ ## Common Causes & Fixes
72
+
73
+ ### LSP Daemon Failures
74
+
75
+ **Symptoms:** `LSPDaemonRestartRate` also firing, analysis throughput dropped.
76
+
77
+ **Fix:**
78
+ 1. Resolve LSP daemon issues first (see [LSP_DAEMON_RESTART.md](./LSP_DAEMON_RESTART.md))
79
+ 2. Backlog will drain naturally once daemons stabilize
80
+
81
+ ### Large Repository Push
82
+
83
+ **Symptoms:** Sudden backlog spike correlating with a specific large repo event.
84
+
85
+ **Fix:**
86
+ 1. This may be normal — a large monorepo push creates many analysis jobs
87
+ 2. Monitor: backlog should drain at normal throughput rate
88
+ 3. If one repo is blocking everything: consider priority queuing
89
+
90
+ ### Insufficient Worker Capacity
91
+
92
+ **Symptoms:** Workers are all busy, throughput is at capacity, but event rate exceeds processing rate.
93
+
94
+ **Fix:**
95
+ 1. Scale workers: `kubectl scale deployment -n ecip ecip-analysis-engine --replicas=N`
96
+ 2. Check HPA: is it already at max replicas?
97
+ 3. If at max: increase HPA maxReplicas or add nodes
98
+
99
+ ### Downstream Failure (Knowledge Store)
100
+
101
+ **Symptoms:** Analysis completes but writing results to M03 fails, causing retries.
102
+
103
+ **Fix:**
104
+ 1. Check M03 Knowledge Store health
105
+ 2. Check write latency on Cache Performance dashboard
106
+ 3. Fix M03 issue → retries succeed → backlog drains
107
+
108
+ ---
109
+
110
+ ## Escalation
111
+
112
+ - **Warning (> 500, 15min):** M02 on-call investigates
113
+ - **Critical (> 1000, 10min):** Page M02 on-call via PagerDuty
114
+ - **If backlog growing > 100/min:** Consider pausing non-critical analysis (PR branches)
115
+ - **If correlated with LSP issues:** Also engage Platform team
116
+
117
+ ---
118
+
119
+ ## Related Alerts
120
+
121
+ - `LSPDaemonRestartRate` — Often the root cause
122
+ - `LSPDaemonOOMKill` — Memory pressure variant
123
+ - `DLQDepthExceeded` — If event processing is part of the failure chain
124
+ - `KnowledgeStoreWriteLatencyHigh` — Downstream bottleneck
125
+
126
+ ---
127
+
128
+ *Last updated: March 2026 · Platform Team*
@@ -0,0 +1,150 @@
1
+ # Alert Runbook: DLQ Depth Exceeded
2
+
3
+ > **Alert Name:** `DLQDepthExceeded` (critical) / `DLQDepthWarning` (warning)
4
+ > **Severity:** Critical / Warning
5
+ > **Module:** M07 — Event Bus
6
+
7
+ ---
8
+
9
+ ## Alert Definitions
10
+
11
+ ```yaml
12
+ # Warning
13
+ expr: ecip_dlq_depth{module="M07"} > 10
14
+ for: 10m
15
+
16
+ # Critical
17
+ expr: ecip_dlq_depth{module="M07"} > 100
18
+ for: 5m
19
+ ```
20
+
21
+ Also related:
22
+ ```yaml
23
+ # DLQ Message Age
24
+ expr: ecip_dlq_oldest_message_age_seconds{module="M07"} > 3600
25
+ for: 5m
26
+ ```
27
+
28
+ ---
29
+
30
+ ## Impact
31
+
32
+ - Failed events are not being processed
33
+ - Repositories may not be indexed or updated
34
+ - Stale data in M03 Knowledge Store
35
+ - Downstream: queries return outdated results
36
+
37
+ ---
38
+
39
+ ## Triage Steps
40
+
41
+ ### 1. Check Event Bus DLQ Dashboard
42
+
43
+ Open **Grafana → ECIP → Event Bus DLQ**. Look for:
44
+ - Current DLQ depth and trend
45
+ - Oldest message age (> 1hr is concerning)
46
+ - DLQ ingestion rate (how fast are messages arriving?)
47
+ - Processing lag for main topics
48
+
49
+ ### 2. Check Kafka Consumer Lag
50
+
51
+ ```bash
52
+ kubectl exec -n ecip kafka-0 -- kafka-consumer-groups.sh \
53
+ --bootstrap-server localhost:9092 \
54
+ --group ecip-event-processors \
55
+ --describe
56
+ ```
57
+
58
+ ### 3. Inspect DLQ Messages
59
+
60
+ ```bash
61
+ kubectl exec -n ecip kafka-0 -- kafka-console-consumer.sh \
62
+ --bootstrap-server localhost:9092 \
63
+ --topic ecip.dlq \
64
+ --from-beginning \
65
+ --max-messages 5 \
66
+ --property print.headers=true
67
+ ```
68
+
69
+ Look at the `error` header to determine why messages were sent to DLQ.
70
+
71
+ ### 4. Check Event Bus Service Logs
72
+
73
+ ```bash
74
+ kubectl logs -n ecip deployment/ecip-event-bus --tail=200 \
75
+ | grep -E "error|dlq|dead.letter|retry.exhausted"
76
+ ```
77
+
78
+ ---
79
+
80
+ ## Common Causes & Fixes
81
+
82
+ ### Schema Validation Failure
83
+
84
+ **Symptoms:** DLQ messages have `error: schema_validation_failed` header.
85
+
86
+ **Fix:**
87
+ 1. Check if a recent schema change was deployed without updating consumers
88
+ 2. Verify Schema Registry has the correct version
89
+ 3. If schema was updated: update consumers and replay DLQ messages
90
+
91
+ ### Downstream Service Unavailable
92
+
93
+ **Symptoms:** DLQ messages have `error: downstream_timeout` or `connection_refused`.
94
+
95
+ **Fix:**
96
+ 1. Check the target service health (M02 Analysis Engine, M03 Knowledge Store)
97
+ 2. If the service is down: fix the service, then replay DLQ
98
+ 3. If the service is overloaded: scale it up, then replay
99
+
100
+ ### Poison Messages
101
+
102
+ **Symptoms:** Same message(s) repeatedly failing, growing DLQ depth slowly.
103
+
104
+ **Fix:**
105
+ 1. Inspect the failing messages for malformed data
106
+ 2. If a specific repository event is broken, manually acknowledge and skip it
107
+ 3. File a bug for the source that produced the malformed event
108
+
109
+ ### Kafka Broker Issues
110
+
111
+ **Symptoms:** Multiple consumer groups affected, not just ECIP.
112
+
113
+ **Fix:**
114
+ 1. Check Kafka broker health: `kubectl get pods -n kafka`
115
+ 2. Check broker disk usage and partition balance
116
+ 3. Escalate to Infrastructure/Kafka team
117
+
118
+ ---
119
+
120
+ ## Replaying DLQ Messages
121
+
122
+ After fixing the root cause, replay DLQ messages:
123
+
124
+ ```bash
125
+ # Using the M07 replay tool
126
+ kubectl exec -n ecip deployment/ecip-event-bus -- \
127
+ node scripts/replay-dlq.js --topic ecip.dlq --batch-size 50
128
+ ```
129
+
130
+ Monitor the DLQ depth dashboard to confirm depth decreasing.
131
+
132
+ ---
133
+
134
+ ## Escalation
135
+
136
+ - **Warning (> 10 messages):** Investigate within 30 minutes
137
+ - **Critical (> 100 messages):** Page M07 on-call
138
+ - **If age > 4hr:** Escalate to P1 — data staleness affecting users
139
+ - **If Kafka infrastructure issue:** Page Infrastructure on-call
140
+
141
+ ---
142
+
143
+ ## Related Alerts
144
+
145
+ - `DLQMessageAgeHigh` — Oldest message > 1 hour
146
+ - `AnalysisBacklogCritical` — May correlate if M02 events failing
147
+
148
+ ---
149
+
150
+ *Last updated: March 2026 · Platform Team*
@@ -0,0 +1,134 @@
1
+ # Alert Runbook: High Query Latency
2
+
3
+ > **Alert Name:** `QueryLatencySLABreach`
4
+ > **Severity:** Critical
5
+ > **Module:** M04 — Query Service
6
+ > **SLA:** p95 < 2000ms
7
+
8
+ ---
9
+
10
+ ## Alert Definition
11
+
12
+ ```yaml
13
+ expr: >
14
+ histogram_quantile(0.95,
15
+ sum(rate(ecip_query_duration_ms_bucket{module="M04"}[5m])) by (le)
16
+ ) > 1500
17
+ for: 5m
18
+ ```
19
+
20
+ Fires when p95 query latency exceeds 1500ms for 5 consecutive minutes. Threshold is set at 1500ms (below the 2000ms SLA) to allow time for investigation before SLA breach.
21
+
22
+ ---
23
+
24
+ ## Impact
25
+
26
+ - Users experience slow code intelligence responses
27
+ - IDE integrations may timeout
28
+ - SLA breach if not resolved within the `for` window
29
+
30
+ ---
31
+
32
+ ## Triage Steps
33
+
34
+ ### 1. Check the Query Latency Dashboard
35
+
36
+ Open **Grafana → ECIP → Query Latency**. Look for:
37
+ - Is p95 above 1500ms? Which percentile is affected?
38
+ - Is this a gradual trend or a sudden spike?
39
+ - Is the request rate higher than normal? (capacity issue vs. performance regression)
40
+
41
+ ### 2. Check Downstream Dependencies
42
+
43
+ Query latency is often caused by slow dependencies:
44
+
45
+ | Dependency | Check | Dashboard |
46
+ |---|---|---|
47
+ | M03 Knowledge Store | Cache hit rate, write latency | Cache Performance |
48
+ | M05 MCP Server | MCP call latency | MCP Call Graph |
49
+ | M06 Registry | RBAC check latency | (Prometheus direct) |
50
+
51
+ ```promql
52
+ # Check cache hit rate
53
+ ecip_cache_hit_rate{module="M03"}
54
+
55
+ # Check MCP call latency
56
+ histogram_quantile(0.95, sum(rate(ecip_mcp_call_duration_ms_bucket[5m])) by (le))
57
+
58
+ # Check FilterAuthorizedRepos latency
59
+ histogram_quantile(0.95, sum(rate(ecip_filter_authorized_repos_duration_ms_bucket[5m])) by (le))
60
+ ```
61
+
62
+ ### 3. Check Traces in Tempo
63
+
64
+ 1. Open **Grafana → Explore → Tempo**
65
+ 2. Search for traces with `service.name = ecip-query-service` and duration > 1500ms
66
+ 3. Look at the span waterfall — which span is the bottleneck?
67
+
68
+ ### 4. Check Pod Resources
69
+
70
+ ```bash
71
+ kubectl top pods -n ecip -l app=ecip-query-service
72
+ ```
73
+
74
+ If CPU or memory is near limits, this is a capacity issue.
75
+
76
+ ---
77
+
78
+ ## Common Causes & Fixes
79
+
80
+ ### Cache Miss Storm
81
+
82
+ **Symptoms:** Cache hit rate drops below 85%, query latency rises proportionally.
83
+
84
+ **Fix:**
85
+ 1. Check if cache was recently flushed (deployment, Redis restart)
86
+ 2. Monitor cache warm-up — latency should recover as cache repopulates
87
+ 3. If persistent: check for cache eviction due to memory pressure
88
+
89
+ ### MCP Fan-out Explosion
90
+
91
+ **Symptoms:** Cross-repo queries with high fan-out depth (> 5 repos).
92
+
93
+ **Fix:**
94
+ 1. Check `cross-repo-fanout` dashboard for depth distribution
95
+ 2. If specific queries cause excessive fan-out, investigate query patterns
96
+ 3. Consider circuit breaker tuning in M04
97
+
98
+ ### FilterAuthorizedRepos Slow
99
+
100
+ **Symptoms:** RBAC filtering taking > 20ms (NFR-SEC-011 SLA breach).
101
+
102
+ **Fix:**
103
+ 1. Check M06 Registry Service health
104
+ 2. Verify RBAC cache is populated
105
+ 3. Escalate to M06 team if their latency is the bottleneck
106
+
107
+ ### Capacity Issue
108
+
109
+ **Symptoms:** Request rate is significantly higher than baseline, CPU/memory near limits.
110
+
111
+ **Fix:**
112
+ 1. Scale horizontally: `kubectl scale deployment -n ecip ecip-query-service --replicas=N`
113
+ 2. Review HPA configuration for M04
114
+
115
+ ---
116
+
117
+ ## Escalation
118
+
119
+ - **Critical:** Pages M04 on-call via PagerDuty automatically
120
+ - **If caused by M03 cache:** Escalate to M03 team
121
+ - **If caused by M06 RBAC:** Escalate to M06 team
122
+ - **If capacity issue:** Engage Platform team for scaling
123
+
124
+ ---
125
+
126
+ ## Related Alerts
127
+
128
+ - `CacheHitRateDegraded` — Often precedes query latency spikes
129
+ - `FilterAuthorizedReposLatency` — RBAC check SLA
130
+ - `MCPCallLatencyWarn` — MCP tool call slowness
131
+
132
+ ---
133
+
134
+ *Last updated: March 2026 · Platform Team*
@@ -0,0 +1,118 @@
1
+ # Alert Runbook: LSP Daemon Restart Rate
2
+
3
+ > **Alert Name:** `LSPDaemonRestartRate`
4
+ > **Severity:** Warning
5
+ > **Module:** M02 — Analysis Engine
6
+ > **for:** 0m (fires immediately — M02-specific tuning)
7
+
8
+ ---
9
+
10
+ ## Alert Definition
11
+
12
+ ```yaml
13
+ expr: rate(ecip_lsp_daemon_restarts_total{module="M02"}[5m]) > 3
14
+ for: 0m
15
+ ```
16
+
17
+ Fires when LSP daemon restarts exceed 3 per 5-minute window. The `for: 0m` is intentional — LSP daemon instability cascades immediately to analysis failures and should be investigated without delay.
18
+
19
+ ---
20
+
21
+ ## Impact
22
+
23
+ - Active code analyses may fail or produce incomplete results
24
+ - Analysis backlog grows while daemons restart
25
+ - Downstream: M04 queries may return stale data if re-indexing is blocked
26
+
27
+ ---
28
+
29
+ ## Triage Steps
30
+
31
+ ### 1. Check the LSP Daemon Health Dashboard
32
+
33
+ Open **Grafana → ECIP → LSP Daemon Health**. Look for:
34
+ - Restart rate spike pattern (steady increase vs. sudden burst)
35
+ - OOM kill correlation (if OOM kills are rising, go to §OOM below)
36
+ - Memory trend before restarts
37
+
38
+ ### 2. Check Pod Events
39
+
40
+ ```bash
41
+ kubectl get events -n ecip --field-selector involvedObject.name=ecip-analysis-engine \
42
+ --sort-by='.lastTimestamp' | head -20
43
+ ```
44
+
45
+ Look for:
46
+ - `OOMKilled` — memory limit exceeded
47
+ - `CrashLoopBackOff` — repeated crash-restart cycles
48
+ - `Unhealthy` — failed health checks
49
+
50
+ ### 3. Check Pod Logs
51
+
52
+ ```bash
53
+ kubectl logs -n ecip deployment/ecip-analysis-engine -c lsp-daemon --tail=200 \
54
+ | grep -E "error|fatal|panic|OOM"
55
+ ```
56
+
57
+ ### 4. Check Recent Deployments
58
+
59
+ ```bash
60
+ kubectl rollout history -n ecip deployment/ecip-analysis-engine
61
+ ```
62
+
63
+ If a recent deploy correlates with the restart spike → consider rollback.
64
+
65
+ ---
66
+
67
+ ## Common Causes & Fixes
68
+
69
+ ### OOM Kill
70
+
71
+ **Symptoms:** `LSPDaemonOOMKill` alert also firing, `OOMKilled` in pod events.
72
+
73
+ **Fix:**
74
+ 1. Check if a specific file type is causing excessive memory usage
75
+ 2. If memory usage is generally trending up, increase the memory limit:
76
+ ```bash
77
+ kubectl set resources -n ecip deployment/ecip-analysis-engine \
78
+ -c lsp-daemon --limits=memory=2Gi
79
+ ```
80
+ 3. Long-term: file an issue for the M02 team to investigate the memory leak
81
+
82
+ ### Large File Processing
83
+
84
+ **Symptoms:** Restarts correlate with analysis of specific large repositories.
85
+
86
+ **Fix:**
87
+ 1. Check analysis queue for unusually large repos
88
+ 2. Consider adding file-size limits in M02 configuration
89
+ 3. Check if `max_file_size_bytes` is properly set in M02 config
90
+
91
+ ### Language Server Crash
92
+
93
+ **Symptoms:** Clean exit codes (no OOM), restarts correlate with specific file types.
94
+
95
+ **Fix:**
96
+ 1. Check which language server is crashing (TypeScript, Python, Go, etc.)
97
+ 2. Check for known issues in the language server version
98
+ 3. Update the language server image if a fix is available
99
+
100
+ ---
101
+
102
+ ## Escalation
103
+
104
+ - **If restarts > 10/5min:** Page the M02 on-call engineer
105
+ - **If associated with `AnalysisBacklogCritical`:** Escalate to P1
106
+ - **If OOM and no obvious fix:** Engage Platform team for memory profiling
107
+
108
+ ---
109
+
110
+ ## Related Alerts
111
+
112
+ - `LSPDaemonOOMKill` — Fires on OOM kill events specifically
113
+ - `AnalysisBacklogCritical` — May fire as a downstream effect
114
+ - `AnalysisBacklogWarning` — Early warning of backlog buildup
115
+
116
+ ---
117
+
118
+ *Last updated: March 2026 · Platform Team*
@@ -0,0 +1,160 @@
1
+ # Alert Runbook: Security Anomaly
2
+
3
+ > **Alert Names:** `SecurityAuthBurst`, `SecurityRBACDenialBurst`, `ServiceAuthFailure`
4
+ > **Severity:** Critical
5
+ > **Module:** All (Security cross-cutting)
6
+ > **Notification:** Slack #ecip-security + PagerDuty
7
+
8
+ ---
9
+
10
+ ## Alert Definitions
11
+
12
+ ```yaml
13
+ # Auth burst — possible brute force
14
+ SecurityAuthBurst:
15
+ expr: sum(rate(ecip_auth_failures_total[5m])) > 10
16
+ for: 0m # Fires immediately
17
+
18
+ # RBAC denial burst — possible privilege escalation attempt
19
+ SecurityRBACDenialBurst:
20
+ expr: sum(rate(ecip_rbac_denials_total[5m])) > 20
21
+ for: 0m
22
+
23
+ # Service auth failure — inter-service mTLS/token issues
24
+ ServiceAuthFailure:
25
+ expr: rate(ecip_service_auth_failures_total[5m]) > 0
26
+ for: 2m
27
+ ```
28
+
29
+ All security alerts fire to the `#ecip-security` Slack channel **and** PagerDuty.
30
+
31
+ ---
32
+
33
+ ## Impact
34
+
35
+ - **Auth burst:** Possible credential stuffing or brute-force attack
36
+ - **RBAC denial burst:** Possible unauthorized access attempt or misconfigured permissions
37
+ - **Service auth failure:** Inter-service communication broken (mTLS cert expired, token invalid)
38
+
39
+ ---
40
+
41
+ ## Triage Steps
42
+
43
+ ### 1. Check Security Events Dashboard
44
+
45
+ Open **Grafana → ECIP → Security Events**. This dashboard queries **Elasticsearch** (not Prometheus).
46
+
47
+ Look for:
48
+ - Auth failure rate and source IP distribution
49
+ - RBAC denial patterns — same user? same resource? same action?
50
+ - Geographic anomalies in auth failures
51
+
52
+ ### 2. Query Elasticsearch Directly
53
+
54
+ ```bash
55
+ # Recent auth failures
56
+ curl -s "http://elasticsearch.monitoring:9200/ecip-security-events/_search" \
57
+ -H "Content-Type: application/json" -d '{
58
+ "query": { "bool": { "must": [
59
+ { "term": { "event.type": "authentication_failure" }},
60
+ { "range": { "@timestamp": { "gte": "now-30m" }}}
61
+ ]}},
62
+ "sort": [{ "@timestamp": "desc" }],
63
+ "size": 20
64
+ }'
65
+
66
+ # Recent RBAC denials
67
+ curl -s "http://elasticsearch.monitoring:9200/ecip-security-events/_search" \
68
+ -H "Content-Type: application/json" -d '{
69
+ "query": { "bool": { "must": [
70
+ { "term": { "event.type": "rbac_denial" }},
71
+ { "range": { "@timestamp": { "gte": "now-30m" }}}
72
+ ]}},
73
+ "sort": [{ "@timestamp": "desc" }],
74
+ "size": 20
75
+ }'
76
+ ```
77
+
78
+ ### 3. Identify Patterns
79
+
80
+ For auth failures:
81
+ - **Same source IP:** Possible brute force → consider IP block
82
+ - **Same user (hashed):** Possible credential stuffing → force password reset
83
+ - **Distributed IPs, same user:** Possible botnet → escalate to security team
84
+
85
+ For RBAC denials:
86
+ - **Same user, various resources:** User recently lost permissions → check with access management
87
+ - **Various users, same resource:** Resource permissions recently changed → check recent RBAC config changes
88
+ - **Single user, escalating actions:** Possible privilege escalation → escalate immediately
89
+
90
+ ---
91
+
92
+ ## Response Actions
93
+
94
+ ### SecurityAuthBurst
95
+
96
+ 1. **If concentrated from single IP:**
97
+ ```bash
98
+ # Check source IP in recent events
99
+ # If malicious, block at WAF/API Gateway level
100
+ ```
101
+ 2. **If distributed attack:** Engage security team, consider rate limit reduction at M01
102
+
103
+ 3. **If false positive (e.g., misconfigured CI/CD):**
104
+ - Identify the automated system
105
+ - Fix its authentication configuration
106
+ - Verify auth failures stop
107
+
108
+ ### SecurityRBACDenialBurst
109
+
110
+ 1. **Check recent permission changes:**
111
+ - Query M06 Registry Service audit log
112
+ - If permissions were recently revoked: expected behavior, adjust alert threshold if too sensitive
113
+
114
+ 2. **If no recent changes:** Possible unauthorized access attempt
115
+ - Identify the user(s) involved
116
+ - Check if the denied actions are suspicious (e.g., accessing repos they've never accessed before)
117
+ - Engage security team if pattern suggests enumeration
118
+
119
+ ### ServiceAuthFailure
120
+
121
+ 1. **Check certificate expiry:**
122
+ ```bash
123
+ kubectl get secrets -n ecip -l type=tls -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.cert-manager\.io/expiry}{"\n"}{end}'
124
+ ```
125
+
126
+ 2. **Check service account tokens:**
127
+ ```bash
128
+ kubectl get serviceaccounts -n ecip
129
+ ```
130
+
131
+ 3. **If cert expired:** Renew immediately, this blocks all inter-service communication
132
+
133
+ ---
134
+
135
+ ## Escalation
136
+
137
+ - **All security alerts:** Notify security team immediately in `#ecip-security`
138
+ - **Auth burst > 50/5min:** Page security on-call
139
+ - **RBAC burst with escalation pattern:** P1 security incident
140
+ - **Service auth failure:** Page Platform on-call (infrastructure issue)
141
+ - **Any confirmed attack:** Follow the organization's incident response procedure
142
+
143
+ ---
144
+
145
+ ## Important Notes
146
+
147
+ - **User IDs in security events are SHA-256 hashed.** You cannot directly identify users from Elasticsearch data. Correlation requires access to the identity provider.
148
+ - **Security events use a dedicated pipeline** (OTel Collector → Elasticsearch). They do NOT flow through the general logs pipeline.
149
+ - **Never disable these alerts** without security team approval, even during maintenance windows.
150
+
151
+ ---
152
+
153
+ ## Related Alerts
154
+
155
+ - `GRPCRequestLatencyHigh` — May correlate if auth checks are causing slowdowns
156
+ - `QueryLatencySLABreach` — May be downstream effect of auth issues
157
+
158
+ ---
159
+
160
+ *Last updated: March 2026 · Platform Team · Security Team*