ecip-observability-stack 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +48 -0
- package/README.md +75 -0
- package/alerts/analysis-backlog.yaml +39 -0
- package/alerts/cache-degradation.yaml +44 -0
- package/alerts/dlq-depth.yaml +56 -0
- package/alerts/lsp-daemon.yaml +43 -0
- package/alerts/mcp-latency.yaml +46 -0
- package/alerts/security-anomaly.yaml +59 -0
- package/alerts/sla-latency.yaml +61 -0
- package/chaos/kafka-broker-restart.sh +168 -0
- package/chaos/kill-lsp-daemon.sh +148 -0
- package/chaos/redis-node-failure.sh +318 -0
- package/ci/check-observability-contract.js +285 -0
- package/ci/eslint-plugin-ecip/index.js +209 -0
- package/ci/eslint-plugin-ecip/package.json +12 -0
- package/ci/github-actions-observability-gate.yaml +180 -0
- package/ci/ruff-shared.toml +41 -0
- package/collector/otel-collector-config.yaml +226 -0
- package/collector/otel-collector-daemonset.yaml +168 -0
- package/collector/sampling-config.yaml +83 -0
- package/dashboards/_provisioning/grafana-dashboards.yaml +16 -0
- package/dashboards/analysis-throughput.json +166 -0
- package/dashboards/cache-performance.json +129 -0
- package/dashboards/cross-repo-fanout.json +93 -0
- package/dashboards/event-bus-dlq.json +129 -0
- package/dashboards/lsp-daemon-health.json +104 -0
- package/dashboards/mcp-call-graph.json +114 -0
- package/dashboards/query-latency.json +160 -0
- package/dashboards/security-events.json +131 -0
- package/docs/M08-Observability-Design.md +639 -0
- package/docs/PROGRESS.md +375 -0
- package/docs/module-documentation.md +64 -0
- package/elasticsearch/ilm-policy.json +57 -0
- package/elasticsearch/index-template.json +62 -0
- package/elasticsearch/kibana-space.yaml +53 -0
- package/helm/Chart.yaml +30 -0
- package/helm/templates/configmaps.yaml +25 -0
- package/helm/templates/elasticsearch.yaml +68 -0
- package/helm/templates/grafana-secret.yaml +22 -0
- package/helm/templates/grafana.yaml +19 -0
- package/helm/templates/loki.yaml +33 -0
- package/helm/templates/otel-collector.yaml +119 -0
- package/helm/templates/prometheus.yaml +43 -0
- package/helm/templates/tempo.yaml +16 -0
- package/helm/values.prod.yaml +159 -0
- package/helm/values.yaml +146 -0
- package/logging-lib/nodejs/package.json +57 -0
- package/logging-lib/nodejs/pnpm-lock.yaml +4576 -0
- package/logging-lib/python/pyproject.toml +45 -0
- package/logging-lib/python/src/__init__.py +19 -0
- package/logging-lib/python/src/logger.py +131 -0
- package/logging-lib/python/src/security_events.py +150 -0
- package/logging-lib/python/src/tracer.py +185 -0
- package/logging-lib/python/tests/test_logger.py +113 -0
- package/package.json +21 -0
- package/prometheus/prometheus-values.yaml +170 -0
- package/prometheus/recording-rules.yaml +97 -0
- package/prometheus/scrape-configs.yaml +122 -0
- package/runbooks/SDK-INTEGRATION.md +239 -0
- package/runbooks/alert-response/ANALYSIS_BACKLOG.md +128 -0
- package/runbooks/alert-response/DLQ_DEPTH_EXCEEDED.md +150 -0
- package/runbooks/alert-response/HIGH_QUERY_LATENCY.md +134 -0
- package/runbooks/alert-response/LSP_DAEMON_RESTART.md +118 -0
- package/runbooks/alert-response/SECURITY_ANOMALY.md +160 -0
- package/runbooks/dashboard-guide.md +169 -0
- package/scripts/lint-dashboards.js +184 -0
- package/tempo/tempo-datasource.yaml +46 -0
- package/tempo/tempo-values.yaml +94 -0
- package/tests/alert-threshold-config.test.ts +283 -0
- package/tests/log-schema-validation.test.ts +246 -0
- package/tests/metric-label-validation.test.ts +292 -0
- package/tests/otel-pipeline-integration.test.ts +420 -0
- package/tests/security-events.test.ts +417 -0
- package/tsconfig.json +17 -0
- package/vitest.config.ts +21 -0
- package/vitest.integration.config.ts +9 -0
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
# Alert Runbook: Analysis Backlog
|
|
2
|
+
|
|
3
|
+
> **Alert Name:** `AnalysisBacklogCritical` / `AnalysisBacklogWarning`
|
|
4
|
+
> **Severity:** Critical / Warning
|
|
5
|
+
> **Module:** M02 — Analysis Engine
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Alert Definitions
|
|
10
|
+
|
|
11
|
+
```yaml
|
|
12
|
+
# Warning
|
|
13
|
+
expr: ecip_analysis_backlog_size{module="M02"} > 500
|
|
14
|
+
for: 15m
|
|
15
|
+
|
|
16
|
+
# Critical
|
|
17
|
+
expr: ecip_analysis_backlog_size{module="M02"} > 1000
|
|
18
|
+
for: 10m
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Impact
|
|
24
|
+
|
|
25
|
+
- Code analysis jobs are queuing faster than they can be processed
|
|
26
|
+
- Repository updates are delayed
|
|
27
|
+
- Users see stale code intelligence results
|
|
28
|
+
- If backlog grows indefinitely: eventual disk/memory pressure
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## Triage Steps
|
|
33
|
+
|
|
34
|
+
### 1. Check Analysis Throughput Dashboard
|
|
35
|
+
|
|
36
|
+
Open **Grafana → ECIP → Analysis Throughput**. Look for:
|
|
37
|
+
- Current backlog size and trend
|
|
38
|
+
- Analysis throughput (analyses per minute) — has it dropped?
|
|
39
|
+
- Error rate — are analyses failing?
|
|
40
|
+
- Active analyses — are workers busy or idle?
|
|
41
|
+
|
|
42
|
+
### 2. Check LSP Daemon Health
|
|
43
|
+
|
|
44
|
+
Analysis throughput depends on healthy LSP daemons. Open **Grafana → ECIP → LSP Daemon Health**.
|
|
45
|
+
- If daemon restarts are elevated: see [LSP_DAEMON_RESTART runbook](./LSP_DAEMON_RESTART.md)
|
|
46
|
+
- If OOM kills are occurring: daemons are crashing before completing work
|
|
47
|
+
|
|
48
|
+
### 3. Check Event Bus
|
|
49
|
+
|
|
50
|
+
Is the backlog caused by a flood of events?
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
# Check Kafka lag for analysis consumer group
|
|
54
|
+
kubectl exec -n ecip kafka-0 -- kafka-consumer-groups.sh \
|
|
55
|
+
--bootstrap-server localhost:9092 \
|
|
56
|
+
--group ecip-analysis-consumers \
|
|
57
|
+
--describe
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
### 4. Check for Large Repository Events
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
kubectl logs -n ecip deployment/ecip-analysis-engine --tail=200 \
|
|
64
|
+
| grep -E "files_to_analyze|large_repo|backlog"
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
A single large repository push can flood the queue.
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## Common Causes & Fixes
|
|
72
|
+
|
|
73
|
+
### LSP Daemon Failures
|
|
74
|
+
|
|
75
|
+
**Symptoms:** `LSPDaemonRestartRate` also firing, analysis throughput dropped.
|
|
76
|
+
|
|
77
|
+
**Fix:**
|
|
78
|
+
1. Resolve LSP daemon issues first (see [LSP_DAEMON_RESTART.md](./LSP_DAEMON_RESTART.md))
|
|
79
|
+
2. Backlog will drain naturally once daemons stabilize
|
|
80
|
+
|
|
81
|
+
### Large Repository Push
|
|
82
|
+
|
|
83
|
+
**Symptoms:** Sudden backlog spike correlating with a specific large repo event.
|
|
84
|
+
|
|
85
|
+
**Fix:**
|
|
86
|
+
1. This may be normal — a large monorepo push creates many analysis jobs
|
|
87
|
+
2. Monitor: backlog should drain at normal throughput rate
|
|
88
|
+
3. If one repo is blocking everything: consider priority queuing
|
|
89
|
+
|
|
90
|
+
### Insufficient Worker Capacity
|
|
91
|
+
|
|
92
|
+
**Symptoms:** Workers are all busy, throughput is at capacity, but event rate exceeds processing rate.
|
|
93
|
+
|
|
94
|
+
**Fix:**
|
|
95
|
+
1. Scale workers: `kubectl scale deployment -n ecip ecip-analysis-engine --replicas=N`
|
|
96
|
+
2. Check HPA: is it already at max replicas?
|
|
97
|
+
3. If at max: increase HPA maxReplicas or add nodes
|
|
98
|
+
|
|
99
|
+
### Downstream Failure (Knowledge Store)
|
|
100
|
+
|
|
101
|
+
**Symptoms:** Analysis completes but writing results to M03 fails, causing retries.
|
|
102
|
+
|
|
103
|
+
**Fix:**
|
|
104
|
+
1. Check M03 Knowledge Store health
|
|
105
|
+
2. Check write latency on Cache Performance dashboard
|
|
106
|
+
3. Fix M03 issue → retries succeed → backlog drains
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## Escalation
|
|
111
|
+
|
|
112
|
+
- **Warning (> 500, 15min):** M02 on-call investigates
|
|
113
|
+
- **Critical (> 1000, 10min):** Page M02 on-call via PagerDuty
|
|
114
|
+
- **If backlog growing > 100/min:** Consider pausing non-critical analysis (PR branches)
|
|
115
|
+
- **If correlated with LSP issues:** Also engage Platform team
|
|
116
|
+
|
|
117
|
+
---
|
|
118
|
+
|
|
119
|
+
## Related Alerts
|
|
120
|
+
|
|
121
|
+
- `LSPDaemonRestartRate` — Often the root cause
|
|
122
|
+
- `LSPDaemonOOMKill` — Memory pressure variant
|
|
123
|
+
- `DLQDepthExceeded` — If event processing is part of the failure chain
|
|
124
|
+
- `KnowledgeStoreWriteLatencyHigh` — Downstream bottleneck
|
|
125
|
+
|
|
126
|
+
---
|
|
127
|
+
|
|
128
|
+
*Last updated: March 2026 · Platform Team*
|
|
@@ -0,0 +1,150 @@
|
|
|
1
|
+
# Alert Runbook: DLQ Depth Exceeded
|
|
2
|
+
|
|
3
|
+
> **Alert Name:** `DLQDepthExceeded` (critical) / `DLQDepthWarning` (warning)
|
|
4
|
+
> **Severity:** Critical / Warning
|
|
5
|
+
> **Module:** M07 — Event Bus
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Alert Definitions
|
|
10
|
+
|
|
11
|
+
```yaml
|
|
12
|
+
# Warning
|
|
13
|
+
expr: ecip_dlq_depth{module="M07"} > 10
|
|
14
|
+
for: 10m
|
|
15
|
+
|
|
16
|
+
# Critical
|
|
17
|
+
expr: ecip_dlq_depth{module="M07"} > 100
|
|
18
|
+
for: 5m
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
Also related:
|
|
22
|
+
```yaml
|
|
23
|
+
# DLQ Message Age
|
|
24
|
+
expr: ecip_dlq_oldest_message_age_seconds{module="M07"} > 3600
|
|
25
|
+
for: 5m
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## Impact
|
|
31
|
+
|
|
32
|
+
- Failed events are not being processed
|
|
33
|
+
- Repositories may not be indexed or updated
|
|
34
|
+
- Stale data in M03 Knowledge Store
|
|
35
|
+
- Downstream: queries return outdated results
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## Triage Steps
|
|
40
|
+
|
|
41
|
+
### 1. Check Event Bus DLQ Dashboard
|
|
42
|
+
|
|
43
|
+
Open **Grafana → ECIP → Event Bus DLQ**. Look for:
|
|
44
|
+
- Current DLQ depth and trend
|
|
45
|
+
- Oldest message age (> 1hr is concerning)
|
|
46
|
+
- DLQ ingestion rate (how fast are messages arriving?)
|
|
47
|
+
- Processing lag for main topics
|
|
48
|
+
|
|
49
|
+
### 2. Check Kafka Consumer Lag
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
kubectl exec -n ecip kafka-0 -- kafka-consumer-groups.sh \
|
|
53
|
+
--bootstrap-server localhost:9092 \
|
|
54
|
+
--group ecip-event-processors \
|
|
55
|
+
--describe
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### 3. Inspect DLQ Messages
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
kubectl exec -n ecip kafka-0 -- kafka-console-consumer.sh \
|
|
62
|
+
--bootstrap-server localhost:9092 \
|
|
63
|
+
--topic ecip.dlq \
|
|
64
|
+
--from-beginning \
|
|
65
|
+
--max-messages 5 \
|
|
66
|
+
--property print.headers=true
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
Look at the `error` header to determine why messages were sent to DLQ.
|
|
70
|
+
|
|
71
|
+
### 4. Check Event Bus Service Logs
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
kubectl logs -n ecip deployment/ecip-event-bus --tail=200 \
|
|
75
|
+
| grep -E "error|dlq|dead.letter|retry.exhausted"
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## Common Causes & Fixes
|
|
81
|
+
|
|
82
|
+
### Schema Validation Failure
|
|
83
|
+
|
|
84
|
+
**Symptoms:** DLQ messages have `error: schema_validation_failed` header.
|
|
85
|
+
|
|
86
|
+
**Fix:**
|
|
87
|
+
1. Check if a recent schema change was deployed without updating consumers
|
|
88
|
+
2. Verify Schema Registry has the correct version
|
|
89
|
+
3. If schema was updated: update consumers and replay DLQ messages
|
|
90
|
+
|
|
91
|
+
### Downstream Service Unavailable
|
|
92
|
+
|
|
93
|
+
**Symptoms:** DLQ messages have `error: downstream_timeout` or `connection_refused`.
|
|
94
|
+
|
|
95
|
+
**Fix:**
|
|
96
|
+
1. Check the target service health (M02 Analysis Engine, M03 Knowledge Store)
|
|
97
|
+
2. If the service is down: fix the service, then replay DLQ
|
|
98
|
+
3. If the service is overloaded: scale it up, then replay
|
|
99
|
+
|
|
100
|
+
### Poison Messages
|
|
101
|
+
|
|
102
|
+
**Symptoms:** Same message(s) repeatedly failing, growing DLQ depth slowly.
|
|
103
|
+
|
|
104
|
+
**Fix:**
|
|
105
|
+
1. Inspect the failing messages for malformed data
|
|
106
|
+
2. If a specific repository event is broken, manually acknowledge and skip it
|
|
107
|
+
3. File a bug for the source that produced the malformed event
|
|
108
|
+
|
|
109
|
+
### Kafka Broker Issues
|
|
110
|
+
|
|
111
|
+
**Symptoms:** Multiple consumer groups affected, not just ECIP.
|
|
112
|
+
|
|
113
|
+
**Fix:**
|
|
114
|
+
1. Check Kafka broker health: `kubectl get pods -n kafka`
|
|
115
|
+
2. Check broker disk usage and partition balance
|
|
116
|
+
3. Escalate to Infrastructure/Kafka team
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
## Replaying DLQ Messages
|
|
121
|
+
|
|
122
|
+
After fixing the root cause, replay DLQ messages:
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
# Using the M07 replay tool
|
|
126
|
+
kubectl exec -n ecip deployment/ecip-event-bus -- \
|
|
127
|
+
node scripts/replay-dlq.js --topic ecip.dlq --batch-size 50
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
Monitor the DLQ depth dashboard to confirm depth decreasing.
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Escalation
|
|
135
|
+
|
|
136
|
+
- **Warning (> 10 messages):** Investigate within 30 minutes
|
|
137
|
+
- **Critical (> 100 messages):** Page M07 on-call
|
|
138
|
+
- **If age > 4hr:** Escalate to P1 — data staleness affecting users
|
|
139
|
+
- **If Kafka infrastructure issue:** Page Infrastructure on-call
|
|
140
|
+
|
|
141
|
+
---
|
|
142
|
+
|
|
143
|
+
## Related Alerts
|
|
144
|
+
|
|
145
|
+
- `DLQMessageAgeHigh` — Oldest message > 1 hour
|
|
146
|
+
- `AnalysisBacklogCritical` — May correlate if M02 events failing
|
|
147
|
+
|
|
148
|
+
---
|
|
149
|
+
|
|
150
|
+
*Last updated: March 2026 · Platform Team*
|
|
@@ -0,0 +1,134 @@
|
|
|
1
|
+
# Alert Runbook: High Query Latency
|
|
2
|
+
|
|
3
|
+
> **Alert Name:** `QueryLatencySLABreach`
|
|
4
|
+
> **Severity:** Critical
|
|
5
|
+
> **Module:** M04 — Query Service
|
|
6
|
+
> **SLA:** p95 < 2000ms
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Alert Definition
|
|
11
|
+
|
|
12
|
+
```yaml
|
|
13
|
+
expr: >
|
|
14
|
+
histogram_quantile(0.95,
|
|
15
|
+
sum(rate(ecip_query_duration_ms_bucket{module="M04"}[5m])) by (le)
|
|
16
|
+
) > 1500
|
|
17
|
+
for: 5m
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
Fires when p95 query latency exceeds 1500ms for 5 consecutive minutes. Threshold is set at 1500ms (below the 2000ms SLA) to allow time for investigation before SLA breach.
|
|
21
|
+
|
|
22
|
+
---
|
|
23
|
+
|
|
24
|
+
## Impact
|
|
25
|
+
|
|
26
|
+
- Users experience slow code intelligence responses
|
|
27
|
+
- IDE integrations may timeout
|
|
28
|
+
- SLA breach if not resolved within the `for` window
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## Triage Steps
|
|
33
|
+
|
|
34
|
+
### 1. Check the Query Latency Dashboard
|
|
35
|
+
|
|
36
|
+
Open **Grafana → ECIP → Query Latency**. Look for:
|
|
37
|
+
- Is p95 above 1500ms? Which percentile is affected?
|
|
38
|
+
- Is this a gradual trend or a sudden spike?
|
|
39
|
+
- Is the request rate higher than normal? (capacity issue vs. performance regression)
|
|
40
|
+
|
|
41
|
+
### 2. Check Downstream Dependencies
|
|
42
|
+
|
|
43
|
+
Query latency is often caused by slow dependencies:
|
|
44
|
+
|
|
45
|
+
| Dependency | Check | Dashboard |
|
|
46
|
+
|---|---|---|
|
|
47
|
+
| M03 Knowledge Store | Cache hit rate, write latency | Cache Performance |
|
|
48
|
+
| M05 MCP Server | MCP call latency | MCP Call Graph |
|
|
49
|
+
| M06 Registry | RBAC check latency | (Prometheus direct) |
|
|
50
|
+
|
|
51
|
+
```promql
|
|
52
|
+
# Check cache hit rate
|
|
53
|
+
ecip_cache_hit_rate{module="M03"}
|
|
54
|
+
|
|
55
|
+
# Check MCP call latency
|
|
56
|
+
histogram_quantile(0.95, sum(rate(ecip_mcp_call_duration_ms_bucket[5m])) by (le))
|
|
57
|
+
|
|
58
|
+
# Check FilterAuthorizedRepos latency
|
|
59
|
+
histogram_quantile(0.95, sum(rate(ecip_filter_authorized_repos_duration_ms_bucket[5m])) by (le))
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
### 3. Check Traces in Tempo
|
|
63
|
+
|
|
64
|
+
1. Open **Grafana → Explore → Tempo**
|
|
65
|
+
2. Search for traces with `service.name = ecip-query-service` and duration > 1500ms
|
|
66
|
+
3. Look at the span waterfall — which span is the bottleneck?
|
|
67
|
+
|
|
68
|
+
### 4. Check Pod Resources
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
kubectl top pods -n ecip -l app=ecip-query-service
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
If CPU or memory is near limits, this is a capacity issue.
|
|
75
|
+
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
## Common Causes & Fixes
|
|
79
|
+
|
|
80
|
+
### Cache Miss Storm
|
|
81
|
+
|
|
82
|
+
**Symptoms:** Cache hit rate drops below 85%, query latency rises proportionally.
|
|
83
|
+
|
|
84
|
+
**Fix:**
|
|
85
|
+
1. Check if cache was recently flushed (deployment, Redis restart)
|
|
86
|
+
2. Monitor cache warm-up — latency should recover as cache repopulates
|
|
87
|
+
3. If persistent: check for cache eviction due to memory pressure
|
|
88
|
+
|
|
89
|
+
### MCP Fan-out Explosion
|
|
90
|
+
|
|
91
|
+
**Symptoms:** Cross-repo queries with high fan-out depth (> 5 repos).
|
|
92
|
+
|
|
93
|
+
**Fix:**
|
|
94
|
+
1. Check `cross-repo-fanout` dashboard for depth distribution
|
|
95
|
+
2. If specific queries cause excessive fan-out, investigate query patterns
|
|
96
|
+
3. Consider circuit breaker tuning in M04
|
|
97
|
+
|
|
98
|
+
### FilterAuthorizedRepos Slow
|
|
99
|
+
|
|
100
|
+
**Symptoms:** RBAC filtering taking > 20ms (NFR-SEC-011 SLA breach).
|
|
101
|
+
|
|
102
|
+
**Fix:**
|
|
103
|
+
1. Check M06 Registry Service health
|
|
104
|
+
2. Verify RBAC cache is populated
|
|
105
|
+
3. Escalate to M06 team if their latency is the bottleneck
|
|
106
|
+
|
|
107
|
+
### Capacity Issue
|
|
108
|
+
|
|
109
|
+
**Symptoms:** Request rate is significantly higher than baseline, CPU/memory near limits.
|
|
110
|
+
|
|
111
|
+
**Fix:**
|
|
112
|
+
1. Scale horizontally: `kubectl scale deployment -n ecip ecip-query-service --replicas=N`
|
|
113
|
+
2. Review HPA configuration for M04
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
## Escalation
|
|
118
|
+
|
|
119
|
+
- **Critical:** Pages M04 on-call via PagerDuty automatically
|
|
120
|
+
- **If caused by M03 cache:** Escalate to M03 team
|
|
121
|
+
- **If caused by M06 RBAC:** Escalate to M06 team
|
|
122
|
+
- **If capacity issue:** Engage Platform team for scaling
|
|
123
|
+
|
|
124
|
+
---
|
|
125
|
+
|
|
126
|
+
## Related Alerts
|
|
127
|
+
|
|
128
|
+
- `CacheHitRateDegraded` — Often precedes query latency spikes
|
|
129
|
+
- `FilterAuthorizedReposLatency` — RBAC check SLA
|
|
130
|
+
- `MCPCallLatencyWarn` — MCP tool call slowness
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
*Last updated: March 2026 · Platform Team*
|
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
# Alert Runbook: LSP Daemon Restart Rate
|
|
2
|
+
|
|
3
|
+
> **Alert Name:** `LSPDaemonRestartRate`
|
|
4
|
+
> **Severity:** Warning
|
|
5
|
+
> **Module:** M02 — Analysis Engine
|
|
6
|
+
> **for:** 0m (fires immediately — M02-specific tuning)
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Alert Definition
|
|
11
|
+
|
|
12
|
+
```yaml
|
|
13
|
+
expr: rate(ecip_lsp_daemon_restarts_total{module="M02"}[5m]) > 3
|
|
14
|
+
for: 0m
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
Fires when LSP daemon restarts exceed 3 per 5-minute window. The `for: 0m` is intentional — LSP daemon instability cascades immediately to analysis failures and should be investigated without delay.
|
|
18
|
+
|
|
19
|
+
---
|
|
20
|
+
|
|
21
|
+
## Impact
|
|
22
|
+
|
|
23
|
+
- Active code analyses may fail or produce incomplete results
|
|
24
|
+
- Analysis backlog grows while daemons restart
|
|
25
|
+
- Downstream: M04 queries may return stale data if re-indexing is blocked
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Triage Steps
|
|
30
|
+
|
|
31
|
+
### 1. Check the LSP Daemon Health Dashboard
|
|
32
|
+
|
|
33
|
+
Open **Grafana → ECIP → LSP Daemon Health**. Look for:
|
|
34
|
+
- Restart rate spike pattern (steady increase vs. sudden burst)
|
|
35
|
+
- OOM kill correlation (if OOM kills are rising, go to §OOM below)
|
|
36
|
+
- Memory trend before restarts
|
|
37
|
+
|
|
38
|
+
### 2. Check Pod Events
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
kubectl get events -n ecip --field-selector involvedObject.name=ecip-analysis-engine \
|
|
42
|
+
--sort-by='.lastTimestamp' | head -20
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
Look for:
|
|
46
|
+
- `OOMKilled` — memory limit exceeded
|
|
47
|
+
- `CrashLoopBackOff` — repeated crash-restart cycles
|
|
48
|
+
- `Unhealthy` — failed health checks
|
|
49
|
+
|
|
50
|
+
### 3. Check Pod Logs
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
kubectl logs -n ecip deployment/ecip-analysis-engine -c lsp-daemon --tail=200 \
|
|
54
|
+
| grep -E "error|fatal|panic|OOM"
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### 4. Check Recent Deployments
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
kubectl rollout history -n ecip deployment/ecip-analysis-engine
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
If a recent deploy correlates with the restart spike → consider rollback.
|
|
64
|
+
|
|
65
|
+
---
|
|
66
|
+
|
|
67
|
+
## Common Causes & Fixes
|
|
68
|
+
|
|
69
|
+
### OOM Kill
|
|
70
|
+
|
|
71
|
+
**Symptoms:** `LSPDaemonOOMKill` alert also firing, `OOMKilled` in pod events.
|
|
72
|
+
|
|
73
|
+
**Fix:**
|
|
74
|
+
1. Check if a specific file type is causing excessive memory usage
|
|
75
|
+
2. If memory usage is generally trending up, increase the memory limit:
|
|
76
|
+
```bash
|
|
77
|
+
kubectl set resources -n ecip deployment/ecip-analysis-engine \
|
|
78
|
+
-c lsp-daemon --limits=memory=2Gi
|
|
79
|
+
```
|
|
80
|
+
3. Long-term: file an issue for the M02 team to investigate the memory leak
|
|
81
|
+
|
|
82
|
+
### Large File Processing
|
|
83
|
+
|
|
84
|
+
**Symptoms:** Restarts correlate with analysis of specific large repositories.
|
|
85
|
+
|
|
86
|
+
**Fix:**
|
|
87
|
+
1. Check analysis queue for unusually large repos
|
|
88
|
+
2. Consider adding file-size limits in M02 configuration
|
|
89
|
+
3. Check if `max_file_size_bytes` is properly set in M02 config
|
|
90
|
+
|
|
91
|
+
### Language Server Crash
|
|
92
|
+
|
|
93
|
+
**Symptoms:** Clean exit codes (no OOM), restarts correlate with specific file types.
|
|
94
|
+
|
|
95
|
+
**Fix:**
|
|
96
|
+
1. Check which language server is crashing (TypeScript, Python, Go, etc.)
|
|
97
|
+
2. Check for known issues in the language server version
|
|
98
|
+
3. Update the language server image if a fix is available
|
|
99
|
+
|
|
100
|
+
---
|
|
101
|
+
|
|
102
|
+
## Escalation
|
|
103
|
+
|
|
104
|
+
- **If restarts > 10/5min:** Page the M02 on-call engineer
|
|
105
|
+
- **If associated with `AnalysisBacklogCritical`:** Escalate to P1
|
|
106
|
+
- **If OOM and no obvious fix:** Engage Platform team for memory profiling
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## Related Alerts
|
|
111
|
+
|
|
112
|
+
- `LSPDaemonOOMKill` — Fires on OOM kill events specifically
|
|
113
|
+
- `AnalysisBacklogCritical` — May fire as a downstream effect
|
|
114
|
+
- `AnalysisBacklogWarning` — Early warning of backlog buildup
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
*Last updated: March 2026 · Platform Team*
|
|
@@ -0,0 +1,160 @@
|
|
|
1
|
+
# Alert Runbook: Security Anomaly
|
|
2
|
+
|
|
3
|
+
> **Alert Names:** `SecurityAuthBurst`, `SecurityRBACDenialBurst`, `ServiceAuthFailure`
|
|
4
|
+
> **Severity:** Critical
|
|
5
|
+
> **Module:** All (Security cross-cutting)
|
|
6
|
+
> **Notification:** Slack #ecip-security + PagerDuty
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Alert Definitions
|
|
11
|
+
|
|
12
|
+
```yaml
|
|
13
|
+
# Auth burst — possible brute force
|
|
14
|
+
SecurityAuthBurst:
|
|
15
|
+
expr: sum(rate(ecip_auth_failures_total[5m])) > 10
|
|
16
|
+
for: 0m # Fires immediately
|
|
17
|
+
|
|
18
|
+
# RBAC denial burst — possible privilege escalation attempt
|
|
19
|
+
SecurityRBACDenialBurst:
|
|
20
|
+
expr: sum(rate(ecip_rbac_denials_total[5m])) > 20
|
|
21
|
+
for: 0m
|
|
22
|
+
|
|
23
|
+
# Service auth failure — inter-service mTLS/token issues
|
|
24
|
+
ServiceAuthFailure:
|
|
25
|
+
expr: rate(ecip_service_auth_failures_total[5m]) > 0
|
|
26
|
+
for: 2m
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
All security alerts fire to the `#ecip-security` Slack channel **and** PagerDuty.
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## Impact
|
|
34
|
+
|
|
35
|
+
- **Auth burst:** Possible credential stuffing or brute-force attack
|
|
36
|
+
- **RBAC denial burst:** Possible unauthorized access attempt or misconfigured permissions
|
|
37
|
+
- **Service auth failure:** Inter-service communication broken (mTLS cert expired, token invalid)
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## Triage Steps
|
|
42
|
+
|
|
43
|
+
### 1. Check Security Events Dashboard
|
|
44
|
+
|
|
45
|
+
Open **Grafana → ECIP → Security Events**. This dashboard queries **Elasticsearch** (not Prometheus).
|
|
46
|
+
|
|
47
|
+
Look for:
|
|
48
|
+
- Auth failure rate and source IP distribution
|
|
49
|
+
- RBAC denial patterns — same user? same resource? same action?
|
|
50
|
+
- Geographic anomalies in auth failures
|
|
51
|
+
|
|
52
|
+
### 2. Query Elasticsearch Directly
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
# Recent auth failures
|
|
56
|
+
curl -s "http://elasticsearch.monitoring:9200/ecip-security-events/_search" \
|
|
57
|
+
-H "Content-Type: application/json" -d '{
|
|
58
|
+
"query": { "bool": { "must": [
|
|
59
|
+
{ "term": { "event.type": "authentication_failure" }},
|
|
60
|
+
{ "range": { "@timestamp": { "gte": "now-30m" }}}
|
|
61
|
+
]}},
|
|
62
|
+
"sort": [{ "@timestamp": "desc" }],
|
|
63
|
+
"size": 20
|
|
64
|
+
}'
|
|
65
|
+
|
|
66
|
+
# Recent RBAC denials
|
|
67
|
+
curl -s "http://elasticsearch.monitoring:9200/ecip-security-events/_search" \
|
|
68
|
+
-H "Content-Type: application/json" -d '{
|
|
69
|
+
"query": { "bool": { "must": [
|
|
70
|
+
{ "term": { "event.type": "rbac_denial" }},
|
|
71
|
+
{ "range": { "@timestamp": { "gte": "now-30m" }}}
|
|
72
|
+
]}},
|
|
73
|
+
"sort": [{ "@timestamp": "desc" }],
|
|
74
|
+
"size": 20
|
|
75
|
+
}'
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
### 3. Identify Patterns
|
|
79
|
+
|
|
80
|
+
For auth failures:
|
|
81
|
+
- **Same source IP:** Possible brute force → consider IP block
|
|
82
|
+
- **Same user (hashed):** Possible credential stuffing → force password reset
|
|
83
|
+
- **Distributed IPs, same user:** Possible botnet → escalate to security team
|
|
84
|
+
|
|
85
|
+
For RBAC denials:
|
|
86
|
+
- **Same user, various resources:** User recently lost permissions → check with access management
|
|
87
|
+
- **Various users, same resource:** Resource permissions recently changed → check recent RBAC config changes
|
|
88
|
+
- **Single user, escalating actions:** Possible privilege escalation → escalate immediately
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## Response Actions
|
|
93
|
+
|
|
94
|
+
### SecurityAuthBurst
|
|
95
|
+
|
|
96
|
+
1. **If concentrated from single IP:**
|
|
97
|
+
```bash
|
|
98
|
+
# Check source IP in recent events
|
|
99
|
+
# If malicious, block at WAF/API Gateway level
|
|
100
|
+
```
|
|
101
|
+
2. **If distributed attack:** Engage security team, consider rate limit reduction at M01
|
|
102
|
+
|
|
103
|
+
3. **If false positive (e.g., misconfigured CI/CD):**
|
|
104
|
+
- Identify the automated system
|
|
105
|
+
- Fix its authentication configuration
|
|
106
|
+
- Verify auth failures stop
|
|
107
|
+
|
|
108
|
+
### SecurityRBACDenialBurst
|
|
109
|
+
|
|
110
|
+
1. **Check recent permission changes:**
|
|
111
|
+
- Query M06 Registry Service audit log
|
|
112
|
+
- If permissions were recently revoked: expected behavior, adjust alert threshold if too sensitive
|
|
113
|
+
|
|
114
|
+
2. **If no recent changes:** Possible unauthorized access attempt
|
|
115
|
+
- Identify the user(s) involved
|
|
116
|
+
- Check if the denied actions are suspicious (e.g., accessing repos they've never accessed before)
|
|
117
|
+
- Engage security team if pattern suggests enumeration
|
|
118
|
+
|
|
119
|
+
### ServiceAuthFailure
|
|
120
|
+
|
|
121
|
+
1. **Check certificate expiry:**
|
|
122
|
+
```bash
|
|
123
|
+
kubectl get secrets -n ecip -l type=tls -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.cert-manager\.io/expiry}{"\n"}{end}'
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
2. **Check service account tokens:**
|
|
127
|
+
```bash
|
|
128
|
+
kubectl get serviceaccounts -n ecip
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
3. **If cert expired:** Renew immediately, this blocks all inter-service communication
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## Escalation
|
|
136
|
+
|
|
137
|
+
- **All security alerts:** Notify security team immediately in `#ecip-security`
|
|
138
|
+
- **Auth burst > 50/5min:** Page security on-call
|
|
139
|
+
- **RBAC burst with escalation pattern:** P1 security incident
|
|
140
|
+
- **Service auth failure:** Page Platform on-call (infrastructure issue)
|
|
141
|
+
- **Any confirmed attack:** Follow the organization's incident response procedure
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
## Important Notes
|
|
146
|
+
|
|
147
|
+
- **User IDs in security events are SHA-256 hashed.** You cannot directly identify users from Elasticsearch data. Correlation requires access to the identity provider.
|
|
148
|
+
- **Security events use a dedicated pipeline** (OTel Collector → Elasticsearch). They do NOT flow through the general logs pipeline.
|
|
149
|
+
- **Never disable these alerts** without security team approval, even during maintenance windows.
|
|
150
|
+
|
|
151
|
+
---
|
|
152
|
+
|
|
153
|
+
## Related Alerts
|
|
154
|
+
|
|
155
|
+
- `GRPCRequestLatencyHigh` — May correlate if auth checks are causing slowdowns
|
|
156
|
+
- `QueryLatencySLABreach` — May be downstream effect of auth issues
|
|
157
|
+
|
|
158
|
+
---
|
|
159
|
+
|
|
160
|
+
*Last updated: March 2026 · Platform Team · Security Team*
|