ecip-observability-stack 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +48 -0
- package/README.md +75 -0
- package/alerts/analysis-backlog.yaml +39 -0
- package/alerts/cache-degradation.yaml +44 -0
- package/alerts/dlq-depth.yaml +56 -0
- package/alerts/lsp-daemon.yaml +43 -0
- package/alerts/mcp-latency.yaml +46 -0
- package/alerts/security-anomaly.yaml +59 -0
- package/alerts/sla-latency.yaml +61 -0
- package/chaos/kafka-broker-restart.sh +168 -0
- package/chaos/kill-lsp-daemon.sh +148 -0
- package/chaos/redis-node-failure.sh +318 -0
- package/ci/check-observability-contract.js +285 -0
- package/ci/eslint-plugin-ecip/index.js +209 -0
- package/ci/eslint-plugin-ecip/package.json +12 -0
- package/ci/github-actions-observability-gate.yaml +180 -0
- package/ci/ruff-shared.toml +41 -0
- package/collector/otel-collector-config.yaml +226 -0
- package/collector/otel-collector-daemonset.yaml +168 -0
- package/collector/sampling-config.yaml +83 -0
- package/dashboards/_provisioning/grafana-dashboards.yaml +16 -0
- package/dashboards/analysis-throughput.json +166 -0
- package/dashboards/cache-performance.json +129 -0
- package/dashboards/cross-repo-fanout.json +93 -0
- package/dashboards/event-bus-dlq.json +129 -0
- package/dashboards/lsp-daemon-health.json +104 -0
- package/dashboards/mcp-call-graph.json +114 -0
- package/dashboards/query-latency.json +160 -0
- package/dashboards/security-events.json +131 -0
- package/docs/M08-Observability-Design.md +639 -0
- package/docs/PROGRESS.md +375 -0
- package/docs/module-documentation.md +64 -0
- package/elasticsearch/ilm-policy.json +57 -0
- package/elasticsearch/index-template.json +62 -0
- package/elasticsearch/kibana-space.yaml +53 -0
- package/helm/Chart.yaml +30 -0
- package/helm/templates/configmaps.yaml +25 -0
- package/helm/templates/elasticsearch.yaml +68 -0
- package/helm/templates/grafana-secret.yaml +22 -0
- package/helm/templates/grafana.yaml +19 -0
- package/helm/templates/loki.yaml +33 -0
- package/helm/templates/otel-collector.yaml +119 -0
- package/helm/templates/prometheus.yaml +43 -0
- package/helm/templates/tempo.yaml +16 -0
- package/helm/values.prod.yaml +159 -0
- package/helm/values.yaml +146 -0
- package/logging-lib/nodejs/package.json +57 -0
- package/logging-lib/nodejs/pnpm-lock.yaml +4576 -0
- package/logging-lib/python/pyproject.toml +45 -0
- package/logging-lib/python/src/__init__.py +19 -0
- package/logging-lib/python/src/logger.py +131 -0
- package/logging-lib/python/src/security_events.py +150 -0
- package/logging-lib/python/src/tracer.py +185 -0
- package/logging-lib/python/tests/test_logger.py +113 -0
- package/package.json +21 -0
- package/prometheus/prometheus-values.yaml +170 -0
- package/prometheus/recording-rules.yaml +97 -0
- package/prometheus/scrape-configs.yaml +122 -0
- package/runbooks/SDK-INTEGRATION.md +239 -0
- package/runbooks/alert-response/ANALYSIS_BACKLOG.md +128 -0
- package/runbooks/alert-response/DLQ_DEPTH_EXCEEDED.md +150 -0
- package/runbooks/alert-response/HIGH_QUERY_LATENCY.md +134 -0
- package/runbooks/alert-response/LSP_DAEMON_RESTART.md +118 -0
- package/runbooks/alert-response/SECURITY_ANOMALY.md +160 -0
- package/runbooks/dashboard-guide.md +169 -0
- package/scripts/lint-dashboards.js +184 -0
- package/tempo/tempo-datasource.yaml +46 -0
- package/tempo/tempo-values.yaml +94 -0
- package/tests/alert-threshold-config.test.ts +283 -0
- package/tests/log-schema-validation.test.ts +246 -0
- package/tests/metric-label-validation.test.ts +292 -0
- package/tests/otel-pipeline-integration.test.ts +420 -0
- package/tests/security-events.test.ts +417 -0
- package/tsconfig.json +17 -0
- package/vitest.config.ts +21 -0
- package/vitest.integration.config.ts +9 -0
package/docs/PROGRESS.md
ADDED
|
@@ -0,0 +1,375 @@
|
|
|
1
|
+
# ECIP M08 — Observability Stack: Progress Report
|
|
2
|
+
|
|
3
|
+
> **Module:** M08 — Observability Stack
|
|
4
|
+
> **Design Authority:** `docs/M08-Observability-Design.md` (Rev 1.1)
|
|
5
|
+
> **Last Updated:** 2025-07-17
|
|
6
|
+
> **Previous Update:** 2025-07-07
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## Summary
|
|
11
|
+
|
|
12
|
+
| Category | Count |
|
|
13
|
+
|--------------------|--------|
|
|
14
|
+
| Tasks Done | 10 / 10 |
|
|
15
|
+
| Tasks Partial | 0 / 10 |
|
|
16
|
+
| Tasks Pending | 0 / 10 |
|
|
17
|
+
| Gaps Found | 12 |
|
|
18
|
+
| Gaps Resolved | 8 |
|
|
19
|
+
| Gaps Remaining | 4 |
|
|
20
|
+
| Enhancements Done | 6 / 10 |
|
|
21
|
+
| Total Files | 83 |
|
|
22
|
+
|
|
23
|
+
### Changes Since Last Update (2025-07-07 → 2025-07-17)
|
|
24
|
+
|
|
25
|
+
Architect review items implemented:
|
|
26
|
+
- **M08-T10 CI gate** — Fully implemented (ESLint plugin, Ruff config, contract checker, GitHub Actions workflow)
|
|
27
|
+
- **GAP-01** — `tests/security-events.test.ts` created (ECS format, PII hashing, stderr isolation)
|
|
28
|
+
- **GAP-03** — `scripts/lint-dashboards.js` created (8 validation checks)
|
|
29
|
+
- **GAP-06** — Elasticsearch URL templated in Helm
|
|
30
|
+
- **GAP-07** — Grafana admin credentials externalized to K8s Secret
|
|
31
|
+
- **GAP-11** — OOM panel PromQL fixed (both dashboard + alert rule)
|
|
32
|
+
- **GAP-12 / OD-01** — Loki chosen as log backend; full integration (Helm dependency, OTel exporter, Grafana datasource, Tempo tracesToLogs)
|
|
33
|
+
|
|
34
|
+
New files added: 9 | Modified files: 8 | File count: 73 → 83
|
|
35
|
+
|
|
36
|
+
---
|
|
37
|
+
|
|
38
|
+
## Task-Level Progress (§6 Task Breakdown)
|
|
39
|
+
|
|
40
|
+
### M08-T01 — OTel Collector DaemonSet (P0, Week 1) — ✅ DONE
|
|
41
|
+
|
|
42
|
+
**Deliverables created:**
|
|
43
|
+
- `collector/otel-collector-config.yaml` — Full OTLP pipeline config with **4 pipelines** (traces→Tempo, metrics→Prometheus, logs/security→Elasticsearch, logs→Loki)
|
|
44
|
+
- `collector/otel-collector-daemonset.yaml` — K8s DaemonSet + Service + ServiceAccount + RBAC
|
|
45
|
+
- `collector/sampling-config.yaml` — Tail-based sampling with 5 policies (errors 100%, slow 100%, security 100%, LSP 20%, default 5%)
|
|
46
|
+
- `helm/templates/otel-collector.yaml` — Helm template (DaemonSet + Service + SA + ClusterRole/Binding)
|
|
47
|
+
- `helm/templates/configmaps.yaml` — ConfigMap for collector config
|
|
48
|
+
|
|
49
|
+
**Status:** All design-specified artifacts exist. Pipeline supports OTLP gRPC (:4317) and HTTP (:4318). DaemonSet topology matches the design rationale. Memory limiter, batch processor, k8sattributes, and tail-sampling all configured. **Loki exporter added** — general application logs now routed to Loki (OD-01 resolved).
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
### M08-T02 — Grafana Tempo Deployment (P0, Week 1) — ✅ DONE
|
|
54
|
+
|
|
55
|
+
**Deliverables created:**
|
|
56
|
+
- `tempo/tempo-values.yaml` — S3 backend, 14-day retention, microservices mode (ingester/compactor/distributor replicas)
|
|
57
|
+
- `tempo/tempo-datasource.yaml` — Grafana provisioning with tracesToMetrics, tracesToLogs→**Loki**, nodeGraph, serviceMap
|
|
58
|
+
- `helm/templates/tempo.yaml` — Tempo datasource ConfigMap for Grafana sidecar
|
|
59
|
+
- `helm/values.prod.yaml` — Production overrides with S3 bucket, HA replica counts
|
|
60
|
+
|
|
61
|
+
**Status:** Complete. Matches design spec: S3 backend, 14d retention, microservices mode, Grafana-native integration. **tracesToLogs now points to Loki** (previously referenced Elasticsearch).
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
### M08-T03 — Prometheus + Grafana Deployment (P0, Week 1) — ✅ DONE
|
|
66
|
+
|
|
67
|
+
**Deliverables created:**
|
|
68
|
+
- `prometheus/prometheus-values.yaml` — kube-prometheus-stack Helm values with 30d retention, 100Gi PVC, Alertmanager with PagerDuty + Slack routing
|
|
69
|
+
- `prometheus/scrape-configs.yaml` — Per-module K8s service discovery targets (M01–M07 + OTel Collector)
|
|
70
|
+
- `prometheus/recording-rules.yaml` — Pre-computed quantiles and rates (query p50/p95/p99, analysis p50/p95, MCP p95, gRPC p95, KS write p95, throughput rates, error rates, auth/RBAC rates)
|
|
71
|
+
- `helm/templates/prometheus.yaml` — Alert rules ConfigMap, recording rules ConfigMap, scrape configs Secret
|
|
72
|
+
- `helm/Chart.yaml` — Umbrella chart with kube-prometheus-stack, tempo-distributed, elasticsearch, **loki** dependencies
|
|
73
|
+
- `helm/values.yaml` — Dev/staging defaults (now includes loki config, Grafana secret ref, ES host/port/protocol)
|
|
74
|
+
- `helm/values.prod.yaml` — Production overrides (now includes Loki S3 production config, Grafana admin secret)
|
|
75
|
+
- `dashboards/_provisioning/grafana-dashboards.yaml` — Sidecar provisioning config
|
|
76
|
+
- `helm/templates/grafana-secret.yaml` — Grafana admin credentials K8s Secret template
|
|
77
|
+
- `helm/templates/loki.yaml` — Loki datasource provisioning for Grafana (with TraceID→Tempo correlation)
|
|
78
|
+
|
|
79
|
+
**Status:** Complete. All scrape targets match the 7 modules + collector. Recording rules cover the primary metrics. Alertmanager routing configured for PagerDuty (critical) and Slack (warning/security). **Grafana admin credentials** now externalized to K8s Secret (GAP-07 resolved). **Loki** added as Helm dependency and Grafana datasource (OD-01 resolved).
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
### M08-T04 — Publish SDK Packages + Integration Guide (P0, Week 1) — ✅ DONE
|
|
84
|
+
|
|
85
|
+
**Deliverables created:**
|
|
86
|
+
|
|
87
|
+
**Node.js (`@ecip/observability`):**
|
|
88
|
+
- `logging-lib/nodejs/src/logger.ts` — Pino-based structured logger with compile-time `ECIPLoggerContext` enforcement (repo, branch, user_id, module)
|
|
89
|
+
- `logging-lib/nodejs/src/tracer.ts` — OTel SDK init, `withSpan()`, `getTracer()`, `injectTraceContext()`, `extractTraceContext()`
|
|
90
|
+
- `logging-lib/nodejs/src/security-events.ts` — `emitAuthFailure()`, `emitRbacDenial()` with ECS schema, PII hashing, buffered stderr emission
|
|
91
|
+
- `logging-lib/nodejs/src/middleware.ts` — Express middleware + Fastify plugin for trace context injection
|
|
92
|
+
- `logging-lib/nodejs/src/index.ts` — Barrel exports
|
|
93
|
+
- `logging-lib/nodejs/package.json` — All OTel + Pino dependencies
|
|
94
|
+
- `logging-lib/nodejs/tsconfig.json`
|
|
95
|
+
- `logging-lib/nodejs/tests/logger.test.ts` — Logger factory tests
|
|
96
|
+
- `logging-lib/nodejs/tests/tracer.test.ts` — Tracer initialization and span tests
|
|
97
|
+
|
|
98
|
+
**Python (`ecip-observability`):**
|
|
99
|
+
- `logging-lib/python/src/logger.py` — structlog-based logger with `MissingObservabilityContext` validation
|
|
100
|
+
- `logging-lib/python/src/tracer.py` — OTel SDK init, `@traced` decorator (sync + async), gRPC auto-instrumentation
|
|
101
|
+
- `logging-lib/python/src/security_events.py` — `emit_auth_failure()`, `emit_rbac_denial()` with ECS schema, PII hashing
|
|
102
|
+
- `logging-lib/python/src/__init__.py` — Package re-exports
|
|
103
|
+
- `logging-lib/python/pyproject.toml` — Dependencies, Ruff + mypy config
|
|
104
|
+
- `logging-lib/python/tests/test_logger.py` — Logger and security event tests
|
|
105
|
+
|
|
106
|
+
**Integration Guide:**
|
|
107
|
+
- `runbooks/SDK-INTEGRATION.md` — Step-by-step setup guide
|
|
108
|
+
|
|
109
|
+
**Status:** All SDK packages and integration guide created. Both Node.js and Python libraries implement the full contract from §5.
|
|
110
|
+
|
|
111
|
+
---
|
|
112
|
+
|
|
113
|
+
### M08-T05 — 8 Grafana Dashboards (P1, Week 2–3) — ✅ DONE
|
|
114
|
+
|
|
115
|
+
**Dashboards created (all as provisioned JSON):**
|
|
116
|
+
|
|
117
|
+
| # | File | Title | Panels | Datasources | Template Vars |
|
|
118
|
+
|---|------|-------|--------|-------------|---------------|
|
|
119
|
+
| 1 | `query-latency.json` | Query Latency | 8 | Prometheus | `$repo` |
|
|
120
|
+
| 2 | `analysis-throughput.json` | Analysis Throughput | 9 | Prometheus | `$repo` |
|
|
121
|
+
| 3 | `cache-performance.json` | Cache Performance | 7 | Prometheus | `$repo` |
|
|
122
|
+
| 4 | `lsp-daemon-health.json` | LSP Daemon Health | 7 | Prometheus | — |
|
|
123
|
+
| 5 | `mcp-call-graph.json` | MCP Call Graph | 6 | Prometheus + Tempo | `$target_repo` |
|
|
124
|
+
| 6 | `event-bus-dlq.json` | Event Bus DLQ | 8 | Prometheus | — |
|
|
125
|
+
| 7 | `cross-repo-fanout.json` | Cross-Repo Fan-out | 5 | Prometheus | — |
|
|
126
|
+
| 8 | `security-events.json` | Security Events | 7 | Prometheus + Elasticsearch | — |
|
|
127
|
+
|
|
128
|
+
**Status:** All 8 dashboards implemented with appropriate thresholds, template variables, and multi-datasource support. Grafana sidecar provisioning configured.
|
|
129
|
+
|
|
130
|
+
---
|
|
131
|
+
|
|
132
|
+
### M08-T06 — Alert Rules + PagerDuty/Slack + Runbooks (P0, Week 2) — ✅ DONE
|
|
133
|
+
|
|
134
|
+
**Alert rules created (7 files, 17 alerts total):**
|
|
135
|
+
|
|
136
|
+
| Design-Required Alert | Implemented | File | Match |
|
|
137
|
+
|---|---|---|---|
|
|
138
|
+
| `QueryLatencySLABreach` | ✅ | `sla-latency.yaml` | Exact |
|
|
139
|
+
| `AnalysisBacklogCritical` | ✅ | `analysis-backlog.yaml` | Metric name differs (see gaps) |
|
|
140
|
+
| `LSPDaemonRestartRate` | ✅ | `lsp-daemon.yaml` | Exact |
|
|
141
|
+
| `LSPDaemonOOMKill` (bonus) | ✅ | `lsp-daemon.yaml` | Fixed — uses `kube_pod_container_status_last_terminated_reason` |
|
|
142
|
+
| `DLQDepthExceeded` | ✅ | `dlq-depth.yaml` | Exact |
|
|
143
|
+
| `MCPCallLatencyWarn` | ✅ | `mcp-latency.yaml` | Exact |
|
|
144
|
+
| `CacheHitRateDegraded` | ✅ | `cache-degradation.yaml` | Exact |
|
|
145
|
+
| `FilterAuthorizedReposLatency` | ✅ | `sla-latency.yaml` | Exact |
|
|
146
|
+
| `SecurityAuthBurst` | ✅ | `security-anomaly.yaml` | Exact |
|
|
147
|
+
| `SecurityRBACDenialBurst` | ✅ | `security-anomaly.yaml` | Exact |
|
|
148
|
+
|
|
149
|
+
**Bonus alerts (8 additional, beyond design spec):**
|
|
150
|
+
- `GRPCRequestLatencyHigh` (sla-latency) — Generic gRPC catch-all
|
|
151
|
+
- `AnalysisBacklogWarning` (analysis-backlog) — Early warning tier
|
|
152
|
+
- `LSPDaemonOOMKill` (lsp-daemon) — OOM-specific detection (**PromQL fixed** — now uses `kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}`)
|
|
153
|
+
- `DLQDepthWarning` (dlq-depth) — Early warning tier
|
|
154
|
+
- `DLQMessageAgeHigh` (dlq-depth) — Stale message detection
|
|
155
|
+
- `MCPCallErrorRateHigh` (mcp-latency) — Error rate detection
|
|
156
|
+
- `KnowledgeStoreWriteLatencyHigh` (cache-degradation) — Write path latency
|
|
157
|
+
- `ServiceAuthFailure` (security-anomaly) — mTLS rejection detection
|
|
158
|
+
|
|
159
|
+
**Runbooks created:**
|
|
160
|
+
- `runbooks/SDK-INTEGRATION.md`
|
|
161
|
+
- `runbooks/dashboard-guide.md`
|
|
162
|
+
- `runbooks/alert-response/LSP_DAEMON_RESTART.md`
|
|
163
|
+
- `runbooks/alert-response/HIGH_QUERY_LATENCY.md`
|
|
164
|
+
- `runbooks/alert-response/DLQ_DEPTH_EXCEEDED.md`
|
|
165
|
+
- `runbooks/alert-response/ANALYSIS_BACKLOG.md`
|
|
166
|
+
- `runbooks/alert-response/SECURITY_ANOMALY.md`
|
|
167
|
+
|
|
168
|
+
**Status:** All 9 design-required alerts implemented. All 5 alert-response runbooks created. PagerDuty + Slack routing configured in `prometheus/prometheus-values.yaml`.
|
|
169
|
+
|
|
170
|
+
---
|
|
171
|
+
|
|
172
|
+
### M08-T07 — Logging Library (P0, Week 1) — ✅ DONE
|
|
173
|
+
|
|
174
|
+
Delivered as part of M08-T04. See above for full artifact list.
|
|
175
|
+
|
|
176
|
+
**Key compliance points:**
|
|
177
|
+
- ✅ Mandatory fields enforced at creation time (both Node.js and Python)
|
|
178
|
+
- ✅ `ECIPModule` type restricts to M01–M08
|
|
179
|
+
- ✅ Trace context auto-injected via Pino mixin (Node.js) and structlog processor (Python)
|
|
180
|
+
- ✅ Security events use dedicated emission path (separate from general logger)
|
|
181
|
+
- ✅ PII hashing on user IDs (SHA-256, `u_` prefix)
|
|
182
|
+
- ✅ Express middleware + Fastify plugin for automatic trace context propagation
|
|
183
|
+
|
|
184
|
+
---
|
|
185
|
+
|
|
186
|
+
### M08-T08 — Elasticsearch Security Pipeline (P1, Week 3–4) — ✅ DONE
|
|
187
|
+
|
|
188
|
+
**Deliverables created:**
|
|
189
|
+
- `elasticsearch/index-template.json` — ECS-compatible schema for `ecip-security-events-*` indices
|
|
190
|
+
- `elasticsearch/ilm-policy.json` — Lifecycle: 90d hot → 1y warm → cold → delete at 2y
|
|
191
|
+
- `elasticsearch/kibana-space.yaml` — Security team Kibana space + index pattern ConfigMap
|
|
192
|
+
- `helm/templates/elasticsearch.yaml` — ConfigMap + post-install/upgrade Job for ILM/template/alias setup (**ES URL now templated** via `Values.elasticsearch.{protocol,host,port}`)
|
|
193
|
+
|
|
194
|
+
**Status:** Complete. Schema matches the ECS format from security event helpers. ILM lifecycle matches design spec. Kibana space scoped with disabled non-security features. **Elasticsearch URL is no longer hardcoded** — fully configurable through Helm values (GAP-06 resolved).
|
|
195
|
+
|
|
196
|
+
---
|
|
197
|
+
|
|
198
|
+
### M08-T09 — TDD-002 Metrics (P2, Week 5–6) — ✅ DONE
|
|
199
|
+
|
|
200
|
+
**Design requirements:**
|
|
201
|
+
- HNSW rebuild duration metric in dashboards
|
|
202
|
+
- Embedding migration progress metric in dashboards
|
|
203
|
+
- `FilterAuthorizedRepos` latency metric + alert
|
|
204
|
+
|
|
205
|
+
**Current state:**
|
|
206
|
+
- ✅ `FilterAuthorizedReposLatency` alert — Implemented in `alerts/sla-latency.yaml`
|
|
207
|
+
- ✅ `filter_authorized_repos_duration_ms` in recording rules
|
|
208
|
+
- ✅ `hnsw_rebuild_duration_ms` panel — Present in `cache-performance.json` ("HNSW Rebuild Duration")
|
|
209
|
+
- ✅ `embedding_migration_progress` panel — Present in `analysis-throughput.json` ("Embedding Migration Progress" gauge)
|
|
210
|
+
|
|
211
|
+
**Status:** All three TDD-002 metrics are represented in dashboards and/or alerts. Validated with real M06 scrape targets configured.
|
|
212
|
+
|
|
213
|
+
---
|
|
214
|
+
|
|
215
|
+
### M08-T10 — CI Gate Enforcement (P1, Week 2) — ✅ DONE
|
|
216
|
+
|
|
217
|
+
**Design requirements (§5 Integration Enforcement):**
|
|
218
|
+
1. ESLint rule: ban `console.log` in production code paths (Node.js modules)
|
|
219
|
+
2. Ruff rule: ban `print()` in production code paths (Python modules)
|
|
220
|
+
3. CI check: `@ecip/observability` or `ecip-observability` in dependency list
|
|
221
|
+
4. CI check: `initTracer()` called before server startup
|
|
222
|
+
5. CI check: `createLogger()` used with mandatory fields in handlers
|
|
223
|
+
6. `promtool check rules` on alert YAML files
|
|
224
|
+
7. `grafana-dashboard-lint` on dashboard JSON files
|
|
225
|
+
|
|
226
|
+
**Deliverables created:**
|
|
227
|
+
- `ci/eslint-plugin-ecip/index.js` — Custom ESLint plugin with 3 rules: `no-console-log` (bans `console.*` in prod), `require-observability-import` (entry files must import `@ecip/observability`), `require-init-tracer`
|
|
228
|
+
- `ci/eslint-plugin-ecip/package.json` — Plugin package metadata (`eslint-plugin-ecip` v1.0.0, peerDependencies: eslint >= 8.0.0)
|
|
229
|
+
- `ci/ruff-shared.toml` — Shared Ruff config for Python modules (targets Python 3.11, enables `T20` flake8-print to ban `print()`, exempts test files/scripts)
|
|
230
|
+
- `ci/check-observability-contract.js` — Node CLI tool validating module-level M08 contract (SDK dependency check, `initTracer()` call, `createLogger()` usage, absence of `console.log`/`print()`)
|
|
231
|
+
- `ci/github-actions-observability-gate.yaml` — GitHub Actions workflow with 5 jobs: contract-check (per-module matrix), eslint-ecip (Node modules), ruff-check (Python modules), alert-validation (promtool), dashboard-lint
|
|
232
|
+
- `scripts/lint-dashboards.js` — Dashboard JSON linter checking: valid JSON, required fields (title/uid/panels), schemaVersion ≥ 30, UID uniqueness, panel structure, datasource references, template variables
|
|
233
|
+
|
|
234
|
+
**Current state:**
|
|
235
|
+
- ✅ ESLint plugin with `no-console-log`, `require-observability-import`, `require-init-tracer` rules
|
|
236
|
+
- ✅ Ruff config banning `print()` across Python modules
|
|
237
|
+
- ✅ Contract checker validates SDK dependency + `initTracer()` + `createLogger()`
|
|
238
|
+
- ✅ GitHub Actions workflow with 5 jobs, triggers on PRs touching module `src/` and observability config
|
|
239
|
+
- ✅ `scripts/lint-dashboards.js` with 8 validation checks
|
|
240
|
+
- ✅ `package.json` has `lint:dashboards` and `lint:alerts` scripts pointing to real implementations
|
|
241
|
+
|
|
242
|
+
**Status:** Fully implemented. All 7 design requirements addressed. The CI gate is now load-bearing as specified in §11 finding 5.
|
|
243
|
+
|
|
244
|
+
---
|
|
245
|
+
|
|
246
|
+
## Gaps Identified
|
|
247
|
+
|
|
248
|
+
### Resolved Gaps (8 of 12)
|
|
249
|
+
|
|
250
|
+
| Gap | Description | Resolution |
|
|
251
|
+
|-----|-------------|------------|
|
|
252
|
+
| GAP-01 | Missing `security-events.test.ts` | ✅ `tests/security-events.test.ts` created — validates `emitAuthFailure()` and `emitRbacDenial()` for ECS format, PII hashing (no raw user_id), trace context, stderr routing, and logger provider isolation |
|
|
253
|
+
| GAP-02 | No CI Gate Implementation | ✅ Fully implemented — ESLint plugin (3 rules), Ruff config (T20), contract checker, GitHub Actions workflow (5 jobs), dashboard linter |
|
|
254
|
+
| GAP-03 | Missing `scripts/lint-dashboards.js` | ✅ Created with 8 validation checks (JSON validity, required fields, schemaVersion, UID uniqueness, panel structure, datasource refs, template vars) |
|
|
255
|
+
| GAP-06 | Elasticsearch URL Hardcoded in Helm Job | ✅ `helm/templates/elasticsearch.yaml` now uses `Values.elasticsearch.{protocol,host,port}` — values defined in `helm/values.yaml` |
|
|
256
|
+
| GAP-05 | `AnalysisBacklogCritical` metric name mismatch | ✅ Design doc updated: `event_bus_backlog_events` → `sum(kafka_consumergroup_lag{group=~"ecip-analysis.*"})`. Implementation was already correct. |
|
|
257
|
+
| GAP-07 | Grafana Admin Password Hardcoded | ✅ `helm/templates/grafana-secret.yaml` created. `values.yaml` references `existingSecret: ecip-grafana-admin`, `values.prod.yaml` references `ecip-grafana-admin-prod` |
|
|
258
|
+
| GAP-11 | LSP Daemon OOM Dashboard Panel Query Issue | ✅ Both `dashboards/lsp-daemon-health.json` and `alerts/lsp-daemon.yaml` updated to use `kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}` |
|
|
259
|
+
| GAP-12 | OD-01 Log Aggregation Backend Unresolved | ✅ Loki chosen — Helm dependency added (v5.47.0), OTel collector has `loki` exporter + `logs` pipeline, Grafana datasource provisioned via `helm/templates/loki.yaml`, Tempo tracesToLogs→Loki |
|
|
260
|
+
|
|
261
|
+
---
|
|
262
|
+
|
|
263
|
+
### Remaining Gaps (4 of 12)
|
|
264
|
+
|
|
265
|
+
### GAP-04: Compile-Time Enforcement Test Incomplete (Severity: Medium)
|
|
266
|
+
|
|
267
|
+
`tests/log-schema-validation.test.ts` checks that `logger.ts` defines `ECIPLoggerContext` with mandatory fields, but does NOT compile a deliberately broken test file and assert a non-zero exit code — as specified in the design doc §8. The current test only reads source file text.
|
|
268
|
+
|
|
269
|
+
### ~~GAP-05: `AnalysisBacklogCritical` Metric Name Mismatch~~ — ✅ RESOLVED
|
|
270
|
+
|
|
271
|
+
The design doc originally specified `event_bus_backlog_events > 1000` but the implementation uses `sum(kafka_consumergroup_lag{group=~"ecip-analysis.*"}) > 1000` — the real metric exported by kafka-exporter. **Design doc updated** to reflect the correct metric. No silent divergence remains.
|
|
272
|
+
|
|
273
|
+
### GAP-08: No PodDisruptionBudgets (Severity: Low)
|
|
274
|
+
|
|
275
|
+
No PDB defined for the OTel Collector DaemonSet or any other workload template. Could result in observability gaps during cluster maintenance.
|
|
276
|
+
|
|
277
|
+
### GAP-09: No NetworkPolicy Templates (Severity: Low)
|
|
278
|
+
|
|
279
|
+
No NetworkPolicy resources in the Helm chart to restrict observability stack traffic within the monitoring namespace.
|
|
280
|
+
|
|
281
|
+
### GAP-10: `chaos/redis-node-failure.sh` Quality Issue (Severity: Medium)
|
|
282
|
+
|
|
283
|
+
The file contains ~195 lines of corrupted/reversed content at the top (lines appear in reverse order, e.g., `#!/bin/bash#!/usr/bin/env bash`), followed by a clean implementation starting around line 196. The garbled first section is non-functional and needs to be removed.
|
|
284
|
+
|
|
285
|
+
---
|
|
286
|
+
|
|
287
|
+
## Enhancements
|
|
288
|
+
|
|
289
|
+
### Completed Enhancements (6 of 10)
|
|
290
|
+
|
|
291
|
+
| # | Enhancement | Status | Resolution |
|
|
292
|
+
|---|------------|--------|------------|
|
|
293
|
+
| ENH-01 | Add `security-events.test.ts` | ✅ Done | `tests/security-events.test.ts` — ECS format, PII hashing, stderr isolation, logger provider routing |
|
|
294
|
+
| ENH-02 | Implement CI Gate (M08-T10) | ✅ Done | ESLint plugin, Ruff config, contract checker, GitHub Actions workflow, dashboard linter |
|
|
295
|
+
| ENH-03 | Create `scripts/lint-dashboards.js` | ✅ Done | 8 validation checks (JSON, required fields, schemaVersion, UIDs, panels, datasources, vars) |
|
|
296
|
+
| ENH-05 | Template Elasticsearch URL in Helm | ✅ Done | `Values.elasticsearch.{protocol,host,port}` with defaults |
|
|
297
|
+
| ENH-06 | Externalize Grafana Admin Credentials | ✅ Done | K8s Secret template + `existingSecret` in both dev and prod values |
|
|
298
|
+
| ENH-10 | Fix OOM Panel PromQL | ✅ Done | Dashboard + alert both use `kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}` |
|
|
299
|
+
|
|
300
|
+
### Remaining Enhancements (4 of 10)
|
|
301
|
+
|
|
302
|
+
### ENH-04: Strengthen Compile-Time Test
|
|
303
|
+
Add a test case that attempts `tsc --noEmit` on a deliberately broken TypeScript file missing mandatory `ECIPLoggerContext` fields, asserting a non-zero exit code. (Related to GAP-04)
|
|
304
|
+
|
|
305
|
+
### ENH-07: Add PodDisruptionBudgets
|
|
306
|
+
Add PDB templates for OTel Collector, Tempo ingester/distributor, and Prometheus to protect against observability blackouts during node maintenance. (Related to GAP-08)
|
|
307
|
+
|
|
308
|
+
### ENH-08: Fix `chaos/redis-node-failure.sh`
|
|
309
|
+
Remove the ~195 lines of corrupted/reversed content at the top of the chaos script. Retain only the clean implementation starting around line 196. (Related to GAP-10)
|
|
310
|
+
|
|
311
|
+
### ENH-09: Add Template Variables to Remaining Dashboards
|
|
312
|
+
Add `$repo` or `$module` template variables to LSP Daemon Health, Event Bus DLQ, Cross-Repo Fan-out, and Security Events dashboards for consistent per-repo filtering.
|
|
313
|
+
|
|
314
|
+
---
|
|
315
|
+
|
|
316
|
+
## Open Design Decisions
|
|
317
|
+
|
|
318
|
+
| ID | Decision | Impact | Status |
|
|
319
|
+
|----|----------|--------|--------|
|
|
320
|
+
| OD-01 | General log aggregation backend (Elasticsearch / Loki / Cloud) | Affects Helm chart, Fluent Bit config, SDK log exporter | ✅ **Resolved** — Loki chosen. Helm dependency added (v5.47.0), OTel collector exports to Loki, Grafana datasource provisioned, Tempo tracesToLogs→Loki |
|
|
321
|
+
| OD-02 | Collector resource limits (512Mi estimated, needs load testing) | May require sampling rate reduction (5% → 2%) | **Unresolved** — revisit at Week 8 with real traffic |
|
|
322
|
+
| OD-03 | Dashboard ownership post-Week 28 | Module teams vs. Platform team for panel maintenance | **Unresolved** — process decision needed |
|
|
323
|
+
|
|
324
|
+
---
|
|
325
|
+
|
|
326
|
+
## File Inventory (83 files)
|
|
327
|
+
|
|
328
|
+
| Directory | Files | Description |
|
|
329
|
+
|-----------|-------|-------------|
|
|
330
|
+
| `collector/` | 3 | OTel Collector config, DaemonSet manifest, sampling config |
|
|
331
|
+
| `dashboards/` | 9 | 8 Grafana JSON dashboards + provisioning config |
|
|
332
|
+
| `alerts/` | 7 | Alert rule YAML files (17 alert rules total) |
|
|
333
|
+
| `logging-lib/nodejs/` | 9 | `@ecip/observability` npm package (src + tests) |
|
|
334
|
+
| `logging-lib/python/` | 6 | `ecip-observability` Python package (src + tests) |
|
|
335
|
+
| `prometheus/` | 3 | Helm values, scrape configs, recording rules |
|
|
336
|
+
| `tempo/` | 2 | Helm values, Grafana datasource provisioning |
|
|
337
|
+
| `elasticsearch/` | 3 | Index template, ILM policy, Kibana space config |
|
|
338
|
+
| `helm/` | 12 | Umbrella chart + **8 templates** + 2 values files *(+3 new: grafana-secret.yaml, loki.yaml, +loki dependency in Chart.yaml)* |
|
|
339
|
+
| `ci/` | 5 | **NEW** — ESLint plugin (2 files), Ruff config, contract checker, GitHub Actions workflow |
|
|
340
|
+
| `scripts/` | 1 | **NEW** — `lint-dashboards.js` dashboard JSON linter |
|
|
341
|
+
| `runbooks/` | 7 | SDK integration guide, dashboard guide, 5 alert-response runbooks |
|
|
342
|
+
| `tests/` | 5 | Metric validation, alert config, log schema, OTel pipeline integration, **security events** *(+1 new)* |
|
|
343
|
+
| `chaos/` | 3 | LSP kill, Redis failure, Kafka broker restart |
|
|
344
|
+
| Root | 4 | `package.json`, `tsconfig.json`, `vitest.config.ts`, `vitest.integration.config.ts` |
|
|
345
|
+
| Docs | 5 | `CLAUDE.md`, `README.md`, `docs/M08-Observability-Design.md`, `docs/module-documentation.md`, `docs/PROGRESS.md` |
|
|
346
|
+
| **Total** | **83** | *(+10 files since last update)* |
|
|
347
|
+
|
|
348
|
+
---
|
|
349
|
+
|
|
350
|
+
## Risk Status (from §10)
|
|
351
|
+
|
|
352
|
+
| Risk | Current Mitigation Status |
|
|
353
|
+
|------|--------------------------|
|
|
354
|
+
| Module teams skip SDK integration | ✅ **Fully mitigated** — SDK and guide exist, CI gate (M08-T10) fully implemented with GitHub Actions workflow enforcing SDK dependency, `initTracer()`, `createLogger()`, and `console.log`/`print()` bans. |
|
|
355
|
+
| Collector config error drops traces | ✅ **Mitigated** — `otel-pipeline-integration.test.ts` with Testcontainers catches pipeline misconfigurations in CI. |
|
|
356
|
+
| High-cardinality Prometheus labels | ✅ **Mitigated** — `metric-label-validation.test.ts` prohibits `user_id`, `trace_id`, `sha` as labels. |
|
|
357
|
+
| Tempo storage costs | ✅ **Mitigated** — Tail-based sampling at 5% default configured. Review at Week 8 with live data. |
|
|
358
|
+
| Security pipeline mixes with general logs | ✅ **Mitigated** — Separate OTel logger provider, dedicated Collector pipeline (`logs/security` → ES), general logs → Loki. |
|
|
359
|
+
| M08 is itself unobservable | ✅ **Mitigated** — Collector exposes metrics at :8888, self-scrape configured. |
|
|
360
|
+
| General application logs have no backend | ✅ **Mitigated** — Loki deployed as log aggregation backend (OD-01 resolved). OTel collector `logs` pipeline exports to Loki. |
|
|
361
|
+
|
|
362
|
+
---
|
|
363
|
+
|
|
364
|
+
## Recommended Next Actions (Priority Order)
|
|
365
|
+
|
|
366
|
+
1. **ENH-08** — Fix `chaos/redis-node-failure.sh` (remove ~195 lines of corrupted content) — **Quick win, 5 min**
|
|
367
|
+
2. **ENH-04 / GAP-04** — Strengthen compile-time enforcement test (`tsc --noEmit` on broken file) — **Medium, 30 min**
|
|
368
|
+
3. **ENH-09** — Add template variables to remaining dashboards — **Medium, 1 hr**
|
|
369
|
+
4. **ENH-07 / GAP-08** — Add PodDisruptionBudgets to Helm templates — **Medium, 30 min**
|
|
370
|
+
5. **GAP-09** — Add NetworkPolicy templates — **Low priority, nice-to-have**
|
|
371
|
+
6. ~~**GAP-05**~~ — ✅ Resolved — Design doc updated to match implementation
|
|
372
|
+
|
|
373
|
+
---
|
|
374
|
+
|
|
375
|
+
*Generated from code audit against M08-Observability-Design.md Rev 1.1*
|
|
@@ -0,0 +1,64 @@
|
|
|
1
|
+
# M08 — Observability Stack: Module Documentation
|
|
2
|
+
|
|
3
|
+
**Document ID:** ECIP-M08-DOC · **Status:** Starter Draft
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Overview
|
|
8
|
+
|
|
9
|
+
The Observability Stack provides the monitoring, tracing, and alerting infrastructure for the entire ECIP platform. It is a pure infrastructure module — it does not contain application business logic.
|
|
10
|
+
|
|
11
|
+
All other modules instrument themselves using the OpenTelemetry SDK. M08 collects, stores, and visualizes that telemetry.
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## Instrumentation Strategy
|
|
16
|
+
|
|
17
|
+
**Auto-instrumentation first.** The OTel auto-instrumentation libraries handle HTTP, gRPC, Kafka, and database spans automatically for Node.js, Go, and Rust. Modules do not write custom span code unless they need custom attributes.
|
|
18
|
+
|
|
19
|
+
**Custom metrics** use the OTel Metrics API (`meter.createHistogram`, `meter.createCounter`) — never the Prometheus client directly.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Required Dashboards
|
|
24
|
+
|
|
25
|
+
Each module has a corresponding Grafana dashboard. All dashboard JSON definitions live in `dashboards/` and are version-controlled here.
|
|
26
|
+
|
|
27
|
+
**Platform SLA Dashboard** is the top-level view and must show:
|
|
28
|
+
- End-to-end query latency p50/p95/p99
|
|
29
|
+
- Analysis pipeline throughput and error rate
|
|
30
|
+
- Uptime per module
|
|
31
|
+
- Active alerts
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## Alert Runbook Index
|
|
36
|
+
|
|
37
|
+
| Alert | Severity | Runbook |
|
|
38
|
+
|-------|----------|---------|
|
|
39
|
+
| `QueryLatencyHigh` | Critical | [runbooks/query-latency-high.md](./runbooks/query-latency-high.md) |
|
|
40
|
+
| `AnalysisBacklogGrowing` | Warning | [runbooks/analysis-backlog.md](./runbooks/analysis-backlog.md) |
|
|
41
|
+
| `KafkaDLQDepthHigh` | Warning | [runbooks/dlq-depth.md](./runbooks/dlq-depth.md) |
|
|
42
|
+
| `MCPServerDown` | Critical | [runbooks/mcp-server-down.md](./runbooks/mcp-server-down.md) |
|
|
43
|
+
| `CacheHitRateLow` | Warning | [runbooks/cache-hit-rate.md](./runbooks/cache-hit-rate.md) |
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Production Readiness Checklist
|
|
48
|
+
|
|
49
|
+
Before each production release:
|
|
50
|
+
- [ ] All SLA dashboards show green for 48h in staging
|
|
51
|
+
- [ ] Chaos tests passed (LSP daemon kill, Redis failure, Kafka restart)
|
|
52
|
+
- [ ] 200 CCU load test completed: p95 < 1.5s
|
|
53
|
+
- [ ] All critical alerts have runbooks
|
|
54
|
+
- [ ] On-call rotation configured in PagerDuty
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
## Known Gaps / TODOs
|
|
59
|
+
|
|
60
|
+
- [ ] Grafana dashboards not yet created (just schemas defined)
|
|
61
|
+
- [ ] Loki vs ELK log aggregation decision pending
|
|
62
|
+
- [ ] Chaos test scripts not yet written
|
|
63
|
+
- [ ] PagerDuty integration not yet configured
|
|
64
|
+
- [ ] DR test procedure not yet documented
|
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
{
|
|
2
|
+
"policy": {
|
|
3
|
+
"description": "ECIP security event ILM policy: 90d hot → 1y warm → cold → delete",
|
|
4
|
+
"phases": {
|
|
5
|
+
"hot": {
|
|
6
|
+
"min_age": "0ms",
|
|
7
|
+
"actions": {
|
|
8
|
+
"rollover": {
|
|
9
|
+
"max_primary_shard_size": "50gb",
|
|
10
|
+
"max_age": "7d"
|
|
11
|
+
},
|
|
12
|
+
"set_priority": {
|
|
13
|
+
"priority": 100
|
|
14
|
+
}
|
|
15
|
+
}
|
|
16
|
+
},
|
|
17
|
+
"warm": {
|
|
18
|
+
"min_age": "90d",
|
|
19
|
+
"actions": {
|
|
20
|
+
"shrink": {
|
|
21
|
+
"number_of_shards": 1
|
|
22
|
+
},
|
|
23
|
+
"forcemerge": {
|
|
24
|
+
"max_num_segments": 1
|
|
25
|
+
},
|
|
26
|
+
"set_priority": {
|
|
27
|
+
"priority": 50
|
|
28
|
+
},
|
|
29
|
+
"allocate": {
|
|
30
|
+
"require": {
|
|
31
|
+
"data": "warm"
|
|
32
|
+
}
|
|
33
|
+
}
|
|
34
|
+
}
|
|
35
|
+
},
|
|
36
|
+
"cold": {
|
|
37
|
+
"min_age": "365d",
|
|
38
|
+
"actions": {
|
|
39
|
+
"set_priority": {
|
|
40
|
+
"priority": 0
|
|
41
|
+
},
|
|
42
|
+
"allocate": {
|
|
43
|
+
"require": {
|
|
44
|
+
"data": "cold"
|
|
45
|
+
}
|
|
46
|
+
}
|
|
47
|
+
}
|
|
48
|
+
},
|
|
49
|
+
"delete": {
|
|
50
|
+
"min_age": "730d",
|
|
51
|
+
"actions": {
|
|
52
|
+
"delete": {}
|
|
53
|
+
}
|
|
54
|
+
}
|
|
55
|
+
}
|
|
56
|
+
}
|
|
57
|
+
}
|
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
{
|
|
2
|
+
"index_patterns": ["ecip-security-events-*"],
|
|
3
|
+
"template": {
|
|
4
|
+
"settings": {
|
|
5
|
+
"number_of_shards": 2,
|
|
6
|
+
"number_of_replicas": 1,
|
|
7
|
+
"index.lifecycle.name": "ecip-security-events-ilm",
|
|
8
|
+
"index.lifecycle.rollover_alias": "ecip-security-events"
|
|
9
|
+
},
|
|
10
|
+
"mappings": {
|
|
11
|
+
"properties": {
|
|
12
|
+
"@timestamp": {
|
|
13
|
+
"type": "date"
|
|
14
|
+
},
|
|
15
|
+
"event.kind": {
|
|
16
|
+
"type": "keyword"
|
|
17
|
+
},
|
|
18
|
+
"event.category": {
|
|
19
|
+
"type": "keyword"
|
|
20
|
+
},
|
|
21
|
+
"event.type": {
|
|
22
|
+
"type": "keyword"
|
|
23
|
+
},
|
|
24
|
+
"event.outcome": {
|
|
25
|
+
"type": "keyword"
|
|
26
|
+
},
|
|
27
|
+
"trace.id": {
|
|
28
|
+
"type": "keyword"
|
|
29
|
+
},
|
|
30
|
+
"user.id": {
|
|
31
|
+
"type": "keyword"
|
|
32
|
+
},
|
|
33
|
+
"source.ip": {
|
|
34
|
+
"type": "ip"
|
|
35
|
+
},
|
|
36
|
+
"resource": {
|
|
37
|
+
"type": "keyword"
|
|
38
|
+
},
|
|
39
|
+
"action": {
|
|
40
|
+
"type": "keyword"
|
|
41
|
+
},
|
|
42
|
+
"reason": {
|
|
43
|
+
"type": "keyword"
|
|
44
|
+
},
|
|
45
|
+
"module": {
|
|
46
|
+
"type": "keyword"
|
|
47
|
+
},
|
|
48
|
+
"metadata": {
|
|
49
|
+
"type": "object",
|
|
50
|
+
"enabled": true,
|
|
51
|
+
"dynamic": true
|
|
52
|
+
}
|
|
53
|
+
}
|
|
54
|
+
}
|
|
55
|
+
},
|
|
56
|
+
"priority": 200,
|
|
57
|
+
"composed_of": [],
|
|
58
|
+
"version": 1,
|
|
59
|
+
"_meta": {
|
|
60
|
+
"description": "ECIP security event index template (NFR-SEC-007). Governs auth failures, RBAC denials, and service auth events routed from the OTel Collector security pipeline."
|
|
61
|
+
}
|
|
62
|
+
}
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
# =============================================================================
|
|
2
|
+
# ECIP M08 — Kibana Space Configuration (Security Team)
|
|
3
|
+
# =============================================================================
|
|
4
|
+
# The security team owns this space. M08 provides the index;
|
|
5
|
+
# the security team manages all Kibana queries/dashboards/SIEM rules.
|
|
6
|
+
# =============================================================================
|
|
7
|
+
apiVersion: v1
|
|
8
|
+
kind: ConfigMap
|
|
9
|
+
metadata:
|
|
10
|
+
name: kibana-security-space
|
|
11
|
+
namespace: monitoring
|
|
12
|
+
labels:
|
|
13
|
+
app.kubernetes.io/name: elasticsearch
|
|
14
|
+
app.kubernetes.io/component: kibana-config
|
|
15
|
+
ecip.module: M08
|
|
16
|
+
data:
|
|
17
|
+
security-space.json: |
|
|
18
|
+
{
|
|
19
|
+
"id": "ecip-security",
|
|
20
|
+
"name": "ECIP Security",
|
|
21
|
+
"description": "Security events from the ECIP platform — auth failures, RBAC denials, service auth events",
|
|
22
|
+
"color": "#DD0000",
|
|
23
|
+
"initials": "ES",
|
|
24
|
+
"disabledFeatures": [
|
|
25
|
+
"canvas",
|
|
26
|
+
"maps",
|
|
27
|
+
"ml",
|
|
28
|
+
"monitoring",
|
|
29
|
+
"apm",
|
|
30
|
+
"uptime",
|
|
31
|
+
"observability",
|
|
32
|
+
"fleet"
|
|
33
|
+
]
|
|
34
|
+
}
|
|
35
|
+
index-pattern.json: |
|
|
36
|
+
{
|
|
37
|
+
"title": "ecip-security-events-*",
|
|
38
|
+
"timeFieldName": "@timestamp",
|
|
39
|
+
"fields": [
|
|
40
|
+
"@timestamp",
|
|
41
|
+
"event.kind",
|
|
42
|
+
"event.category",
|
|
43
|
+
"event.type",
|
|
44
|
+
"event.outcome",
|
|
45
|
+
"trace.id",
|
|
46
|
+
"user.id",
|
|
47
|
+
"source.ip",
|
|
48
|
+
"resource",
|
|
49
|
+
"action",
|
|
50
|
+
"reason",
|
|
51
|
+
"module"
|
|
52
|
+
]
|
|
53
|
+
}
|
package/helm/Chart.yaml
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
apiVersion: v2
|
|
2
|
+
name: ecip-observability-stack
|
|
3
|
+
description: ECIP M08 — Umbrella Helm chart deploying the full observability stack
|
|
4
|
+
type: application
|
|
5
|
+
version: 1.0.0
|
|
6
|
+
appVersion: "1.0.0"
|
|
7
|
+
maintainers:
|
|
8
|
+
- name: ECIP Platform Team
|
|
9
|
+
email: platform@ecip.internal
|
|
10
|
+
|
|
11
|
+
dependencies:
|
|
12
|
+
- name: kube-prometheus-stack
|
|
13
|
+
version: "56.6.2"
|
|
14
|
+
repository: https://prometheus-community.github.io/helm-charts
|
|
15
|
+
condition: prometheus.enabled
|
|
16
|
+
|
|
17
|
+
- name: tempo-distributed
|
|
18
|
+
version: "1.8.0"
|
|
19
|
+
repository: https://grafana.github.io/helm-charts
|
|
20
|
+
condition: tempo.enabled
|
|
21
|
+
|
|
22
|
+
- name: elasticsearch
|
|
23
|
+
version: "8.5.1"
|
|
24
|
+
repository: https://helm.elastic.co
|
|
25
|
+
condition: elasticsearch.enabled
|
|
26
|
+
|
|
27
|
+
- name: loki
|
|
28
|
+
version: "5.47.0"
|
|
29
|
+
repository: https://grafana.github.io/helm-charts
|
|
30
|
+
condition: loki.enabled
|
|
@@ -0,0 +1,25 @@
|
|
|
1
|
+
{{- /*
|
|
2
|
+
ECIP M08 — ConfigMaps for collector, dashboards, and alert provisioning
|
|
3
|
+
*/ -}}
|
|
4
|
+
apiVersion: v1
|
|
5
|
+
kind: ConfigMap
|
|
6
|
+
metadata:
|
|
7
|
+
name: otel-collector-config
|
|
8
|
+
namespace: {{ .Values.namespace | default "monitoring" }}
|
|
9
|
+
labels:
|
|
10
|
+
app.kubernetes.io/name: otel-collector
|
|
11
|
+
app: ecip
|
|
12
|
+
data:
|
|
13
|
+
otel-collector-config.yaml: |-
|
|
14
|
+
{{ .Files.Get "collector/otel-collector-config.yaml" | nindent 4 }}
|
|
15
|
+
---
|
|
16
|
+
apiVersion: v1
|
|
17
|
+
kind: ConfigMap
|
|
18
|
+
metadata:
|
|
19
|
+
name: grafana-dashboard-provisioning
|
|
20
|
+
namespace: {{ .Values.namespace | default "monitoring" }}
|
|
21
|
+
labels:
|
|
22
|
+
app: ecip
|
|
23
|
+
data:
|
|
24
|
+
grafana-dashboards.yaml: |-
|
|
25
|
+
{{ .Files.Get "dashboards/_provisioning/grafana-dashboards.yaml" | nindent 4 }}
|