ecip-observability-stack 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +48 -0
- package/README.md +75 -0
- package/alerts/analysis-backlog.yaml +39 -0
- package/alerts/cache-degradation.yaml +44 -0
- package/alerts/dlq-depth.yaml +56 -0
- package/alerts/lsp-daemon.yaml +43 -0
- package/alerts/mcp-latency.yaml +46 -0
- package/alerts/security-anomaly.yaml +59 -0
- package/alerts/sla-latency.yaml +61 -0
- package/chaos/kafka-broker-restart.sh +168 -0
- package/chaos/kill-lsp-daemon.sh +148 -0
- package/chaos/redis-node-failure.sh +318 -0
- package/ci/check-observability-contract.js +285 -0
- package/ci/eslint-plugin-ecip/index.js +209 -0
- package/ci/eslint-plugin-ecip/package.json +12 -0
- package/ci/github-actions-observability-gate.yaml +180 -0
- package/ci/ruff-shared.toml +41 -0
- package/collector/otel-collector-config.yaml +226 -0
- package/collector/otel-collector-daemonset.yaml +168 -0
- package/collector/sampling-config.yaml +83 -0
- package/dashboards/_provisioning/grafana-dashboards.yaml +16 -0
- package/dashboards/analysis-throughput.json +166 -0
- package/dashboards/cache-performance.json +129 -0
- package/dashboards/cross-repo-fanout.json +93 -0
- package/dashboards/event-bus-dlq.json +129 -0
- package/dashboards/lsp-daemon-health.json +104 -0
- package/dashboards/mcp-call-graph.json +114 -0
- package/dashboards/query-latency.json +160 -0
- package/dashboards/security-events.json +131 -0
- package/docs/M08-Observability-Design.md +639 -0
- package/docs/PROGRESS.md +375 -0
- package/docs/module-documentation.md +64 -0
- package/elasticsearch/ilm-policy.json +57 -0
- package/elasticsearch/index-template.json +62 -0
- package/elasticsearch/kibana-space.yaml +53 -0
- package/helm/Chart.yaml +30 -0
- package/helm/templates/configmaps.yaml +25 -0
- package/helm/templates/elasticsearch.yaml +68 -0
- package/helm/templates/grafana-secret.yaml +22 -0
- package/helm/templates/grafana.yaml +19 -0
- package/helm/templates/loki.yaml +33 -0
- package/helm/templates/otel-collector.yaml +119 -0
- package/helm/templates/prometheus.yaml +43 -0
- package/helm/templates/tempo.yaml +16 -0
- package/helm/values.prod.yaml +159 -0
- package/helm/values.yaml +146 -0
- package/logging-lib/nodejs/package.json +57 -0
- package/logging-lib/nodejs/pnpm-lock.yaml +4576 -0
- package/logging-lib/python/pyproject.toml +45 -0
- package/logging-lib/python/src/__init__.py +19 -0
- package/logging-lib/python/src/logger.py +131 -0
- package/logging-lib/python/src/security_events.py +150 -0
- package/logging-lib/python/src/tracer.py +185 -0
- package/logging-lib/python/tests/test_logger.py +113 -0
- package/package.json +21 -0
- package/prometheus/prometheus-values.yaml +170 -0
- package/prometheus/recording-rules.yaml +97 -0
- package/prometheus/scrape-configs.yaml +122 -0
- package/runbooks/SDK-INTEGRATION.md +239 -0
- package/runbooks/alert-response/ANALYSIS_BACKLOG.md +128 -0
- package/runbooks/alert-response/DLQ_DEPTH_EXCEEDED.md +150 -0
- package/runbooks/alert-response/HIGH_QUERY_LATENCY.md +134 -0
- package/runbooks/alert-response/LSP_DAEMON_RESTART.md +118 -0
- package/runbooks/alert-response/SECURITY_ANOMALY.md +160 -0
- package/runbooks/dashboard-guide.md +169 -0
- package/scripts/lint-dashboards.js +184 -0
- package/tempo/tempo-datasource.yaml +46 -0
- package/tempo/tempo-values.yaml +94 -0
- package/tests/alert-threshold-config.test.ts +283 -0
- package/tests/log-schema-validation.test.ts +246 -0
- package/tests/metric-label-validation.test.ts +292 -0
- package/tests/otel-pipeline-integration.test.ts +420 -0
- package/tests/security-events.test.ts +417 -0
- package/tsconfig.json +17 -0
- package/vitest.config.ts +21 -0
- package/vitest.integration.config.ts +9 -0
|
@@ -0,0 +1,639 @@
|
|
|
1
|
+
# M08 — Observability Stack
|
|
2
|
+
## Module Design Document
|
|
3
|
+
|
|
4
|
+
> **Document ID:** ECIP-M08-MDD · **Revision:** 1.1 · **Date:** March 2026
|
|
5
|
+
> **Status:** Design Complete — Ready for Implementation
|
|
6
|
+
> **Supplements:** ECIP-TDD-001 §4.8, ECIP-TDD-002 §9
|
|
7
|
+
> **Team:** Platform / Infra · **Classification:** Confidential — Internal Engineering Use Only
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Table of Contents
|
|
12
|
+
|
|
13
|
+
- [§0 — Architect's Note](#0--architects-note)
|
|
14
|
+
- [§1 — Module Overview](#1--module-overview)
|
|
15
|
+
- [Responsibilities](#responsibilities)
|
|
16
|
+
- [Non-Goals](#non-goals)
|
|
17
|
+
- [§2 — Folder Structure](#2--folder-structure)
|
|
18
|
+
- [Top-Level Layout](#top-level-layout)
|
|
19
|
+
- [Key File Explanations](#key-file-explanations)
|
|
20
|
+
- [§3 — Four Pillars: Detailed Design](#3--four-pillars-detailed-design)
|
|
21
|
+
- [Pillar 1: Distributed Tracing](#pillar-1-distributed-tracing)
|
|
22
|
+
- [Pillar 2: Metrics Collection](#pillar-2-metrics-collection)
|
|
23
|
+
- [Pillar 3: Structured Log Aggregation](#pillar-3-structured-log-aggregation)
|
|
24
|
+
- [Pillar 4: Alerting](#pillar-4-alerting)
|
|
25
|
+
- [§4 — Security Event Pipeline](#4--security-event-pipeline)
|
|
26
|
+
- [§5 — SDK Integration Contract](#5--sdk-integration-contract)
|
|
27
|
+
- [§6 — Task Breakdown](#6--task-breakdown)
|
|
28
|
+
- [§7 — Module Dependencies](#7--module-dependencies)
|
|
29
|
+
- [§8 — Testing Plan](#8--testing-plan)
|
|
30
|
+
- [§9 — Open Design Decisions](#9--open-design-decisions)
|
|
31
|
+
- [§10 — Risk Register](#10--risk-register)
|
|
32
|
+
- [§11 — Architect's Review Notes](#11--architects-review-notes)
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## §0 — Architect's Note
|
|
37
|
+
|
|
38
|
+
M08 is not like the other 7 modules. It has no business logic. It owns no domain data. It ships no user-facing features. Its entire job is to make every other module's behaviour observable in production — and that makes it the most important thing to get right before anyone else writes a single line of production code.
|
|
39
|
+
|
|
40
|
+
**Three rules govern this module:**
|
|
41
|
+
|
|
42
|
+
1. **M08 ships in Week 1, before any other module integration.** The SDK guide and shared logging library are Day 1 deliverables. A module that starts without them will create permanent blind spots that are expensive to retrofit.
|
|
43
|
+
|
|
44
|
+
2. **Silent failure is the primary failure mode of observability infrastructure.** Instrumentation appearing to work while producing no data is far worse than outright failure — it creates false confidence. Every design decision in this document is oriented around making failures loud.
|
|
45
|
+
|
|
46
|
+
3. **M08 must not become a performance bottleneck.** Observability infrastructure that adds >5ms to the hot query path (M01 → M04 → M03) is worse than no observability. The DaemonSet topology, async span export, and tail-based sampling rules are all chosen for this reason.
|
|
47
|
+
|
|
48
|
+
> **Architect's sign-off:** This design supersedes the TDD-001 §4.8 task list. The task list was a starting point; this document is the implementable specification. Teams should reference this document, not TDD-001, for M08 implementation details.
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
## §1 — Module Overview
|
|
53
|
+
|
|
54
|
+
| Field | Value |
|
|
55
|
+
|---|---|
|
|
56
|
+
| **Module** | M08 — Observability Stack |
|
|
57
|
+
| **Type** | Cross-Cutting Infrastructure |
|
|
58
|
+
| **Team** | Platform / Infra |
|
|
59
|
+
| **Stack** | OTel Collector · Grafana Tempo · Prometheus · Grafana · Elasticsearch |
|
|
60
|
+
| **Timeline** | Week 1 → Week 28 (continuous; SDK guide Week 1, dashboards iterative) |
|
|
61
|
+
| **Deployment** | Kubernetes (DaemonSet for collector, Deployments for backends) |
|
|
62
|
+
| **Consumes** | Nothing — M08 has no runtime dependency on any other ECIP module |
|
|
63
|
+
| **Consumed by** | All modules (M01–M07) via `@ecip/observability` SDK |
|
|
64
|
+
|
|
65
|
+
**Purpose:** Distributed tracing, metrics collection, structured log aggregation, and alerting across all 7 ECIP modules via OpenTelemetry auto-instrumentation. Minimal per-module code changes are required — the shared library handles all boilerplate.
|
|
66
|
+
|
|
67
|
+
### Responsibilities
|
|
68
|
+
|
|
69
|
+
- **Distributed tracing** — End-to-end W3C TraceContext traces from M01 through all downstream services, stored in Grafana Tempo with 14-day retention.
|
|
70
|
+
- **Metrics collection** — Standard Kubernetes metrics plus a curated catalog of application-level histograms, counters, and gauges scraped by Prometheus.
|
|
71
|
+
- **Structured log aggregation** — A mandatory JSON log schema enforced at compile time via the shared logging library. All services emit structured logs; unstructured `console.log` is a PR-review violation.
|
|
72
|
+
- **Alerting** — SLA-backed Prometheus alert rules routed to PagerDuty and Slack. Every alert has a corresponding runbook.
|
|
73
|
+
- **Security event logging** — Auth failures and RBAC denials routed to a dedicated Elasticsearch index for SIEM use. Entirely separate pipeline from general application logs (NFR-SEC-007).
|
|
74
|
+
- **Grafana dashboards** — Eight pre-built dashboards delivered as dashboard-as-code JSON, auto-provisioned via Grafana sidecar.
|
|
75
|
+
- **SDK and integration guide** — Published packages (`@ecip/observability` for Node.js, `ecip-observability` for Python) plus a runbook targeting 30-minute integration time per team.
|
|
76
|
+
|
|
77
|
+
### Non-Goals
|
|
78
|
+
|
|
79
|
+
- **SIEM analysis and alert rules** — Security team operates the Elasticsearch SIEM layer. M08 writes the raw events; the security team owns all query logic above that.
|
|
80
|
+
- **Business-level metrics** (e.g., repos indexed per org, daily active users) — Owned by individual product modules. M08 provides the emission infrastructure only.
|
|
81
|
+
- **Log retention beyond 14 days for traces, 30 days for application logs** — Retention policy is a platform decision; storage provisioning is an infra decision. Neither is M08's call.
|
|
82
|
+
- **APM continuous profiling** (e.g., Pyroscope) — Out of scope for V1. Re-evaluate at Week 22 performance review.
|
|
83
|
+
- **Real user monitoring / browser RUM** — ECIP is a backend-only platform in V1.
|
|
84
|
+
- **On-call schedule management** — M08 wires alerts to PagerDuty; on-call rotation configuration is a team ops concern.
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
## §2 — Folder Structure
|
|
89
|
+
|
|
90
|
+
### Top-Level Layout
|
|
91
|
+
|
|
92
|
+
```
|
|
93
|
+
ecip-observability/ # M08 root — Platform team repo
|
|
94
|
+
│
|
|
95
|
+
├── collector/ # OTel Collector configuration
|
|
96
|
+
│ ├── otel-collector-config.yaml # OTLP receiver, processors, exporters
|
|
97
|
+
│ ├── otel-collector-daemonset.yaml # K8s DaemonSet manifest
|
|
98
|
+
│ └── sampling-config.yaml # Tail sampling rules (5% default / 100% on error)
|
|
99
|
+
│
|
|
100
|
+
├── dashboards/ # Grafana dashboard-as-code (JSON)
|
|
101
|
+
│ ├── query-latency.json # p50/p95/p99 per mode (lsp/vector/hybrid)
|
|
102
|
+
│ ├── analysis-throughput.json # Events processed / backlog / Kafka consumer lag
|
|
103
|
+
│ ├── cache-performance.json # Hit rate by cache_type and repo
|
|
104
|
+
│ ├── lsp-daemon-health.json # Daemon status, restart rate, OOM events
|
|
105
|
+
│ ├── mcp-call-graph.json # MCP fan-out topology, latency per target_repo
|
|
106
|
+
│ ├── event-bus-dlq.json # DLQ depth, DLQ age, retry counts
|
|
107
|
+
│ ├── cross-repo-fanout.json # Fan-out depth distribution, cycle warnings
|
|
108
|
+
│ ├── security-events.json # Auth failures, RBAC denials (from Elasticsearch)
|
|
109
|
+
│ └── _provisioning/
|
|
110
|
+
│ └── grafana-dashboards.yaml # Grafana sidecar auto-provision config
|
|
111
|
+
│
|
|
112
|
+
├── alerts/ # Prometheus alerting rules
|
|
113
|
+
│ ├── sla-latency.yaml # query_duration_ms p95 > 1500ms
|
|
114
|
+
│ ├── analysis-backlog.yaml # event backlog > 1000 events
|
|
115
|
+
│ ├── lsp-daemon.yaml # restart_rate > 2/hour
|
|
116
|
+
│ ├── dlq-depth.yaml # DLQ depth > 100
|
|
117
|
+
│ ├── mcp-latency.yaml # mcp_call_duration_ms p95 > 800ms
|
|
118
|
+
│ ├── cache-degradation.yaml # cache_hit_rate < 60%
|
|
119
|
+
│ └── security-anomaly.yaml # Auth failure burst > 10 in 5 min
|
|
120
|
+
│
|
|
121
|
+
├── logging-lib/ # Shared structured logging library (published package)
|
|
122
|
+
│ ├── nodejs/ # For M01, M04, M05, M07 (TypeScript services)
|
|
123
|
+
│ │ ├── src/
|
|
124
|
+
│ │ │ ├── logger.ts # Pino wrapper — mandatory fields enforced at type level
|
|
125
|
+
│ │ │ ├── tracer.ts # OTel SDK init + trace/span helpers
|
|
126
|
+
│ │ │ ├── security-events.ts # emitAuthFailure() / emitRbacDenial() helpers
|
|
127
|
+
│ │ │ └── middleware.ts # Express/Fastify: injects trace_id into request context
|
|
128
|
+
│ │ ├── package.json # Published as @ecip/observability
|
|
129
|
+
│ │ └── tests/
|
|
130
|
+
│ │ ├── logger.test.ts
|
|
131
|
+
│ │ └── tracer.test.ts
|
|
132
|
+
│ │
|
|
133
|
+
│ └── python/ # For M02 (Analysis Engine — Python components)
|
|
134
|
+
│ ├── src/
|
|
135
|
+
│ │ ├── logger.py # structlog wrapper with mandatory fields
|
|
136
|
+
│ │ ├── tracer.py # OTel Python SDK init + @traced decorator
|
|
137
|
+
│ │ └── security_events.py
|
|
138
|
+
│ ├── pyproject.toml # Published as ecip-observability
|
|
139
|
+
│ └── tests/
|
|
140
|
+
│ └── test_logger.py
|
|
141
|
+
│
|
|
142
|
+
├── prometheus/ # Prometheus deployment
|
|
143
|
+
│ ├── prometheus-values.yaml # Helm values for kube-prometheus-stack
|
|
144
|
+
│ ├── scrape-configs.yaml # Per-service scrape targets (updated as modules ship)
|
|
145
|
+
│ └── recording-rules.yaml # Pre-computed rate/ratio metrics for dashboard perf
|
|
146
|
+
│
|
|
147
|
+
├── tempo/ # Grafana Tempo trace storage
|
|
148
|
+
│ ├── tempo-values.yaml # Helm values: S3 backend, 14-day retention
|
|
149
|
+
│ └── tempo-datasource.yaml # Grafana datasource provisioning
|
|
150
|
+
│
|
|
151
|
+
├── elasticsearch/ # SIEM security event index (NFR-SEC-007)
|
|
152
|
+
│ ├── index-template.json # Security event schema + field mappings
|
|
153
|
+
│ ├── ilm-policy.json # ILM: 90-day hot → 1-year cold → delete
|
|
154
|
+
│ └── kibana-space.yaml # Security team Kibana space config
|
|
155
|
+
│
|
|
156
|
+
├── helm/ # Umbrella Helm chart — deploys entire M08 stack
|
|
157
|
+
│ ├── Chart.yaml
|
|
158
|
+
│ ├── values.yaml # Default values (dev/staging)
|
|
159
|
+
│ ├── values.prod.yaml # Production overrides
|
|
160
|
+
│ └── templates/
|
|
161
|
+
│ ├── otel-collector.yaml
|
|
162
|
+
│ ├── prometheus.yaml
|
|
163
|
+
│ ├── grafana.yaml
|
|
164
|
+
│ ├── tempo.yaml
|
|
165
|
+
│ ├── elasticsearch.yaml
|
|
166
|
+
│ └── configmaps.yaml # Alert rule + dashboard provisioning refs
|
|
167
|
+
│
|
|
168
|
+
├── runbooks/ # Operations runbooks and SDK integration guide
|
|
169
|
+
│ ├── SDK-INTEGRATION.md # M08-T04: 30-min setup guide for all teams
|
|
170
|
+
│ ├── alert-response/
|
|
171
|
+
│ │ ├── LSP_DAEMON_RESTART.md
|
|
172
|
+
│ │ ├── HIGH_QUERY_LATENCY.md
|
|
173
|
+
│ │ ├── DLQ_DEPTH_EXCEEDED.md
|
|
174
|
+
│ │ ├── ANALYSIS_BACKLOG.md
|
|
175
|
+
│ │ └── SECURITY_ANOMALY.md
|
|
176
|
+
│ └── dashboard-guide.md
|
|
177
|
+
│
|
|
178
|
+
└── tests/ # M08 validation tests
|
|
179
|
+
├── metric-label-validation.test.ts # Assert all required labels present on emission
|
|
180
|
+
├── alert-threshold-config.test.ts # Parse + validate all alert YAML files
|
|
181
|
+
├── log-schema-validation.test.ts # JSON log shape + compile-time field enforcement
|
|
182
|
+
└── otel-pipeline-integration.test.ts # Testcontainers: Collector → Tempo end-to-end
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### Key File Explanations
|
|
186
|
+
|
|
187
|
+
**`collector/otel-collector-config.yaml`** — The most operationally critical file in M08. It defines three separate pipelines: `traces` (to Tempo), `metrics` (to Prometheus), and `logs` (to Elasticsearch, security events only). An error in this file silently drops all observability data with no visible failure to services. This file must be version-controlled and change-reviewed with the same rigour as application code.
|
|
188
|
+
|
|
189
|
+
**`logging-lib/nodejs/src/logger.ts`** — The single most consumed M08 artifact. Every Node.js service imports this. The mandatory fields (`trace_id`, `span_id`, `repo`, `branch`, `user_id`, `module`) are enforced at the TypeScript type level — missing any of them is a compilation failure. This is intentional. The alternative is runtime validation which is too easily bypassed.
|
|
190
|
+
|
|
191
|
+
**`alerts/sla-latency.yaml`** — Directly governs whether NFR-AVL-001 (99.9% query path availability) breaches are detected. The PromQL expression must use `histogram_quantile(0.95, ...)` not an average — averaging latency hides tail problems. This is a common mistake and worth explicit attention during review.
|
|
192
|
+
|
|
193
|
+
**`runbooks/SDK-INTEGRATION.md`** — The highest-leverage document M08 produces. Every other module's observability quality is directly proportional to how well teams follow this guide. It must be ready by end of Week 1, before any other module writes production code. Treat it as a P0 deliverable, not documentation.
|
|
194
|
+
|
|
195
|
+
---
|
|
196
|
+
|
|
197
|
+
## §3 — Four Pillars: Detailed Design
|
|
198
|
+
|
|
199
|
+
### Pillar 1: Distributed Tracing
|
|
200
|
+
|
|
201
|
+
**Stack:** OTel Collector (DaemonSet) → Grafana Tempo (S3-backed) → Grafana UI
|
|
202
|
+
|
|
203
|
+
Every external request is injected with a W3C `traceparent` header at M01. The trace propagates automatically through every downstream gRPC and HTTP call via the OTel SDK's auto-instrumentation. Engineers debugging a slow query can jump from the Grafana latency dashboard to a complete flamegraph-style trace in Tempo in one click, with the trace ID correlating directly to the log entry that fired the alert.
|
|
204
|
+
|
|
205
|
+
**Topology decision: DaemonSet, not central Deployment**
|
|
206
|
+
|
|
207
|
+
The OTel Collector runs as a Kubernetes DaemonSet — one collector pod per node, receiving spans from all pods on the same node via localhost. This is a deliberate architectural choice:
|
|
208
|
+
|
|
209
|
+
- A central Deployment collector is a single point of failure; a DaemonSet crash only affects pods on that node.
|
|
210
|
+
- Localhost communication avoids network latency on the critical span export path.
|
|
211
|
+
- Horizontal scale is automatic as nodes are added to the cluster.
|
|
212
|
+
|
|
213
|
+
The tradeoff is that each collector pod must be configured identically, which is managed entirely by the Helm chart. No manual per-node configuration is ever required.
|
|
214
|
+
|
|
215
|
+
**Sampling strategy**
|
|
216
|
+
|
|
217
|
+
Tail-based sampling is configured at the collector, not the SDK. This means sampling decisions are made after the full trace is assembled, enabling intelligent rules:
|
|
218
|
+
|
|
219
|
+
| Rule | Rate | Condition |
|
|
220
|
+
|---|---|---|
|
|
221
|
+
| `errors-always-sample` | 100% | Any span with status `ERROR` |
|
|
222
|
+
| `slow-queries-sample` | 100% | Any trace with total latency > 1000ms |
|
|
223
|
+
| `default` | 5% | All other traces |
|
|
224
|
+
|
|
225
|
+
Head-based sampling (sampling at trace start) is explicitly rejected because it would drop most error traces before they complete. Tail-based sampling with a 10-second decision window ensures errors are always captured.
|
|
226
|
+
|
|
227
|
+
**Trace propagation flow**
|
|
228
|
+
|
|
229
|
+
```
|
|
230
|
+
Client Request
|
|
231
|
+
│
|
|
232
|
+
▼
|
|
233
|
+
M01 (API Gateway) → inject traceparent header
|
|
234
|
+
│ → start root span: "gateway.route"
|
|
235
|
+
▼
|
|
236
|
+
M04 (Query Service) → child span: "query.intent_classification"
|
|
237
|
+
│ → child span: "query.context_fusion"
|
|
238
|
+
│ → child span: "query.filter_authorized_repos" [gRPC to M06]
|
|
239
|
+
├──────────────────────────────────────────────────────────────┐
|
|
240
|
+
▼ ▼
|
|
241
|
+
M03 (Knowledge Store) M06 (Registry)
|
|
242
|
+
child span: "knowledge.vector_search" child span: "registry.check_access"
|
|
243
|
+
child span: "knowledge.symbol_lookup" child span: "registry.filter_repos"
|
|
244
|
+
│
|
|
245
|
+
▼ (if cross-repo, depth ≤ 2)
|
|
246
|
+
M05 (MCP Server) × N → child span: "mcp.tool_call" [per target_repo, in parallel]
|
|
247
|
+
│
|
|
248
|
+
▼
|
|
249
|
+
All spans → OTel Collector (localhost:4317) → Grafana Tempo (S3)
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
### Pillar 2: Metrics Collection
|
|
253
|
+
|
|
254
|
+
**Stack:** OTel SDK (emit) → OTel Collector → Prometheus → Grafana
|
|
255
|
+
|
|
256
|
+
All services expose a `/metrics` endpoint scraped by Prometheus. The scrape interval is 15 seconds for application metrics and 30 seconds for infrastructure metrics. Recording rules in `prometheus/recording-rules.yaml` pre-compute expensive quantile and rate calculations so Grafana dashboard queries are fast even at scale.
|
|
257
|
+
|
|
258
|
+
**Metrics Catalog**
|
|
259
|
+
|
|
260
|
+
This is the authoritative list of application-level metrics M08 defines. All other modules must instrument exactly these metrics with exactly these label sets. Additional metrics may be added by modules but must be reviewed by the Platform team before shipping to prevent high-cardinality label explosions.
|
|
261
|
+
|
|
262
|
+
| Metric Name | Type | Source | Labels | Alert Threshold |
|
|
263
|
+
|---|---|---|---|---|
|
|
264
|
+
| `query_duration_ms` | Histogram | M04 | `repo`, `mode` (lsp/vector/hybrid), `cached` | p95 > 1500ms → CRITICAL |
|
|
265
|
+
| `analysis_duration_ms` | Histogram | M02 | `repo`, `branch_type` (trunk/pr), `language` | p95 > 120000ms → WARN |
|
|
266
|
+
| `cache_hit_rate` | Gauge | M03, M04 | `cache_type` (symbol/query), `repo` | < 60% → WARN |
|
|
267
|
+
| `lsp_daemon_restarts_total` | Counter | M02 | `repo`, `language` | > 2/hour → CRITICAL |
|
|
268
|
+
| `mcp_call_duration_ms` | Histogram | M04 | `target_repo`, `tool_name` | p95 > 800ms → WARN |
|
|
269
|
+
| `event_bus_dlq_depth` | Gauge | M07 | `topic` | > 100 → CRITICAL |
|
|
270
|
+
| `cross_repo_fanout_count` | Histogram | M04 | `depth`, `repo` | — (dashboard only) |
|
|
271
|
+
| `rbac_denial_total` | Counter | M01, M06 | `resource`, `action` | > 10 in 5min → SECURITY WARN |
|
|
272
|
+
| `auth_failure_total` | Counter | M01 | `reason` (expired/invalid/missing) | > 10 in 5min → SECURITY WARN |
|
|
273
|
+
| `grpc_request_duration_ms` | Histogram | M01, M04, M06 | `service`, `method`, `status_code` | p95 > 500ms → WARN |
|
|
274
|
+
| `knowledge_store_write_duration_ms` | Histogram | M03 | `store_type` (redis/pgvector), `namespace` | p95 > 200ms → WARN |
|
|
275
|
+
| `hnsw_rebuild_duration_ms` | Histogram | M03 | `repo` | — (TDD-002; dashboard only) |
|
|
276
|
+
| `filter_authorized_repos_duration_ms` | Histogram | M06 | `org` | p95 > 20ms → WARN (NFR-SEC-011) |
|
|
277
|
+
| `embedding_migration_progress` | Gauge | M02, M03 | `repo`, `phase` (backfill/shadow/cutover) | — (TDD-002; dashboard only) |
|
|
278
|
+
|
|
279
|
+
> **Label cardinality rule:** `user_id` is explicitly prohibited as a Prometheus label. User-scoped metrics belong in Elasticsearch (security events) or application logs. A single high-cardinality label like `user_id` can cause Prometheus OOM on a busy cluster. This rule is enforced by the metric label validation test.
|
|
280
|
+
|
|
281
|
+
### Pillar 3: Structured Log Aggregation
|
|
282
|
+
|
|
283
|
+
**Stack:** `@ecip/observability` logger → stdout (JSON) → Fluentd/Fluent Bit → central log store
|
|
284
|
+
|
|
285
|
+
All services must use the M08-published logging library. Raw `console.log`, `print()`, or any unstructured log emission is a contract violation, caught at PR review, and treated as a build failure in CI.
|
|
286
|
+
|
|
287
|
+
**Mandatory JSON log schema**
|
|
288
|
+
|
|
289
|
+
```json
|
|
290
|
+
{
|
|
291
|
+
"timestamp": "2026-03-01T10:00:00.123Z", // ISO 8601, always UTC
|
|
292
|
+
"level": "info", // trace | debug | info | warn | error | fatal
|
|
293
|
+
"module": "M04", // which ECIP module emitted this
|
|
294
|
+
"trace_id": "d9f3c1a8-2e7b-44cc-...", // W3C TraceContext traceId
|
|
295
|
+
"span_id": "f4a912b3...", // active span ID at emission time
|
|
296
|
+
"repo": "acme-corp/auth-service", // {org}/{repo} — required
|
|
297
|
+
"branch": "main", // branch context of the operation
|
|
298
|
+
"user_id": "u_8f3a1c", // hashed user identifier — no raw PII
|
|
299
|
+
"env": "production",
|
|
300
|
+
"msg": "Query resolved via hybrid mode",
|
|
301
|
+
// ... arbitrary structured fields follow
|
|
302
|
+
"query_type": "hybrid",
|
|
303
|
+
"duration_ms": 43,
|
|
304
|
+
"cached": false
|
|
305
|
+
}
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
`trace_id`, `span_id`, `repo`, `branch`, `user_id`, and `module` are mandatory at the TypeScript type level. Missing them is a compilation failure. The Python wrapper raises `MissingObservabilityContext` at import time if the OTel SDK was not initialized.
|
|
309
|
+
|
|
310
|
+
**What goes in logs vs. metrics vs. traces**
|
|
311
|
+
|
|
312
|
+
This is a common source of confusion and must be explicit:
|
|
313
|
+
|
|
314
|
+
- **Metrics** — Aggregatable numeric values: latency histograms, counters, rates. If you need to know "how many times did X happen", it's a metric.
|
|
315
|
+
- **Logs** — Discrete events with rich contextual detail. If you need to know "what were the exact parameters when X failed", it's a log.
|
|
316
|
+
- **Traces** — Causal chains across service boundaries. If you need to know "which downstream call made this request slow", it's a trace.
|
|
317
|
+
|
|
318
|
+
Never duplicate data across all three. A query duration belongs in a histogram metric and in the trace span duration — not also in a log line.
|
|
319
|
+
|
|
320
|
+
### Pillar 4: Alerting
|
|
321
|
+
|
|
322
|
+
**Stack:** Prometheus → Alertmanager → PagerDuty / Slack
|
|
323
|
+
|
|
324
|
+
All alert rules live in `alerts/`. Every alert must have: a corresponding entry in the metrics catalog, a PromQL expression reviewed by the Platform team, a severity level, a routing target, and a runbook document in `runbooks/alert-response/`.
|
|
325
|
+
|
|
326
|
+
**Alert rules**
|
|
327
|
+
|
|
328
|
+
| Alert Name | PromQL Condition | For | Severity | Routes To | Runbook |
|
|
329
|
+
|---|---|---|---|---|---|
|
|
330
|
+
| `QueryLatencySLABreach` | `histogram_quantile(0.95, query_duration_ms) > 1500` | 5min | CRITICAL | PagerDuty + Slack #alerts | `HIGH_QUERY_LATENCY.md` |
|
|
331
|
+
| `AnalysisBacklogCritical` | `sum(kafka_consumergroup_lag{group=~"ecip-analysis.*"}) > 1000` | 10min | CRITICAL | PagerDuty + Slack #alerts | `ANALYSIS_BACKLOG.md` |
|
|
332
|
+
| `LSPDaemonRestartRate` | `rate(lsp_daemon_restarts_total[1h]) > 2` | 0min | CRITICAL | PagerDuty + Slack #alerts | `LSP_DAEMON_RESTART.md` |
|
|
333
|
+
| `DLQDepthExceeded` | `event_bus_dlq_depth > 100` | 5min | CRITICAL | PagerDuty + Slack #alerts | `DLQ_DEPTH_EXCEEDED.md` |
|
|
334
|
+
| `MCPCallLatencyWarn` | `histogram_quantile(0.95, mcp_call_duration_ms) > 800` | 10min | WARNING | Slack #alerts-warn | `HIGH_QUERY_LATENCY.md` |
|
|
335
|
+
| `CacheHitRateDegraded` | `cache_hit_rate < 0.60` | 15min | WARNING | Slack #alerts-warn | `HIGH_QUERY_LATENCY.md` |
|
|
336
|
+
| `FilterAuthorizedReposLatency` | `histogram_quantile(0.95, filter_authorized_repos_duration_ms) > 20` | 5min | WARNING | Slack #alerts-warn | `HIGH_QUERY_LATENCY.md` |
|
|
337
|
+
| `SecurityAuthBurst` | `increase(auth_failure_total[5m]) > 10` | 0min | CRITICAL | PagerDuty + Slack #security | `SECURITY_ANOMALY.md` |
|
|
338
|
+
| `SecurityRBACDenialBurst` | `increase(rbac_denial_total[5m]) > 10` | 0min | WARNING | Slack #security | `SECURITY_ANOMALY.md` |
|
|
339
|
+
|
|
340
|
+
> **Architect's note on `LSPDaemonRestartRate`:** The `for: 0min` duration (fires immediately) is intentional. An LSP daemon crash is always a significant event; there is no grace period. The circuit breaker in M04 handles the graceful degradation; the alert gets a human involved in parallel.
|
|
341
|
+
|
|
342
|
+
---
|
|
343
|
+
|
|
344
|
+
## §4 — Security Event Pipeline
|
|
345
|
+
|
|
346
|
+
Security events are a distinct category from application logs. They are governed by NFR-SEC-007 and require a separate OTel pipeline that routes exclusively to Elasticsearch, never to the general log store.
|
|
347
|
+
|
|
348
|
+
**What constitutes a security event**
|
|
349
|
+
|
|
350
|
+
- JWT validation failure at M01 (expired, invalid signature, missing)
|
|
351
|
+
- RBAC denial at M01 or M06 (user lacks required role for resource)
|
|
352
|
+
- Service-to-service authentication failure (mTLS or service token rejection)
|
|
353
|
+
- Any access attempt to a repo not in the user's authorized set (detected by M06's `FilterAuthorizedRepos`)
|
|
354
|
+
|
|
355
|
+
**Security event schema**
|
|
356
|
+
|
|
357
|
+
```json
|
|
358
|
+
{
|
|
359
|
+
"@timestamp": "2026-03-01T10:00:00.123Z",
|
|
360
|
+
"event.kind": "event",
|
|
361
|
+
"event.category": "authentication", // or "authorization"
|
|
362
|
+
"event.type": "denied",
|
|
363
|
+
"event.outcome": "failure",
|
|
364
|
+
"trace.id": "d9f3c1a8-2e7b-44cc-...", // correlates to Tempo trace
|
|
365
|
+
"user.id": "u_8f3a1c", // hashed — no raw PII
|
|
366
|
+
"source.ip": "10.0.14.22",
|
|
367
|
+
"resource": "acme-corp/auth-service",
|
|
368
|
+
"action": "read",
|
|
369
|
+
"reason": "rbac_insufficient_role", // or "jwt_expired" | "jwt_invalid"
|
|
370
|
+
"module": "M01"
|
|
371
|
+
}
|
|
372
|
+
```
|
|
373
|
+
|
|
374
|
+
**Pipeline separation**
|
|
375
|
+
|
|
376
|
+
Security events must never leak into general application logs and must never be emitted via the general `logger` object. The `emitSecurityEvent()` helper in `security-events.ts` is the only permitted emission path. It uses a separate OTel logger provider configured to route exclusively to the security Elasticsearch pipeline. This separation is enforced architecturally — it cannot be accidentally bypassed by a developer using the standard logger.
|
|
377
|
+
|
|
378
|
+
**Elasticsearch index lifecycle**
|
|
379
|
+
|
|
380
|
+
```
|
|
381
|
+
90 days → hot tier (SSD, fast query)
|
|
382
|
+
90–365 days → warm tier (standard storage)
|
|
383
|
+
365 days → cold tier (object storage)
|
|
384
|
+
> 365 days → delete
|
|
385
|
+
```
|
|
386
|
+
|
|
387
|
+
The security team owns all Kibana queries, dashboards, and SIEM detection rules above the index layer. M08 owns only the index template, ILM policy, and raw event schema.
|
|
388
|
+
|
|
389
|
+
---
|
|
390
|
+
|
|
391
|
+
## §5 — SDK Integration Contract
|
|
392
|
+
|
|
393
|
+
This section defines what M08 delivers to other module teams and what those teams are contractually required to do with it. This is the interface contract between M08 and every other module.
|
|
394
|
+
|
|
395
|
+
### What M08 delivers (by end of Week 1)
|
|
396
|
+
|
|
397
|
+
- **`@ecip/observability`** — npm package (Node.js / TypeScript). Published to the internal registry.
|
|
398
|
+
- **`ecip-observability`** — Python package. Published to the internal PyPI.
|
|
399
|
+
- **`runbooks/SDK-INTEGRATION.md`** — Step-by-step integration guide. Target: 30-minute setup for any team.
|
|
400
|
+
- **OTel Collector endpoint** — Available at `http://otel-collector.monitoring:4317` (gRPC) and `:4318` (HTTP) in all namespaces by Week 1.
|
|
401
|
+
|
|
402
|
+
### What every module team must do
|
|
403
|
+
|
|
404
|
+
**Node.js / TypeScript services (M01, M04, M05, M07)**
|
|
405
|
+
|
|
406
|
+
```typescript
|
|
407
|
+
// Step 1: Install
|
|
408
|
+
// npm install @ecip/observability @opentelemetry/sdk-node
|
|
409
|
+
|
|
410
|
+
// Step 2: src/instrument.ts — import BEFORE all other imports
|
|
411
|
+
import { initTracer } from '@ecip/observability';
|
|
412
|
+
|
|
413
|
+
initTracer({
|
|
414
|
+
serviceName: 'ecip-query-service',
|
|
415
|
+
otlpEndpoint: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, // injected by Helm
|
|
416
|
+
});
|
|
417
|
+
|
|
418
|
+
// Step 3: Use the logger in every handler
|
|
419
|
+
import { createLogger } from '@ecip/observability';
|
|
420
|
+
|
|
421
|
+
const log = createLogger({ repo, branch, user_id, module: 'M04' });
|
|
422
|
+
log.info({ duration_ms: 43, cached: false }, 'Query resolved');
|
|
423
|
+
|
|
424
|
+
// Step 4: Emit security events via dedicated helper (never via log)
|
|
425
|
+
import { emitRbacDenial } from '@ecip/observability';
|
|
426
|
+
|
|
427
|
+
emitRbacDenial({ userId, resource: repoId, action: 'read', reason: 'rbac_insufficient_role' });
|
|
428
|
+
```
|
|
429
|
+
|
|
430
|
+
**Python services (M02)**
|
|
431
|
+
|
|
432
|
+
```python
|
|
433
|
+
# Step 1: pip install ecip-observability opentelemetry-sdk
|
|
434
|
+
|
|
435
|
+
# Step 2: Initialize at process entry (before any other imports)
|
|
436
|
+
from ecip_observability import init_tracer, get_logger
|
|
437
|
+
|
|
438
|
+
init_tracer(service_name="ecip-analysis-engine")
|
|
439
|
+
|
|
440
|
+
# Step 3: Use the logger
|
|
441
|
+
log = get_logger(repo=repo, branch=branch, user_id=user_id, module="M02")
|
|
442
|
+
log.info("Analysis complete", duration_ms=14200, files_indexed=47)
|
|
443
|
+
|
|
444
|
+
# Step 4: Decorate functions for automatic span creation
|
|
445
|
+
from ecip_observability import traced
|
|
446
|
+
|
|
447
|
+
@traced(name="lsp.symbol_extraction")
|
|
448
|
+
def extract_symbols(file_path: str) -> list:
|
|
449
|
+
... # span automatically started/ended; exceptions auto-captured
|
|
450
|
+
```
|
|
451
|
+
|
|
452
|
+
**Helm value injection (Infra team responsibility)**
|
|
453
|
+
|
|
454
|
+
All OTel configuration is injected by the Helm chart. Module teams do not hardcode endpoints.
|
|
455
|
+
|
|
456
|
+
```yaml
|
|
457
|
+
env:
|
|
458
|
+
- name: OTEL_EXPORTER_OTLP_ENDPOINT
|
|
459
|
+
value: "http://otel-collector.monitoring:4318"
|
|
460
|
+
- name: OTEL_SERVICE_NAME
|
|
461
|
+
valueFrom:
|
|
462
|
+
fieldRef:
|
|
463
|
+
fieldPath: metadata.labels['app.kubernetes.io/name']
|
|
464
|
+
- name: OTEL_RESOURCE_ATTRIBUTES
|
|
465
|
+
value: "ecip.module=M04,deployment.environment=production"
|
|
466
|
+
```
|
|
467
|
+
|
|
468
|
+
### Integration enforcement (CI gate)
|
|
469
|
+
|
|
470
|
+
A module PR that does not satisfy all of the following is blocked from merging:
|
|
471
|
+
|
|
472
|
+
1. `@ecip/observability` or `ecip-observability` is in the dependency list.
|
|
473
|
+
2. `initTracer()` is called before the service starts accepting requests (verified by startup integration test).
|
|
474
|
+
3. At least one log line per request handler uses `createLogger()` with all mandatory fields.
|
|
475
|
+
4. No `console.log` or `print()` statements exist in production code paths (ESLint / Ruff rules).
|
|
476
|
+
|
|
477
|
+
---
|
|
478
|
+
|
|
479
|
+
## §6 — Task Breakdown
|
|
480
|
+
|
|
481
|
+
| # | Task | Est. | Owner | Week | Priority | Notes |
|
|
482
|
+
|---|---|---|---|---|---|---|
|
|
483
|
+
| M08-T01 | Deploy OTel Collector as DaemonSet; configure OTLP receiver (gRPC :4317 + HTTP :4318) | 2d | Infra | W1 | **P0** | Must be online before any other module begins instrumentation. Hard blocker for all teams. |
|
|
484
|
+
| M08-T02 | Deploy Grafana Tempo; S3 backend; 14-day retention; microservices mode | 2d | Infra | W1 | **P0** | Grafana Tempo over Jaeger — native Grafana integration, S3 cost model better at scale. |
|
|
485
|
+
| M08-T03 | Deploy Prometheus + Grafana; scrape configs; recording rules | 2d | Infra | W1 | **P0** | `kube-prometheus-stack` Helm chart. Add per-module scrape configs as modules come online. |
|
|
486
|
+
| M08-T04 | Publish `@ecip/observability` + `ecip-observability`; write SDK-INTEGRATION.md | 2d | Senior Backend | W1 | **P0** | Highest-leverage M08 deliverable. Treat as feature, not documentation. |
|
|
487
|
+
| M08-T05 | Build 8 Grafana dashboards as JSON; wire via sidecar provisioning | 4d | Backend | W2–W3 | P1 | Ship `query-latency.json` first. Others follow as modules come online. |
|
|
488
|
+
| M08-T06 | Configure all alert rules; wire PagerDuty + Slack; write runbook for each alert | 2d | Backend | W2 | **P0** | `QueryLatencySLABreach` and `LSPDaemonRestartRate` must be live before CP1 (Week 4). |
|
|
489
|
+
| M08-T07 | Implement logging library: mandatory fields, security event helpers, middleware | 1d | Senior Backend | W1 | **P0** | Delivered as part of M08-T04 package. Compile-time field enforcement is non-negotiable. |
|
|
490
|
+
| M08-T08 | Elasticsearch security event index; ILM policy; schema; Kibana space | 2d | Security + Backend | W3–W4 | P1 | NFR-SEC-007. Security team owns Kibana; M08 owns index schema only. |
|
|
491
|
+
| M08-T09 | Add TDD-002 metrics: HNSW rebuild, embedding migration progress, `FilterAuthorizedRepos` latency | 1d | Backend | W5–W6 | P2 | Extend existing dashboards once M06's new RPCs are wired. |
|
|
492
|
+
| M08-T10 | Write CI gate enforcement: ESLint/Ruff rules, startup integration check, label validation | 1d | Backend | W2 | P1 | Without this, SDK adoption will drift. CI enforcement is the only reliable mechanism. |
|
|
493
|
+
|
|
494
|
+
**Total estimated effort:** ~17 days (Platform team; 2 engineers)
|
|
495
|
+
|
|
496
|
+
---
|
|
497
|
+
|
|
498
|
+
## §7 — Module Dependencies
|
|
499
|
+
|
|
500
|
+
M08 has no runtime dependency on any other ECIP module. All arrows point inward — every other module depends on M08.
|
|
501
|
+
|
|
502
|
+
| Module | Depends on M08 for | M08 Artifact | Required by Week |
|
|
503
|
+
|---|---|---|---|
|
|
504
|
+
| M01 — API Gateway | Trace injection, auth failure logging, request logging | `@ecip/observability`, Collector endpoint | **Week 1** |
|
|
505
|
+
| M03 — Knowledge Store | Write latency metrics, cache hit rate | `@ecip/observability`, Prometheus scrape config | **Week 1** |
|
|
506
|
+
| M07 — Event Bus | DLQ depth metrics, event processing latency | `@ecip/observability`, Prometheus scrape config | **Week 1** |
|
|
507
|
+
| M06 — Registry | RBAC denial events, gRPC duration metrics, `FilterAuthorizedRepos` latency | `@ecip/observability`, security event schema | Week 5 |
|
|
508
|
+
| M02 — Analysis Engine | Span wrapping around LSP calls, analysis duration metrics | `ecip-observability` (Python), Collector endpoint | Week 5 |
|
|
509
|
+
| M04 — Query Service | Query latency metrics, circuit breaker state logging, MCP call spans | `@ecip/observability`, Collector endpoint | Week 11 |
|
|
510
|
+
| M05 — MCP Server | Tool call duration metrics, RBAC denial logging | `@ecip/observability` | Week 17 |
|
|
511
|
+
|
|
512
|
+
M01, M03, and M07 all start in Week 1 or Week 2. M08's core infrastructure (Collector, Tempo, Prometheus, SDK) must be ready before any of them write production code. This is the hardest constraint in the M08 timeline.
|
|
513
|
+
|
|
514
|
+
---
|
|
515
|
+
|
|
516
|
+
## §8 — Testing Plan
|
|
517
|
+
|
|
518
|
+
### Unit Tests (coverage target: 80%)
|
|
519
|
+
|
|
520
|
+
| Test File | What It Validates |
|
|
521
|
+
|---|---|
|
|
522
|
+
| `metric-label-validation.test.ts` | All metric emissions include the exact label sets defined in the catalog. Uses a mock Prometheus registry to capture emitted metrics and validates label presence and types. |
|
|
523
|
+
| `alert-threshold-config.test.ts` | All YAML files in `alerts/` parse without error; all PromQL expressions are syntactically valid (`promtool check rules`); all referenced metrics exist in the catalog; all runbook file paths resolve on disk. |
|
|
524
|
+
| `log-schema-validation.test.ts` | JSON log output from `createLogger()` matches the mandatory field schema. Compile-time test: TypeScript build fails when mandatory fields are omitted (validated by compiling a deliberately broken test file and asserting non-zero exit code). |
|
|
525
|
+
| `security-events.test.ts` | `emitAuthFailure()` and `emitRbacDenial()` produce correct ECS-formatted events. No `user_id` raw values in output (only hashed). Events route to the correct OTel logger provider (not the general logger). |
|
|
526
|
+
|
|
527
|
+
### Integration Tests (Testcontainers)
|
|
528
|
+
|
|
529
|
+
**`otel-pipeline-integration.test.ts`** — The most important test in M08.
|
|
530
|
+
|
|
531
|
+
Setup: Testcontainers spins up an OTel Collector container and a Grafana Tempo container. The `@ecip/observability` SDK is initialized pointing at the collector.
|
|
532
|
+
|
|
533
|
+
Assertions:
|
|
534
|
+
1. Emit 5 spans across a simulated async service boundary.
|
|
535
|
+
2. All 5 spans are retrievable from Tempo via HTTP API within 10 seconds.
|
|
536
|
+
3. `traceparent` header is correctly propagated across the async boundary (child spans have correct parent ID).
|
|
537
|
+
4. An error span (forced `status: ERROR`) is 100% sampled (present in Tempo despite < 5% default rate).
|
|
538
|
+
5. A normal span in a 5%-sampled-only run may or may not be present — test does not assert its presence (this validates the sampling rule is not broken by asserting only error capture).
|
|
539
|
+
|
|
540
|
+
**Alert rule CI validation:**
|
|
541
|
+
```bash
|
|
542
|
+
# Run in CI on every PR touching alerts/
|
|
543
|
+
promtool check rules alerts/*.yaml
|
|
544
|
+
```
|
|
545
|
+
|
|
546
|
+
**Dashboard JSON lint:**
|
|
547
|
+
```bash
|
|
548
|
+
# Run in CI on every PR touching dashboards/
|
|
549
|
+
grafana-dashboard-lint dashboards/*.json
|
|
550
|
+
```
|
|
551
|
+
|
|
552
|
+
> **Why the Testcontainers test is non-negotiable:** Unit tests cannot catch a misconfigured collector pipeline, an mTLS cert error on the Tempo exporter, or a sampling rule that silently drops all traces. The integration test is the only meaningful correctness signal for the observability pipeline itself. It must run in CI on every PR, not just nightly.
|
|
553
|
+
|
|
554
|
+
### What M08 does not test
|
|
555
|
+
|
|
556
|
+
M08 does not test that other modules have correctly instrumented themselves. That responsibility belongs to each module's own test suite. M08's CI gate (§5) enforces minimum instrumentation standards; module teams own the quality of their own spans and log lines.
|
|
557
|
+
|
|
558
|
+
---
|
|
559
|
+
|
|
560
|
+
## §9 — Open Design Decisions
|
|
561
|
+
|
|
562
|
+
These decisions are not yet finalised and must be resolved before the relevant tasks begin.
|
|
563
|
+
|
|
564
|
+
**OD-01: Log aggregation backend for general application logs**
|
|
565
|
+
|
|
566
|
+
The current design routes security events to Elasticsearch but leaves the general application log destination unspecified. Options are: (a) Elasticsearch with a separate index, (b) Loki (Grafana-native, lower cost), (c) CloudWatch / GCP Logging if running on a managed cloud. This decision affects the Helm chart, the Fluent Bit / Fluentd sidecar configuration, and potentially the SDK (if the log backend requires a specific OTel exporter). Must be decided before M08-T01 is complete.
|
|
567
|
+
|
|
568
|
+
**OD-02: Collector resource limits in production**
|
|
569
|
+
|
|
570
|
+
The DaemonSet manifest specifies `limit_mib: 512` for the collector memory limiter. This was estimated for a 200-node cluster at expected trace volume with 5% sampling. It has not been load-tested. At Week 8, after M01, M03, and M07 are generating real traces, the actual memory profile must be measured and the limit updated. If volume exceeds estimates, the sampling rate is the first lever to pull (reduce from 5% to 2%), not the memory limit.
|
|
571
|
+
|
|
572
|
+
**OD-03: Dashboard ownership after Week 28**
|
|
573
|
+
|
|
574
|
+
The 8 dashboards in `dashboards/` are built and maintained by the Platform team during the build phase. Post-Week 28, who owns them? Module teams adding new metrics need to update the dashboards. Without a clear ownership model, dashboards will drift from the actual metrics being emitted. Recommendation: each module team owns the panels in the dashboards that relate to their metrics; Platform team owns the infrastructure-level panels (Collector health, Prometheus performance, Tempo storage).
|
|
575
|
+
|
|
576
|
+
---
|
|
577
|
+
|
|
578
|
+
## §10 — Risk Register
|
|
579
|
+
|
|
580
|
+
| Risk | Likelihood | Impact | Mitigation |
|
|
581
|
+
|---|---|---|---|
|
|
582
|
+
| Module teams skip SDK integration — observability blind spots in production | **High** | **High** | CI gate (M08-T10) blocks merges without instrumentation. SDK guide ready Week 1. Make it easier to integrate than to skip. |
|
|
583
|
+
| OTel Collector config error silently drops all trace data | **Medium** | **High** | Pipeline integration test (Testcontainers) catches this in CI. Config changes to `otel-collector-config.yaml` require Platform team review. Collector exposes its own health metrics at `:13133/healthz`. |
|
|
584
|
+
| High-cardinality Prometheus labels cause OOM | **Medium** | **High** | `user_id` prohibited as Prometheus label (enforced by label validation test). New metrics require Platform team review before shipping. Prometheus has a 10M series limit alert configured. |
|
|
585
|
+
| Tempo storage costs blow out with 14-day retention | **Medium** | **Low** | Tail-based sampling (5% healthy) reduces volume ~20x vs. 100% sampling. Review storage costs at Week 8 with real traffic data. Reduce sampling to 2% if needed before touching retention. |
|
|
586
|
+
| Security event pipeline mixes with general logs | **Low** | **High** | Architecturally separated: dedicated OTel logger provider, dedicated Collector logs pipeline, dedicated Elasticsearch index. `emitSecurityEvent()` is the only permitted path. Enforced by security-events unit test. |
|
|
587
|
+
| M08 infrastructure is itself unobservable | **Low** | **Medium** | OTel Collector exposes Prometheus metrics at `:8888`. Grafana has a collector health panel. Collector logs are shipped to the same log store as other services. The `monitoring` namespace has its own Prometheus alert for collector pod restarts. |
|
|
588
|
+
|
|
589
|
+
---
|
|
590
|
+
|
|
591
|
+
## §11 — Architect's Review Notes
|
|
592
|
+
|
|
593
|
+
These are design issues identified during the architectural review of TDD-001's M08 task list. They are not hypothetical concerns — they reflect gaps in the original specification that would have caused real problems during implementation.
|
|
594
|
+
|
|
595
|
+
### Review finding 1: The task list is not a design
|
|
596
|
+
|
|
597
|
+
TDD-001 §4.8 lists 8 tasks and a metrics table. That is a starting point, not an implementable specification. Missing from the original:
|
|
598
|
+
|
|
599
|
+
- No definition of the OTel Collector topology (DaemonSet vs. Deployment) and the rationale for it.
|
|
600
|
+
- No sampling strategy. The default OTel SDK samples 100% of traces, which is disastrously expensive at production volume.
|
|
601
|
+
- No specification of what "structured logging" means — just "deliver as logging library wrapper."
|
|
602
|
+
- No enforcement mechanism. A guide with no enforcement is a suggestion.
|
|
603
|
+
- No security event pipeline design. NFR-SEC-007 requires it, but TDD-001 only lists it as a task with no schema, no routing, no pipeline separation.
|
|
604
|
+
|
|
605
|
+
This document addresses all of the above.
|
|
606
|
+
|
|
607
|
+
### Review finding 2: Log aggregation destination is unspecified
|
|
608
|
+
|
|
609
|
+
TDD-001 mentions Elasticsearch for security events but says nothing about where general application logs go. This is a gap. The OTel Collector's `logs` pipeline needs an exporter target. Leaving it unspecified means the Infra team will make the decision independently, potentially choosing a backend that is incompatible with the Grafana Tempo + Grafana stack. This is captured as OD-01 above and must be resolved before M08-T01 closes.
|
|
610
|
+
|
|
611
|
+
### Review finding 3: `FilterAuthorizedRepos` latency alert was missing
|
|
612
|
+
|
|
613
|
+
TDD-002 adds `FilterAuthorizedRepos` as a new RPC to M06 with an explicit p95 < 20ms SLA (NFR-SEC-011). TDD-001's M08 task list predates TDD-002 and therefore has no alert for this SLA. The metrics catalog and alert rules in this document add `filter_authorized_repos_duration_ms` and the `FilterAuthorizedReposLatency` alert to cover it. This is not captured in M08-T09 (which only mentions dashboard additions) — the alert rule is a separate deliverable and must be added to M08-T06's scope.
|
|
614
|
+
|
|
615
|
+
### Review finding 4: Dashboard delivery sequence matters
|
|
616
|
+
|
|
617
|
+
TDD-001 specifies all dashboards as a single 4-day task. In practice, a dashboard for `query_duration_ms` is useless before M04 exists, and a dashboard for `mcp_call_duration_ms` is useless before M05 exists. Delivering all 8 dashboards by Week 3 means 5 of them display empty panels for months.
|
|
618
|
+
|
|
619
|
+
The correct approach is to ship dashboards in module-dependency order:
|
|
620
|
+
|
|
621
|
+
1. **Week 2** — `lsp-daemon-health.json`, `event-bus-dlq.json` (M02, M07 start early)
|
|
622
|
+
2. **Week 3** — `cache-performance.json`, `analysis-throughput.json` (M03 foundation)
|
|
623
|
+
3. **Week 11** — `query-latency.json`, `mcp-call-graph.json` (M04 comes online)
|
|
624
|
+
4. **Week 17** — `cross-repo-fanout.json` (M05 comes online)
|
|
625
|
+
5. **Week 4** — `security-events.json` (tied to M08-T08, not module timelines)
|
|
626
|
+
|
|
627
|
+
This does not change the total effort; it changes when effort is spent and ensures dashboards are immediately useful when delivered.
|
|
628
|
+
|
|
629
|
+
### Review finding 5: The CI gate is load-bearing, not optional
|
|
630
|
+
|
|
631
|
+
The SDK integration guide says "all teams should use `@ecip/observability`." Without a CI gate, this will not happen. Engineers under deadline pressure skip instrumentation. Two months into the build, when the first production incident occurs and traces are missing for 3 of 7 modules, the cost of retrofitting is high and the debugging is impossible.
|
|
632
|
+
|
|
633
|
+
M08-T10 (CI gate enforcement) is listed as P1 in the task breakdown. Architecturally it should be P0, delivering alongside M08-T04 in Week 1. The CI gate is the enforcement mechanism that makes the SDK guide meaningful. Without it, the SDK guide is optional documentation.
|
|
634
|
+
|
|
635
|
+
---
|
|
636
|
+
|
|
637
|
+
*ECIP-M08-MDD · Observability Stack · Module Design Document · Rev 1.1 · March 2026*
|
|
638
|
+
*Confidential — Internal Engineering Use Only · Platform Team*
|
|
639
|
+
*This document supersedes ECIP-TDD-001 §4.8 for implementation purposes.*
|