ecip-observability-stack 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (76) hide show
  1. package/CLAUDE.md +48 -0
  2. package/README.md +75 -0
  3. package/alerts/analysis-backlog.yaml +39 -0
  4. package/alerts/cache-degradation.yaml +44 -0
  5. package/alerts/dlq-depth.yaml +56 -0
  6. package/alerts/lsp-daemon.yaml +43 -0
  7. package/alerts/mcp-latency.yaml +46 -0
  8. package/alerts/security-anomaly.yaml +59 -0
  9. package/alerts/sla-latency.yaml +61 -0
  10. package/chaos/kafka-broker-restart.sh +168 -0
  11. package/chaos/kill-lsp-daemon.sh +148 -0
  12. package/chaos/redis-node-failure.sh +318 -0
  13. package/ci/check-observability-contract.js +285 -0
  14. package/ci/eslint-plugin-ecip/index.js +209 -0
  15. package/ci/eslint-plugin-ecip/package.json +12 -0
  16. package/ci/github-actions-observability-gate.yaml +180 -0
  17. package/ci/ruff-shared.toml +41 -0
  18. package/collector/otel-collector-config.yaml +226 -0
  19. package/collector/otel-collector-daemonset.yaml +168 -0
  20. package/collector/sampling-config.yaml +83 -0
  21. package/dashboards/_provisioning/grafana-dashboards.yaml +16 -0
  22. package/dashboards/analysis-throughput.json +166 -0
  23. package/dashboards/cache-performance.json +129 -0
  24. package/dashboards/cross-repo-fanout.json +93 -0
  25. package/dashboards/event-bus-dlq.json +129 -0
  26. package/dashboards/lsp-daemon-health.json +104 -0
  27. package/dashboards/mcp-call-graph.json +114 -0
  28. package/dashboards/query-latency.json +160 -0
  29. package/dashboards/security-events.json +131 -0
  30. package/docs/M08-Observability-Design.md +639 -0
  31. package/docs/PROGRESS.md +375 -0
  32. package/docs/module-documentation.md +64 -0
  33. package/elasticsearch/ilm-policy.json +57 -0
  34. package/elasticsearch/index-template.json +62 -0
  35. package/elasticsearch/kibana-space.yaml +53 -0
  36. package/helm/Chart.yaml +30 -0
  37. package/helm/templates/configmaps.yaml +25 -0
  38. package/helm/templates/elasticsearch.yaml +68 -0
  39. package/helm/templates/grafana-secret.yaml +22 -0
  40. package/helm/templates/grafana.yaml +19 -0
  41. package/helm/templates/loki.yaml +33 -0
  42. package/helm/templates/otel-collector.yaml +119 -0
  43. package/helm/templates/prometheus.yaml +43 -0
  44. package/helm/templates/tempo.yaml +16 -0
  45. package/helm/values.prod.yaml +159 -0
  46. package/helm/values.yaml +146 -0
  47. package/logging-lib/nodejs/package.json +57 -0
  48. package/logging-lib/nodejs/pnpm-lock.yaml +4576 -0
  49. package/logging-lib/python/pyproject.toml +45 -0
  50. package/logging-lib/python/src/__init__.py +19 -0
  51. package/logging-lib/python/src/logger.py +131 -0
  52. package/logging-lib/python/src/security_events.py +150 -0
  53. package/logging-lib/python/src/tracer.py +185 -0
  54. package/logging-lib/python/tests/test_logger.py +113 -0
  55. package/package.json +21 -0
  56. package/prometheus/prometheus-values.yaml +170 -0
  57. package/prometheus/recording-rules.yaml +97 -0
  58. package/prometheus/scrape-configs.yaml +122 -0
  59. package/runbooks/SDK-INTEGRATION.md +239 -0
  60. package/runbooks/alert-response/ANALYSIS_BACKLOG.md +128 -0
  61. package/runbooks/alert-response/DLQ_DEPTH_EXCEEDED.md +150 -0
  62. package/runbooks/alert-response/HIGH_QUERY_LATENCY.md +134 -0
  63. package/runbooks/alert-response/LSP_DAEMON_RESTART.md +118 -0
  64. package/runbooks/alert-response/SECURITY_ANOMALY.md +160 -0
  65. package/runbooks/dashboard-guide.md +169 -0
  66. package/scripts/lint-dashboards.js +184 -0
  67. package/tempo/tempo-datasource.yaml +46 -0
  68. package/tempo/tempo-values.yaml +94 -0
  69. package/tests/alert-threshold-config.test.ts +283 -0
  70. package/tests/log-schema-validation.test.ts +246 -0
  71. package/tests/metric-label-validation.test.ts +292 -0
  72. package/tests/otel-pipeline-integration.test.ts +420 -0
  73. package/tests/security-events.test.ts +417 -0
  74. package/tsconfig.json +17 -0
  75. package/vitest.config.ts +21 -0
  76. package/vitest.integration.config.ts +9 -0
@@ -0,0 +1,639 @@
1
+ # M08 — Observability Stack
2
+ ## Module Design Document
3
+
4
+ > **Document ID:** ECIP-M08-MDD · **Revision:** 1.1 · **Date:** March 2026
5
+ > **Status:** Design Complete — Ready for Implementation
6
+ > **Supplements:** ECIP-TDD-001 §4.8, ECIP-TDD-002 §9
7
+ > **Team:** Platform / Infra · **Classification:** Confidential — Internal Engineering Use Only
8
+
9
+ ---
10
+
11
+ ## Table of Contents
12
+
13
+ - [§0 — Architect's Note](#0--architects-note)
14
+ - [§1 — Module Overview](#1--module-overview)
15
+ - [Responsibilities](#responsibilities)
16
+ - [Non-Goals](#non-goals)
17
+ - [§2 — Folder Structure](#2--folder-structure)
18
+ - [Top-Level Layout](#top-level-layout)
19
+ - [Key File Explanations](#key-file-explanations)
20
+ - [§3 — Four Pillars: Detailed Design](#3--four-pillars-detailed-design)
21
+ - [Pillar 1: Distributed Tracing](#pillar-1-distributed-tracing)
22
+ - [Pillar 2: Metrics Collection](#pillar-2-metrics-collection)
23
+ - [Pillar 3: Structured Log Aggregation](#pillar-3-structured-log-aggregation)
24
+ - [Pillar 4: Alerting](#pillar-4-alerting)
25
+ - [§4 — Security Event Pipeline](#4--security-event-pipeline)
26
+ - [§5 — SDK Integration Contract](#5--sdk-integration-contract)
27
+ - [§6 — Task Breakdown](#6--task-breakdown)
28
+ - [§7 — Module Dependencies](#7--module-dependencies)
29
+ - [§8 — Testing Plan](#8--testing-plan)
30
+ - [§9 — Open Design Decisions](#9--open-design-decisions)
31
+ - [§10 — Risk Register](#10--risk-register)
32
+ - [§11 — Architect's Review Notes](#11--architects-review-notes)
33
+
34
+ ---
35
+
36
+ ## §0 — Architect's Note
37
+
38
+ M08 is not like the other 7 modules. It has no business logic. It owns no domain data. It ships no user-facing features. Its entire job is to make every other module's behaviour observable in production — and that makes it the most important thing to get right before anyone else writes a single line of production code.
39
+
40
+ **Three rules govern this module:**
41
+
42
+ 1. **M08 ships in Week 1, before any other module integration.** The SDK guide and shared logging library are Day 1 deliverables. A module that starts without them will create permanent blind spots that are expensive to retrofit.
43
+
44
+ 2. **Silent failure is the primary failure mode of observability infrastructure.** Instrumentation appearing to work while producing no data is far worse than outright failure — it creates false confidence. Every design decision in this document is oriented around making failures loud.
45
+
46
+ 3. **M08 must not become a performance bottleneck.** Observability infrastructure that adds >5ms to the hot query path (M01 → M04 → M03) is worse than no observability. The DaemonSet topology, async span export, and tail-based sampling rules are all chosen for this reason.
47
+
48
+ > **Architect's sign-off:** This design supersedes the TDD-001 §4.8 task list. The task list was a starting point; this document is the implementable specification. Teams should reference this document, not TDD-001, for M08 implementation details.
49
+
50
+ ---
51
+
52
+ ## §1 — Module Overview
53
+
54
+ | Field | Value |
55
+ |---|---|
56
+ | **Module** | M08 — Observability Stack |
57
+ | **Type** | Cross-Cutting Infrastructure |
58
+ | **Team** | Platform / Infra |
59
+ | **Stack** | OTel Collector · Grafana Tempo · Prometheus · Grafana · Elasticsearch |
60
+ | **Timeline** | Week 1 → Week 28 (continuous; SDK guide Week 1, dashboards iterative) |
61
+ | **Deployment** | Kubernetes (DaemonSet for collector, Deployments for backends) |
62
+ | **Consumes** | Nothing — M08 has no runtime dependency on any other ECIP module |
63
+ | **Consumed by** | All modules (M01–M07) via `@ecip/observability` SDK |
64
+
65
+ **Purpose:** Distributed tracing, metrics collection, structured log aggregation, and alerting across all 7 ECIP modules via OpenTelemetry auto-instrumentation. Minimal per-module code changes are required — the shared library handles all boilerplate.
66
+
67
+ ### Responsibilities
68
+
69
+ - **Distributed tracing** — End-to-end W3C TraceContext traces from M01 through all downstream services, stored in Grafana Tempo with 14-day retention.
70
+ - **Metrics collection** — Standard Kubernetes metrics plus a curated catalog of application-level histograms, counters, and gauges scraped by Prometheus.
71
+ - **Structured log aggregation** — A mandatory JSON log schema enforced at compile time via the shared logging library. All services emit structured logs; unstructured `console.log` is a PR-review violation.
72
+ - **Alerting** — SLA-backed Prometheus alert rules routed to PagerDuty and Slack. Every alert has a corresponding runbook.
73
+ - **Security event logging** — Auth failures and RBAC denials routed to a dedicated Elasticsearch index for SIEM use. Entirely separate pipeline from general application logs (NFR-SEC-007).
74
+ - **Grafana dashboards** — Eight pre-built dashboards delivered as dashboard-as-code JSON, auto-provisioned via Grafana sidecar.
75
+ - **SDK and integration guide** — Published packages (`@ecip/observability` for Node.js, `ecip-observability` for Python) plus a runbook targeting 30-minute integration time per team.
76
+
77
+ ### Non-Goals
78
+
79
+ - **SIEM analysis and alert rules** — Security team operates the Elasticsearch SIEM layer. M08 writes the raw events; the security team owns all query logic above that.
80
+ - **Business-level metrics** (e.g., repos indexed per org, daily active users) — Owned by individual product modules. M08 provides the emission infrastructure only.
81
+ - **Log retention beyond 14 days for traces, 30 days for application logs** — Retention policy is a platform decision; storage provisioning is an infra decision. Neither is M08's call.
82
+ - **APM continuous profiling** (e.g., Pyroscope) — Out of scope for V1. Re-evaluate at Week 22 performance review.
83
+ - **Real user monitoring / browser RUM** — ECIP is a backend-only platform in V1.
84
+ - **On-call schedule management** — M08 wires alerts to PagerDuty; on-call rotation configuration is a team ops concern.
85
+
86
+ ---
87
+
88
+ ## §2 — Folder Structure
89
+
90
+ ### Top-Level Layout
91
+
92
+ ```
93
+ ecip-observability/ # M08 root — Platform team repo
94
+
95
+ ├── collector/ # OTel Collector configuration
96
+ │ ├── otel-collector-config.yaml # OTLP receiver, processors, exporters
97
+ │ ├── otel-collector-daemonset.yaml # K8s DaemonSet manifest
98
+ │ └── sampling-config.yaml # Tail sampling rules (5% default / 100% on error)
99
+
100
+ ├── dashboards/ # Grafana dashboard-as-code (JSON)
101
+ │ ├── query-latency.json # p50/p95/p99 per mode (lsp/vector/hybrid)
102
+ │ ├── analysis-throughput.json # Events processed / backlog / Kafka consumer lag
103
+ │ ├── cache-performance.json # Hit rate by cache_type and repo
104
+ │ ├── lsp-daemon-health.json # Daemon status, restart rate, OOM events
105
+ │ ├── mcp-call-graph.json # MCP fan-out topology, latency per target_repo
106
+ │ ├── event-bus-dlq.json # DLQ depth, DLQ age, retry counts
107
+ │ ├── cross-repo-fanout.json # Fan-out depth distribution, cycle warnings
108
+ │ ├── security-events.json # Auth failures, RBAC denials (from Elasticsearch)
109
+ │ └── _provisioning/
110
+ │ └── grafana-dashboards.yaml # Grafana sidecar auto-provision config
111
+
112
+ ├── alerts/ # Prometheus alerting rules
113
+ │ ├── sla-latency.yaml # query_duration_ms p95 > 1500ms
114
+ │ ├── analysis-backlog.yaml # event backlog > 1000 events
115
+ │ ├── lsp-daemon.yaml # restart_rate > 2/hour
116
+ │ ├── dlq-depth.yaml # DLQ depth > 100
117
+ │ ├── mcp-latency.yaml # mcp_call_duration_ms p95 > 800ms
118
+ │ ├── cache-degradation.yaml # cache_hit_rate < 60%
119
+ │ └── security-anomaly.yaml # Auth failure burst > 10 in 5 min
120
+
121
+ ├── logging-lib/ # Shared structured logging library (published package)
122
+ │ ├── nodejs/ # For M01, M04, M05, M07 (TypeScript services)
123
+ │ │ ├── src/
124
+ │ │ │ ├── logger.ts # Pino wrapper — mandatory fields enforced at type level
125
+ │ │ │ ├── tracer.ts # OTel SDK init + trace/span helpers
126
+ │ │ │ ├── security-events.ts # emitAuthFailure() / emitRbacDenial() helpers
127
+ │ │ │ └── middleware.ts # Express/Fastify: injects trace_id into request context
128
+ │ │ ├── package.json # Published as @ecip/observability
129
+ │ │ └── tests/
130
+ │ │ ├── logger.test.ts
131
+ │ │ └── tracer.test.ts
132
+ │ │
133
+ │ └── python/ # For M02 (Analysis Engine — Python components)
134
+ │ ├── src/
135
+ │ │ ├── logger.py # structlog wrapper with mandatory fields
136
+ │ │ ├── tracer.py # OTel Python SDK init + @traced decorator
137
+ │ │ └── security_events.py
138
+ │ ├── pyproject.toml # Published as ecip-observability
139
+ │ └── tests/
140
+ │ └── test_logger.py
141
+
142
+ ├── prometheus/ # Prometheus deployment
143
+ │ ├── prometheus-values.yaml # Helm values for kube-prometheus-stack
144
+ │ ├── scrape-configs.yaml # Per-service scrape targets (updated as modules ship)
145
+ │ └── recording-rules.yaml # Pre-computed rate/ratio metrics for dashboard perf
146
+
147
+ ├── tempo/ # Grafana Tempo trace storage
148
+ │ ├── tempo-values.yaml # Helm values: S3 backend, 14-day retention
149
+ │ └── tempo-datasource.yaml # Grafana datasource provisioning
150
+
151
+ ├── elasticsearch/ # SIEM security event index (NFR-SEC-007)
152
+ │ ├── index-template.json # Security event schema + field mappings
153
+ │ ├── ilm-policy.json # ILM: 90-day hot → 1-year cold → delete
154
+ │ └── kibana-space.yaml # Security team Kibana space config
155
+
156
+ ├── helm/ # Umbrella Helm chart — deploys entire M08 stack
157
+ │ ├── Chart.yaml
158
+ │ ├── values.yaml # Default values (dev/staging)
159
+ │ ├── values.prod.yaml # Production overrides
160
+ │ └── templates/
161
+ │ ├── otel-collector.yaml
162
+ │ ├── prometheus.yaml
163
+ │ ├── grafana.yaml
164
+ │ ├── tempo.yaml
165
+ │ ├── elasticsearch.yaml
166
+ │ └── configmaps.yaml # Alert rule + dashboard provisioning refs
167
+
168
+ ├── runbooks/ # Operations runbooks and SDK integration guide
169
+ │ ├── SDK-INTEGRATION.md # M08-T04: 30-min setup guide for all teams
170
+ │ ├── alert-response/
171
+ │ │ ├── LSP_DAEMON_RESTART.md
172
+ │ │ ├── HIGH_QUERY_LATENCY.md
173
+ │ │ ├── DLQ_DEPTH_EXCEEDED.md
174
+ │ │ ├── ANALYSIS_BACKLOG.md
175
+ │ │ └── SECURITY_ANOMALY.md
176
+ │ └── dashboard-guide.md
177
+
178
+ └── tests/ # M08 validation tests
179
+ ├── metric-label-validation.test.ts # Assert all required labels present on emission
180
+ ├── alert-threshold-config.test.ts # Parse + validate all alert YAML files
181
+ ├── log-schema-validation.test.ts # JSON log shape + compile-time field enforcement
182
+ └── otel-pipeline-integration.test.ts # Testcontainers: Collector → Tempo end-to-end
183
+ ```
184
+
185
+ ### Key File Explanations
186
+
187
+ **`collector/otel-collector-config.yaml`** — The most operationally critical file in M08. It defines three separate pipelines: `traces` (to Tempo), `metrics` (to Prometheus), and `logs` (to Elasticsearch, security events only). An error in this file silently drops all observability data with no visible failure to services. This file must be version-controlled and change-reviewed with the same rigour as application code.
188
+
189
+ **`logging-lib/nodejs/src/logger.ts`** — The single most consumed M08 artifact. Every Node.js service imports this. The mandatory fields (`trace_id`, `span_id`, `repo`, `branch`, `user_id`, `module`) are enforced at the TypeScript type level — missing any of them is a compilation failure. This is intentional. The alternative is runtime validation which is too easily bypassed.
190
+
191
+ **`alerts/sla-latency.yaml`** — Directly governs whether NFR-AVL-001 (99.9% query path availability) breaches are detected. The PromQL expression must use `histogram_quantile(0.95, ...)` not an average — averaging latency hides tail problems. This is a common mistake and worth explicit attention during review.
192
+
193
+ **`runbooks/SDK-INTEGRATION.md`** — The highest-leverage document M08 produces. Every other module's observability quality is directly proportional to how well teams follow this guide. It must be ready by end of Week 1, before any other module writes production code. Treat it as a P0 deliverable, not documentation.
194
+
195
+ ---
196
+
197
+ ## §3 — Four Pillars: Detailed Design
198
+
199
+ ### Pillar 1: Distributed Tracing
200
+
201
+ **Stack:** OTel Collector (DaemonSet) → Grafana Tempo (S3-backed) → Grafana UI
202
+
203
+ Every external request is injected with a W3C `traceparent` header at M01. The trace propagates automatically through every downstream gRPC and HTTP call via the OTel SDK's auto-instrumentation. Engineers debugging a slow query can jump from the Grafana latency dashboard to a complete flamegraph-style trace in Tempo in one click, with the trace ID correlating directly to the log entry that fired the alert.
204
+
205
+ **Topology decision: DaemonSet, not central Deployment**
206
+
207
+ The OTel Collector runs as a Kubernetes DaemonSet — one collector pod per node, receiving spans from all pods on the same node via localhost. This is a deliberate architectural choice:
208
+
209
+ - A central Deployment collector is a single point of failure; a DaemonSet crash only affects pods on that node.
210
+ - Localhost communication avoids network latency on the critical span export path.
211
+ - Horizontal scale is automatic as nodes are added to the cluster.
212
+
213
+ The tradeoff is that each collector pod must be configured identically, which is managed entirely by the Helm chart. No manual per-node configuration is ever required.
214
+
215
+ **Sampling strategy**
216
+
217
+ Tail-based sampling is configured at the collector, not the SDK. This means sampling decisions are made after the full trace is assembled, enabling intelligent rules:
218
+
219
+ | Rule | Rate | Condition |
220
+ |---|---|---|
221
+ | `errors-always-sample` | 100% | Any span with status `ERROR` |
222
+ | `slow-queries-sample` | 100% | Any trace with total latency > 1000ms |
223
+ | `default` | 5% | All other traces |
224
+
225
+ Head-based sampling (sampling at trace start) is explicitly rejected because it would drop most error traces before they complete. Tail-based sampling with a 10-second decision window ensures errors are always captured.
226
+
227
+ **Trace propagation flow**
228
+
229
+ ```
230
+ Client Request
231
+
232
+
233
+ M01 (API Gateway) → inject traceparent header
234
+ │ → start root span: "gateway.route"
235
+
236
+ M04 (Query Service) → child span: "query.intent_classification"
237
+ │ → child span: "query.context_fusion"
238
+ │ → child span: "query.filter_authorized_repos" [gRPC to M06]
239
+ ├──────────────────────────────────────────────────────────────┐
240
+ ▼ ▼
241
+ M03 (Knowledge Store) M06 (Registry)
242
+ child span: "knowledge.vector_search" child span: "registry.check_access"
243
+ child span: "knowledge.symbol_lookup" child span: "registry.filter_repos"
244
+
245
+ ▼ (if cross-repo, depth ≤ 2)
246
+ M05 (MCP Server) × N → child span: "mcp.tool_call" [per target_repo, in parallel]
247
+
248
+
249
+ All spans → OTel Collector (localhost:4317) → Grafana Tempo (S3)
250
+ ```
251
+
252
+ ### Pillar 2: Metrics Collection
253
+
254
+ **Stack:** OTel SDK (emit) → OTel Collector → Prometheus → Grafana
255
+
256
+ All services expose a `/metrics` endpoint scraped by Prometheus. The scrape interval is 15 seconds for application metrics and 30 seconds for infrastructure metrics. Recording rules in `prometheus/recording-rules.yaml` pre-compute expensive quantile and rate calculations so Grafana dashboard queries are fast even at scale.
257
+
258
+ **Metrics Catalog**
259
+
260
+ This is the authoritative list of application-level metrics M08 defines. All other modules must instrument exactly these metrics with exactly these label sets. Additional metrics may be added by modules but must be reviewed by the Platform team before shipping to prevent high-cardinality label explosions.
261
+
262
+ | Metric Name | Type | Source | Labels | Alert Threshold |
263
+ |---|---|---|---|---|
264
+ | `query_duration_ms` | Histogram | M04 | `repo`, `mode` (lsp/vector/hybrid), `cached` | p95 > 1500ms → CRITICAL |
265
+ | `analysis_duration_ms` | Histogram | M02 | `repo`, `branch_type` (trunk/pr), `language` | p95 > 120000ms → WARN |
266
+ | `cache_hit_rate` | Gauge | M03, M04 | `cache_type` (symbol/query), `repo` | < 60% → WARN |
267
+ | `lsp_daemon_restarts_total` | Counter | M02 | `repo`, `language` | > 2/hour → CRITICAL |
268
+ | `mcp_call_duration_ms` | Histogram | M04 | `target_repo`, `tool_name` | p95 > 800ms → WARN |
269
+ | `event_bus_dlq_depth` | Gauge | M07 | `topic` | > 100 → CRITICAL |
270
+ | `cross_repo_fanout_count` | Histogram | M04 | `depth`, `repo` | — (dashboard only) |
271
+ | `rbac_denial_total` | Counter | M01, M06 | `resource`, `action` | > 10 in 5min → SECURITY WARN |
272
+ | `auth_failure_total` | Counter | M01 | `reason` (expired/invalid/missing) | > 10 in 5min → SECURITY WARN |
273
+ | `grpc_request_duration_ms` | Histogram | M01, M04, M06 | `service`, `method`, `status_code` | p95 > 500ms → WARN |
274
+ | `knowledge_store_write_duration_ms` | Histogram | M03 | `store_type` (redis/pgvector), `namespace` | p95 > 200ms → WARN |
275
+ | `hnsw_rebuild_duration_ms` | Histogram | M03 | `repo` | — (TDD-002; dashboard only) |
276
+ | `filter_authorized_repos_duration_ms` | Histogram | M06 | `org` | p95 > 20ms → WARN (NFR-SEC-011) |
277
+ | `embedding_migration_progress` | Gauge | M02, M03 | `repo`, `phase` (backfill/shadow/cutover) | — (TDD-002; dashboard only) |
278
+
279
+ > **Label cardinality rule:** `user_id` is explicitly prohibited as a Prometheus label. User-scoped metrics belong in Elasticsearch (security events) or application logs. A single high-cardinality label like `user_id` can cause Prometheus OOM on a busy cluster. This rule is enforced by the metric label validation test.
280
+
281
+ ### Pillar 3: Structured Log Aggregation
282
+
283
+ **Stack:** `@ecip/observability` logger → stdout (JSON) → Fluentd/Fluent Bit → central log store
284
+
285
+ All services must use the M08-published logging library. Raw `console.log`, `print()`, or any unstructured log emission is a contract violation, caught at PR review, and treated as a build failure in CI.
286
+
287
+ **Mandatory JSON log schema**
288
+
289
+ ```json
290
+ {
291
+ "timestamp": "2026-03-01T10:00:00.123Z", // ISO 8601, always UTC
292
+ "level": "info", // trace | debug | info | warn | error | fatal
293
+ "module": "M04", // which ECIP module emitted this
294
+ "trace_id": "d9f3c1a8-2e7b-44cc-...", // W3C TraceContext traceId
295
+ "span_id": "f4a912b3...", // active span ID at emission time
296
+ "repo": "acme-corp/auth-service", // {org}/{repo} — required
297
+ "branch": "main", // branch context of the operation
298
+ "user_id": "u_8f3a1c", // hashed user identifier — no raw PII
299
+ "env": "production",
300
+ "msg": "Query resolved via hybrid mode",
301
+ // ... arbitrary structured fields follow
302
+ "query_type": "hybrid",
303
+ "duration_ms": 43,
304
+ "cached": false
305
+ }
306
+ ```
307
+
308
+ `trace_id`, `span_id`, `repo`, `branch`, `user_id`, and `module` are mandatory at the TypeScript type level. Missing them is a compilation failure. The Python wrapper raises `MissingObservabilityContext` at import time if the OTel SDK was not initialized.
309
+
310
+ **What goes in logs vs. metrics vs. traces**
311
+
312
+ This is a common source of confusion and must be explicit:
313
+
314
+ - **Metrics** — Aggregatable numeric values: latency histograms, counters, rates. If you need to know "how many times did X happen", it's a metric.
315
+ - **Logs** — Discrete events with rich contextual detail. If you need to know "what were the exact parameters when X failed", it's a log.
316
+ - **Traces** — Causal chains across service boundaries. If you need to know "which downstream call made this request slow", it's a trace.
317
+
318
+ Never duplicate data across all three. A query duration belongs in a histogram metric and in the trace span duration — not also in a log line.
319
+
320
+ ### Pillar 4: Alerting
321
+
322
+ **Stack:** Prometheus → Alertmanager → PagerDuty / Slack
323
+
324
+ All alert rules live in `alerts/`. Every alert must have: a corresponding entry in the metrics catalog, a PromQL expression reviewed by the Platform team, a severity level, a routing target, and a runbook document in `runbooks/alert-response/`.
325
+
326
+ **Alert rules**
327
+
328
+ | Alert Name | PromQL Condition | For | Severity | Routes To | Runbook |
329
+ |---|---|---|---|---|---|
330
+ | `QueryLatencySLABreach` | `histogram_quantile(0.95, query_duration_ms) > 1500` | 5min | CRITICAL | PagerDuty + Slack #alerts | `HIGH_QUERY_LATENCY.md` |
331
+ | `AnalysisBacklogCritical` | `sum(kafka_consumergroup_lag{group=~"ecip-analysis.*"}) > 1000` | 10min | CRITICAL | PagerDuty + Slack #alerts | `ANALYSIS_BACKLOG.md` |
332
+ | `LSPDaemonRestartRate` | `rate(lsp_daemon_restarts_total[1h]) > 2` | 0min | CRITICAL | PagerDuty + Slack #alerts | `LSP_DAEMON_RESTART.md` |
333
+ | `DLQDepthExceeded` | `event_bus_dlq_depth > 100` | 5min | CRITICAL | PagerDuty + Slack #alerts | `DLQ_DEPTH_EXCEEDED.md` |
334
+ | `MCPCallLatencyWarn` | `histogram_quantile(0.95, mcp_call_duration_ms) > 800` | 10min | WARNING | Slack #alerts-warn | `HIGH_QUERY_LATENCY.md` |
335
+ | `CacheHitRateDegraded` | `cache_hit_rate < 0.60` | 15min | WARNING | Slack #alerts-warn | `HIGH_QUERY_LATENCY.md` |
336
+ | `FilterAuthorizedReposLatency` | `histogram_quantile(0.95, filter_authorized_repos_duration_ms) > 20` | 5min | WARNING | Slack #alerts-warn | `HIGH_QUERY_LATENCY.md` |
337
+ | `SecurityAuthBurst` | `increase(auth_failure_total[5m]) > 10` | 0min | CRITICAL | PagerDuty + Slack #security | `SECURITY_ANOMALY.md` |
338
+ | `SecurityRBACDenialBurst` | `increase(rbac_denial_total[5m]) > 10` | 0min | WARNING | Slack #security | `SECURITY_ANOMALY.md` |
339
+
340
+ > **Architect's note on `LSPDaemonRestartRate`:** The `for: 0min` duration (fires immediately) is intentional. An LSP daemon crash is always a significant event; there is no grace period. The circuit breaker in M04 handles the graceful degradation; the alert gets a human involved in parallel.
341
+
342
+ ---
343
+
344
+ ## §4 — Security Event Pipeline
345
+
346
+ Security events are a distinct category from application logs. They are governed by NFR-SEC-007 and require a separate OTel pipeline that routes exclusively to Elasticsearch, never to the general log store.
347
+
348
+ **What constitutes a security event**
349
+
350
+ - JWT validation failure at M01 (expired, invalid signature, missing)
351
+ - RBAC denial at M01 or M06 (user lacks required role for resource)
352
+ - Service-to-service authentication failure (mTLS or service token rejection)
353
+ - Any access attempt to a repo not in the user's authorized set (detected by M06's `FilterAuthorizedRepos`)
354
+
355
+ **Security event schema**
356
+
357
+ ```json
358
+ {
359
+ "@timestamp": "2026-03-01T10:00:00.123Z",
360
+ "event.kind": "event",
361
+ "event.category": "authentication", // or "authorization"
362
+ "event.type": "denied",
363
+ "event.outcome": "failure",
364
+ "trace.id": "d9f3c1a8-2e7b-44cc-...", // correlates to Tempo trace
365
+ "user.id": "u_8f3a1c", // hashed — no raw PII
366
+ "source.ip": "10.0.14.22",
367
+ "resource": "acme-corp/auth-service",
368
+ "action": "read",
369
+ "reason": "rbac_insufficient_role", // or "jwt_expired" | "jwt_invalid"
370
+ "module": "M01"
371
+ }
372
+ ```
373
+
374
+ **Pipeline separation**
375
+
376
+ Security events must never leak into general application logs and must never be emitted via the general `logger` object. The `emitSecurityEvent()` helper in `security-events.ts` is the only permitted emission path. It uses a separate OTel logger provider configured to route exclusively to the security Elasticsearch pipeline. This separation is enforced architecturally — it cannot be accidentally bypassed by a developer using the standard logger.
377
+
378
+ **Elasticsearch index lifecycle**
379
+
380
+ ```
381
+ 90 days → hot tier (SSD, fast query)
382
+ 90–365 days → warm tier (standard storage)
383
+ 365 days → cold tier (object storage)
384
+ > 365 days → delete
385
+ ```
386
+
387
+ The security team owns all Kibana queries, dashboards, and SIEM detection rules above the index layer. M08 owns only the index template, ILM policy, and raw event schema.
388
+
389
+ ---
390
+
391
+ ## §5 — SDK Integration Contract
392
+
393
+ This section defines what M08 delivers to other module teams and what those teams are contractually required to do with it. This is the interface contract between M08 and every other module.
394
+
395
+ ### What M08 delivers (by end of Week 1)
396
+
397
+ - **`@ecip/observability`** — npm package (Node.js / TypeScript). Published to the internal registry.
398
+ - **`ecip-observability`** — Python package. Published to the internal PyPI.
399
+ - **`runbooks/SDK-INTEGRATION.md`** — Step-by-step integration guide. Target: 30-minute setup for any team.
400
+ - **OTel Collector endpoint** — Available at `http://otel-collector.monitoring:4317` (gRPC) and `:4318` (HTTP) in all namespaces by Week 1.
401
+
402
+ ### What every module team must do
403
+
404
+ **Node.js / TypeScript services (M01, M04, M05, M07)**
405
+
406
+ ```typescript
407
+ // Step 1: Install
408
+ // npm install @ecip/observability @opentelemetry/sdk-node
409
+
410
+ // Step 2: src/instrument.ts — import BEFORE all other imports
411
+ import { initTracer } from '@ecip/observability';
412
+
413
+ initTracer({
414
+ serviceName: 'ecip-query-service',
415
+ otlpEndpoint: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, // injected by Helm
416
+ });
417
+
418
+ // Step 3: Use the logger in every handler
419
+ import { createLogger } from '@ecip/observability';
420
+
421
+ const log = createLogger({ repo, branch, user_id, module: 'M04' });
422
+ log.info({ duration_ms: 43, cached: false }, 'Query resolved');
423
+
424
+ // Step 4: Emit security events via dedicated helper (never via log)
425
+ import { emitRbacDenial } from '@ecip/observability';
426
+
427
+ emitRbacDenial({ userId, resource: repoId, action: 'read', reason: 'rbac_insufficient_role' });
428
+ ```
429
+
430
+ **Python services (M02)**
431
+
432
+ ```python
433
+ # Step 1: pip install ecip-observability opentelemetry-sdk
434
+
435
+ # Step 2: Initialize at process entry (before any other imports)
436
+ from ecip_observability import init_tracer, get_logger
437
+
438
+ init_tracer(service_name="ecip-analysis-engine")
439
+
440
+ # Step 3: Use the logger
441
+ log = get_logger(repo=repo, branch=branch, user_id=user_id, module="M02")
442
+ log.info("Analysis complete", duration_ms=14200, files_indexed=47)
443
+
444
+ # Step 4: Decorate functions for automatic span creation
445
+ from ecip_observability import traced
446
+
447
+ @traced(name="lsp.symbol_extraction")
448
+ def extract_symbols(file_path: str) -> list:
449
+ ... # span automatically started/ended; exceptions auto-captured
450
+ ```
451
+
452
+ **Helm value injection (Infra team responsibility)**
453
+
454
+ All OTel configuration is injected by the Helm chart. Module teams do not hardcode endpoints.
455
+
456
+ ```yaml
457
+ env:
458
+ - name: OTEL_EXPORTER_OTLP_ENDPOINT
459
+ value: "http://otel-collector.monitoring:4318"
460
+ - name: OTEL_SERVICE_NAME
461
+ valueFrom:
462
+ fieldRef:
463
+ fieldPath: metadata.labels['app.kubernetes.io/name']
464
+ - name: OTEL_RESOURCE_ATTRIBUTES
465
+ value: "ecip.module=M04,deployment.environment=production"
466
+ ```
467
+
468
+ ### Integration enforcement (CI gate)
469
+
470
+ A module PR that does not satisfy all of the following is blocked from merging:
471
+
472
+ 1. `@ecip/observability` or `ecip-observability` is in the dependency list.
473
+ 2. `initTracer()` is called before the service starts accepting requests (verified by startup integration test).
474
+ 3. At least one log line per request handler uses `createLogger()` with all mandatory fields.
475
+ 4. No `console.log` or `print()` statements exist in production code paths (ESLint / Ruff rules).
476
+
477
+ ---
478
+
479
+ ## §6 — Task Breakdown
480
+
481
+ | # | Task | Est. | Owner | Week | Priority | Notes |
482
+ |---|---|---|---|---|---|---|
483
+ | M08-T01 | Deploy OTel Collector as DaemonSet; configure OTLP receiver (gRPC :4317 + HTTP :4318) | 2d | Infra | W1 | **P0** | Must be online before any other module begins instrumentation. Hard blocker for all teams. |
484
+ | M08-T02 | Deploy Grafana Tempo; S3 backend; 14-day retention; microservices mode | 2d | Infra | W1 | **P0** | Grafana Tempo over Jaeger — native Grafana integration, S3 cost model better at scale. |
485
+ | M08-T03 | Deploy Prometheus + Grafana; scrape configs; recording rules | 2d | Infra | W1 | **P0** | `kube-prometheus-stack` Helm chart. Add per-module scrape configs as modules come online. |
486
+ | M08-T04 | Publish `@ecip/observability` + `ecip-observability`; write SDK-INTEGRATION.md | 2d | Senior Backend | W1 | **P0** | Highest-leverage M08 deliverable. Treat as feature, not documentation. |
487
+ | M08-T05 | Build 8 Grafana dashboards as JSON; wire via sidecar provisioning | 4d | Backend | W2–W3 | P1 | Ship `query-latency.json` first. Others follow as modules come online. |
488
+ | M08-T06 | Configure all alert rules; wire PagerDuty + Slack; write runbook for each alert | 2d | Backend | W2 | **P0** | `QueryLatencySLABreach` and `LSPDaemonRestartRate` must be live before CP1 (Week 4). |
489
+ | M08-T07 | Implement logging library: mandatory fields, security event helpers, middleware | 1d | Senior Backend | W1 | **P0** | Delivered as part of M08-T04 package. Compile-time field enforcement is non-negotiable. |
490
+ | M08-T08 | Elasticsearch security event index; ILM policy; schema; Kibana space | 2d | Security + Backend | W3–W4 | P1 | NFR-SEC-007. Security team owns Kibana; M08 owns index schema only. |
491
+ | M08-T09 | Add TDD-002 metrics: HNSW rebuild, embedding migration progress, `FilterAuthorizedRepos` latency | 1d | Backend | W5–W6 | P2 | Extend existing dashboards once M06's new RPCs are wired. |
492
+ | M08-T10 | Write CI gate enforcement: ESLint/Ruff rules, startup integration check, label validation | 1d | Backend | W2 | P1 | Without this, SDK adoption will drift. CI enforcement is the only reliable mechanism. |
493
+
494
+ **Total estimated effort:** ~17 days (Platform team; 2 engineers)
495
+
496
+ ---
497
+
498
+ ## §7 — Module Dependencies
499
+
500
+ M08 has no runtime dependency on any other ECIP module. All arrows point inward — every other module depends on M08.
501
+
502
+ | Module | Depends on M08 for | M08 Artifact | Required by Week |
503
+ |---|---|---|---|
504
+ | M01 — API Gateway | Trace injection, auth failure logging, request logging | `@ecip/observability`, Collector endpoint | **Week 1** |
505
+ | M03 — Knowledge Store | Write latency metrics, cache hit rate | `@ecip/observability`, Prometheus scrape config | **Week 1** |
506
+ | M07 — Event Bus | DLQ depth metrics, event processing latency | `@ecip/observability`, Prometheus scrape config | **Week 1** |
507
+ | M06 — Registry | RBAC denial events, gRPC duration metrics, `FilterAuthorizedRepos` latency | `@ecip/observability`, security event schema | Week 5 |
508
+ | M02 — Analysis Engine | Span wrapping around LSP calls, analysis duration metrics | `ecip-observability` (Python), Collector endpoint | Week 5 |
509
+ | M04 — Query Service | Query latency metrics, circuit breaker state logging, MCP call spans | `@ecip/observability`, Collector endpoint | Week 11 |
510
+ | M05 — MCP Server | Tool call duration metrics, RBAC denial logging | `@ecip/observability` | Week 17 |
511
+
512
+ M01, M03, and M07 all start in Week 1 or Week 2. M08's core infrastructure (Collector, Tempo, Prometheus, SDK) must be ready before any of them write production code. This is the hardest constraint in the M08 timeline.
513
+
514
+ ---
515
+
516
+ ## §8 — Testing Plan
517
+
518
+ ### Unit Tests (coverage target: 80%)
519
+
520
+ | Test File | What It Validates |
521
+ |---|---|
522
+ | `metric-label-validation.test.ts` | All metric emissions include the exact label sets defined in the catalog. Uses a mock Prometheus registry to capture emitted metrics and validates label presence and types. |
523
+ | `alert-threshold-config.test.ts` | All YAML files in `alerts/` parse without error; all PromQL expressions are syntactically valid (`promtool check rules`); all referenced metrics exist in the catalog; all runbook file paths resolve on disk. |
524
+ | `log-schema-validation.test.ts` | JSON log output from `createLogger()` matches the mandatory field schema. Compile-time test: TypeScript build fails when mandatory fields are omitted (validated by compiling a deliberately broken test file and asserting non-zero exit code). |
525
+ | `security-events.test.ts` | `emitAuthFailure()` and `emitRbacDenial()` produce correct ECS-formatted events. No `user_id` raw values in output (only hashed). Events route to the correct OTel logger provider (not the general logger). |
526
+
527
+ ### Integration Tests (Testcontainers)
528
+
529
+ **`otel-pipeline-integration.test.ts`** — The most important test in M08.
530
+
531
+ Setup: Testcontainers spins up an OTel Collector container and a Grafana Tempo container. The `@ecip/observability` SDK is initialized pointing at the collector.
532
+
533
+ Assertions:
534
+ 1. Emit 5 spans across a simulated async service boundary.
535
+ 2. All 5 spans are retrievable from Tempo via HTTP API within 10 seconds.
536
+ 3. `traceparent` header is correctly propagated across the async boundary (child spans have correct parent ID).
537
+ 4. An error span (forced `status: ERROR`) is 100% sampled (present in Tempo despite < 5% default rate).
538
+ 5. A normal span in a 5%-sampled-only run may or may not be present — test does not assert its presence (this validates the sampling rule is not broken by asserting only error capture).
539
+
540
+ **Alert rule CI validation:**
541
+ ```bash
542
+ # Run in CI on every PR touching alerts/
543
+ promtool check rules alerts/*.yaml
544
+ ```
545
+
546
+ **Dashboard JSON lint:**
547
+ ```bash
548
+ # Run in CI on every PR touching dashboards/
549
+ grafana-dashboard-lint dashboards/*.json
550
+ ```
551
+
552
+ > **Why the Testcontainers test is non-negotiable:** Unit tests cannot catch a misconfigured collector pipeline, an mTLS cert error on the Tempo exporter, or a sampling rule that silently drops all traces. The integration test is the only meaningful correctness signal for the observability pipeline itself. It must run in CI on every PR, not just nightly.
553
+
554
+ ### What M08 does not test
555
+
556
+ M08 does not test that other modules have correctly instrumented themselves. That responsibility belongs to each module's own test suite. M08's CI gate (§5) enforces minimum instrumentation standards; module teams own the quality of their own spans and log lines.
557
+
558
+ ---
559
+
560
+ ## §9 — Open Design Decisions
561
+
562
+ These decisions are not yet finalised and must be resolved before the relevant tasks begin.
563
+
564
+ **OD-01: Log aggregation backend for general application logs**
565
+
566
+ The current design routes security events to Elasticsearch but leaves the general application log destination unspecified. Options are: (a) Elasticsearch with a separate index, (b) Loki (Grafana-native, lower cost), (c) CloudWatch / GCP Logging if running on a managed cloud. This decision affects the Helm chart, the Fluent Bit / Fluentd sidecar configuration, and potentially the SDK (if the log backend requires a specific OTel exporter). Must be decided before M08-T01 is complete.
567
+
568
+ **OD-02: Collector resource limits in production**
569
+
570
+ The DaemonSet manifest specifies `limit_mib: 512` for the collector memory limiter. This was estimated for a 200-node cluster at expected trace volume with 5% sampling. It has not been load-tested. At Week 8, after M01, M03, and M07 are generating real traces, the actual memory profile must be measured and the limit updated. If volume exceeds estimates, the sampling rate is the first lever to pull (reduce from 5% to 2%), not the memory limit.
571
+
572
+ **OD-03: Dashboard ownership after Week 28**
573
+
574
+ The 8 dashboards in `dashboards/` are built and maintained by the Platform team during the build phase. Post-Week 28, who owns them? Module teams adding new metrics need to update the dashboards. Without a clear ownership model, dashboards will drift from the actual metrics being emitted. Recommendation: each module team owns the panels in the dashboards that relate to their metrics; Platform team owns the infrastructure-level panels (Collector health, Prometheus performance, Tempo storage).
575
+
576
+ ---
577
+
578
+ ## §10 — Risk Register
579
+
580
+ | Risk | Likelihood | Impact | Mitigation |
581
+ |---|---|---|---|
582
+ | Module teams skip SDK integration — observability blind spots in production | **High** | **High** | CI gate (M08-T10) blocks merges without instrumentation. SDK guide ready Week 1. Make it easier to integrate than to skip. |
583
+ | OTel Collector config error silently drops all trace data | **Medium** | **High** | Pipeline integration test (Testcontainers) catches this in CI. Config changes to `otel-collector-config.yaml` require Platform team review. Collector exposes its own health metrics at `:13133/healthz`. |
584
+ | High-cardinality Prometheus labels cause OOM | **Medium** | **High** | `user_id` prohibited as Prometheus label (enforced by label validation test). New metrics require Platform team review before shipping. Prometheus has a 10M series limit alert configured. |
585
+ | Tempo storage costs blow out with 14-day retention | **Medium** | **Low** | Tail-based sampling (5% healthy) reduces volume ~20x vs. 100% sampling. Review storage costs at Week 8 with real traffic data. Reduce sampling to 2% if needed before touching retention. |
586
+ | Security event pipeline mixes with general logs | **Low** | **High** | Architecturally separated: dedicated OTel logger provider, dedicated Collector logs pipeline, dedicated Elasticsearch index. `emitSecurityEvent()` is the only permitted path. Enforced by security-events unit test. |
587
+ | M08 infrastructure is itself unobservable | **Low** | **Medium** | OTel Collector exposes Prometheus metrics at `:8888`. Grafana has a collector health panel. Collector logs are shipped to the same log store as other services. The `monitoring` namespace has its own Prometheus alert for collector pod restarts. |
588
+
589
+ ---
590
+
591
+ ## §11 — Architect's Review Notes
592
+
593
+ These are design issues identified during the architectural review of TDD-001's M08 task list. They are not hypothetical concerns — they reflect gaps in the original specification that would have caused real problems during implementation.
594
+
595
+ ### Review finding 1: The task list is not a design
596
+
597
+ TDD-001 §4.8 lists 8 tasks and a metrics table. That is a starting point, not an implementable specification. Missing from the original:
598
+
599
+ - No definition of the OTel Collector topology (DaemonSet vs. Deployment) and the rationale for it.
600
+ - No sampling strategy. The default OTel SDK samples 100% of traces, which is disastrously expensive at production volume.
601
+ - No specification of what "structured logging" means — just "deliver as logging library wrapper."
602
+ - No enforcement mechanism. A guide with no enforcement is a suggestion.
603
+ - No security event pipeline design. NFR-SEC-007 requires it, but TDD-001 only lists it as a task with no schema, no routing, no pipeline separation.
604
+
605
+ This document addresses all of the above.
606
+
607
+ ### Review finding 2: Log aggregation destination is unspecified
608
+
609
+ TDD-001 mentions Elasticsearch for security events but says nothing about where general application logs go. This is a gap. The OTel Collector's `logs` pipeline needs an exporter target. Leaving it unspecified means the Infra team will make the decision independently, potentially choosing a backend that is incompatible with the Grafana Tempo + Grafana stack. This is captured as OD-01 above and must be resolved before M08-T01 closes.
610
+
611
+ ### Review finding 3: `FilterAuthorizedRepos` latency alert was missing
612
+
613
+ TDD-002 adds `FilterAuthorizedRepos` as a new RPC to M06 with an explicit p95 < 20ms SLA (NFR-SEC-011). TDD-001's M08 task list predates TDD-002 and therefore has no alert for this SLA. The metrics catalog and alert rules in this document add `filter_authorized_repos_duration_ms` and the `FilterAuthorizedReposLatency` alert to cover it. This is not captured in M08-T09 (which only mentions dashboard additions) — the alert rule is a separate deliverable and must be added to M08-T06's scope.
614
+
615
+ ### Review finding 4: Dashboard delivery sequence matters
616
+
617
+ TDD-001 specifies all dashboards as a single 4-day task. In practice, a dashboard for `query_duration_ms` is useless before M04 exists, and a dashboard for `mcp_call_duration_ms` is useless before M05 exists. Delivering all 8 dashboards by Week 3 means 5 of them display empty panels for months.
618
+
619
+ The correct approach is to ship dashboards in module-dependency order:
620
+
621
+ 1. **Week 2** — `lsp-daemon-health.json`, `event-bus-dlq.json` (M02, M07 start early)
622
+ 2. **Week 3** — `cache-performance.json`, `analysis-throughput.json` (M03 foundation)
623
+ 3. **Week 11** — `query-latency.json`, `mcp-call-graph.json` (M04 comes online)
624
+ 4. **Week 17** — `cross-repo-fanout.json` (M05 comes online)
625
+ 5. **Week 4** — `security-events.json` (tied to M08-T08, not module timelines)
626
+
627
+ This does not change the total effort; it changes when effort is spent and ensures dashboards are immediately useful when delivered.
628
+
629
+ ### Review finding 5: The CI gate is load-bearing, not optional
630
+
631
+ The SDK integration guide says "all teams should use `@ecip/observability`." Without a CI gate, this will not happen. Engineers under deadline pressure skip instrumentation. Two months into the build, when the first production incident occurs and traces are missing for 3 of 7 modules, the cost of retrofitting is high and the debugging is impossible.
632
+
633
+ M08-T10 (CI gate enforcement) is listed as P1 in the task breakdown. Architecturally it should be P0, delivering alongside M08-T04 in Week 1. The CI gate is the enforcement mechanism that makes the SDK guide meaningful. Without it, the SDK guide is optional documentation.
634
+
635
+ ---
636
+
637
+ *ECIP-M08-MDD · Observability Stack · Module Design Document · Rev 1.1 · March 2026*
638
+ *Confidential — Internal Engineering Use Only · Platform Team*
639
+ *This document supersedes ECIP-TDD-001 §4.8 for implementation purposes.*