@decocms/start 6.0.1 → 6.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -121,6 +121,15 @@ this plan.
121
121
  | 2026-05-07 | **D6.1 — Cloudflare credentials never leave `deco-start`** | Same-day refinement of D6 after the first central deploy on `baggagio-tanstack` failed with `Secret CLOUDFLARE_API_TOKEN is required, but not provided while calling`. The original D6 design used `secrets: inherit` from the storefront stub and required `CLOUDFLARE_*` to live in the `deco-sites` org, which broke the principle that *the only secrets a storefront repo holds are the secrets that go into wrangler secrets, not the ones used to deploy*. First-pass refinement: the central `deploy.yml` / `preview.yml` / `sync-secrets.yml` jobs declared `environment: production` to try to make `${{ secrets.CLOUDFLARE_* }}` resolve from `decocms/deco-start`'s `production` Environment. **Found broken empirically on 2026-05-07** — the deployment registers in the *caller* repo, not the called workflow's repo, so the environment lookup uses the caller's `production` env (auto-created with no secrets). Superseded by D6.2 the same evening. |
122
122
  | 2026-05-07 | **D6.2 — App-mediated dispatch + no per-site registry (supersedes D6 + D6.1)** | After D6.1's `environment:` mechanism was empirically shown not to work cross-repo, the architecture pivoted: a `decocms-deployer` GitHub App is installed on `decocms/deco-start` (`actions:write`) and on each storefront repo (`contents:read`, optionally `pull-requests:write`). The storefront caller stub mints a short-lived App-installation token and calls `gh workflow run deploy.yml --repo decocms/deco-start --ref v3 -f site_owner=… -f site_name=…`. The central workflow runs in `decocms/deco-start`'s context, so `CLOUDFLARE_API_TOKEN` / `CLOUDFLARE_ACCOUNT_ID` are ordinary repo secrets. For runtime `SECRET_*` values, each storefront has a `<site_name>-secrets` GitHub Environment in `decocms/deco-start` (S1 design); `sync-secrets.yml` binds to that environment and pushes to `wrangler secret put`. The per-site registry under `deploy/sites/<repo>.jsonc` was dropped entirely (Pure C): worker name = repo basename by convention; the App being installed on the storefront repo is the deploy authorization gate; rare per-worker derived fields (like AE dataset name) use `$WORKER_*` substitution tokens in the template. Force-rollback is impossible for production deploys because the central workflow ignores caller-supplied `site_sha` and resolves the storefront's current default-branch HEAD itself. See [`deploy/README.md`](./deploy/README.md) for the full trust model. **Operational migrations required by Pure C:** `miess-01-tanstack` repo's worker shifts from `miess-tanstack` to `miess-01-tanstack` (CF-side cutover); `lebiscuit-tanstack` AE dataset shifts from `deco_metrics_lebiscuit` to `deco_metrics_lebiscuit_tanstack` (orphans old data). |
123
123
  | 2026-05-07 | **D6.3 — Revert D6/D6.1/D6.2; deploys move to Cloudflare Workers Builds** | The whole D6 family (centralized GitHub Actions reusable workflows + `decocms-deployer` GitHub App + per-storefront GitHub Environments + central `deploy/wrangler-template.jsonc` + `deco-wrangler` CLI + per-site caller stubs) is being **reverted**. Trigger: GitHub Free orgs do not propagate org-level secrets to private repos, which forced the App private key to live as a per-storefront repo secret in every storefront — that key gives the holder the ability to mint installation tokens that can trigger workflows on `decocms/deco-start`, which in turn have the only Cloudflare credentials in the system. Per-repo distribution + rotation of that key across N customer storefronts didn't scale and concentrated blast radius on one credential. **Replacement (chosen, to be detailed in a follow-up D-record once shipped):** [Cloudflare Workers Builds](https://developers.cloudflare.com/workers/ci-cd/builds/) owns the deploy/preview pipelines per-worker. Verified empirically on `baggagio-tanstack` 2026-05-07: a malicious `wrangler.jsonc` `name` field pointing at a different worker (`americanas-tanstack`) is **ignored** by CF Builds — the deploy lands on the connected worker (`baggagio-tanstack`), CF surfaces a warning banner in the dashboard, and CF auto-opens a PR to fix the config (deco-sites/baggagio-tanstack#34). The dashboard repo<->worker connection is the source of truth; the in-repo config is treated as a secondary input. Per-storefront wiring (one CF dashboard click per worker) is acceptable at our scale; revisit when CF's [git-integration enable API](https://github.com/cloudflare/workers-sdk/issues/12058) lands. The `deco-build` CLI (regenerates `wrangler.jsonc` bindings from a central template) and runtime-secrets management remain to be designed in a separate PR. |
124
+ | 2026-05-22 | **D-9 — Stable `request.id` propagation across all observability channels** | The fragmentation surfaced during the May 2026 error triage: tail-worker logs, direct-POST metrics, CF-Destinations spans, and structured `console.log` JSON all carry overlapping information but no shared join key. A single user-visible 5xx required hand-correlating timestamps across four ClickHouse tables. **Decision:** the framework generates a stable `request.id` once at request entry (precedence: inbound `x-request-id` → `cf-ray` → `crypto.randomUUID()`) inside `RequestContext.run`, then stamps it on (a) the root span as the `request.id` attribute, (b) every log line via the logger attribute floor, (c) the response as `X-Request-Id` (read by `deco-otel-tail`), and (d) the metric labels via `extra`. Symmetric for `trace.id` — read from the root span's spanContext, echoed as `X-Trace-Id`. This re-establishes a single join key across all channels: pick any row from any table, filter by `request.id`, and reconstruct the full request lifecycle. **Files:** `RequestContext.requestId` + logger floor + workerEntry response-header echo + tail-worker enrichment. **Captured in Phase 1 of [the observability refinement plan](../../.cursor/plans/observability_refinement_plan_4fa41548.plan.md)**. |
125
+ | 2026-05-22 | **D-10 — Server-side log normalization at the ingest worker, cost-neutral** | CF Destinations wraps every `console.log(JSON.stringify(...))` line into an OTLP LogRecord with the JSON body in `body.stringValue`. Querying by structured fields requires `JSONExtract` everywhere — slow, query-fragile, and tied to whatever the producer happens to embed. **Two design choices considered:** (a) migrate all framework `logger.{info,warn,debug}` calls to direct-POST native OTLP, ditching the CF Destinations sampled path. Substantially more code volume in production direct-POST traffic; bypasses the head sampling that keeps fleet cost bounded. (b) lift the JSON-in-body into native OTLP `LogAttributes` server-side at the `deco-otel-ingest` worker. Same wire volume, same cost. **Decision: option (b).** The ingest worker's `logsToRows` now detects JSON-shaped `body.stringValue`, lifts `level`/`msg`/`trace_id`/`span_id` plus arbitrary keys into native OTLP attributes, reduces `Body` to the human-readable `msg`, and falls back unchanged for non-JSON strings (third-party `console.log`). Dashboards drop `JSONExtractString(Body, 'level') = 'error'` in favor of `SeverityText = 'ERROR'`. **Files:** `stats-lake/ingestion/otel-ingest/src/index.ts`. Phase 4 of the observability refinement plan. |
126
+ | 2026-05-22 | **D-11 — Outcome metrics layer becomes the truth source for "did we serve users today?"** | Earlier metric labels (`method`, `path`, `status`) couldn't answer "5xx rate per route per site" without joining metrics to tail-worker logs. The path label was raw-URL (unbounded cardinality risk); status was opaque (no class bucketing); no cache decision / cache layer; no commerce histogram in the framework (only apps-start sites that bumped to a recent version had it). **Decision:** expand the canonical label set for `http_requests_total` / `http_request_duration_ms` / `http_request_errors_total` to `{ method, route_pattern, status, status_class, outcome?, cache_decision?, cache_layer?, region?, …extra }`. `route_pattern` is the TanStack closed-set pattern (`/_products/$slug/p`); fallback is the normalized path. `status_class` is `2xx`/.../`5xx`/`unknown`. Cache labels lift the existing `X-Cache` / `X-Cache-Profile` headers up to the metric so dashboards answer cache-hit rate per route from the counter alone. Move `commerce_request_duration_ms` declaration into `@decocms/start` so every site emits it as soon as the framework is bumped, regardless of apps-start version (apps register operation strings only). Labels: `{ provider, operation, status_class?, cached? }`. **Files:** `src/middleware/observability.ts` (`statusClassFor`, `RequestMetricLabels`, `CacheLayer`, `recordCommerceMetric`, expanded `recordCacheMetric` signature). Phase 2 of the observability refinement plan. |
127
+ | 2026-05-22 | **D-12 — Direct-POST OTLP trace exporter for framework `deco.*` spans** | Empirical verification (May 2026) confirmed the framework's 10+ `withTracing` calls produced zero rows in `otel_traces`. Root cause: the bridge tracer in `instrumentWorker` delegates to `trace.getTracer(...)` on the `@opentelemetry/api` global. With no `TracerProvider` registered (the common case — CF Workers only auto-installs a provider when `observability.traces.destinations` is set), every framework span is silently discarded. **Decision:** introduce `otelHttpTracer.ts` — a direct-POST OTLP/HTTP trace exporter that mirrors the existing meter + error-log adapters. Same transport: per-isolate buffer, ctx.waitUntil flush, FNV-1a hash sampling at `headSamplingRate` (default 0.01 matches CF Destinations recommendation). Consistent per-trace decision so child spans are kept iff their root is kept. Honors inbound W3C `traceparent` — if the remote parent arrived sampled, every span in that trace is exported regardless of the rate. Wired alongside the existing `@opentelemetry/api` bridge via `configureTracerStack` — CF auto-spans still flow to the CF dashboard, framework spans direct-POST to ClickHouse. Default-on `injectTraceContext` inside `createInstrumentedFetch` was already in place. **Files:** `src/sdk/otelHttpTracer.ts`, `src/sdk/otel.ts` (`configureTracerStack`), `src/sdk/workerEntry.ts` (traceparent parsing). Phase 3 of the observability refinement plan. |
128
+ | 2026-05-22 | **D-13 — Per-site Grafana dashboards + alert rules are auto-provisioned from `dim_sites`** | Hand-built dashboards drift the moment a new site lands: a fleet of 100 sites can't be maintained by a human curator. **Decision:** the canonical observability provisioning lives in [`stats-lake/observability/`](../../../stats-lake/observability/) — a single dashboard template + a single alert-rule template, parameterized by `{{site}}`/`{{team}}`/`{{datasource_uid}}` and rendered once per site by `scripts/provision-dashboards.ts` (reads `dim_sites` joined to `dim_teams`, writes to `dashboards/dist/<team>/<site>.json` + `alerts/dist/<team>/<site>.yaml`). Alerts use **anomaly bands, not thresholds** — current 5-min mean vs 24h rolling mean ± 3σ, fires after 10 minutes outside the band. Same rule set runs on every site, but the baseline is per-site so a noisy storefront doesn't false-positive against a quiet one. Every alert carries a `runbook_url` annotation pointing at [`deco-start/docs/runbooks/`](../docs/runbooks/) — the runbook is part of the alert, not a separate artifact. **Decision points open** (Phase 5 of the refinement plan): alerting venue (Grafana → email, Linear MCP tickets, both, or none for v1). **Files:** `stats-lake/observability/{dashboards,alerts,scripts,README.md}` + `deco-start/docs/runbooks/`. |
129
+ | 2026-05-22 | **D-17 — Alerting venue: none for v1; ship dashboards-only** | The action layer (Phase 5) generates anomaly-band alert rule templates per site, but every alert needs a pager and we don't have an on-call rotation. **Decision:** ship dashboards-only for v1. The alert templates in `stats-lake/observability/alerts/templates/site-rules.yaml` stay versioned so they evolve with the dashboards, but `provision-dashboards.ts` only templates them into `alerts/dist/` when `--with-alerts` is passed, and no Grafana → email / Linear MCP / PagerDuty receiver is wired up. Rejected alternatives: (a) Grafana → email — emails get muted within a week without a triage owner; (b) Linear MCP ticket-per-fire — creates noise during active incidents and you can't triage a ticket while firefighting; (c) both — overkill before we know which storefronts will be noisiest. **Revisit when:** an on-call rotation exists, OR a specific incident class earns dedicated paging (e.g., billing-critical sites that need 24/7 coverage). **Files:** `stats-lake/observability/scripts/provision-dashboards.ts` (`--with-alerts` flag), `stats-lake/observability/README.md`. Phase 5 of the observability refinement plan. |
130
+ | 2026-05-22 | **D-16 — `deco-audit-observability` is warn-by-default; promote to block once the fleet is clean** | The audit (D-14) detects drift in `tail_consumers`, `version_metadata`, `DECO_METRICS`, and `DECO_OTEL_*_ENDPOINT` vars across every storefront wrangler.jsonc. Pre-merge blocking on day one would fail PRs that have nothing to do with observability — storefronts are upgraded over weeks, not all at once. Advisory comments alone get ignored. **Decision:** `--mode warn` (default) annotates findings via `::warning::` GitHub Actions lines but always exits 0, so observability drift surfaces in CI without blocking ship. `--mode block` exits 1 on any `error`-severity finding for use once the fleet has been pulled current. Storefronts opt into `block` per-repo by wiring `--mode block` in their workflow when they've cleared their findings. Includes `--github` flag to emit native annotations. **Files:** `scripts/audit-observability-config.ts` (parseArgs `--mode` / `--github`, main exit policy), test coverage in the matching `.test.ts` (6 new CLI smoke tests via tsx subprocess). Phase 6 of the observability refinement plan. |
131
+ | 2026-05-22 | **D-15 — OTel Collector swap is a documented target state, not a committed milestone** | The current `deco-otel-ingest` Worker is a hand-rolled OTLP/HTTP parser + ClickHouse inserter. It works at ~14M POSTs/month, but each new OTLP protocol revision, each new receiver (gRPC OTLP, Prometheus remote-write), and each new sink would cost us code to write and maintain — code the upstream OTel Collector + `clickhouseexporter` ship as a maintained product. **Decision:** mark the Collector swap as the eventual target state, **with no committed timeline**, and capture the explicit revisit-triggers so we know when to act rather than relying on "we should think about this sometime." The decision is cheap because the ClickHouse schema is **already** the canonical `clickhouseexporter` shape — that was a deliberate design choice in `clickhouse/schema/otel/` so the ingest path stays swappable. Migration when triggered is config-only at the data layer: stand up a Collector in the same CF account, configure `otlphttp`/`otlpgrpc` receivers + `clickhouse/v1` exporter + `transform` processors for the existing PII redaction and JSON-body lift (1:1 from current Worker logic), DNS-cutover the `DECO_OTEL_*_ENDPOINT` vars. **Revisit triggers:** OTLP 2.0 ships, we need gRPC OTLP, we need a non-ClickHouse sink, ingest volume exceeds 100M POSTs/mo, or a hand-rolled parser develops a defect we can't fix quickly. **Files:** [`stats-lake/ingestion/otel-ingest/COLLECTOR_TARGET.md`](../../../stats-lake/ingestion/otel-ingest/COLLECTOR_TARGET.md) holds the full migration runbook + rollback story. Phase 7 of the observability refinement plan; explicitly **optional**. |
132
+ | 2026-05-22 | **D-14 — `deco-audit-observability` covers fleet bindings, not just the `observability` block** | The existing audit only checked the `observability` block (sampling rates, persist, destinations). Phase 1+2+3 made several other wrangler keys load-bearing: `tail_consumers` must list `deco-otel-tail` (Phase 1 enrichment is a no-op without it), `version_metadata` must bind `CF_VERSION_METADATA` (no `service.version` without it = no deploy correlation), `analytics_engine_datasets` must bind `DECO_METRICS` (no AE meter), `vars.DECO_OTEL_{METRICS,TRACES,LOGS}_ENDPOINT` must resolve (direct-POST channels silently no-op otherwise). **Decision:** expand the audit with six new rules under a sibling function `auditFleetBindings` and a composing `auditWranglerConfig`. Severity tuned to the impact: tail consumer + version_metadata are `error` (operational coverage gap); the rest are `warn` (degraded mode, not total failure). Drift is detected today; the matching `--fix` codemod and CI gate hardness (block / warn / advisory) are Phase 6 decision points still open. **Files:** `scripts/audit-observability-config.ts` (`auditFleetBindings`, `auditWranglerConfig`). |
124
133
  | 2026-05-19 | **D-8 — Cloudflare Tail Worker (Strategy B) is the canonical 100% error capture mechanism** | At fleet scale (100 sites, 2.5B req/month) head sampling forces a tradeoff: 1% sampling makes the `head_sampling_rate * 5B-event-cap` math work, but 99% of error traces and 99% of error-correlated logs get dropped at the CF Destinations head. The framework already covers framework-emitted errors via the in-Worker direct-POST channel (`DECO_OTEL_LOGS_ENDPOINT`) — that's 100% of `logger.error(...)` regardless of `head_sampling_rate`. But three structural gaps remain that *no* in-Worker code can close from inside its own request handler: (a) uncaught throws (the worker isolate is already unwinding when the throw bubbles out of `instrumentWorker`), (b) `exceededCpu` / `exceededMemory` outcomes (the runtime kills the producer before any in-Worker code can run), (c) raw `console.error(...)` from third-party SDKs that bypass the framework logger. **Decision:** introduce [`deco-otel-tail`](https://github.com/decocms/stats-lake/tree/main/ingestion/otel-tail) — a Cloudflare Tail Worker in `stats-lake/ingestion/otel-tail/`. CF invokes it on every execution of any producer worker that lists it under `tail_consumers` (`wrangler.jsonc`). The handler filters TraceItems down to the interesting subset (`outcome !== "ok" \|\| exceptions.length > 0 \|\| logs.some(l => l.level === "error")`), translates each to OTLP LogRecords (one per exception, one per `error`-level log line, plus a synthetic LogRecord for non-ok outcomes that didn't surface either), and forwards them to `deco-otel-ingest` via an in-account service binding (no public hop). Rows land in `otel_logs` with `Attributes['_source'] = 'tail-worker'` so dashboards can split tail-captured errors from direct-POST + CF-Destinations errors. **Rejected alternatives:** (1) **Codemod + lint to enforce `logger.error` calls** — structural coverage gap; can't catch uncaught throws or 1101s by definition, and a lint can't enforce calls inside third-party code. (2) **Logpush + ingest pipeline** — bypassed because Logpush isn't OTLP-shaped and the pricing curve loses to tail-worker at our scale. (3) **CF dashboard log retention only** — no fan-out to ClickHouse, no fleet-wide query surface. (4) **DO-buffered tail-on-error** — ~$8K/mo at fleet scale per the cost model in `docs/observability.md`. **Coverage matrix lives in [`docs/observability.md`](./docs/observability.md) → "Error capture — three-channel model".** Producer-side wiring is one line per `wrangler.jsonc`: `tail_consumers: [{ service: "deco-otel-tail" }]`. **Operational dependency:** the tail worker MUST be deployed to the same Cloudflare account as `deco-otel-ingest` (currently `c95fc4cec7fc52453228d9db170c372c`) so the `[[services]]` binding resolves. If `deco-otel-ingest` ever moves accounts, the service binding collapses to a public HTTPS POST and the model needs revisiting. **Agent behaviour:** when designing error capture for new Worker-deployed code, default to Strategy B for the long tail; don't reach for codemod/lint enforcement unless there's a specific code-quality concern beyond capture. |
125
134
 
126
135
  The full text of the constitutional rule (loaded into every agent
@@ -71,17 +71,24 @@ This makes it possible to filter traces by cache decision directly in ClickStack
71
71
 
72
72
  ## What's measured
73
73
 
74
- | Metric | Type | Source | Labels |
75
- | ------------------------------ | --------- | ----------------------------------- | ------------------------------- |
76
- | `http_requests_total` | counter | `workerEntry` | `method`, `path`, `status` |
77
- | `http_request_duration_ms` | histogram | `workerEntry` | `method`, `path`, `status` |
78
- | `http_request_errors_total` | counter | `workerEntry` (status >= 500) | `method`, `path`, `status` |
79
- | `cache_hit_total` | counter | edge cache decision | `profile`, `decision` |
80
- | `cache_miss_total` | counter | edge cache decision | `profile`, `decision` |
81
- | `resolve_duration_ms` | histogram | `resolveDecoPage` | |
74
+ | Metric | Type | Source | Labels (canonical, Phase 2 / D-11) |
75
+ | ------------------------------- | --------- | ----------------------------------- | ----------------------------------------------------------------------------------------------- |
76
+ | `http_requests_total` | counter | `workerEntry` | `method`, `route_pattern`, `status`, `status_class`, `outcome?`, `cache_decision?`, `cache_layer?`, `region?` |
77
+ | `http_request_duration_ms` | histogram | `workerEntry` | same as `http_requests_total` |
78
+ | `http_request_errors_total` | counter | `workerEntry` (status >= 500) | same as `http_requests_total` |
79
+ | `cache_hit_total` | counter | edge cache decision | `profile`, `decision`, `layer` (`edge` \| `cachedLoader` \| `vtex-swr`) |
80
+ | `cache_miss_total` | counter | edge cache decision | `profile`, `decision`, `layer` |
81
+ | `commerce_request_duration_ms` | histogram | commerce clients (vtex/shopify/…) | `provider`, `operation`, `status_class?`, `cached?` |
82
+ | `resolve_duration_ms` | histogram | `resolveDecoPage` | — |
82
83
 
83
84
  `decision` values mirror the `X-Cache` response header: `HIT`, `STALE-HIT`, `STALE-ERROR`, `MISS`, `BYPASS`.
84
85
 
86
+ `route_pattern` is the TanStack route pattern (e.g. `/_products/$slug/p`) rather than the raw URL path — bounded cardinality, joinable to the route table. Callers that don't supply one get a normalized path with dynamic segments collapsed (`/products/:slug/p`).
87
+
88
+ `status_class` is the canonical `2xx`/.../`5xx`/`unknown` bucket. Dashboards aggregate by `status_class` for SLO panels and by `status` for incident drill-down.
89
+
90
+ `commerce_request_duration_ms` owned by the framework (Phase 2 / D-11) so every site emits it as soon as `@decocms/start` is bumped, regardless of `@decocms/apps` version. Apps register operation strings via `recordCommerceMetric`; the framework owns the cardinality contract.
91
+
85
92
  ### Metrics: AE vs OTLP (the two-meter split)
86
93
 
87
94
  `instrumentWorker` plugs **up to two meters in parallel**, composed via `createCompositeMeter`:
@@ -181,8 +188,9 @@ The direct-POST channels are wired automatically when the relevant env vars reso
181
188
  | ---------------------------------- | -------------------- | ---------------------- | ---------------------------------------------------- |
182
189
  | `DECO_OTEL_METRICS_ENDPOINT` | OTLP metrics POST | `""` (unset) | OTLP meter is not created; AE-only metrics |
183
190
  | `DECO_OTEL_LOGS_ENDPOINT` | OTLP error-log POST | `""` (unset) | Error logs ride CF Destinations only (head-sampled) |
191
+ | `DECO_OTEL_TRACES_ENDPOINT` | OTLP traces POST | `""` (unset) | Framework `deco.*` spans drop unless CF Traces is on |
184
192
 
185
- Both are opt-out via `OtelOptions.otlpMetricsEnabled: false` / `otlpErrorLogsEnabled: false` if you need to disable them at boot for a specific environment without changing the env vars.
193
+ All three are opt-out via `OtelOptions.otlpMetricsEnabled: false` / `otlpErrorLogsEnabled: false` / `otlpTracesEnabled: false` if you need to disable them at boot for a specific environment without changing the env vars. Traces honor `OtelOptions.otlpTracesSamplingRate` (default `0.01` to match CF Destinations) — sampling decisions are consistent per trace (`FNV-1a` hash of `trace_id`), so child spans are kept iff their root is kept. Remote parents that arrive sampled (`traceparent` flags `01`) override the rate and are always exported.
186
194
 
187
195
  ## Log shape (and how to query it)
188
196
 
@@ -213,7 +221,9 @@ In the ClickStack UI you can also filter logs panel by `trace_id` directly — p
213
221
 
214
222
  ## Outbound trace propagation
215
223
 
216
- For any outbound `fetch` issued during a request (VTEX, Shopify, internal APIs), inject a W3C `traceparent` header so upstream services that participate in OTel can join your trace:
224
+ For commerce clients (VTEX, Shopify), `createInstrumentedFetch` injects the W3C `traceparent` header by default. To opt out for a specific endpoint that rejects unknown headers, pass `injectTraceparent: false`.
225
+
226
+ For any other outbound `fetch` issued during a request, inject a `traceparent` header manually so upstream services that participate in OTel can join your trace:
217
227
 
218
228
  ```ts
219
229
  import { injectTraceContext } from "@decocms/start/sdk/observability";
@@ -0,0 +1,209 @@
1
+ # RUM (Real-User Monitoring) — Plan
2
+
3
+ > Sibling deliverable to [`observability_refinement_plan_4fa41548.plan.md`](../../../.cursor/plans/observability_refinement_plan_4fa41548.plan.md).
4
+ > The refinement plan covers server-side observability end-to-end. This
5
+ > plan covers what runs **in the browser** — Core Web Vitals, JS errors,
6
+ > long tasks, resource timing, custom user-journey events.
7
+
8
+ ## Why this is a separate plan
9
+
10
+ Server-side telemetry tells us what our Workers did. RUM tells us what
11
+ the **user actually experienced** — including everything outside our
12
+ edge (DNS, TLS handshake, third-party scripts, the user's CPU, their
13
+ flaky LTE link, their ad-blocker). The two answer different questions:
14
+
15
+ | Question | Answered by |
16
+ |---|---|
17
+ | "Did we serve the request?" | server outcomes (Phase 2 of refinement plan) |
18
+ | "Was the user able to read the page?" | RUM (this plan) |
19
+ | "Did our deploy regress LCP on iOS Safari?" | RUM (this plan) |
20
+ | "Why did checkout abandon at 73%?" | RUM + server outcomes joined |
21
+
22
+ Today the answer to every RUM question is "we don't know." The plan
23
+ puts a floor on that.
24
+
25
+ ## Scope tiers — the decision you're making
26
+
27
+ The size of this plan changes by an order of magnitude depending on
28
+ scope. Three defensible tiers below; the work is **strictly additive**
29
+ between them so we can commit to Tier 1, run it for a quarter, and
30
+ upgrade to Tier 2 or 3 only if the data we collect surfaces a need.
31
+
32
+ ### Tier 1 — Core Web Vitals + JS errors (recommended for v1)
33
+
34
+ **What's collected:**
35
+ - LCP (Largest Contentful Paint), CLS (Cumulative Layout Shift),
36
+ INP (Interaction to Next Paint), FCP, TTFB — via the standard
37
+ [`web-vitals`](https://github.com/GoogleChrome/web-vitals) library.
38
+ - `window.onerror` + `window.onunhandledrejection` — uncaught JS errors
39
+ with stack, source URL, line/col, user agent, route pattern, deploy
40
+ id, `request.id` (same one the server stamped in Phase 1, joinable to
41
+ `otel_logs` and `otel_traces`).
42
+ - Page-context attributes: `route_pattern`, `service.version`,
43
+ `service.name`, `deployment.environment`, viewport, connection type
44
+ (`navigator.connection.effectiveType`), `cf-ray`.
45
+
46
+ **What's NOT collected:**
47
+ - Session replay.
48
+ - Custom user-journey events (add-to-cart, scroll-to-fold, etc.).
49
+ - Resource timing for every asset.
50
+ - User interaction heatmaps.
51
+
52
+ **Implementation footprint:** ~3 dev-weeks.
53
+ - `@decocms/start/sdk/rum.ts` — single browser-side module bundled into
54
+ the site entry. Reads `web-vitals` (peer dep, ~3KB gzipped), batches
55
+ events, sends to `/__deco/rum` on the same origin (no CORS / no
56
+ third-party script).
57
+ - `cmsRoute.ts` already serves a worker; add a `/__deco/rum` handler
58
+ that validates the payload, redacts referrer/URL via the shared PII
59
+ library (Phase 1), and forwards to the OTLP HTTP endpoint as
60
+ `otel_logs` with `SeverityText="INFO"` and `LogAttributes.rum.*`.
61
+ - ClickHouse rows land in the existing `otel_logs` table — no new
62
+ schema, no new pipeline. The Tier 1 query "p75 LCP per route per
63
+ site, last 7 days" is a single SQL join on the existing tables.
64
+ - One Grafana dashboard template added to
65
+ `stats-lake/observability/dashboards/templates/site-rum.json`. Auto-
66
+ provisioned via the existing Phase 5 script — `--with-rum` flag
67
+ mirrors `--with-alerts`.
68
+
69
+ **Cost:** marginal. One row per pageview per metric (~5 rows / pageview)
70
+ adds < 5% to existing log volume. Within the cost guardrail dashboard
71
+ (Phase 6) headroom.
72
+
73
+ **Risks:**
74
+ - INP requires the modern API; falls back to FID on older browsers.
75
+ Reported separately so the metric isn't muddied.
76
+ - `web-vitals` runs ~50ms of JS on first input; teams that obsess over
77
+ shaving milliseconds will want a build flag to disable it. Ship one
78
+ off the bat.
79
+
80
+ ### Tier 2 — Tier 1 + custom user-journey events + resource timing
81
+
82
+ **What's added:**
83
+ - A typed `rum.track(name, attributes)` API exposed from
84
+ `@decocms/start/sdk/rum.ts`. Sites call
85
+ `rum.track('add_to_cart', { sku, price, currency })` and the event
86
+ flows through the same `/__deco/rum` endpoint into `otel_logs` with a
87
+ reserved attribute namespace (`rum.event.*`).
88
+ - Resource Timing API rollup: for each pageview, total resource bytes,
89
+ count by content-type, slowest 5 URLs (with paths redacted). Helps
90
+ diagnose when a third-party tag has gone bad.
91
+ - A `rum.identify(userIdHash)` call that lets sites cohort by logged-in
92
+ user without sending PII (the site hashes the user id before passing
93
+ it in; we never see plaintext).
94
+
95
+ **Implementation footprint:** ~6 additional dev-weeks on top of Tier 1.
96
+ - The framework-side API + types: ~2 weeks.
97
+ - Resource Timing payload shape + redaction: ~1 week.
98
+ - Documentation, codemod fixtures, and an audit rule that enforces
99
+ the redaction helpers stay in use: ~3 weeks.
100
+ - A second Grafana dashboard (`site-rum-events.json`) that pivots on
101
+ `rum.event.*`.
102
+
103
+ **Risks:**
104
+ - Cardinality explosion: a site that emits `track('view_product', { id })`
105
+ with the product id as a label creates one time-series per SKU. The
106
+ attribute system has to **enforce** id-as-attribute / id-not-as-label
107
+ via the type system + a runtime check in the framework. Doable but
108
+ it's the new piece that has to be designed right.
109
+ - Custom events drift across sites unless we standardize a vocabulary.
110
+ Recommend shipping a small reserved-name list (`add_to_cart`,
111
+ `begin_checkout`, `purchase`, `view_product`) so fleet-wide dashboards
112
+ can roll up "conversion funnel" without per-site cooperation.
113
+
114
+ ### Tier 3 — Tier 2 + session replay + interaction heatmaps
115
+
116
+ **What's added:**
117
+ - Session replay: capture every DOM mutation + every user input as a
118
+ delta-encoded stream, replay it as a video in HyperDX / a custom
119
+ viewer. The canonical OSS implementation is
120
+ [`rrweb`](https://github.com/rrweb-io/rrweb).
121
+ - Interaction heatmaps: aggregate click positions over a page within a
122
+ given time window.
123
+
124
+ **Implementation footprint:** months.
125
+ - ~30KB gzipped of `rrweb` on every pageview — measurable LCP impact.
126
+ Mitigation: lazy-load after first interaction; some events miss the
127
+ first paint.
128
+ - The replay payload is enormous (~100KB/min/session compressed).
129
+ Multiplied by realistic session counts this is a 10-100× ingest
130
+ blowup vs Tier 1. Replay storage is the new bottleneck, not the
131
+ schema; we'd need an R2-backed cold tier separate from ClickHouse.
132
+ - A privacy review needs to ship before the first byte of replay
133
+ flows. PII redaction has to happen client-side before the network —
134
+ rrweb's "mask all inputs" mode is the floor; password fields, credit
135
+ card numbers, and authenticated user-data attributes need explicit
136
+ privacy classes.
137
+
138
+ **Risks:**
139
+ - Privacy: replay is a high-magnification footgun. One regression in
140
+ the mask config and we've recorded a user's credit card. The whole
141
+ payment-flow page must be force-masked at the framework level, not
142
+ opt-in per site.
143
+ - Cost: 10-100× the ingest of Tier 1 even with aggressive sampling.
144
+ The Phase 6 cost-guardrail dashboard will trip; needs a new tier of
145
+ retention policies (replay rows expire at 30d, not 90d like logs).
146
+
147
+ ## Out of scope (regardless of tier)
148
+
149
+ - **Synthetic monitoring.** Lighthouse runs against canonical journeys
150
+ on a cron. Covers a different need — "would a clean browser have
151
+ hit our SLO?" rather than "what did real users see?". A separate
152
+ initiative if we want it.
153
+ - **Heatmaps for individual users.** Aggregate heatmaps only; never
154
+ identifiable.
155
+ - **A/B-test attribution.** Sites that want it can pass an experiment
156
+ cohort id through `rum.identify` — we don't ship the experiment
157
+ framework itself.
158
+
159
+ ## Recommended sequencing
160
+
161
+ Ship Tier 1 in one PR. Live with the data for a quarter. If during
162
+ that quarter the question "but what did the user actually click before
163
+ they bounced?" comes up more than twice, plan and ship Tier 2. If
164
+ during *that* quarter we hit a category of bug that can only be
165
+ diagnosed by replay (so far: 0), plan Tier 3.
166
+
167
+ **Anti-recommendation:** do not commit to Tier 3 up front. Replay is
168
+ the highest-cost, highest-privacy-risk piece of the whole observability
169
+ surface, and the bug categories that genuinely need it are rare.
170
+
171
+ ## Decision points
172
+
173
+ These mirror the main plan's structure — answer once, then this
174
+ document gets a follow-up PR turning answers into TODOs.
175
+
176
+ 1. **Tier selection.** Tier 1 only / Tier 1 + Tier 2 / Full Tier 3.
177
+ 2. **Identify hashing.** If we're shipping Tier 2, do sites pass in a
178
+ client-side hash, or do we accept plaintext IDs and hash at the
179
+ ingest worker? (Recommend client-side hash — keeps plaintext out of
180
+ our pipeline entirely.)
181
+ 3. **Sampling.** RUM events are cheap per-row but high-volume. Default
182
+ sample rate: 100% (Tier 1), 100% (Tier 2 events), 10% (Tier 3
183
+ replay). Confirm or revise per tier.
184
+
185
+ ## Files this plan would touch (Tier 1)
186
+
187
+ ```
188
+ deco-start/
189
+ ├── src/sdk/rum.ts # NEW — browser-side instrumentation
190
+ ├── src/sdk/rum.server.ts # NEW — /__deco/rum handler
191
+ ├── src/admin/setup.ts # ROUTE — mount /__deco/rum
192
+ ├── src/sdk/observability.ts # EXPORT — re-export rum API
193
+ ├── package.json # DEP — add `web-vitals` peer dep
194
+ └── docs/rum.md # NEW — site-side usage docs
195
+
196
+ stats-lake/observability/
197
+ └── dashboards/templates/site-rum.json # NEW
198
+ ```
199
+
200
+ ## Why this is a plan and not an implementation
201
+
202
+ The user explicitly asked for RUM to be in a separate plan document
203
+ rather than rolled into the refinement plan. The scope-tier decision
204
+ above is the single highest-leverage choice; everything downstream
205
+ follows from it, and "Tier 3 because Tier 3 is biggest" is the wrong
206
+ default. A 30-minute conversation on the tier choice saves weeks of
207
+ work in the wrong direction.
208
+
209
+ Open the matching Linear / GitHub issue once a tier is selected.
@@ -0,0 +1,40 @@
1
+ # Runbooks
2
+
3
+ Tested response procedures for the alerts auto-provisioned by
4
+ [`stats-lake/observability/`](https://github.com/decocms/stats-lake/blob/main/observability/README.md).
5
+
6
+ Every alert generated by `provision-dashboards.ts` carries a
7
+ `runbook_url` annotation that points to a Markdown file in this
8
+ directory. Each runbook follows the same structure so on-call can read
9
+ top-to-bottom under stress:
10
+
11
+ 1. **What this alert means.** One paragraph, no jargon.
12
+ 2. **First check (60 seconds).** The single most likely cause and how to
13
+ confirm it.
14
+ 3. **Diagnostic queries.** ClickHouse SQL the responder can paste into
15
+ Grafana / ClickStack to dig deeper.
16
+ 4. **Common causes & fixes.** Ranked by frequency. Each with a "did
17
+ that fix it?" verification step.
18
+ 5. **Escalation.** When to page a domain owner, and who.
19
+ 6. **Post-mortem hook.** What to capture so the post-mortem isn't
20
+ reconstructed from memory.
21
+
22
+ ## Runbook catalogue
23
+
24
+ | Alert ID | Runbook |
25
+ |---------------------------|--------------------------------------------------------|
26
+ | `http-error-spike` | [`http-error-spike.md`](./http-error-spike.md) |
27
+ | `http-latency-spike` | [`http-latency-spike.md`](./http-latency-spike.md) |
28
+ | `cache-hit-drop` | [`cache-hit-drop.md`](./cache-hit-drop.md) |
29
+ | `commerce-upstream-slow` | [`commerce-upstream-slow.md`](./commerce-upstream-slow.md) |
30
+ | `tail-exception-spike` | [`tail-exception-spike.md`](./tail-exception-spike.md) |
31
+
32
+ ## Authoring a new runbook
33
+
34
+ When you add a new alert template, add a runbook with the same ID. The
35
+ provisioning script doesn't enforce the link today — that gate lands in
36
+ Phase 6 (Governance), but treat it as required: an alert without a
37
+ runbook is a half-shipped alert.
38
+
39
+ Keep runbooks short. If a section grows past 20 lines, split it out
40
+ into a dedicated incident-pattern doc and link to it from the runbook.
@@ -0,0 +1,83 @@
1
+ # Runbook: `cache-hit-drop`
2
+
3
+ > A site's edge cache hit rate fell below its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ The edge cache is missing more than usual. On the user side this
8
+ manifests as slower page loads. On the cost side it means more origin
9
+ requests (more billing for Workers + commerce API calls). On the
10
+ upstream side it can become a thundering herd if many users hit a
11
+ freshly-evicted entry simultaneously.
12
+
13
+ ## First check (60 seconds)
14
+
15
+ Was there a deploy or a cache purge in the last 10 minutes? Cold caches
16
+ recover quickly (5–10m) so if the alert is fresh and a deploy is
17
+ recent, this often self-heals.
18
+
19
+ ```sql
20
+ -- Recent deploys (any change to service.version visible in metrics)
21
+ SELECT ResourceAttributes['service.version'] AS version, min(TimeUnix) AS first_seen
22
+ FROM otel_metrics_sum
23
+ WHERE ServiceName = '{site}'
24
+ AND TimeUnix > now() - INTERVAL 1 HOUR
25
+ GROUP BY version
26
+ ORDER BY first_seen DESC;
27
+ ```
28
+
29
+ If neither deploy nor purge fired in the window, the cache miss share
30
+ indicates a real regression — proceed below.
31
+
32
+ ## Diagnostic queries
33
+
34
+ ```sql
35
+ -- Hit / miss share by route_pattern, last 30 minutes
36
+ SELECT
37
+ Attributes['route_pattern'] AS route,
38
+ countIf(MetricName = 'cache_hit_total') AS hits,
39
+ countIf(MetricName = 'cache_miss_total') AS misses,
40
+ hits / nullIf(hits + misses, 0) AS hit_rate
41
+ FROM otel_metrics_sum
42
+ WHERE MetricName IN ('cache_hit_total', 'cache_miss_total')
43
+ AND ServiceName = '{site}'
44
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
45
+ GROUP BY route
46
+ ORDER BY misses DESC
47
+ LIMIT 20;
48
+ ```
49
+
50
+ ```sql
51
+ -- Cache decision distribution by cache_profile
52
+ SELECT
53
+ Attributes['profile'] AS profile,
54
+ Attributes['decision'] AS decision,
55
+ sum(toFloat64(Value)) AS n
56
+ FROM otel_metrics_sum
57
+ WHERE MetricName = 'cache_hit_total'
58
+ AND ServiceName = '{site}'
59
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
60
+ GROUP BY profile, decision
61
+ ORDER BY n DESC;
62
+ ```
63
+
64
+ ## Common causes & fixes
65
+
66
+ | Rank | Cause | How to confirm | Fix |
67
+ |------|-------------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
68
+ | 1 | Deploy purged the version cache (`X-Cache-Version` flipped) | Recent `service.version` in the deploy query | Wait 10m for cache to warm. If sustained, check that the new build hash is propagating consistently. |
69
+ | 2 | A new query parameter is hashing into the cache key | One route's MISS share is far higher than the rest | Check `cacheHeaders` / `ignoreSearchParams` config for that route; add the new param to the ignore list. |
70
+ | 3 | Set-Cookie present on a previously cacheable response | `X-Cache: BYPASS` with `X-Cache-Reason: private-set-cookie` on the affected route | Inspect the section that started emitting cookies; move the cookie write to a non-cacheable POST handler. |
71
+ | 4 | A real burst of unique URLs (e.g. crawler scanning long-tail) | `Attributes['route_pattern']` doesn't change but distinct paths multiply | If a known bot, add a WAF rule. If a real catalog query, consider broader cache profile. |
72
+
73
+ ## Escalation
74
+
75
+ - Sustained > 1 hour despite no deploy → page the site team owner.
76
+ - Suspected bot/abuse → loop in security / WAF on-call.
77
+
78
+ ## Post-mortem hook
79
+
80
+ - The "before" hit rate and the "after" hit rate.
81
+ - The top route that lost the hit rate.
82
+ - A representative response header snippet showing `X-Cache`,
83
+ `X-Cache-Profile`, `X-Cache-Reason`.
@@ -0,0 +1,88 @@
1
+ # Runbook: `commerce-upstream-slow`
2
+
3
+ > A site's `commerce_request_duration_ms` p95 exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ Calls out to a commerce provider (VTEX, Shopify, or similar) are
8
+ taking abnormally long for *this* site. Because SSR is synchronous on
9
+ upstream commerce calls, a slow upstream cascades into the user-facing
10
+ `http-latency-spike` alert almost immediately. If both fired together,
11
+ this is the root cause — fix here first.
12
+
13
+ ## First check (60 seconds)
14
+
15
+ Which provider/operation is slow? The same dashboard's "Commerce p95
16
+ by provider/operation" panel breaks it out. Note the
17
+ `provider.operation` string — e.g. `vtex.intelligent-search.product_search`.
18
+
19
+ If a single operation is responsible, jump to "Common causes" #1.
20
+ If multiple operations from the same provider are slow simultaneously,
21
+ that's a provider-wide regression — jump to "Common causes" #2.
22
+
23
+ ## Diagnostic queries
24
+
25
+ ```sql
26
+ -- p95 commerce latency by provider + operation, last hour
27
+ SELECT
28
+ toStartOfInterval(TimeUnix, INTERVAL 5 MINUTE) AS t,
29
+ Attributes['provider'] AS provider,
30
+ Attributes['operation'] AS op,
31
+ quantileBFloat16(0.95)(toFloat64(Sum / nullIf(Count, 0))) AS p95
32
+ FROM otel_metrics_histogram
33
+ WHERE MetricName = 'commerce_request_duration_ms'
34
+ AND ServiceName = '{site}'
35
+ AND TimeUnix > now() - INTERVAL 1 HOUR
36
+ GROUP BY t, provider, op
37
+ ORDER BY t, p95 DESC;
38
+ ```
39
+
40
+ ```sql
41
+ -- Commerce call status distribution — are we getting 5xx from upstream?
42
+ SELECT
43
+ Attributes['provider'] AS provider,
44
+ Attributes['operation'] AS op,
45
+ Attributes['status_class'] AS status_class,
46
+ count() AS n
47
+ FROM otel_metrics_histogram
48
+ WHERE MetricName = 'commerce_request_duration_ms'
49
+ AND ServiceName = '{site}'
50
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
51
+ GROUP BY provider, op, status_class
52
+ ORDER BY n DESC;
53
+ ```
54
+
55
+ ```sql
56
+ -- VTEX SWR cache effectiveness on the slow operation
57
+ SELECT
58
+ Attributes['cached'] AS cached,
59
+ count() AS n,
60
+ avg(toFloat64(Sum / nullIf(Count, 0))) AS avg_ms
61
+ FROM otel_metrics_histogram
62
+ WHERE MetricName = 'commerce_request_duration_ms'
63
+ AND ServiceName = '{site}'
64
+ AND Attributes['operation'] = '<paste operation here>'
65
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
66
+ GROUP BY cached;
67
+ ```
68
+
69
+ ## Common causes & fixes
70
+
71
+ | Rank | Cause | How to confirm | Fix |
72
+ |------|------------------------------------------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
73
+ | 1 | One specific upstream operation is slow | Single `provider.operation` row dominates the p95 query | Check provider status page (status.vtex.com, www.shopifystatus.com). If clean, see if we recently changed payload size or filter on that operation. |
74
+ | 2 | Provider-wide regression | Multiple operations from the same `provider` regressed simultaneously | Public provider status page is usually the source of truth. Open a ticket with the provider citing our timing window. |
75
+ | 3 | VTEX SWR / cachedLoader hit rate dropped | Query 3 shows `cached=false` share rose | Inspect recent loader changes for the affected section. May have invalidated the cache key by changing the loader signature. |
76
+ | 4 | Region-specific (CF colo → upstream latency) | `region` label on the metric isolates one CF colo | Usually transient; CF will rebalance. If sustained, file a CF support ticket. |
77
+
78
+ ## Escalation
79
+
80
+ - Provider-wide regression confirmed → notify the affected customer-facing teams; this is communication-shaped, not engineering-shaped.
81
+ - One operation slow, no provider status incident → page the site team owner for that route.
82
+
83
+ ## Post-mortem hook
84
+
85
+ - The `provider.operation` string and its p95 timeline.
86
+ - The cache (`cached=true/false`) split on that operation.
87
+ - A representative trace from `otel_traces` showing the slow span
88
+ (`SpanName LIKE 'vtex.%'` or `'shopify.%'`).
@@ -0,0 +1,98 @@
1
+ # Runbook: `http-error-spike`
2
+
3
+ > A site's 5xx error rate exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ Real users are getting 5xx responses at a rate that's statistically
8
+ abnormal for this specific site. The alert uses a per-site anomaly band
9
+ (not a fleet-wide threshold) so a site that normally runs at 0.3% 5xx
10
+ fires for spikes other sites wouldn't notice — and a site that normally
11
+ runs at 4% (a known-noisy legacy storefront) doesn't false-positive at
12
+ 4.1%.
13
+
14
+ ## First check (60 seconds)
15
+
16
+ Look at the **commerce upstream p95** panel on the same dashboard. If
17
+ that spiked at the same moment, the root cause is almost always an
18
+ upstream commerce API regressing. Stop here, jump to
19
+ [`commerce-upstream-slow.md`](./commerce-upstream-slow.md).
20
+
21
+ If commerce p95 is flat, the 5xx is internal — proceed below.
22
+
23
+ ## Diagnostic queries
24
+
25
+ Paste into ClickStack or a Grafana Explore panel pointed at the
26
+ ClickHouse datasource.
27
+
28
+ ```sql
29
+ -- Top error routes for this site, last 30 minutes
30
+ SELECT
31
+ Attributes['route_pattern'] AS route,
32
+ countIf(Attributes['status_class'] = '5xx') AS errors,
33
+ count() AS total,
34
+ errors / total AS rate
35
+ FROM otel_metrics_sum
36
+ WHERE MetricName = 'http_requests_total'
37
+ AND ServiceName = '{site}'
38
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
39
+ GROUP BY route
40
+ HAVING errors > 0
41
+ ORDER BY rate DESC
42
+ LIMIT 20;
43
+ ```
44
+
45
+ ```sql
46
+ -- Recent exceptions captured by the tail worker
47
+ SELECT Timestamp, Body, LogAttributes['url.path'] AS path, LogAttributes['http.response.status_code'] AS status
48
+ FROM otel_logs
49
+ WHERE ServiceName = '{site}'
50
+ AND SeverityText = 'ERROR'
51
+ AND LogAttributes['_source'] = 'tail-worker'
52
+ AND LogAttributes['_outcome'] = 'exception'
53
+ AND Timestamp > now() - INTERVAL 30 MINUTE
54
+ ORDER BY Timestamp DESC
55
+ LIMIT 100;
56
+ ```
57
+
58
+ ```sql
59
+ -- Did a deploy correlate? List versions seen in the last hour
60
+ SELECT
61
+ ResourceAttributes['service.version'] AS version,
62
+ min(Timestamp) AS first_seen,
63
+ max(Timestamp) AS last_seen,
64
+ count() AS log_count
65
+ FROM otel_logs
66
+ WHERE ServiceName = '{site}'
67
+ AND Timestamp > now() - INTERVAL 1 HOUR
68
+ GROUP BY version
69
+ ORDER BY first_seen DESC;
70
+ ```
71
+
72
+ ## Common causes & fixes
73
+
74
+ | Rank | Cause | How to confirm | Fix |
75
+ |------|----------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------------|
76
+ | 1 | A recent deploy regressed | Top query above shows a `service.version` that flipped just before the spike | Roll back via Cloudflare dashboard `Deployments → Rollback`. Confirm via a fresh `service.version` line in the next 5m. |
77
+ | 2 | A specific route is broken (one bad section) | Top error routes query shows one `route_pattern` at 100% error rate | Check the recent commits to that section. Roll back or `Lazy` wrap it for graceful degradation. |
78
+ | 3 | Upstream cache layer evicted; cold-cache thundering herd | `cache_miss_total` for the same window spikes proportionally to errors | Wait it out — usually self-heals in 5m. If sustained, check that `staleTime` is set correctly on cmsRouteConfig. |
79
+ | 4 | Origin (commerce API) returning 5xx | `commerce_request_duration_ms` spike OR commerce logs | See [`commerce-upstream-slow.md`](./commerce-upstream-slow.md). |
80
+
81
+ ## Escalation
82
+
83
+ - **Site team owner** if a fix isn't obvious in 15 minutes (slack
84
+ `#deco-platform`).
85
+ - **Cloudflare support** if all sites in a region are affected
86
+ simultaneously (look at the `region` label on the metrics) — this
87
+ has happened during CF colo incidents.
88
+
89
+ ## Post-mortem hook
90
+
91
+ Capture before the alert clears:
92
+ - `request.id` of one failing request (from the response header
93
+ `X-Request-Id` of a manually-reproduced 5xx).
94
+ - A representative tail-worker log row with stack trace.
95
+ - The deploy `service.version` window during the spike.
96
+
97
+ Stash them in the incident ticket so the post-mortem has the
98
+ correlation IDs it needs.