@decocms/start 6.1.0 → 6.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,209 @@
1
+ # RUM (Real-User Monitoring) — Plan
2
+
3
+ > Sibling deliverable to [`observability_refinement_plan_4fa41548.plan.md`](../../../.cursor/plans/observability_refinement_plan_4fa41548.plan.md).
4
+ > The refinement plan covers server-side observability end-to-end. This
5
+ > plan covers what runs **in the browser** — Core Web Vitals, JS errors,
6
+ > long tasks, resource timing, custom user-journey events.
7
+
8
+ ## Why this is a separate plan
9
+
10
+ Server-side telemetry tells us what our Workers did. RUM tells us what
11
+ the **user actually experienced** — including everything outside our
12
+ edge (DNS, TLS handshake, third-party scripts, the user's CPU, their
13
+ flaky LTE link, their ad-blocker). The two answer different questions:
14
+
15
+ | Question | Answered by |
16
+ |---|---|
17
+ | "Did we serve the request?" | server outcomes (Phase 2 of refinement plan) |
18
+ | "Was the user able to read the page?" | RUM (this plan) |
19
+ | "Did our deploy regress LCP on iOS Safari?" | RUM (this plan) |
20
+ | "Why did checkout abandon at 73%?" | RUM + server outcomes joined |
21
+
22
+ Today the answer to every RUM question is "we don't know." The plan
23
+ puts a floor on that.
24
+
25
+ ## Scope tiers — the decision you're making
26
+
27
+ The size of this plan changes by an order of magnitude depending on
28
+ scope. Three defensible tiers below; the work is **strictly additive**
29
+ between them so we can commit to Tier 1, run it for a quarter, and
30
+ upgrade to Tier 2 or 3 only if the data we collect surfaces a need.
31
+
32
+ ### Tier 1 — Core Web Vitals + JS errors (recommended for v1)
33
+
34
+ **What's collected:**
35
+ - LCP (Largest Contentful Paint), CLS (Cumulative Layout Shift),
36
+ INP (Interaction to Next Paint), FCP, TTFB — via the standard
37
+ [`web-vitals`](https://github.com/GoogleChrome/web-vitals) library.
38
+ - `window.onerror` + `window.onunhandledrejection` — uncaught JS errors
39
+ with stack, source URL, line/col, user agent, route pattern, deploy
40
+ id, `request.id` (same one the server stamped in Phase 1, joinable to
41
+ `otel_logs` and `otel_traces`).
42
+ - Page-context attributes: `route_pattern`, `service.version`,
43
+ `service.name`, `deployment.environment`, viewport, connection type
44
+ (`navigator.connection.effectiveType`), `cf-ray`.
45
+
46
+ **What's NOT collected:**
47
+ - Session replay.
48
+ - Custom user-journey events (add-to-cart, scroll-to-fold, etc.).
49
+ - Resource timing for every asset.
50
+ - User interaction heatmaps.
51
+
52
+ **Implementation footprint:** ~3 dev-weeks.
53
+ - `@decocms/start/sdk/rum.ts` — single browser-side module bundled into
54
+ the site entry. Reads `web-vitals` (peer dep, ~3KB gzipped), batches
55
+ events, sends to `/__deco/rum` on the same origin (no CORS / no
56
+ third-party script).
57
+ - `cmsRoute.ts` already serves a worker; add a `/__deco/rum` handler
58
+ that validates the payload, redacts referrer/URL via the shared PII
59
+ library (Phase 1), and forwards to the OTLP HTTP endpoint as
60
+ `otel_logs` with `SeverityText="INFO"` and `LogAttributes.rum.*`.
61
+ - ClickHouse rows land in the existing `otel_logs` table — no new
62
+ schema, no new pipeline. The Tier 1 query "p75 LCP per route per
63
+ site, last 7 days" is a single SQL join on the existing tables.
64
+ - One Grafana dashboard template added to
65
+ `stats-lake/observability/dashboards/templates/site-rum.json`. Auto-
66
+ provisioned via the existing Phase 5 script — `--with-rum` flag
67
+ mirrors `--with-alerts`.
68
+
69
+ **Cost:** marginal. One row per pageview per metric (~5 rows / pageview)
70
+ adds < 5% to existing log volume. Within the cost guardrail dashboard
71
+ (Phase 6) headroom.
72
+
73
+ **Risks:**
74
+ - INP requires the modern API; falls back to FID on older browsers.
75
+ Reported separately so the metric isn't muddied.
76
+ - `web-vitals` runs ~50ms of JS on first input; teams that obsess over
77
+ shaving milliseconds will want a build flag to disable it. Ship one
78
+ off the bat.
79
+
80
+ ### Tier 2 — Tier 1 + custom user-journey events + resource timing
81
+
82
+ **What's added:**
83
+ - A typed `rum.track(name, attributes)` API exposed from
84
+ `@decocms/start/sdk/rum.ts`. Sites call
85
+ `rum.track('add_to_cart', { sku, price, currency })` and the event
86
+ flows through the same `/__deco/rum` endpoint into `otel_logs` with a
87
+ reserved attribute namespace (`rum.event.*`).
88
+ - Resource Timing API rollup: for each pageview, total resource bytes,
89
+ count by content-type, slowest 5 URLs (with paths redacted). Helps
90
+ diagnose when a third-party tag has gone bad.
91
+ - A `rum.identify(userIdHash)` call that lets sites cohort by logged-in
92
+ user without sending PII (the site hashes the user id before passing
93
+ it in; we never see plaintext).
94
+
95
+ **Implementation footprint:** ~6 additional dev-weeks on top of Tier 1.
96
+ - The framework-side API + types: ~2 weeks.
97
+ - Resource Timing payload shape + redaction: ~1 week.
98
+ - Documentation, codemod fixtures, and an audit rule that enforces
99
+ the redaction helpers stay in use: ~3 weeks.
100
+ - A second Grafana dashboard (`site-rum-events.json`) that pivots on
101
+ `rum.event.*`.
102
+
103
+ **Risks:**
104
+ - Cardinality explosion: a site that emits `track('view_product', { id })`
105
+ with the product id as a label creates one time-series per SKU. The
106
+ attribute system has to **enforce** id-as-attribute / id-not-as-label
107
+ via the type system + a runtime check in the framework. Doable but
108
+ it's the new piece that has to be designed right.
109
+ - Custom events drift across sites unless we standardize a vocabulary.
110
+ Recommend shipping a small reserved-name list (`add_to_cart`,
111
+ `begin_checkout`, `purchase`, `view_product`) so fleet-wide dashboards
112
+ can roll up "conversion funnel" without per-site cooperation.
113
+
114
+ ### Tier 3 — Tier 2 + session replay + interaction heatmaps
115
+
116
+ **What's added:**
117
+ - Session replay: capture every DOM mutation + every user input as a
118
+ delta-encoded stream, replay it as a video in HyperDX / a custom
119
+ viewer. The canonical OSS implementation is
120
+ [`rrweb`](https://github.com/rrweb-io/rrweb).
121
+ - Interaction heatmaps: aggregate click positions over a page within a
122
+ given time window.
123
+
124
+ **Implementation footprint:** months.
125
+ - ~30KB gzipped of `rrweb` on every pageview — measurable LCP impact.
126
+ Mitigation: lazy-load after first interaction; some events miss the
127
+ first paint.
128
+ - The replay payload is enormous (~100KB/min/session compressed).
129
+ Multiplied by realistic session counts this is a 10-100× ingest
130
+ blowup vs Tier 1. Replay storage is the new bottleneck, not the
131
+ schema; we'd need an R2-backed cold tier separate from ClickHouse.
132
+ - A privacy review needs to ship before the first byte of replay
133
+ flows. PII redaction has to happen client-side before the network —
134
+ rrweb's "mask all inputs" mode is the floor; password fields, credit
135
+ card numbers, and authenticated user-data attributes need explicit
136
+ privacy classes.
137
+
138
+ **Risks:**
139
+ - Privacy: replay is a high-magnification footgun. One regression in
140
+ the mask config and we've recorded a user's credit card. The whole
141
+ payment-flow page must be force-masked at the framework level, not
142
+ opt-in per site.
143
+ - Cost: 10-100× the ingest of Tier 1 even with aggressive sampling.
144
+ The Phase 6 cost-guardrail dashboard will trip; needs a new tier of
145
+ retention policies (replay rows expire at 30d, not 90d like logs).
146
+
147
+ ## Out of scope (regardless of tier)
148
+
149
+ - **Synthetic monitoring.** Lighthouse runs against canonical journeys
150
+ on a cron. Covers a different need — "would a clean browser have
151
+ hit our SLO?" rather than "what did real users see?". A separate
152
+ initiative if we want it.
153
+ - **Heatmaps for individual users.** Aggregate heatmaps only; never
154
+ identifiable.
155
+ - **A/B-test attribution.** Sites that want it can pass an experiment
156
+ cohort id through `rum.identify` — we don't ship the experiment
157
+ framework itself.
158
+
159
+ ## Recommended sequencing
160
+
161
+ Ship Tier 1 in one PR. Live with the data for a quarter. If during
162
+ that quarter the question "but what did the user actually click before
163
+ they bounced?" comes up more than twice, plan and ship Tier 2. If
164
+ during *that* quarter we hit a category of bug that can only be
165
+ diagnosed by replay (so far: 0), plan Tier 3.
166
+
167
+ **Anti-recommendation:** do not commit to Tier 3 up front. Replay is
168
+ the highest-cost, highest-privacy-risk piece of the whole observability
169
+ surface, and the bug categories that genuinely need it are rare.
170
+
171
+ ## Decision points
172
+
173
+ These mirror the main plan's structure — answer once, then this
174
+ document gets a follow-up PR turning answers into TODOs.
175
+
176
+ 1. **Tier selection.** Tier 1 only / Tier 1 + Tier 2 / Full Tier 3.
177
+ 2. **Identify hashing.** If we're shipping Tier 2, do sites pass in a
178
+ client-side hash, or do we accept plaintext IDs and hash at the
179
+ ingest worker? (Recommend client-side hash — keeps plaintext out of
180
+ our pipeline entirely.)
181
+ 3. **Sampling.** RUM events are cheap per-row but high-volume. Default
182
+ sample rate: 100% (Tier 1), 100% (Tier 2 events), 10% (Tier 3
183
+ replay). Confirm or revise per tier.
184
+
185
+ ## Files this plan would touch (Tier 1)
186
+
187
+ ```
188
+ deco-start/
189
+ ├── src/sdk/rum.ts # NEW — browser-side instrumentation
190
+ ├── src/sdk/rum.server.ts # NEW — /__deco/rum handler
191
+ ├── src/admin/setup.ts # ROUTE — mount /__deco/rum
192
+ ├── src/sdk/observability.ts # EXPORT — re-export rum API
193
+ ├── package.json # DEP — add `web-vitals` peer dep
194
+ └── docs/rum.md # NEW — site-side usage docs
195
+
196
+ stats-lake/observability/
197
+ └── dashboards/templates/site-rum.json # NEW
198
+ ```
199
+
200
+ ## Why this is a plan and not an implementation
201
+
202
+ The user explicitly asked for RUM to be in a separate plan document
203
+ rather than rolled into the refinement plan. The scope-tier decision
204
+ above is the single highest-leverage choice; everything downstream
205
+ follows from it, and "Tier 3 because Tier 3 is biggest" is the wrong
206
+ default. A 30-minute conversation on the tier choice saves weeks of
207
+ work in the wrong direction.
208
+
209
+ Open the matching Linear / GitHub issue once a tier is selected.
@@ -0,0 +1,40 @@
1
+ # Runbooks
2
+
3
+ Tested response procedures for the alerts auto-provisioned by
4
+ [`stats-lake/observability/`](https://github.com/decocms/stats-lake/blob/main/observability/README.md).
5
+
6
+ Every alert generated by `provision-dashboards.ts` carries a
7
+ `runbook_url` annotation that points to a Markdown file in this
8
+ directory. Each runbook follows the same structure so on-call can read
9
+ top-to-bottom under stress:
10
+
11
+ 1. **What this alert means.** One paragraph, no jargon.
12
+ 2. **First check (60 seconds).** The single most likely cause and how to
13
+ confirm it.
14
+ 3. **Diagnostic queries.** ClickHouse SQL the responder can paste into
15
+ Grafana / ClickStack to dig deeper.
16
+ 4. **Common causes & fixes.** Ranked by frequency. Each with a "did
17
+ that fix it?" verification step.
18
+ 5. **Escalation.** When to page a domain owner, and who.
19
+ 6. **Post-mortem hook.** What to capture so the post-mortem isn't
20
+ reconstructed from memory.
21
+
22
+ ## Runbook catalogue
23
+
24
+ | Alert ID | Runbook |
25
+ |---------------------------|--------------------------------------------------------|
26
+ | `http-error-spike` | [`http-error-spike.md`](./http-error-spike.md) |
27
+ | `http-latency-spike` | [`http-latency-spike.md`](./http-latency-spike.md) |
28
+ | `cache-hit-drop` | [`cache-hit-drop.md`](./cache-hit-drop.md) |
29
+ | `commerce-upstream-slow` | [`commerce-upstream-slow.md`](./commerce-upstream-slow.md) |
30
+ | `tail-exception-spike` | [`tail-exception-spike.md`](./tail-exception-spike.md) |
31
+
32
+ ## Authoring a new runbook
33
+
34
+ When you add a new alert template, add a runbook with the same ID. The
35
+ provisioning script doesn't enforce the link today — that gate lands in
36
+ Phase 6 (Governance), but treat it as required: an alert without a
37
+ runbook is a half-shipped alert.
38
+
39
+ Keep runbooks short. If a section grows past 20 lines, split it out
40
+ into a dedicated incident-pattern doc and link to it from the runbook.
@@ -0,0 +1,83 @@
1
+ # Runbook: `cache-hit-drop`
2
+
3
+ > A site's edge cache hit rate fell below its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ The edge cache is missing more than usual. On the user side this
8
+ manifests as slower page loads. On the cost side it means more origin
9
+ requests (more billing for Workers + commerce API calls). On the
10
+ upstream side it can become a thundering herd if many users hit a
11
+ freshly-evicted entry simultaneously.
12
+
13
+ ## First check (60 seconds)
14
+
15
+ Was there a deploy or a cache purge in the last 10 minutes? Cold caches
16
+ recover quickly (5–10m) so if the alert is fresh and a deploy is
17
+ recent, this often self-heals.
18
+
19
+ ```sql
20
+ -- Recent deploys (any change to service.version visible in metrics)
21
+ SELECT ResourceAttributes['service.version'] AS version, min(TimeUnix) AS first_seen
22
+ FROM otel_metrics_sum
23
+ WHERE ServiceName = '{site}'
24
+ AND TimeUnix > now() - INTERVAL 1 HOUR
25
+ GROUP BY version
26
+ ORDER BY first_seen DESC;
27
+ ```
28
+
29
+ If neither deploy nor purge fired in the window, the cache miss share
30
+ indicates a real regression — proceed below.
31
+
32
+ ## Diagnostic queries
33
+
34
+ ```sql
35
+ -- Hit / miss share by route_pattern, last 30 minutes
36
+ SELECT
37
+ Attributes['route_pattern'] AS route,
38
+ countIf(MetricName = 'cache_hit_total') AS hits,
39
+ countIf(MetricName = 'cache_miss_total') AS misses,
40
+ hits / nullIf(hits + misses, 0) AS hit_rate
41
+ FROM otel_metrics_sum
42
+ WHERE MetricName IN ('cache_hit_total', 'cache_miss_total')
43
+ AND ServiceName = '{site}'
44
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
45
+ GROUP BY route
46
+ ORDER BY misses DESC
47
+ LIMIT 20;
48
+ ```
49
+
50
+ ```sql
51
+ -- Cache decision distribution by cache_profile
52
+ SELECT
53
+ Attributes['profile'] AS profile,
54
+ Attributes['decision'] AS decision,
55
+ sum(toFloat64(Value)) AS n
56
+ FROM otel_metrics_sum
57
+ WHERE MetricName = 'cache_hit_total'
58
+ AND ServiceName = '{site}'
59
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
60
+ GROUP BY profile, decision
61
+ ORDER BY n DESC;
62
+ ```
63
+
64
+ ## Common causes & fixes
65
+
66
+ | Rank | Cause | How to confirm | Fix |
67
+ |------|-------------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
68
+ | 1 | Deploy purged the version cache (`X-Cache-Version` flipped) | Recent `service.version` in the deploy query | Wait 10m for cache to warm. If sustained, check that the new build hash is propagating consistently. |
69
+ | 2 | A new query parameter is hashing into the cache key | One route's MISS share is far higher than the rest | Check `cacheHeaders` / `ignoreSearchParams` config for that route; add the new param to the ignore list. |
70
+ | 3 | Set-Cookie present on a previously cacheable response | `X-Cache: BYPASS` with `X-Cache-Reason: private-set-cookie` on the affected route | Inspect the section that started emitting cookies; move the cookie write to a non-cacheable POST handler. |
71
+ | 4 | A real burst of unique URLs (e.g. crawler scanning long-tail) | `Attributes['route_pattern']` doesn't change but distinct paths multiply | If a known bot, add a WAF rule. If a real catalog query, consider broader cache profile. |
72
+
73
+ ## Escalation
74
+
75
+ - Sustained > 1 hour despite no deploy → page the site team owner.
76
+ - Suspected bot/abuse → loop in security / WAF on-call.
77
+
78
+ ## Post-mortem hook
79
+
80
+ - The "before" hit rate and the "after" hit rate.
81
+ - The top route that lost the hit rate.
82
+ - A representative response header snippet showing `X-Cache`,
83
+ `X-Cache-Profile`, `X-Cache-Reason`.
@@ -0,0 +1,88 @@
1
+ # Runbook: `commerce-upstream-slow`
2
+
3
+ > A site's `commerce_request_duration_ms` p95 exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ Calls out to a commerce provider (VTEX, Shopify, or similar) are
8
+ taking abnormally long for *this* site. Because SSR is synchronous on
9
+ upstream commerce calls, a slow upstream cascades into the user-facing
10
+ `http-latency-spike` alert almost immediately. If both fired together,
11
+ this is the root cause — fix here first.
12
+
13
+ ## First check (60 seconds)
14
+
15
+ Which provider/operation is slow? The same dashboard's "Commerce p95
16
+ by provider/operation" panel breaks it out. Note the
17
+ `provider.operation` string — e.g. `vtex.intelligent-search.product_search`.
18
+
19
+ If a single operation is responsible, jump to "Common causes" #1.
20
+ If multiple operations from the same provider are slow simultaneously,
21
+ that's a provider-wide regression — jump to "Common causes" #2.
22
+
23
+ ## Diagnostic queries
24
+
25
+ ```sql
26
+ -- p95 commerce latency by provider + operation, last hour
27
+ SELECT
28
+ toStartOfInterval(TimeUnix, INTERVAL 5 MINUTE) AS t,
29
+ Attributes['provider'] AS provider,
30
+ Attributes['operation'] AS op,
31
+ quantileBFloat16(0.95)(toFloat64(Sum / nullIf(Count, 0))) AS p95
32
+ FROM otel_metrics_histogram
33
+ WHERE MetricName = 'commerce_request_duration_ms'
34
+ AND ServiceName = '{site}'
35
+ AND TimeUnix > now() - INTERVAL 1 HOUR
36
+ GROUP BY t, provider, op
37
+ ORDER BY t, p95 DESC;
38
+ ```
39
+
40
+ ```sql
41
+ -- Commerce call status distribution — are we getting 5xx from upstream?
42
+ SELECT
43
+ Attributes['provider'] AS provider,
44
+ Attributes['operation'] AS op,
45
+ Attributes['status_class'] AS status_class,
46
+ count() AS n
47
+ FROM otel_metrics_histogram
48
+ WHERE MetricName = 'commerce_request_duration_ms'
49
+ AND ServiceName = '{site}'
50
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
51
+ GROUP BY provider, op, status_class
52
+ ORDER BY n DESC;
53
+ ```
54
+
55
+ ```sql
56
+ -- VTEX SWR cache effectiveness on the slow operation
57
+ SELECT
58
+ Attributes['cached'] AS cached,
59
+ count() AS n,
60
+ avg(toFloat64(Sum / nullIf(Count, 0))) AS avg_ms
61
+ FROM otel_metrics_histogram
62
+ WHERE MetricName = 'commerce_request_duration_ms'
63
+ AND ServiceName = '{site}'
64
+ AND Attributes['operation'] = '<paste operation here>'
65
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
66
+ GROUP BY cached;
67
+ ```
68
+
69
+ ## Common causes & fixes
70
+
71
+ | Rank | Cause | How to confirm | Fix |
72
+ |------|------------------------------------------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
73
+ | 1 | One specific upstream operation is slow | Single `provider.operation` row dominates the p95 query | Check provider status page (status.vtex.com, www.shopifystatus.com). If clean, see if we recently changed payload size or filter on that operation. |
74
+ | 2 | Provider-wide regression | Multiple operations from the same `provider` regressed simultaneously | Public provider status page is usually the source of truth. Open a ticket with the provider citing our timing window. |
75
+ | 3 | VTEX SWR / cachedLoader hit rate dropped | Query 3 shows `cached=false` share rose | Inspect recent loader changes for the affected section. May have invalidated the cache key by changing the loader signature. |
76
+ | 4 | Region-specific (CF colo → upstream latency) | `region` label on the metric isolates one CF colo | Usually transient; CF will rebalance. If sustained, file a CF support ticket. |
77
+
78
+ ## Escalation
79
+
80
+ - Provider-wide regression confirmed → notify the affected customer-facing teams; this is communication-shaped, not engineering-shaped.
81
+ - One operation slow, no provider status incident → page the site team owner for that route.
82
+
83
+ ## Post-mortem hook
84
+
85
+ - The `provider.operation` string and its p95 timeline.
86
+ - The cache (`cached=true/false`) split on that operation.
87
+ - A representative trace from `otel_traces` showing the slow span
88
+ (`SpanName LIKE 'vtex.%'` or `'shopify.%'`).
@@ -0,0 +1,98 @@
1
+ # Runbook: `http-error-spike`
2
+
3
+ > A site's 5xx error rate exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ Real users are getting 5xx responses at a rate that's statistically
8
+ abnormal for this specific site. The alert uses a per-site anomaly band
9
+ (not a fleet-wide threshold) so a site that normally runs at 0.3% 5xx
10
+ fires for spikes other sites wouldn't notice — and a site that normally
11
+ runs at 4% (a known-noisy legacy storefront) doesn't false-positive at
12
+ 4.1%.
13
+
14
+ ## First check (60 seconds)
15
+
16
+ Look at the **commerce upstream p95** panel on the same dashboard. If
17
+ that spiked at the same moment, the root cause is almost always an
18
+ upstream commerce API regressing. Stop here, jump to
19
+ [`commerce-upstream-slow.md`](./commerce-upstream-slow.md).
20
+
21
+ If commerce p95 is flat, the 5xx is internal — proceed below.
22
+
23
+ ## Diagnostic queries
24
+
25
+ Paste into ClickStack or a Grafana Explore panel pointed at the
26
+ ClickHouse datasource.
27
+
28
+ ```sql
29
+ -- Top error routes for this site, last 30 minutes
30
+ SELECT
31
+ Attributes['route_pattern'] AS route,
32
+ countIf(Attributes['status_class'] = '5xx') AS errors,
33
+ count() AS total,
34
+ errors / total AS rate
35
+ FROM otel_metrics_sum
36
+ WHERE MetricName = 'http_requests_total'
37
+ AND ServiceName = '{site}'
38
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
39
+ GROUP BY route
40
+ HAVING errors > 0
41
+ ORDER BY rate DESC
42
+ LIMIT 20;
43
+ ```
44
+
45
+ ```sql
46
+ -- Recent exceptions captured by the tail worker
47
+ SELECT Timestamp, Body, LogAttributes['url.path'] AS path, LogAttributes['http.response.status_code'] AS status
48
+ FROM otel_logs
49
+ WHERE ServiceName = '{site}'
50
+ AND SeverityText = 'ERROR'
51
+ AND LogAttributes['_source'] = 'tail-worker'
52
+ AND LogAttributes['_outcome'] = 'exception'
53
+ AND Timestamp > now() - INTERVAL 30 MINUTE
54
+ ORDER BY Timestamp DESC
55
+ LIMIT 100;
56
+ ```
57
+
58
+ ```sql
59
+ -- Did a deploy correlate? List versions seen in the last hour
60
+ SELECT
61
+ ResourceAttributes['service.version'] AS version,
62
+ min(Timestamp) AS first_seen,
63
+ max(Timestamp) AS last_seen,
64
+ count() AS log_count
65
+ FROM otel_logs
66
+ WHERE ServiceName = '{site}'
67
+ AND Timestamp > now() - INTERVAL 1 HOUR
68
+ GROUP BY version
69
+ ORDER BY first_seen DESC;
70
+ ```
71
+
72
+ ## Common causes & fixes
73
+
74
+ | Rank | Cause | How to confirm | Fix |
75
+ |------|----------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------------|
76
+ | 1 | A recent deploy regressed | Top query above shows a `service.version` that flipped just before the spike | Roll back via Cloudflare dashboard `Deployments → Rollback`. Confirm via a fresh `service.version` line in the next 5m. |
77
+ | 2 | A specific route is broken (one bad section) | Top error routes query shows one `route_pattern` at 100% error rate | Check the recent commits to that section. Roll back or `Lazy` wrap it for graceful degradation. |
78
+ | 3 | Upstream cache layer evicted; cold-cache thundering herd | `cache_miss_total` for the same window spikes proportionally to errors | Wait it out — usually self-heals in 5m. If sustained, check that `staleTime` is set correctly on cmsRouteConfig. |
79
+ | 4 | Origin (commerce API) returning 5xx | `commerce_request_duration_ms` spike OR commerce logs | See [`commerce-upstream-slow.md`](./commerce-upstream-slow.md). |
80
+
81
+ ## Escalation
82
+
83
+ - **Site team owner** if a fix isn't obvious in 15 minutes (slack
84
+ `#deco-platform`).
85
+ - **Cloudflare support** if all sites in a region are affected
86
+ simultaneously (look at the `region` label on the metrics) — this
87
+ has happened during CF colo incidents.
88
+
89
+ ## Post-mortem hook
90
+
91
+ Capture before the alert clears:
92
+ - `request.id` of one failing request (from the response header
93
+ `X-Request-Id` of a manually-reproduced 5xx).
94
+ - A representative tail-worker log row with stack trace.
95
+ - The deploy `service.version` window during the spike.
96
+
97
+ Stash them in the incident ticket so the post-mortem has the
98
+ correlation IDs it needs.
@@ -0,0 +1,82 @@
1
+ # Runbook: `http-latency-spike`
2
+
3
+ > A site's p95 latency exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ User-perceived latency on this site is statistically abnormal vs the
8
+ last 24 hours. Latency rarely degrades in isolation — almost always
9
+ something else is bottlenecked underneath. Use this alert as the
10
+ "something is wrong, look around" signal, then triangulate.
11
+
12
+ ## First check (60 seconds)
13
+
14
+ Open the dashboard's **commerce p95 by provider/operation** panel. The
15
+ most common cause of p95 spikes is an upstream commerce API (VTEX,
16
+ Shopify) slowing down — and our SSR is synchronous on the upstream
17
+ call.
18
+
19
+ If commerce p95 spiked at the same moment, jump to
20
+ [`commerce-upstream-slow.md`](./commerce-upstream-slow.md).
21
+
22
+ ## Diagnostic queries
23
+
24
+ ```sql
25
+ -- Latency p95 by route_pattern, last hour
26
+ SELECT
27
+ toStartOfInterval(TimeUnix, INTERVAL 5 MINUTE) AS t,
28
+ Attributes['route_pattern'] AS route,
29
+ quantileBFloat16(0.95)(toFloat64(Sum / nullIf(Count, 0))) AS p95
30
+ FROM otel_metrics_histogram
31
+ WHERE MetricName = 'http_request_duration_ms'
32
+ AND ServiceName = '{site}'
33
+ AND TimeUnix > now() - INTERVAL 1 HOUR
34
+ GROUP BY t, route
35
+ ORDER BY t, p95 DESC;
36
+ ```
37
+
38
+ ```sql
39
+ -- Cache decision distribution — did hit rate drop while latency rose?
40
+ SELECT
41
+ Attributes['cache_decision'] AS decision,
42
+ count() AS n,
43
+ avg(toFloat64(Sum / nullIf(Count, 0))) AS avg_ms
44
+ FROM otel_metrics_histogram
45
+ WHERE MetricName = 'http_request_duration_ms'
46
+ AND ServiceName = '{site}'
47
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
48
+ GROUP BY decision
49
+ ORDER BY n DESC;
50
+ ```
51
+
52
+ ```sql
53
+ -- Slow traces with full span breakdown (sampled ~1%, so re-run if empty)
54
+ SELECT TraceId, SpanName, Duration / 1e6 AS ms, SpanAttributes['url.path'] AS path
55
+ FROM otel_traces
56
+ WHERE ServiceName = '{site}'
57
+ AND Timestamp > now() - INTERVAL 30 MINUTE
58
+ AND SpanName = 'deco.http.request'
59
+ AND (Duration / 1e6) > 2000
60
+ ORDER BY Duration DESC
61
+ LIMIT 50;
62
+ ```
63
+
64
+ ## Common causes & fixes
65
+
66
+ | Rank | Cause | How to confirm | Fix |
67
+ |------|------------------------------------------------------|-----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
68
+ | 1 | Upstream commerce API slow | Commerce p95 panel spikes with the same shape | See [`commerce-upstream-slow.md`](./commerce-upstream-slow.md). |
69
+ | 2 | Cache hit rate dropped (cold cache after deploy/purge) | Cache panel shows MISS share rose at spike start; usually self-heals within 5-10m | Wait it out unless sustained; if sustained check the route-level cache profile. |
70
+ | 3 | One specific route is slow (heavy loader added) | Per-route p95 query shows one `route_pattern` dominating | Inspect recent commits to that route's loader. Consider deferring sections via `Lazy`. |
71
+ | 4 | Cloudflare edge / colo issue | `region` label distribution skewed to one or two colos | Check CF status page; usually clears on its own. |
72
+
73
+ ## Escalation
74
+
75
+ - 30 minutes without resolution → page the site team owner.
76
+ - All sites in a region affected → suspect CF infra; check status.cloudflare.com.
77
+
78
+ ## Post-mortem hook
79
+
80
+ - A representative slow `TraceId` from the third query above.
81
+ - The cache hit rate before/during the spike.
82
+ - Deploy version at the start of the window.
@@ -0,0 +1,100 @@
1
+ # Runbook: `tail-exception-spike`
2
+
3
+ > A site's tail-worker `_outcome=exception` count exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ Real, uncaught exceptions are happening in the Worker — captured by
8
+ the tail consumer with 100% fidelity (`deco-otel-tail`). After Phase 1
9
+ severity reclassification, this alert specifically excludes `canceled`
10
+ and `responseStreamDisconnected` outcomes (those are client-disconnect
11
+ noise, not bugs). What's left is a true bug, OOM, or CPU-limit kill.
12
+
13
+ ## First check (60 seconds)
14
+
15
+ ```sql
16
+ -- What's blowing up, last 15 minutes
17
+ SELECT Body, LogAttributes['url.path'] AS path, count() AS n
18
+ FROM otel_logs
19
+ WHERE ServiceName = '{site}'
20
+ AND SeverityText = 'ERROR'
21
+ AND LogAttributes['_source'] = 'tail-worker'
22
+ AND LogAttributes['_outcome'] = 'exception'
23
+ AND Timestamp > now() - INTERVAL 15 MINUTE
24
+ GROUP BY Body, path
25
+ ORDER BY n DESC
26
+ LIMIT 30;
27
+ ```
28
+
29
+ If 90% of the rows share the same `Body` (same exception class /
30
+ message), that's the bug — proceed to "Common causes" #1.
31
+
32
+ If the exceptions are scattered across many distinct messages, you
33
+ likely have a resource problem (OOM / CPU limit) — proceed to #2.
34
+
35
+ ## Diagnostic queries
36
+
37
+ ```sql
38
+ -- Outcome distribution — separate exception from exceededMemory / exceededCpu
39
+ SELECT
40
+ LogAttributes['_outcome'] AS outcome,
41
+ count() AS n
42
+ FROM otel_logs
43
+ WHERE ServiceName = '{site}'
44
+ AND LogAttributes['_source'] = 'tail-worker'
45
+ AND Timestamp > now() - INTERVAL 30 MINUTE
46
+ GROUP BY outcome
47
+ ORDER BY n DESC;
48
+ ```
49
+
50
+ ```sql
51
+ -- Did a specific deploy cause it?
52
+ SELECT
53
+ LogAttributes['service.version'] AS version,
54
+ LogAttributes['_outcome'] AS outcome,
55
+ count() AS n
56
+ FROM otel_logs
57
+ WHERE ServiceName = '{site}'
58
+ AND LogAttributes['_source'] = 'tail-worker'
59
+ AND Timestamp > now() - INTERVAL 1 HOUR
60
+ GROUP BY version, outcome
61
+ ORDER BY n DESC;
62
+ ```
63
+
64
+ ```sql
65
+ -- Pull the full record for one offending request to get request.id
66
+ -- and trace.id for join queries
67
+ SELECT *
68
+ FROM otel_logs
69
+ WHERE ServiceName = '{site}'
70
+ AND SeverityText = 'ERROR'
71
+ AND LogAttributes['_source'] = 'tail-worker'
72
+ AND LogAttributes['_outcome'] = 'exception'
73
+ AND Timestamp > now() - INTERVAL 15 MINUTE
74
+ ORDER BY Timestamp DESC
75
+ LIMIT 1;
76
+ ```
77
+
78
+ ## Common causes & fixes
79
+
80
+ | Rank | Cause | How to confirm | Fix |
81
+ |------|----------------------------------------------------|-------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
82
+ | 1 | A single uncaught throw, recent deploy | Same `Body` dominates; one `service.version` correlates | Roll back the deploy. File a bug with the offending stack + `request.id` for repro. Add a try/catch + structured `logger.error`. |
83
+ | 2 | `exceededMemory` (OOM) | Outcome query shows non-trivial `exceededMemory` count | Look for large in-memory buffers — a `Response.text()` on a multi-MB upstream, a runaway `JSON.parse`. See [`deco-site-memory-debugging`](https://github.com/decocms/deco-start/blob/main/.cursor/skills/deco-site-memory-debugging/SKILL.md) skill. |
84
+ | 3 | `exceededCpu` (CPU-limit kill) | Outcome query shows `exceededCpu` | Investigate a section with a heavy synchronous loop. Move work to a server function or shed load via cache. |
85
+ | 4 | A new upstream returning malformed responses | `Body` references a third-party hostname; matches a known endpoint | Add defensive parsing + a structured `logger.error` so the throw becomes a typed error, not a crash. |
86
+
87
+ ## Escalation
88
+
89
+ - `exceededMemory` / `exceededCpu` sustained → page site team + platform on-call. May indicate a leak that will recur until isolate restart.
90
+ - A throw we can't decode in 15 minutes → page site team owner.
91
+
92
+ ## Post-mortem hook
93
+
94
+ - One full record from query #3 above — preserves the
95
+ `request.id` / `trace.id` for cross-channel correlation.
96
+ - The dominant `Body` (the exception message).
97
+ - The `service.version` window.
98
+ - Whether the alert fired on `exception` or `exceededMemory` /
99
+ `exceededCpu` — drives whether the post-mortem investigates code or
100
+ resource bounds.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@decocms/start",
3
- "version": "6.1.0",
3
+ "version": "6.2.0",
4
4
  "type": "module",
5
5
  "description": "Deco framework for TanStack Start - CMS bridge, admin protocol, hooks, schema generation",
6
6
  "main": "./src/index.ts",
@@ -2,7 +2,11 @@ import * as fs from "node:fs";
2
2
  import * as os from "node:os";
3
3
  import * as path from "node:path";
4
4
  import { afterEach, beforeEach, describe, expect, it } from "vitest";
5
- import { auditObservabilityBlock } from "./audit-observability-config";
5
+ import {
6
+ auditFleetBindings,
7
+ auditObservabilityBlock,
8
+ auditWranglerConfig,
9
+ } from "./audit-observability-config";
6
10
  import { parseJsonc, stripJsoncTrailingCommas } from "./lib/jsonc";
7
11
 
8
12
  describe("auditObservabilityBlock", () => {
@@ -138,6 +142,125 @@ describe("auditObservabilityBlock", () => {
138
142
  });
139
143
  });
140
144
 
145
+ describe("auditFleetBindings (D-14)", () => {
146
+ const canonicalBindings = {
147
+ version_metadata: { binding: "CF_VERSION_METADATA" },
148
+ analytics_engine_datasets: [{ binding: "DECO_METRICS", dataset: "deco_metrics_site" }],
149
+ tail_consumers: [{ service: "deco-otel-tail" }],
150
+ vars: {
151
+ DECO_OTEL_METRICS_ENDPOINT: "https://deco-otel-ingest.example/v1/metrics",
152
+ DECO_OTEL_TRACES_ENDPOINT: "https://deco-otel-ingest.example/v1/traces",
153
+ DECO_OTEL_LOGS_ENDPOINT: "https://deco-otel-ingest.example/v1/logs",
154
+ },
155
+ };
156
+
157
+ it("returns no findings for canonical bindings", () => {
158
+ expect(auditFleetBindings(canonicalBindings)).toEqual([]);
159
+ });
160
+
161
+ it("flags version_metadata_binding_missing as error", () => {
162
+ const { version_metadata: _, ...rest } = canonicalBindings;
163
+ const findings = auditFleetBindings(rest);
164
+ const f = findings.find((x) => x.id === "version_metadata_binding_missing");
165
+ expect(f?.severity).toBe("error");
166
+ });
167
+
168
+ it("flags version_metadata_binding_missing when binding is empty", () => {
169
+ const findings = auditFleetBindings({
170
+ ...canonicalBindings,
171
+ version_metadata: { binding: "" },
172
+ });
173
+ const f = findings.find((x) => x.id === "version_metadata_binding_missing");
174
+ expect(f).toBeDefined();
175
+ });
176
+
177
+ it("flags analytics_engine_binding_missing as warn", () => {
178
+ const findings = auditFleetBindings({
179
+ ...canonicalBindings,
180
+ analytics_engine_datasets: [],
181
+ });
182
+ const f = findings.find((x) => x.id === "analytics_engine_binding_missing");
183
+ expect(f?.severity).toBe("warn");
184
+ });
185
+
186
+ it("flags analytics_engine_binding_missing when binding name doesn't match DECO_METRICS", () => {
187
+ const findings = auditFleetBindings({
188
+ ...canonicalBindings,
189
+ analytics_engine_datasets: [{ binding: "OTHER_NAME" }],
190
+ });
191
+ expect(findings.some((f) => f.id === "analytics_engine_binding_missing")).toBe(true);
192
+ });
193
+
194
+ it("flags tail_consumer_missing as error", () => {
195
+ const findings = auditFleetBindings({
196
+ ...canonicalBindings,
197
+ tail_consumers: [],
198
+ });
199
+ const f = findings.find((x) => x.id === "tail_consumer_missing");
200
+ expect(f?.severity).toBe("error");
201
+ });
202
+
203
+ it("flags tail_consumer_missing when an unrelated tail consumer is configured", () => {
204
+ const findings = auditFleetBindings({
205
+ ...canonicalBindings,
206
+ tail_consumers: [{ service: "another-tail" }],
207
+ });
208
+ expect(findings.some((f) => f.id === "tail_consumer_missing")).toBe(true);
209
+ });
210
+
211
+ it("flags otel_metrics_endpoint_missing when DECO_OTEL_METRICS_ENDPOINT is unset", () => {
212
+ const findings = auditFleetBindings({
213
+ ...canonicalBindings,
214
+ vars: {
215
+ ...canonicalBindings.vars,
216
+ DECO_OTEL_METRICS_ENDPOINT: "",
217
+ },
218
+ });
219
+ expect(findings.some((f) => f.id === "otel_metrics_endpoint_missing")).toBe(true);
220
+ });
221
+
222
+ it("flags otel_traces_endpoint_missing when DECO_OTEL_TRACES_ENDPOINT is missing", () => {
223
+ const { vars: _vars, ...rest } = canonicalBindings;
224
+ const findings = auditFleetBindings(rest);
225
+ expect(findings.some((f) => f.id === "otel_traces_endpoint_missing")).toBe(true);
226
+ expect(findings.some((f) => f.id === "otel_logs_endpoint_missing")).toBe(true);
227
+ expect(findings.some((f) => f.id === "otel_metrics_endpoint_missing")).toBe(true);
228
+ });
229
+
230
+ it("handles missing vars object gracefully", () => {
231
+ expect(() => auditFleetBindings({ vars: undefined })).not.toThrow();
232
+ });
233
+ });
234
+
235
+ describe("auditWranglerConfig — composition", () => {
236
+ it("composes observability + fleet rules", () => {
237
+ const findings = auditWranglerConfig({});
238
+ const ids = findings.map((f) => f.id);
239
+ expect(ids).toContain("observability_missing");
240
+ expect(ids).toContain("version_metadata_binding_missing");
241
+ expect(ids).toContain("tail_consumer_missing");
242
+ });
243
+
244
+ it("returns no findings on a fully canonical wrangler", () => {
245
+ const findings = auditWranglerConfig({
246
+ observability: {
247
+ enabled: true,
248
+ logs: { enabled: true, head_sampling_rate: 1, persist: true },
249
+ traces: { enabled: true, head_sampling_rate: 0.01, persist: true },
250
+ },
251
+ version_metadata: { binding: "CF_VERSION_METADATA" },
252
+ analytics_engine_datasets: [{ binding: "DECO_METRICS", dataset: "deco_metrics_x" }],
253
+ tail_consumers: [{ service: "deco-otel-tail" }],
254
+ vars: {
255
+ DECO_OTEL_METRICS_ENDPOINT: "https://ingest.example/v1/metrics",
256
+ DECO_OTEL_TRACES_ENDPOINT: "https://ingest.example/v1/traces",
257
+ DECO_OTEL_LOGS_ENDPOINT: "https://ingest.example/v1/logs",
258
+ },
259
+ });
260
+ expect(findings).toEqual([]);
261
+ });
262
+ });
263
+
141
264
  describe("JSONC handling — trailing commas + comments", () => {
142
265
  it("stripJsoncTrailingCommas removes commas before `}` and `]`", () => {
143
266
  expect(stripJsoncTrailingCommas(`{ "a": 1, "b": 2, }`)).toBe(`{ "a": 1, "b": 2 }`);
@@ -165,6 +288,133 @@ describe("JSONC handling — trailing commas + comments", () => {
165
288
  });
166
289
  });
167
290
 
291
+ describe("CLI gate hardness (D-16) — --mode warn|block + --github", () => {
292
+ let tmpdir: string;
293
+ const cliPath = path.resolve(__dirname, "audit-observability-config.ts");
294
+
295
+ beforeEach(() => {
296
+ tmpdir = fs.mkdtempSync(path.join(os.tmpdir(), "audit-mode-"));
297
+ });
298
+ afterEach(() => {
299
+ fs.rmSync(tmpdir, { recursive: true, force: true });
300
+ });
301
+
302
+ // Spawn the script via tsx in a child process so we exercise the real
303
+ // `process.exit()` paths instead of monkey-patching them. This is the
304
+ // contract storefront CI consumes, so it's the contract under test.
305
+ function runCli(args: string[]): {
306
+ status: number | null;
307
+ stdout: string;
308
+ stderr: string;
309
+ } {
310
+ const { spawnSync } = require("node:child_process") as typeof import(
311
+ "node:child_process"
312
+ );
313
+ const result = spawnSync(
314
+ process.execPath,
315
+ [
316
+ require.resolve("tsx/cli"),
317
+ cliPath,
318
+ "--source",
319
+ tmpdir,
320
+ ...args,
321
+ ],
322
+ { encoding: "utf8" },
323
+ );
324
+ return {
325
+ status: result.status,
326
+ stdout: result.stdout,
327
+ stderr: result.stderr,
328
+ };
329
+ }
330
+
331
+ it("default mode is warn — exits 0 even with error findings", () => {
332
+ // Empty wrangler triggers `observability_missing` (error) +
333
+ // `tail_consumer_missing` (error) + `version_metadata_*` (error). Warn
334
+ // mode must annotate but exit 0.
335
+ fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
336
+ const { status, stdout } = runCli([]);
337
+ expect(status).toBe(0);
338
+ expect(stdout).toMatch(/observability_missing/);
339
+ });
340
+
341
+ it("--mode block exits 1 when an error-severity finding is present", () => {
342
+ fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
343
+ const { status, stdout } = runCli(["--mode", "block"]);
344
+ expect(status).toBe(1);
345
+ expect(stdout).toMatch(/observability_missing/);
346
+ });
347
+
348
+ it("--mode block exits 0 when only warn-severity findings are present", () => {
349
+ // Canonical observability block + the rest of the fleet bindings → only
350
+ // the DECO_OTEL_*_ENDPOINT warns survive. Block mode must exit 0 because
351
+ // those are `warn`, not `error`.
352
+ fs.writeFileSync(
353
+ path.join(tmpdir, "wrangler.jsonc"),
354
+ JSON.stringify({
355
+ name: "my-store",
356
+ observability: {
357
+ enabled: true,
358
+ traces: { enabled: true, head_sampling_rate: 0.01, persist: true },
359
+ logs: { enabled: true, head_sampling_rate: 1, persist: true },
360
+ },
361
+ version_metadata: { binding: "CF_VERSION_METADATA" },
362
+ analytics_engine_datasets: [{ binding: "DECO_METRICS" }],
363
+ tail_consumers: [{ service: "deco-otel-tail" }],
364
+ }),
365
+ );
366
+ const { status } = runCli(["--mode", "block"]);
367
+ expect(status).toBe(0);
368
+ });
369
+
370
+ it("--mode block exits 0 on a fully clean wrangler.jsonc", () => {
371
+ fs.writeFileSync(
372
+ path.join(tmpdir, "wrangler.jsonc"),
373
+ JSON.stringify({
374
+ name: "my-store",
375
+ observability: {
376
+ enabled: true,
377
+ traces: { enabled: true, head_sampling_rate: 0.01, persist: true },
378
+ logs: { enabled: true, head_sampling_rate: 1, persist: true },
379
+ },
380
+ version_metadata: { binding: "CF_VERSION_METADATA" },
381
+ analytics_engine_datasets: [{ binding: "DECO_METRICS" }],
382
+ tail_consumers: [{ service: "deco-otel-tail" }],
383
+ vars: {
384
+ DECO_OTEL_METRICS_ENDPOINT: "https://ingest.example.com",
385
+ DECO_OTEL_TRACES_ENDPOINT: "https://ingest.example.com",
386
+ DECO_OTEL_LOGS_ENDPOINT: "https://ingest.example.com",
387
+ },
388
+ }),
389
+ );
390
+ const { status } = runCli(["--mode", "block"]);
391
+ expect(status).toBe(0);
392
+ });
393
+
394
+ it("--github emits ::warning::/::error:: annotations matched to mode", () => {
395
+ fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
396
+ // In warn mode, even error-severity findings annotate as `warning` (we
397
+ // never escalate to GitHub `error` annotations when we won't fail the
398
+ // check — keeps the PR annotation channel quiet at v1).
399
+ const warnRun = runCli(["--github"]);
400
+ expect(warnRun.status).toBe(0);
401
+ expect(warnRun.stdout).toMatch(/::warning title=observability_missing::/);
402
+ expect(warnRun.stdout).not.toMatch(/::error title=/);
403
+
404
+ // In block mode, error-severity findings escalate to `::error::`.
405
+ const blockRun = runCli(["--mode", "block", "--github"]);
406
+ expect(blockRun.status).toBe(1);
407
+ expect(blockRun.stdout).toMatch(/::error title=observability_missing::/);
408
+ });
409
+
410
+ it("--mode rejects values other than warn|block with exit 2", () => {
411
+ fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
412
+ const { status, stderr } = runCli(["--mode", "advisory"]);
413
+ expect(status).toBe(2);
414
+ expect(stderr).toMatch(/--mode must be "warn" or "block"/);
415
+ });
416
+ });
417
+
168
418
  describe("CLI smoke — wrangler.jsonc with trailing commas", () => {
169
419
  let tmpdir: string;
170
420
  beforeEach(() => {
@@ -15,27 +15,45 @@
15
15
  *
16
16
  * Rules (id — severity — what it catches):
17
17
  *
18
- * observability_missing error No `observability` key at all. CF captures nothing.
19
- * observability_disabled error `observability.enabled: false`. Master switch off.
20
- * traces_disabled warn `observability.traces.enabled: false`. No traces in dashboard.
21
- * logs_disabled warn `observability.logs.enabled: false`. No logs in dashboard.
22
- * head_sampling_rate_elevated error `traces.head_sampling_rate > 0.01`. Fleet-scale cost risk; see docs/observability.md.
23
- * logs_head_sampling_rate_low warn `logs.head_sampling_rate < 1`. Sampling info/warn logs loses signal cheaply; errors go via the direct-POST channel.
24
- * persist_disabled_no_destination error `persist: false` with no destination configured. Data captured then discarded.
18
+ * observability_missing error No `observability` key at all. CF captures nothing.
19
+ * observability_disabled error `observability.enabled: false`. Master switch off.
20
+ * traces_disabled warn `observability.traces.enabled: false`. No traces in dashboard.
21
+ * logs_disabled warn `observability.logs.enabled: false`. No logs in dashboard.
22
+ * head_sampling_rate_elevated error `traces.head_sampling_rate > 0.01`. Fleet-scale cost risk; see docs/observability.md.
23
+ * logs_head_sampling_rate_low warn `logs.head_sampling_rate < 1`. Sampling info/warn logs loses signal cheaply; errors go via the direct-POST channel.
24
+ * persist_disabled_no_destination error `persist: false` with no destination configured. Data captured then discarded.
25
+ *
26
+ * Phase 6 / D-14 — fleet-config drift rules (live outside the
27
+ * `observability` block but still owned by this audit):
28
+ *
29
+ * version_metadata_binding_missing error Missing `version_metadata` binding. `service.version` won't be stamped — regressions can't be attributed to a deploy.
30
+ * analytics_engine_binding_missing warn No `DECO_METRICS` AE binding. AE meter is off; OTLP meter still works but CF dashboard panels go dark.
31
+ * tail_consumer_missing error No `tail_consumers` entry pointing at `deco-otel-tail`. 100% error-capture is broken.
32
+ * otel_metrics_endpoint_missing warn `DECO_OTEL_METRICS_ENDPOINT` not set on `vars`. OTLP meter is off; only AE works.
33
+ * otel_traces_endpoint_missing warn `DECO_OTEL_TRACES_ENDPOINT` not set on `vars`. Framework `deco.*` spans drop unless CF Traces is on.
34
+ * otel_logs_endpoint_missing warn `DECO_OTEL_LOGS_ENDPOINT` not set on `vars`. Error logs ride CF Destinations only (head-sampled).
25
35
  *
26
36
  * Usage (from a site directory):
27
- * npx -p @decocms/start deco-audit-observability # audit cwd
28
- * npx -p @decocms/start deco-audit-observability --source ./ # explicit
37
+ * npx -p @decocms/start deco-audit-observability # audit cwd (warn mode — exit 0)
38
+ * npx -p @decocms/start deco-audit-observability --source ./ # explicit source dir
29
39
  * npx -p @decocms/start deco-audit-observability --json # machine-readable
40
+ * npx -p @decocms/start deco-audit-observability --mode block # error findings exit 1 (CI gate)
41
+ * npx -p @decocms/start deco-audit-observability --github # GitHub Actions annotations
30
42
  *
31
43
  * Options:
32
44
  * --source <dir> Site directory (default: .)
33
- * --json Emit findings as JSON to stdout (still exits non-zero on findings)
45
+ * --json Emit findings as JSON to stdout
46
+ * --mode <m> Gate hardness: "warn" (default — always exit 0 on findings,
47
+ * just print them) or "block" (exit 1 on any `error` finding).
48
+ * See D-16 in MIGRATION_TOOLING_PLAN.md for the rationale on
49
+ * why warn is the v1 default.
50
+ * --github Emit `::warning::` / `::error::` lines for GitHub Actions
51
+ * annotations in addition to the normal text output.
34
52
  * --help, -h Show this message
35
53
  *
36
54
  * Exit codes:
37
- * 0 — no findings (or only `info`-level findings; none defined yet)
38
- * 1 — at least one finding (warn or error)
55
+ * 0 — no findings, or `--mode warn` (the default) regardless of findings
56
+ * 1 — `--mode block` and at least one `error`-severity finding
39
57
  * 2 — file invalid / can't parse
40
58
  */
41
59
 
@@ -53,14 +71,24 @@ export interface Finding {
53
71
  fix?: string;
54
72
  }
55
73
 
74
+ export type GateMode = "warn" | "block";
75
+
56
76
  interface CliOpts {
57
77
  source: string;
58
78
  json: boolean;
59
79
  help: boolean;
80
+ mode: GateMode;
81
+ github: boolean;
60
82
  }
61
83
 
62
84
  function parseArgs(argv: string[]): CliOpts {
63
- const opts: CliOpts = { source: ".", json: false, help: false };
85
+ const opts: CliOpts = {
86
+ source: ".",
87
+ json: false,
88
+ help: false,
89
+ mode: "warn",
90
+ github: false,
91
+ };
64
92
  for (let i = 0; i < argv.length; i++) {
65
93
  const flag = argv[i];
66
94
  switch (flag) {
@@ -70,6 +98,20 @@ function parseArgs(argv: string[]): CliOpts {
70
98
  case "--json":
71
99
  opts.json = true;
72
100
  break;
101
+ case "--mode": {
102
+ const value = argv[++i];
103
+ if (value !== "warn" && value !== "block") {
104
+ console.error(
105
+ `audit: --mode must be "warn" or "block" (got "${value ?? ""}")`,
106
+ );
107
+ process.exit(2);
108
+ }
109
+ opts.mode = value;
110
+ break;
111
+ }
112
+ case "--github":
113
+ opts.github = true;
114
+ break;
73
115
  case "--help":
74
116
  case "-h":
75
117
  opts.help = true;
@@ -91,14 +133,18 @@ function showHelp(): void {
91
133
  npx -p @decocms/start deco-audit-observability [options]
92
134
 
93
135
  Options:
94
- --source <dir> Site directory (default: .)
95
- --json Emit findings as JSON
96
- --help, -h This message
136
+ --source <dir> Site directory (default: .)
137
+ --json Emit findings as JSON
138
+ --mode <m> "warn" (default, exit 0) | "block" (exit 1 on errors)
139
+ --github Emit ::warning::/::error:: lines for GitHub Actions
140
+ --help, -h This message
97
141
 
98
142
  Exit codes:
99
- 0 no findings
100
- 1 one or more findings (warn or error)
143
+ 0 no findings, OR --mode warn (default — annotate and move on)
144
+ 1 --mode block AND at least one error-severity finding
101
145
  2 wrangler.jsonc missing or unparseable
146
+
147
+ See D-16 in MIGRATION_TOOLING_PLAN.md for the v1 "warn-only" policy.
102
148
  `);
103
149
  }
104
150
 
@@ -236,6 +282,138 @@ export function auditObservabilityBlock(
236
282
  return findings;
237
283
  }
238
284
 
285
+ /**
286
+ * Fleet-config drift rules — owned by the same audit because the
287
+ * cumulative effect of "observability block correct, bindings missing"
288
+ * is identical to "observability block missing" (no data lands in
289
+ * ClickHouse).
290
+ *
291
+ * The CLI composes `auditObservabilityBlock` + `auditFleetBindings`.
292
+ * Both return Finding[]; callers concatenate.
293
+ */
294
+ export interface WranglerLike {
295
+ observability?: ObservabilityBlock;
296
+ version_metadata?: { binding?: string } | unknown;
297
+ analytics_engine_datasets?: Array<{ binding?: string; dataset?: string }> | unknown;
298
+ tail_consumers?: Array<{ service?: string; environment?: string }> | unknown;
299
+ vars?: Record<string, unknown> | unknown;
300
+ }
301
+
302
+ export function auditFleetBindings(wrangler: WranglerLike): Finding[] {
303
+ const findings: Finding[] = [];
304
+
305
+ // version_metadata — required so `service.version` is stamped on every
306
+ // span and log line. Without it, regressions can't be tied to a
307
+ // specific deployment.
308
+ const vmBinding =
309
+ typeof wrangler.version_metadata === "object" &&
310
+ wrangler.version_metadata !== null &&
311
+ "binding" in wrangler.version_metadata
312
+ ? (wrangler.version_metadata as { binding?: string }).binding
313
+ : undefined;
314
+ if (!vmBinding) {
315
+ findings.push({
316
+ id: "version_metadata_binding_missing",
317
+ severity: "error",
318
+ message:
319
+ "wrangler.jsonc is missing a `version_metadata.binding` entry. " +
320
+ "`service.version` won't appear on spans/logs and the deploy-correlation " +
321
+ "panel will be empty. Recommended value: `CF_VERSION_METADATA`.",
322
+ fix: "npx -p @decocms/start deco-cf-observability --write",
323
+ });
324
+ }
325
+
326
+ // DECO_METRICS — Analytics Engine binding. The AE meter is the hot-
327
+ // path CF dashboard view; OTLP works without it but we lose the
328
+ // operator-grade short-window panels.
329
+ const aeDatasets = Array.isArray(wrangler.analytics_engine_datasets)
330
+ ? (wrangler.analytics_engine_datasets as Array<{ binding?: string }>)
331
+ : [];
332
+ const hasMetricsBinding = aeDatasets.some((d) => d?.binding === "DECO_METRICS");
333
+ if (!hasMetricsBinding) {
334
+ findings.push({
335
+ id: "analytics_engine_binding_missing",
336
+ severity: "warn",
337
+ message:
338
+ "wrangler.jsonc has no `analytics_engine_datasets[].binding == 'DECO_METRICS'`. " +
339
+ "The AE meter is off; the hot-path operator dashboards in CF will be empty. " +
340
+ "OTLP metrics keep flowing if `DECO_OTEL_METRICS_ENDPOINT` is set.",
341
+ fix: "npx -p @decocms/start deco-cf-observability --write",
342
+ });
343
+ }
344
+
345
+ // tail_consumers — must list deco-otel-tail. Phase 1 enrichment is
346
+ // useless without the tail consumer firing.
347
+ const tail = Array.isArray(wrangler.tail_consumers)
348
+ ? (wrangler.tail_consumers as Array<{ service?: string }>)
349
+ : [];
350
+ const hasTailConsumer = tail.some((t) => t?.service === "deco-otel-tail");
351
+ if (!hasTailConsumer) {
352
+ findings.push({
353
+ id: "tail_consumer_missing",
354
+ severity: "error",
355
+ message:
356
+ "wrangler.jsonc has no `tail_consumers[].service == 'deco-otel-tail'` entry. " +
357
+ "100% error-capture is broken — only the head-sampled CF Destinations path " +
358
+ "will report errors, and isolate crashes will be invisible.",
359
+ fix: "npx -p @decocms/start deco-cf-observability --write",
360
+ });
361
+ }
362
+
363
+ // DECO_OTEL_*_ENDPOINT env vars. Audit each separately so the message
364
+ // explains which channel is silently no-op.
365
+ const vars =
366
+ typeof wrangler.vars === "object" && wrangler.vars !== null
367
+ ? (wrangler.vars as Record<string, unknown>)
368
+ : {};
369
+ const checkVar = (id: string, name: string, severity: Severity, channel: string) => {
370
+ const v = vars[name];
371
+ if (typeof v !== "string" || v.length === 0) {
372
+ findings.push({
373
+ id,
374
+ severity,
375
+ message:
376
+ `wrangler.jsonc \`vars.${name}\` is not set. ${channel} is a no-op; ` +
377
+ `data captured in this channel never lands in ClickHouse. ` +
378
+ `See docs/observability.md for the canonical endpoints.`,
379
+ fix: "npx -p @decocms/start deco-cf-observability --write",
380
+ });
381
+ }
382
+ };
383
+ checkVar(
384
+ "otel_metrics_endpoint_missing",
385
+ "DECO_OTEL_METRICS_ENDPOINT",
386
+ "warn",
387
+ "OTLP metrics direct-POST",
388
+ );
389
+ checkVar(
390
+ "otel_traces_endpoint_missing",
391
+ "DECO_OTEL_TRACES_ENDPOINT",
392
+ "warn",
393
+ "OTLP traces direct-POST",
394
+ );
395
+ checkVar(
396
+ "otel_logs_endpoint_missing",
397
+ "DECO_OTEL_LOGS_ENDPOINT",
398
+ "warn",
399
+ "OTLP error-log direct-POST",
400
+ );
401
+
402
+ return findings;
403
+ }
404
+
405
+ /**
406
+ * One-stop call for the full wrangler audit — composes the
407
+ * observability-block rules with the fleet-binding rules. Both keep
408
+ * working standalone for fine-grained tests.
409
+ */
410
+ export function auditWranglerConfig(wrangler: WranglerLike): Finding[] {
411
+ return [
412
+ ...auditObservabilityBlock(wrangler.observability),
413
+ ...auditFleetBindings(wrangler),
414
+ ];
415
+ }
416
+
239
417
  function findingsToText(file: string, findings: Finding[]): string {
240
418
  if (findings.length === 0) {
241
419
  return `OK ${file} — observability config looks canonical`;
@@ -263,26 +441,49 @@ function main(): void {
263
441
  process.exit(2);
264
442
  }
265
443
 
266
- let parsed: { observability?: ObservabilityBlock };
444
+ let parsed: WranglerLike;
267
445
  try {
268
- parsed = parseJsonc(fs.readFileSync(file, "utf8"));
446
+ parsed = parseJsonc(fs.readFileSync(file, "utf8")) as WranglerLike;
269
447
  } catch (err) {
270
448
  console.error(`audit: ${file} could not be parsed: ${(err as Error).message}`);
271
449
  process.exit(2);
272
450
  }
273
451
 
274
- const findings = auditObservabilityBlock(parsed.observability);
452
+ const findings = auditWranglerConfig(parsed);
275
453
 
276
454
  if (opts.json) {
277
- process.stdout.write(JSON.stringify({ file, findings }, null, 2) + "\n");
455
+ process.stdout.write(
456
+ JSON.stringify({ file, mode: opts.mode, findings }, null, 2) + "\n",
457
+ );
278
458
  } else {
279
459
  process.stdout.write(findingsToText(file, findings) + "\n");
280
460
  }
281
461
 
282
- // Any finding (warn or error) is a non-zero exit. info-severity would not
283
- // flip the exit, but no info-severity rules are defined yet.
284
- const blocking = findings.some((f) => f.severity !== "info");
285
- process.exit(blocking ? 1 : 0);
462
+ if (opts.github) {
463
+ for (const f of findings) {
464
+ // GitHub Actions workflow command. `error` and `warning` annotate the
465
+ // diff; `notice` is informational. We never emit `error` in warn mode
466
+ // even for error-severity rules — the v1 policy is annotate-don't-fail.
467
+ const level = opts.mode === "block" && f.severity === "error"
468
+ ? "error"
469
+ : f.severity === "info" ? "notice" : "warning";
470
+ const msg = `${f.message}${f.fix ? ` (fix: ${f.fix})` : ""}`;
471
+ const escaped = msg.replace(/%/g, "%25").replace(/\r/g, "%0D").replace(
472
+ /\n/g,
473
+ "%0A",
474
+ );
475
+ process.stdout.write(`::${level} title=${f.id}::${escaped}\n`);
476
+ }
477
+ }
478
+
479
+ // Exit policy: D-16 / Phase 6 decision.
480
+ // warn — annotate only; always exit 0 (CI sees the findings but ships)
481
+ // block — exit 1 on any `error`-severity finding
482
+ // The default is `warn` because storefronts are upgraded over weeks; a
483
+ // day-one block would fail PRs that have nothing to do with observability.
484
+ const shouldFail = opts.mode === "block" &&
485
+ findings.some((f) => f.severity === "error");
486
+ process.exit(shouldFail ? 1 : 0);
286
487
  }
287
488
 
288
489
  // Only run when invoked directly, not when imported by tests.