npm - @decocms/start - Versions diffs - 6.1.0 → 6.2.0 - Mend

@decocms/start 6.1.0 → 6.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/docs/rum-plan.md +209 -0
package/docs/runbooks/README.md +40 -0
package/docs/runbooks/cache-hit-drop.md +83 -0
package/docs/runbooks/commerce-upstream-slow.md +88 -0
package/docs/runbooks/http-error-spike.md +98 -0
package/docs/runbooks/http-latency-spike.md +82 -0
package/docs/runbooks/tail-exception-spike.md +100 -0
package/package.json +1 -1
package/scripts/audit-observability-config.test.ts +251 -1
package/scripts/audit-observability-config.ts +227 -26

package/docs/rum-plan.md ADDED Viewed

@@ -0,0 +1,209 @@
+# RUM (Real-User Monitoring) — Plan
+> Sibling deliverable to [`observability_refinement_plan_4fa41548.plan.md`](../../../.cursor/plans/observability_refinement_plan_4fa41548.plan.md).
+> The refinement plan covers server-side observability end-to-end. This
+> plan covers what runs **in the browser** — Core Web Vitals, JS errors,
+> long tasks, resource timing, custom user-journey events.
+## Why this is a separate plan
+Server-side telemetry tells us what our Workers did. RUM tells us what
+the **user actually experienced** — including everything outside our
+edge (DNS, TLS handshake, third-party scripts, the user's CPU, their
+flaky LTE link, their ad-blocker). The two answer different questions:
+| Question | Answered by |
+|---|---|
+| "Did we serve the request?" | server outcomes (Phase 2 of refinement plan) |
+| "Was the user able to read the page?" | RUM (this plan) |
+| "Did our deploy regress LCP on iOS Safari?" | RUM (this plan) |
+| "Why did checkout abandon at 73%?" | RUM + server outcomes joined |
+Today the answer to every RUM question is "we don't know." The plan
+puts a floor on that.
+## Scope tiers — the decision you're making
+The size of this plan changes by an order of magnitude depending on
+scope. Three defensible tiers below; the work is **strictly additive**
+between them so we can commit to Tier 1, run it for a quarter, and
+upgrade to Tier 2 or 3 only if the data we collect surfaces a need.
+### Tier 1 — Core Web Vitals + JS errors (recommended for v1)
+**What's collected:**
+- LCP (Largest Contentful Paint), CLS (Cumulative Layout Shift),
+  INP (Interaction to Next Paint), FCP, TTFB — via the standard
+  [`web-vitals`](https://github.com/GoogleChrome/web-vitals) library.
+- `window.onerror` + `window.onunhandledrejection` — uncaught JS errors
+  with stack, source URL, line/col, user agent, route pattern, deploy
+  id, `request.id` (same one the server stamped in Phase 1, joinable to
+  `otel_logs` and `otel_traces`).
+- Page-context attributes: `route_pattern`, `service.version`,
+  `service.name`, `deployment.environment`, viewport, connection type
+  (`navigator.connection.effectiveType`), `cf-ray`.
+**What's NOT collected:**
+- Session replay.
+- Custom user-journey events (add-to-cart, scroll-to-fold, etc.).
+- Resource timing for every asset.
+- User interaction heatmaps.
+**Implementation footprint:** ~3 dev-weeks.
+- `@decocms/start/sdk/rum.ts` — single browser-side module bundled into
+  the site entry. Reads `web-vitals` (peer dep, ~3KB gzipped), batches
+  events, sends to `/__deco/rum` on the same origin (no CORS / no
+  third-party script).
+- `cmsRoute.ts` already serves a worker; add a `/__deco/rum` handler
+  that validates the payload, redacts referrer/URL via the shared PII
+  library (Phase 1), and forwards to the OTLP HTTP endpoint as
+  `otel_logs` with `SeverityText="INFO"` and `LogAttributes.rum.*`.
+- ClickHouse rows land in the existing `otel_logs` table — no new
+  schema, no new pipeline. The Tier 1 query "p75 LCP per route per
+  site, last 7 days" is a single SQL join on the existing tables.
+- One Grafana dashboard template added to
+  `stats-lake/observability/dashboards/templates/site-rum.json`. Auto-
+  provisioned via the existing Phase 5 script — `--with-rum` flag
+  mirrors `--with-alerts`.
+**Cost:** marginal. One row per pageview per metric (~5 rows / pageview)
+adds < 5% to existing log volume. Within the cost guardrail dashboard
+(Phase 6) headroom.
+**Risks:**
+- INP requires the modern API; falls back to FID on older browsers.
+  Reported separately so the metric isn't muddied.
+- `web-vitals` runs ~50ms of JS on first input; teams that obsess over
+  shaving milliseconds will want a build flag to disable it. Ship one
+  off the bat.
+### Tier 2 — Tier 1 + custom user-journey events + resource timing
+**What's added:**
+- A typed `rum.track(name, attributes)` API exposed from
+  `@decocms/start/sdk/rum.ts`. Sites call
+  `rum.track('add_to_cart', { sku, price, currency })` and the event
+  flows through the same `/__deco/rum` endpoint into `otel_logs` with a
+  reserved attribute namespace (`rum.event.*`).
+- Resource Timing API rollup: for each pageview, total resource bytes,
+  count by content-type, slowest 5 URLs (with paths redacted). Helps
+  diagnose when a third-party tag has gone bad.
+- A `rum.identify(userIdHash)` call that lets sites cohort by logged-in
+  user without sending PII (the site hashes the user id before passing
+  it in; we never see plaintext).
+**Implementation footprint:** ~6 additional dev-weeks on top of Tier 1.
+- The framework-side API + types: ~2 weeks.
+- Resource Timing payload shape + redaction: ~1 week.
+- Documentation, codemod fixtures, and an audit rule that enforces
+  the redaction helpers stay in use: ~3 weeks.
+- A second Grafana dashboard (`site-rum-events.json`) that pivots on
+  `rum.event.*`.
+**Risks:**
+- Cardinality explosion: a site that emits `track('view_product', { id })`
+  with the product id as a label creates one time-series per SKU. The
+  attribute system has to **enforce** id-as-attribute / id-not-as-label
+  via the type system + a runtime check in the framework. Doable but
+  it's the new piece that has to be designed right.
+- Custom events drift across sites unless we standardize a vocabulary.
+  Recommend shipping a small reserved-name list (`add_to_cart`,
+  `begin_checkout`, `purchase`, `view_product`) so fleet-wide dashboards
+  can roll up "conversion funnel" without per-site cooperation.
+### Tier 3 — Tier 2 + session replay + interaction heatmaps
+**What's added:**
+- Session replay: capture every DOM mutation + every user input as a
+  delta-encoded stream, replay it as a video in HyperDX / a custom
+  viewer. The canonical OSS implementation is
+  [`rrweb`](https://github.com/rrweb-io/rrweb).
+- Interaction heatmaps: aggregate click positions over a page within a
+  given time window.
+**Implementation footprint:** months.
+- ~30KB gzipped of `rrweb` on every pageview — measurable LCP impact.
+  Mitigation: lazy-load after first interaction; some events miss the
+  first paint.
+- The replay payload is enormous (~100KB/min/session compressed).
+  Multiplied by realistic session counts this is a 10-100× ingest
+  blowup vs Tier 1. Replay storage is the new bottleneck, not the
+  schema; we'd need an R2-backed cold tier separate from ClickHouse.
+- A privacy review needs to ship before the first byte of replay
+  flows. PII redaction has to happen client-side before the network —
+  rrweb's "mask all inputs" mode is the floor; password fields, credit
+  card numbers, and authenticated user-data attributes need explicit
+  privacy classes.
+**Risks:**
+- Privacy: replay is a high-magnification footgun. One regression in
+  the mask config and we've recorded a user's credit card. The whole
+  payment-flow page must be force-masked at the framework level, not
+  opt-in per site.
+- Cost: 10-100× the ingest of Tier 1 even with aggressive sampling.
+  The Phase 6 cost-guardrail dashboard will trip; needs a new tier of
+  retention policies (replay rows expire at 30d, not 90d like logs).
+## Out of scope (regardless of tier)
+- **Synthetic monitoring.** Lighthouse runs against canonical journeys
+  on a cron. Covers a different need — "would a clean browser have
+  hit our SLO?" rather than "what did real users see?". A separate
+  initiative if we want it.
+- **Heatmaps for individual users.** Aggregate heatmaps only; never
+  identifiable.
+- **A/B-test attribution.** Sites that want it can pass an experiment
+  cohort id through `rum.identify` — we don't ship the experiment
+  framework itself.
+## Recommended sequencing
+Ship Tier 1 in one PR. Live with the data for a quarter. If during
+that quarter the question "but what did the user actually click before
+they bounced?" comes up more than twice, plan and ship Tier 2. If
+during *that* quarter we hit a category of bug that can only be
+diagnosed by replay (so far: 0), plan Tier 3.
+**Anti-recommendation:** do not commit to Tier 3 up front. Replay is
+the highest-cost, highest-privacy-risk piece of the whole observability
+surface, and the bug categories that genuinely need it are rare.
+## Decision points
+These mirror the main plan's structure — answer once, then this
+document gets a follow-up PR turning answers into TODOs.
+1. **Tier selection.** Tier 1 only / Tier 1 + Tier 2 / Full Tier 3.
+2. **Identify hashing.** If we're shipping Tier 2, do sites pass in a
+   client-side hash, or do we accept plaintext IDs and hash at the
+   ingest worker? (Recommend client-side hash — keeps plaintext out of
+   our pipeline entirely.)
+3. **Sampling.** RUM events are cheap per-row but high-volume. Default
+   sample rate: 100% (Tier 1), 100% (Tier 2 events), 10% (Tier 3
+   replay). Confirm or revise per tier.
+## Files this plan would touch (Tier 1)
+```
+deco-start/
+├── src/sdk/rum.ts                  # NEW — browser-side instrumentation
+├── src/sdk/rum.server.ts           # NEW — /__deco/rum handler
+├── src/admin/setup.ts              # ROUTE — mount /__deco/rum
+├── src/sdk/observability.ts        # EXPORT — re-export rum API
+├── package.json                    # DEP — add `web-vitals` peer dep
+└── docs/rum.md                     # NEW — site-side usage docs
+stats-lake/observability/
+└── dashboards/templates/site-rum.json   # NEW
+```
+## Why this is a plan and not an implementation
+The user explicitly asked for RUM to be in a separate plan document
+rather than rolled into the refinement plan. The scope-tier decision
+above is the single highest-leverage choice; everything downstream
+follows from it, and "Tier 3 because Tier 3 is biggest" is the wrong
+default. A 30-minute conversation on the tier choice saves weeks of
+work in the wrong direction.
+Open the matching Linear / GitHub issue once a tier is selected.

package/docs/runbooks/README.md ADDED Viewed

@@ -0,0 +1,40 @@
+# Runbooks
+Tested response procedures for the alerts auto-provisioned by
+[`stats-lake/observability/`](https://github.com/decocms/stats-lake/blob/main/observability/README.md).
+Every alert generated by `provision-dashboards.ts` carries a
+`runbook_url` annotation that points to a Markdown file in this
+directory. Each runbook follows the same structure so on-call can read
+top-to-bottom under stress:
+1. **What this alert means.** One paragraph, no jargon.
+2. **First check (60 seconds).** The single most likely cause and how to
+   confirm it.
+3. **Diagnostic queries.** ClickHouse SQL the responder can paste into
+   Grafana / ClickStack to dig deeper.
+4. **Common causes & fixes.** Ranked by frequency. Each with a "did
+   that fix it?" verification step.
+5. **Escalation.** When to page a domain owner, and who.
+6. **Post-mortem hook.** What to capture so the post-mortem isn't
+   reconstructed from memory.
+## Runbook catalogue
+| Alert ID                  | Runbook                                                |
+|---------------------------|--------------------------------------------------------|
+| `http-error-spike`        | [`http-error-spike.md`](./http-error-spike.md)         |
+| `http-latency-spike`      | [`http-latency-spike.md`](./http-latency-spike.md)     |
+| `cache-hit-drop`          | [`cache-hit-drop.md`](./cache-hit-drop.md)             |
+| `commerce-upstream-slow`  | [`commerce-upstream-slow.md`](./commerce-upstream-slow.md) |
+| `tail-exception-spike`    | [`tail-exception-spike.md`](./tail-exception-spike.md) |
+## Authoring a new runbook
+When you add a new alert template, add a runbook with the same ID. The
+provisioning script doesn't enforce the link today — that gate lands in
+Phase 6 (Governance), but treat it as required: an alert without a
+runbook is a half-shipped alert.
+Keep runbooks short. If a section grows past 20 lines, split it out
+into a dedicated incident-pattern doc and link to it from the runbook.

package/docs/runbooks/cache-hit-drop.md ADDED Viewed

@@ -0,0 +1,83 @@
+# Runbook: `cache-hit-drop`
+> A site's edge cache hit rate fell below its own 24h rolling baseline by 3σ for ≥ 10 minutes.
+## What this alert means
+The edge cache is missing more than usual. On the user side this
+manifests as slower page loads. On the cost side it means more origin
+requests (more billing for Workers + commerce API calls). On the
+upstream side it can become a thundering herd if many users hit a
+freshly-evicted entry simultaneously.
+## First check (60 seconds)
+Was there a deploy or a cache purge in the last 10 minutes? Cold caches
+recover quickly (5–10m) so if the alert is fresh and a deploy is
+recent, this often self-heals.
+```sql
+-- Recent deploys (any change to service.version visible in metrics)
+SELECT ResourceAttributes['service.version'] AS version, min(TimeUnix) AS first_seen
+FROM otel_metrics_sum
+WHERE ServiceName = '{site}'
+  AND TimeUnix > now() - INTERVAL 1 HOUR
+GROUP BY version
+ORDER BY first_seen DESC;
+```
+If neither deploy nor purge fired in the window, the cache miss share
+indicates a real regression — proceed below.
+## Diagnostic queries
+```sql
+-- Hit / miss share by route_pattern, last 30 minutes
+SELECT
+  Attributes['route_pattern'] AS route,
+  countIf(MetricName = 'cache_hit_total') AS hits,
+  countIf(MetricName = 'cache_miss_total') AS misses,
+  hits / nullIf(hits + misses, 0) AS hit_rate
+FROM otel_metrics_sum
+WHERE MetricName IN ('cache_hit_total', 'cache_miss_total')
+  AND ServiceName = '{site}'
+  AND TimeUnix > now() - INTERVAL 30 MINUTE
+GROUP BY route
+ORDER BY misses DESC
+LIMIT 20;
+```
+```sql
+-- Cache decision distribution by cache_profile
+SELECT
+  Attributes['profile'] AS profile,
+  Attributes['decision'] AS decision,
+  sum(toFloat64(Value)) AS n
+FROM otel_metrics_sum
+WHERE MetricName = 'cache_hit_total'
+  AND ServiceName = '{site}'
+  AND TimeUnix > now() - INTERVAL 30 MINUTE
+GROUP BY profile, decision
+ORDER BY n DESC;
+```
+## Common causes & fixes
+| Rank | Cause                                                       | How to confirm                                                | Fix                                                                                                       |
+|------|-------------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
+| 1    | Deploy purged the version cache (`X-Cache-Version` flipped) | Recent `service.version` in the deploy query                  | Wait 10m for cache to warm. If sustained, check that the new build hash is propagating consistently.       |
+| 2    | A new query parameter is hashing into the cache key         | One route's MISS share is far higher than the rest             | Check `cacheHeaders` / `ignoreSearchParams` config for that route; add the new param to the ignore list.   |
+| 3    | Set-Cookie present on a previously cacheable response       | `X-Cache: BYPASS` with `X-Cache-Reason: private-set-cookie` on the affected route | Inspect the section that started emitting cookies; move the cookie write to a non-cacheable POST handler. |
+| 4    | A real burst of unique URLs (e.g. crawler scanning long-tail) | `Attributes['route_pattern']` doesn't change but distinct paths multiply | If a known bot, add a WAF rule. If a real catalog query, consider broader cache profile.                  |
+## Escalation
+- Sustained > 1 hour despite no deploy → page the site team owner.
+- Suspected bot/abuse → loop in security / WAF on-call.
+## Post-mortem hook
+- The "before" hit rate and the "after" hit rate.
+- The top route that lost the hit rate.
+- A representative response header snippet showing `X-Cache`,
+  `X-Cache-Profile`, `X-Cache-Reason`.

package/docs/runbooks/commerce-upstream-slow.md ADDED Viewed

@@ -0,0 +1,88 @@
+# Runbook: `commerce-upstream-slow`
+> A site's `commerce_request_duration_ms` p95 exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
+## What this alert means
+Calls out to a commerce provider (VTEX, Shopify, or similar) are
+taking abnormally long for *this* site. Because SSR is synchronous on
+upstream commerce calls, a slow upstream cascades into the user-facing
+`http-latency-spike` alert almost immediately. If both fired together,
+this is the root cause — fix here first.
+## First check (60 seconds)
+Which provider/operation is slow? The same dashboard's "Commerce p95
+by provider/operation" panel breaks it out. Note the
+`provider.operation` string — e.g. `vtex.intelligent-search.product_search`.
+If a single operation is responsible, jump to "Common causes" #1.
+If multiple operations from the same provider are slow simultaneously,
+that's a provider-wide regression — jump to "Common causes" #2.
+## Diagnostic queries
+```sql
+-- p95 commerce latency by provider + operation, last hour
+SELECT
+  toStartOfInterval(TimeUnix, INTERVAL 5 MINUTE) AS t,
+  Attributes['provider'] AS provider,
+  Attributes['operation'] AS op,
+  quantileBFloat16(0.95)(toFloat64(Sum / nullIf(Count, 0))) AS p95
+FROM otel_metrics_histogram
+WHERE MetricName = 'commerce_request_duration_ms'
+  AND ServiceName = '{site}'
+  AND TimeUnix > now() - INTERVAL 1 HOUR
+GROUP BY t, provider, op
+ORDER BY t, p95 DESC;
+```
+```sql
+-- Commerce call status distribution — are we getting 5xx from upstream?
+SELECT
+  Attributes['provider'] AS provider,
+  Attributes['operation'] AS op,
+  Attributes['status_class'] AS status_class,
+  count() AS n
+FROM otel_metrics_histogram
+WHERE MetricName = 'commerce_request_duration_ms'
+  AND ServiceName = '{site}'
+  AND TimeUnix > now() - INTERVAL 30 MINUTE
+GROUP BY provider, op, status_class
+ORDER BY n DESC;
+```
+```sql
+-- VTEX SWR cache effectiveness on the slow operation
+SELECT
+  Attributes['cached'] AS cached,
+  count() AS n,
+  avg(toFloat64(Sum / nullIf(Count, 0))) AS avg_ms
+FROM otel_metrics_histogram
+WHERE MetricName = 'commerce_request_duration_ms'
+  AND ServiceName = '{site}'
+  AND Attributes['operation'] = '<paste operation here>'
+  AND TimeUnix > now() - INTERVAL 30 MINUTE
+GROUP BY cached;
+```
+## Common causes & fixes
+| Rank | Cause                                                | How to confirm                                                                                  | Fix                                                                                                                |
+|------|------------------------------------------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| 1    | One specific upstream operation is slow              | Single `provider.operation` row dominates the p95 query                                          | Check provider status page (status.vtex.com, www.shopifystatus.com). If clean, see if we recently changed payload size or filter on that operation. |
+| 2    | Provider-wide regression                             | Multiple operations from the same `provider` regressed simultaneously                            | Public provider status page is usually the source of truth. Open a ticket with the provider citing our timing window.    |
+| 3    | VTEX SWR / cachedLoader hit rate dropped             | Query 3 shows `cached=false` share rose                                                          | Inspect recent loader changes for the affected section. May have invalidated the cache key by changing the loader signature. |
+| 4    | Region-specific (CF colo → upstream latency)         | `region` label on the metric isolates one CF colo                                                | Usually transient; CF will rebalance. If sustained, file a CF support ticket.                                       |
+## Escalation
+- Provider-wide regression confirmed → notify the affected customer-facing teams; this is communication-shaped, not engineering-shaped.
+- One operation slow, no provider status incident → page the site team owner for that route.
+## Post-mortem hook
+- The `provider.operation` string and its p95 timeline.
+- The cache (`cached=true/false`) split on that operation.
+- A representative trace from `otel_traces` showing the slow span
+  (`SpanName LIKE 'vtex.%'` or `'shopify.%'`).

package/docs/runbooks/http-error-spike.md ADDED Viewed

@@ -0,0 +1,98 @@
+# Runbook: `http-error-spike`
+> A site's 5xx error rate exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
+## What this alert means
+Real users are getting 5xx responses at a rate that's statistically
+abnormal for this specific site. The alert uses a per-site anomaly band
+(not a fleet-wide threshold) so a site that normally runs at 0.3% 5xx
+fires for spikes other sites wouldn't notice — and a site that normally
+runs at 4% (a known-noisy legacy storefront) doesn't false-positive at
+4.1%.
+## First check (60 seconds)
+Look at the **commerce upstream p95** panel on the same dashboard. If
+that spiked at the same moment, the root cause is almost always an
+upstream commerce API regressing. Stop here, jump to
+[`commerce-upstream-slow.md`](./commerce-upstream-slow.md).
+If commerce p95 is flat, the 5xx is internal — proceed below.
+## Diagnostic queries
+Paste into ClickStack or a Grafana Explore panel pointed at the
+ClickHouse datasource.
+```sql
+-- Top error routes for this site, last 30 minutes
+SELECT
+  Attributes['route_pattern'] AS route,
+  countIf(Attributes['status_class'] = '5xx') AS errors,
+  count()                                     AS total,
+  errors / total                              AS rate
+FROM otel_metrics_sum
+WHERE MetricName = 'http_requests_total'
+  AND ServiceName = '{site}'
+  AND TimeUnix > now() - INTERVAL 30 MINUTE
+GROUP BY route
+HAVING errors > 0
+ORDER BY rate DESC
+LIMIT 20;
+```
+```sql
+-- Recent exceptions captured by the tail worker
+SELECT Timestamp, Body, LogAttributes['url.path'] AS path, LogAttributes['http.response.status_code'] AS status
+FROM otel_logs
+WHERE ServiceName = '{site}'
+  AND SeverityText = 'ERROR'
+  AND LogAttributes['_source'] = 'tail-worker'
+  AND LogAttributes['_outcome'] = 'exception'
+  AND Timestamp > now() - INTERVAL 30 MINUTE
+ORDER BY Timestamp DESC
+LIMIT 100;
+```
+```sql
+-- Did a deploy correlate? List versions seen in the last hour
+SELECT
+  ResourceAttributes['service.version'] AS version,
+  min(Timestamp) AS first_seen,
+  max(Timestamp) AS last_seen,
+  count() AS log_count
+FROM otel_logs
+WHERE ServiceName = '{site}'
+  AND Timestamp > now() - INTERVAL 1 HOUR
+GROUP BY version
+ORDER BY first_seen DESC;
+```
+## Common causes & fixes
+| Rank | Cause                                              | How to confirm                                              | Fix                                                                 |
+|------|----------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------------|
+| 1    | A recent deploy regressed                          | Top query above shows a `service.version` that flipped just before the spike | Roll back via Cloudflare dashboard `Deployments → Rollback`. Confirm via a fresh `service.version` line in the next 5m. |
+| 2    | A specific route is broken (one bad section)       | Top error routes query shows one `route_pattern` at 100% error rate | Check the recent commits to that section. Roll back or `Lazy` wrap it for graceful degradation. |
+| 3    | Upstream cache layer evicted; cold-cache thundering herd | `cache_miss_total` for the same window spikes proportionally to errors | Wait it out — usually self-heals in 5m. If sustained, check that `staleTime` is set correctly on cmsRouteConfig. |
+| 4    | Origin (commerce API) returning 5xx                | `commerce_request_duration_ms` spike OR commerce logs       | See [`commerce-upstream-slow.md`](./commerce-upstream-slow.md).      |
+## Escalation
+- **Site team owner** if a fix isn't obvious in 15 minutes (slack
+  `#deco-platform`).
+- **Cloudflare support** if all sites in a region are affected
+  simultaneously (look at the `region` label on the metrics) — this
+  has happened during CF colo incidents.
+## Post-mortem hook
+Capture before the alert clears:
+- `request.id` of one failing request (from the response header
+  `X-Request-Id` of a manually-reproduced 5xx).
+- A representative tail-worker log row with stack trace.
+- The deploy `service.version` window during the spike.
+Stash them in the incident ticket so the post-mortem has the
+correlation IDs it needs.

package/docs/runbooks/http-latency-spike.md ADDED Viewed

@@ -0,0 +1,82 @@
+# Runbook: `http-latency-spike`
+> A site's p95 latency exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
+## What this alert means
+User-perceived latency on this site is statistically abnormal vs the
+last 24 hours. Latency rarely degrades in isolation — almost always
+something else is bottlenecked underneath. Use this alert as the
+"something is wrong, look around" signal, then triangulate.
+## First check (60 seconds)
+Open the dashboard's **commerce p95 by provider/operation** panel. The
+most common cause of p95 spikes is an upstream commerce API (VTEX,
+Shopify) slowing down — and our SSR is synchronous on the upstream
+call.
+If commerce p95 spiked at the same moment, jump to
+[`commerce-upstream-slow.md`](./commerce-upstream-slow.md).
+## Diagnostic queries
+```sql
+-- Latency p95 by route_pattern, last hour
+SELECT
+  toStartOfInterval(TimeUnix, INTERVAL 5 MINUTE) AS t,
+  Attributes['route_pattern'] AS route,
+  quantileBFloat16(0.95)(toFloat64(Sum / nullIf(Count, 0))) AS p95
+FROM otel_metrics_histogram
+WHERE MetricName = 'http_request_duration_ms'
+  AND ServiceName = '{site}'
+  AND TimeUnix > now() - INTERVAL 1 HOUR
+GROUP BY t, route
+ORDER BY t, p95 DESC;
+```
+```sql
+-- Cache decision distribution — did hit rate drop while latency rose?
+SELECT
+  Attributes['cache_decision'] AS decision,
+  count() AS n,
+  avg(toFloat64(Sum / nullIf(Count, 0))) AS avg_ms
+FROM otel_metrics_histogram
+WHERE MetricName = 'http_request_duration_ms'
+  AND ServiceName = '{site}'
+  AND TimeUnix > now() - INTERVAL 30 MINUTE
+GROUP BY decision
+ORDER BY n DESC;
+```
+```sql
+-- Slow traces with full span breakdown (sampled ~1%, so re-run if empty)
+SELECT TraceId, SpanName, Duration / 1e6 AS ms, SpanAttributes['url.path'] AS path
+FROM otel_traces
+WHERE ServiceName = '{site}'
+  AND Timestamp > now() - INTERVAL 30 MINUTE
+  AND SpanName = 'deco.http.request'
+  AND (Duration / 1e6) > 2000
+ORDER BY Duration DESC
+LIMIT 50;
+```
+## Common causes & fixes
+| Rank | Cause                                                | How to confirm                                                                                | Fix                                                                                  |
+|------|------------------------------------------------------|-----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
+| 1    | Upstream commerce API slow                           | Commerce p95 panel spikes with the same shape                                                 | See [`commerce-upstream-slow.md`](./commerce-upstream-slow.md).                       |
+| 2    | Cache hit rate dropped (cold cache after deploy/purge) | Cache panel shows MISS share rose at spike start; usually self-heals within 5-10m            | Wait it out unless sustained; if sustained check the route-level cache profile.      |
+| 3    | One specific route is slow (heavy loader added)      | Per-route p95 query shows one `route_pattern` dominating                                      | Inspect recent commits to that route's loader. Consider deferring sections via `Lazy`. |
+| 4    | Cloudflare edge / colo issue                         | `region` label distribution skewed to one or two colos                                        | Check CF status page; usually clears on its own.                                     |
+## Escalation
+- 30 minutes without resolution → page the site team owner.
+- All sites in a region affected → suspect CF infra; check status.cloudflare.com.
+## Post-mortem hook
+- A representative slow `TraceId` from the third query above.
+- The cache hit rate before/during the spike.
+- Deploy version at the start of the window.

package/docs/runbooks/tail-exception-spike.md ADDED Viewed

@@ -0,0 +1,100 @@
+# Runbook: `tail-exception-spike`
+> A site's tail-worker `_outcome=exception` count exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
+## What this alert means
+Real, uncaught exceptions are happening in the Worker — captured by
+the tail consumer with 100% fidelity (`deco-otel-tail`). After Phase 1
+severity reclassification, this alert specifically excludes `canceled`
+and `responseStreamDisconnected` outcomes (those are client-disconnect
+noise, not bugs). What's left is a true bug, OOM, or CPU-limit kill.
+## First check (60 seconds)
+```sql
+-- What's blowing up, last 15 minutes
+SELECT Body, LogAttributes['url.path'] AS path, count() AS n
+FROM otel_logs
+WHERE ServiceName = '{site}'
+  AND SeverityText = 'ERROR'
+  AND LogAttributes['_source'] = 'tail-worker'
+  AND LogAttributes['_outcome'] = 'exception'
+  AND Timestamp > now() - INTERVAL 15 MINUTE
+GROUP BY Body, path
+ORDER BY n DESC
+LIMIT 30;
+```
+If 90% of the rows share the same `Body` (same exception class /
+message), that's the bug — proceed to "Common causes" #1.
+If the exceptions are scattered across many distinct messages, you
+likely have a resource problem (OOM / CPU limit) — proceed to #2.
+## Diagnostic queries
+```sql
+-- Outcome distribution — separate exception from exceededMemory / exceededCpu
+SELECT
+  LogAttributes['_outcome'] AS outcome,
+  count() AS n
+FROM otel_logs
+WHERE ServiceName = '{site}'
+  AND LogAttributes['_source'] = 'tail-worker'
+  AND Timestamp > now() - INTERVAL 30 MINUTE
+GROUP BY outcome
+ORDER BY n DESC;
+```
+```sql
+-- Did a specific deploy cause it?
+SELECT
+  LogAttributes['service.version'] AS version,
+  LogAttributes['_outcome'] AS outcome,
+  count() AS n
+FROM otel_logs
+WHERE ServiceName = '{site}'
+  AND LogAttributes['_source'] = 'tail-worker'
+  AND Timestamp > now() - INTERVAL 1 HOUR
+GROUP BY version, outcome
+ORDER BY n DESC;
+```
+```sql
+-- Pull the full record for one offending request to get request.id
+-- and trace.id for join queries
+SELECT *
+FROM otel_logs
+WHERE ServiceName = '{site}'
+  AND SeverityText = 'ERROR'
+  AND LogAttributes['_source'] = 'tail-worker'
+  AND LogAttributes['_outcome'] = 'exception'
+  AND Timestamp > now() - INTERVAL 15 MINUTE
+ORDER BY Timestamp DESC
+LIMIT 1;
+```
+## Common causes & fixes
+| Rank | Cause                                              | How to confirm                                                                | Fix                                                                                                              |
+|------|----------------------------------------------------|-------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
+| 1    | A single uncaught throw, recent deploy             | Same `Body` dominates; one `service.version` correlates                       | Roll back the deploy. File a bug with the offending stack + `request.id` for repro. Add a try/catch + structured `logger.error`. |
+| 2    | `exceededMemory` (OOM)                             | Outcome query shows non-trivial `exceededMemory` count                        | Look for large in-memory buffers — a `Response.text()` on a multi-MB upstream, a runaway `JSON.parse`. See [`deco-site-memory-debugging`](https://github.com/decocms/deco-start/blob/main/.cursor/skills/deco-site-memory-debugging/SKILL.md) skill. |
+| 3    | `exceededCpu` (CPU-limit kill)                    | Outcome query shows `exceededCpu`                                            | Investigate a section with a heavy synchronous loop. Move work to a server function or shed load via cache.       |
+| 4    | A new upstream returning malformed responses      | `Body` references a third-party hostname; matches a known endpoint           | Add defensive parsing + a structured `logger.error` so the throw becomes a typed error, not a crash.             |
+## Escalation
+- `exceededMemory` / `exceededCpu` sustained → page site team + platform on-call. May indicate a leak that will recur until isolate restart.
+- A throw we can't decode in 15 minutes → page site team owner.
+## Post-mortem hook
+- One full record from query #3 above — preserves the
+  `request.id` / `trace.id` for cross-channel correlation.
+- The dominant `Body` (the exception message).
+- The `service.version` window.
+- Whether the alert fired on `exception` or `exceededMemory` /
+  `exceededCpu` — drives whether the post-mortem investigates code or
+  resource bounds.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@decocms/start",
-  "version": "6.1.0",
+  "version": "6.2.0",
   "type": "module",
   "description": "Deco framework for TanStack Start - CMS bridge, admin protocol, hooks, schema generation",
   "main": "./src/index.ts",

package/scripts/audit-observability-config.test.ts CHANGED Viewed

@@ -2,7 +2,11 @@ import * as fs from "node:fs";
 import * as os from "node:os";
 import * as path from "node:path";
 import { afterEach, beforeEach, describe, expect, it } from "vitest";
-import { auditObservabilityBlock } from "./audit-observability-config";
+import {
+  auditFleetBindings,
+  auditObservabilityBlock,
+  auditWranglerConfig,
+} from "./audit-observability-config";
 import { parseJsonc, stripJsoncTrailingCommas } from "./lib/jsonc";
 describe("auditObservabilityBlock", () => {
@@ -138,6 +142,125 @@ describe("auditObservabilityBlock", () => {
   });
 });
+describe("auditFleetBindings (D-14)", () => {
+  const canonicalBindings = {
+    version_metadata: { binding: "CF_VERSION_METADATA" },
+    analytics_engine_datasets: [{ binding: "DECO_METRICS", dataset: "deco_metrics_site" }],
+    tail_consumers: [{ service: "deco-otel-tail" }],
+    vars: {
+      DECO_OTEL_METRICS_ENDPOINT: "https://deco-otel-ingest.example/v1/metrics",
+      DECO_OTEL_TRACES_ENDPOINT: "https://deco-otel-ingest.example/v1/traces",
+      DECO_OTEL_LOGS_ENDPOINT: "https://deco-otel-ingest.example/v1/logs",
+    },
+  };
+  it("returns no findings for canonical bindings", () => {
+    expect(auditFleetBindings(canonicalBindings)).toEqual([]);
+  });
+  it("flags version_metadata_binding_missing as error", () => {
+    const { version_metadata: _, ...rest } = canonicalBindings;
+    const findings = auditFleetBindings(rest);
+    const f = findings.find((x) => x.id === "version_metadata_binding_missing");
+    expect(f?.severity).toBe("error");
+  });
+  it("flags version_metadata_binding_missing when binding is empty", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      version_metadata: { binding: "" },
+    });
+    const f = findings.find((x) => x.id === "version_metadata_binding_missing");
+    expect(f).toBeDefined();
+  });
+  it("flags analytics_engine_binding_missing as warn", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      analytics_engine_datasets: [],
+    });
+    const f = findings.find((x) => x.id === "analytics_engine_binding_missing");
+    expect(f?.severity).toBe("warn");
+  });
+  it("flags analytics_engine_binding_missing when binding name doesn't match DECO_METRICS", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      analytics_engine_datasets: [{ binding: "OTHER_NAME" }],
+    });
+    expect(findings.some((f) => f.id === "analytics_engine_binding_missing")).toBe(true);
+  });
+  it("flags tail_consumer_missing as error", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      tail_consumers: [],
+    });
+    const f = findings.find((x) => x.id === "tail_consumer_missing");
+    expect(f?.severity).toBe("error");
+  });
+  it("flags tail_consumer_missing when an unrelated tail consumer is configured", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      tail_consumers: [{ service: "another-tail" }],
+    });
+    expect(findings.some((f) => f.id === "tail_consumer_missing")).toBe(true);
+  });
+  it("flags otel_metrics_endpoint_missing when DECO_OTEL_METRICS_ENDPOINT is unset", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      vars: {
+        ...canonicalBindings.vars,
+        DECO_OTEL_METRICS_ENDPOINT: "",
+      },
+    });
+    expect(findings.some((f) => f.id === "otel_metrics_endpoint_missing")).toBe(true);
+  });
+  it("flags otel_traces_endpoint_missing when DECO_OTEL_TRACES_ENDPOINT is missing", () => {
+    const { vars: _vars, ...rest } = canonicalBindings;
+    const findings = auditFleetBindings(rest);
+    expect(findings.some((f) => f.id === "otel_traces_endpoint_missing")).toBe(true);
+    expect(findings.some((f) => f.id === "otel_logs_endpoint_missing")).toBe(true);
+    expect(findings.some((f) => f.id === "otel_metrics_endpoint_missing")).toBe(true);
+  });
+  it("handles missing vars object gracefully", () => {
+    expect(() => auditFleetBindings({ vars: undefined })).not.toThrow();
+  });
+});
+describe("auditWranglerConfig — composition", () => {
+  it("composes observability + fleet rules", () => {
+    const findings = auditWranglerConfig({});
+    const ids = findings.map((f) => f.id);
+    expect(ids).toContain("observability_missing");
+    expect(ids).toContain("version_metadata_binding_missing");
+    expect(ids).toContain("tail_consumer_missing");
+  });
+  it("returns no findings on a fully canonical wrangler", () => {
+    const findings = auditWranglerConfig({
+      observability: {
+        enabled: true,
+        logs: { enabled: true, head_sampling_rate: 1, persist: true },
+        traces: { enabled: true, head_sampling_rate: 0.01, persist: true },
+      },
+      version_metadata: { binding: "CF_VERSION_METADATA" },
+      analytics_engine_datasets: [{ binding: "DECO_METRICS", dataset: "deco_metrics_x" }],
+      tail_consumers: [{ service: "deco-otel-tail" }],
+      vars: {
+        DECO_OTEL_METRICS_ENDPOINT: "https://ingest.example/v1/metrics",
+        DECO_OTEL_TRACES_ENDPOINT: "https://ingest.example/v1/traces",
+        DECO_OTEL_LOGS_ENDPOINT: "https://ingest.example/v1/logs",
+      },
+    });
+    expect(findings).toEqual([]);
+  });
+});
 describe("JSONC handling — trailing commas + comments", () => {
   it("stripJsoncTrailingCommas removes commas before `}` and `]`", () => {
     expect(stripJsoncTrailingCommas(`{ "a": 1, "b": 2, }`)).toBe(`{ "a": 1, "b": 2 }`);
@@ -165,6 +288,133 @@ describe("JSONC handling — trailing commas + comments", () => {
   });
 });
+describe("CLI gate hardness (D-16) — --mode warn|block + --github", () => {
+  let tmpdir: string;
+  const cliPath = path.resolve(__dirname, "audit-observability-config.ts");
+  beforeEach(() => {
+    tmpdir = fs.mkdtempSync(path.join(os.tmpdir(), "audit-mode-"));
+  });
+  afterEach(() => {
+    fs.rmSync(tmpdir, { recursive: true, force: true });
+  });
+  // Spawn the script via tsx in a child process so we exercise the real
+  // `process.exit()` paths instead of monkey-patching them. This is the
+  // contract storefront CI consumes, so it's the contract under test.
+  function runCli(args: string[]): {
+    status: number | null;
+    stdout: string;
+    stderr: string;
+  } {
+    const { spawnSync } = require("node:child_process") as typeof import(
+      "node:child_process"
+    );
+    const result = spawnSync(
+      process.execPath,
+      [
+        require.resolve("tsx/cli"),
+        cliPath,
+        "--source",
+        tmpdir,
+        ...args,
+      ],
+      { encoding: "utf8" },
+    );
+    return {
+      status: result.status,
+      stdout: result.stdout,
+      stderr: result.stderr,
+    };
+  }
+  it("default mode is warn — exits 0 even with error findings", () => {
+    // Empty wrangler triggers `observability_missing` (error) +
+    // `tail_consumer_missing` (error) + `version_metadata_*` (error). Warn
+    // mode must annotate but exit 0.
+    fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
+    const { status, stdout } = runCli([]);
+    expect(status).toBe(0);
+    expect(stdout).toMatch(/observability_missing/);
+  });
+  it("--mode block exits 1 when an error-severity finding is present", () => {
+    fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
+    const { status, stdout } = runCli(["--mode", "block"]);
+    expect(status).toBe(1);
+    expect(stdout).toMatch(/observability_missing/);
+  });
+  it("--mode block exits 0 when only warn-severity findings are present", () => {
+    // Canonical observability block + the rest of the fleet bindings → only
+    // the DECO_OTEL_*_ENDPOINT warns survive. Block mode must exit 0 because
+    // those are `warn`, not `error`.
+    fs.writeFileSync(
+      path.join(tmpdir, "wrangler.jsonc"),
+      JSON.stringify({
+        name: "my-store",
+        observability: {
+          enabled: true,
+          traces: { enabled: true, head_sampling_rate: 0.01, persist: true },
+          logs: { enabled: true, head_sampling_rate: 1, persist: true },
+        },
+        version_metadata: { binding: "CF_VERSION_METADATA" },
+        analytics_engine_datasets: [{ binding: "DECO_METRICS" }],
+        tail_consumers: [{ service: "deco-otel-tail" }],
+      }),
+    );
+    const { status } = runCli(["--mode", "block"]);
+    expect(status).toBe(0);
+  });
+  it("--mode block exits 0 on a fully clean wrangler.jsonc", () => {
+    fs.writeFileSync(
+      path.join(tmpdir, "wrangler.jsonc"),
+      JSON.stringify({
+        name: "my-store",
+        observability: {
+          enabled: true,
+          traces: { enabled: true, head_sampling_rate: 0.01, persist: true },
+          logs: { enabled: true, head_sampling_rate: 1, persist: true },
+        },
+        version_metadata: { binding: "CF_VERSION_METADATA" },
+        analytics_engine_datasets: [{ binding: "DECO_METRICS" }],
+        tail_consumers: [{ service: "deco-otel-tail" }],
+        vars: {
+          DECO_OTEL_METRICS_ENDPOINT: "https://ingest.example.com",
+          DECO_OTEL_TRACES_ENDPOINT: "https://ingest.example.com",
+          DECO_OTEL_LOGS_ENDPOINT: "https://ingest.example.com",
+        },
+      }),
+    );
+    const { status } = runCli(["--mode", "block"]);
+    expect(status).toBe(0);
+  });
+  it("--github emits ::warning::/::error:: annotations matched to mode", () => {
+    fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
+    // In warn mode, even error-severity findings annotate as `warning` (we
+    // never escalate to GitHub `error` annotations when we won't fail the
+    // check — keeps the PR annotation channel quiet at v1).
+    const warnRun = runCli(["--github"]);
+    expect(warnRun.status).toBe(0);
+    expect(warnRun.stdout).toMatch(/::warning title=observability_missing::/);
+    expect(warnRun.stdout).not.toMatch(/::error title=/);
+    // In block mode, error-severity findings escalate to `::error::`.
+    const blockRun = runCli(["--mode", "block", "--github"]);
+    expect(blockRun.status).toBe(1);
+    expect(blockRun.stdout).toMatch(/::error title=observability_missing::/);
+  });
+  it("--mode rejects values other than warn|block with exit 2", () => {
+    fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
+    const { status, stderr } = runCli(["--mode", "advisory"]);
+    expect(status).toBe(2);
+    expect(stderr).toMatch(/--mode must be "warn" or "block"/);
+  });
+});
 describe("CLI smoke — wrangler.jsonc with trailing commas", () => {
   let tmpdir: string;
   beforeEach(() => {

package/scripts/audit-observability-config.ts CHANGED Viewed

@@ -15,27 +15,45 @@
  *
  * Rules (id — severity — what it catches):
  *
- *   observability_missing            error   No `observability` key at all. CF captures nothing.
- *   observability_disabled           error   `observability.enabled: false`. Master switch off.
- *   traces_disabled                  warn    `observability.traces.enabled: false`. No traces in dashboard.
- *   logs_disabled                    warn    `observability.logs.enabled: false`. No logs in dashboard.
- *   head_sampling_rate_elevated      error   `traces.head_sampling_rate > 0.01`. Fleet-scale cost risk; see docs/observability.md.
- *   logs_head_sampling_rate_low      warn    `logs.head_sampling_rate < 1`. Sampling info/warn logs loses signal cheaply; errors go via the direct-POST channel.
- *   persist_disabled_no_destination  error   `persist: false` with no destination configured. Data captured then discarded.
+ *   observability_missing                error   No `observability` key at all. CF captures nothing.
+ *   observability_disabled               error   `observability.enabled: false`. Master switch off.
+ *   traces_disabled                      warn    `observability.traces.enabled: false`. No traces in dashboard.
+ *   logs_disabled                        warn    `observability.logs.enabled: false`. No logs in dashboard.
+ *   head_sampling_rate_elevated          error   `traces.head_sampling_rate > 0.01`. Fleet-scale cost risk; see docs/observability.md.
+ *   logs_head_sampling_rate_low          warn    `logs.head_sampling_rate < 1`. Sampling info/warn logs loses signal cheaply; errors go via the direct-POST channel.
+ *   persist_disabled_no_destination      error   `persist: false` with no destination configured. Data captured then discarded.
+ *
+ * Phase 6 / D-14 — fleet-config drift rules (live outside the
+ * `observability` block but still owned by this audit):
+ *
+ *   version_metadata_binding_missing     error   Missing `version_metadata` binding. `service.version` won't be stamped — regressions can't be attributed to a deploy.
+ *   analytics_engine_binding_missing     warn    No `DECO_METRICS` AE binding. AE meter is off; OTLP meter still works but CF dashboard panels go dark.
+ *   tail_consumer_missing                error   No `tail_consumers` entry pointing at `deco-otel-tail`. 100% error-capture is broken.
+ *   otel_metrics_endpoint_missing        warn    `DECO_OTEL_METRICS_ENDPOINT` not set on `vars`. OTLP meter is off; only AE works.
+ *   otel_traces_endpoint_missing         warn    `DECO_OTEL_TRACES_ENDPOINT` not set on `vars`. Framework `deco.*` spans drop unless CF Traces is on.
+ *   otel_logs_endpoint_missing           warn    `DECO_OTEL_LOGS_ENDPOINT` not set on `vars`. Error logs ride CF Destinations only (head-sampled).
  *
  * Usage (from a site directory):
- *   npx -p @decocms/start deco-audit-observability                # audit cwd
- *   npx -p @decocms/start deco-audit-observability --source ./   # explicit
+ *   npx -p @decocms/start deco-audit-observability                # audit cwd (warn mode — exit 0)
+ *   npx -p @decocms/start deco-audit-observability --source ./   # explicit source dir
  *   npx -p @decocms/start deco-audit-observability --json         # machine-readable
+ *   npx -p @decocms/start deco-audit-observability --mode block   # error findings exit 1 (CI gate)
+ *   npx -p @decocms/start deco-audit-observability --github       # GitHub Actions annotations
  *
  * Options:
  *   --source <dir>   Site directory (default: .)
- *   --json           Emit findings as JSON to stdout (still exits non-zero on findings)
+ *   --json           Emit findings as JSON to stdout
+ *   --mode <m>       Gate hardness: "warn" (default — always exit 0 on findings,
+ *                    just print them) or "block" (exit 1 on any `error` finding).
+ *                    See D-16 in MIGRATION_TOOLING_PLAN.md for the rationale on
+ *                    why warn is the v1 default.
+ *   --github         Emit `::warning::` / `::error::` lines for GitHub Actions
+ *                    annotations in addition to the normal text output.
  *   --help, -h       Show this message
  *
  * Exit codes:
- *   0 — no findings (or only `info`-level findings; none defined yet)
- *   1 — at least one finding (warn or error)
+ *   0 — no findings, or `--mode warn` (the default) regardless of findings
+ *   1 — `--mode block` and at least one `error`-severity finding
  *   2 — file invalid / can't parse
  */
@@ -53,14 +71,24 @@ export interface Finding {
   fix?: string;
 }
+export type GateMode = "warn" | "block";
 interface CliOpts {
   source: string;
   json: boolean;
   help: boolean;
+  mode: GateMode;
+  github: boolean;
 }
 function parseArgs(argv: string[]): CliOpts {
-  const opts: CliOpts = { source: ".", json: false, help: false };
+  const opts: CliOpts = {
+    source: ".",
+    json: false,
+    help: false,
+    mode: "warn",
+    github: false,
+  };
   for (let i = 0; i < argv.length; i++) {
     const flag = argv[i];
     switch (flag) {
@@ -70,6 +98,20 @@ function parseArgs(argv: string[]): CliOpts {
       case "--json":
         opts.json = true;
         break;
+      case "--mode": {
+        const value = argv[++i];
+        if (value !== "warn" && value !== "block") {
+          console.error(
+            `audit: --mode must be "warn" or "block" (got "${value ?? ""}")`,
+          );
+          process.exit(2);
+        }
+        opts.mode = value;
+        break;
+      }
+      case "--github":
+        opts.github = true;
+        break;
       case "--help":
       case "-h":
         opts.help = true;
@@ -91,14 +133,18 @@ function showHelp(): void {
     npx -p @decocms/start deco-audit-observability [options]
   Options:
-    --source <dir>   Site directory (default: .)
-    --json           Emit findings as JSON
-    --help, -h       This message
+    --source <dir>     Site directory (default: .)
+    --json             Emit findings as JSON
+    --mode <m>         "warn" (default, exit 0) | "block" (exit 1 on errors)
+    --github           Emit ::warning::/::error:: lines for GitHub Actions
+    --help, -h         This message
   Exit codes:
-    0   no findings
-    1   one or more findings (warn or error)
+    0   no findings, OR --mode warn (default — annotate and move on)
+    1   --mode block AND at least one error-severity finding
     2   wrangler.jsonc missing or unparseable
+  See D-16 in MIGRATION_TOOLING_PLAN.md for the v1 "warn-only" policy.
 `);
 }
@@ -236,6 +282,138 @@ export function auditObservabilityBlock(
   return findings;
 }
+/**
+ * Fleet-config drift rules — owned by the same audit because the
+ * cumulative effect of "observability block correct, bindings missing"
+ * is identical to "observability block missing" (no data lands in
+ * ClickHouse).
+ *
+ * The CLI composes `auditObservabilityBlock` + `auditFleetBindings`.
+ * Both return Finding[]; callers concatenate.
+ */
+export interface WranglerLike {
+  observability?: ObservabilityBlock;
+  version_metadata?: { binding?: string } | unknown;
+  analytics_engine_datasets?: Array<{ binding?: string; dataset?: string }> | unknown;
+  tail_consumers?: Array<{ service?: string; environment?: string }> | unknown;
+  vars?: Record<string, unknown> | unknown;
+}
+export function auditFleetBindings(wrangler: WranglerLike): Finding[] {
+  const findings: Finding[] = [];
+  // version_metadata — required so `service.version` is stamped on every
+  // span and log line. Without it, regressions can't be tied to a
+  // specific deployment.
+  const vmBinding =
+    typeof wrangler.version_metadata === "object" &&
+    wrangler.version_metadata !== null &&
+    "binding" in wrangler.version_metadata
+      ? (wrangler.version_metadata as { binding?: string }).binding
+      : undefined;
+  if (!vmBinding) {
+    findings.push({
+      id: "version_metadata_binding_missing",
+      severity: "error",
+      message:
+        "wrangler.jsonc is missing a `version_metadata.binding` entry. " +
+        "`service.version` won't appear on spans/logs and the deploy-correlation " +
+        "panel will be empty. Recommended value: `CF_VERSION_METADATA`.",
+      fix: "npx -p @decocms/start deco-cf-observability --write",
+    });
+  }
+  // DECO_METRICS — Analytics Engine binding. The AE meter is the hot-
+  // path CF dashboard view; OTLP works without it but we lose the
+  // operator-grade short-window panels.
+  const aeDatasets = Array.isArray(wrangler.analytics_engine_datasets)
+    ? (wrangler.analytics_engine_datasets as Array<{ binding?: string }>)
+    : [];
+  const hasMetricsBinding = aeDatasets.some((d) => d?.binding === "DECO_METRICS");
+  if (!hasMetricsBinding) {
+    findings.push({
+      id: "analytics_engine_binding_missing",
+      severity: "warn",
+      message:
+        "wrangler.jsonc has no `analytics_engine_datasets[].binding == 'DECO_METRICS'`. " +
+        "The AE meter is off; the hot-path operator dashboards in CF will be empty. " +
+        "OTLP metrics keep flowing if `DECO_OTEL_METRICS_ENDPOINT` is set.",
+      fix: "npx -p @decocms/start deco-cf-observability --write",
+    });
+  }
+  // tail_consumers — must list deco-otel-tail. Phase 1 enrichment is
+  // useless without the tail consumer firing.
+  const tail = Array.isArray(wrangler.tail_consumers)
+    ? (wrangler.tail_consumers as Array<{ service?: string }>)
+    : [];
+  const hasTailConsumer = tail.some((t) => t?.service === "deco-otel-tail");
+  if (!hasTailConsumer) {
+    findings.push({
+      id: "tail_consumer_missing",
+      severity: "error",
+      message:
+        "wrangler.jsonc has no `tail_consumers[].service == 'deco-otel-tail'` entry. " +
+        "100% error-capture is broken — only the head-sampled CF Destinations path " +
+        "will report errors, and isolate crashes will be invisible.",
+      fix: "npx -p @decocms/start deco-cf-observability --write",
+    });
+  }
+  // DECO_OTEL_*_ENDPOINT env vars. Audit each separately so the message
+  // explains which channel is silently no-op.
+  const vars =
+    typeof wrangler.vars === "object" && wrangler.vars !== null
+      ? (wrangler.vars as Record<string, unknown>)
+      : {};
+  const checkVar = (id: string, name: string, severity: Severity, channel: string) => {
+    const v = vars[name];
+    if (typeof v !== "string" || v.length === 0) {
+      findings.push({
+        id,
+        severity,
+        message:
+          `wrangler.jsonc \`vars.${name}\` is not set. ${channel} is a no-op; ` +
+          `data captured in this channel never lands in ClickHouse. ` +
+          `See docs/observability.md for the canonical endpoints.`,
+        fix: "npx -p @decocms/start deco-cf-observability --write",
+      });
+    }
+  };
+  checkVar(
+    "otel_metrics_endpoint_missing",
+    "DECO_OTEL_METRICS_ENDPOINT",
+    "warn",
+    "OTLP metrics direct-POST",
+  );
+  checkVar(
+    "otel_traces_endpoint_missing",
+    "DECO_OTEL_TRACES_ENDPOINT",
+    "warn",
+    "OTLP traces direct-POST",
+  );
+  checkVar(
+    "otel_logs_endpoint_missing",
+    "DECO_OTEL_LOGS_ENDPOINT",
+    "warn",
+    "OTLP error-log direct-POST",
+  );
+  return findings;
+}
+/**
+ * One-stop call for the full wrangler audit — composes the
+ * observability-block rules with the fleet-binding rules. Both keep
+ * working standalone for fine-grained tests.
+ */
+export function auditWranglerConfig(wrangler: WranglerLike): Finding[] {
+  return [
+    ...auditObservabilityBlock(wrangler.observability),
+    ...auditFleetBindings(wrangler),
+  ];
+}
 function findingsToText(file: string, findings: Finding[]): string {
   if (findings.length === 0) {
     return `OK   ${file} — observability config looks canonical`;
@@ -263,26 +441,49 @@ function main(): void {
     process.exit(2);
   }
-  let parsed: { observability?: ObservabilityBlock };
+  let parsed: WranglerLike;
   try {
-    parsed = parseJsonc(fs.readFileSync(file, "utf8"));
+    parsed = parseJsonc(fs.readFileSync(file, "utf8")) as WranglerLike;
   } catch (err) {
     console.error(`audit: ${file} could not be parsed: ${(err as Error).message}`);
     process.exit(2);
   }
-  const findings = auditObservabilityBlock(parsed.observability);
+  const findings = auditWranglerConfig(parsed);
   if (opts.json) {
-    process.stdout.write(JSON.stringify({ file, findings }, null, 2) + "\n");
+    process.stdout.write(
+      JSON.stringify({ file, mode: opts.mode, findings }, null, 2) + "\n",
+    );
   } else {
     process.stdout.write(findingsToText(file, findings) + "\n");
   }
-  // Any finding (warn or error) is a non-zero exit. info-severity would not
-  // flip the exit, but no info-severity rules are defined yet.
-  const blocking = findings.some((f) => f.severity !== "info");
-  process.exit(blocking ? 1 : 0);
+  if (opts.github) {
+    for (const f of findings) {
+      // GitHub Actions workflow command. `error` and `warning` annotate the
+      // diff; `notice` is informational. We never emit `error` in warn mode
+      // even for error-severity rules — the v1 policy is annotate-don't-fail.
+      const level = opts.mode === "block" && f.severity === "error"
+        ? "error"
+        : f.severity === "info" ? "notice" : "warning";
+      const msg = `${f.message}${f.fix ? ` (fix: ${f.fix})` : ""}`;
+      const escaped = msg.replace(/%/g, "%25").replace(/\r/g, "%0D").replace(
+        /\n/g,
+        "%0A",
+      );
+      process.stdout.write(`::${level} title=${f.id}::${escaped}\n`);
+    }
+  }
+  // Exit policy: D-16 / Phase 6 decision.
+  //   warn   — annotate only; always exit 0 (CI sees the findings but ships)
+  //   block  — exit 1 on any `error`-severity finding
+  // The default is `warn` because storefronts are upgraded over weeks; a
+  // day-one block would fail PRs that have nothing to do with observability.
+  const shouldFail = opts.mode === "block" &&
+    findings.some((f) => f.severity === "error");
+  process.exit(shouldFail ? 1 : 0);
 }
 // Only run when invoked directly, not when imported by tests.