@decocms/start 6.1.0 → 6.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -293,9 +293,10 @@ Generates `MIGRATION_REPORT.md` with:
293
293
  ### Phase 7: Bootstrap
294
294
 
295
295
  Runs automatically after all phases (skipped in `--dry-run`):
296
- 1. `npm install` (or `bun install`)
297
- 2. `npx tsx node_modules/@decocms/start/scripts/generate-blocks.ts`
298
- 3. `npx tsr generate`
296
+ 1. `bun install`
297
+ 2. `bunx tsx node_modules/@decocms/start/scripts/generate-blocks.ts`
298
+ 3. `bunx tsx node_modules/@decocms/start/scripts/generate-invoke.ts` — emits `src/server/invoke.gen.ts` (top-level `createServerFn` declarations for every VTEX action, plus the `forwardResponseCookies()` Set-Cookie bridge). Without this step the site falls back to the `/deco/invoke/...` proxy and the cart breaks at `/checkout` after addItemToCart. See `.cursor/skills/deco-server-functions-invoke/troubleshooting.md` ("Cart 'forgets' items between requests") for the failure mode.
299
+ 4. `bunx tsr generate`
299
300
 
300
301
  ### Phase 8: Compile
301
302
 
@@ -31,6 +31,19 @@ export async function vtexFetchWithCookies<T>(
31
31
 
32
32
  Use in: `checkout.ts`, `auth.ts`, `session.ts` (create/edit).
33
33
 
34
+ ### Pitfall: never copy with `Headers.entries()` or `forEach`
35
+
36
+ When a caller pushes the captured cookies onto the request scope's response headers (e.g. `RequestContext.responseHeaders.append("set-cookie", c)`), the eventual HTTP-response bridge must read them back with `Headers.getSetCookie()`. **Do not** iterate `headers.entries()` (or `forEach`) and re-append, because both collapse multiple `Set-Cookie` values into a single comma-joined string, which browsers silently discard. The result: the cart appears empty after addItemToCart even though the action returned an OrderForm with items.
37
+
38
+ Two bridges where this rule applies in a TanStack Start site:
39
+
40
+ - `src/server/invoke.gen.ts` — the auto-generated `forwardResponseCookies()` already uses `getSetCookie()`. Re-run `bunx tsx node_modules/@decocms/start/scripts/generate-invoke.ts` if the file is missing or stale.
41
+ - `@decocms/start/src/admin/invoke.ts` — `forwardCtxHeadersTo()` does the same for the `/deco/invoke/...` HTTP path. Versions ≥ 5.0.0 ship the fix.
42
+
43
+ ### Client-side `setOrderFormIdCookie` is defense-in-depth, not the fix
44
+
45
+ Some migrated `useCart` hooks (e.g. miess-01-tanstack) manually call `document.cookie = "checkout.vtex.com__orderFormId=..."` after each cart action. That only patches one cookie — `segment`, `sc`, `vtex_session` etc. are still dropped. Keep the workaround if you like (cheap, idempotent), but the real fix is the two server-side bridges above. Once both are in place, the manual cookie write can be removed without regressing the cart.
46
+
34
47
  ## buildAuthCookieHeader
35
48
 
36
49
  VTEX IO GraphQL at `{account}.myvtex.com` requires both cookie names:
@@ -63,6 +63,21 @@ const result = await vtexFetchWithCookies<OrderForm>(url, opts);
63
63
 
64
64
  **Where NOT needed**: Read-only loaders, GraphQL queries.
65
65
 
66
+ **Server→browser bridge** — the cookies that `vtexFetchWithCookies` captures must be forwarded onto the outgoing HTTP response, or the browser never sees them and the cart appears empty on the next request. There are two bridge points in a TanStack Start site, and both must be wired:
67
+
68
+ 1. **`src/server/invoke.gen.ts`** (TanStack RPC path). Generated by `bunx tsx node_modules/@decocms/start/scripts/generate-invoke.ts`. Audit:
69
+ - File exists?
70
+ - Contains `function forwardResponseCookies()`?
71
+ - Every action handler calls `forwardResponseCookies()` after the `await`?
72
+
73
+ If any answer is "no", regenerate with the script above. Then make sure `useCart`, `useUser`, `useWishlist` import `invoke` from `~/server/invoke.gen` (or a barrel that re-exports it), not from the proxy `~/runtime.ts`.
74
+
75
+ 2. **`@decocms/start/src/admin/invoke.ts`** (`/deco/invoke/...` HTTP path). The framework's single + batch invoke handlers must use `Headers.getSetCookie()` (not `entries()`!) when copying `RequestContext.responseHeaders` onto the response. Pin to a version ≥ 5.0.0 that ships `forwardCtxHeadersTo`.
76
+
77
+ The historical failure mode: a `for…of headers.entries()` loop collapsed N `Set-Cookie` values into one comma-joined string, which browsers silently discard. Every VTEX cart action returns 3–5 cookies (`checkout.vtex.com__orderFormId`, `segment`, `sc`, `vtex_session`…), so even one collapse breaks the entire cart flow.
78
+
79
+ **Quick diagnosis**: Add an item, watch DevTools → Network → the cart-action response should have **multiple** `Set-Cookie:` rows, not one comma-joined line.
80
+
66
81
  ### 2. Auth Cookie Headers
67
82
 
68
83
  All authenticated VTEX IO GraphQL calls need both cookie variants:
@@ -1,5 +1,27 @@
1
1
  # Troubleshooting
2
2
 
3
+ ## Cart "forgets" items between requests / /checkout opens empty after addItemToCart
4
+
5
+ **Symptom**: `invoke.vtex.actions.addItemsToCart(...)` succeeds (returns an `OrderForm` with items), but the next page load — or clicking the cart icon — shows an empty cart, and `/checkout` lands on a fresh empty order. Sometimes the orderFormId is different from the one just returned.
6
+
7
+ **Root cause**: The VTEX cart cookies (`checkout.vtex.com__orderFormId`, `segment`, `sc`, `vtex_session`) never reach the browser because somewhere in the chain, multiple `Set-Cookie` headers got collapsed into a single comma-joined string. Browsers silently discard malformed `Set-Cookie` values, so every subsequent request hits VTEX without authentication and gets a new empty orderForm.
8
+
9
+ **Two places this can break**:
10
+
11
+ 1. **`src/server/invoke.gen.ts` missing or stale**. This is the TanStack RPC path. Each action must call `forwardResponseCookies()` after awaiting the underlying VTEX call. The helper uses `Headers.getSetCookie()` (not `entries()`!) to read the un-collapsed list and writes each value to TanStack's response via `setResponseHeader("set-cookie", [...])`. If the file doesn't exist, the site falls back to the `/deco/invoke/...` proxy.
12
+
13
+ 2. **`/deco/invoke/...` HTTP proxy** (`~/runtime.ts` pattern). The framework's admin handler (`@decocms/start/src/admin/invoke.ts`) used to iterate `RequestContext.responseHeaders.entries()` which collapses Set-Cookie. Fixed in @decocms/start ≥ 5.0.0 by switching to `getSetCookie()` + a `forwardCtxHeadersTo()` helper applied on both single and batch paths.
14
+
15
+ **Diagnosis**: Open DevTools → Network → response to the cart action. You should see **multiple distinct** `Set-Cookie:` rows. If you see a single `Set-Cookie: foo=1, bar=2; Path=/, baz=3` line, that's the collapse bug.
16
+
17
+ **Fix**:
18
+ 1. Upgrade `@decocms/start` to the version with `forwardCtxHeadersTo` in `src/admin/invoke.ts` (search the file — both single and batch handlers should call it).
19
+ 2. Run `bunx tsx node_modules/@decocms/start/scripts/generate-invoke.ts` to regenerate `src/server/invoke.gen.ts`. Verify it has `function forwardResponseCookies()` and that every emitted handler calls it.
20
+ 3. Make sure `useCart` (and other VTEX hooks) imports `invoke` from `~/server/invoke.gen` (or a barrel re-export of it), not from `~/runtime`.
21
+ 4. The migration script (`scripts/migrate.ts` bootstrap) runs `generate-invoke.ts` automatically on freshly-migrated sites — if a site was migrated before that, run the generator manually.
22
+
23
+ **Client-side workaround** (defense-in-depth, removable): some sites manually `document.cookie = "checkout.vtex.com__orderFormId=..."` inside `useCart`. That only patches one cookie of many. With the server-side fix in place, the workaround is harmless but no longer load-bearing — see `~/conductor/workspaces/miess-01-tanstack/newport-beach/src/hooks/useCart.ts` for an example.
24
+
3
25
  ## CORS Error on Add to Cart / Checkout
4
26
 
5
27
  **Symptom**: Browser console shows CORS error when calling VTEX API directly.
@@ -0,0 +1,209 @@
1
+ # RUM (Real-User Monitoring) — Plan
2
+
3
+ > Sibling deliverable to [`observability_refinement_plan_4fa41548.plan.md`](../../../.cursor/plans/observability_refinement_plan_4fa41548.plan.md).
4
+ > The refinement plan covers server-side observability end-to-end. This
5
+ > plan covers what runs **in the browser** — Core Web Vitals, JS errors,
6
+ > long tasks, resource timing, custom user-journey events.
7
+
8
+ ## Why this is a separate plan
9
+
10
+ Server-side telemetry tells us what our Workers did. RUM tells us what
11
+ the **user actually experienced** — including everything outside our
12
+ edge (DNS, TLS handshake, third-party scripts, the user's CPU, their
13
+ flaky LTE link, their ad-blocker). The two answer different questions:
14
+
15
+ | Question | Answered by |
16
+ |---|---|
17
+ | "Did we serve the request?" | server outcomes (Phase 2 of refinement plan) |
18
+ | "Was the user able to read the page?" | RUM (this plan) |
19
+ | "Did our deploy regress LCP on iOS Safari?" | RUM (this plan) |
20
+ | "Why did checkout abandon at 73%?" | RUM + server outcomes joined |
21
+
22
+ Today the answer to every RUM question is "we don't know." The plan
23
+ puts a floor on that.
24
+
25
+ ## Scope tiers — the decision you're making
26
+
27
+ The size of this plan changes by an order of magnitude depending on
28
+ scope. Three defensible tiers below; the work is **strictly additive**
29
+ between them so we can commit to Tier 1, run it for a quarter, and
30
+ upgrade to Tier 2 or 3 only if the data we collect surfaces a need.
31
+
32
+ ### Tier 1 — Core Web Vitals + JS errors (recommended for v1)
33
+
34
+ **What's collected:**
35
+ - LCP (Largest Contentful Paint), CLS (Cumulative Layout Shift),
36
+ INP (Interaction to Next Paint), FCP, TTFB — via the standard
37
+ [`web-vitals`](https://github.com/GoogleChrome/web-vitals) library.
38
+ - `window.onerror` + `window.onunhandledrejection` — uncaught JS errors
39
+ with stack, source URL, line/col, user agent, route pattern, deploy
40
+ id, `request.id` (same one the server stamped in Phase 1, joinable to
41
+ `otel_logs` and `otel_traces`).
42
+ - Page-context attributes: `route_pattern`, `service.version`,
43
+ `service.name`, `deployment.environment`, viewport, connection type
44
+ (`navigator.connection.effectiveType`), `cf-ray`.
45
+
46
+ **What's NOT collected:**
47
+ - Session replay.
48
+ - Custom user-journey events (add-to-cart, scroll-to-fold, etc.).
49
+ - Resource timing for every asset.
50
+ - User interaction heatmaps.
51
+
52
+ **Implementation footprint:** ~3 dev-weeks.
53
+ - `@decocms/start/sdk/rum.ts` — single browser-side module bundled into
54
+ the site entry. Reads `web-vitals` (peer dep, ~3KB gzipped), batches
55
+ events, sends to `/__deco/rum` on the same origin (no CORS / no
56
+ third-party script).
57
+ - `cmsRoute.ts` already serves a worker; add a `/__deco/rum` handler
58
+ that validates the payload, redacts referrer/URL via the shared PII
59
+ library (Phase 1), and forwards to the OTLP HTTP endpoint as
60
+ `otel_logs` with `SeverityText="INFO"` and `LogAttributes.rum.*`.
61
+ - ClickHouse rows land in the existing `otel_logs` table — no new
62
+ schema, no new pipeline. The Tier 1 query "p75 LCP per route per
63
+ site, last 7 days" is a single SQL join on the existing tables.
64
+ - One Grafana dashboard template added to
65
+ `stats-lake/observability/dashboards/templates/site-rum.json`. Auto-
66
+ provisioned via the existing Phase 5 script — `--with-rum` flag
67
+ mirrors `--with-alerts`.
68
+
69
+ **Cost:** marginal. One row per pageview per metric (~5 rows / pageview)
70
+ adds < 5% to existing log volume. Within the cost guardrail dashboard
71
+ (Phase 6) headroom.
72
+
73
+ **Risks:**
74
+ - INP requires the modern API; falls back to FID on older browsers.
75
+ Reported separately so the metric isn't muddied.
76
+ - `web-vitals` runs ~50ms of JS on first input; teams that obsess over
77
+ shaving milliseconds will want a build flag to disable it. Ship one
78
+ off the bat.
79
+
80
+ ### Tier 2 — Tier 1 + custom user-journey events + resource timing
81
+
82
+ **What's added:**
83
+ - A typed `rum.track(name, attributes)` API exposed from
84
+ `@decocms/start/sdk/rum.ts`. Sites call
85
+ `rum.track('add_to_cart', { sku, price, currency })` and the event
86
+ flows through the same `/__deco/rum` endpoint into `otel_logs` with a
87
+ reserved attribute namespace (`rum.event.*`).
88
+ - Resource Timing API rollup: for each pageview, total resource bytes,
89
+ count by content-type, slowest 5 URLs (with paths redacted). Helps
90
+ diagnose when a third-party tag has gone bad.
91
+ - A `rum.identify(userIdHash)` call that lets sites cohort by logged-in
92
+ user without sending PII (the site hashes the user id before passing
93
+ it in; we never see plaintext).
94
+
95
+ **Implementation footprint:** ~6 additional dev-weeks on top of Tier 1.
96
+ - The framework-side API + types: ~2 weeks.
97
+ - Resource Timing payload shape + redaction: ~1 week.
98
+ - Documentation, codemod fixtures, and an audit rule that enforces
99
+ the redaction helpers stay in use: ~3 weeks.
100
+ - A second Grafana dashboard (`site-rum-events.json`) that pivots on
101
+ `rum.event.*`.
102
+
103
+ **Risks:**
104
+ - Cardinality explosion: a site that emits `track('view_product', { id })`
105
+ with the product id as a label creates one time-series per SKU. The
106
+ attribute system has to **enforce** id-as-attribute / id-not-as-label
107
+ via the type system + a runtime check in the framework. Doable but
108
+ it's the new piece that has to be designed right.
109
+ - Custom events drift across sites unless we standardize a vocabulary.
110
+ Recommend shipping a small reserved-name list (`add_to_cart`,
111
+ `begin_checkout`, `purchase`, `view_product`) so fleet-wide dashboards
112
+ can roll up "conversion funnel" without per-site cooperation.
113
+
114
+ ### Tier 3 — Tier 2 + session replay + interaction heatmaps
115
+
116
+ **What's added:**
117
+ - Session replay: capture every DOM mutation + every user input as a
118
+ delta-encoded stream, replay it as a video in HyperDX / a custom
119
+ viewer. The canonical OSS implementation is
120
+ [`rrweb`](https://github.com/rrweb-io/rrweb).
121
+ - Interaction heatmaps: aggregate click positions over a page within a
122
+ given time window.
123
+
124
+ **Implementation footprint:** months.
125
+ - ~30KB gzipped of `rrweb` on every pageview — measurable LCP impact.
126
+ Mitigation: lazy-load after first interaction; some events miss the
127
+ first paint.
128
+ - The replay payload is enormous (~100KB/min/session compressed).
129
+ Multiplied by realistic session counts this is a 10-100× ingest
130
+ blowup vs Tier 1. Replay storage is the new bottleneck, not the
131
+ schema; we'd need an R2-backed cold tier separate from ClickHouse.
132
+ - A privacy review needs to ship before the first byte of replay
133
+ flows. PII redaction has to happen client-side before the network —
134
+ rrweb's "mask all inputs" mode is the floor; password fields, credit
135
+ card numbers, and authenticated user-data attributes need explicit
136
+ privacy classes.
137
+
138
+ **Risks:**
139
+ - Privacy: replay is a high-magnification footgun. One regression in
140
+ the mask config and we've recorded a user's credit card. The whole
141
+ payment-flow page must be force-masked at the framework level, not
142
+ opt-in per site.
143
+ - Cost: 10-100× the ingest of Tier 1 even with aggressive sampling.
144
+ The Phase 6 cost-guardrail dashboard will trip; needs a new tier of
145
+ retention policies (replay rows expire at 30d, not 90d like logs).
146
+
147
+ ## Out of scope (regardless of tier)
148
+
149
+ - **Synthetic monitoring.** Lighthouse runs against canonical journeys
150
+ on a cron. Covers a different need — "would a clean browser have
151
+ hit our SLO?" rather than "what did real users see?". A separate
152
+ initiative if we want it.
153
+ - **Heatmaps for individual users.** Aggregate heatmaps only; never
154
+ identifiable.
155
+ - **A/B-test attribution.** Sites that want it can pass an experiment
156
+ cohort id through `rum.identify` — we don't ship the experiment
157
+ framework itself.
158
+
159
+ ## Recommended sequencing
160
+
161
+ Ship Tier 1 in one PR. Live with the data for a quarter. If during
162
+ that quarter the question "but what did the user actually click before
163
+ they bounced?" comes up more than twice, plan and ship Tier 2. If
164
+ during *that* quarter we hit a category of bug that can only be
165
+ diagnosed by replay (so far: 0), plan Tier 3.
166
+
167
+ **Anti-recommendation:** do not commit to Tier 3 up front. Replay is
168
+ the highest-cost, highest-privacy-risk piece of the whole observability
169
+ surface, and the bug categories that genuinely need it are rare.
170
+
171
+ ## Decision points
172
+
173
+ These mirror the main plan's structure — answer once, then this
174
+ document gets a follow-up PR turning answers into TODOs.
175
+
176
+ 1. **Tier selection.** Tier 1 only / Tier 1 + Tier 2 / Full Tier 3.
177
+ 2. **Identify hashing.** If we're shipping Tier 2, do sites pass in a
178
+ client-side hash, or do we accept plaintext IDs and hash at the
179
+ ingest worker? (Recommend client-side hash — keeps plaintext out of
180
+ our pipeline entirely.)
181
+ 3. **Sampling.** RUM events are cheap per-row but high-volume. Default
182
+ sample rate: 100% (Tier 1), 100% (Tier 2 events), 10% (Tier 3
183
+ replay). Confirm or revise per tier.
184
+
185
+ ## Files this plan would touch (Tier 1)
186
+
187
+ ```
188
+ deco-start/
189
+ ├── src/sdk/rum.ts # NEW — browser-side instrumentation
190
+ ├── src/sdk/rum.server.ts # NEW — /__deco/rum handler
191
+ ├── src/admin/setup.ts # ROUTE — mount /__deco/rum
192
+ ├── src/sdk/observability.ts # EXPORT — re-export rum API
193
+ ├── package.json # DEP — add `web-vitals` peer dep
194
+ └── docs/rum.md # NEW — site-side usage docs
195
+
196
+ stats-lake/observability/
197
+ └── dashboards/templates/site-rum.json # NEW
198
+ ```
199
+
200
+ ## Why this is a plan and not an implementation
201
+
202
+ The user explicitly asked for RUM to be in a separate plan document
203
+ rather than rolled into the refinement plan. The scope-tier decision
204
+ above is the single highest-leverage choice; everything downstream
205
+ follows from it, and "Tier 3 because Tier 3 is biggest" is the wrong
206
+ default. A 30-minute conversation on the tier choice saves weeks of
207
+ work in the wrong direction.
208
+
209
+ Open the matching Linear / GitHub issue once a tier is selected.
@@ -0,0 +1,40 @@
1
+ # Runbooks
2
+
3
+ Tested response procedures for the alerts auto-provisioned by
4
+ [`stats-lake/observability/`](https://github.com/decocms/stats-lake/blob/main/observability/README.md).
5
+
6
+ Every alert generated by `provision-dashboards.ts` carries a
7
+ `runbook_url` annotation that points to a Markdown file in this
8
+ directory. Each runbook follows the same structure so on-call can read
9
+ top-to-bottom under stress:
10
+
11
+ 1. **What this alert means.** One paragraph, no jargon.
12
+ 2. **First check (60 seconds).** The single most likely cause and how to
13
+ confirm it.
14
+ 3. **Diagnostic queries.** ClickHouse SQL the responder can paste into
15
+ Grafana / ClickStack to dig deeper.
16
+ 4. **Common causes & fixes.** Ranked by frequency. Each with a "did
17
+ that fix it?" verification step.
18
+ 5. **Escalation.** When to page a domain owner, and who.
19
+ 6. **Post-mortem hook.** What to capture so the post-mortem isn't
20
+ reconstructed from memory.
21
+
22
+ ## Runbook catalogue
23
+
24
+ | Alert ID | Runbook |
25
+ |---------------------------|--------------------------------------------------------|
26
+ | `http-error-spike` | [`http-error-spike.md`](./http-error-spike.md) |
27
+ | `http-latency-spike` | [`http-latency-spike.md`](./http-latency-spike.md) |
28
+ | `cache-hit-drop` | [`cache-hit-drop.md`](./cache-hit-drop.md) |
29
+ | `commerce-upstream-slow` | [`commerce-upstream-slow.md`](./commerce-upstream-slow.md) |
30
+ | `tail-exception-spike` | [`tail-exception-spike.md`](./tail-exception-spike.md) |
31
+
32
+ ## Authoring a new runbook
33
+
34
+ When you add a new alert template, add a runbook with the same ID. The
35
+ provisioning script doesn't enforce the link today — that gate lands in
36
+ Phase 6 (Governance), but treat it as required: an alert without a
37
+ runbook is a half-shipped alert.
38
+
39
+ Keep runbooks short. If a section grows past 20 lines, split it out
40
+ into a dedicated incident-pattern doc and link to it from the runbook.
@@ -0,0 +1,83 @@
1
+ # Runbook: `cache-hit-drop`
2
+
3
+ > A site's edge cache hit rate fell below its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ The edge cache is missing more than usual. On the user side this
8
+ manifests as slower page loads. On the cost side it means more origin
9
+ requests (more billing for Workers + commerce API calls). On the
10
+ upstream side it can become a thundering herd if many users hit a
11
+ freshly-evicted entry simultaneously.
12
+
13
+ ## First check (60 seconds)
14
+
15
+ Was there a deploy or a cache purge in the last 10 minutes? Cold caches
16
+ recover quickly (5–10m) so if the alert is fresh and a deploy is
17
+ recent, this often self-heals.
18
+
19
+ ```sql
20
+ -- Recent deploys (any change to service.version visible in metrics)
21
+ SELECT ResourceAttributes['service.version'] AS version, min(TimeUnix) AS first_seen
22
+ FROM otel_metrics_sum
23
+ WHERE ServiceName = '{site}'
24
+ AND TimeUnix > now() - INTERVAL 1 HOUR
25
+ GROUP BY version
26
+ ORDER BY first_seen DESC;
27
+ ```
28
+
29
+ If neither deploy nor purge fired in the window, the cache miss share
30
+ indicates a real regression — proceed below.
31
+
32
+ ## Diagnostic queries
33
+
34
+ ```sql
35
+ -- Hit / miss share by route_pattern, last 30 minutes
36
+ SELECT
37
+ Attributes['route_pattern'] AS route,
38
+ countIf(MetricName = 'cache_hit_total') AS hits,
39
+ countIf(MetricName = 'cache_miss_total') AS misses,
40
+ hits / nullIf(hits + misses, 0) AS hit_rate
41
+ FROM otel_metrics_sum
42
+ WHERE MetricName IN ('cache_hit_total', 'cache_miss_total')
43
+ AND ServiceName = '{site}'
44
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
45
+ GROUP BY route
46
+ ORDER BY misses DESC
47
+ LIMIT 20;
48
+ ```
49
+
50
+ ```sql
51
+ -- Cache decision distribution by cache_profile
52
+ SELECT
53
+ Attributes['profile'] AS profile,
54
+ Attributes['decision'] AS decision,
55
+ sum(toFloat64(Value)) AS n
56
+ FROM otel_metrics_sum
57
+ WHERE MetricName = 'cache_hit_total'
58
+ AND ServiceName = '{site}'
59
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
60
+ GROUP BY profile, decision
61
+ ORDER BY n DESC;
62
+ ```
63
+
64
+ ## Common causes & fixes
65
+
66
+ | Rank | Cause | How to confirm | Fix |
67
+ |------|-------------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
68
+ | 1 | Deploy purged the version cache (`X-Cache-Version` flipped) | Recent `service.version` in the deploy query | Wait 10m for cache to warm. If sustained, check that the new build hash is propagating consistently. |
69
+ | 2 | A new query parameter is hashing into the cache key | One route's MISS share is far higher than the rest | Check `cacheHeaders` / `ignoreSearchParams` config for that route; add the new param to the ignore list. |
70
+ | 3 | Set-Cookie present on a previously cacheable response | `X-Cache: BYPASS` with `X-Cache-Reason: private-set-cookie` on the affected route | Inspect the section that started emitting cookies; move the cookie write to a non-cacheable POST handler. |
71
+ | 4 | A real burst of unique URLs (e.g. crawler scanning long-tail) | `Attributes['route_pattern']` doesn't change but distinct paths multiply | If a known bot, add a WAF rule. If a real catalog query, consider broader cache profile. |
72
+
73
+ ## Escalation
74
+
75
+ - Sustained > 1 hour despite no deploy → page the site team owner.
76
+ - Suspected bot/abuse → loop in security / WAF on-call.
77
+
78
+ ## Post-mortem hook
79
+
80
+ - The "before" hit rate and the "after" hit rate.
81
+ - The top route that lost the hit rate.
82
+ - A representative response header snippet showing `X-Cache`,
83
+ `X-Cache-Profile`, `X-Cache-Reason`.
@@ -0,0 +1,88 @@
1
+ # Runbook: `commerce-upstream-slow`
2
+
3
+ > A site's `commerce_request_duration_ms` p95 exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ Calls out to a commerce provider (VTEX, Shopify, or similar) are
8
+ taking abnormally long for *this* site. Because SSR is synchronous on
9
+ upstream commerce calls, a slow upstream cascades into the user-facing
10
+ `http-latency-spike` alert almost immediately. If both fired together,
11
+ this is the root cause — fix here first.
12
+
13
+ ## First check (60 seconds)
14
+
15
+ Which provider/operation is slow? The same dashboard's "Commerce p95
16
+ by provider/operation" panel breaks it out. Note the
17
+ `provider.operation` string — e.g. `vtex.intelligent-search.product_search`.
18
+
19
+ If a single operation is responsible, jump to "Common causes" #1.
20
+ If multiple operations from the same provider are slow simultaneously,
21
+ that's a provider-wide regression — jump to "Common causes" #2.
22
+
23
+ ## Diagnostic queries
24
+
25
+ ```sql
26
+ -- p95 commerce latency by provider + operation, last hour
27
+ SELECT
28
+ toStartOfInterval(TimeUnix, INTERVAL 5 MINUTE) AS t,
29
+ Attributes['provider'] AS provider,
30
+ Attributes['operation'] AS op,
31
+ quantileBFloat16(0.95)(toFloat64(Sum / nullIf(Count, 0))) AS p95
32
+ FROM otel_metrics_histogram
33
+ WHERE MetricName = 'commerce_request_duration_ms'
34
+ AND ServiceName = '{site}'
35
+ AND TimeUnix > now() - INTERVAL 1 HOUR
36
+ GROUP BY t, provider, op
37
+ ORDER BY t, p95 DESC;
38
+ ```
39
+
40
+ ```sql
41
+ -- Commerce call status distribution — are we getting 5xx from upstream?
42
+ SELECT
43
+ Attributes['provider'] AS provider,
44
+ Attributes['operation'] AS op,
45
+ Attributes['status_class'] AS status_class,
46
+ count() AS n
47
+ FROM otel_metrics_histogram
48
+ WHERE MetricName = 'commerce_request_duration_ms'
49
+ AND ServiceName = '{site}'
50
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
51
+ GROUP BY provider, op, status_class
52
+ ORDER BY n DESC;
53
+ ```
54
+
55
+ ```sql
56
+ -- VTEX SWR cache effectiveness on the slow operation
57
+ SELECT
58
+ Attributes['cached'] AS cached,
59
+ count() AS n,
60
+ avg(toFloat64(Sum / nullIf(Count, 0))) AS avg_ms
61
+ FROM otel_metrics_histogram
62
+ WHERE MetricName = 'commerce_request_duration_ms'
63
+ AND ServiceName = '{site}'
64
+ AND Attributes['operation'] = '<paste operation here>'
65
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
66
+ GROUP BY cached;
67
+ ```
68
+
69
+ ## Common causes & fixes
70
+
71
+ | Rank | Cause | How to confirm | Fix |
72
+ |------|------------------------------------------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
73
+ | 1 | One specific upstream operation is slow | Single `provider.operation` row dominates the p95 query | Check provider status page (status.vtex.com, www.shopifystatus.com). If clean, see if we recently changed payload size or filter on that operation. |
74
+ | 2 | Provider-wide regression | Multiple operations from the same `provider` regressed simultaneously | Public provider status page is usually the source of truth. Open a ticket with the provider citing our timing window. |
75
+ | 3 | VTEX SWR / cachedLoader hit rate dropped | Query 3 shows `cached=false` share rose | Inspect recent loader changes for the affected section. May have invalidated the cache key by changing the loader signature. |
76
+ | 4 | Region-specific (CF colo → upstream latency) | `region` label on the metric isolates one CF colo | Usually transient; CF will rebalance. If sustained, file a CF support ticket. |
77
+
78
+ ## Escalation
79
+
80
+ - Provider-wide regression confirmed → notify the affected customer-facing teams; this is communication-shaped, not engineering-shaped.
81
+ - One operation slow, no provider status incident → page the site team owner for that route.
82
+
83
+ ## Post-mortem hook
84
+
85
+ - The `provider.operation` string and its p95 timeline.
86
+ - The cache (`cached=true/false`) split on that operation.
87
+ - A representative trace from `otel_traces` showing the slow span
88
+ (`SpanName LIKE 'vtex.%'` or `'shopify.%'`).
@@ -0,0 +1,98 @@
1
+ # Runbook: `http-error-spike`
2
+
3
+ > A site's 5xx error rate exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
4
+
5
+ ## What this alert means
6
+
7
+ Real users are getting 5xx responses at a rate that's statistically
8
+ abnormal for this specific site. The alert uses a per-site anomaly band
9
+ (not a fleet-wide threshold) so a site that normally runs at 0.3% 5xx
10
+ fires for spikes other sites wouldn't notice — and a site that normally
11
+ runs at 4% (a known-noisy legacy storefront) doesn't false-positive at
12
+ 4.1%.
13
+
14
+ ## First check (60 seconds)
15
+
16
+ Look at the **commerce upstream p95** panel on the same dashboard. If
17
+ that spiked at the same moment, the root cause is almost always an
18
+ upstream commerce API regressing. Stop here, jump to
19
+ [`commerce-upstream-slow.md`](./commerce-upstream-slow.md).
20
+
21
+ If commerce p95 is flat, the 5xx is internal — proceed below.
22
+
23
+ ## Diagnostic queries
24
+
25
+ Paste into ClickStack or a Grafana Explore panel pointed at the
26
+ ClickHouse datasource.
27
+
28
+ ```sql
29
+ -- Top error routes for this site, last 30 minutes
30
+ SELECT
31
+ Attributes['route_pattern'] AS route,
32
+ countIf(Attributes['status_class'] = '5xx') AS errors,
33
+ count() AS total,
34
+ errors / total AS rate
35
+ FROM otel_metrics_sum
36
+ WHERE MetricName = 'http_requests_total'
37
+ AND ServiceName = '{site}'
38
+ AND TimeUnix > now() - INTERVAL 30 MINUTE
39
+ GROUP BY route
40
+ HAVING errors > 0
41
+ ORDER BY rate DESC
42
+ LIMIT 20;
43
+ ```
44
+
45
+ ```sql
46
+ -- Recent exceptions captured by the tail worker
47
+ SELECT Timestamp, Body, LogAttributes['url.path'] AS path, LogAttributes['http.response.status_code'] AS status
48
+ FROM otel_logs
49
+ WHERE ServiceName = '{site}'
50
+ AND SeverityText = 'ERROR'
51
+ AND LogAttributes['_source'] = 'tail-worker'
52
+ AND LogAttributes['_outcome'] = 'exception'
53
+ AND Timestamp > now() - INTERVAL 30 MINUTE
54
+ ORDER BY Timestamp DESC
55
+ LIMIT 100;
56
+ ```
57
+
58
+ ```sql
59
+ -- Did a deploy correlate? List versions seen in the last hour
60
+ SELECT
61
+ ResourceAttributes['service.version'] AS version,
62
+ min(Timestamp) AS first_seen,
63
+ max(Timestamp) AS last_seen,
64
+ count() AS log_count
65
+ FROM otel_logs
66
+ WHERE ServiceName = '{site}'
67
+ AND Timestamp > now() - INTERVAL 1 HOUR
68
+ GROUP BY version
69
+ ORDER BY first_seen DESC;
70
+ ```
71
+
72
+ ## Common causes & fixes
73
+
74
+ | Rank | Cause | How to confirm | Fix |
75
+ |------|----------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------------|
76
+ | 1 | A recent deploy regressed | Top query above shows a `service.version` that flipped just before the spike | Roll back via Cloudflare dashboard `Deployments → Rollback`. Confirm via a fresh `service.version` line in the next 5m. |
77
+ | 2 | A specific route is broken (one bad section) | Top error routes query shows one `route_pattern` at 100% error rate | Check the recent commits to that section. Roll back or `Lazy` wrap it for graceful degradation. |
78
+ | 3 | Upstream cache layer evicted; cold-cache thundering herd | `cache_miss_total` for the same window spikes proportionally to errors | Wait it out — usually self-heals in 5m. If sustained, check that `staleTime` is set correctly on cmsRouteConfig. |
79
+ | 4 | Origin (commerce API) returning 5xx | `commerce_request_duration_ms` spike OR commerce logs | See [`commerce-upstream-slow.md`](./commerce-upstream-slow.md). |
80
+
81
+ ## Escalation
82
+
83
+ - **Site team owner** if a fix isn't obvious in 15 minutes (slack
84
+ `#deco-platform`).
85
+ - **Cloudflare support** if all sites in a region are affected
86
+ simultaneously (look at the `region` label on the metrics) — this
87
+ has happened during CF colo incidents.
88
+
89
+ ## Post-mortem hook
90
+
91
+ Capture before the alert clears:
92
+ - `request.id` of one failing request (from the response header
93
+ `X-Request-Id` of a manually-reproduced 5xx).
94
+ - A representative tail-worker log row with stack trace.
95
+ - The deploy `service.version` window during the spike.
96
+
97
+ Stash them in the incident ticket so the post-mortem has the
98
+ correlation IDs it needs.