@decocms/start 6.1.0 → 6.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.agents/skills/deco-migrate-script/SKILL.md +4 -3
- package/.cursor/skills/deco-apps-vtex-porting/cookie-auth-patterns.md +13 -0
- package/.cursor/skills/deco-apps-vtex-review/SKILL.md +15 -0
- package/.cursor/skills/deco-server-functions-invoke/troubleshooting.md +22 -0
- package/docs/rum-plan.md +209 -0
- package/docs/runbooks/README.md +40 -0
- package/docs/runbooks/cache-hit-drop.md +83 -0
- package/docs/runbooks/commerce-upstream-slow.md +88 -0
- package/docs/runbooks/http-error-spike.md +98 -0
- package/docs/runbooks/http-latency-spike.md +82 -0
- package/docs/runbooks/tail-exception-spike.md +100 -0
- package/package.json +1 -1
- package/scripts/audit-observability-config.test.ts +251 -1
- package/scripts/audit-observability-config.ts +227 -26
- package/scripts/migrate/post-cleanup/rules.ts +90 -0
- package/scripts/migrate/post-cleanup/runner.test.ts +103 -0
- package/scripts/migrate.ts +13 -0
- package/src/admin/invoke.test.ts +141 -0
- package/src/admin/invoke.ts +47 -14
|
@@ -293,9 +293,10 @@ Generates `MIGRATION_REPORT.md` with:
|
|
|
293
293
|
### Phase 7: Bootstrap
|
|
294
294
|
|
|
295
295
|
Runs automatically after all phases (skipped in `--dry-run`):
|
|
296
|
-
1. `
|
|
297
|
-
2. `
|
|
298
|
-
3. `
|
|
296
|
+
1. `bun install`
|
|
297
|
+
2. `bunx tsx node_modules/@decocms/start/scripts/generate-blocks.ts`
|
|
298
|
+
3. `bunx tsx node_modules/@decocms/start/scripts/generate-invoke.ts` — emits `src/server/invoke.gen.ts` (top-level `createServerFn` declarations for every VTEX action, plus the `forwardResponseCookies()` Set-Cookie bridge). Without this step the site falls back to the `/deco/invoke/...` proxy and the cart breaks at `/checkout` after addItemToCart. See `.cursor/skills/deco-server-functions-invoke/troubleshooting.md` ("Cart 'forgets' items between requests") for the failure mode.
|
|
299
|
+
4. `bunx tsr generate`
|
|
299
300
|
|
|
300
301
|
### Phase 8: Compile
|
|
301
302
|
|
|
@@ -31,6 +31,19 @@ export async function vtexFetchWithCookies<T>(
|
|
|
31
31
|
|
|
32
32
|
Use in: `checkout.ts`, `auth.ts`, `session.ts` (create/edit).
|
|
33
33
|
|
|
34
|
+
### Pitfall: never copy with `Headers.entries()` or `forEach`
|
|
35
|
+
|
|
36
|
+
When a caller pushes the captured cookies onto the request scope's response headers (e.g. `RequestContext.responseHeaders.append("set-cookie", c)`), the eventual HTTP-response bridge must read them back with `Headers.getSetCookie()`. **Do not** iterate `headers.entries()` (or `forEach`) and re-append, because both collapse multiple `Set-Cookie` values into a single comma-joined string, which browsers silently discard. The result: the cart appears empty after addItemToCart even though the action returned an OrderForm with items.
|
|
37
|
+
|
|
38
|
+
Two bridges where this rule applies in a TanStack Start site:
|
|
39
|
+
|
|
40
|
+
- `src/server/invoke.gen.ts` — the auto-generated `forwardResponseCookies()` already uses `getSetCookie()`. Re-run `bunx tsx node_modules/@decocms/start/scripts/generate-invoke.ts` if the file is missing or stale.
|
|
41
|
+
- `@decocms/start/src/admin/invoke.ts` — `forwardCtxHeadersTo()` does the same for the `/deco/invoke/...` HTTP path. Versions ≥ 5.0.0 ship the fix.
|
|
42
|
+
|
|
43
|
+
### Client-side `setOrderFormIdCookie` is defense-in-depth, not the fix
|
|
44
|
+
|
|
45
|
+
Some migrated `useCart` hooks (e.g. miess-01-tanstack) manually call `document.cookie = "checkout.vtex.com__orderFormId=..."` after each cart action. That only patches one cookie — `segment`, `sc`, `vtex_session` etc. are still dropped. Keep the workaround if you like (cheap, idempotent), but the real fix is the two server-side bridges above. Once both are in place, the manual cookie write can be removed without regressing the cart.
|
|
46
|
+
|
|
34
47
|
## buildAuthCookieHeader
|
|
35
48
|
|
|
36
49
|
VTEX IO GraphQL at `{account}.myvtex.com` requires both cookie names:
|
|
@@ -63,6 +63,21 @@ const result = await vtexFetchWithCookies<OrderForm>(url, opts);
|
|
|
63
63
|
|
|
64
64
|
**Where NOT needed**: Read-only loaders, GraphQL queries.
|
|
65
65
|
|
|
66
|
+
**Server→browser bridge** — the cookies that `vtexFetchWithCookies` captures must be forwarded onto the outgoing HTTP response, or the browser never sees them and the cart appears empty on the next request. There are two bridge points in a TanStack Start site, and both must be wired:
|
|
67
|
+
|
|
68
|
+
1. **`src/server/invoke.gen.ts`** (TanStack RPC path). Generated by `bunx tsx node_modules/@decocms/start/scripts/generate-invoke.ts`. Audit:
|
|
69
|
+
- File exists?
|
|
70
|
+
- Contains `function forwardResponseCookies()`?
|
|
71
|
+
- Every action handler calls `forwardResponseCookies()` after the `await`?
|
|
72
|
+
|
|
73
|
+
If any answer is "no", regenerate with the script above. Then make sure `useCart`, `useUser`, `useWishlist` import `invoke` from `~/server/invoke.gen` (or a barrel that re-exports it), not from the proxy `~/runtime.ts`.
|
|
74
|
+
|
|
75
|
+
2. **`@decocms/start/src/admin/invoke.ts`** (`/deco/invoke/...` HTTP path). The framework's single + batch invoke handlers must use `Headers.getSetCookie()` (not `entries()`!) when copying `RequestContext.responseHeaders` onto the response. Pin to a version ≥ 5.0.0 that ships `forwardCtxHeadersTo`.
|
|
76
|
+
|
|
77
|
+
The historical failure mode: a `for…of headers.entries()` loop collapsed N `Set-Cookie` values into one comma-joined string, which browsers silently discard. Every VTEX cart action returns 3–5 cookies (`checkout.vtex.com__orderFormId`, `segment`, `sc`, `vtex_session`…), so even one collapse breaks the entire cart flow.
|
|
78
|
+
|
|
79
|
+
**Quick diagnosis**: Add an item, watch DevTools → Network → the cart-action response should have **multiple** `Set-Cookie:` rows, not one comma-joined line.
|
|
80
|
+
|
|
66
81
|
### 2. Auth Cookie Headers
|
|
67
82
|
|
|
68
83
|
All authenticated VTEX IO GraphQL calls need both cookie variants:
|
|
@@ -1,5 +1,27 @@
|
|
|
1
1
|
# Troubleshooting
|
|
2
2
|
|
|
3
|
+
## Cart "forgets" items between requests / /checkout opens empty after addItemToCart
|
|
4
|
+
|
|
5
|
+
**Symptom**: `invoke.vtex.actions.addItemsToCart(...)` succeeds (returns an `OrderForm` with items), but the next page load — or clicking the cart icon — shows an empty cart, and `/checkout` lands on a fresh empty order. Sometimes the orderFormId is different from the one just returned.
|
|
6
|
+
|
|
7
|
+
**Root cause**: The VTEX cart cookies (`checkout.vtex.com__orderFormId`, `segment`, `sc`, `vtex_session`) never reach the browser because somewhere in the chain, multiple `Set-Cookie` headers got collapsed into a single comma-joined string. Browsers silently discard malformed `Set-Cookie` values, so every subsequent request hits VTEX without authentication and gets a new empty orderForm.
|
|
8
|
+
|
|
9
|
+
**Two places this can break**:
|
|
10
|
+
|
|
11
|
+
1. **`src/server/invoke.gen.ts` missing or stale**. This is the TanStack RPC path. Each action must call `forwardResponseCookies()` after awaiting the underlying VTEX call. The helper uses `Headers.getSetCookie()` (not `entries()`!) to read the un-collapsed list and writes each value to TanStack's response via `setResponseHeader("set-cookie", [...])`. If the file doesn't exist, the site falls back to the `/deco/invoke/...` proxy.
|
|
12
|
+
|
|
13
|
+
2. **`/deco/invoke/...` HTTP proxy** (`~/runtime.ts` pattern). The framework's admin handler (`@decocms/start/src/admin/invoke.ts`) used to iterate `RequestContext.responseHeaders.entries()` which collapses Set-Cookie. Fixed in @decocms/start ≥ 5.0.0 by switching to `getSetCookie()` + a `forwardCtxHeadersTo()` helper applied on both single and batch paths.
|
|
14
|
+
|
|
15
|
+
**Diagnosis**: Open DevTools → Network → response to the cart action. You should see **multiple distinct** `Set-Cookie:` rows. If you see a single `Set-Cookie: foo=1, bar=2; Path=/, baz=3` line, that's the collapse bug.
|
|
16
|
+
|
|
17
|
+
**Fix**:
|
|
18
|
+
1. Upgrade `@decocms/start` to the version with `forwardCtxHeadersTo` in `src/admin/invoke.ts` (search the file — both single and batch handlers should call it).
|
|
19
|
+
2. Run `bunx tsx node_modules/@decocms/start/scripts/generate-invoke.ts` to regenerate `src/server/invoke.gen.ts`. Verify it has `function forwardResponseCookies()` and that every emitted handler calls it.
|
|
20
|
+
3. Make sure `useCart` (and other VTEX hooks) imports `invoke` from `~/server/invoke.gen` (or a barrel re-export of it), not from `~/runtime`.
|
|
21
|
+
4. The migration script (`scripts/migrate.ts` bootstrap) runs `generate-invoke.ts` automatically on freshly-migrated sites — if a site was migrated before that, run the generator manually.
|
|
22
|
+
|
|
23
|
+
**Client-side workaround** (defense-in-depth, removable): some sites manually `document.cookie = "checkout.vtex.com__orderFormId=..."` inside `useCart`. That only patches one cookie of many. With the server-side fix in place, the workaround is harmless but no longer load-bearing — see `~/conductor/workspaces/miess-01-tanstack/newport-beach/src/hooks/useCart.ts` for an example.
|
|
24
|
+
|
|
3
25
|
## CORS Error on Add to Cart / Checkout
|
|
4
26
|
|
|
5
27
|
**Symptom**: Browser console shows CORS error when calling VTEX API directly.
|
package/docs/rum-plan.md
ADDED
|
@@ -0,0 +1,209 @@
|
|
|
1
|
+
# RUM (Real-User Monitoring) — Plan
|
|
2
|
+
|
|
3
|
+
> Sibling deliverable to [`observability_refinement_plan_4fa41548.plan.md`](../../../.cursor/plans/observability_refinement_plan_4fa41548.plan.md).
|
|
4
|
+
> The refinement plan covers server-side observability end-to-end. This
|
|
5
|
+
> plan covers what runs **in the browser** — Core Web Vitals, JS errors,
|
|
6
|
+
> long tasks, resource timing, custom user-journey events.
|
|
7
|
+
|
|
8
|
+
## Why this is a separate plan
|
|
9
|
+
|
|
10
|
+
Server-side telemetry tells us what our Workers did. RUM tells us what
|
|
11
|
+
the **user actually experienced** — including everything outside our
|
|
12
|
+
edge (DNS, TLS handshake, third-party scripts, the user's CPU, their
|
|
13
|
+
flaky LTE link, their ad-blocker). The two answer different questions:
|
|
14
|
+
|
|
15
|
+
| Question | Answered by |
|
|
16
|
+
|---|---|
|
|
17
|
+
| "Did we serve the request?" | server outcomes (Phase 2 of refinement plan) |
|
|
18
|
+
| "Was the user able to read the page?" | RUM (this plan) |
|
|
19
|
+
| "Did our deploy regress LCP on iOS Safari?" | RUM (this plan) |
|
|
20
|
+
| "Why did checkout abandon at 73%?" | RUM + server outcomes joined |
|
|
21
|
+
|
|
22
|
+
Today the answer to every RUM question is "we don't know." The plan
|
|
23
|
+
puts a floor on that.
|
|
24
|
+
|
|
25
|
+
## Scope tiers — the decision you're making
|
|
26
|
+
|
|
27
|
+
The size of this plan changes by an order of magnitude depending on
|
|
28
|
+
scope. Three defensible tiers below; the work is **strictly additive**
|
|
29
|
+
between them so we can commit to Tier 1, run it for a quarter, and
|
|
30
|
+
upgrade to Tier 2 or 3 only if the data we collect surfaces a need.
|
|
31
|
+
|
|
32
|
+
### Tier 1 — Core Web Vitals + JS errors (recommended for v1)
|
|
33
|
+
|
|
34
|
+
**What's collected:**
|
|
35
|
+
- LCP (Largest Contentful Paint), CLS (Cumulative Layout Shift),
|
|
36
|
+
INP (Interaction to Next Paint), FCP, TTFB — via the standard
|
|
37
|
+
[`web-vitals`](https://github.com/GoogleChrome/web-vitals) library.
|
|
38
|
+
- `window.onerror` + `window.onunhandledrejection` — uncaught JS errors
|
|
39
|
+
with stack, source URL, line/col, user agent, route pattern, deploy
|
|
40
|
+
id, `request.id` (same one the server stamped in Phase 1, joinable to
|
|
41
|
+
`otel_logs` and `otel_traces`).
|
|
42
|
+
- Page-context attributes: `route_pattern`, `service.version`,
|
|
43
|
+
`service.name`, `deployment.environment`, viewport, connection type
|
|
44
|
+
(`navigator.connection.effectiveType`), `cf-ray`.
|
|
45
|
+
|
|
46
|
+
**What's NOT collected:**
|
|
47
|
+
- Session replay.
|
|
48
|
+
- Custom user-journey events (add-to-cart, scroll-to-fold, etc.).
|
|
49
|
+
- Resource timing for every asset.
|
|
50
|
+
- User interaction heatmaps.
|
|
51
|
+
|
|
52
|
+
**Implementation footprint:** ~3 dev-weeks.
|
|
53
|
+
- `@decocms/start/sdk/rum.ts` — single browser-side module bundled into
|
|
54
|
+
the site entry. Reads `web-vitals` (peer dep, ~3KB gzipped), batches
|
|
55
|
+
events, sends to `/__deco/rum` on the same origin (no CORS / no
|
|
56
|
+
third-party script).
|
|
57
|
+
- `cmsRoute.ts` already serves a worker; add a `/__deco/rum` handler
|
|
58
|
+
that validates the payload, redacts referrer/URL via the shared PII
|
|
59
|
+
library (Phase 1), and forwards to the OTLP HTTP endpoint as
|
|
60
|
+
`otel_logs` with `SeverityText="INFO"` and `LogAttributes.rum.*`.
|
|
61
|
+
- ClickHouse rows land in the existing `otel_logs` table — no new
|
|
62
|
+
schema, no new pipeline. The Tier 1 query "p75 LCP per route per
|
|
63
|
+
site, last 7 days" is a single SQL join on the existing tables.
|
|
64
|
+
- One Grafana dashboard template added to
|
|
65
|
+
`stats-lake/observability/dashboards/templates/site-rum.json`. Auto-
|
|
66
|
+
provisioned via the existing Phase 5 script — `--with-rum` flag
|
|
67
|
+
mirrors `--with-alerts`.
|
|
68
|
+
|
|
69
|
+
**Cost:** marginal. One row per pageview per metric (~5 rows / pageview)
|
|
70
|
+
adds < 5% to existing log volume. Within the cost guardrail dashboard
|
|
71
|
+
(Phase 6) headroom.
|
|
72
|
+
|
|
73
|
+
**Risks:**
|
|
74
|
+
- INP requires the modern API; falls back to FID on older browsers.
|
|
75
|
+
Reported separately so the metric isn't muddied.
|
|
76
|
+
- `web-vitals` runs ~50ms of JS on first input; teams that obsess over
|
|
77
|
+
shaving milliseconds will want a build flag to disable it. Ship one
|
|
78
|
+
off the bat.
|
|
79
|
+
|
|
80
|
+
### Tier 2 — Tier 1 + custom user-journey events + resource timing
|
|
81
|
+
|
|
82
|
+
**What's added:**
|
|
83
|
+
- A typed `rum.track(name, attributes)` API exposed from
|
|
84
|
+
`@decocms/start/sdk/rum.ts`. Sites call
|
|
85
|
+
`rum.track('add_to_cart', { sku, price, currency })` and the event
|
|
86
|
+
flows through the same `/__deco/rum` endpoint into `otel_logs` with a
|
|
87
|
+
reserved attribute namespace (`rum.event.*`).
|
|
88
|
+
- Resource Timing API rollup: for each pageview, total resource bytes,
|
|
89
|
+
count by content-type, slowest 5 URLs (with paths redacted). Helps
|
|
90
|
+
diagnose when a third-party tag has gone bad.
|
|
91
|
+
- A `rum.identify(userIdHash)` call that lets sites cohort by logged-in
|
|
92
|
+
user without sending PII (the site hashes the user id before passing
|
|
93
|
+
it in; we never see plaintext).
|
|
94
|
+
|
|
95
|
+
**Implementation footprint:** ~6 additional dev-weeks on top of Tier 1.
|
|
96
|
+
- The framework-side API + types: ~2 weeks.
|
|
97
|
+
- Resource Timing payload shape + redaction: ~1 week.
|
|
98
|
+
- Documentation, codemod fixtures, and an audit rule that enforces
|
|
99
|
+
the redaction helpers stay in use: ~3 weeks.
|
|
100
|
+
- A second Grafana dashboard (`site-rum-events.json`) that pivots on
|
|
101
|
+
`rum.event.*`.
|
|
102
|
+
|
|
103
|
+
**Risks:**
|
|
104
|
+
- Cardinality explosion: a site that emits `track('view_product', { id })`
|
|
105
|
+
with the product id as a label creates one time-series per SKU. The
|
|
106
|
+
attribute system has to **enforce** id-as-attribute / id-not-as-label
|
|
107
|
+
via the type system + a runtime check in the framework. Doable but
|
|
108
|
+
it's the new piece that has to be designed right.
|
|
109
|
+
- Custom events drift across sites unless we standardize a vocabulary.
|
|
110
|
+
Recommend shipping a small reserved-name list (`add_to_cart`,
|
|
111
|
+
`begin_checkout`, `purchase`, `view_product`) so fleet-wide dashboards
|
|
112
|
+
can roll up "conversion funnel" without per-site cooperation.
|
|
113
|
+
|
|
114
|
+
### Tier 3 — Tier 2 + session replay + interaction heatmaps
|
|
115
|
+
|
|
116
|
+
**What's added:**
|
|
117
|
+
- Session replay: capture every DOM mutation + every user input as a
|
|
118
|
+
delta-encoded stream, replay it as a video in HyperDX / a custom
|
|
119
|
+
viewer. The canonical OSS implementation is
|
|
120
|
+
[`rrweb`](https://github.com/rrweb-io/rrweb).
|
|
121
|
+
- Interaction heatmaps: aggregate click positions over a page within a
|
|
122
|
+
given time window.
|
|
123
|
+
|
|
124
|
+
**Implementation footprint:** months.
|
|
125
|
+
- ~30KB gzipped of `rrweb` on every pageview — measurable LCP impact.
|
|
126
|
+
Mitigation: lazy-load after first interaction; some events miss the
|
|
127
|
+
first paint.
|
|
128
|
+
- The replay payload is enormous (~100KB/min/session compressed).
|
|
129
|
+
Multiplied by realistic session counts this is a 10-100× ingest
|
|
130
|
+
blowup vs Tier 1. Replay storage is the new bottleneck, not the
|
|
131
|
+
schema; we'd need an R2-backed cold tier separate from ClickHouse.
|
|
132
|
+
- A privacy review needs to ship before the first byte of replay
|
|
133
|
+
flows. PII redaction has to happen client-side before the network —
|
|
134
|
+
rrweb's "mask all inputs" mode is the floor; password fields, credit
|
|
135
|
+
card numbers, and authenticated user-data attributes need explicit
|
|
136
|
+
privacy classes.
|
|
137
|
+
|
|
138
|
+
**Risks:**
|
|
139
|
+
- Privacy: replay is a high-magnification footgun. One regression in
|
|
140
|
+
the mask config and we've recorded a user's credit card. The whole
|
|
141
|
+
payment-flow page must be force-masked at the framework level, not
|
|
142
|
+
opt-in per site.
|
|
143
|
+
- Cost: 10-100× the ingest of Tier 1 even with aggressive sampling.
|
|
144
|
+
The Phase 6 cost-guardrail dashboard will trip; needs a new tier of
|
|
145
|
+
retention policies (replay rows expire at 30d, not 90d like logs).
|
|
146
|
+
|
|
147
|
+
## Out of scope (regardless of tier)
|
|
148
|
+
|
|
149
|
+
- **Synthetic monitoring.** Lighthouse runs against canonical journeys
|
|
150
|
+
on a cron. Covers a different need — "would a clean browser have
|
|
151
|
+
hit our SLO?" rather than "what did real users see?". A separate
|
|
152
|
+
initiative if we want it.
|
|
153
|
+
- **Heatmaps for individual users.** Aggregate heatmaps only; never
|
|
154
|
+
identifiable.
|
|
155
|
+
- **A/B-test attribution.** Sites that want it can pass an experiment
|
|
156
|
+
cohort id through `rum.identify` — we don't ship the experiment
|
|
157
|
+
framework itself.
|
|
158
|
+
|
|
159
|
+
## Recommended sequencing
|
|
160
|
+
|
|
161
|
+
Ship Tier 1 in one PR. Live with the data for a quarter. If during
|
|
162
|
+
that quarter the question "but what did the user actually click before
|
|
163
|
+
they bounced?" comes up more than twice, plan and ship Tier 2. If
|
|
164
|
+
during *that* quarter we hit a category of bug that can only be
|
|
165
|
+
diagnosed by replay (so far: 0), plan Tier 3.
|
|
166
|
+
|
|
167
|
+
**Anti-recommendation:** do not commit to Tier 3 up front. Replay is
|
|
168
|
+
the highest-cost, highest-privacy-risk piece of the whole observability
|
|
169
|
+
surface, and the bug categories that genuinely need it are rare.
|
|
170
|
+
|
|
171
|
+
## Decision points
|
|
172
|
+
|
|
173
|
+
These mirror the main plan's structure — answer once, then this
|
|
174
|
+
document gets a follow-up PR turning answers into TODOs.
|
|
175
|
+
|
|
176
|
+
1. **Tier selection.** Tier 1 only / Tier 1 + Tier 2 / Full Tier 3.
|
|
177
|
+
2. **Identify hashing.** If we're shipping Tier 2, do sites pass in a
|
|
178
|
+
client-side hash, or do we accept plaintext IDs and hash at the
|
|
179
|
+
ingest worker? (Recommend client-side hash — keeps plaintext out of
|
|
180
|
+
our pipeline entirely.)
|
|
181
|
+
3. **Sampling.** RUM events are cheap per-row but high-volume. Default
|
|
182
|
+
sample rate: 100% (Tier 1), 100% (Tier 2 events), 10% (Tier 3
|
|
183
|
+
replay). Confirm or revise per tier.
|
|
184
|
+
|
|
185
|
+
## Files this plan would touch (Tier 1)
|
|
186
|
+
|
|
187
|
+
```
|
|
188
|
+
deco-start/
|
|
189
|
+
├── src/sdk/rum.ts # NEW — browser-side instrumentation
|
|
190
|
+
├── src/sdk/rum.server.ts # NEW — /__deco/rum handler
|
|
191
|
+
├── src/admin/setup.ts # ROUTE — mount /__deco/rum
|
|
192
|
+
├── src/sdk/observability.ts # EXPORT — re-export rum API
|
|
193
|
+
├── package.json # DEP — add `web-vitals` peer dep
|
|
194
|
+
└── docs/rum.md # NEW — site-side usage docs
|
|
195
|
+
|
|
196
|
+
stats-lake/observability/
|
|
197
|
+
└── dashboards/templates/site-rum.json # NEW
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
## Why this is a plan and not an implementation
|
|
201
|
+
|
|
202
|
+
The user explicitly asked for RUM to be in a separate plan document
|
|
203
|
+
rather than rolled into the refinement plan. The scope-tier decision
|
|
204
|
+
above is the single highest-leverage choice; everything downstream
|
|
205
|
+
follows from it, and "Tier 3 because Tier 3 is biggest" is the wrong
|
|
206
|
+
default. A 30-minute conversation on the tier choice saves weeks of
|
|
207
|
+
work in the wrong direction.
|
|
208
|
+
|
|
209
|
+
Open the matching Linear / GitHub issue once a tier is selected.
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
# Runbooks
|
|
2
|
+
|
|
3
|
+
Tested response procedures for the alerts auto-provisioned by
|
|
4
|
+
[`stats-lake/observability/`](https://github.com/decocms/stats-lake/blob/main/observability/README.md).
|
|
5
|
+
|
|
6
|
+
Every alert generated by `provision-dashboards.ts` carries a
|
|
7
|
+
`runbook_url` annotation that points to a Markdown file in this
|
|
8
|
+
directory. Each runbook follows the same structure so on-call can read
|
|
9
|
+
top-to-bottom under stress:
|
|
10
|
+
|
|
11
|
+
1. **What this alert means.** One paragraph, no jargon.
|
|
12
|
+
2. **First check (60 seconds).** The single most likely cause and how to
|
|
13
|
+
confirm it.
|
|
14
|
+
3. **Diagnostic queries.** ClickHouse SQL the responder can paste into
|
|
15
|
+
Grafana / ClickStack to dig deeper.
|
|
16
|
+
4. **Common causes & fixes.** Ranked by frequency. Each with a "did
|
|
17
|
+
that fix it?" verification step.
|
|
18
|
+
5. **Escalation.** When to page a domain owner, and who.
|
|
19
|
+
6. **Post-mortem hook.** What to capture so the post-mortem isn't
|
|
20
|
+
reconstructed from memory.
|
|
21
|
+
|
|
22
|
+
## Runbook catalogue
|
|
23
|
+
|
|
24
|
+
| Alert ID | Runbook |
|
|
25
|
+
|---------------------------|--------------------------------------------------------|
|
|
26
|
+
| `http-error-spike` | [`http-error-spike.md`](./http-error-spike.md) |
|
|
27
|
+
| `http-latency-spike` | [`http-latency-spike.md`](./http-latency-spike.md) |
|
|
28
|
+
| `cache-hit-drop` | [`cache-hit-drop.md`](./cache-hit-drop.md) |
|
|
29
|
+
| `commerce-upstream-slow` | [`commerce-upstream-slow.md`](./commerce-upstream-slow.md) |
|
|
30
|
+
| `tail-exception-spike` | [`tail-exception-spike.md`](./tail-exception-spike.md) |
|
|
31
|
+
|
|
32
|
+
## Authoring a new runbook
|
|
33
|
+
|
|
34
|
+
When you add a new alert template, add a runbook with the same ID. The
|
|
35
|
+
provisioning script doesn't enforce the link today — that gate lands in
|
|
36
|
+
Phase 6 (Governance), but treat it as required: an alert without a
|
|
37
|
+
runbook is a half-shipped alert.
|
|
38
|
+
|
|
39
|
+
Keep runbooks short. If a section grows past 20 lines, split it out
|
|
40
|
+
into a dedicated incident-pattern doc and link to it from the runbook.
|
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
# Runbook: `cache-hit-drop`
|
|
2
|
+
|
|
3
|
+
> A site's edge cache hit rate fell below its own 24h rolling baseline by 3σ for ≥ 10 minutes.
|
|
4
|
+
|
|
5
|
+
## What this alert means
|
|
6
|
+
|
|
7
|
+
The edge cache is missing more than usual. On the user side this
|
|
8
|
+
manifests as slower page loads. On the cost side it means more origin
|
|
9
|
+
requests (more billing for Workers + commerce API calls). On the
|
|
10
|
+
upstream side it can become a thundering herd if many users hit a
|
|
11
|
+
freshly-evicted entry simultaneously.
|
|
12
|
+
|
|
13
|
+
## First check (60 seconds)
|
|
14
|
+
|
|
15
|
+
Was there a deploy or a cache purge in the last 10 minutes? Cold caches
|
|
16
|
+
recover quickly (5–10m) so if the alert is fresh and a deploy is
|
|
17
|
+
recent, this often self-heals.
|
|
18
|
+
|
|
19
|
+
```sql
|
|
20
|
+
-- Recent deploys (any change to service.version visible in metrics)
|
|
21
|
+
SELECT ResourceAttributes['service.version'] AS version, min(TimeUnix) AS first_seen
|
|
22
|
+
FROM otel_metrics_sum
|
|
23
|
+
WHERE ServiceName = '{site}'
|
|
24
|
+
AND TimeUnix > now() - INTERVAL 1 HOUR
|
|
25
|
+
GROUP BY version
|
|
26
|
+
ORDER BY first_seen DESC;
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
If neither deploy nor purge fired in the window, the cache miss share
|
|
30
|
+
indicates a real regression — proceed below.
|
|
31
|
+
|
|
32
|
+
## Diagnostic queries
|
|
33
|
+
|
|
34
|
+
```sql
|
|
35
|
+
-- Hit / miss share by route_pattern, last 30 minutes
|
|
36
|
+
SELECT
|
|
37
|
+
Attributes['route_pattern'] AS route,
|
|
38
|
+
countIf(MetricName = 'cache_hit_total') AS hits,
|
|
39
|
+
countIf(MetricName = 'cache_miss_total') AS misses,
|
|
40
|
+
hits / nullIf(hits + misses, 0) AS hit_rate
|
|
41
|
+
FROM otel_metrics_sum
|
|
42
|
+
WHERE MetricName IN ('cache_hit_total', 'cache_miss_total')
|
|
43
|
+
AND ServiceName = '{site}'
|
|
44
|
+
AND TimeUnix > now() - INTERVAL 30 MINUTE
|
|
45
|
+
GROUP BY route
|
|
46
|
+
ORDER BY misses DESC
|
|
47
|
+
LIMIT 20;
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
```sql
|
|
51
|
+
-- Cache decision distribution by cache_profile
|
|
52
|
+
SELECT
|
|
53
|
+
Attributes['profile'] AS profile,
|
|
54
|
+
Attributes['decision'] AS decision,
|
|
55
|
+
sum(toFloat64(Value)) AS n
|
|
56
|
+
FROM otel_metrics_sum
|
|
57
|
+
WHERE MetricName = 'cache_hit_total'
|
|
58
|
+
AND ServiceName = '{site}'
|
|
59
|
+
AND TimeUnix > now() - INTERVAL 30 MINUTE
|
|
60
|
+
GROUP BY profile, decision
|
|
61
|
+
ORDER BY n DESC;
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
## Common causes & fixes
|
|
65
|
+
|
|
66
|
+
| Rank | Cause | How to confirm | Fix |
|
|
67
|
+
|------|-------------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
|
|
68
|
+
| 1 | Deploy purged the version cache (`X-Cache-Version` flipped) | Recent `service.version` in the deploy query | Wait 10m for cache to warm. If sustained, check that the new build hash is propagating consistently. |
|
|
69
|
+
| 2 | A new query parameter is hashing into the cache key | One route's MISS share is far higher than the rest | Check `cacheHeaders` / `ignoreSearchParams` config for that route; add the new param to the ignore list. |
|
|
70
|
+
| 3 | Set-Cookie present on a previously cacheable response | `X-Cache: BYPASS` with `X-Cache-Reason: private-set-cookie` on the affected route | Inspect the section that started emitting cookies; move the cookie write to a non-cacheable POST handler. |
|
|
71
|
+
| 4 | A real burst of unique URLs (e.g. crawler scanning long-tail) | `Attributes['route_pattern']` doesn't change but distinct paths multiply | If a known bot, add a WAF rule. If a real catalog query, consider broader cache profile. |
|
|
72
|
+
|
|
73
|
+
## Escalation
|
|
74
|
+
|
|
75
|
+
- Sustained > 1 hour despite no deploy → page the site team owner.
|
|
76
|
+
- Suspected bot/abuse → loop in security / WAF on-call.
|
|
77
|
+
|
|
78
|
+
## Post-mortem hook
|
|
79
|
+
|
|
80
|
+
- The "before" hit rate and the "after" hit rate.
|
|
81
|
+
- The top route that lost the hit rate.
|
|
82
|
+
- A representative response header snippet showing `X-Cache`,
|
|
83
|
+
`X-Cache-Profile`, `X-Cache-Reason`.
|
|
@@ -0,0 +1,88 @@
|
|
|
1
|
+
# Runbook: `commerce-upstream-slow`
|
|
2
|
+
|
|
3
|
+
> A site's `commerce_request_duration_ms` p95 exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
|
|
4
|
+
|
|
5
|
+
## What this alert means
|
|
6
|
+
|
|
7
|
+
Calls out to a commerce provider (VTEX, Shopify, or similar) are
|
|
8
|
+
taking abnormally long for *this* site. Because SSR is synchronous on
|
|
9
|
+
upstream commerce calls, a slow upstream cascades into the user-facing
|
|
10
|
+
`http-latency-spike` alert almost immediately. If both fired together,
|
|
11
|
+
this is the root cause — fix here first.
|
|
12
|
+
|
|
13
|
+
## First check (60 seconds)
|
|
14
|
+
|
|
15
|
+
Which provider/operation is slow? The same dashboard's "Commerce p95
|
|
16
|
+
by provider/operation" panel breaks it out. Note the
|
|
17
|
+
`provider.operation` string — e.g. `vtex.intelligent-search.product_search`.
|
|
18
|
+
|
|
19
|
+
If a single operation is responsible, jump to "Common causes" #1.
|
|
20
|
+
If multiple operations from the same provider are slow simultaneously,
|
|
21
|
+
that's a provider-wide regression — jump to "Common causes" #2.
|
|
22
|
+
|
|
23
|
+
## Diagnostic queries
|
|
24
|
+
|
|
25
|
+
```sql
|
|
26
|
+
-- p95 commerce latency by provider + operation, last hour
|
|
27
|
+
SELECT
|
|
28
|
+
toStartOfInterval(TimeUnix, INTERVAL 5 MINUTE) AS t,
|
|
29
|
+
Attributes['provider'] AS provider,
|
|
30
|
+
Attributes['operation'] AS op,
|
|
31
|
+
quantileBFloat16(0.95)(toFloat64(Sum / nullIf(Count, 0))) AS p95
|
|
32
|
+
FROM otel_metrics_histogram
|
|
33
|
+
WHERE MetricName = 'commerce_request_duration_ms'
|
|
34
|
+
AND ServiceName = '{site}'
|
|
35
|
+
AND TimeUnix > now() - INTERVAL 1 HOUR
|
|
36
|
+
GROUP BY t, provider, op
|
|
37
|
+
ORDER BY t, p95 DESC;
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
```sql
|
|
41
|
+
-- Commerce call status distribution — are we getting 5xx from upstream?
|
|
42
|
+
SELECT
|
|
43
|
+
Attributes['provider'] AS provider,
|
|
44
|
+
Attributes['operation'] AS op,
|
|
45
|
+
Attributes['status_class'] AS status_class,
|
|
46
|
+
count() AS n
|
|
47
|
+
FROM otel_metrics_histogram
|
|
48
|
+
WHERE MetricName = 'commerce_request_duration_ms'
|
|
49
|
+
AND ServiceName = '{site}'
|
|
50
|
+
AND TimeUnix > now() - INTERVAL 30 MINUTE
|
|
51
|
+
GROUP BY provider, op, status_class
|
|
52
|
+
ORDER BY n DESC;
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
```sql
|
|
56
|
+
-- VTEX SWR cache effectiveness on the slow operation
|
|
57
|
+
SELECT
|
|
58
|
+
Attributes['cached'] AS cached,
|
|
59
|
+
count() AS n,
|
|
60
|
+
avg(toFloat64(Sum / nullIf(Count, 0))) AS avg_ms
|
|
61
|
+
FROM otel_metrics_histogram
|
|
62
|
+
WHERE MetricName = 'commerce_request_duration_ms'
|
|
63
|
+
AND ServiceName = '{site}'
|
|
64
|
+
AND Attributes['operation'] = '<paste operation here>'
|
|
65
|
+
AND TimeUnix > now() - INTERVAL 30 MINUTE
|
|
66
|
+
GROUP BY cached;
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
## Common causes & fixes
|
|
70
|
+
|
|
71
|
+
| Rank | Cause | How to confirm | Fix |
|
|
72
|
+
|------|------------------------------------------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
|
|
73
|
+
| 1 | One specific upstream operation is slow | Single `provider.operation` row dominates the p95 query | Check provider status page (status.vtex.com, www.shopifystatus.com). If clean, see if we recently changed payload size or filter on that operation. |
|
|
74
|
+
| 2 | Provider-wide regression | Multiple operations from the same `provider` regressed simultaneously | Public provider status page is usually the source of truth. Open a ticket with the provider citing our timing window. |
|
|
75
|
+
| 3 | VTEX SWR / cachedLoader hit rate dropped | Query 3 shows `cached=false` share rose | Inspect recent loader changes for the affected section. May have invalidated the cache key by changing the loader signature. |
|
|
76
|
+
| 4 | Region-specific (CF colo → upstream latency) | `region` label on the metric isolates one CF colo | Usually transient; CF will rebalance. If sustained, file a CF support ticket. |
|
|
77
|
+
|
|
78
|
+
## Escalation
|
|
79
|
+
|
|
80
|
+
- Provider-wide regression confirmed → notify the affected customer-facing teams; this is communication-shaped, not engineering-shaped.
|
|
81
|
+
- One operation slow, no provider status incident → page the site team owner for that route.
|
|
82
|
+
|
|
83
|
+
## Post-mortem hook
|
|
84
|
+
|
|
85
|
+
- The `provider.operation` string and its p95 timeline.
|
|
86
|
+
- The cache (`cached=true/false`) split on that operation.
|
|
87
|
+
- A representative trace from `otel_traces` showing the slow span
|
|
88
|
+
(`SpanName LIKE 'vtex.%'` or `'shopify.%'`).
|
|
@@ -0,0 +1,98 @@
|
|
|
1
|
+
# Runbook: `http-error-spike`
|
|
2
|
+
|
|
3
|
+
> A site's 5xx error rate exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
|
|
4
|
+
|
|
5
|
+
## What this alert means
|
|
6
|
+
|
|
7
|
+
Real users are getting 5xx responses at a rate that's statistically
|
|
8
|
+
abnormal for this specific site. The alert uses a per-site anomaly band
|
|
9
|
+
(not a fleet-wide threshold) so a site that normally runs at 0.3% 5xx
|
|
10
|
+
fires for spikes other sites wouldn't notice — and a site that normally
|
|
11
|
+
runs at 4% (a known-noisy legacy storefront) doesn't false-positive at
|
|
12
|
+
4.1%.
|
|
13
|
+
|
|
14
|
+
## First check (60 seconds)
|
|
15
|
+
|
|
16
|
+
Look at the **commerce upstream p95** panel on the same dashboard. If
|
|
17
|
+
that spiked at the same moment, the root cause is almost always an
|
|
18
|
+
upstream commerce API regressing. Stop here, jump to
|
|
19
|
+
[`commerce-upstream-slow.md`](./commerce-upstream-slow.md).
|
|
20
|
+
|
|
21
|
+
If commerce p95 is flat, the 5xx is internal — proceed below.
|
|
22
|
+
|
|
23
|
+
## Diagnostic queries
|
|
24
|
+
|
|
25
|
+
Paste into ClickStack or a Grafana Explore panel pointed at the
|
|
26
|
+
ClickHouse datasource.
|
|
27
|
+
|
|
28
|
+
```sql
|
|
29
|
+
-- Top error routes for this site, last 30 minutes
|
|
30
|
+
SELECT
|
|
31
|
+
Attributes['route_pattern'] AS route,
|
|
32
|
+
countIf(Attributes['status_class'] = '5xx') AS errors,
|
|
33
|
+
count() AS total,
|
|
34
|
+
errors / total AS rate
|
|
35
|
+
FROM otel_metrics_sum
|
|
36
|
+
WHERE MetricName = 'http_requests_total'
|
|
37
|
+
AND ServiceName = '{site}'
|
|
38
|
+
AND TimeUnix > now() - INTERVAL 30 MINUTE
|
|
39
|
+
GROUP BY route
|
|
40
|
+
HAVING errors > 0
|
|
41
|
+
ORDER BY rate DESC
|
|
42
|
+
LIMIT 20;
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
```sql
|
|
46
|
+
-- Recent exceptions captured by the tail worker
|
|
47
|
+
SELECT Timestamp, Body, LogAttributes['url.path'] AS path, LogAttributes['http.response.status_code'] AS status
|
|
48
|
+
FROM otel_logs
|
|
49
|
+
WHERE ServiceName = '{site}'
|
|
50
|
+
AND SeverityText = 'ERROR'
|
|
51
|
+
AND LogAttributes['_source'] = 'tail-worker'
|
|
52
|
+
AND LogAttributes['_outcome'] = 'exception'
|
|
53
|
+
AND Timestamp > now() - INTERVAL 30 MINUTE
|
|
54
|
+
ORDER BY Timestamp DESC
|
|
55
|
+
LIMIT 100;
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
```sql
|
|
59
|
+
-- Did a deploy correlate? List versions seen in the last hour
|
|
60
|
+
SELECT
|
|
61
|
+
ResourceAttributes['service.version'] AS version,
|
|
62
|
+
min(Timestamp) AS first_seen,
|
|
63
|
+
max(Timestamp) AS last_seen,
|
|
64
|
+
count() AS log_count
|
|
65
|
+
FROM otel_logs
|
|
66
|
+
WHERE ServiceName = '{site}'
|
|
67
|
+
AND Timestamp > now() - INTERVAL 1 HOUR
|
|
68
|
+
GROUP BY version
|
|
69
|
+
ORDER BY first_seen DESC;
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
## Common causes & fixes
|
|
73
|
+
|
|
74
|
+
| Rank | Cause | How to confirm | Fix |
|
|
75
|
+
|------|----------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------------------|
|
|
76
|
+
| 1 | A recent deploy regressed | Top query above shows a `service.version` that flipped just before the spike | Roll back via Cloudflare dashboard `Deployments → Rollback`. Confirm via a fresh `service.version` line in the next 5m. |
|
|
77
|
+
| 2 | A specific route is broken (one bad section) | Top error routes query shows one `route_pattern` at 100% error rate | Check the recent commits to that section. Roll back or `Lazy` wrap it for graceful degradation. |
|
|
78
|
+
| 3 | Upstream cache layer evicted; cold-cache thundering herd | `cache_miss_total` for the same window spikes proportionally to errors | Wait it out — usually self-heals in 5m. If sustained, check that `staleTime` is set correctly on cmsRouteConfig. |
|
|
79
|
+
| 4 | Origin (commerce API) returning 5xx | `commerce_request_duration_ms` spike OR commerce logs | See [`commerce-upstream-slow.md`](./commerce-upstream-slow.md). |
|
|
80
|
+
|
|
81
|
+
## Escalation
|
|
82
|
+
|
|
83
|
+
- **Site team owner** if a fix isn't obvious in 15 minutes (slack
|
|
84
|
+
`#deco-platform`).
|
|
85
|
+
- **Cloudflare support** if all sites in a region are affected
|
|
86
|
+
simultaneously (look at the `region` label on the metrics) — this
|
|
87
|
+
has happened during CF colo incidents.
|
|
88
|
+
|
|
89
|
+
## Post-mortem hook
|
|
90
|
+
|
|
91
|
+
Capture before the alert clears:
|
|
92
|
+
- `request.id` of one failing request (from the response header
|
|
93
|
+
`X-Request-Id` of a manually-reproduced 5xx).
|
|
94
|
+
- A representative tail-worker log row with stack trace.
|
|
95
|
+
- The deploy `service.version` window during the spike.
|
|
96
|
+
|
|
97
|
+
Stash them in the incident ticket so the post-mortem has the
|
|
98
|
+
correlation IDs it needs.
|