@decocms/start 6.0.1 → 6.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -121,6 +121,15 @@ this plan.
121
121
  | 2026-05-07 | **D6.1 — Cloudflare credentials never leave `deco-start`** | Same-day refinement of D6 after the first central deploy on `baggagio-tanstack` failed with `Secret CLOUDFLARE_API_TOKEN is required, but not provided while calling`. The original D6 design used `secrets: inherit` from the storefront stub and required `CLOUDFLARE_*` to live in the `deco-sites` org, which broke the principle that *the only secrets a storefront repo holds are the secrets that go into wrangler secrets, not the ones used to deploy*. First-pass refinement: the central `deploy.yml` / `preview.yml` / `sync-secrets.yml` jobs declared `environment: production` to try to make `${{ secrets.CLOUDFLARE_* }}` resolve from `decocms/deco-start`'s `production` Environment. **Found broken empirically on 2026-05-07** — the deployment registers in the *caller* repo, not the called workflow's repo, so the environment lookup uses the caller's `production` env (auto-created with no secrets). Superseded by D6.2 the same evening. |
122
122
  | 2026-05-07 | **D6.2 — App-mediated dispatch + no per-site registry (supersedes D6 + D6.1)** | After D6.1's `environment:` mechanism was empirically shown not to work cross-repo, the architecture pivoted: a `decocms-deployer` GitHub App is installed on `decocms/deco-start` (`actions:write`) and on each storefront repo (`contents:read`, optionally `pull-requests:write`). The storefront caller stub mints a short-lived App-installation token and calls `gh workflow run deploy.yml --repo decocms/deco-start --ref v3 -f site_owner=… -f site_name=…`. The central workflow runs in `decocms/deco-start`'s context, so `CLOUDFLARE_API_TOKEN` / `CLOUDFLARE_ACCOUNT_ID` are ordinary repo secrets. For runtime `SECRET_*` values, each storefront has a `<site_name>-secrets` GitHub Environment in `decocms/deco-start` (S1 design); `sync-secrets.yml` binds to that environment and pushes to `wrangler secret put`. The per-site registry under `deploy/sites/<repo>.jsonc` was dropped entirely (Pure C): worker name = repo basename by convention; the App being installed on the storefront repo is the deploy authorization gate; rare per-worker derived fields (like AE dataset name) use `$WORKER_*` substitution tokens in the template. Force-rollback is impossible for production deploys because the central workflow ignores caller-supplied `site_sha` and resolves the storefront's current default-branch HEAD itself. See [`deploy/README.md`](./deploy/README.md) for the full trust model. **Operational migrations required by Pure C:** `miess-01-tanstack` repo's worker shifts from `miess-tanstack` to `miess-01-tanstack` (CF-side cutover); `lebiscuit-tanstack` AE dataset shifts from `deco_metrics_lebiscuit` to `deco_metrics_lebiscuit_tanstack` (orphans old data). |
123
123
  | 2026-05-07 | **D6.3 — Revert D6/D6.1/D6.2; deploys move to Cloudflare Workers Builds** | The whole D6 family (centralized GitHub Actions reusable workflows + `decocms-deployer` GitHub App + per-storefront GitHub Environments + central `deploy/wrangler-template.jsonc` + `deco-wrangler` CLI + per-site caller stubs) is being **reverted**. Trigger: GitHub Free orgs do not propagate org-level secrets to private repos, which forced the App private key to live as a per-storefront repo secret in every storefront — that key gives the holder the ability to mint installation tokens that can trigger workflows on `decocms/deco-start`, which in turn have the only Cloudflare credentials in the system. Per-repo distribution + rotation of that key across N customer storefronts didn't scale and concentrated blast radius on one credential. **Replacement (chosen, to be detailed in a follow-up D-record once shipped):** [Cloudflare Workers Builds](https://developers.cloudflare.com/workers/ci-cd/builds/) owns the deploy/preview pipelines per-worker. Verified empirically on `baggagio-tanstack` 2026-05-07: a malicious `wrangler.jsonc` `name` field pointing at a different worker (`americanas-tanstack`) is **ignored** by CF Builds — the deploy lands on the connected worker (`baggagio-tanstack`), CF surfaces a warning banner in the dashboard, and CF auto-opens a PR to fix the config (deco-sites/baggagio-tanstack#34). The dashboard repo<->worker connection is the source of truth; the in-repo config is treated as a secondary input. Per-storefront wiring (one CF dashboard click per worker) is acceptable at our scale; revisit when CF's [git-integration enable API](https://github.com/cloudflare/workers-sdk/issues/12058) lands. The `deco-build` CLI (regenerates `wrangler.jsonc` bindings from a central template) and runtime-secrets management remain to be designed in a separate PR. |
124
+ | 2026-05-22 | **D-9 — Stable `request.id` propagation across all observability channels** | The fragmentation surfaced during the May 2026 error triage: tail-worker logs, direct-POST metrics, CF-Destinations spans, and structured `console.log` JSON all carry overlapping information but no shared join key. A single user-visible 5xx required hand-correlating timestamps across four ClickHouse tables. **Decision:** the framework generates a stable `request.id` once at request entry (precedence: inbound `x-request-id` → `cf-ray` → `crypto.randomUUID()`) inside `RequestContext.run`, then stamps it on (a) the root span as the `request.id` attribute, (b) every log line via the logger attribute floor, (c) the response as `X-Request-Id` (read by `deco-otel-tail`), and (d) the metric labels via `extra`. Symmetric for `trace.id` — read from the root span's spanContext, echoed as `X-Trace-Id`. This re-establishes a single join key across all channels: pick any row from any table, filter by `request.id`, and reconstruct the full request lifecycle. **Files:** `RequestContext.requestId` + logger floor + workerEntry response-header echo + tail-worker enrichment. **Captured in Phase 1 of [the observability refinement plan](../../.cursor/plans/observability_refinement_plan_4fa41548.plan.md)**. |
125
+ | 2026-05-22 | **D-10 — Server-side log normalization at the ingest worker, cost-neutral** | CF Destinations wraps every `console.log(JSON.stringify(...))` line into an OTLP LogRecord with the JSON body in `body.stringValue`. Querying by structured fields requires `JSONExtract` everywhere — slow, query-fragile, and tied to whatever the producer happens to embed. **Two design choices considered:** (a) migrate all framework `logger.{info,warn,debug}` calls to direct-POST native OTLP, ditching the CF Destinations sampled path. Substantially more code volume in production direct-POST traffic; bypasses the head sampling that keeps fleet cost bounded. (b) lift the JSON-in-body into native OTLP `LogAttributes` server-side at the `deco-otel-ingest` worker. Same wire volume, same cost. **Decision: option (b).** The ingest worker's `logsToRows` now detects JSON-shaped `body.stringValue`, lifts `level`/`msg`/`trace_id`/`span_id` plus arbitrary keys into native OTLP attributes, reduces `Body` to the human-readable `msg`, and falls back unchanged for non-JSON strings (third-party `console.log`). Dashboards drop `JSONExtractString(Body, 'level') = 'error'` in favor of `SeverityText = 'ERROR'`. **Files:** `stats-lake/ingestion/otel-ingest/src/index.ts`. Phase 4 of the observability refinement plan. |
126
+ | 2026-05-22 | **D-11 — Outcome metrics layer becomes the truth source for "did we serve users today?"** | Earlier metric labels (`method`, `path`, `status`) couldn't answer "5xx rate per route per site" without joining metrics to tail-worker logs. The path label was raw-URL (unbounded cardinality risk); status was opaque (no class bucketing); no cache decision / cache layer; no commerce histogram in the framework (only apps-start sites that bumped to a recent version had it). **Decision:** expand the canonical label set for `http_requests_total` / `http_request_duration_ms` / `http_request_errors_total` to `{ method, route_pattern, status, status_class, outcome?, cache_decision?, cache_layer?, region?, …extra }`. `route_pattern` is the TanStack closed-set pattern (`/_products/$slug/p`); fallback is the normalized path. `status_class` is `2xx`/.../`5xx`/`unknown`. Cache labels lift the existing `X-Cache` / `X-Cache-Profile` headers up to the metric so dashboards answer cache-hit rate per route from the counter alone. Move `commerce_request_duration_ms` declaration into `@decocms/start` so every site emits it as soon as the framework is bumped, regardless of apps-start version (apps register operation strings only). Labels: `{ provider, operation, status_class?, cached? }`. **Files:** `src/middleware/observability.ts` (`statusClassFor`, `RequestMetricLabels`, `CacheLayer`, `recordCommerceMetric`, expanded `recordCacheMetric` signature). Phase 2 of the observability refinement plan. |
127
+ | 2026-05-22 | **D-12 — Direct-POST OTLP trace exporter for framework `deco.*` spans** | Empirical verification (May 2026) confirmed the framework's 10+ `withTracing` calls produced zero rows in `otel_traces`. Root cause: the bridge tracer in `instrumentWorker` delegates to `trace.getTracer(...)` on the `@opentelemetry/api` global. With no `TracerProvider` registered (the common case — CF Workers only auto-installs a provider when `observability.traces.destinations` is set), every framework span is silently discarded. **Decision:** introduce `otelHttpTracer.ts` — a direct-POST OTLP/HTTP trace exporter that mirrors the existing meter + error-log adapters. Same transport: per-isolate buffer, ctx.waitUntil flush, FNV-1a hash sampling at `headSamplingRate` (default 0.01 matches CF Destinations recommendation). Consistent per-trace decision so child spans are kept iff their root is kept. Honors inbound W3C `traceparent` — if the remote parent arrived sampled, every span in that trace is exported regardless of the rate. Wired alongside the existing `@opentelemetry/api` bridge via `configureTracerStack` — CF auto-spans still flow to the CF dashboard, framework spans direct-POST to ClickHouse. Default-on `injectTraceContext` inside `createInstrumentedFetch` was already in place. **Files:** `src/sdk/otelHttpTracer.ts`, `src/sdk/otel.ts` (`configureTracerStack`), `src/sdk/workerEntry.ts` (traceparent parsing). Phase 3 of the observability refinement plan. |
128
+ | 2026-05-22 | **D-13 — Per-site Grafana dashboards + alert rules are auto-provisioned from `dim_sites`** | Hand-built dashboards drift the moment a new site lands: a fleet of 100 sites can't be maintained by a human curator. **Decision:** the canonical observability provisioning lives in [`stats-lake/observability/`](../../../stats-lake/observability/) — a single dashboard template + a single alert-rule template, parameterized by `{{site}}`/`{{team}}`/`{{datasource_uid}}` and rendered once per site by `scripts/provision-dashboards.ts` (reads `dim_sites` joined to `dim_teams`, writes to `dashboards/dist/<team>/<site>.json` + `alerts/dist/<team>/<site>.yaml`). Alerts use **anomaly bands, not thresholds** — current 5-min mean vs 24h rolling mean ± 3σ, fires after 10 minutes outside the band. Same rule set runs on every site, but the baseline is per-site so a noisy storefront doesn't false-positive against a quiet one. Every alert carries a `runbook_url` annotation pointing at [`deco-start/docs/runbooks/`](../docs/runbooks/) — the runbook is part of the alert, not a separate artifact. **Decision points open** (Phase 5 of the refinement plan): alerting venue (Grafana → email, Linear MCP tickets, both, or none for v1). **Files:** `stats-lake/observability/{dashboards,alerts,scripts,README.md}` + `deco-start/docs/runbooks/`. |
129
+ | 2026-05-22 | **D-17 — Alerting venue: none for v1; ship dashboards-only** | The action layer (Phase 5) generates anomaly-band alert rule templates per site, but every alert needs a pager and we don't have an on-call rotation. **Decision:** ship dashboards-only for v1. The alert templates in `stats-lake/observability/alerts/templates/site-rules.yaml` stay versioned so they evolve with the dashboards, but `provision-dashboards.ts` only templates them into `alerts/dist/` when `--with-alerts` is passed, and no Grafana → email / Linear MCP / PagerDuty receiver is wired up. Rejected alternatives: (a) Grafana → email — emails get muted within a week without a triage owner; (b) Linear MCP ticket-per-fire — creates noise during active incidents and you can't triage a ticket while firefighting; (c) both — overkill before we know which storefronts will be noisiest. **Revisit when:** an on-call rotation exists, OR a specific incident class earns dedicated paging (e.g., billing-critical sites that need 24/7 coverage). **Files:** `stats-lake/observability/scripts/provision-dashboards.ts` (`--with-alerts` flag), `stats-lake/observability/README.md`. Phase 5 of the observability refinement plan. |
130
+ | 2026-05-22 | **D-16 — `deco-audit-observability` is warn-by-default; promote to block once the fleet is clean** | The audit (D-14) detects drift in `tail_consumers`, `version_metadata`, `DECO_METRICS`, and `DECO_OTEL_*_ENDPOINT` vars across every storefront wrangler.jsonc. Pre-merge blocking on day one would fail PRs that have nothing to do with observability — storefronts are upgraded over weeks, not all at once. Advisory comments alone get ignored. **Decision:** `--mode warn` (default) annotates findings via `::warning::` GitHub Actions lines but always exits 0, so observability drift surfaces in CI without blocking ship. `--mode block` exits 1 on any `error`-severity finding for use once the fleet has been pulled current. Storefronts opt into `block` per-repo by wiring `--mode block` in their workflow when they've cleared their findings. Includes `--github` flag to emit native annotations. **Files:** `scripts/audit-observability-config.ts` (parseArgs `--mode` / `--github`, main exit policy), test coverage in the matching `.test.ts` (6 new CLI smoke tests via tsx subprocess). Phase 6 of the observability refinement plan. |
131
+ | 2026-05-22 | **D-15 — OTel Collector swap is a documented target state, not a committed milestone** | The current `deco-otel-ingest` Worker is a hand-rolled OTLP/HTTP parser + ClickHouse inserter. It works at ~14M POSTs/month, but each new OTLP protocol revision, each new receiver (gRPC OTLP, Prometheus remote-write), and each new sink would cost us code to write and maintain — code the upstream OTel Collector + `clickhouseexporter` ship as a maintained product. **Decision:** mark the Collector swap as the eventual target state, **with no committed timeline**, and capture the explicit revisit-triggers so we know when to act rather than relying on "we should think about this sometime." The decision is cheap because the ClickHouse schema is **already** the canonical `clickhouseexporter` shape — that was a deliberate design choice in `clickhouse/schema/otel/` so the ingest path stays swappable. Migration when triggered is config-only at the data layer: stand up a Collector in the same CF account, configure `otlphttp`/`otlpgrpc` receivers + `clickhouse/v1` exporter + `transform` processors for the existing PII redaction and JSON-body lift (1:1 from current Worker logic), DNS-cutover the `DECO_OTEL_*_ENDPOINT` vars. **Revisit triggers:** OTLP 2.0 ships, we need gRPC OTLP, we need a non-ClickHouse sink, ingest volume exceeds 100M POSTs/mo, or a hand-rolled parser develops a defect we can't fix quickly. **Files:** [`stats-lake/ingestion/otel-ingest/COLLECTOR_TARGET.md`](../../../stats-lake/ingestion/otel-ingest/COLLECTOR_TARGET.md) holds the full migration runbook + rollback story. Phase 7 of the observability refinement plan; explicitly **optional**. |
132
+ | 2026-05-22 | **D-14 — `deco-audit-observability` covers fleet bindings, not just the `observability` block** | The existing audit only checked the `observability` block (sampling rates, persist, destinations). Phase 1+2+3 made several other wrangler keys load-bearing: `tail_consumers` must list `deco-otel-tail` (Phase 1 enrichment is a no-op without it), `version_metadata` must bind `CF_VERSION_METADATA` (no `service.version` without it = no deploy correlation), `analytics_engine_datasets` must bind `DECO_METRICS` (no AE meter), `vars.DECO_OTEL_{METRICS,TRACES,LOGS}_ENDPOINT` must resolve (direct-POST channels silently no-op otherwise). **Decision:** expand the audit with six new rules under a sibling function `auditFleetBindings` and a composing `auditWranglerConfig`. Severity tuned to the impact: tail consumer + version_metadata are `error` (operational coverage gap); the rest are `warn` (degraded mode, not total failure). Drift is detected today; the matching `--fix` codemod and CI gate hardness (block / warn / advisory) are Phase 6 decision points still open. **Files:** `scripts/audit-observability-config.ts` (`auditFleetBindings`, `auditWranglerConfig`). |
124
133
  | 2026-05-19 | **D-8 — Cloudflare Tail Worker (Strategy B) is the canonical 100% error capture mechanism** | At fleet scale (100 sites, 2.5B req/month) head sampling forces a tradeoff: 1% sampling makes the `head_sampling_rate * 5B-event-cap` math work, but 99% of error traces and 99% of error-correlated logs get dropped at the CF Destinations head. The framework already covers framework-emitted errors via the in-Worker direct-POST channel (`DECO_OTEL_LOGS_ENDPOINT`) — that's 100% of `logger.error(...)` regardless of `head_sampling_rate`. But three structural gaps remain that *no* in-Worker code can close from inside its own request handler: (a) uncaught throws (the worker isolate is already unwinding when the throw bubbles out of `instrumentWorker`), (b) `exceededCpu` / `exceededMemory` outcomes (the runtime kills the producer before any in-Worker code can run), (c) raw `console.error(...)` from third-party SDKs that bypass the framework logger. **Decision:** introduce [`deco-otel-tail`](https://github.com/decocms/stats-lake/tree/main/ingestion/otel-tail) — a Cloudflare Tail Worker in `stats-lake/ingestion/otel-tail/`. CF invokes it on every execution of any producer worker that lists it under `tail_consumers` (`wrangler.jsonc`). The handler filters TraceItems down to the interesting subset (`outcome !== "ok" \|\| exceptions.length > 0 \|\| logs.some(l => l.level === "error")`), translates each to OTLP LogRecords (one per exception, one per `error`-level log line, plus a synthetic LogRecord for non-ok outcomes that didn't surface either), and forwards them to `deco-otel-ingest` via an in-account service binding (no public hop). Rows land in `otel_logs` with `Attributes['_source'] = 'tail-worker'` so dashboards can split tail-captured errors from direct-POST + CF-Destinations errors. **Rejected alternatives:** (1) **Codemod + lint to enforce `logger.error` calls** — structural coverage gap; can't catch uncaught throws or 1101s by definition, and a lint can't enforce calls inside third-party code. (2) **Logpush + ingest pipeline** — bypassed because Logpush isn't OTLP-shaped and the pricing curve loses to tail-worker at our scale. (3) **CF dashboard log retention only** — no fan-out to ClickHouse, no fleet-wide query surface. (4) **DO-buffered tail-on-error** — ~$8K/mo at fleet scale per the cost model in `docs/observability.md`. **Coverage matrix lives in [`docs/observability.md`](./docs/observability.md) → "Error capture — three-channel model".** Producer-side wiring is one line per `wrangler.jsonc`: `tail_consumers: [{ service: "deco-otel-tail" }]`. **Operational dependency:** the tail worker MUST be deployed to the same Cloudflare account as `deco-otel-ingest` (currently `c95fc4cec7fc52453228d9db170c372c`) so the `[[services]]` binding resolves. If `deco-otel-ingest` ever moves accounts, the service binding collapses to a public HTTPS POST and the model needs revisiting. **Agent behaviour:** when designing error capture for new Worker-deployed code, default to Strategy B for the long tail; don't reach for codemod/lint enforcement unless there's a specific code-quality concern beyond capture. |
125
134
 
126
135
  The full text of the constitutional rule (loaded into every agent
@@ -71,17 +71,24 @@ This makes it possible to filter traces by cache decision directly in ClickStack
71
71
 
72
72
  ## What's measured
73
73
 
74
- | Metric | Type | Source | Labels |
75
- | ------------------------------ | --------- | ----------------------------------- | ------------------------------- |
76
- | `http_requests_total` | counter | `workerEntry` | `method`, `path`, `status` |
77
- | `http_request_duration_ms` | histogram | `workerEntry` | `method`, `path`, `status` |
78
- | `http_request_errors_total` | counter | `workerEntry` (status >= 500) | `method`, `path`, `status` |
79
- | `cache_hit_total` | counter | edge cache decision | `profile`, `decision` |
80
- | `cache_miss_total` | counter | edge cache decision | `profile`, `decision` |
81
- | `resolve_duration_ms` | histogram | `resolveDecoPage` | |
74
+ | Metric | Type | Source | Labels (canonical, Phase 2 / D-11) |
75
+ | ------------------------------- | --------- | ----------------------------------- | ----------------------------------------------------------------------------------------------- |
76
+ | `http_requests_total` | counter | `workerEntry` | `method`, `route_pattern`, `status`, `status_class`, `outcome?`, `cache_decision?`, `cache_layer?`, `region?` |
77
+ | `http_request_duration_ms` | histogram | `workerEntry` | same as `http_requests_total` |
78
+ | `http_request_errors_total` | counter | `workerEntry` (status >= 500) | same as `http_requests_total` |
79
+ | `cache_hit_total` | counter | edge cache decision | `profile`, `decision`, `layer` (`edge` \| `cachedLoader` \| `vtex-swr`) |
80
+ | `cache_miss_total` | counter | edge cache decision | `profile`, `decision`, `layer` |
81
+ | `commerce_request_duration_ms` | histogram | commerce clients (vtex/shopify/…) | `provider`, `operation`, `status_class?`, `cached?` |
82
+ | `resolve_duration_ms` | histogram | `resolveDecoPage` | — |
82
83
 
83
84
  `decision` values mirror the `X-Cache` response header: `HIT`, `STALE-HIT`, `STALE-ERROR`, `MISS`, `BYPASS`.
84
85
 
86
+ `route_pattern` is the TanStack route pattern (e.g. `/_products/$slug/p`) rather than the raw URL path — bounded cardinality, joinable to the route table. Callers that don't supply one get a normalized path with dynamic segments collapsed (`/products/:slug/p`).
87
+
88
+ `status_class` is the canonical `2xx`/.../`5xx`/`unknown` bucket. Dashboards aggregate by `status_class` for SLO panels and by `status` for incident drill-down.
89
+
90
+ `commerce_request_duration_ms` owned by the framework (Phase 2 / D-11) so every site emits it as soon as `@decocms/start` is bumped, regardless of `@decocms/apps` version. Apps register operation strings via `recordCommerceMetric`; the framework owns the cardinality contract.
91
+
85
92
  ### Metrics: AE vs OTLP (the two-meter split)
86
93
 
87
94
  `instrumentWorker` plugs **up to two meters in parallel**, composed via `createCompositeMeter`:
@@ -181,8 +188,9 @@ The direct-POST channels are wired automatically when the relevant env vars reso
181
188
  | ---------------------------------- | -------------------- | ---------------------- | ---------------------------------------------------- |
182
189
  | `DECO_OTEL_METRICS_ENDPOINT` | OTLP metrics POST | `""` (unset) | OTLP meter is not created; AE-only metrics |
183
190
  | `DECO_OTEL_LOGS_ENDPOINT` | OTLP error-log POST | `""` (unset) | Error logs ride CF Destinations only (head-sampled) |
191
+ | `DECO_OTEL_TRACES_ENDPOINT` | OTLP traces POST | `""` (unset) | Framework `deco.*` spans drop unless CF Traces is on |
184
192
 
185
- Both are opt-out via `OtelOptions.otlpMetricsEnabled: false` / `otlpErrorLogsEnabled: false` if you need to disable them at boot for a specific environment without changing the env vars.
193
+ All three are opt-out via `OtelOptions.otlpMetricsEnabled: false` / `otlpErrorLogsEnabled: false` / `otlpTracesEnabled: false` if you need to disable them at boot for a specific environment without changing the env vars. Traces honor `OtelOptions.otlpTracesSamplingRate` (default `0.01` to match CF Destinations) — sampling decisions are consistent per trace (`FNV-1a` hash of `trace_id`), so child spans are kept iff their root is kept. Remote parents that arrive sampled (`traceparent` flags `01`) override the rate and are always exported.
186
194
 
187
195
  ## Log shape (and how to query it)
188
196
 
@@ -213,7 +221,9 @@ In the ClickStack UI you can also filter logs panel by `trace_id` directly — p
213
221
 
214
222
  ## Outbound trace propagation
215
223
 
216
- For any outbound `fetch` issued during a request (VTEX, Shopify, internal APIs), inject a W3C `traceparent` header so upstream services that participate in OTel can join your trace:
224
+ For commerce clients (VTEX, Shopify), `createInstrumentedFetch` injects the W3C `traceparent` header by default. To opt out for a specific endpoint that rejects unknown headers, pass `injectTraceparent: false`.
225
+
226
+ For any other outbound `fetch` issued during a request, inject a `traceparent` header manually so upstream services that participate in OTel can join your trace:
217
227
 
218
228
  ```ts
219
229
  import { injectTraceContext } from "@decocms/start/sdk/observability";
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@decocms/start",
3
- "version": "6.0.1",
3
+ "version": "6.1.0",
4
4
  "type": "module",
5
5
  "description": "Deco framework for TanStack Start - CMS bridge, admin protocol, hooks, schema generation",
6
6
  "main": "./src/index.ts",
@@ -0,0 +1,237 @@
1
+ /**
2
+ * Phase 2 (D-11) coverage for the metric surface — canonical label set,
3
+ * cache_layer, commerce_request_duration_ms. The Phase 1 logger/trace
4
+ * tests live under `src/sdk/logger.test.ts` and `src/sdk/otel.test.ts`;
5
+ * this file focuses on the middleware-level helpers.
6
+ */
7
+ import { afterEach, beforeEach, describe, expect, it } from "vitest";
8
+ import {
9
+ configureMeter,
10
+ type MeterAdapter,
11
+ MetricNames,
12
+ recordCacheMetric,
13
+ recordCommerceMetric,
14
+ recordRequestMetric,
15
+ statusClassFor,
16
+ } from "./observability";
17
+
18
+ interface Counter {
19
+ name: string;
20
+ value: number;
21
+ labels?: Record<string, unknown>;
22
+ }
23
+ interface Histogram {
24
+ name: string;
25
+ value: number;
26
+ labels?: Record<string, unknown>;
27
+ }
28
+
29
+ function captureMeter(): {
30
+ adapter: MeterAdapter;
31
+ counters: Counter[];
32
+ histograms: Histogram[];
33
+ } {
34
+ const counters: Counter[] = [];
35
+ const histograms: Histogram[] = [];
36
+ const adapter: MeterAdapter = {
37
+ counterInc(name, value, labels) {
38
+ counters.push({ name, value: value ?? 1, labels });
39
+ },
40
+ histogramRecord(name, value, labels) {
41
+ histograms.push({ name, value, labels });
42
+ },
43
+ };
44
+ return { adapter, counters, histograms };
45
+ }
46
+
47
+ describe("statusClassFor", () => {
48
+ it("maps 2xx / 3xx / 4xx / 5xx to canonical class labels", () => {
49
+ expect(statusClassFor(200)).toBe("2xx");
50
+ expect(statusClassFor(204)).toBe("2xx");
51
+ expect(statusClassFor(301)).toBe("3xx");
52
+ expect(statusClassFor(404)).toBe("4xx");
53
+ expect(statusClassFor(500)).toBe("5xx");
54
+ expect(statusClassFor(503)).toBe("5xx");
55
+ });
56
+
57
+ it("returns 'unknown' for out-of-range / NaN / non-numeric inputs", () => {
58
+ expect(statusClassFor(-1)).toBe("unknown");
59
+ expect(statusClassFor(99)).toBe("unknown");
60
+ expect(statusClassFor(600)).toBe("unknown");
61
+ expect(statusClassFor(Number.NaN)).toBe("unknown");
62
+ expect(statusClassFor(Infinity)).toBe("unknown");
63
+ });
64
+ });
65
+
66
+ describe("recordRequestMetric — canonical labels (D-11)", () => {
67
+ afterEach(() => {
68
+ // Reset meter so other tests start clean.
69
+ configureMeter({ counterInc: () => {} });
70
+ });
71
+
72
+ it("stamps method + route_pattern + status + status_class by default", () => {
73
+ const { adapter, counters, histograms } = captureMeter();
74
+ configureMeter(adapter);
75
+
76
+ recordRequestMetric("GET", "/products/abc123/p", 200, 42);
77
+
78
+ expect(counters).toHaveLength(1);
79
+ expect(counters[0]?.name).toBe(MetricNames.HTTP_REQUESTS_TOTAL);
80
+ expect(counters[0]?.labels).toMatchObject({
81
+ method: "GET",
82
+ // Default normalization: dynamic segments collapsed.
83
+ route_pattern: "/products/:slug/p",
84
+ status: 200,
85
+ status_class: "2xx",
86
+ });
87
+ expect(histograms).toHaveLength(1);
88
+ expect(histograms[0]?.name).toBe(MetricNames.HTTP_REQUEST_DURATION_MS);
89
+ expect(histograms[0]?.value).toBe(42);
90
+ });
91
+
92
+ it("prefers caller-supplied route_pattern over normalized path", () => {
93
+ const { adapter, counters } = captureMeter();
94
+ configureMeter(adapter);
95
+
96
+ recordRequestMetric("GET", "/anything/random/123", 200, 5, {
97
+ route_pattern: "/_products/$slug/p",
98
+ });
99
+
100
+ expect(counters[0]?.labels?.route_pattern).toBe("/_products/$slug/p");
101
+ });
102
+
103
+ it("emits http_request_errors_total on 5xx", () => {
104
+ const { adapter, counters } = captureMeter();
105
+ configureMeter(adapter);
106
+
107
+ recordRequestMetric("POST", "/checkout", 503, 120);
108
+
109
+ const errCounter = counters.find((c) => c.name === MetricNames.HTTP_REQUEST_ERRORS);
110
+ expect(errCounter).toBeDefined();
111
+ expect(errCounter?.labels?.status_class).toBe("5xx");
112
+ });
113
+
114
+ it("propagates optional labels (outcome, cache_decision, cache_layer, region, extra)", () => {
115
+ const { adapter, counters } = captureMeter();
116
+ configureMeter(adapter);
117
+
118
+ recordRequestMetric("GET", "/", 200, 10, {
119
+ outcome: "ok",
120
+ cache_decision: "STALE-HIT",
121
+ cache_layer: "edge",
122
+ region: "GRU",
123
+ extra: { ab_variant: "B" },
124
+ });
125
+
126
+ expect(counters[0]?.labels).toMatchObject({
127
+ outcome: "ok",
128
+ cache_decision: "STALE-HIT",
129
+ cache_layer: "edge",
130
+ region: "GRU",
131
+ ab_variant: "B",
132
+ });
133
+ });
134
+
135
+ it("is a no-op when no meter is configured", () => {
136
+ // We can't easily prove a no-op other than verifying no throw —
137
+ // safer than calling configureMeter(null), which would mask real
138
+ // bugs. The previous test's `afterEach` reset already gives us a
139
+ // bare meter; this test confirms the call is benign.
140
+ expect(() => recordRequestMetric("GET", "/", 200, 1)).not.toThrow();
141
+ });
142
+ });
143
+
144
+ describe("recordCacheMetric — cache_layer label", () => {
145
+ beforeEach(() => {
146
+ configureMeter({ counterInc: () => {} });
147
+ });
148
+
149
+ it("stamps profile + decision + layer when all are provided", () => {
150
+ const { adapter, counters } = captureMeter();
151
+ configureMeter(adapter);
152
+
153
+ recordCacheMetric(true, "product", "HIT", "edge");
154
+
155
+ expect(counters).toHaveLength(1);
156
+ expect(counters[0]?.name).toBe(MetricNames.CACHE_HIT);
157
+ expect(counters[0]?.labels).toMatchObject({
158
+ profile: "product",
159
+ decision: "HIT",
160
+ layer: "edge",
161
+ });
162
+ });
163
+
164
+ it("emits cache_miss_total when hit=false", () => {
165
+ const { adapter, counters } = captureMeter();
166
+ configureMeter(adapter);
167
+
168
+ recordCacheMetric(false, "search", "MISS", "edge");
169
+
170
+ expect(counters[0]?.name).toBe(MetricNames.CACHE_MISS);
171
+ });
172
+
173
+ it("supports the legacy 3-arg signature for backward compat", () => {
174
+ const { adapter, counters } = captureMeter();
175
+ configureMeter(adapter);
176
+
177
+ recordCacheMetric(true, "static");
178
+
179
+ expect(counters[0]?.labels).toEqual({ profile: "static" });
180
+ });
181
+
182
+ it("distinguishes cachedLoader vs edge vs vtex-swr layers", () => {
183
+ const { adapter, counters } = captureMeter();
184
+ configureMeter(adapter);
185
+
186
+ recordCacheMetric(true, "loader-x", "HIT", "cachedLoader");
187
+ recordCacheMetric(true, "vtex-product", "HIT", "vtex-swr");
188
+
189
+ expect(counters[0]?.labels?.layer).toBe("cachedLoader");
190
+ expect(counters[1]?.labels?.layer).toBe("vtex-swr");
191
+ });
192
+ });
193
+
194
+ describe("recordCommerceMetric (D-11)", () => {
195
+ beforeEach(() => {
196
+ configureMeter({ counterInc: () => {} });
197
+ });
198
+
199
+ it("emits commerce_request_duration_ms with provider + operation labels", () => {
200
+ const { adapter, histograms } = captureMeter();
201
+ configureMeter(adapter);
202
+
203
+ recordCommerceMetric(123, {
204
+ provider: "vtex",
205
+ operation: "intelligent-search.product_search",
206
+ status_class: "2xx",
207
+ });
208
+
209
+ expect(histograms).toHaveLength(1);
210
+ expect(histograms[0]?.name).toBe(MetricNames.COMMERCE_REQUEST_DURATION_MS);
211
+ expect(histograms[0]?.value).toBe(123);
212
+ expect(histograms[0]?.labels).toMatchObject({
213
+ provider: "vtex",
214
+ operation: "intelligent-search.product_search",
215
+ status_class: "2xx",
216
+ });
217
+ });
218
+
219
+ it("includes the cached boolean when provided", () => {
220
+ const { adapter, histograms } = captureMeter();
221
+ configureMeter(adapter);
222
+
223
+ recordCommerceMetric(5, {
224
+ provider: "shopify",
225
+ operation: "graphql.cart_query",
226
+ cached: true,
227
+ });
228
+
229
+ expect(histograms[0]?.labels?.cached).toBe(true);
230
+ });
231
+
232
+ it("is a no-op when no meter is configured", () => {
233
+ expect(() =>
234
+ recordCommerceMetric(1, { provider: "vtex", operation: "test" }),
235
+ ).not.toThrow();
236
+ });
237
+ });
@@ -239,25 +239,121 @@ export const MetricNames = {
239
239
  CACHE_MISS: "cache_miss_total",
240
240
  RESOLVE_DURATION_MS: "resolve_duration_ms",
241
241
  FETCH_DURATION_MS: "fetch_duration_ms",
242
+ /**
243
+ * Per-provider outbound commerce fetch duration. Owned by
244
+ * `@decocms/start` (not `@decocms/apps`) so every site emits this
245
+ * histogram unconditionally as soon as it bumps the framework,
246
+ * regardless of apps-start version. Apps register operation strings
247
+ * (`vtex.intelligent-search.product_search`,
248
+ * `shopify.graphql.cart_create`, ...) via `recordCommerceMetric`
249
+ * below; the framework owns the cardinality contract.
250
+ *
251
+ * Canonical labels: `provider`, `operation`, `status_class`, `cached`.
252
+ * See `recordCommerceMetric` for the full label set and Phase 2 in
253
+ * `MIGRATION_TOOLING_PLAN.md` for the rationale.
254
+ */
255
+ COMMERCE_REQUEST_DURATION_MS: "commerce_request_duration_ms",
242
256
  } as const;
243
257
 
258
+ /**
259
+ * Map an HTTP status code to its canonical class label (`2xx` / ... /
260
+ * `5xx`). Out-of-range numbers (e.g. -1 from a thrown fetch) fall back
261
+ * to `"unknown"` so dashboards don't break on edge cases.
262
+ *
263
+ * Exported because callers occasionally need the same mapping for
264
+ * non-metric purposes (logging, tail enrichment).
265
+ */
266
+ export function statusClassFor(status: number): string {
267
+ if (typeof status !== "number" || !Number.isFinite(status)) return "unknown";
268
+ if (status < 100 || status >= 600) return "unknown";
269
+ return `${Math.floor(status / 100)}xx`;
270
+ }
271
+
272
+ /**
273
+ * Optional dimensions stamped on `http_requests_total` /
274
+ * `http_request_duration_ms` / `http_request_errors_total`. All fields
275
+ * are optional — callers pass what they have, the framework fills in
276
+ * the rest from defaults.
277
+ *
278
+ * Cardinality discipline: every field here is bounded. `route_pattern`
279
+ * comes from the TanStack router (a closed set), `outcome` is the CF
280
+ * Workers Observability enum, `cache_decision` / `cache_layer` are
281
+ * union types declared in this module, `region` is a small set of CF
282
+ * colo codes. Status is unbounded by spec but bounded in practice; the
283
+ * `status_class` label bounds the cardinality further for dashboards
284
+ * that don't need the raw value.
285
+ */
286
+ export interface RequestMetricLabels {
287
+ /** TanStack route pattern (`/_products/$slug/p`) — closed set. */
288
+ route_pattern?: string;
289
+ /** Cloudflare Workers Observability `outcome` (`ok`, `exception`, ...). */
290
+ outcome?: string;
291
+ /** Cache layer + decision when known. */
292
+ cache_decision?: CacheDecision;
293
+ cache_layer?: CacheLayer;
294
+ /** Cloudflare colo (`GRU`, `IAD`, ...). */
295
+ region?: string;
296
+ /**
297
+ * Arbitrary extra labels — callers should avoid this and add fields
298
+ * to the typed surface above instead. Kept as an escape hatch so
299
+ * non-canonical experiments don't require a framework release.
300
+ */
301
+ extra?: Record<string, string | number | boolean>;
302
+ }
303
+
244
304
  /**
245
305
  * Record an HTTP request metric.
246
- * Call in middleware after the response is produced.
306
+ *
307
+ * Call in middleware after the response is produced. Two-call surface
308
+ * for backward compat:
309
+ *
310
+ * recordRequestMetric(method, path, status, durationMs)
311
+ * recordRequestMetric(method, path, status, durationMs, labels)
312
+ *
313
+ * The labels argument is optional — sites that haven't bumped to the
314
+ * Phase 2 metric shape still emit the original three labels
315
+ * (`method`, `route_pattern`, `status`). Adding labels never changes
316
+ * existing labels' values; only adds new ones.
247
317
  */
248
318
  export function recordRequestMetric(
249
319
  method: string,
250
320
  path: string,
251
321
  status: number,
252
322
  durationMs: number,
323
+ labels?: RequestMetricLabels,
253
324
  ) {
254
325
  const m = getState().meter;
255
326
  if (!m) return;
256
- const labels: Labels = { method, path: normalizePath(path), status };
257
- m.counterInc(MetricNames.HTTP_REQUESTS_TOTAL, 1, labels);
258
- m.histogramRecord?.(MetricNames.HTTP_REQUEST_DURATION_MS, durationMs, labels);
327
+ // Cardinality discipline:
328
+ // - `method`: small (GET, POST, ...).
329
+ // - `route_pattern`: closed set (caller-supplied) OR normalized path
330
+ // (fallback). Either way bounded.
331
+ // - `status`: full HTTP code (bounded ~50 values in practice).
332
+ // - `status_class`: 5-element enum (2xx / 3xx / 4xx / 5xx / unknown).
333
+ // - `outcome`: CF outcome enum (~7 values).
334
+ // - `cache_decision`: 5-element enum.
335
+ // - `cache_layer`: 3-element enum (edge / cachedLoader / vtex-swr).
336
+ // - `region`: ~250 CF colo codes worldwide.
337
+ // Total combinations are bounded — safe for unbounded series on
338
+ // ClickHouse but operators should still avoid grouping by `region`
339
+ // unless explicitly needed.
340
+ const merged: Labels = {
341
+ method,
342
+ route_pattern: labels?.route_pattern ?? normalizePath(path),
343
+ status,
344
+ status_class: statusClassFor(status),
345
+ };
346
+ if (labels?.outcome) merged.outcome = labels.outcome;
347
+ if (labels?.cache_decision) merged.cache_decision = labels.cache_decision;
348
+ if (labels?.cache_layer) merged.cache_layer = labels.cache_layer;
349
+ if (labels?.region) merged.region = labels.region;
350
+ if (labels?.extra) {
351
+ for (const [k, v] of Object.entries(labels.extra)) merged[k] = v;
352
+ }
353
+ m.counterInc(MetricNames.HTTP_REQUESTS_TOTAL, 1, merged);
354
+ m.histogramRecord?.(MetricNames.HTTP_REQUEST_DURATION_MS, durationMs, merged);
259
355
  if (status >= 500) {
260
- m.counterInc(MetricNames.HTTP_REQUEST_ERRORS, 1, labels);
356
+ m.counterInc(MetricNames.HTTP_REQUEST_ERRORS, 1, merged);
261
357
  }
262
358
  }
263
359
 
@@ -272,22 +368,45 @@ export function recordRequestMetric(
272
368
  */
273
369
  export type CacheDecision = "HIT" | "STALE-HIT" | "STALE-ERROR" | "MISS" | "BYPASS";
274
370
 
371
+ /**
372
+ * Where the cache lives. Phase 2 label expansion (D-11).
373
+ * - `edge` — Cloudflare Cache API (HTML pages, server-fn responses)
374
+ * - `cachedLoader` — In-memory per-isolate via `sdk/cachedLoader.ts`
375
+ * (loader-level SWR, dedup, in-flight)
376
+ * - `vtex-swr` — Apps-side in-memory cache shared by VTEX clients
377
+ * (intelligent-search, cross-selling, etc.)
378
+ */
379
+ export type CacheLayer = "edge" | "cachedLoader" | "vtex-swr";
380
+
275
381
  /**
276
382
  * Record a cache hit/miss metric. Also stamps the decision on the active
277
383
  * trace span (when one exists) as `deco.cache.decision` / `deco.cache.profile`
278
384
  * so operators can filter ClickStack traces by cache decision directly,
279
385
  * without joining to metrics.
280
386
  *
281
- * `decision` is optional — when omitted, the metric still records HIT vs MISS
282
- * but dashboards can't distinguish SWR/SIE paths. Pass it whenever known.
387
+ * Backward-compatible signature:
388
+ * recordCacheMetric(hit, profile?, decision?)
389
+ * recordCacheMetric(hit, profile?, decision?, layer?)
390
+ *
391
+ * `decision` is optional — when omitted, the metric still records HIT
392
+ * vs MISS but dashboards can't distinguish SWR/SIE paths. Pass it
393
+ * whenever known. `layer` defaults to `edge` when called from
394
+ * workerEntry; cachedLoader / vtex-swr call sites should pass their
395
+ * value explicitly.
283
396
  */
284
- export function recordCacheMetric(hit: boolean, profile?: string, decision?: CacheDecision) {
397
+ export function recordCacheMetric(
398
+ hit: boolean,
399
+ profile?: string,
400
+ decision?: CacheDecision,
401
+ layer?: CacheLayer,
402
+ ) {
285
403
  // Stamp on the active span FIRST so the attribute survives even if the
286
404
  // meter is a no-op (e.g. on tests, or in dev without DECO_METRICS).
287
405
  const active = getActiveSpan();
288
406
  if (active) {
289
407
  if (decision) active.setAttribute?.("deco.cache.decision", decision);
290
408
  if (profile) active.setAttribute?.("deco.cache.profile", profile);
409
+ if (layer) active.setAttribute?.("deco.cache.layer", layer);
291
410
  }
292
411
 
293
412
  const m = getState().meter;
@@ -295,9 +414,47 @@ export function recordCacheMetric(hit: boolean, profile?: string, decision?: Cac
295
414
  const labels: Labels = {};
296
415
  if (profile) labels.profile = profile;
297
416
  if (decision) labels.decision = decision;
417
+ if (layer) labels.layer = layer;
298
418
  m.counterInc(hit ? MetricNames.CACHE_HIT : MetricNames.CACHE_MISS, 1, labels);
299
419
  }
300
420
 
421
+ /**
422
+ * Labels for `commerce_request_duration_ms`. Owned by the framework so
423
+ * apps-start (and any future provider package) can register operation
424
+ * strings without owning the histogram declaration. Phase 2 (D-11).
425
+ */
426
+ export interface CommerceMetricLabels {
427
+ /** `vtex`, `shopify`, `wake`, ... — small closed set. */
428
+ provider: string;
429
+ /** Per-provider operation, e.g. `intelligent-search.product_search`. */
430
+ operation: string;
431
+ /** Set when known (e.g. from the HTTP response). Bounded enum. */
432
+ status_class?: string;
433
+ /** Whether the underlying fetch was served from a cache. */
434
+ cached?: boolean;
435
+ }
436
+
437
+ /**
438
+ * Record a commerce / outbound-fetch duration sample. No-op when no
439
+ * meter is configured. The metric name is constant
440
+ * (`commerce_request_duration_ms`) — providers vary by the `provider`
441
+ * label, not by name, so dashboards aggregate cleanly across the fleet.
442
+ */
443
+ export function recordCommerceMetric(
444
+ durationMs: number,
445
+ labels: CommerceMetricLabels,
446
+ ) {
447
+ const m = getState().meter;
448
+ if (!m) return;
449
+ const merged: Labels = {
450
+ provider: labels.provider,
451
+ operation: labels.operation,
452
+ };
453
+ if (labels.status_class) merged.status_class = labels.status_class;
454
+ if (typeof labels.cached === "boolean") merged.cached = labels.cached;
455
+ m.histogramRecord?.(MetricNames.COMMERCE_REQUEST_DURATION_MS, durationMs, merged);
456
+ }
457
+
301
458
  function normalizePath(path: string): string {
302
459
  // Collapse dynamic segments to reduce cardinality
303
460
  return path
@@ -99,13 +99,13 @@ export function createCachedLoader<TProps, TResult>(
99
99
  const inflight = inflightRequests.get(cacheKey);
100
100
  if (inflight) {
101
101
  // Treat in-flight dedup as a cache hit — avoided the origin call.
102
- recordCacheMetric(true, name);
102
+ recordCacheMetric(true, name, undefined, "cachedLoader");
103
103
  return inflight as Promise<TResult>;
104
104
  }
105
105
 
106
106
  if (isDev) {
107
107
  // Dev mode: no caching, but still useful to count attempts.
108
- recordCacheMetric(false, name);
108
+ recordCacheMetric(false, name, undefined, "cachedLoader");
109
109
  const promise = withTracing(
110
110
  "deco.cachedLoader",
111
111
  () => loaderFn(props).finally(() => inflightRequests.delete(cacheKey)),
@@ -121,20 +121,20 @@ export function createCachedLoader<TProps, TResult>(
121
121
 
122
122
  if (policy === "no-cache") {
123
123
  if (entry && !isStale) {
124
- recordCacheMetric(true, name);
124
+ recordCacheMetric(true, name, "HIT", "cachedLoader");
125
125
  return entry.value;
126
126
  }
127
127
  }
128
128
 
129
129
  if (policy === "stale-while-revalidate") {
130
130
  if (entry && !isStale) {
131
- recordCacheMetric(true, name);
131
+ recordCacheMetric(true, name, "HIT", "cachedLoader");
132
132
  return entry.value;
133
133
  }
134
134
 
135
135
  if (entry && isStale && !entry.refreshing) {
136
136
  // Stale-while-revalidate hit: serve stale, refresh in background.
137
- recordCacheMetric(true, name);
137
+ recordCacheMetric(true, name, "STALE-HIT", "cachedLoader");
138
138
  entry.refreshing = true;
139
139
  loaderFn(props)
140
140
  .then((result) => {
@@ -156,14 +156,17 @@ export function createCachedLoader<TProps, TResult>(
156
156
  }
157
157
 
158
158
  if (entry) {
159
- recordCacheMetric(true, name);
159
+ // Past SIE window — still serve the stale value once but mark
160
+ // the decision as STALE-ERROR so dashboards can distinguish
161
+ // this from healthy SWR.
162
+ recordCacheMetric(true, name, "STALE-ERROR", "cachedLoader");
160
163
  return entry.value;
161
164
  }
162
165
  }
163
166
 
164
167
  // Cache miss — emit metric, then run loader inside a span so individual
165
168
  // slow loaders are visible in traces.
166
- recordCacheMetric(false, name);
169
+ recordCacheMetric(false, name, "MISS", "cachedLoader");
167
170
  const promise = withTracing("deco.cachedLoader", () => loaderFn(props), {
168
171
  "deco.loader": name,
169
172
  "deco.cache.policy": policy,