npm - @decocms/start - Versions diffs - 6.0.1 → 6.2.0 - Mend

@decocms/start 6.0.1 → 6.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

package/MIGRATION_TOOLING_PLAN.md +9 -0
package/docs/observability.md +20 -10
package/docs/rum-plan.md +209 -0
package/docs/runbooks/README.md +40 -0
package/docs/runbooks/cache-hit-drop.md +83 -0
package/docs/runbooks/commerce-upstream-slow.md +88 -0
package/docs/runbooks/http-error-spike.md +98 -0
package/docs/runbooks/http-latency-spike.md +82 -0
package/docs/runbooks/tail-exception-spike.md +100 -0
package/package.json +1 -1
package/scripts/audit-observability-config.test.ts +251 -1
package/scripts/audit-observability-config.ts +227 -26
package/src/middleware/observability.test.ts +237 -0
package/src/middleware/observability.ts +165 -8
package/src/sdk/cachedLoader.ts +10 -7
package/src/sdk/logger.test.ts +99 -0
package/src/sdk/logger.ts +18 -7
package/src/sdk/observability.ts +18 -0
package/src/sdk/otel.ts +228 -38
package/src/sdk/otelHttpTracer.test.ts +422 -0
package/src/sdk/otelHttpTracer.ts +489 -0
package/src/sdk/requestContext.ts +46 -0
package/src/sdk/workerEntry.ts +138 -17

package/docs/runbooks/http-latency-spike.md ADDED Viewed

@@ -0,0 +1,82 @@
+# Runbook: `http-latency-spike`
+> A site's p95 latency exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
+## What this alert means
+User-perceived latency on this site is statistically abnormal vs the
+last 24 hours. Latency rarely degrades in isolation — almost always
+something else is bottlenecked underneath. Use this alert as the
+"something is wrong, look around" signal, then triangulate.
+## First check (60 seconds)
+Open the dashboard's **commerce p95 by provider/operation** panel. The
+most common cause of p95 spikes is an upstream commerce API (VTEX,
+Shopify) slowing down — and our SSR is synchronous on the upstream
+call.
+If commerce p95 spiked at the same moment, jump to
+[`commerce-upstream-slow.md`](./commerce-upstream-slow.md).
+## Diagnostic queries
+```sql
+-- Latency p95 by route_pattern, last hour
+SELECT
+  toStartOfInterval(TimeUnix, INTERVAL 5 MINUTE) AS t,
+  Attributes['route_pattern'] AS route,
+  quantileBFloat16(0.95)(toFloat64(Sum / nullIf(Count, 0))) AS p95
+FROM otel_metrics_histogram
+WHERE MetricName = 'http_request_duration_ms'
+  AND ServiceName = '{site}'
+  AND TimeUnix > now() - INTERVAL 1 HOUR
+GROUP BY t, route
+ORDER BY t, p95 DESC;
+```
+```sql
+-- Cache decision distribution — did hit rate drop while latency rose?
+SELECT
+  Attributes['cache_decision'] AS decision,
+  count() AS n,
+  avg(toFloat64(Sum / nullIf(Count, 0))) AS avg_ms
+FROM otel_metrics_histogram
+WHERE MetricName = 'http_request_duration_ms'
+  AND ServiceName = '{site}'
+  AND TimeUnix > now() - INTERVAL 30 MINUTE
+GROUP BY decision
+ORDER BY n DESC;
+```
+```sql
+-- Slow traces with full span breakdown (sampled ~1%, so re-run if empty)
+SELECT TraceId, SpanName, Duration / 1e6 AS ms, SpanAttributes['url.path'] AS path
+FROM otel_traces
+WHERE ServiceName = '{site}'
+  AND Timestamp > now() - INTERVAL 30 MINUTE
+  AND SpanName = 'deco.http.request'
+  AND (Duration / 1e6) > 2000
+ORDER BY Duration DESC
+LIMIT 50;
+```
+## Common causes & fixes
+| Rank | Cause                                                | How to confirm                                                                                | Fix                                                                                  |
+|------|------------------------------------------------------|-----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
+| 1    | Upstream commerce API slow                           | Commerce p95 panel spikes with the same shape                                                 | See [`commerce-upstream-slow.md`](./commerce-upstream-slow.md).                       |
+| 2    | Cache hit rate dropped (cold cache after deploy/purge) | Cache panel shows MISS share rose at spike start; usually self-heals within 5-10m            | Wait it out unless sustained; if sustained check the route-level cache profile.      |
+| 3    | One specific route is slow (heavy loader added)      | Per-route p95 query shows one `route_pattern` dominating                                      | Inspect recent commits to that route's loader. Consider deferring sections via `Lazy`. |
+| 4    | Cloudflare edge / colo issue                         | `region` label distribution skewed to one or two colos                                        | Check CF status page; usually clears on its own.                                     |
+## Escalation
+- 30 minutes without resolution → page the site team owner.
+- All sites in a region affected → suspect CF infra; check status.cloudflare.com.
+## Post-mortem hook
+- A representative slow `TraceId` from the third query above.
+- The cache hit rate before/during the spike.
+- Deploy version at the start of the window.

package/docs/runbooks/tail-exception-spike.md ADDED Viewed

@@ -0,0 +1,100 @@
+# Runbook: `tail-exception-spike`
+> A site's tail-worker `_outcome=exception` count exceeded its own 24h rolling baseline by 3σ for ≥ 10 minutes.
+## What this alert means
+Real, uncaught exceptions are happening in the Worker — captured by
+the tail consumer with 100% fidelity (`deco-otel-tail`). After Phase 1
+severity reclassification, this alert specifically excludes `canceled`
+and `responseStreamDisconnected` outcomes (those are client-disconnect
+noise, not bugs). What's left is a true bug, OOM, or CPU-limit kill.
+## First check (60 seconds)
+```sql
+-- What's blowing up, last 15 minutes
+SELECT Body, LogAttributes['url.path'] AS path, count() AS n
+FROM otel_logs
+WHERE ServiceName = '{site}'
+  AND SeverityText = 'ERROR'
+  AND LogAttributes['_source'] = 'tail-worker'
+  AND LogAttributes['_outcome'] = 'exception'
+  AND Timestamp > now() - INTERVAL 15 MINUTE
+GROUP BY Body, path
+ORDER BY n DESC
+LIMIT 30;
+```
+If 90% of the rows share the same `Body` (same exception class /
+message), that's the bug — proceed to "Common causes" #1.
+If the exceptions are scattered across many distinct messages, you
+likely have a resource problem (OOM / CPU limit) — proceed to #2.
+## Diagnostic queries
+```sql
+-- Outcome distribution — separate exception from exceededMemory / exceededCpu
+SELECT
+  LogAttributes['_outcome'] AS outcome,
+  count() AS n
+FROM otel_logs
+WHERE ServiceName = '{site}'
+  AND LogAttributes['_source'] = 'tail-worker'
+  AND Timestamp > now() - INTERVAL 30 MINUTE
+GROUP BY outcome
+ORDER BY n DESC;
+```
+```sql
+-- Did a specific deploy cause it?
+SELECT
+  LogAttributes['service.version'] AS version,
+  LogAttributes['_outcome'] AS outcome,
+  count() AS n
+FROM otel_logs
+WHERE ServiceName = '{site}'
+  AND LogAttributes['_source'] = 'tail-worker'
+  AND Timestamp > now() - INTERVAL 1 HOUR
+GROUP BY version, outcome
+ORDER BY n DESC;
+```
+```sql
+-- Pull the full record for one offending request to get request.id
+-- and trace.id for join queries
+SELECT *
+FROM otel_logs
+WHERE ServiceName = '{site}'
+  AND SeverityText = 'ERROR'
+  AND LogAttributes['_source'] = 'tail-worker'
+  AND LogAttributes['_outcome'] = 'exception'
+  AND Timestamp > now() - INTERVAL 15 MINUTE
+ORDER BY Timestamp DESC
+LIMIT 1;
+```
+## Common causes & fixes
+| Rank | Cause                                              | How to confirm                                                                | Fix                                                                                                              |
+|------|----------------------------------------------------|-------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
+| 1    | A single uncaught throw, recent deploy             | Same `Body` dominates; one `service.version` correlates                       | Roll back the deploy. File a bug with the offending stack + `request.id` for repro. Add a try/catch + structured `logger.error`. |
+| 2    | `exceededMemory` (OOM)                             | Outcome query shows non-trivial `exceededMemory` count                        | Look for large in-memory buffers — a `Response.text()` on a multi-MB upstream, a runaway `JSON.parse`. See [`deco-site-memory-debugging`](https://github.com/decocms/deco-start/blob/main/.cursor/skills/deco-site-memory-debugging/SKILL.md) skill. |
+| 3    | `exceededCpu` (CPU-limit kill)                    | Outcome query shows `exceededCpu`                                            | Investigate a section with a heavy synchronous loop. Move work to a server function or shed load via cache.       |
+| 4    | A new upstream returning malformed responses      | `Body` references a third-party hostname; matches a known endpoint           | Add defensive parsing + a structured `logger.error` so the throw becomes a typed error, not a crash.             |
+## Escalation
+- `exceededMemory` / `exceededCpu` sustained → page site team + platform on-call. May indicate a leak that will recur until isolate restart.
+- A throw we can't decode in 15 minutes → page site team owner.
+## Post-mortem hook
+- One full record from query #3 above — preserves the
+  `request.id` / `trace.id` for cross-channel correlation.
+- The dominant `Body` (the exception message).
+- The `service.version` window.
+- Whether the alert fired on `exception` or `exceededMemory` /
+  `exceededCpu` — drives whether the post-mortem investigates code or
+  resource bounds.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@decocms/start",
-  "version": "6.0.1",
+  "version": "6.2.0",
   "type": "module",
   "description": "Deco framework for TanStack Start - CMS bridge, admin protocol, hooks, schema generation",
   "main": "./src/index.ts",

package/scripts/audit-observability-config.test.ts CHANGED Viewed

@@ -2,7 +2,11 @@ import * as fs from "node:fs";
 import * as os from "node:os";
 import * as path from "node:path";
 import { afterEach, beforeEach, describe, expect, it } from "vitest";
-import { auditObservabilityBlock } from "./audit-observability-config";
+import {
+  auditFleetBindings,
+  auditObservabilityBlock,
+  auditWranglerConfig,
+} from "./audit-observability-config";
 import { parseJsonc, stripJsoncTrailingCommas } from "./lib/jsonc";
 describe("auditObservabilityBlock", () => {
@@ -138,6 +142,125 @@ describe("auditObservabilityBlock", () => {
   });
 });
+describe("auditFleetBindings (D-14)", () => {
+  const canonicalBindings = {
+    version_metadata: { binding: "CF_VERSION_METADATA" },
+    analytics_engine_datasets: [{ binding: "DECO_METRICS", dataset: "deco_metrics_site" }],
+    tail_consumers: [{ service: "deco-otel-tail" }],
+    vars: {
+      DECO_OTEL_METRICS_ENDPOINT: "https://deco-otel-ingest.example/v1/metrics",
+      DECO_OTEL_TRACES_ENDPOINT: "https://deco-otel-ingest.example/v1/traces",
+      DECO_OTEL_LOGS_ENDPOINT: "https://deco-otel-ingest.example/v1/logs",
+    },
+  };
+  it("returns no findings for canonical bindings", () => {
+    expect(auditFleetBindings(canonicalBindings)).toEqual([]);
+  });
+  it("flags version_metadata_binding_missing as error", () => {
+    const { version_metadata: _, ...rest } = canonicalBindings;
+    const findings = auditFleetBindings(rest);
+    const f = findings.find((x) => x.id === "version_metadata_binding_missing");
+    expect(f?.severity).toBe("error");
+  });
+  it("flags version_metadata_binding_missing when binding is empty", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      version_metadata: { binding: "" },
+    });
+    const f = findings.find((x) => x.id === "version_metadata_binding_missing");
+    expect(f).toBeDefined();
+  });
+  it("flags analytics_engine_binding_missing as warn", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      analytics_engine_datasets: [],
+    });
+    const f = findings.find((x) => x.id === "analytics_engine_binding_missing");
+    expect(f?.severity).toBe("warn");
+  });
+  it("flags analytics_engine_binding_missing when binding name doesn't match DECO_METRICS", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      analytics_engine_datasets: [{ binding: "OTHER_NAME" }],
+    });
+    expect(findings.some((f) => f.id === "analytics_engine_binding_missing")).toBe(true);
+  });
+  it("flags tail_consumer_missing as error", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      tail_consumers: [],
+    });
+    const f = findings.find((x) => x.id === "tail_consumer_missing");
+    expect(f?.severity).toBe("error");
+  });
+  it("flags tail_consumer_missing when an unrelated tail consumer is configured", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      tail_consumers: [{ service: "another-tail" }],
+    });
+    expect(findings.some((f) => f.id === "tail_consumer_missing")).toBe(true);
+  });
+  it("flags otel_metrics_endpoint_missing when DECO_OTEL_METRICS_ENDPOINT is unset", () => {
+    const findings = auditFleetBindings({
+      ...canonicalBindings,
+      vars: {
+        ...canonicalBindings.vars,
+        DECO_OTEL_METRICS_ENDPOINT: "",
+      },
+    });
+    expect(findings.some((f) => f.id === "otel_metrics_endpoint_missing")).toBe(true);
+  });
+  it("flags otel_traces_endpoint_missing when DECO_OTEL_TRACES_ENDPOINT is missing", () => {
+    const { vars: _vars, ...rest } = canonicalBindings;
+    const findings = auditFleetBindings(rest);
+    expect(findings.some((f) => f.id === "otel_traces_endpoint_missing")).toBe(true);
+    expect(findings.some((f) => f.id === "otel_logs_endpoint_missing")).toBe(true);
+    expect(findings.some((f) => f.id === "otel_metrics_endpoint_missing")).toBe(true);
+  });
+  it("handles missing vars object gracefully", () => {
+    expect(() => auditFleetBindings({ vars: undefined })).not.toThrow();
+  });
+});
+describe("auditWranglerConfig — composition", () => {
+  it("composes observability + fleet rules", () => {
+    const findings = auditWranglerConfig({});
+    const ids = findings.map((f) => f.id);
+    expect(ids).toContain("observability_missing");
+    expect(ids).toContain("version_metadata_binding_missing");
+    expect(ids).toContain("tail_consumer_missing");
+  });
+  it("returns no findings on a fully canonical wrangler", () => {
+    const findings = auditWranglerConfig({
+      observability: {
+        enabled: true,
+        logs: { enabled: true, head_sampling_rate: 1, persist: true },
+        traces: { enabled: true, head_sampling_rate: 0.01, persist: true },
+      },
+      version_metadata: { binding: "CF_VERSION_METADATA" },
+      analytics_engine_datasets: [{ binding: "DECO_METRICS", dataset: "deco_metrics_x" }],
+      tail_consumers: [{ service: "deco-otel-tail" }],
+      vars: {
+        DECO_OTEL_METRICS_ENDPOINT: "https://ingest.example/v1/metrics",
+        DECO_OTEL_TRACES_ENDPOINT: "https://ingest.example/v1/traces",
+        DECO_OTEL_LOGS_ENDPOINT: "https://ingest.example/v1/logs",
+      },
+    });
+    expect(findings).toEqual([]);
+  });
+});
 describe("JSONC handling — trailing commas + comments", () => {
   it("stripJsoncTrailingCommas removes commas before `}` and `]`", () => {
     expect(stripJsoncTrailingCommas(`{ "a": 1, "b": 2, }`)).toBe(`{ "a": 1, "b": 2 }`);
@@ -165,6 +288,133 @@ describe("JSONC handling — trailing commas + comments", () => {
   });
 });
+describe("CLI gate hardness (D-16) — --mode warn|block + --github", () => {
+  let tmpdir: string;
+  const cliPath = path.resolve(__dirname, "audit-observability-config.ts");
+  beforeEach(() => {
+    tmpdir = fs.mkdtempSync(path.join(os.tmpdir(), "audit-mode-"));
+  });
+  afterEach(() => {
+    fs.rmSync(tmpdir, { recursive: true, force: true });
+  });
+  // Spawn the script via tsx in a child process so we exercise the real
+  // `process.exit()` paths instead of monkey-patching them. This is the
+  // contract storefront CI consumes, so it's the contract under test.
+  function runCli(args: string[]): {
+    status: number | null;
+    stdout: string;
+    stderr: string;
+  } {
+    const { spawnSync } = require("node:child_process") as typeof import(
+      "node:child_process"
+    );
+    const result = spawnSync(
+      process.execPath,
+      [
+        require.resolve("tsx/cli"),
+        cliPath,
+        "--source",
+        tmpdir,
+        ...args,
+      ],
+      { encoding: "utf8" },
+    );
+    return {
+      status: result.status,
+      stdout: result.stdout,
+      stderr: result.stderr,
+    };
+  }
+  it("default mode is warn — exits 0 even with error findings", () => {
+    // Empty wrangler triggers `observability_missing` (error) +
+    // `tail_consumer_missing` (error) + `version_metadata_*` (error). Warn
+    // mode must annotate but exit 0.
+    fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
+    const { status, stdout } = runCli([]);
+    expect(status).toBe(0);
+    expect(stdout).toMatch(/observability_missing/);
+  });
+  it("--mode block exits 1 when an error-severity finding is present", () => {
+    fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
+    const { status, stdout } = runCli(["--mode", "block"]);
+    expect(status).toBe(1);
+    expect(stdout).toMatch(/observability_missing/);
+  });
+  it("--mode block exits 0 when only warn-severity findings are present", () => {
+    // Canonical observability block + the rest of the fleet bindings → only
+    // the DECO_OTEL_*_ENDPOINT warns survive. Block mode must exit 0 because
+    // those are `warn`, not `error`.
+    fs.writeFileSync(
+      path.join(tmpdir, "wrangler.jsonc"),
+      JSON.stringify({
+        name: "my-store",
+        observability: {
+          enabled: true,
+          traces: { enabled: true, head_sampling_rate: 0.01, persist: true },
+          logs: { enabled: true, head_sampling_rate: 1, persist: true },
+        },
+        version_metadata: { binding: "CF_VERSION_METADATA" },
+        analytics_engine_datasets: [{ binding: "DECO_METRICS" }],
+        tail_consumers: [{ service: "deco-otel-tail" }],
+      }),
+    );
+    const { status } = runCli(["--mode", "block"]);
+    expect(status).toBe(0);
+  });
+  it("--mode block exits 0 on a fully clean wrangler.jsonc", () => {
+    fs.writeFileSync(
+      path.join(tmpdir, "wrangler.jsonc"),
+      JSON.stringify({
+        name: "my-store",
+        observability: {
+          enabled: true,
+          traces: { enabled: true, head_sampling_rate: 0.01, persist: true },
+          logs: { enabled: true, head_sampling_rate: 1, persist: true },
+        },
+        version_metadata: { binding: "CF_VERSION_METADATA" },
+        analytics_engine_datasets: [{ binding: "DECO_METRICS" }],
+        tail_consumers: [{ service: "deco-otel-tail" }],
+        vars: {
+          DECO_OTEL_METRICS_ENDPOINT: "https://ingest.example.com",
+          DECO_OTEL_TRACES_ENDPOINT: "https://ingest.example.com",
+          DECO_OTEL_LOGS_ENDPOINT: "https://ingest.example.com",
+        },
+      }),
+    );
+    const { status } = runCli(["--mode", "block"]);
+    expect(status).toBe(0);
+  });
+  it("--github emits ::warning::/::error:: annotations matched to mode", () => {
+    fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
+    // In warn mode, even error-severity findings annotate as `warning` (we
+    // never escalate to GitHub `error` annotations when we won't fail the
+    // check — keeps the PR annotation channel quiet at v1).
+    const warnRun = runCli(["--github"]);
+    expect(warnRun.status).toBe(0);
+    expect(warnRun.stdout).toMatch(/::warning title=observability_missing::/);
+    expect(warnRun.stdout).not.toMatch(/::error title=/);
+    // In block mode, error-severity findings escalate to `::error::`.
+    const blockRun = runCli(["--mode", "block", "--github"]);
+    expect(blockRun.status).toBe(1);
+    expect(blockRun.stdout).toMatch(/::error title=observability_missing::/);
+  });
+  it("--mode rejects values other than warn|block with exit 2", () => {
+    fs.writeFileSync(path.join(tmpdir, "wrangler.jsonc"), "{}");
+    const { status, stderr } = runCli(["--mode", "advisory"]);
+    expect(status).toBe(2);
+    expect(stderr).toMatch(/--mode must be "warn" or "block"/);
+  });
+});
 describe("CLI smoke — wrangler.jsonc with trailing commas", () => {
   let tmpdir: string;
   beforeEach(() => {