@decocms/start 5.3.0-rc.2 → 5.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/o11y.md DELETED
@@ -1,602 +0,0 @@
1
- # First context, made at stats-lake:
2
-
3
- @decocms/start Observability — Investigation & Implementation Brief
4
-
5
- Goal
6
-
7
- Make @decocms/start’s instrumentWorker() produce rich, well-shaped OTel telemetry that lands in our existing ClickStack instance (hyperdx.clickhouse.cloud) backed by stats-lake ClickHouse. The transport is already solved (CF Destinations → deco-otel-ingest Worker → otel_traces / otel_logs). This brief is about what the framework should emit.
8
-
9
- Context — what’s already built
10
-
11
- casaevideo-tanstack (CF Worker)
12
- observability.destinations in wrangler.jsonc
13
- │ OTLP/HTTP JSON
14
-
15
- deco-otel-ingest (https://deco-otel-ingest.deco-cx.workers.dev)
16
- maps OTLP → ClickHouse clickhouseexporter schema
17
- redacts PII (cookie, authorization, x-vtex-*)
18
-
19
- stats-lake ClickHouse
20
- default.otel_traces / default.otel_logs (30-day TTL)
21
-
22
- hyperdx.clickhouse.cloud (ClickStack UI)
23
- End-to-end is verified. Synthetic spans + logs ingest correctly, render in the ClickStack UI under service:deco-otel-ingest-smoke.
24
-
25
- Repos involved:
26
-
27
- /Users/fernandofrizzatti/development/workspace/decocms/deco-start/ — this brief is about this repo
28
- /Users/fernandofrizzatti/conductor/workspaces/stats-lake/calgary/ — ingest Worker + DDL (Phase 1, PR https://github.com/decocms/stats-lake/pull/1)
29
- /Users/fernandofrizzatti/development/workspace/decocms/casaevideo-storefront/ — first site to wire (Phase 3)
30
- /Users/fernandofrizzatti/development/workspace/decocms/apps/ — @decocms/apps, has createInstrumentedFetch used by sites
31
- What works today in @decocms/start v5.0
32
-
33
- instrumentWorker(handler, { serviceName }) — wraps a Worker; configures structured JSON logger + per-span attribute floor via configureTracer() bridge onto @opentelemetry/api global tracer
34
- withTracing("name", fn, attrs) — public API, called from workerEntry.ts:981 around every request as deco.http.request
35
- defaultLoggerAdapter — writes { level, msg, timestamp, ...attrs } to console.*, which CF Destinations captures as OTLP logs
36
- AE meter wiring via DECO_METRICS binding (existing, unrelated to the OTLP path)
37
- Single dependency: @opentelemetry/api ^1.9.1. No SDK, no exporters. The transport is entirely CF Destinations.
38
-
39
- Confirmed gaps (ranked by impact)
40
-
41
- 1. service.version is never set from CF_VERSION_METADATA
42
- instrumentWorker() reads decoRuntimeVersion option (defaults to hardcoded "5.0.0") and stamps it as deco.runtime.version
43
- It does NOT read env.CF_VERSION_METADATA?.id to set the OTel-standard service.version
44
- Why this matters: dashboards correlate errors / latency with specific deployments via service.version. Without it, “did the regression land in deployment X?” is unanswerable
45
- Fix: in bootObservability(), read env.CF_VERSION_METADATA?.id and stamp it as service.version in spanAttributeFloor AND in the logger floor. Files: src/sdk/otel.ts:168-214
46
- 2. Per-span attribute floor only covers framework-created spans, not CF auto-instrumented ones
47
- spanAttributeFloor (deco.runtime.version, deployment.environment) is applied inside the configureTracer bridge in otel.ts:137-149 — only spans created through withTracing() get it
48
- CF runtime auto-instrumented spans (fetch/KV/R2/DO) come out without these attributes
49
- Why this matters: ~90% of spans in a typical request come from CF auto-instrumentation. Without the floor, dashboards can’t filter those by environment or runtime version
50
- Fix options:
51
- (a) Move stamping into ingest Worker: if a resource attribute is missing, fill from the request’s known service. Requires the ingest Worker to know per-service defaults — leaks coupling.
52
- (b) Accept the gap. Filter dashboards by service.name only (resource attribute, always present). deco.runtime.version becomes “best-effort for framework spans”
53
- (c) Use CF Tail Workers to enrich spans post-hoc (heavy)
54
- Recommendation: (b) for now, document explicitly. Revisit if a real dashboard need surfaces.
55
- 3. Sampling: head-only via CF Destinations, no tail-on-error
56
- CF Destinations only supports head_sampling_rate — decided at trace start, before status is known
57
- Goal: “100% errors + 1% baseline” — tail-on-error semantics. Not achievable with CF Destinations alone
58
- Options:
59
- A (long term) — CF Tail Workers: bind a separate Worker as a tail handler that receives every invocation summary (status + logs). It decides to forward full traces to ingest based on outcome. Most flexible; requires a second Worker.
60
- B (pragmatic short term) — Head 100% + ingest-side filter: set traces.head_sampling_rate: 1.0 in wrangler, let everything flow. deco-otel-ingest keeps all errors (StatusCode == STATUS_CODE_ERROR or http.response.status_code >= 500) plus a deterministic % of OK traces (hash on TraceId). Costs more CF events but no missed errors.
61
- C — Status quo: head=0.1, logs=1.0. Miss 90% of error traces; lean on 100% logs for error context.
62
- Recommendation: start with C for the pilot (no extra work), measure ratio of 5xx in the first 24h of real traffic on casaevideo. If we’re losing relevant signal, move to B (1-2 hour task in the ingest Worker).
63
- Where the change lives: not in @decocms/start. Either the site’s wrangler.jsonc (option C → B head rate change) or deco-otel-ingest/src/index.ts (option B filter logic).
64
- 4. Logger output is JSON-in-stringValue, not native OTel log attributes
65
- defaultLoggerAdapter does console.log(JSON.stringify({ level, msg, timestamp, ...attrs }))
66
- CF Destinations wraps that into an OTLP LogRecord with body.stringValue = "<the JSON>". The attrs object isn’t exposed as native log attributes
67
- Our ingest Worker maps body.stringValue → otel_logs.Body verbatim. Queries need JSONExtract(Body, 'level', 'String') to filter by level
68
- Why this matters: ClickStack UI auto-detects severity from body.stringValue substring matching, so “INFO” / “ERROR” show up correctly (confirmed empirically). But filtering by custom attributes (requestId, userId) requires JSON extraction at query time
69
- Fix options:
70
- (a) Have the logger emit severityText and severityNumber to a separate sidecar that’s understood by CF Destinations. Looks hard — CF only sees console output
71
- (b) Accept the JSON-in-body shape, document the JSONExtract pattern for queries
72
- (c) Change the logger to log key-value pairs as separate console.log calls, but that breaks correlation
73
- Recommendation: (b). Document with example queries in @decocms/start observability README.
74
- Open investigation points (verify empirically with casaevideo)
75
-
76
- What does CF Destinations actually emit for resource attributes? I assumed service.name = Worker name from wrangler.jsonc. Confirm by reading from otel_traces.ResourceAttributes after the first casaevideo request lands. Also check: cloud.provider, cloud.platform, faas.name, cloudflare.script_version.id.
77
-
78
- What does withTracing() in workerEntry.ts:981 actually emit? Confirm deco.http.request spans show up with http.method + url.path and inherit the spanAttributeFloor. Cross-check that the framework span and the CF root span are properly parent-child related (or whether they’re siblings — depends on how CF auto-instrumentation handles user spans).
79
-
80
- createInstrumentedFetch("vtex") from @decocms/apps — does it use withTracing from @decocms/start, or its own tracer? Either way, do VTEX calls show up as child spans of the request root in ClickStack? Trace propagation (W3C traceparent header) to VTEX would be a nice future win.
81
-
82
- TanStack-specific spans — server entry render, data fetch, route resolution. Are there withTracing() calls in the framework / @tanstack/react-start/server-entry that produce meaningful spans? If not, add them at strategic points.
83
-
84
- Error reporting — when a request throws, does withTracing() set setError on the span (workerEntry.ts:996 area)? Verify the span gets StatusCode = STATUS_CODE_ERROR and a StatusMessage. If not, fix in src/middleware/observability.ts:77-95.
85
-
86
- CF_VERSION_METADATA binding — is it declared in casaevideo’s wrangler.jsonc? (Yes, confirmed earlier.) Does instrumentWorker() receive it via env? Should: yes, since the binding is on env.
87
-
88
- Concrete deliverables for the @decocms/start PR
89
-
90
- In priority order:
91
-
92
- src/sdk/otel.ts — read env.CF_VERSION_METADATA?.id in bootObservability(). Stamp as service.version in both spanAttributeFloor and the logger floor. Boot-time log should include version. ~10 LOC.
93
-
94
- src/sdk/otel.ts — also stamp service.name in both floors (so framework-created spans carry it, even though the resource attribute already has it from CF). Defensive, lets ClickStack dashboards filter by either path. ~3 LOC.
95
-
96
- src/middleware/observability.ts:77-95 — confirm withTracing() sets span status to STATUS_CODE_ERROR on thrown exception via the bridge’s setError. Add a test.
97
-
98
- README.md / docs/observability.md — document the architecture: instrumentWorker() + CF Destinations + ingest Worker. Show the recommended wrangler.jsonc shape. Show the ClickStack query patterns (JSONExtract for log filtering).
99
-
100
- Optional: add withTracing() calls at strategic framework points — render entry, loader resolution, data fetching. Only if the empirical investigation (point 2 above) shows current coverage is too thin.
101
-
102
- Out of scope for this PR
103
-
104
- Reviving the OTLP SDK exporters that lived at v4.6.0 — not needed; CF Destinations replaces that path
105
- Tail-on-error sampling — lives in the ingest Worker or a CF Tail Worker, not in @decocms/start
106
- @microlabs/otel-cf-workers integration — rejected; CF Destinations + thin framework attributes is sufficient
107
- PII redaction — handled at ingest Worker, no per-site code needed
108
- How to validate the framework changes
109
-
110
- Branch off @decocms/start main, apply the 4 changes above
111
- pnpm link it into casaevideo-storefront
112
- Add destinations block to casaevideo/wrangler.jsonc pointing at https://deco-otel-ingest.deco-cx.workers.dev
113
- Wrap casaevideo/src/worker-entry.ts final export with instrumentWorker(handler, { serviceName: "casaevideo-tanstack" })
114
- wrangler deploy (preview URL is fine; routes are still commented out in production)
115
- Hit the preview URL a few times
116
- ClickHouse: verify service.version lands in otel_traces.ResourceAttributes['service.version'] AND otel_traces.SpanAttributes['service.version']
117
- ClickStack UI: filter service:casaevideo-tanstack, confirm spans show with sensible names, durations, and errors-as-errors
118
- Reference files
119
-
120
- src/sdk/otel.ts — instrumentWorker(), bootObservability(), attribute floor wiring
121
- src/sdk/logger.ts — defaultLoggerAdapter, severity routing
122
- src/sdk/workerEntry.ts:968-996 — createDecoWorkerEntry, RequestContext.run, root withTracing("deco.http.request", ...)
123
- src/middleware/observability.ts:77-95 — withTracing() implementation, error path
124
- package.json — only @opentelemetry/api ^1.9.1, no SDK
125
- src/sdk/otelAdapters/clickhouseCollector.ts — documented stub, unchanged; leave for future use
126
-
127
-
128
- # Second Context: made at deco-start
129
-
130
- Let's plan for all of those, straight away. we need to be able to track everything we need; cache, performance, errors, etc;
131
- Separate what goes into this repo as a PLAN, and separate what goes into decocms/apps as Plan prompt for investigation and stuff; please.
132
- Let's implement what goes into this repo, and I will send the prompt to apps in another thread.
133
- Observability completeness — full coverage for cache, performance, errors
134
-
135
- Context
136
-
137
- The brief at /tmp/attachments/pasted_text_2026-05-15_15-42-09.txt covers the next iteration of @decocms/start observability. Transport is solved end-to-end:
138
-
139
- site Worker → CF Destinations (OTLP/HTTP JSON)
140
- → deco-otel-ingest (Worker)
141
- → stats-lake ClickHouse (otel_traces / otel_logs)
142
- → hyperdx.clickhouse.cloud (ClickStack UI)
143
- The framework already has scaffolding in place — instrumentWorker, withTracing, AE meter, structured logger, log severity floor — but the signal it emits is incomplete:
144
-
145
- Identity fields missing: service.version, service.name aren’t on the per-span/per-log floor.
146
- Error spans don’t carry STATUS_CODE_ERROR — they look successful with an exception attached.
147
- Two CMS hotspots are already instrumented (resolveDecoPage → deco.cms.resolvePage, runSingleSectionLoader → deco.section.loader). The rest of the framework is dark to ClickStack.
148
- No cache decision spans / metrics at the application layer (CF Cache API spans show storage ops but not HIT/MISS/profile).
149
- Admin protocol handlers (handleMeta, handleDecofile, handleRender, handleInvoke) emit no spans, so admin traffic mixes with customer traffic in the trace tree.
150
- Deferred section loads (loadDeferredSection server fn) have no span — on-scroll latency is invisible.
151
- logRequest doesn’t emit traceId, so logs and traces can’t be cross-referenced in ClickStack.
152
- No W3C trace-context propagation outbound — upstream services (VTEX, Shopify) can’t link their spans to ours.
153
- No docs/observability.md.
154
- Goal: ship a single coherent PR that makes the framework emit rich, queryable telemetry covering cache decisions, request shape, admin protocol, errors, and trace correlation — without changing the public API. The companion work in @decocms/apps (createInstrumentedFetch, VTEX-specific spans) is described as a standalone brief at the bottom of this document for the user to send to a separate agent thread.
155
-
156
- Plan A — Changes in @decocms/start (this repo)
157
-
158
- A1. Identity on the floor (src/core/sdk/otel.ts:168-214)
159
- bootObservability():
160
-
161
- Read env.CF_VERSION_METADATA?.id; if present, add service.version to floor.
162
- Add service.name (already computed at line 171) to floor.
163
- Update the boot breadcrumb (otel.ts:208-213) to log version when known.
164
- Resulting shape:
165
-
166
- const floor: Record<string, string> = {
167
- "service.name": serviceName,
168
- "deco.runtime.version": decoRuntimeVersion,
169
- "deployment.environment": deploymentEnvironment,
170
- };
171
- const versionId = (env.CF_VERSION_METADATA as { id?: string } | undefined)?.id;
172
- if (versionId) floor["service.version"] = versionId;
173
- if (opts.decoAppsVersion) floor["deco.apps.version"] = opts.decoAppsVersion;
174
- Both spanAttributeFloor = floor and setLoggerAttributeFloor(floor) already exist (lines 185, 191) — no plumbing needed downstream.
175
-
176
- A2. Span-status on errors (src/core/sdk/otel.ts:137-149)
177
- In the configureTracer bridge, extend setError to also call span.setStatus({ code: SpanStatusCode.ERROR, message }). Import SpanStatusCode from @opentelemetry/api.
178
-
179
- setError: (error) => {
180
- const message = error instanceof Error ? error.message : String(error);
181
- if (error instanceof Error) span.recordException(error);
182
- span.setStatus({ code: SpanStatusCode.ERROR, message });
183
- },
184
- withTracing() at src/core/sdk/observability.ts:101-105 already calls span.setError?.(error) in its catch — no change needed there.
185
-
186
- A3. Extend Span interface to expose trace context (src/core/sdk/observability.ts:43-47)
187
- Required by A7 (logger traceId) and A8 (traceparent propagation).
188
-
189
- export interface Span {
190
- end(): void;
191
- setError?(error: unknown): void;
192
- setAttribute?(key: string, value: string | number | boolean): void;
193
- spanContext?(): { traceId: string; spanId: string; traceFlags: number };
194
- }
195
- In the bridge (src/core/sdk/otel.ts:140-148), expose spanContext by delegating to the OTel API span:
196
-
197
- spanContext: () => {
198
- const ctx = span.spanContext();
199
- return { traceId: ctx.traceId, spanId: ctx.spanId, traceFlags: ctx.traceFlags };
200
- },
201
- This is opt-in via the optional method — adapters that don’t implement it still work; helpers that need it (logger, traceparent) just no-op gracefully.
202
-
203
- A4. Cache decision spans + metrics (src/tanstack/sdk/workerEntry.ts)
204
- Two cache stores, two lookup paths. Wrap each cache.match call in withTracing("deco.cache.lookup", ...) and emit recordCacheMetric(...) at the decision points already in the code.
205
-
206
- Server-function cache (POST _serverFn): workerEntry.ts:1246-1303
207
-
208
- Wrap line 1247-1251 lookup: span deco.cache.lookup, attrs { "cache.profile": "listing", "cache.kind": "serverFn" }.
209
- At each decision point (HIT line 1254-1264, STALE-HIT line 1267-1292, MISS line 1303+), call recordCacheMetric(hit, profile) and set span attribute cache.decision.
210
- Wrap cache.put at line 1334 with deco.cache.store, attrs { "cache.profile": "listing", "cache.kind": "serverFn" }.
211
- HTML cache (GET): workerEntry.ts:1484-1595
212
-
213
- Wrap line 1486-1489 lookup: span deco.cache.lookup, attrs { "cache.profile": profile, "cache.kind": "html" }.
214
- At each decision point (HIT line 1492-1501, STALE-HIT line 1502-1511, MISS line 1514+), call recordCacheMetric(...) with the right profile, set cache.decision.
215
- Wrap cache.put (helper storeInCache line 1446-1458) with deco.cache.store, attrs { "cache.profile": profile, "cache.kind": "html" }.
216
- Attributes on every cache span: cache.profile, cache.kind, cache.decision (HIT/STALE-HIT/MISS), cache.age_seconds (when known).
217
-
218
- A5. Section loader batch wrapper (src/core/cms/sectionLoaders.ts:253)
219
- Wrap runSectionLoaders body in withTracing("deco.section.loaders.batch", ..., { "section.count": sections.length }) so the existing per-section spans (deco.section.loader at sectionLoaders.ts:332-334) nest cleanly under one batch parent. Makes traces readable when a page has 15+ sections.
220
-
221
- A6. Admin protocol handler spans
222
- The map agent found 5 admin handlers. Wrap each at the call site in src/tanstack/sdk/workerEntry.ts (inside tryAdminRoute) — not inside the core handler — so the handlers stay framework-agnostic and reusable from src/next/. Names + attrs:
223
-
224
- Span name Call site Attributes
225
- deco.admin.meta wrapper around handleMeta (none — handler is sync, fast)
226
- deco.admin.decofile.read wrapper around handleDecofileRead (none)
227
- deco.admin.decofile.reload wrapper around handleDecofileReload decofile.block_count (set after reload)
228
- deco.admin.render wrapper around handleRender cms.component (from pathComponent), cms.resolve_chain (bool)
229
- deco.admin.invoke wrapper around handleInvoke invoke.key, invoke.kind (loader
230
- For the sync handlers (handleMeta, handleDecofileRead), use a thin async wrapper inside the withTracing callback. Negligible cost; uniform span shape.
231
-
232
- A7. Logger trace correlation (src/core/sdk/observability.ts:189-216)
233
- Inside logRequest, before the JSON-serialize step, pull the active span’s spanContext() (via the interface from A3) and add traceId / spanId to the payload. No-op when no active span. Field names match OTel conventions so ClickStack’s auto-correlation panel works (trace_id, span_id).
234
-
235
- Same enrichment for the logger floor in setLoggerAttributeFloor — actually skip that: the floor is per-boot, not per-request. Instead, plumb traceId inside the emit() function in src/core/sdk/logger.ts:228-250 so every log line (not just logRequest) carries trace correlation. Cheap: one getActiveSpan()?.spanContext?.() call per line, no allocation when no active span.
236
-
237
- A8. W3C trace-context propagation helper (src/core/sdk/http.ts or new src/core/sdk/traceContext.ts)
238
- Add a small helper:
239
-
240
- export function injectTraceContext(headers: Headers): void {
241
- const ctx = getActiveSpan()?.spanContext?.();
242
- if (!ctx) return;
243
- const flags = ctx.traceFlags.toString(16).padStart(2, "0");
244
- headers.set("traceparent", `00-${ctx.traceId}-${ctx.spanId}-${flags}`);
245
- }
246
- Frame as a public helper exported from @decocms/start/sdk/observability so apps repo (@decocms/apps) can call it from createInstrumentedFetch. No new dependency — we already have @opentelemetry/api.
247
-
248
- Document on the export: “Call before sending an outbound fetch from within a withTracing scope to enable upstream trace correlation.”
249
-
250
- A9. Deferred section span (src/tanstack/routes/cmsRoute.ts:167-214)
251
- Wrap the internal resolveDeferredSectionPure call at line 197:
252
-
253
- const result = await withTracing(
254
- "deco.section.deferred.load",
255
- () => resolveDeferredSectionPure(pagePath, component, matcherCtx, opts),
256
- { "section.name": component, "section.index": index, "page.path": pagePath },
257
- );
258
- Lives inside the TanStack server fn handler — the span naturally parents under whatever withTracing context CF threads through (or stands alone if not).
259
-
260
- A10. Tests
261
- src/core/sdk/otel.test.ts:
262
-
263
- service.version lands on the logger floor when env.CF_VERSION_METADATA = { id: "abc" }.
264
- service.name lands on the logger floor (from serviceName option and DECO_SITE_NAME fallback).
265
- The bridge’s setError sets SpanStatusCode.ERROR — assert via a vi.spyOn(trace, "getTracer") stub that returns a recording span.
266
- spanContext() round-trips traceId/spanId from a stub OTel span.
267
- src/core/sdk/observability.test.ts (new file):
268
-
269
- injectTraceContext writes a well-formed traceparent when an active span exists, no-ops otherwise.
270
- withTracing nests getActiveSpan correctly across await.
271
- src/core/cms/sectionLoaders.test.ts:
272
-
273
- New runSectionLoaders batch test: with a stub tracer, verify one deco.section.loaders.batch parent appears with section.count set.
274
- src/tanstack/sdk/workerEntry.test.ts (new file or extend existing):
275
-
276
- Stub caches.open + a fake cache that returns a fresh Response — verify span deco.cache.lookup is emitted with cache.decision = "HIT" and recordCacheMetric(true, profile) is called.
277
- Same for MISS.
278
- Admin route hits emit the right span names.
279
- src/core/sdk/logger.test.ts:
280
-
281
- A log emitted inside withTracing carries trace_id / span_id fields.
282
- A log emitted outside any span doesn’t carry them.
283
- A11. Documentation — docs/observability.md (new) + README link
284
- Single page covering:
285
-
286
- Architecture — site Worker → CF Destinations → deco-otel-ingest → ClickHouse → ClickStack. Diagram in plain text.
287
- What’s instrumented — table of every span name and its attributes:
288
- Span Source Key attributes
289
- deco.http.request workerEntry root http.method, url.path
290
- deco.cms.resolvePage resolveDecoPage deco.route
291
- deco.section.loaders.batch runSectionLoaders section.count
292
- deco.section.loader runSingleSectionLoader deco.section
293
- deco.section.deferred.load loadDeferredSection section.name, section.index
294
- deco.cache.lookup edge cache match cache.profile, cache.kind, cache.decision, cache.age_seconds
295
- deco.cache.store edge cache put cache.profile, cache.kind
296
- deco.admin.meta / decofile.* / render / invoke admin route wrappers see table in A6
297
- What’s measured — table of metric names:
298
- Metric Source Labels
299
- http_requests_total (counter) workerEntry method, path, status
300
- http_request_duration_ms (histogram) workerEntry method, path, status
301
- http_request_errors_total (counter, 5xx) workerEntry method, path, status
302
- cache_hit_total / cache_miss_total (counter) workerEntry profile
303
- resolve_duration_ms (histogram) resolveDecoPage (none)
304
- Identity stamped on every span and log: service.name, service.version (when CF_VERSION_METADATA bound), deco.runtime.version, deployment.environment, deco.apps.version (optional). Each log line additionally carries trace_id / span_id when emitted inside an active span.
305
-
306
- Required wrangler.jsonc — lift the JSDoc example from src/core/sdk/otel.ts:19-30, plus the observability.logs.destinations / observability.traces.destinations blocks pointing at https://deco-otel-ingest.deco-cx.workers.dev. Include version_metadata binding and analytics_engine_datasets.
307
-
308
- Logger output shape gap — JSON-in-stringValue for log records. Show one ClickHouse query:
309
-
310
- SELECT JSONExtract(Body, 'level', 'String') as level, count()
311
- FROM otel_logs
312
- WHERE ServiceName = 'my-store' AND Timestamp > now() - INTERVAL 1 HOUR
313
- GROUP BY level
314
- Cross-referencing logs and traces in ClickStack — filter logs by trace_id to jump from a log line to its trace.
315
-
316
- Outbound trace propagation — short note on injectTraceContext(headers) for sites/apps that issue outbound fetches.
317
-
318
- Not in scope — tail-on-error sampling (lives in ingest Worker), VTEX/Shopify-specific spans (lives in @decocms/apps).
319
-
320
- Add a link line under the “Documentation” section of README.md.
321
-
322
- Critical files
323
- src/core/sdk/otel.ts (A1, A2, A3)
324
- src/core/sdk/observability.ts (A3, A7, A8)
325
- src/core/sdk/logger.ts (A7)
326
- src/core/sdk/http.ts or new src/core/sdk/traceContext.ts (A8)
327
- src/core/cms/sectionLoaders.ts (A5)
328
- src/tanstack/sdk/workerEntry.ts (A4, A6)
329
- src/tanstack/routes/cmsRoute.ts (A9)
330
- src/core/sdk/otel.test.ts, src/core/sdk/observability.test.ts (new), src/core/sdk/logger.test.ts, src/core/cms/sectionLoaders.test.ts, src/tanstack/sdk/workerEntry.test.ts (new) (A10)
331
- docs/observability.md (new), README.md (link) (A11)
332
- Existing utilities to reuse
333
- withTracing (src/core/sdk/observability.ts:88-106) — span lifecycle.
334
- recordCacheMetric / recordRequestMetric (src/core/sdk/observability.ts:145-167) — existing meter helpers; just wire them at the cache decision points.
335
- MetricNames (src/core/sdk/observability.ts:131-139) — already declares RESOLVE_DURATION_MS, FETCH_DURATION_MS, CACHE_HIT, CACHE_MISS. No new metric names needed.
336
- getActiveSpan (src/core/sdk/observability.ts:79-81) — used in A7 (logger) and A8 (traceparent helper).
337
- setLoggerAttributeFloor (src/core/sdk/logger.ts:118-120) — already wired from bootObservability.
338
- defaultLoggerAdapter (src/core/sdk/logger.ts:49-82) — wraps console.*, picked up by CF Destinations.
339
- Verification
340
- bun run typecheck — clean.
341
- bun test — all new + existing tests green.
342
- bun run build — package builds clean; dist/ exports unchanged.
343
- Manual smoke on a preview deployment (after the user merges):
344
- Deploy site Worker with version_metadata binding + observability.{logs,traces}.destinations pointing at deco-otel-ingest.
345
- Hit a PDP, a PLP, a cart route (so we exercise multiple cache profiles).
346
- ClickHouse query confirms ResourceAttributes['service.version'], SpanAttributes['service.version'], and SpanAttributes['cache.profile'] are populated.
347
- Force a 5xx (invalid CMS resolve), confirm the framework span shows STATUS_CODE_ERROR and the exception message in ClickStack.
348
- Filter ClickStack logs by trace_id from a known request, confirm logs and traces correlate.
349
- Release channel
350
- All changes are additive — no public API removed or renamed. The only behavioral change for sites that don’t have CF_VERSION_METADATA bound is that they get richer spans and per-line trace_id on logs; no breaking surface. Target main → @decocms/start@latest. Confirm with user before opening the PR.
351
-
352
- Plan B — Brief for @decocms/apps (paste into a separate agent thread)
353
-
354
- The user will copy the section below verbatim into a new agent conversation rooted at /Users/fernandofrizzatti/development/workspace/decocms/apps/. Self-contained, no shared state with this thread.
355
-
356
- Brief: instrument @decocms/apps for end-to-end observability
357
- Context
358
-
359
- The deco platform pipes telemetry from site Workers → CF Destinations → deco-otel-ingest (https://deco-otel-ingest.deco-cx.workers.dev) → stats-lake ClickHouse → ClickStack UI (hyperdx.clickhouse.cloud). The framework (@decocms/start) emits service identity (service.name, service.version from CF_VERSION_METADATA), a root deco.http.request span, per-section spans, cache decision spans, admin protocol spans, and W3C traceparent-injection helper injectTraceContext(headers) exported from @decocms/start/sdk/observability.
360
-
361
- What’s missing is per-app commerce instrumentation. VTEX, Shopify, Nuvemshop, etc. all issue many outbound fetches per request — search, catalog, OMS, intelligent-search, checkout. Today these are CF-auto-instrumented as flat fetch spans with no semantic grouping. Operators can’t answer “is intelligent-search slow today?” without grep-style URL filtering.
362
-
363
- Investigation tasks
364
-
365
- Audit createInstrumentedFetch — find its current implementation in this repo (likely under vtex/utils/ or utils/). For each:
366
- Does it use withTracing from @decocms/start? If yes, what’s the span name shape?
367
- Are spans named per-endpoint (e.g. vtex.intelligent-search.product_search) or just per-client (vtex.fetch)?
368
- Are HTTP-level attributes set (http.method, http.url, http.status_code, http.response.body_size)?
369
- Does it set STATUS_CODE_ERROR on 5xx responses or on thrown errors?
370
- Is response timing recorded as a histogram metric (e.g. vtex_request_duration_ms with endpoint label)?
371
- W3C trace propagation — does createInstrumentedFetch inject traceparent headers? If not, import injectTraceContext from @decocms/start/sdk/observability and call it before each outbound fetch. Confirms downstream correlation if upstream is OTel-aware.
372
- Per-app coverage matrix — list every commerce app (VTEX, Shopify, Wake, Nuvemshop, …) and whether each uses createInstrumentedFetch consistently across loaders, actions, and hooks. Identify any direct fetch() calls that bypass instrumentation.
373
- Cross-app pattern — should there be a shared createInstrumentedFetch(provider: string) factory that emits spans {provider}.{endpoint}, with provider as a label on the per-app histogram? Today each app likely has its own.
374
- GraphQL — VTEX/Shopify use GraphQL. Should the span attributes include graphql.operation.name so dashboards can group by GraphQL operation rather than POST endpoint?
375
- Deliverables (after investigation)
376
-
377
- A short report (in the conversation, no file) summarizing the current state of fetch instrumentation across apps.
378
- A unified PR (or per-app PRs) that:
379
- Standardizes createInstrumentedFetch to emit a span per outbound call with {provider}.{operation_or_endpoint} shape, full HTTP attrs, error status on failures, and traceparent propagation.
380
- Adds a per-provider duration histogram (vtex_request_duration_ms, shopify_request_duration_ms) with endpoint / operation labels.
381
- Wraps GraphQL clients to expose graphql.operation.name on the span.
382
- Audits direct fetch() calls and converts them to createInstrumentedFetch.
383
- Local smoke verification (run a site against a real VTEX account, confirm spans appear in ClickStack with the expected shape).
384
- Constraints
385
-
386
- Don’t add new dependencies. @opentelemetry/api is already pulled in transitively via @decocms/start.
387
- Don’t change public function signatures — wrap implementations, don’t break consumers.
388
- PII redaction is handled at the ingest Worker; no per-app code needed for that.
389
- Don’t introduce a custom OTLP exporter — telemetry rides the CF Destinations pipe.
390
- Reference files (in @decocms/start, after the upcoming PR lands)
391
-
392
- src/core/sdk/observability.ts — withTracing, injectTraceContext, getActiveSpan.
393
- src/core/sdk/logger.ts — logger.info|warn|error (logs carry trace_id automatically).
394
- docs/observability.md — full pipeline + query patterns.
395
- Approved
396
- All 11 plan items complete. Verification clean: bun run typecheck, all 553 tests pass, bun run build succeeds, biome lint clean on touched files, tier boundaries clean.
397
-
398
- Summary of changes (this repo)
399
-
400
- Identity & errors — service.name and service.version (from env.CF_VERSION_METADATA.id) now stamp on every framework span and every log line via the attribute floor. Framework spans correctly carry SpanStatusCode.ERROR with the exception message when withTracing() rethrows. Span interface gained an optional spanContext() so logger/propagation helpers can read trace IDs.
401
-
402
- New spans — deco.section.loaders.batch, deco.section.deferred.load, deco.cache.lookup (×2 paths: SFn + HTML), deco.cache.store (×2), deco.admin.{meta,decofile.read,decofile.reload,render,invoke}. Wired at workerEntry callsites and TanStack admin routes.
403
-
404
- Cache metrics — extended recordCacheMetric(hit, profile, decision) with a CacheDecision union (“HIT” | “STALE-HIT” | “STALE-ERROR” | “MISS” | “BYPASS”). Wired at every HIT / STALE-HIT / MISS / STALE-ERROR decision point in both cache paths.
405
-
406
- Trace correlation — every log emitted inside a withTracing() scope now carries trace_id / span_id for cross-referencing logs and traces in ClickStack. logRequest does the same.
407
-
408
- injectTraceContext(headers) — new public helper that writes a W3C traceparent header from the active span. Exported from @decocms/start/sdk/observability (and /middleware) so @decocms/apps’s createInstrumentedFetch can propagate context to upstream services.
409
-
410
- Tests — 11 new test cases covering: identity floor, error status, spanContext round-trip, batch-loader span, deferred span, trace-id in logs, injectTraceContext happy/no-op paths, getActiveSpan nesting.
411
-
412
- Docs — new docs/observability.md covering architecture, instrumented spans, metrics, identity floor, wrangler.jsonc shape, ClickHouse JSONExtract query patterns, log-to-trace correlation, injectTraceContext usage, sampling notes, and what’s out of scope. README links it under “Documentation”.
413
-
414
- Apps brief
415
-
416
- The standalone apps brief is in Plan B of /Users/fernandofrizzatti/.claude/plans/system-instruction-you-are-working-shimmying-badger.md — copy that section into a new agent thread rooted at /Users/fernandofrizzatti/development/workspace/decocms/apps/.
417
-
418
- PR instructions.md
419
- The user likes the current state of the code. Ther...
420
- Create a PR
421
- PR created: https://github.com/decocms/deco-start/pull/179
422
-
423
- The branch is pushed, the commit follows the repo’s conventional-commit style, and the PR body lays out what shipped plus the post-merge smoke steps from the plan.
424
-
425
-
426
- # Third Context: decocms/apps
427
- https://github.com/decocms/deco-start/pull/179 check that PR, check if that solves some problems, if any
428
- Exit code 1 accepts at most 1 arg(s), received 3
429
- PR 179 in decocms/deco-start directly addresses the framework gaps I found. Summary:
430
-
431
- injectTraceContext(headers: Headers) is added and exported from @decocms/start/sdk/observability — exactly the symbol the brief assumed exists. No-op when no active span. ✅
432
- Log↔trace correlation is automatic: any log emitted inside a withTracing(...) scope gets trace_id/span_id stamped via the logger attribute floor. Free win for @decocms/apps once we wrap call sites.
433
- recordCacheMetric(hit, profile, decision?) extended with decision label (HIT|STALE-HIT|STALE-ERROR|MISS|BYPASS).
434
- New framework spans added (cache lookup/store, loaders batch, deferred section load, admin handlers). Doesn’t affect us, but means once our outbound fetch spans nest under deco.http.request, the whole tree is contiguous.
435
- spanContext?() added to Span — adapters that implement it light up trace propagation; adapters that don’t degrade gracefully.
436
- What PR 179 does not solve (still on the @decocms/apps plate):
437
-
438
- createInstrumentedFetch is unchanged — still emits only ${name}.fetch spans, still does not call injectTraceContext (the PR’s doc claims it does, but the diff doesn’t touch src/core/sdk/instrumentedFetch.ts). So either (a) we add injectTraceContext inside our per-app fetch wrappers in this repo, or (b) we open a follow-up upstream PR to wire it into createInstrumentedFetch itself.
439
- Per-operation span granularity — upstream still produces vtex.fetch. To get vtex.intelligent-search.product_search, we wrap call sites in withTracing(...) inside @decocms/apps.
440
- GraphQL operationName on spans — needs to happen in our Shopify GraphQL client.
441
- ~27 raw fetch( sites that bypass _fetch need to be routed through it.
442
- Per-provider duration histograms — need to call getMeter()?.histogramRecord(...) ourselves at the per-app fetch helper.
443
- Let me write the plan, then I’ll ask a couple of decisions via ExitPlanMode.
444
- Instrument @decocms/apps for end-to-end observability
445
-
446
- Context
447
-
448
- The deco platform pipes telemetry from site Workers → CF Destinations → deco-otel-ingest → ClickHouse stats-lake → ClickStack UI. The framework half (@decocms/start) emits a root deco.http.request span, per-section spans, cache decision spans, admin protocol spans, and exposes tracer/meter primitives. The apps half (@decocms/apps — this repo) is the integration layer where outbound commerce fetches happen (VTEX REST, Shopify GraphQL, Resend, Google Fonts, etc.). Today those fetches are essentially invisible to operators — the only span shape produced is vtex.fetch / shopify.fetch via createInstrumentedFetch(name), and most fetches bypass even that.
449
-
450
- The goal: every outbound fetch in every app produces a span named {provider}.{operation}, carries full HTTP attributes, sets error status on failures, records a per-provider duration histogram with operation label, and propagates W3C traceparent so any OTel-aware upstream joins the trace.
451
-
452
- Upstream dependency: this plan depends on decocms/deco-start#179 being released. That PR adds injectTraceContext, CacheDecision, spanContext?() on the Span interface, and automatic log/trace correlation via the logger attribute floor. Without it, we’d need to vendor a W3C propagator. Once it ships, we just import.
453
-
454
- Current state (from exploration)
455
-
456
- createInstrumentedFetch is defined in @decocms/start/sdk/instrumentedFetch, not in this repo. Span shape is fixed at ${name}.fetch. No HTTP attributes, no traceparent injection, no per-operation granularity.
457
- Wiring lives in this repo via setter pattern:
458
- vtex/client.ts:117 — setVtexFetch(fetchFn) stores into module-level _fetch. Called at startup in setup.ts.
459
- shopify/client.ts:24 — setShopifyFetch(fetchFn) does the same for Shopify.
460
- Only 4 paths actually use _fetch:
461
- vtex/client.ts — vtexFetchResponse:214, vtexCachedFetch:260, intelligentSearch:360
462
- shopify/utils/graphql.ts:32 — createGraphqlClient.query (when fetchFn arg passed)
463
- ~27 raw fetch( sites bypass instrumentation entirely:
464
- VTEX actions (6): actions/analytics/sendEvent.ts:54, actions/profile.ts, actions/masterData.ts, …
465
- VTEX hooks (9): hooks/useCart.ts:77,96,111,131, hooks/useWishlist.ts:34,48,58, hooks/useAutocomplete.ts:49, hooks/useUser.ts:34
466
- VTEX utils (5): utils/authHelpers.ts:52, utils/proxy.ts:176,383, utils/fetch.ts:91
467
- Shopify: 1 fallback in utils/graphql.ts when fetchFn is undefined
468
- Resend: resend/actions/send.ts:35
469
- Website: website/loaders/fonts/googleFonts.ts:97,103
470
- Shopify GraphQL client (shopify/utils/graphql.ts) does not surface operationName anywhere — span attribute opportunity.
471
- No file in this repo currently imports from @decocms/start/sdk/observability.
472
- Approach
473
-
474
- Two layers of change in this repo:
475
-
476
- A. A small shared “traced fetch” helper
477
- Add a single helper that all per-app fetch utilities funnel through. It composes the three pieces we need on every outbound call:
478
-
479
- // utils/tracedFetch.ts (new file)
480
- import { withTracing, injectTraceContext, getMeter, setSpanAttribute } from "@decocms/start/sdk/observability";
481
-
482
- export interface TracedFetchOptions {
483
- provider: string; // e.g. "vtex"
484
- operation: string; // e.g. "intelligent-search.product_search"
485
- attributes?: Record<string, string | number | boolean>;
486
- }
487
-
488
- export async function tracedFetch(
489
- input: Request | URL | string,
490
- init: RequestInit | undefined,
491
- opts: TracedFetchOptions,
492
- underlyingFetch: typeof fetch = fetch,
493
- ): Promise<Response> {
494
- const headers = new Headers(init?.headers);
495
- injectTraceContext(headers);
496
- const url = typeof input === "string" ? input : input instanceof URL ? input.toString() : input.url;
497
- const method = init?.method ?? "GET";
498
- const start = performance.now();
499
-
500
- return withTracing(
501
- `${opts.provider}.${opts.operation}`,
502
- async () => {
503
- setSpanAttribute("http.method", method);
504
- setSpanAttribute("http.url", redactUrl(url)); // strip query secrets like authToken
505
- for (const [k, v] of Object.entries(opts.attributes ?? {})) setSpanAttribute(k, v);
506
-
507
- let res: Response;
508
- try {
509
- res = await underlyingFetch(input, { ...init, headers });
510
- } catch (err) {
511
- recordDuration(opts, performance.now() - start, "error");
512
- throw err;
513
- }
514
-
515
- setSpanAttribute("http.status_code", res.status);
516
- const len = res.headers.get("content-length");
517
- if (len) setSpanAttribute("http.response.body_size", Number(len));
518
- recordDuration(opts, performance.now() - start, String(res.status));
519
- return res;
520
- },
521
- );
522
- }
523
-
524
- function recordDuration(opts: TracedFetchOptions, ms: number, status: string) {
525
- getMeter()?.histogramRecord?.("commerce_request_duration_ms", ms, {
526
- provider: opts.provider,
527
- operation: opts.operation,
528
- status,
529
- });
530
- }
531
- Notes:
532
-
533
- Error status on 5xx and on thrown errors is handled by withTracing itself (PR 179 sets SpanStatusCode.ERROR when the wrapped function rethrows). For 5xx responses we don’t throw — we just rely on the http.status_code attribute for filtering, which is the OTel-recommended approach (status codes ≠ span errors unless the client treats them as errors).
534
- We rely on getMeter()?.histogramRecord?.(...) from PR 179. Single generic histogram commerce_request_duration_ms with provider + operation labels is more useful for cross-provider dashboards than per-provider names; matches how http_request_duration_ms is shaped upstream.
535
- redactUrl strips known-sensitive query params (authToken, cookie, vtex_session, etc.) consistent with the existing sanitization in vtex/utils/fetch.ts.
536
- B. Funnel every raw fetch through it
537
- VTEX (vtex/):
538
-
539
- In vtex/client.ts, change vtexFetchResponse, vtexCachedFetch, intelligentSearch to call tracedFetch(url, init, { provider: "vtex", operation }, _fetch) instead of _fetch(url, init). operation is hardcoded per call path:
540
- vtex.api (catch-all in vtexFetchResponse)
541
- vtex.intelligent-search.{endpoint} — parsed once from the URL path inside intelligentSearch
542
- vtex.catalog.*, vtex.checkout.orderForm, etc. as appropriate
543
- Convert each raw fetch(...) in vtex/actions/, vtex/hooks/, vtex/utils/authHelpers.ts, vtex/utils/proxy.ts, vtex/utils/fetch.ts to either:
544
- Go through _fetch (preferred — they’re VTEX-account-aware), then through tracedFetch, with an operation derived from the call site (e.g. vtex.checkout.addItems, vtex.events.send, vtex.wishlist.list, …).
545
- For the proxy paths (utils/proxy.ts) where we forward arbitrary requests, use operation vtex.proxy and set proxy.path as an attribute.
546
- The hook fetches (hooks/useCart.ts, etc.) currently run on the browser via the invoke proxy → server function path. They still need instrumentation on the server side; if the hooks themselves run on the client they won’t have a span context, and injectTraceContext will no-op. That’s fine. The server-side _fetch call is where instrumentation lives.
547
- Shopify (shopify/):
548
-
549
- shopify/utils/graphql.ts — wrap the POST in tracedFetch with operation: graphqlOperationName(query) ?? "graphql" and set graphql.operation.name as a span attribute. Parse the operation name with a small regex on the GraphQL document (/^\s*(?:query|mutation|subscription)\s+(\w+)/) — no external dep needed.
550
- Remove the globalThis.fetch fallback path: require that callers route through setShopifyFetch at startup. (Or wrap the fallback in tracedFetch too — pick whichever is least disruptive to existing tests.)
551
- Resend, Website:
552
-
553
- resend/actions/send.ts:35 — wrap in tracedFetch with provider: "resend", operation: "emails.send".
554
- website/loaders/fonts/googleFonts.ts:97,103 — wrap with provider: "google-fonts", operation: "css". Low priority but cheap.
555
- Setter signature unchanged: setVtexFetch / setShopifyFetch still accept the result of createInstrumentedFetch("vtex") / ("shopify") from @decocms/start. The new tracedFetch helper sits between the call site and _fetch, so the upstream ${provider}.fetch span becomes the parent of our per-operation span. Good — gives operators both views.
556
-
557
- Files to modify
558
-
559
- utils/tracedFetch.ts (new)
560
- utils/tracedFetch.test.ts (new — exercise span name, attributes, traceparent header, histogram record, error status)
561
- vtex/client.ts — route vtexFetchResponse, vtexCachedFetch, intelligentSearch through tracedFetch
562
- vtex/utils/fetch.ts — route the generic wrapper through tracedFetch
563
- vtex/utils/proxy.ts — wrap the two fetch calls
564
- vtex/utils/authHelpers.ts — wrap the logout call
565
- vtex/actions/analytics/sendEvent.ts, vtex/actions/profile.ts, vtex/actions/masterData.ts, and the rest of the VTEX action files identified during audit
566
- vtex/hooks/useCart.ts, vtex/hooks/useWishlist.ts, vtex/hooks/useAutocomplete.ts, vtex/hooks/useUser.ts — convert direct fetch to the per-app wrapper
567
- shopify/utils/graphql.ts — wrap query in tracedFetch, surface graphql.operation.name
568
- resend/actions/send.ts — wrap in tracedFetch
569
- website/loaders/fonts/googleFonts.ts — wrap in tracedFetch
570
- package.json — bump @decocms/start to the version that includes PR 179 (likely a next channel release first)
571
- Reused helpers
572
-
573
- withTracing, injectTraceContext, setSpanAttribute, getMeter, MetricNames — all from @decocms/start/sdk/observability (post-PR-179).
574
- createInstrumentedFetch — still the boundary at startup, unchanged.
575
- RequestContext from @decocms/start/sdk/requestContext — already imported in this repo; useful inside tracedFetch if we want to stamp cookie.account or similar on spans.
576
- Release path
577
-
578
- Per CLAUDE.md, this is behavior-changing (new spans, new metric names, log volume). Route through @next:
579
-
580
- Open PR to next branch (not main). Title/body must avoid the canonical CI-skip token.
581
- Validate on a real VTEX storefront via npm install @decocms/apps@next.
582
- Confirm spans appear in ClickStack with expected shape (vtex.intelligent-search.product_search, full HTTP attrs, traceparent received downstream).
583
- Promote next → main via PR.
584
- Verification
585
-
586
- npm run check clean (typecheck + biome + knip).
587
- npm test — new unit tests on tracedFetch pass:
588
- Asserts span name is ${provider}.${operation}.
589
- Asserts http.method, http.url (redacted), http.status_code attributes set.
590
- Asserts traceparent header is injected when called inside withTracing (use the fake tracer pattern from PR 179’s observability.test.ts).
591
- Asserts histogram histogramRecord is called with commerce_request_duration_ms and the right labels.
592
- Asserts span has STATUS_CODE_ERROR when the underlying fetch throws.
593
- Local smoke against a real VTEX storefront:
594
- Bind to a dev site, hit PLP/PDP/cart pages.
595
- Tail ClickHouse / ClickStack for service.name = <site> and confirm vtex.intelligent-search.*, vtex.checkout.*, vtex.api spans appear nested under deco.http.request.
596
- Confirm logs from inside those fetches show trace_id/span_id matching the parent trace.
597
- Confirm at least one VTEX endpoint that has its own OTel collector receives our traceparent header (curl or check VTEX-side logs if available).
598
- Open decisions
599
-
600
- Operation-name source — explicit per-caller string vs URL-derived inside tracedFetch. The plan above assumes explicit per-caller (most accurate, no regex maintenance), but a URL→operation router inside tracedFetch would mean fewer per-call-site edits. Worth deciding before implementation starts.
601
- Should injectTraceContext live in upstream createInstrumentedFetch instead? PR 179’s doc claims it does, but the diff doesn’t touch instrumentedFetch.ts. If yes, file a separate upstream PR; if no, our tracedFetch is the right place.
602
- Per-provider vs generic histogram name — plan uses generic commerce_request_duration_ms with labels. Brief asked for vtex_request_duration_ms. Generic is more flexible; brief-literal is matchable to existing per-provider dashboards if any exist. Decide.