npm - @axlsdk/studio - Versions diffs - 0.14.0 → 0.16.0 - Mend

@axlsdk/studio 0.14.0 → 0.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

package/README.md +69 -10
package/dist/chunk-RE6VPUXA.js +2213 -0
package/dist/chunk-RE6VPUXA.js.map +1 -0
package/dist/cli.cjs +1191 -143
package/dist/cli.cjs.map +1 -1
package/dist/cli.js +1 -1
package/dist/client/assets/index-ClajLxib.js +288 -0
package/dist/client/assets/index-DnHL_gtF.css +1 -0
package/dist/client/index.html +2 -2
package/dist/connection-manager-DAuqk9lM.d.cts +166 -0
package/dist/connection-manager-DAuqk9lM.d.ts +166 -0
package/dist/middleware.cjs +1222 -150
package/dist/middleware.cjs.map +1 -1
package/dist/middleware.d.cts +76 -6
package/dist/middleware.d.ts +76 -6
package/dist/middleware.js +32 -8
package/dist/middleware.js.map +1 -1
package/dist/server/index.cjs +1194 -142
package/dist/server/index.cjs.map +1 -1
package/dist/server/index.d.cts +171 -28
package/dist/server/index.d.ts +171 -28
package/dist/server/index.js +7 -3
package/package.json +13 -9
package/dist/chunk-HUKUQDYL.js +0 -1163
package/dist/chunk-HUKUQDYL.js.map +0 -1
package/dist/client/assets/index-7aDhMztu.css +0 -1
package/dist/client/assets/index-Bzr3vDPz.js +0 -255
package/dist/connection-manager-B7AWpsCD.d.cts +0 -81
package/dist/connection-manager-B7AWpsCD.d.ts +0 -81

package/README.md CHANGED Viewed

@@ -95,7 +95,12 @@ Execute workflows with custom JSON input. View execution timelines showing each
 Waterfall visualization of execution traces. Filter by type, agent, or tool. View token counts, cost per step, and duration.
 ### Cost Dashboard
-Track spending across agents, models, and workflows. Live cost updates via WebSocket. Per-agent and per-model breakdowns.
+Track spending across agents, models, workflows, and embedders with time-window filtering (24h/7d/30d/all). Live cost updates via WebSocket; all breakdown tables are user-sortable by any column. Two sections appear conditionally:
+- **Retry Overhead** — when retries have happened, decomposes `agent_call` cost by `retryReason` (`primary` / `schema` / `validate` / `guardrail`) with per-reason call counts. Surfaces how much spend was wasted on gate failures.
+- **Memory (Embedder)** — when semantic memory ops have run, shows `CostData.byEmbedder: Record<string, { cost, calls, tokens }>` bucketed by embedder model (e.g., `text-embedding-3-small`).
+Sub-cent cost values use tiered precision (`< $0.000001` sentinel, `< $0.0001` scientific, `< $0.01` six decimals, `>= $0.01` two decimals) so embedder costs don't collapse to `$0.0000`.
 ### Memory Browser
 View and manage agent memory (session and global scope). Create, edit, and delete entries. Test semantic recall queries.
@@ -107,7 +112,7 @@ Browse active sessions with conversation history. Replay sessions step by step.
 Browse all registered tools with their schemas rendered as forms. Test any tool directly with custom input and see the result.
 ### Eval Runner
-Run evaluations from the UI. View per-item results with scores, timing, and cost. Drill into individual items to see LLM scorer reasoning, per-scorer timing/cost, and annotations. Filter items by error state or score threshold, sort by score/duration/cost. Score distribution chart shows how scores are spread across bins. Compare two runs with the run picker (baseline/candidate selection from history), timing/cost tradeoff analysis, item-level comparison table, and expandable regression detail showing side-by-side outputs and reasoning. History tab groups multi-run results and tracks mean scores across runs with an eval name filter. Multi-run switcher navigates between individual runs. LLM scorer badges distinguish LLM-judged from deterministic scorers. Significance tooltips explain bootstrap CI methodology. Requires `@axlsdk/eval` as an optional peer dependency.
+Run evaluations from the UI. Toggle **Capture traces** in the command bar to populate per-item `EvalItem.traces` — the item detail panel renders each captured event inline with type, agent/tool, duration, and cost (success and failure paths both). View per-item results with scores, timing, and cost. Drill into individual items to see LLM scorer reasoning, per-scorer timing/cost, and annotations. Filter items by error state or score threshold, sort by score/duration/cost. Score distribution chart shows how scores are spread across bins. Compare two runs with the run picker (baseline/candidate selection from history), timing/cost tradeoff analysis, item-level comparison table, and expandable regression detail showing side-by-side outputs and reasoning. History tab groups multi-run results and tracks mean scores across runs with an eval name filter. Multi-run switcher navigates between individual runs. LLM scorer badges distinguish LLM-judged from deterministic scorers. Significance tooltips explain bootstrap CI methodology. Requires `@axlsdk/eval` as an optional peer dependency.
 ## What gets registered
@@ -139,14 +144,17 @@ Studio exposes a REST API that the SPA consumes. You can also call these directl
 | `POST /api/tools/:name/test` | Test a tool with `{ input: {...} }` |
 | `GET /api/sessions` | List sessions |
 | `GET /api/executions` | List executions |
-| `GET /api/costs` | Aggregated cost data |
-| `POST /api/costs/reset` | Reset cost counters |
+| `GET /api/costs?window=24h\|7d\|30d\|all` | Aggregated cost data for a time window (default `7d`). `?windows=all` returns all four windows at once for debugging |
+| `GET /api/eval-trends?window=` | Per-eval score trends (latest, mean, std), cost totals, recent runs with `model`/`duration` |
+| `GET /api/workflow-stats?window=` | Per-workflow totals, completed/failed counts, p50/p95/avg duration, failure rate |
+| `GET /api/trace-stats?window=` | Event-type distribution, tool call counts (calls/approved/denied), retry breakdown by agent |
 | `GET /api/memory/:scope/:key` | Read memory entry |
 | `PUT /api/memory/:scope/:key` | Save memory entry |
 | `DELETE /api/memory/:scope/:key` | Delete memory entry |
 | `GET /api/evals` | List registered eval configs |
 | `GET /api/evals/history` | List eval run history |
-| `POST /api/evals/:name/run` | Run a registered eval by name. Accepts `{ runs: N }` (capped at 25) |
+| `POST /api/evals/:name/run` | Run a registered eval by name. Body: `{ runs?: N, stream?: true, captureTraces?: true }` (`runs` capped at 25). When `stream: true`, returns `{ evalRunId }` immediately and broadcasts progress over the `eval:{evalRunId}` WS channel (`item_done` / `run_done` / `done` / `error`). The `done` event carries only `{ evalResultId, runGroupId? }` — a pointer, not the full result — so multi-item evals don't hit the 64KB WS frame cap. Clients refetch the full result from history. `captureTraces: true` populates per-item `EvalItem.traces` on every item (success + failure); the Eval Runner panel renders these inline on item detail. Synchronous mode (default) is unchanged |
+| `POST /api/evals/runs/:evalRunId/cancel` | Abort an active streaming eval run. The cancelled run appears in history with remaining items marked as cancelled |
 | `POST /api/evals/:name/rescore` | Re-score a history entry with the eval's current scorers |
 | `POST /api/evals/import` | Import a CLI eval artifact (parsed `EvalResult` JSON) into runtime history |
 | `DELETE /api/evals/history/:id` | Delete a single history entry. Blocked in readOnly |
@@ -166,7 +174,15 @@ Single endpoint at `ws://localhost:4400/ws` with channel multiplexing:
 { "type": "event", "channel": "trace:abc-123", "data": { ... } }
 ```
-Channels: `execution:{id}`, `trace:{id}`, `trace:*`, `costs`, `decisions`. Execution channels have replay buffering — late subscribers receive the full event history (capped at 500 events, cleaned up 30s after stream completes).
+Channels: `execution:{id}`, `trace:{id}`, `trace:*`, `eval:{id}`, `eval:{evalRunId}`, `eval:*`, `costs`, `eval-trends`, `workflow-stats`, `trace-stats`, `decisions`. Execution and eval channels have replay buffering — late subscribers receive the full event history (capped at 1000 events by default; tunable via `bufferCaps`, see below). Buffers are cleaned up 30s after the stream completes. Aggregate channels (`costs`, `eval-trends`, `workflow-stats`, `trace-stats`) broadcast `{ snapshots: Record<WindowId, State>, updatedAt }` on every fold or rebuild.
+**Outbound frame budget.** The WS broadcast layer enforces a 64KB soft cap via `truncateIfOversized`. Oversized verbose-mode `agent_call_end.data.messages` snapshots are replaced with a `{ __truncated: true, originalBytes, maxBytes, hint }` placeholder that preserves the event's `type`/`step`/`agent`/`tool` so the Trace Explorer still renders the row. The 64KB threshold matches the inbound message reject limit in the WS protocol (shared constant).
+### Migrating from 0.14
+- **`POST /api/costs/reset` has been removed.** Any script hitting the old endpoint gets `404`. Use window selection (`?window=`) instead — snapshots evict automatically as their window slides.
+- **`CostAggregator` class is no longer exported** from `@axlsdk/studio`. Replaced by `TraceAggregator<CostData>` configured with a pure `reduceCost` reducer. Behavior is preserved.
+- **`costs` WS channel payload shape changed** from `CostData` to `{ snapshots: Record<WindowId, CostData>, updatedAt: number }`. Clients that read the old shape must select a window (typically `snapshots['7d']`).
 ## Embeddable Middleware
@@ -205,9 +221,11 @@ studio.upgradeWebSocket(server);
 | `runtime` | `AxlRuntime` | required | The runtime instance to observe and control |
 | `basePath` | `string` | `''` | URL path prefix (e.g., `'/studio'`) |
 | `serveClient` | `boolean` | `true` | Serve the pre-built SPA |
-| `verifyUpgrade` | `(req) => boolean \| Promise<boolean>` | — | Auth callback for WebSocket upgrades |
-| `readOnly` | `boolean` | `false` | Disable all mutating endpoints |
+| `verifyUpgrade` | `(req) => boolean \| { allowed: boolean, metadata?: unknown } \| Promise<...>` | — | Auth callback for WebSocket upgrades. The object form attaches `metadata` (tenant/user id / role) to the connection, available to `filterTraceEvent` on every outbound broadcast. Bare boolean still works (back-compat) |
+| `filterTraceEvent` | `(event, metadata) => boolean` | — | Per-connection broadcast filter for multi-tenant deployments. Called on every outbound trace event (and on replay buffer events for late subscribers, so historical cross-tenant events can't leak on reconnect). Predicate errors are fail-closed — event is dropped |
+| `readOnly` | `boolean` | `false` | Disable all mutating endpoints. `POST /api/evals/compare` is allowed (pure computation); `POST /api/evals/import`, `POST /api/evals/:name/run`, `POST /api/evals/:name/rescore`, `POST /api/evals/runs/:evalRunId/cancel`, and `DELETE /api/evals/history/:id` are blocked |
 | `evals` | `string \| string[] \| { files, conditions? }` | — | Lazy-load eval files for the Eval Runner panel |
+| `bufferCaps` | `{ maxEventsPerBuffer?, maxBytesPerBuffer?, maxActiveBuffers? }` | `{ 1000, 4 MiB, 256 }` | Override the default WebSocket replay-buffer resource caps for high-churn deployments. Worst-case memory is roughly `maxActiveBuffers × maxBytesPerBuffer` (≈1 GiB at defaults). Terminal `done`/`error` events are always buffered regardless of caps |
 ### Return value
@@ -421,6 +439,42 @@ Lazy-loaded evals coexist with evals registered directly via `runtime.registerEv
 - CORS is not applied in embedded mode — the host framework owns CORS policy
 - `basePath` is validated against unsafe characters and path traversal
+### Observability-boundary redaction
+When the runtime is constructed with `config.trace.redact: true`, Studio scrubs user/LLM content at three layers — trace events at emission, REST route responses at serialization, and WebSocket broadcasts at send time — while preserving structural metadata (IDs, keys, agent/tool/workflow names, roles, cost/token/duration metrics, timestamps).
+```typescript
+const runtime = new AxlRuntime({ trace: { redact: true } });
+const studio = createStudioMiddleware({ runtime });
+```
+Under `redact: true`, the following Studio endpoints scrub user content server-side before responding: `GET /api/executions{,/:id}`, `GET /api/memory/:scope{,/:key}` (keys preserved so Memory Browser stays navigable), `GET /api/sessions/:id`, `GET /api/evals/history`, `POST /api/evals/:name/run` (sync), `POST /api/evals/:name/rescore`, `GET /api/decisions`, `POST /api/tools/:name/test`, `POST /api/workflows/:name/execute` (sync); streaming WS broadcasts on `/workflows/:name/execute` with `stream: true`, `/api/playground/chat`, AND the trace channel firehose (`trace:{executionId}`) all scrub `AxlEvent` content before send.
+Studio checks the flag via `runtime.isRedactEnabled(): boolean` — it does **not** reach into the config object directly, because `Readonly<AxlConfig>` is shallow and consumers could mutate the nested `trace.redact` field via sub-object access. `GET /api/health` also reports `readOnly: boolean` so clients can gate mutating UI affordances.
+See [`docs/observability.md`](../../docs/observability.md#pii-and-redaction) for the complete scrubbed/preserved field table.
+### Multi-tenant deployments
+Combine `verifyUpgrade` returning `{ allowed, metadata }` with `filterTraceEvent` to scope each WebSocket connection to a tenant/user:
+```typescript
+const studio = createStudioMiddleware({
+  runtime,
+  verifyUpgrade: (req) => {
+    const userId = authenticate(req);
+    if (!userId) return { allowed: false };
+    return { allowed: true, metadata: { userId, tenantId: lookupTenant(userId) } };
+  },
+  filterTraceEvent: (event, metadata) => {
+    // Scope the trace firehose: only let a connection see its own tenant's events.
+    return event.metadata?.tenantId === metadata?.tenantId;
+  },
+});
+```
+The filter runs on live broadcasts **and** on replay buffer events delivered to late subscribers, so historical cross-tenant events can't leak on reconnect. Predicate errors are fail-closed (event dropped).
 ### Migrating from the standalone CLI
 If you currently use `npx @axlsdk/studio` with a config file:
@@ -443,10 +497,15 @@ src/
   server/
     index.ts              createServer() — Hono app composition (basePath, readOnly, cors)
     types.ts              API types, WebSocket message types
-    cost-aggregator.ts    Accumulates cost from trace events
+    aggregates/
+      aggregate-snapshots.ts  AggregateSnapshots<State> helper (per-window state, fold, replace, broadcastTransform)
+      trace-aggregator.ts     TraceAggregator<State> — AxlEvent consumer (costs, trace-stats)
+      execution-aggregator.ts ExecutionAggregator<State> — ExecutionInfo consumer (workflow-stats)
+      eval-aggregator.ts      EvalAggregator<State> — EvalHistoryEntry consumer (eval-trends)
+      reducers.ts             Pure reducers: reduceCost, reduceWorkflowStats, reduceTraceStats, reduceEvalTrends + enrichWorkflowStats
     middleware/
       error-handler.ts    Axl errors → JSON error envelope
-    routes/               One file per resource (health, workflows, agents, tools, etc.)
+    routes/               One file per resource (health, workflows, agents, tools, costs, eval-trends, workflow-stats, trace-stats, evals, etc.)
     ws/
       handler.ts          WebSocket message routing (Hono adapter)
       connection-manager.ts  Channel subscriptions + broadcast (BroadcastTarget) + replay buffer for execution channels