npm - @axlsdk/studio - Versions diffs - 0.13.8 → 0.15.0 - Mend

@axlsdk/studio 0.13.8 → 0.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

package/README.md +122 -11
package/dist/chunk-IPDMFFTQ.js +2142 -0
package/dist/chunk-IPDMFFTQ.js.map +1 -0
package/dist/{chunk-6VDX5CRP.js → chunk-JGQ3MSIG.js} +5 -2
package/dist/chunk-JGQ3MSIG.js.map +1 -0
package/dist/cli.cjs +1336 -209
package/dist/cli.cjs.map +1 -1
package/dist/cli.js +8 -3
package/dist/cli.js.map +1 -1
package/dist/client/assets/index-CLKKOaE2.css +1 -0
package/dist/client/assets/index-rvds50cZ.js +278 -0
package/dist/client/index.html +2 -2
package/dist/{connection-manager-B7AWpsCD.d.cts → connection-manager-BMPahDuY.d.cts} +63 -1
package/dist/{connection-manager-B7AWpsCD.d.ts → connection-manager-BMPahDuY.d.ts} +63 -1
package/dist/middleware.cjs +1353 -211
package/dist/middleware.cjs.map +1 -1
package/dist/middleware.d.cts +52 -5
package/dist/middleware.d.ts +52 -5
package/dist/middleware.js +31 -8
package/dist/middleware.js.map +1 -1
package/dist/server/index.cjs +1325 -202
package/dist/server/index.cjs.map +1 -1
package/dist/server/index.d.cts +165 -28
package/dist/server/index.d.ts +165 -28
package/dist/server/index.js +7 -3
package/package.json +11 -6
package/dist/chunk-6VDX5CRP.js.map +0 -1
package/dist/chunk-YWRYXT7U.js +0 -1021
package/dist/chunk-YWRYXT7U.js.map +0 -1
package/dist/client/assets/index-C_uwupnn.js +0 -221
package/dist/client/assets/index-DVcH6P9w.css +0 -1

package/README.md CHANGED Viewed

@@ -95,7 +95,12 @@ Execute workflows with custom JSON input. View execution timelines showing each
 Waterfall visualization of execution traces. Filter by type, agent, or tool. View token counts, cost per step, and duration.
 ### Cost Dashboard
-Track spending across agents, models, and workflows. Live cost updates via WebSocket. Per-agent and per-model breakdowns.
+Track spending across agents, models, workflows, and embedders with time-window filtering (24h/7d/30d/all). Live cost updates via WebSocket; all breakdown tables are user-sortable by any column. Two sections appear conditionally:
+- **Retry Overhead** — when retries have happened, decomposes `agent_call` cost by `retryReason` (`primary` / `schema` / `validate` / `guardrail`) with per-reason call counts. Surfaces how much spend was wasted on gate failures.
+- **Memory (Embedder)** — when semantic memory ops have run, shows `CostData.byEmbedder: Record<string, { cost, calls, tokens }>` bucketed by embedder model (e.g., `text-embedding-3-small`).
+Sub-cent cost values use tiered precision (`< $0.000001` sentinel, `< $0.0001` scientific, `< $0.01` six decimals, `>= $0.01` two decimals) so embedder costs don't collapse to `$0.0000`.
 ### Memory Browser
 View and manage agent memory (session and global scope). Create, edit, and delete entries. Test semantic recall queries.
@@ -107,7 +112,7 @@ Browse active sessions with conversation history. Replay sessions step by step.
 Browse all registered tools with their schemas rendered as forms. Test any tool directly with custom input and see the result.
 ### Eval Runner
-Run evaluations from the UI. View per-item results with scores, timing, and cost. Drill into individual items to see LLM scorer reasoning, per-scorer timing/cost, and annotations. Filter items by error state or score threshold, sort by score/duration/cost. Score distribution chart shows how scores are spread across bins. Compare two runs with the run picker (baseline/candidate selection from history), timing/cost tradeoff analysis, item-level comparison table, and expandable regression detail showing side-by-side outputs and reasoning. History tab groups multi-run results and tracks mean scores across runs with an eval name filter. Multi-run switcher navigates between individual runs. LLM scorer badges distinguish LLM-judged from deterministic scorers. Significance tooltips explain bootstrap CI methodology. Requires `@axlsdk/eval` as an optional peer dependency.
+Run evaluations from the UI. Toggle **Capture traces** in the command bar to populate per-item `EvalItem.traces` — the item detail panel renders each captured event inline with type, agent/tool, duration, and cost (success and failure paths both). View per-item results with scores, timing, and cost. Drill into individual items to see LLM scorer reasoning, per-scorer timing/cost, and annotations. Filter items by error state or score threshold, sort by score/duration/cost. Score distribution chart shows how scores are spread across bins. Compare two runs with the run picker (baseline/candidate selection from history), timing/cost tradeoff analysis, item-level comparison table, and expandable regression detail showing side-by-side outputs and reasoning. History tab groups multi-run results and tracks mean scores across runs with an eval name filter. Multi-run switcher navigates between individual runs. LLM scorer badges distinguish LLM-judged from deterministic scorers. Significance tooltips explain bootstrap CI methodology. Requires `@axlsdk/eval` as an optional peer dependency.
 ## What gets registered
@@ -139,16 +144,21 @@ Studio exposes a REST API that the SPA consumes. You can also call these directl
 | `POST /api/tools/:name/test` | Test a tool with `{ input: {...} }` |
 | `GET /api/sessions` | List sessions |
 | `GET /api/executions` | List executions |
-| `GET /api/costs` | Aggregated cost data |
-| `POST /api/costs/reset` | Reset cost counters |
+| `GET /api/costs?window=24h\|7d\|30d\|all` | Aggregated cost data for a time window (default `7d`). `?windows=all` returns all four windows at once for debugging |
+| `GET /api/eval-trends?window=` | Per-eval score trends (latest, mean, std), cost totals, recent runs with `model`/`duration` |
+| `GET /api/workflow-stats?window=` | Per-workflow totals, completed/failed counts, p50/p95/avg duration, failure rate |
+| `GET /api/trace-stats?window=` | Event-type distribution, tool call counts (calls/approved/denied), retry breakdown by agent |
 | `GET /api/memory/:scope/:key` | Read memory entry |
 | `PUT /api/memory/:scope/:key` | Save memory entry |
 | `DELETE /api/memory/:scope/:key` | Delete memory entry |
 | `GET /api/evals` | List registered eval configs |
 | `GET /api/evals/history` | List eval run history |
-| `POST /api/evals/:name/run` | Run a registered eval by name. Accepts `{ runs: N }` (capped at 25) |
+| `POST /api/evals/:name/run` | Run a registered eval by name. Body: `{ runs?: N, stream?: true, captureTraces?: true }` (`runs` capped at 25). When `stream: true`, returns `{ evalRunId }` immediately and broadcasts progress over the `eval:{evalRunId}` WS channel (`item_done` / `run_done` / `done` / `error`). The `done` event carries only `{ evalResultId, runGroupId? }` — a pointer, not the full result — so multi-item evals don't hit the 64KB WS frame cap. Clients refetch the full result from history. `captureTraces: true` populates per-item `EvalItem.traces` on every item (success + failure); the Eval Runner panel renders these inline on item detail. Synchronous mode (default) is unchanged |
+| `POST /api/evals/runs/:evalRunId/cancel` | Abort an active streaming eval run. The cancelled run appears in history with remaining items marked as cancelled |
 | `POST /api/evals/:name/rescore` | Re-score a history entry with the eval's current scorers |
-| `POST /api/evals/compare` | Compare two eval results |
+| `POST /api/evals/import` | Import a CLI eval artifact (parsed `EvalResult` JSON) into runtime history |
+| `DELETE /api/evals/history/:id` | Delete a single history entry. Blocked in readOnly |
+| `POST /api/evals/compare` | Compare two eval results by history ID. Body: `{ baselineId, candidateId, options? }` where each ID is `string` (single run) or `string[]` (pooled multi-run). Resolves IDs server-side from `runtime.getEvalHistory()` so the wire payload stays small |
 | `POST /api/playground/chat` | Chat with an agent directly (no workflow required). Accepts `{ message, agent?, sessionId? }`. Streams results via WebSocket |
 | `GET /api/decisions` | List pending decisions |
 | `POST /api/decisions/:id/resolve` | Resolve a pending decision |
@@ -164,7 +174,15 @@ Single endpoint at `ws://localhost:4400/ws` with channel multiplexing:
 { "type": "event", "channel": "trace:abc-123", "data": { ... } }
 ```
-Channels: `execution:{id}`, `trace:{id}`, `trace:*`, `costs`, `decisions`. Execution channels have replay buffering — late subscribers receive the full event history (capped at 500 events, cleaned up 30s after stream completes).
+Channels: `execution:{id}`, `trace:{id}`, `trace:*`, `eval:{id}`, `eval:{evalRunId}`, `eval:*`, `costs`, `eval-trends`, `workflow-stats`, `trace-stats`, `decisions`. Execution and eval channels have replay buffering — late subscribers receive the full event history (capped at 500 events, cleaned up 30s after stream completes). Aggregate channels (`costs`, `eval-trends`, `workflow-stats`, `trace-stats`) broadcast `{ snapshots: Record<WindowId, State>, updatedAt }` on every fold or rebuild.
+**Outbound frame budget.** The WS broadcast layer enforces a 64KB soft cap via `truncateIfOversized`. Oversized verbose-mode `agent_call.data.messages` snapshots are replaced with a `{ __truncated: true, originalBytes, maxBytes, hint }` placeholder that preserves the event's `type`/`step`/`agent`/`tool` so the Trace Explorer still renders the row. The 64KB threshold matches the inbound message reject limit in the WS protocol (shared constant).
+### Migrating from 0.14
+- **`POST /api/costs/reset` has been removed.** Any script hitting the old endpoint gets `404`. Use window selection (`?window=`) instead — snapshots evict automatically as their window slides.
+- **`CostAggregator` class is no longer exported** from `@axlsdk/studio`. Replaced by `TraceAggregator<CostData>` configured with a pure `reduceCost` reducer. Behavior is preserved.
+- **`costs` WS channel payload shape changed** from `CostData` to `{ snapshots: Record<WindowId, CostData>, updatedAt: number }`. Clients that read the old shape must select a window (typically `snapshots['7d']`).
 ## Embeddable Middleware
@@ -203,8 +221,9 @@ studio.upgradeWebSocket(server);
 | `runtime` | `AxlRuntime` | required | The runtime instance to observe and control |
 | `basePath` | `string` | `''` | URL path prefix (e.g., `'/studio'`) |
 | `serveClient` | `boolean` | `true` | Serve the pre-built SPA |
-| `verifyUpgrade` | `(req) => boolean \| Promise<boolean>` | — | Auth callback for WebSocket upgrades |
-| `readOnly` | `boolean` | `false` | Disable all mutating endpoints |
+| `verifyUpgrade` | `(req) => boolean \| { allowed: boolean, metadata?: unknown } \| Promise<...>` | — | Auth callback for WebSocket upgrades. The object form attaches `metadata` (tenant/user id / role) to the connection, available to `filterTraceEvent` on every outbound broadcast. Bare boolean still works (back-compat) |
+| `filterTraceEvent` | `(event, metadata) => boolean` | — | Per-connection broadcast filter for multi-tenant deployments. Called on every outbound trace event (and on replay buffer events for late subscribers, so historical cross-tenant events can't leak on reconnect). Predicate errors are fail-closed — event is dropped |
+| `readOnly` | `boolean` | `false` | Disable all mutating endpoints. `POST /api/evals/compare` is allowed (pure computation); `POST /api/evals/import`, `POST /api/evals/:name/run`, `POST /api/evals/:name/rescore`, `POST /api/evals/runs/:evalRunId/cancel`, and `DELETE /api/evals/history/:id` are blocked |
 | `evals` | `string \| string[] \| { files, conditions? }` | — | Lazy-load eval files for the Eval Runner panel |
 ### Return value
@@ -220,6 +239,57 @@ studio.upgradeWebSocket(server);
 **Note:** `upgradeWebSocket(server)` is required for real-time features (trace streaming, cost updates, execution events, decision resolution). Without it, the Studio SPA loads but panels relying on live data will show no updates. If your framework manages WebSocket connections itself (NestJS gateway, Fastify plugin), use `handleWebSocket()` instead.
+### Host body limits
+Studio's API uses small request bodies — the eval comparison flow sends history IDs (~100 bytes), not full result payloads — so the default body limits in Express, NestJS, Fastify, and Koa (typically 100KB) are sufficient for normal use.
+The one exception is `POST /api/evals/import`, which accepts a full `EvalResult` JSON (typically a CLI artifact from `axl-eval --output result.json`). If you import sizeable eval files through Studio, raise your host framework's JSON body limit *on the Studio sub-mount only*.
+**Express:**
+```typescript
+import express from 'express';
+const app = express();
+// Larger limit just for Studio; the rest of the app keeps its defaults.
+app.use('/studio', express.json({ limit: '10mb' }), studio.handler);
+```
+**NestJS:** NestJS registers its own body-parser at bootstrap, so `app.use(express.json(...))` added after `NestFactory.create()` does *not* override it — the built-in parser runs first and still rejects with `PayloadTooLargeError`. Disable the built-in parser and register a conditional one:
+```typescript
+// main.ts
+import { NestFactory, HttpAdapterHost } from '@nestjs/core';
+import { json } from 'express';
+import { AppModule } from './app.module';
+import { createStudioMiddleware } from '@axlsdk/studio/middleware';
+async function bootstrap() {
+  // Disable Nest's built-in body parser so we control limits ourselves.
+  const app = await NestFactory.create(AppModule, { bodyParser: false });
+  // Apply 10 MB limit to the Studio sub-mount only; rest of the app keeps
+  // the 100 KB default. This is the maintainer-endorsed pattern for
+  // per-route body limits in NestJS (see nestjs/nest#14734).
+  const studioJson = json({ limit: '10mb' });
+  const defaultJson = json();
+  app.use((req, res, next) =>
+    req.url.startsWith('/studio') ? studioJson(req, res, next) : defaultJson(req, res, next),
+  );
+  const studio = createStudioMiddleware({ runtime });
+  const expressApp = app.get(HttpAdapterHost).httpAdapter.getInstance();
+  expressApp.use('/studio', studio.handler);
+  studio.upgradeWebSocket(app.getHttpServer());
+  await app.listen(3000);
+}
+bootstrap();
+```
+> `app.useBodyParser('json', { limit })` raises the limit **globally**, not per-route — avoid it if you want the larger limit scoped to Studio.
+**Fastify:** set `bodyLimit` on the Fastify instance or pass it via `fastify({ bodyLimit: 10 * 1024 * 1024 })`. There's no per-route equivalent as clean as Express's; if Studio is the only route that needs a larger limit, either raise the global limit or mount Studio on a separate Fastify instance.
 ### Framework examples
 #### NestJS
@@ -368,6 +438,42 @@ Lazy-loaded evals coexist with evals registered directly via `runtime.registerEv
 - CORS is not applied in embedded mode — the host framework owns CORS policy
 - `basePath` is validated against unsafe characters and path traversal
+### Observability-boundary redaction
+When the runtime is constructed with `config.trace.redact: true`, Studio scrubs user/LLM content at three layers — trace events at emission, REST route responses at serialization, and WebSocket broadcasts at send time — while preserving structural metadata (IDs, keys, agent/tool/workflow names, roles, cost/token/duration metrics, timestamps).
+```typescript
+const runtime = new AxlRuntime({ config: { trace: { redact: true } } });
+const studio = createStudioMiddleware({ runtime });
+```
+Under `redact: true`, the following Studio endpoints scrub user content server-side before responding: `GET /api/executions{,/:id}`, `GET /api/memory/:scope{,/:key}` (keys preserved so Memory Browser stays navigable), `GET /api/sessions/:id`, `GET /api/evals/history`, `POST /api/evals/:name/run` (sync), `POST /api/evals/:name/rescore`, `GET /api/decisions`, `POST /api/tools/:name/test`, `POST /api/workflows/:name/execute` (sync); streaming WS broadcasts on `/workflows/:name/execute` with `stream: true` and `/api/playground/chat` also scrub `StreamEvent` content before send.
+Studio checks the flag via `runtime.isRedactEnabled(): boolean` — it does **not** reach into the config object directly, because `Readonly<AxlConfig>` is shallow and consumers could mutate the nested `trace.redact` field via sub-object access. `GET /api/health` also reports `readOnly: boolean` so clients can gate mutating UI affordances.
+See [`docs/observability.md`](../../docs/observability.md#pii-and-redaction) for the complete scrubbed/preserved field table.
+### Multi-tenant deployments
+Combine `verifyUpgrade` returning `{ allowed, metadata }` with `filterTraceEvent` to scope each WebSocket connection to a tenant/user:
+```typescript
+const studio = createStudioMiddleware({
+  runtime,
+  verifyUpgrade: (req) => {
+    const userId = authenticate(req);
+    if (!userId) return { allowed: false };
+    return { allowed: true, metadata: { userId, tenantId: lookupTenant(userId) } };
+  },
+  filterTraceEvent: (event, metadata) => {
+    // Scope the trace firehose: only let a connection see its own tenant's events.
+    return event.metadata?.tenantId === metadata?.tenantId;
+  },
+});
+```
+The filter runs on live broadcasts **and** on replay buffer events delivered to late subscribers, so historical cross-tenant events can't leak on reconnect. Predicate errors are fail-closed (event dropped).
 ### Migrating from the standalone CLI
 If you currently use `npx @axlsdk/studio` with a config file:
@@ -390,10 +496,15 @@ src/
   server/
     index.ts              createServer() — Hono app composition (basePath, readOnly, cors)
     types.ts              API types, WebSocket message types
-    cost-aggregator.ts    Accumulates cost from trace events
+    aggregates/
+      aggregate-snapshots.ts  AggregateSnapshots<State> helper (per-window state, fold, replace, broadcastTransform)
+      trace-aggregator.ts     TraceAggregator<State> — TraceEvent consumer (costs, trace-stats)
+      execution-aggregator.ts ExecutionAggregator<State> — ExecutionInfo consumer (workflow-stats)
+      eval-aggregator.ts      EvalAggregator<State> — EvalHistoryEntry consumer (eval-trends)
+      reducers.ts             Pure reducers: reduceCost, reduceWorkflowStats, reduceTraceStats, reduceEvalTrends + enrichWorkflowStats
     middleware/
       error-handler.ts    Axl errors → JSON error envelope
-    routes/               One file per resource (health, workflows, agents, tools, etc.)
+    routes/               One file per resource (health, workflows, agents, tools, costs, eval-trends, workflow-stats, trace-stats, evals, etc.)
     ws/
       handler.ts          WebSocket message routing (Hono adapter)
       connection-manager.ts  Channel subscriptions + broadcast (BroadcastTarget) + replay buffer for execution channels