@axlsdk/studio 0.14.0 → 0.16.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +69 -10
- package/dist/chunk-RE6VPUXA.js +2213 -0
- package/dist/chunk-RE6VPUXA.js.map +1 -0
- package/dist/cli.cjs +1191 -143
- package/dist/cli.cjs.map +1 -1
- package/dist/cli.js +1 -1
- package/dist/client/assets/index-ClajLxib.js +288 -0
- package/dist/client/assets/index-DnHL_gtF.css +1 -0
- package/dist/client/index.html +2 -2
- package/dist/connection-manager-DAuqk9lM.d.cts +166 -0
- package/dist/connection-manager-DAuqk9lM.d.ts +166 -0
- package/dist/middleware.cjs +1222 -150
- package/dist/middleware.cjs.map +1 -1
- package/dist/middleware.d.cts +76 -6
- package/dist/middleware.d.ts +76 -6
- package/dist/middleware.js +32 -8
- package/dist/middleware.js.map +1 -1
- package/dist/server/index.cjs +1194 -142
- package/dist/server/index.cjs.map +1 -1
- package/dist/server/index.d.cts +171 -28
- package/dist/server/index.d.ts +171 -28
- package/dist/server/index.js +7 -3
- package/package.json +13 -9
- package/dist/chunk-HUKUQDYL.js +0 -1163
- package/dist/chunk-HUKUQDYL.js.map +0 -1
- package/dist/client/assets/index-7aDhMztu.css +0 -1
- package/dist/client/assets/index-Bzr3vDPz.js +0 -255
- package/dist/connection-manager-B7AWpsCD.d.cts +0 -81
- package/dist/connection-manager-B7AWpsCD.d.ts +0 -81
package/README.md
CHANGED
|
@@ -95,7 +95,12 @@ Execute workflows with custom JSON input. View execution timelines showing each
|
|
|
95
95
|
Waterfall visualization of execution traces. Filter by type, agent, or tool. View token counts, cost per step, and duration.
|
|
96
96
|
|
|
97
97
|
### Cost Dashboard
|
|
98
|
-
Track spending across agents, models, and
|
|
98
|
+
Track spending across agents, models, workflows, and embedders with time-window filtering (24h/7d/30d/all). Live cost updates via WebSocket; all breakdown tables are user-sortable by any column. Two sections appear conditionally:
|
|
99
|
+
|
|
100
|
+
- **Retry Overhead** — when retries have happened, decomposes `agent_call` cost by `retryReason` (`primary` / `schema` / `validate` / `guardrail`) with per-reason call counts. Surfaces how much spend was wasted on gate failures.
|
|
101
|
+
- **Memory (Embedder)** — when semantic memory ops have run, shows `CostData.byEmbedder: Record<string, { cost, calls, tokens }>` bucketed by embedder model (e.g., `text-embedding-3-small`).
|
|
102
|
+
|
|
103
|
+
Sub-cent cost values use tiered precision (`< $0.000001` sentinel, `< $0.0001` scientific, `< $0.01` six decimals, `>= $0.01` two decimals) so embedder costs don't collapse to `$0.0000`.
|
|
99
104
|
|
|
100
105
|
### Memory Browser
|
|
101
106
|
View and manage agent memory (session and global scope). Create, edit, and delete entries. Test semantic recall queries.
|
|
@@ -107,7 +112,7 @@ Browse active sessions with conversation history. Replay sessions step by step.
|
|
|
107
112
|
Browse all registered tools with their schemas rendered as forms. Test any tool directly with custom input and see the result.
|
|
108
113
|
|
|
109
114
|
### Eval Runner
|
|
110
|
-
Run evaluations from the UI. View per-item results with scores, timing, and cost. Drill into individual items to see LLM scorer reasoning, per-scorer timing/cost, and annotations. Filter items by error state or score threshold, sort by score/duration/cost. Score distribution chart shows how scores are spread across bins. Compare two runs with the run picker (baseline/candidate selection from history), timing/cost tradeoff analysis, item-level comparison table, and expandable regression detail showing side-by-side outputs and reasoning. History tab groups multi-run results and tracks mean scores across runs with an eval name filter. Multi-run switcher navigates between individual runs. LLM scorer badges distinguish LLM-judged from deterministic scorers. Significance tooltips explain bootstrap CI methodology. Requires `@axlsdk/eval` as an optional peer dependency.
|
|
115
|
+
Run evaluations from the UI. Toggle **Capture traces** in the command bar to populate per-item `EvalItem.traces` — the item detail panel renders each captured event inline with type, agent/tool, duration, and cost (success and failure paths both). View per-item results with scores, timing, and cost. Drill into individual items to see LLM scorer reasoning, per-scorer timing/cost, and annotations. Filter items by error state or score threshold, sort by score/duration/cost. Score distribution chart shows how scores are spread across bins. Compare two runs with the run picker (baseline/candidate selection from history), timing/cost tradeoff analysis, item-level comparison table, and expandable regression detail showing side-by-side outputs and reasoning. History tab groups multi-run results and tracks mean scores across runs with an eval name filter. Multi-run switcher navigates between individual runs. LLM scorer badges distinguish LLM-judged from deterministic scorers. Significance tooltips explain bootstrap CI methodology. Requires `@axlsdk/eval` as an optional peer dependency.
|
|
111
116
|
|
|
112
117
|
## What gets registered
|
|
113
118
|
|
|
@@ -139,14 +144,17 @@ Studio exposes a REST API that the SPA consumes. You can also call these directl
|
|
|
139
144
|
| `POST /api/tools/:name/test` | Test a tool with `{ input: {...} }` |
|
|
140
145
|
| `GET /api/sessions` | List sessions |
|
|
141
146
|
| `GET /api/executions` | List executions |
|
|
142
|
-
| `GET /api/costs` | Aggregated cost data |
|
|
143
|
-
| `
|
|
147
|
+
| `GET /api/costs?window=24h\|7d\|30d\|all` | Aggregated cost data for a time window (default `7d`). `?windows=all` returns all four windows at once for debugging |
|
|
148
|
+
| `GET /api/eval-trends?window=` | Per-eval score trends (latest, mean, std), cost totals, recent runs with `model`/`duration` |
|
|
149
|
+
| `GET /api/workflow-stats?window=` | Per-workflow totals, completed/failed counts, p50/p95/avg duration, failure rate |
|
|
150
|
+
| `GET /api/trace-stats?window=` | Event-type distribution, tool call counts (calls/approved/denied), retry breakdown by agent |
|
|
144
151
|
| `GET /api/memory/:scope/:key` | Read memory entry |
|
|
145
152
|
| `PUT /api/memory/:scope/:key` | Save memory entry |
|
|
146
153
|
| `DELETE /api/memory/:scope/:key` | Delete memory entry |
|
|
147
154
|
| `GET /api/evals` | List registered eval configs |
|
|
148
155
|
| `GET /api/evals/history` | List eval run history |
|
|
149
|
-
| `POST /api/evals/:name/run` | Run a registered eval by name.
|
|
156
|
+
| `POST /api/evals/:name/run` | Run a registered eval by name. Body: `{ runs?: N, stream?: true, captureTraces?: true }` (`runs` capped at 25). When `stream: true`, returns `{ evalRunId }` immediately and broadcasts progress over the `eval:{evalRunId}` WS channel (`item_done` / `run_done` / `done` / `error`). The `done` event carries only `{ evalResultId, runGroupId? }` — a pointer, not the full result — so multi-item evals don't hit the 64KB WS frame cap. Clients refetch the full result from history. `captureTraces: true` populates per-item `EvalItem.traces` on every item (success + failure); the Eval Runner panel renders these inline on item detail. Synchronous mode (default) is unchanged |
|
|
157
|
+
| `POST /api/evals/runs/:evalRunId/cancel` | Abort an active streaming eval run. The cancelled run appears in history with remaining items marked as cancelled |
|
|
150
158
|
| `POST /api/evals/:name/rescore` | Re-score a history entry with the eval's current scorers |
|
|
151
159
|
| `POST /api/evals/import` | Import a CLI eval artifact (parsed `EvalResult` JSON) into runtime history |
|
|
152
160
|
| `DELETE /api/evals/history/:id` | Delete a single history entry. Blocked in readOnly |
|
|
@@ -166,7 +174,15 @@ Single endpoint at `ws://localhost:4400/ws` with channel multiplexing:
|
|
|
166
174
|
{ "type": "event", "channel": "trace:abc-123", "data": { ... } }
|
|
167
175
|
```
|
|
168
176
|
|
|
169
|
-
Channels: `execution:{id}`, `trace:{id}`, `trace:*`, `costs`, `decisions`. Execution channels have replay buffering — late subscribers receive the full event history (capped at
|
|
177
|
+
Channels: `execution:{id}`, `trace:{id}`, `trace:*`, `eval:{id}`, `eval:{evalRunId}`, `eval:*`, `costs`, `eval-trends`, `workflow-stats`, `trace-stats`, `decisions`. Execution and eval channels have replay buffering — late subscribers receive the full event history (capped at 1000 events by default; tunable via `bufferCaps`, see below). Buffers are cleaned up 30s after the stream completes. Aggregate channels (`costs`, `eval-trends`, `workflow-stats`, `trace-stats`) broadcast `{ snapshots: Record<WindowId, State>, updatedAt }` on every fold or rebuild.
|
|
178
|
+
|
|
179
|
+
**Outbound frame budget.** The WS broadcast layer enforces a 64KB soft cap via `truncateIfOversized`. Oversized verbose-mode `agent_call_end.data.messages` snapshots are replaced with a `{ __truncated: true, originalBytes, maxBytes, hint }` placeholder that preserves the event's `type`/`step`/`agent`/`tool` so the Trace Explorer still renders the row. The 64KB threshold matches the inbound message reject limit in the WS protocol (shared constant).
|
|
180
|
+
|
|
181
|
+
### Migrating from 0.14
|
|
182
|
+
|
|
183
|
+
- **`POST /api/costs/reset` has been removed.** Any script hitting the old endpoint gets `404`. Use window selection (`?window=`) instead — snapshots evict automatically as their window slides.
|
|
184
|
+
- **`CostAggregator` class is no longer exported** from `@axlsdk/studio`. Replaced by `TraceAggregator<CostData>` configured with a pure `reduceCost` reducer. Behavior is preserved.
|
|
185
|
+
- **`costs` WS channel payload shape changed** from `CostData` to `{ snapshots: Record<WindowId, CostData>, updatedAt: number }`. Clients that read the old shape must select a window (typically `snapshots['7d']`).
|
|
170
186
|
|
|
171
187
|
## Embeddable Middleware
|
|
172
188
|
|
|
@@ -205,9 +221,11 @@ studio.upgradeWebSocket(server);
|
|
|
205
221
|
| `runtime` | `AxlRuntime` | required | The runtime instance to observe and control |
|
|
206
222
|
| `basePath` | `string` | `''` | URL path prefix (e.g., `'/studio'`) |
|
|
207
223
|
| `serveClient` | `boolean` | `true` | Serve the pre-built SPA |
|
|
208
|
-
| `verifyUpgrade` | `(req) => boolean \|
|
|
209
|
-
| `
|
|
224
|
+
| `verifyUpgrade` | `(req) => boolean \| { allowed: boolean, metadata?: unknown } \| Promise<...>` | — | Auth callback for WebSocket upgrades. The object form attaches `metadata` (tenant/user id / role) to the connection, available to `filterTraceEvent` on every outbound broadcast. Bare boolean still works (back-compat) |
|
|
225
|
+
| `filterTraceEvent` | `(event, metadata) => boolean` | — | Per-connection broadcast filter for multi-tenant deployments. Called on every outbound trace event (and on replay buffer events for late subscribers, so historical cross-tenant events can't leak on reconnect). Predicate errors are fail-closed — event is dropped |
|
|
226
|
+
| `readOnly` | `boolean` | `false` | Disable all mutating endpoints. `POST /api/evals/compare` is allowed (pure computation); `POST /api/evals/import`, `POST /api/evals/:name/run`, `POST /api/evals/:name/rescore`, `POST /api/evals/runs/:evalRunId/cancel`, and `DELETE /api/evals/history/:id` are blocked |
|
|
210
227
|
| `evals` | `string \| string[] \| { files, conditions? }` | — | Lazy-load eval files for the Eval Runner panel |
|
|
228
|
+
| `bufferCaps` | `{ maxEventsPerBuffer?, maxBytesPerBuffer?, maxActiveBuffers? }` | `{ 1000, 4 MiB, 256 }` | Override the default WebSocket replay-buffer resource caps for high-churn deployments. Worst-case memory is roughly `maxActiveBuffers × maxBytesPerBuffer` (≈1 GiB at defaults). Terminal `done`/`error` events are always buffered regardless of caps |
|
|
211
229
|
|
|
212
230
|
### Return value
|
|
213
231
|
|
|
@@ -421,6 +439,42 @@ Lazy-loaded evals coexist with evals registered directly via `runtime.registerEv
|
|
|
421
439
|
- CORS is not applied in embedded mode — the host framework owns CORS policy
|
|
422
440
|
- `basePath` is validated against unsafe characters and path traversal
|
|
423
441
|
|
|
442
|
+
### Observability-boundary redaction
|
|
443
|
+
|
|
444
|
+
When the runtime is constructed with `config.trace.redact: true`, Studio scrubs user/LLM content at three layers — trace events at emission, REST route responses at serialization, and WebSocket broadcasts at send time — while preserving structural metadata (IDs, keys, agent/tool/workflow names, roles, cost/token/duration metrics, timestamps).
|
|
445
|
+
|
|
446
|
+
```typescript
|
|
447
|
+
const runtime = new AxlRuntime({ trace: { redact: true } });
|
|
448
|
+
const studio = createStudioMiddleware({ runtime });
|
|
449
|
+
```
|
|
450
|
+
|
|
451
|
+
Under `redact: true`, the following Studio endpoints scrub user content server-side before responding: `GET /api/executions{,/:id}`, `GET /api/memory/:scope{,/:key}` (keys preserved so Memory Browser stays navigable), `GET /api/sessions/:id`, `GET /api/evals/history`, `POST /api/evals/:name/run` (sync), `POST /api/evals/:name/rescore`, `GET /api/decisions`, `POST /api/tools/:name/test`, `POST /api/workflows/:name/execute` (sync); streaming WS broadcasts on `/workflows/:name/execute` with `stream: true`, `/api/playground/chat`, AND the trace channel firehose (`trace:{executionId}`) all scrub `AxlEvent` content before send.
|
|
452
|
+
|
|
453
|
+
Studio checks the flag via `runtime.isRedactEnabled(): boolean` — it does **not** reach into the config object directly, because `Readonly<AxlConfig>` is shallow and consumers could mutate the nested `trace.redact` field via sub-object access. `GET /api/health` also reports `readOnly: boolean` so clients can gate mutating UI affordances.
|
|
454
|
+
|
|
455
|
+
See [`docs/observability.md`](../../docs/observability.md#pii-and-redaction) for the complete scrubbed/preserved field table.
|
|
456
|
+
|
|
457
|
+
### Multi-tenant deployments
|
|
458
|
+
|
|
459
|
+
Combine `verifyUpgrade` returning `{ allowed, metadata }` with `filterTraceEvent` to scope each WebSocket connection to a tenant/user:
|
|
460
|
+
|
|
461
|
+
```typescript
|
|
462
|
+
const studio = createStudioMiddleware({
|
|
463
|
+
runtime,
|
|
464
|
+
verifyUpgrade: (req) => {
|
|
465
|
+
const userId = authenticate(req);
|
|
466
|
+
if (!userId) return { allowed: false };
|
|
467
|
+
return { allowed: true, metadata: { userId, tenantId: lookupTenant(userId) } };
|
|
468
|
+
},
|
|
469
|
+
filterTraceEvent: (event, metadata) => {
|
|
470
|
+
// Scope the trace firehose: only let a connection see its own tenant's events.
|
|
471
|
+
return event.metadata?.tenantId === metadata?.tenantId;
|
|
472
|
+
},
|
|
473
|
+
});
|
|
474
|
+
```
|
|
475
|
+
|
|
476
|
+
The filter runs on live broadcasts **and** on replay buffer events delivered to late subscribers, so historical cross-tenant events can't leak on reconnect. Predicate errors are fail-closed (event dropped).
|
|
477
|
+
|
|
424
478
|
### Migrating from the standalone CLI
|
|
425
479
|
|
|
426
480
|
If you currently use `npx @axlsdk/studio` with a config file:
|
|
@@ -443,10 +497,15 @@ src/
|
|
|
443
497
|
server/
|
|
444
498
|
index.ts createServer() — Hono app composition (basePath, readOnly, cors)
|
|
445
499
|
types.ts API types, WebSocket message types
|
|
446
|
-
|
|
500
|
+
aggregates/
|
|
501
|
+
aggregate-snapshots.ts AggregateSnapshots<State> helper (per-window state, fold, replace, broadcastTransform)
|
|
502
|
+
trace-aggregator.ts TraceAggregator<State> — AxlEvent consumer (costs, trace-stats)
|
|
503
|
+
execution-aggregator.ts ExecutionAggregator<State> — ExecutionInfo consumer (workflow-stats)
|
|
504
|
+
eval-aggregator.ts EvalAggregator<State> — EvalHistoryEntry consumer (eval-trends)
|
|
505
|
+
reducers.ts Pure reducers: reduceCost, reduceWorkflowStats, reduceTraceStats, reduceEvalTrends + enrichWorkflowStats
|
|
447
506
|
middleware/
|
|
448
507
|
error-handler.ts Axl errors → JSON error envelope
|
|
449
|
-
routes/ One file per resource (health, workflows, agents, tools, etc.)
|
|
508
|
+
routes/ One file per resource (health, workflows, agents, tools, costs, eval-trends, workflow-stats, trace-stats, evals, etc.)
|
|
450
509
|
ws/
|
|
451
510
|
handler.ts WebSocket message routing (Hono adapter)
|
|
452
511
|
connection-manager.ts Channel subscriptions + broadcast (BroadcastTarget) + replay buffer for execution channels
|