@axlsdk/studio 0.13.8 → 0.15.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +122 -11
- package/dist/chunk-IPDMFFTQ.js +2142 -0
- package/dist/chunk-IPDMFFTQ.js.map +1 -0
- package/dist/{chunk-6VDX5CRP.js → chunk-JGQ3MSIG.js} +5 -2
- package/dist/chunk-JGQ3MSIG.js.map +1 -0
- package/dist/cli.cjs +1336 -209
- package/dist/cli.cjs.map +1 -1
- package/dist/cli.js +8 -3
- package/dist/cli.js.map +1 -1
- package/dist/client/assets/index-CLKKOaE2.css +1 -0
- package/dist/client/assets/index-rvds50cZ.js +278 -0
- package/dist/client/index.html +2 -2
- package/dist/{connection-manager-B7AWpsCD.d.cts → connection-manager-BMPahDuY.d.cts} +63 -1
- package/dist/{connection-manager-B7AWpsCD.d.ts → connection-manager-BMPahDuY.d.ts} +63 -1
- package/dist/middleware.cjs +1353 -211
- package/dist/middleware.cjs.map +1 -1
- package/dist/middleware.d.cts +52 -5
- package/dist/middleware.d.ts +52 -5
- package/dist/middleware.js +31 -8
- package/dist/middleware.js.map +1 -1
- package/dist/server/index.cjs +1325 -202
- package/dist/server/index.cjs.map +1 -1
- package/dist/server/index.d.cts +165 -28
- package/dist/server/index.d.ts +165 -28
- package/dist/server/index.js +7 -3
- package/package.json +11 -6
- package/dist/chunk-6VDX5CRP.js.map +0 -1
- package/dist/chunk-YWRYXT7U.js +0 -1021
- package/dist/chunk-YWRYXT7U.js.map +0 -1
- package/dist/client/assets/index-C_uwupnn.js +0 -221
- package/dist/client/assets/index-DVcH6P9w.css +0 -1
package/README.md
CHANGED
|
@@ -95,7 +95,12 @@ Execute workflows with custom JSON input. View execution timelines showing each
|
|
|
95
95
|
Waterfall visualization of execution traces. Filter by type, agent, or tool. View token counts, cost per step, and duration.
|
|
96
96
|
|
|
97
97
|
### Cost Dashboard
|
|
98
|
-
Track spending across agents, models, and
|
|
98
|
+
Track spending across agents, models, workflows, and embedders with time-window filtering (24h/7d/30d/all). Live cost updates via WebSocket; all breakdown tables are user-sortable by any column. Two sections appear conditionally:
|
|
99
|
+
|
|
100
|
+
- **Retry Overhead** — when retries have happened, decomposes `agent_call` cost by `retryReason` (`primary` / `schema` / `validate` / `guardrail`) with per-reason call counts. Surfaces how much spend was wasted on gate failures.
|
|
101
|
+
- **Memory (Embedder)** — when semantic memory ops have run, shows `CostData.byEmbedder: Record<string, { cost, calls, tokens }>` bucketed by embedder model (e.g., `text-embedding-3-small`).
|
|
102
|
+
|
|
103
|
+
Sub-cent cost values use tiered precision (`< $0.000001` sentinel, `< $0.0001` scientific, `< $0.01` six decimals, `>= $0.01` two decimals) so embedder costs don't collapse to `$0.0000`.
|
|
99
104
|
|
|
100
105
|
### Memory Browser
|
|
101
106
|
View and manage agent memory (session and global scope). Create, edit, and delete entries. Test semantic recall queries.
|
|
@@ -107,7 +112,7 @@ Browse active sessions with conversation history. Replay sessions step by step.
|
|
|
107
112
|
Browse all registered tools with their schemas rendered as forms. Test any tool directly with custom input and see the result.
|
|
108
113
|
|
|
109
114
|
### Eval Runner
|
|
110
|
-
Run evaluations from the UI. View per-item results with scores, timing, and cost. Drill into individual items to see LLM scorer reasoning, per-scorer timing/cost, and annotations. Filter items by error state or score threshold, sort by score/duration/cost. Score distribution chart shows how scores are spread across bins. Compare two runs with the run picker (baseline/candidate selection from history), timing/cost tradeoff analysis, item-level comparison table, and expandable regression detail showing side-by-side outputs and reasoning. History tab groups multi-run results and tracks mean scores across runs with an eval name filter. Multi-run switcher navigates between individual runs. LLM scorer badges distinguish LLM-judged from deterministic scorers. Significance tooltips explain bootstrap CI methodology. Requires `@axlsdk/eval` as an optional peer dependency.
|
|
115
|
+
Run evaluations from the UI. Toggle **Capture traces** in the command bar to populate per-item `EvalItem.traces` — the item detail panel renders each captured event inline with type, agent/tool, duration, and cost (success and failure paths both). View per-item results with scores, timing, and cost. Drill into individual items to see LLM scorer reasoning, per-scorer timing/cost, and annotations. Filter items by error state or score threshold, sort by score/duration/cost. Score distribution chart shows how scores are spread across bins. Compare two runs with the run picker (baseline/candidate selection from history), timing/cost tradeoff analysis, item-level comparison table, and expandable regression detail showing side-by-side outputs and reasoning. History tab groups multi-run results and tracks mean scores across runs with an eval name filter. Multi-run switcher navigates between individual runs. LLM scorer badges distinguish LLM-judged from deterministic scorers. Significance tooltips explain bootstrap CI methodology. Requires `@axlsdk/eval` as an optional peer dependency.
|
|
111
116
|
|
|
112
117
|
## What gets registered
|
|
113
118
|
|
|
@@ -139,16 +144,21 @@ Studio exposes a REST API that the SPA consumes. You can also call these directl
|
|
|
139
144
|
| `POST /api/tools/:name/test` | Test a tool with `{ input: {...} }` |
|
|
140
145
|
| `GET /api/sessions` | List sessions |
|
|
141
146
|
| `GET /api/executions` | List executions |
|
|
142
|
-
| `GET /api/costs` | Aggregated cost data |
|
|
143
|
-
| `
|
|
147
|
+
| `GET /api/costs?window=24h\|7d\|30d\|all` | Aggregated cost data for a time window (default `7d`). `?windows=all` returns all four windows at once for debugging |
|
|
148
|
+
| `GET /api/eval-trends?window=` | Per-eval score trends (latest, mean, std), cost totals, recent runs with `model`/`duration` |
|
|
149
|
+
| `GET /api/workflow-stats?window=` | Per-workflow totals, completed/failed counts, p50/p95/avg duration, failure rate |
|
|
150
|
+
| `GET /api/trace-stats?window=` | Event-type distribution, tool call counts (calls/approved/denied), retry breakdown by agent |
|
|
144
151
|
| `GET /api/memory/:scope/:key` | Read memory entry |
|
|
145
152
|
| `PUT /api/memory/:scope/:key` | Save memory entry |
|
|
146
153
|
| `DELETE /api/memory/:scope/:key` | Delete memory entry |
|
|
147
154
|
| `GET /api/evals` | List registered eval configs |
|
|
148
155
|
| `GET /api/evals/history` | List eval run history |
|
|
149
|
-
| `POST /api/evals/:name/run` | Run a registered eval by name.
|
|
156
|
+
| `POST /api/evals/:name/run` | Run a registered eval by name. Body: `{ runs?: N, stream?: true, captureTraces?: true }` (`runs` capped at 25). When `stream: true`, returns `{ evalRunId }` immediately and broadcasts progress over the `eval:{evalRunId}` WS channel (`item_done` / `run_done` / `done` / `error`). The `done` event carries only `{ evalResultId, runGroupId? }` — a pointer, not the full result — so multi-item evals don't hit the 64KB WS frame cap. Clients refetch the full result from history. `captureTraces: true` populates per-item `EvalItem.traces` on every item (success + failure); the Eval Runner panel renders these inline on item detail. Synchronous mode (default) is unchanged |
|
|
157
|
+
| `POST /api/evals/runs/:evalRunId/cancel` | Abort an active streaming eval run. The cancelled run appears in history with remaining items marked as cancelled |
|
|
150
158
|
| `POST /api/evals/:name/rescore` | Re-score a history entry with the eval's current scorers |
|
|
151
|
-
| `POST /api/evals/
|
|
159
|
+
| `POST /api/evals/import` | Import a CLI eval artifact (parsed `EvalResult` JSON) into runtime history |
|
|
160
|
+
| `DELETE /api/evals/history/:id` | Delete a single history entry. Blocked in readOnly |
|
|
161
|
+
| `POST /api/evals/compare` | Compare two eval results by history ID. Body: `{ baselineId, candidateId, options? }` where each ID is `string` (single run) or `string[]` (pooled multi-run). Resolves IDs server-side from `runtime.getEvalHistory()` so the wire payload stays small |
|
|
152
162
|
| `POST /api/playground/chat` | Chat with an agent directly (no workflow required). Accepts `{ message, agent?, sessionId? }`. Streams results via WebSocket |
|
|
153
163
|
| `GET /api/decisions` | List pending decisions |
|
|
154
164
|
| `POST /api/decisions/:id/resolve` | Resolve a pending decision |
|
|
@@ -164,7 +174,15 @@ Single endpoint at `ws://localhost:4400/ws` with channel multiplexing:
|
|
|
164
174
|
{ "type": "event", "channel": "trace:abc-123", "data": { ... } }
|
|
165
175
|
```
|
|
166
176
|
|
|
167
|
-
Channels: `execution:{id}`, `trace:{id}`, `trace:*`, `costs`, `decisions`. Execution channels have replay buffering — late subscribers receive the full event history (capped at 500 events, cleaned up 30s after stream completes).
|
|
177
|
+
Channels: `execution:{id}`, `trace:{id}`, `trace:*`, `eval:{id}`, `eval:{evalRunId}`, `eval:*`, `costs`, `eval-trends`, `workflow-stats`, `trace-stats`, `decisions`. Execution and eval channels have replay buffering — late subscribers receive the full event history (capped at 500 events, cleaned up 30s after stream completes). Aggregate channels (`costs`, `eval-trends`, `workflow-stats`, `trace-stats`) broadcast `{ snapshots: Record<WindowId, State>, updatedAt }` on every fold or rebuild.
|
|
178
|
+
|
|
179
|
+
**Outbound frame budget.** The WS broadcast layer enforces a 64KB soft cap via `truncateIfOversized`. Oversized verbose-mode `agent_call.data.messages` snapshots are replaced with a `{ __truncated: true, originalBytes, maxBytes, hint }` placeholder that preserves the event's `type`/`step`/`agent`/`tool` so the Trace Explorer still renders the row. The 64KB threshold matches the inbound message reject limit in the WS protocol (shared constant).
|
|
180
|
+
|
|
181
|
+
### Migrating from 0.14
|
|
182
|
+
|
|
183
|
+
- **`POST /api/costs/reset` has been removed.** Any script hitting the old endpoint gets `404`. Use window selection (`?window=`) instead — snapshots evict automatically as their window slides.
|
|
184
|
+
- **`CostAggregator` class is no longer exported** from `@axlsdk/studio`. Replaced by `TraceAggregator<CostData>` configured with a pure `reduceCost` reducer. Behavior is preserved.
|
|
185
|
+
- **`costs` WS channel payload shape changed** from `CostData` to `{ snapshots: Record<WindowId, CostData>, updatedAt: number }`. Clients that read the old shape must select a window (typically `snapshots['7d']`).
|
|
168
186
|
|
|
169
187
|
## Embeddable Middleware
|
|
170
188
|
|
|
@@ -203,8 +221,9 @@ studio.upgradeWebSocket(server);
|
|
|
203
221
|
| `runtime` | `AxlRuntime` | required | The runtime instance to observe and control |
|
|
204
222
|
| `basePath` | `string` | `''` | URL path prefix (e.g., `'/studio'`) |
|
|
205
223
|
| `serveClient` | `boolean` | `true` | Serve the pre-built SPA |
|
|
206
|
-
| `verifyUpgrade` | `(req) => boolean \|
|
|
207
|
-
| `
|
|
224
|
+
| `verifyUpgrade` | `(req) => boolean \| { allowed: boolean, metadata?: unknown } \| Promise<...>` | — | Auth callback for WebSocket upgrades. The object form attaches `metadata` (tenant/user id / role) to the connection, available to `filterTraceEvent` on every outbound broadcast. Bare boolean still works (back-compat) |
|
|
225
|
+
| `filterTraceEvent` | `(event, metadata) => boolean` | — | Per-connection broadcast filter for multi-tenant deployments. Called on every outbound trace event (and on replay buffer events for late subscribers, so historical cross-tenant events can't leak on reconnect). Predicate errors are fail-closed — event is dropped |
|
|
226
|
+
| `readOnly` | `boolean` | `false` | Disable all mutating endpoints. `POST /api/evals/compare` is allowed (pure computation); `POST /api/evals/import`, `POST /api/evals/:name/run`, `POST /api/evals/:name/rescore`, `POST /api/evals/runs/:evalRunId/cancel`, and `DELETE /api/evals/history/:id` are blocked |
|
|
208
227
|
| `evals` | `string \| string[] \| { files, conditions? }` | — | Lazy-load eval files for the Eval Runner panel |
|
|
209
228
|
|
|
210
229
|
### Return value
|
|
@@ -220,6 +239,57 @@ studio.upgradeWebSocket(server);
|
|
|
220
239
|
|
|
221
240
|
**Note:** `upgradeWebSocket(server)` is required for real-time features (trace streaming, cost updates, execution events, decision resolution). Without it, the Studio SPA loads but panels relying on live data will show no updates. If your framework manages WebSocket connections itself (NestJS gateway, Fastify plugin), use `handleWebSocket()` instead.
|
|
222
241
|
|
|
242
|
+
### Host body limits
|
|
243
|
+
|
|
244
|
+
Studio's API uses small request bodies — the eval comparison flow sends history IDs (~100 bytes), not full result payloads — so the default body limits in Express, NestJS, Fastify, and Koa (typically 100KB) are sufficient for normal use.
|
|
245
|
+
|
|
246
|
+
The one exception is `POST /api/evals/import`, which accepts a full `EvalResult` JSON (typically a CLI artifact from `axl-eval --output result.json`). If you import sizeable eval files through Studio, raise your host framework's JSON body limit *on the Studio sub-mount only*.
|
|
247
|
+
|
|
248
|
+
**Express:**
|
|
249
|
+
|
|
250
|
+
```typescript
|
|
251
|
+
import express from 'express';
|
|
252
|
+
const app = express();
|
|
253
|
+
// Larger limit just for Studio; the rest of the app keeps its defaults.
|
|
254
|
+
app.use('/studio', express.json({ limit: '10mb' }), studio.handler);
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
**NestJS:** NestJS registers its own body-parser at bootstrap, so `app.use(express.json(...))` added after `NestFactory.create()` does *not* override it — the built-in parser runs first and still rejects with `PayloadTooLargeError`. Disable the built-in parser and register a conditional one:
|
|
258
|
+
|
|
259
|
+
```typescript
|
|
260
|
+
// main.ts
|
|
261
|
+
import { NestFactory, HttpAdapterHost } from '@nestjs/core';
|
|
262
|
+
import { json } from 'express';
|
|
263
|
+
import { AppModule } from './app.module';
|
|
264
|
+
import { createStudioMiddleware } from '@axlsdk/studio/middleware';
|
|
265
|
+
|
|
266
|
+
async function bootstrap() {
|
|
267
|
+
// Disable Nest's built-in body parser so we control limits ourselves.
|
|
268
|
+
const app = await NestFactory.create(AppModule, { bodyParser: false });
|
|
269
|
+
|
|
270
|
+
// Apply 10 MB limit to the Studio sub-mount only; rest of the app keeps
|
|
271
|
+
// the 100 KB default. This is the maintainer-endorsed pattern for
|
|
272
|
+
// per-route body limits in NestJS (see nestjs/nest#14734).
|
|
273
|
+
const studioJson = json({ limit: '10mb' });
|
|
274
|
+
const defaultJson = json();
|
|
275
|
+
app.use((req, res, next) =>
|
|
276
|
+
req.url.startsWith('/studio') ? studioJson(req, res, next) : defaultJson(req, res, next),
|
|
277
|
+
);
|
|
278
|
+
|
|
279
|
+
const studio = createStudioMiddleware({ runtime });
|
|
280
|
+
const expressApp = app.get(HttpAdapterHost).httpAdapter.getInstance();
|
|
281
|
+
expressApp.use('/studio', studio.handler);
|
|
282
|
+
studio.upgradeWebSocket(app.getHttpServer());
|
|
283
|
+
|
|
284
|
+
await app.listen(3000);
|
|
285
|
+
}
|
|
286
|
+
bootstrap();
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
> `app.useBodyParser('json', { limit })` raises the limit **globally**, not per-route — avoid it if you want the larger limit scoped to Studio.
|
|
290
|
+
|
|
291
|
+
**Fastify:** set `bodyLimit` on the Fastify instance or pass it via `fastify({ bodyLimit: 10 * 1024 * 1024 })`. There's no per-route equivalent as clean as Express's; if Studio is the only route that needs a larger limit, either raise the global limit or mount Studio on a separate Fastify instance.
|
|
292
|
+
|
|
223
293
|
### Framework examples
|
|
224
294
|
|
|
225
295
|
#### NestJS
|
|
@@ -368,6 +438,42 @@ Lazy-loaded evals coexist with evals registered directly via `runtime.registerEv
|
|
|
368
438
|
- CORS is not applied in embedded mode — the host framework owns CORS policy
|
|
369
439
|
- `basePath` is validated against unsafe characters and path traversal
|
|
370
440
|
|
|
441
|
+
### Observability-boundary redaction
|
|
442
|
+
|
|
443
|
+
When the runtime is constructed with `config.trace.redact: true`, Studio scrubs user/LLM content at three layers — trace events at emission, REST route responses at serialization, and WebSocket broadcasts at send time — while preserving structural metadata (IDs, keys, agent/tool/workflow names, roles, cost/token/duration metrics, timestamps).
|
|
444
|
+
|
|
445
|
+
```typescript
|
|
446
|
+
const runtime = new AxlRuntime({ config: { trace: { redact: true } } });
|
|
447
|
+
const studio = createStudioMiddleware({ runtime });
|
|
448
|
+
```
|
|
449
|
+
|
|
450
|
+
Under `redact: true`, the following Studio endpoints scrub user content server-side before responding: `GET /api/executions{,/:id}`, `GET /api/memory/:scope{,/:key}` (keys preserved so Memory Browser stays navigable), `GET /api/sessions/:id`, `GET /api/evals/history`, `POST /api/evals/:name/run` (sync), `POST /api/evals/:name/rescore`, `GET /api/decisions`, `POST /api/tools/:name/test`, `POST /api/workflows/:name/execute` (sync); streaming WS broadcasts on `/workflows/:name/execute` with `stream: true` and `/api/playground/chat` also scrub `StreamEvent` content before send.
|
|
451
|
+
|
|
452
|
+
Studio checks the flag via `runtime.isRedactEnabled(): boolean` — it does **not** reach into the config object directly, because `Readonly<AxlConfig>` is shallow and consumers could mutate the nested `trace.redact` field via sub-object access. `GET /api/health` also reports `readOnly: boolean` so clients can gate mutating UI affordances.
|
|
453
|
+
|
|
454
|
+
See [`docs/observability.md`](../../docs/observability.md#pii-and-redaction) for the complete scrubbed/preserved field table.
|
|
455
|
+
|
|
456
|
+
### Multi-tenant deployments
|
|
457
|
+
|
|
458
|
+
Combine `verifyUpgrade` returning `{ allowed, metadata }` with `filterTraceEvent` to scope each WebSocket connection to a tenant/user:
|
|
459
|
+
|
|
460
|
+
```typescript
|
|
461
|
+
const studio = createStudioMiddleware({
|
|
462
|
+
runtime,
|
|
463
|
+
verifyUpgrade: (req) => {
|
|
464
|
+
const userId = authenticate(req);
|
|
465
|
+
if (!userId) return { allowed: false };
|
|
466
|
+
return { allowed: true, metadata: { userId, tenantId: lookupTenant(userId) } };
|
|
467
|
+
},
|
|
468
|
+
filterTraceEvent: (event, metadata) => {
|
|
469
|
+
// Scope the trace firehose: only let a connection see its own tenant's events.
|
|
470
|
+
return event.metadata?.tenantId === metadata?.tenantId;
|
|
471
|
+
},
|
|
472
|
+
});
|
|
473
|
+
```
|
|
474
|
+
|
|
475
|
+
The filter runs on live broadcasts **and** on replay buffer events delivered to late subscribers, so historical cross-tenant events can't leak on reconnect. Predicate errors are fail-closed (event dropped).
|
|
476
|
+
|
|
371
477
|
### Migrating from the standalone CLI
|
|
372
478
|
|
|
373
479
|
If you currently use `npx @axlsdk/studio` with a config file:
|
|
@@ -390,10 +496,15 @@ src/
|
|
|
390
496
|
server/
|
|
391
497
|
index.ts createServer() — Hono app composition (basePath, readOnly, cors)
|
|
392
498
|
types.ts API types, WebSocket message types
|
|
393
|
-
|
|
499
|
+
aggregates/
|
|
500
|
+
aggregate-snapshots.ts AggregateSnapshots<State> helper (per-window state, fold, replace, broadcastTransform)
|
|
501
|
+
trace-aggregator.ts TraceAggregator<State> — TraceEvent consumer (costs, trace-stats)
|
|
502
|
+
execution-aggregator.ts ExecutionAggregator<State> — ExecutionInfo consumer (workflow-stats)
|
|
503
|
+
eval-aggregator.ts EvalAggregator<State> — EvalHistoryEntry consumer (eval-trends)
|
|
504
|
+
reducers.ts Pure reducers: reduceCost, reduceWorkflowStats, reduceTraceStats, reduceEvalTrends + enrichWorkflowStats
|
|
394
505
|
middleware/
|
|
395
506
|
error-handler.ts Axl errors → JSON error envelope
|
|
396
|
-
routes/ One file per resource (health, workflows, agents, tools, etc.)
|
|
507
|
+
routes/ One file per resource (health, workflows, agents, tools, costs, eval-trends, workflow-stats, trace-stats, evals, etc.)
|
|
397
508
|
ws/
|
|
398
509
|
handler.ts WebSocket message routing (Hono adapter)
|
|
399
510
|
connection-manager.ts Channel subscriptions + broadcast (BroadcastTarget) + replay buffer for execution channels
|