@mantyx/sdk 0.10.1 → 0.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -9,7 +9,7 @@ SDK is expected to ship for client-resolved (`*_local`) tools.
9
9
 
10
10
  If you're just looking for HTTP routes, auth, body shapes, or session
11
11
  semantics, start with `agent-runs-protocol.md`. If you're writing or
12
- maintaining an SDK and want to know *exactly* what a `local_tool_call` event
12
+ maintaining an SDK and want to know _exactly_ what a `local_tool_call` event
13
13
  looks like for `mcp_local`, you're in the right place.
14
14
 
15
15
  > **Authentication.** Every example below uses
@@ -21,9 +21,10 @@ looks like for `mcp_local`, you're in the right place.
21
21
  > `models:read`, `mantyx.identity:read`); see §2 of
22
22
  > `agent-runs-protocol.md` for the per-endpoint scope table and
23
23
  > [`docs/oauth.md`](./oauth.md) for the registration / Authorization Code
24
- > + PKCE flow.
24
+ >
25
+ > - PKCE flow.
25
26
 
26
- > **Stability.** Field names listed in *bold* are part of the documented
27
+ > **Stability.** Field names listed in _bold_ are part of the documented
27
28
  > stable surface. Any other fields are passed through verbatim and survive
28
29
  > round-trips, but their semantics are not contractually guaranteed. The
29
30
  > server uses Zod with `passthrough` for all `*_local` resolved-content
@@ -34,15 +35,15 @@ looks like for `mcp_local`, you're in the right place.
34
35
 
35
36
  ## 0. Glossary
36
37
 
37
- | Term | Meaning |
38
- | ------------------- | ------- |
39
- | **MANTYX** | The agent operating system server (this repo). Owns LLM orchestration, tool execution for server-resolved tools, persistence. |
40
- | **SDK** | Anything calling the public agent-runs API — typically `@mantyx/ts-sdk`, but also other-language SDKs and direct HTTP clients. |
41
- | **Agent run** | A single LLM execution. Streams events; ends with a terminal `result` / `error` / `cancelled`. |
42
- | **Spec** | The JSON object describing what the run does — model, prompt, tools, budgets, optional `reasoningLevel`. Sent in the `POST /agent-runs` (or `.../messages`) body. |
43
- | **Tool ref** | One entry in `spec.tools[]`. A discriminated union keyed by `kind`. |
44
- | **Server-resolved** | A tool MANTYX executes itself (`mantyx`, `mantyx_plugin`, `a2a`, `mcp`). The SDK only sees informational `tool_result` events. |
45
- | **Client-resolved** | A tool the SDK executes (`local`, `a2a_local`, `mcp_local`). MANTYX emits `local_tool_call`, the SDK does the work, the SDK posts back to `.../tool-results`. |
38
+ | Term | Meaning |
39
+ | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
40
+ | **MANTYX** | The agent operating system server (this repo). Owns LLM orchestration, tool execution for server-resolved tools, persistence. |
41
+ | **SDK** | Anything calling the public agent-runs API — typically `@mantyx/ts-sdk`, but also other-language SDKs and direct HTTP clients. |
42
+ | **Agent run** | A single LLM execution. Streams events; ends with a terminal `result` / `error` / `cancelled`. |
43
+ | **Spec** | The JSON object describing what the run does — model, prompt, tools, budgets, optional `reasoningLevel`. Sent in the `POST /agent-runs` (or `.../messages`) body. |
44
+ | **Tool ref** | One entry in `spec.tools[]`. A discriminated union keyed by `kind`. |
45
+ | **Server-resolved** | A tool MANTYX executes itself (`mantyx`, `mantyx_plugin`, `a2a`, `mcp`). The SDK only sees informational `tool_result` events. |
46
+ | **Client-resolved** | A tool the SDK executes (`local`, `a2a_local`, `mcp_local`). MANTYX emits `local_tool_call`, the SDK does the work, the SDK posts back to `.../tool-results`. |
46
47
  | **Resolution** | The act of turning an external resource (A2A peer, MCP server) into a self-contained JSON document the model can reason about. For `*_local` kinds, resolution is the **SDK's** responsibility. |
47
48
 
48
49
  ---
@@ -108,23 +109,34 @@ short-circuit, etc.) see `agent-runs-protocol.md` §4.
108
109
  {
109
110
  "modelId": "openai:gpt-5.5",
110
111
  "systemPrompt": "...",
111
- "prompt": "...", // OR "messages": [...]
112
- "tools": [ /* tool refs — see §3 */ ],
113
- "reasoningLevel": "medium", // optional; see §6
112
+ "prompt": "...", // OR "messages": [...]
113
+ "tools": [
114
+ /* tool refs see §3 */
115
+ ],
116
+ "reasoningLevel": "medium", // optional; see §6
114
117
  "budgets": { "maxToolTurns": 32 },
115
- "outputSchema": { // optional; see §7
116
- "name": "weather_report", // defaults to "output"
117
- "schema": { /* JSON Schema */ }
118
+ "outputSchema": {
119
+ // optional; see §7
120
+ "name": "weather_report", // defaults to "output"
121
+ "schema": {
122
+ /* JSON Schema */
123
+ },
118
124
  },
119
- "loopDetection": { // optional; see §8
125
+ "loopDetection": {
126
+ // optional; see §8
120
127
  "consecutiveThreshold": 3,
121
- "hardCutoffThreshold": 6
128
+ "hardCutoffThreshold": 6,
129
+ },
130
+ "toolBudgets": {
131
+ // optional; see §8
132
+ "recall": { "maxCalls": 4 },
133
+ "hive_consult_ontology": { "maxCalls": 4 },
122
134
  },
123
- "toolBudgets": { // optional; see §8
124
- "recall": { "maxCalls": 4 },
125
- "hive_consult_ontology": { "maxCalls": 4 }
135
+ "supervisor": {
136
+ // optional; see §8.4 — platform LLM judge on ephemeral runs
137
+ "interval": 5,
126
138
  },
127
- "metadata": { "customer": "acme" } // optional, free-form k/v
139
+ "metadata": { "customer": "acme" }, // optional, free-form k/v
128
140
  }
129
141
  ```
130
142
 
@@ -132,29 +144,29 @@ short-circuit, etc.) see `agent-runs-protocol.md` §4.
132
144
 
133
145
  Same body shape, posted to `POST /agent-sessions/:id/messages`. The session
134
146
  keeps the conversation history; per-message `tools`, `reasoningLevel`,
135
- `outputSchema`, `loopDetection`, and `toolBudgets` *replace* the session's
136
- defaults for that single run only — the next run falls back to whatever
137
- the session was created with.
147
+ `outputSchema`, `loopDetection`, `toolBudgets`, and `supervisor` _replace_
148
+ the session's defaults for that single run only — the next run falls back to
149
+ whatever the session was created with.
138
150
 
139
151
  ---
140
152
 
141
153
  ## 3. Tool ref taxonomy
142
154
 
143
155
  Every entry in `spec.tools[]` is one of the seven shapes below. The
144
- *resolution column* is the contract that drives everything else: **server**
156
+ _resolution column_ is the contract that drives everything else: **server**
145
157
  means MANTYX runs the tool itself and the SDK only ever sees a
146
158
  `tool_result` event; **client** means MANTYX is a transport and the SDK
147
159
  must answer `local_tool_call` events.
148
160
 
149
- | Kind | Resolution | Wire-payload contract |
150
- | ---------------- | ---------- | --------------------- |
151
- | `mantyx` | server | `{ id }` reference to a workspace `Tool` row. |
152
- | `mantyx_plugin` | server | `{ name }` reference to a platform plugin tool. |
153
- | `local` | client | `{ name, description?, parameters?, outputSchema?, longRunning? }` — `parameters` is **JSON Schema** (object schema with `properties`/`required`); forwarded verbatim to the LLM provider and validated against incoming tool-call args before execution. `outputSchema` (optional) is JSON Schema for the tool's structured return value, surfaced to providers that accept per-tool response schemas. `longRunning` (optional, default `false`) annotates the model-facing description with a "don't double-call while pending" hint so every provider treats the tool as long-running. |
154
- | `a2a` | server | `{ name, agentCardUrl, headers?, contextId?, description? }`. |
155
- | `a2a_local` | client | `{ name, agentCard }` — **resolved A2A Agent Card JSON content**. |
156
- | `mcp` | server | `{ name, url, headers?, toolFilter? }`. |
157
- | `mcp_local` | client | `{ name, serverInfo?, tools[] }` — **resolved MCP `Tool[]`**. |
161
+ | Kind | Resolution | Wire-payload contract |
162
+ | --------------- | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
163
+ | `mantyx` | server | `{ id }` reference to a workspace `Tool` row. |
164
+ | `mantyx_plugin` | server | `{ name }` reference to a platform plugin tool. |
165
+ | `local` | client | `{ name, description?, parameters?, outputSchema?, longRunning? }` — `parameters` is **JSON Schema** (object schema with `properties`/`required`); forwarded verbatim to the LLM provider and validated against incoming tool-call args before execution. `outputSchema` (optional) is JSON Schema for the tool's structured return value, surfaced to providers that accept per-tool response schemas. `longRunning` (optional, default `false`) annotates the model-facing description with a "don't double-call while pending" hint so every provider treats the tool as long-running. |
166
+ | `a2a` | server | `{ name, agentCardUrl, headers?, contextId?, description? }`. |
167
+ | `a2a_local` | client | `{ name, agentCard }` — **resolved A2A Agent Card JSON content**. |
168
+ | `mcp` | server | `{ name, url, headers?, toolFilter? }`. |
169
+ | `mcp_local` | client | `{ name, serverInfo?, tools[] }` — **resolved MCP `Tool[]`**. |
158
170
 
159
171
  The remainder of this document focuses on `local`, `a2a_local`, and
160
172
  `mcp_local`, because they're the ones that carry SDK-defined structured
@@ -173,38 +185,40 @@ caller-specific business logic.
173
185
  ```jsonc
174
186
  {
175
187
  "kind": "local",
176
- "name": "send_email", // model-facing; /^[a-zA-Z0-9_]{1,64}$/
188
+ "name": "send_email", // model-facing; /^[a-zA-Z0-9_]{1,64}$/
177
189
  "description": "Send a transactional email.",
178
- "parameters": { // OPTIONAL; JSON Schema for args
190
+ "parameters": {
191
+ // OPTIONAL; JSON Schema for args
179
192
  "type": "object",
180
193
  "properties": {
181
- "to": { "type": "string", "format": "email" },
194
+ "to": { "type": "string", "format": "email" },
182
195
  "subject": { "type": "string" },
183
- "body": { "type": "string" }
196
+ "body": { "type": "string" },
184
197
  },
185
198
  "required": ["to", "subject", "body"],
186
- "additionalProperties": false
199
+ "additionalProperties": false,
187
200
  },
188
- "outputSchema": { // OPTIONAL; JSON Schema for the return value
201
+ "outputSchema": {
202
+ // OPTIONAL; JSON Schema for the return value
189
203
  "type": "object",
190
204
  "properties": { "id": { "type": "string" } },
191
205
  "required": ["id"],
192
- "additionalProperties": false
206
+ "additionalProperties": false,
193
207
  },
194
- "longRunning": false // OPTIONAL; default false
208
+ "longRunning": false, // OPTIONAL; default false
195
209
  }
196
210
  ```
197
211
 
198
212
  **Field reference:**
199
213
 
200
- | Field | Required | Notes |
201
- | -------------- | -------- | ----- |
202
- | `kind` | yes | Discriminator literal `"local"`. |
203
- | `name` | yes | Model-facing tool name. Must match `/^[a-zA-Z0-9_]{1,64}$/`. |
204
- | `description` | no | Free-form. When omitted the model sees an empty description (acceptable but reduces tool selection accuracy). |
205
- | `parameters` | no | JSON Schema for the tool's input. Must be an object schema (`type: "object"` with `properties`); other shapes are coerced to an empty object schema server-side. Nested constraints (`array.items`, `enum`, `anyOf`, …) are preserved end-to-end. Args that fail server-side validation produce a structured `tool_input_invalid` tool result the model can recover from instead of crashing the call. |
206
- | `outputSchema` | no | JSON Schema for the structured value the tool returns. Forwarded to providers with per-tool response schemas (Gemini's `responseJsonSchema` on the FunctionDeclaration); other engines surface it through the description and rely on host-side validation. The model uses it to plan follow-up arguments more reliably. Must be an object schema; non-object roots are dropped server-side (engines reject non-object roots in this position). |
207
- | `longRunning` | no | When `true`, MANTYX appends a stable hint to the description:<br>*"NOTE: This is a long-running operation. Do not call this tool again if it has already returned an intermediate or pending status."*<br>Useful for tools where a single call may yield a `pending` / status response and the SDK polls on its own; without the hint, models routinely fire repeat calls and waste turns. Pure declarative — MANTYX does not change scheduling. |
214
+ | Field | Required | Notes |
215
+ | -------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
216
+ | `kind` | yes | Discriminator literal `"local"`. |
217
+ | `name` | yes | Model-facing tool name. Must match `/^[a-zA-Z0-9_]{1,64}$/`. |
218
+ | `description` | no | Free-form. When omitted the model sees an empty description (acceptable but reduces tool selection accuracy). |
219
+ | `parameters` | no | JSON Schema for the tool's input. Must be an object schema (`type: "object"` with `properties`); other shapes are coerced to an empty object schema server-side. Nested constraints (`array.items`, `enum`, `anyOf`, …) are preserved end-to-end. Args that fail server-side validation produce a structured `tool_input_invalid` tool result the model can recover from instead of crashing the call. |
220
+ | `outputSchema` | no | JSON Schema for the structured value the tool returns. Forwarded to providers with per-tool response schemas (Gemini's `responseJsonSchema` on the FunctionDeclaration); other engines surface it through the description and rely on host-side validation. The model uses it to plan follow-up arguments more reliably. Must be an object schema; non-object roots are dropped server-side (engines reject non-object roots in this position). |
221
+ | `longRunning` | no | When `true`, MANTYX appends a stable hint to the description:<br>_"NOTE: This is a long-running operation. Do not call this tool again if it has already returned an intermediate or pending status."_<br>Useful for tools where a single call may yield a `pending` / status response and the SDK polls on its own; without the hint, models routinely fire repeat calls and waste turns. Pure declarative — MANTYX does not change scheduling. |
208
222
 
209
223
  **Tool call dispatch.** When the model calls a `local` tool, the SSE
210
224
  stream emits `local_tool_call` with `kind: "local"` (or omitted, for
@@ -222,9 +236,10 @@ the `agentCard` field. MANTYX never reaches out to discover it.
222
236
  ```jsonc
223
237
  {
224
238
  "kind": "a2a_local",
225
- "name": "intranet_hr_agent", // model-facing; /^[a-zA-Z0-9_]{1,64}$/
226
- "description": "...", // OPTIONAL; overrides the synthesized one
227
- "agentCard": { // REQUIRED; A2A Agent Card content
239
+ "name": "intranet_hr_agent", // model-facing; /^[a-zA-Z0-9_]{1,64}$/
240
+ "description": "...", // OPTIONAL; overrides the synthesized one
241
+ "agentCard": {
242
+ // REQUIRED; A2A Agent Card content
228
243
  "protocolVersion": "0.3.0",
229
244
  "name": "Acme HR",
230
245
  "description": "Answers questions about HR policies and benefits.",
@@ -242,34 +257,38 @@ the `agentCard` field. MANTYX never reaches out to discover it.
242
257
  "name": "PTO lookup",
243
258
  "description": "Find a teammate's remaining PTO days for the year.",
244
259
  "tags": ["hr", "pto"],
245
- "examples": ["How many PTO days does Alice have left?"]
246
- }
260
+ "examples": ["How many PTO days does Alice have left?"],
261
+ },
262
+ ],
263
+ "securitySchemes": {
264
+ /* spec-shaped, never read by MANTYX */
265
+ },
266
+ "security": [
267
+ /* spec-shaped, never read by MANTYX */
247
268
  ],
248
- "securitySchemes": { /* spec-shaped, never read by MANTYX */ },
249
- "security": [ /* spec-shaped, never read by MANTYX */ ]
250
269
  /* …any other A2A spec field passes through unchanged. */
251
- }
270
+ },
252
271
  }
253
272
  ```
254
273
 
255
274
  **Where the SDK obtains `agentCard`:**
256
275
 
257
- - *Well-known URL.* Most peers expose the card at
276
+ - _Well-known URL._ Most peers expose the card at
258
277
  `<peer>/.well-known/agent-card.json`. The SDK can simply
259
278
  `fetch` it (with whatever auth applies on the local network).
260
- - *Static config.* For peers that don't publish a card, hand-craft one — the
279
+ - _Static config._ For peers that don't publish a card, hand-craft one — the
261
280
  spec only requires a couple of fields and the rest is all metadata.
262
- - *Registry / cache.* Cache cards locally and refresh periodically. MANTYX
281
+ - _Registry / cache._ Cache cards locally and refresh periodically. MANTYX
263
282
  treats every spec submission as a fresh snapshot, so new cards take
264
283
  effect on the next run / message.
265
284
 
266
285
  **What MANTYX does with `agentCard`:**
267
286
 
268
- | Field | Used for | Notes |
269
- | ------------------------ | -------- | ----- |
270
- | `name`, `description` | Tool description for the model | Used to compose `"Delegate a task to <name>: <description>"` if no `description` override is supplied at the ref level. |
271
- | `skills[]` (first 12) | Tool description for the model | Bulleted into the description so the model can choose a peer based on capability. |
272
- | All other fields | Echo only | Forwarded back to the SDK in every `local_tool_call` event so the SDK can dispatch by `url`, by `provider.organization`, by `protocolVersion`, or whatever it indexed on. |
287
+ | Field | Used for | Notes |
288
+ | --------------------- | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
289
+ | `name`, `description` | Tool description for the model | Used to compose `"Delegate a task to <name>: <description>"` if no `description` override is supplied at the ref level. |
290
+ | `skills[]` (first 12) | Tool description for the model | Bulleted into the description so the model can choose a peer based on capability. |
291
+ | All other fields | Echo only | Forwarded back to the SDK in every `local_tool_call` event so the SDK can dispatch by `url`, by `provider.organization`, by `protocolVersion`, or whatever it indexed on. |
273
292
 
274
293
  ### 3.3 `mcp_local` — SDK-resolved Tool catalog
275
294
 
@@ -283,28 +302,32 @@ ship the `Implementation` block from MCP `Initialize` as `serverInfo`.
283
302
  ```jsonc
284
303
  {
285
304
  "kind": "mcp_local",
286
- "name": "fs", // SDK-side server label; not a name prefix
287
- "serverInfo": { // OPTIONAL; from MCP Initialize
305
+ "name": "fs", // SDK-side server label; not a name prefix
306
+ "serverInfo": {
307
+ // OPTIONAL; from MCP Initialize
288
308
  "name": "mcp-server-filesystem",
289
- "version": "0.4.1"
309
+ "version": "0.4.1",
290
310
  /* …any other Implementation field passes through unchanged. */
291
311
  },
292
- "tools": [ // REQUIRED; verbatim MCP tools/list output
312
+ "tools": [
313
+ // REQUIRED; verbatim MCP tools/list output
293
314
  {
294
- "name": "fs_read_file", // model-facing; /^[a-zA-Z0-9_]{1,64}$/; SDK owns naming
315
+ "name": "fs_read_file", // model-facing; /^[a-zA-Z0-9_]{1,64}$/; SDK owns naming
295
316
  "description": "Read a file under /workspace.",
296
- "inputSchema": { // MCP's term for the JSON Schema
317
+ "inputSchema": {
318
+ // MCP's term for the JSON Schema
297
319
  "type": "object",
298
320
  "properties": { "path": { "type": "string" } },
299
- "required": ["path"]
321
+ "required": ["path"],
300
322
  },
301
- "annotations": { // OPTIONAL; spec-defined hints
323
+ "annotations": {
324
+ // OPTIONAL; spec-defined hints
302
325
  "readOnlyHint": true,
303
- "openWorldHint": false
304
- }
326
+ "openWorldHint": false,
327
+ },
305
328
  /* …any other MCP Tool field passes through unchanged. */
306
- }
307
- ]
329
+ },
330
+ ],
308
331
  }
309
332
  ```
310
333
 
@@ -313,8 +336,8 @@ ship the `Implementation` block from MCP `Initialize` as `serverInfo`.
313
336
  ```ts
314
337
  // pseudo-code, MCP-SDK-flavoured
315
338
  const client = new McpClient(stdio("./fs-server"));
316
- const init = await client.initialize(); // → { name, version, … }
317
- const list = await client.listTools(); // → { tools: [...] }
339
+ const init = await client.initialize(); // → { name, version, … }
340
+ const list = await client.listTools(); // → { tools: [...] }
318
341
 
319
342
  // drop straight into the spec
320
343
  const ref = {
@@ -327,18 +350,18 @@ const ref = {
327
350
 
328
351
  **What MANTYX does with the catalog:**
329
352
 
330
- | Field | Used for | Notes |
331
- | ------------------------ | -------- | ----- |
332
- | `tools[].name` | Model-facing tool name | Used as-is. MANTYX does **not** prefix with the ref's `name`. The SDK is responsible for any naming convention (e.g. emit `fs_read_file` instead of `read_file` if you have multiple servers). |
333
- | `tools[].description` | Model-facing description | Used as-is. |
334
- | `tools[].inputSchema` | LLM tool-call schema | Forwarded **verbatim** to the LLM provider as the tool's JSON Schema, then validated against incoming tool-call args (Ajv) before execution. Nested constraints (`array.items`, `enum`, `anyOf`, …) are preserved end-to-end. Empty / missing schema → no-arg tool. Args that violate the schema produce a structured `tool_input_invalid` tool result the model can recover from instead of crashing the tool. |
335
- | `tools[].annotations` | Echo only | Forwarded to the SDK in `local_tool_call` events (as part of the call envelope) for observability. |
336
- | `serverInfo` | Echo only | Forwarded to the SDK in `local_tool_call.mcpServerInfo`. |
353
+ | Field | Used for | Notes |
354
+ | --------------------- | ------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
355
+ | `tools[].name` | Model-facing tool name | Used as-is. MANTYX does **not** prefix with the ref's `name`. The SDK is responsible for any naming convention (e.g. emit `fs_read_file` instead of `read_file` if you have multiple servers). |
356
+ | `tools[].description` | Model-facing description | Used as-is. |
357
+ | `tools[].inputSchema` | LLM tool-call schema | Forwarded **verbatim** to the LLM provider as the tool's JSON Schema, then validated against incoming tool-call args (Ajv) before execution. Nested constraints (`array.items`, `enum`, `anyOf`, …) are preserved end-to-end. Empty / missing schema → no-arg tool. Args that violate the schema produce a structured `tool_input_invalid` tool result the model can recover from instead of crashing the tool. |
358
+ | `tools[].annotations` | Echo only | Forwarded to the SDK in `local_tool_call` events (as part of the call envelope) for observability. |
359
+ | `serverInfo` | Echo only | Forwarded to the SDK in `local_tool_call.mcpServerInfo`. |
337
360
 
338
361
  > **Naming convention reminder.** Because MANTYX doesn't prefix names for
339
362
  > `mcp_local`, two refs that both expose a tool called `read_file` will
340
363
  > collide. Either give the second one a different `name` in the catalog or
341
- > drop it via SDK-side filtering. (For `mcp` — *remote* MCP — MANTYX does
364
+ > drop it via SDK-side filtering. (For `mcp` — _remote_ MCP — MANTYX does
342
365
  > auto-prefix with the ref's `name`, so collisions are impossible.)
343
366
 
344
367
  ---
@@ -352,24 +375,31 @@ so reconnects can use `Last-Event-ID`.
352
375
  Every event payload has the same envelope:
353
376
 
354
377
  ```jsonc
355
- { "seq": 7, "type": "<event-type>", "data": { /* type-specific */ } }
378
+ {
379
+ "seq": 7,
380
+ "type": "<event-type>",
381
+ "data": {
382
+ /* type-specific */
383
+ },
384
+ }
356
385
  ```
357
386
 
358
387
  The vocabulary (`EphemeralEventType` in `bus.ts`):
359
388
 
360
- | Type | Direction | Frequency | Purpose |
361
- | ----------------------- | --------- | --------- | ------- |
362
- | `assistant_delta` | M → SDK | Many | Streamed assistant text token / chunk. |
363
- | `thinking_delta` | M → SDK | Many (iff `reasoningLevel > 0`) | Streamed extended-thinking text (provider redacts when policy requires). |
364
- | `tool_result` | M → SDK | Per server-resolved tool call | Informational — tells the SDK that MANTYX ran a server-resolved tool (`mantyx`, `mantyx_plugin`, `a2a`, `mcp`) and got a result. The SDK does not need to act on it. |
365
- | `local_tool_call` | M → SDK | Per client-resolved tool call | **Action required.** SDK must POST a tool-result. |
366
- | `local_tool_result_in` | M → SDK | Per client-resolved tool call | Informational mirror of the tool-result the SDK just posted, persisted for observability. Re-emitted to late subscribers so they can replay the conversation. |
367
- | `loop_detected` | M → SDK | 0–2× per run (soft nudge + optional hard cutoff) | Observability for the loop-detection guard (see §8). The server already substituted the synthetic skip + steering nudge — SDK clients render a status note (`looping — nudged` / `looping — gave up`) and otherwise leave the run alone. |
368
- | `tool_budget_exceeded` | M → SDK | Per intercepted tool call | Observability for per-tool call budgets (see §8). The synthetic `tool_result` carrying the "budget exceeded — pivot or finalize" body lands on the normal tool-result channel; this event is purely so SDK clients can surface a UI banner. |
369
- | `assistant_message` | M → SDK | per turn | Final assistant message for the turn (concatenated, persistence-ready). |
370
- | `result` | M → SDK | 1× terminal | Successful completion. Carries the final assistant text and run summary. |
371
- | `error` | M → SDK | 1× terminal | Failure. Carries `error` (message), `code` / `errorClass` (category), `finishReason`, and an optional `partialText` salvage payload. See §4.7. |
372
- | `cancelled` | M → SDK | 1× terminal | Cancellation. Run was aborted via `POST /cancel`. |
389
+ | Type | Direction | Frequency | Purpose |
390
+ | ---------------------- | --------- | ------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
391
+ | `assistant_delta` | M → SDK | Many | Streamed assistant text token / chunk. |
392
+ | `thinking_delta` | M → SDK | Many (iff `reasoningLevel > 0`) | Streamed extended-thinking text (provider redacts when policy requires). |
393
+ | `tool_result` | M → SDK | Per server-resolved tool call | Informational — tells the SDK that MANTYX ran a server-resolved tool (`mantyx`, `mantyx_plugin`, `a2a`, `mcp`) and got a result. The SDK does not need to act on it. |
394
+ | `local_tool_call` | M → SDK | Per client-resolved tool call | **Action required.** SDK must POST a tool-result. |
395
+ | `local_tool_result_in` | M → SDK | Per client-resolved tool call | Informational mirror of the tool-result the SDK just posted, persisted for observability. Re-emitted to late subscribers so they can replay the conversation. |
396
+ | `loop_detected` | M → SDK | 0–2× per run (soft nudge + optional hard cutoff) | Observability for the loop-detection guard (see §8). The server already substituted the synthetic skip + steering nudge — SDK clients render a status note (`looping — nudged` / `looping — gave up`) and otherwise leave the run alone. |
397
+ | `tool_budget_exceeded` | M → SDK | Per intercepted tool call | Observability for per-tool call budgets (see §8). The synthetic `tool_result` carrying the "budget exceeded — pivot or finalize" body lands on the normal tool-result channel; this event is purely so SDK clients can surface a UI banner. |
398
+ | `supervisor` | M → SDK | 0–N× per run (every `interval` LLM calls) | Run-supervisor check (see §4.7 / §8.4). Fired on **every** review — including `on_track` — so SDK clients can render supervisor activity. When the judge steers the run (`redirect` / `finalize`), the pipeline has already injected the steering message or forced a tools-disabled finalize turn. |
399
+ | `assistant_message` | M → SDK | 1× per turn | Final assistant message for the turn (concatenated, persistence-ready). |
400
+ | `result` | M → SDK | 1× terminal | Successful completion. Carries the final assistant text and run summary. |
401
+ | `error` | M → SDK | 1× terminal | Failure. Carries `error` (message), `code` / `errorClass` (category), `finishReason`, and an optional `partialText` salvage payload. See §4.7. |
402
+ | `cancelled` | M → SDK | 1× terminal | Cancellation. Run was aborted via `POST /cancel`. |
373
403
 
374
404
  `result`, `error`, and `cancelled` are the **terminal** events — the SDK
375
405
  should close the SSE stream after one of them arrives.
@@ -395,8 +425,8 @@ progress text — it's not part of the canonical assistant response.
395
425
  "data": {
396
426
  "toolUseId": "tu_a",
397
427
  "name": "github_search_repos",
398
- "result": "..." // truncated for display; never JSON-parsed by SDK
399
- }
428
+ "result": "...", // truncated for display; never JSON-parsed by SDK
429
+ },
400
430
  }
401
431
  ```
402
432
 
@@ -436,8 +466,8 @@ No extras. Dispatch by `name`.
436
466
  "toolUseId": "tu_x",
437
467
  "name": "compute_total",
438
468
  "args": { "amount": 42, "currency": "USD" },
439
- "kind": "local" // OR omitted (legacy)
440
- }
469
+ "kind": "local", // OR omitted (legacy)
470
+ },
441
471
  }
442
472
  ```
443
473
 
@@ -455,17 +485,20 @@ dispatch to the right A2A client when it manages multiple peers.
455
485
  "name": "intranet_hr_agent",
456
486
  "args": { "message": "When does PTO reset?" },
457
487
  "kind": "a2a_local",
458
- "agentCard": { // full Agent Card from the spec
488
+ "agentCard": {
489
+ // full Agent Card from the spec
459
490
  "name": "Acme HR",
460
491
  "url": "https://hr.intranet.acme/a2a",
461
- "skills": [ /* ... */ ]
492
+ "skills": [
493
+ /* ... */
494
+ ],
462
495
  /* ...all other fields the SDK shipped... */
463
- }
464
- }
496
+ },
497
+ },
465
498
  }
466
499
  ```
467
500
 
468
- `args.message` is *always* `{ "message": string }` for `a2a_local` — the
501
+ `args.message` is _always_ `{ "message": string }` for `a2a_local` — the
469
502
  LLM's task is reduced to "what do I want to ask the peer in plain text?"
470
503
  so the SDK doesn't have to re-derive an A2A `message` envelope from a
471
504
  tool-specific schema.
@@ -481,23 +514,24 @@ parsing the tool name back into pieces.
481
514
  "type": "local_tool_call",
482
515
  "data": {
483
516
  "toolUseId": "tu_z",
484
- "name": "fs_read_file", // identical to what the SDK declared
517
+ "name": "fs_read_file", // identical to what the SDK declared
485
518
  "args": { "path": "/etc/hosts" },
486
519
  "kind": "mcp_local",
487
- "mcpServer": "fs", // ref's `name` — SDK's MCP-client key
488
- "mcpToolName": "fs_read_file", // duplicates `name` for the SDK's convenience
489
- "mcpServerInfo": { // present iff the spec carried `serverInfo`
520
+ "mcpServer": "fs", // ref's `name` — SDK's MCP-client key
521
+ "mcpToolName": "fs_read_file", // duplicates `name` for the SDK's convenience
522
+ "mcpServerInfo": {
523
+ // present iff the spec carried `serverInfo`
490
524
  "name": "mcp-server-filesystem",
491
- "version": "0.4.1"
492
- }
493
- }
525
+ "version": "0.4.1",
526
+ },
527
+ },
494
528
  }
495
529
  ```
496
530
 
497
531
  The SDK's typical dispatch path is:
498
532
 
499
533
  ```ts
500
- const client = mcpClients.get(call.mcpServer); // by SDK label
534
+ const client = mcpClients.get(call.mcpServer); // by SDK label
501
535
  if (!client) throw new Error(`unknown MCP server ${call.mcpServer}`);
502
536
  const result = await client.callTool({
503
537
  name: call.mcpToolName,
@@ -509,7 +543,10 @@ const text = result.content
509
543
  .join("\n");
510
544
  await fetch(`${baseUrl}/agent-runs/${runId}/tool-results`, {
511
545
  method: "POST",
512
- headers: { "Content-Type": "application/json", Authorization: `Bearer ${apiKey}` },
546
+ headers: {
547
+ "Content-Type": "application/json",
548
+ Authorization: `Bearer ${apiKey}`,
549
+ },
513
550
  body: JSON.stringify({ toolUseId: call.toolUseId, result: text }),
514
551
  });
515
552
  ```
@@ -523,20 +560,27 @@ await fetch(`${baseUrl}/agent-runs/${runId}/tool-results`, {
523
560
  "data": {
524
561
  "text": "Here's what I found...",
525
562
  "turn": 0,
526
- "finishReason": "tool_use", // optional; canonical lowercase token
527
- "toolCalls": [ // optional; absent when the turn was text-only
528
- { "id": "call_abc", "name": "search", "input": { /* JSON Schema-matching args */ } }
529
- ]
530
- }
563
+ "finishReason": "tool_use", // optional; canonical lowercase token
564
+ "toolCalls": [
565
+ // optional; absent when the turn was text-only
566
+ {
567
+ "id": "call_abc",
568
+ "name": "search",
569
+ "input": {
570
+ /* JSON Schema-matching args */
571
+ },
572
+ },
573
+ ],
574
+ },
531
575
  }
532
576
  ```
533
577
 
534
- | Field | Type | Required | Notes |
535
- | ---------------- | -------- | -------- | ----- |
536
- | `text` | string | yes | Full assistant text for this turn (concatenation of every preceding `assistant_delta` for this turn, plus any non-streaming snapshot the engine appended at close). May be empty when the turn was tool-only. |
537
- | `turn` | integer | yes | 0-based tool-turn index this assistant message closes. Useful for SDK clients pairing the message with the subsequent `tool_result` rows. |
538
- | `finishReason` | string\|null | no | Canonical lowercase stop reason normalized across providers (`"end_turn"`, `"tool_use"`, `"max_tokens"`, `"refusal"`, `"malformed_function_call"`, …). Pulled from the engine's per-turn `stopReason` after normalization — Gemini's `MAX_TOKENS` lands as `"max_tokens"`, OpenAI's `length` lands as `"max_tokens"`, etc. `null` / omitted when the provider did not report one. |
539
- | `toolCalls` | array | no | Tool calls the model emitted on this turn (id, sanitized pipeline-side name, JSON-matching `input`). Omitted when the model did not call any tools. |
578
+ | Field | Type | Required | Notes |
579
+ | -------------- | ------------ | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
580
+ | `text` | string | yes | Full assistant text for this turn (concatenation of every preceding `assistant_delta` for this turn, plus any non-streaming snapshot the engine appended at close). May be empty when the turn was tool-only. |
581
+ | `turn` | integer | yes | 0-based tool-turn index this assistant message closes. Useful for SDK clients pairing the message with the subsequent `tool_result` rows. |
582
+ | `finishReason` | string\|null | no | Canonical lowercase stop reason normalized across providers (`"end_turn"`, `"tool_use"`, `"max_tokens"`, `"refusal"`, `"malformed_function_call"`, …). Pulled from the engine's per-turn `stopReason` after normalization — Gemini's `MAX_TOKENS` lands as `"max_tokens"`, OpenAI's `length` lands as `"max_tokens"`, etc. `null` / omitted when the provider did not report one. |
583
+ | `toolCalls` | array | no | Tool calls the model emitted on this turn (id, sanitized pipeline-side name, JSON-matching `input`). Omitted when the model did not call any tools. |
540
584
 
541
585
  **Emission frequency.** Exactly **one** `assistant_message` per completed
542
586
  assistant turn — including the last turn before a terminal `error`. SDK
@@ -548,7 +592,7 @@ and avoid stitching a turn out of `assistant_delta` chunks themselves
548
592
  Gemini `MAX_TOKENS` while emitting `outputSchema` JSON), the last
549
593
  `assistant_message` preceding the `error` carries the partial text plus
550
594
  `finishReason: "max_tokens"`. The terminal `error` event then carries the
551
- *same* text on `data.partialText` so reconnect / replay sees both pieces
595
+ _same_ text on `data.partialText` so reconnect / replay sees both pieces
552
596
  without depending on event ordering.
553
597
 
554
598
  ### 4.5 `loop_detected`
@@ -563,11 +607,11 @@ without depending on event ordering.
563
607
  "data": { "consecutiveCount": 6, "hardCutoff": true, "tools": ["recall"] } }
564
608
  ```
565
609
 
566
- | Field | Type | Notes |
567
- | ------------------ | ------- | ----- |
568
- | `consecutiveCount` | integer | Length of the identical-batch streak that just tripped the threshold (`>= consecutiveThreshold`). |
610
+ | Field | Type | Notes |
611
+ | ------------------ | ------- | ---------------------------------------------------------------------------------------------------------------------------- |
612
+ | `consecutiveCount` | integer | Length of the identical-batch streak that just tripped the threshold (`>= consecutiveThreshold`). |
569
613
  | `hardCutoff` | boolean | `false` for the soft nudge round; `true` once the pipeline forces finalisation. The SDK may see one of each in a single run. |
570
- | `tools` | array | Names of the tool calls in the looping batch (no args — those are persisted on the matching `tool_result` events). |
614
+ | `tools` | array | Names of the tool calls in the looping batch (no args — those are persisted on the matching `tool_result` events). |
571
615
 
572
616
  Observability only: the synthetic skip + steering nudge are emitted on the
573
617
  normal `tool_result` and assistant-message channels by the time this event
@@ -580,14 +624,17 @@ See §8 for the wire-spec field that controls thresholds.
580
624
  ### 4.6 `tool_budget_exceeded`
581
625
 
582
626
  ```jsonc
583
- { "seq": 14, "type": "tool_budget_exceeded",
584
- "data": { "tool": "recall", "maxCalls": 4, "callIndex": 5 } }
627
+ {
628
+ "seq": 14,
629
+ "type": "tool_budget_exceeded",
630
+ "data": { "tool": "recall", "maxCalls": 4, "callIndex": 5 },
631
+ }
585
632
  ```
586
633
 
587
- | Field | Type | Notes |
588
- | ----------- | ------- | ----- |
589
- | `tool` | string | Logical tool name as the model saw it (matches the key in `spec.toolBudgets`). |
590
- | `maxCalls` | integer | Configured cap. |
634
+ | Field | Type | Notes |
635
+ | ----------- | ------- | ----------------------------------------------------------------------------------------------------------- |
636
+ | `tool` | string | Logical tool name as the model saw it (matches the key in `spec.toolBudgets`). |
637
+ | `maxCalls` | integer | Configured cap. |
591
638
  | `callIndex` | integer | 1-based count of attempts to call this tool over the run lifetime; always strictly greater than `maxCalls`. |
592
639
 
593
640
  Observability only: the synthetic "budget exceeded — pivot or finalize"
@@ -598,17 +645,66 @@ re-parsing tool-result bodies.
598
645
 
599
646
  See §8 for the wire-spec field that defines budgets.
600
647
 
601
- ### 4.7 Terminal events
648
+ ### 4.7 `supervisor`
649
+
650
+ ```jsonc
651
+ // on_track — the judge reviewed the run and decided not to intervene
652
+ { "seq": 15, "type": "supervisor",
653
+ "data": { "action": "on_track", "reason": "Agent is gathering context via search before answering.", "llmCalls": 5 } }
654
+
655
+ // redirect — a steering user message was injected; the agent keeps its tools
656
+ { "seq": 20, "type": "supervisor",
657
+ "data": { "action": "redirect", "reason": "Repeating the same search with identical args.", "redirect": "Stop re-querying; synthesize an answer from the results you already have.", "llmCalls": 10 } }
658
+
659
+ // finalize — the run was forced to wrap up on a tools-disabled turn
660
+ { "seq": 25, "type": "supervisor",
661
+ "data": { "action": "finalize", "reason": "Enough evidence to answer; further tool use is unlikely to help.", "llmCalls": 15 } }
662
+ ```
663
+
664
+ | Field | Type | Notes |
665
+ | ---------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
666
+ | `action` | string | One of `"on_track"`, `"redirect"`, `"finalize"`. |
667
+ | `reason` | string | One- or two-sentence explanation from the judge. |
668
+ | `redirect` | string | Present when `action === "redirect"`: the steering message injected into the conversation (same text the agent sees as a user message). Omitted for `on_track` / `finalize`. |
669
+ | `llmCalls` | integer | Number of LLM calls (`completeTurn` invocations) completed when this review fired. Matches the pipeline's `modelInvocations` counter at the check boundary. |
670
+
671
+ Observability for the run-supervisor guard (see §8.4). The event fires on
672
+ **every** check, not only when the judge intervenes — `on_track` reviews are
673
+ included so SDK clients can show "supervisor reviewed" activity without
674
+ inferring it from missing events.
675
+
676
+ When `action` is `redirect` or `finalize`, the pipeline has already applied
677
+ the verdict by the time this event arrives: a steering user message was
678
+ appended (`redirect`) or the next turn was forced tools-disabled
679
+ (`finalize`). SDK clients should render a status note and **not** try to
680
+ steer the run themselves.
681
+
682
+ Pass `"supervisor": false` in the spec (§8.4) to disable the platform judge
683
+ for a run. Omission keeps the runtime default (supervisor **enabled** on
684
+ ephemeral runs).
685
+
686
+ ### 4.8 Terminal events
602
687
 
603
688
  ```jsonc
604
- { "seq": 14, "type": "result", "data": { "ok": true, "text": "..." } }
689
+ // Every terminal `result` and `error` event also carries `tokens`, `turns`,
690
+ // and `model` for cost attribution and dashboards — see §4.7.1.
691
+ { "seq": 14, "type": "result", "data": {
692
+ "ok": true,
693
+ "text": "...",
694
+ "tokens": { "inputTokens": 1283, "cachedTokens": 512, "reasoningTokens": 96, "outputTokens": 240 },
695
+ "turns": 3,
696
+ "model": { "id": "platform:demo", "provider": "openai", "vendorModelId": "gpt-5.4-mini", "reasoningEffort": "low" }
697
+ } }
605
698
  { "seq": 14, "type": "error", "data": {
606
699
  "error": "Model output was truncated (stop_reason=max_tokens). …",
607
700
  "code": "truncation", // mirrors `errorClass`; legacy alias
608
701
  "errorClass": "truncation", // canonical category (see below)
609
702
  "finishReason": "max_tokens", // canonical lowercase stop reason
610
703
  "partialText": "{\n \"answer\":… (truncated JSON) …",
611
- "retryable": false // optional; per-class retry hint
704
+ "retryable": false, // optional; per-class retry hint
705
+ "tokens": { "inputTokens": 8190, "cachedTokens": 0, "reasoningTokens": 0, "outputTokens": 1024 },
706
+ "turns": 1,
707
+ "model": { "id": "provider:cmf…", "provider": "google", "vendorModelId": "gemini-2.5-pro" }
612
708
  } }
613
709
  { "seq": 14, "type": "cancelled", "data": { "reason": "user" } }
614
710
  ```
@@ -620,25 +716,116 @@ SSE stream.
620
716
  with structured triage attributes when the failure carried a salvage path
621
717
  (typically truncation, upstream deadline, or max-budget-with-text):
622
718
 
623
- | Field | Type | Required | Notes |
624
- | -------------- | -------- | -------- | ----- |
625
- | `error` | string | yes | Human-readable message (also persisted on `EphemeralAgentRun.error`). |
626
- | `code` | string | yes | Legacy alias for `errorClass`. Equals `errorClass` when present; otherwise a small lowercase token (`"error"`, `"invalid_spec"`, `"worker_error"`, …) the SDK can switch on. |
627
- | `errorClass` | string | no | Canonical category. One of `"rate_limit"`, `"overloaded"`, `"server"`, `"context_window"` (input too big), `"truncation"` (output budget exhausted), `"invalid_request"`, `"auth"`, `"timeout"`, `"local_timeout"`, `"upstream_deadline"`, `"unknown"`. New categories may land additively. |
628
- | `finishReason` | string\|null | no | Canonical lowercase stop reason normalized across providers (`"max_tokens"`, `"refusal"`, `"malformed_function_call"`, …). When present, mirrors the value on the last `assistant_message`. |
629
- | `partialText` | string | no | **Best-effort raw bytes** the model emitted before the failure. For `outputSchema` runs this is likely **incomplete JSON** that will fail `JSON.parse` — see §7 below. Also persisted on `EphemeralAgentRun.finalText` so the Calls UI can render it alongside a truncation banner. |
630
- | `retryable` | boolean | no | Coarse retry hint inherited from the pipeline's error classifier. Informational; the SDK still owns the actual retry decision. |
719
+ | Field | Type | Required | Notes |
720
+ | -------------- | ------------ | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
721
+ | `error` | string | yes | Human-readable message (also persisted on `EphemeralAgentRun.error`). |
722
+ | `code` | string | yes | Legacy alias for `errorClass`. Equals `errorClass` when present; otherwise a small lowercase token (`"error"`, `"invalid_spec"`, `"worker_error"`, …) the SDK can switch on. |
723
+ | `errorClass` | string | no | Canonical category. One of `"rate_limit"`, `"overloaded"`, `"server"`, `"context_window"` (input too big), `"truncation"` (output budget exhausted), `"invalid_request"`, `"auth"`, `"timeout"`, `"local_timeout"`, `"upstream_deadline"`, `"unknown"`. New categories may land additively. |
724
+ | `finishReason` | string\|null | no | Canonical lowercase stop reason normalized across providers (`"max_tokens"`, `"refusal"`, `"malformed_function_call"`, …). When present, mirrors the value on the last `assistant_message`. |
725
+ | `partialText` | string | no | **Best-effort raw bytes** the model emitted before the failure. For `outputSchema` runs this is likely **incomplete JSON** that will fail `JSON.parse` — see §7 below. Also persisted on `EphemeralAgentRun.finalText` so the Calls UI can render it alongside a truncation banner. |
726
+ | `retryable` | boolean | no | Coarse retry hint inherited from the pipeline's error classifier. Informational; the SDK still owns the actual retry decision. |
631
727
 
632
728
  When `errorClass` is `"truncation"`, the `EphemeralAgentRun` row that the
633
729
  SDK can re-fetch via `GET /agent-runs/:runId` will have:
634
730
 
635
- | Field | Value |
636
- | --------------- | ----- |
637
- | `status` | `"failed"` |
638
- | `finalText` | Same string as `data.partialText` (so SDKs can ignore the SSE stream and still recover the salvage). |
639
- | `error` | Same string as `data.error`. |
731
+ | Field | Value |
732
+ | --------------- | ------------------------------------------------------------------------------------------------------------------------ |
733
+ | `status` | `"failed"` |
734
+ | `finalText` | Same string as `data.partialText` (so SDKs can ignore the SSE stream and still recover the salvage). |
735
+ | `error` | Same string as `data.error`. |
640
736
  | `failureReason` | `{ "errorClass": "truncation", "finishReason": "max_tokens" }` (JSON object, future-proof for additional triage fields). |
641
737
 
738
+ ### 4.8.1 Cost-attribution fields (`tokens`, `turns`, `model`)
739
+
740
+ Every terminal `result` and `error` event carries three additional
741
+ fields so callers can drive cost dashboards, per-turn budgets, and
742
+ provider/model spend reports without a follow-up `GET /agent-runs/:runId`
743
+ round trip. The same fields are persisted on the `EphemeralAgentRun`
744
+ row (columns `tokens` / `turns` / `model`) and surfaced by that
745
+ endpoint.
746
+
747
+ | Field | Type | Notes |
748
+ | -------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
749
+ | `tokens` | object | Per-run token totals aggregated across every model invocation. Schema below. |
750
+ | `turns` | int | Total `engine.completeTurn(...)` invocations for the run, **including** the failing call when a run errors out mid-loop. A single-shot run reports `1`; a tool loop is `>= 2`. Tracked by the pipeline as `modelInvocations` in `PipelineLoopState` and emitted on the terminal `PipelineEvent` (see `packages/agent-pipeline/src/types.ts`). Distinct from "tool turns" — `turns` counts **model invocations**, regardless of whether the model called any tools. |
751
+ | `model` | object | Resolved model that actually executed the run. Schema below. |
752
+
753
+ Always present on terminal events for runs created against
754
+ **MANTYX ≥ 2026-09** servers. Older servers omit these fields entirely;
755
+ SDK clients (TS/Go/Python) detect "no usage data" by checking that
756
+ `model.provider` is empty / falsy. JSON keys follow MANTYX's standard
757
+ camelCase wire convention.
758
+
759
+ **`tokens` schema** — mirrors the wire shape produced by
760
+ `tokenUsageToWireTokens` in `packages/ts-sdk/src/usage-wire.ts`, which
761
+ is the single source of truth across the TS SDK return value, REST/SSE,
762
+ and A2A surfaces:
763
+
764
+ | Field | Type | Notes |
765
+ | ----------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
766
+ | `inputTokens` | int | **Total billable input** — fresh prompt tokens **plus** the cached-read slice the provider still bills (at a discount) **plus** any cache-creation tokens **plus** tool-prompt tokens. Equal to the sum of every provider-reported input bucket for the run. |
767
+ | `cachedTokens` | int | The discounted slice of `inputTokens` that came from a prompt cache hit (Anthropic prompt caching, OpenAI cached prompt, Gemini implicit cache). `0` when the provider doesn't report cache reads or the run didn't hit cache. |
768
+ | `reasoningTokens` | int | Non-visible thinking tokens. **Already counted inside `outputTokens`** — surfaced separately so dashboards can break out "thinking cost" vs visible output. `0` when the model didn't reason or didn't report it. |
769
+ | `outputTokens` | int | **All** tokens the model emitted for this run, visible + reasoning. Matches the provider's "completion tokens" / "output tokens" billing line. |
770
+
771
+ `inputTokens` and `outputTokens` together cover every billable token the
772
+ run consumed; `cachedTokens` and `reasoningTokens` are diagnostic
773
+ breakdowns _inside_ those two totals (not separate buckets to be added).
774
+ All four are clamped to non-negative integers — a misbehaving engine
775
+ emitting `NaN` or negatives cannot poison the JSON snapshot or Prisma
776
+ write.
777
+
778
+ **`model` schema** — fields the platform stamps onto every successful
779
+ or failed run via `services/agent-runs/resolve-model.ts`:
780
+
781
+ | Field | Type | Notes |
782
+ | ----------------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
783
+ | `id` | string | Catalog id — the same string a caller would pass back as `modelId` (in §2.1) to re-select this exact entry (e.g. `"platform:demo"`, `"provider:cmf…"`). Empty string against legacy fallbacks that didn't synthesise a catalog id. |
784
+ | `provider` | string | Lowercase provider id: `"openai"`, `"anthropic"`, `"google"`, `"azure-openai"`. |
785
+ | `vendorModelId` | string | The model id the platform actually sent to the provider (e.g. `"gpt-5.4-mini"`, `"claude-opus-4-7"`, `"gemini-2.5-pro"`). Carried through from the `model` field on `AgentSpec` after resolution. |
786
+ | `reasoningEffort` | string | Optional. `"off"`, `"low"`, `"medium"`, `"high"`. Computed via `resolveReasoningEffortForOptions` (`packages/ts-sdk/src/usage-wire.ts`) from the unified 0–100 `reasoningLevel` knob: 0 → `"off"`, 1–35 → `"low"`, 36–65 → `"medium"`, 66–100 → `"high"`. Omitted when the provider doesn't expose a reasoning-level knob or the run didn't request one. |
787
+
788
+ **Per-provider token mapping.** Provider responses vary in how they
789
+ report token usage. MANTYX normalises them into the wire shape above as
790
+ follows (see `packages/agent-pipeline/src/engines/*` for the engine-
791
+ side aggregation that feeds `tokenUsageToWireTokens`):
792
+
793
+ | Provider | `inputTokens` ← | `cachedTokens` ← | `reasoningTokens` ← | `outputTokens` ← |
794
+ | --------- | ----------------------------------------------------------------------------------------------- | ------------------------------------------- | ----------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
795
+ | OpenAI | `usage.prompt_tokens` (already includes cached read tokens) | `usage.prompt_tokens_details.cached_tokens` | `usage.completion_tokens_details.reasoning_tokens` | `usage.completion_tokens` |
796
+ | Anthropic | `usage.input_tokens` + `usage.cache_read_input_tokens` + `usage.cache_creation_input_tokens` | `usage.cache_read_input_tokens` | (extended-thinking tokens; folded into `output_tokens` by the provider) | `usage.output_tokens` |
797
+ | Google | `usageMetadata.promptTokenCount` + `usageMetadata.cachedContentTokenCount` + tool-prompt tokens | `usageMetadata.cachedContentTokenCount` | `usageMetadata.thoughtsTokenCount` | `usageMetadata.candidatesTokenCount` (or `totalTokenCount - promptTokenCount` for older Gemini SDKs) |
798
+
799
+ If a provider doesn't report a given bucket the corresponding field is
800
+ `0`, never `null`.
801
+
802
+ **Tool-loop accounting.** When the run executes tool turns, every
803
+ `engine.completeTurn(...)` invocation contributes its usage to the
804
+ aggregated `tokens` object — so a run with one tool round (model →
805
+ tool → model) reports `turns: 2` and the **sum** of both model calls'
806
+ token usage. The counter is incremented in a `try/finally` around the
807
+ engine call inside `runMainPipelineLoop`
808
+ (`packages/agent-pipeline/src/pipeline.ts`), so the failing call still
809
+ counts toward `turns` even when the engine throws. The terminal event
810
+ carries cumulative totals only; per-turn observability lives on
811
+ `assistant_message` events.
812
+
813
+ **A2A exposure.** The MANTYX-hosted A2A endpoint
814
+ (`POST /api/a2a/{workspaceSlug}/agents/{agentSlug}`) returns the same
815
+ triple under `result.metadata.mantyx`. The block is omitted entirely
816
+ against legacy runners that haven't implemented the optional
817
+ `runWithUsage` method on `AgentRunner` (see
818
+ `packages/ts-sdk/src/a2a/adapter.ts`); cross-platform A2A clients
819
+ should treat its absence as "no usage data" rather than as zero usage.
820
+
821
+ **SDK return-value exposure.** The TS SDK exposes the same triple via
822
+ the opt-in `runAgentWithUsage` (returning a `RunAgentResult` with
823
+ `text`, `tokens`, `turns`, `model`). The legacy `runAgent` still
824
+ returns just `string` for backward compatibility — see
825
+ `packages/ts-sdk/src/run.ts`. Go and Python SDKs surface the fields
826
+ directly on the existing `RunResult` struct/dataclass (additive,
827
+ non-breaking since those return types were already objects).
828
+
642
829
  ---
643
830
 
644
831
  ## 5. SDK → MANTYX: tool-result POST
@@ -657,19 +844,19 @@ Authorization: Bearer <api-key>
657
844
  }
658
845
  ```
659
846
 
660
- | Field | Type | Required | Notes |
661
- | ------------ | ------- | -------- | ----- |
662
- | `toolUseId` | string | yes | Must match a pending `local_tool_call`'s id. |
663
- | `result` | string | one-of | Successful textual result (≤ 2 MB). For MCP tools, flatten content blocks to text. For A2A delegations, the peer's reply text. |
664
- | `error` | string | one-of | Human-readable failure message (≤ 8 KB). Surfaced to the model so it can recover. |
847
+ | Field | Type | Required | Notes |
848
+ | ----------- | ------ | -------- | ------------------------------------------------------------------------------------------------------------------------------ |
849
+ | `toolUseId` | string | yes | Must match a pending `local_tool_call`'s id. |
850
+ | `result` | string | one-of | Successful textual result (≤ 2 MB). For MCP tools, flatten content blocks to text. For A2A delegations, the peer's reply text. |
851
+ | `error` | string | one-of | Human-readable failure message (≤ 8 KB). Surfaced to the model so it can recover. |
665
852
 
666
853
  Server response codes:
667
854
 
668
- | Code | When |
669
- | ---- | ---- |
670
- | `204` | Accepted; the runner was woken and will resume the model loop. |
671
- | `400` | Body failed Zod validation (missing `toolUseId`, both/neither of `result`/`error`, etc.). |
672
- | `404` | `unknown_tool_use` — `toolUseId` doesn't match any pending call (already answered or unknown id). |
855
+ | Code | When |
856
+ | ----- | ------------------------------------------------------------------------------------------------------------------- |
857
+ | `204` | Accepted; the runner was woken and will resume the model loop. |
858
+ | `400` | Body failed Zod validation (missing `toolUseId`, both/neither of `result`/`error`, etc.). |
859
+ | `404` | `unknown_tool_use` — `toolUseId` doesn't match any pending call (already answered or unknown id). |
673
860
  | `409` | `run_terminal` — the run already finished (success, failure, cancel, or local-tool timeout). The result is dropped. |
674
861
 
675
862
  The runner enforces a per-call `localToolTimeoutMs` (default 5 minutes).
@@ -684,20 +871,20 @@ After timeout the model loop unblocks with a synthetic
684
871
  `spec.reasoningLevel` controls the LLM's extended-thinking effort. Two
685
872
  input shapes are accepted; both map to a numeric `0–100` internally.
686
873
 
687
- | Form | Values | Notes |
688
- | ----------- | ------------------------------------- | ----- |
689
- | **String** | `"off"`, `"low"`, `"medium"`, `"high"` | Snaps to `0`, `30`, `50`, `80` (matches the web composer). |
690
- | **Number** | integer `0`–`100` | Pass-through. `0` explicitly disables provider thinking. |
874
+ | Form | Values | Notes |
875
+ | ---------- | -------------------------------------- | ---------------------------------------------------------- |
876
+ | **String** | `"off"`, `"low"`, `"medium"`, `"high"` | Snaps to `0`, `30`, `50`, `80` (matches the web composer). |
877
+ | **Number** | integer `0`–`100` | Pass-through. `0` explicitly disables provider thinking. |
691
878
 
692
879
  Per provider:
693
880
 
694
- | Provider | Knob driven by `reasoningLevel` |
695
- | -------------------------- | ------------------------------- |
696
- | OpenAI Responses (o-series, GPT-5.x) | `reasoning.effort` |
697
- | Gemini ≥ 3 | `thinkingConfig.thinkingLevel` |
698
- | Gemini ≤ 2.5 | `thinkingConfig.thinkingBudget` (token budget; scaled) |
699
- | Anthropic / Bedrock-Anthropic | extended thinking budget (≈ 512 tokens at `low` → ≈ 8 000 at `high`) |
700
- | xAI Grok, others | ignored |
881
+ | Provider | Knob driven by `reasoningLevel` |
882
+ | ------------------------------------ | -------------------------------------------------------------------- |
883
+ | OpenAI Responses (o-series, GPT-5.x) | `reasoning.effort` |
884
+ | Gemini ≥ 3 | `thinkingConfig.thinkingLevel` |
885
+ | Gemini ≤ 2.5 | `thinkingConfig.thinkingBudget` (token budget; scaled) |
886
+ | Anthropic / Bedrock-Anthropic | extended thinking budget (≈ 512 tokens at `low` → ≈ 8 000 at `high`) |
887
+ | xAI Grok, others | ignored |
701
888
 
702
889
  When `reasoningLevel > 0` and the provider supports it, the SSE stream
703
890
  will include `thinking_delta` events alongside `assistant_delta`.
@@ -718,21 +905,21 @@ guaranteed-parseable JSON matching the supplied schema.
718
905
  }
719
906
  ```
720
907
 
721
- | Field | Type | Required | Notes |
722
- | -------- | ------ | -------- | ----- |
723
- | `name` | string | no | Stable identifier passed to providers (OpenAI `text.format.name`, Anthropic synthetic-tool name). Defaults to `"output"`. |
908
+ | Field | Type | Required | Notes |
909
+ | -------- | ------ | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
910
+ | `name` | string | no | Stable identifier passed to providers (OpenAI `text.format.name`, Anthropic synthetic-tool name). Defaults to `"output"`. |
724
911
  | `schema` | object | yes | JSON Schema for the assistant text. Root must be a JSON object — most providers reject array/scalar roots in structured-output mode. Passed through verbatim; MANTYX does not validate the schema's contents. |
725
912
 
726
913
  Per provider:
727
914
 
728
- | Provider | How the schema is enforced |
729
- | ------------------------------ | -------------------------- |
730
- | OpenAI Responses (o-series, GPT-5.x, …) | `text.format = { type: "json_schema", strict: true, name, schema }` on every `completeTurn` (compatible with tool calls). |
731
- | Gemini 3+ (any turn) | `responseMimeType: "application/json"` + `responseJsonSchema` on every `completeTurn`. Gemini 3 accepts the schema alongside `functionDeclarations`. |
732
- | Gemini ≤ 2.5 with no tools | Same as Gemini 3+: `responseMimeType: "application/json"` + `responseJsonSchema`. |
733
- | Gemini ≤ 2.5 **with tools** | Synthetic `set_model_response` function declaration is injected; its `parametersJsonSchema` is the supplied schema. The system instruction is augmented to direct the model to call this tool with the final answer. The engine intercepts the call, hides it from the SDK, and surfaces the call's arguments as the assistant text (JSON-stringified). Sidesteps the API rejection ("Function calling with a response mime type: 'application/json' is unsupported") without round-tripping a 4xx. |
734
- | Anthropic / Bedrock-Anthropic | Synthetic `final_report` tool whose `input_schema` is the supplied schema; `tool_choice` is forced on the no-tools finishing turn. The tool's input is surfaced as the assistant text. |
735
- | xAI Grok, others | Ignored — the model returns plain text. |
915
+ | Provider | How the schema is enforced |
916
+ | --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
917
+ | OpenAI Responses (o-series, GPT-5.x, …) | `text.format = { type: "json_schema", strict: true, name, schema }` on every `completeTurn` (compatible with tool calls). |
918
+ | Gemini 3+ (any turn) | `responseMimeType: "application/json"` + `responseJsonSchema` on every `completeTurn`. Gemini 3 accepts the schema alongside `functionDeclarations`. |
919
+ | Gemini ≤ 2.5 with no tools | Same as Gemini 3+: `responseMimeType: "application/json"` + `responseJsonSchema`. |
920
+ | Gemini ≤ 2.5 **with tools** | Synthetic `set_model_response` function declaration is injected; its `parametersJsonSchema` is the supplied schema. The system instruction is augmented to direct the model to call this tool with the final answer. The engine intercepts the call, hides it from the SDK, and surfaces the call's arguments as the assistant text (JSON-stringified). Sidesteps the API rejection ("Function calling with a response mime type: 'application/json' is unsupported") without round-tripping a 4xx. |
921
+ | Anthropic / Bedrock-Anthropic | Synthetic `final_report` tool whose `input_schema` is the supplied schema; `tool_choice` is forced on the no-tools finishing turn. The tool's input is surfaced as the assistant text. |
922
+ | xAI Grok, others | Ignored — the model returns plain text. |
736
923
 
737
924
  The synthetic-tool paths (Gemini 2.5 + tools, Anthropic) are entirely
738
925
  internal: the SDK still receives `data.text: string` on the terminal
@@ -741,11 +928,11 @@ or `final_report`. They never appear in the tools array the SDK declared.
741
928
 
742
929
  Validation (server-side, `400 invalid_request` on violation):
743
930
 
744
- | Constraint | Limit |
745
- | ----------------------------------------- | ----- |
746
- | Serialized JSON size of `outputSchema` | ≤ 32 KB |
747
- | `name` regex | `/^[a-zA-Z0-9_-]{1,64}$/` |
748
- | `schema` shape | non-`null`, non-array JSON object |
931
+ | Constraint | Limit |
932
+ | -------------------------------------- | --------------------------------- |
933
+ | Serialized JSON size of `outputSchema` | ≤ 32 KB |
934
+ | `name` regex | `/^[a-zA-Z0-9_-]{1,64}$/` |
935
+ | `schema` shape | non-`null`, non-array JSON object |
749
936
 
750
937
  **SDK guidance.** Even though the server enforces JSON shape via the
751
938
  provider, transient model errors (refusal text, truncation under
@@ -768,8 +955,8 @@ bytes that already streamed. Instead:
768
955
  bytes (§4.7).
769
956
  3. The run row exposes the salvage on
770
957
  `GET /agent-runs/:runId` as `{ status: "failed", finalText: "<partial JSON>",
771
- error: "Model output was truncated …", failureReason: { errorClass:
772
- "truncation", finishReason: "max_tokens" } }`.
958
+ error: "Model output was truncated …", failureReason: { errorClass:
959
+ "truncation", finishReason: "max_tokens" } }`.
773
960
 
774
961
  `partialText` is a **best-effort raw byte sequence** — for `outputSchema`
775
962
  runs it will almost always fail `JSON.parse` because the JSON object was
@@ -781,7 +968,7 @@ falling back to it as the answer is not.
781
968
  `outputSchema` works for both ephemeral runs (`systemPrompt`-defined) and
782
969
  `agentId`-backed runs — the runner applies the schema to whichever
783
970
  `AgentSpec` it built. `outputSchema` is independent of `reasoningLevel`:
784
- the model can think extensively *and* emit JSON.
971
+ the model can think extensively _and_ emit JSON.
785
972
 
786
973
  ---
787
974
 
@@ -798,10 +985,10 @@ The pipeline tracks an order-invariant canonical signature for every
798
985
  assistant turn that emits one or more tool calls. When the same signature
799
986
  repeats consecutively the guard intervenes:
800
987
 
801
- | Trigger | Server action |
802
- | -------------------------------------------------- | ------------- |
988
+ | Trigger | Server action |
989
+ | ------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
803
990
  | `consecutiveThreshold` identical batches in a row | Skip the duplicate batch with a synthetic "you've made this exact call before" tool result, prepend a user-style **steering nudge** ("either deliver a final answer or change strategy") before the next model turn. |
804
- | `hardCutoffThreshold` identical batches in a row | Force a tools-disabled finalise turn (same path as `budgets.maxToolTurnsExceeded: "finalize"`) so the run lands cleanly. |
991
+ | `hardCutoffThreshold` identical batches in a row | Force a tools-disabled finalise turn (same path as `budgets.maxToolTurnsExceeded: "finalize"`) so the run lands cleanly. |
805
992
 
806
993
  ```jsonc
807
994
  "loopDetection": {
@@ -813,11 +1000,11 @@ repeats consecutively the guard intervenes:
813
1000
  "loopDetection": false // explicitly disable for this run
814
1001
  ```
815
1002
 
816
- | Field | Type | Notes |
817
- | ---------------------- | --------------- | ----- |
818
- | `consecutiveThreshold` | integer ≥ 2 | Default `3`. Single batch = single tool call, not a loop, so the floor is `2`. |
1003
+ | Field | Type | Notes |
1004
+ | ---------------------- | --------------- | --------------------------------------------------------------------------------------------------------------------- |
1005
+ | `consecutiveThreshold` | integer ≥ 2 | Default `3`. Single batch = single tool call, not a loop, so the floor is `2`. |
819
1006
  | `hardCutoffThreshold` | integer ≥ 3 | Default `6`. Must be **strictly greater** than `consecutiveThreshold` (otherwise the soft nudge never gets a chance). |
820
- | (top-level `false`) | literal `false` | Disables the guard. `budgets.maxToolTurns` still applies. |
1007
+ | (top-level `false`) | literal `false` | Disables the guard. `budgets.maxToolTurns` still applies. |
821
1008
 
822
1009
  Validation (server-side, `400 invalid_request` on violation): both
823
1010
  thresholds capped at `100`; `hardCutoffThreshold` must exceed
@@ -844,10 +1031,10 @@ loop and either changes strategy or finalises.
844
1031
  }
845
1032
  ```
846
1033
 
847
- | Field | Type | Notes |
848
- | ---------- | ----------- | ----- |
1034
+ | Field | Type | Notes |
1035
+ | ---------- | -------------------- | -------------------------------------------------------------------------------------------------------------- |
849
1036
  | `<key>` | string (1–120 chars) | Logical tool name as the model sees it (`ResolvedTool.name`). The SDK + pipeline handle internal sanitisation. |
850
- | `maxCalls` | integer ≥ 0 | Hard cap. `0` disables the tool entirely (the first attempt returns the synthetic body). |
1037
+ | `maxCalls` | integer ≥ 0 | Hard cap. `0` disables the tool entirely (the first attempt returns the synthetic body). |
851
1038
 
852
1039
  Budgets are **per-tool, not pooled** — `hive_search_deals: { maxCalls: 5 }`
853
1040
  and `hive_search_meetings: { maxCalls: 5 }` give the agent five of each,
@@ -855,23 +1042,23 @@ not five between them.
855
1042
 
856
1043
  Validation (server-side, `400 invalid_request` on violation):
857
1044
 
858
- | Constraint | Limit |
859
- | --------------------- | ----- |
860
- | Max entries | `32` |
861
- | `<key>` length | `1..120` |
1045
+ | Constraint | Limit |
1046
+ | ---------------------- | ---------------------------------------------------------------- |
1047
+ | Max entries | `32` |
1048
+ | `<key>` length | `1..120` |
862
1049
  | `maxCalls` upper bound | `1000` (functionally unlimited; `maxToolTurns: 100` fires first) |
863
1050
 
864
1051
  **Default budgets** (applied when the field is omitted; caller-provided
865
1052
  entries are layered on top so per-run overrides win):
866
1053
 
867
- | Tool | Default `maxCalls` |
868
- | ------------------------------------------------------------------------------------------------ | ------------------ |
869
- | `recall` (workspace memory hybrid search) | `4` |
870
- | `traverse` (memory graph BFS) | `3` |
871
- | `hive_consult_ontology` (per-hive ontology read; same name across all three hives) | `4` |
872
- | `hive_search_deals` / `_meetings` / `_companies` / `_people` (Sales Hive general search) | `5` |
873
- | `hive_search_tickets` / `_conversations` / `_accounts` (Customer Hive general search) | `5` |
874
- | `hive_search_releases` / `_issues` (Product Hive general search) | `5` |
1054
+ | Tool | Default `maxCalls` |
1055
+ | ---------------------------------------------------------------------------------------- | ------------------ |
1056
+ | `recall` (workspace memory hybrid search) | `4` |
1057
+ | `traverse` (memory graph BFS) | `3` |
1058
+ | `hive_consult_ontology` (per-hive ontology read; same name across all three hives) | `4` |
1059
+ | `hive_search_deals` / `_meetings` / `_companies` / `_people` (Sales Hive general search) | `5` |
1060
+ | `hive_search_tickets` / `_conversations` / `_accounts` (Customer Hive general search) | `5` |
1061
+ | `hive_search_releases` / `_issues` (Product Hive general search) | `5` |
875
1062
 
876
1063
  Pass `"toolBudgets": {}` to start from a clean slate (no defaults applied
877
1064
  on top — useful for runs that intentionally want unbounded research). When
@@ -886,20 +1073,67 @@ banners without re-parsing tool-result bodies:
886
1073
  - `loop_detected` — fired on the soft nudge and again on the hard cutoff
887
1074
  if reached. See §4.5.
888
1075
  - `tool_budget_exceeded` — fired each time a call is intercepted. See §4.6.
1076
+ - `supervisor` — fired on every run-supervisor review (`on_track`,
1077
+ `redirect`, or `finalize`). See §4.7.
1078
+
1079
+ Both guard events (`loop_detected`, `tool_budget_exceeded`) are
1080
+ observability-only: the server has already substituted the synthetic
1081
+ tool-result / steering nudge by the time the SDK sees the event. The
1082
+ `supervisor` event is also observability-only when `action` is
1083
+ `redirect` / `finalize` — the pipeline already applied the verdict. The
1084
+ run continues to its terminal `result` / `error` / `cancelled` as usual.
1085
+
1086
+ ### 8.4 `supervisor` (run judge)
1087
+
1088
+ An optional LLM **run supervisor** periodically reviews the agent's
1089
+ transcript (reasoning, tool calls, tool results, visible text) and may
1090
+ steer the run:
1091
+
1092
+ | Verdict | Server action |
1093
+ | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
1094
+ | `on_track` | No-op — the run continues unchanged. |
1095
+ | `redirect` | A steering **user message** is injected; tools stay available on the next turn. |
1096
+ | `finalize` | The next turn is forced **tools-disabled** so the run lands a clean final answer (optionally prefaced by the supervisor's message). |
1097
+
1098
+ Reviews fire every **`interval` LLM calls** (`completeTurn` invocations),
1099
+ measured at the bottom of tool-emitting rounds. Default interval is **5**
1100
+ when the field is omitted.
1101
+
1102
+ ```jsonc
1103
+ "supervisor": {
1104
+ "interval": 5 // optional — LLM calls between reviews; default 5
1105
+ }
1106
+
1107
+ // or:
1108
+ "supervisor": false // explicitly disable the platform judge for this run
1109
+ ```
1110
+
1111
+ | Field | Type | Notes |
1112
+ | ---------- | --------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
1113
+ | `interval` | integer ≥ 1 | Optional. Default **5** when omitted. Capped at **100** server-side. |
1114
+ | (literal `false`) | `false` | Disables the run supervisor for this run. Loop detection and tool budgets still apply. |
1115
+
1116
+ **Defaults.** When `supervisor` is **omitted**, MANTYX enables the platform
1117
+ LLM judge on ephemeral runs (web chat enables it separately via the chat
1118
+ runner). Pass `"supervisor": false` to opt out.
889
1119
 
890
- Both events are observability-only: the server has already substituted
891
- the synthetic tool-result / steering nudge by the time the SDK sees the
892
- event. The run continues to its terminal `result` / `error` / `cancelled`
893
- as usual.
1120
+ **SDK-only runs.** When a caller uses `@mantyx/ts-sdk` directly (not via
1121
+ `POST /agent-runs`), the supervisor is **off unless explicitly configured**:
1122
+ pass a `RunAgentSupervisor` object with a `review` callback to enable it, or
1123
+ pass `supervisor: false` (or omit the field) to keep it disabled. The wire
1124
+ field above controls the **platform-hosted** judge on ephemeral API runs only.
894
1125
 
895
- ### 8.4 Session inheritance
1126
+ Each review emits a SSE `supervisor` event (§4.7). Supervisor LLM usage is
1127
+ recorded under the `supervisor` usage surface for cost attribution.
896
1128
 
897
- Like `reasoningLevel` and `outputSchema`, both fields support
1129
+ ### 8.5 Session inheritance
1130
+
1131
+ Like `reasoningLevel` and `outputSchema`, the run-guard fields support
898
1132
  session-default + per-message override:
899
1133
 
900
- - `POST /agent-sessions { loopDetection, toolBudgets }` — sets the
1134
+ - `POST /agent-sessions { loopDetection, toolBudgets, supervisor }` — sets the
901
1135
  session-default applied to every subsequent message run.
902
- - `POST /agent-sessions/:id/messages { loopDetection, toolBudgets }` —
1136
+ - `POST /agent-sessions/:id/messages { loopDetection, toolBudgets, supervisor }` —
903
1137
  optional per-message override. Applies to that one run only and does
904
1138
  not mutate the session's stored value.
905
1139
 
@@ -942,23 +1176,27 @@ terminal event.
942
1176
  import { fetch } from "undici";
943
1177
 
944
1178
  // ── 1. Resolve the Agent Card locally ───────────────────────────────────
945
- const cardResp = await fetch("https://hr.intranet.acme/.well-known/agent-card.json", {
946
- headers: { Authorization: `Bearer ${INTRANET_TOKEN}` },
947
- });
948
- const agentCard = await cardResp.json(); // ← whole document, passed through
1179
+ const cardResp = await fetch(
1180
+ "https://hr.intranet.acme/.well-known/agent-card.json",
1181
+ {
1182
+ headers: { Authorization: `Bearer ${INTRANET_TOKEN}` },
1183
+ },
1184
+ );
1185
+ const agentCard = await cardResp.json(); // ← whole document, passed through
949
1186
 
950
1187
  // ── 2. Submit the spec ──────────────────────────────────────────────────
951
1188
  const create = await fetch(`${MANTYX}/api/v1/workspaces/${slug}/agent-runs`, {
952
1189
  method: "POST",
953
- headers: { "Content-Type": "application/json", Authorization: `Bearer ${apiKey}` },
1190
+ headers: {
1191
+ "Content-Type": "application/json",
1192
+ Authorization: `Bearer ${apiKey}`,
1193
+ },
954
1194
  body: JSON.stringify({
955
1195
  modelId: "openai:gpt-5.5",
956
1196
  systemPrompt: "You can delegate HR questions to the Acme HR agent.",
957
1197
  prompt: "How many PTO days does Alice have left this year?",
958
1198
  reasoningLevel: "low",
959
- tools: [
960
- { kind: "a2a_local", name: "intranet_hr_agent", agentCard },
961
- ],
1199
+ tools: [{ kind: "a2a_local", name: "intranet_hr_agent", agentCard }],
962
1200
  }),
963
1201
  });
964
1202
  const { runId, streamUrl } = await create.json();
@@ -972,14 +1210,23 @@ for await (const ev of parseSSE(stream)) {
972
1210
  if (ev.type !== "local_tool_call") continue;
973
1211
  if (ev.data.kind !== "a2a_local") continue;
974
1212
 
975
- const peer = a2aClients.get(ev.data.agentCard.url); // ← dispatch by URL
1213
+ const peer = a2aClients.get(ev.data.agentCard.url); // ← dispatch by URL
976
1214
  const reply = await peer.send({ message: ev.data.args.message });
977
1215
 
978
- await fetch(`${MANTYX}/api/v1/workspaces/${slug}/agent-runs/${runId}/tool-results`, {
979
- method: "POST",
980
- headers: { "Content-Type": "application/json", Authorization: `Bearer ${apiKey}` },
981
- body: JSON.stringify({ toolUseId: ev.data.toolUseId, result: reply.text }),
982
- });
1216
+ await fetch(
1217
+ `${MANTYX}/api/v1/workspaces/${slug}/agent-runs/${runId}/tool-results`,
1218
+ {
1219
+ method: "POST",
1220
+ headers: {
1221
+ "Content-Type": "application/json",
1222
+ Authorization: `Bearer ${apiKey}`,
1223
+ },
1224
+ body: JSON.stringify({
1225
+ toolUseId: ev.data.toolUseId,
1226
+ result: reply.text,
1227
+ }),
1228
+ },
1229
+ );
983
1230
  }
984
1231
  ```
985
1232
 
@@ -990,13 +1237,16 @@ for await (const ev of parseSSE(stream)) {
990
1237
  ```ts
991
1238
  // ── 1. Connect + resolve catalog locally ────────────────────────────────
992
1239
  const mcp = new McpClient(stdio("./mcp-server-filesystem"));
993
- const initImpl = await mcp.initialize(); // → { name, version, ... }
994
- const { tools } = await mcp.listTools(); // → MCP Tool[]
1240
+ const initImpl = await mcp.initialize(); // → { name, version, ... }
1241
+ const { tools } = await mcp.listTools(); // → MCP Tool[]
995
1242
 
996
1243
  // ── 2. Submit the spec ──────────────────────────────────────────────────
997
1244
  const create = await fetch(`${MANTYX}/api/v1/workspaces/${slug}/agent-runs`, {
998
1245
  method: "POST",
999
- headers: { "Content-Type": "application/json", Authorization: `Bearer ${apiKey}` },
1246
+ headers: {
1247
+ "Content-Type": "application/json",
1248
+ Authorization: `Bearer ${apiKey}`,
1249
+ },
1000
1250
  body: JSON.stringify({
1001
1251
  modelId: "openai:gpt-5.5",
1002
1252
  prompt: "Tell me what's at /etc/hosts.",
@@ -1005,7 +1255,7 @@ const create = await fetch(`${MANTYX}/api/v1/workspaces/${slug}/agent-runs`, {
1005
1255
  kind: "mcp_local",
1006
1256
  name: "fs",
1007
1257
  serverInfo: initImpl,
1008
- tools, // ← verbatim from listTools()
1258
+ tools, // ← verbatim from listTools()
1009
1259
  },
1010
1260
  ],
1011
1261
  }),
@@ -1018,7 +1268,7 @@ for await (const ev of parseSSE(streamFromUrl(streamUrl, apiKey))) {
1018
1268
  if (ev.data.kind !== "mcp_local") continue;
1019
1269
 
1020
1270
  const result = await mcp.callTool({
1021
- name: ev.data.mcpToolName, // identical to ev.data.name
1271
+ name: ev.data.mcpToolName, // identical to ev.data.name
1022
1272
  arguments: ev.data.args,
1023
1273
  });
1024
1274
  const text = result.content
@@ -1026,11 +1276,17 @@ for await (const ev of parseSSE(streamFromUrl(streamUrl, apiKey))) {
1026
1276
  .map((b) => b.text)
1027
1277
  .join("\n");
1028
1278
 
1029
- await fetch(`${MANTYX}/api/v1/workspaces/${slug}/agent-runs/${runId}/tool-results`, {
1030
- method: "POST",
1031
- headers: { "Content-Type": "application/json", Authorization: `Bearer ${apiKey}` },
1032
- body: JSON.stringify({ toolUseId: ev.data.toolUseId, result: text }),
1033
- });
1279
+ await fetch(
1280
+ `${MANTYX}/api/v1/workspaces/${slug}/agent-runs/${runId}/tool-results`,
1281
+ {
1282
+ method: "POST",
1283
+ headers: {
1284
+ "Content-Type": "application/json",
1285
+ Authorization: `Bearer ${apiKey}`,
1286
+ },
1287
+ body: JSON.stringify({ toolUseId: ev.data.toolUseId, result: text }),
1288
+ },
1289
+ );
1034
1290
  }
1035
1291
  ```
1036
1292
 
@@ -1049,24 +1305,21 @@ A reference SDK should:
1049
1305
  source-of-truth schema (Zod / Pydantic / etc.) — the server enforces
1050
1306
  JSON shape via the provider, but transient model errors can still
1051
1307
  produce strings that fail to parse in rare cases.
1052
- - [ ] Accept `loopDetection` and `toolBudgets` from the caller and pass
1053
- them through unchanged (see §8). Both are *additive* omitting
1054
- them keeps the runtime defaults; passing `loopDetection: false` opts
1055
- out; passing `toolBudgets: {}` clears the defaults; passing entries
1056
- layers caller overrides on top of the defaults. Do **not** translate
1057
- to vendor-specific knobs.
1058
- - [ ] Treat `loop_detected` and `tool_budget_exceeded` SSE events as
1059
- observability-only (see §4.5 / §4.6). Surface them as status notes
1060
- / log lines / telemetry — the server already substituted the
1061
- synthetic tool-results / steering nudges, so the SDK should keep
1062
- consuming the stream until the terminal event lands.
1308
+ - [ ] Accept `loopDetection`, `toolBudgets`, and `supervisor` from the caller
1309
+ and pass them through unchanged (see §8). All three are _additive_
1310
+ omitting them keeps the runtime defaults; passing `loopDetection: false`
1311
+ or `supervisor: false` opts out; passing `toolBudgets: {}` clears the
1312
+ defaults; passing entries layers caller overrides on top of the defaults.
1313
+ Do **not** translate to vendor-specific knobs.
1314
+ - [ ] Treat `loop_detected`, `tool_budget_exceeded`, and `supervisor` SSE
1315
+ events as observability-only (see §4.5 / §4.6 / §4.7). Surface them as
1316
+ status notes / log lines / telemetry — the server already substituted
1317
+ synthetic tool-results / steering nudges / supervisor verdicts, so the
1318
+ SDK should keep consuming the stream until the terminal event lands.
1063
1319
  - [ ] Maintain three local-callback registries (or one tagged-union
1064
- registry), keyed by `name`:
1065
- - generic local tools (`kind: "local"`),
1066
- - local A2A peers (`kind: "a2a_local"`, indexed by some Agent Card
1067
- field — typically `agentCard.url`),
1068
- - local MCP servers (`kind: "mcp_local"`, indexed by the SDK-side
1069
- server label that matches `local_tool_call.mcpServer`).
1320
+ registry), keyed by `name`: - generic local tools (`kind: "local"`), - local A2A peers (`kind: "a2a_local"`, indexed by some Agent Card
1321
+ field — typically `agentCard.url`), - local MCP servers (`kind: "mcp_local"`, indexed by the SDK-side
1322
+ server label that matches `local_tool_call.mcpServer`).
1070
1323
  - [ ] For `kind: "local"`, accept developer-supplied `parameters` (Zod /
1071
1324
  JSON Schema) and serialize to JSON Schema before submission. When the
1072
1325
  caller declares an output schema, forward it as `outputSchema` (same