@mastra/voice-inworld 0.2.1-alpha.0 → 0.3.0-alpha.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,68 @@
1
1
  # @mastra/voice-inworld
2
2
 
3
+ ## 0.3.0-alpha.1
4
+
5
+ ### Minor Changes
6
+
7
+ - `@mastra/voice-inworld` now ships `InworldRealtimeVoice` for full-duplex realtime voice — mic in, speakers out, server-side LLM routing, semantic VAD turn-taking, tool calling, barge-in, and live transcripts of both sides — alongside the existing streaming TTS and batch STT. No separate package needed; import both from the same entry point. ([#16865](https://github.com/mastra-ai/mastra/pull/16865))
8
+
9
+ ```typescript
10
+ // Batch TTS / STT (unchanged)
11
+ import { InworldVoice } from '@mastra/voice-inworld';
12
+
13
+ // New: realtime full-duplex voice, from the same package
14
+ import { InworldRealtimeVoice } from '@mastra/voice-inworld';
15
+
16
+ const voice = new InworldRealtimeVoice({
17
+ apiKey: process.env.INWORLD_API_KEY,
18
+ // Defaults: model 'inworld/models/gemma-4-26b-a4b-it', speaker 'Sarah',
19
+ // STT 'inworld/inworld-stt-1', semantic-VAD turn detection.
20
+ });
21
+
22
+ await voice.connect();
23
+ voice.on('speaker', stream => playAudio(stream)); // PCM16 @ 24kHz
24
+ voice.on('writing', ({ text, role }) => console.log(role, text));
25
+ voice.on('interrupted', ({ response_id }) => stopAudio(response_id));
26
+ await voice.send(getMicrophoneStream());
27
+ ```
28
+
29
+ **Typed `providerData` for Inworld realtime extensions**
30
+
31
+ `InworldRealtimeVoice` now accepts a typed `providerData` object for Inworld-specific extensions — STT tuning, TTS segmentation and steering, automatic memory, back-channel, and responsiveness — sent under `session.providerData`. The provider also surfaces inbound extension data: a `voiceProfile` on user `writing` events, a `memory` event for the rolling summary/facts state, and `backchannel` / `backchannel.done` / `backchannel.skipped` events for back-channel audio.
32
+
33
+ ```typescript
34
+ const voice = new InworldRealtimeVoice({
35
+ providerData: {
36
+ stt: { voice_profile: true, language_hints: ['en-US'] },
37
+ tts: { delivery_mode: 'CREATIVE', segmenter_strategy: 'balanced' },
38
+ memory: { enabled: true, turn_interval: 4 },
39
+ backchannel: { enabled: true, max_per_turn: 1 },
40
+ },
41
+ });
42
+
43
+ voice.on('memory', state => console.log(state.summary, state.facts));
44
+ voice.on('backchannel', stream => playAudio(stream));
45
+ voice.on('writing', ({ role, voiceProfile }) => console.log(role, voiceProfile?.emotion));
46
+ ```
47
+
48
+ **Realtime fixes and additions**
49
+ - Fixed the per-call `speak(text, { speaker })` voice override. It is now sent as the flat `response.voice` field, so the per-call speaker is no longer silently ignored by the server.
50
+ - Added manual turn-taking methods `commitInput()`, `clearInput()`, and `clearOutput()` for push-to-talk and manual turn control (use `clearOutput()` only to hard-stop all playback — it also stops in-flight back-channels).
51
+ - Added smart-turn and playback-state events: `turn-suggestion`, `turn-suggestion-revoked`, `input-committed`, `input-cleared`, `input-timeout`, and `output-audio-started` / `output-audio-stopped` / `output-audio-cleared`.
52
+ - Added richer typed session config: input noise reduction, telephony (8 kHz) and float32 audio formats, a server-VAD `idle_timeout_ms`, plus `tracing`, `include`, and `prompt`.
53
+
54
+ ```typescript
55
+ // Push-to-talk with no auto-VAD
56
+ const voice = new InworldRealtimeVoice({
57
+ session: { audio: { input: { turn_detection: null } } },
58
+ });
59
+
60
+ await voice.send(getMicrophoneStream());
61
+ voice.commitInput(); // end the user turn manually
62
+
63
+ voice.on('output-audio-stopped', () => console.log('playback finished'));
64
+ ```
65
+
3
66
  ## 0.2.1-alpha.0
4
67
 
5
68
  ### Patch Changes
package/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # @mastra/voice-inworld
2
2
 
3
- [Inworld AI](https://inworld.ai) voice provider for [Mastra](https://mastra.ai) — streaming TTS and batch STT.
3
+ [Inworld AI](https://inworld.ai) voice provider for [Mastra](https://mastra.ai) — streaming TTS, batch STT, and realtime full-duplex voice.
4
4
 
5
5
  ## Installation
6
6
 
@@ -106,6 +106,341 @@ Alex, Ashley, Craig, Deborah, Dennis, Dominus, Edward, Elizabeth, Hades, Heitor,
106
106
 
107
107
  The `speak()` method uses Inworld's streaming TTS endpoint (`/tts/v1/voice:stream`), returning audio chunks progressively as they are generated. This is ideal for agentic workflows where low time-to-first-audio matters.
108
108
 
109
+ ## Realtime (full-duplex) voice
110
+
111
+ Alongside the batch TTS/STT `InworldVoice`, this package ships `InworldRealtimeVoice` — full-duplex, websocket-based speech-to-speech with tool calling, barge-in, and live transcripts of both sides of the conversation. Inworld runs the LLM server-side via its router, so you don't need a second model client.
112
+
113
+ Inworld's wire protocol is the OpenAI Realtime GA spec — same client/server event names (`conversation.item.added`, `conversation.item.done`, `response.output_audio.delta`, etc.). The provider-level differences are:
114
+
115
+ - Endpoint: `wss://api.inworld.ai/api/v1/realtime/session?key=<sessionId>&protocol=realtime`. The model is configured via the initial `session.update`, not the URL.
116
+ - Auth: `Authorization: Basic <key>` (Inworld keys ship pre-encoded; pass verbatim).
117
+ - Typed first-class session knobs (`audio.output.speed`, `audio.output.model`, `audio.input.turn_detection`, `audio.input.transcription`, `output_modalities`, `tool_choice`, …) via the `session` constructor field.
118
+ - A typed `providerData` object for Inworld extensions (STT tuning, TTS segmentation/steering, automatic memory, back-channel, responsiveness, plus `user_id`/`metadata`), sent under `session.providerData`.
119
+
120
+ ### Usage
121
+
122
+ ```typescript
123
+ import { InworldRealtimeVoice } from '@mastra/voice-inworld';
124
+
125
+ const voice = new InworldRealtimeVoice({
126
+ apiKey: process.env.INWORLD_API_KEY,
127
+ model: 'inworld/models/gemma-4-26b-a4b-it',
128
+ speaker: 'Sarah',
129
+ instructions: 'You are a helpful voice assistant.',
130
+ // Typed first-class session knobs:
131
+ session: {
132
+ audio: {
133
+ output: { speed: 1.1 },
134
+ input: {
135
+ transcription: { model: 'inworld/inworld-stt-1' },
136
+ turn_detection: { type: 'semantic_vad', eagerness: 'high' },
137
+ },
138
+ },
139
+ },
140
+ });
141
+
142
+ await voice.connect();
143
+
144
+ voice.on('speaker', stream => {
145
+ // PCM16 @ 24kHz stream — pipe to your audio output
146
+ });
147
+
148
+ voice.on('writing', ({ text, role }) => {
149
+ // Transcription / assistant text
150
+ });
151
+
152
+ voice.on('error', err => {
153
+ console.error('Voice error:', err);
154
+ });
155
+
156
+ await voice.speak('Hello from Mastra!');
157
+
158
+ // Tool integration
159
+ voice.addTools({
160
+ search: searchTool,
161
+ });
162
+
163
+ // Streaming audio in
164
+ await voice.send(microphoneStream);
165
+
166
+ // Stop
167
+ voice.close();
168
+ ```
169
+
170
+ ### Options
171
+
172
+ | Option | Type | Default | Description |
173
+ | ------------------ | ------------------------------- | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
174
+ | `apiKey` | `string` | `process.env.INWORLD_API_KEY` | Inworld API key (Basic-encoded, passed verbatim). |
175
+ | `url` | `string` | `wss://api.inworld.ai/api/v1/realtime/session` | Realtime websocket endpoint. |
176
+ | `model` | `string` | `inworld/models/gemma-4-26b-a4b-it` | LLM Router model ID. |
177
+ | `speaker` | `string` | `Sarah` | Default voice ID. Any catalog voice is accepted. |
178
+ | `sessionId` | `string` | `voice-{Date.now()}` | Client-generated session key surfaced as the URL `?key=` parameter. |
179
+ | `instructions` | `string` | `undefined` | System prompt sent with the initial `session.update`. |
180
+ | `session` | `Partial<InworldSessionConfig>` | `undefined` | Typed first-class session knobs (see below). Deep-merged into every `session.update`. |
181
+ | `debug` | `boolean` | `false` | Log raw server events. |
182
+ | `providerData` | `InworldProviderData` | `undefined` | Inworld extension config sent under `session.providerData`; composes with `session.providerData` (constructor option wins on collision). |
183
+ | `connectTimeoutMs` | `number` | `15000` | Max time `connect()` will wait for both the WebSocket handshake and the initial `session.updated` round-trip. A pre-open error/close or timeout becomes a rejected `connect()`. |
184
+
185
+ #### `session` (typed knobs)
186
+
187
+ Use the typed `session` field for known Inworld realtime options. Fields compose with the connect-time defaults (e.g. `audio.output.voice` set from `speaker`):
188
+
189
+ ```typescript
190
+ new InworldRealtimeVoice({
191
+ speaker: 'Dennis',
192
+ session: {
193
+ output_modalities: ['audio', 'text'],
194
+ audio: {
195
+ output: { speed: 1.15, model: 'inworld-tts-2' },
196
+ input: {
197
+ transcription: { model: 'inworld/inworld-stt-1', language: 'en-US' },
198
+ turn_detection: { type: 'semantic_vad', eagerness: 'medium' },
199
+ },
200
+ },
201
+ tool_choice: { type: 'mcp', server_label: 'my-mcp' },
202
+ temperature: 0.6,
203
+ },
204
+ });
205
+ ```
206
+
207
+ Other typed `session` fields:
208
+
209
+ - `audio.input.noise_reduction`: `{ type: 'near_field' | 'far_field' }` — denoise input before VAD/transcription.
210
+ - `audio.{input,output}.format`: a codec string or `{ type, rate? }` object. Supports telephony 8kHz (`audio/pcmu`, `audio/pcma`) and `audio/float32`; `rate` (Hz) applies to `audio/pcm` + `audio/float32` (default 24000).
211
+ - `audio.input.turn_detection.idle_timeout_ms`: `server_vad` only — idle window before the server commits a turn.
212
+ - `audio.input.transcription.prompt`: bias transcription with vocabulary/spelling/style hints.
213
+ - `tracing`: `'auto'` or `{ workflow_name?, group_id?, metadata? }`.
214
+ - `include`: opt-in extra event fields (e.g. `['item.input_audio_transcription.logprobs']`).
215
+ - `prompt`: a server-side prompt-template reference (string or `null`).
216
+
217
+ #### `providerData` (Inworld extensions)
218
+
219
+ `providerData` is a typed object for Inworld-specific realtime extensions. It's sent under `session.providerData` on every `session.update` and composes with any `session.providerData` you set via the `session` field — the constructor option wins on key collisions.
220
+
221
+ It has five branches plus two session-level fields:
222
+
223
+ - `stt`: STT tuning (`prompt`, `voice_profile`, `language_hints`, VAD/end-of-turn thresholds).
224
+ - `tts`: TTS segmentation and delivery (`segmenter_strategy`, `steering_handling`, `delivery_mode`, `conversational`, `user_turn_mode`, `language`).
225
+ - `memory`: automatic rolling memory (`enabled`, `turn_interval`, `max_facts`, …). Inworld echoes its state back via the `memory` event.
226
+ - `backchannel`: short acknowledgements ("uh-huh") while the user speaks. Audio arrives on the `backchannel` event.
227
+ - `responsiveness`: early "filler" audio while the main response generates. Filler audio reuses the normal `speaker`/`speaking` path — there are no distinct events.
228
+ - `user_id` and `metadata`: session-level identifiers passed through to Inworld.
229
+
230
+ ```typescript
231
+ new InworldRealtimeVoice({
232
+ providerData: {
233
+ stt: { voice_profile: true, language_hints: ['en-US'] },
234
+ tts: { delivery_mode: 'CREATIVE', segmenter_strategy: 'balanced' },
235
+ memory: { enabled: true, turn_interval: 4 },
236
+ backchannel: { enabled: true, max_per_turn: 1 },
237
+ user_id: 'user-123',
238
+ },
239
+ });
240
+ ```
241
+
242
+ ### Events
243
+
244
+ `on()` and `off()` are typed against `InworldVoiceEventMap` — known event names give you a typed callback payload, unknown event names fall back to `unknown`.
245
+
246
+ | Event | Payload |
247
+ | ------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
248
+ | `speaking` | `{ audio: Buffer; response_id: string }` |
249
+ | `speaking.done` | `{ response_id: string }` |
250
+ | `speaker` | `PassThrough` stream of PCM audio |
251
+ | `writing` | `{ text: string; response_id: string; role: 'assistant' \| 'user'; voiceProfile? }`. Deduplicated across audio-transcript + text deltas in one response. `voiceProfile` is present on user events when `providerData.stt.voice_profile` is enabled. |
252
+ | `speech-started` | Raw server `input_audio_buffer.speech_started` payload (VAD edge). |
253
+ | `speech-stopped` | Raw server `input_audio_buffer.speech_stopped` payload (VAD edge). |
254
+ | `interrupted` | `{ response_id: string }`. Synthesized once per in-flight response when the user starts speaking. |
255
+ | `turn-suggestion` | `{ item_id, utterance_index, probability, trailing_silence_ms?, audio_duration_ms?, inference_ms? }`. Smart-turn endpointing hint for a buffered user utterance. |
256
+ | `turn-suggestion-revoked` | `{ item_id, utterance_index }`. A prior `turn-suggestion` was retracted. |
257
+ | `input-committed` | `{ item_id, previous_item_id? }`. Buffered input audio was committed as a user turn (`previous_item_id` may be `null`). |
258
+ | `input-cleared` | `{}`. Buffered input audio was discarded. |
259
+ | `input-timeout` | `{ audio_start_ms, audio_end_ms, item_id }`. Server-VAD idle timeout committed a user turn. |
260
+ | `output-audio-started` | `{}`. Server began emitting output audio. |
261
+ | `output-audio-stopped` | `{}`. Server stopped emitting output audio for the current response. |
262
+ | `output-audio-cleared` | `{}`. Server output audio buffer was flushed (playback stopped). |
263
+ | `memory` | `InworldMemoryState` — Inworld's rolling summary/facts state (deduped by version). Requires `providerData.memory.enabled`. |
264
+ | `backchannel` | `PassThrough` stream of back-channel PCM audio. Requires `providerData.backchannel.enabled`. |
265
+ | `backchannel.done` | `{ backchannel_id: string; phrase? }`. Fires when a back-channel finishes. |
266
+ | `backchannel.skipped` | `{ reason: string }`. Fires when the decider skips a back-channel before any audio. |
267
+ | `response.created` | Full server event. |
268
+ | `response.done` | Full server event. |
269
+ | `conversation.item.added` | Full server event. |
270
+ | `conversation.item.done` | Full server event. |
271
+ | `function_call.arguments` | `{ call_id, name, arguments }` JSON. |
272
+ | `tool-call-start` | `{ toolCallId, toolName, args, … }`. |
273
+ | `tool-call-result` | `{ toolCallId, …, result }`. |
274
+ | `error` | `Error` (or a server error event). |
275
+
276
+ #### Barge-in
277
+
278
+ `speech-started` and `speech-stopped` mirror Inworld's raw VAD edges. `interrupted` is a synthetic, client-side signal: whenever `speech-started` fires while one or more responses are in flight, `interrupted` is emitted once per active `response_id`. Listen to `interrupted` to stop audio playback without having to track response state yourself.
279
+
280
+ **Back-channels are exempt from barge-in — by design.** Back-channel audio (`"uh-huh"`, `"I see"`) is meant to play _while the user is still speaking_, so it must NOT be cut off when they do. The two audio kinds are kept on separate channels so this falls out naturally:
281
+
282
+ - Main response audio arrives on the **`speaker`** event; each stream's `.id` is a `response_id`. `interrupted` only ever carries `response_id`s, so stopping the `speaker` stream whose id matches `interrupted.response_id` stops main content instantly on barge-in.
283
+ - Back-channel audio arrives on the **`backchannel`** event; each stream's `.id` is a `backchannel_id`. These ids never appear in `interrupted`, and the SDK never sends `response.cancel` for them — so a client that keys players by stream id and only kills on `interrupted` will leave back-channels playing automatically.
284
+
285
+ The one footgun: don't blanket-kill all players on barge-in. Keep back-channel players in a separate collection (or just key by stream `.id` and match only `interrupted.response_id`). Each back-channel stream ends itself on `backchannel.done`, so its player exits naturally.
286
+
287
+ ```typescript
288
+ const players = new Map(); // main response audio — stopped on barge-in
289
+ const bcPlayers = new Map(); // back-channels — never stopped on barge-in
290
+
291
+ voice.on('speaker', stream => startPlayer(players, stream.id, stream));
292
+ voice.on('interrupted', ({ response_id }) => players.get(response_id)?.stop()); // main only
293
+
294
+ voice.on('backchannel', stream => startPlayer(bcPlayers, stream.id, stream)); // overlaps user speech
295
+ ```
296
+
297
+ Enable back-channels with `providerData: { backchannel: { enabled: true } }` (gated by server prerequisites — contact your Inworld account team).
298
+
299
+ #### Manual turn-taking & playback signals
300
+
301
+ For push-to-talk or manual turn-taking (set `turn_detection` to `null` to disable auto-VAD), drive turns yourself:
302
+
303
+ - `commitInput()` — commit buffered input audio as a user turn.
304
+ - `clearInput()` — discard buffered input audio.
305
+ - `clearOutput()` — clear the server's **entire** output audio buffer, stopping playback. This also stops any in-flight **back-channel** audio. The default barge-in path (`response.cancel` on `interrupted`) is back-channel-safe; prefer it. Use `clearOutput()` only when you explicitly want to flush everything.
306
+
307
+ ```typescript
308
+ voice.commitInput(); // end the current user turn
309
+ voice.clearInput(); // throw away what's buffered
310
+ voice.clearOutput(); // hard-stop all playback (back-channels included)
311
+ ```
312
+
313
+ The server emits matching signals you can listen to:
314
+
315
+ - `turn-suggestion` / `turn-suggestion-revoked` — smart-turn endpointing hints (and retractions) for a buffered utterance.
316
+ - `input-committed` / `input-cleared` — acknowledgements for `commitInput()` / `clearInput()` (also fire on auto-VAD commits).
317
+ - `input-timeout` — a server-VAD idle timeout committed a user turn.
318
+ - `output-audio-started` / `output-audio-stopped` / `output-audio-cleared` — output playback state on the server.
319
+
320
+ ```typescript
321
+ voice.on('turn-suggestion', ({ probability }) => console.log('end-of-turn?', probability));
322
+ voice.on('input-committed', ({ item_id }) => console.log('committed', item_id));
323
+ voice.on('output-audio-stopped', () => console.log('playback finished'));
324
+ ```
325
+
326
+ #### Awaitable `speak()`
327
+
328
+ `speak()` resolves only after the full response lifecycle completes (`response.done` for the response it triggered). It rejects if the response is interrupted by user speech, or on a transport error. Serial calls are the supported pattern — concurrent `speak()` calls share the same listener pool and have undefined response-pinning order.
329
+
330
+ #### Default `turn_detection`
331
+
332
+ `audio.input.turn_detection` defaults to `{ type: 'semantic_vad', eagerness: 'medium', create_response: true, interrupt_response: true }`. To override, set `session.audio.input.turn_detection` to your own object. To disable turn detection entirely, set it to `null`.
333
+
334
+ `eagerness` controls how quickly semantic VAD ends a user turn — `low` waits for clearer pauses (more interruption-resistant), `high` ends turns sooner (snappier, more prone to cutting users off). Default `medium` balances both.
335
+
336
+ #### Default `transcription`
337
+
338
+ `audio.input.transcription` defaults to `{ model: 'inworld/inworld-stt-1' }`, so user-side `writing` events (with `role: 'user'`) fire out of the box. To override, set `session.audio.input.transcription` to your own object. To disable user-side transcription, set it to `null`.
339
+
340
+ ### Full CLI example
341
+
342
+ A complete, terminal-based demo wiring `InworldRealtimeVoice` into a Mastra `Agent` with mic input, speaker output, semantic-VAD turn-taking, barge-in, and tool calling — all in one file.
343
+
344
+ Prereqs: Node 22+, `sox` (provides `sox` and `play`; `brew install sox` on macOS), `INWORLD_API_KEY`.
345
+
346
+ The same code as a clone-and-run repo: [github.com/cshape/inworld-mastra-cli-demo](https://github.com/cshape/inworld-mastra-cli-demo).
347
+
348
+ ```typescript
349
+ import 'dotenv/config';
350
+ import { spawn, type ChildProcess } from 'node:child_process';
351
+ import { Agent } from '@mastra/core/agent';
352
+ import { createTool } from '@mastra/core/tools';
353
+ import { InworldRealtimeVoice } from '@mastra/voice-inworld';
354
+ import { z } from 'zod';
355
+
356
+ const getCurrentTime = createTool({
357
+ id: 'get-current-time',
358
+ description: 'Returns the current local time.',
359
+ inputSchema: z.object({}),
360
+ outputSchema: z.object({ time: z.string() }),
361
+ execute: async () => ({ time: new Date().toLocaleTimeString() }),
362
+ });
363
+
364
+ const voice = new InworldRealtimeVoice({
365
+ model: 'openai/gpt-5.4-nano',
366
+ speaker: 'Jason',
367
+ session: {
368
+ audio: {
369
+ input: {
370
+ transcription: { model: 'inworld/inworld-stt-1', language: 'en-US' },
371
+ turn_detection: { type: 'semantic_vad', eagerness: 'high', interrupt_response: true },
372
+ },
373
+ output: { model: 'inworld-tts-2', speed: 1.0 },
374
+ },
375
+ temperature: 0.7,
376
+ max_output_tokens: 150,
377
+ },
378
+ });
379
+
380
+ new Agent({
381
+ id: 'voice-demo',
382
+ name: 'Voice Demo',
383
+ instructions:
384
+ 'You are a concise voice assistant. Reply in one or two short sentences. Use the get-current-time tool when asked the time.',
385
+ model: 'n/a',
386
+ tools: { getCurrentTime },
387
+ voice,
388
+ });
389
+
390
+ const SOX = ['-t', 'raw', '-r', '24000', '-e', 'signed', '-b', '16', '-c', '1', '-q', '-'];
391
+ const players = new Map<string, ChildProcess>();
392
+
393
+ voice.on('speaker', stream => {
394
+ // Any new response supersedes the prior one — kill leftover players so
395
+ // a missed barge-in can't leave two streams playing at once.
396
+ for (const p of players.values()) p.kill('SIGTERM');
397
+ players.clear();
398
+ const id = (stream as unknown as { id: string }).id;
399
+ const player = spawn('play', SOX, { stdio: ['pipe', 'ignore', 'ignore'] });
400
+ players.set(id, player);
401
+ // Swallow EPIPE when `play` exits while the PassThrough still has buffered frames.
402
+ player.stdin!.on('error', () => {});
403
+ stream.pipe(player.stdin!);
404
+ player.on('exit', () => players.delete(id));
405
+ });
406
+
407
+ voice.on('interrupted', ({ response_id }) => players.get(response_id)?.kill('SIGTERM'));
408
+
409
+ let lastRole: 'user' | 'assistant' | null = null;
410
+ voice.on('writing', ({ text, role }) => {
411
+ if (role !== lastRole) {
412
+ process.stdout.write(role === 'user' ? '\n[you] ' : '\n[bot] ');
413
+ lastRole = role;
414
+ }
415
+ process.stdout.write(text);
416
+ });
417
+
418
+ voice.on('tool-call-start', ({ toolName }) => console.log(`\n[tool] ${toolName}`));
419
+ voice.on('error', err => console.error('\n[error]', err));
420
+
421
+ await voice.connect();
422
+ console.log('Connected. Use headphones for best experience. Speak when ready. Ctrl+C to exit.');
423
+
424
+ const mic = spawn('sox', ['-d', ...SOX], { stdio: ['ignore', 'pipe', 'ignore'] });
425
+ await voice.send(mic.stdout);
426
+
427
+ process.on('SIGINT', () => {
428
+ mic.kill('SIGTERM');
429
+ for (const p of players.values()) p.kill('SIGTERM');
430
+ voice.close();
431
+ process.exit(0);
432
+ });
433
+ ```
434
+
435
+ ### Realtime protocol notes
436
+
437
+ These match what the live API emits (verified via raw-websocket smoke tests):
438
+
439
+ - Audio default: PCM16 @ 24kHz. Also supports telephony `audio/pcmu` / `audio/pcma` @ 8kHz and `audio/float32`.
440
+ - Server emits `session.created` on connect (older docs claim it doesn't).
441
+ - Function call args arrive via `response.function_call_arguments.delta` (singular). Some docs say plural; the docs are wrong.
442
+ - Audio deltas arrive on `response.output_audio.delta` / `…audio.done` (GA spec), not the older `response.audio.delta`.
443
+
109
444
  ## Authentication
110
445
 
111
- Set your API key via the `INWORLD_API_KEY` environment variable or pass it in the config. Get your key from [platform.inworld.ai](https://platform.inworld.ai) → Settings → API Keys.
446
+ Set your API key via the `INWORLD_API_KEY` environment variable or pass it in the config. Get your key from [platform.inworld.ai](https://platform.inworld.ai) → Settings → API Keys. Inworld API keys ship **already Basic-encoded** — paste verbatim; the package will not re-encode the key.
@@ -1,3 +1,5 @@
1
+ import { JSONSchema7 } from 'json-schema';
2
+ import { JSONSchema7Definition } from 'json-schema';
1
3
  import { ServerResponse } from 'node:http';
2
4
  import { ServerResponse as ServerResponse_2 } from 'http';
3
5
  import { z } from '../../zod/v4/index.d.cts';
@@ -3507,164 +3509,7 @@ export declare function jsonSchema<OBJECT = unknown>(jsonSchema: JSONSchema7 | (
3507
3509
  validate?: (value: unknown) => ValidationResult<OBJECT> | PromiseLike<ValidationResult<OBJECT>>;
3508
3510
  }): Schema<OBJECT>;
3509
3511
 
3510
- export declare interface JSONSchema7 {
3511
- $id?: string | undefined;
3512
- $ref?: string | undefined;
3513
- $schema?: JSONSchema7Version | undefined;
3514
- $comment?: string | undefined;
3515
-
3516
- /**
3517
- * @see https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-00#section-8.2.4
3518
- * @see https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-validation-00#appendix-A
3519
- */
3520
- $defs?: {
3521
- [key: string]: JSONSchema7Definition;
3522
- } | undefined;
3523
-
3524
- /**
3525
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-6.1
3526
- */
3527
- type?: JSONSchema7TypeName | JSONSchema7TypeName[] | undefined;
3528
- enum?: JSONSchema7Type[] | undefined;
3529
- const?: JSONSchema7Type | undefined;
3530
-
3531
- /**
3532
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-6.2
3533
- */
3534
- multipleOf?: number | undefined;
3535
- maximum?: number | undefined;
3536
- exclusiveMaximum?: number | undefined;
3537
- minimum?: number | undefined;
3538
- exclusiveMinimum?: number | undefined;
3539
-
3540
- /**
3541
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-6.3
3542
- */
3543
- maxLength?: number | undefined;
3544
- minLength?: number | undefined;
3545
- pattern?: string | undefined;
3546
-
3547
- /**
3548
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-6.4
3549
- */
3550
- items?: JSONSchema7Definition | JSONSchema7Definition[] | undefined;
3551
- additionalItems?: JSONSchema7Definition | undefined;
3552
- maxItems?: number | undefined;
3553
- minItems?: number | undefined;
3554
- uniqueItems?: boolean | undefined;
3555
- contains?: JSONSchema7Definition | undefined;
3556
-
3557
- /**
3558
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-6.5
3559
- */
3560
- maxProperties?: number | undefined;
3561
- minProperties?: number | undefined;
3562
- required?: string[] | undefined;
3563
- properties?: {
3564
- [key: string]: JSONSchema7Definition;
3565
- } | undefined;
3566
- patternProperties?: {
3567
- [key: string]: JSONSchema7Definition;
3568
- } | undefined;
3569
- additionalProperties?: JSONSchema7Definition | undefined;
3570
- dependencies?: {
3571
- [key: string]: JSONSchema7Definition | string[];
3572
- } | undefined;
3573
- propertyNames?: JSONSchema7Definition | undefined;
3574
-
3575
- /**
3576
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-6.6
3577
- */
3578
- if?: JSONSchema7Definition | undefined;
3579
- then?: JSONSchema7Definition | undefined;
3580
- else?: JSONSchema7Definition | undefined;
3581
-
3582
- /**
3583
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-6.7
3584
- */
3585
- allOf?: JSONSchema7Definition[] | undefined;
3586
- anyOf?: JSONSchema7Definition[] | undefined;
3587
- oneOf?: JSONSchema7Definition[] | undefined;
3588
- not?: JSONSchema7Definition | undefined;
3589
-
3590
- /**
3591
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-7
3592
- */
3593
- format?: string | undefined;
3594
-
3595
- /**
3596
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-8
3597
- */
3598
- contentMediaType?: string | undefined;
3599
- contentEncoding?: string | undefined;
3600
-
3601
- /**
3602
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-9
3603
- */
3604
- definitions?: {
3605
- [key: string]: JSONSchema7Definition;
3606
- } | undefined;
3607
-
3608
- /**
3609
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-10
3610
- */
3611
- title?: string | undefined;
3612
- description?: string | undefined;
3613
- default?: JSONSchema7Type | undefined;
3614
- readOnly?: boolean | undefined;
3615
- writeOnly?: boolean | undefined;
3616
- examples?: JSONSchema7Type | undefined;
3617
- }
3618
-
3619
- declare interface JSONSchema7Array extends Array<JSONSchema7Type> {}
3620
-
3621
- /**
3622
- * JSON Schema v7
3623
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01
3624
- */
3625
- declare type JSONSchema7Definition = JSONSchema7 | boolean;
3626
-
3627
- declare interface JSONSchema7Object {
3628
- [key: string]: JSONSchema7Type;
3629
- }
3630
-
3631
- /**
3632
- * Primitive type
3633
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-6.1.1
3634
- */
3635
- declare type JSONSchema7Type =
3636
- | string //
3637
- | number
3638
- | boolean
3639
- | JSONSchema7Object
3640
- | JSONSchema7Array
3641
- | null;
3642
-
3643
- /**
3644
- * Primitive type
3645
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-6.1.1
3646
- */
3647
- declare type JSONSchema7TypeName =
3648
- | "string" //
3649
- | "number"
3650
- | "integer"
3651
- | "boolean"
3652
- | "object"
3653
- | "array"
3654
- | "null";
3655
-
3656
- /**
3657
- * Meta schema
3658
- *
3659
- * Recommended values:
3660
- * - 'http://json-schema.org/schema#'
3661
- * - 'http://json-schema.org/hyper-schema#'
3662
- * - 'http://json-schema.org/draft-07/schema#'
3663
- * - 'http://json-schema.org/draft-07/hyper-schema#'
3664
- *
3665
- * @see https://tools.ietf.org/html/draft-handrews-json-schema-validation-01#section-5
3666
- */
3667
- declare type JSONSchema7Version = string;
3512
+ export { JSONSchema7 }
3668
3513
 
3669
3514
  export declare class JsonToSseTransformStream extends TransformStream<unknown, string> {
3670
3515
  constructor();
@@ -9039,5 +8884,5 @@ export declare function zodSchema<OBJECT>(zodSchema: z4.core.$ZodType<OBJECT, an
9039
8884
  }): Schema<OBJECT>;
9040
8885
 
9041
8886
  export { }
9042
- export { GatewayProviderSettings as GatewayProviderSettings, GatewayProvider as GatewayProvider, ParseResult as ParseResult, FlexibleValidator as FlexibleValidator, ValidationResult as ValidationResult, LanguageModelV2ToolResultOutput as LanguageModelV2ToolResultOutput, ProviderOptions as ProviderOptions, ReasoningPart as ReasoningPart, ToolOutputProperties as ToolOutputProperties, LanguageModelV2ToolResultPart as LanguageModelV2ToolResultPart, FlexibleSchema as FlexibleSchema, JSONSchema7Version as JSONSchema7Version, JSONSchema7Definition as JSONSchema7Definition, JSONSchema7TypeName as JSONSchema7TypeName, JSONSchema7Type as JSONSchema7Type, ImageModelV2CallWarning as ImageModelV2CallWarning, SpeechModelV2CallWarning as SpeechModelV2CallWarning, TranscriptionModelV2CallWarning as TranscriptionModelV2CallWarning, AttributeValue as AttributeValue, Tracer as Tracer, EmbeddingModelV2Embedding as EmbeddingModelV2Embedding, ImageModelV2ProviderMetadata as ImageModelV2ProviderMetadata, GlobalProviderModelId as GlobalProviderModelId, LanguageModelV2FinishReason as LanguageModelV2FinishReason, LanguageModelV2CallWarning as LanguageModelV2CallWarning, LanguageModelV2Middleware as LanguageModelV2Middleware, SharedV2ProviderMetadata as SharedV2ProviderMetadata, LanguageModelV2Usage as LanguageModelV2Usage, Source as Source, ResponseMessage as ResponseMessage, DeepPartialInternal as DeepPartialInternal, LanguageModelV2ToolCall as LanguageModelV2ToolCall, StreamTextOnAbortCallback as StreamTextOnAbortCallback, ValueOf as ValueOf, asUITool as asUITool, _ai_sdk_provider_utils as _ai_sdk_provider_utils, _ai_sdk_provider as _ai_sdk_provider, DataUIMessageChunk as DataUIMessageChunk, ConsumeStreamOptions as ConsumeStreamOptions, Output_2 as Output_2, InferAgentTools as InferAgentTools, SingleRequestTextStreamPart as SingleRequestTextStreamPart, RetryErrorReason as RetryErrorReason, InferSchema as InferSchema, Job as Job, StreamObjectOnErrorCallback as StreamObjectOnErrorCallback, JSONValue_2 as JSONValue_2, LanguageModelV2CallOptions as LanguageModelV2CallOptions, LanguageModelV2 as LanguageModelV2, EmbeddingModelV2 as EmbeddingModelV2, ImageModelV2 as ImageModelV2, TranscriptionModelV2 as TranscriptionModelV2, SpeechModelV2 as SpeechModelV2, ExtractModelId as ExtractModelId, ExtractLiteralUnion as ExtractLiteralUnion, ProviderV2 as ProviderV2, getOriginalFetch as getOriginalFetch, UIMessageStreamResponseInit as UIMessageStreamResponseInit, InferUIMessageToolCall as InferUIMessageToolCall, Validator as Validator, StandardSchemaV1 as StandardSchemaV1, UIDataTypesToSchemas as UIDataTypesToSchemas, InferUIMessageData as InferUIMessageData, InferUIMessageMetadata as InferUIMessageMetadata, InferUIMessageTools as InferUIMessageTools, Resolvable as Resolvable, FetchFunction as FetchFunction };
8887
+ export { GatewayProviderSettings as GatewayProviderSettings, GatewayProvider as GatewayProvider, ParseResult as ParseResult, FlexibleValidator as FlexibleValidator, ValidationResult as ValidationResult, LanguageModelV2ToolResultOutput as LanguageModelV2ToolResultOutput, ProviderOptions as ProviderOptions, ReasoningPart as ReasoningPart, ToolOutputProperties as ToolOutputProperties, LanguageModelV2ToolResultPart as LanguageModelV2ToolResultPart, FlexibleSchema as FlexibleSchema, ImageModelV2CallWarning as ImageModelV2CallWarning, SpeechModelV2CallWarning as SpeechModelV2CallWarning, TranscriptionModelV2CallWarning as TranscriptionModelV2CallWarning, AttributeValue as AttributeValue, Tracer as Tracer, EmbeddingModelV2Embedding as EmbeddingModelV2Embedding, ImageModelV2ProviderMetadata as ImageModelV2ProviderMetadata, GlobalProviderModelId as GlobalProviderModelId, LanguageModelV2FinishReason as LanguageModelV2FinishReason, LanguageModelV2CallWarning as LanguageModelV2CallWarning, LanguageModelV2Middleware as LanguageModelV2Middleware, SharedV2ProviderMetadata as SharedV2ProviderMetadata, LanguageModelV2Usage as LanguageModelV2Usage, Source as Source, ResponseMessage as ResponseMessage, DeepPartialInternal as DeepPartialInternal, LanguageModelV2ToolCall as LanguageModelV2ToolCall, StreamTextOnAbortCallback as StreamTextOnAbortCallback, ValueOf as ValueOf, asUITool as asUITool, _ai_sdk_provider_utils as _ai_sdk_provider_utils, _ai_sdk_provider as _ai_sdk_provider, DataUIMessageChunk as DataUIMessageChunk, ConsumeStreamOptions as ConsumeStreamOptions, Output_2 as Output_2, InferAgentTools as InferAgentTools, SingleRequestTextStreamPart as SingleRequestTextStreamPart, RetryErrorReason as RetryErrorReason, InferSchema as InferSchema, Job as Job, StreamObjectOnErrorCallback as StreamObjectOnErrorCallback, JSONValue_2 as JSONValue_2, LanguageModelV2CallOptions as LanguageModelV2CallOptions, LanguageModelV2 as LanguageModelV2, EmbeddingModelV2 as EmbeddingModelV2, ImageModelV2 as ImageModelV2, TranscriptionModelV2 as TranscriptionModelV2, SpeechModelV2 as SpeechModelV2, ExtractModelId as ExtractModelId, ExtractLiteralUnion as ExtractLiteralUnion, ProviderV2 as ProviderV2, getOriginalFetch as getOriginalFetch, UIMessageStreamResponseInit as UIMessageStreamResponseInit, InferUIMessageToolCall as InferUIMessageToolCall, Validator as Validator, StandardSchemaV1 as StandardSchemaV1, UIDataTypesToSchemas as UIDataTypesToSchemas, InferUIMessageData as InferUIMessageData, InferUIMessageMetadata as InferUIMessageMetadata, InferUIMessageTools as InferUIMessageTools, Resolvable as Resolvable, FetchFunction as FetchFunction };
9043
8888
  export { schemaSymbol, symbol$1, symbol$1_2, symbol$2, symbol$2_2, symbol$3, symbol$3_2, symbol$4, symbol$4_2, symbol$5, symbol$5_2, symbol$6, symbol$6_2, symbol$7, symbol$7_2, symbol$8, symbol$8_2, symbol$9, symbol$9_2, symbol$a, symbol$a_2, symbol$b, symbol$b_2, symbol$c, symbol$c_2, symbol$d, symbol$d_2, symbol$e, symbol, symbol_2, symbol_3, validatorSymbol };
@@ -52,6 +52,8 @@ type VersionSelector = {
52
52
  };
53
53
  type VersionOverrides = {
54
54
  agents?: Record<string, VersionSelector>;
55
+ /** Fallback status for sub-agents (and future primitives) without an explicit entry. */
56
+ defaultStatus?: 'draft' | 'published';
55
57
  };
56
58
  declare function mergeVersionOverrides(base?: VersionOverrides, overrides?: VersionOverrides): VersionOverrides | undefined;
57
59
  declare class RequestContext<Values extends Record<string, any> | unknown = unknown> {
@@ -3,7 +3,7 @@ name: mastra-voice-inworld
3
3
  description: Documentation for @mastra/voice-inworld. Use when working with @mastra/voice-inworld APIs, configuration, or implementation.
4
4
  metadata:
5
5
  package: "@mastra/voice-inworld"
6
- version: "0.2.1-alpha.0"
6
+ version: "0.3.0-alpha.1"
7
7
  ---
8
8
 
9
9
  ## When to use
@@ -17,10 +17,12 @@ Read the individual reference documents for detailed explanations and code examp
17
17
  ### Docs
18
18
 
19
19
  - [Voice in Mastra](references/docs-voice-overview.md) - Overview of voice capabilities in Mastra, including text-to-speech, speech-to-text, and real-time speech-to-speech interactions.
20
+ - [Speech-to-Speech capabilities in Mastra](references/docs-voice-speech-to-speech.md) - Overview of speech-to-speech capabilities in Mastra, including real-time interactions and event-driven architecture.
20
21
 
21
22
  ### Reference
22
23
 
23
24
  - [Reference: Inworld](references/reference-voice-inworld.md) - Documentation for the InworldVoice class, providing streaming text-to-speech and batch speech-to-text capabilities using Inworld AI.
25
+ - [Reference: Inworld Realtime voice](references/reference-voice-inworld-realtime.md) - Documentation for the InworldRealtimeVoice class, providing real-time speech-to-speech, tool calling, and Inworld-specific session controls via WebSockets.
24
26
 
25
27
 
26
28
  Read [assets/SOURCE_MAP.json](assets/SOURCE_MAP.json) for source code references.
@@ -1,5 +1,5 @@
1
1
  {
2
- "version": "0.2.1-alpha.0",
2
+ "version": "0.3.0-alpha.1",
3
3
  "package": "@mastra/voice-inworld",
4
4
  "exports": {},
5
5
  "modules": {}