@mastra/voice-inworld 0.2.1-alpha.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,137 @@
1
1
  # @mastra/voice-inworld
2
2
 
3
+ ## 0.3.0
4
+
5
+ ### Minor Changes
6
+
7
+ - `@mastra/voice-inworld` now ships `InworldRealtimeVoice` for full-duplex realtime voice — mic in, speakers out, server-side LLM routing, semantic VAD turn-taking, tool calling, barge-in, and live transcripts of both sides — alongside the existing streaming TTS and batch STT. No separate package needed; import both from the same entry point. ([#16865](https://github.com/mastra-ai/mastra/pull/16865))
8
+
9
+ ```typescript
10
+ // Batch TTS / STT (unchanged)
11
+ import { InworldVoice } from '@mastra/voice-inworld';
12
+
13
+ // New: realtime full-duplex voice, from the same package
14
+ import { InworldRealtimeVoice } from '@mastra/voice-inworld';
15
+
16
+ const voice = new InworldRealtimeVoice({
17
+ apiKey: process.env.INWORLD_API_KEY,
18
+ // Defaults: model 'inworld/models/gemma-4-26b-a4b-it', speaker 'Sarah',
19
+ // STT 'inworld/inworld-stt-1', semantic-VAD turn detection.
20
+ });
21
+
22
+ await voice.connect();
23
+ voice.on('speaker', stream => playAudio(stream)); // PCM16 @ 24kHz
24
+ voice.on('writing', ({ text, role }) => console.log(role, text));
25
+ voice.on('interrupted', ({ response_id }) => stopAudio(response_id));
26
+ await voice.send(getMicrophoneStream());
27
+ ```
28
+
29
+ **Typed `providerData` for Inworld realtime extensions**
30
+
31
+ `InworldRealtimeVoice` now accepts a typed `providerData` object for Inworld-specific extensions — STT tuning, TTS segmentation and steering, automatic memory, back-channel, and responsiveness — sent under `session.providerData`. The provider also surfaces inbound extension data: a `voiceProfile` on user `writing` events, a `memory` event for the rolling summary/facts state, and `backchannel` / `backchannel.done` / `backchannel.skipped` events for back-channel audio.
32
+
33
+ ```typescript
34
+ const voice = new InworldRealtimeVoice({
35
+ providerData: {
36
+ stt: { voice_profile: true, language_hints: ['en-US'] },
37
+ tts: { delivery_mode: 'CREATIVE', segmenter_strategy: 'balanced' },
38
+ memory: { enabled: true, turn_interval: 4 },
39
+ backchannel: { enabled: true, max_per_turn: 1 },
40
+ },
41
+ });
42
+
43
+ voice.on('memory', state => console.log(state.summary, state.facts));
44
+ voice.on('backchannel', stream => playAudio(stream));
45
+ voice.on('writing', ({ role, voiceProfile }) => console.log(role, voiceProfile?.emotion));
46
+ ```
47
+
48
+ **Realtime fixes and additions**
49
+ - Fixed the per-call `speak(text, { speaker })` voice override. It is now sent as the flat `response.voice` field, so the per-call speaker is no longer silently ignored by the server.
50
+ - Added manual turn-taking methods `commitInput()`, `clearInput()`, and `clearOutput()` for push-to-talk and manual turn control (use `clearOutput()` only to hard-stop all playback — it also stops in-flight back-channels).
51
+ - Added smart-turn and playback-state events: `turn-suggestion`, `turn-suggestion-revoked`, `input-committed`, `input-cleared`, `input-timeout`, and `output-audio-started` / `output-audio-stopped` / `output-audio-cleared`.
52
+ - Added richer typed session config: input noise reduction, telephony (8 kHz) and float32 audio formats, a server-VAD `idle_timeout_ms`, plus `tracing`, `include`, and `prompt`.
53
+
54
+ ```typescript
55
+ // Push-to-talk with no auto-VAD
56
+ const voice = new InworldRealtimeVoice({
57
+ session: { audio: { input: { turn_detection: null } } },
58
+ });
59
+
60
+ await voice.send(getMicrophoneStream());
61
+ voice.commitInput(); // end the user turn manually
62
+
63
+ voice.on('output-audio-stopped', () => console.log('playback finished'));
64
+ ```
65
+
66
+ ### Patch Changes
67
+
68
+ - Moved shared voice primitives and route metadata into the new `@internal/voice` package so voice providers no longer depend on `@mastra/core` and server voice routes share the same route definitions. ([#16725](https://github.com/mastra-ai/mastra/pull/16725))
69
+
70
+ `@mastra/core/voice` continues to re-export the voice APIs for backwards compatibility.
71
+
72
+ ## 0.3.0-alpha.1
73
+
74
+ ### Minor Changes
75
+
76
+ - `@mastra/voice-inworld` now ships `InworldRealtimeVoice` for full-duplex realtime voice — mic in, speakers out, server-side LLM routing, semantic VAD turn-taking, tool calling, barge-in, and live transcripts of both sides — alongside the existing streaming TTS and batch STT. No separate package needed; import both from the same entry point. ([#16865](https://github.com/mastra-ai/mastra/pull/16865))
77
+
78
+ ```typescript
79
+ // Batch TTS / STT (unchanged)
80
+ import { InworldVoice } from '@mastra/voice-inworld';
81
+
82
+ // New: realtime full-duplex voice, from the same package
83
+ import { InworldRealtimeVoice } from '@mastra/voice-inworld';
84
+
85
+ const voice = new InworldRealtimeVoice({
86
+ apiKey: process.env.INWORLD_API_KEY,
87
+ // Defaults: model 'inworld/models/gemma-4-26b-a4b-it', speaker 'Sarah',
88
+ // STT 'inworld/inworld-stt-1', semantic-VAD turn detection.
89
+ });
90
+
91
+ await voice.connect();
92
+ voice.on('speaker', stream => playAudio(stream)); // PCM16 @ 24kHz
93
+ voice.on('writing', ({ text, role }) => console.log(role, text));
94
+ voice.on('interrupted', ({ response_id }) => stopAudio(response_id));
95
+ await voice.send(getMicrophoneStream());
96
+ ```
97
+
98
+ **Typed `providerData` for Inworld realtime extensions**
99
+
100
+ `InworldRealtimeVoice` now accepts a typed `providerData` object for Inworld-specific extensions — STT tuning, TTS segmentation and steering, automatic memory, back-channel, and responsiveness — sent under `session.providerData`. The provider also surfaces inbound extension data: a `voiceProfile` on user `writing` events, a `memory` event for the rolling summary/facts state, and `backchannel` / `backchannel.done` / `backchannel.skipped` events for back-channel audio.
101
+
102
+ ```typescript
103
+ const voice = new InworldRealtimeVoice({
104
+ providerData: {
105
+ stt: { voice_profile: true, language_hints: ['en-US'] },
106
+ tts: { delivery_mode: 'CREATIVE', segmenter_strategy: 'balanced' },
107
+ memory: { enabled: true, turn_interval: 4 },
108
+ backchannel: { enabled: true, max_per_turn: 1 },
109
+ },
110
+ });
111
+
112
+ voice.on('memory', state => console.log(state.summary, state.facts));
113
+ voice.on('backchannel', stream => playAudio(stream));
114
+ voice.on('writing', ({ role, voiceProfile }) => console.log(role, voiceProfile?.emotion));
115
+ ```
116
+
117
+ **Realtime fixes and additions**
118
+ - Fixed the per-call `speak(text, { speaker })` voice override. It is now sent as the flat `response.voice` field, so the per-call speaker is no longer silently ignored by the server.
119
+ - Added manual turn-taking methods `commitInput()`, `clearInput()`, and `clearOutput()` for push-to-talk and manual turn control (use `clearOutput()` only to hard-stop all playback — it also stops in-flight back-channels).
120
+ - Added smart-turn and playback-state events: `turn-suggestion`, `turn-suggestion-revoked`, `input-committed`, `input-cleared`, `input-timeout`, and `output-audio-started` / `output-audio-stopped` / `output-audio-cleared`.
121
+ - Added richer typed session config: input noise reduction, telephony (8 kHz) and float32 audio formats, a server-VAD `idle_timeout_ms`, plus `tracing`, `include`, and `prompt`.
122
+
123
+ ```typescript
124
+ // Push-to-talk with no auto-VAD
125
+ const voice = new InworldRealtimeVoice({
126
+ session: { audio: { input: { turn_detection: null } } },
127
+ });
128
+
129
+ await voice.send(getMicrophoneStream());
130
+ voice.commitInput(); // end the user turn manually
131
+
132
+ voice.on('output-audio-stopped', () => console.log('playback finished'));
133
+ ```
134
+
3
135
  ## 0.2.1-alpha.0
4
136
 
5
137
  ### Patch Changes
package/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # @mastra/voice-inworld
2
2
 
3
- [Inworld AI](https://inworld.ai) voice provider for [Mastra](https://mastra.ai) — streaming TTS and batch STT.
3
+ [Inworld AI](https://inworld.ai) voice provider for [Mastra](https://mastra.ai) — streaming TTS, batch STT, and realtime full-duplex voice.
4
4
 
5
5
  ## Installation
6
6
 
@@ -106,6 +106,341 @@ Alex, Ashley, Craig, Deborah, Dennis, Dominus, Edward, Elizabeth, Hades, Heitor,
106
106
 
107
107
  The `speak()` method uses Inworld's streaming TTS endpoint (`/tts/v1/voice:stream`), returning audio chunks progressively as they are generated. This is ideal for agentic workflows where low time-to-first-audio matters.
108
108
 
109
+ ## Realtime (full-duplex) voice
110
+
111
+ Alongside the batch TTS/STT `InworldVoice`, this package ships `InworldRealtimeVoice` — full-duplex, websocket-based speech-to-speech with tool calling, barge-in, and live transcripts of both sides of the conversation. Inworld runs the LLM server-side via its router, so you don't need a second model client.
112
+
113
+ Inworld's wire protocol is the OpenAI Realtime GA spec — same client/server event names (`conversation.item.added`, `conversation.item.done`, `response.output_audio.delta`, etc.). The provider-level differences are:
114
+
115
+ - Endpoint: `wss://api.inworld.ai/api/v1/realtime/session?key=<sessionId>&protocol=realtime`. The model is configured via the initial `session.update`, not the URL.
116
+ - Auth: `Authorization: Basic <key>` (Inworld keys ship pre-encoded; pass verbatim).
117
+ - Typed first-class session knobs (`audio.output.speed`, `audio.output.model`, `audio.input.turn_detection`, `audio.input.transcription`, `output_modalities`, `tool_choice`, …) via the `session` constructor field.
118
+ - A typed `providerData` object for Inworld extensions (STT tuning, TTS segmentation/steering, automatic memory, back-channel, responsiveness, plus `user_id`/`metadata`), sent under `session.providerData`.
119
+
120
+ ### Usage
121
+
122
+ ```typescript
123
+ import { InworldRealtimeVoice } from '@mastra/voice-inworld';
124
+
125
+ const voice = new InworldRealtimeVoice({
126
+ apiKey: process.env.INWORLD_API_KEY,
127
+ model: 'inworld/models/gemma-4-26b-a4b-it',
128
+ speaker: 'Sarah',
129
+ instructions: 'You are a helpful voice assistant.',
130
+ // Typed first-class session knobs:
131
+ session: {
132
+ audio: {
133
+ output: { speed: 1.1 },
134
+ input: {
135
+ transcription: { model: 'inworld/inworld-stt-1' },
136
+ turn_detection: { type: 'semantic_vad', eagerness: 'high' },
137
+ },
138
+ },
139
+ },
140
+ });
141
+
142
+ await voice.connect();
143
+
144
+ voice.on('speaker', stream => {
145
+ // PCM16 @ 24kHz stream — pipe to your audio output
146
+ });
147
+
148
+ voice.on('writing', ({ text, role }) => {
149
+ // Transcription / assistant text
150
+ });
151
+
152
+ voice.on('error', err => {
153
+ console.error('Voice error:', err);
154
+ });
155
+
156
+ await voice.speak('Hello from Mastra!');
157
+
158
+ // Tool integration
159
+ voice.addTools({
160
+ search: searchTool,
161
+ });
162
+
163
+ // Streaming audio in
164
+ await voice.send(microphoneStream);
165
+
166
+ // Stop
167
+ voice.close();
168
+ ```
169
+
170
+ ### Options
171
+
172
+ | Option | Type | Default | Description |
173
+ | ------------------ | ------------------------------- | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
174
+ | `apiKey` | `string` | `process.env.INWORLD_API_KEY` | Inworld API key (Basic-encoded, passed verbatim). |
175
+ | `url` | `string` | `wss://api.inworld.ai/api/v1/realtime/session` | Realtime websocket endpoint. |
176
+ | `model` | `string` | `inworld/models/gemma-4-26b-a4b-it` | LLM Router model ID. |
177
+ | `speaker` | `string` | `Sarah` | Default voice ID. Any catalog voice is accepted. |
178
+ | `sessionId` | `string` | `voice-{Date.now()}` | Client-generated session key surfaced as the URL `?key=` parameter. |
179
+ | `instructions` | `string` | `undefined` | System prompt sent with the initial `session.update`. |
180
+ | `session` | `Partial<InworldSessionConfig>` | `undefined` | Typed first-class session knobs (see below). Deep-merged into every `session.update`. |
181
+ | `debug` | `boolean` | `false` | Log raw server events. |
182
+ | `providerData` | `InworldProviderData` | `undefined` | Inworld extension config sent under `session.providerData`; composes with `session.providerData` (constructor option wins on collision). |
183
+ | `connectTimeoutMs` | `number` | `15000` | Max time `connect()` will wait for both the WebSocket handshake and the initial `session.updated` round-trip. A pre-open error/close or timeout becomes a rejected `connect()`. |
184
+
185
+ #### `session` (typed knobs)
186
+
187
+ Use the typed `session` field for known Inworld realtime options. Fields compose with the connect-time defaults (e.g. `audio.output.voice` set from `speaker`):
188
+
189
+ ```typescript
190
+ new InworldRealtimeVoice({
191
+ speaker: 'Dennis',
192
+ session: {
193
+ output_modalities: ['audio', 'text'],
194
+ audio: {
195
+ output: { speed: 1.15, model: 'inworld-tts-2' },
196
+ input: {
197
+ transcription: { model: 'inworld/inworld-stt-1', language: 'en-US' },
198
+ turn_detection: { type: 'semantic_vad', eagerness: 'medium' },
199
+ },
200
+ },
201
+ tool_choice: { type: 'mcp', server_label: 'my-mcp' },
202
+ temperature: 0.6,
203
+ },
204
+ });
205
+ ```
206
+
207
+ Other typed `session` fields:
208
+
209
+ - `audio.input.noise_reduction`: `{ type: 'near_field' | 'far_field' }` — denoise input before VAD/transcription.
210
+ - `audio.{input,output}.format`: a codec string or `{ type, rate? }` object. Supports telephony 8kHz (`audio/pcmu`, `audio/pcma`) and `audio/float32`; `rate` (Hz) applies to `audio/pcm` + `audio/float32` (default 24000).
211
+ - `audio.input.turn_detection.idle_timeout_ms`: `server_vad` only — idle window before the server commits a turn.
212
+ - `audio.input.transcription.prompt`: bias transcription with vocabulary/spelling/style hints.
213
+ - `tracing`: `'auto'` or `{ workflow_name?, group_id?, metadata? }`.
214
+ - `include`: opt-in extra event fields (e.g. `['item.input_audio_transcription.logprobs']`).
215
+ - `prompt`: a server-side prompt-template reference (string or `null`).
216
+
217
+ #### `providerData` (Inworld extensions)
218
+
219
+ `providerData` is a typed object for Inworld-specific realtime extensions. It's sent under `session.providerData` on every `session.update` and composes with any `session.providerData` you set via the `session` field — the constructor option wins on key collisions.
220
+
221
+ It has five branches plus two session-level fields:
222
+
223
+ - `stt`: STT tuning (`prompt`, `voice_profile`, `language_hints`, VAD/end-of-turn thresholds).
224
+ - `tts`: TTS segmentation and delivery (`segmenter_strategy`, `steering_handling`, `delivery_mode`, `conversational`, `user_turn_mode`, `language`).
225
+ - `memory`: automatic rolling memory (`enabled`, `turn_interval`, `max_facts`, …). Inworld echoes its state back via the `memory` event.
226
+ - `backchannel`: short acknowledgements ("uh-huh") while the user speaks. Audio arrives on the `backchannel` event.
227
+ - `responsiveness`: early "filler" audio while the main response generates. Filler audio reuses the normal `speaker`/`speaking` path — there are no distinct events.
228
+ - `user_id` and `metadata`: session-level identifiers passed through to Inworld.
229
+
230
+ ```typescript
231
+ new InworldRealtimeVoice({
232
+ providerData: {
233
+ stt: { voice_profile: true, language_hints: ['en-US'] },
234
+ tts: { delivery_mode: 'CREATIVE', segmenter_strategy: 'balanced' },
235
+ memory: { enabled: true, turn_interval: 4 },
236
+ backchannel: { enabled: true, max_per_turn: 1 },
237
+ user_id: 'user-123',
238
+ },
239
+ });
240
+ ```
241
+
242
+ ### Events
243
+
244
+ `on()` and `off()` are typed against `InworldVoiceEventMap` — known event names give you a typed callback payload, unknown event names fall back to `unknown`.
245
+
246
+ | Event | Payload |
247
+ | ------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
248
+ | `speaking` | `{ audio: Buffer; response_id: string }` |
249
+ | `speaking.done` | `{ response_id: string }` |
250
+ | `speaker` | `PassThrough` stream of PCM audio |
251
+ | `writing` | `{ text: string; response_id: string; role: 'assistant' \| 'user'; voiceProfile? }`. Deduplicated across audio-transcript + text deltas in one response. `voiceProfile` is present on user events when `providerData.stt.voice_profile` is enabled. |
252
+ | `speech-started` | Raw server `input_audio_buffer.speech_started` payload (VAD edge). |
253
+ | `speech-stopped` | Raw server `input_audio_buffer.speech_stopped` payload (VAD edge). |
254
+ | `interrupted` | `{ response_id: string }`. Synthesized once per in-flight response when the user starts speaking. |
255
+ | `turn-suggestion` | `{ item_id, utterance_index, probability, trailing_silence_ms?, audio_duration_ms?, inference_ms? }`. Smart-turn endpointing hint for a buffered user utterance. |
256
+ | `turn-suggestion-revoked` | `{ item_id, utterance_index }`. A prior `turn-suggestion` was retracted. |
257
+ | `input-committed` | `{ item_id, previous_item_id? }`. Buffered input audio was committed as a user turn (`previous_item_id` may be `null`). |
258
+ | `input-cleared` | `{}`. Buffered input audio was discarded. |
259
+ | `input-timeout` | `{ audio_start_ms, audio_end_ms, item_id }`. Server-VAD idle timeout committed a user turn. |
260
+ | `output-audio-started` | `{}`. Server began emitting output audio. |
261
+ | `output-audio-stopped` | `{}`. Server stopped emitting output audio for the current response. |
262
+ | `output-audio-cleared` | `{}`. Server output audio buffer was flushed (playback stopped). |
263
+ | `memory` | `InworldMemoryState` — Inworld's rolling summary/facts state (deduped by version). Requires `providerData.memory.enabled`. |
264
+ | `backchannel` | `PassThrough` stream of back-channel PCM audio. Requires `providerData.backchannel.enabled`. |
265
+ | `backchannel.done` | `{ backchannel_id: string; phrase? }`. Fires when a back-channel finishes. |
266
+ | `backchannel.skipped` | `{ reason: string }`. Fires when the decider skips a back-channel before any audio. |
267
+ | `response.created` | Full server event. |
268
+ | `response.done` | Full server event. |
269
+ | `conversation.item.added` | Full server event. |
270
+ | `conversation.item.done` | Full server event. |
271
+ | `function_call.arguments` | `{ call_id, name, arguments }` JSON. |
272
+ | `tool-call-start` | `{ toolCallId, toolName, args, … }`. |
273
+ | `tool-call-result` | `{ toolCallId, …, result }`. |
274
+ | `error` | `Error` (or a server error event). |
275
+
276
+ #### Barge-in
277
+
278
+ `speech-started` and `speech-stopped` mirror Inworld's raw VAD edges. `interrupted` is a synthetic, client-side signal: whenever `speech-started` fires while one or more responses are in flight, `interrupted` is emitted once per active `response_id`. Listen to `interrupted` to stop audio playback without having to track response state yourself.
279
+
280
+ **Back-channels are exempt from barge-in — by design.** Back-channel audio (`"uh-huh"`, `"I see"`) is meant to play _while the user is still speaking_, so it must NOT be cut off when they do. The two audio kinds are kept on separate channels so this falls out naturally:
281
+
282
+ - Main response audio arrives on the **`speaker`** event; each stream's `.id` is a `response_id`. `interrupted` only ever carries `response_id`s, so stopping the `speaker` stream whose id matches `interrupted.response_id` stops main content instantly on barge-in.
283
+ - Back-channel audio arrives on the **`backchannel`** event; each stream's `.id` is a `backchannel_id`. These ids never appear in `interrupted`, and the SDK never sends `response.cancel` for them — so a client that keys players by stream id and only kills on `interrupted` will leave back-channels playing automatically.
284
+
285
+ The one footgun: don't blanket-kill all players on barge-in. Keep back-channel players in a separate collection (or just key by stream `.id` and match only `interrupted.response_id`). Each back-channel stream ends itself on `backchannel.done`, so its player exits naturally.
286
+
287
+ ```typescript
288
+ const players = new Map(); // main response audio — stopped on barge-in
289
+ const bcPlayers = new Map(); // back-channels — never stopped on barge-in
290
+
291
+ voice.on('speaker', stream => startPlayer(players, stream.id, stream));
292
+ voice.on('interrupted', ({ response_id }) => players.get(response_id)?.stop()); // main only
293
+
294
+ voice.on('backchannel', stream => startPlayer(bcPlayers, stream.id, stream)); // overlaps user speech
295
+ ```
296
+
297
+ Enable back-channels with `providerData: { backchannel: { enabled: true } }` (gated by server prerequisites — contact your Inworld account team).
298
+
299
+ #### Manual turn-taking & playback signals
300
+
301
+ For push-to-talk or manual turn-taking (set `turn_detection` to `null` to disable auto-VAD), drive turns yourself:
302
+
303
+ - `commitInput()` — commit buffered input audio as a user turn.
304
+ - `clearInput()` — discard buffered input audio.
305
+ - `clearOutput()` — clear the server's **entire** output audio buffer, stopping playback. This also stops any in-flight **back-channel** audio. The default barge-in path (`response.cancel` on `interrupted`) is back-channel-safe; prefer it. Use `clearOutput()` only when you explicitly want to flush everything.
306
+
307
+ ```typescript
308
+ voice.commitInput(); // end the current user turn
309
+ voice.clearInput(); // throw away what's buffered
310
+ voice.clearOutput(); // hard-stop all playback (back-channels included)
311
+ ```
312
+
313
+ The server emits matching signals you can listen to:
314
+
315
+ - `turn-suggestion` / `turn-suggestion-revoked` — smart-turn endpointing hints (and retractions) for a buffered utterance.
316
+ - `input-committed` / `input-cleared` — acknowledgements for `commitInput()` / `clearInput()` (also fire on auto-VAD commits).
317
+ - `input-timeout` — a server-VAD idle timeout committed a user turn.
318
+ - `output-audio-started` / `output-audio-stopped` / `output-audio-cleared` — output playback state on the server.
319
+
320
+ ```typescript
321
+ voice.on('turn-suggestion', ({ probability }) => console.log('end-of-turn?', probability));
322
+ voice.on('input-committed', ({ item_id }) => console.log('committed', item_id));
323
+ voice.on('output-audio-stopped', () => console.log('playback finished'));
324
+ ```
325
+
326
+ #### Awaitable `speak()`
327
+
328
+ `speak()` resolves only after the full response lifecycle completes (`response.done` for the response it triggered). It rejects if the response is interrupted by user speech, or on a transport error. Serial calls are the supported pattern — concurrent `speak()` calls share the same listener pool and have undefined response-pinning order.
329
+
330
+ #### Default `turn_detection`
331
+
332
+ `audio.input.turn_detection` defaults to `{ type: 'semantic_vad', eagerness: 'medium', create_response: true, interrupt_response: true }`. To override, set `session.audio.input.turn_detection` to your own object. To disable turn detection entirely, set it to `null`.
333
+
334
+ `eagerness` controls how quickly semantic VAD ends a user turn — `low` waits for clearer pauses (more interruption-resistant), `high` ends turns sooner (snappier, more prone to cutting users off). Default `medium` balances both.
335
+
336
+ #### Default `transcription`
337
+
338
+ `audio.input.transcription` defaults to `{ model: 'inworld/inworld-stt-1' }`, so user-side `writing` events (with `role: 'user'`) fire out of the box. To override, set `session.audio.input.transcription` to your own object. To disable user-side transcription, set it to `null`.
339
+
340
+ ### Full CLI example
341
+
342
+ A complete, terminal-based demo wiring `InworldRealtimeVoice` into a Mastra `Agent` with mic input, speaker output, semantic-VAD turn-taking, barge-in, and tool calling — all in one file.
343
+
344
+ Prereqs: Node 22+, `sox` (provides `sox` and `play`; `brew install sox` on macOS), `INWORLD_API_KEY`.
345
+
346
+ The same code as a clone-and-run repo: [github.com/cshape/inworld-mastra-cli-demo](https://github.com/cshape/inworld-mastra-cli-demo).
347
+
348
+ ```typescript
349
+ import 'dotenv/config';
350
+ import { spawn, type ChildProcess } from 'node:child_process';
351
+ import { Agent } from '@mastra/core/agent';
352
+ import { createTool } from '@mastra/core/tools';
353
+ import { InworldRealtimeVoice } from '@mastra/voice-inworld';
354
+ import { z } from 'zod';
355
+
356
+ const getCurrentTime = createTool({
357
+ id: 'get-current-time',
358
+ description: 'Returns the current local time.',
359
+ inputSchema: z.object({}),
360
+ outputSchema: z.object({ time: z.string() }),
361
+ execute: async () => ({ time: new Date().toLocaleTimeString() }),
362
+ });
363
+
364
+ const voice = new InworldRealtimeVoice({
365
+ model: 'openai/gpt-5.4-nano',
366
+ speaker: 'Jason',
367
+ session: {
368
+ audio: {
369
+ input: {
370
+ transcription: { model: 'inworld/inworld-stt-1', language: 'en-US' },
371
+ turn_detection: { type: 'semantic_vad', eagerness: 'high', interrupt_response: true },
372
+ },
373
+ output: { model: 'inworld-tts-2', speed: 1.0 },
374
+ },
375
+ temperature: 0.7,
376
+ max_output_tokens: 150,
377
+ },
378
+ });
379
+
380
+ new Agent({
381
+ id: 'voice-demo',
382
+ name: 'Voice Demo',
383
+ instructions:
384
+ 'You are a concise voice assistant. Reply in one or two short sentences. Use the get-current-time tool when asked the time.',
385
+ model: 'n/a',
386
+ tools: { getCurrentTime },
387
+ voice,
388
+ });
389
+
390
+ const SOX = ['-t', 'raw', '-r', '24000', '-e', 'signed', '-b', '16', '-c', '1', '-q', '-'];
391
+ const players = new Map<string, ChildProcess>();
392
+
393
+ voice.on('speaker', stream => {
394
+ // Any new response supersedes the prior one — kill leftover players so
395
+ // a missed barge-in can't leave two streams playing at once.
396
+ for (const p of players.values()) p.kill('SIGTERM');
397
+ players.clear();
398
+ const id = (stream as unknown as { id: string }).id;
399
+ const player = spawn('play', SOX, { stdio: ['pipe', 'ignore', 'ignore'] });
400
+ players.set(id, player);
401
+ // Swallow EPIPE when `play` exits while the PassThrough still has buffered frames.
402
+ player.stdin!.on('error', () => {});
403
+ stream.pipe(player.stdin!);
404
+ player.on('exit', () => players.delete(id));
405
+ });
406
+
407
+ voice.on('interrupted', ({ response_id }) => players.get(response_id)?.kill('SIGTERM'));
408
+
409
+ let lastRole: 'user' | 'assistant' | null = null;
410
+ voice.on('writing', ({ text, role }) => {
411
+ if (role !== lastRole) {
412
+ process.stdout.write(role === 'user' ? '\n[you] ' : '\n[bot] ');
413
+ lastRole = role;
414
+ }
415
+ process.stdout.write(text);
416
+ });
417
+
418
+ voice.on('tool-call-start', ({ toolName }) => console.log(`\n[tool] ${toolName}`));
419
+ voice.on('error', err => console.error('\n[error]', err));
420
+
421
+ await voice.connect();
422
+ console.log('Connected. Use headphones for best experience. Speak when ready. Ctrl+C to exit.');
423
+
424
+ const mic = spawn('sox', ['-d', ...SOX], { stdio: ['ignore', 'pipe', 'ignore'] });
425
+ await voice.send(mic.stdout);
426
+
427
+ process.on('SIGINT', () => {
428
+ mic.kill('SIGTERM');
429
+ for (const p of players.values()) p.kill('SIGTERM');
430
+ voice.close();
431
+ process.exit(0);
432
+ });
433
+ ```
434
+
435
+ ### Realtime protocol notes
436
+
437
+ These match what the live API emits (verified via raw-websocket smoke tests):
438
+
439
+ - Audio default: PCM16 @ 24kHz. Also supports telephony `audio/pcmu` / `audio/pcma` @ 8kHz and `audio/float32`.
440
+ - Server emits `session.created` on connect (older docs claim it doesn't).
441
+ - Function call args arrive via `response.function_call_arguments.delta` (singular). Some docs say plural; the docs are wrong.
442
+ - Audio deltas arrive on `response.output_audio.delta` / `…audio.done` (GA spec), not the older `response.audio.delta`.
443
+
109
444
  ## Authentication
110
445
 
111
- Set your API key via the `INWORLD_API_KEY` environment variable or pass it in the config. Get your key from [platform.inworld.ai](https://platform.inworld.ai) → Settings → API Keys.
446
+ Set your API key via the `INWORLD_API_KEY` environment variable or pass it in the config. Get your key from [platform.inworld.ai](https://platform.inworld.ai) → Settings → API Keys. Inworld API keys ship **already Basic-encoded** — paste verbatim; the package will not re-encode the key.