@mastra/voice-google-gemini-live 0.11.5-alpha.2 → 0.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,108 @@
1
1
  # @mastra/voice-google-gemini-live
2
2
 
3
+ ## 0.12.0
4
+
5
+ ### Minor Changes
6
+
7
+ - Surface native-audio behavioral signals on Gemini Live realtime sessions (#17021). ([#17434](https://github.com/mastra-ai/mastra/pull/17434))
8
+
9
+ The `@mastra/voice-google-gemini-live` provider now enables transcription and barge-in detection in the setup payload and exposes them through Mastra's standard realtime event contract. This makes native-audio models such as `gemini-2.5-flash-native-audio-preview-12-2025` and `gemini-3.1-flash-live-preview` behaviorally usable end-to-end. Until now, the spoken response was silently dropped on native-audio because it arrives on a different wire channel from the model's internal reasoning.
10
+
11
+ **What changed**
12
+ - Setup payload unconditionally includes `input_audio_transcription`, `output_audio_transcription`, and `realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`, matching how the OpenAI, xAI, Inworld, and AWS Nova Sonic providers enable transcription by default.
13
+ - User-side transcripts emit as `writing` with `role: 'user'`. Model-side transcripts emit as `writing` with `role: 'assistant'`. This matches the cross-provider `writing` contract.
14
+ - Barge-in (the server cancelling its in-flight response when the user starts speaking) emits an `interrupt` event with `{ type: 'user', timestamp }`, matching `@mastra/voice-aws-nova-sonic`.
15
+ - On native-audio models, `modelTurn.parts.text` is the model's internal chain-of-thought, not the spoken response. It now emits as a Gemini-specific `thinking` event so consumers can render reasoning separately. On non-native-audio models, `modelTurn.parts.text` continues to emit as `writing` (it is the spoken response there).
16
+
17
+ **Example**
18
+
19
+ ```ts
20
+ import { GeminiLiveVoice } from '@mastra/voice-google-gemini-live';
21
+
22
+ const voice = new GeminiLiveVoice({
23
+ apiKey: process.env.GOOGLE_API_KEY,
24
+ model: 'gemini-2.5-flash-native-audio-preview-12-2025',
25
+ });
26
+
27
+ voice.on('writing', ({ text, role }) => {
28
+ // role: 'user' → speech-to-text of the caller
29
+ // role: 'assistant' → speech-to-text of the model's spoken reply
30
+ });
31
+
32
+ voice.on('thinking', ({ text }) => {
33
+ // Gemini's internal reasoning on native-audio models
34
+ });
35
+
36
+ voice.on('interrupt', ({ type, timestamp }) => {
37
+ // Drop queued TTS audio — the user just barged in
38
+ });
39
+
40
+ await voice.connect();
41
+ ```
42
+
43
+ ### Patch Changes
44
+
45
+ - Moved shared voice primitives and route metadata into the new `@internal/voice` package so voice providers no longer depend on `@mastra/core` and server voice routes share the same route definitions. ([#16725](https://github.com/mastra-ai/mastra/pull/16725))
46
+
47
+ `@mastra/core/voice` continues to re-export the voice APIs for backwards compatibility.
48
+
49
+ - Fixed Gemini Live tool registration failing with `1007 Unknown name` errors for tools using discriminated unions, literals, and nullable types. The `sanitizeToolParameters` method now rewrites `oneOf` → `anyOf`, `const` → `enum`, and collapses nullable `anyOf` patterns into OpenAPI 3.0-compatible `type` + `nullable: true` form. Fixes #17020. ([#17179](https://github.com/mastra-ai/mastra/pull/17179))
50
+
51
+ - **Fixed** Gemini Live sessions now connect successfully when using native-audio models. Previously the connection failed during session setup. ([#17019](https://github.com/mastra-ai/mastra/pull/17019))
52
+
53
+ **Fixed** tools are now invoked correctly. Previously tool calls were silently ignored even when tools were registered during setup.
54
+
55
+ **Fixed** tool results of any shape (arrays, primitives, objects) are now accepted. Previously, non-object tool return values caused sessions to close unexpectedly.
56
+
57
+ **Fixed** the `speaker` option is now honored when passed at the `VoiceConfig` root alongside `realtimeConfig`, not only when passed in the flat config shape.
58
+
59
+ **Changed** default model from `gemini-2.0-flash-exp` (shut down 2025-12-09) to `gemini-3.1-flash-live-preview` (Google's current Live API quickstart model). If you weren't explicitly setting `model`, your sessions will start connecting again.
60
+
61
+ Fixes #17018.
62
+
63
+ - Updated dependencies [[`00eca42`](https://github.com/mastra-ai/mastra/commit/00eca4252393aa114dc8c9a5e1da68df91fa06cf), [`ff9d743`](https://github.com/mastra-ai/mastra/commit/ff9d743f71d7e072927725c0d700632aca0c1fee)]:
64
+ - @mastra/schema-compat@1.2.11
65
+
66
+ ## 0.12.0-alpha.3
67
+
68
+ ### Minor Changes
69
+
70
+ - Surface native-audio behavioral signals on Gemini Live realtime sessions (#17021). ([#17434](https://github.com/mastra-ai/mastra/pull/17434))
71
+
72
+ The `@mastra/voice-google-gemini-live` provider now enables transcription and barge-in detection in the setup payload and exposes them through Mastra's standard realtime event contract. This makes native-audio models such as `gemini-2.5-flash-native-audio-preview-12-2025` and `gemini-3.1-flash-live-preview` behaviorally usable end-to-end. Until now, the spoken response was silently dropped on native-audio because it arrives on a different wire channel from the model's internal reasoning.
73
+
74
+ **What changed**
75
+ - Setup payload unconditionally includes `input_audio_transcription`, `output_audio_transcription`, and `realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`, matching how the OpenAI, xAI, Inworld, and AWS Nova Sonic providers enable transcription by default.
76
+ - User-side transcripts emit as `writing` with `role: 'user'`. Model-side transcripts emit as `writing` with `role: 'assistant'`. This matches the cross-provider `writing` contract.
77
+ - Barge-in (the server cancelling its in-flight response when the user starts speaking) emits an `interrupt` event with `{ type: 'user', timestamp }`, matching `@mastra/voice-aws-nova-sonic`.
78
+ - On native-audio models, `modelTurn.parts.text` is the model's internal chain-of-thought, not the spoken response. It now emits as a Gemini-specific `thinking` event so consumers can render reasoning separately. On non-native-audio models, `modelTurn.parts.text` continues to emit as `writing` (it is the spoken response there).
79
+
80
+ **Example**
81
+
82
+ ```ts
83
+ import { GeminiLiveVoice } from '@mastra/voice-google-gemini-live';
84
+
85
+ const voice = new GeminiLiveVoice({
86
+ apiKey: process.env.GOOGLE_API_KEY,
87
+ model: 'gemini-2.5-flash-native-audio-preview-12-2025',
88
+ });
89
+
90
+ voice.on('writing', ({ text, role }) => {
91
+ // role: 'user' → speech-to-text of the caller
92
+ // role: 'assistant' → speech-to-text of the model's spoken reply
93
+ });
94
+
95
+ voice.on('thinking', ({ text }) => {
96
+ // Gemini's internal reasoning on native-audio models
97
+ });
98
+
99
+ voice.on('interrupt', ({ type, timestamp }) => {
100
+ // Drop queued TTS audio — the user just barged in
101
+ });
102
+
103
+ await voice.connect();
104
+ ```
105
+
3
106
  ## 0.11.5-alpha.2
4
107
 
5
108
  ### Patch Changes
package/README.md CHANGED
@@ -99,10 +99,21 @@ voice.on('speaker', audioStream => {
99
99
  });
100
100
 
101
101
  voice.on('writing', ({ text, role }) => {
102
- // Handle transcribed text
102
+ // role: 'user' → speech-to-text of the caller
103
+ // role: 'assistant' → speech-to-text of the model's spoken reply
103
104
  console.log(`${role}: ${text}`);
104
105
  });
105
106
 
107
+ // Native-audio models only: model's internal reasoning
108
+ voice.on('thinking', ({ text }) => {
109
+ console.log(`thinking: ${text}`);
110
+ });
111
+
112
+ // Drop queued playback when the user barges in over the model
113
+ voice.on('interrupt', ({ type, timestamp }) => {
114
+ console.log(`interrupt by ${type} at ${timestamp}`);
115
+ });
116
+
106
117
  // Send text to speech
107
118
  await voice.speak('Hello from Mastra!');
108
119
 
@@ -313,16 +324,26 @@ Registers an event listener.
313
324
 
314
325
  - `'speaking'` - Audio response from model
315
326
  - `'speaker'` - Readable stream of concatenated audio for the active response
316
- - `'writing'` - Text response or transcription
327
+ - `'writing'` - Transcribed text. Callback receives `{ text, role: 'user' | 'assistant' }`. On native-audio models the assistant transcript is driven by the server's `output_audio_transcription` channel
328
+ - `'thinking'` - Model chain-of-thought / reasoning text on native-audio models. Callback receives `{ text }`. Does not fire on non-native-audio models, where reasoning is not surfaced separately
317
329
  - `'error'` - Error events
318
330
  - `'session'` - Session state changes
319
331
  - `'toolCall'` - Tool calls from model
320
332
  - `'vad'` - Voice activity detection events
321
- - `'interrupt'` - Interrupt events
333
+ - `'interrupt'` - Emitted on barge-in when the user starts speaking over an in-flight model response. Callback receives `{ type: 'user', timestamp }`
322
334
  - `'usage'` - Token usage information
323
335
  - `'sessionHandle'` - Session resumption handle
324
336
  - `'turnComplete'` - Turn completion for the current model response
325
337
 
338
+ #### Native-audio models
339
+
340
+ Native-audio models (any model whose ID contains `native-audio`, e.g. `gemini-2.5-flash-native-audio-preview-12-2025`) split text output across two channels:
341
+
342
+ - The model's spoken reply is delivered as audio plus an `output_audio_transcription` transcript — surfaced as `writing` with `role: 'assistant'`.
343
+ - The model's internal reasoning is delivered as `modelTurn.parts.text` — surfaced as `thinking`.
344
+
345
+ On non-native-audio models there is no `output_audio_transcription` channel; `modelTurn.parts.text` is the spoken response itself and is emitted as `writing` (so `thinking` will not fire). Transcription and barge-in detection are enabled automatically in the setup payload — no extra configuration is required.
346
+
326
347
  ### Tools
327
348
 
328
349
  Add tools with `addTools()` using either `@mastra/core/tools` or a plain object matching `ToolsInput`.
@@ -3,7 +3,7 @@ name: mastra-voice-google-gemini-live
3
3
  description: Documentation for @mastra/voice-google-gemini-live. Use when working with @mastra/voice-google-gemini-live APIs, configuration, or implementation.
4
4
  metadata:
5
5
  package: "@mastra/voice-google-gemini-live"
6
- version: "0.11.5-alpha.2"
6
+ version: "0.12.0"
7
7
  ---
8
8
 
9
9
  ## When to use
@@ -1,5 +1,5 @@
1
1
  {
2
- "version": "0.11.5-alpha.2",
2
+ "version": "0.12.0",
3
3
  "package": "@mastra/voice-google-gemini-live",
4
4
  "exports": {},
5
5
  "modules": {}
@@ -16,7 +16,7 @@ const voiceAgent = new Agent({
16
16
  id: 'voice-agent',
17
17
  name: 'Voice Agent',
18
18
  instructions: 'You are a voice assistant that can help users with their tasks.',
19
- model: 'openai/gpt-5.4',
19
+ model: 'openai/gpt-5.5',
20
20
  voice: new OpenAIVoice(),
21
21
  })
22
22
  ```
@@ -40,7 +40,7 @@ const voiceAgent = new Agent({
40
40
  id: 'voice-agent',
41
41
  name: 'Voice Agent',
42
42
  instructions: 'You are a voice assistant that can help users with their tasks.',
43
- model: 'openai/gpt-5.4',
43
+ model: 'openai/gpt-5.5',
44
44
  voice: new OpenAIVoice(),
45
45
  })
46
46
 
@@ -68,7 +68,7 @@ const voiceAgent = new Agent({
68
68
  id: 'voice-agent',
69
69
  name: 'Voice Agent',
70
70
  instructions: 'You are a voice assistant that can help users with their tasks.',
71
- model: 'openai/gpt-5.4',
71
+ model: 'openai/gpt-5.5',
72
72
  voice: new AzureVoice(),
73
73
  })
74
74
 
@@ -95,7 +95,7 @@ const voiceAgent = new Agent({
95
95
  id: 'voice-agent',
96
96
  name: 'Voice Agent',
97
97
  instructions: 'You are a voice assistant that can help users with their tasks.',
98
- model: 'openai/gpt-5.4',
98
+ model: 'openai/gpt-5.5',
99
99
  voice: new ElevenLabsVoice(),
100
100
  })
101
101
 
@@ -122,7 +122,7 @@ const voiceAgent = new Agent({
122
122
  id: 'voice-agent',
123
123
  name: 'Voice Agent',
124
124
  instructions: 'You are a voice assistant that can help users with their tasks.',
125
- model: 'openai/gpt-5.4',
125
+ model: 'openai/gpt-5.5',
126
126
  voice: new PlayAIVoice(),
127
127
  })
128
128
 
@@ -149,7 +149,7 @@ const voiceAgent = new Agent({
149
149
  id: 'voice-agent',
150
150
  name: 'Voice Agent',
151
151
  instructions: 'You are a voice assistant that can help users with their tasks.',
152
- model: 'openai/gpt-5.4',
152
+ model: 'openai/gpt-5.5',
153
153
  voice: new GoogleVoice(),
154
154
  })
155
155
 
@@ -176,7 +176,7 @@ const voiceAgent = new Agent({
176
176
  id: 'voice-agent',
177
177
  name: 'Voice Agent',
178
178
  instructions: 'You are a voice assistant that can help users with their tasks.',
179
- model: 'openai/gpt-5.4',
179
+ model: 'openai/gpt-5.5',
180
180
  voice: new CloudflareVoice(),
181
181
  })
182
182
 
@@ -203,7 +203,7 @@ const voiceAgent = new Agent({
203
203
  id: 'voice-agent',
204
204
  name: 'Voice Agent',
205
205
  instructions: 'You are a voice assistant that can help users with their tasks.',
206
- model: 'openai/gpt-5.4',
206
+ model: 'openai/gpt-5.5',
207
207
  voice: new DeepgramVoice(),
208
208
  })
209
209
 
@@ -230,7 +230,7 @@ const voiceAgent = new Agent({
230
230
  id: 'voice-agent',
231
231
  name: 'Voice Agent',
232
232
  instructions: 'You are a voice assistant that can help users with their tasks.',
233
- model: 'openai/gpt-5.4',
233
+ model: 'openai/gpt-5.5',
234
234
  voice: new InworldVoice(),
235
235
  })
236
236
 
@@ -257,7 +257,7 @@ const voiceAgent = new Agent({
257
257
  id: 'voice-agent',
258
258
  name: 'Voice Agent',
259
259
  instructions: 'You are a voice assistant that can help users with their tasks.',
260
- model: 'openai/gpt-5.4',
260
+ model: 'openai/gpt-5.5',
261
261
  voice: new SpeechifyVoice(),
262
262
  })
263
263
 
@@ -284,7 +284,7 @@ const voiceAgent = new Agent({
284
284
  id: 'voice-agent',
285
285
  name: 'Voice Agent',
286
286
  instructions: 'You are a voice assistant that can help users with their tasks.',
287
- model: 'openai/gpt-5.4',
287
+ model: 'openai/gpt-5.5',
288
288
  voice: new SarvamVoice(),
289
289
  })
290
290
 
@@ -311,7 +311,7 @@ const voiceAgent = new Agent({
311
311
  id: 'voice-agent',
312
312
  name: 'Voice Agent',
313
313
  instructions: 'You are a voice assistant that can help users with their tasks.',
314
- model: 'openai/gpt-5.4',
314
+ model: 'openai/gpt-5.5',
315
315
  voice: new MurfVoice(),
316
316
  })
317
317
 
@@ -346,7 +346,7 @@ const voiceAgent = new Agent({
346
346
  id: 'voice-agent',
347
347
  name: 'Voice Agent',
348
348
  instructions: 'You are a voice assistant that can help users with their tasks.',
349
- model: 'openai/gpt-5.4',
349
+ model: 'openai/gpt-5.5',
350
350
  voice: new OpenAIVoice(),
351
351
  })
352
352
 
@@ -375,7 +375,7 @@ const voiceAgent = new Agent({
375
375
  id: 'voice-agent',
376
376
  name: 'Voice Agent',
377
377
  instructions: 'You are a voice assistant that can help users with their tasks.',
378
- model: 'openai/gpt-5.4',
378
+ model: 'openai/gpt-5.5',
379
379
  voice: new AzureVoice(),
380
380
  })
381
381
 
@@ -403,7 +403,7 @@ const voiceAgent = new Agent({
403
403
  id: 'voice-agent',
404
404
  name: 'Voice Agent',
405
405
  instructions: 'You are a voice assistant that can help users with their tasks.',
406
- model: 'openai/gpt-5.4',
406
+ model: 'openai/gpt-5.5',
407
407
  voice: new ElevenLabsVoice(),
408
408
  })
409
409
 
@@ -431,7 +431,7 @@ const voiceAgent = new Agent({
431
431
  id: 'voice-agent',
432
432
  name: 'Voice Agent',
433
433
  instructions: 'You are a voice assistant that can help users with their tasks.',
434
- model: 'openai/gpt-5.4',
434
+ model: 'openai/gpt-5.5',
435
435
  voice: new GoogleVoice(),
436
436
  })
437
437
 
@@ -459,7 +459,7 @@ const voiceAgent = new Agent({
459
459
  id: 'voice-agent',
460
460
  name: 'Voice Agent',
461
461
  instructions: 'You are a voice assistant that can help users with their tasks.',
462
- model: 'openai/gpt-5.4',
462
+ model: 'openai/gpt-5.5',
463
463
  voice: new CloudflareVoice(),
464
464
  })
465
465
 
@@ -487,7 +487,7 @@ const voiceAgent = new Agent({
487
487
  id: 'voice-agent',
488
488
  name: 'Voice Agent',
489
489
  instructions: 'You are a voice assistant that can help users with their tasks.',
490
- model: 'openai/gpt-5.4',
490
+ model: 'openai/gpt-5.5',
491
491
  voice: new DeepgramVoice(),
492
492
  })
493
493
 
@@ -515,7 +515,7 @@ const voiceAgent = new Agent({
515
515
  id: 'voice-agent',
516
516
  name: 'Voice Agent',
517
517
  instructions: 'You are a voice assistant that can help users with their tasks.',
518
- model: 'openai/gpt-5.4',
518
+ model: 'openai/gpt-5.5',
519
519
  voice: new InworldVoice(),
520
520
  })
521
521
 
@@ -543,7 +543,7 @@ const voiceAgent = new Agent({
543
543
  id: 'voice-agent',
544
544
  name: 'Voice Agent',
545
545
  instructions: 'You are a voice assistant that can help users with their tasks.',
546
- model: 'openai/gpt-5.4',
546
+ model: 'openai/gpt-5.5',
547
547
  voice: new SarvamVoice(),
548
548
  })
549
549
 
@@ -575,7 +575,7 @@ const voiceAgent = new Agent({
575
575
  id: 'voice-agent',
576
576
  name: 'Voice Agent',
577
577
  instructions: 'You are a voice assistant that can help users with their tasks.',
578
- model: 'openai/gpt-5.4',
578
+ model: 'openai/gpt-5.5',
579
579
  voice: new OpenAIRealtimeVoice(),
580
580
  })
581
581
 
@@ -605,7 +605,7 @@ const voiceAgent = new Agent({
605
605
  id: 'voice-agent',
606
606
  name: 'Voice Agent',
607
607
  instructions: 'You are a voice assistant that can help users with their tasks.',
608
- model: 'openai/gpt-5.4',
608
+ model: 'openai/gpt-5.5',
609
609
  voice: new GeminiLiveVoice({
610
610
  // Live API mode
611
611
  apiKey: process.env.GOOGLE_API_KEY,
@@ -654,7 +654,7 @@ const voiceAgent = new Agent({
654
654
  id: 'voice-agent',
655
655
  name: 'Voice Agent',
656
656
  instructions: 'You are a voice assistant that can help users with their tasks.',
657
- model: 'openai/gpt-5.4',
657
+ model: 'openai/gpt-5.5',
658
658
  voice: new NovaSonicVoice({
659
659
  region: 'us-east-1',
660
660
  speaker: 'matthew',
@@ -697,7 +697,7 @@ const voiceAgent = new Agent({
697
697
  id: 'voice-agent',
698
698
  name: 'Voice Agent',
699
699
  instructions: 'You are a voice assistant that can help users with their tasks.',
700
- model: 'openai/gpt-5.4',
700
+ model: 'openai/gpt-5.5',
701
701
  voice: new InworldRealtimeVoice({
702
702
  apiKey: process.env.INWORLD_API_KEY,
703
703
  model: 'inworld/models/gemma-4-26b-a4b-it',
@@ -1132,7 +1132,7 @@ const voiceAgent = new Agent({
1132
1132
  id: 'aisdk-voice-agent',
1133
1133
  name: 'AI SDK Voice Agent',
1134
1134
  instructions: 'You are a helpful assistant with voice capabilities.',
1135
- model: 'openai/gpt-5.4',
1135
+ model: 'openai/gpt-5.5',
1136
1136
  voice,
1137
1137
  })
1138
1138
  ```
@@ -32,7 +32,7 @@ const agent = new Agent({
32
32
  id: 'agent',
33
33
  name: 'OpenAI Realtime Agent',
34
34
  instructions: `You are a helpful assistant with real-time voice capabilities.`,
35
- model: 'openai/gpt-5.4',
35
+ model: 'openai/gpt-5.5',
36
36
  voice: new OpenAIRealtimeVoice(),
37
37
  })
38
38
 
@@ -66,7 +66,7 @@ const agent = new Agent({
66
66
  name: 'Gemini Live Agent',
67
67
  instructions: 'You are a helpful assistant with real-time voice capabilities.',
68
68
  // Model used for text generation; voice provider handles realtime audio
69
- model: 'openai/gpt-5.4',
69
+ model: 'openai/gpt-5.5',
70
70
  voice: new GeminiLiveVoice({
71
71
  apiKey: process.env.GOOGLE_API_KEY,
72
72
  model: 'gemini-2.0-flash-exp',
@@ -113,7 +113,7 @@ const agent = new Agent({
113
113
  name: 'Nova Sonic Agent',
114
114
  instructions: 'You are a helpful assistant with real-time voice capabilities.',
115
115
  // Model used for text generation; voice provider handles realtime audio
116
- model: 'openai/gpt-5.4',
116
+ model: 'openai/gpt-5.5',
117
117
  voice: new NovaSonicVoice({
118
118
  region: 'us-east-1',
119
119
  speaker: 'matthew',
@@ -157,7 +157,7 @@ const agent = new Agent({
157
157
  name: 'Inworld Realtime Agent',
158
158
  instructions: 'You are a helpful assistant with real-time voice capabilities.',
159
159
  // Model used for text generation; voice provider handles realtime audio
160
- model: 'openai/gpt-5.4',
160
+ model: 'openai/gpt-5.5',
161
161
  voice: new InworldRealtimeVoice({
162
162
  apiKey: process.env.INWORLD_API_KEY,
163
163
  model: 'inworld/models/gemma-4-26b-a4b-it',
@@ -242,7 +242,9 @@ The GeminiLiveVoice class emits the following events:
242
242
 
243
243
  **speaking** (`event`): Emitted with audio metadata. Callback receives { audioData?: Int16Array, sampleRate?: number }.
244
244
 
245
- **writing** (`event`): Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }.
245
+ **writing** (`event`): Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }. On native-audio models the assistant transcript is driven by the server's \`output\_audio\_transcription\` channel rather than \`modelTurn.parts.text\`.
246
+
247
+ **thinking** (`event`): Emitted on native-audio models with the model's chain-of-thought / reasoning text from \`modelTurn.parts.text\`. Callback receives { text: string }. Does not fire on non-native-audio models, where \`modelTurn.parts.text\` is the spoken response and is emitted as \`writing\` instead.
246
248
 
247
249
  **session** (`event`): Emitted on session state changes. Callback receives { state: 'connecting' | 'connected' | 'disconnected' | 'disconnecting' | 'updated', config?: object }.
248
250
 
@@ -254,7 +256,18 @@ The GeminiLiveVoice class emits the following events:
254
256
 
255
257
  **error** (`event`): Emitted when an error occurs. Callback receives { message: string, code?: string, details?: unknown }.
256
258
 
257
- **interrupt** (`event`): Interrupt events. Callback receives { type: 'user' | 'model', timestamp: number }.
259
+ **interrupt** (`event`): Emitted on barge-in when the user starts speaking over an in-flight model response. The server cancels any further audio for the current turn. Callback receives { type: 'user', timestamp: number }.
260
+
261
+ ## Native-audio behavior
262
+
263
+ Native-audio Gemini Live models — any model whose ID contains `native-audio`, such as `gemini-2.5-flash-native-audio-preview-12-2025` — split text output across two channels:
264
+
265
+ - The model's spoken reply is delivered as audio plus an `output_audio_transcription` transcript and surfaced as `writing` with `role: 'assistant'`.
266
+ - The model's internal reasoning is delivered as `modelTurn.parts.text` and surfaced as `thinking`.
267
+
268
+ On non-native-audio models there is no `output_audio_transcription` channel, so `modelTurn.parts.text` is the spoken response itself and is emitted as `writing`; the `thinking` event does not fire.
269
+
270
+ Input transcription, output transcription, and barge-in detection (`realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`) are enabled automatically in the setup payload — no extra configuration is required.
258
271
 
259
272
  ## Available models
260
273
 
package/dist/index.cjs CHANGED
@@ -1529,6 +1529,10 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
1529
1529
  requestContext;
1530
1530
  // Store the configuration options
1531
1531
  options;
1532
+ // Accumulates assistant text across `serverContent` frames for the current
1533
+ // turn. Live API streams responses over many frames; we aggregate here and
1534
+ // flush to context history once on `turnComplete`.
1535
+ pendingAssistantResponse = "";
1532
1536
  /**
1533
1537
  * Normalize configuration to ensure proper VoiceConfig format
1534
1538
  * Handles backward compatibility with direct GeminiLiveVoiceConfig
@@ -2483,15 +2487,38 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
2483
2487
  if (!data) {
2484
2488
  return;
2485
2489
  }
2486
- let assistantResponse = "";
2490
+ if (data.interrupted) {
2491
+ this.log("Model response interrupted by user activity");
2492
+ this.audioStreamManager.cleanupSpeakerStreams();
2493
+ this.pendingAssistantResponse = "";
2494
+ this.emit("interrupt", { type: "user", timestamp: Date.now() });
2495
+ }
2496
+ if (data.inputTranscription?.text) {
2497
+ this.emit("writing", {
2498
+ text: data.inputTranscription.text,
2499
+ role: "user"
2500
+ });
2501
+ }
2502
+ if (data.outputTranscription?.text) {
2503
+ this.pendingAssistantResponse += data.outputTranscription.text;
2504
+ this.emit("writing", {
2505
+ text: data.outputTranscription.text,
2506
+ role: "assistant"
2507
+ });
2508
+ }
2509
+ const nativeAudio = this.isNativeAudioModel();
2487
2510
  if (data.modelTurn?.parts) {
2488
2511
  for (const part of data.modelTurn.parts) {
2489
2512
  if (part.text) {
2490
- assistantResponse += part.text;
2491
- this.emit("writing", {
2492
- text: part.text,
2493
- role: "assistant"
2494
- });
2513
+ if (nativeAudio) {
2514
+ this.emit("thinking", { text: part.text });
2515
+ } else {
2516
+ this.pendingAssistantResponse += part.text;
2517
+ this.emit("writing", {
2518
+ text: part.text,
2519
+ role: "assistant"
2520
+ });
2521
+ }
2495
2522
  }
2496
2523
  if (part.functionCall) {
2497
2524
  this.log("Found function call in serverContent.modelTurn.parts", part.functionCall);
@@ -2560,11 +2587,12 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
2560
2587
  }
2561
2588
  }
2562
2589
  }
2563
- if (assistantResponse.trim()) {
2564
- this.addToContext("assistant", assistantResponse);
2565
- }
2566
2590
  if (data.turnComplete) {
2567
2591
  this.log("Turn completed");
2592
+ if (this.pendingAssistantResponse.trim()) {
2593
+ this.addToContext("assistant", this.pendingAssistantResponse);
2594
+ }
2595
+ this.pendingAssistantResponse = "";
2568
2596
  this.audioStreamManager.cleanupSpeakerStreams();
2569
2597
  this.emit("turnComplete", {
2570
2598
  timestamp: Date.now()
@@ -2717,6 +2745,27 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
2717
2745
  getVertexLocation() {
2718
2746
  return this.options.location?.trim() || "us-central1";
2719
2747
  }
2748
+ /**
2749
+ * Whether the active model is a Gemini Live "native-audio" variant.
2750
+ *
2751
+ * Native-audio models emit a different `serverContent.modelTurn.parts.text`
2752
+ * stream than their half-cascade siblings: on native-audio, that text is the
2753
+ * model's internal reasoning (chain-of-thought), and the *spoken* response
2754
+ * arrives separately via `serverContent.outputTranscription.text`. On
2755
+ * non-native-audio models there is no `outputTranscription` channel, and
2756
+ * `modelTurn.parts.text` is the spoken response.
2757
+ *
2758
+ * Used to decide whether `modelTurn.parts.text` should be emitted as
2759
+ * `thinking` (native-audio) or `writing` (non-native-audio). All native-audio
2760
+ * model IDs in `GeminiVoiceModel` contain the literal substring
2761
+ * `native-audio`, so a substring check is sufficient and forward-compatible
2762
+ * with new variants that follow the same naming convention.
2763
+ * @private
2764
+ */
2765
+ isNativeAudioModel() {
2766
+ const model = this.options.model ?? DEFAULT_MODEL;
2767
+ return model.includes("native-audio");
2768
+ }
2720
2769
  /**
2721
2770
  * Resolve the correct model identifier for Gemini API or Vertex AI
2722
2771
  * @private
@@ -2758,7 +2807,20 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
2758
2807
  const setupMessage = {
2759
2808
  setup: {
2760
2809
  model: this.resolveModelIdentifier(),
2761
- generation_config: generationConfig
2810
+ generation_config: generationConfig,
2811
+ // Transcription is on by default — matches the pattern in @mastra/voice-openai-realtime,
2812
+ // @mastra/voice-xai-realtime, and @mastra/voice-inworld where realtime sessions
2813
+ // unconditionally enable STT in `connect()`. On native-audio models this is the ONLY way
2814
+ // to receive the spoken response as text (`modelTurn.parts.text` carries reasoning, not
2815
+ // speech), so without these flags the assistant's words are silently dropped client-side.
2816
+ input_audio_transcription: {},
2817
+ output_audio_transcription: {},
2818
+ // Activity-based interrupts surface barge-in as `serverContent.interrupted = true` and
2819
+ // cancel the in-flight model response. This is the only way to wire up the `interrupt`
2820
+ // event declared in `GeminiLiveEventMap`.
2821
+ realtime_input_config: {
2822
+ activity_handling: "START_OF_ACTIVITY_INTERRUPTS"
2823
+ }
2762
2824
  }
2763
2825
  };
2764
2826
  if (this.options.instructions) {