@mastra/voice-google-gemini-live 0.11.5-alpha.2 → 0.12.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +103 -0
- package/README.md +24 -3
- package/dist/docs/SKILL.md +1 -1
- package/dist/docs/assets/SOURCE_MAP.json +1 -1
- package/dist/docs/references/docs-voice-overview.md +25 -25
- package/dist/docs/references/docs-voice-speech-to-speech.md +4 -4
- package/dist/docs/references/reference-voice-google-gemini-live.md +15 -2
- package/dist/index.cjs +72 -10
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.ts +19 -0
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +72 -10
- package/dist/index.js.map +1 -1
- package/dist/types.d.ts +39 -0
- package/dist/types.d.ts.map +1 -1
- package/package.json +7 -7
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,108 @@
|
|
|
1
1
|
# @mastra/voice-google-gemini-live
|
|
2
2
|
|
|
3
|
+
## 0.12.0
|
|
4
|
+
|
|
5
|
+
### Minor Changes
|
|
6
|
+
|
|
7
|
+
- Surface native-audio behavioral signals on Gemini Live realtime sessions (#17021). ([#17434](https://github.com/mastra-ai/mastra/pull/17434))
|
|
8
|
+
|
|
9
|
+
The `@mastra/voice-google-gemini-live` provider now enables transcription and barge-in detection in the setup payload and exposes them through Mastra's standard realtime event contract. This makes native-audio models such as `gemini-2.5-flash-native-audio-preview-12-2025` and `gemini-3.1-flash-live-preview` behaviorally usable end-to-end. Until now, the spoken response was silently dropped on native-audio because it arrives on a different wire channel from the model's internal reasoning.
|
|
10
|
+
|
|
11
|
+
**What changed**
|
|
12
|
+
- Setup payload unconditionally includes `input_audio_transcription`, `output_audio_transcription`, and `realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`, matching how the OpenAI, xAI, Inworld, and AWS Nova Sonic providers enable transcription by default.
|
|
13
|
+
- User-side transcripts emit as `writing` with `role: 'user'`. Model-side transcripts emit as `writing` with `role: 'assistant'`. This matches the cross-provider `writing` contract.
|
|
14
|
+
- Barge-in (the server cancelling its in-flight response when the user starts speaking) emits an `interrupt` event with `{ type: 'user', timestamp }`, matching `@mastra/voice-aws-nova-sonic`.
|
|
15
|
+
- On native-audio models, `modelTurn.parts.text` is the model's internal chain-of-thought, not the spoken response. It now emits as a Gemini-specific `thinking` event so consumers can render reasoning separately. On non-native-audio models, `modelTurn.parts.text` continues to emit as `writing` (it is the spoken response there).
|
|
16
|
+
|
|
17
|
+
**Example**
|
|
18
|
+
|
|
19
|
+
```ts
|
|
20
|
+
import { GeminiLiveVoice } from '@mastra/voice-google-gemini-live';
|
|
21
|
+
|
|
22
|
+
const voice = new GeminiLiveVoice({
|
|
23
|
+
apiKey: process.env.GOOGLE_API_KEY,
|
|
24
|
+
model: 'gemini-2.5-flash-native-audio-preview-12-2025',
|
|
25
|
+
});
|
|
26
|
+
|
|
27
|
+
voice.on('writing', ({ text, role }) => {
|
|
28
|
+
// role: 'user' → speech-to-text of the caller
|
|
29
|
+
// role: 'assistant' → speech-to-text of the model's spoken reply
|
|
30
|
+
});
|
|
31
|
+
|
|
32
|
+
voice.on('thinking', ({ text }) => {
|
|
33
|
+
// Gemini's internal reasoning on native-audio models
|
|
34
|
+
});
|
|
35
|
+
|
|
36
|
+
voice.on('interrupt', ({ type, timestamp }) => {
|
|
37
|
+
// Drop queued TTS audio — the user just barged in
|
|
38
|
+
});
|
|
39
|
+
|
|
40
|
+
await voice.connect();
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
### Patch Changes
|
|
44
|
+
|
|
45
|
+
- Moved shared voice primitives and route metadata into the new `@internal/voice` package so voice providers no longer depend on `@mastra/core` and server voice routes share the same route definitions. ([#16725](https://github.com/mastra-ai/mastra/pull/16725))
|
|
46
|
+
|
|
47
|
+
`@mastra/core/voice` continues to re-export the voice APIs for backwards compatibility.
|
|
48
|
+
|
|
49
|
+
- Fixed Gemini Live tool registration failing with `1007 Unknown name` errors for tools using discriminated unions, literals, and nullable types. The `sanitizeToolParameters` method now rewrites `oneOf` → `anyOf`, `const` → `enum`, and collapses nullable `anyOf` patterns into OpenAPI 3.0-compatible `type` + `nullable: true` form. Fixes #17020. ([#17179](https://github.com/mastra-ai/mastra/pull/17179))
|
|
50
|
+
|
|
51
|
+
- **Fixed** Gemini Live sessions now connect successfully when using native-audio models. Previously the connection failed during session setup. ([#17019](https://github.com/mastra-ai/mastra/pull/17019))
|
|
52
|
+
|
|
53
|
+
**Fixed** tools are now invoked correctly. Previously tool calls were silently ignored even when tools were registered during setup.
|
|
54
|
+
|
|
55
|
+
**Fixed** tool results of any shape (arrays, primitives, objects) are now accepted. Previously, non-object tool return values caused sessions to close unexpectedly.
|
|
56
|
+
|
|
57
|
+
**Fixed** the `speaker` option is now honored when passed at the `VoiceConfig` root alongside `realtimeConfig`, not only when passed in the flat config shape.
|
|
58
|
+
|
|
59
|
+
**Changed** default model from `gemini-2.0-flash-exp` (shut down 2025-12-09) to `gemini-3.1-flash-live-preview` (Google's current Live API quickstart model). If you weren't explicitly setting `model`, your sessions will start connecting again.
|
|
60
|
+
|
|
61
|
+
Fixes #17018.
|
|
62
|
+
|
|
63
|
+
- Updated dependencies [[`00eca42`](https://github.com/mastra-ai/mastra/commit/00eca4252393aa114dc8c9a5e1da68df91fa06cf), [`ff9d743`](https://github.com/mastra-ai/mastra/commit/ff9d743f71d7e072927725c0d700632aca0c1fee)]:
|
|
64
|
+
- @mastra/schema-compat@1.2.11
|
|
65
|
+
|
|
66
|
+
## 0.12.0-alpha.3
|
|
67
|
+
|
|
68
|
+
### Minor Changes
|
|
69
|
+
|
|
70
|
+
- Surface native-audio behavioral signals on Gemini Live realtime sessions (#17021). ([#17434](https://github.com/mastra-ai/mastra/pull/17434))
|
|
71
|
+
|
|
72
|
+
The `@mastra/voice-google-gemini-live` provider now enables transcription and barge-in detection in the setup payload and exposes them through Mastra's standard realtime event contract. This makes native-audio models such as `gemini-2.5-flash-native-audio-preview-12-2025` and `gemini-3.1-flash-live-preview` behaviorally usable end-to-end. Until now, the spoken response was silently dropped on native-audio because it arrives on a different wire channel from the model's internal reasoning.
|
|
73
|
+
|
|
74
|
+
**What changed**
|
|
75
|
+
- Setup payload unconditionally includes `input_audio_transcription`, `output_audio_transcription`, and `realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`, matching how the OpenAI, xAI, Inworld, and AWS Nova Sonic providers enable transcription by default.
|
|
76
|
+
- User-side transcripts emit as `writing` with `role: 'user'`. Model-side transcripts emit as `writing` with `role: 'assistant'`. This matches the cross-provider `writing` contract.
|
|
77
|
+
- Barge-in (the server cancelling its in-flight response when the user starts speaking) emits an `interrupt` event with `{ type: 'user', timestamp }`, matching `@mastra/voice-aws-nova-sonic`.
|
|
78
|
+
- On native-audio models, `modelTurn.parts.text` is the model's internal chain-of-thought, not the spoken response. It now emits as a Gemini-specific `thinking` event so consumers can render reasoning separately. On non-native-audio models, `modelTurn.parts.text` continues to emit as `writing` (it is the spoken response there).
|
|
79
|
+
|
|
80
|
+
**Example**
|
|
81
|
+
|
|
82
|
+
```ts
|
|
83
|
+
import { GeminiLiveVoice } from '@mastra/voice-google-gemini-live';
|
|
84
|
+
|
|
85
|
+
const voice = new GeminiLiveVoice({
|
|
86
|
+
apiKey: process.env.GOOGLE_API_KEY,
|
|
87
|
+
model: 'gemini-2.5-flash-native-audio-preview-12-2025',
|
|
88
|
+
});
|
|
89
|
+
|
|
90
|
+
voice.on('writing', ({ text, role }) => {
|
|
91
|
+
// role: 'user' → speech-to-text of the caller
|
|
92
|
+
// role: 'assistant' → speech-to-text of the model's spoken reply
|
|
93
|
+
});
|
|
94
|
+
|
|
95
|
+
voice.on('thinking', ({ text }) => {
|
|
96
|
+
// Gemini's internal reasoning on native-audio models
|
|
97
|
+
});
|
|
98
|
+
|
|
99
|
+
voice.on('interrupt', ({ type, timestamp }) => {
|
|
100
|
+
// Drop queued TTS audio — the user just barged in
|
|
101
|
+
});
|
|
102
|
+
|
|
103
|
+
await voice.connect();
|
|
104
|
+
```
|
|
105
|
+
|
|
3
106
|
## 0.11.5-alpha.2
|
|
4
107
|
|
|
5
108
|
### Patch Changes
|
package/README.md
CHANGED
|
@@ -99,10 +99,21 @@ voice.on('speaker', audioStream => {
|
|
|
99
99
|
});
|
|
100
100
|
|
|
101
101
|
voice.on('writing', ({ text, role }) => {
|
|
102
|
-
//
|
|
102
|
+
// role: 'user' → speech-to-text of the caller
|
|
103
|
+
// role: 'assistant' → speech-to-text of the model's spoken reply
|
|
103
104
|
console.log(`${role}: ${text}`);
|
|
104
105
|
});
|
|
105
106
|
|
|
107
|
+
// Native-audio models only: model's internal reasoning
|
|
108
|
+
voice.on('thinking', ({ text }) => {
|
|
109
|
+
console.log(`thinking: ${text}`);
|
|
110
|
+
});
|
|
111
|
+
|
|
112
|
+
// Drop queued playback when the user barges in over the model
|
|
113
|
+
voice.on('interrupt', ({ type, timestamp }) => {
|
|
114
|
+
console.log(`interrupt by ${type} at ${timestamp}`);
|
|
115
|
+
});
|
|
116
|
+
|
|
106
117
|
// Send text to speech
|
|
107
118
|
await voice.speak('Hello from Mastra!');
|
|
108
119
|
|
|
@@ -313,16 +324,26 @@ Registers an event listener.
|
|
|
313
324
|
|
|
314
325
|
- `'speaking'` - Audio response from model
|
|
315
326
|
- `'speaker'` - Readable stream of concatenated audio for the active response
|
|
316
|
-
- `'writing'` -
|
|
327
|
+
- `'writing'` - Transcribed text. Callback receives `{ text, role: 'user' | 'assistant' }`. On native-audio models the assistant transcript is driven by the server's `output_audio_transcription` channel
|
|
328
|
+
- `'thinking'` - Model chain-of-thought / reasoning text on native-audio models. Callback receives `{ text }`. Does not fire on non-native-audio models, where reasoning is not surfaced separately
|
|
317
329
|
- `'error'` - Error events
|
|
318
330
|
- `'session'` - Session state changes
|
|
319
331
|
- `'toolCall'` - Tool calls from model
|
|
320
332
|
- `'vad'` - Voice activity detection events
|
|
321
|
-
- `'interrupt'` -
|
|
333
|
+
- `'interrupt'` - Emitted on barge-in when the user starts speaking over an in-flight model response. Callback receives `{ type: 'user', timestamp }`
|
|
322
334
|
- `'usage'` - Token usage information
|
|
323
335
|
- `'sessionHandle'` - Session resumption handle
|
|
324
336
|
- `'turnComplete'` - Turn completion for the current model response
|
|
325
337
|
|
|
338
|
+
#### Native-audio models
|
|
339
|
+
|
|
340
|
+
Native-audio models (any model whose ID contains `native-audio`, e.g. `gemini-2.5-flash-native-audio-preview-12-2025`) split text output across two channels:
|
|
341
|
+
|
|
342
|
+
- The model's spoken reply is delivered as audio plus an `output_audio_transcription` transcript — surfaced as `writing` with `role: 'assistant'`.
|
|
343
|
+
- The model's internal reasoning is delivered as `modelTurn.parts.text` — surfaced as `thinking`.
|
|
344
|
+
|
|
345
|
+
On non-native-audio models there is no `output_audio_transcription` channel; `modelTurn.parts.text` is the spoken response itself and is emitted as `writing` (so `thinking` will not fire). Transcription and barge-in detection are enabled automatically in the setup payload — no extra configuration is required.
|
|
346
|
+
|
|
326
347
|
### Tools
|
|
327
348
|
|
|
328
349
|
Add tools with `addTools()` using either `@mastra/core/tools` or a plain object matching `ToolsInput`.
|
package/dist/docs/SKILL.md
CHANGED
|
@@ -3,7 +3,7 @@ name: mastra-voice-google-gemini-live
|
|
|
3
3
|
description: Documentation for @mastra/voice-google-gemini-live. Use when working with @mastra/voice-google-gemini-live APIs, configuration, or implementation.
|
|
4
4
|
metadata:
|
|
5
5
|
package: "@mastra/voice-google-gemini-live"
|
|
6
|
-
version: "0.
|
|
6
|
+
version: "0.12.0"
|
|
7
7
|
---
|
|
8
8
|
|
|
9
9
|
## When to use
|
|
@@ -16,7 +16,7 @@ const voiceAgent = new Agent({
|
|
|
16
16
|
id: 'voice-agent',
|
|
17
17
|
name: 'Voice Agent',
|
|
18
18
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
19
|
-
model: 'openai/gpt-5.
|
|
19
|
+
model: 'openai/gpt-5.5',
|
|
20
20
|
voice: new OpenAIVoice(),
|
|
21
21
|
})
|
|
22
22
|
```
|
|
@@ -40,7 +40,7 @@ const voiceAgent = new Agent({
|
|
|
40
40
|
id: 'voice-agent',
|
|
41
41
|
name: 'Voice Agent',
|
|
42
42
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
43
|
-
model: 'openai/gpt-5.
|
|
43
|
+
model: 'openai/gpt-5.5',
|
|
44
44
|
voice: new OpenAIVoice(),
|
|
45
45
|
})
|
|
46
46
|
|
|
@@ -68,7 +68,7 @@ const voiceAgent = new Agent({
|
|
|
68
68
|
id: 'voice-agent',
|
|
69
69
|
name: 'Voice Agent',
|
|
70
70
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
71
|
-
model: 'openai/gpt-5.
|
|
71
|
+
model: 'openai/gpt-5.5',
|
|
72
72
|
voice: new AzureVoice(),
|
|
73
73
|
})
|
|
74
74
|
|
|
@@ -95,7 +95,7 @@ const voiceAgent = new Agent({
|
|
|
95
95
|
id: 'voice-agent',
|
|
96
96
|
name: 'Voice Agent',
|
|
97
97
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
98
|
-
model: 'openai/gpt-5.
|
|
98
|
+
model: 'openai/gpt-5.5',
|
|
99
99
|
voice: new ElevenLabsVoice(),
|
|
100
100
|
})
|
|
101
101
|
|
|
@@ -122,7 +122,7 @@ const voiceAgent = new Agent({
|
|
|
122
122
|
id: 'voice-agent',
|
|
123
123
|
name: 'Voice Agent',
|
|
124
124
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
125
|
-
model: 'openai/gpt-5.
|
|
125
|
+
model: 'openai/gpt-5.5',
|
|
126
126
|
voice: new PlayAIVoice(),
|
|
127
127
|
})
|
|
128
128
|
|
|
@@ -149,7 +149,7 @@ const voiceAgent = new Agent({
|
|
|
149
149
|
id: 'voice-agent',
|
|
150
150
|
name: 'Voice Agent',
|
|
151
151
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
152
|
-
model: 'openai/gpt-5.
|
|
152
|
+
model: 'openai/gpt-5.5',
|
|
153
153
|
voice: new GoogleVoice(),
|
|
154
154
|
})
|
|
155
155
|
|
|
@@ -176,7 +176,7 @@ const voiceAgent = new Agent({
|
|
|
176
176
|
id: 'voice-agent',
|
|
177
177
|
name: 'Voice Agent',
|
|
178
178
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
179
|
-
model: 'openai/gpt-5.
|
|
179
|
+
model: 'openai/gpt-5.5',
|
|
180
180
|
voice: new CloudflareVoice(),
|
|
181
181
|
})
|
|
182
182
|
|
|
@@ -203,7 +203,7 @@ const voiceAgent = new Agent({
|
|
|
203
203
|
id: 'voice-agent',
|
|
204
204
|
name: 'Voice Agent',
|
|
205
205
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
206
|
-
model: 'openai/gpt-5.
|
|
206
|
+
model: 'openai/gpt-5.5',
|
|
207
207
|
voice: new DeepgramVoice(),
|
|
208
208
|
})
|
|
209
209
|
|
|
@@ -230,7 +230,7 @@ const voiceAgent = new Agent({
|
|
|
230
230
|
id: 'voice-agent',
|
|
231
231
|
name: 'Voice Agent',
|
|
232
232
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
233
|
-
model: 'openai/gpt-5.
|
|
233
|
+
model: 'openai/gpt-5.5',
|
|
234
234
|
voice: new InworldVoice(),
|
|
235
235
|
})
|
|
236
236
|
|
|
@@ -257,7 +257,7 @@ const voiceAgent = new Agent({
|
|
|
257
257
|
id: 'voice-agent',
|
|
258
258
|
name: 'Voice Agent',
|
|
259
259
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
260
|
-
model: 'openai/gpt-5.
|
|
260
|
+
model: 'openai/gpt-5.5',
|
|
261
261
|
voice: new SpeechifyVoice(),
|
|
262
262
|
})
|
|
263
263
|
|
|
@@ -284,7 +284,7 @@ const voiceAgent = new Agent({
|
|
|
284
284
|
id: 'voice-agent',
|
|
285
285
|
name: 'Voice Agent',
|
|
286
286
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
287
|
-
model: 'openai/gpt-5.
|
|
287
|
+
model: 'openai/gpt-5.5',
|
|
288
288
|
voice: new SarvamVoice(),
|
|
289
289
|
})
|
|
290
290
|
|
|
@@ -311,7 +311,7 @@ const voiceAgent = new Agent({
|
|
|
311
311
|
id: 'voice-agent',
|
|
312
312
|
name: 'Voice Agent',
|
|
313
313
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
314
|
-
model: 'openai/gpt-5.
|
|
314
|
+
model: 'openai/gpt-5.5',
|
|
315
315
|
voice: new MurfVoice(),
|
|
316
316
|
})
|
|
317
317
|
|
|
@@ -346,7 +346,7 @@ const voiceAgent = new Agent({
|
|
|
346
346
|
id: 'voice-agent',
|
|
347
347
|
name: 'Voice Agent',
|
|
348
348
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
349
|
-
model: 'openai/gpt-5.
|
|
349
|
+
model: 'openai/gpt-5.5',
|
|
350
350
|
voice: new OpenAIVoice(),
|
|
351
351
|
})
|
|
352
352
|
|
|
@@ -375,7 +375,7 @@ const voiceAgent = new Agent({
|
|
|
375
375
|
id: 'voice-agent',
|
|
376
376
|
name: 'Voice Agent',
|
|
377
377
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
378
|
-
model: 'openai/gpt-5.
|
|
378
|
+
model: 'openai/gpt-5.5',
|
|
379
379
|
voice: new AzureVoice(),
|
|
380
380
|
})
|
|
381
381
|
|
|
@@ -403,7 +403,7 @@ const voiceAgent = new Agent({
|
|
|
403
403
|
id: 'voice-agent',
|
|
404
404
|
name: 'Voice Agent',
|
|
405
405
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
406
|
-
model: 'openai/gpt-5.
|
|
406
|
+
model: 'openai/gpt-5.5',
|
|
407
407
|
voice: new ElevenLabsVoice(),
|
|
408
408
|
})
|
|
409
409
|
|
|
@@ -431,7 +431,7 @@ const voiceAgent = new Agent({
|
|
|
431
431
|
id: 'voice-agent',
|
|
432
432
|
name: 'Voice Agent',
|
|
433
433
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
434
|
-
model: 'openai/gpt-5.
|
|
434
|
+
model: 'openai/gpt-5.5',
|
|
435
435
|
voice: new GoogleVoice(),
|
|
436
436
|
})
|
|
437
437
|
|
|
@@ -459,7 +459,7 @@ const voiceAgent = new Agent({
|
|
|
459
459
|
id: 'voice-agent',
|
|
460
460
|
name: 'Voice Agent',
|
|
461
461
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
462
|
-
model: 'openai/gpt-5.
|
|
462
|
+
model: 'openai/gpt-5.5',
|
|
463
463
|
voice: new CloudflareVoice(),
|
|
464
464
|
})
|
|
465
465
|
|
|
@@ -487,7 +487,7 @@ const voiceAgent = new Agent({
|
|
|
487
487
|
id: 'voice-agent',
|
|
488
488
|
name: 'Voice Agent',
|
|
489
489
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
490
|
-
model: 'openai/gpt-5.
|
|
490
|
+
model: 'openai/gpt-5.5',
|
|
491
491
|
voice: new DeepgramVoice(),
|
|
492
492
|
})
|
|
493
493
|
|
|
@@ -515,7 +515,7 @@ const voiceAgent = new Agent({
|
|
|
515
515
|
id: 'voice-agent',
|
|
516
516
|
name: 'Voice Agent',
|
|
517
517
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
518
|
-
model: 'openai/gpt-5.
|
|
518
|
+
model: 'openai/gpt-5.5',
|
|
519
519
|
voice: new InworldVoice(),
|
|
520
520
|
})
|
|
521
521
|
|
|
@@ -543,7 +543,7 @@ const voiceAgent = new Agent({
|
|
|
543
543
|
id: 'voice-agent',
|
|
544
544
|
name: 'Voice Agent',
|
|
545
545
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
546
|
-
model: 'openai/gpt-5.
|
|
546
|
+
model: 'openai/gpt-5.5',
|
|
547
547
|
voice: new SarvamVoice(),
|
|
548
548
|
})
|
|
549
549
|
|
|
@@ -575,7 +575,7 @@ const voiceAgent = new Agent({
|
|
|
575
575
|
id: 'voice-agent',
|
|
576
576
|
name: 'Voice Agent',
|
|
577
577
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
578
|
-
model: 'openai/gpt-5.
|
|
578
|
+
model: 'openai/gpt-5.5',
|
|
579
579
|
voice: new OpenAIRealtimeVoice(),
|
|
580
580
|
})
|
|
581
581
|
|
|
@@ -605,7 +605,7 @@ const voiceAgent = new Agent({
|
|
|
605
605
|
id: 'voice-agent',
|
|
606
606
|
name: 'Voice Agent',
|
|
607
607
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
608
|
-
model: 'openai/gpt-5.
|
|
608
|
+
model: 'openai/gpt-5.5',
|
|
609
609
|
voice: new GeminiLiveVoice({
|
|
610
610
|
// Live API mode
|
|
611
611
|
apiKey: process.env.GOOGLE_API_KEY,
|
|
@@ -654,7 +654,7 @@ const voiceAgent = new Agent({
|
|
|
654
654
|
id: 'voice-agent',
|
|
655
655
|
name: 'Voice Agent',
|
|
656
656
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
657
|
-
model: 'openai/gpt-5.
|
|
657
|
+
model: 'openai/gpt-5.5',
|
|
658
658
|
voice: new NovaSonicVoice({
|
|
659
659
|
region: 'us-east-1',
|
|
660
660
|
speaker: 'matthew',
|
|
@@ -697,7 +697,7 @@ const voiceAgent = new Agent({
|
|
|
697
697
|
id: 'voice-agent',
|
|
698
698
|
name: 'Voice Agent',
|
|
699
699
|
instructions: 'You are a voice assistant that can help users with their tasks.',
|
|
700
|
-
model: 'openai/gpt-5.
|
|
700
|
+
model: 'openai/gpt-5.5',
|
|
701
701
|
voice: new InworldRealtimeVoice({
|
|
702
702
|
apiKey: process.env.INWORLD_API_KEY,
|
|
703
703
|
model: 'inworld/models/gemma-4-26b-a4b-it',
|
|
@@ -1132,7 +1132,7 @@ const voiceAgent = new Agent({
|
|
|
1132
1132
|
id: 'aisdk-voice-agent',
|
|
1133
1133
|
name: 'AI SDK Voice Agent',
|
|
1134
1134
|
instructions: 'You are a helpful assistant with voice capabilities.',
|
|
1135
|
-
model: 'openai/gpt-5.
|
|
1135
|
+
model: 'openai/gpt-5.5',
|
|
1136
1136
|
voice,
|
|
1137
1137
|
})
|
|
1138
1138
|
```
|
|
@@ -32,7 +32,7 @@ const agent = new Agent({
|
|
|
32
32
|
id: 'agent',
|
|
33
33
|
name: 'OpenAI Realtime Agent',
|
|
34
34
|
instructions: `You are a helpful assistant with real-time voice capabilities.`,
|
|
35
|
-
model: 'openai/gpt-5.
|
|
35
|
+
model: 'openai/gpt-5.5',
|
|
36
36
|
voice: new OpenAIRealtimeVoice(),
|
|
37
37
|
})
|
|
38
38
|
|
|
@@ -66,7 +66,7 @@ const agent = new Agent({
|
|
|
66
66
|
name: 'Gemini Live Agent',
|
|
67
67
|
instructions: 'You are a helpful assistant with real-time voice capabilities.',
|
|
68
68
|
// Model used for text generation; voice provider handles realtime audio
|
|
69
|
-
model: 'openai/gpt-5.
|
|
69
|
+
model: 'openai/gpt-5.5',
|
|
70
70
|
voice: new GeminiLiveVoice({
|
|
71
71
|
apiKey: process.env.GOOGLE_API_KEY,
|
|
72
72
|
model: 'gemini-2.0-flash-exp',
|
|
@@ -113,7 +113,7 @@ const agent = new Agent({
|
|
|
113
113
|
name: 'Nova Sonic Agent',
|
|
114
114
|
instructions: 'You are a helpful assistant with real-time voice capabilities.',
|
|
115
115
|
// Model used for text generation; voice provider handles realtime audio
|
|
116
|
-
model: 'openai/gpt-5.
|
|
116
|
+
model: 'openai/gpt-5.5',
|
|
117
117
|
voice: new NovaSonicVoice({
|
|
118
118
|
region: 'us-east-1',
|
|
119
119
|
speaker: 'matthew',
|
|
@@ -157,7 +157,7 @@ const agent = new Agent({
|
|
|
157
157
|
name: 'Inworld Realtime Agent',
|
|
158
158
|
instructions: 'You are a helpful assistant with real-time voice capabilities.',
|
|
159
159
|
// Model used for text generation; voice provider handles realtime audio
|
|
160
|
-
model: 'openai/gpt-5.
|
|
160
|
+
model: 'openai/gpt-5.5',
|
|
161
161
|
voice: new InworldRealtimeVoice({
|
|
162
162
|
apiKey: process.env.INWORLD_API_KEY,
|
|
163
163
|
model: 'inworld/models/gemma-4-26b-a4b-it',
|
|
@@ -242,7 +242,9 @@ The GeminiLiveVoice class emits the following events:
|
|
|
242
242
|
|
|
243
243
|
**speaking** (`event`): Emitted with audio metadata. Callback receives { audioData?: Int16Array, sampleRate?: number }.
|
|
244
244
|
|
|
245
|
-
**writing** (`event`): Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }.
|
|
245
|
+
**writing** (`event`): Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }. On native-audio models the assistant transcript is driven by the server's \`output\_audio\_transcription\` channel rather than \`modelTurn.parts.text\`.
|
|
246
|
+
|
|
247
|
+
**thinking** (`event`): Emitted on native-audio models with the model's chain-of-thought / reasoning text from \`modelTurn.parts.text\`. Callback receives { text: string }. Does not fire on non-native-audio models, where \`modelTurn.parts.text\` is the spoken response and is emitted as \`writing\` instead.
|
|
246
248
|
|
|
247
249
|
**session** (`event`): Emitted on session state changes. Callback receives { state: 'connecting' | 'connected' | 'disconnected' | 'disconnecting' | 'updated', config?: object }.
|
|
248
250
|
|
|
@@ -254,7 +256,18 @@ The GeminiLiveVoice class emits the following events:
|
|
|
254
256
|
|
|
255
257
|
**error** (`event`): Emitted when an error occurs. Callback receives { message: string, code?: string, details?: unknown }.
|
|
256
258
|
|
|
257
|
-
**interrupt** (`event`):
|
|
259
|
+
**interrupt** (`event`): Emitted on barge-in when the user starts speaking over an in-flight model response. The server cancels any further audio for the current turn. Callback receives { type: 'user', timestamp: number }.
|
|
260
|
+
|
|
261
|
+
## Native-audio behavior
|
|
262
|
+
|
|
263
|
+
Native-audio Gemini Live models — any model whose ID contains `native-audio`, such as `gemini-2.5-flash-native-audio-preview-12-2025` — split text output across two channels:
|
|
264
|
+
|
|
265
|
+
- The model's spoken reply is delivered as audio plus an `output_audio_transcription` transcript and surfaced as `writing` with `role: 'assistant'`.
|
|
266
|
+
- The model's internal reasoning is delivered as `modelTurn.parts.text` and surfaced as `thinking`.
|
|
267
|
+
|
|
268
|
+
On non-native-audio models there is no `output_audio_transcription` channel, so `modelTurn.parts.text` is the spoken response itself and is emitted as `writing`; the `thinking` event does not fire.
|
|
269
|
+
|
|
270
|
+
Input transcription, output transcription, and barge-in detection (`realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`) are enabled automatically in the setup payload — no extra configuration is required.
|
|
258
271
|
|
|
259
272
|
## Available models
|
|
260
273
|
|
package/dist/index.cjs
CHANGED
|
@@ -1529,6 +1529,10 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
|
|
|
1529
1529
|
requestContext;
|
|
1530
1530
|
// Store the configuration options
|
|
1531
1531
|
options;
|
|
1532
|
+
// Accumulates assistant text across `serverContent` frames for the current
|
|
1533
|
+
// turn. Live API streams responses over many frames; we aggregate here and
|
|
1534
|
+
// flush to context history once on `turnComplete`.
|
|
1535
|
+
pendingAssistantResponse = "";
|
|
1532
1536
|
/**
|
|
1533
1537
|
* Normalize configuration to ensure proper VoiceConfig format
|
|
1534
1538
|
* Handles backward compatibility with direct GeminiLiveVoiceConfig
|
|
@@ -2483,15 +2487,38 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
|
|
|
2483
2487
|
if (!data) {
|
|
2484
2488
|
return;
|
|
2485
2489
|
}
|
|
2486
|
-
|
|
2490
|
+
if (data.interrupted) {
|
|
2491
|
+
this.log("Model response interrupted by user activity");
|
|
2492
|
+
this.audioStreamManager.cleanupSpeakerStreams();
|
|
2493
|
+
this.pendingAssistantResponse = "";
|
|
2494
|
+
this.emit("interrupt", { type: "user", timestamp: Date.now() });
|
|
2495
|
+
}
|
|
2496
|
+
if (data.inputTranscription?.text) {
|
|
2497
|
+
this.emit("writing", {
|
|
2498
|
+
text: data.inputTranscription.text,
|
|
2499
|
+
role: "user"
|
|
2500
|
+
});
|
|
2501
|
+
}
|
|
2502
|
+
if (data.outputTranscription?.text) {
|
|
2503
|
+
this.pendingAssistantResponse += data.outputTranscription.text;
|
|
2504
|
+
this.emit("writing", {
|
|
2505
|
+
text: data.outputTranscription.text,
|
|
2506
|
+
role: "assistant"
|
|
2507
|
+
});
|
|
2508
|
+
}
|
|
2509
|
+
const nativeAudio = this.isNativeAudioModel();
|
|
2487
2510
|
if (data.modelTurn?.parts) {
|
|
2488
2511
|
for (const part of data.modelTurn.parts) {
|
|
2489
2512
|
if (part.text) {
|
|
2490
|
-
|
|
2491
|
-
|
|
2492
|
-
|
|
2493
|
-
|
|
2494
|
-
|
|
2513
|
+
if (nativeAudio) {
|
|
2514
|
+
this.emit("thinking", { text: part.text });
|
|
2515
|
+
} else {
|
|
2516
|
+
this.pendingAssistantResponse += part.text;
|
|
2517
|
+
this.emit("writing", {
|
|
2518
|
+
text: part.text,
|
|
2519
|
+
role: "assistant"
|
|
2520
|
+
});
|
|
2521
|
+
}
|
|
2495
2522
|
}
|
|
2496
2523
|
if (part.functionCall) {
|
|
2497
2524
|
this.log("Found function call in serverContent.modelTurn.parts", part.functionCall);
|
|
@@ -2560,11 +2587,12 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
|
|
|
2560
2587
|
}
|
|
2561
2588
|
}
|
|
2562
2589
|
}
|
|
2563
|
-
if (assistantResponse.trim()) {
|
|
2564
|
-
this.addToContext("assistant", assistantResponse);
|
|
2565
|
-
}
|
|
2566
2590
|
if (data.turnComplete) {
|
|
2567
2591
|
this.log("Turn completed");
|
|
2592
|
+
if (this.pendingAssistantResponse.trim()) {
|
|
2593
|
+
this.addToContext("assistant", this.pendingAssistantResponse);
|
|
2594
|
+
}
|
|
2595
|
+
this.pendingAssistantResponse = "";
|
|
2568
2596
|
this.audioStreamManager.cleanupSpeakerStreams();
|
|
2569
2597
|
this.emit("turnComplete", {
|
|
2570
2598
|
timestamp: Date.now()
|
|
@@ -2717,6 +2745,27 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
|
|
|
2717
2745
|
getVertexLocation() {
|
|
2718
2746
|
return this.options.location?.trim() || "us-central1";
|
|
2719
2747
|
}
|
|
2748
|
+
/**
|
|
2749
|
+
* Whether the active model is a Gemini Live "native-audio" variant.
|
|
2750
|
+
*
|
|
2751
|
+
* Native-audio models emit a different `serverContent.modelTurn.parts.text`
|
|
2752
|
+
* stream than their half-cascade siblings: on native-audio, that text is the
|
|
2753
|
+
* model's internal reasoning (chain-of-thought), and the *spoken* response
|
|
2754
|
+
* arrives separately via `serverContent.outputTranscription.text`. On
|
|
2755
|
+
* non-native-audio models there is no `outputTranscription` channel, and
|
|
2756
|
+
* `modelTurn.parts.text` is the spoken response.
|
|
2757
|
+
*
|
|
2758
|
+
* Used to decide whether `modelTurn.parts.text` should be emitted as
|
|
2759
|
+
* `thinking` (native-audio) or `writing` (non-native-audio). All native-audio
|
|
2760
|
+
* model IDs in `GeminiVoiceModel` contain the literal substring
|
|
2761
|
+
* `native-audio`, so a substring check is sufficient and forward-compatible
|
|
2762
|
+
* with new variants that follow the same naming convention.
|
|
2763
|
+
* @private
|
|
2764
|
+
*/
|
|
2765
|
+
isNativeAudioModel() {
|
|
2766
|
+
const model = this.options.model ?? DEFAULT_MODEL;
|
|
2767
|
+
return model.includes("native-audio");
|
|
2768
|
+
}
|
|
2720
2769
|
/**
|
|
2721
2770
|
* Resolve the correct model identifier for Gemini API or Vertex AI
|
|
2722
2771
|
* @private
|
|
@@ -2758,7 +2807,20 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
|
|
|
2758
2807
|
const setupMessage = {
|
|
2759
2808
|
setup: {
|
|
2760
2809
|
model: this.resolveModelIdentifier(),
|
|
2761
|
-
generation_config: generationConfig
|
|
2810
|
+
generation_config: generationConfig,
|
|
2811
|
+
// Transcription is on by default — matches the pattern in @mastra/voice-openai-realtime,
|
|
2812
|
+
// @mastra/voice-xai-realtime, and @mastra/voice-inworld where realtime sessions
|
|
2813
|
+
// unconditionally enable STT in `connect()`. On native-audio models this is the ONLY way
|
|
2814
|
+
// to receive the spoken response as text (`modelTurn.parts.text` carries reasoning, not
|
|
2815
|
+
// speech), so without these flags the assistant's words are silently dropped client-side.
|
|
2816
|
+
input_audio_transcription: {},
|
|
2817
|
+
output_audio_transcription: {},
|
|
2818
|
+
// Activity-based interrupts surface barge-in as `serverContent.interrupted = true` and
|
|
2819
|
+
// cancel the in-flight model response. This is the only way to wire up the `interrupt`
|
|
2820
|
+
// event declared in `GeminiLiveEventMap`.
|
|
2821
|
+
realtime_input_config: {
|
|
2822
|
+
activity_handling: "START_OF_ACTIVITY_INTERRUPTS"
|
|
2823
|
+
}
|
|
2762
2824
|
}
|
|
2763
2825
|
};
|
|
2764
2826
|
if (this.options.instructions) {
|