@mastra/voice-google-gemini-live 0.11.5-alpha.2 → 0.12.0-alpha.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +40 -0
- package/README.md +24 -3
- package/dist/docs/SKILL.md +1 -1
- package/dist/docs/assets/SOURCE_MAP.json +1 -1
- package/dist/docs/references/reference-voice-google-gemini-live.md +15 -2
- package/dist/index.cjs +72 -10
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.ts +19 -0
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +72 -10
- package/dist/index.js.map +1 -1
- package/dist/types.d.ts +39 -0
- package/dist/types.d.ts.map +1 -1
- package/package.json +3 -3
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,45 @@
|
|
|
1
1
|
# @mastra/voice-google-gemini-live
|
|
2
2
|
|
|
3
|
+
## 0.12.0-alpha.3
|
|
4
|
+
|
|
5
|
+
### Minor Changes
|
|
6
|
+
|
|
7
|
+
- Surface native-audio behavioral signals on Gemini Live realtime sessions (#17021). ([#17434](https://github.com/mastra-ai/mastra/pull/17434))
|
|
8
|
+
|
|
9
|
+
The `@mastra/voice-google-gemini-live` provider now enables transcription and barge-in detection in the setup payload and exposes them through Mastra's standard realtime event contract. This makes native-audio models such as `gemini-2.5-flash-native-audio-preview-12-2025` and `gemini-3.1-flash-live-preview` behaviorally usable end-to-end. Until now, the spoken response was silently dropped on native-audio because it arrives on a different wire channel from the model's internal reasoning.
|
|
10
|
+
|
|
11
|
+
**What changed**
|
|
12
|
+
- Setup payload unconditionally includes `input_audio_transcription`, `output_audio_transcription`, and `realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`, matching how the OpenAI, xAI, Inworld, and AWS Nova Sonic providers enable transcription by default.
|
|
13
|
+
- User-side transcripts emit as `writing` with `role: 'user'`. Model-side transcripts emit as `writing` with `role: 'assistant'`. This matches the cross-provider `writing` contract.
|
|
14
|
+
- Barge-in (the server cancelling its in-flight response when the user starts speaking) emits an `interrupt` event with `{ type: 'user', timestamp }`, matching `@mastra/voice-aws-nova-sonic`.
|
|
15
|
+
- On native-audio models, `modelTurn.parts.text` is the model's internal chain-of-thought, not the spoken response. It now emits as a Gemini-specific `thinking` event so consumers can render reasoning separately. On non-native-audio models, `modelTurn.parts.text` continues to emit as `writing` (it is the spoken response there).
|
|
16
|
+
|
|
17
|
+
**Example**
|
|
18
|
+
|
|
19
|
+
```ts
|
|
20
|
+
import { GeminiLiveVoice } from '@mastra/voice-google-gemini-live';
|
|
21
|
+
|
|
22
|
+
const voice = new GeminiLiveVoice({
|
|
23
|
+
apiKey: process.env.GOOGLE_API_KEY,
|
|
24
|
+
model: 'gemini-2.5-flash-native-audio-preview-12-2025',
|
|
25
|
+
});
|
|
26
|
+
|
|
27
|
+
voice.on('writing', ({ text, role }) => {
|
|
28
|
+
// role: 'user' → speech-to-text of the caller
|
|
29
|
+
// role: 'assistant' → speech-to-text of the model's spoken reply
|
|
30
|
+
});
|
|
31
|
+
|
|
32
|
+
voice.on('thinking', ({ text }) => {
|
|
33
|
+
// Gemini's internal reasoning on native-audio models
|
|
34
|
+
});
|
|
35
|
+
|
|
36
|
+
voice.on('interrupt', ({ type, timestamp }) => {
|
|
37
|
+
// Drop queued TTS audio — the user just barged in
|
|
38
|
+
});
|
|
39
|
+
|
|
40
|
+
await voice.connect();
|
|
41
|
+
```
|
|
42
|
+
|
|
3
43
|
## 0.11.5-alpha.2
|
|
4
44
|
|
|
5
45
|
### Patch Changes
|
package/README.md
CHANGED
|
@@ -99,10 +99,21 @@ voice.on('speaker', audioStream => {
|
|
|
99
99
|
});
|
|
100
100
|
|
|
101
101
|
voice.on('writing', ({ text, role }) => {
|
|
102
|
-
//
|
|
102
|
+
// role: 'user' → speech-to-text of the caller
|
|
103
|
+
// role: 'assistant' → speech-to-text of the model's spoken reply
|
|
103
104
|
console.log(`${role}: ${text}`);
|
|
104
105
|
});
|
|
105
106
|
|
|
107
|
+
// Native-audio models only: model's internal reasoning
|
|
108
|
+
voice.on('thinking', ({ text }) => {
|
|
109
|
+
console.log(`thinking: ${text}`);
|
|
110
|
+
});
|
|
111
|
+
|
|
112
|
+
// Drop queued playback when the user barges in over the model
|
|
113
|
+
voice.on('interrupt', ({ type, timestamp }) => {
|
|
114
|
+
console.log(`interrupt by ${type} at ${timestamp}`);
|
|
115
|
+
});
|
|
116
|
+
|
|
106
117
|
// Send text to speech
|
|
107
118
|
await voice.speak('Hello from Mastra!');
|
|
108
119
|
|
|
@@ -313,16 +324,26 @@ Registers an event listener.
|
|
|
313
324
|
|
|
314
325
|
- `'speaking'` - Audio response from model
|
|
315
326
|
- `'speaker'` - Readable stream of concatenated audio for the active response
|
|
316
|
-
- `'writing'` -
|
|
327
|
+
- `'writing'` - Transcribed text. Callback receives `{ text, role: 'user' | 'assistant' }`. On native-audio models the assistant transcript is driven by the server's `output_audio_transcription` channel
|
|
328
|
+
- `'thinking'` - Model chain-of-thought / reasoning text on native-audio models. Callback receives `{ text }`. Does not fire on non-native-audio models, where reasoning is not surfaced separately
|
|
317
329
|
- `'error'` - Error events
|
|
318
330
|
- `'session'` - Session state changes
|
|
319
331
|
- `'toolCall'` - Tool calls from model
|
|
320
332
|
- `'vad'` - Voice activity detection events
|
|
321
|
-
- `'interrupt'` -
|
|
333
|
+
- `'interrupt'` - Emitted on barge-in when the user starts speaking over an in-flight model response. Callback receives `{ type: 'user', timestamp }`
|
|
322
334
|
- `'usage'` - Token usage information
|
|
323
335
|
- `'sessionHandle'` - Session resumption handle
|
|
324
336
|
- `'turnComplete'` - Turn completion for the current model response
|
|
325
337
|
|
|
338
|
+
#### Native-audio models
|
|
339
|
+
|
|
340
|
+
Native-audio models (any model whose ID contains `native-audio`, e.g. `gemini-2.5-flash-native-audio-preview-12-2025`) split text output across two channels:
|
|
341
|
+
|
|
342
|
+
- The model's spoken reply is delivered as audio plus an `output_audio_transcription` transcript — surfaced as `writing` with `role: 'assistant'`.
|
|
343
|
+
- The model's internal reasoning is delivered as `modelTurn.parts.text` — surfaced as `thinking`.
|
|
344
|
+
|
|
345
|
+
On non-native-audio models there is no `output_audio_transcription` channel; `modelTurn.parts.text` is the spoken response itself and is emitted as `writing` (so `thinking` will not fire). Transcription and barge-in detection are enabled automatically in the setup payload — no extra configuration is required.
|
|
346
|
+
|
|
326
347
|
### Tools
|
|
327
348
|
|
|
328
349
|
Add tools with `addTools()` using either `@mastra/core/tools` or a plain object matching `ToolsInput`.
|
package/dist/docs/SKILL.md
CHANGED
|
@@ -3,7 +3,7 @@ name: mastra-voice-google-gemini-live
|
|
|
3
3
|
description: Documentation for @mastra/voice-google-gemini-live. Use when working with @mastra/voice-google-gemini-live APIs, configuration, or implementation.
|
|
4
4
|
metadata:
|
|
5
5
|
package: "@mastra/voice-google-gemini-live"
|
|
6
|
-
version: "0.
|
|
6
|
+
version: "0.12.0-alpha.3"
|
|
7
7
|
---
|
|
8
8
|
|
|
9
9
|
## When to use
|
|
@@ -242,7 +242,9 @@ The GeminiLiveVoice class emits the following events:
|
|
|
242
242
|
|
|
243
243
|
**speaking** (`event`): Emitted with audio metadata. Callback receives { audioData?: Int16Array, sampleRate?: number }.
|
|
244
244
|
|
|
245
|
-
**writing** (`event`): Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }.
|
|
245
|
+
**writing** (`event`): Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }. On native-audio models the assistant transcript is driven by the server's \`output\_audio\_transcription\` channel rather than \`modelTurn.parts.text\`.
|
|
246
|
+
|
|
247
|
+
**thinking** (`event`): Emitted on native-audio models with the model's chain-of-thought / reasoning text from \`modelTurn.parts.text\`. Callback receives { text: string }. Does not fire on non-native-audio models, where \`modelTurn.parts.text\` is the spoken response and is emitted as \`writing\` instead.
|
|
246
248
|
|
|
247
249
|
**session** (`event`): Emitted on session state changes. Callback receives { state: 'connecting' | 'connected' | 'disconnected' | 'disconnecting' | 'updated', config?: object }.
|
|
248
250
|
|
|
@@ -254,7 +256,18 @@ The GeminiLiveVoice class emits the following events:
|
|
|
254
256
|
|
|
255
257
|
**error** (`event`): Emitted when an error occurs. Callback receives { message: string, code?: string, details?: unknown }.
|
|
256
258
|
|
|
257
|
-
**interrupt** (`event`):
|
|
259
|
+
**interrupt** (`event`): Emitted on barge-in when the user starts speaking over an in-flight model response. The server cancels any further audio for the current turn. Callback receives { type: 'user', timestamp: number }.
|
|
260
|
+
|
|
261
|
+
## Native-audio behavior
|
|
262
|
+
|
|
263
|
+
Native-audio Gemini Live models — any model whose ID contains `native-audio`, such as `gemini-2.5-flash-native-audio-preview-12-2025` — split text output across two channels:
|
|
264
|
+
|
|
265
|
+
- The model's spoken reply is delivered as audio plus an `output_audio_transcription` transcript and surfaced as `writing` with `role: 'assistant'`.
|
|
266
|
+
- The model's internal reasoning is delivered as `modelTurn.parts.text` and surfaced as `thinking`.
|
|
267
|
+
|
|
268
|
+
On non-native-audio models there is no `output_audio_transcription` channel, so `modelTurn.parts.text` is the spoken response itself and is emitted as `writing`; the `thinking` event does not fire.
|
|
269
|
+
|
|
270
|
+
Input transcription, output transcription, and barge-in detection (`realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`) are enabled automatically in the setup payload — no extra configuration is required.
|
|
258
271
|
|
|
259
272
|
## Available models
|
|
260
273
|
|
package/dist/index.cjs
CHANGED
|
@@ -1529,6 +1529,10 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
|
|
|
1529
1529
|
requestContext;
|
|
1530
1530
|
// Store the configuration options
|
|
1531
1531
|
options;
|
|
1532
|
+
// Accumulates assistant text across `serverContent` frames for the current
|
|
1533
|
+
// turn. Live API streams responses over many frames; we aggregate here and
|
|
1534
|
+
// flush to context history once on `turnComplete`.
|
|
1535
|
+
pendingAssistantResponse = "";
|
|
1532
1536
|
/**
|
|
1533
1537
|
* Normalize configuration to ensure proper VoiceConfig format
|
|
1534
1538
|
* Handles backward compatibility with direct GeminiLiveVoiceConfig
|
|
@@ -2483,15 +2487,38 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
|
|
|
2483
2487
|
if (!data) {
|
|
2484
2488
|
return;
|
|
2485
2489
|
}
|
|
2486
|
-
|
|
2490
|
+
if (data.interrupted) {
|
|
2491
|
+
this.log("Model response interrupted by user activity");
|
|
2492
|
+
this.audioStreamManager.cleanupSpeakerStreams();
|
|
2493
|
+
this.pendingAssistantResponse = "";
|
|
2494
|
+
this.emit("interrupt", { type: "user", timestamp: Date.now() });
|
|
2495
|
+
}
|
|
2496
|
+
if (data.inputTranscription?.text) {
|
|
2497
|
+
this.emit("writing", {
|
|
2498
|
+
text: data.inputTranscription.text,
|
|
2499
|
+
role: "user"
|
|
2500
|
+
});
|
|
2501
|
+
}
|
|
2502
|
+
if (data.outputTranscription?.text) {
|
|
2503
|
+
this.pendingAssistantResponse += data.outputTranscription.text;
|
|
2504
|
+
this.emit("writing", {
|
|
2505
|
+
text: data.outputTranscription.text,
|
|
2506
|
+
role: "assistant"
|
|
2507
|
+
});
|
|
2508
|
+
}
|
|
2509
|
+
const nativeAudio = this.isNativeAudioModel();
|
|
2487
2510
|
if (data.modelTurn?.parts) {
|
|
2488
2511
|
for (const part of data.modelTurn.parts) {
|
|
2489
2512
|
if (part.text) {
|
|
2490
|
-
|
|
2491
|
-
|
|
2492
|
-
|
|
2493
|
-
|
|
2494
|
-
|
|
2513
|
+
if (nativeAudio) {
|
|
2514
|
+
this.emit("thinking", { text: part.text });
|
|
2515
|
+
} else {
|
|
2516
|
+
this.pendingAssistantResponse += part.text;
|
|
2517
|
+
this.emit("writing", {
|
|
2518
|
+
text: part.text,
|
|
2519
|
+
role: "assistant"
|
|
2520
|
+
});
|
|
2521
|
+
}
|
|
2495
2522
|
}
|
|
2496
2523
|
if (part.functionCall) {
|
|
2497
2524
|
this.log("Found function call in serverContent.modelTurn.parts", part.functionCall);
|
|
@@ -2560,11 +2587,12 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
|
|
|
2560
2587
|
}
|
|
2561
2588
|
}
|
|
2562
2589
|
}
|
|
2563
|
-
if (assistantResponse.trim()) {
|
|
2564
|
-
this.addToContext("assistant", assistantResponse);
|
|
2565
|
-
}
|
|
2566
2590
|
if (data.turnComplete) {
|
|
2567
2591
|
this.log("Turn completed");
|
|
2592
|
+
if (this.pendingAssistantResponse.trim()) {
|
|
2593
|
+
this.addToContext("assistant", this.pendingAssistantResponse);
|
|
2594
|
+
}
|
|
2595
|
+
this.pendingAssistantResponse = "";
|
|
2568
2596
|
this.audioStreamManager.cleanupSpeakerStreams();
|
|
2569
2597
|
this.emit("turnComplete", {
|
|
2570
2598
|
timestamp: Date.now()
|
|
@@ -2717,6 +2745,27 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
|
|
|
2717
2745
|
getVertexLocation() {
|
|
2718
2746
|
return this.options.location?.trim() || "us-central1";
|
|
2719
2747
|
}
|
|
2748
|
+
/**
|
|
2749
|
+
* Whether the active model is a Gemini Live "native-audio" variant.
|
|
2750
|
+
*
|
|
2751
|
+
* Native-audio models emit a different `serverContent.modelTurn.parts.text`
|
|
2752
|
+
* stream than their half-cascade siblings: on native-audio, that text is the
|
|
2753
|
+
* model's internal reasoning (chain-of-thought), and the *spoken* response
|
|
2754
|
+
* arrives separately via `serverContent.outputTranscription.text`. On
|
|
2755
|
+
* non-native-audio models there is no `outputTranscription` channel, and
|
|
2756
|
+
* `modelTurn.parts.text` is the spoken response.
|
|
2757
|
+
*
|
|
2758
|
+
* Used to decide whether `modelTurn.parts.text` should be emitted as
|
|
2759
|
+
* `thinking` (native-audio) or `writing` (non-native-audio). All native-audio
|
|
2760
|
+
* model IDs in `GeminiVoiceModel` contain the literal substring
|
|
2761
|
+
* `native-audio`, so a substring check is sufficient and forward-compatible
|
|
2762
|
+
* with new variants that follow the same naming convention.
|
|
2763
|
+
* @private
|
|
2764
|
+
*/
|
|
2765
|
+
isNativeAudioModel() {
|
|
2766
|
+
const model = this.options.model ?? DEFAULT_MODEL;
|
|
2767
|
+
return model.includes("native-audio");
|
|
2768
|
+
}
|
|
2720
2769
|
/**
|
|
2721
2770
|
* Resolve the correct model identifier for Gemini API or Vertex AI
|
|
2722
2771
|
* @private
|
|
@@ -2758,7 +2807,20 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
|
|
|
2758
2807
|
const setupMessage = {
|
|
2759
2808
|
setup: {
|
|
2760
2809
|
model: this.resolveModelIdentifier(),
|
|
2761
|
-
generation_config: generationConfig
|
|
2810
|
+
generation_config: generationConfig,
|
|
2811
|
+
// Transcription is on by default — matches the pattern in @mastra/voice-openai-realtime,
|
|
2812
|
+
// @mastra/voice-xai-realtime, and @mastra/voice-inworld where realtime sessions
|
|
2813
|
+
// unconditionally enable STT in `connect()`. On native-audio models this is the ONLY way
|
|
2814
|
+
// to receive the spoken response as text (`modelTurn.parts.text` carries reasoning, not
|
|
2815
|
+
// speech), so without these flags the assistant's words are silently dropped client-side.
|
|
2816
|
+
input_audio_transcription: {},
|
|
2817
|
+
output_audio_transcription: {},
|
|
2818
|
+
// Activity-based interrupts surface barge-in as `serverContent.interrupted = true` and
|
|
2819
|
+
// cancel the in-flight model response. This is the only way to wire up the `interrupt`
|
|
2820
|
+
// event declared in `GeminiLiveEventMap`.
|
|
2821
|
+
realtime_input_config: {
|
|
2822
|
+
activity_handling: "START_OF_ACTIVITY_INTERRUPTS"
|
|
2823
|
+
}
|
|
2762
2824
|
}
|
|
2763
2825
|
};
|
|
2764
2826
|
if (this.options.instructions) {
|