@mastra/voice-google-gemini-live 0.11.5-alpha.2 → 0.12.0-alpha.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,45 @@
1
1
  # @mastra/voice-google-gemini-live
2
2
 
3
+ ## 0.12.0-alpha.3
4
+
5
+ ### Minor Changes
6
+
7
+ - Surface native-audio behavioral signals on Gemini Live realtime sessions (#17021). ([#17434](https://github.com/mastra-ai/mastra/pull/17434))
8
+
9
+ The `@mastra/voice-google-gemini-live` provider now enables transcription and barge-in detection in the setup payload and exposes them through Mastra's standard realtime event contract. This makes native-audio models such as `gemini-2.5-flash-native-audio-preview-12-2025` and `gemini-3.1-flash-live-preview` behaviorally usable end-to-end. Until now, the spoken response was silently dropped on native-audio because it arrives on a different wire channel from the model's internal reasoning.
10
+
11
+ **What changed**
12
+ - Setup payload unconditionally includes `input_audio_transcription`, `output_audio_transcription`, and `realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`, matching how the OpenAI, xAI, Inworld, and AWS Nova Sonic providers enable transcription by default.
13
+ - User-side transcripts emit as `writing` with `role: 'user'`. Model-side transcripts emit as `writing` with `role: 'assistant'`. This matches the cross-provider `writing` contract.
14
+ - Barge-in (the server cancelling its in-flight response when the user starts speaking) emits an `interrupt` event with `{ type: 'user', timestamp }`, matching `@mastra/voice-aws-nova-sonic`.
15
+ - On native-audio models, `modelTurn.parts.text` is the model's internal chain-of-thought, not the spoken response. It now emits as a Gemini-specific `thinking` event so consumers can render reasoning separately. On non-native-audio models, `modelTurn.parts.text` continues to emit as `writing` (it is the spoken response there).
16
+
17
+ **Example**
18
+
19
+ ```ts
20
+ import { GeminiLiveVoice } from '@mastra/voice-google-gemini-live';
21
+
22
+ const voice = new GeminiLiveVoice({
23
+ apiKey: process.env.GOOGLE_API_KEY,
24
+ model: 'gemini-2.5-flash-native-audio-preview-12-2025',
25
+ });
26
+
27
+ voice.on('writing', ({ text, role }) => {
28
+ // role: 'user' → speech-to-text of the caller
29
+ // role: 'assistant' → speech-to-text of the model's spoken reply
30
+ });
31
+
32
+ voice.on('thinking', ({ text }) => {
33
+ // Gemini's internal reasoning on native-audio models
34
+ });
35
+
36
+ voice.on('interrupt', ({ type, timestamp }) => {
37
+ // Drop queued TTS audio — the user just barged in
38
+ });
39
+
40
+ await voice.connect();
41
+ ```
42
+
3
43
  ## 0.11.5-alpha.2
4
44
 
5
45
  ### Patch Changes
package/README.md CHANGED
@@ -99,10 +99,21 @@ voice.on('speaker', audioStream => {
99
99
  });
100
100
 
101
101
  voice.on('writing', ({ text, role }) => {
102
- // Handle transcribed text
102
+ // role: 'user' → speech-to-text of the caller
103
+ // role: 'assistant' → speech-to-text of the model's spoken reply
103
104
  console.log(`${role}: ${text}`);
104
105
  });
105
106
 
107
+ // Native-audio models only: model's internal reasoning
108
+ voice.on('thinking', ({ text }) => {
109
+ console.log(`thinking: ${text}`);
110
+ });
111
+
112
+ // Drop queued playback when the user barges in over the model
113
+ voice.on('interrupt', ({ type, timestamp }) => {
114
+ console.log(`interrupt by ${type} at ${timestamp}`);
115
+ });
116
+
106
117
  // Send text to speech
107
118
  await voice.speak('Hello from Mastra!');
108
119
 
@@ -313,16 +324,26 @@ Registers an event listener.
313
324
 
314
325
  - `'speaking'` - Audio response from model
315
326
  - `'speaker'` - Readable stream of concatenated audio for the active response
316
- - `'writing'` - Text response or transcription
327
+ - `'writing'` - Transcribed text. Callback receives `{ text, role: 'user' | 'assistant' }`. On native-audio models the assistant transcript is driven by the server's `output_audio_transcription` channel
328
+ - `'thinking'` - Model chain-of-thought / reasoning text on native-audio models. Callback receives `{ text }`. Does not fire on non-native-audio models, where reasoning is not surfaced separately
317
329
  - `'error'` - Error events
318
330
  - `'session'` - Session state changes
319
331
  - `'toolCall'` - Tool calls from model
320
332
  - `'vad'` - Voice activity detection events
321
- - `'interrupt'` - Interrupt events
333
+ - `'interrupt'` - Emitted on barge-in when the user starts speaking over an in-flight model response. Callback receives `{ type: 'user', timestamp }`
322
334
  - `'usage'` - Token usage information
323
335
  - `'sessionHandle'` - Session resumption handle
324
336
  - `'turnComplete'` - Turn completion for the current model response
325
337
 
338
+ #### Native-audio models
339
+
340
+ Native-audio models (any model whose ID contains `native-audio`, e.g. `gemini-2.5-flash-native-audio-preview-12-2025`) split text output across two channels:
341
+
342
+ - The model's spoken reply is delivered as audio plus an `output_audio_transcription` transcript — surfaced as `writing` with `role: 'assistant'`.
343
+ - The model's internal reasoning is delivered as `modelTurn.parts.text` — surfaced as `thinking`.
344
+
345
+ On non-native-audio models there is no `output_audio_transcription` channel; `modelTurn.parts.text` is the spoken response itself and is emitted as `writing` (so `thinking` will not fire). Transcription and barge-in detection are enabled automatically in the setup payload — no extra configuration is required.
346
+
326
347
  ### Tools
327
348
 
328
349
  Add tools with `addTools()` using either `@mastra/core/tools` or a plain object matching `ToolsInput`.
@@ -3,7 +3,7 @@ name: mastra-voice-google-gemini-live
3
3
  description: Documentation for @mastra/voice-google-gemini-live. Use when working with @mastra/voice-google-gemini-live APIs, configuration, or implementation.
4
4
  metadata:
5
5
  package: "@mastra/voice-google-gemini-live"
6
- version: "0.11.5-alpha.2"
6
+ version: "0.12.0-alpha.3"
7
7
  ---
8
8
 
9
9
  ## When to use
@@ -1,5 +1,5 @@
1
1
  {
2
- "version": "0.11.5-alpha.2",
2
+ "version": "0.12.0-alpha.3",
3
3
  "package": "@mastra/voice-google-gemini-live",
4
4
  "exports": {},
5
5
  "modules": {}
@@ -242,7 +242,9 @@ The GeminiLiveVoice class emits the following events:
242
242
 
243
243
  **speaking** (`event`): Emitted with audio metadata. Callback receives { audioData?: Int16Array, sampleRate?: number }.
244
244
 
245
- **writing** (`event`): Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }.
245
+ **writing** (`event`): Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }. On native-audio models the assistant transcript is driven by the server's \`output\_audio\_transcription\` channel rather than \`modelTurn.parts.text\`.
246
+
247
+ **thinking** (`event`): Emitted on native-audio models with the model's chain-of-thought / reasoning text from \`modelTurn.parts.text\`. Callback receives { text: string }. Does not fire on non-native-audio models, where \`modelTurn.parts.text\` is the spoken response and is emitted as \`writing\` instead.
246
248
 
247
249
  **session** (`event`): Emitted on session state changes. Callback receives { state: 'connecting' | 'connected' | 'disconnected' | 'disconnecting' | 'updated', config?: object }.
248
250
 
@@ -254,7 +256,18 @@ The GeminiLiveVoice class emits the following events:
254
256
 
255
257
  **error** (`event`): Emitted when an error occurs. Callback receives { message: string, code?: string, details?: unknown }.
256
258
 
257
- **interrupt** (`event`): Interrupt events. Callback receives { type: 'user' | 'model', timestamp: number }.
259
+ **interrupt** (`event`): Emitted on barge-in when the user starts speaking over an in-flight model response. The server cancels any further audio for the current turn. Callback receives { type: 'user', timestamp: number }.
260
+
261
+ ## Native-audio behavior
262
+
263
+ Native-audio Gemini Live models — any model whose ID contains `native-audio`, such as `gemini-2.5-flash-native-audio-preview-12-2025` — split text output across two channels:
264
+
265
+ - The model's spoken reply is delivered as audio plus an `output_audio_transcription` transcript and surfaced as `writing` with `role: 'assistant'`.
266
+ - The model's internal reasoning is delivered as `modelTurn.parts.text` and surfaced as `thinking`.
267
+
268
+ On non-native-audio models there is no `output_audio_transcription` channel, so `modelTurn.parts.text` is the spoken response itself and is emitted as `writing`; the `thinking` event does not fire.
269
+
270
+ Input transcription, output transcription, and barge-in detection (`realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`) are enabled automatically in the setup payload — no extra configuration is required.
258
271
 
259
272
  ## Available models
260
273
 
package/dist/index.cjs CHANGED
@@ -1529,6 +1529,10 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
1529
1529
  requestContext;
1530
1530
  // Store the configuration options
1531
1531
  options;
1532
+ // Accumulates assistant text across `serverContent` frames for the current
1533
+ // turn. Live API streams responses over many frames; we aggregate here and
1534
+ // flush to context history once on `turnComplete`.
1535
+ pendingAssistantResponse = "";
1532
1536
  /**
1533
1537
  * Normalize configuration to ensure proper VoiceConfig format
1534
1538
  * Handles backward compatibility with direct GeminiLiveVoiceConfig
@@ -2483,15 +2487,38 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
2483
2487
  if (!data) {
2484
2488
  return;
2485
2489
  }
2486
- let assistantResponse = "";
2490
+ if (data.interrupted) {
2491
+ this.log("Model response interrupted by user activity");
2492
+ this.audioStreamManager.cleanupSpeakerStreams();
2493
+ this.pendingAssistantResponse = "";
2494
+ this.emit("interrupt", { type: "user", timestamp: Date.now() });
2495
+ }
2496
+ if (data.inputTranscription?.text) {
2497
+ this.emit("writing", {
2498
+ text: data.inputTranscription.text,
2499
+ role: "user"
2500
+ });
2501
+ }
2502
+ if (data.outputTranscription?.text) {
2503
+ this.pendingAssistantResponse += data.outputTranscription.text;
2504
+ this.emit("writing", {
2505
+ text: data.outputTranscription.text,
2506
+ role: "assistant"
2507
+ });
2508
+ }
2509
+ const nativeAudio = this.isNativeAudioModel();
2487
2510
  if (data.modelTurn?.parts) {
2488
2511
  for (const part of data.modelTurn.parts) {
2489
2512
  if (part.text) {
2490
- assistantResponse += part.text;
2491
- this.emit("writing", {
2492
- text: part.text,
2493
- role: "assistant"
2494
- });
2513
+ if (nativeAudio) {
2514
+ this.emit("thinking", { text: part.text });
2515
+ } else {
2516
+ this.pendingAssistantResponse += part.text;
2517
+ this.emit("writing", {
2518
+ text: part.text,
2519
+ role: "assistant"
2520
+ });
2521
+ }
2495
2522
  }
2496
2523
  if (part.functionCall) {
2497
2524
  this.log("Found function call in serverContent.modelTurn.parts", part.functionCall);
@@ -2560,11 +2587,12 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
2560
2587
  }
2561
2588
  }
2562
2589
  }
2563
- if (assistantResponse.trim()) {
2564
- this.addToContext("assistant", assistantResponse);
2565
- }
2566
2590
  if (data.turnComplete) {
2567
2591
  this.log("Turn completed");
2592
+ if (this.pendingAssistantResponse.trim()) {
2593
+ this.addToContext("assistant", this.pendingAssistantResponse);
2594
+ }
2595
+ this.pendingAssistantResponse = "";
2568
2596
  this.audioStreamManager.cleanupSpeakerStreams();
2569
2597
  this.emit("turnComplete", {
2570
2598
  timestamp: Date.now()
@@ -2717,6 +2745,27 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
2717
2745
  getVertexLocation() {
2718
2746
  return this.options.location?.trim() || "us-central1";
2719
2747
  }
2748
+ /**
2749
+ * Whether the active model is a Gemini Live "native-audio" variant.
2750
+ *
2751
+ * Native-audio models emit a different `serverContent.modelTurn.parts.text`
2752
+ * stream than their half-cascade siblings: on native-audio, that text is the
2753
+ * model's internal reasoning (chain-of-thought), and the *spoken* response
2754
+ * arrives separately via `serverContent.outputTranscription.text`. On
2755
+ * non-native-audio models there is no `outputTranscription` channel, and
2756
+ * `modelTurn.parts.text` is the spoken response.
2757
+ *
2758
+ * Used to decide whether `modelTurn.parts.text` should be emitted as
2759
+ * `thinking` (native-audio) or `writing` (non-native-audio). All native-audio
2760
+ * model IDs in `GeminiVoiceModel` contain the literal substring
2761
+ * `native-audio`, so a substring check is sufficient and forward-compatible
2762
+ * with new variants that follow the same naming convention.
2763
+ * @private
2764
+ */
2765
+ isNativeAudioModel() {
2766
+ const model = this.options.model ?? DEFAULT_MODEL;
2767
+ return model.includes("native-audio");
2768
+ }
2720
2769
  /**
2721
2770
  * Resolve the correct model identifier for Gemini API or Vertex AI
2722
2771
  * @private
@@ -2758,7 +2807,20 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
2758
2807
  const setupMessage = {
2759
2808
  setup: {
2760
2809
  model: this.resolveModelIdentifier(),
2761
- generation_config: generationConfig
2810
+ generation_config: generationConfig,
2811
+ // Transcription is on by default — matches the pattern in @mastra/voice-openai-realtime,
2812
+ // @mastra/voice-xai-realtime, and @mastra/voice-inworld where realtime sessions
2813
+ // unconditionally enable STT in `connect()`. On native-audio models this is the ONLY way
2814
+ // to receive the spoken response as text (`modelTurn.parts.text` carries reasoning, not
2815
+ // speech), so without these flags the assistant's words are silently dropped client-side.
2816
+ input_audio_transcription: {},
2817
+ output_audio_transcription: {},
2818
+ // Activity-based interrupts surface barge-in as `serverContent.interrupted = true` and
2819
+ // cancel the in-flight model response. This is the only way to wire up the `interrupt`
2820
+ // event declared in `GeminiLiveEventMap`.
2821
+ realtime_input_config: {
2822
+ activity_handling: "START_OF_ACTIVITY_INTERRUPTS"
2823
+ }
2762
2824
  }
2763
2825
  };
2764
2826
  if (this.options.instructions) {