npm - @mastra/voice-google-gemini-live - Versions diffs - 0.11.5-alpha.2 → 0.12.0-alpha.3 - Mend

@mastra/voice-google-gemini-live 0.11.5-alpha.2 → 0.12.0-alpha.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/CHANGELOG.md +40 -0
package/README.md +24 -3
package/dist/docs/SKILL.md +1 -1
package/dist/docs/assets/SOURCE_MAP.json +1 -1
package/dist/docs/references/reference-voice-google-gemini-live.md +15 -2
package/dist/index.cjs +72 -10
package/dist/index.cjs.map +1 -1
package/dist/index.d.ts +19 -0
package/dist/index.d.ts.map +1 -1
package/dist/index.js +72 -10
package/dist/index.js.map +1 -1
package/dist/types.d.ts +39 -0
package/dist/types.d.ts.map +1 -1
package/package.json +3 -3

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,45 @@
 # @mastra/voice-google-gemini-live
+## 0.12.0-alpha.3
+### Minor Changes
+- Surface native-audio behavioral signals on Gemini Live realtime sessions (#17021). ([#17434](https://github.com/mastra-ai/mastra/pull/17434))
+  The `@mastra/voice-google-gemini-live` provider now enables transcription and barge-in detection in the setup payload and exposes them through Mastra's standard realtime event contract. This makes native-audio models such as `gemini-2.5-flash-native-audio-preview-12-2025` and `gemini-3.1-flash-live-preview` behaviorally usable end-to-end. Until now, the spoken response was silently dropped on native-audio because it arrives on a different wire channel from the model's internal reasoning.
+  **What changed**
+  - Setup payload unconditionally includes `input_audio_transcription`, `output_audio_transcription`, and `realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`, matching how the OpenAI, xAI, Inworld, and AWS Nova Sonic providers enable transcription by default.
+  - User-side transcripts emit as `writing` with `role: 'user'`. Model-side transcripts emit as `writing` with `role: 'assistant'`. This matches the cross-provider `writing` contract.
+  - Barge-in (the server cancelling its in-flight response when the user starts speaking) emits an `interrupt` event with `{ type: 'user', timestamp }`, matching `@mastra/voice-aws-nova-sonic`.
+  - On native-audio models, `modelTurn.parts.text` is the model's internal chain-of-thought, not the spoken response. It now emits as a Gemini-specific `thinking` event so consumers can render reasoning separately. On non-native-audio models, `modelTurn.parts.text` continues to emit as `writing` (it is the spoken response there).
+  **Example**
+  ```ts
+  import { GeminiLiveVoice } from '@mastra/voice-google-gemini-live';
+  const voice = new GeminiLiveVoice({
+    apiKey: process.env.GOOGLE_API_KEY,
+    model: 'gemini-2.5-flash-native-audio-preview-12-2025',
+  });
+  voice.on('writing', ({ text, role }) => {
+    // role: 'user'      → speech-to-text of the caller
+    // role: 'assistant' → speech-to-text of the model's spoken reply
+  });
+  voice.on('thinking', ({ text }) => {
+    // Gemini's internal reasoning on native-audio models
+  });
+  voice.on('interrupt', ({ type, timestamp }) => {
+    // Drop queued TTS audio — the user just barged in
+  });
+  await voice.connect();
+  ```
 ## 0.11.5-alpha.2
 ### Patch Changes

package/README.md CHANGED Viewed

@@ -99,10 +99,21 @@ voice.on('speaker', audioStream => {
 });
 voice.on('writing', ({ text, role }) => {
-  // Handle transcribed text
+  // role: 'user'      → speech-to-text of the caller
+  // role: 'assistant' → speech-to-text of the model's spoken reply
   console.log(`${role}: ${text}`);
 });
+// Native-audio models only: model's internal reasoning
+voice.on('thinking', ({ text }) => {
+  console.log(`thinking: ${text}`);
+});
+// Drop queued playback when the user barges in over the model
+voice.on('interrupt', ({ type, timestamp }) => {
+  console.log(`interrupt by ${type} at ${timestamp}`);
+});
 // Send text to speech
 await voice.speak('Hello from Mastra!');
@@ -313,16 +324,26 @@ Registers an event listener.
 - `'speaking'` - Audio response from model
 - `'speaker'` - Readable stream of concatenated audio for the active response
-- `'writing'` - Text response or transcription
+- `'writing'` - Transcribed text. Callback receives `{ text, role: 'user' | 'assistant' }`. On native-audio models the assistant transcript is driven by the server's `output_audio_transcription` channel
+- `'thinking'` - Model chain-of-thought / reasoning text on native-audio models. Callback receives `{ text }`. Does not fire on non-native-audio models, where reasoning is not surfaced separately
 - `'error'` - Error events
 - `'session'` - Session state changes
 - `'toolCall'` - Tool calls from model
 - `'vad'` - Voice activity detection events
-- `'interrupt'` - Interrupt events
+- `'interrupt'` - Emitted on barge-in when the user starts speaking over an in-flight model response. Callback receives `{ type: 'user', timestamp }`
 - `'usage'` - Token usage information
 - `'sessionHandle'` - Session resumption handle
 - `'turnComplete'` - Turn completion for the current model response
+#### Native-audio models
+Native-audio models (any model whose ID contains `native-audio`, e.g. `gemini-2.5-flash-native-audio-preview-12-2025`) split text output across two channels:
+- The model's spoken reply is delivered as audio plus an `output_audio_transcription` transcript — surfaced as `writing` with `role: 'assistant'`.
+- The model's internal reasoning is delivered as `modelTurn.parts.text` — surfaced as `thinking`.
+On non-native-audio models there is no `output_audio_transcription` channel; `modelTurn.parts.text` is the spoken response itself and is emitted as `writing` (so `thinking` will not fire). Transcription and barge-in detection are enabled automatically in the setup payload — no extra configuration is required.
 ### Tools
 Add tools with `addTools()` using either `@mastra/core/tools` or a plain object matching `ToolsInput`.

package/dist/docs/SKILL.md CHANGED Viewed

@@ -3,7 +3,7 @@ name: mastra-voice-google-gemini-live
 description: Documentation for @mastra/voice-google-gemini-live. Use when working with @mastra/voice-google-gemini-live APIs, configuration, or implementation.
 metadata:
   package: "@mastra/voice-google-gemini-live"
-  version: "0.11.5-alpha.2"
+  version: "0.12.0-alpha.3"
 ---
 ## When to use

package/dist/docs/assets/SOURCE_MAP.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "version": "0.11.5-alpha.2",
+  "version": "0.12.0-alpha.3",
   "package": "@mastra/voice-google-gemini-live",
   "exports": {},
   "modules": {}

package/dist/docs/references/reference-voice-google-gemini-live.md CHANGED Viewed

@@ -242,7 +242,9 @@ The GeminiLiveVoice class emits the following events:
 **speaking** (`event`): Emitted with audio metadata. Callback receives { audioData?: Int16Array, sampleRate?: number }.
-**writing** (`event`): Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }.
+**writing** (`event`): Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }. On native-audio models the assistant transcript is driven by the server's \`output\_audio\_transcription\` channel rather than \`modelTurn.parts.text\`.
+**thinking** (`event`): Emitted on native-audio models with the model's chain-of-thought / reasoning text from \`modelTurn.parts.text\`. Callback receives { text: string }. Does not fire on non-native-audio models, where \`modelTurn.parts.text\` is the spoken response and is emitted as \`writing\` instead.
 **session** (`event`): Emitted on session state changes. Callback receives { state: 'connecting' | 'connected' | 'disconnected' | 'disconnecting' | 'updated', config?: object }.
@@ -254,7 +256,18 @@ The GeminiLiveVoice class emits the following events:
 **error** (`event`): Emitted when an error occurs. Callback receives { message: string, code?: string, details?: unknown }.
-**interrupt** (`event`): Interrupt events. Callback receives { type: 'user' | 'model', timestamp: number }.
+**interrupt** (`event`): Emitted on barge-in when the user starts speaking over an in-flight model response. The server cancels any further audio for the current turn. Callback receives { type: 'user', timestamp: number }.
+## Native-audio behavior
+Native-audio Gemini Live models — any model whose ID contains `native-audio`, such as `gemini-2.5-flash-native-audio-preview-12-2025` — split text output across two channels:
+- The model's spoken reply is delivered as audio plus an `output_audio_transcription` transcript and surfaced as `writing` with `role: 'assistant'`.
+- The model's internal reasoning is delivered as `modelTurn.parts.text` and surfaced as `thinking`.
+On non-native-audio models there is no `output_audio_transcription` channel, so `modelTurn.parts.text` is the spoken response itself and is emitted as `writing`; the `thinking` event does not fire.
+Input transcription, output transcription, and barge-in detection (`realtime_input_config.activity_handling = 'START_OF_ACTIVITY_INTERRUPTS'`) are enabled automatically in the setup payload — no extra configuration is required.
 ## Available models

package/dist/index.cjs CHANGED Viewed

@@ -1529,6 +1529,10 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
   requestContext;
   // Store the configuration options
   options;
+  // Accumulates assistant text across `serverContent` frames for the current
+  // turn. Live API streams responses over many frames; we aggregate here and
+  // flush to context history once on `turnComplete`.
+  pendingAssistantResponse = "";
   /**
    * Normalize configuration to ensure proper VoiceConfig format
    * Handles backward compatibility with direct GeminiLiveVoiceConfig
@@ -2483,15 +2487,38 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
     if (!data) {
       return;
     }
-    let assistantResponse = "";
+    if (data.interrupted) {
+      this.log("Model response interrupted by user activity");
+      this.audioStreamManager.cleanupSpeakerStreams();
+      this.pendingAssistantResponse = "";
+      this.emit("interrupt", { type: "user", timestamp: Date.now() });
+    }
+    if (data.inputTranscription?.text) {
+      this.emit("writing", {
+        text: data.inputTranscription.text,
+        role: "user"
+      });
+    }
+    if (data.outputTranscription?.text) {
+      this.pendingAssistantResponse += data.outputTranscription.text;
+      this.emit("writing", {
+        text: data.outputTranscription.text,
+        role: "assistant"
+      });
+    }
+    const nativeAudio = this.isNativeAudioModel();
     if (data.modelTurn?.parts) {
       for (const part of data.modelTurn.parts) {
         if (part.text) {
-          assistantResponse += part.text;
-          this.emit("writing", {
-            text: part.text,
-            role: "assistant"
-          });
+          if (nativeAudio) {
+            this.emit("thinking", { text: part.text });
+          } else {
+            this.pendingAssistantResponse += part.text;
+            this.emit("writing", {
+              text: part.text,
+              role: "assistant"
+            });
+          }
         }
         if (part.functionCall) {
           this.log("Found function call in serverContent.modelTurn.parts", part.functionCall);
@@ -2560,11 +2587,12 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
         }
       }
     }
-    if (assistantResponse.trim()) {
-      this.addToContext("assistant", assistantResponse);
-    }
     if (data.turnComplete) {
       this.log("Turn completed");
+      if (this.pendingAssistantResponse.trim()) {
+        this.addToContext("assistant", this.pendingAssistantResponse);
+      }
+      this.pendingAssistantResponse = "";
       this.audioStreamManager.cleanupSpeakerStreams();
       this.emit("turnComplete", {
         timestamp: Date.now()
@@ -2717,6 +2745,27 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
   getVertexLocation() {
     return this.options.location?.trim() || "us-central1";
   }
+  /**
+   * Whether the active model is a Gemini Live "native-audio" variant.
+   *
+   * Native-audio models emit a different `serverContent.modelTurn.parts.text`
+   * stream than their half-cascade siblings: on native-audio, that text is the
+   * model's internal reasoning (chain-of-thought), and the *spoken* response
+   * arrives separately via `serverContent.outputTranscription.text`. On
+   * non-native-audio models there is no `outputTranscription` channel, and
+   * `modelTurn.parts.text` is the spoken response.
+   *
+   * Used to decide whether `modelTurn.parts.text` should be emitted as
+   * `thinking` (native-audio) or `writing` (non-native-audio). All native-audio
+   * model IDs in `GeminiVoiceModel` contain the literal substring
+   * `native-audio`, so a substring check is sufficient and forward-compatible
+   * with new variants that follow the same naming convention.
+   * @private
+   */
+  isNativeAudioModel() {
+    const model = this.options.model ?? DEFAULT_MODEL;
+    return model.includes("native-audio");
+  }
   /**
    * Resolve the correct model identifier for Gemini API or Vertex AI
    * @private
@@ -2758,7 +2807,20 @@ var GeminiLiveVoice = class _GeminiLiveVoice extends MastraVoice {
     const setupMessage = {
       setup: {
         model: this.resolveModelIdentifier(),
-        generation_config: generationConfig
+        generation_config: generationConfig,
+        // Transcription is on by default — matches the pattern in @mastra/voice-openai-realtime,
+        // @mastra/voice-xai-realtime, and @mastra/voice-inworld where realtime sessions
+        // unconditionally enable STT in `connect()`. On native-audio models this is the ONLY way
+        // to receive the spoken response as text (`modelTurn.parts.text` carries reasoning, not
+        // speech), so without these flags the assistant's words are silently dropped client-side.
+        input_audio_transcription: {},
+        output_audio_transcription: {},
+        // Activity-based interrupts surface barge-in as `serverContent.interrupted = true` and
+        // cancel the in-flight model response. This is the only way to wire up the `interrupt`
+        // event declared in `GeminiLiveEventMap`.
+        realtime_input_config: {
+          activity_handling: "START_OF_ACTIVITY_INTERRUPTS"
+        }
       }
     };
     if (this.options.instructions) {