npm - @mastra/voice-aws-nova-sonic - Versions diffs - 0.0.0-studio-cli-20260504022012 - Mend

@mastra/voice-aws-nova-sonic 0.0.0-studio-cli-20260504022012

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/CHANGELOG.md +51 -0
package/LICENSE.md +30 -0
package/README.md +384 -0
package/dist/docs/SKILL.md +27 -0
package/dist/docs/assets/SOURCE_MAP.json +6 -0
package/dist/docs/references/docs-voice-overview.md +1028 -0
package/dist/docs/references/docs-voice-speech-to-speech.md +146 -0
package/dist/docs/references/reference-voice-aws-nova-sonic.md +247 -0
package/dist/index.cjs +1619 -0
package/dist/index.cjs.map +1 -0
package/dist/index.d.ts +269 -0
package/dist/index.d.ts.map +1 -0
package/dist/index.js +1615 -0
package/dist/index.js.map +1 -0
package/dist/types.d.ts +354 -0
package/dist/types.d.ts.map +1 -0
package/dist/utils/auth.d.ts +6 -0
package/dist/utils/auth.d.ts.map +1 -0
package/dist/utils/errors.d.ts +17 -0
package/dist/utils/errors.d.ts.map +1 -0
package/package.json +68 -0

package/dist/docs/references/docs-voice-speech-to-speech.md ADDED Viewed

@@ -0,0 +1,146 @@
+# Speech-to-Speech capabilities in Mastra
+## Introduction
+Speech-to-Speech (STS) in Mastra provides a standardized interface for real-time interactions across multiple providers. STS enables continuous bidirectional audio communication through listening to events from Realtime models. Unlike separate TTS and STT operations, STS maintains an open connection that processes speech continuously in both directions.
+## Configuration
+- **`apiKey`**: Your OpenAI API key. Falls back to the `OPENAI_API_KEY` environment variable.
+- **`model`**: The model ID to use for real-time voice interactions (e.g., `gpt-5.1-realtime`).
+- **`speaker`**: The default voice ID for speech synthesis. This allows you to specify which voice to use for the speech output.
+```typescript
+const voice = new OpenAIRealtimeVoice({
+  apiKey: 'your-openai-api-key',
+  model: 'gpt-5.1-realtime',
+  speaker: 'alloy', // Default voice
+})
+// If using default settings the configuration can be simplified to:
+const voice = new OpenAIRealtimeVoice()
+```
+## Using STS
+```typescript
+import { Agent } from '@mastra/core/agent'
+import { OpenAIRealtimeVoice } from '@mastra/voice-openai-realtime'
+import { playAudio, getMicrophoneStream } from '@mastra/node-audio'
+const agent = new Agent({
+  id: 'agent',
+  name: 'OpenAI Realtime Agent',
+  instructions: `You are a helpful assistant with real-time voice capabilities.`,
+  model: 'openai/gpt-5.4',
+  voice: new OpenAIRealtimeVoice(),
+})
+// Connect to the voice service
+await agent.voice.connect()
+// Listen for agent audio responses
+agent.voice.on('speaker', ({ audio }) => {
+  playAudio(audio)
+})
+// Initiate the conversation
+await agent.voice.speak('How can I help you today?')
+// Send continuous audio from the microphone
+const micStream = getMicrophoneStream()
+await agent.voice.send(micStream)
+```
+For integrating Speech-to-Speech capabilities with agents, refer to the [Adding Voice to Agents](https://mastra.ai/docs/agents/adding-voice) documentation.
+## Google Gemini Live (Realtime)
+```typescript
+import { Agent } from '@mastra/core/agent'
+import { GeminiLiveVoice } from '@mastra/voice-google-gemini-live'
+import { playAudio, getMicrophoneStream } from '@mastra/node-audio'
+const agent = new Agent({
+  id: 'agent',
+  name: 'Gemini Live Agent',
+  instructions: 'You are a helpful assistant with real-time voice capabilities.',
+  // Model used for text generation; voice provider handles realtime audio
+  model: 'openai/gpt-5.4',
+  voice: new GeminiLiveVoice({
+    apiKey: process.env.GOOGLE_API_KEY,
+    model: 'gemini-2.0-flash-exp',
+    speaker: 'Puck',
+    debug: true,
+    // Vertex AI option:
+    // vertexAI: true,
+    // project: 'your-gcp-project',
+    // location: 'us-central1',
+    // serviceAccountKeyFile: '/path/to/service-account.json',
+  }),
+})
+await agent.voice.connect()
+agent.voice.on('speaker', ({ audio }) => {
+  playAudio(audio)
+})
+agent.voice.on('writing', ({ role, text }) => {
+  console.log(`${role}: ${text}`)
+})
+await agent.voice.speak('How can I help you today?')
+const micStream = getMicrophoneStream()
+await agent.voice.send(micStream)
+```
+Note:
+- Live API requires `GOOGLE_API_KEY`. Vertex AI requires project/location and service account credentials.
+- Events: `speaker` (audio stream), `writing` (text), `turnComplete`, `usage`, and `error`.
+## AWS Nova Sonic (Realtime)
+```typescript
+import { Agent } from '@mastra/core/agent'
+import { NovaSonicVoice } from '@mastra/voice-aws-nova-sonic'
+import { playAudio, getMicrophoneStream } from '@mastra/node-audio'
+const agent = new Agent({
+  id: 'agent',
+  name: 'Nova Sonic Agent',
+  instructions: 'You are a helpful assistant with real-time voice capabilities.',
+  // Model used for text generation; voice provider handles realtime audio
+  model: 'openai/gpt-5.4',
+  voice: new NovaSonicVoice({
+    region: 'us-east-1',
+    speaker: 'matthew',
+    // Static credentials are optional. The default AWS credential provider
+    // chain is used when none are passed.
+  }),
+})
+await agent.voice.connect()
+// Assistant audio is emitted as 16-bit PCM on the `speaking` event
+agent.voice.on('speaking', ({ audioData }) => {
+  if (audioData) playAudio(audioData)
+})
+agent.voice.on('writing', ({ role, text }) => {
+  console.log(`${role}: ${text}`)
+})
+await agent.voice.speak('How can I help you today?')
+const micStream = getMicrophoneStream()
+await agent.voice.send(micStream)
+```
+Note:
+- Available regions: `us-east-1`, `us-west-2`, and `ap-northeast-1`.
+- Authenticates through the standard AWS credential provider chain. Pass `credentials` to override.
+- Events: `speaking` (Int16Array audio), `writing` (text with `generationStage`), `toolCall`, `interrupt`, `turnComplete`, `usage`, `session`, and `error`.

package/dist/docs/references/reference-voice-aws-nova-sonic.md ADDED Viewed

@@ -0,0 +1,247 @@
+# AWS Nova Sonic voice
+The `NovaSonicVoice` class provides real-time speech-to-speech capabilities backed by [AWS Bedrock Nova 2 Sonic](https://docs.aws.amazon.com/nova/latest/userguide/speech.html). It opens a bidirectional stream to the model and emits events for assistant audio, transcribed text, tool calls, turn boundaries, and interruptions.
+## Usage example
+```typescript
+import { NovaSonicVoice } from '@mastra/voice-aws-nova-sonic'
+import { playAudio, getMicrophoneStream } from '@mastra/node-audio'
+// Initialize using the default AWS credential provider chain
+const voice = new NovaSonicVoice({
+  region: 'us-east-1',
+  speaker: 'matthew',
+})
+// Or pass explicit credentials
+const voiceWithCredentials = new NovaSonicVoice({
+  region: 'us-east-1',
+  speaker: 'tiffany',
+  credentials: {
+    accessKeyId: process.env.AWS_ACCESS_KEY_ID!,
+    secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY!,
+  },
+})
+// Establish the bidirectional stream
+await voice.connect()
+// Listen for assistant audio (Int16Array PCM)
+voice.on('speaking', ({ audioData }) => {
+  if (audioData) playAudio(audioData)
+})
+// Listen for transcribed text from the user and assistant
+voice.on('writing', ({ text, role, generationStage }) => {
+  console.log(`${role} (${generationStage ?? 'FINAL'}): ${text}`)
+})
+// Stream microphone audio in real time
+const microphoneStream = getMicrophoneStream()
+await voice.send(microphoneStream)
+// Disconnect when done
+voice.close()
+```
+## Authentication
+`NovaSonicVoice` uses the AWS SDK credential resolution chain when no `credentials` option is passed. Mastra calls `defaultProvider()` from `@aws-sdk/credential-provider-node`, which checks (in order) environment variables, shared credentials files, IAM role for EC2, ECS, EKS, and other standard sources.
+To use static credentials, pass them on the constructor:
+```typescript
+new NovaSonicVoice({
+  region: 'us-east-1',
+  credentials: {
+    accessKeyId: process.env.AWS_ACCESS_KEY_ID!,
+    secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY!,
+    sessionToken: process.env.AWS_SESSION_TOKEN,
+  },
+})
+```
+The voice provider never logs credential values.
+## Configuration
+### Constructor options
+**region** (`'us-east-1' | 'us-west-2' | 'ap-northeast-1'`): AWS region that hosts the Nova Sonic model. (Default: `'us-east-1'`)
+**model** (`string`): Bedrock model ID for the bidirectional stream. (Default: `'amazon.nova-2-sonic-v1:0'`)
+**credentials** (`AwsCredentialIdentity`): Static AWS credentials. When omitted the default AWS credential provider chain is used.
+**speaker** (`string | NovaSonicVoiceConfigDetails`): Default voice for the assistant. Pass a voice ID string such as 'matthew' or an object that includes a language code and gender. (Default: `'matthew'`)
+**languageCode** (`NovaSonicLanguageCode`): Language code used for the session. Polyglot voices support all listed languages.
+**instructions** (`string`): System prompt sent at session start. Equivalent to calling addInstructions() before connect().
+**tools** (`NovaSonicToolConfig[]`): Tools exposed to the model. When the voice instance is attached to an Agent, the Agent's tools are added automatically.
+**sessionConfig** (`NovaSonicSessionConfig`): Inference, turn-detection, and tool-choice configuration. See Session configuration below.
+**debug** (`boolean`): Enable verbose logging for stream events. Sensitive fields are masked. (Default: `false`)
+### Session configuration
+`sessionConfig` controls inference parameters and turn-taking behavior. All fields are optional.
+**inferenceConfiguration** (`object`): Sampling and decoding parameters.
+**inferenceConfiguration.maxTokens** (`number`): Maximum tokens generated per turn.
+**inferenceConfiguration.temperature** (`number`): Sampling temperature.
+**inferenceConfiguration.topP** (`number`): Nucleus sampling probability.
+**inferenceConfiguration.topK** (`number`): Top-k sampling.
+**inferenceConfiguration.stopSequences** (`string[]`): Sequences that end generation.
+**turnDetectionConfiguration** (`object`): Endpointing sensitivity for turn detection.
+**turnDetectionConfiguration.endpointingSensitivity** (`'HIGH' | 'MEDIUM' | 'LOW'`): Pause duration before the model considers a turn complete. HIGH ends turns fastest (about 1.5s pause), MEDIUM is balanced (about 1.75s), LOW waits longest (about 2s).
+**toolChoice** (`'auto' | 'any' | { tool: { name: string } }`): How the model decides whether to call a tool.
+**enableKnowledgeGrounding** (`boolean`): Enable retrieval-augmented grounding against a Bedrock knowledge base.
+**knowledgeBaseConfig** (`{ knowledgeBaseId?: string; dataSourceId?: string }`): Knowledge base used when knowledge grounding is enabled.
+## Methods
+### `connect()`
+Opens the bidirectional stream to AWS Bedrock and sends the initial session, prompt, and system events. Call this before `speak`, `listen`, or `send`.
+**options** (`{ requestContext?: RequestContext }`): Optional request context propagated to tool calls made during the session.
+Returns: `Promise<void>`
+### `speak()`
+Synthesizes speech for a text prompt and emits `speaking` events as audio is produced.
+**input** (`string | NodeJS.ReadableStream`): Text or text stream to synthesize.
+**options** (`NovaSonicVoiceOptions`): Per-call overrides such as the speaker or language code.
+Returns: `Promise<void>`
+### `send()`
+Streams microphone audio (or any PCM source) to the model. Use this for live, continuous conversation.
+**audioData** (`NodeJS.ReadableStream | Int16Array`): 16-bit PCM audio to forward to the model.
+Returns: `Promise<void>`
+### `listen()`
+Convenience wrapper that delegates to `send()`. Use it when you want a single transcription pass over a finite audio stream.
+**audioData** (`NodeJS.ReadableStream`): Audio stream to transcribe.
+Returns: `Promise<void>`
+### `endAudioInput()`
+Signals the end of the current audio turn so the model can finalize its response. Call this when the user stops speaking and the provider is not configured for server-side turn detection.
+Returns: `Promise<void>`
+### `addInstructions()`
+Updates the system prompt for the active session.
+**instructions** (`string`): System prompt to apply to the session.
+Returns: `void`
+### `addTools()`
+Registers tools with the voice instance. When `NovaSonicVoice` is attached to an Agent, the Agent's tools are added automatically.
+**tools** (`ToolsInput`): Tools exposed to the model.
+Returns: `void`
+### `getSpeakers()`
+Returns the list of voices supported by Nova 2 Sonic.
+Returns: `Promise<Array<{ voiceId: string; name: string; language: string; locale: string; gender: 'masculine' | 'feminine'; polyglot: boolean }>>`
+### `getListener()`
+Returns whether the voice instance currently holds an open stream.
+Returns: `Promise<{ enabled: boolean }>`
+### `close()`
+Closes the bidirectional stream and destroys the underlying Bedrock client. Call this when the conversation ends.
+Returns: `void`
+### `on()` / `off()`
+Registers and removes event listeners. See [Voice events](https://mastra.ai/reference/voice/voice.events) for the shared event API.
+## Events
+`NovaSonicVoice` emits the following events:
+**speaking** (`event`): Assistant audio chunk. Callback receives { audioData: Int16Array, sampleRate?: number }.
+**writing** (`event`): Transcribed text from the user or assistant. Callback receives { text: string, role: 'assistant' | 'user', generationStage?: 'SPECULATIVE' | 'FINAL' }.
+**toolCall** (`event`): Model requested a tool call. Callback receives { name: string, args: Record\<string, any>, id: string }.
+**interrupt** (`event`): User or model interrupted the current turn. Callback receives { type: 'user' | 'model', timestamp: number }.
+**turnComplete** (`event`): Model finished its turn. Callback receives { timestamp: number }.
+**session** (`event`): Session state transition. Callback receives { state: 'connecting' | 'connected' | 'disconnected' | 'disconnecting' | 'error' }.
+**usage** (`event`): Token usage for the turn. Callback receives { inputTokens: number, outputTokens: number, totalTokens: number }.
+**error** (`event`): Stream or provider error. Callback receives { message: string, code?: string, details?: unknown }.
+`generationStage` distinguishes provisional transcripts (`'SPECULATIVE'`) from finalized ones (`'FINAL'`). Use `'FINAL'` text for persistent storage and `'SPECULATIVE'` text for live captions.
+## Available voices
+Nova 2 Sonic ships voices in ten locales. Tiffany and Matthew are polyglot and can speak any supported language.
+| Voice ID   | Name     | Language   | Locale | Gender    | Polyglot |
+| ---------- | -------- | ---------- | ------ | --------- | -------- |
+| `tiffany`  | Tiffany  | English    | en-US  | feminine  | yes      |
+| `matthew`  | Matthew  | English    | en-US  | masculine | yes      |
+| `amy`      | Amy      | English    | en-GB  | feminine  | no       |
+| `olivia`   | Olivia   | English    | en-AU  | feminine  | no       |
+| `kiara`    | Kiara    | English    | en-IN  | feminine  | no       |
+| `arjun`    | Arjun    | English    | en-IN  | masculine | no       |
+| `ambre`    | Ambre    | French     | fr-FR  | feminine  | no       |
+| `florian`  | Florian  | French     | fr-FR  | masculine | no       |
+| `beatrice` | Beatrice | Italian    | it-IT  | feminine  | no       |
+| `lorenzo`  | Lorenzo  | Italian    | it-IT  | masculine | no       |
+| `tina`     | Tina     | German     | de-DE  | feminine  | no       |
+| `lennart`  | Lennart  | German     | de-DE  | masculine | no       |
+| `lupe`     | Lupe     | Spanish    | es-US  | feminine  | no       |
+| `carlos`   | Carlos   | Spanish    | es-US  | masculine | no       |
+| `carolina` | Carolina | Portuguese | pt-BR  | feminine  | no       |
+| `leo`      | Leo      | Portuguese | pt-BR  | masculine | no       |
+| `kiara`    | Kiara    | Hindi      | hi-IN  | feminine  | no       |
+| `arjun`    | Arjun    | Hindi      | hi-IN  | masculine | no       |
+## Notes
+- Audio is streamed as 16-bit PCM. Assistant audio is emitted as `Int16Array` on the `speaking` event.
+- The voice instance must call `connect()` before any other streaming method.
+- `close()` destroys the underlying `BedrockRuntimeClient` to release the HTTP/2 session.
+- Nova 2 Sonic is available in `us-east-1`, `us-west-2`, and `ap-northeast-1`. Other regions throw a configuration error during construction.