@mastra/voice-xai-realtime 0.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +39 -0
- package/README.md +106 -0
- package/dist/docs/SKILL.md +26 -0
- package/dist/docs/assets/SOURCE_MAP.json +6 -0
- package/dist/docs/references/docs-voice-overview.md +1188 -0
- package/dist/docs/references/reference-voice-xai-realtime.md +267 -0
- package/dist/index.cjs +851 -0
- package/dist/index.cjs.map +1 -0
- package/dist/index.d.ts +91 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +849 -0
- package/dist/index.js.map +1 -0
- package/dist/types.d.ts +181 -0
- package/dist/types.d.ts.map +1 -0
- package/dist/utils.d.ts +25 -0
- package/dist/utils.d.ts.map +1 -0
- package/package.json +67 -0
|
@@ -0,0 +1,267 @@
|
|
|
1
|
+
# xAI Realtime voice
|
|
2
|
+
|
|
3
|
+
The `XAIRealtimeVoice` class provides realtime voice interaction capabilities using the xAI Grok Voice Agent API. It implements Mastra's `MastraVoice` realtime contract and supports bidirectional audio streaming, text turns, server VAD, xAI voices, function tools, and xAI server-side tools.
|
|
4
|
+
|
|
5
|
+
## Usage example
|
|
6
|
+
|
|
7
|
+
```typescript
|
|
8
|
+
import { Agent } from '@mastra/core/agent'
|
|
9
|
+
import { getMicrophoneStream, playAudio } from '@mastra/node-audio'
|
|
10
|
+
import { XAIRealtimeVoice } from '@mastra/voice-xai-realtime'
|
|
11
|
+
|
|
12
|
+
const voice = new XAIRealtimeVoice({
|
|
13
|
+
apiKey: process.env.XAI_API_KEY,
|
|
14
|
+
model: 'grok-voice-think-fast-1.0',
|
|
15
|
+
speaker: 'eve',
|
|
16
|
+
instructions: 'You are a concise voice assistant.',
|
|
17
|
+
turnDetection: { type: 'server_vad' },
|
|
18
|
+
})
|
|
19
|
+
|
|
20
|
+
const agent = new Agent({
|
|
21
|
+
id: 'voice-agent',
|
|
22
|
+
name: 'Voice Agent',
|
|
23
|
+
instructions: 'You are a helpful voice assistant.',
|
|
24
|
+
model: 'xai/grok-4.3',
|
|
25
|
+
voice,
|
|
26
|
+
})
|
|
27
|
+
|
|
28
|
+
await agent.voice.connect()
|
|
29
|
+
|
|
30
|
+
agent.voice.on('speaker', audioStream => {
|
|
31
|
+
playAudio(audioStream)
|
|
32
|
+
})
|
|
33
|
+
|
|
34
|
+
agent.voice.on('writing', ({ text, role }) => {
|
|
35
|
+
console.log(`${role}: ${text}`)
|
|
36
|
+
})
|
|
37
|
+
|
|
38
|
+
await agent.voice.speak('How can I help you today?')
|
|
39
|
+
|
|
40
|
+
const microphoneStream = getMicrophoneStream()
|
|
41
|
+
await agent.voice.send(microphoneStream)
|
|
42
|
+
|
|
43
|
+
agent.voice.close()
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
## Configuration
|
|
47
|
+
|
|
48
|
+
### Constructor options
|
|
49
|
+
|
|
50
|
+
**apiKey** (`string`): xAI API key. Falls back to the XAI\_API\_KEY environment variable.
|
|
51
|
+
|
|
52
|
+
**ephemeralToken** (`string`): Short-lived xAI token sent with the WebSocket protocol instead of an authorization header.
|
|
53
|
+
|
|
54
|
+
**model** (`XAIRealtimeModel`): The Grok voice model to use. (Default: `'grok-voice-think-fast-1.0'`)
|
|
55
|
+
|
|
56
|
+
**speaker** (`XAIVoice`): Voice ID to use for speech output. Built-in values are eve, ara, rex, sal, and leo. Custom xAI voice IDs are also supported. (Default: `'eve'`)
|
|
57
|
+
|
|
58
|
+
**instructions** (`string`): System instructions sent in session.update.
|
|
59
|
+
|
|
60
|
+
**turnDetection** (`XAITurnDetection`): Voice activity detection configuration. (Default: `{ type: 'server_vad' }`)
|
|
61
|
+
|
|
62
|
+
**audio** (`XAIAudioConfig`): Input and output audio format configuration. (Default: `24 kHz audio/pcm input and output`)
|
|
63
|
+
|
|
64
|
+
**serverTools** (`XAIServerTool[]`): xAI server-side tools to send in session.update. Supports file\_search, web\_search, x\_search, and mcp. These are merged with session.tools.
|
|
65
|
+
|
|
66
|
+
**session** (`Partial<XAISessionConfig>`): Additional xAI session fields to merge into the initial session.update event.
|
|
67
|
+
|
|
68
|
+
**url** (`string`): Override the xAI realtime WebSocket URL. (Default: `'wss://api.x.ai/v1/realtime'`)
|
|
69
|
+
|
|
70
|
+
**debug** (`boolean`): Enable debug logging for received xAI events. Debug logs can include transcripts and tool-call arguments. (Default: `false`)
|
|
71
|
+
|
|
72
|
+
### VoiceConfig pattern
|
|
73
|
+
|
|
74
|
+
You can also use Mastra's shared voice configuration shape:
|
|
75
|
+
|
|
76
|
+
```typescript
|
|
77
|
+
const voice = new XAIRealtimeVoice({
|
|
78
|
+
speaker: 'ara',
|
|
79
|
+
realtimeConfig: {
|
|
80
|
+
model: 'grok-voice-think-fast-1.0',
|
|
81
|
+
apiKey: process.env.XAI_API_KEY,
|
|
82
|
+
options: {
|
|
83
|
+
instructions: 'Answer briefly.',
|
|
84
|
+
turnDetection: { type: 'server_vad', threshold: 0.85 },
|
|
85
|
+
},
|
|
86
|
+
},
|
|
87
|
+
})
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
## Authentication
|
|
91
|
+
|
|
92
|
+
Use `apiKey` or `XAI_API_KEY` for server-side applications. This provider is built for Node.js server-side runtimes. If you already mint xAI ephemeral tokens on your server, you can pass one as `ephemeralToken`; the provider uses the `xai-client-secret.<token>` WebSocket protocol instead of an authorization header. If both `apiKey` and `ephemeralToken` are configured, the provider uses the ephemeral token.
|
|
93
|
+
|
|
94
|
+
## Methods
|
|
95
|
+
|
|
96
|
+
### `connect()`
|
|
97
|
+
|
|
98
|
+
Establishes the WebSocket connection and sends the initial `session.update`.
|
|
99
|
+
|
|
100
|
+
**requestContext** (`RequestContext`): Optional Mastra request context passed to function tool executions.
|
|
101
|
+
|
|
102
|
+
Returns: `Promise<void>`
|
|
103
|
+
|
|
104
|
+
### `close()`
|
|
105
|
+
|
|
106
|
+
Closes the WebSocket connection, ends active speaker streams, and clears queued events, pending function-call state, and request context. `disconnect()` is an alias for `close()`.
|
|
107
|
+
|
|
108
|
+
Returns: `void`
|
|
109
|
+
|
|
110
|
+
### `addInstructions()`
|
|
111
|
+
|
|
112
|
+
Sets session instructions. If the WebSocket is open, the provider sends a `session.update`; passing `undefined` stores an empty string and clears the active instructions on the current session or the next connection.
|
|
113
|
+
|
|
114
|
+
**instructions** (`string`): System instructions to send to xAI.
|
|
115
|
+
|
|
116
|
+
Returns: `void`
|
|
117
|
+
|
|
118
|
+
### `addTools()`
|
|
119
|
+
|
|
120
|
+
Registers Mastra function tools and, when connected, refreshes the session tools with `session.update`.
|
|
121
|
+
|
|
122
|
+
**tools** (`ToolsInput`): Mastra tools to expose as xAI function tools.
|
|
123
|
+
|
|
124
|
+
Returns: `void`
|
|
125
|
+
|
|
126
|
+
### `updateConfig()`
|
|
127
|
+
|
|
128
|
+
Sends a `session.update` event with additional xAI session fields.
|
|
129
|
+
|
|
130
|
+
**sessionConfig** (`Partial<XAISessionConfig>`): Session fields to update.
|
|
131
|
+
|
|
132
|
+
Returns: `void`
|
|
133
|
+
|
|
134
|
+
### `speak()`
|
|
135
|
+
|
|
136
|
+
Sends a text turn using `conversation.item.create` and then requests a response.
|
|
137
|
+
|
|
138
|
+
**input** (`string | NodeJS.ReadableStream`): Text or a readable text stream to send as user input.
|
|
139
|
+
|
|
140
|
+
**options.speaker** (`XAIVoice`): Voice override. This updates the active xAI session voice and is used for subsequent turns.
|
|
141
|
+
|
|
142
|
+
**options.response** (`Record<string, unknown>`): Additional xAI response.create fields.
|
|
143
|
+
|
|
144
|
+
Returns: `Promise<void>`
|
|
145
|
+
|
|
146
|
+
### `send()`
|
|
147
|
+
|
|
148
|
+
Streams realtime audio chunks with `input_audio_buffer.append`.
|
|
149
|
+
|
|
150
|
+
`send()` requires an open connection. Use it for live microphone audio after `connect()` resolves. Readable stream chunks must be binary audio chunks (`Buffer`, `ArrayBuffer`, or a typed array).
|
|
151
|
+
|
|
152
|
+
**audioData** (`NodeJS.ReadableStream | Int16Array`): PCM audio stream or Int16Array audio data.
|
|
153
|
+
|
|
154
|
+
**eventId** (`string`): Optional xAI event ID.
|
|
155
|
+
|
|
156
|
+
Returns: `Promise<void>`
|
|
157
|
+
|
|
158
|
+
### `listen()`
|
|
159
|
+
|
|
160
|
+
Sends a finite audio stream with `input_audio_buffer.append`. By default it commits the input buffer and requests a response.
|
|
161
|
+
|
|
162
|
+
**audioData** (`NodeJS.ReadableStream`): Audio stream to send.
|
|
163
|
+
|
|
164
|
+
**options.commit** (`boolean`): Whether to send input\_audio\_buffer.commit after the audio item. (Default: `true`)
|
|
165
|
+
|
|
166
|
+
**options.createResponse** (`boolean`): Whether to send response.create after the audio item. (Default: `true`)
|
|
167
|
+
|
|
168
|
+
Returns: `Promise<void>`
|
|
169
|
+
|
|
170
|
+
### `answer()`
|
|
171
|
+
|
|
172
|
+
Sends `response.create` to ask xAI to continue the conversation.
|
|
173
|
+
|
|
174
|
+
Returns: `Promise<void>`
|
|
175
|
+
|
|
176
|
+
### `commitAudioBuffer()` and `clearAudioBuffer()`
|
|
177
|
+
|
|
178
|
+
Send the matching xAI realtime client events for manual turn control.
|
|
179
|
+
|
|
180
|
+
Returns: `Promise<void>`
|
|
181
|
+
|
|
182
|
+
### `cancelResponse()`
|
|
183
|
+
|
|
184
|
+
Sends `response.cancel` to interrupt an in-flight response.
|
|
185
|
+
|
|
186
|
+
**responseId** (`string`): Optional xAI response ID to cancel.
|
|
187
|
+
|
|
188
|
+
**eventId** (`string`): Optional xAI event ID.
|
|
189
|
+
|
|
190
|
+
Returns: `Promise<void>`
|
|
191
|
+
|
|
192
|
+
## Events
|
|
193
|
+
|
|
194
|
+
`XAIRealtimeVoice` maps xAI realtime server events onto Mastra voice events:
|
|
195
|
+
|
|
196
|
+
- `speaker`: emits a readable stream for assistant audio.
|
|
197
|
+
- `speaking`: emits assistant audio deltas.
|
|
198
|
+
- `speaking.done`: emits when an assistant audio response completes.
|
|
199
|
+
- `writing`: emits assistant text deltas and user input transcriptions.
|
|
200
|
+
- `error`: emits xAI errors, provider execution errors, tool execution errors, and malformed function-call arguments. Tool errors include `details.call_id` and `details.name`.
|
|
201
|
+
- `close`: emits when the WebSocket closes.
|
|
202
|
+
- `tool-call-start`: emits before a Mastra function tool is executed.
|
|
203
|
+
- `tool-call-result`: emits after a Mastra function tool returns.
|
|
204
|
+
|
|
205
|
+
Raw xAI event names are also emitted, so you can subscribe to events such as `response.output_audio.delta`, `response.text.delta`, `response.function_call_arguments.done`, and `response.done`.
|
|
206
|
+
|
|
207
|
+
## Tools
|
|
208
|
+
|
|
209
|
+
### Mastra function tools
|
|
210
|
+
|
|
211
|
+
Tools added with `addTools()` are converted into xAI function tools and included in `session.update`.
|
|
212
|
+
|
|
213
|
+
```typescript
|
|
214
|
+
import { createTool } from '@mastra/core/tools'
|
|
215
|
+
import { z } from 'zod'
|
|
216
|
+
|
|
217
|
+
const weatherTool = createTool({
|
|
218
|
+
id: 'getWeather',
|
|
219
|
+
description: 'Get current weather for a location.',
|
|
220
|
+
inputSchema: z.object({
|
|
221
|
+
location: z.string(),
|
|
222
|
+
}),
|
|
223
|
+
execute: async ({ location }) => {
|
|
224
|
+
return { location, temperature: 22 }
|
|
225
|
+
},
|
|
226
|
+
})
|
|
227
|
+
|
|
228
|
+
voice.addTools({ getWeather: weatherTool })
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
When xAI emits `response.function_call_arguments.done`, the provider executes the matching Mastra tool and sends a `function_call_output` item. If xAI emits multiple function calls for one response, the provider waits for every tool result and the response's `response.done` event before sending one continuation `response.create`.
|
|
232
|
+
|
|
233
|
+
### xAI server-side tools
|
|
234
|
+
|
|
235
|
+
xAI server-side tools are passed through in the session configuration and executed by xAI. Tools passed in `session.tools` and `serverTools` are merged:
|
|
236
|
+
|
|
237
|
+
```typescript
|
|
238
|
+
const voice = new XAIRealtimeVoice({
|
|
239
|
+
apiKey: process.env.XAI_API_KEY,
|
|
240
|
+
serverTools: [
|
|
241
|
+
{ type: 'web_search' },
|
|
242
|
+
{ type: 'x_search', allowed_x_handles: ['xai'] },
|
|
243
|
+
{ type: 'file_search', vector_store_ids: ['collection_123'], max_num_results: 10 },
|
|
244
|
+
{
|
|
245
|
+
type: 'mcp',
|
|
246
|
+
server_url: 'https://mcp.example.com/mcp',
|
|
247
|
+
server_label: 'business-tools',
|
|
248
|
+
allowed_tools: ['lookup_order'],
|
|
249
|
+
},
|
|
250
|
+
],
|
|
251
|
+
})
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
## Audio formats
|
|
255
|
+
|
|
256
|
+
The default input and output format is 24 kHz PCM16. You can also configure supported PCM sample rates or telephony codecs:
|
|
257
|
+
|
|
258
|
+
```typescript
|
|
259
|
+
const voice = new XAIRealtimeVoice({
|
|
260
|
+
audio: {
|
|
261
|
+
input: { format: { type: 'audio/pcm', rate: 16000 } },
|
|
262
|
+
output: { format: { type: 'audio/pcm', rate: 16000 } },
|
|
263
|
+
},
|
|
264
|
+
})
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
Supported format types are `audio/pcm`, `audio/pcmu`, and `audio/pcma`. PCM supports the documented sample rates from 8 kHz through 48 kHz. `audio/pcmu` and `audio/pcma` are G.711 telephony codecs and use 8 kHz.
|