@micdrop/server 1.7.1 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,20 +1,23 @@
1
+ # 🖐️🎤 Micdrop: Real-Time Voice Conversations with AI
2
+
3
+ Micdrop is a set of open source Typescript packages to build real-time voice conversations with AI agents. It handles all the complexities on the browser and server side (microphone, speaker, VAD, network communication, etc) and provides ready-to-use implementations for various AI providers.
4
+
1
5
  # @micdrop/server
2
6
 
3
- A Node.js library for handling real-time voice conversations with WebSocket-based audio streaming.
7
+ The Node.js server implementation of [Micdrop](../../README.md).
4
8
 
5
9
  For browser implementation, see [@micdrop/client](../client/README.md) package.
6
10
 
7
11
  ## Features
8
12
 
9
- - 🌐 WebSocket server for real-time audio streaming
13
+ - 🤖 AI agents integration
14
+ - 🎙️ Speech-to-text and text-to-speech integration
10
15
  - 🔊 Advanced audio processing:
11
- - Streaming TTS support
12
- - Efficient audio chunk delivery
13
- - Interrupt handling
16
+ - Streaming input and output
17
+ - Audio conversion
18
+ - Interruptions handling
14
19
  - 💬 Conversation state management
15
- - 🎙️ Speech-to-text and text-to-speech integration
16
- - 🤖 AI conversation generation support
17
- - 💾 Debug mode with optional audio saving
20
+ - 🌐 WebSocket-based audio streaming
18
21
 
19
22
  ## Installation
20
23
 
@@ -29,272 +32,271 @@ pnpm add @micdrop/server
29
32
  ## Quick Start
30
33
 
31
34
  ```typescript
35
+ import { ElevenLabsTTS } from '@micdrop/elevenlabs'
36
+ import { GladiaSTT } from '@micdrop/gladia'
37
+ import { OpenaiAgent } from '@micdrop/openai'
38
+ import { MicdropServer } from '@micdrop/server'
32
39
  import { WebSocketServer } from 'ws'
33
- import { CallServer, CallConfig } from '@micdrop/server'
34
40
 
35
- // Create WebSocket server
36
41
  const wss = new WebSocketServer({ port: 8080 })
37
42
 
38
- // Define call configuration
39
- const config: CallConfig = {
40
- // Initial system prompt for the conversation
41
- systemPrompt: 'You are a helpful assistant',
42
-
43
- // Optional first message from assistant
44
- // Omit to generate the first message
45
- firstMessage: 'Hello!',
46
-
47
- // Function to generate assistant responses
48
- async generateAnswer(conversation) {
49
- // Implement your LLM or response generation logic
50
- return 'Assistant response'
51
- },
52
-
53
- // Function to convert speech to text
54
- async speech2Text(audioBlob, lastMessagePrompt) {
55
- // Implement your STT logic
56
- return 'Transcribed text'
57
- },
58
-
59
- // Function to convert text to speech
60
- // Can return either a complete ArrayBuffer or a ReadableStream for streaming
61
- async text2Speech(
62
- text: string
63
- ): Promise<ArrayBuffer | NodeJS.ReadableStream> {
64
- // Implement your TTS logic
65
- return new ArrayBuffer(0) // Audio data
66
- },
67
-
68
- // Optional callback when a message is added
69
- onMessage(message) {
70
- console.log('New message:', message)
71
- },
72
-
73
- // Optional callback when call ends
74
- onEnd(summary) {
75
- console.log('Call ended:', summary)
76
- },
77
- }
78
-
79
- // Handle new connections
80
- wss.on('connection', (ws) => {
81
- // Create call handler with configuration
82
- new CallServer(ws, config)
83
- })
84
- ```
43
+ wss.on('connection', (socket) => {
44
+ // Setup agent
45
+ const agent = new OpenaiAgent({
46
+ apiKey: process.env.OPENAI_API_KEY || '',
47
+ systemPrompt: 'You are a helpful assistant',
48
+ })
85
49
 
86
- ## Demo
50
+ // Setup STT
51
+ const stt = new GladiaSTT({
52
+ apiKey: process.env.GLADIA_API_KEY || '',
53
+ })
87
54
 
88
- Check out the demo implementation in the [@micdrop/demo-server](../demo-server/README.md) package. It shows:
55
+ // Setup TTS
56
+ const tts = new ElevenLabsTTS({
57
+ apiKey: process.env.ELEVENLABS_API_KEY || '',
58
+ voiceId: process.env.ELEVENLABS_VOICE_ID || '',
59
+ })
89
60
 
90
- - Setting up a Fastify server with WebSocket support
91
- - Configuring the CallServer with custom handlers
92
- - Basic authentication flow
93
- - Example speech-to-text and text-to-speech implementations
94
- - Error handling patterns
61
+ // Handle call
62
+ new MicdropServer(socket, {
63
+ firstMessage: 'Hello, how can I help you today?',
64
+ agent,
65
+ stt,
66
+ tts,
67
+ })
68
+ })
69
+ ```
95
70
 
96
- Here's a simplified version from the demo:
71
+ ## Agent / STT / TTS
97
72
 
98
- ## Documentation
73
+ Micdrop server has 3 main components:
99
74
 
100
- The server package provides several core components:
75
+ - `Agent` - AI agent using LLM
76
+ - `STT` - Speech-to-text
77
+ - `TTS` - Text-to-speech
101
78
 
102
- - **CallServer** - Main class that handles WebSocket connections, audio streaming, and conversation flow
103
- - **CallConfig** - Configuration interface for customizing speech processing and conversation behavior
104
- - **Types** - Common TypeScript types and interfaces for messages and commands
105
- - **Error Handling** - Standardized error handling with specific error codes
79
+ ### Available implementations
106
80
 
107
- ## API Reference
81
+ Micdrop provides ready-to-use implementations for the following AI providers:
108
82
 
109
- ### CallServer
83
+ - [@micdrop/openai](../openai/README.md)
84
+ - [@micdrop/elevenlabs](../elevenlabs/README.md)
85
+ - [@micdrop/cartesia](../cartesia/README.md)
86
+ - [@micdrop/mistral](../mistral/README.md)
87
+ - [@micdrop/gladia](../gladia/README.md)
110
88
 
111
- The main class for managing WebSocket connections and audio streaming.
89
+ ### Custom implementations
112
90
 
113
- ```typescript
114
- class CallServer {
115
- constructor(socket: WebSocket, config: CallConfig)
91
+ You can use provided abstractions to write your own implementation:
116
92
 
117
- // Add assistant message and send to client with audio (TTS)
118
- answer(message: string): Promise<void>
93
+ - **[Agent](./docs/Agent.md)** - Abstract class for answer generation
94
+ - **[STT](./docs/STT.md)** - Abstract class for speech-to-text
95
+ - **[TTS](./docs/TTS.md)** - Abstract class for text-to-speech
119
96
 
120
- // Reset conversation (including system prompt)
121
- resetConversation(conversation: Conversation): void
122
- }
123
- ```
97
+ ## Examples
124
98
 
125
- ### CallConfig
99
+ ### Authorization and Language Parameters
126
100
 
127
- Configuration interface for customizing the call behavior.
101
+ For production applications, you'll want to handle authorization and language configuration:
128
102
 
129
103
  ```typescript
130
- interface CallConfig {
131
- // Initial system prompt for the conversation
132
- systemPrompt: string
133
-
134
- // Optional first message from assistant
135
- firstMessage?: string
136
-
137
- // Enable debug logging with timestamps
138
- debugLog?: boolean
139
-
140
- // Save last speech audio file for debugging (speech.ogg)
141
- debugSaveSpeech?: boolean
142
-
143
- // Disable text-to-speech conversion
144
- disableTTS?: boolean
145
-
146
- // Generate assistant's response
147
- generateAnswer(
148
- conversation: Conversation
149
- ): Promise<string | ConversationMessage>
104
+ import { ElevenLabsTTS } from '@micdrop/elevenlabs'
105
+ import { GladiaSTT } from '@micdrop/gladia'
106
+ import { OpenaiAgent } from '@micdrop/openai'
107
+ import {
108
+ MicdropServer,
109
+ waitForParams,
110
+ MicdropError,
111
+ MicdropErrorCode,
112
+ handleError,
113
+ } from '@micdrop/server'
114
+ import { WebSocketServer } from 'ws'
115
+ import { z } from 'zod'
150
116
 
151
- // Convert audio to text
152
- speech2Text(audioBlob: Blob, prevMessage?: string): Promise<string>
117
+ const wss = new WebSocketServer({ port: 8080 })
153
118
 
154
- // Convert text to audio
155
- // Can return either a complete ArrayBuffer or a ReadableStream for streaming
156
- text2Speech(text: string): Promise<ArrayBuffer | NodeJS.ReadableStream>
119
+ // Define params schema for authorization and language
120
+ const callParamsSchema = z.object({
121
+ authorization: z.string(),
122
+ lang: z.string().regex(/^[a-z]{2}(-[A-Z]{2})?$/), // e.g., "en", "fr", "en-US"
123
+ })
157
124
 
158
- // Optional callbacks
159
- onMessage?(message: ConversationMessage): void
160
- onEnd?(summary: CallSummary): void
161
- }
125
+ wss.on('connection', async (socket) => {
126
+ try {
127
+ // Wait for client parameters (authorization & language)
128
+ const params = await waitForParams(socket, callParamsSchema.parse)
129
+
130
+ // Validate authorization
131
+ if (params.authorization !== process.env.AUTHORIZATION_KEY) {
132
+ throw new MicdropError(
133
+ MicdropErrorCode.Unauthorized,
134
+ 'Invalid authorization'
135
+ )
136
+ }
137
+
138
+ // Setup agent with language-specific system prompt
139
+ const agent = new OpenaiAgent({
140
+ apiKey: process.env.OPENAI_API_KEY || '',
141
+ systemPrompt: `You are a helpful assistant. Respond in ${params.lang} language.`,
142
+ })
143
+
144
+ // Setup STT with language configuration
145
+ const stt = new GladiaSTT({
146
+ apiKey: process.env.GLADIA_API_KEY || '',
147
+ language: params.lang,
148
+ })
149
+
150
+ // Setup TTS with language configuration
151
+ const tts = new ElevenLabsTTS({
152
+ apiKey: process.env.ELEVENLABS_API_KEY || '',
153
+ voiceId: process.env.ELEVENLABS_VOICE_ID || '',
154
+ language: params.lang,
155
+ })
156
+
157
+ // Handle call
158
+ new MicdropServer(socket, {
159
+ firstMessage: 'Hello! How can I help you today?',
160
+ agent,
161
+ stt,
162
+ tts,
163
+ })
164
+ } catch (error) {
165
+ handleError(socket, error)
166
+ }
167
+ })
162
168
  ```
163
169
 
164
- ## WebSocket Protocol
165
-
166
- The server implements a specific protocol for client-server communication:
167
-
168
- ### Client Commands
170
+ ### With Fastify
169
171
 
170
- The client can send the following commands to the server:
171
-
172
- - `CallClientCommands.StartSpeaking` - The user starts speaking
173
- - `CallClientCommands.StopSpeaking` - The user stops speaking
174
- - `CallClientCommands.Mute` - The user mutes the microphone
175
-
176
- ### Server Commands
177
-
178
- The server can send the following commands to the client:
179
-
180
- - `CallServerCommands.Message` - A message from the assistant.
181
- - `CallServerCommands.CancelLastAssistantMessage` - Cancel the last assistant message.
182
- - `CallServerCommands.CancelLastUserMessage` - Cancel the last user message.
183
- - `CallServerCommands.SkipAnswer` - Notify that the last generated answer was ignored, it's listening again.
184
- - `CallServerCommands.EnableSpeakerStreaming` - Enable speaker streaming.
185
- - `CallServerCommands.EndCall` - End the call.
186
-
187
- ### Message Flow
188
-
189
- 1. Client connects to WebSocket server
190
- 2. Server sends initial assistant message (generated if not provided)
191
- 3. Client sends audio chunks when user speaks
192
- 4. Server processes audio and responds with text+audio
193
- 5. Process continues until call ends
194
-
195
- See detailed protocol in [README.md](../README.md).
196
-
197
- ## Message metadata
198
-
199
- You can add metadata to the generated answers, that will be accessible in the conversation on the client and server side.
172
+ Using Fastify for WebSocket handling:
200
173
 
201
174
  ```typescript
202
- const metadata: AnswerMetadata = {
203
- // ...
204
- }
175
+ import { ElevenLabsTTS } from '@micdrop/elevenlabs'
176
+ import { GladiaSTT } from '@micdrop/gladia'
177
+ import { OpenaiAgent } from '@micdrop/openai'
178
+ import { MicdropServer } from '@micdrop/server'
179
+ import Fastify from 'fastify'
180
+
181
+ const fastify = Fastify()
182
+
183
+ // Register WebSocket support
184
+ await fastify.register(import('@fastify/websocket'))
185
+
186
+ // WebSocket route for voice calls
187
+ fastify.register(async function (fastify) {
188
+ fastify.get('/call', { websocket: true }, (socket) => {
189
+ // Setup agent
190
+ const agent = new OpenaiAgent({
191
+ apiKey: process.env.OPENAI_API_KEY || '',
192
+ systemPrompt: 'You are a helpful voice assistant',
193
+ })
194
+
195
+ // Setup STT
196
+ const stt = new GladiaSTT({
197
+ apiKey: process.env.GLADIA_API_KEY || '',
198
+ })
199
+
200
+ // Setup TTS
201
+ const tts = new ElevenLabsTTS({
202
+ apiKey: process.env.ELEVENLABS_API_KEY || '',
203
+ voiceId: process.env.ELEVENLABS_VOICE_ID || '',
204
+ })
205
+
206
+ // Handle call
207
+ new MicdropServer(socket, {
208
+ firstMessage: 'Hello, how can I help you today?',
209
+ agent,
210
+ stt,
211
+ tts,
212
+ })
213
+ })
214
+ })
205
215
 
206
- const message: ConversationMessage = {
207
- role: 'assistant',
208
- content: 'Hello!',
209
- metadata,
210
- }
216
+ // Start server
217
+ fastify
218
+ .listen({ port: 8080 })
219
+ .then(() => console.log('Server listening on port 8080'))
220
+ .catch((err) => fastify.log.error(err))
211
221
  ```
212
222
 
213
- ## Ending the call
214
-
215
- The call has two ways to end:
223
+ ### With NestJS
216
224
 
217
- - When the client closes the websocket connection.
218
- - When the generated answer contains the commands `endCall: true`.
219
-
220
- Example:
225
+ Using NestJS for WebSocket handling:
221
226
 
222
227
  ```typescript
223
- const END_CALL = 'END_CALL'
224
- const systemPrompt = `
225
- You are a voice assistant interviewing the user.
226
- To end the interview, briefly thank the user and say good bye, then say ${END_CALL}.
227
- `
228
-
229
- async function generateAnswer(
230
- conversation: ConversationMessage[]
231
- ): Promise<ConversationMessage> {
232
- const response = await openai.chat.completions.create({
233
- model: 'gpt-4o',
234
- messages: conversation,
235
- temperature: 0.5,
236
- max_tokens: 250,
237
- })
238
-
239
- let text = response.choices[0].message.content
240
- if (!text) throw new Error('Empty response')
241
-
242
- // Add metadata
243
- const commands: AnswerCommands = {}
244
- if (text.includes(END_CALL)) {
245
- text = text.replace(END_CALL, '').trim()
246
- commands.endCall = true
228
+ // websocket.gateway.ts
229
+ import {
230
+ WebSocketGateway,
231
+ WebSocketServer,
232
+ OnGatewayConnection,
233
+ } from '@nestjs/websockets'
234
+ import { ElevenLabsTTS } from '@micdrop/elevenlabs'
235
+ import { GladiaSTT } from '@micdrop/gladia'
236
+ import { OpenaiAgent } from '@micdrop/openai'
237
+ import { MicdropServer } from '@micdrop/server'
238
+ import { Server } from 'ws'
239
+ import { Injectable } from '@nestjs/common'
240
+
241
+ @Injectable()
242
+ @WebSocketGateway(8080)
243
+ export class MicdropGateway implements OnGatewayConnection {
244
+ @WebSocketServer()
245
+ server: Server
246
+
247
+ handleConnection(socket: Server) {
248
+ // Setup agent
249
+ const agent = new OpenaiAgent({
250
+ apiKey: process.env.OPENAI_API_KEY || '',
251
+ systemPrompt: 'You are a helpful voice assistant built with NestJS',
252
+ })
253
+
254
+ // Setup STT
255
+ const stt = new GladiaSTT({
256
+ apiKey: process.env.GLADIA_API_KEY || '',
257
+ })
258
+
259
+ // Setup TTS
260
+ const tts = new ElevenLabsTTS({
261
+ apiKey: process.env.ELEVENLABS_API_KEY || '',
262
+ voiceId: process.env.ELEVENLABS_VOICE_ID || '',
263
+ })
264
+
265
+ // Handle call
266
+ new MicdropServer(socket, {
267
+ firstMessage: 'Hello, how can I help you today?',
268
+ agent,
269
+ stt,
270
+ tts,
271
+ })
247
272
  }
248
-
249
- return { role: 'assistant', content: text, commands }
250
273
  }
251
274
  ```
252
275
 
253
- See demo [system prompt](../demo-server/src/call.ts) and [generateAnswer](../demo-server/src/ai/generateAnswer.ts) for a complete example.
254
-
255
- ## Integration Example
256
-
257
- Here's an example using Fastify:
258
-
259
276
  ```typescript
260
- import fastify from 'fastify'
261
- import fastifyWebsocket from '@fastify/websocket'
262
- import { CallServer, CallConfig } from '@micdrop/server'
263
-
264
- const server = fastify()
265
- server.register(fastifyWebsocket)
277
+ // app.module.ts
278
+ import { Module } from '@nestjs/common'
279
+ import { MicdropGateway } from './websocket.gateway'
266
280
 
267
- server.get('/call', { websocket: true }, (socket) => {
268
- const config: CallConfig = {
269
- systemPrompt: 'You are a helpful assistant',
270
- // ... other config options
271
- }
272
- new CallServer(socket, config)
281
+ @Module({
282
+ providers: [MicdropGateway],
273
283
  })
274
-
275
- server.listen({ port: 8080 })
284
+ export class AppModule {}
276
285
  ```
277
286
 
278
- See [@micdrop/demo-server](../demo-server/src/call.ts) for a complete example using OpenAI and ElevenLabs.
279
-
280
- ## Debug Mode
281
-
282
- The server includes a debug mode that can:
283
-
284
- - Log detailed timing information
285
- - Save audio files for debugging (optional)
286
- - Track conversation state
287
- - Monitor WebSocket events
287
+ ## Demo
288
288
 
289
- See `debugLog`, `debugSaveSpeech` and `disableTTS` options in [CallConfig](#callconfig).
289
+ Check out the demo implementation in the [@micdrop/demo-server](../../examples/demo-server/README.md) package. It shows:
290
290
 
291
- ## Browser Support
291
+ - Setting up a Fastify server with WebSocket support
292
+ - Configuring the MicdropServer with custom handlers
293
+ - Basic authentication flow
294
+ - Example agent, speech-to-text and text-to-speech implementations
295
+ - Error handling patterns
292
296
 
293
- The server is designed to work with any WebSocket client, but is specifically tested with:
297
+ ## Documentation
294
298
 
295
- - Modern browsers supporting WebSocket API
296
- - Node.js clients
297
- - @micdrop/client package
299
+ Learn more about the protocol of Micdrop in [protocol.md](./docs/protocol.md).
298
300
 
299
301
  ## License
300
302
 
@@ -302,6 +304,4 @@ MIT
302
304
 
303
305
  ## Author
304
306
 
305
- Originally developed for [Raconte.ai](https://www.raconte.ai)
306
-
307
- by [Lonestone](https://www.lonestone.io) ([GitHub](https://github.com/lonestone))
307
+ Originally developed for [Raconte.ai](https://www.raconte.ai) and open sourced by [Lonestone](https://www.lonestone.io) ([GitHub](https://github.com/lonestone))