@micdrop/server 1.7.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,20 +1,23 @@
1
+ # 🖐️🎤 Micdrop: Real-Time Voice Conversations with AI
2
+
3
+ Micdrop is a set of open source Typescript packages to build real-time voice conversations with AI agents. It handles all the complexities on the browser and server side (microphone, speaker, VAD, network communication, etc) and provides ready-to-use implementations for various AI providers.
4
+
1
5
  # @micdrop/server
2
6
 
3
- A Node.js library for handling real-time voice conversations with WebSocket-based audio streaming.
7
+ The Node.js server implementation of [Micdrop](../../README.md).
4
8
 
5
9
  For browser implementation, see [@micdrop/client](../client/README.md) package.
6
10
 
7
11
  ## Features
8
12
 
9
- - 🌐 WebSocket server for real-time audio streaming
13
+ - 🤖 AI agents integration
14
+ - 🎙️ Speech-to-text and text-to-speech integration
10
15
  - 🔊 Advanced audio processing:
11
- - Streaming TTS support
12
- - Efficient audio chunk delivery
13
- - Interrupt handling
16
+ - Streaming input and output
17
+ - Audio conversion
18
+ - Interruptions handling
14
19
  - 💬 Conversation state management
15
- - 🎙️ Speech-to-text and text-to-speech integration
16
- - 🤖 AI conversation generation support
17
- - 💾 Debug mode with optional audio saving
20
+ - 🌐 WebSocket-based audio streaming
18
21
 
19
22
  ## Installation
20
23
 
@@ -29,272 +32,271 @@ pnpm add @micdrop/server
29
32
  ## Quick Start
30
33
 
31
34
  ```typescript
35
+ import { ElevenLabsTTS } from '@micdrop/elevenlabs'
36
+ import { GladiaSTT } from '@micdrop/gladia'
37
+ import { OpenaiAgent } from '@micdrop/openai'
38
+ import { MicdropServer } from '@micdrop/server'
32
39
  import { WebSocketServer } from 'ws'
33
- import { CallServer, CallConfig } from '@micdrop/server'
34
40
 
35
- // Create WebSocket server
36
41
  const wss = new WebSocketServer({ port: 8080 })
37
42
 
38
- // Define call configuration
39
- const config: CallConfig = {
40
- // Initial system prompt for the conversation
41
- systemPrompt: 'You are a helpful assistant',
42
-
43
- // Optional first message from assistant
44
- // Omit to generate the first message
45
- firstMessage: 'Hello!',
46
-
47
- // Function to generate assistant responses
48
- async generateAnswer(conversation) {
49
- // Implement your LLM or response generation logic
50
- return 'Assistant response'
51
- },
52
-
53
- // Function to convert speech to text
54
- async speech2Text(audioBlob, lastMessagePrompt) {
55
- // Implement your STT logic
56
- return 'Transcribed text'
57
- },
58
-
59
- // Function to convert text to speech
60
- // Can return either a complete ArrayBuffer or a ReadableStream for streaming
61
- async text2Speech(
62
- text: string
63
- ): Promise<ArrayBuffer | NodeJS.ReadableStream> {
64
- // Implement your TTS logic
65
- return new ArrayBuffer(0) // Audio data
66
- },
67
-
68
- // Optional callback when a message is added
69
- onMessage(message) {
70
- console.log('New message:', message)
71
- },
72
-
73
- // Optional callback when call ends
74
- onEnd(summary) {
75
- console.log('Call ended:', summary)
76
- },
77
- }
78
-
79
- // Handle new connections
80
- wss.on('connection', (ws) => {
81
- // Create call handler with configuration
82
- new CallServer(ws, config)
83
- })
84
- ```
85
-
86
- ## Demo
87
-
88
- Check out the demo implementation in the [@micdrop/demo-server](../demo-server/README.md) package. It shows:
89
-
90
- - Setting up a Fastify server with WebSocket support
91
- - Configuring the CallServer with custom handlers
92
- - Basic authentication flow
93
- - Example speech-to-text and text-to-speech implementations
94
- - Error handling patterns
95
-
96
- Here's a simplified version from the demo:
43
+ wss.on('connection', (socket) => {
44
+ // Setup agent
45
+ const agent = new OpenaiAgent({
46
+ apiKey: process.env.OPENAI_API_KEY || '',
47
+ systemPrompt: 'You are a helpful assistant',
48
+ })
97
49
 
98
- ## Documentation
50
+ // Setup STT
51
+ const stt = new GladiaSTT({
52
+ apiKey: process.env.GLADIA_API_KEY || '',
53
+ })
99
54
 
100
- The server package provides several core components:
55
+ // Setup TTS
56
+ const tts = new ElevenLabsTTS({
57
+ apiKey: process.env.ELEVENLABS_API_KEY || '',
58
+ voiceId: process.env.ELEVENLABS_VOICE_ID || '',
59
+ })
101
60
 
102
- - **CallServer** - Main class that handles WebSocket connections, audio streaming, and conversation flow
103
- - **CallConfig** - Configuration interface for customizing speech processing and conversation behavior
104
- - **Types** - Common TypeScript types and interfaces for messages and commands
105
- - **Error Handling** - Standardized error handling with specific error codes
61
+ // Handle call
62
+ new MicdropServer(socket, {
63
+ firstMessage: 'Hello, how can I help you today?',
64
+ agent,
65
+ stt,
66
+ tts,
67
+ })
68
+ })
69
+ ```
106
70
 
107
- ## API Reference
71
+ ## Examples
108
72
 
109
- ### CallServer
73
+ ### Authorization and Language Parameters
110
74
 
111
- The main class for managing WebSocket connections and audio streaming.
75
+ For production applications, you'll want to handle authorization and language configuration:
112
76
 
113
77
  ```typescript
114
- class CallServer {
115
- constructor(socket: WebSocket, config: CallConfig)
78
+ import { ElevenLabsTTS } from '@micdrop/elevenlabs'
79
+ import { GladiaSTT } from '@micdrop/gladia'
80
+ import { OpenaiAgent } from '@micdrop/openai'
81
+ import {
82
+ MicdropServer,
83
+ waitForParams,
84
+ MicdropError,
85
+ MicdropErrorCode,
86
+ handleError,
87
+ } from '@micdrop/server'
88
+ import { WebSocketServer } from 'ws'
89
+ import { z } from 'zod'
116
90
 
117
- // Add assistant message and send to client with audio (TTS)
118
- answer(message: string): Promise<void>
91
+ const wss = new WebSocketServer({ port: 8080 })
119
92
 
120
- // Reset conversation (including system prompt)
121
- resetConversation(conversation: Conversation): void
122
- }
93
+ // Define params schema for authorization and language
94
+ const callParamsSchema = z.object({
95
+ authorization: z.string(),
96
+ lang: z.string().regex(/^[a-z]{2}(-[A-Z]{2})?$/), // e.g., "en", "fr", "en-US"
97
+ })
98
+
99
+ wss.on('connection', async (socket) => {
100
+ try {
101
+ // Wait for client parameters (authorization & language)
102
+ const params = await waitForParams(socket, callParamsSchema.parse)
103
+
104
+ // Validate authorization
105
+ if (params.authorization !== process.env.AUTHORIZATION_KEY) {
106
+ throw new MicdropError(
107
+ MicdropErrorCode.Unauthorized,
108
+ 'Invalid authorization'
109
+ )
110
+ }
111
+
112
+ // Setup agent with language-specific system prompt
113
+ const agent = new OpenaiAgent({
114
+ apiKey: process.env.OPENAI_API_KEY || '',
115
+ systemPrompt: `You are a helpful assistant. Respond in ${params.lang} language.`,
116
+ })
117
+
118
+ // Setup STT with language configuration
119
+ const stt = new GladiaSTT({
120
+ apiKey: process.env.GLADIA_API_KEY || '',
121
+ language: params.lang,
122
+ })
123
+
124
+ // Setup TTS with language configuration
125
+ const tts = new ElevenLabsTTS({
126
+ apiKey: process.env.ELEVENLABS_API_KEY || '',
127
+ voiceId: process.env.ELEVENLABS_VOICE_ID || '',
128
+ language: params.lang,
129
+ })
130
+
131
+ // Handle call
132
+ new MicdropServer(socket, {
133
+ firstMessage: 'Hello! How can I help you today?',
134
+ agent,
135
+ stt,
136
+ tts,
137
+ })
138
+ } catch (error) {
139
+ handleError(socket, error)
140
+ }
141
+ })
123
142
  ```
124
143
 
125
- ### CallConfig
144
+ ### With Fastify
126
145
 
127
- Configuration interface for customizing the call behavior.
146
+ Using Fastify for WebSocket handling:
128
147
 
129
148
  ```typescript
130
- interface CallConfig {
131
- // Initial system prompt for the conversation
132
- systemPrompt: string
133
-
134
- // Optional first message from assistant
135
- firstMessage?: string
136
-
137
- // Enable debug logging with timestamps
138
- debugLog?: boolean
139
-
140
- // Save last speech audio file for debugging (speech.ogg)
141
- debugSaveSpeech?: boolean
142
-
143
- // Disable text-to-speech conversion
144
- disableTTS?: boolean
145
-
146
- // Generate assistant's response
147
- generateAnswer(
148
- conversation: Conversation
149
- ): Promise<string | ConversationMessage>
150
-
151
- // Convert audio to text
152
- speech2Text(audioBlob: Blob, prevMessage?: string): Promise<string>
153
-
154
- // Convert text to audio
155
- // Can return either a complete ArrayBuffer or a ReadableStream for streaming
156
- text2Speech(text: string): Promise<ArrayBuffer | NodeJS.ReadableStream>
149
+ import { ElevenLabsTTS } from '@micdrop/elevenlabs'
150
+ import { GladiaSTT } from '@micdrop/gladia'
151
+ import { OpenaiAgent } from '@micdrop/openai'
152
+ import { MicdropServer } from '@micdrop/server'
153
+ import Fastify from 'fastify'
154
+
155
+ const fastify = Fastify()
156
+
157
+ // Register WebSocket support
158
+ await fastify.register(import('@fastify/websocket'))
159
+
160
+ // WebSocket route for voice calls
161
+ fastify.register(async function (fastify) {
162
+ fastify.get('/call', { websocket: true }, (socket) => {
163
+ // Setup agent
164
+ const agent = new OpenaiAgent({
165
+ apiKey: process.env.OPENAI_API_KEY || '',
166
+ systemPrompt: 'You are a helpful voice assistant',
167
+ })
168
+
169
+ // Setup STT
170
+ const stt = new GladiaSTT({
171
+ apiKey: process.env.GLADIA_API_KEY || '',
172
+ })
173
+
174
+ // Setup TTS
175
+ const tts = new ElevenLabsTTS({
176
+ apiKey: process.env.ELEVENLABS_API_KEY || '',
177
+ voiceId: process.env.ELEVENLABS_VOICE_ID || '',
178
+ })
179
+
180
+ // Handle call
181
+ new MicdropServer(socket, {
182
+ firstMessage: 'Hello, how can I help you today?',
183
+ agent,
184
+ stt,
185
+ tts,
186
+ })
187
+ })
188
+ })
157
189
 
158
- // Optional callbacks
159
- onMessage?(message: ConversationMessage): void
160
- onEnd?(summary: CallSummary): void
161
- }
190
+ // Start server
191
+ fastify
192
+ .listen({ port: 8080 })
193
+ .then(() => console.log('Server listening on port 8080'))
194
+ .catch((err) => fastify.log.error(err))
162
195
  ```
163
196
 
164
- ## WebSocket Protocol
165
-
166
- The server implements a specific protocol for client-server communication:
167
-
168
- ### Client Commands
169
-
170
- The client can send the following commands to the server:
171
-
172
- - `CallClientCommands.StartSpeaking` - The user starts speaking
173
- - `CallClientCommands.StopSpeaking` - The user stops speaking
174
- - `CallClientCommands.Mute` - The user mutes the microphone
175
-
176
- ### Server Commands
177
-
178
- The server can send the following commands to the client:
179
-
180
- - `CallServerCommands.Message` - A message from the assistant.
181
- - `CallServerCommands.CancelLastAssistantMessage` - Cancel the last assistant message.
182
- - `CallServerCommands.CancelLastUserMessage` - Cancel the last user message.
183
- - `CallServerCommands.SkipAnswer` - Notify that the last generated answer was ignored, it's listening again.
184
- - `CallServerCommands.EnableSpeakerStreaming` - Enable speaker streaming.
185
- - `CallServerCommands.EndCall` - End the call.
186
-
187
- ### Message Flow
197
+ ### With NestJS
188
198
 
189
- 1. Client connects to WebSocket server
190
- 2. Server sends initial assistant message (generated if not provided)
191
- 3. Client sends audio chunks when user speaks
192
- 4. Server processes audio and responds with text+audio
193
- 5. Process continues until call ends
194
-
195
- See detailed protocol in [README.md](../README.md).
196
-
197
- ## Message metadata
198
-
199
- You can add metadata to the generated answers, that will be accessible in the conversation on the client and server side.
199
+ Using NestJS for WebSocket handling:
200
200
 
201
201
  ```typescript
202
- const metadata: AnswerMetadata = {
203
- // ...
204
- }
205
-
206
- const message: ConversationMessage = {
207
- role: 'assistant',
208
- content: 'Hello!',
209
- metadata,
202
+ // websocket.gateway.ts
203
+ import {
204
+ WebSocketGateway,
205
+ WebSocketServer,
206
+ OnGatewayConnection,
207
+ } from '@nestjs/websockets'
208
+ import { ElevenLabsTTS } from '@micdrop/elevenlabs'
209
+ import { GladiaSTT } from '@micdrop/gladia'
210
+ import { OpenaiAgent } from '@micdrop/openai'
211
+ import { MicdropServer } from '@micdrop/server'
212
+ import { Server } from 'ws'
213
+ import { Injectable } from '@nestjs/common'
214
+
215
+ @Injectable()
216
+ @WebSocketGateway(8080)
217
+ export class MicdropGateway implements OnGatewayConnection {
218
+ @WebSocketServer()
219
+ server: Server
220
+
221
+ handleConnection(socket: Server) {
222
+ // Setup agent
223
+ const agent = new OpenaiAgent({
224
+ apiKey: process.env.OPENAI_API_KEY || '',
225
+ systemPrompt: 'You are a helpful voice assistant built with NestJS',
226
+ })
227
+
228
+ // Setup STT
229
+ const stt = new GladiaSTT({
230
+ apiKey: process.env.GLADIA_API_KEY || '',
231
+ })
232
+
233
+ // Setup TTS
234
+ const tts = new ElevenLabsTTS({
235
+ apiKey: process.env.ELEVENLABS_API_KEY || '',
236
+ voiceId: process.env.ELEVENLABS_VOICE_ID || '',
237
+ })
238
+
239
+ // Handle call
240
+ new MicdropServer(socket, {
241
+ firstMessage: 'Hello, how can I help you today?',
242
+ agent,
243
+ stt,
244
+ tts,
245
+ })
246
+ }
210
247
  }
211
248
  ```
212
249
 
213
- ## Ending the call
214
-
215
- The call has two ways to end:
216
-
217
- - When the client closes the websocket connection.
218
- - When the generated answer contains the commands `endCall: true`.
219
-
220
- Example:
221
-
222
250
  ```typescript
223
- const END_CALL = 'END_CALL'
224
- const systemPrompt = `
225
- You are a voice assistant interviewing the user.
226
- To end the interview, briefly thank the user and say good bye, then say ${END_CALL}.
227
- `
228
-
229
- async function generateAnswer(
230
- conversation: ConversationMessage[]
231
- ): Promise<ConversationMessage> {
232
- const response = await openai.chat.completions.create({
233
- model: 'gpt-4o',
234
- messages: conversation,
235
- temperature: 0.5,
236
- max_tokens: 250,
237
- })
238
-
239
- let text = response.choices[0].message.content
240
- if (!text) throw new Error('Empty response')
241
-
242
- // Add metadata
243
- const commands: AnswerCommands = {}
244
- if (text.includes(END_CALL)) {
245
- text = text.replace(END_CALL, '').trim()
246
- commands.endCall = true
247
- }
251
+ // app.module.ts
252
+ import { Module } from '@nestjs/common'
253
+ import { MicdropGateway } from './websocket.gateway'
248
254
 
249
- return { role: 'assistant', content: text, commands }
250
- }
255
+ @Module({
256
+ providers: [MicdropGateway],
257
+ })
258
+ export class AppModule {}
251
259
  ```
252
260
 
253
- See demo [system prompt](../demo-server/src/call.ts) and [generateAnswer](../demo-server/src/ai/generateAnswer.ts) for a complete example.
261
+ ## Agent / STT / TTS
254
262
 
255
- ## Integration Example
263
+ Micdrop server has 3 main components:
256
264
 
257
- Here's an example using Fastify:
265
+ - `Agent` - AI agent using LLM
266
+ - `STT` - Speech-to-text
267
+ - `TTS` - Text-to-speech
258
268
 
259
- ```typescript
260
- import fastify from 'fastify'
261
- import fastifyWebsocket from '@fastify/websocket'
262
- import { CallServer, CallConfig } from '@micdrop/server'
269
+ ### Available implementations
263
270
 
264
- const server = fastify()
265
- server.register(fastifyWebsocket)
271
+ Micdrop provides ready-to-use implementations for the following AI providers:
266
272
 
267
- server.get('/call', { websocket: true }, (socket) => {
268
- const config: CallConfig = {
269
- systemPrompt: 'You are a helpful assistant',
270
- // ... other config options
271
- }
272
- new CallServer(socket, config)
273
- })
274
-
275
- server.listen({ port: 8080 })
276
- ```
273
+ - [@micdrop/openai](../openai/README.md)
274
+ - [@micdrop/elevenlabs](../elevenlabs/README.md)
275
+ - [@micdrop/cartesia](../cartesia/README.md)
276
+ - [@micdrop/mistral](../mistral/README.md)
277
+ - [@micdrop/gladia](../gladia/README.md)
277
278
 
278
- See [@micdrop/demo-server](../demo-server/src/call.ts) for a complete example using OpenAI and ElevenLabs.
279
+ ### Custom implementations
279
280
 
280
- ## Debug Mode
281
+ You can use provided abstractions to write your own implementation:
281
282
 
282
- The server includes a debug mode that can:
283
+ - **[Agent](./docs/Agent.md)** - Abstract class for answer generation
284
+ - **[STT](./docs/STT.md)** - Abstract class for speech-to-text
285
+ - **[TTS](./docs/TTS.md)** - Abstract class for text-to-speech
283
286
 
284
- - Log detailed timing information
285
- - Save audio files for debugging (optional)
286
- - Track conversation state
287
- - Monitor WebSocket events
287
+ ## Demo
288
288
 
289
- See `debugLog`, `debugSaveSpeech` and `disableTTS` options in [CallConfig](#callconfig).
289
+ Check out the demo implementation in the [@micdrop/demo-server](../../examples/demo-server/README.md) package. It shows:
290
290
 
291
- ## Browser Support
291
+ - Setting up a Fastify server with WebSocket support
292
+ - Configuring the MicdropServer with custom handlers
293
+ - Basic authentication flow
294
+ - Example agent, speech-to-text and text-to-speech implementations
295
+ - Error handling patterns
292
296
 
293
- The server is designed to work with any WebSocket client, but is specifically tested with:
297
+ ## Documentation
294
298
 
295
- - Modern browsers supporting WebSocket API
296
- - Node.js clients
297
- - @micdrop/client package
299
+ Learn more about the protocol of Micdrop in [protocol.md](./docs/protocol.md).
298
300
 
299
301
  ## License
300
302
 
@@ -302,6 +304,4 @@ MIT
302
304
 
303
305
  ## Author
304
306
 
305
- Originally developed for [Raconte.ai](https://www.raconte.ai)
306
-
307
- by [Lonestone](https://www.lonestone.io) ([GitHub](https://github.com/lonestone))
307
+ Originally developed for [Raconte.ai](https://www.raconte.ai) and open sourced by [Lonestone](https://www.lonestone.io) ([GitHub](https://github.com/lonestone))