opencode-skills-collection 1.0.185 → 1.0.187
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bundled-skills/.antigravity-install-manifest.json +5 -1
- package/bundled-skills/3d-web-experience/SKILL.md +152 -37
- package/bundled-skills/agent-evaluation/SKILL.md +1088 -26
- package/bundled-skills/agent-memory-systems/SKILL.md +1037 -25
- package/bundled-skills/agent-tool-builder/SKILL.md +668 -16
- package/bundled-skills/ai-agents-architect/SKILL.md +271 -31
- package/bundled-skills/ai-product/SKILL.md +716 -26
- package/bundled-skills/ai-wrapper-product/SKILL.md +450 -44
- package/bundled-skills/algolia-search/SKILL.md +867 -15
- package/bundled-skills/autonomous-agents/SKILL.md +1033 -26
- package/bundled-skills/aws-serverless/SKILL.md +1046 -35
- package/bundled-skills/azure-functions/SKILL.md +1318 -19
- package/bundled-skills/browser-automation/SKILL.md +1065 -28
- package/bundled-skills/browser-extension-builder/SKILL.md +159 -32
- package/bundled-skills/bullmq-specialist/SKILL.md +347 -16
- package/bundled-skills/clerk-auth/SKILL.md +796 -15
- package/bundled-skills/computer-use-agents/SKILL.md +1870 -28
- package/bundled-skills/context-window-management/SKILL.md +271 -18
- package/bundled-skills/conversation-memory/SKILL.md +453 -24
- package/bundled-skills/crewai/SKILL.md +252 -46
- package/bundled-skills/discord-bot-architect/SKILL.md +1207 -34
- package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
- package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
- package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
- package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
- package/bundled-skills/docs/users/bundles.md +1 -1
- package/bundled-skills/docs/users/claude-code-skills.md +1 -1
- package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
- package/bundled-skills/docs/users/getting-started.md +1 -1
- package/bundled-skills/docs/users/kiro-integration.md +1 -1
- package/bundled-skills/docs/users/usage.md +4 -4
- package/bundled-skills/docs/users/visual-guide.md +4 -4
- package/bundled-skills/email-systems/SKILL.md +646 -26
- package/bundled-skills/faf-expert/SKILL.md +221 -0
- package/bundled-skills/faf-wizard/SKILL.md +252 -0
- package/bundled-skills/file-uploads/SKILL.md +212 -11
- package/bundled-skills/firebase/SKILL.md +646 -16
- package/bundled-skills/gcp-cloud-run/SKILL.md +1117 -32
- package/bundled-skills/graphql/SKILL.md +1026 -27
- package/bundled-skills/hubspot-integration/SKILL.md +804 -19
- package/bundled-skills/idea-darwin/SKILL.md +120 -0
- package/bundled-skills/inngest/SKILL.md +431 -16
- package/bundled-skills/interactive-portfolio/SKILL.md +342 -44
- package/bundled-skills/langfuse/SKILL.md +296 -41
- package/bundled-skills/langgraph/SKILL.md +259 -50
- package/bundled-skills/micro-saas-launcher/SKILL.md +343 -44
- package/bundled-skills/neon-postgres/SKILL.md +572 -15
- package/bundled-skills/nextjs-supabase-auth/SKILL.md +269 -21
- package/bundled-skills/notion-template-business/SKILL.md +371 -44
- package/bundled-skills/personal-tool-builder/SKILL.md +537 -44
- package/bundled-skills/plaid-fintech/SKILL.md +825 -19
- package/bundled-skills/prompt-caching/SKILL.md +438 -25
- package/bundled-skills/rag-engineer/SKILL.md +271 -29
- package/bundled-skills/salesforce-development/SKILL.md +912 -19
- package/bundled-skills/satori/SKILL.md +54 -0
- package/bundled-skills/scroll-experience/SKILL.md +381 -44
- package/bundled-skills/segment-cdp/SKILL.md +817 -19
- package/bundled-skills/shopify-apps/SKILL.md +1475 -19
- package/bundled-skills/slack-bot-builder/SKILL.md +1162 -28
- package/bundled-skills/telegram-bot-builder/SKILL.md +152 -37
- package/bundled-skills/telegram-mini-app/SKILL.md +445 -44
- package/bundled-skills/trigger-dev/SKILL.md +916 -27
- package/bundled-skills/twilio-communications/SKILL.md +1310 -28
- package/bundled-skills/upstash-qstash/SKILL.md +898 -27
- package/bundled-skills/vercel-deployment/SKILL.md +637 -39
- package/bundled-skills/viral-generator-builder/SKILL.md +132 -37
- package/bundled-skills/voice-agents/SKILL.md +937 -27
- package/bundled-skills/voice-ai-development/SKILL.md +375 -46
- package/bundled-skills/workflow-automation/SKILL.md +982 -29
- package/bundled-skills/zapier-make-patterns/SKILL.md +772 -27
- package/package.json +1 -1
|
@@ -1,22 +1,36 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: voice-agents
|
|
3
|
-
description:
|
|
3
|
+
description: Voice agents represent the frontier of AI interaction - humans
|
|
4
|
+
speaking naturally with AI systems.
|
|
4
5
|
risk: safe
|
|
5
|
-
source:
|
|
6
|
-
date_added:
|
|
6
|
+
source: vibeship-spawner-skills (Apache 2.0)
|
|
7
|
+
date_added: 2026-02-27
|
|
7
8
|
---
|
|
8
9
|
|
|
9
10
|
# Voice Agents
|
|
10
11
|
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
12
|
+
Voice agents represent the frontier of AI interaction - humans speaking
|
|
13
|
+
naturally with AI systems. The challenge isn't just speech recognition
|
|
14
|
+
and synthesis, it's achieving natural conversation flow with sub-800ms
|
|
15
|
+
latency while handling interruptions, background noise, and emotional
|
|
16
|
+
nuance.
|
|
15
17
|
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
18
|
+
This skill covers two architectures: speech-to-speech (OpenAI Realtime API,
|
|
19
|
+
lowest latency, most natural) and pipeline (STT→LLM→TTS, more control,
|
|
20
|
+
easier to debug). Key insight: latency is the constraint. Humans expect
|
|
21
|
+
responses in 500ms. Every millisecond matters.
|
|
22
|
+
|
|
23
|
+
84% of organizations are increasing voice AI budgets in 2025. This is the
|
|
24
|
+
year voice agents go mainstream.
|
|
25
|
+
|
|
26
|
+
## Principles
|
|
27
|
+
|
|
28
|
+
- Latency is the constraint - target <800ms end-to-end
|
|
29
|
+
- Jitter (variance) matters as much as absolute latency
|
|
30
|
+
- VAD quality determines conversation flow
|
|
31
|
+
- Interruption handling makes or breaks the experience
|
|
32
|
+
- Start with focused MVP, iterate based on real conversations
|
|
33
|
+
- Combine best-in-class components (Deepgram STT + ElevenLabs TTS)
|
|
20
34
|
|
|
21
35
|
## Capabilities
|
|
22
36
|
|
|
@@ -30,44 +44,940 @@ step but add latency. Mos
|
|
|
30
44
|
- barge-in-detection
|
|
31
45
|
- voice-interfaces
|
|
32
46
|
|
|
47
|
+
## Scope
|
|
48
|
+
|
|
49
|
+
- phone-system-integration → backend
|
|
50
|
+
- audio-processing-dsp → audio-specialist
|
|
51
|
+
- music-generation → audio-specialist
|
|
52
|
+
- accessibility-compliance → accessibility-specialist
|
|
53
|
+
|
|
54
|
+
## Tooling
|
|
55
|
+
|
|
56
|
+
### Speech_to_speech
|
|
57
|
+
|
|
58
|
+
- OpenAI Realtime API - When: Lowest latency, most natural conversation Note: gpt-4o-realtime-preview, native voice, sub-500ms
|
|
59
|
+
- Pipecat - When: Open-source voice orchestration Note: Daily-backed, enterprise-grade, modular
|
|
60
|
+
|
|
61
|
+
### Speech_to_text
|
|
62
|
+
|
|
63
|
+
- OpenAI Whisper - When: Highest accuracy, multilingual Note: gpt-4o-transcribe for best results
|
|
64
|
+
- Deepgram Nova-3 - When: Production workloads, 54% lower WER Note: 150-184ms TTFT, 90%+ accuracy on noisy audio
|
|
65
|
+
- AssemblyAI - When: Real-time streaming, speaker diarization Note: Good accuracy-latency balance
|
|
66
|
+
|
|
67
|
+
### Text_to_speech
|
|
68
|
+
|
|
69
|
+
- ElevenLabs - When: Most natural voice, emotional control Note: Flash model 75ms latency, V3 for expression
|
|
70
|
+
- OpenAI TTS - When: Integrated with OpenAI stack Note: gpt-4o-mini-tts, 13 voices, streaming
|
|
71
|
+
- Deepgram Aura-2 - When: Cost-effective production TTS Note: 40% cheaper than ElevenLabs, 184ms TTFB
|
|
72
|
+
|
|
73
|
+
### Frameworks
|
|
74
|
+
|
|
75
|
+
- Pipecat - When: Open-source voice agent orchestration Note: Silero VAD, SmartTurn, interruption handling
|
|
76
|
+
- Vapi - When: Managed voice agent platform Note: No infrastructure management
|
|
77
|
+
- Retell AI - When: Low-latency voice agents Note: Best context preservation on interruption
|
|
78
|
+
|
|
33
79
|
## Patterns
|
|
34
80
|
|
|
35
81
|
### Speech-to-Speech Architecture
|
|
36
82
|
|
|
37
83
|
Direct audio-to-audio processing for lowest latency
|
|
38
84
|
|
|
85
|
+
**When to use**: Maximum naturalness, emotional preservation, real-time conversation
|
|
86
|
+
|
|
87
|
+
# SPEECH-TO-SPEECH ARCHITECTURE:
|
|
88
|
+
|
|
89
|
+
"""
|
|
90
|
+
[User Audio] → [S2S Model] → [Agent Audio]
|
|
91
|
+
|
|
92
|
+
Advantages:
|
|
93
|
+
- Lowest latency (sub-500ms)
|
|
94
|
+
- Preserves emotion, emphasis, accents
|
|
95
|
+
- Most natural conversation flow
|
|
96
|
+
|
|
97
|
+
Disadvantages:
|
|
98
|
+
- Less control over responses
|
|
99
|
+
- Harder to debug/audit
|
|
100
|
+
- Can't easily modify what's said
|
|
101
|
+
"""
|
|
102
|
+
|
|
103
|
+
## OpenAI Realtime API
|
|
104
|
+
"""
|
|
105
|
+
import { RealtimeClient } from '@openai/realtime-api-beta';
|
|
106
|
+
|
|
107
|
+
const client = new RealtimeClient({
|
|
108
|
+
apiKey: process.env.OPENAI_API_KEY,
|
|
109
|
+
});
|
|
110
|
+
|
|
111
|
+
// Configure for voice conversation
|
|
112
|
+
client.updateSession({
|
|
113
|
+
modalities: ['text', 'audio'],
|
|
114
|
+
voice: 'alloy',
|
|
115
|
+
input_audio_format: 'pcm16',
|
|
116
|
+
output_audio_format: 'pcm16',
|
|
117
|
+
instructions: `You are a helpful customer service agent.
|
|
118
|
+
Be concise and friendly. If you don't know something,
|
|
119
|
+
say so rather than making things up.`,
|
|
120
|
+
turn_detection: {
|
|
121
|
+
type: 'server_vad', // or 'semantic_vad'
|
|
122
|
+
threshold: 0.5,
|
|
123
|
+
prefix_padding_ms: 300,
|
|
124
|
+
silence_duration_ms: 500,
|
|
125
|
+
},
|
|
126
|
+
});
|
|
127
|
+
|
|
128
|
+
// Handle audio streams
|
|
129
|
+
client.on('conversation.item.input_audio_transcription', (event) => {
|
|
130
|
+
console.log('User said:', event.transcript);
|
|
131
|
+
});
|
|
132
|
+
|
|
133
|
+
client.on('response.audio.delta', (event) => {
|
|
134
|
+
// Stream audio to speaker
|
|
135
|
+
audioPlayer.write(Buffer.from(event.delta, 'base64'));
|
|
136
|
+
});
|
|
137
|
+
|
|
138
|
+
// Send user audio
|
|
139
|
+
client.appendInputAudio(audioBuffer);
|
|
140
|
+
"""
|
|
141
|
+
|
|
142
|
+
## Use Cases:
|
|
143
|
+
- Real-time customer support
|
|
144
|
+
- Voice assistants
|
|
145
|
+
- Interactive voice response (IVR)
|
|
146
|
+
- Live language translation
|
|
147
|
+
|
|
39
148
|
### Pipeline Architecture
|
|
40
149
|
|
|
41
150
|
Separate STT → LLM → TTS for maximum control
|
|
42
151
|
|
|
152
|
+
**When to use**: Need to know/control exactly what's said, debugging, compliance
|
|
153
|
+
|
|
154
|
+
# PIPELINE ARCHITECTURE:
|
|
155
|
+
|
|
156
|
+
"""
|
|
157
|
+
[Audio] → [STT] → [Text] → [LLM] → [Text] → [TTS] → [Audio]
|
|
158
|
+
|
|
159
|
+
Advantages:
|
|
160
|
+
- Full control at each step
|
|
161
|
+
- Can log/audit all text
|
|
162
|
+
- Easier to debug
|
|
163
|
+
- Mix best-in-class components
|
|
164
|
+
|
|
165
|
+
Disadvantages:
|
|
166
|
+
- Higher latency (700-1200ms typical)
|
|
167
|
+
- Loses some emotion/nuance
|
|
168
|
+
- More components to manage
|
|
169
|
+
"""
|
|
170
|
+
|
|
171
|
+
## Production Pipeline Example
|
|
172
|
+
"""
|
|
173
|
+
import { Deepgram } from '@deepgram/sdk';
|
|
174
|
+
import { ElevenLabsClient } from 'elevenlabs';
|
|
175
|
+
import OpenAI from 'openai';
|
|
176
|
+
|
|
177
|
+
// Initialize clients
|
|
178
|
+
const deepgram = new Deepgram(process.env.DEEPGRAM_API_KEY);
|
|
179
|
+
const elevenlabs = new ElevenLabsClient();
|
|
180
|
+
const openai = new OpenAI();
|
|
181
|
+
|
|
182
|
+
async function processVoiceInput(audioStream) {
|
|
183
|
+
// 1. Speech-to-Text (Deepgram Nova-3)
|
|
184
|
+
const transcription = await deepgram.transcription.live({
|
|
185
|
+
model: 'nova-3',
|
|
186
|
+
punctuate: true,
|
|
187
|
+
endpointing: 300, // ms of silence before end
|
|
188
|
+
});
|
|
189
|
+
|
|
190
|
+
transcription.on('transcript', async (data) => {
|
|
191
|
+
if (data.is_final && data.speech_final) {
|
|
192
|
+
const userText = data.channel.alternatives[0].transcript;
|
|
193
|
+
console.log('User:', userText);
|
|
194
|
+
|
|
195
|
+
// 2. LLM Processing
|
|
196
|
+
const completion = await openai.chat.completions.create({
|
|
197
|
+
model: 'gpt-4o-mini',
|
|
198
|
+
messages: [
|
|
199
|
+
{ role: 'system', content: 'You are a concise voice assistant.' },
|
|
200
|
+
{ role: 'user', content: userText }
|
|
201
|
+
],
|
|
202
|
+
max_tokens: 150, // Keep responses short for voice
|
|
203
|
+
});
|
|
204
|
+
|
|
205
|
+
const agentText = completion.choices[0].message.content;
|
|
206
|
+
console.log('Agent:', agentText);
|
|
207
|
+
|
|
208
|
+
// 3. Text-to-Speech (ElevenLabs)
|
|
209
|
+
const audioStream = await elevenlabs.textToSpeech.stream({
|
|
210
|
+
voice_id: 'voice_id_here',
|
|
211
|
+
text: agentText,
|
|
212
|
+
model_id: 'eleven_flash_v2_5', // Lowest latency
|
|
213
|
+
});
|
|
214
|
+
|
|
215
|
+
// Stream to user
|
|
216
|
+
playAudioStream(audioStream);
|
|
217
|
+
}
|
|
218
|
+
});
|
|
219
|
+
|
|
220
|
+
// Pipe audio to transcription
|
|
221
|
+
audioStream.pipe(transcription);
|
|
222
|
+
}
|
|
223
|
+
"""
|
|
224
|
+
|
|
225
|
+
## Optimization Tips:
|
|
226
|
+
- Start TTS while LLM still generating (streaming)
|
|
227
|
+
- Pre-compute first response segment during user speech
|
|
228
|
+
- Use Flash/turbo models for latency
|
|
229
|
+
|
|
43
230
|
### Voice Activity Detection Pattern
|
|
44
231
|
|
|
45
232
|
Detect when user starts/stops speaking
|
|
46
233
|
|
|
47
|
-
|
|
234
|
+
**When to use**: All voice agents need VAD for turn-taking
|
|
235
|
+
|
|
236
|
+
# VOICE ACTIVITY DETECTION (VAD):
|
|
237
|
+
|
|
238
|
+
"""
|
|
239
|
+
VAD Types:
|
|
240
|
+
1. Energy-based: Simple, fast, noise-sensitive
|
|
241
|
+
2. Model-based: Silero VAD, more accurate
|
|
242
|
+
3. Semantic VAD: Understands meaning, best for conversation
|
|
243
|
+
"""
|
|
244
|
+
|
|
245
|
+
## Silero VAD (Popular Open Source)
|
|
246
|
+
"""
|
|
247
|
+
import { SileroVAD } from '@pipecat-ai/silero-vad';
|
|
248
|
+
|
|
249
|
+
const vad = new SileroVAD({
|
|
250
|
+
threshold: 0.5, // Speech probability threshold
|
|
251
|
+
min_speech_duration: 250, // ms before speech confirmed
|
|
252
|
+
min_silence_duration: 500, // ms of silence = end of turn
|
|
253
|
+
});
|
|
254
|
+
|
|
255
|
+
vad.on('speech_start', () => {
|
|
256
|
+
console.log('User started speaking');
|
|
257
|
+
// Stop any playing TTS (barge-in)
|
|
258
|
+
audioPlayer.stop();
|
|
259
|
+
});
|
|
260
|
+
|
|
261
|
+
vad.on('speech_end', () => {
|
|
262
|
+
console.log('User finished speaking');
|
|
263
|
+
// Trigger response generation
|
|
264
|
+
processTranscript();
|
|
265
|
+
});
|
|
266
|
+
|
|
267
|
+
// Feed audio to VAD
|
|
268
|
+
audioStream.on('data', (chunk) => {
|
|
269
|
+
vad.process(chunk);
|
|
270
|
+
});
|
|
271
|
+
"""
|
|
272
|
+
|
|
273
|
+
## OpenAI Semantic VAD
|
|
274
|
+
"""
|
|
275
|
+
// In Realtime API session config
|
|
276
|
+
client.updateSession({
|
|
277
|
+
turn_detection: {
|
|
278
|
+
type: 'semantic_vad', // Uses meaning, not just silence
|
|
279
|
+
// Model waits longer after "ummm..."
|
|
280
|
+
// Responds faster after "Yes, that's correct."
|
|
281
|
+
},
|
|
282
|
+
});
|
|
283
|
+
"""
|
|
284
|
+
|
|
285
|
+
## Barge-In Handling
|
|
286
|
+
"""
|
|
287
|
+
// When user interrupts:
|
|
288
|
+
function handleBargeIn() {
|
|
289
|
+
// 1. Stop TTS immediately
|
|
290
|
+
audioPlayer.stop();
|
|
291
|
+
|
|
292
|
+
// 2. Cancel pending LLM generation
|
|
293
|
+
llmController.abort();
|
|
294
|
+
|
|
295
|
+
// 3. Reset state
|
|
296
|
+
conversationState.checkpoint();
|
|
297
|
+
|
|
298
|
+
// 4. Listen to new input
|
|
299
|
+
startListening();
|
|
300
|
+
}
|
|
301
|
+
|
|
302
|
+
// VAD triggers barge-in
|
|
303
|
+
vad.on('speech_start', () => {
|
|
304
|
+
if (audioPlayer.isPlaying) {
|
|
305
|
+
handleBargeIn();
|
|
306
|
+
}
|
|
307
|
+
});
|
|
308
|
+
"""
|
|
309
|
+
|
|
310
|
+
### Latency Optimization Pattern
|
|
311
|
+
|
|
312
|
+
Achieving <800ms end-to-end response time
|
|
313
|
+
|
|
314
|
+
**When to use**: Production voice agents
|
|
315
|
+
|
|
316
|
+
# LATENCY OPTIMIZATION:
|
|
317
|
+
|
|
318
|
+
"""
|
|
319
|
+
Target Metrics:
|
|
320
|
+
- End-to-end: <800ms (ideal: <500ms)
|
|
321
|
+
- Time-to-First-Token (TTFT): <300ms
|
|
322
|
+
- Barge-in response: <200ms
|
|
323
|
+
- Jitter variance: <100ms std dev
|
|
324
|
+
"""
|
|
325
|
+
|
|
326
|
+
## Pipeline Latency Breakdown
|
|
327
|
+
"""
|
|
328
|
+
Typical breakdown:
|
|
329
|
+
- VAD processing: 50-100ms
|
|
330
|
+
- STT first result: 150-200ms
|
|
331
|
+
- LLM TTFT: 100-300ms
|
|
332
|
+
- TTS TTFA: 75-200ms
|
|
333
|
+
- Audio buffering: 50-100ms
|
|
334
|
+
|
|
335
|
+
Total: 425-900ms
|
|
336
|
+
"""
|
|
337
|
+
|
|
338
|
+
## Optimization Strategies
|
|
339
|
+
|
|
340
|
+
### 1. Streaming Everything
|
|
341
|
+
"""
|
|
342
|
+
// Stream STT results as they come
|
|
343
|
+
stt.on('partial_transcript', (text) => {
|
|
344
|
+
// Start processing before final transcript
|
|
345
|
+
llmPreprocessor.prepare(text);
|
|
346
|
+
});
|
|
347
|
+
|
|
348
|
+
// Stream LLM output to TTS
|
|
349
|
+
const llmStream = await openai.chat.completions.create({
|
|
350
|
+
stream: true,
|
|
351
|
+
// ...
|
|
352
|
+
});
|
|
353
|
+
|
|
354
|
+
for await (const chunk of llmStream) {
|
|
355
|
+
tts.appendText(chunk.choices[0].delta.content);
|
|
356
|
+
}
|
|
357
|
+
"""
|
|
358
|
+
|
|
359
|
+
### 2. Pre-computation
|
|
360
|
+
"""
|
|
361
|
+
// While user is speaking, predict and prepare
|
|
362
|
+
stt.on('partial_transcript', async (text) => {
|
|
363
|
+
// Pre-fetch relevant context
|
|
364
|
+
const context = await retrieveContext(text);
|
|
365
|
+
|
|
366
|
+
// Pre-compute likely first sentence
|
|
367
|
+
const firstSentence = await generateOpener(context);
|
|
368
|
+
});
|
|
369
|
+
"""
|
|
370
|
+
|
|
371
|
+
### 3. Use Low-Latency Models
|
|
372
|
+
"""
|
|
373
|
+
// STT: Deepgram Nova-3 (150ms TTFT)
|
|
374
|
+
// LLM: gpt-4o-mini (fastest GPT-4 class)
|
|
375
|
+
// TTS: ElevenLabs Flash (75ms) or Deepgram Aura-2 (184ms)
|
|
376
|
+
"""
|
|
377
|
+
|
|
378
|
+
### 4. Edge Deployment
|
|
379
|
+
"""
|
|
380
|
+
// Run inference closer to user
|
|
381
|
+
// - Cloud regions near user
|
|
382
|
+
// - Edge computing for VAD/STT
|
|
383
|
+
// - WebSocket over HTTP for lower overhead
|
|
384
|
+
"""
|
|
385
|
+
|
|
386
|
+
### Conversation Design Pattern
|
|
387
|
+
|
|
388
|
+
Designing natural voice conversations
|
|
389
|
+
|
|
390
|
+
**When to use**: Building voice UX
|
|
391
|
+
|
|
392
|
+
# CONVERSATION DESIGN:
|
|
393
|
+
|
|
394
|
+
## Voice-First Principles
|
|
395
|
+
"""
|
|
396
|
+
Voice is different from text:
|
|
397
|
+
- No undo button - say it right the first time
|
|
398
|
+
- Linear - user can't scroll back
|
|
399
|
+
- Ephemeral - easy to miss information
|
|
400
|
+
- Emotional - tone matters as much as words
|
|
401
|
+
"""
|
|
402
|
+
|
|
403
|
+
## Response Design
|
|
404
|
+
"""
|
|
405
|
+
# Keep responses short (10-20 seconds max)
|
|
406
|
+
# Front-load the answer
|
|
407
|
+
# Use signposting for lists
|
|
408
|
+
|
|
409
|
+
Bad: "I found several options. The first is... second is..."
|
|
410
|
+
Good: "I found 3 options. Want me to go through them?"
|
|
411
|
+
|
|
412
|
+
# Confirm understanding
|
|
413
|
+
Bad: "I'll transfer $500 to John."
|
|
414
|
+
Good: "So that's $500 to John Smith. Should I proceed?"
|
|
415
|
+
"""
|
|
416
|
+
|
|
417
|
+
## Prompting for Voice
|
|
418
|
+
"""
|
|
419
|
+
system_prompt = '''
|
|
420
|
+
You are a voice assistant. Follow these rules:
|
|
421
|
+
|
|
422
|
+
1. Be concise - keep responses under 30 words
|
|
423
|
+
2. Use natural speech - contractions, casual language
|
|
424
|
+
3. Never use formatting (bullets, numbers in lists)
|
|
425
|
+
4. Spell out numbers and abbreviations
|
|
426
|
+
5. End with a question to keep conversation flowing
|
|
427
|
+
6. If unclear, ask for clarification
|
|
428
|
+
7. Never say "I'm an AI" unless asked
|
|
429
|
+
|
|
430
|
+
Good: "Got it. I'll set that reminder for three pm. Anything else?"
|
|
431
|
+
Bad: "I have set a reminder for 3:00 PM. Is there anything else I can assist you with today?"
|
|
432
|
+
'''
|
|
433
|
+
"""
|
|
434
|
+
|
|
435
|
+
## Error Recovery
|
|
436
|
+
"""
|
|
437
|
+
// Handle recognition errors gracefully
|
|
438
|
+
const errorResponses = {
|
|
439
|
+
no_speech: "I didn't catch that. Could you say it again?",
|
|
440
|
+
unclear: "Sorry, I'm not sure I understood. You said [repeat]. Is that right?",
|
|
441
|
+
timeout: "Still there? I'm here when you're ready.",
|
|
442
|
+
};
|
|
443
|
+
|
|
444
|
+
// Always offer human fallback for complex issues
|
|
445
|
+
if (confidenceScore < 0.6) {
|
|
446
|
+
response = "I want to make sure I get this right. Would you like to speak with a human agent?";
|
|
447
|
+
}
|
|
448
|
+
"""
|
|
449
|
+
|
|
450
|
+
## Sharp Edges
|
|
451
|
+
|
|
452
|
+
### Response Latency Exceeds 800ms
|
|
453
|
+
|
|
454
|
+
Severity: CRITICAL
|
|
455
|
+
|
|
456
|
+
Situation: Building a voice agent pipeline
|
|
457
|
+
|
|
458
|
+
Symptoms:
|
|
459
|
+
Conversations feel awkward. Users repeat themselves. "Are you
|
|
460
|
+
there?" questions. Users hang up or give up. Low satisfaction
|
|
461
|
+
scores despite correct answers.
|
|
462
|
+
|
|
463
|
+
Why this breaks:
|
|
464
|
+
In human conversation, responses typically arrive within 500ms.
|
|
465
|
+
Anything over 800ms feels like the agent is slow or confused.
|
|
466
|
+
Users lose confidence and patience. Every component adds latency:
|
|
467
|
+
VAD (100ms) + STT (200ms) + LLM (300ms) + TTS (200ms) = 800ms.
|
|
468
|
+
|
|
469
|
+
Recommended fix:
|
|
470
|
+
|
|
471
|
+
# Measure and budget latency for each component:
|
|
472
|
+
|
|
473
|
+
## Target latencies:
|
|
474
|
+
- VAD processing: <100ms
|
|
475
|
+
- STT time-to-first-token: <200ms
|
|
476
|
+
- LLM time-to-first-token: <300ms
|
|
477
|
+
- TTS time-to-first-audio: <150ms
|
|
478
|
+
- Total end-to-end: <800ms
|
|
479
|
+
|
|
480
|
+
## Optimization strategies:
|
|
481
|
+
|
|
482
|
+
1. Use low-latency models:
|
|
483
|
+
- STT: Deepgram Nova-3 (150ms) vs Whisper (500ms+)
|
|
484
|
+
- TTS: ElevenLabs Flash (75ms) vs standard (200ms+)
|
|
485
|
+
- LLM: gpt-4o-mini streaming
|
|
486
|
+
|
|
487
|
+
2. Stream everything:
|
|
488
|
+
- Don't wait for full STT transcript
|
|
489
|
+
- Stream LLM output to TTS
|
|
490
|
+
- Start audio playback before TTS finishes
|
|
491
|
+
|
|
492
|
+
3. Pre-compute:
|
|
493
|
+
- While user speaks, prepare context
|
|
494
|
+
- Generate opening phrase in parallel
|
|
495
|
+
|
|
496
|
+
4. Edge deployment:
|
|
497
|
+
- Run VAD/STT at edge
|
|
498
|
+
- Use nearest cloud region
|
|
499
|
+
|
|
500
|
+
## Measure continuously:
|
|
501
|
+
Log timestamps at each stage, track P50/P95 latency
|
|
502
|
+
|
|
503
|
+
### Response Time Variance Disrupts Rhythm
|
|
504
|
+
|
|
505
|
+
Severity: HIGH
|
|
506
|
+
|
|
507
|
+
Situation: Voice agent with inconsistent response times
|
|
508
|
+
|
|
509
|
+
Symptoms:
|
|
510
|
+
Conversations feel unpredictable. User doesn't know when to speak.
|
|
511
|
+
Sometimes agent responds immediately, sometimes after long pause.
|
|
512
|
+
Users talk over agent. Agent talks over users.
|
|
513
|
+
|
|
514
|
+
Why this breaks:
|
|
515
|
+
Jitter (variance in response time) disrupts conversational rhythm
|
|
516
|
+
more than absolute latency. Consistent 800ms feels better than
|
|
517
|
+
alternating 400ms and 1200ms. Users can't adapt to unpredictable
|
|
518
|
+
timing.
|
|
519
|
+
|
|
520
|
+
Recommended fix:
|
|
521
|
+
|
|
522
|
+
# Target jitter metrics:
|
|
523
|
+
- Standard deviation: <100ms
|
|
524
|
+
- P95-P50 gap: <200ms
|
|
525
|
+
|
|
526
|
+
## Reduce jitter sources:
|
|
527
|
+
|
|
528
|
+
1. Consistent model loading:
|
|
529
|
+
- Keep models warm
|
|
530
|
+
- Pre-load on connection start
|
|
531
|
+
|
|
532
|
+
2. Buffer audio output:
|
|
533
|
+
- Small buffer (50-100ms) smooths playback
|
|
534
|
+
- Don't start playing until buffer filled
|
|
535
|
+
|
|
536
|
+
3. Handle LLM variance:
|
|
537
|
+
- gpt-4o-mini more consistent than larger models
|
|
538
|
+
- Set max_tokens to limit long responses
|
|
539
|
+
|
|
540
|
+
4. Monitor and alert:
|
|
541
|
+
- Track response time distribution
|
|
542
|
+
- Alert on jitter spikes
|
|
543
|
+
|
|
544
|
+
## Implementation:
|
|
545
|
+
const MIN_RESPONSE_TIME = 400; // ms
|
|
546
|
+
|
|
547
|
+
async function respondWithConsistentTiming(text) {
|
|
548
|
+
const startTime = Date.now();
|
|
549
|
+
const audio = await generateSpeech(text);
|
|
550
|
+
|
|
551
|
+
const elapsed = Date.now() - startTime;
|
|
552
|
+
if (elapsed < MIN_RESPONSE_TIME) {
|
|
553
|
+
await delay(MIN_RESPONSE_TIME - elapsed);
|
|
554
|
+
}
|
|
555
|
+
|
|
556
|
+
playAudio(audio);
|
|
557
|
+
}
|
|
558
|
+
|
|
559
|
+
### Using Silence Duration for Turn Detection
|
|
560
|
+
|
|
561
|
+
Severity: HIGH
|
|
48
562
|
|
|
49
|
-
|
|
563
|
+
Situation: Detecting when user finishes speaking
|
|
50
564
|
|
|
51
|
-
|
|
565
|
+
Symptoms:
|
|
566
|
+
Agent interrupts user mid-thought. Or waits too long after user
|
|
567
|
+
finishes. "Let me think..." triggers premature response. Short
|
|
568
|
+
answers have awkward pause before response.
|
|
52
569
|
|
|
53
|
-
|
|
570
|
+
Why this breaks:
|
|
571
|
+
Simple silence detection (e.g., "end turn after 500ms silence")
|
|
572
|
+
doesn't understand conversation. Humans pause mid-sentence.
|
|
573
|
+
"Yes." needs fast response, "Well, let me think about that..."
|
|
574
|
+
needs patience. Fixed timeout fits neither.
|
|
54
575
|
|
|
55
|
-
|
|
576
|
+
Recommended fix:
|
|
56
577
|
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
578
|
+
# Use semantic VAD:
|
|
579
|
+
|
|
580
|
+
## OpenAI Semantic VAD:
|
|
581
|
+
client.updateSession({
|
|
582
|
+
turn_detection: {
|
|
583
|
+
type: 'semantic_vad',
|
|
584
|
+
// Waits longer after "umm..."
|
|
585
|
+
// Responds faster after "Yes, that's correct."
|
|
586
|
+
},
|
|
587
|
+
});
|
|
588
|
+
|
|
589
|
+
## Pipecat SmartTurn:
|
|
590
|
+
const pipeline = new Pipeline({
|
|
591
|
+
vad: new SileroVAD(),
|
|
592
|
+
turnDetection: new SmartTurn(),
|
|
593
|
+
});
|
|
594
|
+
|
|
595
|
+
// SmartTurn considers:
|
|
596
|
+
// - Speech content (complete sentence?)
|
|
597
|
+
// - Prosody (falling intonation?)
|
|
598
|
+
// - Context (question asked?)
|
|
599
|
+
|
|
600
|
+
## Fallback: Adaptive silence threshold:
|
|
601
|
+
function calculateSilenceThreshold(transcript) {
|
|
602
|
+
const endsWithComplete = transcript.match(/[.!?]$/);
|
|
603
|
+
const hasFillers = transcript.match(/um|uh|like|well/i);
|
|
604
|
+
|
|
605
|
+
if (endsWithComplete && !hasFillers) {
|
|
606
|
+
return 300; // Fast response
|
|
607
|
+
} else if (hasFillers) {
|
|
608
|
+
return 1500; // Wait for continuation
|
|
609
|
+
}
|
|
610
|
+
return 700; // Default
|
|
611
|
+
}
|
|
612
|
+
|
|
613
|
+
### Agent Doesn't Stop When User Interrupts
|
|
614
|
+
|
|
615
|
+
Severity: HIGH
|
|
616
|
+
|
|
617
|
+
Situation: User tries to interrupt agent mid-sentence
|
|
618
|
+
|
|
619
|
+
Symptoms:
|
|
620
|
+
Agent talks over user. User has to wait for agent to finish.
|
|
621
|
+
Frustrating experience. Users give up and abandon call.
|
|
622
|
+
"STOP! STOP!" doesn't work.
|
|
623
|
+
|
|
624
|
+
Why this breaks:
|
|
625
|
+
Without barge-in handling, the TTS plays to completion regardless
|
|
626
|
+
of user input. This violates basic conversational norms - in human
|
|
627
|
+
conversation, we stop when interrupted.
|
|
628
|
+
|
|
629
|
+
Recommended fix:
|
|
630
|
+
|
|
631
|
+
# Implement barge-in detection:
|
|
632
|
+
|
|
633
|
+
## Basic barge-in:
|
|
634
|
+
vad.on('speech_start', () => {
|
|
635
|
+
if (ttsPlayer.isPlaying) {
|
|
636
|
+
// 1. Stop audio immediately
|
|
637
|
+
ttsPlayer.stop();
|
|
638
|
+
|
|
639
|
+
// 2. Cancel pending TTS generation
|
|
640
|
+
ttsController.abort();
|
|
641
|
+
|
|
642
|
+
// 3. Checkpoint conversation state
|
|
643
|
+
conversationState.save();
|
|
644
|
+
|
|
645
|
+
// 4. Listen to new input
|
|
646
|
+
startTranscription();
|
|
647
|
+
}
|
|
648
|
+
});
|
|
649
|
+
|
|
650
|
+
## Advanced: Distinguish interruption types:
|
|
651
|
+
vad.on('speech_start', async () => {
|
|
652
|
+
if (!ttsPlayer.isPlaying) return;
|
|
653
|
+
|
|
654
|
+
// Wait 200ms to get first words
|
|
655
|
+
await delay(200);
|
|
656
|
+
const firstWords = getTranscriptSoFar();
|
|
657
|
+
|
|
658
|
+
if (isBackchannel(firstWords)) {
|
|
659
|
+
// "uh-huh", "yeah" - don't interrupt
|
|
660
|
+
return;
|
|
661
|
+
}
|
|
662
|
+
|
|
663
|
+
if (isClarification(firstWords)) {
|
|
664
|
+
// "What?", "Sorry?" - repeat last sentence
|
|
665
|
+
repeatLastSentence();
|
|
666
|
+
} else {
|
|
667
|
+
// Real interruption - stop and listen
|
|
668
|
+
handleFullInterruption();
|
|
669
|
+
}
|
|
670
|
+
});
|
|
671
|
+
|
|
672
|
+
## Response time target:
|
|
673
|
+
- Barge-in response: <200ms
|
|
674
|
+
- User should feel heard immediately
|
|
675
|
+
|
|
676
|
+
### Generating Text-Length Responses for Voice
|
|
677
|
+
|
|
678
|
+
Severity: MEDIUM
|
|
679
|
+
|
|
680
|
+
Situation: Prompting LLM for voice agent responses
|
|
681
|
+
|
|
682
|
+
Symptoms:
|
|
683
|
+
Agent rambles. Users lose track of information. "Can you repeat
|
|
684
|
+
that?" requests. Users interrupt to ask for shorter version.
|
|
685
|
+
Low comprehension of conveyed information.
|
|
686
|
+
|
|
687
|
+
Why this breaks:
|
|
688
|
+
Text can be scanned and re-read. Voice is linear and ephemeral.
|
|
689
|
+
A 3-paragraph response that works in chat is overwhelming in voice.
|
|
690
|
+
Users can only hold ~7 items in working memory.
|
|
691
|
+
|
|
692
|
+
Recommended fix:
|
|
693
|
+
|
|
694
|
+
# Constrain response length in prompts:
|
|
695
|
+
|
|
696
|
+
system_prompt = '''
|
|
697
|
+
You are a voice assistant. Keep responses UNDER 30 WORDS.
|
|
698
|
+
For complex information, break into chunks and confirm
|
|
699
|
+
understanding between each.
|
|
700
|
+
|
|
701
|
+
Instead of: "Here are the three options. First, you could...
|
|
702
|
+
Second... Third..."
|
|
703
|
+
|
|
704
|
+
Say: "I found 3 options. Want me to go through them?"
|
|
705
|
+
|
|
706
|
+
Never list more than 3 items without pausing for confirmation.
|
|
707
|
+
'''
|
|
708
|
+
|
|
709
|
+
## Enforce at generation:
|
|
710
|
+
const response = await openai.chat.completions.create({
|
|
711
|
+
max_tokens: 100, // Hard limit
|
|
712
|
+
// ...
|
|
713
|
+
});
|
|
714
|
+
|
|
715
|
+
## Chunking pattern:
|
|
716
|
+
if (information.length > 3) {
|
|
717
|
+
response = `I have ${information.length} items. Let's go through them one at a time. First: ${information[0]}. Ready for the next?`;
|
|
718
|
+
}
|
|
719
|
+
|
|
720
|
+
## Progressive disclosure:
|
|
721
|
+
"I found your account. Want the balance, recent transactions, or something else?"
|
|
722
|
+
// Don't dump all info at once
|
|
723
|
+
|
|
724
|
+
### Using Bullets/Numbers/Markdown in Voice
|
|
725
|
+
|
|
726
|
+
Severity: MEDIUM
|
|
727
|
+
|
|
728
|
+
Situation: Formatting LLM output for voice
|
|
729
|
+
|
|
730
|
+
Symptoms:
|
|
731
|
+
"First bullet point: item one" read aloud. Numbers read as "one
|
|
732
|
+
two three" instead of "one, two, three." Markdown artifacts in
|
|
733
|
+
speech. Robotic, unnatural delivery.
|
|
734
|
+
|
|
735
|
+
Why this breaks:
|
|
736
|
+
TTS models read what they're given. Text formatting intended for
|
|
737
|
+
visual display sounds robotic when read aloud. Users can't "see"
|
|
738
|
+
structure in audio.
|
|
739
|
+
|
|
740
|
+
Recommended fix:
|
|
741
|
+
|
|
742
|
+
# Prompt for spoken format:
|
|
743
|
+
|
|
744
|
+
system_prompt = '''
|
|
745
|
+
Format responses for SPOKEN delivery:
|
|
746
|
+
- No bullet points, numbered lists, or markdown
|
|
747
|
+
- Spell out numbers: "twenty-three" not "23"
|
|
748
|
+
- Spell out abbreviations: "United States" not "US"
|
|
749
|
+
- Use verbal signposting: "There are three things. First..."
|
|
750
|
+
- Never use asterisks, dashes, or special characters
|
|
751
|
+
'''
|
|
752
|
+
|
|
753
|
+
## Post-processing:
|
|
754
|
+
function prepareForSpeech(text) {
|
|
755
|
+
return text
|
|
756
|
+
// Remove markdown
|
|
757
|
+
.replace(/[*_#`]/g, '')
|
|
758
|
+
// Convert numbers
|
|
759
|
+
.replace(/\d+/g, numToWords)
|
|
760
|
+
// Expand abbreviations
|
|
761
|
+
.replace(/\betc\b/gi, 'et cetera')
|
|
762
|
+
.replace(/\be\.g\./gi, 'for example')
|
|
763
|
+
// Add pauses
|
|
764
|
+
.replace(/\. /g, '... ')
|
|
765
|
+
.replace(/, /g, '... ');
|
|
766
|
+
}
|
|
767
|
+
|
|
768
|
+
## SSML for precise control:
|
|
769
|
+
<speak>
|
|
770
|
+
The total is <say-as interpret-as="currency">$49.99</say-as>.
|
|
771
|
+
<break time="500ms"/>
|
|
772
|
+
Want to proceed?
|
|
773
|
+
</speak>
|
|
774
|
+
|
|
775
|
+
### VAD/STT Fails in Noisy Environments
|
|
776
|
+
|
|
777
|
+
Severity: MEDIUM
|
|
778
|
+
|
|
779
|
+
Situation: Users in cars, cafes, outdoors
|
|
780
|
+
|
|
781
|
+
Symptoms:
|
|
782
|
+
"I didn't catch that" frequently. Background noise triggers
|
|
783
|
+
false starts. Fan/AC causes continuous listening. Car engine
|
|
784
|
+
noise confuses STT.
|
|
785
|
+
|
|
786
|
+
Why this breaks:
|
|
787
|
+
Default VAD thresholds work for quiet environments. Real-world
|
|
788
|
+
usage includes background noise that triggers false positives
|
|
789
|
+
or masks speech, causing false negatives.
|
|
790
|
+
|
|
791
|
+
Recommended fix:
|
|
792
|
+
|
|
793
|
+
# Implement noise handling:
|
|
794
|
+
|
|
795
|
+
## 1. Noise reduction in STT:
|
|
796
|
+
const transcription = await deepgram.transcription.live({
|
|
797
|
+
model: 'nova-3',
|
|
798
|
+
noise_reduction: true,
|
|
799
|
+
// or
|
|
800
|
+
smart_format: true,
|
|
801
|
+
});
|
|
802
|
+
|
|
803
|
+
## 2. Adaptive VAD threshold:
|
|
804
|
+
// Measure ambient noise level
|
|
805
|
+
const ambientLevel = measureAmbientNoise(5000); // 5 sec sample
|
|
806
|
+
|
|
807
|
+
vad.setThreshold(ambientLevel * 1.5); // Above ambient
|
|
808
|
+
|
|
809
|
+
## 3. Confidence filtering:
|
|
810
|
+
stt.on('transcript', (data) => {
|
|
811
|
+
if (data.confidence < 0.7) {
|
|
812
|
+
// Low confidence - probably noise
|
|
813
|
+
askForRepeat();
|
|
814
|
+
return;
|
|
815
|
+
}
|
|
816
|
+
processTranscript(data.transcript);
|
|
817
|
+
});
|
|
818
|
+
|
|
819
|
+
## 4. Echo cancellation:
|
|
820
|
+
// Prevent agent's voice from being transcribed
|
|
821
|
+
const echoCanceller = new EchoCanceller();
|
|
822
|
+
echoCanceller.reference(ttsOutput);
|
|
823
|
+
const cleanedAudio = echoCanceller.process(userAudio);
|
|
824
|
+
|
|
825
|
+
### STT Produces Incorrect or Hallucinated Text
|
|
826
|
+
|
|
827
|
+
Severity: MEDIUM
|
|
828
|
+
|
|
829
|
+
Situation: Processing unclear or accented speech
|
|
830
|
+
|
|
831
|
+
Symptoms:
|
|
832
|
+
Agent responds to something user didn't say. Names consistently
|
|
833
|
+
wrong. Technical terms misheard. "I said X, not Y" frustration.
|
|
834
|
+
|
|
835
|
+
Why this breaks:
|
|
836
|
+
STT models can hallucinate, especially on proper nouns, technical
|
|
837
|
+
terms, or accented speech. These errors propagate through the
|
|
838
|
+
pipeline and produce nonsensical responses.
|
|
839
|
+
|
|
840
|
+
Recommended fix:
|
|
841
|
+
|
|
842
|
+
# Mitigate STT errors:
|
|
843
|
+
|
|
844
|
+
## 1. Use keywords/biasing:
|
|
845
|
+
const transcription = await deepgram.transcription.live({
|
|
846
|
+
keywords: ['Acme Corp', 'ProductName', 'John Smith'],
|
|
847
|
+
keyword_boost: 'high',
|
|
848
|
+
});
|
|
849
|
+
|
|
850
|
+
## 2. Confirmation for critical info:
|
|
851
|
+
if (containsNameOrNumber(transcript)) {
|
|
852
|
+
response = `I heard "${name}". Is that correct?`;
|
|
853
|
+
}
|
|
854
|
+
|
|
855
|
+
## 3. Confidence-based fallback:
|
|
856
|
+
if (confidence < 0.8) {
|
|
857
|
+
response = `I think you said "${transcript}". Did I get that right?`;
|
|
858
|
+
}
|
|
859
|
+
|
|
860
|
+
## 4. Multiple hypothesis handling:
|
|
861
|
+
// Some STT APIs return n-best list
|
|
862
|
+
const alternatives = transcription.alternatives;
|
|
863
|
+
if (alternatives[0].confidence - alternatives[1].confidence < 0.1) {
|
|
864
|
+
// Ambiguous - ask for clarification
|
|
865
|
+
}
|
|
866
|
+
|
|
867
|
+
## 5. Error correction patterns:
|
|
868
|
+
promptPattern = `
|
|
869
|
+
User may correct previous mistakes. If they say "no, I said X"
|
|
870
|
+
or "not Y, Z", update your understanding accordingly.
|
|
871
|
+
`;
|
|
872
|
+
|
|
873
|
+
## Validation Checks
|
|
874
|
+
|
|
875
|
+
### Missing Latency Measurement
|
|
876
|
+
|
|
877
|
+
Severity: ERROR
|
|
878
|
+
|
|
879
|
+
Voice agents must track latency at each stage
|
|
880
|
+
|
|
881
|
+
Message: Voice pipeline without latency tracking. Add timestamps at each stage to measure performance.
|
|
882
|
+
|
|
883
|
+
### Using Batch STT Instead of Streaming
|
|
884
|
+
|
|
885
|
+
Severity: WARNING
|
|
886
|
+
|
|
887
|
+
Streaming STT reduces latency significantly
|
|
888
|
+
|
|
889
|
+
Message: Using batch transcription. Consider streaming for lower latency in voice agents.
|
|
890
|
+
|
|
891
|
+
### TTS Without Streaming Output
|
|
892
|
+
|
|
893
|
+
Severity: WARNING
|
|
894
|
+
|
|
895
|
+
Streaming TTS reduces time to first audio
|
|
896
|
+
|
|
897
|
+
Message: TTS without streaming. Stream audio to reduce time to first audio.
|
|
898
|
+
|
|
899
|
+
### Hardcoded VAD Silence Threshold
|
|
900
|
+
|
|
901
|
+
Severity: WARNING
|
|
902
|
+
|
|
903
|
+
Fixed silence thresholds don't adapt to conversation
|
|
904
|
+
|
|
905
|
+
Message: Fixed silence threshold. Consider semantic VAD or adaptive thresholds for better turn-taking.
|
|
906
|
+
|
|
907
|
+
### Missing Barge-In Handling
|
|
908
|
+
|
|
909
|
+
Severity: WARNING
|
|
910
|
+
|
|
911
|
+
Voice agents should stop when user interrupts
|
|
912
|
+
|
|
913
|
+
Message: VAD without barge-in handling. Stop TTS when user starts speaking.
|
|
914
|
+
|
|
915
|
+
### Voice Prompt Without Length Constraints
|
|
916
|
+
|
|
917
|
+
Severity: WARNING
|
|
918
|
+
|
|
919
|
+
Voice prompts should constrain response length
|
|
920
|
+
|
|
921
|
+
Message: Voice prompt without length constraints. Add 'Keep responses under 30 words' to system prompt.
|
|
922
|
+
|
|
923
|
+
### Markdown Formatting Sent to TTS
|
|
924
|
+
|
|
925
|
+
Severity: WARNING
|
|
926
|
+
|
|
927
|
+
Markdown will be read literally by TTS
|
|
928
|
+
|
|
929
|
+
Message: Check for markdown in TTS input. Strip formatting before sending to TTS.
|
|
930
|
+
|
|
931
|
+
### STT Without Error Handling
|
|
932
|
+
|
|
933
|
+
Severity: WARNING
|
|
934
|
+
|
|
935
|
+
STT can fail or return low confidence
|
|
936
|
+
|
|
937
|
+
Message: STT without error handling. Check confidence scores and handle failures.
|
|
938
|
+
|
|
939
|
+
### WebSocket Without Reconnection
|
|
940
|
+
|
|
941
|
+
Severity: WARNING
|
|
942
|
+
|
|
943
|
+
Realtime APIs need reconnection handling
|
|
944
|
+
|
|
945
|
+
Message: Realtime connection without reconnection logic. Handle disconnects gracefully.
|
|
946
|
+
|
|
947
|
+
### Missing Noise Handling
|
|
948
|
+
|
|
949
|
+
Severity: INFO
|
|
950
|
+
|
|
951
|
+
Real-world audio includes background noise
|
|
952
|
+
|
|
953
|
+
Message: Consider adding noise handling for real-world audio quality.
|
|
954
|
+
|
|
955
|
+
## Collaboration
|
|
956
|
+
|
|
957
|
+
### Delegation Triggers
|
|
958
|
+
|
|
959
|
+
- user needs phone/telephony integration -> backend (Twilio, Vonage, SIP integration)
|
|
960
|
+
- user needs LLM optimization -> llm-architect (Model selection, prompting, fine-tuning)
|
|
961
|
+
- user needs tools for voice agent -> agent-tool-builder (Tool design for voice context)
|
|
962
|
+
- user needs multi-agent voice system -> multi-agent-orchestration (Voice agents working together)
|
|
963
|
+
- user needs accessibility compliance -> accessibility-specialist (Voice interface accessibility)
|
|
67
964
|
|
|
68
965
|
## Related Skills
|
|
69
966
|
|
|
70
967
|
Works well with: `agent-tool-builder`, `multi-agent-orchestration`, `llm-architect`, `backend`
|
|
71
968
|
|
|
72
969
|
## When to Use
|
|
73
|
-
|
|
970
|
+
|
|
971
|
+
- User mentions or implies: voice agent
|
|
972
|
+
- User mentions or implies: speech to text
|
|
973
|
+
- User mentions or implies: text to speech
|
|
974
|
+
- User mentions or implies: whisper
|
|
975
|
+
- User mentions or implies: elevenlabs
|
|
976
|
+
- User mentions or implies: deepgram
|
|
977
|
+
- User mentions or implies: realtime api
|
|
978
|
+
- User mentions or implies: voice assistant
|
|
979
|
+
- User mentions or implies: voice ai
|
|
980
|
+
- User mentions or implies: conversational ai
|
|
981
|
+
- User mentions or implies: tts
|
|
982
|
+
- User mentions or implies: stt
|
|
983
|
+
- User mentions or implies: asr
|