opencode-skills-collection 1.0.185 → 1.0.187

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (71) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +5 -1
  2. package/bundled-skills/3d-web-experience/SKILL.md +152 -37
  3. package/bundled-skills/agent-evaluation/SKILL.md +1088 -26
  4. package/bundled-skills/agent-memory-systems/SKILL.md +1037 -25
  5. package/bundled-skills/agent-tool-builder/SKILL.md +668 -16
  6. package/bundled-skills/ai-agents-architect/SKILL.md +271 -31
  7. package/bundled-skills/ai-product/SKILL.md +716 -26
  8. package/bundled-skills/ai-wrapper-product/SKILL.md +450 -44
  9. package/bundled-skills/algolia-search/SKILL.md +867 -15
  10. package/bundled-skills/autonomous-agents/SKILL.md +1033 -26
  11. package/bundled-skills/aws-serverless/SKILL.md +1046 -35
  12. package/bundled-skills/azure-functions/SKILL.md +1318 -19
  13. package/bundled-skills/browser-automation/SKILL.md +1065 -28
  14. package/bundled-skills/browser-extension-builder/SKILL.md +159 -32
  15. package/bundled-skills/bullmq-specialist/SKILL.md +347 -16
  16. package/bundled-skills/clerk-auth/SKILL.md +796 -15
  17. package/bundled-skills/computer-use-agents/SKILL.md +1870 -28
  18. package/bundled-skills/context-window-management/SKILL.md +271 -18
  19. package/bundled-skills/conversation-memory/SKILL.md +453 -24
  20. package/bundled-skills/crewai/SKILL.md +252 -46
  21. package/bundled-skills/discord-bot-architect/SKILL.md +1207 -34
  22. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  23. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  24. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  25. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  26. package/bundled-skills/docs/users/bundles.md +1 -1
  27. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  28. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  29. package/bundled-skills/docs/users/getting-started.md +1 -1
  30. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  31. package/bundled-skills/docs/users/usage.md +4 -4
  32. package/bundled-skills/docs/users/visual-guide.md +4 -4
  33. package/bundled-skills/email-systems/SKILL.md +646 -26
  34. package/bundled-skills/faf-expert/SKILL.md +221 -0
  35. package/bundled-skills/faf-wizard/SKILL.md +252 -0
  36. package/bundled-skills/file-uploads/SKILL.md +212 -11
  37. package/bundled-skills/firebase/SKILL.md +646 -16
  38. package/bundled-skills/gcp-cloud-run/SKILL.md +1117 -32
  39. package/bundled-skills/graphql/SKILL.md +1026 -27
  40. package/bundled-skills/hubspot-integration/SKILL.md +804 -19
  41. package/bundled-skills/idea-darwin/SKILL.md +120 -0
  42. package/bundled-skills/inngest/SKILL.md +431 -16
  43. package/bundled-skills/interactive-portfolio/SKILL.md +342 -44
  44. package/bundled-skills/langfuse/SKILL.md +296 -41
  45. package/bundled-skills/langgraph/SKILL.md +259 -50
  46. package/bundled-skills/micro-saas-launcher/SKILL.md +343 -44
  47. package/bundled-skills/neon-postgres/SKILL.md +572 -15
  48. package/bundled-skills/nextjs-supabase-auth/SKILL.md +269 -21
  49. package/bundled-skills/notion-template-business/SKILL.md +371 -44
  50. package/bundled-skills/personal-tool-builder/SKILL.md +537 -44
  51. package/bundled-skills/plaid-fintech/SKILL.md +825 -19
  52. package/bundled-skills/prompt-caching/SKILL.md +438 -25
  53. package/bundled-skills/rag-engineer/SKILL.md +271 -29
  54. package/bundled-skills/salesforce-development/SKILL.md +912 -19
  55. package/bundled-skills/satori/SKILL.md +54 -0
  56. package/bundled-skills/scroll-experience/SKILL.md +381 -44
  57. package/bundled-skills/segment-cdp/SKILL.md +817 -19
  58. package/bundled-skills/shopify-apps/SKILL.md +1475 -19
  59. package/bundled-skills/slack-bot-builder/SKILL.md +1162 -28
  60. package/bundled-skills/telegram-bot-builder/SKILL.md +152 -37
  61. package/bundled-skills/telegram-mini-app/SKILL.md +445 -44
  62. package/bundled-skills/trigger-dev/SKILL.md +916 -27
  63. package/bundled-skills/twilio-communications/SKILL.md +1310 -28
  64. package/bundled-skills/upstash-qstash/SKILL.md +898 -27
  65. package/bundled-skills/vercel-deployment/SKILL.md +637 -39
  66. package/bundled-skills/viral-generator-builder/SKILL.md +132 -37
  67. package/bundled-skills/voice-agents/SKILL.md +937 -27
  68. package/bundled-skills/voice-ai-development/SKILL.md +375 -46
  69. package/bundled-skills/workflow-automation/SKILL.md +982 -29
  70. package/bundled-skills/zapier-make-patterns/SKILL.md +772 -27
  71. package/package.json +1 -1
@@ -1,22 +1,36 @@
1
1
  ---
2
2
  name: voice-agents
3
- description: "You are a voice AI architect who has shipped production voice agents handling millions of calls. You understand the physics of latency - every component adds milliseconds, and the sum determines whether conversations feel natural or awkward."
3
+ description: Voice agents represent the frontier of AI interaction - humans
4
+ speaking naturally with AI systems.
4
5
  risk: safe
5
- source: "vibeship-spawner-skills (Apache 2.0)"
6
- date_added: "2026-02-27"
6
+ source: vibeship-spawner-skills (Apache 2.0)
7
+ date_added: 2026-02-27
7
8
  ---
8
9
 
9
10
  # Voice Agents
10
11
 
11
- You are a voice AI architect who has shipped production voice agents handling
12
- millions of calls. You understand the physics of latency - every component
13
- adds milliseconds, and the sum determines whether conversations feel natural
14
- or awkward.
12
+ Voice agents represent the frontier of AI interaction - humans speaking
13
+ naturally with AI systems. The challenge isn't just speech recognition
14
+ and synthesis, it's achieving natural conversation flow with sub-800ms
15
+ latency while handling interruptions, background noise, and emotional
16
+ nuance.
15
17
 
16
- Your core insight: Two architectures exist. Speech-to-speech (S2S) models like
17
- OpenAI Realtime API preserve emotion and achieve lowest latency but are less
18
- controllable. Pipeline architectures (STT→LLM→TTS) give you control at each
19
- step but add latency. Mos
18
+ This skill covers two architectures: speech-to-speech (OpenAI Realtime API,
19
+ lowest latency, most natural) and pipeline (STT→LLM→TTS, more control,
20
+ easier to debug). Key insight: latency is the constraint. Humans expect
21
+ responses in 500ms. Every millisecond matters.
22
+
23
+ 84% of organizations are increasing voice AI budgets in 2025. This is the
24
+ year voice agents go mainstream.
25
+
26
+ ## Principles
27
+
28
+ - Latency is the constraint - target <800ms end-to-end
29
+ - Jitter (variance) matters as much as absolute latency
30
+ - VAD quality determines conversation flow
31
+ - Interruption handling makes or breaks the experience
32
+ - Start with focused MVP, iterate based on real conversations
33
+ - Combine best-in-class components (Deepgram STT + ElevenLabs TTS)
20
34
 
21
35
  ## Capabilities
22
36
 
@@ -30,44 +44,940 @@ step but add latency. Mos
30
44
  - barge-in-detection
31
45
  - voice-interfaces
32
46
 
47
+ ## Scope
48
+
49
+ - phone-system-integration → backend
50
+ - audio-processing-dsp → audio-specialist
51
+ - music-generation → audio-specialist
52
+ - accessibility-compliance → accessibility-specialist
53
+
54
+ ## Tooling
55
+
56
+ ### Speech_to_speech
57
+
58
+ - OpenAI Realtime API - When: Lowest latency, most natural conversation Note: gpt-4o-realtime-preview, native voice, sub-500ms
59
+ - Pipecat - When: Open-source voice orchestration Note: Daily-backed, enterprise-grade, modular
60
+
61
+ ### Speech_to_text
62
+
63
+ - OpenAI Whisper - When: Highest accuracy, multilingual Note: gpt-4o-transcribe for best results
64
+ - Deepgram Nova-3 - When: Production workloads, 54% lower WER Note: 150-184ms TTFT, 90%+ accuracy on noisy audio
65
+ - AssemblyAI - When: Real-time streaming, speaker diarization Note: Good accuracy-latency balance
66
+
67
+ ### Text_to_speech
68
+
69
+ - ElevenLabs - When: Most natural voice, emotional control Note: Flash model 75ms latency, V3 for expression
70
+ - OpenAI TTS - When: Integrated with OpenAI stack Note: gpt-4o-mini-tts, 13 voices, streaming
71
+ - Deepgram Aura-2 - When: Cost-effective production TTS Note: 40% cheaper than ElevenLabs, 184ms TTFB
72
+
73
+ ### Frameworks
74
+
75
+ - Pipecat - When: Open-source voice agent orchestration Note: Silero VAD, SmartTurn, interruption handling
76
+ - Vapi - When: Managed voice agent platform Note: No infrastructure management
77
+ - Retell AI - When: Low-latency voice agents Note: Best context preservation on interruption
78
+
33
79
  ## Patterns
34
80
 
35
81
  ### Speech-to-Speech Architecture
36
82
 
37
83
  Direct audio-to-audio processing for lowest latency
38
84
 
85
+ **When to use**: Maximum naturalness, emotional preservation, real-time conversation
86
+
87
+ # SPEECH-TO-SPEECH ARCHITECTURE:
88
+
89
+ """
90
+ [User Audio] → [S2S Model] → [Agent Audio]
91
+
92
+ Advantages:
93
+ - Lowest latency (sub-500ms)
94
+ - Preserves emotion, emphasis, accents
95
+ - Most natural conversation flow
96
+
97
+ Disadvantages:
98
+ - Less control over responses
99
+ - Harder to debug/audit
100
+ - Can't easily modify what's said
101
+ """
102
+
103
+ ## OpenAI Realtime API
104
+ """
105
+ import { RealtimeClient } from '@openai/realtime-api-beta';
106
+
107
+ const client = new RealtimeClient({
108
+ apiKey: process.env.OPENAI_API_KEY,
109
+ });
110
+
111
+ // Configure for voice conversation
112
+ client.updateSession({
113
+ modalities: ['text', 'audio'],
114
+ voice: 'alloy',
115
+ input_audio_format: 'pcm16',
116
+ output_audio_format: 'pcm16',
117
+ instructions: `You are a helpful customer service agent.
118
+ Be concise and friendly. If you don't know something,
119
+ say so rather than making things up.`,
120
+ turn_detection: {
121
+ type: 'server_vad', // or 'semantic_vad'
122
+ threshold: 0.5,
123
+ prefix_padding_ms: 300,
124
+ silence_duration_ms: 500,
125
+ },
126
+ });
127
+
128
+ // Handle audio streams
129
+ client.on('conversation.item.input_audio_transcription', (event) => {
130
+ console.log('User said:', event.transcript);
131
+ });
132
+
133
+ client.on('response.audio.delta', (event) => {
134
+ // Stream audio to speaker
135
+ audioPlayer.write(Buffer.from(event.delta, 'base64'));
136
+ });
137
+
138
+ // Send user audio
139
+ client.appendInputAudio(audioBuffer);
140
+ """
141
+
142
+ ## Use Cases:
143
+ - Real-time customer support
144
+ - Voice assistants
145
+ - Interactive voice response (IVR)
146
+ - Live language translation
147
+
39
148
  ### Pipeline Architecture
40
149
 
41
150
  Separate STT → LLM → TTS for maximum control
42
151
 
152
+ **When to use**: Need to know/control exactly what's said, debugging, compliance
153
+
154
+ # PIPELINE ARCHITECTURE:
155
+
156
+ """
157
+ [Audio] → [STT] → [Text] → [LLM] → [Text] → [TTS] → [Audio]
158
+
159
+ Advantages:
160
+ - Full control at each step
161
+ - Can log/audit all text
162
+ - Easier to debug
163
+ - Mix best-in-class components
164
+
165
+ Disadvantages:
166
+ - Higher latency (700-1200ms typical)
167
+ - Loses some emotion/nuance
168
+ - More components to manage
169
+ """
170
+
171
+ ## Production Pipeline Example
172
+ """
173
+ import { Deepgram } from '@deepgram/sdk';
174
+ import { ElevenLabsClient } from 'elevenlabs';
175
+ import OpenAI from 'openai';
176
+
177
+ // Initialize clients
178
+ const deepgram = new Deepgram(process.env.DEEPGRAM_API_KEY);
179
+ const elevenlabs = new ElevenLabsClient();
180
+ const openai = new OpenAI();
181
+
182
+ async function processVoiceInput(audioStream) {
183
+ // 1. Speech-to-Text (Deepgram Nova-3)
184
+ const transcription = await deepgram.transcription.live({
185
+ model: 'nova-3',
186
+ punctuate: true,
187
+ endpointing: 300, // ms of silence before end
188
+ });
189
+
190
+ transcription.on('transcript', async (data) => {
191
+ if (data.is_final && data.speech_final) {
192
+ const userText = data.channel.alternatives[0].transcript;
193
+ console.log('User:', userText);
194
+
195
+ // 2. LLM Processing
196
+ const completion = await openai.chat.completions.create({
197
+ model: 'gpt-4o-mini',
198
+ messages: [
199
+ { role: 'system', content: 'You are a concise voice assistant.' },
200
+ { role: 'user', content: userText }
201
+ ],
202
+ max_tokens: 150, // Keep responses short for voice
203
+ });
204
+
205
+ const agentText = completion.choices[0].message.content;
206
+ console.log('Agent:', agentText);
207
+
208
+ // 3. Text-to-Speech (ElevenLabs)
209
+ const audioStream = await elevenlabs.textToSpeech.stream({
210
+ voice_id: 'voice_id_here',
211
+ text: agentText,
212
+ model_id: 'eleven_flash_v2_5', // Lowest latency
213
+ });
214
+
215
+ // Stream to user
216
+ playAudioStream(audioStream);
217
+ }
218
+ });
219
+
220
+ // Pipe audio to transcription
221
+ audioStream.pipe(transcription);
222
+ }
223
+ """
224
+
225
+ ## Optimization Tips:
226
+ - Start TTS while LLM still generating (streaming)
227
+ - Pre-compute first response segment during user speech
228
+ - Use Flash/turbo models for latency
229
+
43
230
  ### Voice Activity Detection Pattern
44
231
 
45
232
  Detect when user starts/stops speaking
46
233
 
47
- ## Anti-Patterns
234
+ **When to use**: All voice agents need VAD for turn-taking
235
+
236
+ # VOICE ACTIVITY DETECTION (VAD):
237
+
238
+ """
239
+ VAD Types:
240
+ 1. Energy-based: Simple, fast, noise-sensitive
241
+ 2. Model-based: Silero VAD, more accurate
242
+ 3. Semantic VAD: Understands meaning, best for conversation
243
+ """
244
+
245
+ ## Silero VAD (Popular Open Source)
246
+ """
247
+ import { SileroVAD } from '@pipecat-ai/silero-vad';
248
+
249
+ const vad = new SileroVAD({
250
+ threshold: 0.5, // Speech probability threshold
251
+ min_speech_duration: 250, // ms before speech confirmed
252
+ min_silence_duration: 500, // ms of silence = end of turn
253
+ });
254
+
255
+ vad.on('speech_start', () => {
256
+ console.log('User started speaking');
257
+ // Stop any playing TTS (barge-in)
258
+ audioPlayer.stop();
259
+ });
260
+
261
+ vad.on('speech_end', () => {
262
+ console.log('User finished speaking');
263
+ // Trigger response generation
264
+ processTranscript();
265
+ });
266
+
267
+ // Feed audio to VAD
268
+ audioStream.on('data', (chunk) => {
269
+ vad.process(chunk);
270
+ });
271
+ """
272
+
273
+ ## OpenAI Semantic VAD
274
+ """
275
+ // In Realtime API session config
276
+ client.updateSession({
277
+ turn_detection: {
278
+ type: 'semantic_vad', // Uses meaning, not just silence
279
+ // Model waits longer after "ummm..."
280
+ // Responds faster after "Yes, that's correct."
281
+ },
282
+ });
283
+ """
284
+
285
+ ## Barge-In Handling
286
+ """
287
+ // When user interrupts:
288
+ function handleBargeIn() {
289
+ // 1. Stop TTS immediately
290
+ audioPlayer.stop();
291
+
292
+ // 2. Cancel pending LLM generation
293
+ llmController.abort();
294
+
295
+ // 3. Reset state
296
+ conversationState.checkpoint();
297
+
298
+ // 4. Listen to new input
299
+ startListening();
300
+ }
301
+
302
+ // VAD triggers barge-in
303
+ vad.on('speech_start', () => {
304
+ if (audioPlayer.isPlaying) {
305
+ handleBargeIn();
306
+ }
307
+ });
308
+ """
309
+
310
+ ### Latency Optimization Pattern
311
+
312
+ Achieving <800ms end-to-end response time
313
+
314
+ **When to use**: Production voice agents
315
+
316
+ # LATENCY OPTIMIZATION:
317
+
318
+ """
319
+ Target Metrics:
320
+ - End-to-end: <800ms (ideal: <500ms)
321
+ - Time-to-First-Token (TTFT): <300ms
322
+ - Barge-in response: <200ms
323
+ - Jitter variance: <100ms std dev
324
+ """
325
+
326
+ ## Pipeline Latency Breakdown
327
+ """
328
+ Typical breakdown:
329
+ - VAD processing: 50-100ms
330
+ - STT first result: 150-200ms
331
+ - LLM TTFT: 100-300ms
332
+ - TTS TTFA: 75-200ms
333
+ - Audio buffering: 50-100ms
334
+
335
+ Total: 425-900ms
336
+ """
337
+
338
+ ## Optimization Strategies
339
+
340
+ ### 1. Streaming Everything
341
+ """
342
+ // Stream STT results as they come
343
+ stt.on('partial_transcript', (text) => {
344
+ // Start processing before final transcript
345
+ llmPreprocessor.prepare(text);
346
+ });
347
+
348
+ // Stream LLM output to TTS
349
+ const llmStream = await openai.chat.completions.create({
350
+ stream: true,
351
+ // ...
352
+ });
353
+
354
+ for await (const chunk of llmStream) {
355
+ tts.appendText(chunk.choices[0].delta.content);
356
+ }
357
+ """
358
+
359
+ ### 2. Pre-computation
360
+ """
361
+ // While user is speaking, predict and prepare
362
+ stt.on('partial_transcript', async (text) => {
363
+ // Pre-fetch relevant context
364
+ const context = await retrieveContext(text);
365
+
366
+ // Pre-compute likely first sentence
367
+ const firstSentence = await generateOpener(context);
368
+ });
369
+ """
370
+
371
+ ### 3. Use Low-Latency Models
372
+ """
373
+ // STT: Deepgram Nova-3 (150ms TTFT)
374
+ // LLM: gpt-4o-mini (fastest GPT-4 class)
375
+ // TTS: ElevenLabs Flash (75ms) or Deepgram Aura-2 (184ms)
376
+ """
377
+
378
+ ### 4. Edge Deployment
379
+ """
380
+ // Run inference closer to user
381
+ // - Cloud regions near user
382
+ // - Edge computing for VAD/STT
383
+ // - WebSocket over HTTP for lower overhead
384
+ """
385
+
386
+ ### Conversation Design Pattern
387
+
388
+ Designing natural voice conversations
389
+
390
+ **When to use**: Building voice UX
391
+
392
+ # CONVERSATION DESIGN:
393
+
394
+ ## Voice-First Principles
395
+ """
396
+ Voice is different from text:
397
+ - No undo button - say it right the first time
398
+ - Linear - user can't scroll back
399
+ - Ephemeral - easy to miss information
400
+ - Emotional - tone matters as much as words
401
+ """
402
+
403
+ ## Response Design
404
+ """
405
+ # Keep responses short (10-20 seconds max)
406
+ # Front-load the answer
407
+ # Use signposting for lists
408
+
409
+ Bad: "I found several options. The first is... second is..."
410
+ Good: "I found 3 options. Want me to go through them?"
411
+
412
+ # Confirm understanding
413
+ Bad: "I'll transfer $500 to John."
414
+ Good: "So that's $500 to John Smith. Should I proceed?"
415
+ """
416
+
417
+ ## Prompting for Voice
418
+ """
419
+ system_prompt = '''
420
+ You are a voice assistant. Follow these rules:
421
+
422
+ 1. Be concise - keep responses under 30 words
423
+ 2. Use natural speech - contractions, casual language
424
+ 3. Never use formatting (bullets, numbers in lists)
425
+ 4. Spell out numbers and abbreviations
426
+ 5. End with a question to keep conversation flowing
427
+ 6. If unclear, ask for clarification
428
+ 7. Never say "I'm an AI" unless asked
429
+
430
+ Good: "Got it. I'll set that reminder for three pm. Anything else?"
431
+ Bad: "I have set a reminder for 3:00 PM. Is there anything else I can assist you with today?"
432
+ '''
433
+ """
434
+
435
+ ## Error Recovery
436
+ """
437
+ // Handle recognition errors gracefully
438
+ const errorResponses = {
439
+ no_speech: "I didn't catch that. Could you say it again?",
440
+ unclear: "Sorry, I'm not sure I understood. You said [repeat]. Is that right?",
441
+ timeout: "Still there? I'm here when you're ready.",
442
+ };
443
+
444
+ // Always offer human fallback for complex issues
445
+ if (confidenceScore < 0.6) {
446
+ response = "I want to make sure I get this right. Would you like to speak with a human agent?";
447
+ }
448
+ """
449
+
450
+ ## Sharp Edges
451
+
452
+ ### Response Latency Exceeds 800ms
453
+
454
+ Severity: CRITICAL
455
+
456
+ Situation: Building a voice agent pipeline
457
+
458
+ Symptoms:
459
+ Conversations feel awkward. Users repeat themselves. "Are you
460
+ there?" questions. Users hang up or give up. Low satisfaction
461
+ scores despite correct answers.
462
+
463
+ Why this breaks:
464
+ In human conversation, responses typically arrive within 500ms.
465
+ Anything over 800ms feels like the agent is slow or confused.
466
+ Users lose confidence and patience. Every component adds latency:
467
+ VAD (100ms) + STT (200ms) + LLM (300ms) + TTS (200ms) = 800ms.
468
+
469
+ Recommended fix:
470
+
471
+ # Measure and budget latency for each component:
472
+
473
+ ## Target latencies:
474
+ - VAD processing: <100ms
475
+ - STT time-to-first-token: <200ms
476
+ - LLM time-to-first-token: <300ms
477
+ - TTS time-to-first-audio: <150ms
478
+ - Total end-to-end: <800ms
479
+
480
+ ## Optimization strategies:
481
+
482
+ 1. Use low-latency models:
483
+ - STT: Deepgram Nova-3 (150ms) vs Whisper (500ms+)
484
+ - TTS: ElevenLabs Flash (75ms) vs standard (200ms+)
485
+ - LLM: gpt-4o-mini streaming
486
+
487
+ 2. Stream everything:
488
+ - Don't wait for full STT transcript
489
+ - Stream LLM output to TTS
490
+ - Start audio playback before TTS finishes
491
+
492
+ 3. Pre-compute:
493
+ - While user speaks, prepare context
494
+ - Generate opening phrase in parallel
495
+
496
+ 4. Edge deployment:
497
+ - Run VAD/STT at edge
498
+ - Use nearest cloud region
499
+
500
+ ## Measure continuously:
501
+ Log timestamps at each stage, track P50/P95 latency
502
+
503
+ ### Response Time Variance Disrupts Rhythm
504
+
505
+ Severity: HIGH
506
+
507
+ Situation: Voice agent with inconsistent response times
508
+
509
+ Symptoms:
510
+ Conversations feel unpredictable. User doesn't know when to speak.
511
+ Sometimes agent responds immediately, sometimes after long pause.
512
+ Users talk over agent. Agent talks over users.
513
+
514
+ Why this breaks:
515
+ Jitter (variance in response time) disrupts conversational rhythm
516
+ more than absolute latency. Consistent 800ms feels better than
517
+ alternating 400ms and 1200ms. Users can't adapt to unpredictable
518
+ timing.
519
+
520
+ Recommended fix:
521
+
522
+ # Target jitter metrics:
523
+ - Standard deviation: <100ms
524
+ - P95-P50 gap: <200ms
525
+
526
+ ## Reduce jitter sources:
527
+
528
+ 1. Consistent model loading:
529
+ - Keep models warm
530
+ - Pre-load on connection start
531
+
532
+ 2. Buffer audio output:
533
+ - Small buffer (50-100ms) smooths playback
534
+ - Don't start playing until buffer filled
535
+
536
+ 3. Handle LLM variance:
537
+ - gpt-4o-mini more consistent than larger models
538
+ - Set max_tokens to limit long responses
539
+
540
+ 4. Monitor and alert:
541
+ - Track response time distribution
542
+ - Alert on jitter spikes
543
+
544
+ ## Implementation:
545
+ const MIN_RESPONSE_TIME = 400; // ms
546
+
547
+ async function respondWithConsistentTiming(text) {
548
+ const startTime = Date.now();
549
+ const audio = await generateSpeech(text);
550
+
551
+ const elapsed = Date.now() - startTime;
552
+ if (elapsed < MIN_RESPONSE_TIME) {
553
+ await delay(MIN_RESPONSE_TIME - elapsed);
554
+ }
555
+
556
+ playAudio(audio);
557
+ }
558
+
559
+ ### Using Silence Duration for Turn Detection
560
+
561
+ Severity: HIGH
48
562
 
49
- ### Ignoring Latency Budget
563
+ Situation: Detecting when user finishes speaking
50
564
 
51
- ### ❌ Silence-Only Turn Detection
565
+ Symptoms:
566
+ Agent interrupts user mid-thought. Or waits too long after user
567
+ finishes. "Let me think..." triggers premature response. Short
568
+ answers have awkward pause before response.
52
569
 
53
- ### Long Responses
570
+ Why this breaks:
571
+ Simple silence detection (e.g., "end turn after 500ms silence")
572
+ doesn't understand conversation. Humans pause mid-sentence.
573
+ "Yes." needs fast response, "Well, let me think about that..."
574
+ needs patience. Fixed timeout fits neither.
54
575
 
55
- ## ⚠️ Sharp Edges
576
+ Recommended fix:
56
577
 
57
- | Issue | Severity | Solution |
58
- |-------|----------|----------|
59
- | Issue | critical | # Measure and budget latency for each component: |
60
- | Issue | high | # Target jitter metrics: |
61
- | Issue | high | # Use semantic VAD: |
62
- | Issue | high | # Implement barge-in detection: |
63
- | Issue | medium | # Constrain response length in prompts: |
64
- | Issue | medium | # Prompt for spoken format: |
65
- | Issue | medium | # Implement noise handling: |
66
- | Issue | medium | # Mitigate STT errors: |
578
+ # Use semantic VAD:
579
+
580
+ ## OpenAI Semantic VAD:
581
+ client.updateSession({
582
+ turn_detection: {
583
+ type: 'semantic_vad',
584
+ // Waits longer after "umm..."
585
+ // Responds faster after "Yes, that's correct."
586
+ },
587
+ });
588
+
589
+ ## Pipecat SmartTurn:
590
+ const pipeline = new Pipeline({
591
+ vad: new SileroVAD(),
592
+ turnDetection: new SmartTurn(),
593
+ });
594
+
595
+ // SmartTurn considers:
596
+ // - Speech content (complete sentence?)
597
+ // - Prosody (falling intonation?)
598
+ // - Context (question asked?)
599
+
600
+ ## Fallback: Adaptive silence threshold:
601
+ function calculateSilenceThreshold(transcript) {
602
+ const endsWithComplete = transcript.match(/[.!?]$/);
603
+ const hasFillers = transcript.match(/um|uh|like|well/i);
604
+
605
+ if (endsWithComplete && !hasFillers) {
606
+ return 300; // Fast response
607
+ } else if (hasFillers) {
608
+ return 1500; // Wait for continuation
609
+ }
610
+ return 700; // Default
611
+ }
612
+
613
+ ### Agent Doesn't Stop When User Interrupts
614
+
615
+ Severity: HIGH
616
+
617
+ Situation: User tries to interrupt agent mid-sentence
618
+
619
+ Symptoms:
620
+ Agent talks over user. User has to wait for agent to finish.
621
+ Frustrating experience. Users give up and abandon call.
622
+ "STOP! STOP!" doesn't work.
623
+
624
+ Why this breaks:
625
+ Without barge-in handling, the TTS plays to completion regardless
626
+ of user input. This violates basic conversational norms - in human
627
+ conversation, we stop when interrupted.
628
+
629
+ Recommended fix:
630
+
631
+ # Implement barge-in detection:
632
+
633
+ ## Basic barge-in:
634
+ vad.on('speech_start', () => {
635
+ if (ttsPlayer.isPlaying) {
636
+ // 1. Stop audio immediately
637
+ ttsPlayer.stop();
638
+
639
+ // 2. Cancel pending TTS generation
640
+ ttsController.abort();
641
+
642
+ // 3. Checkpoint conversation state
643
+ conversationState.save();
644
+
645
+ // 4. Listen to new input
646
+ startTranscription();
647
+ }
648
+ });
649
+
650
+ ## Advanced: Distinguish interruption types:
651
+ vad.on('speech_start', async () => {
652
+ if (!ttsPlayer.isPlaying) return;
653
+
654
+ // Wait 200ms to get first words
655
+ await delay(200);
656
+ const firstWords = getTranscriptSoFar();
657
+
658
+ if (isBackchannel(firstWords)) {
659
+ // "uh-huh", "yeah" - don't interrupt
660
+ return;
661
+ }
662
+
663
+ if (isClarification(firstWords)) {
664
+ // "What?", "Sorry?" - repeat last sentence
665
+ repeatLastSentence();
666
+ } else {
667
+ // Real interruption - stop and listen
668
+ handleFullInterruption();
669
+ }
670
+ });
671
+
672
+ ## Response time target:
673
+ - Barge-in response: <200ms
674
+ - User should feel heard immediately
675
+
676
+ ### Generating Text-Length Responses for Voice
677
+
678
+ Severity: MEDIUM
679
+
680
+ Situation: Prompting LLM for voice agent responses
681
+
682
+ Symptoms:
683
+ Agent rambles. Users lose track of information. "Can you repeat
684
+ that?" requests. Users interrupt to ask for shorter version.
685
+ Low comprehension of conveyed information.
686
+
687
+ Why this breaks:
688
+ Text can be scanned and re-read. Voice is linear and ephemeral.
689
+ A 3-paragraph response that works in chat is overwhelming in voice.
690
+ Users can only hold ~7 items in working memory.
691
+
692
+ Recommended fix:
693
+
694
+ # Constrain response length in prompts:
695
+
696
+ system_prompt = '''
697
+ You are a voice assistant. Keep responses UNDER 30 WORDS.
698
+ For complex information, break into chunks and confirm
699
+ understanding between each.
700
+
701
+ Instead of: "Here are the three options. First, you could...
702
+ Second... Third..."
703
+
704
+ Say: "I found 3 options. Want me to go through them?"
705
+
706
+ Never list more than 3 items without pausing for confirmation.
707
+ '''
708
+
709
+ ## Enforce at generation:
710
+ const response = await openai.chat.completions.create({
711
+ max_tokens: 100, // Hard limit
712
+ // ...
713
+ });
714
+
715
+ ## Chunking pattern:
716
+ if (information.length > 3) {
717
+ response = `I have ${information.length} items. Let's go through them one at a time. First: ${information[0]}. Ready for the next?`;
718
+ }
719
+
720
+ ## Progressive disclosure:
721
+ "I found your account. Want the balance, recent transactions, or something else?"
722
+ // Don't dump all info at once
723
+
724
+ ### Using Bullets/Numbers/Markdown in Voice
725
+
726
+ Severity: MEDIUM
727
+
728
+ Situation: Formatting LLM output for voice
729
+
730
+ Symptoms:
731
+ "First bullet point: item one" read aloud. Numbers read as "one
732
+ two three" instead of "one, two, three." Markdown artifacts in
733
+ speech. Robotic, unnatural delivery.
734
+
735
+ Why this breaks:
736
+ TTS models read what they're given. Text formatting intended for
737
+ visual display sounds robotic when read aloud. Users can't "see"
738
+ structure in audio.
739
+
740
+ Recommended fix:
741
+
742
+ # Prompt for spoken format:
743
+
744
+ system_prompt = '''
745
+ Format responses for SPOKEN delivery:
746
+ - No bullet points, numbered lists, or markdown
747
+ - Spell out numbers: "twenty-three" not "23"
748
+ - Spell out abbreviations: "United States" not "US"
749
+ - Use verbal signposting: "There are three things. First..."
750
+ - Never use asterisks, dashes, or special characters
751
+ '''
752
+
753
+ ## Post-processing:
754
+ function prepareForSpeech(text) {
755
+ return text
756
+ // Remove markdown
757
+ .replace(/[*_#`]/g, '')
758
+ // Convert numbers
759
+ .replace(/\d+/g, numToWords)
760
+ // Expand abbreviations
761
+ .replace(/\betc\b/gi, 'et cetera')
762
+ .replace(/\be\.g\./gi, 'for example')
763
+ // Add pauses
764
+ .replace(/\. /g, '... ')
765
+ .replace(/, /g, '... ');
766
+ }
767
+
768
+ ## SSML for precise control:
769
+ <speak>
770
+ The total is <say-as interpret-as="currency">$49.99</say-as>.
771
+ <break time="500ms"/>
772
+ Want to proceed?
773
+ </speak>
774
+
775
+ ### VAD/STT Fails in Noisy Environments
776
+
777
+ Severity: MEDIUM
778
+
779
+ Situation: Users in cars, cafes, outdoors
780
+
781
+ Symptoms:
782
+ "I didn't catch that" frequently. Background noise triggers
783
+ false starts. Fan/AC causes continuous listening. Car engine
784
+ noise confuses STT.
785
+
786
+ Why this breaks:
787
+ Default VAD thresholds work for quiet environments. Real-world
788
+ usage includes background noise that triggers false positives
789
+ or masks speech, causing false negatives.
790
+
791
+ Recommended fix:
792
+
793
+ # Implement noise handling:
794
+
795
+ ## 1. Noise reduction in STT:
796
+ const transcription = await deepgram.transcription.live({
797
+ model: 'nova-3',
798
+ noise_reduction: true,
799
+ // or
800
+ smart_format: true,
801
+ });
802
+
803
+ ## 2. Adaptive VAD threshold:
804
+ // Measure ambient noise level
805
+ const ambientLevel = measureAmbientNoise(5000); // 5 sec sample
806
+
807
+ vad.setThreshold(ambientLevel * 1.5); // Above ambient
808
+
809
+ ## 3. Confidence filtering:
810
+ stt.on('transcript', (data) => {
811
+ if (data.confidence < 0.7) {
812
+ // Low confidence - probably noise
813
+ askForRepeat();
814
+ return;
815
+ }
816
+ processTranscript(data.transcript);
817
+ });
818
+
819
+ ## 4. Echo cancellation:
820
+ // Prevent agent's voice from being transcribed
821
+ const echoCanceller = new EchoCanceller();
822
+ echoCanceller.reference(ttsOutput);
823
+ const cleanedAudio = echoCanceller.process(userAudio);
824
+
825
+ ### STT Produces Incorrect or Hallucinated Text
826
+
827
+ Severity: MEDIUM
828
+
829
+ Situation: Processing unclear or accented speech
830
+
831
+ Symptoms:
832
+ Agent responds to something user didn't say. Names consistently
833
+ wrong. Technical terms misheard. "I said X, not Y" frustration.
834
+
835
+ Why this breaks:
836
+ STT models can hallucinate, especially on proper nouns, technical
837
+ terms, or accented speech. These errors propagate through the
838
+ pipeline and produce nonsensical responses.
839
+
840
+ Recommended fix:
841
+
842
+ # Mitigate STT errors:
843
+
844
+ ## 1. Use keywords/biasing:
845
+ const transcription = await deepgram.transcription.live({
846
+ keywords: ['Acme Corp', 'ProductName', 'John Smith'],
847
+ keyword_boost: 'high',
848
+ });
849
+
850
+ ## 2. Confirmation for critical info:
851
+ if (containsNameOrNumber(transcript)) {
852
+ response = `I heard "${name}". Is that correct?`;
853
+ }
854
+
855
+ ## 3. Confidence-based fallback:
856
+ if (confidence < 0.8) {
857
+ response = `I think you said "${transcript}". Did I get that right?`;
858
+ }
859
+
860
+ ## 4. Multiple hypothesis handling:
861
+ // Some STT APIs return n-best list
862
+ const alternatives = transcription.alternatives;
863
+ if (alternatives[0].confidence - alternatives[1].confidence < 0.1) {
864
+ // Ambiguous - ask for clarification
865
+ }
866
+
867
+ ## 5. Error correction patterns:
868
+ promptPattern = `
869
+ User may correct previous mistakes. If they say "no, I said X"
870
+ or "not Y, Z", update your understanding accordingly.
871
+ `;
872
+
873
+ ## Validation Checks
874
+
875
+ ### Missing Latency Measurement
876
+
877
+ Severity: ERROR
878
+
879
+ Voice agents must track latency at each stage
880
+
881
+ Message: Voice pipeline without latency tracking. Add timestamps at each stage to measure performance.
882
+
883
+ ### Using Batch STT Instead of Streaming
884
+
885
+ Severity: WARNING
886
+
887
+ Streaming STT reduces latency significantly
888
+
889
+ Message: Using batch transcription. Consider streaming for lower latency in voice agents.
890
+
891
+ ### TTS Without Streaming Output
892
+
893
+ Severity: WARNING
894
+
895
+ Streaming TTS reduces time to first audio
896
+
897
+ Message: TTS without streaming. Stream audio to reduce time to first audio.
898
+
899
+ ### Hardcoded VAD Silence Threshold
900
+
901
+ Severity: WARNING
902
+
903
+ Fixed silence thresholds don't adapt to conversation
904
+
905
+ Message: Fixed silence threshold. Consider semantic VAD or adaptive thresholds for better turn-taking.
906
+
907
+ ### Missing Barge-In Handling
908
+
909
+ Severity: WARNING
910
+
911
+ Voice agents should stop when user interrupts
912
+
913
+ Message: VAD without barge-in handling. Stop TTS when user starts speaking.
914
+
915
+ ### Voice Prompt Without Length Constraints
916
+
917
+ Severity: WARNING
918
+
919
+ Voice prompts should constrain response length
920
+
921
+ Message: Voice prompt without length constraints. Add 'Keep responses under 30 words' to system prompt.
922
+
923
+ ### Markdown Formatting Sent to TTS
924
+
925
+ Severity: WARNING
926
+
927
+ Markdown will be read literally by TTS
928
+
929
+ Message: Check for markdown in TTS input. Strip formatting before sending to TTS.
930
+
931
+ ### STT Without Error Handling
932
+
933
+ Severity: WARNING
934
+
935
+ STT can fail or return low confidence
936
+
937
+ Message: STT without error handling. Check confidence scores and handle failures.
938
+
939
+ ### WebSocket Without Reconnection
940
+
941
+ Severity: WARNING
942
+
943
+ Realtime APIs need reconnection handling
944
+
945
+ Message: Realtime connection without reconnection logic. Handle disconnects gracefully.
946
+
947
+ ### Missing Noise Handling
948
+
949
+ Severity: INFO
950
+
951
+ Real-world audio includes background noise
952
+
953
+ Message: Consider adding noise handling for real-world audio quality.
954
+
955
+ ## Collaboration
956
+
957
+ ### Delegation Triggers
958
+
959
+ - user needs phone/telephony integration -> backend (Twilio, Vonage, SIP integration)
960
+ - user needs LLM optimization -> llm-architect (Model selection, prompting, fine-tuning)
961
+ - user needs tools for voice agent -> agent-tool-builder (Tool design for voice context)
962
+ - user needs multi-agent voice system -> multi-agent-orchestration (Voice agents working together)
963
+ - user needs accessibility compliance -> accessibility-specialist (Voice interface accessibility)
67
964
 
68
965
  ## Related Skills
69
966
 
70
967
  Works well with: `agent-tool-builder`, `multi-agent-orchestration`, `llm-architect`, `backend`
71
968
 
72
969
  ## When to Use
73
- This skill is applicable to execute the workflow or actions described in the overview.
970
+
971
+ - User mentions or implies: voice agent
972
+ - User mentions or implies: speech to text
973
+ - User mentions or implies: text to speech
974
+ - User mentions or implies: whisper
975
+ - User mentions or implies: elevenlabs
976
+ - User mentions or implies: deepgram
977
+ - User mentions or implies: realtime api
978
+ - User mentions or implies: voice assistant
979
+ - User mentions or implies: voice ai
980
+ - User mentions or implies: conversational ai
981
+ - User mentions or implies: tts
982
+ - User mentions or implies: stt
983
+ - User mentions or implies: asr