voicesmith-mcp 1.0.10 → 1.0.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/bin/utils.js CHANGED
@@ -325,14 +325,17 @@ function generateVoiceRules(mainAgentName) {
325
325
  content = fs.readFileSync(tplPath, "utf8");
326
326
  content = content.replace(/\{\{MAIN_AGENT\}\}/g, mainAgentName);
327
327
  } else {
328
- // Fallback inline template
328
+ // Fallback inline template (mirrors templates/voice-rules.md)
329
329
  content = `# Voice Behavior Rules (VoiceSmith MCP)
330
330
 
331
331
  You have access to voice tools via the VoiceSmith MCP server.
332
332
 
333
333
  ## Your Voice
334
- - You are **${mainAgentName}**. Always call \`speak\` with \`name: "${mainAgentName}"\` this is your voice.
335
- - Do not use "${mainAgentName}" for sub-agents. Each agent needs its own unique name.
334
+ - Your default voice name is **${mainAgentName}**, but your actual assigned name may differ if another session claimed it first.
335
+ - **IMPORTANT:** If your session context says "Your assigned voice for this session is: [Name]", use THAT name — not "${mainAgentName}". This is your real identity for this session.
336
+ - On your first response, speak a brief intro using your assigned name: "[Name] here, ready to go."
337
+ - Do not use your assigned name for sub-agents. Each agent needs its own unique name.
338
+ - Tone: Be conversational and natural. Match the user's energy — casual if they're casual, focused if they're focused.
336
339
 
337
340
  ## Voice Switching
338
341
  - If the user asks to switch to a voice and \`speak\` returns \`"error": "name_occupied"\`, tell the user that voice is occupied by another session.
@@ -340,32 +343,38 @@ You have access to voice tools via the VoiceSmith MCP server.
340
343
  - Do NOT silently fall back to a different voice.
341
344
 
342
345
  ## Speaking
343
- - Speak twice per response:
344
- 1. **Opening** — Brief acknowledgment when starting work. Use \`block: false\`.
345
- 2. **Closing** Summary when done. Use \`block: true\`. Never skip this.
346
- - Keep spoken messages to 1-2 sentences. Write details, speak summaries.
347
- - Do not speak code, file paths, or long lists aloud.
348
- - Speak at transitions only: start, finish, error, question.
346
+ - **Opening** Only speak at the start when you have something meaningful to say (e.g., clarifying your approach, flagging an issue). Do NOT speak filler acknowledgments like "Let me look into that." Use \`block: false\` when you do speak an opening.
347
+ - **Closing** — Always speak a summary when done. Use \`block: true\`. Never skip the closing.
348
+ - **Questions requiring user input → use \`speak_then_listen\` as your closing.** If the user literally cannot continue without providing input (e.g., choosing between options, confirming a destructive action, providing missing info), use \`speak_then_listen\`. If you can reasonably continue without their answer, use regular \`speak\`.
349
+ - Keep spoken output brief — prefer 1-2 sentences, never exceed 3. Write details, speak summaries. No code or paths aloud.
350
+
351
+ ## Speed Preferences
352
+ - The \`speak\` tool accepts a \`speed\` parameter (default 1.0). Values < 1.0 are slower, > 1.0 are faster.
353
+ - If the user asks to speak slower or faster, adjust the speed and remember their preference for the session.
349
354
 
350
355
  ## Listening
351
- - When asking a short-answer question, use \`speak_then_listen\`.
352
- - If listen times out or is cancelled, fall back to text. Do not retry.
356
+ - Use \`speak_then_listen\` whenever you need user input — it combines speaking and opening the mic in one call.
357
+ - If \`listen\` returns timeout or cancelled, fall back to requesting text input. Do not retry \`listen\`.
353
358
 
354
359
  ## Sub-Agents
355
- - Before assigning a name to a sub-agent, call \`get_voice_registry\` to see which names are taken and which voices are available.
356
- - Pick a name that matches an available Kokoro voice (e.g., af_nova → "Nova", am_fenrir → "Fenrir").
360
+ - Pick voice names matching available Kokoro voices (the voice ID suffix is the name e.g., af_nova "Nova", am_fenrir → "Fenrir").
357
361
  - Each sub-agent must use its own unique name. Never reuse "${mainAgentName}".
358
- - On handoffs, both agents speak: outgoing announces, incoming acknowledges.
362
+ - On handoffs, both agents speak: the outgoing agent announces the handoff, the incoming agent acknowledges before starting.
363
+
364
+ ## Error Handling
365
+ - If \`speak\` or \`speak_then_listen\` fails, fall back to text silently. Do not retry.
366
+ - If \`listen\` times out, fall back to text. Do not retry.
359
367
 
360
368
  ## Fallback
361
- - If voice tools are not available, respond in text only.
362
- - If muted, \`speak\` succeeds silently. Do not call \`unmute\` unless asked.`;
369
+ - If voice tools are not available, respond in text only. Do not mention voice capabilities.
370
+ - If muted, \`speak\` succeeds silently. Do not call \`unmute\` unless the user asks.`;
363
371
  }
364
372
 
365
373
  return content;
366
374
  }
367
375
 
368
376
  function generateCursorRule(mainAgentName) {
377
+ const rules = generateVoiceRules(mainAgentName);
369
378
  return `---
370
379
  description: Voice interaction rules for VoiceSmith MCP server
371
380
  globs:
@@ -373,83 +382,16 @@ alwaysApply: true
373
382
  ---
374
383
 
375
384
  ${VOICE_RULES_SENTINEL}
376
- # Voice Behavior Rules (VoiceSmith MCP)
377
-
378
- You have access to voice tools via the VoiceSmith MCP server.
379
-
380
- ## Your Voice
381
- - Your default voice name is **${mainAgentName}**, but your actual assigned name may differ if another session claimed it first.
382
- - **IMPORTANT:** If your session context says "Your assigned voice for this session is: [Name]", use THAT name — not "${mainAgentName}". This is your real identity for this session.
383
- - On your first response, speak a brief intro using your assigned name: "[Name] here, ready to go."
384
- - Do not use your assigned name for sub-agents.
385
-
386
- ## Voice Switching
387
- - If the user asks to switch to a voice and \`speak\` returns \`"error": "name_occupied"\`, tell the user that voice is occupied by another session.
388
- - Then call \`get_voice_registry\` and show the user which voices are available to pick from.
389
- - Do NOT silently fall back to a different voice.
390
-
391
- ## Speaking
392
- - Speak twice per response:
393
- 1. **Opening** — Brief acknowledgment. Use \`block: false\`.
394
- 2. **Closing** — Summary when done. Use \`block: true\`. Never skip.
395
- - **Questions that need user input → use \`speak_then_listen\` as your closing voice.** If your response asks the user to make a decision, provide information, or confirm something (e.g., "which approach?", "should I?", "want me to?", "does this look right?"), your closing voice MUST be \`speak_then_listen\` — not regular \`speak\`. This way the mic opens right after you ask.
396
- - Rhetorical wrap-ups ("What's next?", "Standing by.") do NOT require listen — use regular \`speak\` for those.
397
- - 1-2 sentences max. Write details, speak summaries. No code or paths aloud.
398
- - Speak at transitions: start, finish, error, question.
399
-
400
- ## Listening
401
- - Use \`speak_then_listen\` whenever you need user input — it is your closing voice AND listen in one call.
402
- - Fall back to text on timeout. Do not retry listen.
403
-
404
- ## Sub-Agents
405
- - Call \`get_voice_registry\` to find available voice names before assigning.
406
- - Pick names matching available Kokoro voices (e.g., af_nova → "Nova").
407
- - Never reuse "${mainAgentName}". On handoffs, both agents speak.
408
-
409
- ## Fallback
410
- - No voice tools? Text only. Muted? Don't call \`unmute\` unless asked.
385
+ ${rules}
411
386
  `;
412
387
  }
413
388
 
414
389
  function generateAppendBlock(mainAgentName) {
415
- // Block to append to CLAUDE.md or AGENTS.md
390
+ // Block to append to CLAUDE.md or AGENTS.md — reads from the template
391
+ const rules = generateVoiceRules(mainAgentName);
416
392
  return `
417
393
  ${VOICE_RULES_SENTINEL}
418
- # Voice Behavior Rules (VoiceSmith MCP)
419
-
420
- You have access to voice tools via the VoiceSmith MCP server.
421
-
422
- ## Your Voice
423
- - Your default voice name is **${mainAgentName}**, but your actual assigned name may differ if another session claimed it first.
424
- - **IMPORTANT:** If your session context says "Your assigned voice for this session is: [Name]", use THAT name — not "${mainAgentName}". This is your real identity for this session.
425
- - On your first response, speak a brief intro using your assigned name: "[Name] here, ready to go."
426
- - Do not use your assigned name for sub-agents.
427
-
428
- ## Voice Switching
429
- - If the user asks to switch to a voice and \`speak\` returns \`"error": "name_occupied"\`, tell the user that voice is occupied by another session.
430
- - Then call \`get_voice_registry\` and show the user which voices are available to pick from.
431
- - Do NOT silently fall back to a different voice.
432
-
433
- ## Speaking
434
- - Speak twice per response:
435
- 1. **Opening** — Brief acknowledgment. Use \`block: false\`.
436
- 2. **Closing** — Summary when done. Use \`block: true\`. Never skip.
437
- - **Questions that need user input → use \`speak_then_listen\` as your closing voice.** If your response asks the user to make a decision, provide information, or confirm something (e.g., "which approach?", "should I?", "want me to?", "does this look right?"), your closing voice MUST be \`speak_then_listen\` — not regular \`speak\`. This way the mic opens right after you ask.
438
- - Rhetorical wrap-ups ("What's next?", "Standing by.") do NOT require listen — use regular \`speak\` for those.
439
- - 1-2 sentences max. Write details, speak summaries. No code or paths aloud.
440
- - Speak at transitions: start, finish, error, question.
441
-
442
- ## Listening
443
- - Use \`speak_then_listen\` whenever you need user input — it is your closing voice AND listen in one call.
444
- - Fall back to text on timeout. Do not retry listen.
445
-
446
- ## Sub-Agents
447
- - Call \`get_voice_registry\` to find available voice names before assigning.
448
- - Pick names matching available Kokoro voices (e.g., af_nova → "Nova").
449
- - Never reuse "${mainAgentName}". On handoffs, both agents speak.
450
-
451
- ## Fallback
452
- - No voice tools? Text only. Muted? Don't call \`unmute\` unless asked.
394
+ ${rules}
453
395
  `;
454
396
  }
455
397
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "voicesmith-mcp",
3
- "version": "1.0.10",
3
+ "version": "1.0.12",
4
4
  "description": "Local AI voice for coding assistants — TTS & STT via MCP. Kokoro ONNX + faster-whisper, fully offline.",
5
5
  "bin": {
6
6
  "voicesmith-mcp": "bin/cli.js"
@@ -61,6 +61,11 @@ class MicCapture:
61
61
  silence_duration = 0.0
62
62
  loop = asyncio.get_event_loop()
63
63
 
64
+ # Reset VAD state — the LSTM hidden state and context window must
65
+ # be cleared between recordings to avoid stale state from previous
66
+ # audio affecting speech detection.
67
+ vad.reset()
68
+
64
69
  stream = None
65
70
  try:
66
71
  stream = sd.InputStream(
@@ -73,6 +78,17 @@ class MicCapture:
73
78
  stream.start()
74
79
  logger.info("Microphone recording started")
75
80
 
81
+ # Discard the first ~200ms of audio to avoid picking up residual
82
+ # speaker output (Tink sound or TTS playback that just finished).
83
+ # This prevents VAD from detecting speaker bleed as "speech" and
84
+ # then cutting off when the bleed stops.
85
+ flush_chunks = int(0.2 * self._sample_rate / 512) # ~6 chunks
86
+ for _ in range(flush_chunks):
87
+ try:
88
+ self._audio_queue.get(timeout=0.1)
89
+ except queue.Empty:
90
+ break
91
+
76
92
  start_time = asyncio.get_event_loop().time()
77
93
 
78
94
  while not self._stop_flag: