voicesmith-mcp 1.0.11 → 1.0.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/bin/utils.js CHANGED
@@ -345,7 +345,7 @@ You have access to voice tools via the VoiceSmith MCP server.
345
345
  ## Speaking
346
346
  - **Opening** — Only speak at the start when you have something meaningful to say (e.g., clarifying your approach, flagging an issue). Do NOT speak filler acknowledgments like "Let me look into that." Use \`block: false\` when you do speak an opening.
347
347
  - **Closing** — Always speak a summary when done. Use \`block: true\`. Never skip the closing.
348
- - **Questions requiring user input → use \`speak_then_listen\` as your closing.** If the user literally cannot continue without providing input (e.g., choosing between options, confirming a destructive action, providing missing info), use \`speak_then_listen\`. If you can reasonably continue without their answer, use regular \`speak\`.
348
+ - **Questions → use \`speak_then_listen\`.** If your closing statement ends with a question directed at the user (ends with \`?\`), use \`speak_then_listen\` not regular \`speak\`. The only exceptions are rhetorical wrap-ups like "Standing by." or "What's next?" where you don't actually need an answer.
349
349
  - Keep spoken output brief — prefer 1-2 sentences, never exceed 3. Write details, speak summaries. No code or paths aloud.
350
350
 
351
351
  ## Speed Preferences
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "voicesmith-mcp",
3
- "version": "1.0.11",
3
+ "version": "1.0.13",
4
4
  "description": "Local AI voice for coding assistants — TTS & STT via MCP. Kokoro ONNX + faster-whisper, fully offline.",
5
5
  "bin": {
6
6
  "voicesmith-mcp": "bin/cli.js"
@@ -61,6 +61,11 @@ class MicCapture:
61
61
  silence_duration = 0.0
62
62
  loop = asyncio.get_event_loop()
63
63
 
64
+ # Reset VAD state — the LSTM hidden state and context window must
65
+ # be cleared between recordings to avoid stale state from previous
66
+ # audio affecting speech detection.
67
+ vad.reset()
68
+
64
69
  stream = None
65
70
  try:
66
71
  stream = sd.InputStream(
@@ -73,6 +78,17 @@ class MicCapture:
73
78
  stream.start()
74
79
  logger.info("Microphone recording started")
75
80
 
81
+ # Discard the first ~200ms of audio to avoid picking up residual
82
+ # speaker output (Tink sound or TTS playback that just finished).
83
+ # This prevents VAD from detecting speaker bleed as "speech" and
84
+ # then cutting off when the bleed stops.
85
+ flush_chunks = int(0.2 * self._sample_rate / 512) # ~6 chunks
86
+ for _ in range(flush_chunks):
87
+ try:
88
+ self._audio_queue.get(timeout=0.1)
89
+ except queue.Empty:
90
+ break
91
+
76
92
  start_time = asyncio.get_event_loop().time()
77
93
 
78
94
  while not self._stop_flag:
@@ -17,7 +17,7 @@ You have access to voice tools via the VoiceSmith MCP server.
17
17
  ## Speaking
18
18
  - **Opening** — Only speak at the start when you have something meaningful to say (e.g., clarifying your approach, flagging an issue). Do NOT speak filler acknowledgments like "Let me look into that." Use `block: false` when you do speak an opening.
19
19
  - **Closing** — Always speak a summary when done. Use `block: true`. Never skip the closing.
20
- - **Questions requiring user input → use `speak_then_listen` as your closing.** If the user literally cannot continue without providing input (e.g., choosing between options, confirming a destructive action, providing missing info), use `speak_then_listen`. If you can reasonably continue without their answer, use regular `speak`.
20
+ - **Questions → use `speak_then_listen`.** If your closing statement ends with a question directed at the user (ends with `?`), use `speak_then_listen` not regular `speak`. The only exceptions are rhetorical wrap-ups like "Standing by." or "What's next?" where you don't actually need an answer.
21
21
  - Keep spoken output brief — prefer 1-2 sentences, never exceed 3. Write details, speak summaries. No code or paths aloud.
22
22
 
23
23
  ## Speed Preferences