npm - voice-mcp-server - Versions diffs - 0.1.25 → 0.3.0 - Mend

voice-mcp-server 0.1.25 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

package/README.md +31 -3
package/config/config.yaml +1 -1
package/config/vad/ptt_vad.yaml +1 -1
package/package.json +5 -1
package/requirements.txt +1 -0
package/src/adapters_real/elevenlabs_speaker.py +1 -1
package/src/adapters_real/kokoro_speaker.py +7 -6
package/src/adapters_real/live_mic.py +15 -4
package/src/adapters_real/ptt_sidecar.swift +156 -0
package/src/adapters_real/ptt_vad.py +143 -25
package/src/adapters_real/whisper_stt.py +5 -4
package/src/daemon/audio_server.py +47 -13
package/src/logger.py +29 -0
package/src/mcp_server.py +113 -65
package/src/simulation/engine.py +12 -1
package/src/simulation/tests/test_abort_daemon.py +109 -0
package/src/simulation/tests/test_mcp_cancellation.py +83 -0
package/src/simulation/tests/test_ptt_vad.py +81 -0
package/src/adapters_real/__pycache__/__init__.cpython-312.pyc +0 -0
package/src/adapters_real/__pycache__/kokoro_speaker.cpython-312.pyc +0 -0
package/src/adapters_real/__pycache__/live_mic.cpython-312.pyc +0 -0
package/src/adapters_real/__pycache__/ptt_vad.cpython-312.pyc +0 -0
package/src/adapters_real/__pycache__/queue_llm.cpython-312.pyc +0 -0
package/src/adapters_real/__pycache__/whisper_stt.cpython-312.pyc +0 -0
package/src/simulation/__pycache__/__init__.cpython-312.pyc +0 -0
package/src/simulation/__pycache__/engine.cpython-312.pyc +0 -0
package/src/simulation/__pycache__/models.cpython-312.pyc +0 -0
package/src/simulation/__pycache__/ports.cpython-312.pyc +0 -0

package/README.md CHANGED Viewed

@@ -5,6 +5,7 @@
 **Give your AI agents a voice, real ears, and the ability to handle interruptions in real-time.**
 [![npm version](https://img.shields.io/npm/v/voice-mcp-server.svg?color=red&style=flat-square&logo=npm)](https://www.npmjs.com/package/voice-mcp-server)
+[![Publish to NPM](https://img.shields.io/github/actions/workflow/status/erickvs/voice-mcp-server/publish.yml?style=flat-square&logo=github)](https://github.com/erickvs/voice-mcp-server/actions)
 [![Platform: macOS Apple Silicon](https://img.shields.io/badge/Platform-macOS%20%7C%20Apple%20Silicon-lightgrey?style=flat-square&logo=apple)](#-target-environment)
 [![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&style=flat-square)](https://python.org)
 [![MCP Compatible](https://img.shields.io/badge/MCP-Compatible-success?style=flat-square)](https://modelcontextprotocol.io/)
@@ -66,7 +67,7 @@ The system is built on a highly modular adapter pattern configured via `hydra` Y
 | | `elevenlabs_speaker` | Premium cloud-based ultra-realistic voices. |
 | **🎙️ Microphones** | `live_mic` | Direct hardware integration via PyAudio. |
 | **🤫 VAD (Activity)** | `silero_vad` | Conversational mode powered by Silero, heavily optimized for 1-second barge-ins. *(Note: **Headphones are strictly required** for this mode to prevent the AI from hearing its own audio output and endlessly interrupting itself).* |
-| | `ptt_vad` | Manual Push-to-Talk / Walkie-Talkie mode. **(Default: Hold 'Shift' to talk)** |
+| | `ptt_vad` | Manual Push-to-Talk / Walkie-Talkie mode. **(Default: Hold 'Right Option' to talk)** |
 | **📝 STT (Transcription)**| `mlx_whisper_large_v3`| Blazing fast local transcription leveraging Apple's MLX framework. |
 -----
@@ -149,7 +150,34 @@ Simply use `voice-mcp-server` as the command in your configuration.
 > [!NOTE]
 > **First Run Performance:** The very first time you invoke the voice tool, it will take a few minutes to initialize the Python environment and download the heavy ML weights (~4GB). **The tools will not be available until this background setup completes.** You can monitor progress in your terminal logs. *Depending on your AI client, you may need to restart the application/CLI for the tools to appear after setup.*
-### 4. Uninstalling
+### 4. Customizing the Voice (ElevenLabs)
+If you prefer to use **ElevenLabs** for ultra-realistic cloud TTS instead of the default local Kokoro engine, you can easily configure it using Environment Variables!
+> [!WARNING]
+> **Privacy Notice:** By configuring and using ElevenLabs, the text generated by your LLM will be transmitted over the internet to ElevenLabs' servers for audio rendering. This data is subject to ElevenLabs' own privacy policies and terms of service. If you require absolute privacy and air-gapped security, do not configure this key and continue using the default local MLX engine.
+When adding the server to your MCP Client (like `claude_desktop_config.json`), simply provide your API key and your preferred Voice ID in the `env` object:
+```json
+{
+  "mcpServers": {
+    "voice-mcp-server": {
+      "command": "voice-mcp-server",
+      "args": [],
+      "env": {
+        "ELEVENLABS_API_KEY": "sk_your_api_key_here",
+        "ELEVENLABS_VOICE_ID": "aEO01A4wXwd1O8GPgGlF"
+      }
+    }
+  }
+}
+```
+*(If you are using Gemini CLI or Claude Code, you can simply `export` these variables in your terminal profile like `.zshrc`!)*
+Once configured, simply tell your AI: *"Switch your audio engine to use the elevenlabs_speaker adapter."*
+### 5. Uninstalling
 If you wish to completely remove the server and its downloaded ML models from your system to free up space:
@@ -192,7 +220,7 @@ If you wish to contribute to the project or run it from source, follow these ste
 Once connected, test the server by sending this prompt to your AI:
-> *"Let's test your voice capabilities! Please use the `voice_converse` tool to introduce yourself and tell me a story about a brave robot. If I interrupt you while you are speaking, stop the story and acknowledge my interruption in your next response."*
+> *"Let's test your voice capabilities! Please introduce yourself, seamlessly tell me how to use the Right Option key to interact with you, and then start telling me a long story about a brave robot. I will practice using the Right Option key to interrupt you mid-story. When I interrupt, stop the story instantly, acknowledge my interruption naturally, and ask what we should work on instead."*
 -----

package/config/config.yaml CHANGED Viewed

@@ -9,7 +9,7 @@ defaults:
   - speaker: kokoro_speaker
   # Available VADs:
-  #   - ptt_vad: Walkie-Talkie mode (Hold 'Shift' to talk. Instant response. Ignores TV/noise).
+  #   - ptt_vad: Walkie-Talkie mode (Hold 'Right Option' to talk. Instant response. Ignores TV/noise).
   #   - silero_vad: Conversational AI mode (Listens automatically. Tuned for 1-second barge-ins).
   - vad: ptt_vad

package/config/vad/ptt_vad.yaml CHANGED Viewed

@@ -1,5 +1,5 @@
 _target_: adapters_real.ptt_vad.PushToTalkVAD
-key_name: "shift"
+key_name: "right_option"
 # PTT specific tuning (tuned for instant response)
 vad_probability_threshold: 0.50

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "voice-mcp-server",
-  "version": "0.1.25",
+  "version": "0.3.0",
   "description": "An MCP server to allow LLMs to speak and listen via bidirectional voice loops",
   "main": "build/index.js",
   "type": "module",
@@ -30,6 +30,10 @@
   ],
   "author": "Erick Vazquez Santillan",
   "license": "MIT",
+  "repository": {
+    "type": "git",
+    "url": "git+https://github.com/erickvs/voice-mcp-server.git"
+  },
   "dependencies": {
     "@modelcontextprotocol/sdk": "^1.5.0"
   },

package/requirements.txt CHANGED Viewed

@@ -97,6 +97,7 @@ shellingham==1.5.4
 silero-vad==6.2.1
 six==1.17.0
 smart_open==7.5.1
+sounddevice==0.5.5
 soundfile==0.13.1
 spacy==3.8.14
 spacy-curated-transformers==0.3.1

package/src/adapters_real/elevenlabs_speaker.py CHANGED Viewed

@@ -16,7 +16,7 @@ class ElevenLabsSpeaker(ISpeaker):
         self.words = []
         self.process = None
         self.start_time = 0
-        self.voice_id = voice_id
+        self.voice_id = os.getenv("ELEVENLABS_VOICE_ID", voice_id)
         self.api_key = os.getenv("ELEVENLABS_API_KEY")
         self.temp_file = "/tmp/elevenlabs_output.mp3"

package/src/adapters_real/kokoro_speaker.py CHANGED Viewed

@@ -1,3 +1,4 @@
+from logger import logger
 import os
 import time
 import subprocess
@@ -17,7 +18,7 @@ class KokoroSpeaker(ISpeaker):
         self.start_time = 0
         self.temp_file = "/tmp/kokoro_output.wav"
-        print(f"[DEBUG SPEAKER] Loading local Kokoro TTS model (Voice: {voice})...")
+        logger.info(f"Loading local Kokoro TTS model (Voice: {voice})...")
         # Load the pipeline. Since you are on M4 Max, we will try to use MPS if available
         if torch.backends.mps.is_available():
             self.device = "mps"
@@ -30,7 +31,7 @@ class KokoroSpeaker(ISpeaker):
         # We use lang_code 'a' for American English
         self.pipeline = KPipeline(lang_code='a', device=self.device)
         self.voice = voice
-        print(f"[DEBUG SPEAKER] Kokoro TTS loaded successfully on {self.device}.")
+        logger.info(f"Kokoro TTS loaded successfully on {self.device}.")
     def speak(self, text: str):
         if not text.strip():
@@ -40,7 +41,7 @@ class KokoroSpeaker(ISpeaker):
         self.words = text.split()
         try:
-            print(f"[DEBUG SPEAKER] Generating Kokoro audio for: {text[:50]}...")
+            logger.debug(f"Generating Kokoro audio for: {text[:50]}...")
             # Generate the audio locally
             generator = self.pipeline(
                 text, voice=self.voice, # <= change voice here
@@ -54,14 +55,14 @@ class KokoroSpeaker(ISpeaker):
                 audio_segments.append(audio)
             if not audio_segments:
-                print("[DEBUG SPEAKER] Kokoro generated empty audio.")
+                logger.warning("Kokoro generated empty audio.")
                 return
             final_audio = torch.cat(audio_segments, dim=0).cpu().numpy()
             # Save to temporary file at 24kHz (Kokoro's default sample rate)
             sf.write(self.temp_file, final_audio, 24000)
-            print("[DEBUG SPEAKER] Audio generated, starting playback.")
+            logger.debug("Audio generated, starting playback.")
             # Play the generated audio using afplay (macOS native)
             self.start_time = time.time()
@@ -72,7 +73,7 @@ class KokoroSpeaker(ISpeaker):
             )
         except Exception as e:
-            print(f"[DEBUG SPEAKER] Kokoro Generation Error: {e}")
+            logger.error(f"Kokoro Generation Error: {e}")
             # Fallback to macOS say
             self.start_time = time.time()
             self.process = subprocess.Popen(

package/src/adapters_real/live_mic.py CHANGED Viewed

@@ -1,3 +1,4 @@
+from logger import logger
 import pyaudio
 import queue
 from simulation.ports import IMicrophone
@@ -10,6 +11,7 @@ class LiveMicrophone(IMicrophone):
         self.q = queue.Queue(maxsize=100)
         self.p = pyaudio.PyAudio()
         self.stream = None
+        logger.info(f"Initialized LiveMicrophone with rate={rate}, chunk={chunk}")
     def start_stream(self):
         if self.stream is not None:
@@ -31,12 +33,21 @@ class LiveMicrophone(IMicrophone):
             stream_callback=self._callback
         )
         self.stream.start_stream()
+        logger.info("LiveMicrophone stream started")
     def stop_stream(self):
-        if self.stream is not None:
-            self.stream.stop_stream()
-            self.stream.close()
-            self.stream = None
+        stream = self.stream
+        self.stream = None
+        if stream is not None:
+            try:
+                stream.stop_stream()
+            except OSError as e:
+                logger.debug(f"Ignored PyAudio OSError during stop_stream: {e}")
+            try:
+                stream.close()
+            except Exception:
+                pass
+            logger.info("LiveMicrophone stream stopped")
     def _callback(self, in_data, frame_count, time_info, status):
         try:

package/src/adapters_real/ptt_sidecar.swift ADDED Viewed

@@ -0,0 +1,156 @@
+import Foundation
+import IOKit.hid
+import AudioToolbox
+import AppKit
+var isPTTActive = false
+var isCtrlPressed = false
+var pingID: SystemSoundID = 0
+var popID: SystemSoundID = 0
+var idleTimer: Timer?
+let IDLE_TIMEOUT: TimeInterval = 900 // 15 minutes
+func resetIdleTimer() {
+    idleTimer?.invalidate()
+    idleTimer = Timer.scheduledTimer(withTimeInterval: IDLE_TIMEOUT, repeats: false) { _ in
+        print("💤 [SWIFT] Sidecar idle for \(Int(IDLE_TIMEOUT / 60)) minutes. Exiting to save resources.")
+        exit(0)
+    }
+}
+// Load uncompressed audio for 0ms latency
+let pingURL = URL(fileURLWithPath: "/System/Library/Sounds/Morse.aiff") as CFURL
+let popURL = URL(fileURLWithPath: "/System/Library/Sounds/Pop.aiff") as CFURL
+AudioServicesCreateSystemSoundID(pingURL, &pingID)
+AudioServicesCreateSystemSoundID(popURL, &popID)
+func sendSocketMessage(code: UInt8) -> Bool {
+    let fd = socket(AF_UNIX, SOCK_STREAM, 0)
+    guard fd >= 0 else { return false }
+    defer { close(fd) }
+    var addr = sockaddr_un()
+    addr.sun_family = sa_family_t(AF_UNIX)
+    let path = "/tmp/voice_mcp_ptt.sock"
+    let pathSize = Int(MemoryLayout.size(ofValue: addr.sun_path))
+    _ = withUnsafeMutablePointer(to: &addr.sun_path.0) { ptr in
+        path.withCString { cstr in
+            strncpy(ptr, cstr, pathSize)
+        }
+    }
+    let len = socklen_t(MemoryLayout<sockaddr_un>.size)
+    let connectResult = withUnsafePointer(to: &addr) {
+        $0.withMemoryRebound(to: sockaddr.self, capacity: 1) { connect(fd, $0, len) }
+    }
+    if connectResult == 0 {
+        var byte: UInt8 = code
+        write(fd, &byte, 1)
+        return true
+    }
+    return false
+}
+func isTerminalFrontmost() -> Bool {
+    guard let frontApp = NSWorkspace.shared.frontmostApplication else { return false }
+    let bundleID = frontApp.bundleIdentifier ?? ""
+    // Add common terminal emulators and editors
+    let allowedTerminals = [
+        "com.apple.Terminal",
+        "com.googlecode.iterm2",
+        "dev.warp.Warp-Stable",
+        "co.zeit.hyper",
+        "com.mitchellh.ghostty",
+        "net.kovidgoyal.kitty",
+        "org.alacritty",
+        "com.anthropic.claudedesktop",
+        "com.microsoft.VSCode",
+        "com.todesktop.Cursor"
+    ]
+    return allowedTerminals.contains(bundleID)
+}
+var lastPressTime: TimeInterval = 0
+var lastReleaseTime: TimeInterval = 0
+let DOUBLE_TAP_THRESHOLD: TimeInterval = 0.4 // 400 milliseconds
+let hidCallback: IOHIDValueCallback = { context, result, sender, value in
+    let element = IOHIDValueGetElement(value)
+    let usagePage = IOHIDElementGetUsagePage(element)
+    let usage = IOHIDElementGetUsage(element)
+    let intValue = IOHIDValueGetIntegerValue(value)
+    // 0x07 = Generic Desktop Keyboard
+    if usagePage == 0x07 {
+        let isPressed = (intValue == 1)
+        // 0xE6 = Right Option
+        if usage == 0xE6 {
+            // Only process events if our terminal is the active window!
+            if isTerminalFrontmost() {
+                let now = Date().timeIntervalSince1970
+                if isPressed && !isPTTActive {
+                    resetIdleTimer()
+                    // Check for Double-Tap!
+                    // If the time since the LAST release is very short, and the time
+                    // since the LAST press is also very short, this is the second press of a double-tap.
+                    if (now - lastReleaseTime) < DOUBLE_TAP_THRESHOLD && (now - lastPressTime) < DOUBLE_TAP_THRESHOLD {
+                        // Abort signal!
+                        if sendSocketMessage(code: 2) {
+                            print("🚨 [SWIFT] -> DOUBLE TAP DETECTED! Transmitted 0x02 (Abort)")
+                            AudioServicesPlaySystemSound(popID) // Play pop to confirm abort
+                        }
+                        // Reset timestamps so we don't accidentally triple-tap
+                        lastPressTime = 0
+                        lastReleaseTime = 0
+                        return
+                    }
+                    // Normal Single Press
+                    lastPressTime = now
+                    if sendSocketMessage(code: 1) {
+                        isPTTActive = true
+                        AudioServicesPlaySystemSound(pingID)
+                        print("[SWIFT] -> Transmitted 0x01 (Press)")
+                    }
+                } else if !isPressed && isPTTActive {
+                    lastReleaseTime = now
+                    isPTTActive = false
+                    // Normal Release
+                    _ = sendSocketMessage(code: 0)
+                    AudioServicesPlaySystemSound(popID)
+                    print("[SWIFT] -> Transmitted 0x00 (Release)")
+                }
+            }
+        }
+    }
+}
+let manager = IOHIDManagerCreate(kCFAllocatorDefault, IOOptionBits(kIOHIDOptionsTypeNone))
+let deviceMatch: [String: Any] = ["DeviceUsagePage": 1, "DeviceUsage": 6]
+IOHIDManagerSetDeviceMatching(manager, deviceMatch as CFDictionary)
+IOHIDManagerRegisterInputValueCallback(manager, hidCallback, nil)
+IOHIDManagerScheduleWithRunLoop(manager, CFRunLoopGetMain(), CFRunLoopMode.defaultMode.rawValue)
+let openResult = IOHIDManagerOpen(manager, IOOptionBits(kIOHIDOptionsTypeNone))
+if openResult != kIOReturnSuccess {
+    print("❌ FATAL: macOS blocked hardware access.")
+    print("👉 ACTION REQUIRED: Open System Settings -> Privacy & Security -> Input Monitoring.")
+    print("👉 Add your Terminal application, toggle it ON, completely restart the terminal, and try again.")
+    exit(1)
+}
+print("✅ [SWIFT] Sidecar Online with Context-Aware Focus Filter.")
+print("🎧 [SWIFT] Listening natively for Right Option (Hardware Matrix 0xE6)...")
+print("🔒 [SWIFT] Mic will ONLY open if a Terminal window is currently active.")
+resetIdleTimer() // Start the idle timer initially
+CFRunLoopRun()

package/src/adapters_real/ptt_vad.py CHANGED Viewed

@@ -1,36 +1,154 @@
-from pynput import keyboard
+from logger import logger
+import threading
+import subprocess
+import socket
+import os
+import sys
+import atexit
+import http.client
 from simulation.ports import IVAD
 from simulation.models import VirtualAudioFrame
+SOCKET_PATH = "/tmp/voice_mcp_ptt.sock"
+class UDSHTTPConnection(http.client.HTTPConnection):
+    def __init__(self, socket_path, timeout=300.0):
+        super().__init__("localhost", timeout=timeout)
+        self.socket_path = socket_path
+    def connect(self):
+        self.sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        self.sock.settimeout(self.timeout)
+        self.sock.connect(self.socket_path)
 class PushToTalkVAD(IVAD):
-    def __init__(self, key_name="shift", **kwargs):
-        self.is_pressed = False
-        print(f"[DEBUG VAD] Initializing Push-To-Talk VAD. Walkie-Talkie Hotkey: '{key_name}'")
+    def __init__(self, key_name="right_option", **kwargs):
+        self.lock = threading.Lock()
+        self.is_ptt_active = False
-        # Map string names to pynput Key objects
-        key_map = {
-            "shift": keyboard.Key.shift,
-            "shift_r": keyboard.Key.shift_r,
-            "ctrl": keyboard.Key.ctrl,
-            "alt": keyboard.Key.alt,
-            "cmd": keyboard.Key.cmd,
-            "space": keyboard.Key.space
-        }
+        logger.info("Initializing Push-To-Talk VAD via Swift Sidecar.")
-        self.hotkey = key_map.get(key_name.lower(), keyboard.Key.shift)
+        self.sidecar_process = None
+        self.server_socket = None
+        self.listener_thread = None
+        self._stop_event = threading.Event()
+        self._start_sidecar()
+        atexit.register(self._cleanup)
-        def on_press(key):
-            if key == self.hotkey:
-                self.is_pressed = True
+    def set_active(self, active: bool):
+        if active and self.server_socket is None:
+            self._start_server()
+        elif not active and self.server_socket is not None:
+            self._stop_server()
-        def on_release(key):
-            if key == self.hotkey:
-                self.is_pressed = False
+    def _start_server(self):
+        self._stop_event.clear()
+        if os.path.exists(SOCKET_PATH):
+            try:
+                os.remove(SOCKET_PATH)
+            except OSError:
+                pass
+        self.server_socket = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        self.server_socket.bind(SOCKET_PATH)
+        self.server_socket.listen(1)
+        logger.debug(f"PTT socket created at {SOCKET_PATH}")
+        self.listener_thread = threading.Thread(target=self._listen_loop, daemon=True)
+        self.listener_thread.start()
+    def _stop_server(self):
+        self._stop_event.set()
+        if self.server_socket:
+            try:
+                self.server_socket.close()
+            except Exception:
+                pass
+            self.server_socket = None
+        if self.listener_thread:
+            self.listener_thread.join(timeout=1.0)
+            self.listener_thread = None
+        if os.path.exists(SOCKET_PATH):
+            try:
+                os.remove(SOCKET_PATH)
+            except OSError:
+                pass
+        with self.lock:
+            self.is_ptt_active = False
+    def _start_sidecar(self):
+        try:
+            output = subprocess.check_output(["pgrep", "-x", "ptt_sidecar"])
+            if len(output.strip()) > 0:
+                logger.debug("Swift Sidecar is already running.")
+                return
+        except subprocess.CalledProcessError:
+            pass
+        sidecar_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "ptt_sidecar")
+        if not os.path.exists(sidecar_path):
+            logger.info(f"Compiling Swift Sidecar at {sidecar_path}...")
+            swift_src = os.path.join(os.path.dirname(os.path.abspath(__file__)), "ptt_sidecar.swift")
+            subprocess.run(["swiftc", swift_src, "-o", sidecar_path])
+        if os.path.exists(sidecar_path):
+            self.sidecar_process = subprocess.Popen(
+                [sidecar_path],
+                stdout=sys.stdout,
+                stderr=sys.stderr,
+                start_new_session=True
+            )
+            logger.info("Swift Sidecar started.")
+        else:
+            logger.error("Failed to start Swift Sidecar, executable not found.")
-        self.listener = keyboard.Listener(on_press=on_press, on_release=on_release)
-        self.listener.start()
+    def _listen_loop(self):
+        while not self._stop_event.is_set():
+            try:
+                if not self.server_socket:
+                    break
+                conn, _ = self.server_socket.accept()
+                with conn:
+                    while not self._stop_event.is_set():
+                        data = conn.recv(1)
+                        if not data:
+                            break
+                        with self.lock:
+                            if data == b'\x01':
+                                logger.info("Mic Alive (Right Option Pressed) - Received 0x01")
+                                self.is_ptt_active = True
+                            elif data == b'\x00':
+                                logger.info("Mic Dead (Right Option Released) - Received 0x00")
+                                self.is_ptt_active = False
+                            elif data == b'\x02':
+                                logger.info("Abort (Esc/Ctrl+C Pressed) - Received 0x02. Triggering /abort")
+                                try:
+                                    daemon_sock = os.path.expanduser("~/Library/Application Support/VoiceMCP/daemon.sock")
+                                    conn_uds = UDSHTTPConnection(daemon_sock, timeout=1.0)
+                                    conn_uds.request("POST", "/abort", body=None, headers={})
+                                    conn_uds.getresponse().read()
+                                    conn_uds.close()
+                                except Exception as e:
+                                    logger.error(f"Failed to trigger /abort natively: {e}")
+            except Exception as e:
+                pass
     def analyze(self, frame: VirtualAudioFrame) -> float:
-        # If the key is held down, we return 1.0 (100% certainty of speech).
-        # If the key is released, we return 0.0 (0% certainty of speech).
-        return 1.0 if self.is_pressed else 0.0
+        with self.lock:
+            return 1.0 if self.is_ptt_active else 0.0
+    def _cleanup(self):
+        self._stop_server()
+        if self.sidecar_process:
+            try:
+                self.sidecar_process.terminate()
+            except Exception:
+                pass
+    def __del__(self):
+        self._cleanup()

package/src/adapters_real/whisper_stt.py CHANGED Viewed

@@ -1,3 +1,4 @@
+from logger import logger
 import numpy as np
 import mlx_whisper
 from typing import List
@@ -8,7 +9,7 @@ from simulation.models import VirtualAudioFrame
 class RealWhisperSTT(ISTT):
     def __init__(self, model_size="mlx-community/whisper-large-v3-mlx"):
         self.model_size = model_size
-        print(f"[DEBUG STT] Preparing MLX Whisper model ({model_size}) for Apple Silicon...")
+        logger.info(f"Preparing MLX Whisper model ({model_size}) for Apple Silicon...")
         # MLX will lazily load and compile the model on the first inference, but we print here to indicate we are using the MLX backend.
     def transcribe(self, frames: List[VirtualAudioFrame]) -> str:
@@ -19,14 +20,14 @@ class RealWhisperSTT(ISTT):
         # Convert 16-bit PCM (expected from microphone) to float32 [-1.0, 1.0] expected by Whisper
         audio_data = np.frombuffer(raw_bytes, dtype=np.int16).astype(np.float32) / 32768.0
-        print(f"[DEBUG STT] Transcribing {len(audio_data)} samples with Apple MLX Whisper ({self.model_size})...")
+        logger.debug(f"Transcribing {len(audio_data)} samples with Apple MLX Whisper ({self.model_size})...")
         try:
             # We explicitly set English since you are speaking English, and fp16 for Metal acceleration
             result = mlx_whisper.transcribe(audio_data, path_or_hf_repo=self.model_size, language="en")
             text = result.get("text", "").strip()
-            print(f"[DEBUG STT] MLX Transcription result: {text}")
+            logger.debug(f"MLX Transcription result: {text}")
             return text
         except Exception as e:
-            print(f"[DEBUG STT] MLX Whisper transcription error: {e}")
+            logger.error(f"MLX Whisper transcription error: {e}")
             return ""