PyPI - scribe-cli - Versions diffs - 0.17.1__tar.gz → 0.18.0__tar.gz - Mend

scribe-cli 0.17.1tar.gz → 0.18.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (62) hide show

{scribe_cli-0.17.1 → scribe_cli-0.18.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: scribe-cli
-Version: 0.17.1
+Version: 0.18.0
 Summary: Speech-to-text CLI and system-tray app for dictating into any focused window. Local (vosk, faster-whisper) or cloud (groq, openai) backends, batch or streaming.
 Author-email: Mahé Perrette <mahe.perrette@gmail.com>
 License: MIT License
@@ -52,6 +52,7 @@ Requires-Dist: unidecode
 Requires-Dist: termcolor
 Requires-Dist: platformdirs
 Requires-Dist: desktop-ai-core>=0.2.0
+Requires-Dist: onnxruntime
 Provides-Extra: keyboard
 Requires-Dist: pynput; extra == "keyboard"
 Provides-Extra: whisper
@@ -69,6 +70,7 @@ Requires-Dist: soundfile; extra == "openai"
 Provides-Extra: groq
 Requires-Dist: openai<3,>=2.37.0; extra == "groq"
 Requires-Dist: soundfile; extra == "groq"
+Provides-Extra: vad
 Provides-Extra: all
 Requires-Dist: pynput; extra == "all"
 Requires-Dist: faster-whisper; extra == "all"

{scribe_cli-0.17.1 → scribe_cli-0.18.0}/docs/backends.md RENAMED Viewed

@@ -225,8 +225,8 @@ Whisper's prompt window is capped at ~224 tokens; 200 chars of French
 sits well under that and leaves room for your static prompt + words
 list.
-The rolling tail is **dropped** whenever the pause that triggered the
-chunk cut exceeded 1.5 seconds — a long pause is treated as a new
+The rolling tail is **dropped** when the silence between two
+utterances exceeds 1.5 seconds — a long pause is treated as a new
 sentence/idea boundary, where carrying a possibly-bad prior chunk
 forward biases the next one more than it helps. This mirrors
 `whisper.cpp`'s `--keep-context off` default: prior-text conditioning

{scribe_cli-0.17.1 → scribe_cli-0.18.0}/docs/cli.md RENAMED Viewed

@@ -65,20 +65,39 @@ flag suppresses only its own side (giving `--prompt ""` still loads
 | `--type-direct`             | In keystroke mode, type the transcription as keystrokes instead of synthesising Ctrl+V.     |
 | `-o, --output-file FILE`    | Also append the transcription to this file.                                                 |
-## Silence detection (shared)
+## Silence detection
 | Flag                       | Default | Purpose                                                                |
 |----------------------------|---------|------------------------------------------------------------------------|
 | `--duration SECS`          | `120`   | Max recording duration in seconds.                                     |
-| `--silence-db DB`          | `-40`   | dBFS volume floor for "this frame is silent". Used by every silence-driven behavior. |
 | `--silence-duration SECS`  | `0.6`   | How long silence must persist before triggering a backend's silence behavior (realtime auto-commit, pseudo-streaming cut). |
+## Voice activity detection
+scribe ships two silence-detection backends. By default
+(`--vad-mode auto`) it picks **silero-vad** when `onnxruntime` is
+importable (always true on a stock `pip install scribe-cli` since
+`onnxruntime` is a base dependency) and falls back to a plain dB
+volume threshold otherwise. silero is much more robust to ambient
+noise (clicks, fan, traffic) and to soft speech than dB, which drops
+sub-threshold syllables and gets fooled by loud non-speech.
+The dB and silero parameter groups are independent — the inactive
+mode's knobs are ignored.
+| Flag                          | Default | Purpose                                                                |
+|-------------------------------|---------|------------------------------------------------------------------------|
+| `--vad-mode {auto,db,silero}` | `auto`  | Silence-detection backend. `auto` picks silero when available, dB otherwise. |
+| `--vad-threshold FLOAT`       | `0.5`   | **[silero only]** Speech-probability threshold in `[0,1]`. Lower = more permissive (catches quiet speech and more noise); higher = stricter. |
+| `--vad-min-silence-ms INT`    | `300`   | **[silero only]** Minimum sustained low-probability span before speech-end fires, in ms. silero's onset/offset smoothing window. |
+| `--silence-db DB`             | `-40`   | **[dB only]** dBFS volume floor for "this frame is silent". Ignored when silero is the active mode. |
 ## Realtime (`gpt-realtime-whisper`)
 | Flag                                              | Default  | Purpose                                                                                                                                                                                  |
 |---------------------------------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | `--realtime-delay {minimal,low,medium,high,xhigh}` | `medium` | Trade off latency vs accuracy on `gpt-realtime-whisper`. Lower = faster partials but more paste churn in the focused window.                                                             |
-| `--realtime-gate` / `--no-realtime-gate`          | on       | Drop silent frames (per `--silence-db`) before sending them over the WebSocket so silent audio isn't billed as input tokens. After `--silence-duration` of silence, also commit mid-session so trailing words flush live. |
+| `--realtime-gate` / `--no-realtime-gate`          | on       | Drop silent frames (per the active `--vad-mode`) before sending them over the WebSocket so silent audio isn't billed as input tokens. After `--silence-duration` of silence, also commit mid-session so trailing words flush live. |
 Streaming models (Vosk, `gpt-realtime-whisper`) ignore the batch
 silence-chunking knobs; they have their own end-of-utterance signal.

{scribe_cli-0.17.1 → scribe_cli-0.18.0}/docs/tray.md RENAMED Viewed

@@ -58,10 +58,15 @@ Options ▶
     Keyboard backend ▶            eitype / pynput / ydotool / wtype
                                   (rows incompatible with this OS are hidden;
                                    submenu hidden entirely when ≤ 1 row left)
-    Advanced ▶                    silence duration, silence threshold,
-                                    realtime gate, pseudo-streaming
-                                    [experimental], streaming window
-                                    [experimental], output file
+    Advanced ▶                    silence duration, VAD mode toggle
+                                    (silero ↔ dB), per-mode VAD knobs
+                                    (silero: speech-probability threshold,
+                                    min silence duration; dB: silence
+                                    threshold — only the active mode's
+                                    knobs are shown), realtime gate,
+                                    pseudo-streaming [experimental],
+                                    streaming window [experimental],
+                                    output file
 Quit
 ```

{scribe_cli-0.17.1 → scribe_cli-0.18.0}/pyproject.toml RENAMED Viewed

@@ -22,6 +22,11 @@ dependencies = [
     "termcolor",
     "platformdirs",
     "desktop-ai-core>=0.2.0",
+    # Runs the bundled silero VAD ONNX model (~2 MB shipped in scribe_data).
+    # In base deps so silero is available out of the box — see scribe/audio.py.
+    # `faster-whisper` already pulls it transitively, so installing with
+    # [whisper] is free; standalone adds ~57 MB which is trivial for an STT tool.
+    "onnxruntime",
 ]
 classifiers = [
@@ -67,12 +72,18 @@ vosk = ["vosk"]
 app = ["pystray", "PyGObject"]
 openai = ["openai>=2.37.0,<3", "soundfile"]
 groq = ["openai>=2.37.0,<3", "soundfile"]
+# [vad] is now a no-op alias kept for back-compat (`pip install scribe-cli[vad]`
+# was the documented install before onnxruntime moved into base deps).
+vad = []
 all = ["pynput", "faster-whisper", "pywhispercpp", "openai>=2.37.0,<3", "soundfile", "vosk", "pystray"]
 [tool.setuptools]
 packages = [ "scribe", "scribe_data" ]
+[tool.setuptools.package-data]
+scribe_data = ["share/*.png", "templates/*", "silero_vad.onnx", "silero_vad.LICENSE"]
 [tool.setuptools_scm]
 write_to = "scribe/_version.py"

{scribe_cli-0.17.1 → scribe_cli-0.18.0}/scribe/_version.py RENAMED Viewed

@@ -18,7 +18,7 @@ version_tuple: tuple[int | str, ...]
 commit_id: str | None
 __commit_id__: str | None
-__version__ = version = '0.17.1'
-__version_tuple__ = version_tuple = (0, 17, 1)
+__version__ = version = '0.18.0'
+__version_tuple__ = version_tuple = (0, 18, 0)
-__commit_id__ = commit_id = 'g67d90f5e4'
+__commit_id__ = commit_id = 'gd48d707c7'

{scribe_cli-0.17.1 → scribe_cli-0.18.0}/scribe/app.py RENAMED Viewed

@@ -171,7 +171,8 @@ def _resolve_prompt_and_words(prompt_text, prompt_file, words, words_file):
 def _build_backend_kwargs(backend, model, language, samplerate, duration,
-                          silence_db, silence_onset_db, silence_duration,
+                          silence_db, silence_duration,
+                          vad_mode, vad_threshold, vad_min_silence_ms,
                           download_folder_vosk, download_folder_whisper,
                           download_folder_whisper_futo,
                           realtime_delay, realtime_gate,
@@ -186,6 +187,8 @@ def _build_backend_kwargs(backend, model, language, samplerate, duration,
         word_blob = " ".join(words)
         merged_prompt = f"{prompt_text} {word_blob}" if prompt_text else word_blob
+    vad_kwargs = dict(vad_mode=vad_mode, vad_threshold=vad_threshold,
+                      vad_min_silence_ms=vad_min_silence_ms)
     if backend == "vosk":
         # Vosk has no soft prompt; only a hard grammar. Silently ignore for now.
         return dict(model_name=model, language=language, samplerate=samplerate,
@@ -194,11 +197,12 @@ def _build_backend_kwargs(backend, model, language, samplerate, duration,
     if backend == "whisper":
         return dict(model_name=model, language=language, samplerate=samplerate,
                     timeout=duration, silence_duration=silence_duration,
-                    silence_thresh=silence_db, silence_thresh_onset=silence_onset_db,
+                    silence_thresh=silence_db,
                     pseudo_streaming=pseudo_streaming, streaming_window=streaming_window,
                     prompt=prompt_text,
                     hotwords=(" ".join(words) if words else None),
-                    model_kwargs={"download_root": download_folder_whisper})
+                    model_kwargs={"download_root": download_folder_whisper},
+                    **vad_kwargs)
     if backend == "whisper-futo":
         # pywhispercpp 1.4.1 exposes `initial_prompt`; the backend folds
         # words+prompt into it (and adds a rolling chunk-tail in
@@ -206,17 +210,19 @@ def _build_backend_kwargs(backend, model, language, samplerate, duration,
         # everything into the prompt like the cloud backends do.
         return dict(model_name=model, language=language, samplerate=samplerate,
                     timeout=duration, silence_duration=silence_duration,
-                    silence_thresh=silence_db, silence_thresh_onset=silence_onset_db,
+                    silence_thresh=silence_db,
                     pseudo_streaming=pseudo_streaming, streaming_window=streaming_window,
                     prompt=merged_prompt,
-                    download_folder=download_folder_whisper_futo)
+                    download_folder=download_folder_whisper_futo,
+                    **vad_kwargs)
     if backend in ("openai", "groq"):
         from scribe.backends.openai_api import REALTIME_MODELS
         kwargs = dict(model_name=model, samplerate=samplerate,
                       timeout=duration, silence_duration=silence_duration,
-                      silence_thresh=silence_db, silence_thresh_onset=silence_onset_db,
+                      silence_thresh=silence_db,
                       pseudo_streaming=pseudo_streaming, streaming_window=streaming_window,
-                      prompt=merged_prompt)
+                      prompt=merged_prompt,
+                      **vad_kwargs)
         if backend == "openai" and model in REALTIME_MODELS:
             kwargs["realtime_delay"] = realtime_delay
             kwargs["realtime_gate"] = realtime_gate
@@ -231,7 +237,8 @@ def _build_backend_kwargs(backend, model, language, samplerate, duration,
 def get_transcriber(model=None, backend=None, dummy=False, interactive=True, language=None,
                     samplerate=None, duration=None,
-                    silence_db=None, silence_onset_db=None, silence_duration=0.6,
+                    silence_db=None, silence_duration=0.6,
+                    vad_mode="auto", vad_threshold=0.5, vad_min_silence_ms=300,
                     download_folder_vosk=None, download_folder_whisper=None,
                     download_folder_whisper_futo=None,
                     realtime_delay="medium", realtime_gate=True,
@@ -261,17 +268,14 @@ def get_transcriber(model=None, backend=None, dummy=False, interactive=True, lan
     else:
         model = _prompt_model_for_backend(backend, language, interactive)
     print(f"Selected model: {model}")
-    # silence_db is the LOW threshold (in-speech pause detection) — default
-    # -40 in all modes. silence_onset_db is the HIGH threshold (speech-start
-    # gate) used only in pseudo-streaming via hysteresis; -25 keeps ambient
-    # noise (keyboard, breathing) from triggering a chunk.
+    # silence_db is the single volume floor used by the dB fallback. Silero
+    # mode ignores it. Default -40 dBFS — keeps the gate simple by design.
     if silence_db is None:
         silence_db = -40.0
-    if silence_onset_db is None:
-        silence_onset_db = -25.0 if pseudo_streaming else silence_db
     prompt_text, word_list = _resolve_prompt_and_words(prompt, prompt_file, words, words_file)
     backend_kwargs = _build_backend_kwargs(backend, model, language, samplerate, duration,
-                                          silence_db, silence_onset_db, silence_duration,
+                                          silence_db, silence_duration,
+                                          vad_mode, vad_threshold, vad_min_silence_ms,
                                           download_folder_vosk, download_folder_whisper,
                                           download_folder_whisper_futo,
                                           realtime_delay, realtime_gate,
@@ -335,21 +339,9 @@ def get_parser():
     group.add_argument("-o", "--output-file",
                        help="Also append the transcription to this file.")
-    group = parser.add_argument_group("Silence detection (shared)")
+    group = parser.add_argument_group("Silence detection")
     group.add_argument("--duration", default=120, type=float,
                        help="Max recording duration in seconds (default: %(default)s).")
-    group.add_argument("--silence-db", default=None, type=float,
-                       help="LOW silence floor in dBFS — applied while we're "
-                            "already inside an utterance, so soft trailing "
-                            "syllables aren't cut. Default: -40. Used by every "
-                            "silence-driven behavior (pseudo-streaming pause "
-                            "detection, realtime gate, realtime auto-commit).")
-    group.add_argument("--silence-onset-db", default=None, type=float,
-                       help="HIGH silence floor in dBFS — applied before we've "
-                            "started capturing speech (audio buffer empty). "
-                            "Stricter so ambient noise (keyboard, breathing) "
-                            "doesn't trigger a chunk. Default: -25 in "
-                            "pseudo-streaming, same as --silence-db otherwise.")
     group.add_argument("--silence-duration", default=0.6, type=float,
                        help="Seconds of silence required before triggering a "
                             "backend's silence behavior (default: %(default)s). "
@@ -358,6 +350,31 @@ def get_parser():
                             "batch backends: candidate cut point within the "
                             "streaming window.")
+    group = parser.add_argument_group("Voice activity detection")
+    group.add_argument("--vad-mode", choices=("auto", "db", "silero"), default="auto",
+                       help="Silence-detection backend (default: %(default)s). "
+                            "'auto' picks silero if installed, dB otherwise. "
+                            "'silero' uses silero-vad — much more robust to "
+                            "ambient noise (ticks, fan, traffic) AND to soft "
+                            "speech (the dB gate drops sub-threshold syllables; "
+                            "silero recognises speech spectrally). "
+                            "'db' is a volume-threshold fallback used when "
+                            "onnxruntime is unavailable (see --silence-db). "
+                            "The dB and silero parameter groups are independent.")
+    group.add_argument("--vad-threshold", default=0.5, type=float,
+                       help="[silero only] Speech-probability threshold in [0,1] "
+                            "(default: %(default)s). Lower = more permissive (catches "
+                            "quiet speech but also more noise); higher = stricter.")
+    group.add_argument("--vad-min-silence-ms", default=300, type=int,
+                       help="[silero only] Minimum sustained low-probability span before "
+                            "speech-end is emitted, in ms (default: %(default)s). "
+                            "Acts as silero's onset/offset smoothing window.")
+    group.add_argument("--silence-db", default=None, type=float,
+                       help="[dB only] Silence floor in dBFS for the dB-mode "
+                            "fallback (default: -40). Ignored when "
+                            "--vad-mode=silero (or =auto and silero is "
+                            "available).")
     group = parser.add_argument_group("Realtime (gpt-realtime-whisper)")
     group.add_argument("--realtime-delay",
                        choices=("minimal", "low", "medium", "high", "xhigh"),
@@ -367,10 +384,10 @@ def get_parser():
                             "paste churn in the focused window).")
     group.add_argument("--realtime-gate", action=argparse.BooleanOptionalAction,
                        default=True,
-                       help="Drop silent frames (per --silence-db) before sending "
-                            "them over the WebSocket so silent audio isn't billed "
-                            "as input tokens (default: on; pass --no-realtime-gate "
-                            "to disable).")
+                       help="Drop silent frames (per the active --vad-mode) before "
+                            "sending them over the WebSocket so silent audio "
+                            "isn't billed as input tokens (default: on; pass "
+                            "--no-realtime-gate to disable).")
     group = parser.add_argument_group("Pseudo-streaming (experimental)")
     group.add_argument("--pseudo-streaming", action="store_true",
@@ -538,14 +555,24 @@ def create_app(micro, app_state):
     image = Image.open(Path(scribe_data.__file__).parent / "share" / "icon.png")
     image_recording = Image.open(Path(scribe_data.__file__).parent / "share" / "icon_recording.png")
     image_writing = Image.open(Path(scribe_data.__file__).parent / "share" / "icon_writing.png")
+    # Composite (red + writing 'a'): shown while recording AND the silence
+    # gate says speech is active. Gives the user a visual confirmation that
+    # the audio is actually being captured/sent — not just sitting in
+    # detected silence. Plain red = recording but waiting for speech.
+    image_recording_active = Image.alpha_composite(
+        image_recording.convert("RGBA"), image_writing.convert("RGBA"),
+    )
     if transcriber.backend == "vosk":
-        # Recording and writing happen at the same time in this backend.
-        image_recording = Image.alpha_composite(image_recording.convert("RGBA"), image_writing.convert("RGBA"))
+        # vosk transcribes while recording — both recording sub-states show
+        # the composite (no meaningful "waiting" since vosk streams
+        # continuously).
+        image_recording = image_recording_active
     state_images = {
         None: image,
         "recording": image_recording,
+        "recording_active": image_recording_active,
         "busy": image_writing,
     }
@@ -564,7 +591,12 @@ def create_app(micro, app_state):
             return "busy"
         s = icon._session
         if s.recording:
-            return "recording"
+            # session.waiting flips True after silence_duration of detected
+            # silence, False on the first non-silent chunk. The composite
+            # ("recording_active") tells the user audio is actually being
+            # sent to the backend — solves the "is it hearing me?" question
+            # without printing partial transcripts to the tray.
+            return "recording" if s.waiting else "recording_active"
         if s.busy:
             return "busy"
         return None

scribe-cli 0.17.1__tar.gz → 0.18.0__tar.gz

scribe-cli 0.17.1tar.gz → 0.18.0tar.gz