PyPI - scribe-cli - Versions diffs - 0.17.0__tar.gz → 0.18.0__tar.gz - Mend

scribe-cli 0.17.0tar.gz → 0.18.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (63) hide show

{scribe_cli-0.17.0 → scribe_cli-0.18.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: scribe-cli
-Version: 0.17.0
+Version: 0.18.0
 Summary: Speech-to-text CLI and system-tray app for dictating into any focused window. Local (vosk, faster-whisper) or cloud (groq, openai) backends, batch or streaming.
 Author-email: Mahé Perrette <mahe.perrette@gmail.com>
 License: MIT License
@@ -52,6 +52,7 @@ Requires-Dist: unidecode
 Requires-Dist: termcolor
 Requires-Dist: platformdirs
 Requires-Dist: desktop-ai-core>=0.2.0
+Requires-Dist: onnxruntime
 Provides-Extra: keyboard
 Requires-Dist: pynput; extra == "keyboard"
 Provides-Extra: whisper
@@ -69,6 +70,7 @@ Requires-Dist: soundfile; extra == "openai"
 Provides-Extra: groq
 Requires-Dist: openai<3,>=2.37.0; extra == "groq"
 Requires-Dist: soundfile; extra == "groq"
+Provides-Extra: vad
 Provides-Extra: all
 Requires-Dist: pynput; extra == "all"
 Requires-Dist: faster-whisper; extra == "all"

{scribe_cli-0.17.0 → scribe_cli-0.18.0}/docs/backends.md RENAMED Viewed

@@ -149,9 +149,10 @@ differently:
 | Backend                              | `--prompt`                    | `--words`                                              |
 |--------------------------------------|-------------------------------|--------------------------------------------------------|
 | `whisper` (faster-whisper, local)    | passed as `initial_prompt=`   | passed as `hotwords=` — a **dedicated biasing channel** separate from the prompt |
+| `whisper-futo` (pywhispercpp, local) | passed as `initial_prompt=`   | joined onto the prompt string (no separate hotwords channel here) |
 | `openai` batch (`gpt-4o*-transcribe`) | passed as `prompt=`           | joined onto the prompt string                          |
 | `groq` (`whisper-large-v3-turbo`)     | passed as `prompt=`           | joined onto the prompt string                          |
-| `openai` realtime (`gpt-realtime-whisper`) | included in the session config as `transcription.prompt` | joined onto the prompt string |
+| `openai` realtime (`gpt-realtime-whisper`) | *silently ignored* — the model rejects the prompt parameter server-side (HTTP 400 *"The 'prompt' parameter is not supported for this model."*). The kwarg stays accepted for plumbing compatibility but never reaches the API. | same — joined into the (ignored) prompt |
 | `vosk`                               | *ignored* (no soft prompt)    | *ignored* (Vosk only supports a hard `grammar` allowlist; not yet exposed) |
 The whisper-family APIs cap the prompt around ~224 tokens; longer
@@ -202,3 +203,36 @@ more than latency.
 This is experimental and off by default. The tray menu surfaces the
 same toggle under Options ▶ Advanced ▶ Pseudo-streaming.
+### Cross-chunk prompt context
+In pseudo-streaming mode scribe automatically augments each chunk's
+prompt with the trailing ~200 characters of the *previous* chunk's
+transcription. This rolling tail is concatenated onto whatever static
+`--prompt` / `--words` you configured and reaches the backend through
+the same channel as the static prompt (the vocabulary biasing table
+above). The motivation is cross-chunk continuity:
+- **Capitalization drift** — without context, a chunk that starts
+  right after a period might come back lowercased.
+- **Article gender (FR/IT/ES/…)** — `"la nouveau"` → `"le nouveau"`
+  once the prior chunk has established the noun.
+- **Language lock** — `whisper.cpp` auto-detects language per call;
+  feeding the previous chunk's tokens keeps the language stable
+  across cuts.
+Whisper's prompt window is capped at ~224 tokens; 200 chars of French
+sits well under that and leaves room for your static prompt + words
+list.
+The rolling tail is **dropped** when the silence between two
+utterances exceeds 1.5 seconds — a long pause is treated as a new
+sentence/idea boundary, where carrying a possibly-bad prior chunk
+forward biases the next one more than it helps. This mirrors
+`whisper.cpp`'s `--keep-context off` default: prior-text conditioning
+can self-reinforce errors (hallucinations, decoder repetition loops)
+more readily than it provides useful continuity, so we cap it at
+natural sentence boundaries.
+Short pauses (mid-sentence punctuation) keep the context; the cut at
+the start of every new recording also clears it.

{scribe_cli-0.17.0 → scribe_cli-0.18.0}/docs/cli.md RENAMED Viewed

@@ -65,20 +65,39 @@ flag suppresses only its own side (giving `--prompt ""` still loads
 | `--type-direct`             | In keystroke mode, type the transcription as keystrokes instead of synthesising Ctrl+V.     |
 | `-o, --output-file FILE`    | Also append the transcription to this file.                                                 |
-## Silence detection (shared)
+## Silence detection
 | Flag                       | Default | Purpose                                                                |
 |----------------------------|---------|------------------------------------------------------------------------|
 | `--duration SECS`          | `120`   | Max recording duration in seconds.                                     |
-| `--silence-db DB`          | `-40`   | dBFS volume floor for "this frame is silent". Used by every silence-driven behavior. |
 | `--silence-duration SECS`  | `0.6`   | How long silence must persist before triggering a backend's silence behavior (realtime auto-commit, pseudo-streaming cut). |
+## Voice activity detection
+scribe ships two silence-detection backends. By default
+(`--vad-mode auto`) it picks **silero-vad** when `onnxruntime` is
+importable (always true on a stock `pip install scribe-cli` since
+`onnxruntime` is a base dependency) and falls back to a plain dB
+volume threshold otherwise. silero is much more robust to ambient
+noise (clicks, fan, traffic) and to soft speech than dB, which drops
+sub-threshold syllables and gets fooled by loud non-speech.
+The dB and silero parameter groups are independent — the inactive
+mode's knobs are ignored.
+| Flag                          | Default | Purpose                                                                |
+|-------------------------------|---------|------------------------------------------------------------------------|
+| `--vad-mode {auto,db,silero}` | `auto`  | Silence-detection backend. `auto` picks silero when available, dB otherwise. |
+| `--vad-threshold FLOAT`       | `0.5`   | **[silero only]** Speech-probability threshold in `[0,1]`. Lower = more permissive (catches quiet speech and more noise); higher = stricter. |
+| `--vad-min-silence-ms INT`    | `300`   | **[silero only]** Minimum sustained low-probability span before speech-end fires, in ms. silero's onset/offset smoothing window. |
+| `--silence-db DB`             | `-40`   | **[dB only]** dBFS volume floor for "this frame is silent". Ignored when silero is the active mode. |
 ## Realtime (`gpt-realtime-whisper`)
 | Flag                                              | Default  | Purpose                                                                                                                                                                                  |
 |---------------------------------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | `--realtime-delay {minimal,low,medium,high,xhigh}` | `medium` | Trade off latency vs accuracy on `gpt-realtime-whisper`. Lower = faster partials but more paste churn in the focused window.                                                             |
-| `--realtime-gate` / `--no-realtime-gate`          | on       | Drop silent frames (per `--silence-db`) before sending them over the WebSocket so silent audio isn't billed as input tokens. After `--silence-duration` of silence, also commit mid-session so trailing words flush live. |
+| `--realtime-gate` / `--no-realtime-gate`          | on       | Drop silent frames (per the active `--vad-mode`) before sending them over the WebSocket so silent audio isn't billed as input tokens. After `--silence-duration` of silence, also commit mid-session so trailing words flush live. |
 Streaming models (Vosk, `gpt-realtime-whisper`) ignore the batch
 silence-chunking knobs; they have their own end-of-utterance signal.

{scribe_cli-0.17.0 → scribe_cli-0.18.0}/docs/keyboard.md RENAMED Viewed

@@ -167,3 +167,34 @@ If `eitype` is unavailable, two older workarounds also work:
 Roadmap for native libei integration (eventual Python bindings,
 expanded compositor support) is tracked in
 [docs/roadmap-libei.md](roadmap-libei.md).
+## Realtime backend: delta coalescing
+The `gpt-realtime-whisper` backend emits one transcription delta per
+word/subword at ~30–80 ms intervals — much faster than the
+`pyperclip.copy()` + Ctrl+V cycle can settle on Wayland (≥100 ms,
+because `wl-copy` is asynchronous). Pasting every delta led to
+clipboard races where successive copies overwrote each other before
+Ctrl+V landed, manifesting as dropped and duplicated words
+(*"fait fait le mot mot time time…"*).
+In **paste mode** (default keystroke output) scribe therefore
+coalesces deltas: incoming tokens accumulate into a small buffer and
+are flushed only when *either* ~400 ms have elapsed since the last
+flush, *or* the buffer ends on sentence-final punctuation
+(`. ! ? \n`). A 200 ms floor between any two flushes prevents
+back-to-back punctuation flushes from racing each other through the
+clipboard.
+With **`--type-direct`** the coalescing is bypassed entirely — each
+delta goes through the chosen typer as a raw keystroke synchronously
+(uinput / xtest / portal libei), no clipboard involved, no race to
+defeat. The UX is also snappier: tokens appear one at a time rather
+than in ~400 ms-cadenced bursts.
+macOS and Windows clipboards are synchronous, so the race that
+motivates coalescing is essentially a Wayland artefact; scribe still
+coalesces in paste mode there for consistency, but it's harmless.
+This whole behaviour is realtime-specific — Vosk's per-phrase commits
+already arrive at a sane cadence, and the pseudo-streaming backends
+emit one chunk per silence cut (already coarse enough).

{scribe_cli-0.17.0 → scribe_cli-0.18.0}/docs/tray.md RENAMED Viewed

@@ -58,10 +58,15 @@ Options ▶
     Keyboard backend ▶            eitype / pynput / ydotool / wtype
                                   (rows incompatible with this OS are hidden;
                                    submenu hidden entirely when ≤ 1 row left)
-    Advanced ▶                    silence duration, silence threshold,
-                                    realtime gate, pseudo-streaming
-                                    [experimental], streaming window
-                                    [experimental], output file
+    Advanced ▶                    silence duration, VAD mode toggle
+                                    (silero ↔ dB), per-mode VAD knobs
+                                    (silero: speech-probability threshold,
+                                    min silence duration; dB: silence
+                                    threshold — only the active mode's
+                                    knobs are shown), realtime gate,
+                                    pseudo-streaming [experimental],
+                                    streaming window [experimental],
+                                    output file
 Quit
 ```

{scribe_cli-0.17.0 → scribe_cli-0.18.0}/pyproject.toml RENAMED Viewed

@@ -22,6 +22,11 @@ dependencies = [
     "termcolor",
     "platformdirs",
     "desktop-ai-core>=0.2.0",
+    # Runs the bundled silero VAD ONNX model (~2 MB shipped in scribe_data).
+    # In base deps so silero is available out of the box — see scribe/audio.py.
+    # `faster-whisper` already pulls it transitively, so installing with
+    # [whisper] is free; standalone adds ~57 MB which is trivial for an STT tool.
+    "onnxruntime",
 ]
 classifiers = [
@@ -67,12 +72,18 @@ vosk = ["vosk"]
 app = ["pystray", "PyGObject"]
 openai = ["openai>=2.37.0,<3", "soundfile"]
 groq = ["openai>=2.37.0,<3", "soundfile"]
+# [vad] is now a no-op alias kept for back-compat (`pip install scribe-cli[vad]`
+# was the documented install before onnxruntime moved into base deps).
+vad = []
 all = ["pynput", "faster-whisper", "pywhispercpp", "openai>=2.37.0,<3", "soundfile", "vosk", "pystray"]
 [tool.setuptools]
 packages = [ "scribe", "scribe_data" ]
+[tool.setuptools.package-data]
+scribe_data = ["share/*.png", "templates/*", "silero_vad.onnx", "silero_vad.LICENSE"]
 [tool.setuptools_scm]
 write_to = "scribe/_version.py"

{scribe_cli-0.17.0 → scribe_cli-0.18.0}/scribe/_version.py RENAMED Viewed

@@ -18,7 +18,7 @@ version_tuple: tuple[int | str, ...]
 commit_id: str | None
 __commit_id__: str | None
-__version__ = version = '0.17.0'
-__version_tuple__ = version_tuple = (0, 17, 0)
+__version__ = version = '0.18.0'
+__version_tuple__ = version_tuple = (0, 18, 0)
-__commit_id__ = commit_id = 'gbfcd2e228'
+__commit_id__ = commit_id = 'gd48d707c7'

{scribe_cli-0.17.0 → scribe_cli-0.18.0}/scribe/app.py RENAMED Viewed

@@ -66,7 +66,10 @@ class DummyTranscriber:
 whisper_models = ["tiny", "base", "small", "medium", "large-v3", "large-v3-turbo"]
 whisper_english_models = ["tiny.en", "base.en", "small.en", "medium.en"]
-# FUTO ACFT publishes only tiny/base/small (+ .en variants); no medium/large/turbo.
+# FUTO ACFT publishes only tiny/base/small (+ .en variants). Community
+# conversions exist for large/turbo but their large-v3 encoder is
+# incompatible with the audio_ctx shrinkage that's the point of this
+# backend — for large models use the `whisper` backend instead.
 whisper_futo_models = ["tiny", "base", "small"]
 whisper_futo_english_models = ["tiny.en", "base.en", "small.en"]
 whisperapi_models = ["gpt-4o-transcribe", "gpt-4o-mini-transcribe", "gpt-realtime-whisper"]
@@ -169,6 +172,7 @@ def _resolve_prompt_and_words(prompt_text, prompt_file, words, words_file):
 def _build_backend_kwargs(backend, model, language, samplerate, duration,
                           silence_db, silence_duration,
+                          vad_mode, vad_threshold, vad_min_silence_ms,
                           download_folder_vosk, download_folder_whisper,
                           download_folder_whisper_futo,
                           realtime_delay, realtime_gate,
@@ -183,6 +187,8 @@ def _build_backend_kwargs(backend, model, language, samplerate, duration,
         word_blob = " ".join(words)
         merged_prompt = f"{prompt_text} {word_blob}" if prompt_text else word_blob
+    vad_kwargs = dict(vad_mode=vad_mode, vad_threshold=vad_threshold,
+                      vad_min_silence_ms=vad_min_silence_ms)
     if backend == "vosk":
         # Vosk has no soft prompt; only a hard grammar. Silently ignore for now.
         return dict(model_name=model, language=language, samplerate=samplerate,
@@ -190,25 +196,33 @@ def _build_backend_kwargs(backend, model, language, samplerate, duration,
                     model_kwargs={"download_root": download_folder_vosk})
     if backend == "whisper":
         return dict(model_name=model, language=language, samplerate=samplerate,
-                    timeout=duration, silence_duration=silence_duration, silence_thresh=silence_db,
+                    timeout=duration, silence_duration=silence_duration,
+                    silence_thresh=silence_db,
                     pseudo_streaming=pseudo_streaming, streaming_window=streaming_window,
                     prompt=prompt_text,
                     hotwords=(" ".join(words) if words else None),
-                    model_kwargs={"download_root": download_folder_whisper})
+                    model_kwargs={"download_root": download_folder_whisper},
+                    **vad_kwargs)
     if backend == "whisper-futo":
-        # whisper.cpp via pywhispercpp doesn't take prompt/hotwords through the
-        # same surface; drop them for now. Audio_ctx is computed per-call inside
-        # the backend from actual audio length (the ACFT speedup).
+        # pywhispercpp 1.4.1 exposes `initial_prompt`; the backend folds
+        # words+prompt into it (and adds a rolling chunk-tail in
+        # pseudo-streaming). No separate hotwords channel here — fold
+        # everything into the prompt like the cloud backends do.
         return dict(model_name=model, language=language, samplerate=samplerate,
-                    timeout=duration, silence_duration=silence_duration, silence_thresh=silence_db,
+                    timeout=duration, silence_duration=silence_duration,
+                    silence_thresh=silence_db,
                     pseudo_streaming=pseudo_streaming, streaming_window=streaming_window,
-                    download_folder=download_folder_whisper_futo)
+                    prompt=merged_prompt,
+                    download_folder=download_folder_whisper_futo,
+                    **vad_kwargs)
     if backend in ("openai", "groq"):
         from scribe.backends.openai_api import REALTIME_MODELS
         kwargs = dict(model_name=model, samplerate=samplerate,
-                      timeout=duration, silence_duration=silence_duration, silence_thresh=silence_db,
+                      timeout=duration, silence_duration=silence_duration,
+                      silence_thresh=silence_db,
                       pseudo_streaming=pseudo_streaming, streaming_window=streaming_window,
-                      prompt=merged_prompt)
+                      prompt=merged_prompt,
+                      **vad_kwargs)
         if backend == "openai" and model in REALTIME_MODELS:
             kwargs["realtime_delay"] = realtime_delay
             kwargs["realtime_gate"] = realtime_gate
@@ -223,7 +237,8 @@ def _build_backend_kwargs(backend, model, language, samplerate, duration,
 def get_transcriber(model=None, backend=None, dummy=False, interactive=True, language=None,
                     samplerate=None, duration=None,
-                    silence_db=-40.0, silence_duration=0.6,
+                    silence_db=None, silence_duration=0.6,
+                    vad_mode="auto", vad_threshold=0.5, vad_min_silence_ms=300,
                     download_folder_vosk=None, download_folder_whisper=None,
                     download_folder_whisper_futo=None,
                     realtime_delay="medium", realtime_gate=True,
@@ -253,9 +268,14 @@ def get_transcriber(model=None, backend=None, dummy=False, interactive=True, lan
     else:
         model = _prompt_model_for_backend(backend, language, interactive)
     print(f"Selected model: {model}")
+    # silence_db is the single volume floor used by the dB fallback. Silero
+    # mode ignores it. Default -40 dBFS — keeps the gate simple by design.
+    if silence_db is None:
+        silence_db = -40.0
     prompt_text, word_list = _resolve_prompt_and_words(prompt, prompt_file, words, words_file)
     backend_kwargs = _build_backend_kwargs(backend, model, language, samplerate, duration,
                                           silence_db, silence_duration,
+                                          vad_mode, vad_threshold, vad_min_silence_ms,
                                           download_folder_vosk, download_folder_whisper,
                                           download_folder_whisper_futo,
                                           realtime_delay, realtime_gate,
@@ -319,14 +339,9 @@ def get_parser():
     group.add_argument("-o", "--output-file",
                        help="Also append the transcription to this file.")
-    group = parser.add_argument_group("Silence detection (shared)")
+    group = parser.add_argument_group("Silence detection")
     group.add_argument("--duration", default=120, type=float,
                        help="Max recording duration in seconds (default: %(default)s).")
-    group.add_argument("--silence-db", default=-40.0, type=float,
-                       help="dBFS volume floor for 'this frame is silent' "
-                            "(default: %(default)s). Used by every silence-driven "
-                            "behavior (realtime gate, realtime auto-commit, "
-                            "pseudo-streaming chunking).")
     group.add_argument("--silence-duration", default=0.6, type=float,
                        help="Seconds of silence required before triggering a "
                             "backend's silence behavior (default: %(default)s). "
@@ -335,6 +350,31 @@ def get_parser():
                             "batch backends: candidate cut point within the "
                             "streaming window.")
+    group = parser.add_argument_group("Voice activity detection")
+    group.add_argument("--vad-mode", choices=("auto", "db", "silero"), default="auto",
+                       help="Silence-detection backend (default: %(default)s). "
+                            "'auto' picks silero if installed, dB otherwise. "
+                            "'silero' uses silero-vad — much more robust to "
+                            "ambient noise (ticks, fan, traffic) AND to soft "
+                            "speech (the dB gate drops sub-threshold syllables; "
+                            "silero recognises speech spectrally). "
+                            "'db' is a volume-threshold fallback used when "
+                            "onnxruntime is unavailable (see --silence-db). "
+                            "The dB and silero parameter groups are independent.")
+    group.add_argument("--vad-threshold", default=0.5, type=float,
+                       help="[silero only] Speech-probability threshold in [0,1] "
+                            "(default: %(default)s). Lower = more permissive (catches "
+                            "quiet speech but also more noise); higher = stricter.")
+    group.add_argument("--vad-min-silence-ms", default=300, type=int,
+                       help="[silero only] Minimum sustained low-probability span before "
+                            "speech-end is emitted, in ms (default: %(default)s). "
+                            "Acts as silero's onset/offset smoothing window.")
+    group.add_argument("--silence-db", default=None, type=float,
+                       help="[dB only] Silence floor in dBFS for the dB-mode "
+                            "fallback (default: -40). Ignored when "
+                            "--vad-mode=silero (or =auto and silero is "
+                            "available).")
     group = parser.add_argument_group("Realtime (gpt-realtime-whisper)")
     group.add_argument("--realtime-delay",
                        choices=("minimal", "low", "medium", "high", "xhigh"),
@@ -344,10 +384,10 @@ def get_parser():
                             "paste churn in the focused window).")
     group.add_argument("--realtime-gate", action=argparse.BooleanOptionalAction,
                        default=True,
-                       help="Drop silent frames (per --silence-db) before sending "
-                            "them over the WebSocket so silent audio isn't billed "
-                            "as input tokens (default: on; pass --no-realtime-gate "
-                            "to disable).")
+                       help="Drop silent frames (per the active --vad-mode) before "
+                            "sending them over the WebSocket so silent audio "
+                            "isn't billed as input tokens (default: on; pass "
+                            "--no-realtime-gate to disable).")
     group = parser.add_argument_group("Pseudo-streaming (experimental)")
     group.add_argument("--pseudo-streaming", action="store_true",
@@ -399,8 +439,16 @@ def start_recording(micro, session, mode="keystroke", typer="auto",
     # Query the live transcriber instance — the registered class may dispatch
     # to a streaming sibling for specific models (e.g. openai →
     # gpt-realtime-whisper), so a class-level lookup via BACKENDS would lie.
+    # Pseudo-streaming also yields chunks (silence-cut batch transcriptions)
+    # so the output should treat it the same: live paste/type per chunk.
     backend_obj = getattr(session, "backend", session)
-    is_streaming = bool(getattr(backend_obj, "supports_streaming", False)) if not isinstance(backend_obj, str) else False
+    if isinstance(backend_obj, str):
+        is_streaming = False
+    else:
+        is_streaming = (
+            bool(getattr(backend_obj, "supports_streaming", False))
+            or bool(getattr(backend_obj, "pseudo_streaming", False))
+        )
     # Clipboard is written in clipboard mode (the user pastes manually) and in
     # paste-based keystroke mode (the paste source). type_direct keystroke
     # mode bypasses the clipboard entirely — we type the chunks/text raw.
@@ -427,6 +475,16 @@ def start_recording(micro, session, mode="keystroke", typer="auto",
         import pyperclip
         session.log("The transcription will be copied to clipboard as it becomes available.")
+    # Tell streaming backends whether their output is about to hit the
+    # clipboard-paste race or a direct-keystroke typer. The realtime
+    # backend's per-token deltas only need coalescing in paste mode;
+    # type-direct (ydotool/wtype/pynput via uinput/xtest) types each
+    # character synchronously and benefits from raw per-delta emission
+    # for snappier UX. Set as a plain attribute — backends that don't
+    # implement coalescing ignore it.
+    if not isinstance(backend_obj, str) and hasattr(backend_obj, "_coalesce_deltas"):
+        backend_obj._coalesce_deltas = do_live_paste
     fulltext = ""
     for result in session.start_recording(micro, **greetings):
@@ -497,14 +555,24 @@ def create_app(micro, app_state):
     image = Image.open(Path(scribe_data.__file__).parent / "share" / "icon.png")
     image_recording = Image.open(Path(scribe_data.__file__).parent / "share" / "icon_recording.png")
     image_writing = Image.open(Path(scribe_data.__file__).parent / "share" / "icon_writing.png")
+    # Composite (red + writing 'a'): shown while recording AND the silence
+    # gate says speech is active. Gives the user a visual confirmation that
+    # the audio is actually being captured/sent — not just sitting in
+    # detected silence. Plain red = recording but waiting for speech.
+    image_recording_active = Image.alpha_composite(
+        image_recording.convert("RGBA"), image_writing.convert("RGBA"),
+    )
     if transcriber.backend == "vosk":
-        # Recording and writing happen at the same time in this backend.
-        image_recording = Image.alpha_composite(image_recording.convert("RGBA"), image_writing.convert("RGBA"))
+        # vosk transcribes while recording — both recording sub-states show
+        # the composite (no meaningful "waiting" since vosk streams
+        # continuously).
+        image_recording = image_recording_active
     state_images = {
         None: image,
         "recording": image_recording,
+        "recording_active": image_recording_active,
         "busy": image_writing,
     }
@@ -523,7 +591,12 @@ def create_app(micro, app_state):
             return "busy"
         s = icon._session
         if s.recording:
-            return "recording"
+            # session.waiting flips True after silence_duration of detected
+            # silence, False on the first non-silent chunk. The composite
+            # ("recording_active") tells the user audio is actually being
+            # sent to the backend — solves the "is it hearing me?" question
+            # without printing partial transcripts to the tray.
+            return "recording" if s.waiting else "recording_active"
         if s.busy:
             return "busy"
         return None

scribe-cli 0.17.0__tar.gz → 0.18.0__tar.gz

scribe-cli 0.17.0tar.gz → 0.18.0tar.gz