PyPI - scribe-cli - Versions diffs - 0.17.0__tar.gz → 0.17.1__tar.gz - Mend

scribe-cli 0.17.0tar.gz → 0.17.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (59) hide show

{scribe_cli-0.17.0 → scribe_cli-0.17.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: scribe-cli
-Version: 0.17.0
+Version: 0.17.1
 Summary: Speech-to-text CLI and system-tray app for dictating into any focused window. Local (vosk, faster-whisper) or cloud (groq, openai) backends, batch or streaming.
 Author-email: Mahé Perrette <mahe.perrette@gmail.com>
 License: MIT License

{scribe_cli-0.17.0 → scribe_cli-0.17.1}/docs/backends.md RENAMED Viewed

@@ -149,9 +149,10 @@ differently:
 | Backend                              | `--prompt`                    | `--words`                                              |
 |--------------------------------------|-------------------------------|--------------------------------------------------------|
 | `whisper` (faster-whisper, local)    | passed as `initial_prompt=`   | passed as `hotwords=` — a **dedicated biasing channel** separate from the prompt |
+| `whisper-futo` (pywhispercpp, local) | passed as `initial_prompt=`   | joined onto the prompt string (no separate hotwords channel here) |
 | `openai` batch (`gpt-4o*-transcribe`) | passed as `prompt=`           | joined onto the prompt string                          |
 | `groq` (`whisper-large-v3-turbo`)     | passed as `prompt=`           | joined onto the prompt string                          |
-| `openai` realtime (`gpt-realtime-whisper`) | included in the session config as `transcription.prompt` | joined onto the prompt string |
+| `openai` realtime (`gpt-realtime-whisper`) | *silently ignored* — the model rejects the prompt parameter server-side (HTTP 400 *"The 'prompt' parameter is not supported for this model."*). The kwarg stays accepted for plumbing compatibility but never reaches the API. | same — joined into the (ignored) prompt |
 | `vosk`                               | *ignored* (no soft prompt)    | *ignored* (Vosk only supports a hard `grammar` allowlist; not yet exposed) |
 The whisper-family APIs cap the prompt around ~224 tokens; longer
@@ -202,3 +203,36 @@ more than latency.
 This is experimental and off by default. The tray menu surfaces the
 same toggle under Options ▶ Advanced ▶ Pseudo-streaming.
+### Cross-chunk prompt context
+In pseudo-streaming mode scribe automatically augments each chunk's
+prompt with the trailing ~200 characters of the *previous* chunk's
+transcription. This rolling tail is concatenated onto whatever static
+`--prompt` / `--words` you configured and reaches the backend through
+the same channel as the static prompt (the vocabulary biasing table
+above). The motivation is cross-chunk continuity:
+- **Capitalization drift** — without context, a chunk that starts
+  right after a period might come back lowercased.
+- **Article gender (FR/IT/ES/…)** — `"la nouveau"` → `"le nouveau"`
+  once the prior chunk has established the noun.
+- **Language lock** — `whisper.cpp` auto-detects language per call;
+  feeding the previous chunk's tokens keeps the language stable
+  across cuts.
+Whisper's prompt window is capped at ~224 tokens; 200 chars of French
+sits well under that and leaves room for your static prompt + words
+list.
+The rolling tail is **dropped** whenever the pause that triggered the
+chunk cut exceeded 1.5 seconds — a long pause is treated as a new
+sentence/idea boundary, where carrying a possibly-bad prior chunk
+forward biases the next one more than it helps. This mirrors
+`whisper.cpp`'s `--keep-context off` default: prior-text conditioning
+can self-reinforce errors (hallucinations, decoder repetition loops)
+more readily than it provides useful continuity, so we cap it at
+natural sentence boundaries.
+Short pauses (mid-sentence punctuation) keep the context; the cut at
+the start of every new recording also clears it.

{scribe_cli-0.17.0 → scribe_cli-0.17.1}/docs/keyboard.md RENAMED Viewed

@@ -167,3 +167,34 @@ If `eitype` is unavailable, two older workarounds also work:
 Roadmap for native libei integration (eventual Python bindings,
 expanded compositor support) is tracked in
 [docs/roadmap-libei.md](roadmap-libei.md).
+## Realtime backend: delta coalescing
+The `gpt-realtime-whisper` backend emits one transcription delta per
+word/subword at ~30–80 ms intervals — much faster than the
+`pyperclip.copy()` + Ctrl+V cycle can settle on Wayland (≥100 ms,
+because `wl-copy` is asynchronous). Pasting every delta led to
+clipboard races where successive copies overwrote each other before
+Ctrl+V landed, manifesting as dropped and duplicated words
+(*"fait fait le mot mot time time…"*).
+In **paste mode** (default keystroke output) scribe therefore
+coalesces deltas: incoming tokens accumulate into a small buffer and
+are flushed only when *either* ~400 ms have elapsed since the last
+flush, *or* the buffer ends on sentence-final punctuation
+(`. ! ? \n`). A 200 ms floor between any two flushes prevents
+back-to-back punctuation flushes from racing each other through the
+clipboard.
+With **`--type-direct`** the coalescing is bypassed entirely — each
+delta goes through the chosen typer as a raw keystroke synchronously
+(uinput / xtest / portal libei), no clipboard involved, no race to
+defeat. The UX is also snappier: tokens appear one at a time rather
+than in ~400 ms-cadenced bursts.
+macOS and Windows clipboards are synchronous, so the race that
+motivates coalescing is essentially a Wayland artefact; scribe still
+coalesces in paste mode there for consistency, but it's harmless.
+This whole behaviour is realtime-specific — Vosk's per-phrase commits
+already arrive at a sane cadence, and the pseudo-streaming backends
+emit one chunk per silence cut (already coarse enough).

{scribe_cli-0.17.0 → scribe_cli-0.17.1}/scribe/_version.py RENAMED Viewed

@@ -18,7 +18,7 @@ version_tuple: tuple[int | str, ...]
 commit_id: str | None
 __commit_id__: str | None
-__version__ = version = '0.17.0'
-__version_tuple__ = version_tuple = (0, 17, 0)
+__version__ = version = '0.17.1'
+__version_tuple__ = version_tuple = (0, 17, 1)
-__commit_id__ = commit_id = 'gbfcd2e228'
+__commit_id__ = commit_id = 'g67d90f5e4'

{scribe_cli-0.17.0 → scribe_cli-0.17.1}/scribe/app.py RENAMED Viewed

@@ -66,7 +66,10 @@ class DummyTranscriber:
 whisper_models = ["tiny", "base", "small", "medium", "large-v3", "large-v3-turbo"]
 whisper_english_models = ["tiny.en", "base.en", "small.en", "medium.en"]
-# FUTO ACFT publishes only tiny/base/small (+ .en variants); no medium/large/turbo.
+# FUTO ACFT publishes only tiny/base/small (+ .en variants). Community
+# conversions exist for large/turbo but their large-v3 encoder is
+# incompatible with the audio_ctx shrinkage that's the point of this
+# backend — for large models use the `whisper` backend instead.
 whisper_futo_models = ["tiny", "base", "small"]
 whisper_futo_english_models = ["tiny.en", "base.en", "small.en"]
 whisperapi_models = ["gpt-4o-transcribe", "gpt-4o-mini-transcribe", "gpt-realtime-whisper"]
@@ -168,7 +171,7 @@ def _resolve_prompt_and_words(prompt_text, prompt_file, words, words_file):
 def _build_backend_kwargs(backend, model, language, samplerate, duration,
-                          silence_db, silence_duration,
+                          silence_db, silence_onset_db, silence_duration,
                           download_folder_vosk, download_folder_whisper,
                           download_folder_whisper_futo,
                           realtime_delay, realtime_gate,
@@ -190,23 +193,28 @@ def _build_backend_kwargs(backend, model, language, samplerate, duration,
                     model_kwargs={"download_root": download_folder_vosk})
     if backend == "whisper":
         return dict(model_name=model, language=language, samplerate=samplerate,
-                    timeout=duration, silence_duration=silence_duration, silence_thresh=silence_db,
+                    timeout=duration, silence_duration=silence_duration,
+                    silence_thresh=silence_db, silence_thresh_onset=silence_onset_db,
                     pseudo_streaming=pseudo_streaming, streaming_window=streaming_window,
                     prompt=prompt_text,
                     hotwords=(" ".join(words) if words else None),
                     model_kwargs={"download_root": download_folder_whisper})
     if backend == "whisper-futo":
-        # whisper.cpp via pywhispercpp doesn't take prompt/hotwords through the
-        # same surface; drop them for now. Audio_ctx is computed per-call inside
-        # the backend from actual audio length (the ACFT speedup).
+        # pywhispercpp 1.4.1 exposes `initial_prompt`; the backend folds
+        # words+prompt into it (and adds a rolling chunk-tail in
+        # pseudo-streaming). No separate hotwords channel here — fold
+        # everything into the prompt like the cloud backends do.
         return dict(model_name=model, language=language, samplerate=samplerate,
-                    timeout=duration, silence_duration=silence_duration, silence_thresh=silence_db,
+                    timeout=duration, silence_duration=silence_duration,
+                    silence_thresh=silence_db, silence_thresh_onset=silence_onset_db,
                     pseudo_streaming=pseudo_streaming, streaming_window=streaming_window,
+                    prompt=merged_prompt,
                     download_folder=download_folder_whisper_futo)
     if backend in ("openai", "groq"):
         from scribe.backends.openai_api import REALTIME_MODELS
         kwargs = dict(model_name=model, samplerate=samplerate,
-                      timeout=duration, silence_duration=silence_duration, silence_thresh=silence_db,
+                      timeout=duration, silence_duration=silence_duration,
+                      silence_thresh=silence_db, silence_thresh_onset=silence_onset_db,
                       pseudo_streaming=pseudo_streaming, streaming_window=streaming_window,
                       prompt=merged_prompt)
         if backend == "openai" and model in REALTIME_MODELS:
@@ -223,7 +231,7 @@ def _build_backend_kwargs(backend, model, language, samplerate, duration,
 def get_transcriber(model=None, backend=None, dummy=False, interactive=True, language=None,
                     samplerate=None, duration=None,
-                    silence_db=-40.0, silence_duration=0.6,
+                    silence_db=None, silence_onset_db=None, silence_duration=0.6,
                     download_folder_vosk=None, download_folder_whisper=None,
                     download_folder_whisper_futo=None,
                     realtime_delay="medium", realtime_gate=True,
@@ -253,9 +261,17 @@ def get_transcriber(model=None, backend=None, dummy=False, interactive=True, lan
     else:
         model = _prompt_model_for_backend(backend, language, interactive)
     print(f"Selected model: {model}")
+    # silence_db is the LOW threshold (in-speech pause detection) — default
+    # -40 in all modes. silence_onset_db is the HIGH threshold (speech-start
+    # gate) used only in pseudo-streaming via hysteresis; -25 keeps ambient
+    # noise (keyboard, breathing) from triggering a chunk.
+    if silence_db is None:
+        silence_db = -40.0
+    if silence_onset_db is None:
+        silence_onset_db = -25.0 if pseudo_streaming else silence_db
     prompt_text, word_list = _resolve_prompt_and_words(prompt, prompt_file, words, words_file)
     backend_kwargs = _build_backend_kwargs(backend, model, language, samplerate, duration,
-                                          silence_db, silence_duration,
+                                          silence_db, silence_onset_db, silence_duration,
                                           download_folder_vosk, download_folder_whisper,
                                           download_folder_whisper_futo,
                                           realtime_delay, realtime_gate,
@@ -322,11 +338,18 @@ def get_parser():
     group = parser.add_argument_group("Silence detection (shared)")
     group.add_argument("--duration", default=120, type=float,
                        help="Max recording duration in seconds (default: %(default)s).")
-    group.add_argument("--silence-db", default=-40.0, type=float,
-                       help="dBFS volume floor for 'this frame is silent' "
-                            "(default: %(default)s). Used by every silence-driven "
-                            "behavior (realtime gate, realtime auto-commit, "
-                            "pseudo-streaming chunking).")
+    group.add_argument("--silence-db", default=None, type=float,
+                       help="LOW silence floor in dBFS — applied while we're "
+                            "already inside an utterance, so soft trailing "
+                            "syllables aren't cut. Default: -40. Used by every "
+                            "silence-driven behavior (pseudo-streaming pause "
+                            "detection, realtime gate, realtime auto-commit).")
+    group.add_argument("--silence-onset-db", default=None, type=float,
+                       help="HIGH silence floor in dBFS — applied before we've "
+                            "started capturing speech (audio buffer empty). "
+                            "Stricter so ambient noise (keyboard, breathing) "
+                            "doesn't trigger a chunk. Default: -25 in "
+                            "pseudo-streaming, same as --silence-db otherwise.")
     group.add_argument("--silence-duration", default=0.6, type=float,
                        help="Seconds of silence required before triggering a "
                             "backend's silence behavior (default: %(default)s). "
@@ -399,8 +422,16 @@ def start_recording(micro, session, mode="keystroke", typer="auto",
     # Query the live transcriber instance — the registered class may dispatch
     # to a streaming sibling for specific models (e.g. openai →
     # gpt-realtime-whisper), so a class-level lookup via BACKENDS would lie.
+    # Pseudo-streaming also yields chunks (silence-cut batch transcriptions)
+    # so the output should treat it the same: live paste/type per chunk.
     backend_obj = getattr(session, "backend", session)
-    is_streaming = bool(getattr(backend_obj, "supports_streaming", False)) if not isinstance(backend_obj, str) else False
+    if isinstance(backend_obj, str):
+        is_streaming = False
+    else:
+        is_streaming = (
+            bool(getattr(backend_obj, "supports_streaming", False))
+            or bool(getattr(backend_obj, "pseudo_streaming", False))
+        )
     # Clipboard is written in clipboard mode (the user pastes manually) and in
     # paste-based keystroke mode (the paste source). type_direct keystroke
     # mode bypasses the clipboard entirely — we type the chunks/text raw.
@@ -427,6 +458,16 @@ def start_recording(micro, session, mode="keystroke", typer="auto",
         import pyperclip
         session.log("The transcription will be copied to clipboard as it becomes available.")
+    # Tell streaming backends whether their output is about to hit the
+    # clipboard-paste race or a direct-keystroke typer. The realtime
+    # backend's per-token deltas only need coalescing in paste mode;
+    # type-direct (ydotool/wtype/pynput via uinput/xtest) types each
+    # character synchronously and benefits from raw per-delta emission
+    # for snappier UX. Set as a plain attribute — backends that don't
+    # implement coalescing ignore it.
+    if not isinstance(backend_obj, str) and hasattr(backend_obj, "_coalesce_deltas"):
+        backend_obj._coalesce_deltas = do_live_paste
     fulltext = ""
     for result in session.start_recording(micro, **greetings):

{scribe_cli-0.17.0 → scribe_cli-0.17.1}/scribe/backends/openai_api.py RENAMED Viewed

@@ -47,7 +47,8 @@ class OpenaiAPITranscriber(WhisperTranscriber):
         sf.write(buffer, audio_data, self.samplerate, format='WAV')
         buffer.seek(0)
         buffer.name = "audio.wav"  # Set a filename with a valid extension
-        extra = {"prompt": self._prompt} if self._prompt else {}
+        prompt = self.compose_prompt(self._prompt)
+        extra = {"prompt": prompt} if prompt else {}
         try:
             transcription = self.model.audio.transcriptions.create(
                 model=self.model_name,
@@ -58,6 +59,7 @@ class OpenaiAPITranscriber(WhisperTranscriber):
             title, message = format_openai_error(e)
             self.notify_error(title, message)
             return {"text": ""}
+        self.update_streaming_context(transcription.text)
         return {"text": transcription.text}

{scribe_cli-0.17.0 → scribe_cli-0.17.1}/scribe/backends/openai_realtime.py RENAMED Viewed

@@ -2,6 +2,7 @@ import base64
 import logging
 import queue
 import threading
+import time
 from typing import ClassVar
 import numpy as np
@@ -34,6 +35,29 @@ class OpenaiRealtimeTranscriber(AbstractStreamingTranscriber):
     # click) followed by silence would otherwise trigger an error popup.
     _SERVER_COMMIT_MIN_MS = 100.0
+    # Coalesce token-level deltas before yielding to the app layer.
+    # gpt-realtime-whisper emits one delta per word/subword (~30-80 ms
+    # apart). The live-paste path (paste_via_clipboard) needs ~100 ms
+    # per call to defeat Wayland's wl-copy async race — pasting every
+    # delta caused token drops + duplications because the clipboard got
+    # overwritten before Ctrl+V landed.
+    #
+    # _INTERVAL: regular cadence for in-progress sentences (no punct
+    # yet). Long enough that most short sentences finish before it fires
+    # — that way the natural commit point is the period, not a mid-
+    # sentence timeout (which would split a phrase across two pastes and
+    # race them through the clipboard).
+    #
+    # _MIN_INTERVAL: floor between successive flushes regardless of
+    # trigger. Even when the buffer ends on a period, we hold the flush
+    # until the floor has elapsed since the prior one. Two punctuation
+    # flushes <200ms apart was the residual failure mode that mangled
+    # rapid repeated phrases ("Tout rentre dans l'ordre. Tout rentre
+    # dans l'ordre.") even after the initial coalescing landed.
+    _DELTA_FLUSH_INTERVAL_S = 0.4
+    _DELTA_FLUSH_MIN_INTERVAL_S = 0.2
+    _DELTA_FLUSH_PUNCT = frozenset(".!?\n")
     def __init__(self, model_name="gpt-realtime-whisper", language=None, model_kwargs={},
                  model=None, realtime_delay="medium",
                  realtime_gate=True, prompt=None, **kwargs):
@@ -66,6 +90,16 @@ class OpenaiRealtimeTranscriber(AbstractStreamingTranscriber):
         self._has_uncommitted_audio = False
         self._silent_samples = 0
         self._uncommitted_ms = 0.0
+        # Delta coalescing state (see _DELTA_FLUSH_INTERVAL_S). The flag
+        # below is set by the app layer at recording time: True when
+        # live-paste-via-clipboard is the output (clipboard race exists
+        # → coalesce); False in type-direct mode (uinput/xtest tap each
+        # character — no clipboard, no race, no need to batch). Default
+        # True so backends instantiated outside the scribe app loop
+        # (smoke tests, library use) keep the safer batched behaviour.
+        self._coalesce_deltas = True
+        self._delta_buffer = ""
+        self._last_delta_flush = 0.0
     def _session_config(self) -> dict:
         # gpt-realtime-whisper does NOT support server VAD (rejected as
@@ -73,11 +107,17 @@ class OpenaiRealtimeTranscriber(AbstractStreamingTranscriber):
         # The streaming knob for this model is `delay` — "minimal" emits
         # partials as early as possible; higher values trade latency for
         # accuracy. Surfaced as the --realtime-delay CLI flag.
+        #
+        # NOTE: this model also rejects `prompt` server-side
+        # (400 "The 'prompt' parameter is not supported for this model.",
+        # param `session.audio.input.transcription.prompt`). The shared
+        # backend kwarg `prompt` is silently ignored here — the
+        # pseudo-streaming chunk-tail context machinery doesn't apply
+        # either (this backend is true streaming, not chunked). If a
+        # future REALTIME_MODELS entry supports it, gate by model name.
         transcription: dict = {"model": self.model_name, "delay": self._realtime_delay}
         if self.language:
             transcription["language"] = self.language
-        if self._prompt:
-            transcription["prompt"] = self._prompt
         audio_input: dict = {
             "format": {"type": "audio/pcm", "rate": self._GA_SAMPLE_RATE},
             "transcription": transcription,
@@ -100,6 +140,8 @@ class OpenaiRealtimeTranscriber(AbstractStreamingTranscriber):
         self._has_uncommitted_audio = False
         self._silent_samples = 0
         self._uncommitted_ms = 0.0
+        self._delta_buffer = ""
+        self._last_delta_flush = time.time()
         self._client = openai.OpenAI()
@@ -250,6 +292,9 @@ class OpenaiRealtimeTranscriber(AbstractStreamingTranscriber):
             else:
                 self._silent_samples = 0
+        # Drain queue. Errors surface immediately in both modes. Text
+        # deltas either get buffered for coalesced flush (paste mode)
+        # or yielded raw (type-direct mode — see _coalesce_deltas).
         while True:
             try:
                 item = self._event_queue.get_nowait()
@@ -259,7 +304,31 @@ class OpenaiRealtimeTranscriber(AbstractStreamingTranscriber):
                 title, message = item["_error"]
                 self.notify_error(title, message)
                 continue
-            yield item
+            text = item.get("text", "")
+            if not text:
+                continue
+            if self._coalesce_deltas:
+                self._delta_buffer += text
+            else:
+                yield {"text": text}
+        # Flush the coalesced buffer when both:
+        #   (a) the floor _DELTA_FLUSH_MIN_INTERVAL_S has elapsed since
+        #       the last flush — no two pastes within the clipboard race
+        #       window, regardless of trigger; and
+        #   (b) either the regular interval elapsed, or the buffer ends
+        #       on sentence-final punctuation (natural commit boundary).
+        # In raw-delta mode the buffer stays empty so this is a no-op.
+        if self._delta_buffer:
+            now = time.time()
+            elapsed = now - self._last_delta_flush
+            ends_on_punct = self._delta_buffer[-1] in self._DELTA_FLUSH_PUNCT
+            if elapsed >= self._DELTA_FLUSH_MIN_INTERVAL_S and (
+                elapsed >= self._DELTA_FLUSH_INTERVAL_S or ends_on_punct
+            ):
+                yield {"text": self._delta_buffer}
+                self._delta_buffer = ""
+                self._last_delta_flush = now
     def finalize(self):
         if self._connection is None or self._closed:
@@ -288,7 +357,14 @@ class OpenaiRealtimeTranscriber(AbstractStreamingTranscriber):
             # transcript was already streamed live as `text` deltas during
             # recording, so we only return the tail.
             self._completed_event.wait(timeout=self._FINALIZE_TIMEOUT)
+        # Start with whatever sat in the coalescing buffer (deltas seen
+        # by feed_audio but not yet flushed by the interval/punct check),
+        # then append any tail deltas the recv_loop pushed in after the
+        # recording loop exited.
         tail_parts: list[str] = []
+        if self._delta_buffer:
+            tail_parts.append(self._delta_buffer)
+            self._delta_buffer = ""
         while True:
             try:
                 item = self._event_queue.get_nowait()

{scribe_cli-0.17.0 → scribe_cli-0.17.1}/scribe/backends/whisper.py RENAMED Viewed

@@ -29,7 +29,7 @@ class WhisperTranscriber(AbstractTranscriber):
             language=self.language,
             vad_filter=True,
             beam_size=1,
-            initial_prompt=self._prompt,
+            initial_prompt=self.compose_prompt(self._prompt),
             hotwords=self._hotwords,
             no_speech_threshold=0.6,
             log_prob_threshold=-1.0,
@@ -37,6 +37,7 @@ class WhisperTranscriber(AbstractTranscriber):
             temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
         )
         text = "".join(segment.text for segment in segments)
+        self.update_streaming_context(text)
         return {"text": text}
     def finalize(self):

{scribe_cli-0.17.0 → scribe_cli-0.17.1}/scribe/backends/whisper_futo.py RENAMED Viewed

@@ -15,6 +15,7 @@ from __future__ import annotations
 import math
 import os
+import re
 from pathlib import Path
 from typing import ClassVar
@@ -23,9 +24,36 @@ import numpy as np
 from scribe.models import AbstractTranscriber
+# Whisper hallucinates sound-effect annotations like "(music)", "[Applause]"
+# on near-silence, and occasionally emits IPA-modifier-letter garbage
+# (U+02B0–02FF) or U+FFFD when the audio is unintelligible. Two filters:
+#   - WHOLE_RE: chunk is one such artifact end-to-end → drop.
+#   - INLINE_RE: artifact embedded mid-text ("Bonjour (typing) ça va") →
+#     substitute out. Restricted to lowercase ASCII + spaces inside the
+#     brackets so legitimate French parentheticals (accents) and proper
+#     nouns (uppercase) are preserved. pywhispercpp 1.4.1 advertises
+#     `suppress_non_speech_tokens` in its schema but the C struct doesn't
+#     expose it, so this lives at the text layer.
+_NON_SPEECH_WHOLE_RE = re.compile(r"^\s*[(\[*][^()\[\]*]{1,60}[)\]*]\s*[.!?]?\s*$")
+# Allow any case ([Breathing], [KNOCKING], [Door opens], (footsteps)) and
+# consume any trailing punctuation so adjacent text doesn't end up with
+# stray commas. Substitute with a space (not "") so adjacent words don't
+# collide when the noise token has no surrounding whitespace
+# ("[door][door]" or "word(typing)word"); a follow-up \s+ collapse cleans
+# up any doubles.
+_NON_SPEECH_INLINE_RE = re.compile(r"[(\[][A-Za-z][A-Za-z\s\-]{0,30}[)\]][.,!?:;]?")
+_WHITESPACE_RE = re.compile(r"\s+")
+_PHONETIC_RE = re.compile(r"[ʰ-˿�]")
 _FUTO_BASE_URL = "https://voiceinput.futo.org/VoiceInput/"
-# Map user-visible model name → ggml filename on FUTO's CDN.
+# Map user-visible model name → ggml filename on FUTO's CDN. FUTO publishes
+# only tiny/base/small (+ .en variants). The DeadBranches community q8_0 of
+# large-v3-turbo was tried briefly but its large-v3 encoder is incompatible
+# with the audio_ctx-shrinkage that's the whole point of this backend
+# (Progress: 1612% / CJK garbage on short clips), so we stick to the FUTO
+# set where ACFT works as advertised.
 _FUTO_MODELS: dict[str, str] = {
     "tiny":     "tiny_acft_q8_0.bin",
     "tiny.en":  "tiny_en_acft_q8_0.bin",
@@ -90,7 +118,7 @@ class WhisperFutoTranscriber(AbstractTranscriber):
     is_local: ClassVar[bool] = True
     def __init__(self, model_name, language=None, model=None, model_kwargs={},
-                 download_folder=None, **kwargs):
+                 download_folder=None, prompt=None, **kwargs):
         if model is None:
             from pywhispercpp.model import Model
             path = _model_path(model_name, download_folder)
@@ -101,29 +129,62 @@ class WhisperFutoTranscriber(AbstractTranscriber):
             init_kwargs = {k: v for k, v in model_kwargs.items() if k != "n_threads"}
             model = Model(str(path), n_threads=n_threads, **init_kwargs)
         super().__init__(model, model_name, language, model_kwargs=model_kwargs, **kwargs)
+        self._prompt = prompt
     def transcribe_audio(self, audio_bytes):
         self.log("\nTranscribing")
         audio = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32768.0
-        # ACFT shortcut: shrink the encoder window to the actual audio length.
-        # Works for both explicit language and auto-detect (whisper.cpp runs its
-        # language ID head on the same shrunk encoder output; FUTO's L2-distill
-        # training preserves enough representational quality at short contexts).
-        # pywhispercpp wants "" (not "auto") to request auto-detection.
         duration_s = len(audio) / self.samplerate
-        audio_ctx = min(_AUDIO_CTX_MAX,
-                        max(_AUDIO_CTX_MIN,
-                            math.ceil(duration_s * _AUDIO_CTX_PER_SECOND)))
-        segments = self.model.transcribe(
-            audio,
-            language=self.language or "",
-            audio_ctx=audio_ctx,
-            no_speech_thold=0.6,
-            entropy_thold=2.4,
-            logprob_thold=-1.0,
-            temperature_inc=0.2,
-        )
-        return {"text": "".join(s.text for s in segments)}
+        # ACFT shortcut: shrink the encoder window to the actual audio
+        # length. This is the whole point of the FUTO backend — without it,
+        # a 2 s clip runs against the full 30 s window and inference is
+        # 5-10× slower. Safe for the FUTO ACFT set (tiny/base/small +
+        # .en) which was trained to preserve quality at short audio_ctx.
+        # pywhispercpp wants "" (not "auto") to request auto-detect.
+        kwargs = {
+            "language": self.language or "",
+            "audio_ctx": min(_AUDIO_CTX_MAX,
+                             max(_AUDIO_CTX_MIN,
+                                 math.ceil(duration_s * _AUDIO_CTX_PER_SECOND))),
+        }
+        prompt = self.compose_prompt(self._prompt)
+        if prompt:
+            kwargs["initial_prompt"] = prompt
+        # Streaming-only safety nets. max_tokens caps decoder repetition
+        # loops on short silence-split chunks; the non-speech filter
+        # below drops "(music)"-style hallucinations from those same
+        # tiny chunks. Both can clip real speech in batch where the
+        # recording is a single longer utterance.
+        if self.pseudo_streaming:
+            kwargs["max_tokens"] = max(12, int(duration_s * 12))
+        segments = self.model.transcribe(audio, **kwargs)
+        text = "".join(s.text for s in segments)
+        if self.pseudo_streaming:
+            # Inline pass first: catches concatenated noise tokens like
+            # "[door opens][door closes]" and mid-sentence "(typing)"
+            # inserts. Replace with " " then collapse to avoid gluing
+            # adjacent words. Whole-chunk fallback catches artifacts the
+            # inline pattern misses (internal punctuation inside brackets).
+            text = _NON_SPEECH_INLINE_RE.sub(" ", text)
+            text = _WHITESPACE_RE.sub(" ", text).strip()
+            if _NON_SPEECH_WHOLE_RE.match(text):
+                text = ""
+        else:
+            text = text.strip()
+        # Phonetic garbage (IPA modifier letters, U+FFFD) is always a
+        # decode failure — drop in both modes.
+        if _PHONETIC_RE.search(text):
+            text = ""
+        # Carry the cleaned text forward as cross-chunk context. Done
+        # post-filter so hallucination/phonetic-garbage chunks (now "")
+        # don't poison the next chunk's prompt.
+        self.update_streaming_context(text)
+        # Trailing space lets pseudo-streaming chunks concatenate cleanly
+        # (vosk convention). Harmless in batch mode — downstream strips.
+        if text:
+            text += " "
+        return {"text": text}
     def finalize(self):
         if len(self.session.audio_buffer) == 0:

scribe-cli 0.17.0__tar.gz → 0.17.1__tar.gz

scribe-cli 0.17.0tar.gz → 0.17.1tar.gz