PyPI - scribe-cli - Versions diffs - 0.17.1__tar.gz → 1.0.0__tar.gz - Mend

scribe-cli 0.17.1tar.gz → 1.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (73) hide show

{scribe_cli-0.17.1 → scribe_cli-1.0.0}/.gitignore RENAMED Viewed

@@ -7,3 +7,4 @@ scribe/_version.py
 # Autonomous roadmap workflows (local coordination artifacts; never committed)
 workflows/
+.worktrees/

{scribe_cli-0.17.1 → scribe_cli-1.0.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: scribe-cli
-Version: 0.17.1
+Version: 1.0.0
 Summary: Speech-to-text CLI and system-tray app for dictating into any focused window. Local (vosk, faster-whisper) or cloud (groq, openai) backends, batch or streaming.
 Author-email: Mahé Perrette <mahe.perrette@gmail.com>
 License: MIT License
@@ -33,13 +33,34 @@ License: MIT License
         licenses of all dependencies before using or distributing this software to
         ensure compliance with their respective terms.
 Project-URL: Homepage, https://github.com/perrette/scribe
-Keywords: speech-to-text,speech recognition,transcription,dictation,voice-typing,voice-to-text,realtime,streaming,language,AI,local,API,cli,tray,vosk,whisper,openai,groq,gpt-4o,linux,wayland,keyboard,clipboard
+Project-URL: Source, https://github.com/perrette/scribe
+Project-URL: Issues, https://github.com/perrette/scribe/issues
+Project-URL: Changelog, https://github.com/perrette/scribe/releases
+Project-URL: Funding, https://github.com/sponsors/perrette
+Keywords: speech-to-text,stt,transcription,dictation,voice-typing,voice-recognition,multilingual,realtime,streaming,cli,tray,vosk,whisper,faster-whisper,openai,groq,gpt-4o,linux,wayland,keyboard,clipboard,microphone,audio
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: End Users/Desktop
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: 3.13
 Classifier: Operating System :: OS Independent
+Classifier: Environment :: Console
+Classifier: Environment :: X11 Applications
+Classifier: Environment :: MacOS X
+Classifier: Environment :: Win32 (MS Windows)
+Classifier: Natural Language :: English
+Classifier: Natural Language :: French
+Classifier: Natural Language :: German
+Classifier: Natural Language :: Italian
+Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Office/Business
+Classifier: Topic :: Text Processing :: Linguistic
+Classifier: Topic :: Utilities
 Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
@@ -52,6 +73,7 @@ Requires-Dist: unidecode
 Requires-Dist: termcolor
 Requires-Dist: platformdirs
 Requires-Dist: desktop-ai-core>=0.2.0
+Requires-Dist: onnxruntime
 Provides-Extra: keyboard
 Requires-Dist: pynput; extra == "keyboard"
 Provides-Extra: whisper
@@ -69,6 +91,7 @@ Requires-Dist: soundfile; extra == "openai"
 Provides-Extra: groq
 Requires-Dist: openai<3,>=2.37.0; extra == "groq"
 Requires-Dist: soundfile; extra == "groq"
+Provides-Extra: vad
 Provides-Extra: all
 Requires-Dist: pynput; extra == "all"
 Requires-Dist: faster-whisper; extra == "all"
@@ -90,11 +113,13 @@ cloud-based APIs, batch and streaming workflows.
 ## What it does
-- Records from your mic and transcribes via one of four backends —
-  **Vosk** (local, streaming), **Whisper** (local, batch), **OpenAI**
-  (cloud, batch *or* streaming), **Groq** (cloud, batch).
-- Delivers the transcript three ways: paste into the focused window
-  (default), copy to clipboard, or print to the terminal.
+- Records from your mic and transcribes via one of five backends —
+  **Vosk** (local, streaming), **Whisper** (local, batch),
+  **Whisper FUTO** (local, batch — ACFT-tuned for short dictations),
+  **OpenAI** (cloud, batch *or* streaming), **Groq** (cloud, batch).
+- Delivers the transcript four ways: paste into the focused window
+  (default), copy to clipboard, print to the terminal, or write to
+  a file.
 - Runs as a **system tray icon** with a single Record button, or as an
   interactive **terminal TUI** — same menu in both.
 - Hooks into your DE's keyboard shortcuts via `SIGUSR1` (toggle
@@ -124,8 +149,8 @@ scribe
 This launches the system tray icon. Press Record, speak, press Stop —
 the transcription lands in the focused window. Scribe picks the first
 backend whose key / dependency is present, in order **`groq` →
-`openai` → `whisper` → `vosk`**, so with `GROQ_API_KEY` set the
-command above is equivalent to:
+`openai` → `whisper-futo` → `whisper` → `vosk`**, so with `GROQ_API_KEY`
+set the command above is equivalent to:
 ```bash
 scribe --backend groq --model whisper-large-v3-turbo
@@ -140,15 +165,17 @@ scribe --backend openai --model gpt-4o-mini-transcribe # OpenAI sweet spot
 scribe --backend openai --model gpt-realtime-whisper   # OpenAI streaming
 scribe --backend whisper --model small                 # local, no API key
 scribe --frontend terminal                             # interactive TUI menu
-scribe --frontend terminal --no-interactive            # record immediately, no menu
+scribe --record                                        # start recording immediately on launch (works in tray or terminal)
+scribe --record --frontend terminal --mode file        # one-shot batched dictation → file
+scribe --record --frontend terminal --mode file --stream  # streamed: chunks appended live as you speak
 scribe --mode clipboard                                # copy to clipboard, no keystroke
 scribe --mode terminal                                 # only print to stdout
-scribe -o transcript.txt                               # also append to a file
+scribe --mode file -o transcript.txt                   # append to a file (no keystroke / clipboard)
 ```
 With `--no-interactive` (terminal frontend only), scribe skips the
 interactive menu and starts recording right away — handy for scripted,
-one-shot transcriptions. `--no-prompt` is kept as a deprecated alias.
+one-shot transcriptions.
 Bias the recogniser toward names, jargon, or a domain glossary with
 `--prompt "free text hint"` and `--words word1 word2 ...` (each also
@@ -159,12 +186,13 @@ for what each backend does with them.
 ## Backends at a glance
-| Backend         | `--backend` | Default model              | Streaming model(s)        | Requires                            |
-|-----------------|-------------|----------------------------|---------------------------|-------------------------------------|
-| Groq (cloud)    | `groq`      | `whisper-large-v3-turbo`   | —                         | `GROQ_API_KEY`                      |
-| OpenAI (cloud)  | `openai`    | `gpt-4o-mini-transcribe`   | `gpt-realtime-whisper`    | `OPENAI_API_KEY`                    |
-| Whisper (local) | `whisper`   | `small`                    | —                         | `pip install scribe-cli[whisper]`   |
-| Vosk (local)    | `vosk`      | language-dependent         | all Vosk models           | `pip install scribe-cli[vosk]`      |
+| Backend              | `--backend`     | Default model              | Streaming model(s)        | Requires                               |
+|----------------------|-----------------|----------------------------|---------------------------|----------------------------------------|
+| Groq (cloud)         | `groq`          | `whisper-large-v3-turbo`   | —                         | `GROQ_API_KEY`                         |
+| OpenAI (cloud)       | `openai`        | `gpt-4o-mini-transcribe`   | `gpt-realtime-whisper`    | `OPENAI_API_KEY`                       |
+| Whisper FUTO (local) | `whisper-futo`  | `small`                    | —                         | `pip install scribe-cli[whisper-futo]` |
+| Whisper (local)      | `whisper`       | `small`                    | —                         | `pip install scribe-cli[whisper]`      |
+| Vosk (local)         | `vosk`          | language-dependent         | all Vosk models           | `pip install scribe-cli[vosk]`         |
 Whether a transcription appears live as you speak or all at once when
 you stop depends on the **model** picked — see
@@ -173,8 +201,11 @@ you stop depends on the **model** picked — see
 ### Getting an API key
-Groq is a good cloud backend to start with — very fast, quite accurate, and the
-**free tier** is generous enough for everyday dictation. Sign up at
+Groq is the **recommended cloud backend by default** — extremely fast
+(by a wide margin compared to other cloud STT options, especially in
+**Stream** mode where the per-chunk roundtrip latency dominates the
+perceived speed), quite accurate, and the **free tier** is generous
+enough for everyday dictation. Sign up at
 [console.groq.com](https://console.groq.com/), create an API key
 under **Settings → API Keys**, and export it as `GROQ_API_KEY`.
@@ -187,7 +218,7 @@ I personally use [OpenAI](https://openai.com/api/) with `gpt-4o-mini-transcribe`
   extras, Ubuntu / GNOME tray libs.
 - [Backends in detail](docs/backends.md) — model lists, when to pick
   which, the realtime model.
-- [Keyboard modes & typer backends](docs/keyboard.md) — keystroke vs
+- [Output modes & typer backends](docs/output.md) — keystroke vs
   clipboard, Wayland / `eitype`, `--type-direct`.
 - [System tray & global hotkeys](docs/tray.md) — menu tree, icon
   states, `SIGUSR1`/`SIGUSR2`.
@@ -196,10 +227,17 @@ I personally use [OpenAI](https://openai.com/api/) with `gpt-4o-mini-transcribe`
 - [Fine tuning & CLI reference](docs/cli.md) — every `scribe --help`
   flag with examples.
+## Related projects
+- **[bard](https://github.com/perrette/bard)** — TTS sibling of scribe,
+  same tray/CLI architecture in reverse: highlight text, hear it
+  spoken. Shares the [`desktop-ai-core`](https://github.com/perrette/desktop-ai-core)
+  backbone (frontends, providers, dialog helpers).
 ## Compatibility
 Initially developed for Python 3 on Ubuntu 24.04 (GNOME + Wayland);
 works on macOS and Windows too. Wayland keystroke injection is
-convoluted but [solved](docs/keyboard.md). For dependencies of
+convoluted but [solved](docs/output.md). For dependencies of
 individual subsystems, check `pynput` (keyboard) and `pystray` (tray
 icon).

{scribe_cli-0.17.1 → scribe_cli-1.0.0}/README.md RENAMED Viewed

@@ -9,11 +9,13 @@ cloud-based APIs, batch and streaming workflows.
 ## What it does
-- Records from your mic and transcribes via one of four backends —
-  **Vosk** (local, streaming), **Whisper** (local, batch), **OpenAI**
-  (cloud, batch *or* streaming), **Groq** (cloud, batch).
-- Delivers the transcript three ways: paste into the focused window
-  (default), copy to clipboard, or print to the terminal.
+- Records from your mic and transcribes via one of five backends —
+  **Vosk** (local, streaming), **Whisper** (local, batch),
+  **Whisper FUTO** (local, batch — ACFT-tuned for short dictations),
+  **OpenAI** (cloud, batch *or* streaming), **Groq** (cloud, batch).
+- Delivers the transcript four ways: paste into the focused window
+  (default), copy to clipboard, print to the terminal, or write to
+  a file.
 - Runs as a **system tray icon** with a single Record button, or as an
   interactive **terminal TUI** — same menu in both.
 - Hooks into your DE's keyboard shortcuts via `SIGUSR1` (toggle
@@ -43,8 +45,8 @@ scribe
 This launches the system tray icon. Press Record, speak, press Stop —
 the transcription lands in the focused window. Scribe picks the first
 backend whose key / dependency is present, in order **`groq` →
-`openai` → `whisper` → `vosk`**, so with `GROQ_API_KEY` set the
-command above is equivalent to:
+`openai` → `whisper-futo` → `whisper` → `vosk`**, so with `GROQ_API_KEY`
+set the command above is equivalent to:
 ```bash
 scribe --backend groq --model whisper-large-v3-turbo
@@ -59,15 +61,17 @@ scribe --backend openai --model gpt-4o-mini-transcribe # OpenAI sweet spot
 scribe --backend openai --model gpt-realtime-whisper   # OpenAI streaming
 scribe --backend whisper --model small                 # local, no API key
 scribe --frontend terminal                             # interactive TUI menu
-scribe --frontend terminal --no-interactive            # record immediately, no menu
+scribe --record                                        # start recording immediately on launch (works in tray or terminal)
+scribe --record --frontend terminal --mode file        # one-shot batched dictation → file
+scribe --record --frontend terminal --mode file --stream  # streamed: chunks appended live as you speak
 scribe --mode clipboard                                # copy to clipboard, no keystroke
 scribe --mode terminal                                 # only print to stdout
-scribe -o transcript.txt                               # also append to a file
+scribe --mode file -o transcript.txt                   # append to a file (no keystroke / clipboard)
 ```
 With `--no-interactive` (terminal frontend only), scribe skips the
 interactive menu and starts recording right away — handy for scripted,
-one-shot transcriptions. `--no-prompt` is kept as a deprecated alias.
+one-shot transcriptions.
 Bias the recogniser toward names, jargon, or a domain glossary with
 `--prompt "free text hint"` and `--words word1 word2 ...` (each also
@@ -78,12 +82,13 @@ for what each backend does with them.
 ## Backends at a glance
-| Backend         | `--backend` | Default model              | Streaming model(s)        | Requires                            |
-|-----------------|-------------|----------------------------|---------------------------|-------------------------------------|
-| Groq (cloud)    | `groq`      | `whisper-large-v3-turbo`   | —                         | `GROQ_API_KEY`                      |
-| OpenAI (cloud)  | `openai`    | `gpt-4o-mini-transcribe`   | `gpt-realtime-whisper`    | `OPENAI_API_KEY`                    |
-| Whisper (local) | `whisper`   | `small`                    | —                         | `pip install scribe-cli[whisper]`   |
-| Vosk (local)    | `vosk`      | language-dependent         | all Vosk models           | `pip install scribe-cli[vosk]`      |
+| Backend              | `--backend`     | Default model              | Streaming model(s)        | Requires                               |
+|----------------------|-----------------|----------------------------|---------------------------|----------------------------------------|
+| Groq (cloud)         | `groq`          | `whisper-large-v3-turbo`   | —                         | `GROQ_API_KEY`                         |
+| OpenAI (cloud)       | `openai`        | `gpt-4o-mini-transcribe`   | `gpt-realtime-whisper`    | `OPENAI_API_KEY`                       |
+| Whisper FUTO (local) | `whisper-futo`  | `small`                    | —                         | `pip install scribe-cli[whisper-futo]` |
+| Whisper (local)      | `whisper`       | `small`                    | —                         | `pip install scribe-cli[whisper]`      |
+| Vosk (local)         | `vosk`          | language-dependent         | all Vosk models           | `pip install scribe-cli[vosk]`         |
 Whether a transcription appears live as you speak or all at once when
 you stop depends on the **model** picked — see
@@ -92,8 +97,11 @@ you stop depends on the **model** picked — see
 ### Getting an API key
-Groq is a good cloud backend to start with — very fast, quite accurate, and the
-**free tier** is generous enough for everyday dictation. Sign up at
+Groq is the **recommended cloud backend by default** — extremely fast
+(by a wide margin compared to other cloud STT options, especially in
+**Stream** mode where the per-chunk roundtrip latency dominates the
+perceived speed), quite accurate, and the **free tier** is generous
+enough for everyday dictation. Sign up at
 [console.groq.com](https://console.groq.com/), create an API key
 under **Settings → API Keys**, and export it as `GROQ_API_KEY`.
@@ -106,7 +114,7 @@ I personally use [OpenAI](https://openai.com/api/) with `gpt-4o-mini-transcribe`
   extras, Ubuntu / GNOME tray libs.
 - [Backends in detail](docs/backends.md) — model lists, when to pick
   which, the realtime model.
-- [Keyboard modes & typer backends](docs/keyboard.md) — keystroke vs
+- [Output modes & typer backends](docs/output.md) — keystroke vs
   clipboard, Wayland / `eitype`, `--type-direct`.
 - [System tray & global hotkeys](docs/tray.md) — menu tree, icon
   states, `SIGUSR1`/`SIGUSR2`.
@@ -115,10 +123,17 @@ I personally use [OpenAI](https://openai.com/api/) with `gpt-4o-mini-transcribe`
 - [Fine tuning & CLI reference](docs/cli.md) — every `scribe --help`
   flag with examples.
+## Related projects
+- **[bard](https://github.com/perrette/bard)** — TTS sibling of scribe,
+  same tray/CLI architecture in reverse: highlight text, hear it
+  spoken. Shares the [`desktop-ai-core`](https://github.com/perrette/desktop-ai-core)
+  backbone (frontends, providers, dialog helpers).
 ## Compatibility
 Initially developed for Python 3 on Ubuntu 24.04 (GNOME + Wayland);
 works on macOS and Windows too. Wayland keystroke injection is
-convoluted but [solved](docs/keyboard.md). For dependencies of
+convoluted but [solved](docs/output.md). For dependencies of
 individual subsystems, check `pynput` (keyboard) and `pystray` (tray
 icon).

scribe_cli-1.0.0/docs/app-tray-menu.png ADDED Viewed

Binary file

{scribe_cli-0.17.1 → scribe_cli-1.0.0}/docs/backends.md RENAMED Viewed

@@ -70,7 +70,7 @@ Vosk transcribes in real time and is very good at one language at a
 time, but tends to make more mistakes than Whisper and does not produce
 punctuation. It becomes really useful in longer, interactive sessions
 where the live "appears as you speak" UX matters — see
-[keyboard.md](keyboard.md) for how the keystroke mode interacts with
+[output.md](output.md) for how the keystroke mode interacts with
 streaming models.
 There are many [Vosk models](https://alphacephei.com/vosk/models)
@@ -117,12 +117,15 @@ for the full picture.
 ## `groq` (Groq cloud)
 Talks to Groq's OpenAI-compatible API and defaults to
-`whisper-large-v3-turbo`. Typically the fastest cloud option for
-full-utterance transcription:
+`whisper-large-v3-turbo`. **Extremely fast** thanks to Groq's
+inference hardware — the recommended cloud backend by default, and
+the natural pick for `--stream` mode where per-chunk roundtrip
+latency dominates perceived speed:
 ```bash
 export GROQ_API_KEY=YOURAPIKEY
-scribe --backend groq
+scribe --backend groq          # Clip mode (default)
+scribe --backend groq --stream # live transcription, per-chunk
 ```
 The `groq` backend reuses the `openai` Python client under the hood, so
@@ -146,14 +149,14 @@ style, domain, or word list. The concept is generic across the
 whisper-family backends but each backend exposes it slightly
 differently:
-| Backend                              | `--prompt`                    | `--words`                                              |
-|--------------------------------------|-------------------------------|--------------------------------------------------------|
-| `whisper` (faster-whisper, local)    | passed as `initial_prompt=`   | passed as `hotwords=` — a **dedicated biasing channel** separate from the prompt |
-| `whisper-futo` (pywhispercpp, local) | passed as `initial_prompt=`   | joined onto the prompt string (no separate hotwords channel here) |
-| `openai` batch (`gpt-4o*-transcribe`) | passed as `prompt=`           | joined onto the prompt string                          |
-| `groq` (`whisper-large-v3-turbo`)     | passed as `prompt=`           | joined onto the prompt string                          |
-| `openai` realtime (`gpt-realtime-whisper`) | *silently ignored* — the model rejects the prompt parameter server-side (HTTP 400 *"The 'prompt' parameter is not supported for this model."*). The kwarg stays accepted for plumbing compatibility but never reaches the API. | same — joined into the (ignored) prompt |
-| `vosk`                               | *ignored* (no soft prompt)    | *ignored* (Vosk only supports a hard `grammar` allowlist; not yet exposed) |
+| Backend                              | `--prompt`                    | `--words`                                              | `--language`                                           |
+|--------------------------------------|-------------------------------|--------------------------------------------------------|---------------------------------------------------------|
+| `whisper` (faster-whisper, local)    | passed as `initial_prompt=`   | passed as `hotwords=` — a **dedicated biasing channel** separate from the prompt | passed as `language=` (ISO 639-1); `-l en` also auto-substitutes `small.en` etc. |
+| `whisper-futo` (pywhispercpp, local) | passed as `initial_prompt=`   | joined onto the prompt string (no separate hotwords channel here) | passed as `language=` (ISO 639-1); `-l en` auto-substitutes `small.en` etc. |
+| `openai` batch (`gpt-4o*-transcribe`) | passed as `prompt=`           | joined onto the prompt string                          | passed as `language=` hint (ISO 639-1)                  |
+| `groq` (`whisper-large-v3-turbo`)     | passed as `prompt=`           | joined onto the prompt string                          | passed as `language=` hint (ISO 639-1)                  |
+| `openai` realtime (`gpt-realtime-whisper`) | *silently ignored* — the model rejects the prompt parameter server-side (HTTP 400 *"The 'prompt' parameter is not supported for this model."*). The kwarg stays accepted for plumbing compatibility but never reaches the API. | same — joined into the (ignored) prompt | passed as `language=` (ISO 639-1) |
+| `vosk`                               | *ignored* (no soft prompt)    | *ignored* (Vosk only supports a hard `grammar` allowlist; not yet exposed) | picks a per-language model from `scribe/models.toml`; no runtime parameter |
 The whisper-family APIs cap the prompt around ~224 tokens; longer
 hints are silently truncated. Faster-whisper's `hotwords` channel is
@@ -184,34 +187,117 @@ invocation, pass an explicit empty value: `--prompt ""` (or
 arguments (or `--words-file ""`) suppresses the words default. Each
 side is independent.
-## Pseudo-streaming (experimental)
-`--pseudo-streaming` makes a batch backend behave streaming-like by
-cutting the running buffer into chunks driven by silence:
+## Language
+`-l / --language LANG` tells the backend which language to expect.
+What that means in practice varies by backend (see the per-backend
+column in the table above):
+- **Whisper-family** (`whisper`, `whisper-futo`, `openai` batch +
+  realtime, `groq`) — the language is passed to the model as a hard
+  lock: the decoder generates that language regardless of what it
+  hears acoustically. Accepts any [ISO 639-1 short code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
+  Whisper recognises (~99 languages). When unset, Whisper auto-detects
+  per chunk.
+- **English-only model variants** — for `whisper` and `whisper-futo`,
+  `-l en` *also* auto-substitutes the English-only model when one
+  exists (`small` → `small.en`, etc.). These variants trade
+  multilingual coverage for English accuracy.
+- **Vosk** — language isn't a runtime parameter; vosk ships a
+  separate model per language. `-l fr` looks up the vosk model
+  pre-mapped to French in [`scribe/models.toml`](../scribe/models.toml)
+  and instantiates that one. Vosk has no auto-detect path, so the
+  Language menu's `Auto` entry on vosk falls back to a sensible
+  default — the tray shows `Auto (🇬🇧 en)` to make this explicit
+  without mutating the stored `language=None`.
+The tray's **Language** submenu exposes the four curated languages
+(`en` / `fr` / `de` / `it`) with origin-country flag prefixes
+(🇬🇧 / 🇫🇷 / 🇩🇪 / 🇮🇹). The CLI accepts these plus any other ISO 639-1
+code the active backend recognises.
+## Stream mode (works with any backend)
+`--stream` (or **Mode: Stream** in the tray) emits transcribed text
+**live as you speak**, regardless of which backend you picked. This
+is the headline v1.0.0 improvement: scribe abstracts over the two
+different mechanisms that backends use to deliver live output, so
+`--stream` works uniformly across every supported backend.
+- **Native streaming backends** (Vosk, `gpt-realtime-whisper`) push
+  partial results from the server as audio is received — scribe just
+  forwards them to the chosen output (focused window / clipboard /
+  terminal / file). These backends are *always* in Stream mode; the
+  Mode toggle reads "Mode: Stream (native)" for them and is read-only.
+- **Batch backends** (Whisper local, Whisper FUTO, OpenAI
+  `gpt-4o-*-transcribe`, Groq `whisper-large-v3-turbo`) don't accept
+  partial audio. scribe instead cuts the recording buffer on
+  detected silence and issues a separate transcription request for
+  each chunk — internally called *pseudo-streaming*. The user sees
+  the same live experience.
 ```bash
-scribe --pseudo-streaming --streaming-window 5
+scribe --stream                       # any backend, live transcription
+scribe --stream --backend groq        # Groq + Stream is the sweet spot
+scribe --stream --backend whisper     # local, live, no API key
 ```
-After `--streaming-window` seconds of buffered audio, scribe cuts at
-the first silence of at least `--silence-duration` and transcribes the
-chunk; if no silence arrives by `2 × --streaming-window`, it
-force-cuts. The session continues until you stop it. Default `5` s
-trades a little Whisper context for snappier "text appears as you
-speak" UX; raise it (10–30 s) if accuracy on long sentences matters
-more than latency.
-This is experimental and off by default. The tray menu surfaces the
-same toggle under Options ▶ Advanced ▶ Pseudo-streaming.
+### How pseudo-streaming carves up a recording
+Once the buffer has grown to at least `--stream-chunk-min` (default
+1.5 s), silence of at least `--stream-chunk-silence-break` (default
+0.6 s) triggers a chunk cut. A force-cut fires at `--stream-chunk-max`
+(default 10 s) regardless of silence, to cap latency. The session
+continues until you stop it manually.
+### Does pseudo-streaming change the API cost?
+For cloud backends, going from one big transcription to N chunked
+requests **does not normally change the bill**:
+- **Groq** (`whisper-large-v3-turbo`) is billed per second of audio.
+  Total audio is unchanged → same cost.
+- **OpenAI `whisper-1`** (legacy) is billed per minute of audio. Same
+  logic, same cost.
+- **OpenAI `gpt-4o-transcribe` / `gpt-4o-mini-transcribe`** are token-
+  billed (audio-in + text-out + prompt-in). Audio and output stay
+  identical; the only delta is the rolling cross-chunk *prompt*
+  context (~200 chars ≈ 50–60 tokens per chunk after the first).
+  At gpt-4o-mini-transcribe input rates this is negligible — well
+  under a cent per long session.
+That said, your real cost depends on your usage and your account's
+pricing tier — **verify on your provider's billing dashboard** if
+cost is a hard constraint.
+Two special values for `--stream-chunk-silence-break` (set via the
+tray's **Silence break** picker or `--stream-chunk-silence-break 0`
+at the CLI):
+- **Auto** (`0`) — disables the fixed-threshold trigger. At force-cut
+  time scribe picks the *longest* silence interval within the window
+  whose start position is at least `--stream-chunk-min` into the chunk,
+  re-cutting there for a more natural word boundary. Falls back to a
+  brute force-cut if no qualifying silence is found.
+- **Max** — disables silence-based cuts entirely; only the force-cut at
+  `--stream-chunk-max` fires. Useful when you want uniform chunk sizes
+  regardless of speech patterns. (Only selectable from the tray picker.)
+Stream mode is off by default — the default `Clip` mode transcribes the
+whole recording at end (`--clip`). The tray menu surfaces the same
+toggle as the top-level **Mode: Stream / Clip** item. Native
+streamers (vosk, `gpt-realtime-whisper`) are always streaming and the
+menu shows **Mode: Stream (native)** for them.
 ### Cross-chunk prompt context
-In pseudo-streaming mode scribe automatically augments each chunk's
-prompt with the trailing ~200 characters of the *previous* chunk's
-transcription. This rolling tail is concatenated onto whatever static
-`--prompt` / `--words` you configured and reaches the backend through
-the same channel as the static prompt (the vocabulary biasing table
-above). The motivation is cross-chunk continuity:
+In Stream mode (pseudo-streaming) scribe automatically augments
+each chunk's prompt with the trailing ~200 characters of the
+*previous* chunk's transcription. This rolling tail is concatenated
+onto whatever static `--prompt` / `--words` you configured and
+reaches the backend through the same channel as the static prompt
+(the vocabulary biasing table above). The motivation is cross-chunk
+continuity:
 - **Capitalization drift** — without context, a chunk that starts
   right after a period might come back lowercased.
@@ -225,14 +311,13 @@ Whisper's prompt window is capped at ~224 tokens; 200 chars of French
 sits well under that and leaves room for your static prompt + words
 list.
-The rolling tail is **dropped** whenever the pause that triggered the
-chunk cut exceeded 1.5 seconds — a long pause is treated as a new
-sentence/idea boundary, where carrying a possibly-bad prior chunk
-forward biases the next one more than it helps. This mirrors
-`whisper.cpp`'s `--keep-context off` default: prior-text conditioning
-can self-reinforce errors (hallucinations, decoder repetition loops)
-more readily than it provides useful continuity, so we cap it at
-natural sentence boundaries.
+The rolling tail is **dropped** when the silence between two
+utterances exceeds `--stream-context-reset-silence` ×
+`--stream-chunk-silence-break` (default 3 × 0.6 s = 1.8 s) — a long
+pause is treated as a new sentence/idea boundary, where carrying a
+possibly-bad prior chunk forward biases the next one more than it
+helps. Use `--stream-context-reset-silence inf` to keep context across
+arbitrarily long pauses.
 Short pauses (mid-sentence punctuation) keep the context; the cut at
 the start of every new recording also clears it.

scribe-cli 0.17.1__tar.gz → 1.0.0__tar.gz

scribe-cli 0.17.1tar.gz → 1.0.0tar.gz