PyPI - scribe-cli - Versions diffs - 0.18.0__tar.gz → 1.0.0__tar.gz - Mend

scribe-cli 0.18.0tar.gz → 1.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (68) hide show

{scribe_cli-0.18.0 → scribe_cli-1.0.0}/.gitignore RENAMED Viewed

@@ -7,3 +7,4 @@ scribe/_version.py
 # Autonomous roadmap workflows (local coordination artifacts; never committed)
 workflows/
+.worktrees/

{scribe_cli-0.18.0 → scribe_cli-1.0.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: scribe-cli
-Version: 0.18.0
+Version: 1.0.0
 Summary: Speech-to-text CLI and system-tray app for dictating into any focused window. Local (vosk, faster-whisper) or cloud (groq, openai) backends, batch or streaming.
 Author-email: Mahé Perrette <mahe.perrette@gmail.com>
 License: MIT License
@@ -33,13 +33,34 @@ License: MIT License
         licenses of all dependencies before using or distributing this software to
         ensure compliance with their respective terms.
 Project-URL: Homepage, https://github.com/perrette/scribe
-Keywords: speech-to-text,speech recognition,transcription,dictation,voice-typing,voice-to-text,realtime,streaming,language,AI,local,API,cli,tray,vosk,whisper,openai,groq,gpt-4o,linux,wayland,keyboard,clipboard
+Project-URL: Source, https://github.com/perrette/scribe
+Project-URL: Issues, https://github.com/perrette/scribe/issues
+Project-URL: Changelog, https://github.com/perrette/scribe/releases
+Project-URL: Funding, https://github.com/sponsors/perrette
+Keywords: speech-to-text,stt,transcription,dictation,voice-typing,voice-recognition,multilingual,realtime,streaming,cli,tray,vosk,whisper,faster-whisper,openai,groq,gpt-4o,linux,wayland,keyboard,clipboard,microphone,audio
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: End Users/Desktop
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: 3.13
 Classifier: Operating System :: OS Independent
+Classifier: Environment :: Console
+Classifier: Environment :: X11 Applications
+Classifier: Environment :: MacOS X
+Classifier: Environment :: Win32 (MS Windows)
+Classifier: Natural Language :: English
+Classifier: Natural Language :: French
+Classifier: Natural Language :: German
+Classifier: Natural Language :: Italian
+Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Office/Business
+Classifier: Topic :: Text Processing :: Linguistic
+Classifier: Topic :: Utilities
 Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 License-File: LICENSE
@@ -92,11 +113,13 @@ cloud-based APIs, batch and streaming workflows.
 ## What it does
-- Records from your mic and transcribes via one of four backends —
-  **Vosk** (local, streaming), **Whisper** (local, batch), **OpenAI**
-  (cloud, batch *or* streaming), **Groq** (cloud, batch).
-- Delivers the transcript three ways: paste into the focused window
-  (default), copy to clipboard, or print to the terminal.
+- Records from your mic and transcribes via one of five backends —
+  **Vosk** (local, streaming), **Whisper** (local, batch),
+  **Whisper FUTO** (local, batch — ACFT-tuned for short dictations),
+  **OpenAI** (cloud, batch *or* streaming), **Groq** (cloud, batch).
+- Delivers the transcript four ways: paste into the focused window
+  (default), copy to clipboard, print to the terminal, or write to
+  a file.
 - Runs as a **system tray icon** with a single Record button, or as an
   interactive **terminal TUI** — same menu in both.
 - Hooks into your DE's keyboard shortcuts via `SIGUSR1` (toggle
@@ -126,8 +149,8 @@ scribe
 This launches the system tray icon. Press Record, speak, press Stop —
 the transcription lands in the focused window. Scribe picks the first
 backend whose key / dependency is present, in order **`groq` →
-`openai` → `whisper` → `vosk`**, so with `GROQ_API_KEY` set the
-command above is equivalent to:
+`openai` → `whisper-futo` → `whisper` → `vosk`**, so with `GROQ_API_KEY`
+set the command above is equivalent to:
 ```bash
 scribe --backend groq --model whisper-large-v3-turbo
@@ -142,15 +165,17 @@ scribe --backend openai --model gpt-4o-mini-transcribe # OpenAI sweet spot
 scribe --backend openai --model gpt-realtime-whisper   # OpenAI streaming
 scribe --backend whisper --model small                 # local, no API key
 scribe --frontend terminal                             # interactive TUI menu
-scribe --frontend terminal --no-interactive            # record immediately, no menu
+scribe --record                                        # start recording immediately on launch (works in tray or terminal)
+scribe --record --frontend terminal --mode file        # one-shot batched dictation → file
+scribe --record --frontend terminal --mode file --stream  # streamed: chunks appended live as you speak
 scribe --mode clipboard                                # copy to clipboard, no keystroke
 scribe --mode terminal                                 # only print to stdout
-scribe -o transcript.txt                               # also append to a file
+scribe --mode file -o transcript.txt                   # append to a file (no keystroke / clipboard)
 ```
 With `--no-interactive` (terminal frontend only), scribe skips the
 interactive menu and starts recording right away — handy for scripted,
-one-shot transcriptions. `--no-prompt` is kept as a deprecated alias.
+one-shot transcriptions.
 Bias the recogniser toward names, jargon, or a domain glossary with
 `--prompt "free text hint"` and `--words word1 word2 ...` (each also
@@ -161,12 +186,13 @@ for what each backend does with them.
 ## Backends at a glance
-| Backend         | `--backend` | Default model              | Streaming model(s)        | Requires                            |
-|-----------------|-------------|----------------------------|---------------------------|-------------------------------------|
-| Groq (cloud)    | `groq`      | `whisper-large-v3-turbo`   | —                         | `GROQ_API_KEY`                      |
-| OpenAI (cloud)  | `openai`    | `gpt-4o-mini-transcribe`   | `gpt-realtime-whisper`    | `OPENAI_API_KEY`                    |
-| Whisper (local) | `whisper`   | `small`                    | —                         | `pip install scribe-cli[whisper]`   |
-| Vosk (local)    | `vosk`      | language-dependent         | all Vosk models           | `pip install scribe-cli[vosk]`      |
+| Backend              | `--backend`     | Default model              | Streaming model(s)        | Requires                               |
+|----------------------|-----------------|----------------------------|---------------------------|----------------------------------------|
+| Groq (cloud)         | `groq`          | `whisper-large-v3-turbo`   | —                         | `GROQ_API_KEY`                         |
+| OpenAI (cloud)       | `openai`        | `gpt-4o-mini-transcribe`   | `gpt-realtime-whisper`    | `OPENAI_API_KEY`                       |
+| Whisper FUTO (local) | `whisper-futo`  | `small`                    | —                         | `pip install scribe-cli[whisper-futo]` |
+| Whisper (local)      | `whisper`       | `small`                    | —                         | `pip install scribe-cli[whisper]`      |
+| Vosk (local)         | `vosk`          | language-dependent         | all Vosk models           | `pip install scribe-cli[vosk]`         |
 Whether a transcription appears live as you speak or all at once when
 you stop depends on the **model** picked — see
@@ -175,8 +201,11 @@ you stop depends on the **model** picked — see
 ### Getting an API key
-Groq is a good cloud backend to start with — very fast, quite accurate, and the
-**free tier** is generous enough for everyday dictation. Sign up at
+Groq is the **recommended cloud backend by default** — extremely fast
+(by a wide margin compared to other cloud STT options, especially in
+**Stream** mode where the per-chunk roundtrip latency dominates the
+perceived speed), quite accurate, and the **free tier** is generous
+enough for everyday dictation. Sign up at
 [console.groq.com](https://console.groq.com/), create an API key
 under **Settings → API Keys**, and export it as `GROQ_API_KEY`.
@@ -189,7 +218,7 @@ I personally use [OpenAI](https://openai.com/api/) with `gpt-4o-mini-transcribe`
   extras, Ubuntu / GNOME tray libs.
 - [Backends in detail](docs/backends.md) — model lists, when to pick
   which, the realtime model.
-- [Keyboard modes & typer backends](docs/keyboard.md) — keystroke vs
+- [Output modes & typer backends](docs/output.md) — keystroke vs
   clipboard, Wayland / `eitype`, `--type-direct`.
 - [System tray & global hotkeys](docs/tray.md) — menu tree, icon
   states, `SIGUSR1`/`SIGUSR2`.
@@ -198,10 +227,17 @@ I personally use [OpenAI](https://openai.com/api/) with `gpt-4o-mini-transcribe`
 - [Fine tuning & CLI reference](docs/cli.md) — every `scribe --help`
   flag with examples.
+## Related projects
+- **[bard](https://github.com/perrette/bard)** — TTS sibling of scribe,
+  same tray/CLI architecture in reverse: highlight text, hear it
+  spoken. Shares the [`desktop-ai-core`](https://github.com/perrette/desktop-ai-core)
+  backbone (frontends, providers, dialog helpers).
 ## Compatibility
 Initially developed for Python 3 on Ubuntu 24.04 (GNOME + Wayland);
 works on macOS and Windows too. Wayland keystroke injection is
-convoluted but [solved](docs/keyboard.md). For dependencies of
+convoluted but [solved](docs/output.md). For dependencies of
 individual subsystems, check `pynput` (keyboard) and `pystray` (tray
 icon).

{scribe_cli-0.18.0 → scribe_cli-1.0.0}/README.md RENAMED Viewed

@@ -9,11 +9,13 @@ cloud-based APIs, batch and streaming workflows.
 ## What it does
-- Records from your mic and transcribes via one of four backends —
-  **Vosk** (local, streaming), **Whisper** (local, batch), **OpenAI**
-  (cloud, batch *or* streaming), **Groq** (cloud, batch).
-- Delivers the transcript three ways: paste into the focused window
-  (default), copy to clipboard, or print to the terminal.
+- Records from your mic and transcribes via one of five backends —
+  **Vosk** (local, streaming), **Whisper** (local, batch),
+  **Whisper FUTO** (local, batch — ACFT-tuned for short dictations),
+  **OpenAI** (cloud, batch *or* streaming), **Groq** (cloud, batch).
+- Delivers the transcript four ways: paste into the focused window
+  (default), copy to clipboard, print to the terminal, or write to
+  a file.
 - Runs as a **system tray icon** with a single Record button, or as an
   interactive **terminal TUI** — same menu in both.
 - Hooks into your DE's keyboard shortcuts via `SIGUSR1` (toggle
@@ -43,8 +45,8 @@ scribe
 This launches the system tray icon. Press Record, speak, press Stop —
 the transcription lands in the focused window. Scribe picks the first
 backend whose key / dependency is present, in order **`groq` →
-`openai` → `whisper` → `vosk`**, so with `GROQ_API_KEY` set the
-command above is equivalent to:
+`openai` → `whisper-futo` → `whisper` → `vosk`**, so with `GROQ_API_KEY`
+set the command above is equivalent to:
 ```bash
 scribe --backend groq --model whisper-large-v3-turbo
@@ -59,15 +61,17 @@ scribe --backend openai --model gpt-4o-mini-transcribe # OpenAI sweet spot
 scribe --backend openai --model gpt-realtime-whisper   # OpenAI streaming
 scribe --backend whisper --model small                 # local, no API key
 scribe --frontend terminal                             # interactive TUI menu
-scribe --frontend terminal --no-interactive            # record immediately, no menu
+scribe --record                                        # start recording immediately on launch (works in tray or terminal)
+scribe --record --frontend terminal --mode file        # one-shot batched dictation → file
+scribe --record --frontend terminal --mode file --stream  # streamed: chunks appended live as you speak
 scribe --mode clipboard                                # copy to clipboard, no keystroke
 scribe --mode terminal                                 # only print to stdout
-scribe -o transcript.txt                               # also append to a file
+scribe --mode file -o transcript.txt                   # append to a file (no keystroke / clipboard)
 ```
 With `--no-interactive` (terminal frontend only), scribe skips the
 interactive menu and starts recording right away — handy for scripted,
-one-shot transcriptions. `--no-prompt` is kept as a deprecated alias.
+one-shot transcriptions.
 Bias the recogniser toward names, jargon, or a domain glossary with
 `--prompt "free text hint"` and `--words word1 word2 ...` (each also
@@ -78,12 +82,13 @@ for what each backend does with them.
 ## Backends at a glance
-| Backend         | `--backend` | Default model              | Streaming model(s)        | Requires                            |
-|-----------------|-------------|----------------------------|---------------------------|-------------------------------------|
-| Groq (cloud)    | `groq`      | `whisper-large-v3-turbo`   | —                         | `GROQ_API_KEY`                      |
-| OpenAI (cloud)  | `openai`    | `gpt-4o-mini-transcribe`   | `gpt-realtime-whisper`    | `OPENAI_API_KEY`                    |
-| Whisper (local) | `whisper`   | `small`                    | —                         | `pip install scribe-cli[whisper]`   |
-| Vosk (local)    | `vosk`      | language-dependent         | all Vosk models           | `pip install scribe-cli[vosk]`      |
+| Backend              | `--backend`     | Default model              | Streaming model(s)        | Requires                               |
+|----------------------|-----------------|----------------------------|---------------------------|----------------------------------------|
+| Groq (cloud)         | `groq`          | `whisper-large-v3-turbo`   | —                         | `GROQ_API_KEY`                         |
+| OpenAI (cloud)       | `openai`        | `gpt-4o-mini-transcribe`   | `gpt-realtime-whisper`    | `OPENAI_API_KEY`                       |
+| Whisper FUTO (local) | `whisper-futo`  | `small`                    | —                         | `pip install scribe-cli[whisper-futo]` |
+| Whisper (local)      | `whisper`       | `small`                    | —                         | `pip install scribe-cli[whisper]`      |
+| Vosk (local)         | `vosk`          | language-dependent         | all Vosk models           | `pip install scribe-cli[vosk]`         |
 Whether a transcription appears live as you speak or all at once when
 you stop depends on the **model** picked — see
@@ -92,8 +97,11 @@ you stop depends on the **model** picked — see
 ### Getting an API key
-Groq is a good cloud backend to start with — very fast, quite accurate, and the
-**free tier** is generous enough for everyday dictation. Sign up at
+Groq is the **recommended cloud backend by default** — extremely fast
+(by a wide margin compared to other cloud STT options, especially in
+**Stream** mode where the per-chunk roundtrip latency dominates the
+perceived speed), quite accurate, and the **free tier** is generous
+enough for everyday dictation. Sign up at
 [console.groq.com](https://console.groq.com/), create an API key
 under **Settings → API Keys**, and export it as `GROQ_API_KEY`.
@@ -106,7 +114,7 @@ I personally use [OpenAI](https://openai.com/api/) with `gpt-4o-mini-transcribe`
   extras, Ubuntu / GNOME tray libs.
 - [Backends in detail](docs/backends.md) — model lists, when to pick
   which, the realtime model.
-- [Keyboard modes & typer backends](docs/keyboard.md) — keystroke vs
+- [Output modes & typer backends](docs/output.md) — keystroke vs
   clipboard, Wayland / `eitype`, `--type-direct`.
 - [System tray & global hotkeys](docs/tray.md) — menu tree, icon
   states, `SIGUSR1`/`SIGUSR2`.
@@ -115,10 +123,17 @@ I personally use [OpenAI](https://openai.com/api/) with `gpt-4o-mini-transcribe`
 - [Fine tuning & CLI reference](docs/cli.md) — every `scribe --help`
   flag with examples.
+## Related projects
+- **[bard](https://github.com/perrette/bard)** — TTS sibling of scribe,
+  same tray/CLI architecture in reverse: highlight text, hear it
+  spoken. Shares the [`desktop-ai-core`](https://github.com/perrette/desktop-ai-core)
+  backbone (frontends, providers, dialog helpers).
 ## Compatibility
 Initially developed for Python 3 on Ubuntu 24.04 (GNOME + Wayland);
 works on macOS and Windows too. Wayland keystroke injection is
-convoluted but [solved](docs/keyboard.md). For dependencies of
+convoluted but [solved](docs/output.md). For dependencies of
 individual subsystems, check `pynput` (keyboard) and `pystray` (tray
 icon).

scribe_cli-1.0.0/docs/app-tray-menu.png ADDED Viewed

Binary file

{scribe_cli-0.18.0 → scribe_cli-1.0.0}/docs/backends.md RENAMED Viewed

@@ -70,7 +70,7 @@ Vosk transcribes in real time and is very good at one language at a
 time, but tends to make more mistakes than Whisper and does not produce
 punctuation. It becomes really useful in longer, interactive sessions
 where the live "appears as you speak" UX matters — see
-[keyboard.md](keyboard.md) for how the keystroke mode interacts with
+[output.md](output.md) for how the keystroke mode interacts with
 streaming models.
 There are many [Vosk models](https://alphacephei.com/vosk/models)
@@ -117,12 +117,15 @@ for the full picture.
 ## `groq` (Groq cloud)
 Talks to Groq's OpenAI-compatible API and defaults to
-`whisper-large-v3-turbo`. Typically the fastest cloud option for
-full-utterance transcription:
+`whisper-large-v3-turbo`. **Extremely fast** thanks to Groq's
+inference hardware — the recommended cloud backend by default, and
+the natural pick for `--stream` mode where per-chunk roundtrip
+latency dominates perceived speed:
 ```bash
 export GROQ_API_KEY=YOURAPIKEY
-scribe --backend groq
+scribe --backend groq          # Clip mode (default)
+scribe --backend groq --stream # live transcription, per-chunk
 ```
 The `groq` backend reuses the `openai` Python client under the hood, so
@@ -146,14 +149,14 @@ style, domain, or word list. The concept is generic across the
 whisper-family backends but each backend exposes it slightly
 differently:
-| Backend                              | `--prompt`                    | `--words`                                              |
-|--------------------------------------|-------------------------------|--------------------------------------------------------|
-| `whisper` (faster-whisper, local)    | passed as `initial_prompt=`   | passed as `hotwords=` — a **dedicated biasing channel** separate from the prompt |
-| `whisper-futo` (pywhispercpp, local) | passed as `initial_prompt=`   | joined onto the prompt string (no separate hotwords channel here) |
-| `openai` batch (`gpt-4o*-transcribe`) | passed as `prompt=`           | joined onto the prompt string                          |
-| `groq` (`whisper-large-v3-turbo`)     | passed as `prompt=`           | joined onto the prompt string                          |
-| `openai` realtime (`gpt-realtime-whisper`) | *silently ignored* — the model rejects the prompt parameter server-side (HTTP 400 *"The 'prompt' parameter is not supported for this model."*). The kwarg stays accepted for plumbing compatibility but never reaches the API. | same — joined into the (ignored) prompt |
-| `vosk`                               | *ignored* (no soft prompt)    | *ignored* (Vosk only supports a hard `grammar` allowlist; not yet exposed) |
+| Backend                              | `--prompt`                    | `--words`                                              | `--language`                                           |
+|--------------------------------------|-------------------------------|--------------------------------------------------------|---------------------------------------------------------|
+| `whisper` (faster-whisper, local)    | passed as `initial_prompt=`   | passed as `hotwords=` — a **dedicated biasing channel** separate from the prompt | passed as `language=` (ISO 639-1); `-l en` also auto-substitutes `small.en` etc. |
+| `whisper-futo` (pywhispercpp, local) | passed as `initial_prompt=`   | joined onto the prompt string (no separate hotwords channel here) | passed as `language=` (ISO 639-1); `-l en` auto-substitutes `small.en` etc. |
+| `openai` batch (`gpt-4o*-transcribe`) | passed as `prompt=`           | joined onto the prompt string                          | passed as `language=` hint (ISO 639-1)                  |
+| `groq` (`whisper-large-v3-turbo`)     | passed as `prompt=`           | joined onto the prompt string                          | passed as `language=` hint (ISO 639-1)                  |
+| `openai` realtime (`gpt-realtime-whisper`) | *silently ignored* — the model rejects the prompt parameter server-side (HTTP 400 *"The 'prompt' parameter is not supported for this model."*). The kwarg stays accepted for plumbing compatibility but never reaches the API. | same — joined into the (ignored) prompt | passed as `language=` (ISO 639-1) |
+| `vosk`                               | *ignored* (no soft prompt)    | *ignored* (Vosk only supports a hard `grammar` allowlist; not yet exposed) | picks a per-language model from `scribe/models.toml`; no runtime parameter |
 The whisper-family APIs cap the prompt around ~224 tokens; longer
 hints are silently truncated. Faster-whisper's `hotwords` channel is
@@ -184,34 +187,117 @@ invocation, pass an explicit empty value: `--prompt ""` (or
 arguments (or `--words-file ""`) suppresses the words default. Each
 side is independent.
-## Pseudo-streaming (experimental)
-`--pseudo-streaming` makes a batch backend behave streaming-like by
-cutting the running buffer into chunks driven by silence:
+## Language
+`-l / --language LANG` tells the backend which language to expect.
+What that means in practice varies by backend (see the per-backend
+column in the table above):
+- **Whisper-family** (`whisper`, `whisper-futo`, `openai` batch +
+  realtime, `groq`) — the language is passed to the model as a hard
+  lock: the decoder generates that language regardless of what it
+  hears acoustically. Accepts any [ISO 639-1 short code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
+  Whisper recognises (~99 languages). When unset, Whisper auto-detects
+  per chunk.
+- **English-only model variants** — for `whisper` and `whisper-futo`,
+  `-l en` *also* auto-substitutes the English-only model when one
+  exists (`small` → `small.en`, etc.). These variants trade
+  multilingual coverage for English accuracy.
+- **Vosk** — language isn't a runtime parameter; vosk ships a
+  separate model per language. `-l fr` looks up the vosk model
+  pre-mapped to French in [`scribe/models.toml`](../scribe/models.toml)
+  and instantiates that one. Vosk has no auto-detect path, so the
+  Language menu's `Auto` entry on vosk falls back to a sensible
+  default — the tray shows `Auto (🇬🇧 en)` to make this explicit
+  without mutating the stored `language=None`.
+The tray's **Language** submenu exposes the four curated languages
+(`en` / `fr` / `de` / `it`) with origin-country flag prefixes
+(🇬🇧 / 🇫🇷 / 🇩🇪 / 🇮🇹). The CLI accepts these plus any other ISO 639-1
+code the active backend recognises.
+## Stream mode (works with any backend)
+`--stream` (or **Mode: Stream** in the tray) emits transcribed text
+**live as you speak**, regardless of which backend you picked. This
+is the headline v1.0.0 improvement: scribe abstracts over the two
+different mechanisms that backends use to deliver live output, so
+`--stream` works uniformly across every supported backend.
+- **Native streaming backends** (Vosk, `gpt-realtime-whisper`) push
+  partial results from the server as audio is received — scribe just
+  forwards them to the chosen output (focused window / clipboard /
+  terminal / file). These backends are *always* in Stream mode; the
+  Mode toggle reads "Mode: Stream (native)" for them and is read-only.
+- **Batch backends** (Whisper local, Whisper FUTO, OpenAI
+  `gpt-4o-*-transcribe`, Groq `whisper-large-v3-turbo`) don't accept
+  partial audio. scribe instead cuts the recording buffer on
+  detected silence and issues a separate transcription request for
+  each chunk — internally called *pseudo-streaming*. The user sees
+  the same live experience.
 ```bash
-scribe --pseudo-streaming --streaming-window 5
+scribe --stream                       # any backend, live transcription
+scribe --stream --backend groq        # Groq + Stream is the sweet spot
+scribe --stream --backend whisper     # local, live, no API key
 ```
-After `--streaming-window` seconds of buffered audio, scribe cuts at
-the first silence of at least `--silence-duration` and transcribes the
-chunk; if no silence arrives by `2 × --streaming-window`, it
-force-cuts. The session continues until you stop it. Default `5` s
-trades a little Whisper context for snappier "text appears as you
-speak" UX; raise it (10–30 s) if accuracy on long sentences matters
-more than latency.
-This is experimental and off by default. The tray menu surfaces the
-same toggle under Options ▶ Advanced ▶ Pseudo-streaming.
+### How pseudo-streaming carves up a recording
+Once the buffer has grown to at least `--stream-chunk-min` (default
+1.5 s), silence of at least `--stream-chunk-silence-break` (default
+0.6 s) triggers a chunk cut. A force-cut fires at `--stream-chunk-max`
+(default 10 s) regardless of silence, to cap latency. The session
+continues until you stop it manually.
+### Does pseudo-streaming change the API cost?
+For cloud backends, going from one big transcription to N chunked
+requests **does not normally change the bill**:
+- **Groq** (`whisper-large-v3-turbo`) is billed per second of audio.
+  Total audio is unchanged → same cost.
+- **OpenAI `whisper-1`** (legacy) is billed per minute of audio. Same
+  logic, same cost.
+- **OpenAI `gpt-4o-transcribe` / `gpt-4o-mini-transcribe`** are token-
+  billed (audio-in + text-out + prompt-in). Audio and output stay
+  identical; the only delta is the rolling cross-chunk *prompt*
+  context (~200 chars ≈ 50–60 tokens per chunk after the first).
+  At gpt-4o-mini-transcribe input rates this is negligible — well
+  under a cent per long session.
+That said, your real cost depends on your usage and your account's
+pricing tier — **verify on your provider's billing dashboard** if
+cost is a hard constraint.
+Two special values for `--stream-chunk-silence-break` (set via the
+tray's **Silence break** picker or `--stream-chunk-silence-break 0`
+at the CLI):
+- **Auto** (`0`) — disables the fixed-threshold trigger. At force-cut
+  time scribe picks the *longest* silence interval within the window
+  whose start position is at least `--stream-chunk-min` into the chunk,
+  re-cutting there for a more natural word boundary. Falls back to a
+  brute force-cut if no qualifying silence is found.
+- **Max** — disables silence-based cuts entirely; only the force-cut at
+  `--stream-chunk-max` fires. Useful when you want uniform chunk sizes
+  regardless of speech patterns. (Only selectable from the tray picker.)
+Stream mode is off by default — the default `Clip` mode transcribes the
+whole recording at end (`--clip`). The tray menu surfaces the same
+toggle as the top-level **Mode: Stream / Clip** item. Native
+streamers (vosk, `gpt-realtime-whisper`) are always streaming and the
+menu shows **Mode: Stream (native)** for them.
 ### Cross-chunk prompt context
-In pseudo-streaming mode scribe automatically augments each chunk's
-prompt with the trailing ~200 characters of the *previous* chunk's
-transcription. This rolling tail is concatenated onto whatever static
-`--prompt` / `--words` you configured and reaches the backend through
-the same channel as the static prompt (the vocabulary biasing table
-above). The motivation is cross-chunk continuity:
+In Stream mode (pseudo-streaming) scribe automatically augments
+each chunk's prompt with the trailing ~200 characters of the
+*previous* chunk's transcription. This rolling tail is concatenated
+onto whatever static `--prompt` / `--words` you configured and
+reaches the backend through the same channel as the static prompt
+(the vocabulary biasing table above). The motivation is cross-chunk
+continuity:
 - **Capitalization drift** — without context, a chunk that starts
   right after a period might come back lowercased.
@@ -226,13 +312,12 @@ sits well under that and leaves room for your static prompt + words
 list.
 The rolling tail is **dropped** when the silence between two
-utterances exceeds 1.5 seconds — a long pause is treated as a new
-sentence/idea boundary, where carrying a possibly-bad prior chunk
-forward biases the next one more than it helps. This mirrors
-`whisper.cpp`'s `--keep-context off` default: prior-text conditioning
-can self-reinforce errors (hallucinations, decoder repetition loops)
-more readily than it provides useful continuity, so we cap it at
-natural sentence boundaries.
+utterances exceeds `--stream-context-reset-silence` ×
+`--stream-chunk-silence-break` (default 3 × 0.6 s = 1.8 s) — a long
+pause is treated as a new sentence/idea boundary, where carrying a
+possibly-bad prior chunk forward biases the next one more than it
+helps. Use `--stream-context-reset-silence inf` to keep context across
+arbitrarily long pauses.
 Short pauses (mid-sentence punctuation) keep the context; the cut at
 the start of every new recording also clears it.

{scribe_cli-0.18.0 → scribe_cli-1.0.0}/docs/cli.md RENAMED Viewed

@@ -13,10 +13,11 @@ The flags are grouped to mirror the source-of-truth in
 | Flag                            | Purpose                                                                 |
 |---------------------------------|-------------------------------------------------------------------------|
-| `--backend {vosk,whisper,openai,groq}` | Speech-recognition backend (prompted if omitted).                |
+| `--backend {vosk,whisper,whisper-futo,openai,groq}` | Speech-recognition backend (prompted if omitted).            |
 | `--model NAME`                  | Model name for the chosen backend. Auto-routes to the right backend for known model names (e.g. `--model gpt-realtime-whisper` selects `openai`). |
-| `-l, --language LANG`           | Language alias selecting a preset Vosk model (`en`/`fr`/`de`/`it`), or `en` for English-only Whisper models. |
+| `-l, --language LANG`           | Language alias selecting a preset Vosk model (`en`/`fr`/`de`/`it`), or `en` for English-only Whisper / Whisper-FUTO models. |
 | `--download-folder-whisper DIR` | Folder to store Whisper models.                                         |
+| `--download-folder-whisper-futo DIR` | Folder to store Whisper-FUTO ACFT ggml models (default: `$XDG_CACHE_HOME/whisper-futo`). |
 | `--download-folder-vosk DIR`    | Folder to store Vosk models.                                            |
 ## Prompting & vocabulary biasing
@@ -55,22 +56,23 @@ flag suppresses only its own side (giving `--prompt ""` still loads
 | Flag                  | Purpose                                                  |
 |-----------------------|----------------------------------------------------------|
 | `--input-device N`    | Microphone device index (see `python -m sounddevice`).   |
+| `--dry-run`           | Short-circuit the STT request boundary in every backend: model load is skipped and the network/SDK call returns a canned `[dry-run transcript]`. Used by the backend × mode smoke-test matrix; handy for plumbing without network access. |
 ## Output
 | Flag                        | Purpose                                                                                     |
 |-----------------------------|---------------------------------------------------------------------------------------------|
-| `-m, --mode {keystroke,clipboard,terminal}` | Where transcribed text goes (default `keystroke`). See [keyboard.md](keyboard.md). |
+| `-m, --mode {keystroke,clipboard,terminal,file}` | Where transcribed text goes (default `keystroke`). `file` routes the transcript exclusively to `--output-file` and suppresses keyboard/clipboard output. See [output.md](output.md). |
 | `--typer {auto,eitype,pynput,wtype,ydotool}` | Keystroke-injection backend (default `auto`).                                |
 | `--type-direct`             | In keystroke mode, type the transcription as keystrokes instead of synthesising Ctrl+V.     |
-| `-o, --output-file FILE`    | Also append the transcription to this file.                                                 |
+| `-o, --output-file FILE`    | Path the transcription is appended to when `--mode file`. Defaults to `<user-desktop>/scribe-notes.txt` (the platform Desktop folder — `~/Desktop` on Linux/macOS, `%USERPROFILE%\Desktop` on Windows; falls back to home dir if Desktop is absent). Ignored when `--mode` is anything other than `file` (the four output modes are mutually exclusive). |
 ## Silence detection
-| Flag                       | Default | Purpose                                                                |
-|----------------------------|---------|------------------------------------------------------------------------|
-| `--duration SECS`          | `120`   | Max recording duration in seconds.                                     |
-| `--silence-duration SECS`  | `0.6`   | How long silence must persist before triggering a backend's silence behavior (realtime auto-commit, pseudo-streaming cut). |
+> **Deprecated aliases** (still accepted, hidden from `--help`):
+> `--duration N` maps to `--clip-timeout N`; `--silence-duration N`
+> sets both `--stream-chunk-silence-break` and `--realtime-commit-silence`
+> to `N`. Existing scripts using these flags continue to work.
 ## Voice activity detection
@@ -94,22 +96,53 @@ mode's knobs are ignored.
 ## Realtime (`gpt-realtime-whisper`)
-| Flag                                              | Default  | Purpose                                                                                                                                                                                  |
-|---------------------------------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `--realtime-delay {minimal,low,medium,high,xhigh}` | `medium` | Trade off latency vs accuracy on `gpt-realtime-whisper`. Lower = faster partials but more paste churn in the focused window.                                                             |
-| `--realtime-gate` / `--no-realtime-gate`          | on       | Drop silent frames (per the active `--vad-mode`) before sending them over the WebSocket so silent audio isn't billed as input tokens. After `--silence-duration` of silence, also commit mid-session so trailing words flush live. |
+| Flag                                              | Default      | Purpose                                                                                                                                                                                  |
+|---------------------------------------------------|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `--realtime-delay {minimal,low,medium,high,xhigh}` | `medium`    | Trade off latency vs accuracy on `gpt-realtime-whisper`. Lower = faster partials but more paste churn in the focused window.                                                             |
+| `--realtime-gate` / `--no-realtime-gate`          | on           | Drop silent frames (per the active `--vad-mode`) before sending them over the WebSocket so silent audio isn't billed as input tokens. |
+| `--realtime-commit-silence SECS`                  | `0.6`        | Seconds of silence before a mid-session commit flushes trailing words to the server (default `0.6`). Set to `0` to rely solely on the server's turn detection. |
+The tray's **Stream (advanced) › Stream** picker unifies `--realtime-gate`
+and `--realtime-commit-silence` into a single choice: **Live** (gate
+off, commit disabled — server turn detection only) or **Offline after
+Xs** (gate on, commit after X seconds of silence). At the CLI level the
+two flags remain independent. The auto-stop is documented under
+**Listening mode → `--stream-timeout`** below (covers both native
+streamers and pseudo-streaming on batch backends).
 Streaming models (Vosk, `gpt-realtime-whisper`) ignore the batch
 silence-chunking knobs; they have their own end-of-utterance signal.
+## Listening mode
+| Flag                              | Default | Purpose                                                                                   |
+|-----------------------------------|---------|-------------------------------------------------------------------------------------------|
+| `--stream`                        | —       | Force a batch backend (whisper, whisper-futo, openai non-realtime, groq) into pseudo-streaming — live chunks driven by `--stream-chunk-max` and `--stream-chunk-silence-break`. Same as the tray's **Mode: Stream**. |
+| `--clip`                          | default | Transcribe the whole recording at end. Same as the tray's **Mode: Clip**.                 |
+| `--stream-chunk-max SECS`         | `10`    | Maximum chunk duration in seconds. Force-cut fires at this threshold when no silence pause has been detected (default `10`). |
+| `--stream-chunk-min SECS`         | `1.5`   | Minimum chunk size before a silence-cut is allowed (default `1.5`). Prevents very short clips that cause Whisper hallucinations. |
+| `--stream-chunk-silence-break SECS` | `0.6` | Silence duration that triggers a chunk cut (default `0.6`). Special value `0` enables Auto mode (best-silence-in-window at force-cut time). |
+| `--stream-context-reset-silence X` | `3.0`  | Multiplier of `--stream-chunk-silence-break` above which the rolling cross-chunk prompt context is discarded (default `3.0`, i.e. 1.8 s at default silence-break). Use `inf` to never reset. |
+| `--clip-timeout SECS`             | `120`   | Auto-stop after this many seconds in Clip mode (default `120`). |
+| `--stream-timeout SECS`           | `None`  | Auto-stop after this many seconds in Stream mode (`None` = Always On, no auto-stop). Tray equivalent: **Stream timeout** in the Stream (advanced) submenu. |
+Native streamers (vosk, `gpt-realtime-whisper`) are always streaming
+and ignore `--clip`. `--realtime`, `--pseudo-streaming`,
+`--streaming-window`, and `--realtime-timeout` are kept as hidden
+back-compat aliases (`--streaming-window N` maps to
+`--stream-chunk-max 2N` to preserve the old effective force-cut
+threshold; `--realtime-timeout` maps to `--stream-timeout`).
 ## Frontend
 | Flag                        | Purpose                                                              |
 |-----------------------------|----------------------------------------------------------------------|
 | `--frontend {tray,terminal}` | UI to launch (default `tray`).                                       |
-| `--no-interactive`          | In terminal mode, skip the interactive menu and record immediately. (`--no-prompt` is kept as a deprecated alias.) |
+| `--no-interactive`          | In terminal mode, skip the interactive menu and record immediately. |
+| `--record`                  | Start recording immediately on launch, frontend-agnostic. In terminal it's a one-line shortcut for `--no-interactive`; in tray it auto-fires the Record action ~0.5 s after the icon comes up. Useful for hotkey bindings (`scribe --record` triggers a recording from anywhere) and batched / scripted invocations. |
 | `--vosk-models M [M ...]`   | Vosk models offered in the tray menu.                                |
 | `--whisper-models M [M ...]` | Whisper models offered in the tray menu.                             |
+| `--whisper-futo-models M [M ...]` | Whisper-FUTO ACFT models offered in the tray menu.              |
 ## Examples
@@ -134,13 +167,31 @@ environment) — you'll pay for silent audio while the session is open:
 scribe --model gpt-realtime-whisper --no-realtime-gate
 ```
-Run scribe headlessly into a file without touching the clipboard or
-focused window:
+**Batched / scripted use** — record one dictation headlessly, write
+it where you want, exit. No tray, no menu, no clipboard:
 ```bash
-scribe --frontend terminal --no-interactive --mode terminal -o session.txt
+# Append to a file (default <Desktop>/scribe-notes.txt — override with -o)
+scribe --record --frontend terminal --mode file
+# Same with a custom path
+scribe --record --frontend terminal --mode file -o /tmp/notes.txt
+# Pipe-friendly: transcript on stdout
+scribe --record --frontend terminal --mode terminal
+# Streamed: chunks appended live (as you speak) instead of all-at-once
+# at end-of-recording. Useful for long dictations and tail-following:
+#   tail -f /tmp/notes.txt
+scribe --record --frontend terminal --mode file --stream -o /tmp/notes.txt
 ```
+`--record` starts the recording immediately, `--frontend terminal`
+skips the tray icon, `--mode file` (or `terminal`) picks where the
+transcript lands, `--stream` (optional) emits chunks live instead of
+the default Clip-mode all-at-once. Combine with a hotkey or cron for
+one-shot capture.
 Bias the recogniser toward domain jargon (medical terms, proper names):
 ```bash

scribe-cli 0.18.0__tar.gz → 1.0.0__tar.gz

scribe-cli 0.18.0tar.gz → 1.0.0tar.gz