supervoxtral 0.1.4__tar.gz → 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (29) hide show
  1. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/.gitignore +2 -0
  2. supervoxtral-0.3.0/AGENTS.md +104 -0
  3. supervoxtral-0.3.0/CLAUDE.md +1 -0
  4. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/PKG-INFO +3 -6
  5. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/README.md +24 -7
  6. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/pyproject.toml +4 -7
  7. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/cli.py +9 -4
  8. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/core/audio.py +7 -1
  9. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/core/config.py +17 -1
  10. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/core/pipeline.py +45 -32
  11. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/core/prompt.py +37 -12
  12. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/core/storage.py +1 -1
  13. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/providers/base.py +29 -9
  14. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/providers/mistral.py +75 -68
  15. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/ui/qt_app.py +35 -4
  16. supervoxtral-0.3.0/uv.lock +973 -0
  17. supervoxtral-0.1.4/AGENTS.md +0 -88
  18. supervoxtral-0.1.4/notes.md +0 -8
  19. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/LICENSE +0 -0
  20. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/logs/.gitkeep +0 -0
  21. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/macos-shortcut.png +0 -0
  22. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/prompt/.gitkeep +0 -0
  23. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/recordings/.gitkeep +0 -0
  24. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/supervoxtral.gif +0 -0
  25. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/__init__.py +0 -0
  26. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/core/__init__.py +0 -0
  27. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/core/clipboard.py +0 -0
  28. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/svx/providers/__init__.py +0 -0
  29. {supervoxtral-0.1.4 → supervoxtral-0.3.0}/transcripts/.gitkeep +0 -0
@@ -98,3 +98,5 @@ Icon?
98
98
  *.swp
99
99
  *.swo
100
100
  .python-version
101
+ .aider*
102
+ notes.md
@@ -0,0 +1,104 @@
1
+ # SuperVoxtral — Agent Guide
2
+
3
+ ## Project overview
4
+ Python CLI/GUI for audio recording + transcription via APIs (Mistral Voxtral). MVP: manual stop, API-based, zero-footprint defaults (temp files, no persistent dirs unless overridden), results in `transcripts/` when persisted.
5
+
6
+ ### Core Design Principles
7
+
8
+ 1. **Centralized Pipeline**: All recording/transcription flows through `RecordingPipeline` (svx/core/pipeline.py) for consistency between CLI and GUI
9
+ 2. **Config-driven**: Structured `Config` dataclass (svx/core/config.py) loaded from user's config.toml; CLI args override specific values
10
+ 3. **Zero-footprint defaults**: Temp files auto-deleted unless `keep_*` flags or `--save-all` enabled; no project directories created by default
11
+ 4. **Provider abstraction**: `Provider` protocol (svx/providers/base.py) for pluggable transcription services
12
+
13
+ ### Module Structure
14
+
15
+ - **svx/cli.py**: Typer CLI entrypoint; orchestration only, delegates to Config and Pipeline
16
+ - **svx/core/**:
17
+ - `config.py`: Config dataclasses, TOML loading, prompt resolution (supports multiple prompts via [prompt.key] sections), logging setup
18
+ - `pipeline.py`: RecordingPipeline class - records, converts (ffmpeg), transcribes, saves conditionally, copies to clipboard
19
+ - `audio.py`: WAV recording (sounddevice), ffmpeg detection/conversion to MP3/Opus
20
+ - `prompt.py`: Multi-prompt resolution from config dict (key-based: "default", "test", etc.)
21
+ - `storage.py`: Save transcripts/JSON conditionally based on keep_transcript_files
22
+ - `clipboard.py`: Cross-platform clipboard copy
23
+ - **svx/providers/**:
24
+ - `base.py`: Provider protocol, TranscriptionResult TypedDict, ProviderError
25
+ - `mistral.py`: Mistral Voxtral implementation (dedicated transcription endpoint + text-based LLM chat)
26
+ - `openai.py`: OpenAI Whisper implementation
27
+ - `__init__.py`: Provider registry (get_provider)
28
+ - **svx/ui/**:
29
+ - `qt_app.py`: PySide6 GUI (RecorderWindow/Worker) using Pipeline; dynamic buttons per prompt key
30
+
31
+ ### Execution Flow
32
+
33
+ 1. **Entry**: CLI parses args (--prompt, --save-all, --gui, --transcribe)
34
+ 2. **Config Load**: Config.load() reads config.toml (supports [prompt.default], [prompt.other], etc.); `chat_model` for text LLM; API keys in [providers.mistral] or [providers.openai]
35
+ 3. **Context Bias**: Optional `context_bias` list in `[defaults]` (up to 100 items) — passed to Mistral's transcription endpoint to improve recognition of specific vocabulary (proper nouns, technical terms). Stored in `DefaultsConfig`, read by `MistralProvider.__init__`.
36
+ 4. **Prompt Resolution**:
37
+ - CLI: Uses "default" prompt key unless --prompt/--prompt-file overrides
38
+ - GUI: Dynamic buttons for each [prompt.key]; "Transcribe" button bypasses prompt
39
+ - Priority: CLI arg > config [prompt.key] > user prompt file > fallback
40
+ 5. **Pipeline Execution** (RecordingPipeline) — 2-step pipeline:
41
+ - record(): WAV recording via sounddevice, temp file if keep_audio_files=false
42
+ - process(): Optional ffmpeg conversion, then:
43
+ - Step 1 (Transcription): audio → text via provider.transcribe() (dedicated endpoint, always)
44
+ - Step 2 (Transformation): text + prompt → text via provider.chat() (text LLM, only when prompt provided)
45
+ - Uses `cfg.defaults.model` for transcription, `cfg.defaults.chat_model` for transformation
46
+ - Conditional save_transcript (+ raw transcript file when transformation applied), clipboard copy
47
+ - clean(): Temp file cleanup
48
+ 6. **Transcribe Mode** (CLI only):
49
+ - --transcribe flag: No prompt, step 1 only (dedicated transcription endpoint)
50
+ - GUI: --transcribe ignored (warning); use "Transcribe" button instead
51
+ 7. **Output**: CLI prints result; GUI emits via callback; temp files auto-deleted unless keep_* enabled
52
+
53
+ ## Build
54
+ ```bash
55
+ # Setup (creates .venv, editable install, lockfile-based)
56
+ uv sync --extra dev --extra gui
57
+ ```
58
+
59
+ ## Linting and Type Checking
60
+ ```bash
61
+ # Lint & format
62
+ uv run ruff check svx/
63
+
64
+ # Type checker
65
+ uv run basedpyright svx
66
+ ```
67
+
68
+ ## Running the Application
69
+ ```bash
70
+ # CLI: Record with prompt
71
+ svx record --prompt "Transcribe this audio"
72
+
73
+ # CLI: Pure transcription (no prompt)
74
+ svx record --transcribe
75
+
76
+ # GUI: Launch interactive recorder
77
+ svx record --gui
78
+
79
+ # Config management
80
+ svx config init # Create default config.toml
81
+ svx config open # Open config directory
82
+ svx config show # Display current config
83
+ ```
84
+
85
+ ## Maintenance
86
+
87
+ - use `uv sync --extra dev --extra gui` to install/update dependencies
88
+ - after updating `pyproject.toml`, run `uv sync --extra dev --extra gui` to refresh the environment
89
+ - When adding modules: Propagate Config instance; use RecordingPipeline for recording flows; handle temp files via keep_* flags.
90
+ - Test temp cleanup: Verify no leftovers in default mode (keep_*=false).
91
+
92
+
93
+ ## Code style
94
+ - **Imports**: `from __future__ import annotations` first, then stdlib, third-party, local
95
+ - **Formatting**: Black (100 line length), ruff for linting (E, F, I, UP rules)
96
+ - **Types**: Full type hints required, use `TypedDict` for data structures, `Protocol` for interfaces (e.g., Provider protocol, Config dataclasses with type hints)
97
+ - **Naming**: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
98
+ - **Docstrings**: Google-style with clear purpose/dependencies/`__all__` exports
99
+
100
+ ## Security
101
+ - API keys are configured in the user config file (`config.toml`), under provider-specific sections.
102
+ - Mistral: define `[providers.mistral].api_key`
103
+ - Environment variables are not used for API keys.
104
+ - Validate user inputs (e.g., paths in Config, prompt resolution).
@@ -0,0 +1 @@
1
+ AGENTS.md
@@ -1,7 +1,7 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: supervoxtral
3
- Version: 0.1.4
4
- Summary: CLI/GUI audio recorder and transcription client using Mistral Voxtral (chat with audio and transcription).
3
+ Version: 0.3.0
4
+ Summary: CLI/GUI audio recorder with 2-step pipeline: transcription (Voxtral) then text transformation (LLM).
5
5
  License: MIT
6
6
  License-File: LICENSE
7
7
  Keywords: audio,cli,gui,mistral,transcription,voxtral,whisper
@@ -14,10 +14,7 @@ Requires-Dist: sounddevice
14
14
  Requires-Dist: soundfile
15
15
  Requires-Dist: typer
16
16
  Provides-Extra: dev
17
- Requires-Dist: black; extra == 'dev'
18
- Requires-Dist: mypy; extra == 'dev'
19
- Requires-Dist: pytest; extra == 'dev'
17
+ Requires-Dist: basedpyright; extra == 'dev'
20
18
  Requires-Dist: ruff; extra == 'dev'
21
- Requires-Dist: types-python-dotenv; extra == 'dev'
22
19
  Provides-Extra: gui
23
20
  Requires-Dist: pyside6-essentials; extra == 'gui'
@@ -2,9 +2,9 @@
2
2
 
3
3
  ![Supervoxtral](supervoxtral.gif)
4
4
 
5
- SuperVoxtral is a lightweight Python CLI/GUI utility for recording microphone audio and integrate with Mistral's Voxtral APIs for transcription or audio-enabled chat.
5
+ SuperVoxtral is a lightweight Python CLI/GUI utility for recording microphone audio and processing it via a 2-step pipeline using Mistral's APIs.
6
6
 
7
- Voxtral models, such as `voxtral-mini-latest` and `voxtral-small-latest`, deliver fast inference times, high transcription accuracy across languages and accents, and minimal API costs. In contrast to OpenAI's Whisper, which performs only standalone transcription, Voxtral supports two modes: pure transcription via a dedicated endpoint (no prompts needed) or chat mode, where audio input combines with text prompts for refined outputs—like error correction or contextual summarization—without invoking a separate LLM.
7
+ The pipeline works in two stages: (1) **Transcription** — audio is converted to text using Voxtral's dedicated transcription endpoint (`voxtral-mini-latest`), which delivers fast inference, high accuracy across languages, and minimal API costs; (2) **Transformation** the raw transcript is refined by a text-based LLM (e.g., `mistral-small-latest`) using a configurable prompt for tasks like error correction, summarization, or reformatting. In pure transcription mode (`--transcribe`), only step 1 is performed.
8
8
 
9
9
  For instance, use a prompt like: "_Transcribe this audio precisely and remove all minor speech hesitations: "um", "uh", "er", "euh", "ben", etc._"
10
10
 
@@ -26,6 +26,7 @@ The package is available on PyPI. We recommend using `uv` (a fast Python package
26
26
  - For GUI support (includes PySide6):
27
27
  ```
28
28
  uv tool install "supervoxtral[gui]"
29
+ # to update: uv tool update "supervoxtral[gui]"
29
30
  ```
30
31
 
31
32
  - For core CLI only functionality:
@@ -64,10 +65,15 @@ This installs the `svx` command within the virtual environment. Make sure to act
64
65
 
65
66
  **For development** (local editing):
66
67
  1. Clone the repo and navigate to the project root.
67
- 2. Create/activate a virtual environment:
68
- - macOS/Linux: `python -m venv .venv && source .venv/bin/activate`
69
- - Windows: `python -m venv .venv && .\.venv\Scripts\Activate.ps1`
70
- 3. Install in editable mode: `pip install -e .` (or `pip install -e ".[dev]"` for dev tools).
68
+ 2. Install dependencies (creates `.venv` automatically, editable mode, lockfile-based):
69
+ ```
70
+ uv sync --extra dev --extra gui
71
+ ```
72
+ 3. Run linting and type checking:
73
+ ```
74
+ uv run ruff check svx/
75
+ uv run basedpyright svx
76
+ ```
71
77
 
72
78
  ## Quick Start
73
79
 
@@ -143,12 +149,20 @@ provider = "mistral"
143
149
  # Recording is always WAV; conversion is applied if "mp3" or "opus"
144
150
  format = "opus"
145
151
 
146
- # Model to use on the provider side (example for Mistral Voxtral)
152
+ # Model for audio transcription (dedicated endpoint)
147
153
  model = "voxtral-mini-latest"
148
154
 
155
+ # Model for text transformation via LLM (applied after transcription when a prompt is used)
156
+ chat_model = "mistral-small-latest"
157
+
149
158
  # Language hint (may help the provider)
150
159
  language = "fr"
151
160
 
161
+ # Context bias: up to 100 words/phrases to help recognize specific vocabulary
162
+ # (proper nouns, technical terms, brand names, etc.)
163
+ # context_bias = ["SuperVoxtral", "Mistral AI", "Voxtral"]
164
+ context_bias = []
165
+
152
166
  # Audio recording parameters
153
167
  rate = 16000
154
168
  channels = 1
@@ -233,6 +247,9 @@ By default in CLI, uses the 'default' prompt from config.toml [prompt.default].
233
247
 
234
248
  ## Changelog
235
249
 
250
+ - 0.3.0: Add `context_bias` support for Mistral Voxtral transcription — a list of up to 100 words/phrases to help the model recognize specific vocabulary (proper nouns, technical terms, brand names). Configurable in `config.toml` under `[defaults]`.
251
+ - 0.2.0: 2-step pipeline (transcription → transformation). Replaces chat-with-audio by dedicated transcription endpoint + text-based LLM. New `chat_model` config option. Raw transcript saved separately when transformation is applied.
252
+ - 0.1.5: Fix bug on prompt selecting
236
253
  - 0.1.4: Support for multiple prompts in config.toml with dynamic GUI buttons for each prompt key
237
254
  - 0.1.3: Minor style update
238
255
  - 0.1.2: Interactive mode in GUI (choose transcribe / prompt / cancel while recording)
@@ -4,8 +4,8 @@ build-backend = "hatchling.build"
4
4
 
5
5
  [project]
6
6
  name = "supervoxtral"
7
- version = "0.1.4"
8
- description = "CLI/GUI audio recorder and transcription client using Mistral Voxtral (chat with audio and transcription)."
7
+ version = "0.3.0"
8
+ description = "CLI/GUI audio recorder with 2-step pipeline: transcription (Voxtral) then text transformation (LLM)."
9
9
  requires-python = ">=3.11"
10
10
  license = { text = "MIT" }
11
11
  keywords = [
@@ -29,15 +29,11 @@ dependencies = [
29
29
 
30
30
  [project.optional-dependencies]
31
31
  gui = ["PySide6-Essentials"]
32
- dev = ["black", "ruff", "mypy", "types-python-dotenv", "pytest"]
32
+ dev = ["ruff", "basedpyright"]
33
33
 
34
34
  [project.scripts]
35
35
  svx = "svx.cli:app"
36
36
 
37
- [tool.black]
38
- line-length = 100
39
- target-version = ["py310"]
40
-
41
37
  [tool.ruff]
42
38
  line-length = 100
43
39
  target-version = "py310"
@@ -51,6 +47,7 @@ packages = ["svx"]
51
47
 
52
48
  [tool.basedpyright]
53
49
  typeCheckingMode = "standard" # "basic" | "standard" | "strict" (défaut: "standard")
50
+ reportUnknownParameterType = true
54
51
  # reportUnknownArgumentType = false
55
52
  # reportUnknownVariableType = false
56
53
  # reportUnusedCallResult = false
@@ -162,10 +162,14 @@ def record(
162
162
  ),
163
163
  ):
164
164
  """
165
- Record audio from the microphone and send it to the selected provider.
165
+ Record audio from the microphone and process it via a 2-step pipeline.
166
+
167
+ Pipeline:
168
+ 1. Transcription: audio -> text via dedicated transcription endpoint (always).
169
+ 2. Transformation: text + prompt -> text via text-based LLM (when a prompt is provided).
166
170
 
167
171
  This CLI accepts only a small set of runtime flags. Most defaults (provider, format,
168
- model, language, sample rate, channels, device,
172
+ model, chat_model, language, sample rate, channels, device,
169
173
  file retention, copy-to-clipboard)
170
174
  must be configured in the user's `config.toml` under [defaults].
171
175
 
@@ -178,11 +182,12 @@ def record(
178
182
  Flow:
179
183
  - Records WAV until you press Enter (CLI mode).
180
184
  - Optionally converts to MP3/Opus depending on config.
181
- - Sends the file per provider rules.
185
+ - Transcribes via dedicated endpoint (step 1).
186
+ - If a prompt is provided, transforms the transcript via LLM (step 2).
182
187
  - Prints and saves the result.
183
188
 
184
189
  Note: In --transcribe mode, prompts (--user-prompt or --user-prompt-file) are ignored,
185
- as it uses a dedicated transcription endpoint without prompting.
190
+ and only step 1 (transcription) is performed.
186
191
  """
187
192
  cfg = Config.load(log_level=log_level)
188
193
 
@@ -22,6 +22,7 @@ from pathlib import Path
22
22
  from threading import Event, Thread
23
23
  from typing import Any
24
24
 
25
+ import numpy as np
25
26
  import sounddevice as sd
26
27
  import soundfile as sf
27
28
 
@@ -149,7 +150,12 @@ def record_wav(
149
150
  writer_stop = Event()
150
151
  start_time = time.time()
151
152
 
152
- def audio_callback(indata, frames, time_info, status):
153
+ def audio_callback(
154
+ indata: np.ndarray[Any, np.dtype[np.int16]],
155
+ frames: int,
156
+ time_info: sd.CallbackFlags,
157
+ status: sd.CallbackFlags,
158
+ ) -> None:
153
159
  if status:
154
160
  logging.warning("SoundDevice status: %s", status)
155
161
  q.put(indata.copy())
@@ -220,10 +220,17 @@ def init_user_config(force: bool = False, prompt_file: Path | None = None) -> Pa
220
220
  '# File format sent to the provider: "wav" | "mp3" | "opus"\n'
221
221
  '# Recording is always WAV; conversion is applied if "mp3" or "opus"\n'
222
222
  'format = "opus"\n\n'
223
- "# Model to use on the provider side (example for Mistral Voxtral)\n"
223
+ "# Model for audio transcription (dedicated endpoint)\n"
224
224
  'model = "voxtral-mini-latest"\n\n'
225
+ "# Model for text transformation via LLM\n"
226
+ "# (applied after transcription when a prompt is used)\n"
227
+ 'chat_model = "mistral-small-latest"\n\n'
225
228
  "# Language hint (may help the provider)\n"
226
229
  'language = "fr"\n\n'
230
+ "# Context bias: up to 100 words/phrases to help recognize specific vocabulary\n"
231
+ "# (proper nouns, technical terms, brand names, etc.)\n"
232
+ '# context_bias = ["SuperVoxtral", "Mistral AI", "Voxtral"]\n'
233
+ "context_bias = []\n\n"
227
234
  "# Audio recording parameters\n"
228
235
  "rate = 16000\n"
229
236
  "channels = 1\n"
@@ -271,7 +278,9 @@ class DefaultsConfig:
271
278
  provider: str = "mistral"
272
279
  format: str = "opus"
273
280
  model: str = "voxtral-mini-latest"
281
+ chat_model: str = "mistral-small-latest"
274
282
  language: str | None = None
283
+ context_bias: list[str] = field(default_factory=list)
275
284
  rate: int = 16000
276
285
  channels: int = 1
277
286
  device: str | None = None
@@ -315,7 +324,11 @@ class Config:
315
324
  "provider": str(user_defaults_raw.get("provider", "mistral")),
316
325
  "format": str(user_defaults_raw.get("format", "opus")),
317
326
  "model": str(user_defaults_raw.get("model", "voxtral-mini-latest")),
327
+ "chat_model": str(user_defaults_raw.get("chat_model", "mistral-small-latest")),
318
328
  "language": user_defaults_raw.get("language"),
329
+ "context_bias": list(user_defaults_raw.get("context_bias", []))
330
+ if isinstance(user_defaults_raw.get("context_bias"), list)
331
+ else [],
319
332
  "rate": int(user_defaults_raw.get("rate", 16000)),
320
333
  "channels": int(user_defaults_raw.get("channels", 1)),
321
334
  "device": user_defaults_raw.get("device"),
@@ -335,6 +348,9 @@ class Config:
335
348
  format_ = defaults_data["format"]
336
349
  if format_ not in {"wav", "mp3", "opus"}:
337
350
  raise ValueError("format must be one of wav|mp3|opus")
351
+ context_bias = defaults_data["context_bias"]
352
+ if len(context_bias) > 100:
353
+ raise ValueError("context_bias cannot contain more than 100 items (Mistral API limit)")
338
354
  defaults = DefaultsConfig(**defaults_data)
339
355
  # Conditional output directories
340
356
  if defaults.keep_audio_files:
@@ -12,18 +12,23 @@ import svx.core.config as config
12
12
  from svx.core.audio import convert_audio, record_wav, timestamp
13
13
  from svx.core.clipboard import copy_to_clipboard
14
14
  from svx.core.config import Config
15
- from svx.core.storage import save_transcript
15
+ from svx.core.storage import save_text_file, save_transcript
16
16
  from svx.providers import get_provider
17
17
 
18
18
 
19
19
  class RecordingPipeline:
20
20
  """
21
- Centralized pipeline for recording audio, transcribing via provider, saving outputs,
22
- and copying to clipboard. Handles temporary files when not keeping audio.
21
+ Centralized pipeline for recording audio, transcribing via provider, optionally
22
+ transforming with a text LLM, saving outputs, and copying to clipboard.
23
23
 
24
+ Pipeline steps:
25
+ 1. Transcription: audio -> text via dedicated transcription endpoint (always)
26
+ 2. Transformation: text + prompt -> text via text-based LLM (when a prompt is provided)
27
+
28
+ Handles temporary files when not keeping audio.
24
29
  Supports runtime overrides like save_all for keeping all files and adding log handlers.
25
30
  Optional progress_callback for status updates (e.g., for GUI).
26
- Supports transcribe_mode for pure transcription without prompt using dedicated endpoint.
31
+ Supports transcribe_mode for pure transcription without prompt (step 1 only).
27
32
  """
28
33
 
29
34
  def __init__(
@@ -136,31 +141,26 @@ class RecordingPipeline:
136
141
  self, wav_path: Path, duration: float, transcribe_mode: bool, user_prompt: str | None = None
137
142
  ) -> dict[str, Any]:
138
143
  """
139
- Process recorded audio: convert if needed, transcribe, save, copy.
144
+ Process recorded audio: convert if needed, transcribe, optionally transform, save, copy.
145
+
146
+ Pipeline:
147
+ 1. Transcription: audio -> text via dedicated endpoint (always)
148
+ 2. Transformation: text + prompt -> text via LLM (when prompt is provided)
140
149
 
141
150
  Args:
142
151
  wav_path: Path to the recorded WAV file.
143
152
  duration: Recording duration in seconds.
144
- transcribe_mode: Whether to use pure transcription mode.
145
- user_prompt: User prompt to use (None for transcribe_mode).
153
+ transcribe_mode: Whether to use pure transcription mode (step 1 only).
154
+ user_prompt: User prompt to use for transformation (None for transcribe_mode).
146
155
 
147
156
  Returns:
148
- Dict with 'text' (str), 'raw' (dict), 'duration' (float),
149
- 'paths' (dict of Path or None).
157
+ Dict with 'text' (str), 'raw_transcript' (str), 'raw' (dict),
158
+ 'duration' (float), 'paths' (dict of Path or None).
150
159
  """
151
160
  # Resolve parameters
152
161
  provider = self.cfg.defaults.provider
153
162
  audio_format = self.cfg.defaults.format
154
163
  model = self.cfg.defaults.model
155
- original_model = model
156
- if transcribe_mode:
157
- model = "voxtral-mini-latest"
158
- if original_model != "voxtral-mini-latest":
159
- logging.warning(
160
- "Transcribe mode: model override from '%s' to 'voxtral-mini-latest'\n"
161
- "(optimized for transcription).",
162
- original_model,
163
- )
164
164
  language = self.cfg.defaults.language
165
165
  if wav_path.stem.endswith(".wav"):
166
166
  base = wav_path.stem.replace(".wav", "")
@@ -176,9 +176,11 @@ class RecordingPipeline:
176
176
  final_user_prompt = self.cfg.resolve_prompt(self.user_prompt, self.user_prompt_file)
177
177
  else:
178
178
  final_user_prompt = user_prompt
179
- self._status("Transcribe mode not activated: using prompt.")
179
+ self._status("Prompt mode: transcription then transformation.")
180
180
  else:
181
- self._status("Transcribe mode activated: no prompt used.")
181
+ self._status("Transcribe mode: transcription only, no prompt.")
182
+
183
+ logging.debug(f"Applied prompt: {final_user_prompt or 'None (transcribe mode)'}")
182
184
 
183
185
  paths: dict[str, Path | None] = {"wav": wav_path}
184
186
 
@@ -192,18 +194,22 @@ class RecordingPipeline:
192
194
  paths["converted"] = to_send_path
193
195
  _converted = True
194
196
 
195
- # Transcribe
197
+ # Step 1: Transcription (always)
196
198
  self._status("Transcribing...")
197
199
  prov = get_provider(provider, cfg=self.cfg)
198
- result = prov.transcribe(
199
- to_send_path,
200
- user_prompt=final_user_prompt,
201
- model=model,
202
- language=language,
203
- transcribe_mode=transcribe_mode,
204
- )
205
- text = result["text"]
206
- raw = result["raw"]
200
+ result = prov.transcribe(to_send_path, model=model, language=language)
201
+ raw_transcript = result["text"]
202
+
203
+ # Step 2: Transformation (if prompt)
204
+ if not transcribe_mode and final_user_prompt:
205
+ self._status("Applying prompt...")
206
+ chat_model = self.cfg.defaults.chat_model
207
+ chat_result = prov.chat(raw_transcript, final_user_prompt, model=chat_model)
208
+ text = chat_result["text"]
209
+ raw = {"transcription": result["raw"], "transformation": chat_result["raw"]}
210
+ else:
211
+ text = raw_transcript
212
+ raw = result["raw"]
207
213
 
208
214
  # Save if keeping transcripts
209
215
  if keep_transcript:
@@ -213,6 +219,12 @@ class RecordingPipeline:
213
219
  )
214
220
  paths["txt"] = txt_path
215
221
  paths["json"] = json_path
222
+
223
+ # Save raw transcript separately when transformation was applied
224
+ if not transcribe_mode and final_user_prompt:
225
+ raw_txt_path = self.cfg.transcripts_dir / f"{base}_{provider}_raw.txt"
226
+ save_text_file(raw_txt_path, raw_transcript)
227
+ paths["raw_txt"] = raw_txt_path
216
228
  else:
217
229
  paths["txt"] = None
218
230
  paths["json"] = None
@@ -228,6 +240,7 @@ class RecordingPipeline:
228
240
  logging.info("Processing finished (%.2fs)", duration)
229
241
  return {
230
242
  "text": text,
243
+ "raw_transcript": raw_transcript,
231
244
  "raw": raw,
232
245
  "duration": duration,
233
246
  "paths": paths,
@@ -261,8 +274,8 @@ class RecordingPipeline:
261
274
  stop_event: Optional event to signal recording stop (e.g., for GUI).
262
275
 
263
276
  Returns:
264
- Dict with 'text' (str), 'raw' (dict), 'duration' (float),
265
- 'paths' (dict of Path or None).
277
+ Dict with 'text' (str), 'raw_transcript' (str), 'raw' (dict),
278
+ 'duration' (float), 'paths' (dict of Path or None).
266
279
 
267
280
  Raises:
268
281
  Exception: On recording, conversion, or transcription errors.
@@ -12,6 +12,7 @@ Intended to be small and dependency-light so it can be imported broadly.
12
12
  from __future__ import annotations
13
13
 
14
14
  import logging
15
+ from collections.abc import Callable
15
16
  from pathlib import Path
16
17
 
17
18
  from .config import USER_PROMPT_DIR, Config, PromptEntry
@@ -121,22 +122,45 @@ def resolve_user_prompt(
121
122
  return ""
122
123
 
123
124
  key = key or "default"
124
- suppliers = [
125
- lambda: _strip(inline),
126
- lambda: _read(file),
127
- lambda: _from_user_cfg(key),
128
- _from_user_prompt_dir,
125
+
126
+ # Suppliers annotated with a name for tracing which one returned the prompt.
127
+ named_suppliers: list[tuple[str, Callable[[], str]]] = [
128
+ ("inline", lambda: _strip(inline)),
129
+ ("file", lambda: _read(file)),
130
+ (f"prompt_config[{key}]", lambda: _from_user_cfg(key)),
131
+ ("user_prompt_dir/user.md", _from_user_prompt_dir),
129
132
  ]
130
133
 
131
- for supplier in suppliers:
134
+ for name, supplier in named_suppliers:
132
135
  try:
133
136
  val = supplier()
134
137
  if val:
138
+ # Log which supplier provided the prompt and a short snippet for debugging.
139
+ try:
140
+ if len(val) > 200:
141
+ snippet = val[:200] + "..."
142
+ else:
143
+ snippet = val
144
+ logging.info(
145
+ "resolve_user_prompt: supplier '%s' provided prompt snippet: %s",
146
+ name,
147
+ snippet,
148
+ )
149
+ except Exception:
150
+ # Ensure logging failures do not change behavior.
151
+ logging.info(
152
+ "resolve_user_prompt: supplier '%s' provided a prompt "
153
+ "(snippet unavailable)",
154
+ name,
155
+ )
135
156
  return val
136
157
  except Exception as e:
137
- logging.debug("Prompt supplier failed: %s", e)
158
+ logging.debug("Prompt supplier '%s' failed: %s", name, e)
138
159
 
139
- return "What's in this audio?"
160
+ # Final fallback
161
+ fallback = "Clean up this transcription. Keep the original language."
162
+ logging.info("resolve_user_prompt: no supplier provided a prompt, using fallback: %s", fallback)
163
+ return fallback
140
164
 
141
165
 
142
166
  def init_user_prompt_file(force: bool = False) -> Path:
@@ -152,13 +176,14 @@ def init_user_prompt_file(force: bool = False) -> Path:
152
176
  path = USER_PROMPT_DIR / "user.md"
153
177
  if not path.exists() or force:
154
178
  example_prompt = """
155
- - Transcribe the input audio file. If the audio if empty, just respond "no audio detected".
156
- - Do not respond to any question in the audio. Just transcribe.
157
- - DO NOT TRANSLATE.
158
- - Responde only with the transcription. Do not provide explanations or notes.
179
+ You receive a raw transcription of a voice recording. Clean it up:
180
+ - DO NOT TRANSLATE. Keep the original language.
181
+ - Do not respond to any question in the text. Just clean the transcription.
182
+ - Respond only with the cleaned text. Do not provide explanations or notes.
159
183
  - Remove all minor speech hesitations: "um", "uh", "er", "euh", "ben", etc.
160
184
  - Remove false starts (e.g., "je veux dire... je pense" → "je pense").
161
185
  - Correct grammatical errors.
186
+ - If the transcription is empty, respond "no audio detected".
162
187
  """
163
188
  try:
164
189
  path.write_text(example_prompt, encoding="utf-8")
@@ -86,7 +86,7 @@ def save_transcript(
86
86
  base_name: str,
87
87
  provider: str,
88
88
  text: str,
89
- raw: dict | None = None,
89
+ raw: dict[str, Any] | None = None,
90
90
  ) -> tuple[Path, Path | None]:
91
91
  """
92
92
  Save a transcript text and, optionally, the raw JSON response.