supervoxtral 0.1.5__tar.gz → 0.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/.gitignore +2 -0
- supervoxtral-0.3.0/AGENTS.md +104 -0
- supervoxtral-0.3.0/CLAUDE.md +1 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/PKG-INFO +3 -6
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/README.md +22 -7
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/pyproject.toml +4 -7
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/cli.py +9 -4
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/core/audio.py +7 -1
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/core/config.py +17 -1
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/core/pipeline.py +43 -32
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/core/prompt.py +6 -5
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/core/storage.py +1 -1
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/providers/base.py +29 -9
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/providers/mistral.py +75 -68
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/ui/qt_app.py +1 -1
- supervoxtral-0.3.0/uv.lock +973 -0
- supervoxtral-0.1.5/AGENTS.md +0 -88
- supervoxtral-0.1.5/notes.md +0 -8
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/LICENSE +0 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/logs/.gitkeep +0 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/macos-shortcut.png +0 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/prompt/.gitkeep +0 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/recordings/.gitkeep +0 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/supervoxtral.gif +0 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/__init__.py +0 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/core/__init__.py +0 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/core/clipboard.py +0 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/svx/providers/__init__.py +0 -0
- {supervoxtral-0.1.5 → supervoxtral-0.3.0}/transcripts/.gitkeep +0 -0
|
@@ -0,0 +1,104 @@
|
|
|
1
|
+
# SuperVoxtral — Agent Guide
|
|
2
|
+
|
|
3
|
+
## Project overview
|
|
4
|
+
Python CLI/GUI for audio recording + transcription via APIs (Mistral Voxtral). MVP: manual stop, API-based, zero-footprint defaults (temp files, no persistent dirs unless overridden), results in `transcripts/` when persisted.
|
|
5
|
+
|
|
6
|
+
### Core Design Principles
|
|
7
|
+
|
|
8
|
+
1. **Centralized Pipeline**: All recording/transcription flows through `RecordingPipeline` (svx/core/pipeline.py) for consistency between CLI and GUI
|
|
9
|
+
2. **Config-driven**: Structured `Config` dataclass (svx/core/config.py) loaded from user's config.toml; CLI args override specific values
|
|
10
|
+
3. **Zero-footprint defaults**: Temp files auto-deleted unless `keep_*` flags or `--save-all` enabled; no project directories created by default
|
|
11
|
+
4. **Provider abstraction**: `Provider` protocol (svx/providers/base.py) for pluggable transcription services
|
|
12
|
+
|
|
13
|
+
### Module Structure
|
|
14
|
+
|
|
15
|
+
- **svx/cli.py**: Typer CLI entrypoint; orchestration only, delegates to Config and Pipeline
|
|
16
|
+
- **svx/core/**:
|
|
17
|
+
- `config.py`: Config dataclasses, TOML loading, prompt resolution (supports multiple prompts via [prompt.key] sections), logging setup
|
|
18
|
+
- `pipeline.py`: RecordingPipeline class - records, converts (ffmpeg), transcribes, saves conditionally, copies to clipboard
|
|
19
|
+
- `audio.py`: WAV recording (sounddevice), ffmpeg detection/conversion to MP3/Opus
|
|
20
|
+
- `prompt.py`: Multi-prompt resolution from config dict (key-based: "default", "test", etc.)
|
|
21
|
+
- `storage.py`: Save transcripts/JSON conditionally based on keep_transcript_files
|
|
22
|
+
- `clipboard.py`: Cross-platform clipboard copy
|
|
23
|
+
- **svx/providers/**:
|
|
24
|
+
- `base.py`: Provider protocol, TranscriptionResult TypedDict, ProviderError
|
|
25
|
+
- `mistral.py`: Mistral Voxtral implementation (dedicated transcription endpoint + text-based LLM chat)
|
|
26
|
+
- `openai.py`: OpenAI Whisper implementation
|
|
27
|
+
- `__init__.py`: Provider registry (get_provider)
|
|
28
|
+
- **svx/ui/**:
|
|
29
|
+
- `qt_app.py`: PySide6 GUI (RecorderWindow/Worker) using Pipeline; dynamic buttons per prompt key
|
|
30
|
+
|
|
31
|
+
### Execution Flow
|
|
32
|
+
|
|
33
|
+
1. **Entry**: CLI parses args (--prompt, --save-all, --gui, --transcribe)
|
|
34
|
+
2. **Config Load**: Config.load() reads config.toml (supports [prompt.default], [prompt.other], etc.); `chat_model` for text LLM; API keys in [providers.mistral] or [providers.openai]
|
|
35
|
+
3. **Context Bias**: Optional `context_bias` list in `[defaults]` (up to 100 items) — passed to Mistral's transcription endpoint to improve recognition of specific vocabulary (proper nouns, technical terms). Stored in `DefaultsConfig`, read by `MistralProvider.__init__`.
|
|
36
|
+
4. **Prompt Resolution**:
|
|
37
|
+
- CLI: Uses "default" prompt key unless --prompt/--prompt-file overrides
|
|
38
|
+
- GUI: Dynamic buttons for each [prompt.key]; "Transcribe" button bypasses prompt
|
|
39
|
+
- Priority: CLI arg > config [prompt.key] > user prompt file > fallback
|
|
40
|
+
5. **Pipeline Execution** (RecordingPipeline) — 2-step pipeline:
|
|
41
|
+
- record(): WAV recording via sounddevice, temp file if keep_audio_files=false
|
|
42
|
+
- process(): Optional ffmpeg conversion, then:
|
|
43
|
+
- Step 1 (Transcription): audio → text via provider.transcribe() (dedicated endpoint, always)
|
|
44
|
+
- Step 2 (Transformation): text + prompt → text via provider.chat() (text LLM, only when prompt provided)
|
|
45
|
+
- Uses `cfg.defaults.model` for transcription, `cfg.defaults.chat_model` for transformation
|
|
46
|
+
- Conditional save_transcript (+ raw transcript file when transformation applied), clipboard copy
|
|
47
|
+
- clean(): Temp file cleanup
|
|
48
|
+
6. **Transcribe Mode** (CLI only):
|
|
49
|
+
- --transcribe flag: No prompt, step 1 only (dedicated transcription endpoint)
|
|
50
|
+
- GUI: --transcribe ignored (warning); use "Transcribe" button instead
|
|
51
|
+
7. **Output**: CLI prints result; GUI emits via callback; temp files auto-deleted unless keep_* enabled
|
|
52
|
+
|
|
53
|
+
## Build
|
|
54
|
+
```bash
|
|
55
|
+
# Setup (creates .venv, editable install, lockfile-based)
|
|
56
|
+
uv sync --extra dev --extra gui
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Linting and Type Checking
|
|
60
|
+
```bash
|
|
61
|
+
# Lint & format
|
|
62
|
+
uv run ruff check svx/
|
|
63
|
+
|
|
64
|
+
# Type checker
|
|
65
|
+
uv run basedpyright svx
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Running the Application
|
|
69
|
+
```bash
|
|
70
|
+
# CLI: Record with prompt
|
|
71
|
+
svx record --prompt "Transcribe this audio"
|
|
72
|
+
|
|
73
|
+
# CLI: Pure transcription (no prompt)
|
|
74
|
+
svx record --transcribe
|
|
75
|
+
|
|
76
|
+
# GUI: Launch interactive recorder
|
|
77
|
+
svx record --gui
|
|
78
|
+
|
|
79
|
+
# Config management
|
|
80
|
+
svx config init # Create default config.toml
|
|
81
|
+
svx config open # Open config directory
|
|
82
|
+
svx config show # Display current config
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
## Maintenance
|
|
86
|
+
|
|
87
|
+
- use `uv sync --extra dev --extra gui` to install/update dependencies
|
|
88
|
+
- after updating `pyproject.toml`, run `uv sync --extra dev --extra gui` to refresh the environment
|
|
89
|
+
- When adding modules: Propagate Config instance; use RecordingPipeline for recording flows; handle temp files via keep_* flags.
|
|
90
|
+
- Test temp cleanup: Verify no leftovers in default mode (keep_*=false).
|
|
91
|
+
|
|
92
|
+
|
|
93
|
+
## Code style
|
|
94
|
+
- **Imports**: `from __future__ import annotations` first, then stdlib, third-party, local
|
|
95
|
+
- **Formatting**: Black (100 line length), ruff for linting (E, F, I, UP rules)
|
|
96
|
+
- **Types**: Full type hints required, use `TypedDict` for data structures, `Protocol` for interfaces (e.g., Provider protocol, Config dataclasses with type hints)
|
|
97
|
+
- **Naming**: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
|
|
98
|
+
- **Docstrings**: Google-style with clear purpose/dependencies/`__all__` exports
|
|
99
|
+
|
|
100
|
+
## Security
|
|
101
|
+
- API keys are configured in the user config file (`config.toml`), under provider-specific sections.
|
|
102
|
+
- Mistral: define `[providers.mistral].api_key`
|
|
103
|
+
- Environment variables are not used for API keys.
|
|
104
|
+
- Validate user inputs (e.g., paths in Config, prompt resolution).
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
AGENTS.md
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: supervoxtral
|
|
3
|
-
Version: 0.
|
|
4
|
-
Summary: CLI/GUI audio recorder
|
|
3
|
+
Version: 0.3.0
|
|
4
|
+
Summary: CLI/GUI audio recorder with 2-step pipeline: transcription (Voxtral) then text transformation (LLM).
|
|
5
5
|
License: MIT
|
|
6
6
|
License-File: LICENSE
|
|
7
7
|
Keywords: audio,cli,gui,mistral,transcription,voxtral,whisper
|
|
@@ -14,10 +14,7 @@ Requires-Dist: sounddevice
|
|
|
14
14
|
Requires-Dist: soundfile
|
|
15
15
|
Requires-Dist: typer
|
|
16
16
|
Provides-Extra: dev
|
|
17
|
-
Requires-Dist:
|
|
18
|
-
Requires-Dist: mypy; extra == 'dev'
|
|
19
|
-
Requires-Dist: pytest; extra == 'dev'
|
|
17
|
+
Requires-Dist: basedpyright; extra == 'dev'
|
|
20
18
|
Requires-Dist: ruff; extra == 'dev'
|
|
21
|
-
Requires-Dist: types-python-dotenv; extra == 'dev'
|
|
22
19
|
Provides-Extra: gui
|
|
23
20
|
Requires-Dist: pyside6-essentials; extra == 'gui'
|
|
@@ -2,9 +2,9 @@
|
|
|
2
2
|
|
|
3
3
|

|
|
4
4
|
|
|
5
|
-
SuperVoxtral is a lightweight Python CLI/GUI utility for recording microphone audio and
|
|
5
|
+
SuperVoxtral is a lightweight Python CLI/GUI utility for recording microphone audio and processing it via a 2-step pipeline using Mistral's APIs.
|
|
6
6
|
|
|
7
|
-
Voxtral
|
|
7
|
+
The pipeline works in two stages: (1) **Transcription** — audio is converted to text using Voxtral's dedicated transcription endpoint (`voxtral-mini-latest`), which delivers fast inference, high accuracy across languages, and minimal API costs; (2) **Transformation** — the raw transcript is refined by a text-based LLM (e.g., `mistral-small-latest`) using a configurable prompt for tasks like error correction, summarization, or reformatting. In pure transcription mode (`--transcribe`), only step 1 is performed.
|
|
8
8
|
|
|
9
9
|
For instance, use a prompt like: "_Transcribe this audio precisely and remove all minor speech hesitations: "um", "uh", "er", "euh", "ben", etc._"
|
|
10
10
|
|
|
@@ -65,10 +65,15 @@ This installs the `svx` command within the virtual environment. Make sure to act
|
|
|
65
65
|
|
|
66
66
|
**For development** (local editing):
|
|
67
67
|
1. Clone the repo and navigate to the project root.
|
|
68
|
-
2.
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
68
|
+
2. Install dependencies (creates `.venv` automatically, editable mode, lockfile-based):
|
|
69
|
+
```
|
|
70
|
+
uv sync --extra dev --extra gui
|
|
71
|
+
```
|
|
72
|
+
3. Run linting and type checking:
|
|
73
|
+
```
|
|
74
|
+
uv run ruff check svx/
|
|
75
|
+
uv run basedpyright svx
|
|
76
|
+
```
|
|
72
77
|
|
|
73
78
|
## Quick Start
|
|
74
79
|
|
|
@@ -144,12 +149,20 @@ provider = "mistral"
|
|
|
144
149
|
# Recording is always WAV; conversion is applied if "mp3" or "opus"
|
|
145
150
|
format = "opus"
|
|
146
151
|
|
|
147
|
-
# Model
|
|
152
|
+
# Model for audio transcription (dedicated endpoint)
|
|
148
153
|
model = "voxtral-mini-latest"
|
|
149
154
|
|
|
155
|
+
# Model for text transformation via LLM (applied after transcription when a prompt is used)
|
|
156
|
+
chat_model = "mistral-small-latest"
|
|
157
|
+
|
|
150
158
|
# Language hint (may help the provider)
|
|
151
159
|
language = "fr"
|
|
152
160
|
|
|
161
|
+
# Context bias: up to 100 words/phrases to help recognize specific vocabulary
|
|
162
|
+
# (proper nouns, technical terms, brand names, etc.)
|
|
163
|
+
# context_bias = ["SuperVoxtral", "Mistral AI", "Voxtral"]
|
|
164
|
+
context_bias = []
|
|
165
|
+
|
|
153
166
|
# Audio recording parameters
|
|
154
167
|
rate = 16000
|
|
155
168
|
channels = 1
|
|
@@ -234,6 +247,8 @@ By default in CLI, uses the 'default' prompt from config.toml [prompt.default].
|
|
|
234
247
|
|
|
235
248
|
## Changelog
|
|
236
249
|
|
|
250
|
+
- 0.3.0: Add `context_bias` support for Mistral Voxtral transcription — a list of up to 100 words/phrases to help the model recognize specific vocabulary (proper nouns, technical terms, brand names). Configurable in `config.toml` under `[defaults]`.
|
|
251
|
+
- 0.2.0: 2-step pipeline (transcription → transformation). Replaces chat-with-audio by dedicated transcription endpoint + text-based LLM. New `chat_model` config option. Raw transcript saved separately when transformation is applied.
|
|
237
252
|
- 0.1.5: Fix bug on prompt selecting
|
|
238
253
|
- 0.1.4: Support for multiple prompts in config.toml with dynamic GUI buttons for each prompt key
|
|
239
254
|
- 0.1.3: Minor style update
|
|
@@ -4,8 +4,8 @@ build-backend = "hatchling.build"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "supervoxtral"
|
|
7
|
-
version = "0.
|
|
8
|
-
description = "CLI/GUI audio recorder
|
|
7
|
+
version = "0.3.0"
|
|
8
|
+
description = "CLI/GUI audio recorder with 2-step pipeline: transcription (Voxtral) then text transformation (LLM)."
|
|
9
9
|
requires-python = ">=3.11"
|
|
10
10
|
license = { text = "MIT" }
|
|
11
11
|
keywords = [
|
|
@@ -29,15 +29,11 @@ dependencies = [
|
|
|
29
29
|
|
|
30
30
|
[project.optional-dependencies]
|
|
31
31
|
gui = ["PySide6-Essentials"]
|
|
32
|
-
dev = ["
|
|
32
|
+
dev = ["ruff", "basedpyright"]
|
|
33
33
|
|
|
34
34
|
[project.scripts]
|
|
35
35
|
svx = "svx.cli:app"
|
|
36
36
|
|
|
37
|
-
[tool.black]
|
|
38
|
-
line-length = 100
|
|
39
|
-
target-version = ["py310"]
|
|
40
|
-
|
|
41
37
|
[tool.ruff]
|
|
42
38
|
line-length = 100
|
|
43
39
|
target-version = "py310"
|
|
@@ -51,6 +47,7 @@ packages = ["svx"]
|
|
|
51
47
|
|
|
52
48
|
[tool.basedpyright]
|
|
53
49
|
typeCheckingMode = "standard" # "basic" | "standard" | "strict" (défaut: "standard")
|
|
50
|
+
reportUnknownParameterType = true
|
|
54
51
|
# reportUnknownArgumentType = false
|
|
55
52
|
# reportUnknownVariableType = false
|
|
56
53
|
# reportUnusedCallResult = false
|
|
@@ -162,10 +162,14 @@ def record(
|
|
|
162
162
|
),
|
|
163
163
|
):
|
|
164
164
|
"""
|
|
165
|
-
Record audio from the microphone and
|
|
165
|
+
Record audio from the microphone and process it via a 2-step pipeline.
|
|
166
|
+
|
|
167
|
+
Pipeline:
|
|
168
|
+
1. Transcription: audio -> text via dedicated transcription endpoint (always).
|
|
169
|
+
2. Transformation: text + prompt -> text via text-based LLM (when a prompt is provided).
|
|
166
170
|
|
|
167
171
|
This CLI accepts only a small set of runtime flags. Most defaults (provider, format,
|
|
168
|
-
model, language, sample rate, channels, device,
|
|
172
|
+
model, chat_model, language, sample rate, channels, device,
|
|
169
173
|
file retention, copy-to-clipboard)
|
|
170
174
|
must be configured in the user's `config.toml` under [defaults].
|
|
171
175
|
|
|
@@ -178,11 +182,12 @@ def record(
|
|
|
178
182
|
Flow:
|
|
179
183
|
- Records WAV until you press Enter (CLI mode).
|
|
180
184
|
- Optionally converts to MP3/Opus depending on config.
|
|
181
|
-
-
|
|
185
|
+
- Transcribes via dedicated endpoint (step 1).
|
|
186
|
+
- If a prompt is provided, transforms the transcript via LLM (step 2).
|
|
182
187
|
- Prints and saves the result.
|
|
183
188
|
|
|
184
189
|
Note: In --transcribe mode, prompts (--user-prompt or --user-prompt-file) are ignored,
|
|
185
|
-
|
|
190
|
+
and only step 1 (transcription) is performed.
|
|
186
191
|
"""
|
|
187
192
|
cfg = Config.load(log_level=log_level)
|
|
188
193
|
|
|
@@ -22,6 +22,7 @@ from pathlib import Path
|
|
|
22
22
|
from threading import Event, Thread
|
|
23
23
|
from typing import Any
|
|
24
24
|
|
|
25
|
+
import numpy as np
|
|
25
26
|
import sounddevice as sd
|
|
26
27
|
import soundfile as sf
|
|
27
28
|
|
|
@@ -149,7 +150,12 @@ def record_wav(
|
|
|
149
150
|
writer_stop = Event()
|
|
150
151
|
start_time = time.time()
|
|
151
152
|
|
|
152
|
-
def audio_callback(
|
|
153
|
+
def audio_callback(
|
|
154
|
+
indata: np.ndarray[Any, np.dtype[np.int16]],
|
|
155
|
+
frames: int,
|
|
156
|
+
time_info: sd.CallbackFlags,
|
|
157
|
+
status: sd.CallbackFlags,
|
|
158
|
+
) -> None:
|
|
153
159
|
if status:
|
|
154
160
|
logging.warning("SoundDevice status: %s", status)
|
|
155
161
|
q.put(indata.copy())
|
|
@@ -220,10 +220,17 @@ def init_user_config(force: bool = False, prompt_file: Path | None = None) -> Pa
|
|
|
220
220
|
'# File format sent to the provider: "wav" | "mp3" | "opus"\n'
|
|
221
221
|
'# Recording is always WAV; conversion is applied if "mp3" or "opus"\n'
|
|
222
222
|
'format = "opus"\n\n'
|
|
223
|
-
"# Model
|
|
223
|
+
"# Model for audio transcription (dedicated endpoint)\n"
|
|
224
224
|
'model = "voxtral-mini-latest"\n\n'
|
|
225
|
+
"# Model for text transformation via LLM\n"
|
|
226
|
+
"# (applied after transcription when a prompt is used)\n"
|
|
227
|
+
'chat_model = "mistral-small-latest"\n\n'
|
|
225
228
|
"# Language hint (may help the provider)\n"
|
|
226
229
|
'language = "fr"\n\n'
|
|
230
|
+
"# Context bias: up to 100 words/phrases to help recognize specific vocabulary\n"
|
|
231
|
+
"# (proper nouns, technical terms, brand names, etc.)\n"
|
|
232
|
+
'# context_bias = ["SuperVoxtral", "Mistral AI", "Voxtral"]\n'
|
|
233
|
+
"context_bias = []\n\n"
|
|
227
234
|
"# Audio recording parameters\n"
|
|
228
235
|
"rate = 16000\n"
|
|
229
236
|
"channels = 1\n"
|
|
@@ -271,7 +278,9 @@ class DefaultsConfig:
|
|
|
271
278
|
provider: str = "mistral"
|
|
272
279
|
format: str = "opus"
|
|
273
280
|
model: str = "voxtral-mini-latest"
|
|
281
|
+
chat_model: str = "mistral-small-latest"
|
|
274
282
|
language: str | None = None
|
|
283
|
+
context_bias: list[str] = field(default_factory=list)
|
|
275
284
|
rate: int = 16000
|
|
276
285
|
channels: int = 1
|
|
277
286
|
device: str | None = None
|
|
@@ -315,7 +324,11 @@ class Config:
|
|
|
315
324
|
"provider": str(user_defaults_raw.get("provider", "mistral")),
|
|
316
325
|
"format": str(user_defaults_raw.get("format", "opus")),
|
|
317
326
|
"model": str(user_defaults_raw.get("model", "voxtral-mini-latest")),
|
|
327
|
+
"chat_model": str(user_defaults_raw.get("chat_model", "mistral-small-latest")),
|
|
318
328
|
"language": user_defaults_raw.get("language"),
|
|
329
|
+
"context_bias": list(user_defaults_raw.get("context_bias", []))
|
|
330
|
+
if isinstance(user_defaults_raw.get("context_bias"), list)
|
|
331
|
+
else [],
|
|
319
332
|
"rate": int(user_defaults_raw.get("rate", 16000)),
|
|
320
333
|
"channels": int(user_defaults_raw.get("channels", 1)),
|
|
321
334
|
"device": user_defaults_raw.get("device"),
|
|
@@ -335,6 +348,9 @@ class Config:
|
|
|
335
348
|
format_ = defaults_data["format"]
|
|
336
349
|
if format_ not in {"wav", "mp3", "opus"}:
|
|
337
350
|
raise ValueError("format must be one of wav|mp3|opus")
|
|
351
|
+
context_bias = defaults_data["context_bias"]
|
|
352
|
+
if len(context_bias) > 100:
|
|
353
|
+
raise ValueError("context_bias cannot contain more than 100 items (Mistral API limit)")
|
|
338
354
|
defaults = DefaultsConfig(**defaults_data)
|
|
339
355
|
# Conditional output directories
|
|
340
356
|
if defaults.keep_audio_files:
|
|
@@ -12,18 +12,23 @@ import svx.core.config as config
|
|
|
12
12
|
from svx.core.audio import convert_audio, record_wav, timestamp
|
|
13
13
|
from svx.core.clipboard import copy_to_clipboard
|
|
14
14
|
from svx.core.config import Config
|
|
15
|
-
from svx.core.storage import save_transcript
|
|
15
|
+
from svx.core.storage import save_text_file, save_transcript
|
|
16
16
|
from svx.providers import get_provider
|
|
17
17
|
|
|
18
18
|
|
|
19
19
|
class RecordingPipeline:
|
|
20
20
|
"""
|
|
21
|
-
Centralized pipeline for recording audio, transcribing via provider,
|
|
22
|
-
|
|
21
|
+
Centralized pipeline for recording audio, transcribing via provider, optionally
|
|
22
|
+
transforming with a text LLM, saving outputs, and copying to clipboard.
|
|
23
23
|
|
|
24
|
+
Pipeline steps:
|
|
25
|
+
1. Transcription: audio -> text via dedicated transcription endpoint (always)
|
|
26
|
+
2. Transformation: text + prompt -> text via text-based LLM (when a prompt is provided)
|
|
27
|
+
|
|
28
|
+
Handles temporary files when not keeping audio.
|
|
24
29
|
Supports runtime overrides like save_all for keeping all files and adding log handlers.
|
|
25
30
|
Optional progress_callback for status updates (e.g., for GUI).
|
|
26
|
-
Supports transcribe_mode for pure transcription without prompt
|
|
31
|
+
Supports transcribe_mode for pure transcription without prompt (step 1 only).
|
|
27
32
|
"""
|
|
28
33
|
|
|
29
34
|
def __init__(
|
|
@@ -136,31 +141,26 @@ class RecordingPipeline:
|
|
|
136
141
|
self, wav_path: Path, duration: float, transcribe_mode: bool, user_prompt: str | None = None
|
|
137
142
|
) -> dict[str, Any]:
|
|
138
143
|
"""
|
|
139
|
-
Process recorded audio: convert if needed, transcribe, save, copy.
|
|
144
|
+
Process recorded audio: convert if needed, transcribe, optionally transform, save, copy.
|
|
145
|
+
|
|
146
|
+
Pipeline:
|
|
147
|
+
1. Transcription: audio -> text via dedicated endpoint (always)
|
|
148
|
+
2. Transformation: text + prompt -> text via LLM (when prompt is provided)
|
|
140
149
|
|
|
141
150
|
Args:
|
|
142
151
|
wav_path: Path to the recorded WAV file.
|
|
143
152
|
duration: Recording duration in seconds.
|
|
144
|
-
transcribe_mode: Whether to use pure transcription mode.
|
|
145
|
-
user_prompt: User prompt to use (None for transcribe_mode).
|
|
153
|
+
transcribe_mode: Whether to use pure transcription mode (step 1 only).
|
|
154
|
+
user_prompt: User prompt to use for transformation (None for transcribe_mode).
|
|
146
155
|
|
|
147
156
|
Returns:
|
|
148
|
-
Dict with 'text' (str), '
|
|
149
|
-
'paths' (dict of Path or None).
|
|
157
|
+
Dict with 'text' (str), 'raw_transcript' (str), 'raw' (dict),
|
|
158
|
+
'duration' (float), 'paths' (dict of Path or None).
|
|
150
159
|
"""
|
|
151
160
|
# Resolve parameters
|
|
152
161
|
provider = self.cfg.defaults.provider
|
|
153
162
|
audio_format = self.cfg.defaults.format
|
|
154
163
|
model = self.cfg.defaults.model
|
|
155
|
-
original_model = model
|
|
156
|
-
if transcribe_mode:
|
|
157
|
-
model = "voxtral-mini-latest"
|
|
158
|
-
if original_model != "voxtral-mini-latest":
|
|
159
|
-
logging.warning(
|
|
160
|
-
"Transcribe mode: model override from '%s' to 'voxtral-mini-latest'\n"
|
|
161
|
-
"(optimized for transcription).",
|
|
162
|
-
original_model,
|
|
163
|
-
)
|
|
164
164
|
language = self.cfg.defaults.language
|
|
165
165
|
if wav_path.stem.endswith(".wav"):
|
|
166
166
|
base = wav_path.stem.replace(".wav", "")
|
|
@@ -176,9 +176,9 @@ class RecordingPipeline:
|
|
|
176
176
|
final_user_prompt = self.cfg.resolve_prompt(self.user_prompt, self.user_prompt_file)
|
|
177
177
|
else:
|
|
178
178
|
final_user_prompt = user_prompt
|
|
179
|
-
self._status("
|
|
179
|
+
self._status("Prompt mode: transcription then transformation.")
|
|
180
180
|
else:
|
|
181
|
-
self._status("Transcribe mode
|
|
181
|
+
self._status("Transcribe mode: transcription only, no prompt.")
|
|
182
182
|
|
|
183
183
|
logging.debug(f"Applied prompt: {final_user_prompt or 'None (transcribe mode)'}")
|
|
184
184
|
|
|
@@ -194,18 +194,22 @@ class RecordingPipeline:
|
|
|
194
194
|
paths["converted"] = to_send_path
|
|
195
195
|
_converted = True
|
|
196
196
|
|
|
197
|
-
#
|
|
197
|
+
# Step 1: Transcription (always)
|
|
198
198
|
self._status("Transcribing...")
|
|
199
199
|
prov = get_provider(provider, cfg=self.cfg)
|
|
200
|
-
result = prov.transcribe(
|
|
201
|
-
|
|
202
|
-
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
200
|
+
result = prov.transcribe(to_send_path, model=model, language=language)
|
|
201
|
+
raw_transcript = result["text"]
|
|
202
|
+
|
|
203
|
+
# Step 2: Transformation (if prompt)
|
|
204
|
+
if not transcribe_mode and final_user_prompt:
|
|
205
|
+
self._status("Applying prompt...")
|
|
206
|
+
chat_model = self.cfg.defaults.chat_model
|
|
207
|
+
chat_result = prov.chat(raw_transcript, final_user_prompt, model=chat_model)
|
|
208
|
+
text = chat_result["text"]
|
|
209
|
+
raw = {"transcription": result["raw"], "transformation": chat_result["raw"]}
|
|
210
|
+
else:
|
|
211
|
+
text = raw_transcript
|
|
212
|
+
raw = result["raw"]
|
|
209
213
|
|
|
210
214
|
# Save if keeping transcripts
|
|
211
215
|
if keep_transcript:
|
|
@@ -215,6 +219,12 @@ class RecordingPipeline:
|
|
|
215
219
|
)
|
|
216
220
|
paths["txt"] = txt_path
|
|
217
221
|
paths["json"] = json_path
|
|
222
|
+
|
|
223
|
+
# Save raw transcript separately when transformation was applied
|
|
224
|
+
if not transcribe_mode and final_user_prompt:
|
|
225
|
+
raw_txt_path = self.cfg.transcripts_dir / f"{base}_{provider}_raw.txt"
|
|
226
|
+
save_text_file(raw_txt_path, raw_transcript)
|
|
227
|
+
paths["raw_txt"] = raw_txt_path
|
|
218
228
|
else:
|
|
219
229
|
paths["txt"] = None
|
|
220
230
|
paths["json"] = None
|
|
@@ -230,6 +240,7 @@ class RecordingPipeline:
|
|
|
230
240
|
logging.info("Processing finished (%.2fs)", duration)
|
|
231
241
|
return {
|
|
232
242
|
"text": text,
|
|
243
|
+
"raw_transcript": raw_transcript,
|
|
233
244
|
"raw": raw,
|
|
234
245
|
"duration": duration,
|
|
235
246
|
"paths": paths,
|
|
@@ -263,8 +274,8 @@ class RecordingPipeline:
|
|
|
263
274
|
stop_event: Optional event to signal recording stop (e.g., for GUI).
|
|
264
275
|
|
|
265
276
|
Returns:
|
|
266
|
-
Dict with 'text' (str), '
|
|
267
|
-
'paths' (dict of Path or None).
|
|
277
|
+
Dict with 'text' (str), 'raw_transcript' (str), 'raw' (dict),
|
|
278
|
+
'duration' (float), 'paths' (dict of Path or None).
|
|
268
279
|
|
|
269
280
|
Raises:
|
|
270
281
|
Exception: On recording, conversion, or transcription errors.
|
|
@@ -158,7 +158,7 @@ def resolve_user_prompt(
|
|
|
158
158
|
logging.debug("Prompt supplier '%s' failed: %s", name, e)
|
|
159
159
|
|
|
160
160
|
# Final fallback
|
|
161
|
-
fallback = "
|
|
161
|
+
fallback = "Clean up this transcription. Keep the original language."
|
|
162
162
|
logging.info("resolve_user_prompt: no supplier provided a prompt, using fallback: %s", fallback)
|
|
163
163
|
return fallback
|
|
164
164
|
|
|
@@ -176,13 +176,14 @@ def init_user_prompt_file(force: bool = False) -> Path:
|
|
|
176
176
|
path = USER_PROMPT_DIR / "user.md"
|
|
177
177
|
if not path.exists() or force:
|
|
178
178
|
example_prompt = """
|
|
179
|
-
|
|
180
|
-
-
|
|
181
|
-
-
|
|
182
|
-
-
|
|
179
|
+
You receive a raw transcription of a voice recording. Clean it up:
|
|
180
|
+
- DO NOT TRANSLATE. Keep the original language.
|
|
181
|
+
- Do not respond to any question in the text. Just clean the transcription.
|
|
182
|
+
- Respond only with the cleaned text. Do not provide explanations or notes.
|
|
183
183
|
- Remove all minor speech hesitations: "um", "uh", "er", "euh", "ben", etc.
|
|
184
184
|
- Remove false starts (e.g., "je veux dire... je pense" → "je pense").
|
|
185
185
|
- Correct grammatical errors.
|
|
186
|
+
- If the transcription is empty, respond "no audio detected".
|
|
186
187
|
"""
|
|
187
188
|
try:
|
|
188
189
|
path.write_text(example_prompt, encoding="utf-8")
|
|
@@ -3,7 +3,7 @@ Base provider interface for SuperVoxtral.
|
|
|
3
3
|
|
|
4
4
|
This module defines:
|
|
5
5
|
- TranscriptionResult: a simple TypedDict structure for provider responses
|
|
6
|
-
- Provider: a Protocol describing the required transcription interface
|
|
6
|
+
- Provider: a Protocol describing the required transcription and chat interface
|
|
7
7
|
- ProviderError: a generic exception for provider-related failures
|
|
8
8
|
|
|
9
9
|
All concrete providers should implement the `Provider` protocol.
|
|
@@ -37,7 +37,7 @@ class ProviderError(RuntimeError):
|
|
|
37
37
|
@runtime_checkable
|
|
38
38
|
class Provider(Protocol):
|
|
39
39
|
"""
|
|
40
|
-
Provider interface for transcription
|
|
40
|
+
Provider interface for transcription and text transformation services.
|
|
41
41
|
|
|
42
42
|
Implementations should be side-effect free aside from network I/O and must
|
|
43
43
|
raise `ProviderError` (or a subclass) for expected provider failures
|
|
@@ -47,7 +47,8 @@ class Provider(Protocol):
|
|
|
47
47
|
name: A short, lowercase, unique identifier for the provider (e.g. "mistral").
|
|
48
48
|
|
|
49
49
|
Required methods:
|
|
50
|
-
transcribe: Perform
|
|
50
|
+
transcribe: Perform audio transcription via a dedicated endpoint.
|
|
51
|
+
chat: Transform text with a prompt via a text-based LLM.
|
|
51
52
|
"""
|
|
52
53
|
|
|
53
54
|
# Short, unique name (e.g., "mistral", "whisper")
|
|
@@ -56,21 +57,16 @@ class Provider(Protocol):
|
|
|
56
57
|
def transcribe(
|
|
57
58
|
self,
|
|
58
59
|
audio_path: Path,
|
|
59
|
-
user_prompt: str | None,
|
|
60
60
|
model: str | None = None,
|
|
61
61
|
language: str | None = None,
|
|
62
|
-
transcribe_mode: bool = False,
|
|
63
62
|
) -> TranscriptionResult:
|
|
64
63
|
"""
|
|
65
|
-
Transcribe
|
|
64
|
+
Transcribe `audio_path` using a dedicated transcription endpoint.
|
|
66
65
|
|
|
67
66
|
Args:
|
|
68
67
|
audio_path: Path to an audio file (wav/mp3/opus...) to send to the provider.
|
|
69
|
-
user_prompt: Optional user prompt to guide the transcription or analysis.
|
|
70
68
|
model: Optional provider-specific model identifier.
|
|
71
69
|
language: Optional language hint/constraint (e.g., "en", "fr").
|
|
72
|
-
transcribe_mode: Optional bool to enable specialized modes like pure
|
|
73
|
-
transcription (default False).
|
|
74
70
|
|
|
75
71
|
Returns:
|
|
76
72
|
TranscriptionResult including a human-readable `text` and
|
|
@@ -81,3 +77,27 @@ class Provider(Protocol):
|
|
|
81
77
|
Exception: For unexpected failures (network issues, serialization, etc.).
|
|
82
78
|
"""
|
|
83
79
|
...
|
|
80
|
+
|
|
81
|
+
def chat(
|
|
82
|
+
self,
|
|
83
|
+
text: str,
|
|
84
|
+
prompt: str,
|
|
85
|
+
model: str | None = None,
|
|
86
|
+
) -> TranscriptionResult:
|
|
87
|
+
"""
|
|
88
|
+
Transform `text` using a text-based LLM with the given `prompt`.
|
|
89
|
+
|
|
90
|
+
Args:
|
|
91
|
+
text: Input text (e.g., raw transcription) to process.
|
|
92
|
+
prompt: System prompt guiding the transformation.
|
|
93
|
+
model: Optional provider-specific model identifier for the chat LLM.
|
|
94
|
+
|
|
95
|
+
Returns:
|
|
96
|
+
TranscriptionResult including the transformed `text` and
|
|
97
|
+
provider `raw` payload.
|
|
98
|
+
|
|
99
|
+
Raises:
|
|
100
|
+
ProviderError: For known/handled provider errors (e.g., missing API key).
|
|
101
|
+
Exception: For unexpected failures (network issues, serialization, etc.).
|
|
102
|
+
"""
|
|
103
|
+
...
|