minidic 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
minidic-1.0.0/PKG-INFO ADDED
@@ -0,0 +1,125 @@
1
+ Metadata-Version: 2.4
2
+ Name: minidic
3
+ Version: 1.0.0
4
+ Summary: Voice dictation for macOS
5
+ Keywords: dictation,speech-to-text,transcription,macOS,menubar
6
+ Author: Yejun Su
7
+ Author-email: Yejun Su <goofan.su@gmail.com>
8
+ License-Expression: MIT
9
+ Classifier: Environment :: Console
10
+ Classifier: Environment :: MacOS X
11
+ Classifier: Operating System :: MacOS :: MacOS X
12
+ Classifier: Programming Language :: Python :: 3
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
15
+ Requires-Dist: google-genai>=1.0.0
16
+ Requires-Dist: parakeet-mlx>=0.5.1
17
+ Requires-Dist: pyobjc-framework-quartz>=12.1
18
+ Requires-Dist: sounddevice>=0.5.5
19
+ Requires-Dist: soxr>=1.0.0
20
+ Requires-Python: >=3.12
21
+ Project-URL: Homepage, https://github.com/goofansu/minidic
22
+ Project-URL: Issues, https://github.com/goofansu/minidic/issues
23
+ Project-URL: Repository, https://github.com/goofansu/minidic
24
+ Description-Content-Type: text/markdown
25
+
26
+ # minidic
27
+
28
+ A tiny **vibe coding** project for voice dictation on macOS — built as a personal, fast-iteration tool for local use on one machine (not a polished/distributed app).
29
+
30
+ ## Install
31
+
32
+ `minidic` is published on PyPI for macOS users.
33
+
34
+ ```bash
35
+ uv tool install minidic
36
+ ```
37
+
38
+ To upgrade an existing install:
39
+
40
+ ```bash
41
+ uv tool install --reinstall minidic
42
+ ```
43
+
44
+ The first run will download `mlx-community/parakeet-tdt-0.6b-v3`.
45
+
46
+ `uv tool` installs `minidic` to `~/.local/bin/minidic`.
47
+ Make sure `~/.local/bin` is on your `PATH`.
48
+
49
+ ## Usage
50
+
51
+ On first use, macOS will prompt for the permissions required by `minidic`. In general, you need to grant these permissions to the terminal app you use to run the commands:
52
+
53
+ - **Microphone** — needed to capture live audio for dictation
54
+ - **Accessibility** — needed to inject the transcribed text into the active app and handle global hotkeys in menu bar mode
55
+
56
+ To use `--gemini`, set `GEMINI_API_KEY` in your environment before running `minidic`.
57
+
58
+ ### Console
59
+
60
+ Run an interactive dictation session in the terminal. This records from your microphone, transcribes locally, and inserts the final text into the active app.
61
+
62
+ ```bash
63
+ minidic console
64
+ minidic console --gemini
65
+ ```
66
+
67
+ ### Transcribe
68
+
69
+ Transcribe an existing audio file from disk instead of recording live microphone input.
70
+
71
+ ```bash
72
+ minidic transcribe path/to/file.wav
73
+ minidic transcribe --gemini path/to/file.wav
74
+ ```
75
+
76
+ ### Menubar
77
+
78
+ Run `minidic` as a menu bar app with a background daemon and global `F5` hotkey for push-to-toggle dictation.
79
+
80
+ ```bash
81
+ minidic menubar
82
+ ```
83
+
84
+ ![Menu bar icon (stopped)](screenshots/menubar-daemon-stopped.png)
85
+ ![Menu bar icon (running)](screenshots/menubar-daemon-started.png)
86
+
87
+ 1. Start the menu bar app.
88
+ 2. Optionally choose a max recording length from **Duration** in the menu.
89
+ 3. Click **Start daemon** (or **Stop daemon** to stop it).
90
+ 4. Press `F5` to toggle start/stop dictation (captured globally; other apps will not receive `F5` while daemon is running).
91
+
92
+ ## Technique overview
93
+
94
+ `minidic` captures microphone audio, normalizes it to 16 kHz, and runs local speech-to-text with streaming-style decoding.
95
+
96
+ ### Models used
97
+
98
+ - **ASR model:** `parakeet-mlx` for on-device audio transcription on Apple Silicon / MLX
99
+ - **LLM model:** `gemini-3.1-flash-lite-preview` for optional transcript cleanup (thinking disabled)
100
+
101
+ ### High-level pipeline
102
+
103
+ 1. Capture mic audio with `sounddevice`
104
+ 2. Resample to 16 kHz with `soxr` (when needed)
105
+ 3. Transcribe with `parakeet-mlx` on-device
106
+ 4. Smooth transcription by default with local regex cleanup (remove filler words like `um`, `uh`, etc.)
107
+ 5. Further smooth with Gemini when `GEMINI_API_KEY` is set and Gemini mode is enabled (via `--gemini` for `console`/`transcribe`, or via the menu bar toggle)
108
+ 6. Inject text into the active app on macOS
109
+
110
+ The daemon mode is hotkey-driven and lazily loads/unloads the model to reduce idle resource usage.
111
+
112
+ ### Directory structure
113
+
114
+ ```text
115
+ ~/.minidic/
116
+ └── recordings/ # saved WAV recordings captured during dictation/transcription
117
+
118
+ ~/.local/state/minidic/
119
+ ├── config.json # persisted runtime config such as Gemini and duration settings
120
+ ├── daemon.log # daemon logs
121
+ ├── daemon.pid # daemon process ID
122
+ ├── daemon.state # current daemon state: idle, recording, transcribing
123
+ ├── menubar.log # menu bar app logs
124
+ └── menubar.pid # menu bar process ID
125
+ ```
@@ -0,0 +1,100 @@
1
+ # minidic
2
+
3
+ A tiny **vibe coding** project for voice dictation on macOS — built as a personal, fast-iteration tool for local use on one machine (not a polished/distributed app).
4
+
5
+ ## Install
6
+
7
+ `minidic` is published on PyPI for macOS users.
8
+
9
+ ```bash
10
+ uv tool install minidic
11
+ ```
12
+
13
+ To upgrade an existing install:
14
+
15
+ ```bash
16
+ uv tool install --reinstall minidic
17
+ ```
18
+
19
+ The first run will download `mlx-community/parakeet-tdt-0.6b-v3`.
20
+
21
+ `uv tool` installs `minidic` to `~/.local/bin/minidic`.
22
+ Make sure `~/.local/bin` is on your `PATH`.
23
+
24
+ ## Usage
25
+
26
+ On first use, macOS will prompt for the permissions required by `minidic`. In general, you need to grant these permissions to the terminal app you use to run the commands:
27
+
28
+ - **Microphone** — needed to capture live audio for dictation
29
+ - **Accessibility** — needed to inject the transcribed text into the active app and handle global hotkeys in menu bar mode
30
+
31
+ To use `--gemini`, set `GEMINI_API_KEY` in your environment before running `minidic`.
32
+
33
+ ### Console
34
+
35
+ Run an interactive dictation session in the terminal. This records from your microphone, transcribes locally, and inserts the final text into the active app.
36
+
37
+ ```bash
38
+ minidic console
39
+ minidic console --gemini
40
+ ```
41
+
42
+ ### Transcribe
43
+
44
+ Transcribe an existing audio file from disk instead of recording live microphone input.
45
+
46
+ ```bash
47
+ minidic transcribe path/to/file.wav
48
+ minidic transcribe --gemini path/to/file.wav
49
+ ```
50
+
51
+ ### Menubar
52
+
53
+ Run `minidic` as a menu bar app with a background daemon and global `F5` hotkey for push-to-toggle dictation.
54
+
55
+ ```bash
56
+ minidic menubar
57
+ ```
58
+
59
+ ![Menu bar icon (stopped)](screenshots/menubar-daemon-stopped.png)
60
+ ![Menu bar icon (running)](screenshots/menubar-daemon-started.png)
61
+
62
+ 1. Start the menu bar app.
63
+ 2. Optionally choose a max recording length from **Duration** in the menu.
64
+ 3. Click **Start daemon** (or **Stop daemon** to stop it).
65
+ 4. Press `F5` to toggle start/stop dictation (captured globally; other apps will not receive `F5` while daemon is running).
66
+
67
+ ## Technique overview
68
+
69
+ `minidic` captures microphone audio, normalizes it to 16 kHz, and runs local speech-to-text with streaming-style decoding.
70
+
71
+ ### Models used
72
+
73
+ - **ASR model:** `parakeet-mlx` for on-device audio transcription on Apple Silicon / MLX
74
+ - **LLM model:** `gemini-3.1-flash-lite-preview` for optional transcript cleanup (thinking disabled)
75
+
76
+ ### High-level pipeline
77
+
78
+ 1. Capture mic audio with `sounddevice`
79
+ 2. Resample to 16 kHz with `soxr` (when needed)
80
+ 3. Transcribe with `parakeet-mlx` on-device
81
+ 4. Smooth transcription by default with local regex cleanup (remove filler words like `um`, `uh`, etc.)
82
+ 5. Further smooth with Gemini when `GEMINI_API_KEY` is set and Gemini mode is enabled (via `--gemini` for `console`/`transcribe`, or via the menu bar toggle)
83
+ 6. Inject text into the active app on macOS
84
+
85
+ The daemon mode is hotkey-driven and lazily loads/unloads the model to reduce idle resource usage.
86
+
87
+ ### Directory structure
88
+
89
+ ```text
90
+ ~/.minidic/
91
+ └── recordings/ # saved WAV recordings captured during dictation/transcription
92
+
93
+ ~/.local/state/minidic/
94
+ ├── config.json # persisted runtime config such as Gemini and duration settings
95
+ ├── daemon.log # daemon logs
96
+ ├── daemon.pid # daemon process ID
97
+ ├── daemon.state # current daemon state: idle, recording, transcribing
98
+ ├── menubar.log # menu bar app logs
99
+ └── menubar.pid # menu bar process ID
100
+ ```
@@ -0,0 +1,38 @@
1
+ [project]
2
+ name = "minidic"
3
+ version = "1.0.0"
4
+ description = "Voice dictation for macOS"
5
+ readme = "README.md"
6
+ license = "MIT"
7
+ authors = [
8
+ { name = "Yejun Su", email = "goofan.su@gmail.com" }
9
+ ]
10
+ requires-python = ">=3.12"
11
+ keywords = ["dictation", "speech-to-text", "transcription", "macOS", "menubar"]
12
+ classifiers = [
13
+ "Environment :: Console",
14
+ "Environment :: MacOS X",
15
+ "Operating System :: MacOS :: MacOS X",
16
+ "Programming Language :: Python :: 3",
17
+ "Programming Language :: Python :: 3.12",
18
+ "Topic :: Multimedia :: Sound/Audio :: Speech",
19
+ ]
20
+ dependencies = [
21
+ "google-genai>=1.0.0",
22
+ "parakeet-mlx>=0.5.1",
23
+ "pyobjc-framework-quartz>=12.1",
24
+ "sounddevice>=0.5.5",
25
+ "soxr>=1.0.0",
26
+ ]
27
+
28
+ [project.urls]
29
+ Homepage = "https://github.com/goofansu/minidic"
30
+ Repository = "https://github.com/goofansu/minidic"
31
+ Issues = "https://github.com/goofansu/minidic/issues"
32
+
33
+ [project.scripts]
34
+ minidic = "minidic.main:main"
35
+
36
+ [build-system]
37
+ requires = ["uv_build>=0.9.15,<0.10.0"]
38
+ build-backend = "uv_build"
@@ -0,0 +1 @@
1
+ """minidic — macOS voice dictation using parakeet-mlx."""
@@ -0,0 +1,5 @@
1
+ """Allow running as ``python -m minidic``."""
2
+
3
+ from minidic.main import main
4
+
5
+ main()
@@ -0,0 +1,234 @@
1
+ """Audio capture from microphone via sounddevice."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import logging
6
+ import queue
7
+ from types import TracebackType
8
+
9
+ import numpy as np
10
+ import sounddevice as sd
11
+ import soxr
12
+
13
+ logger = logging.getLogger(__name__)
14
+
15
+ TARGET_RATE = 16_000 # What VAD/ASR expect
16
+ CHANNELS = 1
17
+ DTYPE = "int16"
18
+ BLOCKSIZE = 512 # 32ms chunks at 16kHz
19
+
20
+
21
+ def int16_to_float32(audio: np.ndarray) -> np.ndarray:
22
+ """Convert int16 audio samples to float32 in [-1, 1]."""
23
+ return audio.astype(np.float32) / 32768.0
24
+
25
+
26
+ def _refresh_portaudio() -> None:
27
+ """Terminate and reinitialize PortAudio to pick up newly connected devices.
28
+
29
+ PortAudio captures the device list at initialization time. Calling
30
+ Pa_Terminate / Pa_Initialize forces a fresh enumeration so that devices
31
+ connected after the process started (e.g. Bluetooth headsets, USB mics)
32
+ are visible to subsequent ``sd.query_devices`` / ``sd.InputStream`` calls.
33
+
34
+ Uses the private ``sd._terminate`` / ``sd._initialize`` API — there is no
35
+ public equivalent. Both symbols have been stable since sounddevice 0.5.1
36
+ and the project pins ``sounddevice>=0.5.5``. If either call raises, the
37
+ exception is re-raised so the caller receives a clear error rather than a
38
+ deferred cryptic PortAudio failure (a partial reinit — e.g. terminate
39
+ succeeded but initialize failed — leaves PortAudio uninitialized and must
40
+ not be silently swallowed).
41
+ """
42
+ try:
43
+ sd._terminate()
44
+ sd._initialize()
45
+ except Exception:
46
+ logger.warning("PortAudio reinit failed", exc_info=True)
47
+ raise
48
+
49
+
50
+ def _get_device_samplerate(device: int | str | None) -> float:
51
+ """Query the native sample rate for the given input device."""
52
+ info = sd.query_devices(device, kind="input")
53
+ return float(info["default_samplerate"])
54
+
55
+
56
+ class AudioStream:
57
+ """Captures audio from the microphone and pushes 16 kHz chunks to a queue.
58
+
59
+ If the device's native sample rate differs from 16 kHz, audio is
60
+ captured at the native rate and resampled with libsoxr.
61
+
62
+ Usage::
63
+
64
+ with AudioStream() as stream:
65
+ while True:
66
+ chunk = stream.read() # np.ndarray int16, shape (blocksize,)
67
+ ...
68
+
69
+ Parameters
70
+ ----------
71
+ blocksize:
72
+ Number of *output* samples per chunk at 16 kHz (default 512 = 32 ms).
73
+ device:
74
+ Input device index or name. ``None`` uses the system default.
75
+ """
76
+
77
+ def __init__(
78
+ self,
79
+ blocksize: int = BLOCKSIZE,
80
+ device: int | str | None = None,
81
+ ) -> None:
82
+ self.blocksize = blocksize
83
+ self.device = device
84
+ self._queue: queue.Queue[np.ndarray] = queue.Queue()
85
+ self._stream: sd.InputStream | None = None
86
+
87
+ # Determined at start() time
88
+ self._native_rate: float = 0
89
+ self._resampler: soxr.ResampleStream | None = None
90
+ self._resample_buf: np.ndarray = np.array([], dtype=np.float32)
91
+
92
+ # -- callback ----------------------------------------------------------
93
+
94
+ def _callback(
95
+ self,
96
+ indata: np.ndarray,
97
+ frames: int,
98
+ time_info: object,
99
+ status: sd.CallbackFlags,
100
+ ) -> None:
101
+ if status:
102
+ logger.warning("sounddevice status: %s", status)
103
+
104
+ # indata shape is (blocksize, 1) int16; flatten to (blocksize,).
105
+ raw = indata[:, 0].copy()
106
+
107
+ if self._resampler is not None:
108
+ # Resample to 16 kHz. soxr expects float32/float64.
109
+ f32 = raw.astype(np.float32)
110
+ resampled = self._resampler.resample_chunk(f32)
111
+ # Buffer resampled samples and emit fixed-size chunks.
112
+ self._resample_buf = np.concatenate([self._resample_buf, resampled])
113
+ while len(self._resample_buf) >= self.blocksize:
114
+ chunk = self._resample_buf[: self.blocksize]
115
+ self._resample_buf = self._resample_buf[self.blocksize :]
116
+ # Convert back to int16 for consistency
117
+ self._queue.put_nowait(
118
+ np.clip(chunk, -32768, 32767).astype(np.int16)
119
+ )
120
+ else:
121
+ self._queue.put_nowait(raw)
122
+
123
+ # -- internal helpers --------------------------------------------------
124
+
125
+ def _do_open(self) -> None:
126
+ """Query device info, configure resampling, and open the PortAudio stream.
127
+
128
+ Separated from ``start()`` so the try-on-failure retry in ``start()``
129
+ can call it without duplicating the setup logic. Caller must ensure
130
+ ``self._stream is None`` before calling.
131
+ """
132
+ self._native_rate = _get_device_samplerate(self.device)
133
+ needs_resample = abs(self._native_rate - TARGET_RATE) > 1
134
+
135
+ if needs_resample:
136
+ self._resampler = soxr.ResampleStream(
137
+ self._native_rate,
138
+ TARGET_RATE,
139
+ num_channels=1,
140
+ dtype=np.float32,
141
+ )
142
+ self._resample_buf = np.array([], dtype=np.float32)
143
+ # Capture blocksize scaled to native rate
144
+ native_blocksize = int(self.blocksize * self._native_rate / TARGET_RATE)
145
+ else:
146
+ self._resampler = None
147
+ native_blocksize = self.blocksize
148
+
149
+ stream = sd.InputStream(
150
+ samplerate=self._native_rate,
151
+ blocksize=native_blocksize,
152
+ device=self.device,
153
+ channels=CHANNELS,
154
+ dtype=DTYPE,
155
+ callback=self._callback,
156
+ )
157
+ try:
158
+ stream.start()
159
+ except Exception:
160
+ stream.close()
161
+ raise
162
+ self._stream = stream
163
+ logger.info(
164
+ "Audio stream started native_rate=%d target_rate=%d "
165
+ "blocksize=%d resample=%s device=%s",
166
+ int(self._native_rate),
167
+ TARGET_RATE,
168
+ self.blocksize,
169
+ needs_resample,
170
+ self.device or "default",
171
+ )
172
+
173
+ # -- public API --------------------------------------------------------
174
+
175
+ def start(self) -> None:
176
+ """Open and start the audio stream.
177
+
178
+ On the first attempt the stream is opened against the current
179
+ PortAudio device list. If that fails (e.g. the user connected a new
180
+ device since the process started and PortAudio's list is stale),
181
+ PortAudio is reinitialized — which forces CoreAudio to re-enumerate
182
+ devices — and the open is retried once. The retry failure propagates
183
+ to the caller.
184
+ """
185
+ if self._stream is not None:
186
+ return
187
+
188
+ try:
189
+ self._do_open()
190
+ except Exception as exc:
191
+ logger.warning(
192
+ "Audio stream open failed (%s); reinitializing PortAudio and retrying",
193
+ exc,
194
+ )
195
+ _refresh_portaudio()
196
+ self._do_open() # propagates on second failure
197
+
198
+ def stop(self) -> None:
199
+ """Stop and close the audio stream."""
200
+ if self._stream is None:
201
+ return
202
+ self._stream.stop()
203
+ self._stream.close()
204
+ self._stream = None
205
+ self._resampler = None
206
+ logger.info("Audio stream stopped")
207
+
208
+ def read(self, timeout: float | None = None) -> np.ndarray:
209
+ """Block until the next audio chunk is available.
210
+
211
+ Returns an int16 numpy array of shape ``(blocksize,)``.
212
+
213
+ Raises ``queue.Empty`` if *timeout* expires.
214
+ """
215
+ return self._queue.get(timeout=timeout)
216
+
217
+ @property
218
+ def queue(self) -> queue.Queue[np.ndarray]:
219
+ """Direct access to the underlying chunk queue."""
220
+ return self._queue
221
+
222
+ # -- context manager ---------------------------------------------------
223
+
224
+ def __enter__(self) -> AudioStream:
225
+ self.start()
226
+ return self
227
+
228
+ def __exit__(
229
+ self,
230
+ exc_type: type[BaseException] | None,
231
+ exc_val: BaseException | None,
232
+ exc_tb: TracebackType | None,
233
+ ) -> None:
234
+ self.stop()