scribe-cli 0.3.1__tar.gz → 0.4.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {scribe_cli-0.3.1/scribe_cli.egg-info → scribe_cli-0.4.0}/PKG-INFO +4 -4
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/README.md +3 -3
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/_version.py +2 -2
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/models.py +35 -20
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/streamer.py +17 -1
- {scribe_cli-0.3.1 → scribe_cli-0.4.0/scribe_cli.egg-info}/PKG-INFO +4 -4
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/.github/workflows/pypi.yml +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/.gitignore +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/LICENSE +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/pyproject.toml +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/__init__.py +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/audio.py +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/install_desktop.py +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/keyboard.py +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/models.toml +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/saverecording.py +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/testpynput.py +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/util.py +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_cli.egg-info/SOURCES.txt +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_cli.egg-info/dependency_links.txt +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_cli.egg-info/entry_points.txt +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_cli.egg-info/requires.txt +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_cli.egg-info/top_level.txt +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_data/__init__.py +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_data/share/icon.jpg +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_data/templates/scribe.desktop +0 -0
- {scribe_cli-0.3.1 → scribe_cli-0.4.0}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.2
|
|
2
2
|
Name: scribe-cli
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.4.0
|
|
4
4
|
Summary: scribe is a local speech recognition tool that provides real-time transcription using vosk and whisper AI.
|
|
5
5
|
Author-email: Mahé Perrette <mahe.perrette@gmail.com>
|
|
6
6
|
License: MIT License
|
|
@@ -102,9 +102,9 @@ or until after recording is complete (`whisper`).
|
|
|
102
102
|
You can interrupt the recording via Ctrl + C and start again or change model.
|
|
103
103
|
|
|
104
104
|
The default (`whisper`) is excellent at transcribing a full-length audio sequences in [many languages](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). It is really impressive,
|
|
105
|
-
but it cannot do real-time
|
|
106
|
-
With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
|
|
107
|
-
|
|
105
|
+
but it cannot do real-time, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
|
|
106
|
+
With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
|
|
107
|
+
there is a maximum duration after which it will stop by itself, which is setup to 60s by default (unless `--duration` is set to something else).
|
|
108
108
|
|
|
109
109
|
The `vosk` backend is good at
|
|
110
110
|
doing real-time transcription for one language, but tended to make more mistakes in my tests and it does not do punctuation.
|
|
@@ -46,9 +46,9 @@ or until after recording is complete (`whisper`).
|
|
|
46
46
|
You can interrupt the recording via Ctrl + C and start again or change model.
|
|
47
47
|
|
|
48
48
|
The default (`whisper`) is excellent at transcribing a full-length audio sequences in [many languages](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). It is really impressive,
|
|
49
|
-
but it cannot do real-time
|
|
50
|
-
With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
|
|
51
|
-
|
|
49
|
+
but it cannot do real-time, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
|
|
50
|
+
With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
|
|
51
|
+
there is a maximum duration after which it will stop by itself, which is setup to 60s by default (unless `--duration` is set to something else).
|
|
52
52
|
|
|
53
53
|
The `vosk` backend is good at
|
|
54
54
|
doing real-time transcription for one language, but tended to make more mistakes in my tests and it does not do punctuation.
|
|
@@ -9,23 +9,38 @@ VOSK_MODELS_FOLDER = os.path.join(os.environ.get("HOME"),
|
|
|
9
9
|
|
|
10
10
|
class AbstractTranscriber:
|
|
11
11
|
backend = None
|
|
12
|
-
def __init__(self, model, model_name=None, language=None, samplerate=16000, model_kwargs={}):
|
|
12
|
+
def __init__(self, model, model_name=None, language=None, samplerate=16000, max_duration=None, model_kwargs={}):
|
|
13
13
|
self.model_name = model_name
|
|
14
14
|
self.language = language
|
|
15
15
|
self.model = model
|
|
16
16
|
self.model_kwargs = model_kwargs
|
|
17
17
|
self.samplerate = samplerate
|
|
18
|
+
self.max_duration = max_duration
|
|
19
|
+
self.one_second_bytes = self.samplerate * 2 # 16-bit audio, 1 channel ~ 32000 bytes
|
|
20
|
+
self.audio_buffer = b''
|
|
21
|
+
|
|
22
|
+
def get_elapsed(self, size=None):
|
|
23
|
+
return len(size or self.audio_buffer) / self.one_second_bytes
|
|
24
|
+
|
|
25
|
+
def is_overtime(self, elapsed=None, size=None):
|
|
26
|
+
return self.max_duration and (elapsed or self.get_elapsed(size)) > self.max_duration
|
|
27
|
+
|
|
28
|
+
def transcribe_realtime_audio(self, audio_bytes=b""):
|
|
29
|
+
self.audio_buffer += audio_bytes
|
|
30
|
+
return {"partial": f"{len(self.audio_buffer)} bytes received (duration: {self.get_elapsed()} seconds)"}
|
|
18
31
|
|
|
19
32
|
def transcribe_audio(self, audio_data):
|
|
20
33
|
raise NotImplementedError()
|
|
21
34
|
|
|
22
|
-
def
|
|
23
|
-
|
|
35
|
+
def reset(self):
|
|
36
|
+
self.audio_buffer = b''
|
|
24
37
|
|
|
25
38
|
def start_recording(self, microphone,
|
|
26
39
|
start_message="Recording... Press Ctrl+C to stop.",
|
|
27
40
|
stop_message="Stopped recording."):
|
|
28
41
|
|
|
42
|
+
self.reset()
|
|
43
|
+
|
|
29
44
|
with microphone.open_stream():
|
|
30
45
|
print(start_message)
|
|
31
46
|
|
|
@@ -35,6 +50,9 @@ class AbstractTranscriber:
|
|
|
35
50
|
data = microphone.q.get()
|
|
36
51
|
yield self.transcribe_realtime_audio(data)
|
|
37
52
|
|
|
53
|
+
if self.is_overtime():
|
|
54
|
+
raise KeyboardInterrupt("Overtime: {:.2f} seconds".format(self.get_elapsed()))
|
|
55
|
+
|
|
38
56
|
except KeyboardInterrupt:
|
|
39
57
|
pass
|
|
40
58
|
|
|
@@ -75,7 +93,8 @@ class VoskTranscriber(AbstractTranscriber):
|
|
|
75
93
|
super().__init__(model, model_name, model_kwargs=model_kwargs, **kwargs)
|
|
76
94
|
self.recognizer = get_vosk_recognizer(model, self.samplerate)
|
|
77
95
|
|
|
78
|
-
def transcribe_realtime_audio(self, audio_bytes=b""
|
|
96
|
+
def transcribe_realtime_audio(self, audio_bytes=b""):
|
|
97
|
+
self.audio_buffer += audio_bytes
|
|
79
98
|
final = self.recognizer.AcceptWaveform(audio_bytes)
|
|
80
99
|
if final:
|
|
81
100
|
result = self.recognizer.Result()
|
|
@@ -85,20 +104,26 @@ class VoskTranscriber(AbstractTranscriber):
|
|
|
85
104
|
|
|
86
105
|
if final:
|
|
87
106
|
pass
|
|
88
|
-
elif finalize:
|
|
89
|
-
result_dict["text"] = result_dict.pop("partial", "")
|
|
90
107
|
else:
|
|
91
108
|
assert not final
|
|
92
109
|
if "text" in result_dict:
|
|
93
110
|
del result_dict["text"]
|
|
94
111
|
return result_dict
|
|
95
112
|
|
|
96
|
-
def transcribe_audio(self, audio_data=
|
|
97
|
-
|
|
113
|
+
def transcribe_audio(self, audio_data=b""):
|
|
114
|
+
results = self.transcribe_realtime_audio(audio_data)
|
|
115
|
+
if not results.get("text") and "partial" in results:
|
|
116
|
+
results["text"] = results.pop("partial", "")
|
|
117
|
+
return results
|
|
118
|
+
|
|
98
119
|
|
|
99
120
|
def finalize(self):
|
|
100
121
|
return self.transcribe_audio(b"")
|
|
101
122
|
|
|
123
|
+
def reset(self):
|
|
124
|
+
super().reset()
|
|
125
|
+
self.recognizer = get_vosk_recognizer(self.model, self.samplerate)
|
|
126
|
+
|
|
102
127
|
|
|
103
128
|
class WhisperTranscriber(AbstractTranscriber):
|
|
104
129
|
backend = "whisper"
|
|
@@ -108,20 +133,10 @@ class WhisperTranscriber(AbstractTranscriber):
|
|
|
108
133
|
if model is None:
|
|
109
134
|
model = whisper.load_model(model_name)
|
|
110
135
|
super().__init__(model, model_name, language, model_kwargs=model_kwargs, **kwargs)
|
|
111
|
-
self.audio_buffer = b''
|
|
112
|
-
|
|
113
|
-
def transcribe_realtime_audio(self, audio_bytes=b"", max_duration=60):
|
|
114
|
-
self.audio_buffer += audio_bytes
|
|
115
|
-
|
|
116
|
-
one_second = self.samplerate * 2 # 16-bit audio, 1 channel ~ 32000 bytes
|
|
117
|
-
if len(self.audio_buffer) < max_duration * one_second:
|
|
118
|
-
return {"partial": f"{len(self.audio_buffer)} bytes received (duration: {len(self.audio_buffer) / one_second:.2f} seconds)"}
|
|
119
|
-
|
|
120
|
-
else:
|
|
121
|
-
return self.finalize()
|
|
122
136
|
|
|
123
137
|
def transcribe_audio(self, audio_bytes):
|
|
124
138
|
print("\nTranscribing...")
|
|
139
|
+
print("If --keyboard is set, change focus to target app NOW !")
|
|
125
140
|
audio_array = np.frombuffer(audio_bytes, dtype=np.int16).flatten().astype(np.float32) / 32768.0
|
|
126
141
|
return self.model.transcribe(audio_array, fp16=False, language=self.language)
|
|
127
142
|
|
|
@@ -130,4 +145,4 @@ class WhisperTranscriber(AbstractTranscriber):
|
|
|
130
145
|
return {"text": ""}
|
|
131
146
|
result = self.transcribe_audio(self.audio_buffer)
|
|
132
147
|
self.audio_buffer = b''
|
|
133
|
-
return result
|
|
148
|
+
return result
|
|
@@ -136,6 +136,7 @@ def get_transcriber(o, prompt=True):
|
|
|
136
136
|
transcriber = VoskTranscriber(model_name=model,
|
|
137
137
|
language=o.language,
|
|
138
138
|
samplerate=o.samplerate,
|
|
139
|
+
max_duration=None, # vosk keeps going (no timeout)
|
|
139
140
|
model_kwargs={"data_folder": o.data_folder})
|
|
140
141
|
except Exception as error:
|
|
141
142
|
print(error)
|
|
@@ -143,7 +144,7 @@ def get_transcriber(o, prompt=True):
|
|
|
143
144
|
exit(1)
|
|
144
145
|
|
|
145
146
|
elif backend == "whisper":
|
|
146
|
-
transcriber = WhisperTranscriber(model_name=model, language=o.language, samplerate=o.samplerate)
|
|
147
|
+
transcriber = WhisperTranscriber(model_name=model, language=o.language, samplerate=o.samplerate, max_duration=o.duration)
|
|
147
148
|
|
|
148
149
|
else:
|
|
149
150
|
raise ValueError(f"Unknown backend: {backend}")
|
|
@@ -167,6 +168,7 @@ def get_parser():
|
|
|
167
168
|
parser.add_argument("--no-prompt", action="store_false", dest="prompt", help="Disable prompts for backend and model selection and jump to recording")
|
|
168
169
|
|
|
169
170
|
parser.add_argument("--samplerate", default=16000, type=int, help=argparse.SUPPRESS)
|
|
171
|
+
parser.add_argument("--duration", default=60, type=int, help="duration in seconds before whisper models start transcribing (default %(default)ss)")
|
|
170
172
|
parser.add_argument("--keyboard", action="store_true")
|
|
171
173
|
parser.add_argument("--latency", default=0, type=float, help="keyboard latency")
|
|
172
174
|
|
|
@@ -194,6 +196,9 @@ def main(args=None):
|
|
|
194
196
|
print(f"Choose any of the following actions:")
|
|
195
197
|
print(f"[q] quit")
|
|
196
198
|
print(f"[e] change model")
|
|
199
|
+
print(f"[k] toggle keyboard {'off' if o.keyboard else 'on'}")
|
|
200
|
+
if transcriber.backend == "whisper":
|
|
201
|
+
print(f"[t] change duration (currently {transcriber.max_duration}s)")
|
|
197
202
|
print(colored(f"Press [Enter] or any other key to start recording.", "BOLD"))
|
|
198
203
|
|
|
199
204
|
key = input()
|
|
@@ -202,6 +207,17 @@ def main(args=None):
|
|
|
202
207
|
if key == "e":
|
|
203
208
|
transcriber = None
|
|
204
209
|
continue
|
|
210
|
+
if key == "k":
|
|
211
|
+
o.keyboard = not o.keyboard
|
|
212
|
+
continue
|
|
213
|
+
if key == "t":
|
|
214
|
+
duration = input(f"Enter new duration in seconds (current: {transcriber.max_duration}): ")
|
|
215
|
+
try:
|
|
216
|
+
o.duration = transcriber.max_duration = int(duration)
|
|
217
|
+
except:
|
|
218
|
+
print("Invalid duration. Must be an integer.")
|
|
219
|
+
continue
|
|
220
|
+
|
|
205
221
|
start_recording(micro, transcriber, keyboard=o.keyboard, latency=o.latency)
|
|
206
222
|
|
|
207
223
|
# if we arrived so far, that means we pressed Ctrl + C anyway, and need Enter to move on.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.2
|
|
2
2
|
Name: scribe-cli
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.4.0
|
|
4
4
|
Summary: scribe is a local speech recognition tool that provides real-time transcription using vosk and whisper AI.
|
|
5
5
|
Author-email: Mahé Perrette <mahe.perrette@gmail.com>
|
|
6
6
|
License: MIT License
|
|
@@ -102,9 +102,9 @@ or until after recording is complete (`whisper`).
|
|
|
102
102
|
You can interrupt the recording via Ctrl + C and start again or change model.
|
|
103
103
|
|
|
104
104
|
The default (`whisper`) is excellent at transcribing a full-length audio sequences in [many languages](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). It is really impressive,
|
|
105
|
-
but it cannot do real-time
|
|
106
|
-
With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
|
|
107
|
-
|
|
105
|
+
but it cannot do real-time, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
|
|
106
|
+
With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
|
|
107
|
+
there is a maximum duration after which it will stop by itself, which is setup to 60s by default (unless `--duration` is set to something else).
|
|
108
108
|
|
|
109
109
|
The `vosk` backend is good at
|
|
110
110
|
doing real-time transcription for one language, but tended to make more mistakes in my tests and it does not do punctuation.
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|