scribe-cli 0.3.1__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (27) hide show
  1. {scribe_cli-0.3.1/scribe_cli.egg-info → scribe_cli-0.4.0}/PKG-INFO +4 -4
  2. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/README.md +3 -3
  3. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/_version.py +2 -2
  4. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/models.py +35 -20
  5. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/streamer.py +17 -1
  6. {scribe_cli-0.3.1 → scribe_cli-0.4.0/scribe_cli.egg-info}/PKG-INFO +4 -4
  7. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/.github/workflows/pypi.yml +0 -0
  8. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/.gitignore +0 -0
  9. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/LICENSE +0 -0
  10. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/pyproject.toml +0 -0
  11. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/__init__.py +0 -0
  12. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/audio.py +0 -0
  13. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/install_desktop.py +0 -0
  14. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/keyboard.py +0 -0
  15. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/models.toml +0 -0
  16. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/saverecording.py +0 -0
  17. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/testpynput.py +0 -0
  18. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/util.py +0 -0
  19. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_cli.egg-info/SOURCES.txt +0 -0
  20. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_cli.egg-info/dependency_links.txt +0 -0
  21. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_cli.egg-info/entry_points.txt +0 -0
  22. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_cli.egg-info/requires.txt +0 -0
  23. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_cli.egg-info/top_level.txt +0 -0
  24. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_data/__init__.py +0 -0
  25. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_data/share/icon.jpg +0 -0
  26. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe_data/templates/scribe.desktop +0 -0
  27. {scribe_cli-0.3.1 → scribe_cli-0.4.0}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.2
2
2
  Name: scribe-cli
3
- Version: 0.3.1
3
+ Version: 0.4.0
4
4
  Summary: scribe is a local speech recognition tool that provides real-time transcription using vosk and whisper AI.
5
5
  Author-email: Mahé Perrette <mahe.perrette@gmail.com>
6
6
  License: MIT License
@@ -102,9 +102,9 @@ or until after recording is complete (`whisper`).
102
102
  You can interrupt the recording via Ctrl + C and start again or change model.
103
103
 
104
104
  The default (`whisper`) is excellent at transcribing a full-length audio sequences in [many languages](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). It is really impressive,
105
- but it cannot do real-time out of the box, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
106
- With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though after
107
- 60 seconds it will stop automatically (and try to continue afterward).
105
+ but it cannot do real-time, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
106
+ With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
107
+ there is a maximum duration after which it will stop by itself, which is setup to 60s by default (unless `--duration` is set to something else).
108
108
 
109
109
  The `vosk` backend is good at
110
110
  doing real-time transcription for one language, but tended to make more mistakes in my tests and it does not do punctuation.
@@ -46,9 +46,9 @@ or until after recording is complete (`whisper`).
46
46
  You can interrupt the recording via Ctrl + C and start again or change model.
47
47
 
48
48
  The default (`whisper`) is excellent at transcribing a full-length audio sequences in [many languages](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). It is really impressive,
49
- but it cannot do real-time out of the box, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
50
- With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though after
51
- 60 seconds it will stop automatically (and try to continue afterward).
49
+ but it cannot do real-time, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
50
+ With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
51
+ there is a maximum duration after which it will stop by itself, which is setup to 60s by default (unless `--duration` is set to something else).
52
52
 
53
53
  The `vosk` backend is good at
54
54
  doing real-time transcription for one language, but tended to make more mistakes in my tests and it does not do punctuation.
@@ -12,5 +12,5 @@ __version__: str
12
12
  __version_tuple__: VERSION_TUPLE
13
13
  version_tuple: VERSION_TUPLE
14
14
 
15
- __version__ = version = '0.3.1'
16
- __version_tuple__ = version_tuple = (0, 3, 1)
15
+ __version__ = version = '0.4.0'
16
+ __version_tuple__ = version_tuple = (0, 4, 0)
@@ -9,23 +9,38 @@ VOSK_MODELS_FOLDER = os.path.join(os.environ.get("HOME"),
9
9
 
10
10
  class AbstractTranscriber:
11
11
  backend = None
12
- def __init__(self, model, model_name=None, language=None, samplerate=16000, model_kwargs={}):
12
+ def __init__(self, model, model_name=None, language=None, samplerate=16000, max_duration=None, model_kwargs={}):
13
13
  self.model_name = model_name
14
14
  self.language = language
15
15
  self.model = model
16
16
  self.model_kwargs = model_kwargs
17
17
  self.samplerate = samplerate
18
+ self.max_duration = max_duration
19
+ self.one_second_bytes = self.samplerate * 2 # 16-bit audio, 1 channel ~ 32000 bytes
20
+ self.audio_buffer = b''
21
+
22
+ def get_elapsed(self, size=None):
23
+ return len(size or self.audio_buffer) / self.one_second_bytes
24
+
25
+ def is_overtime(self, elapsed=None, size=None):
26
+ return self.max_duration and (elapsed or self.get_elapsed(size)) > self.max_duration
27
+
28
+ def transcribe_realtime_audio(self, audio_bytes=b""):
29
+ self.audio_buffer += audio_bytes
30
+ return {"partial": f"{len(self.audio_buffer)} bytes received (duration: {self.get_elapsed()} seconds)"}
18
31
 
19
32
  def transcribe_audio(self, audio_data):
20
33
  raise NotImplementedError()
21
34
 
22
- def transcribe_realtime_audio(self, audio_data):
23
- raise NotImplementedError()
35
+ def reset(self):
36
+ self.audio_buffer = b''
24
37
 
25
38
  def start_recording(self, microphone,
26
39
  start_message="Recording... Press Ctrl+C to stop.",
27
40
  stop_message="Stopped recording."):
28
41
 
42
+ self.reset()
43
+
29
44
  with microphone.open_stream():
30
45
  print(start_message)
31
46
 
@@ -35,6 +50,9 @@ class AbstractTranscriber:
35
50
  data = microphone.q.get()
36
51
  yield self.transcribe_realtime_audio(data)
37
52
 
53
+ if self.is_overtime():
54
+ raise KeyboardInterrupt("Overtime: {:.2f} seconds".format(self.get_elapsed()))
55
+
38
56
  except KeyboardInterrupt:
39
57
  pass
40
58
 
@@ -75,7 +93,8 @@ class VoskTranscriber(AbstractTranscriber):
75
93
  super().__init__(model, model_name, model_kwargs=model_kwargs, **kwargs)
76
94
  self.recognizer = get_vosk_recognizer(model, self.samplerate)
77
95
 
78
- def transcribe_realtime_audio(self, audio_bytes=b"", finalize=False):
96
+ def transcribe_realtime_audio(self, audio_bytes=b""):
97
+ self.audio_buffer += audio_bytes
79
98
  final = self.recognizer.AcceptWaveform(audio_bytes)
80
99
  if final:
81
100
  result = self.recognizer.Result()
@@ -85,20 +104,26 @@ class VoskTranscriber(AbstractTranscriber):
85
104
 
86
105
  if final:
87
106
  pass
88
- elif finalize:
89
- result_dict["text"] = result_dict.pop("partial", "")
90
107
  else:
91
108
  assert not final
92
109
  if "text" in result_dict:
93
110
  del result_dict["text"]
94
111
  return result_dict
95
112
 
96
- def transcribe_audio(self, audio_data=None):
97
- return self.transcribe_realtime_audio(audio_data, finalize=True)
113
+ def transcribe_audio(self, audio_data=b""):
114
+ results = self.transcribe_realtime_audio(audio_data)
115
+ if not results.get("text") and "partial" in results:
116
+ results["text"] = results.pop("partial", "")
117
+ return results
118
+
98
119
 
99
120
  def finalize(self):
100
121
  return self.transcribe_audio(b"")
101
122
 
123
+ def reset(self):
124
+ super().reset()
125
+ self.recognizer = get_vosk_recognizer(self.model, self.samplerate)
126
+
102
127
 
103
128
  class WhisperTranscriber(AbstractTranscriber):
104
129
  backend = "whisper"
@@ -108,20 +133,10 @@ class WhisperTranscriber(AbstractTranscriber):
108
133
  if model is None:
109
134
  model = whisper.load_model(model_name)
110
135
  super().__init__(model, model_name, language, model_kwargs=model_kwargs, **kwargs)
111
- self.audio_buffer = b''
112
-
113
- def transcribe_realtime_audio(self, audio_bytes=b"", max_duration=60):
114
- self.audio_buffer += audio_bytes
115
-
116
- one_second = self.samplerate * 2 # 16-bit audio, 1 channel ~ 32000 bytes
117
- if len(self.audio_buffer) < max_duration * one_second:
118
- return {"partial": f"{len(self.audio_buffer)} bytes received (duration: {len(self.audio_buffer) / one_second:.2f} seconds)"}
119
-
120
- else:
121
- return self.finalize()
122
136
 
123
137
  def transcribe_audio(self, audio_bytes):
124
138
  print("\nTranscribing...")
139
+ print("If --keyboard is set, change focus to target app NOW !")
125
140
  audio_array = np.frombuffer(audio_bytes, dtype=np.int16).flatten().astype(np.float32) / 32768.0
126
141
  return self.model.transcribe(audio_array, fp16=False, language=self.language)
127
142
 
@@ -130,4 +145,4 @@ class WhisperTranscriber(AbstractTranscriber):
130
145
  return {"text": ""}
131
146
  result = self.transcribe_audio(self.audio_buffer)
132
147
  self.audio_buffer = b''
133
- return result
148
+ return result
@@ -136,6 +136,7 @@ def get_transcriber(o, prompt=True):
136
136
  transcriber = VoskTranscriber(model_name=model,
137
137
  language=o.language,
138
138
  samplerate=o.samplerate,
139
+ max_duration=None, # vosk keeps going (no timeout)
139
140
  model_kwargs={"data_folder": o.data_folder})
140
141
  except Exception as error:
141
142
  print(error)
@@ -143,7 +144,7 @@ def get_transcriber(o, prompt=True):
143
144
  exit(1)
144
145
 
145
146
  elif backend == "whisper":
146
- transcriber = WhisperTranscriber(model_name=model, language=o.language, samplerate=o.samplerate)
147
+ transcriber = WhisperTranscriber(model_name=model, language=o.language, samplerate=o.samplerate, max_duration=o.duration)
147
148
 
148
149
  else:
149
150
  raise ValueError(f"Unknown backend: {backend}")
@@ -167,6 +168,7 @@ def get_parser():
167
168
  parser.add_argument("--no-prompt", action="store_false", dest="prompt", help="Disable prompts for backend and model selection and jump to recording")
168
169
 
169
170
  parser.add_argument("--samplerate", default=16000, type=int, help=argparse.SUPPRESS)
171
+ parser.add_argument("--duration", default=60, type=int, help="duration in seconds before whisper models start transcribing (default %(default)ss)")
170
172
  parser.add_argument("--keyboard", action="store_true")
171
173
  parser.add_argument("--latency", default=0, type=float, help="keyboard latency")
172
174
 
@@ -194,6 +196,9 @@ def main(args=None):
194
196
  print(f"Choose any of the following actions:")
195
197
  print(f"[q] quit")
196
198
  print(f"[e] change model")
199
+ print(f"[k] toggle keyboard {'off' if o.keyboard else 'on'}")
200
+ if transcriber.backend == "whisper":
201
+ print(f"[t] change duration (currently {transcriber.max_duration}s)")
197
202
  print(colored(f"Press [Enter] or any other key to start recording.", "BOLD"))
198
203
 
199
204
  key = input()
@@ -202,6 +207,17 @@ def main(args=None):
202
207
  if key == "e":
203
208
  transcriber = None
204
209
  continue
210
+ if key == "k":
211
+ o.keyboard = not o.keyboard
212
+ continue
213
+ if key == "t":
214
+ duration = input(f"Enter new duration in seconds (current: {transcriber.max_duration}): ")
215
+ try:
216
+ o.duration = transcriber.max_duration = int(duration)
217
+ except:
218
+ print("Invalid duration. Must be an integer.")
219
+ continue
220
+
205
221
  start_recording(micro, transcriber, keyboard=o.keyboard, latency=o.latency)
206
222
 
207
223
  # if we arrived so far, that means we pressed Ctrl + C anyway, and need Enter to move on.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.2
2
2
  Name: scribe-cli
3
- Version: 0.3.1
3
+ Version: 0.4.0
4
4
  Summary: scribe is a local speech recognition tool that provides real-time transcription using vosk and whisper AI.
5
5
  Author-email: Mahé Perrette <mahe.perrette@gmail.com>
6
6
  License: MIT License
@@ -102,9 +102,9 @@ or until after recording is complete (`whisper`).
102
102
  You can interrupt the recording via Ctrl + C and start again or change model.
103
103
 
104
104
  The default (`whisper`) is excellent at transcribing a full-length audio sequences in [many languages](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). It is really impressive,
105
- but it cannot do real-time out of the box, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
106
- With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though after
107
- 60 seconds it will stop automatically (and try to continue afterward).
105
+ but it cannot do real-time, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
106
+ With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
107
+ there is a maximum duration after which it will stop by itself, which is setup to 60s by default (unless `--duration` is set to something else).
108
108
 
109
109
  The `vosk` backend is good at
110
110
  doing real-time transcription for one language, but tended to make more mistakes in my tests and it does not do punctuation.
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes