PyPI - scribe-cli - Versions diffs - 0.3.1__tar.gz → 0.4.0__tar.gz - Mend

scribe-cli 0.3.1tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

{scribe_cli-0.3.1/scribe_cli.egg-info → scribe_cli-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: scribe-cli
-Version: 0.3.1
+Version: 0.4.0
 Summary: scribe is a local speech recognition tool that provides real-time transcription using vosk and whisper AI.
 Author-email: Mahé Perrette <mahe.perrette@gmail.com>
 License: MIT License
@@ -102,9 +102,9 @@ or until after recording is complete (`whisper`).
 You can interrupt the recording via Ctrl + C and start again or change model.
 The default (`whisper`) is excellent at transcribing a full-length audio sequences in [many languages](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). It is really impressive,
-but it cannot do real-time out of the box, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
-With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though after
-60 seconds it will stop automatically (and try to continue afterward).
+but it cannot do real-time, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
+With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
+there is a maximum duration after which it will stop by itself, which is setup to 60s by default (unless `--duration` is set to something else).
 The `vosk` backend is good at
 doing real-time transcription for one language, but tended to make more mistakes in my tests and it does not do punctuation.

{scribe_cli-0.3.1 → scribe_cli-0.4.0}/README.md RENAMED Viewed

@@ -46,9 +46,9 @@ or until after recording is complete (`whisper`).
 You can interrupt the recording via Ctrl + C and start again or change model.
 The default (`whisper`) is excellent at transcribing a full-length audio sequences in [many languages](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). It is really impressive,
-but it cannot do real-time out of the box, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
-With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though after
-60 seconds it will stop automatically (and try to continue afterward).
+but it cannot do real-time, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
+With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
+there is a maximum duration after which it will stop by itself, which is setup to 60s by default (unless `--duration` is set to something else).
 The `vosk` backend is good at
 doing real-time transcription for one language, but tended to make more mistakes in my tests and it does not do punctuation.

{scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/_version.py RENAMED Viewed

@@ -12,5 +12,5 @@ __version__: str
 __version_tuple__: VERSION_TUPLE
 version_tuple: VERSION_TUPLE
-__version__ = version = '0.3.1'
-__version_tuple__ = version_tuple = (0, 3, 1)
+__version__ = version = '0.4.0'
+__version_tuple__ = version_tuple = (0, 4, 0)

{scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/models.py RENAMED Viewed

@@ -9,23 +9,38 @@ VOSK_MODELS_FOLDER = os.path.join(os.environ.get("HOME"),
 class AbstractTranscriber:
     backend = None
-    def __init__(self, model, model_name=None, language=None, samplerate=16000, model_kwargs={}):
+    def __init__(self, model, model_name=None, language=None, samplerate=16000, max_duration=None, model_kwargs={}):
         self.model_name = model_name
         self.language = language
         self.model = model
         self.model_kwargs = model_kwargs
         self.samplerate = samplerate
+        self.max_duration = max_duration
+        self.one_second_bytes = self.samplerate * 2 # 16-bit audio, 1 channel  ~ 32000 bytes
+        self.audio_buffer = b''
+    def get_elapsed(self, size=None):
+        return len(size or self.audio_buffer) / self.one_second_bytes
+    def is_overtime(self, elapsed=None, size=None):
+        return self.max_duration and (elapsed or self.get_elapsed(size)) > self.max_duration
+    def transcribe_realtime_audio(self, audio_bytes=b""):
+        self.audio_buffer += audio_bytes
+        return {"partial": f"{len(self.audio_buffer)} bytes received (duration: {self.get_elapsed()} seconds)"}
     def transcribe_audio(self, audio_data):
         raise NotImplementedError()
-    def transcribe_realtime_audio(self, audio_data):
-        raise NotImplementedError()
+    def reset(self):
+        self.audio_buffer = b''
     def start_recording(self, microphone,
                         start_message="Recording... Press Ctrl+C to stop.",
                         stop_message="Stopped recording."):
+        self.reset()
         with microphone.open_stream():
             print(start_message)
@@ -35,6 +50,9 @@ class AbstractTranscriber:
                         data = microphone.q.get()
                         yield self.transcribe_realtime_audio(data)
+                        if self.is_overtime():
+                            raise KeyboardInterrupt("Overtime: {:.2f} seconds".format(self.get_elapsed()))
             except KeyboardInterrupt:
                 pass
@@ -75,7 +93,8 @@ class VoskTranscriber(AbstractTranscriber):
         super().__init__(model, model_name, model_kwargs=model_kwargs, **kwargs)
         self.recognizer = get_vosk_recognizer(model, self.samplerate)
-    def transcribe_realtime_audio(self, audio_bytes=b"", finalize=False):
+    def transcribe_realtime_audio(self, audio_bytes=b""):
+        self.audio_buffer += audio_bytes
         final = self.recognizer.AcceptWaveform(audio_bytes)
         if final:
             result = self.recognizer.Result()
@@ -85,20 +104,26 @@ class VoskTranscriber(AbstractTranscriber):
         if final:
             pass
-        elif finalize:
-            result_dict["text"] = result_dict.pop("partial", "")
         else:
             assert not final
             if "text" in result_dict:
                 del result_dict["text"]
         return result_dict
-    def transcribe_audio(self, audio_data=None):
-        return self.transcribe_realtime_audio(audio_data, finalize=True)
+    def transcribe_audio(self, audio_data=b""):
+        results = self.transcribe_realtime_audio(audio_data)
+        if not results.get("text") and "partial" in results:
+            results["text"] = results.pop("partial", "")
+        return results
     def finalize(self):
         return self.transcribe_audio(b"")
+    def reset(self):
+        super().reset()
+        self.recognizer = get_vosk_recognizer(self.model, self.samplerate)
 class WhisperTranscriber(AbstractTranscriber):
     backend = "whisper"
@@ -108,20 +133,10 @@ class WhisperTranscriber(AbstractTranscriber):
         if model is None:
             model = whisper.load_model(model_name)
         super().__init__(model, model_name, language, model_kwargs=model_kwargs, **kwargs)
-        self.audio_buffer = b''
-    def transcribe_realtime_audio(self, audio_bytes=b"", max_duration=60):
-        self.audio_buffer += audio_bytes
-        one_second = self.samplerate * 2 # 16-bit audio, 1 channel  ~ 32000 bytes
-        if len(self.audio_buffer) < max_duration * one_second:
-            return {"partial": f"{len(self.audio_buffer)} bytes received (duration: {len(self.audio_buffer) / one_second:.2f} seconds)"}
-        else:
-            return self.finalize()
     def transcribe_audio(self, audio_bytes):
         print("\nTranscribing...")
+        print("If --keyboard is set, change focus to target app NOW !")
         audio_array = np.frombuffer(audio_bytes, dtype=np.int16).flatten().astype(np.float32) / 32768.0
         return self.model.transcribe(audio_array, fp16=False, language=self.language)
@@ -130,4 +145,4 @@ class WhisperTranscriber(AbstractTranscriber):
             return {"text": ""}
         result = self.transcribe_audio(self.audio_buffer)
         self.audio_buffer = b''
-        return result
+        return result

{scribe_cli-0.3.1 → scribe_cli-0.4.0}/scribe/streamer.py RENAMED Viewed

@@ -136,6 +136,7 @@ def get_transcriber(o, prompt=True):
             transcriber = VoskTranscriber(model_name=model,
                                         language=o.language,
                                         samplerate=o.samplerate,
+                                        max_duration=None, # vosk keeps going (no timeout)
                                         model_kwargs={"data_folder": o.data_folder})
         except Exception as error:
             print(error)
@@ -143,7 +144,7 @@ def get_transcriber(o, prompt=True):
             exit(1)
     elif backend == "whisper":
-        transcriber = WhisperTranscriber(model_name=model, language=o.language, samplerate=o.samplerate)
+        transcriber = WhisperTranscriber(model_name=model, language=o.language, samplerate=o.samplerate, max_duration=o.duration)
     else:
         raise ValueError(f"Unknown backend: {backend}")
@@ -167,6 +168,7 @@ def get_parser():
     parser.add_argument("--no-prompt", action="store_false", dest="prompt", help="Disable prompts for backend and model selection and jump to recording")
     parser.add_argument("--samplerate", default=16000, type=int, help=argparse.SUPPRESS)
+    parser.add_argument("--duration", default=60, type=int, help="duration in seconds before whisper models start transcribing (default %(default)ss)")
     parser.add_argument("--keyboard", action="store_true")
     parser.add_argument("--latency", default=0, type=float, help="keyboard latency")
@@ -194,6 +196,9 @@ def main(args=None):
             print(f"Choose any of the following actions:")
             print(f"[q] quit")
             print(f"[e] change model")
+            print(f"[k] toggle keyboard {'off' if o.keyboard else 'on'}")
+            if transcriber.backend == "whisper":
+                print(f"[t] change duration (currently {transcriber.max_duration}s)")
             print(colored(f"Press [Enter] or any other key to start recording.", "BOLD"))
             key = input()
@@ -202,6 +207,17 @@ def main(args=None):
             if key == "e":
                 transcriber = None
                 continue
+            if key == "k":
+                o.keyboard = not o.keyboard
+                continue
+            if key == "t":
+                duration = input(f"Enter new duration in seconds (current: {transcriber.max_duration}): ")
+                try:
+                    o.duration = transcriber.max_duration = int(duration)
+                except:
+                    print("Invalid duration. Must be an integer.")
+                continue
         start_recording(micro, transcriber, keyboard=o.keyboard, latency=o.latency)
         # if we arrived so far, that means we pressed Ctrl + C anyway, and need Enter to move on.

{scribe_cli-0.3.1 → scribe_cli-0.4.0/scribe_cli.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: scribe-cli
-Version: 0.3.1
+Version: 0.4.0
 Summary: scribe is a local speech recognition tool that provides real-time transcription using vosk and whisper AI.
 Author-email: Mahé Perrette <mahe.perrette@gmail.com>
 License: MIT License
@@ -102,9 +102,9 @@ or until after recording is complete (`whisper`).
 You can interrupt the recording via Ctrl + C and start again or change model.
 The default (`whisper`) is excellent at transcribing a full-length audio sequences in [many languages](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). It is really impressive,
-but it cannot do real-time out of the box, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
-With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though after
-60 seconds it will stop automatically (and try to continue afterward).
+but it cannot do real-time, and depending on the model can have relatively long execution time, especially with the `turbo` model (at least on my laptop with CPU only). The `small` model is also excellent and runs much faster. It is selected as default in `scribe` for that reason.
+With the `whisker` model you need to stop the registration manually before the transcription occurs (Ctrl + C), though
+there is a maximum duration after which it will stop by itself, which is setup to 60s by default (unless `--duration` is set to something else).
 The `vosk` backend is good at
 doing real-time transcription for one language, but tended to make more mistakes in my tests and it does not do punctuation.