whispercpp 1.2.0.2 → 1.3.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (135) hide show
  1. checksums.yaml +4 -4
  2. data/.gitignore +5 -0
  3. data/LICENSE +1 -1
  4. data/README.md +165 -434
  5. data/Rakefile +46 -86
  6. data/ext/.gitignore +13 -0
  7. data/ext/cpu.mk +9 -0
  8. data/ext/{dr_wav.h → examples/dr_wav.h} +3560 -1179
  9. data/ext/extconf.rb +185 -7
  10. data/ext/ggml/include/ggml-alloc.h +76 -0
  11. data/ext/ggml/include/ggml-backend.h +352 -0
  12. data/ext/ggml/include/ggml-blas.h +25 -0
  13. data/ext/ggml/include/ggml-cann.h +123 -0
  14. data/ext/ggml/include/ggml-cpp.h +38 -0
  15. data/ext/ggml/include/ggml-cpu.h +135 -0
  16. data/ext/ggml/include/ggml-cuda.h +47 -0
  17. data/ext/ggml/include/ggml-kompute.h +50 -0
  18. data/ext/ggml/include/ggml-metal.h +66 -0
  19. data/ext/ggml/include/ggml-opencl.h +26 -0
  20. data/ext/ggml/include/ggml-opt.h +216 -0
  21. data/ext/ggml/include/ggml-rpc.h +28 -0
  22. data/ext/ggml/include/ggml-sycl.h +49 -0
  23. data/ext/ggml/include/ggml-vulkan.h +31 -0
  24. data/ext/ggml/include/ggml.h +2285 -0
  25. data/ext/ggml/src/ggml-alloc.c +1037 -0
  26. data/ext/ggml/src/ggml-amx/common.h +94 -0
  27. data/ext/ggml/src/ggml-amx/ggml-amx.cpp +446 -0
  28. data/ext/ggml/src/ggml-amx/mmq.cpp +2510 -0
  29. data/ext/ggml/src/ggml-amx/mmq.h +17 -0
  30. data/ext/ggml/src/ggml-backend-impl.h +256 -0
  31. data/ext/ggml/src/ggml-backend-reg.cpp +552 -0
  32. data/ext/ggml/src/ggml-backend.cpp +1999 -0
  33. data/ext/ggml/src/ggml-blas/ggml-blas.cpp +517 -0
  34. data/ext/ggml/src/ggml-cann/acl_tensor.cpp +175 -0
  35. data/ext/ggml/src/ggml-cann/acl_tensor.h +258 -0
  36. data/ext/ggml/src/ggml-cann/aclnn_ops.cpp +3427 -0
  37. data/ext/ggml/src/ggml-cann/aclnn_ops.h +592 -0
  38. data/ext/ggml/src/ggml-cann/common.h +286 -0
  39. data/ext/ggml/src/ggml-cann/ggml-cann.cpp +2188 -0
  40. data/ext/ggml/src/ggml-cann/kernels/ascendc_kernels.h +19 -0
  41. data/ext/ggml/src/ggml-cann/kernels/dup.cpp +236 -0
  42. data/ext/ggml/src/ggml-cann/kernels/get_row_f16.cpp +197 -0
  43. data/ext/ggml/src/ggml-cann/kernels/get_row_f32.cpp +190 -0
  44. data/ext/ggml/src/ggml-cann/kernels/get_row_q4_0.cpp +204 -0
  45. data/ext/ggml/src/ggml-cann/kernels/get_row_q8_0.cpp +191 -0
  46. data/ext/ggml/src/ggml-cann/kernels/quantize_f16_q8_0.cpp +218 -0
  47. data/ext/ggml/src/ggml-cann/kernels/quantize_f32_q8_0.cpp +216 -0
  48. data/ext/ggml/src/ggml-cann/kernels/quantize_float_to_q4_0.cpp +295 -0
  49. data/ext/ggml/src/ggml-common.h +1853 -0
  50. data/ext/ggml/src/ggml-cpu/amx/amx.cpp +220 -0
  51. data/ext/ggml/src/ggml-cpu/amx/amx.h +8 -0
  52. data/ext/ggml/src/ggml-cpu/amx/common.h +91 -0
  53. data/ext/ggml/src/ggml-cpu/amx/mmq.cpp +2511 -0
  54. data/ext/ggml/src/ggml-cpu/amx/mmq.h +10 -0
  55. data/ext/ggml/src/ggml-cpu/cpu-feats-x86.cpp +323 -0
  56. data/ext/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp +4262 -0
  57. data/ext/ggml/src/ggml-cpu/ggml-cpu-aarch64.h +8 -0
  58. data/ext/ggml/src/ggml-cpu/ggml-cpu-hbm.cpp +55 -0
  59. data/ext/ggml/src/ggml-cpu/ggml-cpu-hbm.h +8 -0
  60. data/ext/ggml/src/ggml-cpu/ggml-cpu-impl.h +386 -0
  61. data/ext/ggml/src/ggml-cpu/ggml-cpu-quants.c +10835 -0
  62. data/ext/ggml/src/ggml-cpu/ggml-cpu-quants.h +63 -0
  63. data/ext/ggml/src/ggml-cpu/ggml-cpu-traits.cpp +36 -0
  64. data/ext/ggml/src/ggml-cpu/ggml-cpu-traits.h +38 -0
  65. data/ext/ggml/src/ggml-cpu/ggml-cpu.c +14123 -0
  66. data/ext/ggml/src/ggml-cpu/ggml-cpu.cpp +622 -0
  67. data/ext/ggml/src/ggml-cpu/llamafile/sgemm.cpp +1884 -0
  68. data/ext/ggml/src/ggml-cpu/llamafile/sgemm.h +14 -0
  69. data/ext/ggml/src/ggml-cuda/vendors/cuda.h +14 -0
  70. data/ext/ggml/src/ggml-cuda/vendors/hip.h +186 -0
  71. data/ext/ggml/src/ggml-cuda/vendors/musa.h +134 -0
  72. data/ext/ggml/src/ggml-impl.h +556 -0
  73. data/ext/ggml/src/ggml-kompute/ggml-kompute.cpp +2251 -0
  74. data/ext/ggml/src/ggml-metal/ggml-metal-impl.h +288 -0
  75. data/ext/ggml/src/ggml-metal/ggml-metal.m +4884 -0
  76. data/ext/ggml/src/ggml-metal/ggml-metal.metal +6732 -0
  77. data/ext/ggml/src/ggml-opt.cpp +854 -0
  78. data/ext/ggml/src/ggml-quants.c +5238 -0
  79. data/ext/ggml/src/ggml-quants.h +100 -0
  80. data/ext/ggml/src/ggml-rpc/ggml-rpc.cpp +1406 -0
  81. data/ext/ggml/src/ggml-sycl/common.cpp +95 -0
  82. data/ext/ggml/src/ggml-sycl/concat.cpp +196 -0
  83. data/ext/ggml/src/ggml-sycl/conv.cpp +99 -0
  84. data/ext/ggml/src/ggml-sycl/convert.cpp +547 -0
  85. data/ext/ggml/src/ggml-sycl/dmmv.cpp +1023 -0
  86. data/ext/ggml/src/ggml-sycl/element_wise.cpp +1030 -0
  87. data/ext/ggml/src/ggml-sycl/ggml-sycl.cpp +4729 -0
  88. data/ext/ggml/src/ggml-sycl/im2col.cpp +126 -0
  89. data/ext/ggml/src/ggml-sycl/mmq.cpp +3031 -0
  90. data/ext/ggml/src/ggml-sycl/mmvq.cpp +1015 -0
  91. data/ext/ggml/src/ggml-sycl/norm.cpp +378 -0
  92. data/ext/ggml/src/ggml-sycl/outprod.cpp +56 -0
  93. data/ext/ggml/src/ggml-sycl/rope.cpp +276 -0
  94. data/ext/ggml/src/ggml-sycl/softmax.cpp +251 -0
  95. data/ext/ggml/src/ggml-sycl/tsembd.cpp +72 -0
  96. data/ext/ggml/src/ggml-sycl/wkv6.cpp +141 -0
  97. data/ext/ggml/src/ggml-threading.cpp +12 -0
  98. data/ext/ggml/src/ggml-threading.h +14 -0
  99. data/ext/ggml/src/ggml-vulkan/ggml-vulkan.cpp +8657 -0
  100. data/ext/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp +593 -0
  101. data/ext/ggml/src/ggml.c +7694 -0
  102. data/ext/include/whisper.h +672 -0
  103. data/ext/metal-embed.mk +17 -0
  104. data/ext/metal.mk +6 -0
  105. data/ext/ruby_whisper.cpp +1608 -159
  106. data/ext/ruby_whisper.h +10 -0
  107. data/ext/scripts/get-flags.mk +38 -0
  108. data/ext/src/coreml/whisper-decoder-impl.h +146 -0
  109. data/ext/src/coreml/whisper-decoder-impl.m +201 -0
  110. data/ext/src/coreml/whisper-encoder-impl.h +142 -0
  111. data/ext/src/coreml/whisper-encoder-impl.m +197 -0
  112. data/ext/src/coreml/whisper-encoder.h +26 -0
  113. data/ext/src/openvino/whisper-openvino-encoder.cpp +108 -0
  114. data/ext/src/openvino/whisper-openvino-encoder.h +31 -0
  115. data/ext/src/whisper.cpp +7393 -0
  116. data/extsources.rb +6 -0
  117. data/lib/whisper/model/uri.rb +157 -0
  118. data/lib/whisper.rb +2 -0
  119. data/tests/helper.rb +7 -0
  120. data/tests/jfk_reader/.gitignore +5 -0
  121. data/tests/jfk_reader/extconf.rb +3 -0
  122. data/tests/jfk_reader/jfk_reader.c +68 -0
  123. data/tests/test_callback.rb +160 -0
  124. data/tests/test_error.rb +20 -0
  125. data/tests/test_model.rb +71 -0
  126. data/tests/test_package.rb +31 -0
  127. data/tests/test_params.rb +160 -0
  128. data/tests/test_segment.rb +83 -0
  129. data/tests/test_whisper.rb +211 -123
  130. data/whispercpp.gemspec +36 -0
  131. metadata +137 -11
  132. data/ext/ggml.c +0 -8616
  133. data/ext/ggml.h +0 -748
  134. data/ext/whisper.cpp +0 -4829
  135. data/ext/whisper.h +0 -402
data/README.md CHANGED
@@ -1,500 +1,231 @@
1
- # whisper.cpp
1
+ whispercpp
2
+ ==========
2
3
 
3
- [![Actions Status](https://github.com/ggerganov/whisper.cpp/workflows/CI/badge.svg)](https://github.com/ggerganov/whisper.cpp/actions)
4
- [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
5
- [![npm](https://img.shields.io/npm/v/whisper.cpp.svg)](https://www.npmjs.com/package/whisper.cpp/)
4
+ ![whisper.cpp](https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg)
6
5
 
7
- Stable: [v1.2.0](https://github.com/ggerganov/whisper.cpp/releases/tag/v1.2.0) / [Roadmap | F.A.Q.](https://github.com/ggerganov/whisper.cpp/discussions/126)
6
+ Ruby bindings for [whisper.cpp][], an interface of automatic speech recognition model.
8
7
 
9
- High-performance inference of [OpenAI's Whisper](https://github.com/openai/whisper) automatic speech recognition (ASR) model:
8
+ Installation
9
+ ------------
10
10
 
11
- - Plain C/C++ implementation without dependencies
12
- - Apple silicon first-class citizen - optimized via Arm Neon and Accelerate framework
13
- - AVX intrinsics support for x86 architectures
14
- - VSX intrinsics support for POWER architectures
15
- - Mixed F16 / F32 precision
16
- - Low memory usage (Flash Attention)
17
- - Zero memory allocations at runtime
18
- - Runs on the CPU
19
- - [C-style API](https://github.com/ggerganov/whisper.cpp/blob/master/whisper.h)
11
+ Install the gem and add to the application's Gemfile by executing:
20
12
 
21
- Supported platforms:
13
+ $ bundle add whispercpp
22
14
 
23
- - [x] Mac OS (Intel and Arm)
24
- - [x] [iOS](examples/whisper.objc)
25
- - [x] [Android](examples/whisper.android)
26
- - [x] Linux / [FreeBSD](https://github.com/ggerganov/whisper.cpp/issues/56#issuecomment-1350920264)
27
- - [x] [WebAssembly](examples/whisper.wasm)
28
- - [x] Windows ([MSVC](https://github.com/ggerganov/whisper.cpp/blob/master/.github/workflows/build.yml#L117-L144) and [MinGW](https://github.com/ggerganov/whisper.cpp/issues/168)]
29
- - [x] [Raspberry Pi](https://github.com/ggerganov/whisper.cpp/discussions/166)
15
+ If bundler is not being used to manage dependencies, install the gem by executing:
30
16
 
31
- The entire implementation of the model is contained in 2 source files:
17
+ $ gem install whispercpp
32
18
 
33
- - Tensor operations: [ggml.h](ggml.h) / [ggml.c](ggml.c)
34
- - Transformer inference: [whisper.h](whisper.h) / [whisper.cpp](whisper.cpp)
19
+ Usage
20
+ -----
35
21
 
36
- Having such a lightweight implementation of the model allows to easily integrate it in different platforms and applications.
37
- As an example, here is a video of running the model on an iPhone 13 device - fully offline, on-device: [whisper.objc](examples/whisper.objc)
22
+ ```ruby
23
+ require "whisper"
38
24
 
39
- https://user-images.githubusercontent.com/1991296/197385372-962a6dea-bca1-4d50-bf96-1d8c27b98c81.mp4
25
+ whisper = Whisper::Context.new("base")
40
26
 
41
- You can also easily make your own offline voice assistant application: [command](examples/command)
27
+ params = Whisper::Params.new
28
+ params.language = "en"
29
+ params.offset = 10_000
30
+ params.duration = 60_000
31
+ params.max_text_tokens = 300
32
+ params.translate = true
33
+ params.print_timestamps = false
34
+ params.initial_prompt = "Initial prompt here."
42
35
 
43
- https://user-images.githubusercontent.com/1991296/204038393-2f846eae-c255-4099-a76d-5735c25c49da.mp4
36
+ whisper.transcribe("path/to/audio.wav", params) do |whole_text|
37
+ puts whole_text
38
+ end
44
39
 
45
- Or you can even run it straight in the browser: [talk.wasm](examples/talk.wasm)
46
-
47
- ## Implementation details
48
-
49
- - The core tensor operations are implemented in C ([ggml.h](ggml.h) / [ggml.c](ggml.c))
50
- - The transformer model and the high-level C-style API are implemented in C++ ([whisper.h](whisper.h) / [whisper.cpp](whisper.cpp))
51
- - Sample usage is demonstrated in [main.cpp](examples/main)
52
- - Sample real-time audio transcription from the microphone is demonstrated in [stream.cpp](examples/stream)
53
- - Various other examples are available in the [examples](examples) folder
54
-
55
- The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMD
56
- instrisics or CBLAS Accelerate framework routines are used. The latter are especially effective for bigger sizes since
57
- the Accelerate framework utilizes the special-purpose AMX coprocessor available in modern Apple products.
58
-
59
- ## Quick start
60
-
61
- First, download one of the Whisper models converted in [ggml format](models). For example:
62
-
63
- ```bash
64
- bash ./models/download-ggml-model.sh base.en
65
40
  ```
66
41
 
67
- Now build the [main](examples/main) example and transcribe an audio file like this:
42
+ ### Preparing model ###
68
43
 
69
- ```bash
70
- # build the main example
71
- make
44
+ Some models are prepared up-front:
72
45
 
73
- # transcribe an audio file
74
- ./main -f samples/jfk.wav
46
+ ```ruby
47
+ base_en = Whisper::Model.pre_converted_models["base.en"]
48
+ whisper = Whisper::Context.new(base_en)
75
49
  ```
76
50
 
77
- ---
51
+ At first time you use a model, it is downloaded automatically. After that, downloaded cached file is used. To clear cache, call `#clear_cache`:
78
52
 
79
- For a quick demo, simply run `make base.en`:
80
-
81
- ```java
82
- $ make base.en
83
-
84
- cc -I. -O3 -std=c11 -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
85
- c++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp -o whisper.o
86
- c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main -framework Accelerate
87
- ./main -h
88
-
89
- usage: ./main [options] file0.wav file1.wav ...
90
-
91
- options:
92
- -h, --help [default] show this help message and exit
93
- -t N, --threads N [4 ] number of threads to use during computation
94
- -p N, --processors N [1 ] number of processors to use during computation
95
- -ot N, --offset-t N [0 ] time offset in milliseconds
96
- -on N, --offset-n N [0 ] segment index offset
97
- -d N, --duration N [0 ] duration of audio to process in milliseconds
98
- -mc N, --max-context N [-1 ] maximum number of text context tokens to store
99
- -ml N, --max-len N [0 ] maximum segment length in characters
100
- -bo N, --best-of N [5 ] number of best candidates to keep
101
- -bs N, --beam-size N [-1 ] beam size for beam search
102
- -wt N, --word-thold N [0.01 ] word timestamp probability threshold
103
- -et N, --entropy-thold N [2.40 ] entropy threshold for decoder fail
104
- -lpt N, --logprob-thold N [-1.00 ] log probability threshold for decoder fail
105
- -su, --speed-up [false ] speed up audio by x2 (reduced accuracy)
106
- -tr, --translate [false ] translate from source language to english
107
- -di, --diarize [false ] stereo audio diarization
108
- -nf, --no-fallback [false ] do not use temperature fallback while decoding
109
- -otxt, --output-txt [false ] output result in a text file
110
- -ovtt, --output-vtt [false ] output result in a vtt file
111
- -osrt, --output-srt [false ] output result in a srt file
112
- -owts, --output-words [false ] output script for generating karaoke video
113
- -ocsv, --output-csv [false ] output result in a CSV file
114
- -of FNAME, --output-file FNAME [ ] output file path (without file extension)
115
- -ps, --print-special [false ] print special tokens
116
- -pc, --print-colors [false ] print colors
117
- -pp, --print-progress [false ] print progress
118
- -nt, --no-timestamps [true ] do not print timestamps
119
- -l LANG, --language LANG [en ] spoken language ('auto' for auto-detect)
120
- --prompt PROMPT [ ] initial prompt
121
- -m FNAME, --model FNAME [models/ggml-base.en.bin] model path
122
- -f FNAME, --file FNAME [ ] input WAV file path
123
-
124
-
125
- bash ./models/download-ggml-model.sh base.en
126
- Downloading ggml model base.en ...
127
- ggml-base.en.bin 100%[========================>] 141.11M 6.34MB/s in 24s
128
- Done! Model 'base.en' saved in 'models/ggml-base.en.bin'
129
- You can now use it like this:
130
-
131
- $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
132
-
133
-
134
- ===============================================
135
- Running base.en on all samples in ./samples ...
136
- ===============================================
137
-
138
- ----------------------------------------------
139
- [+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
140
- ----------------------------------------------
141
-
142
- whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
143
- whisper_model_load: loading model
144
- whisper_model_load: n_vocab = 51864
145
- whisper_model_load: n_audio_ctx = 1500
146
- whisper_model_load: n_audio_state = 512
147
- whisper_model_load: n_audio_head = 8
148
- whisper_model_load: n_audio_layer = 6
149
- whisper_model_load: n_text_ctx = 448
150
- whisper_model_load: n_text_state = 512
151
- whisper_model_load: n_text_head = 8
152
- whisper_model_load: n_text_layer = 6
153
- whisper_model_load: n_mels = 80
154
- whisper_model_load: f16 = 1
155
- whisper_model_load: type = 2
156
- whisper_model_load: mem required = 215.00 MB (+ 6.00 MB per decoder)
157
- whisper_model_load: kv self size = 5.25 MB
158
- whisper_model_load: kv cross size = 17.58 MB
159
- whisper_model_load: adding 1607 extra tokens
160
- whisper_model_load: model ctx = 140.60 MB
161
- whisper_model_load: model size = 140.54 MB
162
-
163
- system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
164
-
165
- main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
166
-
167
-
168
- [00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
169
-
170
-
171
- whisper_print_timings: fallbacks = 0 p / 0 h
172
- whisper_print_timings: load time = 113.81 ms
173
- whisper_print_timings: mel time = 15.40 ms
174
- whisper_print_timings: sample time = 11.58 ms / 27 runs ( 0.43 ms per run)
175
- whisper_print_timings: encode time = 266.60 ms / 1 runs ( 266.60 ms per run)
176
- whisper_print_timings: decode time = 66.11 ms / 27 runs ( 2.45 ms per run)
177
- whisper_print_timings: total time = 476.31 ms
53
+ ```ruby
54
+ Whisper::Model.pre_converted_models["base"].clear_cache
178
55
  ```
179
56
 
180
- The command downloads the `base.en` model converted to custom `ggml` format and runs the inference on all `.wav` samples in the folder `samples`.
181
-
182
- For detailed usage instructions, run: `./main -h`
183
-
184
- Note that the [main](examples/main) example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool.
185
- For example, you can use `ffmpeg` like this:
57
+ You also can use shorthand for pre-converted models:
186
58
 
187
- ```java
188
- ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
59
+ ```ruby
60
+ whisper = Whisper::Context.new("base.en")
189
61
  ```
190
62
 
191
- ## More audio samples
192
-
193
- If you want some extra audio samples to play with, simply run:
194
-
63
+ You can see the list of prepared model names by `Whisper::Model.preconverted_models.keys`:
64
+
65
+ ```ruby
66
+ puts Whisper::Model.preconverted_model_names
67
+ # tiny
68
+ # tiny.en
69
+ # tiny-q5_1
70
+ # tiny.en-q5_1
71
+ # tiny-q8_0
72
+ # base
73
+ # base.en
74
+ # base-q5_1
75
+ # base.en-q5_1
76
+ # base-q8_0
77
+ # :
78
+ # :
195
79
  ```
196
- make samples
197
- ```
198
-
199
- This will download a few more audio files from Wikipedia and convert them to 16-bit WAV format via `ffmpeg`.
200
80
 
201
- You can download and run the other models as follows:
81
+ You can also use local model files you prepared:
202
82
 
83
+ ```ruby
84
+ whisper = Whisper::Context.new("path/to/your/model.bin")
203
85
  ```
204
- make tiny.en
205
- make tiny
206
- make base.en
207
- make base
208
- make small.en
209
- make small
210
- make medium.en
211
- make medium
212
- make large-v1
213
- make large
214
- ```
215
-
216
- ## Memory usage
217
-
218
- | Model | Disk | Mem | SHA |
219
- | --- | --- | --- | --- |
220
- | tiny | 75 MB | ~125 MB | `bd577a113a864445d4c299885e0cb97d4ba92b5f` |
221
- | base | 142 MB | ~210 MB | `465707469ff3a37a2b9b8d8f89f2f99de7299dac` |
222
- | small | 466 MB | ~600 MB | `55356645c2b361a969dfd0ef2c5a50d530afd8d5` |
223
- | medium | 1.5 GB | ~1.7 GB | `fd9727b6e1217c2f614f9b698455c4ffd82463b4` |
224
- | large | 2.9 GB | ~3.3 GB | `0f4c8e34f21cf1a914c59d8b3ce882345ad349d6` |
225
-
226
- ## Limitations
227
-
228
- - Inference only
229
- - No GPU support (yet)
230
-
231
- ## Another example
232
-
233
- Here is another example of transcribing a [3:24 min speech](https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg)
234
- in about half a minute on a MacBook M1 Pro, using `medium.en` model:
235
-
236
- <details>
237
- <summary>Expand to see the result</summary>
238
-
239
- ```java
240
- $ ./main -m models/ggml-medium.en.bin -f samples/gb1.wav -t 8
241
-
242
- whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
243
- whisper_model_load: loading model
244
- whisper_model_load: n_vocab = 51864
245
- whisper_model_load: n_audio_ctx = 1500
246
- whisper_model_load: n_audio_state = 1024
247
- whisper_model_load: n_audio_head = 16
248
- whisper_model_load: n_audio_layer = 24
249
- whisper_model_load: n_text_ctx = 448
250
- whisper_model_load: n_text_state = 1024
251
- whisper_model_load: n_text_head = 16
252
- whisper_model_load: n_text_layer = 24
253
- whisper_model_load: n_mels = 80
254
- whisper_model_load: f16 = 1
255
- whisper_model_load: type = 4
256
- whisper_model_load: mem required = 1720.00 MB (+ 43.00 MB per decoder)
257
- whisper_model_load: kv self size = 42.00 MB
258
- whisper_model_load: kv cross size = 140.62 MB
259
- whisper_model_load: adding 1607 extra tokens
260
- whisper_model_load: model ctx = 1462.35 MB
261
- whisper_model_load: model size = 1462.12 MB
262
-
263
- system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
264
-
265
- main: processing 'samples/gb1.wav' (3179750 samples, 198.7 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
266
-
267
-
268
- [00:00:00.000 --> 00:00:08.000] My fellow Americans, this day has brought terrible news and great sadness to our country.
269
- [00:00:08.000 --> 00:00:17.000] At nine o'clock this morning, Mission Control in Houston lost contact with our Space Shuttle Columbia.
270
- [00:00:17.000 --> 00:00:23.000] A short time later, debris was seen falling from the skies above Texas.
271
- [00:00:23.000 --> 00:00:29.000] The Columbia's lost. There are no survivors.
272
- [00:00:29.000 --> 00:00:32.000] On board was a crew of seven.
273
- [00:00:32.000 --> 00:00:39.000] Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark,
274
- [00:00:39.000 --> 00:00:48.000] Captain David Brown, Commander William McCool, Dr. Kultna Shavla, and Ilan Ramon,
275
- [00:00:48.000 --> 00:00:52.000] a colonel in the Israeli Air Force.
276
- [00:00:52.000 --> 00:00:58.000] These men and women assumed great risk in the service to all humanity.
277
- [00:00:58.000 --> 00:01:03.000] In an age when space flight has come to seem almost routine,
278
- [00:01:03.000 --> 00:01:07.000] it is easy to overlook the dangers of travel by rocket
279
- [00:01:07.000 --> 00:01:12.000] and the difficulties of navigating the fierce outer atmosphere of the Earth.
280
- [00:01:12.000 --> 00:01:18.000] These astronauts knew the dangers, and they faced them willingly,
281
- [00:01:18.000 --> 00:01:23.000] knowing they had a high and noble purpose in life.
282
- [00:01:23.000 --> 00:01:31.000] Because of their courage and daring and idealism, we will miss them all the more.
283
- [00:01:31.000 --> 00:01:36.000] All Americans today are thinking as well of the families of these men and women
284
- [00:01:36.000 --> 00:01:40.000] who have been given this sudden shock and grief.
285
- [00:01:40.000 --> 00:01:45.000] You're not alone. Our entire nation grieves with you,
286
- [00:01:45.000 --> 00:01:52.000] and those you love will always have the respect and gratitude of this country.
287
- [00:01:52.000 --> 00:01:56.000] The cause in which they died will continue.
288
- [00:01:56.000 --> 00:02:04.000] Mankind is led into the darkness beyond our world by the inspiration of discovery
289
- [00:02:04.000 --> 00:02:11.000] and the longing to understand. Our journey into space will go on.
290
- [00:02:11.000 --> 00:02:16.000] In the skies today, we saw destruction and tragedy.
291
- [00:02:16.000 --> 00:02:22.000] Yet farther than we can see, there is comfort and hope.
292
- [00:02:22.000 --> 00:02:29.000] In the words of the prophet Isaiah, "Lift your eyes and look to the heavens
293
- [00:02:29.000 --> 00:02:35.000] who created all these. He who brings out the starry hosts one by one
294
- [00:02:35.000 --> 00:02:39.000] and calls them each by name."
295
- [00:02:39.000 --> 00:02:46.000] Because of His great power and mighty strength, not one of them is missing.
296
- [00:02:46.000 --> 00:02:55.000] The same Creator who names the stars also knows the names of the seven souls we mourn today.
297
- [00:02:55.000 --> 00:03:01.000] The crew of the shuttle Columbia did not return safely to earth,
298
- [00:03:01.000 --> 00:03:05.000] yet we can pray that all are safely home.
299
- [00:03:05.000 --> 00:03:13.000] May God bless the grieving families, and may God continue to bless America.
300
- [00:03:13.000 --> 00:03:19.000] [Silence]
301
-
302
-
303
- whisper_print_timings: fallbacks = 1 p / 0 h
304
- whisper_print_timings: load time = 569.03 ms
305
- whisper_print_timings: mel time = 146.85 ms
306
- whisper_print_timings: sample time = 238.66 ms / 553 runs ( 0.43 ms per run)
307
- whisper_print_timings: encode time = 18665.10 ms / 9 runs ( 2073.90 ms per run)
308
- whisper_print_timings: decode time = 13090.93 ms / 549 runs ( 23.85 ms per run)
309
- whisper_print_timings: total time = 32733.52 ms
310
- ```
311
- </details>
312
-
313
- ## Real-time audio input example
314
86
 
315
- This is a naive example of performing real-time inference on audio from your microphone.
316
- The [stream](examples/stream) tool samples the audio every half a second and runs the transcription continously.
317
- More info is available in [issue #10](https://github.com/ggerganov/whisper.cpp/issues/10).
87
+ Or, you can download model files:
318
88
 
319
- ```java
320
- make stream
321
- ./stream -m ./models/ggml-base.en.bin -t 8 --step 500 --length 5000
89
+ ```ruby
90
+ model_uri = Whisper::Model::URI.new("http://example.net/uri/of/your/model.bin")
91
+ whisper = Whisper::Context.new(model_uri)
322
92
  ```
323
93
 
324
- https://user-images.githubusercontent.com/1991296/194935793-76afede7-cfa8-48d8-a80f-28ba83be7d09.mp4
94
+ See [models][] page for details.
325
95
 
326
- ## Confidence color-coding
96
+ ### Preparing audio file ###
327
97
 
328
- Adding the `--print-colors` argument will print the transcribed text using an experimental color coding strategy
329
- to highlight words with high or low confidence:
98
+ Currently, whisper.cpp accepts only 16-bit WAV files.
330
99
 
331
- <img width="965" alt="image" src="https://user-images.githubusercontent.com/1991296/197356445-311c8643-9397-4e5e-b46e-0b4b4daa2530.png">
332
-
333
- ## Controlling the length of the generated text segments (experimental)
100
+ API
101
+ ---
334
102
 
335
- For example, to limit the line length to a maximum of 16 characters, simply add `-ml 16`:
103
+ ### Segments ###
336
104
 
337
- ```java
338
- ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16
105
+ Once `Whisper::Context#transcribe` called, you can retrieve segments by `#each_segment`:
339
106
 
340
- whisper_model_load: loading model from './models/ggml-base.en.bin'
341
- ...
342
- system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
107
+ ```ruby
108
+ def format_time(time_ms)
109
+ sec, decimal_part = time_ms.divmod(1000)
110
+ min, sec = sec.divmod(60)
111
+ hour, min = min.divmod(60)
112
+ "%02d:%02d:%02d.%03d" % [hour, min, sec, decimal_part]
113
+ end
343
114
 
344
- main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
115
+ whisper.transcribe("path/to/audio.wav", params)
345
116
 
346
- [00:00:00.000 --> 00:00:00.850] And so my
347
- [00:00:00.850 --> 00:00:01.590] fellow
348
- [00:00:01.590 --> 00:00:04.140] Americans, ask
349
- [00:00:04.140 --> 00:00:05.660] not what your
350
- [00:00:05.660 --> 00:00:06.840] country can do
351
- [00:00:06.840 --> 00:00:08.430] for you, ask
352
- [00:00:08.430 --> 00:00:09.440] what you can do
353
- [00:00:09.440 --> 00:00:10.020] for your
354
- [00:00:10.020 --> 00:00:11.000] country.
355
- ```
117
+ whisper.each_segment.with_index do |segment, index|
118
+ line = "[%{nth}: %{st} --> %{ed}] %{text}" % {
119
+ nth: index + 1,
120
+ st: format_time(segment.start_time),
121
+ ed: format_time(segment.end_time),
122
+ text: segment.text
123
+ }
124
+ line << " (speaker turned)" if segment.speaker_next_turn?
125
+ puts line
126
+ end
356
127
 
357
- ## Word-level timestamp
358
-
359
- The `--max-len` argument can be used to obtain word-level timestamps. Simply use `-ml 1`:
360
-
361
- ```java
362
- ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
363
-
364
- whisper_model_load: loading model from './models/ggml-base.en.bin'
365
- ...
366
- system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
367
-
368
- main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
369
-
370
- [00:00:00.000 --> 00:00:00.320]
371
- [00:00:00.320 --> 00:00:00.370] And
372
- [00:00:00.370 --> 00:00:00.690] so
373
- [00:00:00.690 --> 00:00:00.850] my
374
- [00:00:00.850 --> 00:00:01.590] fellow
375
- [00:00:01.590 --> 00:00:02.850] Americans
376
- [00:00:02.850 --> 00:00:03.300] ,
377
- [00:00:03.300 --> 00:00:04.140] ask
378
- [00:00:04.140 --> 00:00:04.990] not
379
- [00:00:04.990 --> 00:00:05.410] what
380
- [00:00:05.410 --> 00:00:05.660] your
381
- [00:00:05.660 --> 00:00:06.260] country
382
- [00:00:06.260 --> 00:00:06.600] can
383
- [00:00:06.600 --> 00:00:06.840] do
384
- [00:00:06.840 --> 00:00:07.010] for
385
- [00:00:07.010 --> 00:00:08.170] you
386
- [00:00:08.170 --> 00:00:08.190] ,
387
- [00:00:08.190 --> 00:00:08.430] ask
388
- [00:00:08.430 --> 00:00:08.910] what
389
- [00:00:08.910 --> 00:00:09.040] you
390
- [00:00:09.040 --> 00:00:09.320] can
391
- [00:00:09.320 --> 00:00:09.440] do
392
- [00:00:09.440 --> 00:00:09.760] for
393
- [00:00:09.760 --> 00:00:10.020] your
394
- [00:00:10.020 --> 00:00:10.510] country
395
- [00:00:10.510 --> 00:00:11.000] .
396
128
  ```
397
129
 
398
- ## Karaoke-style movie generation (experimental)
130
+ You can also add hook to params called on new segment:
399
131
 
400
- The [main](examples/main) example provides support for output of karaoke-style movies, where the
401
- currently pronounced word is highlighted. Use the `-wts` argument and run the generated bash script.
402
- This requires to have `ffmpeg` installed.
132
+ ```ruby
133
+ # Add hook before calling #transcribe
134
+ params.on_new_segment do |segment|
135
+ line = "[%{st} --> %{ed}] %{text}" % {
136
+ st: format_time(segment.start_time),
137
+ ed: format_time(segment.end_time),
138
+ text: segment.text
139
+ }
140
+ line << " (speaker turned)" if segment.speaker_next_turn?
141
+ puts line
142
+ end
403
143
 
404
- Here are a few *"typical"* examples:
144
+ whisper.transcribe("path/to/audio.wav", params)
405
145
 
406
- ```java
407
- ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owts
408
- source ./samples/jfk.wav.wts
409
- ffplay ./samples/jfk.wav.mp4
410
146
  ```
411
147
 
412
- https://user-images.githubusercontent.com/1991296/199337465-dbee4b5e-9aeb-48a3-b1c6-323ac4db5b2c.mp4
148
+ ### Models ###
413
149
 
414
- ---
150
+ You can see model information:
415
151
 
416
- ```java
417
- ./main -m ./models/ggml-base.en.bin -f ./samples/mm0.wav -owts
418
- source ./samples/mm0.wav.wts
419
- ffplay ./samples/mm0.wav.mp4
420
- ```
152
+ ```ruby
153
+ whisper = Whisper::Context.new("base")
154
+ model = whisper.model
421
155
 
422
- https://user-images.githubusercontent.com/1991296/199337504-cc8fd233-0cb7-4920-95f9-4227de3570aa.mp4
156
+ model.n_vocab # => 51864
157
+ model.n_audio_ctx # => 1500
158
+ model.n_audio_state # => 512
159
+ model.n_audio_head # => 8
160
+ model.n_audio_layer # => 6
161
+ model.n_text_ctx # => 448
162
+ model.n_text_state # => 512
163
+ model.n_text_head # => 8
164
+ model.n_text_layer # => 6
165
+ model.n_mels # => 80
166
+ model.ftype # => 1
167
+ model.type # => "base"
423
168
 
424
- ---
425
-
426
- ```java
427
- ./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav -owts
428
- source ./samples/gb0.wav.wts
429
- ffplay ./samples/gb0.wav.mp4
430
169
  ```
431
170
 
432
- https://user-images.githubusercontent.com/1991296/199337538-b7b0c7a3-2753-4a88-a0cd-f28a317987ba.mp4
433
-
434
- ---
435
-
436
- ## Benchmarks
437
-
438
- In order to have an objective comparison of the performance of the inference across different system configurations,
439
- use the [bench](examples/bench) tool. The tool simply runs the Encoder part of the model and prints how much time it
440
- took to execute it. The results are summarized in the following Github issue:
441
-
442
- [Benchmark results](https://github.com/ggerganov/whisper.cpp/issues/89)
443
-
444
- ## ggml format
445
-
446
- The original models are converted to a custom binary format. This allows to pack everything needed into a single file:
171
+ ### Logging ###
172
+
173
+ You can set log callback:
174
+
175
+ ```ruby
176
+ prefix = "[MyApp] "
177
+ log_callback = ->(level, buffer, user_data) {
178
+ case level
179
+ when Whisper::LOG_LEVEL_NONE
180
+ puts "#{user_data}none: #{buffer}"
181
+ when Whisper::LOG_LEVEL_INFO
182
+ puts "#{user_data}info: #{buffer}"
183
+ when Whisper::LOG_LEVEL_WARN
184
+ puts "#{user_data}warn: #{buffer}"
185
+ when Whisper::LOG_LEVEL_ERROR
186
+ puts "#{user_data}error: #{buffer}"
187
+ when Whisper::LOG_LEVEL_DEBUG
188
+ puts "#{user_data}debug: #{buffer}"
189
+ when Whisper::LOG_LEVEL_CONT
190
+ puts "#{user_data}same to previous: #{buffer}"
191
+ end
192
+ }
193
+ Whisper.log_set log_callback, prefix
194
+ ```
447
195
 
448
- - model parameters
449
- - mel filters
450
- - vocabulary
451
- - weights
196
+ Using this feature, you are also able to suppress log:
452
197
 
453
- You can download the converted models using the [models/download-ggml-model.sh](models/download-ggml-model.sh) script
454
- or manually from here:
198
+ ```ruby
199
+ Whisper.log_set ->(level, buffer, user_data) {
200
+ # do nothing
201
+ }, nil
202
+ Whisper::Context.new("base")
203
+ ```
455
204
 
456
- - https://huggingface.co/datasets/ggerganov/whisper.cpp
457
- - https://ggml.ggerganov.com
205
+ ### Low-level API to transcribe ###
458
206
 
459
- For more details, see the conversion script [models/convert-pt-to-ggml.py](models/convert-pt-to-ggml.py) or the README
460
- in [models](models).
207
+ You can also call `Whisper::Context#full` and `#full_parallel` with a Ruby array as samples. Although `#transcribe` with audio file path is recommended because it extracts PCM samples in C++ and is fast, `#full` and `#full_parallel` give you flexibility.
461
208
 
462
- ## [Bindings](https://github.com/ggerganov/whisper.cpp/discussions/categories/bindings)
209
+ ```ruby
210
+ require "whisper"
211
+ require "wavefile"
463
212
 
464
- - [X] Rust: [tazz4843/whisper-rs](https://github.com/tazz4843/whisper-rs) | [#310](https://github.com/ggerganov/whisper.cpp/discussions/310)
465
- - [X] Javascript: [bindings/javascript](bindings/javascript) | [#309](https://github.com/ggerganov/whisper.cpp/discussions/309)
466
- - [X] Go: [bindings/go](bindings/go) | [#312](https://github.com/ggerganov/whisper.cpp/discussions/312)
467
- - [X] Ruby: [bindings/ruby](bindings/ruby) | [#507](https://github.com/ggerganov/whisper.cpp/discussions/507)
468
- - [X] Objective-C / Swift: [ggerganov/whisper.spm](https://github.com/ggerganov/whisper.spm) | [#313](https://github.com/ggerganov/whisper.cpp/discussions/313)
469
- - [X] .NET: | [#422](https://github.com/ggerganov/whisper.cpp/discussions/422)
470
- - [sandrohanea/whisper.net](https://github.com/sandrohanea/whisper.net)
471
- - [NickDarvey/whisper](https://github.com/NickDarvey/whisper)
472
- - [X] Python: | [#9](https://github.com/ggerganov/whisper.cpp/issues/9)
473
- - [stlukey/whispercpp.py](https://github.com/stlukey/whispercpp.py) (Cython)
213
+ reader = WaveFile::Reader.new("path/to/audio.wav", WaveFile::Format.new(:mono, :float, 16000))
214
+ samples = reader.enum_for(:each_buffer).map(&:samples).flatten
474
215
 
475
- ## Examples
216
+ whisper = Whisper::Context.new("base")
217
+ whisper.full(Whisper::Params.new, samples)
218
+ whisper.each_segment do |segment|
219
+ puts segment.text
220
+ end
221
+ ```
476
222
 
477
- There are various examples of using the library for different projects in the [examples](examples) folder.
478
- Some of the examples are even ported to run in the browser using WebAssembly. Check them out!
223
+ The second argument `samples` may be an array, an object with `length` method, or a MemoryView. If you can prepare audio data as C array and export it as a MemoryView, whispercpp accepts and works with it with zero copy.
479
224
 
480
- | Example | Web | Description |
481
- | --- | --- | --- |
482
- | [main](examples/main) | [whisper.wasm](examples/whisper.wasm) | Tool for translating and transcribing audio using Whisper |
483
- | [bench](examples/bench) | [bench.wasm](examples/bench.wasm) | Benchmark the performance of Whisper on your machine |
484
- | [stream](examples/stream) | [stream.wasm](examples/stream.wasm) | Real-time transcription of raw microphone capture |
485
- | [command](examples/command) | [command.wasm](examples/command.wasm) | Basic voice assistant example for receiving voice commands from the mic |
486
- | [talk](examples/talk) | [talk.wasm](examples/talk.wasm) | Talk with a GPT-2 bot |
487
- | [whisper.objc](examples/whisper.objc) | | iOS mobile application using whisper.cpp |
488
- | [whisper.swiftui](examples/whisper.swiftui) | | SwiftUI iOS / macOS application using whisper.cpp |
489
- | [whisper.android](examples/whisper.android) | | Android mobile application using whisper.cpp |
490
- | [whisper.nvim](examples/whisper.nvim) | | Speech-to-text plugin for Neovim |
491
- | [generate-karaoke.sh](examples/generate-karaoke.sh) | | Helper script to easily [generate a karaoke video](https://youtu.be/uj7hVta4blM) of raw audio capture |
492
- | [livestream.sh](examples/livestream.sh) | | [Livestream audio transcription](https://github.com/ggerganov/whisper.cpp/issues/185) |
493
- | [yt-wsp.sh](examples/yt-wsp.sh) | | Download + transcribe and/or translate any VOD [(original)](https://gist.github.com/DaniruKun/96f763ec1a037cc92fe1a059b643b818) |
225
+ License
226
+ -------
494
227
 
495
- ## [Discussions](https://github.com/ggerganov/whisper.cpp/discussions)
228
+ The same to [whisper.cpp][].
496
229
 
497
- If you have any kind of feedback about this project feel free to use the Discussions section and open a new topic.
498
- You can use the [Show and tell](https://github.com/ggerganov/whisper.cpp/discussions/categories/show-and-tell) category
499
- to share your own projects that use `whisper.cpp`. If you have a question, make sure to check the
500
- [Frequently asked questions (#126)](https://github.com/ggerganov/whisper.cpp/discussions/126) discussion.
230
+ [whisper.cpp]: https://github.com/ggerganov/whisper.cpp
231
+ [models]: https://github.com/ggerganov/whisper.cpp/tree/master/models