whispercpp 1.3.0 → 1.3.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +5 -0
- data/LICENSE +1 -1
- data/README.md +165 -434
- data/Rakefile +60 -11
- data/ext/.gitignore +13 -0
- data/ext/cpu.mk +9 -0
- data/ext/{dr_wav.h → examples/dr_wav.h} +3560 -1179
- data/ext/extconf.rb +185 -16
- data/ext/ggml/include/ggml-alloc.h +76 -0
- data/ext/ggml/include/ggml-backend.h +352 -0
- data/ext/ggml/include/ggml-blas.h +25 -0
- data/ext/ggml/include/ggml-cann.h +123 -0
- data/ext/ggml/include/ggml-cpp.h +38 -0
- data/ext/ggml/include/ggml-cpu.h +135 -0
- data/ext/ggml/include/ggml-cuda.h +47 -0
- data/ext/ggml/include/ggml-kompute.h +50 -0
- data/ext/ggml/include/ggml-metal.h +66 -0
- data/ext/ggml/include/ggml-opencl.h +26 -0
- data/ext/ggml/include/ggml-opt.h +216 -0
- data/ext/ggml/include/ggml-rpc.h +28 -0
- data/ext/ggml/include/ggml-sycl.h +49 -0
- data/ext/ggml/include/ggml-vulkan.h +31 -0
- data/ext/{ggml.h → ggml/include/ggml.h} +479 -596
- data/ext/ggml/src/ggml-alloc.c +1037 -0
- data/ext/ggml/src/ggml-amx/common.h +94 -0
- data/ext/ggml/src/ggml-amx/ggml-amx.cpp +446 -0
- data/ext/ggml/src/ggml-amx/mmq.cpp +2510 -0
- data/ext/ggml/src/ggml-amx/mmq.h +17 -0
- data/ext/ggml/src/ggml-backend-impl.h +256 -0
- data/ext/ggml/src/ggml-backend-reg.cpp +552 -0
- data/ext/ggml/src/ggml-backend.cpp +1999 -0
- data/ext/ggml/src/ggml-blas/ggml-blas.cpp +517 -0
- data/ext/ggml/src/ggml-cann/acl_tensor.cpp +175 -0
- data/ext/ggml/src/ggml-cann/acl_tensor.h +258 -0
- data/ext/ggml/src/ggml-cann/aclnn_ops.cpp +3427 -0
- data/ext/ggml/src/ggml-cann/aclnn_ops.h +592 -0
- data/ext/ggml/src/ggml-cann/common.h +286 -0
- data/ext/ggml/src/ggml-cann/ggml-cann.cpp +2188 -0
- data/ext/ggml/src/ggml-cann/kernels/ascendc_kernels.h +19 -0
- data/ext/ggml/src/ggml-cann/kernels/dup.cpp +236 -0
- data/ext/ggml/src/ggml-cann/kernels/get_row_f16.cpp +197 -0
- data/ext/ggml/src/ggml-cann/kernels/get_row_f32.cpp +190 -0
- data/ext/ggml/src/ggml-cann/kernels/get_row_q4_0.cpp +204 -0
- data/ext/ggml/src/ggml-cann/kernels/get_row_q8_0.cpp +191 -0
- data/ext/ggml/src/ggml-cann/kernels/quantize_f16_q8_0.cpp +218 -0
- data/ext/ggml/src/ggml-cann/kernels/quantize_f32_q8_0.cpp +216 -0
- data/ext/ggml/src/ggml-cann/kernels/quantize_float_to_q4_0.cpp +295 -0
- data/ext/ggml/src/ggml-common.h +1853 -0
- data/ext/ggml/src/ggml-cpu/amx/amx.cpp +220 -0
- data/ext/ggml/src/ggml-cpu/amx/amx.h +8 -0
- data/ext/ggml/src/ggml-cpu/amx/common.h +91 -0
- data/ext/ggml/src/ggml-cpu/amx/mmq.cpp +2511 -0
- data/ext/ggml/src/ggml-cpu/amx/mmq.h +10 -0
- data/ext/ggml/src/ggml-cpu/cpu-feats-x86.cpp +323 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp +4262 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu-aarch64.h +8 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu-hbm.cpp +55 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu-hbm.h +8 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu-impl.h +386 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu-quants.c +10835 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu-quants.h +63 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu-traits.cpp +36 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu-traits.h +38 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu.c +14123 -0
- data/ext/ggml/src/ggml-cpu/ggml-cpu.cpp +622 -0
- data/ext/ggml/src/ggml-cpu/llamafile/sgemm.cpp +1884 -0
- data/ext/ggml/src/ggml-cpu/llamafile/sgemm.h +14 -0
- data/ext/ggml/src/ggml-cuda/vendors/cuda.h +14 -0
- data/ext/ggml/src/ggml-cuda/vendors/hip.h +186 -0
- data/ext/ggml/src/ggml-cuda/vendors/musa.h +134 -0
- data/ext/ggml/src/ggml-impl.h +556 -0
- data/ext/ggml/src/ggml-kompute/ggml-kompute.cpp +2251 -0
- data/ext/ggml/src/ggml-metal/ggml-metal-impl.h +288 -0
- data/ext/ggml/src/ggml-metal/ggml-metal.m +4884 -0
- data/ext/ggml/src/ggml-metal/ggml-metal.metal +6732 -0
- data/ext/ggml/src/ggml-opt.cpp +854 -0
- data/ext/ggml/src/ggml-quants.c +5238 -0
- data/ext/ggml/src/ggml-quants.h +100 -0
- data/ext/ggml/src/ggml-rpc/ggml-rpc.cpp +1406 -0
- data/ext/ggml/src/ggml-sycl/common.cpp +95 -0
- data/ext/ggml/src/ggml-sycl/concat.cpp +196 -0
- data/ext/ggml/src/ggml-sycl/conv.cpp +99 -0
- data/ext/ggml/src/ggml-sycl/convert.cpp +547 -0
- data/ext/ggml/src/ggml-sycl/dmmv.cpp +1023 -0
- data/ext/ggml/src/ggml-sycl/element_wise.cpp +1030 -0
- data/ext/ggml/src/ggml-sycl/ggml-sycl.cpp +4729 -0
- data/ext/ggml/src/ggml-sycl/im2col.cpp +126 -0
- data/ext/ggml/src/ggml-sycl/mmq.cpp +3031 -0
- data/ext/ggml/src/ggml-sycl/mmvq.cpp +1015 -0
- data/ext/ggml/src/ggml-sycl/norm.cpp +378 -0
- data/ext/ggml/src/ggml-sycl/outprod.cpp +56 -0
- data/ext/ggml/src/ggml-sycl/rope.cpp +276 -0
- data/ext/ggml/src/ggml-sycl/softmax.cpp +251 -0
- data/ext/ggml/src/ggml-sycl/tsembd.cpp +72 -0
- data/ext/ggml/src/ggml-sycl/wkv6.cpp +141 -0
- data/ext/ggml/src/ggml-threading.cpp +12 -0
- data/ext/ggml/src/ggml-threading.h +14 -0
- data/ext/ggml/src/ggml-vulkan/ggml-vulkan.cpp +8657 -0
- data/ext/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp +593 -0
- data/ext/ggml/src/ggml.c +7694 -0
- data/ext/{whisper.h → include/whisper.h} +23 -22
- data/ext/metal-embed.mk +17 -0
- data/ext/metal.mk +6 -0
- data/ext/ruby_whisper.cpp +1492 -9
- data/ext/ruby_whisper.h +10 -0
- data/ext/scripts/get-flags.mk +38 -0
- data/ext/src/coreml/whisper-decoder-impl.h +146 -0
- data/ext/src/coreml/whisper-decoder-impl.m +201 -0
- data/ext/src/coreml/whisper-encoder-impl.h +142 -0
- data/ext/src/coreml/whisper-encoder-impl.m +197 -0
- data/ext/src/coreml/whisper-encoder.h +26 -0
- data/ext/src/openvino/whisper-openvino-encoder.cpp +108 -0
- data/ext/src/openvino/whisper-openvino-encoder.h +31 -0
- data/ext/{whisper.cpp → src/whisper.cpp} +661 -492
- data/extsources.rb +6 -0
- data/lib/whisper/model/uri.rb +157 -0
- data/lib/whisper.rb +2 -0
- data/tests/helper.rb +7 -0
- data/tests/jfk_reader/.gitignore +5 -0
- data/tests/jfk_reader/extconf.rb +3 -0
- data/tests/jfk_reader/jfk_reader.c +68 -0
- data/tests/test_callback.rb +160 -0
- data/tests/test_error.rb +20 -0
- data/tests/test_model.rb +71 -0
- data/tests/test_package.rb +31 -0
- data/tests/test_params.rb +160 -0
- data/tests/test_segment.rb +83 -0
- data/tests/test_whisper.rb +211 -123
- data/whispercpp.gemspec +36 -0
- metadata +137 -11
- data/ext/ggml.c +0 -21755
data/README.md
CHANGED
@@ -1,500 +1,231 @@
|
|
1
|
-
|
1
|
+
whispercpp
|
2
|
+
==========
|
2
3
|
|
3
|
-
|
4
|
-
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
|
5
|
-
[![npm](https://img.shields.io/npm/v/whisper.cpp.svg)](https://www.npmjs.com/package/whisper.cpp/)
|
4
|
+
![whisper.cpp](https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg)
|
6
5
|
|
7
|
-
|
6
|
+
Ruby bindings for [whisper.cpp][], an interface of automatic speech recognition model.
|
8
7
|
|
9
|
-
|
8
|
+
Installation
|
9
|
+
------------
|
10
10
|
|
11
|
-
|
12
|
-
- Apple silicon first-class citizen - optimized via Arm Neon and Accelerate framework
|
13
|
-
- AVX intrinsics support for x86 architectures
|
14
|
-
- VSX intrinsics support for POWER architectures
|
15
|
-
- Mixed F16 / F32 precision
|
16
|
-
- Low memory usage (Flash Attention)
|
17
|
-
- Zero memory allocations at runtime
|
18
|
-
- Runs on the CPU
|
19
|
-
- [C-style API](https://github.com/ggerganov/whisper.cpp/blob/master/whisper.h)
|
11
|
+
Install the gem and add to the application's Gemfile by executing:
|
20
12
|
|
21
|
-
|
13
|
+
$ bundle add whispercpp
|
22
14
|
|
23
|
-
|
24
|
-
- [x] [iOS](examples/whisper.objc)
|
25
|
-
- [x] [Android](examples/whisper.android)
|
26
|
-
- [x] Linux / [FreeBSD](https://github.com/ggerganov/whisper.cpp/issues/56#issuecomment-1350920264)
|
27
|
-
- [x] [WebAssembly](examples/whisper.wasm)
|
28
|
-
- [x] Windows ([MSVC](https://github.com/ggerganov/whisper.cpp/blob/master/.github/workflows/build.yml#L117-L144) and [MinGW](https://github.com/ggerganov/whisper.cpp/issues/168)]
|
29
|
-
- [x] [Raspberry Pi](https://github.com/ggerganov/whisper.cpp/discussions/166)
|
15
|
+
If bundler is not being used to manage dependencies, install the gem by executing:
|
30
16
|
|
31
|
-
|
17
|
+
$ gem install whispercpp
|
32
18
|
|
33
|
-
|
34
|
-
|
19
|
+
Usage
|
20
|
+
-----
|
35
21
|
|
36
|
-
|
37
|
-
|
22
|
+
```ruby
|
23
|
+
require "whisper"
|
38
24
|
|
39
|
-
|
25
|
+
whisper = Whisper::Context.new("base")
|
40
26
|
|
41
|
-
|
27
|
+
params = Whisper::Params.new
|
28
|
+
params.language = "en"
|
29
|
+
params.offset = 10_000
|
30
|
+
params.duration = 60_000
|
31
|
+
params.max_text_tokens = 300
|
32
|
+
params.translate = true
|
33
|
+
params.print_timestamps = false
|
34
|
+
params.initial_prompt = "Initial prompt here."
|
42
35
|
|
43
|
-
|
36
|
+
whisper.transcribe("path/to/audio.wav", params) do |whole_text|
|
37
|
+
puts whole_text
|
38
|
+
end
|
44
39
|
|
45
|
-
Or you can even run it straight in the browser: [talk.wasm](examples/talk.wasm)
|
46
|
-
|
47
|
-
## Implementation details
|
48
|
-
|
49
|
-
- The core tensor operations are implemented in C ([ggml.h](ggml.h) / [ggml.c](ggml.c))
|
50
|
-
- The transformer model and the high-level C-style API are implemented in C++ ([whisper.h](whisper.h) / [whisper.cpp](whisper.cpp))
|
51
|
-
- Sample usage is demonstrated in [main.cpp](examples/main)
|
52
|
-
- Sample real-time audio transcription from the microphone is demonstrated in [stream.cpp](examples/stream)
|
53
|
-
- Various other examples are available in the [examples](examples) folder
|
54
|
-
|
55
|
-
The tensor operators are optimized heavily for Apple silicon CPUs. Depending on the computation size, Arm Neon SIMD
|
56
|
-
instrisics or CBLAS Accelerate framework routines are used. The latter are especially effective for bigger sizes since
|
57
|
-
the Accelerate framework utilizes the special-purpose AMX coprocessor available in modern Apple products.
|
58
|
-
|
59
|
-
## Quick start
|
60
|
-
|
61
|
-
First, download one of the Whisper models converted in [ggml format](models). For example:
|
62
|
-
|
63
|
-
```bash
|
64
|
-
bash ./models/download-ggml-model.sh base.en
|
65
40
|
```
|
66
41
|
|
67
|
-
|
42
|
+
### Preparing model ###
|
68
43
|
|
69
|
-
|
70
|
-
# build the main example
|
71
|
-
make
|
44
|
+
Some models are prepared up-front:
|
72
45
|
|
73
|
-
|
74
|
-
|
46
|
+
```ruby
|
47
|
+
base_en = Whisper::Model.pre_converted_models["base.en"]
|
48
|
+
whisper = Whisper::Context.new(base_en)
|
75
49
|
```
|
76
50
|
|
77
|
-
|
51
|
+
At first time you use a model, it is downloaded automatically. After that, downloaded cached file is used. To clear cache, call `#clear_cache`:
|
78
52
|
|
79
|
-
|
80
|
-
|
81
|
-
```java
|
82
|
-
$ make base.en
|
83
|
-
|
84
|
-
cc -I. -O3 -std=c11 -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
|
85
|
-
c++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp -o whisper.o
|
86
|
-
c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main -framework Accelerate
|
87
|
-
./main -h
|
88
|
-
|
89
|
-
usage: ./main [options] file0.wav file1.wav ...
|
90
|
-
|
91
|
-
options:
|
92
|
-
-h, --help [default] show this help message and exit
|
93
|
-
-t N, --threads N [4 ] number of threads to use during computation
|
94
|
-
-p N, --processors N [1 ] number of processors to use during computation
|
95
|
-
-ot N, --offset-t N [0 ] time offset in milliseconds
|
96
|
-
-on N, --offset-n N [0 ] segment index offset
|
97
|
-
-d N, --duration N [0 ] duration of audio to process in milliseconds
|
98
|
-
-mc N, --max-context N [-1 ] maximum number of text context tokens to store
|
99
|
-
-ml N, --max-len N [0 ] maximum segment length in characters
|
100
|
-
-bo N, --best-of N [5 ] number of best candidates to keep
|
101
|
-
-bs N, --beam-size N [-1 ] beam size for beam search
|
102
|
-
-wt N, --word-thold N [0.01 ] word timestamp probability threshold
|
103
|
-
-et N, --entropy-thold N [2.40 ] entropy threshold for decoder fail
|
104
|
-
-lpt N, --logprob-thold N [-1.00 ] log probability threshold for decoder fail
|
105
|
-
-su, --speed-up [false ] speed up audio by x2 (reduced accuracy)
|
106
|
-
-tr, --translate [false ] translate from source language to english
|
107
|
-
-di, --diarize [false ] stereo audio diarization
|
108
|
-
-nf, --no-fallback [false ] do not use temperature fallback while decoding
|
109
|
-
-otxt, --output-txt [false ] output result in a text file
|
110
|
-
-ovtt, --output-vtt [false ] output result in a vtt file
|
111
|
-
-osrt, --output-srt [false ] output result in a srt file
|
112
|
-
-owts, --output-words [false ] output script for generating karaoke video
|
113
|
-
-ocsv, --output-csv [false ] output result in a CSV file
|
114
|
-
-of FNAME, --output-file FNAME [ ] output file path (without file extension)
|
115
|
-
-ps, --print-special [false ] print special tokens
|
116
|
-
-pc, --print-colors [false ] print colors
|
117
|
-
-pp, --print-progress [false ] print progress
|
118
|
-
-nt, --no-timestamps [true ] do not print timestamps
|
119
|
-
-l LANG, --language LANG [en ] spoken language ('auto' for auto-detect)
|
120
|
-
--prompt PROMPT [ ] initial prompt
|
121
|
-
-m FNAME, --model FNAME [models/ggml-base.en.bin] model path
|
122
|
-
-f FNAME, --file FNAME [ ] input WAV file path
|
123
|
-
|
124
|
-
|
125
|
-
bash ./models/download-ggml-model.sh base.en
|
126
|
-
Downloading ggml model base.en ...
|
127
|
-
ggml-base.en.bin 100%[========================>] 141.11M 6.34MB/s in 24s
|
128
|
-
Done! Model 'base.en' saved in 'models/ggml-base.en.bin'
|
129
|
-
You can now use it like this:
|
130
|
-
|
131
|
-
$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav
|
132
|
-
|
133
|
-
|
134
|
-
===============================================
|
135
|
-
Running base.en on all samples in ./samples ...
|
136
|
-
===============================================
|
137
|
-
|
138
|
-
----------------------------------------------
|
139
|
-
[+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
|
140
|
-
----------------------------------------------
|
141
|
-
|
142
|
-
whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
|
143
|
-
whisper_model_load: loading model
|
144
|
-
whisper_model_load: n_vocab = 51864
|
145
|
-
whisper_model_load: n_audio_ctx = 1500
|
146
|
-
whisper_model_load: n_audio_state = 512
|
147
|
-
whisper_model_load: n_audio_head = 8
|
148
|
-
whisper_model_load: n_audio_layer = 6
|
149
|
-
whisper_model_load: n_text_ctx = 448
|
150
|
-
whisper_model_load: n_text_state = 512
|
151
|
-
whisper_model_load: n_text_head = 8
|
152
|
-
whisper_model_load: n_text_layer = 6
|
153
|
-
whisper_model_load: n_mels = 80
|
154
|
-
whisper_model_load: f16 = 1
|
155
|
-
whisper_model_load: type = 2
|
156
|
-
whisper_model_load: mem required = 215.00 MB (+ 6.00 MB per decoder)
|
157
|
-
whisper_model_load: kv self size = 5.25 MB
|
158
|
-
whisper_model_load: kv cross size = 17.58 MB
|
159
|
-
whisper_model_load: adding 1607 extra tokens
|
160
|
-
whisper_model_load: model ctx = 140.60 MB
|
161
|
-
whisper_model_load: model size = 140.54 MB
|
162
|
-
|
163
|
-
system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
|
164
|
-
|
165
|
-
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
|
166
|
-
|
167
|
-
|
168
|
-
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
|
169
|
-
|
170
|
-
|
171
|
-
whisper_print_timings: fallbacks = 0 p / 0 h
|
172
|
-
whisper_print_timings: load time = 113.81 ms
|
173
|
-
whisper_print_timings: mel time = 15.40 ms
|
174
|
-
whisper_print_timings: sample time = 11.58 ms / 27 runs ( 0.43 ms per run)
|
175
|
-
whisper_print_timings: encode time = 266.60 ms / 1 runs ( 266.60 ms per run)
|
176
|
-
whisper_print_timings: decode time = 66.11 ms / 27 runs ( 2.45 ms per run)
|
177
|
-
whisper_print_timings: total time = 476.31 ms
|
53
|
+
```ruby
|
54
|
+
Whisper::Model.pre_converted_models["base"].clear_cache
|
178
55
|
```
|
179
56
|
|
180
|
-
|
181
|
-
|
182
|
-
For detailed usage instructions, run: `./main -h`
|
183
|
-
|
184
|
-
Note that the [main](examples/main) example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool.
|
185
|
-
For example, you can use `ffmpeg` like this:
|
57
|
+
You also can use shorthand for pre-converted models:
|
186
58
|
|
187
|
-
```
|
188
|
-
|
59
|
+
```ruby
|
60
|
+
whisper = Whisper::Context.new("base.en")
|
189
61
|
```
|
190
62
|
|
191
|
-
|
192
|
-
|
193
|
-
|
194
|
-
|
63
|
+
You can see the list of prepared model names by `Whisper::Model.preconverted_models.keys`:
|
64
|
+
|
65
|
+
```ruby
|
66
|
+
puts Whisper::Model.preconverted_model_names
|
67
|
+
# tiny
|
68
|
+
# tiny.en
|
69
|
+
# tiny-q5_1
|
70
|
+
# tiny.en-q5_1
|
71
|
+
# tiny-q8_0
|
72
|
+
# base
|
73
|
+
# base.en
|
74
|
+
# base-q5_1
|
75
|
+
# base.en-q5_1
|
76
|
+
# base-q8_0
|
77
|
+
# :
|
78
|
+
# :
|
195
79
|
```
|
196
|
-
make samples
|
197
|
-
```
|
198
|
-
|
199
|
-
This will download a few more audio files from Wikipedia and convert them to 16-bit WAV format via `ffmpeg`.
|
200
80
|
|
201
|
-
You can
|
81
|
+
You can also use local model files you prepared:
|
202
82
|
|
83
|
+
```ruby
|
84
|
+
whisper = Whisper::Context.new("path/to/your/model.bin")
|
203
85
|
```
|
204
|
-
make tiny.en
|
205
|
-
make tiny
|
206
|
-
make base.en
|
207
|
-
make base
|
208
|
-
make small.en
|
209
|
-
make small
|
210
|
-
make medium.en
|
211
|
-
make medium
|
212
|
-
make large-v1
|
213
|
-
make large
|
214
|
-
```
|
215
|
-
|
216
|
-
## Memory usage
|
217
|
-
|
218
|
-
| Model | Disk | Mem | SHA |
|
219
|
-
| --- | --- | --- | --- |
|
220
|
-
| tiny | 75 MB | ~125 MB | `bd577a113a864445d4c299885e0cb97d4ba92b5f` |
|
221
|
-
| base | 142 MB | ~210 MB | `465707469ff3a37a2b9b8d8f89f2f99de7299dac` |
|
222
|
-
| small | 466 MB | ~600 MB | `55356645c2b361a969dfd0ef2c5a50d530afd8d5` |
|
223
|
-
| medium | 1.5 GB | ~1.7 GB | `fd9727b6e1217c2f614f9b698455c4ffd82463b4` |
|
224
|
-
| large | 2.9 GB | ~3.3 GB | `0f4c8e34f21cf1a914c59d8b3ce882345ad349d6` |
|
225
|
-
|
226
|
-
## Limitations
|
227
|
-
|
228
|
-
- Inference only
|
229
|
-
- No GPU support (yet)
|
230
|
-
|
231
|
-
## Another example
|
232
|
-
|
233
|
-
Here is another example of transcribing a [3:24 min speech](https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg)
|
234
|
-
in about half a minute on a MacBook M1 Pro, using `medium.en` model:
|
235
|
-
|
236
|
-
<details>
|
237
|
-
<summary>Expand to see the result</summary>
|
238
|
-
|
239
|
-
```java
|
240
|
-
$ ./main -m models/ggml-medium.en.bin -f samples/gb1.wav -t 8
|
241
|
-
|
242
|
-
whisper_init_from_file: loading model from 'models/ggml-medium.en.bin'
|
243
|
-
whisper_model_load: loading model
|
244
|
-
whisper_model_load: n_vocab = 51864
|
245
|
-
whisper_model_load: n_audio_ctx = 1500
|
246
|
-
whisper_model_load: n_audio_state = 1024
|
247
|
-
whisper_model_load: n_audio_head = 16
|
248
|
-
whisper_model_load: n_audio_layer = 24
|
249
|
-
whisper_model_load: n_text_ctx = 448
|
250
|
-
whisper_model_load: n_text_state = 1024
|
251
|
-
whisper_model_load: n_text_head = 16
|
252
|
-
whisper_model_load: n_text_layer = 24
|
253
|
-
whisper_model_load: n_mels = 80
|
254
|
-
whisper_model_load: f16 = 1
|
255
|
-
whisper_model_load: type = 4
|
256
|
-
whisper_model_load: mem required = 1720.00 MB (+ 43.00 MB per decoder)
|
257
|
-
whisper_model_load: kv self size = 42.00 MB
|
258
|
-
whisper_model_load: kv cross size = 140.62 MB
|
259
|
-
whisper_model_load: adding 1607 extra tokens
|
260
|
-
whisper_model_load: model ctx = 1462.35 MB
|
261
|
-
whisper_model_load: model size = 1462.12 MB
|
262
|
-
|
263
|
-
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
|
264
|
-
|
265
|
-
main: processing 'samples/gb1.wav' (3179750 samples, 198.7 sec), 8 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
|
266
|
-
|
267
|
-
|
268
|
-
[00:00:00.000 --> 00:00:08.000] My fellow Americans, this day has brought terrible news and great sadness to our country.
|
269
|
-
[00:00:08.000 --> 00:00:17.000] At nine o'clock this morning, Mission Control in Houston lost contact with our Space Shuttle Columbia.
|
270
|
-
[00:00:17.000 --> 00:00:23.000] A short time later, debris was seen falling from the skies above Texas.
|
271
|
-
[00:00:23.000 --> 00:00:29.000] The Columbia's lost. There are no survivors.
|
272
|
-
[00:00:29.000 --> 00:00:32.000] On board was a crew of seven.
|
273
|
-
[00:00:32.000 --> 00:00:39.000] Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark,
|
274
|
-
[00:00:39.000 --> 00:00:48.000] Captain David Brown, Commander William McCool, Dr. Kultna Shavla, and Ilan Ramon,
|
275
|
-
[00:00:48.000 --> 00:00:52.000] a colonel in the Israeli Air Force.
|
276
|
-
[00:00:52.000 --> 00:00:58.000] These men and women assumed great risk in the service to all humanity.
|
277
|
-
[00:00:58.000 --> 00:01:03.000] In an age when space flight has come to seem almost routine,
|
278
|
-
[00:01:03.000 --> 00:01:07.000] it is easy to overlook the dangers of travel by rocket
|
279
|
-
[00:01:07.000 --> 00:01:12.000] and the difficulties of navigating the fierce outer atmosphere of the Earth.
|
280
|
-
[00:01:12.000 --> 00:01:18.000] These astronauts knew the dangers, and they faced them willingly,
|
281
|
-
[00:01:18.000 --> 00:01:23.000] knowing they had a high and noble purpose in life.
|
282
|
-
[00:01:23.000 --> 00:01:31.000] Because of their courage and daring and idealism, we will miss them all the more.
|
283
|
-
[00:01:31.000 --> 00:01:36.000] All Americans today are thinking as well of the families of these men and women
|
284
|
-
[00:01:36.000 --> 00:01:40.000] who have been given this sudden shock and grief.
|
285
|
-
[00:01:40.000 --> 00:01:45.000] You're not alone. Our entire nation grieves with you,
|
286
|
-
[00:01:45.000 --> 00:01:52.000] and those you love will always have the respect and gratitude of this country.
|
287
|
-
[00:01:52.000 --> 00:01:56.000] The cause in which they died will continue.
|
288
|
-
[00:01:56.000 --> 00:02:04.000] Mankind is led into the darkness beyond our world by the inspiration of discovery
|
289
|
-
[00:02:04.000 --> 00:02:11.000] and the longing to understand. Our journey into space will go on.
|
290
|
-
[00:02:11.000 --> 00:02:16.000] In the skies today, we saw destruction and tragedy.
|
291
|
-
[00:02:16.000 --> 00:02:22.000] Yet farther than we can see, there is comfort and hope.
|
292
|
-
[00:02:22.000 --> 00:02:29.000] In the words of the prophet Isaiah, "Lift your eyes and look to the heavens
|
293
|
-
[00:02:29.000 --> 00:02:35.000] who created all these. He who brings out the starry hosts one by one
|
294
|
-
[00:02:35.000 --> 00:02:39.000] and calls them each by name."
|
295
|
-
[00:02:39.000 --> 00:02:46.000] Because of His great power and mighty strength, not one of them is missing.
|
296
|
-
[00:02:46.000 --> 00:02:55.000] The same Creator who names the stars also knows the names of the seven souls we mourn today.
|
297
|
-
[00:02:55.000 --> 00:03:01.000] The crew of the shuttle Columbia did not return safely to earth,
|
298
|
-
[00:03:01.000 --> 00:03:05.000] yet we can pray that all are safely home.
|
299
|
-
[00:03:05.000 --> 00:03:13.000] May God bless the grieving families, and may God continue to bless America.
|
300
|
-
[00:03:13.000 --> 00:03:19.000] [Silence]
|
301
|
-
|
302
|
-
|
303
|
-
whisper_print_timings: fallbacks = 1 p / 0 h
|
304
|
-
whisper_print_timings: load time = 569.03 ms
|
305
|
-
whisper_print_timings: mel time = 146.85 ms
|
306
|
-
whisper_print_timings: sample time = 238.66 ms / 553 runs ( 0.43 ms per run)
|
307
|
-
whisper_print_timings: encode time = 18665.10 ms / 9 runs ( 2073.90 ms per run)
|
308
|
-
whisper_print_timings: decode time = 13090.93 ms / 549 runs ( 23.85 ms per run)
|
309
|
-
whisper_print_timings: total time = 32733.52 ms
|
310
|
-
```
|
311
|
-
</details>
|
312
|
-
|
313
|
-
## Real-time audio input example
|
314
86
|
|
315
|
-
|
316
|
-
The [stream](examples/stream) tool samples the audio every half a second and runs the transcription continously.
|
317
|
-
More info is available in [issue #10](https://github.com/ggerganov/whisper.cpp/issues/10).
|
87
|
+
Or, you can download model files:
|
318
88
|
|
319
|
-
```
|
320
|
-
|
321
|
-
|
89
|
+
```ruby
|
90
|
+
model_uri = Whisper::Model::URI.new("http://example.net/uri/of/your/model.bin")
|
91
|
+
whisper = Whisper::Context.new(model_uri)
|
322
92
|
```
|
323
93
|
|
324
|
-
|
94
|
+
See [models][] page for details.
|
325
95
|
|
326
|
-
|
96
|
+
### Preparing audio file ###
|
327
97
|
|
328
|
-
|
329
|
-
to highlight words with high or low confidence:
|
98
|
+
Currently, whisper.cpp accepts only 16-bit WAV files.
|
330
99
|
|
331
|
-
|
332
|
-
|
333
|
-
## Controlling the length of the generated text segments (experimental)
|
100
|
+
API
|
101
|
+
---
|
334
102
|
|
335
|
-
|
103
|
+
### Segments ###
|
336
104
|
|
337
|
-
|
338
|
-
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16
|
105
|
+
Once `Whisper::Context#transcribe` called, you can retrieve segments by `#each_segment`:
|
339
106
|
|
340
|
-
|
341
|
-
|
342
|
-
|
107
|
+
```ruby
|
108
|
+
def format_time(time_ms)
|
109
|
+
sec, decimal_part = time_ms.divmod(1000)
|
110
|
+
min, sec = sec.divmod(60)
|
111
|
+
hour, min = min.divmod(60)
|
112
|
+
"%02d:%02d:%02d.%03d" % [hour, min, sec, decimal_part]
|
113
|
+
end
|
343
114
|
|
344
|
-
|
115
|
+
whisper.transcribe("path/to/audio.wav", params)
|
345
116
|
|
346
|
-
|
347
|
-
[
|
348
|
-
|
349
|
-
|
350
|
-
|
351
|
-
|
352
|
-
|
353
|
-
|
354
|
-
|
355
|
-
|
117
|
+
whisper.each_segment.with_index do |segment, index|
|
118
|
+
line = "[%{nth}: %{st} --> %{ed}] %{text}" % {
|
119
|
+
nth: index + 1,
|
120
|
+
st: format_time(segment.start_time),
|
121
|
+
ed: format_time(segment.end_time),
|
122
|
+
text: segment.text
|
123
|
+
}
|
124
|
+
line << " (speaker turned)" if segment.speaker_next_turn?
|
125
|
+
puts line
|
126
|
+
end
|
356
127
|
|
357
|
-
## Word-level timestamp
|
358
|
-
|
359
|
-
The `--max-len` argument can be used to obtain word-level timestamps. Simply use `-ml 1`:
|
360
|
-
|
361
|
-
```java
|
362
|
-
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1
|
363
|
-
|
364
|
-
whisper_model_load: loading model from './models/ggml-base.en.bin'
|
365
|
-
...
|
366
|
-
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
|
367
|
-
|
368
|
-
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
|
369
|
-
|
370
|
-
[00:00:00.000 --> 00:00:00.320]
|
371
|
-
[00:00:00.320 --> 00:00:00.370] And
|
372
|
-
[00:00:00.370 --> 00:00:00.690] so
|
373
|
-
[00:00:00.690 --> 00:00:00.850] my
|
374
|
-
[00:00:00.850 --> 00:00:01.590] fellow
|
375
|
-
[00:00:01.590 --> 00:00:02.850] Americans
|
376
|
-
[00:00:02.850 --> 00:00:03.300] ,
|
377
|
-
[00:00:03.300 --> 00:00:04.140] ask
|
378
|
-
[00:00:04.140 --> 00:00:04.990] not
|
379
|
-
[00:00:04.990 --> 00:00:05.410] what
|
380
|
-
[00:00:05.410 --> 00:00:05.660] your
|
381
|
-
[00:00:05.660 --> 00:00:06.260] country
|
382
|
-
[00:00:06.260 --> 00:00:06.600] can
|
383
|
-
[00:00:06.600 --> 00:00:06.840] do
|
384
|
-
[00:00:06.840 --> 00:00:07.010] for
|
385
|
-
[00:00:07.010 --> 00:00:08.170] you
|
386
|
-
[00:00:08.170 --> 00:00:08.190] ,
|
387
|
-
[00:00:08.190 --> 00:00:08.430] ask
|
388
|
-
[00:00:08.430 --> 00:00:08.910] what
|
389
|
-
[00:00:08.910 --> 00:00:09.040] you
|
390
|
-
[00:00:09.040 --> 00:00:09.320] can
|
391
|
-
[00:00:09.320 --> 00:00:09.440] do
|
392
|
-
[00:00:09.440 --> 00:00:09.760] for
|
393
|
-
[00:00:09.760 --> 00:00:10.020] your
|
394
|
-
[00:00:10.020 --> 00:00:10.510] country
|
395
|
-
[00:00:10.510 --> 00:00:11.000] .
|
396
128
|
```
|
397
129
|
|
398
|
-
|
130
|
+
You can also add hook to params called on new segment:
|
399
131
|
|
400
|
-
|
401
|
-
|
402
|
-
|
132
|
+
```ruby
|
133
|
+
# Add hook before calling #transcribe
|
134
|
+
params.on_new_segment do |segment|
|
135
|
+
line = "[%{st} --> %{ed}] %{text}" % {
|
136
|
+
st: format_time(segment.start_time),
|
137
|
+
ed: format_time(segment.end_time),
|
138
|
+
text: segment.text
|
139
|
+
}
|
140
|
+
line << " (speaker turned)" if segment.speaker_next_turn?
|
141
|
+
puts line
|
142
|
+
end
|
403
143
|
|
404
|
-
|
144
|
+
whisper.transcribe("path/to/audio.wav", params)
|
405
145
|
|
406
|
-
```java
|
407
|
-
./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -owts
|
408
|
-
source ./samples/jfk.wav.wts
|
409
|
-
ffplay ./samples/jfk.wav.mp4
|
410
146
|
```
|
411
147
|
|
412
|
-
|
148
|
+
### Models ###
|
413
149
|
|
414
|
-
|
150
|
+
You can see model information:
|
415
151
|
|
416
|
-
```
|
417
|
-
|
418
|
-
|
419
|
-
ffplay ./samples/mm0.wav.mp4
|
420
|
-
```
|
152
|
+
```ruby
|
153
|
+
whisper = Whisper::Context.new("base")
|
154
|
+
model = whisper.model
|
421
155
|
|
422
|
-
|
156
|
+
model.n_vocab # => 51864
|
157
|
+
model.n_audio_ctx # => 1500
|
158
|
+
model.n_audio_state # => 512
|
159
|
+
model.n_audio_head # => 8
|
160
|
+
model.n_audio_layer # => 6
|
161
|
+
model.n_text_ctx # => 448
|
162
|
+
model.n_text_state # => 512
|
163
|
+
model.n_text_head # => 8
|
164
|
+
model.n_text_layer # => 6
|
165
|
+
model.n_mels # => 80
|
166
|
+
model.ftype # => 1
|
167
|
+
model.type # => "base"
|
423
168
|
|
424
|
-
---
|
425
|
-
|
426
|
-
```java
|
427
|
-
./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav -owts
|
428
|
-
source ./samples/gb0.wav.wts
|
429
|
-
ffplay ./samples/gb0.wav.mp4
|
430
169
|
```
|
431
170
|
|
432
|
-
|
433
|
-
|
434
|
-
|
435
|
-
|
436
|
-
|
437
|
-
|
438
|
-
|
439
|
-
|
440
|
-
|
441
|
-
|
442
|
-
|
443
|
-
|
444
|
-
|
445
|
-
|
446
|
-
|
171
|
+
### Logging ###
|
172
|
+
|
173
|
+
You can set log callback:
|
174
|
+
|
175
|
+
```ruby
|
176
|
+
prefix = "[MyApp] "
|
177
|
+
log_callback = ->(level, buffer, user_data) {
|
178
|
+
case level
|
179
|
+
when Whisper::LOG_LEVEL_NONE
|
180
|
+
puts "#{user_data}none: #{buffer}"
|
181
|
+
when Whisper::LOG_LEVEL_INFO
|
182
|
+
puts "#{user_data}info: #{buffer}"
|
183
|
+
when Whisper::LOG_LEVEL_WARN
|
184
|
+
puts "#{user_data}warn: #{buffer}"
|
185
|
+
when Whisper::LOG_LEVEL_ERROR
|
186
|
+
puts "#{user_data}error: #{buffer}"
|
187
|
+
when Whisper::LOG_LEVEL_DEBUG
|
188
|
+
puts "#{user_data}debug: #{buffer}"
|
189
|
+
when Whisper::LOG_LEVEL_CONT
|
190
|
+
puts "#{user_data}same to previous: #{buffer}"
|
191
|
+
end
|
192
|
+
}
|
193
|
+
Whisper.log_set log_callback, prefix
|
194
|
+
```
|
447
195
|
|
448
|
-
|
449
|
-
- mel filters
|
450
|
-
- vocabulary
|
451
|
-
- weights
|
196
|
+
Using this feature, you are also able to suppress log:
|
452
197
|
|
453
|
-
|
454
|
-
|
198
|
+
```ruby
|
199
|
+
Whisper.log_set ->(level, buffer, user_data) {
|
200
|
+
# do nothing
|
201
|
+
}, nil
|
202
|
+
Whisper::Context.new("base")
|
203
|
+
```
|
455
204
|
|
456
|
-
-
|
457
|
-
- https://ggml.ggerganov.com
|
205
|
+
### Low-level API to transcribe ###
|
458
206
|
|
459
|
-
|
460
|
-
in [models](models).
|
207
|
+
You can also call `Whisper::Context#full` and `#full_parallel` with a Ruby array as samples. Although `#transcribe` with audio file path is recommended because it extracts PCM samples in C++ and is fast, `#full` and `#full_parallel` give you flexibility.
|
461
208
|
|
462
|
-
|
209
|
+
```ruby
|
210
|
+
require "whisper"
|
211
|
+
require "wavefile"
|
463
212
|
|
464
|
-
|
465
|
-
|
466
|
-
- [X] Go: [bindings/go](bindings/go) | [#312](https://github.com/ggerganov/whisper.cpp/discussions/312)
|
467
|
-
- [X] Ruby: [bindings/ruby](bindings/ruby) | [#507](https://github.com/ggerganov/whisper.cpp/discussions/507)
|
468
|
-
- [X] Objective-C / Swift: [ggerganov/whisper.spm](https://github.com/ggerganov/whisper.spm) | [#313](https://github.com/ggerganov/whisper.cpp/discussions/313)
|
469
|
-
- [X] .NET: | [#422](https://github.com/ggerganov/whisper.cpp/discussions/422)
|
470
|
-
- [sandrohanea/whisper.net](https://github.com/sandrohanea/whisper.net)
|
471
|
-
- [NickDarvey/whisper](https://github.com/NickDarvey/whisper)
|
472
|
-
- [X] Python: | [#9](https://github.com/ggerganov/whisper.cpp/issues/9)
|
473
|
-
- [stlukey/whispercpp.py](https://github.com/stlukey/whispercpp.py) (Cython)
|
213
|
+
reader = WaveFile::Reader.new("path/to/audio.wav", WaveFile::Format.new(:mono, :float, 16000))
|
214
|
+
samples = reader.enum_for(:each_buffer).map(&:samples).flatten
|
474
215
|
|
475
|
-
|
216
|
+
whisper = Whisper::Context.new("base")
|
217
|
+
whisper.full(Whisper::Params.new, samples)
|
218
|
+
whisper.each_segment do |segment|
|
219
|
+
puts segment.text
|
220
|
+
end
|
221
|
+
```
|
476
222
|
|
477
|
-
|
478
|
-
Some of the examples are even ported to run in the browser using WebAssembly. Check them out!
|
223
|
+
The second argument `samples` may be an array, an object with `length` method, or a MemoryView. If you can prepare audio data as C array and export it as a MemoryView, whispercpp accepts and works with it with zero copy.
|
479
224
|
|
480
|
-
|
481
|
-
|
482
|
-
| [main](examples/main) | [whisper.wasm](examples/whisper.wasm) | Tool for translating and transcribing audio using Whisper |
|
483
|
-
| [bench](examples/bench) | [bench.wasm](examples/bench.wasm) | Benchmark the performance of Whisper on your machine |
|
484
|
-
| [stream](examples/stream) | [stream.wasm](examples/stream.wasm) | Real-time transcription of raw microphone capture |
|
485
|
-
| [command](examples/command) | [command.wasm](examples/command.wasm) | Basic voice assistant example for receiving voice commands from the mic |
|
486
|
-
| [talk](examples/talk) | [talk.wasm](examples/talk.wasm) | Talk with a GPT-2 bot |
|
487
|
-
| [whisper.objc](examples/whisper.objc) | | iOS mobile application using whisper.cpp |
|
488
|
-
| [whisper.swiftui](examples/whisper.swiftui) | | SwiftUI iOS / macOS application using whisper.cpp |
|
489
|
-
| [whisper.android](examples/whisper.android) | | Android mobile application using whisper.cpp |
|
490
|
-
| [whisper.nvim](examples/whisper.nvim) | | Speech-to-text plugin for Neovim |
|
491
|
-
| [generate-karaoke.sh](examples/generate-karaoke.sh) | | Helper script to easily [generate a karaoke video](https://youtu.be/uj7hVta4blM) of raw audio capture |
|
492
|
-
| [livestream.sh](examples/livestream.sh) | | [Livestream audio transcription](https://github.com/ggerganov/whisper.cpp/issues/185) |
|
493
|
-
| [yt-wsp.sh](examples/yt-wsp.sh) | | Download + transcribe and/or translate any VOD [(original)](https://gist.github.com/DaniruKun/96f763ec1a037cc92fe1a059b643b818) |
|
225
|
+
License
|
226
|
+
-------
|
494
227
|
|
495
|
-
|
228
|
+
The same to [whisper.cpp][].
|
496
229
|
|
497
|
-
|
498
|
-
|
499
|
-
to share your own projects that use `whisper.cpp`. If you have a question, make sure to check the
|
500
|
-
[Frequently asked questions (#126)](https://github.com/ggerganov/whisper.cpp/discussions/126) discussion.
|
230
|
+
[whisper.cpp]: https://github.com/ggerganov/whisper.cpp
|
231
|
+
[models]: https://github.com/ggerganov/whisper.cpp/tree/master/models
|