pyannote-cpp-node 0.5.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +565 -106
- package/dist/Pipeline.d.ts +4 -3
- package/dist/Pipeline.d.ts.map +1 -1
- package/dist/Pipeline.js +120 -17
- package/dist/Pipeline.js.map +1 -1
- package/dist/PipelineSession.d.ts +1 -1
- package/dist/PipelineSession.d.ts.map +1 -1
- package/dist/PipelineSession.js +11 -3
- package/dist/PipelineSession.js.map +1 -1
- package/dist/binding.d.ts +16 -6
- package/dist/binding.d.ts.map +1 -1
- package/dist/binding.js +77 -20
- package/dist/binding.js.map +1 -1
- package/dist/index.d.ts +31 -5
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +35 -3
- package/dist/index.js.map +1 -1
- package/dist/types.d.ts +178 -57
- package/dist/types.d.ts.map +1 -1
- package/package.json +20 -8
- package/dist/Pyannote.d.ts +0 -12
- package/dist/Pyannote.d.ts.map +0 -1
- package/dist/Pyannote.js +0 -50
- package/dist/Pyannote.js.map +0 -1
- package/dist/StreamingSession.d.ts +0 -12
- package/dist/StreamingSession.d.ts.map +0 -1
- package/dist/StreamingSession.js +0 -34
- package/dist/StreamingSession.js.map +0 -1
package/README.md
CHANGED
|
@@ -1,49 +1,69 @@
|
|
|
1
1
|
# pyannote-cpp-node
|
|
2
2
|
|
|
3
|
-

|
|
3
|
+

|
|
4
4
|

|
|
5
5
|
|
|
6
|
-
Node.js native bindings for
|
|
6
|
+
Node.js native bindings for whisper.cpp transcription/VAD plus the pyannote speaker diarization pipeline.
|
|
7
7
|
|
|
8
8
|
## Overview
|
|
9
9
|
|
|
10
|
-
`pyannote-cpp-node`
|
|
10
|
+
`pyannote-cpp-node` is now the single package for both:
|
|
11
11
|
|
|
12
|
-
|
|
12
|
+
- low-level whisper.cpp APIs: `WhisperContext`, `VadContext`, `transcribe`, `transcribeAsync`, `getGpuDevices`
|
|
13
|
+
- high-level pyannote pipeline APIs: `Pipeline`, `PipelineSession`
|
|
13
14
|
|
|
14
|
-
|
|
15
|
+
Platform support:
|
|
16
|
+
|
|
17
|
+
- `darwin-arm64`: full pipeline (CoreML + Metal acceleration)
|
|
18
|
+
- `win32-x64`: full pipeline (Vulkan GPU + optional OpenVINO acceleration)
|
|
19
|
+
- unsupported: `darwin-x64`, `win32-ia32`, Linux
|
|
20
|
+
|
|
21
|
+
On both supported pipeline platforms, `getCapabilities().pipeline` is `true`.
|
|
22
|
+
|
|
23
|
+
The integrated pipeline combines Whisper transcription and optional speaker diarization into a single API (`transcriptionOnly: true` skips diarization).
|
|
24
|
+
|
|
25
|
+
Given 16 kHz mono PCM audio (`Float32Array`), it produces transcript segments shaped as below. In streaming mode, diarization emits cumulative `segments` events, while `transcriptionOnly: true` emits incremental `segments` events. `finalize()` returns all segments in both modes.
|
|
26
|
+
|
|
27
|
+
- speaker label (`SPEAKER_00`, `SPEAKER_01`, ...), `"UNKNOWN"` when diarization could not assign a speaker, or empty string (`""`) when `transcriptionOnly` is `true`
|
|
15
28
|
- segment start/duration in seconds
|
|
16
29
|
- segment text
|
|
17
30
|
|
|
18
|
-
The API supports three modes: **offline** batch processing (`transcribeOffline`), **one-shot** streaming (`transcribe`), and **incremental** streaming (`createSession` + `push`/`finalize`). All heavy operations are asynchronous and run on libuv worker threads.
|
|
31
|
+
The API supports three modes: **offline** batch processing (`transcribeOffline`), **one-shot** streaming (`transcribe`), and **incremental** streaming (`createSession` + `push`/`finalize`). All three modes support transcription-only operation via `transcriptionOnly: true`. All heavy operations are asynchronous and run on libuv worker threads.
|
|
19
32
|
|
|
20
33
|
## Features
|
|
21
34
|
|
|
35
|
+
- Low-level whisper.cpp transcription API compatible with prior `whisper-cpp-node` usage
|
|
36
|
+
- Built-in Silero VAD via `VadContext`
|
|
37
|
+
- GPU device enumeration via `getGpuDevices()`
|
|
22
38
|
- Integrated transcription + diarization in one pipeline
|
|
23
39
|
- Speaker-labeled transcript segments with sentence-level text
|
|
24
40
|
- **Offline mode**: runs Whisper on the full audio at once + offline diarization (fastest for batch)
|
|
25
41
|
- **One-shot mode**: streaming pipeline with automatic chunking
|
|
26
42
|
- **Streaming mode**: incremental push/finalize with real-time `segments` events and `audio` chunk streaming
|
|
43
|
+
- **Transcription-only mode**: skip speaker diarization entirely, only segmentation, VAD, and Whisper models required
|
|
27
44
|
- Deterministic output for the same audio/models/config
|
|
28
45
|
- CoreML-accelerated inference on macOS
|
|
29
46
|
- **Shared model cache**: all models loaded once during `Pipeline.load()`, reused across offline/streaming/session modes
|
|
30
|
-
- **Runtime backend switching**: switch
|
|
47
|
+
- **Runtime backend switching**: switch inference backends at runtime on macOS and Windows
|
|
31
48
|
- **Progress reporting**: optional `onProgress` callback for `transcribeOffline` reports Whisper, diarization, and alignment phases
|
|
32
49
|
- **Real-time segment streaming**: optional `onSegment` callback for `transcribeOffline` delivers each Whisper segment (start, end, text) as it's produced — enables live transcript preview and time-based loading bars
|
|
33
50
|
- TypeScript-first API with complete type definitions
|
|
34
51
|
|
|
35
52
|
## Requirements
|
|
36
53
|
|
|
37
|
-
- macOS
|
|
54
|
+
- macOS Apple Silicon or Windows x64
|
|
38
55
|
- Node.js >= 18
|
|
39
56
|
- Model files:
|
|
40
|
-
- Segmentation GGUF (`segModelPath`)
|
|
41
|
-
- Embedding GGUF (`embModelPath`)
|
|
42
|
-
- PLDA GGUF (`pldaPath`)
|
|
43
|
-
-
|
|
44
|
-
- Segmentation CoreML `.mlpackage` (`segCoremlPath`)
|
|
45
|
-
- Whisper GGUF (`whisperModelPath`)
|
|
57
|
+
- Segmentation GGUF (`segModelPath`) — required on all platforms
|
|
58
|
+
- Embedding GGUF (`embModelPath`) — required unless `transcriptionOnly` is `true`
|
|
59
|
+
- PLDA GGUF (`pldaPath`) — required unless `transcriptionOnly` is `true`
|
|
60
|
+
- Whisper GGUF (`whisperModelPath`) — required on all platforms
|
|
46
61
|
- Optional Silero VAD model (`vadModelPath`)
|
|
62
|
+
- Required backend config (`backend`) with one of: `metal`, `vulkan`, `coreml`, or `openvino-hybrid`
|
|
63
|
+
- Accelerator-specific paths now live inside `backend`
|
|
64
|
+
- `backend: { type: 'coreml', segPath, embPath? }` uses CoreML `.mlpackage` assets on macOS
|
|
65
|
+
- `backend: { type: 'openvino-hybrid', whisperEncoderPath, embPath? }` uses OpenVINO IR `.xml` assets on Windows
|
|
66
|
+
- `backend: { type: 'metal' }` on macOS and `backend: { type: 'vulkan' }` on Windows do not need extra accelerator paths
|
|
47
67
|
|
|
48
68
|
## Installation
|
|
49
69
|
|
|
@@ -57,8 +77,45 @@ pnpm add pyannote-cpp-node
|
|
|
57
77
|
|
|
58
78
|
The package installs a platform-specific native addon through `optionalDependencies`.
|
|
59
79
|
|
|
80
|
+
## Low-Level Quick Start
|
|
81
|
+
|
|
82
|
+
```typescript
|
|
83
|
+
import {
|
|
84
|
+
WhisperContext,
|
|
85
|
+
createVadContext,
|
|
86
|
+
getCapabilities,
|
|
87
|
+
transcribeAsync,
|
|
88
|
+
} from 'pyannote-cpp-node';
|
|
89
|
+
|
|
90
|
+
const capabilities = getCapabilities();
|
|
91
|
+
console.log(capabilities);
|
|
92
|
+
|
|
93
|
+
const ctx = new WhisperContext({
|
|
94
|
+
model: './models/ggml-base.en.bin',
|
|
95
|
+
use_gpu: true,
|
|
96
|
+
no_prints: true,
|
|
97
|
+
});
|
|
98
|
+
|
|
99
|
+
const result = await transcribeAsync(ctx, {
|
|
100
|
+
fname_inp: './audio.wav',
|
|
101
|
+
language: 'en',
|
|
102
|
+
});
|
|
103
|
+
|
|
104
|
+
const vad = createVadContext({
|
|
105
|
+
model: './models/ggml-silero-v6.2.0.bin',
|
|
106
|
+
});
|
|
107
|
+
|
|
108
|
+
console.log(result.segments);
|
|
109
|
+
console.log(vad.getWindowSamples());
|
|
110
|
+
|
|
111
|
+
vad.free();
|
|
112
|
+
ctx.free();
|
|
113
|
+
```
|
|
114
|
+
|
|
60
115
|
## Quick Start
|
|
61
116
|
|
|
117
|
+
### macOS (Apple Silicon)
|
|
118
|
+
|
|
62
119
|
```typescript
|
|
63
120
|
import { Pipeline } from 'pyannote-cpp-node';
|
|
64
121
|
|
|
@@ -66,15 +123,43 @@ const pipeline = await Pipeline.load({
|
|
|
66
123
|
segModelPath: './models/segmentation.gguf',
|
|
67
124
|
embModelPath: './models/embedding.gguf',
|
|
68
125
|
pldaPath: './models/plda.gguf',
|
|
69
|
-
coremlPath: './models/embedding.mlpackage',
|
|
70
|
-
segCoremlPath: './models/segmentation.mlpackage',
|
|
71
126
|
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
127
|
+
backend: {
|
|
128
|
+
type: 'coreml',
|
|
129
|
+
segPath: './models/segmentation.mlpackage',
|
|
130
|
+
embPath: './models/embedding.mlpackage',
|
|
131
|
+
},
|
|
72
132
|
language: 'en',
|
|
73
133
|
});
|
|
74
134
|
|
|
75
135
|
const audio = loadAudioAsFloat32Array('./audio-16khz-mono.wav');
|
|
136
|
+
const result = await pipeline.transcribeOffline(audio);
|
|
137
|
+
|
|
138
|
+
for (const segment of result.segments) {
|
|
139
|
+
const end = segment.start + segment.duration;
|
|
140
|
+
console.log(
|
|
141
|
+
`[${segment.speaker}] ${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`
|
|
142
|
+
);
|
|
143
|
+
}
|
|
144
|
+
|
|
145
|
+
pipeline.close();
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### Windows (x64)
|
|
149
|
+
|
|
150
|
+
```typescript
|
|
151
|
+
import { Pipeline } from 'pyannote-cpp-node';
|
|
152
|
+
|
|
153
|
+
const pipeline = await Pipeline.load({
|
|
154
|
+
segModelPath: './models/segmentation.gguf',
|
|
155
|
+
embModelPath: './models/embedding.gguf',
|
|
156
|
+
pldaPath: './models/plda.gguf',
|
|
157
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
158
|
+
language: 'en',
|
|
159
|
+
backend: { type: 'vulkan' },
|
|
160
|
+
});
|
|
76
161
|
|
|
77
|
-
|
|
162
|
+
const audio = loadAudioAsFloat32Array('./audio-16khz-mono.wav');
|
|
78
163
|
const result = await pipeline.transcribeOffline(audio);
|
|
79
164
|
|
|
80
165
|
for (const segment of result.segments) {
|
|
@@ -87,31 +172,86 @@ for (const segment of result.segments) {
|
|
|
87
172
|
pipeline.close();
|
|
88
173
|
```
|
|
89
174
|
|
|
175
|
+
To use the Windows OpenVINO hybrid path instead, pass the OpenVINO assets through `backend`:
|
|
176
|
+
|
|
177
|
+
```typescript
|
|
178
|
+
const pipeline = await Pipeline.load({
|
|
179
|
+
segModelPath: './models/segmentation.gguf',
|
|
180
|
+
embModelPath: './models/embedding.gguf',
|
|
181
|
+
pldaPath: './models/plda.gguf',
|
|
182
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
183
|
+
backend: {
|
|
184
|
+
type: 'openvino-hybrid',
|
|
185
|
+
whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
|
|
186
|
+
embPath: './models/embedding-openvino.xml',
|
|
187
|
+
},
|
|
188
|
+
});
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
### Transcription-only mode
|
|
192
|
+
|
|
193
|
+
```typescript
|
|
194
|
+
const macPipeline = await Pipeline.load({
|
|
195
|
+
segModelPath: './models/segmentation.gguf',
|
|
196
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
197
|
+
language: 'en',
|
|
198
|
+
transcriptionOnly: true,
|
|
199
|
+
backend: {
|
|
200
|
+
type: 'coreml',
|
|
201
|
+
segPath: './models/segmentation.mlpackage',
|
|
202
|
+
},
|
|
203
|
+
});
|
|
204
|
+
|
|
205
|
+
const windowsPipeline = await Pipeline.load({
|
|
206
|
+
segModelPath: './models/segmentation.gguf',
|
|
207
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
208
|
+
language: 'en',
|
|
209
|
+
transcriptionOnly: true,
|
|
210
|
+
backend: { type: 'vulkan' },
|
|
211
|
+
});
|
|
212
|
+
|
|
213
|
+
const result = await macPipeline.transcribe(audio);
|
|
214
|
+
|
|
215
|
+
for (const segment of result.segments) {
|
|
216
|
+
const end = segment.start + segment.duration;
|
|
217
|
+
// No speaker label - segment.speaker is empty string
|
|
218
|
+
console.log(`${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`);
|
|
219
|
+
}
|
|
220
|
+
|
|
221
|
+
macPipeline.close();
|
|
222
|
+
windowsPipeline.close();
|
|
223
|
+
```
|
|
224
|
+
|
|
90
225
|
## API Reference
|
|
91
226
|
|
|
92
227
|
### `Pipeline`
|
|
93
228
|
|
|
94
229
|
```typescript
|
|
95
230
|
class Pipeline {
|
|
96
|
-
static async load(config:
|
|
231
|
+
static async load(config: PipelineConfig): Promise<Pipeline>;
|
|
97
232
|
async transcribeOffline(audio: Float32Array, onProgress?: (phase: number, progress: number) => void, onSegment?: (start: number, end: number, text: string) => void): Promise<TranscriptionResult>;
|
|
98
233
|
async transcribe(audio: Float32Array): Promise<TranscriptionResult>;
|
|
99
234
|
setLanguage(language: string): void;
|
|
100
235
|
setDecodeOptions(options: DecodeOptions): void;
|
|
101
236
|
createSession(): PipelineSession;
|
|
102
|
-
async
|
|
237
|
+
async setExecutionBackend(options: BackendConfig): Promise<void>;
|
|
103
238
|
close(): void;
|
|
104
239
|
get isClosed(): boolean;
|
|
105
240
|
}
|
|
106
241
|
```
|
|
107
242
|
|
|
108
|
-
#### `static async load(config:
|
|
243
|
+
#### `static async load(config: PipelineConfig): Promise<Pipeline>`
|
|
109
244
|
|
|
110
|
-
Validates model paths and loads all models
|
|
245
|
+
Validates model paths and loads all models into a shared cache on a background thread. The accelerator assets are selected by `config.backend`, which is required and has no default. On macOS, `backend: { type: 'coreml', ... }` loads CoreML segmentation and embedding assets, while `backend: { type: 'metal' }` uses Metal. On Windows x64, `backend: { type: 'vulkan' }` loads the Vulkan path, and `backend: { type: 'openvino-hybrid', ... }` also loads OpenVINO IR models for the Whisper encoder and embedding model. When `transcriptionOnly` is `true`, embedding, PLDA, and embedding-specific backend assets are not loaded. Models are loaded once and reused across all subsequent `transcribe()`, `transcribeOffline()`, and `createSession()` calls. Models are freed only when `close()` is called.
|
|
246
|
+
|
|
247
|
+
- `backend` is required in every `Pipeline.load()` call
|
|
248
|
+
- `coreml` requires `segPath`, and `embPath` unless `transcriptionOnly` is `true`
|
|
249
|
+
- `openvino-hybrid` requires `whisperEncoderPath`, and `embPath` unless `transcriptionOnly` is `true`
|
|
250
|
+
- `metal` and `vulkan` do not require extra accelerator model paths
|
|
111
251
|
|
|
112
252
|
#### `async transcribeOffline(audio: Float32Array, onProgress?, onSegment?): Promise<TranscriptionResult>`
|
|
113
253
|
|
|
114
|
-
Runs Whisper on the **entire** audio buffer in a single `whisper_full()` call, then runs offline diarization and WhisperX-style speaker alignment. This is the fastest mode for batch processing — no streaming infrastructure is involved.
|
|
254
|
+
Runs Whisper on the **entire** audio buffer in a single `whisper_full()` call, then runs offline diarization and WhisperX-style speaker alignment. In transcription-only mode, diarization and speaker alignment are skipped, and segments have an empty `speaker` field. This is the fastest mode for batch processing — no streaming infrastructure is involved.
|
|
115
255
|
|
|
116
256
|
The optional `onProgress` callback receives `(phase, progress)` updates:
|
|
117
257
|
|
|
@@ -153,7 +293,7 @@ const result = await pipeline.transcribeOffline(
|
|
|
153
293
|
|
|
154
294
|
#### `async transcribe(audio: Float32Array): Promise<TranscriptionResult>`
|
|
155
295
|
|
|
156
|
-
Runs one-shot transcription + diarization using the streaming pipeline internally (pushes 1-second chunks then finalizes).
|
|
296
|
+
Runs one-shot transcription (+ diarization unless `transcriptionOnly` is set) using the streaming pipeline internally (pushes 1-second chunks then finalizes).
|
|
157
297
|
|
|
158
298
|
#### `setLanguage(language: string): void`
|
|
159
299
|
|
|
@@ -164,28 +304,62 @@ Updates the Whisper decode language for subsequent `transcribe()` calls. This is
|
|
|
164
304
|
Updates one or more Whisper decode options for subsequent `transcribe()` calls. Only the fields you pass are changed; others retain their current values. See `DecodeOptions` for available fields.
|
|
165
305
|
|
|
166
306
|
|
|
167
|
-
#### `async
|
|
307
|
+
#### `async setExecutionBackend(options: BackendConfig): Promise<void>`
|
|
308
|
+
|
|
309
|
+
Switches the inference backend at runtime. Tears down and reloads the entire model cache with the new backend configuration. The promise resolves when the new models are ready.
|
|
168
310
|
|
|
169
|
-
|
|
311
|
+
- **macOS**: supports `metal` and `coreml`
|
|
312
|
+
- **Windows**: supports `vulkan` and `openvino-hybrid`
|
|
170
313
|
|
|
171
|
-
|
|
172
|
-
- Throws if the pipeline is closed, busy, or models are not loaded.
|
|
173
|
-
- After switching, all subsequent `transcribe()`, `transcribeOffline()`, and streaming session calls use the new backend.
|
|
314
|
+
Pass one of these `BackendConfig` variants:
|
|
174
315
|
|
|
175
316
|
```typescript
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
317
|
+
type BackendConfig =
|
|
318
|
+
| { type: 'metal'; gpuDevice?: number; flashAttn?: boolean }
|
|
319
|
+
| { type: 'vulkan'; gpuDevice?: number; flashAttn?: boolean }
|
|
320
|
+
| {
|
|
321
|
+
type: 'coreml';
|
|
322
|
+
gpuDevice?: number;
|
|
323
|
+
flashAttn?: boolean;
|
|
324
|
+
segPath: string;
|
|
325
|
+
embPath?: string;
|
|
326
|
+
whisperEncoderPath?: string;
|
|
327
|
+
}
|
|
328
|
+
| {
|
|
329
|
+
type: 'openvino-hybrid';
|
|
330
|
+
gpuDevice?: number;
|
|
331
|
+
flashAttn?: boolean;
|
|
332
|
+
whisperEncoderPath: string;
|
|
333
|
+
embPath?: string;
|
|
334
|
+
openvinoDevice?: string;
|
|
335
|
+
openvinoCacheDir?: string;
|
|
336
|
+
};
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
> **Warning**: This is a heavy operation (~5-6s on Intel iGPU). It fully tears down and rebuilds the model cache. Treat it as a one-time configuration change, not something to call in a loop. See [Warnings and Known Issues](#warnings-and-known-issues) for Intel iGPU limitations.
|
|
340
|
+
|
|
341
|
+
```typescript
|
|
342
|
+
// macOS: switch to Metal
|
|
343
|
+
await pipeline.setExecutionBackend({ type: 'metal' });
|
|
344
|
+
|
|
345
|
+
// macOS: switch to CoreML
|
|
346
|
+
await pipeline.setExecutionBackend({
|
|
347
|
+
type: 'coreml',
|
|
348
|
+
segPath: './models/segmentation.mlpackage',
|
|
349
|
+
embPath: './models/embedding.mlpackage',
|
|
180
350
|
});
|
|
181
351
|
|
|
182
|
-
//
|
|
183
|
-
await pipeline.
|
|
184
|
-
|
|
352
|
+
// Windows: switch to OpenVINO hybrid
|
|
353
|
+
await pipeline.setExecutionBackend({
|
|
354
|
+
type: 'openvino-hybrid',
|
|
355
|
+
whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
|
|
356
|
+
embPath: './models/embedding-openvino.xml',
|
|
357
|
+
});
|
|
185
358
|
|
|
186
|
-
//
|
|
187
|
-
await pipeline.
|
|
359
|
+
// Windows: switch back to Vulkan
|
|
360
|
+
await pipeline.setExecutionBackend({ type: 'vulkan' });
|
|
188
361
|
```
|
|
362
|
+
|
|
189
363
|
#### `createSession(): PipelineSession`
|
|
190
364
|
|
|
191
365
|
Creates an independent streaming session for incremental processing.
|
|
@@ -242,11 +416,13 @@ Updates one or more Whisper decode options on the live streaming session. Takes
|
|
|
242
416
|
|
|
243
417
|
#### `async finalize(): Promise<TranscriptionResult>`
|
|
244
418
|
|
|
245
|
-
Flushes all stages, runs final recluster + alignment, and returns the definitive result.
|
|
419
|
+
Flushes all stages, runs final recluster + alignment, and returns the definitive result. `finalize()` always returns all accumulated segments regardless of mode. In diarization mode this is the final re-aligned output, and in transcription-only mode this is the union of all incremental `segments` emissions.
|
|
246
420
|
|
|
247
421
|
```typescript
|
|
248
422
|
type TranscriptionResult = {
|
|
249
423
|
segments: AlignedSegment[];
|
|
424
|
+
/** Silence-filtered audio when VAD model is loaded. Timestamps align to this audio. */
|
|
425
|
+
filteredAudio?: Float32Array;
|
|
250
426
|
};
|
|
251
427
|
```
|
|
252
428
|
|
|
@@ -260,11 +436,30 @@ Returns `true` after `close()`.
|
|
|
260
436
|
|
|
261
437
|
#### Event: `'segments'`
|
|
262
438
|
|
|
263
|
-
Emitted after each Whisper transcription result
|
|
439
|
+
Emitted after each Whisper transcription result. Behavior depends on mode:
|
|
440
|
+
|
|
441
|
+
- With diarization (default): each emission contains all segments re-aligned against the latest speaker clustering. Earlier segments may get updated speaker labels as more data arrives. The final emission after `finalize()` is the definitive output.
|
|
442
|
+
- With `transcriptionOnly: true`: each emission contains only the new segments from the latest Whisper result. Earlier segments never change, so incremental delivery is safe. Accumulate across emissions to build the full transcript.
|
|
264
443
|
|
|
265
444
|
```typescript
|
|
445
|
+
// With diarization (default): cumulative, re-aligned output
|
|
266
446
|
session.on('segments', (segments: AlignedSegment[]) => {
|
|
267
|
-
// `segments` contains the latest
|
|
447
|
+
// `segments` contains the latest full speaker-labeled transcript so far
|
|
448
|
+
const latest = segments[segments.length - 1];
|
|
449
|
+
if (latest) {
|
|
450
|
+
const end = latest.start + latest.duration;
|
|
451
|
+
console.log(`[${latest.speaker}] ${latest.start.toFixed(2)}-${end.toFixed(2)} ${latest.text.trim()}`);
|
|
452
|
+
}
|
|
453
|
+
});
|
|
454
|
+
|
|
455
|
+
// With transcriptionOnly: incremental output, accumulate manually
|
|
456
|
+
const allSegments: AlignedSegment[] = [];
|
|
457
|
+
session.on('segments', (newSegments: AlignedSegment[]) => {
|
|
458
|
+
allSegments.push(...newSegments);
|
|
459
|
+
for (const seg of newSegments) {
|
|
460
|
+
const end = seg.start + seg.duration;
|
|
461
|
+
console.log(`${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
|
|
462
|
+
}
|
|
268
463
|
});
|
|
269
464
|
```
|
|
270
465
|
|
|
@@ -281,46 +476,32 @@ session.on('audio', (chunk: Float32Array) => {
|
|
|
281
476
|
### Types
|
|
282
477
|
|
|
283
478
|
```typescript
|
|
284
|
-
export interface
|
|
479
|
+
export interface PipelineConfig {
|
|
285
480
|
// === Required Model Paths ===
|
|
286
481
|
/** Path to segmentation GGUF model */
|
|
287
482
|
segModelPath: string;
|
|
288
483
|
|
|
289
|
-
/** Path to embedding GGUF model */
|
|
290
|
-
embModelPath: string;
|
|
291
|
-
|
|
292
|
-
/** Path to PLDA GGUF model */
|
|
293
|
-
pldaPath: string;
|
|
294
|
-
|
|
295
|
-
/** Path to embedding CoreML .mlpackage directory */
|
|
296
|
-
coremlPath: string;
|
|
297
|
-
|
|
298
|
-
/** Path to segmentation CoreML .mlpackage directory */
|
|
299
|
-
segCoremlPath: string;
|
|
300
|
-
|
|
301
484
|
/** Path to Whisper GGUF model */
|
|
302
485
|
whisperModelPath: string;
|
|
303
486
|
|
|
487
|
+
/** Path to embedding GGUF model (required unless transcriptionOnly is true) */
|
|
488
|
+
embModelPath?: string;
|
|
489
|
+
|
|
490
|
+
/** Path to PLDA GGUF model (required unless transcriptionOnly is true) */
|
|
491
|
+
pldaPath?: string;
|
|
492
|
+
|
|
304
493
|
// === Optional Model Paths ===
|
|
305
494
|
/** Path to Silero VAD model (optional, enables silence compression) */
|
|
306
495
|
vadModelPath?: string;
|
|
307
496
|
|
|
308
|
-
// === Whisper Context Options (model loading) ===
|
|
309
|
-
/** Enable GPU acceleration (default: true) */
|
|
310
|
-
useGpu?: boolean;
|
|
311
|
-
|
|
312
|
-
/** Enable Flash Attention (default: true) */
|
|
313
|
-
flashAttn?: boolean;
|
|
314
|
-
|
|
315
|
-
/** GPU device index (default: 0) */
|
|
316
|
-
gpuDevice?: number;
|
|
317
|
-
|
|
318
497
|
/**
|
|
319
|
-
*
|
|
320
|
-
*
|
|
321
|
-
* e.g., ggml-base.en.bin -> ggml-base.en-encoder.mlmodelc/
|
|
498
|
+
* Transcription-only mode - skip speaker diarization (default: false).
|
|
499
|
+
* When true, embModelPath, pldaPath, and backend embedding assets are not required.
|
|
322
500
|
*/
|
|
323
|
-
|
|
501
|
+
transcriptionOnly?: boolean;
|
|
502
|
+
|
|
503
|
+
/** Required execution backend configuration */
|
|
504
|
+
backend: BackendConfig;
|
|
324
505
|
|
|
325
506
|
/** Suppress whisper.cpp log output (default: false) */
|
|
326
507
|
noPrints?: boolean;
|
|
@@ -378,6 +559,54 @@ export interface ModelConfig {
|
|
|
378
559
|
suppressNst?: boolean;
|
|
379
560
|
}
|
|
380
561
|
|
|
562
|
+
export type BackendConfig =
|
|
563
|
+
| {
|
|
564
|
+
/** Metal backend on macOS */
|
|
565
|
+
type: 'metal';
|
|
566
|
+
/** GPU device index */
|
|
567
|
+
gpuDevice?: number;
|
|
568
|
+
/** Enable Flash Attention */
|
|
569
|
+
flashAttn?: boolean;
|
|
570
|
+
}
|
|
571
|
+
| {
|
|
572
|
+
/** Vulkan backend on Windows */
|
|
573
|
+
type: 'vulkan';
|
|
574
|
+
/** GPU device index */
|
|
575
|
+
gpuDevice?: number;
|
|
576
|
+
/** Enable Flash Attention */
|
|
577
|
+
flashAttn?: boolean;
|
|
578
|
+
}
|
|
579
|
+
| {
|
|
580
|
+
/** CoreML backend on macOS */
|
|
581
|
+
type: 'coreml';
|
|
582
|
+
/** GPU device index */
|
|
583
|
+
gpuDevice?: number;
|
|
584
|
+
/** Enable Flash Attention */
|
|
585
|
+
flashAttn?: boolean;
|
|
586
|
+
/** Path to segmentation CoreML .mlpackage directory */
|
|
587
|
+
segPath: string;
|
|
588
|
+
/** Path to embedding CoreML .mlpackage directory (required unless transcriptionOnly is true) */
|
|
589
|
+
embPath?: string;
|
|
590
|
+
/** Optional path to Whisper encoder CoreML .mlmodelc directory */
|
|
591
|
+
whisperEncoderPath?: string;
|
|
592
|
+
}
|
|
593
|
+
| {
|
|
594
|
+
/** OpenVINO hybrid backend on Windows */
|
|
595
|
+
type: 'openvino-hybrid';
|
|
596
|
+
/** GPU device index */
|
|
597
|
+
gpuDevice?: number;
|
|
598
|
+
/** Enable Flash Attention */
|
|
599
|
+
flashAttn?: boolean;
|
|
600
|
+
/** Path to Whisper encoder OpenVINO IR (.xml) */
|
|
601
|
+
whisperEncoderPath: string;
|
|
602
|
+
/** Path to embedding OpenVINO IR (.xml) (required unless transcriptionOnly is true) */
|
|
603
|
+
embPath?: string;
|
|
604
|
+
/** OpenVINO device target (default: 'GPU') */
|
|
605
|
+
openvinoDevice?: string;
|
|
606
|
+
/** OpenVINO model cache directory */
|
|
607
|
+
openvinoCacheDir?: string;
|
|
608
|
+
};
|
|
609
|
+
|
|
381
610
|
export interface DecodeOptions {
|
|
382
611
|
/** Language code (e.g., 'en', 'zh'). Omit for auto-detect. */
|
|
383
612
|
language?: string;
|
|
@@ -414,7 +643,7 @@ export interface DecodeOptions {
|
|
|
414
643
|
}
|
|
415
644
|
|
|
416
645
|
export interface AlignedSegment {
|
|
417
|
-
/** Global speaker label (e.g., SPEAKER_00). */
|
|
646
|
+
/** Global speaker label (e.g., SPEAKER_00). "UNKNOWN" when diarization could not assign a speaker. Empty string when transcriptionOnly is true. */
|
|
418
647
|
speaker: string;
|
|
419
648
|
|
|
420
649
|
/** Segment start time in seconds. */
|
|
@@ -428,8 +657,16 @@ export interface AlignedSegment {
|
|
|
428
657
|
}
|
|
429
658
|
|
|
430
659
|
export interface TranscriptionResult {
|
|
431
|
-
/**
|
|
660
|
+
/** Transcript segments. Speaker-labeled when diarization is enabled; speaker is empty string in transcription-only mode. */
|
|
432
661
|
segments: AlignedSegment[];
|
|
662
|
+
/**
|
|
663
|
+
* Silence-filtered audio (16 kHz mono Float32Array).
|
|
664
|
+
* Present when a VAD model is loaded (`vadModelPath` in config).
|
|
665
|
+
* Silence longer than 2 seconds is compressed to 2 seconds.
|
|
666
|
+
* All segment timestamps are aligned to this audio —
|
|
667
|
+
* save it directly and timestamps will sync correctly.
|
|
668
|
+
*/
|
|
669
|
+
filteredAudio?: Float32Array;
|
|
433
670
|
}
|
|
434
671
|
```
|
|
435
672
|
|
|
@@ -445,9 +682,12 @@ async function runOffline(audio: Float32Array) {
|
|
|
445
682
|
segModelPath: './models/segmentation.gguf',
|
|
446
683
|
embModelPath: './models/embedding.gguf',
|
|
447
684
|
pldaPath: './models/plda.gguf',
|
|
448
|
-
coremlPath: './models/embedding.mlpackage',
|
|
449
|
-
segCoremlPath: './models/segmentation.mlpackage',
|
|
450
685
|
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
686
|
+
backend: {
|
|
687
|
+
type: 'coreml',
|
|
688
|
+
segPath: './models/segmentation.mlpackage',
|
|
689
|
+
embPath: './models/embedding.mlpackage',
|
|
690
|
+
},
|
|
451
691
|
});
|
|
452
692
|
|
|
453
693
|
// Runs Whisper on full audio at once + offline diarization
|
|
@@ -462,6 +702,46 @@ async function runOffline(audio: Float32Array) {
|
|
|
462
702
|
}
|
|
463
703
|
```
|
|
464
704
|
|
|
705
|
+
### Offline transcription with silence filtering
|
|
706
|
+
|
|
707
|
+
When a VAD model is provided, `transcribeOffline` automatically compresses silence longer than 2 seconds down to 2 seconds before running Whisper and diarization. The filtered audio is returned alongside segments so you can save it with correctly aligned timestamps.
|
|
708
|
+
|
|
709
|
+
```typescript
|
|
710
|
+
import { Pipeline } from 'pyannote-cpp-node';
|
|
711
|
+
import { writeFileSync } from 'node:fs';
|
|
712
|
+
|
|
713
|
+
async function runOfflineWithVAD(audio: Float32Array) {
|
|
714
|
+
const pipeline = await Pipeline.load({
|
|
715
|
+
segModelPath: './models/segmentation.gguf',
|
|
716
|
+
embModelPath: './models/embedding.gguf',
|
|
717
|
+
pldaPath: './models/plda.gguf',
|
|
718
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
719
|
+
backend: {
|
|
720
|
+
type: 'coreml',
|
|
721
|
+
segPath: './models/segmentation.mlpackage',
|
|
722
|
+
embPath: './models/embedding.mlpackage',
|
|
723
|
+
},
|
|
724
|
+
vadModelPath: './models/ggml-silero-v6.2.0.bin', // enables silence filtering
|
|
725
|
+
});
|
|
726
|
+
|
|
727
|
+
const result = await pipeline.transcribeOffline(audio);
|
|
728
|
+
|
|
729
|
+
// Save the silence-filtered audio — timestamps in result.segments align to this
|
|
730
|
+
if (result.filteredAudio) {
|
|
731
|
+
// filteredAudio is 16 kHz mono Float32Array with silence compressed
|
|
732
|
+
writeFileSync('./output-filtered.pcm', Buffer.from(result.filteredAudio.buffer));
|
|
733
|
+
console.log(`Filtered: ${audio.length} -> ${result.filteredAudio.length} samples`);
|
|
734
|
+
}
|
|
735
|
+
|
|
736
|
+
for (const seg of result.segments) {
|
|
737
|
+
const end = seg.start + seg.duration;
|
|
738
|
+
console.log(`[${seg.speaker}] ${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
|
|
739
|
+
}
|
|
740
|
+
|
|
741
|
+
pipeline.close();
|
|
742
|
+
}
|
|
743
|
+
```
|
|
744
|
+
|
|
465
745
|
### Offline transcription with progress and live transcript preview
|
|
466
746
|
|
|
467
747
|
```typescript
|
|
@@ -472,9 +752,12 @@ async function runOfflineWithCallbacks(audio: Float32Array) {
|
|
|
472
752
|
segModelPath: './models/segmentation.gguf',
|
|
473
753
|
embModelPath: './models/embedding.gguf',
|
|
474
754
|
pldaPath: './models/plda.gguf',
|
|
475
|
-
coremlPath: './models/embedding.mlpackage',
|
|
476
|
-
segCoremlPath: './models/segmentation.mlpackage',
|
|
477
755
|
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
756
|
+
backend: {
|
|
757
|
+
type: 'coreml',
|
|
758
|
+
segPath: './models/segmentation.mlpackage',
|
|
759
|
+
embPath: './models/embedding.mlpackage',
|
|
760
|
+
},
|
|
478
761
|
});
|
|
479
762
|
|
|
480
763
|
const result = await pipeline.transcribeOffline(
|
|
@@ -506,9 +789,12 @@ async function runOneShot(audio: Float32Array) {
|
|
|
506
789
|
segModelPath: './models/segmentation.gguf',
|
|
507
790
|
embModelPath: './models/embedding.gguf',
|
|
508
791
|
pldaPath: './models/plda.gguf',
|
|
509
|
-
coremlPath: './models/embedding.mlpackage',
|
|
510
|
-
segCoremlPath: './models/segmentation.mlpackage',
|
|
511
792
|
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
793
|
+
backend: {
|
|
794
|
+
type: 'coreml',
|
|
795
|
+
segPath: './models/segmentation.mlpackage',
|
|
796
|
+
embPath: './models/embedding.mlpackage',
|
|
797
|
+
},
|
|
512
798
|
});
|
|
513
799
|
|
|
514
800
|
// Uses streaming pipeline internally (push 1s chunks + finalize)
|
|
@@ -533,12 +819,16 @@ async function runStreaming(audio: Float32Array) {
|
|
|
533
819
|
segModelPath: './models/segmentation.gguf',
|
|
534
820
|
embModelPath: './models/embedding.gguf',
|
|
535
821
|
pldaPath: './models/plda.gguf',
|
|
536
|
-
coremlPath: './models/embedding.mlpackage',
|
|
537
|
-
segCoremlPath: './models/segmentation.mlpackage',
|
|
538
822
|
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
823
|
+
backend: {
|
|
824
|
+
type: 'coreml',
|
|
825
|
+
segPath: './models/segmentation.mlpackage',
|
|
826
|
+
embPath: './models/embedding.mlpackage',
|
|
827
|
+
},
|
|
539
828
|
});
|
|
540
829
|
|
|
541
830
|
const session = pipeline.createSession();
|
|
831
|
+
// Diarization mode (default): each event is cumulative and may relabel earlier segments
|
|
542
832
|
session.on('segments', (segments) => {
|
|
543
833
|
const latest = segments[segments.length - 1];
|
|
544
834
|
if (latest) {
|
|
@@ -569,6 +859,47 @@ async function runStreaming(audio: Float32Array) {
|
|
|
569
859
|
}
|
|
570
860
|
```
|
|
571
861
|
|
|
862
|
+
```typescript
|
|
863
|
+
import { Pipeline, type AlignedSegment } from 'pyannote-cpp-node';
|
|
864
|
+
|
|
865
|
+
async function runStreamingTranscriptionOnly(audio: Float32Array) {
|
|
866
|
+
const pipeline = await Pipeline.load({
|
|
867
|
+
segModelPath: './models/segmentation.gguf',
|
|
868
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
869
|
+
transcriptionOnly: true,
|
|
870
|
+
backend: {
|
|
871
|
+
type: 'coreml',
|
|
872
|
+
segPath: './models/segmentation.mlpackage',
|
|
873
|
+
},
|
|
874
|
+
});
|
|
875
|
+
|
|
876
|
+
const session = pipeline.createSession();
|
|
877
|
+
|
|
878
|
+
// Transcription-only: each event has only NEW segments
|
|
879
|
+
const allSegments: AlignedSegment[] = [];
|
|
880
|
+
session.on('segments', (newSegments) => {
|
|
881
|
+
allSegments.push(...newSegments);
|
|
882
|
+
for (const seg of newSegments) {
|
|
883
|
+
const end = seg.start + seg.duration;
|
|
884
|
+
console.log(`${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
|
|
885
|
+
}
|
|
886
|
+
});
|
|
887
|
+
|
|
888
|
+
const chunkSize = 16000;
|
|
889
|
+
for (let i = 0; i < audio.length; i += chunkSize) {
|
|
890
|
+
const chunk = audio.slice(i, Math.min(i + chunkSize, audio.length));
|
|
891
|
+
await session.push(chunk);
|
|
892
|
+
}
|
|
893
|
+
|
|
894
|
+
const finalResult = await session.finalize();
|
|
895
|
+
console.log(`Final segments from finalize(): ${finalResult.segments.length}`);
|
|
896
|
+
console.log(`Accumulated from incremental events: ${allSegments.length}`);
|
|
897
|
+
|
|
898
|
+
session.close();
|
|
899
|
+
pipeline.close();
|
|
900
|
+
}
|
|
901
|
+
```
|
|
902
|
+
|
|
572
903
|
### Custom Whisper decode options
|
|
573
904
|
|
|
574
905
|
```typescript
|
|
@@ -578,15 +909,14 @@ const pipeline = await Pipeline.load({
|
|
|
578
909
|
segModelPath: './models/segmentation.gguf',
|
|
579
910
|
embModelPath: './models/embedding.gguf',
|
|
580
911
|
pldaPath: './models/plda.gguf',
|
|
581
|
-
coremlPath: './models/embedding.mlpackage',
|
|
582
|
-
segCoremlPath: './models/segmentation.mlpackage',
|
|
583
912
|
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
584
|
-
|
|
585
|
-
|
|
586
|
-
|
|
587
|
-
|
|
588
|
-
|
|
589
|
-
|
|
913
|
+
backend: {
|
|
914
|
+
type: 'coreml',
|
|
915
|
+
segPath: './models/segmentation.mlpackage',
|
|
916
|
+
embPath: './models/embedding.mlpackage',
|
|
917
|
+
flashAttn: true,
|
|
918
|
+
gpuDevice: 0,
|
|
919
|
+
},
|
|
590
920
|
|
|
591
921
|
// Decode strategy
|
|
592
922
|
nThreads: 8,
|
|
@@ -619,9 +949,12 @@ const pipeline = await Pipeline.load({
|
|
|
619
949
|
segModelPath: './models/segmentation.gguf',
|
|
620
950
|
embModelPath: './models/embedding.gguf',
|
|
621
951
|
pldaPath: './models/plda.gguf',
|
|
622
|
-
coremlPath: './models/embedding.mlpackage',
|
|
623
|
-
segCoremlPath: './models/segmentation.mlpackage',
|
|
624
952
|
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
953
|
+
backend: {
|
|
954
|
+
type: 'coreml',
|
|
955
|
+
segPath: './models/segmentation.mlpackage',
|
|
956
|
+
embPath: './models/embedding.mlpackage',
|
|
957
|
+
},
|
|
625
958
|
language: 'en',
|
|
626
959
|
});
|
|
627
960
|
|
|
@@ -643,34 +976,71 @@ const result3 = await pipeline.transcribe(chineseAudio);
|
|
|
643
976
|
pipeline.close();
|
|
644
977
|
```
|
|
645
978
|
|
|
646
|
-
### Switching
|
|
979
|
+
### Switching execution backend at runtime (macOS)
|
|
647
980
|
|
|
648
981
|
```typescript
|
|
649
982
|
import { Pipeline } from 'pyannote-cpp-node';
|
|
650
983
|
|
|
651
|
-
// Start with
|
|
984
|
+
// Start with Metal
|
|
652
985
|
const pipeline = await Pipeline.load({
|
|
653
986
|
segModelPath: './models/segmentation.gguf',
|
|
654
987
|
embModelPath: './models/embedding.gguf',
|
|
655
988
|
pldaPath: './models/plda.gguf',
|
|
656
|
-
coremlPath: './models/embedding.mlpackage',
|
|
657
|
-
segCoremlPath: './models/segmentation.mlpackage',
|
|
658
989
|
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
659
|
-
|
|
990
|
+
backend: { type: 'metal' },
|
|
660
991
|
});
|
|
661
992
|
|
|
662
|
-
// Switch to CoreML
|
|
663
|
-
|
|
664
|
-
|
|
993
|
+
// Switch to CoreML
|
|
994
|
+
await pipeline.setExecutionBackend({
|
|
995
|
+
type: 'coreml',
|
|
996
|
+
segPath: './models/segmentation.mlpackage',
|
|
997
|
+
embPath: './models/embedding.mlpackage',
|
|
998
|
+
whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
|
|
999
|
+
});
|
|
665
1000
|
const result1 = await pipeline.transcribeOffline(audio);
|
|
666
1001
|
|
|
667
|
-
// Switch back to
|
|
668
|
-
await pipeline.
|
|
1002
|
+
// Switch back to Metal
|
|
1003
|
+
await pipeline.setExecutionBackend({ type: 'metal' });
|
|
669
1004
|
const result2 = await pipeline.transcribeOffline(audio);
|
|
670
1005
|
|
|
671
1006
|
pipeline.close();
|
|
672
1007
|
```
|
|
673
1008
|
|
|
1009
|
+
## Execution Backends
|
|
1010
|
+
|
|
1011
|
+
`setExecutionBackend(options)` switches the inference backend at runtime.
|
|
1012
|
+
|
|
1013
|
+
- On macOS: supports `metal` and `coreml`
|
|
1014
|
+
- On Windows: supports `vulkan` and `openvino-hybrid`
|
|
1015
|
+
- `openvino-hybrid` uses OpenVINO for the Whisper encoder and embedding model, and Vulkan for everything else
|
|
1016
|
+
- `gpuDevice` and `flashAttn` are configured inside the backend object
|
|
1017
|
+
|
|
1018
|
+
```typescript
|
|
1019
|
+
const pipeline = await Pipeline.load({
|
|
1020
|
+
segModelPath: './models/segmentation.gguf',
|
|
1021
|
+
embModelPath: './models/embedding.gguf',
|
|
1022
|
+
pldaPath: './models/plda.gguf',
|
|
1023
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
1024
|
+
language: 'en',
|
|
1025
|
+
backend: {
|
|
1026
|
+
type: 'openvino-hybrid',
|
|
1027
|
+
whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
|
|
1028
|
+
embPath: './models/embedding-openvino.xml',
|
|
1029
|
+
},
|
|
1030
|
+
});
|
|
1031
|
+
|
|
1032
|
+
// Switch to OpenVINO-hybrid at runtime
|
|
1033
|
+
await pipeline.setExecutionBackend({
|
|
1034
|
+
type: 'openvino-hybrid',
|
|
1035
|
+
whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
|
|
1036
|
+
embPath: './models/embedding-openvino.xml',
|
|
1037
|
+
});
|
|
1038
|
+
const result = await pipeline.transcribeOffline(audio);
|
|
1039
|
+
|
|
1040
|
+
// Switch back to Vulkan
|
|
1041
|
+
await pipeline.setExecutionBackend({ type: 'vulkan' });
|
|
1042
|
+
```
|
|
1043
|
+
|
|
674
1044
|
Streaming sessions also support runtime changes:
|
|
675
1045
|
|
|
676
1046
|
```typescript
|
|
@@ -709,6 +1079,21 @@ The pipeline returns this JSON shape:
|
|
|
709
1079
|
}
|
|
710
1080
|
```
|
|
711
1081
|
|
|
1082
|
+
When `transcriptionOnly` is `true`, the `speaker` field is an empty string:
|
|
1083
|
+
|
|
1084
|
+
```json
|
|
1085
|
+
{
|
|
1086
|
+
"segments": [
|
|
1087
|
+
{
|
|
1088
|
+
"speaker": "",
|
|
1089
|
+
"start": 0.497000,
|
|
1090
|
+
"duration": 2.085000,
|
|
1091
|
+
"text": "Hello world"
|
|
1092
|
+
}
|
|
1093
|
+
]
|
|
1094
|
+
}
|
|
1095
|
+
```
|
|
1096
|
+
|
|
712
1097
|
## Audio Format Requirements
|
|
713
1098
|
|
|
714
1099
|
- Input must be `Float32Array`
|
|
@@ -722,9 +1107,13 @@ All API methods expect decoded PCM samples; file decoding/resampling is handled
|
|
|
722
1107
|
|
|
723
1108
|
### Offline mode (`transcribeOffline`)
|
|
724
1109
|
|
|
725
|
-
1.
|
|
726
|
-
2.
|
|
727
|
-
3.
|
|
1110
|
+
1. VAD silence filter (optional — compresses silence >2s to 2s when `vadModelPath` provided)
|
|
1111
|
+
2. Single `whisper_full()` call on filtered audio
|
|
1112
|
+
3. Offline diarization (segmentation → powerset → embeddings → PLDA → AHC → VBx) on filtered audio
|
|
1113
|
+
4. WhisperX-style alignment (speaker assignment by maximum segment overlap)
|
|
1114
|
+
5. Return segments + filtered audio bytes (timestamps aligned to filtered audio)
|
|
1115
|
+
|
|
1116
|
+
In transcription-only mode, steps 3 (diarization) and 4 (alignment) are skipped.
|
|
728
1117
|
|
|
729
1118
|
### Streaming mode (`transcribe` / `createSession`)
|
|
730
1119
|
|
|
@@ -738,6 +1127,8 @@ The streaming pipeline runs in 7 stages:
|
|
|
738
1127
|
6. Finalize (flush + final recluster + final alignment)
|
|
739
1128
|
7. Callback/event emission (`segments` updates + `audio` chunk streaming)
|
|
740
1129
|
|
|
1130
|
+
In transcription-only mode, steps 5 (alignment) and 6 (recluster) are skipped, and segments are emitted with an empty `speaker` field. Each `segments` event contains only the new segments from that Whisper call (incremental), unlike diarization mode which re-emits all segments after each recluster (cumulative).
|
|
1131
|
+
|
|
741
1132
|
## Performance
|
|
742
1133
|
|
|
743
1134
|
- Offline transcription + diarization: **~12x real-time** (30s audio in 2.5s)
|
|
@@ -747,14 +1138,82 @@ The streaming pipeline runs in 7 stages:
|
|
|
747
1138
|
- Each Whisper segment maps 1:1 to a speaker-labeled segment (no merging)
|
|
748
1139
|
- Speaker confusion rate: **2.55%**
|
|
749
1140
|
|
|
1141
|
+
## Warnings and Known Issues
|
|
1142
|
+
|
|
1143
|
+
### Intel Integrated GPU (Iris Xe) - Vulkan driver memory leak
|
|
1144
|
+
|
|
1145
|
+
- Intel Iris Xe (12th/13th gen) Vulkan drivers have a known memory leak when GPU contexts are repeatedly created and destroyed
|
|
1146
|
+
- This is a confirmed Intel driver bug (Intel internal tracking ID: 14022504159), not a bug in this library
|
|
1147
|
+
- Affects: repeated `Pipeline.load()` / `pipeline.close()` cycles in the same process (crashes after ~8-10 cycles)
|
|
1148
|
+
- Affects: repeated `setExecutionBackend()` calls (each call tears down and rebuilds the full GPU context)
|
|
1149
|
+
- Does NOT affect: creating/closing sessions (sessions borrow cached GPU contexts, no new Vulkan allocations)
|
|
1150
|
+
- Does NOT affect: NVIDIA or AMD discrete GPUs, or Intel Core Ultra (newer gen) integrated GPUs
|
|
1151
|
+
- Workaround: load the pipeline once at application startup and reuse it. Close only at shutdown.
|
|
1152
|
+
- Reference: https://community.intel.com/t5/Developing-Games-on-Intel/Memory-leaks-on-Intel-Iris-Xe-graphics/td-p/1585566
|
|
1153
|
+
|
|
1154
|
+
### setExecutionBackend is a heavy operation
|
|
1155
|
+
|
|
1156
|
+
- Each call fully tears down and reloads the model cache (Whisper context, GGML models, Vulkan/OpenVINO backends)
|
|
1157
|
+
- Takes ~5-6 seconds on Intel Iris Xe
|
|
1158
|
+
- Treat it as a one-time configuration change, not something to call repeatedly
|
|
1159
|
+
- On Intel iGPU: limit to 1-2 switches per process lifetime to avoid the driver leak
|
|
1160
|
+
|
|
1161
|
+
### One operation at a time
|
|
1162
|
+
|
|
1163
|
+
The pipeline enforces exclusive access to its GPU resources. Only one of the following can be active at a time:
|
|
1164
|
+
|
|
1165
|
+
- A streaming session (`createSession()`)
|
|
1166
|
+
- A one-shot transcription (`transcribe()`)
|
|
1167
|
+
- An offline transcription (`transcribeOffline()`)
|
|
1168
|
+
- A backend switch (`setExecutionBackend()`)
|
|
1169
|
+
|
|
1170
|
+
Attempting to start a second operation while one is active throws an error. Close the current session or wait for the current operation to complete before starting the next one.
|
|
1171
|
+
|
|
1172
|
+
```typescript
|
|
1173
|
+
// CORRECT: sequential operations
|
|
1174
|
+
const session = pipeline.createSession();
|
|
1175
|
+
// ... push audio, finalize ...
|
|
1176
|
+
session.close();
|
|
1177
|
+
const result = await pipeline.transcribeOffline(audio); // OK — session is closed
|
|
1178
|
+
|
|
1179
|
+
// ERROR: concurrent operations
|
|
1180
|
+
const session1 = pipeline.createSession();
|
|
1181
|
+
const session2 = pipeline.createSession(); // throws: "A session is already active"
|
|
1182
|
+
await pipeline.transcribeOffline(audio); // throws: "Model is busy"
|
|
1183
|
+
```
|
|
1184
|
+
|
|
1185
|
+
### Session creation is cheap
|
|
1186
|
+
|
|
1187
|
+
- `createSession()` borrows pre-loaded models and GPU contexts from the cache
|
|
1188
|
+
- No new Vulkan backends or model loads occur
|
|
1189
|
+
- Close the session when done, then create another — safe to repeat unlimited times
|
|
1190
|
+
|
|
1191
|
+
```typescript
|
|
1192
|
+
// SAFE: load once, create many sessions sequentially
|
|
1193
|
+
const pipeline = await Pipeline.load(config);
|
|
1194
|
+
for (const file of files) {
|
|
1195
|
+
const session = pipeline.createSession();
|
|
1196
|
+
// ... push audio, finalize ...
|
|
1197
|
+
session.close(); // cheap — no GPU teardown
|
|
1198
|
+
}
|
|
1199
|
+
pipeline.close(); // once, at shutdown
|
|
1200
|
+
|
|
1201
|
+
// DANGEROUS on Intel iGPU: repeated load/close cycles
|
|
1202
|
+
for (const file of files) {
|
|
1203
|
+
const pipeline = await Pipeline.load(config); // creates Vulkan context
|
|
1204
|
+
await pipeline.transcribe(audio);
|
|
1205
|
+
pipeline.close(); // destroys Vulkan context - driver leaks ~8th cycle crashes
|
|
1206
|
+
}
|
|
1207
|
+
```
|
|
1208
|
+
|
|
750
1209
|
## Platform Support
|
|
751
1210
|
|
|
752
|
-
| Platform |
|
|
753
|
-
| --- | --- |
|
|
754
|
-
| macOS arm64 (Apple Silicon) | Supported |
|
|
755
|
-
|
|
|
756
|
-
|
|
|
757
|
-
|
|
|
1211
|
+
| Platform | Low-level (Whisper/VAD) | Pipeline (Transcription + Diarization) |
|
|
1212
|
+
| --- | --- | --- |
|
|
1213
|
+
| macOS arm64 (Apple Silicon) | Supported | Supported (CoreML + Metal) |
|
|
1214
|
+
| Windows x64 | Supported | Supported (Vulkan + optional OpenVINO) |
|
|
1215
|
+
| macOS x64 (Intel) | Supported | Not tested |
|
|
1216
|
+
| Linux | Not supported | Not supported |
|
|
758
1217
|
|
|
759
1218
|
## License
|
|
760
1219
|
|