pyannote-cpp-node 0.5.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,49 +1,69 @@
1
1
  # pyannote-cpp-node
2
2
 
3
- ![Platform](https://img.shields.io/badge/platform-macOS-lightgrey)
3
+ ![Platform](https://img.shields.io/badge/platform-macOS%20arm64%20%7C%20Windows%20x64-lightgrey)
4
4
  ![Node](https://img.shields.io/badge/node-%3E%3D18-brightgreen)
5
5
 
6
- Node.js native bindings for integrated Whisper transcription + speaker diarization with speaker-labeled segment output.
6
+ Node.js native bindings for whisper.cpp transcription/VAD plus the pyannote speaker diarization pipeline.
7
7
 
8
8
  ## Overview
9
9
 
10
- `pyannote-cpp-node` exposes the integrated C++ pipeline that combines Whisper transcription and speaker diarization into a single API.
10
+ `pyannote-cpp-node` is now the single package for both:
11
11
 
12
- Given 16 kHz mono PCM audio (`Float32Array`), it produces cumulative and final transcript segments shaped as:
12
+ - low-level whisper.cpp APIs: `WhisperContext`, `VadContext`, `transcribe`, `transcribeAsync`, `getGpuDevices`
13
+ - high-level pyannote pipeline APIs: `Pipeline`, `PipelineSession`
13
14
 
14
- - speaker label (`SPEAKER_00`, `SPEAKER_01`, ...)
15
+ Platform support:
16
+
17
+ - `darwin-arm64`: full pipeline (CoreML + Metal acceleration)
18
+ - `win32-x64`: full pipeline (Vulkan GPU + optional OpenVINO acceleration)
19
+ - unsupported: `darwin-x64`, `win32-ia32`, Linux
20
+
21
+ On both supported pipeline platforms, `getCapabilities().pipeline` is `true`.
22
+
23
+ The integrated pipeline combines Whisper transcription and optional speaker diarization into a single API (`transcriptionOnly: true` skips diarization).
24
+
25
+ Given 16 kHz mono PCM audio (`Float32Array`), it produces transcript segments shaped as below. In streaming mode, diarization emits cumulative `segments` events, while `transcriptionOnly: true` emits incremental `segments` events. `finalize()` returns all segments in both modes.
26
+
27
+ - speaker label (`SPEAKER_00`, `SPEAKER_01`, ...), `"UNKNOWN"` when diarization could not assign a speaker, or empty string (`""`) when `transcriptionOnly` is `true`
15
28
  - segment start/duration in seconds
16
29
  - segment text
17
30
 
18
- The API supports three modes: **offline** batch processing (`transcribeOffline`), **one-shot** streaming (`transcribe`), and **incremental** streaming (`createSession` + `push`/`finalize`). All heavy operations are asynchronous and run on libuv worker threads.
31
+ The API supports three modes: **offline** batch processing (`transcribeOffline`), **one-shot** streaming (`transcribe`), and **incremental** streaming (`createSession` + `push`/`finalize`). All three modes support transcription-only operation via `transcriptionOnly: true`. All heavy operations are asynchronous and run on libuv worker threads.
19
32
 
20
33
  ## Features
21
34
 
35
+ - Low-level whisper.cpp transcription API compatible with prior `whisper-cpp-node` usage
36
+ - Built-in Silero VAD via `VadContext`
37
+ - GPU device enumeration via `getGpuDevices()`
22
38
  - Integrated transcription + diarization in one pipeline
23
39
  - Speaker-labeled transcript segments with sentence-level text
24
40
  - **Offline mode**: runs Whisper on the full audio at once + offline diarization (fastest for batch)
25
41
  - **One-shot mode**: streaming pipeline with automatic chunking
26
42
  - **Streaming mode**: incremental push/finalize with real-time `segments` events and `audio` chunk streaming
43
+ - **Transcription-only mode**: skip speaker diarization entirely, only segmentation, VAD, and Whisper models required
27
44
  - Deterministic output for the same audio/models/config
28
45
  - CoreML-accelerated inference on macOS
29
46
  - **Shared model cache**: all models loaded once during `Pipeline.load()`, reused across offline/streaming/session modes
30
- - **Runtime backend switching**: switch Whisper between GPU-only and CoreML-accelerated without reloading the pipeline
47
+ - **Runtime backend switching**: switch inference backends at runtime on macOS and Windows
31
48
  - **Progress reporting**: optional `onProgress` callback for `transcribeOffline` reports Whisper, diarization, and alignment phases
32
49
  - **Real-time segment streaming**: optional `onSegment` callback for `transcribeOffline` delivers each Whisper segment (start, end, text) as it's produced — enables live transcript preview and time-based loading bars
33
50
  - TypeScript-first API with complete type definitions
34
51
 
35
52
  ## Requirements
36
53
 
37
- - macOS (Apple Silicon or Intel)
54
+ - macOS Apple Silicon or Windows x64
38
55
  - Node.js >= 18
39
56
  - Model files:
40
- - Segmentation GGUF (`segModelPath`)
41
- - Embedding GGUF (`embModelPath`)
42
- - PLDA GGUF (`pldaPath`)
43
- - Embedding CoreML `.mlpackage` (`coremlPath`)
44
- - Segmentation CoreML `.mlpackage` (`segCoremlPath`)
45
- - Whisper GGUF (`whisperModelPath`)
57
+ - Segmentation GGUF (`segModelPath`) — required on all platforms
58
+ - Embedding GGUF (`embModelPath`) — required unless `transcriptionOnly` is `true`
59
+ - PLDA GGUF (`pldaPath`) — required unless `transcriptionOnly` is `true`
60
+ - Whisper GGUF (`whisperModelPath`) — required on all platforms
46
61
  - Optional Silero VAD model (`vadModelPath`)
62
+ - Required backend config (`backend`) with one of: `metal`, `vulkan`, `coreml`, or `openvino-hybrid`
63
+ - Accelerator-specific paths now live inside `backend`
64
+ - `backend: { type: 'coreml', segPath, embPath? }` uses CoreML `.mlpackage` assets on macOS
65
+ - `backend: { type: 'openvino-hybrid', whisperEncoderPath, embPath? }` uses OpenVINO IR `.xml` assets on Windows
66
+ - `backend: { type: 'metal' }` on macOS and `backend: { type: 'vulkan' }` on Windows do not need extra accelerator paths
47
67
 
48
68
  ## Installation
49
69
 
@@ -57,8 +77,45 @@ pnpm add pyannote-cpp-node
57
77
 
58
78
  The package installs a platform-specific native addon through `optionalDependencies`.
59
79
 
80
+ ## Low-Level Quick Start
81
+
82
+ ```typescript
83
+ import {
84
+ WhisperContext,
85
+ createVadContext,
86
+ getCapabilities,
87
+ transcribeAsync,
88
+ } from 'pyannote-cpp-node';
89
+
90
+ const capabilities = getCapabilities();
91
+ console.log(capabilities);
92
+
93
+ const ctx = new WhisperContext({
94
+ model: './models/ggml-base.en.bin',
95
+ use_gpu: true,
96
+ no_prints: true,
97
+ });
98
+
99
+ const result = await transcribeAsync(ctx, {
100
+ fname_inp: './audio.wav',
101
+ language: 'en',
102
+ });
103
+
104
+ const vad = createVadContext({
105
+ model: './models/ggml-silero-v6.2.0.bin',
106
+ });
107
+
108
+ console.log(result.segments);
109
+ console.log(vad.getWindowSamples());
110
+
111
+ vad.free();
112
+ ctx.free();
113
+ ```
114
+
60
115
  ## Quick Start
61
116
 
117
+ ### macOS (Apple Silicon)
118
+
62
119
  ```typescript
63
120
  import { Pipeline } from 'pyannote-cpp-node';
64
121
 
@@ -66,15 +123,43 @@ const pipeline = await Pipeline.load({
66
123
  segModelPath: './models/segmentation.gguf',
67
124
  embModelPath: './models/embedding.gguf',
68
125
  pldaPath: './models/plda.gguf',
69
- coremlPath: './models/embedding.mlpackage',
70
- segCoremlPath: './models/segmentation.mlpackage',
71
126
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
127
+ backend: {
128
+ type: 'coreml',
129
+ segPath: './models/segmentation.mlpackage',
130
+ embPath: './models/embedding.mlpackage',
131
+ },
72
132
  language: 'en',
73
133
  });
74
134
 
75
135
  const audio = loadAudioAsFloat32Array('./audio-16khz-mono.wav');
136
+ const result = await pipeline.transcribeOffline(audio);
137
+
138
+ for (const segment of result.segments) {
139
+ const end = segment.start + segment.duration;
140
+ console.log(
141
+ `[${segment.speaker}] ${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`
142
+ );
143
+ }
144
+
145
+ pipeline.close();
146
+ ```
147
+
148
+ ### Windows (x64)
149
+
150
+ ```typescript
151
+ import { Pipeline } from 'pyannote-cpp-node';
152
+
153
+ const pipeline = await Pipeline.load({
154
+ segModelPath: './models/segmentation.gguf',
155
+ embModelPath: './models/embedding.gguf',
156
+ pldaPath: './models/plda.gguf',
157
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
158
+ language: 'en',
159
+ backend: { type: 'vulkan' },
160
+ });
76
161
 
77
- // Offline mode — fastest for batch processing
162
+ const audio = loadAudioAsFloat32Array('./audio-16khz-mono.wav');
78
163
  const result = await pipeline.transcribeOffline(audio);
79
164
 
80
165
  for (const segment of result.segments) {
@@ -87,31 +172,86 @@ for (const segment of result.segments) {
87
172
  pipeline.close();
88
173
  ```
89
174
 
175
+ To use the Windows OpenVINO hybrid path instead, pass the OpenVINO assets through `backend`:
176
+
177
+ ```typescript
178
+ const pipeline = await Pipeline.load({
179
+ segModelPath: './models/segmentation.gguf',
180
+ embModelPath: './models/embedding.gguf',
181
+ pldaPath: './models/plda.gguf',
182
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
183
+ backend: {
184
+ type: 'openvino-hybrid',
185
+ whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
186
+ embPath: './models/embedding-openvino.xml',
187
+ },
188
+ });
189
+ ```
190
+
191
+ ### Transcription-only mode
192
+
193
+ ```typescript
194
+ const macPipeline = await Pipeline.load({
195
+ segModelPath: './models/segmentation.gguf',
196
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
197
+ language: 'en',
198
+ transcriptionOnly: true,
199
+ backend: {
200
+ type: 'coreml',
201
+ segPath: './models/segmentation.mlpackage',
202
+ },
203
+ });
204
+
205
+ const windowsPipeline = await Pipeline.load({
206
+ segModelPath: './models/segmentation.gguf',
207
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
208
+ language: 'en',
209
+ transcriptionOnly: true,
210
+ backend: { type: 'vulkan' },
211
+ });
212
+
213
+ const result = await macPipeline.transcribe(audio);
214
+
215
+ for (const segment of result.segments) {
216
+ const end = segment.start + segment.duration;
217
+ // No speaker label - segment.speaker is empty string
218
+ console.log(`${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`);
219
+ }
220
+
221
+ macPipeline.close();
222
+ windowsPipeline.close();
223
+ ```
224
+
90
225
  ## API Reference
91
226
 
92
227
  ### `Pipeline`
93
228
 
94
229
  ```typescript
95
230
  class Pipeline {
96
- static async load(config: ModelConfig): Promise<Pipeline>;
231
+ static async load(config: PipelineConfig): Promise<Pipeline>;
97
232
  async transcribeOffline(audio: Float32Array, onProgress?: (phase: number, progress: number) => void, onSegment?: (start: number, end: number, text: string) => void): Promise<TranscriptionResult>;
98
233
  async transcribe(audio: Float32Array): Promise<TranscriptionResult>;
99
234
  setLanguage(language: string): void;
100
235
  setDecodeOptions(options: DecodeOptions): void;
101
236
  createSession(): PipelineSession;
102
- async setUseCoreml(useCoreml: boolean): Promise<void>;
237
+ async setExecutionBackend(options: BackendConfig): Promise<void>;
103
238
  close(): void;
104
239
  get isClosed(): boolean;
105
240
  }
106
241
  ```
107
242
 
108
- #### `static async load(config: ModelConfig): Promise<Pipeline>`
243
+ #### `static async load(config: PipelineConfig): Promise<Pipeline>`
109
244
 
110
- Validates model paths and loads all models (Whisper, CoreML segmentation/embedding, PLDA, and optionally VAD) into a shared cache on a background thread. Models are loaded once and reused across all subsequent `transcribe()`, `transcribeOffline()`, and `createSession()` calls — no redundant loading occurs when switching between modes. Models are freed only when `close()` is called.
245
+ Validates model paths and loads all models into a shared cache on a background thread. The accelerator assets are selected by `config.backend`, which is required and has no default. On macOS, `backend: { type: 'coreml', ... }` loads CoreML segmentation and embedding assets, while `backend: { type: 'metal' }` uses Metal. On Windows x64, `backend: { type: 'vulkan' }` loads the Vulkan path, and `backend: { type: 'openvino-hybrid', ... }` also loads OpenVINO IR models for the Whisper encoder and embedding model. When `transcriptionOnly` is `true`, embedding, PLDA, and embedding-specific backend assets are not loaded. Models are loaded once and reused across all subsequent `transcribe()`, `transcribeOffline()`, and `createSession()` calls. Models are freed only when `close()` is called.
246
+
247
+ - `backend` is required in every `Pipeline.load()` call
248
+ - `coreml` requires `segPath`, and `embPath` unless `transcriptionOnly` is `true`
249
+ - `openvino-hybrid` requires `whisperEncoderPath`, and `embPath` unless `transcriptionOnly` is `true`
250
+ - `metal` and `vulkan` do not require extra accelerator model paths
111
251
 
112
252
  #### `async transcribeOffline(audio: Float32Array, onProgress?, onSegment?): Promise<TranscriptionResult>`
113
253
 
114
- Runs Whisper on the **entire** audio buffer in a single `whisper_full()` call, then runs offline diarization and WhisperX-style speaker alignment. This is the fastest mode for batch processing — no streaming infrastructure is involved.
254
+ Runs Whisper on the **entire** audio buffer in a single `whisper_full()` call, then runs offline diarization and WhisperX-style speaker alignment. In transcription-only mode, diarization and speaker alignment are skipped, and segments have an empty `speaker` field. This is the fastest mode for batch processing — no streaming infrastructure is involved.
115
255
 
116
256
  The optional `onProgress` callback receives `(phase, progress)` updates:
117
257
 
@@ -153,7 +293,7 @@ const result = await pipeline.transcribeOffline(
153
293
 
154
294
  #### `async transcribe(audio: Float32Array): Promise<TranscriptionResult>`
155
295
 
156
- Runs one-shot transcription + diarization using the streaming pipeline internally (pushes 1-second chunks then finalizes).
296
+ Runs one-shot transcription (+ diarization unless `transcriptionOnly` is set) using the streaming pipeline internally (pushes 1-second chunks then finalizes).
157
297
 
158
298
  #### `setLanguage(language: string): void`
159
299
 
@@ -164,28 +304,62 @@ Updates the Whisper decode language for subsequent `transcribe()` calls. This is
164
304
  Updates one or more Whisper decode options for subsequent `transcribe()` calls. Only the fields you pass are changed; others retain their current values. See `DecodeOptions` for available fields.
165
305
 
166
306
 
167
- #### `async setUseCoreml(useCoreml: boolean): Promise<void>`
307
+ #### `async setExecutionBackend(options: BackendConfig): Promise<void>`
308
+
309
+ Switches the inference backend at runtime. Tears down and reloads the entire model cache with the new backend configuration. The promise resolves when the new models are ready.
168
310
 
169
- Switches the Whisper inference backend between GPU-only (`false`) and GPU+CoreML (`true`) at runtime. The method reloads the Whisper context on a background thread with the new `use_coreml` setting. The promise resolves when the new context is ready.
311
+ - **macOS**: supports `metal` and `coreml`
312
+ - **Windows**: supports `vulkan` and `openvino-hybrid`
170
313
 
171
- - If the requested mode matches the current mode, returns immediately (no reload).
172
- - Throws if the pipeline is closed, busy, or models are not loaded.
173
- - After switching, all subsequent `transcribe()`, `transcribeOffline()`, and streaming session calls use the new backend.
314
+ Pass one of these `BackendConfig` variants:
174
315
 
175
316
  ```typescript
176
- // Start with GPU-only Whisper
177
- const pipeline = await Pipeline.load({
178
- ...modelPaths,
179
- useCoreml: false,
317
+ type BackendConfig =
318
+ | { type: 'metal'; gpuDevice?: number; flashAttn?: boolean }
319
+ | { type: 'vulkan'; gpuDevice?: number; flashAttn?: boolean }
320
+ | {
321
+ type: 'coreml';
322
+ gpuDevice?: number;
323
+ flashAttn?: boolean;
324
+ segPath: string;
325
+ embPath?: string;
326
+ whisperEncoderPath?: string;
327
+ }
328
+ | {
329
+ type: 'openvino-hybrid';
330
+ gpuDevice?: number;
331
+ flashAttn?: boolean;
332
+ whisperEncoderPath: string;
333
+ embPath?: string;
334
+ openvinoDevice?: string;
335
+ openvinoCacheDir?: string;
336
+ };
337
+ ```
338
+
339
+ > **Warning**: This is a heavy operation (~5-6s on Intel iGPU). It fully tears down and rebuilds the model cache. Treat it as a one-time configuration change, not something to call in a loop. See [Warnings and Known Issues](#warnings-and-known-issues) for Intel iGPU limitations.
340
+
341
+ ```typescript
342
+ // macOS: switch to Metal
343
+ await pipeline.setExecutionBackend({ type: 'metal' });
344
+
345
+ // macOS: switch to CoreML
346
+ await pipeline.setExecutionBackend({
347
+ type: 'coreml',
348
+ segPath: './models/segmentation.mlpackage',
349
+ embPath: './models/embedding.mlpackage',
180
350
  });
181
351
 
182
- // Switch to CoreML-accelerated Whisper at runtime
183
- await pipeline.setUseCoreml(true);
184
- const result = await pipeline.transcribeOffline(audio);
352
+ // Windows: switch to OpenVINO hybrid
353
+ await pipeline.setExecutionBackend({
354
+ type: 'openvino-hybrid',
355
+ whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
356
+ embPath: './models/embedding-openvino.xml',
357
+ });
185
358
 
186
- // Switch back to GPU-only
187
- await pipeline.setUseCoreml(false);
359
+ // Windows: switch back to Vulkan
360
+ await pipeline.setExecutionBackend({ type: 'vulkan' });
188
361
  ```
362
+
189
363
  #### `createSession(): PipelineSession`
190
364
 
191
365
  Creates an independent streaming session for incremental processing.
@@ -242,11 +416,13 @@ Updates one or more Whisper decode options on the live streaming session. Takes
242
416
 
243
417
  #### `async finalize(): Promise<TranscriptionResult>`
244
418
 
245
- Flushes all stages, runs final recluster + alignment, and returns the definitive result.
419
+ Flushes all stages, runs final recluster + alignment, and returns the definitive result. `finalize()` always returns all accumulated segments regardless of mode. In diarization mode this is the final re-aligned output, and in transcription-only mode this is the union of all incremental `segments` emissions.
246
420
 
247
421
  ```typescript
248
422
  type TranscriptionResult = {
249
423
  segments: AlignedSegment[];
424
+ /** Silence-filtered audio when VAD model is loaded. Timestamps align to this audio. */
425
+ filteredAudio?: Float32Array;
250
426
  };
251
427
  ```
252
428
 
@@ -260,11 +436,30 @@ Returns `true` after `close()`.
260
436
 
261
437
  #### Event: `'segments'`
262
438
 
263
- Emitted after each Whisper transcription result with the latest cumulative aligned output.
439
+ Emitted after each Whisper transcription result. Behavior depends on mode:
440
+
441
+ - With diarization (default): each emission contains all segments re-aligned against the latest speaker clustering. Earlier segments may get updated speaker labels as more data arrives. The final emission after `finalize()` is the definitive output.
442
+ - With `transcriptionOnly: true`: each emission contains only the new segments from the latest Whisper result. Earlier segments never change, so incremental delivery is safe. Accumulate across emissions to build the full transcript.
264
443
 
265
444
  ```typescript
445
+ // With diarization (default): cumulative, re-aligned output
266
446
  session.on('segments', (segments: AlignedSegment[]) => {
267
- // `segments` contains the latest cumulative speaker-labeled transcript
447
+ // `segments` contains the latest full speaker-labeled transcript so far
448
+ const latest = segments[segments.length - 1];
449
+ if (latest) {
450
+ const end = latest.start + latest.duration;
451
+ console.log(`[${latest.speaker}] ${latest.start.toFixed(2)}-${end.toFixed(2)} ${latest.text.trim()}`);
452
+ }
453
+ });
454
+
455
+ // With transcriptionOnly: incremental output, accumulate manually
456
+ const allSegments: AlignedSegment[] = [];
457
+ session.on('segments', (newSegments: AlignedSegment[]) => {
458
+ allSegments.push(...newSegments);
459
+ for (const seg of newSegments) {
460
+ const end = seg.start + seg.duration;
461
+ console.log(`${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
462
+ }
268
463
  });
269
464
  ```
270
465
 
@@ -281,46 +476,32 @@ session.on('audio', (chunk: Float32Array) => {
281
476
  ### Types
282
477
 
283
478
  ```typescript
284
- export interface ModelConfig {
479
+ export interface PipelineConfig {
285
480
  // === Required Model Paths ===
286
481
  /** Path to segmentation GGUF model */
287
482
  segModelPath: string;
288
483
 
289
- /** Path to embedding GGUF model */
290
- embModelPath: string;
291
-
292
- /** Path to PLDA GGUF model */
293
- pldaPath: string;
294
-
295
- /** Path to embedding CoreML .mlpackage directory */
296
- coremlPath: string;
297
-
298
- /** Path to segmentation CoreML .mlpackage directory */
299
- segCoremlPath: string;
300
-
301
484
  /** Path to Whisper GGUF model */
302
485
  whisperModelPath: string;
303
486
 
487
+ /** Path to embedding GGUF model (required unless transcriptionOnly is true) */
488
+ embModelPath?: string;
489
+
490
+ /** Path to PLDA GGUF model (required unless transcriptionOnly is true) */
491
+ pldaPath?: string;
492
+
304
493
  // === Optional Model Paths ===
305
494
  /** Path to Silero VAD model (optional, enables silence compression) */
306
495
  vadModelPath?: string;
307
496
 
308
- // === Whisper Context Options (model loading) ===
309
- /** Enable GPU acceleration (default: true) */
310
- useGpu?: boolean;
311
-
312
- /** Enable Flash Attention (default: true) */
313
- flashAttn?: boolean;
314
-
315
- /** GPU device index (default: 0) */
316
- gpuDevice?: number;
317
-
318
497
  /**
319
- * Enable CoreML acceleration for Whisper encoder on macOS (default: false).
320
- * The CoreML model must be placed next to the GGUF model with naming convention:
321
- * e.g., ggml-base.en.bin -> ggml-base.en-encoder.mlmodelc/
498
+ * Transcription-only mode - skip speaker diarization (default: false).
499
+ * When true, embModelPath, pldaPath, and backend embedding assets are not required.
322
500
  */
323
- useCoreml?: boolean;
501
+ transcriptionOnly?: boolean;
502
+
503
+ /** Required execution backend configuration */
504
+ backend: BackendConfig;
324
505
 
325
506
  /** Suppress whisper.cpp log output (default: false) */
326
507
  noPrints?: boolean;
@@ -378,6 +559,54 @@ export interface ModelConfig {
378
559
  suppressNst?: boolean;
379
560
  }
380
561
 
562
+ export type BackendConfig =
563
+ | {
564
+ /** Metal backend on macOS */
565
+ type: 'metal';
566
+ /** GPU device index */
567
+ gpuDevice?: number;
568
+ /** Enable Flash Attention */
569
+ flashAttn?: boolean;
570
+ }
571
+ | {
572
+ /** Vulkan backend on Windows */
573
+ type: 'vulkan';
574
+ /** GPU device index */
575
+ gpuDevice?: number;
576
+ /** Enable Flash Attention */
577
+ flashAttn?: boolean;
578
+ }
579
+ | {
580
+ /** CoreML backend on macOS */
581
+ type: 'coreml';
582
+ /** GPU device index */
583
+ gpuDevice?: number;
584
+ /** Enable Flash Attention */
585
+ flashAttn?: boolean;
586
+ /** Path to segmentation CoreML .mlpackage directory */
587
+ segPath: string;
588
+ /** Path to embedding CoreML .mlpackage directory (required unless transcriptionOnly is true) */
589
+ embPath?: string;
590
+ /** Optional path to Whisper encoder CoreML .mlmodelc directory */
591
+ whisperEncoderPath?: string;
592
+ }
593
+ | {
594
+ /** OpenVINO hybrid backend on Windows */
595
+ type: 'openvino-hybrid';
596
+ /** GPU device index */
597
+ gpuDevice?: number;
598
+ /** Enable Flash Attention */
599
+ flashAttn?: boolean;
600
+ /** Path to Whisper encoder OpenVINO IR (.xml) */
601
+ whisperEncoderPath: string;
602
+ /** Path to embedding OpenVINO IR (.xml) (required unless transcriptionOnly is true) */
603
+ embPath?: string;
604
+ /** OpenVINO device target (default: 'GPU') */
605
+ openvinoDevice?: string;
606
+ /** OpenVINO model cache directory */
607
+ openvinoCacheDir?: string;
608
+ };
609
+
381
610
  export interface DecodeOptions {
382
611
  /** Language code (e.g., 'en', 'zh'). Omit for auto-detect. */
383
612
  language?: string;
@@ -414,7 +643,7 @@ export interface DecodeOptions {
414
643
  }
415
644
 
416
645
  export interface AlignedSegment {
417
- /** Global speaker label (e.g., SPEAKER_00). */
646
+ /** Global speaker label (e.g., SPEAKER_00). "UNKNOWN" when diarization could not assign a speaker. Empty string when transcriptionOnly is true. */
418
647
  speaker: string;
419
648
 
420
649
  /** Segment start time in seconds. */
@@ -428,8 +657,16 @@ export interface AlignedSegment {
428
657
  }
429
658
 
430
659
  export interface TranscriptionResult {
431
- /** Full speaker-labeled transcript segments. */
660
+ /** Transcript segments. Speaker-labeled when diarization is enabled; speaker is empty string in transcription-only mode. */
432
661
  segments: AlignedSegment[];
662
+ /**
663
+ * Silence-filtered audio (16 kHz mono Float32Array).
664
+ * Present when a VAD model is loaded (`vadModelPath` in config).
665
+ * Silence longer than 2 seconds is compressed to 2 seconds.
666
+ * All segment timestamps are aligned to this audio —
667
+ * save it directly and timestamps will sync correctly.
668
+ */
669
+ filteredAudio?: Float32Array;
433
670
  }
434
671
  ```
435
672
 
@@ -445,9 +682,12 @@ async function runOffline(audio: Float32Array) {
445
682
  segModelPath: './models/segmentation.gguf',
446
683
  embModelPath: './models/embedding.gguf',
447
684
  pldaPath: './models/plda.gguf',
448
- coremlPath: './models/embedding.mlpackage',
449
- segCoremlPath: './models/segmentation.mlpackage',
450
685
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
686
+ backend: {
687
+ type: 'coreml',
688
+ segPath: './models/segmentation.mlpackage',
689
+ embPath: './models/embedding.mlpackage',
690
+ },
451
691
  });
452
692
 
453
693
  // Runs Whisper on full audio at once + offline diarization
@@ -462,6 +702,46 @@ async function runOffline(audio: Float32Array) {
462
702
  }
463
703
  ```
464
704
 
705
+ ### Offline transcription with silence filtering
706
+
707
+ When a VAD model is provided, `transcribeOffline` automatically compresses silence longer than 2 seconds down to 2 seconds before running Whisper and diarization. The filtered audio is returned alongside segments so you can save it with correctly aligned timestamps.
708
+
709
+ ```typescript
710
+ import { Pipeline } from 'pyannote-cpp-node';
711
+ import { writeFileSync } from 'node:fs';
712
+
713
+ async function runOfflineWithVAD(audio: Float32Array) {
714
+ const pipeline = await Pipeline.load({
715
+ segModelPath: './models/segmentation.gguf',
716
+ embModelPath: './models/embedding.gguf',
717
+ pldaPath: './models/plda.gguf',
718
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
719
+ backend: {
720
+ type: 'coreml',
721
+ segPath: './models/segmentation.mlpackage',
722
+ embPath: './models/embedding.mlpackage',
723
+ },
724
+ vadModelPath: './models/ggml-silero-v6.2.0.bin', // enables silence filtering
725
+ });
726
+
727
+ const result = await pipeline.transcribeOffline(audio);
728
+
729
+ // Save the silence-filtered audio — timestamps in result.segments align to this
730
+ if (result.filteredAudio) {
731
+ // filteredAudio is 16 kHz mono Float32Array with silence compressed
732
+ writeFileSync('./output-filtered.pcm', Buffer.from(result.filteredAudio.buffer));
733
+ console.log(`Filtered: ${audio.length} -> ${result.filteredAudio.length} samples`);
734
+ }
735
+
736
+ for (const seg of result.segments) {
737
+ const end = seg.start + seg.duration;
738
+ console.log(`[${seg.speaker}] ${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
739
+ }
740
+
741
+ pipeline.close();
742
+ }
743
+ ```
744
+
465
745
  ### Offline transcription with progress and live transcript preview
466
746
 
467
747
  ```typescript
@@ -472,9 +752,12 @@ async function runOfflineWithCallbacks(audio: Float32Array) {
472
752
  segModelPath: './models/segmentation.gguf',
473
753
  embModelPath: './models/embedding.gguf',
474
754
  pldaPath: './models/plda.gguf',
475
- coremlPath: './models/embedding.mlpackage',
476
- segCoremlPath: './models/segmentation.mlpackage',
477
755
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
756
+ backend: {
757
+ type: 'coreml',
758
+ segPath: './models/segmentation.mlpackage',
759
+ embPath: './models/embedding.mlpackage',
760
+ },
478
761
  });
479
762
 
480
763
  const result = await pipeline.transcribeOffline(
@@ -506,9 +789,12 @@ async function runOneShot(audio: Float32Array) {
506
789
  segModelPath: './models/segmentation.gguf',
507
790
  embModelPath: './models/embedding.gguf',
508
791
  pldaPath: './models/plda.gguf',
509
- coremlPath: './models/embedding.mlpackage',
510
- segCoremlPath: './models/segmentation.mlpackage',
511
792
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
793
+ backend: {
794
+ type: 'coreml',
795
+ segPath: './models/segmentation.mlpackage',
796
+ embPath: './models/embedding.mlpackage',
797
+ },
512
798
  });
513
799
 
514
800
  // Uses streaming pipeline internally (push 1s chunks + finalize)
@@ -533,12 +819,16 @@ async function runStreaming(audio: Float32Array) {
533
819
  segModelPath: './models/segmentation.gguf',
534
820
  embModelPath: './models/embedding.gguf',
535
821
  pldaPath: './models/plda.gguf',
536
- coremlPath: './models/embedding.mlpackage',
537
- segCoremlPath: './models/segmentation.mlpackage',
538
822
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
823
+ backend: {
824
+ type: 'coreml',
825
+ segPath: './models/segmentation.mlpackage',
826
+ embPath: './models/embedding.mlpackage',
827
+ },
539
828
  });
540
829
 
541
830
  const session = pipeline.createSession();
831
+ // Diarization mode (default): each event is cumulative and may relabel earlier segments
542
832
  session.on('segments', (segments) => {
543
833
  const latest = segments[segments.length - 1];
544
834
  if (latest) {
@@ -569,6 +859,47 @@ async function runStreaming(audio: Float32Array) {
569
859
  }
570
860
  ```
571
861
 
862
+ ```typescript
863
+ import { Pipeline, type AlignedSegment } from 'pyannote-cpp-node';
864
+
865
+ async function runStreamingTranscriptionOnly(audio: Float32Array) {
866
+ const pipeline = await Pipeline.load({
867
+ segModelPath: './models/segmentation.gguf',
868
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
869
+ transcriptionOnly: true,
870
+ backend: {
871
+ type: 'coreml',
872
+ segPath: './models/segmentation.mlpackage',
873
+ },
874
+ });
875
+
876
+ const session = pipeline.createSession();
877
+
878
+ // Transcription-only: each event has only NEW segments
879
+ const allSegments: AlignedSegment[] = [];
880
+ session.on('segments', (newSegments) => {
881
+ allSegments.push(...newSegments);
882
+ for (const seg of newSegments) {
883
+ const end = seg.start + seg.duration;
884
+ console.log(`${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
885
+ }
886
+ });
887
+
888
+ const chunkSize = 16000;
889
+ for (let i = 0; i < audio.length; i += chunkSize) {
890
+ const chunk = audio.slice(i, Math.min(i + chunkSize, audio.length));
891
+ await session.push(chunk);
892
+ }
893
+
894
+ const finalResult = await session.finalize();
895
+ console.log(`Final segments from finalize(): ${finalResult.segments.length}`);
896
+ console.log(`Accumulated from incremental events: ${allSegments.length}`);
897
+
898
+ session.close();
899
+ pipeline.close();
900
+ }
901
+ ```
902
+
572
903
  ### Custom Whisper decode options
573
904
 
574
905
  ```typescript
@@ -578,15 +909,14 @@ const pipeline = await Pipeline.load({
578
909
  segModelPath: './models/segmentation.gguf',
579
910
  embModelPath: './models/embedding.gguf',
580
911
  pldaPath: './models/plda.gguf',
581
- coremlPath: './models/embedding.mlpackage',
582
- segCoremlPath: './models/segmentation.mlpackage',
583
912
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
584
-
585
- // Whisper runtime options
586
- useGpu: true,
587
- flashAttn: true,
588
- gpuDevice: 0,
589
- useCoreml: false,
913
+ backend: {
914
+ type: 'coreml',
915
+ segPath: './models/segmentation.mlpackage',
916
+ embPath: './models/embedding.mlpackage',
917
+ flashAttn: true,
918
+ gpuDevice: 0,
919
+ },
590
920
 
591
921
  // Decode strategy
592
922
  nThreads: 8,
@@ -619,9 +949,12 @@ const pipeline = await Pipeline.load({
619
949
  segModelPath: './models/segmentation.gguf',
620
950
  embModelPath: './models/embedding.gguf',
621
951
  pldaPath: './models/plda.gguf',
622
- coremlPath: './models/embedding.mlpackage',
623
- segCoremlPath: './models/segmentation.mlpackage',
624
952
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
953
+ backend: {
954
+ type: 'coreml',
955
+ segPath: './models/segmentation.mlpackage',
956
+ embPath: './models/embedding.mlpackage',
957
+ },
625
958
  language: 'en',
626
959
  });
627
960
 
@@ -643,34 +976,71 @@ const result3 = await pipeline.transcribe(chineseAudio);
643
976
  pipeline.close();
644
977
  ```
645
978
 
646
- ### Switching Whisper backend at runtime
979
+ ### Switching execution backend at runtime (macOS)
647
980
 
648
981
  ```typescript
649
982
  import { Pipeline } from 'pyannote-cpp-node';
650
983
 
651
- // Start with GPU-only Whisper (default)
984
+ // Start with Metal
652
985
  const pipeline = await Pipeline.load({
653
986
  segModelPath: './models/segmentation.gguf',
654
987
  embModelPath: './models/embedding.gguf',
655
988
  pldaPath: './models/plda.gguf',
656
- coremlPath: './models/embedding.mlpackage',
657
- segCoremlPath: './models/segmentation.mlpackage',
658
989
  whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
659
- useCoreml: false,
990
+ backend: { type: 'metal' },
660
991
  });
661
992
 
662
- // Switch to CoreML-accelerated Whisper encoder at runtime
663
- // (requires ggml-large-v3-turbo-q5_0-encoder.mlmodelc next to the GGUF)
664
- await pipeline.setUseCoreml(true);
993
+ // Switch to CoreML
994
+ await pipeline.setExecutionBackend({
995
+ type: 'coreml',
996
+ segPath: './models/segmentation.mlpackage',
997
+ embPath: './models/embedding.mlpackage',
998
+ whisperEncoderPath: './models/ggml-large-v3-turbo-q5_0-encoder.mlmodelc',
999
+ });
665
1000
  const result1 = await pipeline.transcribeOffline(audio);
666
1001
 
667
- // Switch back to GPU-only
668
- await pipeline.setUseCoreml(false);
1002
+ // Switch back to Metal
1003
+ await pipeline.setExecutionBackend({ type: 'metal' });
669
1004
  const result2 = await pipeline.transcribeOffline(audio);
670
1005
 
671
1006
  pipeline.close();
672
1007
  ```
673
1008
 
1009
+ ## Execution Backends
1010
+
1011
+ `setExecutionBackend(options)` switches the inference backend at runtime.
1012
+
1013
+ - On macOS: supports `metal` and `coreml`
1014
+ - On Windows: supports `vulkan` and `openvino-hybrid`
1015
+ - `openvino-hybrid` uses OpenVINO for the Whisper encoder and embedding model, and Vulkan for everything else
1016
+ - `gpuDevice` and `flashAttn` are configured inside the backend object
1017
+
1018
+ ```typescript
1019
+ const pipeline = await Pipeline.load({
1020
+ segModelPath: './models/segmentation.gguf',
1021
+ embModelPath: './models/embedding.gguf',
1022
+ pldaPath: './models/plda.gguf',
1023
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
1024
+ language: 'en',
1025
+ backend: {
1026
+ type: 'openvino-hybrid',
1027
+ whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
1028
+ embPath: './models/embedding-openvino.xml',
1029
+ },
1030
+ });
1031
+
1032
+ // Switch to OpenVINO-hybrid at runtime
1033
+ await pipeline.setExecutionBackend({
1034
+ type: 'openvino-hybrid',
1035
+ whisperEncoderPath: './models/ggml-large-v3-turbo-encoder-openvino.xml',
1036
+ embPath: './models/embedding-openvino.xml',
1037
+ });
1038
+ const result = await pipeline.transcribeOffline(audio);
1039
+
1040
+ // Switch back to Vulkan
1041
+ await pipeline.setExecutionBackend({ type: 'vulkan' });
1042
+ ```
1043
+
674
1044
  Streaming sessions also support runtime changes:
675
1045
 
676
1046
  ```typescript
@@ -709,6 +1079,21 @@ The pipeline returns this JSON shape:
709
1079
  }
710
1080
  ```
711
1081
 
1082
+ When `transcriptionOnly` is `true`, the `speaker` field is an empty string:
1083
+
1084
+ ```json
1085
+ {
1086
+ "segments": [
1087
+ {
1088
+ "speaker": "",
1089
+ "start": 0.497000,
1090
+ "duration": 2.085000,
1091
+ "text": "Hello world"
1092
+ }
1093
+ ]
1094
+ }
1095
+ ```
1096
+
712
1097
  ## Audio Format Requirements
713
1098
 
714
1099
  - Input must be `Float32Array`
@@ -722,9 +1107,13 @@ All API methods expect decoded PCM samples; file decoding/resampling is handled
722
1107
 
723
1108
  ### Offline mode (`transcribeOffline`)
724
1109
 
725
- 1. Single `whisper_full()` call on entire audio
726
- 2. Offline diarization (segmentation powerset embeddings → PLDA → AHC → VBx)
727
- 3. WhisperX-style alignment (speaker assignment by maximum segment overlap)
1110
+ 1. VAD silence filter (optional compresses silence >2s to 2s when `vadModelPath` provided)
1111
+ 2. Single `whisper_full()` call on filtered audio
1112
+ 3. Offline diarization (segmentation powerset embeddings → PLDA → AHC → VBx) on filtered audio
1113
+ 4. WhisperX-style alignment (speaker assignment by maximum segment overlap)
1114
+ 5. Return segments + filtered audio bytes (timestamps aligned to filtered audio)
1115
+
1116
+ In transcription-only mode, steps 3 (diarization) and 4 (alignment) are skipped.
728
1117
 
729
1118
  ### Streaming mode (`transcribe` / `createSession`)
730
1119
 
@@ -738,6 +1127,8 @@ The streaming pipeline runs in 7 stages:
738
1127
  6. Finalize (flush + final recluster + final alignment)
739
1128
  7. Callback/event emission (`segments` updates + `audio` chunk streaming)
740
1129
 
1130
+ In transcription-only mode, steps 5 (alignment) and 6 (recluster) are skipped, and segments are emitted with an empty `speaker` field. Each `segments` event contains only the new segments from that Whisper call (incremental), unlike diarization mode which re-emits all segments after each recluster (cumulative).
1131
+
741
1132
  ## Performance
742
1133
 
743
1134
  - Offline transcription + diarization: **~12x real-time** (30s audio in 2.5s)
@@ -747,14 +1138,82 @@ The streaming pipeline runs in 7 stages:
747
1138
  - Each Whisper segment maps 1:1 to a speaker-labeled segment (no merging)
748
1139
  - Speaker confusion rate: **2.55%**
749
1140
 
1141
+ ## Warnings and Known Issues
1142
+
1143
+ ### Intel Integrated GPU (Iris Xe) - Vulkan driver memory leak
1144
+
1145
+ - Intel Iris Xe (12th/13th gen) Vulkan drivers have a known memory leak when GPU contexts are repeatedly created and destroyed
1146
+ - This is a confirmed Intel driver bug (Intel internal tracking ID: 14022504159), not a bug in this library
1147
+ - Affects: repeated `Pipeline.load()` / `pipeline.close()` cycles in the same process (crashes after ~8-10 cycles)
1148
+ - Affects: repeated `setExecutionBackend()` calls (each call tears down and rebuilds the full GPU context)
1149
+ - Does NOT affect: creating/closing sessions (sessions borrow cached GPU contexts, no new Vulkan allocations)
1150
+ - Does NOT affect: NVIDIA or AMD discrete GPUs, or Intel Core Ultra (newer gen) integrated GPUs
1151
+ - Workaround: load the pipeline once at application startup and reuse it. Close only at shutdown.
1152
+ - Reference: https://community.intel.com/t5/Developing-Games-on-Intel/Memory-leaks-on-Intel-Iris-Xe-graphics/td-p/1585566
1153
+
1154
+ ### setExecutionBackend is a heavy operation
1155
+
1156
+ - Each call fully tears down and reloads the model cache (Whisper context, GGML models, Vulkan/OpenVINO backends)
1157
+ - Takes ~5-6 seconds on Intel Iris Xe
1158
+ - Treat it as a one-time configuration change, not something to call repeatedly
1159
+ - On Intel iGPU: limit to 1-2 switches per process lifetime to avoid the driver leak
1160
+
1161
+ ### One operation at a time
1162
+
1163
+ The pipeline enforces exclusive access to its GPU resources. Only one of the following can be active at a time:
1164
+
1165
+ - A streaming session (`createSession()`)
1166
+ - A one-shot transcription (`transcribe()`)
1167
+ - An offline transcription (`transcribeOffline()`)
1168
+ - A backend switch (`setExecutionBackend()`)
1169
+
1170
+ Attempting to start a second operation while one is active throws an error. Close the current session or wait for the current operation to complete before starting the next one.
1171
+
1172
+ ```typescript
1173
+ // CORRECT: sequential operations
1174
+ const session = pipeline.createSession();
1175
+ // ... push audio, finalize ...
1176
+ session.close();
1177
+ const result = await pipeline.transcribeOffline(audio); // OK — session is closed
1178
+
1179
+ // ERROR: concurrent operations
1180
+ const session1 = pipeline.createSession();
1181
+ const session2 = pipeline.createSession(); // throws: "A session is already active"
1182
+ await pipeline.transcribeOffline(audio); // throws: "Model is busy"
1183
+ ```
1184
+
1185
+ ### Session creation is cheap
1186
+
1187
+ - `createSession()` borrows pre-loaded models and GPU contexts from the cache
1188
+ - No new Vulkan backends or model loads occur
1189
+ - Close the session when done, then create another — safe to repeat unlimited times
1190
+
1191
+ ```typescript
1192
+ // SAFE: load once, create many sessions sequentially
1193
+ const pipeline = await Pipeline.load(config);
1194
+ for (const file of files) {
1195
+ const session = pipeline.createSession();
1196
+ // ... push audio, finalize ...
1197
+ session.close(); // cheap — no GPU teardown
1198
+ }
1199
+ pipeline.close(); // once, at shutdown
1200
+
1201
+ // DANGEROUS on Intel iGPU: repeated load/close cycles
1202
+ for (const file of files) {
1203
+ const pipeline = await Pipeline.load(config); // creates Vulkan context
1204
+ await pipeline.transcribe(audio);
1205
+ pipeline.close(); // destroys Vulkan context - driver leaks ~8th cycle crashes
1206
+ }
1207
+ ```
1208
+
750
1209
  ## Platform Support
751
1210
 
752
- | Platform | Status |
753
- | --- | --- |
754
- | macOS arm64 (Apple Silicon) | Supported |
755
- | macOS x64 (Intel) | Supported |
756
- | Linux | Not supported |
757
- | Windows | Not supported |
1211
+ | Platform | Low-level (Whisper/VAD) | Pipeline (Transcription + Diarization) |
1212
+ | --- | --- | --- |
1213
+ | macOS arm64 (Apple Silicon) | Supported | Supported (CoreML + Metal) |
1214
+ | Windows x64 | Supported | Supported (Vulkan + optional OpenVINO) |
1215
+ | macOS x64 (Intel) | Supported | Not tested |
1216
+ | Linux | Not supported | Not supported |
758
1217
 
759
1218
  ## License
760
1219