pyannote-cpp-node 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,43 +1,45 @@
1
1
  # pyannote-cpp-node
2
2
 
3
- Node.js native bindings for real-time speaker diarization
4
-
5
3
  ![Platform](https://img.shields.io/badge/platform-macOS-lightgrey)
6
4
  ![Node](https://img.shields.io/badge/node-%3E%3D18-brightgreen)
7
5
 
6
+ Node.js native bindings for integrated Whisper transcription + speaker diarization with speaker-labeled, word-level output.
7
+
8
8
  ## Overview
9
9
 
10
- `pyannote-cpp-node` provides Node.js bindings to a high-performance C++ port of the [`pyannote/speaker-diarization-community-1`](https://huggingface.co/pyannote/speaker-diarization-community-1) pipeline. It achieves **39x real-time** performance on Apple Silicon by leveraging CoreML acceleration (Neural Engine + GPU) for neural network inference and optimized C++ implementations of clustering algorithms.
10
+ `pyannote-cpp-node` exposes the integrated C++ pipeline that combines streaming diarization and Whisper transcription into a single API.
11
11
 
12
- The library supports two modes:
12
+ Given 16 kHz mono PCM audio (`Float32Array`), it produces cumulative and final transcript segments shaped as:
13
13
 
14
- - **Offline diarization**: Process an entire audio file at once and receive speaker-labeled segments
15
- - **Streaming diarization**: Process audio incrementally in real-time, receive voice activity detection (VAD) as audio arrives, and trigger speaker clustering on demand
14
+ - speaker label (`SPEAKER_00`, `SPEAKER_01`, ...)
15
+ - segment start/duration in seconds
16
+ - segment text
17
+ - per-word timestamps
16
18
 
17
- All heavy operations are asynchronous and run on libuv worker threads, ensuring the Node.js event loop remains responsive.
19
+ The API supports both one-shot processing (`transcribe`) and incremental streaming (`createSession` + `push`/`finalize`). All heavy operations are asynchronous and run on libuv worker threads.
18
20
 
19
21
  ## Features
20
22
 
21
- - **Offline diarization** Process full audio files and get speaker-labeled segments
22
- - **Streaming diarization** — Push audio incrementally, receive real-time VAD, recluster on demand
23
- - **Async/await API** All heavy operations return Promises and run on worker threads
24
- - **CoreML acceleration** Neural networks run on Apple's Neural Engine, GPU, and CPU
25
- - **TypeScript-first** Full type definitions included
26
- - **Zero-copy audio input** — Direct `Float32Array` input for maximum efficiency
27
- - **Byte-identical output** Streaming finalize produces identical results to offline pipeline
23
+ - Integrated transcription + diarization in one pipeline
24
+ - Speaker-labeled, word-level transcript output
25
+ - One-shot and streaming APIs with the same output schema
26
+ - Incremental `segments` events for live applications
27
+ - Deterministic output for the same audio/models/config
28
+ - CoreML-accelerated inference on macOS
29
+ - TypeScript-first API with complete type definitions
28
30
 
29
31
  ## Requirements
30
32
 
31
- - **macOS** with Apple Silicon (M1/M2/M3/M4) or Intel x64
32
- - **Node.js** >= 18
33
- - **Model files**:
34
- - Segmentation GGUF model (`segmentation.gguf`)
35
- - Embedding GGUF model (`embedding.gguf`)
36
- - PLDA GGUF model (`plda.gguf`)
37
- - Segmentation CoreML model package (`segmentation.mlpackage/`)
38
- - Embedding CoreML model package (`embedding.mlpackage/`)
39
-
40
- Model files can be obtained by converting the original PyTorch models using the conversion scripts in the parent repository.
33
+ - macOS (Apple Silicon or Intel)
34
+ - Node.js >= 18
35
+ - Model files:
36
+ - Segmentation GGUF (`segModelPath`)
37
+ - Embedding GGUF (`embModelPath`)
38
+ - PLDA GGUF (`pldaPath`)
39
+ - Embedding CoreML `.mlpackage` (`coremlPath`)
40
+ - Segmentation CoreML `.mlpackage` (`segCoremlPath`)
41
+ - Whisper GGUF (`whisperModelPath`)
42
+ - Optional Silero VAD model (`vadModelPath`)
41
43
 
42
44
  ## Installation
43
45
 
@@ -45,721 +47,414 @@ Model files can be obtained by converting the original PyTorch models using the
45
47
  npm install pyannote-cpp-node
46
48
  ```
47
49
 
48
- Or with pnpm:
49
-
50
50
  ```bash
51
51
  pnpm add pyannote-cpp-node
52
52
  ```
53
53
 
54
- The package uses `optionalDependencies` to automatically install the correct platform-specific native addon (`@pyannote-cpp-node/darwin-arm64` or `@pyannote-cpp-node/darwin-x64`).
54
+ The package installs a platform-specific native addon through `optionalDependencies`.
55
55
 
56
56
  ## Quick Start
57
57
 
58
58
  ```typescript
59
- import { Pyannote } from 'pyannote-cpp-node';
60
- import { readFileSync } from 'node:fs';
59
+ import { Pipeline } from 'pyannote-cpp-node';
61
60
 
62
- // Load model (validates all paths exist)
63
- const model = await Pyannote.load({
61
+ const pipeline = await Pipeline.load({
64
62
  segModelPath: './models/segmentation.gguf',
65
63
  embModelPath: './models/embedding.gguf',
66
64
  pldaPath: './models/plda.gguf',
67
65
  coremlPath: './models/embedding.mlpackage',
68
66
  segCoremlPath: './models/segmentation.mlpackage',
67
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
68
+ language: 'en',
69
69
  });
70
70
 
71
- // Load audio (16kHz mono Float32Array - see "Audio Format Requirements")
72
- const audio = loadWavFile('./audio.wav');
73
-
74
- // Run diarization
75
- const result = await model.diarize(audio);
71
+ const audio = loadAudioAsFloat32Array('./audio-16khz-mono.wav');
72
+ const result = await pipeline.transcribe(audio);
76
73
 
77
- // Print results
78
74
  for (const segment of result.segments) {
75
+ const end = segment.start + segment.duration;
79
76
  console.log(
80
- `[${segment.start.toFixed(2)}s - ${(segment.start + segment.duration).toFixed(2)}s] ${segment.speaker}`
77
+ `[${segment.speaker}] ${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`
81
78
  );
82
79
  }
83
80
 
84
- // Clean up
85
- model.close();
81
+ pipeline.close();
86
82
  ```
87
83
 
88
84
  ## API Reference
89
85
 
90
- ### `Pyannote` Class
91
-
92
- The main entry point for loading diarization models.
86
+ ### `Pipeline`
93
87
 
94
- #### `static async load(config: ModelConfig): Promise<Pyannote>`
95
-
96
- Factory method for loading a diarization model. Validates that all model paths exist before initializing. CoreML model compilation happens synchronously during initialization and is typically fast.
97
-
98
- **Parameters:**
99
- - `config: ModelConfig` — Configuration object with paths to all required model files
100
-
101
- **Returns:** `Promise<Pyannote>` — Initialized model instance
102
-
103
- **Throws:**
104
- - `Error` if any model path does not exist or is invalid
105
-
106
- **Example:**
107
88
  ```typescript
108
- const model = await Pyannote.load({
109
- segModelPath: './models/segmentation.gguf',
110
- embModelPath: './models/embedding.gguf',
111
- pldaPath: './models/plda.gguf',
112
- coremlPath: './models/embedding.mlpackage',
113
- segCoremlPath: './models/segmentation.mlpackage',
114
- });
89
+ class Pipeline {
90
+ static async load(config: ModelConfig): Promise<Pipeline>;
91
+ async transcribe(audio: Float32Array): Promise<TranscriptionResult>;
92
+ createSession(): PipelineSession;
93
+ close(): void;
94
+ get isClosed(): boolean;
95
+ }
115
96
  ```
116
97
 
117
- #### `async diarize(audio: Float32Array): Promise<DiarizationResult>`
98
+ #### `static async load(config: ModelConfig): Promise<Pipeline>`
118
99
 
119
- Performs offline diarization on the entire audio file. Audio must be 16kHz mono in `Float32Array` format with values in the range [-1.0, 1.0].
100
+ Validates model paths and initializes native pipeline resources.
120
101
 
121
- Internally, this method uses the streaming API: it initializes a streaming session, pushes all audio in 1-second chunks, calls finalize, and cleans up. The operation runs on a worker thread and is non-blocking.
122
-
123
- **Parameters:**
124
- - `audio: Float32Array` — Audio samples (16kHz mono, values in [-1.0, 1.0])
125
-
126
- **Returns:** `Promise<DiarizationResult>` — Diarization result with speaker-labeled segments sorted by start time
127
-
128
- **Throws:**
129
- - `Error` if model is closed
130
- - `TypeError` if audio is not a `Float32Array`
131
- - `Error` if audio is empty
132
-
133
- **Example:**
134
- ```typescript
135
- const result = await model.diarize(audio);
136
- console.log(`Detected ${result.segments.length} segments`);
137
- ```
102
+ #### `async transcribe(audio: Float32Array): Promise<TranscriptionResult>`
138
103
 
139
- #### `createStreamingSession(): StreamingSession`
104
+ Runs one-shot transcription + diarization on the full audio buffer.
140
105
 
141
- Creates a new independent streaming session. Each session maintains its own internal state and can be used to process audio incrementally.
106
+ #### `createSession(): PipelineSession`
142
107
 
143
- **Returns:** `StreamingSession` New streaming session instance
108
+ Creates an independent streaming session for incremental processing.
144
109
 
145
- **Throws:**
146
- - `Error` if model is closed
110
+ #### `close(): void`
147
111
 
148
- **Example:**
149
- ```typescript
150
- const session = model.createStreamingSession();
151
- ```
112
+ Releases native resources. Safe to call multiple times.
152
113
 
153
- #### `close(): void`
114
+ #### `get isClosed(): boolean`
154
115
 
155
- Releases all native resources associated with the model. This method is idempotent and safe to call multiple times.
116
+ Returns `true` after `close()`.
156
117
 
157
- Once closed, the model cannot be used for diarization or creating new streaming sessions. Existing streaming sessions should be closed before closing the model.
118
+ ### `PipelineSession` (extends `EventEmitter`)
158
119
 
159
- **Example:**
160
120
  ```typescript
161
- model.close();
162
- console.log(model.isClosed); // true
121
+ class PipelineSession extends EventEmitter {
122
+ async push(audio: Float32Array): Promise<boolean[]>;
123
+ async finalize(): Promise<TranscriptionResult>;
124
+ close(): void;
125
+ get isClosed(): boolean;
126
+ // Event: 'segments' -> (segments: AlignedSegment[], audio: Float32Array)
127
+ }
163
128
  ```
164
129
 
165
- #### `get isClosed: boolean`
130
+ #### `async push(audio: Float32Array): Promise<boolean[]>`
166
131
 
167
- Indicates whether the model has been closed.
132
+ Pushes an arbitrary number of samples into the streaming pipeline.
168
133
 
169
- **Returns:** `boolean` `true` if the model is closed, `false` otherwise
134
+ - Return value is per-frame VAD booleans (`true` = speech, `false` = silence)
135
+ - First 10 seconds return an empty array because the pipeline needs a full 10-second window
136
+ - Chunk size is flexible; not restricted to 16,000-sample pushes
170
137
 
171
- ### `StreamingSession` Class
138
+ #### `async finalize(): Promise<TranscriptionResult>`
172
139
 
173
- Handles incremental audio processing for real-time diarization.
140
+ Flushes all stages, runs final recluster + alignment, and returns the definitive result.
174
141
 
175
- #### `async push(audio: Float32Array): Promise<VADChunk[]>`
176
-
177
- Pushes audio samples to the streaming session. Audio must be 16kHz mono `Float32Array`. Typically, push 1 second of audio (16,000 samples) at a time.
142
+ #### `close(): void`
178
143
 
179
- The first chunk requires 10 seconds of accumulated audio to produce output (the segmentation model uses a 10-second window). After that, each subsequent push returns approximately one `VADChunk` (depending on the 1-second hop size).
144
+ Releases native session resources. Safe to call multiple times.
180
145
 
181
- The returned VAD chunks contain frame-level voice activity (OR of all speakers) for the newly processed 10-second windows.
146
+ #### `get isClosed(): boolean`
182
147
 
183
- **Parameters:**
184
- - `audio: Float32Array` — Audio samples (16kHz mono, values in [-1.0, 1.0])
148
+ Returns `true` after `close()`.
185
149
 
186
- **Returns:** `Promise<VADChunk[]>` — Array of VAD chunks (empty until 10 seconds accumulated)
150
+ #### Event: `'segments'`
187
151
 
188
- **Throws:**
189
- - `Error` if session is closed
190
- - `TypeError` if audio is not a `Float32Array`
152
+ Emitted after each Whisper transcription result with the latest cumulative aligned output.
191
153
 
192
- **Example:**
193
154
  ```typescript
194
- const vadChunks = await session.push(audioChunk);
195
- for (const chunk of vadChunks) {
196
- console.log(`VAD chunk ${chunk.chunkIndex}: ${chunk.numFrames} frames`);
197
- }
155
+ session.on('segments', (segments: AlignedSegment[], audio: Float32Array) => {
156
+ // `segments` contains the latest cumulative speaker-labeled transcript
157
+ // `audio` contains the chunk submitted for this callback cycle
158
+ });
198
159
  ```
199
160
 
200
- #### `async recluster(): Promise<DiarizationResult>`
201
-
202
- Triggers full clustering on all accumulated audio data. This runs the complete diarization pipeline (embedding extraction → PLDA scoring → hierarchical clustering → VBx refinement → speaker assignment) and returns speaker-labeled segments with global speaker IDs.
203
-
204
- **Warning:** This method mutates the internal session state. Specifically, it replaces the internal embedding and chunk index arrays with filtered versions (excluding silent speakers). Calling `push` after `recluster` may produce unexpected results. Use `recluster` sparingly (e.g., every 30 seconds for live progress updates) or only call `finalize` when the stream ends.
205
-
206
- The operation runs on a worker thread and is non-blocking.
207
-
208
- **Returns:** `Promise<DiarizationResult>` — Complete diarization result with global speaker labels
209
-
210
- **Throws:**
211
- - `Error` if session is closed
161
+ ### Types
212
162
 
213
- **Example:**
214
163
  ```typescript
215
- // Trigger intermediate clustering after accumulating data
216
- const intermediateResult = await session.recluster();
217
- console.log(`Current speaker count: ${new Set(intermediateResult.segments.map(s => s.speaker)).size}`);
218
- ```
164
+ export interface ModelConfig {
165
+ /** Path to segmentation GGUF model file. */
166
+ segModelPath: string;
219
167
 
220
- #### `async finalize(): Promise<DiarizationResult>`
168
+ /** Path to embedding GGUF model file. */
169
+ embModelPath: string;
221
170
 
222
- Processes any remaining audio (zero-padding partial chunks to match the offline pipeline's chunk count formula), then performs final clustering. This method produces byte-identical output to the offline `diarize()` method when given the same input audio.
171
+ /** Path to PLDA GGUF model file. */
172
+ pldaPath: string;
223
173
 
224
- Call this method when the audio stream has ended to get the final diarization result.
174
+ /** Path to embedding CoreML .mlpackage directory. */
175
+ coremlPath: string;
225
176
 
226
- The operation runs on a worker thread and is non-blocking.
177
+ /** Path to segmentation CoreML .mlpackage directory. */
178
+ segCoremlPath: string;
227
179
 
228
- **Returns:** `Promise<DiarizationResult>` Final diarization result
180
+ /** Path to Whisper GGUF model file. */
181
+ whisperModelPath: string;
229
182
 
230
- **Throws:**
231
- - `Error` if session is closed
183
+ /** Optional path to Silero VAD model file; enables silence compression. */
184
+ vadModelPath?: string;
232
185
 
233
- **Example:**
234
- ```typescript
235
- const finalResult = await session.finalize();
236
- console.log(`Final result: ${finalResult.segments.length} segments`);
237
- ```
186
+ /** Enable GPU for Whisper. Default: true. */
187
+ useGpu?: boolean;
238
188
 
239
- #### `close(): void`
189
+ /** Enable flash attention when supported. Default: true. */
190
+ flashAttn?: boolean;
240
191
 
241
- Releases all native resources associated with the streaming session. This method is idempotent and safe to call multiple times.
192
+ /** GPU device index. Default: 0. */
193
+ gpuDevice?: number;
242
194
 
243
- **Example:**
244
- ```typescript
245
- session.close();
246
- ```
195
+ /**
196
+ * Enable Whisper CoreML encoder.
197
+ * Default: false.
198
+ * Requires a matching `-encoder.mlmodelc` next to the GGUF model.
199
+ */
200
+ useCoreml?: boolean;
247
201
 
248
- #### `get isClosed: boolean`
202
+ /** Suppress Whisper native logs. Default: false. */
203
+ noPrints?: boolean;
249
204
 
250
- Indicates whether the session has been closed.
205
+ /** Number of decode threads. Default: 4. */
206
+ nThreads?: number;
251
207
 
252
- **Returns:** `boolean` `true` if the session is closed, `false` otherwise
208
+ /** Language code for transcription. Default: 'en'. Omit for auto-detect behavior with model settings. */
209
+ language?: string;
253
210
 
254
- ### Types
211
+ /** Translate to English. Default: false. */
212
+ translate?: boolean;
255
213
 
256
- #### `ModelConfig`
214
+ /** Force language detection pass. Default: false. */
215
+ detectLanguage?: boolean;
257
216
 
258
- Configuration object for loading diarization models.
217
+ /** Base sampling temperature. Default: 0.0 (greedy). */
218
+ temperature?: number;
259
219
 
260
- ```typescript
261
- interface ModelConfig {
262
- segModelPath: string; // Path to segmentation GGUF model file
263
- embModelPath: string; // Path to embedding GGUF model file
264
- pldaPath: string; // Path to PLDA GGUF model file
265
- coremlPath: string; // Path to embedding CoreML .mlpackage directory
266
- segCoremlPath: string; // Path to segmentation CoreML .mlpackage directory
267
- }
268
- ```
220
+ /** Temperature increment for fallback sampling. Default: 0.2. */
221
+ temperatureInc?: number;
269
222
 
270
- #### `VADChunk`
223
+ /** Disable temperature fallback ladder. Default: false. */
224
+ noFallback?: boolean;
271
225
 
272
- Voice activity detection result for a single 10-second audio chunk.
226
+ /** Beam size. Default: -1 (greedy with best_of). */
227
+ beamSize?: number;
273
228
 
274
- ```typescript
275
- interface VADChunk {
276
- chunkIndex: number; // Zero-based chunk number (increments every 1 second)
277
- startTime: number; // Absolute start time in seconds (chunkIndex * 1.0)
278
- duration: number; // Always 10.0 (chunk window size)
279
- numFrames: number; // Always 589 (segmentation model output frames)
280
- vad: Float32Array; // [589] frame-level voice activity: 1.0 if any speaker active, 0.0 otherwise
281
- }
282
- ```
229
+ /** Number of candidates in best-of sampling. Default: 5. */
230
+ bestOf?: number;
283
231
 
284
- The `vad` array contains 589 frames, each representing approximately 17ms of audio. A value of 1.0 indicates speech activity (any speaker), 0.0 indicates silence.
232
+ /** Compression/entropy threshold. Default: 2.4. */
233
+ entropyThold?: number;
285
234
 
286
- #### `Segment`
235
+ /** Average logprob threshold. Default: -1.0. */
236
+ logprobThold?: number;
287
237
 
288
- A contiguous speech segment with speaker label.
238
+ /** No-speech probability threshold. Default: 0.6. */
239
+ noSpeechThold?: number;
289
240
 
290
- ```typescript
291
- interface Segment {
292
- start: number; // Start time in seconds
293
- duration: number; // Duration in seconds
294
- speaker: string; // Speaker label (e.g., "SPEAKER_00", "SPEAKER_01", ...)
295
- }
296
- ```
241
+ /** Optional initial prompt text. Default: undefined. */
242
+ prompt?: string;
297
243
 
298
- #### `DiarizationResult`
244
+ /** Disable context carry-over between decode windows. Default: true. */
245
+ noContext?: boolean;
299
246
 
300
- Complete diarization output with speaker-labeled segments.
247
+ /** Suppress blank tokens. Default: true. */
248
+ suppressBlank?: boolean;
301
249
 
302
- ```typescript
303
- interface DiarizationResult {
304
- segments: Segment[]; // Array of segments, sorted by start time
250
+ /** Suppress non-speech tokens. Default: false. */
251
+ suppressNst?: boolean;
305
252
  }
306
- ```
307
253
 
308
- ## Usage Examples
254
+ export interface AlignedWord {
255
+ /** Word text (may include leading space from Whisper tokenization). */
256
+ text: string;
309
257
 
310
- ### Example 1: Offline Diarization
258
+ /** Word start time in seconds. */
259
+ start: number;
311
260
 
312
- Process an entire audio file and print a timeline of speaker segments.
313
-
314
- ```typescript
315
- import { Pyannote } from 'pyannote-cpp-node';
316
- import { readFileSync } from 'node:fs';
317
-
318
- // Helper to load 16-bit PCM WAV and convert to Float32Array
319
- function loadWavFile(filePath: string): Float32Array {
320
- const buffer = readFileSync(filePath);
321
- const view = new DataView(buffer.buffer, buffer.byteOffset, buffer.byteLength);
322
-
323
- // Find data chunk
324
- let offset = 12; // Skip RIFF header
325
- while (offset < view.byteLength - 8) {
326
- const chunkId = String.fromCharCode(
327
- view.getUint8(offset),
328
- view.getUint8(offset + 1),
329
- view.getUint8(offset + 2),
330
- view.getUint8(offset + 3)
331
- );
332
- const chunkSize = view.getUint32(offset + 4, true);
333
- offset += 8;
334
-
335
- if (chunkId === 'data') {
336
- // Convert Int16 PCM to Float32 by dividing by 32768
337
- const numSamples = chunkSize / 2;
338
- const float32 = new Float32Array(numSamples);
339
- for (let i = 0; i < numSamples; i++) {
340
- float32[i] = view.getInt16(offset + i * 2, true) / 32768.0;
341
- }
342
- return float32;
343
- }
344
-
345
- offset += chunkSize;
346
- if (chunkSize % 2 !== 0) offset++; // Align to word boundary
347
- }
348
-
349
- throw new Error('No data chunk found in WAV file');
261
+ /** Word end time in seconds. */
262
+ end: number;
350
263
  }
351
264
 
352
- async function main() {
353
- // Load model
354
- const model = await Pyannote.load({
355
- segModelPath: './models/segmentation.gguf',
356
- embModelPath: './models/embedding.gguf',
357
- pldaPath: './models/plda.gguf',
358
- coremlPath: './models/embedding.mlpackage',
359
- segCoremlPath: './models/segmentation.mlpackage',
360
- });
361
-
362
- // Load audio
363
- const audio = loadWavFile('./audio.wav');
364
- console.log(`Loaded ${audio.length} samples (${(audio.length / 16000).toFixed(1)}s)`);
265
+ export interface AlignedSegment {
266
+ /** Global speaker label (for example, SPEAKER_00). */
267
+ speaker: string;
365
268
 
366
- // Diarize
367
- const result = await model.diarize(audio);
269
+ /** Segment start time in seconds. */
270
+ start: number;
368
271
 
369
- // Print timeline
370
- console.log(`\nDetected ${result.segments.length} segments:`);
371
- for (const segment of result.segments) {
372
- const startTime = segment.start.toFixed(2);
373
- const endTime = (segment.start + segment.duration).toFixed(2);
374
- console.log(`[${startTime}s - ${endTime}s] ${segment.speaker}`);
375
- }
272
+ /** Segment duration in seconds. */
273
+ duration: number;
376
274
 
377
- // Count speakers
378
- const speakers = new Set(result.segments.map(s => s.speaker));
379
- console.log(`\nTotal speakers: ${speakers.size}`);
275
+ /** Segment text (concatenated from words). */
276
+ text: string;
380
277
 
381
- model.close();
278
+ /** Word-level timestamps for the segment. */
279
+ words: AlignedWord[];
382
280
  }
383
281
 
384
- main();
282
+ export interface TranscriptionResult {
283
+ /** Full speaker-labeled transcript segments. */
284
+ segments: AlignedSegment[];
285
+ }
385
286
  ```
386
287
 
387
- ### Example 2: Streaming Diarization
288
+ ## Usage Examples
388
289
 
389
- Process audio incrementally in 1-second chunks, displaying real-time VAD.
290
+ ### One-shot transcription
390
291
 
391
292
  ```typescript
392
- import { Pyannote } from 'pyannote-cpp-node';
293
+ import { Pipeline } from 'pyannote-cpp-node';
393
294
 
394
- async function streamingDiarization() {
395
- const model = await Pyannote.load({
295
+ async function runOneShot(audio: Float32Array) {
296
+ const pipeline = await Pipeline.load({
396
297
  segModelPath: './models/segmentation.gguf',
397
298
  embModelPath: './models/embedding.gguf',
398
299
  pldaPath: './models/plda.gguf',
399
300
  coremlPath: './models/embedding.mlpackage',
400
301
  segCoremlPath: './models/segmentation.mlpackage',
302
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
401
303
  });
402
304
 
403
- const session = model.createStreamingSession();
404
-
405
- // Load full audio file
406
- const audio = loadWavFile('./audio.wav');
407
-
408
- // Push audio in 1-second chunks (16,000 samples)
409
- const CHUNK_SIZE = 16000;
410
- let totalChunks = 0;
411
-
412
- for (let offset = 0; offset < audio.length; offset += CHUNK_SIZE) {
413
- const end = Math.min(offset + CHUNK_SIZE, audio.length);
414
- const chunk = audio.slice(offset, end);
415
-
416
- const vadChunks = await session.push(chunk);
417
-
418
- // VAD chunks are returned after first 10 seconds
419
- for (const vad of vadChunks) {
420
- // Count active frames (speech detected)
421
- const activeFrames = vad.vad.filter(v => v > 0.5).length;
422
- const speechRatio = (activeFrames / vad.numFrames * 100).toFixed(1);
423
-
424
- console.log(
425
- `Chunk ${vad.chunkIndex}: ${vad.startTime.toFixed(1)}s - ${(vad.startTime + vad.duration).toFixed(1)}s | ` +
426
- `Speech: ${speechRatio}%`
427
- );
428
- totalChunks++;
429
- }
430
- }
431
-
432
- console.log(`\nProcessed ${totalChunks} chunks`);
433
-
434
- // Get final diarization result
435
- console.log('\nFinalizing...');
436
- const result = await session.finalize();
305
+ const result = await pipeline.transcribe(audio);
437
306
 
438
- console.log(`\nFinal result: ${result.segments.length} segments`);
439
- for (const segment of result.segments) {
440
- console.log(
441
- `[${segment.start.toFixed(2)}s - ${(segment.start + segment.duration).toFixed(2)}s] ${segment.speaker}`
442
- );
307
+ for (const seg of result.segments) {
308
+ const end = seg.start + seg.duration;
309
+ console.log(`[${seg.speaker}] ${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
443
310
  }
444
311
 
445
- session.close();
446
- model.close();
312
+ pipeline.close();
447
313
  }
448
-
449
- streamingDiarization();
450
314
  ```
451
315
 
452
- ### Example 3: On-Demand Reclustering
453
-
454
- Push audio and trigger reclustering every 30 seconds to get intermediate results.
316
+ ### Streaming transcription
455
317
 
456
318
  ```typescript
457
- import { Pyannote } from 'pyannote-cpp-node';
319
+ import { Pipeline } from 'pyannote-cpp-node';
458
320
 
459
- async function reclusteringExample() {
460
- const model = await Pyannote.load({
321
+ async function runStreaming(audio: Float32Array) {
322
+ const pipeline = await Pipeline.load({
461
323
  segModelPath: './models/segmentation.gguf',
462
324
  embModelPath: './models/embedding.gguf',
463
325
  pldaPath: './models/plda.gguf',
464
326
  coremlPath: './models/embedding.mlpackage',
465
327
  segCoremlPath: './models/segmentation.mlpackage',
328
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
466
329
  });
467
330
 
468
- const session = model.createStreamingSession();
469
- const audio = loadWavFile('./audio.wav');
470
-
471
- const CHUNK_SIZE = 16000; // 1 second
472
- const RECLUSTER_INTERVAL = 30; // Recluster every 30 seconds
473
-
474
- let secondsProcessed = 0;
475
-
476
- for (let offset = 0; offset < audio.length; offset += CHUNK_SIZE) {
477
- const end = Math.min(offset + CHUNK_SIZE, audio.length);
478
- const chunk = audio.slice(offset, end);
479
-
480
- await session.push(chunk);
481
- secondsProcessed++;
331
+ const session = pipeline.createSession();
332
+ session.on('segments', (segments) => {
333
+ const latest = segments[segments.length - 1];
334
+ if (latest) {
335
+ const end = latest.start + latest.duration;
336
+ console.log(`[live][${latest.speaker}] ${latest.start.toFixed(2)}-${end.toFixed(2)} ${latest.text.trim()}`);
337
+ }
338
+ });
482
339
 
483
- // Recluster every 30 seconds
484
- if (secondsProcessed % RECLUSTER_INTERVAL === 0) {
485
- console.log(`\n--- Reclustering at ${secondsProcessed}s ---`);
486
- const intermediateResult = await session.recluster();
487
-
488
- const speakers = new Set(intermediateResult.segments.map(s => s.speaker));
489
- console.log(`Current speakers detected: ${speakers.size}`);
490
- console.log(`Current segments: ${intermediateResult.segments.length}`);
340
+ const chunkSize = 16000;
341
+ for (let i = 0; i < audio.length; i += chunkSize) {
342
+ const chunk = audio.slice(i, Math.min(i + chunkSize, audio.length));
343
+ const vad = await session.push(chunk);
344
+ if (vad.length > 0) {
345
+ const speechFrames = vad.filter(Boolean).length;
346
+ console.log(`VAD frames: ${vad.length}, speech frames: ${speechFrames}`);
491
347
  }
492
348
  }
493
349
 
494
- // Final result
495
- console.log('\n--- Final result ---');
496
350
  const finalResult = await session.finalize();
497
- const speakers = new Set(finalResult.segments.map(s => s.speaker));
498
- console.log(`Total speakers: ${speakers.size}`);
499
- console.log(`Total segments: ${finalResult.segments.length}`);
351
+ console.log(`Final segments: ${finalResult.segments.length}`);
500
352
 
501
353
  session.close();
502
- model.close();
354
+ pipeline.close();
503
355
  }
504
-
505
- reclusteringExample();
506
356
  ```
507
357
 
508
- ### Example 4: Generating RTTM Output
509
-
510
- Format diarization results into standard RTTM (Rich Transcription Time Marked) format.
358
+ ### Custom Whisper decode options
511
359
 
512
360
  ```typescript
513
- import { Pyannote, type DiarizationResult } from 'pyannote-cpp-node';
514
- import { writeFileSync } from 'node:fs';
515
-
516
- function toRTTM(result: DiarizationResult, filename: string = 'audio'): string {
517
- const lines = result.segments.map(segment => {
518
- // RTTM format: SPEAKER <file> <chnl> <tbeg> <tdur> <ortho> <stype> <name> <conf> <slat>
519
- return [
520
- 'SPEAKER',
521
- filename,
522
- '1',
523
- segment.start.toFixed(3),
524
- segment.duration.toFixed(3),
525
- '<NA>',
526
- '<NA>',
527
- segment.speaker,
528
- '<NA>',
529
- '<NA>',
530
- ].join(' ');
531
- });
532
-
533
- return lines.join('\n') + '\n';
534
- }
361
+ import { Pipeline } from 'pyannote-cpp-node';
535
362
 
536
- async function generateRTTM() {
537
- const model = await Pyannote.load({
538
- segModelPath: './models/segmentation.gguf',
539
- embModelPath: './models/embedding.gguf',
540
- pldaPath: './models/plda.gguf',
541
- coremlPath: './models/embedding.mlpackage',
542
- segCoremlPath: './models/segmentation.mlpackage',
543
- });
544
-
545
- const audio = loadWavFile('./audio.wav');
546
- const result = await model.diarize(audio);
547
-
548
- // Generate RTTM
549
- const rttm = toRTTM(result, 'audio');
550
-
551
- // Write to file
552
- writeFileSync('./output.rttm', rttm);
553
- console.log('RTTM file written to output.rttm');
554
-
555
- // Also print to console
556
- console.log('\nRTTM output:');
557
- console.log(rttm);
363
+ const pipeline = await Pipeline.load({
364
+ segModelPath: './models/segmentation.gguf',
365
+ embModelPath: './models/embedding.gguf',
366
+ pldaPath: './models/plda.gguf',
367
+ coremlPath: './models/embedding.mlpackage',
368
+ segCoremlPath: './models/segmentation.mlpackage',
369
+ whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
370
+
371
+ // Whisper runtime options
372
+ useGpu: true,
373
+ flashAttn: true,
374
+ gpuDevice: 0,
375
+ useCoreml: false,
376
+
377
+ // Decode strategy
378
+ nThreads: 8,
379
+ language: 'ko',
380
+ translate: false,
381
+ detectLanguage: false,
382
+ temperature: 0.0,
383
+ temperatureInc: 0.2,
384
+ noFallback: false,
385
+ beamSize: 5,
386
+ bestOf: 5,
387
+
388
+ // Thresholds and context
389
+ entropyThold: 2.4,
390
+ logprobThold: -1.0,
391
+ noSpeechThold: 0.6,
392
+ prompt: 'Meeting transcript with technical terminology.',
393
+ noContext: true,
394
+ suppressBlank: true,
395
+ suppressNst: false,
396
+ });
397
+ ```
558
398
 
559
- model.close();
399
+ ## JSON Output Format
400
+
401
+ The pipeline returns this JSON shape:
402
+
403
+ ```json
404
+ {
405
+ "segments": [
406
+ {
407
+ "speaker": "SPEAKER_00",
408
+ "start": 0.497000,
409
+ "duration": 2.085000,
410
+ "text": "Hello world",
411
+ "words": [
412
+ {"text": " Hello", "start": 0.500000, "end": 0.800000},
413
+ {"text": " world", "start": 0.900000, "end": 1.200000}
414
+ ]
415
+ }
416
+ ]
560
417
  }
561
-
562
- generateRTTM();
563
418
  ```
564
419
 
565
- ## Architecture
566
-
567
- The diarization pipeline consists of four main stages:
568
-
569
- ### 1. Segmentation (SincNet + BiLSTM)
570
-
571
- The segmentation model processes 10-second audio windows and outputs 7-class powerset logits for 589 frames (approximately one frame every 17ms). The model architecture:
572
-
573
- - **SincNet**: Learnable sinc filter bank for feature extraction
574
- - **4-layer BiLSTM**: Bidirectional long short-term memory layers
575
- - **Linear classifier**: Projects to 7 powerset classes with log-softmax
576
-
577
- The 7 powerset classes represent all possible combinations of up to 3 simultaneous speakers:
578
- - Class 0: silence (no speakers)
579
- - Classes 1-3: single speakers
580
- - Classes 4-6: speaker overlaps
581
-
582
- ### 2. Powerset Decoding
583
-
584
- Converts the 7-class powerset predictions into binary speaker activity for 3 local speakers per chunk. Each frame is decoded to indicate which of the 3 local speaker "slots" are active.
585
-
586
- ### 3. Embedding Extraction (WeSpeaker ResNet34)
587
-
588
- For each active speaker in each chunk, the embedding model extracts a 256-dimensional speaker vector:
589
-
590
- - **Mel filterbank**: 80-bin log-mel spectrogram features
591
- - **ResNet34**: Deep residual network for speaker representation
592
- - **Output**: 256-dimensional L2-normalized embedding
593
-
594
- Silent speakers receive NaN embeddings, which are filtered before clustering.
595
-
596
- ### 4. Clustering (PLDA + AHC + VBx)
597
-
598
- The final stage maps local speaker labels to global speaker identities:
599
-
600
- - **PLDA transformation**: Probabilistic Linear Discriminant Analysis projects embeddings from 256 to 128 dimensions
601
- - **Agglomerative Hierarchical Clustering (AHC)**: fastcluster implementation with O(n²) complexity, using centroid linkage and a distance threshold of 0.6
602
- - **VBx refinement**: Variational Bayes diarization with parameters FA=0.07, FB=0.8, maximum 20 iterations
603
-
604
- The clustering stage computes speaker centroids and assigns each embedding to the closest centroid while respecting the constraint that two local speakers in the same chunk cannot map to the same global speaker.
605
-
606
- ### CoreML Acceleration
607
-
608
- Both neural networks run on Apple's CoreML framework, which automatically distributes computation across:
609
-
610
- - **Neural Engine**: Dedicated ML accelerator on Apple Silicon
611
- - **GPU**: Metal-accelerated operations
612
- - **CPU**: Fallback for unsupported operations
613
-
614
- CoreML models use Float16 computation for optimal performance while maintaining accuracy within acceptable bounds (cosine similarity > 0.999 vs Float32).
615
-
616
- ### Streaming Architecture
617
-
618
- The streaming API uses a sliding 10-second window with a 1-second hop (9 seconds of overlap between consecutive chunks). Three data stores maintain the state:
619
-
620
- - **`audio_buffer`**: Sliding window (~10s, ~640 KB for 1 hour) — old samples are discarded
621
- - **`embeddings`**: Grows forever (~11 MB for 1 hour) — stores 3 × 256-dim vectors per chunk (NaN for silent speakers)
622
- - **`binarized`**: Grows forever (~25 MB for 1 hour) — stores 589 × 3 binary activity masks per chunk
623
-
624
- During reclustering, all accumulated embeddings are used to compute soft cluster assignments, and all binarized segmentations are used to reconstruct the global timeline. This is why the `embeddings` and `binarized` arrays must persist for the entire session.
625
-
626
- ### Constants
627
-
628
- | Constant | Value | Description |
629
- |----------|-------|-------------|
630
- | SAMPLE_RATE | 16000 Hz | Audio sample rate |
631
- | CHUNK_SAMPLES | 160000 | 10-second window size |
632
- | STEP_SAMPLES | 16000 | 1-second hop between chunks |
633
- | FRAMES_PER_CHUNK | 589 | Segmentation output frames |
634
- | NUM_LOCAL_SPEAKERS | 3 | Maximum speakers per chunk |
635
- | EMBEDDING_DIM | 256 | Speaker embedding dimension |
636
- | FBANK_NUM_BINS | 80 | Mel filterbank bins |
637
-
638
420
  ## Audio Format Requirements
639
421
 
640
- The library expects raw PCM audio in a specific format:
641
-
642
- - **Sample rate**: 16000 Hz (16 kHz) — **required**
643
- - **Channels**: Mono (single channel) — **required**
644
- - **Format**: `Float32Array` with values in the range **[-1.0, 1.0]**
645
-
646
- The library does **not** handle audio decoding. You must provide raw PCM samples.
647
-
648
- ### Loading Audio Files
649
-
650
- For WAV files, you can use the `loadWavFile` function from Example 1, or use third-party libraries:
651
-
652
- ```bash
653
- npm install node-wav
654
- ```
655
-
656
- ```typescript
657
- import { read } from 'node-wav';
658
- import { readFileSync } from 'node:fs';
659
-
660
- const buffer = readFileSync('./audio.wav');
661
- const wav = read(buffer);
662
-
663
- // Convert to mono if stereo
664
- const mono = wav.channelData.length > 1
665
- ? wav.channelData[0].map((v, i) => (v + wav.channelData[1][i]) / 2)
666
- : wav.channelData[0];
667
-
668
- // Resample to 16kHz if needed (using a resampling library)
669
- // ...
670
-
671
- const audio = new Float32Array(mono);
672
- ```
673
-
674
- For other audio formats (MP3, M4A, etc.), use ffmpeg to convert to 16kHz mono WAV first:
675
-
676
- ```bash
677
- ffmpeg -i input.mp3 -ar 16000 -ac 1 -f f32le -acodec pcm_f32le - | \
678
- node process.js
679
- ```
422
+ - Input must be `Float32Array`
423
+ - Sample rate must be `16000` Hz
424
+ - Audio must be mono
425
+ - Recommended amplitude range: `[-1.0, 1.0]`
680
426
 
681
- ## Important Notes and Caveats
427
+ All API methods expect decoded PCM samples; file decoding/resampling is handled by the caller.
682
428
 
683
- ### Platform Limitations
684
-
685
- - **macOS only**: The library requires CoreML for neural network inference. There is currently no fallback implementation for other platforms.
686
- - **No Linux/Windows support**: CoreML is exclusive to Apple platforms.
687
-
688
- ### `recluster()` Mutates State
689
-
690
- The `recluster()` method overwrites the internal session state, specifically replacing the `embeddings` and chunk index arrays with filtered versions (excluding NaN embeddings from silent speakers). This means:
691
-
692
- - Calling `push()` after `recluster()` may produce incorrect results
693
- - Subsequent `recluster()` calls may not work as expected
694
- - The data structure assumes the original unfiltered layout (3 embeddings per chunk)
695
-
696
- **Best practice**: Use `recluster()` sparingly for live progress updates (e.g., every 30 seconds), or avoid it entirely and only call `finalize()` when the stream ends.
697
-
698
- ### Operations Are Serialized
699
-
700
- Operations on a streaming session are serialized internally. Do not call `push()` while another `push()`, `recluster()`, or `finalize()` is in progress. Wait for the Promise to resolve before making the next call.
701
-
702
- ### Resource Management
703
-
704
- - **Close sessions before models**: Always close streaming sessions before closing the parent model
705
- - **Idempotent close**: Both `model.close()` and `session.close()` are safe to call multiple times
706
- - **No reuse after close**: Once closed, models and sessions cannot be reused
707
-
708
- ### Model Loading
709
-
710
- - **Path validation**: `Pyannote.load()` validates that all paths exist using `fs.accessSync()` before initialization
711
- - **CoreML compilation**: The CoreML framework compiles `.mlpackage` models internally on first load (typically fast, ~100ms)
712
- - **No explicit loading step**: Model weights are loaded synchronously in the constructor
713
-
714
- ### Threading Model
715
-
716
- All heavy operations (`diarize`, `push`, `recluster`, `finalize`) run on libuv worker threads and never block the Node.js event loop. However, the operations do hold native locks internally, so concurrent operations on the same session are serialized.
717
-
718
- ### Memory Usage
429
+ ## Architecture
719
430
 
720
- For a 1-hour audio file:
721
- - `audio_buffer`: ~640 KB (sliding window)
722
- - `embeddings`: ~11 MB (grows throughout session)
723
- - `binarized`: ~25 MB (grows throughout session)
724
- - CoreML models: ~50 MB (loaded once per model)
431
+ The integrated pipeline runs in 7 stages:
725
432
 
726
- Total memory footprint: approximately 100 MB for a 1-hour streaming session.
433
+ 1. VAD silence filter (optional compression of long silence)
434
+ 2. Audio buffer (stream-safe FIFO with timestamp tracking)
435
+ 3. Segmentation (speech activity over rolling windows)
436
+ 4. Transcription (Whisper sentence + word timestamps)
437
+ 5. Alignment (segment-level speaker assignment by overlap)
438
+ 6. Finalize (flush + final recluster + final alignment)
439
+ 7. Callback/event emission (`segments` updates)
727
440
 
728
441
  ## Performance
729
442
 
730
- Measured on Apple M2 Pro with 16 GB RAM:
731
-
732
- | Component | Time per Chunk | Notes |
733
- |-----------|----------------|-------|
734
- | Segmentation (CoreML) | ~12ms | 10-second audio window, 589 frames |
735
- | Embedding (CoreML) | ~13ms | Per speaker per chunk (up to 3 speakers) |
736
- | AHC Clustering | ~0.8s | 3000 embeddings (1000 chunks) |
737
- | VBx Refinement | ~1.2s | 20 iterations, 3000 embeddings |
738
- | **Full Pipeline (offline)** | **39x real-time** | 45-minute audio processed in 70 seconds |
739
-
740
- ### Streaming Performance
443
+ - Diarization only: **39x real-time**
444
+ - Integrated transcription + diarization: **~14.6x real-time**
445
+ - 45-minute Korean meeting test (6 speakers): **2713s audio in 186s**
446
+ - Alignment reduction: **701 Whisper segments -> 186 aligned speaker segments**
447
+ - Speaker confusion rate: **2.55%**
741
448
 
742
- - **First chunk latency**: 10 seconds (requires full window)
743
- - **Incremental latency**: ~30ms per 1-second push (after first chunk)
744
- - **Recluster latency**: ~2 seconds for 30 minutes of audio (~1800 embeddings)
449
+ ## Platform Support
745
450
 
746
- Streaming mode has higher per-chunk overhead due to the incremental nature but enables real-time applications.
747
-
748
- ## Supported Platforms
749
-
750
- | Platform | Architecture | Status |
751
- |----------|--------------|--------|
752
- | macOS | arm64 (Apple Silicon) | ✅ Supported |
753
- | macOS | x64 (Intel) | 🔜 Planned |
754
- | Linux | any | ❌ Not supported (CoreML unavailable) |
755
- | Windows | any | ❌ Not supported (CoreML unavailable) |
756
-
757
- Intel macOS support is planned but not yet available. The CoreML dependency makes cross-platform support challenging without alternative inference backends.
451
+ | Platform | Status |
452
+ | --- | --- |
453
+ | macOS arm64 (Apple Silicon) | Supported |
454
+ | macOS x64 (Intel) | Supported |
455
+ | Linux | Not supported |
456
+ | Windows | Not supported |
758
457
 
759
458
  ## License
760
459
 
761
460
  MIT
762
-
763
- ---
764
-
765
- For issues, feature requests, or contributions, please visit the [GitHub repository](https://github.com/predict-woo/pyannote-ggml).