pyannote-cpp-node 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +277 -582
- package/dist/Pipeline.d.ts +12 -0
- package/dist/Pipeline.d.ts.map +1 -0
- package/dist/Pipeline.js +48 -0
- package/dist/Pipeline.js.map +1 -0
- package/dist/PipelineSession.d.ts +18 -0
- package/dist/PipelineSession.d.ts.map +1 -0
- package/dist/PipelineSession.js +38 -0
- package/dist/PipelineSession.js.map +1 -0
- package/dist/binding.d.ts +9 -10
- package/dist/binding.d.ts.map +1 -1
- package/dist/binding.js +3 -3
- package/dist/binding.js.map +1 -1
- package/dist/index.d.ts +5 -3
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +3 -2
- package/dist/index.js.map +1 -1
- package/dist/types.d.ts +65 -10
- package/dist/types.d.ts.map +1 -1
- package/package.json +3 -3
package/README.md
CHANGED
|
@@ -1,43 +1,45 @@
|
|
|
1
1
|
# pyannote-cpp-node
|
|
2
2
|
|
|
3
|
-
Node.js native bindings for real-time speaker diarization
|
|
4
|
-
|
|
5
3
|

|
|
6
4
|

|
|
7
5
|
|
|
6
|
+
Node.js native bindings for integrated Whisper transcription + speaker diarization with speaker-labeled, word-level output.
|
|
7
|
+
|
|
8
8
|
## Overview
|
|
9
9
|
|
|
10
|
-
`pyannote-cpp-node`
|
|
10
|
+
`pyannote-cpp-node` exposes the integrated C++ pipeline that combines streaming diarization and Whisper transcription into a single API.
|
|
11
11
|
|
|
12
|
-
|
|
12
|
+
Given 16 kHz mono PCM audio (`Float32Array`), it produces cumulative and final transcript segments shaped as:
|
|
13
13
|
|
|
14
|
-
-
|
|
15
|
-
-
|
|
14
|
+
- speaker label (`SPEAKER_00`, `SPEAKER_01`, ...)
|
|
15
|
+
- segment start/duration in seconds
|
|
16
|
+
- segment text
|
|
17
|
+
- per-word timestamps
|
|
16
18
|
|
|
17
|
-
All heavy operations are asynchronous and run on libuv worker threads
|
|
19
|
+
The API supports both one-shot processing (`transcribe`) and incremental streaming (`createSession` + `push`/`finalize`). All heavy operations are asynchronous and run on libuv worker threads.
|
|
18
20
|
|
|
19
21
|
## Features
|
|
20
22
|
|
|
21
|
-
-
|
|
22
|
-
-
|
|
23
|
-
-
|
|
24
|
-
-
|
|
25
|
-
-
|
|
26
|
-
-
|
|
27
|
-
-
|
|
23
|
+
- Integrated transcription + diarization in one pipeline
|
|
24
|
+
- Speaker-labeled, word-level transcript output
|
|
25
|
+
- One-shot and streaming APIs with the same output schema
|
|
26
|
+
- Incremental `segments` events for live applications
|
|
27
|
+
- Deterministic output for the same audio/models/config
|
|
28
|
+
- CoreML-accelerated inference on macOS
|
|
29
|
+
- TypeScript-first API with complete type definitions
|
|
28
30
|
|
|
29
31
|
## Requirements
|
|
30
32
|
|
|
31
|
-
-
|
|
32
|
-
-
|
|
33
|
-
-
|
|
34
|
-
- Segmentation GGUF
|
|
35
|
-
- Embedding GGUF
|
|
36
|
-
- PLDA GGUF
|
|
37
|
-
-
|
|
38
|
-
-
|
|
39
|
-
|
|
40
|
-
|
|
33
|
+
- macOS (Apple Silicon or Intel)
|
|
34
|
+
- Node.js >= 18
|
|
35
|
+
- Model files:
|
|
36
|
+
- Segmentation GGUF (`segModelPath`)
|
|
37
|
+
- Embedding GGUF (`embModelPath`)
|
|
38
|
+
- PLDA GGUF (`pldaPath`)
|
|
39
|
+
- Embedding CoreML `.mlpackage` (`coremlPath`)
|
|
40
|
+
- Segmentation CoreML `.mlpackage` (`segCoremlPath`)
|
|
41
|
+
- Whisper GGUF (`whisperModelPath`)
|
|
42
|
+
- Optional Silero VAD model (`vadModelPath`)
|
|
41
43
|
|
|
42
44
|
## Installation
|
|
43
45
|
|
|
@@ -45,721 +47,414 @@ Model files can be obtained by converting the original PyTorch models using the
|
|
|
45
47
|
npm install pyannote-cpp-node
|
|
46
48
|
```
|
|
47
49
|
|
|
48
|
-
Or with pnpm:
|
|
49
|
-
|
|
50
50
|
```bash
|
|
51
51
|
pnpm add pyannote-cpp-node
|
|
52
52
|
```
|
|
53
53
|
|
|
54
|
-
The package
|
|
54
|
+
The package installs a platform-specific native addon through `optionalDependencies`.
|
|
55
55
|
|
|
56
56
|
## Quick Start
|
|
57
57
|
|
|
58
58
|
```typescript
|
|
59
|
-
import {
|
|
60
|
-
import { readFileSync } from 'node:fs';
|
|
59
|
+
import { Pipeline } from 'pyannote-cpp-node';
|
|
61
60
|
|
|
62
|
-
|
|
63
|
-
const model = await Pyannote.load({
|
|
61
|
+
const pipeline = await Pipeline.load({
|
|
64
62
|
segModelPath: './models/segmentation.gguf',
|
|
65
63
|
embModelPath: './models/embedding.gguf',
|
|
66
64
|
pldaPath: './models/plda.gguf',
|
|
67
65
|
coremlPath: './models/embedding.mlpackage',
|
|
68
66
|
segCoremlPath: './models/segmentation.mlpackage',
|
|
67
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
68
|
+
language: 'en',
|
|
69
69
|
});
|
|
70
70
|
|
|
71
|
-
|
|
72
|
-
const
|
|
73
|
-
|
|
74
|
-
// Run diarization
|
|
75
|
-
const result = await model.diarize(audio);
|
|
71
|
+
const audio = loadAudioAsFloat32Array('./audio-16khz-mono.wav');
|
|
72
|
+
const result = await pipeline.transcribe(audio);
|
|
76
73
|
|
|
77
|
-
// Print results
|
|
78
74
|
for (const segment of result.segments) {
|
|
75
|
+
const end = segment.start + segment.duration;
|
|
79
76
|
console.log(
|
|
80
|
-
`[${segment.
|
|
77
|
+
`[${segment.speaker}] ${segment.start.toFixed(2)}-${end.toFixed(2)} ${segment.text.trim()}`
|
|
81
78
|
);
|
|
82
79
|
}
|
|
83
80
|
|
|
84
|
-
|
|
85
|
-
model.close();
|
|
81
|
+
pipeline.close();
|
|
86
82
|
```
|
|
87
83
|
|
|
88
84
|
## API Reference
|
|
89
85
|
|
|
90
|
-
### `
|
|
91
|
-
|
|
92
|
-
The main entry point for loading diarization models.
|
|
86
|
+
### `Pipeline`
|
|
93
87
|
|
|
94
|
-
#### `static async load(config: ModelConfig): Promise<Pyannote>`
|
|
95
|
-
|
|
96
|
-
Factory method for loading a diarization model. Validates that all model paths exist before initializing. CoreML model compilation happens synchronously during initialization and is typically fast.
|
|
97
|
-
|
|
98
|
-
**Parameters:**
|
|
99
|
-
- `config: ModelConfig` — Configuration object with paths to all required model files
|
|
100
|
-
|
|
101
|
-
**Returns:** `Promise<Pyannote>` — Initialized model instance
|
|
102
|
-
|
|
103
|
-
**Throws:**
|
|
104
|
-
- `Error` if any model path does not exist or is invalid
|
|
105
|
-
|
|
106
|
-
**Example:**
|
|
107
88
|
```typescript
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
}
|
|
89
|
+
class Pipeline {
|
|
90
|
+
static async load(config: ModelConfig): Promise<Pipeline>;
|
|
91
|
+
async transcribe(audio: Float32Array): Promise<TranscriptionResult>;
|
|
92
|
+
createSession(): PipelineSession;
|
|
93
|
+
close(): void;
|
|
94
|
+
get isClosed(): boolean;
|
|
95
|
+
}
|
|
115
96
|
```
|
|
116
97
|
|
|
117
|
-
#### `async
|
|
98
|
+
#### `static async load(config: ModelConfig): Promise<Pipeline>`
|
|
118
99
|
|
|
119
|
-
|
|
100
|
+
Validates model paths and initializes native pipeline resources.
|
|
120
101
|
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
**Parameters:**
|
|
124
|
-
- `audio: Float32Array` — Audio samples (16kHz mono, values in [-1.0, 1.0])
|
|
125
|
-
|
|
126
|
-
**Returns:** `Promise<DiarizationResult>` — Diarization result with speaker-labeled segments sorted by start time
|
|
127
|
-
|
|
128
|
-
**Throws:**
|
|
129
|
-
- `Error` if model is closed
|
|
130
|
-
- `TypeError` if audio is not a `Float32Array`
|
|
131
|
-
- `Error` if audio is empty
|
|
132
|
-
|
|
133
|
-
**Example:**
|
|
134
|
-
```typescript
|
|
135
|
-
const result = await model.diarize(audio);
|
|
136
|
-
console.log(`Detected ${result.segments.length} segments`);
|
|
137
|
-
```
|
|
102
|
+
#### `async transcribe(audio: Float32Array): Promise<TranscriptionResult>`
|
|
138
103
|
|
|
139
|
-
|
|
104
|
+
Runs one-shot transcription + diarization on the full audio buffer.
|
|
140
105
|
|
|
141
|
-
|
|
106
|
+
#### `createSession(): PipelineSession`
|
|
142
107
|
|
|
143
|
-
|
|
108
|
+
Creates an independent streaming session for incremental processing.
|
|
144
109
|
|
|
145
|
-
|
|
146
|
-
- `Error` if model is closed
|
|
110
|
+
#### `close(): void`
|
|
147
111
|
|
|
148
|
-
|
|
149
|
-
```typescript
|
|
150
|
-
const session = model.createStreamingSession();
|
|
151
|
-
```
|
|
112
|
+
Releases native resources. Safe to call multiple times.
|
|
152
113
|
|
|
153
|
-
#### `
|
|
114
|
+
#### `get isClosed(): boolean`
|
|
154
115
|
|
|
155
|
-
|
|
116
|
+
Returns `true` after `close()`.
|
|
156
117
|
|
|
157
|
-
|
|
118
|
+
### `PipelineSession` (extends `EventEmitter`)
|
|
158
119
|
|
|
159
|
-
**Example:**
|
|
160
120
|
```typescript
|
|
161
|
-
|
|
162
|
-
|
|
121
|
+
class PipelineSession extends EventEmitter {
|
|
122
|
+
async push(audio: Float32Array): Promise<boolean[]>;
|
|
123
|
+
async finalize(): Promise<TranscriptionResult>;
|
|
124
|
+
close(): void;
|
|
125
|
+
get isClosed(): boolean;
|
|
126
|
+
// Event: 'segments' -> (segments: AlignedSegment[], audio: Float32Array)
|
|
127
|
+
}
|
|
163
128
|
```
|
|
164
129
|
|
|
165
|
-
#### `
|
|
130
|
+
#### `async push(audio: Float32Array): Promise<boolean[]>`
|
|
166
131
|
|
|
167
|
-
|
|
132
|
+
Pushes an arbitrary number of samples into the streaming pipeline.
|
|
168
133
|
|
|
169
|
-
|
|
134
|
+
- Return value is per-frame VAD booleans (`true` = speech, `false` = silence)
|
|
135
|
+
- First 10 seconds return an empty array because the pipeline needs a full 10-second window
|
|
136
|
+
- Chunk size is flexible; not restricted to 16,000-sample pushes
|
|
170
137
|
|
|
171
|
-
|
|
138
|
+
#### `async finalize(): Promise<TranscriptionResult>`
|
|
172
139
|
|
|
173
|
-
|
|
140
|
+
Flushes all stages, runs final recluster + alignment, and returns the definitive result.
|
|
174
141
|
|
|
175
|
-
#### `
|
|
176
|
-
|
|
177
|
-
Pushes audio samples to the streaming session. Audio must be 16kHz mono `Float32Array`. Typically, push 1 second of audio (16,000 samples) at a time.
|
|
142
|
+
#### `close(): void`
|
|
178
143
|
|
|
179
|
-
|
|
144
|
+
Releases native session resources. Safe to call multiple times.
|
|
180
145
|
|
|
181
|
-
|
|
146
|
+
#### `get isClosed(): boolean`
|
|
182
147
|
|
|
183
|
-
|
|
184
|
-
- `audio: Float32Array` — Audio samples (16kHz mono, values in [-1.0, 1.0])
|
|
148
|
+
Returns `true` after `close()`.
|
|
185
149
|
|
|
186
|
-
|
|
150
|
+
#### Event: `'segments'`
|
|
187
151
|
|
|
188
|
-
|
|
189
|
-
- `Error` if session is closed
|
|
190
|
-
- `TypeError` if audio is not a `Float32Array`
|
|
152
|
+
Emitted after each Whisper transcription result with the latest cumulative aligned output.
|
|
191
153
|
|
|
192
|
-
**Example:**
|
|
193
154
|
```typescript
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
}
|
|
155
|
+
session.on('segments', (segments: AlignedSegment[], audio: Float32Array) => {
|
|
156
|
+
// `segments` contains the latest cumulative speaker-labeled transcript
|
|
157
|
+
// `audio` contains the chunk submitted for this callback cycle
|
|
158
|
+
});
|
|
198
159
|
```
|
|
199
160
|
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
Triggers full clustering on all accumulated audio data. This runs the complete diarization pipeline (embedding extraction → PLDA scoring → hierarchical clustering → VBx refinement → speaker assignment) and returns speaker-labeled segments with global speaker IDs.
|
|
203
|
-
|
|
204
|
-
**Warning:** This method mutates the internal session state. Specifically, it replaces the internal embedding and chunk index arrays with filtered versions (excluding silent speakers). Calling `push` after `recluster` may produce unexpected results. Use `recluster` sparingly (e.g., every 30 seconds for live progress updates) or only call `finalize` when the stream ends.
|
|
205
|
-
|
|
206
|
-
The operation runs on a worker thread and is non-blocking.
|
|
207
|
-
|
|
208
|
-
**Returns:** `Promise<DiarizationResult>` — Complete diarization result with global speaker labels
|
|
209
|
-
|
|
210
|
-
**Throws:**
|
|
211
|
-
- `Error` if session is closed
|
|
161
|
+
### Types
|
|
212
162
|
|
|
213
|
-
**Example:**
|
|
214
163
|
```typescript
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
|
|
218
|
-
```
|
|
164
|
+
export interface ModelConfig {
|
|
165
|
+
/** Path to segmentation GGUF model file. */
|
|
166
|
+
segModelPath: string;
|
|
219
167
|
|
|
220
|
-
|
|
168
|
+
/** Path to embedding GGUF model file. */
|
|
169
|
+
embModelPath: string;
|
|
221
170
|
|
|
222
|
-
|
|
171
|
+
/** Path to PLDA GGUF model file. */
|
|
172
|
+
pldaPath: string;
|
|
223
173
|
|
|
224
|
-
|
|
174
|
+
/** Path to embedding CoreML .mlpackage directory. */
|
|
175
|
+
coremlPath: string;
|
|
225
176
|
|
|
226
|
-
|
|
177
|
+
/** Path to segmentation CoreML .mlpackage directory. */
|
|
178
|
+
segCoremlPath: string;
|
|
227
179
|
|
|
228
|
-
|
|
180
|
+
/** Path to Whisper GGUF model file. */
|
|
181
|
+
whisperModelPath: string;
|
|
229
182
|
|
|
230
|
-
|
|
231
|
-
|
|
183
|
+
/** Optional path to Silero VAD model file; enables silence compression. */
|
|
184
|
+
vadModelPath?: string;
|
|
232
185
|
|
|
233
|
-
|
|
234
|
-
|
|
235
|
-
const finalResult = await session.finalize();
|
|
236
|
-
console.log(`Final result: ${finalResult.segments.length} segments`);
|
|
237
|
-
```
|
|
186
|
+
/** Enable GPU for Whisper. Default: true. */
|
|
187
|
+
useGpu?: boolean;
|
|
238
188
|
|
|
239
|
-
|
|
189
|
+
/** Enable flash attention when supported. Default: true. */
|
|
190
|
+
flashAttn?: boolean;
|
|
240
191
|
|
|
241
|
-
|
|
192
|
+
/** GPU device index. Default: 0. */
|
|
193
|
+
gpuDevice?: number;
|
|
242
194
|
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
195
|
+
/**
|
|
196
|
+
* Enable Whisper CoreML encoder.
|
|
197
|
+
* Default: false.
|
|
198
|
+
* Requires a matching `-encoder.mlmodelc` next to the GGUF model.
|
|
199
|
+
*/
|
|
200
|
+
useCoreml?: boolean;
|
|
247
201
|
|
|
248
|
-
|
|
202
|
+
/** Suppress Whisper native logs. Default: false. */
|
|
203
|
+
noPrints?: boolean;
|
|
249
204
|
|
|
250
|
-
|
|
205
|
+
/** Number of decode threads. Default: 4. */
|
|
206
|
+
nThreads?: number;
|
|
251
207
|
|
|
252
|
-
|
|
208
|
+
/** Language code for transcription. Default: 'en'. Omit for auto-detect behavior with model settings. */
|
|
209
|
+
language?: string;
|
|
253
210
|
|
|
254
|
-
|
|
211
|
+
/** Translate to English. Default: false. */
|
|
212
|
+
translate?: boolean;
|
|
255
213
|
|
|
256
|
-
|
|
214
|
+
/** Force language detection pass. Default: false. */
|
|
215
|
+
detectLanguage?: boolean;
|
|
257
216
|
|
|
258
|
-
|
|
217
|
+
/** Base sampling temperature. Default: 0.0 (greedy). */
|
|
218
|
+
temperature?: number;
|
|
259
219
|
|
|
260
|
-
|
|
261
|
-
|
|
262
|
-
segModelPath: string; // Path to segmentation GGUF model file
|
|
263
|
-
embModelPath: string; // Path to embedding GGUF model file
|
|
264
|
-
pldaPath: string; // Path to PLDA GGUF model file
|
|
265
|
-
coremlPath: string; // Path to embedding CoreML .mlpackage directory
|
|
266
|
-
segCoremlPath: string; // Path to segmentation CoreML .mlpackage directory
|
|
267
|
-
}
|
|
268
|
-
```
|
|
220
|
+
/** Temperature increment for fallback sampling. Default: 0.2. */
|
|
221
|
+
temperatureInc?: number;
|
|
269
222
|
|
|
270
|
-
|
|
223
|
+
/** Disable temperature fallback ladder. Default: false. */
|
|
224
|
+
noFallback?: boolean;
|
|
271
225
|
|
|
272
|
-
|
|
226
|
+
/** Beam size. Default: -1 (greedy with best_of). */
|
|
227
|
+
beamSize?: number;
|
|
273
228
|
|
|
274
|
-
|
|
275
|
-
|
|
276
|
-
chunkIndex: number; // Zero-based chunk number (increments every 1 second)
|
|
277
|
-
startTime: number; // Absolute start time in seconds (chunkIndex * 1.0)
|
|
278
|
-
duration: number; // Always 10.0 (chunk window size)
|
|
279
|
-
numFrames: number; // Always 589 (segmentation model output frames)
|
|
280
|
-
vad: Float32Array; // [589] frame-level voice activity: 1.0 if any speaker active, 0.0 otherwise
|
|
281
|
-
}
|
|
282
|
-
```
|
|
229
|
+
/** Number of candidates in best-of sampling. Default: 5. */
|
|
230
|
+
bestOf?: number;
|
|
283
231
|
|
|
284
|
-
|
|
232
|
+
/** Compression/entropy threshold. Default: 2.4. */
|
|
233
|
+
entropyThold?: number;
|
|
285
234
|
|
|
286
|
-
|
|
235
|
+
/** Average logprob threshold. Default: -1.0. */
|
|
236
|
+
logprobThold?: number;
|
|
287
237
|
|
|
288
|
-
|
|
238
|
+
/** No-speech probability threshold. Default: 0.6. */
|
|
239
|
+
noSpeechThold?: number;
|
|
289
240
|
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
start: number; // Start time in seconds
|
|
293
|
-
duration: number; // Duration in seconds
|
|
294
|
-
speaker: string; // Speaker label (e.g., "SPEAKER_00", "SPEAKER_01", ...)
|
|
295
|
-
}
|
|
296
|
-
```
|
|
241
|
+
/** Optional initial prompt text. Default: undefined. */
|
|
242
|
+
prompt?: string;
|
|
297
243
|
|
|
298
|
-
|
|
244
|
+
/** Disable context carry-over between decode windows. Default: true. */
|
|
245
|
+
noContext?: boolean;
|
|
299
246
|
|
|
300
|
-
|
|
247
|
+
/** Suppress blank tokens. Default: true. */
|
|
248
|
+
suppressBlank?: boolean;
|
|
301
249
|
|
|
302
|
-
|
|
303
|
-
|
|
304
|
-
segments: Segment[]; // Array of segments, sorted by start time
|
|
250
|
+
/** Suppress non-speech tokens. Default: false. */
|
|
251
|
+
suppressNst?: boolean;
|
|
305
252
|
}
|
|
306
|
-
```
|
|
307
253
|
|
|
308
|
-
|
|
254
|
+
export interface AlignedWord {
|
|
255
|
+
/** Word text (may include leading space from Whisper tokenization). */
|
|
256
|
+
text: string;
|
|
309
257
|
|
|
310
|
-
|
|
258
|
+
/** Word start time in seconds. */
|
|
259
|
+
start: number;
|
|
311
260
|
|
|
312
|
-
|
|
313
|
-
|
|
314
|
-
```typescript
|
|
315
|
-
import { Pyannote } from 'pyannote-cpp-node';
|
|
316
|
-
import { readFileSync } from 'node:fs';
|
|
317
|
-
|
|
318
|
-
// Helper to load 16-bit PCM WAV and convert to Float32Array
|
|
319
|
-
function loadWavFile(filePath: string): Float32Array {
|
|
320
|
-
const buffer = readFileSync(filePath);
|
|
321
|
-
const view = new DataView(buffer.buffer, buffer.byteOffset, buffer.byteLength);
|
|
322
|
-
|
|
323
|
-
// Find data chunk
|
|
324
|
-
let offset = 12; // Skip RIFF header
|
|
325
|
-
while (offset < view.byteLength - 8) {
|
|
326
|
-
const chunkId = String.fromCharCode(
|
|
327
|
-
view.getUint8(offset),
|
|
328
|
-
view.getUint8(offset + 1),
|
|
329
|
-
view.getUint8(offset + 2),
|
|
330
|
-
view.getUint8(offset + 3)
|
|
331
|
-
);
|
|
332
|
-
const chunkSize = view.getUint32(offset + 4, true);
|
|
333
|
-
offset += 8;
|
|
334
|
-
|
|
335
|
-
if (chunkId === 'data') {
|
|
336
|
-
// Convert Int16 PCM to Float32 by dividing by 32768
|
|
337
|
-
const numSamples = chunkSize / 2;
|
|
338
|
-
const float32 = new Float32Array(numSamples);
|
|
339
|
-
for (let i = 0; i < numSamples; i++) {
|
|
340
|
-
float32[i] = view.getInt16(offset + i * 2, true) / 32768.0;
|
|
341
|
-
}
|
|
342
|
-
return float32;
|
|
343
|
-
}
|
|
344
|
-
|
|
345
|
-
offset += chunkSize;
|
|
346
|
-
if (chunkSize % 2 !== 0) offset++; // Align to word boundary
|
|
347
|
-
}
|
|
348
|
-
|
|
349
|
-
throw new Error('No data chunk found in WAV file');
|
|
261
|
+
/** Word end time in seconds. */
|
|
262
|
+
end: number;
|
|
350
263
|
}
|
|
351
264
|
|
|
352
|
-
|
|
353
|
-
|
|
354
|
-
|
|
355
|
-
segModelPath: './models/segmentation.gguf',
|
|
356
|
-
embModelPath: './models/embedding.gguf',
|
|
357
|
-
pldaPath: './models/plda.gguf',
|
|
358
|
-
coremlPath: './models/embedding.mlpackage',
|
|
359
|
-
segCoremlPath: './models/segmentation.mlpackage',
|
|
360
|
-
});
|
|
361
|
-
|
|
362
|
-
// Load audio
|
|
363
|
-
const audio = loadWavFile('./audio.wav');
|
|
364
|
-
console.log(`Loaded ${audio.length} samples (${(audio.length / 16000).toFixed(1)}s)`);
|
|
265
|
+
export interface AlignedSegment {
|
|
266
|
+
/** Global speaker label (for example, SPEAKER_00). */
|
|
267
|
+
speaker: string;
|
|
365
268
|
|
|
366
|
-
|
|
367
|
-
|
|
269
|
+
/** Segment start time in seconds. */
|
|
270
|
+
start: number;
|
|
368
271
|
|
|
369
|
-
|
|
370
|
-
|
|
371
|
-
for (const segment of result.segments) {
|
|
372
|
-
const startTime = segment.start.toFixed(2);
|
|
373
|
-
const endTime = (segment.start + segment.duration).toFixed(2);
|
|
374
|
-
console.log(`[${startTime}s - ${endTime}s] ${segment.speaker}`);
|
|
375
|
-
}
|
|
272
|
+
/** Segment duration in seconds. */
|
|
273
|
+
duration: number;
|
|
376
274
|
|
|
377
|
-
|
|
378
|
-
|
|
379
|
-
console.log(`\nTotal speakers: ${speakers.size}`);
|
|
275
|
+
/** Segment text (concatenated from words). */
|
|
276
|
+
text: string;
|
|
380
277
|
|
|
381
|
-
|
|
278
|
+
/** Word-level timestamps for the segment. */
|
|
279
|
+
words: AlignedWord[];
|
|
382
280
|
}
|
|
383
281
|
|
|
384
|
-
|
|
282
|
+
export interface TranscriptionResult {
|
|
283
|
+
/** Full speaker-labeled transcript segments. */
|
|
284
|
+
segments: AlignedSegment[];
|
|
285
|
+
}
|
|
385
286
|
```
|
|
386
287
|
|
|
387
|
-
|
|
288
|
+
## Usage Examples
|
|
388
289
|
|
|
389
|
-
|
|
290
|
+
### One-shot transcription
|
|
390
291
|
|
|
391
292
|
```typescript
|
|
392
|
-
import {
|
|
293
|
+
import { Pipeline } from 'pyannote-cpp-node';
|
|
393
294
|
|
|
394
|
-
async function
|
|
395
|
-
const
|
|
295
|
+
async function runOneShot(audio: Float32Array) {
|
|
296
|
+
const pipeline = await Pipeline.load({
|
|
396
297
|
segModelPath: './models/segmentation.gguf',
|
|
397
298
|
embModelPath: './models/embedding.gguf',
|
|
398
299
|
pldaPath: './models/plda.gguf',
|
|
399
300
|
coremlPath: './models/embedding.mlpackage',
|
|
400
301
|
segCoremlPath: './models/segmentation.mlpackage',
|
|
302
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
401
303
|
});
|
|
402
304
|
|
|
403
|
-
const
|
|
404
|
-
|
|
405
|
-
// Load full audio file
|
|
406
|
-
const audio = loadWavFile('./audio.wav');
|
|
407
|
-
|
|
408
|
-
// Push audio in 1-second chunks (16,000 samples)
|
|
409
|
-
const CHUNK_SIZE = 16000;
|
|
410
|
-
let totalChunks = 0;
|
|
411
|
-
|
|
412
|
-
for (let offset = 0; offset < audio.length; offset += CHUNK_SIZE) {
|
|
413
|
-
const end = Math.min(offset + CHUNK_SIZE, audio.length);
|
|
414
|
-
const chunk = audio.slice(offset, end);
|
|
415
|
-
|
|
416
|
-
const vadChunks = await session.push(chunk);
|
|
417
|
-
|
|
418
|
-
// VAD chunks are returned after first 10 seconds
|
|
419
|
-
for (const vad of vadChunks) {
|
|
420
|
-
// Count active frames (speech detected)
|
|
421
|
-
const activeFrames = vad.vad.filter(v => v > 0.5).length;
|
|
422
|
-
const speechRatio = (activeFrames / vad.numFrames * 100).toFixed(1);
|
|
423
|
-
|
|
424
|
-
console.log(
|
|
425
|
-
`Chunk ${vad.chunkIndex}: ${vad.startTime.toFixed(1)}s - ${(vad.startTime + vad.duration).toFixed(1)}s | ` +
|
|
426
|
-
`Speech: ${speechRatio}%`
|
|
427
|
-
);
|
|
428
|
-
totalChunks++;
|
|
429
|
-
}
|
|
430
|
-
}
|
|
431
|
-
|
|
432
|
-
console.log(`\nProcessed ${totalChunks} chunks`);
|
|
433
|
-
|
|
434
|
-
// Get final diarization result
|
|
435
|
-
console.log('\nFinalizing...');
|
|
436
|
-
const result = await session.finalize();
|
|
305
|
+
const result = await pipeline.transcribe(audio);
|
|
437
306
|
|
|
438
|
-
|
|
439
|
-
|
|
440
|
-
console.log(
|
|
441
|
-
`[${segment.start.toFixed(2)}s - ${(segment.start + segment.duration).toFixed(2)}s] ${segment.speaker}`
|
|
442
|
-
);
|
|
307
|
+
for (const seg of result.segments) {
|
|
308
|
+
const end = seg.start + seg.duration;
|
|
309
|
+
console.log(`[${seg.speaker}] ${seg.start.toFixed(2)}-${end.toFixed(2)} ${seg.text.trim()}`);
|
|
443
310
|
}
|
|
444
311
|
|
|
445
|
-
|
|
446
|
-
model.close();
|
|
312
|
+
pipeline.close();
|
|
447
313
|
}
|
|
448
|
-
|
|
449
|
-
streamingDiarization();
|
|
450
314
|
```
|
|
451
315
|
|
|
452
|
-
###
|
|
453
|
-
|
|
454
|
-
Push audio and trigger reclustering every 30 seconds to get intermediate results.
|
|
316
|
+
### Streaming transcription
|
|
455
317
|
|
|
456
318
|
```typescript
|
|
457
|
-
import {
|
|
319
|
+
import { Pipeline } from 'pyannote-cpp-node';
|
|
458
320
|
|
|
459
|
-
async function
|
|
460
|
-
const
|
|
321
|
+
async function runStreaming(audio: Float32Array) {
|
|
322
|
+
const pipeline = await Pipeline.load({
|
|
461
323
|
segModelPath: './models/segmentation.gguf',
|
|
462
324
|
embModelPath: './models/embedding.gguf',
|
|
463
325
|
pldaPath: './models/plda.gguf',
|
|
464
326
|
coremlPath: './models/embedding.mlpackage',
|
|
465
327
|
segCoremlPath: './models/segmentation.mlpackage',
|
|
328
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
466
329
|
});
|
|
467
330
|
|
|
468
|
-
const session =
|
|
469
|
-
|
|
470
|
-
|
|
471
|
-
|
|
472
|
-
|
|
473
|
-
|
|
474
|
-
|
|
475
|
-
|
|
476
|
-
for (let offset = 0; offset < audio.length; offset += CHUNK_SIZE) {
|
|
477
|
-
const end = Math.min(offset + CHUNK_SIZE, audio.length);
|
|
478
|
-
const chunk = audio.slice(offset, end);
|
|
479
|
-
|
|
480
|
-
await session.push(chunk);
|
|
481
|
-
secondsProcessed++;
|
|
331
|
+
const session = pipeline.createSession();
|
|
332
|
+
session.on('segments', (segments) => {
|
|
333
|
+
const latest = segments[segments.length - 1];
|
|
334
|
+
if (latest) {
|
|
335
|
+
const end = latest.start + latest.duration;
|
|
336
|
+
console.log(`[live][${latest.speaker}] ${latest.start.toFixed(2)}-${end.toFixed(2)} ${latest.text.trim()}`);
|
|
337
|
+
}
|
|
338
|
+
});
|
|
482
339
|
|
|
483
|
-
|
|
484
|
-
|
|
485
|
-
|
|
486
|
-
|
|
487
|
-
|
|
488
|
-
const
|
|
489
|
-
console.log(`
|
|
490
|
-
console.log(`Current segments: ${intermediateResult.segments.length}`);
|
|
340
|
+
const chunkSize = 16000;
|
|
341
|
+
for (let i = 0; i < audio.length; i += chunkSize) {
|
|
342
|
+
const chunk = audio.slice(i, Math.min(i + chunkSize, audio.length));
|
|
343
|
+
const vad = await session.push(chunk);
|
|
344
|
+
if (vad.length > 0) {
|
|
345
|
+
const speechFrames = vad.filter(Boolean).length;
|
|
346
|
+
console.log(`VAD frames: ${vad.length}, speech frames: ${speechFrames}`);
|
|
491
347
|
}
|
|
492
348
|
}
|
|
493
349
|
|
|
494
|
-
// Final result
|
|
495
|
-
console.log('\n--- Final result ---');
|
|
496
350
|
const finalResult = await session.finalize();
|
|
497
|
-
|
|
498
|
-
console.log(`Total speakers: ${speakers.size}`);
|
|
499
|
-
console.log(`Total segments: ${finalResult.segments.length}`);
|
|
351
|
+
console.log(`Final segments: ${finalResult.segments.length}`);
|
|
500
352
|
|
|
501
353
|
session.close();
|
|
502
|
-
|
|
354
|
+
pipeline.close();
|
|
503
355
|
}
|
|
504
|
-
|
|
505
|
-
reclusteringExample();
|
|
506
356
|
```
|
|
507
357
|
|
|
508
|
-
###
|
|
509
|
-
|
|
510
|
-
Format diarization results into standard RTTM (Rich Transcription Time Marked) format.
|
|
358
|
+
### Custom Whisper decode options
|
|
511
359
|
|
|
512
360
|
```typescript
|
|
513
|
-
import {
|
|
514
|
-
import { writeFileSync } from 'node:fs';
|
|
515
|
-
|
|
516
|
-
function toRTTM(result: DiarizationResult, filename: string = 'audio'): string {
|
|
517
|
-
const lines = result.segments.map(segment => {
|
|
518
|
-
// RTTM format: SPEAKER <file> <chnl> <tbeg> <tdur> <ortho> <stype> <name> <conf> <slat>
|
|
519
|
-
return [
|
|
520
|
-
'SPEAKER',
|
|
521
|
-
filename,
|
|
522
|
-
'1',
|
|
523
|
-
segment.start.toFixed(3),
|
|
524
|
-
segment.duration.toFixed(3),
|
|
525
|
-
'<NA>',
|
|
526
|
-
'<NA>',
|
|
527
|
-
segment.speaker,
|
|
528
|
-
'<NA>',
|
|
529
|
-
'<NA>',
|
|
530
|
-
].join(' ');
|
|
531
|
-
});
|
|
532
|
-
|
|
533
|
-
return lines.join('\n') + '\n';
|
|
534
|
-
}
|
|
361
|
+
import { Pipeline } from 'pyannote-cpp-node';
|
|
535
362
|
|
|
536
|
-
|
|
537
|
-
|
|
538
|
-
|
|
539
|
-
|
|
540
|
-
|
|
541
|
-
|
|
542
|
-
|
|
543
|
-
|
|
544
|
-
|
|
545
|
-
|
|
546
|
-
|
|
547
|
-
|
|
548
|
-
|
|
549
|
-
|
|
550
|
-
|
|
551
|
-
|
|
552
|
-
|
|
553
|
-
|
|
554
|
-
|
|
555
|
-
|
|
556
|
-
|
|
557
|
-
|
|
363
|
+
const pipeline = await Pipeline.load({
|
|
364
|
+
segModelPath: './models/segmentation.gguf',
|
|
365
|
+
embModelPath: './models/embedding.gguf',
|
|
366
|
+
pldaPath: './models/plda.gguf',
|
|
367
|
+
coremlPath: './models/embedding.mlpackage',
|
|
368
|
+
segCoremlPath: './models/segmentation.mlpackage',
|
|
369
|
+
whisperModelPath: './models/ggml-large-v3-turbo-q5_0.bin',
|
|
370
|
+
|
|
371
|
+
// Whisper runtime options
|
|
372
|
+
useGpu: true,
|
|
373
|
+
flashAttn: true,
|
|
374
|
+
gpuDevice: 0,
|
|
375
|
+
useCoreml: false,
|
|
376
|
+
|
|
377
|
+
// Decode strategy
|
|
378
|
+
nThreads: 8,
|
|
379
|
+
language: 'ko',
|
|
380
|
+
translate: false,
|
|
381
|
+
detectLanguage: false,
|
|
382
|
+
temperature: 0.0,
|
|
383
|
+
temperatureInc: 0.2,
|
|
384
|
+
noFallback: false,
|
|
385
|
+
beamSize: 5,
|
|
386
|
+
bestOf: 5,
|
|
387
|
+
|
|
388
|
+
// Thresholds and context
|
|
389
|
+
entropyThold: 2.4,
|
|
390
|
+
logprobThold: -1.0,
|
|
391
|
+
noSpeechThold: 0.6,
|
|
392
|
+
prompt: 'Meeting transcript with technical terminology.',
|
|
393
|
+
noContext: true,
|
|
394
|
+
suppressBlank: true,
|
|
395
|
+
suppressNst: false,
|
|
396
|
+
});
|
|
397
|
+
```
|
|
558
398
|
|
|
559
|
-
|
|
399
|
+
## JSON Output Format
|
|
400
|
+
|
|
401
|
+
The pipeline returns this JSON shape:
|
|
402
|
+
|
|
403
|
+
```json
|
|
404
|
+
{
|
|
405
|
+
"segments": [
|
|
406
|
+
{
|
|
407
|
+
"speaker": "SPEAKER_00",
|
|
408
|
+
"start": 0.497000,
|
|
409
|
+
"duration": 2.085000,
|
|
410
|
+
"text": "Hello world",
|
|
411
|
+
"words": [
|
|
412
|
+
{"text": " Hello", "start": 0.500000, "end": 0.800000},
|
|
413
|
+
{"text": " world", "start": 0.900000, "end": 1.200000}
|
|
414
|
+
]
|
|
415
|
+
}
|
|
416
|
+
]
|
|
560
417
|
}
|
|
561
|
-
|
|
562
|
-
generateRTTM();
|
|
563
418
|
```
|
|
564
419
|
|
|
565
|
-
## Architecture
|
|
566
|
-
|
|
567
|
-
The diarization pipeline consists of four main stages:
|
|
568
|
-
|
|
569
|
-
### 1. Segmentation (SincNet + BiLSTM)
|
|
570
|
-
|
|
571
|
-
The segmentation model processes 10-second audio windows and outputs 7-class powerset logits for 589 frames (approximately one frame every 17ms). The model architecture:
|
|
572
|
-
|
|
573
|
-
- **SincNet**: Learnable sinc filter bank for feature extraction
|
|
574
|
-
- **4-layer BiLSTM**: Bidirectional long short-term memory layers
|
|
575
|
-
- **Linear classifier**: Projects to 7 powerset classes with log-softmax
|
|
576
|
-
|
|
577
|
-
The 7 powerset classes represent all possible combinations of up to 3 simultaneous speakers:
|
|
578
|
-
- Class 0: silence (no speakers)
|
|
579
|
-
- Classes 1-3: single speakers
|
|
580
|
-
- Classes 4-6: speaker overlaps
|
|
581
|
-
|
|
582
|
-
### 2. Powerset Decoding
|
|
583
|
-
|
|
584
|
-
Converts the 7-class powerset predictions into binary speaker activity for 3 local speakers per chunk. Each frame is decoded to indicate which of the 3 local speaker "slots" are active.
|
|
585
|
-
|
|
586
|
-
### 3. Embedding Extraction (WeSpeaker ResNet34)
|
|
587
|
-
|
|
588
|
-
For each active speaker in each chunk, the embedding model extracts a 256-dimensional speaker vector:
|
|
589
|
-
|
|
590
|
-
- **Mel filterbank**: 80-bin log-mel spectrogram features
|
|
591
|
-
- **ResNet34**: Deep residual network for speaker representation
|
|
592
|
-
- **Output**: 256-dimensional L2-normalized embedding
|
|
593
|
-
|
|
594
|
-
Silent speakers receive NaN embeddings, which are filtered before clustering.
|
|
595
|
-
|
|
596
|
-
### 4. Clustering (PLDA + AHC + VBx)
|
|
597
|
-
|
|
598
|
-
The final stage maps local speaker labels to global speaker identities:
|
|
599
|
-
|
|
600
|
-
- **PLDA transformation**: Probabilistic Linear Discriminant Analysis projects embeddings from 256 to 128 dimensions
|
|
601
|
-
- **Agglomerative Hierarchical Clustering (AHC)**: fastcluster implementation with O(n²) complexity, using centroid linkage and a distance threshold of 0.6
|
|
602
|
-
- **VBx refinement**: Variational Bayes diarization with parameters FA=0.07, FB=0.8, maximum 20 iterations
|
|
603
|
-
|
|
604
|
-
The clustering stage computes speaker centroids and assigns each embedding to the closest centroid while respecting the constraint that two local speakers in the same chunk cannot map to the same global speaker.
|
|
605
|
-
|
|
606
|
-
### CoreML Acceleration
|
|
607
|
-
|
|
608
|
-
Both neural networks run on Apple's CoreML framework, which automatically distributes computation across:
|
|
609
|
-
|
|
610
|
-
- **Neural Engine**: Dedicated ML accelerator on Apple Silicon
|
|
611
|
-
- **GPU**: Metal-accelerated operations
|
|
612
|
-
- **CPU**: Fallback for unsupported operations
|
|
613
|
-
|
|
614
|
-
CoreML models use Float16 computation for optimal performance while maintaining accuracy within acceptable bounds (cosine similarity > 0.999 vs Float32).
|
|
615
|
-
|
|
616
|
-
### Streaming Architecture
|
|
617
|
-
|
|
618
|
-
The streaming API uses a sliding 10-second window with a 1-second hop (9 seconds of overlap between consecutive chunks). Three data stores maintain the state:
|
|
619
|
-
|
|
620
|
-
- **`audio_buffer`**: Sliding window (~10s, ~640 KB for 1 hour) — old samples are discarded
|
|
621
|
-
- **`embeddings`**: Grows forever (~11 MB for 1 hour) — stores 3 × 256-dim vectors per chunk (NaN for silent speakers)
|
|
622
|
-
- **`binarized`**: Grows forever (~25 MB for 1 hour) — stores 589 × 3 binary activity masks per chunk
|
|
623
|
-
|
|
624
|
-
During reclustering, all accumulated embeddings are used to compute soft cluster assignments, and all binarized segmentations are used to reconstruct the global timeline. This is why the `embeddings` and `binarized` arrays must persist for the entire session.
|
|
625
|
-
|
|
626
|
-
### Constants
|
|
627
|
-
|
|
628
|
-
| Constant | Value | Description |
|
|
629
|
-
|----------|-------|-------------|
|
|
630
|
-
| SAMPLE_RATE | 16000 Hz | Audio sample rate |
|
|
631
|
-
| CHUNK_SAMPLES | 160000 | 10-second window size |
|
|
632
|
-
| STEP_SAMPLES | 16000 | 1-second hop between chunks |
|
|
633
|
-
| FRAMES_PER_CHUNK | 589 | Segmentation output frames |
|
|
634
|
-
| NUM_LOCAL_SPEAKERS | 3 | Maximum speakers per chunk |
|
|
635
|
-
| EMBEDDING_DIM | 256 | Speaker embedding dimension |
|
|
636
|
-
| FBANK_NUM_BINS | 80 | Mel filterbank bins |
|
|
637
|
-
|
|
638
420
|
## Audio Format Requirements
|
|
639
421
|
|
|
640
|
-
|
|
641
|
-
|
|
642
|
-
-
|
|
643
|
-
-
|
|
644
|
-
- **Format**: `Float32Array` with values in the range **[-1.0, 1.0]**
|
|
645
|
-
|
|
646
|
-
The library does **not** handle audio decoding. You must provide raw PCM samples.
|
|
647
|
-
|
|
648
|
-
### Loading Audio Files
|
|
649
|
-
|
|
650
|
-
For WAV files, you can use the `loadWavFile` function from Example 1, or use third-party libraries:
|
|
651
|
-
|
|
652
|
-
```bash
|
|
653
|
-
npm install node-wav
|
|
654
|
-
```
|
|
655
|
-
|
|
656
|
-
```typescript
|
|
657
|
-
import { read } from 'node-wav';
|
|
658
|
-
import { readFileSync } from 'node:fs';
|
|
659
|
-
|
|
660
|
-
const buffer = readFileSync('./audio.wav');
|
|
661
|
-
const wav = read(buffer);
|
|
662
|
-
|
|
663
|
-
// Convert to mono if stereo
|
|
664
|
-
const mono = wav.channelData.length > 1
|
|
665
|
-
? wav.channelData[0].map((v, i) => (v + wav.channelData[1][i]) / 2)
|
|
666
|
-
: wav.channelData[0];
|
|
667
|
-
|
|
668
|
-
// Resample to 16kHz if needed (using a resampling library)
|
|
669
|
-
// ...
|
|
670
|
-
|
|
671
|
-
const audio = new Float32Array(mono);
|
|
672
|
-
```
|
|
673
|
-
|
|
674
|
-
For other audio formats (MP3, M4A, etc.), use ffmpeg to convert to 16kHz mono WAV first:
|
|
675
|
-
|
|
676
|
-
```bash
|
|
677
|
-
ffmpeg -i input.mp3 -ar 16000 -ac 1 -f f32le -acodec pcm_f32le - | \
|
|
678
|
-
node process.js
|
|
679
|
-
```
|
|
422
|
+
- Input must be `Float32Array`
|
|
423
|
+
- Sample rate must be `16000` Hz
|
|
424
|
+
- Audio must be mono
|
|
425
|
+
- Recommended amplitude range: `[-1.0, 1.0]`
|
|
680
426
|
|
|
681
|
-
|
|
427
|
+
All API methods expect decoded PCM samples; file decoding/resampling is handled by the caller.
|
|
682
428
|
|
|
683
|
-
|
|
684
|
-
|
|
685
|
-
- **macOS only**: The library requires CoreML for neural network inference. There is currently no fallback implementation for other platforms.
|
|
686
|
-
- **No Linux/Windows support**: CoreML is exclusive to Apple platforms.
|
|
687
|
-
|
|
688
|
-
### `recluster()` Mutates State
|
|
689
|
-
|
|
690
|
-
The `recluster()` method overwrites the internal session state, specifically replacing the `embeddings` and chunk index arrays with filtered versions (excluding NaN embeddings from silent speakers). This means:
|
|
691
|
-
|
|
692
|
-
- Calling `push()` after `recluster()` may produce incorrect results
|
|
693
|
-
- Subsequent `recluster()` calls may not work as expected
|
|
694
|
-
- The data structure assumes the original unfiltered layout (3 embeddings per chunk)
|
|
695
|
-
|
|
696
|
-
**Best practice**: Use `recluster()` sparingly for live progress updates (e.g., every 30 seconds), or avoid it entirely and only call `finalize()` when the stream ends.
|
|
697
|
-
|
|
698
|
-
### Operations Are Serialized
|
|
699
|
-
|
|
700
|
-
Operations on a streaming session are serialized internally. Do not call `push()` while another `push()`, `recluster()`, or `finalize()` is in progress. Wait for the Promise to resolve before making the next call.
|
|
701
|
-
|
|
702
|
-
### Resource Management
|
|
703
|
-
|
|
704
|
-
- **Close sessions before models**: Always close streaming sessions before closing the parent model
|
|
705
|
-
- **Idempotent close**: Both `model.close()` and `session.close()` are safe to call multiple times
|
|
706
|
-
- **No reuse after close**: Once closed, models and sessions cannot be reused
|
|
707
|
-
|
|
708
|
-
### Model Loading
|
|
709
|
-
|
|
710
|
-
- **Path validation**: `Pyannote.load()` validates that all paths exist using `fs.accessSync()` before initialization
|
|
711
|
-
- **CoreML compilation**: The CoreML framework compiles `.mlpackage` models internally on first load (typically fast, ~100ms)
|
|
712
|
-
- **No explicit loading step**: Model weights are loaded synchronously in the constructor
|
|
713
|
-
|
|
714
|
-
### Threading Model
|
|
715
|
-
|
|
716
|
-
All heavy operations (`diarize`, `push`, `recluster`, `finalize`) run on libuv worker threads and never block the Node.js event loop. However, the operations do hold native locks internally, so concurrent operations on the same session are serialized.
|
|
717
|
-
|
|
718
|
-
### Memory Usage
|
|
429
|
+
## Architecture
|
|
719
430
|
|
|
720
|
-
|
|
721
|
-
- `audio_buffer`: ~640 KB (sliding window)
|
|
722
|
-
- `embeddings`: ~11 MB (grows throughout session)
|
|
723
|
-
- `binarized`: ~25 MB (grows throughout session)
|
|
724
|
-
- CoreML models: ~50 MB (loaded once per model)
|
|
431
|
+
The integrated pipeline runs in 7 stages:
|
|
725
432
|
|
|
726
|
-
|
|
433
|
+
1. VAD silence filter (optional compression of long silence)
|
|
434
|
+
2. Audio buffer (stream-safe FIFO with timestamp tracking)
|
|
435
|
+
3. Segmentation (speech activity over rolling windows)
|
|
436
|
+
4. Transcription (Whisper sentence + word timestamps)
|
|
437
|
+
5. Alignment (segment-level speaker assignment by overlap)
|
|
438
|
+
6. Finalize (flush + final recluster + final alignment)
|
|
439
|
+
7. Callback/event emission (`segments` updates)
|
|
727
440
|
|
|
728
441
|
## Performance
|
|
729
442
|
|
|
730
|
-
|
|
731
|
-
|
|
732
|
-
|
|
733
|
-
|
|
734
|
-
|
|
735
|
-
| Embedding (CoreML) | ~13ms | Per speaker per chunk (up to 3 speakers) |
|
|
736
|
-
| AHC Clustering | ~0.8s | 3000 embeddings (1000 chunks) |
|
|
737
|
-
| VBx Refinement | ~1.2s | 20 iterations, 3000 embeddings |
|
|
738
|
-
| **Full Pipeline (offline)** | **39x real-time** | 45-minute audio processed in 70 seconds |
|
|
739
|
-
|
|
740
|
-
### Streaming Performance
|
|
443
|
+
- Diarization only: **39x real-time**
|
|
444
|
+
- Integrated transcription + diarization: **~14.6x real-time**
|
|
445
|
+
- 45-minute Korean meeting test (6 speakers): **2713s audio in 186s**
|
|
446
|
+
- Alignment reduction: **701 Whisper segments -> 186 aligned speaker segments**
|
|
447
|
+
- Speaker confusion rate: **2.55%**
|
|
741
448
|
|
|
742
|
-
|
|
743
|
-
- **Incremental latency**: ~30ms per 1-second push (after first chunk)
|
|
744
|
-
- **Recluster latency**: ~2 seconds for 30 minutes of audio (~1800 embeddings)
|
|
449
|
+
## Platform Support
|
|
745
450
|
|
|
746
|
-
|
|
747
|
-
|
|
748
|
-
|
|
749
|
-
|
|
750
|
-
|
|
|
751
|
-
|
|
752
|
-
| macOS | arm64 (Apple Silicon) | ✅ Supported |
|
|
753
|
-
| macOS | x64 (Intel) | 🔜 Planned |
|
|
754
|
-
| Linux | any | ❌ Not supported (CoreML unavailable) |
|
|
755
|
-
| Windows | any | ❌ Not supported (CoreML unavailable) |
|
|
756
|
-
|
|
757
|
-
Intel macOS support is planned but not yet available. The CoreML dependency makes cross-platform support challenging without alternative inference backends.
|
|
451
|
+
| Platform | Status |
|
|
452
|
+
| --- | --- |
|
|
453
|
+
| macOS arm64 (Apple Silicon) | Supported |
|
|
454
|
+
| macOS x64 (Intel) | Supported |
|
|
455
|
+
| Linux | Not supported |
|
|
456
|
+
| Windows | Not supported |
|
|
758
457
|
|
|
759
458
|
## License
|
|
760
459
|
|
|
761
460
|
MIT
|
|
762
|
-
|
|
763
|
-
---
|
|
764
|
-
|
|
765
|
-
For issues, feature requests, or contributions, please visit the [GitHub repository](https://github.com/predict-woo/pyannote-ggml).
|