embed-cluster 0.3.0 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +612 -63
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -1,6 +1,32 @@
1
1
  # embed-cluster
2
2
 
3
- Cluster embeddings into topics with automatic labeling.
3
+ Cluster embedding vectors into semantically coherent groups with automatic k selection, silhouette analysis, and custom labeling.
4
+
5
+ [![npm version](https://img.shields.io/npm/v/embed-cluster.svg)](https://www.npmjs.com/package/embed-cluster)
6
+ [![npm downloads](https://img.shields.io/npm/dt/embed-cluster.svg)](https://www.npmjs.com/package/embed-cluster)
7
+ [![license](https://img.shields.io/npm/l/embed-cluster.svg)](https://github.com/SiluPanda/embed-cluster/blob/master/LICENSE)
8
+ [![node](https://img.shields.io/node/v/embed-cluster.svg)](https://nodejs.org)
9
+ [![types](https://img.shields.io/npm/types/embed-cluster.svg)](https://www.npmjs.com/package/embed-cluster)
10
+
11
+ ---
12
+
13
+ ## Description
14
+
15
+ `embed-cluster` groups high-dimensional embedding vectors into semantically coherent clusters using k-means++ initialization and configurable distance metrics. It is built for the characteristics of vectors produced by modern language model embedding APIs (768--3072 dimensions, cosine-similarity geometry) where generic clustering libraries require significant hand-tuning.
16
+
17
+ The package provides a complete clustering pipeline in a single function call: L2 normalization, k-means++ centroid initialization, iterative assignment and convergence, silhouette quality scoring, and optional automatic k selection. All algorithms are self-contained with zero mandatory runtime dependencies.
18
+
19
+ Key capabilities:
20
+
21
+ - **k-means++ clustering** with smart centroid initialization for faster convergence and better separation.
22
+ - **Automatic k selection** via silhouette analysis across a range of candidate k values.
23
+ - **Silhouette scoring** at the per-item, per-cluster, and aggregate level.
24
+ - **Custom labeling** through a caller-supplied sync or async labeling function.
25
+ - **Reproducible results** using a seeded pseudo-random number generator.
26
+ - **L2 normalization** built in, enabled by default, so cosine-distance clustering works out of the box.
27
+ - **Custom distance functions** for specialized similarity metrics beyond Euclidean and cosine.
28
+
29
+ ---
4
30
 
5
31
  ## Installation
6
32
 
@@ -8,115 +34,638 @@ Cluster embeddings into topics with automatic labeling.
8
34
  npm install embed-cluster
9
35
  ```
10
36
 
11
- For dimensionality reduction (PCA or UMAP), install the optional peer dependencies:
37
+ For optional dimensionality reduction, install the peer dependencies:
12
38
 
13
39
  ```bash
14
40
  npm install ml-pca # PCA-based dimensionality reduction
15
41
  npm install umap-js # UMAP-based dimensionality reduction
16
42
  ```
17
43
 
44
+ Both peer dependencies are optional and the core clustering API works without them.
45
+
46
+ ---
47
+
18
48
  ## Quick Start
19
49
 
20
50
  ```ts
21
- import { cluster, createClusterer } from 'embed-cluster';
51
+ import { cluster } from "embed-cluster";
52
+
53
+ // Prepare items with id, text, and embedding vector
54
+ const items = [
55
+ { id: "doc-1", text: "Introduction to machine learning", embedding: [0.12, 0.85, 0.33, ...] },
56
+ { id: "doc-2", text: "Deep learning architectures", embedding: [0.14, 0.82, 0.31, ...] },
57
+ { id: "doc-3", text: "Cooking Italian pasta", embedding: [0.91, 0.05, 0.72, ...] },
58
+ // ...more items
59
+ ];
60
+
61
+ // Cluster with a fixed k
62
+ const result = await cluster(items, { k: 3, seed: 42 });
22
63
 
23
- // Cluster raw embedding vectors
24
- const result = await cluster(embeddings, { k: 5 });
64
+ console.log(result.k); // 3
65
+ console.log(result.clusters); // Array of 3 Cluster objects
66
+ console.log(result.quality); // { silhouette, inertia }
67
+ console.log(result.converged); // true
68
+ console.log(result.durationMs); // elapsed time in milliseconds
69
+ ```
70
+
71
+ ### Automatic k selection
25
72
 
26
- console.log(result.clusters); // 5 clusters with labels, centroids, items
27
- console.log(result.quality); // silhouette score, inertia, interpretation
73
+ ```ts
74
+ const result = await cluster(items, { autoK: true, maxK: 10 });
75
+ // k is chosen automatically to maximize silhouette score
76
+ console.log(result.k); // e.g. 4
77
+ ```
78
+
79
+ ### Pre-configured clusterer
80
+
81
+ ```ts
82
+ import { createClusterer } from "embed-cluster";
28
83
 
29
- // Or use a pre-configured clusterer
30
84
  const clusterer = createClusterer({
31
85
  autoK: true,
32
86
  maxK: 15,
33
87
  normalize: true,
88
+ seed: 42,
34
89
  });
35
90
 
36
91
  const result = await clusterer.cluster(items);
37
92
  const optimal = await clusterer.findOptimalK(items);
93
+ const quality = clusterer.silhouetteScore(result);
94
+ ```
95
+
96
+ ### Custom labeling
97
+
98
+ ```ts
99
+ const result = await cluster(items, {
100
+ k: 5,
101
+ labeler: async (clusterItems, clusterId) => {
102
+ // Call an LLM, run TF-IDF, or apply any labeling logic
103
+ const texts = clusterItems.map((item) => item.text).join("\n");
104
+ return `Topic ${clusterId}: ${texts.slice(0, 50)}`;
105
+ },
106
+ });
107
+
108
+ for (const c of result.clusters) {
109
+ console.log(c.label); // "Topic 0: Introduction to machine learning..."
110
+ }
111
+ ```
112
+
113
+ ---
114
+
115
+ ## Features
116
+
117
+ - **k-means++ initialization** -- Selects initial centroids using D-squared weighted probabilistic sampling, producing better starting positions than random initialization and converging in fewer iterations.
118
+ - **Silhouette analysis** -- Computes per-item silhouette coefficients measuring how well each point fits its assigned cluster versus the nearest alternative cluster. Returns per-cluster and overall mean scores in the range [-1, 1].
119
+ - **Automatic k selection** -- Sweeps k from 2 to min(maxK, floor(sqrt(n))), scores each with silhouette analysis, and selects the k that maximizes the mean silhouette coefficient.
120
+ - **L2 normalization** -- Normalizes embedding vectors to unit length before clustering (enabled by default), which is the correct preprocessing for cosine-distance clustering of embedding vectors.
121
+ - **Custom distance functions** -- Supply any `(a: number[], b: number[]) => number` function to replace the default Euclidean distance. Built-in `cosineDistance` is exported for convenience.
122
+ - **Custom labeling** -- Provide a sync or async `LabelerFn` to generate human-readable labels for each cluster. The function receives the cluster's items and cluster ID.
123
+ - **Seeded PRNG** -- Deterministic pseudo-random number generation (mulberry32) ensures identical results across runs when a seed is provided.
124
+ - **Convergence control** -- Configurable maximum iterations and tolerance threshold for centroid movement. The result reports whether the algorithm converged and how many iterations were needed.
125
+ - **Quality metrics** -- Every result includes inertia (within-cluster sum of squared distances), silhouette scores, cohesion (average pairwise intra-cluster distance), and average distance to centroid per cluster.
126
+ - **Zero runtime dependencies** -- All algorithms are self-contained TypeScript. No mandatory third-party libraries.
127
+ - **TypeScript-first** -- Full type definitions with strict typing throughout. All interfaces and types are exported for consumer use.
128
+
129
+ ---
130
+
131
+ ## API Reference
132
+
133
+ ### `cluster(items, options?)`
134
+
135
+ Cluster a set of `EmbedItem` objects using k-means++ and return a `ClusterResult` with silhouette scores.
136
+
137
+ ```ts
138
+ function cluster(
139
+ items: EmbedItem[],
140
+ options?: ClusterOptions
141
+ ): Promise<ClusterResult>;
142
+ ```
143
+
144
+ **Parameters:**
145
+
146
+ | Parameter | Type | Description |
147
+ |-----------|------|-------------|
148
+ | `items` | `EmbedItem[]` | Array of items to cluster. Must not be empty. |
149
+ | `options` | `ClusterOptions` | Clustering configuration. Provide `k` or set `autoK: true`. |
150
+
151
+ **Returns:** `Promise<ClusterResult>`
152
+
153
+ **Throws:** `ClusterError` with code `EMPTY_INPUT` if items is empty, `INVALID_OPTIONS` if neither `k` nor `autoK` is provided, `INVALID_K` if k is invalid, `INCONSISTENT_DIMENSIONS` if embedding dimensions differ.
154
+
155
+ ---
156
+
157
+ ### `createClusterer(config?)`
158
+
159
+ Create a pre-configured `Clusterer` instance. The returned object exposes `cluster()`, `findOptimalK()`, and `silhouetteScore()` methods that merge the bound config with per-call options.
160
+
161
+ ```ts
162
+ function createClusterer(config?: ClusterOptions): Clusterer;
163
+ ```
164
+
165
+ **Parameters:**
166
+
167
+ | Parameter | Type | Description |
168
+ |-----------|------|-------------|
169
+ | `config` | `ClusterOptions` | Default configuration applied to all method calls. Per-call options override these defaults. |
170
+
171
+ **Returns:** `Clusterer`
172
+
173
+ The `Clusterer` interface:
174
+
175
+ ```ts
176
+ interface Clusterer {
177
+ cluster(items: EmbedItem[], options?: ClusterOptions): Promise<ClusterResult>;
178
+ findOptimalK(items: EmbedItem[], options?: Omit<ClusterOptions, "k">): Promise<OptimalKResult>;
179
+ silhouetteScore(result: ClusterResult): SilhouetteResult;
180
+ }
181
+ ```
182
+
183
+ ---
184
+
185
+ ### `findOptimalK(items, options?)`
186
+
187
+ Try k from 2 to min(maxK, floor(sqrt(n))), run k-means for each, compute the silhouette score, and return the k that maximizes it.
188
+
189
+ ```ts
190
+ function findOptimalK(
191
+ items: EmbedItem[],
192
+ options?: Omit<ClusterOptions, "k">
193
+ ): OptimalKResult;
194
+ ```
195
+
196
+ **Parameters:**
197
+
198
+ | Parameter | Type | Description |
199
+ |-----------|------|-------------|
200
+ | `items` | `EmbedItem[]` | Array of items to evaluate. |
201
+ | `options` | `Omit<ClusterOptions, "k">` | Configuration (k is excluded since it is being searched). |
202
+
203
+ **Returns:** `OptimalKResult`
204
+
205
+ ```ts
206
+ interface OptimalKResult {
207
+ k: number; // optimal k
208
+ scores: Array<{ k: number; silhouette: number; inertia: number }>; // score per candidate
209
+ method: "silhouette" | "elbow" | "combined"; // selection method used
210
+ }
211
+ ```
212
+
213
+ ---
214
+
215
+ ### `silhouetteScore(result, distFn?)`
216
+
217
+ Compute the silhouette score for an existing `ClusterResult`.
218
+
219
+ ```ts
220
+ function silhouetteScore(
221
+ result: ClusterResult,
222
+ distFn?: (a: number[], b: number[]) => number
223
+ ): SilhouetteResult;
224
+ ```
225
+
226
+ **Parameters:**
227
+
228
+ | Parameter | Type | Default | Description |
229
+ |-----------|------|---------|-------------|
230
+ | `result` | `ClusterResult` | -- | A clustering result to evaluate. |
231
+ | `distFn` | `(a: number[], b: number[]) => number` | `euclideanDistance` | Distance function for silhouette computation. |
232
+
233
+ **Returns:** `SilhouetteResult`
234
+
235
+ ```ts
236
+ interface SilhouetteResult {
237
+ score: number; // overall mean silhouette coefficient (-1 to 1)
238
+ perCluster: number[]; // per-cluster mean silhouette
239
+ perItem?: number[]; // per-item silhouette (optional, expensive)
240
+ }
241
+ ```
242
+
243
+ Returns `{ score: 0, perCluster: [0, ...], perItem: [0, ...] }` when fewer than 2 clusters exist.
244
+
245
+ ---
246
+
247
+ ### `kMeans(items, k, options?)`
248
+
249
+ Low-level k-means implementation. Runs a single k-means pass with k-means++ initialization and returns a `ClusterResult` with placeholder silhouette scores (use `silhouetteScore()` separately to populate them).
250
+
251
+ ```ts
252
+ function kMeans(
253
+ items: EmbedItem[],
254
+ k: number,
255
+ options?: ClusterOptions
256
+ ): ClusterResult;
257
+ ```
258
+
259
+ **Parameters:**
260
+
261
+ | Parameter | Type | Description |
262
+ |-----------|------|-------------|
263
+ | `items` | `EmbedItem[]` | Array of items to cluster. |
264
+ | `k` | `number` | Number of clusters. Must be a positive integer not exceeding `items.length`. |
265
+ | `options` | `ClusterOptions` | Clustering configuration. |
266
+
267
+ **Returns:** `ClusterResult`
268
+
269
+ ---
270
+
271
+ ### `kMeansPlusPlusInit(vectors, k, distFn, rand)`
272
+
273
+ k-means++ centroid initialization. Selects k initial centroids from the input vectors using D-squared weighted probabilistic selection.
274
+
275
+ ```ts
276
+ function kMeansPlusPlusInit(
277
+ vectors: number[][],
278
+ k: number,
279
+ distFn: (a: number[], b: number[]) => number,
280
+ rand: () => number
281
+ ): number[][];
38
282
  ```
39
283
 
40
- ## Available Exports
284
+ **Parameters:**
285
+
286
+ | Parameter | Type | Description |
287
+ |-----------|------|-------------|
288
+ | `vectors` | `number[][]` | Input data points. |
289
+ | `k` | `number` | Number of centroids to select. |
290
+ | `distFn` | `(a: number[], b: number[]) => number` | Distance function. |
291
+ | `rand` | `() => number` | Random number generator returning values in [0, 1). |
292
+
293
+ **Returns:** `number[][]` -- Array of k centroid vectors.
294
+
295
+ ---
296
+
297
+ ### `euclideanDistance(a, b)`
41
298
 
42
- ### Clustering Functions
299
+ Compute the Euclidean distance between two vectors.
43
300
 
44
- - **`cluster(items, options): Promise<ClusterResult>`** -- Cluster EmbedItems using k-means++ and return a full ClusterResult with silhouette scores. Provide `options.k` for a fixed cluster count or `options.autoK = true` for automatic selection.
45
- - **`findOptimalK(items, options?): OptimalKResult`** -- Try k from 2 to `maxK` (default min(10, √n)), compute silhouette score for each, and return the k that maximises the score.
46
- - **`silhouetteScore(result): SilhouetteResult`** -- Compute per-item, per-cluster, and overall mean silhouette coefficients for an existing ClusterResult. Returns a value in [-1, 1]; higher is better.
47
- - **`createClusterer(config): Clusterer`** -- Create a pre-configured Clusterer instance with `cluster()`, `findOptimalK()`, and `silhouetteScore()` bound to the given config.
301
+ ```ts
302
+ function euclideanDistance(a: number[], b: number[]): number;
303
+ ```
48
304
 
49
- ### Utility Functions
305
+ ---
50
306
 
51
- - **`normalizeVector(vec: number[]): number[]`** -- L2-normalize a single vector to unit length. Returns a zero vector unchanged.
52
- - **`normalizeVectors(vecs: number[][]): number[][]`** -- L2-normalize a batch of vectors independently.
53
- - **`euclideanDistance(a, b): number`** -- Euclidean distance between two vectors.
54
- - **`cosineDistance(a, b): number`** -- Cosine distance (1 - cosine similarity) between two vectors.
307
+ ### `cosineDistance(a, b)`
55
308
 
56
- ### Types
309
+ Compute the cosine distance (1 - cosine similarity) between two vectors. Returns 1 for zero vectors.
57
310
 
58
- All TypeScript interfaces are exported for consumer use:
311
+ ```ts
312
+ function cosineDistance(a: number[], b: number[]): number;
313
+ ```
59
314
 
60
- - `EmbedItem` -- Input item with id, text, embedding, and optional metadata
61
- - `ClusterItem` -- An `EmbedItem` assigned to a cluster with distance-to-centroid
62
- - `Cluster` -- A cluster with centroid, items, label, size, and cohesion metrics
63
- - `ClusterOptions` -- Configuration for clustering (k, autoK, maxK, tolerance, seed, etc.)
64
- - `ClusterResult` -- Full result including clusters, quality metrics, iteration count, and timing
65
- - `Clusterer` -- Interface for a pre-configured clusterer instance
66
- - `SilhouetteResult` -- Silhouette coefficient scores (overall, per-cluster, per-item)
67
- - `OptimalKResult` -- Result of automatic k selection with scores per candidate k
68
- - `ClusterQuality` -- Quality metrics including silhouette, inertia, and optional indices
69
- - `VisualizationData` -- 2D projected points for visualization (PCA, UMAP, or t-SNE)
70
- - `LabelerFn` -- Custom labeling function signature
315
+ ---
71
316
 
72
- ### Error Handling
317
+ ### `normalizeVector(vec)`
73
318
 
74
- - **`ClusterError`** -- Error class with typed `code` field for programmatic error handling
75
- - **`ClusterErrorCode`** -- Union type of error codes: `EMPTY_INPUT`, `INCONSISTENT_DIMENSIONS`, `DEGENERATE_INPUT`, `INVALID_K`, `INVALID_OPTIONS`
319
+ L2-normalize a single vector to unit length. Returns a copy; does not mutate the input. Returns a zero vector unchanged.
76
320
 
77
321
  ```ts
78
- import { ClusterError } from 'embed-cluster';
322
+ function normalizeVector(vec: number[]): number[];
323
+ ```
324
+
325
+ ---
326
+
327
+ ### `normalizeVectors(vecs)`
328
+
329
+ L2-normalize a batch of vectors independently.
330
+
331
+ ```ts
332
+ function normalizeVectors(vecs: number[][]): number[][];
333
+ ```
334
+
335
+ ---
336
+
337
+ ### `ClusterError`
338
+
339
+ Error class with a typed `code` field for programmatic error handling.
340
+
341
+ ```ts
342
+ class ClusterError extends Error {
343
+ readonly name: "ClusterError";
344
+ readonly code: ClusterErrorCode;
345
+ constructor(message: string, code: ClusterErrorCode);
346
+ }
347
+ ```
348
+
349
+ ---
350
+
351
+ ## Types
352
+
353
+ All TypeScript interfaces are exported from the package entry point.
354
+
355
+ ### `EmbedItem`
356
+
357
+ ```ts
358
+ interface EmbedItem {
359
+ id: string;
360
+ text: string;
361
+ embedding: number[];
362
+ metadata?: Record<string, unknown>;
363
+ }
364
+ ```
365
+
366
+ An input item pairing text content with its embedding vector. The optional `metadata` field carries arbitrary data through the clustering pipeline.
367
+
368
+ ### `ClusterItem`
369
+
370
+ ```ts
371
+ interface ClusterItem extends EmbedItem {
372
+ clusterId: number;
373
+ distanceToCentroid: number;
374
+ }
375
+ ```
376
+
377
+ An `EmbedItem` after cluster assignment, annotated with the assigned cluster ID and its distance to the cluster centroid.
378
+
379
+ ### `Cluster`
380
+
381
+ ```ts
382
+ interface Cluster {
383
+ id: number;
384
+ centroid: number[];
385
+ items: ClusterItem[];
386
+ label?: string;
387
+ size: number;
388
+ avgDistanceToCentroid: number;
389
+ cohesion: number; // average intra-cluster pairwise distance
390
+ }
391
+ ```
392
+
393
+ A single cluster containing its centroid, assigned items, optional label, and quality metrics.
394
+
395
+ ### `ClusterOptions`
396
+
397
+ ```ts
398
+ interface ClusterOptions {
399
+ k?: number;
400
+ autoK?: boolean;
401
+ maxK?: number;
402
+ maxIterations?: number;
403
+ tolerance?: number;
404
+ seed?: number;
405
+ normalize?: boolean;
406
+ labeler?: LabelerFn;
407
+ distanceFn?: (a: number[], b: number[]) => number;
408
+ }
409
+ ```
410
+
411
+ ### `ClusterResult`
412
+
413
+ ```ts
414
+ interface ClusterResult {
415
+ clusters: Cluster[];
416
+ quality: ClusterQuality;
417
+ k: number;
418
+ iterations: number;
419
+ converged: boolean;
420
+ durationMs: number;
421
+ }
422
+ ```
423
+
424
+ ### `ClusterQuality`
425
+
426
+ ```ts
427
+ interface ClusterQuality {
428
+ silhouette: SilhouetteResult;
429
+ inertia: number;
430
+ daviesBouldin?: number;
431
+ calinski?: number;
432
+ }
433
+ ```
434
+
435
+ ### `SilhouetteResult`
436
+
437
+ ```ts
438
+ interface SilhouetteResult {
439
+ score: number; // overall mean (-1 to 1)
440
+ perCluster: number[]; // per-cluster mean
441
+ perItem?: number[]; // per-item scores
442
+ }
443
+ ```
444
+
445
+ ### `OptimalKResult`
446
+
447
+ ```ts
448
+ interface OptimalKResult {
449
+ k: number;
450
+ scores: Array<{ k: number; silhouette: number; inertia: number }>;
451
+ method: "silhouette" | "elbow" | "combined";
452
+ }
453
+ ```
454
+
455
+ ### `VisualizationData`
456
+
457
+ ```ts
458
+ interface VisualizationData {
459
+ points: Array<{ id: string; x: number; y: number; clusterId: number }>;
460
+ method: "pca" | "umap" | "tsne";
461
+ }
462
+ ```
463
+
464
+ ### `LabelerFn`
465
+
466
+ ```ts
467
+ type LabelerFn = (
468
+ items: EmbedItem[],
469
+ clusterId: number
470
+ ) => Promise<string> | string;
471
+ ```
472
+
473
+ ### `ClusterErrorCode`
474
+
475
+ ```ts
476
+ type ClusterErrorCode =
477
+ | "EMPTY_INPUT"
478
+ | "INCONSISTENT_DIMENSIONS"
479
+ | "DEGENERATE_INPUT"
480
+ | "INVALID_K"
481
+ | "INVALID_OPTIONS";
482
+ ```
483
+
484
+ ---
485
+
486
+ ## Configuration
487
+
488
+ | Option | Type | Default | Description |
489
+ |--------|------|---------|-------------|
490
+ | `k` | `number` | -- | Number of clusters. Required when `autoK` is `false`. |
491
+ | `autoK` | `boolean` | `false` | Automatically select the optimal k using silhouette analysis. |
492
+ | `maxK` | `number` | `min(10, floor(sqrt(n)))` | Maximum k to evaluate when `autoK` is `true`. |
493
+ | `maxIterations` | `number` | `100` | Maximum number of k-means iterations before stopping. |
494
+ | `tolerance` | `number` | `1e-4` | Convergence tolerance. The algorithm stops when the maximum centroid shift falls below this value. |
495
+ | `seed` | `number` | `42` | Seed for the pseudo-random number generator. Set to any integer for reproducible results. |
496
+ | `normalize` | `boolean` | `true` | L2-normalize all embedding vectors before clustering. Recommended for cosine-distance semantics. |
497
+ | `labeler` | `LabelerFn` | -- | Custom function to generate a label for each cluster. Called once per cluster after assignment. |
498
+ | `distanceFn` | `(a: number[], b: number[]) => number` | `euclideanDistance` | Custom distance function. Use `cosineDistance` for angular separation or provide your own metric. |
499
+
500
+ ---
501
+
502
+ ## Error Handling
503
+
504
+ All errors thrown by the library are instances of `ClusterError`, which extends `Error` and carries a typed `code` field for programmatic handling.
505
+
506
+ ```ts
507
+ import { cluster, ClusterError } from "embed-cluster";
79
508
 
80
509
  try {
81
- const result = await cluster([], { k: 3 });
510
+ await cluster([], { k: 3 });
82
511
  } catch (err) {
83
512
  if (err instanceof ClusterError) {
84
- console.error(err.code); // 'EMPTY_INPUT'
513
+ switch (err.code) {
514
+ case "EMPTY_INPUT":
515
+ console.error("No items provided");
516
+ break;
517
+ case "INVALID_K":
518
+ console.error("k is out of range");
519
+ break;
520
+ case "INVALID_OPTIONS":
521
+ console.error("Provide k or set autoK: true");
522
+ break;
523
+ case "INCONSISTENT_DIMENSIONS":
524
+ console.error("All embeddings must have the same dimension");
525
+ break;
526
+ case "DEGENERATE_INPUT":
527
+ console.error("Input data is degenerate");
528
+ break;
529
+ }
85
530
  }
86
531
  }
87
532
  ```
88
533
 
89
- ## Features
534
+ | Error Code | Condition |
535
+ |------------|-----------|
536
+ | `EMPTY_INPUT` | The items array is empty. |
537
+ | `INCONSISTENT_DIMENSIONS` | Embedding vectors have different lengths. |
538
+ | `DEGENERATE_INPUT` | Input data is degenerate (e.g., all identical vectors). |
539
+ | `INVALID_K` | k is not a positive integer, or k exceeds the number of items. |
540
+ | `INVALID_OPTIONS` | Neither `k` nor `autoK: true` was provided. |
90
541
 
91
- - **k-means++ clustering** -- Smart centroid initialization for faster convergence and better results
92
- - **Silhouette analysis** -- Evaluate cluster quality with per-point and per-cluster silhouette coefficients
93
- - **Automatic k selection** -- Find the optimal number of clusters using silhouette and elbow methods
94
- - **PCA/UMAP visualization** -- Reduce high-dimensional embeddings to 2D for visualization (requires optional peer deps)
95
- - **L2 normalization** -- Built-in vector normalization for cosine-distance clustering
96
- - **Custom labeling** -- Provide your own labeling function or use built-in TF-IDF topic extraction
97
- - **Reproducible results** -- Seed-based random number generation for deterministic clustering
98
- - **TypeScript-first** -- Full type definitions with strict typing throughout
542
+ ---
99
543
 
100
- ## Configuration
544
+ ## Advanced Usage
101
545
 
102
- | Option | Type | Default | Description |
103
- |--------|------|---------|-------------|
104
- | `k` | `number` | -- | Number of clusters (required if `autoK` is false) |
105
- | `autoK` | `boolean` | `false` | Automatically select optimal k |
106
- | `maxK` | `number` | `10` | Maximum k to try when `autoK` is true |
107
- | `maxIterations` | `number` | `100` | Maximum k-means iterations |
108
- | `tolerance` | `number` | `1e-4` | Convergence tolerance |
109
- | `seed` | `number` | -- | Random seed for reproducibility |
110
- | `normalize` | `boolean` | `true` | L2-normalize embeddings before clustering |
111
- | `labeler` | `LabelerFn` | -- | Custom labeling function |
112
- | `distanceFn` | `Function` | -- | Custom distance function |
113
-
114
- ## CLI
546
+ ### Using cosine distance
115
547
 
116
- ```bash
117
- npx embed-cluster --input embeddings.json --k 5 --format summary
548
+ ```ts
549
+ import { cluster, cosineDistance } from "embed-cluster";
550
+
551
+ const result = await cluster(items, {
552
+ k: 5,
553
+ normalize: true,
554
+ distanceFn: cosineDistance,
555
+ });
118
556
  ```
119
557
 
558
+ When `normalize` is `true` (the default), all vectors are L2-normalized before clustering. Combined with `cosineDistance`, this performs angular clustering -- the standard approach for embedding vectors from language models.
559
+
560
+ ### Evaluating an existing clustering
561
+
562
+ ```ts
563
+ import { kMeans, silhouetteScore, cosineDistance } from "embed-cluster";
564
+
565
+ const result = kMeans(items, 4, { seed: 123, normalize: true });
566
+ const quality = silhouetteScore(result, cosineDistance);
567
+
568
+ console.log(quality.score); // overall mean silhouette
569
+ console.log(quality.perCluster); // [0.72, 0.85, 0.61, 0.78]
570
+ console.log(quality.perItem); // per-item scores (same length as items)
571
+ ```
572
+
573
+ ### Comparing different k values
574
+
575
+ ```ts
576
+ import { findOptimalK } from "embed-cluster";
577
+
578
+ const optimal = findOptimalK(items, {
579
+ maxK: 15,
580
+ normalize: true,
581
+ seed: 42,
582
+ });
583
+
584
+ console.log(`Best k: ${optimal.k}`);
585
+ for (const entry of optimal.scores) {
586
+ console.log(` k=${entry.k} silhouette=${entry.silhouette.toFixed(3)} inertia=${entry.inertia.toFixed(1)}`);
587
+ }
588
+ ```
589
+
590
+ ### Identifying outliers
591
+
592
+ Points with a negative per-item silhouette coefficient are poorly assigned and may be outliers:
593
+
594
+ ```ts
595
+ const result = await cluster(items, { k: 5, seed: 42 });
596
+ const sil = silhouetteScore(result);
597
+
598
+ const outlierIndices: number[] = [];
599
+ if (sil.perItem) {
600
+ sil.perItem.forEach((score, index) => {
601
+ if (score < 0) {
602
+ outlierIndices.push(index);
603
+ }
604
+ });
605
+ }
606
+ console.log(`Found ${outlierIndices.length} outlier(s)`);
607
+ ```
608
+
609
+ ### Extracting cluster membership
610
+
611
+ ```ts
612
+ const result = await cluster(items, { k: 4, seed: 42 });
613
+
614
+ for (const c of result.clusters) {
615
+ console.log(`Cluster ${c.id} (${c.size} items, cohesion=${c.cohesion.toFixed(4)}):`);
616
+ for (const item of c.items) {
617
+ console.log(` ${item.id}: ${item.text} (dist=${item.distanceToCentroid.toFixed(4)})`);
618
+ }
619
+ }
620
+ ```
621
+
622
+ ### Reusing configuration across calls
623
+
624
+ ```ts
625
+ import { createClusterer } from "embed-cluster";
626
+
627
+ const clusterer = createClusterer({
628
+ normalize: true,
629
+ seed: 42,
630
+ maxIterations: 200,
631
+ tolerance: 1e-6,
632
+ });
633
+
634
+ // All calls inherit the bound configuration
635
+ const r1 = await clusterer.cluster(datasetA, { k: 3 });
636
+ const r2 = await clusterer.cluster(datasetB, { k: 5 });
637
+
638
+ // Per-call options override bound config
639
+ const r3 = await clusterer.cluster(datasetC, { autoK: true, maxK: 20 });
640
+ ```
641
+
642
+ ---
643
+
644
+ ## TypeScript
645
+
646
+ The package is written in TypeScript with strict mode enabled and ships type declarations alongside the compiled JavaScript. All interfaces, types, and the `ClusterError` class are exported from the package entry point.
647
+
648
+ ```ts
649
+ import type {
650
+ EmbedItem,
651
+ ClusterItem,
652
+ Cluster,
653
+ ClusterOptions,
654
+ ClusterResult,
655
+ ClusterQuality,
656
+ SilhouetteResult,
657
+ OptimalKResult,
658
+ VisualizationData,
659
+ LabelerFn,
660
+ Clusterer,
661
+ ClusterErrorCode,
662
+ } from "embed-cluster";
663
+ ```
664
+
665
+ The package targets ES2022 and compiles to CommonJS. It requires Node.js 18 or later.
666
+
667
+ ---
668
+
120
669
  ## License
121
670
 
122
671
  MIT
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "embed-cluster",
3
- "version": "0.3.0",
3
+ "version": "0.3.1",
4
4
  "description": "Cluster embeddings into topics with automatic labeling",
5
5
  "main": "dist/index.js",
6
6
  "types": "dist/index.d.ts",