twokeys 2.2.0 → 3.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,422 +1,385 @@
1
- # Twokeys
2
-
3
- > A small data exploration and manipulation library, named after **John Tukey** the legendary statistician who pioneered exploratory data analysis (EDA).
4
-
5
- [![CI](https://github.com/buley/twokeys/actions/workflows/ci.yml/badge.svg)](https://github.com/buley/twokeys/actions/workflows/ci.yml)
6
- [![npm version](https://badge.fury.io/js/twokeys.svg)](https://www.npmjs.com/package/twokeys)
7
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8
-
9
- ## About John Tukey
10
-
11
- John Wilder Tukey (1915–2000) revolutionized how we look at data. He invented the box plot, coined the terms "bit" and "software," and championed the idea that **looking at data** is just as important as modeling it. His book *Exploratory Data Analysis* (1977) changed statistics forever.
12
-
13
- This library is named after him a founding mind in [data exploration and analysis](http://en.wikipedia.org/wiki/Exploratory_data_analysis) and a personal hero of the author.
14
-
15
- ## Features
16
-
17
- - **Summary Statistics**: Mean, median, mode, trimean, quartiles (hinges)
18
- - **Outlier Detection**: Tukey fences (inner and outer)
19
- - **Letter Values**: Extended quartiles (eighths, sixteenths, etc.)
20
- - **Stem-and-Leaf**: Text-based distribution visualization
21
- - **Ranking**: Full ranking with tie handling
22
- - **Binning**: Histogram-style grouping
23
- - **Smoothing**: Hanning filter, Tukey's 3RSSH smoothing
24
- - **Transforms**: Logarithms, square roots, reciprocals
25
- - **Graph Analytics**: Centrality, communities, paths, flow, clustering, TSP approximation
26
- - **GDS-style Catalog**: In-memory graph projections with procedure wrappers and pipelines
27
- - **WASM Support**: Optional WebAssembly for maximum performance
28
- - **Zero Dependencies**: Pure TypeScript, works everywhere
29
- - **Tiny**: <3KB minified and gzipped
30
-
31
- ## Packages
32
-
33
- | Package | Description |
34
- |---------|-------------|
35
- | `twokeys` | Core TypeScript library |
36
- | `@buley/twokeys-wasm` | WebAssembly implementation with TypeScript fallback |
37
- | `@buley/twokeys-types` | Shared Zod schemas for runtime validation |
38
-
39
- ## Installation
40
-
41
- ```bash
42
- npm install twokeys
43
- # or
44
- bun add twokeys
45
- # or
46
- yarn add twokeys
47
- ```
48
-
49
- For WASM acceleration (optional):
50
-
51
- ```bash
52
- npm install @buley/twokeys-wasm
53
- ```
54
-
55
- ## Quick Start
56
-
57
- ```typescript
58
- import { Series } from 'twokeys';
59
-
60
- // Create a series from your data
61
- const series = new Series({ data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 100] });
62
-
63
- // Get summary statistics
64
- console.log(series.mean()); // 14.5
65
- console.log(series.median()); // { datum: 5.5, depth: 5.5 }
66
- console.log(series.trimean()); // Tukey's trimean
67
-
68
- // Detect outliers (using Tukey fences)
69
- console.log(series.outliers()); // [100]
70
-
71
- // Get a full description
72
- const desc = series.describe();
73
- console.log(desc.summary);
74
- ```
75
-
76
- ### Using WASM (Optional)
77
-
78
- ```typescript
79
- import { loadWasm, analyze, isWasmLoaded } from '@buley/twokeys-wasm';
80
-
81
- // Load WASM (falls back to TypeScript if unavailable)
82
- await loadWasm();
83
-
84
- console.log(isWasmLoaded()); // true if WASM loaded
85
-
86
- // Use the same API - automatically uses WASM when available
87
- const result = analyze([1, 2, 3, 4, 5, 6, 7, 8, 9, 100]);
88
- console.log(result.summary.outliers); // [100]
89
- ```
90
-
91
- ## Graph Analytics + GDS-style Catalog
92
-
93
- ```typescript
94
- import {
95
- createGraphCatalog,
96
- topologicalSort,
97
- louvainCommunities,
98
- kNearestNeighbors,
99
- linkPrediction,
100
- aStarShortestPath,
101
- yenKShortestPaths,
102
- allPairsShortestPaths,
103
- maximumFlow,
104
- minCostMaxFlow,
105
- } from 'twokeys';
106
-
107
- const nodes = ['root', 'plan', 'build', 'ship'] as const;
108
- const edges = [
109
- { from: 'root', to: 'plan', weight: 1 },
110
- { from: 'plan', to: 'build', weight: 2 },
111
- { from: 'build', to: 'ship', weight: 1 },
112
- ];
113
-
114
- // Vote-weighted linearization for "abstract -> concrete" flows
115
- const linear = topologicalSort(nodes, edges, {
116
- priorityByNode: { ship: 10, build: 6, plan: 3 },
117
- });
118
-
119
- // Community + similarity/link prediction family
120
- const communities = louvainCommunities(nodes, edges);
121
- const knn = kNearestNeighbors(nodes, edges, { k: 2 });
122
- const links = linkPrediction(nodes, edges, { limit: 5 });
123
-
124
- // Path + flow family
125
- const aStar = aStarShortestPath(nodes, edges, 'root', 'ship');
126
- const yen = yenKShortestPaths(nodes, edges, 'root', 'ship', { k: 3 });
127
- const apsp = allPairsShortestPaths(nodes, edges);
128
- const maxFlow = maximumFlow(nodes, edges, 'root', 'ship');
129
- const minCost = minCostMaxFlow(
130
- nodes,
131
- edges.map((edge) => ({
132
- from: edge.from,
133
- to: edge.to,
134
- capacity: edge.weight ?? 1,
135
- cost: edge.weight ?? 1,
136
- })),
137
- 'root',
138
- 'ship',
139
- );
140
-
141
- // GDS-style catalog/procedure wrapper
142
- const gds = createGraphCatalog<string>();
143
- gds.project('tasks', [...nodes], edges, { directed: true });
144
- const rank = gds.pageRank('tasks');
145
- const pipeline = gds.runPipeline('tasks', [
146
- { id: 'rank', kind: 'page-rank' },
147
- { id: 'sim', kind: 'similarity', options: { metric: 'jaccard' } },
148
- { id: 'links', kind: 'link-prediction', options: { limit: 10 } },
149
- ]);
150
- ```
151
-
152
- ## Benchmarks
153
-
154
- Performance on different dataset sizes (operations per second, higher is better):
155
-
156
- ### TypeScript Implementation
157
-
158
- | Method | 100 elements | 1,000 elements | 10,000 elements |
159
- |--------|-------------:|---------------:|----------------:|
160
- | `sorted()` | 224,599 | 14,121 | 874 |
161
- | `median()` | 199,397 | 14,127 | 876 |
162
- | `mean()` | 550,610 | 413,551 | 68,399 |
163
- | `mode()` | 87,665 | 6,738 | 431 |
164
- | `fences()` | 238,486 | 13,270 | 864 |
165
- | `outliers()` | 210,058 | 12,584 | 854 |
166
- | `smooth()` | 61,268 | 1,599 | 31 |
167
- | `describe()` | 15,642 | 952 | 29 |
168
-
169
- ### v2.0 Performance Improvements
170
-
171
- Compared to v1.x (CoffeeScript), v2.0 TypeScript is dramatically faster:
172
-
173
- | Method | v1.x (10K) | v2.0 (10K) | Improvement |
174
- |--------|------------|------------|-------------|
175
- | `median()` | 6 ops/sec | 876 ops/sec | **146x faster** |
176
- | `counts()` | 1 ops/sec | 606 ops/sec | **606x faster** |
177
- | `fences()` | 5 ops/sec | 864 ops/sec | **173x faster** |
178
- | `describe()` | 1 ops/sec | 29 ops/sec | **29x faster** |
179
-
180
- Key optimizations:
181
- - O(1) index-based median (was O(n²) recursive)
182
- - Map-based frequency counting (was O(n²) recursive)
183
- - Eliminated unnecessary array copying in smoothing
184
-
185
- ## Example Output
186
-
187
- Applying `describe()` to a Series returns a complete analysis:
188
-
189
- ```javascript
190
- const series = new Series({ data: [48, 59, 63, 30, 57, 92, 73, 47, 31, 5] });
191
- const analysis = series.describe();
192
-
193
- // Result:
194
- {
195
- "original": [48, 59, 63, 30, 57, 92, 73, 47, 31, 5],
196
- "summary": {
197
- "median": { "datum": 52.5, "depth": 5.5 },
198
- "mean": 50.5,
199
- "hinges": [{ "datum": 31, "depth": 3 }, { "datum": 63, "depth": 8 }],
200
- "adjacent": [30, 92],
201
- "outliers": [],
202
- "extremes": [5, 92],
203
- "iqr": 32,
204
- "fences": [4.5, 100.5]
205
- },
206
- "smooths": {
207
- "smooth": [48, 30, 57, 57, 57, 47, 31, 5, 5, 5],
208
- "hanning": [48, 61, 46.5, 43.5, 74.5, 82.5, 60, 39, 18, 5]
209
- },
210
- "transforms": {
211
- "logs": [3.87, 4.08, 4.14, ...],
212
- "roots": [6.93, 7.68, 7.94, ...],
213
- "inverse": [0.021, 0.017, 0.016, ...]
214
- },
215
- "sorted": [5, 30, 31, 47, 48, 57, 59, 63, 73, 92],
216
- "ranked": { "up": {...}, "down": {...}, "groups": {...} },
217
- "binned": { "bins": 4, "width": 26, "binned": {...} }
218
- }
219
- ```
220
-
221
- ## API
222
-
223
- ### Series
224
-
225
- The `Series` class provides methods for exploring 1-dimensional numerical data.
226
-
227
- ```typescript
228
- import { Series } from 'twokeys';
229
-
230
- const series = new Series({ data: [1, 2, 3, 4, 5] });
231
- ```
232
-
233
- #### Summary Statistics
234
-
235
- | Method | Description |
236
- |--------|-------------|
237
- | `mean()` | Arithmetic mean |
238
- | `median()` | Median value and depth |
239
- | `mode()` | Most frequent value(s) |
240
- | `trimean()` | Tukey's trimean: (Q1 + 2×median + Q3) / 4 |
241
- | `extremes()` | [min, max] values |
242
- | `hinges()` | Quartile boundaries (Q1, Q3) |
243
- | `iqr()` | Interquartile range |
244
-
245
- #### Outlier Detection
246
-
247
- | Method | Description |
248
- |--------|-------------|
249
- | `fences()` | Inner fence boundaries (1.5 × IQR) |
250
- | `outer()` | Outer fence boundaries (3 × IQR) |
251
- | `outliers()` | Values outside inner fences |
252
- | `inside()` | Values within fences |
253
- | `outside()` | Values outside outer fences |
254
- | `adjacent()` | Most extreme non-outlier values |
255
-
256
- #### Letter Values & Visualization
257
-
258
- | Method | Description |
259
- |--------|-------------|
260
- | `letterValues()` | Extended quartiles (M, F, E, D, C, B, A...) |
261
- | `stemLeaf()` | Stem-and-leaf text display |
262
- | `midSummaries()` | Symmetric quantile pair averages |
263
-
264
- #### Ranking & Counting
265
-
266
- | Method | Description |
267
- |--------|-------------|
268
- | `sorted()` | Sorted copy of data |
269
- | `ranked()` | Rank information with tie handling |
270
- | `counts()` | Frequency of each value |
271
- | `binned()` | Histogram-style bins |
272
-
273
- #### Transforms
274
-
275
- | Method | Description |
276
- |--------|-------------|
277
- | `logs()` | Natural logarithm of each value |
278
- | `roots()` | Square root of each value |
279
- | `inverse()` | Reciprocal (1/x) of each value |
280
-
281
- #### Smoothing
282
-
283
- | Method | Description |
284
- |--------|-------------|
285
- | `hanning()` | Hanning filter (running averages) |
286
- | `smooth()` | Tukey's 3RSSH smoothing |
287
- | `rough()` | Residuals (original - smooth) |
288
-
289
- #### Full Description
290
-
291
- ```typescript
292
- series.describe();
293
- // Returns complete analysis including all of the above
294
- ```
295
-
296
- ### Points
297
-
298
- The `Points` class handles n-dimensional point data.
299
-
300
- ```typescript
301
- import { Points } from 'twokeys';
302
-
303
- // 100 random 2D points
304
- const points = new Points({ count: 100, dimensionality: 2 });
305
-
306
- // Or provide your own data
307
- const myPoints = new Points({ data: [[1, 2], [3, 4], [5, 6]] });
308
- ```
309
-
310
- ### Twokeys
311
-
312
- The main class provides factory methods and utilities.
313
-
314
- ```typescript
315
- import Twokeys from 'twokeys';
316
-
317
- // Generate random data
318
- const randomData = Twokeys.randomSeries(100, 50); // 100 values, max 50
319
- const randomPoints = Twokeys.randomPoints(50, 3); // 50 3D points
320
-
321
- // Access classes
322
- const series = new Twokeys.Series({ data: [1, 2, 3] });
323
- const points = new Twokeys.Points(100);
324
- ```
325
-
326
- ## Examples
327
-
328
- ### Box Plot Data
329
-
330
- ```typescript
331
- const series = new Series({ data: myData });
332
-
333
- const boxPlot = {
334
- min: series.extremes()[0],
335
- q1: series.hinges()[0].datum,
336
- median: series.median().datum,
337
- q3: series.hinges()[1].datum,
338
- max: series.extremes()[1],
339
- outliers: series.outliers(),
340
- };
341
- ```
342
-
343
- ### Outlier Detection
344
-
345
- ```typescript
346
- const series = new Series({ data: measurements });
347
-
348
- // Inner fences: 1.5 × IQR from hinges
349
- const suspicious = series.outliers();
350
-
351
- // Outer fences: 3 × IQR from hinges
352
- const extreme = series.outside();
353
- ```
354
-
355
- ### Letter Values Display
356
-
357
- ```typescript
358
- const series = new Series({ data: myData });
359
-
360
- // Get extended quartiles
361
- const lv = series.letterValues();
362
- // [
363
- // { letter: 'M', depth: 10.5, lower: 52.5, upper: 52.5, mid: 52.5, spread: 0 },
364
- // { letter: 'F', depth: 5, lower: 31, upper: 73, mid: 52, spread: 42 },
365
- // { letter: 'E', depth: 3, lower: 30, upper: 82, mid: 56, spread: 52 },
366
- // ...
367
- // ]
368
- ```
369
-
370
- ### Stem-and-Leaf Display
371
-
372
- ```typescript
373
- const series = new Series({ data: myData });
374
-
375
- const { display } = series.stemLeaf();
376
- // [
377
- // " 0 | 5",
378
- // " 3 | 0 1",
379
- // " 4 | 7 8",
380
- // " 5 | 7 9",
381
- // " 6 | 3",
382
- // " 7 | 3",
383
- // " 9 | 2"
384
- // ]
385
- ```
386
-
387
- ### Data Transformation
388
-
389
- ```typescript
390
- const series = new Series({ data: skewedData });
391
-
392
- // Try different transforms to normalize
393
- const logTransformed = series.logs();
394
- const sqrtTransformed = series.roots();
395
- ```
396
-
397
- ## Development
398
-
399
- ```bash
400
- # Install dependencies
401
- bun install
402
-
403
- # Run tests
404
- bun test
405
-
406
- # Run tests with coverage
407
- bun test --coverage
408
-
409
- # Build all packages
410
- bun run build
411
-
412
- # Run benchmark
413
- bun run bench.ts
414
- ```
415
-
416
- ## License
417
-
418
- MIT
419
-
420
- ---
421
-
422
- *"The best thing about being a statistician is that you get to play in everyone's backyard."* — John Tukey
1
+ # Twokeys
2
+
3
+ > Exploratory data analysis for graphs, vectors, and series — named after **John Tukey**, the legendary statistician who pioneered EDA.
4
+
5
+ [![CI](https://github.com/buley/twokeys/actions/workflows/ci.yml/badge.svg)](https://github.com/buley/twokeys/actions/workflows/ci.yml)
6
+ [![npm version](https://badge.fury.io/js/twokeys.svg)](https://www.npmjs.com/package/twokeys)
7
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8
+
9
+ ## What Is This?
10
+
11
+ Tukey taught us that **looking at data** is just as important as modeling it. Twokeys applies that philosophy to three domains:
12
+
13
+ 1. **Graph EDA** Treat structural properties of graphs (degree distributions, clustering coefficients, assortativity) as data series that deserve full Tukey-style analysis
14
+ 2. **Vector Distance & Similarity** — Cosine similarity, Mahalanobis distance, Jaccard similarity, L2 normalization, and more
15
+ 3. **Multivariate Analysis** — Centroids, covariance matrices, correlation matrices, and Mahalanobis-based outlier detection via the `Points` class
16
+ 4. **1D EDA** — The original `Series` class: medians, fences, letter values, stem-and-leaf plots, smoothing, and everything else Tukey invented
17
+ 5. **Graph Algorithms** — Centrality, community detection, shortest paths, flow, clustering (k-means, hierarchical, DBSCAN), TSP approximation, and a GDS-style catalog
18
+
19
+ Zero dependencies. Pure TypeScript. Works everywhere.
20
+
21
+ ## Installation
22
+
23
+ ```bash
24
+ npm install twokeys
25
+ # or
26
+ bun add twokeys
27
+ ```
28
+
29
+ ## Graph EDA
30
+
31
+ The core insight: graph structural properties ARE data series. Degree distributions get box plots. Clustering coefficients get outlier detection. Assortativity tells you whether your network is stratified.
32
+
33
+ ```typescript
34
+ import { graphEda, graphOutliers } from 'twokeys';
35
+
36
+ const nodes = ['alice', 'bob', 'carol', 'dave', 'eve'] as const;
37
+ const edges = [
38
+ { from: 'alice', to: 'bob', weight: 1 },
39
+ { from: 'alice', to: 'carol', weight: 1 },
40
+ { from: 'bob', to: 'carol', weight: 1 },
41
+ { from: 'carol', to: 'dave', weight: 1 },
42
+ { from: 'dave', to: 'eve', weight: 1 },
43
+ ];
44
+
45
+ const summary = graphEda(nodes, edges);
46
+
47
+ // Every structural metric, analyzed as a Tukey Series:
48
+ summary.density; // edges / maxPossibleEdges
49
+ summary.degreeDistribution; // Full SeriesDescription with median, fences, outliers
50
+ summary.clusteringDistribution; // Clustering coefficients as EDA
51
+ summary.globalClusteringCoefficient;
52
+ summary.averagePathLength;
53
+ summary.diameter;
54
+ summary.reciprocity;
55
+ summary.degreeAssortativity; // Do hubs connect to hubs?
56
+
57
+ // Find structurally unusual nodes
58
+ const unusual = graphOutliers(nodes, edges, { method: 'combined' });
59
+ // [{ nodeId: 'eve', score: 2.1, reason: 'Low degree + low clustering' }]
60
+ ```
61
+
62
+ ## Vector Distance & Similarity
63
+
64
+ Standalone functions for vector math. These are the workhorses that `Points`, graph algorithms, and consumers all use.
65
+
66
+ ```typescript
67
+ import {
68
+ cosineSimilarity,
69
+ euclideanDistance,
70
+ manhattanDistance,
71
+ mahalanobisDistance,
72
+ normalizeL2,
73
+ cosineSimilaritySparse,
74
+ jaccardSimilarity,
75
+ overlapCoefficient,
76
+ } from 'twokeys';
77
+
78
+ // Dense vectors
79
+ cosineSimilarity([1, 2, 3], [4, 5, 6]); // 0.974
80
+ euclideanDistance([0, 0], [3, 4]); // 5
81
+ manhattanDistance([0, 0], [3, 4]); // 7
82
+ normalizeL2([3, 4]); // [0.6, 0.8]
83
+ mahalanobisDistance([5, 5], [0, 0], [1, 1]); // 7.07
84
+
85
+ // Sparse vectors (Map<string, number>)
86
+ const a = new Map([['x', 1], ['y', 2]]);
87
+ const b = new Map([['y', 3], ['z', 4]]);
88
+ cosineSimilaritySparse(a, b); // similarity on shared keys
89
+
90
+ // Set similarity
91
+ jaccardSimilarity(new Set([1, 2, 3]), new Set([2, 3, 4])); // 0.5
92
+ overlapCoefficient(new Set([1, 2]), new Set([2, 3, 4])); // 0.5
93
+ ```
94
+
95
+ ## Multivariate Analysis (Points)
96
+
97
+ Full multivariate EDA: centroids, covariance, correlation, Mahalanobis outlier detection.
98
+
99
+ ```typescript
100
+ import { Points } from 'twokeys';
101
+
102
+ const points = new Points({
103
+ data: [
104
+ [1, 2, 3],
105
+ [4, 5, 6],
106
+ [7, 8, 9],
107
+ [100, 100, 100], // outlier
108
+ ],
109
+ });
110
+
111
+ points.centroid(); // [28, 28.75, 29.5]
112
+ points.variances(); // Per-dimension variance
113
+ points.standardDeviations(); // Per-dimension stddev
114
+ points.covarianceMatrix(); // Full covariance matrix
115
+ points.correlationMatrix(); // Pearson correlation matrix
116
+
117
+ // Mahalanobis distance — the multivariate equivalent of z-score
118
+ points.mahalanobis([50, 50, 50]); // Distance of a single point
119
+ points.mahalanobisAll(); // Distance for each stored point
120
+
121
+ // Tukey-style outlier detection for multivariate data
122
+ points.outliersByMahalanobis(); // [[100, 100, 100]]
123
+
124
+ // Normalization
125
+ points.normalizeL2(); // L2-normalize all points (returns new Points)
126
+ points.normalizeZscore(); // Z-score normalize per dimension
127
+
128
+ // Full description each dimension analyzed as a Series
129
+ const desc = points.describe();
130
+ desc.centroid; // Mean point
131
+ desc.correlationMatrix; // Pearson correlations
132
+ desc.mahalanobisDistances; // Distance from centroid per point
133
+ desc.outlierCount; // How many outliers
134
+ desc.dimensionSummaries; // Each dimension as a full SeriesDescription
135
+ ```
136
+
137
+ ## 1D Exploratory Data Analysis (Series)
138
+
139
+ The original Tukey toolkit: everything you need to explore univariate data.
140
+
141
+ ```typescript
142
+ import { Series } from 'twokeys';
143
+
144
+ const series = new Series({ data: [2, 4, 4, 4, 5, 5, 7, 9] });
145
+
146
+ // Summary statistics
147
+ series.mean(); // 5
148
+ series.median(); // { datum: 4.5, depth: 4.5 }
149
+ series.mode(); // { count: 3, data: [4] }
150
+ series.trimean(); // Tukey's trimean
151
+
152
+ // Dispersion
153
+ series.variance(); // Sample variance
154
+ series.stddev(); // Standard deviation
155
+ series.iqr(); // Interquartile range
156
+ series.skewness(); // Fisher-Pearson skewness
157
+ series.kurtosis(); // Excess kurtosis
158
+
159
+ // Outlier detection (Tukey fences)
160
+ series.fences(); // Inner fence boundaries (1.5 x IQR)
161
+ series.outliers(); // Values outside inner fences
162
+ series.outside(); // Values outside outer fences (3 x IQR)
163
+
164
+ // Time series
165
+ series.ema(0.3); // Exponential moving average
166
+ series.zscore(); // Z-score normalization
167
+ series.hanning(); // Hanning filter
168
+ series.smooth(); // Tukey's 3RSSH smoothing
169
+ series.rough(); // Residuals (original - smooth)
170
+
171
+ // Visualization
172
+ series.stemLeaf(); // Stem-and-leaf plot
173
+ series.letterValues(); // Extended quartiles (M, F, E, D, C, B, A...)
174
+
175
+ // Everything at once
176
+ series.describe();
177
+ ```
178
+
179
+ ## Graph Algorithms
180
+
181
+ Centrality, community detection, shortest paths, flow, clustering, and a GDS-style catalog.
182
+
183
+ ```typescript
184
+ import {
185
+ // Centrality
186
+ degreeCentrality,
187
+ closenessCentrality,
188
+ betweennessCentrality,
189
+ pageRank,
190
+
191
+ // Community detection
192
+ louvainCommunities,
193
+ labelPropagationCommunities,
194
+
195
+ // Similarity & link prediction
196
+ nodeSimilarity,
197
+ kNearestNeighbors,
198
+ linkPrediction,
199
+
200
+ // Paths & flow
201
+ shortestPath,
202
+ aStarShortestPath,
203
+ yenKShortestPaths,
204
+ allPairsShortestPaths,
205
+ maximumFlow,
206
+ minCostMaxFlow,
207
+
208
+ // Structure
209
+ topologicalSort,
210
+ stronglyConnectedComponents,
211
+ weaklyConnectedComponents,
212
+ minimumSpanningTree,
213
+ articulationPointsAndBridges,
214
+ analyzeGraph,
215
+
216
+ // Clustering
217
+ kMeansClustering,
218
+ kMeansAuto,
219
+ hierarchicalClustering,
220
+ dbscan,
221
+
222
+ // TSP
223
+ travelingSalesmanApprox,
224
+
225
+ // GDS-style catalog
226
+ createGraphCatalog,
227
+ } from 'twokeys';
228
+ ```
229
+
230
+ ### Clustering
231
+
232
+ ```typescript
233
+ import { kMeansClustering, hierarchicalClustering, dbscan } from 'twokeys';
234
+
235
+ const points = [
236
+ [1, 1], [1.5, 2], [3, 4],
237
+ [5, 7], [3.5, 5], [4.5, 5],
238
+ [3.5, 4.5],
239
+ ];
240
+
241
+ // k-Means
242
+ const km = kMeansClustering(points, 2);
243
+
244
+ // Hierarchical (single, complete, average, or ward linkage)
245
+ const hc = hierarchicalClustering(points, 2, { linkage: 'ward' });
246
+ hc.clusters; // Point indices per cluster
247
+ hc.dendrogram; // Merge history
248
+ hc.silhouette; // Cluster quality score
249
+
250
+ // DBSCAN density-based, finds natural shapes, handles noise
251
+ const db = dbscan(points, 1.5, 2);
252
+ db.clusters; // Point indices per cluster
253
+ db.noise; // Indices of noise points
254
+ db.clusterCount; // Number of clusters found
255
+ ```
256
+
257
+ ### GDS-Style Catalog
258
+
259
+ In-memory graph projections with procedure wrappers and pipelines, inspired by Neo4j GDS.
260
+
261
+ ```typescript
262
+ import { createGraphCatalog } from 'twokeys';
263
+
264
+ const gds = createGraphCatalog<string>();
265
+ gds.project('social', nodes, edges, { directed: true });
266
+
267
+ const rank = gds.pageRank('social');
268
+ const pipeline = gds.runPipeline('social', [
269
+ { id: 'rank', kind: 'page-rank' },
270
+ { id: 'sim', kind: 'similarity', options: { metric: 'jaccard' } },
271
+ { id: 'links', kind: 'link-prediction', options: { limit: 10 } },
272
+ ]);
273
+ ```
274
+
275
+ ## API Reference
276
+
277
+ ### Distance & Similarity (`distance.ts`)
278
+
279
+ | Function | Description |
280
+ |----------|-------------|
281
+ | `cosineSimilarity(a, b)` | Cosine similarity between dense vectors [-1, 1] |
282
+ | `euclideanDistance(a, b)` | Euclidean (L2) distance |
283
+ | `squaredEuclideanDistance(a, b)` | Squared Euclidean distance (avoids sqrt) |
284
+ | `manhattanDistance(a, b)` | Manhattan (L1) distance |
285
+ | `mahalanobisDistance(point, means, variances, epsilon?)` | Mahalanobis distance |
286
+ | `normalizeL2(vector)` | L2-normalize a vector to unit length |
287
+ | `cosineSimilaritySparse(a, b)` | Cosine similarity for sparse vectors (`Map<string, number>`) |
288
+ | `jaccardSimilarity(a, b)` | Jaccard index for sets |
289
+ | `overlapCoefficient(a, b)` | Overlap coefficient for sets |
290
+
291
+ ### Graph EDA (`graph-eda.ts`)
292
+
293
+ | Function | Description |
294
+ |----------|-------------|
295
+ | `graphEda(nodes, edges, options?)` | Full Tukey-style EDA of graph structure |
296
+ | `clusteringCoefficient(nodes, edges, options?)` | Per-node clustering coefficients |
297
+ | `graphOutliers(nodes, edges, options?)` | Structurally unusual nodes |
298
+
299
+ ### Series
300
+
301
+ | Category | Methods |
302
+ |----------|---------|
303
+ | **Statistics** | `mean()`, `median()`, `mode()`, `trimean()`, `variance()`, `stddev()`, `skewness()`, `kurtosis()` |
304
+ | **Dispersion** | `extremes()`, `hinges()`, `iqr()`, `fences()`, `outer()` |
305
+ | **Outliers** | `outliers()`, `inside()`, `outside()`, `adjacent()` |
306
+ | **Time Series** | `ema(alpha)`, `zscore()`, `hanning()`, `smooth()`, `rough()` |
307
+ | **Visualization** | `stemLeaf()`, `letterValues()`, `midSummaries()` |
308
+ | **Transforms** | `logs()`, `roots()`, `inverse()` |
309
+ | **Sorting** | `sorted()`, `ranked()`, `counts()`, `binned()` |
310
+
311
+ ### Points
312
+
313
+ | Method | Description |
314
+ |--------|-------------|
315
+ | `centroid()` | Mean point across all dimensions |
316
+ | `variances()` | Per-dimension variance |
317
+ | `standardDeviations()` | Per-dimension standard deviation |
318
+ | `covarianceMatrix()` | Full covariance matrix |
319
+ | `correlationMatrix()` | Pearson correlation matrix |
320
+ | `mahalanobis(point)` | Mahalanobis distance of a single point from centroid |
321
+ | `mahalanobisAll()` | Mahalanobis distance for each stored point |
322
+ | `outliersByMahalanobis(threshold?)` | Points with Mahalanobis distance > threshold |
323
+ | `normalizeL2()` | L2-normalize all points (returns new Points) |
324
+ | `normalizeZscore()` | Z-score normalize per dimension (returns new Points) |
325
+ | `describe()` | Full multivariate EDA summary |
326
+
327
+ ### Graph Algorithms
328
+
329
+ | Category | Functions |
330
+ |----------|-----------|
331
+ | **Centrality** | `degreeCentrality`, `closenessCentrality`, `betweennessCentrality`, `pageRank` |
332
+ | **Community** | `louvainCommunities`, `labelPropagationCommunities` |
333
+ | **Similarity** | `nodeSimilarity`, `kNearestNeighbors`, `linkPrediction` |
334
+ | **Paths** | `shortestPath`, `aStarShortestPath`, `yenKShortestPaths`, `allPairsShortestPaths` |
335
+ | **Flow** | `maximumFlow`, `minCostMaxFlow` |
336
+ | **Structure** | `topologicalSort`, `stronglyConnectedComponents`, `weaklyConnectedComponents`, `minimumSpanningTree`, `articulationPointsAndBridges`, `analyzeGraph` |
337
+ | **Clustering** | `kMeansClustering`, `kMeansAuto`, `hierarchicalClustering`, `dbscan` |
338
+ | **TSP** | `travelingSalesmanApprox` |
339
+ | **Catalog** | `createGraphCatalog` — GDS-style projections and pipelines |
340
+
341
+ ## Packages
342
+
343
+ | Package | Description |
344
+ |---------|-------------|
345
+ | `twokeys` | Core TypeScript library |
346
+ | `@buley/twokeys-wasm` | WebAssembly implementation with TypeScript fallback |
347
+ | `@buley/twokeys-types` | Shared Zod schemas for runtime validation |
348
+
349
+ ## Benchmarks
350
+
351
+ Performance on different dataset sizes (operations per second):
352
+
353
+ | Method | 100 elements | 1,000 elements | 10,000 elements |
354
+ |--------|-------------:|---------------:|----------------:|
355
+ | `sorted()` | 224,599 | 14,121 | 874 |
356
+ | `median()` | 199,397 | 14,127 | 876 |
357
+ | `mean()` | 550,610 | 413,551 | 68,399 |
358
+ | `mode()` | 87,665 | 6,738 | 431 |
359
+ | `fences()` | 238,486 | 13,270 | 864 |
360
+ | `outliers()` | 210,058 | 12,584 | 854 |
361
+ | `smooth()` | 61,268 | 1,599 | 31 |
362
+ | `describe()` | 15,642 | 952 | 29 |
363
+
364
+ ## Development
365
+
366
+ ```bash
367
+ bun install
368
+ bun test
369
+ bun test --coverage
370
+ bun run build
371
+ ```
372
+
373
+ ## About John Tukey
374
+
375
+ John Wilder Tukey (1915-2000) revolutionized how we look at data. He invented the box plot, coined the terms "bit" and "software," and championed the idea that looking at data is just as important as modeling it. His book *Exploratory Data Analysis* (1977) changed statistics forever.
376
+
377
+ This library is named after him — a founding mind in [data exploration and analysis](http://en.wikipedia.org/wiki/Exploratory_data_analysis) and a personal hero of the author.
378
+
379
+ ## License
380
+
381
+ MIT
382
+
383
+ ---
384
+
385
+ *"The best thing about being a statistician is that you get to play in everyone's backyard."* — John Tukey