twokeys 2.2.0 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,40 +1,22 @@
1
1
  # Twokeys
2
2
 
3
- > A small data exploration and manipulation library, named after **John Tukey** the legendary statistician who pioneered exploratory data analysis (EDA).
3
+ > Exploratory data analysis for graphs, vectors, and series — named after **John Tukey**, the legendary statistician who pioneered EDA.
4
4
 
5
5
  [![CI](https://github.com/buley/twokeys/actions/workflows/ci.yml/badge.svg)](https://github.com/buley/twokeys/actions/workflows/ci.yml)
6
6
  [![npm version](https://badge.fury.io/js/twokeys.svg)](https://www.npmjs.com/package/twokeys)
7
7
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8
8
 
9
- ## About John Tukey
10
-
11
- John Wilder Tukey (1915–2000) revolutionized how we look at data. He invented the box plot, coined the terms "bit" and "software," and championed the idea that **looking at data** is just as important as modeling it. His book *Exploratory Data Analysis* (1977) changed statistics forever.
9
+ ## What Is This?
12
10
 
13
- This library is named after him a founding mind in [data exploration and analysis](http://en.wikipedia.org/wiki/Exploratory_data_analysis) and a personal hero of the author.
11
+ Tukey taught us that **looking at data** is just as important as modeling it. Twokeys applies that philosophy to three domains:
14
12
 
15
- ## Features
16
-
17
- - **Summary Statistics**: Mean, median, mode, trimean, quartiles (hinges)
18
- - **Outlier Detection**: Tukey fences (inner and outer)
19
- - **Letter Values**: Extended quartiles (eighths, sixteenths, etc.)
20
- - **Stem-and-Leaf**: Text-based distribution visualization
21
- - **Ranking**: Full ranking with tie handling
22
- - **Binning**: Histogram-style grouping
23
- - **Smoothing**: Hanning filter, Tukey's 3RSSH smoothing
24
- - **Transforms**: Logarithms, square roots, reciprocals
25
- - **Graph Analytics**: Centrality, communities, paths, flow, clustering, TSP approximation
26
- - **GDS-style Catalog**: In-memory graph projections with procedure wrappers and pipelines
27
- - **WASM Support**: Optional WebAssembly for maximum performance
28
- - **Zero Dependencies**: Pure TypeScript, works everywhere
29
- - **Tiny**: <3KB minified and gzipped
30
-
31
- ## Packages
13
+ 1. **Graph EDA** — Treat structural properties of graphs (degree distributions, clustering coefficients, assortativity) as data series that deserve full Tukey-style analysis
14
+ 2. **Vector Distance & Similarity** — Cosine similarity, Mahalanobis distance, Jaccard similarity, L2 normalization, and more
15
+ 3. **Multivariate Analysis** Centroids, covariance matrices, correlation matrices, and Mahalanobis-based outlier detection via the `Points` class
16
+ 4. **1D EDA** The original `Series` class: medians, fences, letter values, stem-and-leaf plots, smoothing, and everything else Tukey invented
17
+ 5. **Graph Algorithms** Centrality, community detection, shortest paths, flow, clustering (k-means, hierarchical, DBSCAN), TSP approximation, and a GDS-style catalog
32
18
 
33
- | Package | Description |
34
- |---------|-------------|
35
- | `twokeys` | Core TypeScript library |
36
- | `@buley/twokeys-wasm` | WebAssembly implementation with TypeScript fallback |
37
- | `@buley/twokeys-types` | Shared Zod schemas for runtime validation |
19
+ Zero dependencies. Pure TypeScript. Works everywhere.
38
20
 
39
21
  ## Installation
40
22
 
@@ -42,377 +24,358 @@ This library is named after him — a founding mind in [data exploration and ana
42
24
  npm install twokeys
43
25
  # or
44
26
  bun add twokeys
45
- # or
46
- yarn add twokeys
47
27
  ```
48
28
 
49
- For WASM acceleration (optional):
29
+ ## Graph EDA
50
30
 
51
- ```bash
52
- npm install @buley/twokeys-wasm
53
- ```
54
-
55
- ## Quick Start
31
+ The core insight: graph structural properties ARE data series. Degree distributions get box plots. Clustering coefficients get outlier detection. Assortativity tells you whether your network is stratified.
56
32
 
57
33
  ```typescript
58
- import { Series } from 'twokeys';
34
+ import { graphEda, graphOutliers } from 'twokeys';
59
35
 
60
- // Create a series from your data
61
- const series = new Series({ data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 100] });
62
-
63
- // Get summary statistics
64
- console.log(series.mean()); // 14.5
65
- console.log(series.median()); // { datum: 5.5, depth: 5.5 }
66
- console.log(series.trimean()); // Tukey's trimean
67
-
68
- // Detect outliers (using Tukey fences)
69
- console.log(series.outliers()); // [100]
36
+ const nodes = ['alice', 'bob', 'carol', 'dave', 'eve'] as const;
37
+ const edges = [
38
+ { from: 'alice', to: 'bob', weight: 1 },
39
+ { from: 'alice', to: 'carol', weight: 1 },
40
+ { from: 'bob', to: 'carol', weight: 1 },
41
+ { from: 'carol', to: 'dave', weight: 1 },
42
+ { from: 'dave', to: 'eve', weight: 1 },
43
+ ];
70
44
 
71
- // Get a full description
72
- const desc = series.describe();
73
- console.log(desc.summary);
45
+ const summary = graphEda(nodes, edges);
46
+
47
+ // Every structural metric, analyzed as a Tukey Series:
48
+ summary.density; // edges / maxPossibleEdges
49
+ summary.degreeDistribution; // Full SeriesDescription with median, fences, outliers
50
+ summary.clusteringDistribution; // Clustering coefficients as EDA
51
+ summary.globalClusteringCoefficient;
52
+ summary.averagePathLength;
53
+ summary.diameter;
54
+ summary.reciprocity;
55
+ summary.degreeAssortativity; // Do hubs connect to hubs?
56
+
57
+ // Find structurally unusual nodes
58
+ const unusual = graphOutliers(nodes, edges, { method: 'combined' });
59
+ // [{ nodeId: 'eve', score: 2.1, reason: 'Low degree + low clustering' }]
74
60
  ```
75
61
 
76
- ### Using WASM (Optional)
62
+ ## Vector Distance & Similarity
77
63
 
78
- ```typescript
79
- import { loadWasm, analyze, isWasmLoaded } from '@buley/twokeys-wasm';
80
-
81
- // Load WASM (falls back to TypeScript if unavailable)
82
- await loadWasm();
83
-
84
- console.log(isWasmLoaded()); // true if WASM loaded
85
-
86
- // Use the same API - automatically uses WASM when available
87
- const result = analyze([1, 2, 3, 4, 5, 6, 7, 8, 9, 100]);
88
- console.log(result.summary.outliers); // [100]
89
- ```
90
-
91
- ## Graph Analytics + GDS-style Catalog
64
+ Standalone functions for vector math. These are the workhorses that `Points`, graph algorithms, and consumers all use.
92
65
 
93
66
  ```typescript
94
67
  import {
95
- createGraphCatalog,
96
- topologicalSort,
97
- louvainCommunities,
98
- kNearestNeighbors,
99
- linkPrediction,
100
- aStarShortestPath,
101
- yenKShortestPaths,
102
- allPairsShortestPaths,
103
- maximumFlow,
104
- minCostMaxFlow,
68
+ cosineSimilarity,
69
+ euclideanDistance,
70
+ manhattanDistance,
71
+ mahalanobisDistance,
72
+ normalizeL2,
73
+ cosineSimilaritySparse,
74
+ jaccardSimilarity,
75
+ overlapCoefficient,
105
76
  } from 'twokeys';
106
77
 
107
- const nodes = ['root', 'plan', 'build', 'ship'] as const;
108
- const edges = [
109
- { from: 'root', to: 'plan', weight: 1 },
110
- { from: 'plan', to: 'build', weight: 2 },
111
- { from: 'build', to: 'ship', weight: 1 },
112
- ];
113
-
114
- // Vote-weighted linearization for "abstract -> concrete" flows
115
- const linear = topologicalSort(nodes, edges, {
116
- priorityByNode: { ship: 10, build: 6, plan: 3 },
117
- });
118
-
119
- // Community + similarity/link prediction family
120
- const communities = louvainCommunities(nodes, edges);
121
- const knn = kNearestNeighbors(nodes, edges, { k: 2 });
122
- const links = linkPrediction(nodes, edges, { limit: 5 });
123
-
124
- // Path + flow family
125
- const aStar = aStarShortestPath(nodes, edges, 'root', 'ship');
126
- const yen = yenKShortestPaths(nodes, edges, 'root', 'ship', { k: 3 });
127
- const apsp = allPairsShortestPaths(nodes, edges);
128
- const maxFlow = maximumFlow(nodes, edges, 'root', 'ship');
129
- const minCost = minCostMaxFlow(
130
- nodes,
131
- edges.map((edge) => ({
132
- from: edge.from,
133
- to: edge.to,
134
- capacity: edge.weight ?? 1,
135
- cost: edge.weight ?? 1,
136
- })),
137
- 'root',
138
- 'ship',
139
- );
140
-
141
- // GDS-style catalog/procedure wrapper
142
- const gds = createGraphCatalog<string>();
143
- gds.project('tasks', [...nodes], edges, { directed: true });
144
- const rank = gds.pageRank('tasks');
145
- const pipeline = gds.runPipeline('tasks', [
146
- { id: 'rank', kind: 'page-rank' },
147
- { id: 'sim', kind: 'similarity', options: { metric: 'jaccard' } },
148
- { id: 'links', kind: 'link-prediction', options: { limit: 10 } },
149
- ]);
78
+ // Dense vectors
79
+ cosineSimilarity([1, 2, 3], [4, 5, 6]); // 0.974
80
+ euclideanDistance([0, 0], [3, 4]); // 5
81
+ manhattanDistance([0, 0], [3, 4]); // 7
82
+ normalizeL2([3, 4]); // [0.6, 0.8]
83
+ mahalanobisDistance([5, 5], [0, 0], [1, 1]); // 7.07
84
+
85
+ // Sparse vectors (Map<string, number>)
86
+ const a = new Map([['x', 1], ['y', 2]]);
87
+ const b = new Map([['y', 3], ['z', 4]]);
88
+ cosineSimilaritySparse(a, b); // similarity on shared keys
89
+
90
+ // Set similarity
91
+ jaccardSimilarity(new Set([1, 2, 3]), new Set([2, 3, 4])); // 0.5
92
+ overlapCoefficient(new Set([1, 2]), new Set([2, 3, 4])); // 0.5
150
93
  ```
151
94
 
152
- ## Benchmarks
95
+ ## Multivariate Analysis (Points)
153
96
 
154
- Performance on different dataset sizes (operations per second, higher is better):
97
+ Full multivariate EDA: centroids, covariance, correlation, Mahalanobis outlier detection.
155
98
 
156
- ### TypeScript Implementation
99
+ ```typescript
100
+ import { Points } from 'twokeys';
157
101
 
158
- | Method | 100 elements | 1,000 elements | 10,000 elements |
159
- |--------|-------------:|---------------:|----------------:|
160
- | `sorted()` | 224,599 | 14,121 | 874 |
161
- | `median()` | 199,397 | 14,127 | 876 |
162
- | `mean()` | 550,610 | 413,551 | 68,399 |
163
- | `mode()` | 87,665 | 6,738 | 431 |
164
- | `fences()` | 238,486 | 13,270 | 864 |
165
- | `outliers()` | 210,058 | 12,584 | 854 |
166
- | `smooth()` | 61,268 | 1,599 | 31 |
167
- | `describe()` | 15,642 | 952 | 29 |
102
+ const points = new Points({
103
+ data: [
104
+ [1, 2, 3],
105
+ [4, 5, 6],
106
+ [7, 8, 9],
107
+ [100, 100, 100], // outlier
108
+ ],
109
+ });
168
110
 
169
- ### v2.0 Performance Improvements
170
-
171
- Compared to v1.x (CoffeeScript), v2.0 TypeScript is dramatically faster:
172
-
173
- | Method | v1.x (10K) | v2.0 (10K) | Improvement |
174
- |--------|------------|------------|-------------|
175
- | `median()` | 6 ops/sec | 876 ops/sec | **146x faster** |
176
- | `counts()` | 1 ops/sec | 606 ops/sec | **606x faster** |
177
- | `fences()` | 5 ops/sec | 864 ops/sec | **173x faster** |
178
- | `describe()` | 1 ops/sec | 29 ops/sec | **29x faster** |
179
-
180
- Key optimizations:
181
- - O(1) index-based median (was O(n²) recursive)
182
- - Map-based frequency counting (was O(n²) recursive)
183
- - Eliminated unnecessary array copying in smoothing
184
-
185
- ## Example Output
186
-
187
- Applying `describe()` to a Series returns a complete analysis:
188
-
189
- ```javascript
190
- const series = new Series({ data: [48, 59, 63, 30, 57, 92, 73, 47, 31, 5] });
191
- const analysis = series.describe();
192
-
193
- // Result:
194
- {
195
- "original": [48, 59, 63, 30, 57, 92, 73, 47, 31, 5],
196
- "summary": {
197
- "median": { "datum": 52.5, "depth": 5.5 },
198
- "mean": 50.5,
199
- "hinges": [{ "datum": 31, "depth": 3 }, { "datum": 63, "depth": 8 }],
200
- "adjacent": [30, 92],
201
- "outliers": [],
202
- "extremes": [5, 92],
203
- "iqr": 32,
204
- "fences": [4.5, 100.5]
205
- },
206
- "smooths": {
207
- "smooth": [48, 30, 57, 57, 57, 47, 31, 5, 5, 5],
208
- "hanning": [48, 61, 46.5, 43.5, 74.5, 82.5, 60, 39, 18, 5]
209
- },
210
- "transforms": {
211
- "logs": [3.87, 4.08, 4.14, ...],
212
- "roots": [6.93, 7.68, 7.94, ...],
213
- "inverse": [0.021, 0.017, 0.016, ...]
214
- },
215
- "sorted": [5, 30, 31, 47, 48, 57, 59, 63, 73, 92],
216
- "ranked": { "up": {...}, "down": {...}, "groups": {...} },
217
- "binned": { "bins": 4, "width": 26, "binned": {...} }
218
- }
111
+ points.centroid(); // [28, 28.75, 29.5]
112
+ points.variances(); // Per-dimension variance
113
+ points.standardDeviations(); // Per-dimension stddev
114
+ points.covarianceMatrix(); // Full covariance matrix
115
+ points.correlationMatrix(); // Pearson correlation matrix
116
+
117
+ // Mahalanobis distance the multivariate equivalent of z-score
118
+ points.mahalanobis([50, 50, 50]); // Distance of a single point
119
+ points.mahalanobisAll(); // Distance for each stored point
120
+
121
+ // Tukey-style outlier detection for multivariate data
122
+ points.outliersByMahalanobis(); // [[100, 100, 100]]
123
+
124
+ // Normalization
125
+ points.normalizeL2(); // L2-normalize all points (returns new Points)
126
+ points.normalizeZscore(); // Z-score normalize per dimension
127
+
128
+ // Full description — each dimension analyzed as a Series
129
+ const desc = points.describe();
130
+ desc.centroid; // Mean point
131
+ desc.correlationMatrix; // Pearson correlations
132
+ desc.mahalanobisDistances; // Distance from centroid per point
133
+ desc.outlierCount; // How many outliers
134
+ desc.dimensionSummaries; // Each dimension as a full SeriesDescription
219
135
  ```
220
136
 
221
- ## API
137
+ ## 1D Exploratory Data Analysis (Series)
222
138
 
223
- ### Series
224
-
225
- The `Series` class provides methods for exploring 1-dimensional numerical data.
139
+ The original Tukey toolkit: everything you need to explore univariate data.
226
140
 
227
141
  ```typescript
228
142
  import { Series } from 'twokeys';
229
143
 
230
- const series = new Series({ data: [1, 2, 3, 4, 5] });
144
+ const series = new Series({ data: [2, 4, 4, 4, 5, 5, 7, 9] });
145
+
146
+ // Summary statistics
147
+ series.mean(); // 5
148
+ series.median(); // { datum: 4.5, depth: 4.5 }
149
+ series.mode(); // { count: 3, data: [4] }
150
+ series.trimean(); // Tukey's trimean
151
+
152
+ // Dispersion
153
+ series.variance(); // Sample variance
154
+ series.stddev(); // Standard deviation
155
+ series.iqr(); // Interquartile range
156
+ series.skewness(); // Fisher-Pearson skewness
157
+ series.kurtosis(); // Excess kurtosis
158
+
159
+ // Outlier detection (Tukey fences)
160
+ series.fences(); // Inner fence boundaries (1.5 x IQR)
161
+ series.outliers(); // Values outside inner fences
162
+ series.outside(); // Values outside outer fences (3 x IQR)
163
+
164
+ // Time series
165
+ series.ema(0.3); // Exponential moving average
166
+ series.zscore(); // Z-score normalization
167
+ series.hanning(); // Hanning filter
168
+ series.smooth(); // Tukey's 3RSSH smoothing
169
+ series.rough(); // Residuals (original - smooth)
170
+
171
+ // Visualization
172
+ series.stemLeaf(); // Stem-and-leaf plot
173
+ series.letterValues(); // Extended quartiles (M, F, E, D, C, B, A...)
174
+
175
+ // Everything at once
176
+ series.describe();
231
177
  ```
232
178
 
233
- #### Summary Statistics
234
-
235
- | Method | Description |
236
- |--------|-------------|
237
- | `mean()` | Arithmetic mean |
238
- | `median()` | Median value and depth |
239
- | `mode()` | Most frequent value(s) |
240
- | `trimean()` | Tukey's trimean: (Q1 + 2×median + Q3) / 4 |
241
- | `extremes()` | [min, max] values |
242
- | `hinges()` | Quartile boundaries (Q1, Q3) |
243
- | `iqr()` | Interquartile range |
244
-
245
- #### Outlier Detection
246
-
247
- | Method | Description |
248
- |--------|-------------|
249
- | `fences()` | Inner fence boundaries (1.5 × IQR) |
250
- | `outer()` | Outer fence boundaries (3 × IQR) |
251
- | `outliers()` | Values outside inner fences |
252
- | `inside()` | Values within fences |
253
- | `outside()` | Values outside outer fences |
254
- | `adjacent()` | Most extreme non-outlier values |
255
-
256
- #### Letter Values & Visualization
257
-
258
- | Method | Description |
259
- |--------|-------------|
260
- | `letterValues()` | Extended quartiles (M, F, E, D, C, B, A...) |
261
- | `stemLeaf()` | Stem-and-leaf text display |
262
- | `midSummaries()` | Symmetric quantile pair averages |
179
+ ## Graph Algorithms
263
180
 
264
- #### Ranking & Counting
181
+ Centrality, community detection, shortest paths, flow, clustering, and a GDS-style catalog.
265
182
 
266
- | Method | Description |
267
- |--------|-------------|
268
- | `sorted()` | Sorted copy of data |
269
- | `ranked()` | Rank information with tie handling |
270
- | `counts()` | Frequency of each value |
271
- | `binned()` | Histogram-style bins |
183
+ ```typescript
184
+ import {
185
+ // Centrality
186
+ degreeCentrality,
187
+ closenessCentrality,
188
+ betweennessCentrality,
189
+ pageRank,
272
190
 
273
- #### Transforms
191
+ // Community detection
192
+ louvainCommunities,
193
+ labelPropagationCommunities,
274
194
 
275
- | Method | Description |
276
- |--------|-------------|
277
- | `logs()` | Natural logarithm of each value |
278
- | `roots()` | Square root of each value |
279
- | `inverse()` | Reciprocal (1/x) of each value |
195
+ // Similarity & link prediction
196
+ nodeSimilarity,
197
+ kNearestNeighbors,
198
+ linkPrediction,
280
199
 
281
- #### Smoothing
200
+ // Paths & flow
201
+ shortestPath,
202
+ aStarShortestPath,
203
+ yenKShortestPaths,
204
+ allPairsShortestPaths,
205
+ maximumFlow,
206
+ minCostMaxFlow,
282
207
 
283
- | Method | Description |
284
- |--------|-------------|
285
- | `hanning()` | Hanning filter (running averages) |
286
- | `smooth()` | Tukey's 3RSSH smoothing |
287
- | `rough()` | Residuals (original - smooth) |
208
+ // Structure
209
+ topologicalSort,
210
+ stronglyConnectedComponents,
211
+ weaklyConnectedComponents,
212
+ minimumSpanningTree,
213
+ articulationPointsAndBridges,
214
+ analyzeGraph,
215
+
216
+ // Clustering
217
+ kMeansClustering,
218
+ kMeansAuto,
219
+ hierarchicalClustering,
220
+ dbscan,
221
+
222
+ // TSP
223
+ travelingSalesmanApprox,
224
+
225
+ // GDS-style catalog
226
+ createGraphCatalog,
227
+ } from 'twokeys';
228
+ ```
288
229
 
289
- #### Full Description
230
+ ### Clustering
290
231
 
291
232
  ```typescript
292
- series.describe();
293
- // Returns complete analysis including all of the above
294
- ```
233
+ import { kMeansClustering, hierarchicalClustering, dbscan } from 'twokeys';
295
234
 
296
- ### Points
297
-
298
- The `Points` class handles n-dimensional point data.
235
+ const points = [
236
+ [1, 1], [1.5, 2], [3, 4],
237
+ [5, 7], [3.5, 5], [4.5, 5],
238
+ [3.5, 4.5],
239
+ ];
299
240
 
300
- ```typescript
301
- import { Points } from 'twokeys';
241
+ // k-Means
242
+ const km = kMeansClustering(points, 2);
302
243
 
303
- // 100 random 2D points
304
- const points = new Points({ count: 100, dimensionality: 2 });
244
+ // Hierarchical (single, complete, average, or ward linkage)
245
+ const hc = hierarchicalClustering(points, 2, { linkage: 'ward' });
246
+ hc.clusters; // Point indices per cluster
247
+ hc.dendrogram; // Merge history
248
+ hc.silhouette; // Cluster quality score
305
249
 
306
- // Or provide your own data
307
- const myPoints = new Points({ data: [[1, 2], [3, 4], [5, 6]] });
250
+ // DBSCAN density-based, finds natural shapes, handles noise
251
+ const db = dbscan(points, 1.5, 2);
252
+ db.clusters; // Point indices per cluster
253
+ db.noise; // Indices of noise points
254
+ db.clusterCount; // Number of clusters found
308
255
  ```
309
256
 
310
- ### Twokeys
257
+ ### GDS-Style Catalog
311
258
 
312
- The main class provides factory methods and utilities.
259
+ In-memory graph projections with procedure wrappers and pipelines, inspired by Neo4j GDS.
313
260
 
314
261
  ```typescript
315
- import Twokeys from 'twokeys';
262
+ import { createGraphCatalog } from 'twokeys';
316
263
 
317
- // Generate random data
318
- const randomData = Twokeys.randomSeries(100, 50); // 100 values, max 50
319
- const randomPoints = Twokeys.randomPoints(50, 3); // 50 3D points
264
+ const gds = createGraphCatalog<string>();
265
+ gds.project('social', nodes, edges, { directed: true });
320
266
 
321
- // Access classes
322
- const series = new Twokeys.Series({ data: [1, 2, 3] });
323
- const points = new Twokeys.Points(100);
267
+ const rank = gds.pageRank('social');
268
+ const pipeline = gds.runPipeline('social', [
269
+ { id: 'rank', kind: 'page-rank' },
270
+ { id: 'sim', kind: 'similarity', options: { metric: 'jaccard' } },
271
+ { id: 'links', kind: 'link-prediction', options: { limit: 10 } },
272
+ ]);
324
273
  ```
325
274
 
326
- ## Examples
275
+ ## API Reference
327
276
 
328
- ### Box Plot Data
277
+ ### Distance & Similarity (`distance.ts`)
329
278
 
330
- ```typescript
331
- const series = new Series({ data: myData });
332
-
333
- const boxPlot = {
334
- min: series.extremes()[0],
335
- q1: series.hinges()[0].datum,
336
- median: series.median().datum,
337
- q3: series.hinges()[1].datum,
338
- max: series.extremes()[1],
339
- outliers: series.outliers(),
340
- };
341
- ```
279
+ | Function | Description |
280
+ |----------|-------------|
281
+ | `cosineSimilarity(a, b)` | Cosine similarity between dense vectors [-1, 1] |
282
+ | `euclideanDistance(a, b)` | Euclidean (L2) distance |
283
+ | `squaredEuclideanDistance(a, b)` | Squared Euclidean distance (avoids sqrt) |
284
+ | `manhattanDistance(a, b)` | Manhattan (L1) distance |
285
+ | `mahalanobisDistance(point, means, variances, epsilon?)` | Mahalanobis distance |
286
+ | `normalizeL2(vector)` | L2-normalize a vector to unit length |
287
+ | `cosineSimilaritySparse(a, b)` | Cosine similarity for sparse vectors (`Map<string, number>`) |
288
+ | `jaccardSimilarity(a, b)` | Jaccard index for sets |
289
+ | `overlapCoefficient(a, b)` | Overlap coefficient for sets |
342
290
 
343
- ### Outlier Detection
291
+ ### Graph EDA (`graph-eda.ts`)
344
292
 
345
- ```typescript
346
- const series = new Series({ data: measurements });
293
+ | Function | Description |
294
+ |----------|-------------|
295
+ | `graphEda(nodes, edges, options?)` | Full Tukey-style EDA of graph structure |
296
+ | `clusteringCoefficient(nodes, edges, options?)` | Per-node clustering coefficients |
297
+ | `graphOutliers(nodes, edges, options?)` | Structurally unusual nodes |
347
298
 
348
- // Inner fences: 1.5 × IQR from hinges
349
- const suspicious = series.outliers();
299
+ ### Series
350
300
 
351
- // Outer fences: 3 × IQR from hinges
352
- const extreme = series.outside();
353
- ```
301
+ | Category | Methods |
302
+ |----------|---------|
303
+ | **Statistics** | `mean()`, `median()`, `mode()`, `trimean()`, `variance()`, `stddev()`, `skewness()`, `kurtosis()` |
304
+ | **Dispersion** | `extremes()`, `hinges()`, `iqr()`, `fences()`, `outer()` |
305
+ | **Outliers** | `outliers()`, `inside()`, `outside()`, `adjacent()` |
306
+ | **Time Series** | `ema(alpha)`, `zscore()`, `hanning()`, `smooth()`, `rough()` |
307
+ | **Visualization** | `stemLeaf()`, `letterValues()`, `midSummaries()` |
308
+ | **Transforms** | `logs()`, `roots()`, `inverse()` |
309
+ | **Sorting** | `sorted()`, `ranked()`, `counts()`, `binned()` |
354
310
 
355
- ### Letter Values Display
311
+ ### Points
356
312
 
357
- ```typescript
358
- const series = new Series({ data: myData });
359
-
360
- // Get extended quartiles
361
- const lv = series.letterValues();
362
- // [
363
- // { letter: 'M', depth: 10.5, lower: 52.5, upper: 52.5, mid: 52.5, spread: 0 },
364
- // { letter: 'F', depth: 5, lower: 31, upper: 73, mid: 52, spread: 42 },
365
- // { letter: 'E', depth: 3, lower: 30, upper: 82, mid: 56, spread: 52 },
366
- // ...
367
- // ]
368
- ```
313
+ | Method | Description |
314
+ |--------|-------------|
315
+ | `centroid()` | Mean point across all dimensions |
316
+ | `variances()` | Per-dimension variance |
317
+ | `standardDeviations()` | Per-dimension standard deviation |
318
+ | `covarianceMatrix()` | Full covariance matrix |
319
+ | `correlationMatrix()` | Pearson correlation matrix |
320
+ | `mahalanobis(point)` | Mahalanobis distance of a single point from centroid |
321
+ | `mahalanobisAll()` | Mahalanobis distance for each stored point |
322
+ | `outliersByMahalanobis(threshold?)` | Points with Mahalanobis distance > threshold |
323
+ | `normalizeL2()` | L2-normalize all points (returns new Points) |
324
+ | `normalizeZscore()` | Z-score normalize per dimension (returns new Points) |
325
+ | `describe()` | Full multivariate EDA summary |
326
+
327
+ ### Graph Algorithms
328
+
329
+ | Category | Functions |
330
+ |----------|-----------|
331
+ | **Centrality** | `degreeCentrality`, `closenessCentrality`, `betweennessCentrality`, `pageRank` |
332
+ | **Community** | `louvainCommunities`, `labelPropagationCommunities` |
333
+ | **Similarity** | `nodeSimilarity`, `kNearestNeighbors`, `linkPrediction` |
334
+ | **Paths** | `shortestPath`, `aStarShortestPath`, `yenKShortestPaths`, `allPairsShortestPaths` |
335
+ | **Flow** | `maximumFlow`, `minCostMaxFlow` |
336
+ | **Structure** | `topologicalSort`, `stronglyConnectedComponents`, `weaklyConnectedComponents`, `minimumSpanningTree`, `articulationPointsAndBridges`, `analyzeGraph` |
337
+ | **Clustering** | `kMeansClustering`, `kMeansAuto`, `hierarchicalClustering`, `dbscan` |
338
+ | **TSP** | `travelingSalesmanApprox` |
339
+ | **Catalog** | `createGraphCatalog` — GDS-style projections and pipelines |
369
340
 
370
- ### Stem-and-Leaf Display
341
+ ## Packages
371
342
 
372
- ```typescript
373
- const series = new Series({ data: myData });
374
-
375
- const { display } = series.stemLeaf();
376
- // [
377
- // " 0 | 5",
378
- // " 3 | 0 1",
379
- // " 4 | 7 8",
380
- // " 5 | 7 9",
381
- // " 6 | 3",
382
- // " 7 | 3",
383
- // " 9 | 2"
384
- // ]
385
- ```
343
+ | Package | Description |
344
+ |---------|-------------|
345
+ | `twokeys` | Core TypeScript library |
346
+ | `@buley/twokeys-wasm` | WebAssembly implementation with TypeScript fallback |
347
+ | `@buley/twokeys-types` | Shared Zod schemas for runtime validation |
386
348
 
387
- ### Data Transformation
349
+ ## Benchmarks
388
350
 
389
- ```typescript
390
- const series = new Series({ data: skewedData });
351
+ Performance on different dataset sizes (operations per second):
391
352
 
392
- // Try different transforms to normalize
393
- const logTransformed = series.logs();
394
- const sqrtTransformed = series.roots();
395
- ```
353
+ | Method | 100 elements | 1,000 elements | 10,000 elements |
354
+ |--------|-------------:|---------------:|----------------:|
355
+ | `sorted()` | 224,599 | 14,121 | 874 |
356
+ | `median()` | 199,397 | 14,127 | 876 |
357
+ | `mean()` | 550,610 | 413,551 | 68,399 |
358
+ | `mode()` | 87,665 | 6,738 | 431 |
359
+ | `fences()` | 238,486 | 13,270 | 864 |
360
+ | `outliers()` | 210,058 | 12,584 | 854 |
361
+ | `smooth()` | 61,268 | 1,599 | 31 |
362
+ | `describe()` | 15,642 | 952 | 29 |
396
363
 
397
364
  ## Development
398
365
 
399
366
  ```bash
400
- # Install dependencies
401
367
  bun install
402
-
403
- # Run tests
404
368
  bun test
405
-
406
- # Run tests with coverage
407
369
  bun test --coverage
408
-
409
- # Build all packages
410
370
  bun run build
411
-
412
- # Run benchmark
413
- bun run bench.ts
414
371
  ```
415
372
 
373
+ ## About John Tukey
374
+
375
+ John Wilder Tukey (1915-2000) revolutionized how we look at data. He invented the box plot, coined the terms "bit" and "software," and championed the idea that looking at data is just as important as modeling it. His book *Exploratory Data Analysis* (1977) changed statistics forever.
376
+
377
+ This library is named after him — a founding mind in [data exploration and analysis](http://en.wikipedia.org/wiki/Exploratory_data_analysis) and a personal hero of the author.
378
+
416
379
  ## License
417
380
 
418
381
  MIT