@winm2m/inferential-stats-js 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,712 @@
1
+ # @winm2m/inferential-stats-js
2
+
3
+ [![npm version](https://img.shields.io/npm/v/@winm2m/inferential-stats-js.svg)](https://www.npmjs.com/package/@winm2m/inferential-stats-js)
4
+ [![license](https://img.shields.io/npm/l/@winm2m/inferential-stats-js.svg)](https://github.com/winm2m/inferential-stats-js/blob/main/LICENSE)
5
+ [![TypeScript](https://img.shields.io/badge/TypeScript-5.7-blue.svg)](https://www.typescriptlang.org/)
6
+ [![WebAssembly](https://img.shields.io/badge/WebAssembly-Pyodide-blueviolet.svg)](https://pyodide.org/)
7
+
8
+ **A headless JavaScript SDK for advanced statistical analysis in the browser using WebAssembly (Pyodide). Performs SPSS-level inferential statistics entirely client-side with no backend required.**
9
+
10
+ ---
11
+
12
+ ## Table of Contents
13
+
14
+ - [Architecture Overview](#architecture-overview)
15
+ - [Installation](#installation)
16
+ - [Quick Start](#quick-start)
17
+ - [CDN / CodePen Usage](#cdn--codepen-usage)
18
+ - [API Reference](#api-reference)
19
+ - [Core Analysis Features — Mathematical & Technical Documentation](#core-analysis-features--mathematical--technical-documentation)
20
+ - [① Descriptive Statistics](#-descriptive-statistics)
21
+ - [② Compare Means](#-compare-means)
22
+ - [③ Regression](#-regression)
23
+ - [④ Classify](#-classify)
24
+ - [⑤ Dimension Reduction](#-dimension-reduction)
25
+ - [⑥ Scale](#-scale)
26
+ - [Sample Data](#sample-data)
27
+ - [Progress Event Handling](#progress-event-handling)
28
+ - [License](#license)
29
+
30
+ ---
31
+
32
+ ## Architecture Overview
33
+
34
+ `@winm2m/inferential-stats-js` runs **entirely in the browser** — no backend server, no API calls, no data ever leaves the client.
35
+
36
+ ```
37
+ ┌─────────────────────────────────────────────────────────┐
38
+ │ Main Thread │
39
+ │ ┌───────────────────────┐ postMessage() │
40
+ │ │ InferentialStats SDK │ ──── ArrayBuffer ──────┐ │
41
+ │ │ (ESM / CJS) │ (Transferable) │ │
42
+ │ └───────────────────────┘ ▼ │
43
+ │ ┌─────────────────────┐ │
44
+ │ │ Web Worker │ │
45
+ │ │ ┌────────────────┐ │ │
46
+ │ │ │ Pyodide WASM │ │ │
47
+ │ │ │ ┌───────────┐ │ │ │
48
+ │ │ │ │ Python │ │ │ │
49
+ │ │ │ │ Runtime │ │ │ │
50
+ │ │ │ └───────────┘ │ │ │
51
+ │ │ └────────────────┘ │ │
52
+ │ └─────────────────────┘ │
53
+ └─────────────────────────────────────────────────────────┘
54
+ ```
55
+
56
+ ### Key Design Principles
57
+
58
+ | Principle | Description |
59
+ |---|---|
60
+ | **100 % Client-Side** | Statistical computation runs entirely in-browser via WebAssembly. No network requests to any analytics server. |
61
+ | **Web Worker Isolation** | All heavy computation is offloaded to a dedicated Web Worker, keeping the main thread responsive and the UI jank-free. |
62
+ | **ArrayBuffer / TypedArray Transfer** | Data is serialized into a columnar binary format (Float64Array, Int32Array, dictionary-encoded strings) and transferred to the worker using the [Transferable Objects](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Transferable_objects) API for near-zero-copy performance. |
63
+ | **Pyodide WASM Runtime** | The worker loads [Pyodide](https://pyodide.org/) — a full CPython interpreter compiled to WebAssembly — along with pandas, SciPy, statsmodels, scikit-learn, and factor_analyzer. |
64
+ | **Progress Events** | Initialization and computation stages emit `CustomEvent` progress events on a configurable `EventTarget`, enabling real-time progress bars. |
65
+ | **Dual Module Format** | Ships as both ESM (`dist/index.js`) and CommonJS (`dist/index.cjs`) with full TypeScript declarations. |
66
+
67
+ ---
68
+
69
+ ## Installation
70
+
71
+ ```bash
72
+ npm install @winm2m/inferential-stats-js
73
+ ```
74
+
75
+ > **Peer dependency (optional):** If you want explicit control over the Pyodide version, install `pyodide` (>= 0.26.0) as a peer dependency. Otherwise the SDK loads Pyodide from the jsDelivr CDN automatically.
76
+
77
+ ---
78
+
79
+ ## Quick Start
80
+
81
+ ```typescript
82
+ import { InferentialStats, PROGRESS_EVENT_NAME } from '@winm2m/inferential-stats-js';
83
+
84
+ // 1. Listen for initialization progress
85
+ window.addEventListener(PROGRESS_EVENT_NAME, (e: Event) => {
86
+ const { stage, progress, message } = (e as CustomEvent).detail;
87
+ console.log(`[${stage}] ${progress}% — ${message}`);
88
+ });
89
+
90
+ // 2. Create an instance (pass the URL to the bundled worker)
91
+ const stats = new InferentialStats({
92
+ workerUrl: new URL('@winm2m/inferential-stats-js/worker', import.meta.url).href,
93
+ });
94
+
95
+ // 3. Initialize (loads Pyodide + Python packages inside the worker)
96
+ await stats.init();
97
+
98
+ // 4. Prepare your data
99
+ const data = [
100
+ { group: 'A', score: 85 },
101
+ { group: 'A', score: 90 },
102
+ { group: 'B', score: 78 },
103
+ { group: 'B', score: 82 },
104
+ // ... more rows
105
+ ];
106
+
107
+ // 5. Run an analysis
108
+ const result = await stats.anovaOneway({
109
+ data,
110
+ variable: 'score',
111
+ groupVariable: 'group',
112
+ });
113
+
114
+ console.log(result);
115
+ // {
116
+ // success: true,
117
+ // data: { fStatistic: ..., pValue: ..., groupStats: [...], ... },
118
+ // executionTimeMs: 42
119
+ // }
120
+
121
+ // 6. Clean up when done
122
+ stats.destroy();
123
+ ```
124
+
125
+ ---
126
+
127
+ ## CDN / CodePen Usage
128
+
129
+ You can use the SDK directly in a browser or CodePen with no build step. The snippet below loads the library from a CDN, fetches the sample dataset from GitHub Pages, and runs both a crosstabs analysis and a one-way ANOVA.
130
+
131
+ ```html
132
+ <!DOCTYPE html>
133
+ <html lang="en">
134
+ <head>
135
+ <meta charset="UTF-8" />
136
+ <title>inferential-stats-js CDN Demo</title>
137
+ </head>
138
+ <body>
139
+ <h1>inferential-stats-js — CDN Demo</h1>
140
+ <pre id="output">Initializing…</pre>
141
+
142
+ <!-- Load the worker script (global IIFE, no import needed) -->
143
+ <!-- The worker is loaded by URL below, not as a script tag -->
144
+
145
+ <script type="module">
146
+ // 1. Import the SDK from a CDN
147
+ import { InferentialStats, PROGRESS_EVENT_NAME } from 'https://unpkg.com/@winm2m/inferential-stats-js/dist/index.js';
148
+
149
+ const output = document.getElementById('output');
150
+ const log = (msg) => { output.textContent += '\n' + msg; };
151
+
152
+ // 2. Listen for progress events
153
+ window.addEventListener(PROGRESS_EVENT_NAME, (e) => {
154
+ const { stage, progress, message } = e.detail;
155
+ log(`[${stage}] ${progress}% — ${message}`);
156
+ });
157
+
158
+ // 3. Create an instance pointing to the CDN-hosted worker
159
+ const stats = new InferentialStats({
160
+ workerUrl: 'https://unpkg.com/@winm2m/inferential-stats-js/dist/stats-worker.js',
161
+ });
162
+
163
+ try {
164
+ // 4. Initialize (downloads Pyodide WASM + Python packages)
165
+ await stats.init();
166
+ log('✅ Initialization complete!');
167
+
168
+ // 5. Fetch sample survey data from GitHub Pages
169
+ const response = await fetch(
170
+ 'https://winm2m.github.io/inferential-stats-js/sample-survey-data.json'
171
+ );
172
+ const data = await response.json();
173
+ log(`Loaded ${data.length} rows of survey data.`);
174
+
175
+ // 6. Run Crosstabs (gender × favorite_music)
176
+ const crosstabResult = await stats.crosstabs({
177
+ data,
178
+ rowVariable: 'gender',
179
+ colVariable: 'favorite_music',
180
+ });
181
+ log('\n— Crosstabs (gender × favorite_music) —');
182
+ log(`Chi-Square: ${crosstabResult.data.chiSquare.toFixed(4)}`);
183
+ log(`p-value: ${crosstabResult.data.pValue.toFixed(4)}`);
184
+ log(`Cramér's V: ${crosstabResult.data.cramersV.toFixed(4)}`);
185
+
186
+ // 7. Run One-Way ANOVA (music_satisfaction by age_group)
187
+ const anovaResult = await stats.anovaOneway({
188
+ data,
189
+ variable: 'music_satisfaction',
190
+ groupVariable: 'age_group',
191
+ });
192
+ log('\n— One-Way ANOVA (music_satisfaction by age_group) —');
193
+ log(`F-statistic: ${anovaResult.data.fStatistic.toFixed(4)}`);
194
+ log(`p-value: ${anovaResult.data.pValue.toFixed(4)}`);
195
+ log(`η² (eta²): ${anovaResult.data.etaSquared.toFixed(4)}`);
196
+
197
+ } catch (err) {
198
+ log('❌ Error: ' + err.message);
199
+ } finally {
200
+ stats.destroy();
201
+ }
202
+ </script>
203
+ </body>
204
+ </html>
205
+ ```
206
+
207
+ > **Tip:** Paste the JavaScript portion into the **JS panel** of CodePen (with the "JavaScript preprocessor" set to **None** or **Babel**) and the HTML into the **HTML panel**. The demo runs entirely in the browser.
208
+
209
+ ---
210
+
211
+ ## API Reference
212
+
213
+ All analysis methods are async and return `Promise<AnalysisResult<T>>`:
214
+
215
+ ```typescript
216
+ interface AnalysisResult<T> {
217
+ success: boolean;
218
+ data: T;
219
+ error?: string;
220
+ executionTimeMs: number;
221
+ }
222
+ ```
223
+
224
+ ### Lifecycle Methods
225
+
226
+ | Method | Description |
227
+ |---|---|
228
+ | `new InferentialStats(config)` | Create an instance. `config.workerUrl` is required. Optional: `config.pyodideUrl`, `config.eventTarget`. |
229
+ | `init(): Promise<void>` | Load Pyodide and install Python packages inside the Web Worker. |
230
+ | `isInitialized(): boolean` | Returns `true` if the worker is ready. |
231
+ | `destroy(): void` | Terminate the Web Worker and release resources. |
232
+
233
+ ### Analysis Methods (16 total)
234
+
235
+ #### Descriptive Statistics
236
+
237
+ | # | Method | Input → Output | Description |
238
+ |---|---|---|---|
239
+ | 1 | `frequencies(input)` | `FrequenciesInput` → `FrequenciesOutput` | Frequency distribution and relative percentages for a categorical variable. |
240
+ | 2 | `descriptives(input)` | `DescriptivesInput` → `DescriptivesOutput` | Summary statistics (mean, std, min, max, quartiles, skewness, kurtosis) for numeric variables. |
241
+ | 3 | `crosstabs(input)` | `CrosstabsInput` → `CrosstabsOutput` | Cross-tabulation with observed/expected counts, Chi-square test, and Cramér's V. |
242
+
243
+ #### Compare Means
244
+
245
+ | # | Method | Input → Output | Description |
246
+ |---|---|---|---|
247
+ | 4 | `ttestIndependent(input)` | `TTestIndependentInput` → `TTestIndependentOutput` | Independent-samples t-test with Levene's equality-of-variances test. |
248
+ | 5 | `ttestPaired(input)` | `TTestPairedInput` → `TTestPairedOutput` | Paired-samples t-test for dependent observations. |
249
+ | 6 | `anovaOneway(input)` | `AnovaInput` → `AnovaOutput` | One-way ANOVA with group descriptives and eta-squared effect size. |
250
+ | 7 | `posthocTukey(input)` | `PostHocInput` → `PostHocOutput` | Post-hoc Tukey HSD pairwise comparisons following ANOVA. |
251
+
252
+ #### Regression
253
+
254
+ | # | Method | Input → Output | Description |
255
+ |---|---|---|---|
256
+ | 8 | `linearRegression(input)` | `LinearRegressionInput` → `LinearRegressionOutput` | OLS linear regression with coefficients, R², F-test, and Durbin-Watson statistic. |
257
+ | 9 | `logisticBinary(input)` | `LogisticBinaryInput` → `LogisticBinaryOutput` | Binary logistic regression with odds ratios, pseudo-R², and model fit statistics. |
258
+ | 10 | `logisticMultinomial(input)` | `MultinomialLogisticInput` → `MultinomialLogisticOutput` | Multinomial logistic regression with per-category coefficients and odds ratios. |
259
+
260
+ #### Classify
261
+
262
+ | # | Method | Input → Output | Description |
263
+ |---|---|---|---|
264
+ | 11 | `kmeans(input)` | `KMeansInput` → `KMeansOutput` | K-Means clustering with cluster centers, labels, and inertia. |
265
+ | 12 | `hierarchicalCluster(input)` | `HierarchicalClusterInput` → `HierarchicalClusterOutput` | Agglomerative hierarchical clustering with linkage matrix and dendrogram data. |
266
+
267
+ #### Dimension Reduction
268
+
269
+ | # | Method | Input → Output | Description |
270
+ |---|---|---|---|
271
+ | 13 | `efa(input)` | `EFAInput` → `EFAOutput` | Exploratory Factor Analysis with rotation, KMO, and Bartlett's test. |
272
+ | 14 | `pca(input)` | `PCAInput` → `PCAOutput` | Principal Component Analysis with loadings and explained variance. |
273
+ | 15 | `mds(input)` | `MDSInput` → `MDSOutput` | Multidimensional Scaling with stress value and coordinate output. |
274
+
275
+ #### Scale
276
+
277
+ | # | Method | Input → Output | Description |
278
+ |---|---|---|---|
279
+ | 16 | `cronbachAlpha(input)` | `CronbachAlphaInput` → `CronbachAlphaOutput` | Reliability analysis with Cronbach's alpha, item-total correlations, and alpha-if-deleted. |
280
+
281
+ ---
282
+
283
+ ## Core Analysis Features — Mathematical & Technical Documentation
284
+
285
+ This section documents the mathematical foundations and internal Python implementations of all 16 analyses.
286
+
287
+ ---
288
+
289
+ ### ① Descriptive Statistics
290
+
291
+ #### Frequencies
292
+
293
+ Computes a frequency distribution for a categorical variable, including absolute counts, relative percentages, and cumulative percentages.
294
+
295
+ **Python implementation:** `pandas.Series.value_counts(normalize=True)`
296
+
297
+ **Relative frequency:**
298
+
299
+ $$f_i = \frac{n_i}{N}$$
300
+
301
+ where $n_i$ is the count of category $i$ and $N$ is the total number of observations. Cumulative percentage is the running sum of $f_i \times 100$.
302
+
303
+ ---
304
+
305
+ #### Descriptives
306
+
307
+ Produces summary statistics for one or more numeric variables: count, mean, standard deviation, min, max, quartiles (Q1, Q2, Q3), skewness, and kurtosis.
308
+
309
+ **Python implementation:** `pandas.DataFrame.describe()`, `scipy.stats.skew`, `scipy.stats.kurtosis`
310
+
311
+ **Arithmetic mean:**
312
+
313
+ $$\bar{x} = \frac{1}{N}\sum_{i=1}^{N} x_i$$
314
+
315
+ **Sample standard deviation (Bessel-corrected):**
316
+
317
+ $$s = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(x_i - \bar{x})^2}$$
318
+
319
+ **Skewness (Fisher):**
320
+
321
+ $$g_1 = \frac{m_3}{m_2^{3/2}}, \quad m_k = \frac{1}{N}\sum_{i=1}^{N}(x_i - \bar{x})^k$$
322
+
323
+ **Excess kurtosis (Fisher):**
324
+
325
+ $$g_2 = \frac{m_4}{m_2^2} - 3$$
326
+
327
+ ---
328
+
329
+ #### Crosstabs
330
+
331
+ Cross-tabulates two categorical variables and tests for independence using Pearson's Chi-square test. Reports observed and expected counts, row/column/total percentages, and Cramér's V as an effect-size measure.
332
+
333
+ **Python implementation:** `pandas.crosstab`, `scipy.stats.chi2_contingency`
334
+
335
+ **Pearson's Chi-square statistic:**
336
+
337
+ $$\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$
338
+
339
+ where $O_{ij}$ is the observed frequency in cell $(i, j)$ and $E_{ij} = \frac{R_i \cdot C_j}{N}$ is the expected frequency under independence.
340
+
341
+ **Cramér's V:**
342
+
343
+ $$V = \sqrt{\frac{\chi^2}{N \cdot (k-1)}}$$
344
+
345
+ where $k = \min(\text{rows}, \text{cols})$.
346
+
347
+ ---
348
+
349
+ ### ② Compare Means
350
+
351
+ #### Independent-Samples T-Test
352
+
353
+ Compares the means of a numeric variable between two independent groups. Automatically reports results for both equal-variance and unequal-variance (Welch's) assumptions. Includes Levene's test for equality of variances.
354
+
355
+ **Python implementation:** `scipy.stats.ttest_ind`, `scipy.stats.levene`
356
+
357
+ **T-statistic (equal variance assumed):**
358
+
359
+ $$t = \frac{\bar{X}_1 - \bar{X}_2}{S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$
360
+
361
+ **Pooled standard deviation:**
362
+
363
+ $$S_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}$$
364
+
365
+ **Degrees of freedom:** $df = n_1 + n_2 - 2$
366
+
367
+ When Levene's test is significant ($p < 0.05$), Welch's t-test is recommended, which uses the Welch–Satterthwaite approximation for degrees of freedom.
368
+
369
+ ---
370
+
371
+ #### Paired-Samples T-Test
372
+
373
+ Tests whether the mean difference between two paired measurements is significantly different from zero.
374
+
375
+ **Python implementation:** `scipy.stats.ttest_rel`
376
+
377
+ **T-statistic:**
378
+
379
+ $$t = \frac{\bar{D}}{S_D / \sqrt{n}}$$
380
+
381
+ where $\bar{D} = \frac{1}{n}\sum_{i=1}^{n}(X_{1i} - X_{2i})$ is the mean difference and $S_D$ is the standard deviation of the differences.
382
+
383
+ **Degrees of freedom:** $df = n - 1$
384
+
385
+ ---
386
+
387
+ #### One-Way ANOVA
388
+
389
+ Tests whether the means of a numeric variable differ significantly across three or more groups.
390
+
391
+ **Python implementation:** `scipy.stats.f_oneway`
392
+
393
+ **F-statistic:**
394
+
395
+ $$F = \frac{MS_{between}}{MS_{within}}$$
396
+
397
+ **Sum of Squares Between Groups:**
398
+
399
+ $$SS_{between} = \sum_{j=1}^{k} n_j(\bar{X}_j - \bar{X})^2$$
400
+
401
+ **Sum of Squares Within Groups:**
402
+
403
+ $$SS_{within} = \sum_{j=1}^{k}\sum_{i=1}^{n_j}(X_{ij} - \bar{X}_j)^2$$
404
+
405
+ **Mean Squares:**
406
+
407
+ $$MS_{between} = \frac{SS_{between}}{k-1}, \quad MS_{within} = \frac{SS_{within}}{N-k}$$
408
+
409
+ **Effect size (Eta-squared):**
410
+
411
+ $$\eta^2 = \frac{SS_{between}}{SS_{total}}$$
412
+
413
+ ---
414
+
415
+ #### Post-hoc Tukey HSD
416
+
417
+ Performs pairwise comparisons of group means following a significant ANOVA result using the Studentized Range distribution.
418
+
419
+ **Python implementation:** `statsmodels.stats.multicomp.pairwise_tukeyhsd`
420
+
421
+ **Studentized range statistic:**
422
+
423
+ $$q = \frac{\bar{X}_i - \bar{X}_j}{\sqrt{MS_W / n}}$$
424
+
425
+ where $MS_W$ is the within-group mean square from the ANOVA and $n$ is the harmonic mean of group sizes. The critical $q$ value is obtained from the Studentized Range distribution with $k$ groups and $N - k$ degrees of freedom.
426
+
427
+ ---
428
+
429
+ ### ③ Regression
430
+
431
+ #### Linear Regression (OLS)
432
+
433
+ Fits an Ordinary Least Squares regression model with one or more independent variables. Reports regression coefficients, standard errors, t-statistics, p-values, confidence intervals, $R^2$, adjusted $R^2$, F-test, and the Durbin-Watson statistic for autocorrelation detection.
434
+
435
+ **Python implementation:** `statsmodels.api.OLS`
436
+
437
+ **Model:**
438
+
439
+ $$Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon$$
440
+
441
+ where $\epsilon \sim N(0, \sigma^2)$.
442
+
443
+ **OLS estimator:**
444
+
445
+ $$\hat{\beta} = (X^T X)^{-1} X^T Y$$
446
+
447
+ **Coefficient of determination:**
448
+
449
+ $$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$
450
+
451
+ where $SS_{res} = \sum(Y_i - \hat{Y}_i)^2$ and $SS_{tot} = \sum(Y_i - \bar{Y})^2$.
452
+
453
+ ---
454
+
455
+ #### Binary Logistic Regression
456
+
457
+ Models the probability of a binary outcome as a function of one or more independent variables. Reports coefficients (log-odds), odds ratios, z-statistics, p-values, pseudo-$R^2$, AIC, and BIC.
458
+
459
+ **Python implementation:** `statsmodels.discrete.discrete_model.Logit`
460
+
461
+ **Logit link function:**
462
+
463
+ $$\ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$$
464
+
465
+ **Predicted probability:**
466
+
467
+ $$P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}}$$
468
+
469
+ Coefficients are estimated by Maximum Likelihood Estimation (MLE). The odds ratio for predictor $j$ is $e^{\beta_j}$.
470
+
471
+ ---
472
+
473
+ #### Multinomial Logistic Regression
474
+
475
+ Extends binary logistic regression to outcomes with more than two unordered categories. One category is designated as the reference; the model estimates log-odds of each other category relative to the reference.
476
+
477
+ **Python implementation:** `sklearn.linear_model.LogisticRegression(multi_class='multinomial')`
478
+
479
+ **Log-odds relative to reference category $K$:**
480
+
481
+ $$\ln\left(\frac{P(Y=k)}{P(Y=K)}\right) = \beta_{k0} + \beta_{k1}X_1 + \cdots + \beta_{kp}X_p$$
482
+
483
+ for each category $k \neq K$.
484
+
485
+ **Predicted probability via softmax:**
486
+
487
+ $$P(Y=k|X) = \frac{e^{\beta_{k0} + \beta_{k1}X_1 + \cdots + \beta_{kp}X_p}}{\sum_{j=1}^{K} e^{\beta_{j0} + \beta_{j1}X_1 + \cdots + \beta_{jp}X_p}}$$
488
+
489
+ ---
490
+
491
+ ### ④ Classify
492
+
493
+ #### K-Means Clustering
494
+
495
+ Partitions observations into $K$ clusters by iteratively assigning points to the nearest centroid and updating centroids until convergence.
496
+
497
+ **Python implementation:** `sklearn.cluster.KMeans`
498
+
499
+ **Objective function (inertia):**
500
+
501
+ $$J = \sum_{j=1}^{K}\sum_{i \in C_j} \|x_i - \mu_j\|^2$$
502
+
503
+ where $C_j$ is the set of observations in cluster $j$ and $\mu_j$ is the centroid. The algorithm minimizes $J$ using Lloyd's algorithm (Expectation-Maximization style).
504
+
505
+ ---
506
+
507
+ #### Hierarchical (Agglomerative) Clustering
508
+
509
+ Builds a hierarchy of clusters using a bottom-up approach. Supports Ward, complete, average, and single linkage methods. Returns a full linkage matrix and dendrogram data for visualization.
510
+
511
+ **Python implementation:** `scipy.cluster.hierarchy.linkage`, `scipy.cluster.hierarchy.fcluster`
512
+
513
+ **Ward's minimum variance method** (default):
514
+
515
+ $$\Delta(A,B) = \frac{n_A n_B}{n_A + n_B}\|\bar{x}_A - \bar{x}_B\|^2$$
516
+
517
+ At each step, the pair of clusters $(A, B)$ that produces the smallest increase in total within-cluster variance is merged. Ward's method tends to produce compact, equally sized clusters.
518
+
519
+ ---
520
+
521
+ ### ⑤ Dimension Reduction
522
+
523
+ #### Exploratory Factor Analysis (EFA)
524
+
525
+ Discovers latent factors underlying a set of observed variables. Supports varimax, promax, oblimin, and no rotation. Reports factor loadings, communalities, eigenvalues, KMO measure of sampling adequacy, and Bartlett's test of sphericity.
526
+
527
+ **Python implementation:** `factor_analyzer.FactorAnalyzer(rotation='varimax')` — installed at runtime via `micropip`
528
+
529
+ **Factor model:**
530
+
531
+ $$X = \Lambda F + \epsilon$$
532
+
533
+ where $X$ is the observed variable vector, $\Lambda$ is the matrix of factor loadings, $F$ is the vector of latent factors, and $\epsilon$ is the unique variance.
534
+
535
+ **Kaiser-Meyer-Olkin (KMO) measure:**
536
+
537
+ $$KMO = \frac{\sum\sum_{i \neq j} r_{ij}^2}{\sum\sum_{i \neq j} r_{ij}^2 + \sum\sum_{i \neq j} u_{ij}^2}$$
538
+
539
+ where $r_{ij}$ are elements of the correlation matrix and $u_{ij}$ are elements of the partial correlation matrix. KMO values above 0.6 are generally considered acceptable for factor analysis.
540
+
541
+ ---
542
+
543
+ #### Principal Component Analysis (PCA)
544
+
545
+ Finds orthogonal components that maximize variance in the data. Reports component loadings, explained variance, cumulative variance ratios, and singular values. Optionally standardizes the input.
546
+
547
+ **Python implementation:** `sklearn.decomposition.PCA`
548
+
549
+ **Objective:** Find the weight vector $w$ that maximizes projected variance:
550
+
551
+ $$\text{Var}(Xw) \to \max \quad \text{subject to} \quad \|w\| = 1$$
552
+
553
+ This is equivalent to finding the eigenvectors of the covariance matrix $\Sigma = \frac{1}{N-1}X^TX$. The eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots$ represent the variance explained by each component.
554
+
555
+ **Explained variance ratio:**
556
+
557
+ $$\text{EVR}_k = \frac{\lambda_k}{\sum_{i=1}^{p} \lambda_i}$$
558
+
559
+ ---
560
+
561
+ #### Multidimensional Scaling (MDS)
562
+
563
+ Projects high-dimensional data into a lower-dimensional space (typically 2D) while preserving pairwise distances. Supports both metric and non-metric MDS.
564
+
565
+ **Python implementation:** `sklearn.manifold.MDS`
566
+
567
+ **Stress function (Kruskal's Stress-1):**
568
+
569
+ $$\sigma = \sqrt{\frac{\sum_{i<j}(d_{ij} - \delta_{ij})^2}{\sum_{i<j}d_{ij}^2}}$$
570
+
571
+ where $d_{ij}$ is the distance in the reduced space and $\delta_{ij}$ is the original distance (or a monotonic transformation for non-metric MDS). A stress value below 0.1 is generally considered a good fit.
572
+
573
+ ---
574
+
575
+ ### ⑥ Scale
576
+
577
+ #### Cronbach's Alpha
578
+
579
+ Measures the internal consistency (reliability) of a set of scale items. Reports raw alpha, standardized alpha, item-total correlations, and alpha-if-item-deleted for diagnostic purposes.
580
+
581
+ **Python implementation:** Custom implementation using `pandas` covariance matrix operations
582
+
583
+ **Cronbach's alpha (raw):**
584
+
585
+ $$\alpha = \frac{K}{K-1}\left(1 - \frac{\sum_{i=1}^{K} \sigma_{Y_i}^2}{\sigma_X^2}\right)$$
586
+
587
+ where $K$ is the number of items, $\sigma_{Y_i}^2$ is the variance of item $i$, and $\sigma_X^2$ is the variance of the total score.
588
+
589
+ **Standardized alpha (based on mean inter-item correlation):**
590
+
591
+ $$\alpha_{std} = \frac{K\bar{r}}{1+(K-1)\bar{r}}$$
592
+
593
+ where $\bar{r}$ is the mean of all pairwise Pearson correlations among items.
594
+
595
+ | $\alpha$ Range | Interpretation |
596
+ |---|---|
597
+ | ≥ 0.9 | Excellent |
598
+ | 0.8 – 0.9 | Good |
599
+ | 0.7 – 0.8 | Acceptable |
600
+ | 0.6 – 0.7 | Questionable |
601
+ | < 0.6 | Poor |
602
+
603
+ ---
604
+
605
+ ## Sample Data
606
+
607
+ The repository includes a ready-to-use sample dataset at `docs/sample-survey-data.json`, also hosted on GitHub Pages at:
608
+
609
+ ```
610
+ https://winm2m.github.io/inferential-stats-js/sample-survey-data.json
611
+ ```
612
+
613
+ This dataset contains **2,000 rows** of simulated survey data generated with a seeded pseudo-random number generator for full reproducibility.
614
+
615
+ ### Schema
616
+
617
+ | Column | Type | Description |
618
+ |---|---|---|
619
+ | `id` | integer | Unique respondent ID (1–2000) |
620
+ | `gender` | string | `"Male"`, `"Female"`, or `"Other"` |
621
+ | `age_group` | string | `"20s"`, `"30s"`, `"40s"`, `"50s"`, `"60s"` |
622
+ | `nationality` | string | One of several country labels |
623
+ | `favorite_music` | string | Preferred music genre |
624
+ | `favorite_movie` | string | Preferred movie genre |
625
+ | `favorite_art` | string | Preferred art form |
626
+ | `music_satisfaction` | integer (1–5) | Satisfaction with music offerings (Likert scale) |
627
+ | `movie_satisfaction` | integer (1–5) | Satisfaction with movie offerings (Likert scale) |
628
+ | `art_satisfaction` | integer (1–5) | Satisfaction with art offerings (Likert scale) |
629
+ | `weekly_hours_music` | float | Weekly hours spent on music |
630
+ | `weekly_hours_movie` | float | Weekly hours spent on movies |
631
+ | `monthly_art_visits` | integer | Number of art gallery visits per month |
632
+
633
+ This dataset is suitable for exercising every analysis method in the SDK.
634
+
635
+ ---
636
+
637
+ ## Progress Event Handling
638
+
639
+ During `init()`, the SDK dispatches `CustomEvent`s to report progress through multiple stages (loading Pyodide, installing Python packages, etc.). You can use these events to drive a progress bar or loading indicator.
640
+
641
+ ### Event Name
642
+
643
+ The event name is exported as the constant `PROGRESS_EVENT_NAME` (value: `'inferential-stats-progress'`).
644
+
645
+ ### Event Detail
646
+
647
+ ```typescript
648
+ interface ProgressDetail {
649
+ stage: string; // Current stage identifier (e.g. "pyodide", "packages")
650
+ progress: number; // Percentage complete (0–100)
651
+ message: string; // Human-readable status message
652
+ }
653
+ ```
654
+
655
+ ### Example: Full Progress Listener
656
+
657
+ ```typescript
658
+ import { InferentialStats, PROGRESS_EVENT_NAME } from '@winm2m/inferential-stats-js';
659
+
660
+ // You can target any EventTarget — window, document, or a custom one.
661
+ const eventTarget = window;
662
+
663
+ const stats = new InferentialStats({
664
+ workerUrl: '/dist/stats-worker.js',
665
+ eventTarget, // Progress events will be dispatched here
666
+ });
667
+
668
+ // Register the listener BEFORE calling init()
669
+ eventTarget.addEventListener(PROGRESS_EVENT_NAME, ((event: CustomEvent) => {
670
+ const { stage, progress, message } = event.detail as {
671
+ stage: string;
672
+ progress: number;
673
+ message: string;
674
+ };
675
+
676
+ // Update a progress bar
677
+ const progressBar = document.getElementById('progress-bar') as HTMLProgressElement;
678
+ progressBar.value = progress;
679
+ progressBar.max = 100;
680
+
681
+ // Update a status label
682
+ const statusLabel = document.getElementById('status');
683
+ if (statusLabel) {
684
+ statusLabel.textContent = `[${stage}] ${message} (${progress}%)`;
685
+ }
686
+
687
+ console.log(`[${stage}] ${progress}% — ${message}`);
688
+ }) as EventListener);
689
+
690
+ // Start initialization — progress events will fire throughout
691
+ await stats.init();
692
+ console.log('Ready!');
693
+ ```
694
+
695
+ ### Typical Progress Sequence
696
+
697
+ | Stage | Progress | Message |
698
+ |---|---|---|
699
+ | `pyodide` | 0 | Loading Pyodide runtime… |
700
+ | `pyodide` | 30 | Pyodide runtime loaded |
701
+ | `packages` | 40 | Installing pandas… |
702
+ | `packages` | 55 | Installing scipy… |
703
+ | `packages` | 70 | Installing statsmodels… |
704
+ | `packages` | 80 | Installing scikit-learn… |
705
+ | `packages` | 90 | Installing factor_analyzer… |
706
+ | `ready` | 100 | All packages installed. Ready. |
707
+
708
+ ---
709
+
710
+ ## License
711
+
712
+ [MIT](./LICENSE) © 2026 WinM2M