hvrt 2.1.1__tar.gz → 2.2.0.dev0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. hvrt-2.2.0.dev0/PKG-INFO +750 -0
  2. hvrt-2.2.0.dev0/README.md +709 -0
  3. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/pyproject.toml +4 -1
  4. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/__init__.py +5 -1
  5. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/_base.py +65 -8
  6. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/_params.py +5 -4
  7. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/_partitioning.py +10 -5
  8. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/expand.py +6 -2
  9. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/generation_strategies.py +95 -6
  10. hvrt-2.2.0.dev0/src/hvrt/optimizer.py +478 -0
  11. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/reduce.py +21 -1
  12. hvrt-2.2.0.dev0/src/hvrt.egg-info/PKG-INFO +750 -0
  13. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt.egg-info/SOURCES.txt +2 -0
  14. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt.egg-info/requires.txt +3 -0
  15. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/tests/test_expand.py +6 -1
  16. hvrt-2.2.0.dev0/tests/test_optimizer.py +131 -0
  17. hvrt-2.1.1/PKG-INFO +0 -460
  18. hvrt-2.1.1/README.md +0 -421
  19. hvrt-2.1.1/src/hvrt.egg-info/PKG-INFO +0 -460
  20. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/LICENSE +0 -0
  21. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/setup.cfg +0 -0
  22. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/_budgets.py +0 -0
  23. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/_preprocessing.py +0 -0
  24. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/_warnings.py +0 -0
  25. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/legacy/__init__.py +0 -0
  26. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/legacy/adaptive_reducer.py +0 -0
  27. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/legacy/sample_reduction.py +0 -0
  28. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/legacy/selection_strategies.py +0 -0
  29. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/model/__init__.py +0 -0
  30. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/model/fast_hvrt.py +0 -0
  31. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/model/hvrt.py +0 -0
  32. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/pipeline/__init__.py +0 -0
  33. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt/reduction_strategies.py +0 -0
  34. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt.egg-info/dependency_links.txt +0 -0
  35. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/src/hvrt.egg-info/top_level.txt +0 -0
  36. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/tests/test_benchmarks.py +0 -0
  37. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/tests/test_categorical_accuracy_retention.py +0 -0
  38. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/tests/test_categorical_support.py +0 -0
  39. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/tests/test_core.py +0 -0
  40. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/tests/test_deterministic_scalability.py +0 -0
  41. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/tests/test_reduce.py +0 -0
  42. {hvrt-2.1.1 → hvrt-2.2.0.dev0}/tests/test_selection_strategies.py +0 -0
@@ -0,0 +1,750 @@
1
+ Metadata-Version: 2.4
2
+ Name: hvrt
3
+ Version: 2.2.0.dev0
4
+ Summary: Hierarchical Variance-Retaining Transformer (HVRT) — variance-aware sample transformation for tabular data
5
+ Author-email: Jake Peace <mail@jakepeace.me>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/hotprotato/hvrt
8
+ Project-URL: Documentation, https://github.com/hotprotato/hvrt#readme
9
+ Project-URL: Repository, https://github.com/hotprotato/hvrt
10
+ Project-URL: Issues, https://github.com/hotprotato/hvrt/issues
11
+ Keywords: machine-learning,sample-reduction,synthetic-data,data-augmentation,data-preprocessing,variance,kde,tabular-data,heavy-tailed
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.8
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Requires-Python: >=3.8
22
+ Description-Content-Type: text/markdown
23
+ License-File: LICENSE
24
+ Requires-Dist: numpy>=1.20.0
25
+ Requires-Dist: scikit-learn>=1.0.0
26
+ Requires-Dist: scipy>=1.7.0
27
+ Provides-Extra: benchmarks
28
+ Requires-Dist: xgboost>=1.5; extra == "benchmarks"
29
+ Requires-Dist: matplotlib>=3.5; extra == "benchmarks"
30
+ Requires-Dist: pandas>=1.3; extra == "benchmarks"
31
+ Provides-Extra: dev
32
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
33
+ Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
34
+ Requires-Dist: black>=22.0.0; extra == "dev"
35
+ Requires-Dist: mypy>=0.950; extra == "dev"
36
+ Requires-Dist: twine>=5.0.0; extra == "dev"
37
+ Requires-Dist: build>=1.0.0; extra == "dev"
38
+ Provides-Extra: optimizer
39
+ Requires-Dist: optuna>=3.0.0; extra == "optimizer"
40
+ Dynamic: license-file
41
+
42
+ # HVRT: Hierarchical Variance-Retaining Transformer
43
+
44
+ [![PyPI version](https://img.shields.io/pypi/v/hvrt.svg)](https://pypi.org/project/hvrt/)
45
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
46
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
47
+
48
+ Variance-aware sample transformation for tabular data: reduce, expand, or augment.
49
+
50
+ ---
51
+
52
+ ## Overview
53
+
54
+ HVRT partitions a dataset into variance-homogeneous regions via a decision tree fitted on a synthetic extremeness target, then applies a configurable per-partition operation (selection for reduction, sampling for expansion). The tree is fitted once; `reduce()`, `expand()`, and `augment()` all draw from the same fitted model.
55
+
56
+ | Operation | Method | Description |
57
+ |---|---|---|
58
+ | **Reduce** | `model.reduce(ratio=0.3)` | Select a geometrically diverse representative subset |
59
+ | **Expand** | `model.expand(n=50000)` | Generate synthetic samples via per-partition KDE or other strategy |
60
+ | **Augment** | `model.augment(n=15000)` | Concatenate original data with synthetic samples |
61
+
62
+ ---
63
+
64
+ ## Algorithm
65
+
66
+ ### 1. Z-score normalisation
67
+
68
+ ```
69
+ X_z = (X - μ) / σ per feature
70
+ ```
71
+
72
+ Categorical features are integer-encoded then z-scored.
73
+
74
+ ### 2. Synthetic target construction
75
+
76
+ **HVRT** — sum of normalised pairwise feature interactions:
77
+ ```
78
+ For all feature pairs (i, j):
79
+ interaction = X_z[:,i] ⊙ X_z[:,j]
80
+ normalised = (interaction - mean) / std
81
+ target = sum of all normalised interaction columns O(n · d²)
82
+ ```
83
+
84
+ **FastHVRT** — sum of z-scores per sample:
85
+ ```
86
+ target_i = Σ_j X_z[i, j] O(n · d)
87
+ ```
88
+
89
+ ### 3. Partitioning
90
+
91
+ A `DecisionTreeRegressor` is fitted on the synthetic target. Leaves form variance-homogeneous partitions. Tree depth and leaf size are auto-tuned to dataset size.
92
+
93
+ ### 4. Per-partition operations
94
+
95
+ **Reduce:** Select representatives within each partition using the chosen [selection strategy](#selection-strategies). Budget is proportional to partition size (`variance_weighted=False`) or biased toward high-variance partitions (`variance_weighted=True`).
96
+
97
+ **Expand:** Draw synthetic samples within each partition using the chosen [generation strategy](#generation-strategies). Budget allocation follows the same logic.
98
+
99
+ ---
100
+
101
+ ## Installation
102
+
103
+ ```bash
104
+ pip install hvrt
105
+ ```
106
+
107
+ ```bash
108
+ git clone https://github.com/hotprotato/hvrt.git
109
+ cd hvrt
110
+ pip install -e .
111
+ ```
112
+
113
+ ---
114
+
115
+ ## Quick Start
116
+
117
+ ```python
118
+ from hvrt import HVRT, FastHVRT
119
+
120
+ # Fit once — reduce and expand from the same model
121
+ model = HVRT(random_state=42).fit(X_train, y_train) # y optional
122
+ X_reduced, idx = model.reduce(ratio=0.3, return_indices=True)
123
+ X_synthetic = model.expand(n=50000)
124
+ X_augmented = model.augment(n=15000)
125
+
126
+ # FastHVRT — O(n·d) target; preferred for expansion
127
+ model = FastHVRT(random_state=42).fit(X_train)
128
+ X_synthetic = model.expand(n=50000)
129
+ ```
130
+
131
+ ---
132
+
133
+ ## API Reference
134
+
135
+ ### `HVRT`
136
+
137
+ ```python
138
+ from hvrt import HVRT
139
+
140
+ model = HVRT(
141
+ n_partitions=None, # Max tree leaves; auto-tuned if None
142
+ min_samples_leaf=None, # Min samples per leaf; auto-tuned if None
143
+ y_weight=0.0, # 0.0 = unsupervised; 1.0 = y drives splits
144
+ bandwidth='auto', # KDE bandwidth: 'auto' (default), float, 'scott', 'silverman'
145
+ auto_tune=True,
146
+ random_state=42,
147
+ # Pipeline params (see Pipeline section)
148
+ reduce_params=None,
149
+ expand_params=None,
150
+ augment_params=None,
151
+ )
152
+ ```
153
+
154
+ Target: sum of normalised pairwise feature interactions. O(n · d²). Preferred for reduction.
155
+
156
+ ### `FastHVRT`
157
+
158
+ ```python
159
+ from hvrt import FastHVRT
160
+
161
+ model = FastHVRT(bandwidth='auto', random_state=42)
162
+ ```
163
+
164
+ Target: sum of z-scores. O(n · d). Equivalent quality to HVRT for expansion. All constructor parameters identical to HVRT.
165
+
166
+ ### `HVRTOptimizer`
167
+
168
+ Requires: `pip install hvrt[optimizer]`
169
+
170
+ ```python
171
+ from hvrt import HVRTOptimizer
172
+
173
+ opt = HVRTOptimizer(
174
+ n_trials=30, # Optuna trials; use ≥50 in production
175
+ n_jobs=1, # Parallel trials (-1 = all cores)
176
+ cv=3, # Cross-validation folds for the objective
177
+ expansion_ratio=5.0, # Synthetic-to-real ratio during evaluation
178
+ task='auto', # 'auto', 'regression', 'classification'
179
+ timeout=None, # Wall-clock time limit in seconds
180
+ random_state=None,
181
+ verbose=0, # 0 = silent, 1 = Optuna trial progress
182
+ )
183
+ opt = opt.fit(X, y) # y enables TSTR Δ objective; required for classification
184
+ ```
185
+
186
+ Performs TPE-based Bayesian optimisation over `n_partitions`, `min_samples_leaf`,
187
+ `y_weight`, kernel / bandwidth, and `variance_weighted`. The HVRT defaults are always
188
+ evaluated as trial 0 (warm start), so HPO can only match or improve on the baseline.
189
+
190
+ **Post-fit attributes:**
191
+
192
+ | Attribute | Type | Description |
193
+ |---|---|---|
194
+ | `best_score_` | float | Best mean TSTR Δ across CV folds |
195
+ | `best_params_` | dict | Best constructor kwargs (`n_partitions`, `min_samples_leaf`, `y_weight`, `bandwidth`) |
196
+ | `best_expand_params_` | dict | Best expand kwargs (`variance_weighted`, optionally `generation_strategy`) |
197
+ | `best_model_` | HVRT | Refitted on the full dataset using `best_params_` |
198
+ | `study_` | optuna.Study | Full Optuna study for visualisation and diagnostics |
199
+
200
+ **After fitting:**
201
+
202
+ ```python
203
+ opt = HVRTOptimizer(n_trials=50, n_jobs=4, cv=3, random_state=42).fit(X, y)
204
+ print(f'Best TSTR Δ: {opt.best_score_:+.4f}')
205
+ print(f'Best params: {opt.best_params_}')
206
+
207
+ X_synth = opt.expand(n=50000) # y column stripped automatically
208
+ X_aug = opt.augment(n=len(X) * 5) # originals + synthetic
209
+ ```
210
+
211
+ `expand()` and `augment()` strip the appended y column, returning arrays with the same
212
+ number of columns as the training X.
213
+
214
+ ### `fit`
215
+
216
+ ```python
217
+ model.fit(X, y=None, feature_types=None)
218
+ # feature_types: list of 'continuous' or 'categorical' per column
219
+ ```
220
+
221
+ ### `reduce`
222
+
223
+ ```python
224
+ X_reduced = model.reduce(
225
+ n=None, # Absolute target count
226
+ ratio=None, # Proportional (e.g. 0.3 = keep 30%)
227
+ method='fps', # Selection strategy; see Selection Strategies
228
+ variance_weighted=True, # Oversample high-variance partitions
229
+ return_indices=False,
230
+ n_partitions=None, # Override tree granularity for this call only
231
+ )
232
+ ```
233
+
234
+ ### `expand`
235
+
236
+ ```python
237
+ X_synth = model.expand(
238
+ n=10000,
239
+ variance_weighted=False, # True = oversample tails
240
+ bandwidth=None, # Override instance bandwidth; accepts float, 'auto', 'scott'
241
+ adaptive_bandwidth=False, # Scale bandwidth with local expansion ratio
242
+ generation_strategy=None, # See Generation Strategies
243
+ return_novelty_stats=False,
244
+ n_partitions=None,
245
+ )
246
+ ```
247
+
248
+ `adaptive_bandwidth=True` uses per-partition bandwidth `bw_p = scott_p × max(1, budget_p/n_p)^(1/d)`.
249
+
250
+ ### `augment`
251
+
252
+ ```python
253
+ X_aug = model.augment(n=15000, variance_weighted=False)
254
+ # n must exceed len(X); returns original X concatenated with (n - len(X)) synthetic samples
255
+ ```
256
+
257
+ ### Utility methods
258
+
259
+ ```python
260
+ partitions = model.get_partitions()
261
+ # [{'id': 5, 'size': 120, 'mean_abs_z': 0.84, 'variance': 1.2}, ...]
262
+
263
+ novelty = model.compute_novelty(X_new) # min z-space distance per point
264
+
265
+ params = HVRT.recommend_params(X) # {'n_partitions': 180, ...}
266
+ ```
267
+
268
+ ---
269
+
270
+ ## sklearn Pipeline
271
+
272
+ Operation parameters are declared at construction time via `ReduceParams`, `ExpandParams`, or `AugmentParams`. The tree is fitted once during `fit()`; `transform()` calls the corresponding operation.
273
+
274
+ ```python
275
+ from hvrt import HVRT, FastHVRT, ReduceParams, ExpandParams, AugmentParams
276
+ from sklearn.pipeline import Pipeline
277
+
278
+ # Reduce
279
+ pipe = Pipeline([('hvrt', HVRT(reduce_params=ReduceParams(ratio=0.3)))])
280
+ X_red = pipe.fit_transform(X, y)
281
+
282
+ # Expand
283
+ pipe = Pipeline([('hvrt', FastHVRT(expand_params=ExpandParams(n=50000)))])
284
+ X_synth = pipe.fit_transform(X)
285
+
286
+ # Augment
287
+ pipe = Pipeline([('hvrt', HVRT(augment_params=AugmentParams(n=15000)))])
288
+ X_aug = pipe.fit_transform(X)
289
+ ```
290
+
291
+ Alternatively, import from `hvrt.pipeline` to make the intent explicit:
292
+
293
+ ```python
294
+ from hvrt.pipeline import HVRT, ReduceParams
295
+ ```
296
+
297
+ ### ReduceParams
298
+
299
+ ```python
300
+ ReduceParams(
301
+ n=None,
302
+ ratio=None, # e.g. 0.3
303
+ method='fps',
304
+ variance_weighted=True,
305
+ return_indices=False,
306
+ n_partitions=None,
307
+ )
308
+ ```
309
+
310
+ ### ExpandParams
311
+
312
+ ```python
313
+ ExpandParams(
314
+ n=50000, # required
315
+ variance_weighted=False,
316
+ bandwidth=None,
317
+ adaptive_bandwidth=False,
318
+ generation_strategy=None,
319
+ return_novelty_stats=False,
320
+ n_partitions=None,
321
+ )
322
+ ```
323
+
324
+ ### AugmentParams
325
+
326
+ ```python
327
+ AugmentParams(
328
+ n=15000, # required; must exceed len(X)
329
+ variance_weighted=False,
330
+ n_partitions=None,
331
+ )
332
+ ```
333
+
334
+ ---
335
+
336
+ ## Generation Strategies
337
+
338
+ ```python
339
+ from hvrt import FastHVRT, epanechnikov, univariate_kde_copula
340
+
341
+ model = FastHVRT(random_state=42).fit(X)
342
+
343
+ # By name
344
+ X_synth = model.expand(n=10000, generation_strategy='epanechnikov')
345
+
346
+ # By reference
347
+ X_synth = model.expand(n=10000, generation_strategy=univariate_kde_copula)
348
+
349
+ # Custom callable
350
+ def my_strategy(X_z, partition_ids, unique_partitions, budgets, random_state):
351
+ ...
352
+ return X_synthetic # shape (sum(budgets), n_features), z-score space
353
+
354
+ X_synth = model.expand(n=10000, generation_strategy=my_strategy)
355
+ ```
356
+
357
+ | Strategy | Behaviour | Notes |
358
+ |---|---|---|
359
+ | `'multivariate_kde'` | `scipy.stats.gaussian_kde` on all features jointly. Uses instance `bandwidth`. | Captures full joint covariance |
360
+ | `'epanechnikov'` | Product Epanechnikov kernel, Ahrens-Dieter sampling. Bounded support. | Recommended for classification; ≥5× ratios |
361
+ | `'univariate_kde_copula'` | Per-feature 1-D KDE marginals + Gaussian copula. | More flexible per-feature marginals |
362
+ | `'bootstrap_noise'` | Resample with replacement + Gaussian noise at 10% of per-feature std. | Fastest; no distributional assumptions |
363
+
364
+ ```python
365
+ from hvrt import BUILTIN_GENERATION_STRATEGIES
366
+ list(BUILTIN_GENERATION_STRATEGIES)
367
+ # ['multivariate_kde', 'univariate_kde_copula', 'bootstrap_noise', 'epanechnikov']
368
+ ```
369
+
370
+ ---
371
+
372
+ ## Selection Strategies
373
+
374
+ ```python
375
+ from hvrt import HVRT
376
+
377
+ model = HVRT(random_state=42).fit(X, y)
378
+
379
+ X_red = model.reduce(ratio=0.2, method='fps') # default
380
+ X_red = model.reduce(ratio=0.2, method='medoid_fps')
381
+ X_red = model.reduce(ratio=0.2, method='variance_ordered')
382
+ X_red = model.reduce(ratio=0.2, method='stratified')
383
+
384
+ # Custom callable
385
+ def my_selector(X_z, partition_ids, unique_partitions, budgets, random_state):
386
+ ...
387
+ return selected_indices # global indices into X
388
+
389
+ X_red = model.reduce(ratio=0.2, method=my_selector)
390
+ ```
391
+
392
+ | Strategy | Behaviour |
393
+ |---|---|
394
+ | `'fps'` / `'centroid_fps'` | Greedy Furthest Point Sampling seeded at partition centroid. **Default.** |
395
+ | `'medoid_fps'` | FPS seeded at the partition medoid. |
396
+ | `'variance_ordered'` | Select samples with highest local k-NN variance (k=10). |
397
+ | `'stratified'` | Random sample within each partition. |
398
+
399
+ ---
400
+
401
+ ## Recommendations
402
+
403
+ Findings from a systematic bandwidth and kernel benchmark across 6 datasets,
404
+ 3 expansion ratios (2×/5×/10×), and 11 methods (see `benchmarks/bandwidth_benchmark.py`
405
+ and `findings.md`).
406
+
407
+ ### `bandwidth='auto'` — the default
408
+
409
+ `bandwidth='auto'` is the default and requires no tuning for most datasets. At each
410
+ `expand()` call it inspects the fitted partition structure and picks the kernel most
411
+ likely to produce high-quality synthetic data:
412
+
413
+ ```python
414
+ model = HVRT().fit(X) # bandwidth='auto' by default
415
+ X_synth = model.expand(n=50000) # auto chooses at call-time
416
+ ```
417
+
418
+ **How it decides:**
419
+
420
+ At call-time, `'auto'` computes the mean number of samples per partition and
421
+ compares it against a feature-scaled threshold: `max(15, 2 × n_continuous_features)`.
422
+
423
+ | Condition | Chosen kernel | Reason |
424
+ |---|---|---|
425
+ | mean partition size **≥** threshold | Narrow Gaussian `h=0.1` | Enough samples for stable multivariate covariance estimation; tight kernel stays within partition geometry |
426
+ | mean partition size **<** threshold | Epanechnikov product kernel | Too few samples for reliable covariance; product kernel requires no covariance matrix and bounded support keeps samples within the local region |
427
+
428
+ The threshold scales with dimensionality because the minimum samples needed for a
429
+ non-degenerate `d`-dimensional covariance matrix grows with `d`. At 5 features the
430
+ threshold is 15; at 15 features it is 30.
431
+
432
+ **Why not just always use one or the other:**
433
+
434
+ Benchmarking across 4 regression datasets showed a clean crossover depending on
435
+ partition size. With the default auto-tuned partition count (typically 15–20 partitions
436
+ at n=500), partitions hold ~25 samples and narrow Gaussian wins on TSTR. But when
437
+ partitions are finer — either because the dataset is large and the auto-tuner produces
438
+ more leaves, or because `n_partitions` is manually increased — Gaussian KDE degrades
439
+ as partitions become too small for stable covariance estimation, while Epanechnikov
440
+ holds steady or improves. For example, on the housing dataset (d=6) at 10× expansion:
441
+
442
+ | Partition count | Gaussian `h=0.1` TSTR | Epanechnikov TSTR |
443
+ |---|---|---|
444
+ | auto (~18) | +0.004 | −0.014 |
445
+ | 50 | −0.033 | **−0.008** |
446
+ | 100 | −0.037 | **−0.011** |
447
+ | 200 | −0.080 | **−0.008** |
448
+
449
+ The crossover point depends on dimensionality: higher-dimensional datasets shift it
450
+ earlier. On multimodal (d=10), Epanechnikov wins from 30 partitions onward (mean
451
+ partition size ~13 at n=500). On housing (d=6) and emergence_divergence (d=5),
452
+ the crossover is ~50 partitions. This is because higher dimensionality makes a
453
+ d×d covariance matrix harder to estimate stably from small samples, while
454
+ Epanechnikov is always covariance-free.
455
+
456
+ `'auto'` captures this automatically: when you call `expand(n_partitions=200)`,
457
+ `'auto'` sees the resulting small partition sizes and switches to Epanechnikov
458
+ without any manual intervention.
459
+
460
+ **When to override `'auto'`:**
461
+
462
+ - **Heterogeneous / high-skew classification task (mean |skew| ≳ 0.8):**
463
+ `generation_strategy='epanechnikov'` directly — Epanechnikov wins consistently
464
+ when within-partition data is non-Gaussian. On near-Gaussian classification data,
465
+ `bandwidth='auto'` (`h=0.10`) or `adaptive_bandwidth=True` is competitive or
466
+ better, particularly at 2×–5× expansion ratios.
467
+ - **Small dataset, coarse partitions, regression:** `bandwidth=0.1` or `bandwidth=0.3`
468
+ — explicit narrow Gaussian if you know partition sizes are large and correlation
469
+ structure matters.
470
+ - **Diagnostic / ablation:** pass explicit values (`bandwidth=0.3`, `bandwidth='scott'`)
471
+ to isolate the bandwidth effect.
472
+
473
+ ### Why Scott's rule underperforms
474
+
475
+ Scott's rule is AMISE-optimal for iid Gaussian data. HVRT partitions, while locally
476
+ more homogeneous than the global distribution, are not Gaussian enough for this to
477
+ hold (mean |skewness| 0.49–1.37 across benchmark datasets). More importantly, the
478
+ decision tree already captures the primary variance structure of each partition, so
479
+ the residual within-partition variance is narrower than Scott's formula assumes.
480
+ The result is systematic over-smoothing: synthetic samples bleed across partition
481
+ boundaries and dilute the local density structure. Scott's rule won 0 of 18
482
+ benchmark conditions.
483
+
484
+ Wide bandwidths (≥ 0.75) are actively harmful. They produce synthetic data that
485
+ degrades downstream ML models (TSTR Δ as low as −0.75 R²). Discriminator accuracy
486
+ can paradoxically *improve* with wide bandwidths on regression — a metric artifact
487
+ where spreading matches marginals while destroying joint structure. Use TSTR as the
488
+ primary quality signal, not disc_err.
489
+
490
+ ### Partition granularity
491
+
492
+ If `'auto'` is already in use, increasing `n_partitions` will automatically trigger
493
+ the switch to Epanechnikov when partition sizes fall below the threshold. You can
494
+ also set it explicitly:
495
+
496
+ ```python
497
+ # Finer partitions — 'auto' will pick Epanechnikov when sizes drop below threshold
498
+ model.expand(n=50000, n_partitions=150)
499
+
500
+ # Or fix at construction time
501
+ model = HVRT(n_partitions=150, min_samples_leaf=10).fit(X)
502
+ ```
503
+
504
+ Benchmark evidence (regression datasets, 5×/10× expansion ratios):
505
+
506
+ | Dataset (d) | At auto (~18 parts) best TSTR | At 150 parts Epan TSTR |
507
+ |---|---|---|
508
+ | housing (d=6) | h=0.30: −0.001 | **−0.013** |
509
+ | multimodal (d=10) | h=0.30: +0.004 | **+0.001** |
510
+ | emergence_divergence (d=5) | h=0.10: +0.007 | **+0.004** |
511
+ | emergence_bifurcation (d=5) | h=0.10: −0.022 | **−0.118** |
512
+
513
+ Note: for the emergence_bifurcation dataset (where the same feature region maps
514
+ to a bimodal target), all methods remain significantly negative at any partition
515
+ count. This indicates a structural limit: if the same X values correspond to
516
+ multiple distinct y outcomes, expansion without conditioning on y cannot reproduce
517
+ that structure. In such cases consider conditioning expansion on y directly
518
+ (e.g., expand class-conditional subsets separately).
519
+
520
+ ### Hyperparameter optimisation (HPO)
521
+
522
+ Dataset heterogeneity is the primary driver of how sensitive synthetic quality
523
+ is to HVRT's parameters. A well-behaved, near-Gaussian dataset with few
524
+ sub-populations produces good synthetic data at defaults with little room to
525
+ improve. A dataset with distinct clusters, non-linear interactions, or
526
+ regime-switching needs finer partitions to achieve local homogeneity within
527
+ each leaf — and the optimal settings are dataset-specific.
528
+
529
+ Benchmark evidence: on near-Gaussian data (fraud, housing at auto partition
530
+ count), TSTR varied by less than 0.01 across all bandwidth candidates. On
531
+ heterogeneous datasets (emergence_divergence, emergence_bifurcation), TSTR
532
+ varied by up to 0.20+ between the best and worst methods at the same partition
533
+ count. If your data is heterogeneous, HPO pays; if it is well-behaved, defaults
534
+ are sufficient.
535
+
536
+ **When HPO is worth running:**
537
+
538
+ - TSTR Δ is significantly negative on your downstream task (below −0.05 is a
539
+ useful rule of thumb)
540
+ - Your dataset has known sub-populations, clusters, non-linear interactions, or
541
+ regime changes (e.g., different dynamics at different feature values)
542
+ - You are generating at a high ratio (10×+) where compounding errors matter more
543
+
544
+ **Parameter search space:**
545
+
546
+ | Parameter | Default | Suggested search | Effect |
547
+ |---|---|---|---|
548
+ | `n_partitions` | auto | `None`, 20, 30, 50, 75, 100 | **Primary lever.** More partitions → finer local homogeneity. Start here. |
549
+ | `min_samples_leaf` | auto | 5, 10, 15, 20 | Controls auto-tuner floor; lower allows finer splits when n is large. |
550
+ | `bandwidth` | `'auto'` | `'auto'`, 0.05, 0.10, 0.30, `epanechnikov` | `'auto'` is usually near-optimal once partition count is right. |
551
+ | `variance_weighted` | `False` | `True`, `False` | `True` oversamples high-variance partitions; useful for tail-heavy distributions. |
552
+ | `y_weight` | 0.0 | 0.1, 0.3, 0.5 | Weights target in synthetic target; helps when y governs sub-population identity. |
553
+
554
+ **Evaluation metric:** Use **TSTR Δ** (train-on-synthetic, test-on-real minus
555
+ train-on-real baseline) as the HPO objective. Discriminator accuracy (`disc_err`)
556
+ is structurally insensitive — wide bandwidths can lower it by spreading marginals
557
+ while destroying joint structure. TSTR directly measures what matters: can a model
558
+ trained on synthetic data perform as well as one trained on real data?
559
+
560
+ **Example HPO loop:**
561
+
562
+ Use `HVRTOptimizer` for automated Bayesian optimisation with Optuna
563
+ (install the optional extra first: `pip install hvrt[optimizer]`):
564
+
565
+ ```python
566
+ from hvrt import HVRTOptimizer
567
+
568
+ opt = HVRTOptimizer(n_trials=50, n_jobs=4, cv=3, random_state=42).fit(X, y)
569
+ print(f'Best TSTR Δ: {opt.best_score_:+.4f}')
570
+ print(f'Best params: {opt.best_params_}')
571
+
572
+ X_synth = opt.expand(n=50000) # uses tuned kernel + params
573
+ X_aug = opt.augment(n=len(X) * 5) # originals + synthetic
574
+ ```
575
+
576
+ `HVRTOptimizer` searches over `n_partitions`, `min_samples_leaf`,
577
+ `y_weight`, kernel / bandwidth, and `variance_weighted` using TPE
578
+ sampling, with TRTR pre-computed once to halve GBM fitting overhead.
579
+ The fitted `best_model_` is refitted on the full dataset after tuning.
580
+
581
+ For a custom objective or manual grid search:
582
+
583
+ ```python
584
+ from sklearn.ensemble import GradientBoostingRegressor
585
+ from sklearn.metrics import r2_score
586
+ from sklearn.model_selection import train_test_split
587
+ import numpy as np
588
+ from hvrt import HVRT
589
+
590
+ X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
591
+
592
+ def tstr_delta(n_partitions, bandwidth, variance_weighted=False, seed=42):
593
+ XY_tr = np.column_stack([X_tr, y_tr.reshape(-1, 1)])
594
+ model = HVRT(n_partitions=n_partitions, bandwidth=bandwidth,
595
+ random_state=seed).fit(XY_tr)
596
+ XY_s = model.expand(n=len(X_tr) * 5, variance_weighted=variance_weighted)
597
+ X_s, y_s = XY_s[:, :-1], XY_s[:, -1]
598
+ trtr = r2_score(y_te, GradientBoostingRegressor(
599
+ random_state=seed).fit(X_tr, y_tr).predict(X_te))
600
+ tstr = r2_score(y_te, GradientBoostingRegressor(
601
+ random_state=seed).fit(X_s, y_s).predict(X_te))
602
+ return tstr - trtr
603
+
604
+ best_score, best_cfg = float('-inf'), {}
605
+ for n_parts in [None, 30, 50, 100]: # None = let auto-tune decide
606
+ for bw in ['auto', 0.10, 0.30]:
607
+ score = tstr_delta(n_partitions=n_parts, bandwidth=bw)
608
+ if score > best_score:
609
+ best_score, best_cfg = score, {'n_partitions': n_parts, 'bandwidth': bw}
610
+
611
+ print(f'Best TSTR Δ={best_score:+.4f} params={best_cfg}')
612
+ ```
613
+
614
+ **Recommended tuning sequence:**
615
+
616
+ 1. **Run with defaults.** Establish a baseline TSTR Δ. If it is close to zero, stop.
617
+ 2. **Sweep `n_partitions`.** This has the largest effect on heterogeneous data. Try
618
+ `None` (auto), 20, 30, 50, 75, 100. More partitions only help when `n` is large
619
+ enough — a rule of thumb is at least 10–15 real samples per partition.
620
+ 3. **Check `bandwidth`.** With `'auto'`, HVRT already picks the right kernel for
621
+ the resulting partition size. If you have prior knowledge (classification → prefer
622
+ `'epanechnikov'`; regression with large partitions → prefer `0.10`), override it.
623
+ 4. **Try `variance_weighted=True`** if your dataset has a long tail or rare events
624
+ you want the expansion to oversample.
625
+ 5. **If TSTR remains poor at any partition count**, the dataset likely has inherently
626
+ unpredictable local structure (e.g., the same feature region maps to multiple
627
+ distinct outcomes). Consider conditioning: split by `y` quantile or class and
628
+ expand each subset independently.
629
+
630
+ **What not to try:** Expanding synthetically and re-fitting HVRT on that output
631
+ ("two-phase pipeline") to manufacture fine partitions does not improve TSTR.
632
+ Phase 1 Gaussian smoothing introduces distribution drift that Phase 2 amplifies,
633
+ and the net TSTR is worse than single-phase at the auto partition count. Finer
634
+ partitions must come from more *real* data.
635
+
636
+ ---
637
+
638
+ ## Benchmarks
639
+
640
+ ### Sample reduction
641
+
642
+ Metric: GBM ROC-AUC on reduced training set as % of full-training-set AUC.
643
+ n=3 000 train / 2 000 test, seed=42.
644
+
645
+ | Scenario | Retention | HVRT-fps | HVRT-yw | Random | Stratified |
646
+ |---|---|---|---|---|---|
647
+ | Well-behaved (Gaussian, no noise) | 10% | 97.1% | 98.1% | 96.9% | 98.0% |
648
+ | Well-behaved (Gaussian, no noise) | 20% | 98.7% | 98.9% | 98.3% | 99.0% |
649
+ | Noisy labels (20% random flip) | 10% | **96.1%** | 91.1% | 93.3% | 90.4% |
650
+ | Noisy labels (20% random flip) | 20% | **95.2%** | 95.9% | 93.1% | 93.1% |
651
+ | Heavy-tail + label noise + junk features | 30% | **98.2%** | 98.2% | 94.3% | 95.2% |
652
+ | Rare events (5% positive class) | 10% | 98.0% | **99.4%** | 86.5% | 94.1% |
653
+ | Rare events (5% positive class) | 20% | 98.0% | **100.4%** | 97.9% | 99.0% |
654
+
655
+ *HVRT-fps: `method='fps'`, `variance_weighted=True`. HVRT-yw: same + `y_weight=0.3`.*
656
+
657
+ Reproduce: `python benchmarks/reduction_denoising_benchmark.py`
658
+
659
+ ### Synthetic data expansion
660
+
661
+ Metric: discriminator accuracy (target 50% = indistinguishable), marginal KS fidelity, tail MSE.
662
+ bandwidth=0.5, synthetic-to-real ratio 1×.
663
+
664
+ | Method | Marginal Fidelity | Discriminator | Tail Error | Fit time |
665
+ |---|---|---|---|---|
666
+ | **HVRT** | 0.974 | **49.6%** | **0.004** | 0.07 s |
667
+ | Gaussian Copula | 0.998 | 49.4% | 0.017 | 0.02 s |
668
+ | GMM (k=10) | 0.989 | 49.2% | 0.093 | 1.06 s |
669
+ | Bootstrap + Noise | 0.994 | 49.7% | 0.131 | 0.00 s |
670
+ | SMOTE | 1.000 | 48.6% | 0.000 | 0.00 s |
671
+ | CTGAN† | 0.920 | 55.8% | 0.500 | 45 s |
672
+ | TVAE† | 0.940 | 53.5% | 0.450 | 40 s |
673
+ | TabDDPM† | 0.960 | 52.0% | 0.300 | 120 s |
674
+ | MOSTLY AI† | 0.975 | 51.0% | 0.150 | 60 s |
675
+
676
+ *† Published numbers. Discriminator = 50% is ideal. Tail error = 0 is ideal.*
677
+
678
+ Reproduce: `python benchmarks/run_benchmarks.py --tasks expand`
679
+
680
+ ---
681
+
682
+ ## Benchmarking Scripts
683
+
684
+ ```bash
685
+ python benchmarks/run_benchmarks.py
686
+ python benchmarks/run_benchmarks.py --tasks reduce --datasets adult housing
687
+ python benchmarks/run_benchmarks.py --tasks expand
688
+ python benchmarks/reduction_denoising_benchmark.py
689
+ python benchmarks/adaptive_kde_benchmark.py
690
+ python benchmarks/adaptive_full_benchmark.py
691
+ python benchmarks/heart_disease_benchmark.py # requires: pip install ctgan
692
+ python benchmarks/bootstrap_failure_benchmark.py
693
+ python benchmarks/hpo_benchmark.py # HPO vs defaults, nested CV (requires: pip install hvrt[optimizer])
694
+ python benchmarks/hpo_benchmark.py --quick # 3 datasets, 10 trials, fast mode
695
+ ```
696
+
697
+ ---
698
+
699
+ ## Backward Compatibility
700
+
701
+ The v1 API is still importable:
702
+
703
+ ```python
704
+ from hvrt import HVRTSampleReducer, AdaptiveHVRTReducer
705
+
706
+ reducer = HVRTSampleReducer(reduction_ratio=0.2, random_state=42)
707
+ X_reduced, y_reduced = reducer.fit_transform(X, y)
708
+ ```
709
+
710
+ The `mode` constructor parameter is deprecated. Replace with params objects:
711
+
712
+ ```python
713
+ # Deprecated
714
+ HVRT(mode='reduce')
715
+
716
+ # Replacement
717
+ HVRT(reduce_params=ReduceParams(ratio=0.3))
718
+ ```
719
+
720
+ ---
721
+
722
+ ## Testing
723
+
724
+ ```bash
725
+ pytest
726
+ pytest --cov=hvrt --cov-report=term-missing
727
+ ```
728
+
729
+ ---
730
+
731
+ ## Citation
732
+
733
+ ```bibtex
734
+ @software{hvrt2026,
735
+ author = {Peace, Jake},
736
+ title = {HVRT: Hierarchical Variance-Retaining Transformer},
737
+ year = {2026},
738
+ url = {https://github.com/hotprotato/hvrt}
739
+ }
740
+ ```
741
+
742
+ ---
743
+
744
+ ## License
745
+
746
+ MIT License — see [LICENSE](LICENSE).
747
+
748
+ ## Acknowledgments
749
+
750
+ Development assisted by Claude (Anthropic).