PyPI - hvrt - Versions diffs - 2.1.1__tar.gz → 2.2.0.dev0__tar.gz - Mend

hvrt 2.1.1tar.gz → 2.2.0.dev0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

hvrt-2.2.0.dev0/PKG-INFO ADDED Viewed

@@ -0,0 +1,750 @@
+Metadata-Version: 2.4
+Name: hvrt
+Version: 2.2.0.dev0
+Summary: Hierarchical Variance-Retaining Transformer (HVRT) — variance-aware sample transformation for tabular data
+Author-email: Jake Peace <mail@jakepeace.me>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/hotprotato/hvrt
+Project-URL: Documentation, https://github.com/hotprotato/hvrt#readme
+Project-URL: Repository, https://github.com/hotprotato/hvrt
+Project-URL: Issues, https://github.com/hotprotato/hvrt/issues
+Keywords: machine-learning,sample-reduction,synthetic-data,data-augmentation,data-preprocessing,variance,kde,tabular-data,heavy-tailed
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Science/Research
+Classifier: Intended Audience :: Developers
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: numpy>=1.20.0
+Requires-Dist: scikit-learn>=1.0.0
+Requires-Dist: scipy>=1.7.0
+Provides-Extra: benchmarks
+Requires-Dist: xgboost>=1.5; extra == "benchmarks"
+Requires-Dist: matplotlib>=3.5; extra == "benchmarks"
+Requires-Dist: pandas>=1.3; extra == "benchmarks"
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
+Requires-Dist: black>=22.0.0; extra == "dev"
+Requires-Dist: mypy>=0.950; extra == "dev"
+Requires-Dist: twine>=5.0.0; extra == "dev"
+Requires-Dist: build>=1.0.0; extra == "dev"
+Provides-Extra: optimizer
+Requires-Dist: optuna>=3.0.0; extra == "optimizer"
+Dynamic: license-file
+# HVRT: Hierarchical Variance-Retaining Transformer
+[![PyPI version](https://img.shields.io/pypi/v/hvrt.svg)](https://pypi.org/project/hvrt/)
+[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+Variance-aware sample transformation for tabular data: reduce, expand, or augment.
+---
+## Overview
+HVRT partitions a dataset into variance-homogeneous regions via a decision tree fitted on a synthetic extremeness target, then applies a configurable per-partition operation (selection for reduction, sampling for expansion). The tree is fitted once; `reduce()`, `expand()`, and `augment()` all draw from the same fitted model.
+| Operation | Method | Description |
+|---|---|---|
+| **Reduce** | `model.reduce(ratio=0.3)` | Select a geometrically diverse representative subset |
+| **Expand** | `model.expand(n=50000)` | Generate synthetic samples via per-partition KDE or other strategy |
+| **Augment** | `model.augment(n=15000)` | Concatenate original data with synthetic samples |
+---
+## Algorithm
+### 1. Z-score normalisation
+```
+X_z = (X - μ) / σ   per feature
+```
+Categorical features are integer-encoded then z-scored.
+### 2. Synthetic target construction
+**HVRT** — sum of normalised pairwise feature interactions:
+```
+For all feature pairs (i, j):
+  interaction = X_z[:,i] ⊙ X_z[:,j]
+  normalised  = (interaction - mean) / std
+target = sum of all normalised interaction columns        O(n · d²)
+```
+**FastHVRT** — sum of z-scores per sample:
+```
+target_i = Σ_j  X_z[i, j]                               O(n · d)
+```
+### 3. Partitioning
+A `DecisionTreeRegressor` is fitted on the synthetic target. Leaves form variance-homogeneous partitions. Tree depth and leaf size are auto-tuned to dataset size.
+### 4. Per-partition operations
+**Reduce:** Select representatives within each partition using the chosen [selection strategy](#selection-strategies). Budget is proportional to partition size (`variance_weighted=False`) or biased toward high-variance partitions (`variance_weighted=True`).
+**Expand:** Draw synthetic samples within each partition using the chosen [generation strategy](#generation-strategies). Budget allocation follows the same logic.
+---
+## Installation
+```bash
+pip install hvrt
+```
+```bash
+git clone https://github.com/hotprotato/hvrt.git
+cd hvrt
+pip install -e .
+```
+---
+## Quick Start
+```python
+from hvrt import HVRT, FastHVRT
+# Fit once — reduce and expand from the same model
+model = HVRT(random_state=42).fit(X_train, y_train)   # y optional
+X_reduced, idx = model.reduce(ratio=0.3, return_indices=True)
+X_synthetic    = model.expand(n=50000)
+X_augmented    = model.augment(n=15000)
+# FastHVRT — O(n·d) target; preferred for expansion
+model = FastHVRT(random_state=42).fit(X_train)
+X_synthetic = model.expand(n=50000)
+```
+---
+## API Reference
+### `HVRT`
+```python
+from hvrt import HVRT
+model = HVRT(
+    n_partitions=None,           # Max tree leaves; auto-tuned if None
+    min_samples_leaf=None,       # Min samples per leaf; auto-tuned if None
+    y_weight=0.0,                # 0.0 = unsupervised; 1.0 = y drives splits
+    bandwidth='auto',            # KDE bandwidth: 'auto' (default), float, 'scott', 'silverman'
+    auto_tune=True,
+    random_state=42,
+    # Pipeline params (see Pipeline section)
+    reduce_params=None,
+    expand_params=None,
+    augment_params=None,
+)
+```
+Target: sum of normalised pairwise feature interactions. O(n · d²). Preferred for reduction.
+### `FastHVRT`
+```python
+from hvrt import FastHVRT
+model = FastHVRT(bandwidth='auto', random_state=42)
+```
+Target: sum of z-scores. O(n · d). Equivalent quality to HVRT for expansion. All constructor parameters identical to HVRT.
+### `HVRTOptimizer`
+Requires: `pip install hvrt[optimizer]`
+```python
+from hvrt import HVRTOptimizer
+opt = HVRTOptimizer(
+    n_trials=30,             # Optuna trials; use ≥50 in production
+    n_jobs=1,                # Parallel trials (-1 = all cores)
+    cv=3,                    # Cross-validation folds for the objective
+    expansion_ratio=5.0,     # Synthetic-to-real ratio during evaluation
+    task='auto',             # 'auto', 'regression', 'classification'
+    timeout=None,            # Wall-clock time limit in seconds
+    random_state=None,
+    verbose=0,               # 0 = silent, 1 = Optuna trial progress
+)
+opt = opt.fit(X, y)          # y enables TSTR Δ objective; required for classification
+```
+Performs TPE-based Bayesian optimisation over `n_partitions`, `min_samples_leaf`,
+`y_weight`, kernel / bandwidth, and `variance_weighted`. The HVRT defaults are always
+evaluated as trial 0 (warm start), so HPO can only match or improve on the baseline.
+**Post-fit attributes:**
+| Attribute | Type | Description |
+|---|---|---|
+| `best_score_` | float | Best mean TSTR Δ across CV folds |
+| `best_params_` | dict | Best constructor kwargs (`n_partitions`, `min_samples_leaf`, `y_weight`, `bandwidth`) |
+| `best_expand_params_` | dict | Best expand kwargs (`variance_weighted`, optionally `generation_strategy`) |
+| `best_model_` | HVRT | Refitted on the full dataset using `best_params_` |
+| `study_` | optuna.Study | Full Optuna study for visualisation and diagnostics |
+**After fitting:**
+```python
+opt = HVRTOptimizer(n_trials=50, n_jobs=4, cv=3, random_state=42).fit(X, y)
+print(f'Best TSTR Δ: {opt.best_score_:+.4f}')
+print(f'Best params: {opt.best_params_}')
+X_synth = opt.expand(n=50000)         # y column stripped automatically
+X_aug   = opt.augment(n=len(X) * 5)   # originals + synthetic
+```
+`expand()` and `augment()` strip the appended y column, returning arrays with the same
+number of columns as the training X.
+### `fit`
+```python
+model.fit(X, y=None, feature_types=None)
+# feature_types: list of 'continuous' or 'categorical' per column
+```
+### `reduce`
+```python
+X_reduced = model.reduce(
+    n=None,                  # Absolute target count
+    ratio=None,              # Proportional (e.g. 0.3 = keep 30%)
+    method='fps',            # Selection strategy; see Selection Strategies
+    variance_weighted=True,  # Oversample high-variance partitions
+    return_indices=False,
+    n_partitions=None,       # Override tree granularity for this call only
+)
+```
+### `expand`
+```python
+X_synth = model.expand(
+    n=10000,
+    variance_weighted=False,      # True = oversample tails
+    bandwidth=None,               # Override instance bandwidth; accepts float, 'auto', 'scott'
+    adaptive_bandwidth=False,     # Scale bandwidth with local expansion ratio
+    generation_strategy=None,     # See Generation Strategies
+    return_novelty_stats=False,
+    n_partitions=None,
+)
+```
+`adaptive_bandwidth=True` uses per-partition bandwidth `bw_p = scott_p × max(1, budget_p/n_p)^(1/d)`.
+### `augment`
+```python
+X_aug = model.augment(n=15000, variance_weighted=False)
+# n must exceed len(X); returns original X concatenated with (n - len(X)) synthetic samples
+```
+### Utility methods
+```python
+partitions = model.get_partitions()
+# [{'id': 5, 'size': 120, 'mean_abs_z': 0.84, 'variance': 1.2}, ...]
+novelty = model.compute_novelty(X_new)   # min z-space distance per point
+params = HVRT.recommend_params(X)        # {'n_partitions': 180, ...}
+```
+---
+## sklearn Pipeline
+Operation parameters are declared at construction time via `ReduceParams`, `ExpandParams`, or `AugmentParams`. The tree is fitted once during `fit()`; `transform()` calls the corresponding operation.
+```python
+from hvrt import HVRT, FastHVRT, ReduceParams, ExpandParams, AugmentParams
+from sklearn.pipeline import Pipeline
+# Reduce
+pipe = Pipeline([('hvrt', HVRT(reduce_params=ReduceParams(ratio=0.3)))])
+X_red = pipe.fit_transform(X, y)
+# Expand
+pipe = Pipeline([('hvrt', FastHVRT(expand_params=ExpandParams(n=50000)))])
+X_synth = pipe.fit_transform(X)
+# Augment
+pipe = Pipeline([('hvrt', HVRT(augment_params=AugmentParams(n=15000)))])
+X_aug = pipe.fit_transform(X)
+```
+Alternatively, import from `hvrt.pipeline` to make the intent explicit:
+```python
+from hvrt.pipeline import HVRT, ReduceParams
+```
+### ReduceParams
+```python
+ReduceParams(
+    n=None,
+    ratio=None,              # e.g. 0.3
+    method='fps',
+    variance_weighted=True,
+    return_indices=False,
+    n_partitions=None,
+)
+```
+### ExpandParams
+```python
+ExpandParams(
+    n=50000,                 # required
+    variance_weighted=False,
+    bandwidth=None,
+    adaptive_bandwidth=False,
+    generation_strategy=None,
+    return_novelty_stats=False,
+    n_partitions=None,
+)
+```
+### AugmentParams
+```python
+AugmentParams(
+    n=15000,                 # required; must exceed len(X)
+    variance_weighted=False,
+    n_partitions=None,
+)
+```
+---
+## Generation Strategies
+```python
+from hvrt import FastHVRT, epanechnikov, univariate_kde_copula
+model = FastHVRT(random_state=42).fit(X)
+# By name
+X_synth = model.expand(n=10000, generation_strategy='epanechnikov')
+# By reference
+X_synth = model.expand(n=10000, generation_strategy=univariate_kde_copula)
+# Custom callable
+def my_strategy(X_z, partition_ids, unique_partitions, budgets, random_state):
+    ...
+    return X_synthetic   # shape (sum(budgets), n_features), z-score space
+X_synth = model.expand(n=10000, generation_strategy=my_strategy)
+```
+| Strategy | Behaviour | Notes |
+|---|---|---|
+| `'multivariate_kde'` | `scipy.stats.gaussian_kde` on all features jointly. Uses instance `bandwidth`. | Captures full joint covariance |
+| `'epanechnikov'` | Product Epanechnikov kernel, Ahrens-Dieter sampling. Bounded support. | Recommended for classification; ≥5× ratios |
+| `'univariate_kde_copula'` | Per-feature 1-D KDE marginals + Gaussian copula. | More flexible per-feature marginals |
+| `'bootstrap_noise'` | Resample with replacement + Gaussian noise at 10% of per-feature std. | Fastest; no distributional assumptions |
+```python
+from hvrt import BUILTIN_GENERATION_STRATEGIES
+list(BUILTIN_GENERATION_STRATEGIES)
+# ['multivariate_kde', 'univariate_kde_copula', 'bootstrap_noise', 'epanechnikov']
+```
+---
+## Selection Strategies
+```python
+from hvrt import HVRT
+model = HVRT(random_state=42).fit(X, y)
+X_red = model.reduce(ratio=0.2, method='fps')             # default
+X_red = model.reduce(ratio=0.2, method='medoid_fps')
+X_red = model.reduce(ratio=0.2, method='variance_ordered')
+X_red = model.reduce(ratio=0.2, method='stratified')
+# Custom callable
+def my_selector(X_z, partition_ids, unique_partitions, budgets, random_state):
+    ...
+    return selected_indices   # global indices into X
+X_red = model.reduce(ratio=0.2, method=my_selector)
+```
+| Strategy | Behaviour |
+|---|---|
+| `'fps'` / `'centroid_fps'` | Greedy Furthest Point Sampling seeded at partition centroid. **Default.** |
+| `'medoid_fps'` | FPS seeded at the partition medoid. |
+| `'variance_ordered'` | Select samples with highest local k-NN variance (k=10). |
+| `'stratified'` | Random sample within each partition. |
+---
+## Recommendations
+Findings from a systematic bandwidth and kernel benchmark across 6 datasets,
+3 expansion ratios (2×/5×/10×), and 11 methods (see `benchmarks/bandwidth_benchmark.py`
+and `findings.md`).
+### `bandwidth='auto'` — the default
+`bandwidth='auto'` is the default and requires no tuning for most datasets. At each
+`expand()` call it inspects the fitted partition structure and picks the kernel most
+likely to produce high-quality synthetic data:
+```python
+model = HVRT().fit(X)          # bandwidth='auto' by default
+X_synth = model.expand(n=50000)  # auto chooses at call-time
+```
+**How it decides:**
+At call-time, `'auto'` computes the mean number of samples per partition and
+compares it against a feature-scaled threshold: `max(15, 2 × n_continuous_features)`.
+| Condition | Chosen kernel | Reason |
+|---|---|---|
+| mean partition size **≥** threshold | Narrow Gaussian `h=0.1` | Enough samples for stable multivariate covariance estimation; tight kernel stays within partition geometry |
+| mean partition size **<** threshold | Epanechnikov product kernel | Too few samples for reliable covariance; product kernel requires no covariance matrix and bounded support keeps samples within the local region |
+The threshold scales with dimensionality because the minimum samples needed for a
+non-degenerate `d`-dimensional covariance matrix grows with `d`. At 5 features the
+threshold is 15; at 15 features it is 30.
+**Why not just always use one or the other:**
+Benchmarking across 4 regression datasets showed a clean crossover depending on
+partition size. With the default auto-tuned partition count (typically 15–20 partitions
+at n=500), partitions hold ~25 samples and narrow Gaussian wins on TSTR. But when
+partitions are finer — either because the dataset is large and the auto-tuner produces
+more leaves, or because `n_partitions` is manually increased — Gaussian KDE degrades
+as partitions become too small for stable covariance estimation, while Epanechnikov
+holds steady or improves. For example, on the housing dataset (d=6) at 10× expansion:
+| Partition count | Gaussian `h=0.1` TSTR | Epanechnikov TSTR |
+|---|---|---|
+| auto (~18) | +0.004 | −0.014 |
+| 50 | −0.033 | **−0.008** |
+| 100 | −0.037 | **−0.011** |
+| 200 | −0.080 | **−0.008** |
+The crossover point depends on dimensionality: higher-dimensional datasets shift it
+earlier. On multimodal (d=10), Epanechnikov wins from 30 partitions onward (mean
+partition size ~13 at n=500). On housing (d=6) and emergence_divergence (d=5),
+the crossover is ~50 partitions. This is because higher dimensionality makes a
+d×d covariance matrix harder to estimate stably from small samples, while
+Epanechnikov is always covariance-free.
+`'auto'` captures this automatically: when you call `expand(n_partitions=200)`,
+`'auto'` sees the resulting small partition sizes and switches to Epanechnikov
+without any manual intervention.
+**When to override `'auto'`:**
+- **Heterogeneous / high-skew classification task (mean |skew| ≳ 0.8):**
+  `generation_strategy='epanechnikov'` directly — Epanechnikov wins consistently
+  when within-partition data is non-Gaussian. On near-Gaussian classification data,
+  `bandwidth='auto'` (`h=0.10`) or `adaptive_bandwidth=True` is competitive or
+  better, particularly at 2×–5× expansion ratios.
+- **Small dataset, coarse partitions, regression:** `bandwidth=0.1` or `bandwidth=0.3`
+  — explicit narrow Gaussian if you know partition sizes are large and correlation
+  structure matters.
+- **Diagnostic / ablation:** pass explicit values (`bandwidth=0.3`, `bandwidth='scott'`)
+  to isolate the bandwidth effect.
+### Why Scott's rule underperforms
+Scott's rule is AMISE-optimal for iid Gaussian data. HVRT partitions, while locally
+more homogeneous than the global distribution, are not Gaussian enough for this to
+hold (mean |skewness| 0.49–1.37 across benchmark datasets). More importantly, the
+decision tree already captures the primary variance structure of each partition, so
+the residual within-partition variance is narrower than Scott's formula assumes.
+The result is systematic over-smoothing: synthetic samples bleed across partition
+boundaries and dilute the local density structure. Scott's rule won 0 of 18
+benchmark conditions.
+Wide bandwidths (≥ 0.75) are actively harmful. They produce synthetic data that
+degrades downstream ML models (TSTR Δ as low as −0.75 R²). Discriminator accuracy
+can paradoxically *improve* with wide bandwidths on regression — a metric artifact
+where spreading matches marginals while destroying joint structure. Use TSTR as the
+primary quality signal, not disc_err.
+### Partition granularity
+If `'auto'` is already in use, increasing `n_partitions` will automatically trigger
+the switch to Epanechnikov when partition sizes fall below the threshold. You can
+also set it explicitly:
+```python
+# Finer partitions — 'auto' will pick Epanechnikov when sizes drop below threshold
+model.expand(n=50000, n_partitions=150)
+# Or fix at construction time
+model = HVRT(n_partitions=150, min_samples_leaf=10).fit(X)
+```
+Benchmark evidence (regression datasets, 5×/10× expansion ratios):
+| Dataset (d) | At auto (~18 parts) best TSTR | At 150 parts Epan TSTR |
+|---|---|---|
+| housing (d=6) | h=0.30: −0.001 | **−0.013** |
+| multimodal (d=10) | h=0.30: +0.004 | **+0.001** |
+| emergence_divergence (d=5) | h=0.10: +0.007 | **+0.004** |
+| emergence_bifurcation (d=5) | h=0.10: −0.022 | **−0.118** |
+Note: for the emergence_bifurcation dataset (where the same feature region maps
+to a bimodal target), all methods remain significantly negative at any partition
+count. This indicates a structural limit: if the same X values correspond to
+multiple distinct y outcomes, expansion without conditioning on y cannot reproduce
+that structure. In such cases consider conditioning expansion on y directly
+(e.g., expand class-conditional subsets separately).
+### Hyperparameter optimisation (HPO)
+Dataset heterogeneity is the primary driver of how sensitive synthetic quality
+is to HVRT's parameters. A well-behaved, near-Gaussian dataset with few
+sub-populations produces good synthetic data at defaults with little room to
+improve. A dataset with distinct clusters, non-linear interactions, or
+regime-switching needs finer partitions to achieve local homogeneity within
+each leaf — and the optimal settings are dataset-specific.
+Benchmark evidence: on near-Gaussian data (fraud, housing at auto partition
+count), TSTR varied by less than 0.01 across all bandwidth candidates. On
+heterogeneous datasets (emergence_divergence, emergence_bifurcation), TSTR
+varied by up to 0.20+ between the best and worst methods at the same partition
+count. If your data is heterogeneous, HPO pays; if it is well-behaved, defaults
+are sufficient.
+**When HPO is worth running:**
+- TSTR Δ is significantly negative on your downstream task (below −0.05 is a
+  useful rule of thumb)
+- Your dataset has known sub-populations, clusters, non-linear interactions, or
+  regime changes (e.g., different dynamics at different feature values)
+- You are generating at a high ratio (10×+) where compounding errors matter more
+**Parameter search space:**
+| Parameter | Default | Suggested search | Effect |
+|---|---|---|---|
+| `n_partitions` | auto | `None`, 20, 30, 50, 75, 100 | **Primary lever.** More partitions → finer local homogeneity. Start here. |
+| `min_samples_leaf` | auto | 5, 10, 15, 20 | Controls auto-tuner floor; lower allows finer splits when n is large. |
+| `bandwidth` | `'auto'` | `'auto'`, 0.05, 0.10, 0.30, `epanechnikov` | `'auto'` is usually near-optimal once partition count is right. |
+| `variance_weighted` | `False` | `True`, `False` | `True` oversamples high-variance partitions; useful for tail-heavy distributions. |
+| `y_weight` | 0.0 | 0.1, 0.3, 0.5 | Weights target in synthetic target; helps when y governs sub-population identity. |
+**Evaluation metric:** Use **TSTR Δ** (train-on-synthetic, test-on-real minus
+train-on-real baseline) as the HPO objective. Discriminator accuracy (`disc_err`)
+is structurally insensitive — wide bandwidths can lower it by spreading marginals
+while destroying joint structure. TSTR directly measures what matters: can a model
+trained on synthetic data perform as well as one trained on real data?
+**Example HPO loop:**
+Use `HVRTOptimizer` for automated Bayesian optimisation with Optuna
+(install the optional extra first: `pip install hvrt[optimizer]`):
+```python
+from hvrt import HVRTOptimizer
+opt = HVRTOptimizer(n_trials=50, n_jobs=4, cv=3, random_state=42).fit(X, y)
+print(f'Best TSTR Δ: {opt.best_score_:+.4f}')
+print(f'Best params: {opt.best_params_}')
+X_synth = opt.expand(n=50000)        # uses tuned kernel + params
+X_aug   = opt.augment(n=len(X) * 5)  # originals + synthetic
+```
+`HVRTOptimizer` searches over `n_partitions`, `min_samples_leaf`,
+`y_weight`, kernel / bandwidth, and `variance_weighted` using TPE
+sampling, with TRTR pre-computed once to halve GBM fitting overhead.
+The fitted `best_model_` is refitted on the full dataset after tuning.
+For a custom objective or manual grid search:
+```python
+from sklearn.ensemble import GradientBoostingRegressor
+from sklearn.metrics import r2_score
+from sklearn.model_selection import train_test_split
+import numpy as np
+from hvrt import HVRT
+X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
+def tstr_delta(n_partitions, bandwidth, variance_weighted=False, seed=42):
+    XY_tr = np.column_stack([X_tr, y_tr.reshape(-1, 1)])
+    model = HVRT(n_partitions=n_partitions, bandwidth=bandwidth,
+                 random_state=seed).fit(XY_tr)
+    XY_s = model.expand(n=len(X_tr) * 5, variance_weighted=variance_weighted)
+    X_s, y_s = XY_s[:, :-1], XY_s[:, -1]
+    trtr = r2_score(y_te, GradientBoostingRegressor(
+                        random_state=seed).fit(X_tr, y_tr).predict(X_te))
+    tstr = r2_score(y_te, GradientBoostingRegressor(
+                        random_state=seed).fit(X_s, y_s).predict(X_te))
+    return tstr - trtr
+best_score, best_cfg = float('-inf'), {}
+for n_parts in [None, 30, 50, 100]:   # None = let auto-tune decide
+    for bw in ['auto', 0.10, 0.30]:
+        score = tstr_delta(n_partitions=n_parts, bandwidth=bw)
+        if score > best_score:
+            best_score, best_cfg = score, {'n_partitions': n_parts, 'bandwidth': bw}
+print(f'Best TSTR Δ={best_score:+.4f}  params={best_cfg}')
+```
+**Recommended tuning sequence:**
+1. **Run with defaults.** Establish a baseline TSTR Δ. If it is close to zero, stop.
+2. **Sweep `n_partitions`.** This has the largest effect on heterogeneous data. Try
+   `None` (auto), 20, 30, 50, 75, 100. More partitions only help when `n` is large
+   enough — a rule of thumb is at least 10–15 real samples per partition.
+3. **Check `bandwidth`.** With `'auto'`, HVRT already picks the right kernel for
+   the resulting partition size. If you have prior knowledge (classification → prefer
+   `'epanechnikov'`; regression with large partitions → prefer `0.10`), override it.
+4. **Try `variance_weighted=True`** if your dataset has a long tail or rare events
+   you want the expansion to oversample.
+5. **If TSTR remains poor at any partition count**, the dataset likely has inherently
+   unpredictable local structure (e.g., the same feature region maps to multiple
+   distinct outcomes). Consider conditioning: split by `y` quantile or class and
+   expand each subset independently.
+**What not to try:** Expanding synthetically and re-fitting HVRT on that output
+("two-phase pipeline") to manufacture fine partitions does not improve TSTR.
+Phase 1 Gaussian smoothing introduces distribution drift that Phase 2 amplifies,
+and the net TSTR is worse than single-phase at the auto partition count. Finer
+partitions must come from more *real* data.
+---
+## Benchmarks
+### Sample reduction
+Metric: GBM ROC-AUC on reduced training set as % of full-training-set AUC.
+n=3 000 train / 2 000 test, seed=42.
+| Scenario | Retention | HVRT-fps | HVRT-yw | Random | Stratified |
+|---|---|---|---|---|---|
+| Well-behaved (Gaussian, no noise) | 10% | 97.1% | 98.1% | 96.9% | 98.0% |
+| Well-behaved (Gaussian, no noise) | 20% | 98.7% | 98.9% | 98.3% | 99.0% |
+| Noisy labels (20% random flip) | 10% | **96.1%** | 91.1% | 93.3% | 90.4% |
+| Noisy labels (20% random flip) | 20% | **95.2%** | 95.9% | 93.1% | 93.1% |
+| Heavy-tail + label noise + junk features | 30% | **98.2%** | 98.2% | 94.3% | 95.2% |
+| Rare events (5% positive class) | 10% | 98.0% | **99.4%** | 86.5% | 94.1% |
+| Rare events (5% positive class) | 20% | 98.0% | **100.4%** | 97.9% | 99.0% |
+*HVRT-fps: `method='fps'`, `variance_weighted=True`. HVRT-yw: same + `y_weight=0.3`.*
+Reproduce: `python benchmarks/reduction_denoising_benchmark.py`
+### Synthetic data expansion
+Metric: discriminator accuracy (target 50% = indistinguishable), marginal KS fidelity, tail MSE.
+bandwidth=0.5, synthetic-to-real ratio 1×.
+| Method | Marginal Fidelity | Discriminator | Tail Error | Fit time |
+|---|---|---|---|---|
+| **HVRT** | 0.974 | **49.6%** | **0.004** | 0.07 s |
+| Gaussian Copula | 0.998 | 49.4% | 0.017 | 0.02 s |
+| GMM (k=10) | 0.989 | 49.2% | 0.093 | 1.06 s |
+| Bootstrap + Noise | 0.994 | 49.7% | 0.131 | 0.00 s |
+| SMOTE | 1.000 | 48.6% | 0.000 | 0.00 s |
+| CTGAN† | 0.920 | 55.8% | 0.500 | 45 s |
+| TVAE† | 0.940 | 53.5% | 0.450 | 40 s |
+| TabDDPM† | 0.960 | 52.0% | 0.300 | 120 s |
+| MOSTLY AI† | 0.975 | 51.0% | 0.150 | 60 s |
+*† Published numbers. Discriminator = 50% is ideal. Tail error = 0 is ideal.*
+Reproduce: `python benchmarks/run_benchmarks.py --tasks expand`
+---
+## Benchmarking Scripts
+```bash
+python benchmarks/run_benchmarks.py
+python benchmarks/run_benchmarks.py --tasks reduce --datasets adult housing
+python benchmarks/run_benchmarks.py --tasks expand
+python benchmarks/reduction_denoising_benchmark.py
+python benchmarks/adaptive_kde_benchmark.py
+python benchmarks/adaptive_full_benchmark.py
+python benchmarks/heart_disease_benchmark.py      # requires: pip install ctgan
+python benchmarks/bootstrap_failure_benchmark.py
+python benchmarks/hpo_benchmark.py               # HPO vs defaults, nested CV (requires: pip install hvrt[optimizer])
+python benchmarks/hpo_benchmark.py --quick       # 3 datasets, 10 trials, fast mode
+```
+---
+## Backward Compatibility
+The v1 API is still importable:
+```python
+from hvrt import HVRTSampleReducer, AdaptiveHVRTReducer
+reducer = HVRTSampleReducer(reduction_ratio=0.2, random_state=42)
+X_reduced, y_reduced = reducer.fit_transform(X, y)
+```
+The `mode` constructor parameter is deprecated. Replace with params objects:
+```python
+# Deprecated
+HVRT(mode='reduce')
+# Replacement
+HVRT(reduce_params=ReduceParams(ratio=0.3))
+```
+---
+## Testing
+```bash
+pytest
+pytest --cov=hvrt --cov-report=term-missing
+```
+---
+## Citation
+```bibtex
+@software{hvrt2026,
+  author = {Peace, Jake},
+  title  = {HVRT: Hierarchical Variance-Retaining Transformer},
+  year   = {2026},
+  url    = {https://github.com/hotprotato/hvrt}
+}
+```
+---
+## License
+MIT License — see [LICENSE](LICENSE).
+## Acknowledgments
+Development assisted by Claude (Anthropic).

hvrt 2.1.1__tar.gz → 2.2.0.dev0__tar.gz

hvrt 2.1.1tar.gz → 2.2.0.dev0tar.gz