hvrt 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
hvrt-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Jake Peace
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
hvrt-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,411 @@
1
+ Metadata-Version: 2.4
2
+ Name: hvrt
3
+ Version: 0.1.0
4
+ Summary: Hierarchical Variance Reduction Tree (H-VRT) for intelligent sample reduction
5
+ Author-email: Jake Peace <mail@jakepeace.me>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/hotprotato/hvrt
8
+ Project-URL: Documentation, https://github.com/hotprotato/hvrt#readme
9
+ Project-URL: Repository, https://github.com/hotprotato/hvrt
10
+ Project-URL: Issues, https://github.com/hotprotato/hvrt/issues
11
+ Keywords: machine-learning,sample-reduction,data-preprocessing,svm,variance,heavy-tailed
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.8
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Requires-Python: >=3.8
22
+ Description-Content-Type: text/markdown
23
+ License-File: LICENSE
24
+ Requires-Dist: numpy>=1.20.0
25
+ Requires-Dist: scikit-learn>=1.0.0
26
+ Provides-Extra: dev
27
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
28
+ Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
29
+ Requires-Dist: black>=22.0.0; extra == "dev"
30
+ Requires-Dist: mypy>=0.950; extra == "dev"
31
+ Dynamic: license-file
32
+
33
+ # H-VRT: Hierarchical Variance Reduction Tree
34
+
35
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
36
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
37
+
38
+ **H-VRT** is a deterministic, variance-based sample reduction method that intelligently selects training subsets while preserving predictive accuracy.
39
+
40
+ ## Why H-VRT?
41
+
42
+ Unlike random sampling which treats all samples equally, H-VRT optimizes for **explained variance preservation** through:
43
+
44
+ - **Hierarchical partitioning** based on pairwise feature interactions
45
+ - **Diversity-based selection** via Furthest-Point Sampling (FPS)
46
+ - **100% deterministic** - same data → same subset every time
47
+ - **Hybrid mode** for heavy-tailed data and rare events
48
+
49
+ ## Installation
50
+
51
+ ```bash
52
+ pip install hvrt
53
+ ```
54
+
55
+ Or install from source:
56
+
57
+ ```bash
58
+ git clone https://github.com/hotprotato/hvrt.git
59
+ cd hvrt
60
+ pip install -e .
61
+ ```
62
+
63
+ ## Quick Start
64
+
65
+ ```python
66
+ from hvrt import HVRTSampleReducer
67
+ from sklearn.model_selection import train_test_split
68
+ from sklearn.ensemble import RandomForestRegressor
69
+
70
+ # Load your data
71
+ X, y = load_your_data()
72
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
73
+
74
+ # Reduce training set to 20% of original size
75
+ reducer = HVRTSampleReducer(reduction_ratio=0.2, random_state=42)
76
+ X_train_reduced, y_train_reduced = reducer.fit_transform(X_train, y_train)
77
+
78
+ # Train any model on reduced data
79
+ model = RandomForestRegressor()
80
+ model.fit(X_train_reduced, y_train_reduced)
81
+ predictions = model.predict(X_test)
82
+ ```
83
+
84
+ ## When to Use H-VRT
85
+
86
+ ### ✅ H-VRT Excels When
87
+
88
+ - **Regulatory/audit requirements** - 100% reproducible sample selection
89
+ - **Heavy-tailed distributions** - Financial data, extreme events, rare outliers
90
+ - **SVM training** - Makes large-scale SVM practical (25-40x speedup)
91
+ - **Aggressive reduction needed** - 5-20% retention where every sample counts
92
+ - **Small-to-medium datasets** - Up to 50k samples
93
+
94
+ ### ⚠️ Random Sampling May Suffice When
95
+
96
+ - **Large, well-behaved datasets** (≥50k samples, normal distributions)
97
+ - **Modest reduction** (≥50% retention)
98
+ - No interpretability or determinism requirements
99
+
100
+ ### ❌ Avoid H-VRT For
101
+
102
+ - **Distance-based clustering tasks** (K-Means, DBSCAN)
103
+ - Very small datasets (n < 1000)
104
+
105
+ ## Key Features
106
+
107
+ ### 1. Deterministic Selection
108
+
109
+ ```python
110
+ # Same random_state → identical samples every time
111
+ reducer = HVRTSampleReducer(reduction_ratio=0.2, random_state=42)
112
+ X_reduced1, _ = reducer.fit_transform(X, y)
113
+
114
+ reducer2 = HVRTSampleReducer(reduction_ratio=0.2, random_state=42)
115
+ X_reduced2, _ = reducer2.fit_transform(X, y)
116
+
117
+ assert np.array_equal(X_reduced1, X_reduced2) # ✓ Always True
118
+ ```
119
+
120
+ ### 2. Hybrid Mode for Heavy Tails
121
+
122
+ ```python
123
+ # For heavy-tailed data or rare events
124
+ reducer = HVRTSampleReducer(
125
+ reduction_ratio=0.2,
126
+ y_weight=0.25, # 25% weight on y-extremeness
127
+ random_state=42
128
+ )
129
+ X_reduced, y_reduced = reducer.fit_transform(X, y)
130
+ ```
131
+
132
+ **When to use `y_weight`:**
133
+ - `0.0` (default): Well-behaved data, interaction-driven variance
134
+ - `0.25-0.50`: Heavy-tailed distributions, rare events
135
+ - `0.50-1.0`: Extreme outlier detection
136
+
137
+ ### 3. SVM Speedup Example
138
+
139
+ ```python
140
+ from sklearn.svm import SVR
141
+ import time
142
+
143
+ # Without H-VRT: 30 minutes at 50k samples
144
+ start = time.time()
145
+ svm = SVR()
146
+ svm.fit(X_train, y_train) # 50k samples
147
+ print(f"Training time: {time.time() - start:.1f}s") # ~1800s
148
+
149
+ # With H-VRT: 47 seconds (38x faster!)
150
+ reducer = HVRTSampleReducer(reduction_ratio=0.2)
151
+ X_reduced, y_reduced = reducer.fit_transform(X_train, y_train) # 10k samples
152
+
153
+ start = time.time()
154
+ svm.fit(X_reduced, y_reduced)
155
+ print(f"Training time: {time.time() - start:.1f}s") # ~47s
156
+ ```
157
+
158
+ ## API Reference
159
+
160
+ ### AdaptiveHVRTReducer (Recommended)
161
+
162
+ **Automatically finds optimal reduction level via accuracy testing.**
163
+
164
+ ```python
165
+ from hvrt import AdaptiveHVRTReducer
166
+
167
+ reducer = AdaptiveHVRTReducer(
168
+ accuracy_threshold=0.95, # Min accuracy retention (95%)
169
+ reduction_ratios=[0.5, 0.3, 0.2, 0.1], # Levels to test
170
+ validator=None, # Auto: XGBoost (fast validation)
171
+ scoring='accuracy', # Metric: accuracy, f1, r2, custom, dict
172
+ cv=3, # Cross-validation folds
173
+ y_weight=0.0, # Hybrid mode (0.25 for heavy tails)
174
+ random_state=42
175
+ )
176
+
177
+ X_reduced, y_reduced = reducer.fit_transform(X, y)
178
+
179
+ # Review all tested reductions
180
+ print(reducer.get_reduction_summary())
181
+ for result in reducer.reduction_results_:
182
+ print(f"{result['reduction_ratio']}: {result['accuracy_retention']:.1%}")
183
+
184
+ # Multiple metrics example
185
+ reducer_multi = AdaptiveHVRTReducer(
186
+ accuracy_threshold=0.95,
187
+ scoring={'accuracy': 'accuracy', 'f1': 'f1', 'recall': 'recall'}
188
+ )
189
+ reducer_multi.fit(X, y)
190
+ print(reducer_multi.best_reduction_['all_scores'])
191
+ ```
192
+
193
+ **Scoring Options:**
194
+
195
+ Built-in metrics (str):
196
+ - Classification: `'accuracy'`, `'f1'`, `'precision'`, `'recall'`, `'roc_auc'`
197
+ - Regression: `'r2'`, `'neg_mean_absolute_error'`, `'neg_mean_squared_error'`
198
+
199
+ Custom scorer (callable):
200
+ ```python
201
+ def custom_scorer(y_true, y_pred):
202
+ return score # Higher is better
203
+ ```
204
+
205
+ Multiple metrics (dict):
206
+ ```python
207
+ scoring={'acc': 'accuracy', 'f1': 'f1'} # First key is primary
208
+ ```
209
+
210
+ **Use Cases:**
211
+ - Unknown optimal reduction level
212
+ - SVM training (use XGBoost validation, get samples for SVM)
213
+ - Need accuracy guarantees with specific metrics
214
+
215
+ **Methods:**
216
+ - `fit(X, y)`: Test reductions, find best
217
+ - `transform()`: Return best reduced dataset
218
+ - `get_reduction_summary()`: Human-readable results table
219
+
220
+ **Attributes:**
221
+ - `reduction_results_`: List of all tested reductions
222
+ - `best_reduction_`: Optimal reduction meeting threshold
223
+ - `baseline_score_`: Baseline accuracy
224
+
225
+ ---
226
+
227
+ ### HVRTSampleReducer (Manual)
228
+
229
+ **Direct reduction when you know the ratio.**
230
+
231
+ ```python
232
+ from hvrt import HVRTSampleReducer
233
+
234
+ reducer = HVRTSampleReducer(
235
+ reduction_ratio=0.2, # Target retention (0.2 = keep 20%)
236
+ y_weight=0.0, # Hybrid mode weight (0.0-1.0)
237
+ max_leaf_nodes=None, # Tree partitions (auto-tuned if None)
238
+ min_samples_leaf=None, # Min samples per partition (auto-tuned)
239
+ auto_tune=True, # Enable automatic hyperparameter tuning
240
+ random_state=42 # Random seed for reproducibility
241
+ )
242
+
243
+ X_reduced, y_reduced = reducer.fit_transform(X, y)
244
+ ```
245
+
246
+ **Methods:**
247
+ - `fit(X, y)`: Learn variance-based partitioning
248
+ - `transform(X, y=None)`: Return reduced samples
249
+ - `fit_transform(X, y)`: Fit and transform in one step
250
+ - `get_reduction_info()`: Get partitioning statistics
251
+
252
+ **Attributes:**
253
+ - `selected_indices_`: Indices of selected samples
254
+ - `n_partitions_`: Number of partitions created
255
+ - `tree_`: Fitted DecisionTreeRegressor
256
+
257
+ ## Examples
258
+
259
+ See the [`examples/`](examples/) directory for complete demonstrations:
260
+
261
+ - [`basic_usage.py`](examples/basic_usage.py) - Simple 10-line example
262
+ - [`adaptive_reduction.py`](examples/adaptive_reduction.py) - **NEW:** Automatic reduction level selection
263
+ - [`adaptive_scoring_options.py`](examples/adaptive_scoring_options.py) - **NEW:** Custom metrics (MAE, F1, callable)
264
+ - [`svm_speedup_demo.py`](examples/svm_speedup_demo.py) - SVM training speedup
265
+ - [`heavy_tailed_data.py`](examples/heavy_tailed_data.py) - Hybrid mode for rare events
266
+ - [`regulatory_compliance.py`](examples/regulatory_compliance.py) - Determinism for regulated industries
267
+
268
+ Run any example:
269
+ ```bash
270
+ python examples/adaptive_reduction.py # Automatic optimal reduction
271
+ python examples/adaptive_scoring_options.py # Custom scoring metrics
272
+ python examples/basic_usage.py # Simple manual reduction
273
+ ```
274
+
275
+ ## Use Cases
276
+
277
+ ### 1. Regulatory Compliance (Healthcare, Finance)
278
+
279
+ ```python
280
+ # FDA submission requires reproducible model training
281
+ reducer = HVRTSampleReducer(reduction_ratio=0.3, random_state=42)
282
+ X_reduced, y_reduced = reducer.fit_transform(X_train, y_train)
283
+
284
+ # Audit trail: Decision tree shows why samples were selected
285
+ info = reducer.get_reduction_info()
286
+ print(f"Partitions: {info['n_partitions']}, Tree depth: {info['tree_depth']}")
287
+ ```
288
+
289
+ ### 2. Financial Data (Heavy Tails)
290
+
291
+ ```python
292
+ # Rare extreme events (market crashes, fraud)
293
+ reducer = HVRTSampleReducer(
294
+ reduction_ratio=0.2,
295
+ y_weight=0.5, # Prioritize extreme outcomes
296
+ random_state=42
297
+ )
298
+ X_reduced, y_reduced = reducer.fit_transform(X_returns, y_volatility)
299
+ ```
300
+
301
+ ### 3. Hyperparameter Tuning
302
+
303
+ ```python
304
+ from sklearn.model_selection import GridSearchCV
305
+
306
+ # Reduce data once, then use for all 100+ grid search trials
307
+ reducer = HVRTSampleReducer(reduction_ratio=0.2, random_state=42)
308
+ X_reduced, y_reduced = reducer.fit_transform(X_train, y_train)
309
+
310
+ # Grid search on reduced data (10-50x faster)
311
+ grid = GridSearchCV(SVR(), param_grid, cv=5)
312
+ grid.fit(X_reduced, y_reduced)
313
+ ```
314
+
315
+ ## How It Works
316
+
317
+ 1. **Synthetic Target Construction**
318
+ ```python
319
+ y_synthetic = sum(pairwise_interactions(X)) + α × |y - median(y)|
320
+ ```
321
+ - Captures feature interaction patterns
322
+ - Optional y-extremeness weighting for outliers
323
+
324
+ 2. **Hierarchical Partitioning**
325
+ - Decision tree partitions data by synthetic target
326
+ - Captures variance heterogeneity across feature space
327
+ - Auto-tunes partition count based on dataset size
328
+
329
+ 3. **Diversity Selection**
330
+ - Within each partition: Furthest-Point Sampling (FPS)
331
+ - Removes density, preserves boundaries
332
+ - Centroid-seeded for determinism
333
+
334
+ ## Performance
335
+
336
+ Results from validation experiments with SVM feasibility testing on 10k samples:
337
+
338
+ | Scenario | Retention | H-VRT Accuracy | Random Accuracy | Speedup | SNR Retention |
339
+ |----------|-----------|---------------|-----------------|---------|---------------|
340
+ | Well-behaved | 20% | 93.9% | 95.3% | 23.5x | **126.2%** |
341
+ | Heavy-tailed | 20% | **106.6%** | 85.3% | 24.0x | **130.1%** |
342
+
343
+ **Key Findings:**
344
+ - **Well-behaved data:** Both methods work (CLT holds), random slightly better on accuracy
345
+ - **Heavy-tailed data:** H-VRT achieves +21pp accuracy advantage via intelligent noise filtering
346
+ - **SNR (Signal-to-Noise Ratio):** H-VRT improves data quality by 26-30% vs baseline
347
+ - **SVM Speedup:** 24-38x training time reduction at scale (50k samples)
348
+
349
+ **Why >100% accuracy?** H-VRT acts as an intelligent denoiser, removing low-signal samples and improving SNR by 30%, which leads to better generalization.
350
+
351
+ ### Experimental Data
352
+
353
+ All experimental results are included in this repository for full transparency and reproducibility.
354
+
355
+ **📊 Primary Validation Study:**
356
+ - [`archive/experimental/results/svm_pilot/pilot_results_with_snr.json`](archive/experimental/results/svm_pilot/pilot_results_with_snr.json) - Complete SVM pilot data with SNR measurements (15 trials, 10k samples)
357
+
358
+ **📄 Analysis & Documentation:**
359
+ - [`archive/docs/SVM_PILOT_SNR_ANALYSIS.md`](archive/docs/SVM_PILOT_SNR_ANALYSIS.md) - Detailed SNR analysis explaining >100% accuracy
360
+ - [`archive/docs/SVM_PERFORMANCE_ANALYSIS.md`](archive/docs/SVM_PERFORMANCE_ANALYSIS.md) - SVM feasibility study (speedup, accuracy)
361
+
362
+ **🔬 Reproduce Results:**
363
+ ```bash
364
+ cd archive/experimental/experiments
365
+ python exp_svm_pilot_with_snr.py # ~30 seconds
366
+ ```
367
+
368
+ **📑 Complete Archive:**
369
+ See [`archive/ARCHIVE_INDEX.md`](archive/ARCHIVE_INDEX.md) for complete experimental data catalog.
370
+
371
+ ## Contributing
372
+
373
+ Contributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
374
+
375
+ ## Testing
376
+
377
+ ```bash
378
+ # Run tests
379
+ pytest
380
+
381
+ # With coverage
382
+ pytest --cov=hvrt --cov-report=term-missing
383
+ ```
384
+
385
+ ## Citation
386
+
387
+ If you use H-VRT in your research, please cite:
388
+
389
+ ```bibtex
390
+ @software{hvrt2025,
391
+ author = {Peace, Jake},
392
+ title = {H-VRT: Hierarchical Variance Reduction Tree Sample Reduction},
393
+ year = {2025},
394
+ url = {https://github.com/hotprotato/hvrt}
395
+ }
396
+ ```
397
+
398
+ ## License
399
+
400
+ MIT License - see [LICENSE](LICENSE) file.
401
+
402
+ ## Acknowledgments
403
+
404
+ Development assisted by Claude (Anthropic) for rapid prototyping and conceptual refinement.
405
+
406
+ ## Related Work
407
+
408
+ - **Random Sampling**: Simple but fails on heavy-tailed data
409
+ - **CoreSet Methods**: Require distance metrics (not suitable for all data)
410
+ - **Active Learning**: Requires iterative labeling (different use case)
411
+ - **H-VRT**: Deterministic, variance-based, works without distance metrics