pkboost 0.1.1__cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of pkboost might be problematic. Click here for more details.

pkboost/__init__.py ADDED
@@ -0,0 +1,5 @@
1
+ from .pkboost import *
2
+
3
+ __doc__ = pkboost.__doc__
4
+ if hasattr(pkboost, "__all__"):
5
+ __all__ = pkboost.__all__
@@ -0,0 +1,370 @@
1
+ Metadata-Version: 2.4
2
+ Name: pkboost
3
+ Version: 0.1.1
4
+ Classifier: Programming Language :: Rust
5
+ Classifier: Programming Language :: Python :: Implementation :: CPython
6
+ Classifier: Programming Language :: Python :: 3.8
7
+ Classifier: Programming Language :: Python :: 3.9
8
+ Classifier: Programming Language :: Python :: 3.10
9
+ Classifier: Programming Language :: Python :: 3.11
10
+ Classifier: Programming Language :: Python :: 3.12
11
+ Requires-Dist: numpy>=2.3.4
12
+ Summary: Gradient boosting that adapts to concept drift in imbalanced data
13
+ Author-email: Pushp Kharat <kharatpushp16@outlook.com>
14
+ Requires-Python: >=3.8
15
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
16
+ Project-URL: Homepage, https://github.com/Pushp-Kharat1/pkboost
17
+ Project-URL: Repository, https://github.com/Pushp-Kharat1/pkboost
18
+
19
+ # PKBoost
20
+
21
+ **Gradient boosting that adjusts to concept drift in imbalanced data.**
22
+
23
+ Built from scratch in Rust, PKBoost manages changing data distributions in fraud detection with a fraud rate of 0.2%. It shows less than 2% degradation under drift. In comparison, XGBoost experiences a 31.8% drop and LightGBM a 42.5% drop. PKBoost outperforms XGBoost by 10-18% on the Standard dataset when no drift is applied. It employs information theory with Shannon entropy and Newton Raphson to identify shifts in rare events and trigger an adaptive "metamorphosis" for real-time recovery.
24
+
25
+ > **"Most boosting libraries overlook concept drift. PKBoost identifies it and evolves to persist."**
26
+
27
+ **Perfect for:** Streaming fraud detection, real-time medical monitoring, anomaly detection in changing environments, or any scenario where data evolves over time and positive instances are rare.
28
+
29
+ ---
30
+
31
+ ## ๐Ÿš€ Quick Start
32
+
33
+ Clone the repository and build:
34
+
35
+ ```bash
36
+ git clone https://github.com/Pushp-Kharat1/pkboost.git
37
+ cd pkboost
38
+ cargo build --release
39
+ ```
40
+
41
+ Run the benchmark:
42
+
43
+ 1. **Use included sample data** (already in `data/`)
44
+ ```bash
45
+ ls data/ # Should show creditcard_train.csv, creditcard_val.csv, etc.
46
+ ```
47
+
48
+ 2. **Run benchmark**
49
+ ```bash
50
+ cargo run --release --bin benchmark
51
+ ```
52
+
53
+ ---
54
+
55
+ ## ๐Ÿ’ป Basic Usage
56
+
57
+ To train and predict (see `src/bin/benchmark.rs` for a full example):
58
+
59
+ ```rust
60
+ use pkboost::*;
61
+ use csv;
62
+ use std::error::Error;
63
+
64
+ fn main() -> Result<(), Box<dyn Error>> {
65
+ // Load CSV with headers: feature1,feature2,...,Class
66
+ let (x_train, y_train) = load_csv("train.csv")?;
67
+ let (x_val, y_val) = load_csv("val.csv")?;
68
+ let (x_test, y_test) = load_csv("test.csv")?;
69
+
70
+ // Auto-configure based on data characteristics
71
+ let mut model = OptimizedPKBoostShannon::auto(&x_train, &y_train);
72
+
73
+ // Train with early stopping on validation set
74
+ model.fit(
75
+ &x_train,
76
+ &y_train,
77
+ Some((&x_val, &y_val)), // Optional validation
78
+ true // Verbose output
79
+ )?;
80
+
81
+ // Predict probabilities (not classes)
82
+ let test_probs = model.predict_proba(&x_test)?;
83
+
84
+ // Evaluate
85
+ let pr_auc = calculate_pr_auc(&y_test, &test_probs);
86
+ println!("PR-AUC: {:.4}", pr_auc);
87
+
88
+ Ok(())
89
+ }
90
+
91
+ // Helper function (put in your code)
92
+ fn load_csv(path: &str) -> Result<(Vec<Vec<f64>>, Vec<f64>), Box<dyn Error>> {
93
+ let mut reader = csv::Reader::from_path(path)?;
94
+ let headers = reader.headers()?.clone();
95
+ let target_col_index = headers.iter().position(|h| h == "Class")
96
+ .ok_or("Class column not found")?;
97
+
98
+ let mut features = Vec::new();
99
+ let mut labels = Vec::new();
100
+
101
+ for result in reader.records() {
102
+ let record = result?;
103
+ let mut row: Vec<f64> = Vec::new();
104
+ for (i, value) in record.iter().enumerate() {
105
+ if i == target_col_index {
106
+ labels.push(value.parse()?);
107
+ } else {
108
+ let parsed_value = if value.is_empty() {
109
+ f64::NAN
110
+ } else {
111
+ value.parse()?
112
+ };
113
+ row.push(parsed_value);
114
+ }
115
+ }
116
+ features.push(row);
117
+ }
118
+
119
+ Ok((features, labels))
120
+ }
121
+ ```
122
+
123
+ **Expected CSV format:**
124
+ - Header row required
125
+ - Target column named "Class" with binary values (0.0 or 1.0)
126
+ - All other columns treated as numerical features
127
+ - Empty values treated as NaN (median-imputed)
128
+ - No categorical support (encode them first)
129
+ - For data loading examples, see `src/bin/*.rs` files like `benchmark.rs`. Supports CSV via `csv` crate.
130
+
131
+ ---
132
+
133
+ ## โœจ Key Features
134
+
135
+ - **Extreme Imbalance Handling:** Automatic class weighting and MI regularization boost recall on rare positives without reducing precision. Binary classification only.
136
+ - **Adaptive Hyperparameters:** `auto_tune_principled` profiles your dataset for optimal paramsโ€”no manual tuning needed.
137
+ - **Histogram-Based Trees:** Optimized binning with medians for missing values; supports up to 32 bins per feature for fast splits.
138
+ - **Parallelism & Efficiency:** Rayon-based adaptive parallelism detects hardware and scales thresholds dynamically. Efficient batching is used for large datasets.
139
+ - **Adaptation Mechanisms:** `AdversarialLivingBooster` monitors vulnerability scores to detect drift and trigger retraining, such as pruning unused features through "metabolism" tracking.
140
+ - **Metrics Built-In:** PR-AUC, ROC-AUC, F1@0.5, and threshold optimization are available out-of-the-box.
141
+
142
+ ---
143
+
144
+ ## ๐Ÿ“Š Benchmarks
145
+
146
+ **Testing methodology:** All models use default settings with no hyperparameter tuning. This reflects real-world usage where most practitioners cannot dedicate time to extensive tuning.
147
+
148
+ PKBoost's auto-tuning provides an edgeโ€”it automatically detects imbalance and adjusts parameters. LGBM/XGB can match these results with tuning but require expert knowledge.
149
+
150
+ **Reproducibility:** All benchmark code is in `src/bin/benchmark.rs`. Data splits: 60% train, 20% val, 20% test. LGBM/XGB used default params from their Rust crates. Full benchmarks (10+ datasets): See `BENCHMARKS.md`.
151
+
152
+ ### Standard Datasets
153
+
154
+ | Dataset | Samples | Imbalance | Model | PR-AUC | F1-AUC | ROC-AUC |
155
+ |------------------|----------|----------------------|-----------|---------|---------|---------|
156
+ | **Credit Card** | 170,884 | 0.2% (extreme) | **PKBoost** | **87.8%** | **87.4%** | **97.5%** |
157
+ | | | | LightGBM | 79.3% | 71.3% | 92.1% |
158
+ | | | | XGBoost | 74.5% | 79.8% | 91.7% |
159
+ | *Improvements* | | | vs LGBM | **+10.4%** | **+22.7%** | **+5.7%** |
160
+ | | | | vs XGBoost| **+17.9%** | **+9.7%** | **+6.1%** |
161
+ | **Pima Diabetes** | 460 | 35.0% (balanced) | **PKBoost** | **98.0%** | **93.7%** | **98.6%** |
162
+ | | | | LightGBM | 62.9% | 48.8% | 82.4% |
163
+ | | | | XGBoost | 68.0% | 60.0% | 82.0% |
164
+ | *Improvements* | | | vs LGBM | **+55.7%** | **+92.0%** | **+19.6%** |
165
+ | | | | vs XGBoost| **+44.0%** | **+56.1%** | **+20.1%** |
166
+ | **Breast Cancer** | 341 | 37.2% (balanced) | PKBoost | 97.9% | 93.2% | 98.6% |
167
+ | | | | **LightGBM** | **99.1%** | **96.3%** | **99.2%** |
168
+ | | | | **XGBoost** | **99.2%** | **95.1%** | **99.4%** |
169
+ | *Improvements* | | | vs LGBM | -1.2% | -3.3% | -0.7% |
170
+ | | | | vs XGBoost| -1.4% | -2.1% | -0.8% |
171
+ | **Heart Disease** | 181 | 45.9% (balanced) | **PKBoost** | **87.8%** | **82.5%** | **88.5%** |
172
+ | **Ionosphere** | 210 | 35.7% (balanced) | **PKBoost** | **98.0%** | **93.7%** | **98.5%** |
173
+ | | | | LightGBM | 95.4% | 88.9% | 96.0% |
174
+ | | | | XGBoost | 97.2% | 88.9% | 97.5% |
175
+ | *Improvements* | | | vs LGBM | **+2.7%** | **+5.4%** | **+2.7%** |
176
+ | | | | vs XGBoost| **+0.8%** | **+5.4%** | **+1.1%** |
177
+ | **Sonar** | 124 | 46.8% (balanced) | **PKBoost** | **91.8%** | **87.2%** | **93.6%** |
178
+ | **SpamBase** | 2,760 | 39.4% (balanced) | **PKBoost** | **98.0%** | **93.3%** | **98.0%** |
179
+ | **Adult** | - | 24.1% (balanced) | **PKBoost** | **81.2%** | **71.9%** | **92.0%** |
180
+
181
+ **Notes:** PR-AUC is prioritized for imbalance; F1@0.5 uses the optimal threshold. Unfilled cells indicate benchmarks in progress. Note on Pima Diabetes: Small datasets (n=460) have high variance due to limited samples. Results may not generalize; re-run with your data for confirmation. Note on Breast Cancer: PKBoost slightly underperforms on nearly balanced datasets (37% minority). This is expectedโ€”our optimizations target extreme imbalance. For balanced data, use XGBoost.
182
+
183
+ ### Why PKBoost Wins on Imbalanced Data
184
+
185
+ **Credit Card Fraud (0.2% minority class):**
186
+
187
+ - **PKBoost:** 87.8% PR-AUC โ†’ Optimal performance maintained.
188
+ - **XGBoost:** 74.5% PR-AUC โ†’ 15% degradation from balanced baseline.
189
+ - **LightGBM:** 79.3% PR-AUC โ†’ 10% degradation from balanced baseline.
190
+
191
+ **Pattern:** As imbalance severity increases (from balanced to 5% to 1% to 0.2%), traditional boosting drops linearly while PKBoost maintains high accuracy.
192
+
193
+ ### Drift Resilience (Credit Card Dataset)
194
+
195
+ PKBoost features experimental drift detection that monitors model vulnerabilities and can trigger adaptive retraining.
196
+
197
+ **Benchmark:** After introducing a significant covariate shift (adding noise to 10 features), models were tested on corrupted data:
198
+
199
+ | Model | Baseline PR-AUC | After Drift | Degradation |
200
+ |------------------|-----------------|-------------|-------------|
201
+ | **PKBoost** | **87.8%** | **86.2%** | **1.8%** |
202
+ | LightGBM | 79.3% | 45.6% | 42.5% |
203
+ | XGBoost | 74.5% | 50.8% | 31.8% |
204
+
205
+ **PKBoost's robustness comes from:**
206
+ - Conservative tree depth, which prevents overfitting to specific distributions
207
+ - Quantile-based binning that adapts to feature distributions
208
+ - Regularization that reduces sensitivity to noise
209
+
210
+ **Note:** Adaptive retraining is experimental and didn't trigger in this test. The robustness comes from the base architecture.
211
+
212
+ ---
213
+
214
+ ## ๐ŸŽฏ When to Use PKBoost
215
+
216
+ ### โœ… Good fit:
217
+ - Binary classification (0/1 labels)
218
+ - Extreme imbalance (<5% minority class)
219
+ - Fraud detection, medical diagnosis, anomaly detection
220
+ - Seeking good results without hyperparameter tuning
221
+
222
+ ### โŒ Not suitable for:
223
+ - Multi-class classification (not implemented)
224
+ - Regression tasks
225
+ - Perfectly balanced datasets (use XGBoost, it's faster)
226
+ - Datasets with fewer than 1,000 samples (too small for meaningful results)
227
+
228
+ ---
229
+
230
+ ## ๐Ÿ”ฌ How It Works
231
+
232
+ **Traditional gradient boosting struggles with extreme imbalance because:**
233
+ - Gradient-based splits favor the majority class. More samples lead to stronger gradients.
234
+ - Regularization does not consider class rarity.
235
+ - Early stopping uses global metrics that overlook minority class performance.
236
+
237
+ **PKBoost's approach:**
238
+ - **Shannon entropy guidance** optimizes splits for information gain on the minority class.
239
+ - **Adaptive class weighting** is automatically calculated from data statistics.
240
+ - **PR-AUC early stopping** focuses on minority class performance.
241
+
242
+ **Technical innovation:** Fusing information theory with Newton boosting. Each split maximizes:
243
+
244
+ ```
245
+ Gain = GradientGain + ฮป * InformationGain
246
+ ```
247
+
248
+ Where ฮป is adaptive based on imbalance severity.
249
+
250
+ ### Architecture Flow:
251
+
252
+ ```
253
+ [Your Data] โ†’ [Auto-Tuner] โ†’ [Shannon-Guided Trees] โ†’ [Predictions]
254
+ โ†“ โ†“ โ†“
255
+ Detects Entropy + Gradient PR-AUC
256
+ Imbalance Split Criterion Optimized
257
+ ```
258
+
259
+ **Core Model:** `OptimizedPKBoostShannon` โ€“ Shannon-entropy regularized trees with MI weighting.
260
+ **Data Prep:** `OptimizedHistogramBuilder` โ€“ Fast binning, median imputation, parallel transforms.
261
+ **Tuning:** `auto_tune_principled` & `auto_params` โ€“ Dataset-aware hyperparameters.
262
+ **Adaptation:** `AdversarialLivingBooster` โ€“ Monitors drift through vulnerability scores; triggers retraining, such as feature pruning via metabolism tracking.
263
+ **Parallelism:** `adaptive_parallel` โ€“ Hardware-aware Rayon config (cores, RAM detection).
264
+ **Evaluation:** Built-in calculations for PR-AUC, ROC-AUC, and F1.
265
+ **Drift Sims:** Scripts like `test_drift.rs` and `test_static.rs` for baseline comparisons.
266
+
267
+ See `src/` for full implementation. Binary classification only.
268
+
269
+ ---
270
+
271
+ ## โšก Performance
272
+
273
+ **Training Time (Credit Card, 170K samples):**
274
+
275
+ - **PKBoost:** ~45s with auto-tuning โ†’ 87.8% PR-AUC
276
+ - **XGBoost:** ~12s with defaults โ†’ 74.5% PR-AUC
277
+ - **XGBoost:** ~12s ร— 50 trials = 10 minutes tuned โ†’ ~87% PR-AUC (estimated)
278
+
279
+ ### The Trade-off:
280
+ - **PKBoost:** 45 seconds, zero human time
281
+ - **XGBoost:** 10+ minutes compute + 2 hours human tuning time
282
+
283
+ **Choose your bottleneck:** compute time or engineering time.
284
+
285
+ PKBoost prioritizes accuracy over speed. For production inference, all three have similar prediction latency of around 1ms per sample.
286
+
287
+ ---
288
+
289
+ ## ๐Ÿ“‹ Requirements
290
+
291
+ - Rust 1.70+ (2021 edition)
292
+ - 8GB+ RAM for large datasets (>100K samples)
293
+ - Multi-core CPU recommended (auto-detects and parallelizes)
294
+
295
+ ---
296
+
297
+ ## ๐Ÿงช Running Benchmarks & Tests
298
+
299
+ **Install Rust:**
300
+ ```bash
301
+ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
302
+ ```
303
+
304
+ **Clone & build:** As above.
305
+
306
+ **Run:**
307
+ ```bash
308
+ cargo run --release --bin benchmark # uses data/*.csv
309
+ ```
310
+
311
+ **Drift tests:**
312
+ ```bash
313
+ cargo run --bin test_drift
314
+ ```
315
+
316
+ Datasets sourced from UCI/ML.
317
+
318
+ ---
319
+
320
+ ## ๐Ÿ› ๏ธ Common Issues
321
+
322
+ **"error: linker cc not found"**
323
+ - Ubuntu/Debian: `sudo apt install build-essential`
324
+ - macOS: Install Xcode Command Line Tools
325
+
326
+ **Out of memory during compilation:**
327
+ ```bash
328
+ cargo build --release --jobs 1 # Limit parallel compilation
329
+ ```
330
+
331
+ **Slow training on large datasets:**
332
+ - Ensure you're using the `--release` flag
333
+ - Check CPU utilization (should be ~800% on 8 cores)
334
+
335
+ ---
336
+
337
+ ## ๐Ÿค Contributing
338
+
339
+ Open for contributions! Fork & PR: Focus on extensions, optimizations, or new tests. Issues welcome for bugs or dataset requests.
340
+
341
+ **License:** MIT โ€“ Free for commercial use.
342
+ **Contact:** kharatpushp16@outlook.com
343
+
344
+ ---
345
+
346
+ ## ๐Ÿ“š Citation
347
+
348
+ If you use PKBoost in your research, please cite:
349
+
350
+ ```bibtex
351
+ @software{kharat2025pkboost,
352
+ author = {Kharat, Pushp},
353
+ title = {PKBoost: Shannon-Guided Gradient Boosting for Extreme Imbalance},
354
+ year = {2025},
355
+ url = {https://github.com/Pushp-Kharat1/pkboost}
356
+ }
357
+ ```
358
+
359
+ ---
360
+
361
+ ## ๐Ÿ“– Further Reading
362
+
363
+ [Rust ML Ecosystem](https://www.arewelearningyet.com/)
364
+
365
+ **Questions?** Open an issue.
366
+
367
+ ---
368
+
369
+ **Project by Pushp Kharat.** Last updated: October 24, 2025.
370
+
@@ -0,0 +1,5 @@
1
+ pkboost-0.1.1.dist-info/METADATA,sha256=QSEm96eAjU4Jf-6oqaQ24esNfsprCEqosIhkHA7Fdew,15476
2
+ pkboost-0.1.1.dist-info/WHEEL,sha256=AUS7tHOBvWg1bDsPcHg1j3P_rKxqebEdeR--lIGHkyI,129
3
+ pkboost/__init__.py,sha256=VAa9XU8zKnafGTewiLOxnkP0PRnAlMIgsYusSHIWvEg,111
4
+ pkboost/pkboost.cpython-312-x86_64-linux-gnu.so,sha256=5qW-ZxooZrtLqC1cC6IIBxyy7xEu4JMLXk5o_TrGWow,1106832
5
+ pkboost-0.1.1.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: maturin (1.9.6)
3
+ Root-Is-Purelib: false
4
+ Tag: cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64