smelt-ml 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- smelt_ml-0.1.0/.dockerignore +4 -0
- smelt_ml-0.1.0/.gitignore +3 -0
- smelt_ml-0.1.0/CLAUDE.md +118 -0
- smelt_ml-0.1.0/Cargo.lock +937 -0
- smelt_ml-0.1.0/Cargo.toml +36 -0
- smelt_ml-0.1.0/Dockerfile +73 -0
- smelt_ml-0.1.0/LICENSE +21 -0
- smelt_ml-0.1.0/PKG-INFO +9 -0
- smelt_ml-0.1.0/README.md +361 -0
- smelt_ml-0.1.0/benches/learners.rs +8 -0
- smelt_ml-0.1.0/docs/HORARIO DE VERANO.docx (28).pdf +0 -0
- smelt_ml-0.1.0/docs/catboost_paper.pdf +0 -0
- smelt_ml-0.1.0/docs/code_review.md +157 -0
- smelt_ml-0.1.0/docs/email_grekousis.md +45 -0
- smelt_ml-0.1.0/docs/estado_del_arte.md +203 -0
- smelt_ml-0.1.0/docs/geo-xgboost.pdf +0 -0
- smelt_ml-0.1.0/docs/paper_strengthening_checklist.md +196 -0
- smelt_ml-0.1.0/docs/roadmap_checklist.md +185 -0
- smelt_ml-0.1.0/docs/security_audit.md +103 -0
- smelt_ml-0.1.0/docs/wos_analysis.md +290 -0
- smelt_ml-0.1.0/docs/wos_queries.md +286 -0
- smelt_ml-0.1.0/examples/ablation_study.rs +444 -0
- smelt_ml-0.1.0/examples/accuracy_validation.rs +198 -0
- smelt_ml-0.1.0/examples/basic_classification.rs +69 -0
- smelt_ml-0.1.0/examples/bench_vectorization.rs +68 -0
- smelt_ml-0.1.0/examples/benchmark_large.rs +329 -0
- smelt_ml-0.1.0/examples/benchmark_prediction.rs +116 -0
- smelt_ml-0.1.0/examples/case_study_king_county.rs +238 -0
- smelt_ml-0.1.0/examples/case_study_spatial.rs +264 -0
- smelt_ml-0.1.0/examples/catboost_categoricals.rs +70 -0
- smelt_ml-0.1.0/examples/conformal_prediction.rs +85 -0
- smelt_ml-0.1.0/examples/gis_workflow.rs +232 -0
- smelt_ml-0.1.0/examples/mem_bench.rs +60 -0
- smelt_ml-0.1.0/examples/profile_catboost.rs +47 -0
- smelt_ml-0.1.0/examples/profile_lightgbm.rs +41 -0
- smelt_ml-0.1.0/examples/profile_scaling.rs +48 -0
- smelt_ml-0.1.0/examples/regression_pipeline.rs +64 -0
- smelt_ml-0.1.0/examples/spatial_ml.rs +86 -0
- smelt_ml-0.1.0/examples/xgboost_tuning.rs +91 -0
- smelt_ml-0.1.0/pyproject.toml +19 -0
- smelt_ml-0.1.0/python/smelt/__init__.py +54 -0
- smelt_ml-0.1.0/python/smelt/conformal.py +5 -0
- smelt_ml-0.1.0/python/smelt/spatial.py +5 -0
- smelt_ml-0.1.0/python/smelt/stats.py +5 -0
- smelt_ml-0.1.0/smelt-py/.gitignore +6 -0
- smelt_ml-0.1.0/smelt-py/Cargo.toml +17 -0
- smelt_ml-0.1.0/smelt-py/python/smelt/__init__.py +54 -0
- smelt_ml-0.1.0/smelt-py/python/smelt/conformal.py +5 -0
- smelt_ml-0.1.0/smelt-py/python/smelt/spatial.py +5 -0
- smelt_ml-0.1.0/smelt-py/python/smelt/stats.py +5 -0
- smelt_ml-0.1.0/smelt-py/src/lib.rs +744 -0
- smelt_ml-0.1.0/src/benchmark.rs +104 -0
- smelt_ml-0.1.0/src/benchmark_design.rs +136 -0
- smelt_ml-0.1.0/src/causal/mod.rs +506 -0
- smelt_ml-0.1.0/src/cluster/isolation_forest.rs +252 -0
- smelt_ml-0.1.0/src/cluster/mod.rs +300 -0
- smelt_ml-0.1.0/src/conformal/cqr.rs +119 -0
- smelt_ml-0.1.0/src/conformal/mod.rs +238 -0
- smelt_ml-0.1.0/src/data/mod.rs +208 -0
- smelt_ml-0.1.0/src/error.rs +36 -0
- smelt_ml-0.1.0/src/importance/mod.rs +187 -0
- smelt_ml-0.1.0/src/importance/shap.rs +272 -0
- smelt_ml-0.1.0/src/learner/adaboost.rs +273 -0
- smelt_ml-0.1.0/src/learner/bagging.rs +282 -0
- smelt_ml-0.1.0/src/learner/catboost.rs +716 -0
- smelt_ml-0.1.0/src/learner/des.rs +197 -0
- smelt_ml-0.1.0/src/learner/ebm.rs +290 -0
- smelt_ml-0.1.0/src/learner/geo_xgboost.rs +362 -0
- smelt_ml-0.1.0/src/learner/hist_pool.rs +148 -0
- smelt_ml-0.1.0/src/learner/histogram.rs +97 -0
- smelt_ml-0.1.0/src/learner/hoeffding.rs +444 -0
- smelt_ml-0.1.0/src/learner/knn.rs +154 -0
- smelt_ml-0.1.0/src/learner/lightgbm.rs +1036 -0
- smelt_ml-0.1.0/src/learner/linear_regression.rs +182 -0
- smelt_ml-0.1.0/src/learner/logistic_regression.rs +293 -0
- smelt_ml-0.1.0/src/learner/mod.rs +78 -0
- smelt_ml-0.1.0/src/learner/naive_bayes.rs +162 -0
- smelt_ml-0.1.0/src/learner/oblique.rs +763 -0
- smelt_ml-0.1.0/src/learner/quantile.rs +168 -0
- smelt_ml-0.1.0/src/learner/quantile_forest.rs +312 -0
- smelt_ml-0.1.0/src/learner/regularized.rs +418 -0
- smelt_ml-0.1.0/src/learner/stacking.rs +238 -0
- smelt_ml-0.1.0/src/learner/svm.rs +254 -0
- smelt_ml-0.1.0/src/learner/tree/decision_tree.rs +183 -0
- smelt_ml-0.1.0/src/learner/tree/extra_trees.rs +276 -0
- smelt_ml-0.1.0/src/learner/tree/gradient_boosting.rs +431 -0
- smelt_ml-0.1.0/src/learner/tree/mod.rs +362 -0
- smelt_ml-0.1.0/src/learner/tree/random_forest.rs +306 -0
- smelt_ml-0.1.0/src/learner/xgboost.rs +1084 -0
- smelt_ml-0.1.0/src/lib.rs +96 -0
- smelt_ml-0.1.0/src/measure/mod.rs +474 -0
- smelt_ml-0.1.0/src/multilabel/mod.rs +187 -0
- smelt_ml-0.1.0/src/multioutput/mod.rs +159 -0
- smelt_ml-0.1.0/src/prediction/mod.rs +83 -0
- smelt_ml-0.1.0/src/preprocess/adasyn.rs +191 -0
- smelt_ml-0.1.0/src/preprocess/encoder.rs +141 -0
- smelt_ml-0.1.0/src/preprocess/filter.rs +461 -0
- smelt_ml-0.1.0/src/preprocess/imputer.rs +124 -0
- smelt_ml-0.1.0/src/preprocess/label_encoder.rs +76 -0
- smelt_ml-0.1.0/src/preprocess/mod.rs +72 -0
- smelt_ml-0.1.0/src/preprocess/pca.rs +153 -0
- smelt_ml-0.1.0/src/preprocess/pipeline.rs +130 -0
- smelt_ml-0.1.0/src/preprocess/rfe.rs +175 -0
- smelt_ml-0.1.0/src/preprocess/scaler.rs +173 -0
- smelt_ml-0.1.0/src/preprocess/smote.rs +159 -0
- smelt_ml-0.1.0/src/resample/mod.rs +102 -0
- smelt_ml-0.1.0/src/resample/spatial.rs +173 -0
- smelt_ml-0.1.0/src/serialize.rs +98 -0
- smelt_ml-0.1.0/src/stats.rs +684 -0
- smelt_ml-0.1.0/src/survival/mod.rs +453 -0
- smelt_ml-0.1.0/src/task/mod.rs +163 -0
- smelt_ml-0.1.0/src/tuning/bayesian.rs +373 -0
- smelt_ml-0.1.0/src/tuning/grid_search.rs +104 -0
- smelt_ml-0.1.0/src/tuning/hyperband.rs +222 -0
- smelt_ml-0.1.0/src/tuning/mod.rs +127 -0
- smelt_ml-0.1.0/src/tuning/random_search.rs +146 -0
- smelt_ml-0.1.0/src/validate.rs +39 -0
- smelt_ml-0.1.0/tests/catboost_perf.py +40 -0
- smelt_ml-0.1.0/tests/catboost_perf.rs +95 -0
- smelt_ml-0.1.0/tests/integration.rs +5879 -0
- smelt_ml-0.1.0/tests/lightgbm_perf.py +40 -0
- smelt_ml-0.1.0/tests/lightgbm_perf.rs +93 -0
- smelt_ml-0.1.0/tests/real_benchmark.rs +106 -0
- smelt_ml-0.1.0/tests/xgboost_perf.rs +102 -0
- smelt_ml-0.1.0/tests/xgboost_vs_official.rs +280 -0
smelt_ml-0.1.0/CLAUDE.md
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
# Smelt — Machine Learning Framework for Rust
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
Smelt is an ML framework inspired by [mlr3](https://mlr3.mlr-org.com/) (R), designed for Rust's performance and safety guarantees. The name refers to smelting — refining raw data into useful models.
|
|
6
|
+
|
|
7
|
+
## Architecture
|
|
8
|
+
|
|
9
|
+
```
|
|
10
|
+
Task → Learner → TrainedModel → Prediction → Measure
|
|
11
|
+
↑
|
|
12
|
+
Resampling (CV, Holdout)
|
|
13
|
+
Tuning (Grid, Random, Bayesian)
|
|
14
|
+
Preprocessing (Scale, Encode, Impute)
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
### Core Abstractions (mlr3 mapping)
|
|
18
|
+
|
|
19
|
+
| Smelt | mlr3 | Purpose |
|
|
20
|
+
|-------|------|---------|
|
|
21
|
+
| `Task` | `Task` | Data container with target |
|
|
22
|
+
| `ClassificationTask` | `TaskClassif` | Discrete target |
|
|
23
|
+
| `RegressionTask` | `TaskRegr` | Continuous target |
|
|
24
|
+
| `Learner` | `Learner` | Algorithm that trains |
|
|
25
|
+
| `TrainedModel` | trained Learner | Fitted model that predicts |
|
|
26
|
+
| `Prediction` | `Prediction` | Output with optional truth |
|
|
27
|
+
| `Measure` | `Measure` | Evaluation metric |
|
|
28
|
+
| `Resample` | `Resampling` | Train/test splitting strategy |
|
|
29
|
+
|
|
30
|
+
### Module Structure
|
|
31
|
+
|
|
32
|
+
```
|
|
33
|
+
src/
|
|
34
|
+
├── lib.rs # Public API + prelude
|
|
35
|
+
├── error.rs # SmeltError enum (thiserror)
|
|
36
|
+
├── task/mod.rs # Task, ClassificationTask, RegressionTask
|
|
37
|
+
├── learner/mod.rs # Learner trait, TrainedModel trait
|
|
38
|
+
├── prediction/ # Prediction enum (Classification/Regression)
|
|
39
|
+
├── measure/ # Accuracy, RMSE, MAE (+ trait Measure)
|
|
40
|
+
├── resample/ # CrossValidation, Holdout (+ trait Resample)
|
|
41
|
+
├── preprocess/ # TODO: StandardScaler, MinMaxScaler, OneHotEncoder
|
|
42
|
+
└── tuning/ # TODO: GridSearch, RandomSearch, BayesianOpt
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
## Build & Test
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
cargo check # Type check
|
|
49
|
+
cargo test # Run tests
|
|
50
|
+
cargo bench # Run benchmarks (criterion)
|
|
51
|
+
cargo doc --open # Generate docs
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
## Design Principles
|
|
55
|
+
|
|
56
|
+
1. **Type safety** — Classification and Regression are separate types, not runtime tags
|
|
57
|
+
2. **Trait-based extensibility** — Implement `Learner` to add new algorithms
|
|
58
|
+
3. **Zero-copy where possible** — ndarray views, references over clones
|
|
59
|
+
4. **Parallel by default** — rayon for data parallelism (CV folds, ensemble training)
|
|
60
|
+
5. **Composable pipeline** — Task → Learner → Prediction → Measure is always the flow
|
|
61
|
+
|
|
62
|
+
## Implementation Roadmap
|
|
63
|
+
|
|
64
|
+
### Phase 1 — Core (current)
|
|
65
|
+
- [x] Task system (Classification + Regression)
|
|
66
|
+
- [x] Learner + TrainedModel traits
|
|
67
|
+
- [x] Prediction enum
|
|
68
|
+
- [x] Measures: Accuracy, RMSE, MAE
|
|
69
|
+
- [x] Resampling: CrossValidation, Holdout
|
|
70
|
+
|
|
71
|
+
### Phase 2 — First Learners
|
|
72
|
+
- [ ] Decision Tree (CART)
|
|
73
|
+
- [ ] K-Nearest Neighbors
|
|
74
|
+
- [ ] Logistic Regression
|
|
75
|
+
- [ ] Linear Regression (OLS)
|
|
76
|
+
- [ ] Benchmark pipeline (resample + measure loop)
|
|
77
|
+
|
|
78
|
+
### Phase 3 — Ensembles
|
|
79
|
+
- [ ] Random Forest
|
|
80
|
+
- [ ] Gradient Boosting (XGBoost-style)
|
|
81
|
+
- [ ] Bagging
|
|
82
|
+
|
|
83
|
+
### Phase 4 — Preprocessing
|
|
84
|
+
- [ ] StandardScaler, MinMaxScaler
|
|
85
|
+
- [ ] OneHotEncoder, LabelEncoder
|
|
86
|
+
- [ ] Missing value imputation
|
|
87
|
+
- [ ] Pipeline chaining (preprocess → learner)
|
|
88
|
+
|
|
89
|
+
### Phase 5 — Tuning
|
|
90
|
+
- [ ] GridSearch
|
|
91
|
+
- [ ] RandomSearch
|
|
92
|
+
- [ ] Bayesian Optimization
|
|
93
|
+
|
|
94
|
+
### Phase 6 — Advanced
|
|
95
|
+
- [ ] Feature importance (permutation, SHAP-like)
|
|
96
|
+
- [ ] Spatial cross-validation (for geo applications)
|
|
97
|
+
- [ ] CSV/Parquet data loading
|
|
98
|
+
- [ ] Model serialization (serde)
|
|
99
|
+
- [ ] Python bindings (PyO3) — expose as `smelt-py`
|
|
100
|
+
|
|
101
|
+
## Dependencies
|
|
102
|
+
|
|
103
|
+
- `ndarray` — N-dimensional arrays (feature matrices)
|
|
104
|
+
- `rand` — Random number generation (resampling, stochastic algorithms)
|
|
105
|
+
- `rayon` — Data parallelism
|
|
106
|
+
- `thiserror` — Error types
|
|
107
|
+
- `serde` — Serialization
|
|
108
|
+
- `criterion` — Benchmarks (dev)
|
|
109
|
+
|
|
110
|
+
## Author
|
|
111
|
+
|
|
112
|
+
Francisco Parra — francisco.parra.o@usach.cl
|
|
113
|
+
|
|
114
|
+
## Inspiration
|
|
115
|
+
|
|
116
|
+
- [mlr3](https://mlr3.mlr-org.com/) (R) — Task/Learner/Measure architecture
|
|
117
|
+
- [scikit-learn](https://scikit-learn.org/) (Python) — fit/predict API
|
|
118
|
+
- [linfa](https://github.com/rust-ml/linfa) (Rust) — Existing Rust ML, but different design philosophy
|