smelt-ml 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (125) hide show
  1. smelt_ml-0.1.0/.dockerignore +4 -0
  2. smelt_ml-0.1.0/.gitignore +3 -0
  3. smelt_ml-0.1.0/CLAUDE.md +118 -0
  4. smelt_ml-0.1.0/Cargo.lock +937 -0
  5. smelt_ml-0.1.0/Cargo.toml +36 -0
  6. smelt_ml-0.1.0/Dockerfile +73 -0
  7. smelt_ml-0.1.0/LICENSE +21 -0
  8. smelt_ml-0.1.0/PKG-INFO +9 -0
  9. smelt_ml-0.1.0/README.md +361 -0
  10. smelt_ml-0.1.0/benches/learners.rs +8 -0
  11. smelt_ml-0.1.0/docs/HORARIO DE VERANO.docx (28).pdf +0 -0
  12. smelt_ml-0.1.0/docs/catboost_paper.pdf +0 -0
  13. smelt_ml-0.1.0/docs/code_review.md +157 -0
  14. smelt_ml-0.1.0/docs/email_grekousis.md +45 -0
  15. smelt_ml-0.1.0/docs/estado_del_arte.md +203 -0
  16. smelt_ml-0.1.0/docs/geo-xgboost.pdf +0 -0
  17. smelt_ml-0.1.0/docs/paper_strengthening_checklist.md +196 -0
  18. smelt_ml-0.1.0/docs/roadmap_checklist.md +185 -0
  19. smelt_ml-0.1.0/docs/security_audit.md +103 -0
  20. smelt_ml-0.1.0/docs/wos_analysis.md +290 -0
  21. smelt_ml-0.1.0/docs/wos_queries.md +286 -0
  22. smelt_ml-0.1.0/examples/ablation_study.rs +444 -0
  23. smelt_ml-0.1.0/examples/accuracy_validation.rs +198 -0
  24. smelt_ml-0.1.0/examples/basic_classification.rs +69 -0
  25. smelt_ml-0.1.0/examples/bench_vectorization.rs +68 -0
  26. smelt_ml-0.1.0/examples/benchmark_large.rs +329 -0
  27. smelt_ml-0.1.0/examples/benchmark_prediction.rs +116 -0
  28. smelt_ml-0.1.0/examples/case_study_king_county.rs +238 -0
  29. smelt_ml-0.1.0/examples/case_study_spatial.rs +264 -0
  30. smelt_ml-0.1.0/examples/catboost_categoricals.rs +70 -0
  31. smelt_ml-0.1.0/examples/conformal_prediction.rs +85 -0
  32. smelt_ml-0.1.0/examples/gis_workflow.rs +232 -0
  33. smelt_ml-0.1.0/examples/mem_bench.rs +60 -0
  34. smelt_ml-0.1.0/examples/profile_catboost.rs +47 -0
  35. smelt_ml-0.1.0/examples/profile_lightgbm.rs +41 -0
  36. smelt_ml-0.1.0/examples/profile_scaling.rs +48 -0
  37. smelt_ml-0.1.0/examples/regression_pipeline.rs +64 -0
  38. smelt_ml-0.1.0/examples/spatial_ml.rs +86 -0
  39. smelt_ml-0.1.0/examples/xgboost_tuning.rs +91 -0
  40. smelt_ml-0.1.0/pyproject.toml +19 -0
  41. smelt_ml-0.1.0/python/smelt/__init__.py +54 -0
  42. smelt_ml-0.1.0/python/smelt/conformal.py +5 -0
  43. smelt_ml-0.1.0/python/smelt/spatial.py +5 -0
  44. smelt_ml-0.1.0/python/smelt/stats.py +5 -0
  45. smelt_ml-0.1.0/smelt-py/.gitignore +6 -0
  46. smelt_ml-0.1.0/smelt-py/Cargo.toml +17 -0
  47. smelt_ml-0.1.0/smelt-py/python/smelt/__init__.py +54 -0
  48. smelt_ml-0.1.0/smelt-py/python/smelt/conformal.py +5 -0
  49. smelt_ml-0.1.0/smelt-py/python/smelt/spatial.py +5 -0
  50. smelt_ml-0.1.0/smelt-py/python/smelt/stats.py +5 -0
  51. smelt_ml-0.1.0/smelt-py/src/lib.rs +744 -0
  52. smelt_ml-0.1.0/src/benchmark.rs +104 -0
  53. smelt_ml-0.1.0/src/benchmark_design.rs +136 -0
  54. smelt_ml-0.1.0/src/causal/mod.rs +506 -0
  55. smelt_ml-0.1.0/src/cluster/isolation_forest.rs +252 -0
  56. smelt_ml-0.1.0/src/cluster/mod.rs +300 -0
  57. smelt_ml-0.1.0/src/conformal/cqr.rs +119 -0
  58. smelt_ml-0.1.0/src/conformal/mod.rs +238 -0
  59. smelt_ml-0.1.0/src/data/mod.rs +208 -0
  60. smelt_ml-0.1.0/src/error.rs +36 -0
  61. smelt_ml-0.1.0/src/importance/mod.rs +187 -0
  62. smelt_ml-0.1.0/src/importance/shap.rs +272 -0
  63. smelt_ml-0.1.0/src/learner/adaboost.rs +273 -0
  64. smelt_ml-0.1.0/src/learner/bagging.rs +282 -0
  65. smelt_ml-0.1.0/src/learner/catboost.rs +716 -0
  66. smelt_ml-0.1.0/src/learner/des.rs +197 -0
  67. smelt_ml-0.1.0/src/learner/ebm.rs +290 -0
  68. smelt_ml-0.1.0/src/learner/geo_xgboost.rs +362 -0
  69. smelt_ml-0.1.0/src/learner/hist_pool.rs +148 -0
  70. smelt_ml-0.1.0/src/learner/histogram.rs +97 -0
  71. smelt_ml-0.1.0/src/learner/hoeffding.rs +444 -0
  72. smelt_ml-0.1.0/src/learner/knn.rs +154 -0
  73. smelt_ml-0.1.0/src/learner/lightgbm.rs +1036 -0
  74. smelt_ml-0.1.0/src/learner/linear_regression.rs +182 -0
  75. smelt_ml-0.1.0/src/learner/logistic_regression.rs +293 -0
  76. smelt_ml-0.1.0/src/learner/mod.rs +78 -0
  77. smelt_ml-0.1.0/src/learner/naive_bayes.rs +162 -0
  78. smelt_ml-0.1.0/src/learner/oblique.rs +763 -0
  79. smelt_ml-0.1.0/src/learner/quantile.rs +168 -0
  80. smelt_ml-0.1.0/src/learner/quantile_forest.rs +312 -0
  81. smelt_ml-0.1.0/src/learner/regularized.rs +418 -0
  82. smelt_ml-0.1.0/src/learner/stacking.rs +238 -0
  83. smelt_ml-0.1.0/src/learner/svm.rs +254 -0
  84. smelt_ml-0.1.0/src/learner/tree/decision_tree.rs +183 -0
  85. smelt_ml-0.1.0/src/learner/tree/extra_trees.rs +276 -0
  86. smelt_ml-0.1.0/src/learner/tree/gradient_boosting.rs +431 -0
  87. smelt_ml-0.1.0/src/learner/tree/mod.rs +362 -0
  88. smelt_ml-0.1.0/src/learner/tree/random_forest.rs +306 -0
  89. smelt_ml-0.1.0/src/learner/xgboost.rs +1084 -0
  90. smelt_ml-0.1.0/src/lib.rs +96 -0
  91. smelt_ml-0.1.0/src/measure/mod.rs +474 -0
  92. smelt_ml-0.1.0/src/multilabel/mod.rs +187 -0
  93. smelt_ml-0.1.0/src/multioutput/mod.rs +159 -0
  94. smelt_ml-0.1.0/src/prediction/mod.rs +83 -0
  95. smelt_ml-0.1.0/src/preprocess/adasyn.rs +191 -0
  96. smelt_ml-0.1.0/src/preprocess/encoder.rs +141 -0
  97. smelt_ml-0.1.0/src/preprocess/filter.rs +461 -0
  98. smelt_ml-0.1.0/src/preprocess/imputer.rs +124 -0
  99. smelt_ml-0.1.0/src/preprocess/label_encoder.rs +76 -0
  100. smelt_ml-0.1.0/src/preprocess/mod.rs +72 -0
  101. smelt_ml-0.1.0/src/preprocess/pca.rs +153 -0
  102. smelt_ml-0.1.0/src/preprocess/pipeline.rs +130 -0
  103. smelt_ml-0.1.0/src/preprocess/rfe.rs +175 -0
  104. smelt_ml-0.1.0/src/preprocess/scaler.rs +173 -0
  105. smelt_ml-0.1.0/src/preprocess/smote.rs +159 -0
  106. smelt_ml-0.1.0/src/resample/mod.rs +102 -0
  107. smelt_ml-0.1.0/src/resample/spatial.rs +173 -0
  108. smelt_ml-0.1.0/src/serialize.rs +98 -0
  109. smelt_ml-0.1.0/src/stats.rs +684 -0
  110. smelt_ml-0.1.0/src/survival/mod.rs +453 -0
  111. smelt_ml-0.1.0/src/task/mod.rs +163 -0
  112. smelt_ml-0.1.0/src/tuning/bayesian.rs +373 -0
  113. smelt_ml-0.1.0/src/tuning/grid_search.rs +104 -0
  114. smelt_ml-0.1.0/src/tuning/hyperband.rs +222 -0
  115. smelt_ml-0.1.0/src/tuning/mod.rs +127 -0
  116. smelt_ml-0.1.0/src/tuning/random_search.rs +146 -0
  117. smelt_ml-0.1.0/src/validate.rs +39 -0
  118. smelt_ml-0.1.0/tests/catboost_perf.py +40 -0
  119. smelt_ml-0.1.0/tests/catboost_perf.rs +95 -0
  120. smelt_ml-0.1.0/tests/integration.rs +5879 -0
  121. smelt_ml-0.1.0/tests/lightgbm_perf.py +40 -0
  122. smelt_ml-0.1.0/tests/lightgbm_perf.rs +93 -0
  123. smelt_ml-0.1.0/tests/real_benchmark.rs +106 -0
  124. smelt_ml-0.1.0/tests/xgboost_perf.rs +102 -0
  125. smelt_ml-0.1.0/tests/xgboost_vs_official.rs +280 -0
@@ -0,0 +1,4 @@
1
+ target/
2
+ .git/
3
+ docs/WOS/
4
+ *.profraw
@@ -0,0 +1,3 @@
1
+ /target
2
+ docs/WOS/*.bib
3
+ catboost_info/
@@ -0,0 +1,118 @@
1
+ # Smelt — Machine Learning Framework for Rust
2
+
3
+ ## Overview
4
+
5
+ Smelt is an ML framework inspired by [mlr3](https://mlr3.mlr-org.com/) (R), designed for Rust's performance and safety guarantees. The name refers to smelting — refining raw data into useful models.
6
+
7
+ ## Architecture
8
+
9
+ ```
10
+ Task → Learner → TrainedModel → Prediction → Measure
11
+
12
+ Resampling (CV, Holdout)
13
+ Tuning (Grid, Random, Bayesian)
14
+ Preprocessing (Scale, Encode, Impute)
15
+ ```
16
+
17
+ ### Core Abstractions (mlr3 mapping)
18
+
19
+ | Smelt | mlr3 | Purpose |
20
+ |-------|------|---------|
21
+ | `Task` | `Task` | Data container with target |
22
+ | `ClassificationTask` | `TaskClassif` | Discrete target |
23
+ | `RegressionTask` | `TaskRegr` | Continuous target |
24
+ | `Learner` | `Learner` | Algorithm that trains |
25
+ | `TrainedModel` | trained Learner | Fitted model that predicts |
26
+ | `Prediction` | `Prediction` | Output with optional truth |
27
+ | `Measure` | `Measure` | Evaluation metric |
28
+ | `Resample` | `Resampling` | Train/test splitting strategy |
29
+
30
+ ### Module Structure
31
+
32
+ ```
33
+ src/
34
+ ├── lib.rs # Public API + prelude
35
+ ├── error.rs # SmeltError enum (thiserror)
36
+ ├── task/mod.rs # Task, ClassificationTask, RegressionTask
37
+ ├── learner/mod.rs # Learner trait, TrainedModel trait
38
+ ├── prediction/ # Prediction enum (Classification/Regression)
39
+ ├── measure/ # Accuracy, RMSE, MAE (+ trait Measure)
40
+ ├── resample/ # CrossValidation, Holdout (+ trait Resample)
41
+ ├── preprocess/ # TODO: StandardScaler, MinMaxScaler, OneHotEncoder
42
+ └── tuning/ # TODO: GridSearch, RandomSearch, BayesianOpt
43
+ ```
44
+
45
+ ## Build & Test
46
+
47
+ ```bash
48
+ cargo check # Type check
49
+ cargo test # Run tests
50
+ cargo bench # Run benchmarks (criterion)
51
+ cargo doc --open # Generate docs
52
+ ```
53
+
54
+ ## Design Principles
55
+
56
+ 1. **Type safety** — Classification and Regression are separate types, not runtime tags
57
+ 2. **Trait-based extensibility** — Implement `Learner` to add new algorithms
58
+ 3. **Zero-copy where possible** — ndarray views, references over clones
59
+ 4. **Parallel by default** — rayon for data parallelism (CV folds, ensemble training)
60
+ 5. **Composable pipeline** — Task → Learner → Prediction → Measure is always the flow
61
+
62
+ ## Implementation Roadmap
63
+
64
+ ### Phase 1 — Core (current)
65
+ - [x] Task system (Classification + Regression)
66
+ - [x] Learner + TrainedModel traits
67
+ - [x] Prediction enum
68
+ - [x] Measures: Accuracy, RMSE, MAE
69
+ - [x] Resampling: CrossValidation, Holdout
70
+
71
+ ### Phase 2 — First Learners
72
+ - [ ] Decision Tree (CART)
73
+ - [ ] K-Nearest Neighbors
74
+ - [ ] Logistic Regression
75
+ - [ ] Linear Regression (OLS)
76
+ - [ ] Benchmark pipeline (resample + measure loop)
77
+
78
+ ### Phase 3 — Ensembles
79
+ - [ ] Random Forest
80
+ - [ ] Gradient Boosting (XGBoost-style)
81
+ - [ ] Bagging
82
+
83
+ ### Phase 4 — Preprocessing
84
+ - [ ] StandardScaler, MinMaxScaler
85
+ - [ ] OneHotEncoder, LabelEncoder
86
+ - [ ] Missing value imputation
87
+ - [ ] Pipeline chaining (preprocess → learner)
88
+
89
+ ### Phase 5 — Tuning
90
+ - [ ] GridSearch
91
+ - [ ] RandomSearch
92
+ - [ ] Bayesian Optimization
93
+
94
+ ### Phase 6 — Advanced
95
+ - [ ] Feature importance (permutation, SHAP-like)
96
+ - [ ] Spatial cross-validation (for geo applications)
97
+ - [ ] CSV/Parquet data loading
98
+ - [ ] Model serialization (serde)
99
+ - [ ] Python bindings (PyO3) — expose as `smelt-py`
100
+
101
+ ## Dependencies
102
+
103
+ - `ndarray` — N-dimensional arrays (feature matrices)
104
+ - `rand` — Random number generation (resampling, stochastic algorithms)
105
+ - `rayon` — Data parallelism
106
+ - `thiserror` — Error types
107
+ - `serde` — Serialization
108
+ - `criterion` — Benchmarks (dev)
109
+
110
+ ## Author
111
+
112
+ Francisco Parra — francisco.parra.o@usach.cl
113
+
114
+ ## Inspiration
115
+
116
+ - [mlr3](https://mlr3.mlr-org.com/) (R) — Task/Learner/Measure architecture
117
+ - [scikit-learn](https://scikit-learn.org/) (Python) — fit/predict API
118
+ - [linfa](https://github.com/rust-ml/linfa) (Rust) — Existing Rust ML, but different design philosophy