cv-score-predict 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,9 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Danu ANDRIES
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
6
+
7
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
8
+
9
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,2 @@
1
+ exclude *.ipynb *.txt
2
+ recursive-exclude . *catboost*
@@ -0,0 +1,170 @@
1
+ Metadata-Version: 2.4
2
+ Name: cv-score-predict
3
+ Version: 0.1.1
4
+ Summary: Cross-validated ensemble prediction with LGBM, XGBoost, and CatBoost — with safe categorical handling, multi-seed averaging, and artifact return.
5
+ Author-email: Danu ANDRIES <danu@andries.lu>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/Karabush/cv-score-predict
8
+ Project-URL: Repository, https://github.com/Karabush/cv-score-predict
9
+ Project-URL: Documentation, https://github.com/Karabush/cv-score-predict#readme
10
+ Keywords: cross-validation,ensemble learning,model averaging,LightGBM,XGBoost,CatBoost,categorical encoding,OrdinalEncoder,out-of-fold prediction,OOF,multi-seed CV,repeated cross-validation,early stopping,scikit-learn compatible,pandas,machine learning,classification,regression,model validation,kaggle,safe preprocessing,data leakage prevention,boosting ensemble
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Requires-Python: >=3.9
16
+ Description-Content-Type: text/markdown
17
+ License-File: LICENSE
18
+ Requires-Dist: numpy>=1.21
19
+ Requires-Dist: pandas>=1.3
20
+ Requires-Dist: scikit-learn>=1.4
21
+ Requires-Dist: lightgbm>=3.3
22
+ Requires-Dist: xgboost>=1.7
23
+ Requires-Dist: catboost>=1.2
24
+ Dynamic: license-file
25
+
26
+ # cv-score-predict
27
+
28
+ A robust utility for **cross-validated ensemble prediction** with LightGBM, XGBoost and CatBoost. Handles categorical features safely, supports custom preprocessing pipelines, repeated CV over multiple seeds, early stopping, and returns out-of-fold (OOF) predictions, test predictions, trained models, and fitted encoder — all in one call.
29
+
30
+ Designed for **kagglers, ML engineers, and data scientists** who need reliable, leakage-free CV with minimal boilerplate.
31
+
32
+ ---
33
+
34
+ ## ✨ Key Features
35
+
36
+ - **Multi-model ensembling**: Train and average predictions from LGBM, XGBoost and CatBoost in a single CV run.
37
+ - **Native categorical support**: Automatically encodes string/categorical columns with `OrdinalEncoder` and configures models (e.g., `cat_features` for CatBoost, `enable_categorical` for XGBoost).
38
+ - **Safe preprocessing**: Integrates any scikit-learn-compatible `processor` pipeline (e.g., `ColumnTransformer`, `Pipeline`) — fitted **per fold** to prevent data leakage.
39
+ - **Repeated CV**: Average results over multiple random seeds for stable metrics.
40
+ - **Early stopping**: Enabled by default for all models using fold-wise validation.
41
+ - **Full artifact return**: Get OOF predictions, test predictions, trained models, and fitted `OrdinalEncoder` for later use.
42
+
43
+ ---
44
+
45
+ ## 📥 Parameters
46
+
47
+ | Parameter | Type | Default | Description |
48
+ |----------|------|--------|-------------|
49
+ | `X` | `pd.DataFrame` | — | Training features. |
50
+ | `y` | `Union[pd.Series, np.ndarray]` | — | Target values. |
51
+ | `X_test` | `Optional[pd.DataFrame]` | `None` | Test set for final prediction. If `None`, no test predictions are returned. |
52
+ | `pred_type` | `str` | — | Either `'classification'` or `'regression'` (**required**). |
53
+ | `processor` | `Optional[object]` | `None` | Preprocessing pipeline with `fit_transform` and `transform` methods. Must return a `pd.DataFrame` (use `set_output(transform='pandas')`). If `None`, features are passed through unchanged. |
54
+ | `process_categorical` | `bool` | `True` | If `True`, object/category columns are encoded with `OrdinalEncoder` (using `-1` for missing/unseen) and converted to pandas `category` dtype for model compatibility. |
55
+ | `models` | `Union[List[str], str]` | `('lgb', 'xgb', 'cb')` | Models to ensemble. Supported: `'lgb'` (LightGBM), `'xgb'` (XGBoost), `'cb'` (CatBoost). |
56
+ | `params_dict` | `Optional[Dict[str, dict]]` | `None` | Model-specific hyperparameters. Keys: model names; values: param dicts. |
57
+ | `scoring_dict` | `Optional[Dict[str, Callable]]` | `None` | Metrics for evaluation. Keys: metric names; values: scoring functions (e.g., `roc_auc_score`). Defaults: `{'roc_auc': roc_auc_score}` (classification), `{'rmse': rmse_fn}` (regression). |
58
+ | `decision_threshold` | `float` | `0.5` | Threshold to convert probabilities to class labels (classification only). |
59
+ | `n_splits` | `int` | `5` | Number of cross-validation folds. |
60
+ | `random_state` | `Union[int, List[int]]` | `42` | Seed(s) for reproducibility. If a list, CV is repeated for each seed and results are averaged. |
61
+ | `early_stopping_rounds` | `int` | `50` | Early stopping rounds for boosting models (if not overridden in `params_dict`). |
62
+ | `verbose` | `int` | `2` | Logging level: `2` = full per-fold details, `1` = final summary, `0` = silent. |
63
+ | `return_trained` | `bool` | `False` | If `True`, returns list of trained model instances (one per model × fold × seed). |
64
+ | `return_oe` | `bool` | `False` | If `True` and `process_categorical=True`, returns the fitted `OrdinalEncoder`. |
65
+ | `predict_proba` | `bool` | `True` | For classification: if `True`, return probabilities; if `False`, return binary labels (using `decision_threshold`). Ignored for regression. |
66
+ ---
67
+
68
+ ## 🚀 Installation
69
+
70
+ ```bash
71
+ pip install cv-score-predict
72
+ ```
73
+
74
+ Requirements:
75
+
76
+ * Python ≥ 3.8
77
+ * Dependencies (automatically installed):
78
+ numpy, pandas, scikit-learn ≥1.4, lightgbm, xgboost, catboost
79
+
80
+ ---
81
+
82
+ ## 📌 Basic Usage
83
+ ```python
84
+ import pandas as pd
85
+ import numpy as np
86
+ from sklearn.preprocessing import StandardScaler
87
+ from sklearn.compose import make_column_transformer
88
+ from cv_score_predict import cv_score_predict
89
+
90
+ # Simulate data
91
+ X = pd.DataFrame({
92
+ "num": [1, 2, 3, 4, 5, 6, 7, 8],
93
+ "cat": ["A", "B", "A", "C", "B", "A", "C", "D"]
94
+ })
95
+ y = [0, 1, 0, 1, 1, 0, 1, 0]
96
+ X_test = pd.DataFrame({"num": [9, 10], "cat": ["B", "E"]})
97
+
98
+ # Optional processor (applied per fold!)
99
+ processor = make_column_transformer(
100
+ (StandardScaler(), ["num"]),
101
+ remainder="passthrough"
102
+ )
103
+
104
+ # Run CV with 3 seeds → results averaged over seeds & folds
105
+ oof_pred, test_pred, _, _ = cv_score_predict(
106
+ X=X,
107
+ y=y,
108
+ X_test=X_test,
109
+ pred_type="classification",
110
+ processor=processor,
111
+ models=["lgb", "xgb"],
112
+ process_categorical=True,
113
+ random_state=[42, 123, 999],
114
+ n_splits=3,
115
+ verbose=2,
116
+ )
117
+ ```
118
+
119
+ Output will show scores per seed, then final averaged metrics.
120
+
121
+ ---
122
+
123
+ ## 🔧 Advanced Usage: Reuse Artifacts for New Data
124
+ ```python
125
+ # Step 1: Run CV and return artifacts
126
+ oof, _, trained_models, oe = cv_score_predict(
127
+ X,
128
+ y,
129
+ X_test=None, # we'll predict manually
130
+ pred_type="classification",
131
+ processor=processor,
132
+ models=["lgb", "cb"],
133
+ process_categorical=True,
134
+ random_state=[42, 123],
135
+ n_splits=5,
136
+ return_trained=True,
137
+ return_oe=True,
138
+ )
139
+
140
+ # Step 2: For deployment: refit processor on FULL TRAINING data
141
+ # First: encode categoricals using returned oe
142
+ cat_cols = ["cat"]
143
+ X_full = X.copy()
144
+ X_full[cat_cols] = oe.transform(X_full[cat_cols]).astype('category')
145
+ X_new = pd.DataFrame({"num": [7, 8], "cat": [None, "A"]})
146
+
147
+ # Fit processor on full encoded data
148
+ processor = make_column_transformer(
149
+ (StandardScaler(), ["num"]),
150
+ remainder="passthrough"
151
+ )
152
+ processor.fit(X_full)
153
+
154
+ # Apply to new data
155
+ X_new_proc = X_new.copy()
156
+ X_new_proc[cat_cols] = oe.transform(X_new_proc[cat_cols]).astype('category')
157
+ X_new_proc = processor.transform(X_new_proc)
158
+
159
+ # Predict with all trained models and average
160
+ preds = [model.predict_proba(X_new_proc)[:, 1] for model in trained_models]
161
+ final_pred = np.mean(preds, axis=0)
162
+ ```
163
+ ## 📝 Notes
164
+ Categorical columns are encoded with OrdinalEncoder(dtype=np.int32) and converted to category dtype for model compatibility.
165
+ Always use set_output(transform="pandas") in sklearn pipelines to preserve dtypes.
166
+ The processor used in CV is refit on each fold to prevent data leakage, so there is no single global version. For deployment, refit your preprocessing pipeline on the full training set (as shown in the advanced example).
167
+
168
+ ## 📄 License
169
+ This project is licensed under the MIT License.
170
+ See the LICENSE file for details.
@@ -0,0 +1,145 @@
1
+ # cv-score-predict
2
+
3
+ A robust utility for **cross-validated ensemble prediction** with LightGBM, XGBoost and CatBoost. Handles categorical features safely, supports custom preprocessing pipelines, repeated CV over multiple seeds, early stopping, and returns out-of-fold (OOF) predictions, test predictions, trained models, and fitted encoder — all in one call.
4
+
5
+ Designed for **kagglers, ML engineers, and data scientists** who need reliable, leakage-free CV with minimal boilerplate.
6
+
7
+ ---
8
+
9
+ ## ✨ Key Features
10
+
11
+ - **Multi-model ensembling**: Train and average predictions from LGBM, XGBoost and CatBoost in a single CV run.
12
+ - **Native categorical support**: Automatically encodes string/categorical columns with `OrdinalEncoder` and configures models (e.g., `cat_features` for CatBoost, `enable_categorical` for XGBoost).
13
+ - **Safe preprocessing**: Integrates any scikit-learn-compatible `processor` pipeline (e.g., `ColumnTransformer`, `Pipeline`) — fitted **per fold** to prevent data leakage.
14
+ - **Repeated CV**: Average results over multiple random seeds for stable metrics.
15
+ - **Early stopping**: Enabled by default for all models using fold-wise validation.
16
+ - **Full artifact return**: Get OOF predictions, test predictions, trained models, and fitted `OrdinalEncoder` for later use.
17
+
18
+ ---
19
+
20
+ ## 📥 Parameters
21
+
22
+ | Parameter | Type | Default | Description |
23
+ |----------|------|--------|-------------|
24
+ | `X` | `pd.DataFrame` | — | Training features. |
25
+ | `y` | `Union[pd.Series, np.ndarray]` | — | Target values. |
26
+ | `X_test` | `Optional[pd.DataFrame]` | `None` | Test set for final prediction. If `None`, no test predictions are returned. |
27
+ | `pred_type` | `str` | — | Either `'classification'` or `'regression'` (**required**). |
28
+ | `processor` | `Optional[object]` | `None` | Preprocessing pipeline with `fit_transform` and `transform` methods. Must return a `pd.DataFrame` (use `set_output(transform='pandas')`). If `None`, features are passed through unchanged. |
29
+ | `process_categorical` | `bool` | `True` | If `True`, object/category columns are encoded with `OrdinalEncoder` (using `-1` for missing/unseen) and converted to pandas `category` dtype for model compatibility. |
30
+ | `models` | `Union[List[str], str]` | `('lgb', 'xgb', 'cb')` | Models to ensemble. Supported: `'lgb'` (LightGBM), `'xgb'` (XGBoost), `'cb'` (CatBoost). |
31
+ | `params_dict` | `Optional[Dict[str, dict]]` | `None` | Model-specific hyperparameters. Keys: model names; values: param dicts. |
32
+ | `scoring_dict` | `Optional[Dict[str, Callable]]` | `None` | Metrics for evaluation. Keys: metric names; values: scoring functions (e.g., `roc_auc_score`). Defaults: `{'roc_auc': roc_auc_score}` (classification), `{'rmse': rmse_fn}` (regression). |
33
+ | `decision_threshold` | `float` | `0.5` | Threshold to convert probabilities to class labels (classification only). |
34
+ | `n_splits` | `int` | `5` | Number of cross-validation folds. |
35
+ | `random_state` | `Union[int, List[int]]` | `42` | Seed(s) for reproducibility. If a list, CV is repeated for each seed and results are averaged. |
36
+ | `early_stopping_rounds` | `int` | `50` | Early stopping rounds for boosting models (if not overridden in `params_dict`). |
37
+ | `verbose` | `int` | `2` | Logging level: `2` = full per-fold details, `1` = final summary, `0` = silent. |
38
+ | `return_trained` | `bool` | `False` | If `True`, returns list of trained model instances (one per model × fold × seed). |
39
+ | `return_oe` | `bool` | `False` | If `True` and `process_categorical=True`, returns the fitted `OrdinalEncoder`. |
40
+ | `predict_proba` | `bool` | `True` | For classification: if `True`, return probabilities; if `False`, return binary labels (using `decision_threshold`). Ignored for regression. |
41
+ ---
42
+
43
+ ## 🚀 Installation
44
+
45
+ ```bash
46
+ pip install cv-score-predict
47
+ ```
48
+
49
+ Requirements:
50
+
51
+ * Python ≥ 3.8
52
+ * Dependencies (automatically installed):
53
+ numpy, pandas, scikit-learn ≥1.4, lightgbm, xgboost, catboost
54
+
55
+ ---
56
+
57
+ ## 📌 Basic Usage
58
+ ```python
59
+ import pandas as pd
60
+ import numpy as np
61
+ from sklearn.preprocessing import StandardScaler
62
+ from sklearn.compose import make_column_transformer
63
+ from cv_score_predict import cv_score_predict
64
+
65
+ # Simulate data
66
+ X = pd.DataFrame({
67
+ "num": [1, 2, 3, 4, 5, 6, 7, 8],
68
+ "cat": ["A", "B", "A", "C", "B", "A", "C", "D"]
69
+ })
70
+ y = [0, 1, 0, 1, 1, 0, 1, 0]
71
+ X_test = pd.DataFrame({"num": [9, 10], "cat": ["B", "E"]})
72
+
73
+ # Optional processor (applied per fold!)
74
+ processor = make_column_transformer(
75
+ (StandardScaler(), ["num"]),
76
+ remainder="passthrough"
77
+ )
78
+
79
+ # Run CV with 3 seeds → results averaged over seeds & folds
80
+ oof_pred, test_pred, _, _ = cv_score_predict(
81
+ X=X,
82
+ y=y,
83
+ X_test=X_test,
84
+ pred_type="classification",
85
+ processor=processor,
86
+ models=["lgb", "xgb"],
87
+ process_categorical=True,
88
+ random_state=[42, 123, 999],
89
+ n_splits=3,
90
+ verbose=2,
91
+ )
92
+ ```
93
+
94
+ Output will show scores per seed, then final averaged metrics.
95
+
96
+ ---
97
+
98
+ ## 🔧 Advanced Usage: Reuse Artifacts for New Data
99
+ ```python
100
+ # Step 1: Run CV and return artifacts
101
+ oof, _, trained_models, oe = cv_score_predict(
102
+ X,
103
+ y,
104
+ X_test=None, # we'll predict manually
105
+ pred_type="classification",
106
+ processor=processor,
107
+ models=["lgb", "cb"],
108
+ process_categorical=True,
109
+ random_state=[42, 123],
110
+ n_splits=5,
111
+ return_trained=True,
112
+ return_oe=True,
113
+ )
114
+
115
+ # Step 2: For deployment: refit processor on FULL TRAINING data
116
+ # First: encode categoricals using returned oe
117
+ cat_cols = ["cat"]
118
+ X_full = X.copy()
119
+ X_full[cat_cols] = oe.transform(X_full[cat_cols]).astype('category')
120
+ X_new = pd.DataFrame({"num": [7, 8], "cat": [None, "A"]})
121
+
122
+ # Fit processor on full encoded data
123
+ processor = make_column_transformer(
124
+ (StandardScaler(), ["num"]),
125
+ remainder="passthrough"
126
+ )
127
+ processor.fit(X_full)
128
+
129
+ # Apply to new data
130
+ X_new_proc = X_new.copy()
131
+ X_new_proc[cat_cols] = oe.transform(X_new_proc[cat_cols]).astype('category')
132
+ X_new_proc = processor.transform(X_new_proc)
133
+
134
+ # Predict with all trained models and average
135
+ preds = [model.predict_proba(X_new_proc)[:, 1] for model in trained_models]
136
+ final_pred = np.mean(preds, axis=0)
137
+ ```
138
+ ## 📝 Notes
139
+ Categorical columns are encoded with OrdinalEncoder(dtype=np.int32) and converted to category dtype for model compatibility.
140
+ Always use set_output(transform="pandas") in sklearn pipelines to preserve dtypes.
141
+ The processor used in CV is refit on each fold to prevent data leakage, so there is no single global version. For deployment, refit your preprocessing pipeline on the full training set (as shown in the advanced example).
142
+
143
+ ## 📄 License
144
+ This project is licensed under the MIT License.
145
+ See the LICENSE file for details.
@@ -0,0 +1,4 @@
1
+ from .core import cv_score_predict
2
+
3
+ __version__ = "0.1.0"
4
+ __all__ = ["cv_score_predict"]
@@ -0,0 +1,404 @@
1
+ from typing import (
2
+ Union, List, Dict, Tuple, Optional, Callable, Any, Literal, TypeVar
3
+ )
4
+ import numpy as np
5
+ import pandas as pd
6
+ import lightgbm as lgb
7
+ import xgboost as xgb
8
+ import catboost as cb
9
+
10
+ from sklearn.model_selection import StratifiedKFold, KFold
11
+ from sklearn.metrics import roc_auc_score, mean_squared_error
12
+ from sklearn.preprocessing import OrdinalEncoder
13
+ from sklearn.base import BaseEstimator, TransformerMixin
14
+
15
+ # Type aliases
16
+ ModelKey = Literal['lgb', 'xgb', 'cb']
17
+ PredictionType = Literal['classification', 'regression']
18
+
19
+ def cv_score_predict(
20
+ X: pd.DataFrame,
21
+ y: Union[pd.Series, np.ndarray],
22
+ X_test: Optional[pd.DataFrame] = None,
23
+ pred_type: PredictionType = None, # 'classification' or 'regression' (mandatory)
24
+ processor: Optional[Union[BaseEstimator, TransformerMixin]] = None,
25
+ process_categorical: bool = True,
26
+ models: Union[List[ModelKey], ModelKey] = ('lgb', 'xgb', 'cb'),
27
+ params_dict: Optional[Dict[str, dict]] = None,
28
+ scoring_dict: Optional[Dict[str, Callable]] = None,
29
+ decision_threshold: float = 0.5,
30
+ n_splits: int = 5,
31
+ random_state: Union[int, List[int]] = 42,
32
+ early_stopping_rounds: int = 50,
33
+ verbose: int = 2,
34
+ return_trained: bool = False,
35
+ return_oe: bool = False,
36
+ predict_proba: bool = True,
37
+ ) -> Tuple[np.ndarray, Optional[np.ndarray], Optional[List[Any]], Optional[OrdinalEncoder]]:
38
+ """
39
+ Cross-validate supported estimators (optionally repeated over multiple seeds),
40
+ collect out‑of‑fold (OOF) predictions for scoring, and produce averaged
41
+ test‑set predictions for final use. Accepts a scikit‑learn style processor
42
+ pipeline to apply on each fold.
43
+
44
+ Important behavior
45
+ ------------------
46
+ - Early stopping: estimators are trained with early stopping on each fold's
47
+ validation set. Final test predictions (when `X_test` is provided) are
48
+ produced by the early‑stopped estimators from each fold and averaged.
49
+ - Processor contract: if a `processor` is provided it **must** return a
50
+ pandas DataFrame from `.fit_transform` and `.transform`. To guarantee
51
+ this, call `pipeline.set_output(transform='pandas')`. This preserves
52
+ column names and dtypes (including `category`) required by some models.
53
+
54
+ Parameters
55
+ ----------
56
+ X : pd.DataFrame
57
+ Training features.
58
+ y : pd.Series or np.ndarray
59
+ Target values.
60
+ X_test : pd.DataFrame or None, optional
61
+ Final test set to predict. If None, no test predictions are produced.
62
+ pred_type : str
63
+ Either 'classification' or 'regression'.
64
+ processor : object or None, optional
65
+ Preprocessing pipeline with `fit_transform` and `transform` methods.
66
+ Must return a pandas DataFrame.
67
+ process_categorical : bool, default True
68
+ If True, object/category columns are encoded using an OrdinalEncoder
69
+ fitted on the training DataFrame only (no leakage), then converted to
70
+ pandas `category` dtype so libraries that auto‑detect categories work
71
+ correctly. If False, the user is responsible for categorical handling
72
+ (for example, inside `processor`).
73
+ models : list or str, default ('lgb', 'xgb', 'cb')
74
+ Model keys to train. Supported values: 'lgb', 'xgb', 'cb'.
75
+ params_dict : dict or None, optional
76
+ Mapping `model_name -> dict` of model parameters. If None, sensible
77
+ defaults are used. Per‑model entries override top‑level defaults.
78
+ scoring_dict : dict or None, optional
79
+ Mapping `metric_name -> callable(y_true, y_pred_or_proba)`. If None,
80
+ defaults are provided (classification: ROC AUC; regression: RMSE).
81
+ decision_threshold : float, default 0.5
82
+ Threshold to convert probabilities to class labels for threshold‑based metrics.
83
+ n_splits : int, default 5
84
+ Number of CV folds.
85
+ random_state : int or list of ints, default 42
86
+ Single seed or list of seeds to repeat CV. Results are averaged across seeds.
87
+ early_stopping_rounds : int, default 50
88
+ Default early stopping rounds used when model params do not override it.
89
+ verbose : int, default 2
90
+ 2 prints detailed per‑fold/per‑model scores,
91
+ 1 prints only final averaged scores,
92
+ 0 prints nothing.
93
+ return_trained : bool, default False
94
+ If True, return the list of trained estimator instances (one per model
95
+ per fold per seed). If False (default), trained estimators are not
96
+ accumulated and `None` is returned in that position to save memory.
97
+ return_oe : bool, default False
98
+ If True, return the fitted `OrdinalEncoder` instance (or None if no
99
+ categorical processing was performed). Returning `oe` lets the user
100
+ reproduce categorical encoding on new raw data; the user must still
101
+ apply the same `processor` (if used) before applying `oe` and predicting.
102
+ predict_proba : bool, default True
103
+ For classification: if True return probabilities; if False return binary
104
+ labels using `decision_threshold`. Ignored for regression.
105
+
106
+ Returns
107
+ -------
108
+ oof_preds_total : np.ndarray
109
+ Averaged out‑of‑fold predictions (probabilities for classification when
110
+ `predict_proba=True`, otherwise binary labels for classification;
111
+ predictions for regression).
112
+ test_preds_total : np.ndarray or None
113
+ Averaged test‑set predictions across folds and seeds, or None if
114
+ `X_test` is None.
115
+ trained_models_or_none : list or None
116
+ If `return_trained=True`, the list of trained model instances (order preserved).
117
+ Otherwise None.
118
+ oe_or_none : OrdinalEncoder or None
119
+ The fitted `OrdinalEncoder` if `return_oe=True` and categorical processing
120
+ was performed; otherwise None.
121
+ """
122
+ # Input Validation
123
+ if pred_type not in ('classification', 'regression'):
124
+ raise ValueError("pred_type must be 'classification' or 'regression'")
125
+
126
+ if models is None:
127
+ raise ValueError("`models` cannot be None.")
128
+
129
+ if isinstance(models, str):
130
+ models = [models]
131
+ allowed = {'lgb', 'xgb', 'cb'}
132
+ for m in models:
133
+ if m not in allowed:
134
+ raise ValueError(f"Unsupported model '{m}'. Allowed: {allowed}")
135
+
136
+ if isinstance(random_state, int):
137
+ random_states = [random_state]
138
+ else:
139
+ random_states = list(random_state)
140
+
141
+ if X_test is not None and len(X_test) == 0:
142
+ raise ValueError("`X_test` must not be empty if provided.")
143
+
144
+ if len(X) != len(y):
145
+ raise ValueError("`X` and `y` must have the same number of samples.")
146
+
147
+ # Default scoring
148
+ if scoring_dict is None:
149
+ if pred_type == 'classification':
150
+ scoring_dict = {'roc_auc': roc_auc_score}
151
+ else:
152
+ scoring_dict = {'rmse': lambda y_true, y_pred: float(np.sqrt(mean_squared_error(y_true, y_pred)))}
153
+
154
+ # Default parameters
155
+ if params_dict is None:
156
+ params_dict = {m: {} for m in models}
157
+ else:
158
+ for m in models:
159
+ params_dict.setdefault(m, {})
160
+
161
+ # Processor fallback (identity)
162
+ class _IdentityProcessor:
163
+ def fit_transform(self, X, y=None): return X
164
+ def transform(self, X): return X
165
+
166
+ if processor is None:
167
+ processor = _IdentityProcessor()
168
+
169
+ elif not hasattr(processor, 'fit_transform') or not hasattr(processor, 'transform'):
170
+ raise TypeError("`processor` must have `fit_transform` and `transform` methods.")
171
+
172
+ # Defensive Copies
173
+ X = X.copy().reset_index(drop=True)
174
+ y = pd.Series(y).copy().reset_index(drop=True) if isinstance(y, (pd.Series, np.ndarray)) else np.asarray(y)
175
+ if X_test is not None:
176
+ X_test = X_test.copy().reset_index(drop=True)
177
+
178
+ # Categorical handling
179
+ cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
180
+ ordinal_encoder: Optional[OrdinalEncoder] = None
181
+
182
+ if cat_cols and process_categorical:
183
+ oe = OrdinalEncoder(
184
+ dtype=np.int32,
185
+ handle_unknown='use_encoded_value',
186
+ unknown_value=-1,
187
+ encoded_missing_value=-1,
188
+ ).set_output(transform='pandas')
189
+
190
+ X[cat_cols] = oe.fit_transform(X[cat_cols]).astype('category')
191
+
192
+ if X_test is not None:
193
+ X_test[cat_cols] = oe.transform(X_test[cat_cols]).astype('category')
194
+
195
+ ordinal_encoder = oe
196
+
197
+ # Set model-specific categorical flags
198
+ for m in models:
199
+ if m == 'xgb':
200
+ params_dict['xgb']['enable_categorical'] = True
201
+ elif m == 'cb':
202
+ params_dict['cb']['cat_features'] = cat_cols
203
+
204
+ # Prepare result containers
205
+ n_samples = len(X)
206
+ test_n = len(X_test) if X_test is not None else 0
207
+
208
+ oof_preds_total = np.zeros(n_samples, dtype=np.float64)
209
+ test_preds_total = np.zeros(test_n, dtype=np.float64) if X_test is not None else None
210
+ trained_models_list: List[Any] = [] if return_trained else None
211
+
212
+ # CV results storage
213
+ cv_results = {
214
+ 'stacked': {name: [] for name in scoring_dict.keys()},
215
+ 'per_model': {m: {name: [] for name in scoring_dict.keys()} for m in models},
216
+ 'per_seed': [],
217
+ }
218
+ # Helper for controlled printing
219
+ def _print(msg, level=2):
220
+ if verbose >= level:
221
+ print(msg)
222
+
223
+ # Main loop across random states
224
+ for seed in random_states:
225
+ _print(f'\n=== Random State {seed} ===', level=2)
226
+ splitter = (
227
+ StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
228
+ if pred_type == 'classification'
229
+ else KFold(n_splits=n_splits, shuffle=True, random_state=seed)
230
+ )
231
+ oof_preds = np.zeros(n_samples, dtype=float)
232
+ test_preds = np.zeros(test_n, dtype=float) if X_test is not None else None
233
+
234
+ # Per-seed storage for reporting
235
+ seed_model_scores = {m: {name: [] for name in scoring_dict.keys()} for m in models}
236
+ seed_stack_scores = {metric_name: [] for metric_name in scoring_dict.keys()}
237
+
238
+ for fold, (train_idx, val_idx) in enumerate(splitter.split(X, y)):
239
+ _print(f'\nFold {fold + 1}/{n_splits}', level=2)
240
+
241
+ y = y if isinstance(y, pd.Series) else pd.Series(y)
242
+ X_train, X_val = X.iloc[train_idx].copy(), X.iloc[val_idx].copy()
243
+ y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
244
+
245
+ # Apply processor if provided (fit on fold train only)
246
+ X_train = processor.fit_transform(X_train, y_train)
247
+ X_val = processor.transform(X_val)
248
+ X_test_proc = processor.transform(X_test) if X_test is not None else None
249
+
250
+ # Ensure processor returns DataFrame (so categorical dtypes are preserved)
251
+ if not isinstance(X_train, pd.DataFrame):
252
+ raise TypeError("processor.fit_transform must return a pandas DataFrame. "
253
+ "Use pipeline.set_output(transform='pandas').")
254
+
255
+ fold_val_preds_list = []
256
+ fold_test_preds_list = []
257
+
258
+ for model_name in models:
259
+ p = params_dict.get(model_name, {}).copy()
260
+
261
+ # Train model
262
+ if model_name == 'lgb':
263
+ ModelClass = lgb.LGBMClassifier if pred_type == 'classification' else lgb.LGBMRegressor
264
+ p.setdefault('n_estimators', 10000)
265
+ p.setdefault('verbosity', -1)
266
+ model = ModelClass(**p)
267
+ model.fit(
268
+ X_train, y_train,
269
+ eval_set=[(X_val, y_val)],
270
+ callbacks=[lgb.early_stopping(early_stopping_rounds, verbose=False)]
271
+ )
272
+ elif model_name in ['xgb', 'cb']:
273
+ p.setdefault('early_stopping_rounds', early_stopping_rounds)
274
+
275
+ if model_name == 'xgb':
276
+ ModelClass = xgb.XGBClassifier if pred_type == 'classification' else xgb.XGBRegressor
277
+ p.setdefault('n_estimators', 10000)
278
+ else:
279
+ ModelClass = cb.CatBoostClassifier if pred_type == 'classification' else cb.CatBoostRegressor
280
+ p.setdefault('iterations', 10000)
281
+
282
+ model = ModelClass(**p)
283
+ model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
284
+
285
+ # Append trained model instance only if requested
286
+ if return_trained:
287
+ trained_models_list.append(model)
288
+
289
+ # Predictions
290
+ if pred_type == 'classification':
291
+ # Prefer predict_proba; if user requested binary output at top-level,
292
+ # we still compute probabilities here and convert later if needed
293
+ val_preds = model.predict_proba(X_val)[:, 1]
294
+ test_fold_preds = model.predict_proba(X_test_proc)[:, 1] if X_test_proc is not None else None
295
+
296
+ else: # regression
297
+ val_preds = model.predict(X_val)
298
+ test_fold_preds = model.predict(X_test_proc) if X_test_proc is not None else None
299
+
300
+ fold_val_preds_list.append(val_preds)
301
+ if X_test is not None:
302
+ fold_test_preds_list.append(test_fold_preds)
303
+
304
+ # Score individual model on this fold
305
+ if pred_type == 'classification':
306
+ val_binary = (val_preds >= decision_threshold).astype(int)
307
+
308
+ for metric_name, scoring_fn in scoring_dict.items():
309
+ name_l = metric_name.lower()
310
+
311
+ if any(k in name_l for k in ('roc', 'auc', 'logloss', 'log_loss')):
312
+ score = scoring_fn(y_val, val_preds)
313
+ else:
314
+ score = scoring_fn(y_val, val_binary)
315
+
316
+ cv_results['per_model'][model_name][metric_name].append(score)
317
+ seed_model_scores[model_name][metric_name].append(score)
318
+ _print(f' {model_name.upper()} {metric_name}: {score:.5f}', level=2)
319
+ else:
320
+ for metric_name, scoring_fn in scoring_dict.items():
321
+ score = scoring_fn(y_val, val_preds)
322
+ cv_results['per_model'][model_name][metric_name].append(score)
323
+ seed_model_scores[model_name][metric_name].append(score)
324
+ _print(f' {model_name.upper()} {metric_name}: {score:.5f}', level=2)
325
+
326
+ # Aggregate fold predictions (average across models)
327
+ fold_val_preds = np.mean(np.vstack(fold_val_preds_list), axis=0)
328
+ fold_test_preds = np.mean(np.vstack(fold_test_preds_list), axis=0) if X_test is not None else None
329
+
330
+ # Fill OOF and accumulate test preds
331
+ oof_preds[val_idx] = fold_val_preds
332
+
333
+ if X_test is not None:
334
+ test_preds += fold_test_preds / n_splits
335
+
336
+ # Score stacked prediction
337
+ if pred_type == 'classification':
338
+ fold_val_binary = (fold_val_preds >= decision_threshold).astype(int)
339
+ for metric_name, scoring_fn in scoring_dict.items():
340
+ name_l = metric_name.lower()
341
+
342
+ if any(k in name_l for k in ('roc', 'auc', 'logloss', 'log_loss')):
343
+ stacked_score = scoring_fn(y_val, fold_val_preds)
344
+ else:
345
+ stacked_score = scoring_fn(y_val, fold_val_binary)
346
+
347
+ cv_results['stacked'][metric_name].append(stacked_score)
348
+ seed_stack_scores[metric_name].append(stacked_score)
349
+ _print(f' Stacked {metric_name}: {stacked_score:.5f}', level=2)
350
+ else:
351
+ for metric_name, scoring_fn in scoring_dict.items():
352
+ stacked_score = scoring_fn(y_val, fold_val_preds)
353
+ cv_results['stacked'][metric_name].append(stacked_score)
354
+ seed_stack_scores[metric_name].append(stacked_score)
355
+ _print(f' Stacked {metric_name}: {stacked_score:.5f}', level=2)
356
+
357
+ # --- End of folds for this seed ---
358
+
359
+ # Accumulate across seeds
360
+ oof_preds_total += oof_preds / len(random_states)
361
+
362
+ if X_test is not None:
363
+ test_preds_total += test_preds / len(random_states)
364
+
365
+ # Print per-model means (seed)
366
+ _print(f'\nSeed {seed} mean scores:', level=2)
367
+ for model_name in models:
368
+ for metric_name, vals in seed_model_scores[model_name].items():
369
+ mean_val = float(np.mean(vals)) if vals else float('nan')
370
+ _print(f' {model_name.upper()} {metric_name}: {mean_val:.5f}', level=2)
371
+
372
+ # Print stacked mean scores (seed)
373
+ seed_mean_scores = {k: float(np.mean(v)) for k, v in seed_stack_scores.items()}
374
+ for metric_name, score in seed_mean_scores.items():
375
+ _print(f' Stacked {metric_name}: {score:.5f}', level=2)
376
+
377
+ # Final summary print
378
+ if verbose >= 1:
379
+ print('\n' + '=' * 30)
380
+ print('=== CV Results Summary ===\n')
381
+ print('Mean CV Scores per Model:')
382
+ for model_name in models:
383
+ print(f'\n--- {model_name.upper()} ---')
384
+
385
+ for metric_name, scores in cv_results['per_model'][model_name].items():
386
+ print(f' {metric_name}: {np.mean(scores):.5f}')
387
+
388
+ print('\nMean Stacked CV Scores:')
389
+ for metric_name, scores in cv_results['stacked'].items():
390
+ print(f' {metric_name}: {np.mean(scores):.5f}')
391
+
392
+ # Post-process outputs for classification when predict_proba flag is False
393
+ if pred_type == 'classification' and not predict_proba:
394
+ # Convert averaged probabilities to binary labels using decision_threshold
395
+ oof_preds_total = (oof_preds_total >= decision_threshold).astype(int)
396
+ if test_preds_total is not None:
397
+ test_preds_total = (test_preds_total >= decision_threshold).astype(int)
398
+
399
+ # Prepare return values
400
+ test_preds_return = test_preds_total if X_test is not None else None
401
+ trained_return = trained_models_list if return_trained else None
402
+ oe_return = ordinal_encoder if return_oe else None
403
+
404
+ return oof_preds_total, test_preds_return, trained_return, oe_return
@@ -0,0 +1,170 @@
1
+ Metadata-Version: 2.4
2
+ Name: cv-score-predict
3
+ Version: 0.1.1
4
+ Summary: Cross-validated ensemble prediction with LGBM, XGBoost, and CatBoost — with safe categorical handling, multi-seed averaging, and artifact return.
5
+ Author-email: Danu ANDRIES <danu@andries.lu>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/Karabush/cv-score-predict
8
+ Project-URL: Repository, https://github.com/Karabush/cv-score-predict
9
+ Project-URL: Documentation, https://github.com/Karabush/cv-score-predict#readme
10
+ Keywords: cross-validation,ensemble learning,model averaging,LightGBM,XGBoost,CatBoost,categorical encoding,OrdinalEncoder,out-of-fold prediction,OOF,multi-seed CV,repeated cross-validation,early stopping,scikit-learn compatible,pandas,machine learning,classification,regression,model validation,kaggle,safe preprocessing,data leakage prevention,boosting ensemble
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
15
+ Requires-Python: >=3.9
16
+ Description-Content-Type: text/markdown
17
+ License-File: LICENSE
18
+ Requires-Dist: numpy>=1.21
19
+ Requires-Dist: pandas>=1.3
20
+ Requires-Dist: scikit-learn>=1.4
21
+ Requires-Dist: lightgbm>=3.3
22
+ Requires-Dist: xgboost>=1.7
23
+ Requires-Dist: catboost>=1.2
24
+ Dynamic: license-file
25
+
26
+ # cv-score-predict
27
+
28
+ A robust utility for **cross-validated ensemble prediction** with LightGBM, XGBoost and CatBoost. Handles categorical features safely, supports custom preprocessing pipelines, repeated CV over multiple seeds, early stopping, and returns out-of-fold (OOF) predictions, test predictions, trained models, and fitted encoder — all in one call.
29
+
30
+ Designed for **kagglers, ML engineers, and data scientists** who need reliable, leakage-free CV with minimal boilerplate.
31
+
32
+ ---
33
+
34
+ ## ✨ Key Features
35
+
36
+ - **Multi-model ensembling**: Train and average predictions from LGBM, XGBoost and CatBoost in a single CV run.
37
+ - **Native categorical support**: Automatically encodes string/categorical columns with `OrdinalEncoder` and configures models (e.g., `cat_features` for CatBoost, `enable_categorical` for XGBoost).
38
+ - **Safe preprocessing**: Integrates any scikit-learn-compatible `processor` pipeline (e.g., `ColumnTransformer`, `Pipeline`) — fitted **per fold** to prevent data leakage.
39
+ - **Repeated CV**: Average results over multiple random seeds for stable metrics.
40
+ - **Early stopping**: Enabled by default for all models using fold-wise validation.
41
+ - **Full artifact return**: Get OOF predictions, test predictions, trained models, and fitted `OrdinalEncoder` for later use.
42
+
43
+ ---
44
+
45
+ ## 📥 Parameters
46
+
47
+ | Parameter | Type | Default | Description |
48
+ |----------|------|--------|-------------|
49
+ | `X` | `pd.DataFrame` | — | Training features. |
50
+ | `y` | `Union[pd.Series, np.ndarray]` | — | Target values. |
51
+ | `X_test` | `Optional[pd.DataFrame]` | `None` | Test set for final prediction. If `None`, no test predictions are returned. |
52
+ | `pred_type` | `str` | — | Either `'classification'` or `'regression'` (**required**). |
53
+ | `processor` | `Optional[object]` | `None` | Preprocessing pipeline with `fit_transform` and `transform` methods. Must return a `pd.DataFrame` (use `set_output(transform='pandas')`). If `None`, features are passed through unchanged. |
54
+ | `process_categorical` | `bool` | `True` | If `True`, object/category columns are encoded with `OrdinalEncoder` (using `-1` for missing/unseen) and converted to pandas `category` dtype for model compatibility. |
55
+ | `models` | `Union[List[str], str]` | `('lgb', 'xgb', 'cb')` | Models to ensemble. Supported: `'lgb'` (LightGBM), `'xgb'` (XGBoost), `'cb'` (CatBoost). |
56
+ | `params_dict` | `Optional[Dict[str, dict]]` | `None` | Model-specific hyperparameters. Keys: model names; values: param dicts. |
57
+ | `scoring_dict` | `Optional[Dict[str, Callable]]` | `None` | Metrics for evaluation. Keys: metric names; values: scoring functions (e.g., `roc_auc_score`). Defaults: `{'roc_auc': roc_auc_score}` (classification), `{'rmse': rmse_fn}` (regression). |
58
+ | `decision_threshold` | `float` | `0.5` | Threshold to convert probabilities to class labels (classification only). |
59
+ | `n_splits` | `int` | `5` | Number of cross-validation folds. |
60
+ | `random_state` | `Union[int, List[int]]` | `42` | Seed(s) for reproducibility. If a list, CV is repeated for each seed and results are averaged. |
61
+ | `early_stopping_rounds` | `int` | `50` | Early stopping rounds for boosting models (if not overridden in `params_dict`). |
62
+ | `verbose` | `int` | `2` | Logging level: `2` = full per-fold details, `1` = final summary, `0` = silent. |
63
+ | `return_trained` | `bool` | `False` | If `True`, returns list of trained model instances (one per model × fold × seed). |
64
+ | `return_oe` | `bool` | `False` | If `True` and `process_categorical=True`, returns the fitted `OrdinalEncoder`. |
65
+ | `predict_proba` | `bool` | `True` | For classification: if `True`, return probabilities; if `False`, return binary labels (using `decision_threshold`). Ignored for regression. |
66
+ ---
67
+
68
+ ## 🚀 Installation
69
+
70
+ ```bash
71
+ pip install cv-score-predict
72
+ ```
73
+
74
+ Requirements:
75
+
76
+ * Python ≥ 3.8
77
+ * Dependencies (automatically installed):
78
+ numpy, pandas, scikit-learn ≥1.4, lightgbm, xgboost, catboost
79
+
80
+ ---
81
+
82
+ ## 📌 Basic Usage
83
+ ```python
84
+ import pandas as pd
85
+ import numpy as np
86
+ from sklearn.preprocessing import StandardScaler
87
+ from sklearn.compose import make_column_transformer
88
+ from cv_score_predict import cv_score_predict
89
+
90
+ # Simulate data
91
+ X = pd.DataFrame({
92
+ "num": [1, 2, 3, 4, 5, 6, 7, 8],
93
+ "cat": ["A", "B", "A", "C", "B", "A", "C", "D"]
94
+ })
95
+ y = [0, 1, 0, 1, 1, 0, 1, 0]
96
+ X_test = pd.DataFrame({"num": [9, 10], "cat": ["B", "E"]})
97
+
98
+ # Optional processor (applied per fold!)
99
+ processor = make_column_transformer(
100
+ (StandardScaler(), ["num"]),
101
+ remainder="passthrough"
102
+ )
103
+
104
+ # Run CV with 3 seeds → results averaged over seeds & folds
105
+ oof_pred, test_pred, _, _ = cv_score_predict(
106
+ X=X,
107
+ y=y,
108
+ X_test=X_test,
109
+ pred_type="classification",
110
+ processor=processor,
111
+ models=["lgb", "xgb"],
112
+ process_categorical=True,
113
+ random_state=[42, 123, 999],
114
+ n_splits=3,
115
+ verbose=2,
116
+ )
117
+ ```
118
+
119
+ Output will show scores per seed, then final averaged metrics.
120
+
121
+ ---
122
+
123
+ ## 🔧 Advanced Usage: Reuse Artifacts for New Data
124
+ ```python
125
+ # Step 1: Run CV and return artifacts
126
+ oof, _, trained_models, oe = cv_score_predict(
127
+ X,
128
+ y,
129
+ X_test=None, # we'll predict manually
130
+ pred_type="classification",
131
+ processor=processor,
132
+ models=["lgb", "cb"],
133
+ process_categorical=True,
134
+ random_state=[42, 123],
135
+ n_splits=5,
136
+ return_trained=True,
137
+ return_oe=True,
138
+ )
139
+
140
+ # Step 2: For deployment: refit processor on FULL TRAINING data
141
+ # First: encode categoricals using returned oe
142
+ cat_cols = ["cat"]
143
+ X_full = X.copy()
144
+ X_full[cat_cols] = oe.transform(X_full[cat_cols]).astype('category')
145
+ X_new = pd.DataFrame({"num": [7, 8], "cat": [None, "A"]})
146
+
147
+ # Fit processor on full encoded data
148
+ processor = make_column_transformer(
149
+ (StandardScaler(), ["num"]),
150
+ remainder="passthrough"
151
+ )
152
+ processor.fit(X_full)
153
+
154
+ # Apply to new data
155
+ X_new_proc = X_new.copy()
156
+ X_new_proc[cat_cols] = oe.transform(X_new_proc[cat_cols]).astype('category')
157
+ X_new_proc = processor.transform(X_new_proc)
158
+
159
+ # Predict with all trained models and average
160
+ preds = [model.predict_proba(X_new_proc)[:, 1] for model in trained_models]
161
+ final_pred = np.mean(preds, axis=0)
162
+ ```
163
+ ## 📝 Notes
164
+ Categorical columns are encoded with OrdinalEncoder(dtype=np.int32) and converted to category dtype for model compatibility.
165
+ Always use set_output(transform="pandas") in sklearn pipelines to preserve dtypes.
166
+ The processor used in CV is refit on each fold to prevent data leakage, so there is no single global version. For deployment, refit your preprocessing pipeline on the full training set (as shown in the advanced example).
167
+
168
+ ## 📄 License
169
+ This project is licensed under the MIT License.
170
+ See the LICENSE file for details.
@@ -0,0 +1,13 @@
1
+ LICENSE
2
+ MANIFEST.in
3
+ README.md
4
+ pyproject.toml
5
+ cv_score_predict/__init__.py
6
+ cv_score_predict/core.py
7
+ cv_score_predict.egg-info/PKG-INFO
8
+ cv_score_predict.egg-info/SOURCES.txt
9
+ cv_score_predict.egg-info/dependency_links.txt
10
+ cv_score_predict.egg-info/requires.txt
11
+ cv_score_predict.egg-info/top_level.txt
12
+ tests/test_cv_score_predict.py
13
+ tests/test_regression_no_processor_binary_output.py
@@ -0,0 +1,6 @@
1
+ numpy>=1.21
2
+ pandas>=1.3
3
+ scikit-learn>=1.4
4
+ lightgbm>=3.3
5
+ xgboost>=1.7
6
+ catboost>=1.2
@@ -0,0 +1,2 @@
1
+ cv_score_predict
2
+ dist
@@ -0,0 +1,61 @@
1
+ [build-system]
2
+ requires = ["setuptools>=70.0", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "cv-score-predict"
7
+ version = "0.1.1"
8
+ description = "Cross-validated ensemble prediction with LGBM, XGBoost, and CatBoost — with safe categorical handling, multi-seed averaging, and artifact return."
9
+ readme = "README.md"
10
+ requires-python = ">=3.9"
11
+ license = { text = "MIT" }
12
+ authors = [
13
+ { name = "Danu ANDRIES", email = "danu@andries.lu" },
14
+ ]
15
+ classifiers = [
16
+ "Programming Language :: Python :: 3",
17
+ "Intended Audience :: Developers",
18
+ "Intended Audience :: Science/Research",
19
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
20
+ ]
21
+ dependencies = [
22
+ "numpy>=1.21",
23
+ "pandas>=1.3",
24
+ "scikit-learn>=1.4",
25
+ "lightgbm>=3.3",
26
+ "xgboost>=1.7",
27
+ "catboost>=1.2"
28
+ ]
29
+
30
+ keywords = [
31
+ "cross-validation",
32
+ "ensemble learning",
33
+ "model averaging",
34
+ "LightGBM",
35
+ "XGBoost",
36
+ "CatBoost",
37
+ "categorical encoding",
38
+ "OrdinalEncoder",
39
+ "out-of-fold prediction",
40
+ "OOF",
41
+ "multi-seed CV",
42
+ "repeated cross-validation",
43
+ "early stopping",
44
+ "scikit-learn compatible",
45
+ "pandas",
46
+ "machine learning",
47
+ "classification",
48
+ "regression",
49
+ "model validation",
50
+ "kaggle",
51
+ "safe preprocessing",
52
+ "data leakage prevention",
53
+ "boosting ensemble"
54
+ ]
55
+ [tool.setuptools.packages.find]
56
+ exclude = ["tests*", "notebooks*", "temp*", "scripts*", "*catboost*"]
57
+
58
+ [project.urls]
59
+ Homepage = "https://github.com/Karabush/cv-score-predict"
60
+ Repository = "https://github.com/Karabush/cv-score-predict"
61
+ Documentation = "https://github.com/Karabush/cv-score-predict#readme"
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,44 @@
1
+ def test_classification_full_pipeline():
2
+ import pandas as pd
3
+ from sklearn.preprocessing import StandardScaler
4
+ from sklearn.compose import ColumnTransformer
5
+ from cv_score_predict import cv_score_predict
6
+
7
+ # Data with categoricals and missing
8
+ X = pd.DataFrame({
9
+ "num": [1.0, 2.5, 3.1, 4.8, 5.2, 6.0, 7.3, 8.9],
10
+ "cat": ["X", "Y", "X", "Z", "Y", "X", "Z", "W"]
11
+ })
12
+ y = [0, 1, 0, 1, 1, 0, 1, 0]
13
+ X_test = pd.DataFrame({"num": [9.1, 10.2], "cat": ["Y", "V"]}) # V is unseen
14
+
15
+ processor = ColumnTransformer([
16
+ ("num", StandardScaler(), ["num"]),
17
+ ("cat", "passthrough", ["cat"])
18
+ ]).set_output(transform="pandas")
19
+
20
+ oof, test_pred, trained, oe = cv_score_predict(
21
+ X=X,
22
+ y=y,
23
+ X_test=X_test,
24
+ pred_type="classification",
25
+ processor=processor,
26
+ process_categorical=True,
27
+ models=["lgb", "xgb"],
28
+ random_state=[42, 99],
29
+ n_splits=2,
30
+ return_trained=True,
31
+ return_oe=True,
32
+ verbose=0
33
+ )
34
+
35
+ # Assertions
36
+ assert len(oof) == len(X)
37
+ assert len(test_pred) == len(X_test)
38
+ assert test_pred.min() >= 0 and test_pred.max() <= 1 # probabilities
39
+ assert len(trained) == 2 * 2 * 2 # 2 models × 2 folds × 2 seeds
40
+ assert oe is not None
41
+
42
+ # Check that unseen category 'V' was encoded as -1
43
+ X_enc = oe.transform(X_test[["cat"]])
44
+ assert X_enc.iloc[1, 0] == -1 # 'V' → -1
@@ -0,0 +1,27 @@
1
+ def test_regression_no_processor_binary_output():
2
+ import pandas as pd
3
+ import numpy as np
4
+ from cv_score_predict import cv_score_predict
5
+
6
+ np.random.seed(0)
7
+ X = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])
8
+ y = X.sum(axis=1) + np.random.randn(20) * 0.1
9
+ X_test = pd.DataFrame(np.random.randn(5, 3), columns=["a", "b", "c"])
10
+
11
+ oof, test_pred, _, _ = cv_score_predict(
12
+ X=X,
13
+ y=y,
14
+ X_test=X_test,
15
+ pred_type="regression",
16
+ processor=None,
17
+ models=["lgb", "cb"],
18
+ random_state=[1, 2, 3], # 3 seeds
19
+ n_splits=3,
20
+ predict_proba=False, # irrelevant for regression, but should not break
21
+ verbose=0
22
+ )
23
+
24
+ assert len(oof) == 20
25
+ assert len(test_pred) == 5
26
+ assert isinstance(oof[0], float)
27
+ assert isinstance(test_pred[0], float)