PyPI - leakproof-ml - Versions diffs - 0.0.1__tar.gz - Mend

leakproof-ml 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

leakproof_ml-0.0.1/.gitignore ADDED Viewed

@@ -0,0 +1,9 @@
+# Virtual Enviroment
+venv/
+virtualMachine/
+catboost_info/
+__pycache__/
+.pytest_cache/

leakproof_ml-0.0.1/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Alexei Ortiz
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

leakproof_ml-0.0.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,176 @@
+Metadata-Version: 2.4
+Name: leakproof-ml
+Version: 0.0.1
+Summary: ML framework to avoid most common sources of data leakage
+Project-URL: Homepage, https://github.com/ORALEM00/leakproof-ml.git
+Author-email: Alexei Ortiz <alex.lztb@gmail.com>
+License: MIT
+License-File: LICENSE.txt
+Keywords: cross-validation,data-leakage,framework,group-kfold,machine-learning,materials-informatics,model-evaluation
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.9
+Requires-Dist: numpy<3.0,>=1.21
+Requires-Dist: optuna>=3.0
+Requires-Dist: pandas>=1.5
+Requires-Dist: scikit-learn>=1.1
+Provides-Extra: dev
+Requires-Dist: flake8; extra == 'dev'
+Requires-Dist: pytest; extra == 'dev'
+Provides-Extra: full
+Requires-Dist: matplotlib>=3.5; extra == 'full'
+Requires-Dist: seaborn>=0.12; extra == 'full'
+Requires-Dist: shap>=0.43; extra == 'full'
+Provides-Extra: plots
+Requires-Dist: matplotlib>=3.5; extra == 'plots'
+Requires-Dist: seaborn>=0.12; extra == 'plots'
+Provides-Extra: shap
+Requires-Dist: shap>=0.43; extra == 'shap'
+Description-Content-Type: text/markdown
+# Leakproof ML
+**Leakproof ML** is a an open-source, flexible, and simple to use Python package designed to systematically prevent data leakage across a complete modelling process. Focused on the most common sources of data leakage arising from improper validation strategies and inadequate isolation between training and test data.
+## Install
+Leakproof ML can be installed from [PyPI](https://pypi.org/project/shap):
+<pre>
+pip install leakproof_ml
+</pre>
+## Data Leakage Framework
+Leakproof provides an unifed framework of leakafe-aware properties across its main functionalities. This is done by enforcing a standardized implementation of ML workflows to ensure that preprocessing, feature selection, tuning, and fitting are performed exclusively on the training sets. While, promoting the use of splitting strategies aligned with the structure of the data.
+## Quick start
+The software provides three main functionalities integrated into the data leakage framework: training, tuning, and interpretability.
+Each functionality can be applied to both a standard single train-test split, and a cross-validation implementation for small data cases.
+```python
+# Setting for the example
+import xgboost
+from src.leakproof_ml.validation import ShuffledGroupKFold
+df = pd.read_csv("data.csv")
+X = df.drop(columns=["target", "group_id"])
+y = df["target"]
+groups = df["group_id"]
+# Splitter for group based splitter
+# (however can be any splitter)
+splitter = ShuffledGroupKFold(n_splits = 10, random_state = 42)
+```
+### Training
+The simplest function where a model can be used to fit in the dataset, avoiding data leakage in an easy way.
+```python
+from leakproof_ml import cv_analysis
+from leakproof_ml.plots import plot_predictions
+# The class of the model is passed as parameter
+# Results are gathered in dictionary format
+results = cv_analysis(X, y, XGBRegressor, splitter, groups=groups, params = {"max_depth"= 4})
+plot_predictions(results['y_true'], results['y_predict'])
+```
+<p align="center">
+  <img width="616" src="./resulting_plots\XGBRegressor\metrics\groupedCV_predictions.png" />
+</p>
+### Tuning
+For hyperparameter optimization, Leakproof ML employs the Tree-structured Parzen Estimator algorithm implemented in the Optuna library.
+In the train-test setting, a CV is applied on the train set to optimize parameters and subsequently evaluated on the held-out test set. In contrast, for the CV setting, Leakproof ML implements a nested CV scheme to avoid a possible optimistic bias present when tuning the parameters using the entire dataset
+```python
+from leakproof_ml.tuning import nested_cv_tunning
+# For nested cv an extra inner splitter needs to be
+# defined
+inner_splitter = ShuffledGroupKFold(n_splits = 3, random_state = 42)
+# A function accepting parameter trial for Optuna tuning within the framework
+def search_space(trial):
+  return {
+    "max_depth": trial.suggest_int("max_depth", 2, 5),
+    "subsample": trial.suggest_float("subsample", 0.6, 1.0),
+      }
+# Returns in addition the set of parameters optimized
+results = nested_cv_tunning(X, y, XGBRegressor, splitter, inner_splitter, search_space, groups=groups)
+```
+### Interpretability
+To extract physical insights and underlying mechanisms from data-driven models, Leakproof ML uses two global, model-agnostic interpretability methods: permutation importance (PI) and SHAP, which allow for quantification of magnitude and direction of feature influence. By default, PI is used.
+```python
+from leakproof_ml.interpretability import cv_interpretability
+from leakproof_ml.plots import plot_interpretability_bar
+results = cv_interpretability(X, y, model, splitter, groups=groups)
+plot_interpretability_bar(results)
+```
+<p align="center">
+  <img width="616" src="./resulting_plots\XGBRegressor\pi\pi_groupedCV.png" />
+</p>
+## Custom Pipeline
+Apart from the default pipelines in the functions, the package allows for any custom pipeline to be implemented within the functions. To construct a custom pipeline a function returning the pipeline must be used as parameter in the functions. With the final step of the pipeline always defining the model as: ('model', model).
+```python
+from sklearn.discriminant_analysis import StandardScaler
+from sklearn.impute import SimpleImputer
+from sklearn.compose import ColumnTransformer, make_column_selector
+from leakproof_ml import cv_analysis
+# Custom pipeline
+def polynomial_custom_factory(model, degree=2):
+  numeric_pipe = Pipeline(steps=[
+    ('imputer', SimpleImputer(strategy='median')),
+    ('scaler', StandardScaler())
+  ])
+  preprocessor = ColumnTransformer(
+    transformers=[
+      ('num', numeric_pipe, make_column_selector(dtype_include='float64')),
+    ],
+    remainder='passthrough'
+  )
+  # Pipeline steps
+  pipe = Pipeline(steps=[
+    ('preprocessor', preprocessor),
+    ('poly', PolynomialFeatures(degree=degree)),
+    ('model', model)
+  ])
+  return pipe
+results = cv_analysis(X, y, XGBRegressor, splitter, groups=groups, params = {"max_depth"= 4}, pipeline_factory = polynomial_custom_factory)
+```
+## Citation
+If used in a research project, please cite paper "Leakproof ML: Data Leakage Prevention with a Robust, Interpretable, and Reproducible Machine Learning Framework":
+<details open>
+<summary>BibTeX</summary>
+```bibtex
+@inproceedings{,
+  title={},
+  author={},
+  booktitle={},
+  pages={},
+  year={}
+}
+```
+</details>
+## License
+MIT License (see [LICENSE](./LICENSE.txt)).

leakproof_ml-0.0.1/README.md ADDED Viewed

@@ -0,0 +1,143 @@
+# Leakproof ML
+**Leakproof ML** is a an open-source, flexible, and simple to use Python package designed to systematically prevent data leakage across a complete modelling process. Focused on the most common sources of data leakage arising from improper validation strategies and inadequate isolation between training and test data.
+## Install
+Leakproof ML can be installed from [PyPI](https://pypi.org/project/shap):
+<pre>
+pip install leakproof_ml
+</pre>
+## Data Leakage Framework
+Leakproof provides an unifed framework of leakafe-aware properties across its main functionalities. This is done by enforcing a standardized implementation of ML workflows to ensure that preprocessing, feature selection, tuning, and fitting are performed exclusively on the training sets. While, promoting the use of splitting strategies aligned with the structure of the data.
+## Quick start
+The software provides three main functionalities integrated into the data leakage framework: training, tuning, and interpretability.
+Each functionality can be applied to both a standard single train-test split, and a cross-validation implementation for small data cases.
+```python
+# Setting for the example
+import xgboost
+from src.leakproof_ml.validation import ShuffledGroupKFold
+df = pd.read_csv("data.csv")
+X = df.drop(columns=["target", "group_id"])
+y = df["target"]
+groups = df["group_id"]
+# Splitter for group based splitter
+# (however can be any splitter)
+splitter = ShuffledGroupKFold(n_splits = 10, random_state = 42)
+```
+### Training
+The simplest function where a model can be used to fit in the dataset, avoiding data leakage in an easy way.
+```python
+from leakproof_ml import cv_analysis
+from leakproof_ml.plots import plot_predictions
+# The class of the model is passed as parameter
+# Results are gathered in dictionary format
+results = cv_analysis(X, y, XGBRegressor, splitter, groups=groups, params = {"max_depth"= 4})
+plot_predictions(results['y_true'], results['y_predict'])
+```
+<p align="center">
+  <img width="616" src="./resulting_plots\XGBRegressor\metrics\groupedCV_predictions.png" />
+</p>
+### Tuning
+For hyperparameter optimization, Leakproof ML employs the Tree-structured Parzen Estimator algorithm implemented in the Optuna library.
+In the train-test setting, a CV is applied on the train set to optimize parameters and subsequently evaluated on the held-out test set. In contrast, for the CV setting, Leakproof ML implements a nested CV scheme to avoid a possible optimistic bias present when tuning the parameters using the entire dataset
+```python
+from leakproof_ml.tuning import nested_cv_tunning
+# For nested cv an extra inner splitter needs to be
+# defined
+inner_splitter = ShuffledGroupKFold(n_splits = 3, random_state = 42)
+# A function accepting parameter trial for Optuna tuning within the framework
+def search_space(trial):
+  return {
+    "max_depth": trial.suggest_int("max_depth", 2, 5),
+    "subsample": trial.suggest_float("subsample", 0.6, 1.0),
+      }
+# Returns in addition the set of parameters optimized
+results = nested_cv_tunning(X, y, XGBRegressor, splitter, inner_splitter, search_space, groups=groups)
+```
+### Interpretability
+To extract physical insights and underlying mechanisms from data-driven models, Leakproof ML uses two global, model-agnostic interpretability methods: permutation importance (PI) and SHAP, which allow for quantification of magnitude and direction of feature influence. By default, PI is used.
+```python
+from leakproof_ml.interpretability import cv_interpretability
+from leakproof_ml.plots import plot_interpretability_bar
+results = cv_interpretability(X, y, model, splitter, groups=groups)
+plot_interpretability_bar(results)
+```
+<p align="center">
+  <img width="616" src="./resulting_plots\XGBRegressor\pi\pi_groupedCV.png" />
+</p>
+## Custom Pipeline
+Apart from the default pipelines in the functions, the package allows for any custom pipeline to be implemented within the functions. To construct a custom pipeline a function returning the pipeline must be used as parameter in the functions. With the final step of the pipeline always defining the model as: ('model', model).
+```python
+from sklearn.discriminant_analysis import StandardScaler
+from sklearn.impute import SimpleImputer
+from sklearn.compose import ColumnTransformer, make_column_selector
+from leakproof_ml import cv_analysis
+# Custom pipeline
+def polynomial_custom_factory(model, degree=2):
+  numeric_pipe = Pipeline(steps=[
+    ('imputer', SimpleImputer(strategy='median')),
+    ('scaler', StandardScaler())
+  ])
+  preprocessor = ColumnTransformer(
+    transformers=[
+      ('num', numeric_pipe, make_column_selector(dtype_include='float64')),
+    ],
+    remainder='passthrough'
+  )
+  # Pipeline steps
+  pipe = Pipeline(steps=[
+    ('preprocessor', preprocessor),
+    ('poly', PolynomialFeatures(degree=degree)),
+    ('model', model)
+  ])
+  return pipe
+results = cv_analysis(X, y, XGBRegressor, splitter, groups=groups, params = {"max_depth"= 4}, pipeline_factory = polynomial_custom_factory)
+```
+## Citation
+If used in a research project, please cite paper "Leakproof ML: Data Leakage Prevention with a Robust, Interpretable, and Reproducible Machine Learning Framework":
+<details open>
+<summary>BibTeX</summary>
+```bibtex
+@inproceedings{,
+  title={},
+  author={},
+  booktitle={},
+  pages={},
+  year={}
+}
+```
+</details>
+## License
+MIT License (see [LICENSE](./LICENSE.txt)).

leakproof_ml-0.0.1/pyproject.toml ADDED Viewed

@@ -0,0 +1,64 @@
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "leakproof-ml"
+version = "0.0.1"
+authors = [
+  { name="Alexei Ortiz", email="alex.lztb@gmail.com" },
+]
+description = "ML framework to avoid most common sources of data leakage"
+readme = "README.md"
+requires-python = ">=3.9"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "Operating System :: OS Independent",
+    "Intended Audience :: Science/Research",
+    "License :: OSI Approved :: MIT License",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence"
+]
+license = { text = "MIT" }
+dependencies = [
+    "numpy>=1.21,<3.0",
+    "pandas>=1.5",
+    "scikit-learn>=1.1",
+    "optuna>=3.0"
+]
+keywords = [
+    "machine-learning",
+    "data-leakage",
+    "framework",
+    "cross-validation",
+    "group-kfold",
+    "materials-informatics",
+    "model-evaluation"
+]
+[project.optional-dependencies]
+shap = ["shap>=0.43"]
+plots = ["matplotlib>=3.5", "seaborn>=0.12"]
+full = ["shap>=0.43", "matplotlib>=3.5", "seaborn>=0.12"]
+dev = ["pytest", "flake8"]
+[project.urls]
+Homepage = "https://github.com/ORALEM00/leakproof-ml.git"
+[tool.hatch.build.targets.wheel]
+packages = ["src/leakproof_ml"]
+[tool.hatch.build.targets.sdist]
+include = [
+    "src/leakproof_ml",
+    "pyproject.toml",
+    "README.md",
+    "LICENSE",
+]
+exclude = [
+    "virtualMachine/",
+]

leakproof_ml-0.0.1/src/leakproof_ml/__init__.py ADDED Viewed

@@ -0,0 +1,3 @@
+__version__ = "0.0.1"
+from .modeling import train_test_analysis, cv_analysis

leakproof_ml-0.0.1/src/leakproof_ml/interpretability/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ from .explainer import train_test_interpretability, cv_interpretability
2	+ from .analysis import get_stable_features

leakproof_ml-0.0.1/src/leakproof_ml/interpretability/_explainer_utils.py ADDED Viewed

@@ -0,0 +1,92 @@
+import warnings
+def _shap_analysis(pipeline, X_train, X_test, background_size, random_state):
+    """
+    Compute SHAP values for a fitted model within a preprocessing pipeline.
+    This internal utility function applies SHAP-based interpretability to the
+    trained model contained in a scikit-learn ``Pipeline``. The analysis is
+    performed on data that has already been transformed by the preprocessing
+    steps of the pipeline, ensuring that explanations correspond to the actual
+    feature representation seen by the model.
+    The function use ``shap.Explainer`` for fast, model-specific
+    explanations (e.g., ``TreeExplainer`` or ``LinearExplainer``). If the model
+    type is not supported or automatic detection fails, it falls back to
+    ``KernelExplainer``, issuing a warning due to its higher computational cost.
+    Parameters
+    ----------
+    pipeline : sklearn.pipeline.Pipeline
+        A fitted scikit-learn pipeline. The final step must be a predictive
+        estimator exposing a ``predict`` method and is assumed to be named
+        ``'model'``.
+    X_train : pandas.DataFrame or numpy.ndarray of shape (n_train_samples, n_features)
+        The transformed training data used as background for SHAP value
+        estimation. This data is assumed to be the output of the pipeline
+        preprocessing steps.
+    X_test : pandas.DataFrame or numpy.ndarray of shape (n_test_samples, n_features)
+        The transformed test data for which SHAP values are computed.
+    background_size : int or None
+        Number of samples from ``X_train`` to use as background data when
+        ``KernelExplainer`` is employed. If ``None``, all training samples are
+        used. Reducing this value can significantly decrease computation time
+        at the cost of higher variance in SHAP estimates.
+    random_state : int
+        Random seed used when subsampling the background data for
+        ``KernelExplainer``.
+    Returns
+    -------
+    shap_values : numpy.ndarray
+        Array of SHAP values with shape
+        ``(n_test_samples, n_features)``, representing the contribution of each
+        transformed feature to the model prediction for each test sample.
+    Notes
+    -----
+    - SHAP values are computed in the *transformed feature space*, not on the
+      raw input features. As a result, explanations reflect the learned feature
+      representation produced by the preprocessing pipeline.
+    - For tree-based models, SHAP values are computed using path-dependent
+      expectations, which are robust to feature correlations introduced by
+      preprocessing.
+    - For model-agnostic explanations (``KernelExplainer``), SHAP relies on
+      background data sampling and may produce high-variance attributions in
+      small-data or highly correlated feature settings.
+    """
+    try:
+        import shap
+    except ImportError:
+        raise ImportError(
+            "The 'shap' library is required for SHAP analysis. "
+            "Please install it via 'pip install shap'."
+        )
+    # Attempt to use the fast SHAP Explainer
+    try:
+        explainer = shap.Explainer(pipeline.named_steps['model'], X_train, seed=random_state)
+        sv = explainer.shap_values(X_test)
+    # Fallback to KernelExplainer for ensembles or custom estimators
+    except:
+        warnings.warn(
+            "Falling back to SHAP KernelExplainer. "
+            "This may be slower for large datasets or complex models."
+            "Runtime can be reduced by setting `shap_background_size`.",
+            UserWarning
+        )
+        # Wrapper function
+        def model_predict(data):
+            return pipeline.named_steps['model'].predict(data)
+        # Reduce background data size if requested to optimize execution time
+        if (background_size is not None) and (background_size < X_train.shape[0]):
+            # Use a subset of the training data as background
+            X_train = shap.sample(X_train, background_size, random_state=random_state)
+        explainer = shap.KernelExplainer(model_predict, X_train, seed=random_state)
+        sv = explainer.shap_values(X_test)
+    return sv

leakproof_ml-0.0.1/src/leakproof_ml/interpretability/analysis.py ADDED Viewed

@@ -0,0 +1,40 @@
+import numpy as np
+def get_stable_features(features, threshold=0.5):
+    """
+    Identify features that consistently appear across multiple cross-validation folds.
+    In nested cross-validation, feature selection (like Correlation Selection)
+    is performed inside each fold. This results in different sets of features
+    per fold. This function identifies "stable" features—those that were
+    selected frequently enough to meet the specified threshold.
+    Parameters
+    ----------
+    features : list of list of str
+        A nested list where each inner list contains the names of features
+        selected in a specific fold.
+        Example: [['feat_A', 'feat_B'], ['feat_A', 'feat_C'], ['feat_A']]
+    threshold : float, default=0.5
+        The minimum fraction of folds (from 0.0 to 1.0) a feature must
+        appear in to be considered stable. For example, with 5 folds and
+        a 0.6 threshold, a feature must appear in at least 3 folds.
+    Returns
+    -------
+    stable_features : list of str
+        An alphabetically sorted list of feature names that met or exceeded
+        the selection frequency threshold.
+    """
+    selected_features = np.array([f for sublist in features for f in sublist])
+    unique, counts = np.unique(selected_features, return_counts=True)
+    n_folds = len(features)
+    # Round number of folds based on threhold
+    min_appearances = int(n_folds * threshold)
+    stable_features = [str(col) for col, count in zip(unique, counts) if count >= min_appearances]
+    return stable_features