PyPI - cherrypick-ml - Versions diffs - 0.1.0__tar.gz - Mend

cherrypick-ml 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

cherrypick_ml-0.1.0/LICENSE +21 -0
cherrypick_ml-0.1.0/PKG-INFO +117 -0
cherrypick_ml-0.1.0/README.md +82 -0
cherrypick_ml-0.1.0/cherrypick/__init__.py +4 -0
cherrypick_ml-0.1.0/cherrypick/anomaly.py +120 -0
cherrypick_ml-0.1.0/cherrypick/explain.py +178 -0
cherrypick_ml-0.1.0/cherrypick/orchestrator.py +797 -0
cherrypick_ml-0.1.0/cherrypick/preprocessing.py +197 -0
cherrypick_ml-0.1.0/cherrypick/splits.py +56 -0
cherrypick_ml-0.1.0/cherrypick_ml.egg-info/PKG-INFO +117 -0
cherrypick_ml-0.1.0/cherrypick_ml.egg-info/SOURCES.txt +14 -0
cherrypick_ml-0.1.0/cherrypick_ml.egg-info/dependency_links.txt +1 -0
cherrypick_ml-0.1.0/cherrypick_ml.egg-info/requires.txt +13 -0
cherrypick_ml-0.1.0/cherrypick_ml.egg-info/top_level.txt +1 -0
cherrypick_ml-0.1.0/setup.cfg +4 -0
cherrypick_ml-0.1.0/setup.py +37 -0

cherrypick_ml-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Sujal Giri Sanyasi
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

cherrypick_ml-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,117 @@
+Metadata-Version: 2.4
+Name: cherrypick-ml
+Version: 0.1.0
+Summary: A lightweight ML orchestration library with preprocessing, anomaly detection, and explainability tools
+Author: Sujal G Sanyasi
+Author-email: cherrypickml1@gmail.com
+License: MIT
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Operating System :: OS Independent
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: pandas
+Requires-Dist: numpy
+Requires-Dist: matplotlib
+Requires-Dist: shap
+Requires-Dist: seaborn
+Requires-Dist: joblib
+Requires-Dist: plotly
+Requires-Dist: xgboost
+Requires-Dist: rich
+Requires-Dist: lightgbm
+Requires-Dist: catboost
+Requires-Dist: pytest
+Requires-Dist: imblearn
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: license
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: summary
+<p align="center">
+  <img src="assets/cherrylogo.png" alt="cherrypick-ml logo" width="1100" height=450>
+</p>
+-----------------
+# cherrypick-ml: A Machine Learning Orchestration and Pipeline Toolkit
+| | |
+| --- | --- |
+| Testing | Structured validation of preprocessing, orchestration, and explainability components |
+| Package | PyPI distribution for cherrypick-ml |
+| Meta | MIT License, Python-based machine learning pipeline framework |
+---
+## What is it?
+**cherrypick-ml** is a Python package that provides a unified interface for building, managing, and evaluating machine learning workflows. It integrates preprocessing, anomaly detection, model orchestration, and explainability into a single, modular framework.
+The library is designed to simplify real-world machine learning development by reducing repetitive code while maintaining flexibility and transparency in model pipelines.
+---
+## Table of Contents
+- [Main Features](#main-features)
+- [Core Components](#core-components)
+- [Where to get it](#where-to-get-it)
+- [Dependencies](#dependencies)
+- [Installation from sources](#installation-from-sources)
+- [Basic Usage](#basic-usage)
+- [License](#license)
+- [Documentation](#documentation)
+---
+## Main Features
+cherrypick-ml provides the following core capabilities:
+- Automated model orchestration for classification and regression tasks
+- Integrated preprocessing utilities including encoding and missing value handling
+- Outlier detection using statistical method such as Inter quartile range(IQR), Z-score, modified Z-score, Isolation  Forest and Local Outlier Factor based outlier pruning
+- SHAP-based explainability for feature importance and model interpretation
+- Flexible train-test splitting utilities
+- Modular design allowing independent usage of components
+- Designed for practical, real-world machine learning workflows
+---
+## Core Components
+The library is structured into the following modules:
+- **Orchestrator**
+  High-level interface for training, evaluating, and selecting models with explainable visualisation
+- **preprocessing**
+  Tools for encoding, imputation, and feature preparation
+- **anomaly**
+  Outlier detection and data pruning utilities
+- **explain**
+  Model explainability using SHAP-based analysis
+- **splits**
+  Utilities for dataset partitioning
+---
+## Where to get it
+The source code is currently hosted on GitHub at:
+https://github.com/Sujal-G-Sanyasi/cherrypick-ml
+Binary installers for the latest released version are available at the Python Package Index (PyPI):
+```sh
+pip install cherrypick-ml

cherrypick_ml-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,82 @@
+<p align="center">
+  <img src="assets/cherrylogo.png" alt="cherrypick-ml logo" width="1100" height=450>
+</p>
+-----------------
+# cherrypick-ml: A Machine Learning Orchestration and Pipeline Toolkit
+| | |
+| --- | --- |
+| Testing | Structured validation of preprocessing, orchestration, and explainability components |
+| Package | PyPI distribution for cherrypick-ml |
+| Meta | MIT License, Python-based machine learning pipeline framework |
+---
+## What is it?
+**cherrypick-ml** is a Python package that provides a unified interface for building, managing, and evaluating machine learning workflows. It integrates preprocessing, anomaly detection, model orchestration, and explainability into a single, modular framework.
+The library is designed to simplify real-world machine learning development by reducing repetitive code while maintaining flexibility and transparency in model pipelines.
+---
+## Table of Contents
+- [Main Features](#main-features)
+- [Core Components](#core-components)
+- [Where to get it](#where-to-get-it)
+- [Dependencies](#dependencies)
+- [Installation from sources](#installation-from-sources)
+- [Basic Usage](#basic-usage)
+- [License](#license)
+- [Documentation](#documentation)
+---
+## Main Features
+cherrypick-ml provides the following core capabilities:
+- Automated model orchestration for classification and regression tasks
+- Integrated preprocessing utilities including encoding and missing value handling
+- Outlier detection using statistical method such as Inter quartile range(IQR), Z-score, modified Z-score, Isolation  Forest and Local Outlier Factor based outlier pruning
+- SHAP-based explainability for feature importance and model interpretation
+- Flexible train-test splitting utilities
+- Modular design allowing independent usage of components
+- Designed for practical, real-world machine learning workflows
+---
+## Core Components
+The library is structured into the following modules:
+- **Orchestrator**
+  High-level interface for training, evaluating, and selecting models with explainable visualisation
+- **preprocessing**
+  Tools for encoding, imputation, and feature preparation
+- **anomaly**
+  Outlier detection and data pruning utilities
+- **explain**
+  Model explainability using SHAP-based analysis
+- **splits**
+  Utilities for dataset partitioning
+---
+## Where to get it
+The source code is currently hosted on GitHub at:
+https://github.com/Sujal-G-Sanyasi/cherrypick-ml
+Binary installers for the latest released version are available at the Python Package Index (PyPI):
+```sh
+pip install cherrypick-ml

cherrypick_ml-0.1.0/cherrypick/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from .orchestrator import Orchestrator
+from . import explain, preprocessing, anomaly, splits
+__all__ = ['Orchestrator']

cherrypick_ml-0.1.0/cherrypick/anomaly.py ADDED Viewed

@@ -0,0 +1,120 @@
+import pandas as pd
+import numpy as np
+from scipy.stats import zscore
+from sklearn.ensemble import IsolationForest
+from sklearn.neighbors import LocalOutlierFactor
+from typing import Literal
+import warnings as war
+war.filterwarnings('ignore')
+class OutlierPruner:
+    """
+    OutlierPruner provides statistical and ML-based methods
+    for detecting and removing outliers from a dataset.
+    Parameters
+    ----------
+    method : {'iqr', 'zscore', 'mod_zscore', 'isoforest', 'lof'}
+        Method used for outlier detection.
+        - `'iqr'`         - Interquartile Range method
+        - `'zscore'`      - Standard Z-score normalization
+        - `'mod_zscore'`  - Modified Z-score:
+                        `modified_Zscore = 0.6745 * (X - median) / MAD`
+                        *Where,*
+                        **median** = *median of the sample data*
+                        **MAD** = *median absolute deviation*
+                        **X** = *sample data points(Xi)*
+        - `'isoforest'`   - Isolation Forest, an ensemble-based anomaly detection method
+        - `'lof'`         - Local Outlier Factor, detects outliers using local density
+    df : pandas.DataFrame
+        Input dataset on which outlier pruning will be applied.
+    col : str
+        Column name used for outlier detection in statistical methods.
+    Notes
+    -----
+    - Statistical methods require a specific column (``col``).
+    - ML-based methods (Isolation Forest, Local Outlier Factor) operate on numerical features.
+    - Modified Z-score is robust to extreme values as it uses the median instead of mean.
+    """
+    def __init__(self, method: Literal['iqr', 'zscore', 'mod_zscore', 'isoforest', 'lof'], df:pd.DataFrame, col:str):
+        self.df = df
+        self.col = col
+        self.method = method
+    def __iqr(self):
+                Q1 = self.df[self.col].quantile(0.25)
+                Q3 = self.df[self.col].quantile(0.75)
+                IQR = Q3 - Q1
+                lower_fence = Q1 - 1.5 * IQR
+                upper_fence = Q3 + 1.5 * IQR
+                return  self.df[(self.df[self.col] >= lower_fence) & (self.df[self.col] <= upper_fence)]
+    def __zscore(self):
+            z = zscore(self.df[self.col])
+            return self.df[np.abs(z) < 3]
+    def __isoforest(self):
+            isolate = IsolationForest(contamination=0.3, n_jobs=-1, random_state=42)
+            X = self.df.select_dtypes(include=np.number)
+            labels_ = isolate.fit_predict(X)
+            # outliers = np.where(labels_ == -1)[0]
+            return self.df.iloc[labels_!= -1]
+    def __lof(self):
+            lof = LocalOutlierFactor(n_jobs=-1, n_neighbors=20, algorithm='kd_tree')
+            X = self.df.select_dtypes(include = np.number)
+            labels = lof.fit_predict(X)
+            return self.df.iloc[labels != -1]
+    def __modded_zscore(self):
+        df1=self.df
+        median = np.median(df1[self.col])
+        mad = np.median(np.abs(df1[self.col] - median))
+        ## If MAD value == 0, then it will return original DataFrame instead of garbage value and prevent division by zero error
+        if mad  == 0 :
+              return self.df
+        mod_zscore = 0.6745 * (df1[self.col] - median)/mad
+        normal_data = df1[mod_zscore.abs() < 3]
+        outliers = df1[mod_zscore.abs() > 3]
+        return normal_data
+    def remove_outlier(self):
+        '''
+        Calling this function will transform dataset with configuration provided to **OutlierPruner**.
+        '''
+        try:
+            METHOD_CONFIG = {
+                "iqr" : self.__iqr,
+                "zscore" : self.__zscore,
+                "mod_zscore":self.__modded_zscore,
+                "isoforest" : self.__isoforest,
+                "lof" : self.__lof
+            }
+            return METHOD_CONFIG[self.method]()
+        except KeyError:
+            raise ValueError(f"Provide an appropriate method : {self.method}")
+        except Exception as err:
+              raise ValueError(err)

cherrypick_ml-0.1.0/cherrypick/explain.py ADDED Viewed

@@ -0,0 +1,178 @@
+import shap
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+from sklearn import tree
+from cherrypick.orchestrator import Orchestrator
+from typing import Literal
+def explainer(model, data, impact_type :Literal['pos', 'neg', 'all'] = 'all'):
+    """
+    Compute SHAP-based feature importance and return sorted impact values.
+    This function uses SHAP's TreeExplainer to calculate feature contributions
+    for a given model and dataset. It aggregates SHAP values across samples
+    and (if applicable) across multiple classes, returning feature importance
+    based on absolute SHAP magnitudes.
+    Parameters
+    ----------
+    model : object
+        A trained tree-based model compatible with shap.TreeExplainer
+        (e.g., XGBoost, LightGBM, RandomForest).
+    data : pandas.DataFrame
+        Input dataset for which SHAP values are to be computed.
+        Must contain only feature columns (no target column).
+    impact_type : {'pos', 'neg', 'all'}, default='all'
+        Type of feature impact to return:
+        - '**pos**' - Returns features with positive contribution.
+        - '**neg**' - Returns features with negative contribution.
+        - '**all**' - Returns all features with overall importance
+                  (absolute SHAP values).
+    Returns
+    -------
+    result : pandas.DataFrame
+        A sorted DataFrame containing feature importance:
+        - For 'all' → columns: ['Features', 'Overall_Impact']
+        - For 'pos' → columns: ['Features', 'Positive_Impact']
+        - For 'neg' → columns: ['Features', 'Negative_Impact']
+    shap_values : shap.Explanation
+        Raw SHAP explanation object containing per-sample contributions.
+    Notes
+    -----
+    - For multi-class models, SHAP values are averaged across classes.
+    - Feature importance is computed using mean absolute SHAP values.
+    - The function also stores SHAP values globally in `_shap_val`.
+    Raises
+    ------
+    ValueError
+        If `impact_type` is not one of {'pos', 'neg', 'all'}.
+    Example
+    -------
+    >>> result, shap_vals = explainer(model, X_test, impact_type='all')
+    >>> print(result.head())
+    """
+    ## All the Shap values with magnitude based as well!
+    features = [ ]
+    all_values = [ ]
+    neg_values = [ ]
+    neg_feature = [ ]
+    pos_values = [ ]
+    pos_feature = [ ]
+    explain = shap.TreeExplainer(model = model)
+    shap_values = explain(X = data)
+    global _shap_val
+    _shap_val = shap_values
+    vals = _shap_val.values
+    if vals.ndim >= 3 and impact_type == 'all':
+        vals = np.abs(vals).mean(axis = (0, 2))
+    elif vals.ndim == 2 and impact_type == 'all':
+        vals = np.abs(vals).mean(axis=0)
+    elif (impact_type == "pos" or impact_type == "neg") and vals.ndim >=3:
+        vals = vals.mean(axis=(0, 2))
+    elif (impact_type == "pos" or impact_type == "neg") and vals.ndim == 2:
+        vals = vals.mean(axis = 0)
+    else:
+        raise ValueError("Invalid Impact type or dimentions of shap values")
+    for feature, value in zip(data.columns, vals):
+            features.append(feature)
+            all_values.append(value)
+            if value < 0:
+                neg_values.append(value)
+                neg_feature.append(feature)
+            else:
+                pos_values.append(value)
+                pos_feature.append(feature)
+    if impact_type == 'neg':
+            result = pd.DataFrame({
+                        "Features" : neg_feature,
+                        "Negative_Impact" : neg_values
+                    }).sort_values(by="Negative_Impact", ascending=False)
+    elif impact_type == 'pos':
+            result = pd.DataFrame({
+                "Features" : pos_feature,
+                "Positive_Impact" : pos_values
+            }).sort_values(by="Positive_Impact", ascending=False)
+    elif impact_type == 'all':
+            result = pd.DataFrame({
+                "Features" : features,
+                "Overall_Impact" : all_values
+            }).sort_values(by="Overall_Impact", ascending=False)
+    else:
+            raise ValueError("Invalid Impact type : must be neg, pos or all")
+    return result, shap_values
+def summary_plot(data):
+    '''
+    Summary plot for feature contribution for all the classes.
+    '''
+    shap.summary_plot(_shap_val, data)
+def bar_plot(n_classes):
+        '''
+        Bar plot analysis of feature contribution for each class
+        '''
+        for class_id in range(n_classes):
+            plt.title(f"For class_id {class_id}")
+            shap.plots.bar(_shap_val[..., class_id])
+            plt.tight_layout()
+            plt.show()
+# def force_plot(shap_values):
+#       pass
+def tree_plot(model , feature_names, size:tuple):
+    plt.figure(figsize=size)
+    tree.plot_tree(model, filled=True, feature_names=feature_names, class_names=True)
+    plt.tight_layout()
+    plt.show()