PyPI - validmind - Versions diffs - 2.5.6__py3-none-any.whl → 2.5.15__py3-none-any.whl - Mend

validmind 2.5.6py3-none-any.whl → 2.5.15py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (212) hide show

validmind/tests/model_validation/sklearn/{RegressionModelsPerformanceComparison.py → RegressionPerformance.py} RENAMED Viewed

@@ -16,25 +16,27 @@ logger = get_logger(__name__)
 @dataclass
-class RegressionModelsPerformanceComparison(Metric):
+class RegressionPerformance(Metric):
     """
     Compares and evaluates the performance of multiple regression models using five different metrics: MAE, MSE, RMSE,
     MAPE, and MBD.
-    **1. Purpose:**
+    ### Purpose
     The Regression Models Performance Comparison metric is used to measure and compare the performance of regression
     models. It calculates multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE),
     Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Mean Bias Deviation (MBD), thereby
     enabling a comprehensive view of model performance.
-    **2. Test Mechanism:**
+    ### Test Mechanism
     The test starts by sourcing the true and predicted values from the models. It then computes the MAE, MSE, RMSE,
     MAPE, and MBD. These calculations encapsulate both the direction and the magnitude of error in predictions, thereby
     providing a multi-faceted view of model accuracy. It captures these results in a dictionary and compares the
     performance of all models using these metrics. The results are then appended to a table for presenting a
     comparative summary.
-    **3. Signs of High Risk:**
+    ### Signs of High Risk
     - High values of MAE, MSE, RMSE, and MAPE, which indicate a high error rate and imply a larger departure of the
     model's predictions from the true values.
@@ -42,13 +44,13 @@ class RegressionModelsPerformanceComparison(Metric):
     - If the test returns an error citing that no models were provided for comparison, it implies a risk in the
     evaluation process itself.
-    **4. Strengths:**
+    ### Strengths
     - The metric evaluates models on five different metrics offering a comprehensive analysis of model performance.
     - It compares multiple models simultaneously, aiding in the selection of the best-performing models.
     - It is designed to handle regression tasks and can be seamlessly integrated with libraries like sklearn.
-    **5. Limitations:**
+    ### Limitations
     - The metric only evaluates regression models and does not evaluate classification models.
     - The test assumes that the models have been trained and tested appropriately prior to evaluation. It does not
@@ -58,8 +60,8 @@ class RegressionModelsPerformanceComparison(Metric):
     - The test could exhibit performance limitations if a large number of models is input for comparison.
     """
-    name = "models_performance_comparison"
-    required_inputs = ["dataset", "models"]
+    name = "regression_performance"
+    required_inputs = ["dataset", "model"]
     tasks = ["regression"]
     tags = [
@@ -96,7 +98,7 @@ class RegressionModelsPerformanceComparison(Metric):
         This summary varies depending if we're evaluating a binary or multi-class model
         """
         results = []
-        metrics = metric_value["model_0"].keys()
+        metrics = metric_value[self.inputs.model.input_id].keys()
         error_table = []
         for metric_name in metrics:
             errors_dict = {}
@@ -119,20 +121,16 @@ class RegressionModelsPerformanceComparison(Metric):
     def run(self):
         # Check models list is not empty
-        if not self.inputs.models:
+        if not self.inputs.model:
             raise SkipTestError(
-                "List of models must be provided as a `models` parameter to compare performance"
+                "Model must be provided as a `models` parameter to compare performance"
             )
-        all_models = self.inputs.models
         results = {}
-        for idx, model in enumerate(all_models):
-            result = self.regression_errors(
-                y_true_test=self.inputs.dataset.y,
-                y_pred_test=self.inputs.dataset.y_pred(model),
-            )
-            results["model_" + str(idx)] = result
+        result = self.regression_errors(
+            y_true_test=self.inputs.dataset.y,
+            y_pred_test=self.inputs.dataset.y_pred(self.inputs.model),
+        )
+        results[self.inputs.model.input_id] = result
         return self.cache_results(results)

validmind/tests/model_validation/sklearn/RegressionR2Square.py CHANGED Viewed

@@ -2,105 +2,67 @@
 # See the LICENSE file in the root of this repository for details.
 # SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-from dataclasses import dataclass
+import pandas as pd
 from sklearn import metrics
 from validmind.tests.model_validation.statsmodels.statsutils import adj_r2_score
-from validmind.vm_models import Metric, ResultSummary, ResultTable
+from validmind import tags, tasks
-@dataclass
-class RegressionR2Square(Metric):
+@tags("sklearn", "model_performance")
+@tasks("regression")
+def RegressionR2Square(dataset, model):
     """
-    **Purpose**: The purpose of the RegressionR2Square Metric test is to measure the overall goodness-of-fit of a
-    regression model. Specifically, this Python-based test evaluates the R-squared (R2) and Adjusted R-squared (Adj R2)
-    scores: two statistical measures within regression analysis used to evaluate the strength of the relationship
-    between the model's predictors and the response variable.
-    **Test Mechanism**: The test deploys the 'r2_score' method from the Scikit-learn metrics module, measuring the R2
-    score on both training and test sets. This score reflects the proportion of the variance in the dependent variable
-    that is predictable from independent variables. The test also considers the Adjusted R2 score, accounting for the
-    number of predictors in the model, to penalize model complexity and thus reduce overfitting. The Adjusted R2 score
-    will be smaller if unnecessary predictors are included in the model.
-    **Signs of High Risk**: Indicators of high risk in this test may include a low R2 or Adjusted R2 score, which would
-    suggest that the model does not explain much variation in the dependent variable. The occurrence of overfitting is
-    also a high-risk sign, evident when the R2 score on the training set is significantly higher than on the test set,
-    indicating that the model is not generalizing well to unseen data.
-    **Strengths**: The R2 score is a widely-used measure in regression analysis, providing a sound general indication
-    of model performance. It is easy to interpret and understand, as it is essentially representing the proportion of
-    the dependent variable's variance explained by the independent variables. The Adjusted R2 score complements the R2
-    score well by taking into account the number of predictors in the model, which helps control overfitting.
-    **Limitations**: R2 and Adjusted R2 scores can be sensitive to the inclusion of unnecessary predictors in the model
-    (even though Adjusted R2 is intended to penalize complexity). Their reliability might also lessen in cases of
-    non-linear relationships or when the underlying assumptions of linear regression are violated. Additionally, while
-    they summarize how well the model fits the data, they do not provide insight on whether the correct regression was
-    used, or whether certain key assumptions have been fulfilled.
+    Assesses the overall goodness-of-fit of a regression model by evaluating R-squared (R2) and Adjusted R-squared (Adj
+    R2) scores to determine the model's explanatory power over the dependent variable.
+    ### Purpose
+    The purpose of the RegressionR2Square Metric test is to measure the overall goodness-of-fit of a regression model.
+    Specifically, this Python-based test evaluates the R-squared (R2) and Adjusted R-squared (Adj R2) scores, which are
+    statistical measures used to assess the strength of the relationship between the model's predictors and the
+    response variable.
+    ### Test Mechanism
+    The test deploys the `r2_score` method from the Scikit-learn metrics module to measure the R2 score on both
+    training and test sets. This score reflects the proportion of the variance in the dependent variable that is
+    predictable from the independent variables. The test also calculates the Adjusted R2 score, which accounts for the
+    number of predictors in the model to penalize model complexity and reduce overfitting. The Adjusted R2 score will
+    be smaller if unnecessary predictors are included in the model.
+    ### Signs of High Risk
+    - Low R2 or Adjusted R2 scores, suggesting that the model does not explain much variation in the dependent variable.
+    - Significant discrepancy between R2 scores on the training set and test set, indicating overfitting and poor
+    generalization to unseen data.
+    ### Strengths
+    - Widely-used measure in regression analysis, providing a sound general indication of model performance.
+    - Easy to interpret and understand, as it represents the proportion of the dependent variable's variance explained
+    by the independent variables.
+    - Adjusted R2 score helps control overfitting by penalizing unnecessary predictors.
+    ### Limitations
+    - Sensitive to the inclusion of unnecessary predictors even though Adjusted R2 penalizes complexity.
+    - Less reliable in cases of non-linear relationships or when the underlying assumptions of linear regression are
+    violated.
+    - Does not provide insight on whether the correct regression model was used or if key assumptions have been met.
     """
-    name = "regression_errors_r2_square"
-    required_inputs = ["model", "datasets"]
-    tasks = ["regression"]
-    tags = [
-        "sklearn",
-        "model_performance",
-    ]
-    def summary(self, raw_results):
-        """
-        Returns a summarized representation of the dataset split information
-        """
-        table_records = []
-        for result in raw_results:
-            for key, _ in result.items():
-                table_records.append(
-                    {
-                        "Metric": key,
-                        "TRAIN": result[key]["train"],
-                        "TEST": result[key]["test"],
-                    }
-                )
-        return ResultSummary(results=[ResultTable(data=table_records)])
-    def run(self):
-        y_train_true = self.inputs.datasets[0].y
-        y_train_pred = self.inputs.datasets[0].y_pred(self.inputs.model)
-        y_train_true = y_train_true.astype(y_train_pred.dtype)
-        y_test_true = self.inputs.datasets[1].y
-        y_test_pred = self.inputs.datasets[1].y_pred(self.inputs.model)
-        y_test_true = y_test_true.astype(y_test_pred.dtype)
-        r2s_train = metrics.r2_score(y_train_true, y_train_pred)
-        r2s_test = metrics.r2_score(y_test_true, y_test_pred)
-        results = []
-        results.append(
-            {
-                "R-squared (R2) Score": {
-                    "train": r2s_train,
-                    "test": r2s_test,
-                }
-            }
-        )
-        X_columns = self.inputs.datasets[0].feature_columns
-        adj_r2_train = adj_r2_score(
-            y_train_true, y_train_pred, len(y_train_true), len(X_columns)
-        )
-        adj_r2_test = adj_r2_score(
-            y_test_true, y_test_pred, len(y_test_true), len(X_columns)
-        )
-        results.append(
-            {
-                "Adjusted R-squared (R2) Score": {
-                    "train": adj_r2_train,
-                    "test": adj_r2_test,
-                }
-            }
-        )
-        return self.cache_results(metric_value=results)
+    y_true = dataset.y
+    y_pred = dataset.y_pred(model)
+    y_true = y_true.astype(y_pred.dtype)
+    r2s = metrics.r2_score(y_true, y_pred)
+    adj_r2 = adj_r2_score(y_true, y_pred, len(y_true), len(dataset.feature_columns))
+    # Create dataframe with R2 and Adjusted R2 in one row
+    results_df = pd.DataFrame(
+        {"R-squared (R2) Score": [r2s], "Adjusted R-squared (R2) Score": [adj_r2]}
+    )
+    return results_df

validmind/tests/model_validation/sklearn/RegressionR2SquareComparison.py CHANGED Viewed

@@ -13,26 +13,45 @@ from validmind.tests.model_validation.statsmodels.statsutils import adj_r2_score
 @tasks("regression", "time_series_forecasting")
 def RegressionR2SquareComparison(datasets, models):
     """
-    Compare R-Squared and Adjusted R-Squared values for each model and generate a summary table
-    with the results.
+    Compares R-Squared and Adjusted R-Squared values for different regression models across multiple datasets to assess
+    model performance and relevance of features.
-    **Purpose**: The purpose of this function is to compare the R-Squared and Adjusted R-Squared values for different models applied to various datasets.
+    ### Purpose
-    **Test Mechanism**: The function iterates through each dataset-model pair, calculates the R-Squared and Adjusted R-Squared values, and generates a summary table with these results.
+    The Regression R2 Square Comparison test aims to compare the R-Squared and Adjusted R-Squared values for different
+    regression models across various datasets. It helps in assessing how well each model explains the variability in
+    the dataset, and whether the models include irrelevant features.
-    **Signs of High Risk**:
-    - If the R-Squared values are significantly low, it could indicate that the model is not explaining much of the variability in the dataset.
-    - A significant difference between R-Squared and Adjusted R-Squared values might indicate that the model includes irrelevant features.
+    ### Test Mechanism
+    This test operates by:
+    - Iterating through each dataset-model pair.
+    - Calculating the R-Squared values to measure how much of the variability in the dataset is explained by the model.
+    - Calculating the Adjusted R-Squared values, which adjust the R-Squared based on the number of predictors in the
+    model, making it more reliable when comparing models with different numbers of features.
+    - Generating a summary table containing these values for each combination of dataset and model.
+    ### Signs of High Risk
+    - If the R-Squared values are significantly low, it indicates the model isn't explaining much of the variability in
+    the dataset.
+    - A significant difference between R-Squared and Adjusted R-Squared values might indicate that the model includes
+    irrelevant features.
+    ### Strengths
-    **Strengths**:
     - Provides a quantitative measure of model performance in terms of variance explained.
-    - Adjusted R-Squared accounts for the number of predictors, making it a more reliable measure when comparing models with different numbers of features.
+    - Adjusted R-Squared accounts for the number of predictors, making it a more reliable measure when comparing models
+    with different numbers of features.
+    - Useful for time-series forecasting and regression tasks.
-    **Limitations**:
-    - Assumes that the dataset is provided as a DataFrameDataset object with `y`, `y_pred`, and `feature_columns` attributes.
-    - The function relies on `adj_r2_score` from the `statsmodels.statsutils` module, which should be correctly implemented and imported.
-    - Requires that `dataset.y_pred(model)` returns the predicted values for the model.
+    ### Limitations
+    - Assumes the dataset is provided as a DataFrameDataset object with `y`, `y_pred`, and `feature_columns` attributes.
+    - Relies on `adj_r2_score` from the `statsmodels.statsutils` module, which needs to be correctly implemented and
+    imported.
+    - Requires that `dataset.y_pred(model)` returns the predicted values for the model.
     """
     results_list = []

validmind/tests/model_validation/sklearn/RobustnessDiagnosis.py CHANGED Viewed

@@ -7,9 +7,9 @@ from dataclasses import dataclass
 from operator import add
 from typing import List, Tuple
-import matplotlib.pyplot as plt
 import numpy as np
 import pandas as pd
+import plotly.graph_objects as go
 import seaborn as sns
 from sklearn import metrics
@@ -132,24 +132,28 @@ def _combine_results(results: List[dict]):
 def _plot_robustness(
-    results: pd.DataFrame, metric: str, threshold: float, columns: List[str]
+    results: pd.DataFrame, metric: str, threshold: float, columns: List[str], model: str
 ):
-    fig, ax = plt.subplots()
-    pallete = sns.color_palette("muted", len(results["Dataset"].unique()))
-    sns.lineplot(
-        data=results,
-        x="Perturbation Size",
-        y=metric.upper(),
-        hue="Dataset",
-        style="Dataset",
-        linewidth=3,
-        markers=True,
-        markersize=10,
-        dashes=False,
-        palette=pallete,
-        ax=ax,
-    )
+    fig = go.Figure()
+    datasets = results["Dataset"].unique()
+    pallete = [
+        f"#{int(r*255):02x}{int(g*255):02x}{int(b*255):02x}"
+        for r, g, b in sns.color_palette("husl", len(datasets))
+    ]
+    for i, dataset in enumerate(datasets):
+        dataset_results = results[results["Dataset"] == dataset]
+        fig.add_trace(
+            go.Scatter(
+                x=dataset_results["Perturbation Size"],
+                y=dataset_results[metric.upper()],
+                mode="lines+markers",
+                name=dataset,
+                line=dict(width=3, color=pallete[i]),
+                marker=dict(size=10),
+            )
+        )
     if PERFORMANCE_METRICS[metric]["is_lower_better"]:
         y_label = f"{metric.upper()} (lower is better)"
@@ -157,33 +161,64 @@ def _plot_robustness(
         threshold = -threshold
         y_label = f"{metric.upper()} (higher is better)"
-    # add dotted threshold line
-    for i in range(len(results["Dataset"].unique())):
-        baseline = results[results["Dataset"] == results["Dataset"].unique()[i]][
-            metric.upper()
-        ].iloc[0]
-        ax.axhline(
-            y=baseline + threshold,
-            color=pallete[i],
-            linestyle="dotted",
+    # add threshold lines
+    for i, dataset in enumerate(datasets):
+        baseline = results[results["Dataset"] == dataset][metric.upper()].iloc[0]
+        fig.add_trace(
+            go.Scatter(
+                x=results["Perturbation Size"].unique(),
+                y=[baseline + threshold] * len(results["Perturbation Size"].unique()),
+                mode="lines",
+                name=f"threshold_{dataset}",
+                line=dict(dash="dash", width=2, color=pallete[i]),
+                showlegend=True,
+            )
         )
-    ax.tick_params(axis="x")
-    ax.set_ylabel(y_label, weight="bold", fontsize=18)
-    ax.legend(fontsize=18)
-    ax.set_xlabel(
-        "Perturbation Size (X * Standard Deviation)", weight="bold", fontsize=18
-    )
-    ax.set_title(
-        f"Perturbed Features: {', '.join(columns)}",
-        weight="bold",
-        fontsize=20,
-        wrap=True,
+    columns_lines = [""]
+    for column in columns:
+        # keep adding to the last line in list until character limit (40)
+        if len(columns_lines[-1]) + len(column) < 40:
+            columns_lines[-1] += f"{column}, "
+        else:
+            columns_lines.append(f"{column}, ")
+    fig.update_layout(
+        title=dict(
+            text=(
+                f"Model Robustness for '{model}'<br><sup>As determined by calculating "
+                f"{metric.upper()} decay in the presence of random gaussian noise</sup>"
+            ),
+            font=dict(size=20),
+            x=0.5,
+            xanchor="center",
+        ),
+        xaxis_title=dict(
+            text="Perturbation Size (X * Standard Deviation)",
+        ),
+        yaxis_title=dict(text=y_label),
+        plot_bgcolor="white",
+        margin=dict(t=60, b=80, r=20, l=60),
+        xaxis=dict(showgrid=True, gridcolor="lightgrey"),
+        yaxis=dict(showgrid=True, gridcolor="lightgrey"),
+        annotations=[
+            go.layout.Annotation(
+                text=f"Perturbed Features:<br><sup>{'<br>'.join(columns_lines)}</sup>",
+                align="left",
+                font=dict(size=14),
+                bordercolor="lightgrey",
+                borderwidth=1,
+                borderpad=4,
+                showarrow=False,
+                x=1.025,
+                xref="paper",
+                xanchor="left",
+                y=-0.15,
+                yref="paper",
+            )
+        ],
     )
-    # prevent the figure from being displayed
-    plt.close("all")
     return fig
@@ -267,6 +302,7 @@ def robustness_diagnosis(
         metric=metric,
         threshold=performance_decay_threshold,
         columns=datasets[0].feature_columns_numeric,
+        model=model.input_id,
     )
     # rename perturbation size for baseline
@@ -279,38 +315,42 @@ def robustness_diagnosis(
 @dataclass
 class RobustnessDiagnosis(ThresholdTest):
-    """Evaluate the robustness of a machine learning model to noise
-    Robustness refers to a model's ability to maintain a high level of performance in
-    the face of perturbations or changes (particularly noise) added to its input data.
-    This test is designed to help gauge how well the model can handle potential real-
-    world scenarios where the input data might be incomplete or corrupted.
-    ## Test Methodology
-    This test is conducted by adding Gaussian noise, proportional to a particular standard
-    deviation scale, to numeric input features of the input datasets. The model's
-    performance on the perturbed data is then evaluated using a user-defined metric or the
-    default metric of AUC for classification tasks and MSE for regression tasks. The results
-    are then plotted to visualize the model's performance decay as the perturbation size
-    increases.
-    When using this test, it is highly recommended to tailor the performance metric, list
-    of scaling factors for the standard deviation of the noise, and the performance decay
-    threshold to the specific use case of the model being evaluated.
-    **Inputs**:
-    - model (VMModel): The trained model to be evaluated.
-    - datasets (List[VMDataset]): A list of datasets to evaluate the model against.
-    ## Parameters
-    - metric (str, optional): The performance metric to be used for evaluation. If not
-        provided, the default metric is used based on the task of the model. Default values
-        are "auc" for classification tasks and "mse" for regression tasks.
-    - scaling_factor_std_dev_list (List[float], optional): A list of scaling factors for
-        the standard deviation of the noise to be added to the input features. The default
-        values are [0.1, 0.2, 0.3, 0.4, 0.5].
-    - performance_decay_threshold (float, optional): The threshold for the performance
-        decay of the model. The default value is 0.05.
+    """
+    Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions.
+    ### Purpose
+    The Robustness Diagnosis test aims to evaluate the resilience of a machine learning model when subjected to
+    perturbations or noise in its input data. This is essential for understanding the model's ability to handle
+    real-world scenarios where data may be imperfect or corrupted.
+    ### Test Mechanism
+    This test introduces Gaussian noise to the numeric input features of the datasets at varying scales of standard
+    deviation. The performance of the model is then measured using a specified metric. The process includes:
+    - Adding Gaussian noise to numerical input features based on scaling factors.
+    - Evaluating the model's performance on the perturbed data using metrics like AUC for classification tasks and MSE
+    for regression tasks.
+    - Aggregating and plotting the results to visualize performance decay relative to perturbation size.
+    ### Signs of High Risk
+    - A significant drop in performance metrics with minimal noise.
+    - Performance decay values exceeding the specified threshold.
+    - Consistent failure to meet performance standards across multiple perturbation scales.
+    ### Strengths
+    - Provides insights into the model's robustness against noisy or corrupted data.
+    - Utilizes a variety of performance metrics suitable for both classification and regression tasks.
+    - Visualization helps in understanding the extent of performance degradation.
+    ### Limitations
+    - Gaussian noise might not adequately represent all types of real-world data perturbations.
+    - Performance thresholds are somewhat arbitrary and might need tuning.
+    - The test may not account for more complex or unstructured noise patterns that could affect model robustness.
     """
     name = "robustness"

validmind/tests/model_validation/sklearn/SHAPGlobalImportance.py CHANGED Viewed

@@ -22,13 +22,15 @@ class SHAPGlobalImportance(Metric):
     """
     Evaluates and visualizes global feature importance using SHAP values for model explanation and risk identification.
-    **Purpose:**
+    ### Purpose
     The SHAP (SHapley Additive exPlanations) Global Importance metric aims to elucidate model outcomes by attributing
     them to the contributing features. It assigns a quantifiable global importance to each feature via their respective
     absolute Shapley values, thereby making it suitable for tasks like classification (both binary and multiclass).
     This metric forms an essential part of model risk management.
-    **Test Mechanism:**
+    ### Test Mechanism
     The exam begins with the selection of a suitable explainer which aligns with the model's type. For tree-based
     models like XGBClassifier, RandomForestClassifier, CatBoostClassifier, TreeExplainer is used whereas for linear
     models like LogisticRegression, XGBRegressor, LinearRegression, it is the LinearExplainer. Once the explainer
@@ -44,20 +46,20 @@ class SHAPGlobalImportance(Metric):
     gradually changing from low to high. Features are systematically organized in accordance with their importance.
     These plots are generated by the function `_generate_shap_plot()`.
-    **Signs of High Risk:**
+    ### Signs of High Risk
     - Overemphasis on certain features in SHAP importance plots, thus hinting at the possibility of model overfitting
     - Anomalies such as unexpected or illogical features showing high importance, which might suggest that the model's
     decisions are rooted in incorrect or undesirable reasoning
     - A SHAP summary plot filled with high variability or scattered data points, indicating a cause for concern
-    **Strengths:**
+    ### Strengths
     - SHAP does more than just illustrating global feature significance, it offers a detailed perspective on how
     different features shape the model's decision-making logic for each instance.
     - It provides clear insights into model behavior.
-    **Limitations:**
+    ### Limitations
     - High-dimensional data can convolute interpretations.
     - Associating importance with tangible real-world impact still involves a certain degree of subjectivity.

validmind 2.5.6__py3-none-any.whl → 2.5.15__py3-none-any.whl

validmind 2.5.6py3-none-any.whl → 2.5.15py3-none-any.whl