PyPI - validmind - Versions diffs - 2.5.8__py3-none-any.whl → 2.5.15__py3-none-any.whl - Mend

validmind 2.5.8py3-none-any.whl → 2.5.15py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (212) hide show

validmind/tests/model_validation/sklearn/{RegressionModelsPerformanceComparison.py → RegressionPerformance.py} RENAMED Viewed

@@ -16,25 +16,27 @@ logger = get_logger(__name__)
 @dataclass
-class RegressionModelsPerformanceComparison(Metric):
+class RegressionPerformance(Metric):
     """
     Compares and evaluates the performance of multiple regression models using five different metrics: MAE, MSE, RMSE,
     MAPE, and MBD.
-    **1. Purpose:**
+    ### Purpose
     The Regression Models Performance Comparison metric is used to measure and compare the performance of regression
     models. It calculates multiple evaluation metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE),
     Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and Mean Bias Deviation (MBD), thereby
     enabling a comprehensive view of model performance.
-    **2. Test Mechanism:**
+    ### Test Mechanism
     The test starts by sourcing the true and predicted values from the models. It then computes the MAE, MSE, RMSE,
     MAPE, and MBD. These calculations encapsulate both the direction and the magnitude of error in predictions, thereby
     providing a multi-faceted view of model accuracy. It captures these results in a dictionary and compares the
     performance of all models using these metrics. The results are then appended to a table for presenting a
     comparative summary.
-    **3. Signs of High Risk:**
+    ### Signs of High Risk
     - High values of MAE, MSE, RMSE, and MAPE, which indicate a high error rate and imply a larger departure of the
     model's predictions from the true values.
@@ -42,13 +44,13 @@ class RegressionModelsPerformanceComparison(Metric):
     - If the test returns an error citing that no models were provided for comparison, it implies a risk in the
     evaluation process itself.
-    **4. Strengths:**
+    ### Strengths
     - The metric evaluates models on five different metrics offering a comprehensive analysis of model performance.
     - It compares multiple models simultaneously, aiding in the selection of the best-performing models.
     - It is designed to handle regression tasks and can be seamlessly integrated with libraries like sklearn.
-    **5. Limitations:**
+    ### Limitations
     - The metric only evaluates regression models and does not evaluate classification models.
     - The test assumes that the models have been trained and tested appropriately prior to evaluation. It does not
@@ -58,8 +60,8 @@ class RegressionModelsPerformanceComparison(Metric):
     - The test could exhibit performance limitations if a large number of models is input for comparison.
     """
-    name = "models_performance_comparison"
-    required_inputs = ["dataset", "models"]
+    name = "regression_performance"
+    required_inputs = ["dataset", "model"]
     tasks = ["regression"]
     tags = [
@@ -96,7 +98,7 @@ class RegressionModelsPerformanceComparison(Metric):
         This summary varies depending if we're evaluating a binary or multi-class model
         """
         results = []
-        metrics = metric_value["model_0"].keys()
+        metrics = metric_value[self.inputs.model.input_id].keys()
         error_table = []
         for metric_name in metrics:
             errors_dict = {}
@@ -119,20 +121,16 @@ class RegressionModelsPerformanceComparison(Metric):
     def run(self):
         # Check models list is not empty
-        if not self.inputs.models:
+        if not self.inputs.model:
             raise SkipTestError(
-                "List of models must be provided as a `models` parameter to compare performance"
+                "Model must be provided as a `models` parameter to compare performance"
             )
-        all_models = self.inputs.models
         results = {}
-        for idx, model in enumerate(all_models):
-            result = self.regression_errors(
-                y_true_test=self.inputs.dataset.y,
-                y_pred_test=self.inputs.dataset.y_pred(model),
-            )
-            results["model_" + str(idx)] = result
+        result = self.regression_errors(
+            y_true_test=self.inputs.dataset.y,
+            y_pred_test=self.inputs.dataset.y_pred(self.inputs.model),
+        )
+        results[self.inputs.model.input_id] = result
         return self.cache_results(results)

validmind/tests/model_validation/sklearn/RegressionR2Square.py CHANGED Viewed

@@ -2,105 +2,67 @@
 # See the LICENSE file in the root of this repository for details.
 # SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-from dataclasses import dataclass
+import pandas as pd
 from sklearn import metrics
 from validmind.tests.model_validation.statsmodels.statsutils import adj_r2_score
-from validmind.vm_models import Metric, ResultSummary, ResultTable
+from validmind import tags, tasks
-@dataclass
-class RegressionR2Square(Metric):
+@tags("sklearn", "model_performance")
+@tasks("regression")
+def RegressionR2Square(dataset, model):
     """
-    **Purpose**: The purpose of the RegressionR2Square Metric test is to measure the overall goodness-of-fit of a
-    regression model. Specifically, this Python-based test evaluates the R-squared (R2) and Adjusted R-squared (Adj R2)
-    scores: two statistical measures within regression analysis used to evaluate the strength of the relationship
-    between the model's predictors and the response variable.
-    **Test Mechanism**: The test deploys the 'r2_score' method from the Scikit-learn metrics module, measuring the R2
-    score on both training and test sets. This score reflects the proportion of the variance in the dependent variable
-    that is predictable from independent variables. The test also considers the Adjusted R2 score, accounting for the
-    number of predictors in the model, to penalize model complexity and thus reduce overfitting. The Adjusted R2 score
-    will be smaller if unnecessary predictors are included in the model.
-    **Signs of High Risk**: Indicators of high risk in this test may include a low R2 or Adjusted R2 score, which would
-    suggest that the model does not explain much variation in the dependent variable. The occurrence of overfitting is
-    also a high-risk sign, evident when the R2 score on the training set is significantly higher than on the test set,
-    indicating that the model is not generalizing well to unseen data.
-    **Strengths**: The R2 score is a widely-used measure in regression analysis, providing a sound general indication
-    of model performance. It is easy to interpret and understand, as it is essentially representing the proportion of
-    the dependent variable's variance explained by the independent variables. The Adjusted R2 score complements the R2
-    score well by taking into account the number of predictors in the model, which helps control overfitting.
-    **Limitations**: R2 and Adjusted R2 scores can be sensitive to the inclusion of unnecessary predictors in the model
-    (even though Adjusted R2 is intended to penalize complexity). Their reliability might also lessen in cases of
-    non-linear relationships or when the underlying assumptions of linear regression are violated. Additionally, while
-    they summarize how well the model fits the data, they do not provide insight on whether the correct regression was
-    used, or whether certain key assumptions have been fulfilled.
+    Assesses the overall goodness-of-fit of a regression model by evaluating R-squared (R2) and Adjusted R-squared (Adj
+    R2) scores to determine the model's explanatory power over the dependent variable.
+    ### Purpose
+    The purpose of the RegressionR2Square Metric test is to measure the overall goodness-of-fit of a regression model.
+    Specifically, this Python-based test evaluates the R-squared (R2) and Adjusted R-squared (Adj R2) scores, which are
+    statistical measures used to assess the strength of the relationship between the model's predictors and the
+    response variable.
+    ### Test Mechanism
+    The test deploys the `r2_score` method from the Scikit-learn metrics module to measure the R2 score on both
+    training and test sets. This score reflects the proportion of the variance in the dependent variable that is
+    predictable from the independent variables. The test also calculates the Adjusted R2 score, which accounts for the
+    number of predictors in the model to penalize model complexity and reduce overfitting. The Adjusted R2 score will
+    be smaller if unnecessary predictors are included in the model.
+    ### Signs of High Risk
+    - Low R2 or Adjusted R2 scores, suggesting that the model does not explain much variation in the dependent variable.
+    - Significant discrepancy between R2 scores on the training set and test set, indicating overfitting and poor
+    generalization to unseen data.
+    ### Strengths
+    - Widely-used measure in regression analysis, providing a sound general indication of model performance.
+    - Easy to interpret and understand, as it represents the proportion of the dependent variable's variance explained
+    by the independent variables.
+    - Adjusted R2 score helps control overfitting by penalizing unnecessary predictors.
+    ### Limitations
+    - Sensitive to the inclusion of unnecessary predictors even though Adjusted R2 penalizes complexity.
+    - Less reliable in cases of non-linear relationships or when the underlying assumptions of linear regression are
+    violated.
+    - Does not provide insight on whether the correct regression model was used or if key assumptions have been met.
     """
-    name = "regression_errors_r2_square"
-    required_inputs = ["model", "datasets"]
-    tasks = ["regression"]
-    tags = [
-        "sklearn",
-        "model_performance",
-    ]
-    def summary(self, raw_results):
-        """
-        Returns a summarized representation of the dataset split information
-        """
-        table_records = []
-        for result in raw_results:
-            for key, _ in result.items():
-                table_records.append(
-                    {
-                        "Metric": key,
-                        "TRAIN": result[key]["train"],
-                        "TEST": result[key]["test"],
-                    }
-                )
-        return ResultSummary(results=[ResultTable(data=table_records)])
-    def run(self):
-        y_train_true = self.inputs.datasets[0].y
-        y_train_pred = self.inputs.datasets[0].y_pred(self.inputs.model)
-        y_train_true = y_train_true.astype(y_train_pred.dtype)
-        y_test_true = self.inputs.datasets[1].y
-        y_test_pred = self.inputs.datasets[1].y_pred(self.inputs.model)
-        y_test_true = y_test_true.astype(y_test_pred.dtype)
-        r2s_train = metrics.r2_score(y_train_true, y_train_pred)
-        r2s_test = metrics.r2_score(y_test_true, y_test_pred)
-        results = []
-        results.append(
-            {
-                "R-squared (R2) Score": {
-                    "train": r2s_train,
-                    "test": r2s_test,
-                }
-            }
-        )
-        X_columns = self.inputs.datasets[0].feature_columns
-        adj_r2_train = adj_r2_score(
-            y_train_true, y_train_pred, len(y_train_true), len(X_columns)
-        )
-        adj_r2_test = adj_r2_score(
-            y_test_true, y_test_pred, len(y_test_true), len(X_columns)
-        )
-        results.append(
-            {
-                "Adjusted R-squared (R2) Score": {
-                    "train": adj_r2_train,
-                    "test": adj_r2_test,
-                }
-            }
-        )
-        return self.cache_results(metric_value=results)
+    y_true = dataset.y
+    y_pred = dataset.y_pred(model)
+    y_true = y_true.astype(y_pred.dtype)
+    r2s = metrics.r2_score(y_true, y_pred)
+    adj_r2 = adj_r2_score(y_true, y_pred, len(y_true), len(dataset.feature_columns))
+    # Create dataframe with R2 and Adjusted R2 in one row
+    results_df = pd.DataFrame(
+        {"R-squared (R2) Score": [r2s], "Adjusted R-squared (R2) Score": [adj_r2]}
+    )
+    return results_df

validmind/tests/model_validation/sklearn/RegressionR2SquareComparison.py CHANGED Viewed

@@ -13,26 +13,45 @@ from validmind.tests.model_validation.statsmodels.statsutils import adj_r2_score
 @tasks("regression", "time_series_forecasting")
 def RegressionR2SquareComparison(datasets, models):
     """
-    Compare R-Squared and Adjusted R-Squared values for each model and generate a summary table
-    with the results.
+    Compares R-Squared and Adjusted R-Squared values for different regression models across multiple datasets to assess
+    model performance and relevance of features.
-    **Purpose**: The purpose of this function is to compare the R-Squared and Adjusted R-Squared values for different models applied to various datasets.
+    ### Purpose
-    **Test Mechanism**: The function iterates through each dataset-model pair, calculates the R-Squared and Adjusted R-Squared values, and generates a summary table with these results.
+    The Regression R2 Square Comparison test aims to compare the R-Squared and Adjusted R-Squared values for different
+    regression models across various datasets. It helps in assessing how well each model explains the variability in
+    the dataset, and whether the models include irrelevant features.
-    **Signs of High Risk**:
-    - If the R-Squared values are significantly low, it could indicate that the model is not explaining much of the variability in the dataset.
-    - A significant difference between R-Squared and Adjusted R-Squared values might indicate that the model includes irrelevant features.
+    ### Test Mechanism
+    This test operates by:
+    - Iterating through each dataset-model pair.
+    - Calculating the R-Squared values to measure how much of the variability in the dataset is explained by the model.
+    - Calculating the Adjusted R-Squared values, which adjust the R-Squared based on the number of predictors in the
+    model, making it more reliable when comparing models with different numbers of features.
+    - Generating a summary table containing these values for each combination of dataset and model.
+    ### Signs of High Risk
+    - If the R-Squared values are significantly low, it indicates the model isn't explaining much of the variability in
+    the dataset.
+    - A significant difference between R-Squared and Adjusted R-Squared values might indicate that the model includes
+    irrelevant features.
+    ### Strengths
-    **Strengths**:
     - Provides a quantitative measure of model performance in terms of variance explained.
-    - Adjusted R-Squared accounts for the number of predictors, making it a more reliable measure when comparing models with different numbers of features.
+    - Adjusted R-Squared accounts for the number of predictors, making it a more reliable measure when comparing models
+    with different numbers of features.
+    - Useful for time-series forecasting and regression tasks.
-    **Limitations**:
-    - Assumes that the dataset is provided as a DataFrameDataset object with `y`, `y_pred`, and `feature_columns` attributes.
-    - The function relies on `adj_r2_score` from the `statsmodels.statsutils` module, which should be correctly implemented and imported.
-    - Requires that `dataset.y_pred(model)` returns the predicted values for the model.
+    ### Limitations
+    - Assumes the dataset is provided as a DataFrameDataset object with `y`, `y_pred`, and `feature_columns` attributes.
+    - Relies on `adj_r2_score` from the `statsmodels.statsutils` module, which needs to be correctly implemented and
+    imported.
+    - Requires that `dataset.y_pred(model)` returns the predicted values for the model.
     """
     results_list = []

validmind/tests/model_validation/sklearn/RobustnessDiagnosis.py CHANGED Viewed

@@ -315,38 +315,42 @@ def robustness_diagnosis(
 @dataclass
 class RobustnessDiagnosis(ThresholdTest):
-    """Evaluate the robustness of a machine learning model to noise
-    Robustness refers to a model's ability to maintain a high level of performance in
-    the face of perturbations or changes (particularly noise) added to its input data.
-    This test is designed to help gauge how well the model can handle potential real-
-    world scenarios where the input data might be incomplete or corrupted.
-    ## Test Methodology
-    This test is conducted by adding Gaussian noise, proportional to a particular standard
-    deviation scale, to numeric input features of the input datasets. The model's
-    performance on the perturbed data is then evaluated using a user-defined metric or the
-    default metric of AUC for classification tasks and MSE for regression tasks. The results
-    are then plotted to visualize the model's performance decay as the perturbation size
-    increases.
-    When using this test, it is highly recommended to tailor the performance metric, list
-    of scaling factors for the standard deviation of the noise, and the performance decay
-    threshold to the specific use case of the model being evaluated.
-    **Inputs**:
-    - model (VMModel): The trained model to be evaluated.
-    - datasets (List[VMDataset]): A list of datasets to evaluate the model against.
-    ## Parameters
-    - metric (str, optional): The performance metric to be used for evaluation. If not
-        provided, the default metric is used based on the task of the model. Default values
-        are "auc" for classification tasks and "mse" for regression tasks.
-    - scaling_factor_std_dev_list (List[float], optional): A list of scaling factors for
-        the standard deviation of the noise to be added to the input features. The default
-        values are [0.1, 0.2, 0.3, 0.4, 0.5].
-    - performance_decay_threshold (float, optional): The threshold for the performance
-        decay of the model. The default value is 0.05.
+    """
+    Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions.
+    ### Purpose
+    The Robustness Diagnosis test aims to evaluate the resilience of a machine learning model when subjected to
+    perturbations or noise in its input data. This is essential for understanding the model's ability to handle
+    real-world scenarios where data may be imperfect or corrupted.
+    ### Test Mechanism
+    This test introduces Gaussian noise to the numeric input features of the datasets at varying scales of standard
+    deviation. The performance of the model is then measured using a specified metric. The process includes:
+    - Adding Gaussian noise to numerical input features based on scaling factors.
+    - Evaluating the model's performance on the perturbed data using metrics like AUC for classification tasks and MSE
+    for regression tasks.
+    - Aggregating and plotting the results to visualize performance decay relative to perturbation size.
+    ### Signs of High Risk
+    - A significant drop in performance metrics with minimal noise.
+    - Performance decay values exceeding the specified threshold.
+    - Consistent failure to meet performance standards across multiple perturbation scales.
+    ### Strengths
+    - Provides insights into the model's robustness against noisy or corrupted data.
+    - Utilizes a variety of performance metrics suitable for both classification and regression tasks.
+    - Visualization helps in understanding the extent of performance degradation.
+    ### Limitations
+    - Gaussian noise might not adequately represent all types of real-world data perturbations.
+    - Performance thresholds are somewhat arbitrary and might need tuning.
+    - The test may not account for more complex or unstructured noise patterns that could affect model robustness.
     """
     name = "robustness"

validmind/tests/model_validation/sklearn/SHAPGlobalImportance.py CHANGED Viewed

@@ -22,13 +22,15 @@ class SHAPGlobalImportance(Metric):
     """
     Evaluates and visualizes global feature importance using SHAP values for model explanation and risk identification.
-    **Purpose:**
+    ### Purpose
     The SHAP (SHapley Additive exPlanations) Global Importance metric aims to elucidate model outcomes by attributing
     them to the contributing features. It assigns a quantifiable global importance to each feature via their respective
     absolute Shapley values, thereby making it suitable for tasks like classification (both binary and multiclass).
     This metric forms an essential part of model risk management.
-    **Test Mechanism:**
+    ### Test Mechanism
     The exam begins with the selection of a suitable explainer which aligns with the model's type. For tree-based
     models like XGBClassifier, RandomForestClassifier, CatBoostClassifier, TreeExplainer is used whereas for linear
     models like LogisticRegression, XGBRegressor, LinearRegression, it is the LinearExplainer. Once the explainer
@@ -44,20 +46,20 @@ class SHAPGlobalImportance(Metric):
     gradually changing from low to high. Features are systematically organized in accordance with their importance.
     These plots are generated by the function `_generate_shap_plot()`.
-    **Signs of High Risk:**
+    ### Signs of High Risk
     - Overemphasis on certain features in SHAP importance plots, thus hinting at the possibility of model overfitting
     - Anomalies such as unexpected or illogical features showing high importance, which might suggest that the model's
     decisions are rooted in incorrect or undesirable reasoning
     - A SHAP summary plot filled with high variability or scattered data points, indicating a cause for concern
-    **Strengths:**
+    ### Strengths
     - SHAP does more than just illustrating global feature significance, it offers a detailed perspective on how
     different features shape the model's decision-making logic for each instance.
     - It provides clear insights into model behavior.
-    **Limitations:**
+    ### Limitations
     - High-dimensional data can convolute interpretations.
     - Associating importance with tangible real-world impact still involves a certain degree of subjectivity.

validmind/tests/model_validation/sklearn/SilhouettePlot.py CHANGED Viewed

@@ -20,36 +20,44 @@ from validmind.vm_models import (
 @dataclass
 class SilhouettePlot(Metric):
     """
-    Calculates and visualizes Silhouette Score, assessing degree of data point suitability to its cluster in ML models.
-    **Purpose:** This test calculates the Silhouette Score, which is a model performance metric used in clustering
-    applications. Primarily, the Silhouette Score evaluates how similar an object (data point) is to its own cluster
-    compared to other clusters. The metric ranges between -1 and 1, where a high value indicates that the object is
-    well matched to its own cluster and poorly matched to neighboring clusters. Thus, the goal is to achieve a high
-    Silhouette Score, implying well-separated clusters.
-    **Test Mechanism:** The test first extracts the true and predicted labels from the model's training data. The test
-    runs the Silhouette Score function, which takes as input the training dataset features and the predicted labels,
-    subsequently calculating the average score. This average Silhouette Score is printed for reference. The script then
-    calculates the silhouette coefficients for each data point, helping to form the Silhouette Plot. Each cluster is
-    represented in this plot, with color distinguishing between different clusters. A red dashed line indicates the
-    average Silhouette Score. The Silhouette Scores are also collected into a structured table, facilitating model
-    performance analysis and comparison.
-    **Signs of High Risk:**
+    Calculates and visualizes Silhouette Score, assessing the degree of data point suitability to its cluster in ML
+    models.
+    ### Purpose
+    This test calculates the Silhouette Score, which is a model performance metric used in clustering applications.
+    Primarily, the Silhouette Score evaluates how similar a data point is to its own cluster compared to other
+    clusters. The metric ranges between -1 and 1, where a high value indicates that the object is well matched to its
+    own cluster and poorly matched to neighboring clusters. Thus, the goal is to achieve a high Silhouette Score,
+    implying well-separated clusters.
+    ### Test Mechanism
+    The test first extracts the true and predicted labels from the model's training data. The test runs the Silhouette
+    Score function, which takes as input the training dataset features and the predicted labels, subsequently
+    calculating the average score. This average Silhouette Score is printed for reference. The script then calculates
+    the silhouette coefficients for each data point, helping to form the Silhouette Plot. Each cluster is represented
+    in this plot, with color distinguishing between different clusters. A red dashed line indicates the average
+    Silhouette Score. The Silhouette Scores are also collected into a structured table, facilitating model performance
+    analysis and comparison.
+    ### Signs of High Risk
     - A low Silhouette Score, potentially indicating that the clusters are not well separated and that data points may
     not be fitting well to their respective clusters.
     - A Silhouette Plot displaying overlapping clusters or the absence of clear distinctions between clusters visually
     also suggests poor clustering performance.
-    **Strengths:**
+    ### Strengths
     - The Silhouette Score provides a clear and quantitative measure of how well data points have been grouped into
     clusters, offering insights into model performance.
     - The Silhouette Plot provides an intuitive, graphical representation of the clustering mechanism, aiding visual
     assessments of model performance.
     - It does not require ground truth labels, so it's useful when true cluster assignments are not known.
-    **Limitations:**
+    ### Limitations
     - The Silhouette Score may be susceptible to the influence of outliers, which could impact its accuracy and
     reliability.
     - It assumes the clusters are convex and isotropic, which might not be the case with complex datasets.

validmind/tests/model_validation/sklearn/TrainingTestDegradation.py CHANGED Viewed

@@ -32,33 +32,40 @@ class TrainingTestDegradation(ThresholdTest):
     """
     Tests if model performance degradation between training and test datasets exceeds a predefined threshold.
-    **Purpose**: The 'TrainingTestDegradation' class serves as a test to verify that the degradation in performance
-    between the training and test datasets does not exceed a predefined threshold. This test serves as a measure to
-    check the model's ability to generalize from its training data to unseen test data. It assesses key classification
-    metric scores such as accuracy, precision, recall and f1 score, to verify the model's robustness and reliability.
-    **Test Mechanism**: The code applies several predefined metrics including accuracy, precision, recall and f1 scores
-    to the model's predictions for both the training and test datasets. It calculates the degradation as the difference
-    between the training score and test score divided by the training score. The test is considered successful if the
-    degradation for each metric is less than the preset maximum threshold of 10%. The results are summarized in a table
-    showing each metric's train score, test score, degradation percentage, and pass/fail status.
-    **Signs of High Risk**:
+    ### Purpose
+    The `TrainingTestDegradation` class serves as a test to verify that the degradation in performance between the
+    training and test datasets does not exceed a predefined threshold. This test measures the model's ability to
+    generalize from its training data to unseen test data, assessing key classification metrics such as accuracy,
+    precision, recall, and f1 score to verify the model's robustness and reliability.
+    ### Test Mechanism
+    The code applies several predefined metrics, including accuracy, precision, recall, and f1 scores, to the model's
+    predictions for both the training and test datasets. It calculates the degradation as the difference between the
+    training score and test score divided by the training score. The test is considered successful if the degradation
+    for each metric is less than the preset maximum threshold of 10%. The results are summarized in a table showing
+    each metric's train score, test score, degradation percentage, and pass/fail status.
+    ### Signs of High Risk
     - A degradation percentage that exceeds the maximum allowed threshold of 10% for any of the evaluated metrics.
     - A high difference or gap between the metric scores on the training and the test datasets.
     - The 'Pass/Fail' column displaying 'Fail' for any of the evaluated metrics.
-    **Strengths**:
-    - This test provides a quantitative measure of the model's ability to generalize to unseen data, which is key for
-    predicting its practical real-world performance.
+    ### Strengths
+    - Provides a quantitative measure of the model's ability to generalize to unseen data, which is key for predicting
+    its practical real-world performance.
     - By evaluating multiple metrics, it takes into account different facets of model performance and enables a more
     holistic evaluation.
     - The use of a variable predefined threshold allows the flexibility to adjust the acceptability criteria for
     different scenarios.
-    **Limitations**:
-    - The test compares raw performance on training and test data, but does not factor in the nature of the data. Areas
-    with less representation in the training set, for instance, might still perform poorly on unseen data.
+    ### Limitations
+    - The test compares raw performance on training and test data but does not factor in the nature of the data. Areas
+    with less representation in the training set might still perform poorly on unseen data.
     - It requires good coverage and balance in the test and training datasets to produce reliable results, which may
     not always be available.
     - The test is currently only designed for classification tasks.

validmind/tests/model_validation/sklearn/VMeasure.py CHANGED Viewed

@@ -14,42 +14,43 @@ class VMeasure(ClusterPerformance):
     """
     Evaluates homogeneity and completeness of a clustering model using the V Measure Score.
-    **1. Purpose:**
+    ### Purpose
     The purpose of this metric, V Measure Score (V Score), is to evaluate the performance of a clustering model. It
     measures the homogeneity and completeness of a set of cluster labels, where homogeneity refers to each cluster
     containing only members of a single class and completeness meaning all members of a given class are assigned to the
     same cluster.
-    **2. Test Mechanism:**
-    ClusterVMeasure is a class that inherits from another class, ClusterPerformance. It uses the v_measure_score
+    ### Test Mechanism
+    ClusterVMeasure is a class that inherits from another class, ClusterPerformance. It uses the `v_measure_score`
     function from the sklearn module's metrics package. The required inputs to perform this metric are the model, train
     dataset, and test dataset. The test is appropriate for models tasked with clustering.
-    **3. Signs of High Risk:**
+    ### Signs of High Risk
     - Low V Measure Score: A low V Measure Score indicates that the clustering model has poor homogeneity or
     completeness, or both. This might signal that the model is failing to correctly cluster the data.
-    **4. Strengths:**
+    ### Strengths
     - The V Measure Score is a harmonic mean between homogeneity and completeness. This ensures that both attributes
     are taken into account when evaluating the model, providing an overall measure of its cluster validity.
     - The metric does not require knowledge of the ground truth classes when measuring homogeneity and completeness,
     making it applicable in instances where such information is unavailable.
-    **5. Limitations:**
-    - The V Score can be influenced by the number of clusters, which means that it might not always reflect the quality
-    of the clustering. Partitioning the data into many small clusters could lead to high homogeneity but low
-    completeness, leading to a low V Score even if the clustering might be useful.
+    ### Limitations
+    - The V Measure Score can be influenced by the number of clusters, which means that it might not always reflect the
+    quality of the clustering. Partitioning the data into many small clusters could lead to high homogeneity but low
+    completeness, leading to a low V Measure Score even if the clustering might be useful.
     - It assumes equal importance of homogeneity and completeness. In some applications, one may be more important than
-    the other. The V Score does not provide flexibility in assigning different weights to homogeneity and completeness.
+    the other. The V Measure Score does not provide flexibility in assigning different weights to homogeneity and
+    completeness.
     """
     name = "v_measure_score"
-    required_inputs = ["model", "datasets"]
+    required_inputs = ["model", "dataset"]
     tasks = ["clustering"]
     tags = [
         "sklearn",

validmind 2.5.8__py3-none-any.whl → 2.5.15__py3-none-any.whl

validmind 2.5.8py3-none-any.whl → 2.5.15py3-none-any.whl