PyPI - validmind - Versions diffs - 2.5.8__py3-none-any.whl → 2.5.18__py3-none-any.whl - Mend

validmind 2.5.8py3-none-any.whl → 2.5.18py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (233) hide show

validmind/tests/data_validation/BivariateHistograms.py DELETED Viewed

@@ -1,117 +0,0 @@
-# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
-# See the LICENSE file in the root of this repository for details.
-# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-from dataclasses import dataclass
-import matplotlib.pyplot as plt
-import seaborn as sns
-from validmind.vm_models import Figure, Metric
-@dataclass
-class BivariateHistograms(Metric):
-    """
-    Generates bivariate histograms for paired features, aiding in visual inspection of categorical variables'
-    distributions and correlations.
-    **Purpose**: This metric, dubbed BivariateHistograms, is primarily used for visual data analysis via the inspection
-    of variable distribution, specifically categorical variables. Its main objective is to ascertain any potential
-    correlations between these variables and distributions within each defined target class. This is achieved by
-    offering an intuitive avenue into gaining insights into the characteristics of the data and any plausible patterns
-    therein.
-    **Test Mechanism**: The working mechanism of the BivariateHistograms module revolves around an input dataset and a
-    series of feature pairs. It uses seaborn's histogram plotting function and matplotlib techniques to create
-    bivariate histograms for each feature pair in the dataset. Two histograms, stratified by the target column status,
-    are produced for every pair of features. This enables the telling apart of different target statuses through color
-    differentiation. The module also offers optional functionality for restricting the data by a specific status
-    through the target_filter parameter.
-    **Signs of High Risk**:
-    - Irregular or unexpected distributions of data across the different categories.
-    - Highly skewed data distributions.
-    - Significant deviations from the perceived 'normal' or anticipated distributions.
-    - Large discrepancies in distribution patterns between various target statuses.
-    **Strengths**:
-    - Owing to its simplicity, the histogram-based approach is easy to implement and interpret which translates to
-    quick insights.
-    - The metrics provides a consolidated view of the distribution of data across different target conditions for each
-    variable pair, thereby assisting in highlighting potential correlations and patterns.
-    - It proves advantageous in spotting anomalies, comprehending interactions among features, and facilitating
-    exploratory data analysis.
-    **Limitations**:
-    - Its simplicity may be a drawback when it comes to spotting intricate or complex patterns in data.
-    - Overplotting might occur when working with larger datasets.
-    - The metric is only applicable to categorical data, and offers limited insights for numerical or continuous
-    variables.
-    - The interpretation of visual results hinges heavily on the expertise of the observer, possibly leading to
-    subjective analysis.
-    """
-    name = "bivariate_histograms"
-    required_inputs = ["dataset"]
-    default_params = {"features_pairs": None, "target_filter": None}
-    tasks = ["classification"]
-    tags = [
-        "tabular_data",
-        "categorical_data",
-        "binary_classification",
-        "multiclass_classification",
-        "visualization",
-    ]
-    def plot_bivariate_histogram(self, features_pairs, target_filter):
-        status_var = self.inputs.dataset.target_column
-        figures = []
-        palette = {0: (0.5, 0.5, 0.5, 0.8), 1: "tab:red"}
-        for x, y in features_pairs.items():
-            df = self.inputs.dataset.df
-            if target_filter is not None:
-                df = df[df[status_var] == target_filter]
-            fig, axes = plt.subplots(2, 1)
-            for ax, var in zip(axes, [x, y]):
-                for status, color in palette.items():
-                    subset = df[df[status_var] == status]
-                    sns.histplot(
-                        subset[var],
-                        ax=ax,
-                        color=color,
-                        edgecolor=None,
-                        kde=True,
-                        label=status_var if status else "Non-" + status_var,
-                    )
-                ax.set_title(f"Histogram of {var} by {status_var}")
-                ax.set_xlabel(var)
-                ax.legend()
-            plt.tight_layout()
-            plt.show()
-            figures.append(
-                Figure(for_object=self, key=f"{self.key}:{x}_{y}", figure=plt.figure())
-            )
-        plt.close("all")
-        return figures
-    def run(self):
-        features_pairs = self.params["features_pairs"]
-        target_filter = self.params["target_filter"]
-        if features_pairs is None:
-            raise ValueError(
-                "The features_pairs parameter is required for this metric."
-            )
-        figures = self.plot_bivariate_histogram(features_pairs, target_filter)
-        return self.cache_results(figures=figures)

validmind/tests/data_validation/HeatmapFeatureCorrelations.py DELETED Viewed

@@ -1,124 +0,0 @@
-# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
-# See the LICENSE file in the root of this repository for details.
-# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-from dataclasses import dataclass
-import matplotlib.pyplot as plt
-import seaborn as sns
-from validmind.vm_models import Figure, Metric
-@dataclass
-class HeatmapFeatureCorrelations(Metric):
-    """
-    Creates a heatmap to visually represent correlation patterns between pairs of numerical features in a dataset.
-    **Purpose:** The HeatmapFeatureCorrelations metric is utilized to evaluate the degree of interrelationships between
-    pairs of input features within a dataset. This metric allows us to visually comprehend the correlation patterns
-    through a heatmap, which can be essential in understanding which features may contribute most significantly to the
-    performance of the model. Features that have high intercorrelation can potentially reduce the model's ability to
-    learn, thus impacting the overall performance and stability of the machine learning model.
-    **Test Mechanism:** The metric executes the correlation test by computing the Pearson correlations for all pairs of
-    numerical features. It then generates a heatmap plot using seaborn, a Python data visualization library. The
-    colormap ranges from -1 to 1, indicating perfect negative correlation and perfect positive correlation
-    respectively. A 'declutter' option is provided which, if set to true, removes variable names and numerical
-    correlations from the plot to provide a more streamlined view. The size of feature names and correlation
-    coefficients can be controlled through 'fontsize' parameters.
-    **Signs of High Risk:**
-    - Indicators of potential risk include features with high absolute correlation values.
-    - A significant degree of multicollinearity might lead to instabilities in the trained model and can also result in
-    overfitting.
-    - The presence of multiple homogeneous blocks of high positive or negative correlation within the plot might
-    indicate redundant or irrelevant features included within the dataset.
-    **Strengths:**
-    - The strength of this metric lies in its ability to visually represent the extent and direction of correlation
-    between any two numeric features, which aids in the interpretation and understanding of complex data relationships.
-    - The heatmap provides an immediate and intuitively understandable representation, hence, it is extremely useful
-    for high-dimensional datasets where extracting meaningful relationships might be challenging.
-    **Limitations:**
-    - The central limitation might be that it can only calculate correlation between numeric features, making it
-    unsuitable for categorical variables unless they are already numerically encoded in a meaningful manner.
-    - It uses Pearson's correlation, which only measures linear relationships between features. It may perform poorly
-    in cases where the relationship is non-linear.
-    - Large feature sets might result in cluttered and difficult-to-read correlation heatmaps, especially when the
-    'declutter' option is set to false.
-    """
-    name = "heatmap_feature_correlations"
-    required_inputs = ["dataset"]
-    default_params = {"declutter": None, "fontsize": None, "num_features": None}
-    tasks = ["classification", "regression"]
-    tags = ["tabular_data", "visualization", "correlation"]
-    def run(self):
-        features = self.params.get("features")
-        declutter = self.params.get("declutter", False)
-        fontsize = self.params.get("fontsize", 13)
-        # Filter DataFrame based on num_features
-        if features is None:
-            df = self.inputs.dataset.df
-        else:
-            df = self.inputs.dataset.df[features]
-        figure = self.visualize_correlations(df, declutter, fontsize)
-        return self.cache_results(figures=figure)
-    def visualize_correlations(self, df, declutter, fontsize):
-        # Compute Pearson correlations
-        correlations = df.corr(method="pearson")
-        # Create a figure and axes
-        fig, ax = plt.subplots()
-        # If declutter option is true, do not show correlation coefficients and variable names
-        if declutter:
-            sns.heatmap(
-                correlations,
-                cmap="coolwarm",
-                vmin=-1,
-                vmax=1,
-                ax=ax,
-                cbar_kws={"label": "Correlation"},
-            )
-            ax.set_xticklabels([])
-            ax.set_yticklabels([])
-            ax.set_xlabel(f"{df.shape[1]} Numerical Features", fontsize=fontsize)
-            ax.set_ylabel(f"{df.shape[1]} Numerical Features", fontsize=fontsize)
-        else:
-            # For the correlation numbers, you can use the 'annot_kws' argument
-            sns.heatmap(
-                correlations,
-                cmap="coolwarm",
-                vmin=-1,
-                vmax=1,
-                annot=True,
-                fmt=".2f",
-                ax=ax,
-                cbar_kws={"label": "Correlation"},
-                annot_kws={"size": fontsize},
-            )
-            plt.yticks(fontsize=fontsize)
-            plt.xticks(rotation=90, fontsize=fontsize)
-        # To set the fontsize of the color bar, you can iterate over its text elements and set their size
-        cbar = ax.collections[0].colorbar
-        cbar.ax.tick_params(labelsize=fontsize)
-        cbar.set_label("Correlation", size=fontsize)
-        # Show the plot
-        plt.tight_layout()
-        plt.close("all")
-        figure = Figure(for_object=self, key=self.key, figure=fig)
-        return [figure]

validmind/tests/data_validation/MissingValuesRisk.py DELETED Viewed

@@ -1,88 +0,0 @@
-# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
-# See the LICENSE file in the root of this repository for details.
-# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-from dataclasses import dataclass
-from validmind.vm_models import Metric, ResultSummary, ResultTable, ResultTableMetadata
-@dataclass
-class MissingValuesRisk(Metric):
-    """
-    Assesses and quantifies the risk related to missing values in a dataset used for training an ML model.
-    **Purpose**: The Missing Values Risk metric is specifically designed to assess and quantify the risk associated
-    with missing values in the dataset used for machine learning model training. It measures two specific risks: the
-    percentage of total data that are missing, and the percentage of all variables (columns) that contain some missing
-    values.
-    **Test Mechanism**: Initially, the metric calculates the total number of data points in the dataset and the count
-    of missing values. It then inspects each variable (column) to determine how many contain at least one missing
-    datapoint. By methodically counting missing datapoints across the entire dataset and each variable (column), it
-    identifies the percentage of missing values in the entire dataset and the percentage of variables (columns) with
-    such values.
-    **Signs of High Risk**:
-    - Record high percentages in either of the risk measures could suggest a high risk.
-    - If the dataset indicates a high percentage of missing values, it might significantly undermine the model's
-    performance and credibility.
-    - If a significant portion of variables (columns) in the dataset are missing values, this could make the model
-    susceptible to bias and overfitting.
-    **Strengths**:
-    - The metric offers valuable insights into the readiness of a dataset for model training as missing values can
-    heavily destabilize both the model's performance and predictive capabilities.
-    - The metric's quantification of the risks caused by missing values allows for the use of targeted methods to
-    manage these values correctly- either through removal, imputation, or alternative strategies.
-    - The metric has the flexibility to be applied to both classification and regression assignments, maintaining its
-    utility across a wide range of models and scenarios.
-    **Limitations**:
-    - The metric primarily identifies and quantifies the risk associated with missing values without suggesting
-    specific mitigation strategies.
-    - The metric does not ascertain whether the missing values are random or associated with an underlying issue in the
-    stages of data collection or preprocessing.
-    - However, the identification of the presence and scale of missing data is the essential initial step towards
-    improving data quality.
-    """
-    name = "missing_values_risk"
-    required_inputs = ["dataset"]
-    tasks = ["classification", "regression"]
-    tags = ["tabular_data", "data_quality", "risk_analysis"]
-    def run(self):
-        total_cells = self.inputs.dataset.df.size
-        total_missing = self.inputs.dataset.df.isnull().sum().sum()
-        total_columns = self.inputs.dataset.df.shape[1]
-        columns_with_missing = self.inputs.dataset.df.isnull().any().sum()
-        risk_measures = {
-            "Missing Values in the Dataset": round(
-                (total_missing / total_cells) * 100, 2
-            ),
-            "Variables with Missing Values": round(
-                (columns_with_missing / total_columns) * 100, 2
-            ),
-        }
-        return self.cache_results(risk_measures)
-    def summary(self, metric_value):
-        risk_measures_table = [
-            {"Risk Metric": measure, "Value (%)": value}
-            for measure, value in metric_value.items()
-        ]
-        return ResultSummary(
-            results=[
-                ResultTable(
-                    data=risk_measures_table,
-                    metadata=ResultTableMetadata(title="Missing Values Risk Measures"),
-                ),
-            ]
-        )

validmind/tests/model_validation/ModelMetadataComparison.py DELETED Viewed

@@ -1,59 +0,0 @@
-# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
-# See the LICENSE file in the root of this repository for details.
-# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-import pandas as pd
-from validmind import tags, tasks
-from validmind.utils import get_model_info
-@tags("model_training", "metadata")
-@tasks("regression", "time_series_forecasting")
-def ModelMetadataComparison(models):
-    """
-    Compare metadata of different models and generate a summary table with the results.
-    **Purpose**: The purpose of this function is to compare the metadata of different models, including information about their architecture, framework, framework version, and programming language.
-    **Test Mechanism**: The function retrieves the metadata for each model using `get_model_info`, renames columns according to a predefined set of labels, and compiles this information into a summary table.
-    **Signs of High Risk**:
-    - Inconsistent or missing metadata across models can indicate potential issues in model documentation or management.
-    - Significant differences in framework versions or programming languages might pose challenges in model integration and deployment.
-    **Strengths**:
-    - Provides a clear comparison of essential model metadata.
-    - Standardizes metadata labels for easier interpretation and comparison.
-    - Helps identify potential compatibility or consistency issues across models.
-    **Limitations**:
-    - Assumes that the `get_model_info` function returns all necessary metadata fields.
-    - Relies on the correctness and completeness of the metadata provided by each model.
-    - Does not include detailed parameter information, focusing instead on high-level metadata.
-    """
-    column_labels = {
-        "architecture": "Modeling Technique",
-        "framework": "Modeling Framework",
-        "framework_version": "Framework Version",
-        "language": "Programming Language",
-    }
-    description = []
-    for model in models:
-        model_info = get_model_info(model)
-        # Rename columns based on provided labels
-        model_info_renamed = {
-            column_labels.get(k, k): v for k, v in model_info.items() if k != "params"
-        }
-        # Add model name or identifier if available
-        model_info_renamed = {"Model Name": model.input_id, **model_info_renamed}
-        description.append(model_info_renamed)
-    description_df = pd.DataFrame(description)
-    return description_df

validmind/tests/model_validation/sklearn/FeatureImportanceComparison.py DELETED Viewed

@@ -1,83 +0,0 @@
-# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
-# See the LICENSE file in the root of this repository for details.
-# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-import pandas as pd
-from sklearn.inspection import permutation_importance
-from validmind import tags, tasks
-@tags("model_explainability", "sklearn")
-@tasks("regression", "time_series_forecasting")
-def FeatureImportanceComparison(datasets, models, num_features=3):
-    """
-    Compare feature importance scores for each model and generate a summary table
-    with the top important features.
-    **Purpose**: The purpose of this function is to compare the feature importance scores for different models applied to various datasets.
-    **Test Mechanism**: The function iterates through each dataset-model pair, calculates permutation feature importance (PFI) scores, and generates a summary table with the top `num_features` important features for each model.
-    **Signs of High Risk**:
-    - If key features expected to be important are ranked low, it could indicate potential issues with model training or data quality.
-    - High variance in feature importance scores across different models may suggest instability in feature selection.
-    **Strengths**:
-    - Provides a clear comparison of the most important features for each model.
-    - Uses permutation importance, which is a model-agnostic method and can be applied to any estimator.
-    **Limitations**:
-    - Assumes that the dataset is provided as a DataFrameDataset object with `x_df` and `y_df` methods to access feature and target data.
-    - Requires that `model.model` is compatible with `sklearn.inspection.permutation_importance`.
-    - The function's output is dependent on the number of features specified by `num_features`, which defaults to 3 but can be adjusted.
-    """
-    results_list = []
-    for dataset, model in zip(datasets, models):
-        x = dataset.x_df()
-        y = dataset.y_df()
-        pfi_values = permutation_importance(
-            model.model,
-            x,
-            y,
-            random_state=0,
-            n_jobs=-2,
-        )
-        # Create a dictionary to store PFI scores
-        pfi = {
-            column: pfi_values["importances_mean"][i]
-            for i, column in enumerate(x.columns)
-        }
-        # Sort features by their importance
-        sorted_features = sorted(pfi.items(), key=lambda item: item[1], reverse=True)
-        # Extract the top `num_features` features
-        top_features = sorted_features[:num_features]
-        # Prepare the result for the current model and dataset
-        result = {
-            "Model": model.input_id,
-            "Dataset": dataset.input_id,
-        }
-        # Dynamically add feature columns to the result
-        for i in range(num_features):
-            if i < len(top_features):
-                result[
-                    f"Feature {i + 1}"
-                ] = f"[{top_features[i][0]}; {top_features[i][1]:.4f}]"
-            else:
-                result[f"Feature {i + 1}"] = None
-        # Append the result to the list
-        results_list.append(result)
-    # Convert the results list to a DataFrame
-    results_df = pd.DataFrame(results_list)
-    return results_df

validmind/tests/model_validation/statsmodels/JarqueBera.py DELETED Viewed

@@ -1,73 +0,0 @@
-# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
-# See the LICENSE file in the root of this repository for details.
-# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-from statsmodels.stats.stattools import jarque_bera
-from validmind.vm_models import Metric
-class JarqueBera(Metric):
-    """
-    Assesses normality of dataset features in an ML model using the Jarque-Bera test.
-    **Purpose**: The purpose of the Jarque-Bera test as implemented in this metric is to determine if the features in
-    the dataset of a given Machine Learning model follows a normal distribution. This is crucial for understanding the
-    distribution and behavior of the model's features, as numerous statistical methods assume normal distribution of
-    the data.
-    **Test Mechanism**: The test mechanism involves computing the Jarque-Bera statistic, p-value, skew, and kurtosis
-    for each feature in the dataset. It utilizes the 'jarque_bera' function from the 'statsmodels' library in Python,
-    storing the results in a dictionary. The test evaluates the skewness and kurtosis to ascertain whether the dataset
-    follows a normal distribution. A significant p-value (typically less than 0.05) implies that the data does not
-    possess normal distribution.
-    **Signs of High Risk**:
-    - A high Jarque-Bera statistic and a low p-value (usually less than 0.05) indicates high-risk conditions.
-    - Such results suggest the data significantly deviates from a normal distribution. If a machine learning model
-    expects feature data to be normally distributed, these findings imply that it may not function as intended.
-    **Strengths**:
-    - This test provides insights into the shape of the data distribution, helping determine whether a given set of
-    data follows a normal distribution.
-    - This is particularly useful for risk assessment for models that assume a normal distribution of data.
-    - By measuring skewness and kurtosis, it provides additional insights into the nature and magnitude of a
-    distribution's deviation.
-    **Limitations**:
-    - The Jarque-Bera test only checks for normality in the data distribution. It cannot provide insights into other
-    types of distributions.
-    - Datasets that aren't normally distributed but follow some other distribution might lead to inaccurate risk
-    assessments.
-    - The test is highly sensitive to large sample sizes, often rejecting the null hypothesis (that data is normally
-    distributed) even for minor deviations in larger datasets.
-    """
-    name = "jarque_bera"
-    required_inputs = ["dataset"]
-    tasks = ["classification", "regression"]
-    tags = [
-        "tabular_data",
-        "data_distribution",
-        "statistical_test",
-        "statsmodels",
-    ]
-    def run(self):
-        """
-        Calculates JB for each of the dataset features
-        """
-        x_train = self.inputs.dataset.df[self.inputs.dataset.feature_columns_numeric]
-        jb_values = {}
-        for col in x_train.columns:
-            jb_stat, jb_pvalue, jb_skew, jb_kurtosis = jarque_bera(x_train[col].values)
-            jb_values[col] = {
-                "stat": jb_stat,
-                "pvalue": jb_pvalue,
-                "skew": jb_skew,
-                "kurtosis": jb_kurtosis,
-            }
-        return self.cache_results(jb_values)

validmind/tests/model_validation/statsmodels/LJungBox.py DELETED Viewed

@@ -1,66 +0,0 @@
-# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
-# See the LICENSE file in the root of this repository for details.
-# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-from statsmodels.stats.diagnostic import acorr_ljungbox
-from validmind.vm_models import Metric
-class LJungBox(Metric):
-    """
-    Assesses autocorrelations in dataset features by performing a Ljung-Box test on each feature.
-    **Purpose**: The Ljung-Box test is a type of statistical test utilized to ascertain whether there are
-    autocorrelations within a given dataset that differ significantly from zero. In the context of a machine learning
-    model, this test is primarily used to evaluate data utilized in regression tasks, especially those involving time
-    series and forecasting.
-    **Test Mechanism**: The test operates by iterating over each feature within the training dataset and applying the
-    `acorr_ljungbox` function from the `statsmodels.stats.diagnostic` library. This function calculates the Ljung-Box
-    statistic and p-value for each feature. These results are then stored in a dictionary where the keys are the
-    feature names and the values are dictionaries containing the statistic and p-value respectively. Generally, a lower
-    p-value indicates a higher likelihood of significant autocorrelations within the feature.
-    **Signs of High Risk**:
-    - A high risk or failure in the model's performance relating to this test might be indicated by high Ljung-Box
-    statistic values or low p-values.
-    - These outcomes suggest the presence of significant autocorrelations in the respective features. If not properly
-    consider or handle in the machine learning model, these can negatively affect model performance or bias.
-    **Strengths**:
-    - The Ljung-Box test is a powerful tool for detecting autocorrelations within datasets, especially in time series
-    data.
-    - It provides quantitative measures (statistic and p-value) that allow for precise evaluation of autocorrelation.
-    - This test can be instrumental in avoiding issues related to autoregressive residuals and other challenges in
-    regression models.
-    **Limitations**:
-    - The Ljung-Box test cannot detect all types of non-linearity or complex interrelationships among variables.
-    - Testing individual features may not fully encapsulate the dynamics of the data if features interact with each
-    other.
-    - It is designed more for traditional statistical models and may not be fully compatible with certain types of
-    complex machine learning models.
-    """
-    name = "ljung_box"
-    required_inputs = ["dataset"]
-    tasks = ["regression"]
-    tags = ["time_series_data", "forecasting", "statistical_test", "statsmodels"]
-    def run(self):
-        """
-        Calculates Ljung-Box test for each of the dataset features
-        """
-        x_train = self.inputs.dataset.df
-        ljung_box_values = {}
-        for col in x_train.columns:
-            lb_results = acorr_ljungbox(x_train[col].values, return_df=True)
-            ljung_box_values[col] = {
-                "stat": lb_results["lb_stat"].values[0],
-                "pvalue": lb_results["lb_pvalue"].values[0],
-            }
-        return self.cache_results(ljung_box_values)

validmind 2.5.8__py3-none-any.whl → 2.5.18__py3-none-any.whl

validmind 2.5.8py3-none-any.whl → 2.5.18py3-none-any.whl