PyPI - validmind - Versions diffs - 2.5.8__py3-none-any.whl → 2.5.18__py3-none-any.whl - Mend

validmind 2.5.8py3-none-any.whl → 2.5.18py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (233) hide show

validmind/tests/data_validation/AutoAR.py CHANGED Viewed

@@ -16,7 +16,7 @@ class AutoAR(Metric):
     """
     Automatically identifies the optimal Autoregressive (AR) order for a time series using BIC and AIC criteria.
-    **Purpose**:
+    ### Purpose
     The AutoAR test is intended to automatically identify the Autoregressive (AR) order of a time series by utilizing
     the Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC). AR order is crucial in forecasting
@@ -24,30 +24,30 @@ class AutoAR(Metric):
     objective is to select the most fitting AR model that encapsulates the trend and seasonality in the time series
     data.
-    **Test Mechanism**:
+    ### Test Mechanism
     The test mechanism operates by iterating through a possible range of AR orders up to a defined maximum. An AR model
     is fitted for each order, and the corresponding BIC and AIC are computed. BIC and AIC statistical measures are
     designed to penalize models for complexity, preferring simpler models that fit the data proficiently. To verify the
-    stationarity of the time series, the Augmented Dickey-Fuller test is executed. The AR order, BIC, and AIC findings,
+    stationarity of the time series, the Augmented Dickey-Fuller test is executed. The AR order, BIC, and AIC findings
     are compiled into a dataframe for effortless comparison. Then, the AR order with the smallest BIC is established as
     the desirable order for each variable.
-    **Signs of High Risk**:
+    ### Signs of High Risk
     - An augmented Dickey Fuller test p-value > 0.05, indicating the time series isn't stationary, may lead to
     inaccurate results.
     - Problems with the model fitting procedure, such as computational or convergence issues.
-    - Continuous selection of the maximum specified AR order may suggest insufficient set limit.
+    - Continuous selection of the maximum specified AR order may suggest an insufficient set limit.
-    **Strengths**:
+    ### Strengths
     - The test independently pinpoints the optimal AR order, thereby reducing potential human bias.
     - It strikes a balance between model simplicity and goodness-of-fit to avoid overfitting.
-    - Has the capability to account for stationarity in a time series, an essential aspect for dependable AR modelling.
-    - The results are aggregated into an comprehensive table, enabling an easy interpretation.
+    - Has the capability to account for stationarity in a time series, an essential aspect for dependable AR modeling.
+    - The results are aggregated into a comprehensive table, enabling an easy interpretation.
-    **Limitations**:
+    ### Limitations
     - The tests need a stationary time series input.
     - They presume a linear relationship between the series and its lags.

validmind/tests/data_validation/AutoMA.py CHANGED Viewed

@@ -17,32 +17,39 @@ class AutoMA(Metric):
     Automatically selects the optimal Moving Average (MA) order for each variable in a time series dataset based on
     minimal BIC and AIC values.
-    **Purpose**: The `AutoMA` metric serves an essential role of automated decision-making for selecting the optimal
-    Moving Average (MA) order for every variable in a given time series dataset. The selection is dependent on the
-    minimalization of BIC (Bayesian Information Criterion) and AIC (Akaike Information Criterion); these are
-    established statistical tools used for model selection. Furthermore, prior to the commencement of the model fitting
-    process, the algorithm conducts a stationarity test (Augmented Dickey-Fuller test) on each series.
-    **Test Mechanism**: Starting off, the `AutoMA` algorithm checks whether the `max_ma_order` parameter has been
-    provided. It consequently loops through all variables in the dataset, carrying out the Dickey-Fuller test for
-    stationarity. For each stationary variable, it fits an ARIMA model for orders running from 0 to `max_ma_order`. The
-    result is a list showcasing the BIC and AIC values of the ARIMA models based on different orders. The MA order,
-    which yields the smallest BIC, is chosen as the 'best MA order' for every single variable. The final results
-    include a table summarizing the auto MA analysis and another table listing the best MA order for each variable.
-    **Signs of High Risk**:
+    ### Purpose
+    The `AutoMA` metric serves an essential role of automated decision-making for selecting the optimal Moving Average
+    (MA) order for every variable in a given time series dataset. The selection is dependent on the minimalization of
+    BIC (Bayesian Information Criterion) and AIC (Akaike Information Criterion); these are established statistical
+    tools used for model selection. Furthermore, prior to the commencement of the model fitting process, the algorithm
+    conducts a stationarity test (Augmented Dickey-Fuller test) on each series.
+    ### Test Mechanism
+    Starting off, the `AutoMA` algorithm checks whether the `max_ma_order` parameter has been provided. It consequently
+    loops through all variables in the dataset, carrying out the Dickey-Fuller test for stationarity. For each
+    stationary variable, it fits an ARIMA model for orders running from 0 to `max_ma_order`. The result is a list
+    showcasing the BIC and AIC values of the ARIMA models based on different orders. The MA order, which yields the
+    smallest BIC, is chosen as the 'best MA order' for every single variable. The final results include a table
+    summarizing the auto MA analysis and another table listing the best MA order for each variable.
+    ### Signs of High Risk
     - When a series is non-stationary (p-value>0.05 in the Dickey-Fuller test), the produced result could be inaccurate.
     - Any error that arises in the process of fitting the ARIMA models, especially with a higher MA order, can
     potentially indicate risks and might need further investigation.
-    **Strengths**:
+    ### Strengths
     - The metric facilitates automation in the process of selecting the MA order for time series forecasting. This
     significantly saves time and reduces efforts conventionally necessary for manual hyperparameter tuning.
     - The use of both BIC and AIC enhances the likelihood of selecting the most suitable model.
     - The metric ascertains the stationarity of the series prior to model fitting, thus ensuring that the underlying
     assumptions of the MA model are fulfilled.
-    **Limitations**:
+    ### Limitations
     - If the time series fails to be stationary, the metric may yield inaccurate results. Consequently, it necessitates
     pre-processing steps to stabilize the series before fitting the ARIMA model.
     - The metric adopts a rudimentary model selection process based on BIC and doesn't consider other potential model

validmind/tests/data_validation/AutoSeasonality.py CHANGED Viewed

@@ -17,28 +17,30 @@ class AutoSeasonality(Metric):
     Automatically identifies and quantifies optimal seasonality in time series data to improve forecasting model
     performance.
-    **Purpose:**
-    The AutoSeasonality metric's purpose is to automatically detect and identify the best seasonal order or period for
-    each variable in a time series dataset. This detection helps to quantify periodic patterns and seasonality that
-    reoccur at fixed intervals of time in the data. This is especially significant for forecasting-based models, where
-    understanding the seasonality component can drastically improve prediction accuracy.
-    **Test Mechanism:**
-    This metric uses the seasonal decomposition method from the Statsmodels Python library. The function takes the
+    ### Purpose
+    The AutoSeasonality test aims to automatically detect and identify the best seasonal order or period for each
+    variable in a time series dataset. This detection helps to quantify periodic patterns and seasonality that reoccur
+    at fixed intervals in the data. Understanding the seasonality component can drastically improve prediction
+    accuracy, which is especially significant for forecasting-based models.
+    ### Test Mechanism
+    This test uses the seasonal decomposition method from the Statsmodels Python library. The function takes the
     'additive' model type for each variable and applies it within the prescribed range of 'min_period' and
-    'max_period'. The function decomposes the seasonality for each period in the range and calculates the mean residual
-    error for each period. The seasonal period that results in the minimum residuals is marked as the 'Best Period'.
-    The test results include the 'Best Period', the calculated residual errors, and a determination of 'Seasonality' or
-    'No Seasonality'.
+    'max_period'. It decomposes the seasonality for each period in the range and calculates the mean residual error for
+    each period. The seasonal period that results in the minimum residuals is marked as the 'Best Period'. The test
+    results include the 'Best Period', the calculated residual errors, and a determination of 'Seasonality' or 'No
+    Seasonality'.
-    **Signs of High Risk:**
+    ### Signs of High Risk
     - If the optimal seasonal period (or 'Best Period') is consistently at the maximum or minimum limit of the offered
     range for a majority of variables, it may suggest that the range set does not adequately capture the true seasonal
     pattern in the series.
     - A high average 'Residual Error' for the selected 'Best Period' could indicate issues with the model's performance.
-    **Strengths:**
+    ### Strengths
     - The metric offers an automatic approach to identifying and quantifying the optimal seasonality, providing a
     robust method for analyzing time series datasets.
@@ -46,9 +48,9 @@ class AutoSeasonality(Metric):
     seasonality.
     - The use of concrete and measurable statistical methods improves the objectivity and reproducibility of the model.
-    **Limitations:**
+    ### Limitations
-    - This AutoSeasonality metric may not be suitable if the time series data exhibits random walk behaviour or lacks
+    - This AutoSeasonality metric may not be suitable if the time series data exhibits random walk behavior or lacks
     clear seasonality, as the seasonal decomposition model may not be appropriate.
     - The defined range for the seasonal period (min_period and max_period) can influence the outcomes. If the actual
     seasonality period lies outside this range, this method will not be able to identify the true seasonal order.

validmind/tests/data_validation/AutoStationarity.py CHANGED Viewed

@@ -13,26 +13,30 @@ class AutoStationarity(Metric):
     """
     Automates Augmented Dickey-Fuller test to assess stationarity across multiple time series in a DataFrame.
-    **Purpose**: The AutoStationarity metric is intended to automatically detect and evaluate the stationary nature of
-    each time series in a DataFrame. It incorporates the Augmented Dickey-Fuller (ADF) test, a statistical approach
-    used to assess stationarity. Stationarity is a fundamental property suggesting that statistic features like mean
-    and variance remain unchanged over time. This is necessary for many time-series models.
-    **Test Mechanism**: The mechanism for the AutoStationarity test involves applying the Augmented Dicky-Fuller test
-    to each time series within the given dataframe to assess if they are stationary. Every series in the dataframe is
-    looped, using the ADF test up to a defined maximum order (configurable and by default set to 5). The p-value
-    resulting from the ADF test is compared against a predetermined threshold (also configurable and by default set to
-    0.05). The time series is deemed stationary at its current differencing order if the p-value is less than the
-    threshold.
-    **Signs of High Risk**:
+    ### Purpose
+    The AutoStationarity metric is intended to automatically detect and evaluate the stationary nature of each time
+    series in a DataFrame. It incorporates the Augmented Dickey-Fuller (ADF) test, a statistical approach used to
+    assess stationarity. Stationarity is a fundamental property suggesting that statistic features like mean and
+    variance remain unchanged over time. This is necessary for many time-series models.
+    ### Test Mechanism
+    The mechanism for the AutoStationarity test involves applying the Augmented Dicky-Fuller test to each time series
+    within the given dataframe to assess if they are stationary. Every series in the dataframe is looped, using the ADF
+    test up to a defined maximum order (configurable and by default set to 5). The p-value resulting from the ADF test
+    is compared against a predetermined threshold (also configurable and by default set to 0.05). The time series is
+    deemed stationary at its current differencing order if the p-value is less than the threshold.
+    ### Signs of High Risk
     - A significant number of series not achieving stationarity even at the maximum order of differencing can indicate
     high risk or potential failure in the model.
     - This could suggest the series may not be appropriately modeled by a stationary process, hence other modeling
     approaches might be required.
+    ### Strengths
-    **Strengths**:
     - The key strength in this metric lies in the automation of the ADF test, enabling mass stationarity analysis
     across various time series and boosting the efficiency and credibility of the analysis.
     - The utilization of the ADF test, a widely accepted method for testing stationarity, lends authenticity to the
@@ -40,8 +44,9 @@ class AutoStationarity(Metric):
     - The introduction of the max order and threshold parameters give users the autonomy to determine their preferred
     levels of stringency in the tests.
-    **Limitations**:
-    - The Augumented Dicky-Fuller test and the stationarity test are not without their limitations. These tests are
+    ### Limitations
+    - The Augmented Dickey-Fuller test and the stationarity test are not without their limitations. These tests are
     premised on the assumption that the series can be modeled by an autoregressive process, which may not always hold
     true.
     - The stationarity check is highly sensitive to the choice of threshold for the significance level; an extremely

validmind/tests/data_validation/BivariateScatterPlots.py CHANGED Viewed

@@ -3,109 +3,80 @@
 # SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
 import itertools
-from dataclasses import dataclass
 import plotly.express as px
-from validmind.vm_models import Figure, Metric
+from validmind import tags, tasks
-@dataclass
-class BivariateScatterPlots(Metric):
+@tags("tabular_data", "numerical_data", "visualization")
+@tasks("classification")
+def BivariateScatterPlots(dataset):
     """
-    Generates bivariate scatterplots to visually inspect relationships between pairs of predictor variables in machine
-    learning classification tasks.
-    **Purpose**: This metric is intended for visual inspection and monitoring of relationships between pairs of
-    variables in a machine learning model targeting classification tasks. It is especially useful for understanding how
-    predictor variables (features) behave in relation to each other and how they are distributed for different classes
-    of the target variable, which could inform feature selection, model-building strategies, and even alert to possible
-    biases and irregularities in the data.
-    **Test Mechanism**: This metric operates by creating a scatter plot for each pair of the selected features in the
-    dataset. If the parameters "selected_columns" are not specified, an error will be thrown. The metric offers
-    flexibility by allowing the user to filter on a specific target class - specified by the "target_filter" parameter
-    - for more granified insights. Each scatterplot is then color-coded based on the category of the target variable
-    for better visual differentiation. The seaborn scatterplot library is used for generating the plots.
-    **Signs of High Risk**:
-    - Visual patterns which might suggest non-linear relationships, substantial skewness, multicollinearity,
-    clustering, or isolated outlier points in the scatter plot.
-    - Such issues could affect the assumptions and performance of some models, especially the ones assuming linearity
-    like linear regression or logistic regression.
-    **Strengths**:
-    - Scatterplots are simple and intuitive for users to understand, providing a visual tool to pinpoint complex
-    relationships between two variables.
-    - They are useful for outlier detection, identification of variable associations and trends, including non-linear
-    patterns which can be overlooked by other linear-focused metrics or tests.
-    - The implementation also supports visualizing binary or multi-class classification datasets.
-    **Limitations**:
-    - Scatterplots are limited to bivariate analysis - the relationship of two variables at a time - and might not
-    reveal the full picture in higher dimensions or where interactions are present.
-    - They are not ideal for very large datasets as points will overlap and render the visualization less informative.
-    - Scatterplots are more of an exploratory tool rather than a formal statistical test, so they don't provide any
-    quantitative measure of model quality or performance.
-    - Interpretation of scatterplots relies heavily on the domain knowledge and judgment of the viewer, which can
-    introduce subjective bias.
+    Generates bivariate scatterplots to visually inspect relationships between pairs of numerical predictor variables
+    in machine learning classification tasks.
+    ### Purpose
+    This function is intended for visual inspection and monitoring of relationships between pairs of numerical
+    variables in a machine learning model targeting classification tasks. It helps in understanding how predictor
+    variables (features) interact with each other, which can inform feature selection, model-building strategies, and
+    identify potential biases or irregularities in the data.
+    ### Test Mechanism
+    The function creates scatter plots for each pair of numerical features in the dataset. It first filters out
+    non-numerical and binary features, ensuring the plots focus on meaningful numerical relationships. The resulting
+    scatterplots are color-coded uniformly to avoid visual distraction, and the function returns a tuple of Plotly
+    figure objects, each representing a scatter plot for a pair of features.
+    ### Signs of High Risk
+    - Visual patterns suggesting non-linear relationships, multicollinearity, clustering, or outlier points in the
+    scatter plots.
+    - Such issues could affect the assumptions and performance of certain models, especially those assuming linearity,
+    like logistic regression.
+    ### Strengths
+    - Scatterplots provide an intuitive and visual tool to explore relationships between two variables.
+    - They are useful for identifying outliers, variable associations, and trends, including non-linear patterns.
+    - Supports visualization of binary or multi-class classification datasets, focusing on numerical features.
+    ### Limitations
+    - Scatterplots are limited to bivariate analysis, showing relationships between only two variables at a time.
+    - Not ideal for very large datasets where overlapping points can reduce the clarity of the visualization.
+    - Scatterplots are exploratory tools and do not provide quantitative measures of model quality or performance.
+    - Interpretation is subjective and relies on the domain knowledge and judgment of the viewer.
     """
+    figures = []
-    name = "bivariate_scatter_plots"
-    required_inputs = ["dataset"]
-    default_params = {"selected_columns": None}
-    tasks = ["classification"]
-    tags = [
-        "tabular_data",
-        "categorical_data",
-        "binary_classification",
-        "multiclass_classification",
-        "visualization",
+    # Select numerical features
+    features = dataset.feature_columns_numeric
+    # Select non-binary features
+    features = [
+        feature for feature in features if len(dataset.df[feature].unique()) > 2
     ]
-    def plot_bivariate_scatter(self, columns):
-        figures = []
-        df = self.inputs.dataset.df
-        # Generate all pairs of columns
-        features_pairs = list(itertools.combinations(columns, 2))
-        for x, y in features_pairs:
-            fig = px.scatter(
-                df,
-                x=x,
-                y=y,
-                title=f"{x} and {y}",
-                labels={x: x, y: y},
-                opacity=0.7,
-                color_discrete_sequence=["blue"],  # Use the same color for all points
-            )
-            fig.update_traces(marker=dict(color="blue"))
-            figures.append(
-                Figure(for_object=self, key=f"{self.key}:{x}_{y}", figure=fig)
-            )
-        return figures
-    def run(self):
-        selected_columns = self.params["selected_columns"]
-        if selected_columns is None:
-            # Use all columns if selected_columns is not provided
-            selected_columns = self.inputs.dataset.df.columns.tolist()
-        else:
-            # Check if all selected columns exist in the dataframe
-            missing_columns = [
-                col
-                for col in selected_columns
-                if col not in self.inputs.dataset.df.columns
-            ]
-            if missing_columns:
-                raise ValueError(
-                    f"The following selected columns are not in the dataframe: {missing_columns}"
-                )
-        figures = self.plot_bivariate_scatter(selected_columns)
-        return self.cache_results(figures=figures)
+    df = dataset.df[features]
+    # Generate all pairs of columns
+    features_pairs = list(itertools.combinations(df.columns, 2))
+    for x, y in features_pairs:
+        fig = px.scatter(
+            df,
+            x=x,
+            y=y,
+            title=f"{x} and {y}",
+            labels={x: x, y: y},
+            opacity=0.7,
+            color_discrete_sequence=["blue"],  # Use the same color for all points
+        )
+        fig.update_traces(marker=dict(color="blue"))
+        figures.append(fig)
+    return tuple(figures)

validmind/tests/{model_validation/statsmodels → data_validation}/BoxPierce.py RENAMED Viewed

@@ -2,40 +2,47 @@
 # See the LICENSE file in the root of this repository for details.
 # SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
+import pandas as pd
 from statsmodels.stats.diagnostic import acorr_ljungbox
-from validmind.vm_models import Metric
+from validmind import tags, tasks
-class BoxPierce(Metric):
+@tasks("regression")
+@tags("time_series_data", "forecasting", "statistical_test", "statsmodels")
+def BoxPierce(dataset):
     """
     Detects autocorrelation in time-series data through the Box-Pierce test to validate model performance.
-    **Purpose:** The Box-Pierce test is utilized to detect the presence of autocorrelation in a time-series dataset.
+    ### Purpose
+    The Box-Pierce test is utilized to detect the presence of autocorrelation in a time-series dataset.
     Autocorrelation, or serial correlation, refers to the degree of similarity between observations based on the
     temporal spacing between them. This test is essential for affirming the quality of a time-series model by ensuring
     that the error terms in the model are random and do not adhere to a specific pattern.
-    **Test Mechanism:** The implementation of the Box-Pierce test involves calculating a test statistic along with a
-    corresponding p-value derived from the dataset features. These quantities are used to test the null hypothesis that
-    posits the data to be independently distributed. This is achieved by iterating over every feature column in the
-    time-series data and applying the `acorr_ljungbox` function of the statsmodels library. The function yields the
-    Box-Pierce test statistic as well as the respective p-value, all of which are cached as test results.
+    ### Test Mechanism
+    The implementation of the Box-Pierce test involves calculating a test statistic along with a corresponding p-value
+    derived from the dataset features. These quantities are used to test the null hypothesis that posits the data to be
+    independently distributed. This is achieved by iterating over every feature column in the time-series data and
+    applying the `acorr_ljungbox` function of the statsmodels library. The function yields the Box-Pierce test
+    statistic as well as the respective p-value, all of which are cached as test results.
-    **Signs of High Risk:**
+    ### Signs of High Risk
     - A low p-value, typically under 0.05 as per statistical convention, throws the null hypothesis of independence
     into question. This implies that the dataset potentially houses autocorrelations, thus indicating a high-risk
     scenario concerning model performance.
     - Large Box-Pierce test statistic values may indicate the presence of autocorrelation.
-    **Strengths:**
+    ### Strengths
     - Detects patterns in data that are supposed to be random, thereby ensuring no underlying autocorrelation.
     - Can be computed efficiently given its low computational complexity.
     - Can be widely applied to most regression problems, making it very versatile.
-    **Limitations:**
+    ### Limitations
     - Assumes homoscedasticity (constant variance) and normality of residuals, which may not always be the case in
     real-world datasets.
@@ -43,29 +50,22 @@ class BoxPierce(Metric):
     correlations.
     - It only provides a general indication of the existence of autocorrelation, without providing specific insights
     into the nature or patterns of the detected autocorrelation.
-    - In the presence of exhibits trends or seasonal patterns, the Box-Pierce test may yield misleading results.
+    - In the presence of trends or seasonal patterns, the Box-Pierce test may yield misleading results.
     - Applicability is limited to time-series data, which limits its overall utility.
     """
-    name = "box_pierce"
-    required_inputs = ["dataset"]
-    tasks = ["regression"]
-    tags = ["time_series_data", "forecasting", "statistical_test", "statsmodels"]
-    def run(self):
-        """
-        Calculates Box-Pierce test for each of the dataset features
-        """
-        x_train = self.inputs.dataset.df
-        box_pierce_values = {}
-        for col in x_train.columns:
-            bp_results = acorr_ljungbox(
-                x_train[col].values, boxpierce=True, return_df=True
-            )
-            box_pierce_values[col] = {
-                "stat": bp_results.iloc[0]["lb_stat"],
-                "pvalue": bp_results.iloc[0]["lb_pvalue"],
-            }
-        return self.cache_results(box_pierce_values)
+    df = dataset.df
+    box_pierce_values = {}
+    for col in df.columns:
+        bp_results = acorr_ljungbox(df[col].values, boxpierce=True, return_df=True)
+        box_pierce_values[col] = {
+            "stat": bp_results.iloc[0]["lb_stat"],
+            "pvalue": bp_results.iloc[0]["lb_pvalue"],
+        }
+    box_pierce_df = pd.DataFrame.from_dict(box_pierce_values, orient="index")
+    box_pierce_df.reset_index(inplace=True)
+    box_pierce_df.columns = ["column", "stat", "pvalue"]
+    return box_pierce_df

validmind 2.5.8__py3-none-any.whl → 2.5.18__py3-none-any.whl

validmind 2.5.8py3-none-any.whl → 2.5.18py3-none-any.whl