PyPI - validmind - Versions diffs - 2.1.0__py3-none-any.whl → 2.2.2__py3-none-any.whl - Mend

validmind 2.1.0py3-none-any.whl → 2.2.2py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (110) hide show

validmind/tests/model_validation/ContextualRecall.py CHANGED Viewed

@@ -2,109 +2,92 @@
 # See the LICENSE file in the root of this repository for details.
 # SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-import itertools
-from dataclasses import dataclass
 import nltk
 import pandas as pd
 import plotly.graph_objects as go
-from validmind.vm_models import Figure, Metric
+from validmind import tags, tasks
-@dataclass
-class ContextualRecall(Metric):
+@tags("nlp", "text_data", "visualization")
+@tasks("text_classification", "text_summarization")
+def ContextualRecall(dataset, model):
     """
-    Evaluates a Natural Language Generation model's ability to generate contextually relevant and factually correct
-    text.
-    **Purpose**:
-    The Contextual Recall metric is used to evaluate the ability of a natural language generation (NLG) model to
-    generate text that appropriately reflects the given context or prompt. It measures the model's capability to
-    remember and reproduce the main context in its resulting output. This metric is critical in natural language
-    processing tasks, as the coherency and contextuality of the generated text are essential.
-    **Test Mechanism**:
-    1. **Preparation of Reference and Candidate Texts**:
-        - **Reference Texts**: Gather the reference text(s) which exemplify the expected or ideal output for a specific
-    context or prompt.
-        - **Candidate Texts**: Generate candidate text(s) from the NLG model under evaluation using the same context.
-    2. **Tokenization and Preprocessing**:
-        - Tokenize the reference and candidate texts into discernible words or tokens using libraries such as NLTK.
-    3. **Computation of Contextual Recall**:
-        - Identify the token overlap between the reference and candidate texts.
-        - The Contextual Recall score is computed by dividing the number of overlapping tokens by the total number of
-    tokens in the reference text. Scores are calculated for each test dataset instance, resulting in an array of
-    scores. These scores are then visualized using a line plot to show score variations across different rows.
-    **Signs of High Risk**:
-    - Low contextual recall scores could indicate that the model is not effectively reflecting the original context in
-    its output, leading to incoherent or contextually misaligned text.
-    - A consistent trend of low recall scores could suggest underperformance of the model.
+    Evaluates a Natural Language Generation model's ability to generate contextually relevant and factually correct text, visualizing the results through histograms and bar charts, alongside compiling a comprehensive table of descriptive statistics for contextual recall scores.
-    **Strengths**:
+    **Purpose:**
+    The Contextual Recall metric is used to evaluate the ability of a natural language generation (NLG) model to generate text that appropriately reflects the given context or prompt. It measures the model's capability to remember and reproduce the main context in its resulting output. This metric is critical in natural language processing tasks, as the coherency and contextuality of the generated text are essential.
-    - The Contextual Recall metric provides a quantifiable measure of a model's adherence to the context and factual
-    elements of the generated narrative.
-    - This metric finds particular value in applications requiring deep comprehension of context, such as text
-    continuation or interactive dialogue systems.
-    - The line plot visualization provides a clear and intuitive representation of score fluctuations.
+    **Test Mechanism:**
+    The function starts by extracting the true and predicted values from the provided dataset and model. It then tokenizes the reference and candidate texts into discernible words or tokens using NLTK. The token overlap between the reference and candidate texts is identified, and the Contextual Recall score is computed by dividing the number of overlapping tokens by the total number of tokens in the reference text. Scores are calculated for each test dataset instance, resulting in an array of scores. These scores are visualized using a histogram and a bar chart to show score variations across different rows. Additionally, a table of descriptive statistics (mean, median, standard deviation, minimum, and maximum) is compiled for the contextual recall scores, providing a comprehensive summary of the model's performance.
+    **Signs of High Risk:**
+    - Low contextual recall scores could indicate that the model is not effectively reflecting the original context in its output, leading to incoherent or contextually misaligned text.
+    - A consistent trend of low recall scores could suggest underperformance of the model.
-    **Limitations**:
+    **Strengths:**
+    - Provides a quantifiable measure of a model's adherence to the context and factual elements of the generated narrative.
+    - Visual representations (histograms and bar charts) make it easier to interpret the distribution and trends of contextual recall scores.
+    - Descriptive statistics offer a concise summary of the model's performance in generating contextually relevant texts.
-    - Despite its effectiveness, the Contextual Recall could fail to comprehensively assess the performance of NLG
-    models. Its focus on word overlap could result in high scores for texts that use many common words, even when these
-    texts lack coherence or meaningful context.
+    **Limitations:**
+    - The focus on word overlap could result in high scores for texts that use many common words, even when these texts lack coherence or meaningful context.
     - This metric does not consider the order of words, which could lead to overestimated scores for scrambled outputs.
     - Models that effectively use infrequent words might be undervalued, as these words might not overlap as often.
     """
-    name = "contextual_recall"
-    required_inputs = ["model", "dataset"]
-    def run(self):
-        y_true = list(itertools.chain.from_iterable(self.inputs.dataset.y))
-        y_pred = self.inputs.dataset.y_pred(self.inputs.model)
-        score_list = []
-        for y_t, y_p in zip(y_true, y_pred):
-            # Tokenize the reference and candidate texts
-            reference_tokens = nltk.word_tokenize(y_t.lower())
-            candidate_tokens = nltk.word_tokenize(y_p.lower())
-            # Calculate overlapping tokens
-            overlapping_tokens = set(reference_tokens) & set(candidate_tokens)
-            # Compute contextual recall
-            score_list.append(len(overlapping_tokens) / len(reference_tokens))
-        metrics_df = pd.DataFrame(score_list, columns=["Contextual Recall"])
-        figures = []
-        # Visualization part
-        fig = go.Figure()
-        # Adding the line plots
-        fig.add_trace(
-            go.Scatter(
-                x=metrics_df.index,
-                y=metrics_df["Contextual Recall"],
-                mode="lines+markers",
-                name="Contextual Recall",
-            )
-        )
-        fig.update_layout(
-            title="Contextual Recall scores for each row",
-            xaxis_title="Row Index",
-            yaxis_title="Score",
-        )
-        figures.append(
-            Figure(
-                for_object=self,
-                key=self.key,
-                figure=fig,
-            )
-        )
-        return self.cache_results(figures=figures)
+    y_true = dataset.y
+    y_pred = dataset.y_pred(model)
+    score_list = []
+    for y_t, y_p in zip(y_true, y_pred):
+        # Tokenize the reference and candidate texts
+        reference_tokens = nltk.word_tokenize(y_t.lower())
+        candidate_tokens = nltk.word_tokenize(y_p.lower())
+        # Calculate overlapping tokens
+        overlapping_tokens = set(reference_tokens) & set(candidate_tokens)
+        # Compute contextual recall
+        score_list.append(len(overlapping_tokens) / len(reference_tokens))
+    metrics_df = pd.DataFrame(score_list, columns=["Contextual Recall"])
+    figures = []
+    # Histogram for Contextual Recall
+    hist_fig = go.Figure(data=[go.Histogram(x=metrics_df["Contextual Recall"])])
+    hist_fig.update_layout(
+        title="Contextual Recall Histogram",
+        xaxis_title="Contextual Recall",
+        yaxis_title="Count",
+    )
+    figures.append(hist_fig)
+    # Bar Chart for Contextual Recall
+    bar_fig = go.Figure(
+        data=[go.Bar(x=metrics_df.index, y=metrics_df["Contextual Recall"])]
+    )
+    bar_fig.update_layout(
+        title="Contextual Recall Bar Chart",
+        xaxis_title="Row Index",
+        yaxis_title="Contextual Recall",
+    )
+    figures.append(bar_fig)
+    # Calculate statistics for Contextual Recall
+    stats_df = metrics_df.describe().loc[["mean", "50%", "max", "min", "std"]]
+    stats_df = stats_df.rename(
+        index={
+            "mean": "Mean Score",
+            "50%": "Median Score",
+            "max": "Max Score",
+            "min": "Min Score",
+            "std": "Standard Deviation",
+        }
+    ).T
+    stats_df["Count"] = len(metrics_df)
+    # Create a DataFrame from all collected statistics
+    result_df = pd.DataFrame(stats_df).reset_index().rename(columns={"index": "Metric"})
+    return (result_df, *tuple(figures))

validmind/tests/model_validation/MeteorScore.py CHANGED Viewed

@@ -2,91 +2,103 @@
 # See the LICENSE file in the root of this repository for details.
 # SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-from dataclasses import dataclass
 import evaluate
 import pandas as pd
 import plotly.graph_objects as go
-from validmind.vm_models import Figure, Metric
+from validmind import tags, tasks
-@dataclass
-class MeteorScore(Metric):
+@tags("nlp", "text_data", "visualization")
+@tasks("text_classification", "text_summarization")
+def MeteorScore(dataset, model):
     """
     Computes and visualizes the METEOR score for each text generation instance, assessing translation quality.
-    **Purpose**: METEOR (Metric for Evaluation of Translation with Explicit ORdering) is designed to evaluate the quality
-    of machine translations by comparing them against reference translations. It emphasizes both the accuracy and fluency
-    of translations, incorporating precision, recall, and word order into its assessment.
-    **Test Mechanism**: The METEOR score is computed for each pair of machine-generated translation (prediction) and its
-    corresponding human-produced reference. This is done by considering unigram matches between the translations, including
-    matches based on surface forms, stemmed forms, and synonyms. The score is a combination of unigram precision and recall,
-    adjusted for word order through a fragmentation penalty.
-    **Signs of High Risk**:
-    - Lower METEOR scores can indicate a lack of alignment between the machine-generated translations and their human-produced references, highlighting potential deficiencies in both the accuracy and fluency of translations.
-    - Significant discrepancies in word order or an excessive fragmentation penalty could signal issues with how the translation model processes and reconstructs sentence structures, potentially compromising the natural flow of translated text.
-    - Persistent underperformance across a variety of text types or linguistic contexts might suggest a broader inability of the model to adapt to the nuances of different languages or dialects, pointing towards gaps in its training or inherent limitations.
-    **Strengths**:
-    - Incorporates a balanced consideration of precision and recall, weighted towards recall to reflect the importance of
-      content coverage in translations.
+    **Purpose:**
+    METEOR (Metric for Evaluation of Translation with Explicit ORdering) is designed to evaluate the quality of machine translations
+    by comparing them against reference translations. It emphasizes both the accuracy and fluency of translations, incorporating
+    precision, recall, and word order into its assessment.
+    **Test Mechanism:**
+    The function starts by extracting the true and predicted values from the provided dataset and model. The METEOR score is computed
+    for each pair of machine-generated translation (prediction) and its corresponding human-produced reference. This is done by
+    considering unigram matches between the translations, including matches based on surface forms, stemmed forms, and synonyms.
+    The score is a combination of unigram precision and recall, adjusted for word order through a fragmentation penalty. Scores are
+    compiled into a dataframe, and histograms and bar charts are generated to visualize the distribution of METEOR scores. Additionally,
+    a table of descriptive statistics (mean, median, standard deviation, minimum, and maximum) is compiled for the METEOR scores,
+    providing a comprehensive summary of the model's performance.
+    **Signs of High Risk:**
+    - Lower METEOR scores can indicate a lack of alignment between the machine-generated translations and their human-produced references,
+      highlighting potential deficiencies in both the accuracy and fluency of translations.
+    - Significant discrepancies in word order or an excessive fragmentation penalty could signal issues with how the translation model processes
+      and reconstructs sentence structures, potentially compromising the natural flow of translated text.
+    - Persistent underperformance across a variety of text types or linguistic contexts might suggest a broader inability of the model to adapt to the
+      nuances of different languages or dialects, pointing towards gaps in its training or inherent limitations.
+    **Strengths:**
+    - Incorporates a balanced consideration of precision and recall, weighted towards recall to reflect the importance of content coverage in translations.
     - Directly accounts for word order, offering a nuanced evaluation of translation fluency beyond simple lexical matching.
     - Adapts to various forms of lexical similarity, including synonyms and stemmed forms, allowing for flexible matching.
-    **Limitations**:
-    - While comprehensive, the complexity of METEOR's calculation can make it computationally intensive, especially for
-      large datasets.
-    - The use of external resources for synonym and stemming matching may introduce variability based on the resources'
-      quality and relevance to the specific translation task.
+    **Limitations:**
+    - While comprehensive, the complexity of METEOR's calculation can make it computationally intensive, especially for large datasets.
+    - The use of external resources for synonym and stemming matching may introduce variability based on the resources' quality and relevance to the specific
+      translation task.
     """
-    name = "meteor_score"
-    required_inputs = ["model", "dataset"]
-    def run(self):
-        # Load the METEOR metric
-        meteor = evaluate.load("meteor")
-        # Initialize a list to hold METEOR scores
-        meteor_scores = []
-        for prediction, reference in zip(
-            self.inputs.dataset.y_pred(self.inputs.model),
-            self.inputs.dataset.y,
-        ):
-            # Compute the METEOR score for the current prediction-reference pair
-            result = meteor.compute(predictions=[prediction], references=[reference])
-            meteor_scores.append(result["meteor"])
-        # Visualization of METEOR scores
-        figures = self.visualize_scores(meteor_scores)
-        return self.cache_results(figures=figures)
-    def visualize_scores(self, scores):
-        # Convert the scores list to a DataFrame for plotting
-        scores_df = pd.DataFrame(scores, columns=["METEOR Score"])
-        # Create a line plot of the METEOR scores
-        fig = go.Figure()
-        fig.add_trace(
-            go.Scatter(
-                x=scores_df.index,
-                y=scores_df["METEOR Score"],
-                mode="lines+markers",
-                name="METEOR Score",
-            )
-        )
-        fig.update_layout(
-            title="METEOR Scores Across Text Instances",
-            xaxis_title="Text Instance Index",
-            yaxis_title="METEOR Score",
-        )
-        # Wrap the Plotly figure for compatibility with your framework
-        figures = [Figure(for_object=self, key=self.key, figure=fig)]
-        return figures
+    # Extract true and predicted values
+    y_true = dataset.y
+    y_pred = dataset.y_pred(model)
+    # Load the METEOR evaluation metric
+    meteor = evaluate.load("meteor")
+    # Calculate METEOR scores
+    score_list = []
+    for y_t, y_p in zip(y_true, y_pred):
+        # Compute the METEOR score
+        score = meteor.compute(predictions=[y_p], references=[y_t])
+        score_list.append(score["meteor"])
+    # Convert scores to a dataframe
+    metrics_df = pd.DataFrame(score_list, columns=["METEOR Score"])
+    figures = []
+    # Histogram for METEOR Score
+    hist_fig = go.Figure(data=[go.Histogram(x=metrics_df["METEOR Score"])])
+    hist_fig.update_layout(
+        title="METEOR Score Histogram",
+        xaxis_title="METEOR Score",
+        yaxis_title="Count",
+    )
+    figures.append(hist_fig)
+    # Bar Chart for METEOR Score
+    bar_fig = go.Figure(data=[go.Bar(x=metrics_df.index, y=metrics_df["METEOR Score"])])
+    bar_fig.update_layout(
+        title="METEOR Score Bar Chart",
+        xaxis_title="Row Index",
+        yaxis_title="METEOR Score",
+    )
+    figures.append(bar_fig)
+    # Calculate statistics for METEOR Score
+    stats_df = metrics_df.describe().loc[["mean", "50%", "max", "min", "std"]]
+    stats_df = stats_df.rename(
+        index={
+            "mean": "Mean Score",
+            "50%": "Median Score",
+            "max": "Max Score",
+            "min": "Min Score",
+            "std": "Standard Deviation",
+        }
+    ).T
+    stats_df["Count"] = len(metrics_df)
+    # Create a DataFrame from all collected statistics
+    result_df = pd.DataFrame(stats_df).reset_index().rename(columns={"index": "Metric"})
+    return (result_df, *tuple(figures))

validmind/tests/model_validation/RegardScore.py CHANGED Viewed

@@ -2,142 +2,124 @@
 # See the LICENSE file in the root of this repository for details.
 # SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
-import itertools
-from dataclasses import dataclass
 import evaluate
+import pandas as pd
 import plotly.graph_objects as go
-from plotly.subplots import make_subplots
-from validmind.vm_models import Figure, Metric
+from validmind import tags, tasks
-@dataclass
-class RegardScore(Metric):
+@tags("nlp", "text_data", "visualization")
+@tasks("text_classification", "text_summarization")
+def RegardScore(dataset, model):
     """
+    Computes and visualizes the regard score for each text instance, assessing sentiment and potential biases.
     **Purpose:**
-    The `RegardScore` metric assesses the degree of regard—positive, negative, neutral, or other—present in the given text,
-    whether it's a classification or summarization result. Especially crucial for applications like sentiment analysis,
-    product reviews, or opinion mining, it provides a granular understanding of how the model perceives or generates content
-    in terms of favorability or sentiment.
+    The `RegardScore` metric is designed to evaluate the regard levels (positive, negative, neutral, or other) of texts generated by models. This helps in understanding the sentiment and biases in the generated content.
     **Test Mechanism:**
-    The metric ingests data primarily from the model's test dataset, extracting the input text, target text (true regard),
-    and the model's predicted regard. These elements undergo a series of consistency checks before being processed. Using
-    the `evaluate.load("regard")` tool, regard scores are computed for each segment of text. The results are then visualized
-    in a multi-subplot line graph, where each subplot corresponds to a particular category of regard (e.g., positive, negative,
-    neutral, other) against the input, target, and predicted texts.
+    The function starts by extracting the true and predicted values from the provided dataset and model. The regard scores are computed for each text using a preloaded `regard` evaluation tool. The scores are compiled into dataframes, and histograms and bar charts are generated to visualize the distribution of regard scores. Additionally, a table of descriptive statistics (mean, median, standard deviation, minimum, and maximum) is compiled for the regard scores, providing a comprehensive summary of the model's performance.
     **Signs of High Risk:**
-    Disparities between the target regard scores and the predicted regard scores may signify potential flaws or biases in
-    the model. For instance, if neutral inputs are consistently perceived as strongly positive or negative, this could
-    indicate the model's inability to correctly identify or generate balanced sentiments.
+    - Noticeable skewness in the histogram, especially when comparing the predicted regard scores with the target regard scores, could indicate biases or inconsistencies in the model.
+    - Lack of neutral scores in the model's predictions, despite a balanced distribution in the target data, might signal an issue.
     **Strengths:**
-    The metric's visual presentation, using line plots, provides an intuitive way to comprehend the model's regard assessment
-    across different text samples and regard categories. The color-coded lines associated with each regard category further
-    enhance the clarity of the comparison, making it simpler for stakeholders or researchers to infer the model's performance.
+    - Provides a clear evaluation of regard levels in generated texts, helping to ensure content appropriateness.
+    - Visual representations (histograms and bar charts) make it easier to interpret the distribution and trends of regard scores.
+    - Descriptive statistics offer a concise summary of the model's performance in generating texts with balanced sentiments.
     **Limitations:**
-    The `RegardScoreHistogram` metric emphasizes regard scores but may not always grasp intricate nuances or the true context
-    of texts. Its reliance on underlying tools, which might be trained on potentially biased datasets, can result in skewed
-    interpretations. Additionally, while the metric segments regard into discrete categories such as "positive" and "negative,"
-    real-world sentiments often exist on a more complex spectrum. The metric's efficacy is intertwined with the accuracy of
-    the model's predictions; any inherent model inaccuracies can impact the metric's reflection of true sentiments.
+    - The accuracy of the regard scores is contingent upon the underlying `regard` tool.
+    - The scores provide a broad overview but do not specify which portions or tokens of the text are responsible for high regard.
+    - Supplementary, in-depth analysis might be needed for granular insights.
     """
-    name = "regard_score"
-    required_inputs = ["model", "dataset"]
-    metadata = {
-        "task_types": ["text_classification", "text_summarization"],
-        "tags": ["regard_score"],
-    }
-    def _get_datasets(self):
-        if not hasattr(self, "model"):
-            raise AttributeError("The 'model' attribute is missing.")
-        y_true = list(itertools.chain.from_iterable(self.inputs.dataset.y))
-        y_pred = self.inputs.dataset.y_pred(self.inputs.model)
-        if not len(y_true) == len(y_pred):
-            raise ValueError(
-                "Inconsistent lengths among input text, true summaries, and predicted summaries."
+    # Extract true and predicted values
+    y_true = dataset.y
+    y_pred = dataset.y_pred(model)
+    # Load the regard evaluation metric
+    regard_tool = evaluate.load("regard")
+    # Function to calculate regard scores
+    def compute_regard_scores(texts):
+        scores = regard_tool.compute(data=texts)["regard"]
+        regard_dicts = [
+            dict((x["label"], x["score"]) for x in sublist) for sublist in scores
+        ]
+        return regard_dicts
+    # Calculate regard scores for true and predicted texts
+    true_regard = compute_regard_scores(y_true)
+    pred_regard = compute_regard_scores(y_pred)
+    # Convert scores to dataframes
+    true_df = pd.DataFrame(true_regard)
+    pred_df = pd.DataFrame(pred_regard)
+    figures = []
+    # Function to create histogram and bar chart for regard scores
+    def create_figures(df, title):
+        for category in df.columns:
+            # Histogram
+            hist_fig = go.Figure(data=[go.Histogram(x=df[category])])
+            hist_fig.update_layout(
+                title=f"{title} - {category.capitalize()} Histogram",
+                xaxis_title=category.capitalize(),
+                yaxis_title="Count",
             )
-        return y_true, y_pred
-    def regard_line_plot(self):
-        regard_tool = evaluate.load("regard")
-        y_true, y_pred = self._get_datasets()
-        dataframes = {
-            "Target Text": y_true,
-            "Predicted Summaries": y_pred,
-        }
-        total_text_columns = len(dataframes)
-        total_rows = total_text_columns * 2
-        categories_order = ["positive", "negative", "neutral", "other"]
-        category_colors = {
-            "negative": "#d9534f",
-            "neutral": "#5bc0de",
-            "other": "#f0ad4e",
-            "positive": "#5cb85c",
-        }
-        fig = make_subplots(
-            rows=total_rows,
-            cols=2,
-            subplot_titles=[
-                f"{col_name} {cat}"
-                for col_name in dataframes
-                for cat in categories_order
-            ],
-            shared_yaxes=True,
-            vertical_spacing=0.1,
-        )
-        row_offset = 0
-        for column_name, column_data in dataframes.items():
-            results = regard_tool.compute(data=column_data)["regard"]
-            regard_dicts = [
-                dict((x["label"], x["score"]) for x in sublist) for sublist in results
-            ]
-            for idx, category in enumerate(categories_order, start=1):
-                row, col = ((idx - 1) // 2 + 1 + row_offset, (idx - 1) % 2 + 1)
-                scores = [res_dict[category] for res_dict in regard_dicts]
-                fig.add_trace(
-                    go.Scatter(
-                        name=f"{category} ({column_name})",
-                        x=list(range(len(column_data))),
-                        y=scores,
-                        mode="lines+markers",
-                        marker=dict(size=5),
-                        hoverinfo="y+name",
-                        line=dict(color=category_colors[category], width=1.5),
-                        showlegend=False,
-                    ),
-                    row=row,
-                    col=col,
-                )
-            row_offset += 2
-        subplot_height = 350
-        total_height = total_rows * subplot_height + 200
-        fig.update_layout(title_text="Regard Scores", height=total_height)
-        fig.update_yaxes(range=[0, 1])
-        fig.update_xaxes(showticklabels=False, row=1, col=1)
-        fig.update_xaxes(title_text="Index", showticklabels=True, row=1, col=1)
-        fig.update_yaxes(title_text="Score", showticklabels=True, row=1, col=1)
-        return fig
-    def run(self):
-        fig = self.regard_line_plot()
-        return self.cache_results(
-            figures=[Figure(for_object=self, key=self.key, figure=fig)]
-        )
+            figures.append(hist_fig)
+            # Bar Chart
+            bar_fig = go.Figure(data=[go.Bar(x=df.index, y=df[category])])
+            bar_fig.update_layout(
+                title=f"{title} - {category.capitalize()} Bar Chart",
+                xaxis_title="Text Instance Index",
+                yaxis_title=category.capitalize(),
+            )
+            figures.append(bar_fig)
+    # Create figures for each regard score dataframe
+    create_figures(true_df, "True Text Regard")
+    create_figures(pred_df, "Predicted Text Regard")
+    # Calculate statistics for each regard score dataframe
+    def calculate_stats(df, metric_name):
+        stats = df.describe().loc[["mean", "50%", "max", "min", "std"]].T
+        stats.columns = [
+            "Mean Score",
+            "Median Score",
+            "Max Score",
+            "Min Score",
+            "Standard Deviation",
+        ]
+        stats["Metric"] = metric_name
+        stats["Count"] = len(df)
+        return stats
+    true_stats = calculate_stats(true_df, "True Text Regard")
+    pred_stats = calculate_stats(pred_df, "Predicted Text Regard")
+    # Combine statistics into a single dataframe
+    result_df = (
+        pd.concat([true_stats, pred_stats])
+        .reset_index()
+        .rename(columns={"index": "Category"})
+    )
+    result_df = result_df[
+        [
+            "Metric",
+            "Category",
+            "Mean Score",
+            "Median Score",
+            "Max Score",
+            "Min Score",
+            "Standard Deviation",
+            "Count",
+        ]
+    ]
+    return (result_df, *tuple(figures))

validmind 2.1.0__py3-none-any.whl → 2.2.2__py3-none-any.whl

validmind 2.1.0py3-none-any.whl → 2.2.2py3-none-any.whl