validmind 2.5.8__py3-none-any.whl → 2.5.15__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- validmind/__version__.py +1 -1
- validmind/ai/test_descriptions.py +26 -7
- validmind/api_client.py +89 -43
- validmind/client.py +2 -2
- validmind/client_config.py +11 -14
- validmind/datasets/regression/fred_timeseries.py +67 -138
- validmind/template.py +1 -0
- validmind/test_suites/__init__.py +0 -2
- validmind/test_suites/statsmodels_timeseries.py +1 -1
- validmind/test_suites/summarization.py +0 -1
- validmind/test_suites/time_series.py +0 -43
- validmind/tests/__types__.py +3 -13
- validmind/tests/data_validation/ACFandPACFPlot.py +15 -13
- validmind/tests/data_validation/ADF.py +31 -24
- validmind/tests/data_validation/AutoAR.py +9 -9
- validmind/tests/data_validation/AutoMA.py +23 -16
- validmind/tests/data_validation/AutoSeasonality.py +18 -16
- validmind/tests/data_validation/AutoStationarity.py +21 -16
- validmind/tests/data_validation/BivariateScatterPlots.py +67 -96
- validmind/tests/data_validation/ChiSquaredFeaturesTable.py +82 -124
- validmind/tests/data_validation/ClassImbalance.py +15 -12
- validmind/tests/data_validation/DFGLSArch.py +19 -13
- validmind/tests/data_validation/DatasetDescription.py +17 -11
- validmind/tests/data_validation/DatasetSplit.py +7 -5
- validmind/tests/data_validation/DescriptiveStatistics.py +28 -21
- validmind/tests/data_validation/Duplicates.py +33 -25
- validmind/tests/data_validation/EngleGrangerCoint.py +35 -33
- validmind/tests/data_validation/FeatureTargetCorrelationPlot.py +59 -71
- validmind/tests/data_validation/HighCardinality.py +19 -12
- validmind/tests/data_validation/HighPearsonCorrelation.py +27 -22
- validmind/tests/data_validation/IQROutliersBarPlot.py +13 -10
- validmind/tests/data_validation/IQROutliersTable.py +40 -36
- validmind/tests/data_validation/IsolationForestOutliers.py +21 -14
- validmind/tests/data_validation/KPSS.py +34 -29
- validmind/tests/data_validation/LaggedCorrelationHeatmap.py +22 -15
- validmind/tests/data_validation/MissingValues.py +32 -27
- validmind/tests/data_validation/MissingValuesBarPlot.py +25 -21
- validmind/tests/data_validation/PearsonCorrelationMatrix.py +71 -84
- validmind/tests/data_validation/PhillipsPerronArch.py +37 -30
- validmind/tests/data_validation/RollingStatsPlot.py +31 -23
- validmind/tests/data_validation/ScatterPlot.py +63 -78
- validmind/tests/data_validation/SeasonalDecompose.py +38 -34
- validmind/tests/data_validation/Skewness.py +35 -37
- validmind/tests/data_validation/SpreadPlot.py +35 -35
- validmind/tests/data_validation/TabularCategoricalBarPlots.py +23 -17
- validmind/tests/data_validation/TabularDateTimeHistograms.py +21 -13
- validmind/tests/data_validation/TabularDescriptionTables.py +51 -16
- validmind/tests/data_validation/TabularNumericalHistograms.py +25 -22
- validmind/tests/data_validation/TargetRateBarPlots.py +21 -14
- validmind/tests/data_validation/TimeSeriesDescription.py +25 -18
- validmind/tests/data_validation/TimeSeriesDescriptiveStatistics.py +23 -17
- validmind/tests/data_validation/TimeSeriesFrequency.py +24 -17
- validmind/tests/data_validation/TimeSeriesHistogram.py +33 -32
- validmind/tests/data_validation/TimeSeriesLinePlot.py +17 -10
- validmind/tests/data_validation/TimeSeriesMissingValues.py +15 -10
- validmind/tests/data_validation/TimeSeriesOutliers.py +37 -33
- validmind/tests/data_validation/TooManyZeroValues.py +16 -11
- validmind/tests/data_validation/UniqueRows.py +11 -6
- validmind/tests/data_validation/WOEBinPlots.py +23 -16
- validmind/tests/data_validation/WOEBinTable.py +35 -30
- validmind/tests/data_validation/ZivotAndrewsArch.py +34 -28
- validmind/tests/data_validation/nlp/CommonWords.py +21 -14
- validmind/tests/data_validation/nlp/Hashtags.py +27 -20
- validmind/tests/data_validation/nlp/LanguageDetection.py +33 -14
- validmind/tests/data_validation/nlp/Mentions.py +21 -15
- validmind/tests/data_validation/nlp/PolarityAndSubjectivity.py +32 -9
- validmind/tests/data_validation/nlp/Punctuations.py +24 -20
- validmind/tests/data_validation/nlp/Sentiment.py +27 -8
- validmind/tests/data_validation/nlp/StopWords.py +26 -19
- validmind/tests/data_validation/nlp/TextDescription.py +36 -35
- validmind/tests/data_validation/nlp/Toxicity.py +32 -9
- validmind/tests/decorator.py +81 -42
- validmind/tests/model_validation/BertScore.py +36 -27
- validmind/tests/model_validation/BleuScore.py +25 -19
- validmind/tests/model_validation/ClusterSizeDistribution.py +38 -34
- validmind/tests/model_validation/ContextualRecall.py +35 -13
- validmind/tests/model_validation/FeaturesAUC.py +32 -13
- validmind/tests/model_validation/MeteorScore.py +46 -33
- validmind/tests/model_validation/ModelMetadata.py +32 -64
- validmind/tests/model_validation/ModelPredictionResiduals.py +75 -73
- validmind/tests/model_validation/RegardScore.py +30 -14
- validmind/tests/model_validation/RegressionResidualsPlot.py +10 -5
- validmind/tests/model_validation/RougeScore.py +36 -30
- validmind/tests/model_validation/TimeSeriesPredictionWithCI.py +30 -14
- validmind/tests/model_validation/TimeSeriesPredictionsPlot.py +27 -30
- validmind/tests/model_validation/TimeSeriesR2SquareBySegments.py +68 -63
- validmind/tests/model_validation/TokenDisparity.py +31 -23
- validmind/tests/model_validation/ToxicityScore.py +26 -17
- validmind/tests/model_validation/embeddings/ClusterDistribution.py +24 -20
- validmind/tests/model_validation/embeddings/CosineSimilarityComparison.py +30 -27
- validmind/tests/model_validation/embeddings/CosineSimilarityDistribution.py +7 -5
- validmind/tests/model_validation/embeddings/CosineSimilarityHeatmap.py +32 -23
- validmind/tests/model_validation/embeddings/DescriptiveAnalytics.py +7 -5
- validmind/tests/model_validation/embeddings/EmbeddingsVisualization2D.py +15 -11
- validmind/tests/model_validation/embeddings/EuclideanDistanceComparison.py +29 -29
- validmind/tests/model_validation/embeddings/EuclideanDistanceHeatmap.py +34 -25
- validmind/tests/model_validation/embeddings/PCAComponentsPairwisePlots.py +38 -26
- validmind/tests/model_validation/embeddings/StabilityAnalysis.py +40 -1
- validmind/tests/model_validation/embeddings/StabilityAnalysisKeyword.py +18 -17
- validmind/tests/model_validation/embeddings/StabilityAnalysisRandomNoise.py +40 -45
- validmind/tests/model_validation/embeddings/StabilityAnalysisSynonyms.py +17 -19
- validmind/tests/model_validation/embeddings/StabilityAnalysisTranslation.py +29 -25
- validmind/tests/model_validation/embeddings/TSNEComponentsPairwisePlots.py +38 -28
- validmind/tests/model_validation/ragas/AnswerCorrectness.py +5 -4
- validmind/tests/model_validation/ragas/AnswerRelevance.py +5 -4
- validmind/tests/model_validation/ragas/AnswerSimilarity.py +5 -4
- validmind/tests/model_validation/ragas/AspectCritique.py +7 -0
- validmind/tests/model_validation/ragas/ContextEntityRecall.py +9 -8
- validmind/tests/model_validation/ragas/ContextPrecision.py +5 -4
- validmind/tests/model_validation/ragas/ContextRecall.py +5 -4
- validmind/tests/model_validation/ragas/Faithfulness.py +5 -4
- validmind/tests/model_validation/ragas/utils.py +6 -0
- validmind/tests/model_validation/sklearn/AdjustedMutualInformation.py +19 -12
- validmind/tests/model_validation/sklearn/AdjustedRandIndex.py +22 -17
- validmind/tests/model_validation/sklearn/ClassifierPerformance.py +27 -25
- validmind/tests/model_validation/sklearn/ClusterCosineSimilarity.py +7 -5
- validmind/tests/model_validation/sklearn/ClusterPerformance.py +40 -78
- validmind/tests/model_validation/sklearn/ClusterPerformanceMetrics.py +15 -17
- validmind/tests/model_validation/sklearn/CompletenessScore.py +17 -11
- validmind/tests/model_validation/sklearn/ConfusionMatrix.py +22 -15
- validmind/tests/model_validation/sklearn/FeatureImportance.py +95 -0
- validmind/tests/model_validation/sklearn/FowlkesMallowsScore.py +7 -7
- validmind/tests/model_validation/sklearn/HomogeneityScore.py +19 -12
- validmind/tests/model_validation/sklearn/HyperParametersTuning.py +35 -30
- validmind/tests/model_validation/sklearn/KMeansClustersOptimization.py +10 -5
- validmind/tests/model_validation/sklearn/MinimumAccuracy.py +32 -32
- validmind/tests/model_validation/sklearn/MinimumF1Score.py +23 -23
- validmind/tests/model_validation/sklearn/MinimumROCAUCScore.py +15 -10
- validmind/tests/model_validation/sklearn/ModelsPerformanceComparison.py +26 -19
- validmind/tests/model_validation/sklearn/OverfitDiagnosis.py +38 -18
- validmind/tests/model_validation/sklearn/PermutationFeatureImportance.py +31 -25
- validmind/tests/model_validation/sklearn/PopulationStabilityIndex.py +8 -6
- validmind/tests/model_validation/sklearn/PrecisionRecallCurve.py +24 -17
- validmind/tests/model_validation/sklearn/ROCCurve.py +12 -7
- validmind/tests/model_validation/sklearn/RegressionErrors.py +74 -130
- validmind/tests/model_validation/sklearn/RegressionErrorsComparison.py +27 -12
- validmind/tests/model_validation/sklearn/{RegressionModelsPerformanceComparison.py → RegressionPerformance.py} +18 -20
- validmind/tests/model_validation/sklearn/RegressionR2Square.py +55 -93
- validmind/tests/model_validation/sklearn/RegressionR2SquareComparison.py +32 -13
- validmind/tests/model_validation/sklearn/RobustnessDiagnosis.py +36 -32
- validmind/tests/model_validation/sklearn/SHAPGlobalImportance.py +7 -5
- validmind/tests/model_validation/sklearn/SilhouettePlot.py +27 -19
- validmind/tests/model_validation/sklearn/TrainingTestDegradation.py +25 -18
- validmind/tests/model_validation/sklearn/VMeasure.py +14 -13
- validmind/tests/model_validation/sklearn/WeakspotsDiagnosis.py +7 -5
- validmind/tests/model_validation/statsmodels/AutoARIMA.py +24 -18
- validmind/tests/model_validation/statsmodels/BoxPierce.py +14 -10
- validmind/tests/model_validation/statsmodels/CumulativePredictionProbabilities.py +73 -104
- validmind/tests/model_validation/statsmodels/DurbinWatsonTest.py +19 -12
- validmind/tests/model_validation/statsmodels/GINITable.py +44 -77
- validmind/tests/model_validation/statsmodels/JarqueBera.py +27 -22
- validmind/tests/model_validation/statsmodels/KolmogorovSmirnov.py +33 -34
- validmind/tests/model_validation/statsmodels/LJungBox.py +32 -28
- validmind/tests/model_validation/statsmodels/Lilliefors.py +27 -24
- validmind/tests/model_validation/statsmodels/PredictionProbabilitiesHistogram.py +87 -119
- validmind/tests/model_validation/statsmodels/RegressionCoeffs.py +100 -0
- validmind/tests/model_validation/statsmodels/RegressionFeatureSignificance.py +14 -9
- validmind/tests/model_validation/statsmodels/RegressionModelForecastPlot.py +17 -13
- validmind/tests/model_validation/statsmodels/RegressionModelForecastPlotLevels.py +46 -43
- validmind/tests/model_validation/statsmodels/RegressionModelSensitivityPlot.py +38 -36
- validmind/tests/model_validation/statsmodels/RegressionModelSummary.py +30 -28
- validmind/tests/model_validation/statsmodels/RegressionPermutationFeatureImportance.py +18 -11
- validmind/tests/model_validation/statsmodels/RunsTest.py +32 -28
- validmind/tests/model_validation/statsmodels/ScorecardHistogram.py +75 -107
- validmind/tests/model_validation/statsmodels/ShapiroWilk.py +15 -8
- validmind/tests/ongoing_monitoring/FeatureDrift.py +10 -6
- validmind/tests/ongoing_monitoring/PredictionAcrossEachFeature.py +31 -25
- validmind/tests/ongoing_monitoring/PredictionCorrelation.py +29 -21
- validmind/tests/ongoing_monitoring/TargetPredictionDistributionPlot.py +31 -23
- validmind/tests/prompt_validation/Bias.py +14 -11
- validmind/tests/prompt_validation/Clarity.py +16 -14
- validmind/tests/prompt_validation/Conciseness.py +7 -5
- validmind/tests/prompt_validation/Delimitation.py +23 -22
- validmind/tests/prompt_validation/NegativeInstruction.py +7 -5
- validmind/tests/prompt_validation/Robustness.py +12 -10
- validmind/tests/prompt_validation/Specificity.py +13 -11
- validmind/tests/prompt_validation/ai_powered_test.py +6 -0
- validmind/tests/run.py +68 -23
- validmind/unit_metrics/__init__.py +81 -144
- validmind/unit_metrics/classification/{sklearn/Accuracy.py → Accuracy.py} +1 -1
- validmind/unit_metrics/classification/{sklearn/F1.py → F1.py} +1 -1
- validmind/unit_metrics/classification/{sklearn/Precision.py → Precision.py} +1 -1
- validmind/unit_metrics/classification/{sklearn/ROC_AUC.py → ROC_AUC.py} +1 -2
- validmind/unit_metrics/classification/{sklearn/Recall.py → Recall.py} +1 -1
- validmind/unit_metrics/regression/{sklearn/AdjustedRSquaredScore.py → AdjustedRSquaredScore.py} +1 -1
- validmind/unit_metrics/regression/GiniCoefficient.py +1 -1
- validmind/unit_metrics/regression/HuberLoss.py +1 -1
- validmind/unit_metrics/regression/KolmogorovSmirnovStatistic.py +1 -1
- validmind/unit_metrics/regression/{sklearn/MeanAbsoluteError.py → MeanAbsoluteError.py} +1 -1
- validmind/unit_metrics/regression/MeanAbsolutePercentageError.py +1 -1
- validmind/unit_metrics/regression/MeanBiasDeviation.py +1 -1
- validmind/unit_metrics/regression/{sklearn/MeanSquaredError.py → MeanSquaredError.py} +1 -1
- validmind/unit_metrics/regression/QuantileLoss.py +1 -1
- validmind/unit_metrics/regression/{sklearn/RSquaredScore.py → RSquaredScore.py} +1 -1
- validmind/unit_metrics/regression/{sklearn/RootMeanSquaredError.py → RootMeanSquaredError.py} +1 -1
- validmind/vm_models/dataset/dataset.py +2 -0
- validmind/vm_models/figure.py +5 -0
- validmind/vm_models/test/result_wrapper.py +93 -132
- {validmind-2.5.8.dist-info → validmind-2.5.15.dist-info}/METADATA +1 -1
- {validmind-2.5.8.dist-info → validmind-2.5.15.dist-info}/RECORD +203 -210
- validmind/tests/data_validation/ANOVAOneWayTable.py +0 -138
- validmind/tests/data_validation/BivariateFeaturesBarPlots.py +0 -142
- validmind/tests/data_validation/BivariateHistograms.py +0 -117
- validmind/tests/data_validation/HeatmapFeatureCorrelations.py +0 -124
- validmind/tests/data_validation/MissingValuesRisk.py +0 -88
- validmind/tests/model_validation/ModelMetadataComparison.py +0 -59
- validmind/tests/model_validation/sklearn/FeatureImportanceComparison.py +0 -83
- validmind/tests/model_validation/statsmodels/RegressionCoeffsPlot.py +0 -135
- validmind/tests/model_validation/statsmodels/RegressionModelsCoeffs.py +0 -103
- {validmind-2.5.8.dist-info → validmind-2.5.15.dist-info}/LICENSE +0 -0
- {validmind-2.5.8.dist-info → validmind-2.5.15.dist-info}/WHEEL +0 -0
- {validmind-2.5.8.dist-info → validmind-2.5.15.dist-info}/entry_points.txt +0 -0
@@ -13,39 +13,52 @@ from validmind import tags, tasks
|
|
13
13
|
@tasks("text_classification", "text_summarization")
|
14
14
|
def MeteorScore(dataset, model):
|
15
15
|
"""
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
The
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
16
|
+
Assesses the quality of machine-generated translations by comparing them to human-produced references using the
|
17
|
+
METEOR score, which evaluates precision, recall, and word order.
|
18
|
+
|
19
|
+
### Purpose
|
20
|
+
|
21
|
+
The METEOR (Metric for Evaluation of Translation with Explicit ORdering) score is designed to evaluate the quality
|
22
|
+
of machine translations by comparing them against reference translations. It emphasizes both the accuracy and
|
23
|
+
fluency of translations, incorporating precision, recall, and word order into its assessment.
|
24
|
+
|
25
|
+
### Test Mechanism
|
26
|
+
|
27
|
+
The function starts by extracting the true and predicted values from the provided dataset and model. The METEOR
|
28
|
+
score is computed for each pair of machine-generated translation (prediction) and its corresponding human-produced
|
29
|
+
reference. This is done by considering unigram matches between the translations, including matches based on surface
|
30
|
+
forms, stemmed forms, and synonyms. The score is a combination of unigram precision and recall, adjusted for word
|
31
|
+
order through a fragmentation penalty. Scores are compiled into a dataframe, and histograms and bar charts are
|
32
|
+
generated to visualize the distribution of METEOR scores. Additionally, a table of descriptive statistics (mean,
|
33
|
+
median, standard deviation, minimum, and maximum) is compiled for the METEOR scores, providing a comprehensive
|
34
|
+
summary of the model's performance.
|
35
|
+
|
36
|
+
### Signs of High Risk
|
37
|
+
|
38
|
+
- Lower METEOR scores can indicate a lack of alignment between the machine-generated translations and their
|
39
|
+
human-produced references, highlighting potential deficiencies in both the accuracy and fluency of translations.
|
40
|
+
- Significant discrepancies in word order or an excessive fragmentation penalty could signal issues with how the
|
41
|
+
translation model processes and reconstructs sentence structures, potentially compromising the natural flow of
|
42
|
+
translated text.
|
43
|
+
- Persistent underperformance across a variety of text types or linguistic contexts might suggest a broader
|
44
|
+
inability of the model to adapt to the nuances of different languages or dialects, pointing towards gaps in its
|
45
|
+
training or inherent limitations.
|
46
|
+
|
47
|
+
### Strengths
|
48
|
+
|
49
|
+
- Incorporates a balanced consideration of precision and recall, weighted towards recall to reflect the importance
|
50
|
+
of content coverage in translations.
|
51
|
+
- Directly accounts for word order, offering a nuanced evaluation of translation fluency beyond simple lexical
|
52
|
+
matching.
|
53
|
+
- Adapts to various forms of lexical similarity, including synonyms and stemmed forms, allowing for flexible
|
54
|
+
matching.
|
55
|
+
|
56
|
+
### Limitations
|
57
|
+
|
58
|
+
- While comprehensive, the complexity of METEOR's calculation can make it computationally intensive, especially for
|
59
|
+
large datasets.
|
60
|
+
- The use of external resources for synonym and stemming matching may introduce variability based on the resources'
|
61
|
+
quality and relevance to the specific translation task.
|
49
62
|
"""
|
50
63
|
|
51
64
|
# Extract true and predicted values
|
@@ -2,66 +2,36 @@
|
|
2
2
|
# See the LICENSE file in the root of this repository for details.
|
3
3
|
# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
|
4
4
|
|
5
|
-
from dataclasses import dataclass
|
6
|
-
|
7
5
|
import pandas as pd
|
8
6
|
|
7
|
+
from validmind import tags, tasks
|
9
8
|
from validmind.utils import get_model_info
|
10
|
-
from validmind.vm_models import Metric, ResultSummary, ResultTable
|
11
9
|
|
12
10
|
|
13
|
-
@
|
14
|
-
|
11
|
+
@tags("model_training", "metadata")
|
12
|
+
@tasks("regression", "time_series_forecasting")
|
13
|
+
def ModelMetadata(model):
|
15
14
|
"""
|
16
|
-
|
17
|
-
|
18
|
-
**Purpose:**
|
19
|
-
This test is designed to collect and summarize important metadata related to a particular machine learning model.
|
20
|
-
Such metadata includes the model's architecture (modeling technique), the version and type of modeling framework
|
21
|
-
used, and the programming language the model is written in.
|
22
|
-
|
23
|
-
**Test Mechanism:**
|
24
|
-
The mechanism of this test consists of extracting information from the model instance. It tries to extract the
|
25
|
-
model information such as the modeling technique used, the modeling framework version, and the programming
|
26
|
-
language. It decorates this information into a data frame and returns a summary of the results.
|
15
|
+
Compare metadata of different models and generate a summary table with the results.
|
27
16
|
|
28
|
-
**
|
17
|
+
**Purpose**: The purpose of this function is to compare the metadata of different models, including information about their architecture, framework, framework version, and programming language.
|
29
18
|
|
30
|
-
|
31
|
-
- Unidentifiable language, outdated or unsupported versions of modeling frameworks, or undisclosed model
|
32
|
-
architectures reflect risky situations, as they could hinder future reproducibility, support, and debugging of the
|
33
|
-
model.
|
19
|
+
**Test Mechanism**: The function retrieves the metadata for each model using `get_model_info`, renames columns according to a predefined set of labels, and compiles this information into a summary table.
|
34
20
|
|
35
|
-
**
|
21
|
+
**Signs of High Risk**:
|
22
|
+
- Inconsistent or missing metadata across models can indicate potential issues in model documentation or management.
|
23
|
+
- Significant differences in framework versions or programming languages might pose challenges in model integration and deployment.
|
36
24
|
|
37
|
-
|
38
|
-
|
39
|
-
-
|
40
|
-
|
41
|
-
compliance of software policies, and assists in planning for model obsolescence due to evolving or discontinuing
|
42
|
-
software and dependencies.
|
25
|
+
**Strengths**:
|
26
|
+
- Provides a clear comparison of essential model metadata.
|
27
|
+
- Standardizes metadata labels for easier interpretation and comparison.
|
28
|
+
- Helps identify potential compatibility or consistency issues across models.
|
43
29
|
|
44
|
-
**Limitations
|
45
|
-
|
46
|
-
-
|
47
|
-
|
48
|
-
- If the model's built-in methods for describing its architecture, framework or language are incorrect or lack
|
49
|
-
necessary information, this test will hold limitations.
|
50
|
-
- Moreover, it is not designed to directly evaluate the performance or accuracy of the model, rather it provides
|
51
|
-
supplementary information which aids in comprehensive analysis.
|
30
|
+
**Limitations**:
|
31
|
+
- Assumes that the `get_model_info` function returns all necessary metadata fields.
|
32
|
+
- Relies on the correctness and completeness of the metadata provided by each model.
|
33
|
+
- Does not include detailed parameter information, focusing instead on high-level metadata.
|
52
34
|
"""
|
53
|
-
|
54
|
-
name = "model_metadata"
|
55
|
-
required_inputs = ["model"]
|
56
|
-
tasks = [
|
57
|
-
"classification",
|
58
|
-
"regression",
|
59
|
-
"text_classification",
|
60
|
-
"text_summarization",
|
61
|
-
]
|
62
|
-
|
63
|
-
tags = ["model_metadata"]
|
64
|
-
|
65
35
|
column_labels = {
|
66
36
|
"architecture": "Modeling Technique",
|
67
37
|
"framework": "Modeling Framework",
|
@@ -69,22 +39,20 @@ class ModelMetadata(Metric):
|
|
69
39
|
"language": "Programming Language",
|
70
40
|
}
|
71
41
|
|
72
|
-
def
|
73
|
-
df = pd.DataFrame(metric_value.items(), columns=["Attribute", "Value"])
|
74
|
-
# Don't serialize the params attribute
|
75
|
-
df = df[df["Attribute"] != "params"]
|
76
|
-
df["Attribute"] = df["Attribute"].map(self.column_labels)
|
77
|
-
|
78
|
-
return ResultSummary(
|
79
|
-
results=[
|
80
|
-
ResultTable(data=df.to_dict(orient="records")),
|
81
|
-
]
|
82
|
-
)
|
83
|
-
|
84
|
-
def run(self):
|
42
|
+
def extract_and_rename_metadata(model):
|
85
43
|
"""
|
86
|
-
Extracts
|
44
|
+
Extracts metadata for a single model and renames columns based on predefined labels.
|
87
45
|
"""
|
88
|
-
model_info = get_model_info(
|
46
|
+
model_info = get_model_info(model)
|
47
|
+
renamed_info = {
|
48
|
+
column_labels.get(k, k): v for k, v in model_info.items() if k != "params"
|
49
|
+
}
|
50
|
+
return renamed_info
|
51
|
+
|
52
|
+
# Collect metadata for all models
|
53
|
+
metadata_list = [extract_and_rename_metadata(model)]
|
54
|
+
|
55
|
+
# Create a DataFrame from the collected metadata
|
56
|
+
metadata_df = pd.DataFrame(metadata_list)
|
89
57
|
|
90
|
-
|
58
|
+
return metadata_df
|
@@ -12,92 +12,94 @@ from validmind import tags, tasks
|
|
12
12
|
@tags("regression")
|
13
13
|
@tasks("residual_analysis", "visualization")
|
14
14
|
def ModelPredictionResiduals(
|
15
|
-
|
15
|
+
dataset, model, nbins=100, p_value_threshold=0.05, start_date=None, end_date=None
|
16
16
|
):
|
17
17
|
"""
|
18
|
-
|
19
|
-
with the Kolmogorov-Smirnov normality test results.
|
18
|
+
Assesses normality and behavior of residuals in regression models through visualization and statistical tests.
|
20
19
|
|
21
|
-
|
22
|
-
assess the normality of residuals using the Kolmogorov-Smirnov test.
|
20
|
+
### Purpose
|
23
21
|
|
24
|
-
|
25
|
-
|
22
|
+
The Model Prediction Residuals test aims to visualize the residuals of model predictions and assess their normality
|
23
|
+
using the Kolmogorov-Smirnov (KS) test. It helps to identify potential issues related to model assumptions and
|
24
|
+
effectiveness.
|
25
|
+
|
26
|
+
### Test Mechanism
|
27
|
+
|
28
|
+
The function calculates residuals and generates
|
29
|
+
two figures: one for the time series of residuals and one for the histogram of residuals.
|
26
30
|
It also calculates the KS test for normality and summarizes the results in a table.
|
27
31
|
|
28
|
-
|
29
|
-
|
30
|
-
-
|
32
|
+
### Signs of High Risk
|
33
|
+
|
34
|
+
- Residuals are not normally distributed, indicating potential issues with model assumptions.
|
35
|
+
- High skewness or kurtosis in the residuals, which may suggest model misspecification.
|
31
36
|
|
32
|
-
|
33
|
-
|
37
|
+
### Strengths
|
38
|
+
|
39
|
+
- Provides clear visualizations of residuals over time and their distribution.
|
34
40
|
- Includes statistical tests to assess the normality of residuals.
|
41
|
+
- Helps in identifying potential model misspecifications and assumption violations.
|
42
|
+
|
43
|
+
### Limitations
|
35
44
|
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
- Only generates plots for datasets with a datetime index, and will raise an error for other types of indices.
|
45
|
+
- Assumes that the dataset is provided as a DataFrameDataset object with a .df attribute to access the pandas
|
46
|
+
DataFrame.
|
47
|
+
- Only generates plots for datasets with a datetime index, resulting in errors for other types of indices.
|
40
48
|
"""
|
41
49
|
|
50
|
+
df = dataset.df.copy()
|
51
|
+
|
52
|
+
# Filter DataFrame by date range if specified
|
53
|
+
if start_date:
|
54
|
+
df = df[df.index >= pd.to_datetime(start_date)]
|
55
|
+
if end_date:
|
56
|
+
df = df[df.index <= pd.to_datetime(end_date)]
|
57
|
+
|
58
|
+
y_true = dataset.y
|
59
|
+
y_pred = dataset.y_pred(model)
|
60
|
+
residuals = y_true - y_pred
|
61
|
+
|
42
62
|
figures = []
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
)
|
82
|
-
figures.append(hist_fig)
|
83
|
-
|
84
|
-
# Perform KS normality test
|
85
|
-
ks_stat, p_value = kstest(
|
86
|
-
residuals, "norm", args=(residuals.mean(), residuals.std())
|
87
|
-
)
|
88
|
-
ks_normality = "Normal" if p_value > p_value_threshold else "Not Normal"
|
89
|
-
|
90
|
-
summary.append(
|
91
|
-
{
|
92
|
-
"Model": model.input_id,
|
93
|
-
"KS Statistic": ks_stat,
|
94
|
-
"p-value": p_value,
|
95
|
-
"KS Normality": ks_normality,
|
96
|
-
"p-value Threshold": p_value_threshold,
|
97
|
-
}
|
98
|
-
)
|
63
|
+
|
64
|
+
# Plot residuals
|
65
|
+
residuals_fig = go.Figure()
|
66
|
+
residuals_fig.add_trace(
|
67
|
+
go.Scatter(x=df.index, y=residuals, mode="markers", name="Residuals")
|
68
|
+
)
|
69
|
+
residuals_fig.update_layout(
|
70
|
+
title="Residuals",
|
71
|
+
yaxis_title="Residuals",
|
72
|
+
font=dict(size=16),
|
73
|
+
showlegend=False,
|
74
|
+
)
|
75
|
+
figures.append(residuals_fig)
|
76
|
+
|
77
|
+
# Plot histogram of residuals
|
78
|
+
hist_fig = go.Figure()
|
79
|
+
hist_fig.add_trace(go.Histogram(x=residuals, nbinsx=nbins, name="Residuals"))
|
80
|
+
hist_fig.update_layout(
|
81
|
+
title="Histogram of Residuals",
|
82
|
+
xaxis_title="Residuals",
|
83
|
+
yaxis_title="Frequency",
|
84
|
+
font=dict(size=16),
|
85
|
+
showlegend=False,
|
86
|
+
)
|
87
|
+
figures.append(hist_fig)
|
88
|
+
|
89
|
+
# Perform KS normality test
|
90
|
+
ks_stat, p_value = kstest(
|
91
|
+
residuals, "norm", args=(residuals.mean(), residuals.std())
|
92
|
+
)
|
93
|
+
ks_normality = "Normal" if p_value > p_value_threshold else "Not Normal"
|
94
|
+
|
95
|
+
summary = {
|
96
|
+
"KS Statistic": ks_stat,
|
97
|
+
"p-value": p_value,
|
98
|
+
"KS Normality": ks_normality,
|
99
|
+
"p-value Threshold": p_value_threshold,
|
100
|
+
}
|
99
101
|
|
100
102
|
# Create a summary DataFrame for the KS normality test results
|
101
|
-
summary_df = pd.DataFrame(summary)
|
103
|
+
summary_df = pd.DataFrame([summary])
|
102
104
|
|
103
105
|
return (summary_df, *figures)
|
@@ -13,26 +13,42 @@ from validmind import tags, tasks
|
|
13
13
|
@tasks("text_classification", "text_summarization")
|
14
14
|
def RegardScore(dataset, model):
|
15
15
|
"""
|
16
|
-
|
16
|
+
Assesses the sentiment and potential biases in text generated by NLP models by computing and visualizing regard
|
17
|
+
scores.
|
17
18
|
|
18
|
-
|
19
|
-
The `RegardScore` metric is designed to evaluate the regard levels (positive, negative, neutral, or other) of texts generated by models. This helps in understanding the sentiment and biases in the generated content.
|
19
|
+
### Purpose
|
20
20
|
|
21
|
-
|
22
|
-
|
21
|
+
The `RegardScore` test aims to evaluate the levels of regard (positive, negative, neutral, or other) in texts
|
22
|
+
generated by NLP models. It helps in understanding the sentiment and bias present in the generated content.
|
23
23
|
|
24
|
-
|
25
|
-
- Noticeable skewness in the histogram, especially when comparing the predicted regard scores with the target regard scores, could indicate biases or inconsistencies in the model.
|
26
|
-
- Lack of neutral scores in the model's predictions, despite a balanced distribution in the target data, might signal an issue.
|
24
|
+
### Test Mechanism
|
27
25
|
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
26
|
+
This test extracts the true and predicted values from the provided dataset and model. It then computes the regard
|
27
|
+
scores for each text instance using a preloaded `regard` evaluation tool. The scores are compiled into dataframes,
|
28
|
+
and visualizations such as histograms and bar charts are generated to display the distribution of regard scores.
|
29
|
+
Additionally, descriptive statistics (mean, median, standard deviation, minimum, and maximum) are calculated for
|
30
|
+
the regard scores, providing a comprehensive overview of the model's performance.
|
31
|
+
|
32
|
+
### Signs of High Risk
|
33
|
+
|
34
|
+
- Noticeable skewness in the histogram, especially when comparing the predicted regard scores with the target
|
35
|
+
regard scores, can indicate biases or inconsistencies in the model.
|
36
|
+
- Lack of neutral scores in the model's predictions, despite a balanced distribution in the target data, might
|
37
|
+
signal an issue.
|
38
|
+
|
39
|
+
### Strengths
|
40
|
+
|
41
|
+
- Provides a clear evaluation of regard levels in generated texts, aiding in ensuring content appropriateness.
|
42
|
+
- Visual representations (histograms and bar charts) make it easier to interpret the distribution and trends of
|
43
|
+
regard scores.
|
44
|
+
- Descriptive statistics offer a concise summary of the model's performance in generating texts with balanced
|
45
|
+
sentiments.
|
46
|
+
|
47
|
+
### Limitations
|
32
48
|
|
33
|
-
**Limitations:**
|
34
49
|
- The accuracy of the regard scores is contingent upon the underlying `regard` tool.
|
35
|
-
- The scores provide a broad overview but do not specify which portions or tokens of the text are responsible for
|
50
|
+
- The scores provide a broad overview but do not specify which portions or tokens of the text are responsible for
|
51
|
+
high regard.
|
36
52
|
- Supplementary, in-depth analysis might be needed for granular insights.
|
37
53
|
"""
|
38
54
|
|
@@ -16,19 +16,22 @@ class RegressionResidualsPlot(Metric):
|
|
16
16
|
"""
|
17
17
|
Evaluates regression model performance using residual distribution and actual vs. predicted plots.
|
18
18
|
|
19
|
-
|
19
|
+
### Purpose
|
20
|
+
|
20
21
|
The `RegressionResidualsPlot` metric aims to evaluate the performance of regression models. By generating and
|
21
22
|
analyzing two plots – a distribution of residuals and a scatter plot of actual versus predicted values – this tool
|
22
23
|
helps to visually appraise how well the model predicts and the nature of errors it makes.
|
23
24
|
|
24
|
-
|
25
|
+
### Test Mechanism
|
26
|
+
|
25
27
|
The process begins by extracting the true output values (`y_true`) and the model's predicted values (`y_pred`).
|
26
28
|
Residuals are computed by subtracting predicted from true values. These residuals are then visualized using a
|
27
29
|
histogram to display their distribution. Additionally, a scatter plot is derived to compare true values against
|
28
30
|
predicted values, together with a "Perfect Fit" line, which represents an ideal match (predicted values equal
|
29
31
|
actual values), facilitating the assessment of the model's predictive accuracy.
|
30
32
|
|
31
|
-
|
33
|
+
### Signs of High Risk
|
34
|
+
|
32
35
|
- Residuals showing a non-normal distribution, especially those with frequent extreme values.
|
33
36
|
- Significant deviations of predicted values from actual values in the scatter plot.
|
34
37
|
- Sparse density of data points near the "Perfect Fit" line in the scatter plot, indicating poor prediction
|
@@ -36,13 +39,15 @@ class RegressionResidualsPlot(Metric):
|
|
36
39
|
- Visible patterns or trends in the residuals plot, suggesting the model's failure to capture the underlying data
|
37
40
|
structure adequately.
|
38
41
|
|
39
|
-
|
42
|
+
### Strengths
|
43
|
+
|
40
44
|
- Provides a direct, visually intuitive assessment of a regression model’s accuracy and handling of data.
|
41
45
|
- Visual plots can highlight issues of underfitting or overfitting.
|
42
46
|
- Can reveal systematic deviations or trends that purely numerical metrics might miss.
|
43
47
|
- Applicable across various regression model types.
|
44
48
|
|
45
|
-
|
49
|
+
### Limitations
|
50
|
+
|
46
51
|
- Relies on visual interpretation, which can be subjective and less precise than numerical evaluations.
|
47
52
|
- May be difficult to interpret in cases with multi-dimensional outputs due to the plots’ two-dimensional nature.
|
48
53
|
- Overlapping data points in the residuals plot can complicate interpretation efforts.
|
@@ -13,44 +13,50 @@ from validmind import tags, tasks
|
|
13
13
|
@tasks("text_classification", "text_summarization")
|
14
14
|
def RougeScore(dataset, model, metric="rouge-1"):
|
15
15
|
"""
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
ROUGE
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
16
|
+
Assesses the quality of machine-generated text using ROUGE metrics and visualizes the results to provide
|
17
|
+
comprehensive performance insights.
|
18
|
+
|
19
|
+
### Purpose
|
20
|
+
|
21
|
+
The ROUGE Score test is designed to evaluate the quality of text generated by machine learning models using various
|
22
|
+
ROUGE metrics. ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation, measures the overlap of
|
23
|
+
n-grams, word sequences, and word pairs between machine-generated text and reference texts. This evaluation is
|
24
|
+
crucial for tasks like text summarization, machine translation, and text generation, where the goal is to produce
|
25
|
+
text that accurately reflects the content and meaning of human-crafted references.
|
26
|
+
|
27
|
+
### Test Mechanism
|
28
|
+
|
29
|
+
The test extracts the true and predicted values from the provided dataset and model. It initializes the ROUGE
|
30
|
+
evaluator with the specified metric (e.g., ROUGE-1). For each pair of true and predicted texts, it calculates the
|
31
|
+
ROUGE scores and compiles them into a dataframe. Histograms and bar charts are generated for each ROUGE metric
|
32
|
+
(Precision, Recall, and F1 Score) to visualize their distribution. Additionally, a table of descriptive statistics
|
33
|
+
(mean, median, standard deviation, minimum, and maximum) is compiled for each metric, providing a comprehensive
|
34
|
+
summary of the model's performance.
|
35
|
+
|
36
|
+
### Signs of High Risk
|
37
|
+
|
38
|
+
- Consistently low scores across ROUGE metrics could indicate poor quality in the generated text, suggesting that
|
39
|
+
the model fails to capture the essential content of the reference texts.
|
37
40
|
- Low precision scores might suggest that the generated text contains a lot of redundant or irrelevant information.
|
38
41
|
- Low recall scores may indicate that important information from the reference text is being omitted.
|
39
|
-
- An imbalanced performance between precision and recall, reflected by a low F1 Score, could signal issues in the
|
40
|
-
|
42
|
+
- An imbalanced performance between precision and recall, reflected by a low F1 Score, could signal issues in the
|
43
|
+
model's ability to balance informativeness and conciseness.
|
41
44
|
|
42
|
-
|
45
|
+
### Strengths
|
43
46
|
|
44
|
-
- Provides a multifaceted evaluation of text quality through different ROUGE metrics, offering a detailed view of
|
45
|
-
|
47
|
+
- Provides a multifaceted evaluation of text quality through different ROUGE metrics, offering a detailed view of
|
48
|
+
model performance.
|
49
|
+
- Visual representations (histograms and bar charts) make it easier to interpret the distribution and trends of the
|
50
|
+
scores.
|
46
51
|
- Descriptive statistics offer a concise summary of the model's strengths and weaknesses in generating text.
|
47
52
|
|
48
|
-
|
53
|
+
### Limitations
|
49
54
|
|
50
|
-
- ROUGE metrics primarily focus on n-gram overlap and may not fully capture semantic coherence, fluency, or
|
55
|
+
- ROUGE metrics primarily focus on n-gram overlap and may not fully capture semantic coherence, fluency, or
|
56
|
+
grammatical quality of the text.
|
51
57
|
- The evaluation relies on the availability of high-quality reference texts, which may not always be obtainable.
|
52
|
-
- While useful for comparison, ROUGE scores alone do not provide a complete assessment of a model's performance and
|
53
|
-
|
58
|
+
- While useful for comparison, ROUGE scores alone do not provide a complete assessment of a model's performance and
|
59
|
+
should be supplemented with other metrics and qualitative analysis.
|
54
60
|
"""
|
55
61
|
|
56
62
|
# Extract true and predicted values
|