validmind 2.5.8__py3-none-any.whl → 2.5.15__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- validmind/__version__.py +1 -1
- validmind/ai/test_descriptions.py +26 -7
- validmind/api_client.py +89 -43
- validmind/client.py +2 -2
- validmind/client_config.py +11 -14
- validmind/datasets/regression/fred_timeseries.py +67 -138
- validmind/template.py +1 -0
- validmind/test_suites/__init__.py +0 -2
- validmind/test_suites/statsmodels_timeseries.py +1 -1
- validmind/test_suites/summarization.py +0 -1
- validmind/test_suites/time_series.py +0 -43
- validmind/tests/__types__.py +3 -13
- validmind/tests/data_validation/ACFandPACFPlot.py +15 -13
- validmind/tests/data_validation/ADF.py +31 -24
- validmind/tests/data_validation/AutoAR.py +9 -9
- validmind/tests/data_validation/AutoMA.py +23 -16
- validmind/tests/data_validation/AutoSeasonality.py +18 -16
- validmind/tests/data_validation/AutoStationarity.py +21 -16
- validmind/tests/data_validation/BivariateScatterPlots.py +67 -96
- validmind/tests/data_validation/ChiSquaredFeaturesTable.py +82 -124
- validmind/tests/data_validation/ClassImbalance.py +15 -12
- validmind/tests/data_validation/DFGLSArch.py +19 -13
- validmind/tests/data_validation/DatasetDescription.py +17 -11
- validmind/tests/data_validation/DatasetSplit.py +7 -5
- validmind/tests/data_validation/DescriptiveStatistics.py +28 -21
- validmind/tests/data_validation/Duplicates.py +33 -25
- validmind/tests/data_validation/EngleGrangerCoint.py +35 -33
- validmind/tests/data_validation/FeatureTargetCorrelationPlot.py +59 -71
- validmind/tests/data_validation/HighCardinality.py +19 -12
- validmind/tests/data_validation/HighPearsonCorrelation.py +27 -22
- validmind/tests/data_validation/IQROutliersBarPlot.py +13 -10
- validmind/tests/data_validation/IQROutliersTable.py +40 -36
- validmind/tests/data_validation/IsolationForestOutliers.py +21 -14
- validmind/tests/data_validation/KPSS.py +34 -29
- validmind/tests/data_validation/LaggedCorrelationHeatmap.py +22 -15
- validmind/tests/data_validation/MissingValues.py +32 -27
- validmind/tests/data_validation/MissingValuesBarPlot.py +25 -21
- validmind/tests/data_validation/PearsonCorrelationMatrix.py +71 -84
- validmind/tests/data_validation/PhillipsPerronArch.py +37 -30
- validmind/tests/data_validation/RollingStatsPlot.py +31 -23
- validmind/tests/data_validation/ScatterPlot.py +63 -78
- validmind/tests/data_validation/SeasonalDecompose.py +38 -34
- validmind/tests/data_validation/Skewness.py +35 -37
- validmind/tests/data_validation/SpreadPlot.py +35 -35
- validmind/tests/data_validation/TabularCategoricalBarPlots.py +23 -17
- validmind/tests/data_validation/TabularDateTimeHistograms.py +21 -13
- validmind/tests/data_validation/TabularDescriptionTables.py +51 -16
- validmind/tests/data_validation/TabularNumericalHistograms.py +25 -22
- validmind/tests/data_validation/TargetRateBarPlots.py +21 -14
- validmind/tests/data_validation/TimeSeriesDescription.py +25 -18
- validmind/tests/data_validation/TimeSeriesDescriptiveStatistics.py +23 -17
- validmind/tests/data_validation/TimeSeriesFrequency.py +24 -17
- validmind/tests/data_validation/TimeSeriesHistogram.py +33 -32
- validmind/tests/data_validation/TimeSeriesLinePlot.py +17 -10
- validmind/tests/data_validation/TimeSeriesMissingValues.py +15 -10
- validmind/tests/data_validation/TimeSeriesOutliers.py +37 -33
- validmind/tests/data_validation/TooManyZeroValues.py +16 -11
- validmind/tests/data_validation/UniqueRows.py +11 -6
- validmind/tests/data_validation/WOEBinPlots.py +23 -16
- validmind/tests/data_validation/WOEBinTable.py +35 -30
- validmind/tests/data_validation/ZivotAndrewsArch.py +34 -28
- validmind/tests/data_validation/nlp/CommonWords.py +21 -14
- validmind/tests/data_validation/nlp/Hashtags.py +27 -20
- validmind/tests/data_validation/nlp/LanguageDetection.py +33 -14
- validmind/tests/data_validation/nlp/Mentions.py +21 -15
- validmind/tests/data_validation/nlp/PolarityAndSubjectivity.py +32 -9
- validmind/tests/data_validation/nlp/Punctuations.py +24 -20
- validmind/tests/data_validation/nlp/Sentiment.py +27 -8
- validmind/tests/data_validation/nlp/StopWords.py +26 -19
- validmind/tests/data_validation/nlp/TextDescription.py +36 -35
- validmind/tests/data_validation/nlp/Toxicity.py +32 -9
- validmind/tests/decorator.py +81 -42
- validmind/tests/model_validation/BertScore.py +36 -27
- validmind/tests/model_validation/BleuScore.py +25 -19
- validmind/tests/model_validation/ClusterSizeDistribution.py +38 -34
- validmind/tests/model_validation/ContextualRecall.py +35 -13
- validmind/tests/model_validation/FeaturesAUC.py +32 -13
- validmind/tests/model_validation/MeteorScore.py +46 -33
- validmind/tests/model_validation/ModelMetadata.py +32 -64
- validmind/tests/model_validation/ModelPredictionResiduals.py +75 -73
- validmind/tests/model_validation/RegardScore.py +30 -14
- validmind/tests/model_validation/RegressionResidualsPlot.py +10 -5
- validmind/tests/model_validation/RougeScore.py +36 -30
- validmind/tests/model_validation/TimeSeriesPredictionWithCI.py +30 -14
- validmind/tests/model_validation/TimeSeriesPredictionsPlot.py +27 -30
- validmind/tests/model_validation/TimeSeriesR2SquareBySegments.py +68 -63
- validmind/tests/model_validation/TokenDisparity.py +31 -23
- validmind/tests/model_validation/ToxicityScore.py +26 -17
- validmind/tests/model_validation/embeddings/ClusterDistribution.py +24 -20
- validmind/tests/model_validation/embeddings/CosineSimilarityComparison.py +30 -27
- validmind/tests/model_validation/embeddings/CosineSimilarityDistribution.py +7 -5
- validmind/tests/model_validation/embeddings/CosineSimilarityHeatmap.py +32 -23
- validmind/tests/model_validation/embeddings/DescriptiveAnalytics.py +7 -5
- validmind/tests/model_validation/embeddings/EmbeddingsVisualization2D.py +15 -11
- validmind/tests/model_validation/embeddings/EuclideanDistanceComparison.py +29 -29
- validmind/tests/model_validation/embeddings/EuclideanDistanceHeatmap.py +34 -25
- validmind/tests/model_validation/embeddings/PCAComponentsPairwisePlots.py +38 -26
- validmind/tests/model_validation/embeddings/StabilityAnalysis.py +40 -1
- validmind/tests/model_validation/embeddings/StabilityAnalysisKeyword.py +18 -17
- validmind/tests/model_validation/embeddings/StabilityAnalysisRandomNoise.py +40 -45
- validmind/tests/model_validation/embeddings/StabilityAnalysisSynonyms.py +17 -19
- validmind/tests/model_validation/embeddings/StabilityAnalysisTranslation.py +29 -25
- validmind/tests/model_validation/embeddings/TSNEComponentsPairwisePlots.py +38 -28
- validmind/tests/model_validation/ragas/AnswerCorrectness.py +5 -4
- validmind/tests/model_validation/ragas/AnswerRelevance.py +5 -4
- validmind/tests/model_validation/ragas/AnswerSimilarity.py +5 -4
- validmind/tests/model_validation/ragas/AspectCritique.py +7 -0
- validmind/tests/model_validation/ragas/ContextEntityRecall.py +9 -8
- validmind/tests/model_validation/ragas/ContextPrecision.py +5 -4
- validmind/tests/model_validation/ragas/ContextRecall.py +5 -4
- validmind/tests/model_validation/ragas/Faithfulness.py +5 -4
- validmind/tests/model_validation/ragas/utils.py +6 -0
- validmind/tests/model_validation/sklearn/AdjustedMutualInformation.py +19 -12
- validmind/tests/model_validation/sklearn/AdjustedRandIndex.py +22 -17
- validmind/tests/model_validation/sklearn/ClassifierPerformance.py +27 -25
- validmind/tests/model_validation/sklearn/ClusterCosineSimilarity.py +7 -5
- validmind/tests/model_validation/sklearn/ClusterPerformance.py +40 -78
- validmind/tests/model_validation/sklearn/ClusterPerformanceMetrics.py +15 -17
- validmind/tests/model_validation/sklearn/CompletenessScore.py +17 -11
- validmind/tests/model_validation/sklearn/ConfusionMatrix.py +22 -15
- validmind/tests/model_validation/sklearn/FeatureImportance.py +95 -0
- validmind/tests/model_validation/sklearn/FowlkesMallowsScore.py +7 -7
- validmind/tests/model_validation/sklearn/HomogeneityScore.py +19 -12
- validmind/tests/model_validation/sklearn/HyperParametersTuning.py +35 -30
- validmind/tests/model_validation/sklearn/KMeansClustersOptimization.py +10 -5
- validmind/tests/model_validation/sklearn/MinimumAccuracy.py +32 -32
- validmind/tests/model_validation/sklearn/MinimumF1Score.py +23 -23
- validmind/tests/model_validation/sklearn/MinimumROCAUCScore.py +15 -10
- validmind/tests/model_validation/sklearn/ModelsPerformanceComparison.py +26 -19
- validmind/tests/model_validation/sklearn/OverfitDiagnosis.py +38 -18
- validmind/tests/model_validation/sklearn/PermutationFeatureImportance.py +31 -25
- validmind/tests/model_validation/sklearn/PopulationStabilityIndex.py +8 -6
- validmind/tests/model_validation/sklearn/PrecisionRecallCurve.py +24 -17
- validmind/tests/model_validation/sklearn/ROCCurve.py +12 -7
- validmind/tests/model_validation/sklearn/RegressionErrors.py +74 -130
- validmind/tests/model_validation/sklearn/RegressionErrorsComparison.py +27 -12
- validmind/tests/model_validation/sklearn/{RegressionModelsPerformanceComparison.py → RegressionPerformance.py} +18 -20
- validmind/tests/model_validation/sklearn/RegressionR2Square.py +55 -93
- validmind/tests/model_validation/sklearn/RegressionR2SquareComparison.py +32 -13
- validmind/tests/model_validation/sklearn/RobustnessDiagnosis.py +36 -32
- validmind/tests/model_validation/sklearn/SHAPGlobalImportance.py +7 -5
- validmind/tests/model_validation/sklearn/SilhouettePlot.py +27 -19
- validmind/tests/model_validation/sklearn/TrainingTestDegradation.py +25 -18
- validmind/tests/model_validation/sklearn/VMeasure.py +14 -13
- validmind/tests/model_validation/sklearn/WeakspotsDiagnosis.py +7 -5
- validmind/tests/model_validation/statsmodels/AutoARIMA.py +24 -18
- validmind/tests/model_validation/statsmodels/BoxPierce.py +14 -10
- validmind/tests/model_validation/statsmodels/CumulativePredictionProbabilities.py +73 -104
- validmind/tests/model_validation/statsmodels/DurbinWatsonTest.py +19 -12
- validmind/tests/model_validation/statsmodels/GINITable.py +44 -77
- validmind/tests/model_validation/statsmodels/JarqueBera.py +27 -22
- validmind/tests/model_validation/statsmodels/KolmogorovSmirnov.py +33 -34
- validmind/tests/model_validation/statsmodels/LJungBox.py +32 -28
- validmind/tests/model_validation/statsmodels/Lilliefors.py +27 -24
- validmind/tests/model_validation/statsmodels/PredictionProbabilitiesHistogram.py +87 -119
- validmind/tests/model_validation/statsmodels/RegressionCoeffs.py +100 -0
- validmind/tests/model_validation/statsmodels/RegressionFeatureSignificance.py +14 -9
- validmind/tests/model_validation/statsmodels/RegressionModelForecastPlot.py +17 -13
- validmind/tests/model_validation/statsmodels/RegressionModelForecastPlotLevels.py +46 -43
- validmind/tests/model_validation/statsmodels/RegressionModelSensitivityPlot.py +38 -36
- validmind/tests/model_validation/statsmodels/RegressionModelSummary.py +30 -28
- validmind/tests/model_validation/statsmodels/RegressionPermutationFeatureImportance.py +18 -11
- validmind/tests/model_validation/statsmodels/RunsTest.py +32 -28
- validmind/tests/model_validation/statsmodels/ScorecardHistogram.py +75 -107
- validmind/tests/model_validation/statsmodels/ShapiroWilk.py +15 -8
- validmind/tests/ongoing_monitoring/FeatureDrift.py +10 -6
- validmind/tests/ongoing_monitoring/PredictionAcrossEachFeature.py +31 -25
- validmind/tests/ongoing_monitoring/PredictionCorrelation.py +29 -21
- validmind/tests/ongoing_monitoring/TargetPredictionDistributionPlot.py +31 -23
- validmind/tests/prompt_validation/Bias.py +14 -11
- validmind/tests/prompt_validation/Clarity.py +16 -14
- validmind/tests/prompt_validation/Conciseness.py +7 -5
- validmind/tests/prompt_validation/Delimitation.py +23 -22
- validmind/tests/prompt_validation/NegativeInstruction.py +7 -5
- validmind/tests/prompt_validation/Robustness.py +12 -10
- validmind/tests/prompt_validation/Specificity.py +13 -11
- validmind/tests/prompt_validation/ai_powered_test.py +6 -0
- validmind/tests/run.py +68 -23
- validmind/unit_metrics/__init__.py +81 -144
- validmind/unit_metrics/classification/{sklearn/Accuracy.py → Accuracy.py} +1 -1
- validmind/unit_metrics/classification/{sklearn/F1.py → F1.py} +1 -1
- validmind/unit_metrics/classification/{sklearn/Precision.py → Precision.py} +1 -1
- validmind/unit_metrics/classification/{sklearn/ROC_AUC.py → ROC_AUC.py} +1 -2
- validmind/unit_metrics/classification/{sklearn/Recall.py → Recall.py} +1 -1
- validmind/unit_metrics/regression/{sklearn/AdjustedRSquaredScore.py → AdjustedRSquaredScore.py} +1 -1
- validmind/unit_metrics/regression/GiniCoefficient.py +1 -1
- validmind/unit_metrics/regression/HuberLoss.py +1 -1
- validmind/unit_metrics/regression/KolmogorovSmirnovStatistic.py +1 -1
- validmind/unit_metrics/regression/{sklearn/MeanAbsoluteError.py → MeanAbsoluteError.py} +1 -1
- validmind/unit_metrics/regression/MeanAbsolutePercentageError.py +1 -1
- validmind/unit_metrics/regression/MeanBiasDeviation.py +1 -1
- validmind/unit_metrics/regression/{sklearn/MeanSquaredError.py → MeanSquaredError.py} +1 -1
- validmind/unit_metrics/regression/QuantileLoss.py +1 -1
- validmind/unit_metrics/regression/{sklearn/RSquaredScore.py → RSquaredScore.py} +1 -1
- validmind/unit_metrics/regression/{sklearn/RootMeanSquaredError.py → RootMeanSquaredError.py} +1 -1
- validmind/vm_models/dataset/dataset.py +2 -0
- validmind/vm_models/figure.py +5 -0
- validmind/vm_models/test/result_wrapper.py +93 -132
- {validmind-2.5.8.dist-info → validmind-2.5.15.dist-info}/METADATA +1 -1
- {validmind-2.5.8.dist-info → validmind-2.5.15.dist-info}/RECORD +203 -210
- validmind/tests/data_validation/ANOVAOneWayTable.py +0 -138
- validmind/tests/data_validation/BivariateFeaturesBarPlots.py +0 -142
- validmind/tests/data_validation/BivariateHistograms.py +0 -117
- validmind/tests/data_validation/HeatmapFeatureCorrelations.py +0 -124
- validmind/tests/data_validation/MissingValuesRisk.py +0 -88
- validmind/tests/model_validation/ModelMetadataComparison.py +0 -59
- validmind/tests/model_validation/sklearn/FeatureImportanceComparison.py +0 -83
- validmind/tests/model_validation/statsmodels/RegressionCoeffsPlot.py +0 -135
- validmind/tests/model_validation/statsmodels/RegressionModelsCoeffs.py +0 -103
- {validmind-2.5.8.dist-info → validmind-2.5.15.dist-info}/LICENSE +0 -0
- {validmind-2.5.8.dist-info → validmind-2.5.15.dist-info}/WHEEL +0 -0
- {validmind-2.5.8.dist-info → validmind-2.5.15.dist-info}/entry_points.txt +0 -0
@@ -0,0 +1,95 @@
|
|
1
|
+
# Copyright © 2023-2024 ValidMind Inc. All rights reserved.
|
2
|
+
# See the LICENSE file in the root of this repository for details.
|
3
|
+
# SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
|
4
|
+
|
5
|
+
import pandas as pd
|
6
|
+
from sklearn.inspection import permutation_importance
|
7
|
+
|
8
|
+
from validmind import tags, tasks
|
9
|
+
|
10
|
+
|
11
|
+
@tags("model_explainability", "sklearn")
|
12
|
+
@tasks("regression", "time_series_forecasting")
|
13
|
+
def FeatureImportance(dataset, model, num_features=3):
|
14
|
+
"""
|
15
|
+
Compute feature importance scores for a given model and generate a summary table
|
16
|
+
with the top important features.
|
17
|
+
|
18
|
+
### Purpose
|
19
|
+
|
20
|
+
The Feature Importance Comparison test is designed to compare the feature importance scores for different models
|
21
|
+
when applied to various datasets. By doing so, it aims to identify the most impactful features and assess the
|
22
|
+
consistency of feature importance across models.
|
23
|
+
|
24
|
+
### Test Mechanism
|
25
|
+
|
26
|
+
This test works by iterating through each dataset-model pair and calculating permutation feature importance (PFI)
|
27
|
+
scores. It then generates a summary table containing the top `num_features` important features for each model. The
|
28
|
+
process involves:
|
29
|
+
|
30
|
+
- Extracting features and target data from each dataset.
|
31
|
+
- Computing PFI scores using `sklearn.inspection.permutation_importance`.
|
32
|
+
- Sorting and selecting the top features based on their importance scores.
|
33
|
+
- Compiling these features into a summary table for comparison.
|
34
|
+
|
35
|
+
### Signs of High Risk
|
36
|
+
|
37
|
+
- Key features expected to be important are ranked low, indicating potential issues with model training or data
|
38
|
+
quality.
|
39
|
+
- High variance in feature importance scores across different models, suggesting instability in feature selection.
|
40
|
+
|
41
|
+
### Strengths
|
42
|
+
|
43
|
+
- Provides a clear comparison of the most important features for each model.
|
44
|
+
- Uses permutation importance, which is a model-agnostic method and can be applied to any estimator.
|
45
|
+
|
46
|
+
### Limitations
|
47
|
+
|
48
|
+
- Assumes that the dataset is provided as a DataFrameDataset object with `x_df` and `y_df` methods to access
|
49
|
+
feature and target data.
|
50
|
+
- Requires that `model.model` is compatible with `sklearn.inspection.permutation_importance`.
|
51
|
+
- The function's output is dependent on the number of features specified by `num_features`, which defaults to 3 but
|
52
|
+
can be adjusted.
|
53
|
+
"""
|
54
|
+
results_list = []
|
55
|
+
|
56
|
+
x = dataset.x_df()
|
57
|
+
y = dataset.y_df()
|
58
|
+
|
59
|
+
pfi_values = permutation_importance(
|
60
|
+
model.model,
|
61
|
+
x,
|
62
|
+
y,
|
63
|
+
random_state=0,
|
64
|
+
n_jobs=-2,
|
65
|
+
)
|
66
|
+
|
67
|
+
# Create a dictionary to store PFI scores
|
68
|
+
pfi = {
|
69
|
+
column: pfi_values["importances_mean"][i] for i, column in enumerate(x.columns)
|
70
|
+
}
|
71
|
+
|
72
|
+
# Sort features by their importance
|
73
|
+
sorted_features = sorted(pfi.items(), key=lambda item: item[1], reverse=True)
|
74
|
+
|
75
|
+
# Extract the top `num_features` features
|
76
|
+
top_features = sorted_features[:num_features]
|
77
|
+
|
78
|
+
# Prepare the result for the current model and dataset
|
79
|
+
result = {}
|
80
|
+
|
81
|
+
# Dynamically add feature columns to the result
|
82
|
+
for i in range(num_features):
|
83
|
+
if i < len(top_features):
|
84
|
+
result[f"Feature {i + 1}"] = (
|
85
|
+
f"[{top_features[i][0]}; {top_features[i][1]:.4f}]"
|
86
|
+
)
|
87
|
+
else:
|
88
|
+
result[f"Feature {i + 1}"] = None
|
89
|
+
|
90
|
+
# Append the result to the list
|
91
|
+
results_list.append(result)
|
92
|
+
|
93
|
+
# Convert the results list to a DataFrame
|
94
|
+
results_df = pd.DataFrame(results_list)
|
95
|
+
return results_df
|
@@ -15,27 +15,27 @@ class FowlkesMallowsScore(ClusterPerformance):
|
|
15
15
|
Evaluates the similarity between predicted and actual cluster assignments in a model using the Fowlkes-Mallows
|
16
16
|
score.
|
17
17
|
|
18
|
-
|
18
|
+
### Purpose
|
19
19
|
|
20
20
|
The FowlkesMallowsScore is a performance metric used to validate clustering algorithms within machine learning
|
21
21
|
models. The score intends to evaluate the matching grade between two clusters. It measures the similarity between
|
22
22
|
the predicted and actual cluster assignments, thus gauging the accuracy of the model's clustering capability.
|
23
23
|
|
24
|
-
|
24
|
+
### Test Mechanism
|
25
25
|
|
26
26
|
The FowlkesMallowsScore method applies the `fowlkes_mallows_score` function from the `sklearn` library to evaluate
|
27
27
|
the model's accuracy in clustering different types of data. The test fetches the datasets from the model's training
|
28
28
|
and testing datasets as inputs then compares the resulting clusters against the previously known clusters to obtain
|
29
29
|
a score. A high score indicates a better clustering performance by the model.
|
30
30
|
|
31
|
-
|
31
|
+
### Signs of High Risk
|
32
32
|
|
33
33
|
- A low Fowlkes-Mallows score (near zero): This indicates that the model's clustering capability is poor and the
|
34
34
|
algorithm isn't properly grouping data.
|
35
|
-
- Inconsistently low scores across different datasets:
|
35
|
+
- Inconsistently low scores across different datasets: This may indicate that the model's clustering performance is
|
36
36
|
not robust and the model may fail when applied to unseen data.
|
37
37
|
|
38
|
-
|
38
|
+
### Strengths
|
39
39
|
|
40
40
|
- The Fowlkes-Mallows score is a simple and effective method for evaluating the performance of clustering
|
41
41
|
algorithms.
|
@@ -43,7 +43,7 @@ class FowlkesMallowsScore(ClusterPerformance):
|
|
43
43
|
comprehensive measure of model performance.
|
44
44
|
- The Fowlkes-Mallows score is non-biased meaning it treats False Positives and False Negatives equally.
|
45
45
|
|
46
|
-
|
46
|
+
### Limitations
|
47
47
|
|
48
48
|
- As a pairwise-based method, this score can be computationally intensive for large datasets and can become
|
49
49
|
unfeasible as the size of the dataset increases.
|
@@ -54,7 +54,7 @@ class FowlkesMallowsScore(ClusterPerformance):
|
|
54
54
|
"""
|
55
55
|
|
56
56
|
name = "fowlkes_mallows_score"
|
57
|
-
required_inputs = ["model", "
|
57
|
+
required_inputs = ["model", "dataset"]
|
58
58
|
tasks = ["clustering"]
|
59
59
|
tags = [
|
60
60
|
"sklearn",
|
@@ -15,29 +15,36 @@ class HomogeneityScore(ClusterPerformance):
|
|
15
15
|
Assesses clustering homogeneity by comparing true and predicted labels, scoring from 0 (heterogeneous) to 1
|
16
16
|
(homogeneous).
|
17
17
|
|
18
|
-
|
19
|
-
clusters formed by a machine learning model. In simple terms, a clustering result satisfies homogeneity if all of
|
20
|
-
its clusters contain only points which are members of a single class.
|
18
|
+
### Purpose
|
21
19
|
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
20
|
+
The Homogeneity Score encapsulated in this performance test is used to measure the homogeneity of the clusters
|
21
|
+
formed by a machine learning model. In simple terms, a clustering result satisfies homogeneity if all of its
|
22
|
+
clusters contain only points which are members of a single class.
|
23
|
+
|
24
|
+
### Test Mechanism
|
25
|
+
|
26
|
+
This test uses the `homogeneity_score` function from the `sklearn.metrics` library to compare the ground truth
|
27
|
+
class labels of the training and testing sets with the labels predicted by the given model. The returned score is a
|
28
|
+
metric of the clustering accuracy, and ranges from 0.0 to 1.0, with 1.0 denoting the highest possible degree of
|
29
|
+
homogeneity.
|
30
|
+
|
31
|
+
### Signs of High Risk
|
26
32
|
|
27
|
-
**Signs of High Risk**:
|
28
33
|
- A score close to 0: This denotes that clusters are highly heterogenous and points within the same cluster might
|
29
34
|
not belong to the same class.
|
30
35
|
- A significantly lower score for testing data compared to the score for training data: This can indicate
|
31
36
|
overfitting, where the model has learned to perfectly match the training data but fails to perform well on unseen
|
32
37
|
data.
|
33
38
|
|
34
|
-
|
39
|
+
### Strengths
|
40
|
+
|
35
41
|
- It provides a simple quantitative measure of the degree to which clusters contain points from only one class.
|
36
|
-
- Useful for validating clustering solutions where the ground truth
|
42
|
+
- Useful for validating clustering solutions where the ground truth — class membership of points — is known.
|
37
43
|
- It's agnostic to the absolute labels, and cares only that the points within the same cluster have the same class
|
38
44
|
label.
|
39
45
|
|
40
|
-
|
46
|
+
### Limitations
|
47
|
+
|
41
48
|
- The Homogeneity Score is not useful for clustering solutions where the ground truth labels are not known.
|
42
49
|
- It doesn’t work well with differently sized clusters since it gives predominance to larger clusters.
|
43
50
|
- The score does not address the actual number of clusters formed, or the evenness of cluster sizes. It only checks
|
@@ -45,7 +52,7 @@ class HomogeneityScore(ClusterPerformance):
|
|
45
52
|
"""
|
46
53
|
|
47
54
|
name = "homogeneity_score"
|
48
|
-
required_inputs = ["model", "
|
55
|
+
required_inputs = ["model", "dataset"]
|
49
56
|
tasks = ["clustering"]
|
50
57
|
tags = [
|
51
58
|
"sklearn",
|
@@ -16,37 +16,42 @@ class HyperParametersTuning(Metric):
|
|
16
16
|
"""
|
17
17
|
Exerts exhaustive grid search to identify optimal hyperparameters for the model, improving performance.
|
18
18
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
the
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
-
|
43
|
-
|
44
|
-
|
45
|
-
|
19
|
+
### Purpose:
|
20
|
+
|
21
|
+
The "HyperParametersTuning" metric aims to find the optimal set of hyperparameters for a given model. The test is
|
22
|
+
designed to enhance the performance of the model by determining the best configuration of hyperparameters. The
|
23
|
+
parameters that are being optimized are defined by the parameter grid provided to the metric.
|
24
|
+
|
25
|
+
### Test Mechanism:
|
26
|
+
|
27
|
+
The HyperParametersTuning test employs a grid search mechanism using the GridSearchCV function from the
|
28
|
+
scikit-learn library. The grid search algorithm systematically works through multiple combinations of parameter
|
29
|
+
values, cross-validating to determine which combination gives the best model performance. The chosen model and the
|
30
|
+
parameter grid passed for tuning are necessary inputs. Once the grid search is complete, the test caches and
|
31
|
+
returns details of the best model and its associated parameters.
|
32
|
+
|
33
|
+
### Signs of High Risk:
|
34
|
+
|
35
|
+
- The test raises a SkipTestError if the param_grid is not supplied, indicating a lack of specific parameters to
|
36
|
+
optimize, which can be risky for certain model types reliant on parameter tuning.
|
37
|
+
- Poorly chosen scoring metrics that do not align well with the specific model or problem at hand could reflect
|
38
|
+
potential risks or failures in achieving optimal performance.
|
39
|
+
|
40
|
+
### Strengths:
|
41
|
+
|
42
|
+
- Provides a comprehensive exploration mechanism to identify the best set of hyperparameters for the supplied
|
43
|
+
model, thereby enhancing its performance.
|
44
|
+
- Implements GridSearchCV, simplifying and automating the time-consuming task of hyperparameter tuning.
|
45
|
+
|
46
|
+
### Limitations:
|
47
|
+
|
48
|
+
- The grid search algorithm can be computationally expensive, especially with large datasets or complex models, and
|
49
|
+
can be time-consuming as it tests all possible combinations within the specified parameter grid.
|
50
|
+
- The effectiveness of the tuning is heavily dependent on the quality of data and only accepts datasets with
|
46
51
|
numerical or ordered categories.
|
47
|
-
-
|
48
|
-
|
49
|
-
- There
|
52
|
+
- Assumes that the same set of hyperparameters is optimal for all problem sets, which may not be true in every
|
53
|
+
scenario.
|
54
|
+
- There's a potential risk of overfitting the model if the training set is not representative of the data that the
|
50
55
|
model will be applied to.
|
51
56
|
"""
|
52
57
|
|
@@ -19,13 +19,15 @@ class KMeansClustersOptimization(Metric):
|
|
19
19
|
"""
|
20
20
|
Optimizes the number of clusters in K-means models using Elbow and Silhouette methods.
|
21
21
|
|
22
|
-
|
22
|
+
### Purpose
|
23
|
+
|
23
24
|
This metric is used to optimize the number of clusters used in K-means clustering models. It intends to measure and
|
24
25
|
evaluate the optimal number of clusters by leveraging two methodologies, namely the Elbow method and the Silhouette
|
25
26
|
method. This is crucial as an inappropriate number of clusters can either overly simplify or overcomplicate the
|
26
27
|
structure of the data, thereby undermining the effectiveness of the model.
|
27
28
|
|
28
|
-
|
29
|
+
### Test Mechanism
|
30
|
+
|
29
31
|
The test mechanism involves iterating over a predefined range of cluster numbers and applying both the Elbow method
|
30
32
|
and the Silhouette method. The Elbow method computes the sum of the minimum euclidean distances between data points
|
31
33
|
and their respective cluster centers (distortion). This value decreases as the number of clusters increases; the
|
@@ -35,20 +37,23 @@ class KMeansClustersOptimization(Metric):
|
|
35
37
|
of clusters under this method is the one that maximizes the average silhouette score. The results of both methods
|
36
38
|
are plotted for visual inspection.
|
37
39
|
|
38
|
-
|
40
|
+
### Signs of High Risk
|
41
|
+
|
39
42
|
- A high distortion value or a low silhouette average score for the optimal number of clusters.
|
40
43
|
- No clear 'elbow' point or plateau observed in the distortion plot, or a uniformly low silhouette average score
|
41
44
|
across different numbers of clusters, suggesting the data is not amenable to clustering.
|
42
45
|
- An optimal cluster number that is unreasonably high or low, suggestive of overfitting or underfitting,
|
43
46
|
respectively.
|
44
47
|
|
45
|
-
|
48
|
+
### Strengths
|
49
|
+
|
46
50
|
- Provides both a visual and quantitative method to determine the optimal number of clusters.
|
47
51
|
- Leverages two different methods (Elbow and Silhouette), thereby affording robustness and versatility in assessing
|
48
52
|
the data's clusterability.
|
49
53
|
- Facilitates improved model performance by allowing for an informed selection of the number of clusters.
|
50
54
|
|
51
|
-
|
55
|
+
### Limitations
|
56
|
+
|
52
57
|
- Assumes that a suitable number of clusters exists in the data, which may not always be true, especially for
|
53
58
|
complex or noisy data.
|
54
59
|
- Both methods may fail to provide definitive answers when the data lacks clear cluster structures.
|
@@ -22,38 +22,38 @@ class MinimumAccuracy(ThresholdTest):
|
|
22
22
|
"""
|
23
23
|
Checks if the model's prediction accuracy meets or surpasses a specified threshold.
|
24
24
|
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
threshold
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
-
|
55
|
-
positives or false negatives.
|
56
|
-
-
|
25
|
+
### Purpose
|
26
|
+
|
27
|
+
The Minimum Accuracy test’s objective is to verify whether the model's prediction accuracy on a specific dataset
|
28
|
+
meets or surpasses a predetermined minimum threshold. Accuracy, which is simply the ratio of correct predictions to
|
29
|
+
total predictions, is a key metric for evaluating the model's performance. Considering binary as well as multiclass
|
30
|
+
classifications, accurate labeling becomes indispensable.
|
31
|
+
|
32
|
+
### Test Mechanism
|
33
|
+
|
34
|
+
The test mechanism involves contrasting the model's accuracy score with a preset minimum threshold value, with the
|
35
|
+
default being 0.7. The accuracy score is computed utilizing sklearn’s `accuracy_score` method, where the true
|
36
|
+
labels `y_true` and predicted labels `class_pred` are compared. If the accuracy score is above the threshold, the
|
37
|
+
test receives a passing mark. The test returns the result along with the accuracy score and threshold used for the
|
38
|
+
test.
|
39
|
+
|
40
|
+
### Signs of High Risk
|
41
|
+
|
42
|
+
- Model fails to achieve or surpass the predefined score threshold.
|
43
|
+
- Persistent scores below the threshold, indicating a high risk of inaccurate predictions.
|
44
|
+
|
45
|
+
### Strengths
|
46
|
+
|
47
|
+
- Simplicity, presenting a straightforward measure of holistic model performance across all classes.
|
48
|
+
- Particularly advantageous when classes are balanced.
|
49
|
+
- Versatile, as it can be implemented on both binary and multiclass classification tasks.
|
50
|
+
|
51
|
+
### Limitations
|
52
|
+
|
53
|
+
- Misleading accuracy scores when classes in the dataset are highly imbalanced.
|
54
|
+
- Favoritism towards the majority class, giving an inaccurate perception of model performance.
|
55
|
+
- Inability to measure the model's precision, recall, or capacity to manage false positives or false negatives.
|
56
|
+
- Focused on overall correctness and may not be sufficient for all types of model analytics.
|
57
57
|
"""
|
58
58
|
|
59
59
|
name = "accuracy_score"
|
@@ -21,42 +21,42 @@ from validmind.vm_models import (
|
|
21
21
|
@dataclass
|
22
22
|
class MinimumF1Score(ThresholdTest):
|
23
23
|
"""
|
24
|
-
|
24
|
+
Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced
|
25
|
+
performance between precision and recall.
|
26
|
+
|
27
|
+
### Purpose
|
25
28
|
|
26
|
-
**Purpose:**
|
27
29
|
The main objective of this test is to ensure that the F1 score, a balanced measure of precision and recall, of the
|
28
30
|
model meets or surpasses a predefined threshold on the validation dataset. The F1 score is highly useful for
|
29
31
|
gauging model performance in classification tasks, especially in cases where the distribution of positive and
|
30
32
|
negative classes is skewed.
|
31
33
|
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
calculation is used. The obtained F1 score is then assessed against the
|
37
|
-
expected from the model.
|
34
|
+
### Test Mechanism
|
35
|
+
|
36
|
+
The F1 score for the validation dataset is computed through scikit-learn's metrics in Python. The scoring mechanism
|
37
|
+
differs based on the classification problem: for multi-class problems, macro averaging is used, and for binary
|
38
|
+
classification, the built-in `f1_score` calculation is used. The obtained F1 score is then assessed against the
|
39
|
+
predefined minimum F1 score that is expected from the model.
|
38
40
|
|
39
|
-
|
41
|
+
### Signs of High Risk
|
40
42
|
|
41
43
|
- If a model returns an F1 score that is less than the established threshold, it is regarded as high risk.
|
42
|
-
- A low F1 score might suggest that the model is not finding an optimal balance between precision and recall,
|
43
|
-
|
44
|
+
- A low F1 score might suggest that the model is not finding an optimal balance between precision and recall,
|
45
|
+
failing to effectively identify positive classes while minimizing false positives.
|
44
46
|
|
45
|
-
|
47
|
+
### Strengths
|
46
48
|
|
47
|
-
-
|
48
|
-
|
49
|
-
-
|
50
|
-
misleading.
|
51
|
-
- The flexibility of setting the threshold value allows for tailoring the minimum acceptable performance.
|
49
|
+
- Provides a balanced measure of a model's performance by accounting for both false positives and false negatives.
|
50
|
+
- Particularly advantageous in scenarios with imbalanced class distribution, where accuracy can be misleading.
|
51
|
+
- Flexibility in setting the threshold value allows tailored minimum acceptable performance standards.
|
52
52
|
|
53
|
-
|
53
|
+
### Limitations
|
54
54
|
|
55
|
-
-
|
56
|
-
-
|
57
|
-
|
58
|
-
|
59
|
-
closely with
|
55
|
+
- May not be suitable for all types of models and machine learning tasks.
|
56
|
+
- The F1 score assumes an equal cost for false positives and false negatives, which may not be true in some
|
57
|
+
real-world scenarios.
|
58
|
+
- Practitioners might need to rely on other metrics such as precision, recall, or the ROC-AUC score that align more
|
59
|
+
closely with specific requirements.
|
60
60
|
"""
|
61
61
|
|
62
62
|
name = "f1_score"
|
@@ -23,32 +23,37 @@ class MinimumROCAUCScore(ThresholdTest):
|
|
23
23
|
"""
|
24
24
|
Validates model by checking if the ROC AUC score meets or surpasses a specified threshold.
|
25
25
|
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
26
|
+
### Purpose
|
27
|
+
|
28
|
+
The Minimum ROC AUC Score test is used to determine the model's performance by ensuring that the Receiver Operating
|
29
|
+
Characteristic Area Under the Curve (ROC AUC) score on the validation dataset meets or exceeds a predefined
|
30
|
+
threshold. The ROC AUC score indicates how well the model can distinguish between different classes, making it a
|
31
|
+
crucial measure in binary and multiclass classification tasks.
|
32
|
+
|
33
|
+
### Test Mechanism
|
31
34
|
|
32
|
-
**Test Mechanism**:
|
33
35
|
This test implementation calculates the multiclass ROC AUC score on the true target values and the model's
|
34
|
-
|
36
|
+
predictions. The test converts the multi-class target variables into binary format using `LabelBinarizer` before
|
35
37
|
computing the score. If this ROC AUC score is higher than the predefined threshold (defaulted to 0.5), the test
|
36
38
|
passes; otherwise, it fails. The results, including the ROC AUC score, the threshold, and whether the test passed
|
37
39
|
or failed, are then stored in a `ThresholdTestResult` object.
|
38
40
|
|
39
|
-
|
41
|
+
### Signs of High Risk
|
42
|
+
|
40
43
|
- A high risk or failure in the model's performance as related to this metric would be represented by a low ROC AUC
|
41
44
|
score, specifically any score lower than the predefined minimum threshold. This suggests that the model is
|
42
45
|
struggling to distinguish between different classes effectively.
|
43
46
|
|
44
|
-
|
47
|
+
### Strengths
|
48
|
+
|
45
49
|
- The test considers both the true positive rate and false positive rate, providing a comprehensive performance
|
46
50
|
measure.
|
47
51
|
- ROC AUC score is threshold-independent meaning it measures the model's quality across various classification
|
48
52
|
thresholds.
|
49
53
|
- Works robustly with binary as well as multi-class classification problems.
|
50
54
|
|
51
|
-
|
55
|
+
### Limitations
|
56
|
+
|
52
57
|
- ROC AUC may not be useful if the class distribution is highly imbalanced; it could perform well in terms of AUC
|
53
58
|
but still fail to predict the minority class.
|
54
59
|
- The test does not provide insight into what specific aspects of the model are causing poor performance if the ROC
|
@@ -19,33 +19,40 @@ class ModelsPerformanceComparison(ClassifierPerformance):
|
|
19
19
|
Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,
|
20
20
|
precision, recall, and F1 score.
|
21
21
|
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
22
|
+
### Purpose
|
23
|
+
|
24
|
+
The Models Performance Comparison test aims to evaluate and compare the performance of various Machine Learning
|
25
|
+
models using test data. It employs multiple metrics such as accuracy, precision, recall, and the F1 score, among
|
26
|
+
others, to assess model performance and assist in selecting the most effective model for the designated task.
|
27
|
+
|
28
|
+
### Test Mechanism
|
29
|
+
|
30
|
+
The test employs Scikit-learn’s performance metrics to evaluate each model's performance for both binary and
|
31
|
+
multiclass classification tasks. To compare performances, the test runs each model against the test dataset, then
|
32
|
+
produces a comprehensive classification report. This report includes metrics such as accuracy, precision, recall,
|
33
|
+
and the F1 score. Based on whether the task at hand is binary or multiclass classification, it calculates metrics
|
34
|
+
for all the classes and their weighted averages, macro averages, and per-class metrics. The test will be skipped if
|
35
|
+
no models are supplied.
|
36
|
+
|
37
|
+
### Signs of High Risk
|
38
|
+
|
34
39
|
- Low scores in accuracy, precision, recall, and F1 metrics indicate a potentially high risk.
|
35
40
|
- A low area under the Receiver Operating Characteristic (ROC) curve (roc_auc score) is another possible indicator
|
36
41
|
of high risk.
|
37
42
|
- If the metrics scores are significantly lower than alternative models, this might suggest a high risk of failure.
|
38
43
|
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
44
|
+
### Strengths
|
45
|
+
|
46
|
+
- Provides a simple way to compare the performance of multiple models, accommodating both binary and multiclass
|
47
|
+
classification tasks.
|
48
|
+
- Offers a holistic view of model performance through a comprehensive report of key performance metrics.
|
43
49
|
- The inclusion of the ROC AUC score is advantageous, as this robust performance metric can effectively handle
|
44
50
|
class imbalance issues.
|
45
51
|
|
46
|
-
|
47
|
-
|
48
|
-
|
52
|
+
### Limitations
|
53
|
+
|
54
|
+
- May not be suitable for more complex performance evaluations that consider factors such as prediction speed,
|
55
|
+
computational cost, or business-specific constraints.
|
49
56
|
- The test's reliability depends on the provided test dataset; hence, the selected models' performance could vary
|
50
57
|
with unseen data or changes in the data distribution.
|
51
58
|
- The ROC AUC score might not be as meaningful or easily interpretable for multilabel/multiclass tasks.
|