@zigrivers/scaffold 3.7.0 → 3.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +113 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/library/library-api-design.md +306 -0
- package/content/knowledge/library/library-architecture.md +247 -0
- package/content/knowledge/library/library-bundling.md +244 -0
- package/content/knowledge/library/library-conventions.md +229 -0
- package/content/knowledge/library/library-dev-environment.md +220 -0
- package/content/knowledge/library/library-documentation.md +300 -0
- package/content/knowledge/library/library-project-structure.md +237 -0
- package/content/knowledge/library/library-requirements.md +173 -0
- package/content/knowledge/library/library-security.md +257 -0
- package/content/knowledge/library/library-testing.md +319 -0
- package/content/knowledge/library/library-type-definitions.md +284 -0
- package/content/knowledge/library/library-versioning.md +300 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/knowledge/mobile-app/mobile-app-architecture.md +283 -0
- package/content/knowledge/mobile-app/mobile-app-conventions.md +180 -0
- package/content/knowledge/mobile-app/mobile-app-deployment.md +298 -0
- package/content/knowledge/mobile-app/mobile-app-dev-environment.md +257 -0
- package/content/knowledge/mobile-app/mobile-app-distribution.md +264 -0
- package/content/knowledge/mobile-app/mobile-app-observability.md +317 -0
- package/content/knowledge/mobile-app/mobile-app-offline-patterns.md +311 -0
- package/content/knowledge/mobile-app/mobile-app-project-structure.md +245 -0
- package/content/knowledge/mobile-app/mobile-app-push-notifications.md +321 -0
- package/content/knowledge/mobile-app/mobile-app-requirements.md +147 -0
- package/content/knowledge/mobile-app/mobile-app-security.md +338 -0
- package/content/knowledge/mobile-app/mobile-app-testing.md +400 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/library-overlay.yml +67 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/content/methodology/mobile-app-overlay.yml +71 -0
- package/dist/cli/commands/init.d.ts +22 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +202 -3
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +190 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +1456 -80
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +87 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +312 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +55 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -1
- package/dist/e2e/project-type-overlays.test.d.ts.map +1 -1
- package/dist/e2e/project-type-overlays.test.js +780 -14
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +16 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +28 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +127 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +224 -4
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +22 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +28 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,256 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-model-evaluation
|
|
3
|
+
description: Train/val/test splits, cross-validation, metrics by task type, holdout sets, and slice analysis for thorough model evaluation
|
|
4
|
+
topics: [ml, evaluation, train-test-split, cross-validation, metrics, holdout, slice-analysis]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Model evaluation is the difference between knowing whether your model works and believing it works. Most ML evaluation bugs are forms of data leakage: the model has seen information during training that it would not have at inference time, making offline metrics look better than production performance. Rigorous evaluation requires careful data splitting, leak-free preprocessing, appropriate metrics for the task, and systematic analysis of where the model fails.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Split data into train, validation, and test sets — use the test set exactly once. For small datasets, use cross-validation. Choose metrics appropriate to the task: classification, regression, ranking, or generation have different canonical metrics. Analyse model performance by meaningful slices (demographic groups, difficulty levels, data subsets) — aggregate metrics hide subgroup failures. Log evaluation results with experiment metadata for longitudinal comparison.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Data Splitting Principles
|
|
16
|
+
|
|
17
|
+
**Three-way split**: train (model learning), validation (hyperparameter tuning and early stopping), test (final unbiased evaluation):
|
|
18
|
+
|
|
19
|
+
```python
|
|
20
|
+
from sklearn.model_selection import train_test_split
|
|
21
|
+
|
|
22
|
+
def create_splits(
|
|
23
|
+
df: pd.DataFrame,
|
|
24
|
+
val_fraction: float = 0.1,
|
|
25
|
+
test_fraction: float = 0.1,
|
|
26
|
+
seed: int = 42,
|
|
27
|
+
stratify_col: str | None = None,
|
|
28
|
+
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
|
|
29
|
+
"""Create reproducible train/val/test splits."""
|
|
30
|
+
stratify = df[stratify_col] if stratify_col else None
|
|
31
|
+
|
|
32
|
+
train_val, test = train_test_split(
|
|
33
|
+
df,
|
|
34
|
+
test_size=test_fraction,
|
|
35
|
+
random_state=seed,
|
|
36
|
+
stratify=stratify,
|
|
37
|
+
)
|
|
38
|
+
|
|
39
|
+
val_size_adjusted = val_fraction / (1 - test_fraction)
|
|
40
|
+
stratify_tv = train_val[stratify_col] if stratify_col else None
|
|
41
|
+
|
|
42
|
+
train, val = train_test_split(
|
|
43
|
+
train_val,
|
|
44
|
+
test_size=val_size_adjusted,
|
|
45
|
+
random_state=seed,
|
|
46
|
+
stratify=stratify_tv,
|
|
47
|
+
)
|
|
48
|
+
|
|
49
|
+
return train, val, test
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
**Critical splitting rules**:
|
|
53
|
+
1. **Split before preprocessing**: Fit preprocessing (scalers, encoders, imputers, tokenizers vocabulary) on training data only, then apply to val/test. Fitting on the combined dataset is data leakage.
|
|
54
|
+
2. **Stratify by label for classification**: Ensures class distribution is preserved in each split.
|
|
55
|
+
3. **Split by entity, not row, for grouped data**: If you have multiple rows per user, all rows for a user must go to the same split. Row-level splitting leaks user-level information.
|
|
56
|
+
4. **Temporal split for time-series**: Train on past, validate and test on future. Random splits would leak future information.
|
|
57
|
+
|
|
58
|
+
### Temporal Splits
|
|
59
|
+
|
|
60
|
+
For any dataset with a time dimension, always split by time:
|
|
61
|
+
|
|
62
|
+
```python
|
|
63
|
+
def temporal_split(
|
|
64
|
+
df: pd.DataFrame,
|
|
65
|
+
timestamp_col: str,
|
|
66
|
+
val_start: str,
|
|
67
|
+
test_start: str,
|
|
68
|
+
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
|
|
69
|
+
"""Create temporal splits — train/val/test defined by date boundaries."""
|
|
70
|
+
df = df.sort_values(timestamp_col)
|
|
71
|
+
train = df[df[timestamp_col] < val_start]
|
|
72
|
+
val = df[(df[timestamp_col] >= val_start) & (df[timestamp_col] < test_start)]
|
|
73
|
+
test = df[df[timestamp_col] >= test_start]
|
|
74
|
+
return train, val, test
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
**Backtesting** extends temporal evaluation by simulating deployment across multiple time windows — tests that a model trained on one period performs on subsequent periods.
|
|
78
|
+
|
|
79
|
+
### Cross-Validation
|
|
80
|
+
|
|
81
|
+
Use k-fold cross-validation when dataset size is insufficient for a stable held-out set (< 10,000 examples):
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
from sklearn.model_selection import StratifiedKFold
|
|
85
|
+
import numpy as np
|
|
86
|
+
|
|
87
|
+
def cross_validate(
|
|
88
|
+
X: np.ndarray,
|
|
89
|
+
y: np.ndarray,
|
|
90
|
+
model_builder,
|
|
91
|
+
n_folds: int = 5,
|
|
92
|
+
seed: int = 42,
|
|
93
|
+
) -> dict[str, float]:
|
|
94
|
+
"""Stratified k-fold cross-validation."""
|
|
95
|
+
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)
|
|
96
|
+
fold_metrics = []
|
|
97
|
+
|
|
98
|
+
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
|
|
99
|
+
X_train, X_val = X[train_idx], X[val_idx]
|
|
100
|
+
y_train, y_val = y[train_idx], y[val_idx]
|
|
101
|
+
|
|
102
|
+
model = model_builder()
|
|
103
|
+
model.fit(X_train, y_train)
|
|
104
|
+
metrics = evaluate_model(model, X_val, y_val)
|
|
105
|
+
fold_metrics.append(metrics)
|
|
106
|
+
|
|
107
|
+
# Aggregate across folds
|
|
108
|
+
return {
|
|
109
|
+
metric: {
|
|
110
|
+
"mean": np.mean([m[metric] for m in fold_metrics]),
|
|
111
|
+
"std": np.std([m[metric] for m in fold_metrics]),
|
|
112
|
+
}
|
|
113
|
+
for metric in fold_metrics[0]
|
|
114
|
+
}
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
**Nested cross-validation** separates hyperparameter selection from model evaluation:
|
|
118
|
+
- Outer loop: Estimate generalisation error
|
|
119
|
+
- Inner loop: Select hyperparameters via grid/random search
|
|
120
|
+
- Prevents over-fitting hyperparameters to the validation set
|
|
121
|
+
|
|
122
|
+
### Holdout Sets and Evaluation Integrity
|
|
123
|
+
|
|
124
|
+
**The test set is sacred**: It may be touched exactly once — when reporting final model performance before deployment. Every other decision (architecture, hyperparameters, features) uses the validation set.
|
|
125
|
+
|
|
126
|
+
If you look at test set performance and then make changes, the test set is contaminated — you must collect a fresh test set.
|
|
127
|
+
|
|
128
|
+
**Multiple evaluation sets**:
|
|
129
|
+
- **In-distribution test set**: Same distribution as training data. Measures how well the model learned.
|
|
130
|
+
- **Out-of-distribution test set**: Different time period, geography, or user cohort. Measures generalisation.
|
|
131
|
+
- **Adversarial / challenging test set**: Hard examples, edge cases, known failure modes. Measures robustness.
|
|
132
|
+
- **Slice-specific test sets**: Subsets by demographic, category, or difficulty. Measures fairness and consistency.
|
|
133
|
+
|
|
134
|
+
### Metrics by Task Type
|
|
135
|
+
|
|
136
|
+
**Binary Classification**:
|
|
137
|
+
```python
|
|
138
|
+
from sklearn.metrics import (
|
|
139
|
+
accuracy_score, precision_score, recall_score,
|
|
140
|
+
f1_score, roc_auc_score, average_precision_score,
|
|
141
|
+
confusion_matrix, classification_report,
|
|
142
|
+
)
|
|
143
|
+
|
|
144
|
+
def evaluate_binary_classifier(
|
|
145
|
+
y_true: np.ndarray,
|
|
146
|
+
y_pred_proba: np.ndarray,
|
|
147
|
+
threshold: float = 0.5,
|
|
148
|
+
) -> dict[str, float]:
|
|
149
|
+
y_pred = (y_pred_proba >= threshold).astype(int)
|
|
150
|
+
return {
|
|
151
|
+
"accuracy": accuracy_score(y_true, y_pred),
|
|
152
|
+
"precision": precision_score(y_true, y_pred, zero_division=0),
|
|
153
|
+
"recall": recall_score(y_true, y_pred, zero_division=0),
|
|
154
|
+
"f1": f1_score(y_true, y_pred, zero_division=0),
|
|
155
|
+
"roc_auc": roc_auc_score(y_true, y_pred_proba),
|
|
156
|
+
"pr_auc": average_precision_score(y_true, y_pred_proba),
|
|
157
|
+
}
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
**Regression**:
|
|
161
|
+
```python
|
|
162
|
+
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
|
|
163
|
+
|
|
164
|
+
def evaluate_regressor(y_true, y_pred) -> dict[str, float]:
|
|
165
|
+
return {
|
|
166
|
+
"mae": mean_absolute_error(y_true, y_pred),
|
|
167
|
+
"rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
|
|
168
|
+
"r2": r2_score(y_true, y_pred),
|
|
169
|
+
"mape": np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100,
|
|
170
|
+
}
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
**Multi-class classification**:
|
|
174
|
+
- Macro-average: Equal weight per class — use when class imbalance should not inflate aggregate metrics
|
|
175
|
+
- Weighted-average: Weight by class support — use for overall system performance
|
|
176
|
+
- Per-class metrics: Report separately to catch poor performance on minority classes
|
|
177
|
+
|
|
178
|
+
### Slice Analysis
|
|
179
|
+
|
|
180
|
+
Aggregate metrics can hide systematic failures in subgroups. Slice analysis breaks down performance by meaningful subsets:
|
|
181
|
+
|
|
182
|
+
```python
|
|
183
|
+
def slice_analysis(
|
|
184
|
+
df: pd.DataFrame,
|
|
185
|
+
y_true_col: str,
|
|
186
|
+
y_pred_col: str,
|
|
187
|
+
slice_cols: list[str],
|
|
188
|
+
metric_fn,
|
|
189
|
+
) -> pd.DataFrame:
|
|
190
|
+
"""Compute metrics for each slice of the data."""
|
|
191
|
+
results = []
|
|
192
|
+
|
|
193
|
+
# Overall metrics
|
|
194
|
+
overall = metric_fn(df[y_true_col], df[y_pred_col])
|
|
195
|
+
results.append({"slice": "overall", "n": len(df), **overall})
|
|
196
|
+
|
|
197
|
+
# Per-slice metrics
|
|
198
|
+
for col in slice_cols:
|
|
199
|
+
for value, group in df.groupby(col):
|
|
200
|
+
if len(group) < 50: # Skip slices with too few examples
|
|
201
|
+
continue
|
|
202
|
+
metrics = metric_fn(group[y_true_col], group[y_pred_col])
|
|
203
|
+
results.append({
|
|
204
|
+
"slice": f"{col}={value}",
|
|
205
|
+
"n": len(group),
|
|
206
|
+
**metrics,
|
|
207
|
+
})
|
|
208
|
+
|
|
209
|
+
return pd.DataFrame(results)
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
**Slices to always analyse**:
|
|
213
|
+
- Demographic groups (if available and legally permissible): age band, gender, geography
|
|
214
|
+
- Data quality slices: high vs. low confidence labels, recent vs. old data
|
|
215
|
+
- Difficulty slices: high vs. low frequency items, short vs. long text
|
|
216
|
+
- Business-relevant slices: product category, customer segment, price tier
|
|
217
|
+
|
|
218
|
+
**Flagging disparities**: If a slice's metric deviates from overall by more than a threshold (e.g., 10 percentage points), flag for investigation before deployment.
|
|
219
|
+
|
|
220
|
+
### Baseline Comparisons
|
|
221
|
+
|
|
222
|
+
Every model evaluation must include a comparison to baselines:
|
|
223
|
+
- **Trivial baseline**: Predict the majority class (classification) or mean target value (regression)
|
|
224
|
+
- **Rule-based baseline**: The current production rule or heuristic
|
|
225
|
+
- **Previous model version**: The model currently in production
|
|
226
|
+
- **Simple ML baseline**: Logistic regression or decision tree
|
|
227
|
+
|
|
228
|
+
A model that does not beat all baselines should not be deployed. The trivial baseline check catches label encoding bugs (where the model learns the majority class trivially).
|
|
229
|
+
|
|
230
|
+
### Evaluation Report Structure
|
|
231
|
+
|
|
232
|
+
```markdown
|
|
233
|
+
# Evaluation Report: fraud-detector-v2.3.0
|
|
234
|
+
|
|
235
|
+
## Dataset
|
|
236
|
+
- Test set: 45,231 examples (2024-01-01 to 2024-03-31)
|
|
237
|
+
- Class balance: 1.2% fraud, 98.8% non-fraud
|
|
238
|
+
|
|
239
|
+
## Overall Metrics
|
|
240
|
+
| Metric | v2.2.0 (prod) | v2.3.0 (candidate) | Delta |
|
|
241
|
+
|--------|--------------|-------------------|-------|
|
|
242
|
+
| ROC-AUC | 0.921 | 0.934 | +1.4% |
|
|
243
|
+
| PR-AUC | 0.712 | 0.748 | +5.1% |
|
|
244
|
+
| Recall @ precision=0.9 | 0.68 | 0.73 | +7.4% |
|
|
245
|
+
|
|
246
|
+
## Slice Analysis
|
|
247
|
+
| Slice | n | ROC-AUC | vs. Overall |
|
|
248
|
+
|-------|---|---------|-------------|
|
|
249
|
+
| Overall | 45,231 | 0.934 | — |
|
|
250
|
+
| Amount < $50 | 12,445 | 0.941 | +0.7% |
|
|
251
|
+
| Amount > $1000 | 3,211 | 0.918 | -1.6% |
|
|
252
|
+
| New user (< 30 days) | 8,902 | 0.891 | -4.6% ⚠️ |
|
|
253
|
+
|
|
254
|
+
## Recommendation
|
|
255
|
+
Promote to staging. Investigate new user performance degradation before production.
|
|
256
|
+
```
|
|
@@ -0,0 +1,253 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-observability
|
|
3
|
+
description: Model monitoring for drift and decay, prediction logging, explainability tools, and alerting on accuracy drops in production ML systems
|
|
4
|
+
topics: [ml, observability, monitoring, drift, model-decay, explainability, alerting, prediction-logging]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
A model deployed to production without monitoring is a ticking clock. Models decay silently: the world changes, input distributions shift, and accuracy degrades while dashboards show green. Unlike software bugs that throw exceptions, model degradation has no stack trace — predictions simply become less useful. ML observability is the discipline of detecting these degradations before users notice them, through systematic monitoring of model inputs, outputs, and outcomes.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
ML observability covers four pillars: input monitoring (feature drift detection), output monitoring (prediction distribution shifts), outcome monitoring (accuracy against labels), and operational monitoring (latency, error rate). Complement monitoring with prediction logging for post-hoc analysis and explainability tools (SHAP, LIME) for understanding individual predictions and debugging systematic failures. Alert thresholds and on-call rotation for model health are as important as for service health.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### The Four Pillars of ML Observability
|
|
16
|
+
|
|
17
|
+
**Pillar 1 — Input monitoring (data drift)**: Detect when the distribution of model inputs changes from the training distribution. A model trained on winter data receiving summer data will degrade without any software change.
|
|
18
|
+
|
|
19
|
+
**Pillar 2 — Output monitoring (prediction drift)**: Detect when the model's prediction distribution changes — e.g., a fraud model that suddenly classifies 10% of transactions as fraud (vs. the baseline 0.1%).
|
|
20
|
+
|
|
21
|
+
**Pillar 3 — Outcome monitoring (accuracy/concept drift)**: Detect when model accuracy changes on labelled outcomes. Requires ground truth labels, which often arrive with delay (e.g., actual fraud confirmed days after prediction).
|
|
22
|
+
|
|
23
|
+
**Pillar 4 — Operational monitoring**: Latency, throughput, error rate, memory usage. Standard SRE metrics applied to the model serving layer.
|
|
24
|
+
|
|
25
|
+
### Feature Drift Detection
|
|
26
|
+
|
|
27
|
+
Measure drift between training and serving feature distributions using statistical tests:
|
|
28
|
+
|
|
29
|
+
```python
|
|
30
|
+
from scipy import stats
|
|
31
|
+
import numpy as np
|
|
32
|
+
from dataclasses import dataclass
|
|
33
|
+
from typing import Optional
|
|
34
|
+
|
|
35
|
+
@dataclass
|
|
36
|
+
class DriftReport:
|
|
37
|
+
feature: str
|
|
38
|
+
psi: float # Population Stability Index
|
|
39
|
+
ks_statistic: float # Kolmogorov-Smirnov statistic
|
|
40
|
+
ks_p_value: float
|
|
41
|
+
is_drifted: bool
|
|
42
|
+
|
|
43
|
+
def compute_psi(
|
|
44
|
+
expected: np.ndarray,
|
|
45
|
+
actual: np.ndarray,
|
|
46
|
+
buckets: int = 10,
|
|
47
|
+
) -> float:
|
|
48
|
+
"""Population Stability Index. PSI < 0.1: stable, 0.1-0.2: minor drift, >0.2: significant drift."""
|
|
49
|
+
eps = 1e-6
|
|
50
|
+
expected_pcts, bins = np.histogram(expected, bins=buckets)
|
|
51
|
+
actual_pcts, _ = np.histogram(actual, bins=bins)
|
|
52
|
+
|
|
53
|
+
expected_pcts = expected_pcts / expected_pcts.sum() + eps
|
|
54
|
+
actual_pcts = actual_pcts / actual_pcts.sum() + eps
|
|
55
|
+
|
|
56
|
+
return float(np.sum((actual_pcts - expected_pcts) * np.log(actual_pcts / expected_pcts)))
|
|
57
|
+
|
|
58
|
+
def detect_drift(
|
|
59
|
+
training_values: np.ndarray,
|
|
60
|
+
serving_values: np.ndarray,
|
|
61
|
+
feature_name: str,
|
|
62
|
+
psi_threshold: float = 0.2,
|
|
63
|
+
ks_alpha: float = 0.05,
|
|
64
|
+
) -> DriftReport:
|
|
65
|
+
psi = compute_psi(training_values, serving_values)
|
|
66
|
+
ks_stat, ks_p = stats.ks_2samp(training_values, serving_values)
|
|
67
|
+
return DriftReport(
|
|
68
|
+
feature=feature_name,
|
|
69
|
+
psi=psi,
|
|
70
|
+
ks_statistic=ks_stat,
|
|
71
|
+
ks_p_value=ks_p,
|
|
72
|
+
is_drifted=psi > psi_threshold or ks_p < ks_alpha,
|
|
73
|
+
)
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
**Reference distribution maintenance**: Store feature statistics (mean, std, percentiles, histogram) from the training set as a "reference profile." Compare each day's serving data to this profile. Refresh the reference when the model is retrained.
|
|
77
|
+
|
|
78
|
+
**PSI thresholds** (industry standard):
|
|
79
|
+
- PSI < 0.1: No significant drift — monitor as normal
|
|
80
|
+
- 0.1 ≤ PSI < 0.2: Minor drift — investigate, consider retraining
|
|
81
|
+
- PSI ≥ 0.2: Significant drift — trigger retraining or alert
|
|
82
|
+
|
|
83
|
+
### Prediction Logging
|
|
84
|
+
|
|
85
|
+
Every prediction made in production should be logged for monitoring and post-hoc analysis:
|
|
86
|
+
|
|
87
|
+
```python
|
|
88
|
+
# src/serving/prediction_logger.py
|
|
89
|
+
import json
|
|
90
|
+
import time
|
|
91
|
+
from dataclasses import dataclass, asdict
|
|
92
|
+
from typing import Any
|
|
93
|
+
|
|
94
|
+
@dataclass
|
|
95
|
+
class PredictionRecord:
|
|
96
|
+
prediction_id: str # UUID for correlation
|
|
97
|
+
model_version: str
|
|
98
|
+
timestamp: float
|
|
99
|
+
request_id: str # Trace ID for distributed tracing
|
|
100
|
+
input_features: dict # Logged features (scrub PII before logging)
|
|
101
|
+
prediction: Any
|
|
102
|
+
confidence: float
|
|
103
|
+
latency_ms: float
|
|
104
|
+
|
|
105
|
+
class PredictionLogger:
|
|
106
|
+
def __init__(self, sink): # sink: Kafka producer, Kinesis, or file
|
|
107
|
+
self.sink = sink
|
|
108
|
+
|
|
109
|
+
def log(self, record: PredictionRecord) -> None:
|
|
110
|
+
payload = json.dumps(asdict(record))
|
|
111
|
+
self.sink.send(payload)
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
**What to log** (balance observability with privacy/cost):
|
|
115
|
+
- Always: prediction ID, model version, timestamp, prediction value, confidence, latency
|
|
116
|
+
- Feature logging: Log features used for prediction (important for drift detection and debugging)
|
|
117
|
+
- PII scrubbing: Never log raw PII fields; log derived features or anonymised values only
|
|
118
|
+
- Sampling: For very high-throughput systems (> 10K RPS), log a representative sample (1–10%)
|
|
119
|
+
|
|
120
|
+
**Label joining**: When ground truth labels arrive (delayed), join them with prediction logs using the prediction ID to compute accuracy metrics:
|
|
121
|
+
```sql
|
|
122
|
+
SELECT
|
|
123
|
+
p.model_version,
|
|
124
|
+
COUNT(*) as n_predictions,
|
|
125
|
+
AVG(CASE WHEN p.prediction = l.actual_label THEN 1 ELSE 0 END) as accuracy,
|
|
126
|
+
AVG(p.confidence) as mean_confidence
|
|
127
|
+
FROM predictions p
|
|
128
|
+
JOIN labels l ON p.prediction_id = l.prediction_id
|
|
129
|
+
WHERE p.timestamp >= NOW() - INTERVAL '7 days'
|
|
130
|
+
GROUP BY p.model_version
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
### Explainability
|
|
134
|
+
|
|
135
|
+
Explainability tools help debug model failures and satisfy regulatory requirements:
|
|
136
|
+
|
|
137
|
+
**SHAP (SHapley Additive exPlanations)**: Computes feature importance for individual predictions using game-theoretic Shapley values. Works with any model.
|
|
138
|
+
|
|
139
|
+
```python
|
|
140
|
+
import shap
|
|
141
|
+
|
|
142
|
+
# Train a background dataset for the explainer
|
|
143
|
+
background = X_train[np.random.choice(len(X_train), 100, replace=False)]
|
|
144
|
+
explainer = shap.TreeExplainer(model) # For tree models
|
|
145
|
+
# explainer = shap.DeepExplainer(model, background) # For neural networks
|
|
146
|
+
# explainer = shap.KernelExplainer(model.predict_proba, background) # Model-agnostic
|
|
147
|
+
|
|
148
|
+
# Explain a single prediction
|
|
149
|
+
shap_values = explainer.shap_values(X_test[0:1])
|
|
150
|
+
shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test[0])
|
|
151
|
+
|
|
152
|
+
# Explain the entire test set (global feature importance)
|
|
153
|
+
shap_values_all = explainer.shap_values(X_test)
|
|
154
|
+
shap.summary_plot(shap_values_all[1], X_test)
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
**LIME (Local Interpretable Model-agnostic Explanations)**: Fits a simple interpretable model (linear regression) locally around each prediction.
|
|
158
|
+
|
|
159
|
+
```python
|
|
160
|
+
from lime.lime_tabular import LimeTabularExplainer
|
|
161
|
+
|
|
162
|
+
explainer = LimeTabularExplainer(
|
|
163
|
+
X_train,
|
|
164
|
+
feature_names=feature_names,
|
|
165
|
+
class_names=["legitimate", "fraud"],
|
|
166
|
+
mode="classification",
|
|
167
|
+
)
|
|
168
|
+
|
|
169
|
+
explanation = explainer.explain_instance(
|
|
170
|
+
X_test[0],
|
|
171
|
+
model.predict_proba,
|
|
172
|
+
num_features=10,
|
|
173
|
+
)
|
|
174
|
+
explanation.show_in_notebook()
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
**Integrated Gradients** (for neural networks): Attribution method that satisfies axiomatic completeness. Available in Captum (PyTorch):
|
|
178
|
+
```python
|
|
179
|
+
from captum.attr import IntegratedGradients
|
|
180
|
+
|
|
181
|
+
ig = IntegratedGradients(model)
|
|
182
|
+
attributions = ig.attribute(input_tensor, baseline=torch.zeros_like(input_tensor))
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### Alerting Strategy
|
|
186
|
+
|
|
187
|
+
Define alert thresholds before deployment, not after a production incident:
|
|
188
|
+
|
|
189
|
+
```yaml
|
|
190
|
+
# monitoring/alerts.yaml
|
|
191
|
+
alerts:
|
|
192
|
+
- name: accuracy_degradation_warning
|
|
193
|
+
metric: val_accuracy_7d_rolling
|
|
194
|
+
condition: "< 0.87" # Warning: 2pp below target
|
|
195
|
+
severity: warning
|
|
196
|
+
action: page_on_call
|
|
197
|
+
|
|
198
|
+
- name: accuracy_degradation_critical
|
|
199
|
+
metric: val_accuracy_7d_rolling
|
|
200
|
+
condition: "< 0.85" # Critical: at SLA threshold
|
|
201
|
+
severity: critical
|
|
202
|
+
action: page_on_call_and_escalate
|
|
203
|
+
|
|
204
|
+
- name: feature_drift_significant
|
|
205
|
+
metric: max_psi_across_features
|
|
206
|
+
condition: "> 0.2"
|
|
207
|
+
severity: warning
|
|
208
|
+
action: notify_ml_team
|
|
209
|
+
|
|
210
|
+
- name: prediction_rate_anomaly
|
|
211
|
+
metric: fraud_prediction_rate_1h
|
|
212
|
+
condition: "> 0.05" # 5x normal rate
|
|
213
|
+
severity: critical
|
|
214
|
+
action: page_on_call
|
|
215
|
+
|
|
216
|
+
- name: serving_latency_breach
|
|
217
|
+
metric: p99_latency_ms
|
|
218
|
+
condition: "> 200"
|
|
219
|
+
severity: warning
|
|
220
|
+
action: notify_ml_team
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
**Alerting anti-patterns**:
|
|
224
|
+
- Alert fatigue: Too many low-signal alerts causes teams to ignore them. Start with critical-only, add warnings after establishing baselines.
|
|
225
|
+
- Static thresholds for seasonal data: Use rolling baselines that adapt to weekly/seasonal patterns.
|
|
226
|
+
- No runbook: Every alert must have a runbook link: "When this fires, do X, check Y, escalate to Z."
|
|
227
|
+
|
|
228
|
+
### Model Monitoring Dashboard
|
|
229
|
+
|
|
230
|
+
A model health dashboard should show at a glance:
|
|
231
|
+
|
|
232
|
+
```
|
|
233
|
+
Model: fraud-detector v2.3.1 | Status: HEALTHY | Updated: 5 minutes ago
|
|
234
|
+
|
|
235
|
+
┌─────────────────┬──────────────────┬──────────────────┐
|
|
236
|
+
│ Accuracy (7d) │ Prediction Rate │ P99 Latency │
|
|
237
|
+
│ 87.3% ✓ │ 0.12% ✓ │ 142ms ✓ │
|
|
238
|
+
│ target: ≥85% │ baseline: 0.1% │ SLA: <200ms │
|
|
239
|
+
└─────────────────┴──────────────────┴──────────────────┘
|
|
240
|
+
|
|
241
|
+
┌──────────────────────────────────────────────────────┐
|
|
242
|
+
│ Feature Drift (PSI) │
|
|
243
|
+
│ transaction_amount: 0.08 ✓ │
|
|
244
|
+
│ merchant_category: 0.12 ⚠ (minor drift) │
|
|
245
|
+
│ user_age_days: 0.04 ✓ │
|
|
246
|
+
└──────────────────────────────────────────────────────┘
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
Retraining triggers: Codify when to retrain rather than leaving it to human judgment:
|
|
250
|
+
- Accuracy drops below warning threshold for 48+ consecutive hours
|
|
251
|
+
- PSI > 0.2 on any top-10 feature by SHAP importance
|
|
252
|
+
- Major upstream data source change (schema change, new data source)
|
|
253
|
+
- Scheduled retraining on a fixed cadence (monthly for most models)
|