@zigrivers/scaffold 3.7.0 → 3.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (97) hide show
  1. package/README.md +113 -8
  2. package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
  3. package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
  4. package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
  5. package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
  6. package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
  7. package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
  8. package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
  9. package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
  10. package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
  11. package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
  12. package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
  13. package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
  14. package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
  15. package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
  16. package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
  17. package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
  18. package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
  19. package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
  20. package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
  21. package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
  22. package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
  23. package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
  24. package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
  25. package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
  26. package/content/knowledge/library/library-api-design.md +306 -0
  27. package/content/knowledge/library/library-architecture.md +247 -0
  28. package/content/knowledge/library/library-bundling.md +244 -0
  29. package/content/knowledge/library/library-conventions.md +229 -0
  30. package/content/knowledge/library/library-dev-environment.md +220 -0
  31. package/content/knowledge/library/library-documentation.md +300 -0
  32. package/content/knowledge/library/library-project-structure.md +237 -0
  33. package/content/knowledge/library/library-requirements.md +173 -0
  34. package/content/knowledge/library/library-security.md +257 -0
  35. package/content/knowledge/library/library-testing.md +319 -0
  36. package/content/knowledge/library/library-type-definitions.md +284 -0
  37. package/content/knowledge/library/library-versioning.md +300 -0
  38. package/content/knowledge/ml/ml-architecture.md +172 -0
  39. package/content/knowledge/ml/ml-conventions.md +209 -0
  40. package/content/knowledge/ml/ml-dev-environment.md +299 -0
  41. package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
  42. package/content/knowledge/ml/ml-model-evaluation.md +256 -0
  43. package/content/knowledge/ml/ml-observability.md +253 -0
  44. package/content/knowledge/ml/ml-project-structure.md +216 -0
  45. package/content/knowledge/ml/ml-requirements.md +138 -0
  46. package/content/knowledge/ml/ml-security.md +188 -0
  47. package/content/knowledge/ml/ml-serving-patterns.md +243 -0
  48. package/content/knowledge/ml/ml-testing.md +301 -0
  49. package/content/knowledge/ml/ml-training-patterns.md +269 -0
  50. package/content/knowledge/mobile-app/mobile-app-architecture.md +283 -0
  51. package/content/knowledge/mobile-app/mobile-app-conventions.md +180 -0
  52. package/content/knowledge/mobile-app/mobile-app-deployment.md +298 -0
  53. package/content/knowledge/mobile-app/mobile-app-dev-environment.md +257 -0
  54. package/content/knowledge/mobile-app/mobile-app-distribution.md +264 -0
  55. package/content/knowledge/mobile-app/mobile-app-observability.md +317 -0
  56. package/content/knowledge/mobile-app/mobile-app-offline-patterns.md +311 -0
  57. package/content/knowledge/mobile-app/mobile-app-project-structure.md +245 -0
  58. package/content/knowledge/mobile-app/mobile-app-push-notifications.md +321 -0
  59. package/content/knowledge/mobile-app/mobile-app-requirements.md +147 -0
  60. package/content/knowledge/mobile-app/mobile-app-security.md +338 -0
  61. package/content/knowledge/mobile-app/mobile-app-testing.md +400 -0
  62. package/content/methodology/browser-extension-overlay.yml +82 -0
  63. package/content/methodology/data-pipeline-overlay.yml +70 -0
  64. package/content/methodology/library-overlay.yml +67 -0
  65. package/content/methodology/ml-overlay.yml +70 -0
  66. package/content/methodology/mobile-app-overlay.yml +71 -0
  67. package/dist/cli/commands/init.d.ts +22 -0
  68. package/dist/cli/commands/init.d.ts.map +1 -1
  69. package/dist/cli/commands/init.js +202 -3
  70. package/dist/cli/commands/init.js.map +1 -1
  71. package/dist/cli/commands/init.test.js +190 -0
  72. package/dist/cli/commands/init.test.js.map +1 -1
  73. package/dist/config/schema.d.ts +1456 -80
  74. package/dist/config/schema.d.ts.map +1 -1
  75. package/dist/config/schema.js +87 -0
  76. package/dist/config/schema.js.map +1 -1
  77. package/dist/config/schema.test.js +312 -3
  78. package/dist/config/schema.test.js.map +1 -1
  79. package/dist/core/assembly/overlay-loader.test.js +55 -0
  80. package/dist/core/assembly/overlay-loader.test.js.map +1 -1
  81. package/dist/e2e/project-type-overlays.test.d.ts +2 -1
  82. package/dist/e2e/project-type-overlays.test.d.ts.map +1 -1
  83. package/dist/e2e/project-type-overlays.test.js +780 -14
  84. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  85. package/dist/types/config.d.ts +16 -1
  86. package/dist/types/config.d.ts.map +1 -1
  87. package/dist/wizard/questions.d.ts +28 -1
  88. package/dist/wizard/questions.d.ts.map +1 -1
  89. package/dist/wizard/questions.js +127 -1
  90. package/dist/wizard/questions.js.map +1 -1
  91. package/dist/wizard/questions.test.js +224 -4
  92. package/dist/wizard/questions.test.js.map +1 -1
  93. package/dist/wizard/wizard.d.ts +22 -0
  94. package/dist/wizard/wizard.d.ts.map +1 -1
  95. package/dist/wizard/wizard.js +28 -1
  96. package/dist/wizard/wizard.js.map +1 -1
  97. package/package.json +1 -1
@@ -0,0 +1,256 @@
1
+ ---
2
+ name: ml-model-evaluation
3
+ description: Train/val/test splits, cross-validation, metrics by task type, holdout sets, and slice analysis for thorough model evaluation
4
+ topics: [ml, evaluation, train-test-split, cross-validation, metrics, holdout, slice-analysis]
5
+ ---
6
+
7
+ Model evaluation is the difference between knowing whether your model works and believing it works. Most ML evaluation bugs are forms of data leakage: the model has seen information during training that it would not have at inference time, making offline metrics look better than production performance. Rigorous evaluation requires careful data splitting, leak-free preprocessing, appropriate metrics for the task, and systematic analysis of where the model fails.
8
+
9
+ ## Summary
10
+
11
+ Split data into train, validation, and test sets — use the test set exactly once. For small datasets, use cross-validation. Choose metrics appropriate to the task: classification, regression, ranking, or generation have different canonical metrics. Analyse model performance by meaningful slices (demographic groups, difficulty levels, data subsets) — aggregate metrics hide subgroup failures. Log evaluation results with experiment metadata for longitudinal comparison.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Data Splitting Principles
16
+
17
+ **Three-way split**: train (model learning), validation (hyperparameter tuning and early stopping), test (final unbiased evaluation):
18
+
19
+ ```python
20
+ from sklearn.model_selection import train_test_split
21
+
22
+ def create_splits(
23
+ df: pd.DataFrame,
24
+ val_fraction: float = 0.1,
25
+ test_fraction: float = 0.1,
26
+ seed: int = 42,
27
+ stratify_col: str | None = None,
28
+ ) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
29
+ """Create reproducible train/val/test splits."""
30
+ stratify = df[stratify_col] if stratify_col else None
31
+
32
+ train_val, test = train_test_split(
33
+ df,
34
+ test_size=test_fraction,
35
+ random_state=seed,
36
+ stratify=stratify,
37
+ )
38
+
39
+ val_size_adjusted = val_fraction / (1 - test_fraction)
40
+ stratify_tv = train_val[stratify_col] if stratify_col else None
41
+
42
+ train, val = train_test_split(
43
+ train_val,
44
+ test_size=val_size_adjusted,
45
+ random_state=seed,
46
+ stratify=stratify_tv,
47
+ )
48
+
49
+ return train, val, test
50
+ ```
51
+
52
+ **Critical splitting rules**:
53
+ 1. **Split before preprocessing**: Fit preprocessing (scalers, encoders, imputers, tokenizers vocabulary) on training data only, then apply to val/test. Fitting on the combined dataset is data leakage.
54
+ 2. **Stratify by label for classification**: Ensures class distribution is preserved in each split.
55
+ 3. **Split by entity, not row, for grouped data**: If you have multiple rows per user, all rows for a user must go to the same split. Row-level splitting leaks user-level information.
56
+ 4. **Temporal split for time-series**: Train on past, validate and test on future. Random splits would leak future information.
57
+
58
+ ### Temporal Splits
59
+
60
+ For any dataset with a time dimension, always split by time:
61
+
62
+ ```python
63
+ def temporal_split(
64
+ df: pd.DataFrame,
65
+ timestamp_col: str,
66
+ val_start: str,
67
+ test_start: str,
68
+ ) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
69
+ """Create temporal splits — train/val/test defined by date boundaries."""
70
+ df = df.sort_values(timestamp_col)
71
+ train = df[df[timestamp_col] < val_start]
72
+ val = df[(df[timestamp_col] >= val_start) & (df[timestamp_col] < test_start)]
73
+ test = df[df[timestamp_col] >= test_start]
74
+ return train, val, test
75
+ ```
76
+
77
+ **Backtesting** extends temporal evaluation by simulating deployment across multiple time windows — tests that a model trained on one period performs on subsequent periods.
78
+
79
+ ### Cross-Validation
80
+
81
+ Use k-fold cross-validation when dataset size is insufficient for a stable held-out set (< 10,000 examples):
82
+
83
+ ```python
84
+ from sklearn.model_selection import StratifiedKFold
85
+ import numpy as np
86
+
87
+ def cross_validate(
88
+ X: np.ndarray,
89
+ y: np.ndarray,
90
+ model_builder,
91
+ n_folds: int = 5,
92
+ seed: int = 42,
93
+ ) -> dict[str, float]:
94
+ """Stratified k-fold cross-validation."""
95
+ skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)
96
+ fold_metrics = []
97
+
98
+ for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
99
+ X_train, X_val = X[train_idx], X[val_idx]
100
+ y_train, y_val = y[train_idx], y[val_idx]
101
+
102
+ model = model_builder()
103
+ model.fit(X_train, y_train)
104
+ metrics = evaluate_model(model, X_val, y_val)
105
+ fold_metrics.append(metrics)
106
+
107
+ # Aggregate across folds
108
+ return {
109
+ metric: {
110
+ "mean": np.mean([m[metric] for m in fold_metrics]),
111
+ "std": np.std([m[metric] for m in fold_metrics]),
112
+ }
113
+ for metric in fold_metrics[0]
114
+ }
115
+ ```
116
+
117
+ **Nested cross-validation** separates hyperparameter selection from model evaluation:
118
+ - Outer loop: Estimate generalisation error
119
+ - Inner loop: Select hyperparameters via grid/random search
120
+ - Prevents over-fitting hyperparameters to the validation set
121
+
122
+ ### Holdout Sets and Evaluation Integrity
123
+
124
+ **The test set is sacred**: It may be touched exactly once — when reporting final model performance before deployment. Every other decision (architecture, hyperparameters, features) uses the validation set.
125
+
126
+ If you look at test set performance and then make changes, the test set is contaminated — you must collect a fresh test set.
127
+
128
+ **Multiple evaluation sets**:
129
+ - **In-distribution test set**: Same distribution as training data. Measures how well the model learned.
130
+ - **Out-of-distribution test set**: Different time period, geography, or user cohort. Measures generalisation.
131
+ - **Adversarial / challenging test set**: Hard examples, edge cases, known failure modes. Measures robustness.
132
+ - **Slice-specific test sets**: Subsets by demographic, category, or difficulty. Measures fairness and consistency.
133
+
134
+ ### Metrics by Task Type
135
+
136
+ **Binary Classification**:
137
+ ```python
138
+ from sklearn.metrics import (
139
+ accuracy_score, precision_score, recall_score,
140
+ f1_score, roc_auc_score, average_precision_score,
141
+ confusion_matrix, classification_report,
142
+ )
143
+
144
+ def evaluate_binary_classifier(
145
+ y_true: np.ndarray,
146
+ y_pred_proba: np.ndarray,
147
+ threshold: float = 0.5,
148
+ ) -> dict[str, float]:
149
+ y_pred = (y_pred_proba >= threshold).astype(int)
150
+ return {
151
+ "accuracy": accuracy_score(y_true, y_pred),
152
+ "precision": precision_score(y_true, y_pred, zero_division=0),
153
+ "recall": recall_score(y_true, y_pred, zero_division=0),
154
+ "f1": f1_score(y_true, y_pred, zero_division=0),
155
+ "roc_auc": roc_auc_score(y_true, y_pred_proba),
156
+ "pr_auc": average_precision_score(y_true, y_pred_proba),
157
+ }
158
+ ```
159
+
160
+ **Regression**:
161
+ ```python
162
+ from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
163
+
164
+ def evaluate_regressor(y_true, y_pred) -> dict[str, float]:
165
+ return {
166
+ "mae": mean_absolute_error(y_true, y_pred),
167
+ "rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
168
+ "r2": r2_score(y_true, y_pred),
169
+ "mape": np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100,
170
+ }
171
+ ```
172
+
173
+ **Multi-class classification**:
174
+ - Macro-average: Equal weight per class — use when class imbalance should not inflate aggregate metrics
175
+ - Weighted-average: Weight by class support — use for overall system performance
176
+ - Per-class metrics: Report separately to catch poor performance on minority classes
177
+
178
+ ### Slice Analysis
179
+
180
+ Aggregate metrics can hide systematic failures in subgroups. Slice analysis breaks down performance by meaningful subsets:
181
+
182
+ ```python
183
+ def slice_analysis(
184
+ df: pd.DataFrame,
185
+ y_true_col: str,
186
+ y_pred_col: str,
187
+ slice_cols: list[str],
188
+ metric_fn,
189
+ ) -> pd.DataFrame:
190
+ """Compute metrics for each slice of the data."""
191
+ results = []
192
+
193
+ # Overall metrics
194
+ overall = metric_fn(df[y_true_col], df[y_pred_col])
195
+ results.append({"slice": "overall", "n": len(df), **overall})
196
+
197
+ # Per-slice metrics
198
+ for col in slice_cols:
199
+ for value, group in df.groupby(col):
200
+ if len(group) < 50: # Skip slices with too few examples
201
+ continue
202
+ metrics = metric_fn(group[y_true_col], group[y_pred_col])
203
+ results.append({
204
+ "slice": f"{col}={value}",
205
+ "n": len(group),
206
+ **metrics,
207
+ })
208
+
209
+ return pd.DataFrame(results)
210
+ ```
211
+
212
+ **Slices to always analyse**:
213
+ - Demographic groups (if available and legally permissible): age band, gender, geography
214
+ - Data quality slices: high vs. low confidence labels, recent vs. old data
215
+ - Difficulty slices: high vs. low frequency items, short vs. long text
216
+ - Business-relevant slices: product category, customer segment, price tier
217
+
218
+ **Flagging disparities**: If a slice's metric deviates from overall by more than a threshold (e.g., 10 percentage points), flag for investigation before deployment.
219
+
220
+ ### Baseline Comparisons
221
+
222
+ Every model evaluation must include a comparison to baselines:
223
+ - **Trivial baseline**: Predict the majority class (classification) or mean target value (regression)
224
+ - **Rule-based baseline**: The current production rule or heuristic
225
+ - **Previous model version**: The model currently in production
226
+ - **Simple ML baseline**: Logistic regression or decision tree
227
+
228
+ A model that does not beat all baselines should not be deployed. The trivial baseline check catches label encoding bugs (where the model learns the majority class trivially).
229
+
230
+ ### Evaluation Report Structure
231
+
232
+ ```markdown
233
+ # Evaluation Report: fraud-detector-v2.3.0
234
+
235
+ ## Dataset
236
+ - Test set: 45,231 examples (2024-01-01 to 2024-03-31)
237
+ - Class balance: 1.2% fraud, 98.8% non-fraud
238
+
239
+ ## Overall Metrics
240
+ | Metric | v2.2.0 (prod) | v2.3.0 (candidate) | Delta |
241
+ |--------|--------------|-------------------|-------|
242
+ | ROC-AUC | 0.921 | 0.934 | +1.4% |
243
+ | PR-AUC | 0.712 | 0.748 | +5.1% |
244
+ | Recall @ precision=0.9 | 0.68 | 0.73 | +7.4% |
245
+
246
+ ## Slice Analysis
247
+ | Slice | n | ROC-AUC | vs. Overall |
248
+ |-------|---|---------|-------------|
249
+ | Overall | 45,231 | 0.934 | — |
250
+ | Amount < $50 | 12,445 | 0.941 | +0.7% |
251
+ | Amount > $1000 | 3,211 | 0.918 | -1.6% |
252
+ | New user (< 30 days) | 8,902 | 0.891 | -4.6% ⚠️ |
253
+
254
+ ## Recommendation
255
+ Promote to staging. Investigate new user performance degradation before production.
256
+ ```
@@ -0,0 +1,253 @@
1
+ ---
2
+ name: ml-observability
3
+ description: Model monitoring for drift and decay, prediction logging, explainability tools, and alerting on accuracy drops in production ML systems
4
+ topics: [ml, observability, monitoring, drift, model-decay, explainability, alerting, prediction-logging]
5
+ ---
6
+
7
+ A model deployed to production without monitoring is a ticking clock. Models decay silently: the world changes, input distributions shift, and accuracy degrades while dashboards show green. Unlike software bugs that throw exceptions, model degradation has no stack trace — predictions simply become less useful. ML observability is the discipline of detecting these degradations before users notice them, through systematic monitoring of model inputs, outputs, and outcomes.
8
+
9
+ ## Summary
10
+
11
+ ML observability covers four pillars: input monitoring (feature drift detection), output monitoring (prediction distribution shifts), outcome monitoring (accuracy against labels), and operational monitoring (latency, error rate). Complement monitoring with prediction logging for post-hoc analysis and explainability tools (SHAP, LIME) for understanding individual predictions and debugging systematic failures. Alert thresholds and on-call rotation for model health are as important as for service health.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### The Four Pillars of ML Observability
16
+
17
+ **Pillar 1 — Input monitoring (data drift)**: Detect when the distribution of model inputs changes from the training distribution. A model trained on winter data receiving summer data will degrade without any software change.
18
+
19
+ **Pillar 2 — Output monitoring (prediction drift)**: Detect when the model's prediction distribution changes — e.g., a fraud model that suddenly classifies 10% of transactions as fraud (vs. the baseline 0.1%).
20
+
21
+ **Pillar 3 — Outcome monitoring (accuracy/concept drift)**: Detect when model accuracy changes on labelled outcomes. Requires ground truth labels, which often arrive with delay (e.g., actual fraud confirmed days after prediction).
22
+
23
+ **Pillar 4 — Operational monitoring**: Latency, throughput, error rate, memory usage. Standard SRE metrics applied to the model serving layer.
24
+
25
+ ### Feature Drift Detection
26
+
27
+ Measure drift between training and serving feature distributions using statistical tests:
28
+
29
+ ```python
30
+ from scipy import stats
31
+ import numpy as np
32
+ from dataclasses import dataclass
33
+ from typing import Optional
34
+
35
+ @dataclass
36
+ class DriftReport:
37
+ feature: str
38
+ psi: float # Population Stability Index
39
+ ks_statistic: float # Kolmogorov-Smirnov statistic
40
+ ks_p_value: float
41
+ is_drifted: bool
42
+
43
+ def compute_psi(
44
+ expected: np.ndarray,
45
+ actual: np.ndarray,
46
+ buckets: int = 10,
47
+ ) -> float:
48
+ """Population Stability Index. PSI < 0.1: stable, 0.1-0.2: minor drift, >0.2: significant drift."""
49
+ eps = 1e-6
50
+ expected_pcts, bins = np.histogram(expected, bins=buckets)
51
+ actual_pcts, _ = np.histogram(actual, bins=bins)
52
+
53
+ expected_pcts = expected_pcts / expected_pcts.sum() + eps
54
+ actual_pcts = actual_pcts / actual_pcts.sum() + eps
55
+
56
+ return float(np.sum((actual_pcts - expected_pcts) * np.log(actual_pcts / expected_pcts)))
57
+
58
+ def detect_drift(
59
+ training_values: np.ndarray,
60
+ serving_values: np.ndarray,
61
+ feature_name: str,
62
+ psi_threshold: float = 0.2,
63
+ ks_alpha: float = 0.05,
64
+ ) -> DriftReport:
65
+ psi = compute_psi(training_values, serving_values)
66
+ ks_stat, ks_p = stats.ks_2samp(training_values, serving_values)
67
+ return DriftReport(
68
+ feature=feature_name,
69
+ psi=psi,
70
+ ks_statistic=ks_stat,
71
+ ks_p_value=ks_p,
72
+ is_drifted=psi > psi_threshold or ks_p < ks_alpha,
73
+ )
74
+ ```
75
+
76
+ **Reference distribution maintenance**: Store feature statistics (mean, std, percentiles, histogram) from the training set as a "reference profile." Compare each day's serving data to this profile. Refresh the reference when the model is retrained.
77
+
78
+ **PSI thresholds** (industry standard):
79
+ - PSI < 0.1: No significant drift — monitor as normal
80
+ - 0.1 ≤ PSI < 0.2: Minor drift — investigate, consider retraining
81
+ - PSI ≥ 0.2: Significant drift — trigger retraining or alert
82
+
83
+ ### Prediction Logging
84
+
85
+ Every prediction made in production should be logged for monitoring and post-hoc analysis:
86
+
87
+ ```python
88
+ # src/serving/prediction_logger.py
89
+ import json
90
+ import time
91
+ from dataclasses import dataclass, asdict
92
+ from typing import Any
93
+
94
+ @dataclass
95
+ class PredictionRecord:
96
+ prediction_id: str # UUID for correlation
97
+ model_version: str
98
+ timestamp: float
99
+ request_id: str # Trace ID for distributed tracing
100
+ input_features: dict # Logged features (scrub PII before logging)
101
+ prediction: Any
102
+ confidence: float
103
+ latency_ms: float
104
+
105
+ class PredictionLogger:
106
+ def __init__(self, sink): # sink: Kafka producer, Kinesis, or file
107
+ self.sink = sink
108
+
109
+ def log(self, record: PredictionRecord) -> None:
110
+ payload = json.dumps(asdict(record))
111
+ self.sink.send(payload)
112
+ ```
113
+
114
+ **What to log** (balance observability with privacy/cost):
115
+ - Always: prediction ID, model version, timestamp, prediction value, confidence, latency
116
+ - Feature logging: Log features used for prediction (important for drift detection and debugging)
117
+ - PII scrubbing: Never log raw PII fields; log derived features or anonymised values only
118
+ - Sampling: For very high-throughput systems (> 10K RPS), log a representative sample (1–10%)
119
+
120
+ **Label joining**: When ground truth labels arrive (delayed), join them with prediction logs using the prediction ID to compute accuracy metrics:
121
+ ```sql
122
+ SELECT
123
+ p.model_version,
124
+ COUNT(*) as n_predictions,
125
+ AVG(CASE WHEN p.prediction = l.actual_label THEN 1 ELSE 0 END) as accuracy,
126
+ AVG(p.confidence) as mean_confidence
127
+ FROM predictions p
128
+ JOIN labels l ON p.prediction_id = l.prediction_id
129
+ WHERE p.timestamp >= NOW() - INTERVAL '7 days'
130
+ GROUP BY p.model_version
131
+ ```
132
+
133
+ ### Explainability
134
+
135
+ Explainability tools help debug model failures and satisfy regulatory requirements:
136
+
137
+ **SHAP (SHapley Additive exPlanations)**: Computes feature importance for individual predictions using game-theoretic Shapley values. Works with any model.
138
+
139
+ ```python
140
+ import shap
141
+
142
+ # Train a background dataset for the explainer
143
+ background = X_train[np.random.choice(len(X_train), 100, replace=False)]
144
+ explainer = shap.TreeExplainer(model) # For tree models
145
+ # explainer = shap.DeepExplainer(model, background) # For neural networks
146
+ # explainer = shap.KernelExplainer(model.predict_proba, background) # Model-agnostic
147
+
148
+ # Explain a single prediction
149
+ shap_values = explainer.shap_values(X_test[0:1])
150
+ shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test[0])
151
+
152
+ # Explain the entire test set (global feature importance)
153
+ shap_values_all = explainer.shap_values(X_test)
154
+ shap.summary_plot(shap_values_all[1], X_test)
155
+ ```
156
+
157
+ **LIME (Local Interpretable Model-agnostic Explanations)**: Fits a simple interpretable model (linear regression) locally around each prediction.
158
+
159
+ ```python
160
+ from lime.lime_tabular import LimeTabularExplainer
161
+
162
+ explainer = LimeTabularExplainer(
163
+ X_train,
164
+ feature_names=feature_names,
165
+ class_names=["legitimate", "fraud"],
166
+ mode="classification",
167
+ )
168
+
169
+ explanation = explainer.explain_instance(
170
+ X_test[0],
171
+ model.predict_proba,
172
+ num_features=10,
173
+ )
174
+ explanation.show_in_notebook()
175
+ ```
176
+
177
+ **Integrated Gradients** (for neural networks): Attribution method that satisfies axiomatic completeness. Available in Captum (PyTorch):
178
+ ```python
179
+ from captum.attr import IntegratedGradients
180
+
181
+ ig = IntegratedGradients(model)
182
+ attributions = ig.attribute(input_tensor, baseline=torch.zeros_like(input_tensor))
183
+ ```
184
+
185
+ ### Alerting Strategy
186
+
187
+ Define alert thresholds before deployment, not after a production incident:
188
+
189
+ ```yaml
190
+ # monitoring/alerts.yaml
191
+ alerts:
192
+ - name: accuracy_degradation_warning
193
+ metric: val_accuracy_7d_rolling
194
+ condition: "< 0.87" # Warning: 2pp below target
195
+ severity: warning
196
+ action: page_on_call
197
+
198
+ - name: accuracy_degradation_critical
199
+ metric: val_accuracy_7d_rolling
200
+ condition: "< 0.85" # Critical: at SLA threshold
201
+ severity: critical
202
+ action: page_on_call_and_escalate
203
+
204
+ - name: feature_drift_significant
205
+ metric: max_psi_across_features
206
+ condition: "> 0.2"
207
+ severity: warning
208
+ action: notify_ml_team
209
+
210
+ - name: prediction_rate_anomaly
211
+ metric: fraud_prediction_rate_1h
212
+ condition: "> 0.05" # 5x normal rate
213
+ severity: critical
214
+ action: page_on_call
215
+
216
+ - name: serving_latency_breach
217
+ metric: p99_latency_ms
218
+ condition: "> 200"
219
+ severity: warning
220
+ action: notify_ml_team
221
+ ```
222
+
223
+ **Alerting anti-patterns**:
224
+ - Alert fatigue: Too many low-signal alerts causes teams to ignore them. Start with critical-only, add warnings after establishing baselines.
225
+ - Static thresholds for seasonal data: Use rolling baselines that adapt to weekly/seasonal patterns.
226
+ - No runbook: Every alert must have a runbook link: "When this fires, do X, check Y, escalate to Z."
227
+
228
+ ### Model Monitoring Dashboard
229
+
230
+ A model health dashboard should show at a glance:
231
+
232
+ ```
233
+ Model: fraud-detector v2.3.1 | Status: HEALTHY | Updated: 5 minutes ago
234
+
235
+ ┌─────────────────┬──────────────────┬──────────────────┐
236
+ │ Accuracy (7d) │ Prediction Rate │ P99 Latency │
237
+ │ 87.3% ✓ │ 0.12% ✓ │ 142ms ✓ │
238
+ │ target: ≥85% │ baseline: 0.1% │ SLA: <200ms │
239
+ └─────────────────┴──────────────────┴──────────────────┘
240
+
241
+ ┌──────────────────────────────────────────────────────┐
242
+ │ Feature Drift (PSI) │
243
+ │ transaction_amount: 0.08 ✓ │
244
+ │ merchant_category: 0.12 ⚠ (minor drift) │
245
+ │ user_age_days: 0.04 ✓ │
246
+ └──────────────────────────────────────────────────────┘
247
+ ```
248
+
249
+ Retraining triggers: Codify when to retrain rather than leaving it to human judgment:
250
+ - Accuracy drops below warning threshold for 48+ consecutive hours
251
+ - PSI > 0.2 on any top-10 feature by SHAP importance
252
+ - Major upstream data source change (schema change, new data source)
253
+ - Scheduled retraining on a fixed cadence (monthly for most models)