@zigrivers/scaffold 3.8.0 → 3.9.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +73 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/dist/cli/commands/init.d.ts +13 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +122 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +120 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +864 -48
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +53 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +166 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +33 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -2
- package/dist/e2e/project-type-overlays.test.js +499 -33
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +10 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +17 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +75 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +167 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +13 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +17 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,253 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-observability
|
|
3
|
+
description: Model monitoring for drift and decay, prediction logging, explainability tools, and alerting on accuracy drops in production ML systems
|
|
4
|
+
topics: [ml, observability, monitoring, drift, model-decay, explainability, alerting, prediction-logging]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
A model deployed to production without monitoring is a ticking clock. Models decay silently: the world changes, input distributions shift, and accuracy degrades while dashboards show green. Unlike software bugs that throw exceptions, model degradation has no stack trace — predictions simply become less useful. ML observability is the discipline of detecting these degradations before users notice them, through systematic monitoring of model inputs, outputs, and outcomes.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
ML observability covers four pillars: input monitoring (feature drift detection), output monitoring (prediction distribution shifts), outcome monitoring (accuracy against labels), and operational monitoring (latency, error rate). Complement monitoring with prediction logging for post-hoc analysis and explainability tools (SHAP, LIME) for understanding individual predictions and debugging systematic failures. Alert thresholds and on-call rotation for model health are as important as for service health.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### The Four Pillars of ML Observability
|
|
16
|
+
|
|
17
|
+
**Pillar 1 — Input monitoring (data drift)**: Detect when the distribution of model inputs changes from the training distribution. A model trained on winter data receiving summer data will degrade without any software change.
|
|
18
|
+
|
|
19
|
+
**Pillar 2 — Output monitoring (prediction drift)**: Detect when the model's prediction distribution changes — e.g., a fraud model that suddenly classifies 10% of transactions as fraud (vs. the baseline 0.1%).
|
|
20
|
+
|
|
21
|
+
**Pillar 3 — Outcome monitoring (accuracy/concept drift)**: Detect when model accuracy changes on labelled outcomes. Requires ground truth labels, which often arrive with delay (e.g., actual fraud confirmed days after prediction).
|
|
22
|
+
|
|
23
|
+
**Pillar 4 — Operational monitoring**: Latency, throughput, error rate, memory usage. Standard SRE metrics applied to the model serving layer.
|
|
24
|
+
|
|
25
|
+
### Feature Drift Detection
|
|
26
|
+
|
|
27
|
+
Measure drift between training and serving feature distributions using statistical tests:
|
|
28
|
+
|
|
29
|
+
```python
|
|
30
|
+
from scipy import stats
|
|
31
|
+
import numpy as np
|
|
32
|
+
from dataclasses import dataclass
|
|
33
|
+
from typing import Optional
|
|
34
|
+
|
|
35
|
+
@dataclass
|
|
36
|
+
class DriftReport:
|
|
37
|
+
feature: str
|
|
38
|
+
psi: float # Population Stability Index
|
|
39
|
+
ks_statistic: float # Kolmogorov-Smirnov statistic
|
|
40
|
+
ks_p_value: float
|
|
41
|
+
is_drifted: bool
|
|
42
|
+
|
|
43
|
+
def compute_psi(
|
|
44
|
+
expected: np.ndarray,
|
|
45
|
+
actual: np.ndarray,
|
|
46
|
+
buckets: int = 10,
|
|
47
|
+
) -> float:
|
|
48
|
+
"""Population Stability Index. PSI < 0.1: stable, 0.1-0.2: minor drift, >0.2: significant drift."""
|
|
49
|
+
eps = 1e-6
|
|
50
|
+
expected_pcts, bins = np.histogram(expected, bins=buckets)
|
|
51
|
+
actual_pcts, _ = np.histogram(actual, bins=bins)
|
|
52
|
+
|
|
53
|
+
expected_pcts = expected_pcts / expected_pcts.sum() + eps
|
|
54
|
+
actual_pcts = actual_pcts / actual_pcts.sum() + eps
|
|
55
|
+
|
|
56
|
+
return float(np.sum((actual_pcts - expected_pcts) * np.log(actual_pcts / expected_pcts)))
|
|
57
|
+
|
|
58
|
+
def detect_drift(
|
|
59
|
+
training_values: np.ndarray,
|
|
60
|
+
serving_values: np.ndarray,
|
|
61
|
+
feature_name: str,
|
|
62
|
+
psi_threshold: float = 0.2,
|
|
63
|
+
ks_alpha: float = 0.05,
|
|
64
|
+
) -> DriftReport:
|
|
65
|
+
psi = compute_psi(training_values, serving_values)
|
|
66
|
+
ks_stat, ks_p = stats.ks_2samp(training_values, serving_values)
|
|
67
|
+
return DriftReport(
|
|
68
|
+
feature=feature_name,
|
|
69
|
+
psi=psi,
|
|
70
|
+
ks_statistic=ks_stat,
|
|
71
|
+
ks_p_value=ks_p,
|
|
72
|
+
is_drifted=psi > psi_threshold or ks_p < ks_alpha,
|
|
73
|
+
)
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
**Reference distribution maintenance**: Store feature statistics (mean, std, percentiles, histogram) from the training set as a "reference profile." Compare each day's serving data to this profile. Refresh the reference when the model is retrained.
|
|
77
|
+
|
|
78
|
+
**PSI thresholds** (industry standard):
|
|
79
|
+
- PSI < 0.1: No significant drift — monitor as normal
|
|
80
|
+
- 0.1 ≤ PSI < 0.2: Minor drift — investigate, consider retraining
|
|
81
|
+
- PSI ≥ 0.2: Significant drift — trigger retraining or alert
|
|
82
|
+
|
|
83
|
+
### Prediction Logging
|
|
84
|
+
|
|
85
|
+
Every prediction made in production should be logged for monitoring and post-hoc analysis:
|
|
86
|
+
|
|
87
|
+
```python
|
|
88
|
+
# src/serving/prediction_logger.py
|
|
89
|
+
import json
|
|
90
|
+
import time
|
|
91
|
+
from dataclasses import dataclass, asdict
|
|
92
|
+
from typing import Any
|
|
93
|
+
|
|
94
|
+
@dataclass
|
|
95
|
+
class PredictionRecord:
|
|
96
|
+
prediction_id: str # UUID for correlation
|
|
97
|
+
model_version: str
|
|
98
|
+
timestamp: float
|
|
99
|
+
request_id: str # Trace ID for distributed tracing
|
|
100
|
+
input_features: dict # Logged features (scrub PII before logging)
|
|
101
|
+
prediction: Any
|
|
102
|
+
confidence: float
|
|
103
|
+
latency_ms: float
|
|
104
|
+
|
|
105
|
+
class PredictionLogger:
|
|
106
|
+
def __init__(self, sink): # sink: Kafka producer, Kinesis, or file
|
|
107
|
+
self.sink = sink
|
|
108
|
+
|
|
109
|
+
def log(self, record: PredictionRecord) -> None:
|
|
110
|
+
payload = json.dumps(asdict(record))
|
|
111
|
+
self.sink.send(payload)
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
**What to log** (balance observability with privacy/cost):
|
|
115
|
+
- Always: prediction ID, model version, timestamp, prediction value, confidence, latency
|
|
116
|
+
- Feature logging: Log features used for prediction (important for drift detection and debugging)
|
|
117
|
+
- PII scrubbing: Never log raw PII fields; log derived features or anonymised values only
|
|
118
|
+
- Sampling: For very high-throughput systems (> 10K RPS), log a representative sample (1–10%)
|
|
119
|
+
|
|
120
|
+
**Label joining**: When ground truth labels arrive (delayed), join them with prediction logs using the prediction ID to compute accuracy metrics:
|
|
121
|
+
```sql
|
|
122
|
+
SELECT
|
|
123
|
+
p.model_version,
|
|
124
|
+
COUNT(*) as n_predictions,
|
|
125
|
+
AVG(CASE WHEN p.prediction = l.actual_label THEN 1 ELSE 0 END) as accuracy,
|
|
126
|
+
AVG(p.confidence) as mean_confidence
|
|
127
|
+
FROM predictions p
|
|
128
|
+
JOIN labels l ON p.prediction_id = l.prediction_id
|
|
129
|
+
WHERE p.timestamp >= NOW() - INTERVAL '7 days'
|
|
130
|
+
GROUP BY p.model_version
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
### Explainability
|
|
134
|
+
|
|
135
|
+
Explainability tools help debug model failures and satisfy regulatory requirements:
|
|
136
|
+
|
|
137
|
+
**SHAP (SHapley Additive exPlanations)**: Computes feature importance for individual predictions using game-theoretic Shapley values. Works with any model.
|
|
138
|
+
|
|
139
|
+
```python
|
|
140
|
+
import shap
|
|
141
|
+
|
|
142
|
+
# Train a background dataset for the explainer
|
|
143
|
+
background = X_train[np.random.choice(len(X_train), 100, replace=False)]
|
|
144
|
+
explainer = shap.TreeExplainer(model) # For tree models
|
|
145
|
+
# explainer = shap.DeepExplainer(model, background) # For neural networks
|
|
146
|
+
# explainer = shap.KernelExplainer(model.predict_proba, background) # Model-agnostic
|
|
147
|
+
|
|
148
|
+
# Explain a single prediction
|
|
149
|
+
shap_values = explainer.shap_values(X_test[0:1])
|
|
150
|
+
shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test[0])
|
|
151
|
+
|
|
152
|
+
# Explain the entire test set (global feature importance)
|
|
153
|
+
shap_values_all = explainer.shap_values(X_test)
|
|
154
|
+
shap.summary_plot(shap_values_all[1], X_test)
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
**LIME (Local Interpretable Model-agnostic Explanations)**: Fits a simple interpretable model (linear regression) locally around each prediction.
|
|
158
|
+
|
|
159
|
+
```python
|
|
160
|
+
from lime.lime_tabular import LimeTabularExplainer
|
|
161
|
+
|
|
162
|
+
explainer = LimeTabularExplainer(
|
|
163
|
+
X_train,
|
|
164
|
+
feature_names=feature_names,
|
|
165
|
+
class_names=["legitimate", "fraud"],
|
|
166
|
+
mode="classification",
|
|
167
|
+
)
|
|
168
|
+
|
|
169
|
+
explanation = explainer.explain_instance(
|
|
170
|
+
X_test[0],
|
|
171
|
+
model.predict_proba,
|
|
172
|
+
num_features=10,
|
|
173
|
+
)
|
|
174
|
+
explanation.show_in_notebook()
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
**Integrated Gradients** (for neural networks): Attribution method that satisfies axiomatic completeness. Available in Captum (PyTorch):
|
|
178
|
+
```python
|
|
179
|
+
from captum.attr import IntegratedGradients
|
|
180
|
+
|
|
181
|
+
ig = IntegratedGradients(model)
|
|
182
|
+
attributions = ig.attribute(input_tensor, baseline=torch.zeros_like(input_tensor))
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### Alerting Strategy
|
|
186
|
+
|
|
187
|
+
Define alert thresholds before deployment, not after a production incident:
|
|
188
|
+
|
|
189
|
+
```yaml
|
|
190
|
+
# monitoring/alerts.yaml
|
|
191
|
+
alerts:
|
|
192
|
+
- name: accuracy_degradation_warning
|
|
193
|
+
metric: val_accuracy_7d_rolling
|
|
194
|
+
condition: "< 0.87" # Warning: 2pp below target
|
|
195
|
+
severity: warning
|
|
196
|
+
action: page_on_call
|
|
197
|
+
|
|
198
|
+
- name: accuracy_degradation_critical
|
|
199
|
+
metric: val_accuracy_7d_rolling
|
|
200
|
+
condition: "< 0.85" # Critical: at SLA threshold
|
|
201
|
+
severity: critical
|
|
202
|
+
action: page_on_call_and_escalate
|
|
203
|
+
|
|
204
|
+
- name: feature_drift_significant
|
|
205
|
+
metric: max_psi_across_features
|
|
206
|
+
condition: "> 0.2"
|
|
207
|
+
severity: warning
|
|
208
|
+
action: notify_ml_team
|
|
209
|
+
|
|
210
|
+
- name: prediction_rate_anomaly
|
|
211
|
+
metric: fraud_prediction_rate_1h
|
|
212
|
+
condition: "> 0.05" # 5x normal rate
|
|
213
|
+
severity: critical
|
|
214
|
+
action: page_on_call
|
|
215
|
+
|
|
216
|
+
- name: serving_latency_breach
|
|
217
|
+
metric: p99_latency_ms
|
|
218
|
+
condition: "> 200"
|
|
219
|
+
severity: warning
|
|
220
|
+
action: notify_ml_team
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
**Alerting anti-patterns**:
|
|
224
|
+
- Alert fatigue: Too many low-signal alerts causes teams to ignore them. Start with critical-only, add warnings after establishing baselines.
|
|
225
|
+
- Static thresholds for seasonal data: Use rolling baselines that adapt to weekly/seasonal patterns.
|
|
226
|
+
- No runbook: Every alert must have a runbook link: "When this fires, do X, check Y, escalate to Z."
|
|
227
|
+
|
|
228
|
+
### Model Monitoring Dashboard
|
|
229
|
+
|
|
230
|
+
A model health dashboard should show at a glance:
|
|
231
|
+
|
|
232
|
+
```
|
|
233
|
+
Model: fraud-detector v2.3.1 | Status: HEALTHY | Updated: 5 minutes ago
|
|
234
|
+
|
|
235
|
+
┌─────────────────┬──────────────────┬──────────────────┐
|
|
236
|
+
│ Accuracy (7d) │ Prediction Rate │ P99 Latency │
|
|
237
|
+
│ 87.3% ✓ │ 0.12% ✓ │ 142ms ✓ │
|
|
238
|
+
│ target: ≥85% │ baseline: 0.1% │ SLA: <200ms │
|
|
239
|
+
└─────────────────┴──────────────────┴──────────────────┘
|
|
240
|
+
|
|
241
|
+
┌──────────────────────────────────────────────────────┐
|
|
242
|
+
│ Feature Drift (PSI) │
|
|
243
|
+
│ transaction_amount: 0.08 ✓ │
|
|
244
|
+
│ merchant_category: 0.12 ⚠ (minor drift) │
|
|
245
|
+
│ user_age_days: 0.04 ✓ │
|
|
246
|
+
└──────────────────────────────────────────────────────┘
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
Retraining triggers: Codify when to retrain rather than leaving it to human judgment:
|
|
250
|
+
- Accuracy drops below warning threshold for 48+ consecutive hours
|
|
251
|
+
- PSI > 0.2 on any top-10 feature by SHAP importance
|
|
252
|
+
- Major upstream data source change (schema change, new data source)
|
|
253
|
+
- Scheduled retraining on a fixed cadence (monthly for most models)
|
|
@@ -0,0 +1,216 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-project-structure
|
|
3
|
+
description: Standard ML project directory layout covering src/data, src/models, src/training, src/serving, notebooks, configs, and model artifact storage
|
|
4
|
+
topics: [ml, project-structure, layout, organization, artifacts, notebooks]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
ML projects accumulate files faster than almost any other software domain: datasets, model checkpoints, experiment configs, notebooks, evaluation reports, and serving code. Without a deliberate directory structure, projects become disorganised within weeks and impossible to onboard new team members onto. A well-structured ML project separates concerns clearly: source code from notebooks, training from serving, configs from code, and tracked artifacts from ephemeral outputs.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
A standard ML project separates source code (`src/`), exploratory notebooks (`notebooks/`), training configurations (`configs/`), and model artifacts (`models/`). Within `src/`, separate data loading (`src/data/`), model architectures (`src/models/`), training logic (`src/training/`), and serving code (`src/serving/`). Keep large artifacts (datasets, checkpoints) out of git using `.gitignore` and DVC or object storage. The structure should be navigable to a new team member within five minutes.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Top-Level Directory Structure
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
project-root/
|
|
19
|
+
├── configs/ # All experiment and model configs (YAML/TOML)
|
|
20
|
+
├── data/ # Data directory (gitignored content, DVC-tracked)
|
|
21
|
+
│ ├── raw/ # Immutable raw data as received from source
|
|
22
|
+
│ ├── processed/ # Cleaned, transformed datasets
|
|
23
|
+
│ └── splits/ # Train/val/test split files (CSV/JSON of IDs)
|
|
24
|
+
├── docs/ # Architecture decisions, dataset cards, model cards
|
|
25
|
+
├── models/ # Model artifact storage (gitignored; object storage backed)
|
|
26
|
+
│ ├── checkpoints/ # Training checkpoints (epoch-N.pt)
|
|
27
|
+
│ └── registry/ # Production-promoted model versions
|
|
28
|
+
├── notebooks/ # Jupyter notebooks for exploration (outputs cleared before commit)
|
|
29
|
+
├── reports/ # Evaluation reports, figures, experiment summaries
|
|
30
|
+
├── scripts/ # One-off utility scripts (not part of the pipeline)
|
|
31
|
+
├── src/ # All production source code
|
|
32
|
+
│ ├── data/ # Dataset classes, loaders, preprocessing
|
|
33
|
+
│ ├── models/ # Model architecture definitions
|
|
34
|
+
│ ├── training/ # Training loops, loss functions, callbacks
|
|
35
|
+
│ ├── evaluation/ # Metrics, evaluation runners, result serialisation
|
|
36
|
+
│ └── serving/ # Inference pipelines, API handlers, preprocessing wrappers
|
|
37
|
+
├── tests/ # Unit and integration tests
|
|
38
|
+
├── .dvc/ # DVC metadata (committed to git)
|
|
39
|
+
├── .gitignore # Excludes data/, models/, __pycache__, .env
|
|
40
|
+
├── pyproject.toml # Project metadata and dependencies (Poetry)
|
|
41
|
+
├── Makefile # Task runner: train, evaluate, serve, test
|
|
42
|
+
└── README.md # Project overview, setup, and usage
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
### `src/data/` — Data Loading and Preprocessing
|
|
46
|
+
|
|
47
|
+
This directory contains all code that transforms raw data into model-ready tensors:
|
|
48
|
+
|
|
49
|
+
```
|
|
50
|
+
src/data/
|
|
51
|
+
├── __init__.py
|
|
52
|
+
├── dataset.py # PyTorch Dataset or TF Dataset class
|
|
53
|
+
├── datamodule.py # LightningDataModule or equivalent orchestrator
|
|
54
|
+
├── transforms.py # Preprocessing transforms (normalize, tokenize, augment)
|
|
55
|
+
├── augmentation.py # Training-time data augmentation (separated from eval transforms)
|
|
56
|
+
├── collate.py # Custom batch collation functions
|
|
57
|
+
└── utils.py # Data utilities (download, checksum, split generation)
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
Key rules:
|
|
61
|
+
- **Separate training transforms from eval transforms** — augmentation must not be applied at inference
|
|
62
|
+
- Dataset classes must accept a `split` parameter and behave correctly for each split
|
|
63
|
+
- All preprocessing must be reproducible and deterministic at inference time
|
|
64
|
+
- Cache processed data to avoid recomputation on each run
|
|
65
|
+
|
|
66
|
+
### `src/models/` — Architecture Definitions
|
|
67
|
+
|
|
68
|
+
Contains model class definitions only — no training logic, no loss functions:
|
|
69
|
+
|
|
70
|
+
```
|
|
71
|
+
src/models/
|
|
72
|
+
├── __init__.py
|
|
73
|
+
├── backbone.py # Feature extractor (ResNet, ViT, BERT, etc.)
|
|
74
|
+
├── head.py # Task-specific head (classification, regression, generation)
|
|
75
|
+
├── model.py # Composed full model
|
|
76
|
+
└── components/ # Reusable building blocks (attention, MLP, norm layers)
|
|
77
|
+
├── attention.py
|
|
78
|
+
└── ffn.py
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
Key rules:
|
|
82
|
+
- Models are pure computation graphs — no file I/O, no training state
|
|
83
|
+
- Accept hyperparameters via constructor, not globals
|
|
84
|
+
- Provide a `from_config(cfg)` class method for config-driven instantiation
|
|
85
|
+
- Serialise with `state_dict()` only — never pickle entire model objects
|
|
86
|
+
|
|
87
|
+
### `src/training/` — Training Logic
|
|
88
|
+
|
|
89
|
+
```
|
|
90
|
+
src/training/
|
|
91
|
+
├── __init__.py
|
|
92
|
+
├── trainer.py # Training loop (or LightningModule)
|
|
93
|
+
├── loss.py # Loss functions
|
|
94
|
+
├── optimizer.py # Optimizer and scheduler builders
|
|
95
|
+
├── callbacks.py # Callbacks (early stopping, logging, checkpoint saving)
|
|
96
|
+
└── utils.py # Gradient clipping, mixed precision helpers
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
The training loop is separate from the model. A model knows how to compute predictions; the trainer knows how to update weights. This separation enables:
|
|
100
|
+
- Testing model forward passes independently of training
|
|
101
|
+
- Swapping training strategies (single GPU, DDP, FSDP) without changing the model
|
|
102
|
+
- Using the same model class for training and serving
|
|
103
|
+
|
|
104
|
+
### `src/evaluation/` — Metrics and Evaluation Runners
|
|
105
|
+
|
|
106
|
+
```
|
|
107
|
+
src/evaluation/
|
|
108
|
+
├── __init__.py
|
|
109
|
+
├── metrics.py # Metric computation (accuracy, F1, AUC, etc.)
|
|
110
|
+
├── evaluator.py # Evaluation loop (runs model on eval set, collects predictions)
|
|
111
|
+
├── slice_analysis.py # Per-slice performance breakdown
|
|
112
|
+
└── reports.py # Result serialisation and report generation
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
Evaluation code runs identically offline and online. Do not inline evaluation logic in the training loop — this makes it impossible to re-evaluate a checkpoint independently.
|
|
116
|
+
|
|
117
|
+
### `src/serving/` — Inference and API
|
|
118
|
+
|
|
119
|
+
```
|
|
120
|
+
src/serving/
|
|
121
|
+
├── __init__.py
|
|
122
|
+
├── predictor.py # Prediction class (loads model, runs inference)
|
|
123
|
+
├── preprocessing.py # Request preprocessing (mirrors training eval transforms)
|
|
124
|
+
├── postprocessing.py # Response postprocessing (calibration, thresholding)
|
|
125
|
+
├── api.py # FastAPI/Flask endpoint definitions
|
|
126
|
+
└── handler.py # TorchServe or Triton handler
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
The `Predictor` class is the contract between the model and the serving infrastructure. It:
|
|
130
|
+
- Loads a model from a path or registry reference
|
|
131
|
+
- Exposes a `predict(inputs)` method with documented input/output types
|
|
132
|
+
- Uses the exact same preprocessing as training evaluation transforms
|
|
133
|
+
|
|
134
|
+
### `notebooks/` — Exploratory Analysis
|
|
135
|
+
|
|
136
|
+
```
|
|
137
|
+
notebooks/
|
|
138
|
+
├── 01-data-exploration.ipynb # EDA, data quality checks
|
|
139
|
+
├── 02-baseline-model.ipynb # Baseline experiments
|
|
140
|
+
├── 03-feature-engineering.ipynb
|
|
141
|
+
└── 04-error-analysis.ipynb # Post-training error analysis
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
Rules for notebooks:
|
|
145
|
+
- **Clear outputs before committing** — use `nbstripout` as a pre-commit hook
|
|
146
|
+
- Number notebooks in chronological/logical order
|
|
147
|
+
- Notebooks document exploration, not production logic
|
|
148
|
+
- Any reusable code found in notebooks gets refactored into `src/` with tests
|
|
149
|
+
|
|
150
|
+
### `configs/` — Experiment Configuration
|
|
151
|
+
|
|
152
|
+
```
|
|
153
|
+
configs/
|
|
154
|
+
├── base.yaml # Default config merged into all experiments
|
|
155
|
+
├── model/
|
|
156
|
+
│ ├── small.yaml
|
|
157
|
+
│ └── large.yaml
|
|
158
|
+
├── data/
|
|
159
|
+
│ ├── dev.yaml # Small dataset for fast iteration
|
|
160
|
+
│ └── full.yaml # Full production dataset
|
|
161
|
+
└── training/
|
|
162
|
+
├── debug.yaml # 1 epoch, no logging, fast feedback
|
|
163
|
+
└── production.yaml # Full training run settings
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
### `models/` — Artifact Storage
|
|
167
|
+
|
|
168
|
+
Large binary artifacts are not stored in git:
|
|
169
|
+
- Checkpoints and production models live in `models/` but are gitignored
|
|
170
|
+
- Back `models/` with object storage: S3, GCS, Azure Blob Storage
|
|
171
|
+
- Use DVC to track artifact versions alongside the code:
|
|
172
|
+
|
|
173
|
+
```bash
|
|
174
|
+
dvc add models/registry/v1.2.0/model.pt
|
|
175
|
+
git add models/registry/v1.2.0/model.pt.dvc
|
|
176
|
+
git commit -m "feat: register model v1.2.0"
|
|
177
|
+
dvc push # Pushes binary to remote storage
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
Teammates restore the artifact with `dvc pull` — they get the exact binary referenced by the `.dvc` pointer in git.
|
|
181
|
+
|
|
182
|
+
### `.gitignore` Essentials for ML Projects
|
|
183
|
+
|
|
184
|
+
```gitignore
|
|
185
|
+
# Data
|
|
186
|
+
data/raw/
|
|
187
|
+
data/processed/
|
|
188
|
+
data/splits/*.csv
|
|
189
|
+
|
|
190
|
+
# Model artifacts
|
|
191
|
+
models/checkpoints/
|
|
192
|
+
models/registry/
|
|
193
|
+
|
|
194
|
+
# Notebook outputs
|
|
195
|
+
*.ipynb
|
|
196
|
+
|
|
197
|
+
# Python
|
|
198
|
+
__pycache__/
|
|
199
|
+
*.pyc
|
|
200
|
+
.venv/
|
|
201
|
+
*.egg-info/
|
|
202
|
+
|
|
203
|
+
# Experiment tracking
|
|
204
|
+
mlruns/
|
|
205
|
+
wandb/
|
|
206
|
+
|
|
207
|
+
# Environment
|
|
208
|
+
.env
|
|
209
|
+
*.env
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
Use `nbstripout` to automatically strip notebook outputs:
|
|
213
|
+
```bash
|
|
214
|
+
pip install nbstripout
|
|
215
|
+
nbstripout --install # Installs as git filter
|
|
216
|
+
```
|
|
@@ -0,0 +1,138 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-requirements
|
|
3
|
+
description: Model performance metrics (accuracy, latency, throughput), business KPIs, fairness/bias requirements, and SLA definitions for ML systems
|
|
4
|
+
topics: [ml, requirements, metrics, fairness, bias, sla, kpi]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
ML requirements differ from traditional software requirements because correctness is probabilistic, not absolute. Before writing a single line of training code, define the target metrics, their measurement methodology, and the business KPIs they serve. Ambiguous requirements — "make the model accurate" — are the root cause of most ML project failures. A requirements document for an ML system must specify numeric thresholds, measurement conditions, and what constitutes an acceptable production deployment.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
ML requirements must specify concrete numeric thresholds for model performance (accuracy, latency, throughput), tie those metrics to business KPIs, define fairness and bias constraints across protected groups, and establish SLAs for production serving. Requirements without measurement methodology are aspirations, not requirements. Capture them in a Model Requirements Document before training begins and treat them as acceptance criteria for production deployment.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Performance Metrics by Task Type
|
|
16
|
+
|
|
17
|
+
Different ML tasks have canonical metrics. Use the right metric for the task — do not default to accuracy for all problems:
|
|
18
|
+
|
|
19
|
+
**Classification**
|
|
20
|
+
- **Accuracy**: Fraction of correct predictions. Misleading for class-imbalanced datasets (a fraud detector predicting "not fraud" always achieves 99.9% accuracy if fraud is 0.1% of data).
|
|
21
|
+
- **Precision**: Of all positive predictions, how many are actually positive. Optimise when false positives are costly (spam filter flagging legitimate email).
|
|
22
|
+
- **Recall (Sensitivity)**: Of all actual positives, how many were predicted positive. Optimise when false negatives are costly (cancer screening missing a case).
|
|
23
|
+
- **F1 Score**: Harmonic mean of precision and recall. Good single metric when both matter equally.
|
|
24
|
+
- **ROC-AUC**: Area under the Receiver Operating Characteristic curve. Threshold-independent, useful for comparing models. Insensitive to class imbalance.
|
|
25
|
+
- **PR-AUC**: Area under the Precision-Recall curve. Better than ROC-AUC for highly imbalanced datasets.
|
|
26
|
+
|
|
27
|
+
**Regression**
|
|
28
|
+
- **MAE (Mean Absolute Error)**: Average absolute error. Robust to outliers. Easy to interpret in the target unit.
|
|
29
|
+
- **RMSE (Root Mean Squared Error)**: Penalises large errors more than MAE. Use when large errors are disproportionately harmful.
|
|
30
|
+
- **MAPE (Mean Absolute Percentage Error)**: Scale-independent. Problematic when targets near zero.
|
|
31
|
+
- **R² (Coefficient of Determination)**: Variance explained by the model. Context-dependent — R²=0.9 may be poor for weather forecasting but excellent for pricing.
|
|
32
|
+
|
|
33
|
+
**Ranking / Recommendation**
|
|
34
|
+
- **NDCG (Normalized Discounted Cumulative Gain)**: Relevance-weighted ranking quality. Standard for search and recommendation.
|
|
35
|
+
- **MRR (Mean Reciprocal Rank)**: Average of 1/rank of first relevant result.
|
|
36
|
+
- **Hit Rate @ K**: Fraction of users for whom a relevant item appears in top-K recommendations.
|
|
37
|
+
|
|
38
|
+
**Generation (LLM/NLG)**
|
|
39
|
+
- **BLEU / ROUGE**: Reference-based n-gram overlap. Weak proxy for quality — supplement with human evaluation.
|
|
40
|
+
- **Perplexity**: Model confidence on a held-out corpus. Lower is better; useful for comparing language models.
|
|
41
|
+
- **Human evaluation**: Win rate against baseline, Likert scale ratings. Required for production quality gating.
|
|
42
|
+
|
|
43
|
+
### Business KPI Alignment
|
|
44
|
+
|
|
45
|
+
Every model metric must map to a business KPI. Without this mapping, teams optimise metrics that do not move the needle:
|
|
46
|
+
|
|
47
|
+
| Model Metric | Business KPI | Notes |
|
|
48
|
+
|---|---|---|
|
|
49
|
+
| Fraud detection recall | Revenue protected from fraud | 1% recall improvement may not justify infra cost |
|
|
50
|
+
| Recommendation CTR | Gross Merchandise Value | CTR can rise while GMV falls (clicks on cheap items) |
|
|
51
|
+
| Search NDCG | Query success rate, conversion | Offline NDCG and online conversion often diverge |
|
|
52
|
+
| Churn prediction AUC | Customer retention rate | Model accuracy gap vs. treatment effectiveness |
|
|
53
|
+
|
|
54
|
+
Document this mapping explicitly. When offline metrics improve but the business metric does not, the mapping is wrong.
|
|
55
|
+
|
|
56
|
+
### Latency Requirements
|
|
57
|
+
|
|
58
|
+
Latency requirements are determined by the use case, not the model team's preferences:
|
|
59
|
+
|
|
60
|
+
- **Interactive / real-time**: User-facing features require P95 latency under 100ms. P99 under 500ms. Recommendation, search, and content ranking fall here.
|
|
61
|
+
- **Near-real-time**: Fraud detection at checkout tolerates 200–500ms P95.
|
|
62
|
+
- **Batch / async**: Offline scoring pipelines have no strict latency requirements but throughput requirements (e.g., score 10M records in 4 hours).
|
|
63
|
+
|
|
64
|
+
Define latency budgets from the user experience backward:
|
|
65
|
+
1. Total page load budget: 2000ms
|
|
66
|
+
2. Backend API budget: 500ms
|
|
67
|
+
3. ML inference budget: 100ms (within the API budget)
|
|
68
|
+
4. Model must fit within that budget at P99 under peak load
|
|
69
|
+
|
|
70
|
+
**Throughput requirements** are independent of latency: "The model must score 50,000 requests per second at peak." Throughput is met by horizontal scaling; latency is met by model optimisation (quantisation, distillation, hardware selection).
|
|
71
|
+
|
|
72
|
+
### Fairness and Bias Requirements
|
|
73
|
+
|
|
74
|
+
Fairness requirements must be defined before training, not audited afterward:
|
|
75
|
+
|
|
76
|
+
**Protected attributes**: Race, gender, age, disability status, national origin, religion. Model inputs should not include protected attributes directly; proxy features (zip code, name) may encode them.
|
|
77
|
+
|
|
78
|
+
**Fairness metrics**:
|
|
79
|
+
- **Demographic parity**: Equal positive prediction rate across groups. `P(ŷ=1 | A=0) = P(ŷ=1 | A=1)`
|
|
80
|
+
- **Equalized odds**: Equal TPR and FPR across groups.
|
|
81
|
+
- **Calibration parity**: Predicted probabilities match observed frequencies equally across groups.
|
|
82
|
+
- **Individual fairness**: Similar individuals receive similar predictions.
|
|
83
|
+
|
|
84
|
+
**Fairness-accuracy tradeoff**: Perfect fairness under multiple definitions simultaneously is mathematically impossible (Impossibility Theorem). Choose the fairness constraint that aligns with the legal and ethical context, then optimise accuracy subject to it.
|
|
85
|
+
|
|
86
|
+
**Requirement format**: "Model's false positive rate for group A must not exceed the false positive rate for group B by more than 5 percentage points."
|
|
87
|
+
|
|
88
|
+
### Model Monitoring SLAs
|
|
89
|
+
|
|
90
|
+
Define monitoring SLAs as part of requirements:
|
|
91
|
+
|
|
92
|
+
- **Accuracy SLA**: "Model accuracy must remain above 85% on the weekly validation set. Alert if it drops below 87% (warning threshold) or 85% (critical threshold)."
|
|
93
|
+
- **Drift SLA**: "Input feature distribution shift (PSI > 0.2) triggers model retraining within 48 hours."
|
|
94
|
+
- **Prediction latency SLA**: "P99 inference latency must remain under 200ms. Alert at 150ms."
|
|
95
|
+
- **Availability SLA**: "Model serving endpoint must maintain 99.9% uptime (43 minutes downtime/month)."
|
|
96
|
+
|
|
97
|
+
### Model Requirements Document Template
|
|
98
|
+
|
|
99
|
+
```markdown
|
|
100
|
+
# Model Requirements: [Model Name]
|
|
101
|
+
|
|
102
|
+
## Business Context
|
|
103
|
+
- Business problem:
|
|
104
|
+
- KPI being optimised:
|
|
105
|
+
- KPI owner:
|
|
106
|
+
|
|
107
|
+
## Performance Requirements
|
|
108
|
+
- Primary metric: [metric] >= [threshold] on [evaluation set]
|
|
109
|
+
- Secondary metric: [metric] >= [threshold]
|
|
110
|
+
- Baseline to beat: [current rule-based / previous model performance]
|
|
111
|
+
|
|
112
|
+
## Latency / Throughput
|
|
113
|
+
- P50 latency: <= [X]ms
|
|
114
|
+
- P99 latency: <= [X]ms
|
|
115
|
+
- Throughput: >= [X] RPS at peak load
|
|
116
|
+
|
|
117
|
+
## Fairness Requirements
|
|
118
|
+
- Protected groups: [list]
|
|
119
|
+
- Fairness metric: [metric] gap <= [threshold] across groups
|
|
120
|
+
|
|
121
|
+
## Data Requirements
|
|
122
|
+
- Training data: [source, size, date range, labeling methodology]
|
|
123
|
+
- Minimum training set size: [N]
|
|
124
|
+
- Label quality: [agreement rate, labeling error budget]
|
|
125
|
+
|
|
126
|
+
## Monitoring SLAs
|
|
127
|
+
- Accuracy degradation alert: < [threshold] on weekly eval
|
|
128
|
+
- Feature drift alert: PSI > [threshold]
|
|
129
|
+
- Retraining trigger: [condition]
|
|
130
|
+
|
|
131
|
+
## Acceptance Criteria
|
|
132
|
+
- [ ] Primary metric exceeds threshold on holdout set
|
|
133
|
+
- [ ] Fairness constraints satisfied
|
|
134
|
+
- [ ] P99 latency within budget under load test
|
|
135
|
+
- [ ] No critical findings in bias audit
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
This document is the acceptance test for model deployment. If the model does not satisfy it, it does not go to production.
|