@zigrivers/scaffold 3.8.0 → 3.9.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +73 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/dist/cli/commands/init.d.ts +13 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +122 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +120 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +864 -48
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +53 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +166 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +33 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -2
- package/dist/e2e/project-type-overlays.test.js +499 -33
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +10 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +17 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +75 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +167 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +13 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +17 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,285 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-experiment-tracking
|
|
3
|
+
description: MLflow and Weights & Biases integration, artifact storage, experiment run comparison, and hyperparameter sweep management
|
|
4
|
+
topics: [ml, experiment-tracking, mlflow, wandb, artifacts, sweeps, reproducibility]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Without experiment tracking, ML development is archaeology: "which config produced that result?" is answered by digging through notebook history, chat logs, and failing memory. Experiment tracking tools are version control for training runs — every metric, every hyperparameter, every artifact, linked to the code that produced it. The discipline of logging everything during training pays dividends when a stakeholder asks "how does this model compare to what we had six months ago?"
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Use MLflow (self-hosted, open source) or Weights & Biases (cloud, more feature-rich) to track every training run. Log hyperparameters, metrics at each epoch, model artifacts, and the git commit SHA. Store large artifacts (checkpoints, datasets) in object storage backed by the experiment tracker. Use sweep features (MLflow Hyperopt integration, W&B Sweeps) for systematic hyperparameter search rather than manual iteration.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### MLflow Integration
|
|
16
|
+
|
|
17
|
+
MLflow is the open-source standard for experiment tracking. It runs locally or on a managed server:
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
# Start local tracking server (stores runs in ./mlruns)
|
|
21
|
+
mlflow server --host 0.0.0.0 --port 5000
|
|
22
|
+
|
|
23
|
+
# Or use the SQLite backend for better performance
|
|
24
|
+
mlflow server \
|
|
25
|
+
--backend-store-uri sqlite:///mlflow.db \
|
|
26
|
+
--default-artifact-root ./mlartifacts \
|
|
27
|
+
--host 0.0.0.0 --port 5000
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
**Instrument training code**:
|
|
31
|
+
```python
|
|
32
|
+
import mlflow
|
|
33
|
+
import mlflow.pytorch
|
|
34
|
+
|
|
35
|
+
# Set tracking server
|
|
36
|
+
mlflow.set_tracking_uri("http://localhost:5000")
|
|
37
|
+
mlflow.set_experiment("fraud-detector")
|
|
38
|
+
|
|
39
|
+
def train(cfg: DictConfig) -> dict:
|
|
40
|
+
with mlflow.start_run(run_name=cfg.experiment.name) as run:
|
|
41
|
+
# Log all hyperparameters from config
|
|
42
|
+
mlflow.log_params(OmegaConf.to_container(cfg, resolve=True))
|
|
43
|
+
|
|
44
|
+
# Log git commit for reproducibility
|
|
45
|
+
import subprocess
|
|
46
|
+
git_sha = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
|
|
47
|
+
mlflow.set_tag("git_commit", git_sha)
|
|
48
|
+
mlflow.set_tag("model_type", cfg.model.type)
|
|
49
|
+
|
|
50
|
+
for epoch in range(cfg.training.epochs):
|
|
51
|
+
train_metrics = train_epoch(...)
|
|
52
|
+
val_metrics = evaluate(...)
|
|
53
|
+
|
|
54
|
+
# Log metrics with step (epoch) for time-series view
|
|
55
|
+
mlflow.log_metrics({
|
|
56
|
+
"train_loss": train_metrics["loss"],
|
|
57
|
+
"val_loss": val_metrics["loss"],
|
|
58
|
+
"val_auc": val_metrics["auc"],
|
|
59
|
+
}, step=epoch)
|
|
60
|
+
|
|
61
|
+
# Log best model
|
|
62
|
+
mlflow.pytorch.log_model(
|
|
63
|
+
model,
|
|
64
|
+
artifact_path="model",
|
|
65
|
+
registered_model_name="fraud-detector", # Register in Model Registry
|
|
66
|
+
)
|
|
67
|
+
|
|
68
|
+
# Log additional artifacts
|
|
69
|
+
mlflow.log_artifact("configs/train.yaml")
|
|
70
|
+
mlflow.log_artifact("reports/eval_report.json")
|
|
71
|
+
|
|
72
|
+
return {"run_id": run.info.run_id, **val_metrics}
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
**MLflow Model Registry** (promote to production):
|
|
76
|
+
```python
|
|
77
|
+
from mlflow.tracking import MlflowClient
|
|
78
|
+
|
|
79
|
+
client = MlflowClient()
|
|
80
|
+
|
|
81
|
+
# Register a run's model in the registry
|
|
82
|
+
model_uri = f"runs:/{run_id}/model"
|
|
83
|
+
mv = mlflow.register_model(model_uri, "fraud-detector")
|
|
84
|
+
|
|
85
|
+
# Transition to staging after validation
|
|
86
|
+
client.transition_model_version_stage(
|
|
87
|
+
name="fraud-detector",
|
|
88
|
+
version=mv.version,
|
|
89
|
+
stage="Staging",
|
|
90
|
+
archive_existing_versions=False,
|
|
91
|
+
)
|
|
92
|
+
|
|
93
|
+
# Load production model in serving
|
|
94
|
+
production_model = mlflow.pytorch.load_model(
|
|
95
|
+
model_uri="models:/fraud-detector/Production"
|
|
96
|
+
)
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
### Weights & Biases Integration
|
|
100
|
+
|
|
101
|
+
W&B provides a richer UI and more features than MLflow, with a cloud-hosted option:
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
import wandb
|
|
105
|
+
|
|
106
|
+
wandb.init(
|
|
107
|
+
project="fraud-detector",
|
|
108
|
+
name=cfg.experiment.name,
|
|
109
|
+
config=OmegaConf.to_container(cfg, resolve=True),
|
|
110
|
+
tags=["baseline", "v2-features"],
|
|
111
|
+
notes="Testing new feature set with gradient clipping",
|
|
112
|
+
)
|
|
113
|
+
|
|
114
|
+
# Log metrics
|
|
115
|
+
for epoch in range(cfg.training.epochs):
|
|
116
|
+
metrics = train_epoch(...)
|
|
117
|
+
wandb.log({
|
|
118
|
+
"epoch": epoch,
|
|
119
|
+
"train/loss": metrics["train_loss"],
|
|
120
|
+
"val/loss": metrics["val_loss"],
|
|
121
|
+
"val/auc": metrics["val_auc"],
|
|
122
|
+
"lr": scheduler.get_last_lr()[0],
|
|
123
|
+
})
|
|
124
|
+
|
|
125
|
+
# Log model artifact
|
|
126
|
+
artifact = wandb.Artifact("fraud-detector", type="model")
|
|
127
|
+
artifact.add_file("models/checkpoints/best.pt")
|
|
128
|
+
wandb.log_artifact(artifact)
|
|
129
|
+
|
|
130
|
+
wandb.finish()
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
**W&B-specific features**:
|
|
134
|
+
- **System monitoring**: GPU utilisation, memory, temperature logged automatically
|
|
135
|
+
- **Gradient histograms**: `wandb.watch(model, log="gradients")` logs gradient distributions per layer — invaluable for debugging vanishing/exploding gradients
|
|
136
|
+
- **Media logging**: Log images, audio, tables, confusion matrices directly in the UI
|
|
137
|
+
- **Alerts**: Set threshold alerts on metrics (email/Slack when val_loss > threshold)
|
|
138
|
+
|
|
139
|
+
### Artifact Storage Strategy
|
|
140
|
+
|
|
141
|
+
Artifacts are the binary outputs of training runs: model checkpoints, preprocessed datasets, evaluation reports, and confusion matrices. Never store large binary artifacts in git:
|
|
142
|
+
|
|
143
|
+
**Storage hierarchy**:
|
|
144
|
+
```
|
|
145
|
+
Small artifacts (< 1 MB): Log directly to tracker
|
|
146
|
+
- Config files, evaluation reports (JSON/CSV)
|
|
147
|
+
- Example predictions, confusion matrices (images)
|
|
148
|
+
|
|
149
|
+
Medium artifacts (1 MB – 1 GB): Log as tracker artifacts
|
|
150
|
+
- Model checkpoints for experimentation
|
|
151
|
+
- Feature engineering outputs
|
|
152
|
+
|
|
153
|
+
Large artifacts (> 1 GB): Object storage with tracker reference
|
|
154
|
+
- Full training datasets
|
|
155
|
+
- Final production model weights
|
|
156
|
+
- Large evaluation outputs
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
**S3 artifact storage for MLflow**:
|
|
160
|
+
```bash
|
|
161
|
+
mlflow server \
|
|
162
|
+
--default-artifact-root s3://my-bucket/mlflow-artifacts \
|
|
163
|
+
--backend-store-uri postgresql://user:pass@host/mlflow
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
**DVC for dataset versioning alongside MLflow**:
|
|
167
|
+
```bash
|
|
168
|
+
# Version dataset with DVC
|
|
169
|
+
dvc add data/processed/features_v3.parquet
|
|
170
|
+
git add data/processed/features_v3.parquet.dvc
|
|
171
|
+
|
|
172
|
+
# Log DVC dataset reference in MLflow
|
|
173
|
+
mlflow.set_tag("dvc_dataset_commit", git_sha)
|
|
174
|
+
mlflow.set_tag("dataset_path", "data/processed/features_v3.parquet")
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
### Run Comparison and Analysis
|
|
178
|
+
|
|
179
|
+
**Finding the best run** (MLflow Python API):
|
|
180
|
+
```python
|
|
181
|
+
from mlflow.tracking import MlflowClient
|
|
182
|
+
import pandas as pd
|
|
183
|
+
|
|
184
|
+
client = MlflowClient()
|
|
185
|
+
|
|
186
|
+
# Get all runs in an experiment, sorted by val_auc
|
|
187
|
+
runs = client.search_runs(
|
|
188
|
+
experiment_ids=["1"],
|
|
189
|
+
filter_string="metrics.val_auc > 0.85",
|
|
190
|
+
order_by=["metrics.val_auc DESC"],
|
|
191
|
+
max_results=20,
|
|
192
|
+
)
|
|
193
|
+
|
|
194
|
+
# Convert to DataFrame for analysis
|
|
195
|
+
run_data = [{
|
|
196
|
+
"run_id": r.info.run_id,
|
|
197
|
+
"name": r.info.run_name,
|
|
198
|
+
"val_auc": r.data.metrics.get("val_auc"),
|
|
199
|
+
"lr": r.data.params.get("optimizer.lr"),
|
|
200
|
+
"batch_size": r.data.params.get("training.batch_size"),
|
|
201
|
+
} for r in runs]
|
|
202
|
+
|
|
203
|
+
df = pd.DataFrame(run_data)
|
|
204
|
+
print(df.head(10))
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
**Comparing runs in W&B**: Use the parallel coordinates plot (built into W&B UI) to visualise the relationship between hyperparameters and metrics across many runs at once.
|
|
208
|
+
|
|
209
|
+
### Hyperparameter Sweeps
|
|
210
|
+
|
|
211
|
+
**W&B Sweeps** (cloud-managed sweep coordinator):
|
|
212
|
+
```yaml
|
|
213
|
+
# sweep_config.yaml
|
|
214
|
+
program: train.py
|
|
215
|
+
method: bayes # bayesian, random, or grid
|
|
216
|
+
metric:
|
|
217
|
+
name: val/auc
|
|
218
|
+
goal: maximize
|
|
219
|
+
parameters:
|
|
220
|
+
optimizer.lr:
|
|
221
|
+
min: 1.0e-5
|
|
222
|
+
max: 1.0e-2
|
|
223
|
+
distribution: log_uniform_values
|
|
224
|
+
training.batch_size:
|
|
225
|
+
values: [16, 32, 64, 128]
|
|
226
|
+
model.dropout:
|
|
227
|
+
min: 0.0
|
|
228
|
+
max: 0.5
|
|
229
|
+
early_terminate:
|
|
230
|
+
type: hyperband
|
|
231
|
+
min_iter: 3
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
```bash
|
|
235
|
+
wandb sweep sweep_config.yaml # Returns sweep ID
|
|
236
|
+
wandb agent <sweep-id> --count 50 # Launch 50 trials
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
**MLflow + Optuna** (self-hosted alternative):
|
|
240
|
+
```python
|
|
241
|
+
import optuna
|
|
242
|
+
import mlflow
|
|
243
|
+
|
|
244
|
+
def objective(trial):
|
|
245
|
+
with mlflow.start_run(nested=True):
|
|
246
|
+
lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
|
|
247
|
+
mlflow.log_param("lr", lr)
|
|
248
|
+
|
|
249
|
+
val_auc = train_and_evaluate(lr=lr)
|
|
250
|
+
mlflow.log_metric("val_auc", val_auc)
|
|
251
|
+
return val_auc
|
|
252
|
+
|
|
253
|
+
with mlflow.start_run(run_name="hyperparameter-sweep"):
|
|
254
|
+
study = optuna.create_study(direction="maximize")
|
|
255
|
+
study.optimize(objective, n_trials=50)
|
|
256
|
+
mlflow.log_params(study.best_params)
|
|
257
|
+
mlflow.log_metric("best_val_auc", study.best_value)
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
### Experiment Logging Checklist
|
|
261
|
+
|
|
262
|
+
Log these for every training run — no exceptions:
|
|
263
|
+
|
|
264
|
+
```python
|
|
265
|
+
# Required: hyperparameters
|
|
266
|
+
mlflow.log_params({...}) # Full config dict
|
|
267
|
+
|
|
268
|
+
# Required: metrics at each epoch
|
|
269
|
+
mlflow.log_metrics({...}, step=epoch)
|
|
270
|
+
|
|
271
|
+
# Required: final metrics
|
|
272
|
+
mlflow.log_metrics({"final_val_auc": val_auc, "final_val_loss": val_loss})
|
|
273
|
+
|
|
274
|
+
# Required: reproducibility tags
|
|
275
|
+
mlflow.set_tag("git_commit", git_sha)
|
|
276
|
+
mlflow.set_tag("dataset_version", dataset_version)
|
|
277
|
+
|
|
278
|
+
# Required: model artifact
|
|
279
|
+
mlflow.pytorch.log_model(model, "model")
|
|
280
|
+
|
|
281
|
+
# Recommended: environment
|
|
282
|
+
mlflow.log_artifact("environment.yml")
|
|
283
|
+
mlflow.set_tag("cuda_version", torch.version.cuda)
|
|
284
|
+
mlflow.set_tag("pytorch_version", torch.__version__)
|
|
285
|
+
```
|
|
@@ -0,0 +1,256 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ml-model-evaluation
|
|
3
|
+
description: Train/val/test splits, cross-validation, metrics by task type, holdout sets, and slice analysis for thorough model evaluation
|
|
4
|
+
topics: [ml, evaluation, train-test-split, cross-validation, metrics, holdout, slice-analysis]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Model evaluation is the difference between knowing whether your model works and believing it works. Most ML evaluation bugs are forms of data leakage: the model has seen information during training that it would not have at inference time, making offline metrics look better than production performance. Rigorous evaluation requires careful data splitting, leak-free preprocessing, appropriate metrics for the task, and systematic analysis of where the model fails.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Split data into train, validation, and test sets — use the test set exactly once. For small datasets, use cross-validation. Choose metrics appropriate to the task: classification, regression, ranking, or generation have different canonical metrics. Analyse model performance by meaningful slices (demographic groups, difficulty levels, data subsets) — aggregate metrics hide subgroup failures. Log evaluation results with experiment metadata for longitudinal comparison.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Data Splitting Principles
|
|
16
|
+
|
|
17
|
+
**Three-way split**: train (model learning), validation (hyperparameter tuning and early stopping), test (final unbiased evaluation):
|
|
18
|
+
|
|
19
|
+
```python
|
|
20
|
+
from sklearn.model_selection import train_test_split
|
|
21
|
+
|
|
22
|
+
def create_splits(
|
|
23
|
+
df: pd.DataFrame,
|
|
24
|
+
val_fraction: float = 0.1,
|
|
25
|
+
test_fraction: float = 0.1,
|
|
26
|
+
seed: int = 42,
|
|
27
|
+
stratify_col: str | None = None,
|
|
28
|
+
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
|
|
29
|
+
"""Create reproducible train/val/test splits."""
|
|
30
|
+
stratify = df[stratify_col] if stratify_col else None
|
|
31
|
+
|
|
32
|
+
train_val, test = train_test_split(
|
|
33
|
+
df,
|
|
34
|
+
test_size=test_fraction,
|
|
35
|
+
random_state=seed,
|
|
36
|
+
stratify=stratify,
|
|
37
|
+
)
|
|
38
|
+
|
|
39
|
+
val_size_adjusted = val_fraction / (1 - test_fraction)
|
|
40
|
+
stratify_tv = train_val[stratify_col] if stratify_col else None
|
|
41
|
+
|
|
42
|
+
train, val = train_test_split(
|
|
43
|
+
train_val,
|
|
44
|
+
test_size=val_size_adjusted,
|
|
45
|
+
random_state=seed,
|
|
46
|
+
stratify=stratify_tv,
|
|
47
|
+
)
|
|
48
|
+
|
|
49
|
+
return train, val, test
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
**Critical splitting rules**:
|
|
53
|
+
1. **Split before preprocessing**: Fit preprocessing (scalers, encoders, imputers, tokenizers vocabulary) on training data only, then apply to val/test. Fitting on the combined dataset is data leakage.
|
|
54
|
+
2. **Stratify by label for classification**: Ensures class distribution is preserved in each split.
|
|
55
|
+
3. **Split by entity, not row, for grouped data**: If you have multiple rows per user, all rows for a user must go to the same split. Row-level splitting leaks user-level information.
|
|
56
|
+
4. **Temporal split for time-series**: Train on past, validate and test on future. Random splits would leak future information.
|
|
57
|
+
|
|
58
|
+
### Temporal Splits
|
|
59
|
+
|
|
60
|
+
For any dataset with a time dimension, always split by time:
|
|
61
|
+
|
|
62
|
+
```python
|
|
63
|
+
def temporal_split(
|
|
64
|
+
df: pd.DataFrame,
|
|
65
|
+
timestamp_col: str,
|
|
66
|
+
val_start: str,
|
|
67
|
+
test_start: str,
|
|
68
|
+
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
|
|
69
|
+
"""Create temporal splits — train/val/test defined by date boundaries."""
|
|
70
|
+
df = df.sort_values(timestamp_col)
|
|
71
|
+
train = df[df[timestamp_col] < val_start]
|
|
72
|
+
val = df[(df[timestamp_col] >= val_start) & (df[timestamp_col] < test_start)]
|
|
73
|
+
test = df[df[timestamp_col] >= test_start]
|
|
74
|
+
return train, val, test
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
**Backtesting** extends temporal evaluation by simulating deployment across multiple time windows — tests that a model trained on one period performs on subsequent periods.
|
|
78
|
+
|
|
79
|
+
### Cross-Validation
|
|
80
|
+
|
|
81
|
+
Use k-fold cross-validation when dataset size is insufficient for a stable held-out set (< 10,000 examples):
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
from sklearn.model_selection import StratifiedKFold
|
|
85
|
+
import numpy as np
|
|
86
|
+
|
|
87
|
+
def cross_validate(
|
|
88
|
+
X: np.ndarray,
|
|
89
|
+
y: np.ndarray,
|
|
90
|
+
model_builder,
|
|
91
|
+
n_folds: int = 5,
|
|
92
|
+
seed: int = 42,
|
|
93
|
+
) -> dict[str, float]:
|
|
94
|
+
"""Stratified k-fold cross-validation."""
|
|
95
|
+
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)
|
|
96
|
+
fold_metrics = []
|
|
97
|
+
|
|
98
|
+
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
|
|
99
|
+
X_train, X_val = X[train_idx], X[val_idx]
|
|
100
|
+
y_train, y_val = y[train_idx], y[val_idx]
|
|
101
|
+
|
|
102
|
+
model = model_builder()
|
|
103
|
+
model.fit(X_train, y_train)
|
|
104
|
+
metrics = evaluate_model(model, X_val, y_val)
|
|
105
|
+
fold_metrics.append(metrics)
|
|
106
|
+
|
|
107
|
+
# Aggregate across folds
|
|
108
|
+
return {
|
|
109
|
+
metric: {
|
|
110
|
+
"mean": np.mean([m[metric] for m in fold_metrics]),
|
|
111
|
+
"std": np.std([m[metric] for m in fold_metrics]),
|
|
112
|
+
}
|
|
113
|
+
for metric in fold_metrics[0]
|
|
114
|
+
}
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
**Nested cross-validation** separates hyperparameter selection from model evaluation:
|
|
118
|
+
- Outer loop: Estimate generalisation error
|
|
119
|
+
- Inner loop: Select hyperparameters via grid/random search
|
|
120
|
+
- Prevents over-fitting hyperparameters to the validation set
|
|
121
|
+
|
|
122
|
+
### Holdout Sets and Evaluation Integrity
|
|
123
|
+
|
|
124
|
+
**The test set is sacred**: It may be touched exactly once — when reporting final model performance before deployment. Every other decision (architecture, hyperparameters, features) uses the validation set.
|
|
125
|
+
|
|
126
|
+
If you look at test set performance and then make changes, the test set is contaminated — you must collect a fresh test set.
|
|
127
|
+
|
|
128
|
+
**Multiple evaluation sets**:
|
|
129
|
+
- **In-distribution test set**: Same distribution as training data. Measures how well the model learned.
|
|
130
|
+
- **Out-of-distribution test set**: Different time period, geography, or user cohort. Measures generalisation.
|
|
131
|
+
- **Adversarial / challenging test set**: Hard examples, edge cases, known failure modes. Measures robustness.
|
|
132
|
+
- **Slice-specific test sets**: Subsets by demographic, category, or difficulty. Measures fairness and consistency.
|
|
133
|
+
|
|
134
|
+
### Metrics by Task Type
|
|
135
|
+
|
|
136
|
+
**Binary Classification**:
|
|
137
|
+
```python
|
|
138
|
+
from sklearn.metrics import (
|
|
139
|
+
accuracy_score, precision_score, recall_score,
|
|
140
|
+
f1_score, roc_auc_score, average_precision_score,
|
|
141
|
+
confusion_matrix, classification_report,
|
|
142
|
+
)
|
|
143
|
+
|
|
144
|
+
def evaluate_binary_classifier(
|
|
145
|
+
y_true: np.ndarray,
|
|
146
|
+
y_pred_proba: np.ndarray,
|
|
147
|
+
threshold: float = 0.5,
|
|
148
|
+
) -> dict[str, float]:
|
|
149
|
+
y_pred = (y_pred_proba >= threshold).astype(int)
|
|
150
|
+
return {
|
|
151
|
+
"accuracy": accuracy_score(y_true, y_pred),
|
|
152
|
+
"precision": precision_score(y_true, y_pred, zero_division=0),
|
|
153
|
+
"recall": recall_score(y_true, y_pred, zero_division=0),
|
|
154
|
+
"f1": f1_score(y_true, y_pred, zero_division=0),
|
|
155
|
+
"roc_auc": roc_auc_score(y_true, y_pred_proba),
|
|
156
|
+
"pr_auc": average_precision_score(y_true, y_pred_proba),
|
|
157
|
+
}
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
**Regression**:
|
|
161
|
+
```python
|
|
162
|
+
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
|
|
163
|
+
|
|
164
|
+
def evaluate_regressor(y_true, y_pred) -> dict[str, float]:
|
|
165
|
+
return {
|
|
166
|
+
"mae": mean_absolute_error(y_true, y_pred),
|
|
167
|
+
"rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
|
|
168
|
+
"r2": r2_score(y_true, y_pred),
|
|
169
|
+
"mape": np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100,
|
|
170
|
+
}
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
**Multi-class classification**:
|
|
174
|
+
- Macro-average: Equal weight per class — use when class imbalance should not inflate aggregate metrics
|
|
175
|
+
- Weighted-average: Weight by class support — use for overall system performance
|
|
176
|
+
- Per-class metrics: Report separately to catch poor performance on minority classes
|
|
177
|
+
|
|
178
|
+
### Slice Analysis
|
|
179
|
+
|
|
180
|
+
Aggregate metrics can hide systematic failures in subgroups. Slice analysis breaks down performance by meaningful subsets:
|
|
181
|
+
|
|
182
|
+
```python
|
|
183
|
+
def slice_analysis(
|
|
184
|
+
df: pd.DataFrame,
|
|
185
|
+
y_true_col: str,
|
|
186
|
+
y_pred_col: str,
|
|
187
|
+
slice_cols: list[str],
|
|
188
|
+
metric_fn,
|
|
189
|
+
) -> pd.DataFrame:
|
|
190
|
+
"""Compute metrics for each slice of the data."""
|
|
191
|
+
results = []
|
|
192
|
+
|
|
193
|
+
# Overall metrics
|
|
194
|
+
overall = metric_fn(df[y_true_col], df[y_pred_col])
|
|
195
|
+
results.append({"slice": "overall", "n": len(df), **overall})
|
|
196
|
+
|
|
197
|
+
# Per-slice metrics
|
|
198
|
+
for col in slice_cols:
|
|
199
|
+
for value, group in df.groupby(col):
|
|
200
|
+
if len(group) < 50: # Skip slices with too few examples
|
|
201
|
+
continue
|
|
202
|
+
metrics = metric_fn(group[y_true_col], group[y_pred_col])
|
|
203
|
+
results.append({
|
|
204
|
+
"slice": f"{col}={value}",
|
|
205
|
+
"n": len(group),
|
|
206
|
+
**metrics,
|
|
207
|
+
})
|
|
208
|
+
|
|
209
|
+
return pd.DataFrame(results)
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
**Slices to always analyse**:
|
|
213
|
+
- Demographic groups (if available and legally permissible): age band, gender, geography
|
|
214
|
+
- Data quality slices: high vs. low confidence labels, recent vs. old data
|
|
215
|
+
- Difficulty slices: high vs. low frequency items, short vs. long text
|
|
216
|
+
- Business-relevant slices: product category, customer segment, price tier
|
|
217
|
+
|
|
218
|
+
**Flagging disparities**: If a slice's metric deviates from overall by more than a threshold (e.g., 10 percentage points), flag for investigation before deployment.
|
|
219
|
+
|
|
220
|
+
### Baseline Comparisons
|
|
221
|
+
|
|
222
|
+
Every model evaluation must include a comparison to baselines:
|
|
223
|
+
- **Trivial baseline**: Predict the majority class (classification) or mean target value (regression)
|
|
224
|
+
- **Rule-based baseline**: The current production rule or heuristic
|
|
225
|
+
- **Previous model version**: The model currently in production
|
|
226
|
+
- **Simple ML baseline**: Logistic regression or decision tree
|
|
227
|
+
|
|
228
|
+
A model that does not beat all baselines should not be deployed. The trivial baseline check catches label encoding bugs (where the model learns the majority class trivially).
|
|
229
|
+
|
|
230
|
+
### Evaluation Report Structure
|
|
231
|
+
|
|
232
|
+
```markdown
|
|
233
|
+
# Evaluation Report: fraud-detector-v2.3.0
|
|
234
|
+
|
|
235
|
+
## Dataset
|
|
236
|
+
- Test set: 45,231 examples (2024-01-01 to 2024-03-31)
|
|
237
|
+
- Class balance: 1.2% fraud, 98.8% non-fraud
|
|
238
|
+
|
|
239
|
+
## Overall Metrics
|
|
240
|
+
| Metric | v2.2.0 (prod) | v2.3.0 (candidate) | Delta |
|
|
241
|
+
|--------|--------------|-------------------|-------|
|
|
242
|
+
| ROC-AUC | 0.921 | 0.934 | +1.4% |
|
|
243
|
+
| PR-AUC | 0.712 | 0.748 | +5.1% |
|
|
244
|
+
| Recall @ precision=0.9 | 0.68 | 0.73 | +7.4% |
|
|
245
|
+
|
|
246
|
+
## Slice Analysis
|
|
247
|
+
| Slice | n | ROC-AUC | vs. Overall |
|
|
248
|
+
|-------|---|---------|-------------|
|
|
249
|
+
| Overall | 45,231 | 0.934 | — |
|
|
250
|
+
| Amount < $50 | 12,445 | 0.941 | +0.7% |
|
|
251
|
+
| Amount > $1000 | 3,211 | 0.918 | -1.6% |
|
|
252
|
+
| New user (< 30 days) | 8,902 | 0.891 | -4.6% ⚠️ |
|
|
253
|
+
|
|
254
|
+
## Recommendation
|
|
255
|
+
Promote to staging. Investigate new user performance degradation before production.
|
|
256
|
+
```
|