@zigrivers/scaffold 3.8.0 → 3.9.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/README.md +73 -8
  2. package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
  3. package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
  4. package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
  5. package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
  6. package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
  7. package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
  8. package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
  9. package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
  10. package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
  11. package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
  12. package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
  13. package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
  14. package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
  15. package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
  16. package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
  17. package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
  18. package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
  19. package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
  20. package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
  21. package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
  22. package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
  23. package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
  24. package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
  25. package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
  26. package/content/knowledge/ml/ml-architecture.md +172 -0
  27. package/content/knowledge/ml/ml-conventions.md +209 -0
  28. package/content/knowledge/ml/ml-dev-environment.md +299 -0
  29. package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
  30. package/content/knowledge/ml/ml-model-evaluation.md +256 -0
  31. package/content/knowledge/ml/ml-observability.md +253 -0
  32. package/content/knowledge/ml/ml-project-structure.md +216 -0
  33. package/content/knowledge/ml/ml-requirements.md +138 -0
  34. package/content/knowledge/ml/ml-security.md +188 -0
  35. package/content/knowledge/ml/ml-serving-patterns.md +243 -0
  36. package/content/knowledge/ml/ml-testing.md +301 -0
  37. package/content/knowledge/ml/ml-training-patterns.md +269 -0
  38. package/content/methodology/browser-extension-overlay.yml +82 -0
  39. package/content/methodology/data-pipeline-overlay.yml +70 -0
  40. package/content/methodology/ml-overlay.yml +70 -0
  41. package/dist/cli/commands/init.d.ts +13 -0
  42. package/dist/cli/commands/init.d.ts.map +1 -1
  43. package/dist/cli/commands/init.js +122 -2
  44. package/dist/cli/commands/init.js.map +1 -1
  45. package/dist/cli/commands/init.test.js +120 -0
  46. package/dist/cli/commands/init.test.js.map +1 -1
  47. package/dist/config/schema.d.ts +864 -48
  48. package/dist/config/schema.d.ts.map +1 -1
  49. package/dist/config/schema.js +53 -0
  50. package/dist/config/schema.js.map +1 -1
  51. package/dist/config/schema.test.js +166 -3
  52. package/dist/config/schema.test.js.map +1 -1
  53. package/dist/core/assembly/overlay-loader.test.js +33 -0
  54. package/dist/core/assembly/overlay-loader.test.js.map +1 -1
  55. package/dist/e2e/project-type-overlays.test.d.ts +2 -2
  56. package/dist/e2e/project-type-overlays.test.js +499 -33
  57. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  58. package/dist/types/config.d.ts +10 -1
  59. package/dist/types/config.d.ts.map +1 -1
  60. package/dist/wizard/questions.d.ts +17 -1
  61. package/dist/wizard/questions.d.ts.map +1 -1
  62. package/dist/wizard/questions.js +75 -1
  63. package/dist/wizard/questions.js.map +1 -1
  64. package/dist/wizard/questions.test.js +167 -0
  65. package/dist/wizard/questions.test.js.map +1 -1
  66. package/dist/wizard/wizard.d.ts +13 -0
  67. package/dist/wizard/wizard.d.ts.map +1 -1
  68. package/dist/wizard/wizard.js +17 -1
  69. package/dist/wizard/wizard.js.map +1 -1
  70. package/package.json +1 -1
@@ -0,0 +1,285 @@
1
+ ---
2
+ name: ml-experiment-tracking
3
+ description: MLflow and Weights & Biases integration, artifact storage, experiment run comparison, and hyperparameter sweep management
4
+ topics: [ml, experiment-tracking, mlflow, wandb, artifacts, sweeps, reproducibility]
5
+ ---
6
+
7
+ Without experiment tracking, ML development is archaeology: "which config produced that result?" is answered by digging through notebook history, chat logs, and failing memory. Experiment tracking tools are version control for training runs — every metric, every hyperparameter, every artifact, linked to the code that produced it. The discipline of logging everything during training pays dividends when a stakeholder asks "how does this model compare to what we had six months ago?"
8
+
9
+ ## Summary
10
+
11
+ Use MLflow (self-hosted, open source) or Weights & Biases (cloud, more feature-rich) to track every training run. Log hyperparameters, metrics at each epoch, model artifacts, and the git commit SHA. Store large artifacts (checkpoints, datasets) in object storage backed by the experiment tracker. Use sweep features (MLflow Hyperopt integration, W&B Sweeps) for systematic hyperparameter search rather than manual iteration.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### MLflow Integration
16
+
17
+ MLflow is the open-source standard for experiment tracking. It runs locally or on a managed server:
18
+
19
+ ```bash
20
+ # Start local tracking server (stores runs in ./mlruns)
21
+ mlflow server --host 0.0.0.0 --port 5000
22
+
23
+ # Or use the SQLite backend for better performance
24
+ mlflow server \
25
+ --backend-store-uri sqlite:///mlflow.db \
26
+ --default-artifact-root ./mlartifacts \
27
+ --host 0.0.0.0 --port 5000
28
+ ```
29
+
30
+ **Instrument training code**:
31
+ ```python
32
+ import mlflow
33
+ import mlflow.pytorch
34
+
35
+ # Set tracking server
36
+ mlflow.set_tracking_uri("http://localhost:5000")
37
+ mlflow.set_experiment("fraud-detector")
38
+
39
+ def train(cfg: DictConfig) -> dict:
40
+ with mlflow.start_run(run_name=cfg.experiment.name) as run:
41
+ # Log all hyperparameters from config
42
+ mlflow.log_params(OmegaConf.to_container(cfg, resolve=True))
43
+
44
+ # Log git commit for reproducibility
45
+ import subprocess
46
+ git_sha = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
47
+ mlflow.set_tag("git_commit", git_sha)
48
+ mlflow.set_tag("model_type", cfg.model.type)
49
+
50
+ for epoch in range(cfg.training.epochs):
51
+ train_metrics = train_epoch(...)
52
+ val_metrics = evaluate(...)
53
+
54
+ # Log metrics with step (epoch) for time-series view
55
+ mlflow.log_metrics({
56
+ "train_loss": train_metrics["loss"],
57
+ "val_loss": val_metrics["loss"],
58
+ "val_auc": val_metrics["auc"],
59
+ }, step=epoch)
60
+
61
+ # Log best model
62
+ mlflow.pytorch.log_model(
63
+ model,
64
+ artifact_path="model",
65
+ registered_model_name="fraud-detector", # Register in Model Registry
66
+ )
67
+
68
+ # Log additional artifacts
69
+ mlflow.log_artifact("configs/train.yaml")
70
+ mlflow.log_artifact("reports/eval_report.json")
71
+
72
+ return {"run_id": run.info.run_id, **val_metrics}
73
+ ```
74
+
75
+ **MLflow Model Registry** (promote to production):
76
+ ```python
77
+ from mlflow.tracking import MlflowClient
78
+
79
+ client = MlflowClient()
80
+
81
+ # Register a run's model in the registry
82
+ model_uri = f"runs:/{run_id}/model"
83
+ mv = mlflow.register_model(model_uri, "fraud-detector")
84
+
85
+ # Transition to staging after validation
86
+ client.transition_model_version_stage(
87
+ name="fraud-detector",
88
+ version=mv.version,
89
+ stage="Staging",
90
+ archive_existing_versions=False,
91
+ )
92
+
93
+ # Load production model in serving
94
+ production_model = mlflow.pytorch.load_model(
95
+ model_uri="models:/fraud-detector/Production"
96
+ )
97
+ ```
98
+
99
+ ### Weights & Biases Integration
100
+
101
+ W&B provides a richer UI and more features than MLflow, with a cloud-hosted option:
102
+
103
+ ```python
104
+ import wandb
105
+
106
+ wandb.init(
107
+ project="fraud-detector",
108
+ name=cfg.experiment.name,
109
+ config=OmegaConf.to_container(cfg, resolve=True),
110
+ tags=["baseline", "v2-features"],
111
+ notes="Testing new feature set with gradient clipping",
112
+ )
113
+
114
+ # Log metrics
115
+ for epoch in range(cfg.training.epochs):
116
+ metrics = train_epoch(...)
117
+ wandb.log({
118
+ "epoch": epoch,
119
+ "train/loss": metrics["train_loss"],
120
+ "val/loss": metrics["val_loss"],
121
+ "val/auc": metrics["val_auc"],
122
+ "lr": scheduler.get_last_lr()[0],
123
+ })
124
+
125
+ # Log model artifact
126
+ artifact = wandb.Artifact("fraud-detector", type="model")
127
+ artifact.add_file("models/checkpoints/best.pt")
128
+ wandb.log_artifact(artifact)
129
+
130
+ wandb.finish()
131
+ ```
132
+
133
+ **W&B-specific features**:
134
+ - **System monitoring**: GPU utilisation, memory, temperature logged automatically
135
+ - **Gradient histograms**: `wandb.watch(model, log="gradients")` logs gradient distributions per layer — invaluable for debugging vanishing/exploding gradients
136
+ - **Media logging**: Log images, audio, tables, confusion matrices directly in the UI
137
+ - **Alerts**: Set threshold alerts on metrics (email/Slack when val_loss > threshold)
138
+
139
+ ### Artifact Storage Strategy
140
+
141
+ Artifacts are the binary outputs of training runs: model checkpoints, preprocessed datasets, evaluation reports, and confusion matrices. Never store large binary artifacts in git:
142
+
143
+ **Storage hierarchy**:
144
+ ```
145
+ Small artifacts (< 1 MB): Log directly to tracker
146
+ - Config files, evaluation reports (JSON/CSV)
147
+ - Example predictions, confusion matrices (images)
148
+
149
+ Medium artifacts (1 MB – 1 GB): Log as tracker artifacts
150
+ - Model checkpoints for experimentation
151
+ - Feature engineering outputs
152
+
153
+ Large artifacts (> 1 GB): Object storage with tracker reference
154
+ - Full training datasets
155
+ - Final production model weights
156
+ - Large evaluation outputs
157
+ ```
158
+
159
+ **S3 artifact storage for MLflow**:
160
+ ```bash
161
+ mlflow server \
162
+ --default-artifact-root s3://my-bucket/mlflow-artifacts \
163
+ --backend-store-uri postgresql://user:pass@host/mlflow
164
+ ```
165
+
166
+ **DVC for dataset versioning alongside MLflow**:
167
+ ```bash
168
+ # Version dataset with DVC
169
+ dvc add data/processed/features_v3.parquet
170
+ git add data/processed/features_v3.parquet.dvc
171
+
172
+ # Log DVC dataset reference in MLflow
173
+ mlflow.set_tag("dvc_dataset_commit", git_sha)
174
+ mlflow.set_tag("dataset_path", "data/processed/features_v3.parquet")
175
+ ```
176
+
177
+ ### Run Comparison and Analysis
178
+
179
+ **Finding the best run** (MLflow Python API):
180
+ ```python
181
+ from mlflow.tracking import MlflowClient
182
+ import pandas as pd
183
+
184
+ client = MlflowClient()
185
+
186
+ # Get all runs in an experiment, sorted by val_auc
187
+ runs = client.search_runs(
188
+ experiment_ids=["1"],
189
+ filter_string="metrics.val_auc > 0.85",
190
+ order_by=["metrics.val_auc DESC"],
191
+ max_results=20,
192
+ )
193
+
194
+ # Convert to DataFrame for analysis
195
+ run_data = [{
196
+ "run_id": r.info.run_id,
197
+ "name": r.info.run_name,
198
+ "val_auc": r.data.metrics.get("val_auc"),
199
+ "lr": r.data.params.get("optimizer.lr"),
200
+ "batch_size": r.data.params.get("training.batch_size"),
201
+ } for r in runs]
202
+
203
+ df = pd.DataFrame(run_data)
204
+ print(df.head(10))
205
+ ```
206
+
207
+ **Comparing runs in W&B**: Use the parallel coordinates plot (built into W&B UI) to visualise the relationship between hyperparameters and metrics across many runs at once.
208
+
209
+ ### Hyperparameter Sweeps
210
+
211
+ **W&B Sweeps** (cloud-managed sweep coordinator):
212
+ ```yaml
213
+ # sweep_config.yaml
214
+ program: train.py
215
+ method: bayes # bayesian, random, or grid
216
+ metric:
217
+ name: val/auc
218
+ goal: maximize
219
+ parameters:
220
+ optimizer.lr:
221
+ min: 1.0e-5
222
+ max: 1.0e-2
223
+ distribution: log_uniform_values
224
+ training.batch_size:
225
+ values: [16, 32, 64, 128]
226
+ model.dropout:
227
+ min: 0.0
228
+ max: 0.5
229
+ early_terminate:
230
+ type: hyperband
231
+ min_iter: 3
232
+ ```
233
+
234
+ ```bash
235
+ wandb sweep sweep_config.yaml # Returns sweep ID
236
+ wandb agent <sweep-id> --count 50 # Launch 50 trials
237
+ ```
238
+
239
+ **MLflow + Optuna** (self-hosted alternative):
240
+ ```python
241
+ import optuna
242
+ import mlflow
243
+
244
+ def objective(trial):
245
+ with mlflow.start_run(nested=True):
246
+ lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
247
+ mlflow.log_param("lr", lr)
248
+
249
+ val_auc = train_and_evaluate(lr=lr)
250
+ mlflow.log_metric("val_auc", val_auc)
251
+ return val_auc
252
+
253
+ with mlflow.start_run(run_name="hyperparameter-sweep"):
254
+ study = optuna.create_study(direction="maximize")
255
+ study.optimize(objective, n_trials=50)
256
+ mlflow.log_params(study.best_params)
257
+ mlflow.log_metric("best_val_auc", study.best_value)
258
+ ```
259
+
260
+ ### Experiment Logging Checklist
261
+
262
+ Log these for every training run — no exceptions:
263
+
264
+ ```python
265
+ # Required: hyperparameters
266
+ mlflow.log_params({...}) # Full config dict
267
+
268
+ # Required: metrics at each epoch
269
+ mlflow.log_metrics({...}, step=epoch)
270
+
271
+ # Required: final metrics
272
+ mlflow.log_metrics({"final_val_auc": val_auc, "final_val_loss": val_loss})
273
+
274
+ # Required: reproducibility tags
275
+ mlflow.set_tag("git_commit", git_sha)
276
+ mlflow.set_tag("dataset_version", dataset_version)
277
+
278
+ # Required: model artifact
279
+ mlflow.pytorch.log_model(model, "model")
280
+
281
+ # Recommended: environment
282
+ mlflow.log_artifact("environment.yml")
283
+ mlflow.set_tag("cuda_version", torch.version.cuda)
284
+ mlflow.set_tag("pytorch_version", torch.__version__)
285
+ ```
@@ -0,0 +1,256 @@
1
+ ---
2
+ name: ml-model-evaluation
3
+ description: Train/val/test splits, cross-validation, metrics by task type, holdout sets, and slice analysis for thorough model evaluation
4
+ topics: [ml, evaluation, train-test-split, cross-validation, metrics, holdout, slice-analysis]
5
+ ---
6
+
7
+ Model evaluation is the difference between knowing whether your model works and believing it works. Most ML evaluation bugs are forms of data leakage: the model has seen information during training that it would not have at inference time, making offline metrics look better than production performance. Rigorous evaluation requires careful data splitting, leak-free preprocessing, appropriate metrics for the task, and systematic analysis of where the model fails.
8
+
9
+ ## Summary
10
+
11
+ Split data into train, validation, and test sets — use the test set exactly once. For small datasets, use cross-validation. Choose metrics appropriate to the task: classification, regression, ranking, or generation have different canonical metrics. Analyse model performance by meaningful slices (demographic groups, difficulty levels, data subsets) — aggregate metrics hide subgroup failures. Log evaluation results with experiment metadata for longitudinal comparison.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Data Splitting Principles
16
+
17
+ **Three-way split**: train (model learning), validation (hyperparameter tuning and early stopping), test (final unbiased evaluation):
18
+
19
+ ```python
20
+ from sklearn.model_selection import train_test_split
21
+
22
+ def create_splits(
23
+ df: pd.DataFrame,
24
+ val_fraction: float = 0.1,
25
+ test_fraction: float = 0.1,
26
+ seed: int = 42,
27
+ stratify_col: str | None = None,
28
+ ) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
29
+ """Create reproducible train/val/test splits."""
30
+ stratify = df[stratify_col] if stratify_col else None
31
+
32
+ train_val, test = train_test_split(
33
+ df,
34
+ test_size=test_fraction,
35
+ random_state=seed,
36
+ stratify=stratify,
37
+ )
38
+
39
+ val_size_adjusted = val_fraction / (1 - test_fraction)
40
+ stratify_tv = train_val[stratify_col] if stratify_col else None
41
+
42
+ train, val = train_test_split(
43
+ train_val,
44
+ test_size=val_size_adjusted,
45
+ random_state=seed,
46
+ stratify=stratify_tv,
47
+ )
48
+
49
+ return train, val, test
50
+ ```
51
+
52
+ **Critical splitting rules**:
53
+ 1. **Split before preprocessing**: Fit preprocessing (scalers, encoders, imputers, tokenizers vocabulary) on training data only, then apply to val/test. Fitting on the combined dataset is data leakage.
54
+ 2. **Stratify by label for classification**: Ensures class distribution is preserved in each split.
55
+ 3. **Split by entity, not row, for grouped data**: If you have multiple rows per user, all rows for a user must go to the same split. Row-level splitting leaks user-level information.
56
+ 4. **Temporal split for time-series**: Train on past, validate and test on future. Random splits would leak future information.
57
+
58
+ ### Temporal Splits
59
+
60
+ For any dataset with a time dimension, always split by time:
61
+
62
+ ```python
63
+ def temporal_split(
64
+ df: pd.DataFrame,
65
+ timestamp_col: str,
66
+ val_start: str,
67
+ test_start: str,
68
+ ) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
69
+ """Create temporal splits — train/val/test defined by date boundaries."""
70
+ df = df.sort_values(timestamp_col)
71
+ train = df[df[timestamp_col] < val_start]
72
+ val = df[(df[timestamp_col] >= val_start) & (df[timestamp_col] < test_start)]
73
+ test = df[df[timestamp_col] >= test_start]
74
+ return train, val, test
75
+ ```
76
+
77
+ **Backtesting** extends temporal evaluation by simulating deployment across multiple time windows — tests that a model trained on one period performs on subsequent periods.
78
+
79
+ ### Cross-Validation
80
+
81
+ Use k-fold cross-validation when dataset size is insufficient for a stable held-out set (< 10,000 examples):
82
+
83
+ ```python
84
+ from sklearn.model_selection import StratifiedKFold
85
+ import numpy as np
86
+
87
+ def cross_validate(
88
+ X: np.ndarray,
89
+ y: np.ndarray,
90
+ model_builder,
91
+ n_folds: int = 5,
92
+ seed: int = 42,
93
+ ) -> dict[str, float]:
94
+ """Stratified k-fold cross-validation."""
95
+ skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)
96
+ fold_metrics = []
97
+
98
+ for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
99
+ X_train, X_val = X[train_idx], X[val_idx]
100
+ y_train, y_val = y[train_idx], y[val_idx]
101
+
102
+ model = model_builder()
103
+ model.fit(X_train, y_train)
104
+ metrics = evaluate_model(model, X_val, y_val)
105
+ fold_metrics.append(metrics)
106
+
107
+ # Aggregate across folds
108
+ return {
109
+ metric: {
110
+ "mean": np.mean([m[metric] for m in fold_metrics]),
111
+ "std": np.std([m[metric] for m in fold_metrics]),
112
+ }
113
+ for metric in fold_metrics[0]
114
+ }
115
+ ```
116
+
117
+ **Nested cross-validation** separates hyperparameter selection from model evaluation:
118
+ - Outer loop: Estimate generalisation error
119
+ - Inner loop: Select hyperparameters via grid/random search
120
+ - Prevents over-fitting hyperparameters to the validation set
121
+
122
+ ### Holdout Sets and Evaluation Integrity
123
+
124
+ **The test set is sacred**: It may be touched exactly once — when reporting final model performance before deployment. Every other decision (architecture, hyperparameters, features) uses the validation set.
125
+
126
+ If you look at test set performance and then make changes, the test set is contaminated — you must collect a fresh test set.
127
+
128
+ **Multiple evaluation sets**:
129
+ - **In-distribution test set**: Same distribution as training data. Measures how well the model learned.
130
+ - **Out-of-distribution test set**: Different time period, geography, or user cohort. Measures generalisation.
131
+ - **Adversarial / challenging test set**: Hard examples, edge cases, known failure modes. Measures robustness.
132
+ - **Slice-specific test sets**: Subsets by demographic, category, or difficulty. Measures fairness and consistency.
133
+
134
+ ### Metrics by Task Type
135
+
136
+ **Binary Classification**:
137
+ ```python
138
+ from sklearn.metrics import (
139
+ accuracy_score, precision_score, recall_score,
140
+ f1_score, roc_auc_score, average_precision_score,
141
+ confusion_matrix, classification_report,
142
+ )
143
+
144
+ def evaluate_binary_classifier(
145
+ y_true: np.ndarray,
146
+ y_pred_proba: np.ndarray,
147
+ threshold: float = 0.5,
148
+ ) -> dict[str, float]:
149
+ y_pred = (y_pred_proba >= threshold).astype(int)
150
+ return {
151
+ "accuracy": accuracy_score(y_true, y_pred),
152
+ "precision": precision_score(y_true, y_pred, zero_division=0),
153
+ "recall": recall_score(y_true, y_pred, zero_division=0),
154
+ "f1": f1_score(y_true, y_pred, zero_division=0),
155
+ "roc_auc": roc_auc_score(y_true, y_pred_proba),
156
+ "pr_auc": average_precision_score(y_true, y_pred_proba),
157
+ }
158
+ ```
159
+
160
+ **Regression**:
161
+ ```python
162
+ from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
163
+
164
+ def evaluate_regressor(y_true, y_pred) -> dict[str, float]:
165
+ return {
166
+ "mae": mean_absolute_error(y_true, y_pred),
167
+ "rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
168
+ "r2": r2_score(y_true, y_pred),
169
+ "mape": np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100,
170
+ }
171
+ ```
172
+
173
+ **Multi-class classification**:
174
+ - Macro-average: Equal weight per class — use when class imbalance should not inflate aggregate metrics
175
+ - Weighted-average: Weight by class support — use for overall system performance
176
+ - Per-class metrics: Report separately to catch poor performance on minority classes
177
+
178
+ ### Slice Analysis
179
+
180
+ Aggregate metrics can hide systematic failures in subgroups. Slice analysis breaks down performance by meaningful subsets:
181
+
182
+ ```python
183
+ def slice_analysis(
184
+ df: pd.DataFrame,
185
+ y_true_col: str,
186
+ y_pred_col: str,
187
+ slice_cols: list[str],
188
+ metric_fn,
189
+ ) -> pd.DataFrame:
190
+ """Compute metrics for each slice of the data."""
191
+ results = []
192
+
193
+ # Overall metrics
194
+ overall = metric_fn(df[y_true_col], df[y_pred_col])
195
+ results.append({"slice": "overall", "n": len(df), **overall})
196
+
197
+ # Per-slice metrics
198
+ for col in slice_cols:
199
+ for value, group in df.groupby(col):
200
+ if len(group) < 50: # Skip slices with too few examples
201
+ continue
202
+ metrics = metric_fn(group[y_true_col], group[y_pred_col])
203
+ results.append({
204
+ "slice": f"{col}={value}",
205
+ "n": len(group),
206
+ **metrics,
207
+ })
208
+
209
+ return pd.DataFrame(results)
210
+ ```
211
+
212
+ **Slices to always analyse**:
213
+ - Demographic groups (if available and legally permissible): age band, gender, geography
214
+ - Data quality slices: high vs. low confidence labels, recent vs. old data
215
+ - Difficulty slices: high vs. low frequency items, short vs. long text
216
+ - Business-relevant slices: product category, customer segment, price tier
217
+
218
+ **Flagging disparities**: If a slice's metric deviates from overall by more than a threshold (e.g., 10 percentage points), flag for investigation before deployment.
219
+
220
+ ### Baseline Comparisons
221
+
222
+ Every model evaluation must include a comparison to baselines:
223
+ - **Trivial baseline**: Predict the majority class (classification) or mean target value (regression)
224
+ - **Rule-based baseline**: The current production rule or heuristic
225
+ - **Previous model version**: The model currently in production
226
+ - **Simple ML baseline**: Logistic regression or decision tree
227
+
228
+ A model that does not beat all baselines should not be deployed. The trivial baseline check catches label encoding bugs (where the model learns the majority class trivially).
229
+
230
+ ### Evaluation Report Structure
231
+
232
+ ```markdown
233
+ # Evaluation Report: fraud-detector-v2.3.0
234
+
235
+ ## Dataset
236
+ - Test set: 45,231 examples (2024-01-01 to 2024-03-31)
237
+ - Class balance: 1.2% fraud, 98.8% non-fraud
238
+
239
+ ## Overall Metrics
240
+ | Metric | v2.2.0 (prod) | v2.3.0 (candidate) | Delta |
241
+ |--------|--------------|-------------------|-------|
242
+ | ROC-AUC | 0.921 | 0.934 | +1.4% |
243
+ | PR-AUC | 0.712 | 0.748 | +5.1% |
244
+ | Recall @ precision=0.9 | 0.68 | 0.73 | +7.4% |
245
+
246
+ ## Slice Analysis
247
+ | Slice | n | ROC-AUC | vs. Overall |
248
+ |-------|---|---------|-------------|
249
+ | Overall | 45,231 | 0.934 | — |
250
+ | Amount < $50 | 12,445 | 0.941 | +0.7% |
251
+ | Amount > $1000 | 3,211 | 0.918 | -1.6% |
252
+ | New user (< 30 days) | 8,902 | 0.891 | -4.6% ⚠️ |
253
+
254
+ ## Recommendation
255
+ Promote to staging. Investigate new user performance degradation before production.
256
+ ```