@zigrivers/scaffold 3.8.0 → 3.9.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/README.md +73 -8
  2. package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
  3. package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
  4. package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
  5. package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
  6. package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
  7. package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
  8. package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
  9. package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
  10. package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
  11. package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
  12. package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
  13. package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
  14. package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
  15. package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
  16. package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
  17. package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
  18. package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
  19. package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
  20. package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
  21. package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
  22. package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
  23. package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
  24. package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
  25. package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
  26. package/content/knowledge/ml/ml-architecture.md +172 -0
  27. package/content/knowledge/ml/ml-conventions.md +209 -0
  28. package/content/knowledge/ml/ml-dev-environment.md +299 -0
  29. package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
  30. package/content/knowledge/ml/ml-model-evaluation.md +256 -0
  31. package/content/knowledge/ml/ml-observability.md +253 -0
  32. package/content/knowledge/ml/ml-project-structure.md +216 -0
  33. package/content/knowledge/ml/ml-requirements.md +138 -0
  34. package/content/knowledge/ml/ml-security.md +188 -0
  35. package/content/knowledge/ml/ml-serving-patterns.md +243 -0
  36. package/content/knowledge/ml/ml-testing.md +301 -0
  37. package/content/knowledge/ml/ml-training-patterns.md +269 -0
  38. package/content/methodology/browser-extension-overlay.yml +82 -0
  39. package/content/methodology/data-pipeline-overlay.yml +70 -0
  40. package/content/methodology/ml-overlay.yml +70 -0
  41. package/dist/cli/commands/init.d.ts +13 -0
  42. package/dist/cli/commands/init.d.ts.map +1 -1
  43. package/dist/cli/commands/init.js +122 -2
  44. package/dist/cli/commands/init.js.map +1 -1
  45. package/dist/cli/commands/init.test.js +120 -0
  46. package/dist/cli/commands/init.test.js.map +1 -1
  47. package/dist/config/schema.d.ts +864 -48
  48. package/dist/config/schema.d.ts.map +1 -1
  49. package/dist/config/schema.js +53 -0
  50. package/dist/config/schema.js.map +1 -1
  51. package/dist/config/schema.test.js +166 -3
  52. package/dist/config/schema.test.js.map +1 -1
  53. package/dist/core/assembly/overlay-loader.test.js +33 -0
  54. package/dist/core/assembly/overlay-loader.test.js.map +1 -1
  55. package/dist/e2e/project-type-overlays.test.d.ts +2 -2
  56. package/dist/e2e/project-type-overlays.test.js +499 -33
  57. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  58. package/dist/types/config.d.ts +10 -1
  59. package/dist/types/config.d.ts.map +1 -1
  60. package/dist/wizard/questions.d.ts +17 -1
  61. package/dist/wizard/questions.d.ts.map +1 -1
  62. package/dist/wizard/questions.js +75 -1
  63. package/dist/wizard/questions.js.map +1 -1
  64. package/dist/wizard/questions.test.js +167 -0
  65. package/dist/wizard/questions.test.js.map +1 -1
  66. package/dist/wizard/wizard.d.ts +13 -0
  67. package/dist/wizard/wizard.d.ts.map +1 -1
  68. package/dist/wizard/wizard.js +17 -1
  69. package/dist/wizard/wizard.js.map +1 -1
  70. package/package.json +1 -1
@@ -0,0 +1,253 @@
1
+ ---
2
+ name: ml-observability
3
+ description: Model monitoring for drift and decay, prediction logging, explainability tools, and alerting on accuracy drops in production ML systems
4
+ topics: [ml, observability, monitoring, drift, model-decay, explainability, alerting, prediction-logging]
5
+ ---
6
+
7
+ A model deployed to production without monitoring is a ticking clock. Models decay silently: the world changes, input distributions shift, and accuracy degrades while dashboards show green. Unlike software bugs that throw exceptions, model degradation has no stack trace — predictions simply become less useful. ML observability is the discipline of detecting these degradations before users notice them, through systematic monitoring of model inputs, outputs, and outcomes.
8
+
9
+ ## Summary
10
+
11
+ ML observability covers four pillars: input monitoring (feature drift detection), output monitoring (prediction distribution shifts), outcome monitoring (accuracy against labels), and operational monitoring (latency, error rate). Complement monitoring with prediction logging for post-hoc analysis and explainability tools (SHAP, LIME) for understanding individual predictions and debugging systematic failures. Alert thresholds and on-call rotation for model health are as important as for service health.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### The Four Pillars of ML Observability
16
+
17
+ **Pillar 1 — Input monitoring (data drift)**: Detect when the distribution of model inputs changes from the training distribution. A model trained on winter data receiving summer data will degrade without any software change.
18
+
19
+ **Pillar 2 — Output monitoring (prediction drift)**: Detect when the model's prediction distribution changes — e.g., a fraud model that suddenly classifies 10% of transactions as fraud (vs. the baseline 0.1%).
20
+
21
+ **Pillar 3 — Outcome monitoring (accuracy/concept drift)**: Detect when model accuracy changes on labelled outcomes. Requires ground truth labels, which often arrive with delay (e.g., actual fraud confirmed days after prediction).
22
+
23
+ **Pillar 4 — Operational monitoring**: Latency, throughput, error rate, memory usage. Standard SRE metrics applied to the model serving layer.
24
+
25
+ ### Feature Drift Detection
26
+
27
+ Measure drift between training and serving feature distributions using statistical tests:
28
+
29
+ ```python
30
+ from scipy import stats
31
+ import numpy as np
32
+ from dataclasses import dataclass
33
+ from typing import Optional
34
+
35
+ @dataclass
36
+ class DriftReport:
37
+ feature: str
38
+ psi: float # Population Stability Index
39
+ ks_statistic: float # Kolmogorov-Smirnov statistic
40
+ ks_p_value: float
41
+ is_drifted: bool
42
+
43
+ def compute_psi(
44
+ expected: np.ndarray,
45
+ actual: np.ndarray,
46
+ buckets: int = 10,
47
+ ) -> float:
48
+ """Population Stability Index. PSI < 0.1: stable, 0.1-0.2: minor drift, >0.2: significant drift."""
49
+ eps = 1e-6
50
+ expected_pcts, bins = np.histogram(expected, bins=buckets)
51
+ actual_pcts, _ = np.histogram(actual, bins=bins)
52
+
53
+ expected_pcts = expected_pcts / expected_pcts.sum() + eps
54
+ actual_pcts = actual_pcts / actual_pcts.sum() + eps
55
+
56
+ return float(np.sum((actual_pcts - expected_pcts) * np.log(actual_pcts / expected_pcts)))
57
+
58
+ def detect_drift(
59
+ training_values: np.ndarray,
60
+ serving_values: np.ndarray,
61
+ feature_name: str,
62
+ psi_threshold: float = 0.2,
63
+ ks_alpha: float = 0.05,
64
+ ) -> DriftReport:
65
+ psi = compute_psi(training_values, serving_values)
66
+ ks_stat, ks_p = stats.ks_2samp(training_values, serving_values)
67
+ return DriftReport(
68
+ feature=feature_name,
69
+ psi=psi,
70
+ ks_statistic=ks_stat,
71
+ ks_p_value=ks_p,
72
+ is_drifted=psi > psi_threshold or ks_p < ks_alpha,
73
+ )
74
+ ```
75
+
76
+ **Reference distribution maintenance**: Store feature statistics (mean, std, percentiles, histogram) from the training set as a "reference profile." Compare each day's serving data to this profile. Refresh the reference when the model is retrained.
77
+
78
+ **PSI thresholds** (industry standard):
79
+ - PSI < 0.1: No significant drift — monitor as normal
80
+ - 0.1 ≤ PSI < 0.2: Minor drift — investigate, consider retraining
81
+ - PSI ≥ 0.2: Significant drift — trigger retraining or alert
82
+
83
+ ### Prediction Logging
84
+
85
+ Every prediction made in production should be logged for monitoring and post-hoc analysis:
86
+
87
+ ```python
88
+ # src/serving/prediction_logger.py
89
+ import json
90
+ import time
91
+ from dataclasses import dataclass, asdict
92
+ from typing import Any
93
+
94
+ @dataclass
95
+ class PredictionRecord:
96
+ prediction_id: str # UUID for correlation
97
+ model_version: str
98
+ timestamp: float
99
+ request_id: str # Trace ID for distributed tracing
100
+ input_features: dict # Logged features (scrub PII before logging)
101
+ prediction: Any
102
+ confidence: float
103
+ latency_ms: float
104
+
105
+ class PredictionLogger:
106
+ def __init__(self, sink): # sink: Kafka producer, Kinesis, or file
107
+ self.sink = sink
108
+
109
+ def log(self, record: PredictionRecord) -> None:
110
+ payload = json.dumps(asdict(record))
111
+ self.sink.send(payload)
112
+ ```
113
+
114
+ **What to log** (balance observability with privacy/cost):
115
+ - Always: prediction ID, model version, timestamp, prediction value, confidence, latency
116
+ - Feature logging: Log features used for prediction (important for drift detection and debugging)
117
+ - PII scrubbing: Never log raw PII fields; log derived features or anonymised values only
118
+ - Sampling: For very high-throughput systems (> 10K RPS), log a representative sample (1–10%)
119
+
120
+ **Label joining**: When ground truth labels arrive (delayed), join them with prediction logs using the prediction ID to compute accuracy metrics:
121
+ ```sql
122
+ SELECT
123
+ p.model_version,
124
+ COUNT(*) as n_predictions,
125
+ AVG(CASE WHEN p.prediction = l.actual_label THEN 1 ELSE 0 END) as accuracy,
126
+ AVG(p.confidence) as mean_confidence
127
+ FROM predictions p
128
+ JOIN labels l ON p.prediction_id = l.prediction_id
129
+ WHERE p.timestamp >= NOW() - INTERVAL '7 days'
130
+ GROUP BY p.model_version
131
+ ```
132
+
133
+ ### Explainability
134
+
135
+ Explainability tools help debug model failures and satisfy regulatory requirements:
136
+
137
+ **SHAP (SHapley Additive exPlanations)**: Computes feature importance for individual predictions using game-theoretic Shapley values. Works with any model.
138
+
139
+ ```python
140
+ import shap
141
+
142
+ # Train a background dataset for the explainer
143
+ background = X_train[np.random.choice(len(X_train), 100, replace=False)]
144
+ explainer = shap.TreeExplainer(model) # For tree models
145
+ # explainer = shap.DeepExplainer(model, background) # For neural networks
146
+ # explainer = shap.KernelExplainer(model.predict_proba, background) # Model-agnostic
147
+
148
+ # Explain a single prediction
149
+ shap_values = explainer.shap_values(X_test[0:1])
150
+ shap.force_plot(explainer.expected_value[1], shap_values[1][0], X_test[0])
151
+
152
+ # Explain the entire test set (global feature importance)
153
+ shap_values_all = explainer.shap_values(X_test)
154
+ shap.summary_plot(shap_values_all[1], X_test)
155
+ ```
156
+
157
+ **LIME (Local Interpretable Model-agnostic Explanations)**: Fits a simple interpretable model (linear regression) locally around each prediction.
158
+
159
+ ```python
160
+ from lime.lime_tabular import LimeTabularExplainer
161
+
162
+ explainer = LimeTabularExplainer(
163
+ X_train,
164
+ feature_names=feature_names,
165
+ class_names=["legitimate", "fraud"],
166
+ mode="classification",
167
+ )
168
+
169
+ explanation = explainer.explain_instance(
170
+ X_test[0],
171
+ model.predict_proba,
172
+ num_features=10,
173
+ )
174
+ explanation.show_in_notebook()
175
+ ```
176
+
177
+ **Integrated Gradients** (for neural networks): Attribution method that satisfies axiomatic completeness. Available in Captum (PyTorch):
178
+ ```python
179
+ from captum.attr import IntegratedGradients
180
+
181
+ ig = IntegratedGradients(model)
182
+ attributions = ig.attribute(input_tensor, baseline=torch.zeros_like(input_tensor))
183
+ ```
184
+
185
+ ### Alerting Strategy
186
+
187
+ Define alert thresholds before deployment, not after a production incident:
188
+
189
+ ```yaml
190
+ # monitoring/alerts.yaml
191
+ alerts:
192
+ - name: accuracy_degradation_warning
193
+ metric: val_accuracy_7d_rolling
194
+ condition: "< 0.87" # Warning: 2pp below target
195
+ severity: warning
196
+ action: page_on_call
197
+
198
+ - name: accuracy_degradation_critical
199
+ metric: val_accuracy_7d_rolling
200
+ condition: "< 0.85" # Critical: at SLA threshold
201
+ severity: critical
202
+ action: page_on_call_and_escalate
203
+
204
+ - name: feature_drift_significant
205
+ metric: max_psi_across_features
206
+ condition: "> 0.2"
207
+ severity: warning
208
+ action: notify_ml_team
209
+
210
+ - name: prediction_rate_anomaly
211
+ metric: fraud_prediction_rate_1h
212
+ condition: "> 0.05" # 5x normal rate
213
+ severity: critical
214
+ action: page_on_call
215
+
216
+ - name: serving_latency_breach
217
+ metric: p99_latency_ms
218
+ condition: "> 200"
219
+ severity: warning
220
+ action: notify_ml_team
221
+ ```
222
+
223
+ **Alerting anti-patterns**:
224
+ - Alert fatigue: Too many low-signal alerts causes teams to ignore them. Start with critical-only, add warnings after establishing baselines.
225
+ - Static thresholds for seasonal data: Use rolling baselines that adapt to weekly/seasonal patterns.
226
+ - No runbook: Every alert must have a runbook link: "When this fires, do X, check Y, escalate to Z."
227
+
228
+ ### Model Monitoring Dashboard
229
+
230
+ A model health dashboard should show at a glance:
231
+
232
+ ```
233
+ Model: fraud-detector v2.3.1 | Status: HEALTHY | Updated: 5 minutes ago
234
+
235
+ ┌─────────────────┬──────────────────┬──────────────────┐
236
+ │ Accuracy (7d) │ Prediction Rate │ P99 Latency │
237
+ │ 87.3% ✓ │ 0.12% ✓ │ 142ms ✓ │
238
+ │ target: ≥85% │ baseline: 0.1% │ SLA: <200ms │
239
+ └─────────────────┴──────────────────┴──────────────────┘
240
+
241
+ ┌──────────────────────────────────────────────────────┐
242
+ │ Feature Drift (PSI) │
243
+ │ transaction_amount: 0.08 ✓ │
244
+ │ merchant_category: 0.12 ⚠ (minor drift) │
245
+ │ user_age_days: 0.04 ✓ │
246
+ └──────────────────────────────────────────────────────┘
247
+ ```
248
+
249
+ Retraining triggers: Codify when to retrain rather than leaving it to human judgment:
250
+ - Accuracy drops below warning threshold for 48+ consecutive hours
251
+ - PSI > 0.2 on any top-10 feature by SHAP importance
252
+ - Major upstream data source change (schema change, new data source)
253
+ - Scheduled retraining on a fixed cadence (monthly for most models)
@@ -0,0 +1,216 @@
1
+ ---
2
+ name: ml-project-structure
3
+ description: Standard ML project directory layout covering src/data, src/models, src/training, src/serving, notebooks, configs, and model artifact storage
4
+ topics: [ml, project-structure, layout, organization, artifacts, notebooks]
5
+ ---
6
+
7
+ ML projects accumulate files faster than almost any other software domain: datasets, model checkpoints, experiment configs, notebooks, evaluation reports, and serving code. Without a deliberate directory structure, projects become disorganised within weeks and impossible to onboard new team members onto. A well-structured ML project separates concerns clearly: source code from notebooks, training from serving, configs from code, and tracked artifacts from ephemeral outputs.
8
+
9
+ ## Summary
10
+
11
+ A standard ML project separates source code (`src/`), exploratory notebooks (`notebooks/`), training configurations (`configs/`), and model artifacts (`models/`). Within `src/`, separate data loading (`src/data/`), model architectures (`src/models/`), training logic (`src/training/`), and serving code (`src/serving/`). Keep large artifacts (datasets, checkpoints) out of git using `.gitignore` and DVC or object storage. The structure should be navigable to a new team member within five minutes.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Top-Level Directory Structure
16
+
17
+ ```
18
+ project-root/
19
+ ├── configs/ # All experiment and model configs (YAML/TOML)
20
+ ├── data/ # Data directory (gitignored content, DVC-tracked)
21
+ │ ├── raw/ # Immutable raw data as received from source
22
+ │ ├── processed/ # Cleaned, transformed datasets
23
+ │ └── splits/ # Train/val/test split files (CSV/JSON of IDs)
24
+ ├── docs/ # Architecture decisions, dataset cards, model cards
25
+ ├── models/ # Model artifact storage (gitignored; object storage backed)
26
+ │ ├── checkpoints/ # Training checkpoints (epoch-N.pt)
27
+ │ └── registry/ # Production-promoted model versions
28
+ ├── notebooks/ # Jupyter notebooks for exploration (outputs cleared before commit)
29
+ ├── reports/ # Evaluation reports, figures, experiment summaries
30
+ ├── scripts/ # One-off utility scripts (not part of the pipeline)
31
+ ├── src/ # All production source code
32
+ │ ├── data/ # Dataset classes, loaders, preprocessing
33
+ │ ├── models/ # Model architecture definitions
34
+ │ ├── training/ # Training loops, loss functions, callbacks
35
+ │ ├── evaluation/ # Metrics, evaluation runners, result serialisation
36
+ │ └── serving/ # Inference pipelines, API handlers, preprocessing wrappers
37
+ ├── tests/ # Unit and integration tests
38
+ ├── .dvc/ # DVC metadata (committed to git)
39
+ ├── .gitignore # Excludes data/, models/, __pycache__, .env
40
+ ├── pyproject.toml # Project metadata and dependencies (Poetry)
41
+ ├── Makefile # Task runner: train, evaluate, serve, test
42
+ └── README.md # Project overview, setup, and usage
43
+ ```
44
+
45
+ ### `src/data/` — Data Loading and Preprocessing
46
+
47
+ This directory contains all code that transforms raw data into model-ready tensors:
48
+
49
+ ```
50
+ src/data/
51
+ ├── __init__.py
52
+ ├── dataset.py # PyTorch Dataset or TF Dataset class
53
+ ├── datamodule.py # LightningDataModule or equivalent orchestrator
54
+ ├── transforms.py # Preprocessing transforms (normalize, tokenize, augment)
55
+ ├── augmentation.py # Training-time data augmentation (separated from eval transforms)
56
+ ├── collate.py # Custom batch collation functions
57
+ └── utils.py # Data utilities (download, checksum, split generation)
58
+ ```
59
+
60
+ Key rules:
61
+ - **Separate training transforms from eval transforms** — augmentation must not be applied at inference
62
+ - Dataset classes must accept a `split` parameter and behave correctly for each split
63
+ - All preprocessing must be reproducible and deterministic at inference time
64
+ - Cache processed data to avoid recomputation on each run
65
+
66
+ ### `src/models/` — Architecture Definitions
67
+
68
+ Contains model class definitions only — no training logic, no loss functions:
69
+
70
+ ```
71
+ src/models/
72
+ ├── __init__.py
73
+ ├── backbone.py # Feature extractor (ResNet, ViT, BERT, etc.)
74
+ ├── head.py # Task-specific head (classification, regression, generation)
75
+ ├── model.py # Composed full model
76
+ └── components/ # Reusable building blocks (attention, MLP, norm layers)
77
+ ├── attention.py
78
+ └── ffn.py
79
+ ```
80
+
81
+ Key rules:
82
+ - Models are pure computation graphs — no file I/O, no training state
83
+ - Accept hyperparameters via constructor, not globals
84
+ - Provide a `from_config(cfg)` class method for config-driven instantiation
85
+ - Serialise with `state_dict()` only — never pickle entire model objects
86
+
87
+ ### `src/training/` — Training Logic
88
+
89
+ ```
90
+ src/training/
91
+ ├── __init__.py
92
+ ├── trainer.py # Training loop (or LightningModule)
93
+ ├── loss.py # Loss functions
94
+ ├── optimizer.py # Optimizer and scheduler builders
95
+ ├── callbacks.py # Callbacks (early stopping, logging, checkpoint saving)
96
+ └── utils.py # Gradient clipping, mixed precision helpers
97
+ ```
98
+
99
+ The training loop is separate from the model. A model knows how to compute predictions; the trainer knows how to update weights. This separation enables:
100
+ - Testing model forward passes independently of training
101
+ - Swapping training strategies (single GPU, DDP, FSDP) without changing the model
102
+ - Using the same model class for training and serving
103
+
104
+ ### `src/evaluation/` — Metrics and Evaluation Runners
105
+
106
+ ```
107
+ src/evaluation/
108
+ ├── __init__.py
109
+ ├── metrics.py # Metric computation (accuracy, F1, AUC, etc.)
110
+ ├── evaluator.py # Evaluation loop (runs model on eval set, collects predictions)
111
+ ├── slice_analysis.py # Per-slice performance breakdown
112
+ └── reports.py # Result serialisation and report generation
113
+ ```
114
+
115
+ Evaluation code runs identically offline and online. Do not inline evaluation logic in the training loop — this makes it impossible to re-evaluate a checkpoint independently.
116
+
117
+ ### `src/serving/` — Inference and API
118
+
119
+ ```
120
+ src/serving/
121
+ ├── __init__.py
122
+ ├── predictor.py # Prediction class (loads model, runs inference)
123
+ ├── preprocessing.py # Request preprocessing (mirrors training eval transforms)
124
+ ├── postprocessing.py # Response postprocessing (calibration, thresholding)
125
+ ├── api.py # FastAPI/Flask endpoint definitions
126
+ └── handler.py # TorchServe or Triton handler
127
+ ```
128
+
129
+ The `Predictor` class is the contract between the model and the serving infrastructure. It:
130
+ - Loads a model from a path or registry reference
131
+ - Exposes a `predict(inputs)` method with documented input/output types
132
+ - Uses the exact same preprocessing as training evaluation transforms
133
+
134
+ ### `notebooks/` — Exploratory Analysis
135
+
136
+ ```
137
+ notebooks/
138
+ ├── 01-data-exploration.ipynb # EDA, data quality checks
139
+ ├── 02-baseline-model.ipynb # Baseline experiments
140
+ ├── 03-feature-engineering.ipynb
141
+ └── 04-error-analysis.ipynb # Post-training error analysis
142
+ ```
143
+
144
+ Rules for notebooks:
145
+ - **Clear outputs before committing** — use `nbstripout` as a pre-commit hook
146
+ - Number notebooks in chronological/logical order
147
+ - Notebooks document exploration, not production logic
148
+ - Any reusable code found in notebooks gets refactored into `src/` with tests
149
+
150
+ ### `configs/` — Experiment Configuration
151
+
152
+ ```
153
+ configs/
154
+ ├── base.yaml # Default config merged into all experiments
155
+ ├── model/
156
+ │ ├── small.yaml
157
+ │ └── large.yaml
158
+ ├── data/
159
+ │ ├── dev.yaml # Small dataset for fast iteration
160
+ │ └── full.yaml # Full production dataset
161
+ └── training/
162
+ ├── debug.yaml # 1 epoch, no logging, fast feedback
163
+ └── production.yaml # Full training run settings
164
+ ```
165
+
166
+ ### `models/` — Artifact Storage
167
+
168
+ Large binary artifacts are not stored in git:
169
+ - Checkpoints and production models live in `models/` but are gitignored
170
+ - Back `models/` with object storage: S3, GCS, Azure Blob Storage
171
+ - Use DVC to track artifact versions alongside the code:
172
+
173
+ ```bash
174
+ dvc add models/registry/v1.2.0/model.pt
175
+ git add models/registry/v1.2.0/model.pt.dvc
176
+ git commit -m "feat: register model v1.2.0"
177
+ dvc push # Pushes binary to remote storage
178
+ ```
179
+
180
+ Teammates restore the artifact with `dvc pull` — they get the exact binary referenced by the `.dvc` pointer in git.
181
+
182
+ ### `.gitignore` Essentials for ML Projects
183
+
184
+ ```gitignore
185
+ # Data
186
+ data/raw/
187
+ data/processed/
188
+ data/splits/*.csv
189
+
190
+ # Model artifacts
191
+ models/checkpoints/
192
+ models/registry/
193
+
194
+ # Notebook outputs
195
+ *.ipynb
196
+
197
+ # Python
198
+ __pycache__/
199
+ *.pyc
200
+ .venv/
201
+ *.egg-info/
202
+
203
+ # Experiment tracking
204
+ mlruns/
205
+ wandb/
206
+
207
+ # Environment
208
+ .env
209
+ *.env
210
+ ```
211
+
212
+ Use `nbstripout` to automatically strip notebook outputs:
213
+ ```bash
214
+ pip install nbstripout
215
+ nbstripout --install # Installs as git filter
216
+ ```
@@ -0,0 +1,138 @@
1
+ ---
2
+ name: ml-requirements
3
+ description: Model performance metrics (accuracy, latency, throughput), business KPIs, fairness/bias requirements, and SLA definitions for ML systems
4
+ topics: [ml, requirements, metrics, fairness, bias, sla, kpi]
5
+ ---
6
+
7
+ ML requirements differ from traditional software requirements because correctness is probabilistic, not absolute. Before writing a single line of training code, define the target metrics, their measurement methodology, and the business KPIs they serve. Ambiguous requirements — "make the model accurate" — are the root cause of most ML project failures. A requirements document for an ML system must specify numeric thresholds, measurement conditions, and what constitutes an acceptable production deployment.
8
+
9
+ ## Summary
10
+
11
+ ML requirements must specify concrete numeric thresholds for model performance (accuracy, latency, throughput), tie those metrics to business KPIs, define fairness and bias constraints across protected groups, and establish SLAs for production serving. Requirements without measurement methodology are aspirations, not requirements. Capture them in a Model Requirements Document before training begins and treat them as acceptance criteria for production deployment.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Performance Metrics by Task Type
16
+
17
+ Different ML tasks have canonical metrics. Use the right metric for the task — do not default to accuracy for all problems:
18
+
19
+ **Classification**
20
+ - **Accuracy**: Fraction of correct predictions. Misleading for class-imbalanced datasets (a fraud detector predicting "not fraud" always achieves 99.9% accuracy if fraud is 0.1% of data).
21
+ - **Precision**: Of all positive predictions, how many are actually positive. Optimise when false positives are costly (spam filter flagging legitimate email).
22
+ - **Recall (Sensitivity)**: Of all actual positives, how many were predicted positive. Optimise when false negatives are costly (cancer screening missing a case).
23
+ - **F1 Score**: Harmonic mean of precision and recall. Good single metric when both matter equally.
24
+ - **ROC-AUC**: Area under the Receiver Operating Characteristic curve. Threshold-independent, useful for comparing models. Insensitive to class imbalance.
25
+ - **PR-AUC**: Area under the Precision-Recall curve. Better than ROC-AUC for highly imbalanced datasets.
26
+
27
+ **Regression**
28
+ - **MAE (Mean Absolute Error)**: Average absolute error. Robust to outliers. Easy to interpret in the target unit.
29
+ - **RMSE (Root Mean Squared Error)**: Penalises large errors more than MAE. Use when large errors are disproportionately harmful.
30
+ - **MAPE (Mean Absolute Percentage Error)**: Scale-independent. Problematic when targets near zero.
31
+ - **R² (Coefficient of Determination)**: Variance explained by the model. Context-dependent — R²=0.9 may be poor for weather forecasting but excellent for pricing.
32
+
33
+ **Ranking / Recommendation**
34
+ - **NDCG (Normalized Discounted Cumulative Gain)**: Relevance-weighted ranking quality. Standard for search and recommendation.
35
+ - **MRR (Mean Reciprocal Rank)**: Average of 1/rank of first relevant result.
36
+ - **Hit Rate @ K**: Fraction of users for whom a relevant item appears in top-K recommendations.
37
+
38
+ **Generation (LLM/NLG)**
39
+ - **BLEU / ROUGE**: Reference-based n-gram overlap. Weak proxy for quality — supplement with human evaluation.
40
+ - **Perplexity**: Model confidence on a held-out corpus. Lower is better; useful for comparing language models.
41
+ - **Human evaluation**: Win rate against baseline, Likert scale ratings. Required for production quality gating.
42
+
43
+ ### Business KPI Alignment
44
+
45
+ Every model metric must map to a business KPI. Without this mapping, teams optimise metrics that do not move the needle:
46
+
47
+ | Model Metric | Business KPI | Notes |
48
+ |---|---|---|
49
+ | Fraud detection recall | Revenue protected from fraud | 1% recall improvement may not justify infra cost |
50
+ | Recommendation CTR | Gross Merchandise Value | CTR can rise while GMV falls (clicks on cheap items) |
51
+ | Search NDCG | Query success rate, conversion | Offline NDCG and online conversion often diverge |
52
+ | Churn prediction AUC | Customer retention rate | Model accuracy gap vs. treatment effectiveness |
53
+
54
+ Document this mapping explicitly. When offline metrics improve but the business metric does not, the mapping is wrong.
55
+
56
+ ### Latency Requirements
57
+
58
+ Latency requirements are determined by the use case, not the model team's preferences:
59
+
60
+ - **Interactive / real-time**: User-facing features require P95 latency under 100ms. P99 under 500ms. Recommendation, search, and content ranking fall here.
61
+ - **Near-real-time**: Fraud detection at checkout tolerates 200–500ms P95.
62
+ - **Batch / async**: Offline scoring pipelines have no strict latency requirements but throughput requirements (e.g., score 10M records in 4 hours).
63
+
64
+ Define latency budgets from the user experience backward:
65
+ 1. Total page load budget: 2000ms
66
+ 2. Backend API budget: 500ms
67
+ 3. ML inference budget: 100ms (within the API budget)
68
+ 4. Model must fit within that budget at P99 under peak load
69
+
70
+ **Throughput requirements** are independent of latency: "The model must score 50,000 requests per second at peak." Throughput is met by horizontal scaling; latency is met by model optimisation (quantisation, distillation, hardware selection).
71
+
72
+ ### Fairness and Bias Requirements
73
+
74
+ Fairness requirements must be defined before training, not audited afterward:
75
+
76
+ **Protected attributes**: Race, gender, age, disability status, national origin, religion. Model inputs should not include protected attributes directly; proxy features (zip code, name) may encode them.
77
+
78
+ **Fairness metrics**:
79
+ - **Demographic parity**: Equal positive prediction rate across groups. `P(ŷ=1 | A=0) = P(ŷ=1 | A=1)`
80
+ - **Equalized odds**: Equal TPR and FPR across groups.
81
+ - **Calibration parity**: Predicted probabilities match observed frequencies equally across groups.
82
+ - **Individual fairness**: Similar individuals receive similar predictions.
83
+
84
+ **Fairness-accuracy tradeoff**: Perfect fairness under multiple definitions simultaneously is mathematically impossible (Impossibility Theorem). Choose the fairness constraint that aligns with the legal and ethical context, then optimise accuracy subject to it.
85
+
86
+ **Requirement format**: "Model's false positive rate for group A must not exceed the false positive rate for group B by more than 5 percentage points."
87
+
88
+ ### Model Monitoring SLAs
89
+
90
+ Define monitoring SLAs as part of requirements:
91
+
92
+ - **Accuracy SLA**: "Model accuracy must remain above 85% on the weekly validation set. Alert if it drops below 87% (warning threshold) or 85% (critical threshold)."
93
+ - **Drift SLA**: "Input feature distribution shift (PSI > 0.2) triggers model retraining within 48 hours."
94
+ - **Prediction latency SLA**: "P99 inference latency must remain under 200ms. Alert at 150ms."
95
+ - **Availability SLA**: "Model serving endpoint must maintain 99.9% uptime (43 minutes downtime/month)."
96
+
97
+ ### Model Requirements Document Template
98
+
99
+ ```markdown
100
+ # Model Requirements: [Model Name]
101
+
102
+ ## Business Context
103
+ - Business problem:
104
+ - KPI being optimised:
105
+ - KPI owner:
106
+
107
+ ## Performance Requirements
108
+ - Primary metric: [metric] >= [threshold] on [evaluation set]
109
+ - Secondary metric: [metric] >= [threshold]
110
+ - Baseline to beat: [current rule-based / previous model performance]
111
+
112
+ ## Latency / Throughput
113
+ - P50 latency: <= [X]ms
114
+ - P99 latency: <= [X]ms
115
+ - Throughput: >= [X] RPS at peak load
116
+
117
+ ## Fairness Requirements
118
+ - Protected groups: [list]
119
+ - Fairness metric: [metric] gap <= [threshold] across groups
120
+
121
+ ## Data Requirements
122
+ - Training data: [source, size, date range, labeling methodology]
123
+ - Minimum training set size: [N]
124
+ - Label quality: [agreement rate, labeling error budget]
125
+
126
+ ## Monitoring SLAs
127
+ - Accuracy degradation alert: < [threshold] on weekly eval
128
+ - Feature drift alert: PSI > [threshold]
129
+ - Retraining trigger: [condition]
130
+
131
+ ## Acceptance Criteria
132
+ - [ ] Primary metric exceeds threshold on holdout set
133
+ - [ ] Fairness constraints satisfied
134
+ - [ ] P99 latency within budget under load test
135
+ - [ ] No critical findings in bias audit
136
+ ```
137
+
138
+ This document is the acceptance test for model deployment. If the model does not satisfy it, it does not go to production.