@zigrivers/scaffold 3.22.0 → 3.24.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +44 -23
- package/content/knowledge/core/automated-review-tooling.md +3 -3
- package/content/knowledge/core/multi-model-review-dispatch.md +13 -4
- package/content/knowledge/data-science/README.md +23 -0
- package/content/knowledge/data-science/data-science-architecture.md +163 -0
- package/content/knowledge/data-science/data-science-conventions.md +233 -0
- package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
- package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
- package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
- package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
- package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
- package/content/knowledge/data-science/data-science-observability.md +161 -0
- package/content/knowledge/data-science/data-science-project-structure.md +178 -0
- package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
- package/content/knowledge/data-science/data-science-requirements.md +151 -0
- package/content/knowledge/data-science/data-science-security.md +151 -0
- package/content/knowledge/data-science/data-science-testing.md +183 -0
- package/content/knowledge/ml/README.md +10 -0
- package/content/methodology/data-science-overlay.yml +39 -0
- package/content/pipeline/build/multi-agent-resume.md +7 -6
- package/content/pipeline/build/multi-agent-start.md +7 -6
- package/content/pipeline/build/single-agent-resume.md +7 -6
- package/content/pipeline/build/single-agent-start.md +7 -6
- package/content/pipeline/environment/automated-pr-review.md +79 -27
- package/content/skills/mmr/SKILL.md +72 -2
- package/content/skills/scaffold-runner/SKILL.md +65 -19
- package/content/tools/review-code.md +74 -16
- package/content/tools/review-pr.md +25 -6
- package/dist/cli/commands/check.d.ts.map +1 -1
- package/dist/cli/commands/check.js +28 -17
- package/dist/cli/commands/check.js.map +1 -1
- package/dist/config/schema.d.ts +672 -126
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +8 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +2 -2
- package/dist/config/schema.test.js.map +1 -1
- package/dist/config/validators/data-science.d.ts +4 -0
- package/dist/config/validators/data-science.d.ts.map +1 -0
- package/dist/config/validators/data-science.js +15 -0
- package/dist/config/validators/data-science.js.map +1 -0
- package/dist/config/validators/index.d.ts.map +1 -1
- package/dist/config/validators/index.js +2 -0
- package/dist/config/validators/index.js.map +1 -1
- package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
- package/dist/core/assembly/knowledge-loader.js +6 -0
- package/dist/core/assembly/knowledge-loader.js.map +1 -1
- package/dist/core/assembly/knowledge-loader.test.js +34 -0
- package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.js +73 -0
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/project/adopt.d.ts.map +1 -1
- package/dist/project/adopt.js +3 -1
- package/dist/project/adopt.js.map +1 -1
- package/dist/project/detectors/coverage.test.d.ts +2 -0
- package/dist/project/detectors/coverage.test.d.ts.map +1 -0
- package/dist/project/detectors/coverage.test.js +78 -0
- package/dist/project/detectors/coverage.test.js.map +1 -0
- package/dist/project/detectors/data-science.d.ts +4 -0
- package/dist/project/detectors/data-science.d.ts.map +1 -0
- package/dist/project/detectors/data-science.js +32 -0
- package/dist/project/detectors/data-science.js.map +1 -0
- package/dist/project/detectors/data-science.test.d.ts +2 -0
- package/dist/project/detectors/data-science.test.d.ts.map +1 -0
- package/dist/project/detectors/data-science.test.js +62 -0
- package/dist/project/detectors/data-science.test.js.map +1 -0
- package/dist/project/detectors/disambiguate.d.ts +2 -0
- package/dist/project/detectors/disambiguate.d.ts.map +1 -1
- package/dist/project/detectors/disambiguate.js +3 -2
- package/dist/project/detectors/disambiguate.js.map +1 -1
- package/dist/project/detectors/disambiguate.test.js +10 -1
- package/dist/project/detectors/disambiguate.test.js.map +1 -1
- package/dist/project/detectors/index.d.ts.map +1 -1
- package/dist/project/detectors/index.js +2 -0
- package/dist/project/detectors/index.js.map +1 -1
- package/dist/project/detectors/library.d.ts.map +1 -1
- package/dist/project/detectors/library.js +1 -0
- package/dist/project/detectors/library.js.map +1 -1
- package/dist/project/detectors/resolve-detection.test.js +31 -0
- package/dist/project/detectors/resolve-detection.test.js.map +1 -1
- package/dist/project/detectors/types.d.ts +6 -2
- package/dist/project/detectors/types.d.ts.map +1 -1
- package/dist/project/detectors/types.js.map +1 -1
- package/dist/types/config.d.ts +8 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/copy/core.d.ts.map +1 -1
- package/dist/wizard/copy/core.js +4 -0
- package/dist/wizard/copy/core.js.map +1 -1
- package/dist/wizard/copy/data-science.d.ts +3 -0
- package/dist/wizard/copy/data-science.d.ts.map +1 -0
- package/dist/wizard/copy/data-science.js +15 -0
- package/dist/wizard/copy/data-science.js.map +1 -0
- package/dist/wizard/copy/index.d.ts.map +1 -1
- package/dist/wizard/copy/index.js +2 -0
- package/dist/wizard/copy/index.js.map +1 -1
- package/dist/wizard/copy/types.d.ts +5 -1
- package/dist/wizard/copy/types.d.ts.map +1 -1
- package/dist/wizard/copy/types.test-d.js +7 -0
- package/dist/wizard/copy/types.test-d.js.map +1 -1
- package/dist/wizard/questions.d.ts +2 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +9 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +14 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +1 -0
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
- package/skills/mmr/SKILL.md +72 -2
- package/skills/scaffold-runner/SKILL.md +65 -19
|
@@ -0,0 +1,161 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-observability
|
|
3
|
+
description: Monitoring deployed DS models and pipelines — prediction logging to Parquet, scheduled evaluation, basic drift detection, and Evidently for deeper analysis
|
|
4
|
+
topics: [data-science, observability, monitoring, drift, evidently]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Models don't fail loudly. A scoring job keeps running, rows keep landing in the output table, dashboards stay green — and quietly, the predictions get worse. The world drifts away from whatever snapshot you trained on, and nobody notices until a stakeholder says "these numbers look weird." Observability for a solo DS isn't a platform; it's a small set of habits that give you a chance to catch decay before someone else does.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
For a solo or small-team data scientist with something deployed (even just a weekly cron), observability boils down to four habits: log every prediction with its inputs to a dated Parquet file, re-run your evaluation script on a schedule and alert on metric drops, check a handful of key features for distributional drift, and reach for `Evidently` when you want a pre-built drift report instead of writing your own. The goal is a tripwire, not a dashboard — you want to get paged when something's wrong, not stare at graphs hoping to spot it.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Log predictions + inputs
|
|
16
|
+
|
|
17
|
+
Every time your model scores something, append a row to a dated Parquet log. This is the single most useful thing you can do for future-you — drift analysis, debugging, label backfill, and post-mortems all depend on having this log.
|
|
18
|
+
|
|
19
|
+
Layout:
|
|
20
|
+
|
|
21
|
+
```
|
|
22
|
+
data/processed/predictions/
|
|
23
|
+
2026-04-21/
|
|
24
|
+
run-20260421T0300-abc123.parquet
|
|
25
|
+
2026-04-22/
|
|
26
|
+
run-20260422T0300-def456.parquet
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
Schema (one row per prediction):
|
|
30
|
+
|
|
31
|
+
```python
|
|
32
|
+
# src/monitor/prediction_log.py
|
|
33
|
+
import uuid
|
|
34
|
+
from datetime import datetime, timezone
|
|
35
|
+
from pathlib import Path
|
|
36
|
+
import pandas as pd
|
|
37
|
+
|
|
38
|
+
def log_predictions(
|
|
39
|
+
features: pd.DataFrame,
|
|
40
|
+
predictions: pd.Series,
|
|
41
|
+
model_version: str,
|
|
42
|
+
log_root: Path = Path("data/processed/predictions"),
|
|
43
|
+
) -> Path:
|
|
44
|
+
today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
|
45
|
+
run_id = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M") + "-" + uuid.uuid4().hex[:6]
|
|
46
|
+
out_dir = log_root / today
|
|
47
|
+
out_dir.mkdir(parents=True, exist_ok=True)
|
|
48
|
+
|
|
49
|
+
df = features.copy()
|
|
50
|
+
df["prediction"] = predictions.values
|
|
51
|
+
df["model_version"] = model_version
|
|
52
|
+
df["logged_at"] = datetime.now(timezone.utc)
|
|
53
|
+
df["run_id"] = run_id
|
|
54
|
+
df["ground_truth"] = pd.NA # Backfilled later when labels arrive
|
|
55
|
+
|
|
56
|
+
out = out_dir / f"run-{run_id}.parquet"
|
|
57
|
+
df.to_parquet(out, index=False)
|
|
58
|
+
return out
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
Parquet is right for this: columnar, compressed, fast to scan across dates with `pd.read_parquet("data/processed/predictions/**/*.parquet")`. If inputs have PII, hash or drop those columns before logging — you rarely need raw identifiers to do drift or error analysis.
|
|
62
|
+
|
|
63
|
+
### Scheduled eval re-runs
|
|
64
|
+
|
|
65
|
+
Your training-time evaluation script is also your monitoring script. Run it weekly or monthly against recent predictions joined to whatever ground truth has arrived, and alert when the headline metric breaches a threshold.
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
# src/monitor/eval.py
|
|
69
|
+
import sys
|
|
70
|
+
import pandas as pd
|
|
71
|
+
from sklearn.metrics import roc_auc_score
|
|
72
|
+
|
|
73
|
+
THRESHOLD = 0.80 # Alert if AUC drops below this
|
|
74
|
+
|
|
75
|
+
def main() -> int:
|
|
76
|
+
preds = pd.read_parquet("data/processed/predictions/")
|
|
77
|
+
labels = pd.read_parquet("data/processed/labels/")
|
|
78
|
+
joined = preds.merge(labels, on="record_id", how="inner")
|
|
79
|
+
if len(joined) < 500:
|
|
80
|
+
print("Not enough labeled data yet; skipping.")
|
|
81
|
+
return 0
|
|
82
|
+
|
|
83
|
+
auc = roc_auc_score(joined["actual"], joined["prediction"])
|
|
84
|
+
print(f"AUC on {len(joined)} labeled rows: {auc:.3f}")
|
|
85
|
+
if auc < THRESHOLD:
|
|
86
|
+
# Send email / Slack webhook here
|
|
87
|
+
print(f"ALERT: AUC {auc:.3f} below threshold {THRESHOLD}", file=sys.stderr)
|
|
88
|
+
return 1
|
|
89
|
+
return 0
|
|
90
|
+
|
|
91
|
+
if __name__ == "__main__":
|
|
92
|
+
raise SystemExit(main())
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
Schedule it with whatever you already have — cron, a GitHub Actions `schedule:` workflow, Airflow if you run it, or your platform's scheduled job. Exit code 1 plus a Slack webhook is a perfectly good alerting system at this scale.
|
|
96
|
+
|
|
97
|
+
### Basic drift detection
|
|
98
|
+
|
|
99
|
+
Before reaching for a library, do the cheap thing: compare this period's feature distribution to your training distribution. Mean, std, a couple of quantiles, and a KS statistic cover most of what you need.
|
|
100
|
+
|
|
101
|
+
```python
|
|
102
|
+
# src/monitor/drift.py
|
|
103
|
+
from scipy.stats import ks_2samp
|
|
104
|
+
import pandas as pd
|
|
105
|
+
|
|
106
|
+
def feature_drift(reference: pd.Series, current: pd.Series) -> dict:
|
|
107
|
+
stat, p = ks_2samp(reference.dropna(), current.dropna())
|
|
108
|
+
return {
|
|
109
|
+
"ref_mean": reference.mean(), "cur_mean": current.mean(),
|
|
110
|
+
"ref_std": reference.std(), "cur_std": current.std(),
|
|
111
|
+
"ks_stat": stat, "ks_p": p,
|
|
112
|
+
"drifted": p < 0.01,
|
|
113
|
+
}
|
|
114
|
+
|
|
115
|
+
train = pd.read_parquet("data/processed/train.parquet")
|
|
116
|
+
recent = pd.read_parquet("data/processed/predictions/2026-04-21/")
|
|
117
|
+
for col in ["amount", "user_tenure_days", "n_items"]:
|
|
118
|
+
print(col, feature_drift(train[col], recent[col]))
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
Run this alongside your scheduled eval. You don't need a dashboard — printing to the job log and alerting on `drifted=True` on any monitored feature is enough.
|
|
122
|
+
|
|
123
|
+
### Evidently for more
|
|
124
|
+
|
|
125
|
+
When you outgrow ad-hoc KS tests, `Evidently` gives you a pre-built drift report across all features, plus data quality checks and target drift, as an HTML page you can open or ship to S3.
|
|
126
|
+
|
|
127
|
+
```python
|
|
128
|
+
# src/monitor/evidently_report.py
|
|
129
|
+
import pandas as pd
|
|
130
|
+
from evidently import Report
|
|
131
|
+
from evidently.presets import DataDriftPreset
|
|
132
|
+
|
|
133
|
+
reference = pd.read_parquet("data/processed/train.parquet")
|
|
134
|
+
current = pd.read_parquet("data/processed/predictions/2026-04-21/")
|
|
135
|
+
|
|
136
|
+
report = Report([DataDriftPreset()])
|
|
137
|
+
snapshot = report.run(reference_data=reference, current_data=current)
|
|
138
|
+
snapshot.save_html("reports/drift-2026-04-21.html")
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
This is opt-in. If plain pandas + SciPy is telling you what you need to know, don't add a dependency. Reach for Evidently when you have enough features that per-column code is tedious, or when you want a shareable artifact for a stakeholder.
|
|
142
|
+
|
|
143
|
+
### The prediction / feedback loop
|
|
144
|
+
|
|
145
|
+
Ground truth almost never arrives at prediction time. A churn model predicts today who'll cancel next month; a fraud model predicts now whether a transaction is bad, confirmed days later. That delay is why the Parquet log exists — you keep predictions around until labels catch up, then join.
|
|
146
|
+
|
|
147
|
+
```python
|
|
148
|
+
# src/monitor/backfill_labels.py
|
|
149
|
+
import pandas as pd
|
|
150
|
+
|
|
151
|
+
preds = pd.read_parquet("data/processed/predictions/")
|
|
152
|
+
labels = pd.read_parquet("data/processed/labels/") # record_id, actual, label_time
|
|
153
|
+
merged = preds.merge(labels, on="record_id", how="left")
|
|
154
|
+
merged.to_parquet("data/processed/predictions_labeled.parquet", index=False)
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
Keep at least one full feedback cycle of prediction logs (if labels arrive after 30 days, keep 60-90 days). This join is how you get a real accuracy number on production traffic, not just your static test-set number from training day.
|
|
158
|
+
|
|
159
|
+
### What NOT to build
|
|
160
|
+
|
|
161
|
+
Resist the urge to over-engineer. At solo scale you do not need streaming drift detection, a Prometheus/Grafana stack, a model registry with canary deploys, or a dedicated monitoring dashboard. Those are ML-platform-team concerns — build them when there's a team to own them. A dated Parquet log, a scheduled eval script, a handful of drift checks, and an alert that emails you is more than enough to catch the failures that actually happen.
|
|
@@ -0,0 +1,178 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-project-structure
|
|
3
|
+
description: Opinionated directory layout for solo and small-team data-science projects — notebooks, src, data, models, reports, tests, configs — with a promotion path from exploration to tested modules
|
|
4
|
+
topics: [data-science, project-structure, layout]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
A solo data-science project accumulates artifacts faster than most software: half-finished notebooks, CSV dumps, parquet caches, serialized models, PNG charts, and the occasional markdown write-up. Without a deliberate directory structure, the project turns into a folder of 40 loose files within a month and a new contributor — including future-you — cannot tell what is canonical, what is scratch, and what is safe to delete. A clear layout fixes three problems at once: discoverability (where does X live?), git hygiene (what is tracked vs generated?), and the promotion path (how does throwaway notebook code become tested library code?).
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
A solo DS project has six top-level directories that each answer one question: `notebooks/` (exploration), `src/` (importable Python modules), `data/` (split into raw/interim/processed — `data/raw/` is always gitignored; small processed artifacts may be committed or DVC-tracked), `models/` (serialized artifacts, tracked via DVC or git-lfs), `reports/` (rendered outputs — figures, HTML, markdown), and `tests/` (pytest suite mirroring `src/`). `configs/` holds YAML run parameters, and `pyproject.toml` at the root defines the package. The `.gitignore` excludes raw data, most of `models/`, and common binary formats that were not deliberately promoted. Reusable logic follows a strict promotion path: explored in a notebook, extracted into `src/`, unit-tested in `tests/`, then re-imported by notebooks or pipeline scripts.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Top-level layout
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
project-root/
|
|
19
|
+
├── notebooks/ # Exploratory notebooks (Marimo preferred; numbered chronologically)
|
|
20
|
+
├── src/ # Importable Python modules — the library
|
|
21
|
+
│ └── <project>/
|
|
22
|
+
│ ├── __init__.py
|
|
23
|
+
│ ├── ingestion.py # Load raw data from source (CSV, DB, API)
|
|
24
|
+
│ ├── features.py # Feature engineering / transforms
|
|
25
|
+
│ ├── training.py # Model fitting routines
|
|
26
|
+
│ ├── evaluation.py # Metrics, CV loops, slice analysis
|
|
27
|
+
│ └── serving.py # Inference helpers (load artifact, predict)
|
|
28
|
+
├── data/ # Datasets at every pipeline stage
|
|
29
|
+
│ ├── raw/ # Immutable inputs — GITIGNORED (always)
|
|
30
|
+
│ ├── interim/ # Cached intermediates — small Parquet may be committed
|
|
31
|
+
│ └── processed/ # Analysis-ready — usually DVC-tracked; small files may be committed
|
|
32
|
+
├── models/ # Serialized model artifacts (DVC / git-lfs tracked)
|
|
33
|
+
├── reports/ # Rendered output: figures/, HTML reports, markdown summaries
|
|
34
|
+
│ └── figures/
|
|
35
|
+
├── tests/ # pytest suite — mirrors src/ structure
|
|
36
|
+
├── configs/ # YAML run configs (Hydra-style or plain)
|
|
37
|
+
├── pyproject.toml # Package metadata, dependencies, tool config
|
|
38
|
+
├── .gitignore
|
|
39
|
+
└── README.md
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
One-liners per dir:
|
|
43
|
+
- `notebooks/` — exploration, EDA, prototyping; numbered `01-…`, `02-…` so ordering is obvious
|
|
44
|
+
- `src/` — every reusable function that a second notebook or a pipeline script will call
|
|
45
|
+
- `data/` — all datasets at every stage; raw is always gitignored, selected processed artifacts (small Parquet in `data/interim/` or `data/processed/`) may be committed directly or tracked via DVC — see `data-science-data-versioning`
|
|
46
|
+
- `models/` — trained model artifacts; tracked through DVC or git-lfs pointers, never raw binaries
|
|
47
|
+
- `reports/` — things a human reads: charts, HTML reports, markdown summaries
|
|
48
|
+
- `tests/` — pytest tests for code in `src/`
|
|
49
|
+
- `configs/` — experiment parameters (paths, seeds, hyperparams) separate from code
|
|
50
|
+
|
|
51
|
+
### Data: gitignore raw, deliberately admit small processed artifacts
|
|
52
|
+
|
|
53
|
+
The single hardest rule in DS project hygiene: **never commit raw datasets under `data/raw/` or raw model binaries under `models/` to git**. A 200 MB parquet file committed to history is permanent — `git filter-repo` is the only cure and it rewrites every commit. Prevent the problem at the `.gitignore` layer before it happens.
|
|
54
|
+
|
|
55
|
+
Gitignoring the entire `data/` tree is the safest default, but it under-serves a common small-team workflow: a cleaned, analysis-ready Parquet in `data/interim/` that's <10 MB, changes rarely, and is useful to have alongside the code. See `data-science-data-versioning` for the full size-based decision rule. The pattern below gitignores raw data and external copies wholesale, and allows opt-in commits of small processed Parquet through a deliberate un-ignore rule. Anything larger (>50 MB, frequent churn, binary artifacts) goes through DVC or git-lfs instead — never direct git commits.
|
|
56
|
+
|
|
57
|
+
```gitignore
|
|
58
|
+
# Raw / external data — never committed (bulky, usually not redistributable)
|
|
59
|
+
data/raw/
|
|
60
|
+
data/external/
|
|
61
|
+
|
|
62
|
+
# Processed / interim data — default: ignore; opt in to specific small artifacts below
|
|
63
|
+
data/interim/*
|
|
64
|
+
data/processed/*
|
|
65
|
+
!data/.gitkeep
|
|
66
|
+
!data/interim/.gitkeep
|
|
67
|
+
!data/processed/.gitkeep
|
|
68
|
+
# Allow small cleaned Parquet to be committed (see data-science-data-versioning
|
|
69
|
+
# for size guidance — under ~10 MB, rare changes). Larger artifacts belong in
|
|
70
|
+
# DVC or git-lfs.
|
|
71
|
+
!data/interim/*.parquet
|
|
72
|
+
!data/processed/*.parquet
|
|
73
|
+
|
|
74
|
+
# Model artifacts — tracked via DVC or git-lfs, not raw binaries
|
|
75
|
+
models/
|
|
76
|
+
!models/.gitkeep
|
|
77
|
+
!models/**/*.dvc
|
|
78
|
+
|
|
79
|
+
# Common large binary formats (defense in depth — catch anything dropped elsewhere)
|
|
80
|
+
*.feather
|
|
81
|
+
*.joblib
|
|
82
|
+
*.pt
|
|
83
|
+
*.pth
|
|
84
|
+
*.onnx
|
|
85
|
+
*.h5
|
|
86
|
+
*.hdf5
|
|
87
|
+
*.npy
|
|
88
|
+
*.npz
|
|
89
|
+
|
|
90
|
+
# Python
|
|
91
|
+
__pycache__/
|
|
92
|
+
*.pyc
|
|
93
|
+
.venv/
|
|
94
|
+
.ruff_cache/
|
|
95
|
+
.pytest_cache/
|
|
96
|
+
*.egg-info/
|
|
97
|
+
|
|
98
|
+
# Notebook outputs (if not using a tool that strips them)
|
|
99
|
+
.ipynb_checkpoints/
|
|
100
|
+
|
|
101
|
+
# Environment / secrets
|
|
102
|
+
.env
|
|
103
|
+
.env.*
|
|
104
|
+
!.env.example
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Two things are load-bearing in this snippet. First, `*.parquet` is **not** in the blanket block-list — we want `data/interim/*.parquet` to match as "allowed" once the un-ignore rules kick in. Second, the `!data/interim/*.parquet` and `!data/processed/*.parquet` patterns mean processed Parquet is committable **by default** at this layer; the policy choice of whether to actually commit a given file is made at `git add` time, not in `.gitignore`. If your team's policy is DVC-first for every dataset, drop those `!…*.parquet` lines. The `!data/.gitkeep` family keeps the directories present in fresh clones.
|
|
108
|
+
|
|
109
|
+
For versioned datasets and models, see `data-science-data-versioning` — DVC or git-lfs pointers are committed, the binaries themselves live in remote storage. Prefer `joblib` or framework-native formats (`.pt`, `.onnx`) over stdlib pickle for model artifacts — pickle loads execute arbitrary code, so a model file from an untrusted source becomes an RCE vector.
|
|
110
|
+
|
|
111
|
+
### Notebooks → src/ promotion
|
|
112
|
+
|
|
113
|
+
Notebooks are for exploration, not production. The moment a function in a notebook becomes useful to a second notebook — or looks like it will survive longer than the current sitting — it gets promoted:
|
|
114
|
+
|
|
115
|
+
1. **Identify**: a cell (or few cells) encapsulating reusable logic — a loader, a transform, a metric computation
|
|
116
|
+
2. **Extract**: move the function into the appropriate `src/<project>/` module (`ingestion.py`, `features.py`, etc.) with type hints and a docstring
|
|
117
|
+
3. **Test**: add a pytest case in `tests/` that exercises a representative input → output case
|
|
118
|
+
4. **Re-import**: the notebook now does `from <project>.features import clean_customer_ids` instead of defining the function inline
|
|
119
|
+
|
|
120
|
+
This discipline keeps notebooks short (exploration, narrative, charts) and concentrates correctness-critical code where it can be reviewed, tested, and reused. See `notebook-discipline` for the mechanics of cell size, output clearing, and `%autoreload` so edits in `src/` are picked up in the notebook without a kernel restart.
|
|
121
|
+
|
|
122
|
+
### Configs and reproducibility
|
|
123
|
+
|
|
124
|
+
Hard-coded paths and hyperparameters inside notebook cells are the single biggest reproducibility killer in a DS project. Push them into `configs/` so a run is defined by a config file + a git SHA.
|
|
125
|
+
|
|
126
|
+
```yaml
|
|
127
|
+
# configs/train_baseline.yaml
|
|
128
|
+
run_name: baseline_v1
|
|
129
|
+
seed: 42
|
|
130
|
+
|
|
131
|
+
data:
|
|
132
|
+
raw_path: data/raw/transactions_2024.csv
|
|
133
|
+
processed_path: data/processed/transactions_clean.parquet
|
|
134
|
+
target: churned_30d
|
|
135
|
+
test_size: 0.2
|
|
136
|
+
split_seed: 42
|
|
137
|
+
|
|
138
|
+
features:
|
|
139
|
+
include:
|
|
140
|
+
- tenure_days
|
|
141
|
+
- monthly_spend
|
|
142
|
+
- support_tickets_30d
|
|
143
|
+
log_transform:
|
|
144
|
+
- monthly_spend
|
|
145
|
+
|
|
146
|
+
model:
|
|
147
|
+
type: gradient_boosting
|
|
148
|
+
params:
|
|
149
|
+
n_estimators: 200
|
|
150
|
+
max_depth: 5
|
|
151
|
+
learning_rate: 0.05
|
|
152
|
+
|
|
153
|
+
output:
|
|
154
|
+
model_path: models/baseline_v1.joblib
|
|
155
|
+
report_path: reports/baseline_v1.html
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
Training code reads the config with `yaml.safe_load` (or Hydra / pydantic-settings for richer projects) and a teammate can reproduce the run with `python -m <project>.training --config configs/train_baseline.yaml`. For Hydra specifically, configs split into `configs/data/`, `configs/model/`, `configs/training/` and compose at the command line.
|
|
159
|
+
|
|
160
|
+
### Tests layout
|
|
161
|
+
|
|
162
|
+
`tests/` mirrors `src/` one-to-one. If `src/<project>/features.py` defines `clean_customer_ids`, then `tests/test_features.py` contains `test_clean_customer_ids_strips_whitespace` and friends.
|
|
163
|
+
|
|
164
|
+
```
|
|
165
|
+
tests/
|
|
166
|
+
├── conftest.py # Shared fixtures (tiny sample dataframes, tmp_path helpers)
|
|
167
|
+
├── test_ingestion.py # Tests for src/<project>/ingestion.py
|
|
168
|
+
├── test_features.py # Tests for src/<project>/features.py
|
|
169
|
+
├── test_training.py # Tests for src/<project>/training.py — usually smoke tests
|
|
170
|
+
└── test_evaluation.py # Tests for src/<project>/evaluation.py
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
Naming rules:
|
|
174
|
+
- Test files: `test_<module>.py` — pytest discovers these by default
|
|
175
|
+
- Test functions: `test_<unit>_<behavior>` — e.g. `test_clean_customer_ids_strips_whitespace`, `test_load_transactions_raises_on_missing_file`
|
|
176
|
+
- Fixtures live in `conftest.py` at the `tests/` root when shared across files; local fixtures stay in the file that uses them
|
|
177
|
+
|
|
178
|
+
Training and evaluation tests are typically **smoke tests** over a 10-row fixture dataframe, not full-dataset runs — the goal is catching shape/dtype/column regressions, not validating model quality (model quality belongs in the evaluation report, not the unit test suite).
|
|
@@ -0,0 +1,164 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-reproducibility
|
|
3
|
+
description: Reproducibility for solo/small-team DS — pin deps with uv lock, seed everything, set PYTHONHASHSEED, and reach for Docker only at OS boundaries
|
|
4
|
+
topics: [data-science, reproducibility, determinism, uv, docker]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
You show a result in Monday's meeting. Six months later, on a new laptop, you can't reproduce it. Three things usually cause this: dependencies drifted (a minor NumPy release changed a default), randomness wasn't pinned (a shuffle or init picked a different seed), or the data changed underneath you. Reproducibility is the discipline of eliminating all three so the same inputs always produce the same numbers.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Pin dependencies with `uv lock` and commit `uv.lock` — `uv sync --frozen` rebuilds the exact environment anywhere. Control randomness with a single `set_seed(seed)` helper that seeds Python `random`, NumPy, PyTorch, and TensorFlow at the top of every script. Export `PYTHONHASHSEED=0` via `.envrc` so hash-order is deterministic across interpreter runs. Log the git SHA and data hash with every run so you can walk back to the exact code + data that produced any number. Reach for Docker only when you're crossing an OS or CUDA boundary — for greenfield solo work, `uv sync` is enough.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Pinning dependencies with uv
|
|
16
|
+
|
|
17
|
+
`uv` resolves the full transitive dependency graph into `uv.lock`, which records the exact version and content hash of every package, including transitive deps you never directly imported. Commit it. On a new machine, `uv sync --frozen` reproduces the environment byte-for-byte without re-resolving anything.
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
# First time: declare top-level deps in pyproject.toml, then lock
|
|
21
|
+
uv lock
|
|
22
|
+
|
|
23
|
+
# On any machine (CI, teammate's laptop, 6 months later):
|
|
24
|
+
uv sync --frozen # install exactly what's in uv.lock, never re-resolve
|
|
25
|
+
|
|
26
|
+
# Upgrade a single package intentionally:
|
|
27
|
+
uv lock --upgrade-package numpy
|
|
28
|
+
# Review the lock diff in PR. Re-run your eval suite before merging.
|
|
29
|
+
|
|
30
|
+
# Add a new dependency:
|
|
31
|
+
uv add pandas # updates pyproject.toml AND uv.lock atomically
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
Rules:
|
|
35
|
+
- Commit `uv.lock`. It is not a build artifact; it is a reproducibility contract.
|
|
36
|
+
- Use `--frozen` in CI and release scripts. A silent re-resolve on deploy is the bug you're trying to prevent.
|
|
37
|
+
- Upgrade packages one at a time, with a PR and an eval run. Bulk upgrades hide which bump broke your metrics.
|
|
38
|
+
- Pin the Python version too: add `requires-python = "==3.12.*"` in `pyproject.toml` and let uv install and manage the interpreter. Minor Python versions change float formatting, dict ordering guarantees, and stdlib behavior in ways that can move your numbers.
|
|
39
|
+
|
|
40
|
+
### Seed management
|
|
41
|
+
|
|
42
|
+
Every source of randomness in your stack has its own PRNG. Seed all of them from a single call, at the top of every train/eval/predict entry point.
|
|
43
|
+
|
|
44
|
+
```python
|
|
45
|
+
# src/utils/seed.py
|
|
46
|
+
import os
|
|
47
|
+
import random
|
|
48
|
+
import numpy as np
|
|
49
|
+
|
|
50
|
+
def set_seed(seed: int = 42) -> None:
|
|
51
|
+
"""Seed every PRNG we might touch. Call at the top of every script."""
|
|
52
|
+
os.environ["PYTHONHASHSEED"] = str(seed)
|
|
53
|
+
random.seed(seed)
|
|
54
|
+
np.random.seed(seed)
|
|
55
|
+
|
|
56
|
+
try:
|
|
57
|
+
import torch
|
|
58
|
+
torch.manual_seed(seed)
|
|
59
|
+
if torch.cuda.is_available():
|
|
60
|
+
torch.cuda.manual_seed_all(seed)
|
|
61
|
+
except ImportError:
|
|
62
|
+
pass
|
|
63
|
+
|
|
64
|
+
try:
|
|
65
|
+
import tensorflow as tf
|
|
66
|
+
tf.random.set_seed(seed)
|
|
67
|
+
except ImportError:
|
|
68
|
+
pass
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
Call `set_seed(42)` before any data split, model init, or sampling. If a library accepts a `random_state` argument (scikit-learn does almost everywhere), pass the seed explicitly — global seeding is a safety net, not a substitute.
|
|
72
|
+
|
|
73
|
+
```python
|
|
74
|
+
# Explicit is better than implicit:
|
|
75
|
+
from sklearn.model_selection import train_test_split
|
|
76
|
+
from sklearn.ensemble import RandomForestClassifier
|
|
77
|
+
|
|
78
|
+
set_seed(42) # global safety net
|
|
79
|
+
|
|
80
|
+
X_train, X_test, y_train, y_test = train_test_split(
|
|
81
|
+
X, y, test_size=0.2, random_state=42 # explicit
|
|
82
|
+
)
|
|
83
|
+
model = RandomForestClassifier(random_state=42) # explicit
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
The one gotcha: multi-worker DataLoaders in PyTorch spawn subprocesses that need their own seeding. Pass `worker_init_fn` to seed each worker, or you'll get different augmentation sequences across runs even with `set_seed` called in the main process.
|
|
87
|
+
|
|
88
|
+
### Hash determinism
|
|
89
|
+
|
|
90
|
+
Python randomizes the hash seed per interpreter run by default. That means dict iteration order, set iteration order, and anything that depends on `hash()` varies between runs — a subtle reproducibility leak that only shows up when you try to diff two training runs.
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
# .envrc (direnv)
|
|
94
|
+
export PYTHONHASHSEED=0
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
`set_seed()` sets this too, but exporting it in `.envrc` covers everything in the shell session — notebooks, ad-hoc scripts, the test runner — before any Python code runs.
|
|
98
|
+
|
|
99
|
+
### GPU determinism (brief)
|
|
100
|
+
|
|
101
|
+
Full GPU determinism requires cuDNN-level flags and disabling non-deterministic kernels:
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
# Only if you actually need this:
|
|
105
|
+
torch.backends.cudnn.deterministic = True
|
|
106
|
+
torch.backends.cudnn.benchmark = False
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
This has a real performance cost (often 10-30% slower training) and doesn't cover every op. For DS-1, don't chase it. CPU-level determinism from `set_seed()` + pinned deps is enough for 95% of analyses. Reach for GPU determinism only under regulatory requirement, scientific publication, or when debugging a numerics bug that you can't otherwise isolate.
|
|
110
|
+
|
|
111
|
+
### Git SHA and data versioning
|
|
112
|
+
|
|
113
|
+
A reproducible run needs four things pinned: code, dependencies, randomness, and data. We've covered three. For code, log the git SHA with every experiment (see `data-science-experiment-tracking.md` for the logging pattern — don't duplicate the plumbing here). For data, hash the input dataset or pin a DVC / lakeFS / Git-LFS reference (see `data-science-data-versioning.md`).
|
|
114
|
+
|
|
115
|
+
The minimum metadata for any reported result:
|
|
116
|
+
|
|
117
|
+
```text
|
|
118
|
+
git_sha: a1b2c3d4
|
|
119
|
+
uv_lock: sha256:... # hash of uv.lock
|
|
120
|
+
seed: 42
|
|
121
|
+
data_hash: sha256:... # hash of the input dataset(s)
|
|
122
|
+
python: 3.12.1
|
|
123
|
+
platform: darwin-arm64
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
If all five match, the numbers should match. If any differ, you know exactly which knob moved.
|
|
127
|
+
|
|
128
|
+
A working pattern: log these fields into your experiment tracker alongside metrics, and include them in any reported result (paper, slide, dashboard tile). The friction cost is near zero once automated; the debugging cost of a result you can't trace back to its exact code + data is enormous.
|
|
129
|
+
|
|
130
|
+
### Docker: only at OS boundaries
|
|
131
|
+
|
|
132
|
+
Docker solves a real problem: "it works on my Mac but not on the Linux GPU box." It does not solve "I forgot to commit `uv.lock`." Reach for containers when you're genuinely crossing a boundary:
|
|
133
|
+
|
|
134
|
+
- Developing on macOS, deploying on Linux — native wheels differ, BLAS differs, occasionally results differ.
|
|
135
|
+
- CUDA version mismatch between dev and prod GPUs.
|
|
136
|
+
- A team standardizing a shared prod environment where `uv sync` isn't enough because the base OS libs drift.
|
|
137
|
+
|
|
138
|
+
For a solo greenfield project on one laptop, a Dockerfile is pure overhead. Start with `uv sync --frozen` and add Docker the first time you actually hit a cross-OS reproducibility failure — not before.
|
|
139
|
+
|
|
140
|
+
When you do reach for it, keep the image minimal and derived from your lockfile:
|
|
141
|
+
|
|
142
|
+
```dockerfile
|
|
143
|
+
FROM python:3.12-slim
|
|
144
|
+
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
|
|
145
|
+
WORKDIR /app
|
|
146
|
+
COPY pyproject.toml uv.lock ./
|
|
147
|
+
RUN uv sync --frozen --no-dev
|
|
148
|
+
COPY src/ ./src/
|
|
149
|
+
ENV PYTHONHASHSEED=0
|
|
150
|
+
CMD ["uv", "run", "python", "-m", "src.train"]
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
Pin the base image by digest (`python:3.12-slim@sha256:...`) once the project is in prod — floating tags drift and will silently give you a different glibc next month.
|
|
154
|
+
|
|
155
|
+
### Reproducibility checklist
|
|
156
|
+
|
|
157
|
+
Before calling any analysis "done":
|
|
158
|
+
|
|
159
|
+
- `uv.lock` committed and current (`uv sync --frozen` in CI succeeds)
|
|
160
|
+
- `set_seed()` called at the top of every entry point
|
|
161
|
+
- `PYTHONHASHSEED=0` in `.envrc` (and `.envrc` committed, `.env` gitignored)
|
|
162
|
+
- Git SHA + data hash logged with every experiment run
|
|
163
|
+
- Eval suite passes on a clean clone in CI — the real test of reproducibility is a fresh machine, not your own
|
|
164
|
+
|
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-requirements
|
|
3
|
+
description: Problem framing, success metrics, evaluation-test design, stakeholder contracts, and nonfunctional requirements for solo/small-team data science projects
|
|
4
|
+
topics: [data-science, requirements, evaluation, success-metrics, reproducibility]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
As a solo or small-team data scientist without an existing data platform, the single biggest risk to your project is not a bad model — it is ambiguous requirements. Without a tight written spec, a DS project sprawls: the question drifts week to week, the notebook becomes unreproducible, and the stakeholder quietly reinterprets the output. This document defines what "done" looks like for an analytical pipeline, model, or report built from scratch — so you can stop work on time and defend the result.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
A data-science requirements doc states a single well-framed question, one primary success metric with a numeric acceptance threshold declared before any modeling, an evaluation design using held-out data, a stakeholder contract (who consumes the output, in what format, on what cadence), and a nonfunctional budget (reproducibility, runtime, storage). Write the target threshold into a test before you touch training data. If you cannot name the metric and the number, you are not ready to start.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Problem framing
|
|
16
|
+
|
|
17
|
+
Most DS projects fail at step one: the question is fuzzy ("understand churn") rather than decidable ("predict 30-day churn for active paying users, with recall >= 0.6 at precision >= 0.3"). The discipline is to force yourself, in writing, to name the decision the output will drive. If you cannot name that decision, stop and interview the stakeholder until you can.
|
|
18
|
+
|
|
19
|
+
Use a short, copyable problem-statement block at the top of your project README or PRD. The one below is opinionated — it forces every ambiguous field to get filled in before modeling starts. The tradeoff: for pure exploratory work (e.g. a one-off investigation) this is overkill; a 3-line hypothesis is enough.
|
|
20
|
+
|
|
21
|
+
```yaml
|
|
22
|
+
# docs/problem-statement.yaml
|
|
23
|
+
question: >
|
|
24
|
+
For monthly paying users active in the last 30 days, predict whether they
|
|
25
|
+
will cancel their subscription within the next 30 days.
|
|
26
|
+
decision_driven:
|
|
27
|
+
who: Growth team
|
|
28
|
+
action: Enroll top-decile predicted churners in a retention email campaign
|
|
29
|
+
cadence: Weekly scoring
|
|
30
|
+
unit_of_analysis: user_id x scoring_date
|
|
31
|
+
prediction_target: churn_within_30d (bool)
|
|
32
|
+
out_of_scope:
|
|
33
|
+
- free-tier users
|
|
34
|
+
- annual subscribers
|
|
35
|
+
- users less than 14 days old at scoring time
|
|
36
|
+
known_confounders:
|
|
37
|
+
- planned price change on 2026-05-01
|
|
38
|
+
- seasonality around end-of-year
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
### Success metrics
|
|
42
|
+
|
|
43
|
+
State the primary success metric and its acceptance threshold in writing before you train anything. The number comes from the stakeholder contract, not from what the model can achieve — otherwise you are reverse-engineering the bar to whatever you got. Pick one primary metric; secondary metrics are tie-breakers, not co-equals.
|
|
44
|
+
|
|
45
|
+
Typical patterns:
|
|
46
|
+
|
|
47
|
+
- **Predictive model**: one primary metric tied to the downstream decision. For a ranked retention campaign, `recall@top-10%` or `precision@k` beats accuracy or raw AUC, because the campaign can only email the top decile.
|
|
48
|
+
- **Regression / forecast**: `RMSE` in the target's natural unit, plus a naive baseline (last-value, rolling-mean). Beating the baseline is mandatory; if you cannot, the project is not viable.
|
|
49
|
+
- **Analytical pipeline / ETL**: functional correctness plus a p95 runtime budget (e.g. "daily job must finish in < 20 min on the scheduled box").
|
|
50
|
+
- **Report / dashboard**: domain acceptance threshold — the numbers in the report must match an independently computed source-of-truth query within a stated tolerance (e.g. "<= 0.1% deviation from the finance ledger").
|
|
51
|
+
|
|
52
|
+
Encode the success metric as a function so it is unambiguous and testable. The expression below is the whole contract — write it the day you start.
|
|
53
|
+
|
|
54
|
+
```python
|
|
55
|
+
# src/metrics.py
|
|
56
|
+
from sklearn.metrics import precision_recall_curve
|
|
57
|
+
import numpy as np
|
|
58
|
+
|
|
59
|
+
TARGET_RECALL = 0.60
|
|
60
|
+
MIN_PRECISION = 0.30 # at the threshold that achieves TARGET_RECALL
|
|
61
|
+
|
|
62
|
+
def primary_metric(y_true: np.ndarray, y_score: np.ndarray) -> dict:
|
|
63
|
+
"""Primary success metric: precision at the threshold that hits target recall."""
|
|
64
|
+
precision, recall, thresholds = precision_recall_curve(y_true, y_score)
|
|
65
|
+
# Walk from highest threshold down; stop when recall crosses target.
|
|
66
|
+
idx = np.searchsorted(recall[::-1], TARGET_RECALL)
|
|
67
|
+
idx = len(recall) - 1 - idx
|
|
68
|
+
return {
|
|
69
|
+
"recall": float(recall[idx]),
|
|
70
|
+
"precision": float(precision[idx]),
|
|
71
|
+
"threshold": float(thresholds[min(idx, len(thresholds) - 1)]),
|
|
72
|
+
"passes": bool(recall[idx] >= TARGET_RECALL and precision[idx] >= MIN_PRECISION),
|
|
73
|
+
}
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
### Evaluation-test design
|
|
77
|
+
|
|
78
|
+
The evaluation test is the single gate between "training run" and "ship it." Its job is to answer one question: does the model hit the stated metric on data it has not seen? Get this wrong — leak the future into the past, evaluate on training rows — and every downstream decision is poisoned.
|
|
79
|
+
|
|
80
|
+
Opinionated defaults:
|
|
81
|
+
|
|
82
|
+
- **Temporal target**: split by time, not randomly. Train on `[t0, t1)`, hold out `[t1, t2)`. Random splits with temporal data leak future information and will silently inflate metrics.
|
|
83
|
+
- **Non-temporal target**: stratified split by the label, fixed `random_state`, held-out fraction 15-20%.
|
|
84
|
+
- **Small data (< 10k rows)**: 5-fold cross-validation with the same fold seed every run; report mean plus std of the primary metric.
|
|
85
|
+
- **Never** tune hyperparameters on the holdout. Use a third validation split or inner CV. Tradeoff: if your dataset is tiny you may have to pool — document the risk explicitly.
|
|
86
|
+
|
|
87
|
+
The evaluation belongs in the test suite, not a notebook. The stakeholder should be able to run `pytest tests/test_model_evaluation.py` and see green before accepting the deliverable.
|
|
88
|
+
|
|
89
|
+
```python
|
|
90
|
+
# tests/test_model_evaluation.py
|
|
91
|
+
import joblib
|
|
92
|
+
import pandas as pd
|
|
93
|
+
import pytest
|
|
94
|
+
from src.metrics import primary_metric, TARGET_RECALL, MIN_PRECISION
|
|
95
|
+
|
|
96
|
+
HOLDOUT_PATH = "data/holdout_2026_q1.parquet"
|
|
97
|
+
MODEL_PATH = "artifacts/churn_model.pkl"
|
|
98
|
+
|
|
99
|
+
@pytest.fixture(scope="module")
|
|
100
|
+
def scored_holdout():
|
|
101
|
+
df = pd.read_parquet(HOLDOUT_PATH)
|
|
102
|
+
model = joblib.load(MODEL_PATH)
|
|
103
|
+
X = df.drop(columns=["churn_within_30d"])
|
|
104
|
+
y_true = df["churn_within_30d"].to_numpy()
|
|
105
|
+
y_score = model.predict_proba(X)[:, 1]
|
|
106
|
+
return y_true, y_score
|
|
107
|
+
|
|
108
|
+
def test_model_beats_acceptance_threshold(scored_holdout):
|
|
109
|
+
y_true, y_score = scored_holdout
|
|
110
|
+
result = primary_metric(y_true, y_score)
|
|
111
|
+
assert result["passes"], (
|
|
112
|
+
f"Model failed acceptance: recall={result['recall']:.3f} "
|
|
113
|
+
f"(target {TARGET_RECALL}), precision={result['precision']:.3f} "
|
|
114
|
+
f"(min {MIN_PRECISION})"
|
|
115
|
+
)
|
|
116
|
+
|
|
117
|
+
def test_model_beats_naive_baseline(scored_holdout):
|
|
118
|
+
# Baseline: predict global churn rate for everyone. Any real model must beat it.
|
|
119
|
+
y_true, y_score = scored_holdout
|
|
120
|
+
baseline_score = pd.Series([y_true.mean()] * len(y_true)).to_numpy()
|
|
121
|
+
assert primary_metric(y_true, y_score)["precision"] > \
|
|
122
|
+
primary_metric(y_true, baseline_score)["precision"]
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
### Stakeholder contract
|
|
126
|
+
|
|
127
|
+
A stakeholder contract makes the hand-off concrete. Without it, you deliver a notebook and the recipient quietly asks for a PDF, a Slack message, a dashboard, or a CSV — all different artifacts. Write this down the same week you write the problem statement.
|
|
128
|
+
|
|
129
|
+
Minimum fields, in order of how often they get skipped:
|
|
130
|
+
|
|
131
|
+
- **Consumer**: named human or team, not "the business."
|
|
132
|
+
- **Artifact format**: one of `csv`, `parquet`, `dashboard (URL)`, `API endpoint`, `PDF report`, `Slack summary`. Pick exactly one primary.
|
|
133
|
+
- **Schema**: column names, types, units, PII flags. Include an example row.
|
|
134
|
+
- **Cadence**: one-shot, daily, weekly, on-demand. If recurring, name the day-of-week and time-of-day.
|
|
135
|
+
- **Freshness SLA**: how stale is the underlying data allowed to be at delivery time.
|
|
136
|
+
- **Failure behavior**: what happens if the pipeline fails — silent retry, page the owner, stale-serve, fail loud.
|
|
137
|
+
- **Sunset criteria**: when does this deliverable stop being needed. If you cannot answer, the project has no natural end.
|
|
138
|
+
|
|
139
|
+
A one-off analysis can collapse this into a single paragraph; a recurring pipeline needs all seven fields in a short `CONTRACT.md` alongside the code.
|
|
140
|
+
|
|
141
|
+
### Nonfunctional requirements
|
|
142
|
+
|
|
143
|
+
Nonfunctional requirements are what separates a notebook from a deliverable. Three to name explicitly:
|
|
144
|
+
|
|
145
|
+
- **Reproducibility**: the pipeline must produce byte-identical outputs given identical inputs. That means a pinned `requirements.txt` (or `pyproject.toml` + lockfile), explicit `random_state` on every stochastic step (train/test split, model init, shuffling, samplers), a recorded data snapshot (immutable parquet under a dated path, not a mutable SQL query), and an entry-point script that runs end-to-end without manual cells. Test it: delete your local `.venv`, re-clone, run the script, diff the outputs. If they differ, reproducibility is broken. The tradeoff: strict byte-reproducibility is hard on GPU — for deep-learning projects, accept statistical reproducibility (metric within a tolerance) and document the exact hardware/CUDA version.
|
|
146
|
+
- **Runtime budget**: name a wall-clock ceiling for the full pipeline on the hardware you actually have. A useful default for small-team work: "end-to-end run (data pull -> train -> evaluate -> scoring output) must complete in <= 1 hour on a 16GB MacBook Pro." If you blow past it, either simplify or move to a bigger box deliberately — do not let runtime creep silently.
|
|
147
|
+
- **Storage budget**: cap the on-disk footprint of raw data, features, and model artifacts. For laptop-scale work, `< 20 GB` total is a reasonable starting point; over that, you need a deliberate story (external object store, partitioned pulls, sampling). Record the budget in the README and check it in CI with a simple `du -sh` assertion.
|
|
148
|
+
|
|
149
|
+
Encode these as top-of-project invariants, not aspirations. If the model hits the success metric but the pipeline is unreproducible or blows the runtime budget, the project is not done.
|
|
150
|
+
|
|
151
|
+
Taken together, these five sections — problem framing, success metric, evaluation test, stakeholder contract, and nonfunctional budget — form the acceptance spec for the project. Write them up front, commit them alongside the code, and treat any drift as a scope change that requires re-agreeing with the stakeholder.
|