@zigrivers/scaffold 3.22.0 → 3.24.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (111) hide show
  1. package/README.md +44 -23
  2. package/content/knowledge/core/automated-review-tooling.md +3 -3
  3. package/content/knowledge/core/multi-model-review-dispatch.md +13 -4
  4. package/content/knowledge/data-science/README.md +23 -0
  5. package/content/knowledge/data-science/data-science-architecture.md +163 -0
  6. package/content/knowledge/data-science/data-science-conventions.md +233 -0
  7. package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
  8. package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
  9. package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
  10. package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
  11. package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
  12. package/content/knowledge/data-science/data-science-observability.md +161 -0
  13. package/content/knowledge/data-science/data-science-project-structure.md +178 -0
  14. package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
  15. package/content/knowledge/data-science/data-science-requirements.md +151 -0
  16. package/content/knowledge/data-science/data-science-security.md +151 -0
  17. package/content/knowledge/data-science/data-science-testing.md +183 -0
  18. package/content/knowledge/ml/README.md +10 -0
  19. package/content/methodology/data-science-overlay.yml +39 -0
  20. package/content/pipeline/build/multi-agent-resume.md +7 -6
  21. package/content/pipeline/build/multi-agent-start.md +7 -6
  22. package/content/pipeline/build/single-agent-resume.md +7 -6
  23. package/content/pipeline/build/single-agent-start.md +7 -6
  24. package/content/pipeline/environment/automated-pr-review.md +79 -27
  25. package/content/skills/mmr/SKILL.md +72 -2
  26. package/content/skills/scaffold-runner/SKILL.md +65 -19
  27. package/content/tools/review-code.md +74 -16
  28. package/content/tools/review-pr.md +25 -6
  29. package/dist/cli/commands/check.d.ts.map +1 -1
  30. package/dist/cli/commands/check.js +28 -17
  31. package/dist/cli/commands/check.js.map +1 -1
  32. package/dist/config/schema.d.ts +672 -126
  33. package/dist/config/schema.d.ts.map +1 -1
  34. package/dist/config/schema.js +8 -0
  35. package/dist/config/schema.js.map +1 -1
  36. package/dist/config/schema.test.js +2 -2
  37. package/dist/config/schema.test.js.map +1 -1
  38. package/dist/config/validators/data-science.d.ts +4 -0
  39. package/dist/config/validators/data-science.d.ts.map +1 -0
  40. package/dist/config/validators/data-science.js +15 -0
  41. package/dist/config/validators/data-science.js.map +1 -0
  42. package/dist/config/validators/index.d.ts.map +1 -1
  43. package/dist/config/validators/index.js +2 -0
  44. package/dist/config/validators/index.js.map +1 -1
  45. package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
  46. package/dist/core/assembly/knowledge-loader.js +6 -0
  47. package/dist/core/assembly/knowledge-loader.js.map +1 -1
  48. package/dist/core/assembly/knowledge-loader.test.js +34 -0
  49. package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
  50. package/dist/e2e/project-type-overlays.test.js +73 -0
  51. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  52. package/dist/project/adopt.d.ts.map +1 -1
  53. package/dist/project/adopt.js +3 -1
  54. package/dist/project/adopt.js.map +1 -1
  55. package/dist/project/detectors/coverage.test.d.ts +2 -0
  56. package/dist/project/detectors/coverage.test.d.ts.map +1 -0
  57. package/dist/project/detectors/coverage.test.js +78 -0
  58. package/dist/project/detectors/coverage.test.js.map +1 -0
  59. package/dist/project/detectors/data-science.d.ts +4 -0
  60. package/dist/project/detectors/data-science.d.ts.map +1 -0
  61. package/dist/project/detectors/data-science.js +32 -0
  62. package/dist/project/detectors/data-science.js.map +1 -0
  63. package/dist/project/detectors/data-science.test.d.ts +2 -0
  64. package/dist/project/detectors/data-science.test.d.ts.map +1 -0
  65. package/dist/project/detectors/data-science.test.js +62 -0
  66. package/dist/project/detectors/data-science.test.js.map +1 -0
  67. package/dist/project/detectors/disambiguate.d.ts +2 -0
  68. package/dist/project/detectors/disambiguate.d.ts.map +1 -1
  69. package/dist/project/detectors/disambiguate.js +3 -2
  70. package/dist/project/detectors/disambiguate.js.map +1 -1
  71. package/dist/project/detectors/disambiguate.test.js +10 -1
  72. package/dist/project/detectors/disambiguate.test.js.map +1 -1
  73. package/dist/project/detectors/index.d.ts.map +1 -1
  74. package/dist/project/detectors/index.js +2 -0
  75. package/dist/project/detectors/index.js.map +1 -1
  76. package/dist/project/detectors/library.d.ts.map +1 -1
  77. package/dist/project/detectors/library.js +1 -0
  78. package/dist/project/detectors/library.js.map +1 -1
  79. package/dist/project/detectors/resolve-detection.test.js +31 -0
  80. package/dist/project/detectors/resolve-detection.test.js.map +1 -1
  81. package/dist/project/detectors/types.d.ts +6 -2
  82. package/dist/project/detectors/types.d.ts.map +1 -1
  83. package/dist/project/detectors/types.js.map +1 -1
  84. package/dist/types/config.d.ts +8 -1
  85. package/dist/types/config.d.ts.map +1 -1
  86. package/dist/wizard/copy/core.d.ts.map +1 -1
  87. package/dist/wizard/copy/core.js +4 -0
  88. package/dist/wizard/copy/core.js.map +1 -1
  89. package/dist/wizard/copy/data-science.d.ts +3 -0
  90. package/dist/wizard/copy/data-science.d.ts.map +1 -0
  91. package/dist/wizard/copy/data-science.js +15 -0
  92. package/dist/wizard/copy/data-science.js.map +1 -0
  93. package/dist/wizard/copy/index.d.ts.map +1 -1
  94. package/dist/wizard/copy/index.js +2 -0
  95. package/dist/wizard/copy/index.js.map +1 -1
  96. package/dist/wizard/copy/types.d.ts +5 -1
  97. package/dist/wizard/copy/types.d.ts.map +1 -1
  98. package/dist/wizard/copy/types.test-d.js +7 -0
  99. package/dist/wizard/copy/types.test-d.js.map +1 -1
  100. package/dist/wizard/questions.d.ts +2 -1
  101. package/dist/wizard/questions.d.ts.map +1 -1
  102. package/dist/wizard/questions.js +9 -1
  103. package/dist/wizard/questions.js.map +1 -1
  104. package/dist/wizard/questions.test.js +14 -0
  105. package/dist/wizard/questions.test.js.map +1 -1
  106. package/dist/wizard/wizard.d.ts.map +1 -1
  107. package/dist/wizard/wizard.js +1 -0
  108. package/dist/wizard/wizard.js.map +1 -1
  109. package/package.json +1 -1
  110. package/skills/mmr/SKILL.md +72 -2
  111. package/skills/scaffold-runner/SKILL.md +65 -19
@@ -0,0 +1,161 @@
1
+ ---
2
+ name: data-science-observability
3
+ description: Monitoring deployed DS models and pipelines — prediction logging to Parquet, scheduled evaluation, basic drift detection, and Evidently for deeper analysis
4
+ topics: [data-science, observability, monitoring, drift, evidently]
5
+ ---
6
+
7
+ Models don't fail loudly. A scoring job keeps running, rows keep landing in the output table, dashboards stay green — and quietly, the predictions get worse. The world drifts away from whatever snapshot you trained on, and nobody notices until a stakeholder says "these numbers look weird." Observability for a solo DS isn't a platform; it's a small set of habits that give you a chance to catch decay before someone else does.
8
+
9
+ ## Summary
10
+
11
+ For a solo or small-team data scientist with something deployed (even just a weekly cron), observability boils down to four habits: log every prediction with its inputs to a dated Parquet file, re-run your evaluation script on a schedule and alert on metric drops, check a handful of key features for distributional drift, and reach for `Evidently` when you want a pre-built drift report instead of writing your own. The goal is a tripwire, not a dashboard — you want to get paged when something's wrong, not stare at graphs hoping to spot it.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Log predictions + inputs
16
+
17
+ Every time your model scores something, append a row to a dated Parquet log. This is the single most useful thing you can do for future-you — drift analysis, debugging, label backfill, and post-mortems all depend on having this log.
18
+
19
+ Layout:
20
+
21
+ ```
22
+ data/processed/predictions/
23
+ 2026-04-21/
24
+ run-20260421T0300-abc123.parquet
25
+ 2026-04-22/
26
+ run-20260422T0300-def456.parquet
27
+ ```
28
+
29
+ Schema (one row per prediction):
30
+
31
+ ```python
32
+ # src/monitor/prediction_log.py
33
+ import uuid
34
+ from datetime import datetime, timezone
35
+ from pathlib import Path
36
+ import pandas as pd
37
+
38
+ def log_predictions(
39
+ features: pd.DataFrame,
40
+ predictions: pd.Series,
41
+ model_version: str,
42
+ log_root: Path = Path("data/processed/predictions"),
43
+ ) -> Path:
44
+ today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
45
+ run_id = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M") + "-" + uuid.uuid4().hex[:6]
46
+ out_dir = log_root / today
47
+ out_dir.mkdir(parents=True, exist_ok=True)
48
+
49
+ df = features.copy()
50
+ df["prediction"] = predictions.values
51
+ df["model_version"] = model_version
52
+ df["logged_at"] = datetime.now(timezone.utc)
53
+ df["run_id"] = run_id
54
+ df["ground_truth"] = pd.NA # Backfilled later when labels arrive
55
+
56
+ out = out_dir / f"run-{run_id}.parquet"
57
+ df.to_parquet(out, index=False)
58
+ return out
59
+ ```
60
+
61
+ Parquet is right for this: columnar, compressed, fast to scan across dates with `pd.read_parquet("data/processed/predictions/**/*.parquet")`. If inputs have PII, hash or drop those columns before logging — you rarely need raw identifiers to do drift or error analysis.
62
+
63
+ ### Scheduled eval re-runs
64
+
65
+ Your training-time evaluation script is also your monitoring script. Run it weekly or monthly against recent predictions joined to whatever ground truth has arrived, and alert when the headline metric breaches a threshold.
66
+
67
+ ```python
68
+ # src/monitor/eval.py
69
+ import sys
70
+ import pandas as pd
71
+ from sklearn.metrics import roc_auc_score
72
+
73
+ THRESHOLD = 0.80 # Alert if AUC drops below this
74
+
75
+ def main() -> int:
76
+ preds = pd.read_parquet("data/processed/predictions/")
77
+ labels = pd.read_parquet("data/processed/labels/")
78
+ joined = preds.merge(labels, on="record_id", how="inner")
79
+ if len(joined) < 500:
80
+ print("Not enough labeled data yet; skipping.")
81
+ return 0
82
+
83
+ auc = roc_auc_score(joined["actual"], joined["prediction"])
84
+ print(f"AUC on {len(joined)} labeled rows: {auc:.3f}")
85
+ if auc < THRESHOLD:
86
+ # Send email / Slack webhook here
87
+ print(f"ALERT: AUC {auc:.3f} below threshold {THRESHOLD}", file=sys.stderr)
88
+ return 1
89
+ return 0
90
+
91
+ if __name__ == "__main__":
92
+ raise SystemExit(main())
93
+ ```
94
+
95
+ Schedule it with whatever you already have — cron, a GitHub Actions `schedule:` workflow, Airflow if you run it, or your platform's scheduled job. Exit code 1 plus a Slack webhook is a perfectly good alerting system at this scale.
96
+
97
+ ### Basic drift detection
98
+
99
+ Before reaching for a library, do the cheap thing: compare this period's feature distribution to your training distribution. Mean, std, a couple of quantiles, and a KS statistic cover most of what you need.
100
+
101
+ ```python
102
+ # src/monitor/drift.py
103
+ from scipy.stats import ks_2samp
104
+ import pandas as pd
105
+
106
+ def feature_drift(reference: pd.Series, current: pd.Series) -> dict:
107
+ stat, p = ks_2samp(reference.dropna(), current.dropna())
108
+ return {
109
+ "ref_mean": reference.mean(), "cur_mean": current.mean(),
110
+ "ref_std": reference.std(), "cur_std": current.std(),
111
+ "ks_stat": stat, "ks_p": p,
112
+ "drifted": p < 0.01,
113
+ }
114
+
115
+ train = pd.read_parquet("data/processed/train.parquet")
116
+ recent = pd.read_parquet("data/processed/predictions/2026-04-21/")
117
+ for col in ["amount", "user_tenure_days", "n_items"]:
118
+ print(col, feature_drift(train[col], recent[col]))
119
+ ```
120
+
121
+ Run this alongside your scheduled eval. You don't need a dashboard — printing to the job log and alerting on `drifted=True` on any monitored feature is enough.
122
+
123
+ ### Evidently for more
124
+
125
+ When you outgrow ad-hoc KS tests, `Evidently` gives you a pre-built drift report across all features, plus data quality checks and target drift, as an HTML page you can open or ship to S3.
126
+
127
+ ```python
128
+ # src/monitor/evidently_report.py
129
+ import pandas as pd
130
+ from evidently import Report
131
+ from evidently.presets import DataDriftPreset
132
+
133
+ reference = pd.read_parquet("data/processed/train.parquet")
134
+ current = pd.read_parquet("data/processed/predictions/2026-04-21/")
135
+
136
+ report = Report([DataDriftPreset()])
137
+ snapshot = report.run(reference_data=reference, current_data=current)
138
+ snapshot.save_html("reports/drift-2026-04-21.html")
139
+ ```
140
+
141
+ This is opt-in. If plain pandas + SciPy is telling you what you need to know, don't add a dependency. Reach for Evidently when you have enough features that per-column code is tedious, or when you want a shareable artifact for a stakeholder.
142
+
143
+ ### The prediction / feedback loop
144
+
145
+ Ground truth almost never arrives at prediction time. A churn model predicts today who'll cancel next month; a fraud model predicts now whether a transaction is bad, confirmed days later. That delay is why the Parquet log exists — you keep predictions around until labels catch up, then join.
146
+
147
+ ```python
148
+ # src/monitor/backfill_labels.py
149
+ import pandas as pd
150
+
151
+ preds = pd.read_parquet("data/processed/predictions/")
152
+ labels = pd.read_parquet("data/processed/labels/") # record_id, actual, label_time
153
+ merged = preds.merge(labels, on="record_id", how="left")
154
+ merged.to_parquet("data/processed/predictions_labeled.parquet", index=False)
155
+ ```
156
+
157
+ Keep at least one full feedback cycle of prediction logs (if labels arrive after 30 days, keep 60-90 days). This join is how you get a real accuracy number on production traffic, not just your static test-set number from training day.
158
+
159
+ ### What NOT to build
160
+
161
+ Resist the urge to over-engineer. At solo scale you do not need streaming drift detection, a Prometheus/Grafana stack, a model registry with canary deploys, or a dedicated monitoring dashboard. Those are ML-platform-team concerns — build them when there's a team to own them. A dated Parquet log, a scheduled eval script, a handful of drift checks, and an alert that emails you is more than enough to catch the failures that actually happen.
@@ -0,0 +1,178 @@
1
+ ---
2
+ name: data-science-project-structure
3
+ description: Opinionated directory layout for solo and small-team data-science projects — notebooks, src, data, models, reports, tests, configs — with a promotion path from exploration to tested modules
4
+ topics: [data-science, project-structure, layout]
5
+ ---
6
+
7
+ A solo data-science project accumulates artifacts faster than most software: half-finished notebooks, CSV dumps, parquet caches, serialized models, PNG charts, and the occasional markdown write-up. Without a deliberate directory structure, the project turns into a folder of 40 loose files within a month and a new contributor — including future-you — cannot tell what is canonical, what is scratch, and what is safe to delete. A clear layout fixes three problems at once: discoverability (where does X live?), git hygiene (what is tracked vs generated?), and the promotion path (how does throwaway notebook code become tested library code?).
8
+
9
+ ## Summary
10
+
11
+ A solo DS project has six top-level directories that each answer one question: `notebooks/` (exploration), `src/` (importable Python modules), `data/` (split into raw/interim/processed — `data/raw/` is always gitignored; small processed artifacts may be committed or DVC-tracked), `models/` (serialized artifacts, tracked via DVC or git-lfs), `reports/` (rendered outputs — figures, HTML, markdown), and `tests/` (pytest suite mirroring `src/`). `configs/` holds YAML run parameters, and `pyproject.toml` at the root defines the package. The `.gitignore` excludes raw data, most of `models/`, and common binary formats that were not deliberately promoted. Reusable logic follows a strict promotion path: explored in a notebook, extracted into `src/`, unit-tested in `tests/`, then re-imported by notebooks or pipeline scripts.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Top-level layout
16
+
17
+ ```
18
+ project-root/
19
+ ├── notebooks/ # Exploratory notebooks (Marimo preferred; numbered chronologically)
20
+ ├── src/ # Importable Python modules — the library
21
+ │ └── <project>/
22
+ │ ├── __init__.py
23
+ │ ├── ingestion.py # Load raw data from source (CSV, DB, API)
24
+ │ ├── features.py # Feature engineering / transforms
25
+ │ ├── training.py # Model fitting routines
26
+ │ ├── evaluation.py # Metrics, CV loops, slice analysis
27
+ │ └── serving.py # Inference helpers (load artifact, predict)
28
+ ├── data/ # Datasets at every pipeline stage
29
+ │ ├── raw/ # Immutable inputs — GITIGNORED (always)
30
+ │ ├── interim/ # Cached intermediates — small Parquet may be committed
31
+ │ └── processed/ # Analysis-ready — usually DVC-tracked; small files may be committed
32
+ ├── models/ # Serialized model artifacts (DVC / git-lfs tracked)
33
+ ├── reports/ # Rendered output: figures/, HTML reports, markdown summaries
34
+ │ └── figures/
35
+ ├── tests/ # pytest suite — mirrors src/ structure
36
+ ├── configs/ # YAML run configs (Hydra-style or plain)
37
+ ├── pyproject.toml # Package metadata, dependencies, tool config
38
+ ├── .gitignore
39
+ └── README.md
40
+ ```
41
+
42
+ One-liners per dir:
43
+ - `notebooks/` — exploration, EDA, prototyping; numbered `01-…`, `02-…` so ordering is obvious
44
+ - `src/` — every reusable function that a second notebook or a pipeline script will call
45
+ - `data/` — all datasets at every stage; raw is always gitignored, selected processed artifacts (small Parquet in `data/interim/` or `data/processed/`) may be committed directly or tracked via DVC — see `data-science-data-versioning`
46
+ - `models/` — trained model artifacts; tracked through DVC or git-lfs pointers, never raw binaries
47
+ - `reports/` — things a human reads: charts, HTML reports, markdown summaries
48
+ - `tests/` — pytest tests for code in `src/`
49
+ - `configs/` — experiment parameters (paths, seeds, hyperparams) separate from code
50
+
51
+ ### Data: gitignore raw, deliberately admit small processed artifacts
52
+
53
+ The single hardest rule in DS project hygiene: **never commit raw datasets under `data/raw/` or raw model binaries under `models/` to git**. A 200 MB parquet file committed to history is permanent — `git filter-repo` is the only cure and it rewrites every commit. Prevent the problem at the `.gitignore` layer before it happens.
54
+
55
+ Gitignoring the entire `data/` tree is the safest default, but it under-serves a common small-team workflow: a cleaned, analysis-ready Parquet in `data/interim/` that's <10 MB, changes rarely, and is useful to have alongside the code. See `data-science-data-versioning` for the full size-based decision rule. The pattern below gitignores raw data and external copies wholesale, and allows opt-in commits of small processed Parquet through a deliberate un-ignore rule. Anything larger (>50 MB, frequent churn, binary artifacts) goes through DVC or git-lfs instead — never direct git commits.
56
+
57
+ ```gitignore
58
+ # Raw / external data — never committed (bulky, usually not redistributable)
59
+ data/raw/
60
+ data/external/
61
+
62
+ # Processed / interim data — default: ignore; opt in to specific small artifacts below
63
+ data/interim/*
64
+ data/processed/*
65
+ !data/.gitkeep
66
+ !data/interim/.gitkeep
67
+ !data/processed/.gitkeep
68
+ # Allow small cleaned Parquet to be committed (see data-science-data-versioning
69
+ # for size guidance — under ~10 MB, rare changes). Larger artifacts belong in
70
+ # DVC or git-lfs.
71
+ !data/interim/*.parquet
72
+ !data/processed/*.parquet
73
+
74
+ # Model artifacts — tracked via DVC or git-lfs, not raw binaries
75
+ models/
76
+ !models/.gitkeep
77
+ !models/**/*.dvc
78
+
79
+ # Common large binary formats (defense in depth — catch anything dropped elsewhere)
80
+ *.feather
81
+ *.joblib
82
+ *.pt
83
+ *.pth
84
+ *.onnx
85
+ *.h5
86
+ *.hdf5
87
+ *.npy
88
+ *.npz
89
+
90
+ # Python
91
+ __pycache__/
92
+ *.pyc
93
+ .venv/
94
+ .ruff_cache/
95
+ .pytest_cache/
96
+ *.egg-info/
97
+
98
+ # Notebook outputs (if not using a tool that strips them)
99
+ .ipynb_checkpoints/
100
+
101
+ # Environment / secrets
102
+ .env
103
+ .env.*
104
+ !.env.example
105
+ ```
106
+
107
+ Two things are load-bearing in this snippet. First, `*.parquet` is **not** in the blanket block-list — we want `data/interim/*.parquet` to match as "allowed" once the un-ignore rules kick in. Second, the `!data/interim/*.parquet` and `!data/processed/*.parquet` patterns mean processed Parquet is committable **by default** at this layer; the policy choice of whether to actually commit a given file is made at `git add` time, not in `.gitignore`. If your team's policy is DVC-first for every dataset, drop those `!…*.parquet` lines. The `!data/.gitkeep` family keeps the directories present in fresh clones.
108
+
109
+ For versioned datasets and models, see `data-science-data-versioning` — DVC or git-lfs pointers are committed, the binaries themselves live in remote storage. Prefer `joblib` or framework-native formats (`.pt`, `.onnx`) over stdlib pickle for model artifacts — pickle loads execute arbitrary code, so a model file from an untrusted source becomes an RCE vector.
110
+
111
+ ### Notebooks → src/ promotion
112
+
113
+ Notebooks are for exploration, not production. The moment a function in a notebook becomes useful to a second notebook — or looks like it will survive longer than the current sitting — it gets promoted:
114
+
115
+ 1. **Identify**: a cell (or few cells) encapsulating reusable logic — a loader, a transform, a metric computation
116
+ 2. **Extract**: move the function into the appropriate `src/<project>/` module (`ingestion.py`, `features.py`, etc.) with type hints and a docstring
117
+ 3. **Test**: add a pytest case in `tests/` that exercises a representative input → output case
118
+ 4. **Re-import**: the notebook now does `from <project>.features import clean_customer_ids` instead of defining the function inline
119
+
120
+ This discipline keeps notebooks short (exploration, narrative, charts) and concentrates correctness-critical code where it can be reviewed, tested, and reused. See `notebook-discipline` for the mechanics of cell size, output clearing, and `%autoreload` so edits in `src/` are picked up in the notebook without a kernel restart.
121
+
122
+ ### Configs and reproducibility
123
+
124
+ Hard-coded paths and hyperparameters inside notebook cells are the single biggest reproducibility killer in a DS project. Push them into `configs/` so a run is defined by a config file + a git SHA.
125
+
126
+ ```yaml
127
+ # configs/train_baseline.yaml
128
+ run_name: baseline_v1
129
+ seed: 42
130
+
131
+ data:
132
+ raw_path: data/raw/transactions_2024.csv
133
+ processed_path: data/processed/transactions_clean.parquet
134
+ target: churned_30d
135
+ test_size: 0.2
136
+ split_seed: 42
137
+
138
+ features:
139
+ include:
140
+ - tenure_days
141
+ - monthly_spend
142
+ - support_tickets_30d
143
+ log_transform:
144
+ - monthly_spend
145
+
146
+ model:
147
+ type: gradient_boosting
148
+ params:
149
+ n_estimators: 200
150
+ max_depth: 5
151
+ learning_rate: 0.05
152
+
153
+ output:
154
+ model_path: models/baseline_v1.joblib
155
+ report_path: reports/baseline_v1.html
156
+ ```
157
+
158
+ Training code reads the config with `yaml.safe_load` (or Hydra / pydantic-settings for richer projects) and a teammate can reproduce the run with `python -m <project>.training --config configs/train_baseline.yaml`. For Hydra specifically, configs split into `configs/data/`, `configs/model/`, `configs/training/` and compose at the command line.
159
+
160
+ ### Tests layout
161
+
162
+ `tests/` mirrors `src/` one-to-one. If `src/<project>/features.py` defines `clean_customer_ids`, then `tests/test_features.py` contains `test_clean_customer_ids_strips_whitespace` and friends.
163
+
164
+ ```
165
+ tests/
166
+ ├── conftest.py # Shared fixtures (tiny sample dataframes, tmp_path helpers)
167
+ ├── test_ingestion.py # Tests for src/<project>/ingestion.py
168
+ ├── test_features.py # Tests for src/<project>/features.py
169
+ ├── test_training.py # Tests for src/<project>/training.py — usually smoke tests
170
+ └── test_evaluation.py # Tests for src/<project>/evaluation.py
171
+ ```
172
+
173
+ Naming rules:
174
+ - Test files: `test_<module>.py` — pytest discovers these by default
175
+ - Test functions: `test_<unit>_<behavior>` — e.g. `test_clean_customer_ids_strips_whitespace`, `test_load_transactions_raises_on_missing_file`
176
+ - Fixtures live in `conftest.py` at the `tests/` root when shared across files; local fixtures stay in the file that uses them
177
+
178
+ Training and evaluation tests are typically **smoke tests** over a 10-row fixture dataframe, not full-dataset runs — the goal is catching shape/dtype/column regressions, not validating model quality (model quality belongs in the evaluation report, not the unit test suite).
@@ -0,0 +1,164 @@
1
+ ---
2
+ name: data-science-reproducibility
3
+ description: Reproducibility for solo/small-team DS — pin deps with uv lock, seed everything, set PYTHONHASHSEED, and reach for Docker only at OS boundaries
4
+ topics: [data-science, reproducibility, determinism, uv, docker]
5
+ ---
6
+
7
+ You show a result in Monday's meeting. Six months later, on a new laptop, you can't reproduce it. Three things usually cause this: dependencies drifted (a minor NumPy release changed a default), randomness wasn't pinned (a shuffle or init picked a different seed), or the data changed underneath you. Reproducibility is the discipline of eliminating all three so the same inputs always produce the same numbers.
8
+
9
+ ## Summary
10
+
11
+ Pin dependencies with `uv lock` and commit `uv.lock` — `uv sync --frozen` rebuilds the exact environment anywhere. Control randomness with a single `set_seed(seed)` helper that seeds Python `random`, NumPy, PyTorch, and TensorFlow at the top of every script. Export `PYTHONHASHSEED=0` via `.envrc` so hash-order is deterministic across interpreter runs. Log the git SHA and data hash with every run so you can walk back to the exact code + data that produced any number. Reach for Docker only when you're crossing an OS or CUDA boundary — for greenfield solo work, `uv sync` is enough.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Pinning dependencies with uv
16
+
17
+ `uv` resolves the full transitive dependency graph into `uv.lock`, which records the exact version and content hash of every package, including transitive deps you never directly imported. Commit it. On a new machine, `uv sync --frozen` reproduces the environment byte-for-byte without re-resolving anything.
18
+
19
+ ```bash
20
+ # First time: declare top-level deps in pyproject.toml, then lock
21
+ uv lock
22
+
23
+ # On any machine (CI, teammate's laptop, 6 months later):
24
+ uv sync --frozen # install exactly what's in uv.lock, never re-resolve
25
+
26
+ # Upgrade a single package intentionally:
27
+ uv lock --upgrade-package numpy
28
+ # Review the lock diff in PR. Re-run your eval suite before merging.
29
+
30
+ # Add a new dependency:
31
+ uv add pandas # updates pyproject.toml AND uv.lock atomically
32
+ ```
33
+
34
+ Rules:
35
+ - Commit `uv.lock`. It is not a build artifact; it is a reproducibility contract.
36
+ - Use `--frozen` in CI and release scripts. A silent re-resolve on deploy is the bug you're trying to prevent.
37
+ - Upgrade packages one at a time, with a PR and an eval run. Bulk upgrades hide which bump broke your metrics.
38
+ - Pin the Python version too: add `requires-python = "==3.12.*"` in `pyproject.toml` and let uv install and manage the interpreter. Minor Python versions change float formatting, dict ordering guarantees, and stdlib behavior in ways that can move your numbers.
39
+
40
+ ### Seed management
41
+
42
+ Every source of randomness in your stack has its own PRNG. Seed all of them from a single call, at the top of every train/eval/predict entry point.
43
+
44
+ ```python
45
+ # src/utils/seed.py
46
+ import os
47
+ import random
48
+ import numpy as np
49
+
50
+ def set_seed(seed: int = 42) -> None:
51
+ """Seed every PRNG we might touch. Call at the top of every script."""
52
+ os.environ["PYTHONHASHSEED"] = str(seed)
53
+ random.seed(seed)
54
+ np.random.seed(seed)
55
+
56
+ try:
57
+ import torch
58
+ torch.manual_seed(seed)
59
+ if torch.cuda.is_available():
60
+ torch.cuda.manual_seed_all(seed)
61
+ except ImportError:
62
+ pass
63
+
64
+ try:
65
+ import tensorflow as tf
66
+ tf.random.set_seed(seed)
67
+ except ImportError:
68
+ pass
69
+ ```
70
+
71
+ Call `set_seed(42)` before any data split, model init, or sampling. If a library accepts a `random_state` argument (scikit-learn does almost everywhere), pass the seed explicitly — global seeding is a safety net, not a substitute.
72
+
73
+ ```python
74
+ # Explicit is better than implicit:
75
+ from sklearn.model_selection import train_test_split
76
+ from sklearn.ensemble import RandomForestClassifier
77
+
78
+ set_seed(42) # global safety net
79
+
80
+ X_train, X_test, y_train, y_test = train_test_split(
81
+ X, y, test_size=0.2, random_state=42 # explicit
82
+ )
83
+ model = RandomForestClassifier(random_state=42) # explicit
84
+ ```
85
+
86
+ The one gotcha: multi-worker DataLoaders in PyTorch spawn subprocesses that need their own seeding. Pass `worker_init_fn` to seed each worker, or you'll get different augmentation sequences across runs even with `set_seed` called in the main process.
87
+
88
+ ### Hash determinism
89
+
90
+ Python randomizes the hash seed per interpreter run by default. That means dict iteration order, set iteration order, and anything that depends on `hash()` varies between runs — a subtle reproducibility leak that only shows up when you try to diff two training runs.
91
+
92
+ ```bash
93
+ # .envrc (direnv)
94
+ export PYTHONHASHSEED=0
95
+ ```
96
+
97
+ `set_seed()` sets this too, but exporting it in `.envrc` covers everything in the shell session — notebooks, ad-hoc scripts, the test runner — before any Python code runs.
98
+
99
+ ### GPU determinism (brief)
100
+
101
+ Full GPU determinism requires cuDNN-level flags and disabling non-deterministic kernels:
102
+
103
+ ```python
104
+ # Only if you actually need this:
105
+ torch.backends.cudnn.deterministic = True
106
+ torch.backends.cudnn.benchmark = False
107
+ ```
108
+
109
+ This has a real performance cost (often 10-30% slower training) and doesn't cover every op. For DS-1, don't chase it. CPU-level determinism from `set_seed()` + pinned deps is enough for 95% of analyses. Reach for GPU determinism only under regulatory requirement, scientific publication, or when debugging a numerics bug that you can't otherwise isolate.
110
+
111
+ ### Git SHA and data versioning
112
+
113
+ A reproducible run needs four things pinned: code, dependencies, randomness, and data. We've covered three. For code, log the git SHA with every experiment (see `data-science-experiment-tracking.md` for the logging pattern — don't duplicate the plumbing here). For data, hash the input dataset or pin a DVC / lakeFS / Git-LFS reference (see `data-science-data-versioning.md`).
114
+
115
+ The minimum metadata for any reported result:
116
+
117
+ ```text
118
+ git_sha: a1b2c3d4
119
+ uv_lock: sha256:... # hash of uv.lock
120
+ seed: 42
121
+ data_hash: sha256:... # hash of the input dataset(s)
122
+ python: 3.12.1
123
+ platform: darwin-arm64
124
+ ```
125
+
126
+ If all five match, the numbers should match. If any differ, you know exactly which knob moved.
127
+
128
+ A working pattern: log these fields into your experiment tracker alongside metrics, and include them in any reported result (paper, slide, dashboard tile). The friction cost is near zero once automated; the debugging cost of a result you can't trace back to its exact code + data is enormous.
129
+
130
+ ### Docker: only at OS boundaries
131
+
132
+ Docker solves a real problem: "it works on my Mac but not on the Linux GPU box." It does not solve "I forgot to commit `uv.lock`." Reach for containers when you're genuinely crossing a boundary:
133
+
134
+ - Developing on macOS, deploying on Linux — native wheels differ, BLAS differs, occasionally results differ.
135
+ - CUDA version mismatch between dev and prod GPUs.
136
+ - A team standardizing a shared prod environment where `uv sync` isn't enough because the base OS libs drift.
137
+
138
+ For a solo greenfield project on one laptop, a Dockerfile is pure overhead. Start with `uv sync --frozen` and add Docker the first time you actually hit a cross-OS reproducibility failure — not before.
139
+
140
+ When you do reach for it, keep the image minimal and derived from your lockfile:
141
+
142
+ ```dockerfile
143
+ FROM python:3.12-slim
144
+ COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
145
+ WORKDIR /app
146
+ COPY pyproject.toml uv.lock ./
147
+ RUN uv sync --frozen --no-dev
148
+ COPY src/ ./src/
149
+ ENV PYTHONHASHSEED=0
150
+ CMD ["uv", "run", "python", "-m", "src.train"]
151
+ ```
152
+
153
+ Pin the base image by digest (`python:3.12-slim@sha256:...`) once the project is in prod — floating tags drift and will silently give you a different glibc next month.
154
+
155
+ ### Reproducibility checklist
156
+
157
+ Before calling any analysis "done":
158
+
159
+ - `uv.lock` committed and current (`uv sync --frozen` in CI succeeds)
160
+ - `set_seed()` called at the top of every entry point
161
+ - `PYTHONHASHSEED=0` in `.envrc` (and `.envrc` committed, `.env` gitignored)
162
+ - Git SHA + data hash logged with every experiment run
163
+ - Eval suite passes on a clean clone in CI — the real test of reproducibility is a fresh machine, not your own
164
+
@@ -0,0 +1,151 @@
1
+ ---
2
+ name: data-science-requirements
3
+ description: Problem framing, success metrics, evaluation-test design, stakeholder contracts, and nonfunctional requirements for solo/small-team data science projects
4
+ topics: [data-science, requirements, evaluation, success-metrics, reproducibility]
5
+ ---
6
+
7
+ As a solo or small-team data scientist without an existing data platform, the single biggest risk to your project is not a bad model — it is ambiguous requirements. Without a tight written spec, a DS project sprawls: the question drifts week to week, the notebook becomes unreproducible, and the stakeholder quietly reinterprets the output. This document defines what "done" looks like for an analytical pipeline, model, or report built from scratch — so you can stop work on time and defend the result.
8
+
9
+ ## Summary
10
+
11
+ A data-science requirements doc states a single well-framed question, one primary success metric with a numeric acceptance threshold declared before any modeling, an evaluation design using held-out data, a stakeholder contract (who consumes the output, in what format, on what cadence), and a nonfunctional budget (reproducibility, runtime, storage). Write the target threshold into a test before you touch training data. If you cannot name the metric and the number, you are not ready to start.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Problem framing
16
+
17
+ Most DS projects fail at step one: the question is fuzzy ("understand churn") rather than decidable ("predict 30-day churn for active paying users, with recall >= 0.6 at precision >= 0.3"). The discipline is to force yourself, in writing, to name the decision the output will drive. If you cannot name that decision, stop and interview the stakeholder until you can.
18
+
19
+ Use a short, copyable problem-statement block at the top of your project README or PRD. The one below is opinionated — it forces every ambiguous field to get filled in before modeling starts. The tradeoff: for pure exploratory work (e.g. a one-off investigation) this is overkill; a 3-line hypothesis is enough.
20
+
21
+ ```yaml
22
+ # docs/problem-statement.yaml
23
+ question: >
24
+ For monthly paying users active in the last 30 days, predict whether they
25
+ will cancel their subscription within the next 30 days.
26
+ decision_driven:
27
+ who: Growth team
28
+ action: Enroll top-decile predicted churners in a retention email campaign
29
+ cadence: Weekly scoring
30
+ unit_of_analysis: user_id x scoring_date
31
+ prediction_target: churn_within_30d (bool)
32
+ out_of_scope:
33
+ - free-tier users
34
+ - annual subscribers
35
+ - users less than 14 days old at scoring time
36
+ known_confounders:
37
+ - planned price change on 2026-05-01
38
+ - seasonality around end-of-year
39
+ ```
40
+
41
+ ### Success metrics
42
+
43
+ State the primary success metric and its acceptance threshold in writing before you train anything. The number comes from the stakeholder contract, not from what the model can achieve — otherwise you are reverse-engineering the bar to whatever you got. Pick one primary metric; secondary metrics are tie-breakers, not co-equals.
44
+
45
+ Typical patterns:
46
+
47
+ - **Predictive model**: one primary metric tied to the downstream decision. For a ranked retention campaign, `recall@top-10%` or `precision@k` beats accuracy or raw AUC, because the campaign can only email the top decile.
48
+ - **Regression / forecast**: `RMSE` in the target's natural unit, plus a naive baseline (last-value, rolling-mean). Beating the baseline is mandatory; if you cannot, the project is not viable.
49
+ - **Analytical pipeline / ETL**: functional correctness plus a p95 runtime budget (e.g. "daily job must finish in < 20 min on the scheduled box").
50
+ - **Report / dashboard**: domain acceptance threshold — the numbers in the report must match an independently computed source-of-truth query within a stated tolerance (e.g. "<= 0.1% deviation from the finance ledger").
51
+
52
+ Encode the success metric as a function so it is unambiguous and testable. The expression below is the whole contract — write it the day you start.
53
+
54
+ ```python
55
+ # src/metrics.py
56
+ from sklearn.metrics import precision_recall_curve
57
+ import numpy as np
58
+
59
+ TARGET_RECALL = 0.60
60
+ MIN_PRECISION = 0.30 # at the threshold that achieves TARGET_RECALL
61
+
62
+ def primary_metric(y_true: np.ndarray, y_score: np.ndarray) -> dict:
63
+ """Primary success metric: precision at the threshold that hits target recall."""
64
+ precision, recall, thresholds = precision_recall_curve(y_true, y_score)
65
+ # Walk from highest threshold down; stop when recall crosses target.
66
+ idx = np.searchsorted(recall[::-1], TARGET_RECALL)
67
+ idx = len(recall) - 1 - idx
68
+ return {
69
+ "recall": float(recall[idx]),
70
+ "precision": float(precision[idx]),
71
+ "threshold": float(thresholds[min(idx, len(thresholds) - 1)]),
72
+ "passes": bool(recall[idx] >= TARGET_RECALL and precision[idx] >= MIN_PRECISION),
73
+ }
74
+ ```
75
+
76
+ ### Evaluation-test design
77
+
78
+ The evaluation test is the single gate between "training run" and "ship it." Its job is to answer one question: does the model hit the stated metric on data it has not seen? Get this wrong — leak the future into the past, evaluate on training rows — and every downstream decision is poisoned.
79
+
80
+ Opinionated defaults:
81
+
82
+ - **Temporal target**: split by time, not randomly. Train on `[t0, t1)`, hold out `[t1, t2)`. Random splits with temporal data leak future information and will silently inflate metrics.
83
+ - **Non-temporal target**: stratified split by the label, fixed `random_state`, held-out fraction 15-20%.
84
+ - **Small data (< 10k rows)**: 5-fold cross-validation with the same fold seed every run; report mean plus std of the primary metric.
85
+ - **Never** tune hyperparameters on the holdout. Use a third validation split or inner CV. Tradeoff: if your dataset is tiny you may have to pool — document the risk explicitly.
86
+
87
+ The evaluation belongs in the test suite, not a notebook. The stakeholder should be able to run `pytest tests/test_model_evaluation.py` and see green before accepting the deliverable.
88
+
89
+ ```python
90
+ # tests/test_model_evaluation.py
91
+ import joblib
92
+ import pandas as pd
93
+ import pytest
94
+ from src.metrics import primary_metric, TARGET_RECALL, MIN_PRECISION
95
+
96
+ HOLDOUT_PATH = "data/holdout_2026_q1.parquet"
97
+ MODEL_PATH = "artifacts/churn_model.pkl"
98
+
99
+ @pytest.fixture(scope="module")
100
+ def scored_holdout():
101
+ df = pd.read_parquet(HOLDOUT_PATH)
102
+ model = joblib.load(MODEL_PATH)
103
+ X = df.drop(columns=["churn_within_30d"])
104
+ y_true = df["churn_within_30d"].to_numpy()
105
+ y_score = model.predict_proba(X)[:, 1]
106
+ return y_true, y_score
107
+
108
+ def test_model_beats_acceptance_threshold(scored_holdout):
109
+ y_true, y_score = scored_holdout
110
+ result = primary_metric(y_true, y_score)
111
+ assert result["passes"], (
112
+ f"Model failed acceptance: recall={result['recall']:.3f} "
113
+ f"(target {TARGET_RECALL}), precision={result['precision']:.3f} "
114
+ f"(min {MIN_PRECISION})"
115
+ )
116
+
117
+ def test_model_beats_naive_baseline(scored_holdout):
118
+ # Baseline: predict global churn rate for everyone. Any real model must beat it.
119
+ y_true, y_score = scored_holdout
120
+ baseline_score = pd.Series([y_true.mean()] * len(y_true)).to_numpy()
121
+ assert primary_metric(y_true, y_score)["precision"] > \
122
+ primary_metric(y_true, baseline_score)["precision"]
123
+ ```
124
+
125
+ ### Stakeholder contract
126
+
127
+ A stakeholder contract makes the hand-off concrete. Without it, you deliver a notebook and the recipient quietly asks for a PDF, a Slack message, a dashboard, or a CSV — all different artifacts. Write this down the same week you write the problem statement.
128
+
129
+ Minimum fields, in order of how often they get skipped:
130
+
131
+ - **Consumer**: named human or team, not "the business."
132
+ - **Artifact format**: one of `csv`, `parquet`, `dashboard (URL)`, `API endpoint`, `PDF report`, `Slack summary`. Pick exactly one primary.
133
+ - **Schema**: column names, types, units, PII flags. Include an example row.
134
+ - **Cadence**: one-shot, daily, weekly, on-demand. If recurring, name the day-of-week and time-of-day.
135
+ - **Freshness SLA**: how stale is the underlying data allowed to be at delivery time.
136
+ - **Failure behavior**: what happens if the pipeline fails — silent retry, page the owner, stale-serve, fail loud.
137
+ - **Sunset criteria**: when does this deliverable stop being needed. If you cannot answer, the project has no natural end.
138
+
139
+ A one-off analysis can collapse this into a single paragraph; a recurring pipeline needs all seven fields in a short `CONTRACT.md` alongside the code.
140
+
141
+ ### Nonfunctional requirements
142
+
143
+ Nonfunctional requirements are what separates a notebook from a deliverable. Three to name explicitly:
144
+
145
+ - **Reproducibility**: the pipeline must produce byte-identical outputs given identical inputs. That means a pinned `requirements.txt` (or `pyproject.toml` + lockfile), explicit `random_state` on every stochastic step (train/test split, model init, shuffling, samplers), a recorded data snapshot (immutable parquet under a dated path, not a mutable SQL query), and an entry-point script that runs end-to-end without manual cells. Test it: delete your local `.venv`, re-clone, run the script, diff the outputs. If they differ, reproducibility is broken. The tradeoff: strict byte-reproducibility is hard on GPU — for deep-learning projects, accept statistical reproducibility (metric within a tolerance) and document the exact hardware/CUDA version.
146
+ - **Runtime budget**: name a wall-clock ceiling for the full pipeline on the hardware you actually have. A useful default for small-team work: "end-to-end run (data pull -> train -> evaluate -> scoring output) must complete in <= 1 hour on a 16GB MacBook Pro." If you blow past it, either simplify or move to a bigger box deliberately — do not let runtime creep silently.
147
+ - **Storage budget**: cap the on-disk footprint of raw data, features, and model artifacts. For laptop-scale work, `< 20 GB` total is a reasonable starting point; over that, you need a deliberate story (external object store, partitioned pulls, sampling). Record the budget in the README and check it in CI with a simple `du -sh` assertion.
148
+
149
+ Encode these as top-of-project invariants, not aspirations. If the model hits the success metric but the pipeline is unreproducible or blows the runtime budget, the project is not done.
150
+
151
+ Taken together, these five sections — problem framing, success metric, evaluation test, stakeholder contract, and nonfunctional budget — form the acceptance spec for the project. Write them up front, commit them alongside the code, and treat any drift as a scope change that requires re-agreeing with the stakeholder.