@zigrivers/scaffold 3.22.0 → 3.24.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (111) hide show
  1. package/README.md +44 -23
  2. package/content/knowledge/core/automated-review-tooling.md +3 -3
  3. package/content/knowledge/core/multi-model-review-dispatch.md +13 -4
  4. package/content/knowledge/data-science/README.md +23 -0
  5. package/content/knowledge/data-science/data-science-architecture.md +163 -0
  6. package/content/knowledge/data-science/data-science-conventions.md +233 -0
  7. package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
  8. package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
  9. package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
  10. package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
  11. package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
  12. package/content/knowledge/data-science/data-science-observability.md +161 -0
  13. package/content/knowledge/data-science/data-science-project-structure.md +178 -0
  14. package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
  15. package/content/knowledge/data-science/data-science-requirements.md +151 -0
  16. package/content/knowledge/data-science/data-science-security.md +151 -0
  17. package/content/knowledge/data-science/data-science-testing.md +183 -0
  18. package/content/knowledge/ml/README.md +10 -0
  19. package/content/methodology/data-science-overlay.yml +39 -0
  20. package/content/pipeline/build/multi-agent-resume.md +7 -6
  21. package/content/pipeline/build/multi-agent-start.md +7 -6
  22. package/content/pipeline/build/single-agent-resume.md +7 -6
  23. package/content/pipeline/build/single-agent-start.md +7 -6
  24. package/content/pipeline/environment/automated-pr-review.md +79 -27
  25. package/content/skills/mmr/SKILL.md +72 -2
  26. package/content/skills/scaffold-runner/SKILL.md +65 -19
  27. package/content/tools/review-code.md +74 -16
  28. package/content/tools/review-pr.md +25 -6
  29. package/dist/cli/commands/check.d.ts.map +1 -1
  30. package/dist/cli/commands/check.js +28 -17
  31. package/dist/cli/commands/check.js.map +1 -1
  32. package/dist/config/schema.d.ts +672 -126
  33. package/dist/config/schema.d.ts.map +1 -1
  34. package/dist/config/schema.js +8 -0
  35. package/dist/config/schema.js.map +1 -1
  36. package/dist/config/schema.test.js +2 -2
  37. package/dist/config/schema.test.js.map +1 -1
  38. package/dist/config/validators/data-science.d.ts +4 -0
  39. package/dist/config/validators/data-science.d.ts.map +1 -0
  40. package/dist/config/validators/data-science.js +15 -0
  41. package/dist/config/validators/data-science.js.map +1 -0
  42. package/dist/config/validators/index.d.ts.map +1 -1
  43. package/dist/config/validators/index.js +2 -0
  44. package/dist/config/validators/index.js.map +1 -1
  45. package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
  46. package/dist/core/assembly/knowledge-loader.js +6 -0
  47. package/dist/core/assembly/knowledge-loader.js.map +1 -1
  48. package/dist/core/assembly/knowledge-loader.test.js +34 -0
  49. package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
  50. package/dist/e2e/project-type-overlays.test.js +73 -0
  51. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  52. package/dist/project/adopt.d.ts.map +1 -1
  53. package/dist/project/adopt.js +3 -1
  54. package/dist/project/adopt.js.map +1 -1
  55. package/dist/project/detectors/coverage.test.d.ts +2 -0
  56. package/dist/project/detectors/coverage.test.d.ts.map +1 -0
  57. package/dist/project/detectors/coverage.test.js +78 -0
  58. package/dist/project/detectors/coverage.test.js.map +1 -0
  59. package/dist/project/detectors/data-science.d.ts +4 -0
  60. package/dist/project/detectors/data-science.d.ts.map +1 -0
  61. package/dist/project/detectors/data-science.js +32 -0
  62. package/dist/project/detectors/data-science.js.map +1 -0
  63. package/dist/project/detectors/data-science.test.d.ts +2 -0
  64. package/dist/project/detectors/data-science.test.d.ts.map +1 -0
  65. package/dist/project/detectors/data-science.test.js +62 -0
  66. package/dist/project/detectors/data-science.test.js.map +1 -0
  67. package/dist/project/detectors/disambiguate.d.ts +2 -0
  68. package/dist/project/detectors/disambiguate.d.ts.map +1 -1
  69. package/dist/project/detectors/disambiguate.js +3 -2
  70. package/dist/project/detectors/disambiguate.js.map +1 -1
  71. package/dist/project/detectors/disambiguate.test.js +10 -1
  72. package/dist/project/detectors/disambiguate.test.js.map +1 -1
  73. package/dist/project/detectors/index.d.ts.map +1 -1
  74. package/dist/project/detectors/index.js +2 -0
  75. package/dist/project/detectors/index.js.map +1 -1
  76. package/dist/project/detectors/library.d.ts.map +1 -1
  77. package/dist/project/detectors/library.js +1 -0
  78. package/dist/project/detectors/library.js.map +1 -1
  79. package/dist/project/detectors/resolve-detection.test.js +31 -0
  80. package/dist/project/detectors/resolve-detection.test.js.map +1 -1
  81. package/dist/project/detectors/types.d.ts +6 -2
  82. package/dist/project/detectors/types.d.ts.map +1 -1
  83. package/dist/project/detectors/types.js.map +1 -1
  84. package/dist/types/config.d.ts +8 -1
  85. package/dist/types/config.d.ts.map +1 -1
  86. package/dist/wizard/copy/core.d.ts.map +1 -1
  87. package/dist/wizard/copy/core.js +4 -0
  88. package/dist/wizard/copy/core.js.map +1 -1
  89. package/dist/wizard/copy/data-science.d.ts +3 -0
  90. package/dist/wizard/copy/data-science.d.ts.map +1 -0
  91. package/dist/wizard/copy/data-science.js +15 -0
  92. package/dist/wizard/copy/data-science.js.map +1 -0
  93. package/dist/wizard/copy/index.d.ts.map +1 -1
  94. package/dist/wizard/copy/index.js +2 -0
  95. package/dist/wizard/copy/index.js.map +1 -1
  96. package/dist/wizard/copy/types.d.ts +5 -1
  97. package/dist/wizard/copy/types.d.ts.map +1 -1
  98. package/dist/wizard/copy/types.test-d.js +7 -0
  99. package/dist/wizard/copy/types.test-d.js.map +1 -1
  100. package/dist/wizard/questions.d.ts +2 -1
  101. package/dist/wizard/questions.d.ts.map +1 -1
  102. package/dist/wizard/questions.js +9 -1
  103. package/dist/wizard/questions.js.map +1 -1
  104. package/dist/wizard/questions.test.js +14 -0
  105. package/dist/wizard/questions.test.js.map +1 -1
  106. package/dist/wizard/wizard.d.ts.map +1 -1
  107. package/dist/wizard/wizard.js +1 -0
  108. package/dist/wizard/wizard.js.map +1 -1
  109. package/package.json +1 -1
  110. package/skills/mmr/SKILL.md +72 -2
  111. package/skills/scaffold-runner/SKILL.md +65 -19
@@ -0,0 +1,194 @@
1
+ ---
2
+ name: data-science-experiment-tracking
3
+ description: Local MLflow setup, run instrumentation, git commit tagging, and run comparison for solo and small-team data science work
4
+ topics: [data-science, experiment-tracking, mlflow, weights-and-biases, reproducibility]
5
+ ---
6
+
7
+ Without experiment tracking, data science becomes archaeology: three weeks after a promising result, a stakeholder asks "which config produced that number?" and answering it turns into a forensic exercise — sifting through notebook history, Slack messages, and commented-out cells. A lightweight experiment tracker fixes this with one discipline: every run logs its hyperparameters, metrics, artifacts, and the git commit SHA that produced it. For a solo DS or small team, you do not need a shared server or a cloud account — a local MLflow instance on SQLite is enough to get the full benefit, and you can graduate to a shared deployment later without changing the instrumentation.
8
+
9
+ ## Summary
10
+
11
+ Self-host MLflow locally with a SQLite backend and a local artifact directory — it is the minimum setup that still gives you a queryable run history, a browsable UI, and reproducible run IDs. Every run logs the full hyperparameter dict, metrics per epoch (or iteration), the git commit SHA as a tag, dataset version, and any config or report artifacts. Weights & Biases is a reasonable cloud alternative if you value the polished UI and do not mind cloud storage — but for a DS-1 setup it is not the primary recommendation. Never log PII into run metadata or artifacts, and never commit `mlflow.db` or `mlartifacts/` to git.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### What to log per run
16
+
17
+ Treat every training run, hyperparameter tweak, or evaluation pass as a tracked experiment — even the exploratory ones you think will be throwaway. The cost of logging is trivial; the cost of not logging a run that turns out to matter is measured in hours of re-running and second-guessing. The minimum payload is:
18
+
19
+ - **Hyperparameters**: the full config dict (learning rate, batch size, seed, feature set, model type, loss weights, regularization). Log it all — future-you does not know which knob will matter and adding knobs retroactively is impossible.
20
+ - **Metrics**: logged with `step=epoch` (or `step=iteration`) so the UI can render a time-series plot. Log train and validation metrics side by side; a single final-value log loses the overfitting story.
21
+ - **Git commit SHA**: a tag pointing to the exact commit that produced the run. Without this, "reproduce run 47" is unanswerable, because the config alone does not capture code changes in the training loop, data loader, or feature engineering.
22
+ - **Dataset version**: a tag or param identifying which dataset snapshot was used — a DVC hash, a filename with a date suffix, or a data commit SHA. Without this, "reproduce run 47" is still unanswerable even if you have the code, because the data moved underneath it.
23
+ - **Run name**: a human-readable name (`baseline-v3-with-dropout`) so the UI list is browsable without clicking every row to read the params.
24
+ - **Artifacts**: the resolved config YAML, the evaluation report JSON, any confusion matrix images, and the final model checkpoint. Small artifacts go inline with the run; large model weights can be stored by reference.
25
+
26
+ ### MLflow self-hosted setup
27
+
28
+ Run the tracking server locally. SQLite is the right backend for a solo workflow — it gives you the full query API without the ops burden of Postgres, and the `mlflow.db` file is small enough that you can zip and share it with a collaborator if you really need to:
29
+
30
+ ```bash
31
+ mlflow server \
32
+ --backend-store-uri sqlite:///mlflow.db \
33
+ --default-artifact-root ./mlartifacts \
34
+ --host 127.0.0.1 --port 5000
35
+ ```
36
+
37
+ Bind to `127.0.0.1` rather than `0.0.0.0` so you do not accidentally expose an unauthenticated tracking server to your network. Leave it running in a terminal tab, a `tmux` pane, or under `launchd`/`systemd` — whatever keeps it up between sessions.
38
+
39
+ Point your code at the server via an environment variable. Using `direnv` keeps this per-project and avoids polluting your shell:
40
+
41
+ ```bash
42
+ # .envrc
43
+ export MLFLOW_TRACKING_URI=http://localhost:5000
44
+ export MLFLOW_EXPERIMENT_NAME=churn-baseline
45
+ ```
46
+
47
+ Add the tracking artifacts to `.gitignore` — they are large, local, and not reproducible from source. Committing them bloats the repo and leaks local-path metadata into history:
48
+
49
+ ```gitignore
50
+ # .gitignore
51
+ mlflow.db
52
+ mlflow.db-journal
53
+ mlartifacts/
54
+ mlruns/
55
+ ```
56
+
57
+ When you later graduate to a shared MLflow server (team deployment, S3 artifact store, Postgres backend), the only change is the `MLFLOW_TRACKING_URI` — your instrumentation code stays identical, and historical runs stay on your laptop as a personal archive.
58
+
59
+ ### Instrumenting a training / experiment run
60
+
61
+ Wrap the training loop in `mlflow.start_run`. The context manager handles start and end timestamps, guarantees the run closes even on exception, and exposes `run.info.run_id` — the stable handle you use later for comparison, export, or model loading:
62
+
63
+ ```python
64
+ import subprocess
65
+ import mlflow
66
+ import yaml
67
+
68
+ mlflow.set_tracking_uri("http://localhost:5000")
69
+ mlflow.set_experiment("churn-baseline")
70
+
71
+ def train(cfg: dict) -> dict:
72
+ with mlflow.start_run(run_name=cfg["experiment"]["name"]) as run:
73
+ # Log full hyperparameter dict (flatten nested keys to dot-paths)
74
+ mlflow.log_params(_flatten(cfg))
75
+
76
+ # Reproducibility tags — git commit is the single most important one
77
+ git_sha = subprocess.check_output(
78
+ ["git", "rev-parse", "HEAD"]
79
+ ).decode().strip()
80
+ mlflow.set_tag("git_commit", git_sha)
81
+ mlflow.set_tag("dataset_version", cfg["data"]["version"])
82
+ mlflow.set_tag("model_type", cfg["model"]["type"])
83
+
84
+ # Per-epoch metrics — step=epoch is what gives you a time-series plot
85
+ for epoch in range(cfg["training"]["epochs"]):
86
+ train_metrics = train_epoch(...)
87
+ val_metrics = evaluate(...)
88
+ mlflow.log_metrics({
89
+ "train_loss": train_metrics["loss"],
90
+ "val_loss": val_metrics["loss"],
91
+ "val_auc": val_metrics["auc"],
92
+ }, step=epoch)
93
+
94
+ # Artifacts: resolved config + eval report
95
+ with open("configs/resolved.yaml", "w") as f:
96
+ yaml.safe_dump(cfg, f)
97
+ mlflow.log_artifact("configs/resolved.yaml")
98
+ mlflow.log_artifact("reports/eval_report.json")
99
+
100
+ return {"run_id": run.info.run_id, **val_metrics}
101
+ ```
102
+
103
+ A few notes on the shape of this code. `mlflow.log_params` takes a flat dict, so a helper like `_flatten` turns `{"optimizer": {"lr": 1e-3}}` into `{"optimizer.lr": "0.001"}` — values are coerced to strings. Log the **resolved** config after any CLI overrides or hydra composition, not the raw file on disk, so the stored params match what actually ran. If the working tree is dirty at training time, either commit first or log `git status --porcelain` output as a tag so you can tell the logged commit is not the whole story. Keep the returned `run_id` — it is the primary key you will use to find this run in the UI, export its metadata, register its model later, or reference it from a downstream evaluation run via `mlflow.set_tag("parent_run_id", ...)`.
104
+
105
+ ### Run comparison and selection
106
+
107
+ Open the MLflow UI at `http://localhost:5000`. The three views that earn their keep:
108
+
109
+ - **Run list** — sort by `metrics.val_auc` or filter by `tags.git_commit = "<sha>"`. Tag filters are the fastest way to find "the runs I launched from this branch." Sort by columns to see the run_id of your best-performing experiment, then click through for the full picture.
110
+ - **Parallel coordinates plot** — select several runs, switch to the parallel coordinates view, and see which hyperparameters correlate with your target metric. This is the view that turns dozens of runs into a readable pattern — hover a line to see the full config, drag axes to filter a band, and the plot re-paints to show only the runs that meet your criterion.
111
+ - **Metric plot** — overlay `val_loss` across selected runs to spot overfitting (train loss drops, val loss rises), bad seeds (wildly different trajectories with the same config), or early-stopping candidates (val metric plateaued ten epochs before training ended).
112
+
113
+ You can also query programmatically when the UI's filters are not expressive enough:
114
+
115
+ ```python
116
+ from mlflow.tracking import MlflowClient
117
+ import pandas as pd
118
+
119
+ client = MlflowClient()
120
+ runs = client.search_runs(
121
+ experiment_ids=[client.get_experiment_by_name("churn-baseline").experiment_id],
122
+ filter_string="metrics.val_auc > 0.82 and tags.dataset_version = '2026-03'",
123
+ order_by=["metrics.val_auc DESC"],
124
+ max_results=20,
125
+ )
126
+ df = pd.DataFrame([{
127
+ "run_id": r.info.run_id,
128
+ "name": r.info.run_name,
129
+ "val_auc": r.data.metrics.get("val_auc"),
130
+ "lr": r.data.params.get("optimizer.lr"),
131
+ } for r in runs])
132
+ ```
133
+
134
+ When you have a winner, export its config back into the repo for a clean retrain:
135
+
136
+ ```python
137
+ run = client.get_run("<run_id>")
138
+ client.download_artifacts(run.info.run_id, "resolved.yaml", dst_path="configs/")
139
+ ```
140
+
141
+ Commit the exported config so "run 47's exact recipe" becomes a file in git, not a memory and not a database row that lives only on your laptop.
142
+
143
+ ### Weights & Biases as alternative
144
+
145
+ Weights & Biases is the polished cloud alternative. It has a richer UI, built-in system metric logging (GPU, memory, temperature), gradient histograms via `wandb.watch(model, log="gradients")`, media logging (images, audio, tables, confusion matrices rendered inline), and better collaboration features — named reports, shared dashboards, thread-style comments. For a small team that has already decided it is comfortable with cloud storage, W&B removes the "who runs the server" question entirely, and the onboarding for a new teammate is `pip install wandb && wandb login`.
146
+
147
+ The instrumentation shape is familiar:
148
+
149
+ ```python
150
+ import wandb
151
+ wandb.init(project="churn-baseline", name=cfg["experiment"]["name"],
152
+ config=cfg, tags=["baseline", "v2-features"])
153
+ for epoch in range(cfg["training"]["epochs"]):
154
+ wandb.log({"epoch": epoch, "val/auc": val_auc, "train/loss": loss})
155
+ wandb.finish()
156
+ ```
157
+
158
+ Two things to weigh before picking it over MLflow for a DS-1 setup. First, the free tier has trial limits on private projects, artifact retention, and seats — fine for a solo experimenter, worth pricing out before a team commits. Second, you are shipping your experiment metadata (and potentially artifacts) to a third party, which matters if your dataset values, config parameters, or run names might accidentally encode something sensitive. MLflow self-hosted stays on your laptop; W&B lives in someone else's cloud. If your org has a data-residency or vendor-review process, MLflow skips it entirely.
159
+
160
+ ### Graduating from solo to shared
161
+
162
+ The path from "SQLite on my laptop" to "shared team tracker" is short and deliberately low-risk, because your instrumentation already speaks the MLflow protocol:
163
+
164
+ 1. **Stand up a shared MLflow server** behind your internal network — Postgres backend, S3 or equivalent object store for artifacts, authentication in front (oauth2-proxy, nginx basic auth, or a cloud load balancer).
165
+ 2. **Flip the tracking URI** in each project's `.envrc` to point at the shared server. No code changes.
166
+ 3. **Optionally backfill historic runs** using `mlflow artifacts download` plus `search_runs`, then re-log to the shared server — only worth it for runs you want the team to see.
167
+ 4. **Keep the local server configured** for offline or air-gapped work — you can still set `MLFLOW_TRACKING_URI=file:./mlruns` for quick local-only iteration when the shared server is down or you are on a plane.
168
+
169
+ This is why the self-hosted local setup is the right default even if you know you will eventually run a team server: the instrumentation you write today is exactly the instrumentation that will talk to production tomorrow.
170
+
171
+ ### Nested runs for sweeps and evaluations
172
+
173
+ When you run a small hyperparameter sweep from a laptop — a few learning rates, two or three seeds — use MLflow's nested runs rather than one flat run per trial. A parent run captures the sweep-level config and best-metric summary; each trial becomes a child with its own params and metrics:
174
+
175
+ ```python
176
+ with mlflow.start_run(run_name="lr-sweep") as parent:
177
+ mlflow.log_param("sweep_type", "grid")
178
+ best_auc = 0.0
179
+ for lr in [1e-4, 3e-4, 1e-3, 3e-3]:
180
+ with mlflow.start_run(run_name=f"lr={lr}", nested=True) as child:
181
+ mlflow.log_param("lr", lr)
182
+ val_auc = train_one(lr)
183
+ mlflow.log_metric("val_auc", val_auc)
184
+ best_auc = max(best_auc, val_auc)
185
+ mlflow.log_metric("best_val_auc", best_auc)
186
+ ```
187
+
188
+ In the UI, the parent shows an expandable tree of children, which keeps the run list navigable once you have hundreds of rows. For larger sweeps, pair MLflow with Optuna (`mlflow.start_run(nested=True)` inside an `objective` function) — you get Bayesian search on top of MLflow's persistence.
189
+
190
+ ### Hygiene
191
+
192
+ **Never log PII.** Experiment metadata and artifacts are easy to share with collaborators, screenshot into a ticket, or accidentally export to a cloud UI when you later migrate to W&B or a shared MLflow. Hyperparameter values, metric names, run names, tags, and artifact contents must all be free of customer identifiers, emails, names, or raw records. If a config carries a data path, keep it at the dataset level (`data/processed/churn_2026_03.parquet`) — never at the row or user level (`data/users/alice@example.com/history.parquet`). Evaluation reports that include example predictions must redact any PII in the input columns before being logged as artifacts. See `data-science-security.md` for the broader no-PII rules that apply across the whole DS workflow.
193
+
194
+ **Never commit the tracking store.** `mlflow.db`, `mlflow.db-journal`, `mlartifacts/`, and `mlruns/` all belong in `.gitignore`. They are large (easily hundreds of MB once you log a few model checkpoints), machine-local (paths and timestamps are specific to your laptop), and not reproducible from git. If you need a teammate to see a run, export the specific artifacts you want to share via `mlflow artifacts download --run-id <id>` or by running a shared tracking server — do not push the whole store into the repo and do not email `mlflow.db` as an attachment.
@@ -0,0 +1,160 @@
1
+ ---
2
+ name: data-science-model-evaluation
3
+ description: Honest model evaluation for solo/small-team DS — metric choice, one-shot holdout, cross-validation, calibration, and error slicing with sklearn
4
+ topics: [data-science, evaluation, sklearn, cross-validation, calibration]
5
+ ---
6
+
7
+ Every solo DS project produces a moment where a notebook prints `0.92 accuracy` on a test set and the author quietly believes the model works. Then it ships — and recall on the minority class is 0.12, the probabilities are miscalibrated, and a single region drives half the error. Evaluation discipline is the only thing separating a model that works from a model that looked like it worked on a single split. At solo scale you do not have an ML platform team checking your work, which makes the discipline entirely your responsibility.
8
+
9
+ ## Summary
10
+
11
+ Match the metric to the business question: do not report accuracy on an imbalanced label, do not report RMSE when one outlier dominates the loss. Split the data once, use cross-validation on the training portion for model selection, and touch the holdout exactly once at the end. If downstream decisions consume probabilities (thresholding, expected value, stacking), check calibration — a 0.9 ROC-AUC model can still output probabilities that are wildly overconfident. Always slice errors by meaningful subgroups (region, bucket, cohort); aggregate metrics hide the failures that matter.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Picking the right metric
16
+
17
+ Metric choice is a business decision, not a math decision. The right starting question is: "what does a false positive cost, and what does a false negative cost?" If the two are comparable and the classes are balanced, accuracy is fine. Once the costs diverge — or the base rate is skewed — accuracy becomes actively misleading.
18
+
19
+ A small rubric for classification:
20
+
21
+ - **Balanced binary classification**: accuracy is fine.
22
+ - **Imbalanced binary (fraud, churn, rare disease)**: precision / recall / F1, and PR-AUC over ROC-AUC (ROC-AUC flatters models on heavy class imbalance).
23
+ - **Ranking / thresholding later**: `roc_auc_score` measures order, not calibration.
24
+ - **Decisions that consume probabilities**: `log_loss` or Brier score — rewards calibrated confidence, punishes overconfident mistakes.
25
+ - **Multi-class**: `classification_report` for per-class precision/recall, and pick `average="macro"` (equal weight per class) vs `"weighted"` (weight by support) deliberately.
26
+
27
+ And for regression:
28
+
29
+ - **Magnitude matters**: RMSE (penalizes large errors quadratically).
30
+ - **Outliers you do not want to chase**: MAE (robust to a few extreme points).
31
+ - **Explained variance / reporting to stakeholders**: R².
32
+ - **Relative error across scales**: MAPE, but guard against zeros in the denominator.
33
+
34
+ ```python
35
+ from sklearn.metrics import classification_report, roc_auc_score, log_loss
36
+
37
+ y_proba = model.predict_proba(X_test)[:, 1]
38
+ y_pred = (y_proba >= 0.5).astype(int)
39
+
40
+ print(classification_report(y_test, y_pred, digits=3))
41
+ print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
42
+ print(f"log-loss: {log_loss(y_test, y_proba):.3f}")
43
+ ```
44
+
45
+ Report at least one threshold-free metric (ROC-AUC or PR-AUC) and one threshold-dependent metric (precision/recall at your operating point). Reporting only accuracy on a 95/5 class split is the canonical way to lie to yourself — the "always predict no" baseline gets 0.95 without a model.
46
+
47
+ ### Holdout discipline
48
+
49
+ Split once, at the top of the notebook, before any exploration on the target:
50
+
51
+ ```python
52
+ from sklearn.model_selection import train_test_split
53
+
54
+ X_train, X_test, y_train, y_test = train_test_split(
55
+ X, y,
56
+ test_size=0.2,
57
+ random_state=42,
58
+ stratify=y, # preserve class balance for classification
59
+ )
60
+ ```
61
+
62
+ Rules:
63
+
64
+ 1. The test set is touched exactly once — at the end, for the final number you report.
65
+ 2. All model selection, feature engineering decisions, and hyperparameter tuning happen on `X_train` (via cross-validation).
66
+ 3. If you peek at test performance and then change the model, the test set is contaminated. Either live with the contamination and note it, or collect a new holdout.
67
+ 4. Fit preprocessing (`StandardScaler`, `OneHotEncoder`, imputers) on train only, then apply to test — wrap it in a `Pipeline` so you cannot leak by accident.
68
+
69
+ ### Cross-validation for model selection
70
+
71
+ Use cross-validation on the training set to compare models and pick hyperparameters. This gives you a mean and standard deviation, so you can see whether model A actually beats model B or is one lucky fold away.
72
+
73
+ ```python
74
+ from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
75
+ from sklearn.ensemble import RandomForestClassifier
76
+
77
+ cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
78
+
79
+ scores = cross_val_score(
80
+ RandomForestClassifier(n_estimators=200, random_state=42),
81
+ X_train, y_train,
82
+ cv=cv,
83
+ scoring="roc_auc",
84
+ )
85
+ print(f"CV ROC-AUC: {scores.mean():.3f} ± {scores.std():.3f}")
86
+
87
+ grid = GridSearchCV(
88
+ RandomForestClassifier(random_state=42),
89
+ param_grid={"max_depth": [4, 8, None], "min_samples_leaf": [1, 5, 20]},
90
+ cv=cv,
91
+ scoring="roc_auc",
92
+ n_jobs=-1,
93
+ )
94
+ grid.fit(X_train, y_train)
95
+ ```
96
+
97
+ Use `StratifiedKFold` for classification (preserves class balance per fold) and plain `KFold` for regression. **For any data with a time dimension, use `TimeSeriesSplit` instead** — random folds leak future information into the training set and will make your model look dramatically better offline than it is in production.
98
+
99
+ ### Calibration
100
+
101
+ A model with a great ROC-AUC can still output badly calibrated probabilities — random forests and boosted trees are both notorious for this. If downstream code takes `predict_proba` output and uses it as a probability (expected-value calculations, threshold tuning based on cost, stacking, active learning), calibration matters at least as much as discrimination.
102
+
103
+ ```python
104
+ from sklearn.calibration import calibration_curve, CalibratedClassifierCV
105
+ import matplotlib.pyplot as plt
106
+
107
+ prob_true, prob_pred = calibration_curve(y_test, y_proba, n_bins=10, strategy="quantile")
108
+
109
+ plt.plot(prob_pred, prob_true, marker="o")
110
+ plt.plot([0, 1], [0, 1], "--", color="gray")
111
+ plt.xlabel("Predicted probability"); plt.ylabel("Observed frequency")
112
+ ```
113
+
114
+ A well-calibrated model tracks the diagonal. If the curve sags below it you are overconfident; if it bulges above, underconfident. Fix with `CalibratedClassifierCV(method="isotonic")` (flexible, needs more data) or `method="sigmoid"` (Platt scaling, works with ~1k examples). Fit calibration on a held-out slice of the training set — never on the test set.
115
+
116
+ ### Error analysis and slicing
117
+
118
+ Overall metrics hide systematic failures. A pandas `groupby` on the predictions is usually enough:
119
+
120
+ ```python
121
+ import pandas as pd
122
+
123
+ eval_df = pd.DataFrame({
124
+ "y_true": y_test,
125
+ "y_pred": y_pred,
126
+ "y_proba": y_proba,
127
+ "region": X_test["region"].values,
128
+ "age_bucket": pd.cut(X_test["age"], bins=[0, 25, 45, 65, 120]),
129
+ })
130
+ eval_df["correct"] = eval_df["y_true"] == eval_df["y_pred"]
131
+
132
+ print(eval_df.groupby("region")["correct"].agg(["mean", "count"]))
133
+ print(eval_df.groupby("age_bucket")["correct"].agg(["mean", "count"]))
134
+ ```
135
+
136
+ Look for slices where the metric is materially worse than overall AND the slice has enough examples to be real (set a floor like n ≥ 50). Those are your debugging targets before shipping.
137
+
138
+ **Fairness note**: slicing by sensitive attributes (age, gender, region, race where legally permitted) surfaces disparate impact. This is a minimum floor — if you ship models that affect people, read a proper fairness reference (Barocas/Hardt/Narayanan "Fairness and Machine Learning") rather than treating a groupby as the whole story.
139
+
140
+ ### What NOT to do
141
+
142
+ - **Do not tune on the test set.** Every time you look at a test number and change the model, you are fitting to the test set in slow motion. The `GridSearchCV` call above uses CV on the train set specifically to avoid this.
143
+ - **Do not cherry-pick a random seed.** If the model only wins with `random_state=7`, it does not actually win. Run with 3–5 different seeds and report the spread if you suspect the result is seed-fragile.
144
+ - **Do not report only the best fold.** Report mean and std across CV folds. A model with 0.85 ± 0.12 is not better than 0.82 ± 0.02 — the first one is one unlucky fold away from losing.
145
+ - **Do not ship without a trivial baseline.** Compare against predicting the majority class (classification) or the training mean (regression). If your fancy model cannot beat that, the problem is the data or the label, not the model.
146
+ - **Do not evaluate on preprocessed-then-split data.** Fit the scaler, encoder, or imputer on train only, then transform test. Anything else is leakage and will inflate your offline numbers.
147
+ - **Do not change the metric after seeing the results.** Pick the metric before training, based on the business question, and stick with it. Swapping from precision to ROC-AUC because one looked nicer is a cousin of p-hacking.
148
+
149
+ ## Minimum evaluation checklist
150
+
151
+ Before calling a model "done" at solo scale, every item below should be true:
152
+
153
+ 1. Metric is chosen to match the business cost of errors, documented in the notebook or readme.
154
+ 2. Data was split once with a fixed `random_state`, stratified for classification or temporally for time-series.
155
+ 3. All preprocessing lives inside a `Pipeline` and is fit on train only.
156
+ 4. Model selection was done with cross-validation on the training set, with mean ± std reported per candidate.
157
+ 5. At least one trivial baseline was beaten by a margin larger than the CV standard deviation.
158
+ 6. Test set was evaluated exactly once, at the end, and that number is what you report.
159
+ 7. If `predict_proba` is consumed downstream, a calibration curve was inspected and recalibrated if needed.
160
+ 8. Errors were sliced by at least one meaningful business dimension, and any slice with materially worse metrics is either fixed or explicitly noted as a known limitation.
@@ -0,0 +1,170 @@
1
+ ---
2
+ name: data-science-notebook-discipline
3
+ description: Notebook discipline for reproducible data science — Marimo as primary, Jupyter plus jupytext as fallback, promoting working cells to tested modules
4
+ topics: [data-science, notebooks, marimo, jupyter, reproducibility]
5
+ ---
6
+
7
+ Every data scientist has shipped a notebook that "worked for me in a session" and then produced different numbers the next morning — or worse, different numbers in a colleague's environment or a production run. The usual cause is not a bug in the code; it is hidden state. Jupyter cells can be executed in any order, re-run selectively, or silently depend on variables that were defined in a cell that has since been edited or deleted. The kernel's in-memory state becomes the real program, and the `.ipynb` file is just a partial, sometimes misleading, transcript. For solo and small-team DS work, this is the single biggest source of "it worked yesterday" pain, and it is entirely avoidable with the right tooling and habits.
8
+
9
+ ## Summary
10
+
11
+ Use **Marimo** as your primary notebook tool: the file format is pure `.py` (git-diffable), execution is reactive (editing a cell re-runs its downstream dependents automatically), and there is no hidden-cell-order hazard by construction. When you cannot switch — existing Jupyter investment, team inertia, library widgets that only work in classic Jupyter — pair every `.ipynb` with a `.py` via **jupytext** and commit the `.py`. Either way, the key discipline is promotion: when a cell works, extract it to `src/<module>.py`, write a test, and import it back. Run finished notebooks as pipelines with `marimo run` or `papermill`.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### The hidden-state problem
16
+
17
+ Classic Jupyter lets you execute cells in any order. Consider this sequence:
18
+
19
+ 1. Cell A defines `df = pd.read_csv("raw.csv")`.
20
+ 2. Cell B defines `df = df.dropna()`.
21
+ 3. You run A, then B, then A again.
22
+ 4. `df` is now the raw frame — but cell B's output cell still shows the cleaned version, and any downstream cell that already ran still has the cleaned `df` cached in its own computation.
23
+
24
+ Nothing about the notebook on disk reveals this inconsistency. "Restart kernel and run all" is the only way to prove a notebook is reproducible, and most DS workflows skip that step for months at a time. Outputs are cached in the `.ipynb`, so a reader sees plausible numbers and has no signal that the state is corrupt. This is **hidden state** — the kernel's memory diverges from the code as written, and the notebook lies about what it computed.
25
+
26
+ Second-order effects make it worse: merge conflicts on `.ipynb` JSON are unreadable; diffs show base64 image blobs; collaborators re-run cells in different orders and get different results. The notebook as a unit of collaboration is broken unless you impose discipline from outside.
27
+
28
+ ### Marimo as primary
29
+
30
+ [Marimo](https://marimo.io) is a reactive Python notebook that solves hidden state at the architecture level. Each notebook is a pure `.py` file; cells form a dependency graph; when you edit a cell, Marimo re-runs all of its dependents automatically. There is no way for the displayed state to diverge from what the code computes, because the runtime enforces topological order on every edit.
31
+
32
+ A minimal Marimo notebook looks like this — note it is ordinary Python you can read in any editor:
33
+
34
+ ```python
35
+ # notebook.py
36
+ import marimo as mo
37
+
38
+ app = mo.App()
39
+
40
+ @app.cell
41
+ def __():
42
+ import pandas as pd
43
+ df = pd.read_csv("data/raw.csv")
44
+ return (df,)
45
+
46
+ @app.cell
47
+ def __(df):
48
+ clean = df.dropna()
49
+ mo.md(f"Rows: **{len(clean)}**")
50
+ return (clean,)
51
+ ```
52
+
53
+ Key commands:
54
+
55
+ - `marimo edit notebook.py` — opens the reactive editor in your browser
56
+ - `marimo run notebook.py` — serves the notebook as a read-only web app (great for stakeholders)
57
+ - `marimo export html notebook.py -o out.html` — static HTML snapshot for reports
58
+
59
+ Because the file is `.py`, `git diff` shows real code changes. Code review on a Marimo notebook works the same as code review on any Python file. There are no output cells to strip, no JSON diffs to parse.
60
+
61
+ ### Jupyter plus jupytext fallback
62
+
63
+ When Marimo is not an option — you depend on a Jupyter-only widget, you share notebooks with non-Marimo users, or your infrastructure is built around `.ipynb` — use **jupytext** to pair each `.ipynb` with a `.py` representation. Install jupytext, then configure pairing at the repo root:
64
+
65
+ ```toml
66
+ # .jupytext.toml
67
+ formats = "ipynb,py:percent"
68
+ notebook_metadata_filter = "-all"
69
+ cell_metadata_filter = "-all"
70
+ ```
71
+
72
+ Or pair a single notebook explicitly:
73
+
74
+ ```bash
75
+ jupytext --set-formats ipynb,py:percent notebooks/eda.ipynb
76
+ ```
77
+
78
+ The `py:percent` format splits cells with `# %%` markers and produces a clean, diffable Python file. Rule of thumb for the repo:
79
+
80
+ - **Commit** the `.py` version — it is the source of truth for review and diffs
81
+ - **Gitignore** the `.ipynb` (or commit it with `nbstripout` installed to strip outputs; see the data-science-security doc for the outputs-as-secrets angle)
82
+ - **Do not** try to keep both hand-edited — jupytext's pre-save hook keeps them in sync automatically
83
+
84
+ This does not fix hidden state (Jupyter still runs cells in click-order), but it does make review and merges sane, and it gives you a textual artifact that survives kernel-state bugs.
85
+
86
+ ### Promotion: notebook to src to test to re-import
87
+
88
+ The most important habit in any notebook workflow — Marimo or Jupyter — is **promotion**. The moment a cell does real work, extract it to a tested module and import it back.
89
+
90
+ Before (inline in the notebook, untested, untyped):
91
+
92
+ ```python
93
+ @app.cell
94
+ def __(df):
95
+ df["hour"] = pd.to_datetime(df["ts"]).dt.hour
96
+ df["is_weekend"] = pd.to_datetime(df["ts"]).dt.dayofweek >= 5
97
+ df["log_amount"] = np.log1p(df["amount"])
98
+ return (df,)
99
+ ```
100
+
101
+ After — extract to `src/features/engineer.py`:
102
+
103
+ ```python
104
+ # src/features/engineer.py
105
+ import numpy as np
106
+ import pandas as pd
107
+
108
+ def add_time_features(df: pd.DataFrame, ts_col: str = "ts") -> pd.DataFrame:
109
+ """Add hour and is_weekend columns derived from a timestamp column."""
110
+ out = df.copy()
111
+ ts = pd.to_datetime(out[ts_col])
112
+ out["hour"] = ts.dt.hour
113
+ out["is_weekend"] = ts.dt.dayofweek >= 5
114
+ return out
115
+
116
+ def add_log_amount(df: pd.DataFrame, amount_col: str = "amount") -> pd.DataFrame:
117
+ out = df.copy()
118
+ out["log_amount"] = np.log1p(out[amount_col])
119
+ return out
120
+ ```
121
+
122
+ Write a test — small, fast, no data dependency:
123
+
124
+ ```python
125
+ # tests/features/test_engineer.py
126
+ import pandas as pd
127
+ from src.features.engineer import add_time_features
128
+
129
+ def test_weekend_flag_friday_vs_saturday():
130
+ df = pd.DataFrame({"ts": ["2026-04-17 10:00", "2026-04-18 10:00"]})
131
+ out = add_time_features(df)
132
+ assert out["is_weekend"].tolist() == [False, True]
133
+ ```
134
+
135
+ Re-import in the notebook:
136
+
137
+ ```python
138
+ @app.cell
139
+ def __(df):
140
+ from src.features.engineer import add_time_features, add_log_amount
141
+ df = add_log_amount(add_time_features(df))
142
+ return (df,)
143
+ ```
144
+
145
+ The notebook becomes a thin orchestration + visualization layer over tested modules. Hidden state matters less because the logic lives in files that are exercised by CI. Pull requests become reviewable — the reviewer reads typed functions with tests, not a wall of chained DataFrame mutations.
146
+
147
+ ### Running notebooks as pipelines
148
+
149
+ Finished notebooks often need to run on a schedule — daily reports, weekly retraining, monthly audits. Do not copy-paste the code into a script; run the notebook directly.
150
+
151
+ **Marimo**: because the file is already Python, you can run it as a script or as an app:
152
+
153
+ ```bash
154
+ marimo run notebook.py # serve as web app
155
+ python notebook.py # execute top-to-bottom as a plain script
156
+ marimo export html notebook.py -o out.html # produce a static report artifact
157
+ ```
158
+
159
+ **Jupyter**: use `papermill` to parameterize and execute an `.ipynb`, producing an executed output notebook:
160
+
161
+ ```bash
162
+ papermill notebooks/weekly_report.ipynb \
163
+ outputs/report_$(date +%Y%m%d).ipynb \
164
+ -p start_date 2026-04-14 \
165
+ -p end_date 2026-04-20
166
+ ```
167
+
168
+ Parameterized cells (tagged `parameters` in Jupyter) are injected by papermill at the top of the run. Use Marimo's `mo.cli_args()` for the equivalent in Marimo. Either way, pair this with a lightweight scheduler (cron, GitHub Actions, Airflow, Prefect) — the notebook is the unit of work, not a script that tries to re-implement it.
169
+
170
+ A useful rule: if a notebook is scheduled to run unattended, its logic should be ~90% imports from `src/` and ~10% glue. The promotion discipline from the previous section is what makes scheduled notebook runs trustworthy.