@zigrivers/scaffold 3.22.0 → 3.23.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +21 -7
- package/content/knowledge/data-science/README.md +23 -0
- package/content/knowledge/data-science/data-science-architecture.md +163 -0
- package/content/knowledge/data-science/data-science-conventions.md +233 -0
- package/content/knowledge/data-science/data-science-data-versioning.md +198 -0
- package/content/knowledge/data-science/data-science-dev-environment.md +159 -0
- package/content/knowledge/data-science/data-science-experiment-tracking.md +194 -0
- package/content/knowledge/data-science/data-science-model-evaluation.md +160 -0
- package/content/knowledge/data-science/data-science-notebook-discipline.md +170 -0
- package/content/knowledge/data-science/data-science-observability.md +161 -0
- package/content/knowledge/data-science/data-science-project-structure.md +178 -0
- package/content/knowledge/data-science/data-science-reproducibility.md +164 -0
- package/content/knowledge/data-science/data-science-requirements.md +151 -0
- package/content/knowledge/data-science/data-science-security.md +151 -0
- package/content/knowledge/data-science/data-science-testing.md +183 -0
- package/content/knowledge/ml/README.md +10 -0
- package/content/methodology/data-science-overlay.yml +39 -0
- package/dist/config/schema.d.ts +672 -126
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +8 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +2 -2
- package/dist/config/schema.test.js.map +1 -1
- package/dist/config/validators/data-science.d.ts +4 -0
- package/dist/config/validators/data-science.d.ts.map +1 -0
- package/dist/config/validators/data-science.js +15 -0
- package/dist/config/validators/data-science.js.map +1 -0
- package/dist/config/validators/index.d.ts.map +1 -1
- package/dist/config/validators/index.js +2 -0
- package/dist/config/validators/index.js.map +1 -1
- package/dist/core/assembly/knowledge-loader.d.ts.map +1 -1
- package/dist/core/assembly/knowledge-loader.js +6 -0
- package/dist/core/assembly/knowledge-loader.js.map +1 -1
- package/dist/core/assembly/knowledge-loader.test.js +34 -0
- package/dist/core/assembly/knowledge-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.js +73 -0
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/project/adopt.d.ts.map +1 -1
- package/dist/project/adopt.js +3 -1
- package/dist/project/adopt.js.map +1 -1
- package/dist/project/detectors/coverage.test.d.ts +2 -0
- package/dist/project/detectors/coverage.test.d.ts.map +1 -0
- package/dist/project/detectors/coverage.test.js +78 -0
- package/dist/project/detectors/coverage.test.js.map +1 -0
- package/dist/project/detectors/data-science.d.ts +4 -0
- package/dist/project/detectors/data-science.d.ts.map +1 -0
- package/dist/project/detectors/data-science.js +32 -0
- package/dist/project/detectors/data-science.js.map +1 -0
- package/dist/project/detectors/data-science.test.d.ts +2 -0
- package/dist/project/detectors/data-science.test.d.ts.map +1 -0
- package/dist/project/detectors/data-science.test.js +62 -0
- package/dist/project/detectors/data-science.test.js.map +1 -0
- package/dist/project/detectors/disambiguate.d.ts +2 -0
- package/dist/project/detectors/disambiguate.d.ts.map +1 -1
- package/dist/project/detectors/disambiguate.js +3 -2
- package/dist/project/detectors/disambiguate.js.map +1 -1
- package/dist/project/detectors/disambiguate.test.js +10 -1
- package/dist/project/detectors/disambiguate.test.js.map +1 -1
- package/dist/project/detectors/index.d.ts.map +1 -1
- package/dist/project/detectors/index.js +2 -0
- package/dist/project/detectors/index.js.map +1 -1
- package/dist/project/detectors/library.d.ts.map +1 -1
- package/dist/project/detectors/library.js +1 -0
- package/dist/project/detectors/library.js.map +1 -1
- package/dist/project/detectors/resolve-detection.test.js +31 -0
- package/dist/project/detectors/resolve-detection.test.js.map +1 -1
- package/dist/project/detectors/types.d.ts +6 -2
- package/dist/project/detectors/types.d.ts.map +1 -1
- package/dist/project/detectors/types.js.map +1 -1
- package/dist/types/config.d.ts +8 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/copy/core.d.ts.map +1 -1
- package/dist/wizard/copy/core.js +4 -0
- package/dist/wizard/copy/core.js.map +1 -1
- package/dist/wizard/copy/data-science.d.ts +3 -0
- package/dist/wizard/copy/data-science.d.ts.map +1 -0
- package/dist/wizard/copy/data-science.js +15 -0
- package/dist/wizard/copy/data-science.js.map +1 -0
- package/dist/wizard/copy/index.d.ts.map +1 -1
- package/dist/wizard/copy/index.js +2 -0
- package/dist/wizard/copy/index.js.map +1 -1
- package/dist/wizard/copy/types.d.ts +5 -1
- package/dist/wizard/copy/types.d.ts.map +1 -1
- package/dist/wizard/copy/types.test-d.js +7 -0
- package/dist/wizard/copy/types.test-d.js.map +1 -1
- package/dist/wizard/questions.d.ts +2 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +9 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +14 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +1 -0
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,160 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-model-evaluation
|
|
3
|
+
description: Honest model evaluation for solo/small-team DS — metric choice, one-shot holdout, cross-validation, calibration, and error slicing with sklearn
|
|
4
|
+
topics: [data-science, evaluation, sklearn, cross-validation, calibration]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Every solo DS project produces a moment where a notebook prints `0.92 accuracy` on a test set and the author quietly believes the model works. Then it ships — and recall on the minority class is 0.12, the probabilities are miscalibrated, and a single region drives half the error. Evaluation discipline is the only thing separating a model that works from a model that looked like it worked on a single split. At solo scale you do not have an ML platform team checking your work, which makes the discipline entirely your responsibility.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Match the metric to the business question: do not report accuracy on an imbalanced label, do not report RMSE when one outlier dominates the loss. Split the data once, use cross-validation on the training portion for model selection, and touch the holdout exactly once at the end. If downstream decisions consume probabilities (thresholding, expected value, stacking), check calibration — a 0.9 ROC-AUC model can still output probabilities that are wildly overconfident. Always slice errors by meaningful subgroups (region, bucket, cohort); aggregate metrics hide the failures that matter.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Picking the right metric
|
|
16
|
+
|
|
17
|
+
Metric choice is a business decision, not a math decision. The right starting question is: "what does a false positive cost, and what does a false negative cost?" If the two are comparable and the classes are balanced, accuracy is fine. Once the costs diverge — or the base rate is skewed — accuracy becomes actively misleading.
|
|
18
|
+
|
|
19
|
+
A small rubric for classification:
|
|
20
|
+
|
|
21
|
+
- **Balanced binary classification**: accuracy is fine.
|
|
22
|
+
- **Imbalanced binary (fraud, churn, rare disease)**: precision / recall / F1, and PR-AUC over ROC-AUC (ROC-AUC flatters models on heavy class imbalance).
|
|
23
|
+
- **Ranking / thresholding later**: `roc_auc_score` measures order, not calibration.
|
|
24
|
+
- **Decisions that consume probabilities**: `log_loss` or Brier score — rewards calibrated confidence, punishes overconfident mistakes.
|
|
25
|
+
- **Multi-class**: `classification_report` for per-class precision/recall, and pick `average="macro"` (equal weight per class) vs `"weighted"` (weight by support) deliberately.
|
|
26
|
+
|
|
27
|
+
And for regression:
|
|
28
|
+
|
|
29
|
+
- **Magnitude matters**: RMSE (penalizes large errors quadratically).
|
|
30
|
+
- **Outliers you do not want to chase**: MAE (robust to a few extreme points).
|
|
31
|
+
- **Explained variance / reporting to stakeholders**: R².
|
|
32
|
+
- **Relative error across scales**: MAPE, but guard against zeros in the denominator.
|
|
33
|
+
|
|
34
|
+
```python
|
|
35
|
+
from sklearn.metrics import classification_report, roc_auc_score, log_loss
|
|
36
|
+
|
|
37
|
+
y_proba = model.predict_proba(X_test)[:, 1]
|
|
38
|
+
y_pred = (y_proba >= 0.5).astype(int)
|
|
39
|
+
|
|
40
|
+
print(classification_report(y_test, y_pred, digits=3))
|
|
41
|
+
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
|
|
42
|
+
print(f"log-loss: {log_loss(y_test, y_proba):.3f}")
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
Report at least one threshold-free metric (ROC-AUC or PR-AUC) and one threshold-dependent metric (precision/recall at your operating point). Reporting only accuracy on a 95/5 class split is the canonical way to lie to yourself — the "always predict no" baseline gets 0.95 without a model.
|
|
46
|
+
|
|
47
|
+
### Holdout discipline
|
|
48
|
+
|
|
49
|
+
Split once, at the top of the notebook, before any exploration on the target:
|
|
50
|
+
|
|
51
|
+
```python
|
|
52
|
+
from sklearn.model_selection import train_test_split
|
|
53
|
+
|
|
54
|
+
X_train, X_test, y_train, y_test = train_test_split(
|
|
55
|
+
X, y,
|
|
56
|
+
test_size=0.2,
|
|
57
|
+
random_state=42,
|
|
58
|
+
stratify=y, # preserve class balance for classification
|
|
59
|
+
)
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
Rules:
|
|
63
|
+
|
|
64
|
+
1. The test set is touched exactly once — at the end, for the final number you report.
|
|
65
|
+
2. All model selection, feature engineering decisions, and hyperparameter tuning happen on `X_train` (via cross-validation).
|
|
66
|
+
3. If you peek at test performance and then change the model, the test set is contaminated. Either live with the contamination and note it, or collect a new holdout.
|
|
67
|
+
4. Fit preprocessing (`StandardScaler`, `OneHotEncoder`, imputers) on train only, then apply to test — wrap it in a `Pipeline` so you cannot leak by accident.
|
|
68
|
+
|
|
69
|
+
### Cross-validation for model selection
|
|
70
|
+
|
|
71
|
+
Use cross-validation on the training set to compare models and pick hyperparameters. This gives you a mean and standard deviation, so you can see whether model A actually beats model B or is one lucky fold away.
|
|
72
|
+
|
|
73
|
+
```python
|
|
74
|
+
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
|
|
75
|
+
from sklearn.ensemble import RandomForestClassifier
|
|
76
|
+
|
|
77
|
+
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
|
|
78
|
+
|
|
79
|
+
scores = cross_val_score(
|
|
80
|
+
RandomForestClassifier(n_estimators=200, random_state=42),
|
|
81
|
+
X_train, y_train,
|
|
82
|
+
cv=cv,
|
|
83
|
+
scoring="roc_auc",
|
|
84
|
+
)
|
|
85
|
+
print(f"CV ROC-AUC: {scores.mean():.3f} ± {scores.std():.3f}")
|
|
86
|
+
|
|
87
|
+
grid = GridSearchCV(
|
|
88
|
+
RandomForestClassifier(random_state=42),
|
|
89
|
+
param_grid={"max_depth": [4, 8, None], "min_samples_leaf": [1, 5, 20]},
|
|
90
|
+
cv=cv,
|
|
91
|
+
scoring="roc_auc",
|
|
92
|
+
n_jobs=-1,
|
|
93
|
+
)
|
|
94
|
+
grid.fit(X_train, y_train)
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
Use `StratifiedKFold` for classification (preserves class balance per fold) and plain `KFold` for regression. **For any data with a time dimension, use `TimeSeriesSplit` instead** — random folds leak future information into the training set and will make your model look dramatically better offline than it is in production.
|
|
98
|
+
|
|
99
|
+
### Calibration
|
|
100
|
+
|
|
101
|
+
A model with a great ROC-AUC can still output badly calibrated probabilities — random forests and boosted trees are both notorious for this. If downstream code takes `predict_proba` output and uses it as a probability (expected-value calculations, threshold tuning based on cost, stacking, active learning), calibration matters at least as much as discrimination.
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
|
|
105
|
+
import matplotlib.pyplot as plt
|
|
106
|
+
|
|
107
|
+
prob_true, prob_pred = calibration_curve(y_test, y_proba, n_bins=10, strategy="quantile")
|
|
108
|
+
|
|
109
|
+
plt.plot(prob_pred, prob_true, marker="o")
|
|
110
|
+
plt.plot([0, 1], [0, 1], "--", color="gray")
|
|
111
|
+
plt.xlabel("Predicted probability"); plt.ylabel("Observed frequency")
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
A well-calibrated model tracks the diagonal. If the curve sags below it you are overconfident; if it bulges above, underconfident. Fix with `CalibratedClassifierCV(method="isotonic")` (flexible, needs more data) or `method="sigmoid"` (Platt scaling, works with ~1k examples). Fit calibration on a held-out slice of the training set — never on the test set.
|
|
115
|
+
|
|
116
|
+
### Error analysis and slicing
|
|
117
|
+
|
|
118
|
+
Overall metrics hide systematic failures. A pandas `groupby` on the predictions is usually enough:
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
import pandas as pd
|
|
122
|
+
|
|
123
|
+
eval_df = pd.DataFrame({
|
|
124
|
+
"y_true": y_test,
|
|
125
|
+
"y_pred": y_pred,
|
|
126
|
+
"y_proba": y_proba,
|
|
127
|
+
"region": X_test["region"].values,
|
|
128
|
+
"age_bucket": pd.cut(X_test["age"], bins=[0, 25, 45, 65, 120]),
|
|
129
|
+
})
|
|
130
|
+
eval_df["correct"] = eval_df["y_true"] == eval_df["y_pred"]
|
|
131
|
+
|
|
132
|
+
print(eval_df.groupby("region")["correct"].agg(["mean", "count"]))
|
|
133
|
+
print(eval_df.groupby("age_bucket")["correct"].agg(["mean", "count"]))
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
Look for slices where the metric is materially worse than overall AND the slice has enough examples to be real (set a floor like n ≥ 50). Those are your debugging targets before shipping.
|
|
137
|
+
|
|
138
|
+
**Fairness note**: slicing by sensitive attributes (age, gender, region, race where legally permitted) surfaces disparate impact. This is a minimum floor — if you ship models that affect people, read a proper fairness reference (Barocas/Hardt/Narayanan "Fairness and Machine Learning") rather than treating a groupby as the whole story.
|
|
139
|
+
|
|
140
|
+
### What NOT to do
|
|
141
|
+
|
|
142
|
+
- **Do not tune on the test set.** Every time you look at a test number and change the model, you are fitting to the test set in slow motion. The `GridSearchCV` call above uses CV on the train set specifically to avoid this.
|
|
143
|
+
- **Do not cherry-pick a random seed.** If the model only wins with `random_state=7`, it does not actually win. Run with 3–5 different seeds and report the spread if you suspect the result is seed-fragile.
|
|
144
|
+
- **Do not report only the best fold.** Report mean and std across CV folds. A model with 0.85 ± 0.12 is not better than 0.82 ± 0.02 — the first one is one unlucky fold away from losing.
|
|
145
|
+
- **Do not ship without a trivial baseline.** Compare against predicting the majority class (classification) or the training mean (regression). If your fancy model cannot beat that, the problem is the data or the label, not the model.
|
|
146
|
+
- **Do not evaluate on preprocessed-then-split data.** Fit the scaler, encoder, or imputer on train only, then transform test. Anything else is leakage and will inflate your offline numbers.
|
|
147
|
+
- **Do not change the metric after seeing the results.** Pick the metric before training, based on the business question, and stick with it. Swapping from precision to ROC-AUC because one looked nicer is a cousin of p-hacking.
|
|
148
|
+
|
|
149
|
+
## Minimum evaluation checklist
|
|
150
|
+
|
|
151
|
+
Before calling a model "done" at solo scale, every item below should be true:
|
|
152
|
+
|
|
153
|
+
1. Metric is chosen to match the business cost of errors, documented in the notebook or readme.
|
|
154
|
+
2. Data was split once with a fixed `random_state`, stratified for classification or temporally for time-series.
|
|
155
|
+
3. All preprocessing lives inside a `Pipeline` and is fit on train only.
|
|
156
|
+
4. Model selection was done with cross-validation on the training set, with mean ± std reported per candidate.
|
|
157
|
+
5. At least one trivial baseline was beaten by a margin larger than the CV standard deviation.
|
|
158
|
+
6. Test set was evaluated exactly once, at the end, and that number is what you report.
|
|
159
|
+
7. If `predict_proba` is consumed downstream, a calibration curve was inspected and recalibrated if needed.
|
|
160
|
+
8. Errors were sliced by at least one meaningful business dimension, and any slice with materially worse metrics is either fixed or explicitly noted as a known limitation.
|
|
@@ -0,0 +1,170 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-notebook-discipline
|
|
3
|
+
description: Notebook discipline for reproducible data science — Marimo as primary, Jupyter plus jupytext as fallback, promoting working cells to tested modules
|
|
4
|
+
topics: [data-science, notebooks, marimo, jupyter, reproducibility]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Every data scientist has shipped a notebook that "worked for me in a session" and then produced different numbers the next morning — or worse, different numbers in a colleague's environment or a production run. The usual cause is not a bug in the code; it is hidden state. Jupyter cells can be executed in any order, re-run selectively, or silently depend on variables that were defined in a cell that has since been edited or deleted. The kernel's in-memory state becomes the real program, and the `.ipynb` file is just a partial, sometimes misleading, transcript. For solo and small-team DS work, this is the single biggest source of "it worked yesterday" pain, and it is entirely avoidable with the right tooling and habits.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Use **Marimo** as your primary notebook tool: the file format is pure `.py` (git-diffable), execution is reactive (editing a cell re-runs its downstream dependents automatically), and there is no hidden-cell-order hazard by construction. When you cannot switch — existing Jupyter investment, team inertia, library widgets that only work in classic Jupyter — pair every `.ipynb` with a `.py` via **jupytext** and commit the `.py`. Either way, the key discipline is promotion: when a cell works, extract it to `src/<module>.py`, write a test, and import it back. Run finished notebooks as pipelines with `marimo run` or `papermill`.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### The hidden-state problem
|
|
16
|
+
|
|
17
|
+
Classic Jupyter lets you execute cells in any order. Consider this sequence:
|
|
18
|
+
|
|
19
|
+
1. Cell A defines `df = pd.read_csv("raw.csv")`.
|
|
20
|
+
2. Cell B defines `df = df.dropna()`.
|
|
21
|
+
3. You run A, then B, then A again.
|
|
22
|
+
4. `df` is now the raw frame — but cell B's output cell still shows the cleaned version, and any downstream cell that already ran still has the cleaned `df` cached in its own computation.
|
|
23
|
+
|
|
24
|
+
Nothing about the notebook on disk reveals this inconsistency. "Restart kernel and run all" is the only way to prove a notebook is reproducible, and most DS workflows skip that step for months at a time. Outputs are cached in the `.ipynb`, so a reader sees plausible numbers and has no signal that the state is corrupt. This is **hidden state** — the kernel's memory diverges from the code as written, and the notebook lies about what it computed.
|
|
25
|
+
|
|
26
|
+
Second-order effects make it worse: merge conflicts on `.ipynb` JSON are unreadable; diffs show base64 image blobs; collaborators re-run cells in different orders and get different results. The notebook as a unit of collaboration is broken unless you impose discipline from outside.
|
|
27
|
+
|
|
28
|
+
### Marimo as primary
|
|
29
|
+
|
|
30
|
+
[Marimo](https://marimo.io) is a reactive Python notebook that solves hidden state at the architecture level. Each notebook is a pure `.py` file; cells form a dependency graph; when you edit a cell, Marimo re-runs all of its dependents automatically. There is no way for the displayed state to diverge from what the code computes, because the runtime enforces topological order on every edit.
|
|
31
|
+
|
|
32
|
+
A minimal Marimo notebook looks like this — note it is ordinary Python you can read in any editor:
|
|
33
|
+
|
|
34
|
+
```python
|
|
35
|
+
# notebook.py
|
|
36
|
+
import marimo as mo
|
|
37
|
+
|
|
38
|
+
app = mo.App()
|
|
39
|
+
|
|
40
|
+
@app.cell
|
|
41
|
+
def __():
|
|
42
|
+
import pandas as pd
|
|
43
|
+
df = pd.read_csv("data/raw.csv")
|
|
44
|
+
return (df,)
|
|
45
|
+
|
|
46
|
+
@app.cell
|
|
47
|
+
def __(df):
|
|
48
|
+
clean = df.dropna()
|
|
49
|
+
mo.md(f"Rows: **{len(clean)}**")
|
|
50
|
+
return (clean,)
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
Key commands:
|
|
54
|
+
|
|
55
|
+
- `marimo edit notebook.py` — opens the reactive editor in your browser
|
|
56
|
+
- `marimo run notebook.py` — serves the notebook as a read-only web app (great for stakeholders)
|
|
57
|
+
- `marimo export html notebook.py -o out.html` — static HTML snapshot for reports
|
|
58
|
+
|
|
59
|
+
Because the file is `.py`, `git diff` shows real code changes. Code review on a Marimo notebook works the same as code review on any Python file. There are no output cells to strip, no JSON diffs to parse.
|
|
60
|
+
|
|
61
|
+
### Jupyter plus jupytext fallback
|
|
62
|
+
|
|
63
|
+
When Marimo is not an option — you depend on a Jupyter-only widget, you share notebooks with non-Marimo users, or your infrastructure is built around `.ipynb` — use **jupytext** to pair each `.ipynb` with a `.py` representation. Install jupytext, then configure pairing at the repo root:
|
|
64
|
+
|
|
65
|
+
```toml
|
|
66
|
+
# .jupytext.toml
|
|
67
|
+
formats = "ipynb,py:percent"
|
|
68
|
+
notebook_metadata_filter = "-all"
|
|
69
|
+
cell_metadata_filter = "-all"
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
Or pair a single notebook explicitly:
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
jupytext --set-formats ipynb,py:percent notebooks/eda.ipynb
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
The `py:percent` format splits cells with `# %%` markers and produces a clean, diffable Python file. Rule of thumb for the repo:
|
|
79
|
+
|
|
80
|
+
- **Commit** the `.py` version — it is the source of truth for review and diffs
|
|
81
|
+
- **Gitignore** the `.ipynb` (or commit it with `nbstripout` installed to strip outputs; see the data-science-security doc for the outputs-as-secrets angle)
|
|
82
|
+
- **Do not** try to keep both hand-edited — jupytext's pre-save hook keeps them in sync automatically
|
|
83
|
+
|
|
84
|
+
This does not fix hidden state (Jupyter still runs cells in click-order), but it does make review and merges sane, and it gives you a textual artifact that survives kernel-state bugs.
|
|
85
|
+
|
|
86
|
+
### Promotion: notebook to src to test to re-import
|
|
87
|
+
|
|
88
|
+
The most important habit in any notebook workflow — Marimo or Jupyter — is **promotion**. The moment a cell does real work, extract it to a tested module and import it back.
|
|
89
|
+
|
|
90
|
+
Before (inline in the notebook, untested, untyped):
|
|
91
|
+
|
|
92
|
+
```python
|
|
93
|
+
@app.cell
|
|
94
|
+
def __(df):
|
|
95
|
+
df["hour"] = pd.to_datetime(df["ts"]).dt.hour
|
|
96
|
+
df["is_weekend"] = pd.to_datetime(df["ts"]).dt.dayofweek >= 5
|
|
97
|
+
df["log_amount"] = np.log1p(df["amount"])
|
|
98
|
+
return (df,)
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
After — extract to `src/features/engineer.py`:
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
# src/features/engineer.py
|
|
105
|
+
import numpy as np
|
|
106
|
+
import pandas as pd
|
|
107
|
+
|
|
108
|
+
def add_time_features(df: pd.DataFrame, ts_col: str = "ts") -> pd.DataFrame:
|
|
109
|
+
"""Add hour and is_weekend columns derived from a timestamp column."""
|
|
110
|
+
out = df.copy()
|
|
111
|
+
ts = pd.to_datetime(out[ts_col])
|
|
112
|
+
out["hour"] = ts.dt.hour
|
|
113
|
+
out["is_weekend"] = ts.dt.dayofweek >= 5
|
|
114
|
+
return out
|
|
115
|
+
|
|
116
|
+
def add_log_amount(df: pd.DataFrame, amount_col: str = "amount") -> pd.DataFrame:
|
|
117
|
+
out = df.copy()
|
|
118
|
+
out["log_amount"] = np.log1p(out[amount_col])
|
|
119
|
+
return out
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
Write a test — small, fast, no data dependency:
|
|
123
|
+
|
|
124
|
+
```python
|
|
125
|
+
# tests/features/test_engineer.py
|
|
126
|
+
import pandas as pd
|
|
127
|
+
from src.features.engineer import add_time_features
|
|
128
|
+
|
|
129
|
+
def test_weekend_flag_friday_vs_saturday():
|
|
130
|
+
df = pd.DataFrame({"ts": ["2026-04-17 10:00", "2026-04-18 10:00"]})
|
|
131
|
+
out = add_time_features(df)
|
|
132
|
+
assert out["is_weekend"].tolist() == [False, True]
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
Re-import in the notebook:
|
|
136
|
+
|
|
137
|
+
```python
|
|
138
|
+
@app.cell
|
|
139
|
+
def __(df):
|
|
140
|
+
from src.features.engineer import add_time_features, add_log_amount
|
|
141
|
+
df = add_log_amount(add_time_features(df))
|
|
142
|
+
return (df,)
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
The notebook becomes a thin orchestration + visualization layer over tested modules. Hidden state matters less because the logic lives in files that are exercised by CI. Pull requests become reviewable — the reviewer reads typed functions with tests, not a wall of chained DataFrame mutations.
|
|
146
|
+
|
|
147
|
+
### Running notebooks as pipelines
|
|
148
|
+
|
|
149
|
+
Finished notebooks often need to run on a schedule — daily reports, weekly retraining, monthly audits. Do not copy-paste the code into a script; run the notebook directly.
|
|
150
|
+
|
|
151
|
+
**Marimo**: because the file is already Python, you can run it as a script or as an app:
|
|
152
|
+
|
|
153
|
+
```bash
|
|
154
|
+
marimo run notebook.py # serve as web app
|
|
155
|
+
python notebook.py # execute top-to-bottom as a plain script
|
|
156
|
+
marimo export html notebook.py -o out.html # produce a static report artifact
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
**Jupyter**: use `papermill` to parameterize and execute an `.ipynb`, producing an executed output notebook:
|
|
160
|
+
|
|
161
|
+
```bash
|
|
162
|
+
papermill notebooks/weekly_report.ipynb \
|
|
163
|
+
outputs/report_$(date +%Y%m%d).ipynb \
|
|
164
|
+
-p start_date 2026-04-14 \
|
|
165
|
+
-p end_date 2026-04-20
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
Parameterized cells (tagged `parameters` in Jupyter) are injected by papermill at the top of the run. Use Marimo's `mo.cli_args()` for the equivalent in Marimo. Either way, pair this with a lightweight scheduler (cron, GitHub Actions, Airflow, Prefect) — the notebook is the unit of work, not a script that tries to re-implement it.
|
|
169
|
+
|
|
170
|
+
A useful rule: if a notebook is scheduled to run unattended, its logic should be ~90% imports from `src/` and ~10% glue. The promotion discipline from the previous section is what makes scheduled notebook runs trustworthy.
|
|
@@ -0,0 +1,161 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-observability
|
|
3
|
+
description: Monitoring deployed DS models and pipelines — prediction logging to Parquet, scheduled evaluation, basic drift detection, and Evidently for deeper analysis
|
|
4
|
+
topics: [data-science, observability, monitoring, drift, evidently]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Models don't fail loudly. A scoring job keeps running, rows keep landing in the output table, dashboards stay green — and quietly, the predictions get worse. The world drifts away from whatever snapshot you trained on, and nobody notices until a stakeholder says "these numbers look weird." Observability for a solo DS isn't a platform; it's a small set of habits that give you a chance to catch decay before someone else does.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
For a solo or small-team data scientist with something deployed (even just a weekly cron), observability boils down to four habits: log every prediction with its inputs to a dated Parquet file, re-run your evaluation script on a schedule and alert on metric drops, check a handful of key features for distributional drift, and reach for `Evidently` when you want a pre-built drift report instead of writing your own. The goal is a tripwire, not a dashboard — you want to get paged when something's wrong, not stare at graphs hoping to spot it.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Log predictions + inputs
|
|
16
|
+
|
|
17
|
+
Every time your model scores something, append a row to a dated Parquet log. This is the single most useful thing you can do for future-you — drift analysis, debugging, label backfill, and post-mortems all depend on having this log.
|
|
18
|
+
|
|
19
|
+
Layout:
|
|
20
|
+
|
|
21
|
+
```
|
|
22
|
+
data/processed/predictions/
|
|
23
|
+
2026-04-21/
|
|
24
|
+
run-20260421T0300-abc123.parquet
|
|
25
|
+
2026-04-22/
|
|
26
|
+
run-20260422T0300-def456.parquet
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
Schema (one row per prediction):
|
|
30
|
+
|
|
31
|
+
```python
|
|
32
|
+
# src/monitor/prediction_log.py
|
|
33
|
+
import uuid
|
|
34
|
+
from datetime import datetime, timezone
|
|
35
|
+
from pathlib import Path
|
|
36
|
+
import pandas as pd
|
|
37
|
+
|
|
38
|
+
def log_predictions(
|
|
39
|
+
features: pd.DataFrame,
|
|
40
|
+
predictions: pd.Series,
|
|
41
|
+
model_version: str,
|
|
42
|
+
log_root: Path = Path("data/processed/predictions"),
|
|
43
|
+
) -> Path:
|
|
44
|
+
today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
|
45
|
+
run_id = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M") + "-" + uuid.uuid4().hex[:6]
|
|
46
|
+
out_dir = log_root / today
|
|
47
|
+
out_dir.mkdir(parents=True, exist_ok=True)
|
|
48
|
+
|
|
49
|
+
df = features.copy()
|
|
50
|
+
df["prediction"] = predictions.values
|
|
51
|
+
df["model_version"] = model_version
|
|
52
|
+
df["logged_at"] = datetime.now(timezone.utc)
|
|
53
|
+
df["run_id"] = run_id
|
|
54
|
+
df["ground_truth"] = pd.NA # Backfilled later when labels arrive
|
|
55
|
+
|
|
56
|
+
out = out_dir / f"run-{run_id}.parquet"
|
|
57
|
+
df.to_parquet(out, index=False)
|
|
58
|
+
return out
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
Parquet is right for this: columnar, compressed, fast to scan across dates with `pd.read_parquet("data/processed/predictions/**/*.parquet")`. If inputs have PII, hash or drop those columns before logging — you rarely need raw identifiers to do drift or error analysis.
|
|
62
|
+
|
|
63
|
+
### Scheduled eval re-runs
|
|
64
|
+
|
|
65
|
+
Your training-time evaluation script is also your monitoring script. Run it weekly or monthly against recent predictions joined to whatever ground truth has arrived, and alert when the headline metric breaches a threshold.
|
|
66
|
+
|
|
67
|
+
```python
|
|
68
|
+
# src/monitor/eval.py
|
|
69
|
+
import sys
|
|
70
|
+
import pandas as pd
|
|
71
|
+
from sklearn.metrics import roc_auc_score
|
|
72
|
+
|
|
73
|
+
THRESHOLD = 0.80 # Alert if AUC drops below this
|
|
74
|
+
|
|
75
|
+
def main() -> int:
|
|
76
|
+
preds = pd.read_parquet("data/processed/predictions/")
|
|
77
|
+
labels = pd.read_parquet("data/processed/labels/")
|
|
78
|
+
joined = preds.merge(labels, on="record_id", how="inner")
|
|
79
|
+
if len(joined) < 500:
|
|
80
|
+
print("Not enough labeled data yet; skipping.")
|
|
81
|
+
return 0
|
|
82
|
+
|
|
83
|
+
auc = roc_auc_score(joined["actual"], joined["prediction"])
|
|
84
|
+
print(f"AUC on {len(joined)} labeled rows: {auc:.3f}")
|
|
85
|
+
if auc < THRESHOLD:
|
|
86
|
+
# Send email / Slack webhook here
|
|
87
|
+
print(f"ALERT: AUC {auc:.3f} below threshold {THRESHOLD}", file=sys.stderr)
|
|
88
|
+
return 1
|
|
89
|
+
return 0
|
|
90
|
+
|
|
91
|
+
if __name__ == "__main__":
|
|
92
|
+
raise SystemExit(main())
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
Schedule it with whatever you already have — cron, a GitHub Actions `schedule:` workflow, Airflow if you run it, or your platform's scheduled job. Exit code 1 plus a Slack webhook is a perfectly good alerting system at this scale.
|
|
96
|
+
|
|
97
|
+
### Basic drift detection
|
|
98
|
+
|
|
99
|
+
Before reaching for a library, do the cheap thing: compare this period's feature distribution to your training distribution. Mean, std, a couple of quantiles, and a KS statistic cover most of what you need.
|
|
100
|
+
|
|
101
|
+
```python
|
|
102
|
+
# src/monitor/drift.py
|
|
103
|
+
from scipy.stats import ks_2samp
|
|
104
|
+
import pandas as pd
|
|
105
|
+
|
|
106
|
+
def feature_drift(reference: pd.Series, current: pd.Series) -> dict:
|
|
107
|
+
stat, p = ks_2samp(reference.dropna(), current.dropna())
|
|
108
|
+
return {
|
|
109
|
+
"ref_mean": reference.mean(), "cur_mean": current.mean(),
|
|
110
|
+
"ref_std": reference.std(), "cur_std": current.std(),
|
|
111
|
+
"ks_stat": stat, "ks_p": p,
|
|
112
|
+
"drifted": p < 0.01,
|
|
113
|
+
}
|
|
114
|
+
|
|
115
|
+
train = pd.read_parquet("data/processed/train.parquet")
|
|
116
|
+
recent = pd.read_parquet("data/processed/predictions/2026-04-21/")
|
|
117
|
+
for col in ["amount", "user_tenure_days", "n_items"]:
|
|
118
|
+
print(col, feature_drift(train[col], recent[col]))
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
Run this alongside your scheduled eval. You don't need a dashboard — printing to the job log and alerting on `drifted=True` on any monitored feature is enough.
|
|
122
|
+
|
|
123
|
+
### Evidently for more
|
|
124
|
+
|
|
125
|
+
When you outgrow ad-hoc KS tests, `Evidently` gives you a pre-built drift report across all features, plus data quality checks and target drift, as an HTML page you can open or ship to S3.
|
|
126
|
+
|
|
127
|
+
```python
|
|
128
|
+
# src/monitor/evidently_report.py
|
|
129
|
+
import pandas as pd
|
|
130
|
+
from evidently import Report
|
|
131
|
+
from evidently.presets import DataDriftPreset
|
|
132
|
+
|
|
133
|
+
reference = pd.read_parquet("data/processed/train.parquet")
|
|
134
|
+
current = pd.read_parquet("data/processed/predictions/2026-04-21/")
|
|
135
|
+
|
|
136
|
+
report = Report([DataDriftPreset()])
|
|
137
|
+
snapshot = report.run(reference_data=reference, current_data=current)
|
|
138
|
+
snapshot.save_html("reports/drift-2026-04-21.html")
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
This is opt-in. If plain pandas + SciPy is telling you what you need to know, don't add a dependency. Reach for Evidently when you have enough features that per-column code is tedious, or when you want a shareable artifact for a stakeholder.
|
|
142
|
+
|
|
143
|
+
### The prediction / feedback loop
|
|
144
|
+
|
|
145
|
+
Ground truth almost never arrives at prediction time. A churn model predicts today who'll cancel next month; a fraud model predicts now whether a transaction is bad, confirmed days later. That delay is why the Parquet log exists — you keep predictions around until labels catch up, then join.
|
|
146
|
+
|
|
147
|
+
```python
|
|
148
|
+
# src/monitor/backfill_labels.py
|
|
149
|
+
import pandas as pd
|
|
150
|
+
|
|
151
|
+
preds = pd.read_parquet("data/processed/predictions/")
|
|
152
|
+
labels = pd.read_parquet("data/processed/labels/") # record_id, actual, label_time
|
|
153
|
+
merged = preds.merge(labels, on="record_id", how="left")
|
|
154
|
+
merged.to_parquet("data/processed/predictions_labeled.parquet", index=False)
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
Keep at least one full feedback cycle of prediction logs (if labels arrive after 30 days, keep 60-90 days). This join is how you get a real accuracy number on production traffic, not just your static test-set number from training day.
|
|
158
|
+
|
|
159
|
+
### What NOT to build
|
|
160
|
+
|
|
161
|
+
Resist the urge to over-engineer. At solo scale you do not need streaming drift detection, a Prometheus/Grafana stack, a model registry with canary deploys, or a dedicated monitoring dashboard. Those are ML-platform-team concerns — build them when there's a team to own them. A dated Parquet log, a scheduled eval script, a handful of drift checks, and an alert that emails you is more than enough to catch the failures that actually happen.
|